0% found this document useful (0 votes)
13 views52 pages

Cpu Concepts-2

It is brief description related to CPU concepts

Uploaded by

tusharsuvarna700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views52 pages

Cpu Concepts-2

It is brief description related to CPU concepts

Uploaded by

tusharsuvarna700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

CPU MODULES

Advanced SIMD and floating-point Extension

❖ The optional Advanced SIMD and floating-point Extension implements:


❖ ARM NEON technology, a media, and signal processing architecture that adds
instructions that are targeted at audio, video, 3-D graphics, image, and speech
processing.
❖ Advanced SIMD instructions are available in AArch64 and AArch32 states.
❖ The floating-point architecture includes the floating-point register file and status
registers.It performs floating-point operations on the data that is held in the floating-
point registerfile.

Cryptography Extension

❖ The optional Cortex-A53 MPCore Cryptography Extension supports the ARMv8


CryptographyExtensions. The Cryptography Extension adds new A64, A32, and T32
instructions toAdvanced SIMD that accelerate:

• Advanced Encryption Standard (AES) encryption and decryption.

• The Secure Hash Algorithm (SHA) functions SHA-1, SHA-224, and SHA-256.

• Finite field arithmetic used in algorithms such as Galois/Counter Mode and Elliptic
Curve Cryptography.

Debug and trace

The Cortex-A53 processor supports a range of debug and trace features including:

•ARM v8 debug features in each core.

•ETMv4 instruction trace unit for each core.

•CoreSight Cross Trigger Interface (CTI).

•CoreSight Cross Trigger Matrix (CTM).

•Debug ROM.
The Cortex-A53 processor has an Advanced Peripheral Bus version 3 (APBv3)
debug interface that is CoreSight compliant. This permits system access to debug
resources, for example, the setting of watchpoints and breakpoints.

The Cortex-A53 processor provides performance monitors that can be configured to


gather statistics on the operation of each core and the memory system. The
performance monitors implement the ARM PMUv3 architecture.

CACHE MEMORY

Cache memory is a chip-based computer component that makes retrieving data from
the computer's memory more efficient. It acts as a temporary storage area that the
computer's processor can retrieve data from easily. This temporary storage area,
known as a cache, is more readily available to the processor than the computer's main
memory source, typically some form of DRAM.

Cache memory is sometimes called CPU (central processing unit) memory because it
is typically integrated directly into the CPU chip or placed on a separate chip that has
a separate bus interconnect with the CPU. Therefore, it is more accessible to the
processor, and able to increase efficiency, because it's physically close to the
processor.

In order to be close to the processor, cache memory needs to be much smaller than
main memory. Consequently, it has less storage space. It is also more expensive than
main memory, as it is a more complex chip that yields higher performance.

What it sacrifices in size and price, it makes up for in speed. Cache memory operates
between 10 to 100 times faster than RAM, requiring only a few nanoseconds to
respond to a CPU request.

The name of the actual hardware that is used for cache memory is high -speed static
random access memory (SRAM). The name of the hardware that is used in a
computer's main memory is dynamic random access memory (DRAM).

Cache memory is not to be confused with the broader term cache. Caches are
temporary stores of data that can exist in both hardware and software. Cache memory
refers to the specific hardware component that allows computers to create caches at
various levels of the network.
Types of cache memory
Cache memory is fast and expensive. Traditionally, it is categorized as "levels" that
describe its closeness and accessibility to the microprocessor. There are three general
cache levels:

L1 cache, or primary cache, is extremely fast but relatively small, and is usually
embedded in the processor chip as CPU cache.

L2 cache, or secondary cache, is often more capacious than L1. L2 cache may be
embedded on the CPU, or it can be on a separate chip or coprocessor and have a
high-speed alternative system bus connecting the cache and CPU. That way it doesn't
get slowed by traffic on the main system bus.

Level 3 (L3) cache is specialized memory developed to improve the performance of


L1 and L2. L1 or L2 can be significantly faster than L3, though L3 is usually double
the speed of DRAM. With multicore processors, each core can have dedicated L1 and
L2 cache, but they can share an L3 cache. If an L3 cache references an instruction, it
is usually elevated to a higher level of cache.

In the past, L1, L2 and L3 caches have been created using combined processor and
motherboard components. Recently, the trend has been toward consolidating all three
levels of memory caching on the CPU itself. That's why the primary means for
increasing cache size has begun to shift from the acquisition of a specific
motherboard with different chipsets and bus architectures to buying a CPU with the
right amount of integrated L1, L2 and L3 cache.

Contrary to popular belief, implementing flash or more dynamic RAM (DRAM) on a


system won't increase cache memory. This can be confusing since the terms memory
caching (hard disk buffering) and cache memory are often used interchangeably.
Memory caching, using DRAM or flash to buffer disk reads, is meant to improve
storage I/O by caching data that is frequently referenced in a buffer ahead of slower
magnetic disk or tape. Cache memory, on the other hand, provides read buffering for
the CPU.
A
diagram of the architecture and data flow of a typical cache memory unit.
Cache memory mapping
Caching configurations continue to evolve, but cache memory traditionally works
under three different configurations:

• Direct mapped cache has each block mapped to exactly one cache memory location.
Conceptually, a direct mapped cache is like rows in a table with three columns: the
cache block that contains the actual data fetched and stored, a tag with all or part of
the address of the data that was fetched, and a flag bit that shows the presence in the
row entry of a valid bit of data.
• Fully associative cache mapping is similar to direct mapping in structure but allows
a memory block to be mapped to any cache location rather than to a prespecified
cache memory location as is the case with direct mapping.
• Set associative cache mapping can be viewed as a compromise between direct
mapping and fully associative mapping in which each block is mapped to a subset of
cache locations. It is sometimes called N-way set associative mapping, which
provides for a location in main memory to be cached to any of "N" locations in the
L1 cache.
Data writing policies
Data can be written to memory using a variety of techniques, but the two main ones
involving cache memory are:

• Write-through. Data is written to both the cache and main memory at the same time.
• Write-back. Data is only written to the cache initially. Data may then be written to
main memory, but this does not need to happen and does not inhibit the interaction
from taking place.

The way data is written to the cache impacts data consistency and efficiency. For
example, when using write-through, more writing needs to happen, which causes
latency upfront. When using write-back, operations may be more efficient, but data
may not be consistent between the main and cache memories.

One way a computer determines data consistency is by examining the dirty bit in
memory. The dirty bit is an extra bit included in memory blocks that indicates
whether the information has been modified. If data reaches the processor's register
file with an active dirty bit, it means that it is not up to date and there are more recent
versions elsewhere. This scenario is more likely to happen in a write-back scenario,
because the data is written to the two storage areas asynchronously.

Specialization and functionality


In addition to instruction and data caches, other caches are designed to provide
specialized system functions. According to some definitions, the L3 cache's shared
design makes it a specialized cache. Other definitions keep the instruction cache and
the data cache separate and refer to each as a specialized cache.

Translation lookaside buffers (TLBs) are also specialized memory caches whose
function is to record virtual address to physical address translations.

Still other caches are not, technically speaking, memory caches at all. Disk caches,
for instance, can use DRAM or flash memory to provide data caching similar to what
memory caches do with CPU instructions. If data is frequently accessed from the
disk, it is cached into DRAM or flash-based silicon storage technology for faster
access time and response.

Specialized caches are also available for applications such as web browsers,
databases, network address binding and client-side Network File System protocol
support. These types of caches might be distributed across multiple networked hosts
to provide greater scalability or performance to an application that uses them.
A
depiction of the memory hierarchy and how it functions
Locality
The ability of cache memory to improve a computer's performance relies on the
concept of locality of reference. Locality describes various situations that make a
system more predictable. Cache memory takes advantage of these situations to create
a pattern of memory access that it can rely upon.

There are several types of locality. Two key ones for cache are:

• Temporal locality. This is when the same resources are accessed repeatedly in a
short amount of time.
• Spatial locality. This refers to accessing various data or resources that are near each
other.
Types of Cache Misses

Compulsory Miss

A compulsory miss, also known as a cold miss, occurs when data is accessed for the
first time. Since the data has not been requested before, it is not present in the cache,
leading to a miss. This type of miss is unavoidable as it is inherent in the first
reference to the data. The only way to eliminate compulsory misses would be to have
an infinite prefetch of data, which is not feasible in real-world systems.
Capacity Miss

A capacity miss happens when the cache cannot contain all the data needed by the
system. This type of miss occurs when the working set (the set of data that a program
accesses frequently) is larger than the cache size. When the cache is filled to capacity
and a new data item is referenced, existing data must be evicted to accommodate the
new data, leading to a miss. Capacity misses can be reduced by increasing the cache
size or optimizing the program to decrease the size of the working set.

Conflict Miss

Conflict misses, also known as collision misses, occur when multiple data items,
which are accessed in a sequence, map to the same cache location, known as a cache
set. This type of miss is a result of the cache’s organization. In a set -associative or
direct-mapped cache, different data items may be mapped to the same set, leading to
conflicts. When a new item is loaded into a filled set, another item must be evicted,
leading to a miss if the evicted item is accessed again. Conflict misses can be
mitigated by improving the cache’s mapping function or by increasing the cache’s
associativity.

Coherence Miss

Coherence misses are specific to multiprocessor systems. In such systems, several


processors have their own private caches and access shared data. A coherence miss
occurs when one processor updates a data item in its private cache, making the
corresponding data item in another processor’s cache stale. When the second
processor accesses the stale data, a cache miss occurs. Coherence misses are managed
by implementing cache coherence protocols that ensure consistency among the
various caches.

PIPELINING
The objectives of this module are to discuss the various hazards associated with
pipelining.

We discussed the basics of pipelining and the MIPS pipeline implementation in the
previous module. We made the following observations about pipelining:
• Pipelining doesn’t help latency of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage o Multiple tasks operating
simultaneously
• Potential speedup = Number of pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
• Unbalanced lengths of pipe stages reduces speedup
• Execute billions of instructions, so throughput is what matters
• Data path design – Ideally we expect a CPI value of 1
• What is desirable in instruction sets for pipelining?
• Variable length instructions vs. all instructions same length?
• Memory operands part of any operation vs. memory operands only in loads or stores?
• Register operand many places in instruction format vs. registers located in same
place?
Ideally we expect a CPI value of 1 and a speedup equal to the number of stages in the
pipeline. But, there are a number of factors that limit this. The problems that occur in
the pipeline are called hazards. Hazards that arise in the pipeline prevent the next
instruction from executing during its designated clock cycle. There are three types of
hazards:
• Structural hazards: Hardware cannot support certain combinations of instructions
(two instructions in the pipeline require the same resource).
• Data hazards: Instruction depends on result of prior instruction still in the pipeline
• Control hazards: Caused by delay between the fetching of instructions and decisions
about changes in control flow (branches and jumps).

Structural hazards arise because there is not enough duplication of resources.

Resolving structural hazards:


Solution 1: Wait
o Must detect the hazard
o Must have mechanism to stall
o Low cost and simple
o Increases CPI
o Used for rare cases
Solution 2: Throw more hardware at the problem
o Pipeline hardware resource

§ useful for multi-cycle resources


§ good performance
§ sometimes complex e.g., RAM

o Replicate resource

§ good performance
§ increases cost (+ maybe interconnect delay)
§ useful for cheap or divisible resource

Figure 11.1 shows one possibility of a structural hazard in the MIPS pipeline.
Instruction 3 is accessing memory for an instruction fetch and instruction 1 is accessing
memory for a data access (load/store). These two are conflicting requirements and
gives rise to a hazard. We should either stall one of the operations as shown in Figure
11.2, or have two separate memories for code and data. Structural hazards will have to
handled at the design time itself.
Next, we shall discuss about data dependences and the associated hazards. There are
two types of data dependence – true data dependences and name dependences.

• An instruction j is data dependent on instruction i if either of the following holds:

– Instruction i produces a result that may be used by instruction j, or

– Instruction j is data dependent on instruction k, and instruction k is data dependent


on instruction i

• A name dependence occurs when two instructions use the same register or memory
location, called a name, but there is no flow of data between the instructions associated
with that name

– Two types of name dependences between an instruction i that precedes instruction j


in program order:

– An antidependence between instruction i and instruction j occurs when instruction j


writes a register or memory location that instruction i reads. The original ordering must
be preserved.

– An output dependence occurs when instruction i and instruction j write the same
register or memory location. The ordering between the instructions must be preserved.

– Since this is not a true dependence, renaming can be more easily done for register
operands, where it is called register renaming

– Register renaming can be done either statically by a compiler or dynamically by the


hardware

Last of all, we discuss control dependences. Control dependences determine the


ordering of an instruction with respect to a branch instruction so that an instruction i is
executed in correct program order. There are two general constraints imposed by
control dependences:

– An instruction that is control dependent on its branch cannot be moved before the
branch so that its execution is no longer controlled by the branch.

– An instruction that is not control dependent on its branch cannot be moved after the
branch so that its execution is controlled by the branch.
Having introduced the various types of data dependences and control dependence, let
us discuss how these dependences cause problems in the pipeline. Dependences are
properties of programs and whether the dependences turn out to be hazards and cause
stalls in the pipeline are properties of the pipeline organization.

Data hazards may be classified as one of three types, depending on the order of read
and write accesses in the instructions:

• RAW (Read After Write)


• Corresponds to a true data dependence
• Program order must be preserved
• This hazard results from an actual need for communication
• Considering two instructions i and j, instruction j reads the data before i writes it

i: ADD R1, R2, R3


j: SUB R4, R1, R3

Add modifies R1 and then Sub should read it. If this order is changed, there is a RAW
hazard
• WAW (Write After Write)
• Corresponds to an output dependence
• Occurs when there are multiple writes or a short integer pipeline and a longer
floating-point pipeline or when an instruction proceeds when a previous instruction is
stalled WAW (write after write)
• This is caused by a name dependence. There is no actual data transfer. It is the same
name that causes the problem
• Considering two instructions i and j, instruction j should write after instruction i has
written the data

i: SUB R1, R4, R3


j: ADD R1, R2, R3

Instruction i has to modify register r1 first, and then j has to modify it. Otherwise, there
is a WAW hazard. There is a problem because of R1. If some other register had been
used, there will not be a problem

• Solution is register renaming, that is, use some other register. The hardware can do the
renaming or the compiler can do the renaming
• WAR (Write After Read)
• Arises from an anti dependence
• Cannot occur in most static issue pipelines
• Occurs either when there are early writes and late reads, or when instructions are re-
ordered
• There is no actual data transfer. It is the same name that causes the problem
• Considering two instructions i and j, instruction j should write after instruction i has
read the data.

i: SUB R4, R1, R3


j: ADD R1, R2, R3

Instruction i has to read register r1 first, and then j has to modify it. Otherwise, there is
a WAR hazard. There is a problem because of R1. If some other register had been used,
there will not be a problem

• Solution is register renaming, that is, use some other register. The hardware can do
the renaming or the compiler can do the renaming

Figure 11.3 gives a situation of having true data dependences. The use of the result of
the ADD instruction in the next three instructions causes a hazard, since the register is
not written until after those instructions read it. The write back for the ADD instruction
happens only in the fifth clock cycle, whereas the next three instructions read the
register values before that, and hence will read the wrong data. This gives rise to RAW
hazards.

A control hazard is when we need to find the destination of a branch, and can’t fetch
any new instructions until we know that destination. Figure 11.4 illustrates a control
hazard. The first instruction is a branch and it gets resolved only in the fourth clock
cycle. So, the next three instructions fetched may be correct, or wrong, depending on
the outcome of the branch. This is an example of a control hazard.

Now, having discussed the various dependences and the hazards that they might lead
to, we shall see what are the hazards that can happen in our simple MIPS pipeline.

• Structural hazard
• Conflict for use of a resource

• In MIPS pipeline with a single memory


– Load/store requires data access

– Instruction fetch would have to stall for that cycle

• Would cause a pipeline “bubble”

• Hence, pipelined datapaths require separate instruction/data memories or separate


instruction/data caches

• RAW hazards – can happen in any architecture


• WAR hazards – Can’t happen in MIPS 5 stage pipeline because all instructions take 5
stages, and reads are always in stage 2, and writes are always in stage 5
• WAW hazards – Can’t happen in MIPS 5 stage pipeline because all instructions take
5 stages, and writes are always in stage 5
• Control hazards
• Can happen
• The penalty depends on when the branch is resolved – in the second clock cycle or
the third clock cycle
• More aggressive implementations resolve the branch in the second clock cycle itself,
leading to one clock cycle penalty

Let us look at the speedup equation with stalls and look at an example problem.

CPIpipelined = Ideal CPI + Average Stall cycles per Inst


Let us assume we want to compare the performance of two machines. Which
machine is faster?
• Machine A: Dual ported memory – so there are no memory stalls
• Machine B: Single ported memory, but its pipelined implementation has a 1.05 times
faster clock rate

Assume:
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedupA = Pipeline Depth/(1+0) x(clockunpipe/clockpipe)
= Pipeline Depth
SpeedupB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 75 x Pipeline Depth

SpeedupA / SpeedupB = Pipeline Depth / (0.75 x Pipeline Depth) = 1.33 Machine A


is 1.33 times faster.

TRANSLATION LOOKASIDE BUFFER (TLB):


TLB organization

This section describes the organization of the TLB.

Micro TLB

The first level of caching for the translation table information is a micro TLB of ten
entries that is implemented on each of the instruction and data sides.

All main TLB related maintenance operations affect both the instruction and data
micro TLBs, causing them to be flushed.

Main TLB

A unified main TLB handles misses from the micro TLBs. This is a 512 -entry, 4-
way, set-associative structure. The main TLB supports all VMSAv8 block sizes,
except 1GB. If a 1GB block is fetched, it is split into 512MB blocks and the
appropriate block for the lookup stored.

Accesses to the main TLB take a variable number of cycles, based on:

• Competing requests from each of the micro TLBs.

• The TLB maintenance operations in flight.


• The different page size mappings in use.

IPA cache RAM

The Intermediate Physical Address (IPA) cache RAM holds mappings between
intermediate physical addresses and physical addresses. Only Non-secure EL1 and
EL0 stage 2 translations use this cache. When a stage 2 translation is completed it is
updated, and checked whenever a stage 2 translation is required.

Similarly to the main TLB, the IPA cache RAM can hold entries for different sizes.

Walk cache RAM

The walk cache RAM holds the result of a stage 1 translation up to but not including
the last level. If the stage 1 translation results in a section or larger mapping then
nothing is placed in the walk cache. The walk cache holds entries fetched from
Secure and Non-secure state.

Memory is one of the most important host resources. For workloads to access global
system memory, we need to make sure virtual memory addresses are mapped to the
physical addresses. There are several components working together to perform these
translations as efficient as possible. This blog post will cover the basics on how
virtual memory addresses are translated.

Memory Translations

The physical address space is your system RAM, the memory modules inside your
ESXi hosts, also referred to as the global system memory. When talking about virtual
memory, we are talking about the memory that is controlled by an operating system,
or a hypervisor like vSphere ESXi. Whenever workloads access data in memory, the
system needs to look up the physical memory address that matches the virtual
address. This is what we refer to as memory translations or mappings.

To map virtual memory addresses to physical memory addresses, page tables are
used. A page table consists of numerous page table entries (PTE).
One memory page in a PTE contains data structures consisting of different sizes of
‘words’. Each type of word contains multiple bytes of data (WORD (16 bits/2 bytes),
DWORD (32 bits/4 bytes) and QWORD (64 bits/8 bytes)). Executing memory
translations for every possible word, or virtual memory page, into physical memory
address is not very efficient as this could potentially be billions of PTE’s. We need
PTE’s to find the physical address space in the system’s global memory, so there is
no way around them.

To make memory translations more efficient, we use page tables to group chunks of
memory addresses in one mapping. Looking at an example of a DWORD entry of 4
bytes; A page table covers 4 kilobytes instead of just the 4 bytes of data in a single
page entry. For example, using a page table, we can translate virtual address space 0
to 4095 and say this is found in physical address space 4096 to 8191. Now we no
longer need to map all the PTE’s separately, and be far more efficient by using page
tables.

MMU and TLB

The page tables are managed by a Memory Management Unit (MMU). All the
physical memory references are passed through the MMU. The MMU is responsible
for the translation between virtual memory addresses and physical memory
addresses. With vSphere ESXi, a virtual machine’s vCPU will call out to MMU
functionality by the Virtual Machine Monitor (VMM) process, or a hardware MMU
supported by a vendor specific CPU offloading instruction.
The Memory Management Unit (MMU) works with the Translation Lookaside
Buffer (TLB) to map the virtual memory addresses to the physical memory layer. The
page table always resides in physical memory, and having to look up the memory
pages directly in physical memory, can be a costly exercise for the MMU as it
introduces latency. That is where the TLB comes into play.

TLB in Detail

The TLB acts as a cache for the MMU that is used to reduce the time taken to access
physical memory. The TLB is a part of the MMU. Depending on the make and model
of a CPU, there’s more than one TLB, or even multiple levels of TLB like with
memory caches to avoid TLB misses and ensuring as low as possible memory
latency.

In essence, the TLB stores recent memory translations of virtual to physical. It is a


cache for page tables. Because it is part of the MMU, the TLB lives inside the CPU
package. This is why the TLB is faster than main memory, which is where the page
tables exists. Typically access times for a TLB is ~10 ns where main memory access
times are around 100 ns.

Now that we covered the basics on memory translation, let’s take a look at some
example scenarios for the TLB.

TLB hit

A virtual memory address comes in, and needs to be translated to the physical
address. The first step is always to dissect the virtual address into a virtual page
number, and the page offset. The offset consists of the last bits of the virtual address.
The offset bits are not translated and passed through to the physical memory address.
The offset contains bits that can represent all the memory addresses in a page table.

So, the offset is directly mapped to the physical memory layer, and the virtual page
number matches a tag already in the TLB. The MMU now immediately knows what
physical memory page to access without the need to look into the global memory.
In the example provided in the above diagram, the virtual page number is found in
the TLB, and immediately translated to the physical page number.

1. The virtual address is dissected in the virtual page number and the page offset.
2. The page offset is passed through as it is not translated.
3. The virtual page number is looked up in the TLB, looking for a tag with the
corresponding number.
4. There is an entry in the TLB (hit), meaning we immediately can translate the virtual
to the physical address.

TLB miss

What happens when a virtual page number is not found in the TLB, also referred to
as a TLB miss? The TLB needs to consult the system’s global memory to understand
what physical page number is used. Reaching out to physical memory means higher
latency compared to a TLB hit. If the TLB is full and a TLB miss occurs, the least
recent TLB entry is flushed, and the new entry is placed instead of it. In the following
example, the virtual page number is not found in the TLB, and the TLB needs to look
into memory to get the page number.
5. The virtual address is dissected in the virtual page number and the page offset.
6. The page offset is passed through as it is not translated.
7. The virtual page number is looked up in the TLB, looking for a tag with a
corresponding number. In this example, the TLB does not yet have a valid entry.
8. TLB reaches out to memory to find page number 3 (because of the tag, derived from
the virtual page number). Page number 3 is retrieved in memory with value 0x0006.
9. The memory translation is done and the entry is now cached in the TLB.

Retrieve from storage

A TLB miss is not ideal, but the worst-case scenario is data that is not residing in
memory but on storage media (flash or disk). Where we are talking nanoseconds to
retrieve data in caches or global memory, getting data from storage media will
quickly run into milliseconds or seconds depending on the media used.
10.The virtual address is dissected in the virtual page number and the page offset.
11.The page offset is passed through as it is not translated.
12.The virtual page number is looked up in the TLB, looking for a tag with a
corresponding number. In this example, the TLB does not yet have a valid entry.
13.TLB reaches out to memory to find page number 0 (because of the tag, derived from
the virtual page number). Page number 0 is retrieved in memory but finds that the
data does not resides in memory, but on storage. A page fault is triggered, because we
cannot translate memory pages for data that is not in memory. We need to wait for
the data from storage.
CPU DEBUG LOGIC:

This section gives an overview of debug and describes the debug components.
The processor forms one component of a debug system.
The following methods of debugging an Arm processor based SoC exist:

Conventional JTAG debug (‘external’ debug)


This is invasive debug with the core halted using:
❖ Breakpoints and watchpoints to halt the core on specific activity.
❖ A debug connection to examine and modify registers and memory, and provide
single-step execution.
Debug host
The debug host is a computer, for example a personal computer, running a software
debugger such as the DS-5 Debugger. The debug host enables you to issue high-level
commands such as setting breakpoint at a certain location, or examining the contents
of a memory address.
Protocol converter
The debug host sends messages to the debug target using an interface such as
Ethernet. However, the debug target typically implements a different interface
protocol. A device such as DSTREAM is required to convert between the two
protocols.
Debug target
The debug target is the lowest level of the system. An example of a debug target is a
development system with a test chip or a silicon part with a processor. The debug
target implements system support for the protocol converter to access the debug unit
using the Advanced Peripheral Bus (APB) slave interface.

Conventional monitor debug (‘self-hosted’ debug)


This is invasive debug with the core running using a debug monitor that resides
in memory.
For self-hosted debug, the debug target runs additional debug monitor software
that runs on the Cortex-A53 processor itself, rather than requiring expensive interface
hardware to connect a second host computer.

GENERIC INTERRUPT CONTROLLER (GIC)


9.1 Overview
The generic interrupt controller(GIC) in this device has two interfaces, the
distributor interface connects to the interrupt source, and the CPU interface
connects to Cortex-A7.
It supports the following features:
• Supports 128hardware interrupt inputs
• Masking of any interrupts
• Prioritization of interrupts
• Distribution of the interrupts to the target Cortex-A7 processor(s)
• Generation of interrupts by software
• Supports Security Extensions
9.2 Block Diagram

The diagram shows that GIC has two AXI interfaces independently and has two
base address for these two interfaces.
9.3 Function Description
This GIC architecture splits logically into a Distributor block and one CPU
interface block, as Figure 12-1 shows.

Distributor
This performs interrupt prioritization and distribution to the CPU interface that
connect to the processor in the system.
CPU interface
CPU interface performs priority masking and preemption handling for a
connected processor in the system.
9.3.1 The Distributor
The Distributor centralizes all interrupt sources, determines the priority of each
interrupt, and for CPU interface dispatches the interrupt with the highest priority
to the interface for priority masking and preemption handling.
The Distributor provides a programming interface for:
• Globally enabling the forwarding of interrupts to the CPU interface
• Enabling or disabling each interrupt
• Setting the priority level of each interrupt
• Setting the target processor list of each interrupt
• Setting each peripheral interrupt to be level-sensitive or edge-triggered
• If the GIC implements the Security Extensions, setting each interrupt as either
• Secure or Non-secure
• Sending a Software-generated interrupt (SGI) to processor.
• Visibility of the state of each interrupt
• A mechanism for software to set or clear the pending state of a peripheral interrupt.
Interrupt ID
Interrupts from sources are identified using ID numbers. CPU interface can see
up to 160 interrupts.
The GIC assigns interrupt these 128 ID numbers as follows:
• Interrupt numbers ID32-ID127 are used for SPIs(shared peripheral interrupts).
• ID0-ID15 are used for SGI.
• ID16-ID31 are used for Private peripheral interrupt (PPI).
• The GIC architecture reserves interrupt ID numbers 1022-1023 for special purposes.
ID1022
The GIC returns this value to a processor in response to an interrupt
acknowledge only when the following apply:

• The interrupt acknowledge is a Secure read


• The highest priority pending interrupt is Non-secure
• The AckCtl bit in the Secure ICCICR is set to 0
• The priority of the interrupt is sufficient for it to be signalled to the processor.
Interrupt ID 1022 informs secure software that there is a Non -secure interrupt
of sufficient priority to be signalled to the processor, that must be handled by
Non-secure software. In this situation the secure software might alter its
schedule to permit Non-secure software to handle the interrupt, to minimize the
interrupt latency.
ID1023
This value is returned to a processor, in response to an interrupt acknowledge,
if there is no pending interrupt with sufficient priority for it to be signalled to th e
processor.
On a processor that implements the Security Extensions, Secure software treats
values of 1022 and 1023 as spurious interrupts.

This section describes the different types of interrupt that the GIC -500 handles.
SGIs (group 0)
SGIs are inter-processor interrupts, that is, interrupts generated from one core and
sent to other cores. Activating an SGI on one core does not affect the same interrupt
ID on another core. Therefore when an SGI is sent to all cores it is handled
independently on each core. The settings for each SGI are also independent between
cores.
You can generate SGIs using System registers in the generating core, or, in legacy
software, by writing to the Software Generated Interrupt Register, GICD_SGIR.
There are 16 independent SGIs, ID0-ID15, that are recorded separately for every
target core. In backwards compatibility mode, the number of the generating core is
also recorded.

PPIs (group 1)
PPIs are typically used for peripherals that are tightly coupled to a particular core.
Interrupts connected to the PPI inputs associated with one core are only sent to that
core. Activating a PPI on one core does not affect the same interrupt ID on another
core. The settings for each PPI are also independent between cores.
A PPI is an interrupt that is specific to a single core and is generated by a wire input.
PPI signals are active-LOW level-sensitive, by default, but can also be programmed
to be triggered on a rising edge.

SPIs (group 1)
SPIs are typically used for peripherals that are not tightly coupled to a specific core.
You can program each SPI to target either a particular core or any core. Activating an
SPI on one core activates the SPI for all cores. That is, the GIC-500 allows at most
one core to activate an SPI. The settings for each SPI are also shared between all
cores.
SPIs are generated either by wire inputs or by writes to the AXI4 slave programming
interface. The GIC-500 can support up to 960 SPIs corresponding to the external
spi[991:32] signal. The number of SPIs available depends on the implemented
configuration. The permitted values are 32-960, in steps of 32. The first SPI has an
ID number of 32. You can configure whether each SPI is triggered on a rising edge or
is active-HIGH level-sensitive.

9.3.2 CPU interface


CPU interface block provides the interface for a processor that operates with the
GIC. CPU interface provides a programming interface for:
• Enabling the signal of interrupt requests by the CPU interface
• Acknowledging an interrupt
• Indicating completion of the processing of an interrupt
• Setting an interrupt priority mask for the processor
• Defining the preemption policy for the processor
• Determining the highest priority pending interrupt for the processor.

When enabled, CPU interface takes the highest priority pending interrupt for its
connected processor and determines whether the interrupt has sufficient
priority for it to signal the interrupt request to the processor.

To determine whether to signal the interrupt request to the processor the CPU
interface considers the interrupt priority mask and the preemption settings for
the processor. At any time, the connected processor can read the priority of its
highest priority active interrupt from a CPU interface register.
The processor acknowledges the interrupt request by reading the CPU interface
Interrupt Acknowledge register. The CPU interface returns one of:

The ID number of the highest priority pending interrupt, if that interrupt is of


sufficient priority to generate an interrupt exception on the processor. This is th e
normal response to an interrupt acknowledge.

Exceptionally, an ID number that indicates a spurious interrupt.

When the processor acknowledges the interrupt at the CPU interface, the
Distributor changes the status of the interrupt from pending to either active, or
active and pending. At this point the CPU interface can signal another interrupt
to the processor, to preempt interrupts that are active on the processor. If there
is no pending interrupt with sufficient priority for signaling to the processor, the
interface de-asserts the interrupt request signal to the processor.

When the interrupt handler on the processor has completed the processing of an
interrupt, it writes to the CPU interface to indicate interrupt completion. When
this happens, the distributor changes the status of the interrupt either:
From active to inactive
From active and pending to pending.

Group 0 Interrupts:
• Group 0 interrupts typically consist of critical or time-sensitive events that require
immediate attention from the processor.

• Examples of events that generate Group 0 interrupts include:

• Timer interrupts: These are generated by timer peripherals (e.g., system timer,
watchdog timer) when a timer reaches its predefined value.

• Performance monitoring interrupts: These interrupts are triggered by performance


counters when certain events or conditions are met, such as cache misses, branch
mispredictions, or other performance-related metrics.

• Inter-processor interrupts (IPIs): These interrupts are used for communication


between processor cores in multi-core systems. They are typically generated by one
core to signal another core for synchronization, coordination, or task migration.

• Group 0 interrupts are often associated with higher priority levels and are not subject
to the same interrupt handling mechanisms as Group 1 interrupts. They are usually
delivered directly to the processor cores without involvement from the GIC's
interrupt controller.

Group 1 Interrupts:

• Group 1 interrupts encompass a broader range of interrupt sources, including


peripheral devices, external interrupts, and other non-critical events.

• Examples of events that generate Group 1 interrupts include:

• Interrupts from peripheral devices: These interrupts are generated by various


peripherals connected to the system, such as UARTs, GPIO controllers, SPI
controllers, etc., to signal events like data reception, transmission completion, or error
conditions.

• External interrupts: These interrupts may originate from external sources outside the
processor, such as external interrupt controllers, hardware accelerators, or co -
processors.
• Group 1 interrupts are managed by the GIC's interrupt controller, which prioritizes
and routes them to the appropriate processor core based on their priority levels and
configuration settings. They may be subject to interrupt masking, priority adjustment,
and other control mechanisms provided by the GIC.

The specific conditions and sources of Group 0 and Group 1 interrupts can vary
widely depending on the system architecture, hardware configuration, and the
operating environment of the ARM-based system.

9.3.3 Interrupt handling state machine

The distributor maintains a state machine for each supported interrupt on CPU
interface. Following figure shows an instance of this state machine, and the
possible state transitions.

Transition A1 or A2, add pending status


For an SGI:
Occurs on a write to an ICDSGIR that specifies the processor as a target.
If the GIC implements the Security Extensions and the write to the ICDSGIR
is Secure, the transition occurs only if the security configuration of the
specified SGI, for the CPU interface, corresponds to the ICDSGIR.SATT bit
value.
For an SPI, occurs if either:
a peripheral asserts an interrupt signal
software writes to an ICDISPR.

Transition B1 or B2, remove pending status


Not applicable to SGIs:
• a pending SGI must transition through the active state, or reset, to remove its pending
status.
• an active and pending SGI must transition through the pending state, or reset, to
remove its pending status.
For an SPI, occurs if either:
• the level-sensitive interrupt is pending only because of the assertion of an input
signal, and that signal is deasserted
• the interrupt is pending only because of the assertion of an edge-triggered interrupt
signal, or a write to an ICDISPR, and software writes to the corresponding ICDICPR.

Transition C
If the interrupt is enabled and of sufficient priority to be signalled to the
processor, occurs when software reads from the ICCIAR.

Transition D
For an SGI, occurs if the associated SGI is enabled and the Distributor forwards
it to the CPU interface at the same time that the processor reads the ICCIAR to
acknowledge a previous instance of the SGI. Whether this transition occurs
depends on the timing of the read of the ICCIAR relative to the reforwarding of
the SGI.
For an SPI:
Occurs if all the following apply:
• The interrupt is enabled.
• Software reads from the ICCIAR. This read adds the active state to th e interrupt.
• For a level-sensitive interrupt, the interrupt signal remains asserted. This is usually
the case, because the peripheral does not deassert the interrupt until the processor has
serviced the interrupt.

• For an edge-triggered interrupt, whether this transition occurs depends on the timing
of the read of the ICCIAR relative to the detection of the reassertion of the interrupt.
Otherwise the read of the ICCIAR causes transition C, possibly followed by
transition A2.

Transition E1 or E2, remove active status


Occurs when software writes to the ICCEOIR.

9.5 Interface Description


Both distributor interface and CPU interface are secure accessed only after reset.
When the signal cfgsdisable is HIGH, it enhances the security of the GIC by
preventing write accesses to security-critical configuration registers.This signal is low
after reset, it can be configured through TZPC registers.
9.6 Application Notes
9.6.1 General handling of interrupts The GIC operates on interrupts as follows:
1. The GIC determines whether each interrupt is enabled. An interrupt that is not
enabled has no further effect on the GIC.(Enables an interrupt by writing to the
appropriate ICDISER bit, disables an interrupt by writing to the appropriate
ICDICER bit)

2. For each enabled interrupt that is pending, the Distributor determines the targeted
processor.

3. For processor, the Distributor determines the highest priority pending interrupt,
based on the priority information it holds for each interrupt, and forwards the
interrupt to the CPU interface.

4. The CPU interface compares the interrupt priority with the current interrupt
priority for the processor, determined by a combination of the Priority Mask Register,
the current preemption settings, and the highest priority active interrupt for the
processor. If the interrupt has sufficient priority, the GIC signals an interrupt
exception request to the processor.

5. When the processor takes the interrupt exception, it reads the ICCIAR in its CPU
interface to acknowledge the interrupt. This read returns an Interrupt ID that the
processor uses to select the correct interrupt handler. When it recognizes this read,
the GIC changes the state of the interrupt:
• If the pending state of the interrupt persists when the interrupt becomes active, or if
the interrupt is generated again, from pending to active and pending.
• Otherwise, from pending to active

6.When the processor has completed handling the interrupt, it signals this completion
by writing to the ICCEOIR in the GIC Generating an SGI A processor generates an
SGI by writing to an ICDSGIR.

9.6.2 Interrupt prioritization


Software configures interrupt prioritization in the GIC by assigning a priority value to
each interrupt source. Priority values are 8-bit unsigned binary.

In this product, GIC implements 64 priority levels. So only the highest 6 bits are
valid, the lower 2 bits read as zero.

In the GIC prioritization scheme, lower numbers have higher priority, that is, the
lower the assigned priority value the higher the priority of the interrupt. The highest
interrupt priority always has priority field value 0.
The ICDIPRs hold the priority value for each supported interrupt. To determine the
number of priority bits implemented write 0xFF to an ICDIPR priority field and read
back the value stored.

Preemption
A CPU interface supports forwarding of higher priority pending interrupts to a target
processor before an active interrupt completes. A pending interrupt is only forwarded
if it has a higher priority than all of:
• the priority of the highest priority active interrupt on the target processor, the running
priority for the processor, see Running Priority Register (ICCRPR) .
• The priority mask, see Priority masking.
• The priority group, see Priority grouping.
Preemption occurs at the time when the processor acknowledges the new interrupt,
and starts to service it in preference to the previously active interrupt or the currently
running process. When this occurs, the initial active interrupt is said to have been
preempted. Starting to service an interrupt while another interrupt is still active is
sometimes described as interrupt nesting.

Priority masking
The ICCPMR for a CPU interface defines a priority threshold for the target
processor, see Interrupt Priority Mask Register. The GIC only signals pending
interrupts with a higher priority than this threshold value to the target processor. A
value of zero, the register reset value, masks all interrupts to the associated processor.
The GIC always masks an interrupt that has the largest supported priority field value.
This provides an additional means of preventing an interrupt being signalled to any
processor.
Priority grouping
Priority grouping splits each priority value into two fields, the group priority and the
subpriority fields. The GIC uses the group priority field to determine whether a
pending interrupt has sufficient priority to preempt a currently active interrupt.
9.6.3 The effect of the Security Extensions on interrupt handling
If a GIC CPU interface implements the Security Extensions, it provides two interrupt
output signals, IRQ and FIQ:
• The CPU interface always uses the IRQ exception request for Non-secureinterrupts
• Software can configure the CPU interface to use either IRQ or FIQ exception
requests for Secure interrupts.
Security Extensions support
Software can detect support for the Security Extensions by reading the
ICDICTR.SecurityExtn bit, see Interrupt Controller Type Register (ICDICTR).
Secure software makes Secure writes to the ICDISRs to configure each interrupt as
Secure or Non-secure, see Interrupt Security Registers (ICDISRn).
In addition:
• The banking of registers provides independent control of Secure and Non-secure
interrupts.
• The Secure copy of the ICCICR has additional fields to control the processing of
Secure and Non-secure interrupts, see CPU Interface Control Register (ICCICR)
These fields are:
❖ the SBPR bit, that affects the preemption of Non-secure interrupts.
❖ the FIQEn bit, that controls whether the interface signals Secure interrupt to the
processor using the IRQ or FIQ interrupt exception requests.
❖ the AckCtl bit, that affects the acknowledgment of Non-secure interrupts.
❖ the EnableNS bit, that controls whether Non-secure interrupts are signaled to the
processor, and is an alias of the Enable bit in the Non-secure ICCICR.
• The Non-secure copy of the ICCBPR is aliased as the ICCABPR, see Aliased
Binary Point Register (ICCABPR). This is a Secure register, meaning it is only
accessible by Secure accesses. Effect of the Security Extensions on interrupt
acknowledgement

When a processor takes an interrupt, it acknowledges the interrupt by reading the


ICCIAR. A read of the ICCIAR always acknowledges the highest priority pending
interrupt for the processor performing the read.

If the highest priority pending interrupt is a Secure interrupt, the processor must make
a Secure read of the ICCIAR to acknowledge it.

By default, the processor must make a Non-secure read of the ICCIAR to


acknowledge a Non-secure interrupt. If he AckCtl bit in the Secure ICCICR is set to
1 the processor can make a Secure read of the ICCIAR to acknowledge a Non-secure
interrupt.

If the read of the ICCIAR does not match the security of the interrupt, taking account
of the AckCtl bit value for a Non-secure interrupt, the ICCIAR read does not
acknowledge any interrupt and returns the value:

• 1022 for a Secure read when the highest priority interrupt is Non -secure
• 1023 for a Non-secure read when the highest priority interrupt is Secure.

TIMERS AND WATCHDOG:

Here are some common features and uses of generic timers in CPUs:
• Interrupt Generation: Generic timers can be used to generate interrupts at regular
intervals. This feature is often utilized in operating systems for tasks like scheduling
tasks or preempting the CPU to handle higher-priority tasks.
• Performance Monitoring: Some CPUs include generic timers that can be used for
performance monitoring purposes, such as measuring instruction execution time,
cache misses, or other performance-related metrics.
• Power Management: Generic timers may also play a role in power management by
enabling the CPU to enter low-power states or adjust its operating frequency based
on certain timing criteria.
• Timekeeping: In some cases, generic timers are used for basic timekeeping functions
within the CPU, providing a reference for measuring time intervals or tracking
system uptime.
• System Synchronization: Generic timers can also be used for system
synchronization purposes, helping coordinate actions between different components
or subsystems within the CPU or the broader system.

TIMERS:

Here are some common types of timer registers and their typical functions:

• Control Register: This register is used to configure the operating mode of the timer,
such as whether it counts up or down, whether it operates in periodic or one-shot
mode, and whether interrupts are enabled.
• Interval Register: Also known as the Load Register or Period Register, this register
is used to set the initial value of the timer's count, which determines the interval at
which the timer will trigger an interrupt or other event.
• Counter Register: This register holds the current value of the timer's count. The
timer increments or decrements this value based on its operating mode and clock
source. When the count reaches a certain threshold (e.g., zero or a maximum value),
the timer may trigger an interrupt or perform some other action.
• Status Register: This register provides status information about the timer, such as
whether an interrupt has occurred, whether the timer is currently running, or whether
it has reached its terminal count.
• Control/Configuration Registers: These registers may include additional
configuration options for the timer, such as clock source selection, prescaler settings,
and interrupt masking.

SYSTEM COUNTERS:

The SoC implementer is responsible for the design of the System Counter. Usually,
the System Counter requires some initialization when a system boots up. Arm
provides a recommended register interface for the System Counter, but you should
check with your SoC implementer for details of a specific implementation.

The System Counter measures real time. This means that it cannot be affected by
power management techniques like Dynamic Voltage and Frequency Scaling (DVFS)
or putting cores into a lower power state. The count must continue to increment at its
fixed frequency. In practice, this requires the System Counterto be in an always -on
power domain.

To save power, the System Counter can vary the rate at which it updates the count.
For example, the System Counter could update the count by 10 every 10th tick of the
clock. This can be useful when the connected cores are all in low power state. The
system count still needs to reflect time advancing, but power can be saved by
broadcasting fewer counter updates.

Generic Timers:

❖ Generic timers are timers that are used for general-purpose timing functions within an
embedded system.
❖ They are typically used for tasks such as scheduling events, generating delays, or
measuring time intervals.
❖ These timers are usually programmed and controlled by software, allowing
developers to tailor their behavior according to the specific requirements of the
application.
❖ Generic timers are not typically designed to handle system failures or faults.
Watchdog Timers:

❖ Watchdog timers, on the other hand, are specifically designed to monitor the
operation of a system and take corrective action in the event of a malfunction or
system crash.
❖ The primary function of a watchdog timer is to reset the system or trigger an alarm if
the software or hardware fails to periodically "feed" or reset the watchdog timer.
❖ Watchdog timers help ensure the reliability and robustness of embedded systems by
providing a mechanism to recover from faults or errors that could otherwise cause the
system to hang or become unresponsive.
❖ These timers are often used in safety-critical applications or in systems where
continuous operation is essential.

Functional description of watchdog timers:


• The Watchdog module is based around a 32-bit down counter that is initialized from
the Reload Register, WdogLoad. The counter decrements by one on each positive
clock edge of WDOGCLK when the clock enable WDOGCLKEN is HIGH. When
the counter reaches zero, an interrupt is generated. On the next enabled WDOGCLK
clock edge the counter is reloaded from the WdogLoad Register and the count down
sequence continues. If the interrupt is not cleared by the time that the counter next
reaches zero then the Watchdog module asserts the reset signal, WDOGRES, and the
counter is stopped.
• WDOGCLK can be equal to or be a sub-multiple of the PCLK frequency. However,
the positive edges of WDOGCLK and PCLK must be synchronous and balanced.
• The Watchdog module interrupt and reset generation can be enabled or disabled as
required by use of the Control Register, WdogControl. When the interrupt generation
is disabled then the counter is stopped. When the interrupt is re-enabled then the
counter starts from the value programmed in WdogLoad, and not from the last count
value.
• Write access to the registers in the Watchdog module can be disabled by the use of
the Watchdog module Lock Register, WdogLock. Writing a value of 0x1ACCE551
to the register enables write accesses to all of the other registers. Writing any other
value disables write accesses to all registers except the Lock Register. This feature
protects the Watchdog module registers from being spuriously changed by runaway
software that might otherwise disable the Watchdog module operation.

❖ Bark Register:
➢ The "bark" register is often associated with the watchdog timer's ability to alert or
"bark" before taking action. It serves as a pre-warning mechanism to indicate that the
system is about to reset or perform a specific action due to a timeout or fault
condition.
➢ When the watchdog timer is enabled, it starts counting down from a preset value.
Before reaching the end of the countdown, it may trigger an interrupt or set a flag in
the bark register to warn the system that it needs attention.
➢ The bark signal is typically used by software to perform diagnostics, log events, or
take corrective actions before the watchdog timer proceeds to its "bite" phase
(resetting the system or taking a drastic action).
❖ Bite Register:
➢ The "bite" register, on the other hand, represents the watchdog timer's final action or
"bite" when a critical condition is not resolved within the timeout period indicated by
the timer.
➢ When the watchdog timer reaches its timeout value without being reset or serviced by
software (after the bark warning), it enters the "bite" phase. In this phase, it can
initiate a system reset, halt the processor, or trigger other emergency actions
depending on the system's design.
➢ The "bite" phase is often considered the last resort to prevent the system from
entering an unrecoverable state due to software or hardware faults.

PERFORMANCE MONITORING UNIT

A CPU Performance Monitoring Unit (PMU) is a hardware component within a


computer's central processing unit (CPU) that is designed to monitor and analyze the
performance of the CPU itself as well as other system components. The PMU tracks
various metrics and events related to the CPU's operation, such as instructions
executed, cache hits and misses, branch predictions, and power consumption.
Here are some key functions and features of a CPU Performance Monitoring Unit:

• Performance Counters: PMUs typically include a set of performance counters that


can be programmed to monitor specific events and performance metrics. These
counters can track events like instructions retired, cache misses, branch predictions,
and memory accesses.
• Event-based Sampling: PMUs can perform event-based sampling, where they
monitor specific events or conditions and sample performance data when those
events occur. This sampling is crucial for performance analysis and profiling tasks.
• Hardware Performance Events: PMUs can detect various hardware events such as
cache misses, branch mispredictions, and TLB (Translation Lookaside Buffer)
misses. These events can provide insights into the efficiency and bottlenecks within
the CPU architecture.
• Performance Analysis Tools: The data collected by the PMU can be accessed by
performance analysis tools, such as profilers and performance counters, to analyze
and optimize software performance. Developers and system administrators use these
tools to identify performance bottlenecks and improve overall system efficiency.
• Power Monitoring: Some advanced PMUs also include power monitoring
capabilities, allowing for real-time monitoring of power consumption by the CPU and
other system components. This feature is valuable for power management and
optimizing energy efficiency in computing systems.

EMBEDDED TRACE MACROCELL


The ETM trace unit is a module that performs real-time instruction flow tracing
based on the Embedded Trace Macrocell (ETM) architecture ETMv4. ETM is a
CoreSight component, and is an integral part of the Arm Real-time Debug solution,
DS-5 Development Studio.

• Processor interface: This block monitors the behavior of the processor and
generates P0 elements that are essentially executed instructions and exceptions traced
in program order.
• Trace generation: The trace generation block generates various trace packets based
on P0 elements.
• Filtering and triggering resources:You can limit the amount of trace data generated
by the ETM, through the process of filtering. For example, generating trace only in a
certain address range. More complicated logic analyzer style filtering options are also
available. The ETM trace unit can also generate a trigger that is a signal to the trace
capture device to stop capturing trace.
• FIFO: The trace generated by the ETM trace unit is in a highly-compressed form.
The FIFO enables trace bursts to be flattened out. When the FIFO becomes full, the
FIFO signals an overflow. The trace generation logic does not generate any new trace
until the FIFO is emptied. This causes a gap in the trace when viewed in the
debugger.
• Trace out: Trace from FIFO is output on the synchronous AMBA ATB interface.
• Syncbridge: The ATB interface from the trace out block goes through an ATB
synchronous bridge.
• Reset: The reset for ETM trace unit is the same as a cold reset for the processor. The
ETM trace unit is not reset when warm reset is applied to the processor so that tracing
through warm processor reset is possible. If the ETM trace unit is reset, tracing stops
until the ETM trace unit is reprogrammed and re-enabled. However, if the processor
is reset using warm reset, the last few instructions provided by the processor before
the reset might not be traced.

In ARM processors, ETM (Embedded Trace Macrocell) and ITM (Instrumentation


Trace Macrocell) are two important components related to debugging and tracing
capabilities. Here are the key differences between ETM and ITM logics in ARM
processors:

Purpose:
❖ ETM (Embedded Trace Macrocell): ETM is primarily used for tracing program
execution flow and capturing trace data related to program behavior. It helps in
understanding how the program is executing and identifying performance bottlenecks
or bugs.
❖ ITM (Instrumentation Trace Macrocell): ITM is used for inserting custom trace
messages into the trace stream. It allows developers to add their own trace events or
debug information into the trace output without affecting the program's execution.
Functionality:
❖ ETM: ETM captures a detailed trace of program execution, including instruction
addresses, data accesses, and control flow information. It provides a comprehensive
view of how the processor is executing instructions.
❖ ITM: ITM is more focused on providing debug and trace information specific to the
developer's needs. It allows inserting printf-style debug messages, timestamps, or
other custom information into the trace stream.
Trace Interface:
❖ ETM: ETM typically uses a dedicated trace port or interface to stream trace data to
an external trace capture device or debugger. It generates a rich trace stream
containing information about executed instructions and events.
❖ ITM: ITM is usually integrated into the processor core and communicates with the
debugger or trace capture unit through the Core Sight debug and trace architecture. It
provides a flexible way to add custom trace messages without requiring an additional
trace port.
Usage:
❖ ETM: ETM is commonly used in performance analysis, debugging complex
software, and optimizing code for better execution efficiency. It is especially valuable
in understanding real-time system behavior.
❖ ITM: ITM is used for debugging and tracing at a higher level of abstraction, allowing
developers to insert trace messages or markers in the code to track specific events or
conditions during program execution.

ITM functional description


The ITM is an optional application-driven trace source that supports printf() style
debugging to trace operating system and application events, and generates diagnostic
system information. The ITM generates trace information as packets from software
traces, hardware traces, time stamping, and global system timestamping sources.
The ITM generates trace information as packets. There are four sources that can
generate packets. If multiple sources generate packets at the same time, the ITM
arbitrates the order in which packets are output. The four sources in decreasing order
of priority are:
❖ Software trace. Software can write directly to ITM stimulus registers to generate
packets.
❖ Hardware trace. The DWT generates these packets, and the ITM outputs them.
❖ Time stamping. Timestamps are generated relative to packets. The ITM contains a
21-bit counter to generate the timestamp. The Cortex-M4 clock or the bitclock rate of
the Serial Wire Viewer (SWV) output clocks the counter.
❖ Global system timestamping. Timestamps can optionally be generated using a
system-wide 48-bit count value. The same count value can be used to insert
timestamps in the ETM trace stream, permitting coarse-grain correlation.

MULTICORE RISCV SS OVERVIEW

A multicore RISC-V subsystem refers to a system architecture that incorporates


multiple RISC-V processor cores working together within a single integrated
environment. Here are some key aspects and considerations regarding multicore
RISC-V subsystems:

RISC-V Architecture:
❖ RISC-V is an open-source instruction set architecture (ISA) that provides a modular
and extensible framework for designing processors. It defines a base ISA along with
optional extensions for various functionalities.
Multicore Configuration:
❖ In a multicore RISC-V subsystem, multiple RISC-V processor cores are integrated
onto a single chip or within a system-on-chip (SoC) design. These cores can operate
independently and can execute instructions concurrently, allowing for parallel
processing.
Benefits:
❖ Parallelism: Multicore architectures offer parallelism, enabling multiple tasks or
threads to run simultaneously. This can lead to improved performance and scalability
for applications that can be parallelized.
❖ Fault Tolerance: Multicore systems can provide fault tolerance by allowing tasks to
be distributed across multiple cores. If one core fails or experiences issues, the
system can continue functioning with the remaining cores.
❖ Resource Utilization: By distributing workloads across cores, multicore systems can
utilize resources more efficiently, optimizing power consumption and overall system
throughput.
interconnect and Communication:
❖ Efficient communication and synchronization mechanisms are crucial in multicore
systems. Inter-core communication can be facilitated through shared memory,
message passing, or dedicated interconnects depending on the system's design and
requirements.
❖ Synchronization primitives such as locks, semaphores, and barriers are used to
coordinate access to shared resources and ensure correct behavior in concurrent
execution scenarios.

Software Support:
❖ Operating systems (OS) and programming models play a vital role in leveraging
multicore architectures effectively. Multicore-aware OS kernels can schedule tasks
across cores, manage thread synchronization, and optimize resource allocation.
❖ Parallel programming frameworks and languages (e.g., OpenMP, MPI, pthreads)
provide abstractions and APIs for developing parallel software that can utilize
multiple cores efficiently.
Scalability:
❖ Multicore RISC-V subsystems can be designed with scalability in mind, allowing for
configurations ranging from a few cores to many cores depending on the target
application and performance requirements.
❖ Scalability considerations include power efficiency, thermal management, memory
hierarchy design, and software scalability to harness the full potential of the
multicore architecture.

Designing a multicore RISC-V subsystem involves several key components and


considerations. Below is a comprehensive overview of the design aspects involved in
creating such a system:

Processor Cores:
❖ Choose the RISC-V processor cores based on the target application and performance
requirements. Common choices include cores based on the RV32I or RV64I base
ISA with optional extensions such as M (integer multiplication and division), A
(atomic instructions), F (single-precision floating-point), D (double-precision
floating-point), and C (compressed instructions).
❖ Determine the number of cores based on the desired level of parallelism and
workload distribution. Common configurations include dual-core, quad-core, octa-
core, or more depending on the scalability needs.
Memory Hierarchy:
❖ Design the memory hierarchy to support multiple cores efficiently. This includes the
instruction cache, data cache, shared L2 cache (if applicable), and system memory
(RAM).
❖ Implement cache coherence protocols such as MESI (Modified, Exclusive, Shared,
Invalid) or MOESI (Modified, Owned, Exclusive, Shared, Invalid) to maintain data
consistency across multiple cores sharing the same memory regions.
Interconnect Fabric:
❖ Choose an appropriate interconnect fabric to connect the processor cores, caches,
memory controllers, and other peripherals within the multicore subsystem.
❖ Considerations include bandwidth, latency, scalability, and support for cache
coherency protocols. Common interconnect technologies include AXI (Advanced
eXtensible Interface), AHB (Advanced High-performance Bus), or custom on-chip
interconnect designs.
Memory Controllers:
❖ Design memory controllers for interfacing with system memory (RAM) and
managing memory accesses from multiple cores.
❖ Ensure memory controller performance matches the bandwidth requirements of the
multicore system and supports features like out-of-order memory accesses,
interleaving, and error correction (ECC).
Peripheral Interfaces:
❖ Include interfaces for peripherals such as UART (Universal Asynchronous Receiver -
Transmitter), SPI (Serial Peripheral Interface), I2C (Inter-Integrated Circuit), GPIO
(General-Purpose Input/Output), timers, and interrupt controllers.
❖ Design peripheral controllers or use standard IP blocks compatible with the RISC-V
ecosystem to enable communication with external devices and peripherals.
Power Management:
❖ Implement power management features to optimize energy consumption in the
multicore subsystem. This includes dynamic voltage and frequency scaling (DVFS),
clock gating, power domains, and sleep modes for idle cores.
❖ Consider thermal management strategies to prevent overheating in high-performance
multicore designs.
Debugging and Trace:
❖ Integrate debugging and trace features to facilitate software development, debugging,
and performance analysis on the multicore system.
❖ Utilize on-chip debug interfaces such as JTAG (Joint Test Action Group) or the
RISC-V Debug specification to enable debugging capabilities across multiple cores
simultaneously.
Software Support:
❖ Develop or port an operating system (OS) with multicore support, such as Linux with
SMP (Symmetric Multiprocessing) or a real-time operating system (RTOS) tailored
for multicore architectures.
❖ Provide software libraries, drivers, and tools that enable parallel programming, thread
synchronization, and efficient utilization of multicore resources.

Verification and Testing:


❖ Perform thorough verification and testing of the multicore RISC-V subsystem design
using simulation, emulation, and hardware validation techniques.
❖ Verify functionality, performance, cache coherence, power management features, and
system-level interactions to ensure the robustness and correctness of the design.
Documentation and Support:
❖ Create comprehensive documentation including datasheets, user manuals,
programming guides, and application notes for developers and system integrators.
❖ Provide technical support, reference designs, and development kits to enable rapid
prototyping and deployment of systems based on the multicore RISC-V subsystem.

MULTICORE CACHE COHERENCY

Cache coherency in a multicore RISC-V subsystem ensures that data stored in


different processor cores' caches remains consistent with the data in main memory. It
prevents data inconsistencies that can arise when multiple cores operate on shared
data concurrently. Here's a detailed explanation of multicore cache coherency in a
RISC-V subsystem:

Shared Memory Model:


❖ in a multicore RISC-V subsystem, all processor cores share a common memory
space. This shared memory model allows multiple cores to access the same data
stored in main memory.
❖ Each core has its own cache hierarchy, including L1 caches (per -core) and possibly
shared L2 or higher-level caches.
Cache Coherence Protocols:
❖ Cache coherence protocols are used to maintain data consistency across caches. The
most common protocols are variations of MESI (Modified, Exclusive, Shared,
Invalid) or MOESI (Modified, Owned, Exclusive, Shared, Invalid).
❖ MESI Protocol:
▪ Modified (M): The cache line is modified and differs from the main memory.
▪ Exclusive (E): The cache line is not modified and is exclusive to this cache.
▪ Shared (S): The cache line is shared among multiple caches and is consistent with
main memory.
▪ Invalid (I): The cache line is invalid and needs to be fetched from main memory or
another cache before use.
❖ MOESI Protocol:
▪ Owned (O): The cache line is not modified but is owned by the cache, meaning other
caches cannot modify it until it's invalidated.
Cache Coherency Mechanisms:
❖ Snooping-Based Coherency:
1) Many multicore systems use a snooping-based approach, where each core monitors
the system bus for memory transactions initiated by other cores.
2) When a core writes to a memory location, other cores snoop the bus to determine if
they have a copy of the same cache line. If so, they update or invalidate their copies
based on the coherence protocol.
❖ Directory-Based Coherency:
1) In directory-based coherency, a centralized directory keeps track of which caches
hold copies of specific memory blocks.
2) When a core needs to access a memory block, it communicates with the directory,
which coordinates cache operations across cores to maintain coherence.
Memory Ordering and Consistency:
❖ RISC-V defines a memory model (e.g., RVWMO - RISC-V Weak Memory
Ordering) that specifies the order of memory operations visible to different cores.
❖ Coherence protocols ensure that memory operations are observed in a globally
consistent order across cores, preventing race conditions and ensuring program
correctness.
Hardware Implementation:
❖ Cache coherence is implemented in hardware through coherence controllers, cache
coherence state machines, and interconnect fabrics with snoop filters.
❖ Coherence controllers manage coherence-related transactions, handle coherence
protocol states, and enforce coherence rules across caches.
Software Support:
❖ Multicore-aware software and operating systems play a critical role in leveraging
cache coherence effectively.
❖ Software must utilize synchronization primitives (e.g., locks, barriers) and memory
ordering constraints to ensure correct and synchronized access to shared data.
Performance Considerations:
❖ Cache coherence mechanisms introduce overhead, including coherence traffic on the
interconnect and cache maintenance operations.
❖ Efficient cache coherence design balances performance, scalability, and hardware
complexity to minimize overhead while ensuring data integrity.
Scalability and Interconnect:
❖ Cache coherence protocols and hardware designs should scale efficiently as the
number of cores increases.
❖ The interconnect fabric should support efficient communication and coherence
transactions between cores while avoiding bottlenecks.
To maintain memory coherency, external bus masters (typically other processors with
their own internal caches) need to acquire the most recent copy of data before
caching it internally. That copy can be in main memory or in the internal caches of
other bus-mastering devices. When an external master
has a cache read-miss or write-miss, it probes the other mastering devices to
determine whether the most recent copy of data is held in any of their caches. If one
of the other mastering devices holds the most recent copy, it provides it to the
requesting device. Otherwise, the most recent copy is provided by main memory.
There are two general types of bus-master probes:
• Read probes indicate the external master is requesting the data for read purposes.
• Write probes indicate the external master is requesting the data for the purpose of
modifying it.
The state transitions involving probes are initiated by other processors and external
bus masters into the processor. Some read probes are initiated by devices that intend
to cache the data. Others, such as those initiated by I/O devices, do not intend to
cache the data. Some processor implementations do not change the data MOESI state
if the read probe is initiated by a device that does not intend to cache the data.
State transitions involving read misses and write misses can cause the processor to
generate probes into external bus masters and to read main memory.
Read hits do not cause a MOESI-state change. Write hits generally cause a MOESI-
state change into the modified state. If the cache line is already in the modified state,
a write hit does not change its state.
The specific operation of external-bus signals and transactions and how they
influence a cache MOESI state are implementation dependent. For example, an
implementation could convert a write miss to a WB memory type into two separate
MOESI-state changes. The first would be a read-miss placing the
cache line in the exclusive state. This would be followed by a write hit into the
exclusive cache line,changing the cache-line state to modified.

MULTICORE SNOOP FILTERS

Multi-core snoop filters are hardware structures used in multicore processor systems
to optimize cache coherence protocols, particularly in snooping-based coherence
schemes. These filters help reduce the overhead associated with snooping on the
system bus by selectively filtering and processing coherence-related transactions
based on the caching state of individual cores. Here's an explanation of multi-core
snoop filters:

Purpose:
❖ The primary purpose of multi-core snoop filters is to improve the efficiency of cache
coherence in multicore systems. They aim to minimize the amount of unnecessary
coherence-related traffic on the system bus, thereby reducing latency and power
consumption associated with snooping.
Operation:
❖ Each core in a multicore system is associated with a snoop filter. The snoop filter
monitors the coherence transactions happening on the system bus and selectively
filters out transactions that are irrelevant to the core's cache state.
❖ When a coherence transaction occurs (e.g., a write to a memory location), the snoop
filter of each core determines whether the transaction is relevant to that core's cache.
If it is, the core takes appropriate action (e.g., updating its cache). If not, the
transaction is ignored, reducing unnecessary bus traffic.
Filtering Mechanisms:
❖ Multi-core snoop filters use various mechanisms to filter coherence transactions
efficiently:
1) Bloom Filters: Bloom filters are probabilistic data structures used to quickly test
whether an element is a member of a set. Each core's snoop filter maintains a Bloom
filter representing the cache state of that core. When a coherence transaction occurs,
the Bloom filter is consulted to determine whether the transaction is relevant to the
core's cache.
2) Tag-Based Filtering: Each coherence transaction carries a tag indicating the memory
address being accessed and the operation being performed (e.g., read or write). The
snoop filter compares this tag with the tags of cached data in the core's cache to
determine relevance.
3) Cache Coherence State Encoding: The snoop filter may encode the coherence state
of cached lines (e.g., MESI or MOESI states) to quickly determine whether a
coherence transaction affects the core's cache.

Benefits:
❖ Reduced Bus Traffic: By filtering out irrelevant coherence transactions, multi-core
snoop filters reduce the amount of unnecessary bus traffic, improving overall system
performance and reducing power consumption.
❖ Lower Latency: Filtering coherence transactions at the core level reduces the latency
associated with snooping, allowing cores to respond more quickly to relevant
coherence events.
❖ Scalability: Snoop filters help maintain scalability in large multicore systems by
limiting the overhead of coherence protocols as the number of cores increases.
Implementation:
❖ Snoop filters are typically implemented as hardware structures integrated into each
core's cache coherence logic. They require efficient hardware designs to perform
filtering operations quickly and accurately.

RISC-V ISA

❖ Base Integer ISA: The base integer ISA (RV32I, RV64I, RV128I) provides essential
instructions for integer arithmetic, logical operations, memory access, control flow,
and system interaction.
❖ Standard Extensions: RISC-V includes several standard extensions that add
functionality beyond the base integer ISA. Some notable extensions include:
➢ M: Integer Multiplication and Division
➢ A: Atomic Instructions
➢ F: Single-Precision Floating-Point
➢ D: Double-Precision Floating-Point
➢ C: Compressed Instructions
➢ Zicsr: Control and Status Register (CSR) Instructions
➢ Zifencei: Instruction-Fetch Fence

RISC-V ISA EXTENSIONS:


❖ M (Integer Multiplication and Division):
➢ Adds support for integer multiplication (mul) and division (div).
➢ Example: mul x3, x1, x2 (Multiplies x1 and x2, stores the result in x3)
❖ A (Atomic Instructions):
➢ Provides atomic memory operations such as atomic load-store (amo) instructions.
➢ Ensures atomicity for read-modify-write operations in concurrent programming.
➢ Example: amoadd.w x3, x1, (x2) (Atomically adds x1 to the memory location
pointed by x2, storing the old value in x3)
❖ F (Single-Precision Floating-Point):
➢ Introduces support for single-precision floating-point arithmetic operations (fadd.s,
fsub.s, fmul.s, fdiv.s, etc.).
➢ Enables computations involving floating-point numbers.
➢ Example: fadd.s f3, f1, f2 (Adds single-precision floating-point numbers f1 and f2,
stores the result in f3)
❖ D (Double-Precision Floating-Point):
➢ Extends floating-point support to include double-precision arithmetic operations
(fadd.d, fsub.d, fmul.d, fdiv.d, etc.).
➢ Provides higher precision for floating-point computations.
➢ Example: fadd.d f3, f1, f2 (Adds double-precision floating-point numbers f1 and f2,
stores the result in f3)
❖ C (Compressed Instructions):
➢ Introduces a compressed instruction set (C extension) to reduce code size.
➢ Compressed instructions are 16 bits long and provide a subset of the base ISA
instructions in a more compact form.
➢ Example: c.addi4spn x2, 16 (Compressed form of add immediate, adding 16 to the
stack pointer)
❖ Zifencei (Instruction-Fetch Fence):
➢ Provides instructions to synchronize instruction fetch operations across cores in a
multi-core system.
➢ Ensures correct program execution and memory consistency in multi-threaded
environments.
➢ Example: fence.i (Instruction-fetch fence to synchronize instruction fetches)
❖ Zicsr (Control and Status Register Instructions):
➢ Introduces instructions to read and write control and status registers (CSRs).
➢ CSRs control various aspects of the processor, such as interrupt handling,
performance monitoring, and privilege levels.
➢ Example: csrrw x1, mstatus, x2 (Reads mstatus register into x1 and writes x2 to
mstatus)

INSTRUCTION FORMATS:
R-Type Instructions (Register-Register Arithmetic/Logic):
❖ These instructions perform operations between two source registers and store the
result in a destination register.
❖ Format: opcode rd, rs1, rs2
➢ opcode: Specifies the operation (e.g., add, subtract, bitwise AND).
➢ rd: Destination register.
➢ rs1, rs2: Source registers.
Examples:
• add (addition): add x3, x1, x2 adds the contents of registers x1 and x2 and stores the
result in register x3.
• sub (subtraction): sub x4, x5, x6 subtracts the contents of register x6 from x5 and
stores the result in register x4.
• and (bitwise AND): and x7, x8, x9 performs a bitwise AND operation between
registers x8 and x9 and stores the result in register x7.

I-Type Instructions (Immediate):


❖ I-Type instructions perform operations with an immediate value (constant) and a
source register, storing the result in a destination register.
❖ Format: opcode rd, rs1, imm
➢ opcode: Specifies the operation (e.g., addi for addition with immediate).
➢ rd: Destination register.
➢ rs1: Source register.
➢ imm: Immediate value (constant).
Examples:
• addi (addition with immediate): addi x1, x2, 10 adds the immediate value 10 to the
contents of register x2 and stores the result in register x1.
• lw (load word): lw x3, 100(x4) loads a 32-bit word from memory at address (x4 +
100) and stores it in register x3.
• sw (store word): sw x5, -8(x6) stores the contents of register x5 into memory at
address (x6 - 8).

S-Type Instructions (Store):


❖ S-Type instructions store data from a source register into memory at an offset from a
base address.
❖ Format: opcode rs2, imm(rs1)
➢ opcode: Specifies the operation (e.g., sb for storing a byte).
➢ rs2: Source register containing the data to be stored.
➢ imm: Offset (immediate) from the base address in rs1.
➢ rs1: Base register containing the memory address.
Examples:
• sb (store byte): sb x1, 4(x2) stores the least significant byte of register x1 into
memory at address (x2 + 4).
• sh (store halfword): sh x3, -2(x4) stores the least significant halfword of register x3
into memory at address (x4 - 2).

B-Type Instructions (Branch):


❖ B-Type instructions perform conditional branching based on a comparison between
two source registers.
❖ Format: opcode rs1, rs2, label
➢ opcode: Specifies the operation (e.g., beq for branch if equal).
➢ rs1, rs2: Source registers for comparison.
➢ label: Target label to jump to if the condition is met.
Examples:
• beq (branch if equal): beq x1, x2, Label branches to Label if the contents of registers
x1 and x2 are equal.
• bne (branch if not equal): bne x3, x4, Label branches to Label if the contents of
registers x3 and x4 are not equal.

U-Type Instructions (Upper Immediate):


❖ U-Type instructions load an immediate value into the upper bits of a register.
❖ Format: opcode rd, imm
➢ opcode: Specifies the operation (e.g., lui for load upper immediate).
➢ rd: Destination register.
➢ imm: Immediate value (usually shifted left by 12 bits to fill the upper bits).
Examples:
• lui (load upper immediate): lui x1, 0xFFFFF loads the immediate value 0xFFFFF into
the upper 20 bits of register x1.

J-Type Instructions (Jump):


❖ J-Type instructions perform unconditional jumps to a target address or register.
❖ Format: opcode rd, label or opcode rd, rs1, imm
➢ opcode: Specifies the operation (e.g., jal for jump and link).
➢ rd: Destination register for the return address (PC+4).
➢ rs1: Source register containing the jump target address (used in jalr instruction).
➢ imm: Immediate value (offset) for the jump (used in jalr instruction).
➢ label: Target label for the jump.
Examples:
• jal (jump and link): jal Label jumps to the Label and stores the return address (PC+4)
in register x1.
• jalr (jump and link register): jalr x2, x3, 0 jumps to the address in register x3 with an
offset of 0 and stores the return address in register x2.
Here are the main differences between RISC and CISC ISAs:

❖ Instruction Set Complexity:


➢ RISC: RISC ISAs have a reduced and simplified instruction set compared to CISC.
They typically include simple instructions that perform basic operations, with each
instruction executing in a single clock cycle.
➢ CISC: CISC ISAs have a complex and rich instruction set that includes instructions
capable of performing multiple operations or accessing complex addressing modes.
CISC instructions can vary widely in execution time and complexity, often requiring
multiple clock cycles to execute.
❖ Instruction Length:
➢ RISC: RISC instructions are typically fixed-length and have a uniform format, which
simplifies instruction decoding and pipelining in hardware.
➢ CISC: CISC instructions can vary in length and complexity, with variable-length
instructions and multi-byte opcodes. This variability can lead to challenges in
decoding and pipeline design.
❖ Hardware Complexity:
➢ RISC: RISC processors are generally designed with simpler hardware components
due to the reduced instruction set. They emphasize pipelining, parallelism, and fast
execution of simple instructions.
➢ CISC: CISC processors often have more complex hardware components, including
microcode and specialized execution units to handle the variety of instructions
efficiently.
❖ Memory Access:
➢ RISC: RISC architectures typically rely on load/store instructions for memory
access, separating data movement from arithmetic/logic operations. This approach
simplifies instruction semantics and enhances compiler optimization.
➢ CISC: CISC architectures may have complex memory access instructions built into
the ISA, allowing operations like memory-to-memory transfers and string
manipulation directly in instructions.
❖ Code Density and Optimization:
➢ RISC: RISC ISAs tend to have higher code density due to simpler instructions,
which can lead to more efficient use of memory and cache.
➢ CISC: CISC ISAs may have more compact code for certain operations due to
complex instructions, but they can also suffer from inefficiencies in code size and
cache utilization.
❖ Compiler Complexity:
➢ RISC: RISC architectures are generally more compiler-friendly due to the
straightforward instruction set and clear separation of instructions.
➢ CISC: CISC architectures may require more complex compiler optimizations to
generate efficient code, especially when dealing with complex instruction semantics
and addressing modes.

You might also like