Design Issues: SMT and CMP Architectures

DESIGN ISSUES:
SMT and CMP Architectures
They determine the performance measures of each processor in a precise manner.

The issue slots usage limitations and its issues also determine the performance. Why
Multithreading Today ILP is exhausted, TLP is in. Large performance gap between MEMORY
and PROCESSOR. Too many transistors on chip. More existing MT applications today.
Multiprocessors on a single chip. Long network latency, too
1. DESIGN CHALLENGES OF SMT
Impact of fine grained scheduling on single thread performance?
A preferred thread approach sacrifices throughput and single threaded performance.

Unfortunately with a preferred thread, the processor is likely to sacrifice some throughput
Reason for loss of throughput
Pipeline is less likely to have a mix of instructions from several threads resulting in
a greater probability that either empty slots or a stall will occur
Design Challenges
Larger register file needed to hold multiple contexts. Not affecting clock cycle time,
especially in
Ø Instruction issue- more candidate instructions need to be considered
Ø Instruction completion- choosing which instructions to commit may be

challenging
Ensuring that cache and TLP conflicts generated by SMT do not degrade
performance. There are mainly two observations
Ø Potential performance overhead due to multithreading is small
Ø Efficiency of current superscalar is low with the room for significant

improvement
A SMT processor works well if Number of compute intensive threads does not exceed
the number of threads supported in SMT. Threads have highly different charecteristics For
eg; 1 thread doing mostly integer operations and another doing mostly floating point
operations.
It does not work well if Threads try to utilize the same functional units and for
assignment problems
Ø Eg; a dual core processor system, each processor having 2 threads

simultaneously
Ø 2 computer intensive application processes might end up on the same processor
instead of different processors
The problem here is the operating system does not see the difference between the SMT and
real processors !!!
Transient Faults
Faults that persist for a “short” duration. Cause is cosmic rays (e.g., neutrons).The
effect is knock off electrons, discharge capacitor.The Solution is no practical absorbent for
cosmic rays.1 fault per 1000 computers per year (estimated fault rate)
Processor Utilization vs. Latency
R = the run length to a long latency event
L = the amount of latency
Simultaneous & Redundantly Threaded Processor (SRT)
SRT = SMT + Fault Detection + Less hardware compared to replicated

microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little
hardware overhead to existing SMT+ Better performance than complete replication better
use of resources + Lower cost avoids complete replication
SRT Design Challenges
Lock stepping doesn’t work because SMT may issue same instruction from
redundant threads in different cycles. Must carefully fetch/schedule instructions from
redundant threads since branch misprediction &cache miss will occur
Transient Fault Detection in CMPs
CRT borrows the detection scheme from the SMT-based simultaneously and
Redundantly Threaded (SRT) processors and applies the scheme to CMPs.
replicated two communicating threads (leading & trailing threads)
Compare the results of the two.

CRT executes the leading and trailing threads on different processors to achieve load
balancing and to reduce the probability of a fault corrupting both threads
Detection is based on replication but to which extent? It replicates register values (in
register file in each core) but not memory values. The CRT’s leading thread commits stores
only after checking, so that memory is guaranteed to be correct.CRT compares only stores
and uncached loads, but not register values, of the two threads.
An incorrect value caused by a fault propagates through computations and is

eventually consumed by a store, checking only stores suffices for detection; other
instructions commit without checking.
CRT uses a store buffer (StB) in which the leading thread places its committed store
values and addresses. The store values and addresses of the trailing thread are compared
against the StB entries to determine whether a fault has occurred. (one checked store
reaches to the cache hierarchy)
Transient Fault Recovery for CMPs
Unlike CRT, CRTR must not allow any trailing instruction to commit before it is
checked for faults, so that the register state of the trailing thread may be used for recovery.
However, the leading thread in CRTR may commit register state before checking, as in CRT.
This asymmetric commit strategy allows CRTR to employ a long slack to absorb
inter-processor latencies. As in CRT, CRTR commits stores only after checking. In addition
to communicating branch outcomes, load addresses, load values, store addresses, and
store values like CRT, CRTR also communicates register values.
Challenges with this approach
Ø I-Cache:
Ø Instruction bandwidth
Ø I-Cache misses:
Since instructions are being grabbed from many different contexts, instruction locality is
degraded and the I-cache miss rate rises.
Ø Register file access time:
Ø Register file access time increases due to the fact that the regfile had to
significantly increase in size to accommodate many separate contexts.
Ø In fact, the HEP and Tera use SRAM to implement the regfile, which means
longer access times.
Ø Single thread performance
Ø Single thread performance significantly degraded since the context is
Forced to switch to a new thread even if none are available.
Ø Very high bandwidth network, which is fast and wide
Ø Retries on load empty or store full
To maximize SMT performance Issue slots, Functional units, Renaming registers
Case Studies
Multicore architecture
A multi-core design in which a single physical processor contains the core logic of
more than one processor. Goal- enables a system to run more task simultaneously
achieving greater overall performance
Hyper-threading or multicore?
Early PCs-capable of doing single task at a time. Later multi-threading tech., came
into place. Intel’s multi-threading called Hyper-threading
Multi-core processors
Each core has its execution pipeline. No limitation for the number of cores that can
be placed in a single chip. Two cores run at slower speeds and lower temperatures. But the
combined throughput > single processor. The fundamental relationship b/w freq. and
power can be used to multiply the no. of cores from 2 to 4, 8 and even higher
Intel-multicore architecture
Ø Intel Turbo Boost Tech.
Ø Intel Hyper Threading Tech.
Ø Intel Core Microarchitecture.
Ø Intel Advanced Smart Cache.

Ø Intel Smart Memory Access.
Intel Smart Memory access
Benefits
Ø Multi-core performance.
Ø Dynamic scalability.
Ø Design and performance scalability
Ø Intelligent performance on-demand
Ø Increased performance on Highly-threaded apps.
Ø Scalable shared memory.
Ø Multi-level shared cache.

IBM CELL PROCESSOR
A chip with one PPC hyper-threaded core called PPE and eight specialized cores
called SPEs. The challenge to be solved by the Cell was to put all those cores together on a
single chip. This was made possible by the use of a bus with outstanding performance
The Cell processor can be split into four components:
Ø external input and output structures,
Ø the main processor called the Power Processing Element (PPE)
Ø eight fully-functional co-processors called the Synergistic Processing

Elements, or SPEs,
Ø a specialized high-bandwidth circular data bus connecting the PPE,

input/output elements and the SPEs, called the Element Interconnect Bus or EIB.
Overview of the architecture of a Cell chip
POWERPC PROCESSOR ELEMENT: (PPE)
Ø The PowerPC Processor Element, usually denoted as PPE is a dual-threaded

powerpc processor version 2.02.
Ø This 64-bit RISC processor also has the Vector/SIMD Multimedia Extension.
Ø The PPE’s role is crucial in the Cell architecture since it is on the one hand
running the OS, and on the other hand controlling all other resources, including the SPEs .
Ø The PPE is made out of two main units:

1: The Power Processor Unit
2:The Power Processor Storage Subsystem (PPSS).
PPE Block diagram
PPU:
It is the processing part of the PPE and is composed of:
Ø A full set of 64-bit PowerPC registers.
Ø 32 128-bit vector multimedia registers.
Ø A 32KB L1 instruction cache.
Ø A 32KB L1 data cache.
All the common components of a ppc processors with vector/SIMD extensions

(instruction control unit, load and store unit, fixed-Point integer unit, floating-point unit,
vector unit, branch unit, virtual memory management unit).The PPU is hyper-threaded and
supports 2 simultaneous threads.
PPSS
This handles all memory requests from the PPE and requests made to the PPE by
other processors or I/O devices. It is composed of:
Ø A unified 512-KB L2 instruction and data cache.
Ø Various queues
Ø A bus interface unit that handles bus arbitration and pacing on the
Element Interconnect Bus
SYNERGISTIC PROCESSOR ELEMENTS:SPE
Each Cell chip has 8 Synergistic Processor Elements. They are 128-bit RISC
processor which is specialized for data-rich, compute-intensive SIMD applications. This
consists of two main units.
1: The Synergistic Processor Unit (SPU)
2: The Memory Flow Controller (MFC)
The Synergistic Processor Unit (SPU):
This deals with instruction control and execution. It includes various components:
Ø A register file of 128 registers of 128 bits each.
Ø A unified instruction and data 256-kB Local Store (LS).
Ø A channel-and-DMA interface.
Ø As usual, an instruction-control unit, a load and store unit, two fixed-point units,
a floating point unit.
The SPU implements a set of SIMD instructions, specific to the Cell. Each SPU is
independent, and has its own program counter. Instructions are fetched in its own Local
Store LS. Data are also loaded and stored in the LS
The Memory Flow Controller (MFC)
It is actually the interface between the SPU and the rest of the Cell chip.MFC
interfaces the SPU with the EIB. In addition to a whole set of MMIO registers, this contains
a DMA controller.
Bus design and communication among the Cell
1: The Element Interconnect Bus:
This bus makes it possible to link all parts of the chip. The EIB itself is made out of
a 4-ring structure (two clockwise, and two counter clockwise) that is used to transfer data,
and a tree structure used to carry commands. It is actually controlled by what is called the
Data Arbitrer. This structure allows 8 simultaneous transactions on the bus.
2: Input/output interfaces:
The Memory Interface Controller (MIC).:
Ø It provides an interface between the EIB and the main storage.
Ø It currently supports two Rambus Extreme Data Rate (XDR) I/O (XIO)
memory channels.
The Cell Broadband Engine Interface (BEI):
Ø this is the interface between the Cell and I/O devices, such as GPUs and
various bridges.
Ø It supports two Rambus FlexIO external I/O channels.
Ø One of this channel only supports non-coherent transfers. The other

supports either coherent or noncoherenT.
Key Attributes of Cell
Ø Cell is Multi-Core
Ø Cell is a Flexible Architecture
Ø Cell is a Broadband Architecture
Ø Cell is a Real-Time Architecture
Ø Cell is a Security Enabled Architecture

Design Issues: SMT and CMP Architectures

Uploaded by

Copyright:

Available Formats

Design Issues: SMT and CMP Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design Issues: SMT and CMP Architectures

Uploaded by

Copyright:

Available Formats

DESIGN ISSUES:

SMT and CMP Architectures

They determine the performance measures of each processor in a precise manner.

1. DESIGN CHALLENGES OF SMT

Impact of fine grained scheduling on single thread performance?

A preferred thread approach sacrifices throughput and single threaded performance.

Reason for loss of throughput

Ø Instruction issue- more candidate instructions need to be considered

Ø Instruction completion- choosing which instructions to commit may be

Ø Potential performance overhead due to multithreading is small

Ø Efficiency of current superscalar is low with the room for significant

Ø Eg; a dual core processor system, each processor having 2 threads

Processor Utilization vs. Latency

R = the run length to a long latency event

L = the amount of latency

Simultaneous & Redundantly Threaded Processor (SRT)

SRT = SMT + Fault Detection + Less hardware compared to replicated

SRT Design Challenges

Transient Fault Detection in CMPs

replicated two communicating threads (leading & trailing threads)

Compare the results of the two.

An incorrect value caused by a fault propagates through computations and is

Transient Fault Recovery for CMPs

Challenges with this approach

Ø Register file access time:

Ø Single thread performance

Ø Single thread performance significantly degraded since the context is

Forced to switch to a new thread even if none are available.

Ø Very high bandwidth network, which is fast and wide

Ø Retries on load empty or store full

To maximize SMT performance Issue slots, Functional units, Renaming registers

Ø Intel Turbo Boost Tech.

Ø Intel Hyper Threading Tech.

Ø Intel Core Microarchitecture.

Ø Intel Advanced Smart Cache.

Intel Smart Memory access

Ø Design and performance scalability

Ø Intelligent performance on-demand

Ø Increased performance on Highly-threaded apps.

Ø Scalable shared memory.

Ø Multi-level shared cache.

The Cell processor can be split into four components:

Ø external input and output structures,

Ø the main processor called the Power Processing Element (PPE)

Ø eight fully-functional co-processors called the Synergistic Processing

Ø a specialized high-bandwidth circular data bus connecting the PPE,

Overview of the architecture of a Cell chip

POWERPC PROCESSOR ELEMENT: (PPE)

Ø The PowerPC Processor Element, usually denoted as PPE is a dual-threaded

Ø The PPE is made out of two main units:

2:The Power Processor Storage Subsystem (PPSS).

PPE Block diagram

It is the processing part of the PPE and is composed of:

Ø A full set of 64-bit PowerPC registers.

Ø 32 128-bit vector multimedia registers.

Ø A 32KB L1 instruction cache.

Ø A 32KB L1 data cache.

All the common components of a ppc processors with vector/SIMD extensions

Ø A unified 512-KB L2 instruction and data cache.