Design Issues: SMT and CMP Architectures
Design Issues: SMT and CMP Architectures
Design Issues: SMT and CMP Architectures
Pipeline is less likely to have a mix of instructions from several threads resulting in
a greater probability that either empty slots or a stall will occur
Design Challenges
Larger register file needed to hold multiple contexts. Not affecting clock cycle time,
especially in
Ensuring that cache and TLP conflicts generated by SMT do not degrade
performance. There are mainly two observations
A SMT processor works well if Number of compute intensive threads does not exceed
the number of threads supported in SMT. Threads have highly different charecteristics For
eg; 1 thread doing mostly integer operations and another doing mostly floating point
operations.
It does not work well if Threads try to utilize the same functional units and for
assignment problems
The problem here is the operating system does not see the difference between the SMT and
real processors !!!
Transient Faults
Faults that persist for a “short” duration. Cause is cosmic rays (e.g., neutrons).The
effect is knock off electrons, discharge capacitor.The Solution is no practical absorbent for
cosmic rays.1 fault per 1000 computers per year (estimated fault rate)
Lock stepping doesn’t work because SMT may issue same instruction from
redundant threads in different cycles. Must carefully fetch/schedule instructions from
redundant threads since branch misprediction &cache miss will occur
CRT borrows the detection scheme from the SMT-based simultaneously and
Redundantly Threaded (SRT) processors and applies the scheme to CMPs.
Detection is based on replication but to which extent? It replicates register values (in
register file in each core) but not memory values. The CRT’s leading thread commits stores
only after checking, so that memory is guaranteed to be correct.CRT compares only stores
and uncached loads, but not register values, of the two threads.
CRT uses a store buffer (StB) in which the leading thread places its committed store
values and addresses. The store values and addresses of the trailing thread are compared
against the StB entries to determine whether a fault has occurred. (one checked store
reaches to the cache hierarchy)
Unlike CRT, CRTR must not allow any trailing instruction to commit before it is
checked for faults, so that the register state of the trailing thread may be used for recovery.
However, the leading thread in CRTR may commit register state before checking, as in CRT.
This asymmetric commit strategy allows CRTR to employ a long slack to absorb
inter-processor latencies. As in CRT, CRTR commits stores only after checking. In addition
to communicating branch outcomes, load addresses, load values, store addresses, and
store values like CRT, CRTR also communicates register values.
Ø I-Cache:
Ø Instruction bandwidth
Ø I-Cache misses:
Since instructions are being grabbed from many different contexts, instruction locality is
degraded and the I-cache miss rate rises.
Ø Register file access time increases due to the fact that the regfile had to
significantly increase in size to accommodate many separate contexts.
Ø In fact, the HEP and Tera use SRAM to implement the regfile, which means
longer access times.
Case Studies
Multicore architecture
A multi-core design in which a single physical processor contains the core logic of
more than one processor. Goal- enables a system to run more task simultaneously
achieving greater overall performance
Hyper-threading or multicore?
Early PCs-capable of doing single task at a time. Later multi-threading tech., came
into place. Intel’s multi-threading called Hyper-threading
Multi-core processors
Each core has its execution pipeline. No limitation for the number of cores that can
be placed in a single chip. Two cores run at slower speeds and lower temperatures. But the
combined throughput > single processor. The fundamental relationship b/w freq. and
power can be used to multiply the no. of cores from 2 to 4, 8 and even higher
Intel-multicore architecture
Benefits
Ø Multi-core performance.
Ø Dynamic scalability.
A chip with one PPC hyper-threaded core called PPE and eight specialized cores
called SPEs. The challenge to be solved by the Cell was to put all those cores together on a
single chip. This was made possible by the use of a bus with outstanding performance
Ø This 64-bit RISC processor also has the Vector/SIMD Multimedia Extension.
Ø The PPE’s role is crucial in the Cell architecture since it is on the one hand
running the OS, and on the other hand controlling all other resources, including the SPEs .
PPU:
PPSS
This handles all memory requests from the PPE and requests made to the PPE by
other processors or I/O devices. It is composed of:
Ø Various queues
Ø A bus interface unit that handles bus arbitration and pacing on the
Element Interconnect Bus
SYNERGISTIC PROCESSOR ELEMENTS:SPE
Each Cell chip has 8 Synergistic Processor Elements. They are 128-bit RISC
processor which is specialized for data-rich, compute-intensive SIMD applications. This
consists of two main units.
This deals with instruction control and execution. It includes various components:
Ø A channel-and-DMA interface.
Ø As usual, an instruction-control unit, a load and store unit, two fixed-point units,
a floating point unit.
The SPU implements a set of SIMD instructions, specific to the Cell. Each SPU is
independent, and has its own program counter. Instructions are fetched in its own Local
Store LS. Data are also loaded and stored in the LS
It is actually the interface between the SPU and the rest of the Cell chip.MFC
interfaces the SPU with the EIB. In addition to a whole set of MMIO registers, this contains
a DMA controller.
This bus makes it possible to link all parts of the chip. The EIB itself is made out of
a 4-ring structure (two clockwise, and two counter clockwise) that is used to transfer data,
and a tree structure used to carry commands. It is actually controlled by what is called the
Data Arbitrer. This structure allows 8 simultaneous transactions on the bus.
2: Input/output interfaces:
Ø It currently supports two Rambus Extreme Data Rate (XDR) I/O (XIO)
memory channels.
Ø this is the interface between the Cell and I/O devices, such as GPUs and
various bridges.
Ø Cell is Multi-Core