l23 Multithread
l23 Multithread
823, L23-1
Multithreading Architectures
Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
Pipeline Hazards
t0 t1 t2 t3 t4 t5 t6 t7 t8
F D X MW F D D D D X MW F F F F D D D D X MW F F F F D D D D
Each instruction may depend on the next What can be done to cope with this? Even bypassing does not eliminate all delays
Multithreading
How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
F D X MW T1: LW r1, 0(r2) F D X M T2: ADD r7, r1, r4 F D X T3: XORI r5, r4, #12 T4: SW 0(r7), r5 F D T1: LW r5, 12(r1) F
W MW X MW D X MW
Prior instruction in a thread always completes writeback before next instruction in same thread reads register file
(Cray, 1964)
Image removed due to copyright restrictions. To view image, visit http://www.bambi.net/computer_museum/cdc6600_ and_console.jpg
First multithreaded hardware 10 virtual I/O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator-based instruction set to reduce processor state
I$
IR
D$
+1 2 Thread select 2
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
Multithreading Costs
Each thread requires its own user state
PC
GPRs
Other costs? Appears to software (including OS) as multiple, albeit slower, CPUs
each of N threads executes one instruction every N cycles if thread not ready to go in its slot, insert pipeline bubble
Denelcor HEP
First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA (Multithreaded Architecture)
Image removed due to copyright restrictions. To view image, visit http://www.npaci.edu/online/v2.1/ mta.html
MTA Architecture
Each processor supports 128 active hardware threads
1 x 128 = 128 stream status word (SSW) registers, 8 x 128 = 1024 branch-target registers, 32 x 128 = 4096 general-purpose registers
Thread creation and termination instructions Explicit 3-bit lookahead field in instruction gives number of subsequent instructions (0-7) that are independent of this one
c.f. instruction grouping in VLIW allows fewer threads to fill machine pipeline used for variable-sized branch delay slots
MTA Pipeline
Issue Pool W M A C Inst Fetch
Every cycle, one instruction from one active thread is launched into pipeline Instruction pipeline is 21 cycles long
Memory Pool
Write Pool
Retry Pool
Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz What is performance? Effective single thread issue rate is 260/21 = 12.4 MIPS
Coarse-grained multithreading
Context switch among threads every few cycles, e.g., on: Function unit data hazard,
L1 miss,
L2 miss
Coarse-Grain Multithreading
Tera MTA designed for supercomputing applications with large data sets and low locality
No data cache Many parallel threads needed to hide large memory latency
Few pipeline bubbles when cache getting hits Just add a few threads to hide occasional cache miss latencies Swap threads on cache misses
Image removed due to copyright restrictions. To view image, visit http://www.cag.lcs.mit.e du/alewife/pictures/jpg/1 6-extender.jpg register windows hold different thread contexts
Commercial coarse-grain multithreading CPU Based on PowerPC with quad-issue in-order five-stage pipeline Each physical CPU supports two virtual CPUs
On L2 cache miss, pipeline is flushed and execution switches to second thread
short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency flush pipeline to simplify exception handling
Vertical Multithreading
Issue width Instruction issue
Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste)
Chip Multiprocessing
Issue width
Time
Time
Add multiple contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously Utilize wide out-of-order superscalar processor issue queue to find instructions to issue from multiple threads OOO instruction window already has most of the circuitry required to schedule from multiple threads Any single thread can utilize whole machine
Reg Write
Retire
PC
SMT Pipeline
Reg Write
Retire
PC
Assuming latency (L) is unchanged with the addition of threading. For each thread i with original throughput Ti:
For regions with low thread level parallelism (TLP) entire machine width is available for instruction level parallelism (ILP) Issue width
Time
Time
Pentium-4 Hyperthreading
(2002)
First commercial SMT design (2-way SMT)
Hyperthreading == SMT
Processor running only one active software thread runs at approximately same speed with or without hyperthreading
Pentium-4 Hyperthreading
Front End
Refer to Figure 5a in Marr, D., et al. "Hyper-threading Technology Architecture and Microarchitecture." Intel Technology Journal 6, no. 1 (2002): 8. _________________________________________________________________________________________ http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf
Shared second-level branch history table, tagged with logical processor IDs
Pentium-4 Hyperthreading
Execution Pipeline
Figure removed due to copyright restrictions. Refer to Figure 6 in Marr, D., et al. "Hyper-threading Technology Architecture and Microarchitecture." Intel Technology Journal 6, no. 1 (2002): 10. http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf _________________________________________________________________________________________
Extras
In-Order Commit
In-Order Physical Reg. File Branch ALU MEM Unit Execute Store Buffer D$
Granularity of Multithreading
CPU
RF
L1 Inst. Cache L1 Data Cache
Unified L2 Cache
Coarse-grained multithreading
CPU switches every few cycles to a different thread When does this make sense?