EE457Unit9c CMT
EE457Unit9c CMT
EE 457 Unit 9c
Credits
• Some of the material in this presentation is taken from:
– Computer Architecture: A Quantitative Approach
• John Hennessy & David Patterson
• Some of the material in this presentation is derived from
course notes and slides from
– Prof. Michel Dubois (USC)
– Prof. Murali Annavaram (USC)
– Prof. David Patterson (UC Berkeley)
3
BACKGROUND KNOWLEDGE
4
Power
• Power and energy consumption is a MAJOR concern for processors
• Power consumption can be decomposed into:
– Static (PSTAT): Power constantly being dissipated (grows with # of transistors)
– Dynamic (PDYN): Power consumed for switching a bit (1 to 0)
• PDYN = IDYN*VDD ≈ ½CTOTVDD2f
– Recall, I = C dV/dt
– VDD is the logic ‘1’ voltage, f = clock frequency
• Dynamic power favors parallel processing vs. higher clock rates
– VDD value is tied to f, so a reduction/increase in f leads to similar change in Vdd
– Implies power is proportional to f3 (a cubic savings in power if we can reduce f)
– Take a core and replicate it 4x => 4x performance and 4x power
– Take a core and increase clock rate 4x => 4x performance and 64x power
• Static power
– Leakage occurs no matter what the frequency is
5
Temperature
• Temperature is related to power consumption
– Locations on the chip that burn more power will usually run hotter
• Locations where bits toggle (register file, etc.) often will become quite hot especially if
toggling continues for a long period of time
– Too much heat can destroy a chip
– Can use sensors to dynamically sense temperature
• Techniques for controlling temperature
– External measures: Remove and spread the heat
• Heat sinks, fans, even liquid cooled machines
– Architectural measures
• Throttle performance (run at slower frequencies / lower voltages)
• Global clock gating (pause..turn off the clock)
• None…results can be catastrophic
6
A real wire
can be
modeled as…
Consideration
• Consider our out-of-order, pipelined processor
from Tomasulo part 2.
10
Question 1
• Do we have high frequencies?
11
Answer 1
• Do we have high frequencies?
– Yes, deep pipelines create shorter clock
cycles and higher frequencies
–Power ↑ ↑ ↑
–Temp. ↑
12
Question 2
• What effect does our OoO processor
have on wires length?
13
Answer 2
• What effect does our OoO processor
have on wires length?
– Wire length ↑ ↑
with CDB, ROB,
Issue logic, etc.
– Time ↑
14
Question 2
• What is the impact of short clock cycles
(high freq.) on cache miss penalties?
15
Answer 3
• What is the impact of short clock cycles
(high freq.) on cache miss penalties?
– Cache miss
penalties ↑ ↑
relative to
processor cycles
16
55%/year
Processor-Memory
Performance Gap
7%/year
Cache Hierarchy
• A hierarchy of cache can help mitigate P
the cache miss penalty
• L1 Cache L1 Cache
– 64 KB
– 2 cycle access time
– Common Miss Rate ~ 5% L2 Cache
• L2 Cache
– 1 MB
– 10-20 cycle access time L3 Cache
– Common Miss Rate ~ 1%
• Main Memory
– ~300 cycle access time Memory
18
Single-Thread Memory
Execution
C M C M C M M Latency
Actual
program
Single-Thread speedup is
Execution C M C M C M minimal due to
(w/ 2x speedup memory
in compute) latency
Adapted from: OpenSparc T1 Micro-architecture Specification
19
Question 4
• But in OoO processors, can't we just
deepen our ROB, Issue queues, Store
Address Buffer, etc to hide cache misses?
21
Answer 4
• But in OoO processors, can't we just
deepen our ROB, Issue queues, Store
Address Buffer, etc to hide cache misses?
– Associative lookup
structures are expensive
and slow down
dramatically as they
deepen
– DOES NOT SCALE WELL
22
Motivating HW Multithread/Multicore
• Issues that prevent us from exploiting ILP in more
advanced single-core processors with deeper
pipelines and OoO Execution
– Slow memory hierarchy
– Increased power with higher clock rates
– Increased wire delay & size with more advanced structures
(ROBs, Issue queues, etc.) for potentially diminishing
returns
• All of these issues point us to find "easier" sources of
parallelism such as: TLP (Thread-Level Parallelism)
23
OVERVIEW OF TLP
24
What is a Thread?
• Thread (def.): Single execution sequence (instruction
stream) representing a separately schedulable task
– Schedulable task: Can be transparently paused and resumed
by the OS scheduler
• Consider the processor: $1 0xbf01e800
$31 0xbff70c44
their own copy to execute in parallel? pc 0x0004a804
lw $5,20($8) sw $9,0($3)
add $2,$8,$4 sub $8,$2,$5
or $8,$5,$3 lw $2,4($1) T1 Stack T2 Stack
sw $9,0($6) sub $7,$2,$6
... ...
Separate or Shared?
• Consider the processor, for what resources
would each thread need their own copy?
1 Per Thread Shared among all
Program Counter
ALUs
Register File
Page Table Base Register
Cache Memory
26
Software Multithreading
• Used since 1960's on uniprocessors to hide I/O CPU
Regs
latency
– Multiple processes with different virtual address PC
spaces and process control blocks
– On an I/O operation, state is saved and another
process is given to the CPU OS
– When I/O operation completes the process is Scheduler
rescheduled T1 = Ready T2 = Blocked T3 = Ready
• On a context switch…
– Trap processor and flush pipeline Saved Saved Saved
State State State
– Save state in process control block (PC, register file, Regs Regs Regs
Interrupt vector, page table base register)
– Restore state of another process PC PC PC
L3 L3 L3 L3
Bank Bank/ Bank Bank/
Thread 1
C M C M
Compute
C Time
Thread 2
C M C M
Memory
M Latency
Thread 3 C M C M
Thread 4 C M C M
Adapted from: OpenSparc T1 Micro-architecture Specification
31
IMPLEMENTING HARDWARE
MULTITHREADING
32
C M C M
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss
C M C M
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss
+ C M C M
C M C M
33
Hardware Multithreading
• Run multiple threads on the same core with
hardware support for fast context switch
– Multiple register files
– Multiple state registers (PCs, page table base
registers, interrupt vectors, etc.)
– Avoids saving context manually (via software)
34
Ex. of Fine-grained
Multithreading
http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
35
Sparc T1 Niagara
• 8 cores each executing 4 threads called a thread group
– Zero cycle thread switching penalty (round-robin)
– 6 stage pipeline
• Each core has its own L1 cache
• Each thread has its own
– Register file, instruction and store buffers
• Threads share…
– L1 cache, TLB, and execution units
• 3 MB shared L2 Cache, 4-banks, 12-way set-associative
– Is it a problem that it's not a power of 2? No!
(Thread)
Fetch Select Decode Exec. Mem. WB
36
2b
3 4
2a
http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
37
T1 Pipeline
• Thread select stage [Stage 2]
– Choose instructions to issue from ready threads
– Issues based on
• Instruction type
• Misses
• Resource conflicts
• Traps and interrupts
• Fetch stage [Stage 1]
– Thread select mux chooses which thread's instruction to
issue and uses that thread's PC to fetch more instructions
– Access I-TLB and I-Cache
– 2 instructions fetched per cycle
38
T1 Pipeline
• Decode stage [Stage 3]
– Accesses register file
• Execute Stage [Stage 4]
– Includes ALU, shifter, MUL and DIV units
– Forwarding Unit
• Memory stage [Stage 5]
– DTLB, Data Cache, and 4 store buffers (1 per thread)
• WB [Stage 6]
– Write to register file
39
Pipeline Scheduling
• No pipeline flush on context switch (except
potentially of instructions from faulting thread)
• Full forwarding/bypassing to consuming, junior
instructions of same thread
• In case of load, wait 2 cycles before an instruction
from the same thread is issued
– Solved forwarding latency issue
• Scheduler guarantees fairness between threads by
prioritizing the least recently scheduled thread
40
Expensive
Context Switch …
Expensive Cache
… Miss Penalty
Types/Levels of Multithreading
• How should we overlap and share the HW between
instructions from different threads
– Coarse-grained Multithreading: Execute one thread with
all HW resource until a cache-miss or misprediction will
incur a stall or pipeline flush, then switch to another
thread
– Fine-grained Multithreading: Alternate fetching
instructions from a different thread each clock
– Simultaneous Multithreading: Fetch and execute
instructions from different threads at the same time
42
Levels of TLP
Issue Slots
Simultaneous
Time Superscalar Coarse-grained MT Fine-Grained MT Multithreading (SMT)
Miss
Miss
Expensive Cache
Miss Penalty
Simultaneous Multithreading
• Uses multiple-issue, dynamic scheduling mechanisms
to execute instructions from multiple threads at the
same time by filling issue slots with as many available
instructions from either thread
– Overcome poor utilization due to cache misses or lack of
independent instructions
– Requires HW to tag instructions based on their thread
• Requires greater level of hardware resources
(separate register renamer, branch prediction, store
buffers, and multiple register files, etc.)
47
Reg. File 1
Instruc. Instruc.
Queue 1 Queue 2 ROB1 ROB2
D-Cache
Reg.
Rename 1 BPB1
Dispatch
Reg.
BPB2
Dispatch can Rename 2
tag instructions
with thread ID
Mult. Queue
L/S Queue
Int. Queue
Div Queue
to separate
instructions in
Issue
Unit
SAB1
the backend
SAB2
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer CDB
Example
• Intel HyperThreading Technology (HTT) is
essentially SMT
• Recent processors including Core i7 are multi-
core, multi-threaded, multi-issue, OoO
(dynamically scheduled) superscalar
processors
49
Future of Multicore/Multithreaded
• Multiple cores in shared memory configuration
• Per-core L1 or even L2
• Large on-chip shared cache
• Multiple threads on each core to fight memory wall
• Ever increasing on-chip threads
– To continue to meet Moore's Law
– CMP's with 1000's of threads envisioned
– Only sane option from technology perspective (i.e. out of
necessity)
– The big road block is parallel programming
50
Parallel Programming
• Implicit parallelism via…
– Parallelizing compilers
– Programming frameworks (e.g. MapReduce)
• Explicit parallelism
– OpenMP
– Task Libraries
• Intel Thread Building Blocks, Java Task Library
– Native threading (Windows threads, POSIX threads)
– MPI
51
BACKUP
52
Instruc.
Queue
Register
Status
Table Dispatch
Mult. Queue
L/S Queue
Int. Queue
Div Queue
Issue
Unit
Integer /
D-Cache Div Mul
Branch
CDB
53
Reg. File 1
Instruc. Instruc.
Queue 1 Queue 2 ROB1 ROB2
D-Cache
Reg.
Rename 1 BPB1
Dispatch
Reg.
BPB2
Dispatch can Rename 2
tag instructions
with thread ID
Mult. Queue
L/S Queue
Int. Queue
Div Queue
to separate
instructions in
Issue
Unit
SAB1
the backend
SAB2
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer CDB
Queues +
Functional ALU
Units
MUL
IM Reg DM Reg
DIV
DM
(Cache)
55
EX
FP Add Look Ahead: Tomasulo
Algorithm will help absorb
An added complication of A1 A2 A3 A4 latency of different functional
units and cache miss latency by
out-of-order execution & Int. & FP MUL allowing other ready instruction
completion: WAW & WAR proceed out-of-order
hazards M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV
Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
56
I-Cache D-Cache
ROB
Reg. File
Instruc. (Reorder
Queue Buffer)
Br. Pred.
Buffer Dispatch Exceptions?
No problem
Mult. Queue
L/S Queue
Int. Queue
Div Queue
Addr.
Buffer
Issue
Unit
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer
CDB
57
58
59
Updated Pipeline
Int. ALU, Addr. Calc.
EX
FP Add
A1 A2 A3 A4
Int. & FP MUL
M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV
Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
60
Updated Pipeline
Int. ALU, Addr. Calc.
PC EX
FP Add
Reg. A1 A2 A3 A4 MEM
I-Cache
File Int. & FP MUL stage
M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV
Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25