0% found this document useful (0 votes)
10 views60 pages

EE457Unit9c CMT

Uploaded by

Shaurya Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views60 pages

EE457Unit9c CMT

Uploaded by

Shaurya Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

1

EE 457 Unit 9c

Thread Level Parallelism


2

Credits
• Some of the material in this presentation is taken from:
– Computer Architecture: A Quantitative Approach
• John Hennessy & David Patterson
• Some of the material in this presentation is derived from
course notes and slides from
– Prof. Michel Dubois (USC)
– Prof. Murali Annavaram (USC)
– Prof. David Patterson (UC Berkeley)
3

BACKGROUND KNOWLEDGE
4

Power
• Power and energy consumption is a MAJOR concern for processors
• Power consumption can be decomposed into:
– Static (PSTAT): Power constantly being dissipated (grows with # of transistors)
– Dynamic (PDYN): Power consumed for switching a bit (1 to 0)
• PDYN = IDYN*VDD ≈ ½CTOTVDD2f
– Recall, I = C dV/dt
– VDD is the logic ‘1’ voltage, f = clock frequency
• Dynamic power favors parallel processing vs. higher clock rates
– VDD value is tied to f, so a reduction/increase in f leads to similar change in Vdd
– Implies power is proportional to f3 (a cubic savings in power if we can reduce f)
– Take a core and replicate it 4x => 4x performance and 4x power
– Take a core and increase clock rate 4x => 4x performance and 64x power
• Static power
– Leakage occurs no matter what the frequency is
5

Temperature
• Temperature is related to power consumption
– Locations on the chip that burn more power will usually run hotter
• Locations where bits toggle (register file, etc.) often will become quite hot especially if
toggling continues for a long period of time
– Too much heat can destroy a chip
– Can use sensors to dynamically sense temperature
• Techniques for controlling temperature
– External measures: Remove and spread the heat
• Heat sinks, fans, even liquid cooled machines
– Architectural measures
• Throttle performance (run at slower frequencies / lower voltages)
• Global clock gating (pause..turn off the clock)
• None…results can be catastrophic
6

Modeling Interconnect Delay


• In modern circuits wire delay (transmitting the signal) begins
to dominate logic delay (time for gate to switch)
• As wires get longer
– Resistance goes up and Capacitance goes up causing longer time
delays (time is proportional to R*C)

A real wire
can be
modeled as…

Ideal wire Lumped Model Distributed Model


(overestimates delay) (better estimate)
7

Dealing With Interconnect


• Interconnect delay rivals switching delay
• Important design considerations
– Long wire traces slow a signal down, thus global signals on a
chip require special attention
– Clock, reset, and other signals must be routed carefully and a
whole tree of buffers inserted to decrease the delay
8

A Case for Thread-Level Parallelism

CHIP MULTITHREADING AND


MULTIPROCESSORS
9

Consideration
• Consider our out-of-order, pipelined processor
from Tomasulo part 2.
10

Question 1
• Do we have high frequencies?
11

Answer 1
• Do we have high frequencies?
– Yes, deep pipelines create shorter clock
cycles and higher frequencies

–Power ↑ ↑ ↑
–Temp. ↑
12

Question 2
• What effect does our OoO processor
have on wires length?
13

Answer 2
• What effect does our OoO processor
have on wires length?
– Wire length ↑ ↑
with CDB, ROB,
Issue logic, etc.
– Time ↑
14

Question 2
• What is the impact of short clock cycles
(high freq.) on cache miss penalties?
15

Answer 3
• What is the impact of short clock cycles
(high freq.) on cache miss penalties?
– Cache miss
penalties ↑ ↑
relative to
processor cycles
16

Memory Wall Problem


• Processor performance is increasing much faster than memory
performance

55%/year
Processor-Memory
Performance Gap

7%/year

There is a limit to ILP! Hennessy and Patterson,


If a cache miss requires several hundred clock cyles even OoO pipelines with Computer Architecture –
10's or 100's of in-flight instructions may stall. A Quantitative Approach (2003)
17

Cache Hierarchy
• A hierarchy of cache can help mitigate P
the cache miss penalty
• L1 Cache L1 Cache
– 64 KB
– 2 cycle access time
– Common Miss Rate ~ 5% L2 Cache
• L2 Cache
– 1 MB
– 10-20 cycle access time L3 Cache
– Common Miss Rate ~ 1%
• Main Memory
– ~300 cycle access time Memory
18

The Growing Memory Problem


• In an In-Order pipeline, a cache miss causes computation to
stall (i.e. a memory induced stall)
– Suppose we could improve our processor to achieve a
2x speedup in compute time
– This would only yield a minimal overall speedup due to memory
latency dominating compute
• Out-of-order may do better, but the problem remains
Time
Compute
C Time

Single-Thread Memory
Execution
C M C M C M M Latency

Actual
program
Single-Thread speedup is
Execution C M C M C M minimal due to
(w/ 2x speedup memory
in compute) latency
Adapted from: OpenSparc T1 Micro-architecture Specification
19

Cache Penalty Example


• Assume 50% of instructions are LW/SW, an L1-D hit
rate of 90%, and miss penalty of 20 clock cycles
(assuming these misses hit in L2). What is the CPI for
our typical 5 stage pipeline?
– 50% * 10% misses = 5 instructions that cause stalls
– Other 95 instructions take 95 cycles to execute
– 5 instructions take 105=5*(1+20) cycles to execute
– Total 200 cycles for 100 instructions =
CPI of 2

Effective CPI = Ideal CPI + Miss Rate*Miss Penalty Cycles


20

Question 4
• But in OoO processors, can't we just
deepen our ROB, Issue queues, Store
Address Buffer, etc to hide cache misses?
21

Answer 4
• But in OoO processors, can't we just
deepen our ROB, Issue queues, Store
Address Buffer, etc to hide cache misses?
– Associative lookup
structures are expensive
and slow down
dramatically as they
deepen
– DOES NOT SCALE WELL
22

Motivating HW Multithread/Multicore
• Issues that prevent us from exploiting ILP in more
advanced single-core processors with deeper
pipelines and OoO Execution
– Slow memory hierarchy
– Increased power with higher clock rates
– Increased wire delay & size with more advanced structures
(ROBs, Issue queues, etc.) for potentially diminishing
returns
• All of these issues point us to find "easier" sources of
parallelism such as: TLP (Thread-Level Parallelism)
23

OVERVIEW OF TLP
24

What is a Thread?
• Thread (def.): Single execution sequence (instruction
stream) representing a separately schedulable task
– Schedulable task: Can be transparently paused and resumed
by the OS scheduler
• Consider the processor: $1 0xbf01e800

– For what resources would each thread need $3 0x00000005

$31 0xbff70c44
their own copy to execute in parallel? pc 0x0004a804

Thread 2 Thread 1 CPU


0x04001c 0x04a800

lw $5,20($8) sw $9,0($3)
add $2,$8,$4 sub $8,$2,$5
or $8,$5,$3 lw $2,4($1) T1 Stack T2 Stack
sw $9,0($6) sub $7,$2,$6
... ...

0x0 Memory 0xffffffff


25

Separate or Shared?
• Consider the processor, for what resources
would each thread need their own copy?
1 Per Thread Shared among all
Program Counter
ALUs
Register File
Page Table Base Register
Cache Memory
26

Shared vs. Private


27

Software Multithreading
• Used since 1960's on uniprocessors to hide I/O CPU
Regs
latency
– Multiple processes with different virtual address PC
spaces and process control blocks
– On an I/O operation, state is saved and another
process is given to the CPU OS
– When I/O operation completes the process is Scheduler
rescheduled T1 = Ready T2 = Blocked T3 = Ready

• On a context switch…
– Trap processor and flush pipeline Saved Saved Saved
State State State
– Save state in process control block (PC, register file, Regs Regs Regs
Interrupt vector, page table base register)
– Restore state of another process PC PC PC

– Start execution and fill pipeline Meta


Data
Meta
Data
Meta
Data

• Context switch is also triggered by timer for fairness


• Very high overhead! (1-10 us)
28

Multicore vs. Multithreaded


• Multicore/Multiprocessor: Single chip containing
multiple processor cores that possess all the logic
resources necessary to execute one or more threads at
a time
– Require software/OS to context switch from one thread to
another and do not share hardware resources between
threads.
• Hardware Multithreading: A processor core that has
hardware support for executing multiple threads and
context switching between them without software
intervention
29

Typical Multicore (CMP) Organization


• Can simply replicate entire processor core to create a
chip multi-processor (CMP)
Private L1 and L2's require
maintaining coherency via
Chip Multi- snooping.
Processor
P P P P Sharing L1 is not a good idea.

L3 is shared (1 copy of data) and


L1 L1 L1 L1 thus does not require a
coherency mechanism.
L2 L2 L2 L2
Interconnect (On-Chip Network)

L3 L3 L3 L3
Bank Bank/ Bank Bank/

Shared bus would be a


bottleneck. Use switched
network (multiple
Main Memory simultaneous connections)
30

Case for Multithreading


• Consider long latency events:
– Cache Miss, Exceptions, Lock (Synchronization), Long instructions such as
MUL/DIV
– Such events cause In-order and even OoO pipelines to be underutilized
• Goal/Idea: Swap to the next thread immediately (on next cycle)
when the current thread hits a long-latency event (i.e. cache miss)
– By executing multiple threads, processors can be kept busy with useful work
Time

Thread 1
C M C M
Compute
C Time
Thread 2
C M C M
Memory
M Latency
Thread 3 C M C M
Thread 4 C M C M
Adapted from: OpenSparc T1 Micro-architecture Specification
31

IMPLEMENTING HARDWARE
MULTITHREADING
32

MT Needs Non-Blocking Caches


• Non-blocking cache: Does not block/pause on a miss but is able
to service hits while fetching one or more miss requests
– Needed to support multithreading
– Example: Pentium Pro has a non-blocking cache capable of handling 4
outstanding misses
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss

C M C M
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss

C M C M
Service Service Service Service
Cache Hits Cache Miss Cache Hits Cache Miss

+ C M C M

C M C M
33

Hardware Multithreading
• Run multiple threads on the same core with
hardware support for fast context switch
– Multiple register files
– Multiple state registers (PCs, page table base
registers, interrupt vectors, etc.)
– Avoids saving context manually (via software)
34

Sun T1 "Niagara" Case Study

Ex. of Fine-grained
Multithreading

http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
35

Sparc T1 Niagara
• 8 cores each executing 4 threads called a thread group
– Zero cycle thread switching penalty (round-robin)
– 6 stage pipeline
• Each core has its own L1 cache
• Each thread has its own
– Register file, instruction and store buffers
• Threads share…
– L1 cache, TLB, and execution units
• 3 MB shared L2 Cache, 4-banks, 12-way set-associative
– Is it a problem that it's not a power of 2? No!

(Thread)
Fetch Select Decode Exec. Mem. WB
36

Sun T1 "Niagara" Pipeline

2b
3 4

2a

http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf
37

T1 Pipeline
• Thread select stage [Stage 2]
– Choose instructions to issue from ready threads
– Issues based on
• Instruction type
• Misses
• Resource conflicts
• Traps and interrupts
• Fetch stage [Stage 1]
– Thread select mux chooses which thread's instruction to
issue and uses that thread's PC to fetch more instructions
– Access I-TLB and I-Cache
– 2 instructions fetched per cycle
38

T1 Pipeline
• Decode stage [Stage 3]
– Accesses register file
• Execute Stage [Stage 4]
– Includes ALU, shifter, MUL and DIV units
– Forwarding Unit
• Memory stage [Stage 5]
– DTLB, Data Cache, and 4 store buffers (1 per thread)
• WB [Stage 6]
– Write to register file
39

Pipeline Scheduling
• No pipeline flush on context switch (except
potentially of instructions from faulting thread)
• Full forwarding/bypassing to consuming, junior
instructions of same thread
• In case of load, wait 2 cycles before an instruction
from the same thread is issued
– Solved forwarding latency issue
• Scheduler guarantees fairness between threads by
prioritizing the least recently scheduled thread
40

A View Without HW Multithreading


Single Threaded
Superscalar
Issue Slots
Time w/ Software MT

Expensive
Context Switch …

Expensive Cache
… Miss Penalty

Only instructions Software


from a single Multithreading
thread
41

Types/Levels of Multithreading
• How should we overlap and share the HW between
instructions from different threads
– Coarse-grained Multithreading: Execute one thread with
all HW resource until a cache-miss or misprediction will
incur a stall or pipeline flush, then switch to another
thread
– Fine-grained Multithreading: Alternate fetching
instructions from a different thread each clock
– Simultaneous Multithreading: Fetch and execute
instructions from different threads at the same time
42

Levels of TLP
Issue Slots
Simultaneous
Time Superscalar Coarse-grained MT Fine-Grained MT Multithreading (SMT)

Miss

Miss

Expensive Cache
Miss Penalty

Only instructions Switch threads Alternate threads Mix instructions from


from a single when one hits a every cycle threads during same
thread long-latency event (Sun UltraSparc issue cycle
like a stall due to T2) (Intel HyperThreading,
cache-miss, IBM Power 5)
pipeline flush, etc.
43

Fine Grained Multithreading


• Like Sun Niagara
• Alternates issuing instructions from different threads
each cycle provided a thread has instructions ready
to execute (i.e. not stalled)
• With enough threads, long latency events may be
completely hidden
– Some processors like Cray may have 128 or more threads
• Degrades single thread performance since it only
gets 1 out of every N cycles if all N threads are ready
44

Coarse Grained Multithreading


• Swaps threads on long-latency event
• Hardware does not have to swap threads in a single
cycle (as in fine-grained multithreading) but can take
a few cycles since the current thread has hit a long
latency event
• Requires flushing pipeline of current thread's
instructions and filling pipeline with new thread's
• Better single-thread performance
45

ILP and TLP


• TLP can also help ILP by providing another
source of independent instructions
• In a 3- or 4-way issue processor, better
utilization can be achieved when instructions
from 2 or more threads are executed
simultaneously
46

Simultaneous Multithreading
• Uses multiple-issue, dynamic scheduling mechanisms
to execute instructions from multiple threads at the
same time by filling issue slots with as many available
instructions from either thread
– Overcome poor utilization due to cache misses or lack of
independent instructions
– Requires HW to tag instructions based on their thread
• Requires greater level of hardware resources
(separate register renamer, branch prediction, store
buffers, and multiple register files, etc.)
47

2-Way SMT Updated Block Diagram


I-Cache SB1 SB2
Reg. File 2

Reg. File 1
Instruc. Instruc.
Queue 1 Queue 2 ROB1 ROB2
D-Cache
Reg.
Rename 1 BPB1
Dispatch
Reg.
BPB2
Dispatch can Rename 2
tag instructions
with thread ID

Mult. Queue
L/S Queue
Int. Queue

Div Queue
to separate
instructions in

Issue
Unit
SAB1
the backend
SAB2

Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer CDB

Updated OoO processor block diagram for 2-way hardware, SMT


48

Example
• Intel HyperThreading Technology (HTT) is
essentially SMT
• Recent processors including Core i7 are multi-
core, multi-threaded, multi-issue, OoO
(dynamically scheduled) superscalar
processors
49

Future of Multicore/Multithreaded
• Multiple cores in shared memory configuration
• Per-core L1 or even L2
• Large on-chip shared cache
• Multiple threads on each core to fight memory wall
• Ever increasing on-chip threads
– To continue to meet Moore's Law
– CMP's with 1000's of threads envisioned
– Only sane option from technology perspective (i.e. out of
necessity)
– The big road block is parallel programming
50

Parallel Programming
• Implicit parallelism via…
– Parallelizing compilers
– Programming frameworks (e.g. MapReduce)
• Explicit parallelism
– OpenMP
– Task Libraries
• Intel Thread Building Blocks, Java Task Library
– Native threading (Windows threads, POSIX threads)
– MPI
51

BACKUP
52

Organization for OoO Execution


I-Cache TAG FIFO Block Diagram
Adapted from Prof.
Michel Dubois
(Simplified for EE 457)
Reg. File

Instruc.
Queue

Register
Status
Table Dispatch

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Issue
Unit
Integer /
D-Cache Div Mul
Branch

CDB
53

2-Way SMT Updated Block Diagram


I-Cache SB1 SB2
Reg. File 2

Reg. File 1
Instruc. Instruc.
Queue 1 Queue 2 ROB1 ROB2
D-Cache
Reg.
Rename 1 BPB1
Dispatch
Reg.
BPB2
Dispatch can Rename 2
tag instructions
with thread ID

Mult. Queue
L/S Queue
Int. Queue

Div Queue
to separate
instructions in

Issue
Unit
SAB1
the backend
SAB2

Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer CDB

Updated OoO processor block diagram for 2-way hardware, SMT


54

Multiple Functional Units


• We now provide multiple functional units
• After decode, issue to a queue, stalling if the unit is busy or
waiting for data dependency to resolve

Queues +
Functional ALU
Units

MUL

IM Reg DM Reg

DIV

DM
(Cache)
55

Functional Unit Latencies


Int. ALU, Addr. Calc.

EX
FP Add Look Ahead: Tomasulo
Algorithm will help absorb
An added complication of A1 A2 A3 A4 latency of different functional
units and cache miss latency by
out-of-order execution & Int. & FP MUL allowing other ready instruction
completion: WAW & WAR proceed out-of-order
hazards M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval


(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
56

OoO Execution w/ ROB


• ROB allows for OoO execution but in-order completion

I-Cache D-Cache

ROB
Reg. File

Instruc. (Reorder
Queue Buffer)

Br. Pred.
Buffer Dispatch Exceptions?
No problem

Mult. Queue
L/S Queue
Int. Queue

Div Queue

Addr.
Buffer
Issue
Unit
Exec. Unit
Integer /
D-Cache Div Mul
Branch
L/S Buffer
CDB
57
58
59

Updated Pipeline
Int. ALU, Addr. Calc.

EX
FP Add

A1 A2 A3 A4
Int. & FP MUL
M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval


(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25
60

Updated Pipeline
Int. ALU, Addr. Calc.

PC EX
FP Add

Reg. A1 A2 A3 A4 MEM
I-Cache
File Int. & FP MUL stage
M1 M2 M3 M4 M5 M6 M7
Int. & FP DIV

Functional Unit Latency Initiation Interval


(Required stalls cycles (Distance between 2 independent instructions
between dependent [RAW] instrucs.) requiring the same FU)

Integer ALU 0 1
FP Add 3 1
FP Mul. 6 1
FP Div. 24 25

You might also like