10 Multithreading
10 Multithreading
Jason Mars
Sunday, March 3, 13
Parallel Architectures for Executing Multiple
Threads
Sunday, March 3, 13
Parallel Architectures for Executing Multiple
Threads
Sunday, March 3, 13
Parallel Architectures for Executing Multiple
Threads
Sunday, March 3, 13
Parallel Architectures for Executing Multiple
Threads
Sunday, March 3, 13
Multiprocessors
Single bus
Memory I/O
Sunday, March 3, 13
Classifying Multiprocessors
• Flynn Taxonomy
• Interconnection Network
• Memory Topology
• Programming Model
Sunday, March 3, 13
Flynn Taxonomy
Sunday, March 3, 13
Interconnection Networks
• Bus
• Network
• pros/cons?
Single bus
Memory I/O
Sunday, March 3, 13
Memory Topology
Single bus
Memory I/O
cpu M Network
Network
cpu M
. .
. .
. .
cpu M
Sunday, March 3, 13
Programming Model
Network
Sunday, March 3, 13
Programming Model
Network
Sunday, March 3, 13
Parallel Programming
i = 47
Processor A Processor B
Sunday, March 3, 13
Parallel Programming
i = 47
Processor A Processor B
Sunday, March 3, 13
Parallel Programming
i = 47
Processor A Processor B
Sunday, March 3, 13
Parallel Programming
i = 47
Processor A Processor B
Sunday, March 3, 13
Parallel Programming
i = 47
Processor A Processor B
Sunday, March 3, 13
But...
Sunday, March 3, 13
Multiprocessor Caches (Shared Memory)
• the solution?
Single bus
Memory I/O
Sunday, March 3, 13
Multiprocessor Caches (Shared Memory)
• the solution?
inc i;
Single bus
Memory I/O
Sunday, March 3, 13
Multiprocessor Caches (Shared Memory)
• the solution?
inc i;
load i;
Processor Processor Processor
Single bus
Memory I/O
Sunday, March 3, 13
Multiprocessor Caches (Shared Memory)
• the solution?
inc i;
load i;
Processor Processor Processor
Single bus
Memory I/O
Sunday, March 3, 13
What Does Coherence Mean?
• Informally:
• Any read must return the most recent write
• Too strict and very difficult to implement
• Better:
• A processor sees its own writes to a location in the correct order.
• Any write must eventually be seen by a read
• All writes are seen in order (“serialization”). Writes to the same location are
seen in the same order by all processors.
• Without these guarantees, synchronization doesn’t work
Sunday, March 3, 13
Solutions
Sunday, March 3, 13
Solutions
Sunday, March 3, 13
Solutions
Sunday, March 3, 13
Implementing Coherence Protocols
• How do you find the most up-to-date copy of the desired data?
• Snooping protocols
• Directory protocols
Single bus
Memory I/O
Sunday, March 3, 13
Implementing Coherence Protocols
• How do you find the most up-to-date copy of the desired data?
• Snooping protocols
• Directory protocols
Single bus
Memory I/O
Write-Update vs Write-Invalidate
Sunday, March 3, 13
Parallel Architectures for Executing Multiple
Threads
Sunday, March 3, 13
Simultaneous Multithreading
Dean Tullsen
Sunday, March 3, 13
Hardware Multithreading
instruction stream
Conventional
Processor
PC
regs
CPU
Dean Tullsen
Sunday, March 3, 13
Hardware Multithreading
Multithreaded
instruction stream
Conventional
Processor
PC
regs
CPU
Dean Tullsen
Sunday, March 3, 13
Hardware Multithreading
Multithreaded
instruction stream
Conventional
Processor
PC PC
regs regs
CPU
Dean Tullsen
Sunday, March 3, 13
Hardware Multithreading
Multithreaded
instruction stream
Conventional
Processor
PC PC PC
CPU
Dean Tullsen
Sunday, March 3, 13
Hardware Multithreading
Multithreaded
instruction stream
Conventional
Processor
PC PC PC
regs
CPU
Dean Tullsen
Sunday, March 3, 13
Superscalar (vs Superpipelined)
Sunday, March 3, 13
Superscalar Execution
Issue Slots
Time (proc cycles)
Dean Tullsen
Sunday, March 3, 13
Superscalar Execution
Issue Slots
Time (proc cycles) Vertical waste
Dean Tullsen
Sunday, March 3, 13
Superscalar Execution
Issue Slots
Time (proc cycles) Vertical waste
Horizontal waste
Dean Tullsen
Sunday, March 3, 13
Superscalar Execution
with Fine-Grain Multithreading
Issue Slots
Time (proc cycles)
Thread 1
Thread 2
Thread 3
Dean Tullsen
Sunday, March 3, 13
Simultaneous Multithreading
Issue Slots
Time (proc cycles)
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Dean Tullsen
Sunday, March 3, 13
SMT Performance
5.2500
Conventional Superscalar
1.7500
0
1 2 3 4 5 6 7 8
Dean Tullsen
Number of Threads
Sunday, March 3, 13
Parallel Architectures for Executing Multiple
Threads
Sunday, March 3, 13
Multicore Processors (aka Chip Multiprocessors)
• Multiple cores on the same die, may or may not share L2 or L3 cache.
• Intel, AMD both have quad core processors. Sun Niagara T2 is 8 cores x 8
threads (64 contexts!)
Sunday, March 3, 13
The Latest Processors
Sunday, March 3, 13
Nehalem
Sunday, March 3, 13
Nehalem
Fetch
Sunday, March 3, 13
Nehalem
Fetch
Decode
Sunday, March 3, 13
Nehalem
Fetch
Decode
Execute
Sunday, March 3, 13
Nehalem
Fetch
Decode
Execute
Mem/WB
Sunday, March 3, 13
CSE 141 Dean Tullsen
Sunday, March 3, 13
CSE 141 Dean Tullsen
Sunday, March 3, 13
CSE 141 Dean Tullsen
Sunday, March 3, 13
Nehalem in a Nutshell
Sunday, March 3, 13
Key Points
Sunday, March 3, 13
Key Points
Sunday, March 3, 13
Key Points
Sunday, March 3, 13
Key Points
• Shared Memory is more intuitive, but creates problems for both the
programmer (memory consistency, requiring synchronization) and the
architect (cache coherency).
Sunday, March 3, 13
Key Points
• Shared Memory is more intuitive, but creates problems for both the
programmer (memory consistency, requiring synchronization) and the
architect (cache coherency).
Sunday, March 3, 13
Key Points
• Shared Memory is more intuitive, but creates problems for both the
programmer (memory consistency, requiring synchronization) and the
architect (cache coherency).
Sunday, March 3, 13