Presentation On Multithreading/Vector
Presentation On Multithreading/Vector
CS252
Graduate Computer Architecture • There can be much higher natural parallelism in
Lecture 12 some applications
– e.g., Database or Scientific codes
Multithreading / Vector Processing – Explicit Thread Level Parallelism or Data Level Parallelism
March 2nd, 2011 • Thread: instruction stream with own PC and data
– thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
– Each thread has all the state (instructions, data, PC, register
John Kubiatowicz state, and so on) necessary to allow it to execute
Electrical Engineering and Computer Sciences • Thread Level Parallelism (TLP):
University of California, Berkeley – Exploit the parallelism inherent between threads to improve
performance
• Data Level Parallelism (DLP):
http://www.eecs.berkeley.edu/~kubitron/cs252
– Perform identical operations on data, and lots of data
2 2
3 3
4 4
5 5
6 6
7 7
From: Tullsen,
Eggers, and Levy, 8 8
“Simultaneous
Multithreading: 9 9
Maximizing On-chip
Parallelism, ISCA M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
3/2/2011 cs252-S11, Lecture 121995. 7 3/2/2011 cs252-S11, Lecture 12 8
Simultaneous Multithreading Details Design Challenges in SMT
• Simultaneous multithreading (SMT): insight that • Since SMT makes sense only with fine-grained
dynamically scheduled processor already has many implementation, impact of fine-grained scheduling
HW mechanisms to support multithreading on single thread performance?
– Large set of virtual registers that can be used to hold the register – A preferred thread approach sacrifices neither throughput nor
sets of independent threads single-thread performance?
– Unfortunately, with a preferred thread, the processor is likely to
– Register renaming provides unique register identifiers, so sacrifice some throughput, when preferred thread stalls
instructions from multiple threads can be mixed in datapath without
confusing sources and destinations across threads • Larger register file needed to hold multiple contexts
– Out-of-order completion allows the threads to execute out of order, • Clock cycle time, especially in:
and get better utilization of the HW – Instruction issue - more candidate instructions need to be
• Just adding a per thread renaming table and keeping considered
– Instruction completion - choosing which instructions to commit
separate PCs may be challenging
– Independent commitment can be supported by logically keeping a
separate reorder buffer for each thread • Ensuring that cache and TLB conflicts generated
by SMT do not degrade performance
Power 4
Power 4
Single-threaded predecessor to
Power 5. 8 execution units in
out-of-order engine, each may
issue an instruction each cycle.
2 commits
Power 5 (architected
register sets)
2 fetch (PC),
2 initial decodes
3/2/2011 cs252-S11, Lecture 12 11 3/2/2011 cs252-S11, Lecture 12 12
Power 5 data flow ... Power 5 thread performance ...
Relative priority
of each thread
controllable in
hardware.
For balanced
operation, both
threads run
slower than if
Why only 2 threads? With 4, one of the shared
they “owned”
resources (physical registers, cache, memory
the machine.
bandwidth) would be prone to bottleneck
3/2/2011 cs252-S11, Lecture 12 13 3/2/2011 cs252-S11, Lecture 12 14
TIME: 2:30-5:30
– This info is on the Lecture page (has been)
– Get on 8 ½ by 11 sheet of notes (both sides)
– Meet at LaVal’s afterwards for Pizza and Beverages
• CS252 First Project proposal due by Friday 3/4
– Need two people/project (although can justify three for right project)
– Complete Research project in 9 weeks
» Typically investigate hypothesis by building an artifact and
measuring it against a “base case”
» Generate conference-length paper/give oral presentation
» Often, can lead to an actual publication.
Definition of a supercomputer:
Epitomized by Cray-1, 1976:
• Fastest machine in world at given task
Scalar Unit + Vector Extensions
• A device to turn a compute-bound problem into an
I/O bound problem • Load/Store Architecture
• Any machine costing $30M+ • Vector Registers
• Any machine designed by Seymour Cray • Vector Instructions
• Hardwired Control
CDC6600 (Cray, 1964) regarded as first supercomputer • Highly Pipelined Functional Units
• Interleaved Memory System
• No Data Caches
• No Virtual Memory