0% found this document useful (0 votes)

17 views

Presentation On Multithreading/Vector

1) Multithreading and simultaneous multithreading (SMT) are approaches to exploiting threads to increase performance beyond single-thread instruction-level parallelism (ILP). 2) With multithreading, multiple threads share the functional units of a processor via overlapping execution, allowing other threads to execute when one thread stalls. 3) SMT aims to exploit both ILP and thread-level parallelism (TLP) simultaneously by issuing instructions from multiple threads each cycle to better utilize functional units and hide stalls.

Uploaded by

Blu007

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Presentation On Multithreading/Vector

Uploaded by

Blu007

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Performance beyond single thread ILP

CS252
Graduate Computer Architecture • There can be much higher natural parallelism in
Lecture 12 some applications
– e.g., Database or Scientific codes
Multithreading / Vector Processing – Explicit Thread Level Parallelism or Data Level Parallelism
March 2nd, 2011 • Thread: instruction stream with own PC and data
– thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
– Each thread has all the state (instructions, data, PC, register
John Kubiatowicz state, and so on) necessary to allow it to execute
Electrical Engineering and Computer Sciences • Thread Level Parallelism (TLP):
University of California, Berkeley – Exploit the parallelism inherent between threads to improve
performance
• Data Level Parallelism (DLP):
http://www.eecs.berkeley.edu/~kubitron/cs252
– Perform identical operations on data, and lots of data

3/2/2011 cs252-S11, Lecture 12 2

One approach to exploiting threads:

Multithreading (TLP within processor) Fine-Grained Multithreading
• Switches between threads on each instruction,
• Multithreading: multiple threads to share the causing the execution of multiples threads to be
functional units of 1 processor via overlapping interleaved
– processor must duplicate independent state of each thread – Usually done in a round-robin fashion, skipping any stalled
e.g., a separate copy of register file, a separate PC, and for threads
running independent programs, a separate page table – CPU must be able to switch threads every clock
– memory shared through the virtual memory mechanisms, • Advantage:
which already support multiple processes – can hide both short and long stalls, since instructions from
other threads executed when one thread stalls
– HW for fast thread switch; much faster than full process switch
 100s to 1000s of clocks • Disadvantage:
– slows down execution of individual threads, since a thread
• When switch? ready to execute without stalls will be delayed by instructions
from other threads
– Alternate instruction per thread (fine grain)
– When a thread is stalled, perhaps for a cache miss, another • Used on Sun’s Niagra (recent), several research
thread can be executed (coarse grain) multiprocessors, Tera

3/2/2011 cs252-S11, Lecture 12 3 3/2/2011 cs252-S11, Lecture 12 4

Simultaneous Multithreading (SMT):
Course-Grained Multithreading Do both ILP and TLP
• Switches threads only on costly stalls, such as L2 • TLP and ILP exploit two different kinds of
cache misses
parallel structure in a program
• Advantages
– Relieves need to have very fast thread-switching • Could a processor oriented at ILP to
– Doesn’t slow down thread, since instructions from other exploit TLP?
threads issued only when the thread encounters a costly – functional units are often idle in data path designed for
stall
ILP because of either stalls or dependences in the code
• Disadvantage is hard to overcome throughput
losses from shorter stalls, due to pipeline start-up • Could the TLP be used as a source of
costs independent instructions that might keep
– Since CPU issues instructions from 1 thread, when a stall the processor busy during stalls?
occurs, the pipeline must be emptied or frozen
– New thread must fill pipeline before instructions can
complete
• Could TLP be used to employ the
• Because of this start-up overhead, coarse-grained
functional units that would otherwise lie
multithreading is better for reducing penalty of idle when insufficient ILP exists?
high cost stalls, where pipeline refill << stall time
• Used in IBM AS/400, Sparcle (for Alewife)
3/2/2011 cs252-S11, Lecture 12 5 3/2/2011 cs252-S11, Lecture 12 6

Justification: For most apps, most

execution units lie idle Simultaneous Multi-threading ...
For an 8-way
superscalar.
One thread, 8 units Two threads, 8 units
Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC
1 1

2 2

3 3

4 4

5 5

6 6

7 7
From: Tullsen,
Eggers, and Levy, 8 8
“Simultaneous
Multithreading: 9 9
Maximizing On-chip
Parallelism, ISCA M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
3/2/2011 cs252-S11, Lecture 121995. 7 3/2/2011 cs252-S11, Lecture 12 8
Simultaneous Multithreading Details Design Challenges in SMT
• Simultaneous multithreading (SMT): insight that • Since SMT makes sense only with fine-grained
dynamically scheduled processor already has many implementation, impact of fine-grained scheduling
HW mechanisms to support multithreading on single thread performance?
– Large set of virtual registers that can be used to hold the register – A preferred thread approach sacrifices neither throughput nor
sets of independent threads single-thread performance?
– Unfortunately, with a preferred thread, the processor is likely to
– Register renaming provides unique register identifiers, so sacrifice some throughput, when preferred thread stalls
instructions from multiple threads can be mixed in datapath without
confusing sources and destinations across threads • Larger register file needed to hold multiple contexts
– Out-of-order completion allows the threads to execute out of order, • Clock cycle time, especially in:
and get better utilization of the HW – Instruction issue - more candidate instructions need to be
• Just adding a per thread renaming table and keeping considered
– Instruction completion - choosing which instructions to commit
separate PCs may be challenging
– Independent commitment can be supported by logically keeping a
separate reorder buffer for each thread • Ensuring that cache and TLB conflicts generated
by SMT do not degrade performance

Source: Micrprocessor Report, December 6, 1999

“Compaq Chooses SMT for Alpha”
3/2/2011 cs252-S11, Lecture 12 9 3/2/2011 cs252-S11, Lecture 12 10

Power 4
Power 4
Single-threaded predecessor to
Power 5. 8 execution units in
out-of-order engine, each may
issue an instruction each cycle.
2 commits
Power 5 (architected
register sets)

2 fetch (PC),
2 initial decodes
3/2/2011 cs252-S11, Lecture 12 11 3/2/2011 cs252-S11, Lecture 12 12
Power 5 data flow ... Power 5 thread performance ...
Relative priority
of each thread
controllable in
hardware.

For balanced
operation, both
threads run
slower than if
Why only 2 threads? With 4, one of the shared
they “owned”
resources (physical registers, cache, memory
the machine.
bandwidth) would be prone to bottleneck
3/2/2011 cs252-S11, Lecture 12 13 3/2/2011 cs252-S11, Lecture 12 14

Changes in Power 5 to support SMT Initial Performance of SMT

• Increased associativity of L1 instruction cache • Pentium 4 Extreme SMT yields 1.01 speedup for
and the instruction address translation buffers SPECint_rate benchmark and 1.07 for SPECfp_rate
• Added per thread load and store queues – Pentium 4 is dual threaded SMT
• Increased size of the L2 (1.92 vs. 1.44 MB) and L3 – SPECRate requires that each SPEC benchmark be run against a
caches vendor-selected number of copies of the same benchmark

• Added separate instruction prefetch and • Running on Pentium 4 each of 26 SPEC

buffering per thread benchmarks paired with every other (262 runs)
• Increased the number of virtual registers from speed-ups from 0.90 to 1.58; average was 1.20
152 to 240 • Power 5, 8 processor server 1.23 faster for
• Increased the size of several issue queues SPECint_rate with SMT, 1.16 faster for SPECfp_rate
• The Power5 core is about 24% larger than the • Power 5 running 2 copies of each app speedup
Power4 core because of the addition of SMT between 0.89 and 1.41
support – Most gained some
– Fl.Pt. apps had most cache conflicts and least gains

3/2/2011 cs252-S11, Lecture 12 15 3/2/2011 cs252-S11, Lecture 12 16

Multithreaded Categories Administrivia
Simultaneous • Exam: Wednesday 3/30
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading Location: 320 Soda
Time (processor cycle)

TIME: 2:30-5:30
– This info is on the Lecture page (has been)
– Get on 8 ½ by 11 sheet of notes (both sides)
– Meet at LaVal’s afterwards for Pizza and Beverages
• CS252 First Project proposal due by Friday 3/4
– Need two people/project (although can justify three for right project)
– Complete Research project in 9 weeks
» Typically investigate hypothesis by building an artifact and
measuring it against a “base case”
» Generate conference-length paper/give oral presentation
» Often, can lead to an actual publication.

Thread 1 Thread 3 Thread 5

Thread 2 Thread 4 Idle slot
3/2/2011 cs252-S11, Lecture 12 17 3/2/2011 cs252-S11, Lecture 12 18

Discussion of SPARCLE paper Discussion of Papers: Sparcle (Con’t)

• Example of close coupling between processor and memory
controller (CMMU) • Message Interface
– All of features mentioned in this paper implemented by combination of – Closely couple with processor
processor and memory controller
– Some functions implemented as special “coprocessor” instructions
» Interface at speed of first-level cache
– Others implemented as “Tagged” loads/stores/swaps – Atomic message launch:
• Course Grained Multithreading » Describe message (including DMA ops) with simple stio insts
– Using SPARC register windows » Atomic launch instruction (ipilaunch)
– Automatic synchronous trap on cache miss – Message Reception
– Fast handling of all other traps/interrupts (great for message interface!)
» Possible interrupt on message receive: use fast context switch
– Multithreading half in hardware/half software (hence 14 cycles)
» Examine message with simple ldio instructions
• Fine Grained Synchronization
– Full-Empty bit/32 bit word (effectively 33 bits) » Discard in pieces, possibly with DMA
» Groups of 4 words/cache line  F/E bits put into memory TAG » Free message (ipicst, i.e “coherent storeback”)
– Fast TRAP on bad condition
– Multiple instructions. Examples:
• We will talk about message interface in greater detail
» LDT (load/trap if empty)
» LDET (load/set empty/trap if empty)
» STF (Store/set full)
» STFT (store/set full/trap if full)
3/2/2011 cs252-S11, Lecture 12 19 3/2/2011 cs252-S11, Lecture 12 20
Supercomputers Vector Supercomputers

Definition of a supercomputer:
Epitomized by Cray-1, 1976:
• Fastest machine in world at given task
Scalar Unit + Vector Extensions
• A device to turn a compute-bound problem into an
I/O bound problem • Load/Store Architecture
• Any machine costing $30M+ • Vector Registers
• Any machine designed by Seymour Cray • Vector Instructions
• Hardwired Control
CDC6600 (Cray, 1964) regarded as first supercomputer • Highly Pipelined Functional Units
• Interleaved Memory System
• No Data Caches
• No Virtual Memory

3/2/2011 cs252-S11, Lecture 12 21 3/2/2011 cs252-S11, Lecture 12 22

Cray-1 (1976) Cray-1 (1976)

V0 Vi V. Mask
V1
V2 Vj
64 Element V3 V. Length
Vector Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of ( (Ah) + j k m ) S1
S2 Sk FP Recip
64-bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
3/2/2011 cs252-S11, Lecture 12 23 3/2/2011 cs252-S11, Lecture 12 24
Vector Programming Model Multithreading and Vector Summary
Scalar Registers Vector Registers
r15 v15 • Explicitly parallel (Data level parallelism or Thread
level parallelism) is next step to performance
• Coarse grain vs. Fine grained multihreading
r0 v0
[0] [1] [2] [VLRMAX-1] – Only on big stall vs. every clock cycle

Vector Length Register VLR • Simultaneous Multithreading if fine grained

multithreading based on OOO superscalar
v1
Vector Arithmetic v2 microarchitecture
Instructions + + + + + + – Instead of replicating registers, reuse rename registers
ADDV v3, v1, v2 v3 • Vector is alternative model for exploiting ILP
[0] [1] [VLR-1] – If code is vectorizable, then simpler hardware, more energy
efficient, and better real-time model than Out-of-order machines
Vector Load and Vector Register – Design issues include number of lanes, number of functional
Store Instructions v1 units, number of vector registers, length of vector registers,
exception handling, conditional operations
LV v1, r1, r2
• Fundamental design issue is memory bandwidth
– With virtual address translation and caching
Memory
Base, r1
3/2/2011 Stride, r2 cs252-S11, Lecture 12 25 3/2/2011 cs252-S11, Lecture 12 26

CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
UNIT-5 (1)
No ratings yet
UNIT-5 (1)
86 pages
L 5 Multicore
No ratings yet
L 5 Multicore
30 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
L 4 Multithreading
No ratings yet
L 4 Multithreading
20 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Basic of Thread Level Parallelism
No ratings yet
Basic of Thread Level Parallelism
30 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Simultaneous Multithreading G Architecture: Virendra Singh
No ratings yet
Simultaneous Multithreading G Architecture: Virendra Singh
15 pages
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
No ratings yet
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
35 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
Tlp
No ratings yet
Tlp
19 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
ACA Unit 4
No ratings yet
ACA Unit 4
27 pages
Design Issues: SMT and CMP Architectures
No ratings yet
Design Issues: SMT and CMP Architectures
9 pages
005-SimultaneousMultithreading
No ratings yet
005-SimultaneousMultithreading
50 pages
Multi Threading
No ratings yet
Multi Threading
5 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Osa Multi Core
No ratings yet
Osa Multi Core
37 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
Antenna Design
No ratings yet
Antenna Design
6 pages
Hardware Multithreading
100% (1)
Hardware Multithreading
4 pages
Multi-Core Computing: Osama Awwad
No ratings yet
Multi-Core Computing: Osama Awwad
37 pages
Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu
No ratings yet
Simultaneous Multithreading: Pratyusa Manadhata, Vyas Sekar (Pratyus, Vyass) @cs - Cmu.edu
4 pages
202004261306373620rohit Engg Multi Threaded
No ratings yet
202004261306373620rohit Engg Multi Threaded
4 pages
Design Issues SMT and CMP Architectures
No ratings yet
Design Issues SMT and CMP Architectures
25 pages
Simultaneous Multithreading Processor
No ratings yet
Simultaneous Multithreading Processor
4 pages
Multiprocessors I
No ratings yet
Multiprocessors I
13 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Threads: by Salman Memon 2K12/IT/109 University of Sindh Jamshoro
No ratings yet
Threads: by Salman Memon 2K12/IT/109 University of Sindh Jamshoro
16 pages
Unit 5
No ratings yet
Unit 5
29 pages
ACA_Lecture_28_Multiprocessors
No ratings yet
ACA_Lecture_28_Multiprocessors
20 pages
MTP: Understanding The Essence: Veljko Milutinović
No ratings yet
MTP: Understanding The Essence: Veljko Milutinović
12 pages
Lecture 3 - Chap - 4
No ratings yet
Lecture 3 - Chap - 4
56 pages
Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
No ratings yet
Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
12 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Multithreaded Programming: Concepts and Practice
No ratings yet
Multithreaded Programming: Concepts and Practice
407 pages
NV Operating Systems UNIT II
No ratings yet
NV Operating Systems UNIT II
91 pages
ch4-new
No ratings yet
ch4-new
39 pages
Multiprocessors, Threads and Microkernels: Fred Kuhns
No ratings yet
Multiprocessors, Threads and Microkernels: Fred Kuhns
46 pages
Vision 2023 Operating System Chapter 2 Threads and System Calls 74
No ratings yet
Vision 2023 Operating System Chapter 2 Threads and System Calls 74
11 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Lec04-SOFE3950-Threads
No ratings yet
Lec04-SOFE3950-Threads
53 pages
(English (Auto-Generated) ) 2 4 8 Examples of Simultaneous Multithreading (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) 2 4 8 Examples of Simultaneous Multithreading (DownSub - Com)
5 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
7 pages
First Hop Redundancy Protocol: Network Redundancy Protocol
From Everand
First Hop Redundancy Protocol: Network Redundancy Protocol
Mulayam Singh
No ratings yet
OS Introduction - Part 2
No ratings yet
OS Introduction - Part 2
44 pages
iOS at Facebook - Simon Whitaker
100% (1)
iOS at Facebook - Simon Whitaker
52 pages
Dbms Casestudy
No ratings yet
Dbms Casestudy
12 pages
Operating System Lab
No ratings yet
Operating System Lab
3 pages
Os Syllabus 2023 3 Sem
No ratings yet
Os Syllabus 2023 3 Sem
6 pages
Define OOP.: Unit I - Introduction To Oop and Fundamentals of Java
No ratings yet
Define OOP.: Unit I - Introduction To Oop and Fundamentals of Java
27 pages
Android Service
No ratings yet
Android Service
7 pages
Case Study Modern Os
No ratings yet
Case Study Modern Os
22 pages
03 Laboratory Exercise 13
No ratings yet
03 Laboratory Exercise 13
1 page
Introduction To Asynchronous Programming: 1 The Models
No ratings yet
Introduction To Asynchronous Programming: 1 The Models
6 pages
4th Sem RR Campus Course Information
No ratings yet
4th Sem RR Campus Course Information
24 pages
Java Test Automation
No ratings yet
Java Test Automation
69 pages
OOPJ Syllabus
No ratings yet
OOPJ Syllabus
3 pages
2008-2009-1 Solution
No ratings yet
2008-2009-1 Solution
9 pages
Operating Systems QB - Fall - 2018 2019
No ratings yet
Operating Systems QB - Fall - 2018 2019
17 pages
Short Notes On OS
No ratings yet
Short Notes On OS
26 pages
Java Skill Test Question/answer
No ratings yet
Java Skill Test Question/answer
6 pages
Events': Eventsource
No ratings yet
Events': Eventsource
6 pages
Operating System Question
No ratings yet
Operating System Question
18 pages
Computer Project
100% (1)
Computer Project
8 pages
Operating System IBPS
No ratings yet
Operating System IBPS
22 pages
OS UNIT 6 NOTES
No ratings yet
OS UNIT 6 NOTES
22 pages
GPU Performance Comparison For Accelerated Radar Data Processing
No ratings yet
GPU Performance Comparison For Accelerated Radar Data Processing
9 pages
CSCE 3600 Syllabus
No ratings yet
CSCE 3600 Syllabus
4 pages
OS (2 CHP)
No ratings yet
OS (2 CHP)
32 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
Os Question Bank
No ratings yet
Os Question Bank
3 pages
End Sem OS
No ratings yet
End Sem OS
115 pages
Electronic and Mobile Commerce
100% (1)
Electronic and Mobile Commerce
18 pages
Ooad Unit-V
No ratings yet
Ooad Unit-V
25 pages

Presentation On Multithreading/Vector

Uploaded by

Presentation On Multithreading/Vector

Uploaded by

Performance beyond single thread ILP

3/2/2011 cs252-S11, Lecture 12 2

One approach to exploiting threads:

3/2/2011 cs252-S11, Lecture 12 3 3/2/2011 cs252-S11, Lecture 12 4

Justification: For most apps, most

Source: Micrprocessor Report, December 6, 1999

Changes in Power 5 to support SMT Initial Performance of SMT

• Added separate instruction prefetch and • Running on Pentium 4 each of 26 SPEC

3/2/2011 cs252-S11, Lecture 12 15 3/2/2011 cs252-S11, Lecture 12 16

Thread 1 Thread 3 Thread 5

Discussion of SPARCLE paper Discussion of Papers: Sparcle (Con’t)

3/2/2011 cs252-S11, Lecture 12 21 3/2/2011 cs252-S11, Lecture 12 22

Cray-1 (1976) Cray-1 (1976)

Vector Length Register VLR • Simultaneous Multithreading if fine grained

You might also like