0% found this document useful (0 votes)

13 views50 pages

Simultaneous Multithreading

Uploaded by

sabry.minali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views50 pages

Simultaneous Multithreading

Uploaded by

sabry.minali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Simultaneous Multithreading

Multiple Issue Machines

• To decrease the CPI to less than “one”, multiple issue machines—they
come in two flavors
• Superscalar: multiple parallel dedicated pipelines:
– Issue varying number of instructions per cycle
• either statically scheduled by compiler(in-order) and/or
dynamically by hardware(out-of-order) (Tomasulo--# of inst issued
depends upon?)
• IBM PowerPC, Sun UltraSparc, DEC Alpha, IA32 Pentium

• Explicit parallelism VLIW (Very Long Instruction Word):

– also classified as EPIC (Explicitly Parallel Instruction Computer)
– several operations encoded as one long instruction
– instructions have wide template (4-16 operations)
• e.g. IA-64 Itanium
• Explicit parallelism Word-level / SIMD / Vector processors:
– multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)
Choices for Multiple Issue machines

Common Issue Hazard Scheduling Examples

Name structure detection
Superscalar Dynamic Hardware Static Sun UltraSPARC
(static) II/III
Superscalar Dynamic Hardware Dynamic IBM POWER2
(dynamic)
Superscalar Dynamic Hardware Dynamic with PentiumIII/4, MIPS
(speculative) Speculation R10k, Alpha21264,
IBM POWER4, HP
PA8500
VLIW Static Software Static Trimedia, i860
EPIC “mostly” static Mostly Mostly static Itanium (IA64)
software
Getting CPI < 1: Issuing Multiple Instructions/Cycle

• Basic approach: fetch (when possible) two

instructions belonging to different classes each clock
cycle
Int inst FP inst
• Superscalar MIPS:
2 instructions/cycle, 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; integer on left, FP on right
– Issue 2 instructions in one clock cycle
Multiple Issue Challenges
• While Integer/FP split is simple for the HW, get CPI of
0.5 only for programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issued at same time, greater
difficulty in decode and issue
– Even 2-way scalar ⇒ examine 2 opcodes, 6 register specifiers, &
decide if 1 or 2 instructions can issue
• VLIW: tradeoff=larger instruction space but simpler
decoding—no hardware scheduling
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long
instruction word are independent ⇒ execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 memory refs, 1 branch ⇒ 16
to 24 bits per field ⇒ 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
– Static issue– completely rely on compilers to schedule
(Superscalar machines rely on hardware dynamic issue)
Limitations to Multi-Issue Machines
• As usual, we have to deal with the three hazards:
– Structural Hazards
– Data Hazards
– Control Hazards
• Multiple issue gives:
– More hazards probable
– Also larger performance hit from hazards
Summary of techniques to increase ILP
To increase ILP Reduce
Stalls
1. Dynamic inst scheduling
Reduces
-Scoreboard Structural &
Data Hazards
-Tomasulo’s
2. Dynamic Branch Predictions
3. Branch Target Buffers + Branch
Hardware techniques
Target cache + Return address
predictor Reduces
Control
Techniques for 4. Superscalar/multiple issue Hazards
reducing Stalls
-Superscalar
-VLIW
5. Hardware based Speculation
Software
techniques by
compiler
Summary of techniques to increase ILP

Techniques Reduces
Forwarding and bypassing Potential data hazard stalls
Delayed branches and simple branch Control hazard stalls
scheduling
Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependences
Dynamic scheduling with renaming Data hazard stalls and stalls from
antidependences and output dependences
Dynamic branch prediction Control stalls
Issuing multiple instructions per cycle Ideal CPI
Speculation Data hazards and control hazard stalls
Dynamic memory disambiguation Data hazard stalls with memory
Loop unrolling Control hazard stalls
Basic compiler pipeline scheduling Data hazard stalls
Compiler dependence analysis Ideal CPI, data hazard stalls
Software pipelining, trace scheduling Ideal CPI, data hazard stalls
Compiler speculation Ideal CPI, data, control stalls
Studies of the Limitations of ILP:
ILP available in a perfect processor
Effect of Window size on ILP
Effect of Branch Prediction Schemes

Either Taken or Not

Taken decided by
compiler
Conditional Branch Prediction accuracy
Effect of number of registers available for renaming
Contemporary forms of parallelism
• Instruction-level parallelism (ILP)
– Wide-issue SuperScalar processors (SS)
• 4 or more instruction per cycle
• Executing a single program or thread
• Attempts to find multiple instructions to issue each cycle.

• Thread-level parallelism (TLP)

– Fine-grained multithreaded superscalars (FGMS)
• Contain hardware state for several threads
• Executing multiple threads
• On any given cycle a processor executes instructions from one of
the threads
– Multiprocessor (MP)
• Performance improved by adding more CPUs
Superscalar Bottlenecks
Superscalar Bottlenecks
• No dominant source of wasted issue bandwidth,
therefore, no dominant solution
• No single latency-tolerating technique will produce a
dramatic increase in the performance of these programs
if it only attacks specific types of latencies
• Examples:
– decrease the TLB miss rates (e.g., increase the TLB sizes)
– larger, more associative, or faster instruction/data cache
hierarchy
– improved branch prediction scheme; lower branch misprediction
penalty
– speculative execution; more aggressive if-conversion
Simple Sequential Processor
Pipelined Processor
Superscalar Processor
OOO Superscalar Processor
Speculative Execution
Limits of ILP
Horizontal and Vertical Waste
Multithreading
• Key idea
– Issue multiple instructions from multiple threads each cycle
• Somewhere in between implicit and explicit parallelism
• Features
– Fully exploits thread-level parallelism and instruction-level
parallelism
– Potentially better performance for
• A suitable mix of independent programs
• Programs that are parallelizable
– These scenarios are likely to happen for both servers (e.g. a
Web server serving several independent requests) and desktop
computers.
• Possibly still transparent to the programmer/compiler
Multithreading

issue

• Enlarge the “width” of the

the instruction window
rather than its depth
issue

fetch from multiple threads

Hardware Multithreading
• Processor must be aware of several independent states,
one per each thread:
– Program Counter
– Register File (and Flags)
– (Memory)
• We need either multiple resources in the processor or a
fast way to switch across states
– E.g. several physical Register Files OR fast copy to/from
memory of a Register File image

• Early hardware solutions include the so-called Barrel

processors (including even CDC6000!) and more recent
architectures such as HEP, TERA, MASA, Alewife
Hardware Multithreading
• Fine-grained (cycle-by-cycle) multithreading:
– Round-robin selection between a set of threads

• Coarse-grained (block) multithreading: Keep executing

a thread until something happens
– Long latency instruction found
– Some indication of scheduling difficulties
– Maximum number of cycles per thread executed
Fine-grained multithreading
Coarse-grained multithreading
Coarse-grained multithreading
Pros&Cons of fine-grained MT
• Null time to switch context
– Multiple Register Files
• No need for forwarding paths if supported threads are
more than pipeline depth!
– Simpler hardware
• Fills well short vertical waste

• Fills much less well long vertical waste

• Does not reduce significantly horizontal waste (per
thread, the instruction window is not much different…)
• Significant deterioration of single thread job
Pros&Cons of coarse-grained MT
• More time allowable for context switch
• Fills very well long vertical waste (other threads come in)

• Fills poorly short vertical waste (if not sufficient to switch

context)
• Does not reduce almost at all horizontal waste
• Scheduling of threads not self-evident:
– What happens of thread #2 if thread #1 executes perfectly well
and leaves no gap?
– Explicit techniques require ISA modifications Bad…
Simultaneous Multithreading (SMT)
Thread scheduling
• Allow a preferred thread for maintaining single-thread
performance

• Prioritised scheduling
– Thread #0 schedules freely
– Thread #1 is allowed to use #0 empty slots
– Thread #2 is allowed to use #0 and #1 empty slots, etc.

• Fair scheduling
– All threads compete for resources
– If several threads want the same resource, round-robin
assignment
Hardware support for SMT
• Fits well on top of an ordinary superscalar processor
organization
• Multiple program counters (= threads) and a policy for
the fetch units to decide which threads to fetch
• Multiple or larger register file(s) with at least as many
registers as logical registers for all threads
• Multiple instruction retirement (e.g., per thread
squashing)
No changes needed in the execution path
and also:
• Thread-aware branch predictors (BTBs, etc.)
• per-thread Return Address Stacks
Hardware support for SMT
• Complication of instruction commit
– We want istructions from separate threads to be allowed to
commit independently
– Use logically separate ReorderBuffers

• Dealing with larger register files needed to hold multiple

contexts potentially slower hardware

• Cache “conflicts” between threads may cause some

performance degradation in memory access
Base superscalar organization
Base reorder buffer
Superscalar processor with SMT
Reorder buffer with SMT
Reservation stations with SMT
• Reservation stations do not need to know which thread
an instruction belongs to
• Operand sources are renamed—physical regs, tags, etc.
Implementation of SMT
• Instruction scheduling not more complex
• Register File datapaths not more complex (but much
larger register file!)
• Instruction Fetch Throughput is attainable even without
more fetch bandwidth
• Unmodified cache and branch predictors are
appropropriate also for SMT
• SMT achieves better results than aggressive superscalar
Implementation of SMT
• Static fetch solutions: Round-robin
– Each cycle 8 instructions from 1 thread
– Each cycle 4 instructions from 2 threads, 2 from 4,…
– Each cycle 8 instructions from 2 threads, and forward as many
as possible from #1 then when long latency instruction in #1 pick
rest from #2
• Dynamic fecth solutions: Check execution queues!
– Favour threads with minimal # of in-flight branches
– Favour threads with minimal # of outstanding misses
– Favour threads with minimal # of in-flight instructions
– Favour threads with instructions far from queue head
Implementation of SMT
• Issue policy is not exactly the same as in superscalars…
– In superscalar: oldest is the best (least speculation, more
dependent ones waiting, etc.)
– In SMT not so clear: branch-speculation level and optimism
(cache-hit speculation) vary across threads
• One can think of many selection strategies:
– Oldest first
– Cache-hit speculated last
– Branch speculated last
– Branches first…
Commercial solutions with SMT
• Compaq Alpha 21464 (EV8)
– 4T SMT
– Project killed June 2001
• Intel Pentium IV (Xeon)
– 2T SMT
– Introduced in 2002
– 10-30% gains expected
• SUN Ultra III
– 2-core CMP, 4T SMT
Intel P4 Xeon HyperThreading
• From a software perspective, OSs and user programs
can schedule processes or threads to logical processors
as they would in a multiprocessor system
• The duplicated resources are primarily:
– Next–Instruction Pointer
– Instruction Stream Buffers
– Instruction translation look–aside buffer
– Return stack predictor
– Trace–cache plus local next–instruction pointer
– Trace–cache fill buffers
– Advanced Programmable Interrupt Controller (APIC)
– Register Alias Tables
Intel P4 Xeon HyperThreading
Intel P4 Xeon HyperThreading
Intel Xeon Hyper-Threading

• Minimum additional cost: SMT = approx. 5% area

• No impact on single-thread performance
– Recombine partitioned resources
• Fair behaviour with 2 threads
Intel P4 Xeon HyperThreading

• Performance results

– Multithread
benchmarks
– 15% to 26%
performance boost

– Multitask benchmarks
– 15% to 27%
performance boost

How To Hack Websites, Passwords, Everything Step by ST
75% (12)
How To Hack Websites, Passwords, Everything Step by ST
3 pages
Amit Login
100% (1)
Amit Login
17 pages
Cutover Migration
No ratings yet
Cutover Migration
11 pages
Real-Life Project Management Plan Phase
No ratings yet
Real-Life Project Management Plan Phase
9 pages
Summertime Saga Log Exp
0% (1)
Summertime Saga Log Exp
6 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Superscalar and VLIW Architectures
No ratings yet
Superscalar and VLIW Architectures
35 pages
Unit 5
No ratings yet
Unit 5
86 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
03 TLP
No ratings yet
03 TLP
33 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Parallel Processing: sp2016 Lec#3
No ratings yet
Parallel Processing: sp2016 Lec#3
23 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
Module 2
No ratings yet
Module 2
127 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
06b Multithreading MF
No ratings yet
06b Multithreading MF
37 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
Aca Notes
No ratings yet
Aca Notes
23 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
Advanced Computer Architecture Prof Thriveni T K
No ratings yet
Advanced Computer Architecture Prof Thriveni T K
59 pages
Unit 5
No ratings yet
Unit 5
44 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
Computer Architecture Question Bank
No ratings yet
Computer Architecture Question Bank
12 pages
03a ILP Superscalar VLIW
No ratings yet
03a ILP Superscalar VLIW
21 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
3-INSTRUCTION LEVEL PARALLELISM-12-Dec-2019Material - I - 12-Dec-2019 - ILP PDF
No ratings yet
3-INSTRUCTION LEVEL PARALLELISM-12-Dec-2019Material - I - 12-Dec-2019 - ILP PDF
15 pages
Zareen 13
No ratings yet
Zareen 13
13 pages
Instruction Level Parallelism: Module 5: Chapter 12
No ratings yet
Instruction Level Parallelism: Module 5: Chapter 12
13 pages
Unit I Instruction Level Parallelism Two Mark Questions: Dept of Cse G.SURESH. M.Tech, Asst Prof / CSE
No ratings yet
Unit I Instruction Level Parallelism Two Mark Questions: Dept of Cse G.SURESH. M.Tech, Asst Prof / CSE
12 pages
BLUEEYES Presentation
No ratings yet
BLUEEYES Presentation
30 pages
ACA Mod2
No ratings yet
ACA Mod2
45 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Module3
No ratings yet
Module3
49 pages
02b ILP Superscalar VLIW
No ratings yet
02b ILP Superscalar VLIW
20 pages
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
No ratings yet
Onur Ddca 2025 Lecture15b Branch Prediction Beforelecture
188 pages
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
No ratings yet
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
31 pages
Vliw Architecture
No ratings yet
Vliw Architecture
30 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Advanced Topics in Computer Architecture ECE 7373
No ratings yet
Advanced Topics in Computer Architecture ECE 7373
40 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
Computer Architecture
No ratings yet
Computer Architecture
29 pages
Superscalar Vs VLIW
No ratings yet
Superscalar Vs VLIW
30 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems
No ratings yet
A Wire-Delay Scalable Microprocessor Architecture For High Performance Systems
20 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
From Everand
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
Anand Vemula
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SiFAC Curs 2017 PDF
No ratings yet
SiFAC Curs 2017 PDF
357 pages
Reduced-Order State Observer Design
No ratings yet
Reduced-Order State Observer Design
145 pages
Export Excel To HTML - Convert Tables To HTML - The Analyst Cave - Excel, VBA, Programming and More
No ratings yet
Export Excel To HTML - Convert Tables To HTML - The Analyst Cave - Excel, VBA, Programming and More
5 pages
Transactional Memory: David Chisnall
No ratings yet
Transactional Memory: David Chisnall
21 pages
RD 044 001 Technical Specifications V 2.31 - en
No ratings yet
RD 044 001 Technical Specifications V 2.31 - en
15 pages
Space Packet Protocol (Red Book 1A)
No ratings yet
Space Packet Protocol (Red Book 1A)
50 pages
IHE ITI TF Supplement XDS-2
No ratings yet
IHE ITI TF Supplement XDS-2
64 pages
Application Registration Business Firms
No ratings yet
Application Registration Business Firms
1 page
Kenlayer User Guide
No ratings yet
Kenlayer User Guide
13 pages
Manual
No ratings yet
Manual
9 pages
Parking: Cidades Inteligentes, Humanas E Sustentáveis
No ratings yet
Parking: Cidades Inteligentes, Humanas E Sustentáveis
2 pages
Engineering Management Chap 3
No ratings yet
Engineering Management Chap 3
28 pages
DC Mcqs
No ratings yet
DC Mcqs
10 pages
Excel Formula
No ratings yet
Excel Formula
204 pages
Turkay&Adinolf WCES2010
No ratings yet
Turkay&Adinolf WCES2010
6 pages
ANSYS AUTODYN in Workbench Introduction
100% (1)
ANSYS AUTODYN in Workbench Introduction
51 pages
DLM Sixtyfive Manual
No ratings yet
DLM Sixtyfive Manual
6 pages
Linux Installation Overview
No ratings yet
Linux Installation Overview
9 pages
Getting To Know The Photoshop CS6 Interface: Important
No ratings yet
Getting To Know The Photoshop CS6 Interface: Important
8 pages
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
No ratings yet
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
6 pages
United International University: Department of Computer Science and Engineering
No ratings yet
United International University: Department of Computer Science and Engineering
3 pages
RTU500 Series - ABB Migration Solutions: To Be Always Up To Date
No ratings yet
RTU500 Series - ABB Migration Solutions: To Be Always Up To Date
24 pages
Design of Distillation Sequence From Conventional To Fully Thermally Couple Distillation System
No ratings yet
Design of Distillation Sequence From Conventional To Fully Thermally Couple Distillation System
23 pages
Indian Representatives - Vibration Institute
No ratings yet
Indian Representatives - Vibration Institute
2 pages
Chapter 1 Test Introduction To Functions 45: Neither. (1 Mark Each) + 2 X 3 X
No ratings yet
Chapter 1 Test Introduction To Functions 45: Neither. (1 Mark Each) + 2 X 3 X
6 pages

Simultaneous Multithreading

Uploaded by

Simultaneous Multithreading

Uploaded by

Simultaneous Multithreading

Multiple Issue Machines

• Explicit parallelism VLIW (Very Long Instruction Word):

Common Issue Hazard Scheduling Examples

• Basic approach: fetch (when possible) two

Either Taken or Not

• Thread-level parallelism (TLP)

• Enlarge the “width” of the

fetch from multiple threads

• Early hardware solutions include the so-called Barrel

• Coarse-grained (block) multithreading: Keep executing

• Fills much less well long vertical waste

• Fills poorly short vertical waste (if not sufficient to switch

• Dealing with larger register files needed to hold multiple

• Cache “conflicts” between threads may cause some

• Minimum additional cost: SMT = approx. 5% area

You might also like