Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

This document discusses different techniques for multithreading, including blocked multithreading, fine-grained multithreading, and simultaneous multithreading (SMT). Blocked multithreading allows overlap of memory access of one thread with computation of another by context switching threads on long-latency operations like cache misses. Fine-grained multithreading switches contexts every cycle to better hide short latencies but requires more threads. SMT fetches and issues instructions from different threads simultaneously to utilize unused pipeline resources without context switching overhead.

Uploaded by

balramkinage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views12 pages

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

balramkinage

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CS4/MSc Parallel Architectures - 2012- 2013

Lect. 9: Multithreading
! Memory latencies and even latencies to lower level caches are
becoming longer w.r.t. processor cycle times
! There are basically 3 ways to hide/tolerate such latencies by
overlapping computation with the memory access
Dynamic out-of-order scheduling
Prefetching
Multithreading
! OOO execution and prefetching allow overlap of computation
and memory access within the same thread (these were covered in
CS3 Computer Architecture)
! Multithreading allows overlap of memory access of one thread/
process with computation by another thread/process
1
CS4/MSc Parallel Architectures - 2012- 2013
Blocked Multithreading
2
! Basic idea:
Recall multi-tasking: on I/O a process is context-switched out of the
processor by the OS
With multithreading a thread/process is context-switched out of the
pipeline by the hardware on longer-latency operations
process 1
running
system call for I/O
OS interrupt handler
running
I/O completion
Process 1
running
process 2
running
OS interrupt handler
running
process 1
running
Long-latency operation
Hardware context
switch
Long-latency operation
Process 1
running
process 2
running
Hardware context
switch
CS4/MSc Parallel Architectures - 2012- 2013
Blocked Multithreading
3
! Basic idea:
Unlike in multi-tasking, context is still kept in the processor and OS is not
aware of any changes
Context switch overhead is minimal (usually only a few cycles)
Unlike in multi-tasking, the completion of the long-latency operation does
not trigger a context switch (the blocked thread is simply marked as ready)
Usually the long-latency operation is a L1 cache miss, but it can also be
others, such as a fp or integer division (which takes 20 to 30 cycles and is
unpipelined)
! Context of a thread in the processor:
Registers
Program counter
Stack pointer
Other processor status words
! Note: the term is commonly (mis)used to mean simply the fact
that the system supports multiple threads
CS4/MSc Parallel Architectures - 2012- 2013
Blocked Multithreading
4
! Latency hiding example:
Memory latencies
Pipeline latency
Thread A
Thread B
Thread C
Thread D
= context switch overhead
= idle (stall cycle)
Culler and Singh
Fig. 11.27
CS4/MSc Parallel Architectures - 2012- 2013
Blocked Multithreading
5
! Hardware mechanisms:
Keeping multiple contexts and supporting fast switch
! One register file per context
! One set of special registers (including PC) per context
Flushing instructions from the previous context from the pipeline after a
context switch
! Note that such squashed instructions add to the context switch overhead
! Note that keeping instructions from two different threads in the pipeline
increases the complexity of the interlocking mechanism and requires that
instructions be tagged with context ID throughout the pipeline
Possibly replicating other microarchitectural structures (e.g., branch
prediction tables)
! Employed in the Sun T1 and T2 systems (a.k.a. Niagara)
CS4/MSc Parallel Architectures - 2012- 2013
Blocked Multithreading
6
! Simple analytical performance model:
Parameters:
! Number of threads (N): the number of threads supported in the hardware
! Busy time (R): time processor spends computing between context switch
points
! Switching time (C): time processor spends with each context switch
! Latency (L): time required by the operation that triggers the switch
To completely hide all L we need enough N such that ~N*(R+C) equals L
(strictly speaking, (N-1)*R + N*C = L)
! Fewer threads mean we cant hide all L
! More threads are unnecessary
Note: these are only average numbers and ideally N should be bigger to
accommodate variation
R C
L
R C R C R C
CS4/MSc Parallel Architectures - 2012- 2013
Blocked Multithreading
7
! Simple analytical performance model:
The minimum value of N is referred to as the saturation point (N
sat
)
Thus, there are two regions of operation:
! Before saturation, adding more threads increase processor utilization linearly
! After saturation, processor utilization does not improve with more threads,
but is limited by the switching overhead
E.g.:
for R=40,
L=200,
and C=10
N
sat
=
R + L

R + C

U
sat
=
R

R + C

Culler and Singh
Fig. 11.25
CS4/MSc Parallel Architectures - 2012- 2013
Fine-grain or Interleaved Multithreading
8
! Basic idea:
Instead of waiting for long-latency operation, context switch on every
cycle
Threads waiting for a long latency operation are marked not ready and are
not considered for execution
With enough threads no two instructions from the same thread are in the
pipeline at the same time no need for pipeline interlock at all
! Advantages and disadvantages over blocked multithreading:
+ No context switch overhead (no pipeline flush)
+ Better at handling short pipeline latencies/bubbles
Possibly poor single thread performance (each thread only gets the
processor once every N cycles)
Requires more threads to completely hide long latencies
Slightly more complex hardware than blocked multithreading
! Some machines have taken this idea to the extreme and
eliminated caches altogether (e.g., Cray MTA-2, with 128 threads
per processor)
CS4/MSc Parallel Architectures - 2012- 2013
Fine-grain or Interleaved Multithreading
9
! Simple analytical performance model
! Assumption: no caches, 1 in 2 instruction is a memory access
Parameters:
! Number of threads (N) and Latency (L)
! Busy time (R) is now 1 and switching time (C) is now 0
To completely hide all L we need enough N such that N-1 = L
The minimum value of N (i.e., N=L+1) is the saturation point (N
sat
)
Again, there are two regions of operation:
! Before saturation, adding more threads increase processor utilization linearly
! After saturation, processor utilization does not improve with more threads,
but is 100% (i.e., U
sat
= 1)
R
L
R R R
CS4/MSc Parallel Architectures - 2012- 2013
Fine-grain or Interleaved Multithreading
10
! Latency hiding example:
Memory latencies
Pipeline latency
Thread A
Thread B
Thread C
Thread D = idle (stall cycle)
Culler and Singh
Fig. 11.28
Thread E
Thread F
A is still blocked,
so is skipped
E is still blocked,
so is skipped
CS4/MSc Parallel Architectures - 2012- 2013
Simultaneous Multithreading (SMT)
11
! Basic idea:
Dont actually context switch, but on a superscalar processor fetch and
issue instructions from different threads/processes simultaneously
E.g., 4-issue processor
! Advantages:
+ Can handle not only long latencies and pipeline bubbles but also unused
issue slots
+ Full performance in single-thread mode
Most complex hardware of all multithreading schemes
cycles
no multithreading
cache
miss
blocked interleaved SMT
CS4/MSc Parallel Architectures - 2012- 2013
Simultaneous Multithreading (SMT)
12
! Fetch policies:
Non-multithreaded fetch: only fetch instructions from one thread in each
cycle, in a round-robin alternation
Partitioned fetch: divide the total fetch bandwidth equally between some
of the available threads (requires more complex fetch unit to fetch from
multiple I-cache lines; see Lecture 3)
Priority fetch: fetch more instructions for specific threads (e.g., those not
in control speculation, those with the least number of instructions in the
issue queue)
! Issue policies:
Round-robin: select one ready instruction from each ready thread in turn
until all issue slots are full or there are no more ready instructions
(note: should remember which thread was the last to have an instruction
selected and start from there in the next cycle)
Priority issue:
! E.g., threads with older instructions in the issue queue are tried first
! E.g., threads in control speculative mode are tried last
! E.g., issue all pending branches first

06b Multithreading MF
No ratings yet
06b Multithreading MF
37 pages
I Puc Computer Science Lab Manual 2024-2025 - With - Flowcharts
89% (9)
I Puc Computer Science Lab Manual 2024-2025 - With - Flowcharts
61 pages
Unit 5
No ratings yet
Unit 5
86 pages
Ai Os CH4
No ratings yet
Ai Os CH4
26 pages
03 TLP
No ratings yet
03 TLP
33 pages
PDC Lecture 2
No ratings yet
PDC Lecture 2
52 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
NV Operating Systems UNIT II
No ratings yet
NV Operating Systems UNIT II
91 pages
Onur Digitaldesign 2020 Lecture18c FGMT Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture18c FGMT Beforelecture
22 pages
ACA Lecture 28 Multiprocessors
No ratings yet
ACA Lecture 28 Multiprocessors
20 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
OS-PROCESS MANAGEMENT Module - 2.2
No ratings yet
OS-PROCESS MANAGEMENT Module - 2.2
89 pages
Lecture 25
No ratings yet
Lecture 25
41 pages
Math Chapter 1 Algebra Classified
No ratings yet
Math Chapter 1 Algebra Classified
43 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
TLP
No ratings yet
TLP
19 pages
Lec04 SOFE3950 Threads
No ratings yet
Lec04 SOFE3950 Threads
53 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
OS Module2 Unit2
No ratings yet
OS Module2 Unit2
43 pages
Module - 6
No ratings yet
Module - 6
89 pages
Threads
No ratings yet
Threads
8 pages
Wireless EV Charging Parking Lot Model Project Report
No ratings yet
Wireless EV Charging Parking Lot Model Project Report
57 pages
Os Chapter 04
No ratings yet
Os Chapter 04
14 pages
Threads
No ratings yet
Threads
9 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
Chapter 9
No ratings yet
Chapter 9
50 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Thermos
No ratings yet
Thermos
41 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
L38 TLP
No ratings yet
L38 TLP
13 pages
OS Module-2 Notes
No ratings yet
OS Module-2 Notes
46 pages
Spring - What Is @autowired
No ratings yet
Spring - What Is @autowired
20 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
202004261306373620rohit Engg Multi Threaded
No ratings yet
202004261306373620rohit Engg Multi Threaded
4 pages
Lecture 2 - Unit 1 - Types of Research
No ratings yet
Lecture 2 - Unit 1 - Types of Research
17 pages
Standard Tightening Torque
100% (3)
Standard Tightening Torque
1 page
Inequalities Questions
No ratings yet
Inequalities Questions
7 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Ch4 Threads
No ratings yet
Ch4 Threads
18 pages
A Study On Power Assists For Bicycle Rickshaws in India
No ratings yet
A Study On Power Assists For Bicycle Rickshaws in India
47 pages
Ch.3 (Chemical Equilibrium) - 1-2
No ratings yet
Ch.3 (Chemical Equilibrium) - 1-2
31 pages
Public Distribution: Fe..Cte I
No ratings yet
Public Distribution: Fe..Cte I
32 pages
Parallel Postulates Revised
No ratings yet
Parallel Postulates Revised
73 pages
GPS Guidebook
No ratings yet
GPS Guidebook
66 pages
Presentation On Multithreading/Vector
No ratings yet
Presentation On Multithreading/Vector
7 pages
2 LT Plug Valve Repair Instructions
No ratings yet
2 LT Plug Valve Repair Instructions
7 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
No ratings yet
Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
12 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
Fwsafdaskdf Ssafsaf SDFSF J SFJKN FG Sdfsafsaf S Trju Fasw SDFSD DFH F SF G
No ratings yet
Fwsafdaskdf Ssafsaf SDFSF J SFJKN FG Sdfsafsaf S Trju Fasw SDFSD DFH F SF G
1 page
SFD F What Is World Cup When Is It? Who Will Win It?
No ratings yet
SFD F What Is World Cup When Is It? Who Will Win It?
1 page
SFD F Tsrdtgds DSF X XG FZ DF CX XCG Z S F DSG FZXF Fasw XCG DG G
No ratings yet
SFD F Tsrdtgds DSF X XG FZ DF CX XCG Z S F DSG FZXF Fasw XCG DG G
1 page
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
No ratings yet
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
31 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Bernanke and Blinder (1988)
No ratings yet
Bernanke and Blinder (1988)
5 pages
Chapter 4 Threads
No ratings yet
Chapter 4 Threads
3 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
Nural Network
No ratings yet
Nural Network
12 pages
Iot Based Aeroponics Agriculture Monitoring System Using Raspberry Pi
No ratings yet
Iot Based Aeroponics Agriculture Monitoring System Using Raspberry Pi
8 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Energetics-WS 1 - JB - NEW
No ratings yet
Energetics-WS 1 - JB - NEW
4 pages
Threads
No ratings yet
Threads
23 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Technical Information
No ratings yet
Technical Information
88 pages
OS Notes For M.phil
No ratings yet
OS Notes For M.phil
30 pages
Unit 5
No ratings yet
Unit 5
29 pages
Basic Engineering Circuit Analysis 10th
No ratings yet
Basic Engineering Circuit Analysis 10th
185 pages
A Fine-Grain Multi Threading
No ratings yet
A Fine-Grain Multi Threading
6 pages
Experiment 2
No ratings yet
Experiment 2
4 pages
3 Questions On Surds and Indices
No ratings yet
3 Questions On Surds and Indices
7 pages
String Manipulation Instructions: REP: Repeat Instruction Prefix
No ratings yet
String Manipulation Instructions: REP: Repeat Instruction Prefix
3 pages
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
Multi Threading
No ratings yet
Multi Threading
5 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
q3 Math-6 Tos Grade-6
No ratings yet
q3 Math-6 Tos Grade-6
6 pages
Service Manual: S4S Diesel Engine
100% (2)
Service Manual: S4S Diesel Engine
15 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
PAM Abd BLOSUM
No ratings yet
PAM Abd BLOSUM
3 pages
VSD Atv212hu22n4
No ratings yet
VSD Atv212hu22n4
11 pages
Hardware Multithreading
100% (1)
Hardware Multithreading
4 pages
This Is For Upload Only This Is For Upload Only RSSR RSSR
No ratings yet
This Is For Upload Only This Is For Upload Only RSSR RSSR
2 pages
This Is For Upload Only
No ratings yet
This Is For Upload Only
2 pages
P.4 Mathematics Mid Term 1
No ratings yet
P.4 Mathematics Mid Term 1
7 pages
Burner Quotation Request Please Fax The Completed Form To: (510) 652-4302 or Email To
No ratings yet
Burner Quotation Request Please Fax The Completed Form To: (510) 652-4302 or Email To
1 page
Computer Graphics Practical File
No ratings yet
Computer Graphics Practical File
18 pages
Day Care Procedures Thermax
No ratings yet
Day Care Procedures Thermax
3 pages
امتى اخلى القواعد هنج ولا فيكسد مع العمود - الصفحة 5
No ratings yet
امتى اخلى القواعد هنج ولا فيكسد مع العمود - الصفحة 5
7 pages
Irc 097-1987
No ratings yet
Irc 097-1987
10 pages
Is Standard List
No ratings yet
Is Standard List
5 pages
06T Semihermetic Screw Compressor
No ratings yet
06T Semihermetic Screw Compressor
8 pages

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

CS4/MSc Parallel Architectures - 2012- 2013

You might also like