0% found this document useful (0 votes)

19 views

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

This document discusses multithreading techniques for hiding memory latency. Blocked multithreading allows overlap of memory access of one thread with computation of another by context switching threads on long-latency operations like cache misses. With enough threads, all latency can be hidden. Fine-grain multithreading switches threads every cycle, eliminating context switch overhead but requiring more threads to hide long latencies and potentially hurting single-thread performance. Both aim to maximize processor utilization through latency hiding.

Uploaded by

manav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

manav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lect.

9: Multithreading
▪ Memory latencies and even latencies to lower level caches are
becoming longer w.r.t. processor cycle times
▪ There are basically 3 ways to hide/tolerate such latencies by
overlapping computation with the memory access
– Dynamic out-of-order scheduling
– Prefetching
– Multithreading
▪ OOO execution and prefetching allow overlap of computation and
memory access within the same thread (these were covered in CS3
Computer Architecture)
▪ Multithreading allows overlap of memory access of one thread/
process with computation by another thread/process

CS4/MSc Parallel Architectures - 2016-2017

1
Blocked Multithreading
▪ Basic idea:
– Recall multi-tasking: on I/O a process is context-switched out of the processor by
the OS
OS interrupt handler OS interrupt handler
running running

process 1 process 2 Process 1

running running running
system call for I/O I/O completion

– With multithreading a thread/process is context-switched out of the pipeline by

the hardware on longer-latency operations
Hardware context Hardware context
switch switch

process 1 process 2 Process 1

running running running
Long-latency operation Long-latency operation
CS4/MSc Parallel Architectures - 2016-2017
2
Blocked Multithreading
▪ Basic idea:
– Unlike in multi-tasking, context is still kept in the processor and OS is not aware of
any changes
– Context switch overhead is minimal (usually only a few cycles)
– Unlike in multi-tasking, the completion of the long-latency operation does not
trigger a context switch (the blocked thread is simply marked as ready)
– Usually the long-latency operation is a L1 cache miss, but it can also be others, such
as a fp or integer division (which takes 20 to 30 cycles and is unpipelined)
▪ Context of a thread in the processor:
– Registers
– Program counter
– Stack pointer
– Other processor status words
▪ Note: the term “multithreading” is commonly used to mean
simply the fact that the system supports multiple threads

CS4/MSc Parallel Architectures - 2016-2017

3
Blocked Multithreading
▪ Latency hiding example: = context switch overhead
Thread A
= idle (stall cycle)
Thread B

Thread C

Thread D

Pipeline latency

Culler and Singh

Fig. 11.27
Memory latencies
CS4/MSc Parallel Architectures - 2016-2017
4
Blocked Multithreading
▪ Hardware mechanisms:
– Keeping multiple contexts and supporting fast switch
▪ One register file per context
▪ One set of special registers (including PC) per context
– Flushing instructions from the previous context from the pipeline after a context
switch
▪ Note that such squashed instructions add to the context switch overhead
▪ Note that keeping instructions from two different threads in the pipeline
increases the complexity of the interlocking mechanism and requires that
instructions be tagged with context ID throughout the pipeline
– Possibly replicating other microarchitectural structures (e.g., branch prediction
tables)
▪ Employed in the Sun T1 and T2 systems (a.k.a. Niagara)

CS4/MSc Parallel Architectures - 2016-2017

5
Blocked Multithreading
▪ Simple analytical performance model:
– Parameters:
▪ Number of threads (N): the number of threads supported in the hardware
▪ Busy time (R): time processor spends computing between context switch
points
▪ Switching time (C): time processor spends with each context switch
▪ Latency (L): time required by the operation that triggers the switch
– To completely hide all L we need enough N such that (N-1)*R + N*C = L
▪ Fewer threads mean we can’t hide all L
▪ More threads are unnecessary

R C

R C R C R C
– Note: these are only average numbers and ideally N should be bigger to
accommodate variation
CS4/MSc Parallel Architectures - 2016-2017
6
Blocked Multithreading
▪ Simple analytical performance model:
– The minimum value of N is referred to as the saturation point (Nsat)
R+L
Nsat =
R+C
– Thus, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is limited by the switching overhead

R
Usat =
R+C
– E.g.: 0.8

for R=40,
Processor utilization (%)

0.6
L=200,
and C=10 0.4

0.2 Culler and Singh

Fig. 11.25
0
0 2 4 6 8
Number of threads

CS4/MSc Parallel Architectures - 2016-2017

7
Fine-grain or Interleaved Multithreading
▪ Basic idea:
– Instead of waiting for long-latency operation, context switch on every cycle
– Threads waiting for a long latency operation are marked not ready and are not
considered for execution
– With enough threads no two instructions from the same thread are in the pipeline
at the same time → no need for pipeline interlock at all
▪ Advantages and disadvantages over blocked multithreading:
+ No context switch overhead (no pipeline flush)
+ Better at handling short pipeline latencies/bubbles
– Possibly poor single thread performance (each thread only gets the processor once
every N cycles)
– Requires more threads to completely hide long latencies
– Slightly more complex hardware than blocked multithreading (if we want to permit
multiple instructions from the same thread in the pipeline)
▪ Some machines have taken this idea to the extreme and
eliminated caches altogether (e.g., Cray MTA-2, with 128 threads
per processor)

CS4/MSc Parallel Architectures - 2016-2017

8
Fine-grain or Interleaved Multithreading
▪ Simple analytical performance model
▪ Assumption: no caches, 1 in 2 instruction is a memory access
– Parameters:
▪ Number of threads (N) and Latency (L)
▪ Busy time (R) is now 1 and switching time (C) is now 0
L enough N such that N-1 = L
– To completely hide all L we need
R

RR R

– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)

– Again, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is 100% (i.e., Usat = 1)

CS4/MSc Parallel Architectures - 2016-2017

9
Fine-grain or Interleaved Multithreading
▪ Latency hiding example:
Thread A Thread E

Thread B Thread F

Thread C

Thread D = idle (stall cycle)

Pipeline latency

Culler and Singh

Memory latencies Fig. 11.28
E is still blocked,
A is still blocked, so is skipped
so is skipped
CS4/MSc Parallel Architectures - 2016-2017
10
Simultaneous Multithreading (SMT)
▪ Basic idea:
– Don’t actually context switch, but on a superscalar processor fetch and issue
instructions from different threads/processes simultaneously
– E.g., 4-issue processor
no multithreading blocked interleaved SMT

cycles

cache
miss

▪ Advantages:
+ Can handle not only long latencies and pipeline bubbles but also unused issue slots
+ Full performance in single-thread mode
– Most complex hardware of all multithreading schemes

CS4/MSc Parallel Architectures - 2016-2017

11
Simultaneous Multithreading (SMT)
▪ Fetch policies:
– Non-multithreaded fetch: only fetch instructions from one thread in each cycle, in
a round-robin alternation
– Partitioned fetch: divide the total fetch bandwidth equally between some of the
available threads (requires more complex fetch unit to fetch from multiple I-cache
lines; see Lecture 3)
– Priority fetch: fetch more instructions for specific threads (e.g., those not in control
speculation, those with the least number of instructions in the issue queue)
▪ Issue policies:
– Round-robin: select one ready instruction from each ready thread in turn until all
issue slots are full or there or no more ready instructions
(note: should remember which thread was the last to have an instruction selected
and start from there in the next cycle)
– Priority issue:
▪ E.g., threads with older instructions in the issue queue are tried first
▪ E.g., threads in control speculative mode are tried last
▪ E.g., issue all pending branches first

CS4/MSc Parallel Architectures - 2016-2017

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
No ratings yet
Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
12 pages
Chapter 9
No ratings yet
Chapter 9
50 pages
Module - 6
No ratings yet
Module - 6
89 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
Multi Threading
No ratings yet
Multi Threading
5 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Lecture25
No ratings yet
Lecture25
41 pages
Threads
No ratings yet
Threads
23 pages
Threads
No ratings yet
Threads
9 pages
Presentation On Multithreading/Vector
No ratings yet
Presentation On Multithreading/Vector
7 pages
UNIT-5 (1)
No ratings yet
UNIT-5 (1)
86 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
L 5 Multicore
No ratings yet
L 5 Multicore
30 pages
Os Chapter 04
No ratings yet
Os Chapter 04
14 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
OS Notes For M.phil
No ratings yet
OS Notes For M.phil
30 pages
Public Distribution: Fe..Cte I
No ratings yet
Public Distribution: Fe..Cte I
32 pages
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
OS_Module2_Unit2
No ratings yet
OS_Module2_Unit2
43 pages
OS Module-2 Notes
No ratings yet
OS Module-2 Notes
46 pages
Chapter 4 Threads
No ratings yet
Chapter 4 Threads
3 pages
PDC Lecture 2
No ratings yet
PDC Lecture 2
52 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
NV Operating Systems UNIT II
No ratings yet
NV Operating Systems UNIT II
91 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
No ratings yet
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
31 pages
Nust College of Eme Nismah Saleem
No ratings yet
Nust College of Eme Nismah Saleem
4 pages
Unit 5
No ratings yet
Unit 5
29 pages
Hyper-Threading: Neil Chakrabarty William May
No ratings yet
Hyper-Threading: Neil Chakrabarty William May
17 pages
Tlp
No ratings yet
Tlp
19 pages
OS-PROCESS MANAGEMENT module -2.2
No ratings yet
OS-PROCESS MANAGEMENT module -2.2
89 pages
Lec04-SOFE3950-Threads
No ratings yet
Lec04-SOFE3950-Threads
53 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Threads
No ratings yet
Threads
8 pages
onur-digitaldesign-2020-lecture18c-fgmt-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture18c-fgmt-beforelecture
22 pages
Different Parallel Processing Architecture
No ratings yet
Different Parallel Processing Architecture
41 pages
Basic of Thread Level Parallelism
No ratings yet
Basic of Thread Level Parallelism
30 pages
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
No ratings yet
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
39 pages
Hyper-Threading: Neil Chakrabarty William May
No ratings yet
Hyper-Threading: Neil Chakrabarty William May
17 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
Multi Threaded Architectures
No ratings yet
Multi Threaded Architectures
47 pages
L 4 Multithreading
No ratings yet
L 4 Multithreading
20 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Lec17 Threads Introduction
No ratings yet
Lec17 Threads Introduction
20 pages
4 OS Threads
No ratings yet
4 OS Threads
25 pages
Threads Notes
No ratings yet
Threads Notes
6 pages
Ch4 Threads
No ratings yet
Ch4 Threads
18 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Unit 5
No ratings yet
Unit 5
45 pages
CS4961 Parallel Programming: Course Details
No ratings yet
CS4961 Parallel Programming: Course Details
7 pages
Multitasking (Overview)
No ratings yet
Multitasking (Overview)
23 pages
PYTHON-PROGRAMMING BY H_PANCHAL
No ratings yet
PYTHON-PROGRAMMING BY H_PANCHAL
61 pages
Sns College of Technology: Java Programming
No ratings yet
Sns College of Technology: Java Programming
16 pages
White Papers
No ratings yet
White Papers
34 pages
Lecture Notes Distributed System
100% (2)
Lecture Notes Distributed System
57 pages
SEmphore
No ratings yet
SEmphore
9 pages
High Performance Computing-1 PDF
No ratings yet
High Performance Computing-1 PDF
15 pages
Method of Assessment: Internal: Mid Semester Theory Examination (Pen Paper Test)
No ratings yet
Method of Assessment: Internal: Mid Semester Theory Examination (Pen Paper Test)
22 pages
9 - Operating System
No ratings yet
9 - Operating System
61 pages
Question Number 1: Write A Note On The Multi-Processor Scheduling. Answer
No ratings yet
Question Number 1: Write A Note On The Multi-Processor Scheduling. Answer
2 pages
Concurrency: CS2403 Programming Languages
No ratings yet
Concurrency: CS2403 Programming Languages
44 pages
Model Papers Software Testing
No ratings yet
Model Papers Software Testing
4 pages
Quest Pond
No ratings yet
Quest Pond
8 pages
IPC Mechanism, Kernel-Application Communication
No ratings yet
IPC Mechanism, Kernel-Application Communication
36 pages
A Practical GPU Based KNN Algorithm: Quansheng Kuang, and Lei Zhao
No ratings yet
A Practical GPU Based KNN Algorithm: Quansheng Kuang, and Lei Zhao
5 pages
Computer operating system assignment 2
No ratings yet
Computer operating system assignment 2
5 pages
Raghu Sir Java Question
100% (2)
Raghu Sir Java Question
9 pages
Node - Js Design Patterns Sample Chapter
100% (1)
Node - Js Design Patterns Sample Chapter
56 pages
Python Asyncio
No ratings yet
Python Asyncio
141 pages
Cse2005 Operating-Systems Eth 1.0 37 Cse2005
No ratings yet
Cse2005 Operating-Systems Eth 1.0 37 Cse2005
2 pages
Optimising TCP/IP Connectivity
No ratings yet
Optimising TCP/IP Connectivity
14 pages
Lecture 03
No ratings yet
Lecture 03
37 pages
The Demikernel Datapath OS Architecture For Microsecond-Scale Datacenter Systems
No ratings yet
The Demikernel Datapath OS Architecture For Microsecond-Scale Datacenter Systems
17 pages
Unit_IV-Topic_7-CUDA_programming_model_features
No ratings yet
Unit_IV-Topic_7-CUDA_programming_model_features
6 pages
Release Notes MD Adams 2010
No ratings yet
Release Notes MD Adams 2010
110 pages
Kotlin Coroutines
No ratings yet
Kotlin Coroutines
65 pages
Section 4: Threads: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
No ratings yet
Section 4: Threads: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
61 pages
COG-702 Certification PDF Features 1362819692 2 0 PDF
No ratings yet
COG-702 Certification PDF Features 1362819692 2 0 PDF
30 pages

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

Uploaded by

Lect.

CS4/MSc Parallel Architectures - 2016-2017

process 1 process 2 Process 1

– With multithreading a thread/process is context-switched out of the pipeline by

process 1 process 2 Process 1

CS4/MSc Parallel Architectures - 2016-2017

Culler and Singh

CS4/MSc Parallel Architectures - 2016-2017

0.2 Culler and Singh

CS4/MSc Parallel Architectures - 2016-2017

CS4/MSc Parallel Architectures - 2016-2017

– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)

CS4/MSc Parallel Architectures - 2016-2017

Thread D = idle (stall cycle)

Culler and Singh

CS4/MSc Parallel Architectures - 2016-2017

CS4/MSc Parallel Architectures - 2016-2017

You might also like