0% found this document useful (0 votes)
19 views

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

This document discusses multithreading techniques for hiding memory latency. Blocked multithreading allows overlap of memory access of one thread with computation of another by context switching threads on long-latency operations like cache misses. With enough threads, all latency can be hidden. Fine-grain multithreading switches threads every cycle, eliminating context switch overhead but requiring more threads to hide long latencies and potentially hurting single-thread performance. Both aim to maximize processor utilization through latency hiding.

Uploaded by

manav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching

This document discusses multithreading techniques for hiding memory latency. Blocked multithreading allows overlap of memory access of one thread with computation of another by context switching threads on long-latency operations like cache misses. With enough threads, all latency can be hidden. Fine-grain multithreading switches threads every cycle, eliminating context switch overhead but requiring more threads to hide long latencies and potentially hurting single-thread performance. Both aim to maximize processor utilization through latency hiding.

Uploaded by

manav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lect.

9: Multithreading
▪ Memory latencies and even latencies to lower level caches are
becoming longer w.r.t. processor cycle times
▪ There are basically 3 ways to hide/tolerate such latencies by
overlapping computation with the memory access
– Dynamic out-of-order scheduling
– Prefetching
– Multithreading
▪ OOO execution and prefetching allow overlap of computation and
memory access within the same thread (these were covered in CS3
Computer Architecture)
▪ Multithreading allows overlap of memory access of one thread/
process with computation by another thread/process

CS4/MSc Parallel Architectures - 2016-2017


1
Blocked Multithreading
▪ Basic idea:
– Recall multi-tasking: on I/O a process is context-switched out of the processor by
the OS
OS interrupt handler OS interrupt handler
running running

process 1 process 2 Process 1


running running running
system call for I/O I/O completion

– With multithreading a thread/process is context-switched out of the pipeline by


the hardware on longer-latency operations
Hardware context Hardware context
switch switch

process 1 process 2 Process 1


running running running
Long-latency operation Long-latency operation
CS4/MSc Parallel Architectures - 2016-2017
2
Blocked Multithreading
▪ Basic idea:
– Unlike in multi-tasking, context is still kept in the processor and OS is not aware of
any changes
– Context switch overhead is minimal (usually only a few cycles)
– Unlike in multi-tasking, the completion of the long-latency operation does not
trigger a context switch (the blocked thread is simply marked as ready)
– Usually the long-latency operation is a L1 cache miss, but it can also be others, such
as a fp or integer division (which takes 20 to 30 cycles and is unpipelined)
▪ Context of a thread in the processor:
– Registers
– Program counter
– Stack pointer
– Other processor status words
▪ Note: the term “multithreading” is commonly used to mean
simply the fact that the system supports multiple threads

CS4/MSc Parallel Architectures - 2016-2017


3
Blocked Multithreading
▪ Latency hiding example: = context switch overhead
Thread A
= idle (stall cycle)
Thread B

Thread C

Thread D

Pipeline latency

Culler and Singh


Fig. 11.27
Memory latencies
CS4/MSc Parallel Architectures - 2016-2017
4
Blocked Multithreading
▪ Hardware mechanisms:
– Keeping multiple contexts and supporting fast switch
▪ One register file per context
▪ One set of special registers (including PC) per context
– Flushing instructions from the previous context from the pipeline after a context
switch
▪ Note that such squashed instructions add to the context switch overhead
▪ Note that keeping instructions from two different threads in the pipeline
increases the complexity of the interlocking mechanism and requires that
instructions be tagged with context ID throughout the pipeline
– Possibly replicating other microarchitectural structures (e.g., branch prediction
tables)
▪ Employed in the Sun T1 and T2 systems (a.k.a. Niagara)

CS4/MSc Parallel Architectures - 2016-2017


5
Blocked Multithreading
▪ Simple analytical performance model:
– Parameters:
▪ Number of threads (N): the number of threads supported in the hardware
▪ Busy time (R): time processor spends computing between context switch
points
▪ Switching time (C): time processor spends with each context switch
▪ Latency (L): time required by the operation that triggers the switch
– To completely hide all L we need enough N such that (N-1)*R + N*C = L
▪ Fewer threads mean we can’t hide all L
▪ More threads are unnecessary

R C

R C R C R C
– Note: these are only average numbers and ideally N should be bigger to
accommodate variation
CS4/MSc Parallel Architectures - 2016-2017
6
Blocked Multithreading
▪ Simple analytical performance model:
– The minimum value of N is referred to as the saturation point (Nsat)
R+L
Nsat =
R+C
– Thus, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is limited by the switching overhead

R
Usat =
R+C
– E.g.: 0.8

for R=40,
Processor utilization (%)

0.6
L=200,
and C=10 0.4

0.2 Culler and Singh


Fig. 11.25
0
0 2 4 6 8
Number of threads

CS4/MSc Parallel Architectures - 2016-2017


7
Fine-grain or Interleaved Multithreading
▪ Basic idea:
– Instead of waiting for long-latency operation, context switch on every cycle
– Threads waiting for a long latency operation are marked not ready and are not
considered for execution
– With enough threads no two instructions from the same thread are in the pipeline
at the same time → no need for pipeline interlock at all
▪ Advantages and disadvantages over blocked multithreading:
+ No context switch overhead (no pipeline flush)
+ Better at handling short pipeline latencies/bubbles
– Possibly poor single thread performance (each thread only gets the processor once
every N cycles)
– Requires more threads to completely hide long latencies
– Slightly more complex hardware than blocked multithreading (if we want to permit
multiple instructions from the same thread in the pipeline)
▪ Some machines have taken this idea to the extreme and
eliminated caches altogether (e.g., Cray MTA-2, with 128 threads
per processor)

CS4/MSc Parallel Architectures - 2016-2017


8
Fine-grain or Interleaved Multithreading
▪ Simple analytical performance model
▪ Assumption: no caches, 1 in 2 instruction is a memory access
– Parameters:
▪ Number of threads (N) and Latency (L)
▪ Busy time (R) is now 1 and switching time (C) is now 0
L enough N such that N-1 = L
– To completely hide all L we need
R

RR R

– The minimum value of N (i.e., N=L+1) is the saturation point (Nsat)


– Again, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is 100% (i.e., Usat = 1)

CS4/MSc Parallel Architectures - 2016-2017


9
Fine-grain or Interleaved Multithreading
▪ Latency hiding example:
Thread A Thread E

Thread B Thread F

Thread C

Thread D = idle (stall cycle)

Pipeline latency

Culler and Singh


Memory latencies Fig. 11.28
E is still blocked,
A is still blocked, so is skipped
so is skipped
CS4/MSc Parallel Architectures - 2016-2017
10
Simultaneous Multithreading (SMT)
▪ Basic idea:
– Don’t actually context switch, but on a superscalar processor fetch and issue
instructions from different threads/processes simultaneously
– E.g., 4-issue processor
no multithreading blocked interleaved SMT

cycles

cache
miss

▪ Advantages:
+ Can handle not only long latencies and pipeline bubbles but also unused issue slots
+ Full performance in single-thread mode
– Most complex hardware of all multithreading schemes

CS4/MSc Parallel Architectures - 2016-2017


11
Simultaneous Multithreading (SMT)
▪ Fetch policies:
– Non-multithreaded fetch: only fetch instructions from one thread in each cycle, in
a round-robin alternation
– Partitioned fetch: divide the total fetch bandwidth equally between some of the
available threads (requires more complex fetch unit to fetch from multiple I-cache
lines; see Lecture 3)
– Priority fetch: fetch more instructions for specific threads (e.g., those not in control
speculation, those with the least number of instructions in the issue queue)
▪ Issue policies:
– Round-robin: select one ready instruction from each ready thread in turn until all
issue slots are full or there or no more ready instructions
(note: should remember which thread was the last to have an instruction selected
and start from there in the next cycle)
– Priority issue:
▪ E.g., threads with older instructions in the issue queue are tried first
▪ E.g., threads in control speculative mode are tried last
▪ E.g., issue all pending branches first

CS4/MSc Parallel Architectures - 2016-2017


12

You might also like