Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
Lect. 9: Multithreading: - Dynamic Out-Of-Order Scheduling - Prefetching
9: Multithreading
▪ Memory latencies and even latencies to lower level caches are
becoming longer w.r.t. processor cycle times
▪ There are basically 3 ways to hide/tolerate such latencies by
overlapping computation with the memory access
– Dynamic out-of-order scheduling
– Prefetching
– Multithreading
▪ OOO execution and prefetching allow overlap of computation and
memory access within the same thread (these were covered in CS3
Computer Architecture)
▪ Multithreading allows overlap of memory access of one thread/
process with computation by another thread/process
Thread C
Thread D
Pipeline latency
R C
R C R C R C
– Note: these are only average numbers and ideally N should be bigger to
accommodate variation
CS4/MSc Parallel Architectures - 2016-2017
6
Blocked Multithreading
▪ Simple analytical performance model:
– The minimum value of N is referred to as the saturation point (Nsat)
R+L
Nsat =
R+C
– Thus, there are two regions of operation:
▪ Before saturation, adding more threads increase processor utilization linearly
▪ After saturation, processor utilization does not improve with more threads, but
is limited by the switching overhead
R
Usat =
R+C
– E.g.: 0.8
for R=40,
Processor utilization (%)
0.6
L=200,
and C=10 0.4
RR R
Thread B Thread F
Thread C
Pipeline latency
cycles
cache
miss
▪ Advantages:
+ Can handle not only long latencies and pipeline bubbles but also unused issue slots
+ Full performance in single-thread mode
– Most complex hardware of all multithreading schemes