Simultaneous Multithreading
Simultaneous Multithreading
Techniques Reduces
Forwarding and bypassing Potential data hazard stalls
Delayed branches and simple branch Control hazard stalls
scheduling
Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependences
Dynamic scheduling with renaming Data hazard stalls and stalls from
antidependences and output dependences
Dynamic branch prediction Control stalls
Issuing multiple instructions per cycle Ideal CPI
Speculation Data hazards and control hazard stalls
Dynamic memory disambiguation Data hazard stalls with memory
Loop unrolling Control hazard stalls
Basic compiler pipeline scheduling Data hazard stalls
Compiler dependence analysis Ideal CPI, data hazard stalls
Software pipelining, trace scheduling Ideal CPI, data hazard stalls
Compiler speculation Ideal CPI, data, control stalls
Studies of the Limitations of ILP:
ILP available in a perfect processor
Effect of Window size on ILP
Effect of Branch Prediction Schemes
issue
• Prioritised scheduling
– Thread #0 schedules freely
– Thread #1 is allowed to use #0 empty slots
– Thread #2 is allowed to use #0 and #1 empty slots, etc.
• Fair scheduling
– All threads compete for resources
– If several threads want the same resource, round-robin
assignment
Hardware support for SMT
• Fits well on top of an ordinary superscalar processor
organization
• Multiple program counters (= threads) and a policy for
the fetch units to decide which threads to fetch
• Multiple or larger register file(s) with at least as many
registers as logical registers for all threads
• Multiple instruction retirement (e.g., per thread
squashing)
No changes needed in the execution path
and also:
• Thread-aware branch predictors (BTBs, etc.)
• per-thread Return Address Stacks
Hardware support for SMT
• Complication of instruction commit
– We want istructions from separate threads to be allowed to
commit independently
– Use logically separate ReorderBuffers
• Performance results
– Multithread
benchmarks
– 15% to 26%
performance boost
– Multitask benchmarks
– 15% to 27%
performance boost