Hardware Multithreading
Hardware Multithreading
Parallel computing architectures breaks the job into discrete parts that can be
executed concurrently. Each part is further broken down to a series of instructions.
Instructions from each part execute simultaneously on different CPUs. Parallel systems
deal with the simultaneous use of multiple computer resources that can include a single
computer with multiple processors, a number of computers connected by a network to
form a parallel processing cluster or a combination of both.
Large number of threads can be span. Very limited number of threads can span.
Coarse grained multithreading is a mechanism in which the switch only happens when
the thread in execution causes a stall, thus wasting a clock cycle.
When a thread is stalled due to some event, the CPU switch to a different hardware
context. This is known as Switch-on-event multithreading or blocked multithreading.
This is less efficient that fine grained multithreading but requires only few threads
to improve CPU utilization.
The events that causes latency or stalls are: Cache misses, Synchronization events and
floating point operations.
Resource sharing in space and time always requires fairness considerations. This
is implemented by considering how much progress each thread makes.
The time allocated to each thread affects both fairness and system throughput. The
allocation strategies depends on the answers to the following questions:
When do we switch?
For how long do we switch?
When do we switch back?
How does the hardware scheduler interact with the software scheduler for
fairness?
What is the switching overhead vs. benefit?
Where do we store the contexts?
A trade off must be done between fairness and system throughput: Switch not only on
miss, but also on data return. This has a severe problem because switching has
performance overhead as it requires flushing of pipeline and window; reduced locality
and increased resource contention.
One possible solution is to estimate the slowdown of each thread compared to when
run alone. Then enforce switching when slowdowns become significantly
unbalanced.
Advantages:
Simpler to implement, can eliminate dependency checking and branch prediction
logic completely
Switching need not have any performance overhead.
Higher performance overhead with deep pipelines and large windows
Disadvantages
Low single thread performance: each thread gets 1/Nth of the bandwidth of the
pipeline
Simultaneous Multithreading (SMT)
Here instructions can be issued from multiple threads in any given cycle.
Instructions are simultaneously issued from multiple threads to the execution units
of a superscalar processor. Thus, the wide superscalar instruction issue is combined
with the multiple-context approach.
In fine-grained and coarse-grained architectures, multithreading can start execution
of instructions from only a single thread at a given cycle. Execution unit or pipeline
stage utilization can be low if there are not enough instructions from a thread to
dispatch in one cycle
Unused instruction slots, which arise from latencies during the pipelined execution
of single-threaded programs by a microprocessor, are filled by instructions of other
threads within a multithreaded processor. The executions units are multiplexed among
those thread contexts that are loaded in the register sets.
Underutilization of a superscalar processor due to missing instruction-level
parallelism can be overcome by simultaneous multithreading, where a processor can
issue multiple instructions from multiple threads in each cycle.
Simultaneous multithreaded processors combine the multithreading technique with a
wide-issue superscalar processor to utilize a larger part of the issue bandwidth by
issuing instructions from different threads simultaneously.