0% found this document useful (0 votes)
62 views10 pages

Hardware Multithreading

Parallel processing architectures break jobs into discrete parts that can be executed concurrently on different CPUs. Flynn's taxonomy classifies parallel computer architectures based on the number of concurrent instruction and data streams. Parallel systems are more difficult to program because the architecture varies and processes must be coordinated. Parallelism can exist at the job, instruction, and data levels. Hardware multithreading enables executing multiple threads concurrently by dividing operations within a single application into threads.

Uploaded by

Farin Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views10 pages

Hardware Multithreading

Parallel processing architectures break jobs into discrete parts that can be executed concurrently on different CPUs. Flynn's taxonomy classifies parallel computer architectures based on the number of concurrent instruction and data streams. Parallel systems are more difficult to program because the architecture varies and processes must be coordinated. Parallelism can exist at the job, instruction, and data levels. Hardware multithreading enables executing multiple threads concurrently by dividing operations within a single application into threads.

Uploaded by

Farin Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PARALLEL PROCESSING ARCHITECTURES

Parallel computing architectures breaks the job into discrete parts that can be
executed concurrently. Each part is further broken down to a series of instructions.
Instructions from each part execute simultaneously on different CPUs. Parallel systems
deal with the simultaneous use of multiple computer resources that can include a single
computer with multiple processors, a number of computers connected by a network to
form a parallel processing cluster or a combination of both.

Flynn’s taxonomy is a specific classification of parallel computer architectures that are


based on thenumber of concurrent instruction (single or multiple) and datastreams
(single or multiple) availableinthe architecture.

Parallel processing architectures and challenges, Hardware multithreading,


Multicore and shared memory multiprocessors, Introduction to Graphics Processing
Units, clusters and Warehouse Scale Computers - Introduction to Multiprocessor
network topologies.
Introduction:
Multiprocessor: A Computer system with atleast two processors.
Job – level parallelism (or) process – level parallelism:
Utilizing multiple processors by running independent programs simultaneously.
Parallel Processing Program:
A single program that runs on multiple processor simultaneously.
Multicore microprocessor:
A microprocessor containing multiple processors(“Cores”) in a single integrated
circuit.
Parallel systems are more difficult to program than computers with a single
processor because the architecture of parallel computers varies accordingly and the
processes of multiple CPUs must be coordinated and synchronized. The crux of
parallel processing are the CPUs.

Parallelism in computer architecture is explained used Flynn”s taxonomy. This


classification is based on the number of instruction and data streams used in the
architecture. The machine structure is explained using streams which are sequence
of items. The four categories in Flynn’s taxonomy based on the number of instruction

streams and data streams are the following:


• (SISD) single instruction, single data
• (MISD) multiple instruction, single data
• (SIMD) single instruction, multiple data
• (MIMD) multiple instruction, multiple data
SISD (Single Instruction, Single Data stream)
Single Instruction, Single Data (SISD) refers to an Instruction Set
Architecture in which a single processor (one CPU) executes exactly one instruction
stream at a time. Fig 5.1 shows SISD and SIMD.
It also fetches or stores one item of data at a time to operate on data stored in a single
memory unit.
Most of the CPU design is based on the von Neumann architecture and the follow
SISD. The SISD model is a non-pipelined architecture with general-purpose
registers, Program Counter (PC), the Instruction Register (IR), Memory
Address Registers (MAR) and Memory Data Registers (MDR).

Fig 1: Single Instruction, Single Data Stream SIMD (Single Instruction,


Multiple Data streams
Source: David A. Patterson and John L. Hennessey, ― “Computer Organization and Design”
Single Instruction, Multiple Data (SIMD) is an Instruction Set Architecture that
have a single control unit (CU) and more than one processing unit (PU) that operates
like a von Neumann machine by executing a single instruction stream over PUs,
handled through the CU.
The CU generates the control signals for all of the PUs and by which executes the
same operation on different data streams.

Fig 2: Single Instruction, Multiple Data streams MISD (Multiple Instruction,


Single Data stream
Source: David A. Patterson and John L. Hennessey, ― “Computer Organization and Design”

The SIMD architecture is capable of achieving data level parallelism.


Multiple Instruction, Single Data (MISD) is an Instruction Set Architecture for
parallel computing where many functional units perform different operations by
executing different instructions on the same data set. The above Fig 2 shows SIMD
streams MISD.
This type of architecture is common mainly in the fault-tolerant computers executing
the same instructions redundantly in order to detect and mask errors. The bellow Fig 3
represent the MISD streams.
Fig 3: Multiple Instruction, Single Data stream
Source: David A. Patterson and John L. Hennessey, ― “Computer Organization and Design”

MIMD (Multiple Instruction, Multiple Data streams)


Multiple Instruction stream, Multiple Data stream (MIMD) is an Instruction Set
Architecture for parallel computing that is typical of the computers with
multiprocessors.
Using the MIMD, each processor in a multiprocessor system can
execute asynchronously different set of the instructions independently on the different
set of data units.
The MIMD based computer systems can used the shared memory in a memory pool
or work using distributed memory across heterogeneous network computers in a
distributed environment.
The MIMD architectures is primarily used in a number of application areas such
as computer-aided design/computer-aided manufacturing, simulation, modelling, and
communication switches etc. The bellow Fig 4 represents MIMD streams.
Fig 4 multiple instruction, multiple data streams
Source: David A. Patterson and John L. Hennessey, ― “Computer Organization and Design”

5.1.1 Challenges in Parallelism


The following are the design challenges in parallelism: Available parallelism.
Load balance: Some processors work while others wait due to insufficient
parallelism or unequal size tasks. Extra work. Managing parallelism Redundant
Computation Communication
5.1.2 HARDWARE MULTITHREADING
Multithreading enables the processing of multiple threads at one time, rather than
multiple processes. Since threads are smaller, more basic instructions than processes,
multithreading may occur within processes. Threads are instruction stream with state
(registers and memory). The register state is also called thread context. Threads could be
part of the same process or from different programs. Threads in the same program share
the same address space and hence consume fewer resources. Fig 5 shows the
hardware multithreading.
The terms multithreading, multiprocessing and multitasking are used
interchangeably. But each has its unique meaning:
Multitasking: It is the process of executing multiple tasks simultaneously. In
multitasking, when a new thread needs to be executed, old thread╆ s context in
hardware written back to memory and new thread╆ s context loaded.
Multiprocessing: It is using two or more CPUs with in a single computer system.
Multithreading: It is executing several parts of a program in parallel by dividing the
specific operations within a single application into individual threads.
Granularity: The threads are categorized based on the amount of work done by the
thread. This is known as granularity. When the hardware executes from the hardware
contexts determines the granularity of multithreading.

Hardware vs Software multithreading

Software Multithreading Hardware Multithreading

Execution of concurrent threads is Execution of concurrent threads is supported


Supported by OS. by CPU.

Large number of threads can be span. Very limited number of threads can span.

Context switching is heavy. It involves Light/immediate context


switching with more operations. limited operations.

Hardware multithreading is having multiple threads contexts to span in same


processor. This is supported by the CPU.

The following are the objectives of hardware multithreading:


To tolerate latency of memory operations, dependent instructions, branch
resolution by utilizing processing resources more efficiently. When one thread
encounters a long- latency operation, the processor can execute a useful operation from
another thread.
To improve system throughput By exploiting thread-level parallelism by
improving superscalar processor utilization
To reduce context switch penalty
Advantages of hardware multithreading:
Latency tolerance
Better hardware utilization
Reduced context switch penalty
Cost of hardware multithreading:
Requires multiple thread contexts to be implemented in hardware.
Usually reduced single-thread performance
Resource sharing, contention
Switching penalty (can be reduced with additional hardware)
Types of hardware multithreading
The hardware multithreading is classified based on the granularity of the threads as:
Fine grained
Coarse grained
Simultaneous
Fine Grained Multithreading
Here, the CPU switch to another thread at every cycle such that no two instructions
from the thread are in the pipeline at the same time. Hence it is also known as
interleaved multithreading.

Fine grained multithreading is a mechanism in which switching among threads


happen despite the cache miss or stall caused by the thread instruction.
The threads are executed in a round-robin fashion in consecutive cycles.
The CPU checks every cycle if the current thread is stalled or not.
If stalled, a hardware scheduler will change execution to another thread that is ready
to run.
Since the hardware is checking every cycle for stalls, all stall types can be dealt with,
even single cycle stalls.
This improves pipeline utilization by taking advantage of multiple threads
It tolerates the control and data dependency latencies by overlapping the latency with
useful work from other threads
Fine-grained parallelism is best exploited in architectures which support fast
communication.
Shared memory architecture which has a low communication overhead is most
suitable for fine-grained parallelism.
This requires more threads to keep the CPU busy.
Advantages:
No need for dependency checking between instructions since only one instruction in
pipeline from a single thread.
No need for branch prediction logic.
The bubble cycles used for executing useful instructions from different threads.
Improved system throughput, latency tolerance, utilization.
Disadvantages:
Extra hardware complexity because of implementation of multiple hardware
contexts and thread selection logic.
Reduced single thread performance as one instruction fetched every N cycles.
Resource contention between threads in caches and memory.
Dependency checking logic between threads remains.
Coarse grained multithreading
In this type, the instructions of other threads are executed successively until an event
in current execution thread cause latency. This delay event induces a context switch.

Coarse grained multithreading is a mechanism in which the switch only happens when
the thread in execution causes a stall, thus wasting a clock cycle.
When a thread is stalled due to some event, the CPU switch to a different hardware
context. This is known as Switch-on-event multithreading or blocked multithreading.
This is less efficient that fine grained multithreading but requires only few threads
to improve CPU utilization.
The events that causes latency or stalls are: Cache misses, Synchronization events and
floating point operations.
Resource sharing in space and time always requires fairness considerations. This
is implemented by considering how much progress each thread makes.
The time allocated to each thread affects both fairness and system throughput. The
allocation strategies depends on the answers to the following questions:
When do we switch?
For how long do we switch?
When do we switch back?
How does the hardware scheduler interact with the software scheduler for
fairness?
What is the switching overhead vs. benefit?
Where do we store the contexts?
A trade off must be done between fairness and system throughput: Switch not only on
miss, but also on data return. This has a severe problem because switching has
performance overhead as it requires flushing of pipeline and window; reduced locality
and increased resource contention.
One possible solution is to estimate the slowdown of each thread compared to when
run alone. Then enforce switching when slowdowns become significantly
unbalanced.
Advantages:
Simpler to implement, can eliminate dependency checking and branch prediction
logic completely
Switching need not have any performance overhead.
Higher performance overhead with deep pipelines and large windows
Disadvantages
Low single thread performance: each thread gets 1/Nth of the bandwidth of the
pipeline
Simultaneous Multithreading (SMT)
Here instructions can be issued from multiple threads in any given cycle.
Instructions are simultaneously issued from multiple threads to the execution units
of a superscalar processor. Thus, the wide superscalar instruction issue is combined
with the multiple-context approach.
In fine-grained and coarse-grained architectures, multithreading can start execution
of instructions from only a single thread at a given cycle. Execution unit or pipeline
stage utilization can be low if there are not enough instructions from a thread to
dispatch in one cycle
Unused instruction slots, which arise from latencies during the pipelined execution
of single-threaded programs by a microprocessor, are filled by instructions of other
threads within a multithreaded processor. The executions units are multiplexed among
those thread contexts that are loaded in the register sets.
Underutilization of a superscalar processor due to missing instruction-level
parallelism can be overcome by simultaneous multithreading, where a processor can
issue multiple instructions from multiple threads in each cycle.
Simultaneous multithreaded processors combine the multithreading technique with a
wide-issue superscalar processor to utilize a larger part of the issue bandwidth by
issuing instructions from different threads simultaneously.

Fig 5: Hardware multithreading


Source: David A. Patterson and John L. Hennessey, ― “Computer Organization and Design”

You might also like