0% found this document useful (0 votes)
34 views

Lecture 2

This document discusses different platforms for parallel computing. It describes how parallelism addresses performance bottlenecks in processors, memory systems, and data paths. It then discusses two approaches to implicit parallelism: pipelining and superscalar execution, and very long instruction word (VLIW) processors. Pipelining overlaps instruction execution stages to improve performance. Superscalar processors issue multiple instructions concurrently by exploiting instruction-level parallelism. VLIW processors rely on compile-time analysis to bundle instructions for parallel execution.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture 2

This document discusses different platforms for parallel computing. It describes how parallelism addresses performance bottlenecks in processors, memory systems, and data paths. It then discusses two approaches to implicit parallelism: pipelining and superscalar execution, and very long instruction word (VLIW) processors. Pipelining overlaps instruction execution stages to improve performance. Superscalar processors issue multiple instructions concurrently by exploiting instruction-level parallelism. VLIW processors rely on compile-time analysis to bundle instructions for parallel execution.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Parallel Computing Platforms

Implicit Parallelism
Scope of Parallelism
• Conventional architectures coarsely comprise of a processor,
memory system, and the datapath.
• Each of these components present significant performance
bottlenecks.
• A number of architectural innovations over the years have
addressed these bottlenecks.
• One of the most important innovations is multiplicity – in
processing units, datapaths, and memory units.
• This multiplicity is either entirely hidden from the programmer,
as in the case of implicit parallelism, or exposed to the
programmer in different forms.
Scope of Parallelism

• Parallelism addresses each of these components (processor,


memory system, and the datapath) in significant ways.
• Different applications utilize different aspects of parallelism -
e.g., data intensive applications utilize high aggregate
throughput, server applications utilize high aggregate network
bandwidth, and scientific applications typically utilize high
processing and memory system performance.
• It is important to understand each of these performance
bottlenecks.
Implicit Parallelism
• Pipelining and Superscalar Execution
• Very Long Instruction Word Processors
Implicit Parallelism: Trends in Microprocessor
Architectures
• Microprocessor clock speeds have posted impressive gains
over the past two decades (two to three orders of magnitude).
• Higher levels of device integration have made available a
large number of transistors.
• The question of how best to utilize these resources is an
important one.
• Current processors use these resources in multiple functional
units and execute multiple instructions in the same cycle.
• The precise manner in which these instructions are selected
and executed provides impressive diversity in architectures.
1. Pipelining and Superscalar Execution

• Pipelining overlaps various stages of instruction execution to


achieve performance.
• At a high level of abstraction, an instruction can be executed
while the next one is being decoded and the next one is being
fetched.
• One of the real life example is assembly line for manufacture
of cars.
– Divide the process in to multiple pipelined stages
– Each pipeline stage has multiple units
– This will lead to multi-fold speedup as compared to serial production
Pipelining and Superscalar Execution

• Pipelining, however, has several limitations.


• The speed of a pipeline is eventually limited by the slowest
stage.
• For this reason, conventional processors rely on very deep
pipelines (20 stage pipelines in state-of-the-art Pentium
processors).
• However, in typical program traces, every 5-6th instruction is a
conditional jump! This requires very accurate branch prediction.
• The penalty of a mis-prediction grows with the depth of the
pipeline, since a larger number of instructions will have to be
flushed.
Pipelining and Superscalar Execution

• One simple way of alleviating these bottlenecks is to use


multiple pipelines.
• During each clock cycle, multiple instructions are piped into
the processor in parallel.
• These instructions are executed on multiple functional units.
Superscalar Execution: An Example

Example of a two-
way superscalar
execution of
instructions.
Superscalar Execution: An Example

• In the above example, there is some wastage of resources


due to data dependencies.

• The example also illustrates that different instruction mixes


with identical semantics can take significantly different
execution time.
Superscalar Execution

• Scheduling of instructions is determined by a number of


factors:
– True Data Dependency: The result of one operation is an input
to the next.
– Resource Dependency: Two operations require the same
resource.
– Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
– The scheduler, a piece of hardware looks at a large number of
instructions in an instruction queue and selects appropriate
number of instructions to execute concurrently based on these
factors.
– The complexity of this hardware is an important constraint on
superscalar processors.
Superscalar Execution:
Issue Mechanisms
• In the simpler model, instructions can be issued only in the
order in which they are encountered.
• That is, if the second instruction cannot be issued because it
has a data dependency with the first, only one instruction is
issued in the cycle.
• This is called in-order issue.
Superscalar Execution:
Issue Mechanisms
• In a more aggressive model, instructions can be issued out of
order.
• In this case, if the second instruction has data dependencies
with the first, but the third instruction does not, the first and
third instructions can be co-scheduled.
• This is also called dynamic issue.
• Performance of in-order issue is generally limited.
Superscalar Execution:
Efficiency Considerations
• Not all functional units can be kept busy at all times.
• If during a cycle, no functional units are utilized, this is referred
to as vertical waste.
• If during a cycle, only some of the functional units are utilized,
this is referred to as horizontal waste.
• Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to extract
parallelism, the performance of superscalar processors is
eventually limited.
• Conventional microprocessors typically support four-way
superscalar execution.
2. Very Long Instruction Word (VLIW)
Processors
• The hardware cost and complexity of the superscalar scheduler
is a major consideration in processor design.
• To address this issues, VLIW processors rely on compile time
analysis to identify and bundle together instructions that can be
executed concurrently.
• These instructions are packed and dispatched together, and
thus the name very long instruction word.
• This concept was used with some commercial success in the
Multiflow Trace machine (circa 1984).
• Variants of this concept are employed in the Intel IA64
processors.
Very Long Instruction Word (VLIW)
Processors: Considerations
• Compiler has a bigger context from which to select co-
scheduled instructions.
• Compilers, however, do not have runtime information such as
cache misses. Scheduling is, therefore, inherently
conservative.
• Branch and memory prediction is more difficult.
• VLIW performance is highly dependent on the compiler. A
number of techniques such as loop unrolling, speculative
execution, branch prediction are critical.
• Typical VLIW processors are limited to 4-way to 8-way
parallelism.
Thank You

You might also like