0% found this document useful (0 votes)
47 views

Lecture 2

This document discusses different platforms for parallel computing. It describes how parallelism addresses performance bottlenecks in processors, memory systems, and data paths. It then discusses two approaches to implicit parallelism: pipelining and superscalar execution, and very long instruction word (VLIW) processors. Pipelining overlaps instruction execution stages to improve performance. Superscalar processors issue multiple instructions concurrently by exploiting instruction-level parallelism. VLIW processors rely on compile-time analysis to bundle instructions for parallel execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Lecture 2

This document discusses different platforms for parallel computing. It describes how parallelism addresses performance bottlenecks in processors, memory systems, and data paths. It then discusses two approaches to implicit parallelism: pipelining and superscalar execution, and very long instruction word (VLIW) processors. Pipelining overlaps instruction execution stages to improve performance. Superscalar processors issue multiple instructions concurrently by exploiting instruction-level parallelism. VLIW processors rely on compile-time analysis to bundle instructions for parallel execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Parallel Computing Platforms

Implicit Parallelism
Scope of Parallelism
• Conventional architectures coarsely comprise of a processor,
memory system, and the datapath.
• Each of these components present significant performance
bottlenecks.
• A number of architectural innovations over the years have
addressed these bottlenecks.
• One of the most important innovations is multiplicity – in
processing units, datapaths, and memory units.
• This multiplicity is either entirely hidden from the programmer,
as in the case of implicit parallelism, or exposed to the
programmer in different forms.
Scope of Parallelism

• Parallelism addresses each of these components (processor,


memory system, and the datapath) in significant ways.
• Different applications utilize different aspects of parallelism -
e.g., data intensive applications utilize high aggregate
throughput, server applications utilize high aggregate network
bandwidth, and scientific applications typically utilize high
processing and memory system performance.
• It is important to understand each of these performance
bottlenecks.
Implicit Parallelism
• Pipelining and Superscalar Execution
• Very Long Instruction Word Processors
Implicit Parallelism: Trends in Microprocessor
Architectures
• Microprocessor clock speeds have posted impressive gains
over the past two decades (two to three orders of magnitude).
• Higher levels of device integration have made available a
large number of transistors.
• The question of how best to utilize these resources is an
important one.
• Current processors use these resources in multiple functional
units and execute multiple instructions in the same cycle.
• The precise manner in which these instructions are selected
and executed provides impressive diversity in architectures.
1. Pipelining and Superscalar Execution

• Pipelining overlaps various stages of instruction execution to


achieve performance.
• At a high level of abstraction, an instruction can be executed
while the next one is being decoded and the next one is being
fetched.
• One of the real life example is assembly line for manufacture
of cars.
– Divide the process in to multiple pipelined stages
– Each pipeline stage has multiple units
– This will lead to multi-fold speedup as compared to serial production
Pipelining and Superscalar Execution

• Pipelining, however, has several limitations.


• The speed of a pipeline is eventually limited by the slowest
stage.
• For this reason, conventional processors rely on very deep
pipelines (20 stage pipelines in state-of-the-art Pentium
processors).
• However, in typical program traces, every 5-6th instruction is a
conditional jump! This requires very accurate branch prediction.
• The penalty of a mis-prediction grows with the depth of the
pipeline, since a larger number of instructions will have to be
flushed.
Pipelining and Superscalar Execution

• One simple way of alleviating these bottlenecks is to use


multiple pipelines.
• During each clock cycle, multiple instructions are piped into
the processor in parallel.
• These instructions are executed on multiple functional units.
Superscalar Execution: An Example

Example of a two-
way superscalar
execution of
instructions.
Superscalar Execution: An Example

• In the above example, there is some wastage of resources


due to data dependencies.

• The example also illustrates that different instruction mixes


with identical semantics can take significantly different
execution time.
Superscalar Execution

• Scheduling of instructions is determined by a number of


factors:
– True Data Dependency: The result of one operation is an input
to the next.
– Resource Dependency: Two operations require the same
resource.
– Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
– The scheduler, a piece of hardware looks at a large number of
instructions in an instruction queue and selects appropriate
number of instructions to execute concurrently based on these
factors.
– The complexity of this hardware is an important constraint on
superscalar processors.
Superscalar Execution:
Issue Mechanisms
• In the simpler model, instructions can be issued only in the
order in which they are encountered.
• That is, if the second instruction cannot be issued because it
has a data dependency with the first, only one instruction is
issued in the cycle.
• This is called in-order issue.
Superscalar Execution:
Issue Mechanisms
• In a more aggressive model, instructions can be issued out of
order.
• In this case, if the second instruction has data dependencies
with the first, but the third instruction does not, the first and
third instructions can be co-scheduled.
• This is also called dynamic issue.
• Performance of in-order issue is generally limited.
Superscalar Execution:
Efficiency Considerations
• Not all functional units can be kept busy at all times.
• If during a cycle, no functional units are utilized, this is referred
to as vertical waste.
• If during a cycle, only some of the functional units are utilized,
this is referred to as horizontal waste.
• Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to extract
parallelism, the performance of superscalar processors is
eventually limited.
• Conventional microprocessors typically support four-way
superscalar execution.
2. Very Long Instruction Word (VLIW)
Processors
• The hardware cost and complexity of the superscalar scheduler
is a major consideration in processor design.
• To address this issues, VLIW processors rely on compile time
analysis to identify and bundle together instructions that can be
executed concurrently.
• These instructions are packed and dispatched together, and
thus the name very long instruction word.
• This concept was used with some commercial success in the
Multiflow Trace machine (circa 1984).
• Variants of this concept are employed in the Intel IA64
processors.
Very Long Instruction Word (VLIW)
Processors: Considerations
• Compiler has a bigger context from which to select co-
scheduled instructions.
• Compilers, however, do not have runtime information such as
cache misses. Scheduling is, therefore, inherently
conservative.
• Branch and memory prediction is more difficult.
• VLIW performance is highly dependent on the compiler. A
number of techniques such as loop unrolling, speculative
execution, branch prediction are critical.
• Typical VLIW processors are limited to 4-way to 8-way
parallelism.
Thank You

You might also like