0% found this document useful (0 votes)

73 views124 pages

Modle 01 - HPC Introduction To Pipeline

This document provides an introduction to high performance computing and pipeline optimization techniques. It discusses Moore's Law and how transistor density has doubled every 18 months, increasing processor performance over time. It describes how architectural innovations like reduced instruction set computing (RISC) and caches, as well as organizational techniques like pipelining and superscalar processing, have allowed performance to continue growing despite slowing transistor improvements. The document outlines objectives to study modern architectural and organizational innovations used in processors like Intel Pentium, AMD Athlon and optimizations including multicore processing.

Uploaded by

Jaideep Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views124 pages

Modle 01 - HPC Introduction To Pipeline

Uploaded by

Jaideep Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

High Performance

Computing

Introduction to Pipeline

Dr. SUBHASIS DASH

SCHOOL OF COMPUTER ENGINEERING.
KIIT UNIVERSITY
BHUBANESWAR

1
Introduction
– Computer performance has been
increasing phenomenally over the
last five decades.
– Brought out by Moore’s Law:
● Transistors per square inch roughly double
every eighteen months.
– Moore’s law is not exactly a law:
● but has held good for nearly 50 years.

2
Moore’s Law

Gordon Moore (co-founder of Intel)

predicted in 1965: “Transistor density
of minimum cost semiconductor chips
Moore’s Law: it’s worked for
would double roughly every 18 a long time
months.”

Transistor density is correlated to

3
processing speed.
Trends Related to Moore’s Law
Cont…

• Processor performance:
• Twice as fast after every 2 years
(roughly).
• Memory capacity:
• Twice as much after every 18 months
(roughly).

4
Interpreting Moore’s Law
● Moore's law is not about just the density of
transistors on a chip that can be achieved:
– But about the density of transistors at which the
cost per transistor is the lowest.
● As more transistors are made on a chip:
– The cost to make each transistor reduces.
– But the chance that the chip will not work due to a
defect rises.
● Moore observed in 1965 there is a transistor
density or complexity:
– At which "a minimum cost" is achieved.
5
How Did Performance Improve?
● Till 1980s, most of the performance improvements
came from using innovations in manufacturing
technologies:
– VLSI
– Reduction in feature size
● Improvements due to innovations in manufacturing
technologies have slowed down since 1980s:
– Smaller feature size gives rise to increased resistance,
capacitance, propagation delays.
– Larger power dissipation.
(Aside: What is the power consumption of Intel Pentium Processor?
Roughly 100 watts idle)
6
How Did Performance Improve?
Cont…

● Since 1980s, most of the

performance improvements have
come from:
– Architectural and organizational
innovations
● What is the difference between:
– Computer architecture and computer
organization?
7
Architecture vs. Organization
● Architecture:
– Also known as Instruction Set Architecture
(ISA)
– Programmer visible part of a processor:
instruction set, registers, addressing modes, etc.
● Organization:
– High-level design: how many caches? how many
arithmetic and logic units? What type of
pipelining, control design, etc.
– Sometimes known as micro-architecture
8
Computer Architecture
● The structure of a computer that a
machine language programmer must
understand:
– To be able to write a correct program for
that machine.
● A family of computers of the same
architecture should be able to run the
same program.
– Thus, the notion of architecture leads to
“binary compatibility.” 9
Course Objectives
● Modern processors such as Intel
Pentium, AMD Athlon, etc. use:
– Many architectural and organizational
innovations not covered in a first course.
– Innovations in memory, bus, and storage
designs as well.
– Multiprocessors and clusters

● In this light, objective of this course:

– Study the architectural and organizational
innovations used in modern computers.
10
A Few Architectural and
Organizational Innovations
Course Objectives

● RISC (Reduced Instruction Set

Computers):
– Exploited instruction-level parallelism:
● Initially through pipelining and later by using
multiple instruction issue (superscalar)
– Use of on-chip caches
● Dynamic instruction scheduling
● Branch prediction
11
Intel MultiCore Architecture
● Improving execution rate of a single-
thread is still considered important:
– Uses out-of-order execution and speculation.
● MultiCore architecture:
–Can reduce power consumption.
– (14 pipeline stages) is closer to the Pentium
M (12 stages) than the P4 (30 stages).
● Many transistors are invested in large
branch predictors:
– To reduce wasted work (power). 12
Intel’s Dual Core Architectures
● The Pentium D is simply two Pentium 4 cpus:
– Inefficiently paired together to run as dual core.
● Core Duo is Intel's first generation dual core processor
based upon the Pentium M (a Pentium III-4 hybrid):
– Made mostly for laptops and is much more efficient than
Pentium D.
● Core 2 Duo is Intel's second generation (hence, Core
2) processor:
– Made for desktops and laptops designed to be fast while
not consuming nearly as much power as previous CPUs.
● Intel has now dropped the Pentium name in favor of
the Core architecture.
13
Intel Core Processor

14
Intel Core 2 Duo
• Code named “conroe”
• Homogeneous cores
• Bus based chip
interconnect.
• Shared on-die Cache
Memory.
Source: Intel Corp.

Classic OOO: Reservation

Stations, Issue ports,
Schedulers…etc Large, shared set
associative, prefetch, etc.
15
Intel Core Processor
Specification
● Speeds:1.06 GHz to 3 GHz
● FSB speeds:533 MT/s to 1333 MT/s
● Process: 0.065 µm (MOSFET channel
length)
● Instruction set:x86, MMX, SSE, SSE2,
SSE3, SSSE3, x86-64
● Microarchitecture: Intel Core
microarchitecture
● Cores:1, 2, or 4 (2x2) 16
Core 2 Duo Microarchitecture

17
Why Sharing On-Die L2?

• What happens when L2 is too large?

18
Xeon and Opteron
Dual-Core Dual-Core Dual- Dual-Core
Core

Memory
Controller PCI-E
Hub Bridge PCI-E PCI-E
PCI-E
I/O Hub Bridge Bridge Bridge
PCI-E
Bridge

I/O Hub

Legacy x86 Architecture AMD64

• 20-year old front-side bus architecture • Direct Connect Architecture eliminates FSB
• CPUs, Memory, I/O all share a bus bottleneck.
• A bottleneck to performance
• Faster CPUs or more cores ≠ performance
19
Xeon Vs. Opteron
Dual-Core Dual-Core Dual-Core Dual-Core
Dual- Dual-
Core Core

Dual-
Dual-
Core
PCI-E Core

Memory Bridge
I/O Hub PCI-E
Controller Bridge
Hub PCI-E
Bridge

PCI-E PCI-E
Bridge Bridge

XMB XMB XMB XMB

I/O Hub

20
20
WHY
 Now as for why, let me also try to explain that.
 There are numerous reasons, the FSB is one, but not the only one.
 The biggest factor is something called instructions per clock cycle (IPC).
 The Ghz number of a processor shows how fast the clock cycles are.
 But the Core 2 Duo has a much higher IPC, meaning it processes more data
each clock cycle than the Pentium D.
 Put into its simplest terms, think of it this way, say the Pentium D has 10
clock cycles in a given amount of time, while the Core 2 Duo would have only 6.
 But the Core 2 Duo could processor 10 instruction per clock cycle, but the
Pentium D only 4.
 In the given amount of time, the Core 2 Duo would process 60 instructions,
but the Pentium D only 40.
 This is why it actually performs faster even with a lower clock speed.
21
Today’s Objectives
● Study some preliminary concepts:
– Amdahl’s law, performance
benchmarking, etc.
● RISC versus CISC architectures.
● Types of parallelism in programs
versus types of parallel
computers.
● Basic concepts in pipelining.
22
GO TO SLIDE NO 69

23
Measuring performance
● Real Application
– Problems occurs due to the dependencies on
OS or COMPILER
● Modified Application
– To enhance the probability of need
– Focus on one particular aspect of the system
performance
● Kernels
– To isolate performance of individual features
of M/c
24
Toy Benchmarks
● The performance of different
computers can be compared by running
some standard programs:
– Quick sort, Merge sort, etc.
● But, the basic problem remains:
– Even if you select based on a toy
benchmark, the system may not perform
well in a specific application.
– What can be a solution then?

25
Synthetic Benchmarks
● Basic Principle: Analyze the distribution of
instructions over a large number of practical
programs.
● Synthesize a program that has the same
instruction distribution as a typical program:
– Need not compute something meaningful.
● Dhrystone, Khornerstone, Linpack are some
of the older synthetic benchmarks:
– More recent is SPEC..( Standard performance
evaluation corporation )
26
SPEC Benchmarks
● SPEC: Standard Performance Evaluation
Corporation:
– A non-profit organization (www.spec.org)
● CPU-intensive benchmark for evaluating
processor performance of workstation:
– Generations: SPEC89, SPEC92, SPEC95,
and SPEC2000 …
– Emphasizing memory system performance
in SPEC2000.
27
Problems with Benchmarks
● SPEC89 benchmark included a small
kernel called matrix 300:
– Consists of 8 different 300*300 matrix
operations.
– Optimization of this inner-loop resulted in
performance improvement by a factor of 9.
● Optimizing performance can discard 25%
Dhrystone code
● Solution: Benchmark suite
28
Other SPEC Benchmarks
● SPECviewperf: 3D graphics performance
– For applications such as CAD/CAM, visualization,
content creations, etc.
● SPEC JVM98: performance of client-side
Java virtual machine.
● SPEC JBB2000: Server-side Java application
● SPEC WEB2005: evaluating WWW servers
– Contains multiple workloads utilizing both http
and https, dynamic content implemented in PHP
and JSP.
29
BAPCo
● Non-profit consortium
www.bapco.com
● SYSmark 2004 SE
– Office productivity benchmark

30
Instruction Set Architecture
(ISA)
● Programmer visible part of a processor:
– Instruction Set (what operations can be
performed?)
– Instruction Format (how are instructions
specified?)
– Registers (where are data located?)
– Addressing Modes (how is data accessed?)
– Exceptional Conditions (what happens if
something goes wrong?)
31
ISA cont…

● ISA is important:
– Not only from the programmer’s
perspective.
– From processor design and
implementation perspectives as well.

32
Evolution of Instruction Sets

Single Accumulator
(Manchester Mark I,
IBM 700 series 1953)
Stack
(Burroughs, HP-3000 1960-70)

General Purpose Register Machines

Complex Instruction Sets RISC

(MIPS,IBM RS6000, . . .1987)
(Vax, Intel 386 1977-85)

33
Different Types of ISAs

● Determined by the means used for

storing data in CPU:
● The major choices are:
– A stack, an accumulator, or a set of
registers.
● Stack architecture:
– Operands are implicitly on top of the
stack.
34
Different Types of ISAs
cont…
● Accumulator architecture:
– One operand is in the accumulator
(register) and the others are
elsewhere.
– Essentially this is a 1 register machine
– Found in older machines…

● General purpose registers:

– Operandsare in registers or specific
memory locations.
35
Comparison of Architectures

Consider the operation: C =A + B

Stack Accumulator Register-Memory Register-Register

Push A Load A Load R1, A Load R1, A
Push B Add B Add R1, B Load R2, B
Add Store C Store C, R1 Add R3, R1, R2
Pop C Store C, R3

36
Types of GPR Computers

● Register-Register (0,3)
● Register-Memory (1,2)
● Register-Memory (2,2) (3,3)

37
Quantitative Principle of
Computer Design
● MAKE COMMON CASE FAST
– Frequent case over infrequent case
– Improvement is easy to the Frequent case
– Quantifies overall performance gain due to
improve in the part of computation.

38
Computer System Components
CPU

Caches
Processor-Memory Bus

Adapter
RAM Peripheral Buses

Controllers Controllers

I/O devices
Displays Networks

Keyboards
39
Amdahl’s Law
● Quantifies overall performance gain due to
improve in a part of a computation. (CPU Bound)
● Amdahl’s Law:
– Performance improvement gained from using some
faster mode of execution is limited by the amount
of time the enhancement is actually used
Performance for entire task using enhancement when possible
Speedup =
Performance for entire task without using enhancement
OR
Execution time for a task without enhancement
Speedup =
Execution time for the task using enhancement
40
Amdahl’s Law and Speedup
● Speedup tells us:
– How much faster a machine will run due to an
enhancement.
● For using Amdahl’s law two things should
be considered:
– 1st… FRACTION ENHANCED :-Fraction of the
computation time in the original machine
that can use the enhancement
– It is always less than or equal to 1
● If a program executes in 30 seconds and 15
seconds of exec. uses enhancement, fraction = ½
41
Amdahl’s Law and Speedup
– 2nd… SPEEDUP ENHANCED :-Improvement
gained by enhancement, that is how
much faster the task would run if the
enhanced mode is used for entire
program.
– Means the time of original mode over
the time of enhanced mode
– It is always greater than 1
● If enhanced task takes 3.5 seconds and
original task took 7secs, we say the
speedup is 2.
42
Amdahl’s Law Equations
Fractionenhanced
Execution timenew = Execution timeold x (1 – Fractionenhanced) +
Speedupenhanced

Execution Timeold 1
Speedupoverall = =
Execution Timenew Fractionenhanced
(1 – Fractionenhanced) +
Speedupenhanced
Use previous equation,
Solve for speedup

Don’t just try to memorize

these equations and plug numbers into them.
It’s always important to think about the problem too!
43
Amdahl’s Law Example
● Suppose that we are considering an enhancement to
the processor of a server system used for web
serving. The new CPU is 10 times faster on
computation in the web serving application than the
original processor. Assuming that the original CPU is
busy with computation 40% of the time & is waiting
for I/O 60% of the time. What is the overall speed
up gained by incorporating the enhancement?
● Solution:- Fractionenhanced = 0.4
Speedupenhanced = 10

1
Speedupoverall = ~ 1.56
=
(1 – 0.4) + (0.4/10)
44
Corollary of Amdahl’s Law
1. Amdahl’s law express the law of diminishing
returns. The incremental improvement in speed
up gained by an additional improvement in the
performance of just a portion of the
computation diminishes as improvements are
added.

1. If an enhancement is only usable for a fraction

of task, we can’t speed up the task by more
than the reciprocal of (1- Fraction Enhanced ).

45
Amdahl’s Law Example
Assume that we make an enhancement to a computer that improves some mode
of execution by a factor of 10. Enhanced mode is used 50% of the time.
Measured as a percentage of the execution time when the enhanced mode is in
use. Recall that Amdahl’s law depends on the fraction of the original,
Unenhanced execution time that could make use of enhanced mode. Thus we
can’t directly use this 50% measurement to compute speed up with Amdahl’s
law. What is the speed up we have obtained from the fast mode? What
percentage of original the original execution time has been converted to fast
mode?
● Solution:- If an enhancement is only usable for a fraction of task, we can’t
speed up the task by more than the reciprocal of (1- Fraction).
● Speedup = 1 = 1/(1-0.5) = 2
(1 – Fractionenhanced)

1
Speedupoverall =
Fractionenhanced
(1 – Fractionenhanced) +
Speedupenhanced
● 1
2=
Fractionenhanced 46
(1 – Fractionenhanced) +
10
Amdahl’s Law -Example
● A common transformation required in graphics engines is SQRT. Implementations of
FPSQRT is vary significantly in performance, especially among processors designed for
graphics. Suppose FPSQRT is responsible for 20% of the execution time of a critical
graphics benchmark. One proposal to enhance the FPSQRT H/W & speed up this
operation by a factor of 10. The other alternative is just to try to make FP
instructions in the graphics processor run faster by a factor of 1.6. The FP
instructions are responsible for a total of 50% of the execution time for the
application. The design team believes that they can make all FP instructions run 1.6
times faster with the same effort as required for the fast SQRT. Compare these 2
design alternatives to suggest which one is better.
● Solution:-
Design
1 Design
2
Fraction Enhanced= 20% 0.2 Fraction Enhanced= 50% 0.5
Speed up Enhanced = 10 Speed up Enhanced = 1.6
Seed up FPSQRT Overall = 1/ (1-0.2) + (0.2/10) Seed up FP Overall = 1/ (1-0.5) + (0.5/1.6)
= 1/0.82 ~ 1.22 = 1/0.8125 ~ 1.23
= =
Improving the performance of FP operations overall is slightly better
because of the higher frequency .
47
High Performance
Computing
Lecture-2:
A Few Basic Concepts

Mr. SUBHASIS DASH

SCHOLE OF COMPUTER SCIENCE & ENGG.
KIIT UNIVERSITY
BHUBANESWAR

48
CPU Performance Equation
● Computers using a clock running at a constant
rate.
● CPU time = CPU clock cycles for a program X
clock cycle time
CPU clock cycles for a program
=
Clock Rate
 CPI can be defined in terms of No. of clock cycles & instruction
count. ( IPC CPI)

CPI =
CPU clock cycles for a program
Instruction count
49
CPU Performance Equation
● CPU time = Instruction count X clock cycle
time X cycle per instruction.

Instruction Clock cycle Second

● CPU time = × Instruction × Clock cycle
Program

● Performance dependant on 3 parameter

– Clock Cycle Time ( H/W technology & Organization)
– CPI ( Organization & ISA )
– IC ( ISA & Compiler Technology )

50
CPU Performance Equation
n
● CPU clock cycle = Σ IC i X CPI i
Where :-
i=1 IC I No. of times
n instructions i
is executed in
Σ IC i X CPI i a program.
i=1
● CPI overall= Instruction Count
CPI IAverage No.
of clock pre
instructions.

51
Performance Measurements

● Performance measurement is
important:
– Helps us to determine if one
processor (or computer) works
faster than another.
– A computer exhibits higher
performance if it executes programs
faster.
52
Clock-Rate Based Performance
Measurement
● Comparing performance based on
clock rates is obviously meaningless:
– Execution time=CPI × Clock cycle time
– Please remember:
● Higher CPI need not mean better
performance.
● Also, a processor with a higher clock

rate may execute programs much

slower!
53
Example: Calculating Overall
CPI (Cycles per Instruction)
Operation Freq CPI(i) (% Time)
ALU 50% 1 (40%)
Load 20% 2 (27%)
Store 10% 2 (13%)
Branch 20% 5 (20%)

Typical Instruction Mix

Overall CPI= 10.4+ 20.27+ 20.13+50.2

= 2.2

54
MIPS and MFLOPS
● Used extensively 30 years back.
● MIPS: millions of instructions processed
per second.
● MFLOPS: Millions of FLoating point
OPerations completed per Second

Instruction Count Clock Rate

MIPS = =
Exec. Time x 106 CPI x 106

55
Problems with MIPS
● Three significant problems with
using MIPS:
● So severe, made some one term:
– “MeaninglessInformation about
Processing Speed”
● Problem 1:
– MIPS is instruction set dependent.

56
Problems with MIPS
cont…
● Problem 2:
– MIPS varies between programs on
the same computer.
● Problem 3:
– MIPS can vary inversely to
performance!
● Let’s look at an example as to why
MIPS doesn’t work…
57
A MIPS Example
● Consider the following computer:
Instruction counts (in millions) for each
instruction class
Code type- A (1 cycle) B (2 cycle) C (3 cycle)
Compiler 1 5 1 1
Compiler 2 10 1 1
The machine runs at 100MHz.
Instruction A requires 1 clock cycle, Instruction B requires 2
clock cycles, Instruction C requires 3 clock cycles.
n

CPU Clock Cycles

S
i =1
CPIi x Ni
CPI = =
Instruction Count Instruction Count
58
A MIPS Example
cont…
count cycles
[(5x1) + (1x2) + (1x3)] x 106
CPI1 = = 10/7 = 1.43
(5 + 1 + 1) x 106

100 MHz
MIPS1 = = 69.9
1.43

[(10x1) + (1x2) + (1x3)] x 106

CPI2 = = 15/12 = 1.25
(10 + 1 + 1) x 106

100 MHz So, compiler 2 has a higher

MIPS2 = = 80.0 MIPS rating and should be
1.25
faster?
59
A MIPS Example
cont…

● Now let’s compare CPU time:

Note Instruction Count x CPI
! important CPU Time =
formula!
Clock Rate

7 x 106 x 1.43
CPU Time1 = = 0.10 seconds
100 x 106

12 x 106 x 1.25
CPU Time2 = = 0.15 seconds
100 x 106

Therefore program 1 is faster despite a lower MIPS!

60
CPU Performance Equation
● Suppose we have made the following measurements :-

– Frequency of FP operations (Other than FPSQRT) = 25 %

– Average CPI of FP operations = 4.0

– Average CPI of other operations = 1.33

– Frequency of FPSQRT = 2%

– CPI of FPSQRT = 20

– Assume that the 2 design alternatives are to decrease the CPI of

FPSQRT to 2 or to decrease the average CPI Of all FP operations to
2.5. Compare these 2 design alternatives using the CPU performance
equation.

61
CPU Performance Equation
● Suppose we have made the following measurements :-
– Frequency of FP operations (Other than FPSQRT) = 25 %
– Average CPI of FP operations = 4.0
– Average CPI of other operations = 1.33
– Frequency of FPSQRT = 2%
– CPI of FPSQRT = 20
– Assume that the 2 design alternatives are to decrease the CPI of FPSQRT to 2 or
to decrease the average CPI Of all FP operations to 2.5. Compare these 2 design
alternatives using the CPU performance equation.
● Solutions:- First observe that only the CPI changes, the clock rate and instruction
count remain identical. We can start by finding original CPI without enhancement:-
n

S CPIi x Ni
= (4 x 25%) + (1.33 x 75%) ~ 2
CPI original = CPU Clock Cycles
Instruction Count
=
i =1
Instruction Count
=
We can compute the CPI for the enhanced FPSQRT by subtracting the cycles saved from
the original CPI :-
CPI with new FPSQRT = CPI Original – [2% x (CPI old FPSQRT – CPI new FPSQRT)]
= 2 – [(2/10)x (20-2)] = 1.64 62
CPU Performance Equation
● We can compute the CPI for the enhanced of all FP instructions the same way or by
adding FP and NON FP CPIs.
● CPI new FP = (75% x 1.33) + (25% x 2.5) = 1.6225
● Since the CPI of the overall FP enhancement is slightly lower, its performance will be
marginally better, specifically the speed up for the overall FP enhancement is :-
● Speed up over all for FP = 2/1.6225 = 1.23
● Speed up over all for FPSQRT = 2/1.64 = 1.22

63
Today’s Objectives
● Study some preliminary concepts:
– Amdahl’s law, performance
benchmarking, etc.
● RISC versus CISC architectures.
● Types of parallelism in programs
versus types of parallel
computers.
● Basic concepts in pipelining.
64
RISC/CISC Controversy
● RISC: Reduced Instruction Set Computer
● CISC: Complex Instruction Set Computer
● Genesis of CISC architecture:
– Implementing commonly used instructions in
hardware can lead to significant performance
benefits.
– For example, use of a FP processor can lead to
performance improvements.
● Genesis of RISC architecture:
– The rarely used instructions can be eliminated
to save chip space --- on chip cache and large 65
number of registers can be provided.
Features of A CISC Processor
● Rich instruction set:
– Some simple, some very complex
● Complex addressing modes:
– Orthogonal addressing (Every possible
addressing mode for every instruction).
 Many instructions take multiple cycles:
 Large variation in CPI
 Instructions are of variable sizes
 Small number of registers
 Microcode control
 No (or inefficient) pipelining 66
Features of a RISC Processor
● Small number of instructions
● Small number of addressing modes
 Large number of registers (>32)
 Instructions execute in one or two clock cycles
 Uniformed length instructions and fixed
instruction format.
 Register-Register Architecture:
 Separate memory instructions (load/store)
 Separate instruction/data cache
 Hardwired control
 Pipelining (Why CISC are not pipelined?)
67
CISC vs. RISC Organizations

Microprogrammed
Control Unit Hardwared
Cache Control Unit
Microprogrammed
Control Memory Instruction Data
Cache Cache

Main Memory Main Memory

(a) CISC Organization (b) RISC Organization

68
Why Does RISC Lead to Improved
Performance?
● Increased GPRs
– Lead to decreased data traffic to memory.
– Remember memory is the bottleneck.
● Register-Register architecture leads to
more uniform instructions:
– Efficient pipelining becomes possible.
● However, large instruction memory
traffic:
– Because of larger number of instructions
results.
69
Early RISC Processors
● 1987 Sun SPARC
● 1990 IBM RS 6000
● 1996 IBM/Motorola PowerPC

70
Architectural Classifications
● Flynn’s Classifications [1966]
– Based on multiplicity of instruction streams
& data stream in a computer.
● Feng’s Classification [1972]
– Based on serial & parallel processing.
● Handler’s Classification [1977]
– Determined by the degree of parallelism &
pipeline in various subsystem level.

71
Flynn’s Classification
● SISD (Single Instruction Single Data):
– Uniprocessors.
● MISD (Multiple Instruction Single
Data):
– No practical examples exist
● SIMD (Single Instruction Multiple
Data):
– Specialized processors
● MIMD (Multiple Instruction Multiple
Data):
– General purpose, commercially important
72
SISD

IS Memory
Processing DS
Control unit Module
Unit

73
SIMD
Processing DS Memory
Unit 1 Module

IS DS
Control Processing Memory
unit Unit 2 Module

DS Memory
Processing
Unit n Module

IS 74
MIMD
Control IS Processing DS Memory
unit Unit 1 Module

Control IS DS Memory
Processing
unit Unit 2 Module

DS Memory
Control IS Processing
unit Unit n Module

75
Classification for MIMD
Computers
● Shared Memory:
– Processors communicate through a shared
memory.
– Typically processors connected to each
other and to the shared memory through a
bus.
● Distributed Memory:
– Processors do not share any physical
memory.
– Processors connected to each other
through a network. 76
Shared Memory
● Shared memory located at a
centralized location:
– May consist of several interleaved
modules –-- same distance (access
time) from any processor.
– Also called Uniform Memory Access
(UMA) model.

77
Distributed Memory
● Memory is distributed to each processor:
– Improves scalability.
● Non-Uniform Memory Access (NUMA)
– (a) Message passing architectures – No
processor can directly access another
processor’s memory.
– (b) Distributed Shared Memory (DSM)–
Memory is distributed, but the address
space is shared.

78
UMA vs. NUMA Computers

P1 P2 Pn P1 P2 Pn
Cache Cache Cache Cache Cache Cache
Bus
Main Main Main
Memory Memory Memory
Main
Memory

Network

(a) UMA Model (b) NUMA Model

79
Basics of Parallel Computing
● If you are ploughing a field,
which of the following would you
rather use:
– One strong OX?
– A pair of cows?
– Two pairs of goats?
– 128 chicken?
– 100000 ants? 80
Basics of Parallel Computing
cont…
● Consider another scenario:
– You have get a color image printed on a
stack of papers.
● Would you rather:
– For each sheet print red, then green, and
blue and then take up the next paper? or
– As soon as you complete printing red on a
paper advance it to blue, in the mean
while take a new paper for printing red?
81
Parallel Processing DEMANDS
Concurrent Execution
● An efficient form of information processing
which emphasizes the exploitation of concurrent
events in computing process.
– Parallel events may occur in multiple resources
during the same interval of time.
– Simultaneous events may occur at the same time.
– Pipelined events may occur in overlapped time
spans.

82
face up to face

● Job / Program level  Algorithmically

● Task / Procedure level 
S/W
● Inter-instruction level 
● Intra-instruction level  H/W

83
Parallel Computer Structure
● Emphasize on parallel processing
● Basic architectural features of parallel
computers are :
– Pipelined computers Which performs overlapped
computation to exploit temporal parallelism (task
task is
broken into multiple stages).
stages
– Array processor uses multiple synchronized
arithmetic logic units to achieve spatial parallelism
(duplicate
duplicate hardware performs multiple tasks at
once).
once
– Multiprocessor system achieve asynchronous
parallelism through a set of interactive processors
with shared resources. 84
Pipelined Computer

● IF ID OF EX Segments
● Multiple pipeline cycles
● A pipeline cycle can be set equal to the
delay of the slowest stage.
● Operations of all stages is synchronized
under a common clock.
● Interface latches are used between the
segments to hold the intermediate result.

85
Scalar Pipeline
Scalar data SP1

SP2
Scalar
Reg.
..
.

..
.

K . .
SPn

Scalar fetch
M
E I I K
IS OF
M F D
Vector Pipeline
O Vector fetch
K
R VP1

Y Vector
VP2
.
Instruction Preprocessing Reg. .
.
VPn

Vector data

Functional structure of pipeline computer with 86

scalar & vector capabilities
Array processor
● Synchronous parallel computer with multiple
arithmetic logic units [Same function at same
time] .
● By replication of ALU system can achieve
spatial parallelism.
● An appropriate data routing mechanism must
be establish among the PE’s.
● Scalar & control type instructions are directly
executed in control unit.
● Each PE consist of one ALU with register &
local memory.
87
Array processor
● Vector instructions are broadcast to PEs for
distributed execution over different component
operands fetched directly from the local
memory.
● IF & Decode is done by the control unit.
● Array processors are designed with associative
memory called associative processors.
● Array processors are much more difficult to
program than pipelined machines.

88
I/O
Scalar Processing
Control Unit P: Processor
Control M: Memory
Processor
Duplicate Hardware Performs
Data Bus Control Multiple Tasks At Once
Memory

PE1 PE 2
PE n

P Control
P P

M M . .... M

Array
Processing

Inter-PE connection network

[Data routing]
Functional structure of SIMD array processor with
concurrent scalar processing in control unit 89
Multiprocessor Systems
● Multiprocessor system leads to improving
throughput, reliability, &flexibility.
● The enter system must be control by a
single integrated OS providing interaction
between processor & their programs at
various level.
● Each processor has its own local memory &
i/o devices.
● Inter-processor communication can be
done through shared memories or through
an interrupt network.
90
Multiprocessor Systems
● There are three different types of inter-
processor communications are :
– Time shared common bus
– Crossbar switch network.
– Multiport memories.
● Centralized computing system, in which all
H/W & S/W resource are housed in the
same computing center with negligible
communication delays among subsystems.
● Loosely coupled & tightly coupled.

91
I/O channels

. . . . . .

Memory
module 1
Input-output Interprocessor-memory
Memory Interconnection Connection network
module 2 Network [Bus , crossbar or multiport]

.
.
. LM1
. . . .
Memory P 1
module n
Inter- P 2 LM2
Shared Memory
processor
interrupt
network

P n LM n

Functional design of an MIMD multiprocessor system 92

Parallel Execution of Programs
● Parallel execution of a program can be
done in one or more of following ways:
– Instruction-level (Fine grained): individual
instructions on any one thread are executed
parallely.
● Parallel execution across a sequence of
instructions (block) -- could be a loop, a
conditional, or some other sequence of starts.
– Thread-level (Medium grained): different
threads of a process are executed parallely.
– Process-level (Coarse grained): different
processes can be executed parallely.
93
Exploitation of Instruction-Level
Parallelism
● ILP can be exploited by deploying several
available techniques:
– Temporal parallelism (Overlapped execution):
● Pipelining
– Spatial Parallelism:
● Vector processing (single instruction multiple
data SIMD)
● Superscalar execution (Multiple instructions
that use multiple data MIMD)

94
GO TO MIPS DATAPATH
SLIDES

95
ILP Exploitation Through
Pipelining

96
Pipelining
● Pipelining incorporates the
concept of overlapped execution:
– Usedin many everyday applications
without our notice.
● Has proved to be a very popular
and successful way to exploit ILP:
– Instructionpipes are being used in
almost all modern processors.
97
A Pipeline Example
● Consider two alternate ways in which
an engineering college can work:
– Approach 1. Admit a batch of students
and next batch admitted only after
already admitted batch completes (i.e.
admit once every 4 years).
– Approach 2. Admit students every year.
– In the second approach:
● Average number of students graduating per
year increases four times.
98
Pipelining

First Year Second Year Third Year Fourth Year

First Year Second Year Third Year Fourth Year
First Year Second Year Third Year Fourth Year

99
Pipelined Execution
Time

IFetch Dcd Exec Mem WB

Program Flow
IFetch Dcd Exec Mem WB

100
Advantages of Pipelining
● An n-stage pipeline:
– Can improve performance upto n times.
● Not much investment in hardware:
– No replication of hardware resources
necessary.
– The principle deployed is to keep the
units as busy as possible.
● Transparent to the programmers:
– Easy to use
101
Basic Pipelining Terminologies
● Pipeline cycle (or Processor cycle):
– The time required to move an instruction
one step further in the pipeline.
– Not to be confused with clock cycle.

● Synchronous pipeline:
– Pipeline cycle is constant (clock-driven).
● Asynchronous pipeline:
– Time for moving from stage to stage varies
– Handshaking communication between stages
102
Principle of linear pipelining
● Assembly lines, where items are assembled
continuously from separate part along a moving
conveyor belt.
● Partition of assembly depends on factors like:-
– Quality of working units
– Desired processing speed
– Cost effectiveness
● All assembly stations must have equal
processing.
● Lowest station becomes the bottleneck or
congestion.
103
Precedence relation
● A set of subtask { T1,T2,……,Tn} for a given
task T ,that some task Tj can not start until
some earlier task Ti ,where (i<j)finishes.
● Pipeline consists of cascade of processing
stages.
● Stages are combinational circuits over data
stream flowing through pipe.
● Stages are separated by high speed interface
latches (Holding intermediate results between
stages.)
● Control must be under a common clock.
104
Synchronous Pipeline
- Transfers between stages are simultaneous.
- One task or operation enters the pipeline
per cycle.

L L L L L
Input Output
S1 S2 Sk

Clock
 m d

105
Asynchronous Pipeline
- Transfers performed when individual stages are
ready.
- Handshaking protocol between processors.
Input Output

Ready S1 Ready S2 Ready Sk Ready

Ack Ack Ack Ack

- Different amounts of delay may be experienced at

different stages.
- Can display variable throughput rate.

106
A Few Pipeline Concepts

Si Si+1

 m d

Pipeline cycle : 

Latch delay : d
 = max {m } + d

Pipeline frequency : f
f = 1 / 

107
Example on Clock period
● Suppose the time delays of the 4 stages are
1=60ns,2 = 50ns, 3 = 90ns, 4 = 80ns & the
interface latch has a delay of l = 10ns. What is
the value of pipeline cycle time ?
● Hence the cycle time of this pipeline can be
granted to be like :-  = 90 + 10 =100ns
● Clock frequency of the pipeline (f) =1/100 =10 Mhz
● If it is non-pipeline then = 60+50+90+80 = 280ns

  = max {m } + d
108
Ideal Pipeline Speedup
● k-stage pipeline processes n tasks in k + (n-1)
clock cycles:
– k cycles for the first task and n-1 cycles for the
remaining n-1 tasks.
● Total time to process n tasks
● Tk = [ k + (n-1)] 
● For the non-pipelined processor
T1 = n k 

109
Pipeline Speedup Expression
Speedup=

T1 nk nk
Sk =
Tk = [ k + (n-1)] = k + (n-1)

● Maximum speedup = Sk  K ,for n >> K

● Observe that the memory bandwidth

must increase by a factor of Sk:
– Otherwise, the processor would stall waiting for
data to arrive from memory. 110
Efficiency of pipeline
● The percentage of busy time-space span over
the total time span.
– n:- no. of task or instruction
– k:- no. of pipeline stages
– :- clock period of pipeline
● Hence pipeline efficiency can be defined by:-
n * k *  n
 = =
K [ k* +(n-1)  ] k+(n-1)

111
Throughput of pipeline
● Number of result task that can be completed
by a pipeline per unit time.

n n 
W = = = 
k*+(n-1) [k+(n-1)]

● Idle case w = 1/ = f when  =1.

● Maximum throughput = frequency of linear
pipeline

112
Pipelines: A Few Basic
Concepts
● Pipeline increases instruction throughput:
– But, does not decrease the execution time of
the individual instructions.
– In fact, slightly increases execution time of
each instruction due to pipeline overheads.
● Pipeline overhead arises due to a
combination of:
– Pipeline register delay / Latch between stages
– Clock skew
113
Pipelines: A Few Basic
Concepts
● Pipeline register delay:
– Caused due to set up time
● Clock skew:
– the maximum delay between clock arrival
at any two registers.
● Once clock cycle is as small as the
pipeline overhead:
– No further pipelining would be useful.
– Very deep pipelines may not be useful.
114
Drags on Pipeline Performance

● Things are actually not so rosy,

due to the following factors:
– Difficult to balance the stages
– Pipeline overheads: latch delays
– Clock skew
– Hazards

115
Exercise
● Consider an unpipelined processor:
– Takes 4 cycles for ALU and other operations
– 5 cycles for memory operations.
– Assume the relative frequencies:
● ALU and other=60%,
● memory operations=40%
– Cycle time =1ns
● Compute speedup due to pipelining:
– Ignore effects of branching.
– Assume pipeline overhead = 0.2ns
116
Solution
● Average instruction execution time for
large number of instructions:
– unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns
– Pipelined=1 + 0.2 = 1.2ns
● Speedup=4.4/1.2=3.7 times

117
Pipeline Hazards
● Hazards can result in incorrect
operations:
– Structural hazards: Two instructions
requiring the same hardware unit at same
time.
– Data hazards: Instruction depends on result
of a prior instruction that is still in pipeline
● Data dependency
– Control hazards: Caused by delay in decisions
about changes in control flow (branches and
jumps).
● Control dependency
118
Pipeline Interlock
● Pipeline interlock:
– Resolving
of pipeline hazards through
hardware mechanisms.
● Interlock hardware detects all
hazards:
– Stallsappropriate stages of the
pipeline to resolve hazards.

119
MIPS Pipelining Stages
● 5 stages of MIPS Pipeline:
– IF Stage:
● Needs access to the Memory to load the instruction.
● Needs a adder to update the PC.
– ID Stage:
● Needs access to the Register File in reading operand.
● Needs an adder (to compute the potential branch target).
– EX Stage:
● Needs an ALU.
– MEM Stage:
● Needs access to the Memory.
– WB Stage:
● Needs access to the Register File in writing.
120
Further MIPS Enhancements
cont…

● MIPS could achieve CPI of 1, to improve

performance further, two possibilities:
– Superscalar
– Superpipelined
● Superscalar:
– Replicate each pipeline stage so that two or
more instructions can proceed simultaneously.
● Superpipeline:
– Split pipeline stages into further stages.
121
Summary
● RISC architecture style saves chip area
that is used to support more registers
and cache:
– Also instruction pipelining is facilitated due
to small and uniform sized instructions.
● Three main types of parallelism in a
program:
– Instruction-level
– Thread-level
– Process-level

122
Summary
Cont…
● Two main types of parallel computers:
– SIMD
– MIMD
● Instruction pipelines are found in almost all
modern processors:
– Exploits instruction-level parallelism
– Transparent to the programmers
● Hazards can slowdown a pipeline:
– In the next lecture, we shall examine hazards in
more detail and available ways to resolve hazards.
123
References
[1]J.L. Hennessy & D.A. Patterson,
“Computer Architecture: A Quantitative
Approach”. Morgan Kaufmann Publishers,
3rd Edition, 2003
[2]John Paul Shen and Mikko Lipasti,
“Modern Processor Design,” Tata Mc-
Graw-Hill, 2005

124

CA Mid1
No ratings yet
CA Mid1
15 pages
Chapter 01
No ratings yet
Chapter 01
40 pages
Lesson Plan in General Mathematics: R Log I I R I I I 10,000 I I
No ratings yet
Lesson Plan in General Mathematics: R Log I I R I I I 10,000 I I
4 pages
Chapter 01
No ratings yet
Chapter 01
78 pages
Unit 3
No ratings yet
Unit 3
162 pages
(MS-02.00) Condensing Unit & Ahu
No ratings yet
(MS-02.00) Condensing Unit & Ahu
52 pages
Archtitecure 1
No ratings yet
Archtitecure 1
64 pages
Introduction To ACA 2021
No ratings yet
Introduction To ACA 2021
73 pages
02 - Computer Evolution and Performance
No ratings yet
02 - Computer Evolution and Performance
32 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
ELEC Lecture 2
No ratings yet
ELEC Lecture 2
18 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Computer Hardware Lab
No ratings yet
Computer Hardware Lab
35 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Computer Architecture: Vnu - University Engineering Technology
No ratings yet
Computer Architecture: Vnu - University Engineering Technology
30 pages
CS5204/EE5364 - Advanced Computer Architecture - Introduction
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Introduction
28 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
IAS & MIPS Rate
No ratings yet
IAS & MIPS Rate
42 pages
MMW Finals Notes Mod 5&6
No ratings yet
MMW Finals Notes Mod 5&6
52 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
49 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
Geography F1T1 2024 QS Teacher - Co - .Ke
No ratings yet
Geography F1T1 2024 QS Teacher - Co - .Ke
4 pages
PDC Solution Final Helpline
No ratings yet
PDC Solution Final Helpline
86 pages
Ita Report-N7-V3 BD P
No ratings yet
Ita Report-N7-V3 BD P
12 pages
Chapter 1
No ratings yet
Chapter 1
50 pages
Topic 1
No ratings yet
Topic 1
32 pages
02 Computer-Evolution
No ratings yet
02 Computer-Evolution
52 pages
Chapter 2
No ratings yet
Chapter 2
59 pages
Computer Fundamentals Lab 2 by Rafae
No ratings yet
Computer Fundamentals Lab 2 by Rafae
11 pages
Switch On - Worksheets - 3 - Amer 6
No ratings yet
Switch On - Worksheets - 3 - Amer 6
1 page
Reviewer in Archi
100% (1)
Reviewer in Archi
8 pages
Project Report
67% (15)
Project Report
40 pages
Aula Ch1
No ratings yet
Aula Ch1
40 pages
Review of LSS CSC
No ratings yet
Review of LSS CSC
21 pages
2020 Vehicle Technologies Office Annual Merit Review High Efficiency Powertrain For Heavy Duty Trucks Using Silicon Carbide (Sic) Inverter
No ratings yet
2020 Vehicle Technologies Office Annual Merit Review High Efficiency Powertrain For Heavy Duty Trucks Using Silicon Carbide (Sic) Inverter
23 pages
CMP2008 L1
No ratings yet
CMP2008 L1
20 pages
William Stallings Computer Organization and Architecture
No ratings yet
William Stallings Computer Organization and Architecture
20 pages
Computer Evolution & Performance
No ratings yet
Computer Evolution & Performance
71 pages
UNIT1
No ratings yet
UNIT1
11 pages
Microprocessor (Report)
No ratings yet
Microprocessor (Report)
4 pages
VAC Notes
No ratings yet
VAC Notes
44 pages
02 - Computer Evolution and Performance
No ratings yet
02 - Computer Evolution and Performance
57 pages
Probability of One Event
No ratings yet
Probability of One Event
14 pages
Ch.2 Performance Issues: Computer Organization and Architecture
No ratings yet
Ch.2 Performance Issues: Computer Organization and Architecture
25 pages
Tuning The Pentium Pro Microarchitecture
No ratings yet
Tuning The Pentium Pro Microarchitecture
8 pages
Advanced Computer Architecture: Azvjvhd
No ratings yet
Advanced Computer Architecture: Azvjvhd
61 pages
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
No ratings yet
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
151 pages
Unit - 4 Notes (C.O.A)
No ratings yet
Unit - 4 Notes (C.O.A)
4 pages
Lavalle Planning
No ratings yet
Lavalle Planning
121 pages
William Stallings Computer Organization and Architecture
No ratings yet
William Stallings Computer Organization and Architecture
41 pages
Solar Greenhouse Construction and Operation by Rick Fisher
100% (1)
Solar Greenhouse Construction and Operation by Rick Fisher
166 pages
Computers: - Computers Impact Our Lives in A Huge Number of Ways
No ratings yet
Computers: - Computers Impact Our Lives in A Huge Number of Ways
14 pages
Computer Evolution and Performance
No ratings yet
Computer Evolution and Performance
52 pages
02 - Computer Evolution Ina
No ratings yet
02 - Computer Evolution Ina
43 pages
Computer Architecture
No ratings yet
Computer Architecture
21 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
Lecture1 2
No ratings yet
Lecture1 2
30 pages
Lecture1 Microprocessor Types and Specifications PDF
50% (2)
Lecture1 Microprocessor Types and Specifications PDF
33 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
50 pages
CAQA6e ch1
No ratings yet
CAQA6e ch1
31 pages
23G-04 1 06
No ratings yet
23G-04 1 06
17 pages
Physics Final Project
No ratings yet
Physics Final Project
22 pages
Wrninv 45 K Zhyt Bgro TTF9 X Le JT XCAVWEgf Ah IFn C
No ratings yet
Wrninv 45 K Zhyt Bgro TTF9 X Le JT XCAVWEgf Ah IFn C
12 pages
Edi Lab - 2019-2020
No ratings yet
Edi Lab - 2019-2020
13 pages
William Stallings Computer Organization and Architecture 7 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 7 Edition Computer Evolution and Performance
44 pages
Report Car Price Prediction
No ratings yet
Report Car Price Prediction
8 pages
CS-3035 (ML) - CS Mid March 2023
No ratings yet
CS-3035 (ML) - CS Mid March 2023
3 pages
Xbox Architecture: Architecture of Consoles: A Practical Analysis, #13
From Everand
Xbox Architecture: Architecture of Consoles: A Practical Analysis, #13
Rodrigo Copetti
No ratings yet
02 - Computer Evolution and Perfomance1 PDF
No ratings yet
02 - Computer Evolution and Perfomance1 PDF
11 pages
Noblelft FD20-35 Operation & Maintenance Manual
No ratings yet
Noblelft FD20-35 Operation & Maintenance Manual
108 pages
TN4611 PDF
No ratings yet
TN4611 PDF
11 pages
E GMAT SC Complete StudyPlan
No ratings yet
E GMAT SC Complete StudyPlan
6 pages
It-3022 (CC) - CS Mid March 2023
No ratings yet
It-3022 (CC) - CS Mid March 2023
5 pages
Tda8580j Datasheet
100% (1)
Tda8580j Datasheet
28 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
52 pages
SKYMARK C-Series Water-Cooled Self-Contained Units Catalog (1112)
No ratings yet
SKYMARK C-Series Water-Cooled Self-Contained Units Catalog (1112)
24 pages
Victaulic Grooved IPS-CS Installation
No ratings yet
Victaulic Grooved IPS-CS Installation
3 pages
Computer Evolution and Perfomance1
No ratings yet
Computer Evolution and Perfomance1
11 pages
RTSEC Documentation
No ratings yet
RTSEC Documentation
4 pages
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Defining Computer Architecture
No ratings yet
Defining Computer Architecture
6 pages
COA - 02 - Computer Evolution and Performance
No ratings yet
COA - 02 - Computer Evolution and Performance
9 pages
Python Tuples PDF
No ratings yet
Python Tuples PDF
3 pages
Path Alignment Cross Polarization Parabolic Antennas TP 108827
No ratings yet
Path Alignment Cross Polarization Parabolic Antennas TP 108827
7 pages
SAP Business Explorer Tools
No ratings yet
SAP Business Explorer Tools
12 pages
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Unit 5: International Financial Management 5.1
No ratings yet
Unit 5: International Financial Management 5.1
4 pages
3 Phase Power Measurement
No ratings yet
3 Phase Power Measurement
6 pages
How To Install Ubuntu Linux From USB Drive
No ratings yet
How To Install Ubuntu Linux From USB Drive
2 pages
Measuring Vibration DT9837 MATLAB
No ratings yet
Measuring Vibration DT9837 MATLAB
4 pages
Dr. Rife's Newspaper Articles
No ratings yet
Dr. Rife's Newspaper Articles
1 page
SL No. Item Decription Unit Qty Unit Rate Total Supply Cost in Rs. Unit Rate Total Erection Cost in Rs. Supply Portion Erection Portion
No ratings yet
SL No. Item Decription Unit Qty Unit Rate Total Supply Cost in Rs. Unit Rate Total Erection Cost in Rs. Supply Portion Erection Portion
1 page

Modle 01 - HPC Introduction To Pipeline

Uploaded by

Modle 01 - HPC Introduction To Pipeline

Uploaded by

High Performance

Dr. SUBHASIS DASH

Gordon Moore (co-founder of Intel)

Transistor density is correlated to

● Since 1980s, most of the

● In this light, objective of this course:

● RISC (Reduced Instruction Set

Classic OOO: Reservation

• What happens when L2 is too large?

Legacy x86 Architecture AMD64

XMB XMB XMB XMB

General Purpose Register Machines

Complex Instruction Sets RISC

● Determined by the means used for

● General purpose registers:

Consider the operation: C =A + B

Stack Accumulator Register-Memory Register-Register

Don’t just try to memorize

1. If an enhancement is only usable for a fraction

Mr. SUBHASIS DASH

Instruction Clock cycle Second

● Performance dependant on 3 parameter

rate may execute programs much

Typical Instruction Mix

Overall CPI= 1*0.4+ 2*0.27+ 2*0.13+5*0.2

Instruction Count Clock Rate

CPU Clock Cycles

[(10x1) + (1x2) + (1x3)] x 106

100 MHz So, compiler 2 has a higher

● Now let’s compare CPU time:

Therefore program 1 is faster despite a lower MIPS!

– Frequency of FP operations (Other than FPSQRT) = 25 %

– Average CPI of FP operations = 4.0

– Average CPI of other operations = 1.33

– Assume that the 2 design alternatives are to decrease the CPI of

Main Memory Main Memory

(a) CISC Organization (b) RISC Organization

(a) UMA Model (b) NUMA Model

● Job / Program level  Algorithmically

Functional structure of pipeline computer with 86

Inter-PE connection network

Functional design of an MIMD multiprocessor system 92

First Year Second Year Third Year Fourth Year

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

Ready S1 Ready S2 Ready Sk Ready

- Different amounts of delay may be experienced at

● Maximum speedup = Sk  K ,for n >> K

● Observe that the memory bandwidth

● Idle case w = 1/ = f when  =1.

● Things are actually not so rosy,

● MIPS could achieve CPI of 1, to improve

You might also like

Overall CPI= 10.4+ 20.27+ 20.13+50.2