0% found this document useful (0 votes)
3 views41 pages

W 04 Parallel Processing

The document outlines the concepts of computer architecture with a focus on parallel processing and pipelined execution. It discusses single-cycle processors, their limitations, and introduces the benefits and challenges of parallel processing. Intended learning outcomes include understanding pipelined execution and multi-core processors, supported by various resources and examples.

Uploaded by

dreamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views41 pages

W 04 Parallel Processing

The document outlines the concepts of computer architecture with a focus on parallel processing and pipelined execution. It discusses single-cycle processors, their limitations, and introduces the benefits and challenges of parallel processing. Intended learning outcomes include understanding pipelined execution and multi-core processors, supported by various resources and examples.

Uploaded by

dreamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Bachelor of Science Honours in SE

DES220130 - Computer Architecture

Parallel Processing
ACKNOWLEDGEMENT

Presented by: Mr. Chandana Deshapriya

DES220130 - Computer Architecture 2


RESOURCES

• J. L. Hennessy and D. A. Patterson. Computer Architecture: A


Quantitative Approach. Morgan Kaufmann, San Francisco, CA,
fifth edition, 2012.
• B. Ram , Fundamentals of Microprocessors and Microcomputers
Dhanpat Rai & Sons, 2012.
• John L. Hennessy and David A. Patterson, Computer Architecture,
A Quantitative Approach, 6th ed. Morgan Kaufmann, 2017.

DES220130 - Computer Architecture 3


AGENDA

4. Parallel Processing

I. Single Cycle Processors


II. Limitations of Single Cycle Processors
III. Parallel Processing
IV. Parallel processing challenges
V. Pipelined Processing
VI. Limitations of the pipelined instructions

DES220130 - Computer Architecture 4


INTENDED LEARNING OUTCOMES (ILO)

By the end of this section, students should be able to:


ILO1: articulate the basics of pipelined execution;
ILO2: explain the parallelism and multi-core processors;

DES220130 - Computer Architecture 5


Single Cycle Processors

DES220130 - Computer Architecture 6


Single Cycle Processors
In a single-cycle processor, each instruction is completed in one
clock cycle.
• Every instruction goes through these phases in a single cycle.

Ex: 5 + 3
The instruction ADD R1, R2, R3 run on single-cycle processor:
(add the values stored in register R2 and register R3, then store
the result in register R1). assume: R2 contains the number 5. and
R3 contains the number 3.

DES220130 - Computer Architecture 7


The instruction will go through the following steps within a single
clock cycle:

1. Instruction Fetch: The processor Load the ADD R1, R2, R3


instruction from memory. (the Program Counter (PC) holds the memory
address of this instruction)

2. Instruction Decode: The Control Unit reads the instruction and


identify ADD operation, and registers involved (R1 for the result, R2
and R3 for the values to add).

3. Execute: The ALU performs the addition of the values in R2 and


R3. → (5 + 3 = 8).

DES220130 - Computer Architecture 8


4. Write-Back: The result, 8, is written back into Register R1.
Now, R1 holds the value 8, completing the instruction.

In a single-cycle processor, all these steps — fetching, decoding,


executing, and writing back the result — happen in one clock cycle

In the example:

If the clock cycle is 5 nanoseconds, the ADD instruction must be


completed within those 5 nanoseconds, even if simpler instructions
(like just moving data) could take less time.

DES220130 - Computer Architecture 9


Single Cycle Processors
Limitations of Single Cycle Processors

Speed of Single Cycle Processors, assuming:


• Clock-synchronous circuits, single-cycle memory
Lots of time not spent doing useful work!
• Can pipelining help with performance?

instr. 1

instr. 2

DES220130 - Computer Architecture 10


Single Cycle Processors
Single Cycle Processors

instr. 1

instr. 2

DES220130 - Computer Architecture 11


Activity 1
Write short answers for the questions on
the LMS

Examples should be shown using this

DES220130 - Computer Architecture 12


Parallel Processing

DES220130 - Computer Architecture 13


Multiprogramming and Multiprocessing

DES220130 - Computer Architecture 14


Multiprogramming and Multiprocessing

DES220130 - Computer Architecture 15


Parallel processor introduction
• Attempt to pipeline our processor using pipeline registers/FIFOs

• Much better latency and throughput!


• Average CPI reduced from 3 to 1!
• Still lots of time spent not doing work. Can we do better?

DES220130 - Computer Architecture 16


Parallel processing challenges

1. Datapath
• Five (or more) instructions are in the path.
2. Instructions may have
• data and control flow dependences
• I.e. units of work are not independent- One may have to
stand and wait for another
3. Control
• Must correspond to multiple instructions

DES220130 - Computer Architecture 17


Complications – Datapath Dependences

DES220130 - Computer Architecture 18


Complications in Datapath Cycle F D X M W

DES220130 - Computer Architecture 19


Complications -Program Dependences
Cycle F D X M W A true dependence between
two instructions may only
involve one subcomputation
of each instruction.
i1:
i1: xxxx i1

i2: xxxx i2 i2:

i3: xxxx i3 i3:

The implied sequential precedences are


an overspecification. It is sufficient but not
necessary to ensure program correctness.

DES220130 - Computer Architecture 20


Complications -Program Dependences
1. True dependence (RAW)
• j cannot execute until i D(i)  R( j )  
produces its result

R(i)  D( j )  
2. Anti-dependence (WAR)
• j cannot write its result until i
has read its sources

3. Output dependence (WAW)


• j cannot write its result until i
D(i)  D( j )  
has written its result

DES220130 - Computer Architecture 21


Complications - Control Dependences

• Conditional branches
• Branch must execute to determine which instruction to fetch
next
• Instructions following a conditional branch are control
dependent on the branch instruction

DES220130 - Computer Architecture 22


Pipelined Processing

DES220130 - Computer Architecture 23


Pipelined Processing

Ann, Brian, Cathy, Dave


Each has one load of clothes to

DES220130 - Computer Architecture 24


Sequential Laundry
6 Hours
Time
30 40 20 30 40 20 30 40 20 30 40 20

A
Task Order

C
D

What would you do?


Sequential Laundry
6 Hours
Time
30 40 20 30 40 20 30 40 20 30 40 20

A
Task Order

C
D

What would you do?


Pipelined Laundry
3.5 Hours
Observations
Time
• A task has a series of
30 40 40 40 40 20 stages;
A • Stage dependency:
Task Order

e.g., wash before dry;


B • Multi tasks with
overlapping stages;
C • Simultaneously use
diff resources to
D speed up;
• Slowest stage
determines the finish
time;
Pipelined Laundry
3.5 Hours
Observations
Time
30 40 40 40 40 20 • No speed up for
individual task;
A e.g., A still takes
Task Order

30+40+20=90
B • But speed up for
average task
C execution time;

D e.g., 3.5*60/4=52.5 <


30+40+20=90
Assembly Line

Cola

Auto
Pipelining
• An implementation technique
whereby multiple instructions are
overlapped in execution.
e.g., B wash while A dry
A
• Essence: Start executing one B
instruction before completing the
previous one.
• Significance: Make fast CPUs.
Balanced Pipeline

• Equal-length pipe stages


e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages – wash, dry, fold

40min

T1 A
T2 B A
T3 C B A
T4 D C B
Balanced Pipeline

• Equal-length pipe stages


e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages – wash, dry, fold

40min

T1 A
T2 B A
T3 C B A
T4 D C B
Balanced Pipeline

• Equal-length pipe stages


e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages – wash, dry, fold

40min

T1 A
T2 B A
T3 C B A
T4 D C B
Balanced Pipeline
One task/instruction
per 40 mins
• Equal-length pipe stages
e.g., Wash, dry, fold = 40 mins
per unpipelined laundry time = 40x3 mins
3 pipe stages – wash, dry, fold
• Performance
40min
Time per instruction by pipeline =
T1 A Time per instr on unpipelined machine
T2 B A Number of pipe stages
T3 C B A
T4 D C B Speed up by pipeline =
Number of pipe stages
Single Cycle Processors
Multicycle Processors

• Multicycle implementation:
Cycle: 1 2 3 4 5 6 7 8 9 1 1 1 1
Instr: 0 1 2 3
i F D X MW
i+1 F D X
i+2 F D X M
i+3 F
i+4

DES220130 - Computer Architecture 35


Ideally balanced pipeline performance

• Clock cycle: 1/5 of total latency


• Circuits in all stages are always busy with useful work

DES220130 - Computer Architecture 36


Ideally balanced performance

Cycle: 1 2 3 4 5 6 7 8 9 1 1 1 1
Instr: 0 1 2 3
i F D X MW
i+1 F D X MW
i+2 F D X M W
i+3 F D X MW
i+4 F D X MW

DES220130 - Computer Architecture 37


Ideally balanced pipeline performance

• Identical sub computations


• Can pipeline into stages with equal delay
• Identical computations
• Can fill pipeline with identical work
• Independent computations
• No relationships between work units

Are these practical?


No, but can get close enough to get significant speedup

DES220130 - Computer Architecture 38


Five-Stage Pipeline
• How it works
introduce pipeline registers between successive stages;
pipeline registers store the results of a stage and use them as
the input of the next stage.
Five-Stage Pipeline
Activity 1
Complete the MCQs given on LMS

Examples should be shown using this

DES220130 - Computer Architecture 41

You might also like