Processor Organization & Pipelining
Processor Organization & Pipelining
ng
CSC 227
Computer Architecture and Organization I
5
cont….
• Computers typically use CISC while tablets, smartphones and other devices
use RISC.
• So, the higher efficiency of the RISC architecture makes it desirable in these
applications where cycles and power are usually in short supply.
• A CISC instruction set typically includes many instructions with different sizes
and execution cycles, which makes CISC instructions harder to pipeline.
Characteristic of CISC Processors
• A CISC instruction can be thought to contain many different type of
instructions bundled into one simple instruction.
• A large number of instructions - typically from 100 to 250 instructions.
• Some instructions that perform specialized tasks and are used infrequently.
• A large variety of addressing modes - typically 5 to 20 different modes.
• Variable-length instruction formats
• Instructions that manipulate operands in memory.
7
Properties of a CISC Processor
1. Richer instruction set, some simple, some very complex.
2. Instructions generally take more than 1 clock to execute.
3. Instructions of a variable size.
4. Instructions is an interface with memory in multiple mechanisms with
complex addressing modes.
5. No pipelining.
6. Microcode control make CISC instruction set possible & flexible.
7. Work well with simpler compiler.
Advantage
• Microprogramming is as easy as assembly language to implement, and
much less expensive than hardwiring a control unit.
• Memory references, loads and stores, are slow and account for a significant
fraction of all instructions.
• Instruction set & chip of new generation hardware become more complex
with each generation of computers.
CISC Instruction Example
A CISC could multiply 5 by 10 as follows:
Mov ax,10
Mov bx,5
Mul bx
RISC: Reduced Instruction Set Computer
History
The first RISC projects came from IBM, Stanford, and UC-Berkeley in the late
70s and early 80s.
The IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2 were all designed
with a similar philosophy which has become known as RISC.
• Code expansion: Since CISC machines perform complex actions with a single
instruction, where RISC machines may require multiple instructions for the same
action, code expansion can be a problem.
• System Design: Another problem that faces RISC machines is that they require very
fast memory systems to feed their instructions. RISC-based systems typically
contain large memory caches, usually on the chip itself. This is known as a first-
level cache.
RISC Instruction Example
• In RISC the microprocessor's designers might make sure that add executes
in one clock.
• Then a compiler could multiply a and b by adding a to itself b times or b to
itself a times.
Mov ax,0
Mov bx,10
Mov cx,5
Begin:
Add ax,bx
Loop Begin
loop cx times
RISC 5 Stage Pipelining
Fivestage “RISC” load-‐store architecture
1. Instruction fetch (IF)
• Get instruction from memory, increment PC.
2. Instruction Decode (ID)
• Translate opcode into control signals and read registers.
3. Execute (EX)
• Perform ALU operation, compute jump/branch target
4. Memory (MEM)
• Access memory if needed
5. Writeback (WB)
• Update register file
RISC 5 Stage Pipelining
Example: CISC and RISC Instructions
Comparisons between CISC and RISC Processors
• Instructions utilize more cycles in CISC than RISC.
• CISC has more complex instructions than RISC.
• CISC typically has fewer instructions than RISC.
• CISC implementations tend to be slower than RISC implementations.
• RISC design is approximately twice as cost-effective as CISC.
• RISC architectures are designed for a good cost/performance, whereas CISC
architectures are designed for a good performance on slow memories.
Comparisons between CISC and RISC Processors
cont….
CISC RISC
Emphasis on hardware Emphasis on software
Slower since instruction can take Faster since instructions usually take
more than 1 cycle 1 instruction cycle
29
Pipelining - Introduction
In a typical system speedup is achieved through parallelism at all
levels:
Multi-user, multitasking,multi-processing, multi-programming,
multi-threading, compiler optimizations.
• Pipelining : is a technique for overlapping operations during
execution.Today this is a key feature that makes fast CPUs.
• Different types of pipeline: instruction pipeline, operation
pipeline, multi-issue pipelines.
30
What is Pipelining? - 1
3
31
What is a Pipeline? - 2
3
32
Pipeline Characteristics
• Throughput: Number of items (cars, instructions, operations) that
exit the pipeline per unit time. Ex: 1 inst /clock cycle, 10 cars/ hour,
10 floating point operations /cycle.
• Stage time: The pipeline designer’s goal is to balance the
length of each pipeline stage(balanced pipeline). In general,
Stage time = Time per instruction on non-pipelined machine/
number of stages
• In many instances, stage time = max (times for all stages).
C P I : Pipeline yields a reduction in cycles per instruction
• CPI approx = stage time.
3
33
Pipeline Analogy – The Laundry
3
34
Pipeline Analogy – The Laundry - Sequential Operation
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
A
B
C
D
3
35
Pipeline Analogy – The Laundry - Overlapping Tasks
6 PM 7 8 9 10 11 12 1 2 AM
30 30 30 30 30 30 30 Time
A
B
C
D
3
36
Pipeline Analogy – The Laundry (cont’d)
37
Pipeline Analogy – The Laundry (cont’d)
• Key idea: break big computation up into pieces
1ns
Pipeline
Register
38
Pipelining Analogy – Grading of Exam
39
Pipelining Analogy – Grading of Exam
40
Pipelining Analogy – Grading of Exam
buffer
Input
Tasks Stage 1 Stage 2 K – stage pipeline Stage k
Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600
Order Time
Instruction REG REG
lw $ 1 , 1 0 0 ( $ 0 ) Fetch ALU MEM
RD WR
Instruction REG REG
lw $ 2 , 200($0) ALU MEM
200ps Fetch RD WR
Instruction REG REG
lw $ 3 , 300($0) Fetch
ALU MEM
RD WR
200ps
200ps 200ps 200ps 200ps 200ps
42
Performance Issues in Pipelining
• Speedup :How much speed up performance we get through pipelining.
▪ n: Number of tasks to be performed
43
Performance Issues in Pipelining – cont’d
• Pipelined Machine (k stages)
▪ tp: Clock cycle (time to complete each sub operation)
▪ tk: Time required to complete the n tasks
▪ tk = (k + (n - 1)) * tp
• Speedup
▪ S k: Speedup
▪
• S k = n*t n / (k + (n – 1))*tp
44
Performance Issue - Example
Example
- 4-stage pipeline
- sub operation in each stage; tp = 20ns
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80ns
Pipelined System
(k + (n – 1))*tp = (4 + 99) * 20 = 2060ns
Non-Pipelined System
n*tp = 100 * 80 = 8000ns
Speedup
Sk = 8000 / 2060 = 3.88
4-Stage Pipeline is basically identical to the system with 4 identical function units
45
Performance Issue – Example (cont’d)
Multiple P1 P2 P3 P4
Functional Units
46
Pipeline Performance – Example2
47
Pipeline Performance – Example2
• Design1:
• Average instruction execution time = clock cycletime *CPI
• = 10ns * (4 *0.4 + 4 *0.2+ 5*0.4) = 10 *(1.6+0.8+2.0)
• = 44ns
• Design 2:
• Average instruction time at steady state is clock cycle time:
• = 10ns + 1ns (for setup and clock skew) = 11ns
• Speed up = 44/11 = 4
48
Pipeline Performance – Example3
• Assume times for each functional unit of a pipeline to be: 10ns, 8ns,
10ns,10ns and 7ns.
• Overhead 1ns per stage.Compute the speed of the data path.
• Pipelined:Stage time = MAX(10,8,10,10,7) + overhead
• • = 10 + 1 = 11ns.
• This is the average instruction execution time at steady state.
• Non-pipelined:10+8+10+10+7 = 45ns
• Speedup = 45/11= 4.1 times
4949
Performance Issue – (cont’d)
• Efficiency: The efficiency of a pipeline can be measured as the ratio
of busy time span to the total time span including the idle time.
• Let c be the clock period of the pipeline, the efficiency E can be
denoted as:
• E = (n. m. c) / m. [m. c + (n-1).c] = n / [(m + (n-1)]
• As n-> ∞ ,E becomes 1.
5050
Performance Issue – (cont’d)
• Throughput: Throughput of a pipeline can be defined as the number of results
that have been achieved per unit time.
• It can be denoted as:
▪ T = (n / [m + (n-1)]) / c = E / c
5151
Speedup - Example
• Consider an unpipelined processor. Assume that it has a 1 ns clock cycle and it
uses 4 cycles for ALU operations and branches, and 5 cycles for memory
operations, assume that the relative frequencies of these operations are 40%,
20%, and 40%, respectively. Suppose that due to clock skew and setup,
pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any
latency impact, how much speedup in the instruction execution rate will we gain
from a pipeline?
Average instruction execution time
= 1 ns * ((40% + 20%)*4 + 40%*5)
= 4.4ns
Speedup from pipeline
= Average instruction time unpiplined/Average instruction time pipelined
= 4.4ns/1.2ns = 3.7
5252
Pipeline Hazards/Limitations
• Hazards reduce the performance from the ideal speedup gained by
pipelines:
• Structural hazard: Resource conflict.
• Hardware cannot support all possible combinations of instructions in
simultaneous overlapped execution.
• Data hazard:
• When an instruction depends on the results of the previous instruction.
• Control hazard:
• Due to branches and other instructions that affect the PC.
5353
Pipeline Stalls
• A stall is the delay in cycles caused due to any of the hazards
mentioned above.
• Speedup :
• 1/(1+pipeline stall per instruction)* Number of stages
• Number of cycles needed to initially fill up the pipeline could be
included in computation of average stall per instruction
5454
Structural Hazards
• When more than one instruction in the pipeline needs to access a
resource, the datapath is said to have a structural hazard.
• Examples of resources: register file, memory, ALU.
• Solution: Stall the pipeline for one clock cycle when the conflict is
detected. This results in a pipeline bubble.
• Figures 4 & 5 illustrate the memory access conflict and how it is
resolved by stalling an instruction.
• Problem: one memory port.
5555
Structural Hazards and Stalls - Conflicts
56
Structural Hazards and Stalls - Solution
57
Structural Hazards and Stalls - Bubble
58
Structural Hazard - Example
• Machine with load hazard: Data references constitute 4 0 % of the mix.
Ideal CPI is 1. Clock rate is 1.05 of the machine without hazard.
Which machine is faster, the one with hazard (machine A) or without
the hazard (machine B)? Prove.
• Solution: Hazard affects 4 0 % of the B’s instruction.
• Average instruction time for machine A: C P I * clock cycle time = 1
* x = 1.0x
5959
Structural Hazard - Solution
• Average inst time for machine B:
1) CPI has been extended.
= 4 0 % of the times 1 more cycle
2) Clock rate is faster: 1.05 times: less than machine A. By how much?
• Avg instruction time for machine B: (1 + 40/100*1) * (clock cycle
time /1.05)
= 1.4 * x/1.05 = 1.3x
• Proved that A is faster.
6060
Data Hazard - Example
• Consider the instruction sequence:
• A D D R1,R2,R3 ;result is in R1
• SUB R4,R5,R1
• AND R6,R1,R7
• OR R8,R1,R9
• XOR R10,R1,R11
• All instructions use R1 after the first instruction.
6161
Data Hazard - Solution
6363
Data Hazard – E x a m p l e
Example 1 :
add $s0, $t0, $t1
sub $t2, $s0, $t3
In the example, the second instruction is dependent on the result in $s0 of the
first instruction:
if $s0 = -5 before add
$s0 = 8 after add
then the value 8 should be used in the second instruction sub.
Draw the multiple clock cycle pipeline diagram for the execution:
6464
Data Hazard – E x a m p l e ( c o n t ’d )
For sub instruction, the value in $s0 has to be read in its ID stage (CC3).
However, the value in $s0 in CC3 is still -5 and not the correct value
8. We can only have the correct value in $s0 at the end of clock 5 (CC5).
The dependency goes from CC5 ---> CC3 (backward)
6565
Data HazardStalls
• All data hazards cannot be solved by forwarding:
• LW R1,0(R2)
• SUB R4,R1,R5
• AND R6,R1,R7
• OR R8,R1,R9
• Unlike the previous example, data is available until MEM/WB. So
subtract ALU cycle has to be stalled introducing a (vertical) bubble.
6666
Data Hazards and Stalls
67
Data Hazards and Stalls
68
Data Hazards – Time Stage Diagram
69
Data HazardClassification
• RAW - Read After Write. Most common: solved by data forwarding.
• WAW - Write After Write :Inst i (load) before inst j (add).
• Both write to same register. But inst i does it before inst j. DLX avoids
this by waiting for WB to write to registers. So no WAW hazard in
DLX.
• WAR - Write after Read: inst j tries to write a destination before it is
read by I, so I incorrectly gets its value. This cannot happen in DLX
since all inst read early (ID) but write late (WB).
• But WAW happens in complex instruction sets that have auto-
increment mode and require operands to be read late cycle
experience WAW.
7070
Data Hazard Stalls
Describe each of the following categories of Data Hazards: RAW, WAR,
WAW. Using (I – iii) below state which is RAW, WAR or WAW and indicate
how each occurs using an arrow.
R3 R1 op R2 ii) R3 R1 op R2 iii) R3 R1 op R2
R5 R3 op R4 R1 R4 op R5 R3 R6 op R7
7171
Limitations to Speedup - 1
7272
Limitations to Speedup - 2
7373
Limitations to Speedup - 3
•Branch Instructionsand Interrupts in the program :
•A program is not a straight flow of sequential
instructions.
•There may be branch instructions that alter the normal flow of program,
which delays the pipelining execution and affects the performance.
7474