0% found this document useful (0 votes)
77 views

Lecture 03

Uploaded by

hamza abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Lecture 03

Uploaded by

hamza abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Execution time

• Execution Time (processor-related)


= IC x CPI x T
IC = instruction count
CPI = average number of system clock
periods to execute an
instruction
T = clock period
Example
Consider two SRC programs having three types of
instructions given as follows
Number of .. Program 1 Program 2

data transfer instructions 2 1


control instructions 2 5
ALSU Instructions 2 1
Compare both the programs for the following parameters

1. Instruction count
2. Speed of execution
Example contd..
1. Instruction count IC.
IC for program 1= 2+2+2=6
IC for program 2= 1+5+1=7
2. For execution time we can use the following SRC
specifications.
ET = IC x CPI x T Instruction Type CPI
ET1= (2x2)+(2x3)+(2x4)
Control 2
= 18
ALSU 3
ET2 =(5x2)+(1x3)+(1x4)
=17 Data Transfer 4
Note: Since both programs are executing on the same machine, the T factor can
be ignored while calculating ET.
Problem: Consider the following SRC code segments for
implementing the operation a=b+5c. Find which one is more
efficient in terms of instruction count and execution time.
Program 1: Multiplication by using
repeated addition in a for loop
org 0 mpy:
a: .dw 1 brzr r7,r5 ; jump to next after 5
b: .dw 1 iterations
c: .dw 1 add r4,r4,r3 ;r4 contains r4+c
.org 80 addi r5,r5,-1 ; decrement index
la r5, 5 ; load value of loop br r6 ; loop again
lar r6,mpy ;load address of mpy next:
lar r7, next ;load address of next add r4,r4,r2 ; r4 contains sum
ld r2, b ; load contents of b of
ld r3, c ; load contents of c b and 5c
la r4, 0 ;load 0 in r4 st r4, a ;store at address a
stop
Problem: Consider the following two SRC code segments for
implementing the operation a=b+5c. Find which one is more
efficient in terms of instruction count and execution time.

Program 2: Multiplication using sub-


routine call

.org 0 stop
a: .dw 1 mpy:
b: .dw 1 la r7,0 ;r7 contains zero
c: .dw 1 lar r8,again ;r8 contain again address
.org 80 again:
lar r1,mpy ;load address of mpy in r1 brzr r5,r3 ;exit loop when index is
0
ld r2, b ; load contents of b in r2 add r7,r7,r4 ; r7 contains r7+c
la r3,5 ; load index in r3 addi r3,r3,-1 ; decrement index
ld r4,c ; load contents of c in r4 br r8
brl r5, r1 ; r5 contains PC
add r2,r2,r7 ; r2 contains sum b+5c
st r2, a
Solution
The instructions in both programs can be divided into 3
types and the respective count of each type is

Number of.. Program 1 Program 2

Data transfer 7 7
instructions
Control instructions 3 4

ALSU instructions 3 3

IC for program 1 = 7 + 3 + 3= 13
IC for program 2 = 7 + 4 + 3= 14
Solution contd..
For execution time, consider the following SRC
specifications.
Instruction Type CPI
ET = IC x CPI x T
Control 2
ET1= (7x4)+(3x2)+(3x3)
ALSU 3
= 43T
ET2= (7x4)+(4x2)+(3x3) Data Transfer 4
= 45T
Conclusion:
Program 1 runs faster than program 2 as obvious from the
execution time of both.
MIPS
• Millions of Instructions Per Second
= IC / (ET x 106)
• Capability of different instructions varies from
machine to machine, eg. RISC machines have
simpler instructions, so the same job will require
more instructions
• Was popular when the VAX 11/780 was treated
as a reference – late 70s and early 80s
MIPS as a performance metric
• MIPS is inversely proportional to execution
time,
ET= IC / (MIPS x 106 )
Example
Consider a machine having a 100 MHz clock and three
instruction types with following Instruction Type CPI
parameters. Control 2
Now suppose that two
ALSU 3
different compilers generate
Data Transfer 4
code for the same program.
The instruction count for each is given as follows
IC in millions Code from Code from
compiler 1 compiler 2
Control 5 10
ALSU 1 1
Data Transfer 1 1
Compare the two codes according to MIPS and
according to execution time.
Solution:
First we find the CPI for both code sequences
Since CPI = clock cycles for each type of instruction / IC
CPI1= (5x2 + 1x3 + 1x4)/ 7 = 2.43
CPI2= (10x2 +1x3 + 1x4)/12 = 2.25

As MIPS= Clock Rate/ (CPI x 106 )


MIPS1= 100 x 106 / (2.43 x 106)
= 41.15
MIPS2=100 x 106 / (2.25 x 106)
= 44.44
Hence the code generated by compiler 2 has higher MIPS
Rating.
Compare the two codes according to MIPS and
according to execution time.

Solution:
First we find the CPI for both code sequences
Since CPI = clock cycles for each type of instruction / IC
CPI1= (5x2 + 1x3 + 1x4)/ 7 = 2.43
CPI2= (10x2 +1x3 + 1x4)/12 = 2.25

As MIPS= Clock Rate/ (CPI x 106 )


MIPS1= 100 x 106 / (2.43 x 106) As MIPS = IC / (ET x 106)
MIPS= (IC x clock rate)/
= 41.15
( IC x CPI x 106)
MIPS2=100 x 106 / (2.25 x 106) = Clock rate/(CPI x 106)
= 44.44
Hence the code generated by compiler 2 has higher MIPS
Rating.
Solution contd..
Since ET = IC / (MIPS x 106)
ET1= (7 x 106) / (41.15 x 106)
= 0.17 seconds
ET2= (12 x 106) / ( 44.44 x 106)
= 0.27 seconds
Hence code sequence 1 is much more efficient in
terms of
execution time.
MFLOPS
• Millions of FLoating point Operations Per
Second
• Using FP operations makes more sense to some
compared to using just any instructions
• Results vary from FP op to FP op
• Better compared to MIPS because of two
reasons:
2 reasons
1. FP ops are complex, and therefore, provide a
better picture of the hardware capabilities on
which they are run
2. Overheads (get operands, store results, etc. )
are effectively lumped with the FP ops they
support
Dhrystones ***
• Dhrystone is a general “integer performance”
benchmark test originally developed by Reinhold
Weicker in 1984.
• Small program; less than 100 HLL statements
• Compiles to about 1 to 1.5 Kb of code

*** The name is a play on the word Whetstone


Disadvantages of using
Whetstones and Dhrystones
Both Whetstones and Dhrystones are now
considered obsolete because of the following
reasons.
 Small, fit in cache
 Obsolete instruction mix
 Prone to compiler tricks
 Difficult to reproduce results
 Uncontrolled source code
SPEC
• System Performance Evaluation Cooperative
• (SPEC) was founded in October, 1988, by
Apollo, Hewlett-Packard, MIPS Computer
Systems and SUN Microsystems
• Latest version is SPEC CPU2000
SPEC
• The standard SPEC benchmark suite includes:
 A compiler
 A Boolean minimization program
 A spreadsheet program
 A number of other programs that stress
arithmetic processing speed
• It uses a simple metric, elapsed time, to
measure performance of competing machines
• Machine independent code is used for fair
comparisona
Advantages
• It provides for ease of publication.
• Each benchmark carries the same weight.
• SPECratio is dimensionless.
• It is not unduly influenced by long running
programs.
• It is relatively immune to performance variation
on individual benchmarks.
• It provides a consistent and fair metric.
Programmer’s view of the SRC
31 0
R0 7 0
R1 0
: 1
R31 2
Register file :
:
:
IR
232-1
PC

CPU Main memory


SRC: Notation
• R[3] means contents of register 3
• M[8] means contents of memory location 8
• A memory word at address 8 is defined as the
32 bits at address 8,9,10 and 11
SRC: Notation
(continued…)

• Special notation for 32-bit memory words


M[8]<31…0>:=M[8]©M[9]©M[10]©M[11]
© is used to represent concatenation
 Logical addresses

7 0
a M[8] One memory “word”
a+1 M[9]
31 24 23 16 15 8 7 0
a+2 M[10]
M[8] M[9] M[10] M[11]
a+3 M[11]
MS Byte LS Byte
SRC: instruction formats
31 27 26 0
Type A Op-code unused

31 27 26 22 21 0
Type B Op-code ra c1

31 27 26 22 21 17 16 0
Type C Op-code ra rb c2

31 27 26 22 21 17 16 12 11 0
Type D Op-code ra rb rc c3
31 27 26 0

Type A Op-code unused

Only two instructions


 nop (op-code = 0)
•useful in pipelining
 stop (op-code = 31)
Both are 0-operand
31 27 26 22 21 0

Type B Op-code ra c1

Note: R8 is register name and R[8] means contents of register R8

 three instructions; all three use relative addressing mode


 ldr (op-code = 2 ) load register from memory using relative address
ldr R3, 56 R[3] M[PC+56]
 lar (op-code = 6 ) load register with relative address
lar R3, 56 R[3] PC+56
 str (op-code = 4) store register to memory using relative address

str R8, 34 M[PC+34] R[8]


the effective address is computed at run-time by adding a
constant to the PC
makes the instructions relocatable
31 27 26 22 21 17 16 0

Type C Op-code ra rb c2

 three load/store instructions, plus three ALU instructions


 ld (op-code = 1 ) load register from memory
ld R3, 56 R[3] M[56] (rb field = 0)
ld R3, 56(R5) R[3] M[56+R[5]] (rb field ≠ 0)
 la (op-code = 5 ) load register with displacement
address
la R3, 56 R[3] 56
la R3, 56(R5) R[3] 56+R[5]
 st (op-code = 3 ) store register to memory
st R8, 34 M[34] R[8]
st R8, 34(R6) M[34+R[6]] R[8]
Problem: Consider the following two SRC code segments for
implementing multiplication. Find which one is more efficient in
terms of instruction count and execution time.

Program 1: Multiplication by using Program 2: Multiplication using sub-


repeated addition in a for loop routine call
la r5, 5 ; load value of loop lar r1,mpy ;load address of mpy in r1
lar r6,mpy ;load address of mpy
lar r7, next ;load address of next ld r2, b ; load contents of b in r2
ld r2, b ; load contents of b in r2 la r3,5 ; load index in r3
ld r3, c ; load contents of c in r3 ld r4,c ; load contents of c in r4
la r4, 0 ;load 0 in r4 brl r5, r1 ; r5 contains PC
mpy: add r2,r2,r7 ; r2 contains sum of b & 5c
brzr r7,r5 ; jump to next after 5 iteration st r2, a
add r4,r4,r3 ;r4 contains r4+c mpy:
lar r8,again ;r8 contain again address
addi r5,r5,-1 ; decrement index
br r6 ; loop again again:
next: brzr r5,r3 ;exit loop when index is 0
add r4,r4,r2 ; r4 contains sum of b add r7,r7,r4 ; r7 contains r7+c
st r4, a ;store at address label a addi r3,r3,-1 ; decrement index
br r8
Solution
The instructions in both programs can be divided into 3
types and the respective count of each type is

Number of.. Program 1 Program 2

Data transfer 7 6
instructions
Control instructions 2 3

ALSU instructions 3 3

IC for program 1 = 7 + 2 + 3= 12
IC for program 2 = 6 + 3 + 3= 12
Solution contd..
For execution time, consider the following SRC
specifications.
Instruction Type CPI
ET = IC x CPI x T
ET1= (7x4)+(2x2)+(3x3) Control 2
= 41 ALSU 3
ET2= (6x4)+(3x2)+(3x3) Data Transfer 4
= 39
Conclusion:
Although the instruction count for both programs is same,
program 2 runs much faster than program 1 due to lesser
number of clock cycles required.

You might also like