Chapter 2: Advanced Computer Architecture
Chapter 2: Advanced Computer Architecture
UNIT - 2
Instruction Set
Design
software
instruction set
hardware
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Programmer's View
ADD
SUBTRACT
AND
OR
COMPARE
.
.
.
01010
01110
10011
10001
11010
.
.
.
CPU
Memory
I/O
Computer's View
Princeton (Von Neumann) Architecture
--- Data and Instructions mixed in same
unified memory
Harvard Architecture
--- Data & Instructions in
separate memories
The arrows indicate whether the operand is an input or the result of the
ALU operation, or both an input and result.
Lighter shades indicate inputs, and the dark shade indicates the result.
In (a), a Top Of Stack register (TOS), points to the top input operand,
which is combined with the operand below. The first operand is removed
from the stack, the result takes the place of the second operand, and
TOS is updated to point to the result. All operands are implicit. In (b), the
Accumulator is both an implicit input operand and a result. In (c), one
input operand is a register, one is in memory, and the result goes to a
register. All operands are registers in (d) and, like the stack architecture,
can be transferred to memory only via separate instructions: push or pop
for (a) and load or store for (d).
Stack
Push A
Push B
Add
Pop C
Accumulator
Load A
Add B
Store C
Register-memory Register-register
Load R1, A
Add R3,R1, B
Store R3, C
Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C
Stack Architectures
Accumulator Architectures
Register-Set Architectures
Register-to-Memory Architectures
Memory-to-Memory Architectures
Instruction Formats
There are few fundamental operations that all computers must provide
All designer have the same goal of finding a language that simplifies building
the hardware and the compiler while maximizing performance and
minimizing cost
Interface Design
A good interface:
Technology
Machine organization
use
Programming languages
Compiler technology
Operating systems
imp 1
Interface
use
imp 3
Time
use
imp 2
# general-purpose
registers
Architecture style
2
16
1
16
32
32
32
Accumulator
Register-memory, memory-memory
Extended accumulator
Register-memory
Register-memory
Load-store
Load-store
Year
1974
1977
1978
1980
1985
1992
1992
Operands are to be pushed on a stack from memory and the results have to
be popped from the stack to memory
Operations take their operand by default from the top of the stack and insert
the results back onto the stack
Stack machines simplify compilers and lent themselves to a compact
instruction encoding but limit compiler optimization (e.g. in math. expressions)
Example:
A= B+C
Push AddressC
Push AddressB
add
Pop AddressA
# Top=Top+4; Stack[Top]=Memory[AddressC]
# Top=Top+4; Stack[Top]=Memory[AddressB]
# Stack[Top-4]=Stack[Top]+Stack[Top-4]; Top=Top-4
# Memory[AddressA]=Stack[Top]; Top=Top-4
Concept of a Family
(IBM 360 1964)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(MIPS,SPARC,IBM RS6000, . . .1987)
Register-Memory Architectures
# memory
addresses
Max. number
of operands
Examples
Memory Address
Interpreting Memory Addressing
The address of a word matches the byte address of one of its 4 bytes
The addresses of sequential words differ by 4 (word size in byte)
words' addresses are multiple of 4 (alignment restriction)
Machines that use the address of the leftmost byte as the word address is
called "Big Endian" and those that use rightmost bytes called "Little Endian"
Misalignment complicates memory access and causes programs to run slower
(Some machines does not allow misaligned memory access at all)
Byte ordering can be a problem when exchanging data among different machines
Byte addresses affects array index calculation to account for word addressing and
offset within the word
Object
addressed
Aligned at
byte offsets
Misaligned at
byte offsets
Byte
1,2,3,4,5,6,7
Never
Half word
0,2,4,6
1,3,5,7
Word
0,4
1,2,3,5,6,7
Double word
1,2,3,4,5,6,7
Addressing Modes
Addressing modes refer to how to specify the location of an
operand (effective address)
Addressing modes have the ability to:
Example
Register
ADD R4, R3
Immediate
Displacement
ADD R4, #3
ADD R4, 100 (R1)
Register indirect
Indexed
Direct or absolute
Memory indirect or
memory deferred
Autoincrement
Auto decrement
Scaled
Meaning
Regs[R4] = Regs[R4] +
Regs[R3]
Regs[R4] = Regs[R4] + 3
Regs[R4] = Regs[R4] +
Mem[ 100 + Regs[R1] ]
Regs[R4] = Regs[R4] +
Mem[Regs[R1] ]
Regs[R4] = Regs[R4] +
Mem[Regs[R1] +
Regs[R2]]
Regs[R4] = Regs[R4] +
Mem[ 1001 ]
Regs[R4] = Regs[R4] +
Mem[Mem[Regs[R3] ]]
Regs[R4] = Regs[R4] +
Mem[Regs[R2] ]
Regs[R2] = Regs[R2] + d
Regs[R2] = Regs[R2] d
Regs[R4] = Regs[R4] +
Mem[Regs[R2] ]
Regs[R4] = Regs[R4] +
Mem[100 + Regs[R2] +
Regs[R3] * d]
When used
When a value is in a register
For constants
Accessing local variables
Accessing using a pointer or a
computed address
Sometimes useful in array
addressing: R1 = base of the
array: R2 = index amount
Sometimes useful for accessing
static data; address constant
may need to be large
If R3 is the address of the
pointer p, then mode yields *p
Useful for stepping through
arrays within a loop. R2 points to
start of the array; each reference
increments R2 by d.
Same use as autoincrement.
Autodecrement/increment can
also act as push/pop to
implement a stack
Used to index arrays.
1 (0012) 4 (1002)
Reverse addressing:
3 (0112) 6 (1102)
4 (1002) 1 (0012)
0 (0002) 0 (0002)
2 (0102) 2 (0102)
5 (1012) 5 (1012)
6 (1102) 3 (0112)
7 (1112) 7 (1112)
Example:
f = (g + h) - (i + j)
MIPS:
add
add
sub
t0, g, h
t1, i, j
f, t0, t1
Examples
Integer arithmetic and logical operations: add, and, subtract , or
Loads-stores (move instructions on machines with memory addressing)
Branch, jump, procedure call and return, trap
Operating system call, Virtual memory management instructions
Floating point instructions: add, multiply
Decimal add, decimal multiply, decimal to character conversion
String move, string compare, string search
Pixel operations, compression/decompression operations
Arithmetic, logical, data transfer and control are almost standard categories
for all machines
System instructions are required for multi-programming environments
although support for system functions varies
Decimal and string instructions can be primitives, e.g. IBM 360 and the VAX
Support for floating point, decimal, string and graphics can be optionally
sometimes provided via co-processor
Some machines rely on the compiler to synthesize special operations such
as string handling from simpler instructions
Perform multiple 16-bit addition on a 64-bit ALU since most data are narrow
Increases ALU throughput for multimedia applications
Paired single operations (float)
Very handy for calculating dot products of vectors (signal processing) and
matrix multiplication
80x86 Instruction
Load
Conditional branch
Compare
Store
Add
And
Sub
Move register-register
Call
Return
Total
Integer Average
(% total executed)
22%
20%
16%
12%
8%
6%
5%
4%
1%
1%
96%
Relative addressing w.r.t. the program counter proved to be the best choice
for forward and backward branching or jumps (load address independent)
To allow for dynamic loading of library routines, register indirect address
allows addresses to be loaded in special registers
(e.g. virtual functions in C++ and system calls in a case statement)
Condition Evaluation
Remember to focus
on the common case
DSPs support repeat instruction for for loops (vectors) using 3 registers
Supporting Procedures
Execution of a procedure follows the following steps:
Store parameters in a place accessible to the procedure
Transfer control to the procedure
Acquire the storage resources needed for the procedure
Perform the desired task
Store the results value in a place accessible to the calling program
Return control to the point of origin
The hardware provides a program counter to trace instruction flow and
manage transfer of control
Parameter Passing
Registers can be used for passing small number of parameters
A stack is used to spill registers of the current context and make room for
the called procedure to run and to allow for large parameters to be passed
Storage of machine state can be performed by caller or callee
Handling of shared variables is important to ensure correct semantics and
thus requires clear specifications in the library interface
Global variables stored in registers need careful handling
Size of Operands
Instruction Representation
Humans are taught to think in base 10 (decimal) but numbers may be
represented in any base (123 in base 10 = 1111011 in binary or base 2)
Numbers are stored in computers as a series of high and low electronic
signals (binary numbers)
Binary digits are called bits and considered the atom of computing
Each piece of an instruction is a number and placing these numbers
together forms the instruction
Assembler translate the assembly symbolic instructions into machine
language instructions (machine code)
Example:
Assembly:
M/C language (decimal):
M/C language (binary):
17
18
32
000000
10001
10010
01000
00000
100000
6 b its
5 b its
5 b its
5 b its
5 b its
6 b its
Note: MIPS compiler by default maps $s0,,$s7 to reg. 16-23 and $t0,,$t7 to reg. 8-15
Encoding Examples
op
rs
rt
rd
sham t
fu n c t
6 b its
5 b it s
5 b its
5 b it s
5 b its
6 b it s
Immediate-type instructions:
op
rs
rt
a d d re s s
6 b its
5 b its
5 b it s
1 6 b its
Some instructions need longer fields than provided for large value constant
The 16-bit address means a load word instruction can load a word within a
region of 215 bytes of the address in the base register
Example:
lw
$t0, 32($s3)
# Temporary register $t0 gets A[8]
Instruction
add
sub
lw
sw
Format
R
R
I
I
op
0
0
35
43
rs
reg
reg
reg
reg
rt
reg
reg
reg
reg
rd
reg
reg
N/A
N/A
shamt
0
0
N/A
N/A
funct
32
34
N/A
N/A
address
N/A
N/A
address
address
A c c o u n t in g p r o g r a m
( m a c h in e c o d e )
E d it o r p r o g r a m
( m a c h in e c o d e )
P ro c e s s o r
M e m o ry
C c o m p i le r
( m a c h in e c o d e )
P a y r o ll d a ta
B o o k te x t
S o u r c e c o d e in C
fo r e d ito r p ro g r a m
i= j
i = = j?
E ls e :
f= g + h
f= g h
if (i == j) f = g + h; else f = g - h;
MIPS:
E x it :
Else:
Exit:
bne
add
Exit
sub
i j
# go to Else if i j
# f = g + h (skipped if i j)
# f = g - h (skipped if i = j)
Typical Compilation
Explanation
Frequency
High level
Procedure integration
Local
Constant propagation
Global
Across a branch
13%
Copy propagation
11%
Code motion
16%
Induction variable
elimination
Machine-dependant
Strength reduction
Pipeline Scheduling
N.M
18%
22%
N.M
2%
N.M
N.M.
Starting a Program
C p ro g ra m
C o m p il e r
A s s e m b ly la n g u a g e p r o g r a m
A s s e m b le r
O b j e c t : M a c h in e la n g u a g e m o d u le
O b je c t : L ib r a r y r o u t in e ( m a c h in e la n g u a g e )
Linker
L in k e r
E x e c u t a b le : M a c h in e la n g u a g e p r o g r a m
L oa de r
M e m o ry
7 ff f ff ff
hex
S ta c k
D y n a m ic d a t a
$gp
1000 8000
hex
1000 0000
pc
S t a tic d a ta
Text
hex
0040 0000
hex
R eserved
0
Number of Addresses
Four categories
3-address machines
- 2 for the source operands and one for the result
2-address machines
- One address doubles as source and result
1-address machine
- Accumulator machines
- Accumulator is used for one source and result
0-address machines
- Stack machines
- Operands are taken from the stack
- Result goes onto the stack
dest,src1,src2
; M(dest)=[src1]+[src2]
sub
dest,src1,src2
; M(dest)=[src1]-[src2]
mult
dest,src1,src2
; M(dest)=[src1]*[src2]
Three addresses:
Operand 1, Operand 2, Result
Example: a = b + c
Three-address instruction formats are not common, because they require a
relatively long instruction format to hold the three address references.
C statement
A=B+C*DE+F+A
Equivalent code:
mult
T,C,D
;T = C*D
add
T,T,B
;T = B+C*D
sub
T,T,E
;T = B+C*D-E
add
T,T,F
;T = B+C*D-E+F
add
A,T,A
;A = B+C*D-E+F+A
Sample instructions
load
dest,src ; M(dest)=[src]
add
dest,src ; M(dest)=[dest]+[src]
sub
dest,src ; M(dest)=[dest]-[src]
mult
dest,src ; M(dest)=[dest]*[src]
Two Addresses:
One address doubles as operand and result
Example: a = a + b
The two-address formal reduces the space requirement but also
introduces some awkwardness. To avoid altering the value of an
operand, a MOVE instruction is used to move one of the values to a
result or temporary location before performing the operation.
C statement
A=B+C*DE+F+A
Equivalent code:
load
T,C
;T = C
mult
T,D
;T = C*D
add
T,B
;T = B+C*D
sub
T,E
;T = B+C*D-E
add
T,F
;T = B+C*D-E+F
add
A,T
;A = B+C*D-E+F+A
One-address machines
Use special set of registers called accumulators
- Specify one source operand & receive the result
load
store
add
sub
mult
C statement
A=B+C*DE+F+A
Equivalent code:
load
mult
;accum = C*D
add
;accum = C*D+B
sub
;accum = B+C*D-E
add
;accum = B+C*D-E+F
add
;accum = B+C*D-E+F+A
store
Zero-address machines
Stack supplies operands and receives the result
- Special instructions to load and store use an address
addr ; push([addr])
pop
addr ; pop([addr])
add
; push(pop + pop)
sub
; push(pop - pop)
mult
; push(pop * pop)
Example
C statement
A=B+C*DE+F+A
Equivalent code:
push
sub
push
push
push
add
Mult
push
add
push
B
F
A
add
pop
Load/Store Architecture
63
Sample instructions
load
Rd,addr
;Rd = [addr]
store
addr,Rs
;(addr) = Rs
add
sub
mult
Example
C statement
A = B + C * D E + F + A
Equivalent code:
load
R1,B
mult
R2,R2,R3
load
R2,C
add
R2,R2,R1
load
R3,D
sub
R2,R2,R4
load
R4,E
add
R2,R2,R5
load
R5,F
add
R2,R2,R6
load
R6,A
store
A,R2
2: Flow of Control
Procedure calls
- Delayed procedure calls
Branches
Unconditional
- Absolute address
- PC-relative
Target address is specified relative to PC contents
Relocatable code
Example: MIPS
- Absolute address
j target
- PC-relative
b target
e.g., Pentium
e.g., SPARC
; compare AX and BX
; jump if equal
Delayed branching
Control is transferred after executing the instruction that
follows the branch instruction
- This instruction slot is called delay slot
Improves efficiency
Highly pipelined RISC processors support
Procedure calls
Facilitate modular programming
Require two pieces of information to return
- End of procedure
Pentium
uses ret instruction
MIPS
uses jr instruction
- Return address
In a (special) register
MIPS allows any general-purpose register
On the stack
Pentium
Delay slot
Parameter Passing
3: Operand Types
Instruction overload
Same instruction for different data types
Example: Pentium
mov
AL,address
mov
AX,address
mov
EAX,address
Operand Types
Separate instructions
Instructions specify the operand size
Example: MIPS
lb
Rdest,address
;loads a byte
lh
Rdest,address
;loads a halfword
;(16 bits)
lw
Rdest,address
;loads a word
;(32 bits)
ld
Rdest,address
Similar instruction: store
;loads a doubleword
;(64 bits)
4: Addressing Modes
- Part of instruction
Constant
Immediate addressing mode
All processors support these two addressing modes
- Memory
Difference between RISC and CISC
CISC supports a large variety of addressing modes
RISC follows load/store architecture
5: Instruction Types
mov
dest,src
- Logical
and, or, not, xor
Example: Pentium
cmp
count,25
;compare count to 25
;subtract 25 from count
je
target
;jump if equal
I/O instructions
- Memory-mapped I/O
Most processors support memory-mapped I/O
No separate instructions for I/O
- Isolated I/O
Pentium supports isolated I/O
Separate I/O instructions
in
AX,io_port ;read from an I/O port
out
io_port,AX ;write to an I/O port
6: Instruction Formats
Two types
Fixed-length
- Used by RISC processors
- 32-bit RISC processors use 32-bits wide instructions
Examples: SPARC, MIPS, PowerPC
Variable-length
- Used by CISC processors
- Memory operands need more bits to specify
Opcode
Major and exact operation
RISC
versus
RISC Vs CISC
The underlying philosophy of RISC machines is that a
system is better able to manage program execution
when the program consists of only a few different
instructions that are the same length and require the
same number of clock cycles to decode and execute.
RISC systems access memory only with explicit load
and store instructions.
In CISC systems, many different kinds of instructions
access memory, making instruction length variable
and fetch-decode-execute time unpredictable.
84
RISC Vs CISC
The difference between CISC and RISC becomes
evident through the basic computer performance
equation:
85
RISC Vs CISC
The simple instruction set of RISC machines
enables control units to be hardwired for maximum
speed.
The more complex-- and variable-- instruction set of
CISC machines requires microcode-based control
units that interpret instructions as they are fetched
from memory. This translation takes time.
With fixed-length instructions, RISC lends itself to
pipelining and speculative execution.
86
RISC Vs CISC
Consider the the program fragments:
CISC
mov ax, 10
mov bx, 5
mul bx, ax
RISC
Begin
mov ax, 0
mov bx, 10
mov cx, 5
add ax, bx
loop Begin
The total clock cycles for the CISC version might be:
RISC Vs CISC
Because of their load-store ISAs, RISC architectures
require a large number of CPU registers.
These register provide fast access to data during
sequential program execution.
They can also be employed to reduce the overhead
typically caused by passing parameters to
subprograms.
Instead of pulling parameters off of a stack, the
subprogram is directed to use a subset of registers.
88
RISC Vs CISC
This is how
registers can
be overlapped
in a RISC
system.
The current
window pointer
(CWP) points
to the active
register
window.
89
RISC Vs CISC
It is becoming increasingly difficult to distinguish
RISC architectures from CISC architectures.
Some RISC systems provide more extravagant
instruction sets than some CISC systems.
Some systems combine both approaches.
The following two slides summarize the
characteristics that traditionally typify the differences
between these two architectures.
90
RISC Vs CISC
RISC
Multiple register sets.
Three operands per
instruction.
Parameter passing
through register
windows.
Single-cycle
instructions.
Hardwired
control.
Continued....
Highly pipelined.
CISC
Single register set.
One or two register
operands per
instruction.
Parameter passing
through memory.
Multiple cycle
instructions.
Microprogrammed
control.
Less pipelined.
91
RISC Vs CISC
RISC
Simple instructions,
few in number.
Fixed length
instructions.
Complexity in
compiler.
Only LOAD/STORE
instructions access
memory.
Few addressing modes.
CISC
Many complex
instructions.
Variable length
instructions.
Complexity in
microcode.
Many instructions can
access memory.
Many addressing
modes.
92
RISC Vs CISC
Summary
Processor
Registers
Memory
Cache
Disk
97
Pros
Good code density (implicit top of stack)
Low hardware requirements
Easy to write a simpler compiler for stack architectures
Cons
Stack becomes the bottleneck
Little ability for parallelism or pipelining
Data is not always at the top of stack when need, so additional
instructions like TOP and SWAP are needed
Difficult to write an optimizing compiler for stack architectures
Pros
Very low hardware requirements
Easy to design and understand
Cons
Accumulator becomes the bottleneck
Little ability for parallelism or pipelining
High memory traffic
Pros
Requires fewer instructions (especially if 3 operands)
Easy to write compilers for (especially if 3 operands)
Cons
Very high memory traffic (especially if 3 operands)
Variable number of clocks per instruction
With two operands, more data movements are required
Cons
Operands are not equivalent (poor orthogonal)
Variable number of clocks per instruction
May limit number of registers