0% found this document useful (0 votes)

86 views772 pages

Essentials of Computer Architecture - Realref - Copie (2) - Copie

Uploaded by

Nkunzimana Hilaire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views772 pages

Essentials of Computer Architecture - Realref - Copie (2) - Copie

Uploaded by

Nkunzimana Hilaire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 772

Essentials Of

Computer Architecture
Prof. Douglas Comer
Computer Science And ECE
Purdue University

http://www.cs.purdue.edu/people/comer

Copyright  2017 by Douglas Comer. All rights reserved

Module I

Course Introduction
And Overview

Computer Architecture – Module 1 1 Fall, 2016

d Most CS programs require an architecture course, but you might ask:

Is knowledge of computer organization and the underlying hardware

relevant these days?

Should we take this course seriously?

Computer Architecture – Module 1 2 Fall, 2016

d Companies (such as Google, IBM, Microsoft, Apple, Cisco,...) look for knowledge of
architecture when hiring (i.e., understanding computer architecture can help you land a
job)
d The most successful software engineers understand the underlying hardware (i.e.,
knowing about architecture can help you earn promotions)
d As a practical matter: knowledge of computer architecture is needed for later courses,
such as systems programming, compilers, operating systems, and embedded systems

Computer Architecture – Module 1 3 Fall, 2016

d Traditional software engineering jobs are saturated

d The future lies in embedded systems
– Cell phones
– Video games
– MP3 players
– Set-top boxes
– Smart sensor systems
d Understanding architecture is key for programming embedded systems

Computer Architecture – Module 1 4 Fall, 2016

d Hardware is ugly
– Lots of low-level details
– Can be counterintuitive
d Hardware is tricky
– Timing is important
– A small addition in functionality can require many pieces of hardware
d The subject is so large that we cannot hope to cover it in one course
d You will need to think in new ways

Computer Architecture – Module 1 5 Fall, 2016

d You will learn to think in new ways

d It is possible to understand basics without knowing all low-level technical details
d Programmers only need to learn the essentials
– Characteristics of major components
– Role in overall system
– Consequences for software

Computer Architecture – Module 1 6 Fall, 2016

d Basics of digital hardware

– You will build a few simple circuits
d Processors
– You will program RISC and CISC processors in lab
d Memories
– You will learn about memory organization and caching
d I/O operates
– You will explore buffering and learn about interrupts

Computer Architecture – Module 1 7 Fall, 2016

d Basics
– A taste of digital logic
– Data paths and execution
– Data representations
d Processors
– Instruction sets and operands
– Assembly languages and programming
d Memories
– Physical and virtual memories
– Addressing and caching
Computer Architecture – Module 1 8 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Organization Of The Course
(continued)

d Input/Output
– Devices and interfaces
– Buses and bus address spaces
– Role of device drivers
d Advanced topics
– Parallelism and data pipelining
– Power and energy
– Performance and performance assessment
– Architectural hierarchies

Computer Architecture – Module 1 9 Fall, 2016

d The course emphasizes breadth over depth

d Omissions
– Most low-level details (e.g., discussion of electrical properties of resistance, voltage,
current and semiconductor physics)
– Quantitative analysis that engineers use to design hardware circuits
– Design rules that specify how logic gates may be interconnected
– Circuit design and design tools
– VLSI chip design and languages such as Verilog

Computer Architecture – Module 1 10 Fall, 2016

d Three key ideas

– Architecture
– Design
– Implementation

Computer Architecture – Module 1 11 Fall, 2016

d Refers to the overall organization of a computer system

d Analogous to a blueprint
d Specifies
– Functionality of major components
– Interconnections among components
d Abstracts away details

Computer Architecture – Module 1 12 Fall, 2016

d Needed before a digital system can be built

d Translates architecture into components
d Fills in details that the architectural specification omits
d Specifies items such as
– How components are grouped onto boards
– How power is distributed to boards
d Many designs can satisfy a given architecture

Computer Architecture – Module 1 13 Fall, 2016

d All details necessary to build a system

d Includes
– Specific part numbers to be used
– Mechanical specifications of chassis and cases
– Layout of components on boards
– Power supplies and power distribution
– Signal wiring and connectors

Computer Architecture – Module 1 14 Fall, 2016

d Architecture is required because understanding computer organization leads to

programming excellence
d This course covers the four essential aspects of computer architecture
– Digital logic
– Processors
– Memory
– I/O
d You will have fun with hardware in the lab

Computer Architecture – Module 1 15 Fall, 2016

Fundamentals
Of
Digital Logic

Computer Architecture – Module 2 1 Fall, 2016

d Understand the basics

– Concepts
– How computers work at the lowest level
d Avoid whenever possible
– Device physics
– Engineering design rules
– Implementation details

Computer Architecture – Module 2 2 Fall, 2016

d Voltage
– Quantifiable property of electricity
– Measure of potential force
– Unit of measure: volt
d Current
– Quantifiable property of electricity
– Measure of electron flow along a path
– Unit of measure: ampere (amp)

Computer Architecture – Module 2 3 Fall, 2016

d Voltage is analogous to water pressure

d Current is analogous to flowing water
d Can have
– High pressure with little flow
– Large flow with little pressure

Computer Architecture – Module 2 4 Fall, 2016

d Device used is called voltmeter

d Note: can only be measured as difference between two points
d We will
– Assume one point represents zero volts (known as ground)
– Express voltage of second point with respect to ground

Computer Architecture – Module 2 5 Fall, 2016

d In lab, chips will operate on five volts

d Two wires connect each chip to power supply
– Ground (zero volts)
– Power (five volts)
d Every chip needs power and ground connections
d Notes
– Logic diagrams do not show power and ground
– Raspberry Pi operates on 3.3 volts, so conversion is required to connect the Pi to
other chips

Computer Architecture – Module 2 6 Fall, 2016

d Basic building block of electronic circuits

d Operates on electrical current
d Traditional transistor
– Has three external connections
* Emitter
* Base (control)
* Collector
– Acts like an amplifier — a small current between base and emitter controls large
current between collector and emitter

Computer Architecture – Module 2 7 Fall, 2016

large current flows

B from point C to point E

small current flows

from point B to E

d Amplification means the large output current varies exactly like the small input current

Computer Architecture – Module 2 8 Fall, 2016

d Called a Metal Oxide Semiconductor FET (MOSFET) when used on a CMOS chip
d Three external connections
– Source
– Gate
– Drain
d Designed to act as a switch (on or off)
– When the input reaches a threshold (i.e., becomes logic 1), the transistor turns on
and passes full current
– When the input falls below a threshold (i.e., becomes logic 0), the transistor turns
off and passes no current

Computer Architecture – Module 2 9 Fall, 2016

source

turns on current flowing

gate from point S to point D

non-zero current flowing

from point G to D

drain

d Input arrives at the gate

d Logic zero (zero volts) means the transistor is off; logic 1 (positive voltage) turns the
transistor on
Computer Architecture – Module 2 10 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Alternative Field Effect Transistor
(Also Used For Switching)

source

turns on current flowing

gate from point S to point D

no current flowing
from point G to D
drain

d Circle on the gate indicates an inversion

d Logic 0 (zero volts) turns the transistor on, and logic 1 (positive voltage) turns the
transistor off
Computer Architecture – Module 2 11 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Boolean Logic

d Mathematical basis for digital circuits

d Three basic functions: and, or, and not

A B A and B A B A or B A not A
0 0 0 0 0 0 0 1
0 1 0 0 1 1 1 0
1 0 0 1 0 1
1 1 1 1 1 1

Computer Architecture – Module 2 12 Fall, 2016

d Can implement Boolean functions with transistors

d Five volts represents Boolean 1 (true)
d Zero volts represents Boolean 0 (false)

Computer Architecture – Module 2 13 Fall, 2016

+ voltage (called Vdd )

input output

0 volts

d When input is zero volts, output is connected to + voltage

d When input is five volts, output is connected to 0 volts
d Hardware engineers use Vdd to denote positive voltage

Computer Architecture – Module 2 14 Fall, 2016

d Hardware component
d Consists of integrated circuit
d Implements an individual Boolean function
d To reduce complexity, hardware uses inverse of Boolean functions
– Nand gate implements not and
– Nor gate implements not or
– Inverter implements not

Computer Architecture – Module 2 15 Fall, 2016

A B A nand B A B A nor B A B A xor B

0 0 1 0 0 1 0 0 0

0 1 1 0 1 0 0 1 1

1 0 1 1 0 0 1 0 1

1 1 0 1 1 0 1 1 0

Computer Architecture – Module 2 16 Fall, 2016

A input

output

B input

d Solid dot indicates electrical connection

Computer Architecture – Module 2 17 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Symbols Used In Schematic Diagrams

d Basic gates

nand gate nor gate inverter

and gate or gate xor gate

Computer Architecture – Module 2 18 Fall, 2016

d Most popular technology known as Transistor-Transistor Logic (TTL)

d Allows direct interconnection (a wire can connect output from one gate to input of
another)
d Single output can connect to multiple inputs
– Called fanout
– Limited to a small number

Computer Architecture – Module 2 19 Fall, 2016

d Suppose we need a signal to indicate that the power button is depressed and the disk is
ready
d Two logic gates are needed to form logical and
– Output from nand gate connected to input of inverter

input from
power button
output
input from
disk

Computer Architecture – Module 2 20 Fall, 2016

X
C output

Y A
B

d Question: what does the circuit implement?

Computer Architecture – Module 2 21 Fall, 2016

d Boolean expression
– Often used when designing circuit
– Can be transformed to equivalent version that requires fewer gates
d Truth table
– Enumerates inputs and outputs
– Often used when debugging a circuit

Computer Architecture – Module 2 22 Fall, 2016

X
C output

Y A
B

d Value at point A is: not Y

d Value at point B is: Z nor (not Y)

Computer Architecture – Module 2 23 Fall, 2016

X
C output

Y A
B

d Value at point C is: (X nand ((Z nor (not Y))

d Value at output is: X and (Z nor (not Y))

Computer Architecture – Module 2 24 Fall, 2016

d Rules are similar to conventional algebra

– Associative
– Reflexive
– Distributive
d See Appendix 2 in the text for details

Computer Architecture – Module 2 25 Fall, 2016

X Y Z A B C output

0 0 0 1 0 1 0

0 0 1 1 0 1 0

0 1 0 0 1 1 0

0 1 1 0 0 1 0

1 0 0 1 0 1 0

1 0 1 1 0 1 0

1 1 0 0 1 0 1

1 1 1 0 0 1 0

d Table lists all possible inputs and output for each

d Can also state values for intermediate points

Computer Architecture – Module 2 26 Fall, 2016

d Mathematically, nand / nor / not is equivalent to and / or / not

d Practically
– It is possible to construct and and or gates
– Sometimes, humans find and and or operations easier to understand
d Example circuit or truth table output can be described by Boolean expression:

X and Y and (not Z))

Computer Architecture – Module 2 27 Fall, 2016

d How does a computer perform addition?

d Analogous to the method used in elementary school
d Each digit is a single bit
carry carry carry

1 0 1 0 0
+ 1 1 1 0 1

1 1 0 0 0 1

d Note: first bit never has a carry input

Computer Architecture – Module 2 28 Fall, 2016

d Adds two input bits

d Produces two output bits
– Sum
– Carry
d We will use exclusive or gate plus and gate
exclusive-or gate
bit 1
sum
bit 2

and gate

carry

Computer Architecture – Module 2 29 Fall, 2016

d Input is two bits plus a carry

d Produces two output bits
– Sum
– Carry
bit 1

bit 2 sum

carry in

carry out

Computer Architecture – Module 2 30 Fall, 2016

d A single gate only has a few connections

d A chip has many pins for external connections
d Result: package multiple gates on each chip
d We will see examples shortly

Computer Architecture – Module 2 31 Fall, 2016

d 7400 family of chips

d Package is about one-half inch long
d Implement TTL logic
d Powered by five volts
d Each chip contains multiple gates

Computer Architecture – Module 2 32 Fall, 2016

14 13 12 11 10 9 8 14 13 12 11 10 9 8 14 13 12 11 10 9 8

+ + +

gnd gnd gnd

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

7400 7402 7404

(Quad 2-input NAND) (Quad 2-input NOR) (Hex Inverter)

d Pins 7 and 14 connect to ground and power

d Power and ground must be connected for the chip to operate

Computer Architecture – Module 2 33 Fall, 2016

d Question: how can computers be constructed from simple logic gates?

d Answer: they cannot
d Logic gates only provide a Boolean combination of inputs (known as combinatorial
circuits)
d Additional functionality is needed
– Circuits that maintain state
– Circuits that operate on a clock

Computer Architecture – Module 2 34 Fall, 2016

d More sophisticated than combinatorial circuits

d Output depends on history of previous input as well as values on input lines

Computer Architecture – Module 2 35 Fall, 2016

d Known as latch
d Has two inputs: data and enable
d When enable is 1, output is same as data
d When enable goes to 0, output stays locked at current value

data in

output

enable

Computer Architecture – Module 2 36 Fall, 2016

d Key in understanding a latch

d Consider the circuit

output

d What does it do?

d Mathematically, the circuit is meaningless because an inverter produces the complement
of its input, but in this case the output is fed back into the input
d Practically, a propagation delay means the output stays the same for a short time, and
then changes
d Result: output varies over time, 0 for time t, 1 for time t, 0 for time t, and so on, where
t is the propagation delay

Computer Architecture – Module 2 37 Fall, 2016

d Basic building block for a computer

d Acts like a miniature N-bit memory
d Can be built out of latches
enable line for the register Register

1-bit
latch

1-bit
latch
input bits for output bits for
the register the register
1-bit
latch

1-bit
latch

Computer Architecture – Module 2 38 Fall, 2016

d Basic flip-flop
d Can be constructed from a pair of latches
d Analogous to push-button power switch (i.e., push-on push-off)
d Each new 1 received as input causes output to reverse
– First input pulse causes flip-flop to turn on
– Second input pulse causes flip-flop to turn off

Computer Architecture – Module 2 39 Fall, 2016

input output
flip-flop

in: 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
out: 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1

time increases

d Note: output only changes when input makes a transition from zero to one (i.e., rises)

Computer Architecture – Module 2 40 Fall, 2016

1
in:
0

1
out:
0

clock:
time increases

d All changes synchronized with clock (described later)

d Output changes on rising edge of input
d Also called leading edge

Computer Architecture – Module 2 41 Fall, 2016

d Counts input pulses

d Output is binary value
d Includes reset line to restart count at zero
d Example: 4-bit counter available as single integrated circuit

Computer Architecture – Module 2 42 Fall, 2016

input outputs decimal

0 000 0
0 000 0
1 001 1
counter
0 001 1

outputs 1 010 2
input time 1 010 2
increases
0 010 2
1 011 3

(a) 0 011 3
1 100 4
0 100 4
1 101 5
.
.
.
(b)

d Part (a) shows the schematic of a counter chip

d Part (b) shows the output as the input changes
Computer Architecture – Module 2 43 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Clock

d Permits active circuits

d Electronic circuit that pulses regularly
d Measured in cycles per second (Hz)
d Output of clock is square wave (sequence of 1 0 1 0 1 ... )

clock 1
output

0
time

Computer Architecture – Module 2 44 Fall, 2016

d Takes binary number as input

d Uses input to select one output
d Technical distinction
– Decoder simply selects one of its outputs
– Demultiplexor feeds a special input to the selected output
d In practice: engineers often use the term “demux” for either, and blur the distinction

Computer Architecture – Module 2 45 Fall, 2016

d Binary value on input lines determines which output is active

decoder

“000”

“001”

“010”
x
“011”
inputs y outputs
“100”
z
“101”

“110”

“111”

d Technical detail: on some decoder chips, an active output is logic 0 and all others are
logic 1

Computer Architecture – Module 2 46 Fall, 2016

d Imagine the power-on sequence for an embedded system

– Test the battery
– Power on and test the memory
– Start the disk
– Power up the display
– Read boot sector from disk into memory
– Start the CPU
d Separate hardware module performs each task
d Need to activate the modules in sequence

Computer Architecture – Module 2 47 Fall, 2016

decoder

not used
counter test battery
clock
test memory

start disk

power screen

read boot blk

start CPU

not used

d Technique: count clock pulses and use decoder to select an output for each possible
counter output
d Note: counter will wrap around to zero, so this is an infinite loop

Computer Architecture – Module 2 48 Fall, 2016

d Output of circuit used as an input

d Called feedback
d Allows more control
d Example: stop sequence when output F becomes active
d Boolean algebra

CLOCK and (not F)

Computer Architecture – Module 2 49 Fall, 2016

decoder

not used
these two gates perform
clock the Boolean and function counter test battery

test memory

start disk

state CRT

read boot blk

start CPU

feedback stop

d Note additional input needed to restart sequence

Computer Architecture – Module 2 50 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
A Fundamental Difference

d Software
– Uses iteration
– Software engineers are taught to avoid replicating code
– Iteration increases elegance
d Hardware
– Uses replicated (parallel) hardware units
– Hardware engineers are taught to avoid iterative circuits
– Replication increases performance and reliability

Computer Architecture – Module 2 51 Fall, 2016

d Note: because chip contains multiple gates, some gates may be unused
d May be possible to reduce total chips needed by employing unused gates
d Example: use a spare nand gate as an inverter by connecting one input to five volts:

1 nand x = not x

d Previous circuit can be implemented with a single chip (a quad 2-input nand gate)

Computer Architecture – Module 2 52 Fall, 2016

d Power consumption (wiring must carry sufficient power)

d Heat dissipation (chips must be kept cool)
d Timing (gates take time to settle after input changes)
d Clock synchronization (clock signal must travel to all chips simultaneously)
d Difference in clock signals (clock skew) can cause problems

Computer Architecture – Module 2 53 Fall, 2016

IC1 IC3

IC2

d Length of wire determines time required for signal to propagate

Computer Architecture – Module 2 54 Fall, 2016

d Active circuits built without a clock

d Advantages
– Possible power savings
– Avoids clock skew
d Uses two wires to transfer a bit

Wire 1 Wire 2 Meaning

222222222222222222222222222222222222222222222222
0 0 Reset before starting a new bit
0 1 Transfer a 0 bit
1 0 Transfer a 1 bit
1 1 Undefined (not used)

Computer Architecture – Module 2 55 Fall, 2016

d Gordon Moore predicted that the number of transistors on a chip would double each
year (revised in 1970 to every 18 months)
d Led to the following classifications

Name Example Use

2222222222222222222222222222222222222222222222222222222222222
Small Scale Integration (SSI) The most basic logic
such as Boolean gates
Medium Scale Integration (MSI) Intermediate logic
such as counters
Large Scale Integration (LSI) More complex logic such
as embedded processors
Very Large Scale Integration (VLSI) The most complex
processors (i.e., CPUs)

Computer Architecture – Module 2 56 Fall, 2016

d ASIC (Application-Specific Integrated Circuit)

– Custom design for a specific product
– Used when higher speed is needed
d SoC (System on Chip)
– Single IC that contains one or more processors, memories, and I/O device interfaces
all interconnected to form a working system
– Used in many low-end devices

Computer Architecture – Module 2 57 Fall, 2016

d Digital systems can be described at various levels of abstraction

d Some examples

Abstraction Implemented With

22222222222222222222222222222222222222222222222222222222222
Computer Circuit board(s)
Circuit board Components such as processor and memory
Processor VLSI chip
VLSI chip Many gates
Gate Many transistors
Transistor Semiconductor implemented in silicon

Computer Architecture – Module 2 58 Fall, 2016

d Alternative to standard gates

d Allows chip to be configured multiple times
d Can create
– Various gates
– Interconnections
d Typical approach: view a gate as an array and inputs as an index
d Most popular form: Field Programmable Gate Array (FPGA)

Computer Architecture – Module 2 59 Fall, 2016

d Computer systems are constructed of digital logic circuits

d Fundamental building block is called a gate
d Digital circuit can be described by
– Boolean algebra (most useful when designing)
– Truth table (most useful when debugging)
d Clock allows active circuit to perform sequence of operations
d Feedback allows output to control processing
d Practical engineering concerns include
– Power consumption and heat dissipation
– Clock skew and synchronization

Computer Architecture – Module 2 60 Fall, 2016

Data And Program

Representation

Computer Architecture – Module 3 1 Fall, 2016

d Built on two-valued logic system

d Can be interpreted as
– Positive voltage and zero volts
– High and low
– True and false
– Asserted and not asserted
d Underneath, it’s all just electrons and wires

Computer Architecture – Module 3 2 Fall, 2016

d Builds on digital logic

d Applies familiar abstractions
d Interprets sets of Boolean values as
– Numbers
– Characters
– Addresses
d Underneath, it’s all just bits

Computer Architecture – Module 3 3 Fall, 2016

d Direct representation of digital logic values

d Assigned mathematical interpretation
– 0 and 1
d Multiple bits used to represent complex data item
d The same underlying hardware can represent bits of an integer or bits of a character

Computer Architecture – Module 3 4 Fall, 2016

d Set of multiple bits

d Size depends on computer
d Examples of byte sizes
– CDC: 6-bit byte
– BBN: 10-bit byte
– IBM: 8-bit byte
d On most computers, the byte is the smallest addressable unit of storage
d Note: following modern convention, we will assume an 8-bit byte

Computer Architecture – Module 3 5 Fall, 2016

d Number of bits per byte determines range of values that can be stored
d Byte of k bits can store 2k values
d Examples
– Six-bit byte can store 64 possible values
– Eight-bit byte can store 256 possible values

Computer Architecture – Module 3 6 Fall, 2016

d Bits themselves have no intrinsic meaning

d Byte merely stores string of 0’s and 1’s
d Example: all possible combinations of three bits

000 010 100 110

001 011 101 111

d All meaning is determined by how bits are interpreted

Computer Architecture – Module 3 7 Fall, 2016

d Device status
– First bit has the value 1 if a disk is connected
– Second bit has the value 1 if a printer is connected
– Third bit has the value 1 if a keyboard is connected
d Integer interpretation
– Positional representation uses base 2
– Values are 0 through 7
– We must specify order of bits

Computer Architecture – Module 3 8 Fall, 2016

5 4 3 2 1 0
2 = 32 2 = 16 2 =8 2 =4 2 =2 2 =1

d Example
010101
is interpreted as

0 × 25 + 1 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 21

d A set of k bits can represent integers 0 through 2k– 1

Computer Architecture – Module 3 9 Fall, 2016

22222222222222222222222222222222222222222222222222222
Power Of 2 Decimal Value Decimal Digits
0 1 1
1 2 1
2 4 1
3 8 1
4 16 2
5 32 2
6 64 2
7 128 3
8 256 3
9 512 3
10 1024 4
11 2048 4
12 4096 4
15 16384 5
16 32768 5
20 1048576 7
30 1073741824 10
32 4294967296 10
64 18446744073709551616 20

Computer Architecture – Module 3 10 Fall, 2016

d Mathematically, it’s base 16

d Practically, it’s easier to write than binary
d Each hex digit encodes four bits

Hex Binary Decimal Hex Binary Decimal

2 222222222222222222222222 2 222222222222222222222222
0 0000 0 8 1000 8
1 0001 1 9 1001 9
2 0010 2 A 1010 10
3 0011 3 B 1011 11
4 0100 4 C 1100 12
5 0101 5 D 1101 13
6 0110 6 E 1110 14
7 0111 7 F 1111 15

d Note: hexadecimal merely represents bits

Computer Architecture – Module 3 11 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Hexadecimal Constants

d Supported in some programming languages

d Typical syntax: constant begins with 0x
d Example

0xDEC90949
1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1

D E C 9 0 9 4 9

Computer Architecture – Module 3 12 Fall, 2016

d Symbols for upper and lower case letters, digits, and punctuation marks
d Set of symbols defined by computer system
d Each symbol assigned unique bit pattern
d Typically, character set size determined by byte size
d Various character sets have been used in commercial computers
– EBCDIC
– ASCII
– Unicode

Computer Architecture – Module 3 13 Fall, 2016

d Extended Binary Coded Decimal Interchange Code

d Defined by IBM
d Popular in 1960s
d Still used on IBM mainframe computers
d Specifies 128 characters
d Example encoding: lower case letter a assigned binary value

10000001

Computer Architecture – Module 3 14 Fall, 2016

d American Standard Code for Information Interchange

d Vendor independent: defined by American National Standards Institute (ANSI)
d Adopted by PC manufacturers
d Specifies 128 characters
d Example encoding: lower case letter a assigned binary value

01100001

d Unprintable characters used for modem control

Computer Architecture – Module 3 15 Fall, 2016

00 nul 01 soh 02 stx 03 etx 04 eot 05 enq 06 ack 07 bel

08 bs 09 ht 0A lf 0B vt 0C np 0D cr 0E so 0F si
10 dle 11 dc1 12 dc2 13 dc3 14 dc4 15 nak 16 syn 17 etb
18 can 19 em 1A sub 1B esc 1C fs 1D gs 1e rs 1F us
20 sp 21 ! 22 " 23 # 24 $ 25 % 26 & 27 ’
28 ( 29 ) 2A * 2B + 2C , 2D – 2E . 2F /
30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7
38 8 39 9 3A : 3B ; 3C < 3D = 3E > 3F ?
40 @ 41 A 42 B 43 C 44 D 45 E 46 F 47 G
48 H 49 I 4A J 4B K 4C L 4D M 4E N 4F O
50 P 51 Q 52 R 53 S 54 T 55 U 56 V 57 W
58 X 59 Y 5A Z 5B [ 5C \ 5D ] 5E ^ 5F _
60 ‘ 61 a 62 b 63 c 64 d 65 e 66 f 67 g
68 h 69 i 6A j 6B k 6C l 6D m 6E n 6F o
70 p 71 q 72 r 73 s 74 t 75 u 76 v 77 w
78 x 79 y 7A z 7B { 7C | 7D } 7E ~ 7F del

Computer Architecture – Module 3 16 Fall, 2016

d Extends ASCII
– Assigns meaning to values from 128 through 255
– Character can be 16 bits long
d Advantage: can represent larger set of characters
d Motivation: accommodate languages such as Chinese

Computer Architecture – Module 3 17 Fall, 2016

d Each binary integer represented in k bits

d Computers have used k = 8, 16, 32, 60, and 64
d Many computers support multiple integer sizes (e.g., 16, 32, and 64 bit integers)
d 2k possible bit combinations exist for k bits
d Positional interpretation produces unsigned integers

Computer Architecture – Module 3 18 Fall, 2016

d Straightforward positional interpretation

d Each successive bit represents next power of 2
d No provision for negative values
d Precision is fixed (size of integers is a constant)
d Arithmetic operations can produce overflow or underflow (result cannot be represented
in k bits)
d Overflow handled with wraparound and carry bit

Computer Architecture – Module 3 19 Fall, 2016

1 0 0
+ 1 1 0

1 0 1 0

overflow result

d Values wrap around address space

d Hardware records overflow in separate carry indicator
– Software must test after arithmetic operation
– Can be used to raise an exception

Computer Architecture – Module 3 20 Fall, 2016

d Need to choose order for

– Storage in physical memory system
– Transmission over a data network
d Bit order
– Handled by hardware
– Usually hidden from programmer
d Byte order
– Affects multi-byte data items such as integers
– Visible and important to programmer

Computer Architecture – Module 3 21 Fall, 2016

d Little Endian places least significant byte of integer in lowest memory location
d Big Endian places most significant byte of integer in lowest memory location

Interesting historical variation: Digital Equipment Corporation once used an ordering

with 32-bit integers divided into sixteen-bit words in big endian order and bytes within the
words in little endian order.

Computer Architecture – Module 3 22 Fall, 2016

00011101 10100010 00111011 01100111

(a) Integer 497,171,303 in binary positional representation

loc. i loc. i+1 loc. i+2 loc. i+3

. . . 01100111 00111011 10100010 00011101 . . .

(b) The integer stored in little endian order

loc. i loc. i+1 loc. i+2 loc. i+3

. . . 00011101 10100010 00111011 01100111 . . .

(c) The integer stored in big endian order

d Note: difference is especially important when transferring data over the Internet between
computers for which the byte ordering differs

Computer Architecture – Module 3 23 Fall, 2016

d Signed arithmetic is needed by most programs

d Several representations are possible
d Each has been used in at least one computer
d Some bit patterns are used for negative values (typically half)
d Tradeoff: unsigned representation cannot store negative values, but can store integers
that are twice as large as a signed representation

Computer Architecture – Module 3 24 Fall, 2016

d Three signed representations have been used

– Sign magnitude
– One’s complement
– Two’s complement
d Each has interesting quirks

Computer Architecture – Module 3 25 Fall, 2016

d Familiar to humans
d First bit represents sign
d Successive bits represent absolute value of integer
d Interesting quirk: can create negative zero

Computer Architecture – Module 3 26 Fall, 2016

d Positive number uses positional representation

d Negative number formed by inverting all bits of positive value
d Example of 4-bit one’s complement
– 0 0 1 0 represents 2
– 1 1 0 1 represents –2
d Interesting quirk: two representations for zero (all 0’s and all 1’s)
d Note: Internet checksum uses one’s complement

Computer Architecture – Module 3 27 Fall, 2016

d Positive number uses positional representation

d Negative number formed by subtracting 1 from positive value and inverting all bits of
result
d Example of 4-bit two’s complement
– 0 0 1 0 represents 2
– 1 1 1 0 represents –2
– High-order bit is set if number is negative
d Interesting quirk: one more negative value than positive values

Computer Architecture – Module 3 28 Fall, 2016

d We consider unsigned and two’s complement together because

– A single piece of hardware can handle both unsigned and two’s complement integer
arithmetic
– Software can choose an interpretation for each integer
d Example using 4 bits
– Adding 1 to binary 1 0 0 1 produces 1 0 1 0
– Unsigned interpretation goes from 9 to 10
– Two’s complement interpretation goes from –7 to –6

Computer Architecture – Module 3 29 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Example Of Signed Representation (4 bit integers)
Unsigned Sign One’s Two’s
Binary (positional) Magnitude Complement Complement
2222222222222222222222222222222222222222222222222222222222222222222222
String Interpretation Interpretation Interpretation Interpretation
0000 0 0 0 0
0001 1 1 1 1
0010 2 2 2 2
0011 3 3 3 3
0100 4 4 4 4
0101 5 5 5 5
0110 6 6 6 6
0111 7 7 7 7
1000 8 –0 –7 –8
1001 9 –1 –6 –7
1010 10 –2 –5 –6
1011 11 –3 –4 –5
1100 12 –4 –3 –4
1101 13 –5 –2 –3
1110 14 –6 –1 –2
1111 15 –7 –0 –1

Computer Architecture – Module 3 30 Fall, 2016

d Needed for unsigned and two’s complement representations

d Used to accommodate multiple sizes of integers
d Extends high-order bit (known as sign bit)

Computer Architecture – Module 3 31 Fall, 2016

d Assume computer
– Supports 32-bit and 64-bit integers
– Uses two’s complement representation
d When 32-bit integer assigned to 64-bit integer, correct numeric value requires upper 32
bits to be filled with
– Zeroes for a positive number
– Ones for a negative number
d In essence, high-order (sign) bit from the 32-bit integer must be replicated to fill high-
order bits of larger integer

Computer Architecture – Module 3 32 Fall, 2016

d The 8-bit version of integer –3 is

1 1 1 1 1 1 0 1

d The 16-bit version of integer –3 is

1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
_________________
replicated

d During assignment to a larger integer, hardware copies all bits of smaller integer and
then replicates the high-order (sign) bit in remaining bits

Computer Architecture – Module 3 33 Fall, 2016

Sign extension: in two’s complement arithmetic, when an integer Q composed of K bits is

copied to an integer of more than K bits, the additional high-order bits are set equal to the
top bit of Q. Extending the sign bit means the numeric value remains the same.

Computer Architecture – Module 3 34 Fall, 2016

d Right shift of a negative value should produce a negative value

d Example
– Shifting –4 one bit should produce –2 (divide by 2)
– Using sixteen-bit representation, –4 is:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0

d After right shift of one bit, value is –2:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

d Solution: replicate high-order bit during right shift

Computer Architecture – Module 3 35 Fall, 2016

d Most computers use two’s complement hardware, which performs sign extension
d Same hardware is used for unsigned arithmetic, which means that assigning an unsigned
integer to a larger unsigned integer can change the value
d To prevent errors from occurring, a programmer or a compiler must add code to mask
off the extended sign bits
d Example code
unsigned int x;
char y;
y = 0xf0;
x = y; /* should be x = y & 0xff; */

Computer Architecture – Module 3 36 Fall, 2016

d Pioneered by IBM
d Represents integer as a string of digits
– Unpacked: one digit per 8-bit byte
– Packed: one digit per 4-bit nibble
d Uses sign-magnitude representation
d Example of unpacked BCD
– Integer 123456 is stored as
0x01 0x02 0x03 0x04 0x05 0x06
– Integer –123456 is stored as:
0x01 0x02 0x03 0x04 0x05 0x06 0x0D

Computer Architecture – Module 3 37 Fall, 2016

d Disadvantages:
– Take more space
– Hardware is slower than integer or floating point
d Advantages:
– Gives results humans expect (compare to Excel)
– Avoids repeating binary value for .01
d Preferred by banks

Computer Architecture – Module 3 38 Fall, 2016

d Fundamental idea: follow standard scientific representation that specifies a few

significant digits and an order of magnitude
d Example: Avogadro’s number

6.022 × 10 23
d Hardware
– Uses base 2 instead of base 10
– Allocates fixed-size bit strings for
* Exponent
* Mantissa

Computer Architecture – Module 3 39 Fall, 2016

d Mantissa
– Normalized to eliminate leading zeroes
– No need to store most significant bit because it is always 1
– Zero is a special case
d Exponent
– Allows negative as well as positive values
– Biased to permit rapid magnitude comparison

Computer Architecture – Module 3 40 Fall, 2016

d Specifies single-precision and double-precision representations

d Widely adopted by computer architects
31 30 23 22 0

S expon. mantissa (bits 0 - 22)

(a)

63 62 52 51 0

S exponent mantissa (bits 0 - 51)

(b)

Computer Architecture – Module 3 41 Fall, 2016

d Zero
d Positive infinity
d Negative infinity
d Note: infinity values handle cases such as the result of dividing by zero

Computer Architecture – Module 3 42 Fall, 2016

d The single precision range is

2 – 126 to 2 127

d The decimal equivalent is approximately

10 –38 to 10 38

Computer Architecture – Module 3 43 Fall, 2016

d The double precision range is enormously larger than single precision

2 –1022 to 2 1023

d The decimal equivalent is approximately

10 – 308 to 10 308

Computer Architecture – Module 3 44 Fall, 2016

d Consider the decimal value 6.5

d In binary, 6 is 110 and .5 is .1, giving 110.1
d Normalizing gives 1.101 × 22
d In IEEE floating point
– The sign bit is zero (for a positive number)
– The exponent is biased by adding 127, giving 129 (10000001 in binary)
– The leading 1 of the mantissa is not stored, giving (10100000...0 in binary)
d The resulting binary value is
S exponent (23 – 30) mantissa (bits 0 – 22)

0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Computer Architecture – Module 3 45 Fall, 2016

d Typically arranged in contiguous memory

d Example: struct with three integers
0 1 2 3 4 5

integer #1 integer #2 integer #3

d More details later in the course

Computer Architecture – Module 3 46 Fall, 2016

d Fundamental value in digital logic is a bit

d Bits grouped into sets to represent
– Integers
– Characters
– Floating point values
d Integers can be represented as
– Sign magnitude
– One’s complement
– Two’s complement

Computer Architecture – Module 3 47 Fall, 2016

d One piece of hardware can be used for both

– Two’s complement arithmetic
– Unsigned arithmetic
d Bytes of integer can be numbered in
– Big-endian order
– Little-endian order
d Organizations such as ANSI and IEEE define standards for data representation

Computer Architecture – Module 3 48 Fall, 2016

Processors

Computer Architecture – Module 4 1 Fall, 2016

d The terms processor and computational engine refer broadly to any mechanism that
drives computation
d Wide variety of sizes and complexity
d Processor is key element in all computational systems

Computer Architecture – Module 4 2 Fall, 2016

d Characteristic of most modern processors

d Reference to mathematician John Von Neumann, a pioneer in computer architecture
d Unlike Harvard architecture, there is one memory
d Fundamental concept is a stored program (i.e., a program in the same memory as the
data)
d Three basic components interact to form a computational system
– Processor
– Memory
– I/O facilities

Computer Architecture – Module 4 3 Fall, 2016

computer

processor memory

input/output facilities

Computer Architecture – Module 4 4 Fall, 2016

d Digital device
d Performs computation involving multiple steps
d Wide variety of capabilities
d Mechanisms available
– Fixed logic
– Selectable logic
– Parameterized logic
– Programmable logic

Computer Architecture – Module 4 5 Fall, 2016

d Most computer architecture follows a hierarchical approach

d Subparts of a large, central processor are sophisticated enough to meet our definition of
processor
d Some engineers use term computational engine for subpiece that is less powerful than
main processor

Computer Architecture – Module 4 6 Fall, 2016

CPU

graphics
trigonometry engine
engine

other query
components engine arithmetic
engine

Computer Architecture – Module 4 7 Fall, 2016

d Controller to coordinate operation (often omitted from architecture diagrams)

d Arithmetic Logic Unit (ALU)
d Local data storage
d Internal interconnections
d External interfaces (I/O buses)

Computer Architecture – Module 4 8 Fall, 2016

internal interconnection(s)

controller local
ALU
storage

external interface

external connection

Computer Architecture – Module 4 9 Fall, 2016

d Controller
– Overall responsibility for execution
– Moves through sequence of steps
– Coordinates other units
– Timing-based operation: knows how long each unit requires and schedules steps
accordingly
d Arithmetic Logic Unit
– Operates as directed by controller
– Provides arithmetic and Boolean operations
– Performs one operation at a time as directed

Computer Architecture – Module 4 10 Fall, 2016

d Internal interconnections
– Allow transfer of values among units of the processor
– Also called data paths
d External interface
– Handles communication between processor and rest of computer system
– Provides interaction with external memory as well as external I/O devices

Computer Architecture – Module 4 11 Fall, 2016

d Local data storage

– Holds data values for operations
– Values must be inserted (e.g., loaded from memory) before the operation can be
performed
– Typically implemented with registers

Computer Architecture – Module 4 12 Fall, 2016

d Main computational engine in conventional processor

d Complex unit that can perform variety of tasks
d Typical ALU operations
– Arithmetic (integer add, subtract, multiply, divide)
– Shift (left, right, circular)
– Boolean (and, or, not, exclusive or)

Computer Architecture – Module 4 13 Fall, 2016

d Many possible roles for individual processors in

– Coprocessors
– Microcontrollers
– Embedded system processors
– General-purpose processors

Computer Architecture – Module 4 14 Fall, 2016

d Operates in conjunction with and under the control of another processor

d Usually
– Special-purpose processor
– Performs a single task
– Operates at high speed
d Example: floating point accelerator

Computer Architecture – Module 4 15 Fall, 2016

d Programmable device
d Dedicated to control of a physical system
d Example: control an automobile engine or grocery store door
d Negative: extremely limited (slow processor and tiny memory)
d Positive: very low power consumption

Computer Architecture – Module 4 16 Fall, 2016

do forever {
wait for the sensor to be tripped;
turn on power to the door motor;
wait for a signal that indicates the
door is open;
wait for the sensor to reset;
delay ten seconds;
turn off power to the door motor;
}

Computer Architecture – Module 4 17 Fall, 2016

d Runs sophisticated electronic device

d May be more powerful than microcontroller
d Generally low power consumption
d Example: control DVD player, including commands received from a remote control as
well as from the front panel

Computer Architecture – Module 4 18 Fall, 2016

d Most powerful type of processor

d Completely programmable
d Full functionality
d Power consumption is secondary consideration
d Example: CPU in a personal computer

Computer Architecture – Module 4 19 Fall, 2016

d Originally: discrete logic

d Later: single circuit board
d Even later: single chip
d Now: usually part of a single chip

Computer Architecture – Module 4 20 Fall, 2016

d To a software engineer programming means

– Writing, compiling, and loading code into memory
– Executing the resulting memory image
d To a hardware engineer a programmable device
– Has a processor separate from the program it runs
– May have the program burned onto a chip

Computer Architecture – Module 4 21 Fall, 2016

d Basis for programmable processors

d Allows processor to move through program steps automatically
d Implemented by processor hardware
d At some level, every programmable processor implements a fetch-execute cycle

Computer Architecture – Module 4 22 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Fetch-Execute Algorithm
222222222222222222222222222222222222222222222222222222222222222222222222222222
1 1
1 Repeat forever { 1
1 1
1 1
1 Fetch: access the next step of the program from the
1
1 location in which the program has been stored. 1
1 1
1 Execute: Perform the step of the program. 1
1 1
1 } 1
1 1
12222222222222222222222222222222222222222222222222222222222222222222222222222221

d Note: we will discuss in more detail later

Computer Architecture – Module 4 23 Fall, 2016

d Processors require a program to be

– In memory
– Represented in binary
d Programmers prefer a program to be
– Readable by humans
– In a High Level Language
d Solution: allow programmers to write code in a readable high-level language and
translate to binary
d Use computer software to perform the translation

Computer Architecture – Module 4 24 Fall, 2016

preprocessed
source source assembly
code preprocessor compiler code
code

relocatable binary
assembler object linker object
code code

object code
(functions)
in libraries

Computer Architecture – Module 4 25 Fall, 2016

d Clock rate
– Rate at which gates are clocked
– Provides a measure of the underlying hardware speed
d Instruction rate
– Measures the number of instructions a processor can execute per unit time
d On some processors, a given instruction may take more clock cycles than other
instructions
d Example: multiplication may take longer than addition

Computer Architecture – Module 4 26 Fall, 2016

d Processor runs fetch-execute indefinitely

d Software must plan next step
d Two possibilities when last step of computation finishes
– Smallest embedded systems: code enters a loop testing for a change in input
– Larger systems: operating system runs and executes an infinite loop
d Note: to reduce power consumption, hardware may provide a way to put processor to
sleep until I/ O activity occurs (covered later in the course)

Computer Architecture – Module 4 27 Fall, 2016

d Processor hardware includes a reset line that stops the fetch-execute cycle
d For power-down: reset line is asserted
d During power-up, logic holds the reset until the processor and memory are initialized
d Power-up steps known as bootstrap

Computer Architecture – Module 4 28 Fall, 2016

d Processor performs a computation involving multiple steps

d Many types of processors
– Coprocessor
– Microcontroller
– Embedded system processor
– General-purpose processor
d Arithmetic Logic Unit (ALU) performs basic arithmetic and Boolean operations

Computer Architecture – Module 4 29 Fall, 2016

d Hardware in programmable processor runs fetch-execute cycle

d Until a processor is powered down, fetch-execute must continue

Computer Architecture – Module 4 30 Fall, 2016

Processor Types
And
Instruction Sets

Computer Architecture – Module 5 1 Fall, 2016

d Minimum set is sufficient, but inconvenient

d Extremely large set is convenient, but inefficient
d Architect must consider additional factors
– Physical size of processor chip
– Expected use
– Power consumption
d Tradeoffs mean a variety of designs exist

Computer Architecture – Module 5 2 Fall, 2016

d Idea pioneered by IBM

d Allows multiple, compatible models
d Define
– Set of instructions
– Operands and meaning
d Do not define
– Implementation details
– Processor speed

Computer Architecture – Module 5 3 Fall, 2016

d Functionality: what the instructions provide

– Arithmetic (integer or floating point)
– Logic (bit manipulation and testing)
– Control (branching, function call)
– Other (graphics, data conversion)
d Format: representation for each instruction
d Semantics: effect when instruction is executed
d An Instruction Set Architecture includes all of the above

Computer Architecture – Module 5 4 Fall, 2016

d Opcode specifies operation to be performed

d Operands specify data values on which to operate
d Result location specifies where result is to be placed

Computer Architecture – Module 5 5 Fall, 2016

d Instruction represented as sequence of bits in memory (usually multiples of bytes)

d Typically
– Opcode at beginning of instruction
– Operands follow opcode

opcode operand 1 operand 2 . . .

Computer Architecture – Module 5 6 Fall, 2016

d Fixed-length
– Every instruction is same size
– Hardware is less complex
– Hardware can run faster
– Wasted space: some instructions do not use all the bits
d Variable-length
– Some instructions shorter than others
– Allows instructions with no operands, a few operands, or many operands
– Efficient use of memory (no wasted space)

Computer Architecture – Module 5 7 Fall, 2016

d High-speed storage mechanism

d Part of the processor (on chip)
d Each register holds an integer or a pointer
d Numbered from 0 through N–1
d Basic uses
– Temporary storage during computation
– Operand for arithmetic operation
d Note: some processors require all operands for an arithmetic operation to come from
general-purpose registers

Computer Architecture – Module 5 8 Fall, 2016

d Usually separate from general-purpose registers

d Each holds one floating-point value
d Floating point registers are operands for floating point arithmetic

Computer Architecture – Module 5 9 Fall, 2016

d Task
– Start with variables X and Y in memory
– Add X and Y and place the result in variable Z (also in memory)
d Example steps
– Load a copy of X into register 1
– Load a copy of Y into register 2
– Add the value in register 1 to the value in register 2, and put the result in register 3
– Store a copy of the value in register 3 in Z
d Note: the above assumes registers 1, 2, and 3 are available

Computer Architecture – Module 5 10 Fall, 2016

d Register spilling
– Occurs when a register is needed for a computation and all registers contain values
– General idea
* Save current contents of register(s) in memory
* Reload registers(s) from memory when values are needed
d Register allocation
– Refers to choosing which values to keep in registers at a given time
– Performed by programmer or compiler

Computer Architecture – Module 5 11 Fall, 2016

d Refers to value that is twice as large as a standard integer

d Most processors do not have dedicated registers for double precision computation
d Approach taken: programmer must use a contiguous pair of registers to hold a double
precision value
d Example: multiplication of two 32-bit integers
– Result can require 64 bits
– Programmer specifies that result goes into a pair of registers (e.g., 4 and 5)

Computer Architecture – Module 5 12 Fall, 2016

d Registers partitioned into disjoint sets called banks

d Additional hardware detail
d Optimizes performance
d Complicates programming

Computer Architecture – Module 5 13 Fall, 2016

d Registers divided into two banks

d ALU instruction that takes two operands must have one operand from each bank
d Programmer must ensure operands are in separate banks
d Note: having two operands from the same bank will cause a run-time error

Computer Architecture – Module 5 14 Fall, 2016

d Parallel hardware facilities allow simultaneous access of both banks

Bank A Bank B
0 4
1 5
2 6
3 7

separate hardware
units used to access
the register banks

Processor

d Access takes half as long as using a single bank

Computer Architecture – Module 5 15 Fall, 2016

d Even trivial programs cause problems

d Example
R ← X + Y
S ← Z - X
T ← Y + Z

d Operands must be assigned to banks

d No feasible choice for the above

Computer Architecture – Module 5 16 Fall, 2016

d Occur when operands specify same register bank

d May be reported by compiler / assembler
d Programmer must rewrite code or insert extra instruction to copy an operand value to
the opposite register bank
d In the previous example
– Start with Y and Z in the same bank
– Before adding Y and Z, copy one to another bank

Computer Architecture – Module 5 17 Fall, 2016

d CISC: Complex Instruction Set Computer

d RISC: Reduced Instruction Set Computer

Computer Architecture – Module 5 18 Fall, 2016

d Many instructions (often hundreds)

d Given instruction can require arbitrary time to compute
d Example: Intel/AMD (x86/x64) or IBM instruction set
d Typical complex instructions
– Move graphical item on bitmapped display
– Copy or clear a region of memory
– Perform a floating point computation

Computer Architecture – Module 5 19 Fall, 2016

d Few instructions (typically 32 or 64)

d Each instruction executes in one clock cycle
d Example: MIPS or ARM instruction set
d Omits complex instructions
– No floating-point instructions
– No graphics instructions
d Sequence of instructions needed to perform complex action

Computer Architecture – Module 5 20 Fall, 2016

d A major idea in processor design

d Also called execution pipeline
d Optimizes performance
d Permits processor to complete more instructions per unit time
d Typically used with RISC instruction set

Computer Architecture – Module 5 21 Fall, 2016

d Fetch the next instruction

d Decode the instruction and fetch operands from registers
d Perform the arithmetic operation specified by the opcode
d Perform memory read or write, if needed
d Store result back to the registers

Computer Architecture – Module 5 22 Fall, 2016

d Build separate hardware block for each step of the fetch-execute cycle
d Arrange hardware to pass an instruction through the sequence of hardware blocks
d Allows step K of one instruction to execute while step K–1 of next instruction executes
d Result is an execution pipeline

Computer Architecture – Module 5 23 Fall, 2016

stage 1 stage 2 stage 3 stage 4 stage 5

fetch decode perform read or store

next plus fetch arithmetic write the
instruction operands operation memory result

d Example pipeline has five stages

d All stages operate at the same time
d Instruction passes through like a factory assembly line

Computer Architecture – Module 5 24 Fall, 2016

clock stage 1 stage 2 stage 3 stage 4 stage 5

1 inst. 1 - - - -
Time
2 inst. 2 inst. 1 - - -

3 inst. 3 inst. 2 inst. 1 - -

4 inst. 4 inst. 3 inst. 2 inst. 1 -

5 inst. 5 inst. 4 inst. 3 inst. 2 inst. 1

6 inst. 6 inst. 5 inst. 4 inst. 3 inst. 2

7 inst. 7 inst. 6 inst. 5 inst. 4 inst. 3

8 inst. 8 inst. 7 inst. 6 inst. 5 inst. 4

Computer Architecture – Module 5 25 Fall, 2016

d All stages operate in parallel

d Given stage can start to process a new instruction as soon as current instruction finishes
d Effect: N-stage pipeline can operate on N instructions simultaneously, producing
speedup
d Result
– One instruction completes every time pipeline moves
– For RISC processor, one instruction completes on every clock cycle
d Comparison: without a pipeline, each instruction would take five clock cycles

Computer Architecture – Module 5 26 Fall, 2016

d Pipeline is transparent to programmers (i.e., is automatic)

d Execution speed
– Is never worse than a processor without a pipeline
– May be K times faster than processor without a pipeline
d Pipeline stalls (i.e., pauses) if item is not available when a stage needs the item
d Programmer who does not understand pipeline can produce code that stalls frequently

Computer Architecture – Module 5 27 Fall, 2016

d Consider code that

– Performs addition and subtraction operations
– Uses registers A through E for operands and results
d Example instruction sequence

Instruction K: C ← add A B
Instruction K+1: D ← subtract E C

d Instruction K+1 must wait for operand C to be computed

d Result is a stall

Computer Architecture – Module 5 28 Fall, 2016

stage 1 stage 2 stage 3 stage 4 stage 5

fetch fetch ALU access write
instruction operands operation memory results
clock

1 inst. K inst. K-1 inst. K-2 inst. K-3 inst. K-4

Time 2 inst. K+1 inst. K inst. K-1 inst. K-2 inst. K-3

3 inst. K+2 (inst. K+1) inst. K inst. K-1 inst. K-2

4 (inst. K+2) (inst. K+1) – inst. K inst. K-1

5 (inst. K+2) (inst. K+1) – – inst. K

6 (inst. K+2) inst. K+1 – – –

7 inst. K+3 inst. K+2 inst. K+1 – –

8 inst. K+4 inst. K+3 inst. K+2 inst. K+1 –

9 inst. K+5 inst. K+4 inst. K+3 inst. K+2 inst. K+1

10 inst. K+6 inst. K+5 inst. K+4 inst. K+1 inst. K+2

d We say a bubble passes through pipeline

Computer Architecture – Module 5 29 Fall, 2016

d Access external storage (i.e., memory reference)

d Invoke a coprocessor (i.e., I/O)
d Branch to a new location
d Call a subroutine

Computer Architecture – Module 5 30 Fall, 2016

d Program must be written to accommodate instruction pipeline

d To minimize stalls
– Avoid introducing unnecessary branches
– Delay references to result register(s)
d A contradiction
– Good software engineering practice divides a large program into smaller functions
– A function call stalls the pipelining

Computer Architecture – Module 5 31 Fall, 2016

C ← add A B C ← add A B
D ← subtract E C F ← add G H
F ← add G H M ← add K L
J ← subtract I F D ← subtract E C
M ← add K L J ← subtract I F
P ← subtract M N P ← subtract M N
(a) (b)

d Stalls eliminated by rearranging (a) to (b)

d Compilers for RISC processors usually optimize code to avoid stalls

Computer Architecture – Module 5 32 Fall, 2016

d We can think of pipelining as an automatic optimization

– Hardware speeds up processing if possible
– If speedup is not possible, hardware is still correct
d Consequence: code that is not optimized will work correctly, but may run slower than
necessary

Computer Architecture – Module 5 33 Fall, 2016

d Hardware optimization to avoid a stall

d Allows ALU to reference result in next instruction
d Example
Instruction K: C ← add A B
Instruction K+1: D ← subtract E C

d Forwarding hardware
– Passes result of add operation directly to ALU without waiting to store it in a
register
– Ensures the value arrives by the time subtract instruction reaches the pipeline stage
for execution

Computer Architecture – Module 5 34 Fall, 2016

d Often included in RISC instruction sets

d May seem unnecessary
d Has no effect on
– Registers
– Memory
– Program counter
– Computation
d Purpose: can be inserted to avoid instruction stalls

Computer Architecture – Module 5 35 Fall, 2016

d Example
Instruction K: C ← add A B
Instruction K+1: no-op
Instruction K+2: D ← subtract E C

d If forwarding is available, no-op allows time for result from register C to be fetched for
subtract operation
d Compilers insert no-op instructions to optimize performance

Computer Architecture – Module 5 36 Fall, 2016

d Operations usually classified into groups

d An example categorization
– Arithmetic instructions (integer arithmetic)
– Logical instructions (also called Boolean)
– Data access and transfer instructions
– Conditional and unconditional branch instructions
– Floating point instructions
– Processor control instructions
– Graphics instructions

Computer Architecture – Module 5 37 Fall, 2016

d Hardware register
d Used during fetch-execute cycle
d Gives address of next instruction to execute
d Also known as instruction pointer or instruction counter

Computer Architecture – Module 5 38 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Fetch-Execute Algorithm Details
222222222222222222222222222222222222222222222222222222222222222222222222222222
1 1
1 Assign the program counter an initial program address. 1
1 Repeat forever { 1
1 1
1 Fetch: access the next step of the program from the location given by the 1
1 1
1 program counter. 1
1 1
1 Set an internal address register, A, to the address beyond the instruction that 1
1 was just fetched. 1
1 1
1 Execute: Perform the step of the program. 1
1 1
1 Copy the contents of address register A to the program counter. 1
1 1
1 } 1
12222222222222222222222222222222222222222222222222222222222222222222222222222221

Computer Architecture – Module 5 39 Fall, 2016

d Absolute branch
– Typically named jump
– Operand is an address
– Assigns operand value to internal register A
d Relative branch
– Typically named br
– Operand is a signed value
– Adds operand to internal register A

Computer Architecture – Module 5 40 Fall, 2016

d Jump to subroutine (jsr instruction)

– Similar to a jump
– Saves value of internal register A
– Replaces A with operand address
d Return from subroutine (ret instruction)
– Retrieves value saved during jsr
– Replaces A with saved value

Computer Architecture – Module 5 41 Fall, 2016

d Multiple methods are used

d Choice depends on language/ compiler as well as hardware
d Examples
– Store arguments in memory
– Store arguments in special-purpose hardware registers
– Store arguments in general-purpose registers
d Many techniques also used to return result from function

Computer Architecture – Module 5 42 Fall, 2016

d Hardware optimization for argument passing

d Processor contains many general-purpose registers
d Only a small subset of registers visible at any time
d Caller places arguments in reserved registers
d During procedure call, register window moves to hide old registers and expose new
registers

Computer Architecture – Module 5 43 Fall, 2016

registers 0 - 7 before other registers

subroutine is called are unavailable

x1 x2 x3 x4 A B C D

(a)

registers 0 - 7
unavailable when subroutine runs unavailable

x1 x2 x3 x4 A B C D l1 l2 l3 l4

(b)

d (a) registers before calling a subroutine

d (b) registers when the subroutine runs

Computer Architecture – Module 5 44 Fall, 2016

d Known as MIPS instruction set

d Early RISC design
d Minimalistic
d Only 32 instructions

Computer Architecture – Module 5 45 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
MIPS Instruction Set (Part 1)
2222222222222222222222222222222222222222222222222222222222222222
Instruction Meaning
Arithmetic
add integer addition
subtract integer subtraction
add immediate integer addition (register + constant)
add unsigned unsigned integer addition
subtract unsigned unsigned integer subtraction
add immediate unsigned unsigned addition with a constant
move from coprocessor access coprocessor register
multiply integer multiplication
multiply unsigned unsigned integer multiplication
divide integer division
divide unsigned unsigned integer division
move from Hi access high-order register
move from Lo access low-order register
Logical (Boolean)
and logical and (two registers)
or logical or (two registers)
and immediate and of register and constant
or immediate or of register and constant
shift left logical Shift register left N bits
shift right logical Shift register right N bits

Computer Architecture – Module 5 46 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
MIPS Instruction Set (Part 2)
2 222222222222222222222222222222222222222222222222222222222222222222222
Instruction Meaning

Data Transfer
load word load register from memory
store word store register into memory
load upper immediate place constant in upper sixteen
bits of register
move from coproc. register obtain a value from a coprocessor
Conditional Branch
branch equal branch if two registers equal
branch not equal branch if two registers unequal
set on less than compare two registers
set less than immediate compare register and constant
set less than unsigned compare unsigned registers
set less than immediate compare unsigned register and constant
Unconditional Branch
jump go to target address
jump register go to address in register
jump and link procedure call

Computer Architecture – Module 5 47 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
MIPS Floating Point Instructions
Instruction Meaning
2222222222222222222222222222222222222222222222222222222222222222

Arithmetic
FP add floating point addition
FP subtract floating point subtraction
FP multiply floating point multiplication
FP divide floating point division
FP add double double-precision addition
FP subtract double double-precision subtraction
FP multiply double double-precision multiplication
FP divide double double-precision division

Data Transfer
load word coprocessor load value into FP register
store word coprocessor store FP register to memory

Conditional Branch
branch FP true branch if FP condition is true
branch FP false branch if FP condition is false
FP compare single compare two FP registers
FP compare double compare two double precision values

Computer Architecture – Module 5 48 Fall, 2016

d Elegance
– Balanced
– No frivolous or useless instructions
d Orthogonality
– No unnecessary duplication
– No overlap among instructions
d Ease of programming
– Instructions match programmer’s intuition
– Instructions are free from arbitrary restrictions

Computer Architecture – Module 5 49 Fall, 2016

d Specifies that each instruction should perform a unique task

d No instruction duplicates or overlaps another

Computer Architecture – Module 5 50 Fall, 2016

d Extra hardware bits (not part of general-purpose registers)

d Set by ALU each time an instruction produces a result
d Used to indicate
– Overflow
– Underflow
– Whether result is positive, negative, or zero
– Other exceptions
d Tested in conditional branch instruction

Computer Architecture – Module 5 51 Fall, 2016

cmp r4, r5 # compare regs. 4 & 5, and set condition code

be lab1 # branch to lab1 if cond. code specifies equal

mov r3, 0 # place a zero in register 3

lab1: ...program continues at this point

d Above code places a zero in register 3 if register 4 is not equal to register 5

Computer Architecture – Module 5 52 Fall, 2016

DATA PATHS

Interconnection Of Processor Components

And Instruction Execution

Computer Architecture – Module 6 1 Fall, 2016

d We are proceeding from basics to more complexity

d Covered so far
– Interconnecting transistors to form gates
– Interconnecting gates to form combinatorial circuits
– Adding a clock to execute a sequence of steps
– Using feedback to control processing

Computer Architecture – Module 6 2 Fall, 2016

d Build a programmable processor

d We will assume a program already resides in memory
d The processor must repeatedly
– Fetch the next instruction from memory
– Perform the instruction

Computer Architecture – Module 6 3 Fall, 2016

d What are the major building blocks needed to create a processor?

d How are the building blocks arranged?
d What happens when an instruction is executed?

Computer Architecture – Module 6 4 Fall, 2016

d Of course, we’ll build a very simplified computer

d Thirty-two bit processor
d Sixteen registers used for arithmetic
d Harvard architecture: separate memories for
– Instruction store
– Data store
d Memories are byte-addressable (realistic)
d Instruction memory is preloaded with a program
d Consider the hardware needed to execute four basic instructions: load, store, add, jump

Computer Architecture – Module 6 5 Fall, 2016

d Load: copies a value from memory to a register

d Store: copies a value from a register to memory
d Add: adds the values in two registers and places the result in a register
d Jump: forces the processor to a new location in the program instead of the next
sequential location

Computer Architecture – Module 6 6 Fall, 2016

d A programmer writes instructions with an operation followed by operands

d Commas separate operands
d Example
load operand1, operand2
d The program must be translated to binary before being loaded into our computer

Computer Architecture – Module 6 7 Fall, 2016

d Illustrate a couple of basic types

– Register access
– Memory access
d Other operand types will be covered later

Computer Architecture – Module 6 8 Fall, 2016

d Example 1: add the contents of register 4 to the contents of register 11, and place the
result in register 9
add reg9, reg11, reg4

d Example 2: add an offset of 20 to the contents of register 12, use the result as a memory
address, and load register 1 with the value from memory
load reg1, 20(reg12)

d Example 3: add an offset of 64 to the contents of register 7, treat the result as the
address of code in memory, and branch to the address
jump 64(reg7)

d Note: many processors allow an operand to specify an offset plus the contents of a
register
Computer Architecture – Module 6 9 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Instructions In Memory
operation reg A reg B dst reg unused
add 0 0 0 0 1

operation reg A unused dst reg offset

load 0 0 0 1 0

operation reg A reg B unused offset

store 0 0 0 1 1

operation reg A unused unused offset

jump 0 0 1 0 0

d Binary format chosen to simplify hardware

– Field reg A is a register used in a memory address
– Field reg B holds a value to be added
– Field dst reg specifies a register to receive the result
Computer Architecture – Module 6 10 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Notes About Instructions

d Only the add instruction uses all three register fields

d If an instruction has an operand of the form offset(register) the register will always be
in field reg A
d The offset is limited to 15 bits

Computer Architecture – Module 6 11 Fall, 2016

d Suppose rX denotes register X, and consider an add instruction

add r4, r2, r3

(a)

operation reg A reg B dst reg offset

0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(b)

d (a) shows the instruction in assembly language

d (b) shows the instruction in binary as it is stored in memory

Computer Architecture – Module 6 12 Fall, 2016

d Instruction memory (read only)

– Input: 32-bit byte address
– Output: 32-bit data value (the four bytes starting at the specified address)
d Data memory (RAM — can be read or written)
– Inputs
* 32-bit byte address
* 32-bit data (only used during write)
* 1-bit fetch/store signal
– Output 32-bit data value (if the signal is fetch)

Computer Architecture – Module 6 13 Fall, 2016

instruction data
memory memory

addr.
addr.
in
in

data data
out out

data
in

fetch/store control

d Block diagram hides multiple gates

d Note: we assume instruction memory is preloaded with a program (i.e., it is read-only
memory)

Computer Architecture – Module 6 14 Fall, 2016

d Facts
– Our instruction memory is byte-addressable
– Each instruction is 32-bits long (4 bytes)
– The program counter must be incremented by 4 to move to the next instruction
d Hardware needed
– Gates to store a program counter
– Adder to compute the increment
– Clock to control when updates occur

Computer Architecture – Module 6 15 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder

program counter value

used by other components

d Arrows indicate data path of multiple, parallel wires

d In our example, each data path is thirty-two bits wide

Computer Architecture – Module 6 16 Fall, 2016

d Recall
– Instructions in separate instruction memory
– Instruction memory takes a 32-bit address as input and produces a 32-bit output
value equal to the contents of the specified address

Computer Architecture – Module 6 17 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder

instruction
memory

addr.
in

data instruction
out from memory

d The memory output changes whenever the input changes (i.e., whenever a new address
is supplied)

Computer Architecture – Module 6 18 Fall, 2016

d Must break out fields

d Instruction format chosen to make decoding efficient
d Decoder hardware separates fields of an instruction
d Each field sent along separate data path
d Our example design is trivial: the decoder merely consists of a 32-bit register with
output wires grouped into smaller data paths

Computer Architecture – Module 6 19 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder

instr. decoder

instruction src reg A

memory
src reg B
addr.
in dst reg

data
out
offset
operation

d Note: data paths emerging from the instruction decoder are not thirty-two bits wide

Computer Architecture – Module 6 20 Fall, 2016

d The registers are implemented as a single hardware unit

d Think of each register as holding a 32-bit value
d The register unit has four inputs and two outputs
d Input → output
– First register number → contents of register
– Second register number → contents of register
– Third register number plus data → data is stored in the specified register

Computer Architecture – Module 6 21 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in

instr. decoder

reg A contents of
instruction register A
memory
reg B contents of
register B
addr. dst reg
in

data
out
offset
operation

d Note: there are two inputs and two outputs because we assume the register unit has
hardware that can perform two lookups simultaneously

Computer Architecture – Module 6 22 Fall, 2016

d A clock is used to synchronize all units

d Additional controller hardware coordinates overall data movement
– Connects to each hardware unit
– Specifies when to transfer data
d Control connections between controller and individual units are not shown because
diagram illustrates data paths
d Example: control lines (not shown) signal the register unit when to perform a fetch
operation or a store operation

Computer Architecture – Module 6 23 Fall, 2016

d Although example only has one arithmetic operation, add, additional arithmetic
instructions can be added easily (e.g., shift and subtract)
d Use an Arithmetic Logic Unit (ALU)
d Problem: inputs to ALU can be
– Two registers
– Register and offset
d Solution: use a multiplexor to choose

Computer Architecture – Module 6 24 Fall, 2016

input 1
output
input 2

d Small hardware unit

d Fits into data path (i.e., handles parallel data)
d Take two inputs and has one output
d Each input or output is 32-bits wide
d At any time
– Multiplexor forwards 32 bits from one input path to the output
– Selection is determined by a controller (not shown)

Computer Architecture – Module 6 25 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit multiplexor
4 data in

instr. decoder
ALU
reg A
instruction
memory ALU output
reg B

addr. dst reg

data
out
offset
operation

d On some instructions, ALU adds register and offset; on add instruction, ALU adds two
registers

Computer Architecture – Module 6 26 Fall, 2016

d Additional hardware unit implements data memory

d Two basic operations: fetch and store
d Fetch
– Place an address on the address input
– Arrange for controller to signal fetch
– Read a value from the data output
d Store
– Place a value on the address input
– Place a data value on data input
– Arrange for controller to signal store

Computer Architecture – Module 6 27 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in

instr. decoder data

ALU memory

reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation

d A controller (not shown) uses the operation to set the multiplexers

Computer Architecture – Module 6 28 Fall, 2016

d Previous diagram shows all physical data paths

d When an instruction is executed, controller selects which data paths are used
– Memory and register units honor fetch or store
– Each multiplexor selects one input
– Other data paths are ignored
d Examples follow

Computer Architecture – Module 6 29 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in

instr. decoder data

ALU memory

reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation

Computer Architecture – Module 6 30 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in

instr. decoder data

ALU memory

reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation

Computer Architecture – Module 6 31 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in

instr. decoder data

ALU memory

reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation

Computer Architecture – Module 6 32 Fall, 2016

32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in

instr. decoder data

ALU memory

reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation

Computer Architecture – Module 6 33 Fall, 2016

d The term data path describes interconnections among pieces of a processor

d Each data path contains N parallel wires
d Building blocks of a processor include
– Program counter
– Decoder
– Register unit
– Instruction and data memories
– ALU

Computer Architecture – Module 6 34 Fall, 2016

d A multiplexor passes one of its input data paths to the output data path
d Control signals determine which input a multiplexor selects at a given time
d By controlling multiplexors, processor hardware chooses which data paths are active for
a given instruction

Computer Architecture – Module 6 35 Fall, 2016

Operands, Operand Addressing

And
Instruction Representation

Computer Architecture – Module 7 1 Fall, 2016

d Given architecture usually has the same number for most instructions
d Four basic architectural types
– 0-address
– 1-address
– 2-address
– 3-address

Computer Architecture – Module 7 2 Fall, 2016

d Stack-based architecture
d No explicit operands in the instruction
d Program
– Pushes operands onto stack in memory
– Executes instruction
d Instruction execution
– Removes top N items from stack
– Leaves result on top of stack

Computer Architecture – Module 7 3 Fall, 2016

d Example: increment variable X in memory by 7

push X
push 7
add
pop X

d Push instruction places a copy of variable X on the stack

d Add instruction removes two arguments from stack and leaves result on stack
d Pop instruction removes item on the top of the stack, and places the item in variable X

Computer Architecture – Module 7 4 Fall, 2016

d Analogous to a calculator
d One explicit operand per instruction
d Processor has special register known as an accumulator
– Holds second argment for each instruction
– Used to store result of instruction

Computer Architecture – Module 7 5 Fall, 2016

d Example: increment variable X in memory by 7

load X
add 7
store X

d Load places copy of variable X in the accumulator

d Add increases value in accumulator
d Store copies accumulator value into variable X in memory

Computer Architecture – Module 7 6 Fall, 2016

d Two explicit operands per instruction

d Result overwrites one of the operands
d Operands known as source and destination
d Works well for instructions such as memory copy

Computer Architecture – Module 7 7 Fall, 2016

d Example: increment variable X in memory by 7

add 7, X

d Computes X + 7 and places the result in variable X

Computer Architecture – Module 7 8 Fall, 2016

d Three explicit operands per instruction

d Operands specify two values and a location for the result
d Operands are often called
– Source
– Destination (for instructions that only need two operands)
– Result (if all three operands are needed)

Computer Architecture – Module 7 9 Fall, 2016

d Example: add variable Y to variable X and place result in variable Z

add X, Y, Z

Computer Architecture – Module 7 10 Fall, 2016

d Source operand can specify

– A signed constant
– An unsigned constant
– The contents of a register
– A value in memory
d Destination operand can specify
– A single register
– A pair of contiguous registers
– A memory location

Computer Architecture – Module 7 11 Fall, 2016

d Question: how does a processor know whether an operand specifies a constant, a

Computer Architecture – Module 7 12 Fall, 2016

d An operand that gives a signed or unsigned constant is known as an immediate operand

d Of course, constants could be placed in memory
d Question: why have immediate operands?
d Answer: memory references are expensive compared to accessing an immediate value

Computer Architecture – Module 7 13 Fall, 2016

d General engineering principle

d Refers to the cost of memory references
d Often stated as follows

On a computer that follows the Von Neumann

architecture, the time spent performing memory
accesses can limit the overall performance

d Motivates using immediate operands or placing operands in registers

Computer Architecture – Module 7 14 Fall, 2016

d Implicit type encoding

– Opcode specifies the type of each operand
– Many opcodes needed
– Example opcode is add_signed_immediate_to_register
d Explicit type encoding
– Each operand has extra bits that specify a type
– Fewer opcodes required
– Example: opcode is add, and the two operands specify the types signed_immediate
and register

Computer Architecture – Module 7 15 Fall, 2016

Opcode Operands Meaning

22222222222222222222222222222222222222222222222222222222222222222
Add register R1 R2 R1 ← R1 + R2
Add immediate signed R1 I R1 ← R1 + I
Add immediate unsigned R1 UI R1 ← R1 + UI
Add memory R1 M R1 ← R1 + memory[M]

Computer Architecture – Module 7 16 Fall, 2016

d Add operation with registers 1 and 2 as operands

opcode operand 1 operand 2

.. ..
.. ..
.. ..
add register .... 1 register .... 2
.. ..
.. ..

d Add operation with register 1 and signed immediate value of –93 as operands

opcode operand 1 operand 2

.. ..
.. ..
.. signed ....
add register .... 1 . –93
.. integer ...
.. ..

Computer Architecture – Module 7 17 Fall, 2016

d Operand contains multiple items

d Processor computes operand value from individual items
d Typical computation: sum
d Example
– A register-offset operand specifies a register and an immediate value
– Processor adds immediate value to contents of register and uses result as operand

Computer Architecture – Module 7 18 Fall, 2016

opcode operand 1 operand 2

.. .. .. ..
.. .. .. ..
register- .... ..
.. register- .... ..
..
add . 2 .. –17 . 4 .. 76
offset ... .. offset ... ..
.. .. .. ..

d First operand consists of value in register 2 minus 17

d Second operand consists of value in register 4 plus 76

Computer Architecture – Module 7 19 Fall, 2016

d No single style of operand optimal for all purposes

d Tradeoffs among
– Ease of programming
– Fewer instructions
– Smaller instructions
– Larger range of immediate values
– Faster operand fetch and decode
– Decreased hardware size

Computer Architecture – Module 7 20 Fall, 2016

d Operand can specify

– Value in memory (memory reference)
– Location in memory that contains the address of the operand (indirect reference)
d Note: accessing memory is relatively expensive

Computer Architecture – Module 7 21 Fall, 2016

d Indirection through a register

– Operand specifies register number, R
– Obtain A, the current value from register R
– Interpret A as a memory address, and fetch the operand from memory location A
d Indirection through a memory location
– Operand specifies memory address, A
– Obtain M, the value in memory location A
– Interpret M as a memory address, and fetch the operand from memory location M

Computer Architecture – Module 7 22 Fall, 2016

locations in memory
3
instruction register

5
2 4
5

general-purpose register

1 Immediate value (in the instruction)

2 Direct register reference
3 Direct memory reference
4 Indirect through a register
5 Indirect memory reference

Computer Architecture – Module 7 23 Fall, 2016

d Architect chooses the number and types of operands for each instruction
d Possibilities include
– Immediate (constant value)
– Contents of register
– Value in memory
– Indirect reference to memory

Computer Architecture – Module 7 24 Fall, 2016

d Type of operand can be encoded

– Implicitly (opcode determines types of operands)
– Explicitly (extra bits in each operand specify the type)
d Many variations exist; each represents a tradeoff

Computer Architecture – Module 7 25 Fall, 2016

CPUs:
Microcode, Protection,
And Processor Modes

Computer Architecture – Module 8 1 Fall, 2016

d Early systems
– Single Central Processing Unit (CPU) controlled entire computer
– Responsible for all I/O as well as computation
d Modern computer
– Decentralized architecture
– CPU chip may contain multiple cores
– Each I/O device (e.g., a disk) contains processor
– CPU performs computation and coordinates other processors

Computer Architecture – Module 8 2 Fall, 2016

d CPU designed for wide variety of control and processing tasks

d The most complex CPUs have many special-purpose hardware subunits
d Example: Intel makes a multicore chip that contains 2.5 billion transistors

Computer Architecture – Module 8 3 Fall, 2016

d Completely general
d Can perform control functions as well as basic computation
d Offers multiple levels of protection and privilege
d Provides mechanism for hardware priorities
d Handles large volumes of data
d Uses parallelism to achieve high speed

Computer Architecture – Module 8 4 Fall, 2016

d CPU hardware has several possible modes

d At any time, CPU operates in one mode
d Mode dictates
– Instructions that are valid
– Regions of memory that can be accessed
– Amount of privilege
– Backward compatibility with earlier models
d CPU behavior can vary widely among modes

Computer Architecture – Module 8 5 Fall, 2016

d Imagine multiple hardware units inside the CPU

d Mode selects which hardware is used at a given current time
d Two modes may have different
– Word sizes
– Numbers of registers
– Instruction sets

Computer Architecture – Module 8 6 Fall, 2016

d Automatic
– Initiated by hardware (e.g., when device needs service)
– Prior to change, software (OS) must specify which code to run when the change
occurs
d Manual
– Application makes explicit request
– Typically occurs when application calls an operating system function

Computer Architecture – Module 8 7 Fall, 2016

d Determines which resources a program can use

d Usually coupled to mode
d Basic scheme: two levels
– User mode for applications
– Kernel mode for operating system
d Advanced scheme: multiple levels
d In almost any architecture, the OS can execute additional instructions that an application
cannot

Computer Architecture – Module 8 9 Fall, 2016

appl. 1 appl. 2 appl. N

low
... privilege

high
Operating System privilege

d Applications run with low privilege

d OS runs with high privilege

Computer Architecture – Module 8 10 Fall, 2016

d Hardware technique used with CISC processors

d Employs two levels of processor hardware
– Microprocessor (microcontroller) provides basic operations
– Macro instruction set built on micro instructions
– Macro instructions and micro instructions may differ completely
d Key concept: it is easier to construct complex processors by writing programs than by
building hardware from scratch

Computer Architecture – Module 8 12 Fall, 2016

visible to
programmer
macro instruction set

(implemented with microcode)

hidden
(internal)
CPU micro instruction set

(implemented with digital logic) Microcontroller

Computer Architecture – Module 8 13 Fall, 2016

d Size used by micro instructions can differ from size used by macro instructions
d Example
– Micro instructions only offer 16-bit arithmetic
– Macro instructions provide 32-bit arithmetic

Computer Architecture – Module 8 14 Fall, 2016

d Assumptions for the example

– Macro registers
* Each 32 bits wide
* Named R0, R1, ...
– Micro registers
* Each 16 bits wide
* Named r0, r1, ...
d Devise microcode to add values from R5 and R6

Computer Architecture – Module 8 15 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Example Microcode
add32: /* Compute R5 + R6 */
move low-order 16 bits from R5 into r2
move low-order 16 bits from R6 into r3
add r2 and r3, placing result in r1
save value of the carry indicator
move high-order 16 bits from R5 into r2
move high-order 16 bits from R6 into r3
add r2 and r3, placing result in r0
copy the value in r0 to r2
add r2 and the carry bit, placing the result in r0
check for overflow and set the condition code
move the thirty-two bit result from r0 and r1
to the desired destination

Computer Architecture – Module 8 16 Fall, 2016

d Restricted or full scope

– Special-purpose instructions only (e.g., complex instructions or extensions to normal
instruction set)
– All instructions
d Partial or complete use
– Entire fetch-execute cycle
– Instruction fetch and decode
– Opcode processing
– Operand decode and fetch

Computer Architecture – Module 8 17 Fall, 2016

d Higher level of abstraction

d Easier to build and less error prone than building with logic gates
d Easier to change
– Easy upgrade to next version of chip
– Can allow field upgrade

Computer Architecture – Module 8 18 Fall, 2016

d More overhead
d Macro instruction performance depends on micro instruction set
d Microprocessor hardware must run at extremely high clock rate to accommodate
multiple micro instructions per macro instruction

Computer Architecture – Module 8 19 Fall, 2016

d Fixed (immutable) microcode

– Approach used by most CPUs
– Microcode only visible to CPU designer
d Alterable microcode
– Microcode loaded dynamically
– May be restricted to extensions (creating new macro instructions)
– User software written to use new instructions
– Known as a reconfigurable CPU
d If you could change microcode, what would you change?

Computer Architecture – Module 8 20 Fall, 2016

d Writing microcode is tedious and time-consuming compared to application

programming
d Results are difficult to test
d Performance of microcode can be much worse than performance of dedicated hardware
d Result: reconfigurable CPUs have not enjoyed much success
d More recent technology for reconfigurable processors: FPGA

Computer Architecture – Module 8 21 Fall, 2016

d What programming paradigm is used for microcode?

d Two fundamental types
– Vertical
– Horizontal

Computer Architecture – Module 8 22 Fall, 2016

d Vertical microcode similar to conventional assembly language

d Microprocessor uses fetch-execute and executes one instruction at a time
d Micro instructions can access
– An ALU
– The macro general-purpose registers
– Memory
– I/O buses

Computer Architecture – Module 8 23 Fall, 2016

d Macro instruction set is CISC

d Microprocessor is fast RISC processor
d Programmer writes microcode for each macro instruction
d Hardware decodes macro instruction and invokes correct microcode routine

Computer Architecture – Module 8 24 Fall, 2016

d Easy to read
d Programmers are comfortable using it
d Unattractive to hardware designers because higher clock rates needed
d Generally has low performance (many micro instructions needed for each macro
instruction)

Computer Architecture – Module 8 25 Fall, 2016

d Alternative to vertical microcode

d Exploits parallelism in underlying hardware
d Controls functional units and data movement
d Extremely difficult to program
d Paradigm
– Each micro instruction controls a set of hardware units
– An instruction specifies which hardware units to operate and how data is transferred
among them

Computer Architecture – Module 8 26 Fall, 2016

d Consider the internal structure of a CPU

d Data can only move along specific paths between functional units
d Example

Arithmetic macro
Logic general-
Unit result 1 result 2 purpose
(ALU) registers

operand 1 operand 2 register access

data transfer mechanism

Computer Architecture – Module 8 27 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Example Hardware Control Commands
22222222222222222222222222222222222222222222222222222222222222222222222222
1 Unit Command Meaning 1
2
1 2222222222222222222222222222222222222222222222222222222222222222222222222 1
1 1 1
1 1 000 No operation
1
1 1 001 Add 1
1 1 010 Subtract 1
1 1 1
ALU 1 011 Multiply
1 1
1 1 100 Divide 1
1 1 1 0 1 Left shift 1
1 1 1
1 110 Right shift
1 111 Continue previous operation 1
1
122222222222222222222222222222222222222222222222222222222222222222222222222 1
1 1 1
1 operand 1 1
1 0 No operation
1 1 or 2 1
1
122222222222222222222222222222222222222222222222222222222222222222222222222
1 Load value from data transfer mechanism
1
1 1 1
1 result 1 1
1 0 No operation
1 1 or 2 1
1
122222222222222222222222222222222222222222222222222222222222222222222222222
1 Send value to data transfer mechanism
1
1 1 1
1 1 1
1 00xxxx No operation
1 1
register 1 0 1 x x x x Move register xxxx to data transfer mechanism
1 1
1 interface 1 1 0 x x x x Move data transfer mechanism to register xxxx 1
1 1 1
11xxxx No operation
11
122222222222222222222222222222222222222222222222222222222222222222222222222 1

Computer Architecture – Module 8 28 Fall, 2016

ALU Oper. 1 Oper. 2 Res. 1 Res. 2 Register interface

.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
x x x .. x .. x .. x .. x .. x x x x x x
. . . . .

d Diagram shows how groups of bits in an instruction are interpreted

d Each set of bits controls one hardware unit

Computer Architecture – Module 8 29 Fall, 2016

d Move the value from register 4 to the hardware unit for operand 1
d Move the value from register 13 to the hardware unit for operand 2
d Arrange for the ALU to perform addition
d Move the value from the hardware unit for result2 (the low-order bits of the result) to
register 4

Computer Architecture – Module 8 30 Fall, 2016

Instr. ALU OP1 OP2 RES1 RES2 REG. INTERFACE

.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
1 0 0 0 ..
..
1 ..
..
0 ..
..
0 ..
..
0 ..
..
0 1 0 1 0 0
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
2 0 0 0 .. 0 .. 1 .. 0 .. 0 .. 0 1 1 1 0 1
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
3 0 0 1 ..
..
0 ..
..
0 ..
..
0 ..
..
0 ..
..
0 0 0 0 0 0
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
4 0 0 0 .. 0 .. 0 .. 0 .. 1 .. 1 0 0 1 0 0
.. .. .. .. ..
. . . . .

d Observe that the code does not resemble a conventional program

Computer Architecture – Module 8 31 Fall, 2016

d Each microcode instruction takes one micro cycle

d Given functional unit may require more than one cycle to complete an operation
d Programmer must accommodate hardware timing or errors can result
d To wait for functional unit, insert microcode instructions that continue the operation
d Similar to no-op

Computer Architecture – Module 8 32 Fall, 2016

ALU OP1 OP2 RES1 RES2 REG. INTERFACE

.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
1 1 1 .. 0 .. 0 .. 0 .. 0 .. 0 0 0 0 0 0
.. .. .. .. ..
. . . . .

d Assume ALU operation 1 1 1 acts as a delay to continue the previous operation

d None of the other hardware units are active

Computer Architecture – Module 8 33 Fall, 2016

ALU OP1 OP2 RES1 RES2 REG. INTERFACE

.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
1 1 1 .. 1 .. 0 .. 0 .. 0 .. 0 1 0 1 1 1
.. .. .. .. ..
. . . . .

d A single microcode instruction can continue the ALU operation and also load the value
from register 7 into operand unit 1
d By using horizontal microcode, a programmer can specify simultaneous, parallel
operation of multiple hardware units

Computer Architecture – Module 8 34 Fall, 2016

d Schedules instructions by assigning work to functional units

d Handles operations in parallel
d Performs branch optimization by beginning to execute both paths of a branch
d Constrains results so instructions have sequential semantics
– Keeps results separate
– Decides which path to use when branch direction finally known

Computer Architecture – Module 8 35 Fall, 2016

d Parallel hardware can

– Compute values out-of-order
– Follow two possible branches
d CPU must preserve sequential macro execution semantics as expected by programmer
d Mechanisms used
– Scoreboard
– Re-Order Buffer (ROB)
d Note: when results computed from two paths, CPU eventually discards results that are
not needed

Computer Architecture – Module 8 36 Fall, 2016

d Alternative to parallel execution

d Handles conditional execution
d Hardware assumes branch will be taken, and unrolls computation if it is not
d Note: studies show branch is taken approximately 60% of the time

Computer Architecture – Module 8 37 Fall, 2016

d CPU offers modes of execution that determine protection and privilege

d Complex CPU usually implemented with microcode
d Vertical microcode uses conventional instruction set
d Horizontal microcode uses unconventional instructions
d Each horizontal microcode instruction controls
underlying hardware units
d Horizontal microcode offers parallelism

Computer Architecture – Module 8 38 Fall, 2016

d Most complex CPUs have mechanism to schedule instructions on parallel execution

units
d Scoreboard and Re-Order Buffer used to maintain
sequential semantics

Computer Architecture – Module 8 39 Fall, 2016

Assembly Languages
And
Programming Paradigm

Computer Architecture – Module 9 1 Fall, 2016

d One-to-many translation (statement translates to multiple machine instructions)

d Hardware independence
d Application orientation
d General-purpose
d Powerful abstractions

Computer Architecture – Module 9 2 Fall, 2016

d One-to-one translation (each statement translates to one machine instruction)

d Hardware dependence
d Systems programming orientation
d Special-purpose
d Few abstractions

Computer Architecture – Module 9 3 Fall, 2016

d Computer scientist Alan Perlis once quipped that a programming language is low-level
if programming requires attention to irrelevant details
d Perlis’ point: because most applications do not need direct control of hardware, a low-
level language increases programming complexity without providing benefits
d In most cases, programmers do not need assembly language, only compilers do

Computer Architecture – Module 9 4 Fall, 2016

d Assembly language
– Term used for a special type of low-level language
– Each assembly language is specific to a processor
d Assembler
– Term used for a program that translates assembly language into binary code
– Analogous to compiler

Computer Architecture – Module 9 5 Fall, 2016

d Bad news
– Many assembly languages exist
– Each has instructions for one particular processor architecture
d Good news
– Assembly languages all have the same general structure
– A programmer who understands one assembly language can learn another quickly

Computer Architecture – Module 9 6 Fall, 2016

d We will discuss general concepts in class

d You will learn two specific assembly languages in lab

Computer Architecture – Module 9 7 Fall, 2016

d General format

label: opcode operand1 , operand2 , ...

d Most assembly languages use whitespace to separate items in a statement

d Label is optional and is only needed for branching
d Opcode and operands are processor specific

Computer Architecture – Module 9 8 Fall, 2016

d Specific to each assembly language

d Most assembly languages use short mnemonics
d Examples
– ld instead of load_value_into_register
– jsr instead of jump_to_subroutine

Computer Architecture – Module 9 9 Fall, 2016

d Typically
– A character reserved to start a comment
– Comment extends to end of line
d Examples of comment characters
– Pound sign (#)
– Semicolon (;)

Computer Architecture – Module 9 10 Fall, 2016

d Similar to high-level languages: block comments are used to explain the overall purpose
of each large section of code
d Unlike high-level languages: each line of assembly code usually contains a comment
explaining purpose of the instruction

Computer Architecture – Module 9 11 Fall, 2016

################################################################
# #
# Search linked list of free memory blocks to find a block #
# of size N bytes or greater. Pointer to list must be in #
# register 3 and N must be in register 4. The code also #
# destroys the contents of register 5, which is used to #
# walk the list. #
# #
################################################################

Computer Architecture – Module 9 12 Fall, 2016

ld r5, r3 # load the address of list into r5

loop_1: cmp r5, r0 # test to see if at end of list
bz notfnd # if reached end of list go to notfnd

d Note: it is typical to find a comment on every line of an assembly language program

Computer Architecture – Module 9 13 Fall, 2016

d Annoying fact: assembly languages differ on operand order

d Example
– Consider an instruction to move (i.e., copy) register 5 to register 3
– There are two possible operand orders

mov r5, r3 # left-to-right order (source on left)

mov r3, r5 # right-to-left order (source on right)

d Note: in one historic case, DEC and AT&T each built an assembly language for the same processor, and they used opposite
orders for operands!

Computer Architecture – Module 9 14 Fall, 2016

d When programming an assembly language that uses

( source, destination )

remember that we read left-to-right

d When programming an assembly language that uses

( destination, source ),

remember that the operands are in the same order as an assignment statement

Computer Architecture – Module 9 15 Fall, 2016

d Registers are used heavily

d Most assembly languages use short names for registers
d Typical format is letter r followed by a number, such as r1
d However... various assembly languages have used variants (e.g., reg1, R1, $1)
d And some assembly languages assign registers names instead of numbers (e.g., ax, bx,
cx, sp)

Computer Architecture – Module 9 16 Fall, 2016

d Some assemblers permit a programmer to define abbreviations

d Analogous to #define in C
d Example definitions

#
# Define register names used in the program
#
r1 register 1 # define name r1 to be register 1
r2 register 2 # and so on for r2, r3, and r4
r3 register 3
r4 register 4

Computer Architecture – Module 9 17 Fall, 2016

d Symbolic definition allows meaningful names

d Can make code easier to understand
d Example: registers used for a linked list

#
# Define register names for a linked list program
#
listhd register 6 # holds starting address of list
listptr register 7 # moves along the list

Computer Architecture – Module 9 18 Fall, 2016

d Assembly language provides a way to specify the type of each operand (e.g.,
immediate, register, memory reference, indirect memory reference)
d Typically, compact syntax is used
d Example using right-to-left order

mov r3, r4 # copy contents of reg. 4 into reg. 3

mov r2, (r1) # treat r1 as a pointer to memory and

# copy from the mem. location to reg. 2

Computer Architecture – Module 9 19 Fall, 2016

d Assembly language has no way to declare programming abstractions

– No data aggregates (arrays or structs)
– No control structures (while loops, if-then-else, case)
– No function declarations or arguments
d Programmer can only write a sequence of instructions
d To make code readable, programmer must follow conventions that others expect
d Term idiom is used to describe conventional code structure
d Next slides show example idioms

Computer Architecture – Module 9 20 Fall, 2016

if (condition) { code to test the condition and

body set the condition code
} branch to label if condition false
next statement; code to perform body
label: code for next statement

Computer Architecture – Module 9 21 Fall, 2016

if (condition) { code to test the condition and

then_part set the condition code
} else { branch to label1 if condition false
else_part code to perform then_part
} branch to label2
next statement; label1:code for else_part
label2:code for next statement

Computer Architecture – Module 9 22 Fall, 2016

for (i=0; i<10; i++) { set r4 to zero

body label1:compare r4 to 10
} branch to label2 if >=
next statement; code to perform body
increment r4
branch to label1
label2:code for next statement

Computer Architecture – Module 9 23 Fall, 2016

while (condition) { label1:code to compute condition

body branch to label2 if false
} code to perform body
next statement; branch to label1
label2:code for next statement

Computer Architecture – Module 9 24 Fall, 2016

x() { x: code for body of x

body of function x ret
}

x( ); jsr x
other statement; code for other statement
x ( ); jsr x
next statement; code for next statement

Computer Architecture – Module 9 25 Fall, 2016

d Hardware possibilities
– Stack in memory used for arguments
– Register windows used to pass arguments
– Special-purpose argument registers used
d Consequence: assembly language for passing arguments depends on hardware
d See Appendix 3 and Appendix 4 in the text for x86 and MIPS calling sequence

Computer Architecture – Module 9 26 Fall, 2016

x ( a, b ) { x: code for body of x that assumes

body of function x register 1 contains parameter a
} and register 2 contains b
ret

x( -4, 17 ); load -4 into register 1

load 17 into register 2
jsr x
other statement; code for other statement
x ( 71, 27 ); load 71 into register 1
load 27 into register 2
jsr x
next statement code for next statement

Computer Architecture – Module 9 27 Fall, 2016

d Like procedure invocation except also returns a result

d Computers have been built that return a value
– On a stack in memory
– In a special-purpose register
– In a general-purpose register
d Choice may depend on compiler

Computer Architecture – Module 9 28 Fall, 2016

d When debugging really tough problems

d When a high-level language does not produce code that is fast enough
d When a high-level language does not have facilities to use special-purpose instructions
d General rule: assembly language is only used for functions where a high-level language
has insufficient functionality or results in poor performance

Computer Architecture – Module 9 29 Fall, 2016

d Assembly language program can call function written in high-level language (e.g., to
avoid writing complex functions in assembly language)
d High-level language program can call function written in assembly language
– When higher speed is needed
– When access to special-purpose hardware is required
d Interactions must follow calling conventions of the high-level language

Computer Architecture – Module 9 30 Fall, 2016

d Most assembly languages have no variable declarations or variable types

d However, a programmer can reserve a block of storage for a variable, and use a label to
allow the block to be referenced in instructions
d Typical directives to reserve storage
– .word
– .byte or .char
– .long

Computer Architecture – Module 9 31 Fall, 2016

int x, y, z; x: .long
y: .long
z: .long
short w, q; w: .word
q: .word
statement(s) code for statement(s)

d Warning: code and variable storage can be intermixed

d Good news: many assemblers allow a programmer to place code and data in separate
memory segments

Computer Architecture – Module 9 32 Fall, 2016

d Usually allowed as arguments to directives

d Example to declare 16-bit storage with initial value 949

x: .word 949

Computer Architecture – Module 9 33 Fall, 2016

d Software component
d Accepts assembly language program as input
d Produces binary form of program as output
d Uses two-pass algorithm
– Pass 1: computes instruction offset for each label
– Pass 2: generates code

Computer Architecture – Module 9 34 Fall, 2016

d Each statement in source program is translated to one machine instruction

d Assembler
– Computes relative location for each label
– Fills in branch offsets automatically
– Allows a programmer to use mnemonic labels instead of byte offsets

Computer Architecture – Module 9 35 Fall, 2016

locations assembly code

0x00 – 0x03 x: .long

0x04 – 0x07 label1: cmp r1, r2
0x08 – 0x0B bne label2
0x0C – 0x0F jsr label3
0x10 – 0x13 label2: load r3, 0
0x14 – 0x17 br label4
0x18 – 0x1B label3: add r5, 1
0x1C – 0x1F ret
0x20 – 0x23 label4: ld r1, 1
0x24 – 0x27 ret

d In bne instruction, assembler uses 0x10 in place of label2

Computer Architecture – Module 9 36 Fall, 2016

d Syntactic substitution
d Parameterized for flexibility
d Programmer supplies macro definitions
d Code contains macro invocations
d Assembler handles macro expansion in extra pass
d Known as macro assembly language
d Note: assembly macros predate #define

Computer Architecture – Module 9 37 Fall, 2016

d Varies among assembly languages

d Typical definition bracketed by keywords
d Example keywords
– macro
– endmacro
d Invocation
– Uses macro name
– Allows arguments
d Note: Unix assemblers often use cpp as a macro processor

Computer Architecture – Module 9 38 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Example Of Macro Definition
d Definition of macro addmem
macro addmem(a, b, c)
load r1, a # load 1st arg into register 1
load r2, b # load 2nd arg into register 2
add r1, r2 # add register 2 to register 1
store r3, c # store the result in 3rd arg
endmacro

d Code produced by addmem( xxx, YY, zqz)

load r1, xxx # load 1st arg into register 1
load r2, YY # load 2nd arg into register 2
add r1, r2 # add register 2 to register 1
store r3, zqz # store the result in 3rd arg

Computer Architecture – Module 9 39 Fall, 2016

d Macros only provide syntactic substitution

– Parameters are treated as a string of characters
– Arbitrary text permitted
– No error checking performed
d Consequences for programmers
– An extra blank can change the meaning of the instruction
– Macro invocation can generate invalid code
– May be difficult to debug

Computer Architecture – Module 9 40 Fall, 2016

d Calling addmem( 1+, %*J , +) results in

load r1, 1+ # load 1st arg into register 1
load r2, %*J # load 2nd arg into register 2
add r1, r2 # add register 2 to register 1
store r3, + # store the result in 3rd arg

d Assembler substitutes macro arguments literally

d Error messages refer to expanded code, not macro definition
d It may be hard to trace errors back to macro invocations

Computer Architecture – Module 9 41 Fall, 2016

d Assembly language is low-level and incorporates details of a specific processor

d Many assembly languages exist, one per processor
d Each assembly language statement corresponds to one machine instruction
d Same basic programming paradigm used in most assembly languages
d Programmers must code assembly language equivalents of abstractions such as
– Conditional execution
– Definite and indefinite iteration
– Function call

Computer Architecture – Module 9 42 Fall, 2016

d Assembler translates assembly language program into binary code

d Assembler uses two-pass processing
– First pass assigns locations to labels
– Second pass generates code
d Macro assemblers have additional pass to expand macros

Computer Architecture – Module 9 43 Fall, 2016

Memory And Storage

Computer Architecture – Module 10 1 Fall, 2016

d Technology
– The type of the underlying hardware
– Choice determines cost, persistence, performance
– Many variants are available
d Organization
– How underlying hardware is used to build memory system (i.e., bytes, words, etc.)
– Directly visible to programmer

Computer Architecture – Module 10 2 Fall, 2016

d Volatile or nonvolatile
d Random or sequential access
d Read-write or read-only
d Primary or secondary

Computer Architecture – Module 10 3 Fall, 2016

d Volatile memory
– Contents disappear when power is removed
– Fastest access times
– Least expensive
d Nonvolatile memory
– Contents remain without power
– More expensive than volatile memory
– May have slower access times
– Some embedded systems “cheat” by using a battery to maintain memory contents

Computer Architecture – Module 10 4 Fall, 2016

d Random access
– Typical for most applications
d Sequential access
– Known as a FIFO (First-In-First-Out)
– Typically associated with streaming applications
– Requires special purpose hardware

Computer Architecture – Module 10 5 Fall, 2016

d ROM (Read Only Memory)

– Values can be read, but not changed
– Useful for firmware
d PROM (Programmable Read Only Memory)
– Contents can be altered, but doing so is time-consuming
– Change may involve removal from a circuit, exposure to ultraviolet light
d Flash
– Contents can be altered easily
– Used in solid state disks and digital cameras

Computer Architecture – Module 10 6 Fall, 2016

d Primary memory
– Highest speed
– Most expensive, and therefore the smallest
– Typically solid state technology
d Secondary memory
– Lower speed
– Less expensive, and therefore can be larger
– Traditionally used magnetic media and electromechanical drive mechanisms
– Moving to solid state (flash)

Computer Architecture – Module 10 7 Fall, 2016

d Distinction between primary and secondary

– Used to be absolutely clear
– Is now blurring
d Secondary memory is now using solid state technology instead of electromechanical
technology
d Examples
– Flash cards used in smart phones
– Solid-state disks (SSDs) used in laptop computers

Computer Architecture – Module 10 8 Fall, 2016

d Key concept to memory design

d Extend the primary / secondary tradeoff to multiple levels
d Basic idea
– Highest performance memory costs the most
– Can obtain better performance at lower cost by using a set of memories
d The key is choosing the memory sizes and speeds carefully

Computer Architecture – Module 10 9 Fall, 2016

d Select a set of memories

d A small memory has highest performance
d A slightly larger amount of memory has somewhat lower performance
d The largest memory has the lowest performance
d Example hierarchy
– Dozens of general-purpose registers
– A dozen gigabytes of main memory
– Several terabytes of solid state disk

Computer Architecture – Module 10 10 Fall, 2016

d Harvard architecture
– Two separate memories known as
* Instruction store
* Data store
– One memory holds programs and the other holds data
– Used on early computers and some embedded systems
d Von Neumann architecture
– A single memory holds both programs and data
– Used on most general-purpose computers

Computer Architecture – Module 10 11 Fall, 2016

d Instructions and data occupy the same memory

d Consider the following C code
short main[] = {
-25117, -16480, 16384, 28, -28656, 8296, 16384, 26, -28656, 8293, 16384,
24, -28656, 8300, 16384, 22, -28656, 8300, 16384, 20, -28656, 8303,
16384, 18, -28656, 8224, 16384, 16, -28656, 8311, 16384, 14, -28656,
8303, 16384, 12, -28656, 8306, 16384, ’\n’, -28656, 8300, 16384, ’\b’,
-28656, 8292, 16384, 6, -28656, 8238, 16384, 4, -28656, 8202, -32313,
-8184, -32280, 0, -25117, -16480, 4352, 5858, -18430, 8600, -4057,
-24508, -17904, 8192, -17913, 24577, -32601, 16412, 9919, -1, -17913,
24577, -27632, 8193, -28656, 8193, 16384, 4, -28153, -24505, -32313,
-8184, -32280, 0, -32240, 8196, -28208, 8192, 6784, 4, 6912, ’\b’, -26093,
24800, -32317, 16384, 256, 0, -32317, -8184, 256, 0, 0, 0, -32240, 8193,
-28208, 8192, 768, ’\b’, -12256, 24816, -32317, -8184, -28656, 16383
};

d Does the code specify instructions or data?

d Answer: on a Sparc, it compiles and prints hello world
Computer Architecture – Module 10 12 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Tradeoffs For Separate Memories

d Advantages
– Allows separate caches (described later)
– Permits memory technology to be optimized for access patterns
* Instructions: sequential access
* Data: random access
d Disadvantage
– Must choose a size for each when computer is designed

Computer Architecture – Module 10 13 Fall, 2016

d Access paradigm used by memory

d Hardware only supports two operations
– Fetch a value from a specified location
– Store a value into a specified location
d Programmers often use the terms read and write
d We will discuss the implementation and consequences of fetch / store later

Computer Architecture – Module 10 14 Fall, 2016

d The two key aspects of memory are

– Technology
– Organization
d Memory can be characterized as
– Volatile or nonvolatile
– Random or sequential access
– Permanent or nonpermanent
– Primary or secondary

Computer Architecture – Module 10 15 Fall, 2016

d Separating instruction and data memories has potential advantages but a big
disadvantage
d Memory systems use fetch-store paradigm
d Only two operations available
– Fetch (read)
– Store (write)

Computer Architecture – Module 10 16 Fall, 2016

Physical Memory
And
Physical Addressing

Computer Architecture – Module 11 1 Fall, 2016

d Main memory
– Designed to permit arbitrary pattern of references
– Known by the term RAM (Random Access Memory)
d Usually volatile
d Two basic technologies available
– Static RAM
– Dynamic RAM

Computer Architecture – Module 11 2 Fall, 2016

d Easiest to understand
d Basic elements built from a latch
write enable

input circuit output

for
one bit

d When enable is asserted (i.e., logical 1), output is same as input

d Once enable line goes to logical 0, output is the last input value

Computer Architecture – Module 11 3 Fall, 2016

d Advantages
– High speed
– Access circuitry is straightforward
d Disadvantages
– Higher power consumption
– Heat generation
– High cost

Computer Architecture – Module 11 4 Fall, 2016

d Alternative to SRAM
d Consumes less power
d Analogous to a capacitor (i.e., stores an electrical charge)

Computer Architecture – Module 11 5 Fall, 2016

d Entropy increases
d Any electronic storage device gradually loses charge
d When left for a long time, a bit in DRAM changes from logical 1 to logical 0
d Discharge time can be less than a second
d Conclusion: although it is inexpensive, DRAM is a horrible memory device!

Computer Architecture – Module 11 6 Fall, 2016

d Cannot leave bits too long or they change

d Additional hardware known as a refresh circuit is used
d Trick: refresh circuitry repeatedly
– Steps through each location i of DRAM
– Reads the value from location i
– Writes same value back into location i (i.e., recharges the memory)
d Note: refresh hardware runs in the background at all times

Computer Architecture – Module 11 7 Fall, 2016

input circuit output

for
one bit

refresh

d Much more complex than the figure implies

d Refresh must not interfere with normal read and write operations
– Correctness must be guaranteed
– Performance must not suffer

Computer Architecture – Module 11 8 Fall, 2016

d Density
– Refers to memory cells per square area of silicon
– Usually stated as number of bits on standard size chip
– Example: 1 gig chip holds 1 gigabit of memory
– Note: higher density chip generates more heat
d Latency
– Time that elapses between the start of an operation and the completion of the
operation
– May depend on previous operations (see below)

Computer Architecture – Module 11 9 Fall, 2016

d In many memory technologies

– The time required to store exceeds the time required to fetch
– Difference can be dramatic
d Consequence: any measure of memory performance must give two values
– Performance of read
– Performance of write

Computer Architecture – Module 11 10 Fall, 2016

d Hardware unit called a memory controller connects a processor to a physical memory

physical
processor controller
memory

d Main point: because all memory requests go through the controller, the interface a
processor “sees” can differ from the underlying hardware organization

Computer Architecture – Module 11 11 Fall, 2016

d Processor
– Presents request to controller
– Waits for response
d Controller
– Translates request into signals for physical memory chips
– Returns answer to processor as quickly as possible
– Sends signals to reset physical memory for next request

Computer Architecture – Module 11 12 Fall, 2016

d Means next memory operation may be delayed

d Conclusion
– Latency of a single operation is an insufficient measure of performance
– Must measure the time required for successive operations

Computer Architecture – Module 11 13 Fall, 2016

d Time that elapses between two successive memory operations

d More accurate measure than latency
d Two separate measures
– Read cycle time (tRC)
– Write cycle time (tWC)

Computer Architecture – Module 11 14 Fall, 2016

d Both memory and processor use a clock

d Synchronized memory systems ensure two clocks coincide
d Allows higher memory speeds
d Technologies
– Synchronous Static Random Access Memory (SSRAM)
– Synchronous Dynamic Random Access Memory (SDRAM)
d Note: the RAM in most computers is SDRAM

Computer Architecture – Module 11 15 Fall, 2016

d Goals
– Improve memory performance
– Avoid mismatch between CPU speed and memory speed
d Technique: memory hardware runs at a multiple of the CPU clock rate
d Available for both SRAM and DRAM
d Examples
– Double Data Rate SDRAM (DDR-SDRAM)
– Quad Data Rate SRAM (QDR-SRAM)

Computer Architecture – Module 11 16 Fall, 2016

d Many memory technologies exist

d Examples include

Technology Description
222222222222222222222222222222222222222222222222222222222222
DDR-DRAM Double Data Rate Dynamic RAM
DDR-SDRAM Double Data Rate Synchronous Dynamic RAM
FCRAM Fast Cycle RAM
FPM-DRAM Fast Page Mode Dynamic RAM
QDR-DRAM Quad Data Rate Dynamic RAM
QDR-SRAM Quad Data Rate Static RAM
SDRAM Synchronous Dynamic RAM
SSRAM Synchronous Static RAM
ZBT-SRAM Zero Bus Turnaround Static RAM
RDRAM Rambus Dynamic RAM
RLDRAM Reduced Latency Dynamic RAM

Computer Architecture – Module 11 17 Fall, 2016

parallel interface

control- physical
processor
. ler memory
.
.

d Parallel interface used between computer and memory

d Called a bus (more later in the course)

Computer Architecture – Module 11 18 Fall, 2016

d Amount of memory that can be transferred to computer simultaneously

d Determined by bus between computer and controller
d Example memory transfer sizes
– 16 bits
– 32 bits
– 64 bits
d Important to programmers

Computer Architecture – Module 11 19 Fall, 2016

d Bits of physical memory are divided into blocks of N bits each

d N is determined by bus width
d Terminology
– Group of N bits is called a word
– N is known as the width of a word or the word size
d Computer is often characterized by its word size (e.g., one might speak of a 64-bit
computer)

Computer Architecture – Module 11 20 Fall, 2016

d Each word of memory is assigned a unique number known as a physical memory

address
d Physical memory is organized as an array of words
physical .
address .
.

5 word 5

4 word 4

3 word 3

2 word 2

1 word 1

0 word 0

32 bits

d Underlying hardware applies read or write to entire word

Computer Architecture – Module 11 21 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Choosing A Physical Word Size

d Word size represents a fundamental tradeoff

d Larger word size
– Results in higher performance
– Requires more parallel wires and circuitry
– Has higher cost and more power consumption
d Note: architect usually designs all data paths in a computer to use one size for
– Word in physical memory
– Integers and general-purpose registers
– Floating point numbers and floating-point registers

Computer Architecture – Module 11 22 Fall, 2016

d Byte addressing
– View of memory presented to processor
– Each byte of memory assigned an address
– Convenient for programmers
– However... the underlying memory uses word addressing
d Memory controller
– Provides translation
– Allows programmers to use byte addresses (convenient)
– Allows physical memory to use word addresses (efficient)

Computer Architecture – Module 11 23 Fall, 2016

d Assume physical memory is organized into 32-bit words

d Programmer views memory as an array of bytes
d We think of each byte has having an address 0 through N–1
d Each physical word corresponds to 4 byte addresses
physical
address .
.
.
5 20 21 22 23

4 16 17 18 19

3 12 13 14 15

2 8 9 10 11 a byte address
assigned to each
1 4 5 6 7 byte of each word

0 0 1 2 3

32 bits

Computer Architecture – Module 11 24 Fall, 2016

d Let N be the number of bytes per word

d The physical address of the word containing the byte is
J B J
W= J 33 J
Q N P

d And the byte offset within the word is

O = B mod N
d Example
– Find byte B = 11 when N = 4
– B can be found in word 2 at offset 3

Computer Architecture – Module 11 25 Fall, 2016

d Think binary and choose word size N to be a power of 2

d Avoids arithmetic calculations, especially division and remainder
d Word address computed by extracting high-order bits
d Offset computed by extracting low-order bits
d Example: byte 11 with N equal to 4 bytes per word
Byte Address, B (11)

0 . . . 0 0 1 0 1 1

Word Address, W (2) Offset, O (3)

Computer Architecture – Module 11 26 Fall, 2016

d Refers to storing multibyte values (e.g., integers) in memory

d Two designs have been used
– Access must correspond to word boundary in underlying physical memory
– Access can be unaligned, memory controller handles details, but fetch and store
operations are slower
d Unaligned version is common
d Consequences for programmers
– Performance may be improved by aligning integers
– Some I/O devices require buffers to be aligned

Computer Architecture – Module 11 27 Fall, 2016

d Size of address limits maximum memory

d Example: 32-bit address can represent

232 = 4,294,967,296

unique addresses
d Known as address space
d Note: word addressing allows larger memory than byte addressing, but is seldom used
because it is difficult to program

Computer Architecture – Module 11 28 Fall, 2016

d Memory sizes expressed as powers of two, not powers of ten

d Kilobyte defined to be 210 bytes

d Megabyte defined to be 220 bytes

d Gigabyte defined to be 230 bytes

d Terabyte defined to be 240 bytes

Computer Architecture – Module 11 29 Fall, 2016

d Speeds of data networks and other I/O devices are usually expressed in powers of ten
– Example: a Gigabit Ethernet operates at 109 bits per second
d Programmer must accommodate differences between measures for storage and
transmission

Computer Architecture – Module 11 30 Fall, 2016

d C has a heritage of both byte and word addressing

d Example of byte pointer declaration

char *iptr;

d Example of word pointer declaration

int *iptr;

d If integer size is four bytes, iptr + + increments by four

Computer Architecture – Module 11 31 Fall, 2016

d Debugging tool
d Gives hex representation of bytes in memory
d Each line of output specifies memory address and bytes starting at that address

Computer Architecture – Module 11 32 Fall, 2016

d Head consists of pointer to the list

d Each node has the following structure

struct node {
int value;
struct node *next;
}
d Example list has structure
head

node 1 node 2 node 3

192 200 100

Computer Architecture – Module 11 33 Fall, 2016

head
node 1
Address Contents Of Memory
0001bde0 00000000 0001bdf8 deadbeef 4420436f
0001bdf0 6d657200 0001be18 000000c0 0001be14
0001be00 00000064 00000000 00000000 00000002
0001be10 00000000 000000c8 0001be00 00000006
node 3
node 2

d Assume head is located at address 0x0001bde4

d First node at 0x0001bdf8 contains value 192 (0xc0)
d Second node at 0x0001be14 contains value 200 (0xc8)
d Last node at 0x001be00 contains value 100 (0x64)
Computer Architecture – Module 11 34 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Increasing Physical Memory Performance

d Two major techniques

– Memory banks
– Interleaving
d Both employ parallel hardware

Computer Architecture – Module 11 35 Fall, 2016

d Modular approach to constructing large memory

d Basic memory module is replicated multiple times
d Selection circuitry chooses which bank
d Basic idea
– Use high-order bits of address to select a bank
– Use low-order bits to select a word within a bank
d Key ideas
– Hardware for each bank is identical
– Parallel access — one bank can reset while another is being used

Computer Architecture – Module 11 36 Fall, 2016

k low-order bits passed

to all memory banks

Bank 3

S Bank 2
E
L four identical memory
E modules that each
C handle addresses 0 to 2k–1
T
high-order bits used Bank 1
to select a bank

Bank 0

Computer Architecture – Module 11 37 Fall, 2016

d Two approaches have been used

d Transparent
– Programmer is not concerned with banks
– Hardware automatically finds and exploits parallelism
d Opaque
– Programmer informed about banks
– To optimize performance, programmer must place items that will be accessed
sequentially in separate banks

Computer Architecture – Module 11 38 Fall, 2016

d Related to memory banks

d Transparent to programmer
d Hardware places consecutive words (or consecutive bytes) in separate physical
memories
d Technique: use low-order bits of address to choose module
d Known as N-way interleaving, where N is number of physical memories

Computer Architecture – Module 11 39 Fall, 2016

requests

interface

word 0 word 1 word 2 word 3

word 4 word 5 word 6 word 7
word 8 word 9 word 10 word 11
... ... ... ...

module 0 module 1 module 2 module 3

d Consecutive words stored in separate physical memories

Computer Architecture – Module 11 40 Fall, 2016

d Blends two key ideas

– Memory technology
– Memory organization
d Includes parallel hardware for high-speed search

Computer Architecture – Module 11 41 Fall, 2016

d Think of CAM as a two-dimensional array of special-purpose hardware cells

d A row in the array is called a slot
d The hardware cells
– Can answer the question: “Is X stored in any row of the CAM?”
– Operate in parallel to make search fast
d Query is known as a key

Computer Architecture – Module 11 42 Fall, 2016

Key

one slot

CAM Storage

..
.

Computer Architecture – Module 11 43 Fall, 2016

d CAM presented with key for lookup

d Hardware cells test whether key is present
– Search operation performed in parallel on all slots simultaneously
– Result is index of slot where value found
d Note: parallel search hardware makes CAM expensive

Computer Architecture – Module 11 44 Fall, 2016

d Variation of CAM that adds partial match searching

d Each bit in slot can have one of three possible values
– Zero
– One
– Don’t care
d TCAM ignores “don’t care” bits and reports match
d TCAM can either report
– First match
– All matches (bit vector)

Computer Architecture – Module 11 45 Fall, 2016

d Physical memory
– Organized into fixed-size words
– Accessed through a controller
d Controller can use
– Byte addressing when communicating with a processor
– Word addressing when communicating with a physical memory
d To avoid arithmetic, use powers of two for
– Address space size
– Bytes per word

Computer Architecture – Module 11 46 Fall, 2016

d Many memory technologies exist

d A memory dump that shows contents of memory in a printable form can be an
invaluable tool
d Two techniques used to optimize memory access
– Separate memory banks
– Interleaving
d Content Addressable Memory (CAM) permits parallel search; variation of CAM known
as Ternary Content Addressable Memory (TCAM) allows partial match retrieval

Computer Architecture – Module 11 47 Fall, 2016

Caches And Caching

Computer Architecture – Module 12 1 Fall, 2016

d Key concept in computing

d Used in hardware and software
d Memory cache is essential to reduce the Von Neumann bottleneck

Computer Architecture – Module 12 2 Fall, 2016

d Acts as an intermediary
d Located between source of requests and source of replies
large data storage
requester
cache

d Cache contains temporary local storage

– Very high-speed
– Limited size
d Copy of selected items kept in local storage
d Cache answers requests from local copy when possible

Computer Architecture – Module 12 3 Fall, 2016

d Small (usually much smaller than storage needed for entire set of items)
d Active (makes decisions about which items to save)
d Transparent (invisible to both requester and data store)
d Automatic (uses sequence of requests; does not receive extra instructions)

Computer Architecture – Module 12 4 Fall, 2016

d Implemented in hardware, software, or a combination

d Small or large data items (a byte of memory or a complete file)
d Textual or binary data
d For an individual processor or shared among processors
d Retrieval-only or store-and-retrieve
d One of the most important optimization techniques available

Computer Architecture – Module 12 5 Fall, 2016

d Cache hit: request can be satisfied from cache

d Cache miss: request cannot be satisfied from cache
d Locality of reference: refers to whether requests are repeated
– High locality means many repetitions
– Low locality means few repetitions
d Note: cache works well with high locality of reference

Computer Architecture – Module 12 6 Fall, 2016

d Cost measured with respect to requester

requester cache large data storage

d Ch is the cost of an item found in the cache (hit)

d Cm is the cost of an item not found in the cache (miss)

Computer Architecture – Module 12 7 Fall, 2016

d Worst case for sequence of N requests

Cworst = N C m

d Best case for sequence of N requests

Cbest = Cm + (N − 1) Ch

d For best case, the average cost per request is:

Cm + (N − 1) Ch
333333333333333 C
3
333
m Ch
333
= − + Ch
N N N

d Key idea: as N → ∞, average cost approaches Ch

Computer Architecture – Module 12 8 Fall, 2016

d If we ignore overhead
– In the worst case, the performance of caching is no worse than if the cache were not
present
– In the best case, the cost per request is approximately equal to the cost of accessing
the cache
d Note: for memory caches, parallel hardware means almost no overhead

Computer Architecture – Module 12 9 Fall, 2016

d Hit ratio
– Percentage of requests satisfied from cache
– Given as value between 0 and 1
d Miss ratio
– Percentage of requests not satisfied from cache
– Equal to 1 minus the hit ratio
d Allows us to assess expected cost

Computer Architecture – Module 12 10 Fall, 2016

d Access cost depends on hit ratio

Cost = r C h + (1 − r) Cm

where r is the hit ratio

d Notes
– The cost of a miss is often much larger than the cost of a hit
– The performance improves if hit ratio increases or cost of access from cache
decreases

Computer Architecture – Module 12 11 Fall, 2016

d Recall: a cache is smaller than data store

d Once cache is full, existing item must be ejected before another can be inserted
d Replacement policy chooses item to eject
d Most popular replacement policy known as Least Recently Used (LRU)
– Easy to implement
– Tends to retain items that will be requested again
– Works well in practice

Computer Architecture – Module 12 12 Fall, 2016

d Can use multiple caches to improve performance

d Arranged in hierarchy by speed (i.e., by cost)
d Example: insert an extra, faster cache in previous diagram

requester new cache original cache large data storage

Computer Architecture – Module 12 13 Fall, 2016

d Cost is:

Cost = r 1 Ch 1 + r 2 Ch 2 + (1 − r 1 − r 2) Cm

d r 1 is fraction of hits for the new cache

d r 2 is fraction of hits for the original cache
d Ch 1 is cost of accessing the new cache
d Ch 2 is cost of accessing the original cache

Computer Architecture – Module 12 14 Fall, 2016

d Optimization technique
d Stores items in cache before requests arrive
d Works well if data accessed in related groups
d Examples
– When web page is fetched, web cache can preload images that appear on the page
– When byte of memory is fetched, memory cache can preload succeeding bytes

Computer Architecture – Module 12 15 Fall, 2016

d Several memory mechanisms operate as a cache

d Examples
– Physical memory caches
– TLB used in a virtual memory system (covered later)
– Pages in a demand paging system (covered later)

Computer Architecture – Module 12 16 Fall, 2016

d Located between processor and physical memory

d Smaller than physical memory
d Use parallel hardware to achieve high performance
d Perform two operations in parallel
– Search local cache
– Send request to underlying physical memory
d If answer found in cache, cancel request to memory

Computer Architecture – Module 12 17 Fall, 2016

d Differ in how the caches handle a write operation

d Write-through
– Place a copy of item in cache
– Also send (write) a copy to physical memory
d Write-back
– Much faster
– Place a copy of item in cache
– Only write the copy to physical memory when necessary
– Works well for frequent updates (e.g., a loop increments a value)

Computer Architecture – Module 12 18 Fall, 2016

processor processor
1 2

cache 1 cache 2

physical memory

d Each processor (or core) has its own cache

d Each cache can retain copy of item
d Cache coherence needed to ensure correctness when one core changes an item and
others hold a copy
Computer Architecture – Module 12 19 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Multilevel Memory Caches

d Traditional memory cache was separate from both the memory and the processor
d To access traditional memory cache, a processor used pins that connect the processor
chip to the rest of the computer
d Using pins to access external hardware takes much longer than accessing functional
units that are internal to the processor chip
d Advances in technology have made it possible to increase the number of transistors per
chip, which means a processor chip can contain a cache

Computer Architecture – Module 12 20 Fall, 2016

d Level 1 cache (L1 cache)

– Per core
d Level 2 cache (L2 cache)
– May be per core
d Level 3 cache (L3 cache)
– Shared among all cores
d Historical note: definitions used to specify L1 as on-chip, L2 as off-chip, and L3 as part
of memory

Computer Architecture – Module 12 21 Fall, 2016

Cache Size Notes

2 2222222222222222222222222222222222222222222222222
L1 32KB to 64KB Per core
L2 256KB to 512KB May be per core
L3 8MB to 20MB Shared among all cores

Computer Architecture – Module 12 22 Fall, 2016

d Instruction references are typically sequential

– High locality of reference
– Amenable to prefetching
d Data references typically exhibit more randomness
– Lower locality of reference
– Prefetching does not work well
d Question: does performance improve with separate caches for data and instructions?

Computer Architecture – Module 12 23 Fall, 2016

d Cache tends to work well with sequential references

d Adding many random references tends to lower cache performance
d Therefore, separating instruction and data caches can improve performance
d However: if cache is “large enough”, separation doesn’t help
d Current thinking: instead of separate caches, simply use a single larger cache

Computer Architecture – Module 12 24 Fall, 2016

d Direct mapped memory cache

d Set associative memory cache

Computer Architecture – Module 12 25 Fall, 2016

d Divides memory into blocks of size B

d Blocks are numbered modulo C, where C is slots in cache
d Example: block size of B = 8 bytes and cache size C = 4
block addresses of bytes in memory
..
.
3 56 57 58 59 60 61 62 63
2 48 49 50 51 52 53 54 55
1 40 41 42 43 44 45 46 47
0 32 33 34 35 36 37 38 39
3 24 25 26 27 28 29 30 31
2 16 17 18 19 20 21 22 23
1 8 9 10 11 12 13 14 15
0 0 1 2 3 4 5 6 7

d Also called direct mapping cache

Computer Architecture – Module 12 26 Fall, 2016

d When byte is referenced, always place entire block in the cache

d If block number is n, place the block in cache slot n
d Use a tag to specify which actual addresses are currently in slot n
d Tag is the relative number of the block in memory

Computer Architecture – Module 12 27 Fall, 2016

8 bytes

3
2
tag 3
1

cache 0
3
tag value
2
3 tag 2
1
2
0
1
3
0
2
tag 1
1
0
3
2
tag 0
1
0

d General idea: using tags allows a smaller cache

Computer Architecture – Module 12 28 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Efficient Memory Cache

d Think binary: if all values are powers of two, bits of an address can be used to specify a
tag, block, and offset
tag block offset

d For the example above (an unrealistically small cache)

– Block size B is 8, so use 3 bits of offset
– Cache size C is 4, so use 2 bits of block number
– Tag is remainder of address (32 – 5 bits)

Computer Architecture – Module 12 29 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Algorithm For Direct Mapped Cache Lookup
2222222222222222222222222222222222222222222222222222222222222222222222222222222222
1 Given: 1
1 1
1 A memory address 1
1 Find: 1
1 The data byte at that address 1
1 1
1 Method: 1
1 Extract the tag number, t, block number, b, and offset, o, from the address. 1
1 1
1 Examine the tag in slot b of the cache. If the tag matches t, extract the value 1
1 from slot b of the cache. 1
1 1
If the tag in slot b of the cache does not match t, use the memory address to 1
1
1 extract the block from memory, place a copy in slot b of the cache, replace the 1
1 tag with t, and use o to select the appropriate byte from the value. 1
122222222222222222222222222222222222222222222222222222222222222222222222222222222221

Computer Architecture – Module 12 30 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Parallel Hardware in A Cache
incoming address V Tag Value
index bits

decoder selects
only one slot

tag bits
from address

only the selected slot

passes values down comparator
=?

logical
and

“valid” output value output

Computer Architecture – Module 12 31 Fall, 2016

d Alternative to direct mapped memory cache

d Uses parallel hardware
d Maintains two, independent caches

Hardware For Parallel Test

tag value tag value

3 3
2 2
1 1
0 0

d Allows two items with same block number to be cached simultaneously

Computer Architecture – Module 12 32 Fall, 2016

d Assume two memory addresses A1 and A2

– Both have block number zero
– Have different tags
d In direct mapped cache
– A1 and A2 contend for single slot
– Only one can be cached at a given time
d In set associative cache
– A1 and A2 can be placed in separate caches
– Both can be cached at a given time

Computer Architecture – Module 12 33 Fall, 2016

d Generalization of set associative cache

d Many parallel caches
d Each cache has exactly one slot
d Slot can hold arbitrary item

Computer Architecture – Module 12 34 Fall, 2016

d No parallelism corresponds to direct mapped cache

d Some parallelism corresponds to set associative cache
d More parallelism corresponds to fully associative cache
d Arbitrary parallelism corresponds to Content Addressable Memory

Computer Architecture – Module 12 35 Fall, 2016

d In many programs, caching works well without extra work

d To optimize cache performance
– Group related data items into same cache line (e.g., related bytes into a word)
– Perform all operations on one data item before moving to another data item

Computer Architecture – Module 12 36 Fall, 2016

d One day, on an operating systems project

– Someone rewrote the processor startup code
– They inadvertently turned off the L1 cache
d The performance of the system and application processes was slowed
d Guess how much faster the system ran with the L1 cache enabled

With the L1 cache enabled, performance was 15 times faster!

Computer Architecture – Module 12 37 Fall, 2016

d Caching is fundamental optimization technique

d Cache intercepts requests, automatically stores values, and answers requests quickly,
whenever possible
d Caching can be used with both physical and virtual memory addresses
d Memory cache uses hierarchy
– L1 onboard processor
– L2 between processor and memory
– L3 built into memory

Computer Architecture – Module 12 38 Fall, 2016

d Two basic technologies used for memory cache

– Direct mapped
– Set associative
d Fully associative cache generalizes set associative approach

Computer Architecture – Module 12 39 Fall, 2016

Virtual Memory Technologies

And
Virtual Addressing

Computer Architecture – Module 13 1 Fall, 2016

d Broad concept with lots of variants

d General idea
– Hide the details of the underlying physical memory
– Provide a view of memory that is more convenient to a programmer
d Goal is to allow physical memory and addressing to be structured in a way that is
optimal for hardware while providing an interface that is optimal for software

Computer Architecture – Module 13 2 Fall, 2016

d Architecture uses byte addresses

d Underlying physical memory uses word addresses
d Memory controller translates automatically
d Fits our definition of virtual memory

Computer Architecture – Module 13 3 Fall, 2016

d Memory Management Unit (MMU)

– Hardware unit
– Provides translation between virtual and physical memory schemes
d Virtual address
– Generated by processor (either instruction fetch or data fetch)
– Translated into corresponding physical address by MMU
d Physical address
– Used by underlying hardware
– May be completely hidden from programmer

Computer Architecture – Module 13 4 Fall, 2016

d Virtual address space

– Set of all possible virtual addresses
– Can be larger or smaller than physical memory
– Each process may have its own virtual space
d Virtual memory system
– All of the above

Computer Architecture – Module 13 5 Fall, 2016

d Most computers have more than one physical memory module

d Each physical memory module
– Offers addresses zero through N–1 for some N
– May use an arbitrary memory technology (e.g., SRAM or DRAM)
d Virtual memory system can provide uniform address space for all physical memories

Computer Architecture – Module 13 6 Fall, 2016

d Concepts are similar

d Bank
– Generally refers to physical memory
– Used when identical memory modules are replicated
d Module
– More generic term often used with virtual memory systems
– Preferred when heterogeneous memory units are combined

Computer Architecture – Module 13 7 Fall, 2016

processor

MMU

physical physical
controller controller

physical physical
memory memory
#1 #2

Computer Architecture – Module 13 8 Fall, 2016

d Typical scheme: processor has a single virtual address space

d Address space covers all memory modules
d MMU translates from virtual space to underlying physical memories
d Example
– Two physical memories with 1GB each (0x40000000) bytes
– Virtual addresses 0 through 0x3fffffff correspond to memory 1
– Virtual addresses 0x40000000 through 0x7fffffff correspond to memory 2

Computer Architecture – Module 13 9 Fall, 2016

0x7FFFFFFF

memory 2

Processor sees a
0x40000000
single contiguous
0x3FFFFFFF memory

memory 1

d Notes
– 0x40000000 is 1 gigabyte or 1073741824 bytes
– For identical modules, these are called memory banks
Computer Architecture – Module 13 10 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Address Translation

d Performed by MMU
d Also called address mapping
d For our example
– To determine which physical memory, test if address is 0x40000000 or above
– Both memory modules use addresses 0 through 0x3fffffff
– Subtract 0x40000000 from address when forwarding a request to memory 2

Computer Architecture – Module 13 11 Fall, 2016

Receive a virtual memory request from processor;

Let V be the address in the request;
if ( V >= 0 through 0x40000000 ) {
V2 = V – 0x40000000;
Pass the modified request (address V2) to memory 2;
} else {
Pass the unmodified request (address V) to memory 1;
}

Computer Architecture – Module 13 12 Fall, 2016

d Subtraction is relatively expensive

d To optimize, think binary
– Always divide the virtual address space along boundaries that correspond to powers
of two
d Virtual address can be divided into groups of bits that
– Choose among underlying physical memories
– Specify an address in the physical memory
d Note: selecting bits in hardware merely requires running wires (no gates and no
computation)

Computer Architecture – Module 13 13 Fall, 2016

Addresses Values In Binary

222222222222222222222222222222222222222222222222222222222222222222
0 0000000000000000000000000000000
to to
0x3 f f f f f f f 0111111111111111111111111111111

0x40000000 1000000000000000000000000000000
to to
0x7 f f f f f f f 1111111111111111111111111111111

d Addresses above 0x3fffffff are the same as the previous set except for high-order bit
d Hardware uses the high-order bit to select a physical memory module

Computer Architecture – Module 13 14 Fall, 2016

d Contiguous address space

– All locations correspond to physical memory
– Inflexible: requires all memory sockets to be populated
d Discontiguous address space
– One or more blocks of address space do not correspond to physical memory
– Called hole
– Fetch or store to any address in a hole causes an error
– Flexible: allows owner to decide how much memory to install

Computer Architecture – Module 13 15 Fall, 2016

Address
N

memory 2

Hole
(not present)
N/2
N/2– 1

memory 1

Hole
0 (not present)

Computer Architecture – Module 13 16 Fall, 2016

d Consider a program running in an address space that has holes

d If the program attempts to store or fetch an address that corresponds to a hole, an error
results
d For most systems, holes are only relevant to operating systems programmers
d For an embedded system, application programmer may need to avoid holes

Computer Architecture – Module 13 17 Fall, 2016

d Hardware perspective
– Allow multiple memory modules
– Provide homogeneous integration
d Software prospective
– Programmer convenience
– Support for multiprogramming and protection

Computer Architecture – Module 13 18 Fall, 2016

d Operating system allows multiple application programs to run concurrently

d To prevent one application from interfering with another
– Each application runs as a separate process
– Each process has its own virtual address space
d Operating system arranges for MMU to translate a given process’s addresses into the
correct physical memory address

Computer Architecture – Module 13 19 Fall, 2016

physical
M
virtual memory
space N
1
0

M ....................... 3N/4
virtual
space
2
0
....................... N/2
M
virtual
space
3
0 ....................... N/4

M
virtual
space 0
4
0

Computer Architecture – Module 13 20 Fall, 2016

d Note: MMU translates each virtual address to a physical address

d The MMU configuration can be changed at any time
d Typically
– Access to MMU restricted to operating system
– When operating system runs, no mapping is performed
– Processor only changes to virtual memory mode when running an application

Computer Architecture – Module 13 21 Fall, 2016

d Base-bound registers
d Segmentation
d Demand paging

Computer Architecture – Module 13 22 Fall, 2016

d Requires two special hardware registers (part of the MMU)

d Base register specifies starting address
d Bound register specifies size of address space
d Values changed by operating system
– Set before application runs
– Changed by operating system when switching to another application
d Was once popular, but no longer used

Computer Architecture – Module 13 23 Fall, 2016

.....................
M
virtual
space
0 .....................

bound base
M

d Each process’s address space is mapped to a region of memory

Computer Architecture – Module 13 24 Fall, 2016

d Key for systems that run multiple applications concurrently

d Each applications is allocated separate area of physical memory
d Operating system sets base-bound registers before application runs
d MMU hardware checks each memory reference
d Reference to any address outside the valid range results in an error
d Prevents an application from snooping or changing another application’s memory

Computer Architecture – Module 13 25 Fall, 2016

d Alternative to base-bound
d Provides fine-granularity mapping
– Divides program into segments (typical segment corresponds to one procedure)
– Maps each segment to physical memory
d Key idea
– Segment is only placed in physical memory when needed
– When segment is no longer needed, OS moves it to disk

Computer Architecture – Module 13 26 Fall, 2016

d Need hardware support to make moving segments efficient

d Two choices
– Variable-size segments cause memory fragmentation
– Fixed-size segments may be too small or too large
d Neither choice works well
d Consequence: segmentation is seldom used

Computer Architecture – Module 13 27 Fall, 2016

d Alternative to segmentation and base-bound

d Currently, the most popular virtual memory technology
d Divides program into fixed-size pieces called pages
d No attempt is made to align page boundaries with functions, objects, or large data
structures
d Typical page size 4K bytes
d Only some pages of a given application are in memory at any time; others are kept on
disk and fetched when needed
d Allows the physical memory allocated to a process to change over time

Computer Architecture – Module 13 28 Fall, 2016

d Hardware is needed to handle address mapping and detect missing pages

d Software is needed to move pages between external store and physical memory

Computer Architecture – Module 13 29 Fall, 2016

d Part of MMU
d Intercepts each memory reference
d If referenced page is present in memory, translate address and perform the operation
d If referenced page not present in memory, generate a page fault (i.e., an error condition)
d Record the details and allow operating system to handle the fault

Computer Architecture – Module 13 30 Fall, 2016

d Part of the operating system

d Works closely with hardware
d Responsible for overall memory management
d Determines which pages of each application to keep in memory and which to keep on
disk
d Records location of all pages
d Fetches pages on demand (when an application references an address that is not in
memory)
d Configures the MMU

Computer Architecture – Module 13 31 Fall, 2016

d When a computer starts

– Applications run and reference pages
– Each referenced page is placed in physical memory
d Eventually
– Memory is completely full
– An existing page must be written to disk before memory can be used for new page
d Choosing a page to expel is known as page replacement
d Optimization: replace a page that will not be needed soon

Computer Architecture – Module 13 32 Fall, 2016

d Page: fixed-size piece of program’s address space

d Frame: slot in memory exactly the size of one page
d Resident: a page that is currently in memory
d Resident set: pages from a given application that are present in memory

Computer Architecture – Module 13 33 Fall, 2016

d Known as a page table

d One page table per process
d Created and managed by the operating system
d Used by the MMU when translating an address
d Think of a page table as a one-dimensional array
– Indexed by page number
– Entry stores a pointer to the location of the page in memory (or a bit that indicates
the page is currently on disk)

Computer Architecture – Module 13 34 Fall, 2016

page
table
P

d Each page table entry points to a frame in memory or null

Computer Architecture – Module 13 35 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Address Translation With A Page Table

d Given virtual address V, find underlying memory address P

d Three conceptual steps
– Determine the number of the page on which address V lies
– Use the page number as an index into the process’s page table to find the starting
address of a frame in memory that contains the specified byte
– Determine how far into the page address V lies, and convert to a position in the
frame in memory

Computer Architecture – Module 13 36 Fall, 2016

d Page number computed by dividing the virtual address by the number of bytes per page,
K

J V J
N= J 33 J
Q K P

d Offset within the page, O, can be computed as the remainder

O = V mod K

Computer Architecture – Module 13 37 Fall, 2016

d Use N and O to translate virtual address V to real memory address A

A = pagetable [N] + O

Computer Architecture – Module 13 38 Fall, 2016

d Cannot afford division or remainder operation for each memory reference

d Think binary, and use powers of two to eliminate arithmetic
d Let number of bytes per page be 2k
– Offset O is given by low-order k bits
– Page number is given by remaining (high-order) bits
d Computation is:

P = pagetable [ high_order_bits (V) ] or low_order_bits (V)

Computer Architecture – Module 13 39 Fall, 2016

virtual address
N O

page table

F O
physical address

d Typical paging system uses 12 bits of offset (4 Kbytes per page)

Computer Architecture – Module 13 40 Fall, 2016

d Found in most paging hardware

d One set for each page table entry
d Shared by hardware and software
d Purpose of the bits

Control Bit Meaning

22222222222222222222222222222222222222222222222222222222222
Presence bit Tested by hardware to determine whether
page is currently present in memory
Use bit Set by hardware whenever page is referenced
Modified bit Set by hardware whenever page is changed

Computer Architecture – Module 13 41 Fall, 2016

d In some systems, the MMU holds page tables

d Most systems place the page tables in memory
d Interesting idea
– Page table entry only needs to store the address of a frame
– Each frame is a power of two bytes, so the starting address will have zero in the k
low-order bits
– Instead of storing zeros, store the presence, use, and modify bits
– Allows page table entry to remain aligned on word boundary

Computer Architecture – Module 13 42 Fall, 2016

d Typical position: above the operating system

memory

operating page
system tables frame storage

d Consequence: only part of memory is divided into frames that hold applications

Computer Architecture – Module 13 43 Fall, 2016

d When paging is used, an address translation must occur

– For each instruction fetch
– For each data reference
d Translation can become a bottleneck, and it must be optimized
d Note: early virtual memory systems that did not have special hardware for address
translation were unusable

Computer Architecture – Module 13 44 Fall, 2016

d Hardware mechanism used to optimize address translation

d Employs a form of Content Addressable Memory (CAM)
d Hardware unit stores pairs of
( virtual address, physical address )

d If pair is in TLB
– Virtual address can be translated without a page table reference
– MMU returns the translation much faster than a page table lookup

Computer Architecture – Module 13 45 Fall, 2016

d A virtual memory system without TLB is unacceptable

d The TLB approach works well because application programs tend to reference a given
page many times
d Principle known as locality of reference

Computer Architecture – Module 13 46 Fall, 2016

d Programmer can optimize program performance by accommodating the paging system

d Examples
– Group related data items on same page
– Reference arrays in an order that accesses contiguous memory locations

Computer Architecture – Module 13 47 Fall, 2016

d Consider an array stored in row-major order

row 0 row 1 row 2 row 3 row 4 row 5 row N

...

d Location of A [ i , j ] given by

location(A) + i×Q + j

where Q is number of bytes per row

d Accessing items by row makes repeated accesses to the same page before moving on

Computer Architecture – Module 13 48 Fall, 2016

d Optimal
for i = 1 to N {
for j = 1 to M {
A [ i, j ] = 0;
}
}

d Nonoptimal
for j = 1 to M {
for i = 1 to N {
A [ i, j ] = 0;
}
}

Computer Architecture – Module 13 49 Fall, 2016

d Can build a system that caches

– Physical memory address and contents
– Virtual memory address and contents
d Notes
– If MMU is off-chip, L1 cache must use virtual addresses
– Key point: multiple processes have separate address spaces, but each uses the same
set of virtual addresses

Computer Architecture – Module 13 50 Fall, 2016

d Each application process uses virtual addresses 0 through N

d System must ensure that an application does not receive data from another application’s
memory
d Two possible approaches
– OS performs cache flush operation when changing applications
– Cache includes disambiguating tag with each entry (i.e., a process ID)

Computer Architecture – Module 13 51 Fall, 2016

d Assign each running application a unique ID (e.g., use a process ID)

d Operating system places ID in a special hardware register when an application runs
d Memory system attaches ID to each address in the cache

ID virtual address

address used by cache

Computer Architecture – Module 13 52 Fall, 2016

d Virtual memory systems present illusion to processor and programs

d Many virtual memory architectures are possible
d Examples include
– Hiding details of word addressing
– Create uniform address space that spans multiple memories
– Incorporate heterogeneous memory technologies into single address space

Computer Architecture – Module 13 53 Fall, 2016

d Virtual memory offers

– Convenience for programmer
– Support for multiprogramming
– Protection
d Three technologies have been used for virtual memory
– Base-bound registers
– Segmentation
– Demand paging (currently popular)

Computer Architecture – Module 13 54 Fall, 2016

d Demand paging
– The chief technology used in most systems
– Combination of hardware and software
– Uses page tables to map virtual addresses to physical addresses
– High-speed lookup mechanism known as TLB makes demand paging practical
d Caching virtual addresses requires either
– Flushing the cache during context switch
– Using an ID to disambiguate

Computer Architecture – Module 13 55 Fall, 2016

Input / Output
Concepts And Terminology

Computer Architecture – Module 14 1 Fall, 2016

d Third major component of computer system

d Wide range of types
– Keyboards and mice
– Monitors and displays
– Hard disks
– Solid state disks
– Printers
– Cameras
– Audio speakers
– Sensors and actuators
Computer Architecture – Module 14 2 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Conceptual Properties Of An I/O Device

d Operates independent of processor

d May have separate power supply
d Digital signals used for control
d Trivial example: panel lights

digital signals
external device

electrical signals lights

circuit

processor
.. ..
. .

to power source

Computer Architecture – Module 14 3 Fall, 2016

d Controller placed at each end of physical connection

d Allows arbitrary voltage and signals to be used

processor device
external
connection

controller controller

Computer Architecture – Module 14 4 Fall, 2016

d Serial interface
– Single signal wire (also need ground); one bit at a time
– Less complex hardware with lower cost
d Parallel interface
– Many wires; each wire carries one bit at any time
– Width is number of wires
– Complex hardware with higher cost
– Theoretically faster than serial
– Practical limitation: at high data rates, close parallel wires have potential for
interference

Computer Architecture – Module 14 5 Fall, 2016

d Logic on each side of a connection has its own clock

– Processor
– I/ O device
d Communication must be designed so they can coordinate
d We say signals are self-clocking if the receiver can determine the boundary of bits
without knowing about the sender’s clock

Computer Architecture – Module 14 6 Fall, 2016

d Full-duplex
– Simultaneous, bidirectional transfer
– Example: disk drive supports simultaneous read and write operations
d Half-duplex
– Transfer in one direction at a time
– Interfaces must negotiate access before transmitting
– Example: processor can read or write to a disk, but can only perform one operation
at a time

Computer Architecture – Module 14 7 Fall, 2016

d Latency
– Measure of the time required to perform a transfer
– Latencies of input and output may differ
d Throughput
– Measure of the amount of data that can be transferred per unit time
– Informally called speed

Computer Architecture – Module 14 8 Fall, 2016

d Fundamental idea
d Arises from hardware limits on parallelism (pins or wires)
d Allows sharing
d Multiplexor
– Accepts input from many sources
– Sends each item along with an ID
d Demultiplexor
– Receives ID along with transmission
– Uses ID to reassemble items correctly

Computer Architecture – Module 14 9 Fall, 2016

d Example: 64 bits of data multiplexed over 16-bit path

64 bits of data to be transferred

chunk 1 chunk 2 chunk 3 chunk 4

multiplexing hardware

parallel interface
16 bits wide

demultiplexing hardware

chunk 1 chunk 2 chunk 3 chunk 4

data reassembled after transfer

d Hardware iterates, transferring one chunk at a time

Computer Architecture – Module 14 10 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Multiple Devices Per External Interface

d Cannot afford to have a separate physical interconnect per device

– Too many physical wires
– Not enough pins on a processor chip
– Interface hardware adds economic cost
d Solution is sharing
– Allow multiple devices to use a given interconnection
– Known as a bus
– Discussed in the next section

Computer Architecture – Module 14 11 Fall, 2016

Buses
And
Bus Architecture

Computer Architecture – Module 15 1 Fall, 2016

d Digital interconnection mechanism

d Allows two or more functional units to transfer data
d Typical use: connect processor to
– Memory
– I/O devices
d Design can be
– Proprietary (owned by one company)
– Open standard (available to many companies)

Computer Architecture – Module 15 2 Fall, 2016

d Double-headed arrow often used to denote a bus

d Each component connects to the bus
d Example

processor
device

bus

d Bus may have many parallel wires (e.g., 64)

Computer Architecture – Module 15 3 Fall, 2016

d Most buses shared by multiple devices

d Need an access protocol
– Determines which device can use the bus at any time
– All attached devices follow the protocol
d Note: it is possible to have multiple buses in one computer

Computer Architecture – Module 15 4 Fall, 2016

d May support parallel data transfer

– Hardware can transfer multiple bits at the same time
– Typical width is 32 or 64 bits
d Essentially passive
– Bus does not contain many electronic components
– Attached devices handle communication
d Conceptual view: think of a bus as a set of wires
d Bus may have arbiter that manages sharing

Computer Architecture – Module 15 5 Fall, 2016

d Several possibilities
d Can consist of
– A cable with multiple wires
– Traces on a circuit board
d Usually, a bus has sockets into which devices plug

Computer Architecture – Module 15 6 Fall, 2016

mother board

area on mother board

for the processor,
memory, and other units bus formed from
parallel wires
sockets placed
near the edge
of the board

Computer Architecture – Module 15 7 Fall, 2016

d Each I/ O device on a circuit board

d I/ O devices plug into sockets on the mother board

circuit board
(device interface)

mother board

socket

Computer Architecture – Module 15 8 Fall, 2016

d Access protocol is nontrivial

d Controller circuitry is required
d Circuitry part of each I/ O device
d Good news: you don’t have to understand access circuits

Computer Architecture – Module 15 9 Fall, 2016

d Each device attached to a bus is assigned an address (in practice, there my be a small
set of addresses)
d Bus allows processor to specify
– Address for the device
– Data to transfer
– Control (e.g., to specify input or output)
d We can think of a bus as having a separate group of wires (lines) for each of the above
functions

Computer Architecture – Module 15 10 Fall, 2016

control address data

lines lines lines

d Early bus designs did indeed use separate wires

d To lower cost, many bus designs now arrange to multiplex address and data information
over the same wires (in a request, use the wires to send an address; in a response, use
the same wires to send data)
d Serial bus multiplexes all communication over one wire

Computer Architecture – Module 15 11 Fall, 2016

d Bus hardware only supports two operations

– Fetch (also called read)
– Store (also called write)
d Access paradigm is known as the fetch-store paradigm
d Obvious for memory access
d Surprise: all device interaction, including communication with video cameras, speakers,
and microphones, must be performed using the fetch-store paradigm

Computer Architecture – Module 15 12 Fall, 2016

d Fetch
– Place an address on the address lines
– Use control line to signal fetch operation
– Wait for control line to indicate operation complete
– Extract data item from the data lines
d Store
– Place an address on the address lines and a data item on the data lines
– Use control line to signal store operation
– Wait for control line to indicate operation complete

Computer Architecture – Module 15 13 Fall, 2016

d Width refers to the number of parallel data lines

d Larger width
– Advantage: higher performance
– Disadvantages: higher cost and more pins
d Smaller width
– Advantages: lower cost and fewer pins
– Disadvantage: lower performance
d Typical designs use multiplexing to lower cost
d Extreme case: serial bus has a width of one

Computer Architecture – Module 15 14 Fall, 2016

d Bus provides path between processor and memory

d Memory hardware includes bus controller

bus interfaces

processor
memory ... memory
1 N

bus

d Each memory module responds to a set of addresses

Computer Architecture – Module 15 15 Fall, 2016

Let R be the range of addresses assigned to this

memory module
Repeat forever {
Monitor the bus until a request appears;
if ( the request specifies an address in R ) {
respond to the request
} else {
ignore the request
}
}

Computer Architecture – Module 15 16 Fall, 2016

d Address conflict
– Two devices attempt to respond to a given address
d Unassigned address
– No device responds to a given address
d Bus hardware detects the problems and raises an error condition (sometimes called a
bus error)
d Unix reports bus error to an application that attempts to dereference an invalid pointer

Computer Architecture – Module 15 17 Fall, 2016

d Three options for address configuration

– Configure each device before attaching it to a bus
– Arrange sockets so that wiring limits each socket to a range of addresses
– Design bus hardware that configures addresses when system boots (or when a
device attaches)
d Socket wiring is typically used for memory (user can plug in additional modules
without configuring the hardware)
d Automatic configuration is usually used for I/ O devices

Computer Architecture – Module 15 18 Fall, 2016

d Imagine we are designing a device with LEDs used as status indicators

d Assume the hardware
– Provides sixteen separate LEDs
– Connects to 32-bit bus
d Desired functions are
– Turn the display unit on
– Turn the display unit off
– Set the brightness for the display unit
– Turn the ith LED on or off

Computer Architecture – Module 15 19 Fall, 2016

d Device designer chooses semantics for fetch and store

d Example assignment

Address Operation Meaning

222222222222222222222222222222222222222222222222222222222222222222222222
10000 – 10003 store nonzero data value turns the display on,
and a zero data value turns the display off
10000 – 10003 fetch returns zero if display is currently off,
and nonzero if display is currently on
10004 – 10007 store Change brightness. Low-order four bits of
the data value specify brightness value
from zero (dim) through fifteen (bright)
10008 – 10011 store The low order sixteen bits each control a
status light; a zero bit sets the corresponding
light off and a one bit sets the light on

Computer Architecture – Module 15 20 Fall, 2016

if ( address == 10000 ) {
if ( op == store ) {
if ( data != 0 ) {
turn_on_display;
} else {
turn_off_display;
}
} else { /* handle fetch */
if ( device is on ) {
send value 1 as data;
} else {
send value 0 as data;
}
}
}
Computer Architecture – Module 15 21 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Asymmetry

d Fetch and store operations on a bus

– Mean “fetch data” and “store data” for a memory
– May have other meanings for devices
– Are often asymmetric for devices
d Consequences
– For a device, fetch from location N may not be related to store into location N
– A device may define fetch, store, both, or neither for a given location

Computer Architecture – Module 15 22 Fall, 2016

d Single bus can attach

– Multiple memories
– Multiple devices
d Bus address space includes all units

Computer Architecture – Module 15 23 Fall, 2016

processor
memory memory device device
1 2 1 2

bus

d Bus connects processor to

– Multiple physical memory units
– Multiple I/ O devices
d Single address space includes all devices and memories

Computer Architecture – Module 15 24 Fall, 2016

d Example includes
– Two memories of 1 megabyte each
– Two devices that use 12 bytes of address space

Device Address Range

2 2222222222222222222222222222222222222222
Memory 1 0x000000 through 0x0 f f f f f
Memory 2 0x100000 through 0x1 f f f f f
Device 1 0x200000 through 0x20000b
Device 2 0x20000c through 0x200017

d Note: memories occupy many addresses; devices occupy few addresses

Computer Architecture – Module 15 25 Fall, 2016

device 1 device 2

memory
2

memory
1

d We use the term address map to describe the set of assignments

Computer Architecture – Module 15 26 Fall, 2016

0xffff available
for devices

0xdfff Hole
(not available)

0xbfff
available
for
memory

0x7fff

Hole
(not available)

0x3fff
available
for
memory
0x0000

Computer Architecture – Module 15 27 Fall, 2016

d In a typical system
– A device only requires a few bytes of address space
– Designers leave room for many devices
d Consequence: address space available for devices is sparsely populated

Computer Architecture – Module 15 28 Fall, 2016

d Software such as an OS that has access to the bus address space can fetch or store to a
device
d Example code

int p; / declare p to be a pointer to an integer */

p = (int *)10000; /* set pointer to address 10000 */
*p = 1; /* store 1 in addresses 10000 – 10003 */

Computer Architecture – Module 15 29 Fall, 2016

d Hardware mechanism
d Used to connect two buses

bus 1

bridge

bus 2

d Maps range of addresses from one bus to the other

d Forwards operations and replies from one bus to the other
d Especially useful for adding an auxiliary bus

Computer Architecture – Module 15 30 Fall, 2016

address space
of main bus

available
address space
. . . . . . . . .devices
for ............... of auxiliary bus

available
for not
memory bridge supplies mapped
the mapping

available
for
memory

Computer Architecture – Module 15 31 Fall, 2016

d Alternative to bus
d Connects multiple devices
d Sender supplies data and destination device
d Fabric delivers data to specified destination

Computer Architecture – Module 15 32 Fall, 2016

input 1

input 2

input 3

.
.
.

input N

output 1 output 2 output 3 output M

. . .

d Solid dot indicates a connection

Computer Architecture – Module 15 33 Fall, 2016

d Bus is fundamental mechanism that interconnects

– Processor
– Memory
– I/O devices
d Bus uses fetch-store paradigm for all communication
d Each unit assigned set of addresses in bus address space
d Bus address space can contain holes
d Bridge maps subset of addresses on one bus to another bus

Computer Architecture – Module 15 34 Fall, 2016

d Programmer uses conventional memory address mechanism to communicate over a bus

d Switching fabric is alternative to bus that allows parallelism

Computer Architecture – Module 15 35 Fall, 2016

Programmed And
Interrupt-driven I / O

Computer Architecture – Module 16 1 Fall, 2016

d Programmed I/O
– A terrible name
– Also called polled I/O
d Interrupt-driven I/O
– Another poor naming choice
– Software actually drives I/O

Computer Architecture – Module 16 2 Fall, 2016

d Used in early computers and in the smallest embedded systems

d Device has no intelligence (called dumb)
d CPU does all the work
d Processor
– Is much faster than device
– Starts operation on device
– Waits for device to complete

Computer Architecture – Module 16 3 Fall, 2016

d Basic technique used with programmed I/ O is polling

d To wait for an operation to complete, a processor
– Executes a loop that repeatedly requests status from device
– Allows the loop to continue until device indicates “ready”
d Also called busy waiting

Computer Architecture – Module 16 4 Fall, 2016

d Typical sequence of steps

– Test to see if the printer is powered on
– Cause the printer to load a blank sheet of paper
– Poll to determine when the paper has been loaded
– Specify data in memory that tells what to print
– Poll to wait for the printer to load the data
– Cause the printer to start spraying a band of ink
– Poll to determine when the ink mechanism finishes
– Cause the printer to advance the paper to the next band
– Poll to determine when the paper has advanced
– Repeat the above six steps for each band to be printed
– Cause the printer to eject the page
– Poll to determine when the page has been ejected

Computer Architecture – Module 16 5 Fall, 2016

d Each device defines a set of addresses and meanings for fetch and store operations
d An interface for our imaginary printer
Addresses Operation Meaning

0 – 3 fetch Nonzero if the printer is powered on

4 – 7 store Nonzero starts loading a sheet of paper
8 – 11 store Memory address of data to print
12 – 15 store Nonzero causes printer to pick up address
16 – 19 store Start the inkjet spraying current band
20 – 23 store Nonzero advances paper to the next band
24 – 27 fetch Busy: nonzero when device is busy
28 – 31 fetch CMYK ink levels in four octets

d Addresses shown are relative

int p; / Pointer to the device address area */

p = (int *)0x110000; /* Initialize pointer to device address */
if (*p == 0) /* Test if printer is powered on */
error("printer not on");
*(p+1) = 1; /* Start loading paper */
while (*(p+6) != 0) /* Poll to wait for the load to complete */
;
*(p+2) = &mydata; /* Specify the location of data in memory */
*(p+3) = 1; /* Cause printer to pick up data */
while (*(p+6) != 0) /* Poll to wait for printer to complete loading data */
;
*(p+4) = 1; /* Start inkjet spraying */
while (*(p+6) != 0) /* Poll to wait for the inkjet to finish */
;
*(p+5) = 1; /* Advance the paper to the next band */
while (*(p+6) != 0) /* Poll to wait for the paper advance to complete*/
;
d Note: code does not contain any infinite loops!
Computer Architecture – Module 16 7 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Terminology

d Set of addresses a device defines are known as its Control and Status Registers (CSRs)
d CSRs are used to transfer data and control the device
d The hardware designer chooses whether a given CSR responds to
– A fetch operation
– A store operation
– Both
d In many cases, individual CSR bits are assigned meanings
d In C, a struct can be used to define CSRs

Computer Architecture – Module 16 8 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Polling Code Rewritten To Use A Struct (Part 1)
struct csr { /* Template for printer CSRs */
int csr_power; /* Is printer powered on? */
int csr_load; /* Load a sheet of paper */
int csr_addr; /* Specify address of data to print */
int csr_getdata; /* Upload data from memory */
int csr_spray; /* Start inkjet spraying */
int csr_advance; /* Advance paper to next band */
int csr_dev_busy; /* Nonzero => device busy */
int csr_levels; /* CMYK Ink levels in 4 bytes */
}
struct csr *p; /* Pointer to the device address area */
p = (struct csr *)0x110000; /* Set p to device address */
if (p->csr_power == 0); /* Test if printer is on */
error("printer not on");
p->csr_load = 1; /* Start loading paper */
while (p->csr_dev_busy) /* Poll to wait for the load to complete */
;
Computer Architecture – Module 16 9 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Polling Code Rewritten To Use A Struct (Part 2)

p->csr_addr = &mydata /* Specify the location of data in memory */

p->csr_getdata = 1; /* Cause printer to pick up data */
while (p->csr_dev_busy) /* Poll to wait for printer to complete loading data */
;
p->csr_spray = 1; /* Start the inkjet spraying */
while (p->csr_dev_busy) /* Poll to wait for the inkjet to finish */
;
p->csr_ = 1; /* Advance the paper to the next band */
while (p->csr_dev_busy) /* Poll to wait for the paper advance to complete*/
;

Computer Architecture – Module 16 10 Fall, 2016

d Motivation: increase performance by eliminating polling loops

d Technique
– Add special hardware to processor and devices
– Allow processor to start operation on a device
– Arrange for device to interrupt the processor when the operation completes

Computer Architecture – Module 16 11 Fall, 2016

d Processor hardware
– Saves current instruction pointer
– Jumps to code for the interrupt
– Resumes executing the application when the code executes a return from interrupt

Computer Architecture – Module 16 12 Fall, 2016

d Polling uses a synchronous paradigm

– Code is sequential
– Programmer includes device polling for each I/ O operation
d Interrupts use an asynchronous paradigm
– Device temporarily interrupts processor
– Processor services device and returns to computation in progress
– Programmer creates separate piece of software to handle each type of interrupt

Computer Architecture – Module 16 13 Fall, 2016

Copyright  2016 by Douglas Comer. All rights reserved
Fetch-Execute Cycle With Interrupts
222222222222222222222222222222222222222222222222222222222222222222222222222222
1 Repeat forever { 11
1 1
1 1
1 Test: if any device has requested interrupt, handle the interrupt and then continue
1
1 with the next iteration of the loop.
1
1 1
1 Fetch: access the next step of the program from the location in which the program 1
1 has been stored. 1
1 1
1 1
1 Execute: Perform the step of the program.
1
1 }
1
1222222222222222222222222222222222222222222222222222222222222222222222222222222
1 1

d Note: interrupt appears to occur between two instructions

Computer Architecture – Module 16 14 Fall, 2016

d Entire state of computation must be saved when interrupt occurs

– Values in registers
– Program counter
– Condition code
d Hardware usually saves and restores a few items; interrupt code must save and restore
the rest

Computer Architecture – Module 16 15 Fall, 2016

d Technique used to optimize interrupt handling

d OS maintains, V, an array of pointers to interrupt code
– Called an interrupt vector
– Informs bus hardware of the location of V
d Each device is assigned a number from 0 through K-1
d Device specifies its number, i, when interrupting
d Hardware (or in some architectures, the OS) branches to interrupt code at address V[i]

Computer Architecture – Module 16 16 Fall, 2016

interrupt vectors
in memory handler for
device 3
.
.
.

3 handler for
device 2
2

1
handler for
0 device 1

handler for
device 0

Computer Architecture – Module 16 17 Fall, 2016

d Processor boots with interrupts disabled

d OS
– Keeps interrupts disabled during initialization
– Fills in interrupt vector with pointers to interrupt code for each device
d Once all interrupt table entries have been initialized, OS enables interrupts, which
allows I/ O to proceed

Computer Architecture – Module 16 18 Fall, 2016

d Fact: multiple devices can request an interrupt simultaneously

d To prevent confusion, an OS should handle one device before another interrupts
d Typical technique: hardware disables further interrupts while an interrupt is being
handled

Computer Architecture – Module 16 19 Fall, 2016

d Simplest processors: only one interrupt at a time

d Advanced processors: devices assigned a priority, and higher priority devices can
interrupt lower level interrupt code
d Typically a few priority levels (e.g., 7)
d Rule: at any given time, at most one device can be interrupting at each priority level
d Note: the lowest priority (usually zero) means no interrupt is occurring (i.e., an
application program is executing)

Computer Architecture – Module 16 20 Fall, 2016

d Each device must be assigned an interrupt vector ID

d The OS must know which device has been assigned which interrupt ID
d Assignments can be
– Manual (only used on small embedded systems)
– Automated (more flexible; used on most systems)

Computer Architecture – Module 16 21 Fall, 2016

d Some bus technologies allow devices to be connected or disconnected at run-time

d Example: Universal Serial Bus (USB)
d Computer contains a USB hub device that has a fixed interrupt vector
d When a new device is attached, the hub generates an interrupt, and the interrupt code
loads additional software for the device into the OS

Computer Architecture – Module 16 22 Fall, 2016

d Provide higher data transfer rates

d Offload CPU
d Three basic types
– Direct Memory Access (DMA)
– Buffer Chaining
– Operation Chaining

Computer Architecture – Module 16 23 Fall, 2016

d Widely used
d Works well for high-speed I/O and streaming
d Requires smart device that can move data across the bus to / from memory without
using processor
d Example: Wi-Fi network interface can read an entire packet and place the packet in a
specified buffer in memory
d Basic idea
– CPU tells device location of buffer
– Device fills buffer and then interrupts

Computer Architecture – Module 16 24 Fall, 2016

d Extends DMA to handle multiple transfers on one command

d Device given linked list of buffers
d Device hardware uses next buffer on list automatically

address passed
to device

data buffer 1 data buffer 2 data buffer 3

Computer Architecture – Module 16 25 Fall, 2016

Scatter Read And Gather Write

d Special cases of buffer chaining

d Large data transfer formed from separate blocks in memory
d Example: to write a network packet, combine packet header from buffer 1, encryption
header from buffer 2, and packet data from buffer 3
d Eliminates application program from copying data into single large buffer

Computer Architecture – Module 16 26 Fall, 2016

d Further extension of DMA

d Allows sequence of read, write, and control operations
d Processor passes a list of commands to the device
d Device carries out successive commands automatically
d Illustration of disk reads and writes with operation chaining

address passed
to device R 17 W 29 R 61

data buffer 1 data buffer 2 data buffer 3

Computer Architecture – Module 16 27 Fall, 2016

Summary

d Devices can use

– Programmed I/O
– Interrupt-driven I/O
d Interrupts
– Allow processor to continue running while waiting for I/O
– Use vector (usually in memory)
– Occur “between” instructions in fetch-execute cycle

Computer Architecture – Module 16 28 Fall, 2016

d Multi-level interrupts handle high-speed and low-speed devices on same bus

d Smart device has some processing power built into the device
d Optimizations for high-speed devices include
– Direct Memory Access (DMA)
– Buffer chaining
– Operation chaining

Computer Architecture – Module 16 29 Fall, 2016

A Programmer’s View
Of I / O
And Buffering

Computer Architecture – Module 17 1 Fall, 2016

d Piece of software
d Responsible for communicating with specific device
d Usually part of operating system
d Performs basic functions
– Initializes the device
– Manipulates device’s CSRs to start operations when I/ O is needed
– Handles interrupts from device

Computer Architecture – Module 17 2 Fall, 2016

d Encapsulation and hiding: details of device hidden from application software

d Device independent applications: application code does not contain the details for any
specific device(s)

Computer Architecture – Module 17 3 Fall, 2016

d Lower half
– Handler code that is invoked when the device interrupts
– Communicates directly with device (e.g., to reset hardware)
d Upper half
– Set of functions that are invoked by applications
– Allows application to request I/O operations
d Shared variables
– Used by both halves to coordinate
– Contains input and output buffers

Computer Architecture – Module 17 4 Fall, 2016

applications programs

upper half
invoked by
applications

shared
variables

lower half
invoked by
interrupts

device hardware

Computer Architecture – Module 17 5 Fall, 2016

d Character-oriented
– Transfer one byte at a time
– Examples
* Keyboard
* Mouse
d Block-oriented
– Transfer block of data at a time
– Examples
* Disk
* Network interface
Computer Architecture – Module 17 6 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Example Flow In A Network Device Driver
computer

Steps Taken
application
1. The application sends data over the
1
Internet
6 2. Protocol software passes a packet to
the driver
protocols 3. The driver stores the outgoing packet
in the shared variables
2 5
4. The upper half specifies the packet
upper half location and starts the device
5. The upper half returns to the protocol
operating
system 3 module
6. The protocol software returns to the
variables application
7. The device interrupts and the lower
8
half of the driver executes
4
lower half 8. The lower half removes the copy of
the packet from the variables

7
external
hardware
device

Computer Architecture – Module 17 7 Fall, 2016

d Used by most device drivers

d Shared variable area contains queue of requests
d Upper half places request on queue
d Lower half moves to next request on queue when an operation completes
d If device supports operation chaining, upper half can add new items to the queue while
the device is processing (coordination required)

Computer Architecture – Module 17 8 Fall, 2016

upper half
request queue in
shared variables
data area

lower half

d Queue is shared among both halves

d Driver software is designed so that each half ensures the other half will not examine or
change the queue at the same time

Computer Architecture – Module 17 9 Fall, 2016

d At startup, initialize the queue to empty

d When application performs write, upper half
– Deposits data item in queue
– Forces the device to interrupt
– Returns to application
d When interrupt occurs, lower half
– Extracts the next item from the queue and starts output, if queue is not empty
– Allows the device to remain idle, if the queue is empty
– Returns from interrupt

Computer Architecture – Module 17 10 Fall, 2016

d At startup, initialize the queue to empty and start the device

d When application performs read, upper half
– Extracts and returns the next item, if queue is nonempty
– Blocks application if input queue is empty
d When an interrupt occurs, lower half
– Starts another input operation, if the queue is not full
– Allows the application to run, if an application is blocked waiting for input
– Returns from interrupt

Computer Architecture – Module 17 11 Fall, 2016

d Needed because interrupts occur asynchronously and multiple applications can attempt
I/O on a given device at the same time
d Guarantees only one operation will be performed at any time
d Device drivers handle mutual exclusion

Computer Architecture – Module 17 12 Fall, 2016

d Few programmers write device drivers

d Instead of dealing directly with devices, most programmers use high-level abstractions
– Files instead of disks
– Windows instead of display screens
d Typical application invokes run-time library functions to perform I/O
d Chief advantage: I/O hardware and/or device drivers can be changed without changing
applications

Computer Architecture – Module 17 13 Fall, 2016

application

interface to run-time library functions

run-time library

interface to I/ O functions in the OS

device driver

device hardware

d Interfaces can differ dramatically

Computer Architecture – Module 17 14 Fall, 2016

d UNIX library functions

Operation Meaning
2222222222222222222222222222222222222222222222222222222222
printf Generate formatted output for a set of variables
fprintf Generate formatted output for a specific file
scanf Read formatted data into a set of variables

d UNIX system calls

Operation Meaning
2222222222222222222222222222222222222222222222222222222222
open Prepare a device for use (e.g., power up)
read Transfer data from the device to the application
write Transfer data from the application to the device
close Terminate use of the device
seek Move to a new location of data on the device
ioctl Misc. control functions (e.g., change volume)

Computer Architecture – Module 17 15 Fall, 2016

d Two principles
– Cost of making a system call is much more expensive than the cost of making a
conventional function call
– The approach used to reduce system calls consists of transferring more data per call

Computer Architecture – Module 17 16 Fall, 2016

d Important optimization
d Widely used
d Usually automated and invisible to programmer
d Key idea: make large I/O transfers to driver
– Accumulate large block of outgoing data before transfer
– Transfer large block of incoming data and then extract individual items

Computer Architecture – Module 17 17 Fall, 2016

d Typically performed with library functions

d Application
– Uses functions in the library for all I/O
– Transfers data in arbitrarily small amounts
d Library functions
– Buffer data from applications
– Transfer data to underlying system in large blocks

Computer Architecture – Module 17 18 Fall, 2016

Operation Meaning
2222222222222222222222222222222222222222222222222222222222
setup Initialize input and/or output buffers
input Perform an input operation
output Perform an output operation
terminate Discontinue use of the buffers
flush Force contents of output buffer to be written

d Device driver in the operating system may also perform buffering to reduce number of
transfers between the processor and the device

Computer Architecture – Module 17 19 Fall, 2016

d Setup function
– Called to initialize buffer
– May allocate buffer
– Typical buffer sizes 8K to 128K bytes
d Output function
– Called when application needs to emit data
– Places data item in buffer
– Only writes to I/ O device when buffer is full
d Terminate function
– Called when all data has been emitted
– Forces remaining data in buffer to be written

Computer Architecture – Module 17 20 Fall, 2016

Setup(N)
1. Allocate a buffer of N bytes.
2. Create a global pointer, p, and initialize p to the address of the first byte of
the buffer.
Output(D)
1. Place data byte D in the buffer at the position given by pointer p, and move
p to the next byte.
2. If the buffer is full, make a system call to write the contents of the entire
buffer, and reset pointer p to the start of the buffer.

Computer Architecture – Module 17 21 Fall, 2016

Terminate
1. If the buffer is not empty, make a system call to write the contents of the
buffer prior to pointer p.
2. If the buffer was dynamically allocated, deallocate it.

Computer Architecture – Module 17 22 Fall, 2016

d Allows a programmer to force data in a buffer to be written

d Motivation
– For batch programs: force data to disk
– For interactive programs: force data to be sent over a network (e.g., a single
keystroke when using ssh)
d When flush is called
– If buffer contains data, write data and reset buffer to empty
– If buffer is empty, flush has no effect

Computer Architecture – Module 17 23 Fall, 2016

Flush
1. If the buffer is currently empty, return to the caller without taking any action.
2. If the buffer is not currently empty, make a system call to write the contents
of the buffer and set the global pointer p to the address of the first byte of the
buffer.

Terminate
1. Call flush to ensure that any remaining data is written.
2. Deallocate the buffer.

Computer Architecture – Module 17 24 Fall, 2016

Setup(N)
1. Allocate a buffer of N bytes.
2. Create a global pointer, p, and initialize p to indicate that the buffer is
empty.
Input(N)
1. If the buffer is empty, make a system call to fill the entire buffer, and set
pointer p to the start of the buffer.
2. Extract a byte, D, from the position in the buffer given by pointer p, move p
to the next byte, and return D to the caller.
Terminate
1. If the buffer was dynamically allocated, deallocate it.

Computer Architecture – Module 17 25 Fall, 2016

d Implementation
– Both input and output buffering are straightforward
– Only a trivial amount of code is needed
d Effectiveness
– Buffer of size N reduces number of system calls by a factor of N
– Example: when buffering character (byte) output, a buffer of only 8K bytes reduces
system calls by a factor of 8192

Computer Architecture – Module 17 26 Fall, 2016

d Concepts are closely related

d Chief difference
– Caching is designed for random access
– Buffering is designed for sequential access

Computer Architecture – Module 17 27 Fall, 2016

d Standard I/O library in UNIX contains many functions

Function Meaning
222222222222222222222222222222222222222222222
fopen Set up a buffer
fgetc Buffered input of one byte
fread Buffered input of multiple bytes
fwrite Buffered output of multiple bytes
fprintf Buffered output of formatted data
fflush Flush operation for buffered output
fclose Terminate use of a buffer

d Each function uses buffers extensively

d Dramatically improves I/O performance

Computer Architecture – Module 17 28 Fall, 2016

d Two aspects of I/O pertinent to programmers

– Device interface important to systems programmers who write device drivers
– Relative costs of I/O important to application programmers
d Device driver divided into three parts
– Upper-half called by application
– Lower-half handles device interrupts
– Shared data area accessed by both halves

Computer Architecture – Module 17 29 Fall, 2016

d Buffering
– Fundamental technique used to enhance performance
– Useful with both input and output
d Buffer of size N reduces system calls by a factor of N

Computer Architecture – Module 17 30 Fall, 2016

Parallelism

Computer Architecture – Module 18 1 Fall, 2016

d Software designers have many techniques available

– Caching and buffering
– Hashing and randomization
– Better algorithms
– Data placement and reordering data items during search
. . . many more . . .
d Hardware designers have two basic techniques
– Parallelism
– Pipelining

Computer Architecture – Module 18 2 Fall, 2016

d Employs multiple copies of a hardware unit

d All copies can operate simultaneously
d General idea
– Distribute data items among parallel hardware units
– Gather (and possibly combine) results
d Occurs at many levels of architecture
d Term parallel computer applied when parallelism dominates the entire architecture

Computer Architecture – Module 18 3 Fall, 2016

d Microscopic vs. macroscopic

d Symmetric vs. asymmetric
d Fine-grain vs. coarse-grain
d Explicit vs. implicit

Computer Architecture – Module 18 4 Fall, 2016

d Virtually all computers have some parallelism

d Microscopic parallelism refers to parallel facilities in a single, small hardware unit
d Macroscopic parallelism refers to parallel facilities across major pieces of hardware

Computer Architecture – Module 18 5 Fall, 2016

d Microscopic
– Parallel hardware in an ALU
– Parallel data transfer to/from physical memory or an I/O bus
d Macroscopic
– Multiple identical processors, such as a multicore CPU (known as symmetric)
– Multiple dissimilar processors, such as a CPU and GPU (known as asymmetric)

Computer Architecture – Module 18 6 Fall, 2016

d Fine-grain parallelism
– Parallelism among individual instructions (e.g., two addition operations occur at the
same time)
d Coarse-grain parallelism
– Parallel execution of programs on multiple cores

Computer Architecture – Module 18 7 Fall, 2016

d Explicit parallelism
– Visible to programmer
– Requires programmer to initiate and control parallel activities
d Implicit parallelism
– Hidden from programmer
– Hardware runs multiple copies of application code or instructions automatically

Computer Architecture – Module 18 8 Fall, 2016

d Design in which a computer has reasonably large number of processors

d Motivation: scaling computation
d Example: computer with thirty-two cores
d Counterexamples (not generally classified as a parallel computer):
– Dual-core processor
– Computer with one processor and lots of I/ O devices (e.g., multiple disks)

Computer Architecture – Module 18 9 Fall, 2016

d Three types named according to the Flynn classification

Name Meaning
222222222222222222222222222222222222222222222222222222222222
SISD Single Instruction stream Single Data stream
SIMD Single Instruction stream Multiple Data streams
MISD Multiple Instruction streams Single Data stream
MIMD Multiple Instruction streams Multiple Data streams

d Terminology well-known and widely used

d Flynn taxonomy only provides broad, intuitive definitions
d MISD is unusual

Computer Architecture – Module 18 10 Fall, 2016

d Processor executes one instruction at a time

d Each operation applies to one set of data items (operands)
d Synonyms include
– Sequential architecture
– Uniprocessor

Computer Architecture – Module 18 11 Fall, 2016

d Each instruction specifies a single operation

d Hardware applies operation to multiple data items
d Typical implementation
– Add operation performs pairwise addition on two one-dimensional arrays
– Store operation can be used to clear a large block of memory
d Special case of SIMD: vector processor
– Usual focus is on floating point operations
– Applies a given operation to a 1-dimensional array of values (e.g., normalize values)

Computer Architecture – Module 18 12 Fall, 2016

d On a conventional computer
for i from 1 to N {
V[i] ← V[i] × Q;
}
d On a vector processor

V ← V × Q;

d Vector code is trivial (no iteration)

d Compiler generates a single vector instruction
d Computer has K copies of the multiplication hardware; vectors longer than K require
multiple steps

Computer Architecture – Module 18 13 Fall, 2016

d Special-purpose graphics processors

d Follow SIMD design
d Typically, many GPUs on a single graphics interface card
d Technique: divide image (or video frame) into many parts and have each GPU work on
one part
d Modern GPU also has conventional operations (called scalar)

Computer Architecture – Module 18 14 Fall, 2016

d Parallel architecture with multiple physical processors

d Each processor
– Can run an independent program
– May have dedicated I/ O devices (e.g., its own disk)
d Parallelism is visible to programmer
d Works best for applications where computation can be divided into separate,
independent pieces

Computer Architecture – Module 18 15 Fall, 2016

d Symmetric
d Asymmetric

Computer Architecture – Module 18 16 Fall, 2016

d Most well-known MIMD architecture

d Set of N identical processors
d Historic examples of SMP computers
– Carnegie Mellon University (C.mmp)
– Sequent Corporation (Balance 8000 and 21000)
– Encore Corporation (Multimax)
d Current example: multicore CPU

Computer Architecture – Module 18 17 Fall, 2016

P1 Main Pi+1
Memory
(various
P2 modules) Pi+2

.. ..
. .
Devices
Pi PN

d Major problem with SMP architecture: contention for memory and I/ O devices
d To improve performance: provide each processor with its own copy of a device

Computer Architecture – Module 18 18 Fall, 2016

d Set of processors of various types

d Can have processors optimized for specific tasks
d Special-purpose processors are invoked by main processor as needed
d Examples
– Graphics coprocessor (e.g, GPU)
– Math coprocessor handles floating point operations
– I/O coprocessor optimized for handling devices and interrupts

Computer Architecture – Module 18 19 Fall, 2016

d Old idea
d Pioneered in mainframe computers of 1960s
d Examples
– Channel (IBM mainframe)
– Peripheral Processor (CDC mainframe)
d Making a comeback — now used in large systems

Computer Architecture – Module 18 20 Fall, 2016

d Having many processors is not always a clear win

d Overhead arises from
– Communication
– Coordination
– Contention

Computer Architecture – Module 18 21 Fall, 2016

d Needed
– Among processors
– Between processors and I/O devices
– Across networks
d As number of processors increases, communication becomes a bottleneck

Computer Architecture – Module 18 22 Fall, 2016

d Needed when processors work together

d May require one processor to wait for another to compute a result
d One possibility: designate a processor to perform coordination tasks

Computer Architecture – Module 18 23 Fall, 2016

d Processors contend for resources

– Memory and caches
– I/O devices
d Speed of resources can limit overall performance
– Example: bus hardware makes N – 1 processors wait while one processor accesses
memory

Computer Architecture – Module 18 24 Fall, 2016

d Has been disappointing

d Bottlenecks include
– Contention for operating system (only one copy
of OS can run)
– Contention for memory and I/O
d Another problem: caching
– One centralized cache means contention problems
– Coordinating multiple caches means complex interaction
d Many applications are I/O bound

Computer Architecture – Module 18 25 Fall, 2016

“Building multiprocessor systems that scale while correctly synchronising the use of
shared resources is very tricky, whence the principle: with careful design and attention
to detail, an N-processor system can be made to perform nearly as well as a single-
processor system. (Not nearly N times better, nearly as good in total performance as
you were getting from a single processor). You have to be very good — and have the
right problem with the right decomposability — to do better than this.”

http:/ / www.john-a-harper.com/ principles.htm

Computer Architecture – Module 18 26 Fall, 2016

d Speedup is defined relative to performance of a single processor

d Measure is execution time, which is lower if performance is higher
τ1
333
Speedup =
τN

d Where
– τN denotes the execution time on a multiprocessor
– τ1 denotes the execution time on a single processor
d Ideal: speedup that is linear in number of processors

Computer Architecture – Module 18 27 Fall, 2016

Speedup

12 ideal

typical
8

1 4 8 12 16
Number of processors (N)

Computer Architecture – Module 18 28 Fall, 2016

24
ideal

typical
8

1 8 16 24 32
Number of processors (N)

d At some point, performance begins to decrease!

d Writing code for multiprocessors is difficult

– Need to handle mutual exclusion for shared items
– Typical mechanism: locks
d Performance may be worse than a single processor
d Beware of
– Vendors selling multicore systems
– Projects where software engineers must exploit multicore to achieve high
performance

Computer Architecture – Module 18 30 Fall, 2016

d Consider a trivial assignment statement

x = x + 1;

d Typical code

load x, R5
incr R5
store R5, x

d On a uniprocessor, no problems arise

d Consider a multiprocessor

Computer Architecture – Module 18 31 Fall, 2016

d Suppose two processors (cores) attempt to increment item x

d The following sequence can result
– Processor 1 loads x into its register 5
– Processor 1 increments its register 5
– Processor 2 loads x into its register 5
– Processor 1 stores its register 5 into x
– Processor 2 increments its register 5
– Processor 2 stores its register 5 into x

Computer Architecture – Module 18 32 Fall, 2016

d Prevent simultaneous access to data

d A separate lock assigned to each item
d Each lock is assigned an ID
d If lock 17 is used, code becomes

lock 17
load x, R5
incr R5
store R5, x
release 17

d Hardware allows one processor (core) to hold a given lock at a given time, and blocks
others

Computer Architecture – Module 18 33 Fall, 2016

d Implicit parallelism
– Programmer writes sequential code
– Hardware runs many copies automatically
d Explicit parallelism
– Programmer writes code for parallel architecture
– Code must use locks to prevent interference
d Conclusion: explicit parallelism makes computers extremely difficult to program

Computer Architecture – Module 18 34 Fall, 2016

d Both types can be difficult to program

d Symmetric has two advantages
– Only one instruction set to learn
– Programmer does not need to choose processor type for each task
d Asymmetric has an advantage
– Programmer can use processor that is best-suited to a given task
– Example: using a GPU may be easier than implementing graphics operations on a
standard processor

Computer Architecture – Module 18 35 Fall, 2016

d Used to increase reliability rather than performance

d Multiple copies of hardware perform same function
d Watchdog circuitry detects whether all units computed the same result
d Can be used to
– Test whether hardware is performing correctly
– Serve as backup in case of hardware failure

Computer Architecture – Module 18 36 Fall, 2016

d Tightly coupled multiprocessor

– Multiple processors in single computer
– Buses or switching fabrics used to interconnect processors, memory, and I/O
– Usually one operating system
d Loosely coupled multiprocessor
– Multiple, independent computer systems
– Computer networks used to interconnect systems
– Each computer runs its own operating system
– Known as distributed computing

Computer Architecture – Module 18 37 Fall, 2016

d Special case of distributed computer system

d All computers work on a single problem
d Works best if problem can be partitioned into pieces
d Currently popular in large data centers
d Modern supercomputer is a cluster
d Example supercomputer
– Tianhe-2 supercomputer in China
– 16,000 Intel multicore nodes
– Total of 3,120,000 cores

Computer Architecture – Module 18 38 Fall, 2016

d Form of loosely-coupled distributed computing

d Uses computers on the Internet
d Popular for large, scientific computations
d One application: Search for Extra-Terrestrial Intelligence (SETI)

Computer Architecture – Module 18 39 Fall, 2016

d Parallelism is fundamental
d Flynn scheme classifies computers as
– SISD (e.g., conventional uniprocessor)
– SIMD (e.g., vector computer)
– MIMD (e.g., multiprocessor)
d Multiprocessors can be
– Symmetric or asymmetric
– Explicitly or implicitly parallel
d Multiprocessor speedup usually less than linear

Computer Architecture – Module 18 40 Fall, 2016

d Programming multiprocessors is usually difficult

– Programmer must divide tasks onto multiple processors
– Locks needed for shared items
d Parallel systems can be
– Tightly-coupled (single computer)
– Loosely-coupled (computers connected by a network)

Computer Architecture – Module 18 41 Fall, 2016

Data Pipelining

Computer Architecture – Module 19 1 Fall, 2016

d One of the two major hardware optimization techniques

d Information flows through a series of stages (processing components)
d Each stage can perform arbitrary operations on the data
– Inspect
– Interpret
– Modify

Computer Architecture – Module 19 2 Fall, 2016

information information
arrives leaves

stage 1 stage 2 stage 3 stage 4

Computer Architecture – Module 19 3 Fall, 2016

d Hardware or software implementation

d Large or small scale
d Synchronous or asynchronous flow
d Buffered or unbuffered flow
d Finite chunks or continuous bit streams
d Automatic data feed or manual data feed
d Serial or parallel data path between stages
d Homogeneous or heterogeneous stages

Computer Architecture – Module 19 4 Fall, 2016

d Popularized by Unix command interpreter (shell)

d User can specify pipeline as a command
d Example

cat x | sed ’s/friend/partner/g’ | sed ’/W/d’ | more

– Substitutes “partner” for “friend”

– Deletes lines that contain “W”
– Passes result to more for display
d Note: example can be optimized by swapping the order of the two sed commands

Computer Architecture – Module 19 5 Fall, 2016

d Uniprocessor
– Each stage is a process or thread
d Multiprocessor
– Each stage executes on separate processor or core
– Hardware assist can speed interstage data transfer

Computer Architecture – Module 19 6 Fall, 2016

d Two basic types

d Instruction pipeline
– Covered earlier in the course
– Optimizes performance
– Heavily used with RISC architecture
– Each instruction processed in stages
– Exact details and number of stages depend on instruction set and operand types
d Data pipeline
– New idea

Computer Architecture – Module 19 7 Fall, 2016

d Sequence of data items pass through the pipeline

d Each stage performs computation on the data item and passes item to next stage
d Requires designer to divide computation into stages
d Among the most interesting uses of pipelining

Computer Architecture – Module 19 8 Fall, 2016

d A data pipeline implemented with hardware can dramatically increase performance

(throughput)
d To see why, consider an example
– Internet router handles packets
– Assume that a router
* Processes one packet at a time
* Performs six functions on each packet

Computer Architecture – Module 19 9 Fall, 2016

1. Receive a packet (i.e., transfer the packet into memory)

2. Verify packet integrity (i.e., verify that no changes occurred between transmission and
reception)
3. Check for forwarding loops (i.e., decrement a value in the header, and reform the header
with the new value)
4. Select path (i.e., use the destination address field to select one of the possible output
networks and a destination on that network)
5. Prepare for transmission (i.e., compute information that will be used to verify packet
integrity)
6. Transmit the packet (i.e., transfer the packet to the output device)

Computer Architecture – Module 19 10 Fall, 2016

outputs do forever {
processor
input Wait to receive packet
from one Verify integrity
network
Check for loops
Select a path
..
. Prepare for transmission
Enqueue packet for output
}
(a) (b)

d (a) illustration of an Internet router with multiple outgoing network connections

d (b) the computational steps the router must take for each packet

Computer Architecture – Module 19 11 Fall, 2016

d Consider a router that uses a data pipeline

packets
leave
packets verify check select prepare for
arrive integrity for loops path transmission

d Imagine a packet passing through the pipeline

d For now, assume zero delay between stages
d Question: how long will the pipeline take to process the packet?
d Answer: the same amount of time as a conventional router!

Computer Architecture – Module 19 12 Fall, 2016

d Bad news: if it uses processors of the same speed as a nonpipeline architecture, a data
pipeline will not improve the overall time needed to process a given data item
d Good news: by overlapping computation on multiple items, a pipeline increases
throughput

Computer Architecture – Module 19 13 Fall, 2016

d It is possible to partition processing into independent stages

d Overhead required to move data from one stage to another is insignificant
d The slowest stage of the pipeline is faster than a single processor

Computer Architecture – Module 19 14 Fall, 2016

d Assume
– The task is packet processing
– Processing a packet requires exactly 500 instructions
– A processor executes 10 instructions per µsec

d Total time required for one packet

3 500 instructions
333333333333333
time = = 50 µsec
10 instr. per µsec

d Throughput for a non-pipelined system

1 packet 1 packet × 10 6
Tnp = 33333333 = 3333333333333 = 20,000 packets per second
50 µsec 50 sec

Computer Architecture – Module 19 15 Fall, 2016

d Suppose the problem can be divided into four stages and that the stages require
– 50 instructions
– 100 instructions
– 200 instructions
– 150 instructions

d The slowest stage takes 200 instructions

d The time required for the slowest stage is:
200 inst
total time = 333333333333 = 20 µsec
10 inst / µsec

Computer Architecture – Module 19 16 Fall, 2016

d Important principle: the throughput of a data pipeline is limited by the slowest stage
d Overall throughput

1 packet 1 packet × 106

Tp = 33333333 = 3333333333333 = 50,000 packets per second
20 µsec 20 sec
d Note: throughput of pipelined version is 250% of throughput of the non-pipelined
version!

Computer Architecture – Module 19 17 Fall, 2016

d Term refers to computer systems in which the primary focus is data pipelining
d Most often used for special-purpose systems
d Data pipeline usually organized around functions
d Less relevant to general-purpose computers

Computer Architecture – Module 19 18 Fall, 2016

d Build one pipeline stage per function

d Illustration

f( )
g( ) f( ) g( ) h( )
h( )

(a) (b)

d (a) shows a single processor handling three functions

d (b) shows processing divided into a 3-stage pipeline with each stage handling one
function

Computer Architecture – Module 19 19 Fall, 2016

d Setup time
– Refers to time required to start the pipeline initially
d Stall time
– Refers to time required to restart the pipeline after a stage blocks to wait for a
previous stage
d Flush time
– Refers to time that elapses between the cessation of input and the final data item
emerging from the pipeline (i.e., the time required to shut down the pipeline)

Computer Architecture – Module 19 20 Fall, 2016

d Most often used with instruction pipelining

d Subdivides a stage into smaller stages
d Example: subdivide operand processing into
– Operand decode
– Fetch immediate value or value from register
– Fetch value from memory
– Fetch indirect operand
d Technique: subdivide the slowest pipeline stage

Computer Architecture – Module 19 21 Fall, 2016

d Pipelining
– Broad, fundamental concept
– Can be used with hardware or software
– Applies to instructions or data
– Can be synchronous or asynchronous
– Can be buffered or unbuffered

Computer Architecture – Module 19 22 Fall, 2016

d Pipeline performance
– Unless faster processors are used, data pipelining does not decrease the overall time
required to process a single data item
– Using a pipeline does increase the overall throughput (items processed per second)
– The stage of a pipeline that requires the most time to process an item limits the
throughput of the pipeline

Computer Architecture – Module 19 23 Fall, 2016

Power And Energy

Computer Architecture – Module 20 1 Fall, 2016

– Kathryn McKinley
Microsoft, 2013
Power

d Rate at which energy is consumed

d Measured in watts, milliwatts, kilowatts, or megawatts (one watt is one Joule per
second)
d Instantaneous value
d The power at time t is given by

P (t) = V (t) × I (t)

where V is voltage and I is current

Computer Architecture – Module 20 3 Fall, 2016

d A fundamental property of the universe

d Measured in joules, but reported in watts multiplied by time: milliwatt hours (mWh),
kilowatt hours (kWh), or megawatt hours (MWh)

d For constant power utilization, energy used from time t 0 to t 1 is

E = P × ( t1 − t0 )

d If power consumption is not constant, energy is an integral of power

∫
t1
E = t =t 0
P (t) dt

Computer Architecture – Module 20 4 Fall, 2016

d Power
– Associated with data centers
– Question: can supplier deliver the megawatts (or gigawatts) required?
d Energy
– Associated with portable systems
– Question: how long will the battery last?

Computer Architecture – Module 20 5 Fall, 2016

d Switching or dynamic power (denoted Ps or Pd )

– Switching is a change of a logic gate output when an input changes
– Some power is required to cause such a change
d Leakage power (denoted Pleak )
– Caused because transistors are imperfect
– A few electrons penetrate a semiconductor boundary even when the transistor is off
– Important observation: 40 to 60 percent of power usage is leakage
d Minor amount of “short circuit” power lost during switching

Computer Architecture – Module 20 6 Fall, 2016

d Energy for a single gate change

Ed = 313 C V 2
dd
2

d C is a value of capacitance that depends on the underlying CMOS technology

d Vdd is the voltage at which the circuit operates

Computer Architecture – Module 20 7 Fall, 2016

d Observe
– Energy is consumed every time a gate changes
– Many parts of circuit run on a clock
– When clock pulses, the inputs to some gates change
d Consequences
– Energy is consumed when a clock runs, even if the circuit is not otherwise active
– The rate of the clock determines the rate at which a gate uses energy

Computer Architecture – Module 20 8 Fall, 2016

d Clock changes state twice per cycle, so the power used in one period is
2
Pavg = 3C V dd
333333
Tclock

d And the frequency of the clock is

1
333333
Fclock =
Tclock

d Which makes the power used

2
Pavg = C V dd Fclock

Computer Architecture – Module 20 9 Fall, 2016

d Some systems have the ability to shut down part of a circuit (e.g., shut down some of
the cores in a multicore processor)
d If we let α denote the fraction of the circuit in use, 0 ≤ α ≤ 1, the average power is
2
Pavg = α C V dd Fclock

d Three factors that control power consumption

– The fraction of the circuit that is active, α
– The clock frequency, Fclock
– The voltage in the circuit, Vdd

Computer Architecture – Module 20 10 Fall, 2016

d Amount of heat produced is proportional to the power used

d Power density refers to concentration of power
d For chips, power density increases as the industry decreases transistor size according to
Moore’s Law
d Cooling technologies determine how much heat can be removed
d With current technologies, the limit is known as a power wall, and is given by

watts
PowerWall = 100 33333
cm 2

Computer Architecture – Module 20 11 Fall, 2016

d Decreasing the clock rate

– Reduces the switching power
– Does not help with leakage
– May mean the device runs longer (more leakage)
d Decreasing voltage has biggest potential savings (longest battery life)
– Underlying technology must be redesigned
– Cell phones already have lower voltage (3.8 or 2.6 volts)
– Problem: lower voltage increases gate delay, which means the clock rate must also
be lowered

Computer Architecture – Module 20 12 Fall, 2016

d Reducing power consumption is the driving force

d Consider a dual-core chip where each core runs half as fast as a single-core version
d Slower clock rate means voltage can be lowered, reducing power consumption
dramatically
d One example
– Slowing a clock to one-half the original speed permits voltage to be lowered and
cuts the power consumed by a core to approximately 15% of the original value
– Two cores running at half the clock rate consume about 30% as much power as the
original chip and yet have approximately the same computational capability

Computer Architecture – Module 20 13 Fall, 2016

d Can we extend the idea to many cores?

d In theory, yes, because using multiple slow cores can save more energy than a single
high-speed core
d In practice, however
– Programmers must find a way to divide computation among all the cores
– Coordination and communication can mean that N cores cannot perform as well as
one core
– An arbitrarily slow clock rate may not work for some applications (e.g., video)

Computer Architecture – Module 20 14 Fall, 2016

d Power gating
– Refers to cutting power to some parts of a circuit
– Achieved with special, low-leakage power transistors
d Clock gating
– Refers to stopping the clock (setting the frequency to zero)
– Requires software to save state and restore it when restarting the system

Computer Architecture – Module 20 15 Fall, 2016

d Common for embedded processors

d Series of low-power modes
d Software decides when to sleep and awaken
d Wakeup
– Typically performed “on demand”
– Example: user presses a key

Computer Architecture – Module 20 16 Fall, 2016

d Usually employs a timeout mechanism: if circuit has been idle for time T, enter a sleep
mode
d For user-visible actions, allow the user to specify the timeout
d For other actions, compute a break even point

Computer Architecture – Module 20 17 Fall, 2016

d Goal is typically energy savings

d Enter sleep mode only if doing so will save energy
d Let Tshutdown and Twakeup denote the time required to shutdown and wake up,
respectively
d We will use a simplified model to analyze sleep modes

RUN

T shutdown T wakeup

OFF

Computer Architecture – Module 20 18 Fall, 2016

d Shutting down or restarting requires energy

Eshutdown = Es = Pshutdown × Tshutdown

Ewakeup = Ew = Pwakeup × Twakeup

d The energy used while running for time t or sleeping for time t is

Erun = Prun × t
Esleep = Es + Ew + Poff ( t − Tshutdown − Twakeup )

d Shutting down the system will be beneficial at breakpoint

Esleep < E run

Computer Architecture – Module 20 19 Fall, 2016

d Our model is simplistic

d Breakpoint inequality can be expressed as a function of t and constants, which means
we can find a minimum value of t for which sleeping is beneficial
d If processor has five sleep modes, model and analysis must be extended for each of the
modes

Computer Architecture – Module 20 20 Fall, 2016

d Power is an instantaneous measure of the rate at which energy is used

d Energy is the total amount of power used over a given time
d Two primary power uses in a digital circuit are switching power and leakage power
d Leakage power can account for 40 to 60 percent of all power used
d Reducing voltage reduces the power required and introduces gate delays, which requires
reducing the clock speed
d Options for software mangement of power include clock gating and power gating

Computer Architecture – Module 20 21 Fall, 2016

d Many processors have low-power modes (sleep modes)

d Because energy is required to move into and out of a sleep mode, a break even point
can be calculated at which sleep mode saves energy

Computer Architecture – Module 20 22 Fall, 2016

Assessing Performance

Computer Architecture – Module 21 1 Fall, 2016

d Difficult to assess computer performance

d Chief problems
– Flexibility: computer can be used for wide variety of computational tasks
– Architecture that is optimal for some tasks is suboptimal for others
– Memory and I/O costs can dominate processing
– Performance often depends on the specific input data, not just the size of the data

Computer Architecture – Module 21 2 Fall, 2016

d Many groups try to assess computer performance

d A variety of performance measures exist
d No single measure suffices for all situations

Computer Architecture – Module 21 3 Fall, 2016

d Two primary measures

d Integer computation speed
– Pertinent to most applications
– Example measure is millions of instructions per second (MIPS)
d Floating point computation speed
– Used for scientific calculations
– Typically involve matrices
– Example measure is floating point operations per second (FLOPS)

Computer Architecture – Module 21 4 Fall, 2016

d Can we ignore the data and focus on measuring the performance of various groups of
instructions?
d One possible measure is the average (i.e., mean) execution time of all the instructions
available on a computer
d Problems
– Even two closely-related instructions do not take exactly the same time
– A given program may use some instructions more than others

Computer Architecture – Module 21 5 Fall, 2016

d Assume
– Addition or subtraction takes Q nanoseconds
– Multiplication or division takes 2Q nanoseconds
d The average cost of a floating point instruction is

Q
3 + Q + 2Q + 2Q
33333333333333333333
Tavg = = 1.5 Q ns per instr.
4
d Note that addition or subtraction takes 33% less than the average, and multiplication or
division takes 33% more
d A typical program will not have equal numbers of add, subtract, multiply and divide
operations

Computer Architecture – Module 21 6 Fall, 2016

d Idea is to find a more accurate assessment of performance for a specific application

d Examine application to determine how many times each instruction occurs
d Example: multiplication of two N × N matrices
– N 3 floating point multiplications
– N 3 − N 2 floating point additions
– Using Q and 2Q for costs gives:

Ttotal = 2 × Q × N 3 + Q × (N 3 − N 2 )

Computer Architecture – Module 21 7 Fall, 2016

d Alternative to precise count of operations

d Typically obtained by instrumentation
d Program is run on many input data sets and each instruction counted
d Counts averaged over all runs
d Example
Instruction Type Count Percentage
2 222222222222222222222222222222222222222
Add 8513508 72
Subtract 1537162 13
Multiply 1064188 9
Divide 709458 6

Computer Architecture – Module 21 8 Fall, 2016

d Uses instruction counts and cost of each instruction

d Example

Tavg′ = .72 × Q + .13 × Q + .09 × 2 Q + .06 × 2 Q

d Or
Tavg′ = 1.16 Q ns per instruction

d Note: the weighted average given here is 23% less than the uniform average obtained
above

Computer Architecture – Module 21 9 Fall, 2016

d An attempt to generalize weighted average to a class of applications

d Measure a large set of programs
d Obtain relative weights for each type of instruction
d Use relative weights to assess the performance of a given architecture on the example
set
d Try to choose set of programs that represent a typical workload
d Computer architect can use an instruction mix to assess how a proposed architecture
will perform.

Computer Architecture – Module 21 10 Fall, 2016

d Provides workload used to measure computer performance

d Represent typical applications
d Independent corporation formed in 1980s to create benchmarks
– Named Standard Performance Evaluation Corporation (SPEC)
– Not-for-profit
– Avoids having each vendor choose benchmark that is tailored to their architecture

Computer Architecture – Module 21 11 Fall, 2016

d SPEC cint2006
– Used to measure integer performance
d SPEC cfp2006
– Used to measure floating point performance
d Result of measuring performance on a specific architecture is known as the computer’s
SPECmark

Computer Architecture – Module 21 12 Fall, 2016

d CPU performance is only one aspect of system performance

d Other parts of system to be measured
– Memory
– I/O
d Bottleneck in a given architecture can be any of the above
d Consequence: benchmarks have also been created to focus on memory and I/O
performance rather than computational speed

Computer Architecture – Module 21 13 Fall, 2016

d How can we build a faster computing system?

d Hardware is faster than software (just eliminating the fetch-execute cycle speeds up
processing)
d Resulting general principle: to optimize performance, move operations that account for
the most CPU time from software into hardware

Computer Architecture – Module 21 14 Fall, 2016

d Adding additional hardware increases cost

d Consequence: we cannot afford to use high-speed hardware for all operations
d Computer architect Gene Amdahl observed that it is a waste of resources to optimize
functions that are seldom used
d Amdahl’s Law:

The performance improvement that can be realized from faster hardware technology is
limited to the fraction of time the faster technology can be used.

Computer Architecture – Module 21 15 Fall, 2016

3 1
3333333333333333333333333333333333
Speedupoverall =
Fractionenhanced
1 − Fractionenhanced + 33333333333333
3
Speedupenhanced
d Notes
– Speedupoverall is the overall speedup achieved
– Fractionenchanced is the fraction of time the enhanced hardware runs
– Speedupenhanced is the speedup the enhanced hardware gives

Computer Architecture – Module 21 16 Fall, 2016

d Consider a parallel architecture

d Increasing parallelism adds more hardware
d Amdahl’s law explains why adding processors does not always increase performance

Computer Architecture – Module 21 17 Fall, 2016

d A variety of performance measures exist

d Simplistic measures include MIPS and FLOPS
d More sophisticated measures use a weighted average derived by counting the
instructions in a program or set of programs
d A set of weights from multiple applications corresponds to an instruction mix
d Benchmark refers to a standardized program or set of programs used to measure
performance
d Best-known benchmarks, known as SPECmarks, are produced by the SPEC Corporation
d Amdahl’s Law helps architects select functions to be optimized (moved from software
to hardware)

Computer Architecture – Module 21 18 Fall, 2016

Architecture Examples
And Hierarchy

Computer Architecture – Module 22 1 Fall, 2016

d Recall that architecture can be presented at multiple levels of abstraction

d We use the term architectural hierarchy
d Broad classifications
– Macroscopic (e.g., entire computer system)
– Microscopic (e.g., single integrated circuit)

Computer Architecture – Module 22 2 Fall, 2016

Level Description
2222222222222222222222222222222222222222222222222222222222222222
System A complete computer with processor(s), memory, and
I/O devices. A typical system architecture describes
the interconnection of components with buses.

Board An individual circuit board that forms part of a computer

system. A typical board architecture describes the
interconnection of chips and the interface to a bus.

Chip An individual integrated circuit that is used on a

circuit board. A typical chip architecture describes
the interconnection of functional units and gates.

Computer Architecture – Module 22 3 Fall, 2016

d Functional units
– Processor
– Memory
– I/O interfaces
d Interconnections
– High-speed buses for high-speed devices and functional units
– Low-speed buses for lower-speed devices

Computer Architecture – Module 22 4 Fall, 2016

d Recall: bridge technology used to interconnect buses

d Allows
– Multiple buses in a computer system
– Processor only connects to one bus
d Bridge maps between bus address spaces
d Permits backward compatibility (e.g., old I/O device can connect to old bus and still be
used with newer processor and newer bus)

Computer Architecture – Module 22 5 Fall, 2016

d Consider a PC
d Assume
– Processor uses Peripheral Component Interconnect bus (PCI)
– Some I/O devices use older Industry Standard Architecture (ISA)
d The two buses are incompatible (cannot be directly connected)
d Solution: use two buses connected by a bridge

Computer Architecture – Module 22 6 Fall, 2016

devices with PCI interfaces

CPU
. . .

PCI bus

memory bridge

ISA bus

. . .

devices with ISA interfaces

d Interconnection can be transparent

Computer Architecture – Module 22 7 Fall, 2016

d Implementation of bridge is more complex than our conceptual diagram implies

d Usually uses special-purpose controller chips
d Separates high-speed and low-speed units onto separate chips
d Provides the illusion of a bus over a direct connection (bus does not need sockets for
devices)

Computer Architecture – Module 22 8 Fall, 2016

d Two controller chips used

d Northbridge chip connects higher-speed units
– Processor
– Memory
– Advanced Graphics Port (AGP) interface
d Southbridge chip connects lower-speed units
– Local Area Network (LAN) interface
– PCI bus
– Keyboard, mouse, or printer ports

Computer Architecture – Module 22 9 Fall, 2016

CISC
CPU
( x86 )

controller dual-ported
. . . memory
.. .................. ..
.. .
AGP ..
.. DDR ....
port .. SDRAM ...
.. ..
.. ..
Northbridge ..
..
..
..
Stream .. DDR ..
.. ..
.. .
Comm. .. SDRAM ...
.. .
........................

controller proprietary hub connection

U P
S C
B I
Southbridge
6-chan. LAN
audio interface

ISA bus

Computer Architecture – Module 22 10 Fall, 2016

d Northbridge: Intel 82865PE

d Southbridge: Intel ICH5

Computer Architecture – Module 22 11 Fall, 2016

d Rates increase over time, so look at relative speeds, not absolute numbers in the
following examples

Connection Clock Rate Width Throughput†

2222222222222222222222222222222222222222222222222222222222222222
USB 1.0 33 MHz 32 bits 1.5 MB/s
FCC broadband – – 3.1 MB/s
AGP 100–200 MHz 64–128 bits 2.0 GB/s
USB 3.0 up to 500 MHz 32 bits 5.0 GB/s
Memory 200–800 MHz 64–128 bits 6.4 GB/s
PCI 3.0 33 MHz 32 bits 126.0 GB/s
Registers 1000–2000 MHz 64–128 bits 672.0 GB/s

d The FCC’s definition of broadband network speed has been included as a point of
comparison

Computer Architecture – Module 22 12 Fall, 2016

d Controller chips can virtualize hardware

d Example: controller can present the illusion of multiple buses to the processor
d One possible form: controller presents three virtual buses
– Bus 1 contains the host and memory
– Bus 2 contains a high-speed graphics device
– Bus 3 corresponds to the external PCI slots for I/ O devices

Computer Architecture – Module 22 13 Fall, 2016

d Consider an Ethernet interface

– Connects computer to Local Area Network
– Transfers data between computer and network
– Physically consists of separate circuit board
– Usually contains an embedded processor and buffer memory

Computer Architecture – Module 22 14 Fall, 2016

host interface

SRAM
bus

SRAM

network
processor

DRAM

DRAM
bus

network interface

Computer Architecture – Module 22 15 Fall, 2016

d SRAM
– Highest speed
– Typically used for instructions
– May be used to hold packet headers
d DRAM
– Lower speed
– Typically used to hold packets
d Designer decides which data items to place in each memory

Computer Architecture – Module 22 16 Fall, 2016

d Describes structure of single integrated circuit

d Components are functional units
d Can include on-board processors, memory, or buses

Computer Architecture – Module 22 17 Fall, 2016

PCI bus
access unit

Embedded
SRAM RISC serial
access processor line
(XScale)

multiple, Microengine 1
onboard
independent
scratch Microengine 2
internal
memory
buses
Microengine 3

Microengine 4
DRAM
Microengine 5
access
..
.
Microengine N
media
access unit

Computer Architecture – Module 22 18 Fall, 2016

SRAM service priority

pin arbitration
inter-
face AMBA
clock bus AMBA
memory inter-
data from
signals & FIFO face XScale
SRAM address

data
AMBA addr.
command
queues
decoder
addr
& addr.
generator microengine addr.
& command queues Microengine
commands

microengine data

d Each item further composed of logic gates

Computer Architecture – Module 22 19 Fall, 2016

d Architecture of a digital system can be viewed at several levels of abstraction

d System architecture shows entire computer system
d Board architecture shows individual circuit board
d Chip architecture shows individual IC
d Functional unit architecture shows individual unit on an IC

Computer Architecture – Module 22 20 Fall, 2016

d We examined an example hierarchy

– Entire PC
– Physical interconnections of a PC
– LAN interface in a PC
– Network processor chip on a LAN interface
– SRAM access unit on a network processor chip

Computer Architecture – Module 22 21 Fall, 2016

Examples Of Chip-Level Architecture

(Network Processors)

Computer Architecture – Module 23 1 Fall, 2016

A network processor is a special-purpose programmable hardware device that combines

the low cost and flexibility of a RISC processor with the speed and scalability of custom
silicon (i.e., ASIC chips), and is designed to provide computational power for packet
processing systems such as Internet routers.

Computer Architecture – Module 23 2 Fall, 2016

d First emerged in late 1990s

d Used in products 2000–
d By 2003, more than thirty vendors existed
d Large variety of architectures
d Optimizations: parallelism and pipelining
d Currently, only a handful of vendors remain viable

Computer Architecture – Module 23 3 Fall, 2016

EJTAG
instruct.
cache
MIPS-32 DMA controller
embed.
proc. bus unit
SRAM Ethernet MAC
bus
data
cache LCD controller
MAC

USB-Host contr.
SRAM controller

USB-Device contr.
RTC (2)
interrupt controller
power management
GPIO

SSI (2)
I 2S

AC ’97 controller Serial line UART (2)

Computer Architecture – Module 23 4 Fall, 2016

external search external memory host

interface interface interface

policy metering
engine engine

memory access unit

six
nP cores onboard
memory

input packet transform engine output

control iface debug port inter mod. test iface

Computer Architecture – Module 23 5 Fall, 2016

input

MAC classify

Accounting & ICMP

FIB & Netflow

MPLS classify

Access Control

CAR

MLPPP

WRED

output

Computer Architecture – Module 23 6 Fall, 2016

TOPparse TOPsearch TOPresolve TOPmodify

.. .. .. ..
.. .. .. ..
.. .. .. ..
.. .. .. ..
.. .. .. ..
. . . .
memory memory memory memory

Computer Architecture – Module 23 7 Fall, 2016

ingress ingress internal egress egress

data switch SRAM switch data
store interface interface store

Embedded Processor Complex

SRAM (EPC) traffic
for manag.
ingress and
data sched.

ingress egress
physical physical
MAC MAC
multiplexor multiplexor

packets from packets to egress

physical devices physical devices data store

Computer Architecture – Module 23 8 Fall, 2016

H0 H1 H2 H3 H4 S D0 D1 D2 D3 D4 D5 D6

control memory arbiter

ingress completion unit egress

queue queue

interrupts
debug & inter. hardware regs. inter. bus control.......
exceptions ..
..
..
..
..
embed. ... PCI
PowerPC ... bus
programmable ..
protocol processors ..
..
ingress ingress (16 picoengines) ..
data data ..
egress .. egress
store iface ..
data .. data
iface .. store
..
..
..
..
..
instr. memory classifier assist bus arbiter .. internal
bus

ingress egress
data frame dispatch data
store store

Computer Architecture – Module 23 9 Fall, 2016

Forwarding:
Classification: traffic manager
in out
pattern processor and
packet modifier

State Engine:
statistics and
host communication

d Classifier uses programmable pattern matching engine

d Traffic manager includes 256,000 queues

Computer Architecture – Module 23 10 Fall, 2016

packet packet
arrives leaves
...

200 processors

d Each processor executes four instructions per packet

d External coprocessor calls used to pass state

Computer Architecture – Module 23 11 Fall, 2016

optional host connection PCI bus

SRAM
buses IXP2xxx chip

Embedded serial
SRAM SRAM PCI access RISC line
access
processor
(Xscale)

coprocessor scratch
memory Microengine 1
multiple,
independent Microengine 2
internal
Slowport buses
FLASH Microengine 3
access
.
.
Slowport .
Microengine N
DRAM DRAM MSF
access access

DRAM
bus
High-speed
I/O buses receive bus transmit bus

†Formerly Intel
Computer Architecture – Module 23 12 Fall, 2016
Copyright  2016 by Douglas Comer. All rights reserved
Example Of Complexity (PCI Access Unit)
PCI bus access unit to PCI bus
..............................................................................................................................
. .
. .
. Core interface .
. PCI bus .
. .
. .
.
. host fcns. .
.
. .
. .
. .
. initiator initiator initiator PCI target target target .
. .
. .
. addr. FIFO read FIFO write FIFO config. read FIFO write FIFO addr. FIFO ..
.
. .
. .
..............................................................................................................................
............................................................. .............................................................
.. .. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. Master . . Slave Slave .
. .
.
.
.
. .
. PCI .
.
.. Address .. . Write Address .
. . .
. CSRs .
.
.
.
Reg. .
. .
.
Buffer Register .
.
. . . .
.
. DMA Direct .
. .
.
.
.
. . . .
.
. read/write buf. Buffer .
. .
.
.
.
.. .. . .
. . . Slave .
. . . .
. . . .
. . . interface .
.. .. . Slave .
. .
. . . .
. . . Interface .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
.
. DMA DRAM DMA SRAM Direct .
. .............................................................
. .
. . . .
.
.. interface interface interface .
.. .
.
.
.
. . . .
. . . .
. . . .
. . . .
.. .. . .
. .
. . . .
. . . .
. . . .
. . . .
. . . .
. .
.. .. .
. DRAM Data SRAM Data Address .
.
. . . .
. . . .
. . . interface interface interface .
. . . .
. . . .
. . . .
. . . .
. . . .
.. .. . .
. . . .
. . . .
. Master interface . . Command Bus Master .
. . . .
............................................................. .............................................................

Command Bus Slave

pull push cmd. cmd. pull push pull push

SRAM bus bus bus SRAM bus DRAM bus

Computer Architecture – Module 23 13 Fall, 2016

HARDWARE MODULARITY
BOARDS AND REPLICATION

Computer Architecture – Module 24 1 Fall, 2016

d For software
– Easy
– Just build parameterized functions
d For hardware
– Difficult
– Must replicate hardware units

Computer Architecture – Module 24 2 Fall, 2016

d Desiderata
– Build series of products
– Include a range of sizes
– Avoid designing each from scratch
d Solution
– Design a basic building block
– Replicate the block as needed
– Arrange to activate pieces as needed

Computer Architecture – Module 24 3 Fall, 2016

d Lab
– Large set of backend computers
– Students create and download an operating system
– Student OS runs and interacts over a console line
d However
– Student OS can wedge the backend computer
– Must power-cycle backend to regain control

Computer Architecture – Module 24 4 Fall, 2016

d Specialized, homemade hardware mechanism

d Provides power to each backend
d Receives commands from lab control software
d Can power-cycle specified backend

Computer Architecture – Module 24 5 Fall, 2016

d All back-end computers are numbered 0, 1, 2,. . .

d Lab control software issues command to reboot machine X
d Command converted to binary value and sent to rebooter
d Rebooter power-cycles specified backend
power connections for
2N backend computers

N-bit binary
input value Rebooter Hardware Unit

Computer Architecture – Module 24 6 Fall, 2016

d How big should a rebooter be?

d The lab started with 8 machines, but now has over 100
d Building a rebooter that is too small is insufficient
d Building a rebooter that is too large is wasteful
d Size depends on student enrollment
d We did not know in advance how large the lab would grow
d Note: hardware engineers designing products face the same dilemma

Computer Architecture – Module 24 7 Fall, 2016

d Design a basic rebooter hardware module

d Replicate the module as needed
d One possible design: arrange a basic module that controls sixteen devices

Computer Architecture – Module 24 8 Fall, 2016

d Think binary
– Assume an 8-bit binary input (up to 256 backends)
– Low-order 4 bits of binary input used to select one of 16 devices
– High-order 4 bits of binary input used to select a module
d Each module given a unique ID between 0 and 15
d A given module only responds if high-order bits of input match its ID
d Design allows the same binary input to be passed to all modules in parallel

Computer Architecture – Module 24 9 Fall, 2016

d Four modules allows 64 backends

power connections for
64 backend computers

8-bit binary module module module module

input value responds responds responds responds
to ID 0 to ID 1 to ID 2 to ID 3

other modules
can be added

d System can be expanded by adding more modules

d Hardware designers use this modular approach to build a series of products with various
sizes

Computer Architecture – Module 24 10 Fall, 2016

d One technique: DIP switches

– Small physical device about as large as a 7400-series IC
– Each device contains 8 individual switches that can be set (e.g., with the end of a
paper clip)
d Switches on a module are set to specify ID before module is installed
d Comparator circuit compares ID in switches to high-order bits of input
d Potential advantage: if a module fails, it can be replaced
d Of course, care must be taken to ensure each module has a unique ID (i.e., only one
module responds to a given input)

Computer Architecture – Module 24 11 Fall, 2016

7 6 5 4 3 2 1 0
input value is
5 in binary 0 0 0 0 0 1 0 1

module selection output selection

is 0 is 5

d The same input bits are sent to all modules

d All modules operate in parallel to check the module identification bits
d Only one module will match the identification (assuming the hardware is
configured correctly)

Computer Architecture – Module 24 12 Fall, 2016

d A hardware design is expensive and usually unique

d The technique used for modularization is replication of a basic building block
d Data is sent to all modules in parallel
d Each module is configured to respond to a specific set of inputs
d Typical scheme: use high-order bits of the input to select a module and low-order bits
to specify a function on that module

Computer Architecture – Module 24 13 Fall, 2016

SEMESTER WRAP-UP

Computer Architecture – Module 25 1 Fall, 2016

d The four basic aspects of computer architecture

– Digital logic
– Processors
– Memory
– I/O
d The vocabulary of hardware
d General ways a hardware designer approaches problems
d How to think in binary
d A potpourri of additional items

Computer Architecture – Module 25 2 Fall, 2016

d Logic gates are building blocks that can be interconnected

d A clock allows a circuit to execute multiple steps in sequence
d Arithmetic operations, such as addition and subtraction, can be performed without
iteration
d Underneath, it’s all bits; semantic value depends on how the bits are interpreted

Computer Architecture – Module 25 3 Fall, 2016

d Many types of processors exist

d An instruction set defines the operations a processor can perform
– RISC processors: a small set of basic instructions
– CISC processors: many instructions that can be complex
d Most processors use one or more general-purpose registers
d An instruction pipeline can increase performance

Computer Architecture – Module 25 4 Fall, 2016

d The chief characteristics of memory systems are

– Technology (e.g., SRAM and DRAM)
– Organization (e.g., word addressing)
d Many memory technologies exist (e.g., DDR-DRAM)
d Physical memory organization includes banks and interleaving
d Virtual memory systems provide protection among applications and allow a
programmer to use more addresses than the physical memory supports
d Caching can improve memory performance dramatically
d Content Addressable Memory (CAM) provides parallel search

Computer Architecture – Module 25 5 Fall, 2016

d I/O devices attach to a bus, and all I/O is performed using fetch and store operations on
the bus
d A device can be polled or can use interrupts
d Device driver software (in the OS) is divided into
– Upper-half functions that applications call when they read or write data
– Lower-half functions that are invoked when an interrupt occurs
d Sophisticated devices use DMA to transfer data between the device and memory
without requiring the CPU to take action
d Buffering can improve I/O performance dramatically

Computer Architecture – Module 25 6 Fall, 2016

d Architecture can be viewed at multiple levels of abstraction, including a complete

system, a board, or a chip
d To debug or optimize at one level, need to understand the next lower level
d Because processors are complex, performance depends on the software that invokes
instructions (instruction mix)
d Hardware designers use two principal optimizations
– Parallelism
– Pipelining
d Pipelining increases throughput, but does not reduce latency

Computer Architecture – Module 25 7 Fall, 2016

d To achieve modularity, a hardware designer creates a basic building block and then
replicates the block; each copy is configured to respond to a subset of the inputs
d Parallel architectures (e.g., multicore processors, clusters)
– Are difficult to program (e.g., the programmer may need to use locks)
– Often have contention for shared memory and devices
– Have not delivered on the promise of performance

Computer Architecture – Module 25 8 Fall, 2016

d Experience connecting chips to form a digital circuit

d Insight into basic structure of a computer and the data paths used to fetch and execute
instructions
d Enhanced programming background
d An understanding that hardware designers think in terms of parallel units
d An appreciation of the startling difference between the high-level abstractions software
provides and the low-level facilities the hardware provides
d Knowing how to think in binary!

Computer Architecture – Module 25 9 Fall, 2016

d The insight that dividing computation into a data pipeline can improve throughput, even
if each stage of a pipeline runs at the same speed as the original processor
d An understanding that two cores running at lower voltage and half the clock rate can
consume substantially less power than a single core
d Familiarity with assembly language
Note: you may not enjoy programming in assembly language, but it should not be a
mystery and you will be able to use it when necessary
d A sense that you understand what’s going on underneath the software

Computer Architecture – Module 25 10 Fall, 2016

Computer Architecture – Module 25 11 Fall, 2016

Laptop Repair Technique 3-2
No ratings yet
Laptop Repair Technique 3-2
5 pages
Marathon Computer Awareness
No ratings yet
Marathon Computer Awareness
114 pages
Mems and Microsystems Design and Manufacture
No ratings yet
Mems and Microsystems Design and Manufacture
57 pages
Multirate Filtering For DSP
67% (3)
Multirate Filtering For DSP
414 pages
Lect0-Intro V1
No ratings yet
Lect0-Intro V1
28 pages
Transfer Pricing
No ratings yet
Transfer Pricing
6 pages
Components of Electronics
No ratings yet
Components of Electronics
25 pages
17ec63 - Vlsi Design Notes Without Cover Sheets
No ratings yet
17ec63 - Vlsi Design Notes Without Cover Sheets
118 pages
It Era Notes
No ratings yet
It Era Notes
6 pages
Vlsi
No ratings yet
Vlsi
8 pages
VK1628
No ratings yet
VK1628
15 pages
Advantages of Computer-Based Processing (5 Files Merged)
No ratings yet
Advantages of Computer-Based Processing (5 Files Merged)
45 pages
Layout Design: 18-322 Fall 2003
No ratings yet
Layout Design: 18-322 Fall 2003
40 pages
SECE Engineering Profession Introduction 2019
No ratings yet
SECE Engineering Profession Introduction 2019
93 pages
Digital Vlsi Design: Unit 3: Basic Circuit Concepts and Scaling of MOS Circuits
No ratings yet
Digital Vlsi Design: Unit 3: Basic Circuit Concepts and Scaling of MOS Circuits
13 pages
Third Generation
No ratings yet
Third Generation
23 pages
Unionpay Card Products Certification Rule: Implemented On 1 May, 2014 Issued by China Unionpay Co., LTD
No ratings yet
Unionpay Card Products Certification Rule: Implemented On 1 May, 2014 Issued by China Unionpay Co., LTD
32 pages
Introduction To Digital Integrated Circuits (DIC)
No ratings yet
Introduction To Digital Integrated Circuits (DIC)
13 pages
Analog Vlsi
No ratings yet
Analog Vlsi
28 pages
CSS555 Solar Engine
No ratings yet
CSS555 Solar Engine
10 pages
How To Use The Capacitive Sensing Module (CSM)
No ratings yet
How To Use The Capacitive Sensing Module (CSM)
8 pages
Primesim Hspice: The Gold Standard For Accurate Circuit Simulation
No ratings yet
Primesim Hspice: The Gold Standard For Accurate Circuit Simulation
2 pages
130nm CMOS Technology Design of Passive UHF RFID Tag in
No ratings yet
130nm CMOS Technology Design of Passive UHF RFID Tag in
4 pages
Generation First Second Third Fourth Fifth Year
No ratings yet
Generation First Second Third Fourth Fifth Year
2 pages
Lic QP
No ratings yet
Lic QP
11 pages
Curs DFT Intro 3
No ratings yet
Curs DFT Intro 3
76 pages
(WWW - Entrance-Exam - Net) - Accenture Placement Sample Paper 1
No ratings yet
(WWW - Entrance-Exam - Net) - Accenture Placement Sample Paper 1
10 pages
Final PPT Optical Computers
75% (8)
Final PPT Optical Computers
24 pages