Essentials of Computer Architecture - Realref - Copie (2) - Copie
Essentials of Computer Architecture - Realref - Copie (2) - Copie
Computer Architecture
Prof. Douglas Comer
Computer Science And ECE
Purdue University
http://www.cs.purdue.edu/people/comer
Course Introduction
And Overview
d Companies (such as Google, IBM, Microsoft, Apple, Cisco,...) look for knowledge of
architecture when hiring (i.e., understanding computer architecture can help you land a
job)
d The most successful software engineers understand the underlying hardware (i.e.,
knowing about architecture can help you earn promotions)
d As a practical matter: knowledge of computer architecture is needed for later courses,
such as systems programming, compilers, operating systems, and embedded systems
d Hardware is ugly
– Lots of low-level details
– Can be counterintuitive
d Hardware is tricky
– Timing is important
– A small addition in functionality can require many pieces of hardware
d The subject is so large that we cannot hope to cover it in one course
d You will need to think in new ways
d Basics
– A taste of digital logic
– Data paths and execution
– Data representations
d Processors
– Instruction sets and operands
– Assembly languages and programming
d Memories
– Physical and virtual memories
– Addressing and caching
Computer Architecture – Module 1 8 Fall, 2016
Copyright 2016 by Douglas Comer. All rights reserved
Organization Of The Course
(continued)
d Input/Output
– Devices and interfaces
– Buses and bus address spaces
– Role of device drivers
d Advanced topics
– Parallelism and data pipelining
– Power and energy
– Performance and performance assessment
– Architectural hierarchies
Fundamentals
Of
Digital Logic
d Voltage
– Quantifiable property of electricity
– Measure of potential force
– Unit of measure: volt
d Current
– Quantifiable property of electricity
– Measure of electron flow along a path
– Unit of measure: ampere (amp)
d Amplification means the large output current varies exactly like the small input current
d Called a Metal Oxide Semiconductor FET (MOSFET) when used on a CMOS chip
d Three external connections
– Source
– Gate
– Drain
d Designed to act as a switch (on or off)
– When the input reaches a threshold (i.e., becomes logic 1), the transistor turns on
and passes full current
– When the input falls below a threshold (i.e., becomes logic 0), the transistor turns
off and passes no current
source
drain
source
no current flowing
from point G to D
drain
A B A and B A B A or B A not A
0 0 0 0 0 0 0 1
0 1 0 0 1 1 1 0
1 0 0 1 0 1
1 1 1 1 1 1
input output
0 volts
d Hardware component
d Consists of integrated circuit
d Implements an individual Boolean function
d To reduce complexity, hardware uses inverse of Boolean functions
– Nand gate implements not and
– Nor gate implements not or
– Inverter implements not
0 0 1 0 0 1 0 0 0
0 1 1 0 1 0 0 1 1
1 0 1 1 0 0 1 0 1
1 1 0 1 1 0 1 1 0
A input
output
B input
d Basic gates
d Suppose we need a signal to indicate that the power button is depressed and the disk is
ready
d Two logic gates are needed to form logical and
– Output from nand gate connected to input of inverter
input from
power button
output
input from
disk
X
C output
Y A
B
d Boolean expression
– Often used when designing circuit
– Can be transformed to equivalent version that requires fewer gates
d Truth table
– Enumerates inputs and outputs
– Often used when debugging a circuit
X
C output
Y A
B
X
C output
Y A
B
X Y Z A B C output
0 0 0 1 0 1 0
0 0 1 1 0 1 0
0 1 0 0 1 1 0
0 1 1 0 0 1 0
1 0 0 1 0 1 0
1 0 1 1 0 1 0
1 1 0 0 1 0 1
1 1 1 0 0 1 0
1 0 1 0 0
+ 1 1 1 0 1
1 1 0 0 0 1
and gate
carry
bit 2 sum
carry in
carry out
14 13 12 11 10 9 8 14 13 12 11 10 9 8 14 13 12 11 10 9 8
+ + +
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
d Known as latch
d Has two inputs: data and enable
d When enable is 1, output is same as data
d When enable goes to 0, output stays locked at current value
data in
output
enable
output
1-bit
latch
1-bit
latch
input bits for output bits for
the register the register
1-bit
latch
1-bit
latch
d Basic flip-flop
d Can be constructed from a pair of latches
d Analogous to push-button power switch (i.e., push-on push-off)
d Each new 1 received as input causes output to reverse
– First input pulse causes flip-flop to turn on
– Second input pulse causes flip-flop to turn off
input output
flip-flop
in: 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
out: 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1
time increases
d Note: output only changes when input makes a transition from zero to one (i.e., rises)
1
in:
0
1
out:
0
clock:
time increases
outputs 1 010 2
input time 1 010 2
increases
0 010 2
1 011 3
(a) 0 011 3
1 100 4
0 100 4
1 101 5
.
.
.
(b)
clock 1
output
0
time
“000”
“001”
“010”
x
“011”
inputs y outputs
“100”
z
“101”
“110”
“111”
d Technical detail: on some decoder chips, an active output is logic 0 and all others are
logic 1
decoder
not used
counter test battery
clock
test memory
start disk
power screen
start CPU
not used
d Technique: count clock pulses and use decoder to select an output for each possible
counter output
d Note: counter will wrap around to zero, so this is an infinite loop
decoder
not used
these two gates perform
clock the Boolean and function counter test battery
test memory
start disk
state CRT
start CPU
feedback stop
d Software
– Uses iteration
– Software engineers are taught to avoid replicating code
– Iteration increases elegance
d Hardware
– Uses replicated (parallel) hardware units
– Hardware engineers are taught to avoid iterative circuits
– Replication increases performance and reliability
d Note: because chip contains multiple gates, some gates may be unused
d May be possible to reduce total chips needed by employing unused gates
d Example: use a spare nand gate as an inverter by connecting one input to five volts:
1 nand x = not x
d Previous circuit can be implemented with a single chip (a quad 2-input nand gate)
IC1 IC3
IC2
d Gordon Moore predicted that the number of transistors on a chip would double each
year (revised in 1970 to every 18 months)
d Led to the following classifications
d Number of bits per byte determines range of values that can be stored
d Byte of k bits can store 2k values
d Examples
– Six-bit byte can store 64 possible values
– Eight-bit byte can store 256 possible values
d Device status
– First bit has the value 1 if a disk is connected
– Second bit has the value 1 if a printer is connected
– Third bit has the value 1 if a keyboard is connected
d Integer interpretation
– Positional representation uses base 2
– Values are 0 through 7
– We must specify order of bits
5 4 3 2 1 0
2 = 32 2 = 16 2 =8 2 =4 2 =2 2 =1
d Example
010101
is interpreted as
0 × 25 + 1 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 21
22222222222222222222222222222222222222222222222222222
Power Of 2 Decimal Value Decimal Digits
0 1 1
1 2 1
2 4 1
3 8 1
4 16 2
5 32 2
6 64 2
7 128 3
8 256 3
9 512 3
10 1024 4
11 2048 4
12 4096 4
15 16384 5
16 32768 5
20 1048576 7
30 1073741824 10
32 4294967296 10
64 18446744073709551616 20
0xDEC90949
1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1
D E C 9 0 9 4 9
d Symbols for upper and lower case letters, digits, and punctuation marks
d Set of symbols defined by computer system
d Each symbol assigned unique bit pattern
d Typically, character set size determined by byte size
d Various character sets have been used in commercial computers
– EBCDIC
– ASCII
– Unicode
10000001
01100001
d Extends ASCII
– Assigns meaning to values from 128 through 255
– Character can be 16 bits long
d Advantage: can represent larger set of characters
d Motivation: accommodate languages such as Chinese
1 0 0
+ 1 1 0
1 0 1 0
overflow result
d Little Endian places least significant byte of integer in lowest memory location
d Big Endian places most significant byte of integer in lowest memory location
d Note: difference is especially important when transferring data over the Internet between
computers for which the byte ordering differs
d Familiar to humans
d First bit represents sign
d Successive bits represent absolute value of integer
d Interesting quirk: can create negative zero
d Assume computer
– Supports 32-bit and 64-bit integers
– Uses two’s complement representation
d When 32-bit integer assigned to 64-bit integer, correct numeric value requires upper 32
bits to be filled with
– Zeroes for a positive number
– Ones for a negative number
d In essence, high-order (sign) bit from the 32-bit integer must be replicated to fill high-
order bits of larger integer
1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
_________________
replicated
d During assignment to a larger integer, hardware copies all bits of smaller integer and
then replicates the high-order (sign) bit in remaining bits
d Most computers use two’s complement hardware, which performs sign extension
d Same hardware is used for unsigned arithmetic, which means that assigning an unsigned
integer to a larger unsigned integer can change the value
d To prevent errors from occurring, a programmer or a compiler must add code to mask
off the extended sign bits
d Example code
unsigned int x;
char y;
y = 0xf0;
x = y; /* should be x = y & 0xff; */
d Pioneered by IBM
d Represents integer as a string of digits
– Unpacked: one digit per 8-bit byte
– Packed: one digit per 4-bit nibble
d Uses sign-magnitude representation
d Example of unpacked BCD
– Integer 123456 is stored as
0x01 0x02 0x03 0x04 0x05 0x06
– Integer –123456 is stored as:
0x01 0x02 0x03 0x04 0x05 0x06 0x0D
d Disadvantages:
– Take more space
– Hardware is slower than integer or floating point
d Advantages:
– Gives results humans expect (compare to Excel)
– Avoids repeating binary value for .01
d Preferred by banks
6.022 × 10 23
d Hardware
– Uses base 2 instead of base 10
– Allocates fixed-size bit strings for
* Exponent
* Mantissa
d Mantissa
– Normalized to eliminate leading zeroes
– No need to store most significant bit because it is always 1
– Zero is a special case
d Exponent
– Allows negative as well as positive values
– Biased to permit rapid magnitude comparison
(a)
63 62 52 51 0
(b)
d Zero
d Positive infinity
d Negative infinity
d Note: infinity values handle cases such as the result of dividing by zero
2 – 126 to 2 127
10 –38 to 10 38
2 –1022 to 2 1023
10 – 308 to 10 308
0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Processors
d The terms processor and computational engine refer broadly to any mechanism that
drives computation
d Wide variety of sizes and complexity
d Processor is key element in all computational systems
computer
processor memory
input/output facilities
d Digital device
d Performs computation involving multiple steps
d Wide variety of capabilities
d Mechanisms available
– Fixed logic
– Selectable logic
– Parameterized logic
– Programmable logic
CPU
graphics
trigonometry engine
engine
other query
components engine arithmetic
engine
internal interconnection(s)
controller local
ALU
storage
external interface
external connection
d Controller
– Overall responsibility for execution
– Moves through sequence of steps
– Coordinates other units
– Timing-based operation: knows how long each unit requires and schedules steps
accordingly
d Arithmetic Logic Unit
– Operates as directed by controller
– Provides arithmetic and Boolean operations
– Performs one operation at a time as directed
d Internal interconnections
– Allow transfer of values among units of the processor
– Also called data paths
d External interface
– Handles communication between processor and rest of computer system
– Provides interaction with external memory as well as external I/O devices
d Programmable device
d Dedicated to control of a physical system
d Example: control an automobile engine or grocery store door
d Negative: extremely limited (slow processor and tiny memory)
d Positive: very low power consumption
do forever {
wait for the sensor to be tripped;
turn on power to the door motor;
wait for a signal that indicates the
door is open;
wait for the sensor to reset;
delay ten seconds;
turn off power to the door motor;
}
preprocessed
source source assembly
code preprocessor compiler code
code
relocatable binary
assembler object linker object
code code
object code
(functions)
in libraries
d Clock rate
– Rate at which gates are clocked
– Provides a measure of the underlying hardware speed
d Instruction rate
– Measures the number of instructions a processor can execute per unit time
d On some processors, a given instruction may take more clock cycles than other
instructions
d Example: multiplication may take longer than addition
d Processor hardware includes a reset line that stops the fetch-execute cycle
d For power-down: reset line is asserted
d During power-up, logic holds the reset until the processor and memory are initialized
d Power-up steps known as bootstrap
Processor Types
And
Instruction Sets
d Fixed-length
– Every instruction is same size
– Hardware is less complex
– Hardware can run faster
– Wasted space: some instructions do not use all the bits
d Variable-length
– Some instructions shorter than others
– Allows instructions with no operands, a few operands, or many operands
– Efficient use of memory (no wasted space)
d Task
– Start with variables X and Y in memory
– Add X and Y and place the result in variable Z (also in memory)
d Example steps
– Load a copy of X into register 1
– Load a copy of Y into register 2
– Add the value in register 1 to the value in register 2, and put the result in register 3
– Store a copy of the value in register 3 in Z
d Note: the above assumes registers 1, 2, and 3 are available
d Register spilling
– Occurs when a register is needed for a computation and all registers contain values
– General idea
* Save current contents of register(s) in memory
* Reload registers(s) from memory when values are needed
d Register allocation
– Refers to choosing which values to keep in registers at a given time
– Performed by programmer or compiler
Bank A Bank B
0 4
1 5
2 6
3 7
separate hardware
units used to access
the register banks
Processor
d Build separate hardware block for each step of the fetch-execute cycle
d Arrange hardware to pass an instruction through the sequence of hardware blocks
d Allows step K of one instruction to execute while step K–1 of next instruction executes
d Result is an execution pipeline
1 inst. 1 - - - -
Time
2 inst. 2 inst. 1 - - -
Instruction K: C ← add A B
Instruction K+1: D ← subtract E C
Time 2 inst. K+1 inst. K inst. K-1 inst. K-2 inst. K-3
9 inst. K+5 inst. K+4 inst. K+3 inst. K+2 inst. K+1
10 inst. K+6 inst. K+5 inst. K+4 inst. K+1 inst. K+2
C ← add A B C ← add A B
D ← subtract E C F ← add G H
F ← add G H M ← add K L
J ← subtract I F D ← subtract E C
M ← add K L J ← subtract I F
P ← subtract M N P ← subtract M N
(a) (b)
d Forwarding hardware
– Passes result of add operation directly to ALU without waiting to store it in a
register
– Ensures the value arrives by the time subtract instruction reaches the pipeline stage
for execution
d Example
Instruction K: C ← add A B
Instruction K+1: no-op
Instruction K+2: D ← subtract E C
d If forwarding is available, no-op allows time for result from register C to be fetched for
subtract operation
d Compilers insert no-op instructions to optimize performance
d Hardware register
d Used during fetch-execute cycle
d Gives address of next instruction to execute
d Also known as instruction pointer or instruction counter
d Absolute branch
– Typically named jump
– Operand is an address
– Assigns operand value to internal register A
d Relative branch
– Typically named br
– Operand is a signed value
– Adds operand to internal register A
x1 x2 x3 x4 A B C D
(a)
registers 0 - 7
unavailable when subroutine runs unavailable
x1 x2 x3 x4 A B C D l1 l2 l3 l4
(b)
Data Transfer
load word load register from memory
store word store register into memory
load upper immediate place constant in upper sixteen
bits of register
move from coproc. register obtain a value from a coprocessor
Conditional Branch
branch equal branch if two registers equal
branch not equal branch if two registers unequal
set on less than compare two registers
set less than immediate compare register and constant
set less than unsigned compare unsigned registers
set less than immediate compare unsigned register and constant
Unconditional Branch
jump go to target address
jump register go to address in register
jump and link procedure call
Arithmetic
FP add floating point addition
FP subtract floating point subtraction
FP multiply floating point multiplication
FP divide floating point division
FP add double double-precision addition
FP subtract double double-precision subtraction
FP multiply double double-precision multiplication
FP divide double double-precision division
Data Transfer
load word coprocessor load value into FP register
store word coprocessor store FP register to memory
Conditional Branch
branch FP true branch if FP condition is true
branch FP false branch if FP condition is false
FP compare single compare two FP registers
FP compare double compare two double precision values
d Elegance
– Balanced
– No frivolous or useless instructions
d Orthogonality
– No unnecessary duplication
– No overlap among instructions
d Ease of programming
– Instructions match programmer’s intuition
– Instructions are free from arbitrary restrictions
DATA PATHS
d Example 1: add the contents of register 4 to the contents of register 11, and place the
result in register 9
add reg9, reg11, reg4
d Example 2: add an offset of 20 to the contents of register 12, use the result as a memory
address, and load register 1 with the value from memory
load reg1, 20(reg12)
d Example 3: add an offset of 64 to the contents of register 7, treat the result as the
address of code in memory, and branch to the address
jump 64(reg7)
d Note: many processors allow an operand to specify an offset plus the contents of a
register
Computer Architecture – Module 6 9 Fall, 2016
Copyright 2016 by Douglas Comer. All rights reserved
Instructions In Memory
operation reg A reg B dst reg unused
add 0 0 0 0 1
0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(b)
instruction data
memory memory
addr.
addr.
in
in
data data
out out
data
in
fetch/store control
d Facts
– Our instruction memory is byte-addressable
– Each instruction is 32-bits long (4 bytes)
– The program counter must be incremented by 4 to move to the next instruction
d Hardware needed
– Gates to store a program counter
– Adder to compute the increment
– Clock to control when updates occur
32-bit
pgm. ctr. 32-bit
adder
d Recall
– Instructions in separate instruction memory
– Instruction memory takes a 32-bit address as input and produces a 32-bit output
value equal to the contents of the specified address
32-bit
pgm. ctr. 32-bit
adder
instruction
memory
addr.
in
data instruction
out from memory
d The memory output changes whenever the input changes (i.e., whenever a new address
is supplied)
32-bit
pgm. ctr. 32-bit
adder
instr. decoder
data
out
offset
operation
d Note: data paths emerging from the instruction decoder are not thirty-two bits wide
32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in
instr. decoder
reg A contents of
instruction register A
memory
reg B contents of
register B
addr. dst reg
in
data
out
offset
operation
d Note: there are two inputs and two outputs because we assume the register unit has
hardware that can perform two lookups simultaneously
d Although example only has one arithmetic operation, add, additional arithmetic
instructions can be added easily (e.g., shift and subtract)
d Use an Arithmetic Logic Unit (ALU)
d Problem: inputs to ALU can be
– Two registers
– Register and offset
d Solution: use a multiplexor to choose
input 1
output
input 2
32-bit
pgm. ctr. 32-bit
adder
register
unit multiplexor
4 data in
instr. decoder
ALU
reg A
instruction
memory ALU output
reg B
data
out
offset
operation
d On some instructions, ALU adds register and offset; on add instruction, ALU adds two
registers
32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in
reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation
32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in
reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation
32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in
reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation
32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in
reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation
32-bit
pgm. ctr. 32-bit
adder
register
unit
4 data in
reg A
instruction addr.
memory M2 in
reg B
data
addr. dst reg out
in
data
data in
out
offset
operation
d A multiplexor passes one of its input data paths to the output data path
d Control signals determine which input a multiplexor selects at a given time
d By controlling multiplexors, processor hardware chooses which data paths are active for
a given instruction
d Given architecture usually has the same number for most instructions
d Four basic architectural types
– 0-address
– 1-address
– 2-address
– 3-address
d Stack-based architecture
d No explicit operands in the instruction
d Program
– Pushes operands onto stack in memory
– Executes instruction
d Instruction execution
– Removes top N items from stack
– Leaves result on top of stack
push X
push 7
add
pop X
d Analogous to a calculator
d One explicit operand per instruction
d Processor has special register known as an accumulator
– Holds second argment for each instruction
– Used to store result of instruction
load X
add 7
store X
add 7, X
add X, Y, Z
d Add operation with register 1 and signed immediate value of –93 as operands
locations in memory
3
instruction register
5
2 4
5
general-purpose register
d Architect chooses the number and types of operands for each instruction
d Possibilities include
– Immediate (constant value)
– Contents of register
– Value in memory
– Indirect reference to memory
CPUs:
Microcode, Protection,
And Processor Modes
d Early systems
– Single Central Processing Unit (CPU) controlled entire computer
– Responsible for all I/O as well as computation
d Modern computer
– Decentralized architecture
– CPU chip may contain multiple cores
– Each I/O device (e.g., a disk) contains processor
– CPU performs computation and coordinates other processors
d Completely general
d Can perform control functions as well as basic computation
d Offers multiple levels of protection and privilege
d Provides mechanism for hardware priorities
d Handles large volumes of data
d Uses parallelism to achieve high speed
d Automatic
– Initiated by hardware (e.g., when device needs service)
– Prior to change, software (OS) must specify which code to run when the change
occurs
d Manual
– Application makes explicit request
– Typically occurs when application calls an operating system function
low
... privilege
high
Operating System privilege
visible to
programmer
macro instruction set
hidden
(internal)
CPU micro instruction set
d Size used by micro instructions can differ from size used by macro instructions
d Example
– Micro instructions only offer 16-bit arithmetic
– Macro instructions provide 32-bit arithmetic
d More overhead
d Macro instruction performance depends on micro instruction set
d Microprocessor hardware must run at extremely high clock rate to accommodate
multiple micro instructions per macro instruction
d Easy to read
d Programmers are comfortable using it
d Unattractive to hardware designers because higher clock rates needed
d Generally has low performance (many micro instructions needed for each macro
instruction)
Arithmetic macro
Logic general-
Unit result 1 result 2 purpose
(ALU) registers
.. .. .. .. ..
.. .. .. .. ..
.. .. .. .. ..
x x x .. x .. x .. x .. x .. x x x x x x
. . . . .
d Move the value from register 4 to the hardware unit for operand 1
d Move the value from register 13 to the hardware unit for operand 2
d Arrange for the ALU to perform addition
d Move the value from the hardware unit for result2 (the low-order bits of the result) to
register 4
d A single microcode instruction can continue the ALU operation and also load the value
from register 7 into operand unit 1
d By using horizontal microcode, a programmer can specify simultaneous, parallel
operation of multiple hardware units
Assembly Languages
And
Programming Paradigm
d Computer scientist Alan Perlis once quipped that a programming language is low-level
if programming requires attention to irrelevant details
d Perlis’ point: because most applications do not need direct control of hardware, a low-
level language increases programming complexity without providing benefits
d In most cases, programmers do not need assembly language, only compilers do
d Assembly language
– Term used for a special type of low-level language
– Each assembly language is specific to a processor
d Assembler
– Term used for a program that translates assembly language into binary code
– Analogous to compiler
d Bad news
– Many assembly languages exist
– Each has instructions for one particular processor architecture
d Good news
– Assembly languages all have the same general structure
– A programmer who understands one assembly language can learn another quickly
d General format
d Typically
– A character reserved to start a comment
– Comment extends to end of line
d Examples of comment characters
– Pound sign (#)
– Semicolon (;)
d Similar to high-level languages: block comments are used to explain the overall purpose
of each large section of code
d Unlike high-level languages: each line of assembly code usually contains a comment
explaining purpose of the instruction
################################################################
# #
# Search linked list of free memory blocks to find a block #
# of size N bytes or greater. Pointer to list must be in #
# register 3 and N must be in register 4. The code also #
# destroys the contents of register 5, which is used to #
# walk the list. #
# #
################################################################
d Note: in one historic case, DEC and AT&T each built an assembly language for the same processor, and they used opposite
orders for operands!
( source, destination )
( destination, source ),
remember that the operands are in the same order as an assignment statement
#
# Define register names used in the program
#
r1 register 1 # define name r1 to be register 1
r2 register 2 # and so on for r2, r3, and r4
r3 register 3
r4 register 4
#
# Define register names for a linked list program
#
listhd register 6 # holds starting address of list
listptr register 7 # moves along the list
d Assembly language provides a way to specify the type of each operand (e.g.,
immediate, register, memory reference, indirect memory reference)
d Typically, compact syntax is used
d Example using right-to-left order
x( ); jsr x
other statement; code for other statement
x ( ); jsr x
next statement; code for next statement
d Hardware possibilities
– Stack in memory used for arguments
– Register windows used to pass arguments
– Special-purpose argument registers used
d Consequence: assembly language for passing arguments depends on hardware
d See Appendix 3 and Appendix 4 in the text for x86 and MIPS calling sequence
d Assembly language program can call function written in high-level language (e.g., to
avoid writing complex functions in assembly language)
d High-level language program can call function written in assembly language
– When higher speed is needed
– When access to special-purpose hardware is required
d Interactions must follow calling conventions of the high-level language
int x, y, z; x: .long
y: .long
z: .long
short w, q; w: .word
q: .word
statement(s) code for statement(s)
x: .word 949
d Software component
d Accepts assembly language program as input
d Produces binary form of program as output
d Uses two-pass algorithm
– Pass 1: computes instruction offset for each label
– Pass 2: generates code
d Syntactic substitution
d Parameterized for flexibility
d Programmer supplies macro definitions
d Code contains macro invocations
d Assembler handles macro expansion in extra pass
d Known as macro assembly language
d Note: assembly macros predate #define
d Technology
– The type of the underlying hardware
– Choice determines cost, persistence, performance
– Many variants are available
d Organization
– How underlying hardware is used to build memory system (i.e., bytes, words, etc.)
– Directly visible to programmer
d Volatile or nonvolatile
d Random or sequential access
d Read-write or read-only
d Primary or secondary
d Volatile memory
– Contents disappear when power is removed
– Fastest access times
– Least expensive
d Nonvolatile memory
– Contents remain without power
– More expensive than volatile memory
– May have slower access times
– Some embedded systems “cheat” by using a battery to maintain memory contents
d Random access
– Typical for most applications
d Sequential access
– Known as a FIFO (First-In-First-Out)
– Typically associated with streaming applications
– Requires special purpose hardware
d Primary memory
– Highest speed
– Most expensive, and therefore the smallest
– Typically solid state technology
d Secondary memory
– Lower speed
– Less expensive, and therefore can be larger
– Traditionally used magnetic media and electromechanical drive mechanisms
– Moving to solid state (flash)
d Harvard architecture
– Two separate memories known as
* Instruction store
* Data store
– One memory holds programs and the other holds data
– Used on early computers and some embedded systems
d Von Neumann architecture
– A single memory holds both programs and data
– Used on most general-purpose computers
d Advantages
– Allows separate caches (described later)
– Permits memory technology to be optimized for access patterns
* Instructions: sequential access
* Data: random access
d Disadvantage
– Must choose a size for each when computer is designed
d Separating instruction and data memories has potential advantages but a big
disadvantage
d Memory systems use fetch-store paradigm
d Only two operations available
– Fetch (read)
– Store (write)
Physical Memory
And
Physical Addressing
d Main memory
– Designed to permit arbitrary pattern of references
– Known by the term RAM (Random Access Memory)
d Usually volatile
d Two basic technologies available
– Static RAM
– Dynamic RAM
d Easiest to understand
d Basic elements built from a latch
write enable
d Advantages
– High speed
– Access circuitry is straightforward
d Disadvantages
– Higher power consumption
– Heat generation
– High cost
d Alternative to SRAM
d Consumes less power
d Analogous to a capacitor (i.e., stores an electrical charge)
d Entropy increases
d Any electronic storage device gradually loses charge
d When left for a long time, a bit in DRAM changes from logical 1 to logical 0
d Discharge time can be less than a second
d Conclusion: although it is inexpensive, DRAM is a horrible memory device!
refresh
d Density
– Refers to memory cells per square area of silicon
– Usually stated as number of bits on standard size chip
– Example: 1 gig chip holds 1 gigabit of memory
– Note: higher density chip generates more heat
d Latency
– Time that elapses between the start of an operation and the completion of the
operation
– May depend on previous operations (see below)
physical
processor controller
memory
d Main point: because all memory requests go through the controller, the interface a
processor “sees” can differ from the underlying hardware organization
d Processor
– Presents request to controller
– Waits for response
d Controller
– Translates request into signals for physical memory chips
– Returns answer to processor as quickly as possible
– Sends signals to reset physical memory for next request
d Goals
– Improve memory performance
– Avoid mismatch between CPU speed and memory speed
d Technique: memory hardware runs at a multiple of the CPU clock rate
d Available for both SRAM and DRAM
d Examples
– Double Data Rate SDRAM (DDR-SDRAM)
– Quad Data Rate SRAM (QDR-SRAM)
Technology Description
222222222222222222222222222222222222222222222222222222222222
DDR-DRAM Double Data Rate Dynamic RAM
DDR-SDRAM Double Data Rate Synchronous Dynamic RAM
FCRAM Fast Cycle RAM
FPM-DRAM Fast Page Mode Dynamic RAM
QDR-DRAM Quad Data Rate Dynamic RAM
QDR-SRAM Quad Data Rate Static RAM
SDRAM Synchronous Dynamic RAM
SSRAM Synchronous Static RAM
ZBT-SRAM Zero Bus Turnaround Static RAM
RDRAM Rambus Dynamic RAM
RLDRAM Reduced Latency Dynamic RAM
parallel interface
control- physical
processor
. ler memory
.
.
5 word 5
4 word 4
3 word 3
2 word 2
1 word 1
0 word 0
32 bits
d Byte addressing
– View of memory presented to processor
– Each byte of memory assigned an address
– Convenient for programmers
– However... the underlying memory uses word addressing
d Memory controller
– Provides translation
– Allows programmers to use byte addresses (convenient)
– Allows physical memory to use word addresses (efficient)
4 16 17 18 19
3 12 13 14 15
2 8 9 10 11 a byte address
assigned to each
1 4 5 6 7 byte of each word
0 0 1 2 3
32 bits
O = B mod N
d Example
– Find byte B = 11 when N = 4
– B can be found in word 2 at offset 3
0 . . . 0 0 1 0 1 1
232 = 4,294,967,296
unique addresses
d Known as address space
d Note: word addressing allows larger memory than byte addressing, but is seldom used
because it is difficult to program
d Speeds of data networks and other I/O devices are usually expressed in powers of ten
– Example: a Gigabit Ethernet operates at 109 bits per second
d Programmer must accommodate differences between measures for storage and
transmission
char *iptr;
int *iptr;
d Debugging tool
d Gives hex representation of bytes in memory
d Each line of output specifies memory address and bytes starting at that address
struct node {
int value;
struct node *next;
}
d Example list has structure
head
head
node 1
Address Contents Of Memory
0001bde0 00000000 0001bdf8 deadbeef 4420436f
0001bdf0 6d657200 0001be18 000000c0 0001be14
0001be00 00000064 00000000 00000000 00000002
0001be10 00000000 000000c8 0001be00 00000006
node 3
node 2
Bank 3
S Bank 2
E
L four identical memory
E modules that each
C handle addresses 0 to 2k–1
T
high-order bits used Bank 1
to select a bank
Bank 0
requests
interface
Key
one slot
CAM Storage
..
.
d Physical memory
– Organized into fixed-size words
– Accessed through a controller
d Controller can use
– Byte addressing when communicating with a processor
– Word addressing when communicating with a physical memory
d To avoid arithmetic, use powers of two for
– Address space size
– Bytes per word
d Acts as an intermediary
d Located between source of requests and source of replies
large data storage
requester
cache
d Small (usually much smaller than storage needed for entire set of items)
d Active (makes decisions about which items to save)
d Transparent (invisible to both requester and data store)
d Automatic (uses sequence of requests; does not receive extra instructions)
Cm
Ch
Cworst = N C m
Cbest = Cm + (N − 1) Ch
Cm + (N − 1) Ch
333333333333333 C
3
333
m Ch
333
= − + Ch
N N N
d If we ignore overhead
– In the worst case, the performance of caching is no worse than if the cache were not
present
– In the best case, the cost per request is approximately equal to the cost of accessing
the cache
d Note: for memory caches, parallel hardware means almost no overhead
d Hit ratio
– Percentage of requests satisfied from cache
– Given as value between 0 and 1
d Miss ratio
– Percentage of requests not satisfied from cache
– Equal to 1 minus the hit ratio
d Allows us to assess expected cost
Cost = r C h + (1 − r) Cm
d Cost is:
Cost = r 1 Ch 1 + r 2 Ch 2 + (1 − r 1 − r 2) Cm
d Optimization technique
d Stores items in cache before requests arrive
d Works well if data accessed in related groups
d Examples
– When web page is fetched, web cache can preload images that appear on the page
– When byte of memory is fetched, memory cache can preload succeeding bytes
processor processor
1 2
cache 1 cache 2
physical memory
d Traditional memory cache was separate from both the memory and the processor
d To access traditional memory cache, a processor used pins that connect the processor
chip to the rest of the computer
d Using pins to access external hardware takes much longer than accessing functional
units that are internal to the processor chip
d Advances in technology have made it possible to increase the number of transistors per
chip, which means a processor chip can contain a cache
8 bytes
3
2
tag 3
1
cache 0
3
tag value
2
3 tag 2
1
2
0
1
3
0
2
tag 1
1
0
3
2
tag 0
1
0
d Think binary: if all values are powers of two, bits of an address can be used to specify a
tag, block, and offset
tag block offset
decoder selects
only one slot
tag bits
from address
logical
and
processor
MMU
physical physical
controller controller
physical physical
memory memory
#1 #2
0x7FFFFFFF
memory 2
Processor sees a
0x40000000
single contiguous
0x3FFFFFFF memory
memory 1
d Notes
– 0x40000000 is 1 gigabyte or 1073741824 bytes
– For identical modules, these are called memory banks
Computer Architecture – Module 13 10 Fall, 2016
Copyright 2016 by Douglas Comer. All rights reserved
Address Translation
d Performed by MMU
d Also called address mapping
d For our example
– To determine which physical memory, test if address is 0x40000000 or above
– Both memory modules use addresses 0 through 0x3fffffff
– Subtract 0x40000000 from address when forwarding a request to memory 2
0x40000000 1000000000000000000000000000000
to to
0x7 f f f f f f f 1111111111111111111111111111111
d Addresses above 0x3fffffff are the same as the previous set except for high-order bit
d Hardware uses the high-order bit to select a physical memory module
Address
N
memory 2
Hole
(not present)
N/2
N/2– 1
memory 1
Hole
0 (not present)
d Hardware perspective
– Allow multiple memory modules
– Provide homogeneous integration
d Software prospective
– Programmer convenience
– Support for multiprogramming and protection
physical
M
virtual memory
space N
1
0
M ....................... 3N/4
virtual
space
2
0
....................... N/2
M
virtual
space
3
0 ....................... N/4
M
virtual
space 0
4
0
d Base-bound registers
d Segmentation
d Demand paging
.....................
M
virtual
space
0 .....................
bound base
M
d Alternative to base-bound
d Provides fine-granularity mapping
– Divides program into segments (typical segment corresponds to one procedure)
– Maps each segment to physical memory
d Key idea
– Segment is only placed in physical memory when needed
– When segment is no longer needed, OS moves it to disk
d Part of MMU
d Intercepts each memory reference
d If referenced page is present in memory, translate address and perform the operation
d If referenced page not present in memory, generate a page fault (i.e., an error condition)
d Record the details and allow operating system to handle the fault
page
table
P
d Page number computed by dividing the virtual address by the number of bytes per page,
K
J V J
N= J 33 J
Q K P
O = V mod K
A = pagetable [N] + O
virtual address
N O
page table
F O
physical address
memory
operating page
system tables frame storage
d Consequence: only part of memory is divided into frames that hold applications
d If pair is in TLB
– Virtual address can be translated without a page table reference
– MMU returns the translation much faster than a page table lookup
d Location of A [ i , j ] given by
location(A) + i×Q + j
d Optimal
for i = 1 to N {
for j = 1 to M {
A [ i, j ] = 0;
}
}
d Nonoptimal
for j = 1 to M {
for i = 1 to N {
A [ i, j ] = 0;
}
}
ID virtual address
d Demand paging
– The chief technology used in most systems
– Combination of hardware and software
– Uses page tables to map virtual addresses to physical addresses
– High-speed lookup mechanism known as TLB makes demand paging practical
d Caching virtual addresses requires either
– Flushing the cache during context switch
– Using an ID to disambiguate
Input / Output
Concepts And Terminology
digital signals
external device
processor
.. ..
. .
to power source
processor device
external
connection
controller controller
d Serial interface
– Single signal wire (also need ground); one bit at a time
– Less complex hardware with lower cost
d Parallel interface
– Many wires; each wire carries one bit at any time
– Width is number of wires
– Complex hardware with higher cost
– Theoretically faster than serial
– Practical limitation: at high data rates, close parallel wires have potential for
interference
d Full-duplex
– Simultaneous, bidirectional transfer
– Example: disk drive supports simultaneous read and write operations
d Half-duplex
– Transfer in one direction at a time
– Interfaces must negotiate access before transmitting
– Example: processor can read or write to a disk, but can only perform one operation
at a time
d Latency
– Measure of the time required to perform a transfer
– Latencies of input and output may differ
d Throughput
– Measure of the amount of data that can be transferred per unit time
– Informally called speed
d Fundamental idea
d Arises from hardware limits on parallelism (pins or wires)
d Allows sharing
d Multiplexor
– Accepts input from many sources
– Sends each item along with an ID
d Demultiplexor
– Receives ID along with transmission
– Uses ID to reassemble items correctly
multiplexing hardware
parallel interface
16 bits wide
demultiplexing hardware
Buses
And
Bus Architecture
processor
device
bus
d Several possibilities
d Can consist of
– A cable with multiple wires
– Traces on a circuit board
d Usually, a bus has sockets into which devices plug
mother board
circuit board
(device interface)
mother board
socket
d Each device attached to a bus is assigned an address (in practice, there my be a small
set of addresses)
d Bus allows processor to specify
– Address for the device
– Data to transfer
– Control (e.g., to specify input or output)
d We can think of a bus as having a separate group of wires (lines) for each of the above
functions
d Fetch
– Place an address on the address lines
– Use control line to signal fetch operation
– Wait for control line to indicate operation complete
– Extract data item from the data lines
d Store
– Place an address on the address lines and a data item on the data lines
– Use control line to signal store operation
– Wait for control line to indicate operation complete
bus interfaces
processor
memory ... memory
1 N
bus
d Address conflict
– Two devices attempt to respond to a given address
d Unassigned address
– No device responds to a given address
d Bus hardware detects the problems and raises an error condition (sometimes called a
bus error)
d Unix reports bus error to an application that attempts to dereference an invalid pointer
if ( address == 10000 ) {
if ( op == store ) {
if ( data != 0 ) {
turn_on_display;
} else {
turn_off_display;
}
} else { /* handle fetch */
if ( device is on ) {
send value 1 as data;
} else {
send value 0 as data;
}
}
}
Computer Architecture – Module 15 21 Fall, 2016
Copyright 2016 by Douglas Comer. All rights reserved
Asymmetry
processor
memory memory device device
1 2 1 2
bus
d Example includes
– Two memories of 1 megabyte each
– Two devices that use 12 bytes of address space
device 1 device 2
memory
2
memory
1
0xffff available
for devices
0xdfff Hole
(not available)
0xbfff
available
for
memory
0x7fff
Hole
(not available)
0x3fff
available
for
memory
0x0000
d In a typical system
– A device only requires a few bytes of address space
– Designers leave room for many devices
d Consequence: address space available for devices is sparsely populated
d Software such as an OS that has access to the bus address space can fetch or store to a
device
d Example code
d Hardware mechanism
d Used to connect two buses
bus 1
bridge
bus 2
address space
of main bus
available
address space
. . . . . . . . .devices
for ............... of auxiliary bus
available
for not
memory bridge supplies mapped
the mapping
available
for
memory
d Alternative to bus
d Connects multiple devices
d Sender supplies data and destination device
d Fabric delivers data to specified destination
input 1
input 2
input 3
.
.
.
input N
Programmed And
Interrupt-driven I / O
d Programmed I/O
– A terrible name
– Also called polled I/O
d Interrupt-driven I/O
– Another poor naming choice
– Software actually drives I/O
d Each device defines a set of addresses and meanings for fetch and store operations
d An interface for our imaginary printer
Addresses Operation Meaning
d Set of addresses a device defines are known as its Control and Status Registers (CSRs)
d CSRs are used to transfer data and control the device
d The hardware designer chooses whether a given CSR responds to
– A fetch operation
– A store operation
– Both
d In many cases, individual CSR bits are assigned meanings
d In C, a struct can be used to define CSRs
d Processor hardware
– Saves current instruction pointer
– Jumps to code for the interrupt
– Resumes executing the application when the code executes a return from interrupt
interrupt vectors
in memory handler for
device 3
.
.
.
3 handler for
device 2
2
1
handler for
0 device 1
handler for
device 0
d Widely used
d Works well for high-speed I/O and streaming
d Requires smart device that can move data across the bus to / from memory without
using processor
d Example: Wi-Fi network interface can read an entire packet and place the packet in a
specified buffer in memory
d Basic idea
– CPU tells device location of buffer
– Device fills buffer and then interrupts
address passed
to device
address passed
to device R 17 W 29 R 61
A Programmer’s View
Of I / O
And Buffering
d Piece of software
d Responsible for communicating with specific device
d Usually part of operating system
d Performs basic functions
– Initializes the device
– Manipulates device’s CSRs to start operations when I/ O is needed
– Handles interrupts from device
d Lower half
– Handler code that is invoked when the device interrupts
– Communicates directly with device (e.g., to reset hardware)
d Upper half
– Set of functions that are invoked by applications
– Allows application to request I/O operations
d Shared variables
– Used by both halves to coordinate
– Contains input and output buffers
applications programs
upper half
invoked by
applications
shared
variables
lower half
invoked by
interrupts
device hardware
d Character-oriented
– Transfer one byte at a time
– Examples
* Keyboard
* Mouse
d Block-oriented
– Transfer block of data at a time
– Examples
* Disk
* Network interface
Computer Architecture – Module 17 6 Fall, 2016
Copyright 2016 by Douglas Comer. All rights reserved
Example Flow In A Network Device Driver
computer
Steps Taken
application
1. The application sends data over the
1
Internet
6 2. Protocol software passes a packet to
the driver
protocols 3. The driver stores the outgoing packet
in the shared variables
2 5
4. The upper half specifies the packet
upper half location and starts the device
5. The upper half returns to the protocol
operating
system 3 module
6. The protocol software returns to the
variables application
7. The device interrupts and the lower
8
half of the driver executes
4
lower half 8. The lower half removes the copy of
the packet from the variables
7
external
hardware
device
upper half
request queue in
shared variables
data area
lower half
d Needed because interrupts occur asynchronously and multiple applications can attempt
I/O on a given device at the same time
d Guarantees only one operation will be performed at any time
d Device drivers handle mutual exclusion
application
run-time library
device driver
device hardware
d Two principles
– Cost of making a system call is much more expensive than the cost of making a
conventional function call
– The approach used to reduce system calls consists of transferring more data per call
d Important optimization
d Widely used
d Usually automated and invisible to programmer
d Key idea: make large I/O transfers to driver
– Accumulate large block of outgoing data before transfer
– Transfer large block of incoming data and then extract individual items
Operation Meaning
2222222222222222222222222222222222222222222222222222222222
setup Initialize input and/or output buffers
input Perform an input operation
output Perform an output operation
terminate Discontinue use of the buffers
flush Force contents of output buffer to be written
d Device driver in the operating system may also perform buffering to reduce number of
transfers between the processor and the device
d Setup function
– Called to initialize buffer
– May allocate buffer
– Typical buffer sizes 8K to 128K bytes
d Output function
– Called when application needs to emit data
– Places data item in buffer
– Only writes to I/ O device when buffer is full
d Terminate function
– Called when all data has been emitted
– Forces remaining data in buffer to be written
Setup(N)
1. Allocate a buffer of N bytes.
2. Create a global pointer, p, and initialize p to the address of the first byte of
the buffer.
Output(D)
1. Place data byte D in the buffer at the position given by pointer p, and move
p to the next byte.
2. If the buffer is full, make a system call to write the contents of the entire
buffer, and reset pointer p to the start of the buffer.
Terminate
1. If the buffer is not empty, make a system call to write the contents of the
buffer prior to pointer p.
2. If the buffer was dynamically allocated, deallocate it.
Flush
1. If the buffer is currently empty, return to the caller without taking any action.
2. If the buffer is not currently empty, make a system call to write the contents
of the buffer and set the global pointer p to the address of the first byte of the
buffer.
Terminate
1. Call flush to ensure that any remaining data is written.
2. Deallocate the buffer.
Setup(N)
1. Allocate a buffer of N bytes.
2. Create a global pointer, p, and initialize p to indicate that the buffer is
empty.
Input(N)
1. If the buffer is empty, make a system call to fill the entire buffer, and set
pointer p to the start of the buffer.
2. Extract a byte, D, from the position in the buffer given by pointer p, move p
to the next byte, and return D to the caller.
Terminate
1. If the buffer was dynamically allocated, deallocate it.
d Implementation
– Both input and output buffering are straightforward
– Only a trivial amount of code is needed
d Effectiveness
– Buffer of size N reduces number of system calls by a factor of N
– Example: when buffering character (byte) output, a buffer of only 8K bytes reduces
system calls by a factor of 8192
d Buffering
– Fundamental technique used to enhance performance
– Useful with both input and output
d Buffer of size N reduces system calls by a factor of N
Parallelism
d Microscopic
– Parallel hardware in an ALU
– Parallel data transfer to/from physical memory or an I/O bus
d Macroscopic
– Multiple identical processors, such as a multicore CPU (known as symmetric)
– Multiple dissimilar processors, such as a CPU and GPU (known as asymmetric)
d Fine-grain parallelism
– Parallelism among individual instructions (e.g., two addition operations occur at the
same time)
d Coarse-grain parallelism
– Parallel execution of programs on multiple cores
d Explicit parallelism
– Visible to programmer
– Requires programmer to initiate and control parallel activities
d Implicit parallelism
– Hidden from programmer
– Hardware runs multiple copies of application code or instructions automatically
Name Meaning
222222222222222222222222222222222222222222222222222222222222
SISD Single Instruction stream Single Data stream
SIMD Single Instruction stream Multiple Data streams
MISD Multiple Instruction streams Single Data stream
MIMD Multiple Instruction streams Multiple Data streams
d On a conventional computer
for i from 1 to N {
V[i] ← V[i] × Q;
}
d On a vector processor
V ← V × Q;
d Symmetric
d Asymmetric
P1 Main Pi+1
Memory
(various
P2 modules) Pi+2
.. ..
. .
Devices
Pi PN
d Major problem with SMP architecture: contention for memory and I/ O devices
d To improve performance: provide each processor with its own copy of a device
d Old idea
d Pioneered in mainframe computers of 1960s
d Examples
– Channel (IBM mainframe)
– Peripheral Processor (CDC mainframe)
d Making a comeback — now used in large systems
d Needed
– Among processors
– Between processors and I/O devices
– Across networks
d As number of processors increases, communication becomes a bottleneck
“Building multiprocessor systems that scale while correctly synchronising the use of
shared resources is very tricky, whence the principle: with careful design and attention
to detail, an N-processor system can be made to perform nearly as well as a single-
processor system. (Not nearly N times better, nearly as good in total performance as
you were getting from a single processor). You have to be very good — and have the
right problem with the right decomposability — to do better than this.”
d Where
– τN denotes the execution time on a multiprocessor
– τ1 denotes the execution time on a single processor
d Ideal: speedup that is linear in number of processors
Speedup
16
12 ideal
typical
8
1 4 8 12 16
Number of processors (N)
32
24
ideal
16
typical
8
1 8 16 24 32
Number of processors (N)
x = x + 1;
d Typical code
load x, R5
incr R5
store R5, x
lock 17
load x, R5
incr R5
store R5, x
release 17
d Hardware allows one processor (core) to hold a given lock at a given time, and blocks
others
d Implicit parallelism
– Programmer writes sequential code
– Hardware runs many copies automatically
d Explicit parallelism
– Programmer writes code for parallel architecture
– Code must use locks to prevent interference
d Conclusion: explicit parallelism makes computers extremely difficult to program
d Parallelism is fundamental
d Flynn scheme classifies computers as
– SISD (e.g., conventional uniprocessor)
– SIMD (e.g., vector computer)
– MIMD (e.g., multiprocessor)
d Multiprocessors can be
– Symmetric or asymmetric
– Explicitly or implicitly parallel
d Multiprocessor speedup usually less than linear
Data Pipelining
information information
arrives leaves
d Uniprocessor
– Each stage is a process or thread
d Multiprocessor
– Each stage executes on separate processor or core
– Hardware assist can speed interstage data transfer
outputs do forever {
processor
input Wait to receive packet
from one Verify integrity
network
Check for loops
Select a path
..
. Prepare for transmission
Enqueue packet for output
}
(a) (b)
d Bad news: if it uses processors of the same speed as a nonpipeline architecture, a data
pipeline will not improve the overall time needed to process a given data item
d Good news: by overlapping computation on multiple items, a pipeline increases
throughput
d Assume
– The task is packet processing
– Processing a packet requires exactly 500 instructions
– A processor executes 10 instructions per µsec
3 500 instructions
333333333333333
time = = 50 µsec
10 instr. per µsec
1 packet 1 packet × 10 6
Tnp = 33333333 = 3333333333333 = 20,000 packets per second
50 µsec 50 sec
d Suppose the problem can be divided into four stages and that the stages require
– 50 instructions
– 100 instructions
– 200 instructions
– 150 instructions
d Important principle: the throughput of a data pipeline is limited by the slowest stage
d Overall throughput
d Term refers to computer systems in which the primary focus is data pipelining
d Most often used for special-purpose systems
d Data pipeline usually organized around functions
d Less relevant to general-purpose computers
f( )
g( ) f( ) g( ) h( )
h( )
(a) (b)
d Setup time
– Refers to time required to start the pipeline initially
d Stall time
– Refers to time required to restart the pipeline after a stage blocks to wait for a
previous stage
d Flush time
– Refers to time that elapses between the cessation of input and the final data item
emerging from the pipeline (i.e., the time required to shut down the pipeline)
d Pipelining
– Broad, fundamental concept
– Can be used with hardware or software
– Applies to instructions or data
– Can be synchronous or asynchronous
– Can be buffered or unbuffered
d Pipeline performance
– Unless faster processors are used, data pipelining does not decrease the overall time
required to process a single data item
– Using a pipeline does increase the overall throughput (items processed per second)
– The stage of a pipeline that requires the most time to process an item limits the
throughput of the pipeline
– Kathryn McKinley
Microsoft, 2013
Power
∫
t1
E = t =t 0
P (t) dt
d Power
– Associated with data centers
– Question: can supplier deliver the megawatts (or gigawatts) required?
d Energy
– Associated with portable systems
– Question: how long will the battery last?
Ed = 313 C V 2
dd
2
d Observe
– Energy is consumed every time a gate changes
– Many parts of circuit run on a clock
– When clock pulses, the inputs to some gates change
d Consequences
– Energy is consumed when a clock runs, even if the circuit is not otherwise active
– The rate of the clock determines the rate at which a gate uses energy
d Clock changes state twice per cycle, so the power used in one period is
2
Pavg = 3C V dd
333333
Tclock
1
333333
Fclock =
Tclock
d Some systems have the ability to shut down part of a circuit (e.g., shut down some of
the cores in a multicore processor)
d If we let α denote the fraction of the circuit in use, 0 ≤ α ≤ 1, the average power is
2
Pavg = α C V dd Fclock
watts
PowerWall = 100 33333
cm 2
d Power gating
– Refers to cutting power to some parts of a circuit
– Achieved with special, low-leakage power transistors
d Clock gating
– Refers to stopping the clock (setting the frequency to zero)
– Requires software to save state and restore it when restarting the system
d Usually employs a timeout mechanism: if circuit has been idle for time T, enter a sleep
mode
d For user-visible actions, allow the user to specify the timeout
d For other actions, compute a break even point
RUN
T shutdown T wakeup
OFF
d The energy used while running for time t or sleeping for time t is
Erun = Prun × t
Esleep = Es + Ew + Poff ( t − Tshutdown − Twakeup )
Assessing Performance
d Can we ignore the data and focus on measuring the performance of various groups of
instructions?
d One possible measure is the average (i.e., mean) execution time of all the instructions
available on a computer
d Problems
– Even two closely-related instructions do not take exactly the same time
– A given program may use some instructions more than others
d Assume
– Addition or subtraction takes Q nanoseconds
– Multiplication or division takes 2Q nanoseconds
d The average cost of a floating point instruction is
Q
3 + Q + 2Q + 2Q
33333333333333333333
Tavg = = 1.5 Q ns per instr.
4
d Note that addition or subtraction takes 33% less than the average, and multiplication or
division takes 33% more
d A typical program will not have equal numbers of add, subtract, multiply and divide
operations
Ttotal = 2 × Q × N 3 + Q × (N 3 − N 2 )
d Or
Tavg′ = 1.16 Q ns per instruction
d Note: the weighted average given here is 23% less than the uniform average obtained
above
d SPEC cint2006
– Used to measure integer performance
d SPEC cfp2006
– Used to measure floating point performance
d Result of measuring performance on a specific architecture is known as the computer’s
SPECmark
The performance improvement that can be realized from faster hardware technology is
limited to the fraction of time the faster technology can be used.
3 1
3333333333333333333333333333333333
Speedupoverall =
Fractionenhanced
1 − Fractionenhanced + 33333333333333
3
Speedupenhanced
d Notes
– Speedupoverall is the overall speedup achieved
– Fractionenchanced is the fraction of time the enhanced hardware runs
– Speedupenhanced is the speedup the enhanced hardware gives
Architecture Examples
And Hierarchy
Level Description
2222222222222222222222222222222222222222222222222222222222222222
System A complete computer with processor(s), memory, and
I/O devices. A typical system architecture describes
the interconnection of components with buses.
d Functional units
– Processor
– Memory
– I/O interfaces
d Interconnections
– High-speed buses for high-speed devices and functional units
– Low-speed buses for lower-speed devices
d Consider a PC
d Assume
– Processor uses Peripheral Component Interconnect bus (PCI)
– Some I/O devices use older Industry Standard Architecture (ISA)
d The two buses are incompatible (cannot be directly connected)
d Solution: use two buses connected by a bridge
PCI bus
memory bridge
ISA bus
. . .
CISC
CPU
( x86 )
controller dual-ported
. . . memory
.. .................. ..
.. .
AGP ..
.. DDR ....
port .. SDRAM ...
.. ..
.. ..
Northbridge ..
..
..
..
Stream .. DDR ..
.. ..
.. .
Comm. .. SDRAM ...
.. .
........................
U P
S C
B I
Southbridge
6-chan. LAN
audio interface
ISA bus
d Rates increase over time, so look at relative speeds, not absolute numbers in the
following examples
d The FCC’s definition of broadband network speed has been included as a point of
comparison
host interface
SRAM
bus
SRAM
network
processor
DRAM
DRAM
bus
network interface
d SRAM
– Highest speed
– Typically used for instructions
– May be used to hold packet headers
d DRAM
– Lower speed
– Typically used to hold packets
d Designer decides which data items to place in each memory
PCI bus
access unit
Embedded
SRAM RISC serial
access processor line
(XScale)
multiple, Microengine 1
onboard
independent
scratch Microengine 2
internal
memory
buses
Microengine 3
Microengine 4
DRAM
Microengine 5
access
..
.
Microengine N
media
access unit
data
AMBA addr.
command
queues
decoder
addr
& addr.
generator microengine addr.
& command queues Microengine
commands
microengine data
EJTAG
instruct.
cache
MIPS-32 DMA controller
embed.
proc. bus unit
SRAM Ethernet MAC
bus
data
cache LCD controller
MAC
USB-Host contr.
SRAM controller
USB-Device contr.
RTC (2)
interrupt controller
power management
GPIO
SSI (2)
I 2S
policy metering
engine engine
six
nP cores onboard
memory
input
MAC classify
MPLS classify
Access Control
CAR
MLPPP
WRED
output
.. .. .. ..
.. .. .. ..
.. .. .. ..
.. .. .. ..
.. .. .. ..
. . . .
memory memory memory memory
ingress egress
physical physical
MAC MAC
multiplexor multiplexor
H0 H1 H2 H3 H4 S D0 D1 D2 D3 D4 D5 D6
interrupts
debug & inter. hardware regs. inter. bus control.......
exceptions ..
..
..
..
..
embed. ... PCI
PowerPC ... bus
programmable ..
protocol processors ..
..
ingress ingress (16 picoengines) ..
data data ..
egress .. egress
store iface ..
data .. data
iface .. store
..
..
..
..
..
instr. memory classifier assist bus arbiter .. internal
bus
ingress egress
data frame dispatch data
store store
Forwarding:
Classification: traffic manager
in out
pattern processor and
packet modifier
State Engine:
statistics and
host communication
packet packet
arrives leaves
...
200 processors
SRAM
buses IXP2xxx chip
Embedded serial
SRAM SRAM PCI access RISC line
access
processor
(Xscale)
coprocessor scratch
memory Microengine 1
multiple,
independent Microengine 2
internal
Slowport buses
FLASH Microengine 3
access
.
.
Slowport .
Microengine N
DRAM DRAM MSF
access access
DRAM
bus
High-speed
I/O buses receive bus transmit bus
†Formerly Intel
Computer Architecture – Module 23 12 Fall, 2016
Copyright 2016 by Douglas Comer. All rights reserved
Example Of Complexity (PCI Access Unit)
PCI bus access unit to PCI bus
..............................................................................................................................
. .
. .
. Core interface .
. PCI bus .
. .
. .
.
. host fcns. .
.
. .
. .
. .
. initiator initiator initiator PCI target target target .
. .
. .
. addr. FIFO read FIFO write FIFO config. read FIFO write FIFO addr. FIFO ..
.
. .
. .
..............................................................................................................................
............................................................. .............................................................
.. .. . .
. . . .
. . . .
. . . .
. . . .
. . . .
. Master . . Slave Slave .
. .
.
.
.
. .
. PCI .
.
.. Address .. . Write Address .
. . .
. CSRs .
.
.
.
Reg. .
. .
.
Buffer Register .
.
. . . .
.
. DMA Direct .
. .
.
.
.
. . . .
.
. read/write buf. Buffer .
. .
.
.
.
.. .. . .
. . . Slave .
. . . .
. . . .
. . . interface .
.. .. . Slave .
. .
. . . .
. . . Interface .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
.
. DMA DRAM DMA SRAM Direct .
. .............................................................
. .
. . . .
.
.. interface interface interface .
.. .
.
.
.
. . . .
. . . .
. . . .
. . . .
.. .. . .
. .
. . . .
. . . .
. . . .
. . . .
. . . .
. .
.. .. .
. DRAM Data SRAM Data Address .
.
. . . .
. . . .
. . . interface interface interface .
. . . .
. . . .
. . . .
. . . .
. . . .
.. .. . .
. . . .
. . . .
. Master interface . . Command Bus Master .
. . . .
............................................................. .............................................................
HARDWARE MODULARITY
BOARDS AND REPLICATION
d For software
– Easy
– Just build parameterized functions
d For hardware
– Difficult
– Must replicate hardware units
d Desiderata
– Build series of products
– Include a range of sizes
– Avoid designing each from scratch
d Solution
– Design a basic building block
– Replicate the block as needed
– Arrange to activate pieces as needed
d Lab
– Large set of backend computers
– Students create and download an operating system
– Student OS runs and interacts over a console line
d However
– Student OS can wedge the backend computer
– Must power-cycle backend to regain control
N-bit binary
input value Rebooter Hardware Unit
d Think binary
– Assume an 8-bit binary input (up to 256 backends)
– Low-order 4 bits of binary input used to select one of 16 devices
– High-order 4 bits of binary input used to select a module
d Each module given a unique ID between 0 and 15
d A given module only responds if high-order bits of input match its ID
d Design allows the same binary input to be passed to all modules in parallel
other modules
can be added
7 6 5 4 3 2 1 0
input value is
5 in binary 0 0 0 0 0 1 0 1
SEMESTER WRAP-UP
d I/O devices attach to a bus, and all I/O is performed using fetch and store operations on
the bus
d A device can be polled or can use interrupts
d Device driver software (in the OS) is divided into
– Upper-half functions that applications call when they read or write data
– Lower-half functions that are invoked when an interrupt occurs
d Sophisticated devices use DMA to transfer data between the device and memory
without requiring the CPU to take action
d Buffering can improve I/O performance dramatically
d To achieve modularity, a hardware designer creates a basic building block and then
replicates the block; each copy is configured to respond to a subset of the inputs
d Parallel architectures (e.g., multicore processors, clusters)
– Are difficult to program (e.g., the programmer may need to use locks)
– Often have contention for shared memory and devices
– Have not delivered on the promise of performance
d The insight that dividing computation into a data pipeline can improve throughput, even
if each stage of a pipeline runs at the same speed as the original processor
d An understanding that two cores running at lower voltage and half the clock rate can
consume substantially less power than a single core
d Familiarity with assembly language
Note: you may not enjoy programming in assembly language, but it should not be a
mystery and you will be able to use it when necessary
d A sense that you understand what’s going on underneath the software