Computer Architecture
Computer Architecture
MJ Rutter
[email protected]
Lent 2003
Bibliography
Both are thick (1000 pages and 800 pages respectively), detailed,
and quite technical. Both are pleasantly up-to-date.
1
Contents
History 3
The CPU 9
instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Floating Point 51
Memory 76
technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Memory Management 134
CPU Families 164
Video Hardware 180
Parallel Computers 189
multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
parallel computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Permanent Storage 222
disk drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
filing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
tape drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Practical Programming 269
libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
the pitfalls of F90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Index 312
2
History
3
History: to 1970
1961 Fortran IV
Pipelined CPU (IBM 7030)
1962 Hard disk drive with flying heads (IBM)
1963 CTSS: Timesharing (multitasking) OS
Virtual memory & paging (Ferranti Atlas)
1964 First BASIC
1967 ASCII (current version)
GE635 / Multics: SMP (General Elect)
1968 Cache in commercial computer (IBM 360/85)
Mouse demonstrated
Reduce: computer algebra
1969 ARPAnet: wide area network
Fully pipelined functional units (CDC 7600)
Out of order execution (IBM 360/91)
4
History: the 1970s
5
History: the 1980s
7
A Summary of History
8
The CPU
9
The Computer
10
Inside the Computer
CPU
Bus
VDU
11
The Heart of the Computer
12
Schematic of Typical RISC CPU
Decode +, −
Integer Registers
+, − and shift
Floating Point
logical
Issue
Registers
+, −
*, / shift
logical
Fetch
load/ load/
store store
Memory Controller
13
What the bits do
14
Clock Watching
15
Typical instructions
Integer:
• arithmetic: +,−,∗,/,negate
• logical: and, or, not, xor
• bitwise: shift, rotate
• comparison
• load / store (copy between register and memory)
Floating point:
√
• arithmetic: +,−,∗,/, ,negate,modulus
• convert to/from integer
• comparison
• load / store (copy between register and memory)
Control:
16
A typical instruction
fadd f4,f5,f6
add the contents of floating point registers 4 and 5,
placing the result in register 6.
Execution sequence:
17
Making it go faster. . .
first instruction
second instruction
Time
18
. . . and faster. . .
fadd f4,f5,f6
fmul f6,f7,f4
fadd f4,f5,f6
fmul f3,f7,f9
19
. . . and faster
20
Keep it simple
RISC (Reduced Instruction Set Computer) relies on the instructions being very simple – the
above CISC example would certainly be three RISC instructions – and then letting the CPU
overlap them as much as possible.
21
The VLIW EPIC
22
Within a functional unit
23
Breaking Dependencies
do i=1,n
sum=sum+a(i)
enddo
do i=1,n,3
s1=s1+a(i)
s2=s2+a(i+1)
s3=s3+a(i+2)
enddo
sum=s1+s2+s3
24
A Branch in the Pipe
25
Assembler in More Detail
26
Predictions
ldt $f1 F D O X R
addl $1 F D O X R Iteration i
cmplt $1 F D O X R
lda $16 F D O X R
addt $f0 F D O X R
bne F D O X R
With branch
prediction ldt $f1 F D O X R Iteration i+1
Time
Note the stalls in the pipeline based on data dependencies (shown with red arrows) or to
prevent the execution order changing. If the instruction fetch unit fetches one instruction
per cycle, stalls will cause a build-up in the number of in flight instructions. Eventually the
fetcher will pause to allow things to quieten down.
27
Speculation
28
Predication
if (a<0) a=-a;
must be converted to
29
OOO!
ldt $f1 F D O X R
addl $1 F D O X R Iteration i
cmplt $1 F D O X R
lda $16 F D O X R
addt $f0 F D O X R
bne F D O X R
Time
The EV6 Alpha does OOOE, the EV5 does not, nor does the UltraSPARC III. In this simple
case, the compiler erred in not changing the order itself. However, the compiler was told not
to optimise for this example.
30
Machine Code
The IA32 instruction set, with its variable length, can place double precision floating point
values as data within a single instruction, and must store all bits of its branches.
Not all possible bit sequences will be valid instructions. If the instruction decoder hits an
invalid instruction, it objects. Under UNIX this results in the process receiving a SIGILL:
SIGnal ILLegal instruction. 31
Meaningless Indicators of Performance
As we shall see later, most of these are not worth the paper they are written on.
32
Meaningful Indicators of Performance
33
The Guilty Candidates: Linpack
Linpack 100x100
34
SPEC
Two scores are reported, ‘base’, which permits two optimisation flags to the compiler, and
‘peak’ which allows any number of compiler flags. Changing the code is not permitted.
35
The glimmers of hope
Streams
36
Various Results, SPEC
For each CPU, the best result (i.e. fastest motherboard / compiler
/ clock speed) as of 1/2/03 is given.
Note that the Pentium4, Athlon and Pentium III are the only CPUs to have higher SpecInt
scores than SpecFP.
37
Various Results, Streams and Linpack
The ‘Linpack’ column is for the 2000x2000 Fortran version, whereas the dgesv column is the
same problem using the vendor’s supplied maths library.
The faster P4 uses RAMBUS memory, the slower SDRAM. Similarly the two 21164 machines
have different memory subsystems, but identical processors.
38
Representing Integers
39
Being Positive
Bits number
0000 0
0001 1
0010 2
0011 3
0100 4
... ...
1111 15
40
Being Negative
Sign-magnitude
Offset
One’s complement
Two’s complement
41
Chaos
0111 7 −1 7 7
1000 −0 0 −7 −8
1001 −1 1 −6 −7
1110 −6 6 −1 −2
1111 −7 7 −0 −1
42
Adding up
Again, trivial:
Otherwise known as
5 + 9 = 14
• sign-mag: 5 + (−1) = −6
• offset: (−3) + 1 = 6
• 1’s: 5 + (−6) = −1
• 2’s: 5 + (−7) = −2
43
Overflow
5 + 12 = 1
• 1’s: 5 + (−3) = 1
• 2’s: 5 + (−4) = 1
44
Ranges
45
Text
46
Multiplying and Dividing
47
Shifting and Rotating
shift left
rotate left
A left shift by n positions corresponds to ×2n.
A right shift by n positions corresponds to dividing by
2n with the remainder lost.
One can simulate multiplication from a combination of shifts, adds and comparisons.
48
Logical operations
and or xor
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0
These operations crop up surprisingly frequently. For instance:
xor is used for trivial encryption and for graphics cursors, because (a xor b) xor b ≡ a.
49
Typical functional unit speeds
Those slow integer multiplies are more common that it would seem at first. Consider:
The address of x(i) is the address of x(1) plus 8 ∗ (i − 1). That multiplication is just a
shift. However, y(i,j) is y(1,1) plus 8 ∗ ((i − 1) + (j − 1) ∗ 500). A lurking multiply!
50
Floating Point
51
Floating Point Numbers
±0.XXXX ∗ 10±XX
e.g.
0.1000 ∗ 101
or
0.6626 ∗ 10−33
52
Representable numbers
53
Algebra
a + b = a 6⇒ b = 0
(a + b) + c 6= a + (b + c)
54
Algebra (2)
√
a2 6= |a|
p √
(0.1000 ∗ 10−60)2 = 0.0000 ∗ 10−99 = 0
a/b 6= a × 1/b
55
Zeros, Underflows, and Denormals
56
Binary fractions
0.210 = 0.0011001100110011 . . .2
57
Computers and IEEE
58
IEEE Example
3
5 = 101.112 = 0.101112 ∗ 23
4
01000001001110000000000000000000
59
Nasty numbers
When reading rubbish bit sequences as doubles, one expects merely one in 2000 to appear as
a NaN.
60
Signals
61
Ranges
Precision
Single Double
Bytes 4 8
Bits, total 32 64
Bits, exponent 8 11
Bits, mantissa 23 52
Largest value 1.7 ∗ 1038 9 ∗ 10307
Smallest non-zero 6 ∗ 10−39 1 ∗ 10−308
Decimal digits of precision c.7 c.15
Other representations result in different ranges. For instant, IBM 370 style encoding has a
range of around 1075 for both single and double precision.
IEEE is less precise about extended double precision formats. Intel uses an 80 bit format with
a 16 bit exponent, whereas many other vendors use a 128 bit format.
62
Rounding
−25
√
2 ∗ n
2−25 ∗ n
63
Backwards and Forwards
N
X 1
n=1
n
This is better summed by doing a few hundred terms explicitly, then using a result such as
b
1 b + 0.5 1
−2 −2 −4
X
≈ log + (b + 0.5) − (a − 0.5) + O(a )
n=a n a − 0.5 24
64
The Quadratic Formula
√
−b ± b2 − 4ac
x=
2a
The following gives no roots when compiled with a K&R C compiler, and repeated roots with
ANSI C
void main(){
float a=30,b=60.01,c=30.01,d;
d=b*b-4*a*c;
printf("%18.15f\n",(double)d);
}
65
The Logistic Map
66
Making Life Complex
Addition is simple
67
The Really Complex Problem
68
Hard or Soft?
69
Soft denormals
x=1d-20
y=1d-28
n=1e8
do i=1,n
x=x+y
enddo
70
Dividing slowly
71
Conquering division
a/b = a ∗ 1/b
xn+1 = 2 ∗ xn − b ∗ xn2
n xn
0 0.2
1 0.16
2 0.1664
3 0.16666624
4 0.1666666666655744
72
Division in more detail
0.XXXX ∗ 10XX
the important parts are the exponent and the first digit
of the mantissa:
0.A ∗ 10X
and the reciprocal is (approximately)
0.B ∗ 101−X
73
Converging
74
Square roots
75
Memory
• Memory technologies
76
Memory Technologies
Most ‘ROMs’ are some form of EEPROM so they can have their contents upgraded without
physical replacement: also called Flash RAM, as the writing of an EEPROM is sometimes
called flashing.
77
RAM
78
DRAM in Detail
RAS CAS
{
1
0
0
0
1
0
{
0
1
1
1
0
1
1
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
0
1
1
0
1
0
1
0
1
1
0
0
1
1
0
1
0
1
0
0 0 0 1 1 0 1 0
0 1 0 1 0 1 0 0
Buffer 0 1 0 0 0 0 1 1
Of course a ‘real’ DRAM chip would contain several tens of million bits.
79
DRAM Read Timings
The same address lines are used for both the row and
column access. This halves the number of addess lines
needed, and adds the RAS and CAS pins.
Reading a DRAM cell causes a significant drain in the charge on its capacitor, so it needs to
be refreshed before being read again.
80
More Speed!
81
DRAM Timings compared
Classic
Data Data Data
EDO
Data Data Data Data Data
82
Measuring time
They also have identical cycle times, that is, the time
from the start of one read, to the start of another
unrelated read.
83
SDRAM
DDR-SDRAM is similar, except that data are transfered twice each clock-cycle, thus doubling
the bandwidth, and not touching the latency. It is rated on bandwidth, not clock speed, so
133MHz DDR-SDRAM calls itself 266MHz (sic) or PC2100.
84
RAMBUS
85
Packaging it up
The SIMMs have just 12 address lines, limiting capacity to 224 words.
The individual chips on a SIMM used to supply one bit each, so there would be eight on an
8-bit SIMM. Later chips supplied several (e.g. four) bits each, but internally still had one bit
per array of cells. More recent DRAM chips have four or eight bits per cell.
Modules must be installed in groups sufficient to fill the width of the bus being used: four
30pin SIMMs for a 32 bit bus, two 72pin SIMMs for a 64 bit bus, etc.
DIMMs and RIMMs are capable of identifying themselves to the motherboard using a
protocol called Serial Presence Detect. This reveals capacity, speed and type, and allows the
motherboard to configure itself appropriately.
86
Modules in Pictures
30 pin SIMM 1MB, 70ns, parity. Two of the chips are identical supply four bits each, the
third supplies one bit to make up the nine.
72 pin SIMM 16MB, 60ns, EDO, 32 bit, four bits from each of eight chips.
168 pin DIMM 256MB, 133MHz, 64 bit. Eight more chips on the other side of the module.
184 pin RIMM 64MB, 16 bit, PC800. Note the metal heat-spreader to assist cooling.
Other forms of memory include 144 pin SO-DIMMs (Small Outline DIMM) used in laptops.
87
Speed Required
88
More Conservative
89
Wider Still and Wider
90
Caches: the Theory
One can then define a cache hit rate, that is, the
number of memory accesses which go to the cache
divided by the total number of memory accesses. This
is usually expressed as a percentage.
91
Caches: the Problem
CPU memory
cache memory
CPU controller
cache
92
The Anomalies
93
The Cache Controller
94
An aside: Hex
95
Converting to/from Hex
5 C 2 A F 1 B 3
So
010111010010101011110001110000112
= 5C2AF1B316 = 1546318259
96
Our Computer
tag data
97
A Disaster
A waste of space
A waste of time
98
Lines
0x23D17
99
Getting better. . .
A waste of space?
A waste of time
Causing trouble
100
A Further Compromise
101
Direct Mapped Caches
0x40000
0x33D10
line no. 0x30000
0xFFF
0x23D10
2 0x20000
0x3D1
0x000
tag data 0x13D10
0x10000
0x03D10
cache 0x00000
memory
102
Success?
0x2 3D1 7
byte within line
line address within cache
103
The Consequences of Compromise
E.g. The 64KB region from 0x23840 to 0x3383F would be held in cache lines 0x384 to
0xFFF then 0x000 to 0x383
104
Associativity
0x40000
0x33D10
line no. 0x30000
0xFFF
0x23D10
0xBD1
0x3D1 0x20000
0x000
tag data 0x13D10
0x10000
0x03D10
cache 0x00000
memory
105
Victim Caches
Assume a 16K direct mapped cache with 32 byte lines. a(1,1) comes into cache, pulling
a(2-4,1) with it. Then a(1,2) displaces all these, at it must be stored in the same line, as
its address modulo 16K is the same. So a(2,1) is not found in cache when it is referenced.
With a single ATE, the cache hit rate jumps from 0% to 75%, the same that a 2-way set
associative cache would have for this algorithm.
106
Policies: Write Back vs Write Through
A partial solution to this problem is to break a line into equal sub-blocks, each with its own
dirty bit. If only one sub-block has been modified, just that sub-block is written back to
memory when the line is discarded. This is useful even for caches which do not allocate on
writes.
108
Policies: LRU vs Random replacement
109
Write Buffers
110
Not All Data are Equal
111
A Hierarchy
Intel tends to make small, fast caches, compared to RISC workstations which tend to have
larger, slower caches. Some machines have tertiary caches too.
112
Cache size
113
Line Size
114
Explicit Prefetching
115
Implicit Prefetching
116
That Code
unsigned char* a;
for(j=0;j<10001;j++)
for(i=0;i<2048;i++)
a[i]+=a[i+8192]+a[i+16384]+a[i+24576];
Processor Primary Data Cache Time
Size Assoc (s)
Pentium 8KB 2 12.7 to 33.0
21064A (EV4) 16KB 1 6.0
486DX2 8KB 4 3.5
Pentium MMX 16KB 4 1.1
21164 (EV5) 8KB+96KB 1+3 0.75
Pentium II 16KB 4 0.25
The EV5 has a fast 96KB secondary cache in the CPU as well as the 8KB cache.
The above code can be cached in a 32KB direct mapped cache, 16KB 2 way associative, or
8KB 4 way associative.
117
Big and Free
CPU
I cache D cache
instr. data
cache controller controller cache
Secondary cache
Secondary
controller
Cache
Main memory
118
Clock multiplying
119
Limits to Clock Multiplying: Thermal
‘Shrinking’ a processor, by using more advanced fabrication techniques with smaller features,
reduces C and permits V to be reduced. Hence the 5V used for processor cores in the late
1980s has dropped to about 2V today (2002), and the feature size on the die has dropped
from 1µm to under 0.2µm over the same period.
120
Limits to Clock Multiplying: Cache
Misses
Which is faster, a 133MHz core with the external cache and other external features running
at 66MHz (or 33MHz), or a 150MHz core with the external cache and other external features
running at 60MHz (or 30MHz)? The 12% increase in core speed can be completely offset by
the 9% decrease in the speed of everything else. See the 133MHz and 150MHz Pentiums for
more details.
121
The Relevance of Theory
integer a(*),i,j
j=1
do i=1,n
j=a(j)
enddo
122
Classic caches
1000
Stride 1
Stride 2
Stride 4
Stride 16
100
Time, ns
10
1
1 4 16 64 256 1024 4096 16384 65536
Data set size, KB
123
Performance Enhancement
1000
Stride 1
Stride 2
Stride 4
Stride 16
Stride 32
100
Time, ns
10
1
1 4 16 64 256 1024 4096 16384 65536
Data set size, KB
124
But is it correct?
125
Parity: going for a crash
Most parity memory uses one parity bit for every 8 bits
of real data: a 12.5% overhead.
126
ECC to the rescue
127
The Hamming Code
C1 C2 D1 C3 D2 D3 D4 C4 D5 D6 D7 D8
C1 = B3 + B5 + B7 + B9 + B11
C2 = B3 + B6 + B7 + B10 + B11
C3 = B5 + B6 + B7 + B12
C4 = B9 + B10 + B11 + B12
128
The Hamming Code (2)
129
Qui custodiet ipsos custodes?
130
Wall to Wall ECC
131
The good, bad and ugly
132
Does it all matter?
133
Memory Management
134
Memory Management
135
Memory, the DOS way
Free
Program 1
command.com
System
0K
136
Fragmentation
Free
Program 2
Free
command.com
System
0K
137
Anarchy
Intentionally or accidentally.
138
What went wrong?
139
Virtual addressing
Virtual Memory
Program A
Program A
Program B
Program A
Program B
Program A
When OS/2, Windows9x or Linux runs two DOS applications simultaneously, each DOS
application sees an address range from 0K to 640K inhabited by just itself and a copy of DOS.
The corresponding physical addresses will be very different.
140
Address translation in more detail
141
Not quite there
142
A Two Tier System
143
A Two-Level Page Table
Virtual Address
10 10 12
Page
Table
Directory
Page
Tables
20 12
Physical Address
A 32 bit virtual address with the top ten bits indexing a single page table directory, and thus
giving the address of a page containing the page table entries relevant for the next ten bits of
the virtual address. This then contains a twenty bit page number to give a 32 bit physical
address when combined with the final twelve bits of the virtual address. The page table will
also contain protection information.
Each process has its own virtual address space, and hence its own page tables, but half a
dozen pages of page table is sufficient for most small programs.
144
Beyond 32 Bits
145
Efficiency
146
TLBs at work
1000
Stride 1
Stride 2
Stride 4
Stride 16
Stride 8K
100
Time, ns
10
1
1 4 16 64 256 1024 4096 16384 65536
Data set size, KB
147
More paging
The ps command reports not only how much virtual address space a program is using, but
how many of those pages are resident in physical memory.
The union of physical memory and the page area on disk is called virtual memory. Virtual
addressing is a prerequisite for virtual memory, but the terms are not identical.
148
Less paging
149
Swapping
151
Mix and match
152
Alignment
153
‘Free’ memory
154
The Digital UNIX Way
Digital UNIX becomes very unhappy if the disk cache is forced below the lower watermark.
The command ‘free’ (Linux, Digital UNIX (TCM only)) shows the current disk cache size.
155
Segments
156
What We Want
Code
Read only, executable, fixed size
Shared libraries
Read only, shared, executable, fixed size
Static data
Read-write, non-executable, fixed size
Dynamic data
Read-write, non-executable, variable size
Temporary data
Read-write, non-executable, frequently varying size
157
Stacks of Text?
158
What Went Where?
159
Sharing
160
A UNIX Memory Map
reserved
shared text
0x0000 03ff 8000 0000
heap
bss
data 0x0000 0001 4000 0000
161
Another UNIX Memory Map
0xffff ffff
kernel 0xc000 0000
stack
mmap
0x4000 0000
heap
bss
data
text
0x0804 8000
reserved
0x0000 0000
This is roughly the layout used by Linux 2.4 on 32 bit machines. Note the shorter addresses
than for Digital UNIX.
The mmap region deals with shared libraries and large objects allocated via malloc, whereas
smaller malloced objects are placed on the heap in the usual fashion. Note too that if one
uses mmap or shared libraries at all, the largest contiguous region is under 2GB.
Note in both cases the area around zero is reserved. This is so that null pointer dereferencing
will fail: ask a C programmer why this is important.
162
Memory Maps in Action
163
CPU Families
164
CPU Families
165
Common Families
166
Inter-family differences: IA32 vs Alpha
IA32 Alpha
Integer Regs 8 x 32 bit 32 x 64 bit
F.P. Regs 8 x 80 bit 32 x 64 bit
Memory operands Yes No
Has trig functs Yes No
Instruction length 1 - c.14 bytes 4 bytes
There are many other differences, such as the Alpha
having integer and FP instructions of the form ‘a op b
→ c’ where a, b and c must be registers. IA32 uses
the form ‘a op b → a’ for integer operations, and a or
b may be references to data stored in memory.
The Alpha uses the naming convention $0 to $31 for its integer registers, and $f0 to $f31
for its floating point registers. The majority of these registers are equivalent. IA32 calls its
integer registers %eax, %ebx, %ecx, %edx, %edi, %esi, %ebp and %esp for odd historical
reasons. For IA32 there are many tedious restrictions concerning which instructions may act
on which registers.
Both have an additional register holding to the address of the current instruction: the
instruction pointer. A branch instruction simply modifies this special register. Both also
reserve one register for pointing to the end of the program’s stack: $30 for Alpha, and %esp
for IA32. Alpha has a register, $31, whose value is fixed at zero. IA32 does not.
167
More IA32 vs Alpha
.L5: L$6:
faddl (%ecx,%eax,8) ldt $f1, ($16)
incl %eax addl $1, 1, $1
cmpl %edx,%eax cmplt $1, $17, $3
jl .L5 lda $16, 8($16)
.L3: addt $f0, $f1, $f0
bne $3, L$6
L$5:
Both sides slightly abbreviated, but many differences are clear. Different mnemonics are used
(Float LoaD Zero vs Float CLeaR), and certainly different binary representations. IA32 has a
special instruction to increment (add one to) a register, Alpha does not. IA32 can move data
from memory directly to the FP adder without passing through a register, Alpha cannot. Etc.
168
The IA32 Family in Detail: the Ancestors
8086
Introduced 1978. 16 bit registers and data bus, 20 bit
address bus (1MB), four clock cycles to do simplest
instruction, separate FPU (8087). Integer registers
each usable as a pair of 8 bit registers (e.g. ax is ah
and al). Clock speeds 4 to 10MHz. 29,000 transitors.
80286
Introduced 1982. Address bus increased to 24 bits
(16MB), simple operations complete in three cycles,
separate FPU (80287). Clock speeds 8 to 12MHz.
134,000 transistors.
These things are antique, and firmly CISC. They have
many, many bizarre instructions, not least for dealing
with binary coded decimal (what? never mind. . . )
However, code written for one of these will still run on
a Pentium4 designed over two decades later.
Although the 8086 is the first ancestor as far as binary compatibility is concerned, that is, the
ability to run machine code written for one process on another, it is not a completely ‘clean’
design, and enjoys a degree of compatibility with the 8080, which is 4 years older.
169
IA32: the Start
80386
Introduced 1985. Registers extended to 32 bits, e.g. the ax
register is now the bottom half of the eax register, and two new
registers added. Data and address bus extended to 32 bits (4GB).
Virtual memory (paging etc.) with 32 entry 4-way associative
TLB, multitasking, device protection. Simple operations complete
in two cycles. Separate FPU (80387). Clock speeds 16 to
33MHz. Different modes of operation in order to keep full 8086
compatibility. This major increase of functionality has almost
everything a modern CPU has, and just 275,000 transistors.
i486
Introduced 1989. Almost no changes in functionality, but core
redesigned. Cache controller with an 8KB 4-way associative
write-through cache, two write buffers, and the FPU, placed on
the main CPU. Pipelined integer core does basic operations in
a single cycle. Bus can transfer a cache line (16 bytes) in five
cycles, compared to two cycles per 4 bytes for the 80386. Clock
speeds 20 to 50MHz. The i486DX2 and i486DX4 versions have
a 2:1 or 3:1 core:bus frequency ratio. Clock speeds 50MHz to
100MHz.
170
The Pentium: the last CISC
Pentium
Introduced 1993. Very few changes in functionality, but many
in implimentation. Again a redesigned core, now superscalar for
integer operations, with dual issue possible in some circumstances.
The FPU is pipelined for the first time, and is made much faster.
The cache is split as a 2-way associative 8KB instruction cache,
and similar write-back data cache. Branch prediction and 4MB
pages introduced. The data bus width is increased to 64 bits, and
runs at 60 or 66MHz. The core runs at an integer or half-integer
multiple of this, from 60MHz to 200MHz.
PentiumMMX
171
PentiumPro / Pentium II / Pentium III
173
Marketing gimmicks
174
IA64
175
The Alpha Family
21064 (EV4)
The first Alpha, the 21064 or EV4, was introduced in 1992. It
was a fresh, new design with thirty-two 64 bit integer registers,
thirty-two 64 bit floating-point registers (supporting IEEE and
VAX formats), and a superscalar pipelined RISC architecture.
Supported paging and protection. Separate intruction and data
caches, each 8KB direct mapped write through. Twelve entry
ITLB, and 32 entry DTLB, both fully associative. Address bus
34 bit. Can issue two instructions per clock cycle, at most one
floating point. CPU speeds 100MHz to 266MHz. Data bus
128 bit, 33 to 40MHz.
21164 (EV5)
1995. Added a small, 96KB 3-way associative write-back
secondary cache, ITLB increased to 48 entry, DTLB to 64 entry.
Two integer pipelines, one FP add pipeline, one FP multiply
pipeline. Four way superscalar core, 40 bit address bus. The
21164A (EV5.6) added Alpha’s form of ‘MMX’ for dealing more
efficiently with small integer data. Core speeds 266MHz to
600MHz.
176
21264 (EV6)
177
How Many Bits?
178
Families and Optimisation
179
Video Hardware
180
Video
181
The video signal
Humans can perceive flicker if the vertical refresh rate is below about 65 Hz. Below around
72Hz humans do not notice flicker, but are still adversely affected by it.
182
Video hardware
Bandwidths are also high, as the video circuitary needs to read the whole display once for
every scan, so about 75 times per second.
183
TFT LCDs
184
Acceleration
On PCs, the original VGA and SVGA standards were unaccelerated. Chips such as the S3
and ATI’s Mach32 were early, and incompatible, examples of adding acceleration to the core
SVGA functionality.
185
Acceleration in pictures
CPU CPU
CPU bus
PCI bus
Before After
Data
Single command
When PC graphics cards call themselves ‘64 bit’ or ‘128 bit’ they are usually referring to the
width of the internal bus on the card. When they call themselves 8, 16 or 24 bit they are
referring to the colour depth . . .
If the memory on the video card is more than needed for storing the image, the extra can be
used to store frequently-needed objects which can then be quickly copied to the part which is
being displayed.
186
OpenGL
187
Texture Mapping
A video card with hardware GL support has a processor optimised for the sort of single-precision
floating-point operations involved in 3D graphics. It will be highly RISC, highly pipelined, and
probably superscalar, and may well have better performance than the computer’s main CPU.
The processor on the graphics card is sometimes called the ‘GPU’ (Graphics PU), and needs
memory for storing textures and triangle co-ordinates as well as the image itself.
OpenGL has been adopted by Windows and MacOS, as well as the UNIX world.
A final speed enhancement on modern computers is the use of a dedicated bus for the video
card. This AGP (Advanced Graphics Port) bus is simpler than PCI (it can only cope with a
single device), faster than PCI, and is not shared with disk drive controllers etc.
188
Parallel Computers
189
Not Parallel: Multitasking
190
Inequality
The load or load average is UNIX’s term for the number of proceses is the ‘run’ state averaged
over a short period. The uptime command reports three averages, over 1, 5 and 15 minutes
on most UNIXes, and 5s, 30s, and 1 minute on Tru64.
Under UNIX the nice and renice commands can be used to decrease the scheduling priority
of any process you own. The priority cannot be increased again, unless one is root. (If you
use tcsh or csh as your shell, nice is a shell built-in and is documented in the shell man
page. Otherwise, it is /usr/bin/nice and documented by man nice in the usual way.)
191
Co-operate or be Pre-empted
192
Privilege
In any OS, the kernel should be as small as possible, for bugs in the kernel have the greatest
potential for mischieve.
193
Parallel Computers: the Concepts
hence ‘cheap.’
Factorisation of a large √
number. The independent
trial factors from 2 to n are readily distributed
amongst multiple processors.
A simple example of parallelisation has alrady been seen in the various ‘multimedia’
instructions. This is known as SIMD parallelism: Single Instruction Multiple Data. The
parallism discussed in this section is MIMD (Multiple. . . ).
194
Scaling
no. of CPUs
195
Amdahl’s Law
196
Bigger is better
197
SMP: Bused Based
memory
198
Two Heads are Better than One?
199
Shared memory
200
Cache coherency
memory
201
Snoopy caches
Even single CPU workstations have a lesser version of this problem, as it is common for
the CPU and the disk controller to be able to read and write directly to the main memory.
However, with just two combatants, the problem is fairly easily resolved.
202
MESI solutions
203
Directory Entries vs Broadcasting
204
MPP: Breaking the Memory Bottleneck
Interconnect
CPU CPU
cache cache
205
Breaking the Code
206
Topologies
207
16 Nodes. . .
Hypercube
2D torus
208
Performance
209
Parallelisation Overheads
a a
(n − 1) × λ + ≈ nλ +
nσ σ
210
Amdahl revisited
211
Programming Example
t=0.0
do i=1,100000000
t=t+a(i)*b(i)
enddo
212
Programming, MPP
t_local=0.0
do i=1,nmax ! nmax approx 100000000/ncpus
t_local=t_local+a(i)*b(i)
enddo
! Condense results
All the variables are local to each node, and only the MPI call causes one (t) to contain the
sum of all the t local’s and to be set to the same value on all nodes. The programmer must
think in terms of multiple copies of the code running, one per node.
213
The Programming Differences
214
SMP: The Return
215
NUMA / cc-NUMA
CPU CPU
CPU CPU
216
The Consequences of NUMA
for(i=0;i<10000000;i++)
t+=x[i]*y[i];
217
Modern, small SMPs
CPU CPU
X bar
switch
memory memory
218
Modern, large MPPs
Interconnect
219
Multithreading
220
SMT
221
Permanent Storage
222
Disk Drives
223
Physical considerations
Although the heads move only radially, the air is dragged into tangential motion by the
spinning platters, and in this air stream the heads fly.
224
Data Storage and Access Times
track head
sector
platter
This disk has three platters and six heads. In reality the heads are much smaller than shown
above.
A modern (IBM) 36GB disk has 5 glass platters with a total of 10 heads. It records at 13,500
tracks per inch, and 260,000 bits per inch along the track. The raw error rate is about 1 in
1012 bits, reducing to 1 in 1014 after automatic correction. The sustained data transfer rate
from the physical disk is 15 to 30MB/s.
225
Floppy drives
226
CD drives
227
Wot no files?
228
File systems: the requirements
229
An example: FAT16
On partitions of under 32MB, the cluster size is 512 bytes, or one block, the smallest possible
size.
230
Chains
0 1
1 2
2 65535
3 0
4 65535
5 8
6 65535
7 0
8 6
Here we see two free clusters (3 and 7) and three files
occupying clusters 0, 1 and 2, cluster 4, and clusters
5, 8 and 6. Such sequences of clusters in the FAT are
called ‘chains’.
So the FAT has already given us the concept of a file,
but not of a filename.
The metadata in the FAT are so important that DOS
stores the FAT twice at the beginning of a disk.
231
A directory
Every subdirectory contains at least two entries. One, called ‘..’, which describes its parent
directory, and one, called ‘.’, which describes itself.
232
Simple operations
File Deletion
The directory entry has the first byte zeroed, and the
corresponding FAT entries are marked free.
File Creation
An unused directory entry is found and used, and a
FAT chain of at least one block created.
File Renaming
Only the directory entry needs changing.
Appending to a file
The file length in the directory needs modifying, and
possibly a new cluster allocating and the FAT changing,
as well as writing the data.
etc.
233
Consistency
234
The 1.4MB DOS Floppy
Sector 0 contains information such as the size of the FAT (12 or 16 bit), the size of the root
directory, the cluster size, etc.
235
Other FATs
236
The UNIX file system
Again every subdirectory contains explicit entries for ‘.’ and ‘..’ giving its own and its parent’s
inode number.
237
The inode table
File length
File ownership (user and group)
File ‘creation’, modification and last access times
File access permissions
The number of directory entries pointing at this file
A list of the first ten clusters occupied by the file
Three pointers to clusters containing details of further
clusters used
The program fsck checks for consistency. fsck = File System CHeck.
238
Large files
65547-65802
In this example, one would need another level of indirection to support files larger than 64MB.
In practice, the block size is probably 4K, and this scheme will therefore work up to 4GB.
239
File types
The first, the hard link, is not a new file type at all.
One merely has two directory entries pointing at the
same inode. As the inode stores the information about
file length and access times, there are no consistency
problems as there would be with FAT.
One can construct two directory entries pointing a the same chain within FAT, but if the file
is modified using one name, the file length stored under the other directory entry will not be
changed, and a mess results.
240
Symbolic links
tcm30:/usr/sbin> ls -l /usr/sbin/sendmail
lrwxrwxrwx 1 root system 24 Sep 12 1998
sendmail -> /usr/local/exim/bin/exim
241
More speed
242
The ext2 floppy
243
Fragmentation
244
FAT vs HFS vs Ext2
Ext2 does support filesizes of up to 264 bytes, but not all implimentations manage this. Linux
2.4 does.
245
Partitioning
The downside occurs when each partition has 20MB free and you wish to write a 30MB
file. . .
The partition table usually exists, even if it shows just one partition using the whole disk.
246
Still slow
Disk Memory
247
Caching
248
Write collapsing
do i=1,50
move heads to directory
read directory
write out directory with zero in 1st byte
of filename of file deleted
move heads to 1st FAT
read it
write out with relevant chain marked free
heads are now at second copy of FAT
fix that too
done
249
Writes collapsed
do i=1,50
read directory from cache
write out directory to cache with a zero
in 1st byte of filename of file deleted
read FAT from cache
write it back with relevant chain freed
ditto second FAT
done
250
Inconsistencies
Any filesystem which records last access times (such as UFS) will be frequently modifying
data on disk.
UNIX systems, and some versions of Windows, will detect if they have been turned off without
being shutdown properly, and check their disks for consistency when they are next turned on.
If they have been shutdown correctly, they don’t bother.
Though fsck and scandisk can often autorepair a filesystem to a consistent state, it is worth
pointing out that consistency and correctness are different: formatting a disk also reduces its
filesystem to a consistent state, but in a slightly unhelpful manner.
251
Journalling filesystems
Digital UNIX has AdvFS as a journalled filesystem, Irix has xfs, AIX has jfs, Linux has ext3,
and WinNT has NTFS.
252
Journal problems
253
Remote files
On the other hand, it makes it possible to use a machine with no internal disk drive.
254
Remote trouble
255
Multiple filesystems
256
Mounting filesystems
257
Multiple programs
258
Locking
259
Dot locking
260
Quotas
261
Mirrors
262
RAID
Level 1 is mirroring.
Level 0 is very sensitive to failure: one disk fails, and all the data are lost. Level 5, which
uses parity blocks, can be quite slow for writing, as parity blocks will need updating, possibly
requiring additional reads. Rebuilding a level 5 RAID set after a power cut is also very slow.
263
CDs
264
CD-Rs
265
Tapes
266
Tape Technologies
Serpentine Helical
267
Current tapes
Note it takes over 2 hours to read any of the above tapes in full.
268
Practical Programming
269
Programs, Libraries and OSes
Libraries Impossible
O/S
Hardware
270
Libraries (1)
271
Libraries (2)
272
Compiling and Linking
273
Being dynamic
274
Calling Conventions
275
Vagueness
276
Name mangling
Even plain F77 must do some name mangling: the UNIX linker is case-sensitive, and F77 is
not, so all names must be converted to a consistent case. They usually gain underscores too,
to avoid unexpected name clashes with functions in the system libraries.
277
Optimisation
278
Loops
do i=1,n
x(i)=2*pi*i/k1
y(i)=2*pi*i/k2
enddo
279
Simple and automatic
CSE
do i=1,n
t1=2*pi*i
x(i)=t1/k1
y(i)=t1/k2
enddo
Invariant removal
t2=2*pi
do i=1,n
t1=t2*i
x(i)=t1/k1
y(i)=t1/k2
enddo
280
Division to multiplication
t2=2*pi
t3=1/k1
t4=1/k2
do i=1,n
t1=t2*i
x(i)=t1*t3
y(i)=t1*t4
enddo
after which
t1=2*pi/k1
t2=2*pi/k2
do i=1,n
x(i)=i*t1
y(i)=i*t2
enddo
281
Another example
y=0
do i=1,n
y=y+x(i)*x(i)
enddo
y=0
i=1
1 y=y+x(i)*x(i)
i=i+1
if (i<n) goto 1
282
Unrolling
y=0
do i=1,n-mod(n,2),2
y=y+x(i)*x(i)+x(i+1)*x(i+1)
enddo
if (mod(n,2)==1) y=y+x(n)*x(n)
This now looks like
y=0
i=1
n2=n-mod(n,2)
1 y=y+x(i)*x(i)+x(i+1)*x(i+1)
i=i+2
if (i<n2) goto 1
if (mod(n,2)==1) y=y+x(n)*x(n)
283
Reduction
t1=0 ; t2=0
do i=1,n-mod(n,2),2
t1=t1+x(i)*x(i)
t2=t2+x(i+1)*x(i+1)
enddo
y=t1+t2
if (mod(n,2)==1) y=y+x(n)*x(n)
284
Prefetching
y=0
do i=1,n
prefetch_to_cache x(i+8)
y=y+x(i)*x(i)
enddo
It is possible to add directives to one’s code to assist a particular compiler to get prefetching
right: something for the desperate only.
285
Loop Elimination
do i=1,3
a(i)=0
endo
will be transformed to
a(1)=0
a(2)=0
a(3)=0
286
Loop Fusion
do i=1,n
x(i)=i
enddo
do i=1,n
y(i)=i
enddo
transforms trivially to
do i=1,n
x(i)=i
y(i)=i
enddo
287
Strength reduction
double a(2000,2000)
do j=1,n
do i=1,n
a(i,j)=x(i)*y(j)
enddo
enddo
288
Inlining
function norm(x)
double precision norm,x(3)
norm=x(1)**2+x(2)**2+x(3)**2
end function
...
a=norm(b)
transforms to
a=b(1)**2+b(2)**2+b(3)**2
289
Instruction scheduling and loop
pipelining
Consider a piece of code containing three integer adds and three fp adds, all independent.
Offered in that order to a CPU capable of one integer and one fp instruction per cycle, this
would probably take five cycles to issue. If reordered as 3×(integer add, fp add), it would
take just three cycles.
290
Debugging
291
Loop interchange
The conversion of
do i=1,n
do j=1,n
a(i,j)=0
enddo
enddo
to
do j=1,n
do i=1,n
a(i,j)=0
enddo
enddo
292
Matrix Multiplication
do i=1,n
do j=1,n
t=0.
do k=1,n
t=t+a(i,k)*b(k,j)
enddo
c(i,j)=t
enddo
enddo
293
The problem
The inner loop contains one fp add, one fp multiply, one fp load
with unit stride (b), and one fp load with stride n (a). The arrays
are around 32MB each.
For n=2032, every load for a is a cache and TLB miss for i=j=1.
For j=2, every load for a is a cache hit and a TLB miss: over
2000 TLB entries would be needed to cover the first column just
read. A cache hit because 2032 cache lines are sufficient, and the
cache has 32,768 lines.
For n=2048, the same analysis applies for the TLB. For the cache,
because the stride is 214 bytes, the bottom 14 bits of the address,
and hence the bottom 6 of the tag index, are the same for all k.
Thus only 512 different cache lines are being used, and one pass
of the loop would need 2048 if all are to remain in cache, so all
are cache misses.
294
Blocking
do i=1,n,2
do j=1,n
t1=0.
t2=0.
do k=1,n
t1=t1+a(i,k)*b(k,j)
t2=t2+a(i+1,k)*b(k,j)
enddo
c(i,j)=t1
c(i+1,j)=t2
enddo
enddo
295
Loop transformations
296
Laziness
call dgemm(’n’,’n’,n,n,n,1d0,a,n,b,n,0d0,c,n)
Compaq’s own cxml library gave 800MFLOPS. NAG’s BLAS gave just 120MFLOPS.
What was wrong with our 100MFLOPS code? The TLB miss on every cache line load of a
prevents any form of prefetching working for this array.
297
Practical Parallelism
298
The slowdown
1000
No other process
One other process
Two other processes
100
Throughput, MB/s
10
0.1
0.01
16 64 256 1024 4096 16384 65536 262144 1048576 4194304
Message size, bytes
No-one who cares about latencies runs MPI with more than one process per processor!
Note that when running four serial processes on a dual processor machine, each will run twice
as slowly as they would if just two had been run. With parallel code, the slowdown could be a
factor of one thousand.
299
The Compilers
may well apply optimisation to nothing (i.e. the source files following -fast). Similarly
will probably use routines from NAG rather than cxml if both contain the same routine.
However,
300
Common compiler options
-lfoo and -L
-lfoo will look first for a shared library called libfoo.so, then a
static library called libfoo.a, using a particular search path. One
can add to the search path (-L${HOME}/lib or -L.) or specify a
library explicitly like an object file, e.g. /temp/libfoo.a.
-c and -S
-g
301
More compiler options
-C
-r8
The rest
Options will exist for tuning for specific processors, warning about
unused variables, reducing (slightly) the accuracy of maths to
increase speed, aligning variables, etc. There is no standard for
these.
IBM’s equivalent of -r8 is -qautodbl=dbl4.
302
Fortran 90
303
Slow arrays
a=b+c
do i=1,n
a(i)=b(i)+c(i)
enddo
304
Big surprises
a=b+c+d
do i=1,n
a(i)=b(i)+c(i)+d(i)
enddo
temp_allocate(t(n))
do i=1,n
t(i)=b(i)+c(i)
enddo
do i=1,n
a(i)=t(i)+d(i)
enddo
This uses much more memory than the F77 form, and
is much slower.
305
Sure surprises
a=matmul(b,matmul(c,d))
will be treated as
temp_allocate(t(n,n))
t=matmul(c,d)
a=matmul(b,t)
Note that even a=matmul(a,b) needs a temporary array. The special case which does not is
a=matmul(b,c).
306
More sure surprises
allocate(a(n,n))
...
call wibble(a(1:m,1:m))
must be translated to
temp_allocate(t(m,m))
do i=1,m
do j=1,m
t(j,i)=a(j,i)
enddo
enddo
call wibble(t)
do i=1,m
do j=1,m
a(j,i)=t(j,i)
enddo
enddo
307
Type trouble
type electron
integer :: spin
real (kind(1d0)), dimension(3) :: x
end type electron
Good if one always wants the spin and position of the electron
together. However, counting the net spin of this array
s=0
do i=1,n
s=s+e(i)%spin
enddo
308
What is temp allocate?
temp_allocate(t(n,n))
t=matmul(a,b)
a=t
temp_deallocate(t)
temp_allocate(t(m,m))
t=matmul(c,d)
c=t
temp_deallocate(t)
309
Precision
complex (kind(1d0)) :: c
real (kind(1d0)) :: a,b,pi
...
pi=3.1415926536
c=cmplx(a,b)
This should read
pi=3.1415926536d0
c=cmplx(a,b,kind(1d0))
for both a constant and the cmplx function default to
single precision.
Some compilers automatically correct the above errors.
Note also that π expressed to full double precision is not the above value: either use
real (kind(1d0)) :: pi
pi=4*atan(1d0)
or
real (kind(1d0)), parameter :: pi=3.141592653589793d0
(The latter has the advantage that one cannot accidently change the value of π in the
program, the former that it is less likely to be mistyped.)
310
Precision again
real*8 x
real(8) :: y
311
Other languages
http://www.tcm.phy.cam.ac.uk/~mjr/C/
312
-r8, 302 secondary, 112
/proc, 163 write back, 107, 108, 110
0x, 96 write through, 107
2D acceleration, 185 cache coherency
3D acceleration, 187 broadcast, 204, 216
directory, 204, 216
address lines, 79, 80 snoopy, 107, 202, 216
AGP, 188
cache controller, 92, 94
alignment, 153
cache hierarchy, 112
allocate, 159
cache line, 99, 114
allocate on write, 108
cache thrashing, 104
Alpha, 166–168, 176, 177
cc-NUMA, 216
Amdahl’s law, 196, 211
CD drive, 227
and, 49
CD-R, 265
ASCII, 46
CD-RW, 265
assembler, 26, 165
chkdsk, 234
ATE, 106
CISC, 21
bandwidth, 83 clock, 15, 84, 85
bandwidth, hard disk, 225 clock multiplying, 119
bandwidth, interconnect, 209 compilers, 300–302
binary, 40 complex arithmetic, 67, 68
binary compatibility, 165 cooling, 120
binary fractions, 57 crossbar, 207, 218
bit flip, 125 CSE, 280
BLAS, 272, 297
DAT, 267, 268
branch, 25, 29
data dependency, 19
branch prediction, 27
data segment, 158
bss, 158
DDR-SDRAM, 84
burst, 83
debugging, 291, 301, 302
bus, 15
denormals, 56, 70
byte, 31, 39
DIMM, 86, 87
C, 312 dirty bit, 107
cache dirty filesystem, 257
associative, 105 disk thrashing, 148
direct mapped, 102 distributed memory computer, 205
disk, 154, 155, 248 division
memory, 91, 92 floating point, 71–74
primary, 112 integer, 47
313
DLT, 267, 268 HFS, 245
DOS, 135–138, 230 hit rate, 91, 106, 121
DRAM, 77, 78 HPF, 212
DTLB, 146 hypercube, 207
DVI, 184 hyperthreading, 221
314
libraries, shared, 160 MIPS, 32, 166
limit, 159 mirror, 262
link, hard, 240 MMX, 171
link, symbolic, 241 modulo scheduling, see loop pipelining
linking, 273, 277, 301 Motorola 68K, 166
linking, dynamic, 274 mounting, 257
linking, static, 273 MPI, 206, 213, 214, 298, 299
Linpack, 34, 36 MPP, 205
load, 191 MTA, 221
locked pages, 149 MTOPS, 32
logistic map, 66 multitasking, 190
loop co-operative, 192
blocking, 295 pre-emptive, 192
coda, 283 multithreading, 220, 221
elimination, 286
NAG, 272
fusion, 287
name mangling, 277
interchange, 292
NaN, 60, 61
invariant removal, 280
NFS, 255, 261
pipelining, 290
nice, 191
reduction, 284
non-blocking, 209
strength reduction, 288
nop, 22
transformations, 296
null pointer dereferencing, 162
unrolling, 283
NUMA, 216, 217
LRU, 109
offset, 41
machine code, 31
OpenGL, 187, 188
MacOS, 138, 245
OpenMP, 212
malloc, 159, 162
operating system, 148, 191, 193, 270
mantissa, 52
optimisation, 278–297
memory map
or, 49
Digital UNIX, 161
out-of-order execution, 30, 177
DOS, 136, 137
overflow, 44, 60, 61
Linux, 162
memory refresh, 78 page, 141
MESI, 203 page fault, 143, 148
metadata, 229, 253 page table, 141–145
MFLOPS, 32 paging, 148
micro-ops, 172 palette, 183
microcode, 69 parallel computers, 194
315
parity, 126 scandisk, 234, 251
partition table, 246 scheduler, 191
Pentium4, 173 SCSI, 223
phosphor, 181 SDRAM, 84
physical address, 140 SECDED, 127
pipeline, 17, 18, 23 sector, 225
pipeline depth, 17 seek time, 225
pixel, 182 segment, 158
platter, 224, 225 shared memory processor, 198
Power, 166 shift, 48
power, 120 SIGBUS, 153
PowerPC, 166 SIGFPE, 61
predication, 29 SIGILL, 31
prefetching, 115, 116, 285 sign-magnitude, 41
priority, 191 SIGSEGV, 143
privilege, 193 SIMD, 194
process switch, 190 SIMM, 86, 87
ps, 191
size, 158
quadratic formula, 65 SMP, 198
quota, 261 SMT, 221
SO-DIMM, 87
RAID, 263 SPARC, 166
RAM, 77 SPEC, 35, 37
RAMBUS, 85 speculative execution, 28
ranges, IEEE 754, 62 spin wait, 298
ranges, integer, 45 square root, 75
RDRAM, 85 SRAM, 77, 78, 91
refresh rate, video, 182 SSE, 172
register, 14, 31
SSE2, 173
renice, 191
stack, 158, 159
RIMM, 86, 87
stalls, 27
RISC, 21
streaming, 116
Rock Ridge filesystem, 264
Streams, 36
ROM, 77
sub-block, cache line, 108
root, 193
sub-block, cache lnie, 114
rotate, 48
superscalar, 20
rounding, 63
swap space, 150
scaling, 196, 197, 211 swapping, 150
316
syncing, 251
UFS, 237–244
ulimit, 159
underflow, 56
uptime, 191
vector computers, 89
VFAT, 236
victim cache, 106
virtual address, 140
virtual disk, 262
virtual memory, 148
VLIW, 22
vmstat, 154
voltage, 120
word, 31
write behind cache, 248
write buffer, 110
write collapsing, 110
disk, 250
317