ECE Basics
ECE Basics
1 Introduction 1
3 Binary Numbers 7
3.1 Unsigned Binary Numbers . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Signed Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Sign and Magnitude . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Two’s Complement . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Extending an n-bit binary to n+k bits . . . . . . . . . . . . . . . . 11
4 Boolean Algebra 13
4.1 Karnaugh maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Karnaugh maps with 5 and 6 bit variables . . . . . . . . 18
4.1.2 Karnaugh map simplification with ‘X’s . . . . . . . . . . . 19
4.1.3 Karnaugh map simplification based on zeros . . . . . . . 20
Page iii
7 Von Neumann Architecture 45
7.1 Data Path and Memory Bus . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Arithmetic and Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . 47
7.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3.1 Static Random Access Memory (SRAM) . . . . . . . . . . 51
7.3.2 Dynamic Random Access Memory (DRAM) . . . . . . . . 52
7.4 Control Unit (CU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.4.1 Register Transfer Language . . . . . . . . . . . . . . . . . 54
7.4.2 Execution of Instructions . . . . . . . . . . . . . . . . . . . 55
7.4.3 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . 57
7.4.4 Complex and reduced instruction sets (CISC/RISC) . . . 58
7.5 Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
II Low-level programming 79
10 Programming in C 83
10.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
10.1.1 Integer data . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
10.1.2 Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
10.1.3 Floating-point data . . . . . . . . . . . . . . . . . . . . . . . 84
10.2 Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.3 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11 Character encodings 87
11.1 ASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.2 Latin-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.3 Latin-9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.4 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.4.1 UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
12 Assembly programming 91
12.1 Assembler notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
12.1.1 Instruction lines . . . . . . . . . . . . . . . . . . . . . . . . . 91
12.1.2 Specification lines . . . . . . . . . . . . . . . . . . . . . . . 91
12.1.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
12.1.4 Alternative notation . . . . . . . . . . . . . . . . . . . . . . 92
Page iv
12.2 The assembler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
12.2.1 Assembling under Linux . . . . . . . . . . . . . . . . . . . . 92
12.2.2 Assembling under Windows . . . . . . . . . . . . . . . . . 93
12.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
12.4 Instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A Questions Catalogue 99
A.1 Introduction to Digital Electronics . . . . . . . . . . . . . . . . . . 99
A.2 Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.3 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.4 Combinational Logic Crcuits . . . . . . . . . . . . . . . . . . . . . 100
A.5 Sequential Logic Crcuits . . . . . . . . . . . . . . . . . . . . . . . . 100
A.6 Von Neumann Architecture . . . . . . . . . . . . . . . . . . . . . . 100
A.7 Optimizing Hardware Performance . . . . . . . . . . . . . . . . . 101
Index 102
List of Figures
Page v
6.11 State transition graph of a 3-bit counter . . . . . . . . . . . . . . 40
6.12 3-bit counter Karnaugh maps . . . . . . . . . . . . . . . . . . . . . 41
6.13 3-bit synchronous counter . . . . . . . . . . . . . . . . . . . . . . . 41
6.14 3-bit ripple counter . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.15 Shift register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
List of Tables
Page vi
5.3 3-bit decoder truth table . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 3-bit demultiplexer truth table . . . . . . . . . . . . . . . . . . . . 27
5.6 Truth table for a 1-bit half adder . . . . . . . . . . . . . . . . . . . 28
5.7 Full Adder truth table . . . . . . . . . . . . . . . . . . . . . . . . . 29
Page vii
Page viii
Chapter 1
Introduction
This compendium is intended to supply required background information to
students taking the course INF2270. Together with the lectures and the
problems (both the weekly and the mandatory ones) it defines the course
curriculum.
One part (part no I) of the course goes upwards from the bottom level
to explain how computers are designed; the other (part no II) progresses
downwards from the top level to describe how to program the computer
at each level. At the end of the course, the two descriptions should meet
somewhere around levels 2–3.
The authors would like to thank the following students for valuable
contributions: Christer Mathiesen, André Kramer Orten, Christian Resell
and Marius Tennøe.
Page 1
Page 2
Part I
Basics of computer
architecture
Page 3
Chapter 2
Introduction to Digital
Electronics
The word digital comes from the Latin word ‘digitus’ which means finger.
Its meaning today is basically ‘countable’ and since many people use their
fingers for counting, that explains the connection to its Latin origin. Its
opposite is ‘analog’. Digital electronics refers to electronic circuits that
are described by a discrete/countable number of states. The basic building
block of almost all digital electronics today is the switch. This has two
states, either ‘on’ or ‘off’, and almost all digital electronics today is thus
binary, i.e., the number of states of the basic building block and basic
signals is two.1
First predecessors of the modern computer have been build with mechan-
ical switches (The Analytical Engine by Charles Babbage in 1837), electro
mechanical switches/relays (G. Stibitz’ Model-K (1937) and K. Zuse’s Z3
(1941)), and vacuum tubes (ENIAC, 1946). But the veritable computer re-
volution took off with a sheer incredible miniaturization of a switch: The
transistor.
1
The next most popular number of states for the basic elements is three and there exist a
number of ternary electronic circuits as well.
Page 5
CHAPTER 2 INTRODUCTION TO DIGITAL ELECTRONICS
look becomes necessary. Be that as it may, for most digital designs the
description as a switch has proved to be to a large degree sufficient.
Page 6
Chapter 3
Binary Numbers
3.1 Unsigned Binary Numbers
The numbers we use in everyday life are decimal numbers. The main reason
for us to use the decimal system is that we have 10 fingers. The decimal
system uses an alphabet of 10 digits: [0123456789]. When writing down a
decimal number, the rightmost digit has a unit value of 1 (or 100 ), the next
to the left has a unit value of 10 (101 ), the next 100 (102 ) and so on. The
number 18 thus means:
If humankind had but two fingers, things might have turned out quite
differently.1 A binary number system might have evolved, with an alphabet
of only 2 digits: [01]. The rightmost digit would again have a ‘unit’ value
of 1 (20 ), but the next would have a unit value of 2 (21 ) and then 4 (22 ), 8
(23 ), 16 (24 )etc. 18 reads now:
10010 := 1 × 24 + 0 × 23 + 0 × 22 + 1 × 21 + 0 × 20 (3.2)
For example, 8 bit numbers could encode the values from −127 to 127
(7-bit magnitude and 1 sign-bit):
1
We might never have become intelligent enough to compute, because of the inability to use
tools, for instance. Horses, with only two digits to their forelimbs, do (to our knowledge) not
have a number system at all.
Page 7
CHAPTER 3 BINARY NUMBERS
87 = 01010111
−87 = 11010111
A first problem with this scheme is that there is also a ‘signed zero’, i.e., +0
and −0, which is redundant and does not really make sense.
87 = 01010111
−41 = 11010111 (= 215 − 256)
−87 = 10101001 (= 169 − 256)
2) add 1
Example:
1. 87 = 01010111 → 10101000
2. 10101000+1 = 10101001 = –87
1. –87 = 10101001 → 01010110
2. 0 1010110+1 = 01010111 = 87
Page 8
3.3 ADDITION AND SUBTRACTION
Why does that work? The key to understanding this is the modulo operation.
− 2n if ∈ [2n−1 , 2n − 1]
§
0 = (3.3)
if ∈ [0, 2n−1 − 1]
Yet a third concept is that it does not matter if one computes the mod 2n
operation of a sum only on the result or also on the summands, i.e., (
mod 2n + b mod 2n ) mod 2n = ( + b) mod 2n
Thus, it follows:
What this equation says is that for the operation of addition of two two’s
complement numbers one can also just add their unsigned interpretation,
Page 9
CHAPTER 3 BINARY NUMBERS
That convenient property is really good news for the design for arithmetic
operations in digital hardware, as one does not need to implement both
addition and subtraction, since adding a negative number is the same as
subtracting. A subtraction can be performed by
If you accept that this works for unsigned binaries, one can show this
to be true for a negative two’s complement binary 0 number with the
corresponding unsigned interpretation because:
Examples:
decimal binary shifted decimal
-3 1101 1110 -2
-88 10101000 11010100 -44
Page 10
3.5 EXTENDING AN N-BIT BINARY TO N+K BITS
n−1
X
∗b= k ∗ 2k ∗ b (3.7)
k=0
So as an algorithm:
Examples:
decimal 4 bit 8 bit
-2 1110 → 11111110
-5 1011 → 11111011
5 0101 → 00000101
Page 11
Page 12
Chapter 4
Boolean Algebra
Digital electronics can conveniently be used to compute so called Boolean
functions, formulated using Boolean algebraic expressions, which are also
used in propositional logic. These are functions that project a vector of
binary variables onto one (or a vector of) binary variable(s):
In this context one interprets the result often as either ‘true’ or ‘false’
rather than ‘1’ or ‘0’, but that does not change anything for the definition of
Boolean functions: it’s just a renaming of the variables’ alphabet.
There are three basic operators in Boolean algebra: NOT, AND, OR.
Different notations are sometimes used:
NOT a ¬a ̄ a’
a AND b a∧b a×b a·b (Do not confuse with multiplication!)
a OR b a∨b a+b (Do not confuse with addition!)
Boolean functions can be defined by truth tables, where all possible input
combinations are listed together with the corresponding output. For the
basic functions the truth tables are given in table 4.1.
More complicated functions with more input variables can also be defined
as truth tables, but of course the tables become bigger with more inputs
and more and more impractical. An alternative form to define Boolean
functions are Boolean expressions, i.e., to write down a function by
combining Boolean variables and operators (just as we are used to with
other mathematical functions). An example:
ƒ (, b, c) = + b · ( + c) (4.2)
There are several popular quite basic Boolean functions that have their own
operator symbol but are derived from the basic operators:
Page 13
CHAPTER 4 BOOLEAN ALGEBRA
a ā a b a·b a b a+ b
0 1 0 0 0 0 0 0
1 0 0 1 0 0 1 1
1 0 0 1 0 1
1 1 1 1 1 1
a NAND b = a · b
a NOR b = a + b
Table 4.2 lists basic rules that govern Boolean functions and that allow to
rearrange and simplify them. Note that the equal sign ‘=’ connects two
functions that are equivalent, i.e., for every input the output is exactly the
same. Equivalent functions can be written in any number of ways and with
any degree of complexity. Finding the simplest, or at least a reasonably
simple expression for a given function is a very useful goal. It makes the
function easier to read and ‘understand’ and, as we will see later on, reduces
the complexity (number of electronic devices, power consumption, delay) of
digital electronics that implements the function.
Page 14
Figure 4.1: Equivalent Boolean operators, truth tables and
logic gates
To verify the deMorgan theorem one can fill in the truth tables in table 4.3,
and here are two examples on how to apply the rules of table 4.2 to simplify
functions:
Example 1:
a·b + a·b̄
= a·(b+ b̄)
= a·1
= a
Example 2:
a·b·c + ā·b·c + ā·b·c̄ · (a + c)
= (a + ā)·b·c + ā·b·c̄ · a + ā · b · c̄ · c
= 1·b·c + 0+0
= b·c
Applying the rules one can also show that either the NAND or the NOR
function is actually complete, i.e., they are sufficient to derive all possible
Boolean functions. This can be shown by showing that all three basic
functions can be derived from a NAND or NOR gate, again employing the
rules from table 4.2:
Page 15
CHAPTER 4 BOOLEAN ALGEBRA
Beyond truth tables and Boolean expressions, Boolean functions can also
be expressed graphically with logic gates, i.e., the building blocks of digital
electronics. Figure 4.1 summarizes the basic and derived functions and
the corresponding operators, logic gates and truth tables. The logic gates
will be our main tools as we move on to designing digital electronics. Note
that they are somewhat more powerful than Boolean expression and can do
things beyond implementing Boolean functions, since one can also connect
them in circuits containing loops. These loops can be employed to realize
elements that autonomously maintain a stable state, i.e., memory elements.
But for a while still, we will stick with pure feed-forward circuits, and thus,
Boolean functions.
a b c F
0 0 0 0
0 0 1 0
0 1 0 0
F = a · b · c + ā · b · c + ā · b · c̄ · (a + c) → 0 1 1 1
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
The truth table can now be shown in a so-called Karnaugh map (or K-map),
where the outputs are arranged in an array and the axes of the array are
labeled with the inputs arranged in a Gray-code, i.e., such that only one
input bit shifts between columns/rows:
a b c F
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1 →
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
Now one needs to find the so-called minterms graphically: rectangles that
contain 2n ‘1’s (i.e., 1, 2, 4, 8, . . . elements). The goal is to find a minimal
number of rectangles that are maximal in size that cover all ‘1’s in the array.
They may overlap, and this is even desirable to increase their size. They
Page 16
4.1 KARNAUGH MAPS
may also wrap around the edge of the array (See next example!). In this
first example this is quite trivial as there are only two ‘1’s that conveniantly
are neighbours and thus form a 1 × 2 rectangle (marked in red).
Now for this entire rectangle, all inputs are either constant or undergo all
possible binary combinations with each other. Here, variables b and c are
constant, and a goes through both its states ‘0’ and ‘1’.
b·c
Let us look at a somewhat more complete example. We’ll start directly with
the Karnaugh map:
Note the overlap between the 1 × 2 green and red 4 × 2 rectangles and the
blue 2 × 2 rectangle is formed by wrapping around the edges of the array.
The resulting simple Boolean function is as follows. The brackets are colour
coded to correspond to the marked rectangles. Note that the bigger the
rectangle, the shorter the minterm expression.
Page 17
CHAPTER 4 BOOLEAN ALGEBRA
The method shown so far works well up to four input variables, i.e., up to
two bits along one axis. Things become more complicated for more input
bits. For 5 to 6 input bits, the Karnaugh map becomes 3-dimensional. The
property of the 2-bit Gray-code, that any of the two bits of any group of 1, 2
or 4 subsequent codes are either constant or go through both their possible
states in all possible combinations with an eventual 2nd non-constant bit,
is not maintained in a Gray code with three bits. (Do not worry if you have
not understood the last sentence ;-) as long as you understand the resulting
method.) Consequently, instead of having a 3-bit Gray code along one axis,
a third axis is added to the Karnaugh map and one has now to look for 3D-
cuboids with 2n elements instead of 2D rectangles. Since the 3D map is
hard to display on a 2D sheet of paper, the different levels are shown side
by side. Classically, the levels are unfolded along one side, so that one has
to look for matching rectangles of 1’s that are mirrored, as shown in figure
4.2. In this way, it is still a Gray code along the sides. More modern would
be to simply change the most significant bit and to copy the Gray code for
the two lesser bits. Then one would not need to look for mirrored patterns
but patterns of the same shape in the same position in the two (or four)
neighbouring squares. The solution for this example is:
Page 18
4.1 KARNAUGH MAPS
ab
cd 00 01 11 10
00 0 0 0 0
01 0 0 1 0
11 1 0 1 0
10 X X X X
Figure 4.3: K-map with ‘X’s
( x4 · x3 · x1 )
+
( x4 · x3 · x2 · x0 )
+
( x5 · x2 · x1 · x0 )
+
( x5 · x2 · x1 · x0 ) (4.6)
+
( x5 · x4 · x2 x1 )
+
( x5 · x3 · x2 · x1 · x0 )
+
( x5 · x4 · x3 · x1 · x0 )
F = (ā · b̄ · c) + (a · b · d) + (a · b · c) (4.7)
Page 19
CHAPTER 4 BOOLEAN ALGEBRA
F = (b · d) + (a · d) + (a · b) + (c · d) (4.8)
F = F = (b + d) · (a + d) · (a + b) · (c + d) (4.9)
In short, if one wants to obtain the end result directly, one takes the inverse
of the input variables that are constant for each rectangle to form the min-
terms as ‘OR-sums’ and combines these with ANDs.
Page 20
Chapter 5
Combinational Logic
Circuits
Combinational logic circuits are logic/digital circuits composed of feed-
forward networks of logic gates (see figure 4.1) with no memory that can be
described by Boolean functions.1
Logic gates (figure 4.1) are digital circuits that implement Boolean
functions with two inputs and one output and are most often implemented to
operate on binary voltages as input and output signals:2 a certain range of
input voltage is defined as ‘high’ or logic ‘1’ and another range is defined as
‘low’ or ‘0’. E.g., in a digital circuit with a 1.8V supply one can, for instance,
guarantee an input voltage of 0V to 0.5V to be recognised as ‘0’ and 1.2V to
1.8V as ‘1’ by a logic gate.
On the output side the gate can guarantee to deliver a voltage of either
>1.75V or <0.05V.
That means that a small mistake at the input of a logic gate is actually
‘corrected’ at its output which is again closer to the theoretically optimal
values of exactly 0V and 1.8V. These safety margins between input and
output make (correctly designed) digital circuits very robust, which is
necessary with millions of logic gates in a CPU, where a single error might
impair the global function!
1
Note what is implied here: logic gates can also be connected in ways that include feed-back
connections that implement/include memory that cannot be described as Boolean functions!
This is then not ‘combinational logic’, but ‘sequential logic’, which will be the topic of chapter 6.
2
Another possibility is to use socalled ‘current mode’ logic circuits where the logic states are
represented with currents.
Page 21
CHAPTER 5 COMBINATIONAL LOGIC CIRCUITS
A complete analysis is quite trivial for small digital circuits but neigh
impossible for circuits of the complexity of a modern CPU. Hierarchical
approaches in design and analysis provide some help.
The first Pentium on the market actually had a mistake in its floating point
unit. Thus, it has been exposed to some ridicule. Here is a common joke of
that time:
After the Intel 286 there was the 386 and then the 486, but the
585.764529 was then dubbed ‘Pentium’ for simplicity sake.
encoder/decoder
Page 22
5.1 STANDARD COMBINATIONAL CIRCUIT BLOCKS
7 6 5 4 3 2 1 0 O2 O1 O0
0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0 1 0
0 0 0 0 1 0 0 0 0 1 1
0 0 0 1 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 1 0 1
0 1 0 0 0 0 0 0 1 1 0
1 0 0 0 0 0 0 0 1 1 1
multiplexer/demultiplexer
adder/multiplier
..
.
Note that the symbols for those blocks are not as much standardized as the
symbols for the basic logic gates and will vary throughout the literature.
The symbols given here are, thus, not the only ones you will encounter in
other books but will be used throughout this text.
Page 23
CHAPTER 5 COMBINATIONAL LOGIC CIRCUITS
7 6 5 4 3 2 1 0 O2 O1 O0
0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 X 0 0 1
0 0 0 0 0 1 X X 0 1 0
0 0 0 0 1 X X X 0 1 1
0 0 0 1 X X X X 1 0 0
0 0 1 X X X X X 1 0 1
0 1 X X X X X X 1 1 0
1 X X X X X X X 1 1 1
5.1.1 Encoder
An encoder in digital electronics refers to a circuit that converts 2n inputs
into n outputs, as specified (for a 3-bit encoder, i.e., n = 3) by the truth
table 5.1. The input should be a ‘one-hot’ binary input, i.e., a bit-vector
where only one bit is ‘1’ and all others are ‘0’. The output then encodes the
position of this one bit as a binary number. Note, that the truth table, thus,
is not complete. It does not define the output if the input is not a one-hot
code. Be aware that there are totally valid implementations of encoders that
behave as defined if the input is a legal one-hot code, but they may react
differently to ‘illegal’ inputs.
A symbol that is used for an encoder is given in figure 5.2 and a variant
on how to implement a 3-bit encoder is depicted in figure 5.3. This
particular (rather straight forward) implementation will produce quite
arbitrary outputs when given ‘illegal’ inputs.
5.1.2 Decoder
A decoder is the inverse function of an encoder, in digital circuits usually
decoding n inputs into 2n outputs. The truth table for a 3 bit variant is given
in table 5.3. Note that the truth table is complete, not subjected to the same
Page 24
5.1 STANDARD COMBINATIONAL CIRCUIT BLOCKS
2 1 0 O7 O6 O5 O4 O3 O2 O1 O0
0 0 0 0 0 0 0 0 0 0 1
0 0 1 0 0 0 0 0 0 1 0
0 1 0 0 0 0 0 0 1 0 0
0 1 1 0 0 0 0 1 0 0 0
1 0 0 0 0 0 1 0 0 0 0
1 0 1 0 0 1 0 0 0 0 0
1 1 0 0 1 0 0 0 0 0 0
1 1 1 1 0 0 0 0 0 0 0
5.1.3 Multiplexer
A multiplexer routes one of 2n input signals as defined by the binary control
number S to the output. A schematics symbol that is used for a multiplexer
is shown in figure 5.6. The truth table of a 3 bit multiplexer in figure 5.4
does not only contain zeroes and ones any longer but also the input variables
k indicating that the output will depend on the input and the control bits S
choose which input bit the output depends on. Figure 5.7 shows a possible
implementation. Note the way that multiple input logic gates are shown in
a simplified, compact way as explained in the small sub-figures.
Multiplexers are used in many a context, for example when buses (parallel
or serial data lines, see later in this text) are merged.
5.1.4 Demultiplexer
A demultiplexer performs the inverse function of a multiplexer, routing one
input signal to one of 2n outputs as defined by the binary control number
S. Table 5.5 is the corresponding truth table, figure 5.8 is a possible symbol
and figure 5.9 shows a possible implementation.
Page 25
CHAPTER 5 COMBINATIONAL LOGIC CIRCUITS
S2 S1 S0 O
0 0 0 0
0 0 1 1
0 1 0 2
0 1 1 3
1 0 0 4
1 0 1 5
1 1 0 6
1 1 1 7
Figure 5.7
Page 26
5.1 STANDARD COMBINATIONAL CIRCUIT BLOCKS
S2 S1 S0 O7 O6 O5 O4 O3 O2 O1 O0
0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0
Page 27
CHAPTER 5 COMBINATIONAL LOGIC CIRCUITS
a b S C
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
Demultiplexer find their use where a shared data line is used to convey data
to several destinations at different times.
5.1.5 Adders
Addition of binary numbers is a basic arithmetic operation that computers
execute innumerable times which makes the combinational adder circuit
very important.
Page 28
5.1 STANDARD COMBINATIONAL CIRCUIT BLOCKS
Cin a b S Cout
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
connecting the carry out of a stage to the next higher order stage/more
significant bit. The first stage’s carry in should be set to zero, or the first
stage might simply consist of a half adder. This implementation of an adder
is known as ripple carry adder, since the carry bits may ‘ripple’ from the
LSB to the MSB and there might be a significant delay until the MSB of the
result and its carry out become stable.
Page 29
Page 30
Chapter 6
Sequential Logic
Circuits
Sequential logic circuits go beyond the concept of a Boolean function: they
contain internal memory elements and their output will also depend on
those internal states, i.e., on the input history and not just the momentary
input.
6.1 Flip-Flops
Flip-flops are digital circuits with two stable, self-maintaining states that
are used as storage/memory elements for 1 bit. The term ‘flip-flop’ refers
in the more recent use of the language more specifically to synchronous
binary memory cells (e.g., D-flip-flop, JK-flip-flop, T-flip-flop). These circuits
change their state only at the rising edge (or falling edge) of a dedicated
input signal, the clock signal. The term ‘latch’ (e.g., D-latch, SR-latch) is
used for the simpler more basic asynchronous storage elements that do not
have a dedicated clock input signal and may change their state at once if
an input changes, but this naming convention is not consequently applied
throughout the literature.
Page 31
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
Note that often the subscripts ‘present’ and ‘next’ are not explicitly written
but it is assumed that the left hand side of the equation refers to the next
state and the right hand to the present. This will also be applied in this
compendium.
Figure 6.1 shows a possible implementation of the D-latch and its symbol.
The double inverter feedback loop is the classic implementation of a binary
memory cell. It has two possible states: Q is either equal to 1 and Q is equal
1
Think of this as a digital circuit that is automatically ‘overclocked’ to its possible limit, for those
of the readers that have been into this ;-)
Page 32
6.1 FLIP-FLOPS
S R Q Qnet
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 1
1 0 1 1
1 1 0 ?
1 1 1 ?
S R Q
0 0 Q
0 1 0
1 0 1
1 1 ?
to 0 or vice versa. Once the feedback loop is connected, that state has no
way to change, but if the feedback loop is open, then Q and Q will simply
be dependent on the input D. Thus the name ‘transparent latch’, that is also
sometimes used, since the latch will simply convey the input to the output
up until E is drawn low, whereupon the last state of D just before that event
is stored.
6.1.1.2 SR-latch
Another asynchronous latch is the SR-latch . The symbol is shown in figure
6.2. We will not look at its internal workings but define its behaviour
with the characteristic table 6.1. These tables can often be written more
compactly by again using variables of inputs and/or (implicitly ‘present’)
states in the table (table 6.2).
Page 33
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
In words, the SR-latch can be asynchronously set (Q→1 and Q →0) by signal
‘S’ and reset (Q→0 and Q →1) by signal ‘R’. While both ‘S’ and ‘R’ are low,
the state is maintained. Note the unique feature of the question mark in
the characteristic table! They are caused by an ‘illegal’ input configuration,
i.e., when both ‘S’ and ‘R’ are high. The basic definition of a general SR-
latch does not define what the output should be in this case and different
implementations are possible that will behave differently in that situation.
If a circuit designer uses an SR-latch as a black box, he cannot rely on the
output, if he permits this situation to occur.
Q = S + R̄ · Q (6.2)
6.1.2.1 JK-Flip-Flop
The JK-flip-flop is the synchronous equivalent to the SR-latch. ‘J’
corresponds to ‘S’, but since it’s synchronous, a change of ‘J’ from low (0)
Page 34
6.1 FLIP-FLOPS
J K Qt Qt+1
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 1
1 0 1 1
1 1 0 1
1 1 1 0
J K Qt+1
0 0 Qt
0 1 0
1 0 1
1 1 Qt
T Qt Qt+1
0 0 0
0 1 1
1 0 1
1 1 0
to high (1) will not immediately set the flip-flop, i.e., rise the output ‘Q’.
This will only happen later, at the very moment that the clock signal ‘C’
rises (provided that ‘J’ is still high!). Correspondingly, if ‘K’ is 1 when the
clock signal changes to 1, the flip-flop is reset and ‘Q’ goes low, and if both
‘J’ and ‘K’ are low, the state does not change. The ‘illegal’ input state of
the SR-latch, however, is assigned a new functionality in the JK-flip-flop: if
both ‘J’ and ‘K’ are high, the flip-flop’s output will toggle, i.e., Q will change
state and become 1 if it was 0 and vice versa. This behaviour is defined
in the characteristic tables tables 6.3 and 6.4 and/or by the characteristic
equation:
Q=J·Q+K·Q (6.3)
Page 35
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
T Qt+1
0 Qt
1 Qt
6.1.2.2 T-Flip-Flop
The T-flip-flop (toggle flip-flop) is a reduced version of the JK-flip-flop, i.e.,
the signals J and K are shorted and named ‘T’. So this flip-flop either
maintains it state when T is 0 or it toggles, i.e., changes its state at the
start of each clock cycle, when T is 1.
Its symbol is depicted in figure 6.5 and its characteristic table in tables 6.5
and 6.6. The characteristic equation is:
Q=T⊕Q (6.4)
6.1.2.3 D-Flip-Flop
The D-flip-flop (symbol in figure 6.6) can also be seen as a reduced version
of the JK-flip-flop, this time if J is connected to K through an inverter and
J is named ‘D’: the output of the D-flip-flop follows the input D at the start
of every clock cycle as defined in the characteristic tables 6.7 and 6.8. Its
characteristic equation is:
Q=D (6.5)
Page 36
6.2 FINITE STATE MACHINES
D Qt Qt+1
0 0 0
0 1 0
1 0 1
1 1 1
D Qt+1
0 0
1 1
Consider the simple example in figure 6.7. It defines a controller for a traffic
light, where pressure senors in the ground are able to detect cars waiting
coming from either of the four roads. There are two states of this system,
either north-south or east-west traffic is permitted. This FSM is governed
by a slow clock cycle, let’s say of 20 seconds. Equipped with sensors, the
controller’s behaviour is somewhat more clever than simply switching back
and forth between permitting east-west and north-south traffic every cycle:
it only switches, if there are cars waiting in the direction it switches to and
Page 37
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
will not stop the cars travelling in the direction that is green at present
otherwise.
Moore FSM: In a Moore machine the output depends solely on the internal
states. In the traffic light example here, the traffic lights are directly
controlled by the states and the inputs only influence the state
transitions, so that is a typical Moore machine.
Mealy FSM: In a Mealy machine the outputs may also depend on the input
signals directly. A Mealy machine can often reduce the number of
states (naturally, since the ‘state’ of the input signals is exploited
too), but one needs to be more careful when designing them. For
one thing: even if all memory elements are synchronous the outputs
too may change asynchronously, since the inputs are bound to change
asynchronously.
Page 38
6.3 REGISTERS
For the most part, this compendium will stick with the design of
synchronous sequential circuits.
From the characteristic table one can derive the combinational circuit. In
more complicated cases on might employ a Karnaugh map to find a simple
functional expression first. Here, it is rather straight forward to find the
circuit in figure 6.9.
Page 39
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
6.3 Registers
Registers are a concept that will simplify following discussions of more
complex logic. They are nothing more fancy than an array of flip-flops that
are accessed in parallel (e.g., as memory blocks in a CPU), controlled by
shared control signals. The array is usually of a size that is convenient
for parallel access in the context of a CPU/PC, e.g., one Byte or a Word.
Possibly most common is the use of an array of D-flip-flops. A typical control
signal is a ‘write enable’ (WE) or synchronous load (LD). In a D-flip-flop
based register, this signal is ‘and-ed’ with the global clock and connected to
the D-flip-flop clock input, such that a new input is loaded into the register
only if WE is active. Other control signals might be used to control extra
functionality (e.g., in shift-registers, see section 6.4).
Page 40
6.4 STANDARD SEQUENTIAL LOGIC CIRCUITS
present in next
S2 S1 S0 NA S2 S1 S0
0 0 0 0 0 1
0 0 1 0 1 0
0 1 0 0 1 1
0 1 1 1 0 0
1 0 0 1 0 1
1 0 1 1 1 0
1 1 0 1 1 1
1 1 1 0 0 0
Page 41
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
the input is the present state plus the inputs (not available in this example)
and the output the next state, we can deduce the simplified combinational
logic that will connect the output of the D-flip-flops (present state) to
the input of the D-flip-flops (next state) with the help of Karnaugh maps
(figure 6.12), as was introduced in section 4.1.
Equation (6.7) is the resulting sum of minterms for bit S2 and can be further
simplified to equation 6.9.
n−1
^
Snnext = Sn ⊕ Sk (6.10)
k=0
Counters may also be equipped with even more control signals to control
extra functionality such as:
Page 42
6.4 STANDARD SEQUENTIAL LOGIC CIRCUITS
control next
LD SE LS O2 O1 O0
1 X X 2 1 0
0 0 X O2 O1 O0
0 1 0 RSin O2 O1
0 1 1 O1 O0 LSin
the lowest to the highest bit, i.e., the highest bits are updated with a delay.
This must be taken into account, if this kind of counter is used.
be set to zero
Page 43
CHAPTER 6 SEQUENTIAL LOGIC CIRCUITS
Binary multiplication
Page 44
Chapter 7
Von Neumann
Architecture
In 1945 John von Neumann published his reference model of a computer
architecture that is still the basis of most computer architectures today.
The main novelty was that a single memory was used for both, program and
data.
control unit
execution unit
memory unit
input unit
output unit
The control unit (CU) is the central finite state machine that is controlled
by input from the memory, i.e., the program. Its internal states are its
registers. Typical registers of the CU are:
PC: (program counter, also called instruction pointer (IP)) the register
holding the memory address of the next machine code instruction.
IR: (instruction register) the register holding the machine code of the
instruction that is executed.
MBR: (memory buffer register) the other half of the CPU-memory inter-
face, a buffer holding the data just read from the memory or to be
written to the memory. Typically the MBR can be connected as one of
the inputs to the ALU.
The MAR and the MBR may also be assigned to a subunit of the CU, the
memory controller, which even has been implemented on a separate IC and
been placed outside the CPU to allow for more system flexibility, i.e., to
Page 45
CHAPTER 7 VON NEUMANN ARCHITECTURE
allow to reuse the same CPU with different generations and/or sizes of
memory. For the sake of speed, however, many modern CPUs do again
physically contain a memory controller.
The execution unit is the work horse of the CPU that manipulates data.
In its simplest form it would merely be combinational logic with input and
output registers, performing all of its tasks within a single clock cycle. A
core component is the arithmetic and logic unit (ALU) which is instructed
by the CU to execute arithmetic operations such as addition, subtraction,
multiplication, arithmetic shifts and logical operations such as bit-wise and,
or, xor and logic shifts etc. A simple ALU is a combinational circuit and is
further discussed in section 7.2. More advanced execution unit will usually
contain several ALUs, e.g., at least one ALU for pointer manipulations and
one for data manipulation. Registers of the CPU that usually are assigned
to the execution unit are:
accumulator: a dedicated register that stores one operand and the result
of the ALU. Several accumulators (or general purpose registers in
the CPU) allow for storing of intermediate results, avoiding (costly)
memory accesses.
flag/status register: a register where each single bit stands for a specific
property of the result from (or the input to) the ALU, like carry in/out,
equal to zero, even/uneven, . . .
The memory, if looked upon as a FSM has a really huge number of internal
states, so to describe it with a complete state transition graph would be
quite a hassle, which we will not attempt. Its workings will be further
explained in section 7.3
Page 46
7.1 DATA PATH AND MEMORY BUS
Page 47
CHAPTER 7 VON NEUMANN ARCHITECTURE
inst computation
000 a·b
001 a·b
010 a+b
011 a+b
100 a⊕b
101 a⊕b
110 a+b
111 a − b(!)
More complicated ALUs will have more functions as well as flags, e.g.,
overflow, divide by zero, etc.
Modern CPUs contain several ALUs, e.g., one dedicated to memory pointer
operations and one for data operations. ALUs can be much more complex
and perform many more functions in a single step than the example shown
here, but note that even a simple ALU can compute complex operations in
several steps, controlled by the software. Thus, there is always a trade-off
of where to put the complexity: either in the hardware or in the software.
Complex hardware can be expensive in power consumption, chip area and
cost. Furthermore, the most complex operation may determine the maximal
clock speed. The design of the ALU is a major factor in determining the CPU
performance!
Page 48
7.3 MEMORY
Page 49
CHAPTER 7 VON NEUMANN ARCHITECTURE
7.3 Memory
The memory stores program and data in a computer. In the basic von
Neumann model the memory is a monolithic structure, although in real
computer designs, there is a memory hierarchy of memories of different
types and sizes, but we will come back to this later (section 8.1). The
basic type of memory that is usually employed within a basic von Neumann
architecture is random access memory (RAM). A RAM has an address port,
and a data input and an output port, where the later two might be combined
in a single port. If the address port is kept constant, the circuit behaves like
a single register that can store one word of data, i.e., as many bits as the
input and output ports are wide. Changing the address, one addresses a
different word/register. ‘Random access’ refers to this addressed access
that allows to read and write any of the words at any time, as opposed to
the way that, for instance, pixels on a computer screen are accessible only
in sequence.
Some terminology:
word length: The number of bits or bytes that can be accessed in a single
read/write operation, i.e., the number of bits addressed with a single
address. In figure 7.6 the number of columns.
Note that the x86 architecture (and other modern CPUs) allows instructions
to address individual bytes in the main memory, despite the fact that
the word length of the underlying RAM is actually 32 bits/4 bytes or
64 bits/8 bytes. Thus, the address space that a programmer sees is in fact
bigger than the address space of the RAM.
Page 50
7.3 MEMORY
WE, write enable (often active low): This signal causes a write access
writing I/D at address A, either immediately (asynchronous write),
with a following strobe signal (see RAS/CAS) or with the next clock
edge (synchronous write).
OE, output enable: A signal that lets the RAM drive the data line while
asserted, but lets an external source drive the data lines while
deasserted. This can be used to regulate the access if there are
several devices connected to a single bus: Only one of them should
be allowed to drive the bus at anyone time. Deasserted, it can also
allow a common data line to be used as input port.
CS, chip select: A control line that allows to use several RAMs instead
of just one on the same address and data bus and sharing all other
control signals. If CS is not asserted all other signals are ignored and
the output is not enabled. Extra address bits are used to address one
specific RAM and a decoder issues the appropriate CS to just one RAM
at a time. This extends the address space.
RAM comes in either of two main categories: static- and dynamic random
access memory (SRAM and DRAM).
Page 51
CHAPTER 7 VON NEUMANN ARCHITECTURE
The figure introduces yet another digital building block within the lower
right corner circuit, the tri-state buffer. Tri-state buffers allow the outputs
of different logic circuits to be connected to the same electrical node, e.g.,
a bus-line. usually, only one of these outputs at a time will be allowed to
drive that line and determine its states, while all others are in principle
disconnected from that line. So, the tri-state output can actually have three
different states, as controlled by its control input and the input: while its
control input is active, it conveys the usual ‘high’/1 and ‘low’/0 from input to
output. If the control input is inactive, the output is set to a ‘high impedance’
output denoted with ‘Z’. To say it more plainly, in this third state, the buffer
acts like a switch that disconnects its input from the output.
The write access is somewhat less elegant than in a D-latch (section 6.1.1):
in the D-latch, the feedback loop that maintains the state is disconnected
during a write access. Here, however, the feedback loop is maintained, also
during a write access. An active low WE activates a tri-state buffer that
drives the bit-lines. This buffer needs to be stronger than the feedback loop
in order to overrule it. This way, one saves an extra switch in the storage cell
making it more compact, and compactness is the main criterion for memory
cells, since the main goal is to get as much memory in as little a space as
possible. The price one pays is a considerable power consumption while the
feedback loop ‘struggles’ against the write input and a degradation of the
writing speed. Note that in this figure only the non-inverted bit-line BL is
driven during a write access, which is a possible design. Usually, also the
inverted bit-line BLis actively driven during a write operation to increase
the writing speed.
Page 52
7.3 MEMORY
Charge on a capacitor is leaking and lost over time. Thus, a sense amplifier
has to be connected to every memory cell within a given time period, while
the memory is idle. In modern DRAM internal state machines take care of
this refresh cycle.
The sense amplifier reconstructs the digital state from the analog state that
the memory cell has decayed to. In principle, it is nothing but a flip-flop
itself, and if a flip flop should ever find itself in a state where its content
is neither ‘1’ nor ‘0’, it will quickly recover to the closer of those two
states, due to the amplification in the feedback loop. In the lower right
corner of figure 7.7, such a sense amplifier is depicted: it’s a pair of tri-
stated inverters connected in a feedback loop. To refresh a cell, first the
differential bit-lines BL and BL are precharged to a state between ‘0’ and
‘1’, i.e., ‘0.5’, then the memory cell that is to be refreshed is connected to
those lines, passively pulling them in the direction of either (1,0) or (0,1)
dependent on the charge that remains on the two memory cell capacitors.
Then the sense-amplifier is turned on, pulling them further apart actively in
the same direction, until the digital state is reached again.
The rest of the circuit in figure 7.7 is quite similar to the SRAM in figure 7.6.
One difference is that here you need to drive both BL and BL during a
write operation. Another difference is the introduction of a write strobe
signal pulse RAS, so it is not WE that directly triggers the writing of the
memory cell but this strobe signal, this pulse is also used to generate a
shorter precharge pulse during which the bit-lines are precharged, and
after which, the decoder output is enabled, connecting the memory cell
to the precharged bit-lines. In the case of a read access the sense amplifier
is also turned on and the memory cell content is, thus, simultaneously read
and refreshed. The sense amplifier will retain the memory cell state after
the cell itself is disconnected until the next RAS strobe.
Page 53
CHAPTER 7 VON NEUMANN ARCHITECTURE
SRAM DRAM
access speed + -
memory density - +
no refresh needed + -
simple internal control + -
price per bit - +
expression meaning
X register X or unit X
[ X] the content of X
← replace/insert or execute code
M() memory M
[ M([ X] )] memory content at address [ X]
more cheaply. One problem that arises with this is that the number of
address lines that are needed to address the entire address space becomes
bigger and bigger. To avoid enormously wide address buses, the address
is split into two parts, denoted as row and column address. We will not
delve deeply int this here and now. To tell a somewhat simplified story
that still conveys the principle operation: the row address is loaded into a
internal register within the DRAM first. The RAS strobe will trigger that
latching. Then the column address is placed on the address bus and a
separate column address strobe CAS triggers the actual decoder output
and the latching of the memory content into sense amplifier. Repeated fast
memory access with the same row address and only changing the column
address is another consequence of this address splitting.
To describe the FSM that is the control unit one may employ the socalled
‘register transfer language’ (RTL), since moving data among registers is
a central par of what the CU does, besides telling the execution unit to
manipulate some of these date. The syntax of RTL is illuminated in table 7.4.
Page 54
7.4 CONTROL UNIT (CU)
0 move 4
1 add 5
2 store 6
3 stop
4 1
5 2
6 0
.. ..
. .
Table 7.5
[ MAR] ← [ PC]
[ PC] ← [ PC] + 1
[ MBR] ← [ M([ MAR])]
[ IR] ← [ MBR]
As a last stage of the fetch phase the operation code (‘move’) of the
instruction is decoded by the control unit (CU):
CU ← [ IR(opcode)]
and triggers a cycle of the finite state machine with the appropriate signals
to execute a sequence of operations specific to the instruction. The order,
type and number of the individual operations may vary among different
instructions and the set of instructions is specific to a particular CPU.
The other part of the first machine code of the first instruction in our (16-bit)
processor is the ‘operand’ 4. What we have written as ‘move 4’ is actually a
bit pattern:
|10110010
{z } |00000100
{z }
opcode: move operand: 4
As mentioned before, the set of instructions and the machine codes are
specific to a CPU. Machine code is not portable between different CPUs.
Page 55
CHAPTER 7 VON NEUMANN ARCHITECTURE
After the opcode has been decoded into appropriate control signals it is
executed. In this first instruction the data from memory location 4 (1) is
moved to the accumulator A (often the accumulator is the implicit target of
instructions without being explicitly defined):
[ MAR] ← [ IR(opernd)]
[ MBR] ← [ M([ MAR])]
[ A] ← [ MBR]
CU ← [ IR(opcode)]
[ MAR] ← [ IR(opernd)]
[ MBR] ← [ M([ MAR])]
The ALU receives the appropriate instruction from the state machine
triggered by the opcode (sometimes parts of the opcode simply are the
instruction for the ALU, avoiding a complicated decoding).
The number from memory location 5 (2) has been added and the result (3)
is stored in the accumulator. the second instruction is complete.
It follows another instruction fetch and decode like before (not shown).
..
.
and then the execution of the third instruction which causes a write access
to the memory:
[ MBR] ← [ A]
[ MAR] ← [ IR(opernd)]
[ M([ MAR])] ← [ MBR]
Then, the forth instruction is a stop instruction which halts the execution of
the program. The memory content is now changed as shown in figure 7.6.
Page 56
7.4 CONTROL UNIT (CU)
0 move 4
1 add 5
2 store 6
3 stop
4 1
5 2
6 3
.. ..
. .
Table 7.6
7.4.3 Microarchitecture
If the CU were implemented like the example FSMs in section 6.2 the result
would be a socalled hardwired CU architecture, where a hardwired FSM
issues the right sequence of control signals in response to a machine code
in the IR.
Page 57
CHAPTER 7 VON NEUMANN ARCHITECTURE
Microarchitecture Hardwired
Occurrence CISC RISC
Flexibility + -
Design Cycle + -
Speed - +
Compactness - +
Power - +
7.5 Input/Output
A computer is connected to various devices transferring data to and from
the main memory. This is referred to as Input/output (I/O). Examples:
Keyboard, Graphics, Mouse, Network (Ethernet, Bluetooth, . . . ), USB,
Firewire, PCI, PCI-express, SATA, . . .
Page 58
7.5 INPUT/OUTPUT
I/O addressing from the CPUs point of view usually follows one of two
principles:
Memory mapped I/O is to access I/O ports and I/O control- and status
Page 59
CHAPTER 7 VON NEUMANN ARCHITECTURE
registers (each with its own address) with the same functions as the
memory. Thus, in older systems, the system interface might simply
have been a single shared I/O and memory bus. A disadvantage is that
the use of memory addresses may interfere with the expansion of the
main memory.
Isolated I/O (as in the 80x86 family) means that separate instructions
accessing an I/O specific address space are used for I/O. An advantage
can be that these functions can be made privileged, i.e., only available
in certain modes of operation, e.g., only to the operating system.
Modes of Transfer:
Interrupt Driven: The I/O controller signals with a dedicated 1bit data
line (interrupt request (IRQ)) to the CPU that it needs servicing. The
CPU is free to run other processes while waiting for I/O. If the interrupt
is not masked in the corresponding CPU status register, the current
instruction cycle is completed, the processor status is saved (PC and
flags pushed onto stack) and the PC is loaded with the starting address
of an interrupt handler. The start address is found, either at a fixed
memory location specific to the interrupt priority (autovectored ) or
stored in the controller and received by the CPU after having sent an
interrupt acknowledge control signal to the device (vectored )
Direct Memory Access (DMA): The processor is not involved, but the
transfer is negotiated directly with the memory, avoiding copying to
CPU registers first and the subroutine call to the interrupt handler.
DMA is used for maximum speed usually by devices that write whole
blocks of data to memory (e.g., disk controllers). The CPU often
requests the transfer but then relinquishes control of the system bus
to the I/O controller, which only at the completion of the block transfer
notifies the CPU with an interrupt.
(DMA poses another challenge to the cache as data can now become
stale, i.e., invalid in the cache)
Page 60
Chapter 8
Optimizing Hardware
Performance
8.1 Memory Hierarchy
We have discussed two major types of memory, SRAM and DRAM, and stated
that the former was faster but less dense and more expensive, whereas
the latter was slower but denser and cheaper. Which was one to choose
to design a computer? Computer designs try to optimize speed, which
would speak for SRAM, but also require as much storage space as possible,
which favours DRAM. Therefore, computers actually use both : a small fast
memory cache composed of SRAM for data that is accessed often and a
large DRAM for the main memory that is used for longer term storage of
data that is not accessed quite so often. Cache, in fact, is subdivided in
several levels, where L1 cache is smallest and fastest. Note that there is
a trade off between size and speed, since larger memories require more
extensive cabling/routing of signals which limits access time due to parasitic
capacitance.
The challenge now is, how to know which data to keep in the fastest memory.
In fact, the fastest type of memory are the CPU internal registers, and even
slower but more massive than the DRAM main memory is a computers hard-
drive. Thus, one speaks of a memory hierarchy as depicted in figure 8.1
and the following text describes the mechanisms and architecture that are
employed to assign different data to these different storage media.
Table 8.1 gives an indication on access time and size of the different types of
memory. Note that SSD/flash memory is competing to replace hard-drives,
but is at present still more expensive, not quite so big and has a shorter
lifetime, but these shortcomings are slowly being overcome as this text is
written.
8.1.1 Cache
Cache is used to ameliorate the von Neumann memory access bottleneck.
Cache refers to a small high speed RAM integrated into the CPU or close
to the CPU. Access time to cache memory is considerably faster than to
the main memory. Cache is small to reduce cost, but also because there is
Page 61
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
always a trade off between access speed and memory size. Thus, modern
architectures include also several hierarchical levels of cache (L1, L2,
L3, . . . ).
Cache uses the principle of locality of code and data of a program, i.e., that
code/data that is used close in time is often also close in space (memory
address). Thus, instead of only fetching a single word from the main
memory, a whole block around that single word is fetched and stored in
the cache. Any subsequent load or write instructions that fall within that
block (a cache hit ) will not access the main memory but only the cache. If
an access is attempted to a word that is not yet in the cache (a cache miss) a
new block is fetched into the cache (paying a penalty of longer access time).
Page 62
8.1 MEMORY HIERARCHY
Page 63
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
the same colour. Note that they do not necessarily need to be co-
localized like in this illustration.
write through: a simple policy where each write to the cache is followed
by a write to the main memory. Thus, the write operations do not really
profit from the cache.
write back: delayed write back where a block that has been written to in
the cache is marked as dirty. Only when dirty blocks are reused for
another memory block will they be written back into the main memory.
Another problem occurs if devices other than the CPU or multiple cores in
the CPU with their own L1 cache have access to the main memory. In that
case the main memory might be changed and the content in the cache will
be out of date, socalled stale cache, but we will not delve into the methods
to handle these situations within this issue of the compendium.
Least recently used (LRU): this seems intuitively quite reasonable but
requires a good deal of administrative processing (causing delay):
Usually a ‘used’ flag is set per block in the cache when it is accessed.
Page 64
8.1 MEMORY HIERARCHY
This flag is reset in fixed intervals and a time tag is updated for
all blocks that have been used. These time tags have either to be
searched before replacing a block or a queue can be maintained and
updated whenever the time tags are updated.
First in –– first out (FIFO): is simpler. The cache blocks are simply
organized in a queue (ring buffer).
Random: Both LRU and FIFO are in trouble if a program works several
times sequentially through a portion of memory that is bigger than the
cache: the block that is cast out will very soon be needed again. A
random choice will do better here.
Hybrid solutions, e.g., using FIFO within a set of blocks that is randomly
chosen are also used in an attempt to combine the positive properties
of the approaches
Page 65
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
Look-aside architecture: The cache shares the bus between CPU and
memory (system interface), see figure 8.6.
With a miss, the cache just listens in and ‘snarfs’ the data.
Page 66
8.1 MEMORY HIERARCHY
as depicted in figure 8.7. Processes running on the CPU only see the logic
addresses and a coherent virtual memory.
A pointer for each individual logic address would require as much space as
the entire virtual memory. Thus, a translation table is mapping memory
blocks (called pages (fixed size) or segments (variable size)). A logic
address can, thus, be divided into a page number and a page offset. A
location in memory that holds a page is called page frame (figure 8.8).
Note that the penalty for page-failures is much more severe than that for
Page 67
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
Figure 8.9
Complex Simple
Decoder Decoder
pipelined CU
cache misses, since a hard drive has an access time that is up to 50000
times longer than that of the main memory and about 1 million times longer
than a register or L1 cache access (compare table 8.1), whereas the penalty
of a cache miss is roughly less than a 100 times increased access time.
8.2 Pipelining
To accelerate the execution of instructions computer architectures today
divide the execution into several stages. The CPU is designed in a way
that allows it to execute these stages by independent subunits and such
that each stage needs one clock cycle.1 Thus, the first stage’s sub-unit can
1
It would also work if all steps used more than one but the same number of clock cycles.
Page 68
8.2 PIPELINING
Page 69
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
already fetch a new instruction, while the second stage’s sub-unit is still
busy with the first instruction.
The Pentium III has a pipeline consisting of 16 and the Pentium 4 even 31
stages, but the first pipelines used these following 4, that we will employ to
explain the concept:
Figure 8.11 shows a rough block diagram on how a 4-stage pipeline might
be implemented. Intermediate registers pass on intermediate results and
the remaining instruction/operation codes for the remaining stages.
nk
(8.1)
k+n−1
Page 70
8.2 PIPELINING
k+n−1
CPI = (8.2)
n
More pipeline stages split up instructions in ever simpler parts and thus
allow faster clock speeds but pose a bigger design challenge.
resource hazards
data hazards
control hazards
Page 71
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
memory, caches,
buses
ALU
...
Page 72
8.2 PIPELINING
the ALU). If a data hazard is detected this direct data path supersedes
the input from the DE/EX intermediate result register.
Possible measures:
Always Stall: Simply do not fetch any more instructions until it is clear if
the branch is taken or not. This ensures correctness but is of course a
burden upon the performance.
Of course, if there are two jumps or more just after each other, this
method fails on the second jump and the pipeline needs to stall.
Page 73
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
CPI = (1 − Pb ) + Pb (1 − Pt ) + Pb Pt (1 + s) = n + Pb Pt s (8.3)
8.2.2 Conclusion
Pipelining speeds up the instruction throughput (although the execution
of a single instruction is not accelerated). The ideal speed-up cannot be
reached, because of this, and because of instruction inter-dependencies that
sometimes require that an instruction is finished before another can begin.
There are techniques to reduce the occurrence of such hazards, but they
can never be avoided entirely.
Page 74
8.3 SUPERSCALAR CPU
Figure 8.15: The Cray-1 with transparent panels (left) and the
Paragon XP-E (right)
But even cheaper and obtainable for the common user are Ethernet clusters
of individual computers, or even computer grids connected over the Inter-
net. Both of these, obviously, suffer from massive communication overhead
and especially the latter are best used for socalled ‘embarrassingly parallel
problems’, i.e., computation problems that do require no or minimal com-
munication of the computation nodes.
Page 75
CHAPTER 8 OPTIMIZING HARDWARE PERFORMANCE
For example, many processors do sport both an ALU and a FPU. Thus,
they should be able to execute an integer- and a floating-point operation
simultaneously. Data access operations do not require the ALU nor the
FPU (or have a dedicated ALU for address operations) and can thus also be
executed at the same time.
Page 76
8.3 SUPERSCALAR CPU
appropriate pipelines. Among the first three instructions, there are two
arithmetic instructions, so only one of them can immediately be dispatched.
The other has to wait and is dispatched in the next clock cycle together with
the two next instructions, a memory access and a floating point operation.
Compiler level support can group instructions to optimize the potential for
parallel execution.
Retirement stage: The pipelining stage that takes care of finished instruc-
tions and makes the result appear consistent with the execution se-
quence that was intended by the programmer.
Page 77
Page 78
Part II
Low-level programming
Page 79
Chapter 9
Introduction to
low-level
programming
This part of the INF2270 compendium describes low-level programming,
i.e., programming very close to the computer hardware.
Page 81
Page 82
Chapter 10
Programming in C
C is one of the most popular languages today and has been so for more than
30 years. One of the reasons for its success has been that it combines quite
readable code with very high execution speed.
There have been several versions of C; in this course we will use ANSI C
from 1988.
10.1 Data
C has quite a substantial set of data types.
0: false
6= 0 : true
10.1.1.2 Characters
C has no special type for characters either; they should be stored in
unsigned char variables using the encoding described in 11.1 on page 88.2
10.1.2 Texts
Since C has no native data type for texts, an array of unsigned chars should
be used; the end of the text is marked by a byte containing 0.
1
Most C implementations can also store 8-byte integer data which have the type long long, but
this is not in the ANSI C standard.
2
More recent versions of C have support for more extensive character sets, like Unicode, but
this is not covered in this course.
Page 83
CHAPTER 10 PROGRAMMING IN C
10.2 Statements
The statements in C are listed in table 10.3 on the next page.
10.3 Expressions
C has quite an extensive set of expression operators with a confusing
number of precedence levels; use parenthesis if you are in any doubt. The
whole set of operators is shown in table 10.4 on page 86.
Page 84
10.3 EXPRESSIONS
Expression 〈expr〉;
Null statement ;
switch (〈expr〉) {
case 〈const〉: 〈S〉 〈S〉 ...
case 〈const〉: 〈S〉 〈S〉 ...
default: 〈S〉
}
Page 85
CHAPTER 10 PROGRAMMING IN C
Level Op Meaning
15 () Function call
[] Array element
. Member (of struct or union)
–> Member (accessed via pointer)
14 ! Logical negation
~ Masking negation
– Numeric negation
++ Increment
–– Decrement
& Address
* Indirection
(type) Type cast
sizeof Size in bytes
13 * Multiplication
/ Division
% Remainder
12 + Addition
– Subtraction
11 << Left shift
>> Right shift
10 < Less than test
<= Less than or equal test
> Greater than test
>= Greater than or equal test
9 == Equality test
!= Inequality test
8 & Masking and
7 ^ Masking exclusive or
6 | Masking or
5 && Logical and
4 || Logical or
3 ?: Conditional evaluation
2 = Assignment
*= /= %= += –= Updating
<<= >>= &= ^= !=
1 , Sequential evaluation
Page 86
Chapter 11
Character encodings
A character encoding is a table of which numbers represent which
character. There are dozens of encoding; the four most common today are
ASCII, Latin-1, Latin-9 and Unicode.
11.1 ASCII
This very old 7-bit encoding survives today only as a subset of other
encodings; for instance, the left half of Latin-1 (see Table 11.1 on the next
page) is the original ASCII encoding.
11.2 Latin-1
The official name of this 8-bit encoding is ISO 8859-1; it is shown in
Table 11.1 on the following page.
11.3 Latin-9
This encoding is a newer version of Latin-1; its official name is ISO 8859-
15. Only eight characters were changed; they are shown in Table 11.2 on
page 89.
11.4 Unicode
Unicode is a gigantic 21-bit encoding intended to encompass all the world’s
characters; for more information, see http://www.unicode.org/.
11.4.1 UTF-8
UTF-8 is one of several ways to store Unicode’s 21-bit representation
numbers. One advantage of UTF-8 is that it is quite compact; the most
commonly used characters are stored in just one byte, others may need two
or three or up to four bytes, as shown in Table 11.3 on page 89.
Page 87
CHAPTER 11 CHARACTER ENCODINGS
0 000 32
00
ISO 8859−1
040 64
20 @
100 96
40 ‘
140 128
60
200 160
80
240 192
A0 À
300 224
C0 à
340
E0
! A a ¡ Á á
1 001 33 041 65 101 97 141 129 201 161 241 193 301 225 341
01 21 41 61 81 A1 C1 E1
" B b ¢ Â â
2 002 34 042 66 102 98 142 130 202 162 242 194 302 226 342
02 22 42 62 82 A2 C2 E2
# C c £ Ã ã
3 003 35 043 67 103 99 143 131 203 163 243 195 303 227 343
03 23 43 63 83 A3 C3 E3
$ D d ¤ Ä ä
4 004 36 044 68 104 100 144 132 204 164 244 196 304 228 344
04 24 44 64 84 A4 C4 E4
% E e ¥ Å å
5 005 37 045 69 105 101 145 133 205 165 245 197 305 229 345
05 25 45 65 85 A5 C5 E5
& F f ¦ Æ æ
6 006 38 046 70 106 102 146 134 206 166 246 198 306 230 346
06 26 46 66 86 A6 C6 E6
’ G g § Ç ç
7 007 39 047 71 107 103 147 135 207 167 247 199 307 231 347
07 27 47 67 87 A7 C7 E7
( H h ¨ È è
8 010 40 050 72 110 104 150 136 210 168 250 200 310 232 350
08 28 48 68 88 A8 C8 E8
) I i © É é
9 011 41 051 73 111 105 151 137 211 169 251 201 311 233 351
09 29 49 69 89 A9 C9 E9
* J j ª Ê ê
10 012 42 052 74 112 106 152 138 212 170 252 202 312 234 352
0A 2A 4A 6A 8A AA CA EA
+ K k « Ë ë
11 013 43 053 75 113 107 153 139 213 171 253 203 313 235 353
0B 2B 4B 6B 8B AB CB EB
, L l ¬ Ì ì
12 014 44 054 76 114 108 154 140 214 172 254 204 314 236 354
0C 2C 4C 6C 8C AC CC EC
− M m - Í í
13 015 45 055 77 115 109 155 141 215 173 255 205 315 237 355
0D 2D 4D 6D 8D AD CD ED
. N n ® Î î
14 016 46 056 78 116 110 156 142 216 174 256 206 316 238 356
0E 2E 4E 6E 8E AE CE EE
/ O o ¯ Ï ï
15 017 47 057 79 117 111 157 143 217 175 257 207 317 239 357
0F 2F 4F 6F 8F AF CF EF
0 P p ° Ð ð
16 020 48 060 80 120 112 160 144 220 176 260 208 320 240 360
10 30 50 70 90 B0 D0 F0
1 Q q ± Ñ ñ
17 021 49 061 81 121 113 161 145 221 177 261 209 321 241 361
11 31 51 71 91 B1 D1 F1
2 R r ² Ò ò
18 022 50 062 82 122 114 162 146 222 178 262 210 322 242 362
12 32 52 72 92 B2 D2 F2
3 S s ³ Ó ó
19 023 51 063 83 123 115 163 147 223 179 263 211 323 243 363
13 33 53 73 93 B3 D3 F3
4 T t ´ Ô ô
20 024 52 064 84 124 116 164 148 224 180 264 212 324 244 364
14 34 54 74 94 B4 D4 F4
5 U u µ Õ õ
21 025 53 065 85 125 117 165 149 225 181 265 213 325 245 365
15 35 55 75 95 B5 D5 F5
6 V v ¶ Ö ö
22 026 54 066 86 126 118 166 150 226 182 266 214 326 246 366
16 36 56 76 96 B6 D6 F6
7 W w · × ÷
23 027 55 067 87 127 119 167 151 227 183 267 215 327 247 367
17 37 57 77 97 B7 D7 F7
8 X x ¸ Ø ø
24 030 56 070 88 130 120 170 152 230 184 270 216 330 248 370
18 38 58 78 98 B8 D8 F8
9 Y y ¹ Ù ù
25 031 57 071 89 131 121 171 153 231 185 271 217 331 249 371
19 39 59 79 99 B9 D9 F9
: Z z º Ú ú
26 032 58 072 90 132 122 172 154 232 186 272 218 332 250 372
1A 3A 5A 7A 9A BA DA FA
; [ { » Û û
27 033 59 073 91 133 123 173 155 233 187 273 219 333 251 373
1B 3B 5B 7B 9B BB DB FB
< \ | ¼ Ü ü
28 034 60 074 92 134 124 174 156 234 188 274 220 334 252 374
1C 3C 5C 7C 9C BC DC FC
= ] } ½ Ý ý
29 035 61 075 93 135 125 175 157 235 189 275 221 335 253 375
1D 3D 5D 7D 9D BD DD FD
> ^ ~ ¾ Þ þ
30 036 62 076 94 136 126 176 158 236 190 276 222 336 254 376
1E 3E 5E 7E 9E BE DE FE
? _ ¿ ß ÿ
31 037 63 077 95 137 127 177 159 237 191 277 223 337 255 377
1F 3F 5F 7F 9F BF DF FF
© April 1995, DFL, Ifi/UiO
Page 88
11.4 UNICODE
Latin-1 ¤ ¦ ¨ ´ ¸ ¼ ½ ¾
Latin-9 € Š š Ž ž Œ œ Ÿ
Page 89
Page 90
Chapter 12
Assembly
programming
A computer executes machine code programs in which instructions are
encoded as bit patterns; for instance, on an x86 processor, the five bytes
tell the processor to move the value 19 (=13hex ) to the %EAX register.
.align n adds extra bytes until the address has all 0s in the n least
significant bits.
Page 91
CHAPTER 12 ASSEMBLY PROGRAMMING
Intel AT&T
Constants (decimal) 4 $4
Constants (hex) 123h $0x123
Registers eax %eax
Sequence res, op, op, . . . op, op, . . . , res
Size mov movl
Type specification mov ax, WORD PTR v
Indexing [eax+1] 1(%eax)
.globl specifies that a name is to be known globally (and not just within the
file).
12.1.3 Comments
The character “#” will make the rest of the line a comment. Blank lines are
ignored.
(The -m32 option specifies that we are treating the processor as a 32-bit
one.)
Page 92
12.3 REGISTERS
%AX
z }| {
%CX
z }| { %ESI
%ST(0)
%ST(1)
%ST(2)
%ST(3)
%ST(4)
%ST(5)
%ST(6)
%ST(7)
Note that CygWin uses a slightly different convention for global names: the
name “xxx” in C is known as “_xxx” in the assembly language. Fortunately,
it is possible to comply with both the Linux and CygWin conventions by
defining every global name twice, as in
1 .globl funcname
2 .globl _funcname
3 funcname:
4 _funcname:
5 :
12.3 Registers
The most commonly used registers are shown in Figure 12.1.
Page 93
CHAPTER 12 ASSEMBLY PROGRAMMING
–b byte
–w word (= 2 bytes)
–l long (= 4 bytes)
{s} is one of
–l a double
–s a float
Page 94
12.4 INSTRUCTION SET
Data movement
lea{bwl} {cmr} ,{mr} Copy address {mr} ← Adr({cmr} )
mov{bwl} {cmr} ,{mr} Copy data {mr} ← {cmr}
pop{wl} {r} Pop value {r} ← pop
push{wl} {cr} Push value push {cr}
Block operations
cld Clear D-flag D←0
cmpsb Compare byte (%EDI) − (%ESI); %ESI ← %ESI ± 1; %EDI ← %EDI ± 1 4 4 4
movsb Move byte (%EDI) ← (%ESI); %ESI ← %ESI ± 1; %EDI ← %EDI ± 1
rep 〈instr〉 Repeat Repeat 〈instr〉 %ECX times
repnz 〈instr〉 Repeat until zero Repeat 〈instr〉 %ECX times while Z̄
repz 〈instr〉 Repeat while zero Repeat 〈instr〉 %ECX times while Z
scasb Scan byte %AL − (%EDI); %EDI ← %EDI ± 1 4 4 4
std Set D-flag D←1
stosb Store byte (%EDI) ← %AL; %EDI ← %EDI ± 1
Arithmetic
adc{bwl} {cmr} ,{mr} Add with carry {mr} ← {mr} + {cmr} + C 4 4 4
add{bwl} {cmr} ,{mr} Add {mr} ← {mr} + {cmr} 4 4 4
dec{bwl} {mr} Decrement {mr} ← {mr} − 1 4 4
divb {mr} Unsigned divide %AL ← %AX/ {mr}; %AH ← %AX mod {mr} ? ? ?
divw {mr} Unsigned divide %AX ← %DX:%AX/ {mr}; %DH ← %DX:%AX mod {mr} ? ? ?
divl {mr} Unsigned divide %EAX ← %EDX:%EAX/ {mr}; %EDX ← %EDX:%EAX ? ? ?
mod {mr}
idivb {mr} Signed divide %AL ← %AX/ {mr}; %AH ← %AX mod {mr} ? ? ?
idivw {mr} Signed divide %AX ← %DX:%AX/ {mr}; %DH ← %DX:%AX mod {mr} ? ? ?
idivl {mr} Signed divide %EAX ← %EDX:%EAX/ {mr}; %EDX ← %EDX:%EAX ? ? ?
mod {mr}
imulb {mr} Signed multiply %AX ← %AL × {mr} 4 ? ?
imulw {mr} Signed multiply %DX:%AX ← %AX × {mr} 4 ? ?
imull {mr} Signed multiply %EDX:%EAX ← %EAX × {mr} 4 ? ?
imul{wl} {cmr} ,{mr} Signed multiply {mr} ← {mr} × {cmr} 4 ? ?
inc{bwl} {mr} Increment {mr} ← {mr} + 1 4 4
mulb {mr} Unsigned multiply %AX ← %AL × {mr} 4 ? ?
mulw {mr} Unsigned multiply %DX:%AX ← %AX × {mr} 4 ? ?
mull {mr} Unsigned multiply %EDX:%EAX ← %EAX × {mr} 4 ? ?
neg{bwl} {mr} Negate {mr} ← −{mr} 4 4 4
sub{bwl} {cmr} ,{mr} Subtract {mr} ← {mr} − {cmr} 4 4 4
Masking
and{bwl} {cmr} ,{mr} Bit-wise AND {mr} ← {mr} ∧ {cmr} 0 4 4
not{bwl} {mr} Bit-wise invert {mr} ← {mr}
or{bwl} {cmr} ,{mr} Bit-wise OR {mr} ← {mr} ∨ {cmr} 0 4 4
xor{bwl} {cmr} ,{mr} Bit-wise XOR {mr} ← {mr} ⊕ {cmr} 0 4 4
Page 95
CHAPTER 12 ASSEMBLY PROGRAMMING
Extensions
cbw Extend byte→word 8-bit %AL is extended to 16-bit %AX
cwd Extend wordrghtrrodouble 16-bit %AX is extended to 32-bit %DX:%AX
cwde Extend double→ext Extends 16-bit %AX to 32-bit %EAX
cdq Extend ext→quad Extends 32-bit %EAX to 64-bit %EDX:%EAX
Shifting
rcl{bwl} {c} ,{mr} Left C-rotate {mr} ← 〈{mr}, C〉 {c} 4
rcr{bwl} {c} ,{mr} Right C-rotate {mr} ← 〈{mr}, C〉 {c} 4
rol{bwl} {c} ,{mr} Left rotate {mr} ← {mr} {c} 4
ror{bwl} {c} ,{mr} Right rotate {mr} ← {mr} {c} 4
{c}
sal{bwl} {c} ,{mr} Left shift {mr} ← {mr} ⇐ 0 4 4 4
{c}
sar{bwl} {c} ,{mr} Right arithmetic shift {mr} ← S ⇒ {mr} 4 4 4
{c}
shr{bwl} {c} ,{mr} Right logical shift {mr} ← 0 ⇒ {mr} 4 4 4
Testing
bt{wl} {c} ,{mr} Bit-test bit {c} of {mr} 4
btc{wl} {c} ,{mr} Bit-change bit {c} of {mr} ←(bit {c} of {mr} ) 4
btr{wl} {c} ,{mr} Bit-clear bit {c} of {mr} ←0 4
bts{wl} {c} ,{mr} Bit-set bit {c} of {mr} ←1 4
cmp{bwl} {cmr}1 ,{mr}2 Compare values {mr}2 − {cmr}1 4 4 4
test{bwl} {cmr}1 ,{cmr}2 Test bits {cmr}2 ∧ {cmr}1 4 4 4
Jumps
call {} Call push %EIP; %EIP ← {}
ja {} Jump on unsigned > if Z̄ ∧ C¯: %EIP ← {}
jae {} Jump on unsigned ≥ if C¯: %EIP ← {}
jb {} Jump on unsigned < if C : %EIP ← {}
jbe {} Jump on unsigned ≤ if Z ∨ C : %EIP ← {}
jc {} Jump on carry if C : %EIP ← {}
je {} Jump on = if Z : %EIP ← {}
jmp {} Jump %EIP ← {}
jg {} Jump on > if Z̄ ∧ S = O : %EIP ← {}
jge {} Jump on ≥ if S = O : %EIP ← {}
jl {} Jump on < if S 6= O : %EIP ← {}
jle {} Jump on ≤ if Z ∨ S 6= O : %EIP ← {}
jnc {} Jump on non-carry if C¯: %EIP ← {}
jne {} Jump on 6= if Z̄ : %EIP ← {}
jns {} Jump on non-negative if S̄ : %EIP ← {}
jnz {} Jump on non-zero if Z̄ : %EIP ← {}
js {} Jump on negative if S : %EIP ← {}
jz {} Jump on zero if Z : %EIP ← {}
loop {} Loop %ECX ←%ECX-1; if %ECX 6= 0: %EIP ← {}
ret Return %EIP ← pop
Miscellaneous
rdtsc Fetch cycles %EDX:%EAX ← 〈number of cycles〉
Page 96
12.4 INSTRUCTION SET
Load
fld1 Float load 1 Push 1.0
fildl {m} Float int load long Push long {m}
fildq {m} Float int load quad Push long long {m}
filds {m} Float int load short Push short {m}
fldl {m} Float load long Push double {m}
flds {m} Float load short Push float {m}
fldz Float load zero Push 0.0
Store
fistl {m} Float int store long Store %ST(0) in long {m}
fistpl {m} Float int store and pop long Pop %ST(0) into long {m}
fistpq {m} Float int store and pop quad Pop %ST(0) into long long {m}
fistq {m} Float int store quad Store %ST(0) in long long {m}
fistps {m} Float int store and pop short Pop %ST(0) into short {m}
fists {m} Float int store short Store %ST(0) in short {m}
fstl {m} Float store long Store %ST(0) in double {m}
fstpl {m} Float store and pop long Pop %ST(0) into double {m}
fstps {m} Float store and pop short Pop %ST(0) into float {m}
fsts {m} Float store short Store %ST(0) in float {m}
Arithmetic
fabs Float absolute %ST(0) ←|%ST(0)|
fadd %ST(X) Float add %ST(0) ← %ST(0) + %ST(X)
fadd{s} {m} Float add %ST(0) ← %ST(0)+ float/double {m}
faddp {m} Float add and pop %ST(1) ← %ST(0) + %ST(1); pop
fchs Float change sign %ST(0) ←− %ST(0)
fdiv %ST(X) Float div %ST(0) ← %ST(0) ÷ %ST(X)
fdiv{s} {m} Float div %ST(0) ← %ST(0)÷ float/double {m}
fdivp {m} Float reverse div and pop %ST(1) ← %ST(0) ÷ %ST(1); pop
fdivrp {m} Float div and pop %ST(1) ← %ST(1) ÷ %ST(0); pop
fiadd{s} {m} Float int add %ST(0) ← %ST(0)+ short/long {m}
fidiv{s} {m} Float int div %ST(0) ← %ST(0)÷ short/long {m}
fimul{s} {m} Float int mul %ST(0) ← %ST(0)× short/long {m}
fisub{s} {m} Float int sub %ST(0) ← %ST(0)− short/long {m}
fmul %ST(X) Float mul %ST(0) ← %ST(0) × %ST(X)
fmul{s} {m} Float mul %ST(0) ← %ST(0)× float/double {m}
fmulp {m} Float mul and pop %ST(1) ← %ST(0) × %ST(1); pop
p
fsqrt Float square root %ST(0) ← %ST(0)
fsub %ST(X) Float sub %ST(0) ← %ST(0) − %ST(X)
fsub{s} {m} Float sub %ST(0) ← %ST(0)− float/double {m}
fsubp {m} Float reverse sub and pop %ST(1) ← %ST(0) − %ST(1); pop
fsubrp {m} Float sub and pop %ST(1) ← %ST(1) − %ST(0); pop
fyl2xpl Float ??? %ST(1) ←%ST(1) × log2 (%ST(0) + 1); pop
Stack operations
fld %STX Float load Push copy of %ST(X)
fst %STX Float store Store copy of %ST(0) in %ST(X)
fstp %STX Float store and pop Pop %ST(0) into %ST(X)
Page 97
Page 98
A
Appendix
Questions Catalogue
A.1 Introduction to Digital Electronics
1) What is ‘digital’ electronics?
4) Can you list logic gates and their corresponding Boolean function?
5) Can you show what it means thet the AND and OR opperators are
commutative, associative and distributive?
7) How do you set up a Karnaugh map and how do you use it to simplyfy
a Boolean expression?
Page 99
APPENDIX A QUESTIONS CATALOGUE
8) How doeas a Karnaugh map look like that will still result in a very long
and complicated expression.
10)
2) Can you draw a digital circuit with inverters, AND and OR gates that
implement the XNOR function?
Page 100
A.7 OPTIMIZING HARDWARE PERFORMANCE
4) What does the abbreviation ALU stand for and what’s its task?
5) can you describe the simple 1-bit ALU as depicted in the script?
6) What does DRAM and SRAM stand for, what is their task and what are
their differences?
9) What is a sense-amplifier?
11) . . .
Page 101
Index
.align, 91 CLK, 34
.bss, 91 clock, 31, 34
.byte, 91 clock cycle, 34
.data, 92 clock period, 34
.fill, 92 clocked, 32
.globl, 92 combinational logic circuits, 21
.long, 92 combinational logic circuits, 21
.text, 92 comments in assembly, 92
.word, 92 communication overhead, 75
communication protocol, 59
adder, 28
commutative, 14
ALU, 46, 47
complex instruction set computers,
ANSI C, 83
58, 70
arithmetic and logic unit, 46, 47
control unit, 45
arithmetic right shift, 10
counter, 40, 42
as, 92
CPI, 71
ASCII, 87
CygWin, 93
assembly code, 91
associative, 14 D-flip-flop, 36
asynchronous, 31
D-latch, 32
asynchronous FSM, 39
data hazards, 72
asynchronous latches, 32
data path, 47
AT&T-notation, 92
DE, 70
average clock cycles per instruction,
de Morgan, 14
71
decode stage, 70
binary addition, 8 decoder, 24
binary division, 10 demultiplexer, 25
binary electronics, 5 digital electronics, 5
binary multiplication, 10 dirty cache, 64
binary numbers, 7 distributive, 14
binary subtraction, 8 double in C, 84
bit, 6 DRAM, 51
Boolean algebraic rules, 14 DRAM), 52
Boolean algebra, 13 dynamic random access memory, 51,
Boolean algebraic rules, 14 52
Boolean expressions, 13
Boolean functions, 13 encoder, 24
Boolean in C, 83 encoding, 87
Boolean operator, 13 EX, 70
Boolean operators, 13 execute stage, 70
expressions in C, 84, 86
C, 83
carry bit, 28 finite state machine, FSM, 37
char in C, 84 flip-flop, 31
character, 87 float in C, 84
circuit analysis, 21 floating-point, 97
circuit design, 21 floating-point values in C, 84
CISC, 58, 70 full adder, 28
Page 102
A.7 OPTIMIZING HARDWARE PERFORMANCE
Page 103
APPENDIX A QUESTIONS CATALOGUE
WB, 70
write back stage, 70
Page 104