The Mathematics That Power Our World How Is It Made
The Mathematics That Power Our World How Is It Made
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
ISBN 978-981-4730-84-6
ISBN 978-981-3144-08-8 (pbk)
Printed in Singapore
“To the person who made me watch dozens of the How it’s
made episodes during his childhood years, who continues to
inspire me everyday with his never-ending curiosity. To the
better version of me, my son Michel.”
Joseph Khoury
This page intentionally left blank
Preface
In the early 21st century, a TV show called How it’s made premiered in
North America and quickly grew in popularity among viewers of all ages.
The purpose of the show was to look behind the scenes to explain in sim-
ple terms how common everyday items are actually made. Although many
episodes of the show featured simple things like the jeans we wear, the bi-
cycle we ride or even some of the processed food we eat, it was certainly an
eye opener on the ingenuity and effort behind the simplest things we use
on a daily basis.
vii
viii The mathematics that power our world
application, among many others, that would not be possible without the
great accomplishment of number theorists such as Hardy. Regardless how
we value the research in mathematics, the fact remains that a wide variety
of phenomena around us are governed by mathematical models. Pushing
for innovation and research in applied and pure mathematics is key to make
further advancement in science, technology and even medicine. To say the
least, mathematics are certainly a set of tools, among many others, that
help make modern technology function well.
With the hope of convincing students that there is a need to acquire mathe-
matical skills, and to introduce the general public to the pivotal role played
by mathematics in our lives, we started our endeavor of “looking under
the hood” at the engine that makes most of the technology around us run
smoothly. But there is another, stronger motivation for starting this book.
After years of teaching various areas of undergraduate mathematics, we re-
alized that the traditional place held by mathematics in education for many
centuries is taking a step backward. In the name of reform and adapting to
a fast changing world, the learning of mathematics has unfortunately de-
graded in many cases into an empty drill of memorization of miscellaneous
techniques rather than a foundation of scientific reasoning critical in any
aspect of knowledge. More and more, the mathematical community seems
to be divided into two extremes. On one extreme, we have mathematicians
who disassociate their teaching almost completely from any aspect of sci-
entific thinking. They transfer their knowledge of the subject matter in the
form of “recipes” for their students to follow almost blindly. On the other
extreme, we have the group of mathematicians with an overemphasis on ab-
straction and almost a complete disconnection from real applications even
in early service undergraduate courses in mathematics. Our hope is that
this book will contribute as a middle ground between the two extremes.
The topics for the five chapters of the book are carefully chosen to strike
a delicate balance between relevant common applications and a reasonable
dose of mathematics. In most chapters, the mathematical maturity needed
is acquired after a year of studies at the university level in any branch of
science or engineering. However, self-motivated advanced high school stu-
dents with a strong desire to acquire more knowledge and a willingness to
expand their horizons beyond the school curriculum can certainly benefit
a great deal from researching and understanding the mathematics in the
book. The topics discussed in the book are also great resources for high
school teachers and university professors who can use the various applica-
x The mathematics that power our world
Throughout the book, all efforts were made to keep the mathematical re-
quirement to a minimum. For some advanced topics, like the theory of finite
fields and the notion of primitive polynomials in Chapter 4, the mathemat-
ical requirement is made in an almost self-contained fashion. All the topics
are presented with a fair amount of details. Chapters are independent to a
large extent. The readers can choose a topic to read without acquiring full
knowledge of previous chapters. At the beginning of each chapter, a small
section entitled Before you go further is included to give the reader an idea
about the level of mathematical knowledge required to fully understand the
chapter. The organization of the chapters in the book are as follows.
mation called the Discrete Cosine Transform. Concepts from linear algebra
like matrix manipulations, linear independence and orthogonal bases are
used. Some knowledge of basic properties of trigonometric functions and
complex numbers is also required.
Chapter 4 is devoted to the study of the GPS system both from the satel-
lite and the receiver ends. Although the mathematics used by the receiver
to locate positions on the surface of the planet is fairly simple, the nature
of the signals emitted by the satellites and the way the receiver interprets
and treats them require heavy mathematics. Mathematical preliminaries
necessary to understand the signal structure include group theory, modular
arithmetics, the finite field Fp , polynomial rings over Fp and the notion
of primitive polynomials. This chapter is certainly the richest in terms of
mathematical knowledge. If you have a curious mind and enjoy new chal-
lenges, this is definitely a chapter for you.
A word of caution
While every attempt is made to make every chapter in the book as complete
as possible, some technical details are omitted as we are not experts in the
specific domain of application nor do we claim to be. Technicalities like
the way a circuit is wired, the type of transistors needed for a particular
design or the nature of the electrical pulse of a satellite signal are beyond
the scope of this book. Interested readers are encouraged to look up these
aspects in books written by experts in the domain of application.
This page intentionally left blank
Contents
Preface vii
1. What makes a calculator calculate? 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 A view from the inside . . . . . . . . . . . . . . . 2
1.1.2 Before you go further . . . . . . . . . . . . . . . . 2
1.2 Number systems . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Why 0’s and 1’s? . . . . . . . . . . . . . . . . . . 3
1.2.2 The binary system . . . . . . . . . . . . . . . . . . 4
1.2.3 Binary Coded Decimal representation (BCD) . . . 6
1.2.4 Signed versus unsigned binary numbers . . . . . . 7
1.3 Binary arithmetics . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Binary addition of unsigned integers . . . . . . . . 12
1.3.2 Binary addition of signed integers . . . . . . . . . 13
1.3.3 Two’s complement subtraction . . . . . . . . . . . 14
1.4 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Logic gates . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 Sum of products - Product of sums . . . . . . . . 21
1.5.2 Sum of products . . . . . . . . . . . . . . . . . . . 23
1.5.3 Product of sums . . . . . . . . . . . . . . . . . . . 24
1.6 Digital adders . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6.1 Half-adder . . . . . . . . . . . . . . . . . . . . . . 25
1.6.2 Full-adder . . . . . . . . . . . . . . . . . . . . . . 26
1.6.3 Lookahead adder . . . . . . . . . . . . . . . . . . 28
1.6.4 Two’s complement implementation . . . . . . . . 30
1.6.5 Adder-subtractor combo . . . . . . . . . . . . . . 31
xiii
xiv The mathematics that power our world
Index 183
Chapter 1
1.1 Introduction
With the recent advancements in technology, you can hardly avoid seeing an
electronic calculator around you as there is one built in almost every device
you own: your phone, computer, tablet or even in your hand watch. We
trust them blindly in our everyday tasks without questioning the answers
they display. But have you ever wondered, “How can your pocket calculator
do complex mathematical operations with such a high precision in a blink
of an eye?” If you do not know the answer to that, you are certainly not
alone. Ask your friends, even your math and science instructors and you
will be surprised how little is known about the basics of this electronic
device. This chapter takes you on a journey to explore some of the logic
that powers digital computing.
1
2 The mathematics that power our world
and then saying that the multiplier (or coefficient) of 100 (4 in this exam-
ple) is called the units digit, that of 101 (3 in this example) is called the
tens digit and the other two multipliers (2 and 1) are called the hundreds
and the thousands digit respectively. One thing the teacher did not explain
at that time is why we chose powers of 10 in the above expansion. With
time, I came to realize that there is really nothing special about the number
10 aside from the fact that humans have 10 fingers and 10 toes and that
What makes a calculator calculate? 3
Given an integer b ≥ 2, we can talk about the number system with base b
in a similar way to our decimal system. In such a system, every number
can be written using the digits from the set S = {0, 1, 2, . . . , b − 1}. More
precisely, if N = Nk−1 Nk−2 . . . N1 N0 (where each Ni ∈ S) is a number in
the system then N holds the following decimal value:
For the number N = Nk−1 Nk−2 . . . N1 N0 , the digits N0 and Nk−1 are
called the least significant digit (LSD) and the most significant digit (MSD)
respectively. If c > b, then a number N written in base b can clearly
be interpreted as a number in base c as well. This confusion could be
avoided by specifying
the base
as a subscript
and write (N )b . Hence,
(123)4 = 1 × 42 + 2 × 41 + 3 × 40 = 27 and (123)6 = 1 × 62 +
2 × 61 + 3 × 60 = 51.
Reading the column of remainders from bottom to top gives the follow-
ing binary representation of 165: 10100101.
The second algorithm works well for relatively small decimal numbers. We
start by displaying the first 15 powers of 2:
20 = 1, 21 = 2, 22 = 4, 23 = 8, 24 = 16
5 6
2 = 32, 2 = 64, 2 = 128, 2 = 256, 29 = 512
7 8
210 = 1024, 211 = 2048, 212 = 4096, 213 = 8192, 214 = 16348.
Given a decimal number n, we look for the largest integer r1 such that
2r1 ≤ n and we let n1 = n − 2r1 . Again, look for the largest integer r2 such
that 2r2 ≤ n1 and let n2 = n1 − 2r2 . Repeat this process for all successive
values ni ’s until you hit a certain k with nk = 0. Starting with the largest
power of 2 appearing in the above process, we record 1 as a multiplier of
2j if 2j is used and 0 if 2j is not used in the process. If 2l is the largest
power of 2 appearing in the above process, then this method gives a binary
representation with l + 1 bits. Let us look at an example by revisiting the
decimal number 165 treated in the first method. The largest power of 2
less than or equal to 165 is 7 since 27 = 128. Subtracting 128 from 165
gives 37. The largest power of 2 not exceeding 37 is 5 since 25 = 32. Now
37 − 32 = 5 and the largest power of 2 less than or equal to 5 is 2 since
22 = 4. Finally 5 − 4 = 1 and 20 − 1 = 0. The powers of 2 appearing in the
above process are 27 , 25 , 22 and 20 . Therefore
165 = 1 × 27 + 0 × 26 + 1 × 25 + 0 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20
and the decimal representation of 165 is the following 8-bit number:
10100101. Note that we have 8 bits in the binary representation of 165
since the largest power of 2 used is 27 .
Decimal BCD
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
| {z } 0010
0100 | {z} 0111
|{z } .
4 2 7
What makes a calculator calculate? 7
Note that with 4 bits, one can form 24 = 16 different binary codes which
means that 6 codes are not used in the BCD system. The binary codes
1010 (number 10 in decimal) through 1111 (number 15 in decimal) are con-
sidered as invalid codes and cannot be used in a digital design operating
on BCD system.
Proposition 1.1 above shows that in an n-bit machine, the decimal range
of unsigned binary numbers is from 0 to 2n − 1. In the sign-magnitude
format, one bit (leftmost)
n−1 is used as a sign which means that in this format
the decimal range is −2 − 1, 2n−1 − 1 . The bad news is that zero has
two possible representations: 100 · · · 00 (which represents +0) and 000 · · · 00
(which represents −0).
Assuming the movie running time is long enough (and you do not stop
it), the counter will eventually read 9999. One second after that, it goes
back to 0000. Now, imagine at this moment you hit the stop button and
then rewind the movie for 5 seconds. The counter will probably read 9995.
Clearly, the reading ‘9995’ in this case does not mean that the movie has
been playing for 9995 seconds. To avoid the ambiguity on what 9995 means
on the counter, one has to interpret the range 0 to 9999 a little differently.
Note that 9995 + 5 = 10000 = 104 and since the counter can only handle
four digits, the leftmost bit (1 in this case) is dropped and we get 0000. This
suggests that 9995 can be interpreted as −5 in this scenario since 9995 + 5
results in 0000 displayed on the counter, exactly like (−5) + 5 = 0.
The above analogy with a digital counter was made to justify the follow-
ing definition. Given an n-bit binary number N , the two’s complement of N
is defined to be the n-bit binary number N20 s satisfying N +N20 s = 1 0| ·{z
· · 0},
n
with the “+” sign referring to binary addition that we will discuss in Sec-
tion 1.3.1. Notice the analogy with the equation 9995 + 5 = 10000 = 104
in the counter example above and the fact that the binary form of 2n is
precisely 10 · · · 0 (1 followed by n 0’s). In other words, finding the two’s
complements of a binary number means finding the opposite (or negative)
of this number.
right keeping the first three 0’s and the first 1 we encounter. After that,
we invert the remaining bits of the block 1010 obtaining 0101. We then
obtain 01011000 as the two’s complement of 10101000. Let us next look at
the two’s complement of the binary number 11100011. Here, the rightmost
digit is 1 that we output as 1 and we invert the remaining digits, giving
(11100011)20 s = 00011101. As you will see in Section 1.3.1, adding the
binary numbers 11100011 and 00011101 would result in the 9-digit binary
number 100000000. Since we are working in an 8-bit system, we simply
ignore the leftmost bit (exactly like the digital counter would show 0000
instead of 10000) and get 00000000. Note that the above algorithm shows
that the two’s complement of the two’s complement of N gives back N
which is in line with the basic rule −(−N ) = N .
We finish this section with showing the two main advantages of the two’s
complement representation:
110 11
925 557
+ 76 375
1001 932
Like in the case of decimal numbers, the sum of binary numbers involves
a carry digit (of 1) when the sum of digits in a column exceeds 1. The
following four sums give all rules needed to add any n-bit binary numbers:
0 + 0 = 00, 0 + 1 = 1 + 0 = 01, 1 + 1 = 10 and 1 + 1 + 1 = 11. Each
addition is represented by a two-digit binary number. The digit on the
right represents the “output” of the addition called the sum digit and the
digit on the left is the carry digit that we add to the digits of the column
on the left. Here is an example of adding two binary numbers, where the
digits in smaller font in the top row are, as in the decimal case, the carry
digits.
1111 111
1100 1001
+ 1111 1111
11100 1000
Unlike the “pencil and paper” addition shown above, machines have to work
within certain limits and it could happen (like in the above example) that
the result of the addition exceeds the number of storage units allowed. This
is a situation known as overflow. Detecting overflow in unsigned addition
is simple: a carry out of 1 from adding the last significant bits indicates
that an overflow has occurred. Take for instance the (unsigned) addition
0111 + 1001 (corresponding to 7 + 9 in decimal) in a 4-bit machine, which
results in 10000. As the carry out is 1 from the leftmost bit, an overflow
has occurred. In fact, 7 + 9 = 16 exceeds the maximum (15) allowed by a
4-bit machine.
What makes a calculator calculate? 13
Let A and B be two integers (in decimal) and let (A)20 s and (B)20 s be
their respective representations in two’s complement. To perform A+B, the
machine starts by computing the unsigned addition of (A)20 s and (B)20 s .
That is, it treats (A)20 s and (B)20 s as unsigned numbers including their sign
bits. Any carry out bit from the addition in the leftmost column is then
ignored. Let us look at some examples using an 8-bit machine. Assume
A = 52, B = −37. To find A + B, we write A and B in two’s complements:
(A)20 s = 00110100 and (B)20 s = 11011011. We then perform the sum
(A)20 s + (B)20 s as unsigned binaries:
111
0011 0100
+ 1101 1011
10000 1111
Since the result has 9 bits, one more than the storage limit, we simply drop
the leftmost bit and get 00001111, which is 15 in decimal. The answer is
correct: 52 + (−37) = 15. Let us now consider an example of adding two
negative numbers: Assume A = −15, B = −24, then (A)20 s = 11110001
and (B)20 s = 11101000. Adding (A)20 s and (B)20 s (as unsigned) gives
11011001 (after dropping the carry out 1 from the leftmost bit). Note that
(11011001)20 s = −39, which is the correct answer: −15 + (−24) = −39.
not the same as the carry out bit from that column. This last observation
allows an easy overflow detection hardware design in machines. Let us next
consider some examples of addition using 8-bit two’s complement:
1 1 1 11 111 1 11
0100 0100 1010 1011 0011 0100 0010 1100
+ 0110 0000 + 1010 0110 + 1101 1011 + 0010 1101
1010 0100 10101 0001 10000 1111 0101 1001
The first sum has an overflow without carry out. It gives a negative answer
to the sum of two positive numbers (the sign bit of the answer is 1 while
both the operands have 0 as leftmost bit). The second sum presents also an
overflow but with a carry out (by dropping the leftmost bit, the sum gives
a positive answer while the operands are both negative). A carry out with
no overflow occurs in the third addition since the operands are of opposite
signs (we simply drop the leftmost bit of the answer). There is no carry
out nor an overflow in the last sum.
11 11
0011 0111
+ 1011 0010
1110 1001
What makes a calculator calculate? 15
There is no carry out nor an overflow in this case. The result is a negative
number since its sign bit is 1 and its decimal value is −23 (obtained by
converting 11101001 from two’s complement to decimal as we did in Section
1.2.4.4) which is the correct answer: 55 − 78 = −23.
1.4 Logic
In the context of this chapter, Logic is the set of mathematical rules gov-
erning any electrical circuit design for binary arithmetic in a machine. You
will be amazed to learn that basic words like “AND”, “OR”, “NOT” can
be interpreted as “switches” and “gates” inside your computer.
p ∧ q (p AND q) is always false except in the case where p and q are both
true. To explain the truth value of the “XOR” operator (⊕), imagine I say
“Joseph is teaching a class or he is gone fishing”, then I would be saying
the truth if exactly one of the components “Joseph is teaching a class”, “he
is gone fishing” is true. Clearly the statement is false otherwise. Table 1.3
gives the truth values of the above logic operators as functions of the truth
values of their components.
So what does this “linguistic” introduction have to do with electronic
and circuit design? Like a proposition, each bit can take two values 0 or 1
and each switch in a digital circuit is also under two possible states: high
voltage or low voltage. Think of Logic as being the “brain” of any electronic
circuit that sends signals to different parts of the circuit to execute various
tasks.
A A
A∨B A∧B
B B
A
A⊕B A ¬A
B
If you look at a map of a digital electronic chip, you will most likely
see more logic gates symbols than the four listed in Figure 1.1 above. For
instance in TTL technology (Transistor-Transistor Logic), the NAND gate
plays a crucial role. The NAND gate can be interpreted as an AND gate
followed by a NOT gate. If A, B are binary inputs, then the output of a
NAND gate is ¬(A ∧ B):
A
¬(A ∧ B)
B
A
¬(A ∨ B)
B
A A
¬(A ∧ B) ¬(A ∨ B)
B B
Many tools were developed through the years to come up with the best
circuit design to perform a given task. Some of these tools are graphical,
like the Karnaugh map and the semantic tableaux and others are algebraic
like the Boolean Algebra axioms, the sum of products and the product of
sums. In this chapter, we focus on the algebraic tools as they are more
What makes a calculator calculate? 19
At this point, chances are you started to draw a link between the operations
of a Boolean Algebra and the logic operators defined above. In fact, in a
Boolean Algebra, the expression x+y is read “x or y”, x·y is read “x and y”
and “x0 ” is read “not x”. From this point of view, the above axioms become
more natural. As in the case of real numbers, the multiplication will be de-
noted from this point on by juxtaposition of operands, so x·y is replaced by
xy. Also similar to usual arithmetics, there are some precedence rules for
the order the Boolean operations. In the order of the highest to the lowest
precedence, the rules are: the parentheses, the NOT operation (“0 ”), the
AND operation (the multiplication), the OR operation (the addition). An
expression combining binary variables with one or more of the above laws
is often referred to as a function. For instance f (x, y, z) = x + x0 yz + y 0 is a
function that takes the value 1 for the input values x = y = 0 and z = 1. In
addition to the above axioms, the operations of a Boolean Algebra satisfy
the important properties found in Table 1.4 on page 19.
Many of these laws may seem very obvious or even trivial to you at this
point. But remember that their main use in this context is in simplifying
the output function of a complex circuit and from this point of view they
could be sometimes tricky to be handled efficiently. In what follows, we
work out an example to show how Boolean Algebra is used to simplify a
certain digital circuit. Figure 1.6 shows a logic circuit with three binary
inputs x, y and z.
Following the first AND gate from the top, it is easy to see that its input
What makes a calculator calculate? 21
x y z , x y z, x yz , x yz
xy z , xy z, xyz , xyz.
x y z + x yz + xy z + xyz + xyz
= (x y z + xy z) + (x yz + xyz ) + xyz
= (x + x)y z + (x + x)yz + xyz
= y z + yz + xyz
= y z + y(z + xz)
= y z + y[(z + x)(z + z)] (OR distributive law)
= y z + y(z + x)
= y z + yz + xy.
Fig. 1.8 A logic circuit equivalent to the circuit in Figure 1.6 with fewer gates.
Two minterns are called adjacent if they differ in one position only. For
instance, the two minterms x1 x2 x3 and x1 x2 x3 are adjacent since they only
differ at the second position, where the variable x2 appears complemented
in the second minterm. Notice the following:
x1 x2 x3 + x1 x2 x3 = x1 x3 (x2 + x2 ) = x1 x3 . (1.1)
=1
The variable x2 which represents the “different” position of the two
minterms has completely disappeared when the two minterms are added
together. There is nothing special about the example in equation (1.1)
and every time two adjacent minterms are added, we can simplify the sum
by dropping the different position. Exploiting this property of adjacent
minterms will play a crucial role in simplifying algebraic expressions (it is
What makes a calculator calculate? 23
The dual notion of a minterm is the maxterm. A maxterm over the set
S = {x1 , ..., xn } of logic variables is a sum of the form y1 + y2 + · · · + yn
where each yi is either xi or x0i . Like the minterms, there are 2n maxterms
over n logic variables. A maxterm is False for exactly one combination of
variable inputs.
representing the output for the circuit in Figure 1.6 above is called a sum
of products form of f . The name is self explanatory. Note that in this
expression of f , each of the products x0 y 0 z, x0 yz 0 , xy 0 z, xyz, and xyz 0 is a
minterm but that is not necessarily true in every sum of products form. For
instance, the above calculations that led to the diagram in Figure 1.8 show
that an equivalent form of the same function f is given by y 0 z + yz 0 + xy
which we still call a sum of products form of f . In other words, there are
several ways to write a Boolean expression as a sum of products and a sum
of minterms is one of them. There is however a unique way to express a
Boolean expression as a sum of minterms. Writing the (unique) sum of
minterms form of a Boolean expression f from the truth table of f can be
achieved by writing the minterm corresponding to each row in the table
where the output of f is 1 and then adding all the minterms. Although
this method provides an easy way to derive a sum of products expression,
it is not optimal in the sense that it does not produce the simplest expres-
sion for f (as we saw in Figures 1.6 and 1.8) but it is a starting point and
Boolean Algebra manipulations can then be used to simplify it. Let us look
at an example. Suppose you are given the truth table of a Boolean function
f (x, y, z), see Table 1.5.
The truth value of f is 1 in the first, fifth and eighth rows of the
table. The minterms corresponding to these rows are x0 y 0 z 0 , xyz 0 and
xy 0 z 0 respectively. The sum of minterms form of f is then given by
24 The mathematics that power our world
all rows in the table where the function output is 0. For each of these rows,
form the associated maxterm by complementing the minterm that corre-
sponds to that row. For example, if x = 1, y = 1, z = 0 is one row in the
truth table where the output of f is 0, then the corresponding maxterm
is (xyz 0 )0 = x0 + y 0 + z. The product of maxterms form of f is then ob-
tained by multiplying all the corresponding maxterms. Let us look again
at the truth table of the function f in the previous section. The output
of f is 0 in rows 2, 3, 4, 6 and 7. The maxterm corresponding to row 2 is
(x0 y 0 z)0 = x + y + z 0 . Similarly, the maxterms corresponding to rows 3, 4, 6
and 7 are (x+y 0 +z 0 ), (x+y 0 +z), (x0 +y 0 +z 0 ) and (x0 +y +z 0 ) respectively.
The product of maxterms form of f is then
Both the sum of minterms and the product of maxterms are called standard
forms of the Boolean expression.
We now arrive at the heart and soul of this chapter. What is the procedure
that an electronic calculator follows in order to perform a certain arithmetic
operation? The accurate answer to this question requires a fair knowledge
of all components of an electronic circuit and the physical laws that enable
these components to work effectively together. Of course, this is well be-
yond the scope of this book. Components such as the processor inside a
machine are physical implementation of mathematical ideas and designs.
Our main interest is to look beyond the hardware in order to understand
the mathematical ideas.
1.6.1 Half-adder
As we saw in Section 1.3.1, when two binary digits x, y are added, a sum
P P
and a carry C are produced. If you think of (x, y) and C(x, y) as
Boolean expressions of the variables x and y, then the basic rules of adding
bits seen in Section 1.3.1 give their truth tables as illustrated in Table 1.6.
Note that the carry is always 0 except in the case where both digits
are equal to 1. This is exactly the output of the AND operator. So,
26 The mathematics that power our world
1.6.2 Full-adder
While the half-adder is probably the simplest adder circuit, it has a ma-
jor handicap. It has only two inputs which means it cannot deal with the
carry in that usually occurs in binary addition. This makes the half-adder
capable of performing binary addition only when there is no carry out (a
carry out of 0), hence the name. The building block of any digital adder
requires a system that takes into account the two bits and the carry from
the previous column. This is what a full-adder is designed to do. In the
full-adder design, we have three inputs: the two bits to be added and the
carry in from previous addition. As in the case of the half-adder, there are
two outputs: a sum and a carry out. To distinguish between the carry in
and the carry out in the fuller-adder, we write Cin for the carry in and Cout
for the carry out.
What makes a calculator calculate? 27
The truth table of the full-adder is found in Table 1.7. Note that
1 + 1 + 1 = 11 (sum of 1 and a carry of 1).
From the truth table, we can construct Boolean expressions for the sum
P
and the carry out Cout . The idea is to write first the sum of products
P
for and Cout and then make some simplifications using Boolean Algebra
properties.
= x0 y 0 Cin + x0 yCin
0
+ xyCin + xy 0 Cin
0
P
X
= x0 y 0 Cin + x0 yCin
0
+ xyCin + xy 0 Cin
0
= (x ⊕ y) ⊕ Cin
and
Cout = x0 yCin + xyCin
0
+ xyCin + xy 0 Cin
0
= xy (Cin + Cin ) +(x0 y + xy 0 )Cin
| {z }
=1
= xy + (x ⊕ y)Cin .
The circuit in Figure 1.10 implements the above equations using XOR
and AND gates.
Full-adders are the building blocks in any electronic device capable of
doing binary arithmetics. Recall that binary addition of two n-bit numbers
28 The mathematics that power our world
is done by adding first the two least significant bits and progressing to
the addition of the most significant bits. In the process, the carry out
produced at any stage is added to the two bits in the next position. This
can be designed in the machine by cascading together n full-adders, one
adder for each pair of bits. The carry out produced by the sum of each pair
of bits “ripples” through the chain until we get to the last carry out. Such
an adder is called carry ripple adder. Figure 1.11 shows a block design for
a carry ripple adder for 3-bit binary numbers A2 A1 A0 and B2 B1 B0 . Each
rectangle contains a full-adder design as shown in Figure 1.10. Note that
if the first carry-in (C0 ) is 0, then it is represented by a “ground” in real
designs.
The diagram in Figure 1.12 is a detailed logic implementation of a 3-bit
ripple carry adder with XOR, OR and AND gates. The circuit adds the
two binary numbers A2 A1 A0 and B2 B1 B0 and outputs the binary number
3 2 1 0.
the two bits at stage i and Ci+1 be the carry-out at this stage (C0 being
the initial carry in). From the above algebraic manipulations, we have
Ci+1 = xi yi + (xi ⊕ yi )Ci . (1.3)
This equation is the key to eliminate the propagation delay in the chain.
First, let gi = xi yi and pi = xi ⊕ yi , called carry-generator and carry-
propagate respectively. Then equation (1.3) becomes Ci+1 = gi + Ci pi .
Note that at each stage, the carry-generator and the carry-propagate can
be generated independently from the previous stages. In particular, we
have the following values:
C1 = g0 + C0 p0
C2 = g1 + C1 p1
= g1 + (g0 + C0 p0 )p1
= g1 + g0 p1 + C0 p0 p1
C3 = g2 + C2 p2
= g2 + (g1 + g0 p1 + C0 p0 p1 )p2
= g2 + g1 p2 + g0 p1 p2 + C0 p0 p1 p2 .
Continuing this way, we get the following general expression for the
carry out at the ith stage:
Ci+1 = gi + gi−1 pi + gi−2 pi pi−1 + · · · + C0 pi pi−1 pi−2 . . . p0 . (1.4)
This new expression shows that the carry can be produced at any stage
without the need to know the carries from previous stages and hence avoid
the time delay. An adder designed to produce the carry digit locally based
on equation (1.4) is called a carry lookahead adder. In spite of the calcula-
tions involved to compute the carry-generator and the carry-propagate at
each stage, the carry lookahead adder design remains much faster than the
ripple carry adder in large applications.
This means that the XOR gates output B = B3 B2 B1 B0 and each output
P
i is Ai + Bi . The circuit acts like an adder in this case and performs the
operation A + B. If the value of S is 1 (high voltage through S), then the
ith XOR gate outputs Bi ⊕ 1. Now,
Bi ⊕ 1 = (Bi ∨ 1) ∧ ¬(Bi ∧ 1)
= 1 ∧ ¬Bi = ¬Bi = Bi0 .
The XOR gates are then just bit invertors producing the one’s complement
B 0 of B. Since the original input C0 is 1, the circuit performs the operation
A + B 0 + 1 which amounts to adding A to the two’s complement of B
(Section 1.6.4). In other words, the circuit performs A − B and it is a
subtractor in this case.
f b
g
e c
LED display.
The next task is to come up with a simple Boolean expression for each
of the logic functions “a” through “g”. We will work out the details for bar
“a” leaving the expressions for the other bars as exercises for the reader.
From Table 1.8, we can see that it is easier to write the standard product of
sums expression for “a” rather than the sum of product since the output of
34 The mathematics that power our world
“a” is 0 at only two inputs combinations (see Section 1.5.1). The product
of maxters of “a” is given by
a = (x0 + x1 + x2 + x03 )(x0 + x01 + x2 + x3 ). (1.5)
We use the Boolean Algebra properties to come up with a simplified version
of expression (1.5):
a = (x0 + x1 + x2 + x03 )(x0 + x01 + x2 + x3 )
= x0 x0 +x0 x01 + x0 x2 + x0 x3 + x1 x0 + x1 x01 +x1 x2 + x1 x3
| {z } | {z }
=x0 =0
+ x2 x0 + x2 x01 + x2 x2 +x2 x3 + x03 x0 + x03 x01 + x03 x2 + x03 x3 .
| {z } | {z }
=x2 =0
Next, we regroup terms using associativity and commutativity of both mul-
tiplication and addition starting with the single terms x0 and x2 :
a = x0 + x2 + x1 x3 + x01 x03 + (x0 x01 + x1 x0 ) + (x0 x2 + x2 x0 )
+ (x0 x3 + x0 x03 ) + (x1 x2 + x2 x01 ) + (x2 x3 + x03 x2 )
= x0 + x2 + x1 x3 + x01 x03 + x0 (x01 + x1 ) + (x0 x2 + x2 x0 )
| {z } | {z }
=1 =x0 x2
Using these algebraic expressions for the seven segments, we can now
design a logic circuit for the BCD to seven-segment converter.
The logic gates that we encountered in this chapter are in reality mathemat-
ical abstractions of electrical devices that vary depending on what type of
technology is used. It is easy to find and buy these devices in the market to
physically implement a designed logic circuit. Our goal in this chapter was
not to go into the technical details of a calculator operation. We hope that
with a better understanding of basic binary arithmetics, you would now
be interested in learning about the sequence of operations that take place
when a specific operation is performed in a calculator. When you press a
button, the rubber underneath makes contact with a digital circuit produc-
ing an electrical signal in the circuit. The processor of your device picks
up the signal and identifies the addresses of corresponding active “bytes”
or switches in the circuit. If for example you press a number, the processor
will store it in some place in its memory and a signal is sent to activate
the appropriate parts on the screen to display it. The same will happen
if you press another button until another operation key is pressed or you
reach the maximal number of symbols that can displayed on the screen.
For example, when performing an arithmetic operation like addition, the
processor will display all digits of the first operand and when the + key is
pressed, the processor will store the first operand in its memory in binary
form and wipe it out from the screen. The processor will do the same as
you enter the digits of the second operand. Finally, when the = button is
pressed, the processor activates a full-adder circuit and sends a signal to
display the digits of the answer on the screen.
1.10 References
2.1 Introduction
39
40 The mathematics that power our world
When you click the “Save” button after working on or viewing a document
(text, image, audio,. . .), a convenient interpretation of what happens next
is to imagine that your computer stores the file in the form of a (long) finite
sequence of 0’s and 1’s that we call a binary string. The reason why only
two characters are used was explained briefly in the chapter on electronic
calculators.
011111111110000100001000000001000100
In this digital format, each of the two characters “0” and “1” is called
a “bit” (binary digit). The number of bits in a string is called the length
of the string. Note that the number of strings of a given length n is 2n
since each bit can take two possible values. For example, there are exactly
25 = 32 different binary strings of length 5.
Basics of data compression, prefix-free codes and Huffman codes 41
Why 8 bits you may ask? Well, why 12 units in a dozen? But here is
a somehow more convincing answer. The famous (extended) ASCII code
list (involving symbols on your keyboard plus other characters) contains
exactly 256 characters and since 28 = 256, a digital representation for each
of the 256 characters in the ASCII code is possible using the byte as the
storage unit of one character.
Depending on the end goal, data compression algorithms fall into two main
categories: lossless and lossy compressions. Lossless data compression al-
lows the exact original data to be reconstructed from the compressed data.
Lossy data compression allows only an “approximation” of the original data
42 The mathematics that power our world
If lossless compression algorithms exist (and they do), why even bother
with lossy compression? Could we not just create an algorithm capable of
reducing the size of any file and at the same time capable of reconstructing
the compressed file to its exact original form? Mathematics gives a defini-
tive answer: Don’t bother, such an algorithm is just wishful thinking and
cannot exist. First note that a lossless compression algorithm C is successful
only if the following two conditions are met:
(1) C must compress a file of size n bits to a file of size at most n − 1 bits.
Otherwise, no compression has occurred.
(2) If F1 and F2 are two distinct files, then their compressed forms CF1
and CF2 must be distinct as well. Otherwise, we would not be able to
reconstruct the original files.
Therefore,
1 − 2n
1 + 21 + 22 + · · · + 2n−1 = = 2n − 1.
1−2
This means that there are 2n − 1 different files of size at most n − 1. Thus,
α must compress at least two elements of Sn (two files of size n) into the
same compressed form (of size n − 1 at most). This is a contradiction to
the second condition of an efficient lossless compression algorithm.
Another way to interpret the above theorem is the following: for any lossless
data compression algorithm C, there must exist a file that does not get
smaller when processed by C.
node does not grow any more branches, it is called a leaf. Otherwise, we
call it an internal node.
Example 2.1. Consider the binary code C = {00, 01, 001, 0010, 1101}
for the alphabet Σ = {X, Y, Z, U, V }. The binary tree for C is given
below.
and symbols).
Variable length codes are very practical for compression purposes, but not
enough to achieve optimal results. For a code to be efficient, one must
include in the design a unique way of decoding. For example, consider
the code C = {0, 01, 10, 1001, 1} for the alphabet {a, b, c, d, e}. The word
101001 can be interpreted in more than one way: “cd”, “ead” or “ccb”.
This is certainly not a well-designed code because of this decoding ambi-
guity. A closer look at C shows that the problem with it is the fact that
some codewords are prefixes (or start) of other codewords. For example,
the codeword 10 is a prefix of the codeword 1001. One way to avoid the
confusion is to include a symbol to indicate the end of a codeword, but this
risks to be costly considering the number of times the end symbol must be
included in the encoded file. A code C with the property that no code-
word in C is a prefix of any other codeword in C is called a prefix-free
code. Prefix-free codes are uniquely decodable codes in the sense that there
is unique way to decode any encoded message. The converse is not true
in general. There are examples of uniquely decodable codes which are not
prefix-free.
Example 2.2. The code H = {10, 01, 000, 1111} is a prefix-free code since
no codeword is a start of another codeword. In particular, it is uniquely
decodable.
Example 2.3. The code H = {00, 011, 10} for the alphabet {a, b, c} is
uniquely decodable (try to decode couple of binary sequences and you will
quickly see why) but clearly not prefix-free.
Example 2.4. Consider the code C = {00, 01, 10, 110, 111} for the al-
phabet {A, B, C, D, E}. It is clear that C is prefix-free since no codeword
is a prefix of another. Suppose we want to decode the binary message
01000011010111. A good way to start is to draw the binary tree associated
with C:
Starting at the root of the tree, we move left using the first 0 in the string
and then we move right using the first 1. A leaf with label “B” is reached.
We record the letter “B” and return to the root. At this point, we continue
with the string 000011010111. From the root, we move left (for the first 0
in the new string) and then down along the same branch (for the second
0 in the new string). The character “A” is reached and we record it next
to the letter “B” previously found. Continuing in this manner, we see that
the string 01000011010111 corresponds to the message “BAADCE”.
Proof. There are several proofs of this well-known result in the litera-
ture. Some of them are purely algebraic, others are schematic using trees.
Theorem 2.2 consists of a necessary and a sufficient condition. In what
follows, we use a schematic proof for one direction and an algebraic one for
the other to give the reader a flavor of both approaches.
Note that T has exactly 2l leaves, which correspond to all possible code-
words of length l. The length li is called the depth of αi in the tree and ci
is formed by collecting all the binary symbols on the path from the root to
αi . Next, we extract the binary tree T (C) corresponding to C as a “sub-
tree” of T . Since the code C is prefix-free, we know that every character
48 The mathematics that power our world
satisfied:
γ1 ≤2
≤ 22 − 2γ1
γ2
γ3 ≤ 23 − 22 γ1 − 2γ2
(S) γ ≤ 24 − 23 γ1 − 22 γ2 − 2γ3 .
4
. .. ..
..
. .
γl ≤ 2l − 2l−1 γ1 − 2l−2 γ2 − · · · − 2γl−1
Note that if the second inequality in (S) is satisfied, then 2γ1 ≤ 22 − γ2 and
γ1 ≤ 2− γ22 . But since 2− γ22 ≤ 2, we get that γ1 ≤ 2 and the first inequality
in (S) is also satisfied. Similarly, it is easy to see that if any inequality in
(S) is satisfied, then the previous one is also satisfied. Therefore, if we can
prove that the last inequality is satisfied, then the system (S) would be
valid and a prefix-free binary code can be constructed. Multiplying the last
inequality in (S) by 2−l and rearranging gives:
2−1 γ1 + 2−2 γ2 + · · · + 2−(l−1) γl−1 ≤ 1.
This is the same as the inequality (2.2) above. We conclude that a prefix-
free code can be constructed if the Kraft inequality is satisfied.
Pn
that i=1 pi = 1. For a prefix-free binary code of the alphabet A with
codeword lengths L = {l1 , . . . , ln }, we define the average codeword length
Pn
of C as being lav = i=1 pi li measured in bits per source symbols.
We make the following observations without looking too much into the
probabilistic and statistical properties of the source.
(1) It is completely irrelevant what names we give for the source alphabet
symbols. All that matters at the end of the day is the probability
distribution vector of these symbols.
(2) If C is a prefix-free binary code of A with codeword lengths L =
Pn
{l1 , . . . , ln }, then the average codeword length lav = i=1 pi li of C can
be thought of as the average number of bits per symbol required
to encode the source.
(3) It is then natural to seek binary prefix-free codes with average codeword
length as small as possible in order to save on the numbers of bits used
to encode the source. A prefix-free code with minimal average codeword
length is called an optimal code.
It is not entirely obvious that an optimal code exists for a given alphabet.
The following result proves that in fact an optimal and prefix-free binary
code always exists.
Proof. Without any loss of generality, we can assume that pi > 0 for each
i = 1, . . . , n. Indeed, if this is not the case, we can restrict our alphabet
to those symbols with positive probability and just ignore all symbols with
zero probability (they do not appear in the source anyway). By rearranging
the symbols if necessary, we can also assume that p1 ≤ · · · ≤ pn . We start
by proving the existence of a prefix-free binary code for A. This can be
achieved in more than one way. First, take li to be the smallest integer
greater than or equal to − log2 (pi ) for each i = 1, . . . , n. The proof of
Theorem 2.4 below shows in particular that the list (l1 , . . . , ln ) satisfies the
Kraft inequality and hence represents the codeword lengths of a prefix-free
binary code of the alphabet. Another way to construct a prefix-free code
on the alphabet A is to choose a positive integer l satisfying 2l ≥ n, then
the integers l1 = l2 = · · · = ln = l satisfy:
n
X n
2−li = n2−l = l ≤ 1.
i=1
2
By Theorem 2.2, (l1 , . . . , ln ) is the codeword lengths of a prefix-free code
on A. Next, let us fix a prefix-free code C0 on A of average codeword length
l0 . We claim that there is only a finite number of binary prefix-free codes
with average codeword length less than or equal to l0 . To see this, let C
be a prefix-free code on A of codeword lengths (l1 , . . . , ln ) and an average
codeword length l = i=1 pi li satisfying l ≤ l0 . If lk > pl01 for some k, then
P
n
X l0
l= pi li ≥ pk lk > p1 = l0
i=1
p1
which contradicts the assumption l ≤ l0 . We conclude that lk ≤ pl01 for
k = 1, . . . , n. Clearly, there are finitely many codewords of length less than
or equal to the constant pl01 and thus the set G of all binary prefix-free codes
with average codeword length less than or equal to l0 is finite. Pick a code
C in G with the lowest average codeword length (this is possible since G is
finite), then C is a prefix-free and optimal code for the alphabet.
Given a source with alphabet A = {α1 , . . . , αn } and a probability dis-
tribution vector P = (p1 , . . . , pn ), let l be the minimum of the set
( n n
)
X X
pi li ; li is a positive integer and 2−li ≤ 1 .
i=1 i=1
In other words, l is the minimal average codeword length taken over all
possible prefix-free codes of the source. Theorem 2.3 guarantees the exis-
tence of a prefix-free code with average codeword length equals to l. But
52 The mathematics that power our world
how can we actually construct that code? The Kraft inequality gives us a
clean and nice mathematical formulation of the problem at hand.
Proof. For the proof of the inequality on the left in (2.3), we use a well-
known inequality that you probably have seen in your first Calculus course:
with equality in (2.4) occurring only at x = 1. This can be seen from the
graphs of both ln(x) and x − 1:
Basics of data compression, prefix-free codes and Huffman codes 53
This finishes the proof of (2.3) in the Theorem. For the last statement,
assume that C is a prefix-free binary code with codeword lengths (l1 , . . . , ln )
Pn
and average codeword length equals to − i=1 pi log2 (pi ). In order to
Pn
achieve lav = − i=1 pi log2 (pi ), inequality labeled (∗) in the above proof
−li
must be an equal sign. That can only happen when 2pi = 1 for each i
(remember that inequality (2.5) is an equal sign if and only if x = 1) and so
pi = 2−li for all i. Conversely, if (l1 , . . . , ln ) are positive integers satisfying
Pn Pn
pi = 2−li for all i, then −li = log2 pi and i=1 2
−li
= i=1 2
log2 pi
=
Pn
i=1 pi = 1. The Kraft inequality implies that there exists a prefix-free
binary code D with codeword lengths (l1 , . . . , ln ). Note that the average
Pn Pn
codeword length of D is i=1 pi li = − i=1 pi log2 (pi ).
In terms of the entropy, two important facts about the source can
be drawn from Theorem 2.4. The first is that the average codeword
length of any optimal code cannot get any better than the entropy H =
Pn
− i=1 pi log2 (pi ) of the source while it is always within only one bit of
the entropy. The second fact is that there exists a prefix-free code of a
source with probability distribution vector P = (p1 , . . . , pn ) that achieves
the entropy bound if and only if the source is dyadic, that is each pi is of
the form pi = 2−li for some positive integer li .
In the late 1940’s, researchers in the (then) young field of information the-
ory worked hard on the problem of constructing optimal codes with not
much luck. Some descriptions of what such a code should look like were
given but without any concrete algorithm to construct one. In the early
1950’s, David Huffman was a student in a graduate course at MIT given by
Robert Fano on information theory. Huffman and his classmates were given
56 The mathematics that power our world
the option to submit a paper on the optimal code question, a problem Fano
and others had almost given up hope on solving it, or to write a standard
final exam in the course. Huffman worked on the paper for a period of
time and just before giving up and going back to study for the final exam,
a solution hit him. To everyone’s surprise, Huffman’s paper consisted of a
simple and straightforward way to construct the optimal code and earned
him a great deal of fame. While almost every attempted solution to the
problem consisted of constructing a code tree from the top down, Huffman’s
approach was to construct the tree of his code from the bottom up.
parent node in T . Let cij be the codeword corresponding to aij ; that is, cij
is the codeword consisting of the first common li − 1 bits of ci and cj . The
new tree we obtain corresponds to a prefix-free code C 0 for the alphabet A0
obtained from A by removing the symbols αi , αj and replacing them with
the common symbol αij to which we assign the probability pi + pj . If l0 is
the average codeword length of the code C 0 , then
l − l0 = p1 l1 + · · · + pi li + · · · + pj lj + · · · + pn ln
− [p1 l1 + · · · + (pi + pj )(li − 1) + · · · + pn ln ]
= pi li + pj lj −(pi + pj )(li − 1) = pi + pj .
|{z}
=li
(1) Pick two letters αi and αj from the alphabet with the smallest proba-
bilities.
(2) Create a subtree with root labeled αij that has αi and αj as leaves.
(3) Set the probability of αij as pi + pj .
(4) Form a new alphabet A0 of n − 1 symbols by removing αi and αj from
the alphabet A and adding the new symbol αij .
(5) Repeat the previous steps for the new alphabet A0 .
(6) Stop when an alphabet with only one symbol is left.
Basics of data compression, prefix-free codes and Huffman codes 59
The tree we obtain at the end of the above algorithm is called the Huffman
tree and the corresponding code is called the Huffman code.
2.8.3 An example
In this section, we look at a detailed example of text compression using
Huffman algorithm. Assume that we want to encode the following text
source:
The two letters of lowest probabilities in the text are n and k. Create
a subtree with root labeled nk to which we assign the with probability
1 1 1
16 + 16 = 8 with n and k as leaves. As usual, the left branch is labeled
with a 0 and the right branch is labeled with a 1.
60 The mathematics that power our world
We have now a new alphabet with symbols nk, i, s, y, e and and prob-
abilities 18 , 18 , 18 , 81 , 14 and 14 , respectively. Pick two symbols in the new
alphabet of lowest probabilities, say nk and i, and create a new subtree
with root labeled nki having nk and i as children and with probability
1 1 1
8 + 8 = 4.
We get the new alphabet formed of symbols nki, s, y, e and and probabil-
ities 14 , 18 , 18 , 41 and 14 , respectively. Form a new subtree with root labeled
sy (probability 18 + 81 = 14 ) having s and y as children.
We are now left with an alphabet with four symbols, each occurring with a
probability 14 . Pick the symbols nki and sy to form the next subtree with
node nkisy of probability 12 .
Basics of data compression, prefix-free codes and Huffman codes 61
Pick the symbols e and the space symbol to form the next subtree.
Finally, we merge the last two symbols nkisy and e into the root of the
Huffman tree.
62 The mathematics that power our world
Following the tree from the root to the leaves, we get the Huffman code
of the text message:
n → 0000, k → 0001, i → 001, s → 010, y → 011, e → 10, → 11. (2.7)
The text i see eye in sky can now be encoded by concatenating the code-
words of the symbols as they reach the encoder:
001110101010111001110110010000110100001011, (2.8)
for a total of 42 bits. But how much did we really save? Well, a non-
compressed version of the text using standard ASCII code requires 8× 16 =
128 bits to store in a computer. A saving of almost 67%. Using the library
(2.7), decoding the message (2.8) is straightforward since Huffman code
is prefix-free, hence uniquely decodable. Note that the average codeword
length of the code in (2.7) is given by:
1 1 1 1 1 1 1 21
l=4 +4 +3 +3 +3 +2 +2 = .
16 16 8 8 8 2 2 8
The fact that the average codeword length of the Huffman code is equal to
the entropy of the source is expected in this example since each alphabet
symbol appears with a probability equals to a negative power of 2.
2.10 References
3.1 Introduction
• The Sequential lossy mode. The image is broken into blocks. Each
block is scanned once in a raster manner (left-to-right, top-to-bottom).
Some information is lost during the compression and the reconstructed
image is an approximation of the original one.
• The Progressive mode. Both compression and decompression of the
image are done in several scans. Each scan produces a better image
than the previous ones. The image is transferred starting with coarse
resolution (almost unrecognisable) to finer resolution. In applications
with long downloading time, the user will see the image building up in
multiple resolutions.
• The Hierarchical mode. The image is encoded at multiple resolu-
tions allowing applications to access a low resolution version without
the need to decompress the full resolution version of the image. You
65
66 The mathematics that power our world
have probably noticed sometimes that you do not get the same quality
when you print an image from the one displayed on a website since the
two operations (printing and displaying) require different resolutions.
• The Sequential lossless mode The image is scanned once and en-
coded in a way that allows the exact recovery of every element of the
image after decompression. This results, of course, in a much longer
code stream than the ones obtained in the lossy modes.
The first three modes are called DCT-based modes since they all use the
Discrete Cosine Transform (DCT for short, see Section 3.2) as the main tool
to achieve compression. Each of the four modes has its own features and
parameters that allow a certain degree of flexibility in terms of compression-
to-quality ratio. The purpose of this chapter however is not to describe the
technicalities and properties of the above modes nor to discuss the hardware
implementation. We will be concerned only with the Sequential lossy mode
implemented by the Baseline JPEG standard which can be described as the
collection of “baseline routines” that must be included in every DCT-based
JPEG standard. The Baseline standard is by far the most popular JPEG
technique and it is well supported by almost all applications.
The entries of the jth row in the table are the components of the vector
dj constructed above for j = 0, 1, . . . , 7. The table shows that each vector
dj , with the exception of vector d0 , has four pairs of the form (−κ, κ) for
a certain coefficient κ. If the components of the list α are correlated, this
results in a dot product dj · α being relatively small in absolute value. Note
also that the frequency j of the cosine wave wj (x) increases as we go down
in the table. This means that the early DCT coefficients correspond to
low-frequency components of the sequence and these components usually
contain the important characteristics of the list. The table also reveals an-
other important feature of the vectors dj : the dot product di · dk is zero
70 The mathematics that power our world
q q q
1 1 1
n n
... n
q q q
2 π
2 3π
2 (2n−1)π
n
cos 2n n
cos 2n
. . . n
cos 2n
q q q
2 2π
2 6π
2 2(2n−1)π
D= cos cos . . . cos
n 2n n 2n n 2n
.. .. ..
. . ... .
q q q
2 (n−1)π 2 3(n−1)π 2 (n−1)(2n−1)π
n
cos 2n n
cos 2n
... n
cos 2n
We now come to the main application of this chapter. We explain how the
Baseline JPEG standard uses the two-dimensional DCT to compress and
decompress grayscale digital images. The procedure involves several steps
that we explain and illustrate using the image in Example 3.1 below in each
step.
−8 27 34 −62 −48 −69 −64 −55
−50 −41 −73 −38 −18 −43 −59 −56
22 −60 −49 −15 22 −18 −63 −55
52 30 −57 −6 26 −22 −58 −59
A1 = .
−60 −67 −61 −18 −1 −40 −60 −58
−50 −63 −69 −58 −51 −60 −70 −53
−39 −57 −64 −69 −73 −60 −63 −45
−39 −49 −58 −60 −63 −52 −50 −34
−330.8750 65.1776 −9.3885 46.0853 51.1250 −30.8004 −2.7408 4.0057
75.9481 58.8311 −25.6708 8.7057 −2.2776 −20.4903 −6.1304 18.0618
−41.8828 3.8547 52.8126 −60.6199 −58.4390 0.5139 17.5914 12.1807
−51.0028 1.0753 5.0003 −54.8967 −40.6112 −6.6073 3.4048 11.2796
B= (3.9)
53.6250 41.9546 4.5161 −16.0604 −20.3750 −14.3393 −12.2887 5.2865
6.2506 −9.2437 −6.9427
48.4935 67.3990 39.1726 6.7856 6.1368
12.8836 14.1137 3.0914 −3.2691 2.9643 7.5347 22.9374 16.8211
−12.5815 −23.0803 −23.6630 −11.5713 4.2610 17.0271 24.8242 16.5083
3.3.4 Quantization
So far no compression was made in the previous steps. Remember, the DCT
is a completely invertible transformation. From the matrix B of DCT coef-
ficients obtained in the previous step, we can recover the original matrix A
of the unit data by computing Dt BD. Quantization is the step in the JPEG
standard where the magic of compressing takes place. The information lost
in this step is (in general) lost beyond recovery, and this is basically why
we call the JPEG standard a “lossy” one. At this step, mathematics ex-
ploit the human eye perception and tolerance of what level of distortion is
deemed acceptable to come up with a reasonable compression scheme. Ex-
periments show that the human eye is more sensitive to the low frequency
components of an image than the high frequency ones. The quantization
step enables us to discard many of the high frequency coefficients as their
presence have little effect on the perception of the image as a whole. After
the DCT is applied to the 8 × 8 block, each of the 64 DCT coefficients of
the block is first divided by the corresponding entry in the matrix Q = [qij ]
below and then the result is rounded to the nearest integer:
The JPEG standard 75
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
Q= . (3.10)
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
Designed based on human tolerance of visual effects, the matrix Q is
called the luminance quantization matrix. It is not the only quantization
matrix defined by the JPEG standard, but it is the one commonly used in
applications. Note how the entries in the matrix Q increase almost in every
row and every column as you move from left to right and top to bottom.
This is designed to ensure aggressive quantization of coefficients with higher
frequencies (the bigger the number we divide with, the closer the answer
to zero). The quantization step has another important role to play in the
JPEG standard. Notice how the coefficients of the DCT matrix B in (3.9)
are real numbers rounded to four decimal places. When divided with the
entries of Q, these DCT coefficients remain real-valued numbers and if we
do not round them to create integer-valued integers, the encoding process
of the quantized coefficients (see Section 3.3.5) would not be possible. The
matrix B of DCT coefficients is transformed
after quantization into the
bij
matrix C = [cij ] where cij = Round qij . For the block in Example 3.1,
the matrix C is the following.
−21 6 −1 3 2 −1 0 0
6 5 −2 0 0 0 0 0
−3 0 3 −3 −1 0 0 0
−4 0 0 −2 −1 0 0 0
C= . (3.11)
3 2 0 0 0 0 0 0
2 2 1 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 00
3.3.5 Encoding
Now that we have the quantized matrix C = [cij ] from the last step, the
next challenge is to encode the coefficients in a way to minimize the storage
76 The mathematics that power our world
For the quantized matrix of Example 3.1, the zigzag mode yields the
following string:
−21, 6, 6, −3, 5, −1, 3, −2, 0, −4, 3, 0, 3, 0, 2, −1, 0,
−3, 0, 2, 2, 0, 2, 0, −2, −1, 0, 0, 0, 0, 0, −1, 0, 1, EOB (3.12)
where the special symbol EOB (End Of Block) is introduced to indicate
that all remaining coefficients after the last “1” are zero’s till the end of
the sequence. In our example, the last run consists of 30 consecutive zeros.
As we will see later, the DC coefficient is encoded using a technique called
differential encoding while the 63 AC coefficients are encoded using a run-
length encoding technique. Huffman coding (see Chapter 2) is then used
to encode both.
For an image using 8 bits per pixel (that is what we assume for all what
follows), it can be shown that both the AC and the DC coefficients fall in the
range [−1023, 1023] and the DC differences fall in the range [−2047, 2047].
The encoding of a DC difference coefficient d is done as follows:
(1) First find the minimal number γ of bits required to write |d| (the ab-
solute value of d) in binary form. For example, if d = −7, then γ = 3
since the binary form of |−7| is 111. The minimal number of bits re-
quired to represent |d| in binary form is referred to as the CATEGORY
of d.
(2) Figure 3.3 shows the table used to encode the CATEGORY γ. This
code is referred to as the Variable-Length Code of d or VLC for short.
For example, the VLC for d = −7 is 100 (since d = −7 belongs to
CATEGORY 3).
(3) The VLC code for d found in the previous step is the first layer in
the encoding of d. The second layer is referred to as the Variable-
Length Integer or VLI for short that we defined as follows. If d ≥ 0,
then VLI(d) consists of taking the γ least significant bits of the 8-
bit binary representation of d (γ being, as above, the CATEGORY of
d). If d < 0, then VLI(d) consists of taking the γ least significant
bits of the 8-bit two’s complement representation of d − 1 (see the
first chapter on Calculator). Recall that the 8-bit two’s complement
representation of a negative integer β consists of writing the 8-bit binary
representation of |β|, invert the digits (0 becomes 1, 1 becomes 0) and
then add 1 to the answer. For example, if β = −8 then the 8-bit
78 The mathematics that power our world
Assuming the figure given in Example 3.1 is the first block in a digital
image, we encode its DC coefficient −21. The binary form of 21 is 10101,
so −21 is Category 5. From the table in Figure 3.3, the VLC code of −21
is 110. The 8-bit two’s complement of −21 − 1 = −22 is 11101010 and the
5 least significant bits of 11101010 are 01010. This is the VLI code of −21.
We conclude that the encoding of −21 is 11001010.
character by recording how many times the character appears in the run
instead of actually listing it every single time. Huffman coding (see Chap-
ter 2) of pairs of integers is combined with the RLE technique to produce
a compressed binary sequence representing the AC coefficients in the 8 × 8
block. The sequence of the 63 AC coefficients is first shortened to a sequence
of pairs and special symbols as follows: if α is a non-zero AC coefficient in
the zigzag sequence, then α is replaced by the “object” (r, m)(α) where r is
the length of the zero run immediately preceding α (that is the number of
consecutive zeros preceding α), m is the CATEGORY of α from the above
table. The maximum length of a run of zeros allowed in JPEG standard
is 16. The symbol (15, 0) is used to indicate a run of 16 zeros (one zero
preceded by 15 other zeros). If a run has 17 zeros or more, it is divided into
subruns of length 16 or less each. This means that r ranges between 0 and
15 and as a consequence it requires a 4-bit binary representation. In the
intermediate sequence, the EOB is represented with the special pair (0, 0).
Let us illustrate using the AC coefficients in sequence (3.12) above result-
ing from the image in Example 3.1. The intermediate sequence for (3.12)
is (0, 3)(6); (0, 3)(6); (0, 2)(−3); (0, 3)(5); (0, 1)(−1); (0, 2)(3); (0, 2)(−2);
(1, 3)(−4); (0, 2)(3); (1, 2)(3); (1, 2)(2); (0, 1)(−1); (1, 2)(−3); (1, 2)(2);
(0, 2)(2); (1, 2)(2); (1, 2)(−2); (0, 1)(−1); (5, 1)(−1); (1, 1)(1); (0, 0). We
explain how some of terms in this sequence are formed. The first non-zero
AC coefficient is 6 with no preceding 0’s. Since 6 belongs to CATEGORY
3, the first term in the intermediate sequence is (0, 3)(6). The last AC
coefficient −1 appearing in the zigzag sequence is preceded by a run of 5
zeros and since −1 belongs to CATEGORY 1, the corresponding entry in
the intermediate sequence is (5, 1)(−1).
Once the intermediate sequence is formed, each entry (r, m)(α) is en-
coded using the following steps.
(1) The pair (r, m) is encoded using Huffman codes provided by the stan-
dard tables in Section 3.3.5.3.
(2) The non-zero AC coefficient α is encoded using VLI codes as in the
encoding of the DC difference coefficients above.
(3) The final codeword for (r, m)(α) is just the concatenation of codes of
(r, m) and α.
Adding the code for the DC coefficient found earlier, the block image
of Example 3.1 is encoded as:
11001010 100110 100110 0100 100101 000 0111 0101 1111001011 0111
1101111 1101110 000 1101100 1101110 0110 1101110 1101101 000 11110100
11001 1010.
Notice that the size of this new sequence is 124 bits. A saving of about 75%
from the raw size of 512 (8 × 64 = 512) bits if no compression was done on
the block.
(1) When the compressed binary stream of the block enters the decoder
gate, it is read bit by bit. Using the Huffman encoding tables given
in the previous section, the decoder reconstructs the intermediate se-
quence of objects (r, m)(α).
(2) From this intermediate sequence, the quantized DC coefficient, all the
63 quantized AC coefficients and all the run lengths can be recon-
structed in the same zigzag ordering as in the encoding step above.
Recall that the first part of the compressed stream represents the (quan-
tized) DC difference coefficient di = DCi − DCi−1 . The quantized DC
coefficient of block i is reconstructed as DCi = DCi−1 + di for i ≥ 1
(assuming the DC coefficient Di−1 of block i − 1 was obtained at the
previous step).
(3) The sequence obtained in the previous step is “dezigzaged” to recon-
struct the 8 × 8 matrix of the quantized DCT coefficients of the block.
For the block in Example 3.1, this step will reproduce the matrix (3.11)
above.
(4) The 8 × 8 matrix of the quantized DCT coefficients is “dequantized”
by multiplying each of the entries with the corresponding entry of the
quantization matrix Q given in (3.10) above. For the block in Example
3.1, this step will produce the following matrix.
−336 66 −10 48 48 −40 0 0
72 60 −28 0 0 0 0 0
−42 0 48 −72 −40 0 0 0
−56 0 0 −58 −51 0 0 0
S= .
54 44 0 0 0 0 0 0
48 70 55 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 00
Notice that the matrix S is close to the original matrix B of the DCT
coefficient given in (3.9) above but not exactly the same. This is due
The JPEG standard 83
to the fact that entries in B were rounded to the nearest integers after
quantization is performed.
(5) Now apply the two-dimensional inverse DCT given in (3.8) to the ma-
trix S to get the matrix B1 .
−9.5363 26.1090 13.8352 −36.5921 −64.4003 −71.7877 −68.6127 −54.1596
−58.6093 −53.3551 −52.3111 −35.6142 −18.6246 −39.9788 −64.1639 −56.5378
19.1002 −33.7537 −63.8441 −22.2048 16.3370 −20.4071 −62.0797 −52.9300
61.4794 −0.7968 −43.2273 −10.3405 23.8693 −16.4274 −63.1244 −58.2510
B1 = .
−44.8950 −54.4807 −59.1986 −29.7012 0.2449 −25.1636 −64.9681 −69.2678
−79.9202 −66.0041 −69.1116 −67.7725 −51.6672 −55.3606 −66.3567 −58.5642
−30.3389 −34.5392 −61.1915 −78.2150 −71.1730 −70.3833 −63.1596 −37.2653
−41.7867 −52.7451 −69.2971 −61.0331 −42.9813 −53.3216 −56.7931 −30.6488
−10 26 14 −37 −64 −72 −69 −54
−59 −53 −52 −36 −19 −40 −64 −57
19 −34 −64 −22 16 −20 −62 −53
61 −1 −43 −10 24 −16 −63 −58
B2 = .
−45 −54 −59 −30 0 −25 −65 −69
−80 −66 −69 −68 −52 −55 −66 −59
−30 −35 −61 −78 −71 −70 −63 −37
−42 −53 −69 −61 −43 −53 −57 −31
Not bad, considering that the compressed image takes only about 25%
of the storage space taken by the raw image.
In this section, we dig deeper into the mathematics that make the DCT
such an important tool for image compression. As the reader will soon
discover, all the magic of JPEG compression happens by mixing together
some basic properties of linear algebra with some trigonometric identities.
The key property of the DCT matrix that makes it attractive to real life
applications is the fact that it is an orthogonal matrix. Before we proceed
to the definition of orthogonality, let us quickly review some notions from
linear algebra.
The JPEG standard 85
a1 v1 + a2 v2 + · · · + as vs = 0
v = a1 v1 + a2 v2 + · · · + as vs , ak ∈ R for all k.
AAt = At A = I (3.14)
The fact that the coefficients of the matrix A in the new basis S are pre-
cisely the DCT coefficients of A allows us to interpret these coefficients as
a “measure” of how much each of the squares in the above array is present
in the image.
We proceed now to prove Theorem 3.1. Note first that since the state-
ment “A is orthogonal” is equivalent to “At is orthogonal”, we only need
to prove that the first two conditions of Theorem 3.1 are equivalent. Write
The JPEG standard 89
q1t
q2t
t
A = q1 q2 . . . qn where qi is the ith column of A. Then A = . and
..
qnt
therefore
q1t · q1 q1t · q2 . . . q1t · qn
q2t · q1 q2t · q2 . . . q2t · qn
At A = . (3.18)
.. ..
.. . ··· .
qnt · q1 qnt · q2 . . . qnt · qn
where, as usual, qit · qj is the dot product of the vectors qit and qj . If A
is an orthonormal matrix, then At A = I and by comparing (3.18) with
the identity matrix I, we conclude that qit · qj = 0 for all i, j with i 6= j
and qit · qi = 1 for all i. These relations show that the columns of A are
orthogonal and unit vectors of Rn . Corollary 3.1 implies that the columns
of A form an orthonormal basis of Rn . This proves the implication (1 ) ⇒
(2 ) of Theorem 3.1. The implication (2 ) ⇒ (1 ) is proven similarly.
Theorem 3.2. The DCT matrix defined in Figure 3.1 on page 71 is or-
thogonal.
There are few elegant and relatively short proofs of this result in the litera-
ture, but they require some heavy mathematics. The proof presented here
is technical and somewhat long, but stays close to the basics. Let us start
by recalling some trigonometric identities needed for the proof.
1
cos(α) cos(β) = [cos(α + β) + cos(α − β)] . (3.19)
2
1
cos2 (α) = [1 + cos(2α)] . (3.20)
2
Lemma 3.1. For j ∈ {1, 2, . . . , 2n − 1}:
2n−1
X mjπ
cos = 0.
m=0
n
90 The mathematics that power our world
2n terms with 1 as the first term and the ratio of any two consecutive
terms equals to e2iα . Note that 0 < 2α < 2π since 2α = nj π and 1 ≤ j ≤
2n − 1. Therefore the equations cos (2α) = 1 and sin (2α) = 0 cannot be
satisfied simultaneously. Since e2iα = cos (2α) + i sin (2α) = 1 if and only
if cos (2α) = 1 and sin (2α) = 0, we can confirm that the ratio e2iα of the
above geometric sum is not 1. A well-known formula for a finite geometric
sum (with ratio other than 1) allows us to write:
2n−1 2n
X
2iα m
1 − e2iα 1 − e4niα
e = = . (3.21)
m=0
1 − e2iα 1 − e2iα
and consequently,
n−1
X mjπ
cos =0
m=0
n
Proof. The proof of this lemma is a bit more technical and requires
dividing it into subcases. First assume that n is even. Then
n
n−1 2 −1 n n−1
mjπ mjπ 2 jπ mjπ
X X X
cos = cos + cos + cos .
n n n n
m=1 m=1 m= n
2 +1
n
jπ
= cos j π2 = 0 since j is odd (the cosine of any odd
Note that cos 2
n
π
multiple of 2 is zero). So
n
n−1 2 −1 n−1
X mjπ X mjπ X mjπ
cos = cos + cos . (3.22)
n n n
m=1 m=1 m= n
2 +1
92 The mathematics that power our world
Now, let k = n − m, then the last sum in (3.22) can be written as:
n n
n−1 2 −1 2 −1
X mjπ X (n − k)jπ X kjπ
cos = cos = cos jπ −
n n n
m= n
2 +1 k=1 k=1
n
2 −1
X kjπ
=− cos
n
k=1
since for an odd j, cos(jπ − α) = − cos(α). Relation (3.22) shows that the
lemma is true in this case. Next assume that n is odd and write
n−1
n−1 2 n−1
X mjπ X mjπ X mjπ
cos = cos + cos . (3.23)
m=1
n m=1
n n
m= n+1
2
Again, the same change of index (k = n − m) in the second sum shows that
it is equal to the opposite of the first one and the lemma is proved.
q q
1 1
q n
q n
2 (2i+1)π 2 (2j+1)π
n cos 2n
n cos 2n
q q
2 2(2i+1)π 2 2(2j+1)π
qi = n cos , qj = n cos
2n 2n
.
..
.
..
q q
2 (n−1)(2i+1)π 2 (n−1)(2j+1)π
n cos 2n n cos 2n
product of qi and qj :
r r n−1 r r
1 1 X 2 k(2i + 1)π 2 k(2j + 1)π
qi · qj = + cos cos
n n n 2n n 2n
k=1
n−1
1 2X k(2i + 1)π k(2j + 1)π
= + cos cos
n n 2n 2n
k=1
n−1
1 2X1 k(2i + 1)π k(2j + 1)π
= + cos +
n n 2 2n 2n
k=1
k(2i + 1)π k(2j + 1)π
+ cos − (by (3.19))
2n 2n
n−1 n−1
1 1X k(i + j + 1)π 1X k(i − j)π
= + cos + cos
n n n n n
k=1 k=1
with 1 ≤ i + j + 1 ≤ 2n − 1 and 0 ≤ i − j ≤ n − 1. In order for the DCT
matrix D to be orthogonal, two things have to be verified. First, if i = j,
then qi · qj must be 1. Second, if i 6= j, then qi · qj must be 0. The case i = j
is easier
to
treat and that is what we start with. Notethat in this case,
Pn−1
cos kπ(i−j)
n = 1 for any k which means that k=1 cos kπ(i−j) = n − 1.
Pn−1 n
n−1
On the other hand, k=1 cos kπ(i+j+1) = k=1 cos kπ(2i+1)
P
n n = 0 by
1
Lemma 3.3. We conclude then that qi · qj = n + n1 (n − 1) = 1 for i = j.
3.6 References
4.1 Introduction
Recently, after a family trip, a friend of mine decided to go back to use his
old paper map in his travels and to put his car GPS to rest forever. This
came after a series of deceptions by this little device with the annoying
automated voice, the 400 screen and the frequently interrupted signal (his
words not mine). The latest of these deceptions was a trip from Ottawa
to Niagara Falls which took a turn in the US. Admittedly, such a turn is
normal especially if the GPS is programmed to take the shortest distance,
except that my friend’s family did not have passports on them that day.
If you have used a GPS before, you must have experienced some set-
backs here and there. But let us face it, the times when the trip goes
smoothly without any wrong turns or lost signal, we cannot help but ad-
mire the magic and ingenuity that transforms a little device into a holding
hand that takes you from point A to point B, sometimes thousands of kilo-
meters apart. It is almost “spooky” to think that someone is watching you
every step of the way from somewhere “out of this Earth”.
In this chapter, you will learn that there is nothing magical about the
GPS. It is the result of collective efforts of scientists and engineers with
Mathematics as the main link. After reading this chapter, use your time
on the road in your next trip to try to reveal to your co-travelers (with as
little mathematics as possible) the secret behind this technology. It works
every time I want to put my kids to sleep on a long trip.
95
96 The mathematics that power our world
If you press the “where am I” or “My location” buttons built in your GPS
receiver, your location will be displayed with expressions like 40◦ N, 30◦ W
and 1040 m, which are clearly not the “classical” cartesian coordinates.
This is because your GPS uses a more efficient coordinate system in which
the position or location of any point on or near the earth’s surface is de-
termined by three parameters known as the latitude, the longitude and the
altitude. You may have probably seen these terms before in a geography
class, but let us review them anyway. First, consider a cartesian coordinate
system Oxyz of three orthogonal axes centered at the center O of the earth.
Consider a point Q(x, y, z) in the above coordinate system and let P be the
projection of Q on the earth surface. That is, P is the intersection point of
−−→
the vector OQ with the earth surface. The points Q and P share the same
latitude and longitude that we define in what follows.
The position of any point near the surface of the planet is uniquely
determined by its latitude, it longitude and its altitude. For example, a
point described as (40◦ N, 30◦ W, 1850 m) is a point located 40◦ north of
the Equator and 30◦ west of the Greenwich meridian and at a distance of
1.85 km from the sea level (or 6366 + 1.85 = 6367.85 km from the center of
the earth).
98 The mathematics that power our world
UC
Next, you ask the same question to another person passing by, and he
answers: “You are 375 m away from the Math Department” and walks
away. You locate the Math Department on the map, labeled as M D, and
you draw on your map the circle C2 centered at M D and of radius 375 m.
This new information narrows your location to two possible points A and
100 The mathematics that power our world
UC MD
A
FE
UC MD
The point where the three circles meet determines your (relatively) exact
location.
Of course, in order for this to work, you must be lucky enough to have people
passing by giving you (relatively) precise distances from various locations
and to be able to somehow work the scale of the map to draw accurate
circles. Equally important is the kind of question you should ask the third
person in order to ensure that the third circle will somehow meet the other
two at exactly one point. Roughly speaking, a GPS receiver works the same
way except that the circles are replaced by spheres in three dimensions, and
the friendly people you ask to pinpoint your position on the campus map are
replaced with satellites located thousands of kilometers above the surface
of the earth.
Does this seem to be a bit too technical? In the next paragraph we try to
explain the idea of the “time lag” using a simple example.
Once the time delay dt is computed, the receiver internal computer mul-
tiplies it with the speed of light (in a vacuum), c = 299, 792, 458 m/sec to
calculate the distance separating the satellite from the GPS receiver.
102 The mathematics that power our world
All GPS receivers are built with multiple channels allowing them to re-
ceive and treat signals from at least four different satellites simultaneously.
Once it captures the signals of three satellites S1 , S2 and S3 in its range,
the receiver calculates the time delays t1 , t2 and t3 (respectively, in seconds)
taken by signals of the three satellites to reach it. The distances between
the receivers and the three satellites are computed as explained in the pre-
vious section: d1 = ct1 , d2 = ct2 , and d3 = ct3 , respectively. The fact that
the receiver is at a distance d1 from satellite S1 means that it could be
anywhere on the (imaginary) sphere Σ1 centered at S1 and of radius d1 .
Using the ephemeris data part of the satellite signal, the receiver knows the
position (a1 , b1 , c1 ) of the satellite S1 in the above system of axes, so the
sphere Σ1 has equation:
(x − a1 )2 + (y − b1 )2 + (z − c1 )2 = d21 = c2 t21 . (4.1)
The distance d2 = ct2 from the second satellite is computed and the receiver
is also somewhere on the sphere Σ2 centered at the satellite S2 , positioned
Global Positioning System (GPS) 103
This narrows the position of the receiver to the intersection of two spheres,
namely to a circle Γ. Still not enough to determine the exact position.
Finally, the distance d3 = ct3 from the third satellite S3 , positioned at the
point (a3 , b3 , c3 ), shows that the receiver is also on the sphere Σ3 :
The inclination of orbits of GPS satellites are designed so that the surface
of the third sphere would intersect Γ in two points that the receiver can
accurately compute their coordinates. One of these two points will be
unreasonably far from the surface of the earth and therefore one possible
position is left.
S3
S2
S1
The calculation of the time taken by the satellite signal to reach the
receiver (as explained above) assumes that clocks in the receiver and on
board of the satellite are in perfect synchronization. So 6:00 am on board
of the satellite means 6:00 am on the receiver clock. Unfortunately, that is
not the case. The satellites are equipped with atomic clocks, very sophisti-
cated, and extremely accurate clocks, but very expensive. The clocks inside
the receivers, on the other hand, are the usual everyday digital clock. The
difference between the types of clocks creates a certain error in calculating
the real time delay of the GPS signal. You may wonder, why the big fuss
about a time estimate that could differ only in a fraction of a second? Re-
member, we are talking about waves traveling at the speed of light which
makes the estimated distances from the satellite to the GPS receiver ex-
tremely sensitive to gaps between the satellite and receiver clocks. To give
you an idea about the degree of sensitivity, an error of 0.000001 second (one
microsecond) would result in an error of 300 metres in distance estimation.
No wonder why the GPS receiver’s clock is the main source of error. This
means in particular that the distances d1 , d2 and d3 shown in equations
(4.1), (4.2) and (4.3) above are not very accurate since they are based on
“fake” time delays t1 , t2 and t3 .
The true travel time of the signal from satellite Si is then equal to ti − ξ
(with ti as above) rather than simply ti . Equations (4.1), (4.2) and (4.3)
above can now be written as:
2 2 2 2 2 2
(x − a1 ) + (y − b1 ) + (z − c1 ) = d1 = c (t1 − ξ)
(H) (x − a2 ) + (y − b2 ) + (z − c2 ) = d2 = c (t2 − ξ)2 .
2 2 2 2 2
(x − a1 )2 + (y − b1 )2 + (z − c1 )2 − (x − a4 )2 + (y − b4 )2 + (z − c4 )2
The expression (a24 + b24 + c24 ) − (a21 + b21 + c21 ) − c2 (t24 − t21 ) in the above
equation is independent of the variables x, y, z and ξ of the system. To
simplify the notations a little bit, we call it A1 :
Repeating the same thing for the second and third equations in (S), we
obtain a new system (S 0 ) equivalent to (S) (in the sense that both systems
have the same set of solutions):
2(a4 − a1 )x + 2(b4 − b1 )y + 2(c4 − c1 )z = 2c2 (t4 − t1 )ξ + A1
2(a4 − a2 )x + 2(b4 − b2 )y + 2(c4 − c2 )z = 2c2 (t4 − t2 )ξ + A2
(S 0 ) .
2(a4 − a3 )x + 2(b4 − b3 )y + 2(c4 − c3 )z = 2c2 (t4 − t3 )ξ + A3
(x − a4 )2 + (y − b4 )2 + (z − c4 )2 = d24 = c2 (t4 − ξ)2
There are many ways to solve for x, y and z in term of ξ in the first
three equations, but Cramer’s rule is probably the easiest to implement in
the receiver’s computer:
D1 D2 D3
x= , y= , z= ,
D D D
Global Positioning System (GPS) 107
Replacing x, y and z by DD1 , DD2 and DD3 respectively in the fourth equa-
tion of (S 0 ) yields the following quadratic equation
2 2 2
D1 D2 D3
− a4 + − b4 + − c4 = c2 (t4 − ξ)2
D D D
108 The mathematics that power our world
c2 ξ 2 − 2c2 t4 ξ + κ = 0 (4.6)
2 2 2
where κ = c2 t24 − DD1 − a4 − DD2 − b4 − DD3 − c4 . Once again, the
way the satellites are put in their orbits guarantees that equation (4.6)
would have two solutions ξ1 and ξ2 . This gives two possible positions (one
for each of the two values found for ξ) with one of them corresponding to
a point very far from the surface of the planet that the receiver eliminates
as a possibility.
The last equation gives that sin β = dz and since −90◦ ≤ β ≤ 90◦ , there
is a unique value of β satisfying sin β= dz , namely β = arcsin dz (or
maybe you have seen this as sin−1 dz in your first calculus course).
Global Positioning System (GPS) 109
• Knowing the value of β, cos β can be computed and the system (L) is
now reduced to the following two equations:
(
x
cos φ = d cos β
y
sin φ = d cos β
Obviously, the satellites are not emitting their signals using the words of
the song “I see trees of green red roses too...”. So what is the nature of these
signals and how are they engineered to be easily identified by a ground re-
ceiver? More importantly, how can we make sure the signal is sufficiently
“random” to suit the intended use?
4.5.1 Terminology
We start by going over some of the terminology needed for the rest of this
section. As seen in previous chapters, a binary sequence is a sequence
consisting of only two symbols, usually denoted by of 0 and 1 (or On/Off
pulses), that we call bits. A binary sequence is called of length r if it is
a finite sequence consisting of r bits. A sequence a0 , a1 , a2 , . . . is called
periodic if there exists a positive integer p, called a period of the sequence,
such that an+p = an for all n. In other words, the periodic sequence repeats
itself every cycle of p terms. Note that if p is a period, then kp is also a
period for any positive integer k. The smallest possible value for p is called
110 The mathematics that power our world
the minimal period (in some books, it is simply called the period) of the
sequence. For example, the sequence
001011000101100010110001011000101100010110001011000101100010110
am = a0 c0 + a1 c1 + · · · am−1 cm−1
(2) At the first “clock pulse”, the content of each cell is shifted to the
right by one box “pushing out” the value a0 . The content of the first
(leftmost) box is then calculated as follows: first compute the value
Pm−1
of the expression k=0 ak ck = a0 c0 + a1 c1 + · · · + am−1 cm−1 . If the
result is even, the value am = 0 is inserted in the leftmost box. If
the result is odd, the value am = 1 is inserted in the leftmost box. If
you are familiar with modular arithmetic (see Section 4.5.3 below), this
Pm−1
amounts to calculating the sum k=0 ak ck “modulo” 2. We now have
the second “window” w1 = (a1 , a2 , . . . , am ) (again read from right to
left in the block diagram) and the first m + 1 terms of the sequence are
am , am−1 , . . . , a1 , a0 .
(3) This process is repeated. For example, at the second “clock pulse”, the
register shifts again the content of each cell to the right by one box
pushing out the value a1 this time. The content of the leftmost box is
calculated as c0 a1 + c1 a2 + · · · + cm−2 am−1 + cm−1 am (modulo 2) and
this is precisely the next bit, am+1 , in the sequence. We now have the
following terms am+1 , am , am−1 , . . . , a1 , a0 of the sequence.
(4) The procedure is iterated, creating an infinite binary sequence
. . . , ak , ak−1 , . . . , a2 , a1 , a0 .
how later), we can guarantee that no window of all zeros will ever occur
and that the sequence produced by the register is periodical of maximum
period possible.
Theorem 4.1. For a LFSR of degree m, one can always choose the
coefficients c0 , c1 , . . . , cm−1 and initial conditions a0 , a1 , . . . , am−1 in such
a way that the sequence produced by the register has a minimal period of
maximal length 2m − 1.
Definition 4.1. We say that the two integers a and b are congruent
modulo n and we write a ≡ b (mod n), if a and b have the same remainder
upon division by n.
Remark 4.3. In the notation of the equivalence class k used above, the
integer k is just one representative of that class. Any other element of the
same class is also a representative. For instance, 1 can also be represented
by −5 or by 7 in Z3 . To avoid confusion, the elements of Zn are always
represented in the (standard ) form k for 0 ≤ k ≤ n − 1. This way, we write
2 instead of 14 in Z3 . It is also worth mentioning that n = 0 since the
remainder in the division of n by n is 0.
Global Positioning System (GPS) 115
Our next task is to give the set Zn a certain algebraic structure by defining
an addition and a multiplication on the elements of the set that we call ad-
dition and multiplication modulo n. These operations are introduced
naturally in the following way:
• Addition modulo n. If a, b ∈ Zn , define a + b to be the equivalence
class represented by the integer a + b. In other words, a + b = a + b.
• Multiplication modulo n. If a, b ∈ Zn , define ab to be the equiva-
lence class represented by the integer ab: ab = ab.
Since a class in Zn has infinitely many representatives, one has to check
that these two operations are independent of the choice of representatives.
This is not hard to verify. The reader is certainly encouraged to try this as
an exercise.
4.5.4 Groups
It is not hard to prove that the identity element of a group is unique. Also,
if x is an element of a group, then the inverse of x is unique. If in addition
to the above axioms, the operation ∗ is commutative, that is x ∗ y = y ∗ x
for all x, y ∈ G, then the group G is called abelian. A subset H of a group
(G, ∗) is called a subgroup of G if H is itself a group with respect to the
same operation ∗.
All the group axioms can be easily verified. In particular, 0 is the zero
element of the group and if k ∈ Zn , then the opposite of k is n − k since
k + n − k = n = 0 in Zn .
Hence, (Z∗2 , ×), (Z∗3 , ×), (Z∗5 , ×) and (Z∗31 , ×) are all examples of multi-
plicative groups.
From this point on, and unless otherwise specified, the operation of a
multiplicative group is simply denoted with a concatenation of elements
and the identity element of the group is denoted simply by 1.
−1 −m
m
If m < 0, we define g to be g . This is well defined since in a group,
every element has an inverse and −m is now positive. If m = 0, we define
g m to be the identity element 1 of the group G.
The exponent laws for real numbers apply to elements of any group.
Given a group G and elements g, h in G, then for any integers m, n we have
• g m+n = g m g n
n
• (g m ) = g mn
• If G is abelian, then (gh)m = g m hm .
An important theorem in the theory of finite groups (due to Lagrange)
relates the size of the subgroup to the size of the group.
for some h00 ∈ H and therefore y = g 0 h0 h−1 h00 . But h0 h−1 h00 ∈ H since H
is a subgroup, so y = g 0 h0 h−1 h00 ∈ g 0 H. This shows that gH is a subset of
g 0 H. Similarly, we can show that g 0 H is a subset of gH and conclude that
gH = g 0 H. So as soon as the sets gH and g 0 H have an element in common,
they must be equal. In other words, the sets gH and g 0 H are either disjoint
(no elements in common) or they are the same set. Note also that 1H is
simply the subgroup H. Finally, if g ∈ G, then g = g1 ∈ gH since 1 ∈ H.
The group G can then be written as the union of pairwise disjoint subsets
of the form:
G = H ∪ g1 H ∪ · · · ∪ gr H
with |H| = |g1 H| = · · · = |gr H|. Thus, |G| = |H| + |g1 H| + · · · + |gr H| =
(r + 1)|H|. We conclude that |H| is a divisor of |G|.
Groups like (Z, +) and (Zn , +) can be “generated” by a single element.
For example, in the additive group (Z, +) every integer k can be generated
using the element 1: k = 1 + 1 + · · · + 1 = k × 1. We say in this case that
the group (Z, +) is generated by 1. Note also that −1 is a generator of
(Z, +). In general, we have the following.
Field theory has deep roots in the history of abstract algebra and has
been perceived for many years as a purely academic topic in mathematics.
But with the increasing demand on improving the technology and security
of communication, field theory started to play a central role in many real
life applications. In what follows we give a brief introduction and some
important facts about this theory, enough to be able to prove our main
result (Theorem 4.1).
to as the zero element and the second as the identity element of the field.
There is only one field, referred to as the zero field, where the zero element
and the identity element are the same. This is a set with only one element
0 with the obvious rules: 0 + 0 = 0 × 0 = 0. Any other field is called a
non-zero field.
The set (Z, +, ×) is not a field since (Z∗ , ×) is not a multiplicative group.
The sets Q, R and C (of rational numbers, real numbers and complex num-
bers respectively) with the usual addition and multiplication of numbers
are classic examples of a field structure. However, these are not the kind of
fields used in real applications. In what follows we look at fields containing
a finite number of elements that we call finite fields.
Proof.
(1) a0 = a(0 + 0) = a0 + a0 (by the distributivity property of a field). As
an element of a field, a0 must have an additive inverse (or opposite)
−a0. Adding −a0 to the last equation gives 0 = a0.
(2) Assume ab = 0. If a 6= 0, then a admits a multiplicative inverse a−1
since (F∗ , ×) is a group. Multiplying both sides of ab = 0 with a−1
gives
a−1 (ab) = a−1 0 ⇒ (a−1
| {z a})b = 0 ⇒ 1b = 0 ⇒ b = 0.
1
is the main reason why (Z4 , +, ×) and (Z6 , +, ×) are not fields.
Remark 4.6. It can be shown (but we will not show it here) that any finite
field F containing p elements where p is a prime integer is actually a “copy”
of Zp (formally, we say F is isomorphic to Zp ). By a “copy”, we mean that
we can relabel the elements of F to match those of Zp (namely, 1, 2, . . .,
p − 1) in such a way the addition and multiplication tables of F are the
same as those of Zp . In other words, there is a unique field containing p
elements for each prime integer p. This field is denoted by Fp .
From this point on, we will omit the “over line” in expressing the element
a of Zp and just write a for simplicity. For instance, we write Z3 = {0, 1, 2}
and Z5 = {0, 1, 2, 3, 4}.
The field Fpr plays a key role in understanding the properties of the
sequence produced by a LFSR. Our next task is to shed more light on its
structure. For this, we need the notion of a polynomial over a field.
Global Positioning System (GPS) 123
We define addition and multiplication in F[x] in the usual way of adding and
multiplying two polynomials with real coefficients with the understanding
that the involved operations on the coefficients are done in the field F.
These two operations inside F[x] do not give this set the status of a field
since, for example, the multiplicative inverse of the polynomial x ∈ F[x]
does not exist (no polynomial p(x) exists such that xp(x) = 1).
Definition 4.7. Let p(x), q(x) be two polynomials in F[x] with p(x) not
equal to the zero polynomial. We say that p(x) is a divisor of q(x) (or that
p(x) divides q(x)) if q(x) = p(x)k(x) for some k(x) ∈ F[x]. In this case, we
also say that q(x) is a multiple of p(x).
Example 4.9. Let p(x) = x4 +2x3 +x+2 and k(x) = x2 +x+1 considered
as polynomials in Z3 [x] where as usual Z3 = {0, 1, 2}. We can perform the
long division of p(x) by k(x) the usual way but bear in mind that we are
not dealing with real numbers here but rather elements of the field Z3 .
x2 + x − 2
x2 + x + 1 x4 + 2x3 +x+2
− x4 − x3 − x2
x3 − x2 + x
− x3 − x2 − x
− 2x2 +2
2x2 + 2x + 2
2x + 4
The quotient is q(x) = x2 + x − 2 = x2 + x + 1 (since −2 = 1 in the field
Z3 ) and the remainder is r(x) = 2x + 4 = 2x + 1 (since 4 = 1 in the field
Z3 ).
Global Positioning System (GPS) 125
Remark 4.8. It can be shown that if F is a finite field, then there exists a
monic irreducible polynomial of degree r in F[x] for any positive integer r.
Example 4.14. Let p(x) = x2 + 2x ∈ Z3 [x]. Let us see how we can add
and multiply the two polynomials h(x) = x3 + x2 and k(x) = x2 + 2x + 2
of Z3 [x] modulo p(x). First note that h(x) + k(x) = x3 + 2x2 + 2x + 2 and
h(x)k(x) = x5 + 3x4 + 4x3 + 2x2 = x5 + x3 + 2x2 (since 3 = 0 and 4 = 1
in Z3 ). We start by performing the long division of both h(x) + k(x) and
h(x)k(x) by p(x). This leads to the following two relations (the reader is
encouraged to do the long division):
h(x) + k(x) = xp(x) + (2x + 2) ,
Example 4.16. In Example 4.15 above, we saw that every non-zero ele-
ment of the field Z2 [x]/hx3 + x + 1i is a power of α = t. Thus, α = t is a
primitive element of Z2 [x]/hx3 + x + 1i.
One can verify that the set Zrp equipped with the above addition and mul-
tiplication with respect to a monic irreducible polynomial M (t) is indeed a
field.
Remark 4.10. The key feature in this second approach is the fact that it
allows us to look at the r-tuples of Zrp as polynomials. The two fields Frp
and Zp [t]/hM (t)i are copies of each others. Formally, we say that they are
isomorphic.
Theorem 4.10. For any prime integer p and any positive integer n, there
exists a primitive polynomial of degree n over the field Zp .
Remark 4.12. A special case of great interest in our treatment of the GPS
signal is the case where p = 2. In this case, there are 2r polynomials of the
form br−1 tr−1 + · · · + b1 t + b0 ∈ Z2 [t] with exactly half having the leading
coefficient br−1 = 0 and the other half with leading coefficient br−1 = 1.
This means that the lead function θ : Fr2 → F2 takes the value 0 on exactly
half of the elements of Fr2 and the value 1 on the other half.
Global Positioning System (GPS) 133
We now arrive at the last stop in our journey of understanding the mathe-
matics behind the signal produced by a GPS satellite using a LFSR. This
section provides the proof of the main result concerning the GPS signal
(Theorem 4.1). We start with the notion of correlation between two “slices”
of the sequence produced by a LFSR. It is the calculation of this correlation
that allows the GPS receiver to accurately compute the exact time taken
by the signal to reach it from the satellite.
4.6.1 Correlation
Definition 4.13. Given two binary finite sequences of the same length:
A = (ai )ni=1 and B = (bi )ni=1 , the correlation between A and B , denoted
by ν (A, B), is defined as follows:
n
X
ν (A, B) = (−1)ai (−1)bi .
i=1
Note that:
• if ai = bi , then (−1)ai (−1)bi = (−1)2ai = 1, so i∈S1 (−1)ai (−1)bi =
P
We know that such a polynomial exists by Theorem 4.10 above. For the
coefficients of the LFSR, choose the vector c = (cr−1 , . . . , c1 , c0 ) with com-
ponents equal to the coefficients of P (x). The choice of the initial window
can be any non-zero vector (ar−1 , . . . , a1 , a0 ) but the one we choose in the
following is very suitable in proving interesting facts about the sequence.
We use the lead function θ : Frp → Fp defined in Example 4.20 above as
follows:
Note that:
θ (tr (t)) = θ cr−1 tr−1 (t) + · · · + c1 t(t) + c0 (t)
(4.8)
= cr−1 θ tr−1 (t) + · · · + c1 θ (t(t)) + c0 θ ((t))
(4.9)
= cr−1 ar−1 + · · · + c1 a1 + c0 a0 , (4.10)
where (4.8) follows from tr = cr−1 tr−1 + + · · · + c1 t + c0 , (4.9) follows from
the linearity of the lead map θ and (4.10) follows from our definition of
the initial conditions a0 , . . . , ar−1 . Look closely at the last expression. Is
that not how the LFSR computes its next term ar ? We conclude that
θ(tr (t)) = ar . In fact, it is not hard to show that any term in the sequence
produced by a LFSR can be obtained this way. More specifically,
ak = θ(tk (t)), k = 0, 1, 2, . . . . (4.11)
Proof of Theorem 4.1. With the above choice of the coefficients (as
coefficients of a primitive polynomial) and the initial coefficients, we show
that a sequence produced by a LFSR with r registers has a period equal
to N = 2r − 1. We already know (see Remark 4.1) that the sequence
is periodic with (minimal) period T ≤ 2r . Since P (x) is chosen to be a
primitive polynomial, t is a generator of the multiplicative group of the
field Z2 [x]/hP (x)i and so it has an order of N = 2r − 1 as an element of
the group. This means that N is the smallest positive integer satisfying
tN = 1. Moreover, for any n ∈ N, we have
!
n+N
t t (t) = θ (tn (t)) = an .
N n
an+N = θ t (t) = θ |{z}
=1
136 The mathematics that power our world
T ≤ N. (4.12)
For any k ∈ N, if we apply θ to the relation ak+T = ak we get θ tk+T (t) =
θ tk (t) or equivalently
θ tk (t)(tT − 1) = 0
(4.13)
n+k m−n
group of order N ). So, 1 + tm−n = 2 = 0 and (−1)θ(t (t)(1+t )) = 1
for all k = 0, . . . , N − 1. This implies that the correlation in this case is ν =
1 + 1 + · · · + 1 = N . Assume next that m − n is not a multiple of N , then
| {z }
N
the polynomial 1 + tm−n is non-zero and therefore (t) (1 + tm−n ) is also
non-zero as the product of two non-zero elements of the field Z2 [x]/hP (x)i.
As in the proof of Theorem 4.1, the fact that P (x) is chosen to be primitive
comes in very handy now: (t) (1 + tm−n ) 6= 0 implies that
(t) 1 + tm−n = tj for some j ∈ {0, 1, 2, . . . , N − 1}.
Now, since (−1)θ(0) = (−1)0 = 1, the last sum in the above expression of
ν (W1 , W2 ) can be written as
N −1
n+k
(t)(1+tm−n ))
(−1)θ(t
X X
= (−1)θ(αi )
k=0 αi ∈F∗
2r
X
= (−1)θ(αi ) −(−1)θ(0) = −1.
αi ∈F2r
| {z }
0
This proves that the correlation between the two finite sequences is −1 in
this case.
This is indeed an amazing fact: take any two finite slices of the same length
2r − 1 (length of a period) in a sequence produced by a LFSR of degree r,
then you are sure that the number of terms which disagree is always one
more than the number of terms which agree (provided, as in the theorem,
that m − n is not a multiple of N = 2r − 1). This may sound weird, but
having poorly correlated sequences of maximal length 2r − 1 is important
for the GPS receiver since it makes the task of identifying satellites much
easier.
that we usually represent by sequences of 0’s and 1’s for simplicity. These
are just representations of low and high voltages.
In the above diagram, the first signal represents the replica of the satel-
lite code generated by the receiver with t = a being the time of departure
of a particular code cycle from the satellite. The second signal represents
the signal arriving at the receiver with t = b being the time of arrival of
the cycle to the receiver. Signals emitted by various GPS satellites are in
perfect synchronization and the departure time from the satellites of the
start of each cycle is known by the receiver. The runtime of the signal is
marked by dt. In a perfect scenario, the distance between the satellite and
140 The mathematics that power our world
In reality, analyzing the satellite code at the receiver end is more so-
phisticated than the above description. Many algorithms are implemented
to increase the efficiency of the receiver. These are beyond the scope of this
book.
The idea of locating one’s position on the surface of the planet goes back
deep in human history. Some ancient civilizations were able to develop nav-
igational tools (like the Astrolabe) to locate the position of ships in high
seas. But let us not go that deep in history, after all the chapter deals with
a very recent piece of technology.
• The story started in 1957 when the Soviet Union launched the satellite
Sputnik. Just days after, two American scientists were able to track its
orbit simply by recording changes in the satellite radio frequency.
• In the 1960’s, the American navy designed a navigation system for its
Global Positioning System (GPS) 141
4.8 References
5.1 Introduction
In this chapter, we discuss the processing of images. Chances are you have
used at some point an image editing software like Photoshop or Gimp to
transform a photo. We discuss some of the mathematics underlying the
manipulation of images. Our approach will be to combine many images to
obtain the average image. Consider the following three digital images. The
image on the left is a picture taken at the Tremblant ski hill in 2015. The
image in the center is a picture taken in Vancouver in 2012 and the image
on the right is a picture taken in Morocco in 2015. We will combine the
faces from these images to obtain an “average” face.
143
144 The mathematics that power our world
There are many forms of digital images. Our goal here is to discuss the
manipulation of raster images. We can think of a raster image as an array
of pixels, where the pixel represents a unit square. For example, the above
image in Morocco has a height of 960 pixels and width of 720 pixels. The
image in Vancouver has a height of 720 pixels and a width of 960 pixels.
Lastly, the image in Tremblant is 960 × 540 squares pixels.
As seen in Chapter 3, a gray level is assigned to each pixel. The levels
are usually from 0 to 255, where 0 is black and 255 is white. Alternatively,
we can think of the levels as percentages, where 0% is black and 100% is
white.
Each black and white digital image can be represented by a matrix of
gray levels. For
example, the image in Figure 5.1 is represented by the
10 35 100
matrix C = where, for instance, the upper left pixel is
125 175 200
assigned the gray level 10 and the lower right pixel is assigned the level
200.
which is 3 pixels wide and 4 pixels high. The grayscale matrix of this
image is
25 45 55
45 75 125
C=
55 175 200 . (5.1)
99 190 180
Using the point (2, 2) as the centroid. The point (x, y) = (0.5, 1.5) has also
the following coordinates [vx , vy ] = [0.5 − 2, 1.5 − 2] = [−1.5, −0.5] with
respect to the centroid. The color of this point is C[1.5, 0.5] = C[2, 1] =
45. The computation of the color when referring to the coordinates of the
point with respect to the centroid is C[−1.5 + 2, −0.5 + 2] = C[2, 1] =
45.
Its inverse Rθ−1 is the rotation matrix R−θ . In other words, to invert the
rotation, we apply a rotation with the same angle in the opposite direction.
To uniformly scale the image by a factor of r (where r > 0), we can use
the following matrix
r0 10
Sr = =r .
0r 01
So Sr = r I2 , where I2 is the identity matrix of size 2 × 2. The inverse of
the uniform scaling matrix Sr is
−1 1/r 0 1 10 1
Sr = S1/r = = = I2 .
0 1/r r 01 r
When applying both a uniform scaling by a factor r and a rotation of
angle θ, the order of the transformations is not important. The matrix for
the transformation that starts with a rotation followed by a uniform scaling
is AT = Sr Rθ . This matrix is equal to
Sr Rθ = r I2 Rθ = r Rθ = r Rθ I2 = Rθ (r I2 ) = Rθ Sr .
The last matrix in this equality is the matrix for the transformation that
is a uniform scaling followed by the rotation. So regardless of the order of
these two transformations the result is the same. In fact the matrix of the
linear transformation that corresponds to rotation of angle θ clockwise and
a uniform scaling of a factor r is
r cos(θ) −r sin(θ) a −b
r Rθ = = ,
r sin(θ) r cos(θ) b a
where a = r cos(θ) and b = r sin(θ). Theorem 5.1 states that any matrix of
the same form as the matrix on the right-hand-side of the above equality can
be interpreted as a matrix of the composition of a rotation and a uniform
scaling.
Before stating the theorem, we remind the reader that an ordered pair
(x, y) ∈ R2 (which is not (0, 0)) has a unique polar coordinate representation
(r, θ), where r > 0 and θ ∈ (−π, π]. There is a one-to-one correspondence
between the cartesian coordinates (x, y) and the polar coordinates (r, θ).
To go from the polar form to the cartesian form, we use the relations
148 The mathematics that power our world
Proof. Since (r, θ) is the polar coordinate for (a, b), then a = r cos(θ)
and b = r sin(θ). Thus, we can write AT = r Rθ , which is a clockwise
rotation of angle θ and a uniform scaling of factor r.
When applying a linear transformation (e.g. a rotation), we might end
up mapping some points outside of the region of the given image. As an
example, consider the rotation of the image in Figure 5.2. The boundary
of the region of the image corresponds to the rectangle ABCD. As we
rotate the image by an angle π/6 radians clockwise about the point (2, 2),
some points are mapped outside of the rectangle ABCD. In order not to
lose information, we increase the number of pixels. The new image, that
corresponds to the rectangle A0 B 0 C 0 D0 , will be of size 5 × 5 pixels, while
the original image was of size 3 × 4.
We first discuss the math involved in determining the new rectangle
A0 B 0 C 0 D0 and identifying the corresponding centroid (x00 , y00 ) for our new
image. The original rectangle ABCD is a convex set and the points A, B,
C and D are the only extremal points in this set. (Refer to Section 5.8 for
a discussion concerning convex sets.) If we are using an invertible linear
Image processing and face recognition 149
transformation to transform the image, then the region of the new image
will also be a convex set and the images of A, B, C and D, say T (A) =
[ax , ay ], T (B) = [bx , by ], T (C) = [cx , cy ], T (D) = [dx , dy ], respectively, are
going to be the only extremal points in this set. So to find the boundary
of the new image, we need the extreme abscissa:
We will round these extreme values to the nearest integer to work in pixels.
(We might lose a bit of information by rounding, but it should be negligible,
0
i.e. not visible.) In Figure 5.2, x0min = −3, x0max = 2, ymin = −3, ymax0
= 2,
measured from the centroid (2, 2). Thus, the vertices of the new rectangular
image are A0 = [−3, −3], B 0 = [2, −3], C 0 = [2, 2], D0 = [−3, 2], respectively.
So, the new rectangle should be x0max − x0min = 5 pixels wide and ymax 0
−
0 0
ymin = 5 pixels high. Since A is the point in the upper left corner of the
new image, it should correspond to the point (0, 0). But its coordinates
with respect to the centroid (2, 2) are given by the point A0 = [−3, −3],
which means that it is 3 unit above and 3 units left of the centroid. By
making the centroid the point (3, 3), then A0 will correctly be the point
(0, 0). So the centroid in the new image is the point (3, 3).
Now that we have the appropriate rectangular region for the transformed
image, we must color it. In other words, we must determine the gray level
for each pixel in the image. Suppose that the new image has a width of w0
pixels and a height of h0 pixels (in Figure 5.2, we have w0 = 5 and h0 = 5).
It is important to keep in mind the center of the image. In case of Figure
5.2, it is (x00 , y00 ) = (3, 3). We need to find C 0 [i, j] for i = 1, 2, . . . , h0 and
j = 1, 2, . . . , l0 , where C 0 is the matrix of the transformed image.
To find the color, we must think in terms of the inverse transformation.
C 0 [i, j] is the color of a pixel, which is a unit square. We will use the
point (j − 1/2, i − 1/2) (which is the center of the corresponding pixel) as
a representative of the pixel. As an example, let us consider the gray level
C 0 [3, 4] in the new image. This is the color of the pixel which has (x0 , y 0 ) =
(3.5, 2.5) as its representative. Expressing the point in terms of coordinates
150 The mathematics that power our world
with respect to the centroid (3, 3) gives us [3.5 − 3, 2.5 − 3] = [0.5, −0.5].
Let us back-transform this vector.
The transformation from the original image to the new image is a ro-
tation of π/6 radians about the centroid. The inverse transformation is
a rotation of −π/6 radians about the center. The point with coordinates
v = [0.5, −0.5] with respect to the centroid (3, 3) has the pre-image
cos(−π/6) −sin(−π/6) 0.5 0.183
v = R−π/6v = ≈ .
sin(−π/6) cos(−π/6) −0.5 −0.683
Recall that the centroid in the original image is the point (2, 2). So the
representative of our pixel is located at the point (2.183, 1.317). The
unit square centered at (2.183, 1.317), which is highlighted in Figure 5.3,
is overlapping four pixels in the original image (i.e. the rectangular re-
gion ABCD). Naturally, we should use a combination of the four colors:
C[1, 2] = 45, C[1, 3] = 55, C[2, 2] = 75, and C[2, 3] = 125. The largest over-
lap is with the pixel in the second row and the third column of the rectan-
gular region ABCD. Somehow, the color should be closest to C[2, 3] = 125.
In this section, we discuss the assignment of a color C[i∗, j∗] to our pixel
in the case where i∗ and j∗ are not integers. This corresponds to the case
where the back transformed pixel is overlapping pixels in the original image.
To this end, we use a technique called bilinear interpolation.
Image processing and face recognition 151
Consider the color C[i∗, j∗] = C[i + p, j + q], where i, j are integers and
0 ≤ p < 1, 0 ≤ q < 1. Its bilinear interpolation is
and
77 ≈ (5.8%) C[1, 2] + (12.5%) C[1, 3] + (25, 9%) C[2, 2] + (55, 9%) C[2, 3].
152 The mathematics that power our world
The purpose of this section is to compute the “average face”. Note that we
cannot just compute the average of the matrices for the three images on
page 143 for example. The images could be of different sizes. Furthermore,
the faces could be at different locations in the image. We need to determine
the location of the face in each image. To do so, we use eight landmarks.
We identify the location of the inner and outer eye, the nostril, and the
outer mouth, both on the right and the left. Each landmark is located at
a certain row (height) and column (width).
We define the centroid of the face as being the point (c, r), where c and
r are respectively the averages of the eight columns and the eight rows of
the landmarks. In Figure 5.4, the landmarks are identified with x’s and the
centroid by a small circle.
To compute the average of our three faces in the images on page 143,
for each image, we keep an image of size 200 pixels by 200 pixels centered
about the centroid of the face. Each image is represented by a matrix.
We compute the average of the three matrices. The result is in Figure 5.6
on page 154. The original images are in the diagonal of the array. The
images off the diagonal in the upper right corner are pairwise averages.
The average of the three images is in the lower corner on the left.
We should notice that finding the center of the face is not sufficient
to make an average face. The eyes in image 1 (at Tremblant) are slanted
compared to the eyes in image 2 (at Vancouver). Furthermore, image 3 (at
Morocco) appears to be on a different scale than the other images. The
face appears to be closer to the camera in this last image. We will have to
transform the images to try to align the landmarks, to get the eyes on the
For example, the vector corresponding to the exterior right eye is [−25, −8]
for Tremblant, but it is [−23.5, −9.5] in Vancouver. So the distance between
these corresponding landmarks is (−25 − (−23.5))2 + (−9.5 − (−8))2 =
2.121.
To get a total distance (in square units) between the faces, we square
the Euclidian distance between the two landmarks and compute the sum
of the squared distances over the eight landmarks. For our three images,
we compute the total distance (in square units):
Repeating for the other seven landmarks, we get a system of 16 linear equa-
tions with two unknowns. Since we will not be able to get the landmarks
156 The mathematics that power our world
images in positions (1, 1) and (2, 2). The image at position (1, 3) in Figure
5.8 is the average of the diagonal images in positions (1, 1) and (3, 3). The
image at position (2, 3) in Figure 5.8 is the average of the diagonal images
in positions (2, 2) and (3, 3). Compare the averages in Figure 5.6 to those
in Figure 5.8. By using the least-square method, we are able to get much
better result.
In this section, we study the concept of convex sets and extremal points.
Definition 5.1.
(i) A subset K of R2 is called a convex set if for every pair of vectors ~v1
and ~v2 in K, we have
t ~v1 + (1 − t)~v2 ∈ K,
for every t ∈ [0, 1]. We say that the vector t ~v1 + (1 − t)~v2 is a convex
combination of ~v1 and ~v2 .
(ii) Let K be a convex set and ~v an element of K. We say that ~v is an
extremal point, if K \ {~v } (i.e. K without ~v ) is also a convex set.
We can think of convex combinations as all the vectors that lie on the
“line segment” between ~v1 and ~v2 . Refer to the Figure 5.9 for an example
of a convex combination of vectors. It should be evident that a rectangular
image is a convex set in R2 . Furthermore, the four vertices of the rectangle
are the only extremal points in the set.
Given a linear map T : R2 → R2 and a convex set K ⊆ R2 , is the image
T (K) of K a convex set? Moreover, if K is convex and ~v is an extremal
point of K, is T (~v ) an extremal point of T (K)? The following theorem
gives the answer with the assumption that T is invertible.
Proof.
(a) Let v1 , v2 ∈ T (K), and write v1 = T (u1 ), v2 = T (u2 ) for some vectors
u1 , u2 ∈ K. For any t ∈ [0, 1], we have
In this section, we discuss the least squares method to find the “best” solu-
tion to an inconsistent system of linear equations. For example, it is possible
that we cannot find a transformation that aligns the image perfectly. In
this case, we will have to content ourselves with finding the transformation
that minimizes the distance between landmarks.
Here is the general setting. Given a system of n linear equations in p
variables β1 , β2 , . . . , βp :
y1 = β1 x11 + β2 x12 + · · · + βp x1p
y2 = β1 x21 + β2 x22 + · · · + βp x2p
.. .. ..
. . .
yn = β1 xn1 + β2 xn2 + · · · + βp xnp
The last expression in (5.2) gives the dot product in terms of matrix mul-
tiplication. This means in particular that the dot product inherits many
160 The mathematics that power our world
For an n × p matrix X.
Null(X) = {v ∈ Rp : Xv = 0} of Rp .
p = rank(X) + dim(Null(X)).
< u, v > = ut (X t X) v.
β̂ = (X t X)−1 X t Y.
• Note that if X has fewer rows than columns, then rank(X) < p and
the method of least squares will not work.
Looking at the expression inside the square root of the standard deviation,
we observe that it computes the average squared deviation away from the
mean. It represents an average distance away from the mean in squared
units. The square root is used to obtain a measure in the same units as the
intensities. So the standard deviation is a measurement of variability. The
more the intensities are dispersed about the mean, the larger the standard
deviation. The more the intensities are concentrated about the mean, the
smaller the standard deviation.
The variance (i.e. the square of the standard deviation) of the stan-
dardized intensities is
p p
1X 2 1X 2
σz2 = (zi − z)2 = z (since z = 0)
p i=1 p i=1 i
p p
1X 2 1 X 2 σx2
= [(xj − x)/σx ] = (xj − x) = = 1.
p i=1 (pσx2 ) i=1 σx2
Bearing in mind that someone’s face can vary from image to image, this
means that the image we are comparing to the database will not be exactly
the same as any of the images in the database. So we will need to rank
them from the closest to our image to the furthest away.
We compare the distorted images of Alice and Bob (see Figure 5.11) to
our database of six penguins (see Figure 5.10). The distances between the
images are found in Table 5.1. Among the six penguins in the database, it is
the image of Alice that is closest to the distorted image of Alice. However,
the distance of 14 between the two images of Alice is not considered very
small.
In practice, an arbitrary threshold is used as a rule to determine if
the person in the image is in the database. If was set to 10, then the
computer would conclude that the penguin in the distorted image of Alice
is not in the database. The conclusion would be similar for Bob, since the
distorted image of Bob is more than 10 units away from all of the images
in the database.
The root of the problem is that we are comparing all of the p = 2500
pixels. We are using some information in the comparison that is not relevant
to the differences between the 6 penguins in the database. We need to
describe the images at the level of the pixels to identify variations between
the images of the penguins.
168 The mathematics that power our world
The variance of the (standardized) intensities for the 1000th pixel and
the 1500th pixel are respectively
n n
2 1X 1X
σ1000 = (ui − u)2 = 0.318 and σ1500
2
= (vi − v)2 = 0.153.
n i=1 n i=1
Among these two pixels, the 1000th is more variable in terms of the (stan-
dardized) intensities. The larger the variance, the more this pixel will be
useful in the differentiation of the images. So if we were to retain only one
of these two pixels for the comparison of the images, we should choose the
1000th pixel. However, we will see that combining the information from
the two pixels will give us a component that is even more variable.
Before discussing the combination of pixels, we look at the statistical
association between these two pixels. A scatter plot of the 1500th pixel
against the 1000th pixel is found in Figure 5.13. The vertical and the
horizontal lines in the middle of the plot are the respective means of the
two variables. We see that they are highly associated. When one of the
pixels is large compared to its mean, so is the other pixel. Furthermore,
when one of the pixels is small compared to its mean, so is the other pixel.
This type of association is called positive correlation. When the majority
of the points are in Quadrants I and III, we say the association is positive.
While a negative association occurs when the majority of the points are in
Quadrants II and IV.
A statistic that captures the sign (i.e. positive or negative) of the asso-
ciation is the covariance. The covariance is computed as follows:
n
1X
σ1000,1500 = (ui − u)(vi − v) = 0.2064.
n i=1
If there is no association between the pixels, then the points in the scatter
plot are going to be scattered in all four quadrants delimited by the respec-
tive means of these two variables. In this case, the covariance should be
close to zero.
If the association between the pixels is positive, then the majority of
the points are in Quadrants I and III. Thus, the product (ui − u)(vi − v)
is positive for the majority of the points, which gives a positive covariance.
If the association between the pixels is negative, then the majority of the
points are in Quadrants II and IV. Thus, the product (ui − u)(vi − v) is
negative for the majority of the points, which gives a negative covariance.
170 The mathematics that power our world
The combined component is more variable than the 1000th pixel and the
2 2
1500th pixel. Recall that σ1000 = 0.318 and σ1500 = 0.153. So the combined
component will be more useful in the differentiation of the images compared
to the 1000th pixel or to the 1500th pixel.
We do have to be careful in the combination of the pixels, since we can
obtain a component that has a smaller variance than the original pixels. If
we project the vectors onto the line 2 in Figure 5.14, which has the direction
vector d~2 = (0.5608582, −0.8279119), we get
w1 = −0.193, w2 = −0.043, w3 = 0.117,
Fig. 5.13 Scatter plot of the 1500th pixel versus the 1000th pixel.
Fig. 5.14 Projections of the points onto lines passing through the centroid.
Pn
The latter is the covariance between u and v. Thus, n1 i=1 y i y ti is the
covariance matrix V .
To combine the pixels, we orthogonally project the point z i ∈ Rp (which
corresponds to the ith image) onto a line in Rp that passes through the
centroid Z and with direction vector e ∈ Rp . We assume that the direction
vector is a unit vector e (that is kek = 1 or equivalently et e = 1). The
corresponding scalar projection is y i · e = y ti e. The scalar projections of
the n = 6 images are:
w1 = y t1 e, w2 = y t2 e, . . . , wn = y tn e.
The mean of the scalar projections is always zero:
n n n
!
X X X
t
w= wi = yi e = y i e = 0t e = 0.
t
The variance of the scalar projections can be computed from the covariance
matrix V :
2 2
2 1 X 2
X
σw = (wi − w) = (1/n) wi2 (since w = 0)
n i=1 i=1
2
1 X t
= w wi (since wi is a scalar)
n i=1 i
2
1X t t t
= (y e) (y i e)
n i=1 i
2
" 2
#
1X t 1 X
= e y i y ti e = et y y t e = et V e.
n i=1 n i=1 i i
2
We want to find the direction e that maximizes the variance σw . The
following theorem will be useful in the interpretation of the optimal direc-
tion. We state the theorem without proof. It is a well-known result in linear
algebra. It states that a symmetric matrix (i.e. the matrix is equal to its
transpose) is diagonalizable. The notions of eigenvalues, eigenvectors and
orthogonal bases are quickly reviewed in the remarks following the theorem.
Remarks:
(a) We say that a non-zero vector e ∈ Rp is an eigenvector of V , if there
exists a scalar λ such that
V e = λ e.
The scalar λ is called the corresponding eigenvalue.
(b) We say that B = {e1 , e2 , . . . , ep } is an orthonormal basis of Rp , if
(i) the vectors in B are orthogonal, i.e. etj ek = 0 for all j 6= k.
(ii) the vectors in B are unit vectors, i.e. kej k = 1 for all j. Equivalently
this means etj ej = 1 for all j.
(c) As a consequence of part (b), the product P t P is equal to the identity
matrix I, since the columns of P form an orthonormal basis of Rp .
t
(d) The scalar projection of the ith centered observation y i = Z ti −Z onto
the jth eigenvector ej , that is y i ·ej = y ti ej , is called the jth principal
component by statisticians.
Here are a few consequences of the Principle Axis Theorem.
(1) The eigenvalues of the covariance matrix V can be interpreted as vari-
ances. Let w be the component corresponding to the scalar projections
of the centered observations y i for i = 1, 2, . . . , n along the line that
passes through the centroid with direction vector v, where v is an
eigenvector of V with corresponding eigenvalue λ. As seen above the
variance of w is v t V v, which becomes
s2w = v t V v = v t (λ v t ) = λ (v t v) = λ,
since v t v = 1.
(2) The trace of the covariance matrix V is the sum of the elements in its
Pp
diagonal, that is tr(V ) = j=1 σj2 . It is called the total variance. For
our database of n = 6 penguins, the total variance is tr(V ) = 1327.045.
We can compute the trace from the sum of the eigenvalues:
X p
tr(V ) = tr(P D P t ) = tr(P t P D) = tr(I D) = tr(D) = λj .
j=1
176 The mathematics that power our world
We used the fact that the P t P = I, where I is the p×p identity matrix,
and also the cyclic property of the trace: tr(A B) = tr(B A) for any
matrices A, B such that A B and B A are defined.
Given a symmetric p × p matrix A = [aij ] and a vector e =
[e1 , e2 , . . . , ep ]t ∈ Rp , we define a linear form and a quadratic form of e
as follows. For any b = [b1 , b2 , . . . , bp ]t ∈ Rp , the scalar
p X
X p
L = bt A e = e t A b = bk aki ei
i=1 k=1
The distance between the principle features of the images are found in
Table 5.2. Using a threshold of = 10, we see that the computer was
able to identify each of the distorted images as the first and fourth images
in the database, respectively. The distance between the features of the
distorted image of Alice and the first penguin in the database is 1.9 (which
is a very small distance). Similarly, the distance between the features of
180 The mathematics that power our world
the distorted image of Bob and the fourth penguin in the database is only
1.5. By using the first four principle features, the computer is more certain
that the two distorted images are images of penguins from the database.
for the first feature. We interpret the feature as a deviation away from the
mean image of the six images in the database along a particular direction. A
complete gray image means that the image is very close to the mean image
in terms of that feature (see Penguin 6 in Figure 5.15). Furthermore, it is
very difficult to differentiate the second from the third penguin with only
the first feature. However, according the second feature, the third and the
second penguins are different (see Figure 5.16).
The features are ordered according to the size of the variance. The first
features are going to be more useful to differentiate the images. As we look
at Figure 5.18, we see that the projected images are very similar. In fact,
the fourth feature only accounts for 12.9% of the total variance. Notice
also that for the distorted images of Alice and Bob, the projected image for
each feature resembles the projected images of the first and fourth penguins.
The first and fourth penguins in the database are Alice and Bob. As long
as the principle features of the face of a particular individual are similar
from image to image, then the computer should be able to recognize the
person’s face. What are the principle features of a face? This will depend
on the images in the database. The features are determined in terms of the
maximum variance.
5.11 References
183
184 The mathematics that power our world
Quantization, 74
Quotient, 124