0% found this document useful (0 votes)
47 views202 pages

The Mathematics That Power Our World How Is It Made

The document is a preface to a book titled 'The Mathematics That Power Our World', which explores the mathematical principles behind everyday technology and devices. It aims to bridge the gap between theoretical mathematics and its practical applications, emphasizing the importance of mathematics in various fields such as engineering, computer science, and medicine. The book is intended for a broad audience, including university students, high school students, and educators, and covers topics like electronic calculators, data compression, GPS systems, and digital image processing.

Uploaded by

summerking
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views202 pages

The Mathematics That Power Our World How Is It Made

The document is a preface to a book titled 'The Mathematics That Power Our World', which explores the mathematical principles behind everyday technology and devices. It aims to bridge the gap between theoretical mathematics and its practical applications, emphasizing the importance of mathematics in various fields such as engineering, computer science, and medicine. The book is intended for a broad audience, including university students, high school students, and educators, and covers topics like electronic calculators, data compression, GPS systems, and digital image processing.

Uploaded by

summerking
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 202

This page intentionally left blank

Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

THE MATHEMATICS THAT POWER OUR WORLD


How Is It Made?
Copyright © 2016 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.

ISBN 978-981-4730-84-6
ISBN 978-981-3144-08-8 (pbk)

Printed in Singapore
“To the person who made me watch dozens of the How it’s
made episodes during his childhood years, who continues to
inspire me everyday with his never-ending curiosity. To the
better version of me, my son Michel.”

Joseph Khoury
This page intentionally left blank
Preface

In the early 21st century, a TV show called How it’s made premiered in
North America and quickly grew in popularity among viewers of all ages.
The purpose of the show was to look behind the scenes to explain in sim-
ple terms how common everyday items are actually made. Although many
episodes of the show featured simple things like the jeans we wear, the bi-
cycle we ride or even some of the processed food we eat, it was certainly an
eye opener on the ingenuity and effort behind the simplest things we use
on a daily basis.

Unlike other scientists, many mathematicians often work relentlessly on


solving difficult theoretical problems without questioning much the prac-
tical implications of their discoveries on the day-to-day life. This can be
explained from cultural and practical points of view. On a cultural level,
mathematics were perceived throughout human history as a form of col-
laborative art that can mark a whole nation and contribute as a measure
of achievement of that nation. On a practical level, who is to say that the
seemingly abstract problems that we are solving today are not the fuel of
the new technology in the future? In 1940, G.H. Hardy, a mathematician
known for his achievements in number theory and mathematical analysis,
wrote an essay called A Mathematician’s Apology. He described the most
beautiful mathematics as those without real-life applications. In fact, he
described number theory as “useless”, yet very elegant and beautiful. His
motive was (in part at least) to promote the idea that mathematics should
be pursued for its own beauty and not for the sake of its applications. Of
course, it was hard in the early 40’s of the last century to imagine that
number theory would become an important tool in the field of modern
cryptography. Public-key encryption (and thus internet banking) is one

vii
viii The mathematics that power our world

application, among many others, that would not be possible without the
great accomplishment of number theorists such as Hardy. Regardless how
we value the research in mathematics, the fact remains that a wide variety
of phenomena around us are governed by mathematical models. Pushing
for innovation and research in applied and pure mathematics is key to make
further advancement in science, technology and even medicine. To say the
least, mathematics are certainly a set of tools, among many others, that
help make modern technology function well.

Why did we write the book?

In this technology-driven era, our modern lifestyle is certainly not short of


devices that we use on a daily basis regardless of age, gender, language or
socio-political backgrounds. Whether receiving or sending a text message,
taking a digital photo, undergoing an MRI scan or following your GPS in-
structions blindly to get to your destination, you are benefiting from the
collaborative effort of scientists and engineers. The rapid pace of advance-
ments in technology makes it so hard on most of us to keep track of what
is new, let alone to pause and learn about the science behind it.

While embracing the seemingly never ending advances in technology, you


must have wondered at one point: How does the GPS know where you are?
What magic makes the Google search engine classify and display the re-
sults of your search query? Even the simplest of all: How does your pocket
calculator display, in a fraction of a second, the answer to a complicated
arithmetic operation? To most people, mathematics do not come up as
a possible answer to these questions, while in fact, they are the driving
force behind all of them. A major difficulty encountered in high school and
university mathematics courses for students registered in programs other
than pure mathematics is the disconnect between what they learn in their
mathematics courses and the relevance of this material with their chosen
fields. This can seriously affect their motivation and their ultimate suc-
cess. In fact, a question instructors of mathematics hear very often from
their students is: Why do we need to learn this? Things can get even
worst when it comes to research in mathematics. On various social encoun-
ters, mathematicians are often greeted with statements like “Math was my
worst subject”, and “Hasn’t everything been discovered in mathematics?”
or “What is the latest number that you discovered?” Most people associate
mathematicians with pure academia, but very few are actually aware that
Preface ix

there are many mathematicians working in industry, national security and


even the medical field. In fact, many of the best paying jobs for new uni-
versity graduates require a strong mathematical background. For example,
a software developer, an investment banker, an actuary, an engineer or a
financial analyst.

With the hope of convincing students that there is a need to acquire mathe-
matical skills, and to introduce the general public to the pivotal role played
by mathematics in our lives, we started our endeavor of “looking under
the hood” at the engine that makes most of the technology around us run
smoothly. But there is another, stronger motivation for starting this book.
After years of teaching various areas of undergraduate mathematics, we re-
alized that the traditional place held by mathematics in education for many
centuries is taking a step backward. In the name of reform and adapting to
a fast changing world, the learning of mathematics has unfortunately de-
graded in many cases into an empty drill of memorization of miscellaneous
techniques rather than a foundation of scientific reasoning critical in any
aspect of knowledge. More and more, the mathematical community seems
to be divided into two extremes. On one extreme, we have mathematicians
who disassociate their teaching almost completely from any aspect of sci-
entific thinking. They transfer their knowledge of the subject matter in the
form of “recipes” for their students to follow almost blindly. On the other
extreme, we have the group of mathematicians with an overemphasis on ab-
straction and almost a complete disconnection from real applications even
in early service undergraduate courses in mathematics. Our hope is that
this book will contribute as a middle ground between the two extremes.

Who is the intended audience for the book?

The topics for the five chapters of the book are carefully chosen to strike
a delicate balance between relevant common applications and a reasonable
dose of mathematics. In most chapters, the mathematical maturity needed
is acquired after a year of studies at the university level in any branch of
science or engineering. However, self-motivated advanced high school stu-
dents with a strong desire to acquire more knowledge and a willingness to
expand their horizons beyond the school curriculum can certainly benefit
a great deal from researching and understanding the mathematics in the
book. The topics discussed in the book are also great resources for high
school teachers and university professors who can use the various applica-
x The mathematics that power our world

tions to go hand-in-hand with the theory taught in class.

Organization of the book

Throughout the book, all efforts were made to keep the mathematical re-
quirement to a minimum. For some advanced topics, like the theory of finite
fields and the notion of primitive polynomials in Chapter 4, the mathemat-
ical requirement is made in an almost self-contained fashion. All the topics
are presented with a fair amount of details. Chapters are independent to a
large extent. The readers can choose a topic to read without acquiring full
knowledge of previous chapters. At the beginning of each chapter, a small
section entitled Before you go further is included to give the reader an idea
about the level of mathematical knowledge required to fully understand the
chapter. The organization of the chapters in the book are as follows.

Chapter 1 discusses the mathematics of an electronic calculator. It starts


with a review of basic number systems and their properties. Signed num-
bers and their digital representations, in particular the one’s and two’s
complement schemes, are explained. Logic gates, which are at the heart
of any digital computing, are introduced and a link to Boolean Algebra is
made. Binary adders and adder-subtractor circuits are studied from the
point of view of Boolean Algebra rules. The famous seven-segment display
is explained.

Chapter 2 discusses the well-known Huffman codes, an essential tool in


many data compression techniques. The chapter also provides an introduc-
tion to data compression and its modes (lossy and lossless). Binary codes,
binary trees, uniquely decodable and prefix-free codes are introduced as
well as Kraft’s inequality and its applications. A brief introduction to in-
formation theory via the notion of entropy is given. The chapter ends with
the Huffman algorithm to solve the famous optimal prefix-free binary code
problem. Detailed examples are given.

Chapter 3 describes the JPEG standard. We all have uploaded or down-


loaded a picture with “.jpeg” extension at some point. The aim of the
chapter is to explain how, this popular compression technique, is capable
of storing or viewing photos or texts on your machine (computer or digital
camera) with a significantly reduced storage size without jeopardizing the
quality of the original file. The main tool for this technique is a transfor-
Preface xi

mation called the Discrete Cosine Transform. Concepts from linear algebra
like matrix manipulations, linear independence and orthogonal bases are
used. Some knowledge of basic properties of trigonometric functions and
complex numbers is also required.

Chapter 4 is devoted to the study of the GPS system both from the satel-
lite and the receiver ends. Although the mathematics used by the receiver
to locate positions on the surface of the planet is fairly simple, the nature
of the signals emitted by the satellites and the way the receiver interprets
and treats them require heavy mathematics. Mathematical preliminaries
necessary to understand the signal structure include group theory, modular
arithmetics, the finite field Fp , polynomial rings over Fp and the notion
of primitive polynomials. This chapter is certainly the richest in terms of
mathematical knowledge. If you have a curious mind and enjoy new chal-
lenges, this is definitely a chapter for you.

Chapter 5 discusses the manipulation of digital images, in particular we


show the reader how to produce an “average” face by combining the digital
images of many faces. This procedure is one of the steps involved for face
recognition. In the last twenty years, face recognition has become a pop-
ular area of research in computer vision. We discuss one of the methods
that have been proposed for this application, which is known as principal
components analysis. Face recognition is an essential tool in security and
forensic investigation in our modern society. Concepts from linear algebra
like the dot product and orthogonal projections are used. A bit of knowl-
edge in basic statistical concepts is also useful for fully understanding this
chapter.

A word of caution

While every attempt is made to make every chapter in the book as complete
as possible, some technical details are omitted as we are not experts in the
specific domain of application nor do we claim to be. Technicalities like
the way a circuit is wired, the type of transistors needed for a particular
design or the nature of the electrical pulse of a satellite signal are beyond
the scope of this book. Interested readers are encouraged to look up these
aspects in books written by experts in the domain of application.
This page intentionally left blank
Contents

Preface vii
1. What makes a calculator calculate? 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 A view from the inside . . . . . . . . . . . . . . . 2
1.1.2 Before you go further . . . . . . . . . . . . . . . . 2
1.2 Number systems . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Why 0’s and 1’s? . . . . . . . . . . . . . . . . . . 3
1.2.2 The binary system . . . . . . . . . . . . . . . . . . 4
1.2.3 Binary Coded Decimal representation (BCD) . . . 6
1.2.4 Signed versus unsigned binary numbers . . . . . . 7
1.3 Binary arithmetics . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Binary addition of unsigned integers . . . . . . . . 12
1.3.2 Binary addition of signed integers . . . . . . . . . 13
1.3.3 Two’s complement subtraction . . . . . . . . . . . 14
1.4 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Logic gates . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 Sum of products - Product of sums . . . . . . . . 21
1.5.2 Sum of products . . . . . . . . . . . . . . . . . . . 23
1.5.3 Product of sums . . . . . . . . . . . . . . . . . . . 24
1.6 Digital adders . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6.1 Half-adder . . . . . . . . . . . . . . . . . . . . . . 25
1.6.2 Full-adder . . . . . . . . . . . . . . . . . . . . . . 26
1.6.3 Lookahead adder . . . . . . . . . . . . . . . . . . 28
1.6.4 Two’s complement implementation . . . . . . . . 30
1.6.5 Adder-subtractor combo . . . . . . . . . . . . . . 31

xiii
xiv The mathematics that power our world

1.7 BCD to seven-segment decoder . . . . . . . . . . . . . . . 32


1.8 So, how does the magic happen? . . . . . . . . . . . . . . 36
1.9 What next? . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2. Basics of data compression, prefix-free codes and


Huffman codes 39
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.1 What is data compression and do we really need it? 39
2.1.2 Before you go further . . . . . . . . . . . . . . . . 40
2.2 Storage inside computers . . . . . . . . . . . . . . . . . . 40
2.2.1 Measuring units . . . . . . . . . . . . . . . . . . . 41
2.3 Lossy and lossless data compression . . . . . . . . . . . . 41
2.4 Binary codes . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.1 Binary trees . . . . . . . . . . . . . . . . . . . . . 43
2.4.2 Fixed length and variable length codes . . . . . . 44
2.5 Prefix-free code . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.1 Decoding a message using a prefix-free code . . . 45
2.5.2 How to decide if a code is prefix-free? . . . . . . . 46
2.5.3 The Kraft inequality for prefix-free binary codes . 47
2.6 Optimal codes . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7 The source entropy . . . . . . . . . . . . . . . . . . . . . . 54
2.8 The Huffman code . . . . . . . . . . . . . . . . . . . . . . 55
2.8.1 The construction . . . . . . . . . . . . . . . . . . . 57
2.8.2 The Huffman algorithm . . . . . . . . . . . . . . . 58
2.8.3 An example . . . . . . . . . . . . . . . . . . . . . 59
2.9 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3. The JPEG standard 65


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.1.1 Before you go further . . . . . . . . . . . . . . . . 66
3.2 The Discrete Cosine Transform (DCT) . . . . . . . . . . . 66
3.2.1 The one-dimensional DCT . . . . . . . . . . . . . 67
3.2.2 The two-dimensional DCT . . . . . . . . . . . . . 70
3.3 DCT as a tool for image compression . . . . . . . . . . . . 72
3.3.1 Image pre-processing . . . . . . . . . . . . . . . . 72
3.3.2 Level shifting . . . . . . . . . . . . . . . . . . . . . 73
Contents xv

3.3.3 Applying the DCT . . . . . . . . . . . . . . . . . 73


3.3.4 Quantization . . . . . . . . . . . . . . . . . . . . . 74
3.3.5 Encoding . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 JPEG decompression . . . . . . . . . . . . . . . . . . . . . 82
3.5 The mathematics of DCT . . . . . . . . . . . . . . . . . . 84
3.5.1 Two-dimensional DCT as a linear tranformation . 84
3.5.2 What is the deal with orthogonal bases anyway? . 87
3.5.3 Proof of the orthogonality of the DCT matrix . . 89
3.5.4 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . 92
3.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4. Global Positioning System (GPS) 95


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.1 Before you go further . . . . . . . . . . . . . . . . 96
4.2 Latitude, longitude and altitude . . . . . . . . . . . . . . 96
4.3 About the GPS system . . . . . . . . . . . . . . . . . . . . 98
4.3.1 The GPS constellation . . . . . . . . . . . . . . . 98
4.3.2 The GPS signal . . . . . . . . . . . . . . . . . . . 98
4.4 Pinpointing your location . . . . . . . . . . . . . . . . . . 99
4.4.1 Where am I on the map? . . . . . . . . . . . . . . 99
4.4.2 Measuring the distance to a satellite . . . . . . . . 100
4.4.3 Where am I on the surface of the planet? . . . . . 102
4.4.4 Is it really that simple? . . . . . . . . . . . . . . . 103
4.4.5 The fix . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.6 Finding the coordinates of the receiver . . . . . . 105
4.4.7 Conversion from cartesian to (latitude, longitude,
altitude) coordinates . . . . . . . . . . . . . . . . 108
4.5 The mathematics of the GPS signal . . . . . . . . . . . . 109
4.5.1 Terminology . . . . . . . . . . . . . . . . . . . . . 109
4.5.2 Linear Feedback Shift Registers . . . . . . . . . . 110
4.5.3 Some modular arithmetic . . . . . . . . . . . . . . 113
4.5.4 Groups . . . . . . . . . . . . . . . . . . . . . . . . 115
4.5.5 Fields - An introduction and basic results . . . . . 120
4.5.6 The field Zp . . . . . . . . . . . . . . . . . . . . . 121
4.5.7 Polynomials over a field . . . . . . . . . . . . . . . 123
4.5.8 The field Fpr - A first approach . . . . . . . . . . 127
4.5.9 The field Fpr - A second approach . . . . . . . . . 129
4.5.10 The lead function . . . . . . . . . . . . . . . . . . 132
xvi The mathematics that power our world

4.6 Key properties of GPS signals: Correlation and


maximal period . . . . . . . . . . . . . . . . . . . . . . . . 133
4.6.1 Correlation . . . . . . . . . . . . . . . . . . . . . . 133
4.6.2 The LFSR sequence revisited . . . . . . . . . . . . 134
4.6.3 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . 135
4.6.4 More about the signal . . . . . . . . . . . . . . . . 138
4.7 A bit of history . . . . . . . . . . . . . . . . . . . . . . . . 140
4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5. Image processing and face recognition 143


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.1.1 Before you go further . . . . . . . . . . . . . . . . 144
5.2 Raster image . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.3 Invertible linear transformations . . . . . . . . . . . . . . 145
5.4 Gray level for the new image . . . . . . . . . . . . . . . . 149
5.5 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . 150
5.6 The centroid of the face . . . . . . . . . . . . . . . . . . . 152
5.7 Optimal transformation for the average face . . . . . . . . 153
5.8 Convex sets and extremal points . . . . . . . . . . . . . . 157
5.9 Least squares method . . . . . . . . . . . . . . . . . . . . 159
5.9.1 Dot product - Inner product . . . . . . . . . . . . 159
5.10 Face recognition . . . . . . . . . . . . . . . . . . . . . . . 164
5.10.1 Descriptive statistics . . . . . . . . . . . . . . . . 165
5.10.2 The principle components from the covariance
matrix . . . . . . . . . . . . . . . . . . . . . . . . 172
5.10.3 Comparison of the principle features . . . . . . . . 179
5.10.4 Visualizing the features . . . . . . . . . . . . . . . 180
5.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

Index 183
Chapter 1

What makes a calculator calculate?

1.1 Introduction

Throughout history, many civilizations realized the need to invent “count-


ing machines” to help with long and complex calculations. Some forms
of calculators were known even before number systems were fully devel-
oped. Early calculators were mainly mechanical using parts like levers,
gears, axles and rods. The development of new mathematical counting al-
gorithms paved the way to new types of counting machines to appear in
Europe in the 17th century. But it was not until the early 1960’s that the
real revolution in electronic calculators took place thanks to the invention
of a device called the transistor.

With the recent advancements in technology, you can hardly avoid seeing an
electronic calculator around you as there is one built in almost every device
you own: your phone, computer, tablet or even in your hand watch. We
trust them blindly in our everyday tasks without questioning the answers
they display. But have you ever wondered, “How can your pocket calculator
do complex mathematical operations with such a high precision in a blink
of an eye?” If you do not know the answer to that, you are certainly not
alone. Ask your friends, even your math and science instructors and you
will be surprised how little is known about the basics of this electronic
device. This chapter takes you on a journey to explore some of the logic
that powers digital computing.

1
2 The mathematics that power our world

1.1.1 A view from the inside


So what is inside an electronic calculator that makes it do its magic? Look
around your house, you most likely can find an old or a broken calculator.
If you were to take it apart (if you have not done that already once in your
life), you will be surprised how little you will find inside. The heart of a
typical electronic calculator is a microchip called the processor. The rest
consists mainly of plastic keys for your inputs placed on top of a rubber
membrane covering a certain touch activated electronic circuit. Calculators
are also equipped with a certain form of display screen and power source
like a lithium battery. The main focus of this chapter is, however, not so
much on the hardware of the calculator but rather on the mathematics
behind its operation that make all the pieces come together and work as
a unit. We will not touch on the practical aspects or types of electrical or
electronic parts used like transistors and resistors. Interested readers can
definitely learn more about that side of the calculator from any book on
basic electronics.

1.1.2 Before you go further


This chapter is probably the lightest in the book in terms of mathematical
requirement. Most of the mathematics needed is explained with a reasonable
amount of details. However, some mathematical maturity, reasoning and
basic manipulation skills are required.

1.2 Number systems

I still have a picture in my head of one of my school teachers writing on


the board the following expansion of the decimal number 1234:

1234 = 1 × 103 + 2 × 102 + 3 × 101 + 4 × 100


   

and then saying that the multiplier (or coefficient) of 100 (4 in this exam-
ple) is called the units digit, that of 101 (3 in this example) is called the
tens digit and the other two multipliers (2 and 1) are called the hundreds
and the thousands digit respectively. One thing the teacher did not explain
at that time is why we chose powers of 10 in the above expansion. With
time, I came to realize that there is really nothing special about the number
10 aside from the fact that humans have 10 fingers and 10 toes and that
What makes a calculator calculate? 3

we tend to use them as we count. At least, this is what many historians


seem to agree on as the main reason for using 10 as a base for our number
system. This number system, familiar to all of us, is known as the decimal
system. In it, every number is written using the 10 digits 0 to 9.

Given an integer b ≥ 2, we can talk about the number system with base b
in a similar way to our decimal system. In such a system, every number
can be written using the digits from the set S = {0, 1, 2, . . . , b − 1}. More
precisely, if N = Nk−1 Nk−2 . . . N1 N0 (where each Ni ∈ S) is a number in
the system then N holds the following decimal value:

Nk−1 bk−1 + Nk−2 bk−2 + · · · + N1 b + N0 b0 .

For the number N = Nk−1 Nk−2 . . . N1 N0 , the digits N0 and Nk−1 are
called the least significant digit (LSD) and the most significant digit (MSD)
respectively. If c > b, then a number N written in base b can clearly
be interpreted as a number in base c as well. This confusion could be
avoided by specifying
 the base
 as a subscript
 and write (N )b . Hence,

(123)4 = 1 × 42 + 2 × 41 + 3 × 40 = 27 and (123)6 = 1 × 62 +
2 × 61 + 3 × 60 = 51.

Two digital representations of numbers are of particular importance for us


in this chapter, the binary system and the Binary coded decimal representa-
tion (BCD for short). The first one is the language of every modern digital
device and the second is used because of its ability to easily decipher coded
data to its original form. Both systems transform any decimal number into
a string of 0’s and 1’s.

1.2.1 Why 0’s and 1’s?


It is safe to say that the most common thing known about computers among
non experts is the fact that “they use 0’s and 1’s”. Before we go on to study
the two digital representations, maybe it is now the right place to quickly
address the question, “What does it really mean that a computer uses
strings of 0’s and 1’s?” A short (but incomplete) answer is the following.
Electronic devices are built using a number of chips each containing a fairly
large number of transistors. In the context of this chapter, you can think
of a transistor as a switch that you turn on or off as you press various keys
on your device, just like your room light switches. An electrical current
passing through a transistor is indicated by 1 (high voltage) and no current
4 The mathematics that power our world

Table 1.1 Division by 2.


Division by 2 Quotient Remainder
165 82 1
82 41 0
41 20 1
20 10 0
10 5 0
5 2 1
2 1 0
1 0 1

is indicated by 0 (low voltage). By turning on and off these switches,


we actually have a way to communicate with computers and give them
instructions. A basic example is the representations of numbers. As you
will see in the next section, a computer using an 8-bit system interprets the
integer 8 as being the string 00001000. This means that when you press
the key labeled 8 on the keypad of your device, you are actually sending
an electrical signal to a certain chip in your computer instructing it to turn
off the first four of its switches, turn on the fifth switch and again off the
last three switches. The processor of your device will interpret this series
of switches as the number 8 and gets ready to operate on it based on your
next action.

1.2.2 The binary system


This is the number system with base 2. In other words, every number in this
system is built out of two digits 0 and 1 that we call bits. A number in this
system is called a binary number and has the form N = Nk−1 Nk−2 . . . N1 N0
where each Ni is either 0 or 1. Because of the vital role the binary sys-
tem plays in electronics, it is important to develop the ability of switching
back and forth between decimal and binary systems with ease. Two main
algorithms to achieve that task are the successive division by 2 and the
sum of weights. As the name suggests, the idea of the first algorithm is to
successfully divide a given decimal number n by 2 and record the successive
remainders (0 or 1) until we hit a quotient of zero. The list of remainders
read from bottom to top would be the binary representation of the decimal
number. As a consequence, the last remainder would be the most signif-
icant bit. For example, to convert the decimal number 165 to binary we
record the successive remainders upon its division by 2 as shown in the
Table 1.1.
What makes a calculator calculate? 5

Reading the column of remainders from bottom to top gives the follow-
ing binary representation of 165: 10100101.

The second algorithm works well for relatively small decimal numbers. We
start by displaying the first 15 powers of 2:
20 = 1, 21 = 2, 22 = 4, 23 = 8, 24 = 16
5 6
2 = 32, 2 = 64, 2 = 128, 2 = 256, 29 = 512
7 8

210 = 1024, 211 = 2048, 212 = 4096, 213 = 8192, 214 = 16348.
Given a decimal number n, we look for the largest integer r1 such that
2r1 ≤ n and we let n1 = n − 2r1 . Again, look for the largest integer r2 such
that 2r2 ≤ n1 and let n2 = n1 − 2r2 . Repeat this process for all successive
values ni ’s until you hit a certain k with nk = 0. Starting with the largest
power of 2 appearing in the above process, we record 1 as a multiplier of
2j if 2j is used and 0 if 2j is not used in the process. If 2l is the largest
power of 2 appearing in the above process, then this method gives a binary
representation with l + 1 bits. Let us look at an example by revisiting the
decimal number 165 treated in the first method. The largest power of 2
less than or equal to 165 is 7 since 27 = 128. Subtracting 128 from 165
gives 37. The largest power of 2 not exceeding 37 is 5 since 25 = 32. Now
37 − 32 = 5 and the largest power of 2 less than or equal to 5 is 2 since
22 = 4. Finally 5 − 4 = 1 and 20 − 1 = 0. The powers of 2 appearing in the
above process are 27 , 25 , 22 and 20 . Therefore
165 = 1 × 27 + 0 × 26 + 1 × 25 + 0 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20
and the decimal representation of 165 is the following 8-bit number:
10100101. Note that we have 8 bits in the binary representation of 165
since the largest power of 2 used is 27 .

Converting an n-bit binary number to its decimal form is straightfor-


ward. Just use expansion of the binary number in terms of powers of 2.
For instance, the decimal value of the binary number 1110001101 is
1×29 +1×28 +1×27 +0×26 +0×25 +0×24 +1×23 +1×22 +0×21 +1×20 = 909.

Remark 1.1. An important thing to keep in mind is that computers work


with fixed size storage. If a computer uses n-bit storage size, then each
number is stored using n bits. For example, in a 4-bit machine, the integer
5 is stored as 00101. The same integer is stored as 00000101 in an 8-bit
machine. Modern computers use 32-bit or 64-bit storage size.
6 The mathematics that power our world

Now for the first result of the chapter.

Proposition 1.1. In an n-bit machine, the range of (non-negative) decimal


number that can be represented is [0, 2n − 1].

Proof. Note that there is a total of 2n different n-bit binary numbers


since each bit can take only two possible values 0 and 1. The binary number
00 · · · 00 has a decimal value of zero and the largest positive number that
can be represented in n-bit binary is 11 · · · 11 (n ones) which has a decimal
value of 2n−1 +2n−2 +· · ·+21 +20 . Note that this last sum can be rearranged
as 1 + 2 + · · · + 2n−1 which is a finite geometric sum containing n terms
with 1 as the first term and the ratio of two consecutive terms is 2. The
n
−1
value of this sum is known to be 22−1 = 2n − 1. 

1.2.3 Binary Coded Decimal representation (BCD)


This is one of the earliest digital representation systems of decimal num-
bers. The BCD system has almost disappeared in modern computer designs
but it is still in use in devices like your pocket calculator. In this system,
digits in the decimal system (0 to 9) are represented by their 4-bit binary
representations. The following table gives the BCD codes for the digits 0
through 9.

Decimal BCD
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001

Concatenation is then used to represent any decimal number in BCD. For


example, the decimal number 427 is represented by 010000100111 in BCD:

| {z } 0010
0100 | {z} 0111
|{z } .
4 2 7
What makes a calculator calculate? 7

Note that with 4 bits, one can form 24 = 16 different binary codes which
means that 6 codes are not used in the BCD system. The binary codes
1010 (number 10 in decimal) through 1111 (number 15 in decimal) are con-
sidered as invalid codes and cannot be used in a digital design operating
on BCD system.

There is more than one form of a BCD representation in the literature.


The form presented above is called the Natural Binary Coded Decimal and
it is the most straightforward one. The easy conversion between decimal
and BCD is the main virtue of this representation. As we will see later,
this ease of conversion comes in handy in displaying decimal numbers in
digital devices. Main drawbacks of BCD representation are its inefficiency
in terms of data usage and the complexity of circuit implementation.

1.2.4 Signed versus unsigned binary numbers


When talking about integers in the decimal system, the term includes both
non-negative and negative integers. In Section 1.2.2 on the binary system,
we have actually defined the binary form of non-negative integers only,
called unsigned binary integers. In practice, it is crucial that a distinction
between positive and negative numbers is made in any number system and a
machine that cannot deal with negative numbers is practically useless. For
a “pencil and paper” arithmetics, this distinction is simply made by the “+”
sign for positive numbers and “−” sign for negative ones. In a calculator
however, every piece of data is represented in a binary form because of
hardware limitations. It is therefore of great importance to understand the
techniques used to represent signed numbers in binary form. Three such
techniques are presented in what follows.

1.2.4.1 Sign-magnitude format


An intuitive way to represent signed integers in a digital format is to dedi-
cate the leftmost digit as a “sign digit”. This is called the sign-magnitude
format. In this format, a leftmost bit of 0 represents a positive number and
a leftmost bit of 1 represents a negative number. The remaining bits of the
binary number represent its magnitude (or absolute value). For example,
to represent −93 in a sign-magnitude format in a 8-bit machine, we start
by representing the absolute value |−93| = 93 in binary form with seven
bits: 1011101. We add 1 as a leftmost bit to indicate that the number is
8 The mathematics that power our world

indeed negative and we get 11011101.

Proposition 1.1 above shows that in an n-bit machine, the decimal range
of unsigned binary numbers is from 0 to 2n − 1. In the sign-magnitude
format, one bit (leftmost)
 n−1 is used as a sign which means that in this format
the decimal range is −2 − 1, 2n−1 − 1 . The bad news is that zero has
two possible representations: 100 · · · 00 (which represents +0) and 000 · · · 00
(which represents −0).

1.2.4.2 The one’s complement representation


For a binary number N , the one’s complement of N is simply the binary
number obtained by converting each 0 to 1 and each 1 to 0 in N . For ex-
ample, the one’s complement of the binary number 00101101 is 11010010.
In the one’s complement format, a non-negative decimal has the same rep-
resentation as in the sign-magnitude format (or just the n-binary format),
but the representation of a negative decimal is different. For a negative
decimal k, the one’s complement representation is obtained by writing the
one’s complement of the binary form of |k|. For example, to get the one’s
complement representation of −23 in a 8-bit machine, we start by writing
the 8-bit binary of 23: (23)10 = (00010110)2 and then we flip the digits:
11101001. So, (−23)10 = (11101001)one’s complement .

As in the sign-magnitude format, 0 has two differences representations in


this format, namely 00 · · · 00 and 11 · · · 11.

1.2.4.3 The two’s complement representation


With the limitations of the sign-magnitude and the one’s complement for-
mats, modern computers are programmed to use different schemes to rep-
resent signed integers. One scheme that proved to be very friendly in terms
of hardware and circuit design inside the machine is the two’s complement
representation. To explain the main idea behind this scheme, let us con-
sider the following scenario. Imagine you want to watch a movie on your
electronic device with a basic four-digit counter (in seconds) that starts to
run from the reading 0000 as soon as you press the play button.
What makes a calculator calculate? 9

Assuming the movie running time is long enough (and you do not stop
it), the counter will eventually read 9999. One second after that, it goes
back to 0000. Now, imagine at this moment you hit the stop button and
then rewind the movie for 5 seconds. The counter will probably read 9995.

Clearly, the reading ‘9995’ in this case does not mean that the movie has
been playing for 9995 seconds. To avoid the ambiguity on what 9995 means
on the counter, one has to interpret the range 0 to 9999 a little differently.
Note that 9995 + 5 = 10000 = 104 and since the counter can only handle
four digits, the leftmost bit (1 in this case) is dropped and we get 0000. This
suggests that 9995 can be interpreted as −5 in this scenario since 9995 + 5
results in 0000 displayed on the counter, exactly like (−5) + 5 = 0.

The above analogy with a digital counter was made to justify the follow-
ing definition. Given an n-bit binary number N , the two’s complement of N
is defined to be the n-bit binary number N20 s satisfying N +N20 s = 1 0| ·{z
· · 0},
n
with the “+” sign referring to binary addition that we will discuss in Sec-
tion 1.3.1. Notice the analogy with the equation 9995 + 5 = 10000 = 104
in the counter example above and the fact that the binary form of 2n is
precisely 10 · · · 0 (1 followed by n 0’s). In other words, finding the two’s
complements of a binary number means finding the opposite (or negative)
of this number.

The two’s complement of a binary number N is obtained by finding first


the one’s complement of N and then adding 1 to the result. Practically,
we have the following simple algorithm: read the digits of N from right
to left outputting 0 as long as the digit is 0. Also output 1 for the first
1 you read. After that, invert the individual remaining bits of N , that
is a 0 becomes a 1 and vice versa. For example, to find the two’s com-
plement of the binary number 10101000, start reading the digits from the
10 The mathematics that power our world

right keeping the first three 0’s and the first 1 we encounter. After that,
we invert the remaining bits of the block 1010 obtaining 0101. We then
obtain 01011000 as the two’s complement of 10101000. Let us next look at
the two’s complement of the binary number 11100011. Here, the rightmost
digit is 1 that we output as 1 and we invert the remaining digits, giving
(11100011)20 s = 00011101. As you will see in Section 1.3.1, adding the
binary numbers 11100011 and 00011101 would result in the 9-digit binary
number 100000000. Since we are working in an 8-bit system, we simply
ignore the leftmost bit (exactly like the digital counter would show 0000
instead of 10000) and get 00000000. Note that the above algorithm shows
that the two’s complement of the two’s complement of N gives back N
which is in line with the basic rule −(−N ) = N .

1.2.4.4 Back and forth between decimal and two’s complement


Conversion between decimal and two’s complement representations depends
on the number of bits used and the sign of the decimal value. If k is a pos-
itive integer, the two’s complement representation of k is the same as its
sign-magnitude binary representation. If k is a negative integer, its two’s
complement representation is obtained by computing the two’s complement
of the sign-magnitude form of |k|. For example, in an 8-bit machine the
two’s complement representation of the decimal number k = 25 is 00011001.
For k = −35, the sign-magnitude representation of its absolute value is
00100011 and its two’s complement is then 11011101. It is important to
note that in the two’s complement representation, like in sign-magnitude
representation, the leftmost bit is 0 for positive numbers and 1 for negative
numbers.

Converting from the two’s complement representation to decimal is simple.


Given an n-bit two’s complement representation bn−1 bn−2 . . . b1 b0 of a dec-
imal number k, we know that bn−1 represents the sign of k. If bn−1 = 0, k
is positive and its decimal value is bn−2 2n−2 + · · · + b1 21 + b0 . If bn−1 = 1, k
is negative and to find its decimal value, we first find its two’s complement
cn−1 cn−2 . . . c1 c0 which we know will correspond to the opposite −k of k.
We finally put a “−” sign in front of the answer to get the decimal value
we are looking for. We clarify this with a couple of examples. Assume you
are given the two 8-bit binary numbers α = 11011011 and β = 01010111
and you are told that they are the two’s complement representations of two
What makes a calculator calculate? 11

decimal numbers A and B respectively. Your task is to find A and B. Note


first that the sign digit of α is 1 indicating that A is a negative number.
The two’s complement of α is 00100101 which is 37 in decimal. This means
that A = −37. As for B, we know it is positive since the sign digit of β is
0. Converting β to its decimal form gives B = 87.

The range of decimal numbers available for two’s complement representa-


tion depends on the number of bits used. In an n-bit machine, the binary
number 100 · · · 00 (1 followed by with n − 1 zeros), represents a negative
number. Its two’s complement remains the same. Therefore, 100 · · · 00 rep-
resents the decimal number −2n−1 . The largest positive number that can
be represented in two’s complement is 011 · · · 11 (0 followed by with n − 1
ones) which has a decimal value of 2n−2 + 2n−3 + · · · + 21 + 20 . As in the
proof of Proposition 1.1 above, this sum is equal to 2n−1 − 1. We conclude
that only decimal numbers between −2n−1 and 2n−1 − 1 inclusively can be
represented in two’s complement. For example, in an 8-bit machine, this
range is [−128, 127].

We finish this section with showing the two main advantages of the two’s
complement representation:

· · 00} represents −2n−1 , 0 is uniquely represented by


• Since 1 |00 ·{z
n−1
00 · · · 00.
• As we will see later, the two’s complement representation eliminates
the need for new electronic hardware to perform subtraction as it
allows the use of existing hardware for addition to do subtraction.

1.3 Binary arithmetics

Standard operations on decimal numbers like addition, subtraction, multi-


plication and division are just particular cases of operations one can perform
in any number system. Our main focus in this section is on the arithmetic
of binary numbers. As explained earlier, there is more than one way to
represent decimals in binary forms, so it is natural to expect that binary
arithmetics will depend on the representation built inside the electronic
device. Equally important is the fact that the result of any arithmetic op-
eration is supposed to fit within the range of integers allowed by the number
of bits used to store these numbers in the machine.
12 The mathematics that power our world

1.3.1 Binary addition of unsigned integers


Let us start by going way back (at least in my case) in our learning journey
to elementary school and recall how we were first taught to add decimal
numbers. Most likely, your teacher told you to start by aligning the units
digits, the tens digits, hundreds digits, etc., in columns. Then, starting
with the units column, add the digits in each column and if the sum is
greater than 9 (that is, the sum cannot fit in one digit), generate a second
digit that we carry to the next most significant, i.e., to the next column on
the left, to be added with the digits of that column. If the sum is less than
or equal to 9, the carry digit is 0. Here is a refresher.

110 11
925 557
+ 76 375
1001 932
Like in the case of decimal numbers, the sum of binary numbers involves
a carry digit (of 1) when the sum of digits in a column exceeds 1. The
following four sums give all rules needed to add any n-bit binary numbers:
0 + 0 = 00, 0 + 1 = 1 + 0 = 01, 1 + 1 = 10 and 1 + 1 + 1 = 11. Each
addition is represented by a two-digit binary number. The digit on the
right represents the “output” of the addition called the sum digit and the
digit on the left is the carry digit that we add to the digits of the column
on the left. Here is an example of adding two binary numbers, where the
digits in smaller font in the top row are, as in the decimal case, the carry
digits.
1111 111
1100 1001
+ 1111 1111
11100 1000
Unlike the “pencil and paper” addition shown above, machines have to work
within certain limits and it could happen (like in the above example) that
the result of the addition exceeds the number of storage units allowed. This
is a situation known as overflow. Detecting overflow in unsigned addition
is simple: a carry out of 1 from adding the last significant bits indicates
that an overflow has occurred. Take for instance the (unsigned) addition
0111 + 1001 (corresponding to 7 + 9 in decimal) in a 4-bit machine, which
results in 10000. As the carry out is 1 from the leftmost bit, an overflow
has occurred. In fact, 7 + 9 = 16 exceeds the maximum (15) allowed by a
4-bit machine.
What makes a calculator calculate? 13

1.3.2 Binary addition of signed integers


Modern machines use the two’s complement scheme to represent signed
numbers. Two’s complement representation allows binary arithmetic to be
performed without worrying much about the signs of the operands. In this
section, we focus on binary addition using the two’s complement format.

Let A and B be two integers (in decimal) and let (A)20 s and (B)20 s be
their respective representations in two’s complement. To perform A+B, the
machine starts by computing the unsigned addition of (A)20 s and (B)20 s .
That is, it treats (A)20 s and (B)20 s as unsigned numbers including their sign
bits. Any carry out bit from the addition in the leftmost column is then
ignored. Let us look at some examples using an 8-bit machine. Assume
A = 52, B = −37. To find A + B, we write A and B in two’s complements:
(A)20 s = 00110100 and (B)20 s = 11011011. We then perform the sum
(A)20 s + (B)20 s as unsigned binaries:
111
0011 0100
+ 1101 1011
10000 1111
Since the result has 9 bits, one more than the storage limit, we simply drop
the leftmost bit and get 00001111, which is 15 in decimal. The answer is
correct: 52 + (−37) = 15. Let us now consider an example of adding two
negative numbers: Assume A = −15, B = −24, then (A)20 s = 11110001
and (B)20 s = 11101000. Adding (A)20 s and (B)20 s (as unsigned) gives
11011001 (after dropping the carry out 1 from the leftmost bit). Note that
(11011001)20 s = −39, which is the correct answer: −15 + (−24) = −39.

Unlike the unsigned addition, detecting an overflow in two’s complement


addition requires a bit more attention. First note that an overflow can never
occur when the two operands are of opposite signs. The reason is that the
sum in this case is smaller than one of the operand and since both operands
fit within the available range of numbers, the sum must fit in that range as
well. If A and B have the same sign (both positive or both negative), then
an overflow can occur if the result of the addition has the opposite sign of
that of the operands (in this case, the answer is incorrect). In other words,
to detect an overflow in two’s complement addition, examine the sign bit
of the result: if it is different from that of the operands, an overflow has
occurred and if not there is no overflow. Note that this is equivalent to say
that an overflow occurs only if the carry bit into the leftmost column is
14 The mathematics that power our world

not the same as the carry out bit from that column. This last observation
allows an easy overflow detection hardware design in machines. Let us next
consider some examples of addition using 8-bit two’s complement:
1 1 1 11 111 1 11
0100 0100 1010 1011 0011 0100 0010 1100
+ 0110 0000 + 1010 0110 + 1101 1011 + 0010 1101
1010 0100 10101 0001 10000 1111 0101 1001
The first sum has an overflow without carry out. It gives a negative answer
to the sum of two positive numbers (the sign bit of the answer is 1 while
both the operands have 0 as leftmost bit). The second sum presents also an
overflow but with a carry out (by dropping the leftmost bit, the sum gives
a positive answer while the operands are both negative). A carry out with
no overflow occurs in the third addition since the operands are of opposite
signs (we simply drop the leftmost bit of the answer). There is no carry
out nor an overflow in the last sum.

1.3.3 Two’s complement subtraction


The reason why two’s complement representation is so popular in com-
puter designs is because when signed numbers are added or subtracted in
this format, they can be treated as unsigned numbers with any carry out
from the last bit dropped. The answer is correct regardless of the signs of
the operands as long as it fits within the number of bits allowed. Unlike
addition, subtraction is not a commutative operation. That is A − B and
B − A result in two different values. In the subtraction D = A − B, A is
called the minuend, B the subtrahend and D is called the difference. To
perform the subtraction A − B, write the two’s complement form of the
subtrahend and then add the answer to the two’s complement form of the
minuend using the relation A−B = A+(−B). Any addition carry out from
the sign bit is simply ignored. This gives the two’s complement representa-
tion the advantage of performing both addition and subtraction using the
same hardware as we will see later. For example, to perform the subtraction
55 − 78 in two’s complement with 8 bits, we start by writing both 55 and
−78 in two’s complement: (55)20 s = 00110111, (−78)20 s = 10110010. We
then perform the addition:

11 11
0011 0111
+ 1011 0010
1110 1001
What makes a calculator calculate? 15

There is no carry out nor an overflow in this case. The result is a negative
number since its sign bit is 1 and its decimal value is −23 (obtained by
converting 11101001 from two’s complement to decimal as we did in Section
1.2.4.4) which is the correct answer: 55 − 78 = −23.

1.4 Logic

In the context of this chapter, Logic is the set of mathematical rules gov-
erning any electrical circuit design for binary arithmetic in a machine. You
will be amazed to learn that basic words like “AND”, “OR”, “NOT” can
be interpreted as “switches” and “gates” inside your computer.

Any human language is usually a collection of words and symbols that


one can put together using a set of grammatical rules to form different types
of sentences. In the English language for example, we have the imperative
sentence, the interrogative sentence and the declarative sentence among
others. What distinguishes declarative sentences from the others is the fact
that they carry a truth value. Every declarative sentence is either True
(T ) or False (F ). From the Logic point of view, a declarative sentence
is called a proposition. For instance, the sentence “Two to the power of
four is equal to 18” is a proposition with truth value F and “The number
of binary strings with 10 bits is 210 ” is a proposition with truth value T .
The ease of determining the truth value of the previous two propositions is
just an illusion and things can get complicated very quickly as we combine
propositions together to form new ones. Consider for example the following
proposition: “The hunt is over or the lion is dead if and only if it rains in the
forest and the moon is full unless the hunter hides behind the trees”. In this
latest complex proposition, words like “or”, “if and only if”, “and”, “unless”
are called logic connectives or logic operators. Their role is to connect
propositions together to make new compound ones. Table 1.2 gives the
names and symbols of the basic logic connectives. Except for the connective
“NOT”, each of these connectives requires two input propositions that we
denoted as p and q in the table.
When we combine propositions using the above operators, the result
remains a proposition and as such it is either True or False. Without going
into too much details, we explain briefly the output value (T or F ) of some
of these operators. The compound proposition p ∨ q (p OR q) is always
true except in the case where p and q are both false. On the other extreme,
16 The mathematics that power our world

Table 1.2 Basic logic connectives.


Connective Name Symbol English Expressions
Disjonction (OR) p∨q p or q, p unless q
Conjonction (AND) p∧q p and q, p but q
Implication p→q p implies q
Biconditional p↔q p if and only if q
Exclusive OR (XOR) p⊕q p or q but not p and p
Negation (NOT) ¬p Not p

Table 1.3 A truth table.


p q p∨q p∧q p→q p↔q p⊕q ¬p ¬q
T T T T T T F F F
T F T F F F T F T
F T T F T F T T F
F F F F T T F T T

p ∧ q (p AND q) is always false except in the case where p and q are both
true. To explain the truth value of the “XOR” operator (⊕), imagine I say
“Joseph is teaching a class or he is gone fishing”, then I would be saying
the truth if exactly one of the components “Joseph is teaching a class”, “he
is gone fishing” is true. Clearly the statement is false otherwise. Table 1.3
gives the truth values of the above logic operators as functions of the truth
values of their components.
So what does this “linguistic” introduction have to do with electronic
and circuit design? Like a proposition, each bit can take two values 0 or 1
and each switch in a digital circuit is also under two possible states: high
voltage or low voltage. Think of Logic as being the “brain” of any electronic
circuit that sends signals to different parts of the circuit to execute various
tasks.

1.4.1 Logic gates


The OR, AND, XOR, and NOT operators described above can be thought
of as electronic “gates” in a circuit producing a certain output signal based
on the status of their binary input signals. Logic gates are the building
blocks of any electronic design since they have the power to allow or block
a binary signal through parts of the circuit. As one engineer once explained
them to me, they are the “decision makers” in electronic circuits. Industries
use graphic symbols to represent various logic gates. The following gives
the most common graphic representations of OR, AND, XOR, and NOT
gates.
What makes a calculator calculate? 17

A A
A∨B A∧B
B B

OR gate AND gate

A
A⊕B A ¬A
B

Exclusive OR gate NOT gate

Fig. 1.1 Basic logic gates.

If you look at a map of a digital electronic chip, you will most likely
see more logic gates symbols than the four listed in Figure 1.1 above. For
instance in TTL technology (Transistor-Transistor Logic), the NAND gate
plays a crucial role. The NAND gate can be interpreted as an AND gate
followed by a NOT gate. If A, B are binary inputs, then the output of a
NAND gate is ¬(A ∧ B):

A
¬(A ∧ B)
B

Fig. 1.2 NAND gate action.

In CMOS technology (Complementary Metal-Oxide-Semiconductor), an


important logic gate is the NOR gate. Similar to the NAND gate, the NOR
gate is an OR gate followed by a NOT gate. If A, B are binary inputs,
then the output of a NOR gate is ¬(A ∨ B):

A
¬(A ∨ B)
B

Fig. 1.3 NOR gate action.

The NAND and NOR gates are represented graphically as follows.


18 The mathematics that power our world

A A
¬(A ∧ B) ¬(A ∨ B)
B B

NAND gate NOR gate

Fig. 1.4 NAND and NOR gates.

A typical electronic circuit is usually built by a complex network of


logic gates carefully engineered to perform specific tasks. Depending on the
complexity of the circuit, logic gates are usually built with more than two
binary input terminals in order to accommodate more complex operations
through the gate. Figure 1.5 shows an example of a logic circuit with
four binary inputs x0 , x1 , x2 and x3 and with logic gates having multiple
input entries. The network is designed to produce a binary output function
f in response to various combinations of binary inputs. The filled little
circles in the network (•) indicate that the wires are actually connected and
electrons can flow in both wires to the corresponding gates. Unconnected
wires are simply indicated by intersecting lines (⊥) with no filled circle at
the intersection point.
A quick look at the circuit in Figure 1.5 shows that it could be cum-
bersome to determine what the final output value is for a given list of
binary inputs. Even harder is the task of finding an algebraic expres-
sion f (x0 , x1 , x2 , x3 ) for the output as a function of the four inputs. The
other side of this problem is of more importance from a design point
of view: in many instances (like an alarm system, traffic lights, binary
adder, ...) we can actually come up with the desired output as a function
f (x0 , x1 , x2 , . . . , xn ) of the input variables but the challenge is in building a
circuit to implement it. Another question to answer is, “How to ensure that
the design is the most cost effective in terms of number of gates and devices
used considering the fact that a given binary output can be achieved using
several designs?”

Many tools were developed through the years to come up with the best
circuit design to perform a given task. Some of these tools are graphical,
like the Karnaugh map and the semantic tableaux and others are algebraic
like the Boolean Algebra axioms, the sum of products and the product of
sums. In this chapter, we focus on the algebraic tools as they are more
What makes a calculator calculate? 19

representative of the power of (what looks like) abstract mathematics in


circuit designs.

Fig. 1.5 A logic circuit with four binary inputs.

1.5 Boolean Algebra

Table 1.4 Properties of the operations of a Boolean Algebra.


Property Name
x+y =y+x Commutativity of OR
xy = yx Commutativity of AND
x + (y + z) = (x + y) + z Associativity of OR
x(yz) = (xy)z Associativity of AND
x+0=x OR identity element
x1 = x AND identity element
x+1=1 Output set
x0 = 0 Output reset
x + (yz) = (x + y)(x + z) OR distributive law
x(y + z) = (xy) + (xz) AND distributive law
(x + y) = x y  De Morgan’s NOR law
(xy) = x + y  De Morgan’s NAND law
x+x=x OR idempotent law
xx = x AND idempotent law
x + x = 1 Complementation rule for addition
xx = 0 Complementation rule for multiplication
(x ) = x Double negation

In the mid 1800’s, the British mathematician George Boole came up


with a mathematical system to model the logic laws in an algebraic way.
After almost a century of refinements and improvements on the system by
20 The mathematics that power our world

various mathematicians and algorithm designers, the scheme finally found


its way into real world applications and became an important tool in engi-
neering. In our days, people refer to Boole’s system as the Boolean Algebra.
Inspired by the logic laws described above, a Boolean Algebra can be de-
scribed as a set of two elements, say {0, 1}, together with two binary laws
written as addition “ + ” and a multiplication “ · ” and one unary law “0 ”
(operation requiring a single variable). These laws satisfy the following
rules or axioms for any binary variables x, y:

(1) x · y = 0 except if x = y = 1 in which case x · y = 1


(2) x + y = 1 except if x = y = 0 in which case x + y = 0
(3) x0 = 1 when x = 0 and x0 = 0 when x = 1.

At this point, chances are you started to draw a link between the operations
of a Boolean Algebra and the logic operators defined above. In fact, in a
Boolean Algebra, the expression x+y is read “x or y”, x·y is read “x and y”
and “x0 ” is read “not x”. From this point of view, the above axioms become
more natural. As in the case of real numbers, the multiplication will be de-
noted from this point on by juxtaposition of operands, so x·y is replaced by
xy. Also similar to usual arithmetics, there are some precedence rules for
the order the Boolean operations. In the order of the highest to the lowest
precedence, the rules are: the parentheses, the NOT operation (“0 ”), the
AND operation (the multiplication), the OR operation (the addition). An
expression combining binary variables with one or more of the above laws
is often referred to as a function. For instance f (x, y, z) = x + x0 yz + y 0 is a
function that takes the value 1 for the input values x = y = 0 and z = 1. In
addition to the above axioms, the operations of a Boolean Algebra satisfy
the important properties found in Table 1.4 on page 19.

Many of these laws may seem very obvious or even trivial to you at this
point. But remember that their main use in this context is in simplifying
the output function of a complex circuit and from this point of view they
could be sometimes tricky to be handled efficiently. In what follows, we
work out an example to show how Boolean Algebra is used to simplify a
certain digital circuit. Figure 1.6 shows a logic circuit with three binary
inputs x, y and z.

Following the first AND gate from the top, it is easy to see that its input
What makes a calculator calculate? 21

is x ∧y  ∧z or x y  z in Boolean notations. Similarly, the inputs for the other


AND gates (starting from the second one on top) are x yz  , xy  z, xyz and
xyz  respectively. All five gates are connected to an OR gate making the
final circuit output x y  z + x yz  + xy  z + xyz + xyz  . We obviously wish to
come up with a simpler circuit that gives exactly the same output for every
list of binary values for x, y and z. The Boolean Algebra properties become
very handy for this task. The details of the calculations in Figure 1.7 are
not hard to follow and left to the reader to verify. The last expression
y  z + yz  + xy shows that we can replace the circuit in Figure 1.6 with the
following equivalent circuit but more efficient in terms of the number of
logic gates.

1.5.1 Sum of products - Product of sums


We start with some terminology. Given a set S = {x1 , . . . , xn } of logic vari-
ables, we define a minterm over S as being a product of the form y1 y2 · · · yn
where each yi is either equal to xi or the complement of xi (that is xi ). For
each i, the variable xi appears exactly once in each minterm either as xi or
as its complement xi but not both. As a consequence, there is a total of 2n
minterms over S. For example, the following are all the 23 = 8 minterms
over the variables x, y and z:

x y  z  , x y  z, x yz  , x yz
xy  z  , xy  z, xyz  , xyz.

Fig. 1.6 A logic circuit with three binary inputs.


22 The mathematics that power our world

x y  z + x yz  + xy  z + xyz + xyz 
= (x y  z + xy  z) + (x yz  + xyz  ) + xyz
= (x + x)y  z + (x + x)yz  + xyz
= y  z + yz  + xyz
= y  z + y(z  + xz)
= y  z + y[(z  + x)(z  + z)] (OR distributive law)
 
= y z + y(z + x)
= y  z + yz  + xy.

Fig. 1.7 Calculations in a Boolean Algebra.

Fig. 1.8 A logic circuit equivalent to the circuit in Figure 1.6 with fewer gates.

Two minterns are called adjacent if they differ in one position only. For
instance, the two minterms x1 x2 x3 and x1 x2 x3 are adjacent since they only
differ at the second position, where the variable x2 appears complemented
in the second minterm. Notice the following:
x1 x2 x3 + x1 x2 x3 = x1 x3 (x2 + x2 ) = x1 x3 . (1.1)
  
=1
The variable x2 which represents the “different” position of the two
minterms has completely disappeared when the two minterms are added
together. There is nothing special about the example in equation (1.1)
and every time two adjacent minterms are added, we can simplify the sum
by dropping the different position. Exploiting this property of adjacent
minterms will play a crucial role in simplifying algebraic expressions (it is
What makes a calculator calculate? 23

at the heart of a schematic method to simply Boolean expressions known as


Karnough maps that we will not explore in this book). A minterm is True
for exactly one combination of variable inputs. For example, the minterm
x1 0 x2 0 x3 is only true for x1 = 0, x2 = 0 and x3 = 1.

The dual notion of a minterm is the maxterm. A maxterm over the set
S = {x1 , ..., xn } of logic variables is a sum of the form y1 + y2 + · · · + yn
where each yi is either xi or x0i . Like the minterms, there are 2n maxterms
over n logic variables. A maxterm is False for exactly one combination of
variable inputs.

1.5.2 Sum of products


The Boolean expression

f = x0 y 0 z + x0 yz 0 + xy 0 z + xyz + xyz 0 (1.2)

representing the output for the circuit in Figure 1.6 above is called a sum
of products form of f . The name is self explanatory. Note that in this
expression of f , each of the products x0 y 0 z, x0 yz 0 , xy 0 z, xyz, and xyz 0 is a
minterm but that is not necessarily true in every sum of products form. For
instance, the above calculations that led to the diagram in Figure 1.8 show
that an equivalent form of the same function f is given by y 0 z + yz 0 + xy
which we still call a sum of products form of f . In other words, there are
several ways to write a Boolean expression as a sum of products and a sum
of minterms is one of them. There is however a unique way to express a
Boolean expression as a sum of minterms. Writing the (unique) sum of
minterms form of a Boolean expression f from the truth table of f can be
achieved by writing the minterm corresponding to each row in the table
where the output of f is 1 and then adding all the minterms. Although
this method provides an easy way to derive a sum of products expression,
it is not optimal in the sense that it does not produce the simplest expres-
sion for f (as we saw in Figures 1.6 and 1.8) but it is a starting point and
Boolean Algebra manipulations can then be used to simplify it. Let us look
at an example. Suppose you are given the truth table of a Boolean function
f (x, y, z), see Table 1.5.

The truth value of f is 1 in the first, fifth and eighth rows of the
table. The minterms corresponding to these rows are x0 y 0 z 0 , xyz 0 and
xy 0 z 0 respectively. The sum of minterms form of f is then given by
24 The mathematics that power our world

Table 1.5 A Boolean


function.
x y z f
0 0 0 1
0 0 1 0
0 1 1 0
0 1 0 0
1 1 0 1
1 1 1 0
1 0 1 0
1 0 0 1

f (x, y, z) = x0 y 0 z 0 + xyz 0 + xy 0 z 0 . Now, Boolean Algebra properties can


be used to simplify this expression:

x0 y 0 z 0 + xyz 0 + xy 0 z 0 = (x0 y 0 z 0 + xy 0 z 0 ) + xyz 0


= (x0 + x) y 0 z 0 + xyz 0
| {z }
1
= y 0 z 0 + xyz 0 .
A useful sum of products we will use in the adder circuit design below is
that of x ⊕ y, i.e. the exclusive OR. From the truth table of basic logic
operators given above, it is easy to see that the sum of products of x ⊕ y
is x0 y + xy 0 . Note also that (x ⊕ y)0 = (x0 y + xy 0 )0 = xy + x0 y 0 and here is
why:
(x ⊕ y)0 = (x0 y + xy 0 )0 = (x0 y)0 (xy 0 )0 (by DeMorgan laws)
= (x + y 0 )(x0 + y) (by DeMorgan laws)
xx0 +xy + y 0 x0 + yy 0
= |{z}
|{z}
=0 =0
= xy + x0 y 0 .

1.5.3 Product of sums


We start by noting first that minterms and maxterns are actually not
strangers to each others. One is the complement of the other. The maxterm
0
x + y 0 + z 0 is the complement of the minterm x0 yz since (x0 yz) = x + y 0 + z 0
by the DeMorgan’s Laws. From this perspective, every Boolean expression
can also be written as the product of maxterms in a unique fashion. Prod-
uct of maxterms is a special type of a product of sums. From the truth
table, the product of maxterms form can be found as follows: first select
What makes a calculator calculate? 25

all rows in the table where the function output is 0. For each of these rows,
form the associated maxterm by complementing the minterm that corre-
sponds to that row. For example, if x = 1, y = 1, z = 0 is one row in the
truth table where the output of f is 0, then the corresponding maxterm
is (xyz 0 )0 = x0 + y 0 + z. The product of maxterms form of f is then ob-
tained by multiplying all the corresponding maxterms. Let us look again
at the truth table of the function f in the previous section. The output
of f is 0 in rows 2, 3, 4, 6 and 7. The maxterm corresponding to row 2 is
(x0 y 0 z)0 = x + y + z 0 . Similarly, the maxterms corresponding to rows 3, 4, 6
and 7 are (x+y 0 +z 0 ), (x+y 0 +z), (x0 +y 0 +z 0 ) and (x0 +y +z 0 ) respectively.
The product of maxterms form of f is then

(x + y + z 0 )(x + y 0 + z 0 )(x + y 0 + z)(x0 + y 0 + z 0 )(x0 + y + z 0 ).

Both the sum of minterms and the product of maxterms are called standard
forms of the Boolean expression.

1.6 Digital adders

We now arrive at the heart and soul of this chapter. What is the procedure
that an electronic calculator follows in order to perform a certain arithmetic
operation? The accurate answer to this question requires a fair knowledge
of all components of an electronic circuit and the physical laws that enable
these components to work effectively together. Of course, this is well be-
yond the scope of this book. Components such as the processor inside a
machine are physical implementation of mathematical ideas and designs.
Our main interest is to look beyond the hardware in order to understand
the mathematical ideas.

1.6.1 Half-adder
As we saw in Section 1.3.1, when two binary digits x, y are added, a sum
P P
and a carry C are produced. If you think of (x, y) and C(x, y) as
Boolean expressions of the variables x and y, then the basic rules of adding
bits seen in Section 1.3.1 give their truth tables as illustrated in Table 1.6.

Note that the carry is always 0 except in the case where both digits
are equal to 1. This is exactly the output of the AND operator. So,
26 The mathematics that power our world

Table 1.6 Sum


 and carry.
x y Sum ( ) Carry (C)
0 0 0 0
1 0 1 0
0 1 1 0
1 1 0 1

C(x, y) = x ∧ y or simply xy using the Boolean notation. As for the sum



, note that its outcome is 1 if the value of exactly one of the variables is 1
and the output is 0 otherwise. This is precisely the output of the Exclusive

OR operator. Therefore (x, y) = x ⊕ y. In light of these facts, we can
construct a logic circuit to implement this addition of two bits using one
AND gate and one Exclusive OR gate. The circuit takes in two inputs

(binary bits) and gives out two outputs, a sum and a carry out C:

Fig. 1.9 Half-adder circuit.

1.6.2 Full-adder
While the half-adder is probably the simplest adder circuit, it has a ma-
jor handicap. It has only two inputs which means it cannot deal with the
carry in that usually occurs in binary addition. This makes the half-adder
capable of performing binary addition only when there is no carry out (a
carry out of 0), hence the name. The building block of any digital adder
requires a system that takes into account the two bits and the carry from
the previous column. This is what a full-adder is designed to do. In the
full-adder design, we have three inputs: the two bits to be added and the
carry in from previous addition. As in the case of the half-adder, there are
two outputs: a sum and a carry out. To distinguish between the carry in
and the carry out in the fuller-adder, we write Cin for the carry in and Cout
for the carry out.
What makes a calculator calculate? 27

Table 1.7 The full-adder.


P
x y Cin Sum ( ) (Cout )
0 0 0 0 0
0 0 1 1 0
0 1 1 0 1
0 1 0 1 0
1 1 0 0 1
1 1 1 1 1
1 0 1 0 1
1 0 0 1 0

The truth table of the full-adder is found in Table 1.7. Note that
1 + 1 + 1 = 11 (sum of 1 and a carry of 1).

From the truth table, we can construct Boolean expressions for the sum
P
and the carry out Cout . The idea is to write first the sum of products
P
for and Cout and then make some simplifications using Boolean Algebra
properties.
= x0 y 0 Cin + x0 yCin
0
+ xyCin + xy 0 Cin
0
P

Cout = x0 yCin + xyCin


0
+ xyCin + xy 0 Cin .
P
Using the relations established in Section 1.5.2, we can simplify and Cout
as follows:

X
= x0 y 0 Cin + x0 yCin
0
+ xyCin + xy 0 Cin
0

= (x0 y 0 + xy)Cin + (x0 y + xy 0 )Cin


0

= (x ⊕ y)0 Cin + (x ⊕ y)Cin


0

= (x ⊕ y) ⊕ Cin
and
Cout = x0 yCin + xyCin
0
+ xyCin + xy 0 Cin
0
= xy (Cin + Cin ) +(x0 y + xy 0 )Cin
| {z }
=1
= xy + (x ⊕ y)Cin .
The circuit in Figure 1.10 implements the above equations using XOR
and AND gates.
Full-adders are the building blocks in any electronic device capable of
doing binary arithmetics. Recall that binary addition of two n-bit numbers
28 The mathematics that power our world

Fig. 1.10 Full-adder circuit.

is done by adding first the two least significant bits and progressing to
the addition of the most significant bits. In the process, the carry out
produced at any stage is added to the two bits in the next position. This
can be designed in the machine by cascading together n full-adders, one
adder for each pair of bits. The carry out produced by the sum of each pair
of bits “ripples” through the chain until we get to the last carry out. Such
an adder is called carry ripple adder. Figure 1.11 shows a block design for
a carry ripple adder for 3-bit binary numbers A2 A1 A0 and B2 B1 B0 . Each
rectangle contains a full-adder design as shown in Figure 1.10. Note that
if the first carry-in (C0 ) is 0, then it is represented by a “ground” in real
designs.
The diagram in Figure 1.12 is a detailed logic implementation of a 3-bit
ripple carry adder with XOR, OR and AND gates. The circuit adds the
two binary numbers A2 A1 A0 and B2 B1 B0 and outputs the binary number
   
3 2 1 0.

1.6.3 Lookahead adder


In a carry ripple adder, all bits are entered at the same time in the circuit
(as seen in Figure 1.11 below) and the sum of bits at position i cannot
be executed until all carries have rippled through the previous positions.
What makes a calculator calculate? 29

Fig. 1.11 3-bit carry ripple adder.

Fig. 1.12 A 3-bit ripple adder logic circuit.

Considering the significantly large number of bits computers are expected


to deal with, the accumulated wait times can cause a serious delay. One
solution to speed up the process is to try to generate the carry-input locally
at each stage to avoid waiting for its value to ripple through. Let Ai , Bi be
30 The mathematics that power our world

the two bits at stage i and Ci+1 be the carry-out at this stage (C0 being
the initial carry in). From the above algebraic manipulations, we have
Ci+1 = xi yi + (xi ⊕ yi )Ci . (1.3)
This equation is the key to eliminate the propagation delay in the chain.
First, let gi = xi yi and pi = xi ⊕ yi , called carry-generator and carry-
propagate respectively. Then equation (1.3) becomes Ci+1 = gi + Ci pi .
Note that at each stage, the carry-generator and the carry-propagate can
be generated independently from the previous stages. In particular, we
have the following values:
C1 = g0 + C0 p0
C2 = g1 + C1 p1
= g1 + (g0 + C0 p0 )p1
= g1 + g0 p1 + C0 p0 p1
C3 = g2 + C2 p2
= g2 + (g1 + g0 p1 + C0 p0 p1 )p2
= g2 + g1 p2 + g0 p1 p2 + C0 p0 p1 p2 .
Continuing this way, we get the following general expression for the
carry out at the ith stage:
Ci+1 = gi + gi−1 pi + gi−2 pi pi−1 + · · · + C0 pi pi−1 pi−2 . . . p0 . (1.4)
This new expression shows that the carry can be produced at any stage
without the need to know the carries from previous stages and hence avoid
the time delay. An adder designed to produce the carry digit locally based
on equation (1.4) is called a carry lookahead adder. In spite of the calcula-
tions involved to compute the carry-generator and the carry-propagate at
each stage, the carry lookahead adder design remains much faster than the
ripple carry adder in large applications.

1.6.4 Two’s complement implementation


In Section 1.2.4.3, an algorithm to find the two’s complement of a binary
number was introduced. Given an n-bit binary number N = Nn−1 . . . N0 ,
the algorithm requires to check the values of first bits starting at N0 until
we reach the first 1. That could be a challenge to implement in a ma-
chine. In reality, computers use a different approach to compute the two’s
What makes a calculator calculate? 31

complements. Recall that the two’s complement N2 s of N is obtained by


finding first the one’s complement N1 s of N and then adding 1 to the re-
sult: N2 s = N1 s + 1 and N1 s is obtained from N by just reversing its bits.
From a technical point of view, both operations of reversing the bits and
adding 1 are easy to implement in a machine.

1.6.5 Adder-subtractor combo


As mentioned in Section 1.2.4.3, one advantage of representing signed in-
tegers in two’s complement format is in the flexibility to use an existing
adder circuit to perform subtraction. This will certainly reduce the number
of gates used and eliminate some of the complexity of wiring the circuit. In
this section we explain in a bit more detail how an adder-subtractor combo
circuit can be constructed and the mathematics behind this design. Figure
1.13 shows a design for an adder/subtractor circuit for two 4-bit binary
numbers A = A3 A2 A1 A0 and B = B3 B2 B1 B0 . In the design, S is a switch
that controls the signal flow in the circuit and hence allows the transition
between the addition and subtraction modes as we will see below. Each of
the rectangles labeled “FA” represents a full-adder.

Fig. 1.13 4-bit adder/subtractor combo.

If the switch S is turned off (that is the value of S is 0 or low voltage


through S), then the original input C0 is 0 and the ith XOR gate (i =
0, 1, 2, 3) outputs the value Bi ⊕ 0. Note that
Bi ⊕ 0 = (Bi ∨ 0) ∧ ¬(Bi ∧ 0)
= Bi ∧ ¬0
= B i ∧ 1 = Bi .
32 The mathematics that power our world

This means that the XOR gates output B = B3 B2 B1 B0 and each output
P
i is Ai + Bi . The circuit acts like an adder in this case and performs the
operation A + B. If the value of S is 1 (high voltage through S), then the
ith XOR gate outputs Bi ⊕ 1. Now,

Bi ⊕ 1 = (Bi ∨ 1) ∧ ¬(Bi ∧ 1)
= 1 ∧ ¬Bi = ¬Bi = Bi0 .

The XOR gates are then just bit invertors producing the one’s complement
B 0 of B. Since the original input C0 is 1, the circuit performs the operation
A + B 0 + 1 which amounts to adding A to the two’s complement of B
(Section 1.6.4). In other words, the circuit performs A − B and it is a
subtractor in this case.

1.7 BCD to seven-segment decoder

After exploring some of the techniques used by your calculator to perform


arithmetical operations, it is time to look at how the calculator actually
displays the results on its screen. An early device to achieve this is known
as the seven-segment display which can be described as an arrangement
of seven bars made up of a substance that glows as current flows through
it. Each bar is labeled by a letter from “a” to “g” as indicated in Figure
1.14. LED (Light-Emitting Diode) and LCD (Liquid Crystal Display) are
the most common substances used for glowing bars. In our digital era, you
can hardly be somewhere and not having a type of seven-segment display
staring at you. From your digital alarm clock, to your microwave timer,
your car dashboard and your neighborhood traffic light. In spite of its many
variations and improvements, this invention remains a very popular display
device.
By choosing the appropriate combination of electrical signals, all digits
of the decimal system (0 to 9) can be produced as shown in Figure 1.15 by
turning on the corresponding LED bars.
Along with the 10 digits, an arrangement of several seven-segment dis-
plays can be used to represent multi digits integers and other characters
like letters, commas, dots and others. The idea is to turn on some parts
to produce the desired character while keeping the other parts off. BCD to
a seven-segment converter is an early form of display, more recent displays
use a multitude of pixels to create more sharp looking characters. In this
section, we focus on the circuit design for a single classic seven-segment
What makes a calculator calculate? 33

f b
g

e c

Fig. 1.14 Seven-segment LED display.

Fig. 1.15 All 10 digits on a seven-segment display.

LED display.

In a seven-segment display design, the 10 digits 0 to 9 are represented us-


ing their BCD codes (see Section 1.2.3 above). The circuit, known as BCD
to seven segment converter, must then convert a 4-bit input x3 x3 x1 x0 into
a 7-bit output that can be used to turn on specific bars while keeping others
unlit. Each of the bars “a” through “g” on a seven-segment display can be
thought of as a logic function of 4 variables x0 , x1 , x2 and x3 . For instance,
digit 3 is represented by the BCD code 0011 and in order to display it on
the screen, bars “a”, “b”, “c”, “d” and “g” must be turned on (binary value
1, electrical signal passing through) and bars “f” and “e” turned off (binary
value 0, no electrical signal through). So our circuit must convert the BCD
code 0011 into 1111001. Using the same “visual inspection” for the ten
digits and their representations on the seven segment LED display, it is
straightforward to verify the truth tables of each of the seven bars given in
Table 1.8 on page 34.

The next task is to come up with a simple Boolean expression for each
of the logic functions “a” through “g”. We will work out the details for bar
“a” leaving the expressions for the other bars as exercises for the reader.
From Table 1.8, we can see that it is easier to write the standard product of
sums expression for “a” rather than the sum of product since the output of
34 The mathematics that power our world

Table 1.8 Truth tables of each of the seven bars.


Digit x0 x1 x2 x3 a b c d e f g
0 0 0 0 0 1 1 1 1 1 1 0
1 0 0 0 1 0 1 1 0 0 0 0
2 0 0 1 0 1 1 0 1 1 0 1
3 0 0 1 1 1 1 1 1 0 0 1
4 0 1 0 0 0 1 1 0 0 1 1
5 0 1 0 1 1 0 1 1 0 1 1
6 0 1 1 0 1 0 1 1 1 1 1
7 0 1 1 1 1 1 1 0 0 0 0
8 1 0 0 0 1 1 1 1 1 1 1
9 1 0 0 1 1 1 1 1 0 1 1

“a” is 0 at only two inputs combinations (see Section 1.5.1). The product
of maxters of “a” is given by
a = (x0 + x1 + x2 + x03 )(x0 + x01 + x2 + x3 ). (1.5)
We use the Boolean Algebra properties to come up with a simplified version
of expression (1.5):
a = (x0 + x1 + x2 + x03 )(x0 + x01 + x2 + x3 )
= x0 x0 +x0 x01 + x0 x2 + x0 x3 + x1 x0 + x1 x01 +x1 x2 + x1 x3
| {z } | {z }
=x0 =0
+ x2 x0 + x2 x01 + x2 x2 +x2 x3 + x03 x0 + x03 x01 + x03 x2 + x03 x3 .
| {z } | {z }
=x2 =0
Next, we regroup terms using associativity and commutativity of both mul-
tiplication and addition starting with the single terms x0 and x2 :
a = x0 + x2 + x1 x3 + x01 x03 + (x0 x01 + x1 x0 ) + (x0 x2 + x2 x0 )
+ (x0 x3 + x0 x03 ) + (x1 x2 + x2 x01 ) + (x2 x3 + x03 x2 )
= x0 + x2 + x1 x3 + x01 x03 + x0 (x01 + x1 ) + (x0 x2 + x2 x0 )
| {z } | {z }
=1 =x0 x2

+ x0 (x3 + x03 ) +x2 (x1 + x01 ) +x2 (x3 + x03 )


| {z } | {z } | {z }
=1 =1 =1
= x0 + x2 + x1 x3 + x01 x03 + x0 + x0 x2 + x0 + x2 + x2
= x0 + x2 + x1 x3 + x01 x03 + x0 + x0 x2 (using the relation x + x = x)
= x0 + x2 + x1 x3 + x01 x03 + x0 (1 + x2 )
| {z }
=1
= x0 + x2 + x1 x3 + x01 x03 .
The following table gives Boolean expression for each of the other seven
segments “b” to “g”. The reader is encouraged to imitate the above calcu-
lation for segment “a” to verify these expressions.
What makes a calculator calculate? 35

Segment Boolean expression


a x0 + x2 + x1 x3 + x1 x3
b x1 + x2 x3 + x2 x3
c x1 + x2 + x3
d x0 + x1 x2 + x1 x2 x3 + x1 x3
e x1 x3 + x2 x3 + x2 x3
f x0 + x2 x3 + x1 x2 + x1 x3
g x0 + x1 x2 + x1 x2 + x2 x3

Using these algebraic expressions for the seven segments, we can now
design a logic circuit for the BCD to seven-segment converter.

Fig. 1.16 A logic circuit for a BCD to a seven-segment converter.


36 The mathematics that power our world

1.8 So, how does the magic happen?

The logic gates that we encountered in this chapter are in reality mathemat-
ical abstractions of electrical devices that vary depending on what type of
technology is used. It is easy to find and buy these devices in the market to
physically implement a designed logic circuit. Our goal in this chapter was
not to go into the technical details of a calculator operation. We hope that
with a better understanding of basic binary arithmetics, you would now
be interested in learning about the sequence of operations that take place
when a specific operation is performed in a calculator. When you press a
button, the rubber underneath makes contact with a digital circuit produc-
ing an electrical signal in the circuit. The processor of your device picks
up the signal and identifies the addresses of corresponding active “bytes”
or switches in the circuit. If for example you press a number, the processor
will store it in some place in its memory and a signal is sent to activate
the appropriate parts on the screen to display it. The same will happen
if you press another button until another operation key is pressed or you
reach the maximal number of symbols that can displayed on the screen.
For example, when performing an arithmetic operation like addition, the
processor will display all digits of the first operand and when the + key is
pressed, the processor will store the first operand in its memory in binary
form and wipe it out from the screen. The processor will do the same as
you enter the digits of the second operand. Finally, when the = button is
pressed, the processor activates a full-adder circuit and sends a signal to
display the digits of the answer on the screen.

1.9 What next?

Throughout this chapter, only representation of integers inside a machine


was considered and only basic arithmetic operations like addition and sub-
traction on those numbers were treated. At this point, you are not that far
off from understanding how machines represent and do arithmetics on real
numbers in general. The most common way is known as the floating-point.
You also have now a solid base to learn about how other arithmetic oper-
ations (like multiplication, division, exponentiation, comparison and many
others) are performed inside a machine.
What makes a calculator calculate? 37

1.10 References

Brown, S. and Vranesic., Z. (2005). Digital Logic with VHDL Design,


Second Edition. (McGraw-Hill).
Predko, M. (2005) Digital electronic demystified. (McGraw-Hill).
Farhat, H.A. (2004) Digital design and computer organization. (CRC
Press).
This page intentionally left blank
Chapter 2

Basics of data compression,


prefix-free codes and Huffman codes

2.1 Introduction

In our era of digital intelligence, data compression became such a necessity


that without it our modern lifestyle as we know it will come to a halt.
We all use data compression on a daily basis, without realizing it in most
cases. Saving a file on your computer, downloading or uploading a picture
from or to the internet, taking a photo with your digital camera, sending
or receiving a fax or undergoing an MRI medical scan, are few examples of
daily activities that require data compression. Without this technology, it
would virtually be impossible to do things as simple as viewing a friend’s
photo album on a social network, let alone complex electronic transactions
for businesses and industries. In this chapter, we go over some of the
compression recipes where mathematics is the main ingredient.

2.1.1 What is data compression and do we really need it?


In the context of this chapter, the word data stands for a digital form of
information that can be analyzed by a computer. Before it is converted to
a digital form, information usually comes in a raw form that we call source
data. Source data can be a text, an image, an audio or a video.

The answer to “why do we need data compression?” is simple: reduce


the cost of using available technologies. Imagine you own a business for sell-
ing or renting moving boxes. If you do not use foldable boxes or collapsible
containers, you would need a huge storage facility and your business would
not be financially sustainable. Similarly, uncompressed texts, images, au-
dio and video files or information transfer over digital networks require

39
40 The mathematics that power our world

substantial storage capacity certainly not available on standard machines


you use at the office or at home.

By a data compression algorithm, we usually mean a process through


which we can represent data in a compact and digital form that uses less
space to store or less time to transmit over a network than the original
form. This is usually done by reducing unwanted noise (or redundancy)
in the data to a certain degree where it is still possible to recover it in an
acceptable form. The process of assigning digital codes to pieces of data
for storage or transmission is called encoding. Of course, a compression
algorithm is only efficient when we are able to reverse the process, that
is to retrieve the original sequence of characters from the encoded digital
form. This process is called decoding or decompressing. In the literature, the
word compression often means both the compression and the decompression
processes (or encoding and decoding) of data.

2.1.2 Before you go further


Mathematical skills like manipulating algebraic inequalities, basic properties
of logarithmic functions and some level of discrete mathematics are needed
in this chapter.

2.2 Storage inside computers

When you click the “Save” button after working on or viewing a document
(text, image, audio,. . .), a convenient interpretation of what happens next
is to imagine that your computer stores the file in the form of a (long) finite
sequence of 0’s and 1’s that we call a binary string. The reason why only
two characters are used was explained briefly in the chapter on electronic
calculators.

011111111110000100001000000001000100

In this digital format, each of the two characters “0” and “1” is called
a “bit” (binary digit). The number of bits in a string is called the length
of the string. Note that the number of strings of a given length n is 2n
since each bit can take two possible values. For example, there are exactly
25 = 32 different binary strings of length 5.
Basics of data compression, prefix-free codes and Huffman codes 41

As the smallest piece of data that can be stored in an electronic device,


a bit is too small to deal with or to use in practice. It is much like the
penny (1 ¢) which in real life does not buy you anything. Most computers
store files in the form of masses of one-dimensional arrays of bytes, each
byte being a block of 8 bits:
One bit
z}|{
1010010 0 01110101 01101010 11000111
| {z }
One byte

Why 8 bits you may ask? Well, why 12 units in a dozen? But here is
a somehow more convincing answer. The famous (extended) ASCII code
list (involving symbols on your keyboard plus other characters) contains
exactly 256 characters and since 28 = 256, a digital representation for each
of the 256 characters in the ASCII code is possible using the byte as the
storage unit of one character.

2.2.1 Measuring units


Digital files are usually measured by thousands, millions and even billions
of bytes. In the standard metric system, the prefix “K” stands for “kilo”
or one thousand (103 = 1000), the prefix “M” stands for “Mega” or one
million (106 = 1, 000, 000), the prefix “G” stands for “Giga” or one billion
(109 = 1, 000, 000, 000) and so on. The same prefixes are used for sizes of
digital files but with different interpretations. Note first that the closest
power of 2 to one thousand is 10 since 210 = 1024, and the closest power
of 2 to one million is 20 since 220 = 1, 048, 576. In digital sizes, the prefix
“KB” stands for “KiloByte” and is equal to 210 = 1024 bytes (not just
1000 bytes). Similarly, “MB” stands for “MegaByte” and is equal to 220 =
1024 × 1024 = 1, 048, 576 bytes or 1024 KB, “GB” stands for “GigaByte”
and is equal to 230 = 1024 × 1024 × 1024 = 1, 073, 741, 824 bytes or 1024
MB, and so on. For example, if a certain digital file has a size of 120KB, a
computer uses 120 × 1024 = 122, 880 bytes to store it.

2.3 Lossy and lossless data compression

Depending on the end goal, data compression algorithms fall into two main
categories: lossless and lossy compressions. Lossless data compression al-
lows the exact original data to be reconstructed from the compressed data.
Lossy data compression allows only an “approximation” of the original data
42 The mathematics that power our world

to be reconstructed, in exchange for smaller size. Lossless data compression


is used in applications where loss of information cannot be tolerated. For
example, when compressing a computer program one must make sure that
the exact version can be retrieved since the loss of a single character in the
program could make it meaningless. Lossy data compression is mainly used
for applications where loss of information can be tolerated. For example,
in compressing audio, video and still images files, some loss of information
could degrade a bit the quality of the decompressed file but not to a degree
where human senses can detect a significant difference (see the chapter on
JPEG standard).

If lossless compression algorithms exist (and they do), why even bother
with lossy compression? Could we not just create an algorithm capable of
reducing the size of any file and at the same time capable of reconstructing
the compressed file to its exact original form? Mathematics gives a defini-
tive answer: Don’t bother, such an algorithm is just wishful thinking and
cannot exist. First note that a lossless compression algorithm C is successful
only if the following two conditions are met:

(1) C must compress a file of size n bits to a file of size at most n − 1 bits.
Otherwise, no compression has occurred.
(2) If F1 and F2 are two distinct files, then their compressed forms CF1
and CF2 must be distinct as well. Otherwise, we would not be able to
reconstruct the original files.

These conditions make it virtually impossible to come up with a compres-


sion algorithm that reduces the size of every possible input data file. The
following theorem explains why.

Theorem 2.1. There exists no universal algorithm that can compress, in


a lossless fashion, all files of a given size n.

Proof. If k ≥ 0 is an integer, let Sk be the set of all binary strings of


length k (files of size k bits). If a universal algorithm α exists, then for
each F ∈ Sn the size of a compressed file α (F ) must be at most n − 1.
In other words, if F ∈ Sn , then α (F ) ∈ S0 ∪ S1 ∪ · · · ∪ Sn−1 . But notice
that there are exactly 2i strings in Si , and the sets are pairwise disjoint
(the intersection Si ∩ Sj is empty for i 6= j) which means that there are
1 + 21 + 22 + · · · + 2n−1 files in total in the union S0 ∪ S1 ∪ · · · ∪ Sn−1 .
The sum 1 + 21 + 22 + · · · + 2n−1 is geometric with n terms and ratio 2.
Basics of data compression, prefix-free codes and Huffman codes 43

Therefore,
1 − 2n
1 + 21 + 22 + · · · + 2n−1 = = 2n − 1.
1−2
This means that there are 2n − 1 different files of size at most n − 1. Thus,
α must compress at least two elements of Sn (two files of size n) into the
same compressed form (of size n − 1 at most). This is a contradiction to
the second condition of an efficient lossless compression algorithm. 

Another way to interpret the above theorem is the following: for any lossless
data compression algorithm C, there must exist a file that does not get
smaller when processed by C.

2.4 Binary codes

Words in any language are concatenations of basic symbols that we refer to


as the alphabet of the language. From a compression perspective, an input
file is usually formed by a finite set of symbols A = {α1 , . . . , αr } that we
call the alphabet. By assigning binary representation ci (a string of 0’s and
1’s) to each character αi in the alphabet, the set C = {c1 , c2 , . . . , cn } we
obtain is called a binary code and each ci is called a codeword. For example,
if A = {X, Y, Z, W }, then C = {00110, 0, 1010, 111} is a binary code for
A with codewords 00110, 0, 1010 and 111.

2.4.1 Binary trees


A useful way to represent a binary code is to associate to it a picture called
a binary tree in the following way. Start the tree with a node (◦) at the
root. Pick a codeword and read its bits one at a time from left to right. If a
bit 0 is read, create a left branch with a new node at its end. Draw a right
branch otherwise. Move to the next bit in the codeword. If it is 0, move
down one edge along the left branch and record a new node at the end of
the new edge. Do the same but along the right branch (from the previous
node) if the second bit is 1. Repeat the process until all bits are read. Mark
the character after finishing reading all the bits in the codeword. Starting
from the root node, repeat the same process for the second codeword. The
tree is complete when all codewords are considered. Note that each node
can have at most two branches. Each branch represents either 0 or 1. If a
44 The mathematics that power our world

node does not grow any more branches, it is called a leaf. Otherwise, we
call it an internal node.

Example 2.1. Consider the binary code C = {00, 01, 001, 0010, 1101}
for the alphabet Σ = {X, Y, Z, U, V }. The binary tree for C is given
below.

2.4.2 Fixed length and variable length codes


The extended version of the ASCII code transforms letters (English alpha-
bet), punctuation, numbers and other symbols (for a total of 256) into
binary codes of length 8 each (1 byte per character). For example, the
(capital) letter “J” is assigned the codeword 01001010, the letter “o” is as-
signed the codeword 01101111 and 01100101 is the codeword for the letter
“e”. The ASCII code is an example of a fixed length code since it assigns
the same length (8 bits) for each codeword. However, fixed length codes
are not usually the most efficient for compression purposes. To save storage
space, it would make more sense to assign shorter codewords for characters
that appear frequently in a source data and longer ones for less frequent
characters. For example, some studies suggest that in a typical English
text, the letter “e” appears more frequently than other letters (this is not
exactly true if we include other non-letter characters, like the space char-
acter which is a bit more frequent than “e”) followed by the letter “t”,
whereas the letters “q” and “z” appear at the bottom of the list of most
frequent letters. So, if we are compressing an English text, a good com-
pression strategy would be to assign shorter codewords for “e” and “t” and
longer ones for “q” and “z”. In this case, our code would be called a vari-
able length code. The famous Morse code is an example of a variable length
code where just a dot is assigned to the letter “e” and just a dash to the
letter “t” (as opposed to combinations of dots and dashes for other letters
Basics of data compression, prefix-free codes and Huffman codes 45

and symbols).

2.5 Prefix-free code

Variable length codes are very practical for compression purposes, but not
enough to achieve optimal results. For a code to be efficient, one must
include in the design a unique way of decoding. For example, consider
the code C = {0, 01, 10, 1001, 1} for the alphabet {a, b, c, d, e}. The word
101001 can be interpreted in more than one way: “cd”, “ead” or “ccb”.
This is certainly not a well-designed code because of this decoding ambi-
guity. A closer look at C shows that the problem with it is the fact that
some codewords are prefixes (or start) of other codewords. For example,
the codeword 10 is a prefix of the codeword 1001. One way to avoid the
confusion is to include a symbol to indicate the end of a codeword, but this
risks to be costly considering the number of times the end symbol must be
included in the encoded file. A code C with the property that no code-
word in C is a prefix of any other codeword in C is called a prefix-free
code. Prefix-free codes are uniquely decodable codes in the sense that there
is unique way to decode any encoded message. The converse is not true
in general. There are examples of uniquely decodable codes which are not
prefix-free.

Example 2.2. The code H = {10, 01, 000, 1111} is a prefix-free code since
no codeword is a start of another codeword. In particular, it is uniquely
decodable.

Example 2.3. The code H = {00, 011, 10} for the alphabet {a, b, c} is
uniquely decodable (try to decode couple of binary sequences and you will
quickly see why) but clearly not prefix-free.

2.5.1 Decoding a message using a prefix-free code


Using a prefix-free code C, decoding a binary message can be achieved using
the binary tree of C as follows.
(1) Starting at the root of the tree, move down and left one branch if 0
is the first bit in the encoded message, and move down and right one
branch if the first bit is 1.
(2) Repeat using adjacent bits in the sequence until you reach an external
leaf node.
46 The mathematics that power our world

(3) Record the letter corresponding to the leaf node.


(4) Return to the root of the tree and repeat the previous steps using other
blocks of digits in the encoded sequence.

Example 2.4. Consider the code C = {00, 01, 10, 110, 111} for the al-
phabet {A, B, C, D, E}. It is clear that C is prefix-free since no codeword
is a prefix of another. Suppose we want to decode the binary message
01000011010111. A good way to start is to draw the binary tree associated
with C:

Starting at the root of the tree, we move left using the first 0 in the string
and then we move right using the first 1. A leaf with label “B” is reached.
We record the letter “B” and return to the root. At this point, we continue
with the string 000011010111. From the root, we move left (for the first 0
in the new string) and then down along the same branch (for the second
0 in the new string). The character “A” is reached and we record it next
to the letter “B” previously found. Continuing in this manner, we see that
the string 01000011010111 corresponds to the message “BAADCE”.

2.5.2 How to decide if a code is prefix-free?


Deciding if a given code is prefix-free just by looking at the codewords can
be challenging. The associated binary tree is one way to simplify this task.
Looking at the tree, the code is prefix-free if all the alphabet characters are
associated with leaf nodes. If one character happens to be at an internal
node, the code is not prefix-free. The code in Example 2.1 is not prefix-free
since the characters X and Z are associated with internal leaves. The code
in Example 2.4 is prefix-free since every character corresponds to a leaf
node in the tree.
Basics of data compression, prefix-free codes and Huffman codes 47

2.5.3 The Kraft inequality for prefix-free binary codes


Given a list of positive integers L = (l1 , l2 , . . . , ln ), is it possible to con-
struct a prefix-free binary code with L as codeword lengths? The answer
is yes if a certain condition (known as the Kraft inequality) on the lengths
li is satisfied. Namely, we have the following.

Theorem 2.2 (Kraft inequality). Let A = {α1 , α2 , . . . , αn } be an al-


phabet, L = (l1 , l2 , . . . , ln ) a list of n positive integers. A prefix-free binary
code C on A with L as codeword lengths exists if and only if
n

2−lj ≤ 1. (2.1)
j=1

Proof. There are several proofs of this well-known result in the litera-
ture. Some of them are purely algebraic, others are schematic using trees.
Theorem 2.2 consists of a necessary and a sufficient condition. In what
follows, we use a schematic proof for one direction and an algebraic one for
the other to give the reader a flavor of both approaches.

Assume C = {c1 , c2 , . . . , cn } is a prefix-free binary code of the alphabet


A. Let li be the length of ci , the codeword corresponding to αi and let
l = max {li ; i = 1, . . . , n} (l is the largest codeword length). Construct the
complete and full maximal binary tree T of height l. That is to say, T is the
binary tree in which every internal node has exactly two children nodes (two
branches) with every level having the maximum possible number of nodes,
and leaves in the tree appear only at the deepest level l. For example, the
following is the complete and full maximal binary tree T of height 4:

Note that T has exactly 2l leaves, which correspond to all possible code-
words of length l. The length li is called the depth of αi in the tree and ci
is formed by collecting all the binary symbols on the path from the root to
αi . Next, we extract the binary tree T (C) corresponding to C as a “sub-
tree” of T . Since the code C is prefix-free, we know that every character
48 The mathematics that power our world

αi corresponds to a leaf in T (C). The fact that character αi is at depth li


implies that there is a total of 2l−li codewords with ci as prefix. These 2l−li
codewords must be erased in order for αi to correspond to a leaf. Doing
this for every i = 1, 2, . . . , n, the total number of ruled out codewords is
Pn l−li
i=1 2 . But this number cannot possibly exceed 2l . Therefore
n
X n
X n
X
l −li
l−li
2 l
≤ 2 =⇒ 22 l
≤ 2 =⇒ 2−lj ≤ 1.
i=1 i=1 j=1

This proves one direction in the Theorem.


Pn
For the other direction, assume that the inequality j=1 2−lj ≤ 1 is
satisfied for a certain set of positive integers L = {l1 , . . . , ln }. We need to
construct a prefix-free code C of the alphabet A with the li ’s as codeword
lengths. For each positive integer k, let Lk be the subset of L consisting
of those lj ’s equal to k and let γk be the number of elements in Lk . For
example, if L = {1, 2, 3, 5, 5, 5, 5}, then L1 = {l1 }, L2 = {l2 }, L3 = {l3 },
L4 = ∅ (empty), L5 = {l4 , l5 , l6 , l7 } and γ1 = 1, γ2 = 1, γ3 = 1, γ4 = 0
and γ5 = 4. For any k ∈ / {1, 2, 3, 4, 5}, Lk = ∅ and γk = 0. Note that by
regrouping like powers of 2 in the left side of inequality (2.1), the inequality
can be written as
l
X
γk 2−k ≤ 1, (2.2)
k=1

where l = max {li ; i = 1, . . . , n}. Starting with L1 , the subset of code


length 1, it is clear that we can choose at most two codewords of length 1,
namely 0 and 1. In other words, γ1 ≤ 2. For L2 , there is a maximum of
22 (distinct) codewords of length 2, but since the desired code C must be
prefix-free, we must eliminate those codewords of length 2 having one or
the other of the previously chosen codewords. This could be done as long
as γ2 ≤ 22 − 2γ1 . To create the γ3 codewords of length 3 each, none of these
codewords (a total of 23 ) can start with the γ1 codewords created at the
first step or the γ2 codewords created at the second step. There are 22 γ1
codewords of length 3 having prefixes from the codewords created at the
first step and 2γ2 codewords of length 3 having prefixes from the codewords
created at the second step. Therefore γ3 must satisfy the inequality γ3 ≤
23 − 22 γ1 − 2γ2 . Continuing this process, we get that the desired prefix-
free code is possible provided that the following system of inequalities is
Basics of data compression, prefix-free codes and Huffman codes 49

satisfied:


 γ1 ≤2
≤ 22 − 2γ1




 γ2
γ3 ≤ 23 − 22 γ1 − 2γ2


(S) γ ≤ 24 − 23 γ1 − 22 γ2 − 2γ3 .
 4
. .. ..

 ..




 . .
γl ≤ 2l − 2l−1 γ1 − 2l−2 γ2 − · · · − 2γl−1

Note that if the second inequality in (S) is satisfied, then 2γ1 ≤ 22 − γ2 and
γ1 ≤ 2− γ22 . But since 2− γ22 ≤ 2, we get that γ1 ≤ 2 and the first inequality
in (S) is also satisfied. Similarly, it is easy to see that if any inequality in
(S) is satisfied, then the previous one is also satisfied. Therefore, if we can
prove that the last inequality is satisfied, then the system (S) would be
valid and a prefix-free binary code can be constructed. Multiplying the last
inequality in (S) by 2−l and rearranging gives:
2−1 γ1 + 2−2 γ2 + · · · + 2−(l−1) γl−1 ≤ 1.
This is the same as the inequality (2.2) above. We conclude that a prefix-
free code can be constructed if the Kraft inequality is satisfied. 

Remark 2.1. It is important to have a clear understanding of what Kraft’s


inequality says about the code and more importantly, of what it does not
say. The inequality serves as a tool to (quickly) check if a given binary
code is not prefix-free. However, it does offer much help to prove that
a given code is indeed prefix-free. For example, the binary code C1 =
{10, 11, 01, 00, 0111} is not a prefix-free since the codewords lengths do
not satisfy Kraft’s inequality, so there is no point of wasting time to check
if there is a codeword which is a prefix of another, we know it is the case. On
the other hand, the codeword lengths of C2 = {10, 101, 01} satisfy Kraft’s
inequality, but C2 is clearly not prefix-free.

2.6 Optimal codes

Consider an alphabet with n symbols A = {α1 , . . . , αn }. Assume that


a certain source file with alphabet A has a probability distribution vector
P = (p1 , . . . , pn ). This means that pi is the probability of occurrence of
the symbol αi in the file. We can think of the probability pi as the ratio
of the number of times αi appears in the file by the total number of oc-
currences of all alphabet symbols. A consequence of this interpretation is
50 The mathematics that power our world

Pn
that i=1 pi = 1. For a prefix-free binary code of the alphabet A with
codeword lengths L = {l1 , . . . , ln }, we define the average codeword length
Pn
of C as being lav = i=1 pi li measured in bits per source symbols.

We make the following observations without looking too much into the
probabilistic and statistical properties of the source.
(1) It is completely irrelevant what names we give for the source alphabet
symbols. All that matters at the end of the day is the probability
distribution vector of these symbols.
(2) If C is a prefix-free binary code of A with codeword lengths L =
Pn
{l1 , . . . , ln }, then the average codeword length lav = i=1 pi li of C can
be thought of as the average number of bits per symbol required
to encode the source.
(3) It is then natural to seek binary prefix-free codes with average codeword
length as small as possible in order to save on the numbers of bits used
to encode the source. A prefix-free code with minimal average codeword
length is called an optimal code.

Example 2.5. Consider the source file EBRRACCADDABRA with


alphabet A = {A, B, C, D, E, R}. The source probability distribution
vector is P = 72 , 71 , 17 , 71 , 14
1 3
, 14 . For the binary (prefix-free) code
H = {1010, 001, 111, 100, 01, 110}
length is lav = 27 (4) + 17 (3) +
 
of the alphabet, the average codeword
1 1 1 3 45 ∼
  
7 (3) + 7 (3) + 14 (2) + 14 (3) = 14 = 3.2. This means that it takes,
on average, 3.2 bits per alphabet symbol to encode the source using H.
Note that the code H is not optimal since the (prefix-free) code
C = {00, 010, 011, 100, 101, 11}
codeword length of 27 (2) + 17 (3) +
 
of the same alphabet has an average
1 1 1 3
  
7 (3) + 7 (3) + 14 (3) + 14 (2) = 2.5 bits per source symbols. So it
takes fewer bits to encode the source using C.

It is not entirely obvious that an optimal code exists for a given alphabet.
The following result proves that in fact an optimal and prefix-free binary
code always exists.

Theorem 2.3. Given a source S with alphabet A = {α1 , . . . , αn } and


a probability distribution vector P = (p1 , . . . , pn ), an optimal prefix-free
binary code for S exists.
Basics of data compression, prefix-free codes and Huffman codes 51

Proof. Without any loss of generality, we can assume that pi > 0 for each
i = 1, . . . , n. Indeed, if this is not the case, we can restrict our alphabet
to those symbols with positive probability and just ignore all symbols with
zero probability (they do not appear in the source anyway). By rearranging
the symbols if necessary, we can also assume that p1 ≤ · · · ≤ pn . We start
by proving the existence of a prefix-free binary code for A. This can be
achieved in more than one way. First, take li to be the smallest integer
greater than or equal to − log2 (pi ) for each i = 1, . . . , n. The proof of
Theorem 2.4 below shows in particular that the list (l1 , . . . , ln ) satisfies the
Kraft inequality and hence represents the codeword lengths of a prefix-free
binary code of the alphabet. Another way to construct a prefix-free code
on the alphabet A is to choose a positive integer l satisfying 2l ≥ n, then
the integers l1 = l2 = · · · = ln = l satisfy:
n
X n
2−li = n2−l = l ≤ 1.
i=1
2
By Theorem 2.2, (l1 , . . . , ln ) is the codeword lengths of a prefix-free code
on A. Next, let us fix a prefix-free code C0 on A of average codeword length
l0 . We claim that there is only a finite number of binary prefix-free codes
with average codeword length less than or equal to l0 . To see this, let C
be a prefix-free code on A of codeword lengths (l1 , . . . , ln ) and an average
codeword length l = i=1 pi li satisfying l ≤ l0 . If lk > pl01 for some k, then
P

n
X l0
l= pi li ≥ pk lk > p1 = l0
i=1
p1
which contradicts the assumption l ≤ l0 . We conclude that lk ≤ pl01 for
k = 1, . . . , n. Clearly, there are finitely many codewords of length less than
or equal to the constant pl01 and thus the set G of all binary prefix-free codes
with average codeword length less than or equal to l0 is finite. Pick a code
C in G with the lowest average codeword length (this is possible since G is
finite), then C is a prefix-free and optimal code for the alphabet. 
Given a source with alphabet A = {α1 , . . . , αn } and a probability dis-
tribution vector P = (p1 , . . . , pn ), let l be the minimum of the set
( n n
)
X X
pi li ; li is a positive integer and 2−li ≤ 1 .
i=1 i=1

In other words, l is the minimal average codeword length taken over all
possible prefix-free codes of the source. Theorem 2.3 guarantees the exis-
tence of a prefix-free code with average codeword length equals to l. But
52 The mathematics that power our world

how can we actually construct that code? The Kraft inequality gives us a
clean and nice mathematical formulation of the problem at hand.

Problem 2.1. If P = (p1 , . . . , pn ) is a given probability distribution


Pn
vector (i.e., i=1 pi = 1 and pi > 0 for i = 1, . . . , n), how can we find n
Pn
positive integers l1 , . . . , ln subject to the constraint i=1 2−li ≤ 1 (Kraft
Pn
inequality) so that the sum i=1 pi li is minimal? In other words, how can
we find positive integers l1 , . . . , ln satisfying the Kraft inequality so that
Pn
i=1 pi li = l?

Problem 2.1 is a classic optimization problem involving several variables


(l1 , . . . , ln ) subject to a constraint (the Kraft inequality). In addition, we
have another complication to deal with: the variables li ’s must be (positive)
integers. If we ignore that last constraint on the variables and assume that
each li is just a real number, the problem can be dealt with quite efficiently
using a technique known as the method of Lagrange multipliers. Readers
familiar with this optimization technique are encouraged to try to apply it
to this particular problem for small values of n. We will not go over the
solution of Problem 2.1 in this book. However, we present an interesting
result that gives a lower and an upper bound on the value of l.

Theorem 2.4. Let P = (p1 , . . . , pn ) be the probability distribution vector


of a certain data source with alphabet A = {α1 , . . . , αn }. If l is the average
codeword length of any optimal prefix-free code for the source, then
n
X n
X
− pi log2 (pi ) ≤ l < 1 − pi log2 (pi ) . (2.3)
i=1 i=1

Moreover, there exists a prefix-free binary code with codeword lengths


Pn
(l1 , . . . , ln ) and average codeword length equals to − i=1 pi log2 (pi ) if and
only if pi = 2−li for each i = 1, . . . , n.

Proof. For the proof of the inequality on the left in (2.3), we use a well-
known inequality that you probably have seen in your first Calculus course:

ln(x) ≤ x − 1, for x > 0 (2.4)

with equality in (2.4) occurring only at x = 1. This can be seen from the
graphs of both ln(x) and x − 1:
Basics of data compression, prefix-free codes and Huffman codes 53

Since ln(x) = log 2 (x)


log2 (e) (change of base for logarithmic functions), inequality
(2.4) can be written as
log2 (x) ≤ log2 (e)(x − 1), for x > 0. (2.5)
Given any prefix-free binary code C of the alphabet A with codeword lengths
(l1 , . . . , ln ) and average codeword length lav , we have:
n

− pi log2 (pi ) − lav
i=1
n n
 n

= − pi log2 (pi ) − pi l i = − pi (log2 (pi ) + li )
i=1 i=1 i=1
n n

    
= − pi log2 (pi ) + log2 2li = − pi log2 pi 2li
i=1 i=1
n
 n
  
 −1 −li
 2−li
= pi log2 pi 2 = pi log2
i=1 i=1
pi
n  
2−li li
≤ pi log2 (e) − 1 (by (2.5) since 2pi > 0) (*)
i=1
pi
n
 n n

   
−li −li
= log2 (e) 2 − pi = log2 (e) 2 − pi
i=1 i=1 i=1
n −li
n
≤ 0 (since i=1 2 ≤ 1 by Kraft and i=1 pi = 1).
We conclude that the average codeword length lav of any prefix-free code
n
satisfies the inequality − i=1 pi log2 (pi ) ≤ lav . Since any optimal code is
in particular prefix-free, the first inequality in (2.3) is established. For the
54 The mathematics that power our world

second (strict) inequality, it suffices to find a prefix-free binary code with


Pn
an average codeword length strictly less than 1 − i=1 pi log2 (pi ). To this
end, let li = d− log2 (pi )e be the integer least upper bound of − log2 (pi ) for
i = 1, . . . , n. That is to say, li represents the smallest integer greater than
or equal to − log2 (pi ). Since x ≤ dxe < x + 1 for any real number x, we
have
− log2 (pi ) ≤ li < − log2 (pi ) + 1 (2.6)
or equivalently log2 (pi ) − 1 < −li ≤ log2 (pi ). This gives the following
inequalities
pi
2(log2 (pi )−1) < 2−li ≤ 2log2 (pi ) ⇐⇒ < 2−li ≤ pi .
2
Pn −li
Pn
In particular, i=1 2 ≤ i=1 pi = 1. The Kraft inequality is then
satisfied. Theorem 2.2 guarantees the existence of a binary prefix-free code
with li ’s as codeword lengths. Using (2.6),
n
X n
X
pi li < pi (− log2 (pi ) + 1)
i=1 i=1
n
X n
X n
X
=− pi log2 (pi ) + pi = − pi log2 (pi ) + 1.
i=1 i=1 i=1

This finishes the proof of (2.3) in the Theorem. For the last statement,
assume that C is a prefix-free binary code with codeword lengths (l1 , . . . , ln )
Pn
and average codeword length equals to − i=1 pi log2 (pi ). In order to
Pn
achieve lav = − i=1 pi log2 (pi ), inequality labeled (∗) in the above proof
−li
must be an equal sign. That can only happen when 2pi = 1 for each i
(remember that inequality (2.5) is an equal sign if and only if x = 1) and so
pi = 2−li for all i. Conversely, if (l1 , . . . , ln ) are positive integers satisfying
Pn Pn
pi = 2−li for all i, then −li = log2 pi and i=1 2
−li
= i=1 2
log2 pi
=
Pn
i=1 pi = 1. The Kraft inequality implies that there exists a prefix-free
binary code D with codeword lengths (l1 , . . . , ln ). Note that the average
Pn Pn
codeword length of D is i=1 pi li = − i=1 pi log2 (pi ). 

2.7 The source entropy

Given a data source with probability distribution vector P = (p1 , . . . , pn )


Pn
and alphabet A, the lower bound H = − i=1 pi log2 (pi ) on the average
codeword length of a prefix-free code of A shown in Theorem 2.4 seems to
fall from the sky. But readers who carried out the details of the Lagrange
Basics of data compression, prefix-free codes and Huffman codes 55

multipliers technique suggested earlier would have probably seen it pop-


ping up in their calculations. It turns out that H plays a central role in
information theory, where it is known as the Entropy of the source.

In terms of the entropy, two important facts about the source can
be drawn from Theorem 2.4. The first is that the average codeword
length of any optimal code cannot get any better than the entropy H =
Pn
− i=1 pi log2 (pi ) of the source while it is always within only one bit of
the entropy. The second fact is that there exists a prefix-free code of a
source with probability distribution vector P = (p1 , . . . , pn ) that achieves
the entropy bound if and only if the source is dyadic, that is each pi is of
the form pi = 2−li for some positive integer li .

Example 2.6. Assume that the probability distribution vector of a source


file is P = (0.2, 0.1, 0.4, 0.3). Assume also that the prefix-free code C =
{1, 01, 001, 000} is used to encode the source file. The entropy of the source
is
Xn
H=− pi log2 (pi )
i=1
= − [0.2 log2 (0.2) + 0.1 log2 (0.1) + 0.4 log2 (0.4) + 0.3 log2 (0.3)]
≈ 1.85 bits.
This means that, on average, the source code requires a minimum of 1.85
bits to encode. Note that the average codeword length of the code C is
Xn
lav = pi li = (0.2)(1) + (0.1)(2) + (0.4)(3) + (0.3)(3) = 2.5
i=1
bits per symbol. The fact that the average codeword length of C is not
equal to the entropy of the source is due to the fact that the source is not
dyadic.

2.8 The Huffman code

In the late 1940’s, researchers in the (then) young field of information the-
ory worked hard on the problem of constructing optimal codes with not
much luck. Some descriptions of what such a code should look like were
given but without any concrete algorithm to construct one. In the early
1950’s, David Huffman was a student in a graduate course at MIT given by
Robert Fano on information theory. Huffman and his classmates were given
56 The mathematics that power our world

the option to submit a paper on the optimal code question, a problem Fano
and others had almost given up hope on solving it, or to write a standard
final exam in the course. Huffman worked on the paper for a period of
time and just before giving up and going back to study for the final exam,
a solution hit him. To everyone’s surprise, Huffman’s paper consisted of a
simple and straightforward way to construct the optimal code and earned
him a great deal of fame. While almost every attempted solution to the
problem consisted of constructing a code tree from the top down, Huffman’s
approach was to construct the tree of his code from the bottom up.

Fix a source file with an alphabet A = {α1 , . . . , αn } and a probability


distribution vector P = (p1 , . . . , pn ). Huffman’s construction of an optimal
prefix-free code is based on the following observations.

Observation 1. In the binary tree corresponding to an optimal prefix-free


code of A , each internal node must have two children (such a tree is called
full in the literature). To see this, assume to the contrary that there exists
an internal node v with a unique child w. Move w up one level to its parent,
that is merge v and w into a unique node. The resulting tree will remain
that of a prefix-free code of A except its average codeword length is shorter
than the original one. This is a contradiction to the optimality of the code.

Observation 2. For an optimal prefix-free code of A with codeword lengths


l1 , . . . , ln , if pi > pj then li ≤ lj . This observation should come as no
surprise as we expect an optimal code to assign shorter lengths to more
probable characters and longer ones to least probable characters. For the
proof, assume to the contrary that C is an optimal prefix-free code satis-
fying li > lj with pi > pj for some i 6= j. The average codeword length
of C contains the term pi li + pj lj . Let C 0 be the binary code obtained by
interchanging the two codewords corresponding to αi and αj . The code C 0
is clearly prefix-free and its average codeword length is the same as that of
C with the exception that the term pi li +pj lj is replaced by pi lj +pj li . Since
(pi li + pj lj ) − (pi lj + pj li ) = (pi − pj )(li − lj ) > 0, the average codeword of
C 0 is smaller than that of C, contradicting the optimality of C.

Observation 3. In an optimal code C with maximum codeword length


l, if c is a codeword of length l, then there exists another codeword c0 of
length l such that c and c0 differ only in their last bit. To see this, write
c = γ1 γ2 . . . γl−1 γl , where each γi is a bit. By Observation 1 above, the
Basics of data compression, prefix-free codes and Huffman codes 57

internal node corresponding to γ1 γ2 . . . γl−1 must have two children. One


of the children is the leaf corresponding to the codeword c and the other
corresponds to c0 = γ1 γ2 . . . γl−1 γ̄l where γ̄l is 0 if γl is 1 and γ̄l is 1 if γl
is 0. Since c0 is of maximal length l, it must correspond to a leaf in the
tree. This implies that c0 is also a codeword of C of maximal length l and
it differs from c only in the last bit.

Putting together the above observations leads to the following result.

Theorem 2.5. Consider a source file with alphabet A = {α1 , . . . , αn } and


probability distribution vector P = (p1 , . . . , pn ) with the probabilities ar-
ranged in non-decreasing order p1 ≤ p2 ≤ · · · ≤ pn . Then, there exists an
optimal prefix-free code of A satisfying the following property: the code-
words corresponding to α1 and α2 (the two least probable symbols) have
the same maximal length and they differ only in their last bit.

Proof. Start with an arbitrary optimal code C = {c1 , . . . , cn } of A of


codeword lengths (l1 , . . . , ln ) where ci is the codeword assigned to symbol
αi and li is the length of ci (such a code exists by Theorem 2.3). By Ob-
servation 2, codeword c1 has a maximal length l1 since α1 has the smallest
occurrence probability. By Observation 3, there exists a codeword ck of
the same (maximal) length l1 and which differs from c1 only in the last
bit. If k = 2, we are done. If k ≥ 3, then pk ≥ p3 ≥ p2 by the ordering
on the probabilities chosen above. On the other hand, since lk = l1 ≥ l2 ,
Observation 2 implies that pk ≤ p2 . We conclude that pk = p2 . Let C 0
be the code obtained from C by interchanging codewords α2 and αk , then
C 0 remains prefix-free and optimal since it has the same average codeword
length as C. Clearly, C 0 satisfies the required property of the theorem. 

2.8.1 The construction


Consider a source with alphabet A = {α1 , . . . , αn } and probability dis-
tribution vector P = (p1 , . . . , pn ). Assume that C = {c1 , . . . , cn } is the
optimal prefix-free code for A with codeword lengths l1 , . . . , ln and average
codeword length l that satisfies the property of Theorem 2.5. Let T be
the binary tree associated with C. If αi and αj are the two least probable
characters in A, Theorem 2.5 tells us that ci , cj have the same (maximal)
length and that they differ only in their last bits. For the tree T , this means
that αi and αj correspond to sibling leaves in the tree (they have the same
parent node). Merge the two siblings into a common leaf aij placed at their
58 The mathematics that power our world

parent node in T . Let cij be the codeword corresponding to aij ; that is, cij
is the codeword consisting of the first common li − 1 bits of ci and cj . The
new tree we obtain corresponds to a prefix-free code C 0 for the alphabet A0
obtained from A by removing the symbols αi , αj and replacing them with
the common symbol αij to which we assign the probability pi + pj . If l0 is
the average codeword length of the code C 0 , then

l − l0 = p1 l1 + · · · + pi li + · · · + pj lj + · · · + pn ln
− [p1 l1 + · · · + (pi + pj )(li − 1) + · · · + pn ln ]
= pi li + pj lj −(pi + pj )(li − 1) = pi + pj .
|{z}
=li

In particular, l = l0 + (pi + pj ). The difference between the two average


lengths depends solely on the source probability vector. This proves the
following lemma.

Lemma 2.1. Let A = {α1 , . . . , αn } be an alphabet with a probability


distribution vector P = (p1 , . . . , pn ). Let A0 be the alphabet obtained
from A by replacing the two least frequent characters αi and αj with a
single character αij with assigned probability pi + pj . If T 0 is a binary tree
representing an optimal prefix-free code for A0 , then the tree T obtained
from T 0 by replacing αij with an internal node with two children αi , αj
corresponds to an optimal prefix-free code for the original alphabet A.

2.8.2 The Huffman algorithm


With all the above results in mind, we are now ready to give a practical
algorithm to construct an optimal code due to Huffman. The setup is as
before, namely an alphabet A = {α1 , . . . , αn } and probability distribution
vector P = (p1 , . . . , pn ).

(1) Pick two letters αi and αj from the alphabet with the smallest proba-
bilities.
(2) Create a subtree with root labeled αij that has αi and αj as leaves.
(3) Set the probability of αij as pi + pj .
(4) Form a new alphabet A0 of n − 1 symbols by removing αi and αj from
the alphabet A and adding the new symbol αij .
(5) Repeat the previous steps for the new alphabet A0 .
(6) Stop when an alphabet with only one symbol is left.
Basics of data compression, prefix-free codes and Huffman codes 59

The tree we obtain at the end of the above algorithm is called the Huffman
tree and the corresponding code is called the Huffman code.

Theorem 2.6. The Huffman code is an optimal prefix-free code.

Proof. We use a proof by induction on n, the number of symbols of the


alphabet. If n = 2, the Huffman algorithm assigns 0 to one symbol of the
alphabet and 1 to the other. Clearly, this is an optimal prefix-free code in
this case. Let n ≥ 3 and assume that the Huffman code returns an optimal
prefix-free code for any alphabet of size n − 1. Let A = {α1 , α2 , . . . , αn } be
an alphabet of size n with symbol probabilities arranged in a non-decreasing
order p1 ≤ p2 ≤ · · · ≤ pn−1 ≤ pn . We need to show that Huffman coding
returns an optimal prefix-free code for A. Let A be the alphabet with
n − 1 symbols obtained from A by replacing the two least frequent symbols
α1 , α2 with a single symbol α12 with assigned probability p1 + p2 . By the
induction hypothesis, the Huffman code will produce an optimal code for
A . Lemma 2.1 shows now that the Huffman code is optimal for A. 

2.8.3 An example
In this section, we look at a detailed example of text compression using
Huffman algorithm. Assume that we want to encode the following text
source:

i see eye in sky

The alphabet of the text is A = {i, s, e, y, n, k, }, where the symbol


represents the “space”  between
  words.
  The  symbol
 1  probabilities
1 inthe

text are as follows: i 18 , s 18 , e 14 , y 18 , n 16 , k 16 , and 1
4 .
We start by arranging the characters according to a non-decreasing order
of their probabilities.

The two letters of lowest probabilities in the text are n and k. Create
a subtree with root labeled nk to which we assign the with probability
1 1 1
16 + 16 = 8 with n and k as leaves. As usual, the left branch is labeled
with a 0 and the right branch is labeled with a 1.
60 The mathematics that power our world

We have now a new alphabet with symbols nk, i, s, y, e and and prob-
abilities 18 , 18 , 18 , 81 , 14 and 14 , respectively. Pick two symbols in the new
alphabet of lowest probabilities, say nk and i, and create a new subtree
with root labeled nki having nk and i as children and with probability
1 1 1
8 + 8 = 4.

We get the new alphabet formed of symbols nki, s, y, e and and probabil-
ities 14 , 18 , 18 , 41 and 14 , respectively. Form a new subtree with root labeled
sy (probability 18 + 81 = 14 ) having s and y as children.

We are now left with an alphabet with four symbols, each occurring with a
probability 14 . Pick the symbols nki and sy to form the next subtree with
node nkisy of probability 12 .
Basics of data compression, prefix-free codes and Huffman codes 61

Pick the symbols e and the space symbol to form the next subtree.

Finally, we merge the last two symbols nkisy and e into the root of the
Huffman tree.
62 The mathematics that power our world

Following the tree from the root to the leaves, we get the Huffman code
of the text message:
n → 0000, k → 0001, i → 001, s → 010, y → 011, e → 10, → 11. (2.7)
The text i see eye in sky can now be encoded by concatenating the code-
words of the symbols as they reach the encoder:
001110101010111001110110010000110100001011, (2.8)
for a total of 42 bits. But how much did we really save? Well, a non-
compressed version of the text using standard ASCII code requires 8× 16 =
128 bits to store in a computer. A saving of almost 67%. Using the library
(2.7), decoding the message (2.8) is straightforward since Huffman code
is prefix-free, hence uniquely decodable. Note that the average codeword
length of the code in (2.7) is given by:
             
1 1 1 1 1 1 1 21
l=4 +4 +3 +3 +3 +2 +2 = .
16 16 8 8 8 2 2 8

Note also that the entropy of the text message is:


7       
1 1 1 1 1 1
H=− pi log2 (pi ) = − log2 + log2 + log2
i=1
16 16 16 16 8 8
       
1 1 1 1 1 1 1 1 21
+ log2 + log2 + log2 + log2 = .
8 8 8 8 4 4 4 4 8
Basics of data compression, prefix-free codes and Huffman codes 63

The fact that the average codeword length of the Huffman code is equal to
the entropy of the source is expected in this example since each alphabet
symbol appears with a probability equals to a negative power of 2.

2.9 Some remarks

We finish this chapter with some interesting remarks.


(1) Huffman codes are not unique. As seen in the above example, if three
or more symbols have the same probability at any iteration, then the
Huffman coding is not necessarily unique as it depends on the order
in which these symbols are merged. While the decisions on how to
merge equiprobable symbols may affect the individual codewords, they
certainly have no effect on the average codeword length of the code. All
Huffman codes will have the same (minimal) average codeword length.
(2) There are optimal codes which are not Huffman codes. Here is an exam-
ple. Consider the alphabet A = {α1 , . . . , α6 } with probability distri-
bution vector (0.08, 0.12, 0.15, 0.15, 0.24, 0.26). Applying the Huffman
algorithm described above, it not difficult to see that the corresponding
Huffman code H for the source is:

α1 → 000, α2 → 001, α3 → 100, α4 → 101, α5 → 01, α6 → 11. (2.9)

Now, consider the following code C of the same alphabet:

α1 → 000, α2 → 100, α3 → 001, α4 → 101, α5 → 01, α6 → 11. (2.10)

Clearly, C is optimal since it has the same average codeword length


as the Huffman code H. On the other hand, C is not a Huffman code
on the alphabet A since the two least probable symbols (namely α1
and α2 ) do not have codewords that differ only in their last bit.
(3) The Huffman algorithm described above is called static because it as-
sumes that the alphabet’s probability distribution vector remains the
same throughout the encoding (as well as the decoding) process. A
large gap between an estimated source probability vector and the ac-
tual one can seriously deteriorate the efficiency of the Huffman code.
A more flexible coding system is provided by the Adaptive Huffman
coding where both the alphabet and the probabilities of its symbols are
dynamic in the sense that they are updated frequently as the symbols
enter the encoder.
64 The mathematics that power our world

2.10 References

Ida Mengyi Pu. (2006) Fundamental Data Compression, (Butterworth-


Heinemann).
David Salomon. (2004). Data Compression, The Complete Reference,
Third Edition, (Springer).
Chapter 3

The JPEG standard

3.1 Introduction

JPEG is an acronym for “Joint Picture Expert Group”, the committee


formed internationally in the 80’s to create, develop and support global
standards for compression of still (grayscale or colored) images. The com-
mittee was a result of a collaborative effort by three bodies: the Interna-
tional Telecommunication Union (ITU), the International Organization for
Standardization (ISO) and the International Electrotechnical Commission
(IEC). As noted on the official JPEG website (www.jpeg.org), people often
use the term “JPEG” to refer to a particular compression standard and its
implementation, not to the committee itself.

The JPEG standard defines four modes of operations in still image


compression. We give a very brief description of each of these modes.

• The Sequential lossy mode. The image is broken into blocks. Each
block is scanned once in a raster manner (left-to-right, top-to-bottom).
Some information is lost during the compression and the reconstructed
image is an approximation of the original one.
• The Progressive mode. Both compression and decompression of the
image are done in several scans. Each scan produces a better image
than the previous ones. The image is transferred starting with coarse
resolution (almost unrecognisable) to finer resolution. In applications
with long downloading time, the user will see the image building up in
multiple resolutions.
• The Hierarchical mode. The image is encoded at multiple resolu-
tions allowing applications to access a low resolution version without
the need to decompress the full resolution version of the image. You

65
66 The mathematics that power our world

have probably noticed sometimes that you do not get the same quality
when you print an image from the one displayed on a website since the
two operations (printing and displaying) require different resolutions.
• The Sequential lossless mode The image is scanned once and en-
coded in a way that allows the exact recovery of every element of the
image after decompression. This results, of course, in a much longer
code stream than the ones obtained in the lossy modes.

The first three modes are called DCT-based modes since they all use the
Discrete Cosine Transform (DCT for short, see Section 3.2) as the main tool
to achieve compression. Each of the four modes has its own features and
parameters that allow a certain degree of flexibility in terms of compression-
to-quality ratio. The purpose of this chapter however is not to describe the
technicalities and properties of the above modes nor to discuss the hardware
implementation. We will be concerned only with the Sequential lossy mode
implemented by the Baseline JPEG standard which can be described as the
collection of “baseline routines” that must be included in every DCT-based
JPEG standard. The Baseline standard is by far the most popular JPEG
technique and it is well supported by almost all applications.

Although the Baseline standard applies to images with various color


components, we will restrict our discussion to grayscale images only for
simplicity. Once the technique is understood for grayscale images, it can
be extended to color images with not much difficulty.

3.1.1 Before you go further


Mathematical skills required to have a good understanding of this chapter
include basic matrix manipulations, basic linear algebra concepts like the
notion of linear independence and basis. Also, some good knowledge of
working with and simplifying trigonometric expressions is necessary.

3.2 The Discrete Cosine Transform (DCT)

The main ingredient in the JPEG Baseline compression recipe is a math-


ematical operation known as the Discrete Cosine Transform or DCT for
short. In a nutshell, the DCT is a transformation that takes a signal data
as an input and transforms it from one raw type of representation that
usually contains an excess of information to another more suitable for ap-
The JPEG standard 67

plications. For example, if you think of a still image as a two-dimensional


signal that is perceived by the human visual system, then the DCT can
be used to convert the signal (or the spatial information) into a numeric
data (“frequency” or “spectral” information) so that the image information
exists in a quantitative form that can be manipulated for compression.

In this section, n is a fixed positive integer.

3.2.1 The one-dimensional DCT


Note first that there is more than one transform known as a discrete cosine
transform in the literature. These transforms vary in minor details and
are usually referred to as DCT-I to DCT-IV. The most popular one is the
DCT-II known simply as the DCT. It is the transform used in the JPEG
baseline standard and it is the one we consider in this chapter.

Given a list α = (a0 , a1 , . . . , an−1 ) of n real numbers, the one-


dimensional discrete cosine transform of α is the list β = (b0 , b1 , . . . , bn−1 )
of n real numbers given by
n−1
X 2
r  
(2k + 1)j
bj = γj ak cos π (3.1)
n 2n
k=0
with
(
√1 if j = 0
γj = 2 (3.2)
1 if j > 0.
 
Note in particular that b0 = a0 +a1 +···+a

n
n−1
(since cos j(2k+1)
2n π = 1 for
all k when j = 0) is the mean value of the input list α. The coefficient b0
is referred to as the DC coefficient, while the term AC coefficient is given
to any bj with j > 0.

This transformation has some interesting properties, but two properties


in particular make it a valuable tool in data processing. The first one is
its ability to concentrate most of the “energy” of a correlated sequence in
a few transformed coefficients, usually the first ones. If the list α consists
of correlated values, then very few coefficients in the transformed sequence
β are large in absolute value and the rest are very small (close to zero in
absolute value). The second important property of the DCT transformation
68 The mathematics that power our world

is the fact that it is reversible. Given the transformed DCT coefficients


β = (b0 , b1 , . . . , bn−1 ), then one can retrieve the original coefficients using
the inverse Discrete cosine transform (IDCT for short) given by:
n−1
X 2
r  
(2j + 1)k
aj = γk bk cos π (3.3)
n 2n
k=0
with γk like in (3.2). To put things in perspective, we look at an example.
Consider the following (correlated) sequence of 8 terms
α = (155, 150, 165, 154, 160, 167, 158, 163) .
Applying the DCT to α leads to the following 8 coefficients (rounded to
two decimal places):

β = (449.72, −8.39, −2.74, 0.10, −2.83, −0.99, 11.85, 3.55). (3.4)

Half of the transformed coefficients in β are smaller than 3 in absolute


value. If we apply the IDCT to β, we get the following sequence
α0 = (155.0, 150.0, 164.9, 154.0, 160.0, 167.0, 158.0, 163.0) ,
which is almost the same as the original one, and this is what we expect
from the inverse transform. What is less expected is the following fact:
set 3 as a threshold point for the DCT coefficients in (3.4) in the sense
that every coefficient less than 3 in absolute value is rounded to zero, and
round the other coefficients to the nearest integer. We get the following
new transformed sequence
β 0 = (450, −8, 0, 0, 0, 0, 12, 4) .
When applying the IDCT again to β 0 , we obtain the sequence
α00 = (157.86, 149.11, 164.08, 154.06, 159.54, 165.20, 157.99, 164.92) ,
which is still relatively close to the original sequence. It is amazing how we
can still almost reconstruct the original sequence with half of the eight DCT
coefficients coarsely reduced to zero. It is precisely this kind of flexibility
that makes the DCT such a useful tool for data compression. But what
makes the DCT transformation behaves this way? To answer this question,
we need to take a closer look at the operation of this transform. First
consider the n vectors
r             
2 1 π 1 π 1 π
dj = γj cos j 0 + , cos j 1 + , . . . , cos j n − 1 +
n 2 n 2 n 2 n
r       
2 jπ 3jπ (2n − 1)jπ
= γj cos , cos , . . . , cos
n 2n 2n 2n
The JPEG standard 69

for j = 0, 1, . . . , n − 1. Then, the dot product of the vector dj with α (the


original list) is
r     r    
1 1 π 2 1 π
dj · α = a0 cos j 0 + + a1 cos j 1 +
n 2 n n 2 n
r     r    
2 1 π 2 1 π
+ a2 cos j 2 + + ··· + an-1 cos j n-1 +
n 2 n n 2 n
n−1
X 2
r  
(2k + 1)j
= γj ak cos π = bj .
n 2n
k=0
Therefore, the DCT coefficient bj is simply the dot product of the vector
dj with the original vector α. Next, consider the n cosine waves
r
2
wj (x) = γj cos (jx) , j = 0, 1, . . . , n − 1
n
with the frequency of wave wj being j. The vector dj is formed by evalu-
ating the wave wj at each of the following n angles
π 3π 5π (2n − 1)π
, , , ..., .
2n 2n 2n 2n
For n = 8, the following table gives of the values of each of the eight waves
π 3π 5π
w0 , w1 , . . . , w7 at each of the angles 16 , 16 , 16 , . . . , 15π
16 .

x π/16 3π/16 5π/16 7π/16 9π/16 11π/16 13π/16 15π/16


w0 (x) 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536
w1 (x) 0.4904 0.4157 0.2778 0.0975 -0.0975 -0.2778 -0.4157 -0.4904
w2 (x) 0.4619 0.1913 -0.1913 -0.4619 -0.4619 -0.1913 0.1913 0.4619
w3 (x) 0.4157 0.0975 -0.4904 -0.2778 0.2778 0.4904 0.0975 -0.4157
w4 (x) 0.3536 -0.3536 -0.3536 0.3536 -0.3536 -0.3536 0.3536 0.3536
w5 (x) 0.2778 -0.4904 0.0975 0.4157 -0.4157 -0.0975 0.4904 -0.2778
w6 (x) 0.1913 -0.4619 0.4619 -0.1913 -0.1913 0.4619 -0.4619 0.1913
w7 (x) 0.0975 -0.2778 0.4157 -0.4904 0.4904 -0.4157 0.2778 -0.0975

The entries of the jth row in the table are the components of the vector
dj constructed above for j = 0, 1, . . . , 7. The table shows that each vector
dj , with the exception of vector d0 , has four pairs of the form (−κ, κ) for
a certain coefficient κ. If the components of the list α are correlated, this
results in a dot product dj · α being relatively small in absolute value. Note
also that the frequency j of the cosine wave wj (x) increases as we go down
in the table. This means that the early DCT coefficients correspond to
low-frequency components of the sequence and these components usually
contain the important characteristics of the list. The table also reveals an-
other important feature of the vectors dj : the dot product di · dk is zero
70 The mathematics that power our world

for i 6= k which means that the vectors are orthogonal.

The above table provides us as well with an efficient way to compute


the transform vector β. Let D be the following 8 × 8 matrix:
 
0.3536 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536 0.3536
 0.4904 0.4157 0.2778 0.0975 −0.0975 −0.2778 −0.4157 −0.4904 
 
 0.4619 0.1913 −0.1913 −0.4619 −0.4619 −0.1913 0.1913 0.4619 
 
 0.4157 0.0975 −0.4904 −0.2778 0.2778 0.4904 0.0975 −0.4157 
 
 (3.5)
 0.3536 −0.3536 −0.3536 0.3536 −0.3536 −0.3536 0.3536 0.3536 

 0.2778 −0.4904 0.0975 0.4157 −0.4157 −0.0975 0.4904 −0.2778 
 
 
 0.1913 −0.4619 0.4619 −0.1913 −0.1913 0.4619 −0.4619 0.1913 
0.0975 −0.2778 0.4157 −0.4904 0.4904 −0.4157 0.2778 −0.0975

then β = Dαt (matrix multiplication with αt being the transpose of α).

In practical applications, the input sequence α is usually quite large.


In this case, the sequence is broken into small subsequences containing
n coefficients each. Each of the subsequences is treated separately and
independently of the other segments in the source. Most applications using
the DCT set 8 as the value of n.

3.2.2 The two-dimensional DCT


The one-dimensional DCT is used in practice in processing one-dimensional
data such as speech signals. For the processing of two-dimensional data such
as digital images, a two-dimensional version of the DCT is needed. In this
case, the input data is represented by a square n × n matrix A (as opposed
to a one-dimensional sequence α). The two-dimensional DCT transform
of A is defined as being the matrix obtained from A by applying first the
one-dimensional DCT to each column of A to get a matrix A1 , and then
applying the one-dimensional DCT to each row of A1 . As we did for the
one-dimensional DCT, we find a matrix expression for this transformation.
First note that the matrix D given in (3.5) (found for n = 8) is a special
case of the general matrix for arbitrary integer n known as the DCT matrix
that we denote also by D and which is displayed in Figure 3.1.
Given an n × n matrix A, the columns of DA are the one-dimensional
DCT transforms of the columns of A, and the rows of ADt are the one-
dimensional DCT transforms of the rows of A. So DADt is the transforma-
tion that applies the one-dimensional DCT first to the coulmns of A and
then to the rows of the resulting matrix. Therefore, the two-dimensional
The JPEG standard 71

 q q q 
1 1 1
n n
... n
 
 q q q  

 2 π
 2 3π
 2 (2n−1)π 
 n
cos 2n n
cos 2n
. . . n
cos 2n

 
 
 q q q   
2 2π
 2 6π
 2 2(2n−1)π
D= cos cos . . . cos
 
n 2n n 2n n 2n 
 
 
.. .. ..
 
 
 . . ... . 
 
q   q   q  
2 (n−1)π 2 3(n−1)π 2 (n−1)(2n−1)π
n
cos 2n n
cos 2n
... n
cos 2n

Fig. 3.1 Discrete Cosine Transform matrix.

DCT transform of A can be defined as being the n × n matrix B given by:


B = DADt (3.6)
where D is the DCT matrix given in Figure 3.1. Relation (3.6) gives the
following definition of the two-dimensional DCT transform in terms of the
matrix coefficients. If A = [aij ], 0 ≤ i, j ≤ n − 1 is the input matrix, then
the two-dimensional DCT of A is the n×n matrix B = [bij ], 0 ≤ i, j ≤ n−1
with the bij entry given by the expression
n−1
X n−1    
2 X (2k + 1)iπ (2l + 1)jπ
bij = γi γj akl cos cos (3.7)
n 2n 2n
k=0 l=0
and γr is defined as in (3.2). Similar to the one-dimensional DCT, the
coefficient b00 in (3.7) is referred to as the DC coefficient, and every other
coefficient bij is called an AC coefficient. The DC coefficient holds the mean
value of all the coefficients in the matrix.

Like the one-dimensional DCT, the two-dimensional version is invert-


ible. Given the matrix B = [bij ], 0 ≤ i, j ≤ n − 1, then the inverse DCT
transform of B is the matrix A = [aij ], 0 ≤ i, j ≤ n − 1 given by:
n−1 n−1    
2 XX (2i + 1)kπ (2j + 1)lπ
aij = γk γl bkl cos cos (3.8)
n 2n 2n
k=0 l=0
and γr is defined as in (3.2). From relation (3.6) above, we get that A =
−1
D−1 B (Dt ) , assuming D is an invertible matrix. We will prove in Section
3.5.2 that the DCT matrix is indeed invertible and satisfies a key property
that makes it very useful in this application, namely D−1 = Dt . This shows
that A = Dt BD.
72 The mathematics that power our world

3.3 DCT as a tool for image compression

We now come to the main application of this chapter. We explain how the
Baseline JPEG standard uses the two-dimensional DCT to compress and
decompress grayscale digital images. The procedure involves several steps
that we explain and illustrate using the image in Example 3.1 below in each
step.

3.3.1 Image pre-processing


At the encoder end, the source image is first divided into non-overlapping
blocks of 8 × 8 pixels (called data units) from left to right, top to bottom.
The size 8×8 for a data unit was chosen because at the time of development
of the JPEG standard, that size fit well with the maximum size allowed by
integrated circuit technology. Very often, the image is of size m × n pixels
with either m or n (or both) not a multiple of 8. If m is not a multiple of 8,
the bottom row is duplicated a number of times to get the nearest multiple
of 8. If n is not a multiple of 8, the same thing is done on the rightmost
column. For instance, for a 69 × 138 image, the last row is duplicated
three times and the rightmost column is duplicated six times. Each pixel
is represented by its grayscale value which indicates how bright that pixel
is and each grayscale value is stored in a digital form as one byte (8 bits).
Since 28 = 256, each grayscale value is an integer in the range from 0 to
255 with 0 representing Black and 255 representing White. For example, a
69×138 picture requires (with duplications of the last row and the rightmost
column) 72 × 144 = 10368 bytes to store before any compression is done.
Each 8 × 8 block of the digital image is represented by an 8 × 8 matrix with
integer entries in the range [0, 255].

Example 3.1. Figure 3.2 is an 8 × 8 block taken somewhere from a digital


image. The block is represented by the matrix A, which is an 8×8 matrix of
the corresponding grayscale values. Note the (4, 1)-pixel (fourth row, first
column) which seems to be the brightest in the block. The corresponding
entry in A is the largest.
The JPEG standard 73

Fig. 3.2 A raster image and its matrix.

3.3.2 Level shifting


As the values of the cosine function are centered at zero (the midpoint of the
cosine range [−1, 1]), the DCT works more efficiently with values centered
at 0 (negative and positive) rather than just positive integers. Since the
midpoint of the range [0, 255] is 127.5, we shift the grayscale values of the
pixels from the interval [0, 255] to [−128, 127] by subtracting 128 from each
entry in the input matrix. For the image in Example 3.1 above, the matrix
of shifted values is

 
−8 27 34 −62 −48 −69 −64 −55
 −50 −41 −73 −38 −18 −43 −59 −56 
 
 22 −60 −49 −15 22 −18 −63 −55 
 
 
 52 30 −57 −6 26 −22 −58 −59 
A1 =  .
 −60 −67 −61 −18 −1 −40 −60 −58 
 
 −50 −63 −69 −58 −51 −60 −70 −53 
 
 −39 −57 −64 −69 −73 −60 −63 −45 
−39 −49 −58 −60 −63 −52 −50 −34

3.3.3 Applying the DCT


After level-shifting, the two-dimensional DCT is applied independently to
the shifted matrix A1 of each block in the image. This is accomplished by
computing B = DA1 Dt where D is the DCT matrix with n = 8 given in
(3.5). For the block in Example 3.1 above, the matrix B of DCT coefficients
is the following.
74 The mathematics that power our world

 
−330.8750 65.1776 −9.3885 46.0853 51.1250 −30.8004 −2.7408 4.0057
 75.9481 58.8311 −25.6708 8.7057 −2.2776 −20.4903 −6.1304 18.0618 
 
 −41.8828 3.8547 52.8126 −60.6199 −58.4390 0.5139 17.5914 12.1807 
 
 −51.0028 1.0753 5.0003 −54.8967 −40.6112 −6.6073 3.4048 11.2796 
 
B= (3.9)
 53.6250 41.9546 4.5161 −16.0604 −20.3750 −14.3393 −12.2887 5.2865 


6.2506 −9.2437 −6.9427

 48.4935 67.3990 39.1726 6.7856 6.1368 
 
 12.8836 14.1137 3.0914 −3.2691 2.9643 7.5347 22.9374 16.8211 
−12.5815 −23.0803 −23.6630 −11.5713 4.2610 17.0271 24.8242 16.5083

Instead of the 64 grayscale values, we now have an array of different spa-


tial frequencies in the above block. It is probably worth mentioning at this
point that if we apply the DCT transform directly to the original matrix A
before the level shifting, we obtain the same matrix B with the exception
of the DC coefficient (which would be 980.0000 instead of −44.0000). This
has nothing to do with this particular example. It can be proven in general
that shifting the coefficients of the matrix A by a constant value affects
only the DC coefficient when the DCT is applied.

3.3.4 Quantization
So far no compression was made in the previous steps. Remember, the DCT
is a completely invertible transformation. From the matrix B of DCT coef-
ficients obtained in the previous step, we can recover the original matrix A
of the unit data by computing Dt BD. Quantization is the step in the JPEG
standard where the magic of compressing takes place. The information lost
in this step is (in general) lost beyond recovery, and this is basically why
we call the JPEG standard a “lossy” one. At this step, mathematics ex-
ploit the human eye perception and tolerance of what level of distortion is
deemed acceptable to come up with a reasonable compression scheme. Ex-
periments show that the human eye is more sensitive to the low frequency
components of an image than the high frequency ones. The quantization
step enables us to discard many of the high frequency coefficients as their
presence have little effect on the perception of the image as a whole. After
the DCT is applied to the 8 × 8 block, each of the 64 DCT coefficients of
the block is first divided by the corresponding entry in the matrix Q = [qij ]
below and then the result is rounded to the nearest integer:
The JPEG standard 75

 
16 11 10 16 24 40 51 61
 12 12 14 19 26 58 60 55 
 
 14 13 16 24 40 57 69 56 
 
 14 17 22 29 51 87 80 62 
 
Q= . (3.10)
 18 22 37 56 68 109 103 77 
 
 24 35 55 64 81 104 113 92 
 
 49 64 78 87 103 121 120 101 
72 92 95 98 112 100 103 99
Designed based on human tolerance of visual effects, the matrix Q is
called the luminance quantization matrix. It is not the only quantization
matrix defined by the JPEG standard, but it is the one commonly used in
applications. Note how the entries in the matrix Q increase almost in every
row and every column as you move from left to right and top to bottom.
This is designed to ensure aggressive quantization of coefficients with higher
frequencies (the bigger the number we divide with, the closer the answer
to zero). The quantization step has another important role to play in the
JPEG standard. Notice how the coefficients of the DCT matrix B in (3.9)
are real numbers rounded to four decimal places. When divided with the
entries of Q, these DCT coefficients remain real-valued numbers and if we
do not round them to create integer-valued integers, the encoding process
of the quantized coefficients (see Section 3.3.5) would not be possible. The
matrix B of DCT coefficients is transformed
  after quantization into the
bij
matrix C = [cij ] where cij = Round qij . For the block in Example 3.1,
the matrix C is the following.

 
−21 6 −1 3 2 −1 0 0
 6 5 −2 0 0 0 0 0 
 
 −3 0 3 −3 −1 0 0 0 
 
 −4 0 0 −2 −1 0 0 0 
 
C= . (3.11)
 3 2 0 0 0 0 0 0
 
 2 2 1 0 0 0 0 0
 
 0 0 0 0 0 0 0 0
0 0 0 0 0 0 00

3.3.5 Encoding
Now that we have the quantized matrix C = [cij ] from the last step, the
next challenge is to encode the coefficients in a way to minimize the storage
76 The mathematics that power our world

space. As it can be seen from (3.11), the quantization step is designed to


create a matrix C = [cij ] with mostly zero coefficients cij for large values
of i and j (in the lower right section of the matrix). To take advantage of
the abundance of zeros in the matrix, the encoder starts by rearranging the
quantized coefficients in one stream of 64 coefficients in a way to produce
long runs of zeros. The idea is to assign a single code for a run of zeros rather
than coding each zero individually. This can be best achieved by collecting
the coefficients in a “zigzag” fashion starting with the DC coefficient in
the upper left corner and ending with a sequence of trailing zeros with few
short sequences of AC coefficients in between. The zigzag mode of collecting
coefficients is illustrated in the following image:

For the quantized matrix of Example 3.1, the zigzag mode yields the
following string:
−21, 6, 6, −3, 5, −1, 3, −2, 0, −4, 3, 0, 3, 0, 2, −1, 0,
−3, 0, 2, 2, 0, 2, 0, −2, −1, 0, 0, 0, 0, 0, −1, 0, 1, EOB (3.12)
where the special symbol EOB (End Of Block) is introduced to indicate
that all remaining coefficients after the last “1” are zero’s till the end of
the sequence. In our example, the last run consists of 30 consecutive zeros.
As we will see later, the DC coefficient is encoded using a technique called
differential encoding while the 63 AC coefficients are encoded using a run-
length encoding technique. Huffman coding (see Chapter 2) is then used
to encode both.

3.3.5.1 Encoding the DC coefficients


In a continuous tone picture, adjacent data units (8 × 8 pixel blocks) are
usually closely correlated. Since the DC coefficient is proportional to the
The JPEG standard 77

average of pixel intensities in the block, it is natural to assume that adjacent


blocks in the image have fairly close DC coefficients. Once the DC coeffi-
cient DC0 of the first block is encoded, it would make more sense to encode
the difference DC1 − DC0 rather than encoding the DC coefficient DC1 it-
self since the difference is small and would require fewer bits to encode. In
general, the JPEG standard encodes the DC differences di = DCi − DCi−1
between DC coefficients of adjacent blocks rather than the usually large
DC coefficients. For example, if the first four 8 × 8 blocks have respective
(quantized) DC coefficients −21, −22, −18 and −18, then the JPEG stan-
dard encodes −21 in the first block, −1 in the second, 4 in the third and
0 in the fourth block (each of the codes is followed by the 63 codewords of
the AC coefficients of the corresponding block). This technique is known
as differential encoding.

For an image using 8 bits per pixel (that is what we assume for all what
follows), it can be shown that both the AC and the DC coefficients fall in the
range [−1023, 1023] and the DC differences fall in the range [−2047, 2047].
The encoding of a DC difference coefficient d is done as follows:

(1) First find the minimal number γ of bits required to write |d| (the ab-
solute value of d) in binary form. For example, if d = −7, then γ = 3
since the binary form of |−7| is 111. The minimal number of bits re-
quired to represent |d| in binary form is referred to as the CATEGORY
of d.
(2) Figure 3.3 shows the table used to encode the CATEGORY γ. This
code is referred to as the Variable-Length Code of d or VLC for short.
For example, the VLC for d = −7 is 100 (since d = −7 belongs to
CATEGORY 3).
(3) The VLC code for d found in the previous step is the first layer in
the encoding of d. The second layer is referred to as the Variable-
Length Integer or VLI for short that we defined as follows. If d ≥ 0,
then VLI(d) consists of taking the γ least significant bits of the 8-
bit binary representation of d (γ being, as above, the CATEGORY of
d). If d < 0, then VLI(d) consists of taking the γ least significant
bits of the 8-bit two’s complement representation of d − 1 (see the
first chapter on Calculator). Recall that the 8-bit two’s complement
representation of a negative integer β consists of writing the 8-bit binary
representation of |β|, invert the digits (0 becomes 1, 1 becomes 0) and
then add 1 to the answer. For example, if β = −8 then the 8-bit
78 The mathematics that power our world

binary representation of |β| is 00001000 and its 8-bit two’s complement


representation is 11110111 + 1 = 11111000. Since d = −7 belongs to
CATEGORY 3, its VLI coding consists of taking the 3 least significant
digits of the 8-bit two’s complement representation of −7 − 1 = −8
which is 11111000. So VLI(−7) = 000.
(4) The final code of the DC difference d is VLC(d)VLI(d) (the concate-
nation of the Variable-Length Code and the Variable-Length Integer).
For example, the codeword for d = −7 is 100000.

Category DCi − DCi−1 Code


0 0 00
1 -1 1 010
2 -3 -2 2 3 011
3 -7 ... -4 4 ... 7 100
4 -15 ... -8 8 ... 15 101
5 -31 ... -16 16 ... 31 110
6 -63 ... -32 32 ... 63 1110
7 -127 ... -64 64 ... 127 11110
8 -255 ... -128 128 ... 255 111110
9 -511 ... -256 256 ... 511 1111110
10 -1023 ... -512 512 ... 1023 11111110
11 -2047 ... -1024 1024 ... 2047 111111110
Fig. 3.3 Table for coding DC coefficients.

Assuming the figure given in Example 3.1 is the first block in a digital
image, we encode its DC coefficient −21. The binary form of 21 is 10101,
so −21 is Category 5. From the table in Figure 3.3, the VLC code of −21
is 110. The 8-bit two’s complement of −21 − 1 = −22 is 11101010 and the
5 least significant bits of 11101010 are 01010. This is the VLI code of −21.
We conclude that the encoding of −21 is 11001010.

3.3.5.2 Encoding the AC coefficients


The zigzag mode used to collect the quantized DCT coefficients is now put
to work to encode the AC coefficients. Runs of zeros are efficiently com-
pacted using a technique known as run length encoding (RLE for short).
This technique is used to shorten a sequence containing runs of a repeated
The JPEG standard 79

character by recording how many times the character appears in the run
instead of actually listing it every single time. Huffman coding (see Chap-
ter 2) of pairs of integers is combined with the RLE technique to produce
a compressed binary sequence representing the AC coefficients in the 8 × 8
block. The sequence of the 63 AC coefficients is first shortened to a sequence
of pairs and special symbols as follows: if α is a non-zero AC coefficient in
the zigzag sequence, then α is replaced by the “object” (r, m)(α) where r is
the length of the zero run immediately preceding α (that is the number of
consecutive zeros preceding α), m is the CATEGORY of α from the above
table. The maximum length of a run of zeros allowed in JPEG standard
is 16. The symbol (15, 0) is used to indicate a run of 16 zeros (one zero
preceded by 15 other zeros). If a run has 17 zeros or more, it is divided into
subruns of length 16 or less each. This means that r ranges between 0 and
15 and as a consequence it requires a 4-bit binary representation. In the
intermediate sequence, the EOB is represented with the special pair (0, 0).
Let us illustrate using the AC coefficients in sequence (3.12) above result-
ing from the image in Example 3.1. The intermediate sequence for (3.12)
is (0, 3)(6); (0, 3)(6); (0, 2)(−3); (0, 3)(5); (0, 1)(−1); (0, 2)(3); (0, 2)(−2);
(1, 3)(−4); (0, 2)(3); (1, 2)(3); (1, 2)(2); (0, 1)(−1); (1, 2)(−3); (1, 2)(2);
(0, 2)(2); (1, 2)(2); (1, 2)(−2); (0, 1)(−1); (5, 1)(−1); (1, 1)(1); (0, 0). We
explain how some of terms in this sequence are formed. The first non-zero
AC coefficient is 6 with no preceding 0’s. Since 6 belongs to CATEGORY
3, the first term in the intermediate sequence is (0, 3)(6). The last AC
coefficient −1 appearing in the zigzag sequence is preceded by a run of 5
zeros and since −1 belongs to CATEGORY 1, the corresponding entry in
the intermediate sequence is (5, 1)(−1).

Once the intermediate sequence is formed, each entry (r, m)(α) is en-
coded using the following steps.

(1) The pair (r, m) is encoded using Huffman codes provided by the stan-
dard tables in Section 3.3.5.3.
(2) The non-zero AC coefficient α is encoded using VLI codes as in the
encoding of the DC difference coefficients above.
(3) The final codeword for (r, m)(α) is just the concatenation of codes of
(r, m) and α.

The coding of the intermediate sequence of Example 3.1 is 100110; 100110;


0100; 100101; 000; 0111; 0101; 1111001011; 0111; 1101111; 1101110; 000;
80 The mathematics that power our world

1101100; 1101110; 0110; 1101110; 1101101; 000; 11110100; 11001; 1010.

Adding the code for the DC coefficient found earlier, the block image
of Example 3.1 is encoded as:

11001010 100110 100110 0100 100101 000 0111 0101 1111001011 0111
1101111 1101110 000 1101100 1101110 0110 1101110 1101101 000 11110100
11001 1010.

Notice that the size of this new sequence is 124 bits. A saving of about 75%
from the raw size of 512 (8 × 64 = 512) bits if no compression was done on
the block.

3.3.5.3 AC coding tables


The tables shown in this section are the recommended Huffman codes for
AC (luminance) coefficients of the JPEG Baseline standard. They are based
on statistics from experiments on a large number of images to classify “cat-
egories” of pixels values in terms of frequency of their occurrences in various
images. As we saw in the chapter on Huffman coding, the idea is to assign
shorter codewords for more frequent categories and longer codewords for
less frequent ones. Note also that Huffman codes are uniquely decodable
which leaves no room for ambiguity when the decompression process starts
(see Section 3.4 below).

(Run, Category) Codeword Length


(Run, Category) Codeword Length
(0, 0) 1010 4
(1, 1) 1100 4
(0, 1) 00 2
(1, 2) 11011 5
(0, 2) 01 2
(1, 3) 1111001 7
(0, 3) 100 3
(1, 4) 111110110 9
(0, 4) 1011 4
(1, 5) 11111110110 11
(0, 5) 11010 5
(1, 6) 1111111110000100 16
(0, 6) 1111000 7
(1, 7) 1111111110000101 16
(0, 7) 11111000 8
(1, 8) 1111111110000110 16
(0, 8) 1111110110 10
(1, 9) 1111111110000111 16
(0, 9) 1111111110000010 16
(1, A) 1111111110001000 16
(0, A) 1111111110000011 16

(Run, Category) Codeword Length (Run, Category) Codeword Length


(2, 1) 11100 5 (3, 1) 111010 6
(2, 2) 11111001 8 (3, 2) 111110111 9
(2, 3) 1111110111 10 (3, 3) 111111110101 12
(2, 4) 111111110100 12 (3, 4) 1111111110001111 16
(2, 5) 1111111110001001 16 (3, 5) 1111111110010000 16
(2, 6) 1111111110001010 16 (3, 6) 1111111110010001 16
(2, 7) 1111111110001011 16 (3, 7) 1111111110010010 16
(2, 8) 1111111110001100 16 (3, 8) 1111111110010011 16
(2, 9) 1111111110001101 16 (3, 9) 1111111110010100 16
(2, A) 11111111110001110 16 (3, A) 1111111110010101 16
The JPEG standard 81

(Run, Category) Codeword Length (Run, Category) Codeword Length


(4, 1) 111011 6 (5, 1) 1111010 7
(4, 2) 1111111000 10 (5, 2) 11111110111 11
(4, 3) 1111111110010110 16 (5, 3) 1111111110011110 16
(4, 4) 1111111110010111 16 (5, 4) 1111111110011111 16
(4, 5) 1111111110011000 16 (5, 5) 1111111110100000 16
(4, 6) 1111111110011001 16 (5, 6) 1111111110100001 16
(4, 7) 11111111100111010 16 (5, 7) 1111111110100010 16
(4, 8) 1111111110011011 16 (5, 8) 1111111110100011 16
(4, 9) 1111111110011100 16 (5, 9) 1111111110100100 16
(4, A) 1111111110011101 16 (5, A) 1111111110100101 16

(Run, Category) Codeword Length (Run, Category) Codeword Length


(6, 1) 1111011 7 (7, 1) 11111010 8
(6, 2) 111111110110 12 (7, 2) 111111110111 12
(6, 3) 1111111110100110 16 (7, 3) 1111111110101110 16
(6, 4) 1111111110100111 16 (7, 4) 1111111110101111 16
(6, 5) 1111111110101000 16 (7, 5) 1111111110110000 16
(6, 6) 1111111110101001 16 (7, 6) 1111111110110001 16
(6, 7) 1111111110101001 16 (7, 7) 1111111110110010 16
(6, 8) 1111111110101011 16 (7, 8) 1111111110110011 16
(6, 9) 1111111110101100 16 (7, 9) 1111111110110100 16
(6, A) 1111111110101101 16 (7, A) 1111111110110101 16

(Run, Category) Codeword Length (Run, Category) Codeword Length


(8, 1) 111111000 9 (9, 1) 111111001 9
(8, 2) 11111111000000 15 (9, 2) 111111110111110 16
(8, 3) 1111111110110110 16 (9, 3) 1111111110111111 16
(8, 4) 1111111110110111 16 (9, 4) 1111111111000000 16
(8, 5) 1111111110111000 16 (9, 5) 1111111111000001 16
(8, 6) 1111111110111001 16 (6, 6) 1111111111000010 16
(8, 7) 1111111110111010 16 (9, 7) 1111111111000011 16
(8, 8) 1111111110111011 16 (9, 8) 1111111111000100 16
(8, 9) 1111111110111100 16 (9, 9) 1111111111000101 16
(8, A) 1111111110111101 16 (9, A) 1111111111000110 16

(Run, Category) Codeword Length (Run, Category) Codeword Length


(A, 1) 111111010 9 (B, 1) 1111111001 10
(A, 2) 1111111111000111 16 (B, 2) 1111111111010000 16
(A, 3) 1111111111001000 16 (B, 3) 1111111111010001 16
(A, 4) 1111111111001001 16 (B, 4) 1111111111010010 16
(A, 5) 1111111111001010 16 (B, 5) 1111111111010011 16
(A, 6) 1111111111001011 16 (B, 6) 1111111111010100 16
(A, 7) 1111111111001100 16 (B, 7) 1111111111010101 16
(A, 8) 1111111111001101 16 (B, 8) 1111111111010110 16
(A, 9) 1111111111001110 16 (B, 9) 1111111111010111 16
(A, A) 1111111111001111 16 (B, A) 1111111111011000 16

(Run, Category) Codeword Length (Run, Category) Codeword Length


(C, 1) 1111111010 10 (D, 1) 11111111000 11
(C, 2) 1111111111011001 16 (D, 2) 1111111111100010 16
(C, 3) 1111111111011010 16 (D, 3) 1111111111100011 16
(C, 4) 1111111111011011 16 (D, 4) 1111111111100100 16
(C, 5) 1111111111011100 16 (D, 5) 1111111111100101 16
(C, 6) 1111111111011101 16 (D, 6) 1111111111100110 16
(C, 7) 1111111111011110 16 (D, 7) 1111111111100111 16
(C, 8) 1111111111011111 16 (D, 8) 1111111111101000 16
(C, 9) 1111111111100000 16 (D, 9) 1111111111101001 16
(C, A) 1111111111100001 16 (D, A) 1111111111101010 16

(Run, Category) Codeword Length


(Run, Category) Codeword Length
(F, 0)(ZRL) 11111111001 11
(E, 1) 1111111111101011 16
(F, 1) 1111111111110101 16
(E, 2) 1111111111101100 16
(F, 2) 1111111111110110 16
(E, 3) 1111111111101101 16
(F, 3) 1111111111110111 16
(E, 4) 1111111111101110 16
(F, 4) 1111111111111000 16
(E, 5) 1111111111101111 16
(F, 5) 1111111111111001 16
(E, 6) 1111111111110000 16
(F, 6) 1111111111111010 16
(E, 7) 1111111111110001 16
(F, 7) 1111111111111011 16
(E, 8) 1111111111110010 16
(F, 8) 1111111111111100 16
(E, 9) 1111111111110011 16
(F, 9) 1111111111111101 16
(E, A) 1111111111110100 16
(F, A) 1111111111111110 16
82 The mathematics that power our world

3.4 JPEG decompression

The decompression aims to reconstruct the image from the compressed


stream. This is achieved by reversing the above steps in the compression
process. The decompression is done independently on each block and at
the end of the process, all the decompressed blocks are merged together to
form the reconstructed image.

(1) When the compressed binary stream of the block enters the decoder
gate, it is read bit by bit. Using the Huffman encoding tables given
in the previous section, the decoder reconstructs the intermediate se-
quence of objects (r, m)(α).
(2) From this intermediate sequence, the quantized DC coefficient, all the
63 quantized AC coefficients and all the run lengths can be recon-
structed in the same zigzag ordering as in the encoding step above.
Recall that the first part of the compressed stream represents the (quan-
tized) DC difference coefficient di = DCi − DCi−1 . The quantized DC
coefficient of block i is reconstructed as DCi = DCi−1 + di for i ≥ 1
(assuming the DC coefficient Di−1 of block i − 1 was obtained at the
previous step).
(3) The sequence obtained in the previous step is “dezigzaged” to recon-
struct the 8 × 8 matrix of the quantized DCT coefficients of the block.
For the block in Example 3.1, this step will reproduce the matrix (3.11)
above.
(4) The 8 × 8 matrix of the quantized DCT coefficients is “dequantized”
by multiplying each of the entries with the corresponding entry of the
quantization matrix Q given in (3.10) above. For the block in Example
3.1, this step will produce the following matrix.
 
−336 66 −10 48 48 −40 0 0
 72 60 −28 0 0 0 0 0
 
 −42 0 48 −72 −40 0 0 0 
 
 −56 0 0 −58 −51 0 0 0 
 
S= .
 54 44 0 0 0 0 0 0
 
 48 70 55 0 0 0 0 0
 
 0 0 0 0 0 0 0 0
0 0 0 0 0 0 00

Notice that the matrix S is close to the original matrix B of the DCT
coefficient given in (3.9) above but not exactly the same. This is due
The JPEG standard 83

to the fact that entries in B were rounded to the nearest integers after
quantization is performed.
(5) Now apply the two-dimensional inverse DCT given in (3.8) to the ma-
trix S to get the matrix B1 .

 
−9.5363 26.1090 13.8352 −36.5921 −64.4003 −71.7877 −68.6127 −54.1596
 −58.6093 −53.3551 −52.3111 −35.6142 −18.6246 −39.9788 −64.1639 −56.5378 
 
 19.1002 −33.7537 −63.8441 −22.2048 16.3370 −20.4071 −62.0797 −52.9300 
 
 61.4794 −0.7968 −43.2273 −10.3405 23.8693 −16.4274 −63.1244 −58.2510 
 
B1 =  .
 −44.8950 −54.4807 −59.1986 −29.7012 0.2449 −25.1636 −64.9681 −69.2678 
 −79.9202 −66.0041 −69.1116 −67.7725 −51.6672 −55.3606 −66.3567 −58.5642 
 
 
 −30.3389 −34.5392 −61.1915 −78.2150 −71.1730 −70.3833 −63.1596 −37.2653 
−41.7867 −52.7451 −69.2971 −61.0331 −42.9813 −53.3216 −56.7931 −30.6488

(6) Round the entries of B1 to the nearest integer:

 
−10 26 14 −37 −64 −72 −69 −54
 −59 −53 −52 −36 −19 −40 −64 −57 
 
 19 −34 −64 −22 16 −20 −62 −53 
 
 61 −1 −43 −10 24 −16 −63 −58 
 
B2 =  .
 −45 −54 −59 −30 0 −25 −65 −69 

 −80 −66 −69 −68 −52 −55 −66 −59 

 
 −30 −35 −61 −78 −71 −70 −63 −37 
−42 −53 −69 −61 −43 −53 −57 −31

(7) Add 128 to each entry of B2 :


 
118 154 142 91 64 56 59 74
 69 75 76 92 109 88 64 71 
 
 147 94 64 106 144 108 66 75 
 
 189 127 85 118 152 112 65 70 
 
0
A = .
 83 74 69 98 128 103 63 59 
 
 48 62 59 60 76 73 62 69 
 
 98 93 67 50 57 58 65 91 
86 75 59 67 85 75 71 97

(8) Matrix A0 represents the grayscale values of the reconstructed 8 × 8


block.

The original and the reconstructed images are shown:


84 The mathematics that power our world

Original image Reconstructed image

Not bad, considering that the compressed image takes only about 25%
of the storage space taken by the raw image.

3.5 The mathematics of DCT

In this section, we dig deeper into the mathematics that make the DCT
such an important tool for image compression. As the reader will soon
discover, all the magic of JPEG compression happens by mixing together
some basic properties of linear algebra with some trigonometric identities.

3.5.1 Two-dimensional DCT as a linear tranformation


From the perspective of relation (3.6) above, the two-dimensional DCT can
be seen as a transformation:
φ : Mn → Mn
(3.13)
A 7→ DADt

where Mn is the vector space of all n × n real matrices and D as in Figure


3.1 on page 71. The transformation φ is clearly linear:

φ (αA1 + βA2 ) = D (αA1 + βA2 ) Dt = αDA1 Dt + βDA2 Dt


= φ (A1 ) + φ (A2 )

for any A1 , A2 ∈ Mn and α, β ∈ R.

The key property of the DCT matrix that makes it attractive to real life
applications is the fact that it is an orthogonal matrix. Before we proceed
to the definition of orthogonality, let us quickly review some notions from
linear algebra.
The JPEG standard 85

Consider a set of nonzero vectors Σ = {v1 , . . . , vs } of Rr (where r is a


positive integer).

• Σ is called linearly independent if the only way to have an equation of


the form

a1 v1 + a2 v2 + · · · + as vs = 0

with ak ∈ R for all k is that a1 = a2 = · · · = as = 0.


• Σ is called a spanning set of Rr if every vector v ∈ Rr can be written
as a linear combination of the vectors in Σ:

v = a1 v1 + a2 v2 + · · · + as vs , ak ∈ R for all k.

• Σ is called a basis of Rr if it is at the same time a spanning set and


linearly independent. For example, the set

{(1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , (0, 0, . . . , 0, 1)}

is a basis of Rr called the standard basis. Note that in Rr , any basis


contains exactly r (linearly independent) vectors.
• if x = (x1 , . . . , xr ) and y = (y1 , . . . , yr ) are two vectors in Rr , then their
Pr
dot product is x · y = x1 y1 + · · · + xr yy . In particular x · x = i=1 x2i is
equal to kxk2 where kxk is the magnitude (or the norm) of the vector
x. In the case where x · x = 1, x is called a unit vector. If x and y are
nonzero vectors such that x · y = 0, we say that they are orthogonal.
• The set Σ is called orthogonal if

vi · vj = 0, for all i 6= j, i, j ∈ {1, 2, . . . , s}.

If in addition, every vector in Σ is a unit vector, the set Σ is called


orthonormal.
• A basis B is called an orthonormal basis of Rr if it is an orthonormal
set in addition of being a basis. For example, the standard basis of Rr
is an orthonormal basis.

Definition 3.1. A matrix A is called orthogonal if

AAt = At A = I (3.14)

where At denotes, as usual, the transpose of the matrix A and I is the


identity matrix (square matrix with all 1’s on the main diagonal and 0’s
elsewhere).
86 The mathematics that power our world

Note that relation (3.14) implies in particular that A is a square invertible


matrix and that
A−1 = At . (3.15)

From a compression point of view, relation (3.15) is extremely valuable


since it provides a clean and a “low cost” way of reversing the compression
procedure and of reconstructing the original data. The relation also allows
us to look at the DCT matrix as a “transition matrix” (change of basis
matrix) from the standard basis Eij , i, j = 0, . . . , n − 1 (the entries of
Eij are all zeros except for the one on the ith row and the jth column
which is 1) of Mn to another basis S much more useful to the compression
standard. To see what the elements of the basis S look like, we go back
to the definition of the two-dimensional DCT and its inverse. Given an
n × n matrix A = [aij ], its DCT transform is given by the matrix B = [bij ]
with B = DADt and D is the DCT matrix displayed in Figure 3.1 on page
71. The inverse DCT is given by A = Dt BD as explained above. Write
Pn−1 Pn−1
B = i=0 j=0 bij Eij , then
 
X n−1
n−1 X n−1
X n−1
X
A = Dt BD = Dt  bij Dt Eij D .

bij Eij  D =
i=0 j=0 i=0 j=0
t
Let Sij = D Eij D for i 6= j. So the matrix A can be expressed as A =
P
bij Sij and the matrices Sij are precisely the elements of the new basis
S of Mn in which every matrix A can be expressed as a linear combination
of the matrices Sij with coefficients equal to the DCT coefficients bij of
A. Clearly, the coefficients of Sij are not in the range [0, 255] which makes
them unsuitable to represent grayscale images. To see what the blocks Sij ’s
look like as grayscale images, we proceed as follows.
(1) First notice that the (k, l)-coefficient of the matrix Sij is given by
   
δi δj i(2k + 1)π j(2l + 1)π
skl = cos cos
n 2n 2n

with δr = 2 for r > 0 and δ0 = 1. So, multiplying the matrix Sij
0
by the factor δinδj produces a matrix Sij with coefficients in the range
[−1, 1].
(2) If −1 ≤ x ≤ 1, then 0 ≤ x + 1 ≤ 2 and 0 ≤ 255 2 (x + 1) ≤ 255. So
0
adding 1 to each coefficient of Sij and then scaling the coefficients by
00
a factor of 255
2 will produce a matrix Sij with coefficients in the range
[0, 255] and hence can be interpreted as a grayscale image.
The JPEG standard 87

The following picture gives a representation of each of the elements of


the new basis S as a grayscale image. Note that the upper left corner of
the array is just a white image resulting from the fact that each coefficient

in the transformed matrix S00 is equal to 255 (pure white).

The fact that the coefficients of the matrix A in the new basis S are pre-
cisely the DCT coefficients of A allows us to interpret these coefficients as
a “measure” of how much each of the squares in the above array is present
in the image.

3.5.2 What is the deal with orthogonal bases anyway?


You might be wondering at this point why the name “orthogonal” for an
n×n matrix A satisfying AAt = I? The following theorem gives the answer.

Theorem 3.1. For an n × n matrix A, the following conditions are equiv-


alent.
(1) A is an orthogonal matrix.
88 The mathematics that power our world

(2) The columns of A form an orthonormal basis of Rn .


(3) The rows of A form an orthonormal basis of Rn .

For the proof, we need the following two propositions.

Proposition 3.1. Every orthogonal set of Rr is in particular linearly


independent.

Proof. Assume that Σ = {v1 , . . . , vs } is an orthogonal set of Rr and let


a1 , . . . , as ∈ R such that
s
X
ak vk = 0. (3.16)
k=1

We need to prove that ak = 0 for all k = 1, 2, . . . , s. Multiplying (using dot


product) equation (3.16) by vk , we get:

a1 (v1 · vk ) + a2 (v2 · vk ) + · · · + ak (vk · vk ) + · · · + as (vs · vk ) = 0. (3.17)

Since Σ is orthogonal, vi · vk = 0 for all i 6= k and vk · vk = kvk k2 6= 0.


Equation (3.17) reduces to ak (vk · vk ) = 0 which implies that ak = 0. This
is true for any k = 1, 2, . . . , s. 

The second proposition is a fundamental result in Linear Algebra that we


state here without proof.

Proposition 3.2. In Rr , every linearly independent set of r vectors forms


a basis of Rr .
    
1 1
For instance, to prove that the set q1 = , q2 = forms a basis
−1 0
of R2 , it is enough to show that they are linearly independent. It is clearly
the case since none of these vectors is a scalar multiple of the other.

Combining the above propositions, we get the following.

Corollary 3.1. An orthonormal set of r vectors in Rr forms an orthonor-


mal basis of Rr .

We proceed now to prove Theorem 3.1. Note first that since the state-
ment “A is orthogonal” is equivalent to “At is orthogonal”, we only need
to prove that the first two conditions of Theorem 3.1 are equivalent. Write
The JPEG standard 89

q1t


   q2t 
t
A = q1 q2 . . . qn where qi is the ith column of A. Then A =  .  and
 
 .. 
qnt
therefore
q1t · q1 q1t · q2 . . . q1t · qn
 
 q2t · q1 q2t · q2 . . . q2t · qn 
At A =  . (3.18)
 
.. .. 
 .. . ··· . 
qnt · q1 qnt · q2 . . . qnt · qn
where, as usual, qit · qj is the dot product of the vectors qit and qj . If A
is an orthonormal matrix, then At A = I and by comparing (3.18) with
the identity matrix I, we conclude that qit · qj = 0 for all i, j with i 6= j
and qit · qi = 1 for all i. These relations show that the columns of A are
orthogonal and unit vectors of Rn . Corollary 3.1 implies that the columns
of A form an orthonormal basis of Rn . This proves the implication (1 ) ⇒
(2 ) of Theorem 3.1. The implication (2 ) ⇒ (1 ) is proven similarly.

3.5.3 Proof of the orthogonality of the DCT matrix


In this section, we give a detailed proof of the following result using Theo-
rem 3.1.

Theorem 3.2. The DCT matrix defined in Figure 3.1 on page 71 is or-
thogonal.

There are few elegant and relatively short proofs of this result in the litera-
ture, but they require some heavy mathematics. The proof presented here
is technical and somewhat long, but stays close to the basics. Let us start
by recalling some trigonometric identities needed for the proof.
1
cos(α) cos(β) = [cos(α + β) + cos(α − β)] . (3.19)
2

1
cos2 (α) = [1 + cos(2α)] . (3.20)
2
Lemma 3.1. For j ∈ {1, 2, . . . , 2n − 1}:
2n−1  
X mjπ
cos = 0.
m=0
n
90 The mathematics that power our world

Proof. First recall that if x ∈ R, then the complex exponential eix is


defined as follows
eix = cos x + i sin x
where i is the complex number satisfying i2 = −1. Fix j ∈ {1, 2, . . . , 2n−1}

and let α = 2n . Then:
2n−1 2n−1
X m X
e2iα = e2miα = 1 + e2iα + e4iα + · · · + e2(2n−1)iα .
m=0 m=0

The expression 1+e +e4iα +· · ·+e2(2n−1)iα is a geometric sum containing


2iα

2n terms with 1 as the first term and the ratio of any two consecutive
terms equals to e2iα . Note that 0 < 2α < 2π since 2α = nj π and 1 ≤ j ≤
2n − 1. Therefore the equations cos (2α) = 1 and sin (2α) = 0 cannot be
satisfied simultaneously. Since e2iα = cos (2α) + i sin (2α) = 1 if and only
if cos (2α) = 1 and sin (2α) = 0, we can confirm that the ratio e2iα of the
above geometric sum is not 1. A well-known formula for a finite geometric
sum (with ratio other than 1) allows us to write:
2n−1 2n
X
2iα m
 1 − e2iα 1 − e4niα
e = = . (3.21)
m=0
1 − e2iα 1 − e2iα

Note that e4niα = cos(4nα) + i sin(4nα) = cos(2jπ) +i sin(2jπ) = 1, which


| {z } | {z }
P2n−1 2iα 1m 0
implies (by equation (3.21)) that m=0 e = 0. We conclude that
2n−1
X 2n−1
X
0= e2miα = (cos (2mα) + i sin (2mα))
m=0 m=0
2n−1
X     
mjπ mjπ
= cos + i sin
m=0
n n
2n−1   2n−1  
X mjπ X mjπ
= cos +i sin .
m=0
n m=0
n
Since both real and imaginary parts of the last complex expression must
be zero. The result of the lemma follows. 

Lemma 3.2. If j ∈ {1, 2, . . . , 2n − 1} is even, then:


n−1  
X mjπ
cos = 0.
m=0
n
The JPEG standard 91

Proof. Write j = 2α for some α ∈ Z. Using the result of Lemma 3.2, we


have:
2n−1   n−1   n−1  
X mjπ X mjπ X (m + n)jπ
0= cos = cos + cos
m=0
n m=0
n m=0
n
n−1   n−1  
X mjπ 2αmπ X
= cos + + 2απ cos
m=0 m=0
n n
 j 
n−1 n−1
  z}|{
X mjπ X  2α mπ 
= cos + cos  
m=0
n m=0
 n 

since cos(y + 2απ) = cos(y). Thus,


2n−1   n−1  
X mjπ X mjπ
0= cos =2 cos
m=0
n m=0
n

and consequently,
n−1  
X mjπ
cos =0
m=0
n

for j ∈ {1, 2, . . . , 2n − 1} even. 

Lemma 3.3. For j ∈ {1, 2, . . . , 2n − 1} odd:


n−1  
X mjπ
cos = 0.
m=1
n

Proof. The proof of this lemma is a bit more technical and requires
dividing it into subcases. First assume that n is even. Then
n
n−1   2 −1   n  n−1  
mjπ mjπ 2 jπ mjπ
X X X
cos = cos + cos + cos .
n n n n
m=1 m=1 m= n
2 +1

n 

= cos j π2 = 0 since j is odd (the cosine of any odd

Note that cos 2
n
π
multiple of 2 is zero). So
n
n−1   2 −1   n−1  
X mjπ X mjπ X mjπ
cos = cos + cos . (3.22)
n n n
m=1 m=1 m= n
2 +1
92 The mathematics that power our world

Now, let k = n − m, then the last sum in (3.22) can be written as:

n n
n−1   2 −1   2 −1  
X mjπ X (n − k)jπ X kjπ
cos = cos = cos jπ −
n n n
m= n
2 +1 k=1 k=1
n
2 −1  
X kjπ
=− cos
n
k=1

since for an odd j, cos(jπ − α) = − cos(α). Relation (3.22) shows that the
lemma is true in this case. Next assume that n is odd and write

n−1
n−1   2   n−1  
X mjπ X mjπ X mjπ
cos = cos + cos . (3.23)
m=1
n m=1
n n
m= n+1
2

Again, the same change of index (k = n − m) in the second sum shows that
it is equal to the opposite of the first one and the lemma is proved. 

3.5.4 Proof of Theorem 4.1


We are now ready to prove Theorem 4.1.

Consider two columns

 q   q 
1 1
 q n 
   q n 
2 (2i+1)π 2 (2j+1)π

 n cos 2n





 n cos 2n
 q     q   
2 2(2i+1)π 2 2(2j+1)π
qi =  n cos  , qj =  n cos
   
 2n   2n 


 .
..



 .
..


q     q  
2 (n−1)(2i+1)π 2 (n−1)(2j+1)π
n cos 2n n cos 2n

(0 ≤ i, j ≤ n − 1) of the DCT matrix D displayed in Figure 3.1 on page


71. Without loss of generality, we can assume i ≥ j. We compute the dot
The JPEG standard 93

product of qi and qj :
r r n−1 r  r  
1 1 X 2 k(2i + 1)π 2 k(2j + 1)π
qi · qj = + cos cos
n n n 2n n 2n
k=1
n−1    
1 2X k(2i + 1)π k(2j + 1)π
= + cos cos
n n 2n 2n
k=1
n−1   
1 2X1 k(2i + 1)π k(2j + 1)π
= + cos +
n n 2 2n 2n
k=1
 
k(2i + 1)π k(2j + 1)π
+ cos − (by (3.19))
2n 2n
n−1   n−1  
1 1X k(i + j + 1)π 1X k(i − j)π
= + cos + cos
n n n n n
k=1 k=1
with 1 ≤ i + j + 1 ≤ 2n − 1 and 0 ≤ i − j ≤ n − 1. In order for the DCT
matrix D to be orthogonal, two things have to be verified. First, if i = j,
then qi · qj must be 1. Second, if i 6= j, then qi · qj must be 0. The case i = j
is easier
 to 
treat and that is what we start with. Notethat in this case,
Pn−1
cos kπ(i−j)
n = 1 for any k which means that k=1 cos kπ(i−j) = n − 1.
Pn−1    n 
n−1
On the other hand, k=1 cos kπ(i+j+1) = k=1 cos kπ(2i+1)
P
n n = 0 by
1
Lemma 3.3. We conclude then that qi · qj = n + n1 (n − 1) = 1 for i = j.

Assume next that i 6= j so qi , qj are distinct columns of D. Note that if


i−j = 2α for some α ∈ Z (that is if i−j is even), then i+j +1 = 2j +2α+1
is odd. Similarly, if i−j is odd then i+j +1 is even. The proof that the dot
product of qi and qj is zero in this case is done by considering the following
two cases.
• i − j is even. In this case
=0 by Lemma 3.2
z }| {
n−1   n−1  
1 X kπ(i − j) 1 X kπ(i − j) 1 1
cos = cos − =− .
n n n n n n
k=1 k=0
Pn−1  
On the other hand, Lemma 3.3 shows that k=1 cos kπ(i+j+1)
n =0
since i + j + 1 is odd in this case. Therefore, qi · qj = n1 − n1 = 0.
• i − j is odd. Interchanging the roles of i−j and i+j +1 in the previous
case shows that qi · qj = 0 again in this case. By Theorem 3.1, the DCT
matrix D is indeed orthogonal.
94 The mathematics that power our world

3.6 References

Brown, S. and Vranesic., Z. (2005). Digital Logic with VHDL Design,


Second Edition. (McGraw-Hill).
Predko, M. (2005) Digital electronic demystified. (McGraw-Hill).
Farhat, H.A. (2004) Digital design and computer organization. (CRC
Press).
Chapter 4

Global Positioning System (GPS)

4.1 Introduction

Recently, after a family trip, a friend of mine decided to go back to use his
old paper map in his travels and to put his car GPS to rest forever. This
came after a series of deceptions by this little device with the annoying
automated voice, the 400 screen and the frequently interrupted signal (his
words not mine). The latest of these deceptions was a trip from Ottawa
to Niagara Falls which took a turn in the US. Admittedly, such a turn is
normal especially if the GPS is programmed to take the shortest distance,
except that my friend’s family did not have passports on them that day.

If you have used a GPS before, you must have experienced some set-
backs here and there. But let us face it, the times when the trip goes
smoothly without any wrong turns or lost signal, we cannot help but ad-
mire the magic and ingenuity that transforms a little device into a holding
hand that takes you from point A to point B, sometimes thousands of kilo-
meters apart. It is almost “spooky” to think that someone is watching you
every step of the way from somewhere “out of this Earth”.

In this chapter, you will learn that there is nothing magical about the
GPS. It is the result of collective efforts of scientists and engineers with
Mathematics as the main link. After reading this chapter, use your time
on the road in your next trip to try to reveal to your co-travelers (with as
little mathematics as possible) the secret behind this technology. It works
every time I want to put my kids to sleep on a long trip.

95
96 The mathematics that power our world

4.1.1 Before you go further


Although the chapter is intended to be self-contained as much as possi-
ble, it is the heaviest in mathematical content compared to other chapters.
The structure of the GPS signals involves the use of abstract mathemat-
ical concepts, chiefly from abstract algebra like group theory, finite fields,
polynomial rings and primitive elements. Readers who want to have a full
grasp on the mathematical proofs are expected to read through more than
one time.

4.2 Latitude, longitude and altitude

If you press the “where am I” or “My location” buttons built in your GPS
receiver, your location will be displayed with expressions like 40◦ N, 30◦ W
and 1040 m, which are clearly not the “classical” cartesian coordinates.
This is because your GPS uses a more efficient coordinate system in which
the position or location of any point on or near the earth’s surface is de-
termined by three parameters known as the latitude, the longitude and the
altitude. You may have probably seen these terms before in a geography
class, but let us review them anyway. First, consider a cartesian coordinate
system Oxyz of three orthogonal axes centered at the center O of the earth.

Consider a point Q(x, y, z) in the above coordinate system and let P be the
projection of Q on the earth surface. That is, P is the intersection point of
−−→
the vector OQ with the earth surface. The points Q and P share the same
latitude and longitude that we define in what follows.

• the latitude of P (same as the latitude of Q) is the measurement of the


angle β of the location of P north or south of the equator. It represents
−−→
the angle formed between the position vector OP and the plane of the
equator. Note that −90◦ ≤ β ≤ 90◦ with the point of latitude −90◦
being the South Pole that we mark as 90◦ S and the point of latitude
90◦ being the North Pole that we mark as 90◦ N. Points of latitude 0◦
are points on the Equator. Lines of latitude (known as parallels) are
(imaginary) circles on the planet surface each parallel to the equator
plane. Points on the same parallel share the same latitude.
• A meridian line is an imaginary north-south circle on the earth’s
surface connecting the north and south poles. The meridian line passing
through the town of Greenwich, England is agreed on internationally to
Global Positioning System (GPS) 97

be a reference line known as the prime meridian (also known as the


Greenwich line). The longitude of the point P (same as the longitude
of Q) is the measurement of the angle φ of the location of P east or
west of the prime meridian. Note that −180◦ ≤ φ ≤ 180◦ with points
of negative longitude are located to the west of the prime meridian and
points with positive longitude are to its east. Thus a longitude of −100◦
is written as 100◦ W and a longitude of 55◦ is written as 55◦ E. Points
on the same meridian share the same longitude. Note that meridian
lines are orthogonal to parallel lines.
• The altitude h of Q is its distance from the sea level. If R is the radius
of the earth (R ∼
= 6366 km), then the distance between the point Q and
the center of the earth is R + h.

The position of any point near the surface of the planet is uniquely
determined by its latitude, it longitude and its altitude. For example, a
point described as (40◦ N, 30◦ W, 1850 m) is a point located 40◦ north of
the Equator and 30◦ west of the Greenwich meridian and at a distance of
1.85 km from the sea level (or 6366 + 1.85 = 6367.85 km from the center of
the earth).
98 The mathematics that power our world

4.3 About the GPS system

4.3.1 The GPS constellation


The GPS system is a constellation of man-made satellites placed in six
equally-spaced orbital planes surrounding the Earth. Each orbital plane is
inclined 55 degrees relative to the plane of the equator. There are at least 24
operational satellites at any given time with a number of backup satellites in
case of failures. The 24 operational satellites are arranged in four satellites
per orbit. Each GPS satellite orbits the earth twice a day (once every
12 hours) at an altitude of approximately 20,200 km (12,550 miles). The
number of operating satellites in each orbit, altitude and inclination of their
orbital planes as well as their speed and distances apart are carefully chosen
to ensure that a GPS receiver will always have in its range at least four
satellites no matter where it is located near the surface of the planet. As to
why we need four satellites in the range of a receiver, the answer comes a bit
later. In case you are interested, each GPS satellite weighs approximately
908 kg and is about 5.2 m across with the solar panels extended. Each
satellite is built to last about 10 years and replacements are constantly
being built and launched into orbits.

4.3.2 The GPS signal


Each operating GPS satellite transmits constantly two radio signals with
frequencies labeled L1 and L2 . The frequency L1 (1575.42 MHz) is for
civilian use while L2 can only be depicted by military receivers. The signal
can travel through clouds, glass and plastic but it is reflected by objects
like water surfaces and concrete buildings. A typical GPS signal contains
mainly three segments:

• A Pseudo Random Noise (PRN) code. In simple terms, this is a digital


sequence of on/off pulses which plays the role of identification code for
the satellite transmitting the signal. Each GPS satellite generates its
own and unique PRN and the code is designed with enough complexity
to make it virtually impossible for a ground receiver to confuse the
signal with another from outside the constellation. The uniqueness of
the PRN allows all operating satellites to use the same frequency.
• An ephemeris data. This part of the signal contains detailed informa-
tion about the orbit of the satellite and where it should be at any given
time in addition to the current date and time according to the (atomic)
Global Positioning System (GPS) 99

clock on board of the satellite. The information contained in this part


of the signal is vital for the operation of the receiver.
• An almanac data. This part provides the GPS receiver with informa-
tion about all the operating GPS satellites in the constellation. Each
satellite emits almanac data about its own orbit as well as other satel-
lites. With this information, the receiver can determine which satellites
are likely to be in its range and does not waste time looking for the
ones that are not.

4.4 Pinpointing your location

Your GPS receiver uses a simple mathematical concept called Trilateration


to locate its position at any given time. We start by explaining this principle
in the case of a “two-dimensional” map.

4.4.1 Where am I on the map?


Imagine you are lost on campus, you are holding a campus map in your
hand but you do not feel it offers much help. You ask someone: “Where
am I?” and the person answers, “You are 500 m away from the university
center” and he walks away. You locate the university center, labeled as
U C on the campus map, but that does not help much since you could be
anywhere on the circle C1 centered at U C and of radius 500 m. You draw
C1 using the scaling of the campus map.

UC

Next, you ask the same question to another person passing by, and he
answers: “You are 375 m away from the Math Department” and walks
away. You locate the Math Department on the map, labeled as M D, and
you draw on your map the circle C2 centered at M D and of radius 375 m.
This new information narrows your location to two possible points A and
100 The mathematics that power our world

B, namely the intersection points of circles C1 and C2 .


A

UC MD

To know which of the two points A and B is your exact location, it


suffices to draw a third circle that would intersect the other two at one of
these two points. You locate another building relatively close to U C and
M D, say the Faculty of Engineering, labeled as F E on the map. You ask a
third person passing by: “How far am I from the Faculty of Engineering?”
and he answers, “About 200 m”. You then draw the circle C3 on the map
centered at F E and of radius 200 m.

A
FE

UC MD

The point where the three circles meet determines your (relatively) exact
location.

Of course, in order for this to work, you must be lucky enough to have people
passing by giving you (relatively) precise distances from various locations
and to be able to somehow work the scale of the map to draw accurate
circles. Equally important is the kind of question you should ask the third
person in order to ensure that the third circle will somehow meet the other
two at exactly one point. Roughly speaking, a GPS receiver works the same
way except that the circles are replaced by spheres in three dimensions, and
the friendly people you ask to pinpoint your position on the campus map are
replaced with satellites located thousands of kilometers above the surface
of the earth.

4.4.2 Measuring the distance to a satellite


Now for the story of locating your position on the surface of the planet.
Global Positioning System (GPS) 101

Signals transmitted by GPS satellites travel at the speed of light (at


least in a vacuum) and reach the GPS receiver at slightly different times
as some satellites are further away from the receiver than others. These
signals are repeated continuously and any GPS receiver has them stored in
its internal memory along with what the value of the sequence should be at
any given time. As the receiver repeats the satellites sequences internally,
the captured sequence and the one generated by the receiver must be in
synchronize mode in theory, but they are not because of the time taken
by the signal to reach the receiver. Once the receiver captures a signal, it
immediately recognizes which satellite it is coming from (using the PRN
segment of the signal) and its compares it to its own replica. By comparing
how much the satellite signal is lagging, the travel time dt of the signal to
reach the receiver is calculated.

Does this seem to be a bit too technical? In the next paragraph we try to
explain the idea of the “time lag” using a simple example.

Let us assume that a GPS satellite signal is just a “song” broadcasted by


the satellite (admittedly not a pretty one). Imagine that at 6:00 am, a
GPS satellite begins to broadcast the song “I see trees of green, red roses
too, I see them bloom for me and you...” in the form of a radio wave to
Earth. At exactly the same time, a GPS receiver starts playing the same
song. After traveling thousands of kilometers in space, the radio wave ar-
rives at the receiver but with a certain delay in the words. If your standing
by the receiver, you will hear two interfering versions of the song at the
same time. At the time of signal reception, the receiver version is playing
“...them bloom for...” but the satellite version is playing (for instance) the
first “I see...”. The receiver player would then immediately “rewind” its
version a bit until it synchronizes perfectly with the received version. The
amount of time equivalent to this “shift back” in the receiver player is pre-
cisely the travel time of the satellite’s version.

Once the time delay dt is computed, the receiver internal computer mul-
tiplies it with the speed of light (in a vacuum), c = 299, 792, 458 m/sec to
calculate the distance separating the satellite from the GPS receiver.
102 The mathematics that power our world

4.4.3 Where am I on the surface of the planet?


Now that we have a bit more understanding of how the GPS receiver esti-
mates its distance to the satellites in its view, it is time to see how these
estimates are put in use to pinpoint the position of the receiver.

We start by choosing a system of three orthogonal axes centered at the


point O, the center of the earth. The z-axis is the vertical one passing
through the two poles and oriented from South to North. The xz plane
is the Greenwich meridian plane. The x-axis lies in the equatorial plane
and the direction of positive values of x goes through the Greenwich point
(point of longitude zero). Similarly, the y-axis lies in the equatorial plane
and the direction of positive values of y goes through the point of longitude
90◦ East.

All GPS receivers are built with multiple channels allowing them to re-
ceive and treat signals from at least four different satellites simultaneously.
Once it captures the signals of three satellites S1 , S2 and S3 in its range,
the receiver calculates the time delays t1 , t2 and t3 (respectively, in seconds)
taken by signals of the three satellites to reach it. The distances between
the receivers and the three satellites are computed as explained in the pre-
vious section: d1 = ct1 , d2 = ct2 , and d3 = ct3 , respectively. The fact that
the receiver is at a distance d1 from satellite S1 means that it could be
anywhere on the (imaginary) sphere Σ1 centered at S1 and of radius d1 .
Using the ephemeris data part of the satellite signal, the receiver knows the
position (a1 , b1 , c1 ) of the satellite S1 in the above system of axes, so the
sphere Σ1 has equation:
(x − a1 )2 + (y − b1 )2 + (z − c1 )2 = d21 = c2 t21 . (4.1)
The distance d2 = ct2 from the second satellite is computed and the receiver
is also somewhere on the sphere Σ2 centered at the satellite S2 , positioned
Global Positioning System (GPS) 103

at the point (a2 , b2 , c2 ), with radius d2 :

(x − a2 )2 + (y − b2 )2 + (z − c2 )2 = d22 = c2 t22 . (4.2)

This narrows the position of the receiver to the intersection of two spheres,
namely to a circle Γ. Still not enough to determine the exact position.
Finally, the distance d3 = ct3 from the third satellite S3 , positioned at the
point (a3 , b3 , c3 ), shows that the receiver is also on the sphere Σ3 :

(x − a3 )2 + (y − b3 )2 + (z − c3 )2 = d23 = c2 t23 . (4.3)

The inclination of orbits of GPS satellites are designed so that the surface
of the third sphere would intersect Γ in two points that the receiver can
accurately compute their coordinates. One of these two points will be
unreasonably far from the surface of the earth and therefore one possible
position is left.

S3

S2

S1

4.4.4 Is it really that simple?


In theory, once a GPS receiver captures the signals of three different satel-
lites in its view, it should be able to locate its exact position (as the in-
tersection of three imaginary spheres). But in reality, things are bit more
complicated than that.
104 The mathematics that power our world

The calculation of the time taken by the satellite signal to reach the
receiver (as explained above) assumes that clocks in the receiver and on
board of the satellite are in perfect synchronization. So 6:00 am on board
of the satellite means 6:00 am on the receiver clock. Unfortunately, that is
not the case. The satellites are equipped with atomic clocks, very sophisti-
cated, and extremely accurate clocks, but very expensive. The clocks inside
the receivers, on the other hand, are the usual everyday digital clock. The
difference between the types of clocks creates a certain error in calculating
the real time delay of the GPS signal. You may wonder, why the big fuss
about a time estimate that could differ only in a fraction of a second? Re-
member, we are talking about waves traveling at the speed of light which
makes the estimated distances from the satellite to the GPS receiver ex-
tremely sensitive to gaps between the satellite and receiver clocks. To give
you an idea about the degree of sensitivity, an error of 0.000001 second (one
microsecond) would result in an error of 300 metres in distance estimation.
No wonder why the GPS receiver’s clock is the main source of error. This
means in particular that the distances d1 , d2 and d3 shown in equations
(4.1), (4.2) and (4.3) above are not very accurate since they are based on
“fake” time delays t1 , t2 and t3 .

4.4.5 The fix


The main reason we need these expensive atomic clocks on board of the
GPS satellites is to make sure that they are always in perfect synchroniza-
tion with each other. A consequence of this is that the “time error” ξ
between the receiver clock and the satellite clock calculated by the receiver
is independent of the satellite (i.e., the same for any satellite). The real
travel time dti taken by the signal emitted from satellite Si is the differ-
ence between the arrival time of the signal to the receiver and its departure
time from the satellite with both times measured according to the satellite
(atomic) clock:
dti = (arrival time according to satellite clock)
− (departure time according to satellite clock)
= (arrival time according to receiver clock)
− (departure time according to satellite clock)
− [(arrival time according to receiver clock)
− (arrival time according to satellite clock)]
= ti − ξ.
Global Positioning System (GPS) 105

The true travel time of the signal from satellite Si is then equal to ti − ξ
(with ti as above) rather than simply ti . Equations (4.1), (4.2) and (4.3)
above can now be written as:

2 2 2 2 2 2

 (x − a1 ) + (y − b1 ) + (z − c1 ) = d1 = c (t1 − ξ)
(H) (x − a2 ) + (y − b2 ) + (z − c2 ) = d2 = c (t2 − ξ)2 .
2 2 2 2 2

(x − a3 )2 + (y − b3 )2 + (z − c3 )2 = d23 = c2 (t3 − ξ)2


This is a system of three equations in four unknowns: the three coordinates


of the receiver position (x, y and z) and the clocks offset time ξ. To solve
the system for x, y and z, one possibility is to eliminate the need for the
fourth variable ξ all together by equipping the receivers with atomic clocks
so they perfectly synchronize with the satellites clocks. That would reduce
ξ to zero in the system (H) giving a system of three equations in three
unknowns that the receiver computer can solve to figure out its position.
Of course, that would mean paying tens of thousands of dollars for the
receiver. Not a smart way to make this technology available to the general
public. So how come almost everyone you know has a very affordable GPS
receiver that is very accurate at the same time? The designers of the GPS
came up with an answer that is mathematically brilliant and yet simple.
As it turns out, a simple digital clock in your GPS receiver will do just fine
and all what it takes is one more measurement from a fourth satellite and
voilà, you have the equivalent of an atomic clock right in the palm of your
hand.

As explained earlier, the GPS satellites are placed in orbits so that


there are always at least four satellites in view of a GPS receiver anywhere
near the surface of the planet. The receiver captures the signal of a fourth
satellite S4 and adds one more equation to the above system (H). Now we
have the following system of four equations in four unknowns to deal with:
(x − a1 )2 + (y − b1 )2 + (z − c1 )2 = d21 = c2 (t1 − ξ)2



(x − a2 )2 + (y − b2 )2 + (z − c2 )2 = d22 = c2 (t2 − ξ)2

(S) .

 (x − a3 )2 + (y − b3 )2 + (z − c3 )2 = d23 = c2 (t3 − ξ)2
(x − a4 )2 + (y − b4 )2 + (z − c4 )2 = d24 = c2 (t4 − ξ)2

4.4.6 Finding the coordinates of the receiver


Note first that the system (S) is not a linear system and solving it would
require more than the techniques seen in a basic linear algebra course. But
106 The mathematics that power our world

with a little effort, it could be brought to a “quasi-linear” form. We start by


subtracting the fourth equation in (S) from each of the first three equations.
For instance, subtracting the fourth equation from the first gives:

(x − a1 )2 + (y − b1 )2 + (z − c1 )2 − (x − a4 )2 + (y − b4 )2 + (z − c4 )2
 

= c2 (t1 − ξ)2 − c2 (t4 − ξ)2 .

This results in the following equation:

2(a4 − a1 )x + 2(b4 − b1 )y + 2(c4 − c1 )z = 2c2 (t4 − t1 )ξ + (a24 + b24 + c24 )


− (a21 + b21 + c21 ) − c2 (t24 − t21 ).

The expression (a24 + b24 + c24 ) − (a21 + b21 + c21 ) − c2 (t24 − t21 ) in the above
equation is independent of the variables x, y, z and ξ of the system. To
simplify the notations a little bit, we call it A1 :

A1 = (a24 + b24 + c24 ) − (a21 + b21 + c21 ) − c2 (t24 − t21 ).

This way, the last equation can now be written as:

2(a4 − a1 )x + 2(b4 − b1 )y + 2(c4 − c1 )z = 2c2 (t4 − t1 )ξ + A1 . (4.4)

Repeating the same thing for the second and third equations in (S), we
obtain a new system (S 0 ) equivalent to (S) (in the sense that both systems
have the same set of solutions):
2(a4 − a1 )x + 2(b4 − b1 )y + 2(c4 − c1 )z = 2c2 (t4 − t1 )ξ + A1



2(a4 − a2 )x + 2(b4 − b2 )y + 2(c4 − c2 )z = 2c2 (t4 − t2 )ξ + A2

(S 0 ) .

 2(a4 − a3 )x + 2(b4 − b3 )y + 2(c4 − c3 )z = 2c2 (t4 − t3 )ξ + A3
(x − a4 )2 + (y − b4 )2 + (z − c4 )2 = d24 = c2 (t4 − ξ)2

One way to solve (S 0 ) is to treat ξ as a constant in each of the first three


equations. This will allow us to express each of the variables x, y and z
in terms of ξ and then use the fourth equation to find ξ (hence x, y and
z). This approach enables us to use the techniques of linear algebra to
solve systems of linear equations since the first three equations in (S 0 ) form
indeed a system of three linear equations in three variables (x, y and z).

There are many ways to solve for x, y and z in term of ξ in the first
three equations, but Cramer’s rule is probably the easiest to implement in
the receiver’s computer:
D1 D2 D3
x= , y= , z= ,
D D D
Global Positioning System (GPS) 107

where D is the determinant of the matrix:


 
2(a4 − a1 ) 2(b4 − b1 ) 2(c4 − c1 )
L :=  2(a4 − a2 ) 2(b4 − b2 ) 2(c4 − c2 ) 
2(a4 − a3 ) 2(b4 − b3 ) 2(c4 − c3 )
and D1 , D2 , D3 are respectively the determinants of the matrices

2c2 (t4 − t1 )ξ + A1 2(b4 − b1 ) 2(c4 − c1 )


 
2
L1 =  2c (t4 − t2 )ξ + A2 2(b4 − b2 ) 2(c4 − c2 )  ,
2c2 (t4 − t3 )ξ + A3 2(b4 − b3 ) 2(c4 − c3 )

2(a4 − a1 ) 2c2 (t4 − t1 )ξ + A1 2(c4 − c1 )


 
2
L2 =  2(a4 − a2 ) 2c (t4 − t2 )ξ + A2 2(c4 − c2 )  ,
2(a4 − a3 ) 2c2 (t4 − t3 )ξ + A3 2(c4 − c3 )

2(a4 − a1 ) 2(b4 − b1 ) 2c2 (t4 − t1 )ξ + A1


 

L3 =  2(a4 − a2 ) 2(b4 − b2 ) 2c2 (t4 − t2 )ξ + A2  .


2(a4 − a3 ) 2(b4 − b3 ) 2c2 (t4 − t3 )ξ + A3

Clearly, we would be in trouble if D = 0. But can that really happen?


Using the properties of determinants, we can write
a4 − a1 b4 − b1 c4 − c1
D = 8 a4 − a2 b4 − b2 c4 − c2 (4.5)
a4 − a3 b4 − b3 c4 − c3
(the 8 in front is obtained by factoring 2 from each of the three rows of
D) where ai , bi , ci are the coordinates of the satellite Si in the above sys-
tem of axes. So the rows in the determinant D are the components of the
−−−→ −−−→ −−−→
vectors S1 S4 , S2 S4 and S3 S4 respectively (the Si ’s being the satellites). If
D = 0, then a known result from linear algebra asserts that the three vec-
tors are coplanar (belong to the same orbital plane) and consequently, the
four satellites S1 , S2 , S3 and S4 are also coplanar. Engineers were of course
fully aware of this problem and the way they chose to place the 24 satellites
in their orbits was carefully chosen so that it makes it impossible for a GPS
receiver to capture the signals of four coplanar satellites at any moment
and anywhere close to the surface of the Earth. Your linear algebra course
does not look so abstract now, does it?

Replacing x, y and z by DD1 , DD2 and DD3 respectively in the fourth equa-
tion of (S 0 ) yields the following quadratic equation
 2  2  2
D1 D2 D3
− a4 + − b4 + − c4 = c2 (t4 − ξ)2
D D D
108 The mathematics that power our world

which can be written as

c2 ξ 2 − 2c2 t4 ξ + κ = 0 (4.6)
2 2 2
where κ = c2 t24 − DD1 − a4 − DD2 − b4 − DD3 − c4 . Once again, the
way the satellites are put in their orbits guarantees that equation (4.6)
would have two solutions ξ1 and ξ2 . This gives two possible positions (one
for each of the two values found for ξ) with one of them corresponding to
a point very far from the surface of the planet that the receiver eliminates
as a possibility.

4.4.7 Conversion from cartesian to (latitude, longitude,


altitude) coordinates
At this point, the receiver knows its position in cartesian form as being
the point Q(x, y, z) in the above coordinates system Oxyz. But the dis-
play on the screen shows the position in the (latitude, longitude, altitude)
coordinates. The conversion is done as in the following steps.

• First, note that


p the distance separating the receiver from the earth
center is d = x2 + y 2 + z 2 .
• The altitude of the receiver is computed as h = d − R where R is the
radius of the earth.
• For the point P , the projection of Q on the surface of the earth, the
cartesian coordinates are given by R R R
d x, d y and d z. The relations
between these coordinates and the latitude β and the longitude φ of
the point P (which are the same for the point Q) are given by:
R
 d x = R cos β cos φ
R
y = R sin φ cos β .
 d R
d z = R sin β

These can be simplified to the following equations:



 x = d cos β cos φ
(L) y = d sin φ cos β .

z = d sin β

The last equation gives that sin β = dz and since −90◦ ≤ β ≤ 90◦ , there
is a unique value of β satisfying sin β= dz , namely β = arcsin dz (or


maybe you have seen this as sin−1 dz in your first calculus course).
Global Positioning System (GPS) 109

• Knowing the value of β, cos β can be computed and the system (L) is
now reduced to the following two equations:
(
x
cos φ = d cos β
y
sin φ = d cos β

(with cos β known). Since −180◦ ≤ φ ≤ 180◦ , these two equations


determine uniquely the value of the longitude φ.
• Thus the position Q(x, y, z) of the receiver can now be displayed in
terms of the latitude, the longitude and the altitude of the position
point Q.

4.5 The mathematics of the GPS signal

Obviously, the satellites are not emitting their signals using the words of
the song “I see trees of green red roses too...”. So what is the nature of these
signals and how are they engineered to be easily identified by a ground re-
ceiver? More importantly, how can we make sure the signal is sufficiently
“random” to suit the intended use?

Locating the receiver position may have appeared somehow complicated


to you, but the truth is this was the “soft” side of the mathematics used
in this project. Careful encryption of codes in the signal emitted by the
satellite is key to ensure accuracy and reliability of information provided
to your receiver. This side of the GPS project requires somewhat heavier
mathematics.

4.5.1 Terminology
We start by going over some of the terminology needed for the rest of this
section. As seen in previous chapters, a binary sequence is a sequence
consisting of only two symbols, usually denoted by of 0 and 1 (or On/Off
pulses), that we call bits. A binary sequence is called of length r if it is
a finite sequence consisting of r bits. A sequence a0 , a1 , a2 , . . . is called
periodic if there exists a positive integer p, called a period of the sequence,
such that an+p = an for all n. In other words, the periodic sequence repeats
itself every cycle of p terms. Note that if p is a period, then kp is also a
period for any positive integer k. The smallest possible value for p is called
110 The mathematics that power our world

the minimal period (in some books, it is simply called the period) of the
sequence. For example, the sequence

001011000101100010110001011000101100010110001011000101100010110

is a binary sequence of length 63 and it is periodic of (minimal) period 7


with the block 0010110 of 7 digits repeated nine times. Note that a binary
sequence of length r can be viewed as a vector (a0 , a1 , . . . , ar−1 ) having r
components with each component ai is an element of the set F2 := {0, 1}.
This means in particular that there are exactly 2r different binary sequences
of length r. For example, there are 23 = 8 binary sequences of length 3,
namely 111, 110, 101, 100, 011, 010, 001 and 000.

4.5.2 Linear Feedback Shift Registers


In their raw form, the pseudo random noise codes emitted by GPS satel-
lites are exactly that: noise-like signals pretty much like the noise you hear
when your radio cannot tune in to a station. The GPS receiver however
is programmed to look beyond the noise. It treats the codes as deter-
ministic binary sequences. The word “deterministic” in this context refers
to the fact that for the receiver, these signals are not random but rather
completely determined by a fixed set of coefficients and a relatively small
number of initial values, called the PRNG’s state (PRNG is an acronym for
“Pseudo-Random Number Generator”, the algorithm used to produce such
a deterministic binary sequence). There are many pseudo-random number
generators out there used for various applications but the one used in GPS
signal is called Linear Feedback Shift Register or LFSR for short. In
simple terms, a LFSR of degree m (or of m stages) can be described as a
digital circuit containing a series of m one-bit storage (or memory) cells.
Each cell is connected to a constant coefficient ci ∈ {0, 1}. The vector
(c0 , c1 , . . . , cm−1 ) is different from one satellite to another. Figure 4.1 be-
low shows a block diagram of a typical LFSR. The rhythm of the register
is controlled by a counter or a clock. The system is capable of generating
a sequence an of binary bits that will have the “appearance” of being very
random using the following steps.

(1) Start by choosing an initial “window” w0 = (a0 , a1 , . . . , am−1 ) (in the


figure for the LFSR block diagram, values are read from right to left).
These initial values are not all zeros at the same time (w0 6= 0) and the
vector w0 is different for different satellites.
Global Positioning System (GPS) 111

am−1 am−2 ... ... a0

cm−1 cm−2 ... ... c0

am = a0 c0 + a1 c1 + · · · am−1 cm−1

Fig. 4.1 Linear feedback shift register block diagram.

(2) At the first “clock pulse”, the content of each cell is shifted to the
right by one box “pushing out” the value a0 . The content of the first
(leftmost) box is then calculated as follows: first compute the value
Pm−1
of the expression k=0 ak ck = a0 c0 + a1 c1 + · · · + am−1 cm−1 . If the
result is even, the value am = 0 is inserted in the leftmost box. If
the result is odd, the value am = 1 is inserted in the leftmost box. If
you are familiar with modular arithmetic (see Section 4.5.3 below), this
Pm−1
amounts to calculating the sum k=0 ak ck “modulo” 2. We now have
the second “window” w1 = (a1 , a2 , . . . , am ) (again read from right to
left in the block diagram) and the first m + 1 terms of the sequence are
am , am−1 , . . . , a1 , a0 .
(3) This process is repeated. For example, at the second “clock pulse”, the
register shifts again the content of each cell to the right by one box
pushing out the value a1 this time. The content of the leftmost box is
calculated as c0 a1 + c1 a2 + · · · + cm−2 am−1 + cm−1 am (modulo 2) and
this is precisely the next bit, am+1 , in the sequence. We now have the
following terms am+1 , am , am−1 , . . . , a1 , a0 of the sequence.
(4) The procedure is iterated, creating an infinite binary sequence
. . . , ak , ak−1 , . . . , a2 , a1 , a0 .

Before we dig deeper in the mathematical properties of the sequence


produced by a LSFR, let us look at a simple example.

Example 4.1. In this example, we consider a LFSR of degree 5 (m = 5).


As coefficient vector, we take c = (c0 , c1 , c2 , c3 , c4 ) = (0, 1, 1, 1, 0) and as
112 The mathematics that power our world

Table 4.1 First 30 windows in the register.


Clock pulse number Window Clock pulse number Window
0 00110 15 00011
1 00011 16 10001
2 10001 17 01000
3 01000 18 10100
4 10100 19 11010
5 11010 20 01101
6 01101 21 00110
7 00110 22 00011
8 00011 23 10001
9 10001 24 01000
10 01000 25 10100
11 10100 26 11010
12 11010 27 01101
13 01101 28 00110
14 00110 29 00011

initial state we take the vector w0 = (a0 , a1 , a2 , a3 , a4 ) = (0, 1, 1, 0, 0) (or


00110 when the values in the window are read from left to right). At
the first clock pulse, the register computes the sum (0 × 0) + (1 × 1) +
(1 × 1) + (1 × 0) + (0 × 0) = 2. Since the result is even, the content of the
leftmost box is 0. The new window in the sequence is 00011 (again from
left to right). At the second clock pulse, the register computes the sum
(0 × 1) + (1 × 1) + (1 × 0) + (1 × 0) + (0 × 0) = 1. Since the result is odd,
the content of the leftmost box is 1 and the new window is 10001. Refer
to Table 4.1 to find the first 30 windows in the register and the resulting
sequence is a28 a27 · · · a1 a0 = 01000110100011010001101000110.

Remark 4.1. Since there are exactly 2m different binary sequences of


length m, the sequence produced by a LFSR of degree m must be peri-
odic of maximal period of 2m . If you are not convinced, just look at the
30 windows produced by the LFSR in Example 4.1 above. Each window
is a binary sequence of length 5, so there are 25 = 32 (different) possible
windows, including the window of zeros. In the worst case scenario, one
needs 31 clock pulses before repeating a window and as soon as a window
is repeated, the ones that follow will be already on the list in the same
order. But note that the LSFR in Example 4.1 repeats the first window
just after the seventh clock pulse. This justifies the notion of a “maximal
period” of 2m . Note also that no window of zeros appears in the table of
Example 4.1 and this is no coincidence. If the coefficients c0 , c1 , . . . , cm−1
and the initial conditions a0 , a1 , . . . , am−1 are “wisely” chosen (we will see
Global Positioning System (GPS) 113

how later), we can guarantee that no window of all zeros will ever occur
and that the sequence produced by the register is periodical of maximum
period possible.

All the mathematical machinery developed in the following sections is


geared toward proving the following main result.

Theorem 4.1. For a LFSR of degree m, one can always choose the
coefficients c0 , c1 , . . . , cm−1 and initial conditions a0 , a1 , . . . , am−1 in such
a way that the sequence produced by the register has a minimal period of
maximal length 2m − 1.

4.5.3 Some modular arithmetic


The Division algorithm is one of those things we learn at early age in school
but we usually do not pay much attention to its proper statement, let alone
the mathematics behind it. It is a building block for almost everything we
do in Arithmetic.

Theorem 4.2. (Division algorithm for integers) Given two integers a


and b, with b > 0, there exist unique integers q and r, with 0 ≤ r < b, such
that a = bq + r.

In the above theorem, q is called the quotient, r the remainder, b the


divisor and a is called the dividend of the division.

Remark 4.2. When dividing an integer a by a non-zero integer b, one can


always assume that b > 0 (if not, just divide −a with |b|). Note that if
b < 0, then a = |b|q + r = b(−q) + r with 0 ≤ r < |b| by the division
algorithm.

For the rest of this section, we fix an integer n ≥ 2.

Definition 4.1. We say that the two integers a and b are congruent
modulo n and we write a ≡ b (mod n), if a and b have the same remainder
upon division by n.

If a, b ∈ Z have the same remainder upon division by n, then by the division


algorithm we can write a = nq1 +r and b = nq2 +r for some q1 , q2 and r ∈ Z
with 0 ≤ r < n. So a − b = n(q1 − q2 ) is divisible by n. Conversely, suppose
114 The mathematics that power our world

that a − b = αn is divisible by n and write a = nq1 + r1 and b = nq2 + r2 for


some q1 , q2 , r1 and r2 ∈ Z with 0 ≤ r1 < n and 0 ≤ r2 < n. We can clearly
assume that r2 ≤ r1 with no loss of generality (if not, just inverse the roles
of a and b). So, 0 ≤ r1 − r2 < n and a − b = n(q1 − q2 ) + (r1 − r2 ) = αn.
By the uniqueness of the quotient and the remainder (Theorem 4.2), we
conclude that r1 − r2 = 0. In other words, a and b have the same remainder
upon division by n. This proves the following.

Theorem 4.3. For a, b ∈ Z, a ≡ b (mod n) if and only if a − b is divisible


by n.

As an example, 11 ≡ 21 (mod 5) since 11 and 21 have the same re-


mainder (namely 1) upon division by 5 (or equivalently, their difference
21 − 11 = 10 is divisible by 5).

There are n possible remainders upon division by n, namely 0, 1, . . . , n−


1. Given any integer a, the division algorithm allows us to write a = nq + r
for some q, r ∈ Z with 0 ≤ r ≤ n − 1. Since a − r = nq is divisible by n, we
have that a ≡ r (mod n). This means that any integer is congruent modulo
n to one of the elements in the set R = {0, 1, . . . , n − 1}. If k ∈ R, the set
k of all integers having k as remainder in the division by n is known as an
equivalence class modulo n:
k := {j ∈ Z; j ≡ k (mod n)} .
We then consider the collection Zn of all equivalence classes modulo n:

Zn := k; 0 ≤ k ≤ n − 1 .

Example 4.2. Z3 = 0, 1, 2 where

0 = {. . . , −9, −6, −3, 0, 2, 6, 9, . . .} ,


1 = {. . . , −8, −5, −2, 1, 4, 7, 10, . . .} ,
2 = {. . . , −7, −4, −1, 2, 5, 8, 11, . . .} .

Remark 4.3. In the notation of the equivalence class k used above, the
integer k is just one representative of that class. Any other element of the
same class is also a representative. For instance, 1 can also be represented
by −5 or by 7 in Z3 . To avoid confusion, the elements of Zn are always
represented in the (standard ) form k for 0 ≤ k ≤ n − 1. This way, we write
2 instead of 14 in Z3 . It is also worth mentioning that n = 0 since the
remainder in the division of n by n is 0.
Global Positioning System (GPS) 115

Our next task is to give the set Zn a certain algebraic structure by defining
an addition and a multiplication on the elements of the set that we call ad-
dition and multiplication modulo n. These operations are introduced
naturally in the following way:
• Addition modulo n. If a, b ∈ Zn , define a + b to be the equivalence
class represented by the integer a + b. In other words, a + b = a + b.
• Multiplication modulo n. If a, b ∈ Zn , define ab to be the equiva-
lence class represented by the integer ab: ab = ab.
Since a class in Zn has infinitely many representatives, one has to check
that these two operations are independent of the choice of representatives.
This is not hard to verify. The reader is certainly encouraged to try this as
an exercise.

Example 4.3. The following are addition and multiplication tables of Z3 :


+ 0 1 2 × 0 1 2
0 0 1 2 0 0 0 0
1 1 2 0 1 0 1 2
2 2 0 1 2 0 2 1
and of Z4 :
+ 0 1 2 3 × 0 1 2 3
0 0 1 2 3 0 0 0 0 0
1 1 2 3 0 1 0 1 2 3
2 2 3 0 1 2 0 2 0 2
3 3 0 1 2 3 0 3 2 1

4.5.4 Groups

Definition 4.2. A group is a set G equipped with an operation ∗ satisfying


the following axioms:
(1) Closure of G under the operation ∗. This axiom simply says that
when we compose two elements of G, what we get is also an element of
G: x ∗ y ∈ G for all x, y ∈ G.
(2) Associativity of the operation ∗. This property allows us to move
the parenthesis freely when doing computations inside the group: x ∗
(y ∗ z) = (x ∗ y) ∗ z for all x, y, z ∈ G.
(3) Existence of an identity element. There exists an element e (called
the identity element) of G satisfying: x ∗ e = e ∗ x = x for all x ∈ G.
116 The mathematics that power our world

(4) Existence of inverses. For every x ∈ G, there exists y ∈ G such that


x ∗ y = y ∗ x = e. The element y ∈ G is called the inverse of x with
respect to the operation ∗.

It is not hard to prove that the identity element of a group is unique. Also,
if x is an element of a group, then the inverse of x is unique. If in addition
to the above axioms, the operation ∗ is commutative, that is x ∗ y = y ∗ x
for all x, y ∈ G, then the group G is called abelian. A subset H of a group
(G, ∗) is called a subgroup of G if H is itself a group with respect to the
same operation ∗.

It is convenient to use familiar notations for a group operation. The


most familiar ones are of course + and · (or just a concatenation). If we
use the symbol +, we say that our group is additive and if the multiplication
(or concatenation) is used, the group is called multiplicative. In an additive
group, the identity element is called the zero element and denoted by 0
and the inverse of an element x of the group is called the opposite of x and
denoted with −x. In a multiplicative group, the identity element is denoted
by 1 and the inverse of an element x is denoted by x−1 or x1 .

Example 4.4. It should come as no surprise that the abstract definition of


a group given above is a generalization of the well-known (additive) groups
(Z, +) (the integers), (Q, +) (the rational numbers) and (R, +) (the real
numbers). Note that (Z, +) is a subgroup of both (Q, +) and (R, +) and
(Q, +) is a subgroup of (R, +). Changing the operation from addition to
multiplication results in these sets losing their group structure: (Z, ×) is
not a group because only ±1 have their multiplicative inverses in Z and
the inverse of any other integer is not an integer. (Q, ×) and (R, ×) are
not groups since 0 does not have an inverse in these sets. However, and
unlike (Z, ×), the sets (Q∗ , ×) and (R∗ , ×) are indeed groups where Q∗ and
R∗ are respectively the sets of non-zero rational numbers and non-zero real
numbers.

A group G is called finite if it contains a finite number of elements. In


this case, we define the order of G, written |G|, as the number of elements
in G. Finite groups play a pivotal role in many real life applications of
mathematics. The following gives a classic example of a finite group.

Example 4.5. The set Zn = {0, 1, . . . , n − 1} of integers modulo n defined


in Section 4.5.3 above is an additive group for the addition modulo n.
Global Positioning System (GPS) 117

All the group axioms can be easily verified. In particular, 0 is the zero
element of the group and if k ∈ Zn , then the opposite of k is n − k since
k + n − k = n = 0 in Zn .

What about the structure of (Zn , ×) where × is the multiplication modulo


n? Note first that the element 1 ∈ Zn is the identity element of Zn for the
multiplication modulo n since k × 1 = k × 1 = k for all k ∈ Zn . But the
element 0 has no multiplicative inverse since k × 0 = 0 6= 1 for all k ∈ Zn .
This means that (Zn , ×) is not a group. What about removing 0 from Zn
like we did for Q and R? Would the resulting structure (Z∗n , ×) be a group
(like in the case of (Q∗ , ×) and (R∗ , ×))? A closer look at the multiplica-
tion table of Z4 given in Example 4.3 above quickly answers that question
negatively: the element 2 ∈ Z4 has no inverse since the row of 2 in that
table does not contain 1. This is clearly not the case of the multiplication
table of Z3 where every non-zero element seems to have an inverse, making
(Z∗3 , ×) a group.

So under what conditions on n the set (Z∗n , ×) becomes a (multiplicative)


group? Part of the answer comes from the following observation: if n has a
proper divisor d (so d divides n and 2 ≤ d ≤ n − 1), then the element d of
Zn does not have a multiplicative inverse. To see this, write n = kd with
2 ≤ k ≤ n − 1. If d has a multiplicative inverse d0 , then
k × d × d0 = (k × d) × d0 = 0.
| {z }
=n=0

On the other hand


k × d × d0 = k × (d × d0 ) = k 6= 0
| {z }
1

which leads to a contradiction. This shows that (Z∗n , ×) is not a group if


n has a proper divisor (since in that case, d has no multiplicative inverse
in Zn ). Integers with no proper divisors are called prime integers. For
instance, 2, 3, 5, 7, 11 are all prime. It is then natural to expect that if p is
a prime integer, the set Z∗p = {1, 2, . . . , p − 1} (of p − 1 elements) is indeed
a group for the multiplication modulo p. The proof of this fact uses some
properties of the GCD (Greatest Common Divisor) of two integers that we
will not include here but we state the result for future reference.

Theorem 4.4. If p is a prime integer, then the set Z∗p = {1, 2, . . . , p − 1}


(of p − 1 elements) is a group for the multiplication modulo p.
118 The mathematics that power our world

Hence, (Z∗2 , ×), (Z∗3 , ×), (Z∗5 , ×) and (Z∗31 , ×) are all examples of multi-
plicative groups.

From this point on, and unless otherwise specified, the operation of a
multiplicative group is simply denoted with a concatenation of elements
and the identity element of the group is denoted simply by 1.

Definition 4.3. Let G be a (multiplicative) group, g ∈ G and m ∈ Z. If


m > 0, we define g m to be g composed with itself m times, that is
g m = gg . . . g .
| {z }
m times

−1 −m
m

If m < 0, we define g to be g . This is well defined since in a group,
every element has an inverse and −m is now positive. If m = 0, we define
g m to be the identity element 1 of the group G.

Remark 4.4. In an additive group (G, +), the notion of an “exponent”


(or a “power”) g m of g translates to g + g + · · · + g = mg.

The exponent laws for real numbers apply to elements of any group.
Given a group G and elements g, h in G, then for any integers m, n we have
• g m+n = g m g n
n
• (g m ) = g mn
• If G is abelian, then (gh)m = g m hm .
An important theorem in the theory of finite groups (due to Lagrange)
relates the size of the subgroup to the size of the group.

Theorem 4.5. (Lagrange) If G is a finite group and H is a subgroup of


G, then |H| is a divisor of |G|.

Proof. Given x ∈ G, define xH as being the subset {xg; g ∈ H} of G.


Note that there are as many elements in xH as there are in H. To see
this, let g 6= g 0 ∈ H and suppose that xg = xg 0 . Since x−1 exists in G,
multiplying both sides with x−1 yields g = g 0 which is a contradiction. So, if
g 6= g 0 , then xg 6= xg 0 and so xH and H have the same number of elements.
Note also that since H is a subgroup of G, xH = H for any x ∈ H (the
operation is internal in H). Next, let g 6= g 0 ∈ G and suppose that the sets
gH and g 0 H have an element z ∈ G in common. Then there exist h, h0 ∈ H
such that z = gh = g 0 h0 and we can write g = g 0 h0 h−1 (by multiplying
both sides of gh = g 0 h0 with h−1 on the right). If y ∈ gH, then y = gh00
Global Positioning System (GPS) 119

for some h00 ∈ H and therefore y = g 0 h0 h−1 h00 . But h0 h−1 h00 ∈ H since H
is a subgroup, so y = g 0 h0 h−1 h00 ∈ g 0 H. This shows that gH is a subset of
g 0 H. Similarly, we can show that g 0 H is a subset of gH and conclude that
gH = g 0 H. So as soon as the sets gH and g 0 H have an element in common,
they must be equal. In other words, the sets gH and g 0 H are either disjoint
(no elements in common) or they are the same set. Note also that 1H is
simply the subgroup H. Finally, if g ∈ G, then g = g1 ∈ gH since 1 ∈ H.
The group G can then be written as the union of pairwise disjoint subsets
of the form:
G = H ∪ g1 H ∪ · · · ∪ gr H
with |H| = |g1 H| = · · · = |gr H|. Thus, |G| = |H| + |g1 H| + · · · + |gr H| =
(r + 1)|H|. We conclude that |H| is a divisor of |G|. 
Groups like (Z, +) and (Zn , +) can be “generated” by a single element.
For example, in the additive group (Z, +) every integer k can be generated
using the element 1: k = 1 + 1 + · · · + 1 = k × 1. We say in this case that
the group (Z, +) is generated by 1. Note also that −1 is a generator of
(Z, +). In general, we have the following.

Definition 4.4. A group G is called cyclic if there exists an element g ∈ G


such that G = {g m ; m ∈ Z}. In other words, every element of a cyclic
group G can be written as a power of a fixed element g ∈ G. We say in
this case that g is a generator of G and we write G = hgi.

Example 4.6. The group (Z∗5 , ×) = {1, 2, 3, 4} is cyclic with 2 is a gener-


ator since every element of the group can be expressed as a power of 2 as
0 1 3 2
follows: 1 = 2 , 2 = 2 , 3 = 2 and 4 = 2 .

Remark 4.5. By the exponent laws of a group, a cyclic group is always


abelian.

Given a finite group G of order n ≥ 2 and identity element  1, the expo-


nent laws of G show in particular that the set Hg = g k ; k ∈ Z forms
a subgroup of G for any g ∈ G. The subgroup Hg is called the cyclic
subgroup of G generated by g. Since G is finite, we can find k, m ∈ Z
with k < m and g k = g m (otherwise Hg would be infinite). Multiply-
k m −k
ing both
 sides of g = l g with g gives that g m−k = 1. So the set
Pg = l ∈ Z; l > 0 and g = 1 is not empty. The order of the element g,
denoted by |g|, is defined as being the smallest element of Pg . That is, |g|
is the smallest positive integer l satisfying g l = 1. Therefore, the subgroup
120 The mathematics that power our world

Hg is equal to g 0 = 1, g, g 2 , . . . , g |g|−1 and the order of g ∈ G is equal to




the order of the subgroup Hg generated by g.

Theorem 4.6. If G is a finite group of order n, then g n = 1 for any g ∈ G.

Proof. By Lagrange Theorem, we know that |g| = |Hg | is a divisor of n.


k
Write n = k|g| for some k ∈ N, then g n = g k|g| = g |g| = 1k = 1 since
g |g| = 1 by definition of the order of g. 

4.5.5 Fields - An introduction and basic results


We have seen that the sets (Q, +), (R, +) and (Zn , +) are all examples of
additive groups. There is however more to say about their structures. Each
one of these sets is also equipped with a multiplication that interacts well
with the addition to give each of them a more “advanced” structure known
as a field. The additive group (Z, +) is also equipped a multiplication but
its structure differs from that of Q and R in the following key property: the
inverse of a non-zero rational number (respectively a non-zero real number)
is also a rational number (respectively a real number) while the inverse of
an integer is not an integer, except for ±1.

Field theory has deep roots in the history of abstract algebra and has
been perceived for many years as a purely academic topic in mathematics.
But with the increasing demand on improving the technology and security
of communication, field theory started to play a central role in many real
life applications. In what follows we give a brief introduction and some
important facts about this theory, enough to be able to prove our main
result (Theorem 4.1).

Definition 4.5. A field is a nonempty set F together with two internal


operations often called an addition and a multiplication (denoted as usual
by + and × or just a concatenation respectively) such that the following
axioms are satisfied.
• (F, +) is an abelian group with identity element denoted by 0;
• (F∗ , ×) is an abelian group where F∗ = {x ∈ F; x 6= 0};
• The multiplication is distributive over addition: x(y + z) = xy + xz for
all x, y, z ∈ F.
In what follows, 0 and 1 denote the identity elements of the groups (F, +)
and (F∗ , ×) respectively for the field (F, +, ×). The first one is referred
Global Positioning System (GPS) 121

to as the zero element and the second as the identity element of the field.
There is only one field, referred to as the zero field, where the zero element
and the identity element are the same. This is a set with only one element
0 with the obvious rules: 0 + 0 = 0 × 0 = 0. Any other field is called a
non-zero field.

The set (Z, +, ×) is not a field since (Z∗ , ×) is not a multiplicative group.
The sets Q, R and C (of rational numbers, real numbers and complex num-
bers respectively) with the usual addition and multiplication of numbers
are classic examples of a field structure. However, these are not the kind of
fields used in real applications. In what follows we look at fields containing
a finite number of elements that we call finite fields.

4.5.6 The field Zp


The multiplication table of Z4 given in Example 4.3 above reveals the fol-
lowing surprising fact: 2 × 2 = 0 in spite of the fact that 2 6= 0. This cannot
happen in a field as shown in the following proposition.

Proposition 4.1. Let F be a non-zero field. Then


(1) a0 = 0 for all a ∈ F.
(2) If a, b ∈ F are such that ab = 0, then either a = 0 or b = 0.

Proof.
(1) a0 = a(0 + 0) = a0 + a0 (by the distributivity property of a field). As
an element of a field, a0 must have an additive inverse (or opposite)
−a0. Adding −a0 to the last equation gives 0 = a0.
(2) Assume ab = 0. If a 6= 0, then a admits a multiplicative inverse a−1
since (F∗ , ×) is a group. Multiplying both sides of ab = 0 with a−1
gives
a−1 (ab) = a−1 0 ⇒ (a−1
| {z a})b = 0 ⇒ 1b = 0 ⇒ b = 0.
1

We conclude that at least one of the elements a, b must be zero.



The proposition above shows in particular that Z4 , equipped with the ad-
dition and the multiplication modulo 4, is not a field since 2 × 2 = 0 and
2 6= 0. Similarly, 2 × 3 = 6 = 0 in Z6 with both 2, 3 non-zero. On the
other hand, addition and multiplication tables of Z3 show that Z3 is indeed
a field. The fact that 4 and 6 can be written as 2 × 2 and 2 × 3 respectively
122 The mathematics that power our world

is the main reason why (Z4 , +, ×) and (Z6 , +, ×) are not fields.

In general, if n is not a prime integer, then n can be written under the


form n = pq where 1 < p, q < n. This translates in Zn into the equation
p × q = n = 0 with both p, q non-zero. This means that Zn is not a field.
On the other hand, Theorem 4.4 above shows that Z∗p is a (multiplicative)
group if p is a prime integer. We then have the following result.

Theorem 4.7. Zp is a field (for the addition and a multiplication modulo


p) if and only if p is a prime integer.

Hence, Z2 , Z5 and Z7 are all examples of finite fields.

Remark 4.6. It can be shown (but we will not show it here) that any finite
field F containing p elements where p is a prime integer is actually a “copy”
of Zp (formally, we say F is isomorphic to Zp ). By a “copy”, we mean that
we can relabel the elements of F to match those of Zp (namely, 1, 2, . . .,
p − 1) in such a way the addition and multiplication tables of F are the
same as those of Zp . In other words, there is a unique field containing p
elements for each prime integer p. This field is denoted by Fp .

From this point on, we will omit the “over line” in expressing the element
a of Zp and just write a for simplicity. For instance, we write Z3 = {0, 1, 2}
and Z5 = {0, 1, 2, 3, 4}.

The field Zp (or Fp ) is just a particular example of a more general family


of finite fields. The following is a key result in the theory of finite fields.
We omit the proof as it is beyond the scope of this book.

Theorem 4.8. The number of elements in any non-zero finite field is pr


for some prime integer p and positive integer r. Conversely, given a prime
integer p and a positive integer r, there exists a unique finite field (unique-
ness up to relabeling the elements) containing pr elements, that we denote
by Fpr .

The field Fpr plays a key role in understanding the properties of the
sequence produced by a LFSR. Our next task is to shed more light on its
structure. For this, we need the notion of a polynomial over a field.
Global Positioning System (GPS) 123

4.5.7 Polynomials over a field


In all what follows, F denotes an arbitrary non-zero field (not necessarily
finite), p a prime integer and r a positive integer. We will “cook” the field
Fpr following two recipes. The main ingredient in both recipes is the notion
of polynomials with coefficients in the field F. These are the same type of
polynomials that you always dealt with except that the coefficients are no
longer restricted to real numbers.

Definition 4.6. A polynomial in one variable x over F is an expression


of the form
p(x) = an xn + an−1 xn−1 + · · · + a1 x + a0
where ai ∈ F for each i ∈ {0, 1, . . . , n}. If an 6= 0 (with 0 being the
zero element of the field F), then we say that p(x) is of degree n and we
write deg p(x) = n. In this case, the coefficient an is called the leading
coefficient of p(x). A monic polynomial is a polynomial with leading
coefficient 1 (the identity element of the field F). If ai = 0 for all i, we say
that p(x) is the zero polynomial. The degree of the zero polynomial is
defined to be −∞. Note that any element of the field F can be considered
as a polynomial of degree 0 that we usually call a constant polynomial.
The set of all polynomials in one variable x over F is denoted by F[x].

We define addition and multiplication in F[x] in the usual way of adding and
multiplying two polynomials with real coefficients with the understanding
that the involved operations on the coefficients are done in the field F.
These two operations inside F[x] do not give this set the status of a field
since, for example, the multiplicative inverse of the polynomial x ∈ F[x]
does not exist (no polynomial p(x) exists such that xp(x) = 1).

Remark 4.7. We are mainly interested in polynomials over the finite


fields Zp (for prime p) and one has to be careful when computing modulo
the prime p. For instance, let p(x) = x2 + x + 1 and q(x) = x + 1 considered
as polynomials in Z2 [x]. Then p(x) + q(x) = x2 + 2x + 2 = x2 since in the
field Z2 , 2 = 0 (remember: the coefficient 2 here means 2). Also p(x)q(x) =
x3 + 2x2 + 2x + 1 = x3 + 1 for the same reason. Now, consider the same
polynomials p(x) = x2 + x + 1 and q(x) = x + 1 but this time as elements
of Z3 [x]. Then p(x) + q(x) = x2 + 2x + 2 and p(x)q(x) = x3 + 2x2 + 2x + 1.

Similar to integers, we can talk about division of polynomials and we


have a division algorithm in F[x].
124 The mathematics that power our world

Definition 4.7. Let p(x), q(x) be two polynomials in F[x] with p(x) not
equal to the zero polynomial. We say that p(x) is a divisor of q(x) (or that
p(x) divides q(x)) if q(x) = p(x)k(x) for some k(x) ∈ F[x]. In this case, we
also say that q(x) is a multiple of p(x).

Example 4.7. In Z2 [x], p(x) = x2 + x + 1 is a divisor of x3 + 1 since


(x + 1)(x2 + x + 1) = x3 + 1 (see Remark 4.7 above).

Example 4.8. x4 − 1 is a multiple of x2 + 1 in F[x] for any field F since


x4 − 1 = (x2 − 1)(x2 + 1) holds regardless of the base field F.

Division algorithm of F[x]. Given two polynomials f (x) and g(x) in


F[x] with deg g(x) ≥ 1, then uniquely determined polynomials q(x) and
r(x) exist in F[x] such that
(1) f (x) = g(x)q(x) + r(x);
(2) Either r(x) is the zero polynomial or deg r(x) < deg g(x).
The polynomial q(x) is called the quotient of the division and r(x) is called
the remainder. Note that if deg f (x) < deg g(x), then we can write f (x) =
g(x) · 0 + f (x) with 0 as quotient and f (x) as remainder. As in the case of
integers, the above algorithm guarantees the existence of a quotient and a
remainder but it does not say much about how to find them. Usually the
long division of polynomials is used to that end.

Example 4.9. Let p(x) = x4 +2x3 +x+2 and k(x) = x2 +x+1 considered
as polynomials in Z3 [x] where as usual Z3 = {0, 1, 2}. We can perform the
long division of p(x) by k(x) the usual way but bear in mind that we are
not dealing with real numbers here but rather elements of the field Z3 .
x2 + x − 2

x2 + x + 1 x4 + 2x3 +x+2
− x4 − x3 − x2
x3 − x2 + x
− x3 − x2 − x
− 2x2 +2
2x2 + 2x + 2
2x + 4
The quotient is q(x) = x2 + x − 2 = x2 + x + 1 (since −2 = 1 in the field
Z3 ) and the remainder is r(x) = 2x + 4 = 2x + 1 (since 4 = 1 in the field
Z3 ).
Global Positioning System (GPS) 125

We give next an analogue of prime integers for polynomials.

Definition 4.8. A non-zero polynomial p(x) ∈ F[x] is called irreducible


over F if it cannot be written as the product of two non-constant polyno-
mials in F[x]. In other words, p(x) is irreducible if and only if the only way
an equality of the form p(x) = p1 (x)p2 (x) with p1 (x), p2 (x) ∈ F[x] can oc-
cur is when either p1 (x) or p2 (x) is a constant polynomial. Consequently, if
p(x) is irreducible of degree r, then it does have a non-constant polynomial
divisor (or factor) of degree strictly less than r.

The notion of irreducibility for polynomials depends largely on the coeffi-


cient field. If F1 is a field contained in a larger field F2 , it could very well
happen that a polynomial p(x) is irreducible as an element of F1 [x] but not
as an element of F2 [x].

Example 4.10. The polynomial p(x) = x2 − 2 is irreducible


√ as element of

Q[x] but not as an element of√R[x] since√p(x) = (x − 2)(x + 2) and each
one of the polynomials (x − 2), (x + 2) is non-constant in R[x].

More interesting examples arise in the case of finite fields.

Example 4.11. The polynomial p(x) = x2 + 1 is not irreducible over Z2


since (x + 1)(x + 1) = x2 + 2x + 1 = x2 + 1 in Z2 [x]. Note that x2 + 1 is
clearly irreducible in R[x].

Remark 4.8. It can be shown that if F is a finite field, then there exists a
monic irreducible polynomial of degree r in F[x] for any positive integer r.

Similar to the arithmetics “modulo n” in Z, we can define operations


“modulo p(x)” in F[x] for some fixed polynomial p(x) ∈ F[x]. First, a
definition.

Definition 4.9. Let F be a field, p(x) ∈ F[x] a non-zero polynomial. We


say that the two polynomials f (x), g(x) ∈ F[x] are congruent modulo
p(x), and we write f (x) ≡ g(x) (mod p(x)) (or sometimes f (x) = g(x)
(mod p(x))), if p(x) divides the difference f (x) − g(x). Note that (like in
the case of integers) the fact that p(x) divides f (x) − g(x) is equivalent to
f (x) and g(x) having the same remainder when divided with p(x).

Example 4.12. x3 + 2x2 − 1 ≡ x2 − 1 (mod x + 1) in R[x] since x3 + 2x2 −


1 − (x2 − 1) = x3 + x2 = x2 (x + 1).
126 The mathematics that power our world

Example 4.13. x3 + 3x ≡ x3 − x2 − 2x − 1 (mod x2 + 1) in Z5 [x] since


x3 + 3x − (x3 − x2 − 2x − 1) = x2 + 5x + 1 = x2 + 1 (remember that 5 = 0
in Z5 ).

The division algorithm is at the heart of arithmetic modulo p(x) in F[x].


If f (x) = p(x)q(x) + r(x), then f (x) − r(x) = p(x)q(x) and consequently,
f (x) ≡ r(x) (mod p(x)). Like in the case of integers, given a non-zero
polynomial p(x) ∈ F[x], we can group the polynomials of F[x] in “classes”
according to their remainder upon division by p(x). So two polynomials
f (x) and g(x) are “equal” modulo p(x) if and only if they belong to the
same class, or equivalently they have the same remainder when divided by
p(x).

For a non-zero polynomial p(x) ∈ F[x], we denote by F[x]/hp(x)i the


set of all classes of F[x] modulo p(x). In other words, F[x]/hp(x)i is the set
of all possible remainders upon (long) division with the polynomial p(x).
Like in the case of integers modulo n, addition and multiplication (modulo
p(x)) in F[x]/hp(x)i are well defined operations in the sense that they do
not depend on the representatives of the classes.

Remark 4.9. If p(x) is a non-zero polynomial in F[x] and α is a non-


zero element of F, then we can easily verify that f (x) ≡ g(x) (mod p(x))
if and only if f (x) ≡ g(x) (mod αp(x)) for any polynomials f (x) and
g(x) in F[x]. As a consequence, if p(x) = an xn + · · · + a1 x + a0 ∈ F[x]
then the arithmetic modulo p(x) is the same as the arithmetic modulo
a−1 n −1 −1
n p(x) = x + · · · + an a1 x + an a0 . This allows us to assume, without
any loss of generality, that the polynomial p(x) is monic when looking at
the structure of F[x]/hp(x)i.

Example 4.14. Let p(x) = x2 + 2x ∈ Z3 [x]. Let us see how we can add
and multiply the two polynomials h(x) = x3 + x2 and k(x) = x2 + 2x + 2
of Z3 [x] modulo p(x). First note that h(x) + k(x) = x3 + 2x2 + 2x + 2 and
h(x)k(x) = x5 + 3x4 + 4x3 + 2x2 = x5 + x3 + 2x2 (since 3 = 0 and 4 = 1
in Z3 ). We start by performing the long division of both h(x) + k(x) and
h(x)k(x) by p(x). This leads to the following two relations (the reader is
encouraged to do the long division):
h(x) + k(x) = xp(x) + (2x + 2) ,

h(x)k(x) = x3 − 2x2 + 5x − 8 p(x) + 16x




= x3 − 2x2 + 5x − 8 p(x) + x since 16 = 1 in Z3 .



Global Positioning System (GPS) 127

We conclude that h(x) + k(x) = 2x + 2 (mod x2 + 2x) and h(x)k(x) = x


(mod x2 + 2x).

If p(x) ∈ F[x] is not irreducible over F, we would have an equation of


type hq = 0 in the set F[x]/hp(x)i with at least one of h or q is non-zero (can
you see why?). This would deprive F[x]/hp(x)i from having a field structure
with respect to addition and multiplication mod p(x) by Proposition 4.1
above. This suggests that F[x]/hp(x)i (with operations modulo p(x)) is a
field only in the case where p(x) is an irreducible polynomial. To completely
prove that fact, one would need the notion of greatest common divisor of two
polynomials and the Euclidian algorithm to find it. These are techniques
that interested reader can pick up from any abstract algebra book. We
state this fact for future reference.

Theorem 4.9. Let p(x) ∈ F[x] be a non-constant polynomial. The set


F[x]/hp(x)i equipped with addition and multiplication modulo p(x) is a
field if and only if p(x) is an irreducible polynomial.

4.5.8 The field Fpr - A first approach


Given a prime number p and a positive integer r, we are now ready to
provide a first approach to construct the field Fpr containing pr elements.
Start by choosing an irreducible and monic polynomial p(x) of degree r with
coefficients in the finite field Fp = {0, 1, . . . , p − 1}. Remark 4.8 above guar-
antees the existence of such a polynomial. The irreducibility of p(x) gives
Fp [x]/hp(x)i the structure of a field by Theorem 4.9. Since Fp [x]/hp(x)i
consists of all remainders upon division by p(x) and since any remainder is
of degree at most r − 1, elements of this field are of the form

c0 + c1 t + c + 2t2 + · · · + cr−1 tr−1

with ci ∈ F for each i. Such a polynomial has r coefficients each of which


can take on p values in the field Fp . This means that there are exactly pr
elements in the field Fp [x]/hp(x)i. The following example shows how we
can explicitly find the elements of Fp [x]/hp(x)i and more importantly how
to perform the field operations on these elements.

Example 4.15. Consider the polynomial p(x) = x3 + x + 1 of F2 [x]. We


start by proving that p(x) is irreducible over F2 . Suppose not, then the
polynomial can factor into a product of two non-constant polynomials in
F2 [x]. Since p(x) is of degree 3, at least one of the factors must be linear (of
128 The mathematics that power our world

degree 1). So there exist a, b, c ∈ Z2 such that x3 +x+1 = (x+a)(x2 +bx+c).


In particular, x = −a is a root of the polynomial in F2 . But no such root
exists in F2 since p(0) = 1 6= 0 and p(1) = 3 = 1 6= 0. We conclude that
p(x) = x3 + x2 + 1 is irreducible and hence Z2 [x]/hx3 + x + 1i is indeed a
field. Next, we look more closely at the list the elements of this field. We
know that
Z2 [x]/hx3 + x + 1i = a0 + a1 t + a2 t2 ; a0 , a1 , a2 ∈ Z2 .


There are exactly 23 = 8 elements in this field, namely:


Z2 [x]/hx3 + x + 1i
= 0, 1, 1 + t + t2 , 1 + t, 1 + t2 , t + t2 , t, t2 .

(4.7)
It is important to know is how to actually perform operations on the el-
ements inside the field. Note first that x3 + x + 1 ≡ 0 (mod x3 + x + 1)
which corresponds to the relation t3 + t + 1 = 0 in Z2 [x]/hx3 + x + 1i. As it
turns out, this relation is the “vehicle” that brings any multiplication αβ
of elements of Z2 [x]/hx3 + x + 1i to one element in the set (4.7) above. For
example, when the elements 1+t+t2 and t2 are multiplied together, we find
(1 + t + t2 )(t2 ) = t2 + t3 + t4 which does not appear as one of the elements
of the field (4.7). But since t3 + t + 1 = 0, we get that t3 = −t − 1 = t + 1
(−α = α in Z2 since α + α = 2α = 0 for any element α) and so
(1 + t + t2 )(t2 ) = t2 + t3 + t4
2
= t2 + (t + 1) + t(t + 1) = t2 + (t + 1) = 2t2 + 2t + 1 = 1.
Likewise,
(1 + t)(1 + t + t2 ) = 1 + t + t2 + t + t2 + t3
= 1 + 2t + 2t2 + t3 = 1 + t3 = 1 + (−t − 1) = −t = t.
Another important feature one should notice about the multiplication in
Z2 [x]/hx3 + x + 1i is the fact that every non-zero element of this field can
be expressed as a power of the element α = t of the field. Here is why:
α0 = 1, α1 = t, α2 = t2 , α3 = t + 1, α4 = t2 + t, α5 = 1 + t + t2 , α6 = 1 + t2 ,
α7 = 1.

This is not a coincidence according to the following proposition.

Proposition 4.2. If (F, +, ×) is a finite non-zero field, then (F∗ , ×) is a


cyclic group. Here F∗ is, as usual, the set F from which the zero element of
the field is removed.
Global Positioning System (GPS) 129

Proof. We give a sketch of the proof leaving it to the reader to fill in


some details. By Theorem 4.8 above, the field F contains pr elements
for some prime number p and positive integer r. So F∗ has l = pr − 1
elements. Write l as a product of powers of primes l = pq11 pq22 · · · pqcc (called
prime decomposition) where the pi ’s are distinct prime numbers and the
qi ’s are positive integers. If we can find α ∈ F∗ with order l, then we
are done since this would mean that the cyclic subgroup hαi has the same
number of elements as the full group F∗ . Let li = pli for i = 1, . . . , c,
then the polynomial xli − 1 of degree li cannot have every element of F∗
as a root since li < l and the polynomial can have at most li roots. For
each i = 1, . . . , c, we can then choose ξi ∈ F∗ such that ξili − 1 6= 0. Let
l qi
p qi p
αi = ξi i , then the order |αi | of αi is a divisor of pqi i since (αi ) i = ξil = 1.
6 pqi i , then |αi | = pm
If |αi | = i for some m < qi and so |αi | is a divisor of
qi −1
pqi i −1 . But if that is true, then (αi ) i
p
= 1 which implies that ξili = 1, a
li
contradiction to the fact that ξi − 1 6= 0. This shows that the order of αi
is pqi i . Now let α = α1 α2 · · · αc ∈ F∗ , then it can be shown that the order
of α is the product of the orders of the elements αi , namely l. 

Definition 4.10. A primitive element of a finite non-zero field


(F, +, ×) is any generator of the cyclic group (F∗ , ×). In other words,
α ∈ F∗ is primitive if F∗ = {1, α, α2 , . . . , αr−2 } where r = |F|.

Example 4.16. In Example 4.15 above, we saw that every non-zero ele-
ment of the field Z2 [x]/hx3 + x + 1i is a power of α = t. Thus, α = t is a
primitive element of Z2 [x]/hx3 + x + 1i.

4.5.9 The field Fpr - A second approach


Now for the second approach to construct Fpr . Recall that the field Fp
containing p elements is nothing but a copy of the field Zp of all integers
modulo p.

Consider the set Zrp = Zp × Zp × · · · × Zp of all r-tuples


| {z }
r
(a0 , a1 , . . . , ar−1 ) where ai ∈ Zp for all i. Since each component ai can
take p values, the set Zrp consists of pr elements. We define an addition and
a multiplication in Zrp as follows.
130 The mathematics that power our world

• The addition is defined on Zrp the natural way (componentwise):

(a0 , a1 , . . . , ar−1 )+(b0 , b1 , . . . , br−1 ) = (a0 +b0 , a1 +b1 , . . . , ar−1 +br−1 )

where ai + bi represents the addition modulo p in Zp .


• Unlike addition, the multiplication on Zrp will probably appear to you
as very “unnatural”. We start by fixing an irreducible and monic
polynomial M (t) = tr + mr−1 tr−1 + · · · + m1 t + m0 of degree r in
Zp [t] (by Remark 4.8 above, we know that such a polynomial exists).
Each r-tuple (a0 , a1 , . . . , ar−1 ) ∈ Zrp is identified with the polynomial
p(t) = ar−1 tr−1 + · · · + a1 t + a0 ∈ Zp [t] of degree less than or equal to
r − 1 with coefficients in the field Zp . To define the multiplication of
two r-tuples (a0 , a1 , . . . , ar−1 ) and (b0 , b1 , . . . , br−1 ) of Zrp , we start by
writing the corresponding polynomials in Zp [t]:

p(t) = ar−1 tr−1 + · · · + a1 t + a0 , q(t) = br−1 tr−1 + · · · + b1 t + b0 ,

then we multiply the two polynomials together in the usual way by


regrouping terms in t0 , t, t2 ,..., t2(r−1) :

p(t)q(t) = ar−1 br−1 t2(r−1) + · · · + (a0 b1 + a1 b0 )t + a0 b0

which in turns is congruent to its remainder R(t) modulo M (t) as an


element of the field Zp [t]/hM (t)i. Since the remainder is of degree less
than or equal to r − 1, it can be written in the form R(t) = αr−1 tr−1 +
· · · + α1 t + α0 where αi ∈ F for all i. Now define the multiplication
of the two r-tuples (a0 , a1 , . . . , ar−1 ) and (b0 , b1 , . . . , br−1 ) as being the
r-tuple consisting of the coefficients of R(t):

(a0 , a1 , . . . , ar−1 ) × (b0 , b1 , . . . , br−1 ) = (α0 , α1 , . . . , αr−1 ).

One can verify that the set Zrp equipped with the above addition and mul-
tiplication with respect to a monic irreducible polynomial M (t) is indeed a
field.

Remark 4.10. The key feature in this second approach is the fact that it
allows us to look at the r-tuples of Zrp as polynomials. The two fields Frp
and Zp [t]/hM (t)i are copies of each others. Formally, we say that they are
isomorphic.

Example 4.17. Consider the 3-tuples (1, 0, 1) and (1, 1, 1) as elements


of Z32 . As polynomials, these 3-tuples can be identified with t2 + 1 and
t2 + t + 1 respectively. We have seen in Example 4.15 above that the
Global Positioning System (GPS) 131

polynomial M (t) = t3 + t + 1 ∈ Z2 [t] is irreducible. Let us multiply the two


3-tuples with respect to M (t):
(t2 + 1)(t2 + t + 1) = t4 + t3 + 2t2 + t + 1 = t4 + t3 + t + 1
(remember that 2 = 0 in Z2 ). Now we divide t4 + t3 + t + 1 with t3 + t + 1:
t+1
3
 4 3
t +t+1 t +t +t+1
− t4 − t2 − t
t3 − t2 +1
− t3 −t−1
− t2 − t
and get a remainder of −t2 − t = t2 + t. The coefficients of this remainder
are represented with the 3-tuple (0, 1, 1). So, (1, 0, 1) × (1, 1, 1) = (0, 1, 1).

Definition 4.11. An irreducible monic polynomial F (x) ∈ Zp [x] of degree


r is called a primitive polynomial over Zp if the monomial t is a primitive
element of the field Z[x]/hF (x)i when the elements of the field are identified
with polynomials of the form ar−1 tr−1 + · · · + a1 t + a0 with ai ∈ Zp for
all i.
Example 4.18. In Example 4.15 above, the polynomial P (x) = x3 +x+1 ∈
Z2 [x] is primitive since it is irreducible, monic and t is a primitive element
of the field Z2 [x]/hx3 + x + 1i.
Example 4.19. The polynomial x6 + x3 + 1 ∈ Z2 [x] is irreducible since it
has no roots in Z2 . In the field Z2 [x]/hx6 +x3 +1i, the equation t6 +t3 +1 = 0
is equivalent to t6 = −t3 − 1 = t3 + 1. This gives the following powers of
the monomial t:
t7 = t4 + t, t8 = t5 + t2 , t9 = t6 + t3 = t3 + 1 + t3 = 2t3 + 1 = 1.
The fact that t9 = 1 and that the multiplicative group of F[x]/hx6 + x3 + 1i
is of order 26 − 1 = 63 imply that t is not a generator of that group. So the
polynomial x6 + x3 + 1 of Z2 [x] is not primitive.
Remark 4.11. If α is a primitive element of a finite field F with |F| = r ≥ 2,
then we know that α is of order r − 1. In particular, α is a root of the
polynomial Q(x) = xr−1 − 1 and r − 1 is the smallest positive integer m
such that α is a root of xm −1. If F (x) is a primitive polynomial of degree r
over Zp , then r − 1 is the smallest positive integer n satisfying tn − 1 = 0 in
the field Z[x]/hF (x)i and consequently r − 1 is the smallest positive integer
n such that F (x) is a divisor of xn − 1.
132 The mathematics that power our world

The following theorem proves that there is enough supply of primitive


polynomials of any chosen degree. The proof is omitted as it is beyond the
scope of this book.

Theorem 4.10. For any prime integer p and any positive integer n, there
exists a primitive polynomial of degree n over the field Zp .

4.5.10 The lead function


We start by reviewing the definition of a linear map.

Definition 4.12. A map f : Frp → Fp is called linear if it satisfies the two


conditions:

(1) f (~u + ~v ) = f (~u) + f (~v ) for all r-tuples ~u, ~v in Frp ;


(2) f (α~u) = αf (~u) for all ~u ∈ Frp and α ∈ Fp .

Example 4.20. Let F (x) be an irreducible polynomial of degree r in Fp [x]


and identify the field Frp (or Fp [x]/hF (x)i) as usual with the set of poly-
nomials of degree r − 1 or less with coefficients in Fp . Consider the map
θ : Frp → Fp , called the lead function, defined as follows:

θ br−1 tr−1 + · · · + b1 t + b0 = br−1 .




If ~u = br−1 tr−1 + · · · + b1 t + b0 , ~v = cr−1 tr−1 + · · · + c1 t + c0 ∈ Frp and


α ∈ Fp , then

• θ (~u + ~v ) = θ (br−1 + cr−1 )tr−1 + · · · + (b1 + c1 )t + (b0 + c0 ) =
br−1 + cr−1 = θ (~u) + θ (~v ). 
• θ (α~u) = θ αbr−1 tr−1 + · · · + αb1 t + αb0 = αbr−1 = αθ (~u).

This means that θ is a linear map.

Remark 4.12. A special case of great interest in our treatment of the GPS
signal is the case where p = 2. In this case, there are 2r polynomials of the
form br−1 tr−1 + · · · + b1 t + b0 ∈ Z2 [t] with exactly half having the leading
coefficient br−1 = 0 and the other half with leading coefficient br−1 = 1.
This means that the lead function θ : Fr2 → F2 takes the value 0 on exactly
half of the elements of Fr2 and the value 1 on the other half.
Global Positioning System (GPS) 133

4.6 Key properties of GPS signals: Correlation and


maximal period

We now arrive at the last stop in our journey of understanding the mathe-
matics behind the signal produced by a GPS satellite using a LFSR. This
section provides the proof of the main result concerning the GPS signal
(Theorem 4.1). We start with the notion of correlation between two “slices”
of the sequence produced by a LFSR. It is the calculation of this correlation
that allows the GPS receiver to accurately compute the exact time taken
by the signal to reach it from the satellite.

4.6.1 Correlation

Definition 4.13. Given two binary finite sequences of the same length:
A = (ai )ni=1 and B = (bi )ni=1 , the correlation between A and B , denoted
by ν (A, B), is defined as follows:
n
X
ν (A, B) = (−1)ai (−1)bi .
i=1

Let S = {1, 2, . . . , n}, S1 = {i ∈ S; ai = bi } and S2 = {i ∈ S; ai 6= bi }.


Then
Xn X X
(−1)ai (−1)bi = (−1)ai (−1)bi + (−1)ai (−1)bi .
i=1 i∈S1 i∈S2

Note that:
• if ai = bi , then (−1)ai (−1)bi = (−1)2ai = 1, so i∈S1 (−1)ai (−1)bi =
P

1 + 1 + · · · + 1 as many times as the number of elements in S1 .


• if ai 6= bi , then (−1)ai (−1)bi = −1 since one of ai , bi is 0 and the other
is 1 in this case. We conclude that i∈S2 (−1)ai (−1)bi = −1−1−· · ·−1
P

as many times as the number of elements in S2 .


Thus, the correlation between A and B is equal to the number of elements
in S1 minus that of S2 . This proves the following proposition.

Proposition 4.3. The correlation between two binary finite sequences A =


(ai )ni=1 and B = (bi )ni=1 of the same length is equal to the number of indices
i where ai = bi minus the number of indices i where ai 6= bi .

The above proposition suggests that the correlation serves as a measure


how “similar” the sequences are. The closer the correlation ν (A, B) to zero,
134 The mathematics that power our world

more corresponding terms in the sequences are different. So sequences with


small correlation are poorly correlated and sequences with large correlation
are strongly correlated.

Example 4.21. Consider the following two finite sequences:


101011100101110
111001011100101.
Every time the numbers agree (in black), add 1 and every time the numbers
disagree (in gray), subtract 1. The resulting correlation is then −1.

4.6.2 The LFSR sequence revisited


In this section, we revisit the sequence produced by a LFSR of degree r
(with r cells) and try to analyse it in a bit more depth. We start by fixing
a primitive polynomial of degree r over Z2 :

P (x) = xr + cr−1 xr−1 + · · · + c1 x + c0 .

We know that such a polynomial exists by Theorem 4.10 above. For the
coefficients of the LFSR, choose the vector c = (cr−1 , . . . , c1 , c0 ) with com-
ponents equal to the coefficients of P (x). The choice of the initial window
can be any non-zero vector (ar−1 , . . . , a1 , a0 ) but the one we choose in the
following is very suitable in proving interesting facts about the sequence.
We use the lead function θ : Frp → Fp defined in Example 4.20 above as
follows:

(1) Choose a non-zero polynomial (t) in Z2 [x]/hP (x)i:

(t) = r−1 tr−1 + · · · + 1 t + 0 , i ∈ Z2 for all i = r − 1, . . . , 0.

(2) Define a0 = θ((t)) = r−1 .


(3) Next, we compute t(t) as an element of Z2 [x]/hP (x)i. Remember that
the equation P (t) = 0 together with the fact that −ci = ci in the field
Z2 translates to tr = cr−1 tr−1 + · · · + c1 t + c0 .

t(t) = t r−1 tr−1 + · · · + 1 t + 0




= r−1 tr + r−2 tr−1 · · · + 1 t2 + 0 t


= r−1 cr−1 tr−1 + · · · + c1 t + c0 + r−2 tr−1 · · · + 1 t2 + 0 t


= (r−1 cr−1 + r−2 ) tr−1 + · · · + (r−1 c1 + 0 ) t + r−1 c0 .

(4) Define a1 = θ(t(t)) = r−1 cr−1 + r−2 .


Global Positioning System (GPS) 135

(5) To define a2 , we compute first t2 (t) as an element of Z2 [x]/hP (x)i


(always using the identity tr = cr−1 tr−1 + · · · + c1 t + c0 ) and then
we define a2 as the lead of the resulting polynomial. We find a2 =
θ(t2 (t)) = r−1 c2r−1 + cr−2 + r−2 cr−1 .
(6) In general, define ai = θ(ti (t)) for all i ∈ {0, 1, . . . , r − 1}.
(7) Take (ar−1 , . . . , a1 , a0 ) to be the initial window of the LFSR.
But what is the big deal? Why do we need P (x) to be primitive and
why this complicated way of choosing the initial window? Be patient, you
have gone a long way so far and the answers are just a few paragraphs away.

Note that:
θ (tr (t)) = θ cr−1 tr−1 (t) + · · · + c1 t(t) + c0 (t)

(4.8)
= cr−1 θ tr−1 (t) + · · · + c1 θ (t(t)) + c0 θ ((t))

(4.9)
= cr−1 ar−1 + · · · + c1 a1 + c0 a0 , (4.10)
where (4.8) follows from tr = cr−1 tr−1 + + · · · + c1 t + c0 , (4.9) follows from
the linearity of the lead map θ and (4.10) follows from our definition of
the initial conditions a0 , . . . , ar−1 . Look closely at the last expression. Is
that not how the LFSR computes its next term ar ? We conclude that
θ(tr (t)) = ar . In fact, it is not hard to show that any term in the sequence
produced by a LFSR can be obtained this way. More specifically,
ak = θ(tk (t)), k = 0, 1, 2, . . . . (4.11)

4.6.3 Proof of Theorem 4.1


We are now ready to prove Theorem 4.1.

Proof of Theorem 4.1. With the above choice of the coefficients (as
coefficients of a primitive polynomial) and the initial coefficients, we show
that a sequence produced by a LFSR with r registers has a period equal
to N = 2r − 1. We already know (see Remark 4.1) that the sequence
is periodic with (minimal) period T ≤ 2r . Since P (x) is chosen to be a
primitive polynomial, t is a generator of the multiplicative group of the
field Z2 [x]/hP (x)i and so it has an order of N = 2r − 1 as an element of
the group. This means that N is the smallest positive integer satisfying
tN = 1. Moreover, for any n ∈ N, we have
!
n+N
t t (t) = θ (tn (t)) = an .
N n

an+N = θ t (t) = θ |{z}
=1
136 The mathematics that power our world

This shows in particular that N is a period of the sequence and by the


minimality of the period T , we have that

T ≤ N. (4.12)


For any k ∈ N, if we apply θ to the relation ak+T = ak we get θ tk+T (t) =
θ tk (t) or equivalently

θ tk (t)(tT − 1) = 0

(4.13)

by the linearity of θ. Assume (tT − 1) 6= 0, then (t)(tT − 1) 6= 0 as the


product of two non-zero elements of the field Z2 [x]/hP (x)i. The polyno-
mial P (x) was chosen to be minimal for a reason: any non-zero element
of Z2 [x]/hP (x)i is a power of t, in particular (t)(tT − 1) = tn for some
n ∈ {0, 1, 2, . . . , N − 1} and therefore tk (t)(tT − 1) = tk+n . The elements
tk (t)(tT − 1) are then just permutations of the elements of multiplicative
group F∗2r = 1, t, t2 , . . . , tN −1 . Equation (4.13) implies that the lead func-
tion θ takes the value zero everywhere on F∗2r which is not true (see Remark
4.12 above). Therefore tT − 1 = 0 or equivalently tT = 1. Since N is the
order of t as element of the multiplicative group of the field Z2 [x]/hP (x)i,
we get that N ≤ T . Using inequality (4.12), we get that T = N . We con-
clude that the (minimal) period of the sequence an is indeed N = 2r − 1.
This finishes the proof of the theorem.

Example 4.22. It can be shown that the polynomial P (x) = x4 + x3 + 1 is


primitive over Z2 . We look at the binary sequence produced by a LFSR of
degree 4 using the polynomial P (x) = x4 + x3 + 1. The coefficient vector of
the LFSR in this case is (c3 , c2 , c1 , c0 ) = (1, 0, 0, 1). For the initial values,
let us choose the non-zero polynomial (t) = t2 + t. We compute t(t),
t2 (t) and t3 (t) as elements of the field Z2 [x]/hP (x)i. Remember first that
in the field Z2 [x]/hx4 + x3 + 1i, we have the identity t4 + t3 + 1 = 0 or
t4 = −t3 − 1 = t3 + 1.

t(t) = t3 + t2 , t2 (t) = t4 + t3 = (t3 + 1) + t3 = 2t3 + 1 = 1, t3 (t) = t.

So a0 = θ((t)) = 0, a1 = θ(t(t)) = 1, a2 = θ(t2 (t)) = 0 and a3 =


θ(t3 (t)) = 0. The initial state vector is then (a3 , a2 , a1 , a0 ) = (0, 0, 1, 0).
Global Positioning System (GPS) 137

The following tables give the first 24 windows of the sequence:


Clock Pulse # Window Clock Pulse # Window Clock Pulse # Window
0 0010 8 0101 16 0001
1 0001 9 1010 17 1000
2 1000 10 1101 18 1100
3 1100 11 0110 19 1110
4 1110 12 0011 20 1111
5 1111 13 1001 21 0111
6 0111 14 0100 22 1011
7 1011 15 0010 23 0101

The sequence produced is . . . 11100010011010111100010 which is periodic


of period 24 − 1 = 15. When the sequence is read from right to left, one
can see that the slice 011010111100010 is repeated every 15 bits.

We can actually say more about the sequence produced by a LFSR as


constructed above.

Theorem 4.11. Consider the binary sequence produced by a LFSR of


degree r constructed as above. Let W1 = (an , an+1 , . . . , an+N −1 ) and W2 =
(am , am+1 , . . . , am+N −1 ) be two slices (or subsequences) of the sequence
with m > n and length N = 2r − 1 (the minimal period of the sequence)
each. Then the correlation ν between W1 and W2 is given by:

−1 if m − n is not a multiple of N
ν=
N if m − n is a multiple of N .
Proof. We use the definition of the correlation,
N
X −1
ν (W1 , W2 ) = (−1)an+k (−1)am+k
k=0
N −1
n+k m+k
(−1)θ(t (t))
(−1)θ(t (t))
X
= (By relation (4.11) above)
k=0
N −1
n+k m+k
(−1)[θ(t (t))+θ(t (t))]
X
= .
k=0

By the linearity of the lead function, we get


N −1 N −1
n+k
(t)+tm+k (t)) n+k
(t)(1+tm−n ))
(−1)θ(t (−1)θ(t
X X
ν (W1 , W2 ) = = .
k=0 k=0

If m − n is a multiple of N , then m − n = ρN for some integer ρ and


ρ
tm−n = tN = 1 since tN = 1 (remember that t is the generator of a cyclic
138 The mathematics that power our world

n+k m−n
group of order N ). So, 1 + tm−n = 2 = 0 and (−1)θ(t (t)(1+t )) = 1
for all k = 0, . . . , N − 1. This implies that the correlation in this case is ν =
1 + 1 + · · · + 1 = N . Assume next that m − n is not a multiple of N , then
| {z }
N
the polynomial 1 + tm−n is non-zero and therefore (t) (1 + tm−n ) is also
non-zero as the product of two non-zero elements of the field Z2 [x]/hP (x)i.
As in the proof of Theorem 4.1, the fact that P (x) is chosen to be primitive
comes in very handy now: (t) (1 + tm−n ) 6= 0 implies that
(t) 1 + tm−n = tj for some j ∈ {0, 1, 2, . . . , N − 1}.


As k takes all values in the set {0, 1, . . . , N − 1}, the elements


tn+k (t) (1 + tm−n ) = tj+n+k are just permutations of the elements of
F∗2r = {1, t, t2 , . . . , tN −1 }. As seen above, the lead function θ takes the
value 0 on exactly half of the elements of the set F2r and the value 1
on the other half. This implies in particular that αi ∈F2r (−1)θ(αi ) = 0.
P

Now, since (−1)θ(0) = (−1)0 = 1, the last sum in the above expression of
ν (W1 , W2 ) can be written as
N −1
n+k
(t)(1+tm−n ))
(−1)θ(t
X X
= (−1)θ(αi )
k=0 αi ∈F∗
2r
X
= (−1)θ(αi ) −(−1)θ(0) = −1.
αi ∈F2r
| {z }
0

This proves that the correlation between the two finite sequences is −1 in
this case. 
This is indeed an amazing fact: take any two finite slices of the same length
2r − 1 (length of a period) in a sequence produced by a LFSR of degree r,
then you are sure that the number of terms which disagree is always one
more than the number of terms which agree (provided, as in the theorem,
that m − n is not a multiple of N = 2r − 1). This may sound weird, but
having poorly correlated sequences of maximal length 2r − 1 is important
for the GPS receiver since it makes the task of identifying satellites much
easier.

4.6.4 More about the signal


Sequences produced by LFSR on board of GPS satellites are modulated
by wave carriers and transformed into cycles of electrical “chips” or pulses
Global Positioning System (GPS) 139

that we usually represent by sequences of 0’s and 1’s for simplicity. These
are just representations of low and high voltages.

Upon reception of a signal, the receiver tries immediately to match it


with one of the local replicas of the codes stored in its memory. As explained
earlier, the received code is not synchronized with any of the locally gen-
erated codes because of the runtime of the signal from the satellite and
the fact that the satellites are in constant movement. Once the receiver
succeeds to match the received code with one of the replicas, it can identify
the satellite from which the signal was emitted and starts to collect the
information needed to determine the travel time of the signal. To achieve
synchronization, the receiver shifts its locally generated signal by one chip
at a time and compares it with the captured signal by calculating the cor-
relation between two cycles of the codes. This process is repeated until a
maximal correlation is attained (and hence perfect synchronization between
the two signals).

In the above diagram, the first signal represents the replica of the satel-
lite code generated by the receiver with t = a being the time of departure
of a particular code cycle from the satellite. The second signal represents
the signal arriving at the receiver with t = b being the time of arrival of
the cycle to the receiver. Signals emitted by various GPS satellites are in
perfect synchronization and the departure time from the satellites of the
start of each cycle is known by the receiver. The runtime of the signal is
marked by dt. In a perfect scenario, the distance between the satellite and
140 The mathematics that power our world

the receiver is c · dt with c = 299792.458 km/s being the speed of light in a


vacuum. Unfortunately, many factors play a role in making the calculation
of the distance a bit more complicated than that. For instance, remember
that a major source of error comes from the fact that clocks on board of
the satellites and inside the receiver read different times at any given mo-
ment. But that can be fixed by looking at the signal of a fourth satellite
as explained in Section 4.4.5. To compute the time offset uncertainty, the
receiver records the number n of electrical chips needed to be shifted in
order to achieve synchronization (when the correlation is at a maximum)
between the two signals. On board of a GPS satellite, the LFSR used to
produce the PNR code is of degree 10 (has r = 10 registers), producing a
sequence of period 210 − 1 = 1023 bits by the above discussion. Practically,
this means that each cycle of the satellite signal is formed by 1023 electrical
chips. The code is generated at a rate of 1.023 megabits/sec which means
that a cycle repeats every millisecond (or 0.001 second). At the speed of
light, 0.001 second corresponds to a distance of 299.792458 km. Dividing
this distance with 1023 (the period of the sequence) gives an uncertainty
of about 300 m per chip.

In reality, analyzing the satellite code at the receiver end is more so-
phisticated than the above description. Many algorithms are implemented
to increase the efficiency of the receiver. These are beyond the scope of this
book.

4.7 A bit of history

The idea of locating one’s position on the surface of the planet goes back
deep in human history. Some ancient civilizations were able to develop nav-
igational tools (like the Astrolabe) to locate the position of ships in high
seas. But let us not go that deep in history, after all the chapter deals with
a very recent piece of technology.

In what follows, we give a brief history of the development of the Global


Positioning System.

• The story started in 1957 when the Soviet Union launched the satellite
Sputnik. Just days after, two American scientists were able to track its
orbit simply by recording changes in the satellite radio frequency.
• In the 1960’s, the American navy designed a navigation system for its
Global Positioning System (GPS) 141

fleet consisting of 10 satellites. At that time, the signal reception was


very slow and scientists worked hard to improve it.
• In the early 1970’s, Ivan Getting and Bradford Parkinson led a US de-
fense department project to provide continuous navigation information,
leading to the development of the GPS (formally known as NAVSTAR
GPS) in 1973.
• The launching of the first GPS satellite by the US military was accom-
plished in 1978.
• In 1980, the military activated atomic clocks on board of GPS satellites.
• The year 1983 was a turning point in the development of the GPS
system. It was the year where the system started to take the form that
we use today. This came after the tragedy of flight 007 of the Korean
Airline that killed 269 people on board. The tragedy prompted US
president Ronald Reagan to declassify the GPS project and cleared the
way to allow the civilian use of the system.
• After many setbacks, full operational capability with 24 GPS satellites
was announced in 1995.
• In May 2000, the Selective Availability program was discontinued fol-
lowing an executive order issued by US president Bill Clinton to make
the GPS more responsive to civilian and commercial users worldwide.
This created a boom in the GPS devices production industry.
• In 2005, the GPS constellation consisted of 32 satellites, out of which
24 are operational and 8 are ready to take over in case of failure.
• In 2009, an alarming report by the US Accountability Office warned
that some GPS satellites could fail as early as 2010.
• The year 2010 marked an important step in the modernization of the
GPS system with the announcement of the US government of a new
contract to develop the GPS Next Generation Operational Control Sys-
tem.

4.8 References

Elliott Kaplan and Christopher Hegarty (editors) (2005). Understanding


GPS: Principles and Applications, Second Edition. (Artech House).
Christiane Rousseau, Yvan Saint-Aubin (2008) Mathematics and Technol-
ogy. (Springer).
This page intentionally left blank
Chapter 5

Image processing and face recognition

5.1 Introduction

In this chapter, we discuss the processing of images. Chances are you have
used at some point an image editing software like Photoshop or Gimp to
transform a photo. We discuss some of the mathematics underlying the
manipulation of images. Our approach will be to combine many images to
obtain the average image. Consider the following three digital images. The
image on the left is a picture taken at the Tremblant ski hill in 2015. The
image in the center is a picture taken in Vancouver in 2012 and the image
on the right is a picture taken in Morocco in 2015. We will combine the
faces from these images to obtain an “average” face.

Many police organizations have a database of facial images, which is


used to identify individuals who have been involved in a crime. The iden-
tification needs to be automated, since typically, the database will contain
a large number of images. We discuss a technique that compares an image
with images in a database for facial recognition.

143
144 The mathematics that power our world

5.1.1 Before you go further


Mathematical skills required to have a good understanding of this chapter in-
clude basic knowledge of descriptive statistics, basic linear algebra concepts
like the dot product and orthogonal projection. Also, some good knowledge
of working with and manipulating matrices is necessary.

5.2 Raster image

There are many forms of digital images. Our goal here is to discuss the
manipulation of raster images. We can think of a raster image as an array
of pixels, where the pixel represents a unit square. For example, the above
image in Morocco has a height of 960 pixels and width of 720 pixels. The
image in Vancouver has a height of 720 pixels and a width of 960 pixels.
Lastly, the image in Tremblant is 960 × 540 squares pixels.
As seen in Chapter 3, a gray level is assigned to each pixel. The levels
are usually from 0 to 255, where 0 is black and 255 is white. Alternatively,
we can think of the levels as percentages, where 0% is black and 100% is
white.
Each black and white digital image can be represented by a matrix of
gray levels. For
 example, the image in Figure 5.1 is represented by the
10 35 100
matrix C = where, for instance, the upper left pixel is
125 175 200
assigned the gray level 10 and the lower right pixel is assigned the level
200.

Fig. 5.1 A raster image.


Image processing and face recognition 145

Transformations of the raster images correspond to transformations of


the corresponding matrix. The notation C[i, j] is used to indicate the color
(or gray level) in the ith row and the jth column of the image.
There are many systems to represent colored images. A widely used one
is the RED-GREEN-BLUE (RGB) color space. In the RGB color space,
each pixel is represented by a triplet (R, G, B) of integers in [0, 255] with
(R, G, B) = (0, 0, 0) corresponding to black and (R, G, B) = (255, 255, 255)
corresponding to white. A raster image in the RGB color space will have
three matrices, one for each color. In order to transform a color image, one
needs to transform each of the three matrices. For simplicity, we only look
at the transformation of grayscale images in this chapter.

5.3 Invertible linear transformations

We will interpret a point in our image as an element (x, y) in R2 . We usually


consider the origin as the upper left corner of the image (recall the image
is rectangular). Increasing the first component (on the x-axis) by one unit
means moving to the right by one unit. Increasing the second component
(on the y-axis) by one unit means moving down by one unit. A unit square
is just one pixel. The reason to have the upper left corner as the origin is
the need to have an easy correspondence between the points on the image
and the matrix of the image. The color of the point (x, y) in the image is
C[pyq, pxq], where pxq is the smallest integer greater than or equal to x (it
is rounding up to the nearest integer).
For example, the point (0.5, 0.2) in Figure 5.1 is within the first pixel
of the image. Since the gray level of the first pixel is 10, we have
C[p0.2q, p0.5q] = C[1, 1] = 10.
In this chapter we discuss rotations and uniform scaling, since these
are the transformations used to align faces. Recall that the origin (0, 0) is
the point in the upper left corner of the rectangular image. However, we
will want to apply the transformation (e.g. a rotation) about a point in the
facial region, say the point (x0 , y0 ), that we call the centroid. We represent
a point (x, y) on the image with a coordinate system with respect to the
centroid, by using the coordinates [vx , vy ] = [x − x0 , y − y0 ]. For example, if
the centroid is the point (2, 2), then the point (2, 1) can also be represented
as [2 − 2, 1 − 2] = [0, −1]. The color of [vx , vy ] is C[pvy + y0 q, pvx + x0 q],
since it corresponds to the point (vx + x0 , vy + y0 ) = (x, y).
Consider the image delimited by the rectangle ABCD in Figure 5.2,
146 The mathematics that power our world

which is 3 pixels wide and 4 pixels high. The grayscale matrix of this
image is
 
25 45 55
 45 75 125 
C= 
 55 175 200  . (5.1)
99 190 180
Using the point (2, 2) as the centroid. The point (x, y) = (0.5, 1.5) has also
the following coordinates [vx , vy ] = [0.5 − 2, 1.5 − 2] = [−1.5, −0.5] with
respect to the centroid. The color of this point is C[1.5, 0.5] = C[2, 1] =
45. The computation of the color when referring to the coordinates of the
point with respect to the centroid is C[−1.5 + 2, −0.5 + 2] = C[2, 1] =
45.

Fig. 5.2 Rotation of an image by π/6 radians.

There are many transformations in Photoshop or Gimp that are invert-


ible linear maps: rotation about a point, reflection about a line, a shear, a
uniform scaling, and a non-uniform scaling. Recall that a linear transfor-
mation from R2 to R2 is a function T : R2 → R2 such that T (v ) = AT v ,
for some 2 × 2 matrix AT called
 the matrix of the linear transformation.
ab
In other words, if AT = , then the image v  = [vx , vy ] of v = [vx , vy ]
cd
(i.e. location in the new image) is given by
      
vx ab vx a vx + b vy
= = .
vy cd vy c vx + d v y
The matrix of the linear transformation that rotates the image by an
Image processing and face recognition 147

angle of θ radians (about the point (x0 , y0 ), clockwise) is known to be


 
cos(θ) −sin(θ)
Rθ = .
sin(θ) cos(θ)

Its inverse Rθ−1 is the rotation matrix R−θ . In other words, to invert the
rotation, we apply a rotation with the same angle in the opposite direction.
To uniformly scale the image by a factor of r (where r > 0), we can use
the following matrix
   
r0 10
Sr = =r .
0r 01
So Sr = r I2 , where I2 is the identity matrix of size 2 × 2. The inverse of
the uniform scaling matrix Sr is
   
−1 1/r 0 1 10 1
Sr = S1/r = = = I2 .
0 1/r r 01 r
When applying both a uniform scaling by a factor r and a rotation of
angle θ, the order of the transformations is not important. The matrix for
the transformation that starts with a rotation followed by a uniform scaling
is AT = Sr Rθ . This matrix is equal to

Sr Rθ = r I2 Rθ = r Rθ = r Rθ I2 = Rθ (r I2 ) = Rθ Sr .

The last matrix in this equality is the matrix for the transformation that
is a uniform scaling followed by the rotation. So regardless of the order of
these two transformations the result is the same. In fact the matrix of the
linear transformation that corresponds to rotation of angle θ clockwise and
a uniform scaling of a factor r is
   
r cos(θ) −r sin(θ) a −b
r Rθ = = ,
r sin(θ) r cos(θ) b a
where a = r cos(θ) and b = r sin(θ). Theorem 5.1 states that any matrix of
the same form as the matrix on the right-hand-side of the above equality can
be interpreted as a matrix of the composition of a rotation and a uniform
scaling.
Before stating the theorem, we remind the reader that an ordered pair
(x, y) ∈ R2 (which is not (0, 0)) has a unique polar coordinate representation
(r, θ), where r > 0 and θ ∈ (−π, π]. There is a one-to-one correspondence
between the cartesian coordinates (x, y) and the polar coordinates (r, θ).
To go from the polar form to the cartesian form, we use the relations
148 The mathematics that power our world

x = r cos(θ) and y =pr sin(θ). To go from the cartesian form to the


polar form, we use r = x2 + y 2 and θ = atan2 (y, x), where


 arctan(y/x), if x > 0




 arctan(y/x) + π, if y ≥ 0, x < 0
arctan(y/x) − π, if y < 0, x < 0

atan2 (y, x) =

 π/2, if y > 0, x = 0




 −π/2, if y < 0, x = 0
undefined, x = 0, y = 0

The atan2 function is sometimes called the “Four-Quadrant Inverse Tan-


gent”. Its domain is R2 \{(0, 0)} and its image is (−π, π]. It is implemented
in most modern computer languages and it is a convenient way to pass from
cartesian form to polar form.

Theorem 5.1. Let T be linear map from R2 to R2 with the matrix


 
a −b
AT = ,
b a
where a, b ∈ R and (a, b) 6= (0, 0). Then T can be interpreted as a rotation
of angle θ and a uniform scaling by a factor r, where
p
r = a2 + b2 , θ = atan2 (b, a).

Proof. Since (r, θ) is the polar coordinate for (a, b), then a = r cos(θ)
and b = r sin(θ). Thus, we can write AT = r Rθ , which is a clockwise
rotation of angle θ and a uniform scaling of factor r. 
When applying a linear transformation (e.g. a rotation), we might end
up mapping some points outside of the region of the given image. As an
example, consider the rotation of the image in Figure 5.2. The boundary
of the region of the image corresponds to the rectangle ABCD. As we
rotate the image by an angle π/6 radians clockwise about the point (2, 2),
some points are mapped outside of the rectangle ABCD. In order not to
lose information, we increase the number of pixels. The new image, that
corresponds to the rectangle A0 B 0 C 0 D0 , will be of size 5 × 5 pixels, while
the original image was of size 3 × 4.
We first discuss the math involved in determining the new rectangle
A0 B 0 C 0 D0 and identifying the corresponding centroid (x00 , y00 ) for our new
image. The original rectangle ABCD is a convex set and the points A, B,
C and D are the only extremal points in this set. (Refer to Section 5.8 for
a discussion concerning convex sets.) If we are using an invertible linear
Image processing and face recognition 149

transformation to transform the image, then the region of the new image
will also be a convex set and the images of A, B, C and D, say T (A) =
[ax , ay ], T (B) = [bx , by ], T (C) = [cx , cy ], T (D) = [dx , dy ], respectively, are
going to be the only extremal points in this set. So to find the boundary
of the new image, we need the extreme abscissa:

x0min = min{ax , bx , cx , dx } and x0max = max{ax , bx , cx , dx }.

We will also need the extreme ordinates:


0 0
ymin = min{ay , by , cy , dy } and ymax = max{ay , by , cy , dy }.

We will round these extreme values to the nearest integer to work in pixels.
(We might lose a bit of information by rounding, but it should be negligible,
0
i.e. not visible.) In Figure 5.2, x0min = −3, x0max = 2, ymin = −3, ymax0
= 2,
measured from the centroid (2, 2). Thus, the vertices of the new rectangular
image are A0 = [−3, −3], B 0 = [2, −3], C 0 = [2, 2], D0 = [−3, 2], respectively.
So, the new rectangle should be x0max − x0min = 5 pixels wide and ymax 0

0 0
ymin = 5 pixels high. Since A is the point in the upper left corner of the
new image, it should correspond to the point (0, 0). But its coordinates
with respect to the centroid (2, 2) are given by the point A0 = [−3, −3],
which means that it is 3 unit above and 3 units left of the centroid. By
making the centroid the point (3, 3), then A0 will correctly be the point
(0, 0). So the centroid in the new image is the point (3, 3).

5.4 Gray level for the new image

Now that we have the appropriate rectangular region for the transformed
image, we must color it. In other words, we must determine the gray level
for each pixel in the image. Suppose that the new image has a width of w0
pixels and a height of h0 pixels (in Figure 5.2, we have w0 = 5 and h0 = 5).
It is important to keep in mind the center of the image. In case of Figure
5.2, it is (x00 , y00 ) = (3, 3). We need to find C 0 [i, j] for i = 1, 2, . . . , h0 and
j = 1, 2, . . . , l0 , where C 0 is the matrix of the transformed image.
To find the color, we must think in terms of the inverse transformation.
C 0 [i, j] is the color of a pixel, which is a unit square. We will use the
point (j − 1/2, i − 1/2) (which is the center of the corresponding pixel) as
a representative of the pixel. As an example, let us consider the gray level
C 0 [3, 4] in the new image. This is the color of the pixel which has (x0 , y 0 ) =
(3.5, 2.5) as its representative. Expressing the point in terms of coordinates
150 The mathematics that power our world

with respect to the centroid (3, 3) gives us [3.5 − 3, 2.5 − 3] = [0.5, −0.5].
Let us back-transform this vector.
The transformation from the original image to the new image is a ro-
tation of π/6 radians about the centroid. The inverse transformation is
a rotation of −π/6 radians about the center. The point with coordinates
v  = [0.5, −0.5] with respect to the centroid (3, 3) has the pre-image
    
 cos(−π/6) −sin(−π/6) 0.5 0.183
v = R−π/6v = ≈ .
sin(−π/6) cos(−π/6) −0.5 −0.683

Recall that the centroid in the original image is the point (2, 2). So the
representative of our pixel is located at the point (2.183, 1.317). The
unit square centered at (2.183, 1.317), which is highlighted in Figure 5.3,
is overlapping four pixels in the original image (i.e. the rectangular re-
gion ABCD). Naturally, we should use a combination of the four colors:
C[1, 2] = 45, C[1, 3] = 55, C[2, 2] = 75, and C[2, 3] = 125. The largest over-
lap is with the pixel in the second row and the third column of the rectan-
gular region ABCD. Somehow, the color should be closest to C[2, 3] = 125.

Fig. 5.3 The pixel centered at [0.5, −0.5] is back transformed.

5.5 Bilinear interpolation

In this section, we discuss the assignment of a color C[i∗, j∗] to our pixel
in the case where i∗ and j∗ are not integers. This corresponds to the case
where the back transformed pixel is overlapping pixels in the original image.
To this end, we use a technique called bilinear interpolation.
Image processing and face recognition 151

Consider the color C[i∗, j∗] = C[i + p, j + q], where i, j are integers and
0 ≤ p < 1, 0 ≤ q < 1. Its bilinear interpolation is

C[i∗, j∗] = (1 − p) (1 − q) C[i, j] + (1 − p) q C[i, j + 1]


+ p (1 − q) C[i + 1, j] + p q C[i + 1, j + 1].

We should note that it is a linear function in (p, q) with a bilinear term,


that is p q. In fact, it is a weighted average of the four colors, where the
weights are relative to the overlapping area with our back transformed pixel.
The more the corresponding pixel overlaps our back transformed pixel, the
larger its weight.
Note that the lower right point in a pixel is equivalent to referring to the
pixel in terms of its column and its row. For example, in the rectangular
region ABCD from Figure 5.3, the lower right point of the pixel in the
3rd column and the second row is the point (3, 2). We want the color
of the pixel whose representative is (2.183, 1.317). Its lower right corner
is (2.183 + 0.5, 1.317 + 0.5) = (2.683, 1.817). This pixel is highlighted in
Figure 5.3. It is overlapping four pixels in the original image (i.e. the
rectangular region ABCD). We will compute the gray level C[1.817, 2.683]
by using the following four gray levels:

C[1, 2] = 45, C[1, 3] = 55, C[2, 2] = 75, and C[2, 3] = 125.

We could start by assuming that the row was an integer, say i = 1 or


i = 2. In both of those cases, consider a linear approximation in j:

C[1, 2.683] ≈ (1 − 0.683) C[1, 2] + 0.683 C[1, 3] = 51.83,

and

C[2, 2.683] ≈ (1 − 0.683) C[2, 2] + 0.683 C[2, 3] = 109.15.

The last step is to do a linear approximation in i:

C[1.817, 2.683] ≈ (1 − 0.817) C[1, 2.683] + .817 C[2, 2.683] ≈ 77.

These steps give us the bilinear interpolation C[1.317, 2.183], which is a


weighted average of the four colors:

77 ≈ (5.8%) C[1, 2] + (12.5%) C[1, 3] + (25, 9%) C[2, 2] + (55, 9%) C[2, 3].
152 The mathematics that power our world

5.6 The centroid of the face

The purpose of this section is to compute the “average face”. Note that we
cannot just compute the average of the matrices for the three images on
page 143 for example. The images could be of different sizes. Furthermore,
the faces could be at different locations in the image. We need to determine
the location of the face in each image. To do so, we use eight landmarks.
We identify the location of the inner and outer eye, the nostril, and the
outer mouth, both on the right and the left. Each landmark is located at
a certain row (height) and column (width).
We define the centroid of the face as being the point (c, r), where c and
r are respectively the averages of the eight columns and the eight rows of
the landmarks. In Figure 5.4, the landmarks are identified with x’s and the
centroid by a small circle.
To compute the average of our three faces in the images on page 143,
for each image, we keep an image of size 200 pixels by 200 pixels centered
about the centroid of the face. Each image is represented by a matrix.
We compute the average of the three matrices. The result is in Figure 5.6
on page 154. The original images are in the diagonal of the array. The
images off the diagonal in the upper right corner are pairwise averages.
The average of the three images is in the lower corner on the left.
We should notice that finding the center of the face is not sufficient
to make an average face. The eyes in image 1 (at Tremblant) are slanted
compared to the eyes in image 2 (at Vancouver). Furthermore, image 3 (at
Morocco) appears to be on a different scale than the other images. The
face appears to be closer to the camera in this last image. We will have to
transform the images to try to align the landmarks, to get the eyes on the

Fig. 5.4 Eight landmarks to find the centroid.


Image processing and face recognition 153

Fig. 5.5 Landmarks are vectors with respect to the centroid.

eyes, and so on.

5.7 Optimal transformation for the average face

In this section, we discuss the transformation of the images to try to align


the landmarks. We refer to a point (x, y) in the image by using the coor-
dinates [x − c, y − r] with respect to the centroid. In Figure 5.5, we are
displaying each landmark as a vector. As you can imagine, it is important
that a transformation does not change the shape of the face. To this end,
we use a uniform scaling and a rotation about the centroid of the face.
To find the “optimal” transformation to align the faces, we need to
define a measure of distance between the two faces. Since a landmark is a
vector, we could define the distance between corresponding landmarks as
the Euclidian distance between vectors. Consider two vectors in R2 , say
v = [vx , vy ] and u = [ux , uy ]. The Euclidian distance between v and u is

d(v , u) = (vx − ux )2 + (vy − uy )2 .

For example, the vector corresponding to the exterior right eye is [−25, −8]
for Tremblant, but it is [−23.5, −9.5] in Vancouver. So the distance between
these corresponding landmarks is (−25 − (−23.5))2 + (−9.5 − (−8))2 =
2.121.
To get a total distance (in square units) between the faces, we square
the Euclidian distance between the two landmarks and compute the sum
of the squared distances over the eight landmarks. For our three images,
we compute the total distance (in square units):

Q(Tremblant; Vancouver) = 56,


154 The mathematics that power our world

Q(Tremblant; Morocco) = 297,

Q(Vancouver; Morocco) = 266.


Since the face in Vancouver is the closest to the other images, we try to
align the other two images to it.
Consider the transformation of the face in the image at Tremblant. We
want it to be as “close” as possible to the face in the image in Vancouver.
We can try to find a uniform scaling and rotation that will give a distance
of zero, but in practice such a transformation will not exist. Why? The
image is a planar cross-section of space and the images in different photos
usually do not correspond exactly to the same cross-section. So the best
we can do is to try to make the distance as small as possible.

Fig. 5.6 Averaging centered faces.


Image processing and face recognition 155

Fig. 5.7 Optimal transformation of the faces.

We transform the landmarks in Tremblant with a uniform scaling and a


rotation about the centroid
 of the face. The matrix of the transformation
a −b
is of the form AT = , where a and b are real numbers. The vector
b a
for the exterior right eye is [−25, −8] for Tremblant. After applying the
transformation, the vector becomes
    
a −b −25 −25 a + 8 b
= .
b a −8 −25 b − 8 a
We want this vector to correspond to the vector for the exterior right eye in
Vancouver, which is [−23.5, −9.5]. So we get the following system of linear
equations in (a, b):
−23.5 = −25 a + 8 b
−9.5 = −25 b − 8 a.

Repeating for the other seven landmarks, we get a system of 16 linear equa-
tions with two unknowns. Since we will not be able to get the landmarks
156 The mathematics that power our world

Fig. 5.8 Averaging of the faces after alignment.

to align exactly, the system will be inconsistent. However, we will be able


to find a solution to this system of equations with the least squares method
described in Section 5.9. The solution will align the landmarks as “close”
as possible.
Once we have the values for a and b, we can transform the image. We
can also use Theorem 5.1 to interpret the transformation in terms of a
rotation and a uniform scaling. The optimal transformations are described
in Figure 5.7.
Now that the landmarks are aligned (or at least they are as “close” as
possible by using the least-squares method), we are ready to construct the
average. The average of the three aligned faces is in position (3, 1) in Figure
5.8. The image at position (1, 2) in Figure 5.8 is the average of the diagonal
Image processing and face recognition 157

images in positions (1, 1) and (2, 2). The image at position (1, 3) in Figure
5.8 is the average of the diagonal images in positions (1, 1) and (3, 3). The
image at position (2, 3) in Figure 5.8 is the average of the diagonal images
in positions (2, 2) and (3, 3). Compare the averages in Figure 5.6 to those
in Figure 5.8. By using the least-square method, we are able to get much
better result.

5.8 Convex sets and extremal points

In this section, we study the concept of convex sets and extremal points.

Definition 5.1.
(i) A subset K of R2 is called a convex set if for every pair of vectors ~v1
and ~v2 in K, we have
t ~v1 + (1 − t)~v2 ∈ K,
for every t ∈ [0, 1]. We say that the vector t ~v1 + (1 − t)~v2 is a convex
combination of ~v1 and ~v2 .
(ii) Let K be a convex set and ~v an element of K. We say that ~v is an
extremal point, if K \ {~v } (i.e. K without ~v ) is also a convex set.

We can think of convex combinations as all the vectors that lie on the
“line segment” between ~v1 and ~v2 . Refer to the Figure 5.9 for an example
of a convex combination of vectors. It should be evident that a rectangular
image is a convex set in R2 . Furthermore, the four vertices of the rectangle
are the only extremal points in the set.
Given a linear map T : R2 → R2 and a convex set K ⊆ R2 , is the image
T (K) of K a convex set? Moreover, if K is convex and ~v is an extremal
point of K, is T (~v ) an extremal point of T (K)? The following theorem
gives the answer with the assumption that T is invertible.

Theorem 5.2. Let T : R2 → R2 be a linear map and K ⊆ R2 a convex


set.
(a) The image T (K) = {T (~v ) : ~v ∈ K} of K is a convex set.
(b) Assume T is invertible. Then, ~v ∈ K is an extremal point if and only
if T (~v ) is an extremal point of T (K).
158 The mathematics that power our world

Fig. 5.9 Convex combination of two vectors.

Proof.

(a) Let v1 , v2 ∈ T (K), and write v1 = T (u1 ), v2 = T (u2 ) for some vectors
u1 , u2 ∈ K. For any t ∈ [0, 1], we have

t v1 + (1 − t) v2 = t T (u1 ) + (1 − t) T (u2 )


= T (t u1 + (1 − t) u2 ) (by the linearity of T ).

Since K is convex, t u1 + (1 − t) u2 ∈ K. We conclude that t v1 + (1 −


t) v2 = T (u) for some u ∈ K and therefore t v1 + (1 − t) v2 ∈ T (K).
This shows that T (K) is convex.
(b) Assume for this part that T is invertible. That means that T is a
bijective map (injective and surjective) in addition of being linear. Let
v ∈ K be any point. We claim that T (K)\{T (v )} = T (K \{v }). To see
 ∈ T (K)\{T (v )}, then w
this, let w  ∈ T (K) but w  = T (v ). So w
 = T (u)
for some u ∈ K and w  = T (u) = T (v ). Since T is bijective, the latter
inequality implies that u = v . Therefore, w  = T (u) with u ∈ K \ {v },
so w  ∈ T (K \ {v }). This shows that T (K) \ {T (v )} ⊆ T (K \ {v }). The
other inclusion is proven similarly.
If v ∈ K is extremal, we need to show that T (v ) is also an extremal
point of T (K). To see this, let w  2 ∈ T (K)\{T (v )} and let t ∈ [0, 1].
 1, w
Then, w  2 = T (u2 ) with u1 , u2 ∈ K \ {v } and we have
 1 = T (u1 ), w
tw 1 + (1 − t) w 2 = T (t u1 + (1 − t) u2 ). Since v is an extremal point
of K, t u1 + (1 − t) u2 ∈ K \ {v }. We conclude that t w  1 + (1 − t) w
2 ∈
T (K \ {v }) = T (K) \ {T (v )}. This proves that T (v ) is an extremal
point of T (K). The converse is proved similarly.

Image processing and face recognition 159

5.9 Least squares method

In this section, we discuss the least squares method to find the “best” solu-
tion to an inconsistent system of linear equations. For example, it is possible
that we cannot find a transformation that aligns the image perfectly. In
this case, we will have to content ourselves with finding the transformation
that minimizes the distance between landmarks.
Here is the general setting. Given a system of n linear equations in p
variables β1 , β2 , . . . , βp :


 y1 = β1 x11 + β2 x12 + · · · + βp x1p
 y2 = β1 x21 + β2 x22 + · · · + βp x2p

.. .. ..


 . . .
yn = β1 xn1 + β2 xn2 + · · · + βp xnp

we start by writing the system in matrix form: Y = X β, where


x11 x12 · · · x1p
     
y1 β1
 y2   β2   x21 x22 · · · x2p 
Y =  . , β =  . , and X= . .. . . ..  .
     
.
 .  .
 .   .. . . . 
yn βp xn1 xn2 · · · xnp

We are interested in finding a vector β̂ ∈ Rp , such that X β̂ is as “close” as


possible to Y . Of course, we have to clearly define what we mean by “close”
in this setting. We start by reviewing some notions from linear algebra.

5.9.1 Dot product - Inner product

Definition 5.2. Let


   
v1 w1
 v2   w2 
v= .  and w =  . 
   
 ..   .. 
vn wn
be two vectors in Rn . The dot product of v and w is defined as the scalar
n
X
v·w = vi wi = v t w. (5.2)
i=1

The last expression in (5.2) gives the dot product in terms of matrix mul-
tiplication. This means in particular that the dot product inherits many
160 The mathematics that power our world

of its properties from matrix multiplication. Some of these properties are


listed in the following theorem.
Theorem 5.3. Let u, v and w be vectors in Rn and let c ∈ R. The dot
product satisfies the following properties.
(i) v · w = w · v (commutativity of the dot product)
(ii) (u + v) · w = (u · v) + (v · w) (distributivity of the dot product with
respect to addition)
(iii) (c v) · w = c (v · w) = v · (c w)
(iv) v · v ≥ 0
(v) v · v = 0 if and only if v = 0.
Proof.
   
v1 w1
 v2   w2  Pn
(i) Write v =  .  , and w =  .  . Then, v · w = i=1 vi wi =
   
 ..   .. 
vn wn
Pn
i=1 wi v i = w · v.
(ii) (u + v) · w = (u + v)t w = (ut + v t ) w = ut w + v t w = (u · w) + (v · w).
(iii) (c v) · w = (c v)t w = (c v t ) w = c(v t w) = c (v · w).
 
v1
 v2 
(iv) Write v =  .  , then
 
 .. 
vn
n
X
v·v = vi2 ≥ 0.
i=1
 
v1
 v2 
(v) If v = 0, then vi = 0 for i = 1, . . . , n, where v =  .  . Therefore,
 
 .. 
v
Pn 2
Pn n
0= i=1 vi= v · v. Conversely, if v · v = 0, then i=1 vi2 = 0. Since
all the terms in the sum are non-negative, the only way that this sum
can equal zero is that vi = 0 for i = 1, . . . , n. Hence, v = 0.

The notion of the dot product of vectors in Rn is a particular case of
a certain type of vector product in a general vector space V , namely the
inner product.
Image processing and face recognition 161

Definition 5.3. Let V be a vector space over R. An inner product on V


is a function < ·, · > that transforms a couple (u, v) of vectors in V into
a scalar < u, v > such that for all vectors u, v, w ∈ V and for every scalar
c ∈ R, the following properties are satisfied:

(i) < v, w > = < w, v >


(ii) < u + v, w > = < u, v > + < v, w >
(iii) < c v, w > = c < v, w >
(iv) < v, v > ≥ 0
(v) < v, v > = 0 if and only if v = 0.

Clearly, the dot product defined above in Rn is an inner product. Recall


that we expressed earlier the dot product in Rn in terms of matrix multipli-
cation, namely v · w = v t w. Using matrix multiplication, we will introduce
another inner product in Rn . First, recall some well known notions and
facts from Linear Algebra.

For an n × p matrix X.

• The rank of X (denoted rank(X)) is the number of leading 1’s in any


echelon form of X.
• The nullity of X (denoted dim(Null(X))) is the dimension of the sub-
space

Null(X) = {v ∈ Rp : Xv = 0} of Rp .

• The rank-nullity theorem states that

p = rank(X) + dim(Null(X)).

• The following statements are equivalent:


(i) The columns of X are linearly independent vectors in Rn .
(ii) The homogeneous system of equations Xv = 0 has only the trivial
solution v = 0.
(iii) rank(X) = p.
(iii) dim(Null(X)) = 0.

We use the above facts to prove the following result.

Theorem 5.4. Let X be an n × p matrix. If the columns of X are linearly


independent, then X t X is invertible.
162 The mathematics that power our world

Proof. The p × p matrix is invertible if and only if its rank is equal to p.


By the rank-nullity theorem, rank(X t X) = p − dim(Null(X t X)). It suffices
therefore to show that Null(X t X) = {0}.
Since the columns of X are linearly independent, then its nullity is zero
(which implies that Null(X) = {0}). Let v be a vector in Null(X t X). We
show that v must be the zero vector. Since, v ∈ Null(X t X), then X t X v =
0. This implies that v t X t X v = v t 0 = 0. But, v t X t X v = (X v)t (X v) =
(X v) · (X v) (where u · w is the dot product in Rn ). By property (v) of the
dot product, we conclude that X v = 0 and therefore v ∈ Null(X) = {0}.
Thus, v = 0 and therefore Null(X t X) = {0} 

Given an n × p matrix X with rank(X) = p and two vectors u and v in


Rp written as columns, define

< u, v > = ut (X t X) v.

Notice that the expression ut (X t X) v is indeed a scalar.

Theorem 5.5. Given an n × p matrix X with rank(X) = p, the formula


< u, v > = ut (X t X)v is an inner product in Rp .

Proof. Let u, v, and w be vectors in Rp written as columns and let c be


a real scalar. Notice that

< u, w > = ut (X t X) w = (X u)t (X w) = (X u) · (X w),

where · is the dot product on Rn . Both X u and X w are vectors in Rn .

(i) By the commutativity of the dot product, we get < u, w > = (X u) ·


(X w) = (X w) · (X u) = < w, u >.
(ii) < u + v, w > = (u + v)t (X t X) w = (ut + v t ) (X t X) w = ut (X t X) w +
v t (X t X) w = < u, w > + < v, w >.
(iii) < c v, w > = (c v)t (X t X) w = (c v t ) (X t X) w = c (v t (X t X) w). The
latter is equal to c < v, w >. So, < c v, w > = c < v, w >.
(iv) < v, v > = (X v) · (X v) ≥ 0, by property (iv) of the dot product.
(v) Clearly, if v = 0, then < v, v > = 0. Conversely, let v be a vector in Rp
such that < v, v > = 0. So, 0 = < v, v > = v t (X t X) v = (X v) · (X v).
By property (v) of the dot product, we get X v = 0. Remember
that X was chosen to be of rank p, which means in particular that
the homogeneous system of equations X v = 0 has a unique solution
v = 0. Therefore, < v, v > = 0 implies that v = 0.

Image processing and face recognition 163

Let us now go back to our original problem in this section. We are


interested in finding the vector β ∈ Rp such that Xβ is as “close” as
possible to Y . The Euclidean distance between Y and Xβ is
 1/2
 n 
X 2
 p
k Y − Xβ k=  (yi − β1 xi1 − β2 xi2 − · · · − βp xip ) 
 = Q(β).
 i=1 
| {z }
Q(β)

Finding β̂ ∈ Rp such that X β̂ is “as close as possible” to Y means mini-


mizing the expression Q(β). Note that
n
X 2
Q(β) = [yi − (β1 xi1 + β2 xi2 + · · · + βp xip )]
i=1
= (Y − X β)t (Y − X β)
= Y t Y − Y t X β − β t X t Y + β t X t Xβ
= Y t Y − (β t X t Y )t − β t X t Y + β t X t Xβ
= Y t Y − 2 β t X t Y + β t X t Xβ,

since β t X t Y is just a scalar, then it is equal to its transpose. Moreover, if


we assume that X is of rank p, the columns of X are linearly independent,
and therefore X t X is invertible by Theorem 5.4. We have
Q(β) = Y t Y − 2 β t (X t X)(X t X)−1 X t Y + β t X t Xβ
= Y t Y − 2 < β, (X t X)−1 X t Y > + < β, β >
= Y t Y − 2 < β, (X t X)−1 X t Y > + < β, β >
+ < (X t X)−1 X t Y, (X t X)−1 X t Y >
− < (X t X)−1 X t Y, (X t X)−1 X t Y >
= Y t Y + < β − (X t X)−1 X t Y, β − (X t X)−1 X t Y >
− < (X t X)−1 X t Y, (X t X)−1 X t Y >
= Q1 (β) + C,
where
Q1 (β) = < β − (X t X)−1 X t Y, β − (X t X)−1 X t Y >
and
C = Y t Y − < (X t X)−1 X t Y, (X t X)−1 X t Y > .
164 The mathematics that power our world

Note that the expression C is a constant with respect to β. Thus, to


minimize Q(β) it suffices to minimize Q1 (β). Using property (iv) of the
inner product, Q1 (β) ≥ 0 for all β ∈ Rp (since it is of the form < u, u >).
Therefore, the minimal value for Q1 (β) is zero. Using property (v) of the
inner product, Q1 (β) = 0 if and only if β−(X t X)−1 X t Y = 0 or equivalently
β = (X t X)−1 X t Y .

Here is a summary of the least squares method:

• Given a linear system Y = Xβ, where X is an n × p matrix of rank


p, the best approximation of a solution to the system is given by the
vector

β̂ = (X t X)−1 X t Y.

• Note that if X has fewer rows than columns, then rank(X) < p and
the method of least squares will not work.

5.10 Face recognition

In this section, we discuss an algorithm that is used for face recognition. We


have an image of the face of an individual suspected of committing a crime
and we want to compare this image to images of “average” faces stored in a
database. To do so, we measure the distance between the suspect’s image
and the images in the database. We discuss a measure of distance based
on the method of principal components, which is called the “Eigenface”
method.
To illustrate the method, assume that our database consists of the six
images of penguins in Figure 5.10. As you can imagine, a face is not per-
fectly the same from one image to another. For example, the lighting can
vary from image to image. To simulate a variation in the image, we have
distorted the image of first penguin (named Alice) and the fourth penguin
(named Bob) from Figure 5.10.
The distorted images of Alice and Bob are found in Figure 5.11. A
human eye can clearly identify the images in Figure 5.11 as being those
of Alice and Bob, but with a more severe level of distortion and a bigger
image database, it would be much harder to identify images using just our
human eyes. We explore a method that enables a computer to identify the
two distorted images in Figure 5.11 as being Bob and Alice, respectively.
Image processing and face recognition 165

Fig. 5.10 A database of six penguins.

Fig. 5.11 Distorted images of Penguins 1 and 4.

5.10.1 Descriptive statistics


5.10.1.1 At the level of the image
The images in Figure 5.10 are grayscale raster images, each of size 50 × 50
pixels. Each image has p = 50 × 50 = 2500 grayscale levels (also called
intensities). To describe an image, we compute the mean intensity and
the standard deviation of the intensities. We also define a standardized
intensity to control varying lighting conditions.
Consider the image of Alice, that is the first penguin in Figure 5.10. We
write the corresponding intensities as a column vector x:
   
x1 255
 x2   255 
   
x =  .  =  . .
 ..   .. 
x2500 240
Recall that these intensities are written in a raster fashion, left to right,
top to bottom. The jth component of x (denoted xj ) is the intensity of
the jth pixel. The mean intensity is
p
j=1 xj 255 + 255 + · · · + 240
x= = = 184.2088
p 2500
and the standard deviation of the intensity is
p 2
i=1 (xi − x)
σx =
p

(255 − 184.2088)2 + (255 − 184.2088)2 + · · · + (255 − 184.2088)2
=
2500
= 95.24998.
166 The mathematics that power our world

Looking at the expression inside the square root of the standard deviation,
we observe that it computes the average squared deviation away from the
mean. It represents an average distance away from the mean in squared
units. The square root is used to obtain a measure in the same units as the
intensities. So the standard deviation is a measurement of variability. The
more the intensities are dispersed about the mean, the larger the standard
deviation. The more the intensities are concentrated about the mean, the
smaller the standard deviation.

Fig. 5.12 Descriptive statistics of the 6 images of the penguins.

The descriptive statistics for each of the n = 6 penguins in our database


are displayed in Figure 5.12. The distributions of the intensity are not the
same for these six images. Since pictures of faces are taken under varying
lighting conditions, the images of faces usually exhibit varying distributions
of intensity. We want a system that compares varying characteristics of
faces, not varying lighting conditions.
We use a standardized intensity: zj = (xj − x)/σx , for j = 1, 2, . . . , p.
The mean of the standardized intensities is
  
p p
1  (xj − x) 1  
z= = xj − p x = 0,
p j=1 σx σx p j=1
p
since p x = j=1 xj .
Image processing and face recognition 167

The variance (i.e. the square of the standard deviation) of the stan-
dardized intensities is
p p
1X 2 1X 2
σz2 = (zi − z)2 = z (since z = 0)
p i=1 p i=1 i
p p
1X 2 1 X 2 σx2
= [(xj − x)/σx ] = (xj − x) = = 1.
p i=1 (pσx2 ) i=1 σx2

By using the standardized intensities to compare images, we have control


on varying lighting conditions since all the images would have the same
mean and the same standard deviation.
To compare two images, we use the Euclidian distance between their
respective vectors of standardized intensities. Let z, w be the respective
vectors of the standardized intensities for the two images. We use the
following metric (i.e. measure of distance) to compare the two images:
v
u p
uX
kz − wk = t (zj − wj )2 = (z − w)t (z − w).
j=1

Bearing in mind that someone’s face can vary from image to image, this
means that the image we are comparing to the database will not be exactly
the same as any of the images in the database. So we will need to rank
them from the closest to our image to the furthest away.
We compare the distorted images of Alice and Bob (see Figure 5.11) to
our database of six penguins (see Figure 5.10). The distances between the
images are found in Table 5.1. Among the six penguins in the database, it is
the image of Alice that is closest to the distorted image of Alice. However,
the distance of 14 between the two images of Alice is not considered very
small.
In practice, an arbitrary threshold  is used as a rule to determine if
the person in the image is in the database. If  was set to 10, then the
computer would conclude that the penguin in the distorted image of Alice
is not in the database. The conclusion would be similar for Bob, since the
distorted image of Bob is more than 10 units away from all of the images
in the database.
The root of the problem is that we are comparing all of the p = 2500
pixels. We are using some information in the comparison that is not relevant
to the differences between the 6 penguins in the database. We need to
describe the images at the level of the pixels to identify variations between
the images of the penguins.
168 The mathematics that power our world

Table 5.1 Distances between the images of the penguins.


Penguin
Penguin 1 2 3 4 5 6
Distorted Alice 14.0 54.8 62.5 47.5 52.1 49.0
Distorted Bob 47.1 52.7 61.9 11.2 59.0 52.9
1 0 54.3 62.0 46.8 51.7 48.3
2 0 57.1 52.8 66.0 60.9
3 0 62.0 64.7 61.3
4 0 58.9 53.3
5 0 39.9
6 0

5.10.1.2 At the level of a pixel


Let z 1 , z 2 , . . . , z n be the respective vectors of the standardized intensities
of the n = 6 images in the database. They correspond to the 6 penguins in
Figure 5.10. Consider these vectors as the rows of an n × p matrix (called
a data array by statisticians):
 
z1
 z2   
Z =  .  = c1 c2 · · · cp . (5.3)
 
.
 . 
zn
The jth column (that is cj ∈ Rn ) of the data array Z is the vector of the
standardized intensities of the jth pixel from the n = 6 images.
If the intensities between the images are very dispersed (i.e. there is
a large variance) at a particular pixel, then this pixel will be considered
important in the differentiation of the images. Furthermore, if two pixels
are highly correlated, then we only need one of these two pixels (or we
might consider combining the information from the two pixels), since they
contain similar information.
As an example, let us consider the 1000th pixel and the 1500th pixel
(each has been rounded to three decimal places) from our database of 6
penguins:    
−0.979 −0.475
 0.189   0.135 
   
 −0.711   −0.669 
   
u = c1000 =   , v = c1500 =  .
 0.249   0.223 
   
 0.300   0.005 
−0.917 −0.757
Image processing and face recognition 169

The variance of the (standardized) intensities for the 1000th pixel and
the 1500th pixel are respectively
n n
2 1X 1X
σ1000 = (ui − u)2 = 0.318 and σ1500
2
= (vi − v)2 = 0.153.
n i=1 n i=1

Among these two pixels, the 1000th is more variable in terms of the (stan-
dardized) intensities. The larger the variance, the more this pixel will be
useful in the differentiation of the images. So if we were to retain only one
of these two pixels for the comparison of the images, we should choose the
1000th pixel. However, we will see that combining the information from
the two pixels will give us a component that is even more variable.
Before discussing the combination of pixels, we look at the statistical
association between these two pixels. A scatter plot of the 1500th pixel
against the 1000th pixel is found in Figure 5.13. The vertical and the
horizontal lines in the middle of the plot are the respective means of the
two variables. We see that they are highly associated. When one of the
pixels is large compared to its mean, so is the other pixel. Furthermore,
when one of the pixels is small compared to its mean, so is the other pixel.
This type of association is called positive correlation. When the majority
of the points are in Quadrants I and III, we say the association is positive.
While a negative association occurs when the majority of the points are in
Quadrants II and IV.
A statistic that captures the sign (i.e. positive or negative) of the asso-
ciation is the covariance. The covariance is computed as follows:
n
1X
σ1000,1500 = (ui − u)(vi − v) = 0.2064.
n i=1

If there is no association between the pixels, then the points in the scatter
plot are going to be scattered in all four quadrants delimited by the respec-
tive means of these two variables. In this case, the covariance should be
close to zero.
If the association between the pixels is positive, then the majority of
the points are in Quadrants I and III. Thus, the product (ui − u)(vi − v)
is positive for the majority of the points, which gives a positive covariance.
If the association between the pixels is negative, then the majority of the
points are in Quadrants II and IV. Thus, the product (ui − u)(vi − v) is
negative for the majority of the points, which gives a negative covariance.
170 The mathematics that power our world

To measure the intensity of the association, we compute the correlation


as follows:
σ1000,1500
ρ1000,1500 = = 0.934.
σ1000 σ1500
If the pixels are not associated, then the correlation should be near zero,
since the covariance will be near zero. We recall a theorem from linear
algebra to help us interpret a non-zero correlation.

Theorem 5.6. Let x = [x1 , x2 , . . . , xn ]t and y = [y1 , y2 , . . . , yn ]t be vectors


in Rn . Then
v v v v
u n u n n u n u n
uX uX X uX uX
−t 2
xi t 2
yi ≤ xi yi ≤ t 2
xi t yi2 .
i=1 i=1 i=1 i=1 i=1

These inequalities are called the Cauchy-Schwarz inequalities. Furthermore,


the inequalities are strict, unless one of the vectors x, y is a multiple of the
other.

Taking xi = ui −u and yi = vi −v for i = 1, . . . , n in the Cauchy-Schwarz


inequalities, we get
−1 ≤ ρ1000,1500 ≤ 1
and the inequalities are strict unless ui = u + c (vi − v) for some c, for all
i = 1, 2, . . . , n or vi = v + c (ui − u) for some c, for all i = 1, 2, . . . , n. This
means that a correlation is always between −1 and 1. Furthermore, it is
equal to 1 or −1 only if the points in the scatter plot fall exactly on a line.
Falling exactly on a line is a very strong association, that is called perfect
correlation.
We interpret a correlation closer to 1 or −1 as a more intense association.
The correlation between the 1000th pixel and the 1500th pixel is 0.934. This
means that they are highly correlated.
To combine the information found in these two pixels, we orthogonally
project the points in the scatter plot onto a line that passes through the
centroid (u, v) = (−0.311, −0.256). For our example, we choose two lines.
The line 1 will pass through Quadrants I and III, i.e. in the part of the
plane that corresponds to positive correlation. This line will have the vector
d~1 = (−0.8279119, −0.5608582) as unit directional vector. The projected
points are the triangles in Figure 5.14. We use the values along this line
for our two pixels. To compute the scalar projection of the ith point, we
extend a vector ~yi from the centroid to the point. The scalar projection is
Image processing and face recognition 171

wi = ~yi · d~1 , where · is the dot product in R2 . As an example, consider the


point (−0.979, −0.475). Its corresponding vector is ~y = (−0.979, −0.475) −
(u, v) = (−0.668, −0.219) and the corresponding scalar projection onto the
line is
w = ~y · d~1 = (−0.668)(−0.8279119) + (−0.219)(−0.5608582) = 0.675.
The 6 scalar projections along the line with direction vector d~1 are
w1 = 0.675, w2 = −0.633, w3 = 0.562,

w4 = −0.733, w5 = −0.652, w6 = 0.782.


Interpret these values as deviations away from the center (u, v) =
(−0.311, −0.256) along the line with the direction vector d~1 . The variance
of these values is
n
1X
s2w = (wi − w)2 = 0.550.
n i=1

The combined component is more variable than the 1000th pixel and the
2 2
1500th pixel. Recall that σ1000 = 0.318 and σ1500 = 0.153. So the combined
component will be more useful in the differentiation of the images compared
to the 1000th pixel or to the 1500th pixel.
We do have to be careful in the combination of the pixels, since we can
obtain a component that has a smaller variance than the original pixels. If
we project the vectors onto the line 2 in Figure 5.14, which has the direction
vector d~2 = (0.5608582, −0.8279119), we get
w1 = −0.193, w2 = −0.043, w3 = 0.117,

w4 = −0.082, w5 = 0.126, w6 = 0.075.


The variance of these values is s2w = 0.0162. This is a reduction of variance
compared with each of the original pixels. Combining the pixels does not
always give more variance. Clearly the values are more varied along line 1
compared to the values along line 2. In fact, the points along line 1 resemble
much more the original points compared to the points along line 2. The
correlation of the points along line 2 is negative, while the correlation of
the 6 original points is positive.
In the next section, we learn how to combine the information found
in all p = 2500 pixels by using a covariance matrix that will contain all
variances and covariances.
172 The mathematics that power our world

Fig. 5.13 Scatter plot of the 1500th pixel versus the 1000th pixel.

Fig. 5.14 Projections of the points onto lines passing through the centroid.

5.10.2 The principle components from the covariance


matrix
Our illustrative example in the previous subsection only describes the com-
bination of two pixels, that is the 1000th against the 1500th pixel. However,
we want to combine all pixels. To do so, we construct the covariance matrix
Image processing and face recognition 173

(of size p × p):


σ12 σ12
 
. . . σ1p
 σ21 σ22 . . . σ2p 
V = . .
 
 .. . . .. 
. . 
σp1 σp2 . . . σp2
The jth element in the diagonal, i.e. σj2 , is the variance of the standardized
intensities of the jth pixel. The off-diagonal element σjk is the covariance
between the standardized intensities of the jth pixel and the standardized
intensities of the kth pixel.
We can compute V from the rows of the data array Z as defined in (5.3).
The ith row Z i of the data array Z contains the p = 2500 standardized
intensities for the ith image in the data base. Compute the mean vector of
the rows in Z:
n
1X
Z= Z i.
n i=1
The jth component of Z is the mean of the jth pixel. We want to combine
the pixels to form one component by projecting Z i onto a line that passes
through the centroid Z. To do so, we define the column vector Y i with the
centroid Z as the initial point and Z i as the terminal point, that is
t
y i = Z ti − Z .
The mean of the vectors y i is the zero vector 0,
n n
!
1X t t 1 X t t t t
(Z − Z ) = Z − Z = Z − Z = 0.
n i=1 i n i=1 i
The following representation of the covariance matrix V in terms of the
vectors y i will be useful:
n
1X
V = y yt . (5.4)
n i=1 i i
To show that the latter matrix is indeed the covariance matrix, consider u =
(u1 , u2 , . . . , un ) and v = (v1 , v2 , . . . , vn ) to be the standardized intensities
of the jth and the kth pixel, respectively. The vectors u and v are the
jth and the kth columns of the data array Z. The (j, k) element of V is
the covariance between u and v. The (j, k) element of the matrix y i y ti is
Pn
(ui − u) (vi − v). This means that the (j, k) element of n1 i=1 y i y ti is
n
1X
(ui − u) (vi − v).
n i=1
174 The mathematics that power our world

Pn
The latter is the covariance between u and v. Thus, n1 i=1 y i y ti is the
covariance matrix V .
To combine the pixels, we orthogonally project the point z i ∈ Rp (which
corresponds to the ith image) onto a line in Rp that passes through the
centroid Z and with direction vector e ∈ Rp . We assume that the direction
vector is a unit vector e (that is kek = 1 or equivalently et e = 1). The
corresponding scalar projection is y i · e = y ti e. The scalar projections of
the n = 6 images are:
w1 = y t1 e, w2 = y t2 e, . . . , wn = y tn e.
The mean of the scalar projections is always zero:
n n n
!
X X X
t
w= wi = yi e = y i e = 0t e = 0.
t

i=1 i=1 i=1

The variance of the scalar projections can be computed from the covariance
matrix V :
2 2
2 1 X 2
X
σw = (wi − w) = (1/n) wi2 (since w = 0)
n i=1 i=1
2
1 X t
= w wi (since wi is a scalar)
n i=1 i
2
1X t t t
= (y e) (y i e)
n i=1 i
2
" 2
#
1X t 1 X
= e y i y ti e = et y y t e = et V e.
n i=1 n i=1 i i
2
We want to find the direction e that maximizes the variance σw . The
following theorem will be useful in the interpretation of the optimal direc-
tion. We state the theorem without proof. It is a well-known result in linear
algebra. It states that a symmetric matrix (i.e. the matrix is equal to its
transpose) is diagonalizable. The notions of eigenvalues, eigenvectors and
orthogonal bases are quickly reviewed in the remarks following the theorem.

Theorem 5.7 (Principal Axis Theorem). Let V be a p × p real val-


ued symmetric matrix. Then, the p eigenvalues of V are real, and there
exists an orthonormal basis of Rp consisting of eigenvectors of V . Write
the eigenvalues in descending order λ1 ≥ λ2 ≥ · · · ≥ λp and let ei be an
eigenvector associated to the eigenvalue λi for j = 1, 2, . . . , p, such that
Image processing and face recognition 175

B = {e1 , e2 , . . . , ep } is an orthonormal basis of Rp . Let P be the ma-


trix, whose columns are the vectors in B. The matrix V has the following
decomposition:
X p
V = P D Pt = λj ej etj .
j=1

Remarks:
(a) We say that a non-zero vector e ∈ Rp is an eigenvector of V , if there
exists a scalar λ such that
V e = λ e.
The scalar λ is called the corresponding eigenvalue.
(b) We say that B = {e1 , e2 , . . . , ep } is an orthonormal basis of Rp , if
(i) the vectors in B are orthogonal, i.e. etj ek = 0 for all j 6= k.
(ii) the vectors in B are unit vectors, i.e. kej k = 1 for all j. Equivalently
this means etj ej = 1 for all j.
(c) As a consequence of part (b), the product P t P is equal to the identity
matrix I, since the columns of P form an orthonormal basis of Rp .
t
(d) The scalar projection of the ith centered observation y i = Z ti −Z onto
the jth eigenvector ej , that is y i ·ej = y ti ej , is called the jth principal
component by statisticians.
Here are a few consequences of the Principle Axis Theorem.
(1) The eigenvalues of the covariance matrix V can be interpreted as vari-
ances. Let w be the component corresponding to the scalar projections
of the centered observations y i for i = 1, 2, . . . , n along the line that
passes through the centroid with direction vector v, where v is an
eigenvector of V with corresponding eigenvalue λ. As seen above the
variance of w is v t V v, which becomes
s2w = v t V v = v t (λ v t ) = λ (v t v) = λ,
since v t v = 1.
(2) The trace of the covariance matrix V is the sum of the elements in its
Pp
diagonal, that is tr(V ) = j=1 σj2 . It is called the total variance. For
our database of n = 6 penguins, the total variance is tr(V ) = 1327.045.
We can compute the trace from the sum of the eigenvalues:
X p
tr(V ) = tr(P D P t ) = tr(P t P D) = tr(I D) = tr(D) = λj .
j=1
176 The mathematics that power our world

We used the fact that the P t P = I, where I is the p×p identity matrix,
and also the cyclic property of the trace: tr(A B) = tr(B A) for any
matrices A, B such that A B and B A are defined.
Given a symmetric p × p matrix A = [aij ] and a vector e =
[e1 , e2 , . . . , ep ]t ∈ Rp , we define a linear form and a quadratic form of e
as follows. For any b = [b1 , b2 , . . . , bp ]t ∈ Rp , the scalar
p X
X p
L = bt A e = e t A b = bk aki ei
i=1 k=1

is called a linear form of e. The scalar


p X
X p
Q = et A e = ajk ej ek
j=1 k=1

is called a quadratic form of e.


Both the linear form L and the quadratic form Q are functions of p
variables e1 , e2 , . . . , ep (the components of the vector e ∈ Rp ) and one can
define the gradient vector for L and Q as being
   
∂L/∂e1 ∂Q/∂e1
∂L   ∂L/∂e2  ∂Q   ∂Q/∂e2 
 
= ..  and = .. ,
∂e  .  ∂e  . 
∂L/∂ep ∂Q/∂ep
∂F
where as usual ∂x i
means the partial derivative of the function F with
respect to the variable xi . The following Theorem gives an easy way to
compute the gradient vector.

Theorem 5.8. Let A be a symmetric p × p matrix, e = [e1 , e2 , . . . , ep ]t


and b = [b1 , b2 , . . . , bp ]t be 2 vectors in Rp . Let L and Q be the linear and
quadratic forms as above. Then,
∂L ∂Q
= A b and = 2 A e.
∂e ∂e
The concepts of linear and quadratic forms in the above theorem are
a generalization of first and second order power functions in R. Notice
that the formula for the gradient vector (i.e. the vector of derivatives) is
an analogue of the formulae to differentiate the two power functions in R,
namely (d/dx)(c x) = c and (d/dx)(c x2 ) = 2 c x.
To find the optimal direction to maximize the variance, we use the
Lagrange method. We want to maximize s2w = et V e under the constraint
Image processing and face recognition 177

that e is a unit vector in Rp . This constraint is equivalent to et e − 1 = 0.


Define the Lagrangian function:
L(λ, e) = et V e − λ (et e − 1) = et V e − λ (et Ie − 1),
where I the identity matrix of size p × p.
To find the optimal direction e, we need to solve
∂L ∂L
= 0 and = et e − 1 = 0.
∂e ∂λ
The last equality ensures that the constraint is satisfied. Note that e 6= 0,
since e0 e = 1. The former equality gives
∂L
0= = 2 V e − 2 λ Ie.
∂e
The latter is equivalent to V e = λ e. This means, that e 6= 0 is an
eigenvector of V with corresponding eigenvalue λ.
Since λ1 is the largest eigenvalue, then the corresponding eigenvector
v 1 gives the direction of the line that corresponds to the largest variance.
Similarly, λp is the smallest eigenvalue, so the corresponding eigenvector
v p gives the direction that corresponds to the smallest variance. For our
database of n = 6 penguins, λ1 = 474.0232. This is the variance of the
projection along the line with direction vector v 1 . The total variance is
tr(V ) = 1327.045. So the first principal component accounts for λ1 /tr(V ) =
35.72% of the total variance.
We want the projected images to be similar to the original images, so
we should try to recover most of the total variance. Let us try to find a
second component with maximum variance that is uncorrelated with the
first principal component. The first principal component is in the direction
given by v 1 . Let e be the direction of the second component. Let u and w
be the scalar projections onto the lines that pass through the centroid in
the direction given by e and v 1 , respectively. So ui = y ti e and wi = y ti v 1
for i = 1, . . . , n. Similar to the variance, we will can compute the covariance
from the covariance matrix V :
2 2
1 X 1 X
(wi − w)(ui − u) = wi ui (since w = 0 = u)
n i=1 n i=1
2
1 X t
= w ui (since wi is a scalar)
n i=1 i
2 2
" 2
#
1X t t t 1X t t t 1X t
= (y v 1 ) (y i e) = v y y e = v1 yy e
n i=1 i n i=1 1 i i n i=1 i i
= v t1 V e.
178 The mathematics that power our world

We want to maximize s2u = et V e under the constraints that e is a


unit vector in Rp , that is et e − 1 = 0, and that the two components are
uncorrelated, that is v t1 V e = 0. Define the Lagrangian function:
L(λ, λ1 , e) = e0 V e − λ (e0 e − 1) − λ1 (v 01 V e)
= e0 V e − λ (e0 Ie − 1) − λ1 (v 01 Ie)
where I the identity matrix of size p × p.
To find the optimal direction e, we need to solve
∂L ∂L ∂L
= 0, = et e − 1 = 0, and = v t1 V e = 0.
∂e ∂λ ∂λ1
The last two equalities ensures that the constraints are satisfied. The first
equality gives
∂L
0= = 2 V e − 2 λ Ie − λ V v 1 . (5.5)
∂e
t t
Multiply both sides of the equation by v 1 on the left. Since v 1 0 = 0, we
get
0 = 2 v t1 V e − 2 λ v t1 e − λ1 v t1 V v 1
= −λ1 v t1 V v 1 ,
since v t1 V e = 0 and v t1 e = 0. Recall that v t1 V v 1 is the variance of the
first component. Assuming that this variance is not zero, then we must
have λ1 = 0. This means that equation (5.5) becomes
0 = 2 V e − 2 λ Ie,
which is equivalent to V e = λ e. So e 6= 0 is an eigenvector of V with
eigenvalue λ and we conclude that the second component is in the direction
of the eigenvector e2 corresponding to the second largest eigenvalue λ2 .
Furthermore, λ2 will correspond to the variance of the second component.
You can probably now guess, that if we want a component of maximum
variance that is uncorrelated to the first two components, then we should
project the points onto a line in the direction of the eigenvector e3 corre-
sponding to the third largest eigenvalue λ3 , and so on. For our database of
n = 6 penguins, the largest 4 eigenvalues of V are
λ1 = 474.0232, λ2 = 346.2693, λ3 = 208.6106, λ4 = 170.8992.
This means that the first four components will account for (λ1 + λ2 +
λ3 + λ4 )/tr(V ) = 90.4% of the total variance. In practice the number of
components m that are used can vary. However, we usually try to recover
at least 90% of the total variance. Typically, in a large database with tens
of thousands of faces, this corresponds to about 50 to 100 components that
are sometimes called the principle features.
Image processing and face recognition 179

5.10.3 Comparison of the principle features


Let Z i be the row vector of p = 2500 standardized intensities of the ith
images in the database. In our case, the database contains n = 6 images of
Pn
penguins. Let Z be the mean vector (i.e. Z = (1/n) i=1 Z i ) and V the
corresponding covariance matrix:
1 t t t
V = (Z − Z )(Z ti − Z )t .
n i
Let Pm be a p × m matrix, whose columns are the m eigenvectors
v t1 , v t2 , . . . , v tm that correspond to the m largest eigenvalues of V .
We will use the first m = 4 principal components to compare the images
of the penguins. For the ith penguin in the database, its vector of features
is
 t 
v1 yi
 v t2 y i 
Fi =  .  = Pm t
y i = Pmt
(Z ti − Z).
 
 .. 
v tm y i
The vector of features belongs to Rm = R4 . We will use Euclidean
distance in R4 for the comparison. Let Fi and Fi0 be the respective vectors
of features for two images in the database, the distance between the features
of these images is
q
t t
kFi − Fi0 k = [Pm t (Z t − Z )]t [P t (Z t − Z )]
i m i
q
t
= (Z ti − Z )Pm Pm t (Z t − Z t ).
i

Let Z be the row vector of the p = 2500 standardized intensities of


an image to be compared with the images in the database. For example,
we will consider Alice’s distorted image from Figure 5.11. The distance
between the features of Alice’s distorted image and the features of the ith
image in the database is
q
t t (Z t − Z t ).
(Z t − Z )Pm Pm i

The distance between the principle features of the images are found in
Table 5.2. Using a threshold of  = 10, we see that the computer was
able to identify each of the distorted images as the first and fourth images
in the database, respectively. The distance between the features of the
distorted image of Alice and the first penguin in the database is 1.9 (which
is a very small distance). Similarly, the distance between the features of
180 The mathematics that power our world

Table 5.2 Distances between the features of the penguins.


Penguin
Penguin 1 2 3 4 5 6
Distorted Alice 1.9 53.0 60.9 45.2 46.8 42.4
Distorted Bob 45.6 51.5 60.8 1.5 56.0 45.9
1 0 54.3 62.0 46.5 48.2 44.0
2 0 57.1 52.7 63.8 56.9
3 0 61.9 62.3 57.5
4 0 57.2 47.2
5 0 10.4
6 0

the distorted image of Bob and the fourth penguin in the database is only
1.5. By using the first four principle features, the computer is more certain
that the two distorted images are images of penguins from the database.

5.10.4 Visualizing the features


We used m = 4 principle features to compare an image of a penguin to our
database of n = 6 penguins. It is possible to convert each feature into an
image by considering the orthogonal projection as a vector in Rp instead of
simply keeping the scalar projection.
Let Z be the standardized intensities of an image. The jth feature is
the scalar projection of Z on a line in Rp that passes through the point Z
in the direction given by the jth eigenvector v j of the covariance matrix V .
The corresponding orthogonal projection is
v tj (Z t − Z) v j .
The components of this vector are not going to be between 0 and 255. To
produce an image, we must shift and scale the values in the vector to get
values between 0 and 255.
To visualize the jth feature, we computed the corresponding orthogonal
projections for each of the six images in the database and also for the two
distorted images. So we have 8 projections in total. We record the smallest
value a of the 8 × 2500 = 20, 000 intensities and the largest value b of the
20, 000 intensities. To convert the projected vector into an image, for each
component xj of the vector, we compute (xi − a) 255/b. The components
of the resultant vector are going to be values between 0 and 255.
The first feature is displayed in Figure 5.15. The top row contains the
six penguins from our database and the two distorted images (at the end
of the row on the right). The bottom row contains the projected images
Image processing and face recognition 181

for the first feature. We interpret the feature as a deviation away from the
mean image of the six images in the database along a particular direction. A
complete gray image means that the image is very close to the mean image
in terms of that feature (see Penguin 6 in Figure 5.15). Furthermore, it is
very difficult to differentiate the second from the third penguin with only
the first feature. However, according the second feature, the third and the
second penguins are different (see Figure 5.16).
The features are ordered according to the size of the variance. The first
features are going to be more useful to differentiate the images. As we look
at Figure 5.18, we see that the projected images are very similar. In fact,
the fourth feature only accounts for 12.9% of the total variance. Notice
also that for the distorted images of Alice and Bob, the projected image for
each feature resembles the projected images of the first and fourth penguins.
The first and fourth penguins in the database are Alice and Bob. As long
as the principle features of the face of a particular individual are similar
from image to image, then the computer should be able to recognize the
person’s face. What are the principle features of a face? This will depend
on the images in the database. The features are determined in terms of the
maximum variance.

Fig. 5.15 The first feature.

Fig. 5.16 The second feature.


182 The mathematics that power our world

Fig. 5.17 The third feature.

Fig. 5.18 The fourth feature.

5.11 References

Sirovich, L. and Kirby, M. (1987). Low-dimensional procedure for the char-


acterization of human faces. Journal of the Optical Society of America A
4(3), pp. 519-524.
Turk, M.A. and Pentland, A.P. (1991). Face recognition using eigenfaces.
Proceedings of Computer Vision and Pattern Recognition, 1991. IEEE
Computer Society Conference.
Index

Abelian group, 120 Carry generator, 30


Adder-subtractor, 31 Carry lookahead adder, 30
Almanac data, 98 Carry ripple adder, 28
Alphabet, 43 Carry-propagate, 30
Altitude, 96, 108 Clock pulse, 111
Atomic clock, 98, 104 Code,
Average codeword length, 50 Prefix-free, 45
Average face, 164 Codes,
Fixed length, 45
Baseline JPEG, 72 Variable length, 45
Baseline standard, 66, 67 Codeword, 43
Binary Compression, 65
addition, 12 Congruent modulo, 125
code, 43 Conversion
Coded Decimal expansion, 3 decimal and two’s complement,
coded decimal representation, 6 10
representation, 4 Cosine waves, 69
system, 3, 4 Covariance, 169
Binary sequence, 109 Covariance matrix, 173
Binary tree, 43 Cyclic group, 119
Bit, 40, 109
Boolean algebra, 19 Data compression, 39
Byte, 41 Decimal expansion, 2
Decodable, 45
Calculator, 1 Decoding, 40
Carry and sum, 25 Descriptive statistics, 166
Carry digit, 12 Difference, 14

183
184 The mathematics that power our world

Differential Encoding, 76 JPEG, 65


Discrete Cosine Transform, 67
Division algorithm, 124 Karnough maps, 23
Dyadic, 55 Kraft Inequality, 47, 52

Encoding, 40 Lagrange multipliers, 52


Entropy, 55 Latitude, 96, 108
Ephemeris data, 98 Lead function, 132
Euclidian distance, 167 Least significant digit, 3
Level shifting, 73
Face recognition, 143, 164 Linear feedback registrars, 132
Field, 119, 120 Linear feedback shift register, 109,
Finite fields, 120 110
Full-adder, 26 Linear transformation, 84
Logic, 15
Gate gates, 16
AND, 16 operators, 15
NAND, 18 Longitude, 96, 108
NOR, 18 Lossless, 41
NOT, 16 Lossy, 41
OR, 16 Luminance, 75
XOR, 16
Generator, 119 Maxterm, 23
Geometric sum, 6 Mean, 165
GPS, 96 Meridian line, 96
GPS constellation, 98 Minterm, 21
Group, 115 adjacent, 22
Minuend, 14
Half-adder, 25 Modulo arithmetic, 113
Huffman algorithm, 58 Monic irreducible polynomial, 130
Huffman code, 55, 59 Monic polynomial, 123
Huffman coding, 76 Most significant digit, 3

Inverse Discrete Cosine Transform, Natural Binary Coded Decimal, 7


68 Non-zero field, 120
Irreducible polynomial, 125 Number systems, 2
Isomorphic, 122
One’s complement representation,
Joint Picture Expert Group, 65 8
Index 185

Optimal code, 50 Transistor, 1, 3


Optimal prefix-free code, 52, 56 Trilateration, 99
Orthogonal, 70 Two’s complement representation,
Orthogonal basis, 87 8
Orthogonal matrix, 86 Two-dimensional DCT, 70
Overflow, 12, 13
Variable-Length Code, 77
Periodic, 109 Variance, 169
Polynomial over a field, 123
Prefix-free code, 45 Zigzag, 76
Prime integer, 117
Prime meridian, 97
Primitive element, 129, 131
Primitive polynomial, 131
Principle Axis Theorem, 174
Principle components, 172
Principle features, 179
Product of sums, 21, 24
Pseudo random noise, 98

Quantization, 74
Quotient, 124

Raster image, 144


Remainder, 124
Run length encoding, 78

Seven segment display, 32


Sign magnitude format, 7
Signed binary number, 7
Standard deviation, 165
Standard forms, 25
Standardized intensity, 165
Subtrahend, 14
Successive division, 4
Sum and carry, 25
Sum digit, 12
Sum of products, 21
Sum of weights, 4

You might also like