0% found this document useful (0 votes)
415 views694 pages

Connecting Discrete Mathematics and Computer Science

The document introduces 'Connecting Discrete Mathematics and Computer Science' by David Liben-Nowell, which aims to engage computer science students in discrete mathematics by linking mathematical concepts to computer science applications. It features comprehensive coverage of standard topics, pedagogical tools, and numerous exercises to enhance learning. The text is praised for its clarity, accessibility, and the integration of computer science connections throughout the material.

Uploaded by

siamibne1512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
415 views694 pages

Connecting Discrete Mathematics and Computer Science

The document introduces 'Connecting Discrete Mathematics and Computer Science' by David Liben-Nowell, which aims to engage computer science students in discrete mathematics by linking mathematical concepts to computer science applications. It features comprehensive coverage of standard topics, pedagogical tools, and numerous exercises to enhance learning. The text is praised for its clarity, accessibility, and the integration of computer science connections throughout the material.

Uploaded by

siamibne1512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 694

“An inspired approach ... illuminating the “...

an approachable yet rigorous book, written

Liben-Nowell
aesthetic appeal of the subject together with the with wit and verve, that I look forward to
profound and inextricable links that connect it to teaching from!”
the core ideas of computing.” Raghuram Ramanujan, Davidson College
Jon Kleinberg, Cornell University
“... a beautifully motivated, clearly written, and
“... the computer science content and connections accessible exploration of the mathematical
are woven extensively throughout, with “forward foundations of computer science. ... This is
pointers” that can excite students about computer a discrete math book that truly keeps the
science areas they will encounter in their future reader engaged!”
studies. ... It will be a joy to teach with!”

COMPUTER SCIENCE
MATHEMATICS AND
CONNECTING DISCRETE
Ran Libeskind-Hadas, Founding Chair of
Valerie Barr, Mount Holyoke College Integrated Sciences, Claremont McKenna
College
“David Liben-Nowell has put forth a book that will

CONNECTING
make CS students invested in the material. He not
only connects every topic to computer science,
but does so in a clear and entertaining way.”
Dan Arena, Vanderbilt University

Computer science majors taking a non-programming-based course like discrete mathematics


might ask ‘why do I need to learn this?’ Written with these students in mind, this text introduces DISCRETE
MATHEMATICS
the mathematical foundations of computer science by providing a comprehensive treatment of
standard technical topics while simultaneously illustrating some of the broad-ranging applications
of that material throughout the field. Chapters on the standard core topics from discrete
structures – such as logic, proofs, number theory, counting, probability, graphs – are augmented

AND
with around 60 “computer science connections” pages that introduce applications of the core
technical matter: game trees (logic), triangulation of scenes in computer graphics (induction),
the Enigma machine (counting), algorithmic bias (relations), differential privacy (probability), and

COMPUTER
paired kidney transplants (graphs). Each chapter begins with a “Why You Might Care” section and
concludes with a quick-reference guide and summary of key terms and results. Throughout the
text are problem-solving and writing tips, “taking it further” asides with more technical details,
9781009150491: Liben-Nowell: PPC: C M Y K

and around 1700 exercises, 435 worked examples, and 480 figures.

SCIENCE
Second
Edition
Online Resources
www.cambridge.org/Liben-Nowell
Instructor resources
Solutions manual
Second Edition
PowerPoint slides and JPEGs of all the figures and tables
from the book

David Liben-Nowell
Cover image: Photo by David Liben-Nowell
Cover design: Andrew Ward

ISTUDY
ISTUDY
Connecting Discrete Mathematics and Computer Science

Computer science majors taking a non-programming-based course like discrete mathematics might
ask “why do I need to learn this?” Written with these students in mind, this text introduces the
mathematical foundations of computer science by providing a comprehensive treatment of standard
technical topics while simultaneously illustrating some of the broad-ranging applications of that
material throughout the field. Chapters on core topics from discrete structures—like logic, proofs,
number theory, counting, probability, graphs—are augmented with around 60 “Computer Science
Connections” pages introducing their applications: for example, game trees (logic), triangulation
of scenes in computer graphics (induction), the Enigma machine (counting), algorithmic bias
(relations), differential privacy (probability), and paired kidney transplants (graphs). Pedagogical
features include “Why You Might Care” sections, quick-reference “Chapter at a Glance” and Key
Terms and Results summaries, problem-solving and writing tips, “Taking it Further” asides with
more technical details, and around 1700 exercises, 435 worked examples, and 480 figures.

David Liben-Nowell is Professor of Computer Science at Carleton College, and earned degrees
from Cornell, Cambridge, and MIT. His research focuses on computational social science, espe-
cially social networks. He teaches broadly, emphasizing introductory and theoretical computer
science, and created and led a study-abroad program in the United Kingdom (History of Com-
puting). He is on the ACM 202x curriculum subcommittee for mathematical foundations and
is a member of the Liberal Arts CS (LACS) consortium. His awards and honors include NSF
research funding, Kavli Fellow of the National Academy of Sciences, and Visiting By-Fellowship
at Churchill College (Cambridge). He has also published about 30+ crossword puzzles in the New
York Times, Los Angeles Times, Wall Street Journal, Chronicle of Higher Education, and other
venues.

ISTUDY
“An inspired approach to the introductory discrete math course, illuminating the aesthetic appeal
of the subject together with the profound and inextricable links that connect it to the core ideas of
computing.”
– Jon Kleinberg, Cornell University

“Unlike most discrete math texts, here the computer science content and connections are woven
extensively throughout, with “forward pointers” that can excite students about numerous computer
science areas they will encounter in their future studies. In addition, the book is written TO students,
not FOR faculty. It will be a joy to teach with!”
– Valerie Barr, Mount Holyoke College

“Finally! I’ve spent years struggling to find a textbook that makes the topic of Discrete Structures
relevant to computer science students. David Liben-Nowell has put forth a book that will make
CS students invested in the material. He not only connects every topic to computer science, but
does so in a clear and entertaining way.”
– Dan Arena, Vanderbilt University

“By foregrounding the connections between the fields, this outstanding textbook makes a com-
pelling case for why computer science students should embrace the study of discrete mathematics.
This is an approachable yet rigorous book, written with wit and verve, that I look forward to
teaching from!”
– Raghuram Ramanujan, Davidson College

“David Liben-Nowell’s Connecting Discrete Mathematics and Computer Science provides stu-
dents with a beautifully motivated, clearly written, and accessible exploration of the mathematical
foundations of computer science. The “Computer Science Connections” sections provide com-
pelling applications of the mathematical content, and the frequent “Taking it Further” notes provide
extra richness that adds to the joy of the experience. This is a discrete math book that truly keeps
the reader engaged!”
– Ran Libeskind-Hadas, Founding Chair of Integrated Sciences,
Claremont McKenna College

ISTUDY
Connecting Discrete Mathematics
and Computer Science

David Liben-Nowell
Carleton College, Minnesota

ISTUDY
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/highereducation/isbn/9781009150491
DOI: 10.1017/9781009150484
© David Liben-Nowell 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
This book was previously published by John Wiley & Sons, Inc., titled ‘Discrete
Mathematics for Computer Science’, 2017
First published as Connecting Discrete Mathematics and Computer Science by
Cambridge University Press 2022
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
ISBN 978-1-009-15049-1 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.

ISTUDY
To MDSWM, with never-ending appreciation, and in loving memory of my
grandfather, Jay Liben, who brought more joy, curiosity, and kvetching to
this world than anyone else I know.

ISTUDY
ISTUDY
Contents

List of Computer Science Connections page xi


Acknowledgments xiii
Credits xv

1 On the Point of this Book 1

2 Basic Data Types 5


2.1 Why You Might Care 6
2.2 Booleans, Numbers, and Arithmetic 7
2.3 Sets: Unordered Collections 26
2.4 Sequences, Vectors, and Matrices: Ordered Collections 42
2.5 Functions 59
2.6 Chapter at a Glance 75

3 Logic 79
3.1 Why You Might Care 80
3.2 An Introduction to Propositional Logic 81
3.3 Propositional Logic: Some Extensions 95
3.4 An Introduction to Predicate Logic 109
3.5 Predicate Logic: Nested Quantifiers 127
3.6 Chapter at a Glance 140

4 Proofs 143
4.1 Why You Might Care 144
4.2 Error-Correcting Codes 145
4.3 Proofs and Proof Techniques 166
4.4 Some Examples of Proofs 184
4.5 Common Errors in Proofs 201
4.6 Chapter at a Glance 212

5 Mathematical Induction 215


5.1 Why You Might Care 216
5.2 Proofs by Mathematical Induction 217
5.3 Strong Induction 237
5.4 Recursively Defined Structures and Structural Induction 250
5.5 Chapter at a Glance 263

ISTUDY
viii Contents

6 Analysis of Algorithms 267


6.1 Why You Might Care 268
6.2 Asymptotics 269
6.3 Asymptotic Analysis of Algorithms 283
6.4 Recurrence Relations: Analyzing Recursive Algorithms 298
n

6.5 An Extension: Recurrence Relations of the Form T(n) = aT b + cnk 314
6.6 Chapter at a Glance 323

7 Number Theory 327


7.1 Why You Might Care 328
7.2 Modular Arithmetic 329
7.3 Primality and Relative Primality 343
7.4 Multiplicative Inverses 360
7.5 Cryptography 371
7.6 Chapter at a Glance 382

8 Relations 385
8.1 Why You Might Care 386
8.2 Formal Introduction 387
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 402
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 418
8.5 Chapter at a Glance 435

9 Counting 439
9.1 Why You Might Care 440
9.2 Counting Unions and Sequences 441
9.3 Using Functions to Count 463
9.4 Combinations and Permutations 481
9.5 Chapter at a Glance 502

10 Probability 505
10.1 Why You Might Care 506
10.2 Probability, Outcomes, and Events 509
10.3 Independence and Conditional Probability 527
10.4 Random Variables and Expectation 549
10.5 Chapter at a Glance 573

11 Graphs and Trees 577


11.1 Why You Might Care 578
11.2 Formal Introduction 579
11.3 Paths, Connectivity, and Distances 605
11.4 Trees 623

ISTUDY
Contents ix

11.5 Weighted Graphs 641


11.6 Chapter at a Glance 653

12 Looking Forward 657


References 661
Index 667

ISTUDY
ISTUDY
List of Computer Science Connections

Chapter 2: Basic Data Types


Integers and ints, Reals and floats 21
Computing Square Roots, and Not Computing Square Roots 22
Set Building in Languages (and MapReduce) 37
Clustering Datasets (and Speech Processing) 38
The Vector Space Model 53
Rotation Matrices 54
Hash Tables and Hash Functions 72

Chapter 3: Logic
Natural Language Processing, Ambiguity, and Truth 92
Computational Complexity, Satisfiability, and a Million Dollars 104
Short-Circuit Evaluation, Optimization, and Modern Compilers 105
Game Trees, Logic, and Winning Tic-Tac(-Toe) 121
Nonlocal Variables and Lexical vs. Dynamic Scoping 122
Gödel’s Incompleteness Theorem 123
Currying 135

Chapter 4: Proofs
Reed–Solomon Codes 161
Paul Erdős, “The Book,” and Erdős Numbers 180
The Four-Color Theorem (and Massive Computer-Generated Proofs) 181
Cryptography and the Generation of Prime Numbers 197
Other Uncomputable Problems (That You Might Care About) 198
The Cost of Missing Proofs: Some Famous Bugs in CS 207

Chapter 5: Mathematical Induction


Loop Invariants 231
Onion Routing, Steganography, and Conversations whose Very Existence is Sensitive 232
Triangulation, Computer Graphics, and 3D Surfaces 244
Max Heaps 245
Grammars, Parsing, and Ambiguity 259

Chapter 6: Analysis of Algorithms


Vacuum Tubes, Transistors, and Moore’s Law 279
Multitasking, Garbage Collection, and Wall Clocks 293

ISTUDY
xii List of Computer Science Connections

Time, Space, and Complexity 294


AVL Trees 309
Divide-and-Conquer Algorithms and Matrix Multiplication 321

Chapter 7: Number Theory


Converting Between Bases, Binary Representation, and Generating Strings 340
Secret Sharing 356
Error Correction with Reed–Solomon Codes 357
Miller–Rabin Primality Testing 368
Diffie–Hellman Key Exchange 379

Chapter 8: Relations
Relational Databases 398
Automating Decisions, Facial Recognition, and Algorithmic Bias 399
What’s Hard about Designing Voting Systems 414
Regular Expressions 415
Deterministic Finite Automata (DFAs) 430
The Painter’s Algorithm and Hidden-Surface Removal 431

Chapter 9: Counting
Running out of IP addresses, and IPv6 456
A Lower Bound for Comparison-Based Sorting 457
Infinite Cardinalities (and Problems that Can’t Be Solved by Any Program) 474
Lossy and Lossless Compression 475
Brute-Force Algorithms and Dynamic Programming 496
The Enigma Machine and a First Computer 497

Chapter 10: Probability


Using Randomness to Protect Privacy 521
Quantum Computing 522
Information, Charles Dickens, and the Entropy of English 523
Speech Recognition, Bayes’ Rule, and Language Models 543
Bayesian Modeling and Spam Filtering 544
The Monte Carlo Method 566
A Randomized Algorithm for Finding Medians 567

Chapter 11: Graphs and Trees


Degree Distributions and the Heavy Tail 599
Graph Drawing, Graph Layouts, and the 9/11 Memorial 600
The Bowtie Structure of the Web 618
Garbage Collection 619
Directed Graphs, Cycles, and Kidney Transplants 636
Binary Search Trees 637
Random Walks and Ranking Web Pages 650

ISTUDY
Acknowledgments

Would thou hadst less deserved,


That the proportion both of thanks and payment
Might have been mine! only I have left to say,
More is thy due than more than all can pay.
William Shakespeare (1564–1616)
The Scottish Play (c. 1606)

To everyone who has helped, directly and indirectly, with everything over these last years—these words
cannot adequately convey my thanks, but at least they’re a start: thank you!

I owe special thanks to a very long list of generous and warm people—many more than I can mention
here—for advice and kindness and support, both technical and emotional, as this book came into being
and, slowly, evolved into its present form. For those whom I haven’t named by name, please know that it’s
only because I have gotten such great support from so many people, and I hope that you’ll consider this
sentence the promise that, whenever we next see each other (however long that may take), the first round’s
on me. While I’m leaving out the names of the many people who have “merely” helped make my life happy
and fulfilling while I’ve been working on this book, I do want to give specific thanks to a few people.
I want to thank my colleagues—near and far, including many who are not just colleagues but also dear
friends and beloved family members—for their wisdom and patience, for answering my endlessly annoying
questions, and for conversations that led to examples or exercises or bug fixes or the very existence of
this entire book (even if you didn’t know that’s what we were talking about at the time): Katie Ahmann,
Eric Alexander, Tanya Amert, Sandra Batista, Tanya Berger-Wolf, Laura Biester, Kelly Connole, Amy
Csizmar Dalal, Josh Davis, Corey DeGagne, Doreen Densky, Roger Downs, Laura Effinger-Dean, Eric
Egge, Adriana Estill, Andy Exley, Charlotte Foran, Alex Freeman, Sherri Goings, Jack Goldfeather, Daniel
Groll, Clara Hardy, Max Harper, Deanna Haunsperger, Pierre Hecker, Maryam Hedayati, Jack Hessel,
David Huyck, Sue Jandro, Sarah Jansen, Iris Jastram, Jon Kleinberg, Carissa Knipe, Mark Krusemeyer,
Jessica Leiman, Lynn Liben, Doug Marshall, Jadrian Miles, Dave Musicant, Sneha Narayan, Gail Nelson,
Rich Nowell, Susan Nowell, Layla Oesper, Kate Olender, Steven Olender, Jeff Ondich, Sam Patterson,
Anna Rafferty, James Ryan, Alexa Sharp, Asmita Sodhi, Paula Stowe, Julia Strand, Sam Taylor, Mike Tie,
Anya Vostinar, Zach Weinersmith, Tom Wexler, Kevin Woods, Jed Yang, Steve Zdancewic, Doug Zimmer,
and Chico Zimmerman.
I’m grateful to the wonderful folks at Cambridge University Press—Lauren Cowles, Christian Green,
Amy He, Sarah Lewis, and especially Katie Leach—for their guidance, honesty, and ability to restore some
faith in the entire enterprise. Thanks to Judy Brody, Don Fowley, Bryan Gambrel, Beth Golub, Jessy Moor,
Anna Pham, Sondra Scott, and Gladys Soto for their help with a previous incarnation of this material. And
my gratitude to the many insightful reviewers of drafts of this material. So many times I got reviews back
and put them aside in a huff, only to come back to the reviewers’ comments months later and realize that
their suggestions were exactly right. (To be clear: blame me, not them, for the errors that I’m sure remain.)
Specific thanks to the Carleton CS 202 students and instructors who worked with early, and buggy, drafts
of this book. And thanks to those and many other students at Carleton for their patience, and for sending

ISTUDY
xiv Acknowledgments

their comments and suggestions for improvements: Hami Abdi, David Abel, Alexander Auyeung, Andrew
Bacon, Kharmen Bharucha, John Blake, Caleb Braun, Macallan Brown, Adam Canady, Noah Carnahan,
Yitong Chen, Jinny Cho, Leah Cole, Katja Collier, Lila Conlee, Eric Ewing, Kylie Fournier, Andy Freeland,
Emma Freeman, Samuel Greaves, Reilly Hallstrom, Jacob Hamalian, Sylvie Hauser, Jack Hessel, Joy Hill,
Matt Javaly, Emily Johnston, Emily Kampa, Carlton Keedy, Henry Keiter, Jonathan Knudson, Julia Kroll,
Brennan Kuo, Edward Kwiatkowski, Dimitri Lang, Tristan Leigh, Zach Levonian, Daniel Levy, Rhys
Lindmark, Gordon Loery, David Long, Robert Lord, Inara Makhmudova, Elliot Mawby, Javier Moran
Lemus, Sean Mullan, Micah Nacht, Justin Norden, Laurel Orr, Raven Pillmann, Josh Pitkofsky, Matthew
Pruyne, Nikki Rhodes, Will Schifeling, Colby Seyferth, Alex Simonides, Oscar Smith, Kyung Song, Fred-
erik Stensaeth, Patrick Stephen, Maximiliano Villarreal, Alex Voorhees, Allie Warren, Ben Wedin, Michael
Wheatman, Jack Wines, Christopher Winter, Kelly McGucken and Andrew Yang.

This book would not have been possible without the support of Carleton College, not only for the direct
support of this project, but also for providing a wonderfully engaging place to make my professional home.
When I started at Carleton, my friends and family back east thought that moving to Minnesota (the fron-
tier!) was nothing less than a sign that I had finally lost it, and I have to admit that I thought they had a
point. But it’s been a fabulous place to have landed, with great friends and colleagues and students—the
kind who don’t let you get away with anything, but in a good way.
Some stages of the work on this book occurred while I was visiting the University of Cambridge. Thanks
to Churchill College and the Computer Laboratory, and especially to Melissa Hines and Cecilia Mascolo,
for their hospitality and support.
And my thanks to the somewhat less formal host institutions that have fueled this writing: Brick Oven
Bakery, Cakewalk, Goodbye Blue Monday, Tandem Bagels, The Hideaway (Northfield, MN); Anodyne,
Blue Moon, Bull Run, Caffetto, Common Roots, Espresso Royale, Isles Bun & Coffee, Keen Eye, Plan
B, Precision Grind, Reverie, Spyhouse, Sebastian Joe’s, The Beat, The Nicollet, The Purple Onion, Turtle
Bread Company, Uncommon Grounds, Urban Bean (Minneapolis, MN); Ginkgo, Grand Central, Kopplin’s
(St. Paul, MN); Collegetown Bagels (Ithaca, NY); Slave to the Grind (Bronxville, NY); Bloc Eleven, Diesel
Cafe (Somerville, MA); Lyndell’s (Cambridge, MA); Tryst (Washington, DC); Hot Numbers, Espresso
Library (Cambridge, UK); and various Starbucks, Caribous, and Dunn Brothers.
The revisions to this book that resulted in the current version occurred during the annus horribilis of
2020–2021, which had the secondary effect of precluding even a single visit to any of those various tempo-
rary host institutions. My profound gratitude to K. for the kindness, encouragement, and support while I’ve
been toiling away, both at the book and at trying to maintain a semblance of being a human (maybe even
one that I recognize), at the dining table that somehow counts as a desk. Here’s to many more adventures
together (with less toil, maybe?) to come.

And, last but certainly not least, my deepest gratitude to my friends and family for all your help and
support while this project has consumed both hours and years. You know who you are, and I hope you also
know how much I appreciate you. Thank you!

David Liben-Nowell
Northfield, MN
August 2021

PS: I would be delighted to receive any comments or suggestions. Please don’t hesitate to get in touch.

ISTUDY
Credits

This book is a revised version of a book first published as Discrete Mathematics for Computer Science in
2017 (with a preliminary edition published in 2015) by John Wiley & Sons.
This book was typeset using LATEX, and I produced all but a few figures from scratch using TikZ. The
other figures are reprinted with permission from their copyright holders:

First page of each chapter: The illustrations that open every chapter were drawn for this book by Carissa
Knipe (carissaknipe.com), who was a complete delight to work with—both on these illustrations
and when she was a student at Carleton.

Figure 2.1: The photograph of the bear’s claws is a National Parks Service photo by Kaiti Critz, taken at
the Katmai National Park and Preserve. It is in the public domain.

Figure 2.15: The original (unblurred) image of Ada Lovelace is a section of a portrait by the artist Alfred
Edward Chalon (1780–1860), and is in the public domain.

Figure 2.49: The original (unquantized) image of Grace Hopper was taken by David C. MacLean. It is an
official U.S. Navy Photograph, from the collections of the Naval History and Heritage Command. It is
in the public domain.

Figure 4.20: The photograph of Paul Erdős is of him at a student seminar in Budapest in 1992, taken by the
Wikipedia user Kmhkmh; it is reproduced via the Creative Commons Attribution 3.0 Unported license
creativecommons.org/licenses/by/3.0. The photograph of Chad Jenkins was taken by Joseph
Xu of the College of Engineering at the University of Michigan, and is reproduced with permission.
The photograph of Bill Gates is by U.S. Department of Energy photographer Ken Shipp, from a 2013
visit to the Department of Energy, and is in the public domain.

Figure 4.30: The poem proving the undecidability of the halting problem is copyright © 2012 by Geoffrey
K. Pullum, and is reproduced with permission from the author.

Figure 5.11: The original image (without any hidden message) is from the National Security Agency
photograph “WAVE Demonstrating Bombe” (1944) [Visual Information Record Identification Number
190531-D-IM742-9007.JPG], and is in the public domain.

Figure 5.20: The images of the Morse–Vail Telegraph Key are derived from a 3D model from the Smith-
sonian Institution’s National Museum of American History, available at https://3d.si.edu, and in
the public domain.

Figure 5.21: The images of the rabbit are by Tobias Isenberg; they are printed with permission under CC
BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0/.

Figure 8.9: The images of the New York State delegation to the U.S. House of Representatives are all
official U.S. House of Representatives portraits, and are in the public domain.

ISTUDY
xvi Credits

Figure 9.23: The original (uncompressed) photograph is “Katherine Johnson At Her Desk at NASA Lan-
gley Research Center” by NASA photographer Bob Nye (NASA ID: LRC-1966-B701_P-06717). It is
in the public domain.
Figure 9.33: The image is based on a c. 1946 U.S. Army photograph that is in the public domain; it shows
two of the earliest-ever computer programmers, Betty Jean Jennings (later known as Jean Bartik) and
Frances Bilas (later Frances Spence), running the ENIAC.
Figure 9.35: The image of the Enigma machine is a section of a public-domain photograph from the CIA
Museum.

Quotations that appear at the beginning of sections of the book:

• The quotation from Dorothy Parker (p. 59) is printed with permission from the NAACP. The author
wishes to thank the National Association for the Advancement of Colored People for authorizing this
use of Dorothy Parker’s work.
• The translation of the quotation by Franz Kafka (p. 217) is from [67].
• The quotation from Mario Andretti (p. 269) is printed with permission from Mario Andretti.
• The quotation from Vincent van Gogh (p. 314) comes from a letter to Theo van Gogh (April 3rd, 1878).
The letter can be found in Vincent van Gogh, The Letters. Ed. Leo Jansen, Hans Luijten and Nienke
Bakker. Amsterdam 2009: www.vangoghletters.org. Reprinted with permission from the Van Gogh
Museum.
• The quotation from Nelson Mandela (p. 386) comes from an interview published by Reader’s Digest,
and it is printed with permission from the Nelson Mandela Foundation and from Reader’s Digest.
• The lyrics from Tom Lehrer’s “Poisoning Pigeons in the Park” (p. 463) appear with permission from
Tom Lehrer, and the Tom Lehrer Trust 2000 (tomlehrersongs.com).
• The lines from Emily Dickinson (p. 509) appear with permission from Harvard University Press. They
appear in The Poems of Emily Dickinson, edited by Thomas H. Johnson, Cambridge, MA: The Belknap
Press of Harvard University Press, Copyright © 1951, 1955 by the President and Fellows of Harvard
College. Copyright © renewed 1979, 1983 by the President and Fellows of Harvard College. Copyright ©
1914, 1918, 1919, 1924, 1929, 1930, 1932, 1935, 1937, 1942, by Martha Dickinson Bianchi. Copyright
© 1952, 1957, 1958, 1963, 1965, by Mary L. Hampson.
• The lyrics from John Gorka’s “I’m From New Jersey” (p. 549) are from the recording Jack’s Crows, and
appear with permission from John Gorka (www.johngorka.com).
• The lyrics from Kris Delmhorst’s “North Dakota” (p. 605) are excerpted with permission from Kris
Delmhorst (www.krisdelmhorst.com).

The remaining quotations that appear at the beginning of each section of the book are in the public domain.

ISTUDY
1 On the Point of this Book

In which our heroes decide, possibly encouraged by a requirement


for graduation, to set out to explore the world.

ISTUDY
1
2 On the Point of this Book

Why You Might Care


Read much, but not many Books.
Benjamin Franklin (1706–1790)
Poor Richard’s Almanack (1738)

This book is designed for an undergraduate student who has taken a computer science class or three.
Most likely, you are a sophomore or junior prospective or current computer science major taking your first
non-programming-based CS class. If you are a student in this position, you may be wondering why you’re
taking this class (or why you have to take this class!). Computer science students taking a class like this
one sometimes don’t see why this material has anything to do with computer science—particularly if you
enjoy CS because you enjoy programming.
I want to be clear: programming is awesome! I get lost in code all the time—it would be better not count
the number of hours that I spent writing the LATEX code to draw the Fibonacci word fractal in Figure 5.27,
or the divisibility relation in Figure 8.26, for example. (LATEX, the tool used to typeset this book, is the
standard typesetting package for computer scientists, and it’s actually also a full-fledged, if somewhat
bizarre, programming language.)
But there’s more to CS than programming. In fact, many seemingly unrelated problems rely on the same
sorts of abstract thinking. It’s not at all obvious that an optimizing compiler (a program that translates source
code in a programming language like C into something directly executable by a computer) would have
anything important in common with a program to play chess perfectly. But, in fact, they’re both tasks that
are best understood using logic (Chapter 3) as a central component of any solution. Similarly, filtering spam
out of your inbox (“given a message m, should m be categorized as spam?”) and doing speech recognition
(“given an audio stream s of a person speaking in English, what is the best ‘transcript’ reflecting the words
spoken in s?”) are both best understood using probability (Chapter 10).
And these, of course, are just examples; there are many, many ways in which we can gain insight and
efficiency by thinking more abstractly about the commonalities of interesting and important CS problems.
That is the goal of this book: to introduce the kind of mathematical, formal thinking that will allow you to
understand ideas that are shared among disparate applications of computer science—and to make it easier
for you to make your own connections, and to extend CS in even more new directions.

How To Use This Book


“Tom’s getting very profound,” said Daisy, with an expression of unthoughtful sadness.
“He reads deep books with long words in them.”
F. Scott Fitzgerald (1896–1940)
The Great Gatsby (1925)

The brief version of the advice for how to use this book is: it’s your book; use it however you’d like. (Will
Shortz, the crossword editor of The New York Times, gives the analogous advice about crossword puzzles
when he’s asked whether Googling for an answer is cheating.) But my experience is that students do best
when they read actively, with scrap paper close by; most people end up with a deeper understanding of a
problem by trying to solve it themselves first, before they look at the solution.

I’ve assumed throughout that you’re comfortable with programming in at least one language, including
familiarity with recursion. It doesn’t much matter which particular programming language you know; we’ll

ISTUDY
On the Point of this Book 3

use features that are shared by almost all modern languages—things like conditionals, loops, functions, and
recursion. You may or may not have already taken more than one programming-based CS course; many,
but not all, institutions require data structures as a prerequisite for this material. There are times in the book
when a data structures background may give you a deeper understanding (but the same is true in reverse if
you study data structures after this material). There are similarly a handful of topics for which a rudimen-
tary calculus background is valuable. But knowing/remembering calculus will be specifically useful only
a handful of times in this book; the mathematical prerequisite for this material is really algebra and “math-
ematical maturity,” which basically means having some degree of comfort with the idea of a mathematical
definition and with the manipulation of a mathematical expression. (The few places where calculus is help-
ful are explicitly marked, and there’s no point at which following a calculus-based discussion is necessary
to understand the subsequent material.)
There are 10 technical chapters after this one
in the book, followed by a brief concluding chap- 2 data types
ter (Chapter 12). Their dependencies are as shown
at right. Aside from these dependencies, there are 3 logic
occasional references to other chapters, but these
references are light, and intended to be enrich-
4 proofs
ing rather than essential. If you’ve skipped Chap-
ter 6—many instructors will choose not cover this
5 induction
material, as it is frequently included in a course on
algorithms instead of this one—then it will still be
useful to have an informal sense of O, Ω, and Θ 6 7 8 9 11
analysis of number relations counting graphs/trees
notation in the context of the worst-case running algorithms theory
time of an algorithm. (You might skim Section 6.1
and Section 6.6 before reading Chapters 7–11.)
When I teach the corresponding course at my 10
probability
institution, I typically include material from all
chapters except Chapter 8 and Chapter 11, omit-
ting various other sections and subsections as
necessary to fit into the time constraints. (Our aca- 12
demic calendar is a bit compressed: we have three looking forward

10-week terms per year, with students taking three


classes per term, and the courses end up being rather fast paced.) I also typically slightly reorder the later
chapters, moving number theory (Chapter 7) to be the last topic of the course.
I’ve tried to include some helpful tips for problem solving throughout the book, along with a few warn-
ings about common confusions and some notes on terminology and notation that may be helpful in keeping
the words and symbols straight. There are also two kinds of extensions to the main material. The “Taking
it Further” blocks give more technical details about the material under discussion—an alternate way of
thinking about a definition, or a more technical derivation or extension that’s relevant but not central, or a
way that a concept is used in CS or a related field. You should read the “Taking it Further” blocks if—but
only if!—you find them engaging. Each section also ends with one or more boxed-off “Computer Science
Connections” that show how the core material in that section can be used in a wide variety of (interesting,
I hope!) CS applications. No matter how interesting the core technical material may be, I think that it is
what we can do with it that makes it worth studying.

ISTUDY
4 On the Point of this Book

What This Book Is About


I have no wish to talk nonsense.
Charlotte Brontë (1816–1855)
Jane Eyre (1847)

This book focuses on discrete mathematics, in which the entities of interest are distinct and separate.
Discrete mathematics contrasts with continuous mathematics, as in calculus, which addresses infinites-
imally small objects, which cannot be separated. We’ll use summations rather than integrals, and we’ll
generally be thinking about things more like the integers (“1, 2, 3, . . .”) than like the real numbers (“all
numbers between π and 42”).

Be careful; there are two different words that are pronounced identically: discrete, adj., individually separate and distinct; and
discreet, adj., careful and judicious in speech, especially to maintain privacy or avoid embarrassment. You wouldn’t read a book
about discreet mathematics; instead, someone who trusts you might quietly share it while making sure no one was eavesdropping.

Because this book is mostly focused on non-programming-based parts of computer science, in general
the “output” that you produce when solving a problem will be something different from a program. Most
typically, you will be asked to answer some question (quantitatively or qualitatively) and to justify that
answer—that is, to prove your answer. (A proof is an ironclad, airtight argument that convinces its reader
of your claim.) Remember that your task in solving a problem is to persuade your reader that your pur-
ported solution genuinely solves the problem. Above all, that means that your main task in writing is
communication and persuasion.
There are three very reasonable ways of thinking about this book.

• View #1 is that this book is about the mathematical foundations of computation. This book is designed
to give you a firm foundation in mathematical concepts that are crucial to computer science: sets and
sequences and functions, logic, proofs, probability, number theory, graphs, and so forth.
• View #2 is that this book is about practice. Essentially no particular example that we consider matters;
what’s crucial is for you to get exposure to and experience with formal reasoning. Learning specific facts
about specific topics is less important than developing your ability to reason rigorously about formally
defined structures.
• View #3 is that this book is about applications of computer science: error-correcting codes (how to rep-
resent data redundantly so that the original information is recoverable even in the face of data corruption),
cryptography (how to communicate securely so that your information is understood by its intended recip-
ient but not by anyone else), natural language processing (how to interpret the “meaning” of an English
sentence spoken by a human using an automated dialogue system), and so forth. But, because solutions
to these problems rely fundamentally on sets and counting and number theory and logic, we have to
understand basic abstract structures in order to understand the solutions to these applied problems.

In the end, of course, all three views are right: I hope that this book will help to introduce some of the
foundational technical concepts and techniques of theoretical computer science, and I hope that it will also
help demonstrate that these theoretical approaches have relevance and value in work throughout computer
science—in topics both theoretical and applied. And I hope that it will be at least a little bit of fun.

Bon voyage!

ISTUDY
2 Basic Data Types

In which our heroes equip themselves for the journey ahead, by


taking on the basic provisions that they will need along the road.

ISTUDY
5
6 Basic Data Types

2.1 Why You Might Care


It is a capital mistake to theorize before one has data.
Sir Arthur Conan Doyle (1859–1930)
A Scandal in Bohemia (1892)

Imagine converting a color photograph to grayscale (as in Figure 2.1). Implementing this conversion
requires interacting with a slew of foundational data types (the basic “kinds of things”) that show up
throughout CS. A pixel is a sequence of three color values, red, green, and blue. (And an image is a
two-dimensional sequence of pixels.) Those color values are integers between 0 and 255 (because each
is represented as a sequence of 8 bits). The translation process is a function taking inputs (any of the set of
possible color values) and producing outputs (any of the set of grayscale values) using a particular formula.
Virtually every interesting computer science application uses these basic data types extensively. Cryp-
tography, which is devoted to the secure storage and transmission of information so that a malicious third
party cannot decipher that information, is typically based directly on integers, particularly large prime
numbers. A ubiquitous task in machine learning is to “cluster” a set of entities into a collection of nonover-
lapping subsets so that two entities in the same subset are similar and two entities in different subsets are
dissimilar. In information retrieval, where we might seek to find the document from a large collection that
is most relevant to a given query, it is common to represent each document by a vector (a sequence of
numbers) based on the words used in the document, and to find the most relevant documents by identifying
which ones “point in the same direction” as the query’s vector. And functions are everywhere in CS, from
data structures like hash tables to the routing that’s done for every packet of information on the internet.
This chapter introduces concepts, terminology, and notation related to the most common data types that
recur throughout this book, and throughout computer science. These basic entities—the Booleans (True
and False), numbers (integers, rationals, and reals), sets, sequences, functions—are also the basic data
types we use in modern programming languages. Some specific closely related topics will appear later in
the book, as well. But, really, every chapter of this book is related to this chapter: our whole enterprise
will involve building complex objects out of these simple ones (and, to be ready to understand the more
complex objects, we have to understand the simple pieces first). And before we launch into the sea of
applications, we need to establish some basic shared language. Much of the basic material in this chapter
may be familiar, but regardless of whether you have seen it before, it is important and standard content
with which it is important to be comfortable.
Photo by NPS/Kaiti Critz (Katmai Natl. Park & Preserve)

This grayscale
image was
produced from the
original color
using a rough-cut
formula
0.2126 · red +
0.7152 · green +
0.0722 · blue,
rounded down to
the nearest integer
(in other words,
using the floor
function).

Figure 2.1 Converting an image to grayscale. The second image’s highlighted pixels—variously corresponding to
brown fur, gray claws, and green plants—are all the same shade of gray. (That is, using the terminology of Section 2.5,
the function is not one-to-one, a fact that has implications for designing interfaces when some users are colorblind.)

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 7

2.2 Booleans, Numbers, and Arithmetic


“And you do Addition?” the White Queen asked. “What’s one and one and one and one
and one and one and one and one and one and one?”
“I don’t know,” said Alice. “I lost count.”
“She can’t do Addition,” the Red Queen interrupted.
Lewis Carroll (1832–1898)
Through the Looking-Glass (1871)

We start by introducing the most basic types of data: Boolean values (True and False), integers
(. . . , −2, −1, 0, 1, 2, . . .), rational numbers (fractions with integers as numerators and denominators), and
real numbers (including the integers and all the numbers in between them). The rest of this section will then
introduce some basic numerical operations: absolute values and rounding, exponentiation and logarithms,
summations and products. Figure 2.2 summarizes this section’s notation and definitions.

2.2.1 Booleans: True and False


The most basic unit of data is the bit: a single piece of information, which either takes on the value 0 or the
value 1. Every piece of stored data in a digital computer is stored as a sequence of bits. (See Section 2.4
for a formal definition of sequences.)
We’ll view bits from several different perspectives: 1 and 0, on and off, yes and no, True and False. Bits
viewed under the last of these perspectives have a special name, the Booleans:

Definition 2.1: Booleans.


A Boolean value is either True or False.

(Booleans are named after George Boole (1815–1864), a British mathematician, who was the first person
to think about True as 1 and False as 0.) The Booleans are the central object of study of Chapter 3, on
logic. In fact, they are in a sense the central object of study of this entire book: simply, we are interested in
making true statements, with a proof to justify why the statement is true.

2.2.2 Numbers: Integers, Reals, and Rationals


We’ll often encounter a few common types of numbers—integers, reals, and rationals:

Definition 2.2: Integers, reals, and rationals.


The integers, denoted by Z, are those numbers with no fractional part: 0, the positive integers (1, 2, . . .),
and the negative integers (−1, −2, −3, . . .).
The real numbers, denoted by R, are those numbers that can be (approximately) represented by
decimal numbers; informally, the reals include all integers and all numbers “between” any two integers.
The rational numbers, denoted by Q, are those real numbers that can be represented as a ratio mn of two
integers n and m, where n is called the numerator and m ̸= 0 is called the denominator. A real number
that is not rational is called an irrational number.

The superficially unintuitive notation for the integers, the symbol Z, is a stylized “Z” that was chosen because of the German
word Zahlen, which means “numbers.” The name rationals comes from the word ratio; the symbol Q comes from its synonym
quotient. (Besides, the symbol R was already taken by the reals, so the rationals got stuck with their second choice.)

ISTUDY
8 Basic Data Types

Booleans True and False


Z integers (. . . , −3, −2, −1, 0, 1, 2, 3, . . .)
Q rational numbers
R real numbers

[a, b] those real numbers x where a ≤ x ≤ b


(a, b) those real numbers x where a < x < b
[a, b) those real numbers x where a ≤ x < b
(a, b] those real numbers x where a < x ≤ b

|x| absolute value of x: |x| = −x if x < 0; |x| = x if x ≥ 0


⌊x⌋ floor of x: x rounded down to the nearest integer
⌈x⌉ ceiling of x: x rounded up to the nearest integer
bn b multiplied by itself n times

b1/n , or n b a number y such that yn = b (where y ≥ 0 if possible), if one exists
bm/n (b1/n )m
logb x logarithm: logb x is the value y such that by = x, if one exists
n mod k modulo: n mod k = the remainder when dividing n by k
k|n k (evenly) divides n
P P
summation: ni=1 xi = x1 + x2 + · · · + xn
Q Qn
product: i=1 xi = x1 · x2 · · · · · xn

Figure 2.2 Summary of the basic mathematical notation introduced in Section 2.2.

Here are a few examples of each of these types of numbers:

Example 2.1: A few integers, reals, and rationals.


The following are all examples of integers: 1, 42, 0, and −17.
All of the following are real numbers: 1, 99.44, 31 = 0.33333 · · · , the ratio of√the circumference of a
circle to its diameter π ≈ 3.14159 · · · , and the so-called golden ratio ϕ = (1 + 5)/2 ≈ 1.61803 · · · .
Examples of rational numbers include 32 , 95 , 16 4
4 , and 1 . (In Chapter 8, we’ll talk about the familiar notion
of the equivalence of two rational numbers like 12 and 24 , or like 16 4
4 and 1 , based on common divisors.
See Example 8.36.) Of the example real numbers above, all of 1 and 99.44 and 0.3333 · · · are rational
numbers; we can write them as 11 and 4972 1
50 and 3 , for example. Both π and ϕ are irrational.

Note that all integers are rational numbers (with denominator equal to 1), and all rational numbers are real
3
numbers. But not all√rational numbers are integers and not
√ all real numbers are rational: for example, 2 is
not an integer, and 2 is not rational. (We’ll prove that 2 is not rational in Example 4.20.)

Taking it further: Definition 2.2 specifies Z, Q, and R somewhat informally. To be completely rigorous, one can define the
nonnegative integers as the smallest collection of numbers such that: (i) 0 is an integer; and (ii) if x is an integer, then x + 1 is
also an integer. See Section 5.4.1. (Of course, for even this definition to make sense, we’d need to define the number zero and
also define the operation of adding one.) With a proper definition of the integers, it’s fairly easy to define the rationals as ratios
of integers. But formally defining the real numbers is surprisingly challenging; it was a major enterprise of mathematics in the
late 1800s, and is often the focus of a first course in analysis in an undergraduate mathematics curriculum.
Nearly all programming languages support both integers (usually known as ints) and real numbers (usually known as floats);
see p. 21 for a bit about how these basic numerical types are implemented in real computers. (Rational numbers are much less
frequently implemented as basic data types in programming languages, though there are some exceptions, like Scheme.)

In addition to the basic symbols that we’ve introduced to represent the integers, the rationals, and the reals
(Z, Q, and R), we will also introduce special notation for some specific subsets of these numbers. We will

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 9

(a) (1, 4) (b) [1, 4] (c) [1, 4) (d) (1, 4]

0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

Figure 2.3 Number lines representing real numbers between 1 and 4, with 1 included in the range in (b) and (c), and 4
included in the range in (b) and (d).

write Z≥0 and Z≤0 to denote the nonnegative integers (0, 1, 2, . . .) and nonpositive integers (0, −1, −2, . . .),
respectively. Generally, when we write Z with a superscripted condition, we mean all those integers for
which the stated condition is true. For example, Z̸= 1 denotes all integers aside from 1. Similarly, we write
R>0 to denote the positive real numbers (every real number x > 0). Other conditions in the superscript of
R are analogous.
We’ll also use standard notation for intervals of real numbers, denoting all real numbers between two
specified values. There are two variants of this notation, which allow “between two specified values” to
either include or exclude those specified values. We use round parentheses to mean “exclude the endpoint”
and square brackets to mean “include the endpoint” when we denote a range. For example, (a, b] denotes
those real numbers x for which a < x ≤ b, and [a, b) denotes those real numbers x for which a ≤ x < b.
Sometimes (a, b) and [a, b] are, respectively, called the open interval and closed interval between a and b.
These four types of intervals are also sometimes denoted via a number line, with open and closed circles
denoting open and closed intervals; see Figure 2.3 for an example.
For two real numbers x and y, we will use the standard notation “x ≈ y” to denote that x is approximately
equal to y. This notation is defined informally, because what counts as “close enough” to be approximately
equal will depend heavily on context.

2.2.3 Absolute Value, Floor, and Ceiling


In the remaining parts of Section 2.2, we will give definitions of some standard arithmetic operations that
involve the numbers we just defined. We’ll start in here with three operations on a real number: absolute
value, floor, and ceiling.
The absolute value of a real number x, written |x|, denotes how far x is from 0, disregarding the sign of
x (that is, disregarding whether x is positive or negative):

Definition 2.3: Absolute value.


(
x if x ≥ 0
The absolute value of a real number x is |x| =
−x otherwise.

For example, |42.42| = 42.42 and |−128| = 128. (Definition 2.3 uses notation for defining “by cases”:
the value of |x| is x when x ≥ 0, and the value of |x| is −x otherwise—that is, when x < 0.)
For a real number x, we can consider x “rounded down” or “rounded up,” which are called the floor and
ceiling of x, respectively:

Definition 2.4: Floor and ceiling.


The floor of a real number x, written ⌊x⌋, is the largest integer that is less than or equal to x. The ceiling
of x, written ⌈x⌉, is the smallest integer that is greater than or equal to x.

ISTUDY
10 Basic Data Types

⌈ √ ⌉ ⌈√ ⌉
⌊ √ ⌋ − 2 = −1 ⌊√ ⌋ 2 =2 ⌈4⌉ = 4
− 2 = −2 2 =1 ⌊4⌋ = 4

−3 −2 −1 0 1 2 3 4 5

√ √
Figure 2.4 The floor and ceiling of − 2, 2, and 4.

Note that Definition 2.4 defines the floor and ceiling of negative numbers, too; the definition doesn’t care
whether x is greater than or less than 0. Here are a few examples:

Example 2.2: A few floors and ceilings.



√ have ⌊ 2⌋ = ⌊1.4142 · · ·⌋ = 1 and ⌊2π⌋ = ⌊6.28318 · · ·⌋ = 6 and ⌊4⌋ = 4. For ceilings, we have
We
⌈ 2⌉ = 2 and ⌈2π⌉ = 7 and √ ⌈4⌉ = 4. √
For negative numbers, ⌊− 2⌋ = ⌊−1.4142 · · ·⌋ = −2 and ⌈− 2⌉ = −1.

The number line may give an intuitive way to think about floor and ceiling: ⌊x⌋ denotes the first integer that
we encounter moving left in the number line starting at x; ⌈x⌉ denotes the first integer that we encounter
moving right from x. (And x itself counts for both definitions.) See Figure 2.4.

2.2.4 Exponentiation
We next consider raising a number to an exponent or power.

Definition 2.5: Raising a number to an integer power.


For a real number b and a nonnegative integer n, the number bn denotes the result of multiplying b by
itself n times:
b0 = 1 and, for n ≥ 1, bn = b · b · · · · · b .
n times

The number b is called the base and the integer n is called the exponent.

For example, 20 = 1 and 22 = 2 · 2 = 4 and 25 = 2 · 2 · 2 · 2 · 2 = 32 and 52 = 5 · 5 = 25.

Taking it further: Note that Definition 2.5 says that b0 = 1 for any base b, including b = 0. The case of 00 is a little tricky: one
is tempted to say both “0 to the anything is 0” and “anything to the 0 is 1.” But, of course, these two statements are inconsistent.
Lots of people define 00 as 1 (as Definition 2.5 does); lots of other people (particularly those who are giving definitions based on
calculus) treat 00 as undefined. We’ll live in the first camp because it won’t get us in any trouble, and some results that we’ll see
in later chapters become more convoluted and more annoying to state if 00 were undefined. If you’re interested in reading more
about the (over 200 years’ worth of) history and the reasoning behind this decision, see [58].

Raising a base to nonintegral exponents


Consider the expression bx for an exponent x > 0 that is not an integer. (It’s all too easy to have done
this calculation by typing numbers into a calculator without actually thinking about what the expression
actually means!) Here’s the definition of bm/n when the exponent mn is a rational number:

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 11

Definition 2.6: Raising a number to a positive rational power.


For a real number b and for a positive integer n ̸= 0, the expression b1/n denotes the number y such that
yn = b. (If there are two values of y with yn = b, then b1/n means the one that’s positive. If there are no
values of y with yn = b, then b1/n is undefined.)
For a positive integer m, the expression bm/n denotes the mth power of b1/n : that is, bm/n = (b1/n )m .

The value b1/n is called the nth root of b, and it can also be denoted by n b.
For example, 161/2 = 4 because
√ 161/2 is the value y such that y2 = 16, and 42 = 16. Similarly, 161/4 = 2
because 24 = 16. So 161/2 = 16 is 4, and the 4th root of 16 is 2. (As the definition suggests, there may be
2

more than one y such that yn = b. For example, both y = 2 and y = −2 satisfy the condition that y4 = 16.
By the definition, if there are positive and negative values of y satisfying the requirement, we choose the
positive one. So 161/4 = 2, not −2.) Here are a few other examples:

Example 2.3: Some fractional exponents.


The value of 51/2 is roughly 2.2360679774, because 2.23606797742 ≈ 5. (But note that this value of
51/2 is only an approximation, because actually 2.23606797742 = 4.99999999955372691076 ̸= 5.)
When the base is negative, the nth root does not always exist. To find the value of (−8)1/3 , we need to
find a y such that y3 = −8. No y ≥ 0 satisfies this condition, but y = −2 does. Thus (−8)1/3 = −2. On
the other hand, let’s try to compute the value of (−8)1/2 . We would need a value y such that y2 = −8.
But no y ≥ 0 satisfies this condition, and no y ≤ 0 does either. Thus (−8)1/2 is undefined.

When we write b without explicitly
√ indicating which root is intended, then we are talking about the

square root of b. In other words, b = 2 b denotes the (nonnegative) value y such that y2 = b. An integer

n is called a perfect square if n is an integer.

Taking it further: Definition 2.6 presents difficulties if we try to compute, say, −1: the definition tells us that we need to
find a number y such that y = −1. But y ≥ 0 if y ≤ 0 and also if y ≥ 0, so no real number y satisfies the requirement
2 2

y2 = −1. (We had the same problem in Example 2.3, in trying to compute (−8)1/2 .) To handle this situation, several centuries
ago some creative mathematicians defined the imaginary numbers as those resulting from taking the square root of negative

numbers, specifically defining i = −1. (The name “real” to describe real numbers was chosen to contrast with the imaginary
numbers.) We will not be concerned with imaginary numbers in this book, although—perhaps surprisingly—there are some very
natural computational problems in which imaginary numbers are fundamental parts of the best algorithms solving them, such as
the Fast Fourier transform in signal processing and speech processing (transcribing English words from a raw audio stream) or
even quickly multiplying large numbers together [32].

Definition 2.7: Raising a number to a negative power.


1
When the exponent x is negative, then bx is defined as b−x .

For example, 2−4 = 214 = 16 1


and 25−3/2 = 2513/2 = (251/2
1
)3
= 513 = 125
1
.
For an irrational exponent x, the value of bx is approximated arbitrarily closely by choosing a rational
number mn sufficiently close to x and computing the value of bm/n .

Taking it further: A rigorous treatment of irrational powers requires calculus; we will omit the details as they are tangential to
our purposes in this book. But the basic idea is to approximate the exponent with a close-enough rational number, which then
provably approximates the result of raising the base to that power. For example, we can approximate 2π ≈ 8.82497782708 · · ·
more and more closely as 23 = 8, or 231/10 = 8.5741 · · · , or 2314/100 = 8.8815 · · · , or 23141/1000 = 8.8213 · · · , etc.

ISTUDY
12 Basic Data Types

While essentially every modern programming language supports exponentiation—including positive, fractional, and negative
powers—in some form, often in a separate math library, the actual behind-the-scenes computation is rather complicated. See

p. 22 for some discussion of the underlying steps that are done to compute a quantity like x.

Here are a few useful facts about exponentiation:

Theorem 2.8: Properties of exponentials.


For any real numbers a and b, and for any rational numbers x and y:
b0 = 1 (2.8.1)
1
b =b (2.8.2)
b x+y
=b ·b
x y
(2.8.3)
x y xy
(b ) = b (2.8.4)
(ab) = a · b
x x x
(2.8.5)

These properties follow fairly straightforwardly from the definition of exponentiation. (The properties of
Theorem 2.8 carry over to irrational exponents, though the proofs are less straightforward. See the note
after Definition 2.5 regarding (2.8.1).)

2.2.5 Logarithms
Problem-solving tip: I have found many CS students scared, and scarred, by logs. The fear appears to me to result from students
attempting to memorize facts about logs without trying to think about what they mean. Mentally translating between logs and
exponentials can help make these properties more intuitive and can help make them make sense. Often the intuition of a property
of exponentials is reasonably straightforward to grasp.

The logarithm (or log) is the inverse operation to exponentiation: the value of an exponential by is the result
of multiplying a number b by itself y times, while the value of a logarithm logb x is the number of times
we must multiply b by itself to get x.

Definition 2.9: Logarithm.


For a positive real number b ̸= 1 and a real number x > 0, the logarithm base b of x, written logb x, is
the real number y such that by = x.

Example 2.4: Some logs.


The quantity log3 81 is the power to which we must raise 3 to get 81—and thus log3 81 = 4, because
34 = 3 · 3 · 3 · 3 = 81. Similarly:
log4 16 = 2 log4 2 = 0.5 log128 1 = 0 and log2 3 ≈ 1.5849625.

because 42 = 16 because 2 = 4 = 41 2/
because 1280 = 1 because 21 5849625 = 2.999999998 ≈ 3
.

For any base b, note that logb x does get larger as the value of x increases, but it gets larger very slowly.
Figure 2.5 illustrates the slow rate of growth of log2 x and log10 x as x grows.
For a real number x ≤ 0 and any base b, the expression logb x is undefined. For example, the value of
log2 (−4) would be the number y such that 2y = −4—but 2y can never be negative. Similarly, logarithms
base 1 are undefined: log1 2 would be the number y such that 1y = 2—but 1y = 1 for every value of y.

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 13

15
log2 x
10

5
log10 x

4000 8000 12000 16000 20000

Figure 2.5 A graph of log2 x and log10 x.

Logarithms show up frequently in the analysis of data structures and algorithms, including a number
that we will discuss in this book. Several facts about logarithms will be useful in these analyses, and are
also useful in other settings. Here are a few (these properties generally follow directly from the analogous
properties of exponentials in Theorem 2.8):

Theorem 2.10: Properties of logarithms.


For any real numbers b > 1, c > 1, x > 0, and y > 0, the following properties hold:
logb 1 = 0 (2.10.1)
logb b = 1 (2.10.2)
logb xy = logb x + logb y log of a product (2.10.3)
logb x
y = logb x − logb y log of a quotient (2.10.4)

logb xy = y logb x (2.10.5)


logc x
logb x = logc b “change of base” formula (2.10.6)

We will make use of one standard piece of notational shorthand: often the expression log x is written
without an explicit base. When computer scientists write the expression log x, we mean log2 x. One other
base is commonly used in logarithms: the natural logarithm ln x denotes loge x, where e ≈ 2.718281828 · · ·
is defined from calculus as e = limn→∞ (1 + 1n )n .

Throughout this book (and throughout computer science), the assumed base of log x is 2. (Some computer scientists write lg x to
denote log2 x; we’ll simply write log x.) But be aware that mathematicians or engineers may treat the default base to be e or 10.

2.2.6 Moduli and Division


So far, we’ve discussed multiplying numbers (repeatedly, to compute exponentials); here, we turn to the
division of one number by another. When we consider dividing two integers—64 by 5, for example—there
are several useful values to consider: regular-old division ( 64
5 = 12.8), what’s sometimes called integer
division giving “the whole part” of the fraction (⌊ 5 ⌋ = 12), and the remainder giving “the leftover part”
64

of the fraction (the difference between 64 and 12 · 5, namely 64 − 60 = 4).


We will return to these notions of division in great detail in Chapter 7, but we’ll begin here with the
formal definitions for the notions related to remainders:

ISTUDY
14 Basic Data Types

Definition 2.11: Modulus (remainder).


For any integers k > 0 and n, the integer n mod k is the remainder when we divide n by k. Using the
 
“floor” notation from Section 2.2.3, the value of n mod k is n − k · nk .

Example 2.5: Three values mod 3.


The value of 8 mod 3 is 2, because 8 = 2 + 6 (and 6 is a multiple of 3). (Or, using the definition directly:
 
8 mod 3 = 2 because 83 = ⌊2.6666 · · ·⌋ = 2, and 8 − 2 · 3 = 8 − 6 = 2.)
 
Similarly, 28 mod 3 = 1, because 28 = 9, and 28 − 9 · 3 = 28 − 27 = 1.
 48  3
And 48 mod 3 = 0, because 3 = 16, and 48 − 16 · 3 = 0.

Taking it further: In many programming languages, the / operator performs integer division when its arguments are both
integers, and performs “real” division when either argument is a floating-point number. So the expression 64 / 5 will yield 12,
but 64.0 / 5 and 64 / 5.0 and 64.0 / 5.0 will all yield 12.8. In this book, though, we will always mean “real” division
when we write x/y or yx . The n mod k operation, too, is a standard one in programming languages—it’s written as n % k in many
languages, including Java, Python, and C/C++, for example.

In Definition 2.11, we allowed n to be a negative integer, which may stretch your intuition about remainders
a bit. Here’s an example of this case of the definition:

Example 2.6: A negative integer mod 5.


We’ll compute −3 mod 5 simply by following the definition of mod from Definition 2.11:
 
−3 mod 5 = (−3) − 5 · −3 5 = (−3) − 5 · (−1) = (−3) + 5 = 2.
Viewed from an appropriate perspective, this calculation should actually be reasonably intuitive: the value
r = n mod k gives the amount r by which n exceeds its closest multiple of k. (And −3 is 2 more than a
multiple of 5, namely −5, so −3 mod 5 = 2.)

Notice that the value of n mod k is always at least 0 and at most k − 1, for any n and any k > 0; the
remainder when dividing by k can never be k or more. When nk has zero remainder, then we say that k
(evenly) divides n:

Definition 2.12: Integer k (evenly) divides integer n.


For any integers k > 0 and n, we say that k divides n, written k | n, if n
k is an integer. Notice that k | n is
equivalent to n mod k = 0.

By rearranging the floor-based definition from Definition 2.11 when n mod k = 0, we can see that the
 
condition k | n is also equivalent to the condition k · nk = n.

Example 2.7: What 5 divides.


Because 5 · ⌊ 10
5 ⌋ = 5 · 2 = 10, we know 5 | 10. But 5 · ⌊ 5 ⌋ = 5 · 1 = 5 ̸= 9, so 5 ̸ | 9.
9

Some special numbers: evens, odds, primes, composites


A few special types of integers are defined in terms of their divisibility—specifically based on whether they
are divisible by 2 (evens and odds), or whether they are divisible by any other integer except for 1 (primes
and composites).

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 15

Definition 2.13: Even, odd, and parity.


A nonnegative integer n is even if n mod 2 = 0, and n is odd if n mod 2 = 1. The parity of n is its
“oddness” or “evenness.”

For example, we have 17 mod 2 = 1 and 42 mod 2 = 0, so 17 is odd and 42 is even.

Taking it further: If we view 0 as False and 1 as True (see Section 2.2.1), then the value n mod 2 can be interpreted as a Boolean
value. In fact, there’s a deeper connection between arithmetic and the Booleans than might be readily apparent. The “exclusive
or” of two Boolean values p and q (which we will encounter in Section 3.2.3) is denoted p ⊕ q, and the expression p ⊕ q is true
when one but not both of p and q is true. The exclusive or is sometimes referred to as the parity function, because p + q is odd
(viewing p and q as numerical values, 0 or 1) exactly when p ⊕ q is true (viewing p and q as Boolean values, False or True).

Definition 2.14: Prime and composite numbers.


A positive integer n > 1 is prime if the only positive integers that evenly divide n are 1 and n itself. A
positive integer n > 1 is composite if it is not prime.

Notice that the definition of prime numbers does not include 0 and 1, and neither does the definition of
composite numbers: in other words, 0 and 1 are neither composite nor prime.

Example 2.8: Prime numbers.


Is 77 prime? What about 7?
Solution. 77 is not prime, because it is evenly divisible by 7. In other words, because 77 mod 7 = 0 (and
the integer 7 that evenly divides 77 is neither 1 nor 77 itself), 77 is composite. (11 also divides 77.)
On the other hand, 7 is prime. Convincing yourself that a number is prime is harder than convincing
yourself that one is not prime, but we can see it by trying all the possible divisors (every positive integer
except 1 and 7): 7 mod 2 = 1 and 7 mod 3 = 1 and 7 mod 4 = 3 and 7 mod 5 = 2 and 7 mod 6 = 1,
and furthermore 7 mod d = 7 for any d ≥ 8. None of these remainders is zero, so 7 is prime.

Example 2.9: Small primes and composites.


The first ten prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, and 29.
The first ten composite numbers are 4, 6, 8, 9, 10, 12, 14, 15, 16, and 18.

Chapter 7 is devoted to the properties of modular arithmetic, prime numbers, and the like. These quantities
have deep and important connections to cryptography, error-correcting codes, and other applications that
we’ll explore later.

2.2.7 Summations and Products


There is one final piece of notation related to numbers that we need to introduce: a simple way of expressing
the sum or product of a collection of numbers. We’ll start with the compact summation notation that allows
us to express the result of adding many numbers:

ISTUDY
16 Basic Data Types

Definition 2.15: Summation notation.


Pn
Let x1 , x2 , . . . , xn be a sequence of n numbers. We write i=1 xi (usually read as “the sum for i equals 1
to n of xi ”) to denote the sum of the xi s:
P
n
xi = x1 + x2 + · · · + xn .
i=1

The variable i is called the index of summation or the index variable.


P0
Note that i=1 xi = 0: when you add nothing together, you end up with zero.

Pn P
n
A note about notation: i=1 xi and xi mean exactly the same thing.
i=1
(The placement of the “limits” (1 and n) is an aesthetic matter of layout, not a matter of meaning.)

It may be helpful to interpret this summation notation as if it expressed a for loop, as shown in Figure 2.6.

Example 2.10: Some small summations.


Let a1 = 2, a2 = 4, a3 = 8, and a4 = 16, and let b1 = 1, b2 = 2, b3 = 3, b4 = 4, and b5 = 8. Then
P
4 P
5
ai = a1 + a2 + a3 + a4 and bi = b1 + b2 + b3 + b4 + b5
i=1 i=1
= 2 + 4 + 8 + 16 = 30 = 1 + 2 + 3 + 4 + 8 = 18.

In general, instead of just adding xi in the ith term of the sum, we can add any expression involving the
index of summation. (We can also start the index of summation at a value other than 1: to denote the sum
Pn
xj + xj+1 + · · · + xn , we write i=j xi .) Here are a few examples:

Example 2.11: Some sums.


Let a1 = 2, a2 = 4, a3 = 8, and a4 = 16. Then
P
4 P
3 P
4
ai and (ai + 1) and i
i=2 i=1 i=1
= 4 + 8 + 16 = (2 + 1) + (4 + 1) + (8 + 1) =1+2+3+4
= 28 = 17 = 10.

Example 2.12: Some more sums.


As in Example 2.11, let a1 = 2, a2 = 4, a3 = 8, and a4 = 16. Evaluate the following four expressions:
P
4 P
4 P
4 P
4
i2 and i2 and (ai + i2 ) and 5.
i=1 i=2 i=1 i=1

1 result := 0 ∑n
2 for i := 1, 2, . . . , n: ∑0 of i 1 xi might help make the “empty sum” (when n = 0) more
This for loop interpretation =

intuitive: the value of i 1 xi = 0 is simply 0 because result is set to 0 in line 1, and it never changes,
=
3 result := result + xi because n = 0 (and therefore line 3 is never executed).
4 return result
∑n
Figure 2.6 A for loop that returns the value of i=1 xi .

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 17

Solution. Here are the values of these sums:


P
4 P
4 P
4 P
4
i2 = 30 and i2 = 29 and (ai + i2 ) = 60 and 5 = 20.
i=1 i=2 i=1 i=1
12 + 22 + 32 + 42 22 + 32 + 42 (2 + 12 ) + (4 + 22 ) 5+5+5+5
+ (8 + 32 ) + (16 + 42 )

Two special types of summations arise frequently enough to have special names. A geometric series is
Pn i Pn
i=1 r for some real number r; an arithmetic series is i=1 i · r for a real number r. (See Section 5.2.2.)
We will very occasionally consider an infinite sequence of numbers x1 , x2 , . . . , xi , . . .; we may write
P∞
i=1 xi to denote the infinite sum of these numbers. Here’s one example of an infinite summation:

Example 2.13: An infinite sum.


P∞
Define xi = 1/2i , so that x1 = 21 , x2 = 41 , x3 = 18 , and so forth. We can write i=1 xi to denote
2 + 4 + 8 + 16 + · · · . The value of this summation is 1: each term takes the sum halfway closer to 1.
1 1 1 1

While the for loop in Figure 2.6 would run forever if we tried to apply it to an infinite summation, the idea
remains precisely the same: we successively add the value of each term to the result variable. (We will
discuss this type of infinite sum in detail in Section 5.2.2, too.)

Reindexing summations
Just as in a for loop, the “name” of the index variable in a summation doesn’t matter, as long as it’s used
P5 P5
consistently. For example, both i=1 ai and j=1 aj denote the value of a1 + a2 + a3 + a4 + a5 .
We can also rewrite a summation by reindexing it (also known as using a change of index or a change
of variable), by adjusting both the limits of the sum (lower and upper) and what’s being summed while
ensuring that, overall, exactly the same things are being added together. Here are two examples:

Example 2.14: Shifting by two.


Pn Pn−2
The sums i=3 i and j=1 (j+2) are equal, because both express 3 + 4 + 5 + · · · + n. (We have applied
the substitution j := i − 2 to get from the first summation to the second.)

Example 2.15: Counting backward.


Pn
Consider the summation i=0 (n − i). If we define j := n − i, then the summed value becomes j instead
of n − i, and the range i = 0, 1, . . . , n corresponds to j = n − 0, n − 1, . . . , n − n (or, more simply, to
Pn Pn
j = n, n − 1, . . . , 0). Thus i=0 (n − i) and j=0 j are equal. (See Figure 2.7.)

Reindexing can be surprisingly helpful when we’re confronted by ungainly summations; doing so can often
turn the given summation into something more familiar.

∑8
i= 0 1 2 3 4 5 6 7 8 cell i contains the value 8 − i, so the total is i=0 (8 − i).
8 7 6 5 4 3 2 1 0
∑8
define j as j := n − i j= 8 7 6 5 4 3 2 1 0 cell j contains the value j, so the total is j=0 j.

∑8 ∑8
Figure 2.7 Computing 0 + 1 + · · · + 8 in two ways: j=0 j= i=0 (8 − i). (See Example 2.15.)

ISTUDY
18 Basic Data Types

j=1 2 3 4 5 6 j=1 2 3 4 5 6
P1 P1
i=1 5 j=1 5 = 5 · 1 = 5 i=1 1 j=1 j=1= 1
P2 P2
2 5 5 j=1 5 = 5 · 2 = 10 2 1 2 j=1 j = 1 + 2 = 3
P3 P3
3 5 5 5 j=1 5 = 5 · 3 = 15 3 1 2 3 j=1 j=1+2+3= 6
P4 P4
4 5 5 5 5 j=1 5 = 5 · 4 = 20 4 1 2 3 4 j=1 j = 1 + 2 + 3 + 4 = 10
P5 P5
5 5 5 5 5 5 j=1 5 = 5 · 5 = 25 5 1 2 3 4 5 j=1 j = 1 + 2 + 3 + 4 + 5 = 15
P6 P6
6 5 5 5 5 5 5 j=1 5 = 5 · 6 = 30 6 1 2 3 4 5 6 j=1 j = 1 + 2 + 3 + 4 + 5 + 6 = 21
∑6 ∑i ∑6 ∑i
Figure 2.8 The terms of i=1 j=1 5. Figure 2.9 The terms of i=1 j=1 j.

Nested sums
We can sum any expression that depends on the index variable—including summations. These summations
are called double summations or, more generally, nested summations. Just as with nested loops in programs,
the key is to read “from the inside out” in simplifying a summation. Here are two examples:

Example 2.16: A double sum.


P6 P i
Let’s compute the value of the nested sum 5.
Pi i=1 j=1
For any i, the value of j=1 5 is just 5i, because we are summing i different copies of the number 5. (See
P6
Figure 2.8.) Therefore the given sum’s value is i=1 5i = 5 + 10 + 15 + 20 + 25 + 30 = 105.

Example 2.17: A slightly more complicated double sum.


6 P
P i
What is the value of the sum j?
i=1 j=1
Pi
Solution. The inner sum ( j=1 j) has the values shown in Figure 2.9: 1 for i = 1, 3 for i = 2, 6 for i = 3,
P6 Pi
etc. Thus i=1 j=1 j = 1 + 3 + 6 + 10 + 15 + 21 = 56.

When you’re programming and need to write two nested loops, it sometimes ends up being easier to write
the loops with one variable in the outer loop rather than the other variable. Similarly, it may turn out to
be easier to think about a nested sum by reversing the summation—that is, swapping which variable is
the “outer” summation and which is the “inner.” If we have any sequence ai,j of numbers indexed by two
Pn Pn Pn Pn
variables i and j, then i=1 j=1 ai,j and j=1 i=1 ai,j have precisely the same value.
Here are two examples of reversing the order of a double summation, for the tables shown in Figure 2.10:

Example 2.18: Adding up the numbers in a small table.


Consider the table in Figure 2.10a. Write ai,j to denote the element in the ith row and jth column of the
table. Then we can compute the sum of elements in the table by adding up the row sums:
P
3 P
4
ai,j = 23 + 18 + 19 = 60.
i=1 j=1 sum of elements sum of elements sum of elements
in row 1 in row 2 in row 3

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 19

j=1 2 3 4 5 6 7 8
i = 1 −1 −1 −2 −2 −3 −3 −4 −4
2 1 1 2 2 3 3 4 4 j=1 2 3 4 5 6
3 −1 −1 −2 −2 −3 −3 −4 −4 i=1 1
4 1 1 2 2 3 3 4 4 2 1 2
1 2 3 4 5 −1 −1 −2 −2 −3 −3 −4 −4 3 1 2 3
1 7 5 6 5 6 1 1 2 2 3 3 4 4 4 1 2 3 4
2 5 5 1 7 7 −1 −1 −2 −2 −3 −3 −4 −4 5 1 2 3 4 5
3 3 5 8 3 8 1 1 2 2 3 3 4 4 6 1 2 3 4 5 6

∑n ∑n ( ) ∑6 ∑i
(a) A small table. (b) i=1 j=1 (−1)i · ⌈ 2j ⌉ , for n = 8. (c) The terms of i=1 j=1 j.

Figure 2.10 A few tables whose elements we’ll sum “row-wise” and “column-wise.”

Or, by adding up the column sums, the sum of elements in the table is also

P
4 P
3
ai,j = 15 + 15 + 15 + 15 = 60.
j=1 i=1 sum of elements sum of elements sum of elements sum of elements
in column 1 in column 2 in column 3 in column 4

Example 2.19: A double sum, reversed.


n 
n P
P  j 
Let n = 8. What is the value of the sum (−1)i · 2 ?
i=1 j=1

Solution. We are computing the sum of all the values contained in the table in Figure 2.10b. The hard
way to add up all of these values is by computing the row sums, and then adding them all up. (The given
equation expresses this hard way.) The easier way is reverse the summation, and to instead compute
n 
n P
P  j 
(−1)i · 2 .
j=1 i=1
Pn
For each j, observe that i=1 (−1)i · ⌈ 2j ⌉ is actually zero! (This value is just (⌈ 2j ⌉) n2 + (− ⌈ 2j ⌉) n2 .) In other
Pn
words, every column sum in the table is zero. Thus the entire summation is j=1 0, which is just 0.

Problem-solving tip: When you’re looking at a complicated double summation, try reversing it; it may be much easier to analyze
the other way around.

Note that computing the sum from Example 2.19 when n = 100 or n = 100,000 remains just as easy if
we use the column-based approach: as long as n is an even number, every column sum is 0, and thus the
entire summation is 0. (The row-based approach is ever-more painful to use as n gets large.)
P6 Pi
Here’s one more example—another view of the double sum i=1 j=1 j from Example 2.17—where
reversing the summation makes the calculation simpler:

Example 2.20: A double sum, redone.


P6 Pi
The value of i=1 j=1 j is the sum of all the numbers in the table in Figure 2.10c. We solved Exam-
Pi
ple 2.17 by first computing j=1 j, which is the sum of the numbers in the ith row. We then summed
these values over the six different values of i to get 56.
Alternatively, we can compute hthe desired
P6 P6 i
sum by looking at columns instead of rows. The sum of
P6
the table’s elements is also j=1 i=j j , where i=j j is the sum of the numbers in the jth column.

ISTUDY
20 Basic Data Types

P6
Because there are a total of (7 − j) terms in i=j j, the sum of the numbers in the jth column is precisely
j · (7 − j). (For example, the 4th column’s sum is 4 · (7 − 4) = 4 · 3 = 12.) Thus the overall summation is
P
6 P
i P
6
j= [j · (7 − j)]
i=1 j=1 j=1

= (1 · 6) + (2 · 5) + (3 · 4) + (4 · 3) + (5 · 2) + (6 · 1) = 56.
=6 = 10 = 12 = 12 = 10 =6

Products
P
The notation allows us to express repeated addition of a sequence of numbers; there is analogous
notation to represent repeated multiplication of numbers, too:

Definition 2.16: Product notation.


Qn
Let x1 , x2 , . . . , xn be a sequence of n numbers. We write i=1 xi (usually read as “the product for i equals
1 to n of xi ”) to denote the product of the xi s:
Q
n
xi = x1 · x2 · · · · · xn .
i=1

(The summation and product notation have a secret mnemonic to help you remember what each means:
“Σ” is the Greek letter Sigma, which starts with the same letter as the word sum. And “Π” is the Greek
letter Pi, which starts with the same letter as the word product.)

Example 2.21: Some products.


Here are a few small examples of products:

Q
4 Q
4 Q
4 Q
4
i = 24 and i=0 and i2 = 576 and 5 = 625.
i=1 i=0 i=1 i=1
1 · 2 · 3 · 4 = 24 0·1·2·3·4=0 12 · 22 · 32 · 42 = 576 5 · 5 · 5 · 5 = 625

P Q
There are direct analogues between the notions regarding and corresponding notions for : the for
loop interpretation (Figure 2.11), infinite products, reindexing, and nested products. One slight difference
Q0
worthy of note: the value of i=1 xi is 1; when we multiply by nothing, we’re multiplying by one.

1 result := 1
2 for i := 1, 2, . . . , n
3 result := result · xi
4 return result
∏n
Figure 2.11 A for loop that returns the value of i=1 xi .

ISTUDY
2.2 Booleans, Numbers, and Arithmetic 21

Computer Science Connections

Integers and ints, Reals and floats


Every modern programming language has types corresponding to integers and real numbers, often called something
like int (short for “integer”) and float (short for floating-point number; more about this name and floating-point
representation is below). In most programming languages, though, these types differ from Z and R in important ways.
Every piece of data stored on a computer is stored as sign bit data bits
a sequence of bits, and typically the bit sequence stor-
ing a number has some fixed length. For example, an
int stored using 7 bits can range from 0000000 (the ± 64 32 16 8 4 2 1
number 0 represented in binary) to 1111111 (the num-
ber 27 − 1 = 127 represented in binary). Typically,
the first bit in an int’s representation is reserved as the 0 0 1 1 0 0 1 1
sign bit (set to True for a negative number and False + 0 + 32 + 16 + 0 + 0 + 2 + 1 = 51
for a positive number), and the remaining bits store the
value of the number. (See Figure 2.12.) Thus there’s a 0 1 0 1 0 1 0 0
bound on the largest int, depending on the number of + 64 + 0 + 16 + 0 + 4 + 0 + 0 = 84
bits used to represent ints in a particular programming
language: 32,767 in Pascal (= 215 − 1, using 16 bits Figure 2.12 The values of each bit position, and the
per int: 1 sign bit and 15 data bits), and 2,147,483,647 integers 51 and 84, represented as 8-bit signed integers.
in Java (= 231 − 1; 32 bits, of which 1 is a sign bit).
A crucial point about Z and R is that they are infinite: there is no smallest integer, there’s no biggest real number,
and there isn’t even a biggest real number that is smaller than 1. In almost every programming language, however,
there is a smallest int, a biggest float, and a biggest float that’s smaller than 1: after all, there are only finitely
many possible floats (perhaps 264 different values), and one of these 264 values is the smallest float.
The finite nature of these programming language data types can cause some subtle bugs in programs. There
are issues related to integer overflow if we try to store “too large” an integer: for example, when we compute
32,767 + 1 in Pascal, the result is −32,768. And there are bugs related to underflow if we try to store “too small” a
floating-point number: for example, I tried to compute (0.0000000001)33 in Python, and got 0.0 as the result. (But
(0.0000000001)32 was, correctly, 10−320 .) Similarly, there are rounding errors implicit in floating-point representa-
tions of numbers: because there are only finitely many different floats, the infinitely many real numbers cannot all
be stored exactly. For example, when I type 0.0006 - 0.0004 == 0.0002 into a Python interpreter, I get False
as output. (According to Python, 0.0006 - 0.0004 is 0.00019999999999999993, not 0.0002.)
The name float originates with a clever idea that’s
used to mitigate these issues of imprecision: roughly, before the decimal point after the decimal point
a ≈ 0 1000111101101100110101 1000000000000000000000
we allow the decimal point to “float” in the represen-
b ≈ 0 0000000000000000000000 0111010000011110101000
tation of different numbers. Perhaps the first idea for
(approximately) representing real numbers in binary is sign bit
a “fixed-point” representation: we reserve some fixed exponent data
number of bits for the part of the number before the a ≈ 0 00010110 100011110110110011010110000000000000
decimal point, and reserve some fixed number of bits b ≈ 0 00000000 011101000001111010100001001010101101
for the part after the decimal point. But that can waste Figure 2.13 Fixed- and floating-point representations of
bits. Instead, in floating-point representation, the idea
a = 2,349,877.5 (the per-island average population of New
is basically similar to scientific notation: we represent a
Zealand, according to the most recent census) and
number as the product of a power of two and a fraction
b = 0.45359237 (the number of kilograms in a pound,
between 0 and 1. (See Figure 2.13.)
(This description only roughly approximates the according to the UK’s Weights and Measures Act 1985).
way that floating-point representation works to store Floating-point representation, at bottom, stores a as
numbers; see a good computer architecture textbook, 2 22
· 0 . 56025445 · · · and b as 20 · 0.45359237, leaving room
such as [98], for more. Another interesting detail is for more bits to represent the value itself more accurately.
the 2’s complement storage of integers, which allows
a single representation of positive and negative integers so that addition “just works,” even with a sign bit.)

ISTUDY
22 Basic Data Types

Computer Science Connections

Computing Square Roots, and Not Computing Square Roots


Many programmers happily use numerical operations without thinking about how they’re implemented—but a little
knowledge of what’s happening behind the scenes can speed up programs. Computer hardware directly and efficiently
executes basic arithmetic operations√ like +, ·, and /, but more complex calculations may require many of these basic
operations. Consider computing x given an input value x, for example. A √ promising idea is to use some kind of
iterative improvement algorithm: we start with a guess y0 of the value of x, and then update our guess to a new
guess y1 (by observing in some way whether y0 was too big or too small). We continue to improve our guess until
we’ve reached a value y such that y2 is “close enough” to x. (We can specify the tolerance of the algorithm—that is,
how close counts as “close enough.”)
A simple implementation of this Input: A positive real number x.
y0 = 21;
idea is called Heron’s method, Output: A real number y with y2 ≈ x. error ≈ 14.52
named after the first-century Greek 1 Let y0 be arbitrary, and let i := 0.
mathematician Heron of Alexan- 2 while (yi )2 is too far from x: y1 = 11.5; error ≈ 5.02
yi + yx
dria and shown in Figure 2.14. 3 let yi+1 := 2 i and i := i + 1 y2 = 7.576087; error ≈ 1.10
It relies on the nonobvious fact 4 return yi y3 = 6.559923; error ≈ 0.08
that the average
√ of y and yx is y4 = 6.481219; error ≈ 0.00048
closer to x than y was. √ (Unless y5 = 6.480741; error < 0.000001
y is exactly equal to x.) Almost
two millennia later, Isaac New- 10 15 20
ton developed a general technique √
Figure 2.14 Heron’s method for square roots, and an example of computing x
for computing values of numeri-
for x = 42, using 2x as the initial guess.
cal expressions involving exponen-
tials, among other things. This technique, known as Newton’s method, uses calculus—specifically, using derivatives
to figure out how far to move from a current guess yi in making the next guess yi+1 . Like Heron’s method, Newton’s
method is an example in scientific computing, the subfield of computer science devoted to efficient computation of
numerical values, often for the purposes of simulating a complex system. (Many interesting questions and techniques
are used in scientific computing; see [104] for one outstanding, and classic, reference for some of this material.)
Work in scientific computing
has improved the efficiency of
Portrait by Alfred Edward Chalon (1780–1860)

numerical computation. But even


better is to be aware of the fact that
operations like square roots require
significant computation “under the
hood,” and to avoid them when
possible. To take one particular
example, consider applying a blur
filter to an image: replace each
pixel p by the average of all pix-
els within a radius-r circle centered Figure 2.15 A blur filter, and Ada Lovelace (1815–1852), “the first computer
at p. (See Figure 2.15.) To compute programmer,” who developed algorithms for the Analytical Engine, invented by
the blurred value at pixel p, we can Charles Babbage (1791–1871), the first (as-yet-theoretical) “computer.” The
just scan all pixels q in the square second image of her is blurred with a radius of 9. For more on Lovelace and
within ±r rows/columns of p, com- Babbage and their work, I cannot overstate the amazingness of Sydney Padua’s
paring the distance between p and
The Thrilling Adventures of Lovelace and Babbage [96].
q to r. The “obvious” approach is √
to follow that description verbatim: calculate d := (px + qx )2 + (py + qy )2 , and test whether d ≤ r. But the
alternative is to simply square both sides of the comparison, and instead test whether (px + qx )2 + (py + qy )2 ≤ r2 .
While there is no important mathematical difference between these two formulas, there is a computational difference.
Because square roots are expensive to compute, it turns out that in my Python implementation of a blur filter, using
the alternative approach was about 12% faster than using the “obvious” way—a big win for a small tweak.

ISTUDY
Exercises 23

EXERCISES
2.1 What are the smallest and largest integers that are in the interval (111, 202)?
2.2 What are the smallest and largest integers that are in the interval [111, 202)?
2.3 What are the smallest and largest integers that are in the interval (17, 42) but not in the interval (39, 99]?
2.4 What are the smallest and largest integers that are in the interval [17, 42] but not in the interval [39, 99)?

2.5 If x and y are integers, is x + y necessarily an integer? Explain.


2.6 If x and y are rational numbers, is x + y necessarily rational? Explain.
2.7 If x and y are irrational numbers, is x + y necessarily irrational? Explain.

2.8 What is the value of ⌊2.5⌋ + ⌈3.75⌉?


2.9 What is the value of ⌊3.14159⌋ · ⌈0.87853⌉?
2.10 What is the value of (⌊3.14159⌋)⌈3.14159⌉ ?

2.11 Most programming languages provide two different functions called floor and truncate to trim real numbers to integers. In these
languages, floor(x) is defined exactly as we defined ⌊x⌋, and trunc(x) is defined to simply delete any digits that appear after the
decimal point in writing x. So trunc(3.14159) = 3 .14159 = 3. Explain why programming languages might have both floor
and trunc—that is, explain the circumstances under which floor(x) and trunc(x) give different values.

2.12 Using floor, ceiling, and standard arithmetic notation, give an expression for a real number x rounded to the nearest integer. (“Round
up” for a number that’s exactly between two integers—for example, 7.5 rounds to 8.)
2.13 Again using floor, ceiling, and standard arithmetic notation, give an expression for a real number x rounded to the nearest 0.1.
2.14 Generalize Exercise 2.13: give an expression rounding x to the nearest 10−k , for any number k of digits after the decimal point.
2.15 Again using floor, ceiling, and standard arithmetic notation, give an expression for a real number x truncated to k digits after the
decimal point—that is, leaving off the (k + 1)st digit and beyond. (For example, 3.1415926 truncated with 3 digits is 3.141, and
truncated with 4 digits is 3.1415.) [Many programming languages provide a facility for displaying formatted output in this style.
For example, printf("%.3f", x) in C (or Java, or Python, or others) prints the value of x with only 3 digits after the decimal
point. (The “f” of “printf” stands for formatted; the “f” of "%.3f" stands for float.)]
⌊x⌋+⌈x⌉
2.16 For what value(s) of x in the interval [2, 3] is x − 2
the largest?
⌊x⌋+⌈x⌉
2.17 For what value(s) of x in the interval [2, 3] is x − 2
the smallest?

2.18 Let x be a real number. Rewrite ⌊⌊x⌋⌋ as simply as possible.


2.19 Rewrite ⌈⌈x⌉⌉ as simply as possible.
2.20 Rewrite ⌊⌈x⌉⌋ as simply as possible.
2.21 Rewrite ⌈⌊x⌋⌉ as simply as possible.
2.22 Are |⌊x⌋| and ⌊|x|⌋ always equal? Explain.
2.23 Are 1 + ⌊x⌋ and ⌊1 + x⌋ always equal? Explain.
2.24 Are ⌊x⌋ + ⌊y⌋ and ⌊x + y⌋ always equal? Explain.
2.25 Let x be a real number. Describe (in English) what 1 + ⌊x⌋ − ⌈x⌉ represents. Explain.

2.26 In performing a binary search for x in a sorted n-element array A[1 . . . n] (see Figure 6.15b), the first thing we do is to compare the
value of x and the value of A [⌊ 1+n
2
⌋]. Assume that all elements of A are distinct. How many elements of A are less than A [⌊ 1+n2
⌋]?
How many are greater? Write your answers as simply as possible.

Complete Exercises 2.27–2.40 by hand, without using a calculator or computer.


2.27 Which is bigger, 310 or 103 ?
2.28 What is the value of 48 ?
2.29 What is the value of (1/4)8 ?
2.30 What is the value of (−4)8 ?
2.31 What is the value of (−4)9 ?
2.32 What is the value of 2561/4 ?
2.33 What is the value of 81/4 ?
2.34 What is the value of 83/4 ?
2.35 What is the value of (−9)1/4 ?
2.36 What is the value of log2 8?
2.37 What is the value of log2 (1/8)?
2.38 What is the value of log8 2?

ISTUDY
24 Basic Data Types

2.39 What is the value of log1/8 2?


2.40 Which is bigger, log10 17 or log17 10?

2.41 Using the definition of logarithms and the properties of exponentials from Theorem 2.8, justify Theorem 2.10.1: logb 1 = 0 for
any b > 1.
2.42 Again using the definition of logarithms and Theorem 2.8, justify Theorem 2.10.2: logb b = 1 for any b > 1.
2.43 Do the same for Theorem 2.10.3: logb xy = logb x + logb y for any b > 1 and x > 0 and y > 0.
2.44 Do the same for Theorem 2.10.5: logb xy = y logb x for any b > 1 and x > 0 and y > 0.
log x
2.45 Do the same for Theorem 2.10.6: logb x = logc b for any b > 1 and c > 1 and x > 0.
c

Using the results from Exercises 2.41–2.45 and the fact that logb x = logb y exactly when x = y (for any base b > 1), justify the
following additional properties of logarithms:
2.46 For any real numbers b > 1 and x > 0, we have that b[logb x] = x.
2.47 For any real numbers b > 1 and a, n > 0, we have that n[logb a] = a[logb n] .
2.48 Theorem 2.10.4: for any b > 1 and x > 0 and y > 0, we have that logb xy = logb x − logb y.

2.49 Using notation defined in this chapter, define the “hyperceiling” ⌈n⌉ of a positive integer n, where ⌈n⌉ is the smallest exact power
of two that is greater than or equal to n. (That is, ⌈n⌉ denotes the smallest value of 2k where 2k ≥ n and k is a nonnegative integer.)
2.50 Similar to Exercise 2.49: when writing down an integer n on paper using standard decimal notation, we need enough columns
for all the digits of n (and perhaps one additional column for a “−” if n < 0). Write down an expression indicating how many
columns we need to represent n. (Hint: use the case notation introduced in Definition 2.3, and be sure that your expression is well
defined—that is, it doesn’t “generate any errors”—for all integers n.)

Complete Exercises 2.51–2.59 by hand, without using a calculator or computer.


2.51 What is the value of 202 mod 2?
2.52 What is the value of 202 mod 3?
2.53 What is the value of 202 mod 10?
2.54 What is the value of −202 mod 10?
2.55 What is the value of 17 mod 42?
2.56 What is the value of 42 mod 17?
2.57 What is the value of 17 mod 17?
2.58 What is the value of −42 mod 17?
2.59 What is the value of −42 mod 42?

2.60 Observe the Python behavior of the % operator (the Python notation for mod) that’s shown in Figure 2.16. The first two lines
(3 mod 5 = 3 and −3 mod 5 = 2) are completely consistent with the definition that we gave for mod (Definition 2.11), including
its use for n mod k when n is negative (as in Example 2.6). But we haven’t defined what n mod k means for k < 0. Propose a
formal definition of % in Python that’s consistent with Figure 2.16.

2.61 What is the smallest positive integer n where n mod 2 = 0 and n mod 3 = 0, and n mod 5 = 0?
2.62 What is the smallest positive integer n where n mod 2 = 1 and n mod 3 = 1, and n mod 5 = 1?
2.63 What is the smallest positive integer n where n mod 2 = 0 and n mod 3 = 1, and n mod 5 = 0?
2.64 What is the smallest positive integer n where n mod 3 = 2 and n mod 5 = 3, and n mod 7 = 5?
2.65 What is the smallest positive integer n where n mod 2 = 1 and n mod 3 = 2 and n mod 5 = 3, and n mod 7 = 4?

2.66 (programming required) Write a program to determine whether a given positive integer n is prime by testing all possible divisors
between 2 and n − 1. Use your program to find all prime numbers less than 202.
2.67 (programming required) A perfect number is a positive integer n that has the following property: n is equal to the sum of all positive
integers k < n that evenly divide n. For example, 6 is a perfect number, because 1, 2, and 3 are the positive integers less than 6 that
evenly divide 6—and 6 = 1 + 2 + 3. Write a program that finds the four smallest perfect numbers.

1 3 % 5 # value: 3
2 -3 % 5 # value: 2
3 3 % -5 # value: -2
4 -3 % -5 # value: -3

Figure 2.16 Python’s implementation of % (“mod”).

ISTUDY
Exercises 25

2.68 (programming required) Write a program to find all integers between 1 and 1000 that are evenly divisible by exactly three different
integers.

Compute the values of the following summations and products:


P6
2.69 6
i=1
P6
2.70 i2
i=1
P6
2.71 22i
i=1
P6
2.72 i · 2i
i=1
P6
2.73 (i + 2i )
i=1
Q
6
2.74 6
i=1
Q
6
2.75 i2
i=1
Q
6
2.76 22i
i=1
Q
6
2.77 i · 2i
i=1
Q
6
2.78 (i + 2i )
i=1

Compute the values of the following nested summations:


P6 P
6
2.79 (i · j)
i=1 j=1
P6 P 6
2.80 (i · j)
i=1 j=i
P6 P i
2.81 (i · j)
i=1 j=1
P8 P 8
2.82 i
i=1 j=i
P8 P 8
2.83 j
i=1 j=i
P8 P 8
2.84 (i + j)
i=1 j=i
P4 P 4
2.85 (ji )
i=1 j=i

ISTUDY
26 Basic Data Types

2.3 Sets: Unordered Collections


Our torments also may in length of time
Become our Elements.
John Milton (1608–1674)
Paradise Lost (1667)

Section 2.2 introduced the primitive types of objects that we’ll use throughout the book. We turn now
to collections of objects, analogous to lists and arrays in programming languages. We start in this section
with sets, in which objects are collected without respect to order or repetition. (Section 2.4 will address
sequences, which are collections of objects in which order and repetition do matter.) The definitions and
notation related to sets are summarized in Figure 2.17.

Definition 2.17: Sets.


A set is an unordered collection of objects.

(Sets are typically denoted by uppercase letters, often by a mnemonic letter: S for a set of students, D for
a set of documents, etc. As we saw, the common sets from mathematics defined in Section 2.2.2 are often
written using a “blackboard bold” font: Z, R, and Q.) Here are a few small examples:

Example 2.22: Some sets.


Here are three sets: the set of bits {0, 1}, the set of prime numbers {2, 3, 5, 7, 11, . . .}, and the set of basic
arithmetic operators {+, −, ·, /}. (We’ve written these sets using standard notation by listing the objects
in the set between curly braces { and }.)

Set membership—that is, the question is the object x one of the objects in the collection S?, for a particular
object x and a particular set S—is the central notion for sets:

Definition 2.18: Set membership.


For a set S and an object x, the expression x ∈ S is true when x is one of the objects contained in the set
S. When x ∈ S, we say that x is an element or member of S or, more simply, that x is in S.

The expression x ∈/ S is the negation of the expression x ∈ S: that is, x ∈


/ S is true whenever x is not an
element of S (and thus whenever x ∈ S is false).

Example 2.23: Some set memberships.


The integer 0 is an element of the set of bits, and + is in the set of basic arithmetic operators. But 1 is not
an element of the set of prime numbers, and 8 is not in the set of bits.

A second key concept about a set is its cardinality, or size:

Definition 2.19: Set cardinality.


The cardinality of a set S, denoted by |S|, is the number of distinct elements in S.

ISTUDY
2.3 Sets: Unordered Collections 27

set membership x∈S x is one of the elements of S


cardinality |S| the number of distinct elements in the set S

set enumeration {x1 , x2 , . . . , xk } the set containing elements x1 , x2 , . . . , xk


set abstraction {x ∈ U : P(x)} the set containing all x ∈ U for which P(x) is true;
U is the “universe” of candidate elements

empty set {} or ∅ the set containing no elements

complement ∼S = {x ∈ U : x ∈
/ S} the set of all elements in the universe U that aren’t in S;
U may be left implicit if it’s obvious from context
union S ∪ T = {x : x ∈ S or x ∈ T} the set of all elements in either S or T (or both)
intersection S ∩ T = {x : x ∈ S and x ∈ T} the set of all elements in both S and T
set difference S − T = {x : x ∈ S and x ∈
/ T} the set of all elements in S but not in T

set equality S=T every x ∈ S is also in T, and every x ∈ T is also in S


subset S⊆T every x ∈ S is also in T
proper subset S⊂T S ⊆ T but S ≠ T
superset S⊇T every x ∈ T is also in S
proper superset S⊃T S ⊇ T but S ̸= T

power set P(S) the set of all subsets of S

Figure 2.17 A summary of set notation.

Example 2.24: Some set sizes.


The cardinality of the set of bits is 2, because there are two distinct elements of that set (namely 0 and 1).
The cardinality of the set S of prime numbers between 10 and 20 is |S| = 4: the four elements of S are
11, 13, 17, and 19.

Chapter 9 is devoted entirely to the apparently trivial problem of counting—given a (possibly convoluted)
description of a set S, find |S|—which turns out to have some interesting and useful applications, and isn’t
as easy as it seems.

Taking it further: In this book, we will be concerned almost exclusively with the cardinality of finite sets, but one can also ask
questions about the cardinality of sets like Z or R that contain an infinite number of distinct elements. For example, it’s possible
to prove that |Z| = |Z≥0 |, which is a pretty amazing result: there are as many nonnegative integers as there are integers! (And
that’s true despite the fact that every nonnegative integer is an integer!) But it’s also possible to prove that |Z| ̸= |R|: . . . but there
are more real numbers than integers! More amazingly, one can use similar ideas to prove that there are fewer computer programs
than there are problems to solve, and that therefore there are some problems that are not solved by any computer program. This
idea is the central focus of the study of computability and uncomputability. See Section 4.4.4 and p. 474.

2.3.1 Building Sets from Scratch


There are two standard ways to specify a set “from scratch”: by simply listing each of the elements of the
set, or by defining the set as the collection of objects for which a particular logical condition is true.

Set definition via exhaustive enumeration


A set can be specified using an exhaustive listing of its elements—that is, by writing a complete list of its
elements inside the curly braces { and }. Here are a few examples:

ISTUDY
28 Basic Data Types

Example 2.25: Some exhaustively enumerated sets.

• The set of even prime numbers is {2}.


• The set of prime numbers between 10 and 20 is {11, 13, 17, 19}.
• The set of 2-digit perfect squares is {81, 64, 25, 16, 36, 49}.
• The set of bits is {0, 1}.
• The set of Turing Award winners in 2004, 2008, 2012, and 2016 is {Tim Berners-Lee, Vint Cerf,
Shafi Goldwasser, Bob Kahn, Barbara Liskov, Silvio Micali}.

Taking it further: The Turing Award is the most prestigious award given in computer science—the “Nobel Prize of CS,”
it’s sometimes called. The award, named after the pioneering computer scientist Alan Turing (1912–1954)—also a WWII-era
code-breaking hero; see p. 497—has been awarded annually since 1966 to an individual or small group of collaborators for
“major contributions of lasting importance” to the field. The detailed citations and biographies of the winners are available at
https://amturing.acm.org.
Vint Cerf and Bob Kahn (2004) were honored for their collaboration, starting in the early 1970s, in developing the Transmission
Control Protocol and Internet Protocol (TCP/IP), which form the very foundation of the internet. Barbara Liskov (2008) earned
her Turing Award for her foundational work on design of computer systems and programming languages, especially on data
abstraction and modular programming, and on fault tolerance and distributed computing. Shafi Goldwasser and Silvio Micali
(2012) were jointly honored for their work in cryptography, computational complexity, and probabilistic algorithms, which began
while they were earning graduate degrees at Berkeley (his in 1982 and hers in 1984) and continued when they both became faculty
members at MIT. Sir Tim Berners-Lee (2016) won his award for inventing the World Wide Web—and some highly influential
web-centered protocols—around 1990 when he worked at CERN, a physics lab in Geneva.

Recall that a set is an unordered collection, and thus the order in which the elements are listed doesn’t
matter when specifying a set via exhaustive enumeration. Any repetition in the listed elements is also
unimportant. For example:

Example 2.26: The same set, three ways.


The set {2 + 2, 2 · 2, 2/2, 2 − 2} is precisely identical to the set {0, 1, 4}, both of which are precisely
identical to {4, 0, 1}. Also note that |{2 + 2, 2 · 2, 2/2, 2 − 2}| = 3; despite there being four entries in
the list of elements, there are only three distinct objects in the set.

It’s important to remember that the integer 2 and the set {2} are two entirely different kinds of things. For
example, note that 2 ∈ {2}, but that {2} ∈ / {2}; the lone element in {2} is the number two, not the set
containing the number two.

Set definition via set abstraction


Instead of explicitly listing all of a set’s elements, we can also define a set in terms of a condition that is
true for the elements of the set and that’s false for every object that is not an element of the set. Defining a
set this way uses set-abstraction notation:

Definition 2.20: Set abstraction.


Let U be a set of possible elements, called the universe. Let P(x) be a condition (also called a predicate)
that, for every x ∈ U, is either true or false. Then
{x ∈ U : P(x)}
denotes the set of all objects x ∈ U for which P(x) is true.

ISTUDY
2.3 Sets: Unordered Collections 29

That is, for a candidate y ∈ U, the element y is a member of the set {x ∈ U : P(x)} when P(y) = True,
and y ∈
/ {x ∈ U : P(x)} when P(y) = False. (A fully proper version of Definition 2.20 requires functions,
described in Section 2.5.)
(The colon in the notation for set abstraction is read as “such that,” so the set in Definition 2.20 would
be read “the set of all x in U such that P of x.”)

Example 2.27: Most of Example 2.25, redone.

• The set of even prime numbers is {x ∈ Z>1 : x is prime and x is even} .



• The set of 2-digit perfect squares is {n ∈ Z : n ∈ Z and 10 ≤ n ≤ 99} .

• The set of bits is b ∈ Z : b2 = b .

For this set-abstraction notation to meaningfully define a set S, we must specify the universe U of candidates
from which the elements of S are drawn. We will permit ourselves to be sloppy in our notation, and when
the universe U is clear from context we will allow ourselves to write {x : P(x)} instead of {x ∈ U : P(x)}.

Taking it further: The notational sloppiness of omitting the universe in set abstraction will be a convenience for us, and it will
not cause us any trouble—but it turns out that one must be careful! In certain strange scenarios when defining sets, there are
subtle but troubling paradoxes that arise if we allow the universe to be anything at all. The key problem can be seen in Russell’s
paradox, named after the British philosopher/mathematician Bertrand Russell (1872–1970); Russell’s discovery of this paradox
revealed an inconsistency in the commonly accepted foundations of mathematics in the early twentieth century.
Here is a sketch of Russell’s paradox. Let X denote the set of all sets that do not contain themselves: that is, let X = {S : S ∈
/ S}.
For example, {2} ∈ X because {2} ∈ / {2}, and R ∈ X because R is not a real number, so R ∈ / R. On the other hand, if we let
T∗ denote the set of all sets, then T∗ ∈/ X: because T∗ is a set, and T∗ contains all sets, then T∗ ∈ T∗ and therefore T∗ ∈ / X.
Here’s the problem: is X ∈ X? Suppose that X ∈ X: then X ∈ {S : S ∈ / S} by the definition of X, and thus X ∈ / X. But suppose
that X ∈
/ X; then, by the definition of X, we have X ∈ X. So if X ∈ X then X ∈ / X, and if X ∈
/ X then X ∈ X—but that’s absurd!
One standard way to escape this paradox is to say that the set X cannot be defined—because, to be able to define a set using set
abstraction, we need to start from a defined universe of candidate elements. (And the set T∗ cannot be defined either.) The Liar’s
paradox, dating back about 3000 years, is a similar paradox: is “this sentence is false” true (nope!) or false (nope!)? In both
Russell’s paradox and the Liar’s paradox, the fundamental issue relates to self-reference; many other mind-twisting paradoxes
are generated through self-reference, too. For more on these and other paradoxes, see [112].

Definition 2.20 lets us write {x ∈ U : P(x)} to denote the set containing exactly those elements x of U for
which P(x) is True. We will extend this notation to allow ourselves to write more complicated expressions
to the left of the colon, as in the following example:

Example 2.28: 2-digit perfect squares, again (see Examples 2.25 and 2.27).

We can also write the set of 2-digit perfect squares as either x2 : x ∈ Z and 10 ≤ x2 ≤ 99 or as
 2  2 2 2 2 2 2
x : x ∈ {4, 5, 6, 7, 8, 9} = 4 , 5 , 6 , 7 , 8 , 9 .

To properly define this extended form of the set-abstraction notation, we again need the idea of functions,
which are defined in Section 2.5. See Definition 2.49 for a proper definition of this extended notation.

Taking it further: Modern programming languages support the use of lists to store a collection of objects. While these lists store
ordered collections, there are some very close parallels between these lists and sets. In fact, the ways we’ve described building
sets have very close connections to ideas in certain programming languages like Scheme and Python; see p. 37.

The empty set


One particularly useful set—despite its simplicity—is the empty set, also sometimes called the null set:

ISTUDY
30 Basic Data Types

Definition 2.21: The empty set.


The empty set, denoted {} or ∅, is the set that contains no elements.

The definition of the empty set as {} is an exhaustive listing of all of the elements of the set—though,
because there aren’t any elements, there are no elements in the list.
Alternatively, we could have used the set-abstraction notation to define the empty set as ∅ = {x : False}.
This definition may seem initially confusing, but it’s in fact a direct application of Definition 2.20: the
condition P for this set is P(x) = False (that is: for every object x, the value of P(x) is False), and we’ve
defined ∅ to contain every object y such that P(y) = True. But there isn’t any object y such that P(y) =
True—because P(y) is always false—and thus there’s no y ∈ {x : P(x)}.
Notice that, because there are zero elements in ∅, its cardinality is zero: in other words, |∅| = 0. One
other special type of set is defined based on its cardinality; a singleton set is a set S that contains exactly
one element—that is, a set S such that |S| = 1.

2.3.2 Building Sets from Other Sets


There are a number of ways to create new sets from two sets A and B. We will define these operations
formally, but it is sometimes more intuitive to look at a visual representation of sets called a Venn diagram,
which represent sets as circular “blobs” that contain points (elements), enclosed in a rectangle that denotes
the universe. (Venn diagrams are named after the British logician/philosopher John Venn (1834–1923).)

Example 2.29: Venn diagram of odds and primes.


Let U = {1, 2, . . . , 10}. Let P = {2, 3, 5, 7} denote the set of primes in U, and let O = {1, 3, 5, 7, 9}
denote the set of odd numbers in U. A Venn diagram illustrating these sets is shown in Figure 2.18.

We will now define four standard ways of building a new set in terms of one or two existing sets:
complement, union, intersection, and set difference.

Definition 2.22: Set complement.


The complement of a set A with respect to the universe U, written ∼A (or sometimes A), is the set of all
elements not contained within A. Formally, ∼A = {x ∈ U : x ∈ / A} . (When the universe is obvious from
context, we may leave it implicit.)

For example, if the universe is {1, 2, . . . , 10}, then ∼ {1, 2, 3} = {4, 5, 6, 7, 8, 9, 10} and ∼ {3, 4, 5, 6} =
{1, 2, 7, 8, 9, 10}. (See Figure 2.19a.)

The rectangle represents the universe U = {1, 2, . . . , 10}.


P O
3 1
2 5 3, 5, and 7 are elements of both P and O;
7 9 2 is in P but not O;
1 and 9 are in O but not P; and
4 6 8 10 4, 6, 8, and 10 are in neither P nor O.

Figure 2.18 A Venn diagram for the set O of odd numbers and the set P of prime numbers between 1 and 10.

ISTUDY
2.3 Sets: Unordered Collections 31

∼A A∪B A∩B A−B B−A


A A B A B A B A B

(a) The complement (b) The union A ∪ B of (c) The intersection (d) The difference of A and B. On the left is
∼A of a set A. two sets A and B. A ∩ B of A and B. the set A − B; on the right is B − A.

Figure 2.19 Venn diagrams of the basic set operations: complement, union, intersection, and set difference. (The
complement is taken with respect to the universe U, which is represented by the rectangle.)

Definition 2.23: Set union.


The union of two sets A and B, denoted A∪B, is the set of all elements in either A or B (or both). Formally,
A ∪ B = {x : x ∈ A or x ∈ B} .

For example, {1, 2, 3} ∪ {3, 4, 5, 6} = {1, 2, 3, 4, 5, 6}. (See Figure 2.19b.)

Definition 2.24: Set intersection.


The intersection of two sets A and B, denoted A ∩ B, is the set of all elements in both A and B. Formally,
A ∩ B = {x : x ∈ A and x ∈ B} .

For example, {1, 2, 3} ∩ {3, 4, 5, 6} = {3}. (See Figure 2.19c.)


P Q Sn
Analogously to summation and product notation ( and ), we will sometimes write i=1 Si to denote
Tn
S1 ∪ S2 ∪ · · · ∪ Sn , and we will sometimes write i=1 Si to denote S1 ∩ S2 ∩ · · · ∩ Sn .

Definition 2.25: Set difference.


The difference of two sets A and B, denoted A − B, is the set of all elements contained in the set A but not
in the set B. Formally, A − B = {x : x ∈ A and x ∈
/ B} . (Note that A − B and B − A are different sets.)

(Some people write A \ B instead of A − B to denote set difference.)


For example, {1, 2, 3}−{3, 4, 5, 6} = {1, 2} and {3, 4, 5, 6}−{1, 2, 3} = {4, 5, 6}. (See Figure 2.19d.)
In expressions that use more than one of these set operators, the ∼ operator “binds tightest”—that is, in
an expression like ∼S ∪ T, we mean (∼S) ∪ T and not ∼(S ∪ T). We use parentheses to specify the order of
operations among ∩, ∪, and −. Here’s a slightly more complicated example that combines set operations:

Example 2.30: Combining odds and primes.


As in Example 2.29, define U = {1, 2, . . . , 10}, with P = {2, 3, 5, 7} and O = {1, 3, 5, 7, 9} as the
primes and odd numbers in U. List the elements of the sets P ∩ ∼O and ∼(P ∪ O) and ∼P − ∼O.
Solution. For each part, we plug in the definitions:
P ∩ ∼O = {2, 3, 5, 7} ∩ ∼ {1, 3, 5, 7, 9} = {2}
= {2, 4, 6, 8, 10}

∼(P ∪ O) = ∼( {2, 3, 5, 7} ∪ {1, 3, 5, 7, 9} ) = {4, 6, 8, 10}


= {1, 2, 3, 5, 7, 9}

ISTUDY
32 Basic Data Types

(a) (b) (c)


A B A B A B

C C C

Figure 2.20 Some three-set Venn diagrams: (a) (B ∪ C) − A; (b) (A − B) ∩ C; and (c) A ∩ (B ∪ C).

∼P − ∼O = ∼ {2, 3, 5, 7} − ∼ {1, 3, 5, 7, 9} = {1, 9} .


= {1, 4, 6, 8, 9, 10} = {2, 4, 6, 8, 10}

Thus P ∩ ∼O = {2} is the set of prime nonodd numbers; ∼(P ∪ O) = {4, 6, 8, 10} is the set of numbers
that aren’t either prime or odd; and ∼P−∼O = {1, 9} is the set of nonprime numbers that aren’t nonodd,
or, more simply stated, all nonprime odd numbers.

Of course, we can also combine more than two sets in expressions using these set operators—for exam-
ple, A ∪ B ∪ C denotes the set {x : x ∈ A or x ∈ B or x ∈ C}. We can use Venn diagrams to visualize set
operations that involve more than two sets; see Figure 2.20 for a few examples.

Arithmetic operations on sets


We’ll wrap up with a bit of notation that will allow us to perform mathematical operations on the ele-
ments of a set. In Section 2.2.7, we introduced summation and product notation, so that we could write
Pn Qn
i=1 xi to represent x1 + x2 + · · · + xn , and i=1 xi to represent x1 · x2 · · · · · xn . We will sometimes
wish to represent the sum or product of the elements of a particular set (instead of a sequence of values
like x1 , x2 , . . . , xn ). It will also be handy to refer to the smallest or largest element in a set.

Definition 2.26: Sum, product, minimum, and maximum of a set.


Let S be a set. Then the expressions
P Q
x and x and min x and max x
x∈S x∈S x∈S x∈S

respectively denote the sum of the elements of S, the product of the elements of S, the smallest element
in S, and the largest element in S.
P
For example, for the set S = {1, 2, 4, 8}, the sum of the elements of S is x∈S x = 15; the product of the
Q
elements of S is x∈S x = 64; the minimum of S is minx∈S x = 1; and the maximum of S is maxx∈S x = 8.

2.3.3 Comparing Sets


In the same way that two numbers x and y can be compared (we can ask questions like: does x = y? is
x ≤ y? is x ≥ y?), we can also compare two sets A and B. Here, we will define the analogous notions of
comparison for sets. We’ll begin by defining what it means for two sets to be equal:

Definition 2.27: Set equality.


Two sets A and B are equal, denoted A = B, if A and B have exactly the same elements. (In other words,
sets A and B are not equal if there’s an element x ∈ A but x ∈
/ B, or if there’s an element y ∈ B but y ∈
/ A.)

ISTUDY
2.3 Sets: Unordered Collections 33

This definition formalizes the idea that order and repetition don’t matter in sets: for example, the sets {4, 4}
and {4} are equal because there is no element x ∈ {4, 4} where x ∈ / {4} and there is no element y ∈ {4}
where y ∈ / {4, 4}. This definition also implies that the empty set is unique: any set containing no elements
is identical to ∅.

Taking it further: Definition 2.27 is sometimes called the axiom of extensionality. (All of mathematics, including a completely
rigorous definition of the integers and all of arithmetic, can be built up from a small number of axioms about sets, including this
one.) The point is that the only way to compare two sets is by their “externally observable” properties. For example, the follow-
ing two sets are exactly the same set: {x : x > 10 is an even prime number}, and {y : y is a country with a 128-letter name}.
(Namely, both of these sets are ∅.)

The other common type of comparison between two sets A and B is the subset relationship, which expresses
that every element of A is also an element of B:

Definition 2.28: Subset.


A set A is a subset of a set B, written A ⊆ B, if every x ∈ A is also an element of B. (In other words,
A ⊆ B is equivalent to A − B = {}.)

For example, {1, 3, 5} ⊆ {1, 2, 3, 4, 5}, because 1 ∈ {1, 2, 3, 4, 5} and 3 ∈ {1, 2, 3, 4, 5} and 5 ∈
{1, 2, 3, 4, 5}. Similarly, {1, 3, 5} ⊆ {1, 3, 5}.
Notice that {} ⊆ S for any set S: it’s impossible for there to be an x ∈ {} that satisfies x ∈ / S, because
there is no element x ∈ {} in the first place—and if there’s no x ∈ {} at all, then there’s certainly no x ∈ {}
such that x ∈ / S.

Definition 2.29: Proper subset.


A set A is a proper subset of a set B, written A ⊂ B, if A ⊆ B and A ̸= B. In other words, A ⊂ B whenever
A ⊆ B but B ̸⊆ A.

For example, let A = {1, 2, 3}. Then A ⊆ {1, 2, 3, 4} and A ⊆ {1, 2, 3} and A ⊂ {1, 2, 3, 4}, but A is not
a proper subset of {1, 2, 3}.
When A ⊂ B or A ⊆ B, we refer to A as the (possibly proper) subset of B; we can also call B the (possibly
proper) superset of A:

Definition 2.30: Superset and proper superset.


Let A be a set. A set B is a superset of A, written B ⊇ A, if A ⊆ B. The set B is a proper superset of A,
written B ⊃ A, if A ⊂ B.

Figure 2.21a illustrates subsets, proper subsets, supersets, and proper supersets.

The sets satisfy A ⊂ B (and B ⊃ A) if there’s at


B least one element in this region, and they satisfy A B
A
A = B if there’s no element in this region.

(a) Two sets satisfying A ⊆ B and, equivalently, B ⊇ A. (b) Two disjoint sets A and B.

Figure 2.21 One set that’s a subset of another, and two sets that are disjoint.

ISTUDY
34 Basic Data Types

Example 2.31: Subsets and supersets.


Let A = {3, 4, 5} and B = {4, 5, 6}. Identify a set C satisfying the following conditions, or state that the
requirement is impossible to achieve and explain why.
1 A ⊆ C and C ⊇ B
2 A ⊇ C and C ⊆ B
3 A ⊇ C and C ⊇ B

Solution. The first two conditions are achievable, but the third isn’t.
1 Let C = {3, 4, 5, 6}; both A and B are (proper) subsets of this set.
2 We can choose C = {4, 5}, because {4, 5} ⊆ A and {4, 5} ⊆ B.
3 It’s impossible to satisfy {3, 4, 5} ⊇ C and C ⊇ {4, 5, 6} simultaneously. If 6 ∈ C then we don’t have
{3, 4, 5} ⊇ C, but if 6 ∈ / C we don’t have C ⊇ {4, 5, 6}. We can’t have 6 ∈ C and we can’t have
6∈ / C, so we’re stuck with an impossibility.

Here’s one last piece of set-related terminology that we’ll use: two sets A and B are called disjoint if
they have no elements in common.

Definition 2.31: Disjoint sets.


Two sets A and B are disjoint if there is no x ∈ A where x ∈ B—in other words, if A ∩ B = {}.

For example, the sets {1, 2, 3} and {4, 5, 6} are disjoint because {1, 2, 3} ∩ {4, 5, 6} = {}, but the sets
{2, 3, 5, 7} and {2, 4, 6, 8} are not disjoint because 2 is an element of both. See Figure 2.21b.

2.3.4 Sets of Sets


Just as we can have a list of lists in a programming language like Scheme or Java, we can also consider a
set that has sets as its elements. (After all, sets are just collections of objects, and one kind of object that
can be collected is a set itself.)

Example 2.32: Set of sets of numbers.


The set A = {Z, R, Q} of the sets defined in Section 2.2.2 is itself a set. This set has cardinality |A| = 3,
because A has three distinct elements—namely Z and R and Q. (Of course, all three of these elements of
A are themselves sets, and each of these three elements of A has infinite cardinality.)

Example 2.33: A set of smaller sets.


Consider the set B = {{} , {1, 2, 3}}. Note that |B| = 2: B has two elements, namely {} and {1, 2, 3}.
Therefore {} ∈ B because {} is one of the two elements of B. However 1 ∈ / B, because 1 is not one of
the two elements of B—that is, 1 ̸= {} and 1 ̸= {1, 2, 3}—although 1 is an element of one of the two
elements of B.

There are two important types of sets of sets that we will define in the remainder of this section, both
derived from a base set S.

ISTUDY
2.3 Sets: Unordered Collections 35

Partitions
The first interesting use of a set of sets is to form a partition of S into a set of disjoint subsets whose union
is precisely S.

Definition 2.32: Partition.


A partition of a set S is a set {A1 , A2 , . . . , Ak } of nonempty sets A1 , A2 , . . . , Ak , for some k ≥ 1, such that
(i) A1 ∪ A2 ∪ · · · ∪ Ak = S; and (ii) for any i and j = ̸ i, the sets Ai and Aj are disjoint.

A useful way of thinking about a partition of a set S is that we’ve divided S up into several (nonoverlapping)
subcategories. See Figure 2.22 for an illustration of a partition of a set S. Here’s an example of one set
partitioned many different ways:

Example 2.34: Several partitions of the same set.


Consider the set S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. Here are some different ways to partition S:
{{1, 3, 5, 7, 9} , {2, 4, 6, 8, 10}} evens and odds
{{1, 2, 3, 4, 5, 6, 7, 8, 9} , {10}} one- and two-digit numbers
{{1, 4, 7, 10} , {2, 5, 8} , {3, 6, 9}} x mod 3 = 0 and x mod 3 = 1 and x mod 3 = 2
{{1} , {2} , {3} , {4} , {5} , {6} , {7} , {8} , {9} , {10}} all separate
{{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}} all together

In each case, each of the 10 numbers from S is in one, and only one, of the listed sets (and no elements
not in S appear in any of the listed sets).

It’s worth noting that the last two ways of partitioning S in Example 2.34 genuinely are partitions. For
the partition {{1} , {2} , {3} , {4} , {5} , {6} , {7} , {8} , {9} , {10}}, we have k = 10 different disjoint
sets whose union is precisely S. For the partition {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}}, we have k = 1: there’s
only one “subcategory” in the partitioning, and every x ∈ S is indeed contained in one (the only one!) of
these “subcategories.” (And no two distinct subcategories overlap, because there aren’t even two distinct
subcategories at all!)

Taking it further: One way to helpfully organize a massive set S of data—for example, students or restaurants or web pages—is
to partition S into small clusters. The idea is that two elements in the same cluster will be “similar,” and two entities in different
clusters will be “dissimilar.” (So students might be clustered by their majors or dorms; restaurants might be clustered by their
cuisine or geography; and web pages might be clustered based on the set of words that appear in them.) For more, see p. 38.

(a) The set S. (b) S partitioned into five subsets.

Figure 2.22 A visualization of partitioning a set S into disjoint nonempty subsets whose union equals S itself.

ISTUDY
36 Basic Data Types

Power sets
Our second important type of a set of sets is the power set of a set S, which is the set of all subsets of S:

Definition 2.33: Power set.


The power set of a set S, written P(S), denotes the set of all subsets of S: that is, a set A is an element
of P(S) precisely if A ⊆ S. In other words, P(S) = {A : A ⊆ S}.

The power set of S is also occasionally denoted by 2S , in part because—as we’ll see in Chapter 9—|P(S)| is 2|S| . The name
“power set” also comes from this fact: the cardinality of P(S) is 2 to the power of |S|.

Here are some small examples, and one example that’s a bit more complicated:

Example 2.35: Some small power sets.


Here are the power sets of {0}, {0, 1}, and {0, 1, 2}:
P({0}) = {{} , {0}}
P({0, 1}) = {{} , {0} , {1} , {0, 1}}
P({0, 1, 2}) = {{} , {0} , {1} , {2} , {0, 1} , {0, 2} , {1, 2} , {0, 1, 2}}
For the second of these examples: there are four elements in P({0, 1})—the empty set, two singleton
sets {0} and {1}, and the two-element set {0, 1} itself, because {0, 1} ⊆ {0, 1} is a subset of itself.

Example 2.36: P(P({0, 1})).


The power set of the power set of {0, 1} is

P(P({0, 1})) = P( {} , {0} , {1} , {0, 1} );
the 16 elements of this set are shown in Figure 2.23.

{ }
{} , {0}
{ } { } { }
{} {} , {1} {0} , {1} , {0, 1}
{ } { } { }
{} {0} {} , {0, 1} {} , {1} , {0, 1} { }
{ } { } { } {} , {0} , {1} , {0, 1}
{1} {0} , {1} {} , {0} , {0, 1}
{ } { } { }
{0, 1} {0} , {0, 1} {} , {0} , {1}
{ }
{1} , {0, 1}

Figure 2.23 The power set of the power set of {0, 1}. There are a total of 1 + 4 + 6 + 4 + 1 = 16 sets in P(P({0, 1})):
one with 0 elements, four with 1 element, six with 2 elements, four with 3 elements, and one with 4 elements.

ISTUDY
2.3 Sets: Unordered Collections 37

Computer Science Connections

Set Building in Languages (and MapReduce)


Programming languages like Python, Scheme, or ML make heavy use of lists and also allow higher-order functions
(functions that take other functions as parameters). The experience of programming in these languages sometimes
feels very similar to using the set-construction notions from Section 2.3.1. These mechanisms for building sets in
mathematical notation closely parallel built-in functionality for building lists in programs in these languages:
• Build a list from scratch by writing out its elements.
• Build a list from an existing list using the function filter, which takes two parameters (a list U, corresponding
to the universe, and a function P) and returns a new list containing all x ∈ U for which P(x) is true.
• Build a list from an existing list using the function map, which takes two parameters (a list U and a function f) and
returns a new list containing f(x) for every element x of U.
(Unlike sets, the map function can
1 # Some helper functions to use as predicates.
cause repetitions in the stored list: 2 def even(x): return x % 2 == 0
map(square,L) where L contains 3 def square(x): return x**2
4 def false(x): return False
both 2 and −2 will lead to 4 being 5
present twice. But some languages, 6 # The definitions.

including Python, also have syn- 7 L = [1, 2, 4, 8, 16] L = {1, 2, 4, 8, 16}


8 M = [x for x in L if x < 10] M = {x ∈ L : x < 10} = {1, 2, 4, 8}
tax for sets instead of lists, cre- 9 N = filter(even, L) N = {{x ∈ L : x is}even} = {2, 4, 8, 16}
ating an unordered, duplicate-free 10 O = map(square, L) O = { x2 : x ∈ L = {1, 4, 16}, 64, 256}
11 P = [square(x) for x in L if even(x)] P = x2 : x ∈ L and x is even
collection of elements.) 12 = {4, 16, 64, 256}
Python has filter and map 13 Q = [x for x in L if false(x)] Q = {x ∈ L : False} = {}
built in; Scheme has filter and
map either built in or in a standard Figure 2.24 List-building code in Python analogous to our set-building
library. (In Python, there’s even an notation. (Lines 8, 11, and 13 use Python’s list comprehension syntax.)
explicit list comprehension syntax
to create a list without using filter or map, which even more closely parallels the set-abstraction notation from
Definitions 2.20 and 2.49.) See Figure 2.24 for Python, and Figure 2.25 for Scheme.
While the technical details are
a bit different, the basic idea 1 ;;; Some helper functions to use as predicates.
underlying map forms half of a 2 (define even? (lambda (x) (= (modulo x 2) 0)))
3 (define square (lambda (x) (* x x)))
programming model called Map-
4 (define false? (lambda (x) #f))
Reduce that’s become increasingly 5
popular for processing very large 6 ;;; The definitions.
7 (define L (list 1 2 4 8 16)) L = {1, 2, 4, 8, 16}
datasets. (The reduce function in 8
Python or Scheme lets a pro- 9 (define N (filter even? L)) N = {{x ∈ L : x is}even} = {2, 4, 8, 16}
grammer specify how to take a 10 (define O (map square L)) O = { x2 : x ∈ L = {1, 4, 16}, 64, 256}
2
11 (define P (map square (filter even? L))) P = x : x ∈ L and x is even
list of values and “reduce” it 12 = {4, 16, 64, 256}
to a single value—for example, 13 (define Q (filter false? L)) Q = {x ∈ L : False} = {}
by summing.) MapReduce is a
distributed-computing framework Figure 2.25 The analogous list-building code in Scheme. (There is Scheme
that processes data using two user- code analogous to the definition of M, but it requires an additional helper
specified functions: a “map” func- function or some more complicated syntax.)
tion that’s applied to every element
of the dataset, and a “reduce” function that collects together the outputs of the map function. Implementations of
MapReduce allow these computations to occur in parallel, on a cluster of machines, vastly speeding processing time.
(For more on MapReduce—a computing paradigm that has had a major impact on modern parallel and distributed
computing—see [37].)

ISTUDY
38 Basic Data Types

Computer Science Connections

Clustering Datasets (and Speech Processing)


Partitioning a set is a task that arises frequently in various applications, usually with a goal like clustering a large
collection of data points. The goal is that elements placed into the same cluster should be “very similar,” and elements
in different clusters should be “not very similar.” There is a huge array of applications of clustering (more on that
below), but here’s a simplified version of one example application, in the area of speech processing.
Software systems that interact with users as they

Spectrogram generated by Praat [16].


speak in natural language—that is, as they talk in
English, or Mandarin, or other real-world languages—
have developed with rapidly increasing quality over
the last decade. (You probably have a system that can
perform this task in your pocket right now.) Speech
recognition—taking an audio input, and identifying
what English word is being spoken from the acoustic
properties of the audio signal—turns out to be a very
challenging problem. Figure 2.26 illustrates some of Figure 2.26 A spectrogram of me pronouncing the
the reasons for the difficulty, showing a spectrogram sentence “I prefer agglomerative clustering.” In a
generated by the Praat software tool. There are essen- spectrogram, the x-axis is time, and the y-axis is frequency;
tially no acoustic correlates to the divisions between a darkly shaded frequency f at time t shows that the speech
words—the things that look like pauses in Figure 2.26 at time t had an intense component at frequency f.
actually correspond to the moments that my mouth
fully cut off the airflow during the sentence, like on “t” of clustering—which is one of the many reasons that speech
recognition is so difficult.
But we can do something with more data. First,
we can partition a training set of many speakers say- 3000
ing a collection of common words into subsets based
F2 (second-lowest)

on which word was spoken, and then use the average


acoustic properties of the utterances to guess which 2000
word was spoken in novel audio. Figure 2.27 shows the
frequencies of the two lowest formants—frequencies centroid for
of very high intensity—in the utterances of a half- 1000
beat /bit/
centroid for
dozen college students saying the words bat and beat.
bat /bæt/
The points are partitioned by the pronounced word.
0
The centroid of each cluster (the center of mass of 0 200 400 600 800 1000 1200
the points) can serve as a prototypical version of each F1 (lowest formant) [Hertz]
word’s acoustics. (These particular speakers followed
the general trend for English vowels, but their for- Figure 2.27 The frequencies of the first two formants in
mant frequencies don’t quite match the statistics you’d utterances by six speakers saying the words beat and bat.
expect from a larger sample of speakers.)
Speech recognition is an important application, but there are many other reasons to perform clustering on a dataset.
We might try to cluster a set N of news articles into “topics” C1 , C2 , . . . , Ck , where any two articles x, y that are both
in the same cluster Ci are similar (say, with respect to the words contained within them), but if x ∈ Ci and y ∈ Cj̸=i
then x and y are not very similar. Or we might try to cluster the people in a social network into communities, so
that a person in community c has a large fraction of her friends who are also in community c. Understanding these
clusters—and understanding what properties of a data point “cause” it to be in one cluster rather than another—can
help reveal the structure of a large dataset, and can also be useful in building a system to react to new data. Or we
might want to use clusters for anomaly detection: given a large dataset—for example, of user behavior on a computer
system, or the trajectory of a car on a highway—we might be able to identify those data points that do not seem to be
part of a normal pattern. These data points may be the result of suspicious behavior that’s worth further investigation
(or that might trigger a warning to the driver of the car that they have strayed from a lane). For more about clustering,
clustering algorithms, and applications of clustering, see a good data-mining book, like [76].

ISTUDY
Exercises 39

EXERCISES
2.86 Let H = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f} denote the set of hexadecimal digits. Is 6 ∈ H?
2.87 For the same set H: is h ∈ H?
2.88 For the same set H: is a70e ∈ H?
2.89 For the same set H: What is |H|?

2.90 Let S = {0 + 0, 0 + 1, 1 + 0, 1 + 1, 0 · 0, 0 · 1, 1 · 0, 1 · 1} be the set of results of adding any two bits together or multiplying
any two bits together. Which of 0, 1, 2, and 3 are elements of S?
2.91 For the same set S: what is |S|?

2.92 Let T = {n ∈ Z : 0 ≤ n ≤ 20 and n mod 2 = n mod 3}. Let H = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f}. Identify at least one
element of H that is not an element of T.
2.93 For the same sets H and T: identify at least one element of T that is not an element of H.
2.94 For the same set T, and for S as defined in Exercise 2.90: identify at least one element of T that is not an element of S.
2.95 For the same set T, and for S as defined in Exercise 2.90: identify at least one element of S that is not an element of T.
2.96 For T as defined in Exercise 2.92: what is |T|?

2.97 Rewrite the following set by exhaustively listing its elements: {n ∈ Z : 0 ≤ n ≤ 20 and n mod 5 = n mod 7}.
2.98 Do the same for {n ∈ Z : 10 ≤ n ≤ 30 and n mod 5 = n mod 7}.

Let A = {1, 3, 4, 5, 7, 8, 9} and let B = {0, 4, 5, 9}. Define C = {0, 3, 6, 9}. Where relevant, assume that the universe is the set
U = {0, 1, 2, . . . , 9}. What are the following sets?
2.99 A∩B
2.100 A∪B
2.101 A−B
2.102 B−A
2.103 ∼B
2.104 A ∪ ∼C
2.105 ∼C − ∼B
2.106 C − ∼C
2.107 ∼(C − ∼A)

2.108 In general, A − B and B − A do not denote the same set. (See Figure 2.19d.) But your friends Evan and Yasmin wander by and
tell you the following. Let E denote the set of CS homework questions that Evan has not yet solved. Let Y denote the set of CS
homework questions that Yasmin has not yet solved. Evan and Yasmin claim that E − Y = Y − E. Is this possible? If so, under
what circumstances? If not, why not? Justify your answer.

Let D and E be arbitrary sets. For each set given below, indicate which of the following statements is true:
• the given set must be a subset of D (for every choice of D and E);
• the given set may be a subset of D (for certain choices of D and E); or
• the given set cannot be a subset of D (for any choice of D and E).
If you answer “must” or “cannot,” justify your answer (1–2 sentences). If you answer “may,” identify an example D1 , E1 for which
the given set is a subset of D1 , and an example D2 , E2 for which the given set is not a subset of D2 .
2.109 D∪E
2.110 D∩E
2.111 D−E
2.112 E−D
2.113 ∼D

Let F = {1, 2, 4, 8}, G = {1, 3, 9}, and H = {0, 5, 6, 7}. Where relevant, let U = {0, 1, 2, . . . , 9} be the universe.
2.114 Are F and G disjoint?
2.115 Are G and ∼F disjoint?
2.116 Are F ∩ G and H disjoint?
2.117 Are H and ∼H disjoint?

2.118 Let S and T be two sets, with n = |S| and m = |T|. What is the smallest cardinality that S ∪ T can have. Give an example of the
minimum-sized sets. (You should give a family of examples—that is, describe a smallest-possible set for any values of n and m.)

ISTUDY
40 Basic Data Types

2.119 Repeat for S ∩ T. (That is, what’s the smallest possible value of |S ∩ T| in terms of n and m? Give a family of examples.)
2.120 Repeat for S − T.
2.121 What’s the largest possible value of S ∪ T in terms of n and m? Give a family of examples.
2.122 Repeat for the largest possible value of S ∩ T.
2.123 Repeat for the largest possible value of S − T.

In a variety of CS applications, it’s useful to be able to compute the similarity of two sets A and B. (More about one of these
applications, collaborative filtering, below.) There are a number of different ideas of how to measure set similarity, all based on
the intuition that the larger |A ∩ B| is, the more similar the sets A and B are. Here are two basic measures of set similarity that are
sometimes used:
• The cardinality measure: the similarity of A and B is |A ∩ B|.
|A∩B|
• The Jaccard coefficient: the similarity of A and B is |A∪B| . (The Jaccard coefficient is named after Paul Jaccard (1868–1944), a
Swiss botanist who was interested in how similar or different the distributions of various plants were in different regions [63].)
2.124 Let A = {chocolate, hazelnut, cheese}; B = {chocolate, cheese, cardamom, cherries}; and C = {chocolate}. Compute the
similarities of each pair of these sets using the cardinality measure.
2.125 Repeat the previous exercise for the Jaccard coefficient.
2.126 Suppose we have a collection of sets A1 , A2 , . . . , An . Consider the following claim:

Claim: Suppose that the set Av is the most similar set to the set Au in this collection (aside from Au itself). Then Au is necessarily
the set that is most similar to Av (aside from Av itself).

Decide whether you think this claim is true for the cardinality measure of set similarity, and justify your answer. (That is, argue
why it must be true, or give an example showing that it’s false.)
2.127 Repeat the previous exercise for the Jaccard coefficient.

Taking it further: A collaborative filtering system, or recommender system, seeks to suggest new products to a user u on the
basis of the similarity of u’s past behavior to the past behavior of other users in the system. Collaborative filtering systems
are mainstays of many popular commercial online sites (like Amazon or Netflix, for example). One common approach to
collaborative filtering is the following. Let U denote the set of users of the system, and for each user u ∈ U, define the set Su
of products that u has purchased. To make a product recommendation to a user u ∈ U: first, identify the user v ∈ U − {u}
such that Sv is the set “most similar” to Su ; then, second, recommend the products in Sv − Su to user u (if any exist). This
approach is called nearest-neighbor collaborative filtering, because the v found in step (i) is the other person closest to
u. The measure of set similarity used in step (i) is all that’s left to decide, and either cardinality or the Jaccard coefficient
are reasonable choices. The idea behind the Jaccard coefficient is that the fraction of agreement matters more than the
total amount of agreement: a {Cat’s Cradle, Catch 22} purchaser is more similar to a {Slaughterhouse Five, Cat’s Cradle}
purchaser than someone who bought every book Amazon sells.

For each of the following claims, decide whether you think the statement is true for all sets of integers A, B, C. If it’s true for every
A, B, C, then explain why. (A Venn diagram may be helpful.) If it’s not true for every A, B, C, then provide an example for which it
does not hold.
2.128 A ∩ B = ∼(∼A ∪ ∼B)
2.129 A ∪ B = ∼(∼A ∩ ∼B)
2.130 (A − B) ∪ (B − C) = (A ∪ B) − C
2.131 (B − A) ∩ (C − A) = (B ∩ C) − A

2.132 List all of the different ways to partition the set {1, 2, 3}.

Consider the table of distances shown in Figure 2.28 for a set P = {Alice, . . . , Frank} of people. Suppose we partition P into
subsets S1 , . . . , Sk . Define the intracluster distance as the largest distance between two people who are in the same cluster, and
define the intercluster distance as the smallest distance between two people who are in different clusters:
   
intracluster distance = max max distance(x, y) and intercluster distance = min min distance(x, y) .
i x,y∈Si i,j̸=i x∈Si ,y∈Sj

2.133 Partition P into 3 or fewer subsets so that the intracluster distance is ≤ 2.0.

ISTUDY
Exercises 41

Alice Bob Charlie Desdemona Eve Frank


Alice 0.0 1.7 1.2 0.8 7.2 2.9
Bob 1.7 0.0 4.3 1.1 4.3 3.4
Charlie 1.2 4.3 0.0 7.8 5.2 1.3
Desdemona 0.8 1.1 7.8 0.0 2.1 1.9
Eve 7.2 4.3 5.2 2.1 0.0 1.9
Frank 2.9 3.4 1.3 1.9 1.9 0.0

Figure 2.28 Some distances between people.

2.134 Partition P into subsets S1 , . . . , Sk so the intracluster distance is as small as possible. (You choose k.)
2.135 Partition P into subsets S1 , . . . , Sk so the intercluster distance is as large as possible. (Again, you choose k.)

2.136 Define S = {1, 2, . . . , 100}. Let W = {x ∈ S : x mod 2 = 0}, H = {x ∈ S : x mod 3 = 0}, and O = S − H − W. Is {W, H, O}
a partition of S?

2.137 What is the power set of {1, a}?


2.138 What is the power set of {1}?
2.139 What is the power set of {}?
2.140 What is the power set of P(1)?

ISTUDY
42 Basic Data Types

2.4 Sequences, Vectors, and Matrices: Ordered Collections


For nothing matters except life; and, of course, order.
Virginia Woolf (1882–1941)
“Montaigne,” The Common Reader (1925)

In Section 2.3, we introduced sets—collections of objects in which the order of those objects doesn’t
matter. In many circumstances, though, order does matter: if a Java method takes two parameters, then
swapping the order of those parameters will usually change what the method does; if there’s an interesting
site at longitude −1.3310 and latitude 52.5990, then showing up at longitude 52.5990 and latitude −1.3310
won’t do. In this section, we turn to ordered collections of objects, called sequences. A summary of the
notation related to sequences is given in Figure 2.29.

Definition 2.34: Sequence, list, and tuple.


A sequence—also known as a list or tuple—is an ordered collection of objects, typically called compo-
nents or entries. When the number of objects in the collection is 2, 3, 4, or n, the sequence is called an
(ordered) pair, triple, quadruple, or, n-tuple, respectively.

We’ll write a sequence inside of angle brackets ⟨ and ⟩, as in ⟨Northfield, Minnesota, USA⟩ or ⟨0, 1⟩. (Some
people use parentheses instead of angle brackets, as in (128, 128, 0) instead of ⟨128, 128, 0⟩.)
For two sets A and B, we frequently will refer to the set of ordered pairs whose two elements, in order,
come from A and B:

Definition 2.35: Cartesian product.


The Cartesian product of two sets A and B is the set A × B = {⟨a, b⟩ : a ∈ A and b ∈ B} containing all
ordered pairs where the first component comes from A and the second component comes from B.

For example, {0, 1} × {2, 3} is the set {⟨0, 2⟩, ⟨0, 3⟩, ⟨1, 2⟩, ⟨1, 3⟩}.

The Cartesian product is named after René Descartes, the seventeenth-century French philosopher/mathematician. (The English
adjectival form uses only the cartes part of his last name Descartes.) Descartes is also the namesake of the Cartesian plane, which
we’ll see soon, and he was a major contributor in mathematics, particularly geometry. But Descartes is probably most famous as
a philosopher, for the cogito ergo sum (“I think therefore I am”) argument, in which Descartes—after adopting a highly skeptical
view about all claims, even apparently obviously true ones (which is not a bad attitude to adopt when thinking about the kind of
material in this book, for what it’s worth!)—attempts to argue that he himself must exist.

We can also view any particular cell in a two-dimensional grid—like a cell in a spreadsheet, or a square on
a chess board—as a sequence:

Example 2.37: Chess positions.


A chess board is an 8-by-8 grid. Chess players use what’s called “Algebraic notation” to refer to the
columns (which they call files) using the letters a through h, and they refer to the rows (which they call
ranks) using the numbers 1 through 8. (See Figure 2.30.) Thus the square containing the white queen Q
is ⟨d, 1⟩; the full set of squares of the chess board is {a, b, c, d, e, f, g, h} × {1, 2, 3, 4, 5, 6, 7, 8} ; and the
squares containing knights—the N pieces (both white and black)—are {⟨b, 1⟩, ⟨g, 1⟩, ⟨b, 8⟩, ⟨g, 8⟩}. The
set of squares with knights could also be written as {b, g} × {1, 8}.

Here’s another example, about color representation on computers:

ISTUDY
2.4 Sequences, Vectors, and Matrices: Ordered Collections 43

sequence/ordered tuple ⟨a1 , a2 , . . . , an ⟩


Cartesian product A × B = {⟨a, b⟩ : a ∈ A and b ∈ B}
the set of all n-element sequences of S Sn = S × S × · · · × S (n times)

vector x ∈ Rn q
Pn
vector length, for x ∈ Rn ∥x∥ = i=1 xi
2

vector addition, for vectors x, y ∈ Rn x + y = ⟨x1 + y1 , x2 + y2 , . . . , xn + yn ⟩


scalar product, for a ∈ R and x ∈ Rn ax = ⟨a · x1 , a · x2 , . . . , a · xn ⟩
P
dot product, for vectors x, y ∈ Rn x • y = ni=1 xi · yi

matrix M ∈ Rn×m
 
1 0 ... 0
 0 1 . . . 0
 
identity matrix a matrix I ∈ Rn×n where I = 
 .. .. .. .. 

. . . .
0 0 ... 1
scalar multiplication, for a ∈ R and M ∈ Rn×m a matrix N ∈ Rn×m where Ni,j = a · Mi,j
matrix addition, for M, M′ ∈ Rn×m a matrix N ∈ Rn×m where Ni,j = Mi,j + M′i,j
P
matrix multiplication, for A ∈ Rn×m and B ∈ Rm×p a matrix M ∈ Rn×p where Mi,j = m k=1 Ai,k Bk,j
matrix inverse, for M ∈ Rn×n a matrix M−1 ∈ Rn×n where MM−1 = I (if any such M−1 exists)

Figure 2.29 A summary of notation for sequences, vectors, and matrices.

8
rmblkans red ⟨255, 0, 0⟩
7
opopopop orange ⟨255, 128, 0⟩
6
0Z0Z0Z0Z yellow ⟨255, 255, 0⟩
5
Z0Z0Z0Z0 green ⟨0, 255, 0⟩
4
0Z0Z0Z0Z blue ⟨0, 0, 255⟩
3
Z0Z0Z0Z0 indigo ⟨74, 0, 130⟩
2
POPOPOPO violet ⟨128, 0, 128⟩
1
SNAQJBMR
a b c d e f g h

Figure 2.30 The squares of a chess board. Figure 2.31 The seven traditional colors of the rainbow.

Example 2.38: RGB color values.


The RGB color space represents colors as ordered triples, where each component is an element of
{0, 1, . . . , 255}. RGB stands for red–green–blue; the three components of a color, respectively, repre-
sent how red, how green, and how blue the color is. Formally, a color is an element of {0, 1, . . . , 255} ×
{0, 1, . . . , 255} × {0, 1, . . . , 255}.
The order of these components matters; for example, the color ⟨0, 0, 255⟩ is pure blue, while the color
⟨255, 0, 0⟩ is pure red. An example of the seven traditional colors of the rainbow is shown in Figure 2.31.

Taking it further: An annoying pedantic point: we are being sloppy with notation in Example 2.38; we only defined the Cartesian
product for two sets, so when we write S×S×S we “must” mean either S×(S×S) or (S×S)×S. We’re going to ignore this issue,
and simply write statements like ⟨0, 1, 1⟩ ∈ {0, 1} × {0, 1} × {0, 1}—even though we ought to instead be writing statements
like ⟨0, ⟨1, 1⟩⟩ ∈ {0, 1} × ({0, 1} × {0, 1}). (A similar shorthand shows up in programming languages like Scheme, where
pairing—“cons”ing—a single element 3 with a list (2 1) yields the three-element list (3 2 1), rather than the two-element
pair (3 . (2 1)), where the second element is a two-element list.)

Beyond the “obvious” sequences like Examples 2.37 and 2.38, we’ve also already seen some definitions
that don’t seem to involve sequences, but implicitly are about ordered tuples of values. One example is the
rational numbers (see Section 2.2.2):

ISTUDY
44 Basic Data Types

Example 2.39: Rational numbers as sequences.


We can define the rational numbers (also known as fractions) as the set Q = Z × Z>0 . Under this view,
a rational number is represented as a pair ⟨n, d⟩ ∈ Z × Z>0 , with a numerator n and a denominator d.
For example, the fractions 21 and 202808 would be represented as ⟨1, 2⟩ and ⟨202, 808⟩, respectively. (To
flesh out the details of this representation, we also have to consider reducing fractions to lowest terms, to
establish the equivalence of fractions like ⟨2, 4⟩ and ⟨1, 2⟩. See Example 8.36.)

We will often consider sequences of elements that are all drawn from the same set, and there is special
notation for such a sequence:

Definition 2.36: Sequences of elements from the same set.


For a set S and a positive integer n, we write Sn to denote Sn = S × S × · · · × S .
n times

Thus Sn denotes the set of all sequences of length n where each component of the sequence is an element of
3 3
the set S. For example, the RGB values from Example 2.38 are elements of {0, 1, . . . , 255} , and {0, 1}
denotes the set {⟨0, 0, 0⟩, ⟨0, 0, 1⟩, ⟨0, 1, 0⟩, ⟨0, 1, 1⟩, ⟨1, 0, 0⟩, ⟨1, 0, 1⟩, ⟨1, 1, 0⟩, ⟨1, 1, 1⟩} . This notation
also lets us write R × R, called the Cartesian plane, as R2 —the way you might have written it in a high
school algebra class. (See Figure 2.32.)
In certain contexts, sequences of elements from the same set (as in Definition 2.36) are called strings.
For a set Σ, called an alphabet, a string over Σ is an element of Σn for some nonnegative integer n. (In other
S
words, a string is any element of n∈Z≥0 Σn .) The length of a string x ∈ Σn is n. For example, the set of 5-
5
letter words in English is a subset of {A, B, . . . , Z} . We allow strings to have length zero: for any alphabet
Σ, there is only one sequence of elements from Σ of length 0, called the empty string; it’s denoted by ϵ,
and for any alphabet Σ, we have Σ0 = {ϵ}. When writing strings, it is customary to omit the punctuation
11 8
(angle brackets and commas), so we write ABRACADABRA ∈ {A, B, . . . , Z} and 11010011 ∈ {0, 1} .

2.4.1 Vectors
As we’ve already seen, we can create sequences of many types of things: we can view sequences of letters as
11
strings (like ABRACADABRA ∈ {A, B, . . . , Z} ), or sequences of three integers between 0 and 255 as colors
3
(like ⟨119, 136, 153⟩ ∈ {0, 1, . . . , 255} , officially called “light slate gray”). Perhaps the most pervasive
type of sequence, though, is a sequence of real numbers, called a vector.

⟨2, 3⟩
⟨3, 2⟩

⟨7, −1⟩
⟨−3, −2⟩

Figure 2.32 A few points in R2 . The first component represents the x-axis (horizontal) position; the second component
represents the y-axis (vertical) position.

ISTUDY
2.4 Sequences, Vectors, and Matrices: Ordered Collections 45

Definition 2.37: Vector.


A vector (or n-vector) x is a sequence x ∈ Rn , for some positive integer n. For a vector x ∈ Rn and for
any index i ∈ {1, 2, . . . , n}, we write xi to denote the ith component of x.

For example, ⟨0, 1⟩,√⟨1, 0⟩, and ⟨ √12 , √12 ⟩ are all vectors in R2 . For the vector x = ⟨1/2, 3/2⟩, we have
x1 = 1/2 and x2 = 3/2.
Taking it further: Vectors are used in a tremendous variety of computational contexts: computer graphics (representing the
line-of-sight from the viewer’s eye to an object in a scene), machine learning (a feature vector describing which characteristics a
particular object has, which can be used in trying to classify that object as satisfying a condition or failing to satisfy a condition),
among many others. The discussion on p. 53 describes the vector-space model for representing a document d as a vector whose
components correspond to the number of times each word appears in d. Vectors and matrices (the topics of the remainder of this
section) are the main focus of a math course in linear algebra. In this section, we’re only mentioning a few highlights of vectors
and matrices; you can find much more in any good textbook on linear algebra.
A warning for C or Java or Python (or …) programmers: notice that our vectors’ components are indexed starting at one, not
zero. For a vector x ∈ Rn , the expression xi is meaningless unless i ∈ {1, 2, . . . , n}. The expression x0 doesn’t mean anything.

Vectors are sometimes contrasted with scalars, which are just numbers: that is, a scalar is an ele-
ment of R. Vectors are also sometimes written in square brackets, so we may see an n-vector x written
as x = [x1 , x2 , . . . , xn ]. We may encounter vectors in which the components are a restricted kind of
n
number—for example, integers or bits. Elements of {0, 1} are often called bit vectors or bitstrings.
Here’s an example of using vectors to compute distances between points:

Example 2.40: Train stations in Manhattan.


Let’s (very roughly!) represent a location in Manhattan as a vector—specifically, as a point ⟨x, y⟩ ∈ R2
representing the intersection of xth Avenue and yth Street. Define the walking distance between points
p and q in Manhattan as |p1 − q1 | + |p2 − q2 |: the number of east–west blocks between p and q plus
the number of north–south blocks between p and q. (Note that walking distance is different from the
straight-line distance between the points.)
The two major train stations in Manhattan are Penn Station, located at s = ⟨8, 33⟩, and Grand Central
Station, at g = ⟨4, 42⟩. First, determine the walking distance between Penn Station and Grand Central.
Then describe the points that are closer (in walking distance) to Penn Station than to Grand Central.
Solution. The distance between s = ⟨8, 33⟩ and g = ⟨4, 42⟩ is
|s1 − g1 | + |s2 − g2 | = |8 − 4| + |33 − 42| = 4 + 9 = 13.
For the second question, let’s compute some points that are equidistant to the two stations. (Those points
are on the boundary of the region of points closer to g and the region of points closer to s.) For example,
a point ⟨4, y⟩ has distances |42 − y| and 4 + |y − 33| to the stations; these distances are both equal to 6.5
when y = 35.5.
More generally, let’s think about a point whose x-coordinate falls between 4 and 8. For any offset
0 ≤ δ ≤ 4, the distance between the point ⟨4+δ, y⟩ and the two stations are δ+|42−y| and 4−δ+|y−33|.
These two values are both equal to 6.5 when y = 35.5 + δ. (For example, when δ = 4, then y = 39.5.)
Thus the points ⟨4 + 0, 35.5 + 0⟩ = ⟨4, 35.5⟩ and ⟨4 + 4, 35.5 + 4⟩ = ⟨8, 39.5⟩ are both equidistant to
s and g, as are all points on the line segment between them. (See Figure 2.33.) The remaining cases of
the analysis—figuring out which points with x-coordinate less than 4 or greater than 8 are closer to s or
g—are left to Exercises 2.184 and 2.185.

ISTUDY
46 Basic Data Types

45 45
44 44
43 ⟨4, 42⟩ 43 ⟨4, 42⟩
42
41
42
41
points closer to ⟨4, 42⟩ than to ⟨8, 33⟩
40 40
39 39
38 38 see Exercises 2.184 and 2.185
37 37
36 36
35 35 points closer to ⟨8, 33⟩ than to ⟨4, 42⟩
34 ⟨8, 33⟩ 34 ⟨8, 33⟩
33 33
32 32
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10

Figure 2.33 Illustrations of Manhattan train stations.

Taking it further: The measure of walking distance between points described in Example 2.40 is used surprisingly commonly
in computer science applications—and, appropriately enough, it’s actually named after Manhattan. The Manhattan distance
P
between two points p, q ∈ Rn is defined as ni=1 |pi − qi |. (We’re summing the number of “blocks” of difference in each of the
n dimensions; we take the absolute value of each difference because we care about the difference in each dimension rather than
which point has the higher value in that component.)

Here’s one more useful definition about vectors:

Definition 2.38: Vector length. qP


n
The length of a vector x ∈ Rn is defined as ∥x∥ = 2
i=1 (xi ) .

√ √ √
For example, ∥⟨2, 8⟩∥ = 22 + 82 = 4 + 64 = 68 ≈ 8.246. If we draw a vector x ∈ R2 in the
Cartesian plane, then ∥x∥ denotes the length of the line from ⟨0, 0⟩ to x. (See Figure 2.34.) A vector x ∈ Rn
is called a unit vector if ∥x∥ = 1.

Vector arithmetic
We will now define basic arithmetic for vectors: vector addition, which is performed component-wise
(adding the corresponding elements of the two vectors), and two forms of multiplication—one for
multiplying a vector by a scalar (also component-wise) and one for multiplying two vectors together.

Definition 2.39: Vector addition.


The sum of two vectors x, y ∈ Rn , written x+y, is a vector z ∈ Rn , where for every index i ∈ {1, 2, . . . , n}
we have zi = xi + yi . (Note that the sum of two vectors with different sizes is meaningless.)

For example, ⟨1.1, 2.2, 3.3⟩ + ⟨2, 0, 2⟩ = ⟨3.1, 2.2, 5.3⟩.

√ √ √
∥⟨9, 1⟩∥ = 92 + 12 = 81 + 1 = 82

82 ⟨ 9 , 1⟩

√ √ √
∥⟨−3, −5⟩∥ = (−3)2 + (−5)2 = 9 + 25 = 34
34

⟨−3, −5⟩

Figure 2.34 Two vector lengths: ∥⟨9, 1⟩∥ and ∥⟨−3, −5⟩∥.

ISTUDY
2.4 Sequences, Vectors, and Matrices: Ordered Collections 47

The first type of multiplication for vectors is scalar multiplication, when we multiply a vector by a real
number. As with vector addition, scalar multiplication acts on each component independently, by rescaling
each component by the same factor:

Definition 2.40: Scalar product.


Given a vector x ∈ Rn and a real number a ∈ R, the scalar product ax is a vector z ∈ Rn , where zi = axi
for every index i ∈ {1, 2, . . . , n}.

For example, 3·⟨1, 2, 3⟩ = ⟨3, 6, 9⟩ and −1.5 · ⟨1, −1⟩ = ⟨−1.5, 1.5⟩ and 0 · ⟨1, 2, 3, 5, 8⟩ = ⟨0, 0, 0, 0, 0⟩.
The second type of vector multiplication, the dot product, takes two vectors as input and multiplies them
together to produce a single scalar as output:

Definition 2.41: Dot product.


Given two vectors x, y ∈ Rn , the dot product of x and y, denoted x • y, is given by summing the products
of the corresponding components:
P
n
x•y= xi · yi .
i=1

For example, ⟨1, 2, 3⟩ • ⟨4, 5, 6⟩ = 1 · 4 + 2 · 5 + 3 · 6 = 4 + 10 + 18 = 32.


Intuitively, the dot product of two vectors measures the extent to which they point in the “same direction.”
(Again, the dimensions of the vectors in a dot product have to match up: if x ∈ Rn and y ∈ Rm are vectors
where n ≠ m, then x • y is meaningless.)
Figure 2.35 shows a few unit vectors, and each of their dot products with one of the vectors. Here are
two other examples, which use dot products for some useful applications:

Example 2.41: Common classes.


Let C = ⟨CS1, CS2, . . . , CS8⟩ denote the list of all courses offered by a (somewhat narrowly focused)
university. For a particular student, let the bit vector u represent the courses taken by that student, so that
ui = 1 if the student has taken course ci (and ui = 0 otherwise). For example, a student who’s taken only
CS1 and CS8 would be represented by x = ⟨1, 0, 0, 0, 0, 0, 0, 1⟩, and a student who’s taken everything
except CS3 would be represented by y = ⟨1, 1, 0, 1, 1, 1, 1, 1⟩.

a = ⟨0, 1⟩
c • a = c1 · a1 + c2 · a2 = √1 ·0+ √1 ·1 = √1
c = ⟨ √12 , √12 ⟩ 2 2 2

c • b = c1 · b1 + c2 · b2 = √1
2
·1+ √1
2
·0 = √1
2

b = ⟨1, 0⟩ c•c = c1 · c1 + c2 · c2 = √1
2
· 1

2
+ 1

2
· 1

2
= 1 1
2 + 2 =1
c•d = c1 · d1 + c2 · d2 = √1
2
·0+ √1
2
· −1 = − √12
d = ⟨0, −1⟩

Figure 2.35 Four unit vectors, and each of their dot products with c.

ISTUDY
48 Basic Data Types

The dot product of two student vectors represents the number of common courses that they’ve taken.
For example, the number of common classes taken by x and y is
X
8
x•y= xi yi = 1 · 1 + 0 · 1 + 0 · 0 + 0 · 1 + 0 · 1 + 0 · 1 + 0 · 1 + 1 · 1
i=1
= 1 + 0 + 0 + 0 + 0 + 0 + 0 + 1,
which is 2. Specifically, the two common courses taken by x and y are CS1 and CS8.

Example 2.42: GPAs.


Let g ∈ Rn be an n-vector where gi denotes the grade (measured on the grade point scale) that you got
in the ith class that you’ve taken in your college career. Let c ∈ Rn be an n-vector where ci denotes the

number of credits for the ith class you took. Then your grade point average (GPA) is given by ∑gn c ci .
i=1
For example, suppose your school gives grade points on the scale 4.0 = A, 3.7 = A−, 3.3 = B+,
3.0 = B, etc. Suppose you took CS 111 (6 credits), CS 201 (6 credits), and Mbira Lessons (4 credits),
and got grades of B+, A−, and B, respectively. Then g = ⟨3.3, 3.7, 3.0⟩ and c = ⟨6, 6, 4⟩, and
g•c 3.3 · 6 + 3.7 · 6 + 3.0 · 4 19.8 + 22.2 + 12.0 54
your GPA = P3 = = = = 3.375.
i=1 ci
6 + 6 + 4 16 16

2.4.2 Matrices
If a vector is analogous to an array of numbers, then a matrix is analogous to a two-dimensional array of
numbers:

Definition 2.42: Matrix.


An n-by-m matrix M is a two-dimensional table of real numbers containing n rows and m columns. The
⟨i, j⟩th entry of the matrix appears in the ith row and jth column, and we denote that entry by Mi,j . Such
a matrix M is an element of Rn×m , and we refer to M as having size or dimension n-by-m.

(The plural of matrix is matrices, which rhymes with the word “cheese.”) Thus a matrix looks like
 
M1,1 M1,2 . . . M1,m
 M2,1 M2,2 . . . M2,m 
 
M= .. .. .. .. .
 . . . . 
Mn,1 Mn,2 . . . Mn,m

It’s possible to think of a two-dimensional array in a programming language as a one-dimensional array


of one-dimensional arrays; similarly, if you’d like, you can think of an n-by-m matrix as a sequence of n
vectors, all of which are elements of Rm . This view of an n-by-m matrix is as an element of (Rm )n .
Three small matrices are shown in Figure 2.36a. (The ⟨2, 1⟩st entry is circled in each.) In Figure 2.36a,
A is a 2-by-3 matrix, B is a 3-by-2 matrix, and I is a 3-by-3 matrix.

ISTUDY
2.4 Sequences, Vectors, and Matrices: Ordered Collections 49

 
    1 2 3
" #  

5 3
 
1 0 0
 M=
 4 5 6 

3 1 4
A= B=
 4 8 
 I=
 0 1 0 
 7 8 9
9 7 2
6 9 0 0 1
(b) A matrix with the entries of the
(a) Three matrices, with the ⟨2, 1⟩st entry circled in each. main diagonal circled.

Figure 2.36 A few matrices.

(a)  (b)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1
 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1
 
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Figure 2.37 (a) A matrix representing a black-and-white bitmapped image, and (b) the image.

Example 2.43: Bitmaps.


One handy application of matrices is as an easy way to represent images. A black-and-white image can
be represented as a matrix with all entries in {0, 1}: each 1 entry represents white in the corresponding
pixel; each 0 represents black. See Figure 2.37a for an example.

Taking it further: The picture in Figure 2.37 is a simple black-and-white image, but we can use the same basic structure for
grayscale or color images. Instead of an integer in {0, 1} as each entry in the matrix, a grayscale pixel could be represented using
a real number in [0, 1]—or, more practically, a number in { 2550 1
, 255 , . . . , 255
255
}. For color images, each entry would be an RGB
triple (see Example 2.38). These matrix-based representations of an image are often called bitmaps. Bitmaps are highly inefficient
ways of storing images; most computer graphics file formats use much cleverer (and more space-efficient) representations.
Here are a few other examples of the pervasive applications of matrices in computer science. A term–document matrix can be
used to represent a collection of documents: the entry Md,k of the matrix M stores the number of times that keyword k appears in
document d. An adjacency matrix (see Chapter 11) can represent the page-to-page hyperlinks of the web in a matrix M, where
Mi,j = 1 if web page i has a hyperlink to web page j (and Mi,j = 0 otherwise). A rotation matrix (see p. 54) can be used in
computer graphics to re-render a scene from a different perspective.

A matrix M ∈ Rm×n is called square if m = n. For a square matrix M ∈ Rn×n , we may say that the
size of M is n (rather than saying that its size is n-by-n). A square matrix M is called symmetric if, for all
indices i, j ∈ {1, 2, . . . , n}, we have Mi,j = Mj,i . The main diagonal of a square matrix M ∈ Rn×n is the
sequence consisting of the entries Mi,i for i = 1, 2, . . . , n. (An example is shown in Figure 2.36b.) One
special square matrix that will arise frequently is the identity matrix, which has ones on the main diagonal
and zeros everywhere else (see Figure 2.38):

Definition 2.43: Identity matrix.


(
1 if i = j
The n-by-n identity matrix is the matrix I ∈ Rn×n whose entries satisfy Ii,j =
0 if i ̸= j.

Note that there is a different n-by-n identity matrix for every n ≥ 1; for example, the 1-by-1, 2-by-2, …,
5-by-5 identity matrices are shown in Figure 2.38.

ISTUDY
50 Basic Data Types

   
  1 0 0 0 0
1 0 ... 0   1 0 0 0
0 0 [ ] 1 0 0 0 1 0 0 0
 1 ...  [ ] 1 0 0 1 0 0  
. ..  1 0 1 0   0 0 1 0 0
 ..
.. .. 0 0  
. . . 0 1
0 0 1
0 1 0 0 0 1 0
0 0 ... 1 0 0 0 1
0 0 0 0 1

Figure 2.38 The identity matrix I generically, and for size up to 5.

As with vectors, we will need to define the basic arithmetic operations of addition and multiplication
for matrices. Just as with vectors, adding two n-by-m matrices or multiplying a matrix by a scalar is done
component by component:

Definition 2.44: Matrix addition and scalar multiplication.


Given a matrix M ∈ Rn×m and a real number a ∈ R, the product aM is a matrix N ∈ Rn×m where
Ni,j = aMi,j for all indices i ∈ {1, 2, . . . , n} and j ∈ {1, 2, . . . , m}.
Given two matrices M, M′ ∈ Rn×m , the sum M + M′ is a matrix N ∈ Rn×m where Ni,j = Mi,j + M′i,j
for all indices i ∈ {1, 2, . . . , n} and j ∈ {1, 2, . . . , m}.

Again, just as with vectors, adding two matrices that are not the same size is meaningless.

Example 2.44: Matrix arithmetic for some small matrices.


Consider the following matrices:
     
0 2 2 1 2 3 1 0 0
A =  2 0 2 B =  0 0 6 I =  0 1 0 .
2 2 0 0 0 4 0 0 1
Then we have:
A+B= 4B = A + 3I = A − 3I =
       
1 4 5 4 8 12 3 2 2 −3 2 2
 2 0 8 0 0 24  2 3 2  2 −3 2  .
2 2 4 0 0 16 2 2 3 2 2 −3

Matrix multiplication
Multiplying matrices is a bit more complicated than the other vector/matrix operations that we’ve seen so
far. The product of two matrices is a matrix, rather than a single number: the entry in the ith row and jth
column of AB is derived from the ith row of A and the j column of B. More precisely:

Definition 2.45: Matrix multiplication.


The product AB of two matrices A ∈ Rn×m and B ∈ Rm×p is an n-by-p matrix M ∈ Rn×p whose entries
are, for any i ∈ {1, 2, . . . n} and j ∈ {1, 2, . . . , p},
X
m
Mi,j = Ai,k Bk,j .
k=1

ISTUDY
2.4 Sequences, Vectors, and Matrices: Ordered Collections 51

As usual, if the dimensions of the matrices A and B don’t match—if the number of columns in A is different
from the number of rows in B—then AB is undefined.

Example 2.45: Multiplying some small matrices.


Let’s compute the product of a sample 2-by-3 matrix and a 3-by-2 matrix:
 
  7 8
1 2 3 
· 1 3
4 5 6
9 0
Note that, by definition, the result will be a 2-by-2 matrix. Its entries are:
     
1·7+2·1+3·9 1·8+2·3+3·0 7 + 2 + 27 8+6+0 36 14
= = .
4·7+5·1+6·9 4·8+5·3+6·0 28 + 5 + 54 32 + 15 + 0 87 47
For example, the 14 in ⟨row #1, column #2⟩ of the result was calculated by successively multiplying the
first matrix’s first row ⟨1, 2, 3⟩ by the second matrix’s second column ⟨8, 3, 0⟩. A visual representation of
this multiplication is shown in Figure 2.39.

More compactly, we could write matrix multiplication using the dot product from Definition 2.41: for two
matrices A ∈ Rn×m and B ∈ Rm×p , the ⟨i, j⟩th entry of AB is the value of Ai,(1...m) • B(1...m),j .

Problem-solving tip: To help keep matrix multiplication straight, it may be helpful to compute the ⟨i, j⟩th entry of AB by simul-
taneously tracing the ith row of A with the index finger of your left hand, and the jth column of B with the index finger of your
right hand. Multiply the two numbers that you’re pointing at, and add the result to a running tally; when you’ve traced the whole
row/column, the running tally is (AB)i,j .

Be careful: matrix multiplication is not commutative—that is, for matrices A and B, the values AB and BA
are generally different! (This asymmetry is unlike numerical multiplication: for x, y ∈ R, it is always the
case that xy = yx.) In fact, because the number of columns of A must match the number of rows of B for
AB to even be meaningful, it’s possible for BA to be meaningless or a different size from AB.

Example 2.46: Multiplying the other way around.


If we multiply the matrices from Example 2.45 in the other order, we get
   
7 8   39 54 69
1 3 · 1 2 3
= 13 17 21
4 5 6
9 0 9 18 27
This matrix differs from the result in Example 2.45—it’s not even the same size!

You’ll show in the exercises that, for any n-by-m matrix A, the result of multiplying A by the identity matrix
I yields A itself: that is, AI = A. You’ll also explore the inverse of a matrix A: that is, the matrix A−1 such
that AA−1 = I (if any such A−1 exists).
Here’s another way to use matrices to combine different types of information:

ISTUDY
52 Basic Data Types

[ ][ ] [ ][ ]
1 2 3 7 8 1 2 3 7 8
4 5 6 1 3 4 5 6 1 3
9 0 [ ]  [ ] 9 0
1 2 3 7 8 36 14
4 5 6 1 3
=
87 47
[ ][ ] [ ][ ]
1 2 3 7 8 9 0 1 2 3 7 8
4 5 6 1 3 4 5 6 1 3
9 0 9 0

Figure 2.39 Multiplying two matrices.

Example 2.47: Programming language knowledge.


Let A be an n-by-m matrix where Ai,j = 1 if student i has taken class j (and Ai,j = 0 otherwise). Let B be
an m-by-p matrix where Bj,k = 1 if class j uses programming language k (and Bj,k = 0 otherwise). What
does the matrix AB represent?
Solution. First, note that the resulting matrix AB has n rows and p columns; that is, its size is (number of
students)-by-(number of languages). For a student i and a programming language k, we have that
X
m
(AB)i,k = Ai,j Bj,k
j=1
"( #
X
m
1 if student i took class j and j uses language k
=
j=1 0 otherwise

because 0 · 0 = 0 · 1 = 1 · 0 = 0, so the only terms of the sum that are 1 occur when both Ai,j (“student i
took class j?”) and Bj,k (“class j uses language k?”) are true (that is, 1). Thus (AB)i,k denotes the number
of classes that use language k that student i took.
Figure 2.40 shows a concrete example. (For example, the cell corresponding to Alice and C is computed
by ⟨0, 1, 1, 1, 1⟩ • ⟨0, 0, 1, 1, 0⟩—the dot product of Alice’s row of A with C’s column of B—which has
the value 0 · 0 + 1 · 0 + 1 · 1 + 1 · 1 + 1 · 0 = 2. This entry reflects the fact that Alice has taken two
classes that use C: (i) introduction to computer systems, and (ii) programming languages.)
C+ bly
p
pr syst es

e
th lang s

C n

em

m
og em
ur

co

he
th
eo s

+
va
tro ct

rl

ss
of

Py

Sc
Pe
in tru

 
Ja
A
ry
s
tro

ta

0 1 0 0 0 0 0 intro
da
in

   
0 1 1 1 1 Alice  0 1 0 1 0 0 0  data structures
 
A = 1 1 0 1 0 Bob B = 0 0 1 0 1 0 0  intro systems
 
1 0 0 0 1 Charlie  0 1 1 1 1 1 1  prog langs
0 0 0 0 0 0 0 theory of comp
C+ bly

e
C n

em

m
o

he
th

+
va
rl

ss
Py

Sc
Pe

Ja
A

 
0 2 2 2 2 1 1 Alice
AB =  0 3 1 2 1 1 1 Bob
0 1 0 0 0 0 0 Charlie

Figure 2.40 Three students, five courses, and seven programming languages.

ISTUDY
2.4 Sequences, Vectors, and Matrices: Ordered Collections 53

Computer Science Connections

The Vector Space Model


Here’s a classic application of vectors, taken from information retrieval, the subfield of computer science devoted to
searching for information relevant to a given query in large datasets. (For more on information retrieval, see [84].)
We start with a large corpus of documents—for example, transcripts of all email messages that you’ve sent in your
entire life. (The word corpus comes from the Latin for “body”; it simply means a body of texts.) Tasks might include
clustering the corpus into subcollections (“which of my email messages are spam?”), or finding stored documents
similar to a given query (“find the emails most relevant to ‘good restaurants in Chicago’ in my archives”).
The vector space model is a standard approach to representing text documents for the purposes of information
retrieval. We choose a list of n terms that might appear in a document. We then represent a document d as an n-vector x
of integers, where xi is the number of times that the ith term appears in the document d. See Figure 2.41 for an example.
Because documents that are about similar topics tend to
contain similar vocabulary, we can judge the similarity
of documents d and d′ based on “how similar” their 2 “three”
corresponding vectors x and x′ are. A first stab at mea-
suring similarity between x and x′ is to compute the dot d1 = [1, 0, 1] d3 = [1, 1, 0]
product x • x′ ; this approach counts the number of times 1
d2 = [2, 2, 1]
any word in d appears in d′ . (And if a word appears
twice in d, then each appearance in d′ counts twice for
the dot product.) But this first approach has an issue in
“two”
that it favors longer documents: a document that lists e” 1
“on 1 2 3
all the words in the dictionary would correspond to a 2
vector [1, 1, 1, 1, 1, . . .]—which would therefore have 3
a large dot product with all documents in the corpus.
To compensate for the fact that longer documents d1 Three is one of the loneliest numbers. −→ [1, 0, 1]
have more words, it’s a good idea to normalize these d2 A one and a two and a one, two, three. −→ [2, 2, 1]
vectors so that they have the same length, by using d 3 One, two, buckle my shoe. −→ [1, 1, 0]
′ ′
x/∥x∥ and x /∥x ∥ to represent the documents, instead
Figure 2.41 An example from the vector-space model:
of x and x′ . It turns out that the dot product of the
three documents translated into vectors using the keywords
normalized vectors computes the cosine of the angle
“one,” “two,” and “three,” and a plot of them in R3 .
between these representations of the documents. We
can then measure the similarity of x and x′ based on how small the angle between them is.
This second approach is better, but it still suffers from counting common occurrences of the word the and the
word normalize as equally indicative of the similarity of documents. That’s clearly wrong: some words carry more
information than others. Information retrieval systems apply different weights to different terms in measuring sim-
ilarity; one common approach is called term frequency–inverse document frequency (TFIDF), which downweights
terms that appear in many documents in the corpus. TFIDF is due to the British computer scientist Karen Spärck
Jones (1935–2007); she developed it in the early 1970s [64], predating by decades the major shift in natural language
processing from purely symbolic techniques to statistical ones.
Of course, real information retrieval systems are usually quite a lot more complicated than we’ve discussed so
far. For example, a document that talks about sofas would be judged to be completely unrelated to a document that
talks about couches, which seems a little foolish. Handling synonyms requires a more complicated approach, often
based around analyzing the term–document matrix that simultaneously represents the entire corpus. (For example,
if documents that discuss sofas use very similar other words to documents that discuss couches—like change and
cushion and nap—then we might be able to infer something about sofas and couches.) A recent flurry of research
has attempted to find new ways to build a vector representation for words, often using machine-learning techniques
to train the representation on large amounts of human-written texts. A prominent recent example is the word2vec
system [87]. This algorithm embedding can do some remarkable things—including doing a good job of solving
analogy problems using vector arithmetic—but there are also some reasons to be very cautious about using these
tools. Researchers have shown that the embeddings themselves, and the analogies they produce, can be racist and
sexist—in ways that, distressingly, reproduce the racism and sexism in the training data [17, 25].

ISTUDY
54 Basic Data Types

Computer Science Connections

Rotation Matrices
When an image is rendered (drawn) using computer graphics, we typically proceed by transforming a three-
dimensional representation of a scene, a model of the world, into a two-dimensional image fit for a screen. The
scene is typically represented by a collection of points in R3 , each defining a vertex of a polygon. The camera (the
eye from which the scene is viewed) is another point in R3 , with an orientation describing the direction of view. We
then project the polygons’ points into R2 .
Nearly every step of the rendering computation is done using matrix multiplications, taking into account the posi-
tion and direction of view of the camera, and the position of the given point. While a full account of this rendering
algorithm isn’t too difficult, we’ll consider a simpler problem that still includes the interesting matrix computations:
the rotation of a set of points in R2 by an angle θ. (You can learn more about the way that the full-scale computer
graphics algorithms work in [61]. The full-scale problem requires thinking about the angle of view with two param-
eters, akin to “azimuth” and “elevation” in orienteering: the direction θ in the horizontal plane and the angle ϕ away
from a straight horizontal view. Other image transformations that are important in graphics need a slick, if counterin-
tuitive, trick to be represented by matrix operations: it’s common to represent a three-dimensional point by a 4-vector,
rather than a 3-vector, which allows all of the most common transformations to be implemented using matrices.)
Suppose that we have a scene that consists of a col-
lection of points in R2 . As an example, Figure 2.42a
shows a collection of hand-collected points in R2 that
represent the borders of the state of Nevada. Now
imagine that we wish to rotate Nevada by some angle
θ—that is, to rotate each boundary point ⟨x, y⟩ by an ⟨x′ , y′ ⟩ = ⟨x cos θ − y sin θ,
angle θ around the point ⟨0, 0⟩. If you dig deep into x sin θ + y cos θ⟩
your dim recollections of trigonometry, you’d be able
to convince yourself that we can rotate a point ⟨x, 0⟩ ⟨x, y⟩
θ
around the point ⟨0, 0⟩ by moving it to ⟨x cos θ, x sin θ⟩.
More generally, the point ⟨x, y⟩ becomes the point
(a) Ten points (b) A point ⟨x, y⟩ rotated by an
⟨x cos θ − y sin θ, x sin θ + y cos θ⟩ when it’s rotated
representing Nevada. angle θ around the origin.
by an angle θ. (See Figure 2.42b.)
Now, to rotate a sequence of points ⟨x1 , y1 ⟩, . . ., Figure 2.42 Some points in R2 , and how to rotate them.
⟨xn , yn ⟩ by an angle θ, we’ll use matrices. Write a
matrix with the ith column corresponding to the ith point, and perform matrix multiplication as follows:
[ ][ ] [ ]
cos θ − sin θ x1 x2 · · · xn x cos θ − y1 sin θ x2 cos θ − y2 sin θ · · · xn cos θ − yn sin θ
= 1 .
sin θ cos θ y1 y2 · · · yn x1 sin θ + y1 cos θ x2 sin θ + y2 cos θ · · · xn sin θ + yn cos θ
[ ]
cos θ − sin θ
(The matrix R = is called a rotation matrix.)
sin θ cos θ
The result is that we have rotated an entire collection of points—arranged in the 2-by-n matrix M—by multiplying
M by this rotation matrix: in other words, RM is a 2-by-n matrix of the rotated points. (See Figure 2.43.)

origin origin origin

Figure 2.43 Nevada, rotated by three different angles.

ISTUDY
Exercises 55

EXERCISES
2.141 Write out all of the elements of {1, 2, 3} × {1, 4, 16}.
2.142 Do the same for {1, 4, 16} × {1, 2, 3}.
2.143 Do the same for {1} × {1} × {1}.
2.144 Do the same for {1, 2} × {2, 3} × {1, 4, 16}.
2.145 Suppose A × B = {⟨1, 1⟩, ⟨2, 1⟩}. What are A and B?

2.146 Let S = {1, 2, 3, 4, 5, 6, 7, 8}, and let T be an unknown set. Suppose that |S × T| = 16 and ⟨1, 2⟩, ⟨3, 4⟩ ∈ S × T. What can you
conclude about T? Be as precise as possible: if you can list the elements of T exhaustively, do so; if you can’t, identify any elements
that you can conclude must be (or must not be) in T.
2.147 Again with S = {1, 2, 3, 4, 5, 6, 7, 8}, suppose that S × T = ∅. What can you conclude about T?
2.148 Again with S = {1, 2, 3, 4, 5, 6, 7, 8}, suppose that (S × T) ∩ (T × S) = {⟨3, 3⟩}. What can you conclude about T?
2.149 Again with S = {1, 2, 3, 4, 5, 6, 7, 8}, suppose that S × T = T × S. What can you conclude about T?

Recall that Algebraic notation denotes the squares of the chess board as {a, b, c, d, e, f, g, h} × {1, 2, 3, 4, 5, 6, 7, 8}, as in Fig-
ure 2.44a. For each of the following questions, identify sets S and T such that the set of cells containing the designated pieces can
be described as S × T:
2.150 The white rooks (R).
2.151 The bishops (B, white or black).
2.152 The pawns (p, white or black).
2.153 No pieces at all.

2.154 Enumerate the elements of {0, 1, 2}3 .


2.155 Enumerate the elements of {A, B} × {C, D}2 × {E}.
S
2.156 Enumerate the elements of 3i=1 {0, 1}i .

Let Σ = {A, B, . . . , Z} denote the English alphabet. Using notation from this chapter, give an expression that denotes each of
the following sets. It may be useful to recall that Σk denotes the set of strings consisting of a sequence of k elements from Σ, so
Σ0 contains the unique string of length 0 (called the empty string, and typically denoted by ϵ—or by "" in most programming
languages).
2.157 The set of 8-letter strings.
2.158 The set of 5-letter strings that do not contain any vowels {A, E, I, O, U}.
2.159 The set of 6-letter strings that do not contain more than one vowel. (So GRITTY, QWERTY, and BRRRRR are fine; but EEEEEE, THREAT,
STRENGTHS, and A are not.)
2.160 The set of 6-letter strings that contain at most one type of vowel—multiple uses of the same vowel are fine, but no two different
vowels can appear. (So BANANA, RHYTHM, and BOOBOO are fine; ESCAPE and STRAIN are not.)
qP
n
Recall that the length of a vector x ∈ Rn is given by ∥x∥ = i=1 xi . Considering the vectors a = ⟨1, 3⟩, b = ⟨2, −2⟩,
2

c = ⟨4, 0⟩, and d = ⟨−3, −1⟩, state the values of each of the following:
2.161 ∥a∥ 2.166 2a + c − 3b
2.162 ∥b∥ 2.167 ∥a∥ + ∥c∥ and ∥a + c∥
2.163 ∥c∥ 2.168 ∥a∥ + ∥b∥ and ∥a + b∥
2.164 a+b 2.169 3∥d∥ and ∥3d∥
2.165 3d
2.170 Explain why, for an arbitrary vector x ∈ Rn and an arbitrary scalar a ∈ R, ∥ax∥ = a∥x∥.
2.171 For any two vectors x, y ∈ Rn , we have ∥x∥+∥y∥ ≥ ∥x+y∥. Under precisely what circumstances do we have ∥x∥+∥y∥ = ∥x+y∥
for x, y ∈ Rn ? Explain briefly.

2.172 For the vectors a = ⟨1, 3⟩ and b = ⟨2, −2⟩, what is a • b?


2.173 For the vectors a = ⟨1, 3⟩ and d = ⟨−3, −1⟩, what is a • d?
2.174 For the vector c = ⟨4, 0⟩, what is c • c?

2.175 Recall the definitions of Manhattan distance and Euclidean distance between two vectors x, y ∈ Rn :
P qP
Manhattan distance = ni=1 |xi − yi | Euclidean distance = n
i=1 (xi − yi ) .
2

What are the Manhattan and Euclidean distances between the vectors a = ⟨1, 3⟩ and b = ⟨2, −2⟩?

ISTUDY
56 Basic Data Types

45 45
8
rmblkans 3 44
⟨4, 42⟩
44
⟨4, 42⟩
7
opopopop 2 43
42
43
42
6
0Z0Z0Z0Z 1 41 41
5
Z0Z0Z0Z0 0 40
39
40
39
4
0Z0Z0Z0Z −1
38 38
3
Z0Z0Z0Z0 −2 37
36
37
36
2
POPOPOPO −3
35
⟨8, 33⟩
35
⟨8, 33⟩
1
SNAQJBMR −3 −2 −1 0 1 2 3 34
33
34
33
a b c d e f g h 32 32
(b) The plane. 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10
(a) A chess board. (c) Manhattan train (d) Manhattan train stations (scaled).
stations (unscaled).

Figure 2.44 A few two-dimensional grids: a chess board, the Cartesian plane, and Manhattan (twice).

2.176 What are the Manhattan and Euclidean distances between a = ⟨1, 3⟩ and d = ⟨−3, −1⟩?
2.177 What are the Manhattan and Euclidean distances between b = ⟨2, −2⟩ and c = ⟨4, 0⟩?
2.178 What’s the largest possible Euclidean distance between x and y if their Manhattan distance is 1? Justify.
2.179 What’s the smallest possible Euclidean distance between x and y if their Manhattan distance is 1? Justify.
2.180 What’s the smallest possible Euclidean distance between x and y if x, y ∈ Rn (not just n = 2) if their Manhattan distance is 1?
Explain.

2.181 Consider Figure 2.44b, and sketch the set x ∈ R2 : the Euclidean distance between x and ⟨0, 0⟩ is at most 2 .

2.182 Consider Figure 2.44b, and sketch the set x ∈ R2 : the Manhattan distance between x and ⟨0, 0⟩ is at most 2 .

In Example 2.40, we considered two train stations located at points s = ⟨8, 33⟩ and g = ⟨4, 42⟩. (See Figure 2.44c.) In that
example, we showed that, for an offset δ ∈ [0, 4], the Manhattan distance between the point ⟨4 + δ, y⟩ and s is smaller than the
Manhattan distance between the point ⟨4 + δ, y⟩ and g when y < 35.5 + δ.
2.183 Show that the point ⟨16, 40⟩ is closer to one station under Manhattan distance, and to the other under Euclidean distance.
2.184 Let δ ≥ 0. Under Manhattan distance, describe the values of y for which the point ⟨8 + δ, y⟩ is closer to s than to g.
2.185 Let δ ≥ 0. Under Manhattan distance, describe the values of y for which the point ⟨4 − δ, y⟩ is closer to s than to g.
2.186 In the real-world island of Manhattan, the east–west blocks are roughly twice the length of the north–south blocks. As such, the
more accurate picture of distances in the city is shown in Figure 2.44d. Assuming it takes 1.5 minutes to walk a north–south (up–
down) block and 3 minutes to walk an east–west (left–right) block, give a formula for the walking distance between ⟨x, y⟩ and Penn
Station, at s = ⟨8, 33⟩.
2.187 A Voronoi diagram—named after the Ukrainian mathematician Georgy Voronoy (1868–1908)—is a decomposition of the plane

R2 into regions based on a given set S of points. The region “belonging” to a point x ∈ S is y ∈ R2 : d(x, y) ≤ minz∈S d(z, y) ,
where d(·, ·) denotes Euclidean distance. (In other words, the region “belonging” to point x is that portion of the plane that’s closer
to x than any other point in S.) Compute the Voronoi diagram of the set of points {⟨0, 0⟩, ⟨4, 5⟩, ⟨3, 1⟩}. That is, compute:
• the set of points y ∈ R2 that are closer to ⟨0, 0⟩ than ⟨4, 5⟩ or ⟨3, 1⟩ under Euclidean distance;
• the set of points y ∈ R2 that are closer to ⟨4, 5⟩ than ⟨0, 0⟩ or ⟨3, 1⟩ under Euclidean distance; and
• the set of points y ∈ R2 that are closer to ⟨3, 1⟩ than ⟨0, 0⟩ or ⟨4, 5⟩ under Euclidean distance.
2.188 Compute the Voronoi diagram of the set of points {⟨2, 2⟩, ⟨8, 1⟩, ⟨5, 8⟩}.
2.189 Compute the Voronoi diagram of the set of points {⟨0, 7⟩, ⟨3, 3⟩, ⟨8, 1⟩}.
2.190 (programming required) Write a program that takes three points as input and produces a representation of the Voronoi diagram of
those three points as output.

Taking it further: Voronoi diagrams are used frequently in computational geometry, among other areas of computer
science. (For example, a coffee-shop chain might like to build a mobile app that is able to quickly answer the question
What store is closest to me right now? for any customer at any time. Voronoi diagrams can allow precomputation of these
answers.) Given any set S of n points, it’s reasonably straightforward to compute (an inefficient representation of) the
Voronoi diagram of those points by computing the line that’s equidistant between each pair of points, as you saw in the last
few exercises. But there are cleverer ways of computing Voronoi diagrams more efficiently; for more, see a good textbook
on computational geometry, like [36].

ISTUDY
Exercises 57

     
  0 8 0 5 8 7 2 7 [ ]
3 9 2 3 1
A = 9 6 0 B = 7 5 C = 3 5 6 D=
0 9 8 0 8
  2 3 3 3 2 1 2 5
6 2 0
M=
7

 5 5    
7 [ ] [ ]
4
1 0 0 0 0 0
2 8 4 1 2 9
E= F= G = 1 0 0 H = 0 1 0
1 6 7 3 2 5 4 0
1 1 0 1 1 1

Figure 2.45 An assortment of matrices.

2.191 What size is the matrix M in Figure 2.45?


2.192 For the matrix M in Figure 2.45, what is M3,1 ?
2.193 For the matrix M in Figure 2.45, list every ⟨i, j⟩ such that Mi,j = 7.
2.194 For the matrix M in Figure 2.45, what is 3M?

Considering the matrices in Figure 2.45, what are the values of the given expressions (if they’re defined)? (If the given quantity is
undefined, say so—and say why.)
2.195 A+C
2.196 B+F
2.197 D+E
2.198 A+A
2.199 −2D
2.200 0.5F
2.201 AB
2.202 AC
2.203 AF
2.204 BC
2.205 DE
2.206 ED

2.207 For G and H from Figure 2.45, what is 0.25G + 0.75H?


2.208 What is 0.5G + 0.5H?
2.209 Identify two other matrices G′ and H′ with the same average—that is, {G, H} ̸= {G′ , H′ } but 0.5G + 0.5H = 0.5G′ + 0.5H′ .
2.210 (programming required) A common computer graphics effect in the spirit of the last few exercises is morphing one image into
another—that is, slowly changing the first image into the second. There are sophisticated techniques for this task, but a simple
form can be achieved just by averaging. Given two n-by-m images represented by matrices A and B—say grayscale images, with
each entry in [0, 1]—we can produce a “weighted average” of the images as λA + (1 − λ)B, for a parameter λ ∈ [0, 1]. See
Figure 2.46. Write a program, in a programming language of your choice, that takes three inputs—an image A, an image B, and a
weight λ ∈ [0, 1]—and produces a new image λA + (1 − λ)B. (You’ll need to research an image-processing library to use in your
program.)

2.211 Let A be an m-by-n matrix. Let I be the n-by-n identity matrix. Explain why the matrix AI is identical to the matrix A.

Figure 2.46 Morphing one rotated ♣ symbol into another.

ISTUDY
58 Basic Data Types

[ ] [ ] [ ] [ ] [ ] [ ]
2 3 1 1 1 2 0 1 1 0 1 1
J= K= L= N= P= Q=
1 1 1 0 3 4 1 0 0 1 1 1

Figure 2.47 Another assortment of matrices.

If M is an n-by-n matrix, then the product of M with itself is also an n-by-n matrix. We write matrix powers in the normal way that
we defined powers of integers (or of the Cartesian product of sets): Mk = M · M · · · M, multiplied k times. (M0 is the n-by-n identity
matrix I.)
2.212 What is J3 ? (See Figure 2.47.)
2.213 What is K2 ? (See Figure 2.47.)
2.214 What is K4 ?
2.215 What is K9 ? (Hint: M2k = (Mk )2 .)

Taking it further: The Fibonacci numbers are defined recursively as the sequence f1 = 1, f2 = 1, and fn = fn−1 + fn−2
for n ≥ 3. The first several Fibonacci numbers are 1, 1, 2, 3, 5, 8, 13, . . .. As we’ll see in Exercises 5.62 and 6.99, there’s
a very fast algorithm to find the nth Fibonacci number based on computing Kn .

2.216 Let A by an n-by-n matrix. The inverse of A, denoted A−1 , is also an n-by-n matrix, with the property that AA−1 = I. There’s a
general algorithm that one can develop to invert matrices, but here you’ll calculate inverses by hand. Note that
" # " # " #
1 1 x y x+z y+w
· = .
2 1 z w 2x + z 2y + w
A B

Thus A−1 is B if the following four conditions hold: x + z = 1 and y + w = 0 and 2x + z = 0 and 2y + w = 1. Find the values of
x, y, w, and z that satisfy these four conditions.
2.217 Using the same approach as in the previous exercise, find the inverse of the matrix L from Figure 2.47.
2.218 Using the same approach, find the inverse of N from Figure 2.47.
2.219 Using the same approach, find the inverse of P from Figure 2.47.
2.220 Not all matrices have inverses—for example, Q from Figure 2.47 doesn’t have an inverse. Explain why not.

An error-correcting code (see Section 4.2) is a method for redundantly encoding information so that the information can still be
retrieved even in the face of some errors in transmission/storage. The Hamming code is a particular error-correcting code for 4-bit
chunks of information. The Hamming code can be described using matrix multiplication: given a message m ∈ {0, 1}4 , we encode
m as mG mod 2, where
 
1 0 0 0 0 1 1
 0 1 0 0 1 0 1 
 
G= .
 0 0 1 0 1 1 0 
0 0 0 1 1 1 1

(Here you should interpret the “mod 2” as describing an operation to each element of the output vector.)
For example, [1, 1, 1, 1] · G = [1, 1, 1, 1, 3, 3, 3], so we’d encode [1, 1, 1, 1] as [1, 1, 1, 1, 3, 3, 3] mod 2 = [1, 1, 1, 1, 1, 1, 1].
2.221 What is the Hamming code encoding of [0, 0, 0, 0]?
2.222 What is the Hamming code encoding of [0, 1, 1, 0]?
2.223 What is the Hamming code encoding of [1, 0, 0, 1]?

ISTUDY
2.5 Functions 59

2.5 Functions
She runs the gamut of emotions from A to B.
Dorothy Parker (1893–1967)
(speaking of a performance by Katherine Hepburn)

A function transforms an input value into an output value; that is, a function f takes an argument or
parameter x, and returns a value f(x). Functions are familiar from both algebra and from programming. In
algebra, we frequently encounter mathematical functions like f(x) = x + 6, which means that, for example,
we have f(3) = 9 and f(4) = 10. In programming, we often write or invoke functions that use an algorithm
to transform an input into an output, like a function sort—so that sort(⟨3, 1, 4, 1, 5, 9⟩) = ⟨1, 1, 3, 4, 5, 9⟩,
for example.
In this section, we’ll give definitions of functions and of some terminology related to functions, and also
discuss a few special types of functions. (Functions themselves are a special case of relations, and we will
revisit the definition of functions when we discuss them in Chapter 8.)

2.5.1 Basic Definitions


We start with the definition of a function itself:

Definition 2.46: Function.


Let A and B be sets. A function f from A to B, written f : A → B, assigns to each input value a ∈ A a
unique output value b ∈ B; the unique value b assigned to a is denoted by f(a). We sometimes say that f
maps a to f(a).

Note that A and B are allowed to be the same set; for example, a function might have inputs and outputs
that are both elements of Z.
Here are two small examples: first, a function not for Boolean inputs that maps True to False, and False
to True; and, second, a function square that returns its input multiplied by itself.

Example 2.48: Not function.


The function not : {True, False} → {True, False} can be defined with the table in Figure 2.48. Given an
input x, we find the output value not(x) by locating x in the first column of the table and reading the value
in that row’s second column. Thus not(True) = False and not(False) = True.

Example 2.49: Square function.


The function square : R → R can be defined as square(x) = x2 : for any input x ∈ R, the output is
the real number x2 . Thus, for example, square(8) = 64, because the function square assigns the output
82 = 64 to the input 8.

Note, too, that a function f : A → B might have a set A of inputs that are pairs; for example, the function that
takes two numbers and returns their average is the function average : R×R → R, where average(⟨x, y⟩) =
(x + y)/2. (We interpret R × R → R as (R × R) → R.) When there is no danger of confusion, we drop
the angle brackets and simply write, for example, average(3, 2) instead of average(⟨3, 2⟩).
As we’ve already seen in Examples 2.48 and 2.49, the rule by which a function assigns an output to a
given input can be specified either symbolically—typically via an algebraic expression—or exhaustively,

ISTUDY
60 Basic Data Types


x not(x) x double(x) 
 26 if 0 ≤ n ≤ 51


True False 0 0 
 if 52 ≤ n ≤ 103

78
False True 1 2
quantize(n) = 130 if 104 ≤ n ≤ 155
2 4 


182 if 156 ≤ n ≤ 207
3 6 



4 8 234 if 208 ≤ n ≤ 255
5 10
6 12
7 14

Figure 2.48 The functions not and double specified using a table, and quantize specified by cases.

by giving a table describing the input/output relationship. The table-based definition only makes sense
when the set of possible inputs is finite; otherwise the table would have to be infinitely large. (And it’s only
practical to define a function with a table if the set of possible inputs is pretty small!)
Here’s an example of specifying the same function in two different ways, once symbolically and once
using a table:

Example 2.50: Doubling function.


Let’s define the function double that doubles its input value, for any input in {0, 1, . . . , 7}. (That is, we
are defining a function double : {0, 1, . . . , 7} → Z.) We can write double symbolically by defining
double(x) = 2 · x. To define double using a table, we specify the output corresponding to every one of
the eight possible inputs, as shown in Figure 2.48.

The functions that we’ve discussed so far are all fairly simple, but even simple functions can have some
valuable applications. Here’s one that compresses images so that they take up less space:

Example 2.51: Reducing the colorspace of an image.


The pixels in a grayscale image are all elements of {0, 1, . . . , 255}. To reduce the space requirements
for a large image, we can consider a form of lossy compression (that is, compression that loses some
amount of data) by replacing each pixel with one chosen from a smaller list of candidate colors. That is,
instead of having 256 different shades of gray, we might have 128 or 64 or even fewer shades. Define
quantize : {0, 1, . . . , 255} → {0, 1, . . . , 255} as shown in Figure 2.48. We can apply quantize to every
pixel in a grayscale image, and then use a much smaller number of bits per pixel in storing the resulting
image. See Figure 2.49 for an example.

Taking it further: In 1976 (the date of the image in Figure 2.49), Grace Hopper was the Head of the Navy Programming
Language Section of the Office of the Chief of Naval Operations, in the U.S. Navy. She worked on the team that developed some
of the very first computers—the Harvard Mark I and the UNIVAC. (Some UNIVAC manuals are visible behind her.) She coined
the name “compiler” (see p. 105) and wrote the first of them, and was a major player in the design of influential programming
languages like COBOL that, for the first time, used English words as part of their syntax.
As for images themselves: a byte is a sequence of 8 bits. Using 8 bits, we can represent the numbers from 00000000 to 11111111—
that is, from 0 to 255. Thus a pixel with {0, 1, . . . , 255} as possible grayscale values in an image requires one byte of storage
for each pixel. If we don’t do something cleverer, a moderately sized 2048-by-1536 image (the size of many iPads) requires
over 3 megabytes even if it’s grayscale. (A color image requires three times that amount of space.) Techniques similar to the
compression function from Example 2.51 are used in a variety of CS applications—including, for example, in automatic speech
recognition, where each sample from a sound stream is stored using one of only, say, 256 different possible values instead of a
floating-point number, which requires much more space.

ISTUDY
2.5 Functions 61

(a) quantize. (b) Rear Admiral Grace Murray Hopper (c) The same image, compressed to use only
(1906–1992), in 1976, when she was a captain. five shades of gray using the quantize function.

Figure 2.49 Colorspace reduction. In PNG format, image (c) uses only ≈ 15% of the space consumed by image (b).

Domain and codomain


The domain and codomain of a function are its sets of possible inputs and outputs:

Definition 2.47: Domain/codomain.


For a function f : A → B, the set A is called the domain of the function f : A → B, and the set B is called
the codomain of the function f : A → B.

Example 2.52: Some domains and codomains.


Let’s identify the domain and codomain from the previous examples of this section. For the functions
from Examples 2.48–2.51:

• The function not (Example 2.48) has domain {True, False} and codomain {True, False}.
• The function square (Example 2.49) has domain R and codomain R.
• The function double (Example 2.50) has domain {0, 1, . . . , 7} and codomain Z.
• Both the domain and codomain of quantize (Example 2.51) are {0, 1, . . . , 255}.

Note that for three of these functions, the domain and codomain are actually the same set; for the function
double : {0, 1, . . . , 7} → Z, they’re different.

When the domain and codomain are clear from context (or they are unimportant for the purposes of a
discussion), then they may be left unwritten.
Taking it further: This possibility of implicitly representing the domain and codomain of a function is also present in code.
Some programming languages (like Java) require the programmer to explicitly write out the types of the inputs and outputs of a
function; in some (like Python), the input and output types are left implicit. In Java, for example, one would write an isPrime

ISTUDY
62 Basic Data Types

function with the explicit declaration that the input is an integer (int) and the output is a Boolean (boolean). In Python, one
would write the function without any explicit type information.

1 boolean isPrime(int n) { 1 def isPrime(n):


2 /* code to check primality of n */ 2 # code to check primality of n
3 }

But regardless of whether they’re written out or left implicit, these functions do have a domain (the set of valid inputs) and a
codomain (the set of possible outputs).

Range/image
For a function f : A → B, the set A (the domain) is the set of all possible inputs, and the set B (the codomain)
is the set of all possible outputs. But not all of the possible outputs are necessarily actually achieved: in
other words, there may be an element b ∈ B for which there’s no a ∈ A with f(a) = b. For example, we
defined square : R → R in Example 2.49, but there is no real number x such that square(x) = −1. The
range or image defines the set of actually achieved outputs:

Definition 2.48: Range/image.


The range or image of a function f : A → B is the set of all b ∈ B such that f(a) = b for some a ∈ A.
Using the notation of Section 2.3, the range of f is the set
{y ∈ B : there exists at least one x ∈ A such that f(x) = y} .

We’ll start with the functions from earlier in the section, and then look at a slightly more complex example:

Example 2.53: Some ranges.


For not (Example 2.48), double (Example 2.50), and quantize (Example 2.51), the range is easy to deter-
mine: it’s the set of values that appear in the “output” column of the table defining the function. The ranges
are {True, False} for not; {0, 2, 4, 6, 8, 10, 12, 14} for double; and {26, 78, 130, 182, 234} for quantize.
For square (Example 2.49), it’s clear that the range includes no negative numbers, because there’s no
y ∈ R such that y2 < 0. In fact, the range of square is precisely R≥0 : for any x ∈ R≥0 , there’s an input

to square that produces x as output—specifically x.

Example 2.54: The smallest divisor function.


Define a function sd : Z≥2 → Z≥2 as follows. Given an input n ∈ Z≥2 , the value of sd(n) is the smallest
integer k ≥ 2 that evenly divides n. For example:
sd(2) = 2 sd(3) = 3 sd(4) = 2 sd(121) = 11.
because 2 | 2 because 3 | 3 because 2 | 4 because 11 | 121 but
but 2 ̸ | 3 2 ̸ | 121, 3 ̸ | 121, . . ., 10 ̸ | 121

What are the domain, codomain, and range of sd?


Solution. The domain and codomain of sd are easy to determine: they are both Z≥2 . Any integer n ≥ 2
is a valid input to sd, and we defined the function sd as producing an integer k ≥ 2 as its output.
(The domain and codomain are simply written in the function’s definition, before and after the arrow
in sd : Z≥2 → Z≥2 .) The range is a bit harder to see, but it turns out to be the set P of all prime numbers.
Let’s argue that P is the range of sd by showing that (i) every prime number p ∈ P is in the range of sd,
and (ii) every number p in the range of P is a prime number.

ISTUDY
2.5 Functions 63

Claim (i). Let p ∈ Z≥2 be any prime number. Then sd(p) = p: by the definition of primality, the only
integers than evenly divide p are 1 and p itself (and 1 ≥ 2 isn’t true!). Therefore every prime number p
is in the range of sd, because there’s an input to sd such that the output is p.
Claim (ii). Let p be any number in the range of sd—that is, suppose sd(n) = p for some n. We will argue
that p must be prime. Imagine that p were instead composite—that is, there is an integer k satisfying
2 ≤ k < p that evenly divides p. But then sd(n) = p is impossible: if p evenly divides n, then k also
evenly divides n, and k < p, so k would be a smaller divisor of n. (For example, if n were evenly divisible
by the composite number 15, then n would also be evenly divisible by 3 and 5—two factors of 15—so
sd(n) ̸= 15.) Therefore every number in the range of sd is prime.
Together (i) and (ii) let us conclude that the range of sd is precisely the set of all prime numbers.

Problem-solving tip: Example 2.54 illustrates a useful general technique if we wish to show that two sets A and B are equal.
One nice way to establish that A = B is to show that A ⊆ B and B ⊆ A. That’s what we did to establish the range of sd in
Example 2.54. Let’s call R the range of sd. We showed in (i) that every element of P is in R (that is, P ⊆ R); and in (ii) that every
element of R is in P (that is, R ⊆ P). Together these facts establish that R = P.

We will also introduce a minor extension to the set-abstraction notation from Section 2.3 that’s related to
the range of a function. (We used this notation informally in Example 2.28.) Consider a function f : A → B
and a set U ⊆ A. We denote by {f(x) : x ∈ U} the set of all output values of the function f when it’s applied
to the elements x ∈ U:

Definition 2.49: Set abstraction using functions.


For a function f : A → B and a set U ⊆ A, we write {f(x) : x ∈ U} as shorthand for the set
{b ∈ B : there exists some u ∈ U for which f(u) = b}.

Remember that order and repetition of elements in a set don’t matter, which means that the set {f(x) : x ∈ A}
is precisely the range of the function f : A → B.

A visual representation of functions


The table-based and symbolic representations of functions that we’ve discussed fully represent a function,
but sometimes a more visual representation of a function is clearer. Consider a function f : A → B. We can
give a picture representing f by putting the elements of A into one column, the elements of B into a second
column, and drawing an arrow from each a ∈ A to the value of f(a) ∈ B. Notice that the definition of a
function guarantees that every element in the first column has one and only one arrow going from it to the
second column: if f : A → B is a function, then every a ∈ A is assigned a unique output f(a) ∈ B.
Figure 2.50 shows a small example. We can read the domain, codomain, and range directly from this
picture: the domain is the set of elements in the first column; the codomain is the set of elements in the
second column; and the range is the set of elements in the second column for which there is at least one
incoming arrow. In Figure 2.50, the range of f is {10, 11, 12, 13}. (There are no arrows pointing to 14 or
15, so these two numbers are in the codomain but not the range of f.)

Function composition
Suppose we have two functions f : A → B and g : B → C. Given an input a ∈ A, we can find f(a) ∈ B,
and then apply g to map f(a) to an element of C, namely g(f(a)) ∈ C. This successive application of f and
g defines a new function, called the composition of f and g, whose domain is A and whose codomain is C.

ISTUDY
64 Basic Data Types

A B
f
1 10 The function f as a table:

2 11 x f(x)
1 10
3 12
2 11
4 13 3 12
4 11
5 14 5 13
15

Figure 2.50 A picture of a function f : A → B, where A = {1, . . . , 5} and B = {10, . . . , 15}.

Definition 2.50: Function composition.


For two functions f : A → B and g : B → C, the function g ◦ f : A → C maps an element a ∈ A to
g(f(a)) ∈ C. The function g ◦ f is called the composition of f and g.

Notice a slight oddity of the notation: g◦f applies the function f first and the function g second, even though
g is written first.

Example 2.55: Function composition, four ways.


Let f : R → R and g : R → R be defined by f(x) = 2x + 1 and g(x) = x2 . Then:

• The function g ◦ f, given an input x, produces output


g(f(x)) = g(2x + 1) = (2x + 1)2 = 4x2 + 4x + 1.
• The function f ◦ g maps x to f(g(x)) = f(x2 ) = 2x2 + 1.
• The function g ◦ g maps x to g(g(x)) = g(x2 ) = (x2 )2 = x4 .
• The function f ◦ f maps x to f(f(x)) = f(2x + 1) = 2(2x + 1) + 1 = 4x + 3.

As with many function-related concepts, the visual representation of functions gives a nice way of think-
ing about function composition: the function g ◦ f corresponds to the “short-circuiting” of the pictures of
the functions f and g. A small example of this visualization is shown in Figure 2.51. The composition g ◦ f
is given by following two arrows in the diagram: one arrow defined by f, and then one arrow defined by g.

(a) A B B C (b) A B C (c) A C


f g f g g◦f
1 10 10 20 1 10 20 1 20

2 11 11 21 2 11 21 2 21

3 12 12 23 3 12 23 3 23

4 13 13 24 4 13 24 4 24

5 14 14 5 14 5

15 15 15 For example, (g ◦ f)(1) = g(f(1)), which is g(11)


because f(1) = 11. And g(11) = 24 because of g’s
arrow from 11 to 24.

Figure 2.51 Two functions f : A → B and g : B → C, (a) separately and (b) pasted together. Their composition g ◦ f is
shown in (c), based on successively following two arrows from (b).

ISTUDY
2.5 Functions 65

2.5.2 Onto and One-to-One Functions


We now turn to two special categories of functions—onto and one-to-one functions—distinguished by how
many different input values (always at least one? never more than one?) are mapped to each output value.

Onto functions
A function f : A → B is onto if every possible output in B is, in fact, an actual output:

Definition 2.51: Onto functions.


A function f : A → B is called onto if, for every b ∈ B, there exists at least one a ∈ A for which f(a) = b.
An onto function is also sometimes called a surjective function.

Alternatively, using the terminology of Section 2.5.1, a function f is onto if f’s codomain equals f’s range.
It may be easier to think about onto functions using the visual representation of functions: a function f is
onto if there’s at least one arrow pointing at every element in the second column. (See Figure 2.52.)
As an example, here are two of our previous functions, one of which is onto and one of which isn’t:

Example 2.56: An onto function and a non-onto function.


The function not : {True, False} → {True, False} is onto: there’s an input value that produces True
(namely False), and there’s an input value that produces False (namely True). Every element of the
codomain is “hit” by not, so the function is onto.
On the other hand, the function quantize : {0, 1, . . . , 255} → {0, 1, . . . , 255} from Example 2.51 is
not onto. Recall that the only output values achieved were {26, 78, 130, 182, 234}. For example, then,
there is no value of x for which quantize(x) = 42. Thus 42 is not in the range of quantize, and therefore
this function is not onto.
Here is a collection of a few more examples, where we’ll try to construct onto and non-onto functions
meeting a certain description:

Example 2.57: Sample onto/non-onto functions.


Let A = {0, 1, 2} and B = {3, 4}. Give an example of a function that satisfies the following descriptions
(or, if there’s no such function, explain why it’s impossible):
1 an onto function f : A → B;
2 a function g : A → B that is not onto;
3 an onto function h : B → A.

Solution. The first two are possible, but the third is not:
1 Define f(0) = 3, f(1) = 4, and f(2) = 4.
2 Define g(0) = 3, g(1) = 3, and g(2) = 3.
3 Impossible! A function h whose domain is {3, 4} only has two output values, namely h(3) and h(4).
For a function whose codomain is {0, 1, 2} to be onto, we need three different output values to be
achieved. These two conditions cannot be simultaneously satisfied, so there is no onto function from
B to A.
Figure 2.52 illustrates the two functions from Example 2.57; the fact that f is onto and g is not onto is
immediately visible.

ISTUDY
66 Basic Data Types

A B A B
f g
A B A B 0 3 0 4
f g
0 3 0 3 1 4 1 5

1 4 1 4 2 5 2 6

2 2 6 3

Figure 2.52 An onto function f : {0, 1, 2} → {3, 4} and Figure 2.53 A one-to-one function f and a
a function g : {0, 1, 2} → {3, 4} that is not onto. (For an non-one-to-one function g. (For a one-to-one function
onto function from A to B, every element of B has at from A to B, no element of B has more than one
least one incoming arrow.) incoming arrow.)

One-to-one functions
An onto function f : A → B guarantees that every element b ∈ B is “hit at least once” by f—that is, that
b = f(a) for at least one a ∈ A. A one-to-one function f : A → B guarantees that every element b ∈ B is
“hit at most once” by f:

Definition 2.52: One-to-one functions.


A function f : A → B is called one-to-one if, for any b ∈ B, there is at most one a ∈ A such that f(a) = b.
(Alternatively, we could say that f : A → B is one-to-one if, for any a1 ∈ A and a2 ∈ A with a1 ̸= a2 , we
̸ f(a2 ).) A one-to-one function is also sometimes called an injective function.
have that f(a1 ) =

(Terminologically, a one-to-one function sits in contrast to a many-to-one function, in which many different
input values map to the same output value. Thinking about what a many-to-one function would mean may
help to make the name “one-to-one” more intuitive.)
Taking it further: One of the many places that functions are used in computer science is in designing the data structure known
as a hash table, discussed on p. 72. The idea is that we will store a piece of data called x in a location h(x), for some function h
called a hash function. We want to choose h to ensure that this function is “not-too-many-to-one” so that no location has to store
too much information.

Let’s look at two of our previous functions, double and quantize, as an example:

Example 2.58: A one-to-one function and a non-one-to-one function.


The function double : {0, 1, . . . , 7} → Z, defined in Example 2.50, is one-to-one. The table of outputs
for the function (see Figure 2.48) shows that no number appears more than once in the second column.
Because every element of the codomain is “hit” by double at most once, the function is one-to-one.
(Note double is not onto, because there are elements of the codomain that are “hit” zero times—but it is
one-to-one, because no element of the codomain is hit twice.)
On the other hand, the function quantize : {0, 1, . . . , 255} → {0, 1, . . . , 255} from Example 2.51 is
not one-to-one. Recall that quantize(42) = 26 and quantize(17) = 26. Thus 26 is the output for two or
more distinct inputs, and therefore this function is not one-to-one.

As with the definition of onto, it may be easier to think about one-to-one functions using our visual two-
column representation: a function f is one-to-one if there’s at most one arrow pointing at every element in
the second column. Figure 2.53 shows two small examples using this visual perspective: the function f in
is one-to-one because no element of B has multiple incoming arrows, but the function g is not one-to-one,
because 4 ∈ B has two incoming arrows.

ISTUDY
2.5 Functions 67

One-to-one and onto functions


Let f : A → B be a function. We can restate the definitions of f being onto or one-to-one as follows:

• f is onto if, for every b ∈ B, we have |{a ∈ A : f(a) = b}| ≥ 1.


• f is one-to-one if, for every b ∈ B, we have |{a ∈ A : f(a) = b}| ≤ 1.

Therefore a function f : A → B that is both one-to-one and onto guarantees that {a ∈ A : f(a) = b} has
cardinality equal to 1—that is, for any b ∈ B, there is exactly one element a ∈ A so that f(a) = b. (There
is at most one such a because f is one-to-one, and at least one such a because f is onto.) A function with
both of these properties is called a bijection:

Definition 2.53: Bijection.


A function f : A → B is called a bijection if f is one-to-one and onto. In other words, f is a bijection if
|{a ∈ A : f(a) = b}| = 1 for every b ∈ B.

Here are two examples of bijections:

Example 2.59: Two bijections.


The function not : {True, False} → {True, False} from Example 2.48 is a bijection. There’s exactly
one input value whose output is True, namely False; and there’s exactly one input value whose output is
False, namely True.
Similarly, the function f : R → R defined by f(x) = x − 1 is also a bijection. For every b ∈ R, there is
exactly one a such that f(a) = b: specifically, the value a = b + 1.

If f : A → B is a bijection, then every input in A is assigned by f to a unique value in B. This fact means
that we can define a new function, denoted f−1 , that reverses this assignment—given b ∈ B, the function
f−1 (b) identifies the a ∈ A to which b was assigned by f. This function f−1 called the inverse of f:

Definition 2.54: Function inverses.


Let f be a bijection. Then f−1 : B → A is a function called the inverse of f, where f−1 (b) = a whenever
f(a) = b.

Here is an example of finding inverses of a few functions:

Example 2.60: Three inverses.


What is the inverse of each of the following functions?
1 f : R → R, where f(x) = 2x .
2 square : R≥0 → R≥0 , where square(x) = x2 .
3 not : {True, False} → {True, False}.

Solution. We can find the function f−1 , the inverse of f, by solving the equation y = 2x for x. We see that
2y = x. Thus the function f−1 : R → R is given by f−1 (y) = 2y. For any real number x ∈ R, we have
that f(x) = 2x and f−1 ( 2x ) = x. (For example, f(3) = 1.5 and f−1 (1.5) = 3.)

ISTUDY
68 Basic Data Types


The inverse of square is the function square−1 (y) = y. (Note that square : R≥0 → R≥0 is a
bijection—otherwise this problem wouldn’t be solvable!—because the domain and the codomain are
both the equal to the set of nonnegative real numbers. For example, 32 = 9 and (−3)2 = 9; if we had
allowed both negative and positive inputs, then square would not have been one-to-one. And there’s no
x ∈ R such that x2 = −9; if we had allowed negative outputs, then square would not have been onto.)
Finally, for not, note that not(not(True)) = not(False) = True and not(not(False)) = not(True) =
False. Thus the inverse of the function not is the function not itself!

If f : A → B is a bijection, then, for any a ∈ A, observe that applying f−1 to f(a) gives a back as output: that
is, f−1 (f(a)) = a. In other words, f−1 ◦ f is the identity function, defined by id : A → A where id(a) = a.
In our visualization, a bijection f : A → B has exactly one arrow coming into every element in B, and by
definition it also has exactly one arrow leaving every element in A. The inverse of f is precisely the function
that results from reversing the direction of each arrow. (The fact that every right-hand column element has
exactly one incoming arrow under f is precisely what guarantees that reversing the direction of each arrow
still results in the arrow diagram of a function.)
Figure 2.54 shows an example of a bijection and its inverse illustrated in this manner. This picture-based
approach should help to illustrate why a function that is not onto or that is not one-to-one fails to have an
inverse. If f : A → B is not onto, then there exists some element b∗ ∈ B that’s never the value of f, so f−1 (b∗ )
would be undefined. On the other hand, if f is not one-to-one, then there exists b† such that f(a) = b† and
f(a′ ) = b† for a ̸= a′ ; thus f−1 (b† ) would have to be both a and a′ , which is forbidden by the definition of
a function.

2.5.3 Polynomials
We’ll turn now to polynomials, a special type of function whose input and output are both real numbers,
and where f(x) is the sum of powers of x:

Definition 2.55: Polynomial.


A polynomial is a function f : R → R of the form f(x) = a0 + a1 x + a2 x2 + · · · + ak xk where each ai ∈ R
Pk
and ak ̸= 0, for some k ∈ Z≥0 . (More compactly, we can write this function as f(x) = i=0 ai xi .)
The real numbers a0 , a1 , . . . , ak are called the coefficients of the polynomial, and the values
a0 , a1 x, a2 x2 , . . . , ak xk being added together are called the terms of the polynomial.

A few examples of polynomials: f(x) = 7x, g(x) = x202 − 201x111 , and h(x) = x2 − 2. The function h is
graphed in Figure 2.55—in other words, for every x ∈ R, the point ⟨x, h(x)⟩ is drawn.

A B A B
f f− 1
0 4 0 4

1 5 1 5

2 6 2 6

3 7 3 7

Figure 2.54 A bijection f : {0, 1, 2, 3} → {4, 5, 6, 7} and its inverse f−1 : {4, 5, 6, 7} → {0, 1, 2, 3}.

ISTUDY
2.5 Functions 69

degree of h = 2 (the highest power with a nonzero coefficient)


6
h(x) = x2 −2
4

2 n √ √ o
roots of h = − 2, 2 (the points where h(x) = 0)

−2 2
−2

Figure 2.55 A graph of the polynomial h(x) = x2 − 2.

Annoyingly, the word “graph” among computer scientists is completely ambiguous. It can mean a graph in the sense of a plot
of a mathematical function (as in Figure 2.55). But it can also mean a network (like a social network, or a disease propagation
network, or an ethernet network) of interacting entities. We’ll spend a huge amount of time on the latter kind of graph—they are
the subject of all of Chapter 11.

There are two additional definitions related to polynomials that will be useful. The first is the degree of the
polynomial p(x), which is the highest power of x in p’s terms:

Definition 2.56: Degree.


Pk
The degree of a polynomial f(x) = i=0 ai xi is the largest index i such that ai ̸= 0—that is, the highest
power of x with a nonzero coefficient.

Here are a few examples:

Example 2.61: Some degrees.


For the polynomials f(x) = x + x3 and g(x) = x9 , the degree of f is 3 and the degree of g is 9. For the
polynomial p(x) with a0 = 1, a1 = 3, and a2 = 0, the degree of p is 1, because p(x) = 1 + 3x + 0x2 =
1 + 3x.

Some more examples of polynomials with small degrees are shown in Figure 2.56.
The second useful notion about a polynomial p(x) is a root, which is a value of x where the graph of p
crosses the x axis:

Definition 2.57: Roots.


The roots of a polynomial p(x) are the values in the set {x ∈ R : p(x) = 0}.

Here are a couple of examples:

(a) Degree 0. (b) Degree 1. (c) Degree 2. (d) Degree 3. (e) Degree 4.

Figure 2.56 Graphs of some polynomials of degree 0, 1, 2, 3, and 4.

ISTUDY
70 Basic Data Types

Example 2.62: Some roots.


The roots of the polynomial f(x) = x + x2 are 0 and −1. For g(x) = x9 , the only root is 0.

A useful general theorem relates the number of different roots for a polynomial to its degree: a polynomial
p with degree k has at most k different values of x for which p(x) = 0 (unless p is always equal to 0):

Theorem 2.58: (Nonzero) polynomials of degree k have at most k roots.


Let p(x) be a polynomial of degree at most k. Then p has at most k roots unless p(x) is zero for every
value x ∈ R.

When p(x) is zero for every value x ∈ R, we sometimes write p(x) ≡ 0 and say that p is identically zero.
We won’t give a formal proof of Theorem 2.58, but here’s one way to convince yourself of the basic
idea. Think about how many times a polynomial of degree k can “change direction” from increasing to
decreasing or from decreasing to increasing. Observe that a polynomial p must change directions between
any two roots. (Draw a picture!) A polynomial of degree 0 never changes direction, so it’s either always
zero or never zero. A polynomial p(x) of degree d ≥ 1 can change directions only at a point where its slope
is precisely equal to zero—that is, a point x where the derivative p′ of p satisfies p′ (x) = 0. Using calculus,
we can show that the derivative of a polynomial of degree d ≥ 1 is a polynomial of degree d − 1. The idea
of a proof by mathematical induction is to combine the above intuition to prove the theorem.
Taking it further: Here’s some more detailed intuition of how to prove Theorem 2.58 using a proof by mathematical induction;
see Chapter 5 for much more detail on this form of proof. (This argument relies on calculus.)
Think first about a degree-0 polynomial—that is, a constant function p(x) = a. The theorem is clear for this case: either a = 0
(in which case p(x) ≡ 0); or a ̸= 0, in which case p(x) ̸= 0 for any x. (See Figure 2.56a.)
Now think about a degree-1 polynomial—that is, p(x) = ax + b for a = ̸ 0. The derivative of p is a constant function—namely
p′ (x) = a ̸= 0. Imagine what it would mean for p to have two roots: as we move from smaller x to larger x, at some point r we
cross the x-axis, say from p(r−ϵ) < 0 to p(r+ϵ) > 0. (See Figure 2.56b.) In order to find another root larger than r, the function
p would have to change from increasing to decreasing—in other words, there would have to be a point at which p′ (x) = 0. But
we just argued that a degree-0 polynomial like p′ (x) that is not identically zero is never zero. So we can’t find another root.
Now think about a degree-2 polynomial—that is, p(x) = ax2 + bx + c for a ̸= 0. After a root, p will have to change direction to
head back toward the x-axis. That is, between any two roots of p, there must be a point where the derivative of p is zero: that is,
there is a root of the degree-1 polynomial p′ (x) = 2ax + b between any two roots of p. But p′ has at most one root, as we just
argued, so p has at most two roots.
And so forth! We can apply the same argument for degree 3, then degree 4, and so on, up to any degree k.

2.5.4 Algorithms
While functions are a valuable mathematical abstraction, computer scientists are fundamentally interested
in computing things. So, in addition to the type of functions that we’ve discussed so far in this section, we
will also often talk about mapping an input x to a corresponding output f(x) in the way that a computer
program would, by computing the value of f(x) using an algorithm:

Definition 2.59: Algorithm.


An algorithm is step-by-step procedure to transform an input into an output.

In other words, an algorithm is a function—but specified as a sequence of simple operations, of the type
that could be written as a program in your favorite programming language; in fact, these step-by-step

ISTUDY
2.5 Functions 71

procedures are even called functions in many programming languages. (It’s probably worth noting that it’s
unusual for a book like this one to introduce algorithms in the context of functions. But, because the point
of an algorithm really is to transform inputs into outputs, it can be helpful to think of an algorithm as a
description of a function f that specifies how to calculate the output f(x) from a given input x, instead of
simply describing what the value f(x) is.)
We will write algorithms in pseudocode, rather than in any particular programming language. In other
words, we will specify the steps of the algorithm in a style that is neither Python nor Java nor English, but
something in between; it’s written in a style that “looks” like a program, but is designed to communicate the
steps to a human reader, rather than to a computer executing the code. We will aim to write pseudocode that
can be interpreted straightforwardly by a reader who has used any modern programming language; we will
always try to avoid getting bogged down in detailed syntax, and instead emphasize trying to communicate
algorithms clearly. Translating the pseudocode for an algorithm into any programming language should be
straightforward (though of course would take energy and concentration, and a willingness to correct the
inevitable bugs in the code, to actually carry out successfully).
We will make use of the standard elements of any programming language in our pseudocode: condition-
als (“if”), loops (“for” and “while”), function definitions and function calls (including recursive function
calls), and functions returning values. We will use the symbol “:=” to denote assignment and the symbol
“=” to denote equality testing, so that x := 3 sets the value of x to be 3, and x = 3 is True (if x is 3) or False
(if x is not 3). We assume a basic familiarity with these programming constructs throughout the book.

Our notation of := for assignment and = for equality testing is borrowed from the programming language Pascal. In a lot of
other programming languages, like C and Java and Python, assignment is expressed using = and equality testing is expressed
using ==.

We will spend significant energy later in the book on proving algorithms correct (Chapters 4 and 5)—
that is, showing that an algorithm computes the correct output for any given input—and on analyzing the
efficiency of algorithms (Chapter 6). But Figure 2.57 shows one small example to get us started. This
algorithm finds the index of the maximum element of a list. (More properly, this algorithm finds the index
of the first maximum element.)

findMaxIndex(L):
Input: A list L with n ≥ 1 elements L[1], . . . , L[n].
Output: An index i such that L[i] is the maximum value in L.

1 maxIndex := 1
2 for i := 2 to n:
3 if L[i] > L[maxIndex] then
4 maxIndex := i
5 return maxIndex

Figure 2.57 An algorithm to find the index of the maximum element of a list.

ISTUDY
72 Basic Data Types

Computer Science Connections

Hash Tables and Hash Functions


Consider the following scenario: we have a set S of elements that we must store, each of which is chosen from a
universe U of all possible elements. We need to be able to answer the question “is x in S?” quickly. (We might also
have data associated with each x ∈ S, and seek to find the associated data rather than just determining membership.)
Furthermore, the set S might change over time, either by insertion of a new element or deletion of an existing element.
How might we efficiently organize the data to support these operations?
A hash table, one of the most frequently used data 8
structures in computer science, is designed to store a
20 2 4
set like S, as follows: 1 2 3 4 5 6 7 8 9 10
• We define a table T[1 . . . n]. (a) A hash table filled with 4, 2, 8, and 20, using the hash
• We choose a hash function h : U → {1, . . . , n}. function h(x) = (x2 mod 10) + 1. These elements go into
• Each element x ∈ S is stored in the cell T[h(x)]. cells h(4) = (16 mod 10) + 1 = 7 and h(2) = 5 and
h(20) = 1 and h(8) = 5.
To make this idea practical, we’ll make sure that n is
a lot smaller than |U|. That means that h cannot be a 99 98 96 97
one-to-one function, and thus we will need to deal with 91 92 94 93
the possibility of collisions, when we try to store two
89 88 86 87
different elements in the same cell. There are several
different choices about how to handle collisions (and 81 82 84 83
we’ll explore several of the other strategies in Chap- 79 78 76 77
ter 10), but, for simplicity, let’s assume that we store 71 72 74 73
them all in that cell, in a list. (This strategy for han-
69 68 66 67
dling collisions is called chaining.) For example, see
the hash function and hash table in Figure 2.58. 61 62 64 63
To insert a value x into the table, we merely need 59 58 56 57
to compute h(x) and place the value into the list in the
51 52 54 53
cell T[h(x)]. Answering the question “is x stored in the
table?” is similar; we compute h(x) and look through 90 49 48 95 46 47
whatever entries are stored in that list. As a result, the 80 41 42 85 44 43
performance of this data structure is almost entirely 70 39 38 75 36 37
dependent on how many collisions are generated—that
60 31 32 65 34 33
is, how long the lists are in the cells of the table.
A “good” hash function h : U → {1, . . . , n} is one 50 29 28 55 26 27
that distributes the possible values of U as evenly as 40 21 22 45 24 23
possible across the n different cells. The more evenly 30 19 18 35 16 17
the function spreads out U across the table, the smaller
20 11 12 25 14 13
the typical length of the list in a cell, and therefore the
more efficiently the program would run. (Figure 2.58b 10 9 8 15 6 7
says that its hash function is not a very good one— 0 1 2 5 4 3
the number of elements per cell is extremely variable, 1 2 3 4 5 6 7 8 9 10
which means that this hash function does a poor job of
(b) The table filled with {0, 1, . . . , 99}. (Note that four cells
spreading out its inputs across the table.) contain no elements, two cells contain 10 elements each,
Programming languages like Python and Java have and four cells contain 20 elements each.)
built-in implementations of hash tables, and they use
Figure 2.58 A hash table, nearly empty and filled.
some mildly complex iterative arithmetic operations in
their hash functions. But designing a good hash function for whatever kind of data you end up storing can be the
difference between a slow implementation and a blazingly fast one. Incidentally, there are at least two other concerns
with efficiency: first, the hash function must be able to be computed quickly; otherwise the overhead of figuring out
which cell to check is too much. There’s also some cleverness in choosing the size of the table and in deciding when
to rehash everything in the table into a bigger table if the lists get too long (on average).

ISTUDY
Exercises 73

EXERCISES
2.224 Consider the function f : {0, 1, . . . 7} → {0, 1, . . . 7} defined by f(x) = (x2 + 3) mod 8. What is f(3)?
2.225 For the same function, what is f(7)?
2.226 For the same function, identify the values of x for which f(x) = 3.
2.227 Redefine f using a table.

2.228 Example 2.51 introduced a function compressing a grayscale image to use only five different shades of gray. (See Figure 2.59 for
a reminder.) Using basic arithmetic notation (including ⌊ ⌋ and/or ⌈ ⌉ if appropriate), redefine quantize without using cases.
2.229 Let’s generalize the quantization idea to a two-argument function, so that quantize(n, k) takes both an input color
n ∈ {0, 1, . . . , 255} and a number k of “quanta.” (We insist that 1 ≤ k ≤ 256.) In other words, k is the number of different
(approximately) equally spaced output values, and the input color n is translated to the closest of these k values. What are the
domain and range of quantize(n, k)?
2.230 Repeat Exercise 2.228 for quantize(n, k). You should ensure that quantize(n, 5) yields the function from Exercise 2.228. (Hint:
first determine how big a range of colors should be mapped to a particular quantum, rounding the size up. Then figure out which
quantum the given input n corresponds to.)
2.231 A function f : A → B is said to be c-to-1 if, for every output value b ∈ B, there are exactly c different values a ∈ A such that
f(a) = b. (These functions are useful in counting; see the Division Rule in Theorem 9.11.) For what values of k is it possible to
define a c-to-1 (for some integer c) quantizing function that transforms {0, 1, . . . , 255} into a set of k quanta?
2.232 (programming required) Implement quantization in a programming language of your choice. Specifically, implement
quantize(n, k), and apply it to every pixel of an input image. (You’ll need to research an image-processing library to use in your
program.)

Many of the pieces of basic numerical notation that we’ve introduced can be thought of as functions. For each of the following,
state the domain and range of the given function:
2.233 f(x) = |x| 2.238 f(x) = 2 mod x
2.234 f(x) = ⌊x⌋ 2.239 f(x, y) = x mod y
2.235 f(x) = 2x 2.240 f(x) = 2 | x
2.236 f(x) = log2 x 2.241 f(x) = ∥x∥
2.237 f(x) = x mod 2 2.242 f(θ) = ⟨cos θ, sin θ⟩
2.243 Let T = {1, . . . , 12} × {0, 1, . . . , 59} denote the set of numbers that can be displayed on a digital clock in 12-hour mode. Define
a function add : T × Z≥0 → T so that add(t, x) denotes the time that’s x minutes later than t. Do so using only standard symbols
from arithmetic.

Define the functions f(x) = x mod 10, g(x) = x + 3, and h(x) = 2x. Rewrite the following functions using a single algebraic
expression. (For example, the function g ◦ g is given by the definition (g ◦ g)(x) = g(g(x)) = x + 6.)
2.244 f◦f 2.248 h ◦ g
2.245 h◦h 2.249 f ◦ h
2.246 f◦g 2.250 f ◦ g ◦ h
2.247 g◦h
2.251 Let f(x) = 3x + 1 and let g(x) = 2x. Identify a function h such that g ◦ h and f are identical.
2.252 For the same functions f and g, identify a function h such that h ◦ g and f are identical.

2.253 Define a function f : {0, 1, 2, 3} → {0, 1, 2, 3} as f(x) = x. Is f onto?


2.254 Now define f : {0, 1, 2, 3} → {0, 1, 2, 3} as f(x) = x2 mod 4. Is f onto?
2.255 What about for f : {0, 1, 2, 3} → {0, 1, 2, 3} defined as f(x) = (x2 − x) mod 4?



26 if 0 ≤ n ≤ 51



 if 52 ≤ n ≤ 103
Note: the ranges of colors associated with each of

78 these five “quanta” are only approximately equal
quantize(n) = 130 if 104 ≤ n ≤ 155 because of issues of integrality. Here, the first four

 quanta correspond to 52 different colors; the last

 if 156 ≤ n ≤ 207

 182 quantum corresponds to only 256 − 52 · 4 = 48

 different colors.
234 if 208 ≤ n ≤ 255

Figure 2.59 The function from Example 2.51.

ISTUDY
74 Basic Data Types

2.256 Define f : {0, 1, 2, 3} → {0, 1, 2, 3} by f(0) = 3, f(1) = 2, f(2) = 1, and f(3) = 0. Is f onto?
2.257 What about for f : {0, 1, 2, 3} → {0, 1, 2, 3} defined as f(0) = 1, f(1) = 2, f(2) = 1, and f(3) = 2—is f onto?

2.258 Thinking of f(x) = x2 mod 8 as a function f : {0, 1, 2, 3} → {0, 1, . . . , 7}, is f one-to-one?


2.259 Thinking of f(x) = x3 mod 8 as a function f : {0, 1, 2, 3} → {0, 1, . . . , 7}, is f one-to-one?
2.260 Thinking of f(x) = (x3 − x) mod 8 as a function f : {0, 1, 2, 3} → {0, 1, . . . , 7}, is f one-to-one?
2.261 Thinking of f(x) = (x3 + 2x) mod 8 as a function f : {0, 1, 2, 3} → {0, 1, . . . , 7}, is f one-to-one?
2.262 Define f(0) = 3, f(1) = 1, f(2) = 4, and f(3) = 1. For this function f : {0, 1, 2, 3} → {0, 1, . . . , 7}, is f one-to-one?

A heap is a data structure that is used to represent a collection of items, each of which has an associated priority. (See p. 245.)
A heap can be represented as a complete binary tree—a binary tree with no “holes” as you read in left-to-right, top-to-bottom
order—but a heap can also be stored more efficiently as an array, in which the elements are stored in that same left-to-right and
top-to-bottom order. See Figure 2.60.
2.263 For a heap stored as an array A[1 . . . n], state the domain and range of the function parent. Is parent one-to-one?
2.264 State the domain and range of left and right for the heap as stored in A[1 . . . n]. Are left and right one-to-one?
2.265 Give both a mathematical description and an English-language description of the meaning of the function parent ◦ left. Assume
that A is infinite (that is, don’t worry about encountering an i such that left(i) or right(i) is undefined).
2.266 Repeat for parent ◦ right. (Continue to think of A as infinite.)
2.267 Repeat for left ◦ parent. (Continue to think of A as infinite.)
2.268 Repeat for right ◦ parent. (Continue to think of A as infinite.)

2.269 Consider f : R → R, where f(x) = 3x + 1. What is the inverse of f?


2.270 Consider g : R≥0 → R≥0 , where g(x) = x3 . What is the inverse of g?
2.271 Consider h : R≥0 → R≥1 , where h(x) = 3x . What is the inverse of h?
2.272 Why doesn’t the function f : {0, . . . , 23} → {0, . . . , 11} where f(n) = n mod 12 have an inverse?

2.273 Define p(x) = 3x3 + 2x2 + x + 0. What is the degree of p?


2.274 Define p(x) = 9x3 . What is the degree of p?
2.275 Define p(x) = 4x4 − (2x2 − x)2 . What is the degree of p?
2.276 If p and q are both polynomials with degree 7, what is the smallest and largest possible degree of f(x) = p(x) + q(x)?
2.277 If p and q are both polynomials with degree 7, what is the smallest and largest possible degree of f(x) = p(x) · q(x)?
2.278 If p and q are both polynomials with degree 7, what is the smallest and largest possible degree of f(x) = p(q(x))?

2.279 Give an example of a polynomial p of degree 2 such that p has exactly 0 roots.
2.280 Give an example of a polynomial p of degree 2 such that p has exactly 1 root.
2.281 Give an example of a polynomial p of degree 2 such that p has exactly 2 roots.

2.282 The median of a list L of n numbers is the number in the “middle” of L in sorted order. Describe an algorithm to find the median
of a list L. (Don’t worry about efficiency.) You may find it useful to make use of the algorithm in Figure 2.57.

8 7 To store a heap in an array, we define three functions that allow us to compute the
index of the parent of a node; the index of the left child of a node; and the index
3 5 6 of the right child of a node. Here are the functions: for an index i into the array,

parent(i) = ⌊ 2i ⌋ left(i) = 2i right(i) = 2i + 1.


9 8 7 3 5 6
1 2 3 4 5 6
For example, the parent of the node labeled 8 is labeled 9, the left child of the node labeled 8 is labeled 3, and the right child is labeled 5. And the
node labeled 8 has index 2 in the array, and parent(2) = 1 (the index of the node labeled 9); left(2) = 4 (the index of the node labeled 3); and
right(2) = 5 (the index of the node labeled 5).

Figure 2.60 A maximum heap, as a tree and as an array.

ISTUDY
2.6 Chapter at a Glance 75

2.6 Chapter at a Glance

Booleans, Numbers, and Arithmetic


A Boolean value is True or False. The integers Z are {. . . , −3, −2, −1, 0, 1, 2, 3, . . .}. The real numbers R
are the integers and all numbers in between. The closed interval [a, b] consists of all real numbers x where
a ≤ x ≤ b; the open interval (a, b) excludes a and b. The rational numbers Q are those numbers that can
be represented as a/b for integers a and b ̸= 0. Here is some useful notation involving numbers:

• exponentiation: bk is b · b · · · · · b, where b is multiplied k times;


• logarithms: logb x is the number y such that by = x;
• absolute value: |x| is x for x ≥ 0, and |x| = −x for x < 0;
• floor and ceiling: ⌊x⌋ is the largest integer n ≤ x; ⌈x⌉ is the smallest integer n ≥ x;
• modulus: n mod k is the remainder when n is divided by k.

If n mod d = 0, then d is a factor of n or evenly divides n, written d | n. If 2 | n for a positive integer n, then
n is even (“has even parity”); otherwise n is odd. An integer n ≥ 2 is prime if it has no positive integer
factors other than 1 and n; otherwise n is composite. (Note that 0 and 1 are neither prime nor composite.)
Pn
For a collection of numbers x1 , x2 , . . . , xn , their sum x1 + x2 + · · · + xn is written formally as i=1 xi ,
Qn
and their product x1 · x2 · · · · · xn is written i=1 xi .

Sets: Unordered Collections


A set is an unordered collection of objects called elements. A set can be specified by listing its elements
inside braces, as {x1 , x2 , . . . , xn }. A set can also be denoted by {x : P(x)}, which contains all objects x such
that P(x) is true. The set of possible values x that are considered is the universe U, which is sometimes left
implicit.
Standard sets include the empty set {} (also written ∅), which contains no elements; the integers Z; the
real numbers R; and the Booleans {True, False}. We write Z≥0 = {0, 1, 2, . . .} and Z<0 = {−1, −2, . . .},
etc. For a set A and an object x, the expression x ∈ A (“x is in A”) is true whenever x is in the set A.
(So y ∈ {x : P(x)} whenever P(y) = True, and y ∈ {x1 , x2 . . . , xn } whenever xi = y for some i.) The
cardinality of a set A, written |A|, is the number of distinct elements in A.
Given two sets A and B, the union of A and B is A ∪ B = {x : x ∈ A or x ∈ B}. The intersection of A and
B is A ∩ B = {x : x ∈ A and x ∈ B}. The set difference of A and B is A − B = {x : x ∈ A and x ∈ / B}. The
complement of a set A is ∼A = U − A = {x : x ∈ U and x ̸∈ A}, where U is the universe.
A subset of a set B is a set A such that every element of A is also an element of B; this relationship is
denoted by A ⊆ B. If A is a subset of B, then B is a superset of A, written B ⊇ A. A proper subset of B is a
set A that is a subset of B but A ̸= B, written A ⊂ B. Such a set B is a proper superset of A, written B ⊃ A.
Two sets A and B are disjoint if A ∩ B = ∅. A partition of a set S is a collection of sets A1 , A2 , . . . , Ak ,
where A1 ∪ A2 ∪ · · · ∪ Ak = S and, for any distinct i and j, the sets Ai and Aj are disjoint.
The power set of a set A, written P(A), is the set of all subsets of A.

Sequences, Vectors, and Matrices: Ordered Collections


A sequence (or tuple, (ordered) pair, triple, quadruple, …, n-tuple, …) is an ordered collection of objects
called components or entries, written inside angle brackets. The set A × B = {⟨a, b⟩ : a ∈ A and b ∈ B}

ISTUDY
76 Basic Data Types

is the Cartesian product of sets A and B; the set A × B contains all pairs where the first component comes
from A and the second from B. For a set S and a number n ≥ 0, the set Sn denotes the n-fold Cartesian
product of S with itself: Sn = S × S × · · · × S, where S occurs n times in this product.
A vector (or n-vector) is an element of Rn , for some positive integer n ≥ 2. (An element of R1 = R is
n
called a scalar.) A bit vector is an element of {0, 1} . Vectors are sometimes written in square brackets:
x = [x1 , x2 , . . . , xn ]. For a vector x, write xi to denote the ith component of x. (But xi is meaningless unless
i ∈ {1, 2, . . . , n}.) The size or dimensionality of x ∈ Rn is n.
For a vector x ∈ Rn and a real number a ∈ R, the scalar product ax is a vector where (ax)i = axi . For
two vectors x, y ∈ Rn , the sum of x and y is a vector x + y, where (x + y)i = xi + yi . The dot product of
Pn
two vectors x, y ∈ Rn is x • y = i=1 xi yi . Both x + y and x • y are meaningless unless x and y have the
same dimensionality.
An n-by-m matrix M is an element of (Rn )m , which is also sometimes written Rn×m . Such a matrix M
has n rows and m columns, and its entries are referenced with subscripts, first by row and then by column:
 
M1,1 M1,2 . . . M1,m
 M2,1 M2,2 . . . M2,m 
 
M= . .. .. .. 
 .. . . . 
Mn,1 Mn,2 . . . Mn,m
A matrix M ∈ Rn×m is square if n = m. For a size n, the identity matrix is I ∈ Rn×n has ones on the main
diagonal (the entries Ii,i = 1) and zeros everywhere else.
Given a matrix M ∈ Rn×m and a real number a ∈ R, the matrix aM is specified by (aM)i,j = aMi,j .
Given two matrices M, M′ ∈ Rn×m , the matrix M + M′ is specified by (M + M′ )i,j = Mi,j + M′i,j . (The sum
M + M′ is meaningless if M and M′ have different dimensions.) The product of two matrices A ∈ Rn×m
Pm
and B ∈ Rm×p is a matrix AB ∈ Rn×p whose components are given by (AB)i,j = k=1 Ai,k Bk,j . (More
compactly, (AB)i,j = Ai,(1...m) • B(1...m),j .) If the number of rows in A is different from the number of
columns in B then AB is meaningless. The inverse of M is a matrix M−1 such that MM−1 = I (if any such
matrix M−1 exists).

Functions
A function f : A → B maps every element a ∈ A to some element f(a) ∈ B. The domain of f is A and
the codomain is B. The image or range of f is {f(x) : x ∈ A}, the set of elements of the codomain “hit” by
some element of A according to f.
The composition of a function f : A → B and a function g : B → C is written g ◦ f : A → C, where
(g ◦ f)(x) = g(f(x)). A function f : A → B is one-to-one or injective if f(x) = f(y) implies that x = y. The
function f is onto or surjective if the image is equal to the codomain. If f : A → B is one-to-one and onto, it
is bijective. For a bijection f : A → B, the function f−1 : B → A is the inverse of f, where f−1 (b) = a when
f(a) = b.
A polynomial p : R → R is a function p(x) = a0 + a1 x + · · · + ak xk , where each ai ∈ R is a coefficient.
The degree of p is k. The roots of p are {x : p(x) = 0}. A polynomial of degree k that is not always zero
has at most k different roots.
An algorithm is a step-by-step procedure that transforms an input into an output.

ISTUDY
2.6 Chapter at a Glance 77

Key Terms and Results

Key Terms Key Results


Booleans, Numbers, and Arithmetic Booleans, Numbers, and Arithmetic
• Booleans, integers, reals, rationals 1 The value of bn is b · b · · · b, multiplied together n times.
• open intervals, closed intervals If n < 0, then bn = 1/(b−n ). For rational exponents,
• absolute value, floor ⌊ ⌋, ceiling ⌈ ⌉ b1/m is the number x such that xm = b, and
• exponentiation, logarithms bn/m = (b1/m )n .
• modulus, remainder, divides 2 For a positive real number b ̸= 1 and a real number
• even, odd, prime, parity x > 0, the quantity logb x (the log base b of x) is the real
P Q
• summation , product number y such that by = x.
• nested summations, nested products 3 Consider integers k > 0 and n. Then k | n (“k divides n”)
if nk is an integer—or, equivalently, if n mod k = 0.
Sets: Unordered Collections 4 As long as the terms being added remain unchanged, we
can reindex a summation (for example, shifting the
• set, element, membership, cardinality
variable over which the sum is taken, or reversing the
• exhaustive enumeration
order of nested sums) without affecting the total value of
• set abstraction, universe
the sum. The same is true for products.
• the empty set ∅ = {}
• Venn diagram
Sets: Unordered Collections
• complement ∼, union ∪,
intersection ∩, set difference − 1 A set can be specified using exhaustive enumeration (a
• (proper) subset, (proper) superset list of its elements), or by abstraction (a condition
• disjoint sets describing when an object is an element of the set).
• partitions 2 Two sets S and T are equal if every element of S is an
• power set element of T and every element of T is an element of S.

Sequences, Vectors, and Matrices Sequences, Vectors, and Matrices


• sequence, list, ordered pair, n-tuple 1 For vectors x, y ∈ Rn , the dot product of x and y is
Pn
• Cartesian product x • y = i=1 xi yi .
• vector, dot product 2 The product AB of two matrices A ∈ Rn×m and
• matrix, identity matrix B ∈ Rm×p is an n-by-p matrix M ∈ Rn×p whose
Pm
• matrix multiplication components are given by Mi,j = k=1 Ai,k Bk,j .
• matrix inverse
Functions
Functions 1 A one-to-one and onto function f : A → B has an inverse
• domain, codomain, image/range function f−1 : B → A, where f(a) = b precisely when
• function composition f−1 (b) = a.
• one-to-one, onto functions 2 A polynomial of degree k that is not always zero has at
• bijection, inverse most k different roots.
• polynomial, degree, roots
• algorithm

ISTUDY
ISTUDY
3 Logic

In which our heroes move carefully through the marsh, making sure
that each step follows safely from the one before it.

ISTUDY
79
80 Logic

3.1 Why You Might Care


The thinker must think for truth, not for fame.
W. E. B. Du Bois (1868–1963)
The Souls of Black Folk (1903)

Logic is the study of truth and falsity, of theorem and proof, of valid reasoning in any context. It’s also
the foundation of all of computer science, the very reasoning that you use when you write the condition
of an if statement in a Java program, or when you design an algorithm to beat a grandmaster at chess.
More concretely, logic is also the foundation of all computers. At its heart, a computer is a collection of
carefully arranged wires that transport electrons (which serve as a physical manifestation of information)
and “gates” (which serve as physical manifestations of logical operations to manipulate those electrons).
Indeed, the processor of a computer is essentially a circuit built up from almost unthinkably simple
logical components: nothing much more than wires and the physical implementations of operations like
“and,” “or,” and “not.” (For example, the very same reasoning that allows you to say that x+y is odd exactly
when either x is odd and y not odd, or when x is not odd and y is odd, allows you to design a circuit to
add two bits. And stringing together 64 copies of that small circuit—with some additional logic to handle
“carrying” from one place to the next—allows you to build a circuit that adds two 64-bit numbers.)
Because logic is the study of valid reasoning, any endeavor in which one wishes to state and justify
claims rigorously—such as that of this book—must at its core rely on logic. Every condition that you
write in a loop is a logical statement. When you sit down to write binary search in Python, it is through
a (perhaps tacit) use of logical reasoning that you ensure that your code works properly for any input.
When you use a search engine to look for web pages on the topic BEATLES AND NOT JOHN OR PAUL OR
GEORGE OR RINGO you’ve implicitly used logical reasoning to select this particular query. (Maybe you
just heard the reason for the band’s name for the first time, and you wanted to look it up.) Solving a Sudoku
puzzle is nothing more and nothing less than following logical constraints to their conclusion. The central
component of a natural language processing (NLP) system is to take an utterance by a human user that’s
made in a “natural” language like English and “understand” what it means—and understanding what a
sentence means is essentially the same task as understanding the circumstances under which the sentence
is true, and thus is a question of logic. And these are just a handful of examples; for a computer scientist,
logic is the basis of the discipline.
Our main goal in this chapter will be to introduce the basic constructs of logic. We’ll focus on formal
logic, in which it is the “form” of the argument that matters, rather than the “content.” This chapter will
introduce the two major types of formal logic. The first type is propositional logic (Sections 3.2 and 3.3),
in which we will study the truth and falsity of statements, how to construct logical statements from basic
logical operators (like “and” and “or”), and how to reason about those statements. The second type of logic
that we’ll examine is predicate logic (Sections 3.4 and 3.5), which gives us a framework to write logical
statements of the form “every x . . .” or “there’s some x such that . . ..”
One of our main goals in this chapter will be to define a precise, formal, and unambiguous language
to express reasoning—in which writer and reader agree on what each word means. But along the way,
we will encounter applications of logic to natural language processing, circuits, programming languages,
optimizing compilers, and building artificially intelligent systems to play chess and other games.

ISTUDY
3.2 An Introduction to Propositional Logic 81

3.2 An Introduction to Propositional Logic


Logic is a large drawer, containing some useful instruments, and many more that are
superfluous. A wise man will look into it for two purposes, to avail himself of those
instruments that are really useful, and to admire the ingenuity with which those that are
not so, are assorted and arranged.
Charles Caleb Colton (1777–1832)
Lacon: or, Many things in few words; addressed to those who think (1820)

A proposition is a statement that is either true or false—A modern smartphone has over 100,000 times
more memory than the computer on the Apollo moon-landing mission (true, by a long shot) or Java is a
programming language that uses indentation to denote block structure (false), for example. Propositional
logic is the study of propositions, including how to formulate statements as propositions, how to evaluate
whether a proposition is true or false, and how to manipulate propositions. The goal of this section is to
introduce propositions and propositional logic.

3.2.1 Propositions and Truth Values


We’ll begin, briefly, with propositions themselves:

Definition 3.1: Propositions and truth values.


A proposition is a statement that is either true or false. For a particular proposition p, the truth value of
p is its truth or falsity.

A proposition is also sometimes called a Boolean expression or a Boolean formula. (See Section 2.2.1.)
A proposition is written in English as a declarative sentence, the kind of sentence that usually ends with a
period. (Questions and demands—like Did you try binary search? or Always comment your code!—aren’t
the kinds of things that are true or false, and so they’re not propositions.) Here are a few examples:

Example 3.1: Some sample propositions.


The following statements are all propositions:
1 2 + 2 = 4.
2 33 is a prime number.
3 Barack Obama was the 44th person to serve as the president of the United States.
4 Every even integer greater than 2 can be written as the sum of two prime numbers.
(The last of these propositions is called Goldbach’s conjecture; it’s more complicated than the other
propositions in this example, and we’ll return to it in Section 3.4.)

Example 3.2: Determining truth values.


Let’s determine the truth values of the propositions from Example 3.1. They are:
1 True. It really is the case that 2 + 2 equals 4.
2 False. The integer 33 is not a prime number because 33 = 3 · 11. (Prime numbers are evenly divisible
only by 1 and themselves; 33 is also evenly divisible by 3 and 11.)

ISTUDY
82 Logic

3 False. Although Barack Obama is called president #44, Grover Cleveland was president #22 and #24.
So Barack Obama was actually the 43rd person to be president of the United States, not the 44th.
4 Unknown (!). Goldbach’s conjecture was first made in 1742, but has thus far resisted proof—or dis-
proof. It’s easy to check that a particular small even integer can be written as the sum of two prime
numbers; for example, 4 = 2 + 2 and 6 = 3 + 3 and 8 = 3 + 5 and 10 = 3 + 7. But is it true for all
even integers greater than 2? We simply don’t know! Many even integers have been tested, and no
violation has been found in any of these tests. But maybe it will turn out that the very next even integer
that someone tests can’t be written as the sum of two primes. (See Example 3.49.)

Before we move on from Example 3.2, there’s an important point to make about statements that have an
unknown truth value. Even though we don’t know the truth value of Goldbach’s conjecture, it is still a
proposition and thus it has a truth value. That is, Goldbach’s conjecture is indeed either true or false; it’s
just that we don’t know which. (Like The person currently sitting next to you is wearing clean underwear:
it has a truth value, you just don’t know what truth value it has.)

Taking it further: Goldbach’s conjecture, with its unknown truth value, stands in contrast to declarative sentences whose truth is
ill-defined—for example, This book is boring and Logic is fun. Whether these claims are true or false depends on the (imprecise)
definitions of words like boring and fun. “Shades of truth” questions are not the prime focus when you’re studying precise, formal
reasoning, as in this book, but they’re important when considering the role of ambiguity in software systems that interact with
humans via English language input and output. See p. 92.
There is also a potentially interesting philosophical puzzle that’s hiding in questions about the truth values of natural-language
utterances. Here’s a silly (but obviously true) statement: The sentence “snow is white” is true if and only if snow is white. (Of
course!) This claim becomes a bit less trivial if the embedded proposition is stated in a different language—Spanish or Dutch,
say: The sentence “La nieve es blanca” is true if and only if snow is white; or The sentence “Sneeuw is wit” is true if and only if
snow is white. But there’s a troubling paradox lurking here. Surely we would like to believe that the English sentence x and the
French translation of the English sentence x have the same truth value. For example, Snow is white and La neige est blanche surely
are both true, or they’re both false. (And, in fact, it’s the former.) But this belief leads to a problem with certain self-referential
sentences: for example, This sentence starts with a ‘T’ is true, but Cette phrase commence par un ‘T’ is, surely, false. For more
on paradoxes and puzzles of translation, see, for example, [57, 112].

3.2.2 Atomic and Compound Propositions


We will distinguish between two types of propositions, those that cannot be broken down into conceptually
simpler pieces and those that can be:

Definition 3.2: Atomic and compound propositions.


An atomic proposition is a proposition that is conceptually indivisible. A compound proposition is a
proposition that is built up out of conceptually simpler propositions.

Here’s an example of the difference:

Example 3.3: Sample atomic and compound propositions.


Consider these two propositions:
(1) The University of Minnesota’s mascot is the Badger.
(2) The University of Washington’s mascot is the Duck or the University of Oregon’s mascot is the Duck.

ISTUDY
3.2 An Introduction to Propositional Logic 83

The first is an atomic proposition, because it is not conceptually divisible into any simpler claim. The
second is a compound proposition, because it is conceptually divisible into two simpler claims—namely
The University of Washington’s mascot is the Duck and The University of Oregon’s mascot is the Duck.

Atomic propositions are also sometimes called Boolean variables; see Section 2.2.1. A compound propo-
sition that contains Boolean variables p1 , p2 , . . . , pk is sometimes called a Boolean expression or Boolean
formula over p1 , p2 , . . . , pk .

Example 3.4: Password validity as a compound proposition.


A certain college sends the following instructions to its users when it makes them change their password:

Your password is valid only if it is at least 8 characters long, you have not previously used it as your
password, and it contains at least three different types of characters (lowercase letters, uppercase letters,
digits, non-alphanumeric characters).
This compound proposition involves seven different atomic propositions:
p = the password is valid s = the password contains lowercase letters
q = the password is at least 8 characters long t = the password contains uppercase letters
r = the password has been used previously by you u = the password contains digits
v = the password contains non-alphanumeric characters

The form of the compound proposition is “p, only if q and not r and at least three of {s, t, u, v} are true.”
(Later we’ll see how to write this compound proposition in standard logical notation; see Example 3.16.)

3.2.3 Logical Connectives


Logical connectives are the glue that creates the more complicated compound propositions from simpler
propositions. Here are definitions of our first three of these logical connectives—not, and, and or:

Definition 3.3: Negation [not, ¬].


The proposition ¬p (“not p,” called the negation of the proposition p) is true when the proposition p is
false, and is false when p is true.

Definition 3.4: Conjunction [and, ∧].


The proposition p ∧ q (“p and q,” the conjunction of the propositions p and q) is true when both of the
propositions p and q are true, and is false when one or both of p or q is false.

Definition 3.5: Disjunction [or, ∨].


The proposition p ∨ q (“p or q,” the disjunction of the propositions p and q) is true when one or both of
the propositions p or q is true, and is false when both p and q are false.

The prefix con- means “together” and dis- means “apart.” (Junct means “join.”) The conjunction p ∧ q is true when p and q are
true together; the disjunction p ∨ q is true when p is true “apart from” q, or the other way around. To help keep the symbols
straight, it may be helpful to notice that the symbol ∧ is the angular version of the symbol ∩ (intersection), while the symbol ∨
is the angular version of the symbol ∪ (union). The set S ∩ T is the set of all elements contained in S and T; the set S ∪ T is the
set of all elements contained in S or T.

ISTUDY
84 Logic

In the conjunction p ∧ q, the propositions p and q are called conjuncts; in the disjunction p ∨ q, they are
called disjuncts.

Example 3.5: Some compound propositions.


Let p denote the proposition Ohio State’s mascot is the Buckeye and let q denote the proposition
Michigan’s mascot is the Wolverine. Then we have the following compound propositions:

¬q Michigan’s mascot is not the Wolverine.


p ∧ q Ohio State’s mascot is the Buckeye, and Michigan’s mascot is the Wolverine.
p ∨ q Ohio State’s mascot is the Buckeye, or Michigan’s mascot is the Wolverine.

Let’s also translate some English statements expressing compound propositions into logical notation:

Example 3.6: From English statements to compound propositions.


Translate each of the following statements into standard logical notation. (Name the atomic propositions
using appropriate Boolean variables.)
1 Carissa is majoring in computer science and studio art.
2 Either Delroy took a formal logic class, or he is a quick learner.
3 Eli broke his hand and didn’t take the test as scheduled.
4 Florence knows Python or she has programmed in both C and Java.

Solution. We’ll first name the atomic propositions, and then translate them:
1 Write p for Carissa is majoring in computer science and q for Carissa is majoring in studio art. The
given proposition is p ∧ q.
2 Write r to denote Delroy took a formal logic class and write s to denote Delroy is a quick learner. The
given statement is r ∨ s.
3 Denote Eli broke his hand by t and denote Eli took the test as scheduled by u. The sentence can be
written as t ∧ ¬u.
4 Let v denote Florence knows Python, let w denote Florence has programmed in C, and let x denote
Florence has programmed in Java. Then we can write v ∨ (w ∧ x) to represent the given sentence.

Implication (if/then)
Another important logical connective is ⇒, which denotes implication. It expresses a familiar idea from
everyday life, though one that’s not quite captured by a single English word.
Consider the sentence If you tell me how much money you make, then I’ll tell you how much I make. It’s
easiest to think of this sentence as a promise: I’ve promised that I’ll tell you my financial situation as long
as you tell me yours. I haven’t promised anything about what I’ll do if you don’t tell me your salary—I can
abstain from revealing anything about myself, or I might spontaneously tell you my salary anyway, but the
point is that I haven’t guaranteed anything. (But you’d justifiably call me a liar if you told me what you
make, and I failed to disclose my salary in return.)
This kind of promise is expressed as an implication in propositional logic:

ISTUDY
3.2 An Introduction to Propositional Logic 85

Definition 3.6: Implication [⇒].


The proposition p ⇒ q is true when the truth of p implies the truth of q. In other words, p ⇒ q is true
unless p is true and q is false.

In the implication p ⇒ q, the proposition p is called the antecedent or the hypothesis, and the proposition
q is called the consequent or the conclusion.
One initially confusing aspect of logical implication is that the word “implies” seems to hint at something about causation—but
p ⇒ q doesn’t actually say anything about p causing q, only that p being true implies that q is true (or, in other words, p being
true lets us conclude that q is true).

Example 3.7: Some implications.


The following propositions are all true:
1 + 1 = 2 implies that 2 + 3 = 5. (“True implies True” is true.)
2 + 3 = 4 implies that 2 + 2 = 4. (“False implies True” is true.)
2 + 3 = 4 implies that 2 + 3 = 6. (“False implies False” is true.)

But the following proposition is false:

2 + 2 = 4 implies that 2 + 1 = 5. (“True implies False” is false.)


This last proposition is false because 2 + 2 = 4 is true, but 2 + 1 = 5 is false.

There are many different ways to express the proposition p ⇒ q in English; see Figure 3.1a for some of
them. Here’s a concrete example of multiple ways to phrase a real-world implication:

Example 3.8: Expressing implications in English.


According to United States law, people who can legally vote must be American citizens, and they must
also satisfy some other various conditions that vary from state to state (for example, registering in advance
or not being a felon). Thus the following compound proposition is true:
you are a legal U.S. voter ⇒ you are an American citizen.
All of the sentences in Figure 3.1b express this proposition in English.
Most of these sentences are reasonably natural ways to express the stated implication, though the last
phrasing (“You being a legal U.S. voter is sufficient for you to be an American citizen”) seems awkward.
But it’s easier to understand if we slightly rephrase it as “You being a legal U.S. voter is sufficient for me
to conclude that you are an American citizen.”

(a) “if p, then q” (b) If you are a legal U.S. voter, then you are an American citizen.
“p implies q” You being a legal U.S. voter implies that you are an American citizen.
“p only if q” You are a legal U.S. voter only if you are an American citizen.
“q whenever p” You are an American citizen if you are a legal U.S. voter.
“q, if p” You are an American citizen whenever you are a legal U.S. voter.
“q is necessary for p” You being an American citizen is necessary for you to be a legal U.S. voter.
“p is sufficient for q” You being a legal U.S. voter is sufficient for you to be an American citizen.

Figure 3.1 Expressing implications in English: (a) p ⇒ q, and (b) a particular implication [see Example 3.8].

ISTUDY
86 Logic

Here’s another example of restating implications:

Example 3.9: More implications in English.


Consider this proposition:
p q
The nondisclosure agreement is valid only if you signed it .
(This statement is different from if you signed, then the agreement is valid: for example, the agreement
might not be valid because you’re legally a minor and thus not legally allowed to sign away rights.) We
can restate p ⇒ q as “if p then q”:

If the nondisclosure agreement is valid, then you signed it.


We can also restate this implication equivalently—and perhaps more intuitively—using the so-called
contrapositive ¬q ⇒ ¬p (see Example 3.22):

The nondisclosure agreement is invalid if you didn’t sign it.

Exclusive or
The four logical connectives that we have defined so far (¬, ∨, ∧, and ⇒) are the ones that are most
frequently used, but there are two other common connectives too. The first is exclusive or:

Definition 3.7: Exclusive or [⊕].


The proposition p ⊕ q (“p exclusive or q” or, more briefly, “p xor q”) is true when one of p or q is true,
but not both. Thus p ⊕ q is false when both p and q are true, and when both p and q are false.

(“Xor” and ⊕ are usually pronounced like “ex ore,” as in someone you used to date plus some rock with
high precious-metal content.)
When we want to emphasize the distinction between ∨ and ⊕, we refer to ∨ as inclusive or. This ter-
minology highlights the fact that p ∨ q includes the possibility that both p and q are true, while p ⊕ q
excludes that possibility. Unfortunately, the word “or” in English can mean either inclusive or exclusive
or, depending on the context in which it’s being used. When you see the word “or,” you’ll have to think
carefully about which meaning is intended.

Example 3.10: Inclusive versus exclusive or in English.


Translate these statements from a cover letter for a job into logical notation:

You may contact me by email or by phone. I am available for an on-site day-long interview on October
8th in Minneapolis or Hong Kong.
Use the following Boolean variables:

p = you may contact me by phone r = I am physically available for an interview in Minneapolis


q = you may contact me by email s = I am physically available for an interview in Hong Kong

ISTUDY
3.2 An Introduction to Propositional Logic 87

Solution. The “or” in “email or phone” is inclusive (you could receive both an email and a call); the “or”
in “Minneapolis or Hong Kong” is exclusive (it’s not physically possible to be simultaneously present in
Minneapolis and Hong Kong). Thus we can translate these statements as (p ∨ q) ∧ (r ⊕ s).

If and only if
We are now ready to define our last common logical connective:

Definition 3.8: If and only if [⇔].


The proposition p ⇔ q (“p if and only if q”) is true when the propositions p or q have the same truth
value (both p and q are true, or both p and q are false), and false otherwise.

The reason that ⇔ is read as “if and only if” is that p ⇔ q means the same thing as the compound propo-
sition (p ⇒ q) ∧ (q ⇒ p). (We’ll prove this equivalence in Example 3.24.) Furthermore, the propositions
p ⇒ q and q ⇒ p can be rendered, respectively, as “p only if q” and “p, if q.” Thus p ⇔ q expresses “p if
q, and p only if q”—or, more compactly, “p if and only if q.”
(The connective ⇔ is also sometimes called the biconditional, because an implication can also be called
a conditional. You’ll also sometimes see ⇔ abbreviated in sentences as “iff” as shorthand for “if and only
if”; we’ll avoid the “iff” abbreviation in this book, but it’s good to be prepared if you see it elsewhere.)
Unfortunately, just like with “or,” the word “if” is ambiguous in English. Sometimes “if” is used to
express an implication, and sometimes it’s used to express an if-and-only-if definition. When you see the
word “if” in a sentence, you’ll need to think carefully about whether it means ⇒ or ⇔.

Example 3.11: “If” versus “if and only if” in English.


Think of a number between 10 and 1,000,000. Consider the following propositions:
p = your number is prime.
q = your number is even.
r = your number is evenly divisible by a positive integer other than 1 and itself.
Now translate the following two sentences into logical notation:

(1) If the number you’re thinking of is even, then it isn’t prime.


(2) The number you’re thinking of isn’t prime if it’s evenly divisible by an integer other than 1 and itself.

Solution. The “if” in (1) is an implication, and the “if” in (2) is “if and only if.” A correct translation of
these sentences is (1) q ⇒ ¬p; and (2) ¬p ⇔ r.

3.2.4 Combining Logical Connectives


The six logical connectives that we’ve defined are summarized in Figure 3.2. The ¬ connective is a unary
operator, because it builds a compound proposition from a single simpler proposition. The other five con-
nectives are binary operators, which build a compound proposition from two simpler propositions. (We’ll
encounter the full list of binary logical connectives later; see Exercises 4.66–4.71.)

ISTUDY
88 Logic

negation ¬p “not p” highest precedence (binds tightest)


conjunction p∧q “p and q”
disjunction p∨q “p or q”
exclusive or p⊕q “p xor q”
implication p⇒q “if p, then q” or “p implies q”
if and only if p⇔q “p if and only if q” lowest precedence (binds loosest)

Figure 3.2 Summary of notation for propositional logic.

Taking it further: The unary-vs.-binary categorization of logical connectives based on how many “arguments” they accept also
occurs in other contexts—for example, arithmetic and programming. In arithmetic, for example, you might distinguish between
“unary minus” and “binary minus”: the former denotes negation, as in −3; the latter subtraction, as in 2 − 3. The square root

operator is also unary. In programming languages, the number of arguments that a function takes is called its arity. (The arity
of length is one; the arity of equals is two.) You will sometimes encounter variable arity functions that can take a different
number of arguments each time they’re invoked. Common examples include the print functions in many languages—C’s printf
and Python’s print, for example, can take any number of arguments—or arithmetic in prefix languages like Scheme, where you
can write an expression like (+ 1 2 3 4) to denote 1 + 2 + 3 + 4 (= 10).

Order of operations
A full description of the syntax of a programming language always includes a table of the precedence of
operators, arranged from “binds the tightest” (highest precedence) to “binds the loosest” (lowest prece-
dence). These precedence rules tell us when we have to include parentheses in an expression to make it
mean what we want it to mean, and when the parentheses are optional.
In the same way, we’ll adopt some standard conventions regarding the precedence of our logical con-
nectives: negation (¬) has the highest precedence; then there is a three-way tie among ∧, ∨, and ⊕; then
there’s ⇒; then finally ⇔ has the lowest precedence. (The horizontal lines in Figure 3.2 separate the logical
connectives by their precedence.)

The word “precedence” (pre before, cede go) means “what comes first,” so the precedence rules in Figure 3.2 tell us the order
in which the operators “get to go.” For example, consider the proposition p ∧ q ⇒ r. Because ∧ has higher precedence than ⇒,
the ∧ “goes first,” so the proposition is (p ∧ q) ⇒ r. (If ⇒ had “gone first,” it would have meant p ∧ (q ⇒ r).) Although it’s a
standard way of describing precedence, I find the “binds more tightly” phrasing a little weird, but here’s a way to think about it
which may help make it make a bit more sense: deciding how tightly ¬ binds is like deciding how small of a space to put between
the operator ¬ and the expression it operates on. For example, when we’re interpreting ¬p ∧ q, there are two possibilities:

¬p ∧ q and ¬ p ∧ q.
¬ binds more tightly than ∧ ∧ binds more tightly than ¬

Figure 3.2 says that ¬ binds more tightly than ∧, so the correct interpretation is (¬p) ∧ q, the one on the left. These precedence
rules match those in most programming languages; in Java, for example, the condition ! p && q (“not p and q” in Java syntax)
will be interpreted as (!p) && q, because not/¬/! binds tighter than and/∧/&&.

Here are a couple of small examples:

Example 3.12: Precedence of logical connectives.


The propositions p ∨ ¬q and p ∨ q ⇒ ¬r ⇔ p mean, respectively,
 
p ∨ (¬q) and (p ∨ q) ⇒ (¬r) ⇔ p,

which we can see by applying the relevant precedence rules (“¬ goes first, then ∨, then ⇒, then ⇔”).

ISTUDY
3.2 An Introduction to Propositional Logic 89

Figure 3.2 tells us the order in which two different operators are applied in an expression; for a sequence
of applications of the same binary operator, we’ll use the convention that the operator associates to the left.
For example, p ∧ q ∧ r will mean (p ∧ q) ∧ r rather than p ∧ (q ∧ r). (This choice doesn’t matter for most
of the logical connectives anyway: for example, the propositions (p ∧ q) ∧ r and p ∧ (q ∧ r) are true under
exactly the same circumstances. In fact, implication is the only logical connective we’ve seen for which
the order of application matters. See Exercises 3.48–3.50.)

Example 3.13: Precedence of logical connectives.


Fully parenthesize each of the following propositions. (In other words, add parentheses around each
operator that respect the precedence rules and do not change the meaning.)
p∨q⇔p p⊕p⊕q⊕q ¬p ⇔ p ⇔ ¬(p ⇔ p)
Do the same for these:
p ∧ ¬q ⇒ r ⇔ s p ⇒ q ⇒ r ∧ s.

Solution. Using the precedence rules from Figure 3.2 and left associativity, we get:
  
(p ∨ q) ⇔ p (p ⊕ p) ⊕ q ⊕ q (¬p) ⇔ p ⇔ ¬(p ⇔ p)
for the first set. For the second batch, we have
  
p ∧ (¬q) ⇒ r ⇔ s (p ⇒ q) ⇒ (r ∧ s).

Writing tip: Because the order of application does matter for implication, it’s good style to include the optional parentheses so
that it’s clear what you mean. Similarly, we’ll always use parentheses in propositions containing more than one of the tied-in-
precedence operators ∧, ∨, and ⊕ from Figure 3.2, just as we should in programs. It’s hard to quickly recognize the result of
Python code like False and False or True, and it’s nicer not to have to try.

3.2.5 Truth Tables


In Section 3.2.3, we described the logical connectives ¬, ∧, ∨, ⇒, ⊕, and ⇔, but we can more systemati-
cally define these connectives by using a truth table that collects the value yielded by the logical connective
under every truth assignment.

Definition 3.9: Truth assignment.


A truth assignment for a proposition over variables p1 , p2 , . . . , pk is a function that assigns a truth value
to each pi .

For example, the function f where f(p) = T and f(q) = F is a truth assignment for the proposition p ∨ ¬q.
(Each “T” abbreviates a truth value of true; each “F” abbreviates a truth value of false.)
For any particular proposition and for any particular truth assignment f for that proposition, we can
evaluate the proposition under f to figure out the truth value of the entire proposition. In the previous
example, the proposition p ∨ ¬q is true under the truth assignment with p = T and q = F (because T ∨ ¬F
is T ∨ T, which is true). A truth table displays a proposition’s truth value (evaluated in the way we just
described) under all truth assignments:

ISTUDY
90 Logic

Definition 3.10: Truth table.


A truth table for a proposition lists, for each possible truth assignment for that proposition (with one
truth assignment per row in the table), the truth value of the entire proposition.

Example 3.14: Defining ∧.


For example, here is the truth table that defines the logical connective “and”:

p q p∧q
T T T
T F F
F T F
F F F

Columns #1 and #2 correspond to the atomic propositions p and q. There is a row in the table correspond-
ing to each possible truth assignment for p ∧ q—that is, for every pair of truth values for p and q. (So
there are four rows: TT, TF, FT, and FF.)
The third column corresponds to the compound proposition p ∧ q, and it has a T only in the first row.
That is, the truth value of p ∧ q is false unless both p and q are true—just as Definition 3.4 said.

Figure 3.3 shows the truth tables for the six basic logical connectives. It’s worth paying special attention to
the column for p ⇒ q: the only truth assignment under which p ⇒ q is false is when p is true and q is false.
False implies anything! Anything implies true! For example, both of the following are true propositions:

(1) If 2 + 3 = 4, then you will eat tofu for dinner. (If false, then anything!)
(2) If the moon is a clandestine surveillance device launched by the Sumerians in 3000 bce, then 2+3 = 5.
(If anything, then true!)
To emphasize the point, observe that (1) is true even if you would never eat tofu if it were the last so-called
food on earth; the hypothesis “2 + 3 = 4” of the proposition wasn’t true, so the truth of the proposition
doesn’t depend on what your dinner plans are.

Truth tables for more complex propositions


For more complicated propositions, we can fill in a truth table by repeatedly applying the rules in Figure 3.3.
For example, to find the truth table for (p ⇒ q) ∧ (q ∨ p), we compute the truth tables for p ⇒ q and q ∨ p,
and put a “T” in the (p ⇒ q) ∧ (q ∨ p) column for precisely those rows in which the truth tables for p ⇒ q
and q ∨ p both had “T”s. Here’s one small example, and a somewhat more complicated one:

Example 3.15: A small truth table.


Here is a truth table for the proposition p ∧ q ⇒ ¬q:

p q p∧q ¬q p ∧ q ⇒ ¬q
T T T F F
T F F T T
F T F F T
F F F T T

This truth table shows that the given proposition is true precisely when at least one of p and q is false.

ISTUDY
3.2 An Introduction to Propositional Logic 91

p ¬p p q p∧q p∨q p⇒q p⊕q p⇔q


T F T T T T T F T
F T T F F T F T F
F T F T T T F
F F F F T F T

Figure 3.3 Truth tables for the basic logical connectives.

Example 3.16: Three (or more) of four, formalized.


In Example 3.4 (on the validity of passwords), we had a sentence of the form “p, only if q and not r and
at least three of {s, t, u, v} are true.” Let’s translate this sentence into propositional logic. The tricky part
will be translating “at least three of {s, t, u, v} are true.”
There are many solutions, but one relatively simple way to do it is to explicitly write out four cases,
one corresponding to allowing a different one of the four variables {s, t, u, v} to be false:
(s ∧ t ∧ u) ∨ (s ∧ t ∧ v) ∨ (s ∧ u ∧ v) ∨ (t ∧ u ∧ v)
We can verify that we’ve gotten this proposition right with a (big!) truth table, shown in Figure 3.4.
Indeed, the five rows in which the last column has a “T” are exactly the five rows in which there are three
or four “T”s in the columns for s, t, u, and v.
To finish the translation, recall that “x only if y” means x ⇒ y, so the given sentence can be translated
as p ⇒ q ∧ ¬r ∧ (the proposition above)—that is,
 
p ⇒ q ∧ ¬r ∧ (s ∧ t ∧ u) ∨ (s ∧ t ∧ v) ∨ (s ∧ u ∧ v) ∨ (t ∧ u ∧ v) .

s t u v s∧t∧u s∧t∧v s∧u∧v t∧u∧v (s ∧ t ∧ u) ∨ (s ∧ t ∧ v) ∨ (s ∧ u ∧ v) ∨ (t ∧ u ∧ v)


T T T T T T T T T
T T T F T F F F T
T T F T F T F F T
T T F F F F F F F
T F T T F F T F T
Why are there five rows in which
T F T F F F F F F the last column is true? There are
T F F T F F F F F four different truth assignments
corresponding to exactly three of
T F F F F F F F F {s, t, u, v} being true (stu, suv,
F T T T F F F T T stv, tuv), and there is one truth
F T T F F F F F F assignment corresponding to all
four being true (stuv). (In Chapter
F T F T F F F F F 9, on counting, we’ll re-encounter
F T F F F F F F F this style of question. And,
F F T T F F F F F actually, precisely this reasoning
will allow us to prove something
F F T F F F F F F interesting about error-correcting
F F F T F F F F F codes—see Section 4.2.5.)
F F F F F F F F F

Figure 3.4 A truth table for a proposition expressing “at least three of {s, t, u, v} are true,” for Example 3.16.

ISTUDY
92 Logic

Computer Science Connections

Natural Language Processing, Ambiguity, and Truth


Our main interest in this book is in developing (and understanding) precise and unambiguous language to express
mathematical notions; in this chapter specifically, we’re thinking about the truth values of completely precise state-
ments. But thinking about the truth of ambiguous or ill-defined terms is absolutely crucial to any computational
system that’s designed to interact with users via natural language. (A natural language is one like English or French
or Xhosa; these languages contrast with artificial languages like Java or Python or, arguably, Esperanto or Klingon.)
Natural language processing (NLP) (or the roughly similar computational linguistics) is the subfield of computer
science that lies at the discipline’s interface with linguistics. In NLP, we work to develop software systems that can
interact with users in a natural language. A necessary step in an NLP system is to take an utterance made by the
human user and “understand it.” (“Understanding what a sentence means” is more or less the same as “understanding
the circumstances under which it is true”—which is fundamentally a question of logic. You can find much more in
any good textbook on natural language processing, such as [45, 66].)
One major reason that NLP is hard is that there is a
tremendous amount of ambiguity in natural-language S
utterances. We can have lexical ambiguity, in which
two different words are spelled identically but have two NP VP S
different meanings; we have to determine which word
is meant in a sentence. Or there’s syntactic ambiguity, We V NP NP VP
in which a sentence’s structure can be interpreted very
differently. (See Figure 3.5.)
ate N PP We V NP PP
But there are also subtleties about when a state-
ment is true, even if the meaning of each word and the
cake P NP ate N P NP
sentence’s structure are clear. Consider, for example,
designing and implementing a chatbot that’s supposed
with N cake with N
to assist people with planning travel. (Many compa-
nies, including transportation providers, use some kind walnuts walnuts
of automated conversational system as part of their (or forks) (or forks)
interactions with customers.) Imagine a user who asks
the chatbot to find the first flight from MSP to BOS on A: Do you want coffee or tea?
a particular date. That’s straightforward enough, and B: Do you want cream or sugar?
the system would be able to generate a database query C: We ate cake with walnuts.
to find the result: say, that the first nonstop flight from D: We ate cake with forks.
MSP to BOS leaves at 8:45am, with a nonrefundable
fare of $472. Figure 3.5 Examples of lexical ambiguity (A and B) and
But now imagine that the user’s response to that syntactic ambiguity (C and D). The or of A and B can be
information is to ask: Is there a slightly later flight either inclusive or exclusive; simply answering “yes” is a
that isn’t too much more expensive? This conversa- reasonable response to question B, but a bizarre one to
tional system now has to be able to decide the truth of question A. The with of C and D can either attach to the
statements like 10:33am is slightly later than 8:45am cake or the eating; the sentences’ structures are consistent
and $529 isn’t too much more expensive than $472,
with using walnuts as an eating utensil in C, or the cake
even though the “truth” of these statements depends
containing forks as an ingredient in D.
on heavy use of conversational context and pragmatic
reasoning. Of course, even though one cannot unam-
biguously determine whether these sentences are true or false, they’re the kind of statement made continually in
natural language. So systems that process natural language must deal with this issue.
One approach for handling these statements whose truth value is ambiguous is called fuzzy logic, in which each
proposition has a truth value that is a real number between 0 and 1. (So 10:33a is slightly later than 8:45a is “more
true” than 12:19p is slightly later than 8:45a—so the former might have a truth value of 0.74, while the latter might
have a truth value of 0.34. But 7:30a is slightly later than 8:45a would have a truth value of 0.00, as 7:30a is
unambiguously not slightly later than 8:45a.)

ISTUDY
Exercises 93

EXERCISES
What are the truth values of the following propositions?
3.1 22 + 32 = 42
3.2 The number 202 is written 11010010 in binary.
3.3 After executing the following C code, the variable x has the value 1.

1 int x = 202;
2 while (x > 2) {
3 x = x / 2; /* Note: in C, a single slash denotes integer division. */
4 } /* For example, the value of 7/2 is 3, not 3.5. */

Using the atomic propositions in Figure 3.6, translate the following statements about Python expressions into logical notation:
3.4 x ** y is valid Python if and only if x and y are both numeric values.
3.5 x + y is valid Python if and only if x and y are both numeric values, or they’re both lists.
3.6 x * y is valid Python if and only if x and y are both numeric values, or if one of x and y is a list and the other is numeric.
3.7 x * y is a list if x * y is valid Python and x and y are not both numeric values.
3.8 if x + y is a list, then x * y is not a list.
3.9 x + y and x ** y are both valid Python only if x is not a list.

3.10 True story: a 29-year-old friend of mine who does not have an advance care directive was asked If you’re over 55 years old, do you
have an advance care directive? on a form at a doctor’s office. What should she answer, yes or no?

3.11 In Example 3.16, we constructed a proposition corresponding to “at least three of {s, t, u, v} are true.” Generalize this construction
by building a proposition that expresses “at least 3 of {p1 , . . . , pn } are true.”
3.12 Similar to Exercise 3.11: now build a proposition that expresses “at least n − 1 of {p1 , . . . , pn } are true.”

3.13 The identity of a binary operator ⋄ is a value i such that, for any x, the expressions {x, x ⋄ i, i ⋄ x} are all equivalent. An example
from arithmetic: the identity of + is 0, because x + 0 = 0 + x = x for any number x. Identify the identity of ∨, and justify your
answer. (Some operators do not have an identity; if there is no identity, explain why it doesn’t exist.)
3.14 What is the identity of ∧? (Or, if it doesn’t exist, explain why not.)
3.15 What is the identity of ⇔? (Or, if it doesn’t exist, explain why not.)
3.16 What is the identity of ⊕? (Or, if it doesn’t exist, explain why not.)
3.17 The zero of a binary operator ⋄ is a value z such that, for any x, the expressions {z, x ⋄ z, z ⋄ x} are all equivalent. For example, the
zero of multiplication is 0, because x · 0 = 0 · x = 0 for any number x. What is the zero of ∨? (Or explain why ∨ has no zero.)
3.18 What is the zero of ∧? (Or, if it doesn’t exist, explain why not.)
3.19 What is the zero of ⇔? (Or, if it doesn’t exist, explain why not.)
3.20 What is the zero of ⊕? (Or, if it doesn’t exist, explain why not.)
3.21 Because ⇒ is not commutative (that is, because p ⇒ q and q ⇒ p mean different things), it is not too surprising that ⇒ has neither
an identity nor a zero. But there are some related concepts for this type of operator. The left identity of a binary operator ⋄ is a
value iℓ such that, for any x, the expressions x and iℓ ⋄ x are equivalent. The right identity of ⋄ is a value ir such that, for any x,
the expressions x and x ⋄ ir are equivalent. (Again, some operators may not have left or right identities.) What are the left and right
identities of ⇒ (if they exist)?
3.22 The left zero of ⋄ is a value zℓ such that, for any x, the expressions zℓ and zℓ ⋄ x are equivalent; similarly, the right zero is a value
zr such that, for any x, the expressions zr and x ⋄ zr are equivalent. What are the left and right zeros for ⇒ (if they exist)?

In many programming languages, the Boolean values True and False are actually stored as the numerical values 1 and 0, respec-
tively. In Python, for example, both 0 == False and 1 == True are True. Thus, despite appearances, we can do arithmetic on
Boolean values! Furthermore, in many languages (including Python), anything that is not False (in other words, anything other
than 0) acts like True for the purposes of conditionals; for example, if 2 then X else Y will run successfully and execute X.
Suppose that x and y are two Boolean variables in a programming language where True and False are 1 and 0. (That is, the
values of x and y are both 0 or 1.) Each of the following code snippets includes a conditional statement based on an arithmetic
expression using x and y. Rewrite the given condition using the standard notation of propositional logic.
3.23 if x * y ...
3.24 if x + y ...
3.25 if 2 - x - y ...
3.26 if (x * (1 - y)) + ((1 - x) * y) ...

ISTUDY
94 Logic

(Exercises 3.4–3.9 make accurate statements about expressions


p: x + y is valid Python u: x is a numeric value involving numbers and lists in Python, but even so they do not come
q: x * y is valid Python v: y is a numeric value close to fully characterizing the set of valid Python statements, for
multiple reasons: first, they’re about particular variables—x and
r: x ** y is valid Python w: x is a list y—rather than about generic variables. And, second, they omit some
s: x * y is a list z: y is a list common-sense facts, like the fact that it’s not simultaneously
t: x + y is a list possible to be both a list and a numeric value.)

Figure 3.6 Some atomic propositions for Exercises 3.4–3.9.

x3 x2 x1 x0 x3 x2 x1 x0 x3 x2 x1 x0
0 0 0 1 0 0 1 1 1 1 0 0
0+0+0+1 =1 0+0+2+1 =3 8 + 4 + 0 + 0 = 12

Figure 3.7 Representing small numbers with four Boolean variables (where 0 = False and 1 = True).

3.27 We can use the Booleans-as-integers idea from Exercises 3.23–3.26 to give a simpler solution to Exercises 3.11–3.12. Assume that
{p1 , . . . , pn } are all Boolean variables in Python—that is, their values are all 0 or 1. Write a Python conditional expressing the
condition that at least 3 of {p1 , . . . , pn } are true.
3.28 Write a Python conditional expressing the condition that at least n − 1 of {p1 , . . . , pn } are true.

In addition to purely logical operations, computer circuitry has to be built to do simple arithmetic very quickly. Consider a number
x ∈ {0, . . . , 15} represented as a 4-bit binary number, and denote by x0 the least-significant bit of x, by x1 the next bit, and so
forth. (Think of 0 as false and 1 as true.) Here you’ll explore some pieces of using propositional logic and binary representation of
integers to express arithmetic operations. (It’s straightforward to convert your answers into circuits.) See Figure 3.7.
3.29 Give a proposition over {x0 , x1 , x2 , x3 } that expresses that x is greater than or equal to 8.
3.30 Give a proposition over {x0 , x1 , x2 , x3 } that expresses that x is evenly divisible by 4.
3.31 Give a proposition over {x0 , x1 , x2 , x3 } that expresses that x is evenly divisible by 5. (Hint: build a truth table.)
3.32 Give a proposition over {x0 , x1 , x2 , x3 } that expresses that x is evenly divisible by 9.
3.33 Give a proposition over {x0 , x1 , x2 , x3 } that expresses that x is an exact power of two.
3.34 Suppose that we have two 4-bit input integers x and y, represented as in Exercises 3.29–3.33. Give a proposition over the Boolean
variables {x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 } that expresses the condition that x = y.
3.35 Give a proposition over {x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 } that expresses the condition x ≤ y.
3.36 Give a proposition over {x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 } that expresses the condition x = 2 · y.
3.37 Give a proposition over {x0 , x1 , x2 , x3 , y0 , y1 , y2 , y3 } that expresses the condition xy = yx .
3.38 Consider a 4-bit input integer x represented by four Boolean variables {x0 , x1 , x2 , x3 } as in Exercises 3.29–3.33. Let y be x + 1,
represented again as a 4-bit value {y0 , y1 , y2 , y3 }. (Treat 15 + 1 = 0.) For example, if x = 11 (which is 1011 in binary), then
y = 12 (which is 1100 in binary). Give four propositions over {x0 , x1 , x2 , x3 } that express the values of y0 , y1 , y2 , and y3 .

The following exercises ask you to build a program to evaluate a given proposition. To make your life as easy as possible, you
should consider a simple representation of propositions, based on representing any compound proposition as a list. In such a list,
the first element will be the logical connective, and the remaining elements will be the subpropositions. For example, the proposition
p ⇒ (¬q) will be represented as ["implies", ["or", "p", "r"], ["not", "q"]] using Python list notation. Using this
representation of propositions, write a program, in a language of your choice, to accomplish the following operations.
3.39 (programming required) Given a proposition φ, compute the set of all atomic propositions contained within φ. (We’ll occasionally
use lowercase Greek letters, particularly φ [“phi”] or ψ [“psi”], to denote not-necessarily-atomic propositions.) The following
recursive formulation may be helpful:

vars(p) = {p}
vars(¬φ) = vars(φ)
vars(φ ∧ ψ) = vars(φ) ∪ vars(ψ) and similarly for φ ∨ ψ, φ ⇒ ψ, φ ⊕ ψ, and φ ⇔ ψ.

3.40 (programming required) Given a proposition φ and a truth assignment for each variable in φ, evaluate whether φ is true or false
under this truth assignment.
3.41 (programming required) Given a proposition φ, compute the set of all truth assignments for the variables in φ that make φ true.
(One good approach: use your solution to Exercise 3.39 to compute all the variables in φ, then build the full list of truth assignments
for those variables, and then evaluate φ under each of these truth assignments using your solution to Exercise 3.40.)

ISTUDY
3.3 Propositional Logic: Some Extensions 95

3.3 Propositional Logic: Some Extensions


“Beauty is truth, truth beauty,”—that is all
Ye know on earth, and all ye need to know.
John Keats (1795–1821)
Ode on a Grecian Urn (1819)

With the definitions from Section 3.2 in hand, we turn to a few extensions, including some special
types of propositions and some special ways of representing propositions. Along the way, we’ll see a little
bit about circuits, some ways that modern programming languages use inferences about certain logical
expressions to allow you to write some conditional statements a little more slickly, and we’ll get a little
more practice thinking about logical implication.

3.3.1 Tautology and Satisfiability


Several important types of propositions are defined in terms of their truth tables: those that are always
true (tautologies), sometimes true (satisfiable propositions), or never true (unsatisfiable propositions, also
called contradictions). We will explore each of these types in turn.

Tautologies
We’ll start by considering propositions that are always true:

Definition 3.11: Tautology.


A proposition is a tautology if it is true under every truth assignment.

Etymologically, the word tautology comes from Greek taut “same” (to + auto) + logy “word.” Another meaning for the word
“tautology” (in real life, not just in logic) is the unnecessary repetition of an idea, as in a phrase like “a canine dog.” (The
etymology and the non-mathematical meaning are not totally removed from the usage in logic.)

One reason that tautologies are important is that we can use them to reason about logical statements, which
can be particularly valuable when we’re trying to prove a claim. Here are two important tautologies:

Example 3.17: Law of the Excluded Middle.


The truth table for the proposition p ∨ ¬p is shown in Figure 3.8a. The last column (corresponding to the
proposition p ∨ ¬p itself) is filled with “T”s, so p ∨ ¬p is a tautology.
The proposition p ∨ ¬p is called the Law of the Excluded Middle: for any proposition p, either p is true
or p is false; there is nothing “between” true and false.

(a) p ¬p p ∨ ¬p (b) p q p⇒q p ∧ (p ⇒ q) p ∧ (p ⇒ q) ⇒ q


T F T T T T T T
F T T T F F F T
F T T F T
F F T F T

Figure 3.8 Truth tables for two tautologies: (a) Law of the Excluded Middle, and (b) Modus Ponens.

ISTUDY
96 Logic

(p ⇒ q) ∧ p ⇒ q Modus Ponens (p ∨ q) ∧ ¬p ⇒ q
(p ⇒ q) ∧ ¬q ⇒ ¬p Modus Tollens (p ⇒ q) ∧ (¬p ⇒ q) ⇒ q
(p ⇒ q) ∧ (q ⇒ r) ⇒ (p ⇒ r)
p ∨ ¬p Law of the Excluded Middle
p ⇔ ¬¬p Double Negation (p ⇒ q) ∧ (p ⇒ r) ⇔ p ⇒ q ∧ r
p⇔p (p ⇒ q) ∨ (p ⇒ r) ⇔ p ⇒ q ∨ r
p ∧ (q ∨ r) ⇔ (p ∧ q) ∨ (p ∧ r)
p⇒p∨q
p ⇒ (q ⇒ r) ⇔ p ∧ q ⇒ r
p∧q⇒p

Figure 3.9 Some tautologies.

Our second example of a tautology is the proposition [p ∧ (p ⇒ q)] ⇒ q, called Modus Ponens: if we know
both that (a) p is true and that (b) the truth of p implies the truth of q, then we can conclude that q is true.
(“Modus Ponens” rhymes with “goad us phone-ins”; literally, it means “the mood that affirms” in Latin.)

Example 3.18: Modus Ponens.


The truth table for p ∧ (p ⇒ q) ⇒ q (with a few extra columns of “scratch work,” for each of the con-
stituent pieces of the desired final proposition) is shown in Figure 3.8b. There are only “T”s in the last
column, which establishes that Modus Ponens is a tautology.

Figure 3.9 contains a number of tautologies that you may find interesting and occasionally helpful.
(Exercises 3.63–3.75 ask you to verify that these propositions really are tautologies.) One terminological
note from Figure 3.9: Modus Tollens is the proposition [(p ⇒ q) ∧ ¬q] ⇒ ¬p, and it’s the counterpoint to
Modus Ponens: if we know both that (a) the truth of p implies the truth of q and that (b) q is not true, then
we can conclude that p cannot be true either. (Modus Tollens means “the mood that denies” in Latin.)

Satisfiable and unsatisfiable propositions


The next type of propositions we’ll consider are those that are sometimes true, and those propositions that
are never true:

Definition 3.12: Satisfiable propositions.


A proposition is satisfiable if it is true under at least one truth assignment.

Definition 3.13: Unsatisfiable propositions/contradictions.


A proposition is unsatisfiable if it is not satisfiable. Such a proposition is also called a contradiction.

If f is a truth assignment under which a proposition is true, then we say that the proposition is satisfied by
f. Thus a proposition is satisfiable if it is satisfied by at least one truth assignment, and it’s unsatisfiable if
it is not satisfied by any truth assignment. (And it’s a tautology if it is satisfied by every truth assignment.)

Example 3.19: Contradiction of p ⇔ q and p ⊕ q.


The truth table for (p ⇔ q) ∧ (p ⊕ q) is shown in Figure 3.10a. The last column of the truth table is all
False, which means that (p ⇔ q) ∧ (p ⊕ q) is unsatisfiable. Though it might not have been immediately
apparent when they were defined, the logical connectives ⊕ and ⇔ demand precisely opposite things of

ISTUDY
3.3 Propositional Logic: Some Extensions 97

(a) p q p⇔q p⊕q (p ⇔ q) ∧ (p ⊕ q) (b) p q p∨q ¬p ¬q ¬p ∧ ¬q p ∨ q ⇒ ¬p ∧ ¬q


T T T F F T T T F F F F
T F F T F T F T F T F F
F T F T F F T T T F F F
F F T F F F F F T T T T

Figure 3.10 Truth tables for two propositions, for Examples 3.19 and 3.20.

their arguments: the proposition p ⊕ q is true when p and q have different truth values, while p ⇔ q is
true when p and q have the same truth values. Because p and q cannot simultaneously have the same and
different truth values, the conjunction (p ⇔ q) ∧ (p ⊕ q) is a contradiction.

Example 3.20: Demanding satisfaction: p ∨ q ⇒ ¬p ∧ ¬q.


Is the proposition p ∨ q ⇒ ¬p ∧ ¬q satisfiable?
Solution. We’ll answer the question by building a truth table for the given proposition. Because there is
at least one “T” in the last column in the truth table shown in Figure 3.10b, the proposition is satisfiable.
Specifically, this proposition is satisfied by the truth assignment p = False, q = False. (Under this truth
assignment, the hypothesis p ∨ q is false; because false implies anything, the entire implication is true.)

It may be helpful to think about the general relationship between tautology and satisfiability. Let φ be
any proposition. (We occasionally denote generic propositions by lowercase Greek letters, particularly φ
[“phi”] or ψ [“psi”].) Then φ is a tautology exactly when ¬φ is unsatisfiable: φ is a tautology when the
truth table for φ is all “T”s, which happens exactly when the truth table for ¬φ is all “F”s. And that’s
precisely the definition of ¬φ being unsatisfiable!

Taking it further: While satisfiability seems like a pretty precise technical definition that wouldn’t matter all that much, the
satisfiability problem—given a proposition φ, determine whether φ is satisfiable—turns out to be at the heart of the biggest
open question in computer science today. If you figure out how to solve the satisfiability problem efficiently (or prove that it’s
impossible to solve efficiently), then you’ll be the most famous computer scientist of the century. See p. 104.

3.3.2 Logical Equivalence


We’ll now consider a relationship between pairs of propositions. When two propositions “mean the same
thing” (that is, they are true under precisely the same circumstances), they are called logically equivalent:

Definition 3.14: Logical equivalence.


Two propositions φ and ψ are logically equivalent, written φ ≡ ψ, if they have exactly identical truth
tables (in other words, their truth values are the same under every truth assignment).

To state it differently: two propositions φ and ψ are logically equivalent whenever φ ⇔ ψ is a tautology.
Here’s a small example of logical equivalence:

Example 3.21: ¬(p ∧ q) and (p ∧ q) ⇒ ¬q are logically equivalent.


In Example 3.15, we found that (p ∧ q) ⇒ ¬q is true except when p and q are both true. Thus ¬(p ∧ q)
is logically equivalent to (p ∧ q) ⇒ ¬q, as this truth table shows:

ISTUDY
98 Logic

p q (p ∧ q) ⇒ ¬q ¬(p ∧ q)
T T F F
T F T T
F T T T
F F T T

Writing tip: Now that we have a reasonable amount of experience in writing truth tables, we will permit ourselves to skip columns
when they’re both obvious and not central to the point of a particular example. When you’re writing anything—whether as a food
critic or a Shakespeare scholar or a computer scientist—you should always think about the intended audience, and how much
detail is appropriate for them.

Implication, converse, contrapositive, inverse, and mutual implication


We’ll now turn to an important question of logical equivalence that involves the proposition p ⇒ q and
three other implications derived from it:

Definition 3.15: Converse, contrapositive, and inverse.


The converse of an implication p ⇒ q is the proposition q ⇒ p.
The contrapositive of an implication p ⇒ q is the proposition ¬q ⇒ ¬p.
The inverse of an implication p ⇒ q is the proposition ¬p ⇒ ¬q.

These three new implications derived from the original implication p ⇒ q—particularly the converse and
the contrapositive—will arise frequently. Let’s compare the three new implications to the original in light
of logical equivalence:

Example 3.22: Implications, contrapositives, converses, inverses.


Consider the implication p ⇒ q. Which of the converse, contrapositive, and inverse of p ⇒ q are logically
equivalent to the original proposition p ⇒ q?
Solution. To answer this question, let’s build the truth table, which is shown in Figure 3.11. Thus p ⇒ q
is logically equivalent to its contrapositive ¬q ⇒ ¬p, but not to its inverse or its converse.

Here’s a real-world example to make these results more intuitive:

Example 3.23: Contrapositives, converses, inverses, and Turing Awards.


Consider the following (true!) proposition, of the form p ⇒ q:
p q
If you won the Turing Award in 2008 , then your name is Barbara .

(The 2008 Turing Award winner was Barbara Liskov, who was honored for her groundbreaking work on
principles of programming and programming languages, among other things.) The contrapositive of this
proposition is ¬q ⇒ ¬p, which is also true:
¬q ¬p
If your name isn’t Barbara , then you didn’t win the Turing Award in 2008 .

But the converse q ⇒ p and the inverse ¬p ⇒ ¬q are both blatantly false:

If your name is Barbara, then you won the Turing Award in 2008.
If you didn’t win the Turing Award in 2008, then your name isn’t Barbara.

ISTUDY
3.3 Propositional Logic: Some Extensions 99

proposition converse contrapositive inverse


p q p⇒q q⇒p ¬q ⇒ ¬p ¬p ⇒ ¬q
T T T T T T
T F F T F T
F T T F T F
F F T T T T

Figure 3.11 The truth table for an implication and its converse, its contrapositive, and its inverse.

Consider Barbara Kingsolver (best-selling author), Barbara Lee (long-time U.S. House member from Cal-
ifornia), Barbara McClintock (geneticist and 1983 Nobel Prize winner), and Barbara Walters (acclaimed
television interviewer): all named Barbara, and not one of them a Turing Award winner.

It’s worth emphasizing the results from Example 3.22: any implication p ⇒ q is logically equivalent to
its contrapositive ¬q ⇒ ¬p, but it is not logically equivalent to its converse q ⇒ p or its inverse ¬p ⇒ ¬q.
(You might notice, though, that the inverse and the converse are logically equivalent to each other. In fact,
the converse’s contrapositive—that is, the contrapositive of q ⇒ p—is ¬p ⇒ ¬q, which just is the inverse.)
Here’s another example of the concepts of tautology and satisfiability, as they relate to implications and
converses:

Example 3.24: Mutual implication.


Consider the conjunction of the implication p ⇒ q and its converse—that is, (p ⇒ q) ∧ (q ⇒ p). Is this
proposition a tautology? Satisfiable? Unsatisfiable? Is there a simpler proposition to which it’s logically
equivalent?
Solution. We can answer this question with a truth table:

p q p⇒q q⇒p (p ⇒ q) ∧ (q ⇒ p)
T T T T T
T F F T F
F T T F F
F F T T T

Because there is a “T” in its column, (p ⇒ q) ∧ (q ⇒ p) is satisfiable (and thus isn’t a contradiction).
But that column does contain an “F” as well, and therefore (p ⇒ q) ∧ (q ⇒ p) is not a tautology.
Notice that the truth table for (p ⇒ q) ∧ (q ⇒ p) is identical to the truth table for p ⇔ q. (See Figure
3.3.) Thus p ⇔ q and (p ⇒ q) ∧ (q ⇒ p) are logically equivalent. (And ⇔ is called mutual implication
for this reason: p and q imply each other.)

Some other logically equivalent statements


Figure 3.12 contains a large collection of logical equivalences. (It’s worth taking a few minutes to think
about why each logical equivalence holds. See also Exercises 3.76–3.85.)
Here’s a brief description of some of the terminology. (Figure 3.12 contains a lot of new words to handle
all at once—sorry!) Informally, an operator is commutative if the order of its arguments doesn’t matter; an
operator is associative if the way we parenthesize successive applications doesn’t matter; and an operator
is idempotent (Latin: idem “same” + potent “strength”) if applying it to the same argument twice gives

ISTUDY
100 Logic

Commutativity p∨q ≡ q∨p Distribution of ∧ over ∨ p ∧ (q ∨ r) ≡ (p ∧ q) ∨ (p ∧ r)


p∧q ≡ q∧p Distribution of ∨ over ∧ p ∨ (q ∧ r) ≡ (p ∨ q) ∧ (p ∨ r)
p⊕q ≡ q⊕p
Contrapositive p ⇒ q ≡ ¬q ⇒ ¬p
p⇔q ≡ q⇔p
p ⇒ q ≡ ¬p ∨ q
Associativity p ∨ (q ∨ r) ≡ (p ∨ q) ∨ r
p ⇒ (q ⇒ r) ≡ p ∧ q ⇒ r
p ∧ (q ∧ r) ≡ (p ∧ q) ∧ r
p ⇔ q ≡ ¬p ⇔ ¬q
p ⊕ (q ⊕ r) ≡ (p ⊕ q) ⊕ r
p ⇔ (q ⇔ r) ≡ (p ⇔ q) ⇔ r Mutual Implication (p ⇒ q) ∧ (q ⇒ p) ≡ p ⇔ q
Idempotence p∨p ≡ p De Morgan’s Laws ¬(p ∧ q) ≡ ¬p ∨ ¬q
p∧p ≡ p ¬(p ∨ q) ≡ ¬p ∧ ¬q

Figure 3.12 Some logically equivalent propositions. De Morgan’s Laws are named after Augustus De Morgan
(1806–1871), a British mathematician who not only made major contributions to the study of logic but was also the
math tutor of Ada Lovelace (1815–1852), “the first computer programmer.” (See Figure 2.15.)

that argument back. (There are two other frequently discussed concepts: the identity and the zero of the
operator; logical equivalences involving identities and zeros were left to you, in Exercises 3.13–3.22.)

Taking it further: There are at least two ways in which logical equivalences (like those in Figure 3.12) play an important role in
programming. First, most modern languages have a feature called short-circuit evaluation of logical expressions—they evaluate
conjunctions and disjunctions from left to right, and stop as soon as the truth value of the logical expression is known—and
programmers can exploit this feature to make their code cleaner or more efficient. Second, in compiled languages, an optimizing
compiler can make use of logical equivalences to simplify the machine code that ends up being executed. See p. 105.

3.3.3 Representing Propositions: Circuits and Normal Forms


Now that we’ve established the core concepts of propositional logic, we’ll turn to some bigger and more
applied questions. We’ll spend the rest of this section exploring two specific ways of representing proposi-
tions: circuits, the wires and connections from which physical computers are built; and two normal forms,
in which the structure of propositions is restricted in a particular way.
The approach we’re taking with normal forms is a commonly used idea to make reasoning about some
language L easier: we define a subset S of L, with two goals: (1) any statement in L is equivalent to some
statement in S; and (2) S is “simple” in some way. Then we can consider any statement from the “full”
language L, which we can then “translate” into a simple-but-equivalent statement of S. Defining this subset
and its accompanying translation will make it easier to accomplish some task for all expressions in L, while
still making it easy to write statements clearly.

Taking it further: The idea of translating all propositions into a particular form has a natural analogue in designing and imple-
menting programming languages. For example, every for loop can be expressed as a while loop instead, but it would be very
annoying to program in a language that doesn’t have for loops. A nice compromise is to allow for loops, but behind the scenes
to translate each for loop into a while loop. This compromise makes the language easier for the “user” programmer to use (for
loops exist!) and also makes the job of the programmer of the compiler/interpreter easier (she can worry exclusively about
implementing and optimizing while loops!).
In programming languages, this translation is captured by the notion of syntactic sugar. (The phrase is meant to suggest that the
addition of for to the language is a bonus for the programmer—“sugar on top,” maybe—that adds to the syntax of the language.)
The programming language Scheme is perhaps the pinnacle of syntactic sugar; the core language is almost unbelievably simple.
Here’s one illustration: (and x y) (Scheme for “x ∧ y”) is syntactic sugar for (if x y #f) (that’s “if x then y else false”). So
a Scheme programmer can use and, but there’s no “real” and that has to be handled by the interpreter.

ISTUDY
3.3 Propositional Logic: Some Extensions 101

(a) ∨ (b) p

∧ ∧ ¬ ∧

p ¬ ¬ q ¬ ∧
q p q

Figure 3.13 The proposition (p ∧ ¬q) ∨ (¬p ∧ q) both (a) as a tree, and (b) as a circuit.

Circuits
We’ll introduce the idea of circuits by using the proposition (p ∧ ¬q) ∨ (¬p ∧ q) as a particular example.
(Note, by the way, that this proposition is logically equivalent to p ⊕ q.)
This proposition is a disjunction of two smaller propositions, p ∧ ¬q and ¬p ∧ q. Similarly, p ∧ ¬q is a
conjunction of two even simpler propositions, namely p and ¬q. A representation of a proposition called
a tree continues to break down every compound proposition embedded within it. (We’ll talk about trees in
detail in Chapter 11.) See Figure 3.13a. The tree-based view isn’t much of a change from our usual notation
(p ∧ ¬q) ∨ (¬p ∧ q); all we’ve done is use the parentheses and order-of-operation rules to organize the
logical connectives, and then we’ve arranged that organization in a hierarchical layout.
This tree representation is closely related to another important way of viewing logical expressions:
circuits, which represent a proposition as a collection of wires and gates. Wires carry a truth value from one
physical location to another; gates are physical implementations of logical connectives. See Figure 3.13b.
In the circuit-based view, we can think of truth values “flowing in” as inputs to the left side of each gate,
and a truth value “flowing out” as output from the right side of the gate. (The only substantive difference
between the tree and the circuit—aside from which way is up—is whether the two p inputs come from the
same wire, and likewise whether the two q inputs do.)

Example 3.25: Using and and not for or.


Build a circuit for p ∨ q using only ∧ and ¬ gates.
Solution. We’ll use one of De Morgan’s Laws, which says that p ∨ q ≡ ¬(¬p ∧ ¬q):

p ¬
∧ ¬
q ¬

This basic idea—replacing one logical connective by another one (or by multiple other ones)—is crucial
to the construction of computers themselves; we’ll return to this idea in Section 4.4.1.

Conjunctive and disjunctive normal forms


In the rest of this section, we’ll consider a way to simplify the structure of propositions: conjunctive and
disjunctive normal forms, which constrain propositions to have a particular format.
To define these restricted types of propositions, we first need a basic definition. A literal is a Boolean
variable (a.k.a. an atomic proposition) or the negation of a Boolean variable. (So p and ¬p are both literals.)

ISTUDY
102 Logic

Definition 3.16: Conjunctive normal form.


A proposition is in conjunctive normal form (CNF) if it is the conjunction of one or more clauses, where
each clause is the disjunction of one or more literals.

Definition 3.17: Disjunctive normal form.


A proposition is in disjunctive normal form (DNF) if it is the disjunction of one or more clauses, where
each clause is the conjunction of one or more literals.

Less formally, a proposition in conjunctive normal form is “the and of a bunch of ors,” and a proposition
in disjunctive normal form is “the or of a bunch of ands.”

Taking it further: In computer architecture and digital electronics, people usually refer to a proposition in CNF as being a
product of sums, and a proposition in DNF as being a sum of products. (There is a deep way of thinking about formal logic based
on ∧ as multiplication, ∨ as addition, 0 as False, and 1 as True; see Exercises 3.23–3.26.)

Here are small examples of both CNF and DNF:

Example 3.26: Small propositions in CNF and DNF.


Here are two propositions, one in conjunctive normal form and one in disjunctive normal form:
     
conjunctive normal form: ¬p ∨ q ∨ r ∧ ¬q ∨ ¬r ∧ r
clause #1 clause #2 clause #3

       
disjunctive normal form: ¬p ∧ q ∧ r ∨ ¬q ∧ ¬r ∧ s ∨ r ∨ p ∧ ¬r
clause #1 clause #2 clause #3 clause #4

While conjunctive and disjunctive normal forms seem like heavy restrictions on the format of proposi-
tions, every proposition is logically equivalent to a CNF proposition and to a DNF proposition:

Theorem 3.18: All propositions are expressible in CNF.


For any proposition φ, there is a proposition φcnf over the same Boolean variables and in conjunctive
normal form such that φ ≡ φcnf .

Theorem 3.19: All propositions are expressible in DNF.


For any proposition φ, there is a proposition ψdnf over the same Boolean variables and in disjunctive
normal form such that φ ≡ ψdnf .

These two theorems are perhaps the first results that we’ve encountered that are unexpected, or at least
unintuitive. There’s no particular reason for it to be clear that they’re true—let alone how we might prove
them. But we can, and we will: we’ll prove both theorems in Section 4.4.1 and again in Section 5.4.3, after
we’ve introduced some relevant proof techniques. But, for now, here are a few examples of translating
propositions into DNF/CNF.

Problem-solving tip: A good strategy when you’re trying to prove a not-at-all-obvious claim is to test out some small examples,
and then try to start to figure a general pattern. In Examples 3.27 and 3.28, we’ll try out a few examples of converting (relatively
simple) propositions into logically equivalent propositions that are in CNF/DNF. We’ll figure out how to generalize this technique
to any proposition in Section 4.4.1.

ISTUDY
3.3 Propositional Logic: Some Extensions 103

Example 3.27: Translating basic connectives into DNF.


Give propositions in disjunctive normal form that are logically equivalent to each of the following:
1 p∨q
2 p∧q
3 p⇒q
4 p⇔q

Solution. Here are translations into disjunctive normal form:


1 This question is boring: p ∨ q is already in DNF, with two clauses (clause #1 is p; clause #2 is q).
2 This question is boring, too: p ∧ q is also already in DNF, with a single clause (p ∧ q).
3 Figure 3.12 tells us that p ⇒ q ≡ ¬p ∨ q, and ¬p ∨ q is in DNF.
4 The proposition p ⇔ q is true when p and q are either both true or both false. So we can rewrite p ⇔ q
as (p ∧ q) ∨ (¬p ∧ ¬q). We can check that we’ve gotten it right with a truth table:
p q p∧q ¬p ∧ ¬q (p ∧ q) ∨ (¬p ∧ ¬q) p⇔q
T T T F T T
T F F F F F
F T F F F F
F F F T T T

Thus p ⇔ q ≡ (p ∧ q) ∨ (¬p ∧ ¬q).

And here’s the analogous task of translating a few of the logical connectives into CNF. (As with DNF, both
p ∨ q and p ∧ q are already in CNF, so we’ll skip them in this example.)

Example 3.28: Translating basic connectives into CNF.


Give propositions in conjunctive normal form that are logically equivalent to each of the following:
1 p⇒q
2 p⇔q
3 p⊕q

Solution. Here are translations into conjunctive normal form:


1 As in Example 3.27, we know that p ⇒ q ≡ ¬p ∨ q, and ¬p ∨ q is in CNF (with one clause), too.
2 We can rewrite p ⇔ q as follows:
p ⇔ q ≡ (p ⇒ q) ∧ (q ⇒ p) mutual implication (Example 3.24)
≡ (¬p ∨ q) ∧ (¬q ∨ p). x ⇒ y ≡ ¬x ∨ y (Figure 3.12), used twice

The proposition (¬p ∨ q) ∧ (¬q ∨ p) is in CNF.


3 Because p ⊕ q is true as long as one of {p, q} is true and one of {p, q} is false, we can verify via truth
table that p ⊕ q ≡ (p ∨ q) ∧ (¬p ∨ ¬q), which is in CNF.

ISTUDY
104 Logic

Computer Science Connections

Computational Complexity, Satisfiability, and a Million Dollars


Complexity theory is the subfield of computer science devoted to understanding the computational resources—time
and memory, usually—necessary to solve particular problems. It’s the subject of a great deal of fascinating current
research in theoretical computer science. Here’s a little bit of a flavor of that work.
One of the central problems of complexity theory is the satisfiability problem, in which you’re given a proposition
φ over n variables, and asked to determine whether φ is satisfiable. This problem is pretty simple to solve. In fact,
we’ve implicitly described an algorithm for it already: you just construct the truth table for the n-variable proposition
φ, and then check to see whether there are any “T”s in φ’s column of the table. But this algorithm is not very fast,
because the truth table for φ has lots and lots of rows—2n rows, to be precise. Even a moderate value of n means
that this algorithm will not terminate in your lifetime; 2300 exceeds the number of particles in the known universe.
So, it’s clear that there is an algorithm that solves the SAT problem. What’s not clear is whether there is a sub-
stantially more efficient algorithm to solve the SAT problem. It’s so unclear, in fact, that literally nobody knows the
answer, and this question is one of the biggest open problems in computer science and mathematics today. (Arguably,
it’s the biggest.) The Clay Mathematics Institute will even give a $1,000,000 prize to anyone who solves it.
Why is this problem so important? The reason is
that, in a precise technical sense, SAT is just as hard
as a slew of other problems that have a plethora of 1 6 2 1 Define 9 different Boolean variables
7 5 3 1
unspeakably useful applications: the traveling sales- 3 8 7
corresponding to this (and every)
person problem, protein folding, optimally packing the 8 7 6 cell, where x4,6,v means “put v in
9 8 6 5
trunk of a car with suitcases. (Or see Figure 3.14.) This the cell in row 4, column 6.”
2 4 7
slew is a class of computational problems known as 7 4 8
9 5 3 1
NP (“nondeterministic polynomial time”), for which it 8 2 3
is easy to “verify” correct answers. In the context of x ∨x ∨x ∨
SAT, that means that whenever you’ve got a satisfiable cell ⟨1, 1⟩ has a x11,,11,,14 ∨ x11,,11,,25 ∨ x11,,11,,36 ∨
2 Translate the value . . . x1,1,7 ∨ x1,1,8 ∨ x1,1,9
proposition φ, it’s very easy for you to (efficiently) rules of ∧
convince me that φ is satisfiable. Here’s how: you’ll ... ...
Sudoku into a ¬(x1,1,1 ∧ x1,1,2 ) ∧
simply tell me a truth assignment under which φ eval- logical . . . but not ¬(x
1,1,1 ∧ x1,1,3 ) ∧
uates to true. And I can make sure that you didn’t try more than one. · · · ∧ ¬(x1,1,8 ∧ x1,1,9 )
proposition. ∧
to fool me by plugging and chugging: I substitute your ... ...
truth assignment in for every variable, and then I make x ∨x
1 appears in the x4,7,1 ∨ x4,8,1 ∨ x4,9,1 ∨
∨x ∨
5,7,1 5,8,1 5,9,1
sure that the final truth value of φ is indeed True. middle-right subgrid. x6,7,1 ∨ x6,8,1 ∨ x6,9,1
One of the most important results in theoretical ∧
... ...
computer science in the twentieth century—that’s say- solution uses all x1,1,1 ∧ x1,2,6 ∧
x1,1,1 = T
ing something for a field that was founded in the given numbers. x1,5,2 ∧ · · · ∧ x9,9,3
x1,1,2 = F
twentieth century!—is the Cook–Levin Theorem: if you x1,1,3 = F
can solve SAT efficiently, then you can solve any prob-
. 3 Find a satisfying assignment for the
lem in NP efficiently. The major open question is what’s .
.
known as the P-versus-NP question. A problem that’s x9,9,8 = F resulting proposition (which can be
in P is easy to solve from scratch. A problem that’s in x9,9,9 = F interpreted as a Sudoku solution).
NP is easy to verify (in the way described above). So
the question is: does P = NP? Is verifying an answer Figure 3.14 Solving Sudoku using satisfiability. This type
to a problem no easier than solving the problem from of translation is the main reason that SAT is so important: a
scratch? (It seems intuitively “clear” that the answer is huge range of (important and practical) problems can be
no—but nobody has been able to prove it!)
“translated” into satisfiability, so that an efficient solution to
(You can read more about complexity theory in gen-
SAT implies an efficient solution to all of them, too.
eral, and the P-versus-NP question addressed here in
particular, in most books on algorithms or the theory
of computing. The original results on NP-completeness are from Stephen Cook [31] and Leonid Levin [79]; the first
translations of other problems are due to Richard Karp [68].)

ISTUDY
3.3 Propositional Logic: Some Extensions 105

Computer Science Connections

Short-Circuit Evaluation, Optimization, and Modern Compilers


The logical equivalences in Figure 3.12 may seem far removed from “real” programming, but logical equiva-
lences are actually central in modern programming. Here are two ways in which they play an important role:

Short-circuit evaluation: In most modern program-


1 if (2 > 3 && x + y < 9)
ming languages, a logical expression involving ands 2 ...
and ors will only be evaluated until the truth value 3 } else {
of the expression can be determined. For an exam- 4 ... && is Java notation for ∧;
5 } || is Java notation for ∨.
ple in Java, see Figure 3.15a. Like most mod-
ern languages, Java evaluates an ∧ expression from
(a) The second conjunct of the if will never be evaluated:
left to right and stops as soon as it finds a false 2 > 3 is false, and False ∧ anything ≡ False.
conjunct. Similarly, Java evaluates an ∨ expression
from left to right and stops as soon as it finds 1 if (x == 0 || (x-1)/x > 0.5) {

a true disjunct. These simplifications rely on facts 2 ...


from propositional logic: False ∧ anything ≡ False 3 }

and True ∨ anything ≡ True. This style of evaluation


4 if (simpleOrOftenFalse(x)
is called short-circuit evaluation.
5 && complexOrOftenTrue(x)) {
Two slick ways in which programmers can take 6 ...
advantage of short-circuit evaluation are shown in Fig- 7 }

ure 3.15b. First, lines 1–3 use short-circuit evaluation to


avoid deeply nested if statements to handle exceptional (b) Two handy ways to rely on short-circuit evaluation.
cases. When x = 0, evaluating the second disjunct Figure 3.15 Short-circuit evaluation, illustrated in Java.
would cause a divide-by-zero error—but the second
disjunct isn’t evaluated when x = 0 because the first disjunct was true! Second, lines 4–7 use short-circuit evaluation
to make code faster. If the second conjunct typically takes much longer to evaluate (or if it is much more frequently
true) than the first conjunct, then careful ordering of conjuncts avoids a long and usually fruitless computation.

Compile-time optimization: For a program written in a compiled language like C, the source code is translated into
machine-readable form by the compiler. But this translation is not verbatim; instead, the compiler streamlines your
code (when it can!) to make it run faster.
One of the simplest types of compiler optimizations
is constant folding: if some of the values in an arith- 1 if (p || !p)

metic or logical expression are constants—known to 2 x = 51;


3 } else {
the compiler at “compile time,” and thus unchanged at 4 x = 63;
“run time”—then the compiler can “fold” those con- 5 }
(p || !p) denotes p ∨ ¬p in C.
stants together. Using the rules of logical or arithmetic
equivalence broadens the types of code that can be 1 x = 51;
folded in this way. For example, in C, when you write
an assignment statement like y = x + 2 + 3, most Figure 3.16 Two snippets of C code. When this code is
compilers will translate it into y = x + 5. Similarly,
compiled on a modern optimizing compiler (a recent
for z = 7 * x * 8, a modern compiler will be able
version of gcc, with optimization turned on), the machine
to optimize it into z = x * 56, using the commutativ-
ity of multiplication. Because the compiler can reorder code that is produced is exactly identical for both snippets.
the multiplicands without affecting the value, and this
reordering allows the 7 and 8 to be folded into 56, the compiler does the reordering and the folding.
An example using logical equivalences is shown in Figure 3.16. Because p ∨ ¬p is a tautology—the Law of the
Excluded Middle—no matter what the value of p, the “then” clause is executed, not the “else” clause. Thus the
compiler doesn’t even have to waste time checking whether p is true or false, and this optimization can be applied.

ISTUDY
106 Logic

EXERCISES
The operators ∧ and ∨ are idempotent—in other words, p ∧ p ≡ p ∨ p ≡ p. But ⇒, ⊕, and ⇔ are not idempotent.
3.42 Simplify the expression p ⇒ p (that is, give an as-simple-as-possible logically equivalent proposition).
3.43 Repeat for p ⊕ p.
3.44 Repeat for p ⇔ p.

3.45 Add parentheses to the proposition p ⇒ ¬p ⇒ p ⇒ q so that the resulting proposition is a tautology.
3.46 Add parentheses to the proposition p ⇒ ¬p ⇒ p ⇒ q so that the resulting proposition is logically equivalent to q.
3.47 Give as simple as possible a proposition that is logically equivalent to the (unparenthesized) original proposition p ⇒ ¬p ⇒ p ⇒ q.
3.48 Unlike the logical connectives ∧, ∨, ⊕, and ⇔, implication is not associative. In other words, p ⇒ (q ⇒ r) and (p ⇒ q) ⇒ r are
not logically equivalent. Prove it: give a truth assignment in which p ⇒ (q ⇒ r) and (p ⇒ q) ⇒ r have different truth values.
3.49 Consider the propositions p ⇒ (q ⇒ q) and (p ⇒ q) ⇒ q. One of these is a tautology; the other isn’t. Which is which? Explain.
3.50 Consider the propositions p ⇒ (p ⇒ q) and (p ⇒ p) ⇒ q. Is either one a tautology? Satisfiable? Unsatisfiable? What is the
simplest proposition to which each is logically equivalent?

On an exam, I once asked students to write a proposition logically equivalent to p ⊕ q using only the logical connectives ⇒, ¬,
and ∧. Here are some of the students’ answers. Which ones are right?
3.51 ¬(p ∧ q) ⇒ (¬p ∧ ¬q)
3.52 (p ⇒ ¬q) ∧ (q ⇒ ¬p)
3.53 (¬p ⇒ q) ∧ ¬(p ∧ q)
3.54 ¬ [(p ∧ ¬q ⇒ ¬p ∧ q) ∧ (¬p ∧ q ⇒ p ∧ ¬q)]
3.55 Write a proposition logically equivalent to p ⊕ q using only the logical connectives ⇒, ¬, and ∨.

3.56 Simplify the code in Figure 3.17a as much as possible. (For example, if p ⇒ q, it’s a waste of time to test whether q holds in a
block where p is known to be true.)
3.57 Repeat for the code in Figure 3.17b.
3.58 Repeat for the code in Figure 3.17c.
3.59 Simplify (¬p ⇒ q) ∧ (q ∧ p ⇒ ¬p) as much as possible.
3.60 Repeat for (p ⇒ ¬p) ⇒ ((q ⇒ (p ⇒ p)) ⇒ p).
3.61 Repeat for (p ⇒ p) ⇒ (¬p ⇒ ¬p) ∧ q.
3.62 Claim: Every proposition over the single variable p is either logically equivalent to p or it’s logically equivalent to ¬p. Is this
claim true or false? Prove your answer.

Show using truth tables that these propositions (mostly from Figure 3.9) are tautologies:
3.63 (p ⇒ q) ∧ ¬q ⇒ ¬p (Modus Tollens) 3.70 (p ⇒ q) ∨ (p ⇒ r) ⇔ p ⇒ q ∨ r
3.64 p⇒p∨q 3.71 p ∧ (q ∨ r) ⇔ (p ∧ q) ∨ (p ∧ r)
3.65 p∧q⇒p 3.72 p ⇒ (q ⇒ r) ⇔ p ∧ q ⇒ r
3.66 (p ∨ q) ∧ ¬p ⇒ q 3.73 p ∨ (p ∧ q) ⇔ p
3.67 (p ⇒ q) ∧ (¬p ⇒ q) ⇒ q 3.74 p ∧ (p ∨ q) ⇔ p
3.68 (p ⇒ q) ∧ (q ⇒ r) ⇒ (p ⇒ r) 3.75 p ⊕ q ⇒ p ∨ q
3.69 (p ⇒ q) ∧ (p ⇒ r) ⇔ p ⇒ q ∧ r

(a) 1 if x > 20 or (x ≤ 20 and y < 0) then (c) 1 if x mod 12 = 0 then


2 foo(x, y) 2 if x mod 4 ≠ 0 then
3 else 3 foo(x, y)
4 bar(x, y) 4 else Recall that x mod k is the
5 bar(x, y) remainder when dividing x by k.
(b) 1 if y ≥ 0 or y ≤ x or (x−y)·y ≥ 0 then So x mod k = 0 when k | x—that
2 foo(x, y) 6 else is, when x is evenly divisible by k.
3 else 7 if x = 17 then
4 bar(x, y) 8 baz(x, y)
9 else
10 quz(x, y)

Figure 3.17 Some code that uses nested conditionals, or compound propositions as conditions.

ISTUDY
Exercises 107

Show using truth tables that the following logical equivalences from Figure 3.12 hold:
3.76 ¬(p ∧ q) ≡ ¬p ∨ ¬q (De Morgan’s Law) 3.81 p ⇔ (q ⇔ r) ≡ (p ⇔ q) ⇔ r (associativity of ⇔)
3.77 ¬(p ∨ q) ≡ ¬p ∧ ¬q (De Morgan’s Law) 3.82 p ⇒ q ≡ ¬p ∨ q
3.78 p ∨ (q ∨ r) ≡ (p ∨ q) ∨ r (associativity of ∨) 3.83 p ⇒ (q ⇒ r) ≡ p ∧ q ⇒ r
3.79 p ∧ (q ∧ r) ≡ (p ∧ q) ∧ r (associativity of ∧) 3.84 p ⇔ q ≡ ¬p ⇔ ¬q
3.80 p ⊕ (q ⊕ r) ≡ (p ⊕ q) ⊕ r (associativity of ⊕) 3.85 ¬(p ⇒ q) ≡ p ∧ ¬q

3.86 Optimizing compilers (see p. 105) will perform the following optimization, transforming the first block of C code into the second:

1 if (p || !p) { /* "p or not p" */ 1 x = 51;


2 x = 51;
3 } else {
4 x = 63;
5 }

The compiler performs this transformation because p ∨ ¬p is a tautology: no matter the truth value of p, the proposition p ∨ ¬p is
true. But there are situations in which this code translation actually changes the behavior of the program, if p can be an arbitrary
expression (rather than just a Boolean variable)! Describe such a situation. (Hint: why do (some) people watch auto racing?)

3.87 A very mild pet peeve of mine: code that uses Option A for some Boolean-valued expression p instead of Option B.

Option A Option B
1 if p = True then A else B 1 if p then A else B

State and prove the tautology that establishes the equivalence of Option A and Option B.

See Figure 3.18 for the description of a circuit. For each of the following exercises, draw a circuit with at most three gates that
is consistent with the listed behavior. The light’s status is unknown for unlisted inputs. (If multiple circuits are consistent with the
given behavior, choose one with as few gates as possible.)
3.88 The light is on when the true inputs are {q} or {r}. It is off when the true inputs are {p} or {p, q} or {p, q, r}.
3.89 The light is on when the true inputs are {p, q} or {p, r}. It is off when the true inputs are {p} or {q} or {r}.
3.90 The light is on for at least one input setting. It is off when the true inputs are {p} or {q} or {r} or {p, q, r}.
3.91 The light is on for at least two input settings. It is off when the true inputs are {p, q} or {p, r} or {q, r} or {p, q, r}.
3.92 The light is always off, no matter what the input.

3.93 Consider a simplified class of circuits like those from Exercises 3.88–3.92: there are two inputs {p, q} and at most two gates, each
of which is ∧, ∨, or ¬. There are a total of 24 = 16 distinct propositions over inputs {p, q}: four different input configurations,
each of which can turn the light on or leave it off. Which, if any, of these 16 propositions cannot be expressed using up to two
{∧, ∨, ¬} gates?
3.94 (programming required) Consider the class of circuits from Exercises 3.88–3.92: inputs {p, q, r}, and at most three gates chosen
from {∧, ∨, ¬}. There are a total of 28 = 256 distinct propositions over inputs {p, q, r}: eight different input configurations, each
of which can turn the light on or leave it off. Write a program to determine how many of these 256 propositions can be represented
by a circuit of this type. (If you design it well, your program will let you check your answers to Exercises 3.88–3.93.)

3.95 Consider a set S = {p, q, r, s, t} of Boolean variables. Let φ = p ⊕ q ⊕ r ⊕ s ⊕ t. Describe briefly the conditions under which
φ is true. Use English and, if appropriate, standard (nonlogical) mathematical notation. (Hint: look at the symbol ⊕ itself. What’s
p + q + r + s + t, if you treat true as 1 and false as 0 as in Exercises 3.23–3.26?)

3.96 Dithering is a technique for converting grayscale images to black-and-white images (for printed media like newspapers). The classic
dithering algorithm proceeds as follows. For every pixel in the image, going from top to bottom (“north to south”), and from left
to right (“west to east”):
• “Round” the current pixel to black or white. (If it’s closer to black, make it black; if it’s closer to white, make it white.)
• This alteration to the current pixel has some created “rounding error”: we have added some “whiteness units” by making it
white, or removed some “whiteness units” by making it black. We compensate for this error by adding/removing the same total
number of “whiteness units,” distributed (in a particular way) among the neighboring pixels.
I assigned a dithering exercise in an introductory CS class, and I got, more or less, the code in Figure 3.19 from one student. This
code is correct, but it is very repetitious. Reorganize this code so that it’s not so repetitive—in particular, ensure that each “distribute
the error” operation (E, S, SE, and SW) appears only once in your solution.

ISTUDY
108 Logic

p
q unknown circuit with ≤ 3 gates
r

Figure 3.18 An unknown circuit that takes three inputs {p, q, r}, and either turns on a light bulb (output of the circuit =
true) or leaves it off (output = false). The circuit contains at most three gates, each of which is an ∧, ∨, or ¬ gate.

“Distributing the error” from pixel ⟨x, y⟩ means taking the amount by
1 for y = 1, 2, . . . , height:
which P[x, y] changed, and counteracting that change so that the total
2 for x = 1, 2, . . . , width: brightness of the image is unchanged. Specifically, the amount of change
7
3 if P[x, y] is more white than black then (stored in error, which may be negative) is distributed 16 to the eastern
5 3 1
neighboring pixel, 16 to the south, 16 to the south east, and 16 to the
4 error := “white” − P[x, y] south west. If any of these neighboring pixels don’t exist (because the
5 P[x, y] := “white” current pixel is on the border of the image), simply ignore the
6 if x > 1 then corresponding fraction of error (and don’t add it anywhere).
7 if x < width and y ̸< height then distribute E. Distribute E:
8 if x < width and y < height then distribute S, SE, SW, and E. increase P[x + 1, y] by 7
16 · error
9 if x ̸< width and y < height then distribute S and SW. Distribute S:
10 else increase P[x + 1, y] by 5
16 · error
11 if x < width and y ̸< height then distribute E. Distribute SE:
12 if x < width and y < height then distribute S, SE, and E. increase P[x + 1, y] by 3
16 · error
13 if x ̸< width and y < height then distribute S. Distribute SW:
14 else increase P[x + 1, y] by 1
16 · error
15 error := “black” − P[x, y]
16 P[x, y] := “black”
17 [exact duplication of lines 6–13]

Figure 3.19 Some (inefficient) dithering code.

Recall Definition 3.16: a proposition φ is in conjunctive normal form (CNF) if φ is the conjunction of one or more clauses, where
each clause is the disjunction of one or more literals, and where a literal is an atomic proposition or its negation. Definition 3.17
says that φ is in disjunctive normal form (DNF) if φ is the disjunction of one or more clauses, where each clause is the conjunction
of one or more literals. Give a proposition that’s logically equivalent to each of the following, in the specified normal form.
3.97 DNF: ¬(p ∧ q) ⇒ r 3.101 CNF: ¬(p ∧ q) ⇒ r
3.98 DNF: p ∧ (q ∨ r) ⇒ (q ∧ r) 3.102 CNF: p ∧ (q ⇒ (r ⇒ q ⊕ r))
3.99 DNF: p ∨ ¬(q ⇔ p ∧ r) 3.103 CNF: (p ⇒ q) ⇒ (q ⇒ r ∧ p)
3.100 DNF: p ⊕ (¬p ⇒ (q ⇒ r) ∧ ¬r) 3.104 CNF: p ⇔ (q ∨ r ∨ ¬p)

A CNF proposition φ is in 3CNF if each clause contains exactly three distinct literals. (Note that p and ¬p are distinct literals.)
Similarly, a proposition φ is in 3DNF if it is the disjunction of one or more clauses, each of which is the conjunction of exactly
three distinct literals.
3.105 In terms of the number of clauses, what’s the smallest 3CNF formula that’s a tautology?
3.106 In terms of the number of clauses, what’s the smallest 3CNF formula that’s not satisfiable?
3.107 In terms of the number of clauses, what’s the smallest 3DNF formula that’s a tautology?
3.108 In terms of the number of clauses, what’s the smallest 3DNF formula that’s not satisfiable?

Consider 3CNF propositions over the variables {p, q, r} for which no clause appears more than once. (The following exercises
are boring if we omit the restriction about repeating clauses; we could repeat the same clause as many times as we please:
(p ∨ q ∨ r) ∧ (p ∨ q ∨ r) ∧ (p ∨ q ∨ r) · · ·.) Two clauses that contain precisely the same literals (in any order) do not count as
distinct. (But recall that a single clause can contain both a variable and its negation.)
3.109 In terms of the number of clauses, what’s the largest 3-variable distinct-clause 3CNF proposition (at all)?
3.110 In terms of the number of clauses, what’s the largest 3-variable distinct-clause 3CNF proposition that’s a tautology?
3.111 In terms of the number of clauses, what’s the largest 3-variable distinct-clause 3CNF proposition that’s satisfiable?

ISTUDY
3.4 An Introduction to Predicate Logic 109

3.4 An Introduction to Predicate Logic


Ont pû tromper quelques hommes, ou les tromper tous dans certains lieux & en certains
tems, mais non pas tous les hommes, dans tous les lieux & dans tous les siécles.
One can fool some men, or fool all men in some places and times, but one cannot fool all
men in all places and ages.
Jacques Abbadie (c. 1654–1727)
Traité de la Vérité de la Religion Chrétienne (1684)

Propositional logic gives us formal notation to encode Boolean expressions. But these expressions are
relatively simple, a sort of “unstructured programming” style of logic. Predicate logic is a more general
type of logic that allows us to write function-like logical expressions called predicates, and to express a
broader range of notions than in propositional logic.

3.4.1 Predicates
Informally, a predicate is a property that a particular entity might or might not have; for example, being a
vowel is a property that some letters have (A, E, . . .) and some letters do not have (B, C, . . .). A predicate
isn’t the kind of thing that’s true or false, so predicates are different from propositions; rather, a predicate
is like a “proposition with blanks” waiting to be filled in. Here are a few examples of predicates:

Example 3.29: Some predicates.

• “The integer is prime.”


• “The string is a palindrome.”
• “The person costarred in a movie with Kevin Bacon.”
• “The string is alphabetically after the string .”
• “The integer evenly divides the integer .”

Once the blanks of a predicate are filled in, the resulting expression is a proposition. Here are some
propositions—some true, some false—derived from the predicates in Example 3.29:

Example 3.30: Some propositions derived from Example 3.29.

• “The integer 57 is prime.”


• “The string TENET is a palindrome.”
• “The person Helen Mirren costarred in a movie with Kevin Bacon.”
• “The string PYTHON is alphabetically after the string PYTHAGOREAN.”
• “The integer 17 evenly divides the integer 42.”

Here is the formal definition of predicates:

Definition 3.20: Predicate.


A predicate P is a function that assigns the value True or False to each element of a set U. (That is, a
predicate P is a function P : U → {True, False}.) The set U is called the universe or domain of discourse,
and we say that P is a predicate over U.

ISTUDY
110 Logic

Although we didn’t use the name at the time, we’ve already encountered predicates, when talked about
sets in Chapter 2: Definition 2.20 introduced the notation {x ∈ U : P(x)} to denote the set of those objects
x ∈ U for which P is true. This set-abstraction notation “selects” those elements of the set U for which the
predicate P is true.
Here are a few more sample predicates based on arithmetic:

Example 3.31: Some more example predicates.

1 isPrime(n): the positive integer n is a prime number.


2 isPowerOf(n, k): the integer n is an exact power of k: that is, n = ki for some exponent i ∈ Z≥0 .
3 onlyPowersOfTwo(S): every element of the set S is an exact power of two.
4 Q(n, a, b): the positive integer n satisfies n = a + b, and integers a and b are both prime.
5 sumOfTwoPrimes(n): the positive integer n is equal to the sum of two prime numbers.

To highlight Definition 3.20: the isPrime predicate is a function isPrime : Z>0 → {True, False}, which
means that the universe for isPrime is Z>0 . (Sometimes the universe requires a little more thought: for
example, onlyPowersOfTwo answers questions about sets of integers, which means that its universe is the
set of sets of integers—in other words, P(Z).) When the universe is clear from context, we will sometimes
be a little sloppy in our notation and not bother writing it down.

Deriving propositions from predicates


Again, by plugging particular values into the predicates from Example 3.31, we get propositions, each of
which has a truth value:

Example 3.32: Propositions derived from predicates.


Using the predicates in Example 3.31, let’s figure out the truth values of the propositions isPrime(261),
isPrime(262), Q(8, 3, 5), and Q(9, 3, 6). For each, we’ll simply plug the given arguments into the
definition of the predicate and figure out the truth value of the resulting proposition.
1 A little arithmetic shows that 261 = 3 · 87; thus isPrime(261) is false.
2 Similarly, we have 262 = 2 · 131, so isPrime(262) is false.
3 To compute the truth value of Q(8, 3, 5), we plug n = 8, a = 3, and b = 5 into the definition of
Q(n, a, b). The proposition Q(8, 3, 5) requires that the positive integer 8 satisfies 8 = 3 + 5, and the
integers 3 and 5 are both prime. All of the requirements are met, so Q(8, 3, 5) is true.
4 On the other hand, Q(9, 3, 6) is false, because Q(9, 3, 6) requires that 9 = 3 + 6, and that the integers
3 and 6 are both prime. But 6 isn’t prime.

Just as with propositional logical connectives (and as with most functions in your favorite programming
language), each predicate takes a fixed number of arguments. So a predicate might be unary (taking one
argument, like the predicate isPrime); or binary (taking two arguments, like isPowerOf); or ternary (taking
three arguments, like Q from Example 3.31); and so forth.
Here are a few more examples:

ISTUDY
3.4 An Introduction to Predicate Logic 111

Example 3.33: More propositions derived from predicates.


Using the predicates in Example 3.31, find the truth values of these propositions:
1 sumOfTwoPrimes(17)
2 sumOfTwoPrimes(34)
3 isPowerOf(16, 2)
4 isPowerOf(2, 16)
5 onlyPowersOfTwo({1, 2, 8, 128})

Solution. As before, we plug the given values into the definition, and see if the resulting statement is true:
1 False. The only way to get an odd number n by adding two prime numbers is for one of those prime
numbers to be 2 (the only even prime)—but 17 − 2 = 15, and 15 isn’t prime.
2 True, because 34 = 17 + 17, and 17 is prime. (And the other 17 is prime, too.)
3 True, because 24 = 16 (and the exponent 4 is an integer).
4 False, because 161/4 = 2 (and 14 is not an integer).

5 True. Every element of {1, 2, 8, 128} is a power of two: {1, 2, 8, 128} = 20 , 21 , 23 , 27 .

These brief examples may already be enough to begin to give you a sense of the power of logical abstrac-
tion that predicates give us: we can now consider the same logical “condition” applied to two different
“arguments.” In a sense, propositional logic is like programming without functions; letting ourselves use
predicates allows us to write two related propositions using related notation, and to reason simultaneously
about multiple propositions—just like writing a function in Java allows you to think simultaneously about
the same function applied to different arguments.
Taking it further: Predicates also give a convenient way of representing the state of play of multiplayer games like Tic-Tac-Toe,
checkers, and chess. The basic idea is to define a predicate P(B) that expresses “Player 1 will win from board position B if both
players play optimally.” For more on this idea, and on the application of logic to playing these kinds of games, see p. 121.

3.4.2 Quantifiers
We’ve seen that we can form a proposition from a predicate by applying that predicate to a particular argu-
ment. But we can also form a proposition from a predicate using quantifiers, which allow us to formalize
statements like every Java program contains at least four for loops (false!) or there is a proposition that
cannot be expressed using only the connectives ∧ and ∨ (true! See Exercise 4.71). These types of statements
are expressed by the two standard quantifiers, the universal (“every”) and existential (“some”) quantifiers;

Definition 3.21: Universal quantifier [for all, ∀].


Let P be a predicate over S. The proposition ∀x ∈ S : P(x) is true if, for every possible x ∈ S, we have
that P(x) is true.

Definition 3.22: Existential quantifier [there exists, ∃].


Let P be a predicate over S. The proposition ∃x ∈ S : P(x) is true if, for at least one possible x ∈ S, we
have that P(x) is true.

(The proposition ∀x ∈ S : P(x) is read “for all x in S, P(x)” and the proposition ∃x ∈ S : P(x) is read “there
exists an x in S such that P(x).”)

ISTUDY
112 Logic

A hint to remember the symbols ∀ and ∃: the for all notation is ∀, an upside-down “A” as in “all”; the exists notation is ∃, a
backward “E” as in “exists.” (Annoyingly, they had to be flipped in different directions: a backward “A” is still an “A,” and an
upside-down “E” is still an “E.”)

Here’s an example of two propositions about prime numbers that use quantifiers:

Example 3.34: Some propositions about primes using quantifiers.


What are the truth values of the following two propositions?
1 ∀n ∈ Z≥2 : isPrime(n)
2 ∃n ∈ Z≥2 : isPrime(n)

Solution. The proposition ∀n ∈ Z≥2 : isPrime(n) says “every integer n ≥ 2 is prime.” This statement is
false because, for example, the integer 32 is greater than or equal to 2 and is not prime.
The proposition ∃n ∈ Z≥2 : isPrime(n), on the other hand, says “there exists an integer n ≥ 2 that is
prime.” This statement is true because, for example, the integer 31 is prime (and 31 ≥ 2).

We can use quantifiers to make precise many intuitive statements. For example, let’s formalize the
predicates from Example 3.31:

Example 3.35: The example predicates from Example 3.31, formalized.

isPrime(n): the positive integer n ∈ Z>0 is a prime number.


An integer n ∈ Z>0 is prime if and only if n ≥ 2 and the only positive integers that evenly divide n
are 1 and n itself. (See Definition 2.14.) In other words, we can describe the primality of n in terms
of a condition on every candidate divisor d: either d ∈ {1, n}, or d doesn’t evenly divide n. Using the
“divides” notation (see Definition 2.12), we can formalize isPrime(n) as
 
isPrime(n) = n ≥ 2 ∧ ∀d ∈ Z≥1 : (d | n ⇒ d = 1 ∨ d = n) .
isPowerOf(n, k): the integer n is an exact power of k.
An integer n is an exact power of k if and only if n can be written as ki , for some integer i:
isPowerOf(n, k) = ∃i ∈ Z≥0 : n = ki .
onlyPowersOfTwo(S): every element of the set S is an exact power of two.
Because isPowerOf(n, 2) expresses “n is a power of two,” we can be lazy and just use isPowerOf:
onlyPowersOfTwo(S) = ∀x ∈ S : isPowerOf(x, 2).
Q(n, a, b): the positive integer n satisfies n = a + b, and integers a and b are both prime.
If we use isPrime, then formalizing Q actually doesn’t require a quantifier at all:
Q(n, a, b) = (n = a + b) ∧ isPrime(a) ∧ isPrime(b).

sumOfTwoPrimes(n): the positive integer n is equal to the sum of two prime numbers.
This predicate requires that there exist prime numbers a and b that sum to n, and we can use the
predicate Q to express this idea:
sumOfTwoPrimes(n) = ∃⟨a, b⟩ ∈ Z × Z : Q(n, a, b).

ISTUDY
3.4 An Introduction to Predicate Logic 113

(“There exists a pair of integers ⟨a, b⟩ such that Q(n, a, b).”) Alternatively, we could instead write
sumOfTwoPrimes(n) by nesting one quantifier within the other (see Section 3.5):
sumOfTwoPrimes(n) = ∃a ∈ Z : [∃b ∈ Z : Q(n, a, b)] .

Here’s one further example, regarding the prefix relationship between two strings:

Example 3.36: Prefixes, formalized.


k n
A binary string x ∈ {0, 1} is a prefix of the binary string y ∈ {0, 1} , for n ≥ k, if y is x with some extra
bits added on at the end. For example, 01 and 0110 are both prefixes of 01101010, but 1 is not a prefix
of 01101010. If we write |x| and |y| to denote the length of x and y, respectively, then
 
isPrefixOf(x, y) = |x| ≤ |y| ∧ ∀i ∈ {i ∈ Z : 1 ≤ i ≤ |x|} : xi = yi .
That is, y is no shorter than x, and the first |x| characters of y equal their corresponding characters in x.

Quantifiers as loops
One useful way of thinking about these quantifiers is by analogy to loops in programming. If we ever
encounter a y ∈ S for which ¬P(y) holds, then we immediately know that ∀x ∈ S : P(x) is false. Similarly,
any y ∈ S for which Q(y) holds is enough to demonstrate that ∃x ∈ S : Q(x) is true. But if we “loop through”
all candidate values of x and fail to encounter any x with ¬P(x) or Q(x), we know that ∀x ∈ S : P(x) is true
or ∃x ∈ S : Q(x) is false. By this analogy, we might think of the two standard quantifiers as executing the
programs in Figure 3.20.
Another intuitive and useful way to think about ∀ and ∃ is as a supersized version of ∧ and ∨:
∀x ∈ {x1 , x2 , . . . , xn } : P(x) ∃x ∈ {x1 , x2 , . . . , xn } : P(x)
≡ P(x1 ) ∧ P(x2 ) ∧ · · · ∧ P(xn ) ≡ P(x1 ) ∨ P(x2 ) ∨ · · · ∨ P(xn )
The first of these propositions is true only if every one of the P(xi ) terms is true; the second is true if at
least one of the P(xi ) terms is true.
P Q
There is one way in which these analogies are loose, though: just as for (summation) and (prod-
uct) notation (from Section 2.2.7), the loop analogy only makes sense when the domain of discourse is
finite! The ∀ “program” for a true proposition ∀x ∈ Z : P(x) would have to complete an infinite number of
iterations before returning True. But the intuition may still be helpful.

Precedence and parenthesization


As in propositional logic, we’ll adopt standard conventions regarding order of operations so that we don’t
overdose on parentheses. We treat the quantifiers ∀ and ∃ as binding tighter than the propositional logical
connectives. (For the sake of clarity, though, we’ll err on the side of using too many parentheses in quantified
statements, rather than too few.)

(a) (b)
1 for x in S: 1 for x in S:
2 if not P(x) then 2 if Q(x) then
3 return False 3 return True
4 return True 4 return False

Figure 3.20 For loops analogous to (a) ∀x ∈ S : P(x), and (b) ∃x ∈ S : Q(x).

ISTUDY
114 Logic

   
Thus ∀x ∈ S : P(x) ⇒ ∃y ∈ S : P(y) will be understood to mean ∀x ∈ S : P(x) ⇒ ∃y ∈ S : P(y) .
To express the other reading (which involves nested quantifiers; see Section 3.5), we can use parentheses
 
explicitly, by writing ∀x ∈ S : P(x) ⇒ ∃y ∈ S : P(y) .

Free and bound variables


Consider the variables x and y in the expressions

3|x and ∀y ∈ Z : 1 | y.

Understanding the first of these expressions requires knowledge of what x means, whereas the second is
a self-contained statement that can be understood without any outside knowledge. The variable x is called
a free or unbound variable: its value is not fixed by the expression. In contrast, the variable y is a bound
variable: its value is defined within the expression itself. We say that the quantifier binds the variable y, and
the scope or body of the quantifier is the part of the expression in which it has bound y. (We’ve encountered
bound variables before; they arise whenever a variable name is assigned a value within an expression. For
P10
example, the variable i is bound in the arithmetic expression i=1 i2 , as is the variable n in the set-defining

expression n ∈ Z : |n| ≤ |n2 | .)
A single expression can contain both free variables and bound variables: for example, the expression
∃y ∈ Z≥0 : x ≥ y contains a bound variable y and a free variable x. Here’s another example:

Example 3.37: Free and bound variables.


Which variables are free in the following expression?
   
∀x ∈ Z : x2 ≥ y ∧ ∀z ∈ Z : y = z ∨ zy = 1

Solution. The variable y doesn’t appear as the variable bound by either of the quantifiers in this expression,
so y is a free variable. Both x and z are bound by the universal quantifiers. (Incidentally, this expression
is true if and only if y = 0.)

To test whether a particular variable x is free or bound in an expression, we can (consistently) replace
x by a different name in that expression. If the meaning stays the same, then x is bound; if the meaning
changes, then x is free. For example:

Example 3.38: Testing for free and bound variables.


Consider the following pairs of propositions:
∃x ∈ S : x > 251 and ∃y ∈ S : y > 251 (A)
x ≥ 42x and y ≥ 42y (B)
The expressions in (A) express precisely the same condition, namely: some element of S is greater than
251. Thus, the variables x and y in these two expressions are bound.
But the expressions in (B) mean different things, in that there are contexts in which these two statements
have different truth values (for example, x = 3 and y = −2). The first expression states a condition on
the value of x; the second states a condition on the value of y. So x is a free variable in “x ≥ 42x.”

ISTUDY
3.4 An Introduction to Predicate Logic 115

Taking it further: The free-versus-bound-variable distinction is also something that may be familiar from programming, at least
in some programming languages. There are some interesting issues in the design and implementation of programming languages
that center on how free variables in a function definition, for example, get their values. See p. 122.

An expression of predicate logic that contains no free variables is called fully quantified. For expressions
that are not fully quantified, we adopt a standard convention that any unbound variables in a stated claim
are implicitly universally quantified. For example, consider these claims:

Claim A: If x ≥ 1, then x2 ≤ x3 . Claim B: For all x ∈ R, if x ≥ 1, then x2 ≤ x3 .

When we write a (true) claim like Claim A, we will implicitly interpret it to mean Claim B. (Claim B also
explicitly notes R as the domain of discourse, which was left implicit in Claim A.)

3.4.3 Theorem and Proof in Predicate Logic


Recall that a tautology is a proposition that is always true—in other words, it is true no matter what each
Boolean variable p in the proposition “means” (that is, whether p is true or false). In this section, we will
be interested in the corresponding notion of always-true statements of predicate logic, which are called
theorems. A statement of predicate logic is “always true” when it’s true no matter what its predicates
mean. (Formally, the “meaning” of a predicate P is the set of elements of the universe U for which the
predicate is true—that is, {x ∈ U : P(x)}.)

Definition 3.23: Theorems in predicate logic.


A fully quantified expression of predicate logic is a theorem if and only if it is true for every possible
meaning of each of its predicates.

Analogously, two fully quantified expressions are logically equivalent if, for every possible meaning of
their predicates, the two expressions have the same truth values.
We’ll begin with a pair of related examples, one theorem and one nontheorem:

Example 3.39: A theorem of predicate logic.


Let S be any set. The following claim is true regardless of what the predicate P denotes:
 
∀x ∈ S : P(x) ∨ ¬P(x) .
Indeed, this claim simply says that every x ∈ S either makes P(x) true or P(x) false. And that assertion is
true if the predicate P(x) is “x ≥ 42” or “x has red hair” or “x prefers programming in Python to playing
Parcheesi”—indeed, it’s true for any predicate P.

Example 3.40: A nontheorem.


Let’s show that the following proposition is not a theorem:
   
∀x ∈ S : P(x) ∨ ∀x ∈ S : ¬P(x) .
A theorem must be true regardless of P’s meaning, so we can establish that this proposition isn’t a the-
orem by giving an example predicate that makes it false. Here’s one example: consider the predicate
P = isPrime (with the universe S = Z). Observe that ∀x ∈ Z : isPrime(x) is false because, for example,

ISTUDY
116 Logic

isPrime(4) = False; and ∀x ∈ Z : ¬isPrime(x) is false because, for example, ¬isPrime(5) = False. Thus
the given proposition is false when P is isPrime, and so it is not a theorem.

Note the crucial difference between Example 3.39 (“every element of S either makes P true or makes P
false”) and Example 3.40 (“either every element of S makes P true, or every element of S makes P false”).
(Intuitively, it’s the difference between “Every letter is either a vowel or a consonant” and “Every letter is
a vowel or every letter is a consonant.” The former is true; the latter is false.)
Example 3.40 establishes that [∀x ∈ S : P(x)] ∨ [∀x ∈ S : ¬P(x)] fails to be globally true for every mean-
ing of the predicate P, but it is true for some meanings of the predicate. For example, if the predicate P(x)
is x2 ≥ 0 (with S = R), then this disjunction is true (because ∀x ∈ R : x2 ≥ 0 is true).

The challenge of proofs in predicate logic


The remainder of this section states some theorems of predicate logic, along with an initial discussion of
how we might prove that they’re theorems. (A proof of a statement is simply a convincing argument that
the statement is a theorem.) Much of the rest of the book will be devoted to developing and writing proofs
of theorems like these, and Chapter 4 will be devoted exclusively to some techniques and strategies for
proofs. (This section will preview some of the ideas we’ll see there.) Some theorems of predicate logic are
summarized in Figure 3.21; we’ll prove a few here, and you’ll return to some of the others in the exercises.
While predicate logic allows us to express claims that we couldn’t state without quantifiers, that extra
expressiveness comes with a cost. For a quantifier-free proposition (like all propositions in Sections 3.2
and 3.3), there is a straightforward—if tedious—algorithm to decide whether a given proposition is a tau-
tology: first, build a truth table for the proposition; and, second, check to make sure that the proposition
is true in every row. It turns out that the analogous question for predicate logic is much more difficult—in
fact, impossible to solve in general: there’s no algorithm that’s guaranteed to figure out whether a given
fully quantified expression is a theorem! Demonstrating that a statement in predicate logic is a theorem
will require you to think in a way that demonstrating that a statement in propositional logic is a tautology
did not.
Taking it further: The fact that there’s no algorithm guaranteed to determine whether a given proposition is a theorem follows
from a mind-numbing 1931 result by Kurt Gödel (1906–1978). See p. 123. The absence of such an algorithm sounds like bad
news; it means that proving predicate-logic statements is harder, because you can’t just use a simple “plug and chug” technique
to figure out whether a given statement is actually always true. But this fact is also precisely the reason that creativity plays a
crucial role in proofs and in theoretical computer science more generally—and why, arguably, proving things can be fun! (For
me, this difference is exactly why I find Sudoku less interesting than crossword puzzles: when there’s no algorithm to solve a
problem, we have to embrace the creative challenge in attacking it.)

3.4.4 A Few Examples of Theorems and Proofs


In the rest of this section, we will see a few further theorems of predicate logic, with proofs. As we’ve said,
there’s no formulaic approach to prove these theorems; we’ll need to employ a variety of strategies in this
endeavor.

Negating quantifiers: a first example


Suppose that your egomaniacal, overconfident partner from Intro CS wanders into the lab and says For any
array A that you give me, partner, my implementation of insertion sort correctly sorts A. You know, though,
that your partner is wrong. (You spot a bug in his egomaniacal code.) What would that mean? Well, you

ISTUDY
3.4 An Introduction to Predicate Logic 117

 
∀x ∈ S : P(x) ∨ ¬P(x)
   
¬ ∀x ∈ S : P(x) ⇔ ∃x ∈ S : ¬P(x) De Morgan’s Laws (quantified form)
   
¬ ∃x ∈ S : P(x) ⇔ ∀x ∈ S : ¬P(x)
   
∀x ∈ S : P(x) ⇒ ∃x ∈ S : P(x) if the set S is nonempty

∀x ∈ ∅ : P(x) Vacuous quantification


¬∃x ∈ ∅ : P(x)
     
∃x ∈ S : P(x) ∨ Q(x) ⇔ ∃x ∈ S : P(x) ∨ ∃x ∈ S : Q(x)
     
∀x ∈ S : P(x) ∧ Q(x) ⇔ ∀x ∈ S : P(x) ∧ ∀x ∈ S : Q(x)
     
∃x ∈ S : P(x) ∧ Q(x) ⇒ ∃x ∈ S : P(x) ∧ ∃x ∈ S : Q(x)
     
∀x ∈ S : P(x) ∨ Q(x) ⇐ ∀x ∈ S : P(x) ∨ ∀x ∈ S : Q(x)
     
∀x ∈ S : P(x) ⇒ Q(x) ∧ ∀x ∈ S : P(x) ⇒ ∀x ∈ S : Q(x)
   
∀x ∈ {y ∈ S : P(y)} : Q(x) ⇔ ∀x ∈ S : P(x) ⇒ Q(x)
   
∃x ∈ {y ∈ S : P(y)} : Q(x) ⇔ ∃x ∈ S : P(x) ∧ Q(x)
   
φ ∧ ∃x ∈ S : P(x) ⇔ ∃x ∈ S : φ ∧ P(x) if x does not appear as a free variable in φ
   
φ ∨ ∀x ∈ S : P(x) ⇔ ∀x ∈ S : φ ∨ P(x) if x does not appear as a free variable in φ

Figure 3.21 A few theorems involving quantification.

might reply, gently but firmly: There’s an array A for which your implementation of insertion sort does not
correctly sort A. The equivalence that you’re using is a theorem of predicate logic:

Example 3.41: Negating universal quantifiers.


Let’s prove the equivalence you’re using to debunk your partner’s claim:
   
¬ ∀x ∈ S : P(x) ⇔ ∃x ∈ S : ¬P(x) . (∗)
It’s probably easiest to view (∗) as a quantified version of the tautology ¬(p ∧ q) ⇔ ¬p ∨ ¬q, which was
one of De Morgan’s Laws from propositional logic. Here’s the (oversimplified) intuition:
    if we imagine the universe S = {x1 , x2 , . . . , xn }, then
¬ ∀x ∈ S : P(x) ∼∼
∼ ¬ P(x1 ) ∧ P(x2 ) ∧ · · · ∧ P(xn ) ∀x ∈ S : P(x) means P(x1 ) ∧ P(x2 ) ∧ · · · ∧ P(xn )
 
≡ ¬P(x1 ) ∨ ¬P(x2 ) ∨ · · · ∨ ¬P(xn ) by De Morgan’s Laws (propositional version)


∼ ∃x ∈ S : ¬P(x), ∃x ∈ S : Q(x) means Q(x1 ) ∨ Q(x2 ) ∨ · · · ∨ Q(xn )

There is something slightly more subtle about (∗) because S might be infinite, but the idea is the same:

• If there’s an a ∈ S such that P(a) = False, then ∃x ∈ S : ¬P(x) is true (because a is an example) and
∀x ∈ S : P(x) is false (because a is a counterexample).
• On the other hand, if P(a) = True for every a ∈ S, then we know that ∃x ∈ S : ¬P(x) is false and
∀x ∈ S : P(x) is true.

The analogous claim for the negation of ∃x ∈ S : P(x) is also a theorem:

ISTUDY
118 Logic

Example 3.42: Negating existential quantifiers.


Let’s prove that this claim is a theorem, too:
   
¬ ∃x ∈ S : P(x) ⇔ ∀x ∈ S : ¬P(x) . (†)
To see that (†) is true for an arbitrary predicate P, we start with claim (∗) from Example 3.41, but using
the predicate Q(x) = ¬P(x). (Note that Q is also a predicate—so Example 3.41 holds for Q too!) Thus
we know that
   
¬ ∀x ∈ S : Q(x) ⇔ ∃x ∈ S : ¬Q(x) ,

and, because p ⇔ q ≡ ¬p ⇔ ¬q, we therefore also know that


   
∀x ∈ S : Q(x) ⇔ ¬ ∃x ∈ S : ¬Q(x) .

But Q(x) is just ¬P(x) and ¬Q(x) is just P(x), by definition of Q, and so we have
   
∀x ∈ S : ¬P(x) ⇔ ¬ ∃x ∈ S : P(x) .
Thus we’ve now shown that (†) is true for any predicate P, so it is a theorem.

All implies some: a proof of an implication


The entirety of Chapter 4 is devoted to proofs and proof techniques; there’s lots more there about how to
approach proving or disproving new claims. But here we’ll preview a particularly useful proof strategy for
proving an implication, and use it to establish another theorem of predicate logic:

Definition 3.24: Proof by assuming the antecedent.


Suppose that we must prove an implication φ ⇒ ψ. Because the only way for φ ⇒ ψ to fail to be true
is for φ to be true and ψ to be false, we will prove that the implication φ ⇒ ψ is always true by ruling
out the one scenario in which it wouldn’t be. Specifically, we assume that φ is true, and then prove that
ψ must be true too, under this assumption.

(Recall from the truth table of ⇒ that the only way for the implication φ ⇒ ψ to be false is when φ is
true but ψ is false. Also recall that φ is called the antecedent of the implication φ ⇒ ψ; hence this proof
technique is called assuming the antecedent.) Here are two examples of proofs that use this technique:

(An example from propositional logic.) Let’s prove that p ⇒ p ∨ q is a tautology: we assume that the
antecedent p is true, and we must prove that the consequent p ∨ q is true too. But that’s obvious, because
p is true (by our assumption), and True ∨ q ≡ True.

(An example from arithmetic.) Let’s prove that if x is a perfect square, then 4x is a perfect square: assume
that x is a perfect square, that is, assume that x = k2 for an integer k. Then 4x = 4k2 = (2k)2 is a perfect
square too, because 2k is also an integer.

Finally, here’s a theorem of predicate logic that we can prove using this technique:

ISTUDY
3.4 An Introduction to Predicate Logic 119

Example 3.43: If everybody’s doing it, then somebody’s doing it.


Consider the following proposition, for an arbitrary nonempty set S:
   
∀x ∈ S : P(x) ⇒ ∃x ∈ S : P(x) .
We’ll prove this claim by assuming the antecedent. Specifically, we assume ∀x ∈ S : P(x), and we need
to prove that ∃x ∈ S : P(x).
Because the set S is nonempty, we know that there’s at least one element a ∈ S. By our assumption,
we know that P(a) is true. But because P(a) is true, then it’s immediately apparent that ∃x ∈ S : P(x) is
true too—because we can just pick x = a.

Problem-solving tip: When you’re facing a statement that contains a lot of mathematical notation, try to understand it by rephras-
ing it as an English sentence. Restating the assertion from Example 3.43 in English makes it pretty obvious that it’s true: if
everyone in S satisfies P—and there’s actually someone in S—then of course someone in S satisfies P!

Vacuous quantification
Consider the proposition All even prime numbers greater than 12 have a 3 as their last digit. Write N
to denote the set of all even prime numbers greater than 12; formalized, then, this claim can be written as
∀n ∈ N : n mod 10 = 3. Is this claim true or false? It has to be true! The point is that N actually contains no
elements (there are no even prime numbers other than 2, because an even number is by definition divisible
by 2). Thus this claim says: “for every n ∈ ∅, some silly thing is true of n.” But there is no n in ∅, so the
claim has to be true! The general statement of the theorem is

∀x ∈ ∅ : P(x). (‡)

Quantification over the empty set is called vacuous quantification, and (‡) is said to be vacuously true.
Here’s another way to see that (‡) is a theorem, using the De Morgan–like view of quantification. What
would it mean for (‡) to be false? There would have to be some x ∈ ∅ such that ¬P(x)—but there never
exists any element x ∈ ∅, let alone an x ∈ ∅ such that ¬P(x). Thus ∃x ∈ ∅ : ¬P(x) is false, and therefore
its negation ¬∃x ∈ ∅ : ¬P(x), which is equivalent to (‡), is true.

Disjunctions and quantifiers


Here’s one last example for this section, in which we’ll figure out when the “or” of two quantified statements
can be expressed as one single quantified statement:

Example 3.44: Disjunctions and quantifiers.


Consider the following two propositions, for an arbitrary set S:
     
∀x ∈ S : P(x) ∨ Q(x) ⇔ ∀x ∈ S : P(x) ∨ ∀x ∈ S : Q(x) (A)
     
∃x ∈ S : P(x) ∨ Q(x) ⇔ ∃x ∈ S : P(x) ∨ ∃x ∈ S : Q(x) (B)
Is either (A) or (B) a theorem? Both? Prove your answers.
Solution. Claim (B) is a theorem. To prove it, we’ll show that the left-hand side implies the right-hand
side, and vice versa. (That is, we’re proving p ⇔ q by proving both p ⇒ q and q ⇒ p, which is a legit-
imate proof because p ⇔ q ≡ (p ⇒ q) ∧ (q ⇒ p).) Both proofs will use the technique of assuming the
antecedent.

ISTUDY
120 Logic

     
First, let’s prove that ∃x ∈ S : P(x) ∨ Q(x) implies ∃x ∈ S : P(x) ∨ ∃x ∈ S : Q(x) :
Suppose that [∃x ∈ S : P(x) ∨ Q(x)] is true. Then there is some particular x∗ ∈ S for which either P(x∗ )
or Q(x∗ ). But in either case, we’re done: if P(x∗ ) then ∃x ∈ S : P(x) because x∗ satisfies the condition;
if Q(x∗ ) then ∃x ∈ S : Q(x), again because x∗ satisfies the condition.
     
Second, let’s prove that ∃x ∈ S : P(x) ∨ ∃x ∈ S : Q(x) implies ∃x ∈ S : P(x) ∨ Q(x) :
Suppose that [∃x ∈ S : P(x)] ∨ [∃x ∈ S : Q(x)] is true. Thus either there’s an x∗ ∈ S such that P(x∗ ) or
an x∗ ∈ S such that Q(x∗ ). That x∗ suffices to make the left-hand side of (B) true.

On the other hand, (A) is not a theorem, for much the same reason as in Example 3.40. (In fact, if Q(x)
is ¬P(x), then Examples 3.39 and 3.40 directly establish that (A) is not a theorem.) The set Z and the
predicates isOdd and isEven make (A) false: the left-hand side is true (“all integers are either even or
odd”) but the right-hand side is false (“either (i) all integers are even, or (ii) all integers are odd”).

Problem-solving tip: In thinking about whether a particular quantified expression is a theorem—like the question of whether
(A) from Example 3.44 is a theorem—it’s often useful to get intuition by plugging in a few sample values for the universe S and
the predicates P and Q. It’s especially helpful to consider sample values that are relatively easy to describe and think about, like
“odd” and “even.”

Although the mutual implication (A) from Example 3.44 is not a theorem, one direction of it is. Let’s prove
this implication as another example:

Example 3.45: Disjunction, quantifiers, and one-way implications.


The ⇐ direction of (A) from Example 3.44 is a theorem:
     
∀x ∈ S : P(x) ∨ Q(x) ⇐ ∀x ∈ S : P(x) ∨ ∀x ∈ S : Q(x) .
Let’s prove it. We assume the antecedent [∀x ∈ S : P(x)] ∨ [∀x ∈ S : Q(x)]. This assumption means that
either ∀x ∈ S : P(x) or ∀x ∈ S : Q(x).
If we’re in the first case, in which ∀x ∈ S : P(x), then P(x) is true for every x ∈ S. But then it’s certainly
the case that P(x) ∨ Q(x) is true for any x ∈ S too: if P(x) is true, then P(x) ∨ whatever is true too.
If we’re in the case in which ∀x ∈ S : Q(x) holds, then Q(x) is true for any x ∈ S. Similarly, then,
P(x) ∨ Q(x) is true for any x ∈ S: if Q(x) is true, then whatever ∨ Q(x) is true too.
In both cases, we’ve argued that P(x) ∨ Q(x) is true for all x ∈ S—that is, that [∀x ∈ S : P(x) ∨ Q(x)].

You’ll have a chance to consider a number of other theorems of predicate logic in the exercises, including
the ∧-analogy to Examples 3.44 and 3.45 (in Exercises 3.137 and 3.138).

ISTUDY
3.4 An Introduction to Predicate Logic 121

Computer Science Connections

Game Trees, Logic, and Winning Tic-Tac(-Toe)


In 1997, Deep Blue, a chess-playing program developed by IBM, beat the chess Grandmaster Garry Kasparov in a
six-match series. This event was a turning point in the public perception of computation and artificial intelligence
(AI); it was the first time that a computer had outperformed the best humans at something that most people tended
to identify as a “human endeavor.” Ten years later, a research group developed a program called Chinook, a perfect
checkers-playing system: from any game position arising in its games, Chinook chooses the best possible legal move.
(And a few years after that came another major turning point in the public perception of computation and AI: IBM
Watson, designed to play the quiz show Jeopardy!, defeated some of the best-ever human players in a match.)
While chess and checkers are
very complicated games, the basic

ideas of playing them—ideas based
on logic—are shared with simpler
games. Consider Tic-Tac, a 2-by-
O O
O O
O
∧ ∧O O∧ ∧O
2 version of Tic-Tac-Toe. (Thanks
to Jon Kleinberg for suggesting the ∨ X∨ ∨OX
X O O O X O O
X X
game.) See Figure 3.22.
Note that—unless O is tremen-
O∧ TX O∧X
X O X O O O O O O O X O X O O O O O O O
dously dull—O will win the game, O O X X O X O X TO XT XTO
but we will use a game tree (as in
Figure 3.22a), which represents all X O X O X O X O
O X O X OFX OFX
possible moves, to systematize this
reasoning. Here’s the basic idea. (a) 25% of the Tic-Tac game tree. (The (b) The game tree, with each win for O
Let’s define a predicate P(B) to missing 75% is rotated, but otherwise labeled by T, each loss/draw by F, ∨ if
identical.) it’s Player O’s turn, and ∧ if it’s
mean “Player O wins under optimal Player X’s turn.
play starting from board B.” (“Opti-
mal play” means that both Player O Figure 3.22 Tic-Tac. Two players, O and X, make alternate moves, starting
and Player X always make the very with O; a player wins by occupying a complete row or column. Diagonals don’t
best possible move in every turn.) count, and if the board is filled without O or X winning, then the game is a draw.
For example, P( XO || O ) = True
because O has already won; and P( XO || OX ) = False because the game has ended in a draw. The answer to the question
“does O win Tic-Tac if both players play optimally?” is the truth value of P( || ). If it’s O’s turn in board B, then
P(B) is true if and only if there exists a possible move for O leading to a board B′ in which P(B′ ); if it’s X’s turn,
then P(B) is true if and only if every possible move made by X leads to a board B′ in which P(B′ ). So
| O |
P( |
) and P( |
)
= P( X | O
|
) ∧ P( | O
X |
) ∧ P( | O
| X
) = P( O |
|
) ∨ P( | O
|
) ∨ P( |
O |
) ∨ P( |
| O
).
The game tree, labeled appropriately, is shown in Figure 3.22b. If we view the truth values from the leaves as “bubbling
up” from the bottom of the tree, then a board B gets assigned the truth value True if and only if Player O can guarantee
a win from the board B.
Some serious complications arise in writing a program to play more complicated games like checkers or chess.
Here are just a few of the issues that one must confront in building a system like Deep Blue or Chinook. First, there are
≈ 500,000,000,000,000,000,000 different checkers positions—and ≈ 1040 chess positions!—so we can’t afford to
represent them all. (Luckily, our system only needs to find one good move at each stage, and we can choose moves so
most positions are never reached.) Second, approximately one bit per trillion is written incorrectly merely in copying
data on current hard disk technologies. So a program constructing a massive structure like the checkers game tree
must “check its work.” Third, for a game as big as chess, we can’t afford to compute all the way to the bottom of
the tree; instead, we estimate the quality of each position after computing a handful of layers deep in the game tree.
(There are many more complications, too. For more on game trees and algorithms for exploring large search spaces,
see any good AI text, like [111]. You can also read more specifically about Deep Blue [26], or about Chinook [113].)

ISTUDY
122 Logic

Computer Science Connections

Nonlocal Variables and Lexical vs. Dynamic Scoping


In a function f written in a programming language—say, C or Python, though nearly any language has the relevant
features—we can use several different types of variables that store values. There are local variables, whose values are
defined completely within the body of f. There are parameters to the function, inputs to f whose value is specified
when f is invoked. And there are nonlocal variables, which get their value from other contexts. The most common
type of these “other” variables is a global variable, which persists throughout the execution of the entire program.
For an example function (written in both C and
Python as illustrative examples) that uses all three types
1 int addB(int a) {
of variables, see Figure 3.23. (The variable a is a 2 int result = a + b;
parameter, the variable result is a local variable, and 3 return result;
the variable b is a nonlocal variable.) 4 }

In the body of the addB function, the variable a is


1 def addB(a):
a bound variable; specifically, it is bound when the
2 result = a + b
function is invoked with an actual parameter. But the
3 return result
variable b is unbound. (Just as with a quantified expres-
sion, an unbound variable is one for which the meaning
Figure 3.23 A function addB written in C and analogous
of the function could change if we replaced that vari-
function addB written in Python. Here addB takes one
able with a different name. If we changed the a to an
x in both Line 1 and Line 2, then the function would (integer) parameter a, accesses a nonlocal variable b,
behave identically, but if we changed the b to a y, then defines a local variable result, and returns a + b.
the function would behave differently.) In this function,
the variable b has to somehow get a value from somewhere if we are going to be able to invoke the function addB
without causing an error. Often b will be a global variable, but it is also possible in Python or C (with appropriate
compiler settings) to nest function definitions—just as quantifiers can be nested. (See Section 3.5.)
One fundamental issue in the design and imple-
mentation in programming languages is illustrated in 1 int b = 17;
Figure 3.24. Suppose x is an unbound variable in the 2

definition of a function f. Generally, programming lan- 3 int addB(int a) {


A function in C finds
guages either use lexical scope, where x’s value is 4 return a + b;
values for unbound
5 }
found by looking “outward” where f is defined; or variables in the
6 defining
dynamic scope, where x’s value is found by looking 7 int test() { environment.
where f is called. 8 int b = 128;
Almost all modern programming languages use lex- 9 return addB(3);
10 }
ical scope, though macros in C and other languages
11 /* Here addB() is a function, and */
use dynamic scope. (And some languages only use 12 test(3); /* the value of test(3) is 20. */
dynamic scope.) While we’re generally used to lexi-
cal scope and therefore it feels more intuitive, there are 13 int b = 17;
some circumstances in which macros can be tremen- 14

dously useful and convenient. 15 #define addB(a) a + b A macro in C finds


16 values for unbound
(The differences between lexical versus dynamic variables in the
17 int test() {
scope, and other related issues, are classical topics in 18 int b = 128;
calling environment.
the design and implementation of programming lan- 19 return addB(3);
guages. One of the other interesting issues is that there 20 }

are actually multiple paradigms for passing parame- 21 /* Here addB() is a macro, and */
22 test(3); /* the value of test(3) is 131. */
ters to a function, too; we’re discussing call-by-value
parameter passing, which probably is the most com-
mon, but there are other choices. See any good textbook Figure 3.24 Two C snippets defining addB, where the
on programming languages for much more.) nonlocal variable b gets its value from different places.

ISTUDY
3.4 An Introduction to Predicate Logic 123

Computer Science Connections

Gödel’s Incompleteness Theorem


Given a fully quantified proposition φ, is φ a theorem? This apparently simple question drove the development
of some of the most profound and mind-numbing results of the last hundred years. In the early twentieth century,
there was great interest in the “formalist program,” advanced especially by the German mathematician David Hilbert
(1862–1943). The formalist approach aimed to turn all of mathematical reasoning into a machine: one could feed in a
mathematical statement φ as input, turn a hypothetical crank, and the machine would spit out a proof or disproof of φ
as output. But this program was shattered by two closely related results—two of the greatest intellectual achievements
of the twentieth century.
True Provable
The first blow to the formal-
ist program was the proof by Kurt These are good: in a
sense, the entire point If there’s a statement
Gödel (1906–1978), in 1931, of that’s true but
of this book is to
what became known as Gödel’s identify true things, unprovable, then logic
Incompleteness Theorem. Gödel’s and to prove them. is incomplete.

incompleteness theorem is based


These are no
on the following two important problem—you might If there’s a statement
and desirable properties of logi- wish for these that’s provable but
cal systems. First, a logical sys- statements true, but untrue, then logic is
they aren’t (and you inconsistent.
tem is called consistent if only true can’t prove them).
statements can be proven. (In other
words, if there is a proof of φ in the (a) We can be perfectly happy with (b) There’s something troubling about
system, then φ is true.) Second, a statements that are both true and statements that are true but not
logical system is called complete if provable, and with statements that are provable, and with statements that are
every true statement can be proven. neither true nor provable. provable but not true.
(In other words, if φ is true, then Figure 3.25 A Venn diagram dividing the universe of logical statements into
there is a proof of φ in the system.) those that are true (or not), and those that are provable (or not).
See Figure 3.25.
Both consistency and completeness are Good Things for a logical system. If the system is inconsistent, then there
is a false statement φ that can be proven (which means that anything can be proven, as false implies anything!). And
if the system is incomplete, then there is a true statement φ that cannot be proven. But what Gödel was able to prove
is this troubling result:

Gödel’s (First) Incompleteness Theorem:


Any sufficiently powerful logical system is either inconsistent or incomplete.

(Here “sufficiently powerful” just means “capable of expressing multiplication”; predicate logic as described here
is certainly “sufficiently powerful.”) Gödel’s proof proceeds by constructing a self-referential logical expression φ
that means “φ is not provable.” (So if φ is true, then the system is incomplete; and if φ is false, then the system is
inconsistent.)
The second strike against the formalist program was the proof of the undecidability of the halting problem, shown
independently by Alan Turing and Alonzo Church in the 1930s. We can think of the halting problem as asking the
following question: given a function f written in Python and an input x, does running f(x) get stuck in an infinite loop?
(Or does it eventually terminate?) The undecidability of this problem means that there is no algorithm that solves
the halting problem. A corollary of this result is that our problem—given a fully quantified proposition φ, is φ a
theorem?—is also undecidable. We’ll discuss uncomputability in detail in Section 4.4.4.
If you’re interested in reading more: undecidability, incompleteness, and their profound consequences are the
focus of a number of excellent textbooks on the theory of computation (for example, [73, 121])—and also Douglas
Hofstadter’s fascinating masterpiece Gödel, Escher, Bach [56]. There’s also a compelling biography of Kurt Gödel,
who led a deeply troubled life in many ways despite his professional success, that describes much more about the
person behind the result [23].

ISTUDY
124 Logic

EXERCISES
Using the characteristics in Figure 3.26, define a predicate that’s true for each of the following lists of languages, and false for every
other language in the table. For example, the predicate P(x) = “x has strong typing and x is not functional” makes P(Pascal) and
P(Java) true, and makes P(x) false for every x ∈ {C, C++, LATEX, ML, Perl, Scheme}.
3.112 Java
3.113 C, Pascal
3.114 ML, Pascal
3.115 Pascal, Scheme, Perl
3.116 LATEX, Java, C++, Perl
3.117 C, Pascal, ML, C++, LATEX, Scheme, Perl

Examples 3.4 and 3.16 construct a proposition corresponding to “the password contains at least three of four character types
(digits, lowercase letters, uppercase letters, other).” In that example, we took “the password contains at least one digit” (and its
analogues) as an atomic proposition. But we could give a lower-level characterization of valid passwords. Let isDigit, isLower,
and isUpper be predicates that are true of single characters of the appropriate type. Use standard arithmetic notation and these
predicates to formalize the following conditions on a password x = ⟨x1 , . . . , xn ⟩, where xi is the ith character in the password:
3.118 x is at least 8 characters long.
3.119 x contains at least one lowercase letter.
3.120 x contains at least one non-alphanumeric character. (Remember: isDigit, isLower, and isUpper are the only predicates available!)
3.121 (Inspired by a letter to the editor in The New Yorker by Alexander George [51].) Steve Martin, the great comedian, reports in Born
Standing Up: A Comic’s Life [85] that, inspired by Lewis Carroll, he started closing his shows with the following line. (It got big
laughs.) “I’m not going home tonight; I’m going to Bananaland, a place where only two things are true, only two things: One, all
chairs are green; and two, no chairs are green.” Steve Martin describes the joke as a contradiction—but, in fact, these two true
things are not contradictory! Explain. How can “all chairs in Bananaland are green” and “no chairs in Bananaland are green” be
simultaneously true?

As a rough approximation, we can think of a database as a two-dimensional table, where rows correspond to individual entities,
and columns correspond to fields (data about those entities). A database query defines a predicate Q(x) that consists of tests of the
values from various columns, joined by the basic logical connectives. The database system then returns a list of rows/entities for
which the predicate is true. We can think of this type of database access as involving predicates: in response to query Q, the system
returns the list of all rows x for which Q(x) is true. An example is shown in Figure 3.27. Each of the following predicates Q(x)
uses tests on particular columns in x’s row. For each, give a logically equivalent predicate in which each column’s name appears
at most once. You may also use the symbols True and False and the four most common logical connectives (¬, ∧, ∨, ⇒) as many
times as you please. Use a truth table to prove that your answer is logically equivalent to the given predicate.

3.122 [age(x) < 18] ∨ [age(x) ≥ 18] ∧ [gpa(x) ≥ 3.0]
 
3.123 takenCS(x) ⇒ ¬ [home(x) = Hawaii] ⇒ [home(x) = Hawaii] ∧ takenCS(x)
 
3.124 hasMajor(x) ∧ [schoolYear(x) = ̸ 3] ∧ onCampus(x) ∨ hasMajor(x) ∧ [schoolYear(x) = ̸ 3] ∧ ¬onCampus(x)

∨ hasMajor(x) ∧ [schoolYear(x) = 3] ∧ ¬onCampus(x)
3.125 Following the last few exercises, you might begin to think that any query can be rewritten without duplication. Can it? Consider a
unary predicate that is built up from the predicates P(x) and Q(x) and the propositional symbols {True, False, ∧, ∨, ¬, ⇒}. Decide
whether the following claim is true or false, and prove your answer:
Claim: Every such predicate is logically equivalent to a predicate that uses only the following symbols:
(i) {True, False, ∧, ∨, ¬, ⇒}, all of which can be used as many times as you please; and
(ii) the predicates {P(x), Q(x)}, which can appear only one time each.

paradigm typing scope (continued) paradigm typing scope

C imperative weak lexical ML functional strong lexical


C++ object-oriented weak lexical Pascal imperative strong lexical
Java object-oriented strong lexical Perl scripting weak either
LATEX scripting weak dynamic Scheme functional weak either

Figure 3.26 Some well-known programming languages, with some characteristics for each.

ISTUDY
Exercises 125

name GPA CS taken? home age school year on campus? has a major?

Alice 4.0 yes Alberta 20 3 yes yes


Bob 3.14 yes Bermuda 19 3 yes no
Charlie 3.54 no Cornwall 18 1 no yes
Desdemona 3.8 yes Delaware 17 2 no no

Figure 3.27 A sample database. Here, to find a list of all students with grade point averages over 3.4 who have taken at
least one CS course if they’re from Hawaii, we could query [GPA(x) ≥ 3.4] ∧ (takenCS(x) ⇐ [home(x) = Hawaii]). For
this database, this query would return Charlie (and not Alice, Bob, or Desdemona).

Search engines allow users to specify Boolean conditions in their queries. For example, “social OR networks” will return only
web pages containing either the word “social” or the word “networks.” You can view a query as a predicate Q; the search engine
returns (in some order) the list of pages p for which Q(p) is true. Consider the following queries:

Query A: “java AND program AND NOT computer”


Query B: “(computer OR algorithm) AND java”
Query C: “java AND NOT (computer OR algorithm OR program)”

Give an example of a web page—or a sentence—that would be returned as specified in these problems:
3.126 Returned by query A but not by B or C.
3.127 Returned by query B but not by A or C.
3.128 Returned by query C but not by A or B.
3.129 Returned by query A and B but not by C.
 
3.130 Prove or disprove: ∀n ∈ Z : isPrime(n) ⇒ n
2

/Z .

3.131 Translate this quote (attributed to Groucho Marx) into logical notation: It isn’t necessary to have relatives in Kansas City in order
to be unhappy. Let P(x) be “x has relatives in Kansas City” and Q(x) be “x is unhappy,” and view the statement as implicitly making
a claim that a particular kind of person exists.

Write an English sentence that expresses the logical negation of each given sentence. (Don’t just say “It is not the case that …”;
give a genuine negation.) Some of the given sentences are ambiguous in their meaning; if so, describe all of the interpretations of
the sentence that you can find, then choose one and give its negation.
3.132 Every entry in the array A is positive.
3.133 Every decent programming language denotes block structure with parentheses or braces.
3.134 There exists an odd number that is evenly divisible by a different odd number.
3.135 There is a point in Minnesota that is farther than 10 miles from a lake.
3.136 Every sorting algorithm takes at least n log n steps on some n-element input array.

In Examples 3.44 and 3.45, we proved that


h i h i h i
∃x ∈ S : P(x) ∨ Q(x) ⇔ ∃x ∈ S : P(x) ∨ ∃x ∈ S : Q(x)
h i h i h i
∀x ∈ S : P(x) ∨ Q(x) ⇐ ∀x ∈ S : P(x) ∨ ∀x ∈ S : Q(x)

are
h theorems. Argue thati the following
h ∧-analogies
i hto these statements
i are also theorems:
3.137 ∃x ∈ S : P(x) ∧ Q(x) ⇒ ∃x ∈ S : P(x) ∧ ∃x ∈ S : Q(x)
h i h i h i
3.138 ∀x ∈ S : P(x) ∧ Q(x) ⇔ ∀x ∈ S : P(x) ∧ ∀x ∈ S : Q(x)

Explain
h why the following
i are
h theorems of predicate
i h logic: i
3.139 ∀x ∈ S : P(x) ⇒ Q(x) ∧ ∀x ∈ S : P(x) ⇒ ∀x ∈ S : Q(x)
h i h i
3.140 ∀x ∈ {y ∈ S : P(y)} : Q(x) ⇔ ∀x ∈ S : P(x) ⇒ Q(x)
h i h i
3.141 ∃x ∈ {y ∈ S : P(y)} : Q(x) ⇔ ∃x ∈ S : P(x) ∧ Q(x)

ISTUDY
126 Logic

Explain why the following propositions are theorems of predicate logic, assuming that x does not appear as a free variable in the
expression
h φ (and iassuming that S is nonempty):
3.142 φ ⇔ ∀x ∈ S : φ
h i h i
3.143 φ ∨ ∀x ∈ S : P(x) ⇔ ∀x ∈ S : φ ∨ P(x)
h i h i
3.144 φ ∧ ∃x ∈ S : P(x) ⇔ ∃x ∈ S : φ ∧ P(x)
 h i h i
3.145 φ ⇒ ∃x ∈ S : P(x) ⇔ ∃x ∈ S : φ ⇒ P(x)
h i  h i
3.146 ∃x ∈ S : P(x) ⇒ φ ⇔ ∀x ∈ S : P(x) ⇒ φ

3.147 Give an example of a predicate P, a nonempty set S, and an expression φ containing x as a free variable such that the proposition
from Exercise 3.143 is false. Because x has to get its meaning from somewhere, we will imagine a universal quantifier for x wrapped
around the entire expression. Specifically, give an example of P, φ, and S for which
   
   
∀x ∈ S : φ ∨ ∀x ∈ S : P(x) is not logically equivalent to ∀x ∈ S : ∀x ∈ S : φ ∨ P(x) .

ISTUDY
3.5 Predicate Logic: Nested Quantifiers 127

3.5 Predicate Logic: Nested Quantifiers


Who knows what beautiful and winged life, whose egg has been buried for ages under
many concentric layers of woodenness in the dead dry life of society … may
unexpectedly come forth from amidst society’s most trivial and handselled furniture, to
enjoy its perfect summer life at last!
Henry David Thoreau (1817–1862)
Walden; or, Life in the Woods (1854)

Just as we can place one loop inside another in a program, we can place one quantified statement inside
another in predicate logic. In fact, the most interesting quantified statements almost always involve more
than one quantifier. (For example: during every semester, there’s a computer science class that every stu-
dent on campus can take.) In formal notation, such a statement typically involves nested quantifiers—that
is, multiple quantifiers in which one quantifier appears inside the scope of another. We’ve encountered state-
ments involving nested quantification before, although so far we’ve discussed them using English rather
than mathematical notation. Let’s start by formalizing one of these informal uses of nested quantification:

Example 3.46: Onto functions, formalized.


Definition 2.51 said: a function f : A → B is called onto if, for every b ∈ B, there exists at least one a ∈ A
for which f(a) = b. We can rewrite this definition as:
 
A function f : A → B is onto if ∀b ∈ B : ∃a ∈ A : f(a) = b .

Another example that we’ve seen: the concept of a partition of a set (Definition 2.32) also used nested
quantification, just without the ∀ and ∃ notation. (See Exercise 3.168.) Here are two other examples:

Example 3.47: No unmatched elements in an array.


Let’s express the condition that every element of an array A[1 . . . n] is a “double”—that is, appears at least
twice in A. (For example, the array [3, 2, 1, 1, 4, 4, 2, 3, 1] satisfies this condition.) This condition requires
that, for every index i, there exists another index j such that A[i] = A[j]—in other words,
 
∀i ∈ {1, 2, . . . , n} : ∃j ∈ {1, 2, . . . , n} : i ̸= j ∧ A[i] = A[j] .

Example 3.48: Alphabetically later.


Let’s formalize the predicate “The string is alphabetically after the string ” from Example 3.29. For
two letters a, b ∈ {A, B, . . . , Z}, write a < b if a is earlier in the alphabet than b; we’ll use this ordering
on letters to define an ordering on strings. Let x and y be strings over {A, B, . . . , Z}. There are two ways
for x to be alphabetically later than y:
• y is a (proper) prefix of x. (See Example 3.36.) For example, FORTRAN is after FORT.
• x and y share an initial prefix of zero or more identical letters, and the first i for which xi ̸= yi has xi
later in the alphabet than yi . For example, PASTOR comes after PASCAL because T comes after C.
n m
Formally, then, x ∈ {A, B, . . . , Z} is alphabetically after y ∈ {A, B, . . . , Z} if
 
m < n ∧ [∀j ∈ {1, 2 . . . , m} : xj = yj ] ∨ y is a proper prefix of x
 
∃i ∈ {1, . . . , min(n, m)} : xi > yi ∧ [∀j ∈ {1, 2 . . . , i − 1} : xi = yi ] .
the first i − 1 characters match, and xi > yi

ISTUDY
128 Logic

“Sorting alphabetically” is usually called lexicographic ordering in computer science—that is, the way in which words are
organized in the dictionary (also known as the lexicon). There’s a surprisingly fascinating history of alphabetical order, with a
much more interesting set of stories than you’d have ever guessed; if you’re intrigued, see the recent book by Judith Flanders [47].

Here is one more example of a statement that we’ve already seen—Goldbach’s conjecture—that implic-
itly involves nested quantifiers; we’ll formalize it in predicate logic. (Part of the point is to illustrate how
complex even some apparently simple concepts are; there’s a good deal of complexity hidden in words like
“even” and “prime,” which at this point seem pretty intuitive!)

Example 3.49: Goldbach’s conjecture.


Recall Goldbach’s conjecture, from Example 3.1: Every even integer greater than 2 can be written as the
sum of two prime numbers. Formalize this proposition using nested quantifiers.
Solution. Using the sumOfTwoPrimes predicate from Example 3.35, we can write this statement as either
of the following:
∀n ∈ {n ∈ Z : n > 2 ∧ 2 | n} : sumOfTwoPrimes(n) (A)
h i
∀n ∈ Z : n > 2 ∧ 2 | n ⇒ sumOfTwoPrimes(n) (B)

In (B), we quantify over all integers, but the implication n > 2 ∧ 2 | n ⇒ sumOfTwoPrimes(n) is triv-
ially true for an integer n that’s not even or not greater than 2, because false implies anything! Thus the
only values of n for which the implication has any “meat” are even integers greater than 2. As such, these
two formulations are equivalent. (See Exercise 3.140.) Expanding the definition of sumOfTwoPrimes(n)
from Example 3.35, we can also rewrite (B) as
h   i
∀n ∈ Z : n > 2 ∧ 2 | n ⇒ ∃p ∈ Z : ∃q ∈ Z : isPrime(p) ∧ isPrime(q) ∧ [n = p + q] . (C)

We’ve also already seen that the predicate isPrime implicitly contains quantifiers too (“for all potential
divisors d, d does not evenly divide p”)—and, for that matter, so does the “evenly divides” predicate |. In
Exercises 3.185–3.187, you’ll show how to rewrite Goldbach’s conjecture in a few different ways, including
using yet further layers of nested quantifiers.

Writing tip: Just as with nested loops in programs, the deeper the nesting of quantifiers, the harder an expression is for a reader
to follow. Using well-chosen predicates (like isPrime, for example) in a logical statement can make it much easier to read—just
like using well-chosen (and well-named) functions makes your software easier to read!

3.5.1 Order of Quantification


In expressions that involve nested quantifiers, the order of the quantifiers matters! As a frivolous example,
take the title of the 1947 hit song “Everybody Loves Somebody” (sung by Dean Martin). There are two
plausible interpretations of the title:

∀x : ∃y : x loves y and ∃y : ∀x : x loves y.

The former is the more natural reading; it says that every person x has someone whom they love, but each
different x can love a different person. (As in: “every child loves their mother.”) The latter says that there
is one single person loved by every x. (As in: “Everybody loves Raymond.”) These claims are different!

ISTUDY
3.5 Predicate Logic: Nested Quantifiers 129

Taking it further: Disambiguating the order of quantification in English sentences is one of the daunting challenges in natural
language processing (NLP) systems. Compare Every student received a diploma and Every student heard a commencement
address: there are, surely, many diplomas and only one address, but building a software system that understands that fact is
tremendously challenging. There are many other vexing types of ambiguity in NLP systems, too. (See p. 92.) Human listeners
are able to use pragmatic knowledge about the world to disambiguate, but doing so properly in an NLP system is very difficult.

Figure 3.28 shows a visual representation of the importance of this order of quantification. In this
visualization, for example, we have the following correspondences:
∀r : ∃c : P(r, c) ←→ every row has at least one column with a filled cell in it.
∃c : ∀r : P(r, c) ←→ there is a single column for which every row has that column’s cell filled.
Now look at Figures 3.28d and 3.28f. The proposition ∃c : ∀r : P(r, c) is true in Figure 3.28f but not true
in Figure 3.28d, in which every row has a filled cell but which cell is filled varies from row to row. But the
proposition ∀r : ∃c : P(r, c) is true in both Figures 3.28d and 3.28f.
Here’s a mathematical example that illustrates the difference even more precisely.

Example 3.50: The largest real number.


One of the following propositions is true; the other is false. Which is which?
∃y ∈ R : ∀x ∈ R : x < y (A)
∀x ∈ R : ∃y ∈ R : x < y (B)

Solution. Translating these propositions into English helps resolve this question. (A) says that there is
a real number y for which the following property holds: every real number is less than y. (“There is a
largest real number.”) But there isn’t a largest real number! So (A) is false. (If someone tells you that y∗
satisfies ∀x ∈ R : x < y∗ , then you can convince him he’s wrong by choosing x = y∗ + 1.) On the other
hand, (B) says that, for every real number x, there is a real number greater than x. And that’s true: for any
x ∈ R, the number x + 1 is greater than x.
In fact, (B) is nearly the negation of (A). (Before you read through the derivation, can you figure out
why we had to say “nearly” in the last sentence?)
   
¬(A) ≡ ¬ ∃y ∈ R : ∀x ∈ R : x < y ≡ ∀y ∈ R : ¬ ∀x ∈ R : x < y De Morgan’s Laws (quantified form)
≡ ∀y ∈ R : ∃x ∈ R : ¬(x < y) De Morgan’s Laws (quantified form)
≡ ∀y ∈ R : ∃x ∈ R : y ≤ x ¬(x < y) ⇔ y ≤ x
≡ ∀x ∈ R : ∃y ∈ R : x ≤ y. renaming the bound variables

∀r : ∀c : P(r, c), ∃r : ∃c : P(r, c),


or, equivalently, or, equivalently,
∀c : ∀r : P(r, c) ∃c : ∃r : P(r, c) ∀c : ∃r : P(r, c) ∀r : ∃c : P(r, c) ∃r : ∀c : P(r, c) ∃c : ∀r : P(r, c)

(a) (b) (c) (d) (e) (f)

Figure 3.28 An illustration of order of quantification. Let r index a row of the grid, and let c index a column. If P(r, c)
is true in each filled cell, then the corresponding proposition is true.

ISTUDY
130 Logic

So (B) and the negation of (A) are almost—but not quite—identical: the latter has a ≤ where the former
has a <. But both (B) and ¬(A) are theorems.

Commutativity of ∀ and commutativity of ∃


Although the order of quantifiers does matter when universal and existential quantifiers both appear in a
proposition, the order of consecutive universal quantifiers doesn’t matter. Neither does the order of con-
secutive existential quantifiers. That is, these quantifiers are commutative. Thus the following statements
are theorems of predicate logic:

∀x ∈ S : ∀y ∈ T : P(x, y) ⇔ ∀y ∈ T : ∀x ∈ S : P(x, y) (∗)


∃x ∈ S : ∃y ∈ T : P(x, y) ⇔ ∃y ∈ T : ∃x ∈ S : P(x, y) (∗∗)

Why is (∗) true? See Figure 3.28a. Both sides of (∗) require that all the cells be filled: one says that, in
every row, every column is filled; the other says that, in every column, every row is filled. But each of these
statements expresses the condition that P(x, y) is true for every pair ⟨x, y⟩ ∈ S × T.
The argument for (∗∗) is similar. Each side is true if and only if P(x, y) holds for at least one pair
⟨x, y⟩ ∈ S × T (that is, when at least one cell is filled anywhere in the grid; see Figure 3.28b).
Because of these equivalences, as notational shorthand we’ll sometimes write ∀x, y ∈ S : P(x, y) instead
of ∀x ∈ S : ∀y ∈ S : P(x, y). We’ll use ∃x, y ∈ S : P(x, y) analogously.

Nested quantification and nested loops


Just as it can be helpful to think of a quantifier in terms of a corresponding loop, it can be helpful to think of
nested quantifiers in terms of nested loops. And a useful way to think about the importance of the order of
quantification is through the way in which changing the order of nested loops changes what they compute.
(See Exercises 3.198–3.203 for some examples.)
Here’s one example about how thinking about the nested-loop analogy for nested quantifiers can be
helpful. Imagine writing a nested loop to examine every element of a two-dimensional array. As long as
iterations don’t depend on each other, it doesn’t matter whether we proceed through the array in row-major
order (from top to bottom, going from left to right within each row) or column-major order (from left to
right, going from top to bottom within each column). The code segments in Figure 3.29 always have the
same return value, which is another way of illustrating the logical equivalence expressed by (∗∗), above.
(The graphical view is that both programs check every cell of the “grid” of possible inputs to A, as in
Figure 3.28b—just in different orders.)

1 for j = 1 to m: 1 for i = 1 to n:
2 for i = 1 to n: 2 for j = 1 to m:
3 if A[i, j] then 3 if A[i, j] then
4 return True 4 return True
5 return False 5 return False

Figure 3.29 Going through a matrix in row-major order or column-major order. If every cell in the grid is inspected,
then the final result is identical (regardless of which order is used).

ISTUDY
3.5 Predicate Logic: Nested Quantifiers 131

3.5.2 Negating Nested Quantifiers


Recall the rules for negating quantifiers from Examples 3.41 and 3.42:

¬ [∀x ∈ S : P(x)] ⇔ [∃x ∈ S : ¬P(x)] and ¬ [∃x ∈ S : P(x)] ⇔ [∀x ∈ S : ¬P(x)] .

Informally, these theorems say that “everybody is P is false” is equivalent to “somebody isn’t P”; and,
similarly, “somebody is P is false” is equivalent to “everybody isn’t P.”
Here we will consider negating a sequence of nested quantifiers. Negating nested quantifiers proceeds in
precisely the same way as negating a single quantifier, just acting on one quantifier at a time. (We already
saw this idea in Example 3.50, where we repeatedly applied these quantified versions of De Morgan’s Laws
to a sequence of nested quantifiers.) For example:

Example 3.51: No cell is filled ≡ every cell is empty.


What does it mean for the statement ∃r : ∃c : P(r, c) to be false? From the visual perspective in Fig-
ure 3.28, the statement is true if any cell in the grid is filled. There are two different ways of thinking
about what it means for ∃r : ∃c : P(r, c) to be false, then:

(1) there does not exist a filled cell ¬ [∃c : ∃r : P(c, r)]
(2) every cell is unfilled. ∀c : ∀r : [¬P(c, r)]

(These two sentences probably don’t sound very different to you, precisely because you probably already
have some sense of what it means to negate quantifiers.) To write this equivalence formally, we repeatedly
apply the rule for negating an existential quantifier (Example 3.42):
¬ [∃r : ∃c : P(r, c)] “There does not exist a filled cell.”
≡ ∀r : ¬ [∃c : P(r, c)]
≡ ∀r : ∀c : ¬P(r, c). “Every cell is unfilled.”

Similarly, let’s rewrite ¬ [∃r : ∀c : P(r, c)]:

¬ [∃r : ∀c : P(r, c)] “It’s not the case that there’s a row with all columns filled.”
≡ ∀r : ¬ [∀c : P(r, c)]
≡ ∀r : ∃c : ¬P(r, c). “Every row has at least one unfilled column.”

Here’s an example of negating a longer sequence of nested quantifiers:

Example 3.52: Triple negations.


For a ternary predicate P, we have:
¬∃x : ∀y : ∃z : P(x, y, z) ≡ ∀x : ¬∀y : ∃z : P(x, y, z)
≡ ∀x : ∃y : ¬∃z : P(x, y, z)
≡ ∀x : ∃y : ∀z : ¬P(x, y, z).

Here’s a last example, which requires translation from English into logical notation:

ISTUDY
132 Logic

Example 3.53: Negating nested quantifiers.


Negate the following sentence:

For every iPhone user, there’s an iPhone app that every one of that user’s iPhone-using friends has
downloaded.

Solution. First, let’s reason about how the statement would be false: there would be some iPhone user—
we’ll call him Steve—such that, for every iPhone app, Steve has a friend who didn’t download that
app.
Write U and A for the sets of iPhone users and apps, respectively. In (pseudo)logical notation, the
original claim looks like
 
∀u ∈ U : ∃a ∈ A : ∀v ∈ U : (u and v are friends) ⇒ (v downloaded a) .

To negate this statement, we repeatedly apply the quantified De Morgan’s laws, once per quantifier:

¬∀u ∈ U : ∃a ∈ A : ∀v ∈ U : [(u and v are friends) ⇒ (v downloaded a)]


≡ ∃u ∈ U : ¬∃a ∈ A : ∀v ∈ U : [(u and v are friends) ⇒ (v downloaded a)]
≡ ∃u ∈ U : ∀a ∈ A : ¬∀v ∈ U : [(u and v are friends) ⇒ (v downloaded a)]
≡ ∃u ∈ U : ∀a ∈ A : ∃v ∈ U : ¬[(u and v are friends) ⇒ (v downloaded a)].

Using ¬(p ⇒ q) ≡ p ∧ ¬q (Exercise 3.85), we can further write this expression as:

≡ ∃u ∈ U : ∀a ∈ A : ∃v ∈ U : [(u and v are friends) ∧ ¬(v downloaded a)].


This last proposition, translated into English, matches the informal description above as to why the orig-
inal claim would be false: there’s some person such that, for every app, that person has a friend who
hasn’t downloaded that app.

3.5.3 Two New Ways of Considering Nested Quantifiers


We’ll close this section with two different but useful ways to think about nested quantification. As a running
example, consider the following (true) proposition:

∀x ∈ Z : ∃y ∈ Z : x = y + 1, (†)

which says that the number that’s one less than every integer is an integer too. We’ll discuss two ways of
thinking about propositions like (†) with nested quantifiers: as a “game with a demon” in which you battle
against an all-knowing demon to try to make the innermost quantifier’s body true; and as a single quantifier
whose body is a predicate, but a predicate that just happens to be expressed using quantifiers.

Nested quantifiers as demon games


One way to think about any proposition involving nested quantifiers is as a “game” played between you
and a demon. (Thanks to Dexter Kozen for teaching me this way of thinking of nested quantifiers [73].)
Here are the rules of the game:

ISTUDY
3.5 Predicate Logic: Nested Quantifiers 133

• Your goal is to make the innermost statement—x = y + 1 for our running example (†)—turn out to be
true; the demon’s goal is to make that statement false.
• Every “for all” quantifier is a choice that the demon makes; every “there exists” quantifier is a choice
that you get to make:
– in the expression ∀a ∈ S : · · ·, the demon chooses a particular value of a ∈ S, and the game continues
in the “· · · ” part of the expression.
– in the expression ∃b ∈ S : · · ·, you choose a particular value of b ∈ S, and, again, the game continues
in the “· · · ” part.
• Your choices and the demon’s choices are made in the left-to-right order of the quantifiers (starting with
the outside-most quantifier, and moving inward).
• You win the game—in other words, the proposition in question is true—if, no matter how cleverly the
demon plays, you make the innermost statement true.

Here are two examples of viewing quantified statements as demon games, one for a true statement and one
for a false statement:

Example 3.54: Showing that (†) is true.


We’ll use a “game with the demon” to argue that ∀x ∈ Z : ∃y ∈ Z : x = y + 1 is true.
1 The outermost quantifier is ∀, so the demon picks a value for x ∈ Z.
2 Now you get to pick a value y ∈ Z. A good choice for you is y = x − 1.
3 Because you chose y = x − 1, indeed x = y + 1. You win!
(For example, if the demon picks 16, you pick 15. If the demon picks −19, you pick −20. And so forth.)
No matter what the demon picks, your strategy will make you win—and therefore (†) is true!

By contrast, consider (†) with the order of quantification reversed:

Example 3.55: A losing demon game.


Consider playing a demon game for the proposition
∃y ∈ Z : ∀x ∈ Z : x = y + 1.
Unfortunately, the ∃ is first, which means that you have to make the first move. But when you pick a
number y, the demon then gets to pick an x—and there are an infinitude of x values that the demon can
choose so that x ̸= y + 1. (You pick 42? The demon picks 666. You pick 17? The demon picks 666. You
pick 665? The demon picks 616.) Therefore you can’t guarantee that you win the game, so we haven’t
established this claim.

By the way, you could win a demon game to prove the negation of the claim in Example 3.55:

¬(the claim from Example 3.55) ≡ ∀y ∈ Z : ∃x ∈ Z : x ̸= y + 1.

First, the demon picks some unknown y ∈ Z. Then you have to pick an integer x such that x ̸= y + 1—but
that’s easy: for any y the demon picks, you pick x = y. You win!

ISTUDY
134 Logic

Nested quantifiers as single quantifiers


Here’s a reminder of our running example, with a portion of it highlighted:
∀x ∈ Z : ∃y ∈ Z : x = y + 1 . (†)
What kind of thing is the highlighted piece? It can’t be a proposition, because x is a free variable in
it. But once we plug in a value for x, the expression becomes true or false. In other words, the expres-
sion ∃y ∈ Z : x − 1 = y is itself a (unary) predicate: once we are given a value of x, we can compute the
truth value of the expression. Similarly, the expression x − 1 = y is also a predicate—but a binary predi-
cate, taking both x and y as arguments. Let’s name these predicates: let’s write P(x, y) for x − 1 = y, and
hasIntPredecessor(x) for ∃y ∈ Z : x − 1 = y.
Using this perspective, we can write (†) as follows:
hasIntPredecessor(x)
P(x, y)

∀x ∈ Z : ∃y ∈ Z : x = y + 1 ≡ ∀x ∈ Z : ∃y ∈ Z : P(x, y)
(‡)
≡ ∀x ∈ Z : hasIntPredecessor(x).

One implication of this view is that negating nested quantifiers is really just the same as negating non-nested
quantifiers. For example:

Example 3.56: Negating nested quantifiers.


We can view the negation of (†), as written in (‡), as follows:
¬(†) ≡ ¬∀x ∈ Z : hasIntPredecessor(x)
≡ ∃x ∈ Z : ¬hasIntPredecessor(x).

And, re-expanding the definition of hasIntPredecessor and again applying the quantified De Morgan’s
Law, we have that

¬hasIntPredecessor(x) ≡ ¬∃y ∈ Z : P(x, y)


≡ ∀y ∈ Z : ¬P(x, y)
≡ ∀y ∈ Z : x − 1 ̸= y.

Together, these two negations show

¬(†) ≡ ∃x ∈ Z : ¬hasIntPredecessor(x)
≡ ∃x ∈ Z : ∀y ∈ Z : ¬P(x, y)
≡ ∃x ∈ Z : ∀y ∈ Z : x − 1 ̸= y.

Taking it further: This view of nested quantifiers as a single quantifier whose body just happens to express its condition using
quantifiers has a close analogy with writing a particular kind of function in a programming language. If we look at a two-argument
function in the right light, we can see it as a function that takes one argument and returns a function that takes one argument.
This approach is called Currying; see p. 135.

ISTUDY
3.5 Predicate Logic: Nested Quantifiers 135

Computer Science Connections

Currying
For a binary predicate P(x, y) in an expression like ∀y : ∀x : P(x, y), we can think of this expression as first plugging
in a value for y, which then yields a unary predicate ∀x : P(x, y) (which takes an argument x). There’s an interesting
parallel between this view of nested quantifiers and a way of writing functions in some programming languages.
For concreteness, let’s think about a small
1 (* M L *) A quick note on ML syntax:
function that takes two arguments and just fun is a keyword that says
2 fun sum a b = a + b;
returns their sum. Figure 3.30 shows imple- 3 sum 3 2; (* returns 5 *) we’re defining a function; sum
mentations of this function in three different is the name of it; a b is the
4 sum 99 12; (* returns 111 *)
list of arguments; and that
programming languages, and then uses sum function is defined to return
to actually compute 3 + 2 and 99 + 12. 1 # Python the value of a + b.
But now suppose that we want to define a 2 def sum(a,b):
3 return a + b
function that takes one argument and adds 3 For Scheme: (lambda args
4 sum(3,2) # returns 5 body) denotes the function
to it. Can we make use of the sum function 5 sum(99,12) # returns 111 that takes arguments args and
to do so? (The analogy to predicates is that returns the value of the
taking a two-argument predicate and apply- function body body. Also,
1 ; Scheme
Scheme is a prefix language,
ing it to one argument gives one-argument 2 (define sum
so applying the function f to
predicate; here we’re trying to take a two- 3 (lambda (a b) (+ a b))) arguments arg1, arg2, …,
4 (sum 3 2) ;; returns 5 argN is written (f arg1 arg2
argument function in a programming lan- ... argN); for example, (+ 1
5 (sum 99 12) ;; returns 111
guage and apply it to one argument to yield a 2) has the value 3.
one-argument function.) The answer is yes—
Figure 3.30 A sum function, implemented in three languages.
and it turns out that creating the “add 3”
function using sum is very easy in ML: we simply apply sum to one argument, and the result is a function that
“still wants” one more argument. To do this, we execute val add3 = sum 3; which defines a value (that’s the val
part of the syntax) whose name is add3 and whose value is sum applied to 3. That makes add3 a one-argument
function, and now add3 0 has the value 3; add3 108 has the value 111; and add3 199 has the value 202.
A function like sum in ML, which takes its
1 fun sum a b = a + b;
multiple arguments “one at a time,” is said to
2 val add3 = sum 3;
be Curried—after the logician Haskell Curry 3
(1900–1982). (As usual, there’s a richer his- 4 sum 3 2; (* returns 5 *)
tory than a concept named after one person 5 add3 2; (* returns 5 *)
would suggest: the idea dates back at least to
Moses Schönfinkel (1888–1942) and Gottlob 1 def sum(a):
Define sum to be a function
2 def sumA(b):
Frege (1848–1925), two earlier logicians.) that takes an argument a and
3 return a + b returns . . .
The programming language Haskell is also 4 return sumA
named in Curry’s honor. Thinking about Cur- 5
ried functions is a classical topic in the study 6 add3 = sum(3)

of programming languages. (Currying can 7


8 sum(3)(2) # returns 5
be very handy in programming—particularly
9 add3(2) # returns 5
when you have one parameter to a function . . . the function that takes an
that stays the same for many different calls 1 (define sum (lambda (a)
argument b and returns a + b.
(This function [taking b,
to a multiparameter function. For example, 2 (lambda (b) (+ a b))) returning a+b] is called sumA
you might Curry a function to look up data 3 in the Python code, and is
in a large dataset: query(dataset)(item). 4 (define add3 (sum 3)) unnamed in the Scheme
5 code.)
(See [1], or any good textbook on program-
6 ((sum 3) 2) ;; returns 5
ming languages.) 7 (add3 2) ;; returns 5
While writing Curried functions is almost
automatic in ML, you can also write Curried Figure 3.31 Curried implementations of sum.
functions in other programming languages,
too. Examples of a Curried version of sum in Python and Scheme are in Figure 3.31; it’s even possible to write
Curried functions in C or Java, though it’s much less natural than in ML, Python, and Scheme.

ISTUDY
136 Logic

EXERCISES
3.148 Let F denote the set of all functions f : R → R taking real numbers as input and producing real numbers as output. (For one
example, plusOne(x) = x + 1 is a function plusOne : R → R, so plusOne ∈ F.) Is the proposition ∀c ∈ R : [∃f ∈ F : f(0) = c]
true or false? Justify your answer.
3.149 What about ∃f ∈ F : [∀c ∈ R : f(0) = c]? (F is defined in Exercise 3.148.) Again, justify your answer.
3.150 What about ∀c ∈ R : [∃f ∈ F : f(c) = 0]? (F is defined in Exercise 3.148.) Justify.
3.151 What about ∃f ∈ F : [∀c ∈ R : f(c) = 0]? (F is defined in Exercise 3.148.) Again, justify.

Under many operating systems, users can schedule a task to be run at a specified time in the future. In Unix-like operating systems,
this type of scheduled job is called a cron job. (Greek: chron- “time.”) For example, a backup might run nightly at 2:00am, and a
scratch drive might be emptied out weekly on Friday night at 11:50pm. Let T = {1, 2, . . . , tmax } be a set of times (measured in
minutes, let’s say), and let J be a set of jobs. Let scheduledAt be a predicate where scheduledAt(j, t) means “job j is scheduled
at time t.” (Assume that jobs do not last more than one minute. The same job can be scheduled at multiple times.) Formalize the
following conditions using only standard quantifiers, arithmetic operators, logical connectives, and the scheduledAt predicate.
3.152 There is never more than one job scheduled at the same time.
3.153 Every job is scheduled at least once.
3.154 Job A is never run twice within 2 minutes.
3.155 Job B is run at least three times.
3.156 Job C is run at most twice.
3.157 Job D is run sometime after the last time that Job E is run.
3.158 Job F is run at least once between consecutive executions of Job G.
3.159 Job H is run at most once between consecutive executions of Job I.

Let P[1 . . . n, 1 . . . m] be a two-dimensional array of the pixels of a black-and-white image: for every x and y, the value of P[x, y] = 0
if the ⟨x, y⟩th pixel is black, and P[x, y] = 1 if it’s white. Translate these statements into predicate logic:
3.160 Every pixel in the image is black.
3.161 There is at least one white pixel.
3.162 Every row has at least one white pixel.
3.163 There are never two consecutive white pixels in the same column.

A standard American crossword puzzle is a 15-by-15 grid, which can be represented as a two-dimensional 15-by-15 array G,
where G[i, j] = True if and only if the cell in the ith row and jth column is “open” (otherwise known as “unfilled” or “not a black
square”). If i and j are out of range—that is, for any i ≤ 0 or i > 15 or j ≤ 0 or j > 15—assume G[i, j] = False. A contiguous
horizontal or vertical sequence of two or more open squares surrounded by black squares is called a word. (The assumption that
G[i, j] is False when i or j is out of range is equivalent to us pretending that our real grid is surrounded by black squares. In CS, this
style of structure is called a sentinel, wherein we introduce boundary values to avoid having to write out verbose special cases.)
3.164 There are certain customs that G must obey to be a standard American puzzle (all of which are met by the example grid in Fig-
ure 3.32). Rewrite the following informally stated condition as a fully formal definition. No unchecked letters: every open cell
appears in both a down word and an across word.
3.165 Repeat for the no two-letter words condition: every word has length at least 3.
3.166 Repeat for rotational symmetry: if the entire grid is rotated by 180◦ , then the rotated grid is identical to the original grid.
3.167 Repeat for overall interlock: for any two open squares, there is a path of open squares that connects the first to the second. (That
is, we can get from here to there through words.) Your answer should include a formal definition of what it means for there to be a
path from one cell to another (“there exists a sequence of squares such that . . .”).

3.168 According to Definition 2.32, a partition of a set S is a set {A1 , A2 , . . . , Ak } of sets such that (i) A1 , A2 , . . . , Ak are all nonempty;
(ii) A1 ∪ A2 ∪ · · · ∪ Ak = S; and (iii) for any i and j ̸= i, the sets Ai and Aj are disjoint. Formalize this definition using nested
quantifiers and basic set notation.

3.169 Consider the “maximum” problem: given an array of numbers, return the maximum element of that array:
Input: An array A[1 . . . n], where each A[i] ∈ Z.
Output: An integer x ∈ Z such that . . .
Complete the formal specification for this problem by finishing the specification for the output.

ISTUDY
Exercises 137

Two “across” (horizontal) words

Two “down” (vertical) words

Figure 3.32 A standard American crossword puzzle.

Let T = {1, . . . , 12} × {0, 1, . . . , 59} be the set of numbers that a digital clock in 12-hour mode can display. We can think of a
clock as a function c : T → T: if the real time is t ∈ T, the clock displays the time c(t). (For example, if the clock fastby7 runs 7
minutes fast, then fastby7(12:00) = 12:07.) Formalize the following predicates using only standard quantifiers, equality symbols,
and the function add : T × Z≥0 → T, where add(t, x) is the time that is x minutes later than t. (See Exercise 2.243.)
3.170 A clock is right if it always displays the correct time. Formalize right.
3.171 A clock keeps time if there’s some fixed offset by which it is always off from being right. (For example, fastby7 above correctly
keeps time.) Formalize keepsTime.
3.172 A clock is close enough if it always displays a time that’s within 2 minutes of the correct time. Formalize closeEnough.
3.173 A clock is broken if there’s some fixed time that it always displays, regardless of the real time. Formalize broken.
3.174 “Even a broken clock is right twice a day,” they say. (They mean: “even a broken clock displays the correct time at least once per
T.”) Using the broken predicate from Exercise 3.173, formalize the adage and prove it true.

A classic topic of study for computational biologists is genomic distance measures: given two genomes, we’d like to report a single
number that represents how different those two genomes are. These distance computations are useful in, for example, reconstructing
the evolutionary tree of a collection of species. Consider two genomes A and B of bacterium. Label the n genes that appear in
A’s chromosome, in order, as πA = ⟨1, 2, . . . , n⟩. The same genes appear in a different order in B—say, in the order πB =
⟨r1 , r2 , . . . rn ⟩. A particular model of genomic distance will define a specific way in which this list of numbers can mutate; the
question is to find a minimum-length sequence of mutations to explain the difference between the orders πA and πB . One type of
biologically motivated mutation is the prefix reversal, shown in Figure 3.33a. This model defines what’s called the pancake-flipping
problem (coincidentally, the subject of the lone academic paper with Bill Gates as an author [50]). See Figure 3.33b.
3.175 You are given a sequence of pancake radii ⟨r1 , r2 , . . . , rn ⟩, listed from top to bottom, where {r1 , r2 , . . . , rn } = {1, 2, . . . , n} (but
not necessarily in order). Give a fully quantified logical expression expressing the condition that the given pancakes are sorted.
3.176 Again, you are given a sequence of pancake radii ⟨r1 , r2 , . . . , rn ⟩, listed from top to bottom. Give a fully quantified logical
expression expressing the condition that the given pancakes can be sorted with exactly one flip.
3.177 Again, you are given a sequence of pancake radii ⟨r1 , r2 , . . . , rn ⟩, listed from top to bottom. Give a fully quantified logical expres-
sion expressing the condition that the given pancakes can be sorted with exactly two flips. (Hint: consider writing a program to
verify that your indices aren’t off by one.)

3.178 Let P be a set of people, and let T be a set of times. Let friends(x, y) be a predicate denoting that x ∈ P and y ∈ P are friends. Let
bought(x, t) be a predicate denoting that x ∈ P bought an iPad at time t ∈ T. Formalize this statement in predicate logic: “Everyone
who bought an iPad has a friend who bought one before they did.”
3.179 Is the claim from Exercise 3.178 true (in the real world)? Justify your answer.

In programming, an assertion is a logical statement that announces (“asserts”) a condition φ that the programmer believes to
be true. For example, a programmer who is about to access the 202nd element of an array A might assert that length(A) ≥ 202
before accessing this element. When an executing program reaches an assert statement, the program aborts if the condition in
the statement isn’t true. (Using assertions can be an extremely valuable way of documenting and debugging programs, particu-
larly because liberally including assertions will allow the revelation of unexpected data values much earlier in the execution of a
program. And these languages have a global toggle that allows the testing of assertions to be turned off, so once the programmer
is satisfied that the program is working properly, she doesn’t have to worry about any running-time overhead for these checks.)
3.180 Give a nonempty input array A[1 . . . n] that would cause the assertion in Figure 3.34a to fail. (That is, identify an array A that would
cause for the asserted condition to be false.)
3.181 Give a nonempty input array A[1 . . . n] that would cause the assertion in Figure 3.34b to fail.
3.182 Give a nonempty input array A[1 . . . n] that would cause the assertion in Figure 3.34c to fail.

ISTUDY
138 Logic

If, in the course of cell activity, one end of If the loop untangles itself “the other way
4 3 the chromosome comes in contact with a 4 3 around,” the effect is to reverse the order 4 3
7 6 5

7 6 5

7 6 5
point in the middle, a loop forms. of the genes in that loop.

2 1

2 1

2 1
8 CBA 9 8 CBA 9 8 CBA 9

123456789ABC prefix reversal (= a pancake flip) 987654321ABC


(a) A prefix reversal, in which some prefix of the genome is reversed, as in ⟨3, 2, 1, 4, 5⟩ → ⟨1, 2, 3, 4, 5⟩.

⟨ 4, 3, 2, 1, 5⟩ ⟨ 1, 2, 3, 4, 5⟩ ⟨5, 4, 3, 1, 2⟩ ⟨2, 1, 3, 4, 5⟩ ⟨1, 2, 3, 4, 5⟩

(b) The pancake-flipping problem: you are given a stack of pancakes, which you want to arrange in order of size by
repeatedly flipping some number of pancakes at the top of the pile, using as few flips as possible.

Figure 3.33 Genome rearrangements: prefix reversals and the pancake-flipping problem.

(a) (b) (c)

1 last := 0 1 total := A[1] 1 for start := 1, . . . , n − 1:


2 for index := 1, . . . , n − 1: 2 i := 1 2 min := start
3 if A[index] > A[index + 1] then 3 for i := 2, . . . , n − 1: 3 for i := start + 1, . . . , n:
4 last := index 4 if A[i + 1] > A[i] then 4 assert start = 1 or A[i] > A[start − 1]
5 assert last ≥ 1 and last ≤ n − 1 5 total := total + A[i] 5 if A[min] > A[i] then
6 swap A[last] and A[last + 1] 6 assert total > A[1] 6 min := i
7 return total 7 swap A[start] and A[min]

Figure 3.34 Three pieces of pseudocode containing assertions.

While the quantifiers ∀ and ∃ are by far the most common, there are some other quantifiers that are sometimes used. For each of
the following quantifiers, write an expression that is logically equivalent to the given statement that uses only the quantifiers ∀ and
∃; standard propositional logic notation (∧, ¬, ∨, ⇒); standard equality/inequality notation (=, ≥, ≤, <, >); and the predicate
P in the question.
3.183 Write an equivalent expression to ∃! x ∈ Z : P(x) (“there exists a unique x ∈ Z such that P(x)”), which is true when there is one
and only one value of x in the set Z such that P(x) is true.
3.184 Write an equivalent expression to ∃∞ x ∈ Z : P(x) (“there exist infinitely many x ∈ Z such that P(x)”), which is true when there
are infinitely many different values of x ∈ Z such that P(x) is true.

3.185 Prove that the following two formulations of Goldbach’s conjecture are logically equivalent:
h   i
∀n ∈ Z : n > 2 ∧ 2 | n ⇒ ∃p ∈ Z : ∃q ∈ Z : isPrime(p) ∧ isPrime(q) ∧ n = p + q (1)
h  i
∀n ∈ Z : ∃p ∈ Z : ∃q ∈ Z : n ≤ 2 ∨ 2 ̸ | n ∨ isPrime(p) ∧ isPrime(q) ∧ n = p + q (2)

3.186 Rewrite Goldbach’s conjecture without using isPrime—that is, using only quantifiers, the | predicate, and standard arithmetic (+,
·, ≥, etc.).
3.187 Even the | predicate implicitly involves a quantifier: p|q is equivalent to ∃k ∈ Z : p · k = q. Rewrite Goldbach’s conjecture without
using | either—that is, use only quantifiers and standard arithmetic symbols (+, ·, ≥, etc.).
3.188 (programming required) As we discussed, the truth value of Goldbach’s conjecture is currently unknown. In a project that concluded
in 2012, the conjecture has been verified for all even integers from 4 up to 4 × 1018 , through a massive distributed computational
effort led by Tomás Oliveira e Silva. Write a program to test Goldbach’s conjecture, in a programming language of your choice,
for a much smaller range: all even integers up to 10,000.

Let S be an arbitrary nonempty set and let P be an arbitrary binary predicate. Decide whether the following statements are always
true (for any P and S), or whether they can be false. Prove your answers.
3.189 [∃y ∈ S : ∀x ∈ S : P(x, y)] ⇒ [∀x ∈ S : ∃y ∈ S : P(x, y)]
3.190 [∀x ∈ S : ∃y ∈ S : P(x, y)] ⇒ [∃y ∈ S : ∀x ∈ S : P(x, y)]

ISTUDY
Exercises 139

(a) (b) (c)


1 for x in S: 1 for x in S: 1 for x in S:
2 for y in S: 2 flag := False 2 flag := True
3 flag := False 3 for y in S: 3 for y in S:
4 if P(x) or P(y) then 4 if not P(x, y) then 4 if not P(x, y) then
5 flag := True 5 flag := True 5 flag := False
6 if flag then 6 if flag then 6 if flag then
7 return True 7 return True 7 return True
8 return False 8 return False 8 return False
(d) (e) (f)
1 for x in S: 1 for x in S: 1 flag := False
2 flag := False 2 for y in S: 2 for x in S:
3 for y in S: 3 if P(x, y) then 3 for y in S:
4 if not P(x, y) then 4 return False 4 if P(x, y) then
5 flag := True 5 return True 5 flag := True
6 if not flag then 6 return flag
7 return False
8 return True

Figure 3.35 Some nested loops that compute some facts about a predicate P. (Assume that the universe S is finite.)

Consider any unary predicate P(x) over a nonempty set S. It turns out that both of the following propositions are theorems of
propositional logic. Prove them both.
 
3.191 ∀x ∈ S : P(x) ⇒ ∃y ∈ S : P(y)
 
3.192 ∃x ∈ S : P(x) ⇒ ∀y ∈ S : P(y)

Most real-world English utterances are ambiguous—that is, there are multiple possible interpretations of the given sentence.
A particularly common type of ambiguity involves order of quantification. For each of the following English sentences, find as
many different logical readings based on order of quantification as you can. Write down those interpretations using pseudological
notation, and also write a sentence that expresses each meaning unambiguously.
3.193 A computer crashes every day.
3.194 Every prime number except 2 is divisible by an odd integer greater than 1.
3.195 Every student takes a class every term.
3.196 Every submitted program failed on a case submitted by a student.

3.197 You should have found two different logical interpretations in Exercise 3.194. One of these interpretations is a theorem, and one
of them is not. Decide which is which, and prove your answers.

3.198 The code in Figure 3.35a uses nested loops to compute some fact about a predicate P. Write a fully quantified statement of predicate
logic whose truth value matches the value returned by the code. (Assume that S is a finite universe.)
3.199 Do the same for the code in Figure 3.35b.
3.200 Do the same for the code in Figure 3.35c.
3.201 Do the same for the code in Figure 3.35d.
3.202 Do the same for the code in Figure 3.35e.
3.203 Do the same for the code in Figure 3.35f.

3.204 As we’ve discussed, there is no algorithm that can decide whether a given fully quantified proposition φ is a theorem of predicate
logic. But there are several specific types of fully quantified propositions for which we can decide whether a given statement is
a theorem. Here you’ll show that, when quantification is only over a finite set, it is possible to give an algorithm to determine
whether φ is a theorem. Suppose that you are given a fully quantified proposition φ, where the domain for every quantifier is a
finite set—say S = {0, 1}. Describe an algorithm that is guaranteed to figure out whether φ is a theorem.

ISTUDY
140 Logic

3.6 Chapter at a Glance

Propositional Logic
A proposition is the kind of thing that is either true or false. An atomic proposition (or Boolean variable)
is a conceptually indivisible proposition. A compound proposition (or Boolean formula) is one built up
using a logical connective and one or more simpler propositions. The most common logical connectives
are shown in Figure 3.36. A proposition that contains the atomic propositions p1 , . . . , pk is sometimes
called a proposition over p1 , . . . , pk , or a Boolean formula or a Boolean expression over p1 , . . . , pk .
The truth value of a proposition is
its truth or falsity. (The truth value of a negation ¬p “not p”
Boolean formula over p1 , . . . , pk is deter- disjunction p∨q “p or q”
mined only by the truth values of each (inclusive or: “p, q, or both”)
conjunction p∧q “p and q”
of p1 , . . . , pk .) Each logical connective implication p ⇒ q “if p, then q” or “p implies q”
is defined by how the truth value of equivalence p ⇔ q “p if and only if q”
the compound proposition formed using exclusive or p⊕q “p xor q”
that connective relates to the truth values (“p or q, but not both”)

of the constituent propositions. A truth Figure 3.36 The six basic logical connectives of propositional logic.
table defines a connective by listing, for
each possible assignment of truth values p ¬p p q p∧q p∨q p⇒q p⇔q p⊕q
for the constituent propositions, the truth T F T T T T T T F
F T T F F T F F T
value of the entire compound proposi-
F T F T T F T
tion. See Figure 3.37. F F F F T T F
Consider a Boolean formula over vari- Figure 3.37 Truth tables for the logical connectives in Figure 3.36.
ables p1 , . . . , pk . A truth assignment is a
setting to true or false for each variable. (So a truth assignment corresponds to a row of the truth table for
the proposition.) A truth assignment satisfies the proposition if, when the values from the truth assignment
are plugged in, the proposition is true. A Boolean formula is a tautology if every truth assignment satisfies
it; it’s satisfiable if some truth assignment satisfies it; and it’s unsatisfiable or a contradiction if no truth
assignment does. Two Boolean propositions are logically equivalent if they’re satisfied by exactly the same
truth assignments (that is, they have identical truth tables).
Consider an implication p ⇒ q. Note from Figure 3.37 that the proposition p ⇒ q is true if, whenever
p is true, q is too. So the only situation in which p ⇒ q is false is when p is true and q is false. False
implies anything! Anything implies true! The antecedent or hypothesis of the implication p ⇒ q is p; the
consequent or conclusion of the implication is q. The converse of the implication p ⇒ q is the implica-
tion q ⇒ p. The contrapositive is the implication ¬q ⇒ ¬p. Any implication is logically equivalent to its
contrapositive. But an implication is not logically equivalent to its converse.
A literal is a Boolean variable or the negation of a Boolean variable. A proposition is in conjunctive
normal form (CNF) if it is the conjunction (and) of a collection of clauses, where a clause is a disjunction
(or) of a collection of literals. A proposition is in disjunctive normal form (DNF) if it is the disjunction
of a collection of clauses, where a clause is a conjunction of a collection of literals. Every proposition is
logically equivalent to a proposition that is in CNF, and to another that is in DNF.

ISTUDY
3.6 Chapter at a Glance 141

Predicate Logic
A predicate is a statement containing some number of “blanks” (variables), and that has a truth value once
values are plugged in for those variables. (Alternatively, a predicate is a Boolean-valued function.) Once
particular values for these variables are plugged in, the resulting expression is a proposition. A proposition
can also be formed from a predicate through quantifiers:

• The universal quantifier ∀ (“for all”): the proposition ∀x ∈ U : P(x) is true if, for every x ∈ U, we
have that P(x) is true.
• The existential quantifier ∃ (“there exists”): the proposition ∃x ∈ U : P(x) is true if, for at least one
x ∈ U, we have that P(x) is true.

The set U is called the universe or domain of discourse. When the universe is clear from context, it may be
omitted from the notation.
   
In the expression ∀x : or ∃x : , the scope or body of the quantifier is the underlined blank,
and the variable x is bound by the quantifier. A free or unbound variable is one that is not bound by any
quantifier. A fully quantified expression is one with no free variables.
A theorem of predicate logic is a fully quantified expression that is true for all possible meanings of the
predicates in it. Two expressions are logically equivalent if they are true under precisely the same set of
meanings for their predicates. (Alternatively, two expressions φ and ψ are logically equivalent if φ ⇔ ψ
is a theorem.) Two useful theorems of predicate logic are De Morgan’s laws:
¬∀x ∈ S : P(x) ⇔ ∃x ∈ S : ¬P(x) and ¬∃x ∈ S : P(x) ⇔ ∀x ∈ S : ¬P(x).
There is no general algorithm that can test whether any given expression is a theorem. If we wish to
prove that an implication φ ⇒ ψ is an theorem, we can do so with a proof by assuming the antecedent:
to prove that the implication φ ⇒ ψ is always true, we will rule out the one scenario in which it wouldn’t
be; specifically, we assume that φ is true, and then prove that ψ must be true too, under this assumption.
A vacuously quantified statement is one in which the domain of discourse is the empty set. The vacuous
universal quantification ∀x ∈ ∅ : P(x) is a theorem; the vacuous existential quantification ∃x ∈ ∅ : P(x)
is always false.
Quantifiers are nested if one quantifier is inside the scope of another quantifier. Nested quantifiers work
in precisely the same way as single quantifiers, applied in sequence. A proposition involving nested quan-
tifiers like ∀x ∈ S : ∃y ∈ T : R(x, y) is true if, for every choice of x, there is some choice of y (which
can depend on the choice of x) for which R(x, y) is true. Order of quantification matters in general; the
expressions ∀x : ∃y : R(x, y) and ∃y : ∀x : R(x, y) are not logically equivalent.

ISTUDY
142 Logic

Key Terms and Results

Key Terms Key Results


Propositional Logic Propositional Logic
• proposition 1 We can build a truth table for any proposition by
• truth value repeatedly applying the definitions of each of the logical
• atomic and compound propositions connectives, as shown in Figure 3.3.
• logical connectives: 2 Two propositions φ and ψ are logically equivalent if
– negation (¬) and only if φ ⇔ ψ is a tautology.
– conjunction (∧) 3 An implication p ⇒ q is logically equivalent to its
– disjunction (∨) contrapositive ¬q ⇒ ¬p, but not to its converse q ⇒ p.
– implication (⇒) 4 There are many important propositional tautologies and
– exclusive or (⊕) logical equivalences, some of which are shown in
– if and only if (⇔) Figures 3.9 and 3.12.
• truth assignments and truth tables 5 We can show that propositions are logically equivalent
• tautology by showing that every row of their truth tables are the
• satisfiability/unsatisfiability same.
• logical equivalence 6 Every proposition is logically equivalent to one that is in
• antecedent and consequent disjunctive normal form (DNF) and to one that is in
• converse, contrapositive, and inverse conjunctive normal form (CNF).
• conjunctive normal form (CNF)
• disjunctive normal form (DNF)
Predicate Logic
1 We can build a proposition from a predicate P(x) by
Predicate Logic plugging in a particular value for x, or by quantifying
• predicate over x as in ∀x ∈ S : P(x) or ∃x ∈ S : P(x).
• quantifiers: 2 Unlike with propositional logic, there is no algorithm
that is guaranteed to determine whether a given fully
– universal quantifier (∀)
quantified predicate-logic expression is a theorem.
– existential quantifier (∃)
3 There are many important predicate-logic theorems,
• free and bound variables some of which are shown in Figure 3.21.
• fully quantified expression 4 The statements ¬∀x : P(x) and ∃x : ¬P(x) are logically
• theorems of predicate logic equivalent. So are ¬∃x : P(x) and ∀x : ¬P(x).
• logical equivalence in predicate logic 5 We can think of nested quantifiers as a sequence of
• proof by assuming the antecedent single quantifiers, or as “games with a demon.”
• vacuous quantification
• nested quantifiers

ISTUDY
4 Proofs

In which our heroes build ironclad scaffolding to support their


claims, thereby making them impervious to any perils they might
encounter.

ISTUDY
143
144 Proofs

4.1 Why You Might Care


Madame, Madame, un bon mot ne prouve rien.
Madam, madam, a witty saying proves nothing.
Voltaire (1694–1778)
Le Dîner Du Comte De Boulainvilliers (1768)

A typical modern computational system is structured like a tower, with each layer’s proper behavior
contingent on the correctness of the one below. The website that you use to send money to a friend relies
on both a stack of networking protocols (HTTP relying on TCP, which is relying on IP, etc.), as well as
a stack of applications on your computer or your phone (your browser relying on your operating system,
which is relying on the hardware itself). A key theme in computer science is this idea of abstraction: that,
so long as it’s working properly, you can rely on the next layer in one of these towers (or a function in a
large program, or . . .) without worrying about how exactly it works. You just have to trust that it works.
This chapter is about the mathematical version of a tower of trust. A proof is a convincing argument that
establishes a particular claim as fact. That claim might be something explicitly computational: Bubble Sort
performs fewer comparisons than Merge Sort when the input array is already sorted, for example. Or the
claim might be noncomputational, at least superficially: a property of an operating system, the impossibility
of designing a voting system with a certain set of properties. Generally speaking, our goal—in this chapter,
in this book—is to establish new facts. And that’s precisely the point of a proof: to derive a new fact from
old facts, while persuading the reader that the new fact is, indeed, a fact. (For example, we can derive a
new fact using Modus Ponens: if we know both p and p ⇒ q, then we can conclude that q is a fact, too.)
Why are proofs useful in computer science? First, proofs help prevent bugs. Whether or not they write
down in full detail a proof that their code is correct, good software developers are always reasoning carefully
about whether a function performs the task it’s supposed to perform, or whether a particular optimization
continues to meet the given specification. For a theoretical computer scientist, proofs are bread and butter:
proofs of correctness for novel algorithms, or proofs of the hardness of solving a particular problem. For
both theoretically and practically oriented computer scientists, a proof often yields great insight that can
avoid a brute-force solution, improve the efficiency of the code, or unearth some structural property of a
problem that reveals that the problem doesn’t even need to be solved in the first place.
We begin in Section 4.2 with an extended exploration of error-correcting codes, systems that allow for
the reliable transmission and storage of information even in environments that corrupt data as it’s stored/
transmitted/received/retrieved. (For example, CDs and DVDs are susceptible to scratches, and deep-space
satellites’ transmissions are susceptible to radiation.) This section will merely scratch the surface of error-
correcting codes, but it will serve as a nice introduction to error-correcting codes—and to proofs. Then, in
Section 4.3, the technical meat of this chapter, we will develop a toolbox of techniques to use in proofs, and
some strategies for choosing among them. We’ll also illustrate these proof techniques with a hefty collection
of examples about arithmetic. (In Section 4.5, we’ll also catalogue some common types of mistakes in
purported proofs, so that you can avoid them—and recognize bogus proofs when others attempt them.)
While the proof techniques themselves are the “point” of this chapter, in many cases the fact that
we’re proving is at least as interesting as the proof of that fact. Throughout our tour of proof techniques,
we’ll encounter a variety of examples of potentially interesting facts, in Section 4.4: about propositional
logic, including the fact that we need only one logical connective (“nand”) to express every proposition;
about geometry (the Pythagorean Theorem); about prime numbers; and about uncomputability (there are
problems that cannot be solved by any computer!).

ISTUDY
4.2 Error-Correcting Codes 145

4.2 An Extended Application with Proofs: Error-Correcting Codes


Facts or opinions which are to pass through the hands of so many, to be misconceived by
folly in one, and ignorance in another, can hardly have much truth left.
Jane Austen (1775–1817)
Persuasion (1818)

This section introduces error-correcting codes, a way of encoding data so that it can be transmitted
correctly even in the face of (a limited number of) errors in transmission. These codes are used widely—
for example, on DVDs/CDs and in file transfer protocols—and they’re interesting to study on their own.
But, despite appearances, they are not the point of this section! Rather, they’re mostly an excuse to introduce
a technical topic with some interesting (and nonobvious) results—and to persuade you of a few of those
results. In other words, this section is really about proofs.

Error-detecting and error-correcting codes: the basic idea


Visa and Mastercard use 16-digit numbers for their credit and debit cards, but it turns out that there are
only 1015 valid credit-card numbers: a number is valid only if a particular arithmetic calculation on the
digits—more or less, adding up the digits and taking the result modulo 10—always turns out to be zero.
(See Exercises 4.1–4.5 for details of the calculation.) Or, to describe this fact in another way: if you get a
(mildly gullible) friend to read you any 15 digits of their credit-card number, you can figure out the 16th
digit. Less creepily, this system means that there’s an error-detection mechanism built into credit-card
numbers: if any one digit in your number is mistranscribed, then a very simple algorithm can reject that
incorrect card number as invalid (because the calculation above will yield an answer other than zero).
In this section, we’ll explore encoding schemes with this sort of error-handling capability. Suppose that
you have some binary data that you wish to transmit to a friend across an imperfect channel—that is, one
that (due to cosmic rays, hardware failures, or whatever) occasionally mistransmits a 0 as a 1, or vice
versa. (When we refer to an error in a bitstring x, what we mean is a “substitution error,” where some
single bit in x is flipped.) The fundamental idea will be to add redundancy to the transmitted data; if there
is enough redundancy relative to the number of errors, then enough correct information will be transmitted
to allow the receiver to reconstruct the original message. We’ll explore both error-detecting codes that
are able to recognize whether an error has occurred (at least, as long as there aren’t too many errors) and
error-correcting codes that can fix a small number of errors. To reiterate, though: while we’re focusing
on error-correcting and error-detecting codes in this section, the fundamental purpose of this section is to
introduce proof techniques. Along the way, we’ll see some interesting results about error-correcting codes,
but the takeaway message is really about the methods that we’ll use to prove those results.

Taking it further: Aside from credit-card numbers, other examples of error-detecting or error-correcting codes include check-
sums on a transferred file—we might break a large file we wish to transmit into 32-bit blocks, transmit those blocks individually,
and transmit as a final 32-bit block the XOR of all previously transmitted blocks—as a way to check that the file was transmitted
properly. Error-correcting codes are also used in storing data on media (hard disks and CDs/DVDs, for example) so that one
can reconstruct stored data even in the face of hardware errors (or scratches on the disc). QR (“quick response”) codes use a
similar type of error correction. The idea of error detection appears in other contexts, too. Double-entry bookkeeping has been
argued to be one of the most important inventions of the last 1000 years. UPC (“universal product code”) bar codes on products
in supermarkets use error checking similar to that in credit-card numbers. There are error-detection aspects in DNA. And “the
buddy system” from elementary school field trips detects any one “deletion error” among the group (though two “deletions” may
evade detection of the system).

ISTUDY
146 Proofs

4.2.1 A Formal Introduction


k
Imagine a sender who wishes to transmit a message m ∈ {0, 1} to a receiver. A code C is a subset of
n
{0, 1} , listing the set of legal codewords; each k-bit message m is encoded as an n-bit codeword c ∈ C.
The codeword is then transmitted to the receiver, but it may be corrupted during transmission. The recipient
of the (possibly corrupted) n-bit string c′ decodes c′ into a new message m′ ∈ {0, 1} . The goal is that,
k

so long as the corruption is limited, the decoded message is identical to the original message—in other
words, that m = m′ as long as c′ ≈ c. (We’ll make the meaning of “≈” precise soon.) Figure 4.1 shows
a schematic of the process. (For an error-detecting code, the receiver still receives the bitstring c′ , but
determines whether the originally transmitted codeword was corrupted instead of which codeword was
originally transmitted, as in an error-correcting code.)

Measuring the distance between bitstrings


Before we get to codes themselves, we need a way to quantify how similar or different two bitstrings are:

Definition 4.1: Hamming distance.


n
Let x, y ∈ {0, 1} be two n-bit strings. The Hamming distance between x and y, denoted by ∆(x, y), is
the number of positions in which x and y differ. In other words,
n o
∆(x, y) = i ∈ {1, 2, . . . , n} : xi ̸= yi .
(Hamming distance is undefined if x and y don’t have the same length.)

(The Hamming distance is named after Richard Hamming (1915–1998), an American mathematician/
computer scientist who was, among other things, the third winner of the Turing Award.)
For example, ∆(011, 101) = 2 because 011 and 101 differ in bit positions #1 and #2 (but match in bit
position #3), and ∆(0011, 0111) = 1 because 0011 and 0111 differ in bit #2. Similarly, ∆(0000, 1111) = 4
because all four bits differ, and ∆(10101, 10101) = 0 because all five bits match.
The Hamming distance is a metric, which means that it satisfies the following properties, for all bitstrings
n
x, y, z ∈ {0, 1} :

• “reflexivity”: ∆(x, y) = 0 if and only if x = y;


• “symmetry”: ∆(x, y) = ∆(y, x); and
• “the triangle inequality”: ∆(x, y) ≤ ∆(x, z) + ∆(z, y). (See Figure 4.2.)

Informally, the fact that ∆ is a metric means that it generally matches your intuitions about geometric
(Euclidean) distance. (See Exercise 4.6.)

Error-detecting and error-correcting codes

Definition 4.2: Codes, messages, and codewords.


n k
A code is a set C ⊆ {0, 1} , where |C| = 2k for some integer 1 ≤ k ≤ n. Any element of {0, 1} is called
a message, and the elements of C are called codewords.

(It might seem a bit strange to require that the number of codewords in C be a precise power of two—
but doing so is convenient, as it allows us to consider all k-bit strings as the set of possible messages, for
k = log2 |C|.) Here’s an example of a code:

ISTUDY
4.2 Error-Correcting Codes 147

sender receiver

message
corrupted codeword

codeword reconstructed message

m ∈ {0, 1}k c∈C c′ ∈ {0, 1}n m ∈ {0, 1}k


encode corruption decode
C⊆ {0, 1}n

Figure 4.1 A schematic view of error-correcting codes. The goal is that, as long as there isn’t too much corruption, the
received message m′ is identical to the sent message m.

z
∆(x, z) ∆(z, y)
y
x
∆(x, y)

Figure 4.2 The triangle inequality. The distance from x to y isn’t decreased by “stopping off” at z along the way.

Example 4.1: A small code.


The set C = {000000, 101010, 000111, 100001} is a code. Because |C| = 4 = 22 , there are four
2 6
messages, namely the four elements of {0, 1} = {00, 01, 10, 11}. And because C ⊆ {0, 1} , the
6
codewords—the four elements of the set C—are elements of {0, 1} .

We can think of a code as being defined by a pair of operations:


k
encoding: given a message m ∈ {0, 1} , which codeword in C should we transmit? (We’d break up a longer
message into a sequence of k-bit message chunks.)
decoding: from a received (possibly corrupted) bitstring c′ ∈ {0, 1} , what message should we infer was
n

sent? (Or, if we trying to detect errors rather than correct them: from a received bitstring c′ ∈ {0, 1} ,
n

do we say that an error occurred, or not?)

For the moment, we’ll consider a generic (and slow) way of encoding and decoding. Given C, we build a
table mapping messages to codewords, by matching up the ith-largest message with the ith-largest code-
k
word (with both the messages from {0, 1} and the codewords in C sorted in numerical order). We encode
a message m by the codeword in m’s row of the table. For error detection, we respond to a received bitstring
c′ by reporting “no error” if c′ appears in the table, and reporting “error” if c′ does not appear in the table.
For error correction, we decode a received bitstring c′ by identifying the codeword c ∈ C that’s closest to
c′ , measured by Hamming distance; we decode c′ as the message in row c of the table. (If there’s a tie, we
choose one of the tied-for-closest codewords arbitrarily.)

Example 4.2: Encoding and decoding with a small code.


For the code C = {000000, 101010, 000111, 100001} from Example 4.1, sorting the four codewords
2
(and the messages from {0, 1} ) yields the table in Figure 4.3.

ISTUDY
148 Proofs

message codeword
00 000000
01 000111
10 100001
11 101010

Figure 4.3 The message/codeword table for the code from Example 4.1.

For example, we encode the message 10 as the codeword 100001.


If we receive the bitstring 111110, we report “error” because 111110 is not in C.
For the received bitstring 111110, we have ∆(111110, 000000) = 5 and ∆(111110, 000111) = 4 and
∆(111110, 100001) = 5 and ∆(111110, 101010) = 2. The last of these distances is smallest, so we
would decode 111110 as the message 11 (corresponding to codeword 101010).

The danger in error detection is that we’re sent a codeword c ∈ C that’s corrupted into a bitstring c′ , but
we report “no error” because c′ ∈ C. (Note that we’re never wrong when we report “error.”) The danger
in error correction is that we report another codeword c′′ ∈ C because c′ is closer to c′′ than it is to c. (As
we’ll see soon, these dangers are really about Hamming distance between codewords: we might make a
mistake if two codewords in C are too close together, relative to the number of errors.) Here are the precise
definitions of error-detecting and error-correcting codes:

Definition 4.3: Error-detecting and error-correcting codes.


n
Let C ⊆ {0, 1} be a code, and let ℓ ≥ 1 be any integer.
We say that C can detect ℓ errors if, for any codeword c ∈ C and for any sequence of up to ℓ errors
applied to c, we can correctly report “error” or “no error.”
The code C can correct ℓ errors if, for any codeword c ∈ C and for any sequence of up to ℓ errors
applied to c, we can correctly identify that c was the original codeword.

Here’s an example, for our small example code:

Example 4.3: Error detection and correction in a small code.


Recall C = {000000, 101010, 000111, 100001} from Example 4.1. Figure 4.4 shows every bitstring
6
x ∈ {0, 1} , and the Hamming distance between x and each codeword in C.
This code can detect one error, because the 24 “one error” bitstrings in Figure 4.5 are all different
from the 4 “no errors” bitstrings; we can correctly report whether the bitstring in question is a codeword
(no errors) or one of the 24 non-codewords (one error). (There are 24 single-bit errors that can happen
to codewords in C: there are 4 choices of codeword, and, for each, 6 different one-bit errors that can
occur.) Or, to state this fact in a different way: the highlighted columns of Figure 4.4 corresponding to
uncorrupted codewords are not within one error of any other codeword.
On the other hand, C cannot detect two errors. If we receive the bitstring 000000, we can’t distinguish
whether the original codeword was 000000 (and no errors occurred) or whether the original codeword was
100001 (and two errors occurred, in 000000). (Receiving the bitstring 100001 creates the same problem.)
The code C also cannot correct even one error. Consider the bitstring 100000. We cannot distinguish (i)
the original codeword was 000000 (and one error occurred) from (ii) the original codeword was 100001
(and one error occurred). Or, to state this fact differently: 100000 appears twice in the list of 24 “one
error” bitstrings in Figure 4.5.

ISTUDY
4.2 Error-Correcting Codes 149

The bitstrings 000000, 000111, 100001, and 101010 corresponding to


uncorrupted codewords are not within one error of any other codeword.

000000
000001
000010
000011
000100
000101
000110
000111
001000
001001
001010
001011
001100
001101
001110
001111
010000
010001
010010
010011
010100
010101
010110
010111
011000
011001
011010
011011
011100
011101
011110
011111
c′ =
∆ (c′ , 000000) 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5
∆(c′ , 000111) 3 2 2 1 2 1 1 0 4 3 3 2 3 2 2 1 4 3 3 2 3 2 2 1 5 4 4 3 4 3 3 2
∆(c′ , 100001) 2 1 3 2 3 2 4 3 3 2 4 3 4 3 5 4 3 2 4 3 4 3 5 4 4 3 5 4 5 4 6 5
∆(c′ , 101010) 3 4 2 3 4 5 3 4 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 5 3 4 2 3 4 5 3 4
100000 The bitstrings 000001 and 100000 are within a single error of two different codewords.
100001
100010
100011
100100
100101
100110
100111
101000
101001
101010
101011
101100
101101
101110
101111
110000
110001
110010
110011
110100
110101
110110
110111
111000
111001
111010
111011
111100
111101
111110
111111
(continued)
∆(c′ , 000000) 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
∆(c′ , 000111) 4 3 3 2 3 2 2 1 5 4 4 3 4 3 3 2 5 4 4 3 4 3 3 2 6 5 5 4 5 4 4 3
∆(c′ , 100001) 1 0 2 1 2 1 3 2 2 1 3 2 3 2 4 3 2 1 3 2 3 2 4 3 3 2 4 3 4 3 5 4
∆(c′ , 101010) 2 3 1 2 3 4 2 3 1 2 0 1 2 3 1 2 3 4 2 3 4 5 3 4 2 3 1 2 3 4 2 3

Figure 4.4 The Hamming distance of every 6-bit string to all codewords from Example 4.1.

no errors: 000000 one error: 100000 010000 001000 000100 000010 000001
101010 001010 111010 100010 101110 101000 101011
000111 100111 010111 001111 000011 000101 000110
100001 000001 110001 101001 100101 100011 100000

Figure 4.5 The 24 single-bit errors for codewords in C = {000000, 101010, 000111, 100001}.

4.2.2 Distance and Rate


Our goal with error-correcting codes is to ensure that the decoded message m′ is identical to the original
message m, as long as there aren’t too many errors in the transmission. At a high level, we will achieve
this goal by ensuring that the codewords in our code are all “very different” from each other. If every
pair of distinct codewords c1 and c2 are far apart (in Hamming distance), then the closest codeword c to
the received transmission c′ will correspond to the original message, even if “a few” errors occur. (We’ll
quantify “very” and “a few” soon.)
Intuitively, this desire suggests adding a lot of redundancy to our codewords, by making them more
redundant. But we must balance this desire for robustness against another desire that pulls in the opposite
direction: we’d like to transmit a small number of bits (so that the number of “wasted” non-data bits is
small). There’s a seeming tradeoff between these two measures of the quality of a code: increasing error
tolerance suggests making the codewords longer (so there’s room for them to differ more); increasing
efficiency suggests making the codewords shorter (so there are fewer wasted bits). Let’s formally define
both of these measures of code quality:

Definition 4.4: Minimum distance.


The minimum distance of a code C is the smallest Hamming distance between two distinct codewords of
C: that is, the minimum distance of C is min {∆(x, y) : x, y ∈ C and x ̸= y}.

Definition 4.5: Rate.


The rate of a code C is the ratio between message length and codeword length. That is, if C is a code
n
where |C| = 2k and C ⊆ {0, 1} , then the rate of C is the ratio nk .

ISTUDY
150 Proofs

000000 000111 100001 101010


000000 0 3 2 3
000111 3 0 3 4
100001 2 3 0 3
101010 3 4 3 0

Figure 4.6 The Hamming distance between codewords of C from Example 4.1.

(Quiz question: if we hadn’t restricted the minimum in Definition 4.4 to be only over pairs such that x ̸= y,
what would the minimum distance have been?)
Let’s compute the rate and minimum distance for our running example:

Example 4.4: Distance and rate in a small code.


Recall the code C = {000000, 101010, 000111, 100001} from Example 4.1.
The minimum distance of C is 2, because ∆(000000, 100001) = 2. You can check Figure 4.4 (or see
Figure 4.6) to see that no other pair of codewords is closer.
2
The rate of C is 62 , because |C| = 4 = |{0, 1} |, and the codewords have length 6.

Relating minimum distance and error detection/correction


We have now defined enough of the concepts that we can state a first nontrivial theorem, which characterizes
the error-detecting and error-correcting capabilities of a code C in terms of the minimum distance of C. Here
is the statement:

Theorem 4.6: Relationship of minimum distance to detecting/correcting errors.


Let t ≥ 0 be any integer. If the minimum distance of a code C is 2t + 1, then C can detect 2t errors and
correct t errors.

We’re now going to try to prove Theorem 4.6—that is, we’re going to try to generate a convincing argu-
ment that this statement is true. As with any statement that you try to prove, our first task is to understand
what exactly the claim is saying. In this case, the theorem makes a statement about a generic nonnegative
integer t and a generic code C. Plugging in particular values for t can help make the claim clearer:

• If the minimum distance of a code C is 9—that is, the minimum distance is 2t + 1 for t = 4—then the
claim says C can detect 2t = 2 · 4 = 8 errors and correct t = 4 errors.
• Suppose the minimum distance of C is 7. Writing 7 = 2t + 1 for t = 3, the claim states that C can detect
6 errors and correct 3 errors.
• If the minimum distance of C is 5, then C can detect 4 errors and correct 2 errors.
• If the minimum distance of C is 3, then C can detect 2 errors and correct 1 error.
• If the minimum distance of C is 1, then C can detect 0 errors and correct 0 errors.

Problem-solving tip: Step #1 in proving any claim is to understand what it’s saying! (You can’t persuade someone of something
you don’t understand.) One good way to start to do so is by plugging particular values into the statement.

Now that we have a better sense of what the theorem says, let’s prove it:

ISTUDY
4.2 Error-Correcting Codes 151

A codeword x ∈ C

2t (outer) shaded region: all strings t


within distance 2t of x

>t
Another codeword y ∈ C. Because the minimum
distance is 2t + 1, we know ∆(x, y) ≥ 2t + 1 and ≥ 2t + 1
so y is not inside the shaded region.

(a) No codewords are within distance 2t of each other, (b) Any bitstring within distance t of the codeword x
so no codeword except c itself can be inside the shaded (that is, inside the smaller circle) must be more than t
region. This code can detect up to 2t errors. away from any other codeword.

Figure 4.7 Visualizations of the proof of Theorem 4.6. Suppose that C is a code whose minimum distance is 2t + 1.

Proof of Theorem 4.6. First we’ll prove the error-detection condition, by arguing for the following claim:
if a code C has minimum distance 2t + 1, then C can detect 2t errors. In other words, for an arbitrary
codeword x ∈ C and an arbitrary received bitstring c′ with ∆(x, c′ ) ≤ 2t, our error-detection algorithm
must be correct. (If ∆(x, c′ ) > 2t, then we’re not obliged to correctly state that an error occurred, because
we’re only arguing that we can detect 2t errors.) Recall that our error-detection algorithm reports “no error”
if c′ ∈ C, and it reports “error” if c′ ∈
/ C. (It may be helpful to think about this proof via Figure 4.7a.)
If ∆(x, c′ ) = 0, then no error occurred (because the received bitstring matches the transmitted one). In
this case, our error-detection algorithm correctly reports “no error”—because c′ ∈ C (because c′ = x, and
x was a codeword).
On the other hand, suppose 1 ≤ ∆(x, c′ ) ≤ 2t—so an error occurred. The only way that we’d fail to
detect the error is if the received bitstring c′ is itself another codeword. But this situation can’t happen, by
the definition of minimum distance: for any codeword x ∈ C, the set {c′ : ∆(x, c′ ) ≤ 2t} cannot contain
any elements of C—otherwise the minimum distance of C would be 2t or smaller.

For the error-correction condition, suppose that x ∈ C is the transmitted codeword, and the received bitstring
c′ satisfies ∆(x, c′ ) ≤ t. We have to persuade ourselves that x is the codeword closest to c′ in Hamming
distance. Let y ∈ C − {x} be any other codeword. We’ll start from the triangle inequality, which tells us
that ∆(x, y) ≤ ∆(x, c′ ) + ∆(c′ , y) and therefore that ∆(c′ , y) ≥ ∆(x, y) − ∆(x, c′ ). Now we’ll prove that
c′ is closer to x than it is to y (see Figure 4.7b):
′ ′
∆(c , y) ≥ ∆(x, y) − ∆(x, c ) triangle inequality
≥ (2t + 1) − ∆(x, c′ ) ∆(x, y) ≥ 2t + 1 by definition of minimum distance
≥ (2t + 1) − t ′
∆(x, c ) ≤ t by assumption
=t+1
> ∆(x, c′ ). ′
∆(x, c ) ≤ t by assumption, and t < t + 1

This chain of inequalities shows c′ is closer to x than it is to y. (Pedantically speaking, we’re also relying
on the symmetry of Hamming distance here: ∆(c′ , y) = ∆(y, c′ ). Again, see Exercise 4.6.) Because y was
a generic codeword in C − {x}, we can conclude that the original codeword x is the one closest to c′ .

It is customary to mark the end of one’s proofs typographically; here, we’re using a traditional box symbol: . Other people may
write “QED,” short for the Latin phrase quod erat demonstrandum (“that which was to be demonstrated”).

ISTUDY
152 Proofs

Before we move on from the theorem, let’s reflect a little bit on the proof. (We’ll concentrate on the error-
correction half.) The most complicated part was unwinding the definitions in the theorem statement, in
particular of “C has minimum distance 2t + 1” and “C can correct t errors.” Eventually, we had to argue for
the claim

For every x ∈ C, y ∈ C − {x}, and c′ ∈ {0, 1} : if ∆(x, c′ ) ≤ t then ∆(x, c′ ) < ∆(y, c′ ).
n

(In other words, if c′ is within t errors of x, then c′ is closer to x than to any other codeword.) In the end,
we were able to state the proof as a relatively simple sequence of inequalities.
After proving a theorem, it’s also worth briefly reflecting on what the theorem does not say. Theorem 4.6,
for example, only addresses codes with a minimum distance that’s an odd number. You’ll be asked to
consider the error-correcting and error-detecting properties of a code C with an even minimum distance
in Exercise 4.13. We also didn’t show that we couldn’t do better: Theorem 4.6 says that a code C with
minimum distance 2t + 1 can correct t errors, but the theorem doesn’t say that C can’t correct t + 1 (or
more) errors. (But, in fact, it can’t; see Exercise 4.12.)

Problem-solving tip: (Two separate tips related to the proof of Theorem 4.6.) First, when you’re trying to prove a claim of the
form p ⇒ q, try to massage p to look as much like q as possible. A good first step in doing so is to expand out the definitions of
the premises, and then try to see what additional facts you can infer. Second, a useful general strategy: try to draw a picture to
help you clarify/understand the problem that you’re trying to solve.

Outline of the remainder of the section


Intuitively, rate and minimum distance are measures of the inherent tension in an error-correcting code. A
code that has a higher distance means that we are more robust to errors: the farther apart codewords are, the
more corruption can occur before we’re unable to reconstruct the original message. A code that has a higher
rate means that we are “wasting” fewer bits in providing this robustness: the larger the rate, the more our
codeword contains “data” rather than “redundancy.” In the rest of this section, we’re going to prove several
more theorems about error-correcting codes, exploring the tradeoff between rate and distance. (But it’s
also worth noting that it’s not a strict tradeoff: sometimes we can improve in one measure without costing
ourselves in the other!) And, as we go, we’ll continue to try to reflect on the proof techniques that we use
to establish these claims. Here are the three main theorems that we’ll prove in the rest of this section:

Theorem 4.7: Good news.


There exists a code with 4-bit messages, minimum distance 3, and rate 13 .

Theorem 4.8: Better news.


There exists a code with 4-bit messages, minimum distance 3, and rate 47 .

Theorem 4.9: Bad news.


There does not exist a code with 4-bit messages, minimum distance 3, and any rate strictly better than 47 .

Notice that the first two of these results say that a code with particular properties exists, while the third result
says that it’s impossible to create a code with a different set of properties. Also notice that Theorem 4.8
is an improvement on Theorem 4.7: we’ve made the rate better (higher) without making the minimum

ISTUDY
4.2 Error-Correcting Codes 153

distance worse. (When we can, we’ll prove more general versions of these theorems, too, not limited to
4-bit messages with minimum distance 3.)
We’ll prove Theorem 4.7 and Theorem 4.8 “by construction”—specifically, by building a code with the
desired parameters. But, because Theorem 4.9 says that a code with certain properties fails to exist, we’ll
prove the result with a proof by contradiction: we temporarily assume that a code with 4-bit messages with
distance 3 and rate strictly better than 47 does exist, and reasoning logically from that assumption, we will
derive a false statement (a contradiction). Because p ⇒ False ≡ ¬p, we can conclude that the assumption
must have been false, and no such code can exist.

4.2.3 Repetition Codes


Intuitively, a good error-correcting code will amplify even a small difference between two different
messages—a single differing bit—into a larger difference between the corresponding codewords. Perhaps
the most obvious implementation of this idea is simply to encode a message m by repeating the bits of m
several times. This idea gives rise to a simple error-correcting code, called the repetition code. (Actually,
there are many different versions of the repetition code, depending on how many times we repeat m in the
codeword.) Here’s the basic definition:

Definition 4.10: Repetition code.


Let ℓ ∈ Z≥2 . The Repetitionℓ code for k-bit messages consists of the codewords
 k
m m · · · m : m ∈ {0, 1} .
ℓ times
k
That is, the codeword corresponding to a message m ∈ {0, 1} is the ℓ-fold repetition of the message m,
kℓ
so each codeword is an element of {0, 1} .

Here are some small examples of encoding/decoding using repetition codes:

Example 4.5: Some codewords for the repetition code.


The message 00111 corresponds to the codeword 00111 00111 00111 under the Repetition3 code.
If we encode the same message using the Repetition5 code instead, then the resulting codeword is
00111 00111 00111 00111 00111.
For decoding, suppose that we receive the (possibly corrupted) bitstring c′ = 0010 0110 0010 under
the Repetition3 code. We detect that an error occurred: c′ is not a codeword, because the only codewords
are 12-bit strings where all three 4-bit thirds are identical. For error correction: the closest codeword to
c′ is 0010 0010 0010, so we decode c′ as the message 0010.

The message/codeword table for the Repetition3 code for 4-bit messages is shown in Figure 4.8. The
distance and rate properties of the repetition code are relatively easy to see (from the definition or from this
style of table):

Lemma 4.11: Distance and rate of the repetition code.


The Repetitionℓ code has rate 1ℓ and minimum distance ℓ.

ISTUDY
154 Proofs

(continued) (continued) (continued)


message codeword message codeword message codeword message codeword
0000 0000 0000 0000 0100 0100 0100 0100 1000 1000 1000 1000 1100 1100 1100 1100
0001 0001 0001 0001 0101 0101 0101 0101 1001 1001 1001 1001 1101 1101 1101 1101
0010 0010 0010 0010 0110 0110 0110 0110 1010 1010 1010 1010 1110 1110 1110 1110
0011 0011 0011 0011 0111 0111 0111 0111 1011 1011 1011 1011 1111 1111 1111 1111

Figure 4.8 The Repetition3 code for 4-bit messages.

Proof. Recall that the rate of a code is the ratio nk , where k is the length of the messages and n is the length
of the codewords. A k-bit message is encoded as a (kℓ)-bit codeword (ℓ repetitions of k bits), and so the
rate of this code is kℓk = 1ℓ .
For the minimum distance, consider any two messages m, m′ ∈ {0, 1} with m′ =
k
̸ m. We know that m
′ ′
and m must differ in at least one bit position, say bit position i. (Otherwise m = m .) But if mi = ̸ m′i , then
the codeword corresponding to m the codeword corresponding to m′

m m ···m and m′ m′ · · · m′
ℓ times ℓ times

differ in at least one bit in each of the ℓ “blocks” (in the ith position of the block)—for a total of at least ℓ
differences. Furthermore, the Repetitionℓ encodings of the messages 000 · · · 0 and 100 · · · 0 differ in only
ℓ places (the first bit of each “block”), so the minimum distance of the Repetitionℓ code is exactly ℓ.
Lemma 4.11 says that the Repetition3 code on 4-bit messages (see Figure 4.8) has minimum distance
3 and rate 13 . Thus we’ve proven Theorem 4.7: we had to show that a code with these parameters exists,
and we did so by explicitly building such a code. This proof is an example of a “proof by construction”: to
show that an object with a particular property exists, we’ve explicitly built an object with that property.

Problem-solving tip: When you’re trying to prove a claim of the form ∃x : P(x), try using a proof by construction first. (There
are other ways to prove an existential claim, but this approach is great when it’s possible.)

It’s also worth noticing that we started out by describing a generic way to do encoding and decoding
for error-correcting codes: after we build the table (like the one in Figure 4.8), we encode a message by
finding the corresponding codeword in the table, and we decode a bitstring c′ by looking at every codeword
and identifying the one closest to c′ . For particular codes, we may be able to give a much more efficient
algorithm—and, indeed, we can do so for repetition codes. See Exercise 4.21.

4.2.4 Hamming Codes


When we’re encoding 4-bit messages, the Repetition3 code achieves minimum distance 3 with 12-bit
codewords. (So its rate is 31 .) But it turns out that we can do better by defining another, cleverer code:
the Hamming code maintains the same minimum distance, while improving the rate from 13 to 47 . (The
Hamming code, like the Hamming distance, is named after Richard Hamming, who invented this code in
1950 [53]. He was frustrated that programs he started running before going home on Friday nights often
failed over the weekend because of a single bit error in memory.)
The basic idea of the Hamming code is to use an extra bit that, like the 16th digit of a credit-card number,
redundantly reports a value computed from the previous components of the message. Concretely, we could
tack a single bit b onto the message m, where b reports the parity of m—that is, whether there are an even
or odd number of bits set to 1 in m. If a single error occurs in the message, then b would be inconsistent

ISTUDY
4.2 Error-Correcting Codes 155

with the message m, and we’d detect that error. (See Exercise 4.19.) In fact, for the Hamming code, we’ll
use several different parity bits, corresponding to different subsets of the bits of m.

Definition 4.12: Parity function.


The parity of a sequence ⟨a1 , a2 , . . . , ak ⟩ of bits is denoted either parity(a1 , a2 , . . . , ak ) or a1 ⊕a2 ⊕· · ·⊕ak ,
and its value is
(
1 if there are an odd number of i such that ai = 1
a1 ⊕ a2 ⊕ · · · ⊕ ak =
0 if there are an even number of i such that ai = 1.

The parity of a and b can be denoted as a ⊕ b because if you think of a, b ∈ {0, 1}, where True = 1 and False = 0,
then parity(a, b) is the XOR of a and b. (And XOR was also denoted by ⊕.) (We could also have defined this function as
P
parity(a1 , . . . , ak ) = [ ki=1 ai ] mod 2.)

Hamming’s insight was that it’s possible to achieve good error-correction properties by using three different
parity bits, corresponding to different subsets of the message bits. It’s easiest to think of this code in terms
of its encoding algorithm:

Definition 4.13: Hamming code.


The Hamming code is defined via the following encoding function. We will encode a 4-bit message
⟨a, b, c, d⟩ as the following 7-bit codeword:
⟨ a, b, c, d, b ⊕ c ⊕ d, a ⊕ c ⊕ d, a ⊕ b ⊕ d ⟩.
message bits parity bits

Applying this encoding to every 4-bit message yields the table of messages and their corresponding code-
words shown in Figure 4.9. A few examples are shown in detail in Figure 4.10. (We could have described
encoding for the Hamming code using matrix multiplication instead; see Exercises 2.221–2.223.)
Before we analyze the rate and minimum distance of the Hamming code, let’s start to develop some
intuition by looking at a few received (possibly corrupted) codewords. (We’ll also begin to work out an
efficient decoding algorithm as we go.)

(continued) (continued) (continued)


message codeword message codeword message codeword message codeword
0000 0000000 0100 0100101 1000 1000011 1100 1100110
0001 0001111 0101 0101010 1001 1001100 1101 1101001
0010 0010110 0110 0110011 1010 1010101 1110 1110000
0011 0011001 0111 0111100 1011 1011010 1111 1111111

Figure 4.9 The Hamming code for 4-bit messages.

message a, b, c, d −→ a, b, c, d, (b ⊕ c ⊕ d), (a ⊕ c ⊕ d), (a ⊕ b ⊕ d) codeword


0000 0, 0, 0, 0 −→ 0, 0, 0, 0, (0 ⊕ 0 ⊕ 0), (0 ⊕ 0 ⊕ 0), (0 ⊕ 0 ⊕ 0) 0000000
1000 1, 0, 0, 0 −→ 1, 0, 0, 0, (0 ⊕ 0 ⊕ 0), (1 ⊕ 0 ⊕ 0), (1 ⊕ 0 ⊕ 0) 1000011
1110 1, 1, 1, 0 −→ 1, 1, 1, 0, (1 ⊕ 1 ⊕ 0), (1 ⊕ 1 ⊕ 0), (1 ⊕ 1 ⊕ 0) 1110000

Figure 4.10 A few sample Hamming code encodings.

ISTUDY
156 Proofs

Example 4.6: Some Hamming code decoding problems.


You receive the following (possibly corrupted) Hamming code codewords:
0000010 and 1000000 and 1011010 and 1110111.
Assuming that at most one error occurred in each transmission, what were the original messages?
Solution. We’ll consider these (possibly corrupted) codewords in order:
0000010. We’ve received message bits 0000 and parity bits 010. Everything in the received codeword is
consistent with the message being m = 0000, except for the second parity bit. So we infer that the second
parity bit was corrupted, the transmitted codeword was 0000000, and the message was 0000.
Could there have been a one-bit error in message bits instead? No: these parity bits are consistent only
with a message ⟨a, b, c, d⟩ with a ̸= b (because the first two received parity bits differ), and therefore
with d = 1 (because a ̸= b implies that a ⊕ b ⊕ d = 1 ⊕ d = ¬d, and the third parity bit a ⊕ b ⊕ d is 0).
But 10?1 and 01?1 are both at least two errors away from the received message 0000.
1000000. We’ve received message bits 1000 and parity bits 000. If the message bits were uncorrupted,
then the correct parity bits would have been 011. But then we would have to have suffered two transmis-
sion errors in the parity bits, and we’re assuming that at most one error occurred. Thus the error is in the
message bits; the original message is 0000, and the first bit of the message was corrupted.
1011010. The parity bits for the message 1011 are indeed 010, so 1011010 is itself a legal codeword for
the message 1011, and no errors occurred at all.
1110111. These received bits are consistent with the message 1111 with parity bits 111, where the fourth
bit of the message was flipped.

From this example, the basic approach to decoding the Hamming code should start to coalesce. Briefly, we
compute what the parity bits should have been, supposing that the received message bits (the first four bits
of the received codeword) are correct; comparing the computed parity bits to the received parity bits allows
us to deduce which, if any, of the transmitted bits were erroneous. (More on efficient decoding later.)
Why does this approach to decoding work? (And, relatedly, why were the parity bits of the Hamming
code chosen the way that they were?) There are two critical properties in the Hamming code’s parity bits.
First, every message bit appears in at least two parity bits. Thus any error in a received parity bit is
distinguishable from an error in a received message bit: an erroneous message bit will cause at least two
parity bits to look wrong; an erroneous parity bit will cause only that one parity bit to look wrong.
Second, no two message bits appear in precisely the same set of parity bits. Thus any error in a received
message bit has a different “signature” of wrong-looking parity bits: an error in bit a affects parity bits #2
and #3; b affects parity bits #1 and #3; c affects #1 and #2; and d affects all three parity bits. Because all
four of these signatures are different, we can distinguish which message bit was corrupted based on which
set of two or more parity bits look wrong.

Rate and minimum distance of the Hamming code


Let’s use the intuition that we’ve developed so far to establish the rate and minimum distance for the
Hamming code:

ISTUDY
4.2 Error-Correcting Codes 157

Lemma 4.14: Distance and rate of the Hamming code.


The Hamming code has rate 47 and minimum distance 3.

Proof. The rate is straightforward to compute: we have 4-bit messages and 7-bit codewords, so the rate
is 74 by definition.
There are several ways to convince yourself that the minimum distance is 3—perhaps the simplest way
(though certainly the most tedious) is to compute the Hamming distance between each pair of codewords
in Figure 4.9. (There are only 16 codewords, so we just have to check that all 16·152 = 120 pairs of distinct
codewords have Hamming distance at least three.) You’ll write a program to verify this claim in Exercise
4.24. But here’s a different argument.
Consider any two distinct messages m ∈ {0, 1} and m′ ∈ {0, 1} . We must establish that the codewords
4 4

c and c′ associated with m and m′ satisfy ∆(c, c′ ) ≥ 3. We’ll argue for this fact by looking at three separate
cases, depending on ∆(m, m′ ).

Case I: ∆(m, m′ ) ≥ 3. Then we’re done immediately: the message bits of c and c′ differ in at least three
positions (even without looking at the parity bits).
Case II: ∆(m, m′ ) = 2. Then at least one of the three parity bits contains exactly one of the two bit positions
where mi ̸= m′i . (This fact follows from the second crucial property above, that no two message bits
appear in precisely the same set of parity bits.) Therefore this parity bit differs in c and c′ . Thus there
are two message bits and at least one parity bit that differ, so ∆(c, c′ ) ≥ 3.
Case III: ∆(m, m′ ) = 1. Then at least two of the three parity bits contain the bit position where mi ̸= m′i .
(This fact follows from the first crucial property above, that every message bit appears in at least two
parity bits.) Thus there are at least two parity bits and one message bit that differ, and ∆(c, c′ ) ≥ 3.

Note that ∆(m, m′ ) must be 1, 2, or ≥ 3—it can’t be zero because m ̸= m′ —so, no matter what ∆(m, m′ )
is, we’ve established that ∆(c, c′ ) ≥ 3.
Because, for the codewords corresponding to messages 0000 and 1110, we have ∆(0000000, 1110000) =
3, the minimum distance is in fact exactly equal to three.
Lemma 4.14 says that the Hamming code encodes 4-bit messages with minimum distance 3 and rate 47 ;
thus we’ve proven Theorem 4.8. Let’s again reflect a little on the proof. Our proof of the minimum distance
in Lemma 4.14 was a proof by cases: we divided pairs of codewords into three different categories (differing
in 1, 2, or ≥ 3 bits), and then used three different arguments to show that the corresponding codewords
differed in ≥ 3 places. So we showed that the desired distance property was true in all three cases—and,
crucially, that one of the cases applies for every pair of codewords.

Problem-solving tip: If you discover that a proposition seems true “for different reasons” in different circumstances (and those
circumstances seem to cover all possible scenarios!), then a proof by cases may be a good strategy to employ.

Although we’re mostly omitting any discussion of the efficiency of encoding and decoding, it’s worth a
brief mention here. (The speed of these algorithms is a big deal for error-correcting codes used in practice.)
The algorithm for decoding under the Hamming code is suggested by Figure 4.11: we calculate what the
parity bits would have been if the received message bits were uncorrupted, and identify which received
parity bits don’t match those calculated parity bits. Figure 4.11 tells us what inference to draw from each
constellation of mismatched parity bits.

ISTUDY
158 Proofs

parity bit #1: b ⊕ c ⊕ d 3 7 3 3 7 7 3 7


parity bit #2: a ⊕ c ⊕ d 3 3 7 3 7 3 7 7
parity bit #3: a ⊕ b ⊕ d 3 3 3 7 3 7 7 7
location of error (if any) none! parity parity parity bit c bit b bit a bit d
bit #1 bit #2 bit #3

Figure 4.11 Decoding the Hamming code. We conclude that the stated error occurred if the received parity bits and
those calculated from the received message bits mismatch in the listed places.

Why does this decoding algorithm allow us to correct any single error? First, a low-level answer: the
Hamming code has a minimum distance of 3 = 2 · 1 + 1, so Lemma 4.6 tells us that we can correct up to
one error. So we know that a decoding scheme is possible. At a higher level, the reason that this decoding
procedure works properly is that there are eight possible “≤ 1 error” corruptions of a codeword x—namely
one 0-error string (x itself) and seven 1-error strings (one corresponding to an error in each of the seven
bit positions of x)—and there are also eight different subsets of the three parity bits that can be “wrong.”
The Hamming code works by carefully selecting the parity bits in a way that each of these eight bitstrings
corresponds to a different one of the eight parity-bit subsets. In Exercises 4.25–4.28, you’ll explore longer
versions of the Hamming code (with longer messages and more parity bits) with the same relationship.

Taking it further: As we’ve said, our attention here is mostly on the proofs and the proof techniques that we’ve used to establish
the claims in this section, rather than on error-correcting codes themselves. But see p. 161 for an introduction to Reed–Solomon
codes, the basis of the error-correcting codes used in CDs/DVDs (among other applications).

4.2.5 Upper Bounds on Rates


In Sections 4.2.3 and 4.2.4, we’ve constructed two different codes, both for 4-bit messages with minimum
4
distance 3: the repetition code (rate 12 ) and the Hamming code (rate 47 ). Because the message lengths
and minimum distances match, and because higher rates are better, the Hamming code is better. Here
we’ll consider whether we can improve the rate further, while still encoding 4-bit messages with minimum
distance 3. (In other words, can we make the codewords shorter than 7 bits?) The answer turns out to be
“no”—and we’ll prove that it’s impossible.

“Balls” around codewords


We’ll start by thinking about “balls” around codewords in a general code. (The ball of radius r around
x ∈ {0, 1} is the set {x′ : ∆(x, x′ ) ≤ r}—that is, the set of all points that are within Hamming distance r
n

of x.) Here’s a first observation:

n
Lemma 4.15: The size of a ball of radius 1 in {0, 1} .

Let x ∈ {0, 1} , and define X = x′ ∈ {0, 1} : ∆(x, x′ ) ≤ 1 . Then |X| = n + 1.
n n

Proof. The bitstring x itself is an element of X, as are all bitstrings x′ that differ from x in exactly one
position. There are n such strings x′ : one that is x with the first bit flipped, one that is x with the second bit
flipped; . . .; and one that is x with the nth bit flipped. Thus there are 1 + n total bitstrings in X.

Here’s a second useful fact about these balls: in a code C, the balls around codewords (of radius related
to the minimum distance of C) cannot overlap.

ISTUDY
4.2 Error-Correcting Codes 159

≥ 2t + 1

t t
x y

(a) If the minimum distance is 2t + 1, the “balls” of (b) How many unit circles (circles with radius 1) can be
radius t around each codeword are disjoint. packed into a 6-by-6 square?

Figure 4.12 Circle packing.

Lemma 4.16: Balls around codewords are disjoint.


n
Let C ⊆ {0, 1} be a code with minimum distance 2t + 1. For distinct codewords x, y ∈ C, the sets
 ′ 
x ∈ {0, 1} : ∆(x, x′ ) ≤ t and y′ ∈ {0, 1} : ∆(y, y′ ) ≤ t are disjoint.
n n


Proof. Suppose not: in other words, suppose that the set X = x′ ∈ {0, 1} : ∆(x, x′ ) ≤ t and the set
n
 ′ n ′
Y = y ∈ {0, 1} : ∆(y, y ) ≤ t are not disjoint. We will derive a contradiction from this assumption—
that is, a statement that can’t possibly be true—so we’ll have proven that X ∩ Y ̸= ∅ ⇒ False. But this
implication allows us to conclude that X ∩ Y = ∅, because ¬p ⇒ False ≡ p. That is, we’re using a proof
by contradiction.
To start again from the beginning: suppose that X and Y are not disjoint. That is, suppose that there is
n
some bitstring z ∈ {0, 1} such that z ∈ X and z ∈ Y. In other words, by definition of X and Y, there is a
n
bitstring z ∈ {0, 1} such that ∆(x, z) ≤ t and ∆(y, z) ≤ t. But if ∆(x, z) ≤ t and ∆(y, z) ≤ t, then, by the
triangle inequality, we know

∆(x, y) ≤ ∆(x, z) + ∆(z, y) ≤ t + t = 2t.

Therefore ∆(x, y) ≤ 2t—but then we have two distinct codewords x, y ∈ C with ∆(x, y) ≤ 2t. This
condition contradicts the assumption that the minimum distance of C is 2t + 1. (See Figure 4.12a.)

We could have used Lemma 4.16 to establish the error-correction part of Theorem 4.6—a bitstring
corrupted by ≤ t errors from a codeword c is closer to c than to any other codeword—but here we’ll use it,
plus Lemma 4.15, to establish a upper bound on the rate of codes. But, first, let’s pause to look at a similar
argument in a different (but presumably more familiar) domain: normal Euclidean geometry.

Problem-solving tip: When you’re facing a problem in a less familiar domain, try to find an analogous problem in a different,
more familiar setting to help gain intuition.

In a circle-packing problem, we are given an enclosing shape, and we’re asked to place (“pack”) as many
nonoverlapping unit circles (of radius 1) into that shape as possible. (Sphere packing—what grocers have to
do with oranges—is the three-dimensional analogue.) How many unit circles can we fit into a 6-by-6 square,
for example? (See Figure 4.12b.) Here’s an argument that it’s at most 11: a unit circle has area π · 12 = π,
and the 6-by-6 square has area 36; thus we certainly can’t fit more than 36 π ≈ 11.459 nonoverlapping
circles into the square. There isn’t room for 12. (In fact, we can’t even fit 10, because the circles won’t
nestle together without wasting space “in between.” Thus, in this case we’d say that the area-based bound
is loose.)

ISTUDY
160 Proofs

Using packing arguments to derive bounds on error-correcting codes


Now, let’s return to error-correcting codes, and use the circle-packing intuition (and the last two lemmas)
n
to prove a bound on the number of n-bit codewords that can “fit” into {0, 1} with minimum distance 3:

Lemma 4.17: The “sphere-packing bound”: distance-3 version.


n 2n
Let C ⊆ {0, 1} be a code with minimum distance three. Then |C| ≤ n+1 .


Proof. For each x ∈ C, let Sx = x′ ∈ {0, 1} : ∆(x′ , x) ≤ 1 be the ball of radius 1 around x.
n
n
Lemma 4.15 says that |Sx | = n + 1 for each x. Further, Lemma 4.16 says that every element of {0, 1} is
in at most one Sx because the balls are disjoint. Therefore,
 X X
| x′ ∈ {0, 1} : x′ is in one of the Sx balls | =
n
|Sx | = (n + 1) = |C| · (n + 1).
x∈C x∈C

Also observe that every element of any Sx is an n-bit string. There are only 2n different n-bit strings, so
therefore

| x′ ∈ {0, 1} : x′ is in one of the Sx balls | ≤ 2n .
n

Putting together these two facts, we see that |C|·(n+1) ≤ 2n . Solving for |C| yields the desired relationship:
2n
|C| ≤ n+1 .

Corollary 4.18. Any code with messages of length 4 and minimum distance 3 has codewords of length
at least 7. (Thus the Hamming code has the best possible rate among all such codes.)
n
Proof. By Lemma 4.17, we know that |C| ≤ n+1 2
. With 4-bit messages we have |C| = 16, so we know
2n
that 16 ≤ n+1 , or, equivalently, that 2 ≥ 16 · (n + 1). And 27 = 16 · (7 + 1), while for any n < 7 this
n

inequality does not hold.


Corollary 4.18 implies Theorem 4.9, so we’ve now proven the three claims that we set out to establish.
Before we close, though, we’ll mention a few extensions. Lemma 4.16 was general, for any code with
an odd minimum distance. But Lemma 4.15 was specifically about codes with minimum distance 3. To
generalize the latter lemma, we’d need techniques from counting (see Chapter 9, specifically Section 9.4).
Another interesting question: when is the bound from Lemma 4.17 exactly achievable? If we have k-bit
2n
messages, n-bit codewords, and minimum distance 3, then Lemma 4.17 says that 2k ≤ n+1 , or, taking logs,
that k ≤ n − log2 (n + 1). Because k has to be an integer, this bound is exactly achievable only when n + 1
29
is an exact power of two. (For example, if n = 9, this bound requires us to have 2k ≤ 10 = 512
10 = 51.2. In
other words, we need k ≤ log2 51.2 ≈ 5.678. But, because k ∈ Z, in fact we need k ≤ 5. That means that
this bound is not exactly achievable for n = 9.) However, it’s possible to give a version of the Hamming
code for n = 15 and k = 7 with minimum distance 3, as you’ll show in Exercise 4.26. (In fact, there’s a
version of the Hamming code for any n = 2ℓ − 1; see Exercise 4.28.)

ISTUDY
4.2 Error-Correcting Codes 161

Computer Science Connections

Reed–Solomon Codes
The error-correcting codes that are used in CDs, DVDs,
and QR codes (see Figure 4.13) are a bit more com-
plicated than repetition or Hamming codes, but they
perform better. These codes are called Reed–Solomon
codes—named after Irving Reed (1923–2012) and
Gustave Solomon (1930–1996), who invented them in
1960—and they’re based on polynomials and modular
arithmetic. We’ll leave out a lot of the details, but here Figure 4.13 A QR (“quick response”) code, which is still
is a brief sketch of how they work. readable even if some portions are misread or damaged.
The first difference from the codes that we’ve seen (Either version of this QR code can be correctly decoded.)
so far: we’re going to go beyond bits, to a larger “alpha-
k
bet” of characters in our messages and codewords. Instead of encoding messages from {0, 1} , we’re going to encode
k
messages from {0, 1, . . . , q} , for some integer q.
Here’s the core idea of the encoding. Given a mes-
sage m = ⟨m1 , m2 , . . . , mk ⟩, we will define a poly- message = ⟨1, 3, 2⟩
nomial pm (x) with the coefficients of the polynomial The coefficients of the polynomial are the
corresponding to the characters of the message: components of the message.


k pm (x) = 1 · x + 3 · x2 + 2 · x3
pm (x) = mi · xi .
i=1 The ith component of the codeword is pm (i).
For example,
Now, to encode the message m, we evaluate the polyno- pm (3) = 1 · 3 + 3 · 32 + 2 · 33 = 84.
mial for several values of x: specifically, the codeword codeword = ⟨6, 30, 84, 180, 330, 546⟩
corresponding to m will be ⟨pm (1), pm (2), . . . , pm (n)⟩.
See Figure 4.14 for an example. Figure 4.14 Encoding the message ⟨1, 3, 2⟩ using n = 6.
Suppose that we use a k-character message and an
n-character output. (We’re going to have to pick a value of n—bigger for more error correction, smaller for more
efficiency.) It’s easy enough to compute that the rate is nk . But what about the minimum distance of the code? Consider
two distinct messages m and m′ . Note that pm and pm′ are two different polynomials of degree at most k, which
means that f(x) = pm (x) − pm′ (x) is a polynomial of degree at most k, too. Therefore f has at most k roots, by
Lemma 2.58. And {x : f(x) = 0} = {x : pm (x) = pm′ (x)}, which means that there are at most k values of x for
which pm (x) = pm′ (x). We encoded m and m′ by evaluating pm and pm′ on n different inputs, so there are at least
n − k inputs on which these two polynomials disagree. Thus the minimum distance is at least n − k. For example, if
we pick n = 2k, then we achieve rate 12 and minimum distance k.
How might we decode Reed–Solomon codes? Effi-
cient decoding algorithms rely on some results from
Suppose that k = 2 (in other words, you’re looking
linear algebra, but the basic idea is to find the degree- for a message of length two, and the polynomial is
k polynomial that goes through as many of the given a quadratic). You receive the codeword
40 ⟨2, 6, 12, 13, 30, 42⟩.
points as possible. See Figure 4.15.
We’ve left out a few key details of Reed–Solomon Plot the points of the codeword. All but
codes. One is that we computed the rate misleadingly: one is consistent with pm (x) = x + x2 ,
we only counted the number of slots, rather than the 20 so you decode the codeword
as the message ⟨1, 1⟩.
“size” of those slots. (Figure 4.14 shows that the num-
bers can get pretty big!) In real Reed–Solomon codes,
every value is stored modulo a prime. See p. 357 for
how (and why) this fix works. There’s also an addi- 1 2 3 4 5 6
tional clever trick used in the physical layout of the
encoded information on a CD, DVD, or QR code: the Figure 4.15 Decoding a (corrupted) codeword.
bits for a particular codeword are spread out over the
physical encoding, so that a single physical scratch or tear doesn’t cause errors all to occur in the same codeword.

ISTUDY
162 Proofs

EXERCISES
4.1 (programming required) The algorithm for testing whether a given credit-card number is valid is shown in Figure 4.16a, and an
example is shown in Figure 4.16b. (Try executing cc-check from Figure 4.16a on a few credit-card numbers, to make sure that
you’ve understood the algorithm correctly.) This code can detect any one substitution error, because

0 1 2 3 4 5 6 7 8 9
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
0 2 4 6 8 10 → 1 + 0 = 1 12 → 1 + 2 = 3 14 → 1 + 4 = 5 16 → 1 + 6 = 7 18 → 1 + 8 = 9

are all distinct (so changing any single digit, including in odd-indexed digits, changes the overall value of sum). Implement cc-check
in a programming language of your choice. Extend your implementation so that, if it’s given any 16-digit credit/debit-card number
with a single digit replaced by a "?", it computes and outputs the correct missing digit.
4.2 Suppose that we modified cc-check so that, instead of adding the ones digit and (if it exists) the tens digit of di to sum in Line 7 of
the algorithm, we instead simply added di . (That is, replace Line 7 by sum := sum + di .) Does this modified code still allow us to
detect any single substitution error?
4.3 Consider modifying cc-check so that, rather than doubling odd-indexed digits in Line 4 of the algorithm, we tripled the odd-indexed
digits. (That is, replace Line 4 by di := 3 · ni .) Does this modified code still allow us to detect any single substitution error?
4.4 Similarly, imagine modifying cc-check to quintuple instead of double (that is, we replace Line 4 by di := 5 · ni ). Does this modified
code still allow us to detect any single substitution error?
4.5 There are simpler schemes that can detect a single substitution error than the one in cc-check: for example, we could simply ensure
that the sum of all the digits themselves (undoubled) is divisible by 10. (Just skip the doubling step.) The credit-card encoding
system includes the more complicated doubling step to help it detect a different type of error, called a transposition error, where
two adjacent digits are recorded in reverse order. (If two digits are swapped, then the “wrong” digit is multiplied by two, and so
this kind of error might be detectable.) Does cc-check detect every possible transposition error?

A metric space consists of a set X and a function d : X × X → R≥0 , called a distance function, where d obeys the following three
properties:
• reflexivity: for any x and y in X, we have d(x, x) = 0, and d(x, y) =
̸ 0 whenever x ̸= y.
• symmetry: for any x, y ∈ X, we have d(x, y) = d(y, x).
• triangle inequality: for any x, y, z ∈ X, we have d(x, y) ≤ d(x, z) + d(z, y).
When it satisfies all three conditions, we call the function d a metric. The next few exercises consider several different distance
functions d : {0, 1}n × {0, 1}n → Z≥0 . For each, you’re asked to prove whether the given function d is a metric. (In other words,
you should either prove that d satisfies reflexivity, symmetry, and the triangle inequality; or prove that d fails to satisfy one or more
of these properties.)

4.6 In this section, we’ve measured the distance between bitstrings using the Hamming distance, which counts the number of positions
in which x and y differ. Prove that the Hamming distance function ∆ : {0, 1}n × {0, 1}n → Z≥0 is a metric. (Hint: think about
one bit at a time.)

cc-check(n):
Input: a 16-digit credit-card number n ∈ {0, 1, . . . , 9}16
1 sum := 0 original credit-card number 4471 8329 · · ·
2 for i = 1, 2, . . . , 16:
3 if i is odd then 4 4 7 1 8 3 2 9 ···
4 di := 2 · ni
5 else 8 4 14 1 16 3 4 9 ···
6 di := ni
7 Increase sum by the ones’ and tens’ digits of di . 8 + 4 + 1+ 4 + 1 + 1+ 6 + 3 + 4 + 9 ···
di
8 // In other words, sum := sum + (di mod 10) + ⌊ 10 ⌋.
9 return True if sum mod 10 = 0, and False otherwise. odd-indexed digits doubled, and digits summed

(a) The algorithm itself. (b) An example of the calculation.

Figure 4.16 An algorithm for testing the validity of credit-card numbers.

ISTUDY
Exercises 163

4.7 For x, y ∈ {0, 1}n , define d(x, y) as the smallest i ∈ {0, 1, . . . , n} such that xi+1,...,n = yi+1,...,n . For example,
d(01000, 10101) = 5 and d(01000, 10100) = 3 and d(01000, 10000) = 2 and d(11010, 01010) = 1. (This function measures
how far into x and y we must go before the remaining parts match; we could also define d(x, y) as the largest i ∈ {0, 1, . . . , n}
such that xi ̸= yi , where we treat x0 ̸= y0 .) Is d a metric?

4.8 For x, y ∈ {0, 1}n , define d(x, y) as the length of the longest consecutive run of differing bits in corresponding positions of
x and y—that is, d(x, y) = max {j − i : for all k = i, i + 1, . . . , j we have xk ̸= yk } . For example, d(01000, 10101) = 3 and
d(00100, 01010) = 3 and d(01000, 10000) = 2 and d(11010, 01000) = 1. Is d a metric?

4.9 For x, y ∈ {0, 1}n , define d(x, y) as the difference in the number of ones that appears in the two bitstrings—that is,

d(x, y) = |{i : xi = 1}| − |{i : yi = 1}| .

(The vertical bars here are a little confusing: the bars around |{i : xi = 1}| and |{i : yi = 1}| denote set cardinality, while the outer
vertical bars denote absolute value.) For example, d(01000, 10101) = |1 − 3| = 2 and d(01000, 10100) = |1 − 2| = 1 and
d(01000, 10000) = |1 − 1| = 0 and d(11010, 01010) = |2 − 2| = 0. Is d a metric?

4.10 The distance version of the Sørensen index (a.k.a. the Dice coefficient) defines the distance based on the fraction of ones in x or y
that are in the same positions. (The Sørensen/Dice measure is named after independent work by two ecologists from the 1940s, the
Danish botanist Thorvald Sørensen (1902–1973) and the American mammalogist Lee Raymond Dice (1887–1977).) Specifically,
P
2 xi · yi
d(x, y) = 1 − P i .
i xi + yi

For example, d(01000, 10101) = 1− 12+·03 = 1− 04 = 1 and d(00100, 01110) = 1− 12+·13 = 1− 42 = 1/2 and d(01000, 11000) =
1 − 12+·12 = 1 − 23 = 1/3 and d(11010, 01010) = 1 − 32+·22 = 1 − 25 = 3/5. Is d a metric?

4.11 For x, y ∈ {0, 1}n , define d(x, y) as the difference in the numbers that are represented by the two strings in binary. Writing this
function formally is probably less helpful (particularly because the higher powers of 2 have lower indices), but here it is: d(x, y) =
Pn n− i −
Pn n−i . For example, d(01000, 10101) = |8 − 21| = 13 and d(01000, 10100) = |8 − 20| = 12 and
i=1 xi · 2 i=1 yi 2
d(01000, 10000) = |8 − 16| = 8 and d(11010, 01010) = |26 − 10| = 16. Is d a metric?

4.12 Show that we can’t improve on the parameters in Theorem 4.6: for any integer t ≥ 0, prove that a code with minimum distance
2t + 1 cannot correct t + 1 or detect 2t + 1 errors.
4.13 Theorem 4.6 describes the error-detecting and error-correcting properties for a code whose minimum distance is any odd integer.
This exercise asks you to give the analogous analysis for a code whose minimum distance is any even integer. Let t ≥ 1 be any
integer, and let C be a code with minimum distance 2t. How many errors can C detect and correct? Prove your answers.

Let c ∈ {0, 1}n be a codeword. Until now, we’ve mostly talked about substitution errors, in which a single bit of c is flipped from
0 to 1, or from 1 to 0. The next few exercises explore two other types of errors.
An erasure error occurs when a bit of c isn’t successfully transmitted, but the recipient is informed that the transmission of the
corresponding bit wasn’t successful. We can view an erasure error as replacing a bit ci from c with a “?” (as in Exercise 4.1,
for credit-card numbers). Thus, unlike a substitution error, the recipient knows which bit was erased. (So a codeword 1100110
might become 1?0011? after two erasure errors.) When codeword c ∈ {0, 1}n is sent, the receiver gets a corrupted codeword
c′ ∈ {0, 1, ?}n and where all unerased bits were transmitted correctly (that is, if c′i ∈ {0, 1}, then c′i = ci ).
A deletion error is like a “silent erasure” error: a bit fails to be transmitted, but there’s no indication to the recipient as to
where the deletion occurred. (So a codeword 1100110 might become 10011 after two deletion errors.)
4.14 Let C be a code that can detect t substitution errors. Prove that C can correct t erasure errors.
4.15 Let C be a code that can correct t deletion errors. Prove that C can correct t erasure errors.
4.16 Give an example of a code that can correct one erasure error, but can’t correct one deletion error.

4.17 Consider the “code” where all n-bit strings are codewords. (That is, C = {0, 1}n .) What is the rate and minimum distance, and
how many errors can it detect or correct?
4.18 Define the trivial code as C = {0n , 1n }. What is the rate and minimum distance, and how many errors can it detect or correct?
4.19 Consider the parity-check code, defined as follows: the codewords are all n-bit strings with an even number of bits set to 1. What
is the rate and minimum distance, and how many errors can it detect or correct?

ISTUDY
164 Proofs

4.20 Let’s extend the idea of the parity-check code as an add-on to any existing code with odd minimum distance. Let C ⊆ {0, 1}n be a
code with minimum distance 2t + 1, for some integer t ≥ 0. Consider a new code C ′ , in which we augment every codeword of C by
adding a parity bit, which is zero if the number of ones in the original codeword is even and one if the number is odd, as follows:
n Xn o
C ′ = ⟨x1 , x2 , . . . , xn , ( xi ) mod 2⟩ : x ∈ C .
i=1

Then the minimum distance of C ′ is 2t+2. (Hint: consider two distinct codewords x, y ∈ C. You have to argue that the corresponding
codewords x′ , y′ ∈ C have Hamming distance 2t + 2 or more. Use two different cases, depending on the value of ∆(x, y).)

4.21 Show that we can correctly decode the Repetitionℓ code as follows: given a bitstring c′ , for each bit position i, we take the majority
vote of the ℓ blocks’ ith bit in c′ , breaking ties arbitrarily. (In other words, prove that this algorithm actually gives the codeword
that’s closest to c′ .)

4.22 In some error-correcting codes, for certain errors, we may be able to correct more errors than Theorem 4.6 suggests: that is, the
minimum distance is 2t+1, but we can correct certain sequences of > t errors. We’ve already seen that we can’t successfully correct
every such sequence of errors, but we may be able to successfully handle some sequences of errors using the standard algorithm
for error correction (returning the closest codeword). The Repetition3 code with 4-bit messages is only guaranteed to correct 1
error. What’s the largest number of errors that can possibly be corrected successfully by this code? Explain your answer.
4.23 In contrast to the Repetition3 code (as in Exercise 4.22), under the Hamming code we never correct more than 1 error successfully.
Prove why not.

4.24 (programming required) Write a program, in a programming language of your choice, to verify that any two codewords in the
Hamming code differ in at least three bit positions.

Let’s find the “next” Hamming code, with 7-bit messages and 11-bit codewords and a minimum distance of 3. We’ll use the same
style of codeword as in Definition 4.13: the first 7 bits of the codeword will simply be the message, and the next 4 bits will be parity
bits (each for some subset of the message bits).
4.25 To achieve minimum distance 3, it will suffice to have parity bits with the following properties:
(a) each bit of the original message appears in at least two parity bits; and
(b) no two bits of the original message appear in exactly the same set of parity bits.
Prove that these conditions are sufficient. That is, prove that any set of parity bits that satisfy conditions (a) and (b) ensure that the
resulting code has minimum distance 3.
4.26 Define 4 parity bits for 11-bit messages that satisfy conditions (a) and (b) from Exercise 4.25.
4.27 Define 5 parity bits for 26-bit messages that satisfy conditions (a) and (b) from Exercise 4.25.
4.28 Let ℓ ∈ Z>0 , and let n = 2ℓ − 1. Prove that a code with n-bit codewords, minimum distance 3, and messages of length n − ℓ is
achievable. (Hint: look at all ℓ-bit bitstrings; use the bits to identify which message bits are part of which parity bits.)

4.29 You have come into possession of 8 bottles of “poison,” except, you’ve learned, 7 are fake poison and only 1 is really poisonous.
Your master plan to take over the world requires you to identify the poison by tomorrow. Luckily, as an evil genius, you have a small
collection of very expensive rats, which you can use for testing. You can give samples from bottles to multiple rats simultaneously
(a rat can receive a mixture of samples from more than one bottle), and then wait for a day to see which ones die. Obviously you
can identify the real poison with 8 rats (one bottle each), or even with 7 (one bottle each, one unused bottle; if all rats survive then
the leftover bottle is the poison). But how many rats do you need to identify the poison? (Make the number as small as possible.)

Taking it further: The idea in Exercise 4.29 is very similar to the idea of “pooled testing,” which held major promise in
the early stages of the COVID-19 pandemic as a way to mitigate the paucity of available tests for the disease. Usually, in
pooled testing, the idea is a two-stage approach: simultaneously test the samples from k individuals for the disease, and then,
if (and only if) the pool test is positive, complete k individual tests afterwards to determine which individual(s) was/were
infected. (Exercise 4.29 takes a slightly different, but related, approach; it asks you to determine how to perform more
pooled tests simultaneously, to avoid having to do individual tests.) One key difference: this exercise tells you to assume
that there is exactly one positive sample; in the COVID-testing scenario, the number of positive samples is unknown.
Another key difference: real-life tests are not 100% accurate, and accounting for false positives or false negatives is critical.
The appropriate tradeoffs in the size of the test pool depend on the accuracy of the test and the prevalence of the disease.
For an example analysis of the pooled-testing approach for COVID-19, see [28].

ISTUDY
Exercises 165

1 S := ∅
23
2 for x ∈ {0, 1} (in numerical order): // x = 00000000000000000000000, 00000000000000000000001, . . . , 11111111111111111111111
3 if ∆(x, y) ≥ 7 for all y ∈ S then
4 add x to S
5 return S

Figure 4.17 The “greedy algorithm” for generating the Golay code.

Let c ∈ {0, 1}23 . A handy fact (which you’ll show in Exercise 9.135, after we’ve developed the necessary tools for counting to
figure out this quantity): the number of 23-bit strings c′ with ∆(c, c′ ) ≤ 3 is exactly 2048 = 211 = 223−12 . This fact means
that (according to a generalization of Lemma 4.17) it might be possible to achieve a code with 12-bit messages; 23-bit codewords;
and minimum distance 7. In fact, these parameters are achievable—and a code that achieves these parameters is surprisingly
simple to construct. The Golay code is an error-correcting code that can be constructed by the so-called “greedy” algorithm in
Figure 4.17. (The all-zero vector will be added to S in the first iteration of the while loop; a hundred and twenty-seven iterations
later, 00000000000000001111111 will be the second element added to S, and so forth.)
(The Golay code is named after Marcel Golay (1902–1989), a Swiss researcher who discovered them in 1949, just before
Hamming discovered what would later be called the Hamming code. Golay also made major contributions in chemistry and optics.
A slight variant of the Golay code was used by NASA around 1980 to communicate with the Voyager spacecraft as they traveled to
Saturn and Jupiter.)

4.30 (programming required) Write a program, in a language of your choice (but see the warning below), that implements the algorithm
in Figure 4.17, and outputs the list of the 212 = 4096 different 23-bit codewords of the Golay code in a file, one per line.
Implementation hint: suppose you represent the set S as an array, appending each element that passes the test in Line 3 to the end
of the array. When you add a bitstring x to S, the very next thing you do is to consider adding x + 1 to S. Implementing Line 3 by
starting at the x-end of the array will make your code much faster than if you start at the 00000000000000000000000-end of the
array. Think about why!
Implementation warning: this algorithm is not very efficient! We’re doing 223 iterations, each of which might involve checking the
Hamming distance of as many as 212 pairs of strings. On an aging laptop, my Python solution took about 10 minutes to complete;
if you ignore the implementation hint from the previous paragraph, it took 80 minutes. (I also implemented a solution in C; it took
about 10 seconds following the hint, and 100 seconds not following the hint.)

4.31 You and six other friends are imprisoned by an evil genius, in a room filled with eight bubbling bottles marked as “poison.” (Though,
really, seven of them look perfectly safe to you.) The evil genius, though, admires skill with bitstrings and computation, and offers
you all a deal. You and your friends will each have a red or blue hat placed on your heads randomly. (Each hat has a 50% chance
of being red and 50% chance of being blue, independent of all other hats’ colors.) Each person can each see all hats except their
own. After a brief moment to look at each others’ hats, all of you must simultaneously say one of three things: red, blue, or pass.
The evil genius will release all of you from your imprisonment if:
• everyone who says red or blue correctly identifies their hat color; and
• at least one person says a color (that is, not everybody says pass).
You may collaborate on a strategy before the hats are placed on your heads, but once the hat is in place, no communication is allowed.
An example strategy: all 7 of you pick a random color and say it. (You succeed with probability (1/2)7 = 1/128 ≈ 0.0078.)
Another example: you number yourselves 1, 2, . . . , 7, and person #7 picks a random color and says it; everyone else passes. (You
succeed with probability 1/2.)
Can you succeed with probability better than 1/2? If so, how?

4.32 In Section 4.2.5, we proved an upper bound for the rate of a code with a particular minimum distance, based on the volume
of “spheres” around each codeword. There are other bounds that we can prove, with different justifications. Here you’ll prove
what’s called the Singleton bound. (Confusingly, the Singleton bound is named after Richard Singleton (1928–2007), an American
computer scientist; it has nothing to do with singleton sets [sets containing only one element].)
Suppose that we have a code C ⊆ {0, 1}n with |C| = 2k and minimum distance d. Prove that k ≤ n − d + 1.
(Hint: what happens if we delete the first d − 1 bits from each codeword?)

ISTUDY
166 Proofs

4.3 Proofs and Proof Techniques


The proof of the pudding is in the eating.
Miguel de Cervantes (1547–1616)
Don Quixote (1605)

In Section 4.2, we saw a number of claims about error-correcting codes—and, more importantly, proofs
that those claims were true. These proofs used several different styles of argument: proofs that involved
straight-ahead reasoning by starting from the relevant definitions; proofs that used “case-based” reasoning;
and proofs “by contradiction” that argued that x must be true because something impossible would happen
if x were false. Indeed, whenever you face a claim that you need to prove, a variety of different strategies
(including these strategies from Section 4.2) are possible approaches for you to employ. This section is
devoted to outlining these and some other common proof strategies. We’ll first catalogue these techniques
in Section 4.3.1, and then, in Section 4.3.2, we’ll reflect briefly on the strategies and how to choose among
them—and also reflect on the writing part of writing proofs.

What is a proof?
This chapter is devoted to techniques for proving claims—but before we explore proof techniques, let’s
spend a few words discussing what a proof actually is:

Definition 4.19: Proof.


A proof of a proposition is a convincing argument that the proposition is true.

Definition 4.19 says that a proof is a “convincing argument,” but it doesn’t say to whom the argument
should be convincing. The answer is: to your reader. This definition may be frustrating, but the point is that
a proof is a piece of writing, and—just like with fiction or a persuasive essay—you must write for your
audience.

Taking it further: Different audiences will have very different expectations for what counts as “convincing.” A formal logician
might not find an argument convincing unless she saw every last step, no matter how allegedly obvious or apparently trivial.
An instructor of early-to-mid-level computer science class might be convinced by a proof written in paragraph form that omits
some comparatively simple steps, like those that invoke the commutativity of addition, for example. A professional CS researcher
reading a publication in conference proceedings would expect “elementary” calculus to be omitted. Some of the debates over what
counts as convincing to an audience—in other words, what counts as a “proof”—were surprisingly controversial, particularly as
computer scientists began to consider claims that had previously been the exclusive province of mathematicians. See p. 181 for
a discussion of the Four-Color Theorem, which triggered many of these discussions in earnest.

To give an example of writing for different audiences, we’ll give several proofs of the same result. Here’s
a claim regarding divisibility and factorials. (Recall that n!, pronounced “n factorial,” is the product of the
positive integers up to n; that is, n! = n · (n − 1) · (n − 2) · · · 1.)

Let n be a positive integer and let k be any integer satisfying 2 ≤ k ≤ n.


Then n! + 1 is not evenly divisible by k. (†)

Before reading further, spend a minute trying to convince yourself why (†) is true. We’ll prove (†) three
times, using three different levels of detail:

ISTUDY
4.3 Proofs and Proof Techniques 167

Example 4.7: Factorials: Proof I.


Qn
Proofh(heavy idetail). By the definition ofhQ i we
factorial, have that n! = i=1 i, which can be rewritten as
Qk−1  Qn  k−1 Q n 
n! = i=1 i · k · i=k+1 i . Let m = i=1 i · i=k+1 i . Thus we have that n! = k · m and m ∈ Z,
because the product of any finite set of integers is also an integer.
Observe that n! + 1 = mk + 1. We claim that there is no integer ℓ such that kℓ = n! + 1. First, there is
no ℓ ≤ m such that kℓ = n! + 1, because kℓ ≤ km = n! < n! + 1. Second, there is no ℓ ≥ m + 1 such
that kℓ = n! + 1, because k ≥ 2 implies that kℓ ≥ k(m + 1) = n! + k > n! + 1. Because there is no such
integer ℓ ≤ m and no such integer ℓ > m, the claim follows.

Example 4.8: Factorials: Proof II.

Proof (medium detail). Define m = n!k , so that n! = mk and n! + 1 = mk + 1. Because k is an integer


between 2 and n, the definition of factorial implies that m is an integer. But because k ≥ 2, we know
mk < mk + 1 < (m + 1)k. Thus mk + 1 is not evenly divisible by k, because this quantity is strictly
between two consecutive integral multiples of k, namely m · k and (m + 1) · k.

Example 4.9: Factorials: Proof III.

Proof (light detail). Note that k evenly divides n!. The next integer evenly divisible by k is n! + k. But
k ≥ 2, so n! < n! + 1 < n! + k. The claim follows immediately.

Writing tip: As you study the material in this book, you will frequently be given a claim and asked to prove it. To complete
this task well, you must think about the question of for whom you are writing your proof. A reasonable guideline is that your
audience for your proofs is a classmate or a fellow reader of this book who has read and understood everything up to the point
of the claim that you’re proving, but hasn’t thought about this particular claim at all.

Which of the three proofs from Examples 4.7, 4.8, and 4.9 is best? It depends! The right level of detail
depends on your intended reader. A typical reader of this book would probably be happiest with the
medium-detail proof from Example 4.8, but it is up to you to tailor your proof to your desired reader.

Taking it further: It turns out that one can encode literally all of mathematics using a handful of set-theoretic axioms, and a lot
of patience. It’s possible to write down everything in this book in ultraformal set-theoretic notation, which serves the purpose
of making arguments 100% airtight. But the high-level computer science content can be hard to see in that style of proof. If
you’ve ever programmed in assembly language before, there’s a close analogy: you can express every program that you’ve ever
written in extremely low-level machine code, or you can write it in a high-level language like C or Java or Python or Scheme
(and, one hopes, make the algorithm much more understandable for the reader). We’ll prove a lot of facts in this book, but at the
Python-like level of proof. Someone could “compile” our proofs down into the low-level set-theoretic language—but we won’t
bother. (Lest you underestimate the difficulty of this task: a proof that 2 + 2 = 4 would require hundreds of steps in this low-level
proof!) There are subfields of computer science (“formal methods” or “formal verification,” or “automated theorem proving”)
that take this ultrarigorous approach: start from a list of axioms, and a list of inference rules, and a desired theorem, and derive
the theorem by applying the inference rules. When it is absolutely life-or-death critical that the proof be 100% verified, then
these approaches tend to be used: in verifying protocols in distributed computing, or in verifying certain crucial components of
a processor, or the software components of an airplane, for example.

4.3.1 Proof Techniques


We will describe three general strategies for proofs:

ISTUDY
168 Proofs

direct proof: we prove a statement φ by repeatedly inferring new facts from known facts to eventually
conclude φ. (Sometimes we’ll divide our work into separate cases and give different proofs in each case.
And if φ is of the form p ⇒ q, we’ll generally assume p and then try to infer q under that assumption.)
proof by contrapositive: when the statement that we’re trying to prove is an implication p ⇒ q, we can
instead prove ¬q ⇒ ¬p—the contrapositive of the original claim. The contrapositive is logically
equivalent to the original implication, so once we’ve proven ¬q ⇒ ¬p, we can also conclude p ⇒ q.
proof by contradiction: we prove a statement φ by repeatedly assuming ¬φ, and proving something
impossible—that is, proving ¬φ ⇒ False. Because ¬φ therefore cannot be true, we can conclude
that φ must be true.
“When you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
— Sir Arthur Conan Doyle (1859–1930), The Sign of the Four (1890)

We’ll give some additional examples of each proof technique as we go, proving some purely arithmetic
claims to illustrate the strategy.
Almost every claim that we’ll prove here—or that you’ll ever need to prove—will be a universally
quantified statement, of the form ∀x ∈ S : P(x). (Often the quantification will not be explicit: we view
any unquantified variable in a statement as being implicitly universally quantified.) To prove a claim of the
form ∀x ∈ S : P(x), we usually proceed by considering a generic element x ∈ S, and then proving that
P(x) holds. (Considering a “generic” element means that we make no further assumptions about x, other
than assuming that x ∈ S.) Because this proof establishes that an arbitrary x ∈ S makes P(x) true, we can
conclude that ∀x ∈ S : P(x).

Direct proofs
The simplest type of proof for a statement φ is a derivation of φ from known facts. This type of argument
is called a direct proof :

Definition 4.20: Direct proof.


A direct proof of a proposition φ starts from known facts and implications, and repeatedly applies logical
deduction to derive new facts, eventually leading to the conclusion φ.

Most of the proofs in Section 4.2 were direct proofs. Here’s another, less complicated example:

Example 4.10: Divisibility by 4.


Let’s prove the correctness of a simple test of whether a given integer is divisible by 4:

Claim: Any positive integer n is divisible by 4 if and only if its last two digits are themselves divisible
by 4. (That is, n is divisible by 4 if and only if n’s last two digits are in {00, 04, 08, . . . , 92, 96}.)

Proof. Let dk , dk−1 , . . . , d1 , d0 denote the digits of n, reading from left to right, so that
n = d0 + 10d1 + 100d2 + 1000d3 + · · · + 10k dk ,

or, dividing both sides by 4,


n
4 = d0
4 + 10d1
4 + 25d2 + 250d3 + · · · + 25 · 10k−2 dk . (∗)

ISTUDY
4.3 Proofs and Proof Techniques 169

The integer n is a divisible by 4 if and only if n4 is an integer, which because of (∗) occurs if and only
if the right-hand side of (∗) is an integer. And that’s true if and only if d0 +10d
4
1
is an integer, because all
other terms in the right-hand side of (∗) are integers. Therefore 4 | n if and only if 4 | (d0 + 10d1 ). The
last two digits of n are precisely d0 + 10d1 , so the claim follows.

Note that this argument considers a generic positive integer n, and establishes the result for that generic
n. The proof relies on two previously known facts: (1) an integer n is divisible by 4 if and only if n4 is an
integer; and (2) for an integer a, we have that x + a is an integer if and only if x is an integer. The argument
itself uses these two basic facts to derive the desired claim.
Let’s give another example, this time for an implication. The proof strategy of assuming the antecedent,
discussed in Definition 3.24 in Section 3.4.3, is a form of direct proof. To prove an implication of the form
φ ⇒ ψ, we assume the antecedent φ and then prove ψ under this assumption. This proof establishes
φ ⇒ ψ because the only way for the implication to be false is when φ is true but ψ is false, but the proof
shows that ψ is true whenever φ is true. Here’s an example of this type of direct proof, for a basic fact
about rational numbers. (Recall that a number x is rational if and only if there exist integers n and d ̸= 0
such that x = nd .)

Example 4.11: The product of rational numbers is rational.

Claim: If x and y are rational numbers, then so is xy.

Proof. Assume the antecedent—that is, assume that x and y are rational. By the definition of rationality,
n
then, there exist integers nx , ny , dx ̸= 0, and dy ̸= 0 such that x = ndxx and y = dyy . Therefore
nx ny nx ny
xy = · = .
dx dy dx dy
Both nx ny and dx dy are integers, because the product of any two integers is also an integer. And dx dy ̸= 0
because both dx ̸= 0 and dy = ̸ 0. Thus xy is a rational number, by the definition of rationality.

Proof by cases
Sometimes we’ll be asked to prove a statement of the form ∀x ∈ S : P(x) that indeed seems true for every
x ∈ S—but the “reason” that P(x) is true seems to be different for different “kinds” of elements x. For
example, Lemma 4.14 argued that the Hamming distance between two Hamming-code codewords was at
least three, based on three different arguments based on whether the corresponding messages differed in 1,
in 2, or in 3 or more positions. This proof was an example of a proof by cases:

Definition 4.21: Proof by cases.


To give a proof by cases of a proposition φ, we identify a set of cases and then prove two different types
of facts: (1) “in every case, φ holds”; and (2) one of the cases has to hold.

(Proofs by cases need not be direct proofs, but plenty of them are.) Here are two small examples of proofs
by cases for properties in arithmetic:

ISTUDY
170 Proofs

Example 4.12: Certain squares.

Claim: Let n be any integer. Then n · (n + 1)2 is even.

Proof. We’ll give a proof by cases, based on the parity of n:

• If n is even, then any multiple of n is also even, so we’re done.


• If n is odd, then n + 1 must be even. Thus any multiple of n + 1 is also even, so we’re done again.

Because the integer n must be either even or odd, and the quantity n · (n + 1)2 is an even number in either
case, the claim follows.

Example 4.13: An easy fact about absolute values.

Claim: Let x ∈ R. Then −|x| ≤ x ≤ |x|.

Proof. Observe that x ≥ 0 or x ≤ 0. In both cases, we’ll show the desired inequality:

• For the case that x ≥ 0, we know −x ≤ 0 ≤ x. By the definition of absolute value, we have |x| = x
and −|x| = −x. Thus −|x| = −x ≤ 0 ≤ x = |x|.
• For the case that x ≤ 0, we know x ≤ 0 ≤ −x. By the definition of absolute value, we have |x| = −x
and −|x| = x. Thus −|x| = x ≤ 0 ≤ −x = |x|.

Note that a proof by cases is only valid if the cases are exhaustive—that is, if every situation falls into one
of the cases. (If, for example, you try to prove ∀x ∈ R : P(x) with the cases x > 0 and x < 0, you’ve left
out x = 0—and your proof isn’t valid!) But the cases do not need to be mutually exclusive (that is, they’re
allowed to overlap), as long as the cases really do cover all the possibilities; in Example 4.13, we handled
the x = 0 case in both cases x ≥ 0 and x ≤ 0. If all possible values of x are covered by at least one case,
and the claim is true in every case, then the proof is valid.
Here’s another slightly more complex example, where we’ll prove the triangle inequality for the absolute
value function. (See Figure 4.2.)

Example 4.14: Triangle inequality for absolute values.

Claim: Let x, y, z ∈ R. Then |x − y| ≤ |x − z| + |y − z|.

Proof. Without loss of generality, assume that x ≤ y. (If y ≤ x, then we simply swap the names of x
and y, and nothing changes in the claim.)
Because we’re assuming x ≤ y, we must show that |x − z| + |y − z| ≥ |x − y| = y − x. We’ll consider
three cases: z ≤ x, or x ≤ z ≤ y, or y ≤ z. See Figure 4.18.

Case I: z ≤ x. When z ≤ x—that is, z is smaller than both x and y—we have
|x − z| + |y − z| ≥ |y − z| |x − z| ≥ 0 by the definition of absolute value.
=y−z x ≤ y by assumption and z ≤ x in Case I, so z ≤ y too.
≥ y − x. z ≤ x in Case I, so −z ≥ −x.

ISTUDY
4.3 Proofs and Proof Techniques 171

Case II: x ≤ z ≤ y. When x ≤ z ≤ y—that is, z is between x and y—we have


|x − z| + |y − z| = (z − x) + |y − z| definition of absolute value and x ≤ z in Case II.
= (z − x) + (y − z) definition of absolute value and z ≤ y in Case II.
= y − x. algebra/rearranging terms.

Case III: y ≤ z. When y ≤ z—that is, z is larger than both x and y—we have
|x − z| + |y − z| ≥ |x − z| |y − z| ≥ 0 by the definition of absolute value.
=z−x x ≤ y by assumption and y ≤ z in Case III, so x ≤ z too.
≥ y − x. z ≥ y in Case III.

In all three cases, we’ve shown that |x − z| + |y − z| ≥ y − x, so the claim follows.

Writing tip: The phrase “without loss of generality” (which we used at the start of the proof in Example 4.14) indicates that
we won’t explicitly write out all the cases in the proof, because the omitted ones are virtually identical to the ones that we are
writing out. It allows you to avoid cut-and-paste-and-search-and-replace arguments for two very similar cases.

Notice the creative demand if you choose to develop a proof by cases: you have to choose which cases
to use! The proposition itself does not necessarily make obvious an appropriate choice of which different
cases to use.

Proof by contrapositive
When we seek to prove a claim φ, it suffices to instead prove any proposition that is logically equivalent
to φ. (For example, a proof by cases with two cases q and ¬q corresponds to the logical equivalence
p ≡ (q ⇒ p) ∧ (¬q ⇒ p).) A valid proof of any logically equivalent proposition can be used to prove that
φ is true, but a few logical equivalences turn out to be particularly useful. A proof by contrapositive is a
very common proof technique that relies on this principle:

Definition 4.22: Proof by contrapositive.


To give a proof by contrapositive of an implication φ ⇒ ψ, we instead give a proof of the implication
¬ψ ⇒ ¬φ.

Recall from Section 3.4.3 that an implication p ⇒ q is logically equivalent to its contrapositive ¬q ⇒ ¬p.
(An implication is true unless its antecedent is true and its conclusion is false, so ¬q ⇒ ¬p is true unless
¬q is true and ¬p is false, which is precisely when p ⇒ q is false.) Here are two small examples of proofs
using the contrapositive, one about absolute values and one about rational numbers.

Case I: Case II: Case III:


z≤x x≤z≤y y≤z Case I Case II Case III
+ + +

x y z x y x z y x y z

Figure 4.18 The three cases for Example 4.14: z can fall to the left of x, between x and y, or to the right of y. In each
case, we argue that the sum of the lengths of the marked lines is at least y − x.

ISTUDY
172 Proofs

Example 4.15: The sum of the absolute values vs. the absolute value of the sum.

Claim: If |x| + |y| ̸= |x + y|, then xy < 0.

Proof. We’ll prove the contrapositive:


If xy ≥ 0, then |x| + |y| = |x + y|. (∗)
To prove (∗), assume the antecedent; that is, assume that xy ≥ 0. We must prove |x| + |y| = |x + y|.
Because xy ≥ 0, there are two cases: either both x ≥ 0 and y ≥ 0, or both x ≤ 0 and y ≤ 0.

Case I: x ≥ 0 and y ≥ 0. Then |x| + |y| = x + y, by the definition of absolute value. And |x + y| = x + y
too, because x ≥ 0 and y ≥ 0 implies that x + y ≥ 0 as well.
Case II: x ≤ 0 and y ≤ 0. In this case |x| + |y| = −x + −y, again by the definition of absolute value. And
|x + y| = −(x + y) = −x + −y too, because x ≤ 0 and y ≤ 0 implies that x + y ≤ 0 as well.

Writing tip: Help your reader figure out what’s going on! If you’re going to use a proof by contrapositive, say you’re using a
proof by contrapositive! Don’t leave ’em guessing. This tip applies for all proof techniques: your job is to convince your reader,
so be kind and informative to your reader.

Example 4.16: Irrational quotients have an irrational numerator or denominator.

Claim: Let y ̸= 0. If x
y is irrational, then either x is irrational or y is irrational (or both).

Proof. We will prove the contrapositive:


x
If x is rational and y is rational, then y is rational. (†)
(Note that, by De Morgan’s Laws, ¬ (x is irrational or y is irrational) is equivalent to x being rational and
y being rational.)
To prove (†), assume the antecedent—that is, assume that x is rational and y is rational. By definition,
n nd
then, there exist four integers nx , ny , dx ̸= 0, and dy ̸= 0 such that x = ndxx and y = dyy . Thus xy = dxx nyy . (By
the assumption that y ̸= 0, we know that ny ̸= 0, and thus dx ny ̸= 0.) Both the numerator and denominator
are integers, so xy is rational.

Of course, you can always reuse previous results in any proof—and Example 4.11 is particularly useful
for the claim in Example 4.16. Here’s a second, shorter proof:

Example 4.17: Irrational quotients, Version B.

Claim: Let y ̸= 0. If x
y is irrational, then either x is irrational or y is irrational (or both).

Proof. We prove the contrapositive. Assume that x and y are rational. By definition, then, y = nd for
some integers n and d ̸= 0. Therefore 1y = dn is rational too. (By the assumption that y ̸= 0, we know that
n ̸= 0.) But xy = x · 1y , and both x and 1y are rational. Therefore Example 4.11 implies that xy is rational
too.

Here’s one more example of a proof that uses the contrapositive. When proving an “if and only if”
statement φ ⇔ ψ, we can instead give proofs of both φ ⇒ ψ and ψ ⇒ φ, because φ ⇔ ψ and

ISTUDY
4.3 Proofs and Proof Techniques 173

(φ ⇒ ψ) ∧ (ψ ⇒ φ) are logically equivalent. This type of proof is sometimes called a proof by mutual
implication. (We can also prove φ ⇔ ψ by giving a chain of logically equivalent statements that transform
φ into ψ, but it is often easier to prove one direction at a time.) Here’s an example of a proof by mutual
implication, which also uses the contrapositive to prove one of the directions:

Example 4.18: Even integers (and only even integers) have even squares.

Claim: Let n be any integer. Then n is even if and only if n2 is even.

Proof. We proceed by mutual implication.


First, we will show that if n is even, then n2 is even too. Assume that n is even. Then, by definition,
there exists an integer k such that n = 2k. Therefore n2 = (2k)2 = 4k2 = 2 · (2k2 ). Thus n2 is even too,
because there exists an integer ℓ such that n2 = 2ℓ. (Namely, ℓ = 2k2 .)
Second, we will show the converse: if n2 is even, then n is even. We will instead prove the contrapos-
itive: if n is not even, then n2 is not even. Assume that n is not even. Then n is odd, and there exists an
integer k such that n = 2k + 1. Therefore n2 = (2k + 1)2 = 4k2 + 4k + 1 = 2(2k2 + 2k) + 1. Thus n2 is
odd too, because there exists an integer ℓ such that n2 = 2ℓ + 1. (Namely, ℓ = 2k2 + 2k.)

Proofs by contradiction
The proof techniques that we’ve described so far establish a claim φ by arguing that φ must be true. Here,
we’ll look at the other side of the coin, and prove φ has to be true by proving that φ cannot be false. This
approach is called a proof by contradiction: we prove that something impossible must happen if φ is false
(that is, we prove ¬φ ⇒ False); thus the assumption ¬φ led us to an absurd conclusion, and we must reject
the assumption ¬φ and instead conclude its negation φ. (Or, as my grandfather Jay Liben (1913–2006)
always used to say (really!): “If the conclusion is obviously false, reexamine the premises.”)

Definition 4.23: Proof by contradiction.


To prove φ using a proof by contradiction, we assume the negation of φ and derive a contradiction; that
is, we assume ¬φ and prove False.

(This proof technique is based on the logical equivalence of φ and the proposition ¬φ ⇒ False. A proof
by contradiction is also called reductio ad absurdum, which is Latin for “reduction to an absurdity.”)
We used a proof by contradiction in Lemma 4.16: to show that two particular sets X and Y were disjoint,
we assumed that an element z ∈ X ∩ Y did exist (that is, we assumed that X and Y were not disjoint), and
we showed that this assumption led to a violation of the assumptions in the definitions of X and Y. Here’s
another small example:

Example 4.19: Trying to solve 15x + 111y = 55057 for integers x and y.

Claim: Suppose 15x + 111y = 55057, for two real numbers x and y. Then either x or y (or both) is not
an integer.

Proof. Suppose not: that is, suppose that x and y are integers with 15x + 111y = 55057. But
15x + 111y = 3 · (5x + 37y), so 55057
3 = 5x + 37y. But then, by the assumption that x and y are integers,

ISTUDY
174 Proofs

55057
3 must therefore be an integer, because 5x + 37y is—but 55057
3 = 18352.333 · · · ∈
/ Z. Therefore the
assumption that both x ∈ Z and y ∈ Z was false, and at least one of x and y must be nonintegral.

Writing tip: It’s always a good idea to help your reader with “signposts” in your writing. In a proof by contradiction, announce at
the outset that you’re assuming ¬φ for the purposes of deriving a contradiction; when you reach a contradiction, say that you’ve
reached a contradiction, and declare that therefore the assumption ¬φ was false, and φ is true.

Here is another example of a proof by contradiction, for a classical result showing that there are numbers
that aren’t rational:

Example 4.20: The irrationality of 2.

Claim: 2 is not rational.

Proof. We proceed√ by contradiction.


Assume that√ 2 is rational. Therefore, by the definition of rationality, there exist integers n and d ̸= 0
such that nd = 2, where n and d are in lowest terms (that is, where n and d have no common divisors).
2
Squaring both sides yields that nd2 = 2, and therefore that n2 = 2d2 . Because 2d2 is even, we know
that n2 is even. Therefore, by Example 4.18 (“n is even if and only if n2 is even”) we have that n is itself
even.
Because n is even, there exists an integer k such that n = 2k, which implies that n2 = 4k2 . Thus
n2 = 4k2 and n2 = 2d2 , so 2d2 = 4k2 and d2 = 2k2 . Hence d2 is even, and—again using Example 4.18—
we have that d is even.
But now we have a contradiction: we assumed that nd was√ in lowest terms, but we have now shown that
n and√d are both even! Thus the original assumption that 2 was rational was false, and we can conclude
that 2 is irrational.
√ √
Note again the structure of this proof: suppose that 2 is rational; therefore we can write 2 = nk where
n and k have no common divisors, and (a few steps later) therefore n and k are both even. Because n and
k cannot both have no common divisors and also both be even, we’ve √ derived an absurdity. The only way
we could have gotten to this absurdity is via our assumption
√ that 2 was rational—so we conclude that
this assumption must have been false, and therefore 2 is irrational.
(Exercise 4.58 asks you to adapt the proof in Example 4.20 to show that the cube root of√2 is also
irrational, and, similarly, Exercise 4.59 asks you to use an analogous argument to show that 3 is also
irrational. Meanwhile, Exercise 4.104 asks
√ you to figure out what’s wrong with
√ an attempted adaptation of
the proof in Example 4.20 to show that 4 is irrational—which it isn’t, as 4 = 2 is a rational number.)
Note that, when you’re trying to prove an implication φ ⇒ ψ, a proof by contrapositive has some
similarity to a proof by contradiction:

• in a proof by contrapositive, we prove ¬ψ ⇒ ¬φ, by assuming ¬ψ and proving ¬φ.


• in a proof by contradiction, we prove False under the assumption ¬(φ ⇒ ψ)—that is, under the
assumption that φ ∧ ¬ψ. (Note that there’s an extra creative demand here: you have to figure out
which contradiction to derive—something that’s not generally made immediately clear by the given
claim.)

Proofs by contrapositive are generally preferred over proofs by contradiction when a proof by contra-
positive is possible. A proof by contradiction can be hard to follow because we’re asking the reader to

ISTUDY
4.3 Proofs and Proof Techniques 175

temporarily accept an assumption that we’ll later show to be false, and there can be a mental strain in
keeping track of what’s been assumed and what was previously known. (Notice that the claim in Example
4.20 wasn’t an implication, so a proof by contrapositive wasn’t an option. The proofs of Lemma 4.16 (from
Section 4.2) and Example 4.19, though, could have been rephrased as proofs by contrapositive.)

Proofs by construction and disproofs by counterexample


So far we’ve concentrated on proofs of universally quantified statements, where you are asked to show that
some property holds for all elements of a given set. √ (Every example proof in this section, except the two
proofs by contradiction about the irrationality of 2 and the infinitude of primes, were proofs of a “for
all” statement—and, actually, even those two claims could have been phrased as universal quantifications.
For example, √we could have phrased Example 4.20 as the following claim: for all integers n and d ̸= 0, we
have n = ̸ d · 2.) Sometimes you’ll confront a universally quantified statement that’s false, though. The
easiest way to prove that ∀x ∈ S : P(x) is false is using a disproof by counterexample:

Definition 4.24: Disproof by counterexample.


A counterexample to a claim ∀x ∈ S : P(x) is a particular element y ∈ S such that P(y) is false. A
disproof by counterexample shows that ∀x ∈ S : P(x) is false by identifying a counterexample y ∈ S and
providing a proof that P(y) is false.

Finding a counterexample for a claim requires creativity: you have to think about why a claim might not
be true, and then try to construct an example that embodies that reason.
Problem-solving tip: One way you might try to identify counterexample to a claim is by writing a program: write a loop that
tries a bunch of examples; if you ever find one for which the claim is false, then you’ve found a counterexample. Just because
you haven’t found a counterexample with your program doesn’t mean that there isn’t one—unless you’ve tried all the elements
of S—but if you do find a counterexample, it’s still a counterexample no matter how you found it!

Here is a small example:

Example 4.21: Unique sums of squares.

Claim: Let n be a positive integer such that n = a2 + b2 for positive integers a and b. Then n cannot be
expressed as the sum of the squares of two positive integers except a and b. (Alternatively, this claim
could be written more tersely as: No positive integer is expressible in two different ways as the sum of
two perfect squares.)

The claim is false, and we will prove that it is false by counterexample. We can start trying some examples.
One easy class of potential counterexamples is a2 + 1 for an integer a. Although 12 + 12 = 2 can’t be
expressed a different way, what about 5? 10? 17? 26? 37? 50? 65? 82? By testing these examples, we find
that 65 is a counterexample to the claim. Observe that 12 +82 = 1+64 = 65, and 42 +72 = 16+49 = 65.
Another counterexample is 50, as 50 = 52 + 52 = 12 + 72 .

What about when you’re asked to prove an existential claim ∃x : P(x)? One approach is to prove
the claim by contradiction: you assume ∀x : ¬P(x), and then derive some contradiction. This type of
proof is called nonconstructive: you have proven that an object with a certain property must exist, but
you haven’t actually described a particular object with that property. In contrast, a proof by construction
actually identifies a specific object that has the desired property:

ISTUDY
176 Proofs

Definition 4.25: Proof by construction.


A constructive proof or proof by construction for a claim ∃x ∈ S : P(x) actually builds an object satisfying
the property P: first, we identify a particular element y ∈ S; and, second, we prove P(y).

For example, here’s a claim that we’ll prove twice, once nonconstructively and once constructively:

Example 4.22: The last two digits of some squares.

Claim: There exist distinct integers x, y ∈ {1901, 1902, . . . , 2024} such that the last two digits of x2 and
y2 are the same. (In other words, x2 mod 100 = y2 mod 100.)

Nonconstructive proof. There are 124 different numbers in the set {1901, 1902, . . . , 2014}. There are
only 100 different possible values for the last two digits of numbers. Thus, because there are 114 elements
assigned to only 100 categories, there must be some category that contains more than one element.

Constructive proof. Let x = 1986 and y = 1964. Both numbers’ squares have 96 as their last two digits:
19862 = 3,944,196 and 19642 = 3,857,296.

The nonconstructive proof in Example 4.22 makes use of the so-called pigeonhole principle, which we’ll see in Section 9.3.3: if
there are more things (in this case, numbers between 1901 and 2024) than there are categories of things (the possible values for
the last two digits) then there must be at least two things that are in the same category.

It’s generally preferable to give a constructive proof when you can. A constructive proof is sometimes
harder to develop than a nonconstructive proof, though: it may require more insight about the kind of object
that can satisfy a given property, and more creativity in figuring out how to actually construct that object.

Taking it further: A constructive proof of a claim is generally more satisfying for the reader than a nonconstructive proof. A
proof by contradiction may leave a reader unsettled—okay, the claim is true, but what can we do with that?—while a constructive
proof may be useful in designing an algorithm, or it may suggest further possible claims to try to prove. (There’s even a school
of thought in logic called constructivism that doesn’t count a proof by contradiction as a proof!)

4.3.2 Some Brief Thoughts about Proof Strategy


So far in this section, we’ve concentrated on developing a toolbox of proof techniques. But when you’re
confronted with a new claim and asked to prove it, you face a difficult task in figuring out which approach to
take. (It’s even harder if you’re asked to formulate a claim and then prove it!) As we discussed in Chapter 3,
there’s no formulaic approach that’s guaranteed to work—you must be creative, open-minded, persistent.
You will have to accept that you will explore approaches that end up being dead ends. This section will
give a few brief pointers about proof strategy—some things to try when you’re just starting to attack a new
problem. We’ll start with some concrete advice in the form of a three-step plan for proofs, largely inspired
by an outstanding book by George Pólya (1887–1985) [103].

Step 1: Understand what you’re trying to do. Read the statement that you’re trying to prove. Reread it.
What are the assumptions? What is the desired conclusion? (That is, what are you trying to prove under the
given assumptions?) Remind yourself of any unfamiliar notation or terminology. Pick a simple example
and make sure the alleged theorem holds for your example. (If not, either you’ve misunderstood something
or the claim is false.) Reread the statement again.

ISTUDY
4.3 Proofs and Proof Techniques 177

If you’re not given a specific claim—for example, you’re asked to prove or disprove a given statement,
or if you’re asked for the “best possible” solution to a problem—then it’s harder but even more important to
understand what you’re trying to do. Play around with some examples to generate a sense of what might be
plausibly true. Then try to form a conjecture based on these examples or the intuition that you’ve developed.

Step 2: Do it. Now that you have an understanding of the statement that you’re trying to prove, it’s time to
actually prove it. You might start by trying to think about slightly different problems to help grant yourself
insight about this one. Are there results that you already know that “look similar” to this one? Can you
solve a more general problem? Make the premises look as much like the conclusion as possible. Expand
out the definitions; write down what you know and what you have to derive, in primitive terms. Can you
derive some facts from the given hypotheses? Are there easier-to-prove statements that would suffice to
prove the desired conclusion?
Look for a special case: add assumptions until the problem is easy, and then see if you can remove the
extra assumptions. Restate the problem. Restate it again. Make analogies to problems that you’ve already
solved. Could those related problems be directly valuable? Or could you use a similar technique to what
you used in that setting? Try to use a direct proof first; if you’re finding it difficult to construct a direct
proof of an implication, try working on the contrapositive instead. If both of these approaches fail, try a
proof by contradiction.

Problem-solving tip: If you’re totally stuck in attempting to prove a statement true, switch to trying to prove it false. If you
succeed, you’re done—or, by figuring out why you’re struggling to construct a counterexample, you may figure out how to prove
that the statement is true.

When you have a candidate plan of attack, try to execute it. If there’s a picture that will help clarify the
steps in your plan, draw it. Sketch out the “big” steps that you’d need to make the whole proof work. Make
sure they fit together. Then crank through the details of each big step. Do the algebra. Check the algebra.
If it all works out, great! If not, go back and try again. Where did things go off the rails, and can you fix
them?
Think about how to present your proof; then actually write it. Note that what you did in figuring out
how to prove the result might or might not be the best way to present the proof.

Step 3: Think about what you’ve done. Check to make sure your proof is reasonable. Did you actu-
ally use all the assumptions? (If you didn’t, do you believe the stronger claim that has the smaller set of
assumptions?) Look over all the steps of your proof. Turn your internal skepticism dial to its maximum,
and reread what you just wrote. Ask yourself Why? as you think through each step. Don’t let yourself get
away with anything.

Problem-solving tip: Check your work! If your claim says something about a general n, test it for n = 1. Compare your answer
to a plot, or the output of a quick program.

After you’re satisfied that your proof is correct, work to improve it. Can you strengthen the result by making
the conclusion stronger or the assumptions weaker? Can you make the proof constructive? Simplify the
argument as much as you can. Are there unnecessary steps? Are there unnecessarily complex steps? Are
there subclaims that would be better as separate lemmas?

ISTUDY
178 Proofs

Combining the steps. It’s important to be willing to move back and forth among these steps. You’ll try
to prove a claim φ, and then you’ll discover a counterexample to φ—so you go back and modify the claim
to a new claim φ′ and try to prove φ′ instead. You’ll formulate a draft of a proof of φ′ but discover a
bug when you check your work while reflecting on the proof. You’ll go back to proving φ′ , fix the bug,
and discover a new proof that’s bugfree. You’ll think about your proof and realize that it didn’t use all the
assumptions of φ′ , so you’ll formulate a stronger claim φ′′ and then go through the proof of φ′′ and reflect
again about the proof.

Taking it further: One of the most famous—and prolific!—mathematicians of modern times was Paul Erdős (1913–1996), a
Hungarian mathematician who wrote literally thousands of papers over his career, on a huge range of topics. Erdős used to talk
about a mythical “Book” of proofs, containing the perfect proof of every theorem (the clearest, the most elegant—the best!). See
p. 180 for more about The Book, and about Paul Erdős himself.

4.3.3 Some Brief Thoughts about Writing Good Proofs


When you’re writing a proof, it’s important to remember that you are writing. Proofs, like novels or per-
suasive essays, form a particular genre of writing. Treat writing a proof with the same care and attention
that you would give to writing an essay.

Writing tip: Draft. Write. Edit. Rewrite.

Writing tip: In writing a proof, keep your reader informed about the status of every sentence. And make sure that everything
you write is a sentence. For example, every sentence contains a verb. (Note that a symbol like “=” is read as “is equal to” and
is a verb.) Is the sentence an assumption? A goal? A conclusion? Annotate your sentences with signaling words and phrases to
make it clear what each statement is doing. For example, introduce statements that follow logically from previous statements
with words like hence, thus, so, therefore, and then.

Make your argument self-contained; include definitions of all variables and all nonstandard notation. State
all assumptions, and explain your notation. Choose your notation and terminology carefully; name your
variables well. Here’s an example, of doing it well and doing it poorly. (Thanks to Josh Davis for suggesting
Examples 4.23 and 4.24.)

Example 4.23: A theorem, stated poorly.

Theorem: a2 + b2 = c2 .

This formulation is a terrible way of phrasing the theorem: the reader has no idea what a, b, and c are, or
even that the theorem has anything whatsoever to do with geometry. (Example 4.23 is an attempt to state
the Pythagorean Theorem, from geometry, which states that the square of the length of the hypotenuse of
a right triangle is equal to the sum of the squares of the lengths of its legs.) Here’s a much better statement
of the Pythagorean Theorem:

Example 4.24: Pythagorean Theorem, stated well.

Theorem: Let a and b denote the lengths of the legs of a right triangle, and let c denote the length of its
hypotenuse. Then a2 + b2 = c2 .

If you are worried that your audience has forgotten the geometric terminology from this statement, then
you might add the following clarification:

ISTUDY
4.3 Proofs and Proof Techniques 179

As a reminder from geometry, a right triangle is a three-sided polygon with one 90◦ angle, called a right angle. The
two sides adjacent to the right angle are called legs and the third side is called the hypotenuse. Figure 4.19 shows an
example of a right triangle. Here the legs are labeled a and b, and the hypotenuse is labeled c. As is customary, the right
angle is marked with the special square-shaped symbol 2.

Because the “standard” phrasing of the Pythagorean Theorem—which you might have heard in high
school—calls the length of the legs a and b and the length of the hypotenuse c, we use the standard variable
names. Calling the leg lengths θ and ϕ and the hypotenuse r would be hard on the reader; conventionally in
geometry θ and ϕ are angles, while r is a radius. Whenever you can, make life as easy as possible for your
reader. (By the way, we’ll prove the Pythagorean Theorem in Example 4.30, and you’ll prove it again in
Exercise 4.75.)
Above all, remember that your primary goal in writing is communication. Just as when you are program-
ming, it is possible to write two solutions to a problem that both “work,” but which differ tremendously in
readability. Document! Comment your code; explain why this statement follows from previous statements.
Make your proofs—and your code!—a pleasure to read.

c
a

Figure 4.19 A right triangle.

ISTUDY
180 Proofs

Computer Science Connections

Paul Erdős, “The Book,” and Erdős Numbers


After you’ve completed a proof of a claim—and after you’ve celebrated completing it—you should think again
about the problem. In programming, there are often many fundamentally different algorithms to solve a particular
problem; in proofs, there are often many fundamentally different ways of proving a particular theorem. And, just as
in programming, some approaches will be more elegant, more clear, or more efficient than others.
Paul Erdős (1913–1996), a world-famous mathe-
matician who published approximately 1500 papers
(including papers on math, physics, and computer sci-
ence), used to talk about “The Book” of proofs. “The
Book” contains the ideal proof of each theorem—the
most elegant, insightful, and beautiful proof. (If you
believe in God, then The Book contains God’s proofs.)
Proving a theorem is great, but giving a “book proof”
is even better. (There’s even a non-metaphorical book
called Proofs from The Book that collects some of the
most elegant known proofs of some theorems [6].)
Erdős was one of the most respected mathemati-
cians of his time—and one of the most eccentric, too.
(He forswore most material possessions, and instead
traveled the world, crashing in the guest rooms of his
research collaborators for months at time.) Because of
Erdős’s prolific publication record and his great respect
from the research community, a measure of a certain
type of fame sprung up around him. A researcher’s
Erdős number is one if she has coauthored a published
paper with Erdős; it’s two if she has coauthored a paper
with someone with an Erdős number of one; and so
forth. (See Figure 4.20.) If you’re more of a movie per-
son than a peripatetic mathematician person, then you
may be familiar with a similar notion from the enter-
tainment world, the so-called Bacon game, where the
goal is to connect a given actor to Kevin Bacon via the
shortest chain of intermediaries, where two actors are
linked if they have appeared together in a movie.
Although Erdős numbers themselves are really
nothing more than a nerdy source of amusement, the
ideas underlying them are fundamental in graph the-
ory, the subject of Chapter 11. A closely related topic
is the small-world phenomenon, also known as “six
degrees of separation” (a phrase that originated from an
important early paper by the social psychologist Stan-
ley Milgram [88]), the principle that almost any two
people are likely to be connected by a short chain of
intermediate friends. This phenomenon has spawned Figure 4.20 An example of Erdős numbers. The University
a massive amount of recent research by computer sci- of Michigan roboticist Chad Jenkins and Microsoft
entists, who have begun working to analyze questions co-founder Bill Gates both have an Erdős number of at most
about human behavior that have only become visible in four, having written a paper with someone who wrote a
a social-media era in which it is now possible to study paper with someone who wrote a paper with someone who
collective decision making on a massive scale. wrote a paper with Paul Erdős.

ISTUDY
4.3 Proofs and Proof Techniques 181

Computer Science Connections

The Four-Color Theorem (and Massive Computer-Generated Proofs)


As we’ve said, what we mean by a “proof” is an argument that convinces the audience that the claim is true. But there
are some kinds of arguments that may test your intuition about what counts (or should count) as convincing.
To take one example, consider the fact that Checkers is a draw when both players play optimally. The proof of this
fact hinges on showing that the software system Chinook will never lose a checkers game (see p. 121). (So if Chinook
plays against itself, the game must be a draw.) The relevant part of this story is that the fact that Chinook never loses
was established via massive computation to perform a large-scale search of the checkers game tree [113]. Is that
argument convincing? Can such an argument ever be convincing? It’s clear that no human reader can accommodate
the 5 × 1020 checkers board positions in their brain, so it’s not convincing in the sense that a reader would be able
to verify every step of the argument. But, on the other hand, a reader could potentially be convinced that Chinook’s
code is correct, even if the output is too big for a reader to find convincing.
The philosophical question about whether a large-scale computer-generated proof “counts” actually as a proof
first arose in the late 1970s, in the context of the Four-Color Theorem:
The Four-Color Theorem: Any “map” of contiguous geometric regions can be colored using four colors so that no
two adjacent regions share the same color.
(See Figure 4.21.) Two quick notes about the theorem
statement: first, adjacent means sharing a positive-
length border; two regions meeting at a point don’t
need different colors. Second, the requirement of
regions being contiguous means the map can’t require
two disconnected regions (like the Lower 48 States and
Alaska) to get the same color.
The first accepted proof of the Four-Color Theo-
rem was given by Kenneth Appel (1932–2013) and
Wolfgang Haken (b. 1928) in 1976 [8]. (Well, almost
the first: a purported proof by Alfred Kempe (1849–
1922) was published in 1879 [71] and was gener-
ally accepted—but a full decade later a fatal bug was
found!) Appel and Haken’s proof was fundamentally
computational. Appel and Haken first identified a set
of 1936 different map configurations and proved (in the
traditional way, by giving a convincing argument) that,
if the Four-Color Theorem were false, it would fail on
one of these 1936 configurations. They then wrote a Figure 4.21 A four-colored map of the 87 counties in
computer program that showed how to color each one Minnesota. No two counties that share a border are shaded
of these 1936 configurations using only four colors. with the same color.
The theorem follows (“if there were a counterexample
at all, there’d be a counterexample in one of the 1936 cases—and there are no counterexamples in the 1936 cases”).
A great deal of controversy followed the publication of Appel and Haken’s work. Some mathematicians felt
strongly that a proof that’s too massive for a human to understand is not a proof at all. Others were happy to accept
the proof, particularly because the four-colorability question had been posed, and remained unresolved, for centuries.
Computer scientists, by our nature, tend to be more accepting of computational proof than mathematicians—but there
are still plenty of interesting questions to ponder. For example, as we discussed on p. 121, some errors in the execution
of the code that generates Chinook’s proof are known to have occurred, simply because hardware errors happen at
a high enough rate that they will arise in a computation of this size. Thus bit-level corruption may have occurred,
without 100% correction, in Chinook’s proof that checkers is a draw under optimal play. So is Chinook’s “proof”
really a proof? (Of course, there are also plenty of human-generated purported proofs that contain errors!)

ISTUDY
182 Proofs

EXERCISES
Consider the following claims about rational numbers. Prove (or disprove, as indicated) each of them.
4.33 Prove that, if x and y are rational numbers, then x − y is also rational.
̸ 0, then xy is also rational.
4.34 Prove that, if x and y are rational numbers and y =
4.35 One of the following statements is true and one is false:
• If xy and x are both rational, then y is too.
• If x − y and x are both rational, then y is too.
Decide which statement is true and which is false, and give proof/disproof of both.

Prove the following claims about divisibility.


4.36 Prove that the binary representation of any odd integer ends with a 1.
4.37 Prove that a positive integer n is divisible by 5 if and only if its last digit is 0 or 5.
4.38 Let k be any positive integer. Prove that any positive integer n is divisible by 2k if and only if its last k digits are divisible by 2k .
(This exercise is a generalization of Example 4.10.)
4.39 Let n be any integer. Prove by cases that n3 − n is evenly divisible by 3.
4.40 Let n be any integer. Prove by cases that n2 + 1 is not evenly divisible by 3.

4.41 Prove that |x| + |y| ≥ |x + y| for any real numbers x and y.
4.42 Prove that |x| − |y| ≤ |x − y| for any real numbers x and y.
4.43 Prove that the product of the absolute values of x and y is equal to the absolute value of their product for any real numbers x and y.
That is, prove that |x| · |y| = |x · y|.
|x+y|
4.44 Suppose that x, y ∈ R satisfy |x| ≤ |y|. Prove that 2 ≤ |y|.

4.45 Let A and B be sets. Prove that A × B = B × A if and only if A = ∅ or B = ∅ or A = B. Prove the result by mutual implication,
where the proof of the ⇒ direction proceeds by contrapositive.

Let x ≥ 0 and y ≥ 0 be arbitrary real numbers. The arithmetic mean of x and y is (x + y)/2, their average. The geometric mean

of x and y is xy.
4.46 First, a warm-up exercise: prove that x2 ≥ 0 for any real number x. (Hint: no, it’s not supposed to be hard.)

4.47 Prove the Arithmetic Mean–Geometric Mean inequality: for x, y ∈ R≥0 , we have xy ≤ (x + y)/2. (Hint: (x − y)2 ≥ 0 by
Exercise 4.46. Use algebraic manipulation to make this inequality look like the desired one.)
4.48 Prove that the arithmetic mean and geometric mean of x and y are equal if and only if x = y.


In Chapter 2, when we defined square roots, we introduced Heron’s method, a first-century algorithm to compute x given x. See
p. 22, or Figure 4.22 for a reminder. Here you’ll prove two properties that help establish why this algorithm correctly computes

square roots. (The second property that you’ll prove—Exercise 4.50—shows that Heron’s method improves its estimate of x in
every iteration. We won’t prove “how much” improvement Heron’s method achieves in an iteration, or even that this algorithm is
converging to the correct answer—let alone quickly!—but, in fact, it is.)
√ √
4.49 Let’s assume that we start with an overestimate: assume that y0 ≥ x. Prove that, for every i ≥ 1, we have yi ≥ x. In other
√ √
words, prove that if y ≥ x then (y + y )/2 ≥ x too.
x
√ √
4.50 Continue to assume that we started with an overestimate: suppose that y > x. Prove that xy is closer to x than y is—that is,
√ √
prove that | y − x| < |y − x|. Now, using this result and Exercise 4.44, prove that yi+1 as computed in Heron’s Method is closer
x
√ √ √ √
to x than yi , as long as yi > x. (Hint: for the first part, show that |y − x| − | x − xy | > 0.)

Input: A positive real number x


Output: A real number y where y2 ≈ x

1 Let y0 be arbitrary, and let i := 0.


2 while (yi )2 is too far away from x
yi + yx
i
3 let yi+1 := 2
, and let i := i + 1.
4 return yi

Figure 4.22 A reminder of Heron’s method for computing square roots.

ISTUDY
Exercises 183

Prove the following claims using a proof by contrapositive.


4.51 Let n ∈ Z≥0 . If n mod 4 ∈ {2, 3}, then n is not a perfect square.
4.52 Let n and m be integers. If nm is not evenly divisible by 3, then neither n nor m is evenly divisible by 3. (In fact, the converse is
true too, but you don’t have to prove it.)
4.53 Let n ∈ Z≥0 . If 2n4 + n + 5 is odd, then n is even.

Prove the following claims using a proof by mutual implication, using a proof by contrapositive for one direction.
4.54 Let n be any integer. Then n3 is even if and only if n is even.
4.55 Let n be any integer. Then n is divisible by 3 if and only if n2 is divisible by 3.

Prove the following claims using a proof by contradiction.


4.56 Let x, y be positive real numbers. If x2 − y2 = 1, then x or y (or both) is not an integer.
4.57 Suppose 12x + 3y = 254, for real numbers x and y. Then either x or y (or both) is not an integer.

4.58 Adapt Example 4.20 to prove that 3 2 = 21/3 is irrational. (You may find Exercise 4.54 helpful.)

4.59 Adapt Example 4.20 to prove that 3 is irrational. (You may find Exercise 4.55 helpful.)

4.60 Consider an array A[1 . . . n]. A value x is called a strict majority element of A if strictly more than half of the elements in A are
equal to x—in other words, if
n
{i ∈ {1, 2, . . . , n} : A[i] = x} > .
2
Give a proof by contradiction that every array has at most one strict majority element.

In Example 4.11, Exercise 4.33, and Exercise 4.34, we proved that if x and y are both rational, then so are all three of xy, x − y,
and xy . The converse of each of these three statements is false. Disprove the following claims by giving counterexamples:
4.61 If xy is rational, then x and y are rational.
4.62 If x − y is rational, then x and y are rational.
4.63 If yx is rational, then x and y are rational.

4.64 In Example 4.21, we disproved the following claim by giving a counterexample:


Claim 1: No positive integer is expressible in two different ways as the sum of two perfect squares.
Let’s consider a related claim that is not disproved by our counterexamples from Example 4.21:
Claim 2: No positive integer is expressible in three different ways as the sum of two perfect squares.
Disprove Claim 2 by giving a counterexample.

4.65 Leonhard Euler, an eighteenth-century Swiss mathematician to whom the idea of an abstract formal model of networks (graphs;
see Chapter 11) is due, made the observation that the polynomial

f(n) = n2 + n + 41

yields a prime number when it’s evaluated for many small integers n: for example, f(0) = 41 and f(1) = 43 and f(2) = 47 and
f(3) = 53, and so forth. Prove or disprove the following claim: the function f(n) yields a prime for every nonnegative integer n.

ISTUDY
184 Proofs

4.4 Some Examples of Proofs


Few things are harder to put up with than the annoyance of a good example.
Mark Twain (1835–1910)
Pudd’nhead Wilson (1894)

We’ve now catalogued a variety of proof techniques, discussed some strategies for proving novel state-
ments, and described some ideas about presenting proofs well. Section 4.3 illustrated some proof techniques
with a few simple examples each, entirely about numbers and arithmetic. In this section, we’ll give a few
“bigger”—and perhaps more interesting!—examples of theorems and proofs.

4.4.1 A Proof about Propositional Logic: Conjunctive/Disjunctive Normal Form


We’ll start with a result about propositional logic, namely showing that any proposition is logically equiv-
alent to another proposition that has a very specific structure. Recall the definitions of conjunctive and
disjunctive normal form from Definitions 3.16 and 3.17:

Definition 4.26: Reminder: Conjunctive/Disjunctive Normal Form.


In propositional logic, a literal is a Boolean variable or its negation (like p or ¬p).
A proposition φ is in conjunctive normal form (CNF) if φ is the conjunction of one or more clauses,
where each clause is the disjunction of one or more literals.
A proposition φ is in disjunctive normal form (DNF) if φ is the disjunction of one or more clauses,
where each clause is the conjunction of one or more literals.

Here are two small examples of CNF and DNF:


(¬p ∨ q ∨ ¬r) ∧ (¬q ∨ r) and (¬p ∧ ¬q ∧ r) ∨ (¬q ∧ ¬r ∧ s) ∨ (r) .
conjunctive normal form disjunctive normal form

Back in Chapter 3, we claimed that every proposition is logically equivalent to one in CNF and one in DNF,
but we didn’t prove it. Here we will. First, though, let’s recall an example from Chapter 3 and brainstorm
a bit about how to generalize that result into the desired theorem. In Example 3.27, we converted p ⇔ q
into DNF as the logically equivalent proposition (p ∧ q) ∨ (¬p ∧ ¬q). Note that this expression has two
clauses p ∧ q and ¬p ∧ ¬q, each of which is true in one and only one row of the truth table. And our full
proposition (p ∧ q) ∨ (¬p ∧ ¬q) is true in precisely two rows of the truth table. (See Figure 4.23.)
Can we make this idea general? Yes! For an arbitrary proposition φ, and for any particular row of the
truth table for φ, we can construct a clause that’s true in that row and only in that row. We can then build
a DNF proposition that’s logically equivalent to φ by “or”ing together each of the clauses corresponding
to the rows in which φ is true. And then we’re done! (Well, we’re almost done! There is one subtle bug in
this proof sketch—can you find it? We’ll fix the issue in the last paragraph of the proof below.)

p q p⇔q p∧q ¬p ∧ ¬q (p ∧ q) ∨ (¬p ∧ ¬q)


T T T T F T
T F F F F F
F T F F F F
F F T F T T

Figure 4.23 Truth table for p ⇔ q and the clauses for converting it to DNF.

ISTUDY
4.4 Some Examples of Proofs 185

Theorem 4.27: All propositions are expressible in DNF (Theorem 3.19).


For any proposition φ, there exists a proposition ψdnf in disjunctive normal form such that φ ≡ ψdnf .

Proof. Let φ be an arbitrary proposition, say over the Boolean variables p1 , . . . , pk .


For any particular truth assignment f for the variables p1 , . . . , pk , we’ll construct a conjunction cf that’s
true under f and false under all other truth assignments. Let x1 , x2 , . . . , xℓ be the variables assigned true by
f, and y1 , y2 , . . . , yk−ℓ be the variables assigned false by f. Then the clause

cf = x1 ∧ x2 ∧ · · · ∧ xℓ ∧ ¬y1 ∧ ¬y2 ∧ · · · ∧ ¬yk−ℓ

is true under f, and cf is false under every other truth assignment.


We can now construct a DNF proposition ψdnf that is logically equivalent to φ by “or”ing together the
clause cf for each truth assignment f that makes φ true. Build the truth table for φ, and let Sφ denote the set
of truth assignments for p1 , . . . , pk under which φ is true. If the truth assignments in Sφ are {f1 , f2 , . . . , fm },
then define

ψdnf = cf1 ∨ cf2 ∨ · · · ∨ cfm . (∗)

Then ψdnf is true under every truth assignment f under which φ was true (because the clause cf is true under
f). And, for a truth assignment f under which φ was false, every disjunct in ψdnf evaluates to false, so the
entire disjunction is false under such a f, too. Thus φ ≡ ψdnf .
There’s one thing we have to be careful about: what happens if Sφ = ∅—that is, if φ is unsatisfiable?
(This issue is the minor bug we mentioned before the theorem statement.) The construction in (∗) doesn’t
work, but it’s easy to handle this case separately: we simply choose an unsatisfiable DNF proposition like
p ∧ ¬p as ψdnf .

Problem-solving tip: Although we didn’t phrase it as such from the beginning, our proof of Theorem 4.27 was actually a proof
by cases, with two cases corresponding to φ being unsatisfiable and φ being satisfiable. Be on the lookout for special cases (like
an unsatisfiable φ in Theorem 4.27), and see whether you can handle them separately from the argument for the “typical” case.

As an illustration, let’s use the construction from Theorem 4.27 to transform an example proposition into
disjunctive normal form:

Example 4.25: Converting p ⇒ (q ∧ r) to DNF.


Find a proposition in DNF logically equivalent to p ⇒ (q ∧ r).
Solution. To convert p ⇒ (q ∧ r) to DNF, we start from the truth table, and then “or” together the
propositions corresponding to each row that’s marked as True. The truth table is shown in Figure 4.24.
Our DNF proposition will therefore have five clauses, one for each of the five truth assignments under
which this implication is true:
(p ∧ q ∧ r) ∨ (¬p ∧ q ∧ r) ∨ (¬p ∧ q ∧ ¬r) ∨ (¬p ∧ ¬q ∧ r) ∨ (¬p ∧ ¬q ∧ ¬r) .
TTT FTT FTF FFT FFF

Conjunctive normal form


Now that we’ve proven that we can translate any proposition into disjunctive normal form (the “or of ands”),
we’ll turn our attention to conjunctive normal form (the “and of ors”).

ISTUDY
186 Proofs

p q r q∧r p ⇒ (q ∧ r)
T T T T T p∧q∧r
T T F F F p ∧ q ∧ ¬r
T F T F F p ∧ ¬q ∧ r
T F F F F p ∧ ¬q ∧ ¬r
F T T T T ¬p ∧ q ∧ r
F T F F T ¬p ∧ q ∧ ¬r
F F T F T ¬p ∧ ¬q ∧ r
F F F F T ¬p ∧ ¬q ∧ ¬r

Figure 4.24 The truth table for p ⇒ (q ∧ r).

Theorem 4.28: All propositions are expressible in CNF.


For any proposition φ, there exists a proposition φcnf in conjunctive normal form such that φ ≡ φcnf .

Though it’s not initially obvious, Theorem 4.28 actually turns out to be not so hard to prove if we make use
of the DNF result. The crucial idea—and, once again, it’s an idea that requires some genuine creativity to
come up with!—is that it’s fairly simple to turn the negation of a DNF proposition into a CNF proposition.
So, to build a CNF proposition logically equivalent to φ, we’ll construct a DNF proposition that is logically
equivalent to ¬φ; we can then negate that DNF proposition and use De Morgan’s Laws to convert the
resulting proposition into CNF. Here are the details:

Problem-solving tip: Try being lazy first! Think about whether there’s a way to use a previously established result to make the
current problem easier.

Proof. If φ is a tautology, the task is easy; just define φcnf = p ∨ ¬p.


Otherwise, φ is a nontautology, say over the variables p1 , . . . , pk . Using Theorem 4.27, we can con-
struct a DNF proposition ψ that is logically equivalent to ¬φ. (Note that, using our construction from
Theorem 4.27, the proposition ψ will have k literals in every clause, because ¬φ is satisfiable.) Thus the
form of ψ will be

ψ = (c11 ∧ · · · ∧ c1k ) ∨ (c21 ∧ · · · ∧ c2k ) ∨ · · · ∨ (cm


1 ∧ · · · ∧ ck )
m

for some m ≥ 1, where each cji is a literal. Recall that ψ ≡ ¬φ, so we also know that ¬ψ ≡ φ. Let’s
negate ψ:
 
¬ψ = ¬ (c11 ∧ · · · ∧ c1k ) ∨ (c21 ∧ · · · ∧ c2k ) ∨ · · · ∨ (cm
1 ∧ · · · ∧ ck )
m

≡ ¬(c11 ∧ · · · ∧ c1k ) ∧ ¬(c21 ∧ · · · ∧ c2k ) ∧ · · · ∧ ¬(cm


1 ∧ · · · ∧ ck ) De Morgan’s Law: ¬(p ∨ q) ≡ ¬p ∧ ¬q
m

≡ (¬c11 ∨ · · · ∨ ¬c1k ) ∧ (¬c21 ∨ · · · ∨ ¬c2k ) · · · ∧ (¬cm


1 ∨ · · · ∨ ¬ck ).
m

De Morgan’s Law: ¬(p ∧ q) ≡ ¬p ∨ ¬q, applied once per clause

But this expression is in CNF once we remove any doubly negated literals—that is, we replace any occur-
rences of ¬¬p by p instead. Thus we’ve constructed a proposition in conjunctive normal form that’s
logically equivalent to ¬ψ ≡ φ.

As an illustration of this construction, let’s convert p ⇒ (q ∧ r)—which we converted to DNF in


Example 4.25—to conjunctive normal form too:

ISTUDY
4.4 Some Examples of Proofs 187

φ ¬φ
p q r q∧r p ⇒ (q ∧ r) ¬(p ⇒ (q ∧ r))
T T T T T F p∧q∧r
T T F F F T p ∧ q ∧ ¬r
T F T F F T p ∧ ¬q ∧ r
T F F F F T p ∧ ¬q ∧ ¬r
F T T T T F ¬p ∧ q ∧ r
F T F F T F ¬p ∧ q ∧ ¬r
F F T F T F ¬p ∧ ¬q ∧ r
F F F F T F ¬p ∧ ¬q ∧ ¬r

Figure 4.25 The truth table for ¬(p ⇒ (q ∧ r)).

Example 4.26: Converting p ⇒ (q ∧ r) to CNF.


We are considering the proposition φ = p ⇒ (q ∧ r). We’ll convert it into CNF, using Theorem 4.28.
We start from the truth table for ¬φ, shown in Figure 4.25. We construct a DNF proposition equivalent
to ¬φ, following the technique of Theorem 4.27. Our proposition ¬φ has three clauses, one for each of
the truth assignments under which ¬φ is true (and φ is false):
¬φ ≡ (p ∧ q ∧ ¬r) ∨ (p ∧ ¬q ∧ r) ∨ (p ∧ ¬q ∧ ¬r)
TTF TFT TFF

We negate this proposition and use De Morgan’s Laws to push around the negations:
φ ≡ ¬ [(p ∧ q ∧ ¬r) ∨ (p ∧ ¬q ∧ r) ∨ (p ∧ ¬q ∧ ¬r)]
≡ ¬(p ∧ q ∧ ¬r) ∧ ¬(p ∧ ¬q ∧ r) ∧ ¬(p ∧ ¬q ∧ ¬r) De Morgan
≡ (¬p ∨ ¬q ∨ ¬¬r) ∧ (¬p ∨ ¬¬q ∨ ¬r) ∧ (¬p ∨ ¬¬q ∨ ¬¬r) De Morgan
≡ (¬p ∨ ¬q ∨ r) ∧ (¬p ∨ q ∨ ¬r) ∧ (¬p ∨ q ∨ r). Double Negation

So (¬p ∨ ¬q ∨ r) ∧ (¬p ∨ q ∨ ¬r) ∧ (¬p ∨ q ∨ r) is a CNF proposition that’s logically equivalent to the
given proposition p ⇒ (q ∧ r). (We can verify via truth table that we’ve gotten it right.)

One last comment about these proofs: it’s worth emphasizing again that there’s genuine creativity required
in proving these theorems. Through the strategies from Section 4.3.2 and through practice, you can get
better at having the kinds of creative ideas that lead to proofs—but that doesn’t mean that these results
should have been “obvious” to you in advance. It takes a real moment of insight to see how to use the
truth table to develop the DNF proposition to prove Theorem 4.27, or how to use the DNF formula of the
negation to prove Theorem 4.28.

Taking it further: Theorems 4.27 and 4.28 said that “a proposition ψ (of a particular form) exists for every φ”—but our
proofs actually described an algorithm to build ψ from φ. (That’s a more computational way to approach a question: a statement
like “such-and-such exists!” is the kind of thing more typically proven by mathematicians, and “a such-and-such can be found
with this algorithm!” is a claim more typical of computer scientists.) Our algorithms in Theorems 4.27 and 4.28 aren’t very
efficient, unfortunately; they require 2k steps just to build the truth table for a k-variable proposition. We’ll give a (sometimes,
and somewhat) more efficient algorithm in Chapter 5 (see Section 5.4.3) that operates directly on the form of the proposition
(“syntax”) rather than on using the truth table (“semantics”).

ISTUDY
188 Proofs

Some other results about propositional logic


In the exercises, you’ll be asked to prove a large collection of other facts about propositional logic. We’ll
highlight one of them, which is similar in spirit to the theorems about DNF and CNF: you’ll show that any
proposition φ is logically equivalent to a simpler proposition that uses only one kind of logical connective,
called “nand.” For reasons of physics, building the physical circuitry for the logical connective nand—
as in “not and,” where p nand q means ¬(p ∧ q)—is much simpler than other logical connectives. (The
physical reasons relate specifically to the way that transistors—the most basic building blocks for digital
circuits—work.) The truth table for nand—also known as the Sheffer stroke |, after the logician Henry
Sheffer (1882–1964)—appears in Figure 4.26. It turns out that every (every!) logical connective can be
expressed in terms of |. In other words, if you have enough nand gates, then you will be able to build any
logical circuit that you want. Here is a theorem that formally states this result:

Theorem 4.29: All propositions are expressible using only |.


For any Boolean formula φ over p1 , . . . , pk , there exists a proposition ψnand-only such that (i) ψnand-only is
logically equivalent to φ, and (ii) ψnand-only contains only p1 , . . . , pk and the logical connective |.

The theorem follows from Exercise 4.69, where you’ll show that every logical connective can be expressed
in terms of |. (To give a fully rigorous proof, we will need to use mathematical induction, the subject of
Chapter 5. Mathematical induction will essentially allow us to apply the results of Exercise 4.69 recursively
to translate an arbitrary proposition φ into ψnand-only .)

Taking it further: Indeed, real circuits are typically built exclusively out of nand gates, using logical equivalences to construct
and/or/not gates from a small number of nand gates. Although it may be initially implausible if this is the first time that you’ve
heard it, the processor of a physical computer is essentially nothing more than a giant circuit built out of nand gates and wires.
With some thought, you can build a circuit that takes two integers (represented in binary, as a 64-bit sequence) and computes
their sum. Similarly, but more thought-provokingly, you can build a circuit that takes an instruction (add these numbers; compare
those numbers; save this thing in memory; load the other thing from memory) and performs the requested action. That circuit is
a computer!

Incidentally, all of the logical connectives can also be defined in terms of the logical connective known as
Peirce’s arrow ↓ and also known as nor, as in “not or.” (Peirce’s arrow is named after the logician Charles
Peirce (1839–1914).) Its truth table is also shown in Figure 4.26. You’ll prove the analogous result to
Theorem 4.29 for Peirce’s arrow in Exercise 4.70.

4.4.2 The Pythagorean Theorem


Example 4.23 presented the Pythagorean Theorem, which you probably once saw in a long-ago geometry
class: the square of the length of hypotenuse of a right triangle equals the sum of the squares of the lengths of

p q p|q p↓q
T T F F
T F T F
F T T F
F F T T

Figure 4.26 The truth table for nand (also known as the Sheffer stroke |), and nor (also known as Peirce’s arrow ↓).

ISTUDY
4.4 Some Examples of Proofs 189

the legs. (The original formulation of the Pythagorean Theorem is attributed to Pythagoras, a Greek math-
ematician/philosopher who lived around 500 bce, though it may have been known to the Mesopotamians
or others for many centuries prior.)

Theorem 4.30: The Pythagorean Theorem.


Let a and b denote the lengths of the legs of a right triangle, and let c denote the length of its hypotenuse.
Then a2 + b2 = c2 .

Let’s prove it. In brainstorming about this theorem, here’s an idea that turns out to be helpful. Because the
statement of the Pythagorean Theorem involves side lengths raised to the second power (“squared”), we
might be able to think about the problem using geometric squares, appropriately configured. Here’s a proof
that proceeds using this geometric idea:
Proof. Starting with the given right triangle in Figure 4.27a, draw a square with side length c, where one
side of the square coincides with the hypotenuse of the given triangle, as in Figure 4.27b. Now draw three
new triangles, each identical to the first. Place these three new triangles symmetrically around the square
that we just drew, so that each side of the square coincides with the hypotenuse of one of the four triangles,
as in Figure 4.27c. Each of these four triangles has leg lengths a and b and hypotenuse c. Including both
the c-by-c square and the four triangles, the resulting figure is a square with side length a + b.
To complete the proof, we will account for the area of Figure 4.27c in two different ways. First, because
a square with side length x has area x2 , we have that
area of the enclosing square = (a + b)2 = a2 + 2ab + b2 .

Second, this enclosing square can be decomposed into a c-by-c square and four identical right triangles
with leg lengths a and b. Because the area of a right triangle with leg lengths x and y is xy2 , we also have
that

area of the enclosing square = 4 · (area of one triangle) + c2


= 4 · 12 ab + c2
= 2ab + c2 .
But the area of the enclosing square is the same regardless of whether we count it all together, or in its five
disjoint pieces. Therefore a2 + 2ab + b2 = 2ab + c2 . The theorem follows by subtracting 2ab from both
sides.
a b

a
c c
c b c
c c
b
c c c
a a a

b b b a

(a) The right triangle. (b) . . . with an added square. (c) . . . and three added triangles.

Figure 4.27 Illustrations for the proof of the Pythagorean Theorem, Theorem 4.30.

ISTUDY
190 Proofs

There are many proofs of the Pythagorean Theorem—in fact, hundreds! There is a classic proof
attributed to Euclid (see the next section), and many subsequent and different proof approaches followed
over the millennia, including one published, of all people, by U.S. President James Garfield. (There’s even
a book that collects over 350 different proofs of the result [81].) There’s an important lesson to draw from
the many proofs of this theorem: there’s more than one way to do it. (“There’s more than one way to do it”
is also the motto of the programming language Perl.) Just as there are usually many fundamentally different
algorithms for the same problem (think about sorting, for example), there are usually many fundamentally
different techniques that can prove the same theorem. Keep an open mind; there is absolutely no shame in
proving a result using a different approach than the “standard” way!

4.4.3 Prime Numbers


We’ll return to arithmetic for our next set of examples, a pair of proofs about the prime numbers. Recall
that a positive integer n ≥ 2 is prime if and only if the only positive integers that divide n evenly are 1 and
n itself. Also recall that a positive integer n ≥ 2 that is not prime is called composite. (That is, the integer
n is composite if and only if there exists a positive integer k ∈/ {1, n} such that k divides n evenly.)
We’ll start with another example of a proof by contradiction, to show that there are infinitely many
prime numbers. (A similar proof to the one for Theorem 4.31 dates back around 2300 years. It’s due to
Euclid, the ancient Greek mathematician after whom Euclidean geometry—and the Euclidean algorithm
(see Section 7.2.4)—is named.)

Theorem 4.31: An infinitude of primes.


There are infinitely many prime numbers.

Proof. We proceed by contradiction. Suppose, for the purposes of deriving a contradiction, that there are
only finitely many primes. This assumption means that there is a largest prime number, which we will call
p. Consider the integer p!, the factorial of this largest prime p. Let’s consider two separate cases: either
p! + 1 is prime, or p! + 1 is not prime.

Case I: p! + 1 is prime. Then we have a contradiction of the assumption that p is the largest prime, because
p! + 1 > p is also prime.
Case II: p! + 1 is not prime. If p! + 1 is not prime, then by definition it is evenly divisible by some integer
k satisfying 2 ≤ k ≤ p!. But we proved in Example 4.7 that p! + 1 is not evenly divisible by any integer
between 2 and p, inclusive. Thus the smallest integer k that evenly divides p! + 1 must exceed p. Further,
this integer k must be prime: if k were not prime, then some 2 ≤ k′ < k divides k and therefore that
same k′ divides p! + 1—but k was defined as the smallest divisor of p! + 1. Thus k > p is prime, and
again we have a contradiction of the assumption that p is the largest prime.

In either case, we have a contradiction! Thus the original assumption—there are only finitely many prime
numbers—is false, and so there are infinitely many primes.
We’ll now turn to another result about prime numbers, relating to the primality testing problem: you
are given a positive integer n, and you have to determine whether n is prime. The definition of primality
says that n is composite if there’s an integer k ∈ Z − {1, n} such that k | n, but it shouldn’t be too hard
to see that n is composite if and only if there’s an integer k ∈ {2, 3, . . . , n − 1} such that k | n. (That is,

ISTUDY
4.4 Some Examples of Proofs 191

the largest possible divisor of n is n − 1.) But we can do better, strengthening this result by shrinking the
largest candidate value of k:


Theorem 4.32: A composite number n has a factor ≤ n.

A positive integer n ≥ 2 is evenly divisible by some other integer k ∈ {2, 3, . . . , ⌈ n⌉} if and only if n
is composite.

Proof. We’ll proceed by mutual implication.


The forward direction follows immediately from the definition of composite numbers: if there’s some

integer k ∈ {2, 3, . . . , ⌈ n⌉} with k ̸= n such that k evenly divides n, then by definition n is composite.
(That integer k satisfies k | n and k ∈/ {1, n}.)
For the other direction, assume that the integer n ≥ 2 is composite. By definition of composite, there
exists a positive integer k ∈ / {1, n} such that n mod k = 0—that is, there exist positive integers k ∈
/ {1, n}
and d such that dk = n, so d | n and k | n. We must have that d ̸= 1 (otherwise dk = 1 · k = k = n, but
k < n) and d ̸= n (otherwise dk = nk > n, but dk = n). Thus there exist positive integers d, k ∈ / {1, n}
√ √ √ √
such that dk = n. But if both d > n and k > n, then dk > n · n = n, which contradicts the fact that
√ √
dk = n. Thus either d ≤ n or k ≤ n.

Taking it further: Generating large prime numbers (and testing the primality of large numbers) is a crucial step in many modern
cryptographic systems. See p. 197 for some discussion of algorithms for testing primality suggested by these proofs, and a bit
about the role that they play in modern cryptographic systems.

4.4.4 Uncomputability
We’ll close this section with one of the most important results in computer science, dating from the early
twentieth century: there are problems that cannot be solved by computers. At that time, great thinkers were
pondering some of the most fundamental questions that can be asked in CS. What is a computer? What is
computation? What is a program? What tasks can be solved by computers/programs? One of the deepest
and most mind-numbing results of this time was a proof, developed independently by Alan Turing and by
Alonzo Church in the mid-1930s, that there are uncomputable problems. That is, there is a problem P for
which it’s possible to give a completely formal description of the right answer—but it’s not possible to
write a program that solves P.
Here, we’ll prove this theorem. Specifically, we’ll describe the halting problem, and prove that it’s
uncomputable. (Informally, the halting problem is: given a function p written in Python and an input x, does
p get stuck in an infinite loop when it’s run on x?) The result is a great example of a proof by contradiction,
where we will exploit the abyss of self-reference to produce the contradiction.

Problems
Before we address the computability of the halting problem, we have to define precisely what we mean
by a “problem” and “computable.” A problem is the kind of task that we wish to solve with a computer
program. We will focus on yes–no problems, called decision problems:

Definition 4.33: Problem.


A problem is a description of a set of valid inputs, and a specification of the corresponding output for
each them. A decision problem is one where the output is either “yes” or “no.”

ISTUDY
192 Proofs

(In other words, a decision problem is specified by a description of a set of possible inputs, along with a
description of those inputs for which the correct answer is “yes.”)
We’ve already encountered several decision problems. Here are two:

Example 4.27: The Primality decision problem.


For the Primality decision problem, the set of possible inputs is the set of positive integers; the set of
“yes” inputs is the set of prime numbers. (The “no” inputs are 1 and the composites.)

Example 4.28: The Satisfiability decision problem.


For the Satisfiability decision problem, any propositional-logic proposition φ is a valid input, and φ
is a “yes” input if and only if φ is satisfiable.

An instance of a problem is a valid input for that problem. (An invalid input is one that isn’t the right
“kind of thing” for that problem.) We will refer to an instance x of a problem P as a yes-instance if the
correct output is “yes,” and as a no-instance if the correct output is “no.” For example, 17 or 18 are both
instances of Primality; 17 is a yes-instance, while 18 is a no-instance; p ∨ ¬p is an invalid input.

Computability
Problems are the things that we’ll be interested in solving via computer programs. Informally, problems that
can be solved by computer are called computable and those that cannot be solved by computer are called
uncomputable. It’ll be easiest to think of computability in terms of your favorite programming language,
whatever it may be. For concreteness, we’ll pretend it’s Python, though any language would do.

Taking it further: The original definition of computability given by Alan Turing used an abstract device called a Turing machine;
a programming language is called Turing complete if it can solve any problem that can be solved by a Turing machine. Every non-
toy programming language is Turing complete: Java, C, C++, Python, Ruby, Perl, Haskell, BASIC, Fortran, Assembly Language,
whatever.

Formally, we’ll define computability in terms of the existence of an algorithm, which we will think of as a
function written in Python:

Definition 4.34: Computability.


A decision problem P is computable if there exists a Python function A that solves P. That is, P is
computable if there exists a Python function A such that, on any input x, two conditions hold: (i) A
terminates when run on x; and (ii) A(x) returns true if and only if x is a yes-instance of P.

Notice that we insist that the Python function A must actually terminate on any input x: it’s not allowed to
run forever. Furthermore, running A(x) returns True if x is a yes-instance of P and running A(x) returns
False if x is a no-instance of P.
The decision problems from Examples 4.27 and 4.28 are both computable:

Example 4.29: Computability of Primality.


Primality is computable: both isPrime and isPrimeBetter (p. 197) are algorithms that could be imple-
mented as a Python function that (i) terminates when run on any positive integer, and (ii) returns True on
input n if and only if n is prime.

ISTUDY
4.4 Some Examples of Proofs 193

Example 4.30: Computability of Satisfiability.


Satisfiability is computable, too: as we discussed in Section 3.3.1, we can exhaustively try all truth
assignments for φ, checking whether any of them satisfies φ. This algorithm is slow—if φ has n variables,
there are 2n different truth assignments—but it is guaranteed to terminate for any input φ, and correctly
decides whether φ is satisfiable.

Programs that take source code as input


The inputs to the problems or programs that we’ve talked about so far have been integers (for Primality)
or Boolean formulas (for Satisfiability). Of course, other input types like rational numbers or lists are
possible, too. Programs that take programs as input are a particularly important category.
Taking it further: Although you might not have thought about them in these terms, you’ve frequently encountered programs that
take programs as input. For example, in any introductory CS class, you’ve seen one frequently: the Python interpreter python,
the Java compiler javac, and the C compiler gcc all take programs (written in Python or Java or C, respectively) as input.

It’s easy to think up some decision problems where the input is a Python program. Here’s one, about
commenting code. (For example, it’s probably not too hard to imagine an Intro CS instructor setting up
an automated grading system for programs that gives an automatic zero to any submitted assignment that
contains no comments.)

Example 4.31: The Commented decision problem.


Define the decision problem Commented as follows:

Input: the Python source code s for a function.


Output: “yes” if s contains at least one comment; “no” otherwise.

In Python, a comment starts with # and goes until the end of the line, so as long as a # appears some-
where in the source code s—and not inside quotation marks—then s is a yes-instance of Commented;
otherwise s is a no-instance.
The Commented problem is computable: testing whether s is a yes-instance can be done by look-
ing at the characters of s one by one, and testing to see whether any one of those characters starts a
comment. A Python program commentedTester that solves Commented is shown in Figure 4.28. (The
details of testing whether ch is inside quotes are omitted from the source code, but otherwise the code
for commentedTester is valid, runnable Python code.)
Consider running commentedTester on the other instances shown Figure 4.28. Observe that
absValue is a no-instance of Commented, because it doesn’t contain the comment character # at
all, and isEven is a yes-instance of Commented, because it contains three comments. As desired, if
we ran commentedTester on these two pieces of source code, the output would be False and True,
respectively.

Example 4.31 showed that the decision problem Commented is computable by giving a Python function
commentedTester that solves Commented. Because we can run commentedTester on any piece of
Python source code we please, let’s do something a little bizarre: let’s run commentedTester on the source
code for commentedTester itself (!). There weren’t any comments in commentedTester—the only # in
the code is inside quotes—so the source code of commentedTester is a no-instance of Commented. Put
a different way, if sct denotes the source code of commentedTester, then running sct on sct returns False.
This idea of taking some source code s and running s on s itself will be essential in the rest of this section.

ISTUDY
194 Proofs

1 def commentedTester(sourceCode): 1 def absValue(n): 1 def isEven(n):


2 for ch in sourceCode: 2 if n > 0: 2 # % is Python's mod operator
3 if ch == "#" and is not inside quotes: 3 return n 3 if n % 2 == 0:
4 return True 4 else: 4 return True # n is even
5 return False 5 return -1 * n 5 else:
6 return False # n is odd

Figure 4.28 Python source code for three functions.

The halting problem


The key decision problem that we’ll consider is the halting problem:

Definition 4.35: The halting problem.


Define the decision problem HaltingProblem as follows:

Input: a pair ⟨s, x⟩, where s is the source code of a syntactically valid Python function that takes one
argument, and x is any value.
Output: “yes” if s terminates when run on input x; “no” otherwise.

That is, ⟨s, x⟩ is a yes-instance of HaltingProblem if s(x) terminates (doesn’t get stuck in an infinite
loop), and it’s a no-instance if s(x) does get stuck in an infinite loop.

We can now use the idea of running a function with itself as input to show that the halting problem is
uncomputable, by contradiction:

Theorem 4.36: Uncomputability of the halting problem.


HaltingProblem is uncomputable.

Proof. We give a proof by contradiction. Suppose for the sake of contradiction that the halting problem
is computable—that is, assume

There’s a Python function Ahalting solving the halting problem. (1)

(In other words, for the Python source code s of a one-argument function, and any value x, running
Ahalting (s, x) always terminates, and Ahalting (s, x) returns True if and only if running s on x does not result
in an infinite loop.)
Now consider the Python function makeSelfSafe in Figure 4.29. The function makeSelfSafe takes
as input the Python source code s of a one-argument function, tests whether running s on s itself is “safe”

1 def makeSelfSafe(s): # the input s is the Python source code of a one-argument function.
2 safe = Ahalting (s,s)
3 if safe:
4 run s on input s
5 return True

Figure 4.29 The Python code for makeSelfSafe.

ISTUDY
4.4 Some Examples of Proofs 195

(does not cause an infinite loop), and if it’s safe then it runs s on s. We claim that makeSelfSafe never
gets stuck in an infinite loop:
For any Python source code s, makeSelfSafe(s) terminates. (2)
To see that (2) is true, observe that Line 2 of the algorithm always terminates, by assumption (1). Line 3
of the algorithm ensures that s is called on input s if and only if Ahalting (s, s) said that s terminates when
run on s. And, by assumption, Ahalting is always correct. Thus s is run on input s only if s terminates when
run on input s. So Lines 3–4 of the algorithm always terminates. And Line 5 of the algorithm doesn’t do
anything except return, so it terminates immediately. Thus (2) follows.
Write smss to denote the Python source code of makeSelfSafe. Because smss is itself Python source
code, Fact (2) implies that
makeSelfSafe(smss ) terminates. (3)
In other words, running smss on smss terminates. Thus, by the assumption (1) that Ahalting is correct, we can
conclude that
Ahalting (smss , smss ) returns true. (4)
But now consider what happens when we run makeSelfSafe on its own source code—that is, when
we compute makeSelfSafe(smss ). Observe that safe is set to true in Line 2 of the algorithm, by
Fact (4). Thus Line 4 calls makeSelfSafe(smss ) recursively! But therefore makeSelfSafe(smss ) calls
makeSelfSafe(smss ), which calls makeSelfSafe(smss ), and so on, ad infinitum. In other words,
makeSelfSafe(smss ) does not terminate. (5)
But (3) and (5) are contradictory! Thus the only assumption that we made, namely (1), was false. Therefore
there does not exist a correct always-terminating algorithm for the halting problem. That is, the halting
problem is uncomputable.
To summarize Theorem 4.36: we showed that the assumption of the existence of an algorithm for the
halting problem leads to a contradiction, and therefore we conclude that such an algorithm cannot exist.
The contradiction is, at its heart, about self-reference—an algorithmic version of the Liar’s paradox: This
sentence is false.

Taking it further: Computability theory is the study of what problems can and cannot be solved by computers. Computability
was a primary focus of theoretical computer science from the 1930s through roughly the 1970s. (After that time, the focus of
theoretical computer scientists began to shift to complexity theory, which addresses the question of what problems can and cannot
be solved efficiently by computers.) You can read more about the halting problem in any textbook on computability theory, or
a remarkable book by Douglas Hofstadter [56]. For extra amusement, you can find a full proof of Theorem 4.36 by Geoffrey
K. Pullum in poem form, reproduced in Figure 4.30. And see p. 198 for a discussion of some practically relevant problems that
are also uncomputable.

ISTUDY
196 Proofs

Scooping the Loop Snooper: A proof that the Halting Problem is undecidable
Geoffrey K. Pullum

No general procedure for bug checks will do. If P’s answer is ‘Bad!’, Q will suddenly stop.
Now, I won’t just assert that, I’ll prove it to you. But otherwise, Q will go back to the top,
I will prove that although you might work till you drop, and start off again, looping endlessly back,
you cannot tell if computation will stop. till the universe dies and turns frozen and black.

For imagine we have a procedure called P And this program called Q wouldn’t stay on the shelf;
that for specified input permits you to see I would ask it to forecast its run on itself.
whether specified source code, with all of its faults, When it reads its own source code, just what will it do?
defines a routine that eventually halts. What’s the looping behavior of Q run on Q?

You feed in your program, with suitable data, If P warns of infinite loops, Q will quit;
and P gets to work, and a little while later yet P is supposed to speak truly of it!
(in finite compute time) correctly infers And if Q’s going to quit, then P should say ‘Good.’
whether infinite looping behavior occurs. Which makes Q start to loop! (P denied that it would.)

If there will be no looping, then P prints out ‘Good.’ No matter how P might perform, Q will scoop it:
That means work on this input will halt, as it should. Q uses P’s output to make P look stupid.
But if it detects an unstoppable loop, Whatever P says, it cannot predict Q:
then P reports ‘Bad!’—which means you’re in the soup. P is right when it’s wrong, and is false when it’s true!

Well, the truth is that P cannot possibly be, I’ve created a paradox, neat as can be—
because if you wrote it and gave it to me, and simply by using your putative P.
I could use it to set up a logical bind When you posited P you stepped into a snare;
that would shatter your reason and scramble your mind. Your assumption has led you right into my lair.

Here’s the trick that I’ll use—and it’s simple to do. So where can this argument possibly go?
I’ll define a procedure, which I will call Q, I don’t have to tell you; I’m sure you must know.
that will use P’s predictions of halting success A reductio: There cannot possibly be
to stir up a terrible logical mess. a procedure that acts like the mythical P.

For a specified program, say A, one supplies, You can never find general mechanical means
the first step of this program called Q I devise for predicting the acts of computing machines;
is to find out from P what’s the right thing to say it’s something that cannot be done. So we users
of the looping behavior of A run on A. must find our own bugs. Our computers are losers!

Figure 4.30 A (slightly different) proof of Theorem 4.36, in poetic form. Copyright © 2012 by Geoffrey K. Pullum.
Used here by permission of the author.

ISTUDY
4.4 Some Examples of Proofs 197

Computer Science Connections

Cryptography and the Generation of Prime Numbers


Prime numbers are used extensively in cryptographic systems. For example, the RSA cryptosystem—named after the
first letters of its inventors’ last names [107]; see Section 7.5 for much more— uses as a primary step the generation
of two large prime numbers, each of which might be an integer containing ≈ 128 bits in its binary representation.
Prime numbers are useful in cryptography because
of an asymmetry in the apparent difficulty of two multiplication (easy)
directions of a particular problem in arithmetic. (See p = 20,201
Figure 4.31.) If you are given two (big) prime numbers pq = 848,421,799
q = 41,999
p and q, then computing their product pq is easy. But
factoring (probably hard)
if you are given a number n that is guaranteed to be the
product of two prime numbers, finding those two num-
Figure 4.31 Two directions of a problem. (Maybe
bers appears to be much harder. For example, if you’re
told that n = 504,761, it will probably take a long time strangely, “easy” is a technical term for problems in CS,
to figure out that n = 251 · 2011. But if you’re told that meaning that there exists an efficient algorithm to solve it.
p = 251 and q = 2011, then you should be able to cal- We aren’t sure that factoring is “hard”—it’s just that there’s
culate pq = 504,761 in maybe a minute or so, even if no efficient algorithm that is known.)
you have to do all the calculation by hand.
A crucial step in RSA is the generation of large prime numbers. One way to accomplish this step is by choosing
a random integer of the appropriate size and then testing whether that number is prime. (We keep retrying until the
random number turns out to be prime.) So how do we test for primality?
A little consideration of the definition implies that
we can test whether an integer n is prime by testing isPrime(n):
all candidate divisors between 2 and n − 1. This algo- 1 k := 2
rithm requires us to do roughly n divisibility checks 2 while k < n:
(actually, to be precise, n − 2 divisibility checks). 3 if n is evenly divisible by k then
Using Theorem 4.32, the algorithm can be improved 4 return False

to do only about n divisibility checks, as in Fig- 5 k := k + 1
ure 4.32. Although there are other, generally better 6 return True
ways of evaluating them (see Chapter 6), we can test isPrimeBetter(n):
these two algorithms empirically. A Python implemen- 1 k := 2
tation using n − 1 calls to isPrime to find all primes √ 
2 while k ≤ n :
in the integers {2, . . . , n} took about 3 minutes for 3 if n is evenly divisible by k and n ̸= k then
n = 65,536 on an ancient laptop of mine. For the same 4 return False
n on the same machine, isPrimeBetter took about a 5 k := k + 1
second. This difference is a nice example of the way in 6 return True
which theoretical, proof-based techniques can improve Figure 4.32 Slow and less slow primality testing. (We
actual widely used algorithms.
could further save roughly another factor of two by
In part because of its importance to cryptography,
checking only k = 2 and odd k ≥ 3.)
there has been significant work on primality testing √
over recent decades—improving far beyond the roughly n division tests of isPrimeBetter. In general, an effi-
cient algorithm for a number n should require a number of steps√proportional to log n—the number of bits or digits
in the representation of n—rather than proportional to n or even n. (When you add two 10-digit numbers by hand,
you want to do about 10 operations, rather than about 1,000,000,000 operations.) Thus isPrimeBetter still does not
count as “efficient”—in fact, it’s still quite slow on large inputs.
There are some very efficient randomized algorithms for primality testing that are actually used in real cryp-
tosystems, including one called the Miller–Rabin test [89, 106]. This randomized algorithm performs a (randomly
chosen) test that all prime numbers pass and most composite numbers fail; repeating this operation with many differ-
ent randomly chosen tests decreases the probability of getting a wrong answer to an arbitrarily small number. (See
p. 368.) And more recently, researchers gave the first theoretically efficient algorithm for primality testing that’s not
randomized [4]. In other words, like multiplication (and probably unlike factoring), primality testing is “easy.”

ISTUDY
198 Proofs

Computer Science Connections

Other Uncomputable Problems (That You Might Care About)


The halting problem (Theorem 4.36) may seem like a purely abstract problem, and therefore one that doesn’t matter
in the real world—sure, it’d be nice to have an infinite-loop detector in your Python interpreter or Java compiler, but
would it just be a vaguely helpful feature for students in Intro CS classes but nobody else? The answer is a resounding
no: while the halting problem itself may seem obscure, there are many uncomputable problems that, if solved, would
vastly improve operating systems or compilers, among other things. But they’re uncomputable, and therefore the
desired improvements cannot be made.
Here’s one example. Modern operating systems use RAM Hard Disk
virtual memory for their applications. The physical
1 2 6 3,4,5,7,8,…
computer has a limited amount of physical memory—
say, 64 gigabytes of RAM—that applications can use.
But the operating system “pretends” that it has a much (a) Initial configuration, with pages #1,2,6 in memory, and
larger amount of memory, so that the word processor, remaining pages on disk.
web browser, Java compiler, and solitaire game can RAM Hard Disk
each act as though they had even more than 64 giga-
1 2 6 3,4,5,7,8,…
bytes of memory that they don’t have to share. Memory
(both virtual and real) is divided into chunks of a fixed
size, called pages. The operating system stores pages (b) Program requests data on page #2. It’s in memory, so it’s
that are actively in use in physical memory (RAM), just fetched; nothing else happens.
and relegates some of the not-currently-used pages to RAM Hard Disk
the hard drive. At every point in time, the operating
4 2 6 1,3,5,7,8,…
system’s paging system decides which pages to leave
in physical memory, and which pages to “eject” to the
hard drive when there are too many in RAM. (This idea (c) Program requests data on page #4. It’s on disk, so it’s
is the same as what you do when you’re cooking several fetched and replaces some page in RAM—say, #1.
dishes in a kitchen with limited counter space: you have RAM Hard Disk
to relegate some of the not-currently-being-prepared
4 2 1 3,5,6,7,8,…
ingredients to the fridge. And at every moment that the
counter’s full, you have to decide which ingredients to
leave on the counter, and which to “eject” to the fridge.) (d) Program requests data on page #1. It’s on disk, so it’s
See Figure 4.33. fetched and replaces some page in RAM—say, #6.
Here’s a problem that a paging system would love Figure 4.33 A sample sequence of memory fetches in a
to solve: given a page p of memory that an application paged memory system.
has used, will that application ever access the contents
of p again? Let’s call this problem WillBeUsedAgain. When the paging system needs to eject a page, ideally it
would eject a page that’s a no-instance of WillBeUsedAgain, because it will never have to bring that page back
into physical memory. (When you’re out of counter space, you would of course prefer to put away some ingredient
that you’re done using.)
Unfortunately for operating system designers, WillBeUsedAgain is uncomputable. There’s a very quick proof,
based on the uncomputability of the halting problem. Consider the following algorithm:
• Run the Python function f on the input x.
• If f(x) terminates, then access some memory from page p.
This algorithm accesses page p if and only if ⟨f, x⟩ is a yes-instance of the halting problem.
Therefore if we could give an algorithm to solve the WillBeUsedAgain problem, then we could give an algorithm
to solve the halting problem. (Just check whether p will be used again; the answer that that question tells you whether
f halts on x!) But we already know that we can’t give an algorithm to solve the halting problem. If p ⇒ q and ¬q,
then we can conclude ¬p; therefore WillBeUsedAgain is uncomputable.

ISTUDY
Exercises 199

EXERCISES
Figure 4.34 shows the truth tables for all 16 different binary logical operators, with each column named if it’s a logical operator
that we’ve already seen. A set S of binary operators is said to be universal if every binary logical operation can be expressed using
some combination of the operators in S. Formally, a set S is universal if, for every Boolean expression φ over variables p1 , . . . , pk ,
there exists a Boolean expression ψ that is logically equivalent to φ where ψ uses only the variables p1 , . . . , pk and the logical
connectives in S.
4.66 Prove that the set {∨, ∧, ⇒, ¬} is universal. (Hint: to do so, you need to show that, for each column 1 through 16 of Figure 4.34,
you can build a Boolean expression φi over the variables p and q that uses only the operators {∨, ∧, ⇒, ¬}, and such that φi is
logically equivalent to p i q.)
4.67 Prove that the set {∨, ∧, ¬} is universal. (Hint: once you’ve done Exercise 4.66, all you have to do is show that you can express ⇒
using {∨, ∧, ¬}.)
4.68 Prove that {∨, ¬} and {∧, ¬} are both universal.
4.69 Prove that the set {|}—the set containing just the Sheffer stroke, that is, nand—is universal.
4.70 Prove that the singleton set {↓} is universal.
4.71 Prove that the set {∧, ∨} is not universal. (Hint: what happens under the all-true truth assignment?)

4.72 Let φ be a fully quantified proposition of predicate logic. Prove that φ is logically equivalent to a fully quantified proposition ψ in
which all quantifiers are at the outermost level of ψ. In other words, the proposition ψ must be of the form

∀ ∀
/∃ x1 : /∃ x2 : · · · ∀/∃ xk : P(x1 , x2 , . . . , xk ),

where each ∀/∃ is either a universal or existential quantifier. (The transformation that you performed in Exercise 3.185 put Gold-
bach’s conjecture in this special form.) (Hint: you might find the results from Exercises 4.66–4.71 helpful. Using these results, you
can assume that φ has a very particular form.)

4.73 Prove that, for any integer n ≥ 1, there is an n-variable logical proposition φ in conjunctive normal form such that the truth-table
translation to DNF (from Theorem 4.27) yields an DNF proposition with exponentially more clauses than φ has.

4.74 Prove that the area of a right triangle with legs x and y is xy/2.
4.75 Use Figure 4.35 as an outline to give a different proof of the Pythagorean Theorem.

4.76 Exercise 4.47 asked you to prove (via algebra) the Arithmetic Mean–Geometric Mean inequality: for x, y ∈ R≥0 , we have that

xy ≤ x+ y
2 . Here you’ll reprove the result geometrically. Suppose that x ≥ y, and draw two circles of radius x and y tangent to
each other, and tangent to a horizontal line. See Figure 4.36. Considering the right triangle shown in that diagram, and using the
Pythagorean Theorem and the fact that the hypotenuse is the longest side of a right triangle, prove the result again.

Let x, y ∈ R2 be two points in the plane. As usual, denote their coordinates by p


x1 and x2 , and y1 and y2 , respectively. The Euclidean
distance between these points is the length of the line that connects them: (x1 − y1 )2 + (x2 − y2 )2 . The Manhattan distance
between them is |x1 − y1 | + |x2 − y2 |: the number of blocks that you would have to walk “over” plus the number that you’d have
to walk “up” to get from one point to the other. Denote these distances by deuclidean and dmanhattan .
4.77 Prove that deuclidean (x, y) ≤ dmanhattan (x, y) for any two points x, y.
4.78 Prove that there exists a constant a such that both
• dmanhattan (x, y) ≤ a · deuclidean (x, y) for all points x and y; and
• there exist points x∗ , y∗ such that dmanhattan (x∗ , y∗ ) = a · deuclidean (x∗ , y∗ )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
p q True p∨q p⇐q p p⇒q q p⇔q p∧q p|q p⊕q ¬q ¬p p↓q False
T T T T T T T T T T F F F F F F F F
T F T T T T F F F F T T T T F F F F
F T T T F F T T F F T T F F T T F F
F F T F T F T F T F T F T F T F T F

Figure 4.34 The full set of binary logical operators.

ISTUDY
200 Proofs

c
a

b
c b
a

x
a
b c
b
y
a
c

Figure 4.35 A diagram illustrating another way to Figure 4.36 Using the Pythagorean Theorem for the
prove the Pythagorean Theorem. Arithmetic Mean–Geometric Mean inequality.

4.79 On p. 54, we discussed rotation matrices for points in two dimensions. Specifically, given a point ⟨x, y⟩ ∈ R2 and an angle θ ∈ R,
we claimed that rotating the given point by the given angle around the origin ⟨0, 0⟩ resulted in the point

⟨x cos θ − y sin θ, x sin θ + y cos θ⟩.

Prove that this formula is correct. You may find two facts from trigonometry (which you may use without proof) useful:

sin(α + β) = sin α cos β + cos α sin β and cos(α + β) = cos α cos β − sin α sin β.

A positive integer n is called a perfect number if it is equal to the sum of all positive integer factors 1 ≤ k < n of n. For example,
the number 14 is not perfect: the numbers less than 14 that evenly divide 14 are {1, 2, 7}, but 1 + 2 + 7 = 10 ̸= 14.
4.80 Prove that at least one perfect number exists.
4.81 Prove that, for any prime integer p, the positive integer p2 is not a perfect number.

4.82 Let n ≥ 10 be any positive integer. Prove that the set {n, n + 1, . . . , n + 5} contains at most two prime numbers.
4.83 Let n be any positive integer. Prove or disprove: any set of ten consecutive positive integers {n, n + 1, . . . , n + 9} contains at least
one prime number.

4.84 (Thanks to the NPR radio show Car Talk, from which I learned this exercise.) Imagine a junior high school, with 100 lockers,
numbered 1 through 100. All lockers are initially closed. There are 100 students, each of whom—brimming with teenage angst—
rampages through the hallway and systematically slams some of them shut and yanks some of them open. Specifically, in round
i = 1, 2, . . . , 100, student #i changes the state of every ith locker: if the door is open, then it’s slammed shut; if the door is closed,
then it’s opened. (So student #1 opens them all, student #2 closes all the even-numbered lockers, etc.) Which lockers are open
after this whole process is over? Prove your answer.
 √ 
4.85 Theorem 4.32 established this claim: A positive integer n ≥ 2 is evenly divisible by some other integer k ∈ 2, 3, . . . , n if
and only if n is composite. If we delete the word “other,” this claim becomes false. Prove that this modified claim is false.
 √ 
4.86 Prove that the unmodified claim (retaining the word “other”) remains true if the bounds on k are altered from k ∈ 2, 3, . . . , n
√ 
to k ∈ n ,...,n − 1 .
 √  √   √ 
4.87 Prove that the bound cannot be changed from k ∈ 2, 3, . . . , n to k ∈ n/2 , . . . , 3 n/2 . That is, prove that the
√   √ 
following claim is false: A positive integer n ≥ 2 is evenly divisible by some other integer k ∈ n/2 , . . . , 3 n/2 if and
only if n is composite.

4.88 Let n be any positive integer, and let pn denote the smallest prime number that evenly divides n. Prove that there are infinite number

of integers n such that pn ≥ n. (This fact establishes that we cannot change the bound in the aforementioned theorem to anything

smaller than n.)

ISTUDY
4.5 Common Errors in Proofs 201

4.5 Common Errors in Proofs


Mistakes were made.
Ron Ziegler (1939–2003)
Press Secretary for President Richard Nixon during Watergate,
in an apology to The Washington Post (1973)

We’ve now spent considerable time establishing a catalogue of proof techniques that you can use to
prove theorems, along with some examples of these techniques in action. We’ll close this chapter with
a brief overview of some common flaws in proofs, so that you can avoid them in your own work (and
be on the lookout for them in the work of others). Recall that a proof consists of a sequence of logical
inferences, deriving new facts from assumptions or previously established facts. A valid inference is one
whose conclusion is always true as long as the facts that it relies on were true. (That is, a valid step never
creates a false statement from true ones.) An invalid inference is one in which the conclusion can be false
even if the premises are all true. An invalid argument can also be called a logical fallacy, a fallacious
argument, or just a fallacy. In a correct proof, of course, every step is valid.
Taking it further: Some of the most famous disasters in the history of computer science have come from some bugs that arose
because of an erroneous understanding of some property of a system—and a lack of valid proof of correctness for the system.
These bugs have been costly, with both lives and many dollars lost. See p. 207 for a few highlights/lowlights.

Here are a few examples of a single logical inference, some of which might be fallacious:
Problem-solving tip: To make the logical structure of an argument clearer, consider an abstract form of the argument in which
you use variables to name the atomic propositions.

Example 4.32: Some (valid and invalid) logical inferences.


Here are several inferences. In each case, there are two premises, and a conclusion that is claimed to
follow logically from those premises. Which of these inferences are valid, and which are fallacies?
1 Premise A. All people belch sometimes.
Premise B. Sister Mary Kenneth Keller was a person.
Conclusion. Therefore, Sister Mary Kenneth Keller belched sometimes.
2 Premise A. All software is buggy.
Premise B. Windows is a piece of software.
Conclusion. Therefore, Windows is buggy.
3 Premise A. If you handed in an exam without your name on it, then you got a zero.
Premise B. You handed in an exam without your name on it.
Conclusion. Therefore, you got a zero.
4 Premise A. If you handed in an exam without your name on it, then you got a zero.
Premise B. You handed in an exam with your name on it.
Conclusion. Therefore, you didn’t get a zero.

Solution. We abstract away from buggy software and belching people by rewriting these arguments in
purely logical form:
1 Assume a ∈ S and assume ∀x ∈ S : P(x). Conclude P(a).
2 Assume a ∈ S and assume ∀x ∈ S : P(x). Conclude P(a).

ISTUDY
202 Proofs

3 Assume p ⇒ q and assume p. Conclude q.


4 Assume p ⇒ q and assume ¬p. Conclude ¬q.
In this format, we see first that (1) and (2) are actually the same logical argument (with different meanings
for the symbols), and they’re both valid. Argument (3) is precisely an invocation of Modus Ponens (see
Chapter 3), and it’s valid. But (4) is a fallacy: the fact that p ⇒ q and ¬p is consistent with either q or
¬q, so in particular when p = False and q = True the premises are true but the conclusion is false.

Taking it further: Sister Mary Kenneth Keller (1913–1985) earned a Ph.D. in computer science from the University of Wis-
consin in 1965. Hers was one of two such degrees awarded that day in the United States, tying her as one of the first two CS
Ph.D. recipients in the country. (She’d previously become a nun, in 1940.) Her Ph.D. work was on symbolic differentiation, and
she was also a major contributor to the early work on the BASIC programming language. And, despite all of that, she—if you
believe the premises of Example 4.32—belched sometimes.

Each of these examples purports to convince its reader of its conclusion, under the assumption that the
premises are true. Valid arguments will convince any (reasonable) reader that their conclusion follows
from their premises. Fallacious arguments are buggy; a vigilant reader will not accept the conclusion of a
fallacious argument even if she accepts the premises.

Taking it further: Here’s a useful way to think about validity and fallacy. An argument with premises p1 , p2 , . . . , pk and
conclusion c is valid if and only if p1 ∧ p2 ∧ · · · ∧ pk ⇒ c is a theorem. If there is a circumstance in which p1 ∧ p2 ∧ · · · ∧ pk ⇒ c
is false—in other words, where the premises p1 ∧ p2 ∧ · · · ∧ pk are all true but the conclusion c is false—then the argument is
fallacious.

Your main job in proofs is simple: avoid fallacies! But that can be harder than it sounds. The remainder
of this section is devoted to a few types of common mistakes in proofs—that is, some common types of
fallacies.

A broken proof
The most common mistake in a purported proof is simple but insidious: a single statement is alleged to
follow logically from previous statements, but it doesn’t. Here’s a somewhat subtle example:

Problem-solving tip: The kind of mistake in Example 4.33, in which there’s a single step that doesn’t follow from the previous
step, can sometimes be difficult to sniff out. But it’s the kind of bug that you can spot by simply being überskeptical of everything
that’s written in a purported proof.

Example 4.33: Figure out what’s wrong with this logic.


Find the error in this purported proof, and give a counterexample to the claim.

False Theorem: Let Fn = k ∈ Z≥1 : k | n denote the factors of an integer n ≥ 2. Then |Fn | is even.

Proof. Let Fsmall ⊆ Fn be the set of factors of n that are less than n. Let Fbig ⊆ Fn be the set of factors

that are greater than n. Observe that every d ∈ Fsmall has a unique entry nd corresponding to it in Fbig .
Therefore |Fsmall | = |Fbig |. Let k = |Fsmall | = |Fbig |. Note that k is an integer. Thus Fn contains precisely
√ √
k elements less than n and k elements greater than n, and so |Fn | = 2k, which is an even number.
Solution. The problem comes right at the end of the proof, in the last sentence: Thus Fn contains precisely
√ √
k elements less than n and k elements greater than n, and so |Fn | = 2k.

ISTUDY
4.5 Common Errors in Proofs 203


The problem is that this statement discounts the possibility that n itself might be in Fn . (Note that

n is in neither Fsmall nor Fbig as they were defined.) For an integer n that’s a perfect square, we have

that n ∈ Fn , and therefore |Fn | = 2k + 1. For example, the integer 9 is a counterexample to the claim,
because F9 = {1, 3, 9} and |F9 | = 3.

But while an error of this form—one step in the proof that doesn’t actually follow from the previously
established facts—may be the most common type of bug in a proof, there are some other, more structural
errors that can arise. Most of these structural errors come from errors of propositional logic—namely by
proving a new proposition that’s not in fact logically equivalent to the given proposition. Here are a few of
these types of flawed reasoning.

Fallacy: proving true


We are considering a claim φ. We proceed as follows: we assume φ, and (correctly) prove True under
that assumption. (Usually, for some reason, the “proof” writer puts a little check mark in their alleged
proof at this point: ✓.) What can we conclude about φ? The answer is: absolutely nothing! The reason:
we’ve proven that φ ⇒ True, but anything implies true. (Both True ⇒ True and False ⇒ True are true
implications.) Here’s a classical example of a bogus proof that uses this fallacious reasoning:

Example 4.34: Figure out what’s wrong with this logic.


Find the error in this purported proof.

False Theorem: 1 = 0.

Proof. Suppose that 1 = 0. Then:


1=0
therefore, multiplying both sides by 0 0·1=0·0
and therefore, 0 = 0. ✓
And, indeed, 0 = 0. Thus the assumption that 1 = 0 was correct, and the theorem follows.
Solution. We have merely shown that (1 = 0) ⇒ (0 = 0), which does not say anything about the truth
or falsity of 1 = 0; anything implies true.

Writing tip: When you’re trying to prove that two quantities a and b are equal, it’s generally preferable to manipulate a until it
equals b, rather than “meeting in the middle” by manipulating both sides of the equation until you reach a line in which the two
sides are equal. The “manipulate a until it equals b” style of argument makes it clear to the reader that you are proving a = b
rather than proving (a = b) ⇒ True.

Fallacy: affirming the consequent


We are considering a claim φ. We prove (correctly) that φ ⇒ ψ, and we prove (correctly) that ψ. We then
conclude φ. (Recall that ψ is the consequent of the implication φ ⇒ ψ, and we have “affirmed” it by
proving ψ.) This “proof” is wrong because it confuses necessary and sufficient conditions: when we prove
φ ⇒ ψ, we’ve shown that one way for ψ to be true is for φ to be true. But there might be other reasons
that φ is true! Here’s an example of a fallacious argument that uses this bogus logic:

ISTUDY
204 Proofs

Example 4.35: Figure out what’s wrong with this logic.


Find the error in this argument:
Premise A. If it’s raining, then the computer burning will be postponed.
Premise B. The computer burning was postponed.
Conclusion. Therefore, it’s raining.
Solution. This fallacious argument is an example of affirming the consequent. The first premise here
merely says that the computer burning will be postponed if it rains; it does not say that rain is the only
reason that the burning could be postponed. There may be many other reasons why the burning might be
delayed: for example, the inability to find a match, the sudden vigilance of the health and safety office,
or a last-minute stay of execution by the owner of the computer.

Fallacy: denying the hypothesis


Denying the hypothesis is a closely related fallacy to affirming the consequent: we prove (correctly) that
ψ ⇒ φ, and we prove (correctly) that ¬ψ; we then (fallaciously) conclude ¬φ. This logic is buggy for
essentially the same reason as affirming the consequent. (In fact, denying the hypothesis is the contraposi-
tive of affirming the consequent—and therefore a fallacy too, because it’s logically equivalent to a fallacy.)
The implication ψ ⇒ φ means that one way of φ being true is for ψ to be true, but it does not mean that
there is no other way for φ to be true. Here’s an example of a fallacious argument of this type:

Example 4.36: Figure out what’s wrong with this logic.


Find the error in this argument:
Premise A. If you have resolved the P-versus-NP question, then you are famous.
Premise B. You have not resolved the P-versus-NP question.
Conclusion. Therefore, you are not famous.
Solution. This fallacious argument is an example of denying the hypothesis. The first premise says that
one way to be famous is to resolve the P-versus-NP question (see p. 104 for a brief description of this
problem), but it does not say that resolving the P-versus-NP question is the only way to be famous. For
example, you could be famous by being the President of the United States or by founding Google.

Fallacy: false dichotomy


A false dichotomy or false dilemma is a fallacious argument in which two nonexhaustive alternatives are
presented as exhaustive (without acknowledgement that there are any unmentioned alternatives).

Example 4.37: False dichotomy.


The flawed step in Example 4.33 can be interpreted as a false dilemma: implicitly, that proof relied on
the assertion that if k evenly divides n, then
 √
k ∈ Fsmall = factors of n that are less than n or
 √
k ∈ Fbig = factors of n that are greater than n .

But of course the third unmentioned possibility is that k = n.

ISTUDY
4.5 Common Errors in Proofs 205

(The classical false dichotomy, often found in political rhetoric, is “either you’re with us or you’re against
us”: actually, you might be neutral on the issue, and therefore neither “with” nor “against” us!)

Fallacy: begging the question


We wish to prove a proposition φ. A purported proof of φ that begs the question is one that assumes φ
along the way. That is, the “proof” assumes precisely the thing that it purports to prove, and thus actually
proves φ ⇒ φ. Although this type of fallacious reasoning sounds ridiculous, the assumption of the desired
result can be very subtle; you must be vigilant to catch this type of error. Here’s an example of a fallacious
argument of this kind:

Example 4.38: Figure out what’s wrong with this logic.


Find the error in this proof:

False Theorem: Let n be a positive integer such that n + n2 is even. Then n is odd.

Proof. Assume the antecedent—that is, assume that n + n2 is even. Let k be the integer such that
n = 2k + 1. Then
n + n2 = 2k + 1 + (2k + 1)2
= 2k + 1 + 4k2 + 4k + 1
= 4k2 + 6k + 2
= 2 · (2k2 + 3k + 1),
which is even because it is equal to 2 times an integer. But n2 = (2k + 1)2 = 4k2 + 4k + 1 is odd (because
4k2 and 4k are both even). Therefore

n= n + n2 − n2 .
As argued above, As argued above,
n + n2 = 2 · (2k2 + 3k + 1), n2 = 4k2 + 4k + 1,
which is an even number. which is an odd number.

An even number less an odd number is an odd number, which implies that n must be odd too.
Solution. The problem comes very early in the “proof,” in the sentence Let k be the integer such that
n = 2k+1. But this statement implicitly assumes that n is an odd integer; an integer k such that n = 2k+1
exists only if n is odd. So the proof begs the question: it assumes that n is odd, and—after some algebraic
shenanigans—concludes that n is odd.

Problem-solving tip: Even without identifying the specific bug in Example 4.38, we could notice that there’s something fishy by
doing the post-proof plausibility check to make sure that all premises were actually used. The “proof” states that it is assuming
the antecedent, but we actually derived the fact that n + n2 is even. So we never used that assumption in the “proof.” (In fact,
n + n2 is even for any positive integer n.) But, because we didn’t use the assumption, the same proof works just as well without
it as an assumption, so we could use the same “proof” to establish this claim instead:
Patently False Theorem: Let n be a positive integer. Then n is odd.
Given that this new claim is obviously false, there must be a bug in the proof. The only challenge is to find that bug.

ISTUDY
206 Proofs

Other fallacies
We have discussed a reasonably large collection of logical fallacies into which some less-than-careful or
less-than-scrupulous proof writers may fall. It is always your job to be vigilant—both when reading proofs
written by others, and in developing your own proofs—to avoid fallacious reasoning.
But there are many other types of flaws in arguments that more typically arise in informal contexts; these
are the kinds of flawed arguments that are—sadly—often used in politics. (Some of them have analogues
in more mathematical settings, too.)

This listing is just a brief outline of some of the many invalid techniques of persuasion/propaganda; a much more extensive and
thorough list is maintained by Gary Curtis at http://www.fallacyfiles.org/. You might also be interested in books that
catalogue fallacious techniques of argument, such as [102].

Here are a few examples of other types of fallacies that you may encounter in “real-world” arguments:

Confusing correlation and causation. Phenomena A and B are said to be (positively) correlated if they
occur together more often than their individual frequencies would predict. (See Chapter 10.) But just
because A and B are correlated does not mean that one causes the other! For example, the user popula-
tion of TikTok is much younger than is the population at large. We could say, correctly, that Being young is
correlated with using TikTok. But Using TikTok makes you young is an obviously absurd conclusion. (Some
correlation-versus-causation mistakes are subtler; your reaction to Being young makes you use TikTok is
probably less virulent, but it is equally unsupported by the facts that we’ve cited here.) Always be wary
when attempting to infer causal relationships!

Ad hominem attacks. An ad hominem (Latin ad hominem: “to the man”) attack ignores the logical
argument and speaks to the arguer: Bob doesn’t know the difference between contrapositive and converse,
and he says that n is prime. So n must be composite.

Equivocation (also known as “shifting language”). This type of argument relies on changes in the
meanings of the words/variables in an argument. This shift can be grammatical: Time waits for no man,
and no man is an island; therefore, time waits for an island. Or it can be in the semantics of a particular
word: 1024 is a prime example of an exact power of two, and prime numbers are evenly divisible only by
1 and themselves; therefore, 1024 is not divisible by 4. A similar type of fallacy can also occur when a
variable in a proof is introduced to mean two different things.

ISTUDY
4.5 Common Errors in Proofs 207

Computer Science Connections

The Cost of Missing Proofs: Some Famous Bugs in CS


There’s an apocryphal story that the first use of the word “bug” to refer to a flaw in a computer system was in the
1940s: Grace Hopper, a rear admiral in the U.S. Navy and a pioneer in early programming, found a moth (a literal,
physical moth) jamming a piece of computer equipment and causing a malfunction. (The story is true, but the Oxford
English Dictionary reports uses of “bug” to refer to a technological fault dating back to Thomas Edison in the late
1800s.) But there are many other stories of bugs that are both more important and more true. When a computer
system “almost” works—when there’s no proof that it works correctly in all circumstances—there can be grave
repercussions, in dollars and lives lost. Here are a few of the most famous, and most costly, bugs in history.
The Pentium division bug. In 1994, Thomas Nicely, at the time a math professor at Lynchburg College, discovered
a hardware bug in Intel’s new Pentium chip that caused incorrect results when some floating-point numbers were
divided by certain other floating-point numbers. The flaw resulted from a lookup table for the division operation
that was missing a handful of entries. Although the range of numbers that were incorrectly divided was limited, the
resulting brouhaha led to a full Pentium recall and about $500 million (in 1994 dollars) in losses for Intel.
The Ariane 5 rocket. The European Space Agency’s rocket, carrying a $400 million (in 1996 dollars) payload of
satellites, exploded 40 seconds into its first flight, in 1996. The rocket had engaged its self-destruct system, which was
correctly triggered when it strayed from its intended trajectory. But the altered trajectory was caused by a sequence
of errors, including an integer overflow error: the rocket’s velocity was too big to fit into the 16-bit variable that
was being used to store it. (An Ariane 5 rocket was much faster than the Ariane 4 rockets for which the code was
originally developed.) Embarrassingly, the overflow caused a subsystem to output a diagnostic error code that was
interpreted as navigation data. More embarrassingly still, this entire subsystem played no role in navigation after
liftoff, and would have caused no harm if it were just turned off.
The Therac-25. The Therac-25 was
electron filter and beam monitor
a medical device in use in the
diaphragm
mid-1980s that treated tumors with
a focused beam of radiation. The
device fired a concentrated X-ray
beam of extremely high dosage into
a diffuser that would reduce the
beam’s intensity to the desired lev-
els before it was directed at the
patient. But it turned out that a
particularly fast touch-typing oper-
narrow slit for
ator could cause the high-intensity electron beam electron
beam to be fired without the dif- shielding trimmers
fuser in place: hitting enter at
the precise moment that an inter- Figure 4.37 A schematic drawing of a Therac-25, adapted from the article
nal variable reset to zero caused the “Therac 25: A New Medical Accelerator Concept” [125].
undiffused beam to be fired. (This
kind of bug is called a race condition, in which the output of a system depends crucially on the precise timing of
events like operator input.) At least five patients were killed by radiation overdoses.
The Pentium, Ariane 5, and Therac-25 bugs are among the most well-known and worst “bug-like” errors in the history
of CS. For a list of one person’s view of the ten worst bugs in history, including these three and some other sordid
tales, see [49]. (There are other kinds of pernicious problems with software systems that you wouldn’t necessarily
call “bugs”—maybe “ethical lapses” is a better descriptor—that are deeply problematic in other ways. For example,
a system’s impacts may be different for different kinds of users, perhaps reinforcing or exacerbating existing gender-,
race-, or socioeconomics-based biases.) For more on the particular bugs described above: you can read more about
the Pentium issue and its aftermath in [101]; the official report on the Ariane 5 is available as [80]; and an excellent
analysis of what went wrong with Therac-25 can be found in two books by Nancy Levenson [77, 78].

ISTUDY
208 Proofs

EXERCISES
Identify whether the following arguments are valid or fallacious. Justify your answers.

4.89 Premise A: Every programming language that uses garbage collection is slow.
Premise B: C does not use garbage collection.
Conclusion: Therefore, C is slow.

4.90 Premise A: If a piece of software is written well, then it was built with its user in mind.
Premise B: The Firefox web browser is a piece of software that was written with its user in mind.
Conclusion: Therefore, the Firefox web browser is written well.

4.91 Premise A: If a processor overheats while multiplying, then it overheats while computing square roots.
Premise B: The xMax processor does not overheat while computing square roots.
Conclusion: Therefore, the xMax processor does not overheat while multiplying.

4.92 Premise A: Every data structure is either slow at insertions or lookups.


Premise B: The data structure called the Hackmatack tree is slow at insertions.
Conclusion: Therefore, the Hackmatack tree is slow at lookups.

4.93 Premise A: Every web server has an IP address.


Premise B: www.cia.gov is a web server.
Conclusion: Therefore, www.cia.gov has an IP address.

4.94 Premise A: If a computer system is hacked, then there was user error or the system had a design flaw.
Premise B: A computer at NASA was hacked.
Premise C: That computer did not have a design flaw.
Conclusion: Therefore, there was user error.

In the next few problems, you will be presented with a false claim and a bogus proof of that false claim. For each, you’ll be asked
to (a) identify the precise error in the proof, and (b) give a counterexample to the claim. (Note that saying why the claim is false
does not address (a) in the slightest—it would be possible to give a bogus proof of a true claim!)

4.95 State precisely what’s wrong with the following proof.

False Claim 1. Let n be a positive integer and let p, q ∈ Z≥2 , where p and q are prime. If n is evenly divisible by
both p and q, then n is also evenly divisible by pq.

Bogus proof. Because p | n, there exists a positive integer k such that n = pk. Thus, by assumption, we know that
q | pk. Because p and q are both prime, we know that p does not evenly divide q, and thus the only way that q | pk can
hold is if q | k. Hence k = qℓ for some positive integer ℓ, and thus n = pk = pqℓ. Therefore pq | n.

4.96 Give a counterexample to False Claim 1 from Exercise 4.95.

4.97 State precisely what’s wrong with the following proof.


False Claim 2. 721 is prime.

Bogus proof. In Example 4.7, we proved that n! + 1 is not evenly divisible by any k satisfying 2 ≤ k ≤ n. Observe
that 6! = 720. Therefore, 721 = 6! + 1 isn’t evenly divisible by any integer between 2 and 720 inclusive, and therefore
721 is prime.
4.98 Without using a calculator, disprove False Claim 2 from Exercise 4.97.
4.99 Without using a calculator, find an integer n such that n! + 1 is prime.

4.100 State precisely what’s wrong with the following proof.



2 √8
False Claim 3. 4 and are both rational.
2

2
Bogus proof. In Example
√ 4.11, we
√ proved that if x and y are rational then xy is rational too. Here, let x = 4 and
y = √8 . Then xy = 42 · √8 = 8√2 = 2. So xy = 2 is rational, and x and y are too.
2 2 4 2
8

4.101 Prove that isn’t rational (which shows that False Claim 3 from Exercise 4.100 is false).
2

ISTUDY
Exercises 209

4.102 State precisely what’s wrong with the following proof.

False Claim 4. Let n be any integer. Then 12 | n if and only if 12 | n2 .

Bogus proof. We’ll follow the same outline as in Example 4.18. We proceed by mutual implication.

First, assume that 12 | n. Then, by definition, there exists an integer k such that n = 12k. Therefore n2 = (12k)2 =
12 · (12k2 ). Thus 12 | n2 too.

Second, we must show the converse: if 12 | n2 , then 12 | n. We prove the contrapositive. Assume that 12 ̸ | n. Then
there exist integers k and r ∈ {1, . . . , 11} such that n = 12k + r. Therefore n2 = (12k + r)2 = 144k2 + 24kr + r2 =
12(12k2 + 2kr) + r2 . Because r < 12, adding r2 to a multiple of 12 does not result in another multiple of 12. Thus
12 ̸ | n2 .

4.103 Disprove False Claim 4 from Exercise 4.102.

4.104 State precisely what’s wrong with the following proof.



False Claim 5. 4 is irrational.

Bogus proof. We’ll follow the same outline as Example 4.20. Our proof is by contradiction.
√ n

Assume that 4 is rational. Therefore, there exist integers n and d ̸= 0 such that d = 4, where n and d have no
common divisors.
2
Squaring both sides yields that nd2 = 4, and therefore that n2 = 4d2 . Because 4d2 is divisible by 4, we know that n2
is divisible by 4. Therefore, by the same logic as in Example 4.18, we have that n is itself divisible by 4.

Because n is divisible by 4, there exists an integer k such that n = 4k, which implies that n2 = 16k2 . Thus n2 = 16k2
and n2 = 4d2 , so d2 = 4k2 . Hence d2 is divisible by four.

But now we have a contradiction: we assumed that nd was in lowest terms, but we have now shown that n2 and d2

are both divisible by 4, and therefore both n and d must be even! Thus the original assumption was false, and 4 is
irrational.

4.105 State precisely what’s wrong with the following proof.

False Claim 6. 3 ≤ 2.

Bogus proof. Let x and y be arbitrary nonnegative numbers. Because y ≥ 0 implies −y ≤ y, we can add x to both
sides of this inequality to get

x − y ≤ x + y. (1)

Similarly, adding y − 3x to both sides of −x ≤ x yields

y − 4x ≤ y − 2x. (2)

Observe that whenever a ≤ b and c ≤ d, we know that ac ≤ bd. So we can combine (1) and (2) to get

(x − y)(y − 4x) ≤ (x + y)(y − 2x). (3)

Multiplying out and then combining like terms, we have

xy − 4x2 − y2 + 4xy ≤ xy − 2x2 + y2 − 2xy, and (4)


2 2
6xy ≤ 2x + 2y . (5)
p p 2
This calculation was valid for any x, y ≥ 0. For x = y = 1/2, we have xy = x2 = y2 = 1/2 = 12 . Plugging
into (5), we have

6
2 ≤ 22 + 22 . (6)

In other words, we have 3 ≤ 2.

ISTUDY
210 Proofs

Computer vision is the subfield of computer science devoted to developing algorithms that can “understand” images. One of the
more controversial applications of computer vision is in facial recognition, which can be used for some very positive applications
but also in some ethically dubious ones too. One of the most ethically concerning properties of many widely deployed facial
recognition algorithms is that their accuracy rates can be very different across different demographic groups, particularly based
on race: Black people’s faces are systematically less accurately recognized by these algorithms than the faces of white people. (See
p. 399 for more on this kind of algorithmic bias.) These differential accuracies are absolutely real, but it turns out that there are
some subtleties to calculating accuracy rates, which can lead to some counterintuitive conclusions, as you’ll explore in the next
two exercises. (Although some applications, like facial recognition, raise major ethical questions regarding fairness and privacy,
there are other tremendously useful applications of computer vision that seem to be much more clearly world-improving. Deployed
applications of computer vision include: assessing the condition of crops using aerial imagery from drones to help a farmer better
allocate irrigation and fertilizer; doing a first-cut diagnosis on medical images to categorize X-rays or MRIs into “normal” or
“possibly anomalous” clusters (the latter being examined by human experts); and automating the sorting of most packages in a
post office facility by using optical character recognition on the address labels.)

4.106 Suppose that we have two disjoint groups of people, say B (Black) and W (White). Further, suppose that we have two algorithms,
F and G , that we have employed on two different cameras that are being used to allow/deny access to a workplace. Suppose that
algorithm F is deployed on camera I. It makes the correct decision on 75% of the people in W at camera I and 60% of the people in
B at camera I. (That is, when a person from W arrives at camera I, algorithm F correctly decides whether to grant her access 75%
of the time.) Algorithm G , deployed at camera II, makes the correct decision on 70% of the people in W and 50% of the people in
B. The following claim seems obvious, because algorithm F performed better for both members of B and W. But the claim is false,
as you’ll show! (The falsehood of this claim (for example, in the scenario in Exercise 4.107) is called Simpson’s paradox because
the behavior is so counterintuitive.) State precisely where the purported proof goes wrong.

False Claim 7. Algorithm F is right a higher fraction of the time (overall, combining both sets of people) than algorithm G .

Bogus proof. Observe that algorithm F had a better success probability than G with members of B, and also had a better success
probability with members of W. Therefore algorithm F was right a higher fraction of the time (in total, for all people—that is,
members of B ∪ W—than algorithm G ).

4.107 Consider the scenario described in Exercise 4.106. Suppose that there were 100 members of W and 100 members of B who went
by camera I. Suppose that 1000 members of W and 100 members of B went by camera II. Calculate the success rate for algorithm
F at camera I, over all people. Do the same for algorithm G at camera II.

4.108 Here is a (nonobviously) bogus proof of the (obviously) bogus claim that 0 = 1. Identify precisely the flaw in the argument.

Proof that 0 = 1. Consider the four shapes in Figure 4.38a, and the two arrangements thereof in Figure 4.38b. The area of the
triangle in the first configuration is 132·5 = 65
2 , as it forms a right triangle with height 5 and base 13. But the second configuration
also forms a right triangle with height 5 and base 13 as well, and therefore it too has area 65 2 . But the second configuration has one
unfilled square in the triangle, and thus we have

65
0= 2 − 65
2
= area of the second bounding triangle − area of the first bounding triangle
= (1 + area of four constituent shapes) − (area of four constituent shapes)
= 1.

(a) The shapes. (b) Two configurations.

Figure 4.38 Some shapes and their arrangements, for Exercise 4.108.

ISTUDY
Exercises 211

y
θ θ
c c θ
a a z
a ϕ
x
x
θ
ϕ ϕ ϕ

b b b

(a) The original triangle. (b) A perpendicular line added from (c) The original triangle, split into two
the hypotenuse to the opposite vertex. smaller right triangles. Note that c = y + z.

Figure 4.39 Diagrams for Exercise 4.109.

4.109 The following two statements are theorems from geometry that you may recall from high school:
• The angles of a triangle sum to precisely 180◦ .
• If the three angles of triangle T1 are precisely equal to the three angles of T2 , then T1 and T2 are similar triangles, and their sides
are in the same ratios. (That is, if the side lengths of T1 are a, b, and c and the corresponding side lengths of T2 are x, y, and z,
then ax = by = cz .)
These statements are theorems, but they’re used in the following utterly bogus “proof” of the Pythagorean Theorem (actually one
that was published, in 1896!). State precisely what’s wrong with the following purported proof of the Pythagorean Theorem.

Bogus proof. Consider an arbitrary right triangle. Let the two legs and hypotenuse, respectively, have length a, b, and c, and let
the angles between the legs and the hypotenuse be given by θ and ϕ = 90◦ − θ. (See Figure 4.39a.) Draw a line perpendicular
to the hypotenuse to the opposite vertex, dividing the interior of the triangle into two separate sections, which are shaded with
different colors in Figure 4.39b. Observe that the unlabeled angle within the smaller shaded interior triangle must be ϕ = 90◦ − θ,
because the other angles of the smaller shaded interior triangle are (just like for the enclosing triangle) 90◦ and θ. Similarly, the
unlabeled angle within the larger shaded interior triangle must be θ. Therefore we have three similar triangles, all with angles 90◦ ,
θ , and ϕ. Call the lengths of the previously unnamed sides x, y, and z as in Figure 4.39c. Now we can assemble our known facts.
By assumption,

a2 = x2 + y2 , b2 = x2 + z2 , and (y + z)2 = a2 + b2 ,

which we can combine to yield

(y + z)2 = 2x2 + y2 + z2 . (1)

Expanding (y + z)2 = y2 + 2yz + z2 and subtracting common terms from both sides, we have

2yz = 2x2 , (2)

which, dividing both sides by two, yields

yz = x2 . (3)

But (3) is immediate: we know that

x
y = xz (4)

because the two shaded triangles are similar, and therefore the two triangles have the same ratio of the length of the hypotenuse to
the length of the longer leg. Multiplying both sides of (4) by xy gives us x2 = yz, as desired.

ISTUDY
212 Proofs

4.6 Chapter at a Glance

Error-Correcting Codes
Although the main purpose of this section was to introduce proofs, here’s a brief summary of the definitions
and results about error-correcting and error-detecting codes, too.
n
A code is a set C ⊆ {0, 1} , where |C| = 2k for some integer 1 ≤ k ≤ n. A message is an element of
k
{0, 1} ; the elements of C are called codewords. Thinking of an error as a single bit being flipped from 0
to 1 or from 1 to 0, consider any codeword c ∈ C and for any sequence of up to ℓ errors applied to c to
produce c′ . The code C can detect ℓ ≥ 0 errors if we can always correctly report “error” or “no error,” and
can correct ℓ errors if we can always correctly identify that c was the original codeword.
n
The Hamming distance between strings x, y ∈ {0, 1} , denoted ∆(x, y), is the number of positions i in
which xi ̸= yi . The minimum distance of a code C is the smallest Hamming distance between two distinct
codewords of C. The rate of a code with k-bit messages and n-bit codewords is k/n. If the minimum distance
of a code C is 2t + 1 for an integer t, then C can detect 2t errors and correct t errors.
The Repetitionℓ code creates codewords via the ℓ-fold repetition of the message. This code has rate
1/ℓ and minimum distance ℓ. The Hamming code creates 7-bit codewords from 4-bit messages by adding
three different parity bits to the message. This code has rate 4/7 and minimum distance 3. Any code with
messages of length 4 and minimum distance 3 has codewords of length ≥ 7. (Thus the Hamming code has
the best possible rate among all such codes.) We can prove this result via a “sphere-packing” argument and
a proof by contradiction.

Proofs and Proof Techniques


A proof of a claim φ is a convincing argument that φ is true. (A proof should be written with its audience
in mind.) A variety of useful proof techniques can be employed to prove a given claim φ:

• direct proof: we prove φ by repeatedly inferring new facts from known facts to eventually conclude φ.
(Sometimes we divide a proof into multiple cases, or “assume the antecedent,” where we prove p ⇒ q
by assuming p and deriving q.)

You may also prove φ by proving a claim logically equivalent to φ:

• proof by contrapositive: to prove p ⇒ q, we instead prove ¬q ⇒ ¬p.


• proof by contradiction (or reductio ad absurdum): to prove φ, we instead prove that ¬φ ⇒ False—that
is, we prove that ¬φ leads to an absurdity.

We say that y ∈ S with ¬P(y) is a counterexample to the claim ∀x ∈ S : P(x). A proof by construction of the
claim ∃x ∈ S : P(x) proceeds by constructing a particular y ∈ S and proving that P(y). A nonconstructive
proof establishes ∃x ∈ S : P(x) without giving an explicit y ∈ S for which P(x)—for example, by proving
∃x ∈ S : P(x) by contradiction.
The process of developing a proof requires persistence, open-mindedness, and creativity. Here’s a helpful
three-step plan to use when developing a new proof: (1) understand what you’re trying to do (checking
definitions and small examples); (2) do it (by trying the proof techniques catalogued here, and thinking
about analogies from similar problems that you’ve solved previously); and (3) think about what you’ve

ISTUDY
4.6 Chapter at a Glance 213

done (reflecting on and trying to improve your proof). Remember that writing a proof is a form of writing!
Be kind to your reader.

Some Examples of Proofs


We can use these proof techniques to establish a wide variety of facts—about arithmetic, propositional
logic, geometry, prime numbers, and computability. For more extensive examples, see Section 4.4. We’ll
highlight one result: there are problems that we can formally define, but that cannot be solved by any
computer program; these problems (including the halting problem) are called uncomputable.

Common Errors in Proofs


A valid inference is one whose conclusion is always true as long as the facts that it relies on were true. An
invalid inference is one in which the conclusion can be false even if the premises are all true. An invalid,
or fallacious, argument can also be called a logical fallacy or just a fallacy. In a correct proof, of course,
every step is valid.
Perhaps the most common error in a proof is simply asserting that a fact φ follows from previously
established facts, when actually φ is not implied by those facts. Other common types of fallacious reasoning
are structural errors that involve purporting to prove a statement φ, but instead proving a statement that is
not logically equivalent to φ. (For example, the fallacy of proving true: a “proof” of φ that assumes φ and
proves True. But φ ⇒ True is true regardless of the truth of φ, so this purported proof proves nothing.)
Be vigilant; do not let anyone—yourself or others!—get away with fallacious reasoning.

ISTUDY
214 Proofs

Key Terms and Results

Key Terms Key Results


Error-Correcting Codes Error-Correcting Codes
• Hamming distance 1 If the minimum distance of a code C is 2t+1 for an integer
• code, message, codeword t ≥ 0, then C can detect 2t errors and correct t errors.
• error-detecting/correcting code 2 For 4-bit messages and minimum distance 3, there exist
• minimum distance, rate codes with rate 13 (such as the Repetition3 code) and
• repetition code with rate 47 (such as the Hamming code), but not with rate
• Hamming code better than 47 .

Proofs and Proof Techniques Proofs and Proof Techniques


• proof 1 You can prove a claim φ with a direct proof, or by instead
• proof techniques: proving a different claim that is logically equivalent to φ.
– direct proof Examples include proofs by contrapositive and proofs by
– proof by contrapositive contradiction.
– proof by contradiction 2 A useful three-step process for developing proofs is:
• counterexample (1) understand what you’re trying to do; (2) do it; and
• constructive/nonconstructive proof (3) think about what you’ve done. All three steps are
important, and doing each will help with the other steps.
Some Examples of Proofs 3 Writing a proof is a form of writing.

• conjunctive/disjunctive normal form Some Examples of Proofs


• uncomputability
• the halting problem 1 All logical propositions are equivalent to propositions in
conjunctive/disjunctive normal form, or using only nand.
Valid and Fallacious Arguments 2 There are infinitely many prime numbers.
3 There are problems that can be specified completely for-
• valid argument mally that are uncomputable (that is, cannot be solved
• fallacious/invalid argument; fallacy by any computer program). The halting problem is one
• fallacy: proving true example.
• fallacy: affirming the consequent
• fallacy: denying the hypothesis Valid and Fallacious Arguments
• fallacy: false dichotomy
1 There are many common mistakes in proofs that are
• fallacy: begging the question
centered on several types of fallacious reasoning. These
fallacies are essentially all the result of purporting to
prove a statement φ by instead proving a statement ψ,
where ψ fails to be logically equivalent to φ.

ISTUDY
5 Mathematical Induction

In which our heroes wistfully dream about having dreams about


dreaming about a very simple and pleasant world in which no one
sleeps at all.

ISTUDY
215
216 Mathematical Induction

5.1 Why You Might Care


Chaque vérité que je trouvais étant une règle qui me servait après à en trouver d’autres.
Each truth discovered was a rule available in the discovery of subsequent ones.
René Descartes (1596–1650)
Le Discours de la méthode [Discourse on the Method] (1637)

Recursion is a powerful technique in computer science. If we can express a solution to problem X in terms
of solutions to smaller instances of the same problem X—and we can solve X directly for the “smallest”
inputs—then we can solve X for all inputs. There are many examples. We can sort an array A by sorting
the left half of A and the right half of A and merging the results together; 1-element arrays are by definition
already sorted. (That’s merge sort.) We can search for an element x in a sorted list A by comparing x to
the middle element of A and, if they don’t match, searching for x in the appropriate half of A, left or right;
an empty list A by definition does not contain x. (That’s binary search.) We can build an efficient data
structure for storing and searching a set of keys by selecting one of those keys k, and building two such
data structures for keys < k and for keys > k; to search for a key x, we compare x to k and search for x in
the appropriate substructure. And an empty data structure can store an empty set of keys. (That’s a binary
search tree.) And many other things are best understood recursively, too: factorials, the Fibonacci numbers,
fractals (see Figure 5.1), and finding the median element of an unsorted array, for example.
Mathematical induction is a technique for proofs that is directly analogous to recursion: to prove that
a property P(n) holds for all nonnegative integers n, we prove that P(0) is true, and we prove that for an
arbitrary n ≥ 1, if P(n − 1) is true, then P(n) is true too. The proof of P(0) is called the base case, and
the proof that P(n − 1) ⇒ P(n) is called the inductive case. In the same way that a recursive solution to a
problem relies on solutions to a smaller instance of the same problem, an inductive proof of a claim relies
on proofs of a smaller instance of the same claim.
A full understanding of recursion depends on a thorough understanding of mathematical induction. And
many other applications of mathematical induction will arise throughout the book: analyzing the running
time of algorithms, counting the number of bitstrings that have a particular form, and many others.
In this chapter, we will introduce mathematical induction, including a few variations and extensions of
this proof technique. We will start with the “vanilla” form of proofs by mathematical induction (Section
5.2). We will then introduce strong induction (Section 5.3), a form of proof by induction in which the
proof of P(n) in the inductive case may rely on the truth of all of P(0), P(1), …, and P(n − 1) instead of
just on P(n − 1). Finally, we will turn to structural induction (Section 5.4), a form of inductive proof that
operates directly on recursively defined structures like linked lists, binary trees, or well-formed formulas
of propositional logic.

level 0 level 1 level 2 level 3 level 4 level 5

Figure 5.1 The Von Koch Snowflake fractal, shown at several different levels. A level-ℓ snowflake consists of three
level-ℓ lines. A level-0 line is ; a level-ℓ line consists of four level-(ℓ − 1) lines arranged in the shape .

ISTUDY
5.2 Proofs by Mathematical Induction 217

5.2 Proofs by Mathematical Induction


So if you find nothing in the corridors open the doors, if you find nothing behind these
doors there are more floors, and if you find nothing up there, don’t worry, just leap up
another flight of stairs. As long as you don’t stop climbing, the stairs won’t end, under
your climbing feet they will go on growing upwards.
Franz Kafka (1883–1924)
Fürsprecher [Advocates] (c. 1922)

The principle of mathematical induction says the following: to prove that a statement P(n) is true for
all nonnegative integers n, we can prove that P “starts being true” (the base case) and that P “never stops
being true” (the inductive case). We’ll start with the formal definition of a proof based on this principle,
and then consider a wide variety of examples.

5.2.1 An Overview of Proofs by Mathematical Induction


Formally, a proof by mathematical induction proceeds as follows:

Definition 5.1: Proof by mathematical induction.


Suppose that we want to prove that P(n) holds for all n ∈ Z≥0 . To give a proof by mathematical induction
of ∀n ∈ Z≥0 : P(n), we prove two things:
1 the base case: prove P(0).
2 the inductive case: for every n ≥ 1, prove P(n − 1) ⇒ P(n).

When we’ve proven both the base case and the inductive case as in Definition 5.1, we have established that
P(n) holds for all n ∈ Z≥0 . Here’s an example to illustrate how the base case and inductive case combine
to establish this fact:

Example 5.1: Proving P(4) from a base case and inductive case.
Suppose we’ve proven both the base case (P(0)) and the inductive case (P(n − 1) ⇒ P(n), for any n ≥ 1)
as in Definition 5.1. Why do these two facts establish that P(n) holds for all n ∈ Z≥0 ? For example, why
do they establish P(4)?
Solution. Here is a proof of P(4), using the base case once and the inductive case four times. (At each
stage we use Modus Ponens—which, as a reminder, states that from p ⇒ q and p, we can conclude q.)
We know P(0) base case (5.1)
and we know P(0) ⇒ P(1) inductive case, with n = 1 (5.2)
and thus we can conclude P(1). (5.1), (5.2), and Modus Ponens (5.3)

We know P(1) ⇒ P(2) inductive case, with n = 2 (5.4)


and thus we can conclude P(2). (5.3), (5.4), and Modus Ponens (5.5)

We know P(2) ⇒ P(3) inductive case, with n = 3 (5.6)


and thus we can conclude P(3). (5.5), (5.6), and Modus Ponens (5.7)

We know P(3) ⇒ P(4) inductive case, with n = 4 (5.8)


and thus we can conclude P(4). (5.7), (5.8), and Modus Ponens (5.9)

ISTUDY
218 Mathematical Induction

This sequence of inferences established that P(4) is true. We can use the same technique to prove that
P(n) holds for an arbitrary integer n ≥ 0, using the base case once and the inductive case n times.

The principle of mathematical induction is as simple as in Example 5.1—we apply the base case to get
started, and then repeatedly apply the inductive case to conclude P(n) for any larger n—but there are
several analogies that can help to make proofs by mathematical induction more intuitive; see Figure 5.2.
(Figure 5.3 shows a more formal checklist of the components to include a proof by induction.)

Taking it further: “Mathematical induction” is somewhat unfortunately named because its name collides with a distinction
made by philosophers between two types of reasoning. Deductive reasoning is the use of logic (particularly rules of inference) to
reach conclusions—what computer scientists would call a proof. A proof by mathematical induction is an example of deductive
reasoning. For a philosopher, though, inductive reasoning is the type of reasoning that draws conclusions from empirical obser-
vations. If you’ve seen a few hundred ravens in your life, and every one that you’ve seen is black, then you might conclude All
ravens are black. Of course, it might turn out that your conclusion is false, because you haven’t happened upon any of the albino
ravens that exist in the world; what philosophers call inductive reasoning leads to conclusions that may turn out to be false.

A first example: summing powers of two


Let’s use mathematical induction to prove an arithmetic property:

Theorem 5.2: A formula for the sum of powers of two.


P
n
For any nonnegative integer n, we have that 2i = 2n+1 − 1.
i=0

As a plausibility check, let’s test the given formula for some small values of n:
n=1: 20 + 21 = 1 + 2 = 3 22 − 1 = 3
n=2: 20 + 21 + 22 = 1 + 2 + 4 = 7 23 − 1 = 7
n=3: 20 + 21 + 22 + 23 = 1 + 2 + 4 + 8 = 15 24 − 1 = 15
These small examples all check out, so it’s reasonable to try to prove the claim. Let’s do it, with our first
example of a proof by induction:

Problem-solving tip: A plausibility check (testing out a claim for small values of n) can be very helpful before attempting a
proof. Often small examples can either reveal a misunderstanding of the claim or help you see why the claim is true in general.

Example 5.2: A proof of Theorem 5.2.


Let P(n) denote the property
P
n
2i = 2n+1 − 1.
i=0
≥0
We’ll prove that ∀n ∈ Z : P(n) by induction on n.
P0
Base case (n = 0). We must prove P(0). That is, we must prove i=0 2i = 20+1 − 1. But this fact is easy
P0
to prove, because both sides are equal to 1: i=0 2i = 20 = 1, and 20+1 − 1 = 2 − 1 = 1.

Inductive case (n ≥ 1). We must prove that P(n−1) ⇒ P(n), for an arbitrary integer n ≥ 1. We prove this
implication by assuming the antecedent—namely, we assume P(n − 1) and prove P(n). The assumption

ISTUDY
5.2 Proofs by Mathematical Induction 219

Dominoes falling: We have an infinitely long line of dominoes, numbered 0, 1, 2, . . . , n, . . .. To convince


someone that the nth domino falls over, you can convince them that

• the 0th domino falls over, and


• whenever one domino falls over, the next domino falls over too.

(One domino falls, and they keep on falling. Thus, for any n ≥ 0, the nth domino falls.)

Climbing a ladder: We have a ladder with rungs numbered 0, 1, 2, . . . , n, . . .. To convince someone that a ..
.

climber climbing the ladder reaches the nth rung, you can convince them that

• the climber steps onto rung #0, and


• if the climber steps onto one rung, then she also steps onto the next rung.

(The climber starts to climb, and the climber never stops climbing. Thus, for any n ≥ 0, the climber reaches
the nth rung.)

Whispering down the alley: We have an infinitely long line of people, with the people numbered
0, 1, 2, . . . , n, . . .. To argue that everyone in the line learns a secret, we can argue that

• person #0 learns the secret, and


• if person #n learns the secret, then she tells person #(n + 1) the secret.

(The person at the front of the line learns the secret, and everyone who learns it tells the secret to the next
person in line. Thus, for any n ≥ 0, the nth person learns the secret.)

Falling into the depths of despair: Consider the Pit of Infinite Despair, which is filled with nothing but despair
and goes infinitely far down beneath the surface of the earth. (The Pit does not respect physics.) Suppose that:

• the Evil Villain is pushed into the Pit (that is, She is in the Pit zero meters below the surface), and
• if someone is in the Pit at a depth of n meters beneath the surface, then She falls to depth n + 1 meters
beneath the surface.

(The Villain starts to fall, and if the Villain has fallen to a certain depth then She falls another meter further.
Thus, for any n ≥ 0, the Evil Villain eventually reaches depth n in the Pit.)

Figure 5.2 Some analogies to make mathematical induction more intuitive.

1 A clear statement of the claim to be proven—that is, a clear definition of the property P(n) that will be proven true for all
n ≥ 0—and a statement that the proof is by induction, including specifically identifying the variable n upon which induction
is being performed. (Some claims involve multiple variables, and it can be confusing if you aren’t clear about which is the
variable upon which you are performing induction.)

2 A statement and proof of the base case—that is, a proof of P(0).

3 A statement and proof of the inductive case—that is, a proof of P(n − 1) ⇒ P(n), for a generic value of n ≥ 1. The proof of the
inductive case should include all of the following:
a a statement of the inductive hypothesis P(n − 1).
b a statement of the claim P(n) that needs to be proven.
c a proof of P(n), which at some point makes use of the assumed inductive hypothesis.

Figure 5.3 A checklist of the steps required for a proof by mathematical induction.

ISTUDY
220 Mathematical Induction

P(n − 1) is
P
n−1
2i = 2(n−1)+1 − 1. (∗)
i=0

We can now prove P(n)—under the assumption (∗)—by showing that the left-hand and right-hand sides
of P(n) are equal:
" n−1 #
Xn X
i
2 = 2i + 2n by the definition of summations
i=0 i=0
 
= 2 (n−1)+1
− 1 + 2n by (∗), a.k.a. by the assumption that P(n − 1)
=2 −1+2 n n
by algebraic manipulation
=2·2 −1 n

= 2n+1 − 1.
Pn
We’ve thus shown that i=0 2i = 2n+1 − 1—in other words, we’ve proven P(n).

We’ve proven the base case P(0) and the inductive case P(n − 1) ⇒ P(n), so by the principle of
mathematical induction we have shown that P(n) holds for all n ∈ Z≥0 .

Taking it further: In case the inductive proof doesn’t feel 100% natural, here’s another way to make the result from Example 5.2
P
intuitive: think about binary representations of numbers. Written in binary, the number ni=0 2i will look like 11 · · · 111, with
n + 1 ones. What happens when we add 1 to, say, 11111111 (= 255)? It’s a colossal sequence of carrying (as 1 + 1 = 0, carrying
the 1 to the next place), resulting in 100000000 (= 256). In other words, 2n+1 − 1 is written in binary as a sequence of n + 1
P
ones—that is, 2n+1 − 1 = ni=0 2i .

Example 5.2 follows the standard outline of a proof by mathematical induction. We will always prove the
inductive case P(n − 1) ⇒ P(n) by assuming the antecedent P(n − 1) and proving P(n). The assumed
antecedent P(n − 1) in the inductive case of the proof is called the inductive hypothesis. (You may see
“inductive hypothesis” abbreviated as IH.)

Warning! P(n) denotes a proposition—that is, P(n) is either true or false. (We’re proving that, in fact, it’s true for every n.)
Despite its apparent temptation to people new to inductive proofs, it is nonsensical to treat P(n) as a number.

A second example, and a template for proofs by induction


Here’s another proof by induction, with the parts of the proof carefully labeled:

Example 5.3: Summing powers of −1.


(
X
n
1 if n is even
Claim. For any integer n ≥ 0, we have that (−1)i =
i=0 0 if n is odd.
Proof. 1 Clearly state the claim to be proven. Clearly state that the proof will be by induction, and
clearly state the variable upon which induction will be performed.
Let P(n) denote the property
(
X
n
1 if n is even
(−1)i =
i=0 0 if n is odd.

ISTUDY
5.2 Proofs by Mathematical Induction 221

We’ll prove that ∀n ∈ Z≥0 : P(n) by induction on n.


2 State and prove the base case.
P0
Base case (n = 0). We must prove P(0). But i=0 (−1)i = (−1)0 = 1, and 0 is even.
3 State and prove the inductive case. Within the statement and proof of the inductive case . . .

a . . . state the inductive hypothesis.


Inductive case (n ≥ 1). We assume the inductive hypothesis P(n − 1), namely
(
X
n−1
1 if n − 1 is even
(−1)i =
i=0 0 if n − 1 is odd.

b . . . state what we need to prove.


We must prove P(n).
c . . . prove it, making use of the inductive hypothesis and stating where it was used.

" n−1 #
X
n X
i i
(−1) = (−1) + (−1)n definition of summations
i=0 i=0
(
1 + (−1)n if n − 1 is even
= inductive hypothesis
0 + (−1)n if n − 1 is odd.
(
1 + (−1)n if n is odd
= n is odd ⇔ n − 1 is even
n
0 + (−1) if n is even.
(
1 + −1 if n is odd
= (−1)n = ±1, depending on whether n is even; see Exercise 5.3.
0+1 if n is even.
(
0 if n is odd
=
1 if n is even.
Thus we have proven P(n), and the theorem follows.

Writing tip: In the inductive case of a proof of an equality—like Example 5.3—start from the left-hand side of the equality and
manipulate it until you derive the right-hand side of the equality exactly. If you work from both sides simultaneously, you’re at
risk of the fallacy of proving true—or at least the appearance of that fallacy!

We can treat the labeled pieces of Example 5.3 as a checklist for writing proofs by induction. (See
Figure 5.3.) You should ensure that when you write an inductive proof, you include each of these steps.

The sum of the first n integers


We’ll do another example of an inductive proof of an arithmetic property:

Theorem 5.3: Sum of the first n nonnegative integers.


P
n
n(n+1)
For any integer n ≥ 0, we have that i = 0 + 1 + · · · + n is equal to 2 .
i=0

4(4+1)
(For example, for n = 4 we have 0 + 1 + 2 + 3 + 4 = 10 = 2 .)

ISTUDY
222 Mathematical Induction

Example 5.4: Sum of the first n integers: proving Theorem 5.3.


First, we must phrase this problem in terms of a property P(n) that we’ll prove true for every n ≥ 0. For
a particular integer n, let P(n) denote the claim that
P
n
n(n+1)
i= 2 .
i=0

We will prove that P(n) holds for all integers n ≥ 0 by induction on n.


P0
Base case (n = 0). Note that i=1 i = 0 and 0(0+1)2 = 0 too. Thus P(0) follows.
Inductive case (n ≥ 1). Assume the inductive hypothesis P(n − 1), namely
P
n−1
(n−1)((n−1)+1)
i= 2 .
i=0
Pn
n(n+1)
We must prove P(n)—that is, we must prove i= 2 . Here is the proof:
  i=0
Pn P
n−1
i= i +n definition of summations
i=0 i=0
(n − 1)((n − 1) + 1)
= +n inductive hypothesis
2
(n − 1)n + 2n
= putting terms over common denominator
2
n(n − 1 + 2)
= factoring
2
n(n + 1)
= .
2
Thus we’ve shown P(n) assuming P(n − 1), which completes the proof.

Taking it further: While the summation that we analyzed in Theorem 5.3 may seem like a purely arithmetic example, it also
has direct applications in CS—particularly in the analysis of algorithms. Chapter 6 is devoted to this topic, and there’s much
more there, but here’s a brief preview. A basic step in analyzing an algorithm is counting how many steps that algorithm takes,
for an input of arbitrary size. One particular example is Insertion Sort, which sorts an n-element array by repeatedly ensuring
that the first k elements of the array are in sorted order (by swapping the kth element backward until it’s in position). The total
number of swaps that are done in the kth iteration can be as high as k − 1—so the total number of swaps can be as high as
Pn Pn−1
k=1 k − 1 = i=0 i. Thus Theorem 5.3 tells us that Insertion Sort can require as many as n(n − 1)/2 swaps.

Generating a conjecture: segments in a fractal


In the inductive proofs that we’ve seen thus far, we were given a problem statement that described exactly
what property we needed to prove. Solving these problems “just” requires proving the base case and the
inductive case—which may or may not be easy, but at least we know what we’re trying to prove! In other
problems, though, you may also have to first figure out what you’re going to prove, and then prove it.
Obviously this task is generally harder. Here’s one example of such a proof, about the Von Koch snowflake
fractal from Figure 5.1:

Problem-solving tip: Your first task in giving a proof by induction is to identify the property P(n) that you’ll prove true for every
integer n ≥ 0. Sometimes the property is given to you more or less directly and sometimes you’ll have to formulate it yourself,
but in any case you need to identify the precise property you’re going to prove before you can prove it!

ISTUDY
5.2 Proofs by Mathematical Induction 223

level 0 level 1 level 2 level 3 level 4 level 5 level 6

Figure 5.4 Von Koch lines of several levels. (A Von Koch snowflake consists of three Von Koch lines, all of the same
level, arranged in a triangle; see Figure 5.1.)

Example 5.5: Vertices in a Von Koch line.


A Von Koch line of level 0 is a straight line segment; a Von Koch line of level ℓ ≥ 1 consists of four Von
Koch lines of level (ℓ − 1) in the shape . (See Figure 5.4.) Conjecture a formula for the number
of vertices (that is, the number of segment endpoints) in a Von Koch line of level ℓ. Prove the correctness
of your formula by induction.
Solution. Our first task is to formulate a conjecture for the number of vertices in a Von Koch line of
level ℓ. Let’s start with a few small examples, and some tedious counting in the pictures in Figure 5.4:

level 0: 2 endpoints (and 1 segment) level 1: 5 endpoints (and 4 segments) level 2: 17 endpoints (and 16 segments)

There are a few ways to think about this pattern. Here’s one that turns out to be helpful: a level-ℓ line
contains 4 lines of level (ℓ − 1), so it contains 16 lines of level (ℓ − 2). And thus, expanding it all the way
out, the level-ℓ line contains 4ℓ lines of level 0. The number of endpoints that we observe is 2 = 40 + 1,
then 5 = 41 + 1, then 17 = 42 + 1. (Why the “+1?” Each segment starts where the previous segment
ended—so there is one more endpoint than segment, because of the last segment’s second endpoint.)
So it looks like there are 4ℓ + 1 endpoints in a Von Koch line of level ℓ. Let’s turn this observation into
a formal claim, with an inductive proof:

Claim: For any ℓ ≥ 0, a Von Koch line of level ℓ has 4ℓ + 1 endpoints.

Proof. Let P(ℓ) denote the claim that a Von Koch line of level ℓ has 4ℓ + 1 endpoints. We’ll prove that
P(ℓ) holds for all integers ℓ ≥ 0 by induction on ℓ.
Base case (ℓ = 0). We must prove P(0). By definition, a Von Koch line of level 0 is a single line segment,
which has 2 endpoints. Indeed, 40 + 1 = 1 + 1 = 2.
Inductive case (ℓ ≥ 1). We assume the inductive hypothesis, namely P(ℓ − 1), and we must prove P(ℓ).
The key observation is that a Von Koch line of level ℓ consists of four Von Koch lines of level (ℓ−1)—and
the last endpoint of line #1 is identical to the first endpoint of line #2; the last endpoint of #2 is the first
of #3, and the last endpoint of #3 is the first of #4. Therefore there are three endpoints that are shared
among the four lines of level (ℓ − 1). Thus:
the number of endpoints in a Von Koch line of level ℓ
 
= 4 · the number of endpoints in a Von Koch line of level (ℓ − 1) − 3 by the above discussion
h i
= 4 · 4ℓ−1 + 1 − 3 by the inductive hypothesis

= 4ℓ + 4 − 3 multiplying through

= 4 + 1. algebra

Thus P(ℓ) follows, completing the proof.

ISTUDY
224 Mathematical Induction

A note and two variations on the inductive template


The basic idea of induction is not too hard: the reason that P(n) holds is that P(n − 1) held, and the reason
that P(n − 1) held is that P(n − 2) held—and so forth, until eventually the proof finally rests on P(0), the
base case. A proof by induction can sometimes look superficially like it’s circular reasoning—that we’re
assuming precisely the thing that we’re trying to prove. But it’s not! In the inductive case, we’re assuming
P(n − 1) and proving P(n)—we are not assuming P(n) and proving P(n).

Warning! If you do not use the inductive hypothesis P(n − 1) in the proof of P(n), then something is wrong—or, at least, your
proof is not actually a proof by induction!

Taking it further: The superficial appearance of circularity in a proof by induction is equivalent to the superficial appearance
that a recursive function in a program will run forever. (A recursive function f will run forever if calling f on n results in f calling
itself on n again! That’s the same circularity that would happen if we assumed P(n) and proved P(n).) The correspondence
between these aspects of induction and recursion should be no surprise; induction and recursion are essentially the same thing.
In fact, it’s not too hard to write a recursive function that “implements” an inductive proof by outputting a step-by-step argument
establishing P(n) for an arbitrary n, as in Example 5.1.

Our proofs so far have shown ∀n ∈ Z≥0 : P(n) by proving P(0) as a base case. If we instead want to
prove ∀n ∈ Z≥k : P(n) for some integer k, we can prove P(k) as the base case, and then prove the inductive
case P(n − 1) ⇒ P(n) for all n ≥ k + 1.
Another variation in writing inductive proofs relates to the statement of the inductive case. We’ve proven
P(0) and P(n − 1) ⇒ P(n) for arbitrary n ≥ 1. Some writers prefer to prove P(0) and P(n) ⇒ P(n + 1)
for arbitrary n ≥ 0. The difference is merely a reindexing, not a substantive difference: it’s just a matter of
whether one thinks of induction as “the nth domino falls because the (n − 1)st domino fell into it” or as
“the nth domino falls and therefore knocks over the (n + 1)st domino.”

5.2.2 Some Numerical Examples: Geometric, Arithmetic, and Harmonic Series


In the remainder of this section, we’ll give some more examples of proofs by mathematical induction,
following the examples so far, and checklist of Figure 5.3. We’ll start here with some more summation-
based proofs. (The statements we’ll prove in this section are intended to serve both as useful facts to know
about particular numerical sequences and as good practice with induction.) We’ll move on to inductive
proofs of some other types of statements—inductive proofs are ubiquitous throughout computer science,
not just in analyzing summations—in Section 5.2.3.

Geometric series
A geometric sequence of numbers is one in the ratio between consecutive numbers is always the same: for
example, each pair of consecutive numbers in the sequence 1, 2, 4, 8, 16, . . . differs by a factor of two.

Definition 5.4: Geometric sequences and series.


A geometric sequence is a sequence of numbers where each entry is generated by multiplying the previous
entry by a fixed ratio r ∈ R, starting from an initial value a. (Thus the sequence is ⟨a, ar, ar2 , ar3 , . . .⟩.)
Pn
A geometric series or geometric sum is i=0 ari .

Examples of geometric sequences include ⟨2, 4, 8, 16, 32, . . .⟩; or ⟨1, 13 , 19 , 27


1
, . . .⟩; or ⟨1, 1, 1, 1, 1, . . .⟩.
There turns out to be a relatively simple formula for the sum of the first n terms of a geometric sequence:

ISTUDY
5.2 Proofs by Mathematical Induction 225

Theorem 5.5: Analysis of geometric series.


P
n
rn+1 −1
P
n
Let r ∈ R where r ̸= 1, and let n ∈ Z≥0 . Then ri = r−1 . (If r = 1, then ri = n + 1.)
i=0 i=0

(For simplicity, we stated Theorem 5.5 without reference to the initial value a. Because we can pull a
Pn
constant multiplicative factor out of a summation, the theorem allows us to conclude that i=0 ari =
Pn i
a · i=0 r = a · r r−1−1 .) We will be able to prove Theorem 5.5 using a proof by mathematical induction:
n+1

Example 5.6: Geometric series.


Proof of Theorem 5.5. Consider a fixed real number r with r ̸= 1, and let P(n) denote the property that
P
n
rn+1 −1
ri = r−1 .
i=0

We’ll prove that P(n) holds for all integers n ≥ 0 by induction on n.


P0
Base case (n = 0). Note that i=0 ri = r0 and r r−1−1 both equal 1. Thus P(0) holds.
0+1

Pn−1 rn −1
Inductive case (n ≥ 1). We assume the inductive hypothesis P(n−1), which tells us that i=0 ri = r−1 .
We must prove P(n). Here is a proof:
P
n P
n−1
ri = rn + ri definition of summation
i=0 i=0
n
−1
= r + rr−1
n
inductive hypothesis
r (r−1)+rn −1
n
= r−1 putting the fractions over a common denominator
r n+1
−r +r −1
n n
= r−1 multiplying out
r −1
n+1
= r−1 . simplifying

Thus P(n) holds, and the theorem follows.

Problem-solving tip: The inductive cases of many inductive proofs follow the same pattern: first, we use some kind of structural
definition to “pull apart” the statement about n into something kind of statement about n − 1 (plus some “leftover” other stuff),
then apply the inductive hypothesis to simplify the n − 1 part. We then manipulate the result of using the inductive hypothesis
plus the leftovers to get the desired equation.

Notice that Examples 5.2 and 5.3 were both special cases of Theorem 5.5. For the former, Theorem 5.5
Pn
tells us that i=0 2i = 2 2−1−1 = 2n+1 − 1; for the latter, this theorem tells us that
n+1

( 1−(−1)
Xn
(−1)n+1 − 1 1 − (−1)n+1 = 1 if n is even
i
(−1) = = = 1−12
−1 − 1 2 =0 if n is odd.
i=0 2

A corollary of Theorem 5.5 addressing infinite geometric sums will turn out to be useful later:
Pn
Corollary 5.6. Let r ∈ R where 0 ≤ r < 1, and define f(n) = i=0 ri . Then:
P∞ i 1
1 i=0 r = 1−r , and
2 for all n ≥ 0, we have 1 ≤ f(n) ≤ 1−r
1
.

ISTUDY
226 Mathematical Induction

Taking it further: The proof of Corollary 5.6 relies on calculus, so we won’t dwell on it here. But, if you’re interested, here’s the
n+1
idea. We’ll start with the proof of (1), which is the part that requires calculus. Theorem 5.5 says that f(n) = r r−− 1
1 , and we take
the limit as n → ∞. Because r < 1, we have that limn→∞ r n + 1 = 0. Thus as n → ∞ the numerator r n + 1 − 1 tends to −1, and
P P
the entire ratio tends to 1/(1 − r). To prove (2), observe that ni=0 ri is definitely greater than or equal to 0i=0 ri (because r ≥ 0
P
and so the latter results by eliminating n nonnegative terms from the former). Similarly, ni=0 ri is definitely less than or equal
P∞ i Pn P0 P P
to i=0 r . Thus we have f(n) = i=0 r ≥ i=0 r = r = 1 and, similarly, we have f(n) = ni=0 ri ≤ ∞
i i 0 i 1
i=0 r = 1−r .

Arithmetic series
An arithmetic sequence of numbers in which the difference between consecutive numbers is always the
same: for example, consecutive numbers in the sequence 2, 4, 6, 8, 10, . . .) always differ by two. (The
definition for a geometric sequence was about ratios instead of differences.)

Definition 5.7: Arithmetic sequences and series.


An arithmetic sequence is a sequence of numbers where each number is generated by adding a fixed step-
size r ∈ R to the previous number in the sequence. The first entry in the sequence is some initial value
Pn
a ∈ R. (Thus the sequence is ⟨a, a + d, a + 2d, a + 3d, . . .⟩.) An arithmetic series or sum is i=0 (a + id).

Examples include ⟨2, 4, 6, 8, 10, . . .⟩; or ⟨1, 13 , − 13 , −1, − 53 , . . .⟩; or ⟨1, 1, 1, 1, 1, . . .⟩. You’ll prove a
general formula for an arithmetic sum in the exercises.

Harmonic series
The harmonic sequence is defined in terms of the reciprocals of the positive integers (1, 12 , 13 , 14 , 15 , . . .):

Definition 5.8: Harmonic series.


A harmonic series is the sum of a sequence of numbers whose kth number is 1k . The nth harmonic number
Pn
is defined by Hn = k=1 1k .

Thus H1 = 1 and H2 = 1 + 1
2 = 1.5 and H3 = 1 + 1
2 + 1
3 ≈ 1.8333 and H4 = 1 + 1
2 + 1
3 + 1
4 ≈ 2.0833.

The name “harmonic” comes from music: when a note at frequency f is played, overtones of that note—other high-intensity
frequencies—can be heard at frequencies 2f, 3f, 4f, . . .. The wavelengths of the corresponding sound waves are 1f , 2f1 , 3f1 , 4f1 , . . ..
P R
(If you’ve had calculus: you can approximate the value of Hn using an integral, writing Hn = nx=1 1x ≈ xn=1 1x dx = ln n. But
we’ll do a calculus-free analysis here.)

Giving a precise equation for the value of Hn requires more work than for geometric or arithmetic series,
but we can prove upper and lower bounds on Hn by induction a bit more easily. We will be able to show
the following, which captures the value of Hn to within a factor of 2:

Theorem 5.9: Bounds on the (2k )th harmonic number.


For any integer k ≥ 0, we have k + 1 ≥ H2k ≥ k
2 + 1.

(See Figure 5.5.) We’ll prove half of Theorem 5.9 (namely k + 1 ≥ H2k ) by induction in Example 5.7,
leaving the other half to the exercises. (Theorem 5.9 only talks about the case in which n is a power of 2;
we’ll also leave to the exercises upper and lower bounds for Hn when n is not an exact power of 2.)

ISTUDY
5.2 Proofs by Mathematical Induction 227

10 upper bound: H2k ≤ k + 1

Hn = 1 + 21 + 13 + · · · + 1n
5 lower bound: H2k ≥ 2k + 1

100 200 300 400 500 600

Figure 5.5 A visualization of Theorem 5.9, showing that k + 1 ≥ H(2k ) ≥ k


2 + 1.

Example 5.7: Inductive proof that k + 1 ≥ H2k .


Proof. Let P(k) denote the property that k + 1 ≥ H2k . We’ll use induction on k to prove that P(k) holds
for all integers k ≥ 0.
Base case (k = 0). We have that H2k = H20 = H1 = 1, and k + 1 = 0 + 1 = 1 as well. Therefore
H2k = 1 = k + 1—that is, P(0) holds.
Inductive case (k ≥ 1). Let k ≥ 1 be an integer. We must prove P(k)—that is, we must prove k + 1 ≥ H2k .
To do so, we assume the inductive hypothesis P(k − 1), namely that k ≥ H2k−1 . Consider H2k :
P2k
H2k = i=1 1i definition of the harmonic numbers
hP k−1 i hP k i
2 1 2 1
= i=1 i + i=2k−1 +1 i splitting the summation into two parts: i ≤ 2k−1 and i > 2k−1
hP k i
2 1
= H2k−1 + i=2k−1 +1 i definition of the harmonic numbers, again
hP k i ∑k
2
≤ H2k−1 + 1
i=2k−1 +1 2k−1 every term in the summation 2i=2k−1 +1 1i is smaller than 2k−1
1

≤ H2k−1 + 2k−1 · 1
2k−1
there are 2k−1 terms in the summation
= H2k−1 + 1 1
x · x = 1 for any x =
̸ 0
≤ k + 1. inductive hypothesis

Thus we’ve proven that H2k ≤ k + 1—that is, we’ve proven P(k). The theorem follows.

The proof in Example 5.7 is perhaps the first time that we needed some serious insight and creativity to
establish the inductive case. (I don’t think that the “split the summation” step was in any way an obvious
thing to do, nor was the step of bounding 1i ≤ 2k−1
1
for the second-half summation.) The structure of a proof
by induction is rigid—we must prove a base case P(0); we must prove an inductive case P(n−1) ⇒ P(n)—
but that doesn’t make the entire proof totally formulaic. (The proof of the inductive case must use the
inductive hypothesis at some point, so its statement gives you a little guidance for the kinds of manipulations
to try.) Just as with all the other proof techniques that we explored in Chapter 4, a proof by induction can
require you to think—and all of strategies that we discussed in Chapter 4 may be helpful to deploy.
Taking it further: Induction as a proof technique is the analogue to recursion as a programming technique, in the sense that
both of them “call” themselves on smaller inputs. But they are also related in the sense that we just discussed, regarding the proof
in Example 5.7: there can be some surprising, creative ways of using both to solve seemingly difficult problems. Here’s another
example: when you want to communicate with someone secretly, the computationally inspired thing to do is to use cryptography
to encrypt your messages before you send them. (Perhaps you’d use RSA; see Section 7.5.) But if you want to keep secret not
just the contents of your messages, but also the very fact that you’re communicating with the other person, then there is a clever
recursive use of encryption that can be deployed. See p. 232 for a summary of onion routing (and the Tor system), along with a
brief discussion of steganography, which has a related motivation.

ISTUDY
228 Mathematical Induction

5.2.3 Some More Examples


We’ll close this section with a few more examples of proofs by mathematical induction, but we’ll focus on
things other than analyzing summations. Some of these examples are still about arithmetic properties, but
they should at least hint at the breadth of possible statements that we might be able to prove by induction.

Comparing algorithms: which is faster?


Suppose that we have two different candidate algorithms that solve a problem related to a set S with n
elements—a brute-force algorithm that tries all 2n possible subsets of S, and a second algorithm that com-
putes the solution by looking at only n2 subsets of S. Which would be faster to use? The latter algorithm is
(massively!) faster, and we can prove that it’s faster (with a small caveat for small n) by induction:

Example 5.8: 2n vs. n2 .


We’d like to prove that 2n ≥ n2 for all integers n ≥ 0—but it turns out not to be true! (See Figure 5.6.)
Indeed, 23 < 32 . But the relationship appears to hold starting at n = 4. Let’s prove it, by induction:

Claim: For all integers n ≥ 4, we have 2n ≥ n2 .

Proof. Let P(n) denote the property 2n ≥ n2 . We’ll use induction on n to prove that P(n) holds for all
n ≥ 4. (Thus our base case will be for n = 4, and our inductive case for n ≥ 5.)
Base case (n = 4). For n = 4, we have 2n = 16 = n2 , so the inequality P(4) holds.
Inductive case (n ≥ 5). Assume the inductive hypothesis P(n − 1)—that is, assume 2n−1 ≥ (n − 1)2 . We
must prove P(n). For n ≥ 4, note that n2 ≥ 4n (by multiplying both sides of the inequality n ≥ 4 by n).
Thus n2 − 4n ≥ 0, and so
2n = 2 · (2n−1 ) definition of exponentiation
≥ 2 · (n − 1) 2
inductive hypothesis
= 2n − 4n + 2
2
multiplying out
= n + (n − 4n) + 2
2 2
rearranging
≥ n +0+2
2
by the above discussion, we have n2 − 4n ≥ 0
> n2 .
Thus we have shown 2n > n2 , which completes the proof of the inductive case. The claim follows.

n 2n n2
0 1 0 60
1 2 1 2n
2 4 4 n2
40
3 8 9
4 16 16 2n = n2 = 16
for n = 4
5 32 25 20 2n = n2 = 4
6 64 36 for n = 2
7 128 49
1 2 3 4 5 6 7 8

Figure 5.6 Small values of 2n and n2 , and a plot of the functions.

ISTUDY
5.2 Proofs by Mathematical Induction 229

Taking it further: In analyzing the efficiency of algorithms, we will frequently have to do the type of comparison in Example 5.8,
to compare the amount of time consumed by one algorithm versus another. Chapter 6 discusses this type of comparison in much
greater detail, but here’s one example of this sort. Let X be a sequence. A subsequence of X results from selecting some of the
entries in X—for example, TURING is a subsequence of OUTSOURCING. For two sequences X and Y, a common subsequence is a
subsequence of both X and Y. The longest common subsequence of X and Y is, naturally, the common subsequence of X and Y
that’s longest. (For example, TURING is the longest common subsequence of DISTURBINGLY and OUTSOURCING.)
Given two sequences X and Y of length n, we can find the longest common subsequence fairly easily by testing every possible
subsequence of X to see whether it’s also a subsequence of Y. This brute-force solution tests 2n subsequences of X. But there’s a
cleverer approach to solving this problem using an algorithmic design technique called dynamic programming (see p. 496 or a
textbook on algorithms) that avoids redoing the same computation—here, testing the same sequence of letters to see if it appears
in Y—more than once. The dynamic programming algorithm for longest common subsequence requires only about n2 steps.

Proving algorithms correct: factorial


We just gave an example of using a proof by induction to analyze the efficiency of an algorithm, but we
can also use mathematical induction to prove the correctness of a recursive algorithm. (That is, we’d like
to show that a recursive algorithm always returns the desired output.) Here’s an example, for the natural
recursive algorithm to compute factorials (see Figure 5.7):
Taking it further: While induction is much more closely related to recursive algorithms than nonrecursive algorithms, we can
also prove the correctness of an iterative algorithm using induction. The basic idea is to consider a statement, called a loop
invariant, about the correct behavior of a loop; we can prove inductively that a loop invariant starts out true and stays true
throughout the execution of the algorithm. See p. 231.

Example 5.9: Factorial.


Consider the recursive algorithm fact in Figure 5.7. For a positive integer n, let P(n) denote the property
that fact(n) = n!. We’ll prove by induction on n that, indeed, P(n) holds for all integers n ≥ 1.
Base case (n = 1). P(1) is true because fact(1) returns 1 immediately, and 1! = 1 by definition.
Inductive case (n ≥ 2). We assume the inductive hypothesis P(n − 1), namely that fact(n − 1) returns
(n − 1)!. We want to prove that fact(n) returns n!. But this claim follows directly:
fact(n) = n · fact(n − 1) by inspection of the algorithm
= n · (n − 1)! by the inductive hypothesis
= n! by definition of !

and therefore the claim holds by induction.

In fact, induction and recursion are basically the same thing: recursion “works” by leveraging a solution
to a smaller instance of a problem to solve a larger instance of the same problem; a proof by induction
“works” by leveraging a proof of a smaller instance of a claim to prove a larger instance of the same claim.
(Actually, one common use of induction is to analyze the efficiency of a recursive algorithm. We’ll discuss
this type of analysis in great depth in Section 6.4.)

fact(n):
1 if n = 1 then
2 return 1
3 else
4 return n · fact(n − 1)

Figure 5.7 Pseudocode for factorial: given n ∈ Z≥1 , we wish to compute the value of n!.

ISTUDY
230 Mathematical Induction

Divisibility
We’ll close this section with one more numerical example, about divisibility:

Writing tip: Example 5.10 illustrates why it is crucial to state clearly the variable upon which induction is being performed. This
statement involves two variables, k and n, but we’re performing induction on only one of them!

Example 5.10: kn − 1 is evenly divisible by k − 1.

Claim: For any n ≥ 0 and k ≥ 2, we have that kn − 1 is evenly divisible by k − 1.

(For example, 7n − 1 is always divisible by 6, as in 7 − 1, 49 − 1, and 343 − 1. And k2 − 1 is always


divisible by k − 1; in fact, factoring k2 − 1 yields k2 − 1 = (k − 1)(k + 1).)

Proof. We’ll proceed by induction on n. That is, let P(n) denote the claim
For all integers k ≥ 2, we have that kn − 1 is evenly divisible by k − 1.
We will prove that P(n) holds for all integers n ≥ 0 by induction on n.
Base case (n = 0). For any k, we have kn − 1 = k0 − 1 = 1 − 1 = 0. And 0 is evenly divisible by any
positive integer, including k − 1. Thus P(0) holds.
Inductive case (n ≥ 1). We assume the inductive hypothesis P(n − 1), and we need to prove P(n). Let
k ≥ 2 be an arbitrary integer. Then:
kn − 1 = kn − k + k − 1 antisimplification: x = x + k − k.
= k · (kn−1 − 1) + k − 1 factoring

By the inductive hypothesis, kn−1 − 1 is evenly divisible by k − 1. In other words, by the definition of
divisibility, there exists a nonnegative integer a such that a · (k − 1) = kn−1 − 1. Therefore

kn − 1 = k · a · (k − 1) + k − 1
= (k − 1) · (k · a + 1).
Because k · a + 1 is a nonnegative integer, (k − 1) · (k · a + 1) is by definition evenly divisible by k − 1.
Thus kn − 1 = (k − 1) · (k · a + 1) is evenly divisible by k − 1. Our k was arbitrary, so P(n) follows.

Problem-solving tip: In inductive proofs, try to massage the expression in question into something—anything!—that matches
the form of the inductive hypothesis. Here, the “antisimplification” step is obviously true but seems completely bizarre. Why did
we do it? Our only hope in the inductive case is to somehow make use of the inductive hypothesis. Here, the inductive hypothesis
tells us something about kn−1 − 1—so a good strategy is to transform kn − 1 into an expression involving kn−1 − 1, plus some
leftover stuff.

ISTUDY
5.2 Proofs by Mathematical Induction 231

Computer Science Connections

Loop Invariants
In Example 5.9, we saw how to use a proof by induction to establish that a recursive algorithm correctly solves a
particular problem. But proving the correctness of iterative algorithms seems different. An approach—pioneered in
the 1960s by Robert Floyd and C. A. R. Hoare [48, 55]—is based on loop invariants, and can be used to analyze
nonrecursive algorithms. A loop invariant for a loop L is a logical property P such that (i) P is true before L is first
executed; and (ii) if P is true at the beginning of an iteration of L, then P is true after that iteration of L. The parallels
to induction are pretty direct; property (i) is the base case, and property (ii) is the inductive case. Together, they ensure
that P is always true, and in particular P is true when the loop terminates.
Let’s look at Insertion Sort as
an example of using loop invari- insertionSort(A[1 . . . n]): Property P(k):
ants. (See Figure 5.8.) The basic 1 i := 2 A[1 . . . k + 1] is sorted.
idea will be to define a property that 2 while i ≤ n: (No restrictions on
A[k + 2 . . . n])
describes the partial (and growing) 3 j := i
state of correctness of the algorithm 4 while j > 1 and A[j] > A[j − 1]:
as it executes. Specifically, let’s 5 swap A[j] and A[j − 1] 4 5 7 10 11 14 8 15 1 9
define P(k) to denote the condition 6 j := j − 1 1 2 ··· k+1 ··· n
that the subarray A[1 . . . k + 1] is 7 i := i + 1
sorted after completing k iterations Figure 5.8 Insertion Sort, and the loop invariant (if we’ve completed k
of the outer while loop. We claim iterations of the outer loop, then P(k) holds—here illustrated for k = 5).
that P(k) is true for all k ≥ 0—in
other words, that P is a loop invariant for the outer while loop. Let’s prove it:
Proof (sketch). For the base case (k = 0), we’ve completed zero iterations—that is, we have only executed Line 1.
But A[1 . . . k + 1] is then vacuously sorted, because it contains only the lone element A[1].
For the inductive case (k ≥ 1), assume the inductive hypothesis P(k − 1)—that is, that A[1 . . . k] was sorted before
the kth iteration. The kth iteration of the loop executed Lines 2–7, so we must persuade ourselves that the execution
of these lines extends the sorted segment A[1 . . . k] to A[1 . . . k + 1]. And it does! (A formal proof of this claim would
use another loop invariant to describe the behavior of the inner loop—for example, letting Q(j) denote the property
both A[1 . . . j−1] and A[j . . . i] are sorted, and A[j−1] < A[j+1]. But we’ll leave out those details in this sketch.)
Because the above argument establishes that P(n − 1) is true, we know that A[1 . . . (n − 1) + 1] (in other words, all
of A) is sorted after n − 1 iterations of the loop, exactly as desired.
Loop invariants are valuable as
part of the development of pro-
binarySearch(A[1 . . . n], x): After zero iterations of the while loop:
grams, too. For example, many If 12 is in the array A, then 12 is contained
Output: Is x in the sorted array A?
people end up struggling to cor- in the range A[lo, . . . , hi].
1 lo := 1
rectly write binary search (it’s all lo hi
2 hi := n
too easy to have an infinite loop 1 4 5 7 8 9 10 11 14 15
3 while lo ≤ hi:
or an early-stopping bug)—but by ··· n
4 middle := ⌊ lo+ hi
2 ⌋
1 2
writing down a loop invariant and
5 if A[middle] = x then
thinking about it as we write the After one iteration of the while loop:
6 return True
code, it becomes a lot easier to If 12 is in the array A, then 12 is contained
7 else if A[middle] > x then in the range A[lo, . . . , hi].
get right. (See Figure 5.9.) Many
8 hi := middle − 1 lo hi
programming languages allow pro-
9 else
grammers to use assertions to state 1 4 5 7 8 9 10 11 14 15
10 lo := middle + 1
logical conditions that they believe 11 return False 1 2 ··· n
to always be true at a particu-
lar point in the code. A simple Figure 5.9 Binary Search, and a useful loop invariant (“if x is in A, then x is one
assert(P) statement can help a of A[lo, . . . , hi]”) illustrated during the first two iterations of its execution.
programmer identify bugs earlier in
the development process and avoid a great deal of debugging trauma later.

ISTUDY
232 Mathematical Induction

Computer Science Connections

Onion Routing, Steganography, and Conversations whose Very Existence is Sensitive


The typical story introducing cryptography goes like this: two people (Alice and Bob) want to communicate securely,
but the only channel over which they are able to share information is insecure; an eavesdropper (Eve) can listen to
everything that they say. Some remarkable cryptographic protocols (like RSA; see Section 7.5 for much, much more)
have been developed in this context: Bob can publicly post some information about how to communicate with him,
and Alice can use that public information to encrypt a message to Bob so that he can easily decode it, but Eve can’t.
Cryptographic systems like RSA aim for confidentiality of communication: Eve remains fully ignorant of the
contents of whatever message Alice sent to Bob. But what is not secret is the fact that Alice sent a secret message to
Bob. That can be a problem: if you work at the Kremlin, the very fact that you just sent a message to Ottawa may be
damning, regardless of contents.
If you’re willing to reveal that you’re communicat- The contents of this box are e

ncrypted
ing (but you want to keep secret with whom you are The contents of this box are enc

rypted with
communicating) there’s a relevant—and widely used—

A.
The contents of this box are encry

pted wit
Node B.
tool called Tor (“the onion router”). Tor is a commu- The contents of this box are en

cryp
ob
de C.
nication system that combines cryptographic protocols Dear Bob, Hi there! --Alice.

Node
ted with a key shared by Alice/B
in a recursive way: the basic idea of onion routing is

with
h a key shared by Alice and Tor No
to add multiple “layers” of encryption to a message, a key shared by Alice and Tor
which is transmitted across multiple hops—and a layer
r
a key shared by Alice and To
of encryption is peeled away by each subsequent recip-
ient, like the layers of an onion. (And so an argument Figure 5.10 A message for Bob, in Tor.
for the correctness of this system relies on induction.)
If Alice wants to send a message m to Bob, she first identifies a sequence of waypoints—say, Tor nodes A, B, and
C—and agrees on a encryption key with each of these intermediaries (using the Diffie–Hellman protocol; see p. 379).
She computes a quadruply encrypted message eA (eB (eC (eBob (m)))), where the subscript indicates with whom she
shares that particular encryption key, and routes it to Bob via A to B to C. See Figure 5.10. (Alice tags each layer
of the encryption with routing information: A knows that B is the next recipient [but A doesn’t know who C or Bob
are], and so forth.) The point of these intermediate layers is that no participant knows the identity of both Alice and
Bob, and therefore, unless many nodes in the system are operated by the same malicious entity, the identity of a Tor
user’s communication partner is not revealed. (As long as A, B, and C are simultaneously serving as intermediaries
for many different Alices and Bobs, it’s not tractable to “connect the dots.”)
Another option: a system for steganography (Greek:
stego hidden/covered + graphy writing), which seeks to (This image
conceal that communication is happening at all, rather is adjusted
photograph “WAVE Demonstrating Bombe” (1944)
Original image from the National Security Agency

than concealing the recipient or the contents. (Natu- five times


more
rally, these ideas can be combined: you could encrypt prominently
a message which you then steganographically trans- than
necessary, to
mit.) One possibility is to embed a hidden message make the
by altering the least-significant bit of each pixel in an message,
image—rounding each pixel value to an odd or even perhaps, just
barely
value, as in (190) vs. (191), to convey a 0 or a 1— visible in a
in a way that’s all but undetectable visually. (You could few places.)
even post your photograph publicly, with the idea that
only the intended recipient would know to look for an Figure 5.11 An image with a hidden message in the
embedded message.) See Figure 5.11. least-significant bits of the pixel values.
(You can learn more about Tor in the original paper
describing it [39]—including about ways to mitigate risks of malicious Tor nodes, and traffic analysis [an adver-
sary might observe the rate at which particular individuals send and receive data, to try to match up who’s talking
with whom]. And, while there are plenty of good reasons for wanting a system that, like Tor, allows individuals to
communicate in an essentially untraceable way, there are some serious legal and ethical questions, too: truly private
communication leads to the possibility of using the system for illegal, unethical, and other awful purposes.)

ISTUDY
Exercises 233

EXERCISES
Prove that the following claims hold for all integers n ≥ 0, by induction on n:
Xn
5.1 i2 = n(n+1)(6
2n+1)

i=0
X
n
n4 +2n3 +n2
5.2 i3 = 4
i=0
(
1 if n is even
5.3 (−1)n =
−1 if n is odd
X
n
1 n
5.4 i(i+1)
= n+ 1
i=1
Xn
2
5.5 i(i+2)
= 32 − n+
1 1
1 − n+2
i=1
Xn
5.6 i · (i!) = (n + 1)! − 1
i=1

In an optical camera, light enters the lens through the opening called the aperture. The aperture is controlled by a collection
of movable blades, which can be adjusted inward to narrow the area through which light can pass. Although some lenses allow
continuous adjustment to their openings, many have a sequence of so-called stops: discrete steps by which the aperture narrows,
called f-stops (the “f” is short for “focal”). (See Figure 5.12.)
The names of f-stops are a little strange, and you’ll unwind them here. The “fastest” f-stop for a lens measures the ratio of two
numbers: the focal length of the lens divided by the diameter of the aperture of the lens. (For example, a lens that’s 50 mm long
and has a 25 mm diameter yields an f-stop of 50 mm
25 mm = 2.) You can also “stop down” a lens from this fastest setting by adjusting
the blades to shrink the aperture, as shown in Figure 5.12. (For example, for the 50 mm-long lens with a 25 mm diameter, you
might reduce the diameter to 12.5 mm, which yields an f-stop of 1250.5mm
mm = 4.)
5.7 Consider a camera lens with a 50 mm focal length, and let d0 = 50 mm denote the diameter of the lens’s aperture diameter.
“Stopping down” the lens by one step causes the lens’s aperture diameter to shrink by a factor of √1 —that is, the next-smaller
2
aperture diameter for a diameter di is defined as di+1 = √di for any i ≥ 0. Give a closed-form expression for dn —that is, a
2
nonrecursive numerical expression whose value is equal to dn . (Your expression can involve real numbers and the variable n.)
Prove your answer correct by induction.
5.8 Using your solution to Exercise 5.7, give a closed-form expression for two further quantities:
• the “light-gathering” area (that is, the area of the aperture) of the lens when its diameter is set to dn .
• the f-stop fn of the lens when its diameter is set to dn .
Using your formula for fn , can you explain the f-stop names from Figure 5.12?

5.9 What is the sum of the first n odd positive integers? Formulate a conjecture by trying a few examples (for example, what are 1 + 3
and 1 + 3 + 5 and 1 + 3 + 5 + 7, for n = 2, n = 3, and n = 4?). Then prove your answer by induction.
5.10 What is the sum of the first n even positive integers? Prove your answer by induction.
5.11 Formulate and prove correct by induction an expression for the sum of the first n negative powers of two—that is, 21 + 14 + · · · + 21n .
(If you write this number in binary, it looks like 0.1111111, with a total of n ones.)

5.12 Consider an arithmetic sequence ⟨x, x + r, x + 2r, . . .⟩, with x ∈ R as the first entry and r ∈ R as the amount by which each entry
increases over the previous one. (See Definition 5.7. For example, ⟨1, 3, 5, . . .⟩ has x = 1 and r = 2; ⟨25, 20, 15, . . .⟩ has x = 25
and r = −5; and ⟨5, 5, 5, . . .⟩ has x = 5 and r = 0.) An arithmetic sum or arithmetic series is the sum of an arithmetic sequence.
For the arithmetic sequence ⟨x, x + r, x + 2r, . . .⟩, formulate and prove correct by induction a formula expressing the sum of the
P P P
first n + 1 terms: that is, for the arithmetic series ni=0 (x + ir). (Hint: note that ni=0 ir = r ni=0 i = rn(n2+1) , by Theorem 5.3.)

aperture
a blade (note: blade shape is a little different in real lenses)
Narrowing the aperture has two effects:
(1) it reduces the amount of light entering the lens, which
darkens the resulting image, and
f/1 f/1.4 f/2 f/2.8 (2) it increases the depth of field (the range of distances from
the lens at which objects are in focus in the image).

Figure 5.12 A particular lens of a camera, at several different f-stops. (Not pictured: f/4, f/5.6, f/8, f/11, ….)

ISTUDY
234 Mathematical Induction

0Z0Z0Z0Z A knight at position ⟨r, c⟩ can move in an L-shaped pattern to any of eight 0Z0Z0Z0Z
Z0Z0Z0Z0 positions: moving over one row and up/down two columns to positions Z0Z0Z0Z0
0Z0Z0Z0Z ⟨r ± 1, c ± 2⟩, or moving two rows over and one column up/down to 0Z0Z0Z0Z
Z0Z0M0Z0 positions ⟨r ± 2, c ± 1⟩. Z0Z0S0Z0
0Z0Z0Z0Z 0Z0Z0Z0Z
Z0Z0Z0Z0 A rook at position ⟨r, c⟩ can move in a straight line either horizontally or Z0Z0Z0Z0
0Z0Z0Z0Z vertically, moving any integral number x of squares. The rook ends up in 0Z0Z0Z0Z
Z0Z0Z0Z0 positions ⟨r ± x, c⟩ for a vertical move, or ⟨r, c ± x⟩ for a horizontal move. Z0Z0Z0Z0

Figure 5.13 A chess board, and the legal moves for a knight (left) or rook (right).

In chess, a walk for a particular piece is a sequence of legal moves for that piece, starting from a square of your choice, that visits
every square of the board. A tour is a walk that visits every square only once. (See Figure 5.13.)
5.13 Prove by induction that there exists a knight’s walk of an n-by-n chessboard for any n ≥ 4. (It turns out that knight’s tours exist for
all even n ≥ 6, but you don’t need to prove this fact.)
5.14 (programming required) In a programming language of your choice, implement your proof from Exercise 5.13 as a recursive
algorithm that computes a knight’s walk in an n-by-n chessboard.
5.15 Prove by induction that there exists a rook’s tour for any n-by-n chessboard for any n ≥ 1.

Figure 5.14 shows three different fractals. Recall that the perimeter of a shape is the total length of all boundaries separating
regions inside the figure from regions outside. (For example, the perimeter includes the boundary of the “hole” in the Sierpiński
carpet. In fact, for the Sierpiński fractals as drawn here, the perimeter is precisely the length of lines separating colored-in from
uncolored-in regions and the enclosed area is precisely the area of the colored-in regions.) Suppose that we draw each of these
fractals at level ℓ and with size 1. For each of the following, conjecture a formula and prove your answer correct by induction:
5.16 perimeter of the Von Koch snowflake
5.17 perimeter of the Sierpiński triangle
5.18 perimeter of the Sierpiński carpet
5.19 enclosed area of the Von Koch snowflake
5.20 enclosed area of the Sierpiński triangle
5.21 enclosed area of the Sierpiński carpet
In the last few exercises, you computed the fractals’ perimeter and area at level ℓ. But what if we continued the fractal-expansion
process forever? What are the area and perimeter of an infinite-level fractal? (Hint: use Corollary 5.6.)
5.22 perimeter of the infinite Von Koch snowflake
5.23 perimeter of the infinite Sierpiński triangle
5.24 perimeter of the infinite Sierpiński carpet
5.25 enclosed area of the infinite Von Koch snowflake
5.26 enclosed area of the infinite Sierpiński triangle
5.27 enclosed area of the infinite Sierpiński carpet

5.28 (programming required) Write a recursive function sierpinski_triangle(level, length, x, y) in a language of your
choice to draw a Sierpiński triangle with the given side length and level with bottom-left coordinate ⟨x, y⟩. (You’ll need to use
some kind of graphics package with line-drawing capability.) Write your function so that—in addition to drawing the fractal—it
also returns both the total length and total area of the triangles that it draws. Use your function to verify some small cases of
Exercises 5.17 and 5.20.
5.29 (programming required) Write a recursive function sierpinski_carpet(level, length, x, y) to draw a Sierpiński carpet.
(See Exercise 5.28.) Extend your function to also return the area of the boxes that it encloses, and use it to verify some small cases
of your answer to Exercise 5.21.

5.30 An n-by-n magic square is an n-by-n grid into which the numbers 1, 2, . . . , n2 are placed, once each. The “magic” is that each row,
column, and diagonal must be filled with numbers that have the same sum. For example, you can build a 3-by-3 magic square with
the top row [4, 9, 2], middle row [3, 5, 7], and bottom row [8, 1, 6]. (Check!) Conjecture and prove a formula for what
the sum of each row/column/diagonal must be in an n-by-n magic square.
P
Recall from Section 5.2.2 the harmonic numbers, where Hn = ni=1 1i is the sum of the reciprocals of the first n positive integers.
Further recall Theorem 5.9, which states that k + 1 ≥ H2k ≥ 2k + 1 for any integer k ≥ 0.
5.31 In Example 5.7, we proved that k + 1 ≥ H2k . Using the same type of reasoning as in the example, complete the proof of
Theorem 5.9: show by induction that H2k ≥ 2k + 1 for any integer k ≥ 0.

ISTUDY
Exercises 235

A Von Koch line of size s and level 0 is just a straight


line segment; a Von Koch line of size s and level ℓ
consists of four Von Koch lines of size 3s and level
ℓ − 1 arranged in the shape ; a Von Koch
snowflake of size s and level ℓ consists of a triangle of
three Von Koch lines of size s and level ℓ.

A Sierpiński triangle of level 0 and size s is an


equilateral triangle of side length s; a Sierpiński
triangle of level ℓ + 1 is three Sierpiński triangles of
level ℓ and size 2s arranged in a triangle.

A Sierpiński carpet of level ℓ and size s is formed from


8 Sierpiński carpets of level ℓ − 1 and size 3s (arranged
in a 3-by-3 grid with a hole in the middle); a Sierpiński
carpet of level 0 and side length s is a filled s-by-s
square.

Figure 5.14 Three fractals—named after the Swedish mathematician Helge von Koch (1870–1924) and the Polish
mathematician Wacław Sierpiński (1882–1969)—drawn at levels 0, 1, . . . , 4.

5.32 Generalize Theorem 5.9 to numbers that aren’t necessarily exact powers of 2. Specifically, prove that

log n−1
log n + 2 ≥ Hn ≥ 2 +1

for any real number n ≥ 1. (Hint: use Theorem 5.9.)

5.33 Prove Bernoulli’s inequality: let x ≥ −1 be an arbitrary real number. Prove by induction on n that

(1 + x)n ≥ 1 + nx

for any positive integer n.

Prove that the following inequalities hold “for sufficiently large n.” That is, identify an integer k and then prove (by induction on
n) that the stated relationship holds for all integers n ≥ k.
5.34 2n ≤ n!
5.35 bn ≤ n!, for an arbitrary integer b ≥ 1
5.36 3n ≤ n2
5.37 n3 ≤ 2n

5.38 Prove that, for any nonnegative integer n, the algorithm odd?(n) returns True if and only if n is odd. (See Figure 5.15.)
P
5.39 For the algorithm sum in Figure 5.15 (the non-alternative version), prove that sum(n, m) returns m i=n i for any m ≥ n. (Hint:
perform induction on the value of m − n.)
5.40 Modify your proof from Exercise 5.39 for the alternative version of sum in Figure 5.15 (which differs only in Line 4).

5.41 Prove by induction on n that 8n − 3n is divisible by 5 for any nonnegative integer n.


5.42 Conjecture a formula for the value of 9n mod 10, and prove it correct by induction on n. (Hint: try computing 9n mod 10 for a few
small values of n to generate your conjecture.)
5.43 As in the previous exercise, conjecture a formula for the value of 2n mod 7, and prove it correct.

odd?(n): sum(n, m): sum(n, m): // ALTERNATIVE


1 if n = 0 then 1 if n = m then 1 if n = m then
2 return False 2 return m 2 return m
3 else 3 else 3 else
4 return not odd?(n − 1) 4 return n + sum(n + 1, m) 4 return m + sum(n, m − 1)

Figure 5.15 Two algorithms (and one variation).

ISTUDY
236 Mathematical Induction

5.44 Suppose that we count, in binary, using an n-bit counter that goes from 0 to 2n − 1. There are 2n different steps along the way: the
initial step of 00 · · · 0, and then 2n − 1 increment steps, each of which causes at least one bit to be flipped. What is the average
number of bit flips that occur per step? (Count the first step as changing all n bits.) For example,

000 → 001 → 010 → 011 → 100 → 101 → 110 → 111

for n = 3, which has a total of 3 + 1 + 2 + 1 + 3 + 1 + 2 + 1 = 14 bit flips. Prove your answer.

5.45 To protect my backyard from my neighbor, a biology professor who is sometimes a little overfriendly, I have acquired a large army
of vicious robotic dogs. Unfortunately the robotic dogs in this batch are very jealous, and they must be separated by fences—in
fact, they can’t even face each other directly through a fence. So I have built a collection of n fences to separate my backyard
into polygonal regions, where each fence completely crosses my yard (that is, it goes from property line to property line, possibly
crossing other fences). I wish to deploy my robotic dogs to satisfy the following property:
For any two polygonal regions that share a boundary (that is, are separated by a fence segment), one of the two regions has
exactly one robotic dog and the other region has zero robotic dogs.
(See Figure 5.16.) Prove by induction on n that this condition is satisfiable for any collection of n fences.

Figure 5.16 A configuration of fences, and a valid way to deploy my dogs (the dots in the shaded regions).

ISTUDY
5.3 Strong Induction 237

5.3 Strong Induction


Step by step, little by little, bit by bit—that is the way to wealth, that is the way to
wisdom, that is the way to glory. Pounds are the sons, not of pounds, but of pence.
Charles Buxton (1823–1871)
Notes of Thought (1873)

In the proofs by induction in Section 5.2, we established the claim ∀n ∈ Z≥0 : P(n) by proving P(0)
[the base case] and proving that P(n − 1) ⇒ P(n) [the inductive case]. But let’s think again about what
happens in an inductive proof, as we build up facts about P(n) for ever-increasing values of n. (Glance at
Example 5.1 again.)

1. We prove P(0).
2. We prove P(0) ⇒ P(1), so we conclude P(1), using Fact #1.

Now we wish to prove P(2). In a proof by induction like those from Section 5.2, we’d proceed as follows:

3. We prove P(1) ⇒ P(2), so we conclude P(2), using Fact #2.

In a proof by strong induction, we allow ourselves to make use of more assumptions: namely, we know that
P(1) and P(0) when we’re trying to prove P(2). (By way of contrast, we’ll refer to proofs like those from
Section 5.2 as using weak induction.) In a proof by strong induction, we proceed as follows instead:

3′ . We prove P(0) ∧ P(1) ⇒ P(2), so we conclude P(2), using Fact #1 and Fact #2.

In a proof by strong induction, in the inductive case we prove P(n) by assuming n different inductive
hypotheses: P(0), P(1), P(2), …, and P(n − 1). Or, less formally: in the inductive case of a proof by weak
induction, we show that if P “was true last time” then it’s still true this time; in the inductive case of a
proof by strong induction, we show that if P “has been true up until now” then it’s still true this time.

5.3.1 A Definition and a First Example


Here is the formal definition of a proof by strong induction:

Definition 5.10: Proof by strong induction.


Suppose that we want to prove that P(n) holds for all n ∈ Z≥0 . To give a proof by strong induction of
∀n ∈ Z≥0 : P(n), we prove the following:
1 the base case: prove P(0).
2 the inductive case: for every n ≥ 1, prove [P(0) ∧ P(1) ∧ · · · ∧ P(n − 1)] ⇒ P(n).

Generally speaking, using strong induction makes sense when the “reason for” P(n) is that P(k) is true for
more than one index k ≤ n − 1, or that P(k) is true for some index k ≤ n − 2. (For weak induction, the
“reason for” P(n) is that P(n − 1) is true.)
Strong induction makes the inductive case easier to prove than weak induction, because the claim that
we need to show—that is, P(n)—is the same, but we get to use more assumptions in strong induction: in

ISTUDY
238 Mathematical Induction

parity(n): // assume that n ≥ 0 is an integer


1 if n ≤ 1 then
2 return n
3 else
4 return parity(n − 2)

Figure 5.17 An algorithm that computes whether n is odd or even.

strong induction, we’ve assumed all of P(0) ∧ P(1) ∧ . . . ∧ P(n − 1); in weak induction, we’ve assumed
only P(n − 1). We can always ignore those extra assumptions, so it’s never harder to prove something by
strong induction than with weak induction.

Writing tip: While anything that can be proven using weak induction can also be proven using strong induction, you should still
use the tool that’s best suited to the job—generally, the one that makes the argument easiest to understand.

(Strong induction is actually equivalent to weak induction in its power; anything that can be proven with
one can also be proven with the other. See Exercises 5.81–5.82.)

A first example: a simple algorithm for parity


In the rest of this section, we’ll give several examples of proofs by strong induction. We’ll start here with a
proof of correctness for a blazingly simple algorithm that computes the parity of a positive integer. (Recall
that the parity of n is the “evenness” or “oddness” of n.) See Figure 5.17 for the parity algorithm.
We’ve already used (weak) induction to prove the correctness of recursive algorithms that, given an input
of size n, call themselves on an input of size n − 1. (That’s how we proved the correctness of the factorial
algorithm fact from Example 5.9.) But for recursive algorithms that call themselves on smaller inputs but
not necessarily of size n − 1, like parity, we can use strong induction to prove their correctness.

Example 5.11: The correctness of parity.

Claim: For any nonnegative integer n ≥ 0, we have that parity(n) = n mod 2.

Proof. Write P(n) to denote the property that parity(n) = n mod 2. We proceed by strong induction
on n to show that P(n) holds for all n ≥ 0:
Base cases (n = 0 and n = 1). By inspection of the algorithm, parity(0) returns 0 in Line 2, and, indeed,
0 mod 2 = 0. Similarly, we have parity(1) = 1, and 1 mod 2 = 1 too. Thus P(0) and P(1) hold.
Inductive case (n ≥ 2). Assume the inductive hypothesis P(0) ∧ P(1) ∧ · · · ∧ P(n − 1)—that is, assume
for any integer 0 ≤ k < n, we have parity(k) = k mod 2.
We must prove P(n)—that is, we must prove parity(n) = n mod 2:
parity(n) = parity(n − 2) by inspection (specifically because n ≥ 2 and by Line 4)
= (n − 2) mod 2 by the inductive hypothesis P(n − 2)
= n mod 2,
where (n − 2) mod 2 = n mod 2 by Definition 2.11. (Note that the inductive hypothesis applies for
k = n − 2 because n ≥ 2 and therefore 0 ≤ n − 2 < n.)

ISTUDY
5.3 Strong Induction 239

There are two things to note about the proof in Example 5.11. First, using strong induction instead of
weak induction made sense because the inductive case relied on P(n − 2) to prove P(n); we did not show
P(n − 1) ⇒ P(n). Second, we needed two base cases: the “reason” that P(1) holds is not that P(−1)
was true. (In fact, P(−1) is false—parity(−1) isn’t equal to 1! Think about why.) The inductive case of
the proof in Example 5.11 does not correctly apply for n = 1, and therefore we had to handle that case
separately, as a second base case.

5.3.2 Some Further Examples of Strong Induction


We’ll continue this section with several more examples of proofs by strong induction. We’ll first turn to a
proof about prime factorization of integers, and then look at one geometric and one algorithmic claim.

Prime factorization
Recall that an integer n ≥ 2 is prime if the only positive integers that evenly divide it are 1 and n itself. It’s
a basic fact about numbers that any positive integer can be uniquely expressed as the product of primes.

Theorem 5.11: Prime Factorization Theorem.


Let n ∈ Z≥1 be any positive integer. Then there exist k ≥ 0 prime numbers p1 , p2 , . . . , pk such that
Qk
n = i=1 pi . Furthermore, up to reordering, the primes p1 , p2 , . . . , pk are unique.

(The Prime Factorization Theorem is also sometimes called the Fundamental Theorem of Arithmetic.)
While proving the uniqueness requires a bit more work (we’ll prove it in Section 7.3.3), we can give a
proof using strong induction to show that a prime factorization exists.

Example 5.12: A prime factorization exists.


Let P(n) denote the first part of Theorem 5.11, namely the claim
Q
k
there exist k ≥ 0 prime numbers p1 , p2 , . . . , pk such that n = pi .
i=1

We will prove that P(n) holds for any integer n ≥ 1, by strong induction on n.
Base case (n = 1). Recall that the product of zero multiplicands is 1. (See Section 2.2.7 for a reminder
of the definition.) Thus we can write n as the product of zero prime numbers. Thus P(1) holds.
Inductive case (n ≥ 2). We assume the inductive hypothesis—namely, we assume that P(n′ ) holds for
any positive integer n′ where 1 ≤ n′ ≤ n − 1. We must prove P(n). There are two cases:

Case I: n is prime. If n is prime, then there’s nothing to do: we can write n as the product of one prime
number, namely n itself. Formally, let k = 1 and p1 = n, and we’re done immediately.
Case II: n is not prime. If n is not prime, then by definition n can be written as the product n = a · b, for
positive integers a and b satisfying 2 ≤ a ≤ n−1 and 2 ≤ b ≤ n−1. (The definition of (non)primality
says that n = a · b for a ∈/ {1, n}; you should be able to convince yourself that neither a nor b can be
smaller than 2 or larger than n − 1.) By the inductive hypotheses P(a) and P(b), we have
a = q1 · q2 · · · · · qℓ and b = r1 · r2 · · · · · rm (∗)

ISTUDY
240 Mathematical Induction

for prime numbers q1 , . . . , qℓ and r1 , . . . , rm . By (∗) and the fact that n = a · b,


n = q1 · q2 · · · · · qℓ · r1 · r2 · · · · · rm .
Because each qi and ri is prime, we have now written n as the product of ℓ + m prime numbers, and
P(n) holds. The theorem follows.

Taking it further: As with any inductive proof, it may be useful to view the proof from Example 5.12 as a recursive algorithm.
Here’s a sketch of the algorithm primeFactor(n) implicitly defined by this proof:

• if n = 1, then return ⟨⟩. (The base case: P(1) is true.)


• if n > 1 and n is prime, then return ⟨n⟩. (The inductive case, part I: if n ≥ 2 is prime, then P(n) is true.)
• otherwise n > 1 and n is not prime. In this case: (The inductive case, part II: if n ≥ 2 is not prime, then P(n) is true.)
– Find factors a and b where 2 ≤ a ≤ n − 1 and 2 ≤ b ≤ n − 1 such that n = a · b. (∗)
– Recursively compute ⟨q1 , ..., qk ⟩ := primeFactor(a).
– Recursively compute ⟨r1 , ..., rm ⟩ := primeFactor(b).
– Return ⟨q1 , ..., qk , r1 , ..., rm ⟩.

(Notice that there’s some magic in the “algorithm,” in the sense that Line (∗) doesn’t tell us how to find the values of a and
b—but we do know that such values exist, by definition.) We can think of the inductive case of an inductive proof as “making a
recursive call” to a proof for a smaller input. For example, primeFactor(2) returns ⟨2⟩ and primeFactor(5) returns ⟨5⟩, because
both 2 and 5 are prime. For another example, the result of primeFactor(10) is ⟨2, 5⟩, because 10 is not prime, but we can write
10 = 2 · 5 and primeFactor(2) returns ⟨2⟩ and primeFactor(5) returns ⟨5⟩. The result of primeFactor(70) could be ⟨7, 2, 5⟩,
because 70 is not prime, but we can write 70 = 7 · 10 and primeFactor(7) returns ⟨7⟩ and primeFactor(10) returns ⟨2, 5⟩.
Or primeFactor(70) could be ⟨7, 5, 2⟩ because 70 = 35 · 2, and primeFactor(35) returns ⟨7, 5⟩ and primeFactor(2) returns
⟨2⟩. (Which ordering of the values is the output depends on the magic of Line (∗). The second part of Theorem 5.11, about the
uniqueness of the prime factorization, says that it is only the ordering of these numbers that depends on the magic; the numbers
themselves must be the same.)

Triangulating a polygon
We’ll now turn to a proof by strong induction about a geometric question, instead of a numerical one. A
convex polygon is, informally, the points “inside” a set of n vertices: imagine stretching a giant rubber band
around n points in the plane; the polygon is defined as the set of all points contained inside the rubber band.
See Figure 5.18a for an example. Here we will show that an arbitrary convex polygon can be decomposed
into a collection of nonoverlapping triangles.

Example 5.13: Decomposing a polygon into triangles.


Prove the following claim. (For an example, and an outline of a possible proof, see Figure 5.18.)

Claim: Any convex polygon P with k ≥ 3 vertices can be decomposed into a set of k − 2 triangles whose
interiors do not overlap.

Solution. Let Q(k) denote the claim that any k-vertex polygon can be decomposed into a set of k − 2
interior-disjoint triangles. We’ll give a proof by strong induction on k that Q(k) holds for all k ≥ 3. (Note
that strong induction isn’t strictly necessary to prove this claim; we could give an alternative proof using
weak induction.)
Base case (k = 3). There’s nothing to do: any 3-vertex polygon P is itself a triangle, so the collection
{P} is a set of k − 2 = 1 triangles whose interiors do not intersect (vacuously, because there is only one
triangle). Thus Q(3) holds.

ISTUDY
5.3 Strong Induction 241

a side v

A
B

u
a vertex the interior
(excluding the
(a) A polygon P. boundary) (b) Two vertices u and v of P, and P (c) The subpolygons A and B divided
divided into A and B (above and into triangles, using the inductive
below the u-to-v line). hypothesis.

Figure 5.18 An example of the recursive decomposition of a polygon into interior-disjoint triangles.

Inductive case (k ≥ 4). We assume the inductive hypothesis, which states: any convex polygon with
3 ≤ ℓ < k vertices can be decomposed into a set of ℓ − 2 interior-disjoint triangles. (That is, we assume
Q(3), Q(4), . . . , Q(k − 1).) We must prove Q(k).
Let P be an arbitrary k-vertex polygon. Let u and v be any two nonadjacent vertices of P. (Because
k ≥ 4, such a pair exists.) Define A as the “above the ⟨u, v⟩ line” piece of P and B as the “below the ⟨u, v⟩
line” piece of P. Notice that P = A ∪ B, both A and B are convex, and the interiors of A and B are disjoint.
Let ℓ be the number of vertices in A. Observe that ℓ ≥ 3 and ℓ < k because u and v are nonadjacent.
Also observe that B contains precisely k − ℓ + 2 vertices. (The “+ 2” is because vertices u and v appear
in both A and B.) Note that both 3 ≤ ℓ ≤ k − 1 and 3 ≤ k − ℓ + 2 ≤ k − 1, so we can apply the inductive
hypothesis to both ℓ and k − ℓ + 2.
Therefore, by the inductive hypothesis Q(ℓ), the polygon A is decomposable into a set S of ℓ−2 interior-
disjoint triangles. Again by the inductive hypothesis Q(k − ℓ + 2), the polygon B is decomposable into
a set T of k − ℓ + 2 − 2 = k − ℓ interior-disjoint triangles. Furthermore because A and B are interior
disjoint, the triangles of S ∪ T all have disjoint interiors. Thus P itself can be decomposed into the union
of these two sets of triangles, yielding a total of ℓ − 2 + k − ℓ = k − 2 interior-disjoint triangles.
We’ve shown both Q(3) and Q(3) ∧ · · · ∧ Q(k − 1) ⇒ Q(k) for any k ≥ 4, which completes the proof
by strong induction.

Taking it further: The style of triangulation from Example 5.13 has important implications in computer graphics, in which
we seek to render representations of complicated real-world scenes using computational techniques. In many computer graphics
applications, complex surfaces are decomposed into small triangular regions, which are then rendered individually. See p. 244.

Proving algorithms correct: Quick Sort


We’ve now seen a proof of correctness by strong induction for a small (4-line) recursive algorithm (for
parity), and proofs of somewhat more complicated non-algorithmic properties. Here we’ll prove the
correctness of a somewhat more complicated algorithm—the recursive sorting algorithm called Quick
Sort—again using strong induction.
The idea of the Quick Sort algorithm is to select a pivot value x from an input array A; we then partition
the elements of A into those less than x (which we then sort recursively), then x itself, and finally the
elements of A greater than x (which we again sort recursively). We also need a base case: an input array
with fewer than two elements is already sorted. (See Figure 5.19a for the algorithm.)
For example, suppose we wish to sort by birthday the 53 winners of the Nobel Prize for Literature from
the second half of the twentieth century. Toni Morrison’s birthday is February 18th. If we choose her as

ISTUDY
242 Mathematical Induction

quickSort(A[1 . . . n]):
1 if n ≤ 1 then
2 return A
3 else
4 choose pivot ∈ {1, . . . , n}, somehow.
5 L := ⟨⟩ 1 choose a pivot:
6 R := ⟨⟩ we chose 3 as the pivot value
7 2 4 3 1 6 5 8 9 (by some algorithm or
7 for i ∈ {1, . . . , n} with i =
̸ pivot: another).
8 if A[i] < A[pivot] then
9 append A[i] to L 2 partition the elements into L,
all elements < 3, and R, all
10 else 2 1 3 7 4 6 5 8 9 elements > 3.
11 append A[i] to R
12 L := quickSort(L)
13 R := quickSort(R) 3 recursively sort L and R.
14 return L + ⟨A[pivot]⟩ + R 1 2 3 4 5 6 7 8 9

(a) The pseudocode. (b) An example of Quick Sort.

Figure 5.19 Quick Sort: pseudocode and an example.

the pivot, then Quick Sort would first divide all 52 other winners into two lists, of those with pre–February
18th birthdays and those with post–February 18th birthdays,
before[1 . . . 4] := ⟨Boris Pasternak (1958) [February 10th], . . . , Gao Xingjian (2000) [January 4th]⟩
after[1 . . . 48] := ⟨Bertrand Russell (1950) [May 18th], . . . , Günter Grass (1999) [October 16th]⟩,
and then recursively sort before and after. Then the final sorted list will be
before in sorted order Toni Morrison after in sorted order
winner[1], . . . , winner[4], winner[5], winner[6], . . . , winner[53]

While the efficiency of Quick Sort depends crucially on how we choose the pivot value (see Chapter 6),
the correctness of the algorithm holds regardless of that choice. For simplicity, we will prove that Quick
Sort correctly sorts its input under the assumption that all the elements of the input array A are distinct.
(The more general case, in which there may be duplicate elements, is conceptually no harder, but is a bit
more tedious.) It is easy to see by inspection of the algorithm that quickSort(A) returns a reordering of the
input array A—that is, Quick Sort neither deletes nor inserts elements. Thus the real work is to prove that
Quick Sort returns a sorted array.
Taking it further: There were 53 winners between 1950—the year in which Bertrand Russell, of Russell’s paradox fame (see
p. 29), won the award for his writings on philosophy—and 2000, when Gao Xingjian won the prize. As it happens, the simplifying
assumption that we’re making about distinct elements doesn’t apply for this set of Nobelists: Samuel Beckett (1969) and Seamus
Heaney (1995) share a birthday, as do Naguib Mahfouz (1988) and Aleksandr Solzhenitsyn (1970). Think about how you would
modify the proof in Example 5.14 to handle duplicates.

Example 5.14: Correctness of Quick Sort.

Claim: For any array A with distinct elements, quickSort(A) returns a sorted array.

Proof. Let P(n) denote the claim that quickSort(A[1 . . . n]) returns a sorted array for any n-element
array A with distinct elements. We’ll prove P(n) for every n ≥ 0, by strong induction on n.

ISTUDY
5.3 Strong Induction 243

Base cases (n = 0 and n = 1). Both P(0) and P(1) are trivial: any array of length 0 or 1 is sorted.
Inductive case (n ≥ 2). We assume the inductive hypothesis P(0), . . . , P(n − 1): for any array B[1 . . . k]
with distinct elements and length k < n, quickSort(B) returns a sorted array. We must prove P(n).
Let A[1 . . . n] be an arbitrary array with distinct elements. Let pivot ∈ {1, . . . , n} be arbitrary. We must
prove that x appears before y in quickSort(A) if and only if x < y. We proceed by cases, based on the
relationship between x, y, and A[pivot].

L pivot R

Case 1 x : x = A[pivot].
The elements appearing after x in quickSort(A) are precisely the elements of R. And R is exactly the
set of elements greater than x. Thus x appears before y if and only if y appears in R, which occurs if
and only if x < y.

Case 2 y : y = A[pivot].
Analogously to Case 1, x appears before y if and only if x appears in L, which occurs if and only if
x < y.

Case 3 x, y : x < A[pivot] and y < A[pivot].


Then both x and y appear in L. Because A[pivot] does not appear in L, we know that L contains at most
n − 1 elements, all of which are distinct because they’re a subset of the distinct elements of A. Thus
the inductive hypothesis P(|L|) says that x appears before y in quickSort(L) if and only if x < y. And
x appears before y in quickSort(A) if and only if x appears before y in quickSort(L).

Case 4 x, y : x > A[pivot] and y > A[pivot].


Then both x and y appear in R. An analogous argument to Case 3 shows that x appears before y if and
only if x < y.

Case 5 x y : x < A[pivot] and y > A[pivot].


It is immediate both that x appears before y (because x is in L and y is in R) and that x < y.

Case 6 y x : x > A[pivot] and y < A[pivot].


It is immediately apparent that x does not appear before y and that x ̸< y.

In all six cases, we have established that x < y if and only if x appears before y in the output array;
furthermore, the cases are exhaustive. The claim follows.

Taking it further: In addition to proofs of correctness for algorithms, like the one for quickSort that we just gave, strong
induction is crucial in analyzing the efficiency of recursive algorithms; we’ll see many examples in Section 6.4. And strong
induction can also be fruitfully applied to understanding (and designing!) data structures—for example, see p. 245 for a discussion
of maximum heaps.

ISTUDY
244 Mathematical Induction

Computer Science Connections

Triangulation, Computer Graphics, and 3D Surfaces


Here is a typical problem in computer graphics: we are given a three-dimensional scene consisting of a collection
of objects of various shapes and sizes, and we must render a two-dimensional image that is a visual display of the
scene. (Computer graphics uses a lot of matrix computation to facilitate the projection of a three-dimensional shape
onto a two-dimensional surface.)

Smithsonian Institution’s National Museum of


A typical approach—to simplify and

Images generated using a 3D model from the


speed the relevant algorithms—approximates
the three-dimensional shapes of the objects
in the scene using triangles instead of arbi-
trary shapes. (See Figure 5.20.) Triangles are
the easiest shape to process computationally:

American History.
the “real” triangle in the scene can be spec-
ified completely by three three-dimensional
points corresponding to the vertices; and the
rendered shape in the image is still a trian- Figure 5.20 A 3D model of an 1844 telegraph key (made by Alfred
gle specified completely by two-dimensional
Vail based on a design by Samuel Morse), and a close-up of the screws
points corresponding to the vertices’ projec-
in the bottom-right corner. The full model contains 150,000 triangles.
tions onto the image. Specialized hardware
called a graphics processing unit (GPU) makes these computations extremely fast on many modern computers.
We can approximate any three-dimensional shape arbitrarily well using a collection of triangles, and we can refine
a triangulation by dividing splitting one triangle into two pieces, and then properly rendering each constituent triangle:
for example, we can take the triangle , and split it into four subtriangles by placing new vertices at the midpoint
of each edge; we can then connect up the resulting 6 points (3 original + 3 new) to produce . When rendering
a scene, we might compute a single color c that best represents the color of a given triangle in the real scene, and
then display a solid c-colored (projected) triangle in the image; thus our refined triangulation can have more color
variation than the single triangle, as in . Note that there are many different ways to subdivide a given
triangle into separate triangles. Which subdivision we pick might depend on the geometry of the scene; for example,
we might try to make the subtriangles roughly similar in size, or maximally different in color.
The larger the number of triangles we use, the better the match between the real three-dimensional shape and
the triangulated approximation. But, of course, the more triangles we use, the more computation must be done (and
the slower the rendering will be). By identifying particularly important triangles—for example, those whose colors
vary particularly widely, or those at a particularly steep angle to their neighbors, or those whose angles to the viewer
are particularly extreme—we can selectively refine “the most important parts” of the triangulation to produce higher
quality images. (See Figure 5.21.) under CC BY-SA 4.0 https://creativecommons.
Images by Tobias Isenberg; printed with permission

org/licenses/by- sa/4.0/.

Figure 5.21 Two strategies for refining a triangulation of a rabbit: one based on dividing each triangle into four smaller
ones, and one that chooses which triangles to refine based on local geometric properties. (The adaptive strategy on the
far right produces a smaller image file, but one that is more accurate in the most interesting parts of the image.) See [62].

ISTUDY
5.3 Strong Induction 245

Computer Science Connections

Max Heaps
When we design data structures to support particular operations, it is often the case that we wish to maintain some
properties in the way that the data are stored. Here’s one example, for an implementation of priority queues, that
we’ll establish using a proof by mathematical induction. A priority queue is a data structure that stores a set of jobs,
each of which has a priority; we wish to be able to insert new jobs (with specified priorities) and identify/extract the
existing job with the highest priority.
A maximum heap is one way of imple-
menting priority queues. A maximum heap (a) x (b) 8
is a binary tree—see Section 5.4 or Chap-
ter 11—in which every node stores a job with
an associated priority. Every node in the tree 5 7
y z
satisfies the (maximum) heap property (see
Figure 5.22a): the priority of node u must 4 1 6 3
y≤x z≤x
be greater than or equal to the priorities of
each of u’s children. (A heap must also satisfy 2
another property, being “nearly complete”— Figure 5.22 Maximum heaps: (a) the heap property, which says that,
intuitively, a heap has no “missing nodes” for any node with value x, the children of that node must have values
except in the bottommost layer; this “nearly
≤ x; and (b) an example.
complete” property is what guarantees that
heaps implement priority queues very effi-
ciently.) An example of a heap is shown in Figure 5.22b. It’s easy to check that the topmost node (the root) of
the maximum heap in Figure 5.22b has the highest priority. Heaps are designed so that the root of the tree contains
the node with the highest priority—that’s a property that ensures that particular operations related to priority queues
can be carried out very quickly using heaps. Let’s prove this fact by induction:
Claim: In any binary tree in which every node satisfies the maximum heap property, the node with the highest priority
is the root of the tree.

Proof. We’ll proceed by strong induction on the number of layers of nodes in the tree. (This proof is an example of
a situation in which it’s not immediately clear upon what quantity to perform induction, but once we’ve chosen the
quantity well, the proof itself is fairly easy.) Let P(ℓ) denote the claim In any tree containing ℓ layers of nodes, in
which every node satisfies the maximum heap property, the node with the highest priority is the root of the tree. We
will prove that P(ℓ) holds for all ℓ ≥ 1 by strong induction on ℓ.
Base case (ℓ = 1). The tree has only one level—that is, the root is the only node in the tree. Thus, vacuously, the root
has the highest priority, because there are no other nodes.
Inductive case (ℓ ≥ 2). We assume the induc-
tive hypothesis P(1), . . . , P(ℓ − 1). Let x be x x
the priority of the root of the tree. (See Fig-
a a b
ure 5.23 for the cases.) If the root has only
one child, say with priority a, then by the
inductive hypothesis every element y beneath
a satisfies y ≤ a. (There are at most ℓ − 1
layers in the tree beneath a, so the inductive Figure 5.23 The two possibilities for the inductive case of the proof:
hypothesis applies.) By the heap property, we either x has one child (here, a) or x has two children (here, a and b).
know a ≤ x, and thus every element y satis- (Because the tree has ℓ ≥ 2 levels, the root x has at least one child.)
fies y ≤ x. If the root has a second child, say
with priority b, then by the inductive hypothesis every element z beneath b satisfies z ≤ b. (There are at most ℓ − 1
layers in the tree beneath b, so the inductive hypothesis applies again.) Again, by the heap property, we have b ≤ x,
so every element z satisfies z ≤ x.

ISTUDY
246 Mathematical Induction

EXERCISES
5.46 In Example 5.11, we showed the correctness of the parity function (see Figure 5.24)—that is, for any n ≥ 0, we have that
parity(n) = n mod 2. Prove by strong induction on n that the depth of the recursion (that is, the total number of calls to parity
 
made) is 1 + 2n .
5.47 Consider the toBinary algorithm in Figure 5.25, which finds the binary representation of a given integer n ≥ 0. For example,
toBinary(13) returns ⟨1, 1, 0, 1⟩ (and 13 = 8 + 4 + 0 + 1 is written as 1101 in binary). Prove the correctness of toBinary by
P
strong induction: prove that for any n ≥ 0, we have ki=0 bi 2i = n, where toBinary(n) = ⟨bk , . . . , b0 ⟩.
5.48 Exercise 5.47 establishes that any nonnegative integer can be represented in binary. Now, let’s show that this binary representation
is unique—or, at least, unique if we ignore the presence or absence of leading zeros. (For example, we can represent 7 in binary
as 111 or 0111 or 00111.) To do so, prove that every nonnegative integer n that can be represented as a k-bit string is uniquely
represented as a k-bit bitstring. In other words, prove the following claim, for any integer k ≥ 1:
Let a = ⟨ak−1 , ak−2 , . . . , a0 ⟩ and b = ⟨bk−1 , bk−2 , . . . , b0 ⟩ be two k-bit sequences.
P −1 i Pk−1 i
If ki= 0 ai 2 = i=0 bi 2 , then for all i ∈ {k − 1, k − 2, . . . , 0} we have ai = bi .
Your proof should be by (weak) induction on k.

In Chapter 7, we’ll talk in detail about modular arithmetic, including a more general algorithm for converting from one base to
another (p. 340). There, we’ll do most of the computation iteratively; here you’ll fill in a few pieces recursively.
5.49 Generalize the parity(n) algorithm to remainder(n, k) to recursively compute the number r ∈ {0, 1, . . . , k − 1} such that
remainder(n, k) = n mod k. Assume that k ≥ 1 and n ≥ 0 are both integers, and follow the same algorithmic outline as in
Figure 5.24. Prove your algorithm correct using strong induction on n.
5.50 Generalize the toBinary(n) algorithm to baseConvert(n, k) to recursively convert the integer n to base k. Assume that k ≥ 2
and n ≥ 0 are both integers, and follow the same algorithmic outline as in Figure 5.25. Prove using strong induction on n that if
P
baseConvert(n, k) = ⟨bℓ , bℓ−1 , . . . , b0 ⟩ with each bi ∈ {0, 1, . . . , k − 1}, then n = ℓi=0 ki bi .
5.51 (programming required) Implement your remainder and baseConvert algorithms, in a language of your choice.

You are sitting around the table with a crony you’re in cahoots with. You and the crony decide to play the following silly game. (The
two of you run a store that sells nothing but vicious robotic dogs. The loser of the game has to clean up the yard where the dogs
roam—not a pleasant chore—so the stakes are high.) We start with n ∈ Z≥1 stolen credit cards on a table. The two players take
turns removing cards from the table. In a single turn, a player chooses to remove either one or two cards. A player wins by taking
the last card. For example:

your turn crony’s turn your turn


all gone; you win!
take two take one take two

5.52 Prove (by strong induction on n) that if n is divisible by three, then the second player to move can guarantee a win, and if n is not
divisible by three, then the first player to move can guarantee a win.
5.53 Determine who wins the “Take k” modification of the game from Exercise 5.52: conjecture a condition on n that describes precisely
when the first player can guarantee a win under the stated modification, and prove your answer. Here’s the variant of the game: Let
k ≥ 1 be any integer. As in the original game, the player who takes the last card wins—but each player is now allowed to take any
number of cards between 1 and k in any single move. (The original game was Take-2.) Your answer can depend both on k and the
starting number n of cards.
5.54 Repeat for the “Don’t Go Last” variant: As in the original game, players can take only 1 or 2 cards per turn—but the player who
takes the last card loses (instead of winning by managing to take the last card).

toBinary(n): // assume that n ≥ 0 is an integer


1 if n ≤ 1 then
parity(n): // assume that n ≥ 0 is an integer 2 return ⟨n⟩
1 if n ≤ 1 then 3 else
 
2 return n 4 ⟨bk , . . . , b0 ⟩ := toBinary( 2n )
3 else 5 x := parity(n)
4 return parity(n − 2) 6 return ⟨bk , . . . , b0 , x⟩

Figure 5.24 A reminder of parity (from Figure 5.17). Figure 5.25 An algorithm to convert an integer to binary.

ISTUDY
Exercises 247

5.55 Prove by strong induction on n that, for every integer n ≥ 4, it is possible to make n dollars using only two- and five-dollar bills.
(That is, prove that any integer n ≥ 4 can be written as n = 2a + 5b for some integer a ≥ 0 and some integer b ≥ 0.)
5.56 Consider a sport in which teams can score two types of goals, worth either 3 points or 7 points. For example, Team Vikings might
(theoretically speaking) score 32 points by accumulating, in succession, 3, 7, 3, 7, 3, 3, 3, and 3 points. Find the smallest possible
n0 such that, for any n ≥ n0 , a team can score exactly n points in a game. Prove your answer correct by strong induction.

Define the Fibonacci numbers by the sequence f1 = 1, f2 = 1, and fn = fn−1 + fn−2 for n ≥ 3. Thus the first several Fibonacci
numbers are 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, . . . . (The Fibonacci numbers are named after Leonardo of Pisa, also sometimes known
as Leonardo Bonacci or just as Fibonacci, a thirteenth-century Italian mathematician. We’ll see a lot more about the Fibonacci
numbers in Section 6.4.) Prove each of the following statements by induction—weak or strong, as appropriate—on n:
5.57 fn mod 2 = 0 if and only if n mod 3 = 0. (That is, every third Fibonacci number is even.)
5.58 fn mod 3 = 0 if and only if n mod 4 = 0.
X n
5.59 fi = fn+2 − 1
i=1
X
n
5.60 (fi )2 = fn · fn+1
i=1
5.61 Prove Cassini’s identity: fn−1 · fn+1 − (fn )2 = (−1)n for any n ≥ 2.
5.62 For a k-by-k matrix M, the matrix Mn is also k-by-k, and its value is the result of multiplying M by itself n times: MM · · · M. We can
define matrix exponentiation recursively: M0 = I (the k-by-k identity matrix), and Mn+1 = M · Mn . Using this recursive definition,
prove the following identity concerning the Fibonacci numbers:
" #n−1 " # " #
1 1 1 fn
· = for any n ≥ 2.
1 0 0 fn−1

You may use the associativity of matrix multiplication in your answer: for any matrices A, B, and C of the appropriate dimensions,
we have A(BC) = (AB)C.

Define the Lucas numbers as L1 = 1, L2 = 3, and Ln = Ln−1 + Ln−2 for n ≥ 3. (The Lucas numbers are a less famous cousin
of the Fibonacci numbers; they follow the same recursive definition as the Fibonaccis, but starting from a different pair of base
cases. They’re named after Édouard Lucas, a nineteenth-century French mathematician.) Prove the following facts about the Lucas
numbers, by induction (weak or strong, as appropriate) on n:
5.63 Ln = fn + 2fn−1
L +L
5.64 fn = n−1 5 n+1
5.65 (Ln ) = 5(fn )2 +4(−1)n (Hint: you may need to conjecture a second property relating Lucas and Fibonacci numbers to complete the
2

proof of the given property P(n)—specifically, try to formulate a property Q(n) relating Ln Ln−1 and fn fn−1 , and prove P(n) ∧ Q(n)
with a single proof by strong induction.)

Define the Jacobsthal numbers as J1 = 1, J2 = 1, and Jn = Jn−1 + 2Jn−2 for n ≥ 3. (Thus the Jacobsthal numbers are a
more distant relative of the Fibonacci numbers: they have the same base case, but a different recursive definition. They’re named
after Ernst Jacobsthal, a twentieth-century German mathematician.) Prove the following facts about the Jacobsthal numbers by
induction (weak or strong, as appropriate) on n:
5.66 Jn = 2Jn−1 + (−1)n−1 , for all n ≥ 2.
n
1)n
5.67 Jn = 2 −(− 3
5.68 Jn = 2 n − 1 − Jn−1 , for all n ≥ 2.

The next two problems are previews of Chapter 9, where we’ll talk about how to count the size of sets (often, sets that are described
in somewhat complicated ways). You should be able to attack these problems without the detailed results from Chapter 9, but feel
free to glance ahead to Section 9.2 if you’d like.
5.69 You are given a 2-by-n grid that you must tile, using either 1-by-2 dominoes or 2-by-2 squares. The dominoes can be arranged
either vertically or horizontally. (See Figure 5.26.) Prove by strong induction on n that the number of different ways of tiling the
2-by-n grid is precisely Jn+1 . (Be careful: it’s easy to accidentally count some configurations twice—for example, make sure that
you count only once the tiling of a 2-by-3 grid that uses three horizontal dominoes.)
5.70 Suppose that you run out of squares, so you can now only use dominoes for tiling. (See Figure 5.26a.) How does your answer to
the previous exercise change? How many different tilings of a 2-by-n grid are there now? Prove your answer.

ISTUDY
248 Mathematical Induction

(a) The five ways to tile using dominoes (n = 4). (b) The six additional tilings when we also allow squares.

Figure 5.26 How many ways are there to tile an 2-by-n grid using 1-by-2 dominoes and 2-by-2 squares?

the concatenation of sn−1 and sn−2


the ending
s3 = s2 ◦ s1 = 01 the starting
point (last
point (first
s1 = 1 s4 = s3 ◦ s2 = 01 0 point drawn) point drawn)
s2 = 0 s5 = s4 ◦ s3 = 010 01
sn = sn−1 ◦ sn−2 s6 = s5 ◦ s4 = 01001 010
(a) The definition. (b) Some examples. (c) A visualization of s14 , as in Exercise 5.74.

Figure 5.27 The Fibonacci word fractal, which defines a sequence of bitstrings using recursion.

The Fibonacci word fractal defines a sequence of bitstrings using a similar recursive description to the Fibonacci numbers. See
Figure 5.27 for the definition and a few examples. It turns out that if we delete the last two bits from sn , the resulting string is a
palindrome (reading the same back-to-front and front-to-back).
5.71 Prove by strong induction that the number of bits in sn is precisely fn (the nth Fibonacci number).
5.72 Prove by strong induction that the string sn does not contain two consecutive ones or three consecutive zeros.
5.73 Let #0(x) and #1(x) denote the number of zeros and ones in a bitstring x, respectively. Show that, for all n ≥ 3, the amount by
which the number of zeros in sn exceeds the number of ones—that is, the quantity #0(sn ) − #1(sn )—is a Fibonacci number.
5.74 (programming required) The reason that sn is called a fractal is that it’s possible to visualize these “words” (strings) as a geometric
fractal by interpreting zeros and ones as “turn” and “go straight,” respectively. Specifically, here’s the algorithm: start pointing
in some direction, say east. For the ith symbol in sn , for i = 1, 2, . . . , |sn |: proceed in your current direction by one unit. If the
ith symbol is 1 then do not turn; if the symbol is a 0 and i is even, turn 90◦ to the right; and if the symbol is a 0 and i is odd,
turn 90◦ to the left. (See Figure 5.27.) Write a program to draw a bitstring using these rules. Then implement the definition of
the Fibonacci word fractal and “draw” the strings s1 , s2 , . . . , s16 . (For efficiency’s sake, you may want to compute sn with a loop
instead of recursively; see Figure 6.33 in Chapter 6 for some ideas.)

5.75 The sum of the interior angles of any triangle is 180◦ . Now, using this fact and induction, prove that any polygon with k ≥ 3 vertices
has interior angles that sum to 180k − 360 degrees. (See Figure 5.28.)
5.76 A diagonal of a polygon is a line that connects two non-adjacent vertices. (Again, see Figure 5.28.) How many diagonals are there
in a triangle? A quadrilateral? A pentagon? Formulate a conjecture for the number d(k) of diagonals in a k-gon, and prove your
formula correct by induction. (Hint: consider lopping off a triangle from the polygon.)

5.77 Prove that the recursive binary search algorithm shown in Figure 5.29 is correct. That is, prove the following condition, by strong
induction on n: For any sorted array A[1 . . . n], binarySearch(A, x) returns true if and only if x ∈ A.

(a) A polygon. (b) The interior angles. (c) One of the diagonals.

Figure 5.28 A polygon, its interior angles, and one of its diagonals.

ISTUDY
Exercises 249

binarySearch(A[1 . . . n], x): merge(X[1 . . . n], Y[1 . . . m]):


1 if n ≤ 0 then 1 if n = 0 then
2 return False 2 return Y
3 middle := ⌊ 1+ n
2 ⌋ 3 else if m = 0 then
4 if A[middle] = x then 4 return X
5 return True 5 else if X[1] < Y[1] then
6 else if A[middle] > x then 6 return X[1] followed by merge(X[2 . . . n], Y)
7 return binarySearch(A[1 . . . middle − 1], x) 7 else
8 else 8 return Y[1] followed by merge(X, Y[2 . . . m])
9 return binarySearch(A[middle + 1 . . . n], x)
mergeSort(A[1 . . . n]):
Note: for the sake of efficiency, it’s important to make sure that
Line 7 and Line 9 do not copy half of the array A. (As written, they 1 if n = 1 then
appear to.) Usually, you’d accomplish these steps efficiently by 2 return A
maintaining one array, and passing indices into that array (rather than
3 else
the array itself) as parameters to binarySearch. Writing it out with  
the extra parameters makes this algorithm a lot harder to read, so this 4 L := mergeSort(A[1 . . . n2 ])
pseudocode errs on the side of readability at the cost of leaving out an
n
5 R := mergeSort(A[ 2 + 1 . . . n])
important implementation detail.
6 return merge(L, R)

Figure 5.29 Binary Search, Merge, and Merge Sort, recursively.

5.78 Prove by weak induction on the total length of the inputs (that is, on the quantity n + m) that the merge algorithm in Figure 5.29
satisfies the following property for any n ≥ 0 and m ≥ 0: given any two sorted arrays X[1 . . . n] and Y[1 . . . m] as input, the output
of merge(X, Y) is a sorted array containing all elements of X and all elements of Y.
5.79 Prove by strong induction on n that mergeSort(A[1 . . . n]), shown in Figure 5.29, indeed sorts its input.

5.80 Give a recursive algorithm to compute a list of all permutations of a given set S. (That is, compute a list of all possible orderings of
the elements of S. For example, permutations({1, 2, 3}) should return ⟨1, 2, 3⟩, ⟨1, 3, 2⟩, ⟨2, 1, 3⟩, ⟨2, 3, 1⟩, ⟨3, 1, 2⟩, and ⟨3, 2, 1⟩,
in some order.) Prove your algorithm correct by induction.

Prove that weak induction, as defined in Section 5.2, and strong induction are equivalent. (Hint: in one of these two exercises, you
will have to use a different predicate than P.)
5.81 Suppose that you’ve written a proof of ∀n ∈ Z≥0 : P(n) by weak induction. I’m in an evil mood, and I declare that you aren’t
allowed to prove anything by weak induction. Explain how to adapt your weak-induction proof to prove ∀n ∈ Z≥0 : P(n) using
strong induction.
5.82 Now suppose that, obeying my new Draconian rules, you have written a proof of ∀n ∈ Z≥0 : P(n) by strong induction. In a doubly
evil mood, I tell you that now you can only use weak induction to prove things. Explain how to adapt your strong-induction proof
to prove ∀n ∈ Z≥0 : P(n) using weak induction.

ISTUDY
250 Mathematical Induction

5.4 Recursively Defined Structures and Structural Induction


The vermin only teaze and pinch
Their foes superior by an inch.
So, naturalists observe, a flea
Has smaller fleas that on him prey;
And these have smaller still to bite ’em,
And so proceed ad infinitum.
Thus every poet, in his kind,
Is bit by him that comes behind.
Jonathan Swift (1667–1745)
“On Poetry: a Rhapsody” (1733)

In the proofs that we have written so far in this chapter, we have performed induction on an integer:
the number that’s the input to an algorithm, the number of vertices of a polygon, the number of elements
in an array. In this section, we will address proofs about recursively defined structures, instead of about
integers, using a version of induction called structural induction that proceeds over the defined structure
itself, rather than just using numbers.

5.4.1 Recursively Defined Structures


A recursively defined structure, just like a recursive algorithm, is a structure defined in terms of one or
more base cases and one or more inductive cases. Any data type that can be understood as either a trivial
instance of the type or as being built up from a smaller instance (or smaller instances) of that type can
be expressed in this way. For example, basic data structures like a linked list and a binary tree can be
defined recursively. So too can well-formed sentences of a formal language—languages like Python, or
propositional logic—among many other examples. In this section, we’ll give recursive definitions for some
of these examples.

Linked lists
A linked list is a commonly used data structure in which we store a sequence of elements (just like the
sequences from Section 2.4). The reasons that linked lists are useful are best left to a data structures course,
but here is a brief synopsis of what a linked list actually is. Each element in the list, called a node, stores a
data value and a “pointer” to the rest of the list. A special value, often called null, represents the empty
list; the last node in the list stores this value as its pointer to represent that there are no further elements in
the list. See Figure 5.30a for an example. Here is a recursive definition of a linked list:

Example 5.15: Linked list.


A linked list is either:
1 ⟨⟩, known as the empty list; or
2 ⟨x, L⟩, where x is an arbitrary element and L is a linked list.

For example, Figure 5.30a shows the linked list that consists of 1 followed by the linked list containing 6,
1, and 9 (which is a linked list consisting of 6 followed by a linked list containing 1 and 9, which is …). In
other words, Figure 5.30a shows the linked list ⟨1, ⟨6, ⟨1, ⟨9, ⟨⟩⟩⟩⟩⟩.

ISTUDY
5.4 Recursively Defined Structures and Structural Induction 251

(a) (b)
1
1 6 1 9

3 2

5 4

Figure 5.30 An example of (a) a linked list and (b) a binary tree. The slashed line represents the null value.

Binary trees
We can also recursively define a binary tree (see Section 11.4.2). We’ll again defer the discussion of why
binary trees are useful, but here is a quick summary of what they are. Like a linked list, a binary tree is
a collection of nodes that store data values and “pointers” to other nodes. Unlike a linked list, a node in
a binary tree stores two pointers to other nodes (or null, representing an empty binary tree). These two
pointers are to the left child and right child of the node. The root node is the one at the very top of the
tree. See Figure 5.30b for an example; the root node of this tree stores the value 1, and has a left child (the
binary tree with root 3) and a right child (the binary tree with root 2). Here is a recursive definition:

Example 5.16: Binary trees.


A binary tree is either:
1 the empty tree, denoted by null; or
2 a root node x, a left subtree Tℓ , and a right subtree Tr , where x is an arbitrary value and Tℓ and Tr are
both binary trees.

Taking it further: In many programming languages, we can explicitly define data types that echo these recursive definitions,
where the base case is a trivial instance of the data structure (often nil or None or null). In C, for example, we can define a
binary tree with integer-valued nodes as:

1 struct binaryTree { In C, the symbol * means that we’re storing a


2 int root; reference, or pointer, to the subtrees, rather than
3 struct binaryTree *leftSubtree; the subtrees themselves, in the data structure.
4 struct binaryTree *rightSubtree;
5 }

The base case—an empty binary tree—is NULL; the inductive case—a binary tree with a root node—has a value stored as its root,
and then two binary trees (possibly empty) as its left and right subtrees.

Define the leaves of a binary tree T to be those nodes contained in T whose left subtree and right subtree are
both null. Define the internal nodes of T to be all nodes that are not leaves. In Figure 5.30b, for example,
the leaves are the nodes 5 and 8, and the internal nodes are {1, 2, 3, 4}.
Taking it further: Binary trees with certain additional properties turn out to be very useful ways of organizing data for efficient
access. For example, a binary search tree is a binary tree in which each node stores a “key,” and the tree is organized so that, for
any node u, the key at node u is larger than all the keys in u’s left subtree and smaller than all the keys in u’s right subtree. (For
example, we might store the email address of a student as a key; the tree is then organized alphabetically.) Another special type
of a binary search tree is a heap, in which each node’s key is larger than all the keys in its subtrees. These two data structures are
very useful in making certain common operations very efficient; see p. 245 (for heaps) and p. 637 (for binary search trees).

ISTUDY
252 Mathematical Induction

Sentences in a language
In addition to data structures, we can also define sentences in a language using a recursive definition—for
example, arithmetic expressions of the type that are understood by a simple calculator; or propositions (as
in Chapter 3’s propositional logic):

Example 5.17: Arithmetic expressions.


An arithmetic expression is any of the following:
1 any integer n;
2 −E, where E is an arithmetic expression; or
3 E + F, E − F, E · F, or E/F, where E and F are arithmetic expressions.

Example 5.18: Sentences of propositional logic.


A sentence of propositional logic (also known as a well-formed formula, or wff ) over the propositional
variables X is one of the following:
1 x, for some x ∈ X;
2 ¬P, where P is a wff over X; or
3 P ∨ Q, P ∧ Q, or P ⇒ Q, where P and Q are wffs over X.

We implicitly used the recursive definition of logical propositions from Example 5.18 throughout Chapter 3,
but using this recursive definition explicitly allows us to express a number of concepts more concisely. For
example, consider a truth assignment f : X → {True, False} that assigns True or False to each variable in
X. Then the truth value of a proposition over X under the truth assignment f can be defined recursively for
each case of the definition: (1) the truth value of x ∈ X under f is f(x); (2) the truth value of ¬P under f is
True if the truth value of P under f is False, and the truth value of ¬P under f is False if the truth value of
P under f is True; and so forth.

Taking it further: Linguists interested in syntax spend a lot of energy constructing recursive definitions (like those in Exam-
ples 5.17 and 5.18) of grammatical sentences of English. But one can also give a recursive definition for non-natural languages:
in fact, another structure that can be defined recursively is the grammar of a programming language itself. As such, this type of
recursive approach to defining (and processing) a grammar plays a key role not just in linguistics but also in computer science.
See the discussion on p. 259 for more.

5.4.2 Structural Induction


The recursively defined structures from Section 5.4.1 are particularly amenable to inductive proofs. For
example, recall from Example 5.16 that a binary tree is one of the following: (1) the empty tree, denoted
by null; or (2) a root node x, a left subtree Tℓ , and a right subtree Tr , where Tℓ and Tr are both binary trees.
To prove that some property P is true of all binary trees T, we can use (strong) induction on the number n
of applications of rule #2 from the definition. Here is an example of such a proof:

Example 5.19: Internal nodes vs. leaves in binary trees.


Recall that a leaf in a binary tree is a node whose left and right subtrees are both empty; an internal node
is any non-leaf node. Write leaves(T) and internals(T) to denote the number of leaves and internal nodes
in a binary tree T, respectively.

ISTUDY
5.4 Recursively Defined Structures and Structural Induction 253

x If T was produced by ≥ 2
applications of rule #2, then
at least one of Tℓ and Tr is not
Definition 5.16, in brief: The only binary
null, and the leaves of T are
tree produced by
precisely the leaves of Tℓ plus
Rule #1: null is a binary tree. one application of
the leaves of Tr .
x rule #2 has one
Rule #2: if Tℓ and Tr are binary trees, node, which is a
then ⟨x, Tℓ , Tr ⟩ is a binary tree. leaf. Tℓ Tr
(a) What binary trees are. (b) One application of rule #2. (c) Two or more applications of rule #2.

Figure 5.31 Illustrations of the inductive case for Example 5.19.

Claim: In any binary tree T, we have leaves(T) ≤ internals(T) + 1.

Proof. We proceed by strong induction on the number of applications of rule #2 used to generate T.
Specifically, let P(n) denote the property that leaves(T) ≤ internals(T) + 1 holds for any binary tree T
generated by n applications of rule #2; we’ll prove that P(n) holds for all n ≥ 0, which proves the claim.
Base case (n = 0). The only binary tree generated with 0 applications of rule #2 is the empty tree null.
Indeed, leaves(null) = internals(null) = 0, and 0 ≤ 0 + 1.
Inductive case (n ≥ 1). Assume the inductive hypothesis P(0) ∧ P(1) ∧ · · · ∧ P(n − 1): for any binary
tree B generated using k < n applications of rule #2, we have leaves(B) ≤ internals(B) + 1. We must
prove P(n).
We’ll handle the case n = 1 separately. (See Figure 5.31b.) The only way to make a binary tree T
using one application of rule #2 is to use rule #1 for both of T’s subtrees, so T must contain only one node
(which is itself a leaf). Then T contains 1 leaf and 0 internal nodes, and indeed 1 ≤ 0 + 1.
Otherwise n ≥ 2. (See Figure 5.31c.) Observe that the tree T must have been generated by (a) gener-
ating a left subtree Tℓ using some number ℓ of applications of rule #2; (b) generating a right subtree Tr
using some number r of applications of rule #2; and then (c) applying rule #2 to a root node x, Tℓ , and Tr
to produce T. Therefore r + ℓ + 1 = n, and therefore r < n and ℓ < n. Ergo, we can apply the inductive
hypothesis to both Tℓ and Tr , and thus
leaves(Tℓ ) ≤ internals(Tℓ ) + 1 (1)
leaves(Tr ) ≤ internals(Tr ) + 1. (2)
Also observe that, because r + ℓ + 1 = n ≥ 2, either Tr ̸= null or Tℓ ̸= null, or both. Thus the leaves
of T are the leaves of Tℓ and Tr , and internal nodes of T are the internal nodes of Tℓ and Tr plus the root
x (which cannot be a leaf because at least one of Tℓ and Tr is not empty). Therefore
leaves(T) = leaves(Tℓ ) + leaves(Tr ) (3)
internals(T) = internals(Tℓ ) + internals(Tr ) + 1. (4)
Putting together these facts, we have
leaves(T) = leaves(Tℓ ) + leaves(Tr ) by (3)
≤ internals(Tℓ ) + 1 + internals(Tr ) + 1 by (1) and (2)
= internals(T) + 1. by (4)

Thus P(n) holds, which completes the proof.

ISTUDY
254 Mathematical Induction

Structural induction: the idea


The proof in Example 5.19 is perfectly legitimate, but there is another approach that we can use for recur-
sively defined structures, called structural induction. The basic idea is to perform induction on the structure
of an object itself rather than on some integer: instead of a case for n = 0 and a case for n ≥ 1, in a proof
by structural induction our cases correspond directly to the cases of the recursive structural definition.
For structural induction to make sense, we must impose some restrictions on the recursive definition.
Specifically, the set of structures defined must be well ordered, which intuitively ensures that every invoca-
tion of the inductive case of the definition “makes progress” toward the base case(s) of the definition. For
the type of recursive definitions that we’re considering—where there are base cases in the definition, and
all instances of the structure are produced by a finite sequence of applications of the inductive rules in the
definition—structural induction is a valid technique to prove facts about the recursively defined structure.

Taking it further: A set of objects is called well ordered if there’s a “least” element among any collection of those objects. More
formally, a set S of structures is well ordered if there exists a “smaller than” relationship ≺ between elements of S such that, for
any nonempty T ⊆ S, there exists a minimal element m in T—that is, there exists m ∈ T such that no x ∈ T satisfies x ≺ m.
(There might be more than one least element in T.) For example, the set Z≥0 is well ordered, using the normal ≤ relationship.
However, the set R is not well ordered: for example, the set {x ∈ R : x > 2} has no smallest element using ≤. But the set of
binary trees is well ordered; the relation ≺ is “is a subtree of.” One can prove that a set S is well ordered if and only if a proof
by mathematical induction is valid on a set S (where the base cases are the minimal elements of S, and to prove P(x) we assume
the inductive hypotheses P(y) for any y ≺ x).

Proofs by structural induction


Here is the formal definition of a proof by structural induction:

Definition 5.12: Proof by structural induction.


Suppose that we want to prove that P(x) holds for every x ∈ S, where S is the (well-ordered) set of
structures generated by a recursive definition, and P is some property. To give a proof by structural
induction of ∀x ∈ S : P(x), we prove the following:
1 Base cases: for every x defined by a base case in the definition of S, prove P(x).
2 Inductive cases: for every x defined in terms of y1 , y2 , . . . , yk ∈ S by an inductive case in the definition
of S, prove that P(y1 ) ∧ P(y2 ) ∧ · · · ∧ P(yk ) ⇒ P(x).

In a proof by structural induction, we can view both base cases and inductive cases in the same light: each
case assumes that the recursively constructed subpieces of a structure x satisfy the stated property, and we
prove that x itself also satisfies the property. For a base case, the point is just that there are no recursively
constructed pieces, so we actually are not making any assumption.
Notice that a proof by structural induction is identical in form to a proof by strong induction on the
number of applications of the inductive-case rules used to generate the object. For example, we can imme-
diately rephrase the proof in Example 5.19 to use structural induction instead. While the structure of the
proof is identical, structural induction can streamline the proof and make it easier to read:

Example 5.20: Internal nodes vs. leaves in binary trees, take II.

Claim: In any binary tree T, we have leaves(T) ≤ internals(T) + 1.

ISTUDY
5.4 Recursively Defined Structures and Structural Induction 255

Proof. Let P(T) denote the property that leaves(T) ≤ internals(T) + 1 for a binary tree T. We proceed
by structural induction on the form of T.
Base case (T = null). Then leaves(T) = internals(T) = 0, and indeed 0 ≤ 0 + 1.
Inductive case (T has root x, left subtree Tℓ , and right subtree Tr ). We assume the inductive hypotheses
P(Tℓ ) and P(Tr ), namely
leaves(Tℓ ) ≤ internals(Tℓ ) + 1 (1)
leaves(Tr ) ≤ internals(Tr ) + 1. (2)

• If x is itself a leaf, then Tℓ = Tr = null, and therefore leaves(T) = 1 and internals(T) = 0, and
indeed 1 ≤ 0 + 1.
• Otherwise x is not a leaf, and either Tr ̸= null or Tℓ ̸= null, or both. Thus the leaves of T are the
leaves of Tℓ and Tr , and internal nodes of T are the internal nodes of Tℓ and Tr plus the root x. Therefore
leaves(T) = leaves(Tℓ ) + leaves(Tr ) (3)
internals(T) = internals(Tℓ ) + internals(Tr ) + 1. (4)
Putting together these facts, we have
leaves(T) = leaves(Tℓ ) + leaves(Tr ) by (3)
≤ internals(Tℓ ) + 1 + internals(Tr ) + 1 by (1) and (2)
= internals(T) + 1. by (4)

Thus P(n) holds, which completes the proof.

5.4.3 Some More Examples of Structural Induction: Propositional Logic


We’ll finish this section with two more proofs by structural induction, about propositional logic—using
Example 5.18’s recursive definition.

Propositional logic using only ¬ and ∧


First, we’ll give a formal proof using structural induction of the claim that any propositional logic statement
can be expressed using ¬ and ∧ as the only logical connectives. (See Exercise 4.68.)

Example 5.21: All of propositional logic using ¬ and ∧.

Claim: For any logical proposition φ using the connectives {¬, ∧, ∨, ⇒}, there exists a proposition using
only {¬, ∧} that is logically equivalent to φ.

Proof. For a logical proposition φ, let A(φ) denote the property that there exists a {¬, ∧}-only propo-
sition logically equivalent to φ. We’ll prove by structural induction on φ that A(φ) holds for any
well-formed formula φ (see Example 5.18):
Base case: φ is a variable, say φ = x. The proposition x uses no connectives—and thus is vacuously
{¬, ∧}-only—and is obviously logically equivalent to itself. Thus A(x) follows.

ISTUDY
256 Mathematical Induction

Inductive case I: φ is a negation, say φ = ¬P. We assume the inductive hypothesis A(P). We must prove
A(¬P). By the inductive hypothesis, there is a {¬, ∧}-only proposition Q such that Q ≡ P. Consider the
proposition ¬Q. Because Q ≡ P, we have that ¬Q ≡ ¬P, and ¬Q contains only the connectives {¬, ∧}.
Thus ¬Q is a {¬, ∧}-only proposition logically equivalent to ¬P. Thus A(¬P) follows.
Inductive case II: φ is a conjunction, disjunction, or implication, say φ = P1 ∧ P2 , φ = P1 ∨ P2 ,
or φ = P1 ⇒ P2 . We assume the inductive hypotheses A(P1 ) and A(P2 )—that is, we assume there are
{¬, ∧}-only propositions Q1 and Q2 with Q1 ≡ P1 and Q2 ≡ P2 . We must prove A(P1 ∧ P2 ), A(P1 ∨ P2 ),
and A(P1 ⇒ P2 ). Consider the propositions Q1 ∧ Q2 , ¬(¬Q1 ∧ ¬Q2 ), and ¬(Q1 ∧ ¬Q2 ). By De Morgan’s
Law, and the facts that x ⇒ y ≡ ¬(x ∧ ¬y), P1 ≡ Q1 , and P2 ≡ Q2 :
Q1 ∧ Q2 ≡ Q1 ∧ Q2 ≡ P1 ∧ P2
¬(¬Q1 ∧ ¬Q2 ) ≡ Q1 ∨ Q2 ≡ P1 ∨ P2
¬(Q1 ∧ ¬Q2 ) ≡ Q1 ⇒ Q2 ≡ P1 ⇒ P2
Because Q1 and Q2 are {¬, ∧}-only, our three propositions are {¬, ∧}-only as well; therefore A(P1 ∧P2 ),
A(P1 ∨ P2 ), and A(P1 ⇒ P2 ) follow.
We’ve shown that A(φ) holds for any proposition φ, so the claim follows.

Taking it further: In the programming language ML, among others, a programmer can use both recursive definitions and a
form of recursion that mimics structural induction. For example, we can give a simple implementation of the recursive definition
of a well-formed formula from Example 5.18:

1 datatype wff = Variable of string


2 | Not of wff “A well-formed formula is a variable (whose name
3 | And of (wff * wff) is a string), or the negation of a well-formed
4 | Or of (wff * wff) formula, or the conjunction of a pair of
5 | Implies of (wff * wff); well-formed formulas (wff * wff), or . . . .”)

In ML, we can also write a function that mimics the structure of the proof in Example 5.21, using ML’s capability of pattern
matching function arguments. Here is an implementation of the recursive function that takes an arbitrary wff as input, and
produces an equivalent wff using only And and Not as output.

6 fun simplify (Variable var) = Variable var


7 | simplify (Not P) = Not(simplify P)
8 | simplify (And (P1, P2)) = And(simplify P1, simplify P2)
9 | simplify (Or (P1, P2)) = Not(And(Not(simplify P1), Not(simplify P2)))
10 | simplify (Implies (P1, P2)) = Not(And(simplify P1, Not(simplify P2)));

Conjunctive and disjunctive normal forms


Here is another example of a proof by structural induction based on propositional logic, to establish Theo-
rems 3.18 and 3.19, that any proposition is logically equivalent to one that’s in conjunctive or disjunctive
normal form. (Recall that a proposition φ is in conjunctive normal form (CNF) if φ is the conjunction of
one or more clauses, where each clause is the disjunction of one or more literals. A literal is a Boolean
variable or the negation of a Boolean variable. A proposition φ is in disjunctive normal form (DNF) if φ
is the disjunction of one or more clauses, where each clause is the conjunction of one or more literals.)

Theorem 5.13: CNF/DNF suffice.


Let φ be a Boolean formula that uses the connectives {∧, ∨, ¬, ⇒}. Then:
1 there exists φdnf in disjunctive normal form so that φ and φdnf are logically equivalent.
2 there exists φcnf in conjunctive normal form so that φ and φcnf are logically equivalent.

ISTUDY
5.4 Recursively Defined Structures and Structural Induction 257

Perhaps bizarrely, it will turn out to be easier to prove that any proposition is logically equivalent to both
one in CNF and one in DNF than to prove either claim on its own. So we will prove both parts of the
theorem simultaneously, by structural induction.
Problem-solving tip: Suppose we want to prove ∀x : P(x) by induction. Here’s a problem-solving strategy that’s highly coun-
terintuitive: it is sometimes easier to prove a stronger statement ∀x : P(x) ∧ Q(x). It seems bizarre that trying to prove more than
what we want is easier—but the advantage arises because the inductive hypothesis is a more powerful assumption! For example,
I don’t know how to prove that any proposition φ can be expressed in DNF (Theorem 5.13.1) by induction! But I do know how
to prove that any proposition φ can be expressed in both DNF and CNF by induction, as is done in Example 5.22.

We’ll make use of some handy notation in this proof: analogous to summation and product notation, we
Vn Wn
write i=1 pi to denote p1 ∧ p2 ∧ · · · ∧ pn , and similarly i=1 pi means p1 ∨ p2 ∨ · · · ∨ pn . Here is the proof:

Example 5.22: Conjunctive/disjunctive normal form.


Proof. We start by simplifying the task: we use Example 5.21 to ensure that φ contains only the con-
nectives {¬, ∧}. Let C(φ) and D(φ), respectively, denote the property that φ is logically equivalent to
a CNF proposition and a DNF proposition, respectively. We now proceed by structural induction on the
form of φ—which now can only be a variable, negation, or conjunction—to show that C(φ) ∧ D(φ)
holds for any proposition φ.
Base case: φ is a variable, say φ = x. We’re done immediately; a single variable is actually in both CNF
and DNF. We simply choose φdnf = φcnf = x. Thus C(x) and D(x) follow immediately.
Inductive case I: φ is a negation, say φ = ¬P. We assume the inductive hypothesis C(P) ∧ D(P)—that
is, we assume that there are propositions Pcnf and Pdnf such that P ≡ Pcnf ≡ Pdnf , where Pcnf is in CNF
and Pdnf is in DNF. We must show C(¬P) and D(¬P).
We’ll first show D(¬P)—that is, that ¬P can be rewritten in DNF. By the definition of conjunctive
Vn
normal form, we know that the proposition Pcnf is of the form Pcnf = i=1 ci , where ci is a clause of the
Wmi j
form ci = j=1 ci , where cji is a variable or its negation. Therefore we have

inductive hypothesis C(P)


definition of CNF De Morgan’s Law De Morgan’s Law
" !# " !# " !#
V
n W
mi W
n W
mi W
n V
mi
¬P ≡ ¬Pcnf ≡ ¬ cij ≡ ¬ cij ≡ ¬cij .
i=1 j=1 i=1 j=1 i=1 j=1

Once we delete double negations (that is, if cji = ¬x, then we write ¬cji as x rather than as ¬¬x), this last
proposition is in DNF, so D(¬P) follows.
The construction to show C(¬P)—that is, to give an CNF proposition logically equivalent to ¬P—is
strictly analogous; the only change to the argument is that we start from Pdnf instead of Pcnf .
Inductive case II: φ is a conjunction, say P ∧ Q. We assume the inductive hypotheses C(P) ∧ D(P) and
C(Q) ∧ D(Q)—that is, we assume that there are CNF propositions Pcnf and Qcnf and DNF propositions
Pdnf and Qdnf such that P ≡ Pcnf ≡ Pdnf and Q ≡ Qcnf ≡ Qdnf . We must show C(P ∧ Q) and D(P ∧ Q).
The argument for C(P ∧ Q) is the easier of the two: we have propositions Pcnf and Qcnf in CNF where
Pcnf ≡ P and Qcnf ≡ Q. Thus P ∧ Q ≡ Pcnf ∧ Qcnf —and the conjunction of two CNF formulas is itself
in CNF. So C(P ∧ Q) follows.
We have to work a little harder to prove D(P ∧ Q). Recall that, by the inductive hypothesis, there
are propositions Pdnf and Qdnf in DNF, where P ≡ Pdnf and Q ≡ Qdnf . By the definition of DNF, these

ISTUDY
258 Mathematical Induction

Wn Wm
propositions have the form Pdnf = i=1 ci and Qdnf = j=1 dj , where every ci and dj is a clause that is a
conjunction of literals. Therefore
inductive hypothesis D(P) and definition of DNF
inductive hypothesis D(Q) and definition of DNF
distributivity of ∨ over ∧ distributivity of ∨ over ∧

  !
W
n W
n W
n W
m W
n W
m
P∧Q≡ ci ∧Q≡ (ci ∧ Q) ≡ ci ∧ dj ≡ (ci ∧ dj ) .
i=1 i=1 i=1 i=j i=1 j=1

Because every ci and dj is a conjunction of literals, ci ∧ dj is too, and thus this last proposition is in DNF!
So D(P ∧ Q) follows—as does the theorem.

The construction for a conjunction P ∧ Q in Theorem 5.22 is a little tricky, so let’s illustrate it with a
small example:

Example 5.23: An example of the construction from Example 5.22.


Suppose that we are trying to transform a proposition φ∧ψ into DNF. Suppose that we have (recursively)
computed φdnf = (p ∧ t) ∨ q and ψdnf = r ∨ (s ∧ t). Then the construction from Example 5.22 lets us
construct a proposition equivalent to φ ∧ ψ as:
   
φ ∧ ψ ≡ φdnf ∧ ψdnf ≡ (p ∧ t) ∨ (q) ∧ (r) ∨ (s ∧ t)
c1 c2 d1 d2
h  i h  i
≡ (p ∧ t) ∧ (r) ∨ (s ∧ t) ∨ (q) ∧ (r) ∨ (s ∧ t)
c1 d1 ∨ d2 c2 d1 ∨ d2
h i h i
≡ (p ∧ t ∧ r) ∨ (p ∧ t ∧ s ∧ t) ∨ (q ∧ r) ∨ (q ∧ s ∧ t) .
c1 ∧ d1 c1 ∧ d2 c2 ∧ d1 c2 ∧ d2

Then the construction yields (p ∧ t ∧ r) ∨ (p ∧ t ∧ s ∧ t) ∨ (q ∧ r) ∨ (q ∧ s ∧ t) as the DNF proposition


equivalent to φ ∧ ψ.

5.4.4 The Integers, Recursively Defined


Before we end the section, we’ll close our discussion of recursively defined structures and structural induc-
tion with one more potentially interesting observation. Although the basic form of induction in Section 5.2
appears fairly different, that basic form of induction can actually be seen as structural induction, too. The
key is to view the nonnegative integers Z≥0 as defined recursively:

Definition 5.14: Nonnegative integers, recursively defined.


A nonnegative integer is either:
1 zero, denoted by 0; or
2 the successor of a nonnegative integer, denoted by s(x) for a nonnegative integer x.

Under this definition, a proof of ∀n ∈ Z≥0 : P(n) by structural induction and a proof of ∀n ∈ Z≥0 : P(n)
by weak induction are literally identical. First, they have precisely the same base case: prove P(0). And,
second, they have precisely the same inductive case: prove P(n) ⇒ P(s(n))—or, in other words, prove that
P(n) ⇒ P(n + 1).

ISTUDY
5.4 Recursively Defined Structures and Structural Induction 259

Computer Science Connections

Grammars, Parsing, and Ambiguity


In interpreters and compilers—systems that translate input source code written in a programming language like
Python, Java, or C into a machine-executable format—a key initial step is to parse the input into a format that
represents its structure. (A similar step occurs in systems designed to perform natural language processing.) The
structured representation of such an expression is called a parse tree, in which the leaves of the tree correspond
to the base cases of the recursive structural definition, and the internal nodes correspond to the inductive cases of
the definition. We can then use the parse tree for whatever purpose we desire: evaluating arithmetic expressions,
simplifying propositional logic, or any other manipulation. (See Figure 5.32.)
In this setting, a recursively defined structure is
written as a context-free grammar (CFG). A grammar · +
consists of a set of rules that can be used to gener- ·
2 + 4
ate a particular example of this defined structure. (This
type of grammar is called context free because the rules 3 4 2 3
defined by the grammar can be used any time—that
is, without regard to the context in which the symbol Figure 5.32 Parse trees for two arithmetic expressions:
on the left-hand side of the rule appears. There are 2 · (3 + 4) and (2 · 3) + 4.
also “context-sensitive” grammars, which remove this
restriction; they tend to be more complicated to think about.) We’ll take the definition of propositions over the
variables {p, q, r} (Example 5.18) as a running example. Here is a CFG for propositions, following that definition
precisely. (Here “→” means “can be rewritten as” and “|” means “or.”)

S → p | q | r | ¬S | S∨S | S∧S | S⇒S.


S can be a propositional or S can be the or S can be the ∧ or ∨ or ⇒ of
variable negation of a two propositions.
proposition

An expression is a valid proposition over the variables {p, q, r} if and only if it can be generated by a finite sequence
of applications of the rewriting rules in the grammar. For example, ¬p∨p is a valid proposition over {p, q, r}, because
we can generate it as follows (the rule from the grammar that we’ve invoked in each step is shown above the arrow):
S→S∨S S→p S → ¬S S→p

S → S∨S → S∨p → ¬S ∨ p → ¬p ∨ p. (1)


The parse tree corresponding to this sequence of rule applications is shown in Figure 5.33a. A complication that arises
with the grammar given above is that it is ambiguous: the same proposition can be produced using a fundamentally
different sequence of rule applications, which gives rise to a different parse tree:
S → ¬S → ¬S ∨ S → ¬p ∨ S → ¬p ∨ p. (2)
The parse tree corresponding to this second derivation
is shown in Figure 5.33b. This second parse tree corre- S S
sponds to the expression ¬(p ∨ p) instead of (¬p) ∨ p, S ∨ S ¬ S
which is the correct “order of operations” because ¬
binds tighter than ∨. ¬ S p S ∨ S
It’s bad news if the grammar of a programming lan-
p p p
guage is ambiguous, because certain valid code is then
“allowed” to be interpreted in more than one way. (The (a) The correct order of (b) The wrong order of
classic example is the attachment of else clauses: in operations, corresponding operations, corresponding
code like if P then if Q then X else Y, when to derivation (1). to derivation (2).
should Y be executed? When P is true and Q is false? Or Figure 5.33 Two parse trees for ¬p ∨ p.
when P is false?) Thus programming language design-
ers develop unambiguous grammars that reflect the desired behavior. (For more on context-free grammars and
parsing, and the relationship between context-free grammars and compilers/interpreters, see [5, 73, 121].)

ISTUDY
260 Mathematical Induction

EXERCISES
Consider linked lists as defined in Example 5.15. See Figure 5.34 for some algorithms operating on linked lists.
5.83 Let L be a linked list. Prove by structural induction on L that length(L) returns the number of elements contained in L.
5.84 Prove by structural induction that contains(L, x) returns True if and only if x is one of the elements in L.
5.85 Prove by structural induction that sum(L) returns the sum of the elements of L. (Assume L contains only numbers.)
5.86 Here’s a variant of the definition of linked lists in Example 5.15, where we insist that the elements be in increasing order. Define a
nonempty sorted list as one of the following:
• ⟨x, ⟨⟩⟩; or
• ⟨x, ⟨y, L⟩⟩ where x ≤ y and ⟨y, L⟩ is a nonempty sorted list.
Prove by structural induction that in a nonempty sorted list ⟨x, L⟩, every element z in L satisfies z ≥ x.

Intuitively, a string of parentheses is unbalanced if there’s an open parenthesis that’s never closed (as in [ []) or if there’s a close
parenthesis that doesn’t close anything (as in [] ] ). Formally, define a string of balanced parentheses as any of the following:
• the empty string (consisting of zero characters);
• a string [ S ] where S is a string of balanced parentheses; or
• a string S1 S2 where S1 and S2 are both strings of balanced parentheses.
For example, [[]][] is a string of balanced parentheses, using the third rule on [[]] and []. (And [] is a string of balanced
parentheses using the second rule on the empty string. Thus [[]] is, too, using the second rule on [].)
5.87 Prove by structural induction that every string of balanced parentheses has exactly the same number of [s as it does ]s.
5.88 Prove by structural induction that any prefix of a string of balanced parentheses has at least as many [s as it does ]s.

5.89 Prove by structural induction that the algorithm countLeaves(T) in Figure 5.35 returns the number of leaves in a binary tree T.

5.90 A binary search tree (BST) is a binary tree in which each node stores a “key,” and, for any node u, the key at node u is larger than
all keys in u’s left subtree and smaller than all the keys in u’s right subtree. (See p. 637.) That is, a BST is either:
• an empty tree, denoted by null; or
• a root node x, a left subtree Tℓ where all elements are less than x, and a right subtree Tr , where all elements are greater than x,
and Tℓ and Tr are both BSTs.
Prove that the smallest element in a nonempty BST is the bottommost leftmost node—that is, prove that
(
x if Tℓ = null
the smallest element in a BST with root x and left subtree Tℓ =
̸ null.
the smallest element in Tℓ if Tℓ =

length(L): // L is any linked list contains(L, x): // L is any linked list sum(L): // L contains only numbers
1 if L = ⟨⟩ then 1 if L = ⟨⟩ then 1 if L = ⟨⟩ then
2 return 0 2 return False 2 return 0
3 else if L = ⟨x, L′ ⟩ then 3 else if L = ⟨y, L′ ⟩ then 3 else if L = ⟨x, L′ ⟩ then
4 return 1 + length(L′ ) 4 return x = y or contains(L′ , x) 4 return x + sum(L′ )

Figure 5.34 Three algorithms on linked lists.

countLeaves(T):
A reminder of Definition 5.16: a binary tree is either
1 if T = null then
2 return 0 • an empty tree, denoted by null; or
3 else • a root node x, a left subtree Tℓ , and a right subtree Tr ,
4 TL , TR := the left and right subtrees of T where x is an arbitrary value and Tℓ and Tr are both
5 if TL = TR = null then binary trees.
6 return 1
(Recall that a leaf of a binary tree T is a node in T whose left
7 else
subtree and right subtree are both null.)
8 return countLeaves(TL ) + countLeaves(TR )

Figure 5.35 A reminder of the definition of a binary tree, and an algorithm to count leaves in one.

ISTUDY
Exercises 261

A heap is a binary tree where each node stores a priority, and in which every node satisfies the heap property: the priority of a node
u must be greater than or equal to the priorities of the roots of both of u’s subtrees. (The restriction only applies for a subtree that
is not null.) See p. 245.
5.91 Give a recursive definition of a heap.
5.92 Prove by structural induction that every heap is empty, or that no element of the heap is larger than its root node. (That is, the root
is a maximum element.)
5.93 Prove by structural induction that every heap is empty, or it has a leaf u such that u is no larger than any node in the heap. (That is,
the leaf u is a minimum element.)

A 2–3 tree is a data structure that is similar in spirit to a binary search tree (see Exercise 5.90—more precisely, a 2–3 tree is
analogous to a balanced form of BST, which is guaranteed to support fast operations like insertions, lookups, and deletions). The
name “2–3 tree” comes from the fact that each internal node in the tree must have precisely 2 or 3 children; no node has a single
child. Furthermore, all leaves in a 2–3 tree must be at the same “level” of the tree. Formally, a 2–3 tree of height h is one of the
following:
• a single node (in which case h = 0, and the node is called a leaf); or
• a node with 2 subtrees, both of which are 2–3 trees of height h − 1; or
• a node with 3 subtrees, all three of which are 2–3 trees of height h − 1.
5.94 Prove by structural induction that a 2–3 tree of height h has at least 2h leaves.
5.95 Prove by structural induction that a 2–3 tree of height h has at most 3h leaves. (This result plus Exercise 5.94 tells us that a 2–3 tree
that contains n leaf nodes has height between log3 n and log2 n.)
5.96 A 2–3–4 tree is a similar data structure to a 2–3 tree, except that a tree can be a single node or a node with 2, 3, or 4 subtrees. Give
a formal recursive definition of a 2–3–4 tree, and prove that a 2–3–4 tree of height h has at least 2h leaves and at most 4h leaves.

The next few exercises give recursive definitions of some familiar arithmetic operations which are usually defined nonrecursively.
In each, you’re asked to prove a familiar property by structural induction. Think carefully when you choose the quantity upon
which to perform induction, and don’t skip any steps in your proof! You may use the elementary-school facts about addition and
multiplication from Figure 5.36 in your proofs.
5.97 Let’s define an even number as either (i) 0, or (ii) 2 + k, where k is an even number. Prove by structural induction that the sum of
any two even numbers is an even number.
5.98 Let’s define a power of two as either (i) 1, or (ii) 2 · k, where k is a power of two. Prove by structural induction that the product of
any two powers of two is itself a power of two.
P
5.99 Let a1 , a2 , . . . , ak all be even numbers, for an arbitrary integer k ≥ 0. Prove that ki=1 ai is also an even number. (Hint: use weak
induction and Exercise 5.97.)

In Chapter 2 (Definition 2.5), we defined bn (for a base b ∈ R and an exponent n ∈ Z≥0 ) as denoting the result of multiplying
b by itself n times. As an alternative to that definition of exponentiation, we could instead give a recursive definition with integer
exponents: b0 = 1 and bn+1 = b · bn , for any nonnegative integer n.
5.100 Using the facts in Figure 5.36, prove by induction that bm bn = bm+n for any integers n ≥ 0 and m ≥ 0. Don’t skip any steps.
5.101 Using the facts in Figure 5.36 and Exercise 5.100, prove by induction that (bm )n = bmn for any integers n ≥ 0 and m ≥ 0. Again,
don’t skip any steps.

5.102 Suppose that we extend the well-formed formula (“wff”) definition from Example 5.18 to allow another logical connective: the
Sheffer stroke |, where p | q ≡ ¬(p ∧ q). Give a proof using structural induction (see Example 5.21 for an example) that any wff is
logically equivalent to one using Sheffer stroke as the only connective.
5.103 (programming required) Write a program in the programming language ML (see p. 256) to translate an arbitrary statement of
propositional logic into a logically equivalent statement in which | is the only logical connective. (In other words, implement the
proof of Exercise 5.102 as a recursive function.)

( a + b) + c = a + ( b + c ) (a · b) · c = a · (b · c) a+0=0+a=a Additive Identity

Associativity of Addition Associativity of Multiplication


a·1=1·a=a Multiplicative Identity

a+b=b+a Commutativity of Addition a·b=b·a Commutativity of Multiplication a·0=0·a=0 Multiplicative Zero

Figure 5.36 A few elementary-school facts about addition and multiplication.

ISTUDY
262 Mathematical Induction

5.104 Repeat Exercise 5.102 for Peirce’s arrow ↓, where p ↓ q ≡ ¬(p ∨ q).
5.105 (programming required) Repeat Exercise 5.103 with ↓ as the only logical connective (using Exercise 5.104).

5.106 Call a logical proposition truth-preserving if the proposition is true under the all-true truth assignment. (That is, a proposition is
truth-preserving if and only if the first row of its truth table is True.) Prove the following claim by structural induction on the form
of the proposition:
Any logical proposition that uses only the logical connectives ∨ and ∧ is truth-preserving.
(A solution to this exercise yields a rigorous solution to Exercise 4.71—there are propositions that cannot be expressed using only
∧ and ∨. Explain.)

5.107 A palindrome is a string that reads the same front-to-back as it does back-to-front—for example, RACECAR or (ignoring spaces
and punctuation) A MAN, A PLAN, A CANAL---PANAMA! or 10011001. Give a recursive definition of the set of palindromic
bitstrings.
5.108 Let #0(s) and #1(s) denote the number of zeros and ones in a bitstring s, respectively. Using structural induction and your definition
from Exercise 5.107, prove that, for any palindromic bitstring s, the value of [#0(s)] · [#1(s)] is an even number.

ISTUDY
5.5 Chapter at a Glance 263

5.5 Chapter at a Glance

Proofs by Mathematical Induction


Suppose that we want to prove that a property P(n) holds for all n ∈ Z≥0 . To give a proof by mathematical
induction of the claim ∀n ∈ Z≥0 : P(n), we prove the base case P(0), and we prove the inductive case: for
every n ≥ 1, we have P(n − 1) ⇒ P(n).
When writing an inductive proof of the claim ∀n ∈ Z≥0 : P(n), include each of the following steps:
1 A clear statement of the claim to be proven—that is, a clear definition of the property P(n) that will
be proven true for all n ≥ 0—and a statement that the proof is by induction, including specifically
identifying the variable n upon which induction is being performed. (Some claims involve multiple
variables, and it can be confusing if you aren’t clear about which is the variable upon which you are
performing induction.)
2 A statement and proof of the base case—that is, a proof of P(0).
3 A statement and proof of the inductive case—that is, a proof of P(n − 1) ⇒ P(n), for a generic value of
n ≥ 1. The proof of the inductive case should include all of the following:
a a statement of the inductive hypothesis P(n − 1),
b a statement of the claim P(n) that needs to be proven, and
c a proof of P(n), which at some point makes use of the assumed inductive hypothesis P(n − 1).

We can use a proof by mathematical induction to establish arithmetic properties, like a formula for the sum
Pn
of the nonnegative integers up to n—that is, i=0 i = n(n+1) 2 for any integer n ≥ 0—or a formula for a
geometric series:
X n
rn+1 − 1
if r ∈ R where r ̸= 1, and n ∈ Z≥0 , then ri = .
r−1
Pn i=0
(If r = 1, then i=0 ri = n+ 1.) We can also use proofs by mathematical induction to prove the correctness
of algorithms, particularly recursive algorithms.

Strong Induction
Suppose that we want to prove that P(n) holds for all n ∈ Z≥0 . To give a proof by strong induction of
∀n ∈ Z≥0 : P(n), we prove the base case P(0), and we prove the inductive case: for every n ≥ 1, we
have [P(0) ∧ P(1) . . . ∧ P(n − 1)] ⇒ P(n). Strong induction is actually completely equivalent to weak
induction; anything that can be proven with one can also be proven with the other.
Generally speaking, using strong induction makes sense when the “reason” that P(n) is true is that P(k)
is true for more than one value of k < n (or a single value of k < n with k ̸= n − 1). (For weak induction,
the reason that P(n) is true is just P(n − 1).) We can use strong induction to prove many claims, including
part of the Prime Factorization Theorem: if n ∈ Z≥1 is a positive integer, then there exist k ≥ 0 prime
Qk
numbers p1 , p2 , . . . , pk such that n = i=1 pi .

Recursively Defined Structures and Structural Induction


A recursively defined structure, just like a recursive algorithm, is a structure defined in terms of one or
more base cases and one or more inductive cases. Any data type that can be understood as either a trivial
instance of the type or as being built up from a smaller instance (or smaller instances) of that type can

ISTUDY
264 Mathematical Induction

be expressed in this way. The set of structures defined is well ordered if, intuitively, every invocation of
the inductive case of the definition “makes progress” toward the base case(s) of the definition (and, more
formally, that every nonempty subset of those structures has a “least” element).
Suppose that we want to prove that P(x) holds for every x ∈ S, where S is the (well-ordered) set of
structures generated by a recursive definition. To give a proof by structural induction of ∀x ∈ S : P(x), we
prove the following:
1 Base cases: for every x defined by a base case in the definition of S, prove P(x).
2 Inductive cases: for every x defined in terms of y1 , y2 , . . . , yk ∈ S by an inductive case in the definition
of S, prove that P(y1 ) ∧ P(y2 ) . . . ∧ P(yk ) ⇒ P(x).
The form of a proof by structural induction that ∀x ∈ S : P(x) for a well-ordered set of structures S is
identical to the form of a proof using strong induction. Specifically, the proof by structural induction looks
like a proof by strong induction of the claim ∀n ∈ Z≥0 : Q(n), where Q(n) denotes the property “for any
structure x ∈ S that is generated using n applications of the inductive-case rules in the definition of S, we
have P(x).”

ISTUDY
5.5 Chapter at a Glance 265

Key Terms and Results

Key Terms Key Results


Proofs by Mathematical Induction Proofs by Mathematical Induction
• proof by mathematical induction 1 Suppose that we want to prove that P(n) holds for all
• base case n ∈ Z≥0 . To give a proof by mathematical induction of
• inductive case ∀n ∈ Z≥0 : P(n), we prove the following:
• inductive hypothesis a The base case P(0).
• geometric series b The inductive case: for every n ≥ 1, we have
• arithmetic series P(n − 1) ⇒ P(n).
• harmonic series 2 For any integer n ≥ 0, we have 1 + 2 + · · · + n = n(n+1)
2 .
3 Let r ∈ R where r ̸= 1, and let n ∈ Z≥0 . Then
Strong Induction X
n
rn+1 − 1
ri = .
• strong induction r−1
i=0
• prime factorization Pn
(If r = 1, then i=0 ri = n + 1.)
Recursively Defined Structures and
Structural Induction Strong Induction

• recursively defined structures 1 Suppose that we want to prove that P(n) holds for all
• structural induction n ∈ Z≥0 . To give a proof by strong induction of
• well-ordered set ∀n ∈ Z≥0 : P(n), we prove the following:
a The base case P(0).
b The inductive case: for every n ≥ 1, we have
[P(0) ∧ P(1) . . . ∧ P(n − 1)] ⇒ P(n).
2 The Prime Factorization Theorem: let n ∈ Z≥1 be a
positive integer. Then there exist k ≥ 0 prime numbers
Qk
p1 , p2 , . . . , pk such that n = i=1 pi . Furthermore, up to
reordering, the prime numbers p1 , p2 , . . . , pk are unique.

Recursively Defined Structures and Structural


Induction
1 To give a proof by structural induction of ∀x ∈ S : P(x),
we prove the following:
a The base cases: for every x defined by a base case in
the definition of S, we have that P(x).
b The inductive cases: for every x defined in terms of
y1 , y2 , . . . , yk ∈ S by an inductive case in the definition
of S, we have that P(y1 ) ∧ P(y2 ) . . . ∧ P(yk ) ⇒ P(x).

ISTUDY
ISTUDY
6 Analysis of Algorithms

In which our heroes stay beyond the reach of danger, by calculating


precise bounds on how quickly they must move to stay safe.

ISTUDY
267
268 Analysis of Algorithms

6.1 Why You Might Care


It was the best of times; it was the worst of times.
Charles Dickens (1812–1870)
A Tale of Two Cities (1859)

Computer scientists are speed demons. When we are confronted by a computational problem that we
need to solve, we want to solve that problem as quickly as possible. That “need for speed” has driven much
of the advancement in computation over the last 50 years. We discover faster ways of solving important
problems: developing data structures that support apparently instantaneous search of billions of tweets
or billions of users on a social networking site; or discovering new, faster algorithms that solve practical
problems—such as finding shorter routes for delivery drivers or encrypting packets to be sent over the
internet. (Of course, the advances over the last 50 years have also been driven by improvements in computer
hardware that ensure that everything we do computationally is faster!)
This chapter will introduce asymptotic analysis, the most common way in which computer scientists
compare the speed of two possible solutions to the same problem. The basic idea is to think about the rate of
growth of the running time of an algorithm—how much slower does the algorithm get if we double the size
of the input?—in doing this analysis. We will think about “big” inputs to analyze the relative performance
of the two algorithms, focusing on the long-run behavior instead of any small-input-size special cases
for which one algorithm happens to perform exceptionally well. For the CS speed demon, asymptotic
analysis is the speedometer. (Sometimes, instead of time, we measure the amount of space/memory or
power/energy that an algorithm consumes.) We’ll start in Section 6.2 with the key definitions through
which we’ll think about the rates of growth of functions, specifically by introducing “big oh” notation and
its relatives (which formally describe what it means for one function to grow faster or slower than another).
We’ll start to discuss how we apply these definitions to analyze the efficiency of algorithms in Section 6.3.
In Section 6.4, we’ll look at how to analyze the speed of recursive algorithms—and then at a specific, and
particularly common, type of “divide and conquer” recursive algorithm in Section 6.5.
To take one example of why this kind of analysis of running time matters, consider sorting an n-element
array A. One approach is to use brute force: try all n! different permutations of A, and select the one per-
mutation whose elements are in ascending order. Sorting algorithms like Selection Sort, Insertion Sort, or
Bubble Sort require ≈ c · n2 operations, for some constant c, to sort A. You may also have seen Merge Sort,
which requires ≈ c · n log n operations. (We’ll review these sorting algorithms in Section 6.3.) Figure 6.1
shows the number of operations required by these algorithms (n!, n2 , and n log n). Given that some esti-
mates say that the earth will be swallowed by the sun in merely a few billion years [9], there is plenty of
reason to care about the differences in these running times. Asymptotic analysis is the first-cut approxima-
tion to making sure that our algorithms are fast enough—and that they will finish running while we’re still
around to view the output.

maximum n solvable in 1 minute on a


machine that completes 1,000,000,000
n = 10 n = 100 n = 1000 n = 10,000 operations per second
n log n 33 664 9966 132,877 1.94 × 109
n2 100 10,000 1,000,000 100,000,000 244,949
n! 3,628,800 9.333 × 10157 4.029 × 102567 2.846 × 1035,659 13

Figure 6.1 The number of operations for several algorithms, on several input sizes.

ISTUDY
6.2 Asymptotics 269

6.2 Asymptotics
If everything seems under control, you’re just not going fast enough.
Mario Andretti (b. 1940)

We’ll start by developing precise definitions related to the growth of functions—keeping in mind that
the functions of primary interest will be those representing the running time of an algorithm based on
its input size. Generally speaking, we will be interested in the behavior of algorithms ignoring constant
factors. There are two different senses in which we ignore constants. First, we will ignore constant mul-
tiplicative factors; for our purposes, the function f(n) and the function g(n) = 2 · f(n) “grow at the same
rate.” (Exercises 6.1–6.4 explore why we might evaluate efficiency of algorithms in this way.) Second, we
will be interested in the long-run behavior of our algorithms, so we won’t be concerned by any small input
values for which the algorithm performs particularly quickly or slowly.

Example 6.1: All of these things are quite the same.


The following three functions all grow at the same rate:


 if 0 < n < 100
202
f(n) = 3 · n 2
g(n) = 0.01 · n 2
h(n) = n 7
if 100 ≤ n < 1000


1776 · n2 otherwise.
The functions f and g differ by a multiplicative factor. For n ≥ 1000, the function h also differs by a
constant multiplier from f and g; therefore for large enough n it too grows at the same rate as f and g.

This type of analysis is called asymptotic analysis.

Taking it further: The name asymptotic comes from the Greek: a “without” + symptotos “falling together.” In math, the asymp-
tote of a function f(n) is a line that f(n) approaches as n gets very large. (Formally, this value is limn→∞ f(n).) For example,
the function f(x) = 1x has an asymptote at 0: as x gets larger and larger, f(x) gets closer and closer to 0. (Mathematicians also
consider asymptotes where a function approaches, but does not reach, some particular value as the input approaches some point;
for example, tan(θ) has an asymptote of ∞ as θ → π2 and f(x) = x− x
−2 has an asymptote of −∞ as x → 2 from below.) The
asymptotic behavior of a function is similarly motivated: we’re thinking about the growth rate of the function as n gets very
large.

Consider two functions f : R≥0 → R≥0 and g : R≥0 → R≥0 . (We will be interested in functions whose
domain and range are both nonnegative because we’re primarily thinking about functions that describe
the number of steps of a particular algorithm on an input of a particular size, and neither input size nor
number of computational steps executed can be negative.) The key concept of asymptotic analysis will be
a definition of the growth rates of the functions f and g, and how those growth rates compare: that is, what
it means to say that f grows faster than g (or, really, no slower than g); or that f grows at the same rate as g;
or that f grows slower (or no faster) than g.

6.2.1 Big O
We’ll start by defining what it means for the function f(n) to grow no faster than the function g(n), written
f(n) = O(g(n)). (Note: O is pronounced “big oh.”)

ISTUDY
270 Analysis of Algorithms

Definition 6.1: “Big O” [O].


Consider two functions f : R≥0 → R≥0 and g : R≥0 → R≥0 . We say that f grows no faster than g if there
exist constants c > 0 and n0 ≥ 0 such that
∀n ≥ n0 : f(n) ≤ c · g(n).
In this case, we write “f(n) is O(g(n))” or “f(n) = O(g(n)).”

Taking it further: The “=” in “f(n) = O(g(n))” is odd notation, but it’s also very standard. This expression means f(n) has the
property of being O(g(n)) and not f(n) is identical to O(g(n)). Philosophers sometimes distinguish between the “is” of identity
and the “is” of predication. In a sentence like Barbara Liskov is the 2008 Turing Award winner, we are asserting that Barbara
Liskov and the 2008 Turing Award Winner actually refer to the same thing—that is, they are identical. In a sentence like Barbara
Liskov is tall, we are asserting that Barbara Liskov (the entity to which Barbara Liskov refers) has the property of being tall—
that is, the predicate x is tall is true of Barbara Liskov. You should interpret the “=” in f(n) = O(g(n)) as an “is of predication.”
One reasonably accurate way to distinguish these two uses of is is by considering what happens if you reverse the order of the
sentence: The 2008 Turing Award Winner is Barbara Liskov is still a (true) well-formed sentence, but Tall is Barbara Liskov
sounds very strange. Similarly, for an “is of identity” in a mathematical context, we can say either x2 − 1 = (x + 1)(x − 1) or
(x + 1)(x − 1) = x2 − 1. But, while “f(n) = O(g(n))” is a well-formed statement, it is nonsensical to say “O(g(n)) = f(n).”

The intuition is that f(n) = O(g(n)) if, for all large enough n, we have f(n) ≤ constant · g(n). We get to
choose what counts as “large enough” and we also get to choose the value of constant.
Figure 6.2 shows five different functions f : R≥0 → R≥0 that all satisfy f(n) = O(n). (In the figure, the
value of x is “large enough” once x is outside of the shaded box, and the multiplicative constant is equal
to 3 in each subplot. For a function like f(x) = 4x, we’d show that f(n) = O(n) by choosing some c ≥ 4 as
the multiplicative constant.) More quantitatively, here are two examples of functions that are O(n2 ):

Example 6.2: A square function.


Prove that the function f(n) = 3n2 + 2 is O(n2 ).
Solution. We must identify constants c > 0 and n0 ≥ 0 such that ∀n ≥ n0 : 3n2 + 2 ≤ c · n2 . For all
n ≥ 1, observe that 2n2 ≥ 2. Therefore, for all n ≥ 1, we have
f(n) = 3n2 + 2 ≤ 3n2 + 2n2 = 5n2 .
Thus we can select c = 5 and n0 = 1. (See Figure 6.3a.)



25 − x
2


 if x < 3.5
f(x) = x + 8 f(x) =


f(x) = 2x 
 0.5x + 11
f(x) = x f(x) = 10 
if x ≥ 3.5

Figure 6.2 Five functions that are all O(n). For any x beyond the shaded box, we have f(x) ≤ 3x. (The first two
functions satisfy f(x) ≤ 3x for all nonnegative x, so there’s no shaded box necessary.) The dashed line corresponds to 3x.

ISTUDY
6.2 Asymptotics 271

n0 = 1 n0 = 4 n0 = 1

80 60 60
q(n) = 4n2 h(n) = n3
60 q(n) = n2 400
5n2
40 40 7n2
40
200 6n2
20 g(n) = 4n 20 g(n) = 4n
20 5n2
f(n) = 3n2 + 2
4n2

2 4 2 4 6 8 2 4 6 8 2 4 6 8
2 2
(a) f(n) is O(n ) with c = 5 (b) g(n) is O(n ) with c = 1 and n0 = 4. (c) No value of c has
and n0 = 1. Or, g(n) is O(n2 ) with c = 4 and n0 = 1. n3 < cn2 for all large n.

Figure 6.3 The functions f(n) = 3n2 + 2 and g(n) = 4n are O(n2 ); h(n) = n3 is not.

Example 6.3: Another square function.


Prove that the function g(n) = 4n is also O(n2 ).
Solution. We wish to show that 4n ≤ c · n2 for all n ≥ n0 , for constants c > 0 and n0 ≥ 0 that we get
to choose. The two functions g(n) and q(n) = n2 are shown in Figure 6.3b. Because the functions cross
(with no constant multiplier), we can pick c = 1. Observe that 4n ≤ n2 whenever n ≥ 4. (Or when n ≤ 0:
factoring n2 − 4n = n(n − 4) shows that the functions 4n and n2 cross at n = 0 and n = 4.) Thus c = 1
and n0 = 4 suffice.

Note that, when f(n) = O(g(n)), there are many different choices of c and n0 that satisfy the definition. For
example, we could have chosen c = 4 and n0 = 1 in Example 6.3 (again see Figure 6.3b), or c = 2 and
n0 = 2, or c = 1 and n0 = 128, or lots of other values. (See Exercise 6.15.)

Example 6.4: One nonsquare.


Prove that the function h(n) = n3 is not O(n2 ).
Solution. We need to argue that, for all constants n0 and c, there exists an n ≥ n0 such that h(n) > c · n2 .
(See Figure 6.3c.) We’ll proceed by contradiction. Imagine for a moment that we’ve identified values of
n0 and c that we hope satisfy the definition of h(n) = O(n2 ). But now define m = max(n0 , c + 1). Then:

• m ≥ n0 (because we chose m ≥ n0 ), and


• m3 > cm2 (because we chose m > c, and multiplying both sides by m2 yields m3 > cm2 ).

That means that we have found a value of n ≥ n0 such that n3 > c · n2 . Thus these values of n0 and c
do not satisfy the O(·) definition. Because n0 and c were generic, we have shown that no such constants
satisfying the condition ∀n ≥ n0 : h(n) ≤ cn2 can exist. Therefore h(n) = n3 is not O(n2 ).

ISTUDY
272 Analysis of Algorithms

Some properties of O(·)


Now that we’ve seen a few specific examples, let’s turn to some more general results. There are many useful
properties of O(·) that will come in handy later; we’ll start here with a few of these properties, together
with a proof of one.

Lemma 6.2: Asymptotic equivalence of max and sum.


We have f(n) = O(g(n) + h(n)) if and only if f(n) = O(max(g(n), h(n))).

Proof. We proceed by mutual implication.


For the forward direction, suppose that f(n) = O(g(n) + h(n)). Then by definition there exist constants
c > 0 and n0 ≥ 0 such that

∀n ≥ n0 : f(n) ≤ c · [g(n) + h(n)]


≤ c · [max(g(n), h(n)) + h(n)] for any y, we have g(n) ≤ max(g(n), y)
≤ c · [max(g(n), h(n)) + max(g(n), h(n))] for any x, we have h(n) ≤ max(x, h(n))
= 2c · max(g(n), h(n)).

We’ve now shown that ∀n ≥ n0 : f(n) ≤ 2c · max(g(n), h(n)), and this statement just is the definition of
f(n) = O(max(g(n), h(n))), using constants n′0 = n0 and c′ = 2c.
For the converse direction, suppose that f(n) = O(max(g(n), h(n))). Then there exist constants c > 0
and n0 ≥ 0 such that

∀n ≥ n0 : f(n) ≤ c · max(g(n), h(n))


≤ c · [max(g(n), h(n)) + min(g(n), h(n))] x ≤ x + y for any value of y ≥ 0
= c · [g(n) + h(n)] . min(x, y) + max(x, y) = x + y for any values of x and y

We’ve now shown that ∀n ≥ n0 : f(n) ≤ c[g(n) + h(n)], and this statement again just is the definition of
f(n) = O(g(n) + h(n)), using the same constants n′0 = n0 and c′ = c.

Problem-solving tip: Don’t force yourself to prove more than you have to! For example, when proving that an asymptotic
relationship like f(n) = O(g(n)) holds, all we need to do is identify some pair of constants c and n0 that satisfy Definition 6.1.
Don’t work too hard! Choose whatever c or n0 makes your life easiest, even if they’re much bigger than necessary. For asymptotic
purposes, we care that the constants c and n0 exist, but we don’t care how big they are.

Here are statements of a few other useful facts about O(·). (The proofs are left to Exercises 6.18–6.20.)

Lemma 6.3: Transitivity of O(·).


If f(n) = O(g(n)) and g(n) = O(h(n)), then f(n) = O(h(n)).

Lemma 6.4: Addition and multiplication preserve O(·)-ness.


If f(n) = O(h1 (n)) and g(n) = O(h2 (n)), then
f(n) + g(n) = O(h1 (n) + h2 (n)) and f(n) · g(n) = O(h1 (n) · h2 (n)).

ISTUDY
6.2 Asymptotics 273

Asymptotics of polynomials
So far, we’ve discussed properties of O(·) that are general with respect to the form of the functions in ques-
tion. But because we’re typically concerned with O(·) in the context of the running time of algorithms—and
we are generally interested in algorithms that are efficient—we’ll be particularly interested in the asymp-
totics of polynomials. The most salient point about the growth of a polynomial p(n) is that p(n)’s asymptotic
behavior is determined by the degree of p(n)—that is, the polynomial p(n) = a0 + a1 n + a2 n2 + · · · + ak nk
behaves like nk , asymptotically:

Lemma 6.5: Asymptotics of polynomials.


Pk
Let p(n) = i=0 ai ni be a polynomial. Then p(n) = O(nk ).

(If ak > 0, then indeed p(n) = O(nk ), and it is not possible to improve this bound—that is, in the notation
of Section 6.2.2, we have that p(n) = Θ(nk ).)
The proof of Lemma 6.5 is deferred to Exercise 6.21, but we have already seen the intuition in previous
examples: every term ai ni satisfies ai ni ≤ |ai | · nk , for any n ≥ 1.

Asymptotics of logarithms and exponentials


We will also often encounter logarithms and exponential functions, so it’s worth identifying a few of their
asymptotic properties. Again, we’ll prove one of these properties as an example, and leave proofs of many
of the remaining properties to the exercises. The first pair of properties is that logarithmic functions grow
more slowly than polynomials, which grow more slowly than exponential functions:

Lemma 6.6: log n grows slower than n0.0000001 .


Let ϵ > 0 be an arbitrary constant, and let f(n) = log n. Then f(n) = O(nϵ ).

Lemma 6.7: n1000000 grows slower than 1.0000001n .


Pk
Let b > 1 and k ≥ 0 be any constants, and let p(n) = i=0 ai ni be any polynomial. Then p(n) = O(bn ).

The second pair of properties is that logarithmic functions loga n and logb n grow at the same rate (for any
bases a > 1 and b > 1) but that exponential functions an and bn do not (for any bases a and b = ̸ a):

Lemma 6.8: The base of a logarithm doesn’t matter, asymptotically.


Let b > 1 and k > 0 be arbitrary constants. Then f(n) = logb (nk ) is O(log n).

Proof of Lemma 6.8. Using standard facts about logarithms, we have that

logb (nk ) = k · logb (n) Theorem 2.10.5: logb xy = y logb x


=k· log n
log b . Theorem 2.10.6 (change of base formula): logb x =
logc x
logc b

For any n ≥ 1, then, we have that f(n) = k


log b · log n. Therefore, we have that f(n) = O(log n), using the
constants n0 = 1 and c = logk b .

ISTUDY
274 Analysis of Algorithms

Lemma 6.9: The base of an exponential does matter, asymptotically.


Let b ≥ 1 and c ≥ 1 be arbitrary constants. Then f(n) = bn is O(cn ) if and only if b ≤ c.

Lemma 6.8 is the reason that, for example, binary search’s running time is generally described as O(log n)
rather than as O(log2 n), without any concern for writing the “2”: the base of the logarithm is inconsequen-
tial asymptotically, so O(log√2 n) and O(log2 n) and O(ln n) all mean exactly the same thing. In contrast,
for exponential functions, the base of the exponent does affect √ nthe asymptotic behavior: Lemma 6.9
n
says that, for example, the functions f(n) = 2 and g(n) = ( 2) do not grow at the same rate. (See
Exercises 6.25–6.28.)

Taking it further: Generally, exponential growth is a problem for computer scientists. Many computational problems that
are important and useful to solve seem to require searching a very large space of possible answers: for example, testing the
satisfiability of an n-variable logical proposition seems to require looking at about 2n different truth assignments, and factoring
an n-digit number seems to require looking at about 10n different candidate divisors. The fact that exponential functions grow
so quickly is exactly why we do not have algorithms that are practical for even moderately large instances of these problems.
(See p. 104.) But one of the most famous exponentially growing functions actually helps us to solve problems: the amount of
computational power available to a “standard” user of a computer has been growing exponentially for decades: about every 18
months, the processing power of a standard computer has roughly doubled. This trend—dubbed Moore’s Law, after Gordon
Moore, the co-founder of Intel—is discussed on p. 279.

6.2.2 Other Asymptotic Relationships: Ω, Θ, ω, and o


There are several basic asymptotic notions (with accompanying notation), based around two core ideas
(see Figure 6.4):

f(n) grows no faster than g(n). In other words, ignoring small inputs, for all n we have that
f(n) ≤ constant · g(n). This relationship is expressed by the O(·) notation: f(n) = O(g(n)). We can also
say that g is an asymptotic upper bound for f: if we plot n against f(n) and g(n), then g(n) will be “above”
f(n) for large inputs.

f(n) grows no slower than g(n). The opposite relationship, in which g is an asymptotic lower bound on f,
is expressed by Ω(·) notation. (Ω is the Greek letter Omega written in uppercase; ω, which we’ll see
soon, is the same Greek letter written in lowercase.) Again, ignoring small inputs, f(n) = Ω(g(n)) if
for all n we have that f(n) ≥ constant · g(n). (Notice that the inequality swapped directions from the
definition of O(·).)

Formal definitions
Here are the formal definitions of four other relationships based on these notions. (This notation is
summarized in Figure 6.4.)

Definition 6.10: “Big Omega” [Ω].


A function f grows no slower than g, written f(n) = Ω(g(n)), if there exist constants d > 0 and n0 ≥ 0
such that ∀n ≥ n0 : f(n) ≥ d · g(n).

The two fundamental asymptotic relationships, O(·) and Ω(·), are dual notions; they are related by the
property that f(n) = O(g(n)) if and only if g(n) = Ω(f(n)). (The proof is left as Exercise 6.30.)

ISTUDY
6.2 Asymptotics 275

Ω(g(n))

∃c > 0, n0 ≥ 0 such that ∃d > 0, n0 ≥ 0 such that


∀n ≥ n0 : f(n) ≤ c · g(n) ∀n ≥ n0 : f(n) ≥ d · g(n)

f grows no f grows no
faster than g slower than g
g(n) f(n) = O(g(n)) yes don’t care f grows no faster than g
f(n) = Ω(g(n)) don’t care yes f grows no slower than g
f(n) = Θ(g(n)) yes yes f grows at the same rate as g
O(g(n)) f(n) = o(g(n)) yes no f grows strictly slower than g
f(n) = ω(g(n)) no yes f grows strictly faster than g

(a) A visualization. (b) A summary of the O, Ω, Θ, o, and ω asymptotic notation.

Figure 6.4 Two different ways to summarize asymptotic notation.

There are three other pieces of asymptotic notation, corresponding to the situations in which f(n) is both
O(g) and Ω(g), or O(g) but not Ω(g), or Ω(g) but not O(g):

Definition 6.11: “Big Theta” [Θ].


A function f grows at the same rate as g, written f(n) = Θ(g(n)), if f(n) = O(g(n)) and f(n) = Ω(g(n)).

Definition 6.12: “Little o” [o].


A function f grows (strictly) slower than g, written f(n) = o(g(n)), if f(n) = O(g(n)) but f(n) ̸= Ω(g(n)).

Definition 6.13: “Little omega” [ω].


A function f grows (strictly) faster than g, written f(n) = ω(g(n)), if f(n) = Ω(g(n)) but f(n) ̸= O(g(n)).

Example 6.5: f = (n).


2
Let f(n) = 3n + 1. Is f(n) = O(n)? Ω(n)? Θ(n)? o(n)? ω(n)? Prove your answers.
Solution. Once we determine whether f(n) = O(n) and whether f(n) = Ω(n), we can answer all parts of
the question using Figure 6.4.
Claim 1: f(n) = Ω(n). For any n ≥ 1, we have that n ≤ n2 ≤ 3n2 + 1 = f(n). Thus selecting d = 1 and
n0 = 1 satisfies Definition 6.10.
Claim 2: f(n) ̸= O(n). Consider any c > 0. For any n ≥ 3c , we have that 3n2 + 1 > 3n2 ≥ c · n. Therefore,
for any n0 > 0, there exists an n ≥ n0 such that f(n) > c · n. (Namely, let n = max(n0 , 3c ). This value of
n satisfies n ≥ n0 and f(n) > c · n.) Thus, every c > 0 and n0 ≥ 0 fail to satisfy the requirements of
Definition 6.1, and therefore f(n) ̸= O(n).
̸ O(n) with Figure 6.4, we can also conclude that
Assembling the facts that f(n) = Ω(n) and f(n) =
̸ Θ(n), and f(n) ̸= o(n).
f(n) = ω(n), f(n) =

ISTUDY
276 Analysis of Algorithms

Taking it further: We’ve given definitions of O(·), Ω(·), Θ(·), o(·), and ω(·) that are based on nested quantifiers: there exists
a multiplicative constant such that, for all sufficiently large n, etc. If you have a more calculus-based worldview, we could also
give an equivalent definition in terms of limits:

• f(n) = O(g(n)) if limn→∞ f(n)/g(n) is finite;


• f(n) = Ω(g(n)) if limn→∞ f(n)/g(n) is nonzero;
• f(n) = Θ(g(n)) if limn→∞ f(n)/g(n) is finite and nonzero;
• f(n) = o(g(n)) if limn→∞ f(n)/g(n) = 0; and
• f(n) = ω(g(n)) if limn→∞ f(n)/g(n) = ∞.

f( n)
For the function f(n) = 3n2 +1 in Example 6.5, for example, observe that limn→∞ n = ∞. Thus f(n) = Ω(n) and f(n) = ω(n),
but none of the other asymptotic relationships holds.

A (possibly counterintuitive) example


Intuitively, the asymptotic symbols O, Ω, Θ, o, and ω correspond to the numerical comparison symbols
≤, ≥, =, <, and >—but the correspondence isn’t perfect, as we’ll see in this example:

Example 6.6: Finding functions, to spec.


For each of (a), (b), (c), and (d) in the following table, fill in the blank with an example of a function f
that satisfies the stated conditions.

f(n) = O(n2 ) f(n) ̸= O(n2 )


2
f(n) = Ω(n ) (a) (b)
f(n) ̸= Ω(n2 ) (c) (d)

Solution. Three of these cells are fairly straightforward to complete:

(a) f(n) = n2 is Θ(n2 )—that is, it satisfies both O(n2 ) and Ω(n2 ).
(b) f(n) = n is o(n2 )—that is, it satisfies O(n2 ) but not Ω(n2 ).
(c) f(n) = n3 is ω(n2 )—that is, it satisfies Ω(n2 ) but not O(n2 ).

But cell (d)—a function f(n) that is neither O(n2 ) nor Ω(n2 )—appears more challenging (perhaps even
impossible). But let’s look at the definitions carefully:

For f(n) ̸= O(g(n)) to be true, we need, for any constants c > 0 and n0 ≥ 0, that there exists n ≥ n0 such
that f(n) > cn2 .
Similarly, for f(n) ̸= Ω(n2 ) to be true, we need, for any constants d > 0 and n0 ≥ 0, there to exist n ≥ n0
such that f(n) < dn2 .

(In other words, intuitively, if I tell you a value of n0 , you have to be sure that f both exceeds n2 at some
point n ≥ n0 and is exceeded by n2 for some point n ≥ n0 . The “trick” is that it doesn’t have to be the
same value of n!) Here’s one way to simultaneously achieve these conditions. We’ll define the function
f in a piecewise manner, so that for, say, even values of n the function grows faster than n2 , and for odd
values it grows slower:
(
n n3 if n is even
f(n) = n2+(−1) =
n if n is odd.

ISTUDY
6.2 Asymptotics 277

(See Figure 6.5b for a plot of this function.) Below, we’ll argue formally that f(n) ̸= O(n2 ). Together with
the proof that f(n) ̸= Ω(n2 ), which is left to you as Exercise 6.44, this function will allow us to finish the
required table. (See Figure 6.5a.)
To prove that f(n) ̸= O(n2 ), let c > 0 and n0 ≥ 0 be arbitrary. Let n be the smallest even number strictly
greater than max(c, n0 ). Then f(n) = n3 and n3 > c · n2 because we chose n > c. But then it is not the
case that ∀n ≥ n0 : f(n) ≤ cn2 . Because this argument holds for arbitrary c > 0 and n0 ≥ 0, we conclude
that f(n) ̸= O(n2 ).

Problem-solving tip: When you’re confronted with a problem with seemingly contradictory constraints, as in the bottom-right
cell of the table in Example 6.6, very carefully write down what the constraints require. This process can help you see why the
constraints aren’t actually contradictory.

Some properties of Ω, Θ, o, and ω


Many of the properties of O(·) also hold for the other four asymptotic notions; for example, all five of O(·),
Ω(·), Θ(·), o(·), and ω(·) obey transitivity, and several obey reflexivity. See Exercises 6.45–6.53.
One of the subtlest aspects of asymptotic notation is the fact that two functions can be incomparable
with respect to their rates of growth: we can identify two functions f and g such that none of the asymptotic
̸ O(g), f ̸= Ω(g),
relationships holds. (That is, it’s possible for all five of these statements to be true: f =
f≠ Θ(g), f ̸= o(g), and f ̸= ω(g).) Intuitively,
f(n) = O(g(n)) means (roughly) “the growth rate of f ≤ the growth rate of g.” (A)
f(n) = Ω(g(n)) means (roughly) “the growth rate of f ≥ the growth rate of g.” (B)
Definitions 6.11, 6.12, and 6.13 correspond to three of the four combinations of truth values for these two
statements: (A) and (B) is Θ; (A) but not (B) is o; and (B) but not (A) is ω. But be careful! For two real
numbers a ∈ R and b ∈ R, it’s impossible for a ≤ b and a ≥ b to both be false. But it is possible for both of
the inequalities (A) and (B) to be false! The functions g(n) = n2 and the function f(n) from Example 6.6
that equals either n3 or n depending on the parity of n are an example of a pair of functions for which
neither (A) nor (B) is satisfied. (Thus f was neither O(n2 ) nor Ω(n2 ), and therefore not ω(n2 ), o(n2 ), or
Θ(n2 ), either.)

Taking it further: The real numbers satisfy the mathematical property of trichotomy (Greek: “division into three parts”): for
a, b ∈ R, exactly one of {a < b, a = b, a > b} holds. Functions compared asymptotically do not obey trichotomy: for two
functions f and g, it’s possible for none of {f = o(g), f = Θ(g), f = ω(g)} to hold.

400
n3

200 f(n)
f(n) = O(n2 ) ̸ O(n2 )
f(n) =
f(n) = Ω(n2 ) f(n) = n2 f(n) = n3
( n
n3 if n is even
f(n) ̸= Ω(n2 ) f(n) = n f(n) = 2 4 6 8
n if n is odd.
(b) A plot of the function f(n) from
(a) The summary of a solution to Example 6.6. (a) that is neither O(n2 ) nor Ω(n2 ).

Figure 6.5 A solution, and a plot, for Example 6.6.

ISTUDY
278 Analysis of Algorithms

Before we begin to apply asymptotic notation to the analysis of algorithms, we’ll close this section with
a few notes about the use (and abuse) of asymptotic notation.

Using asymptotics in arithmetic expressions


It is often convenient to use asymptotic notation in arithmetic expressions. We permit ourselves to write
something like O(n log n) + O(n3 ) = O(n3 ), which intuitively means that, given functions that grow no
faster than n log n and n3 , their sum grows no faster than n3 too. When asymptotic notation like O(n2 )
appears on the left-hand side of an equality, we interpret it to mean an arbitrary unnamed function that
grows no faster than n2 . For example, making log n calls to an algorithm whose running time is O(n)
requires log n · O(n) = O(n log n) time.

Using asymptotics with multiple variables


It will also occasionally turn out to be convenient to be able to write asymptotic expressions that depend
on more than one variable. Giving a precise technical definition of multivariate asymptotic notation is a
bit subtle, but the intuition precisely matches the univariate definitions we’ve already given. We’ll use the
notation g(n, m) = O(f(n, m)) to mean “for all sufficiently large n and m, there exists a constant c such that
g(n, m) ≤ c · f(n, m).” For example, the function f(n, m) = n2 + 3m − 5 satisfies f(n, m) = O(n2 + m).

A common mistake and some meaningless language


There is a widespread—and incorrect—sloppy use of asymptotic notation: it is unfortunately common for
people to use O(·) when they mean Θ(·). You will sometimes encounter claims like:
“I prefer f to g, because f(n) = O(n2 ) and g(n) = O(n3 ).” (1)
But this logic doesn’t make sense: O(·) defines only an upper bound, so either of f or g might grow more
slowly than the other! Saying (1) is like saying
“Alice is richer than Bob, because Alice has at most $1,000,000,000 and Bob has at most $1,000,000.”
(2)
(Alice might be richer than Bob, sure. But perhaps they both have 20 bucks each, or perhaps Bob has
$1,000,000 and Alice has nothing.) Use O(·) when you mean O(·), and use Θ(·) when you mean Θ(·)—and
be aware that others may use O(·) improperly. (And, gently, correct them if they’re doing so.)
There’s a related imprecise use of asymptotics that leads to statements that don’t mean anything. For
example, consider statements like “f(n) is at least O(n3 )” or “f(n) is at most Ω(n2 ).” These sentences have
no meaning: they say “f(n) grows at least as fast as at most as fast as n3 ” and “f(n) grows at most as fast as
at least as fast as n2 .” (?!?) Be careful: use upper bounds as upper bounds, and use lower bounds as lower
bounds! Again, by analogy, consider these sentences (thanks to Tom Wexler for suggesting (5)):
“My weight is more than ≤ 100 kilograms” (3)
or “I am shorter than some person who is taller than 4 feet tall.” (4)
or “You could save up to 50% or more!” (5)
None of these sentences says anything!

ISTUDY
6.2 Asymptotics 279

Computer Science Connections

Vacuum Tubes, Transistors, and Moore’s Law


The earliest electronic computers—machines like the Colossus at Bletchley Park and the ENIAC at the University
of Pennsylvania—were initially developed in the early-to-mid 1940s. These first electronic machines used vacuum
tubes (or valves) as the core of their circuitry. Vacuum tubes that can be used for logical switching are devices that
look a bit like a light bulb—and they also burn out at roughly similar rates to light bulbs. (These early computers
contained tens of thousands of these devices, which meant that a vacuum tube would fail every day or two.) The
discovery of the transistor was a major breakthrough of the late 1940s; transistors can do the same kinds of things as
vacuum tubes, but they were substantially more energy efficient and smaller—and have gotten smaller and smaller
as time has gone on. (Vacuum tubes are ≈centimeters in size; transistors are now ≈10 nanometers in size.)
In 1965, Gordon Moore, one of the co-founders of Intel, published an article making a basic prediction—and it’s
been reinterpreted many times—that processing power would double roughly once every 18–24 months [91]. (It’s
been debated and revised over time, by, for example, interpreting “processing power” as the number of transistors on
the processor rather than what it can actually compute.) This prediction later came to be known as Moore’s Law—not
a real “law” like Ohm’s Law or the Law of Large Numbers, of course, but rather simply a prediction. That said, it’s
proven to be a remarkably robust prediction: for something like 40 to 50 years, it has proven to be a consistent guide
to the massive increase in processing power for a typical computer user over many decades. (See Figure 6.6.)
Claims that “Moore’s Law is just about to end!” have been made for many decades, and yet Moore’s Law has still
proven to be remarkably accurate over time. Its imminent demise is still being predicted today, and yet it’s still been a
pretty good model of computing power [92]. One probable reason that Moore’s Law has held for as long as it has is a
little bizarre: the repeated publicity surrounding Moore’s Law! Because chip-manufacturing companies “know” that
the public generally expects the number of transistors on processors to double every 2 years, these companies may
actually be setting research-and-development targets based on meeting Moore’s Law. (Just as in a physical system,
we cannot observe a phenomenon without changing it!) But Moore’s Law–level growth must come to an end at some
point—we’re beginning to run up against physical limits in the size of transistors; modern transistors aren’t much
bigger than the size of a silicon atom—and so we’re entering what people tend to call a “post-Moore world,” in which
the now-expected performance improvement over time will have to come from some other kind of innovation.

Data are from an Intel press release celebrating the 40th anniversary of the original
Number of transistors
Intel Pentium 4

http://www.intel.com/pressroom/kits/events/moores_law_40th/
108

hs
107 m ont
ery 24
ling ev
106 doub
ors
sist Intel Pentium
ran
er of t
105 numb
dl ine: Intel 386
dashe
publication of Moore’s Law.

104

103
Intel 4004
Year of introduction
102
1970 1975 1980 1985 1990 1995 2000 2005

Figure 6.6 The number of transistors per processor, for Intel brand processors introduced over the last 50 years. The
dashed line indicates the rate of growth we’d see if the number of transistors per processor doubled every 2 years
(starting with the Intel 4004 in 1971).

ISTUDY
280 Analysis of Algorithms

EXERCISES
Part of the motivation for asymptotic analysis was that algorithms are typically analyzed ignoring constant factors. Ignoring
constant factors in analyzing an algorithm may seem strange: if algorithm A runs twice as fast as B, then A is way faster! But
the reason we care more about asymptotic running time is that even an improvement by a factor of two is quickly swamped by an
asymptotic improvement for even slightly larger inputs. Here are a few examples:
6.1 Suppose that linear search can find an element in a sorted list of n elements in n steps on a particular machine. Binary search
(perhaps not implemented especially efficiently) requires 100 log n steps. For what values of n ≥ 2 is linear search faster?

Alice implements Merge Sort so, on a particular machine, it requires exactly ⌈8n log n⌉ steps to sort n elements. Bob implements
Heap Sort so it requires exactly ⌈5n log n⌉ steps to sort n elements. Charlie implements Selection Sort so it requires exactly 2n2
steps to sort n elements. Suppose that Alice can sort 1000 elements in 1 minute.
6.2 How many elements can Bob sort in a minute? How many can Charlie sort in a minute?
6.3 What is the largest value of n that Charlie can sort faster than Alice?
6.4 Charlie, devastated by the news from Exercise 6.3, buys a computer that’s twice the speed of Alice’s. What is the largest value of
n that Charlie can sort faster than Alice now?

Let f(n) = 9n + 3 and let g(n) = 3n3 − n2 . (See Figure 6.7.)


Let a(n) = 7n, let b(n) = 3n2 + sin n, and let c(n) = 128. (See Figure 6.8.)
6.5 Prove that f(n) = O(n). 6.10 Prove that g(n) is not O(n2 ).
6.6 2
Prove that f(n) = O(n ). 6.11 Prove that g(n) is not O(n3−ϵ ), for any ϵ > 0.
6.7 Prove that f(n) = O(g(n)). 6.12 Prove that a(n) is O(n2 ).
6.8 Prove that g(n) = O(n3 ). 6.13 Prove that b(n) is O(n2 ).
6.9 4
Prove that g(n) = O(n ). 6.14 Prove that c(n) is O(n2 ).
6.15 Consider two functions f : Z≥0 → Z≥0 and g : Z≥0 → Z≥0 . We defined O(·) notation as follows:

f(n) = O(g(n)) if there exist constants c > 0 and n0 ≥ 0 such that ∀n ≥ n0 : f(n) ≤ c · g(n).

Suppose f(n) = O(g(n)). Explain why there are infinitely many choices of c and infinitely many choices of n0 that satisfy the
definition of O(·).
6.16 It turns out that both c and n0 are necessary to the definition of O(·). (See Exercise 6.15 for a reminder of the definition.) Define
the following alternative asymptotic notation, leaving out c (using c = 1) from the definition:

f(n) = P(g(n)) if there exists a constant n0 ≥ 0 such that ∀n ≥ n0 : f(n) ≤ g(n).

̸ P(g),
Prove that P(·) and O(·) mean different things: prove that there exist functions f and g such that either (i) f = O(g) but f =
or (ii) f ̸= O(g) but f = P(g).
6.17 Repeat Exercise 6.16, this time showing that we cannot omit n0 from the definition. Define the following asymptotic notation
(which implicitly uses n0 = 1):

f(n) = Q(g(n)) if there exists a constant c > 0 such that ∀n ≥ 1 : f(n) ≤ c · g(n).

̸ Q(g), or (ii) f ̸= O(g) but f = Q(g).


Prove that there exist functions f and g such that either (i) f = O(g) but f =

g(n) = 3n3 − n2 b(n) = 3n2 + sin n

40 200
f(n) = 9n + 3 c(n) = 128
20 100

a(n) = 7n

1 2 3 4 5 2 4 6 8 10 12

Figure 6.7 Two functions for Exercises 6.5–6.11. Figure 6.8 Three functions for Exercises 6.12–6.14.

ISTUDY
Exercises 281

The next several exercises ask you to prove some of properties of O(·) that we stated without proof earlier in the section. (For a
model of a proof of this type of property, see Lemma 6.2 and its proof in this section.)
6.18 Prove Lemma 6.3, the transitivity of O(·): if f(n) = O(g(n)) and g(n) = O(h(n)), then f(n) = O(h(n)).
6.19 Prove the first half of Lemma 6.4: if f(n) = O(h1 (n)) and g(n) = O(h2 (n)), then f(n) + g(n) = O(h1 (n) + h2 (n)).
6.20 Prove the second half of Lemma 6.4: if f(n) = O(h1 (n)) and g(n) = O(h2 (n)), then f(n) · g(n) = O(h1 (n) · h2 (n)).
P
6.21 Prove Lemma 6.5: if p(n) = ki=0 ai ni is a polynomial, then p(n) = O(nk ).
P
6.22 Prove that the bound from Exercise 6.21 cannot be improved. That is, consider p(n) = ki=0 ai ni with ak > 0. Prove that p(n) is
not O(nk−ϵ ) for any ϵ > 0.

Lemmas 6.6 and 6.7 state that all logarithmic functions grow slower than all polynomial functions, which grow slower than all
exponential functions. (For example, log n = O(n0.000001 ) and n1000000 = O(1.000001n ).) While fully general proofs are more
calculus-intensive than we want to be in this book, here are a few simpler results to prove:
6.23 Assuming Lemma 6.6, prove that any polylogarithmic function f(n) = logk (n) satisfies f(n) = O(nϵ ) for any ϵ > 0 and any integer
k ≥ 0. (A polylogarithmic function is one that’s a polynomial where the terms are powers of log n instead of powers of n—hence
a poly(nomial of the) log function.)
6.24 Prove the special case of Lemma 6.6 for ϵ = 1: that is, prove that log n = O(n). Specifically, do so by proving that log n ≤ n for
all integers n ≥ 1, using strong induction.

The next three exercises explore whether the asymptotic properties of two functions f and g “transfer over” to the functions log f and
log g. Specifically, consider two functions f : Z≥0 → Z≥1 and g : Z≥0 → Z≥1 . (Note: the outputs of f and g are always positive,
so that log(f(n)) and log(g(n)) are well defined.)
6.25 Assume that, for all n, we have f(n) ≥ n and g(n) ≥ n. Furthermore assume that f(n) = O(g(n)). Prove that the function
ℓ(n) = log(f(n)) satisfies ℓ(n) = O(log(g(n))).
6.26 Prove that the converse of Exercise 6.25 is not true: identify functions f(n) and g(n) where f(n) ≥ n and g(n) ≥ n such that
log(f(n)) = O(log(g(n))) but f(n) ̸= O(g(n)). (Hint: what’s log n2 ?)
6.27 Prove that assuming f(n) ≥ n and g(n) ≥ n in Exercise 6.25 was necessary: identify functions f : Z≥0 → Z≥1 and g : Z≥0 → Z≥1
where f(n) = O(g(n)) but ℓ(n) ̸= O(log(g(n))) for the function ℓ(n) = log(f(n)).

6.28 For a real number b ≥ 1, define the function f(n) = bn . Prove Lemma 6.9: f(n) = O(cn ) if and only if b ≤ c.

6.29 Just as with a virus (as we now know all too well), an idea “going viral”—a video, a joke, a hashtag, an app—can be reasonably
modeled as a form of exponential growth: if each person who “adopts” the idea on a particular day causes two others to adopt that
idea the next day, then 1 adopter on Day Zero means 2 new ones on Day One (for a total of 3), and 4 new ones on Day Two (for a
total of 7), etc. Here we might call 2 the spreading rate, the number of people “infected” by each new adopter. Let r0 ∈ Z≥1 be a
P
spreading rate. Define f(n) = ni=1 (r0 )i to be the number of people who have adopted by Day n. Is f(n) = O((r0 )n )? Prove your
answer.

6.30 Prove that f(n) = O(g(n)) if and only if g(n) = Ω(f(n)).

Consider the function f(n) = n + n1 . (See Figure 6.9.) Because f(0) is undefined and the output f(n) is not an integer for any integer
n ≥ 2, treat f as a function from Z≥1 to R. Prove all of your answers to the following questions:
6.31 Is f(n) = O(1)? Ω(1)? Θ(1)? o(1)? ω(1)?
6.32 Is f(n) = O(n)? Ω(n)? Θ(n)? o(n)? ω(n)?
6.33 Is f(n) = O(n2 )? Ω(n2 )? Θ(n2 )? o(n2 )? ω(n2 )?

For an integer n ≥ 0, let k(n) denote the nonnegative integer such that 2k(n) ≤ n < 2k(n)+1 . That is, 2k(n) takes n and “rounds
down” to a power of two: for example, 2k(4) = 22 = 4 and 2k(5) = 22 = 4 and 2k(202) = 27 = 128 and 2k(55,057) = 215 = 32,768.
(See Figure 6.10.)
6.34 Prove that 2k(n) and 2k(n)+1 are both Θ(n).
6.35 Prove that k(n) = Θ(log n).
6.36 Let b > 1 be an arbitrary constant. Let kb (n) denote the nonnegative integer such that bkb (n) ≤ n < bkb (n)+1 . Prove that
kb (n) = Θ(log n) for any constant value b > 1.

6.37 In Chapter 11, we’ll talk about graphs and the “density” of graphs. If f(n) denotes the number of edges in an n-node graph (we’ll
define those terms later), then a graph is called sparse if f(n) = O(n) and it’s called dense if f(n) = Θ(n2 ). Prove that there exists
a function f : Z≥0 → Z≥0 satisfying 0 ≤ f(n) ≤ n2 such that neither f(n) = Θ(n2 ) nor f(n) = O(n).
6.38 Prove or disprove: the all-zero function f(n) = 0 is the only function that is Θ(0).

ISTUDY
282 Analysis of Algorithms

10 10
8 8
6 6
4 4
2 2

2 4 6 8 64 128 256 512

Figure 6.9 The function f(n) = n + 1n , for Figure 6.10 The function k(n), where k(n) is the
Exercises 6.31–6.33. nonnegative integer such that 2k(n) ≤ n < 2k(n)+1 , for
Exercises 6.34–6.36.

6.39 Give an example of a function f(n) such that f(n) = Θ(f(n)2 ).


6.40 Let k ∈ Z≥0 be any constant. Prove that nk = o(n!).
6.41 Let f : Z≥0 → Z≥0 be an arbitrary function. Define the function g(n) = f(n) + 1. Prove that g(n) = O(f(n)) if and only if
f(n) = Ω(1).
6.42 For each of the four blanks (a), (b), (c), and (d) in the following table, identify an example of a function f that satisfies the stated
conditions, or argue that it’s impossible to satisfy both conditions.

f(n) = o(n2 ) f(n) ̸= o(n2 )


f(n) = ω(n2 ) (a) (b)
f(n) ̸= ω(n2 ) (c) (d)

6.43 Let f and g be arbitrary functions. Prove that at most one of the three properties f(n) = o(g(n)) and f(n) = Θ(g(n)) and
f(n) = ω(g(n)) can hold.
6.44 Complete the proof in Example 6.6: prove that f(n) ̸= Ω(n2 ), where f(n) is the function
(
n3 if n is even
f(n) =
n if n is odd.

Many (but not all) of the properties of O(·) also hold for the other four asymptotic notions. Consider the following properties, for
arbitrary functions f, g, and h:
6.45 Prove that Ω is transitive: if f(n) = Ω(g(n)) and g(n) = Ω(h(n)), then f(n) = Ω(h(n)).
6.46 Prove that Θ is transitive: if f(n) = Θ(g(n)) and g(n) = Θ(h(n)), then f(n) = Θ(h(n)).
6.47 Prove that o is transitive: if f(n) = o(g(n)) and g(n) = o(h(n)), then f(n) = o(h(n)).
6.48 Is Ω symmetric? Prove or disprove: if f(n) = Ω(g(n)), then g(n) = Ω(f(n)).
6.49 Is Θ symmetric? Prove or disprove: if f(n) = Θ(g(n)), then g(n) = Θ(f(n)).
6.50 Is ω symmetric? Prove or disprove: if f(n) = ω(g(n)), then g(n) = ω(f(n)).
6.51 Is O reflexive? Prove or disprove: f(n) = O(f(n)).
6.52 Is Ω reflexive? Prove or disprove: f(n) = Ω(f(n)).
6.53 Is ω reflexive? Prove or disprove: f(n) = ω(f(n)).

6.54 Consider the false claim below, and the bogus proof that follows. Where, precisely, does the proof go wrong?

False Claim. The function f(n) = n2 satisfies f(n) = O(n).

Bogus proof. We proceed by induction on n.


For the base case (n = 1), we have n2 = 1. Thus f(1) = O(n) because 1 ≤ n for all n ≥ 1. (Choose c = 1 and n0 = 1.)
For the inductive case (n ≥ 2), we assume the inductive hypothesis—namely, assume that (n − 1)2 = O(n). We must show that
n2 = O(n). Here is the proof:

n2 = (n − 1)2 + 2n − 1 by factoring

= O(n) + 2n − 1 by the inductive hypothesis

= O(n) + O(n) by definition of O(·) and Lemma 6.4

= O(n).

ISTUDY
6.3 Asymptotic Analysis of Algorithms 283

6.3 Asymptotic Analysis of Algorithms


But why this idle toil to paint that day,
This time elaborately thrown away?
Edward Young (1683–1765)
“A Poem On The Last Day” (1713)

The main reason that computer scientists are interested in asymptotic analysis is for its application to the
analysis of algorithms. When, for example, we compare different algorithms that solve the same problem—
say, Merge Sort, Selection Sort, and Insertion Sort—we want to be able to give a meaningful answer to
the question which algorithm is the fastest? (And different inputs may trigger different behaviors in the
algorithms under consideration: when the input array is sorted, for example, Insertion Sort is faster than
Merge Sort and Selection Sort; when the input is very far from sorted, Merge Sort is fastest. But typically
we still would like to identify a single answer to the question of which algorithm is the fastest.)
When evaluating the running time of an algorithm, we generally follow asymptotic principles. Specifi-
cally, we will generally ignore constants in the two ways that O(·) and its asymptotic siblings do:
(1) We don’t care much about what happens for small inputs. There might be small special-case inputs
for which an algorithm is particularly fast, but this fast performance on a few special inputs doesn’t mean
that the algorithm is fast in general. For example, imagine an algorithm for primality testing that returns
true if it’s given as input any of 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, or 97; it returns false if it’s given any other input less than 100; and for numbers greater
than 100 it calls isPrime from Figure 4.32. Despite its speed on a few special cases, we wouldn’t consider
this algorithm to be faster in general than isPrime. (Incidentally, it’s also harder to persuade yourself that
it’s correct—it’s so easy to have made a typo in that list of small primes!) We seek general answers to the
question which algorithm is faster?, which leads us to pay little heed to special cases.
(2) We typically evaluate the running time of an algorithm not by measuring elapsed time on the “wall
clock,” but rather by counting the number of steps that the algorithm takes to complete. (How long a
program takes on your laptop, in terms of the hypothetical clock hanging on your wall, is affected by
all sorts of things unrelated to the algorithm, like how many videos you’re watching while the algorithm
executes.) We will generally ignore multiplicative constants in counting the number of steps consumed by
an algorithm. One reason is so that we can give a machine-independent answer to the which algorithm is
faster? question; how much is accomplished by one instruction on an Intel processor may be different from
one instruction on an ARM processor, and ignoring constants allows us to compare algorithms in a way
that doesn’t depend on grungy details about the particular machine.

Definition 6.14: Running time of an algorithm on a particular input.


Consider an algorithm A and an input x. The running time of algorithm A on input x is the number of
primitive steps that A takes when it’s run on input x.

For example, imagine running binarySearch on the input x = ⟨[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31], 4⟩.
The precise number of primitive steps in this execution depends on the particular machine on which the
algorithm is being run, but it involves successively comparing 4 to 13, then 5, then 2, and finally 3.
Taking it further: Definition 6.14 is intentionally vague about what a “primitive step” is, but it’s probably easiest to think of
a single machine instruction as a primitive step. That single machine instruction might add or compare two numbers, increment
a counter, return a value, etc. Different hardware systems might have different granularity in their “primitive steps”—perhaps a
Mac desktop can “do more” in one machine instruction than an iPhone can do—but, as we just indicated, we’ll look to analyze

ISTUDY
284 Analysis of Algorithms

algorithms independently of this detail. Theoretical computer scientists who work on algorithmic analysis study a very precise
model of computation, in which a “primitive step” is very carefully defined. By far the most common model is the Turing machine,
named after the British code-breaking, artificial intelligence, and computing pioneer Alan Turing (1912–1954), who defined the
model in a groundbreaking paper in the 1930s [127].
We typically evaluate an algorithm’s efficiency by counting asymptotically of the number of primitive steps used by an algorithm’s
execution, rather than by using a stopwatch to measure how long the algorithm actually takes to run on a particular input on a
particular machine. One reason is that it’s very difficult to properly measure this type of performance; see p. 293 for some
discussion about why.
In certain applications, particularly in scientific computing (the subfield of CS devoted to processing and analyzing real-valued
data, where we have to be concerned with issues like accumulated rounding errors in long calculations), it is typical to use a
variation on asymptotic analysis. Calculations on integers are substantially cheaper than those involving floating-point values
(see p. 21); thus in this field one typically doesn’t bother counting integer operations, and instead we only track floating-point
operations, or flops. Because flops are substantially more expensive, often we’ll keep track of the constant on the leading (highest-
degree) term—for example, an algorithm might require 32 n2 + O(n log n) flops or 2n2 + O(n) flops. (We’d choose the former.)

6.3.1 Worst-Case Analysis


We will generally evaluate the efficiency of an algorithm A by thinking about its performance as the input
gets large: what happens to the number of steps consumed by A as a function of the input size n? Further-
more, we generally assume the worst: when we ask about the running time of an algorithm A on an input
of size n, we are interested in the running time of A on the input of size n for which A is the slowest.

Definition 6.15: Worst-case running time of an algorithm.


The worst-case running time of an algorithm A is
 
TA (n) = max the number of primitive steps used by A on input x .
x:|x|=n

We will be interested in the asymptotic behavior of the function TA (n).

When we perform worst-case analysis of an algorithm—analyzing the asymptotic behavior of the function
TA (n)—we seek to understand the rate at which the running time of the algorithm increases as the input
size increases. Because a primary goal of algorithmic analysis is to provide a guarantee on the running
time of an algorithm, we will be pessimistic, and think about how quickly A performs on the input of size
n that’s the worst for algorithm A.

Taking it further: Occasionally we will perform average-case analysis instead of worst-case analysis, by computing the
expected (average) performance of algorithm A for inputs drawn from an appropriate distribution. It can be difficult to decide
on an appropriate distribution, but sometimes this approach makes more sense than being purely pessimistic. See Section 6.3.2.
It’s also worth noting that using asymptotic, worst-case analysis can sometimes be misleading. There are occasions in which an
algorithm’s performance in practice is very poor despite a “good” asymptotic running time—for example, because the multi-
plicative constant suppressed by the O(·) is massive. (And conversely: sometimes an algorithm that’s asymptotically slow in the
worst case might perform very well on problem instances that actually show up in real applications.) Asymptotics capture the
high-level performance of an algorithm, but constants matter too!

Some examples: three sorting algorithms


Figure 6.11 shows a sampling of worst-case running times for a number of the algorithms you may have
encountered earlier in this book or in previous CS classes. In the rest of this section, we’ll prove some of
these results as examples. We’ll start our analysis with Selection Sort, shown in Figure 6.12.

ISTUDY
6.3 Asymptotic Analysis of Algorithms 285

worst-case running time sample algorithm(s)


Θ(1) push/pop in a stack with n elements
Θ(log n) binary search in an array with n elements
Θ(n) linear search in an array with n elements
Θ(n log n) merge sort of an array of n elements
Θ(n2 ) selection sort, insertion sort, or bubble sort of an array of n elements
Θ(n3 ) naïve matrix multiplication of two n-by-n matrices
Θ(2n ) brute-force satisfiability algorithm for a proposition with n variables

Figure 6.11 The running time of some sample algorithms.

3 5 2 1 4
selectionSort(A[1 . . . n])
1 for i := 1 to n: 1 5 2 3 4
2 minIndex := i Elements in the
3 for j := i + 1 to n: shaded region 1 2 5 3 4
are sorted (and
4 if A[j] < A[minIndex] then have been placed 1 2 3 5 4
5 minIndex := j permanently).
6 swap A[i] and A[minIndex] “Swap” the element
1 2 3 4 5 5 with itself.

Figure 6.12 Selection Sort: pseudocode and an example. We repeatedly find the minimum element in the unsorted
region of the array A, and swap it into the first slot of the unsorted segment.

Example 6.7: Selection Sort.


What is the worst-case running time of Selection Sort?
Solution. The outer for loop’s body (Lines 2–6) is executed n times, once each for i = 1, 2, . . . , n. We
complete the body of the inner for loop (Lines 4–5) a total of n − i times in iteration i. Thus the total
number of times that we execute Lines 4–5 is
Pn Pn
= n 2−n ,
2
n − i = n2 − i = n2 − n(n+1)
2
i=1 i=1
Pn
where i=1 i = n(n+1) 2 by Theorem 5.3.
Notice that the only variation in the running time of Selection Sort based on the particular input array
A[1 . . . n] is in the number of times Line 5 is executed; the number of times that minIndex is reassigned
in the ith iteration of the inner loop can vary from as low as 0 to as high as n − i. If x represents the total
Pn
number of executions of Line 5, then x can be as low as 0 and as high as i=1 n − i (which, as argued
above, is less than n2 ). The remainder of the algorithm behaves identically regardless of the input array.
Thus, for some constants c1 > 0 and c2 > 0 and c3 > 0, the total number of primitive steps used by
the algorithm is c1 n + c2 n2 (for all lines except Line 5) plus c3 x (for Line 5), where 0 ≤ x < n2 . Thus the
total running time is between c1 n + c2 n2 and c1 n + (c2 + c3 )n2 , and therefore the asymptotic worst-case
running time of Selection Sort is Θ(n2 ).

We are generally interested in the asymptotic performance of algorithms, so the particular values of the
constants c1 , c2 , and c3 from Example 6.7, which reflect the number of primitive steps corresponding to
each line of the pseudocode in Figure 6.12, are irrelevant to our final answer. (One exception is that we may
sometimes try to count exactly the number of comparisons between elements of A, or swaps of elements
of A; see Exercises 6.55–6.63.)

ISTUDY
286 Analysis of Algorithms

3 5 2 1 4
insertionSort(A[1 . . . n])
1 for i := 2 to n: 3 5 2 1 4
2 j := i The shaded region forms a sorted pre-
fix; each iteration extends that region 2 3 5 1 4
3 while j > 1 and A[j] < A[j − 1]: by swapping the next element back-
4 swap A[j] and A[j − 1] ward in the array until it’s in place.
1 2 3 5 4
5 j := j − 1
1 2 3 4 5

Figure 6.13 Insertion Sort: pseudocode and an example. We maintain a sorted prefix of A (initially consisting only of
the first element), and we repeatedly expand the sorted prefix by one element.

We’ll now turn to our second sorting algorithm, Insertion Sort, which slowly increases the size of a
sorted subarray by swapping adjacent elements. (See Figure 6.13.)

Example 6.8: Insertion Sort.


Insertion Sort is more sensitive to the structure of its input than Selection Sort: if A is in sorted order, then,
in every iteration of the outer for loop, the while loop in Lines 3–5 terminates immediately (because the
test A[j] > A[j − 1] fails). On the other hand, if the input array is in reverse sorted order, then the while
loop in Lines 3–5 always completes i − 1 iterations. In fact, the reverse-sorted array is the worst-case
input for Insertion Sort: there can be as many as i − 1 iterations of the while loop, and there cannot be
more than i − 1 iterations.
In this worst case, when the while loop completes i−1 swaps for every iteration i, then the total amount
of work done by the algorithm is
c is a constant representing the work in a single for-loop iteration aside from what’s in the while loop
d is a constant representing the work in one iteration of Lines 3–5

X
n X
n
c + (i − 1) d = (c − d)n + id rearranging terms
i=1 i=1
n(n+1)
= (c − d)n + d · 2 by Theorem 5.3
= (c − d
2 )n + d 2
2 n . collecting like terms

This function is Θ(n2 ), so Insertion Sort’s worst-case running time is Θ(n2 ).

Finally, we will analyze a third sorting algorithm: Bubble Sort (Figure 6.14), which makes n left-to-right
passes through the array; in each pass, adjacent elements that are out of order are swapped. Bubble Sort
is a comparatively simpler sorting algorithm to analyze. (But, in practice, it is also a comparatively slow
sorting algorithm to run!)

pass #1 pass #2 pass #3


bubbleSort(A[1 . . . n])
1 for i := 1 to n: 3 5 2 1 4 3 2 1 4 5 2 1 3 4 5
3 5 2 1 4 2 3 1 4 5 1 2 3 4 5
2 for j := 1 to n − i:
3 2 5 1 4 2 1 3 4 5 1 2 3 4 5
3 if A[j] > A[j + 1] then
3 2 1 5 4 2 1 3 4 5 ..
4 swap A[j] and A[j + 1] 3 2 1 4 5 . (no further changes)

Figure 6.14 Bubble Sort: pseudocode and an example. We make n left-to-right passes through A[1 . . . n]; in each pass,
we swap each pair of adjacent elements that are out of order.

ISTUDY
6.3 Asymptotic Analysis of Algorithms 287

Example 6.9: Bubble Sort.


Bubble Sort repeatedly compares A[j] and A[j + 1] (swapping the two elements if necessary) for many
different values of j. Every time the body of the inner loop is executed, the algorithm does a constant
amount of work: exactly one comparison and either zero or one swaps. Thus any particular execution of
the body of the inner loop takes Θ(1) time—or, more precisely, an amount of time t satisfying c ≤ t ≤ d,
for two constants c > 0 and d > 0.
Pn Pn−i Pn Pn−i
Therefore the total running time of Bubble Sort is between i=1 j=1 c and i=1 j=1 d. The
Pn
summation i=1 n − i is Θ(n2 ), precisely as in Example 6.7, and thus Bubble Sort’s running time is
Ω(cn2 ) = Ω(n2 ) and O(dn2 ) = O(n2 ). Therefore Bubble Sort is Θ(n2 ).

Problem-solving tip: Precisely speaking, the number of primitive steps required to execute, for example, Lines 3–4 of Bubble
Sort varies based on whether a swap has to occur. In Example 6.9, we carried through the analysis considering two different
constants representing this difference. But, more simply, we could say that Lines 3–4 of Bubble Sort take Θ(1) time, without
caring about the particular constants. You can use this simpler approach to streamline arguments like the one in Example 6.9.

Before we move on from sorting, we’ll mention one more algorithm, Merge Sort, which proceeds recur-
sively by splitting the input array in half, recursively sorting each half, and then “merging” the sorted
subarrays into a single sorted array. But we will defer the analysis of Merge Sort to Section 6.4: to analyze
recursive algorithms like Merge Sort, we will use recurrence relations which represent the algorithm’s
running time itself as a recursive function.

Some more examples: search algorithms


We’ll now analyze two search algorithms, both of which determine whether a particular value x appears
in an array A. We’ll start with Linear Search (Figure 6.15a), which simply walks through the (possibly
unsorted) array A and successively compares each element to the sought value x. We’ll then look at Binary
Search (Figures 6.15b and 6.15c), which relies on the array A[1 . . . n] being sorted. It proceeds by defining

linearSearch(A[1 . . . n], x):


Input: an array A[1 . . . n] and an element x
Output: is x in the (possibly unsorted) array A? binarySearch(A[1 . . . n], x):
1 for i := 1 to n: Input: a sorted array A[1 . . . n]; an element x
2 if A[i] = x then Output: is x in the (sorted) array A?
3 return True 1 lo := 1
4 return False 2 hi := n
3 while lo ≤ hi:
(a) Linear Search. middle := ⌊ lo+ hi
4 2 ⌋
initially, lo = 1 initially, hi = n 5 if A[middle] = x then
For lo = 1 and hi = n, 6 return True
middle is set to ⌊ n+2 1 ⌋ = ⌈ n2 ⌉.
7 else if A[middle] > x then
8 hi := middle − 1
j k j k 9 else
n+1 n+1
1 2 −1 2 +1 n 10 lo := middle + 1
⌈n⌉ 11 return False
2 − 1 elements ⌊ n2 ⌋ =n− ⌈ 2n ⌉ elements
(b) The pseudocode for Binary Search.
(c) An illustration of the first split in Binary Search.

Figure 6.15 Linear and Binary Search.

ISTUDY
288 Analysis of Algorithms

a range of the array in which x would be found if it is present, and then repeatedly halving the size of that
range by comparing x to the middle entry in that range.
Unless otherwise specified (and we will rarely specify otherwise), we are interested in the worst-case
behavior of algorithms. This concern with worst-case behavior includes lower bounds! Here’s an example
of the analysis of an algorithm that suffers from this confusion:

Example 6.10: Linear Search, unsatisfactorily analyzed.


What is incomplete or incorrect in the following analysis of the worst-case running time of Linear Search?
The running time of Linear Search is obviously O(n): we at most iterate over every element of the array, performing a
constant number of operations per element. And it’s obviously Ω(1): no matter what the inputs A and x are, the algorithm
certainly at least does one operation (setting i := 1 in Line 1), even if it immediately returns because A[1] = x.

Solution. The analysis is correct, but it gives a looser lower bound than can be shown: specifically, the
running time of Linear Search is Ω(n), and not just Ω(1). If we call linearSearch(A, 42) for an array
A[1 . . . n] that does not contain the number 42, then the total number of steps required will be at least n,
because every element of A is compared to 42. Performing n comparisons takes Ω(n) time.

Here is a terser writeup of the analysis of Linear Search:

Example 6.11: Linear Search.


The worst case for Linear Search is an array A[1 . . . n] that doesn’t contain the element x. In this case, the
algorithm compares x to all n elements of A, taking Θ(n) time.

Taking it further: When we’re analyzing an algorithm A’s running time, we can generally prove several different lower and
upper bounds for A. For example, we might be able to prove that the running time is Ω(1), Ω(log n), Ω(n), O(n2 ), and O(n3 ).
The bound Ω(1) is a loose bound, because it is superseded by the bound Ω(log n). (That is, if f(n) = Ω(log n) then f(n) = Ω(1).)
Similarly, O(n3 ) is a loose bound, because it is implied by O(n2 ). (Example 6.10 established a loose lower bound on the running
time of Linear Search; it was superseded by the result in Example 6.11.)
We seek asymptotic bounds that are as tight as possible—so we always want to prove f(n) = Ω(g(n)) and f(n) = O(h(n)) for
the fastest-growing function g and slowest-growing function h that we can. If g = h, then we have proven a tight bound, or,
equivalently, that f(n) = Θ(g(n)). Sometimes there are algorithms for which we don’t know a tight bound; we can prove Ω(n)
and O(n2 ), but the algorithm might be Θ(n) or Θ(n2 ) or Θ(n log n log log log n) or whatever. In general, we want to give upper
and lower bounds that are as close together as possible.

Now let’s analyze the running time of Binary Search:

Example 6.12: Binary Search.


Here is the intuition: in every iteration of the while loop in Binary Search, we halve the range of elements
under consideration. (In other words, we halve | {i : lo ≤ i ≤ hi} |.) We can halve a set of size n only
log2 n times, and therefore Binary Search completes O(log2 n) iterations of the while loop. Each of those
iterations takes a constant amount of time, and therefore the total running time is O(log n).
Let’s translate this intuition into a somewhat more formal proof. Suppose that the range of ele-
ments under consideration at the beginning of an iteration of the while loop is A[lo, . . . , hi]. Let k
denote the number of elements in this range—that is, k = hi − lo + 1. There are ⌈ 2k ⌉ − 1 elements in
A[lo, . . . , middle − 1]. There are ⌊ 2k ⌋ elements in A[middle + 1, . . . , hi]. (See Figure 6.15c.) Thus, after
comparing x to A[middle], one of three things happens:

ISTUDY
6.3 Asymptotic Analysis of Algorithms 289

• We find x = A[middle], and the algorithm terminates.


• We find x < A[middle] and continue in the left part of the array, which contains ⌈ 2k ⌉ − 1 ≤ 2k
elements.
• We find x > A[middle] and continue in the right part of the array, which contains ⌊ 2k ⌋ ≤ 2k elements.

In all three cases, we have at most 2k elements under consideration in the next iteration of the loop.
Initially, there are n elements under consideration. Therefore, the above argument implies that there are
at most n/2i elements left under consideration after i iterations. (This claim can be proven by induction.)
So, after log2 n iterations, there is at most one element left under consideration. Once the range contains
only one element, we complete at most one more iteration of the while loop. Thus the total number of
iterations is at most 1 + log2 n. Each iteration takes a constant number of steps, and thus the total running
time is O(log n).

Notice that analyzing the running time of any single iteration of the while loop in the algorithm was not
too hard; the challenge in determining the running time of binarySearch lies in figuring out how many
iterations occur.
Here we have only shown an upper bound on Binary Search’s running time; in Example 6.26, we’ll
prove that Binary Search takes Ω(log n) time, too. (Just as for Linear Search, the worst-case input for
Binary Search is an n-element array that does not contain the sought value x.)

6.3.2 Some Other Types of Analysis


So far we have focused on asymptotically analyzing the worst-case running time of algorithms. While this
type of analysis is the one most commonly used in the analysis of algorithms, there are other interesting
types of questions that we can ask about algorithms. We’ll sketch two of them in this section: instead of
being completely pessimistic about the particular input that we get, we might instead consider either the
best possible case or the “average” case.

Best-case analysis of running time


Best-case running time simply replaces the “max” from Definition 6.15 with a “min”:

Definition 6.16: Best-case running time of an algorithm.


The best-case running time of an algorithm A on an input of size n is
 
A (n) = min the number of primitive steps used by A on input x .
Tbest
x:|x|=n

Best-case analysis is rarely used; knowing that an algorithm might be fast (on inputs for which it is partic-
ularly well tuned) doesn’t help much in drawing generalizable conclusions about its performance (on the
input that it’s actually called on). (Ambrose Bierce (1842–≈1913) defined “optimism” in his The Devil’s
Dictionary (1911) as “the doctrine or belief that everything is beautiful, including what is ugly” [14].
Best-case analysis adheres to that definition more or less perfectly.)

Average-case analysis of running time


The “average” running time of an algorithm A is subtler to state formally, because “average” means that
we have to have a notion of which values are more or less likely to be chosen as inputs. (For example,

ISTUDY
290 Analysis of Algorithms

consider sorting. In many settings, an already-sorted array is the most common input type to the sorting
algorithm; a programmer might just want to “make sure” that the input was sorted, even though they might
have been pretty confident that it already was.) The simplest way to do average-case analysis is to consider
inputs that are chosen uniformly at random from the space of all possible inputs. For example, for sorting
algorithms, we would consider each of the n! different orderings of {1, 2, . . . , n} to be equally likely inputs
of size n.

Definition 6.17: Average-case running time of an algorithm.


Let X denote the set of all possible inputs to an algorithm A. The average-case running time of an
algorithm A for a uniformly chosen input of size n is
1 X  
Tavg
A (n) = · number of primitive steps used by A on x .
| {y ∈ X : |y| = n} |
x∈X:|x|=n

Taking it further: Let ρn be a probability distribution over {x ∈ X : |x| = n}—that is, ρn is a function where ρn (x) denotes the
fraction of the time that the input to A is x, out of all of the times that the input to A is of size n. Definition 6.17 considers the
uniform distribution, where ρn (x) = 1/| {x ∈ X : |x| = n} |.
The average-case running time of A on inputs of size n is the expected running time of A for an input x of size n chosen according
to the probability distribution ρn . We will explore both probability distributions and expectation in detail in Chapter 10, which is
devoted to probability. (If someone refers to the average case of an algorithm without specifying the probability distribution ρ,
then they probably mean that ρ is the uniform distribution, as in Definition 6.17.)

We will still consider the asymptotic behavior of the best-case and average-case running times, for the same
reasons that we are generally interested in the asymptotic behavior in the worst case.

Best- and average-case analysis of sorting algorithms


We’ll close this section with the best- and average-case analyses of our three sorting algorithms. (See
Figure 6.16 for a reminder of the algorithms.)

Example 6.13: Insertion Sort, in the best case and average case.
In Example 6.8, we showed that the worst-case running time of Insertion Sort is Θ(n2 ). Let’s analyze the
best- and average-case running times of Insertion Sort.
The best-case running time for Insertion Sort is much faster than the worst-case running time: if the
input array is already in sorted order, the inner while loop that swaps each A[i] into place terminates
immediately without doing any swaps, because A[i] > A[i − 1]. Each iteration of the for loop therefore
takes Θ(1) time, so the total running time is Θ(n).
We will defer a fully formal analysis of the average-case running time of Insertion Sort to Chapter 10
(see Example 10.47), but here is an informal analysis. Consider iteration the ith of the for loop. When that
iteration starts, the first i − 1 elements of A—that is, A[1, . . . , i − 1]—are in sorted order. The next element
A[i] has an equal chance of falling into any one of the i “slots” in the sorted subarray A[1, . . . , i − 1]: before
A[1], between A[1] and A[2], . . ., between A[i − 2] and A[i − 1], and after A[i − 1]. On average, then, the
number of swaps in the ith iteration of the for loop is 2i . Thus the total average running time will be
Pn−1 i n(n−1)
i=1 2 = 4 , which is Θ(n2 ).

While we typically use formal mathematical analysis to address the performance of algorithms (whether
we’re interested in the worst, best, or average case), sometimes a kind of empirical analysis—where we

ISTUDY
6.3 Asymptotic Analysis of Algorithms 291

Insertion Sort Selection Sort Bubble Sort


0.75s

0.75s

0.75s
reversed reversed reversed
random random random
sorted sorted sorted
0.50s

0.50s

0.50s
0.25s

0.25s

0.25s
512 1024 1536 2048 512 1024 1536 2048 512 1024 1536 2048

insertionSort(A[1 . . . n]) selectionSort(A[1 . . . n]) bubbleSort(A[1 . . . n])


1 for i := 2 to n: 1 for i := 1 to n: 1 for i := 1 to n:
2 j := i 2 minIndex := i 2 for j := 1 to n − i:
3 while j > 1 and A[j] < A[j − 1]: 3 for j := i + 1 to n: 3 if A[j] > A[j + 1] then
4 swap A[j] and A[j − 1] 4 if A[j] < A[minIndex] then 4 swap A[j] and A[j + 1]
5 j := j − 1 5 minIndex := j
6 swap A[i] and A[minIndex]

Figure 6.16 The elapsed-time running time for Insertion, Selection, and Bubble Sorts, and a reminder of the
pseudocode for each.

measure an algorithm’s performance by running it on an actual computer on an actual input and measuring
how much time elapses before the algorithm terminates—can also be useful.
Figure 6.16 shows the elapsed time on an aging laptop during executions of my Python implementa-
tions of Insertion, Selection, and Bubble Sorts on a sorted array, a reverse-sorted array, and a randomly
shuffled array. For Insertion Sort, Figure 6.16 confirms the formal analysis from Example 6.13: Insertion
Sort’s worst case is about twice as slow as its average case, and both are Θ(n2 ); the best case of Insertion
Sort is virtually invisible along the x-axis. On the other hand, Figure 6.16 suggests that Selection Sort’s
performance does not seem to depend very much on the structure of its input. Let’s analyze this algorithm
formally:

Example 6.14: Selection Sort, in the best case and average case.
In Selection Sort, the only effect of the input array’s structure is the number of times that minIndex is
updated. (That’s why the reverse-sorted input tends to perform ever-so-slightly worse in Figure 6.16.)
Thus the best- and average-case running time of Selection Sort is Θ(n2 ), just like the worst-case running
time established in Example 6.7.

Figure 6.16 suggests that Bubble Sort’s performance varies only by a constant factor and, indeed, the worst-,
average-, and best-case running times are all Θ(n2 ):

Example 6.15: Bubble Sort, in the best case and average case.
Again, the only difference in running time based on the structure of the input array is in how many
times the body of the inner conditional is executed—that is, how many swaps occur. (The number of

ISTUDY
292 Analysis of Algorithms

swaps ranges between 0 for a sorted array and n(n−1)


2 for a reverse-sorted array.) But there are Θ(n2 )
comparisons of adjacent elements in any case, and Θ(n ) + 0 and Θ(n2 ) + n2 are both Θ(n2 ).
2

Careful examination (and some small tweaks) of Bubble Sort shows that we can improve the algorithm’s
best-case performance without affecting the worst- and average-case performance asymptotically; see
Exercise 6.65.
Taking it further: The tools from this chapter can be used to analyze the consumption of any resource by an algorithm. So far,
the only resource that we have considered is time: how many primitive steps are used by the algorithm on an particular input?
The other resource whose consumption is most commonly analyzed is the space used by the algorithm—that is, the amount of
memory it uses. As with time, we almost always consider the worst-case space use of the algorithm. See p. 294 for more on
the subfield of CS called computational complexity, which seeks to understand the resources required to solve any particular
problem. While time and space are the resources most frequently analyzed by complexity theorists, there are other resources that
are interesting to track, too. For example, randomized algorithms “flip coins” as they run—that is, they make decisions about how
to continue based on a randomly generated bit. Generating a truly random bit is expensive, and so we can view randomness itself
as a resource, and try to minimize the number of random bits used. And, particularly in mobile processors, power consumption—
and therefore the amount of battery life consumed, and the amount of heat generated—may be a more limiting resource than
time or space. Thus energy can also be viewed as a resource that an algorithm might consume. (For some of the research on
power-aware computing from an architecture perspective, see [70].)

ISTUDY
6.3 Asymptotic Analysis of Algorithms 293

Computer Science Connections

Multitasking, Garbage Collection, and Wall Clocks


One reason that we typically measure the running time of algorithms by counting (asymptotically) the number of
primitive operations consumed by the algorithm on (worst-case) inputs is that measuring running time by so-called
wall-clock time can be difficult to interpret—and potentially misleading.
All modern operating systems (everything that’s been widely deployed for several decades: Windows, MacOS,
Linux, iOS, Android, . . .) are multitasking operating systems. That is, the user is typically running many applica-
tions simultaneously—perhaps an application to play music, a web browser, a programming environment, a word
processor, a virus checker, and that sorting program that you wrote for your CS class. While it appears to the user
that these applications are all running simultaneously, the operating system is actually pulling off a trick. There’s
typically only one processor (or a small number of processors, in multicore machines), and the operating system uses
time-sharing to allow each running application to have a “turn” using the processor. (When it’s the next application’s
turn and there’s no currently idle processor, the operating system swaps out one application, and swaps in the next
one that gets a slice of time on the processor.) If there were more processes running when you ran Merge Sort than
when you ran Bubble Sort, then the elapsed time for Merge Sort could look worse than it should.
Many operating systems can report the total amount of processor time that a particular process consumed, so
we can avoid the multitasking concern—but even within a single process, total processor time consumed can be
misleading. While a program in Python or Java, for example, is running, periodically the garbage collector runs to
reclaim “garbage” memory (previously allocated memory that won’t be used again) for future use. When the garbage
collector runs, the code that you were executing stops running. (See p. 619.)
Figure 6.17 shows the elapsed time while running four sorting algorithms, written in Python, executed on sorted
inputs [1, 2, . . . , n], reverse sorted inputs [n, n−1, . . . , 1], and a randomly permuted n-element array. The “spikiness”
of the elapsed times within the reverse-sorted plots may be because I launched a large presentation-editing application
while the Insertion Sort test was running on inputs in descending sorted order, or because the garbage collector
happened to start running during those trials.
Even putting aside the difficulty of measuring running times accurately, there’s another fundamental issue that we
must address: we have to decide on what inputs to run the algorithms. The three panels of Figure 6.17 show why this
choice can be significant. When the input is in sorted order, Insertion Sort is the best algorithm (in fact, it’s barely
visually distinguishable from the x-axis!). When the input is in reverse sorted order, Insertion Sort is terrible, and
Merge Sort is the fastest. When the input is randomized, Insertion Sort is somewhere in the middle, and Merge Sort
is again the fastest. Selection Sort is essentially unaffected by which type of input we consider.
The fact that we get such different pictures from the three different input types says that we have to decide which
input to consider. (Typically we choose the worst-case input for the particular algorithm, as we’ve discussed.)

Sorted Input Random Input Reverse Sorted Input


0.75s

0.75s

0.75s

Bubble Bubble Bubble


Insertion Insertion Insertion
Selection Selection Selection
Merge Merge Merge
0.50s

0.50s

0.50s
0.25s

0.25s

0.25s

512 1024 1536 2048 512 1024 1536 2048 512 1024 1536 2048

Figure 6.17 The wall-clock running time of four sorting algorithms on three different types of input. For n-element
inputs of each type, the plot shows the number of seconds elapsed for the given sorting algorithms.

ISTUDY
294 Analysis of Algorithms

Computer Science Connections

Time, Space, and Complexity


Computational complexity is the subfield of computer science devoted to the study of the resources required to solve
computational problems. Computational complexity is the domain of the most important open question in all of
computer science, the P-versus-NP problem. That problem is described elsewhere in this book (see p. 104), but here
we’ll describe some of the basic entities that are studied by complexity theorists. (For much more, see any good
textbook on computational complexity (also known as complexity theory), such as [97, 121].)
A complexity class is a set of problems that can be solved using a given constraint on resources consumed. Those
resources are most typically the time or space used by an algorithm that solves the problem. For example, k
the com-
plexity class EXPTIME includes precisely those problems solvable in exponential time—that is, O(2n ) time for
some constant integer k. (There’s a wide range of other resources, other constraints, or other models of computation
that are studied by complexity theorists. For example, what extra power—if any—comes from using a computer that
can make truly random choices as part of its execution? Or, what extra power—if any—comes from using a quantum
computer instead of a classical one?)
One of the most important complexity classes is P, which denotes the set of all problems Π for which there is a
polynomial-time algorithm A that solves Π. In other words,
Π ∈ P ⇔ there exists an algorithm A and an integer k ∈ Z≥0 such that
A solves Π and the worst-case running time of A on an input of size n is O(nk ).
Although the practical efficiency of an algo-
rithm that runs in time Θ(n1000 ) is highly
suspect, it has turned out that essentially any
(non-contrived) problem that has been shown EXPSPACE
to be in P has actually also had a reasonably
efficient algorithm—almost always O(n5 ) or
better. As a result, one might think of the
EXPTIME
entire subfield of CS devoted to algorithms
as really being devoted to understanding what
problems can be solved in polynomial time.
(Of course, improving the exponent of the PSPACE
polynomial is always a goal!)
Other complexity classes that are com-
monly studied are defined in terms of the
space (memory) that they use: P

PSPACE consists of those problems solvable


using a polynomial amount of space.
L consists of those problems solvable using L
O(log n) space (beyond the input itself).
EXPSPACE consists of problems solvable
using a exponential amount of space. Figure 6.18 A few complexity classes, and their relationships.

While a great deal of effort has been devoted to complexity theory over the last half century, surprisingly little is known
about how much time or space is actually required to solve problems—including some very important problems! It
is reasonably easy to prove the relationships among the complexity classes shown in Figure 6.18, namely
L ⊆ P ⊆ PSPACE ⊆ EXPTIME ⊆ EXPSPACE.
Although the proofs are trickier, it has also been known since the 1960s that P ̸= EXPTIME (using the “time hier-
archy theorem”), and that both L = ̸ PSPACE and PSPACE ̸= EXPSPACE (using the “space hierarchy theorem”).
But that’s just about all that we know about the relationship among these complexity classes! For example, for all
we know L = P or P = PSPACE—but not both, because we do know that L = ̸ PSPACE. These foundational
complexity-theoretic questions remain open—awaiting the insights of a new generation of computer scientists!

ISTUDY
Exercises 295

EXERCISES
A comparison-based sorting algorithm reorders its input array A[1 . . . n] with two fundamental operations:
• the comparison of a pair of elements (to determine which one is bigger); and
• the swap of a pair of elements (to exchange their positions in the array).
See Figure 6.19 for a reminder of three comparison-based sorting algorithms (Selection, Insertion, and Bubble Sort), with com-
parisons and swaps highlighted. For the worst-case input array A[1 . . . n] of size n, how many of the listed operations are done by
each? Give an exact answer, not an asymptotic one, and prove your answer.
6.55 (worst-case) comparisons for selectionSort
6.56 (worst-case) comparisons for insertionSort
6.57 (worst-case) comparisons for bubbleSort
6.58 (worst-case) swaps for selectionSort
6.59 (worst-case) swaps for insertionSort
6.60 (worst-case) swaps for bubbleSort

Repeat Exercises 6.55–6.60 for the best case: for the array A[1 . . . n] on which the given algorithm performs the best, how many
comparisons/swaps does it do? (If the best-case array for swaps is different from the best-case array for comparisons, say so and
explain why, and analyze the number of comparisons/swaps in the two different “best” arrays.)
6.61 What is the best-case number of comparisons for selectionSort(A[1 . . . n])? The best-case number of swaps?
6.62 What is the best-case number of comparisons for insertionSort(A[1 . . . n])? The best-case number of swaps?
6.63 What is the best-case number of comparisons for bubbleSort(A[1 . . . n])? The best-case number of swaps?

Two variations of the basic bubbleSort algorithm are shown in Figure 6.20. In the next few exercises, you’ll explore whether they’re
asymptotic improvements.
6.64 What’s the worst-case running time of early-stopping-bubbleSort? Prove your answer.
6.65 Show that the best-case running time of early-stopping-bubbleSort is asymptotically better than the best-case running time of
bubbleSort.
6.66 Show that the running time of forward-backward-bubbleSort on a reverse-sorted array A[1 . . . n] is Θ(n). (The reverse-sorted
input is the worst case for both bubbleSort and early-stopping-bubbleSort.)

selectionSort(A[1 . . . n])
1 for i := 1 to n: insertionSort(A[1 . . . n])
2 minIndex := i 1 for i := 2 to n: bubbleSort(A[1 . . . n])
3 for j := i + 1 to n: 2 j := i 1 for i := 1 to n:
4 if A[j] < A[minIndex] then 3 while j > 1 and A[j] < A[j − 1]: 2 for j := 1 to n − i:
5 minIndex := j 4 swap A[j] and A[j − 1] 3 if A[j] > A[j + 1] then
6 swap A[i] and A[minIndex] 5 j := j − 1 4 swap A[j] and A[j + 1]

Figure 6.19 A reminder of three sorting algorithms, with swaps and comparisons of elements in the input array
highlighted. Count as a “swap” any execution of the swap operation, including the exchange of an element with itself:
the last line of selectionSort still counts as performing a swap even if i = minIndex.

early-stopping-bubbleSort(A[1 . . . n]): forward-backward-bubbleSort(A[1 . . . n]):


1 for i := 1 to n: 1 Construct R[1 . . . n], where R[i] := A[n − i + 1] for each i.
2 swapped := False 2 for i := 1 to n:
3 for j := 1 to n − i: 3 Run one iteration of lines 2–8 of early-stopping-bubbleSort on A.
4 if A[j] > A[j + 1] then 4 Run one iteration of lines 2–8 of early-stopping-bubbleSort on R.
5 swap A[j] and A[j + 1] 5 if either A or R is now sorted then
6 swapped := True 6 return whichever is sorted
7 if swapped = False then
8 return A

Figure 6.20 Two potentially improving variations on Bubble Sort: one that halts once no more swaps are possible, and
one that that simultaneously works on the input and its reverse.

ISTUDY
296 Analysis of Algorithms

6.67 Prove that the worst-case running time of forward-backward-bubbleSort is O(n2 ).


6.68 Prove that—despite the apparent improvement—the worst-case running time of forward-backward-bubbleSort is Ω(n2 ). To
prove this claim, explicitly describe an array A[1 . . . n] for which early-stopping-bubbleSort requires Ω(n2 ) on both A and the
reverse of A.
6.69 (programming required) Implement the three versions of Bubble Sort (including the two in Figure 6.20) in a programming language
of your choice.
6.70 (programming required) Modify your implementations from Exercise 6.69 to count the number of swaps and comparisons per-
formed. Then run all three algorithms on each of the 8! = 40,320 different orderings of {1, 2, . . . , 8}. How do the algorithms’
performances compare, on average?

In Chapter 9 (see p. 458), we will meet a sorting algorithm called Counting Sort that sorts an array A[1 . . . n] where each A[i] is
an element of {1, 2, . . . , k} as follows: for each possible value x ∈ {1, 2, . . . , k}, we walk through A to compute the number cx of
occurrences of the value x—that is, we compute cx := | {i : A[i] = x} |. (We can compute all k values of c1 , . . . , ck in a single pass
through A.) The output array consists of c1 copies of 1, followed by c2 copies of 2, and so forth, ending with ck copies of k. (See
Figure 6.21.) Counting sort is particularly good when k is small.
6.71 In terms of n, what is the worst-case running time of countingSort on an input array of n letters from the alphabet (so k = 26, and
n is arbitrary)?
6.72 (programming required) Implement Counting Sort and one of the Θ(n2 )-time sorting algorithms from this section. Collect some
data to determine, on a particular computer, for what values of k you’d generally prefer Counting Sort over the Θ(n2 )-time algorithm
when n = 4096 = 212 elements are each chosen uniformly at random from {1, 2, . . . , k}.
6.73 Radix Sort is a sorting algorithm based on Counting Sort that proceeds by repeatedly applying Counting Sort to the ith-most
significant bit in the input integers, for increasing i. Do some online research to learn more about Radix Sort, then write pseudocode
for Radix Sort and compare its running time (in terms of n and k) to Counting Sort.

In Example 5.14, we proved the correctness of Quick Sort, a recursive sorting algorithm (see Figure 6.21). The basic idea is to
choose a pivot element of the input array A, then partition A into those elements smaller than the pivot and those elements larger
than the pivot. We can then recursively sort the two “halves” and paste them together, around the pivot, to produce a sorted version
of A. The algorithm performs very well if the two “halves” are genuinely about half the size of A; it performs very poorly if one
“half” contains almost all the elements of A. The running time of the algorithm therefore hinges on how we select the pivot, in
Line 4. (A very good choice of pivot is actually a random element of A, but here we’ll think only about deterministic rules for
choosing a pivot.)
6.74 Suppose that we always choose pivotIndex := 1. (That is, the first element of the array is the pivot value.) Describe (for an arbitrary
n) an input array A[1 . . . n] that causes quickSort under this pivot rule to make either less or greater empty.
6.75 Argue that, for the array you found in Exercise 6.74, the running time of Quick Sort is Θ(n2 ).
6.76 Suppose that we always choose pivotIndex := ⌊ 2n ⌋. (That is, the middle element of the array is the pivot value.) What input array
A[1 . . . n] causes worst-case performance (that is, one of the two sides of the partition—less or greater—is empty) for this pivot
rule?
6.77 A fairly commonly used pivot rule is called the Median of Three rule: we choose pivotIndex ∈ {1, ⌊ 2n ⌋ , n} so that A[pivotIndex] is
the median of the three values A[1], A[⌊ 2n ⌋], and A[n]. Argue that there is still an input array of size n that results in Ω(n2 ) running
time for Quick Sort.

countingSort(A[1 . . . n]) : quickSort(A[1 . . . n]):


1 // Assume that each A[i] ∈ {1, 2, . . . , k} 1 if n ≤ 1 then
2 for v := 1 to k: 2 return A
3 count[v] := 0 3 else
4 for i := 1 to n: 4 Choose pivotIndex ∈ {1, . . . , n}, somehow.
5 count[A[i]] := count[A[i]] + 1 5 Let less (those elements smaller than A[pivotIndex]), same and greater
6 i := 1 be empty arrays.
7 for v := 1 to k: 6 for i := 1 to n:
8 for t := 1 to count[v]: 7 compare A[i] to A[pivotIndex], and append A[i] to the appropriate
9 A[i] := v array less, same, or greater.
10 i := i + 1 8 return quickSort(less) + same + quickSort(greater).

Figure 6.21 Counting Sort, and a high-level reminder of Quick Sort (see Figure 5.19a for more detail).

ISTUDY
Exercises 297

Assume that (i) testing whether Z is in s


early-stopping-linearSearch(A[1 . . . n], x) and (ii) removing a single letter from s
1 for i := 1 to n: countZ(s): both take c · |s| time, for some constant c.
2 if A[i] = x then 1 z := 0
3 return True 2 while there exists i such that si = Z:
4 else if A[i] < x then 3 z := z + 1
5 return False 4 remove si from s // that is, set s := s1 . . . si−1 si+1 . . . sn
6 return False 5 return z

Figure 6.22 An early-stopping variation on Linear Search, and an algorithm for counting ZZZs.

6.78 Earlier we described a linear-search algorithm that looks for an element x in an array A[1 . . . n] by comparing x to A[i] for each
i = 1, 2, . . . n. (See Figure 6.15a.) But if A is sorted, we can determine that x is not in A earlier, as shown in Figure 6.22: once we’ve
passed where x “should” be, we know that it’s not in A. (Our original version omitted lines 4–5.) What is the worst-case running
time of the early-stopping version of linear search?

6.79 Consider the algorithm in Figure 6.22 for counting the number of times the letter Z appears in a given string s. What is the worst-case
running time of this algorithm on an input string of length n?

ISTUDY
298 Analysis of Algorithms

6.4 Recurrence Relations: Analyzing Recursive Algorithms


What did this mean? Who was I? What was I? Whence did I come? What was my
destination? These questions continually recurred, but I was unable to solve them.
Mary Wollstonecraft Shelley (1797–1851)
Frankenstein (1818)

The nonrecursive algorithms in Section 6.3 could be analyzed “just” by counting and manipulation of
summations. (Sometimes the summations are easier to handle, and sometimes they’re harder—but they’re
just summations.) First we figured out the number of iterations of each loop, and then figured out how long
each iteration takes. By summing this work over the iterations and simplifying the summation, we were
able to compute the running time of the algorithm. Determining the running time of a recursive algorithm
is harder. Instead of “merely” containing loops that can be analyzed in this way, the algorithm’s running
time on an input of size n depends on the same algorithm’s running time for inputs of size smaller than n.
We’ll use the classical recursive sorting algorithm Merge Sort (Figure 6.23) as an example. Merge Sort
sorts an n-element array by recursively sorting the first half, recursively sorting the second half, and finally
“merging” the resulting sorted lists. (On an input array of size 1, Merge Sort just returns the array as is.)
Merging two 2n -element arrays takes Θ(n) time, but what does that mean for the overall running time of
Merge Sort? We can think about Merge Sort’s running time by drawing a picture of all of the work that is
done in its execution, in the form of a recursion tree:

Definition 6.18: Recursion tree.


The recursion tree for a recursive algorithm A is a tree that shows all of the recursive calls spawned by
a call to A on an input of size n. Each node in the tree is annotated with the amount of work, aside from
any recursive calls, done by that call.

Figure 6.24 shows the recursion tree for Merge Sort. For ease, assume that n is an exact power of 2. Let c · n
represent the amount of time needed to process an n-element array aside from the recursive calls—that is,
the time to split and merge. There are many different ways to analyze the total amount of work done by
Merge Sort on an n-element input array, but one of the easiest is to use the recursion tree itself:

Example 6.16: Analyzing Merge Sort via recursion tree.


How quickly does Merge Sort run on an n-element input array? (Assume that n is a power of two.)
Solution. Let’s look at the recursion tree for Merge Sort. The total amount of work done by the algorithm
is precisely the sum of the identified work—the quantities written in the shaded boxes—for each node in
the tree. (That’s exactly how c · n was defined.) The easiest way to sum up the work in the tree is to sum
“row-wise” (see Figure 6.25):

mergeSort(A[1 . . . n]):
1 if n = 1 then
2 return A
3 else
  The merge function combines two sorted arrays into a single sorted
4 L := mergeSort(A[1 . . . n2 ])
n array. For example, merge([2, 4, 6, 8], [5, 7, 9, 11]) yields
5 R := mergeSort(A[ 2 + 1 . . . n]) [2, 4, 5, 6, 7, 8, 9, 11]. You’ll argue in Exercise 6.100 that merging
6 return merge(L, R) two n2 -element arrays takes Θ(n) time.

Figure 6.23 Merge Sort.

ISTUDY
6.4 Recurrence Relations: Analyzing Recursive Algorithms 299

n c·n

n n n n
2 c· 2 2 c· 2
1 + log2 n levels

n n n n n n n n
4 c· 4 4 c· 4 4 c· 4 4 c· 4

n n n n n n n n n n n n n n n n
8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8

.. ..
. .

2 2

1 1 c·1 c·1 1 1 c·1

Figure 6.24 The recursion tree for Merge Sort. Each input’s size is shown in the node; the linear amount of time for
splitting/merging it is shown in the shaded box adjacent to the corresponding node.

n c·n 1 · (c · n) = cn

n n n n
2 c· 2 2 c· 2 2 · (c · n2 ) = cn
1 + log2 n levels

n n n n n n n n
4 c· 4 4 c· 4 4 c· 4 4 c· 4 4 · (c · n4 ) = cn

n n n n n n n n n n n n n n n n
8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 c· 8 8 · (c · n8 ) = cn
.. ..
. .

2 2

1 1 c·1 c·1 1 1 c·1 n · (c · 1) = cn

Figure 6.25 The row-wise sum of the tree in Figure 6.24.

• The first “row” of the tree (one call on an input of size n) generates cn work.
• The second row (two calls on inputs of size 2n ) generates 2 · (c · 2n ) = cn work.
• The third row (four calls on inputs of size n4 ) generates 4 · (c · n4 ) = cn work.
• In general, the kth row of the tree contains 2k−1 nodes, each of which represents a call to Merge Sort
on an input of size n/2k−1 . Each of these calls is generates c · n/2k−1 work. In total, then, the kth row
contains a total of 2k−1 · (c · n/2k−1 ) = cn work—no matter what the value of k was.

Putting these pieces together: the total work done by Merge Sort on an n-element input array is
P2 n
1+log
(the work done in the kth row of the tree) there are 1 + log2 n rows in the recursion tree
k=1
P2 n 
1+log   
= 2k−1 · c · n
2k−1
there are 2k−1 nodes in row k, and each corresponds to c · n
2k−1
work
k=1
P2 n
1+log
= cn = cn(1 + log2 n), 2k−1 · (c · n
2k−1
) = cn for any value of k (see above)
k=1

ISTUDY
300 Analysis of Algorithms

and thus is Θ(n log n) in total.

Taking it further: We’ll be analyzing algorithms by examining recursion trees throughout this section, but—if you prefer—
here’s a different argument as to why Merge Sort requires Θ(n log n) time. (This approach looks at the individual experience of
a particular element that’s being sorted, rather than looking at the entire array all at once.) Every individual element of the input
array is merged once into an array of size 2, once into an array of size 4, once into an array of size 8, etc. So each element is
merged log2 n times, so thus the total work is Θ(n · log2 n).

6.4.1 Recurrence Relations


Recursion trees are an excellent way to gain intuition about the running time of a recursive algorithm, and to
analyze it. Here, we’ll look at another way of thinking about recursion trees, which suggests a rigorous (and
in many ways easier to use) approach to analyzing recursive algorithms: the recurrence relation. Because
at least one of the steps in a recursive algorithm A is to call A on a smaller input, the running time of A on
an input of size n depends on A’s running time for inputs of size smaller than n. We will therefore express
A’s running time recursively, too:

Definition 6.19: Recurrence relation.


A recurrence relation (sometimes simply called a recurrence) is a function T(n) that is defined (for some
values of n) in terms of the values of T(k) for input values k < n.

(The name comes from the fact that T recurs (“occurs again”) on the right-hand side of the equation. That’s
the same reason that recursion is called recursion.) Here’s a first example, about compounding interest:

Example 6.17: Compound interest.


Suppose that, in year 0, Alice puts $1000 in a bank account that pays 2% annual compound interest.
Writing A(n) to denote the balance of Alice’s account in year n, we have
A(0) = 1000 A(n) = 1.02 · A(n − 1).

If Bob opens a bank account with the same interest rate, and deposits $10 into the account each year
(starting in year 0), then Bob’s balance is given by the recurrence

B(0) = 10 B(n) = 1.02 · B(n − 1) + 10.

We’ll most frequently encounter recurrence relations in which T(n) denotes the worst-case number of
steps taken by a particular recursive algorithm on an input of size n. Here are a few examples:

Example 6.18: Factorial.


Let T(n) denote the worst-case running time of fact (see Figure 6.26). Then
T(1) = d and T(n) = T(n − 1) + c,
where c is a constant denoting the work of the comparison, conditional, multiplication, and return; and d
is a constant denoting the work of the comparison, conditional, and return.

ISTUDY
6.4 Recurrence Relations: Analyzing Recursive Algorithms 301

fact(n): binarySearch(A, loIndex, hiIndex, x):


1 if n = 1 then 1 if loIndex > hiIndex then
To avoid the inefficiency of copying portions of A when
2 return 1 2 return False a recursive call is made, this code uses four parameters
3 else 3 middle := ⌊ loIndex+
2
hiIndex
⌋ instead of two: the array A, the left- and right-most
indices in A to search, and the sought element x. You’d
4 return n · fact(n − 1) 4 if A[middle] = x then call the algorithm binarySearch(A[1 . . . n], 1, n, x) to
5 return True start the recursive search for x in A.
6 else if A[middle] > x then
7 return binarySearch(A, loIndex, middle − 1, x)
8 else
9 return binarySearch(A, middle + 1, hiIndex, x)

Figure 6.26 Two recursive algorithms: factorial and Binary Search.

Example 6.19: Merge Sort.


Let T(n) denote the worst-case running time of Merge Sort (see Figure 6.23) on an input array containing
n elements. Then, for a constant c, we have:
T(1) = c and T(n) = T(⌊ 2n ⌋) + T(⌈ n2 ⌉) + cn.

Taking it further: Just as for nonrecursive algorithms, we will generally be interested in the asymptotic running times of these
recursive algorithms, so we will usually not fret about the particular values of the constants in recurrences. We will often abuse
notation and use a single constant to represent different Θ(1)-time operations, for example. In Example 6.19, for instance, we
are being sloppy in our recurrence, using a single variable c to represent two different values. The use of one constant to have
two different meanings (plus the “=” sign) is an abuse of notation, but when we care about asymptotic values, this abuse doesn’t
matter. We will even sometimes write 1 to stand for this constant. (See Exercise 6.126.)

Here’s another recurrence relation, for the recursive version of Binary Search:

Example 6.20: Binary Search.


Let T(n) denote the worst-case running time of the recursive binarySearch (see Figure 6.26) on an
n-element array. Then:
T(0) = c
(
T( 2n ) + c if n is even
T(n) =
T( n−1
2 ) + c if n is odd.

Although our interest in recurrence relations will be almost exclusively about the running times of recursive
algorithms, there are other interesting recurrence relations, too. The most famous of these is the recurrence
for the Fibonacci numbers (which turn out to have some interesting CS applications, as we’ll see):

Example 6.21: Fibonacci numbers.


The Fibonacci numbers are defined by
f1 = 1
f2 = 1
fn = fn−1 + fn−2 for n ≥ 3.
The first several Fibonacci numbers are 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, and 89.

ISTUDY
302 Analysis of Algorithms

6.4.2 Solving Recurrences: Induction


When we solve a recurrence relation, we find a closed-form (that is, nonrecursive) equivalent expression.
Because recurrence relations are recursively defined quantities, induction is the easiest way to prove that
a conjectured solution is correct. (The hard part is figuring out what solution to conjecture.) Here, we’ll
solve the recurrences from Section 6.4.1, starting with Alice and Bob and their bank accounts:

Example 6.22: Compound interest.


Let’s solve the recurrences from Example 6.17:
A(0) = 1000 A(n) = 1.02 · A(n − 1) (Alice)
B(0) = 10 B(n) = 1.02 · B(n − 1) + 10. (Bob)

The recurrence for Alice is the easier of the two to solve: we can prove relatively straightforwardly by
induction that A(n) = 1000 · 1.02n for any n ≥ 1.
For Bob, the analysis is a little trickier. Here’s some intuition: in year n, Bob has had $10 sitting in his
account since year 0 (earning interest for n years); he has had $10 in his account since year 1 (earning
interest for n − 1 years); and so forth. Ten dollars that has accumulated interest for i years has, as with
Alice, grown to 10 · 1.02i . Thus the total amount of money in Bob’s account in year n will be
by Theorem 5.5 (the analysis of a geometric series)
X
n
  X
n
1.02n+1 −1
10 · 1.02 = 10 ·
i
1.02 = 10 ·
i
1.02−1 = 510 · 1.02n − 500.
i=0 i=0

Using this intuition, let’s prove that B(n) = 510 · 1.02n − 500, by induction on n:
Base case (n = 0). The recurrence defines B(0) as 10, and indeed 510 · 1.020 − 500 = 510 − 500 = 10.
Inductive case (n ≥ 0). We assume the inductive hypothesis, which states that B(n − 1) =
510 · 1.02n−1 − 500; we must show that B(n) = 510 · 1.02n − 500:
B(n) = 1.02 · B(n − 1) + 10 definition of B(n)
 
= 1.02 · 510 · 1.02n−1 − 500 + 10 inductive hypothesis
= 1.02 · 510 · 1.02n−1 − 1.02 · 500 + 10 multiplying through
= 510 · 1.02 − 510 + 10
n
simplifying
= 510 · 1.02 − 500.
n

Taking it further: As Example 6.22 suggests, familiar summations like geometric series can be expressed using recurrence
relations. Other familiar summations can also be expressed using recurrence relations; for example, the sum of the first n integers
is given by the recurrence T(1) = 1 and T(n) = T(n − 1) + n. (See Section 5.2 for some closed-form solutions.)

Factorial
One good way to generate a conjecture (to then prove correct by induction) is by “iterating” the recurrence:
expand out a few layers of the recursion to see the values of T(n) for a few small values of n. We’ll illustrate
this technique with the simplest recurrence from the last section, for the recursive factorial function.

Example 6.23: Factorial.


Recall the recurrence from Example 6.18:
T(1) = d T(n) = T(n − 1) + c.

ISTUDY
6.4 Recurrence Relations: Analyzing Recursive Algorithms 303

Give an exact closed-form (nonrecursive) solution for T(n).


Solution. Let’s iterate the recurrence a few times:
T(1) = d
T(2) = c + T(1) = c + d
T(3) = c + T(2) = 2c + d
T(4) = c + T(3) = 3c + d
From the values of this recurrence for small n, we might conjecture that T(n) = (n − 1)c + d. Let’s prove
this conjecture by induction.
For the base case (n = 1), we have T(1) = d by definition of the recurrence, and d = 0 · c + d, as
desired.
For the inductive case (n ≥ 1), assume the inductive hypothesis T(n − 1) = (n − 2)c + d. We want to
show that T(n) = (n − 1)c + d. Here’s the proof:
T(n) = T(n − 1) + c by definition of the recurrence
= [(n − 2)c + d] + c by the inductive hypothesis
= (n − 1)c + d. by algebraic manipulation

Thus T(n) = (n − 1)c + d.

Problem-solving tip: Try iterating a recurrence to generate its first few values. Once we have a few values, we can often conjecture
a general solution (which we then prove correct via induction).

Merge Sort
Recall the Merge Sort recurrence, where T(n) = T(⌈ n2 ⌉) + T(⌊ n2 ⌋) + cn and T(1) = c. We’ll solve this
recurrence in this section. But it will be easier to address the case in which n is an exact power of 2 first
(so that the floors and ceilings don’t complicate the picture), so we’ll start with that case:

Example 6.24: Merge Sort, for powers of two.


Give an exact closed-form (nonrecursive) solution for the Merge Sort recurrence from Example 6.19:
T(1) = c T(n) = T(⌈ n2 ⌉) + T(⌊ n2 ⌋) + cn.
Assume that n is an exact power of two.
Solution. Because n is an exact power of two, we can write n = 2k for some k ∈ Z≥0 . (For an exact
power of two, we have ⌈ n2 ⌉ = ⌊ n2 ⌋ = n2 = 2k−1 .)
We’ll solve the given recurrence by thinking about the recursive relationship involving k, instead of
n. Specifically, define R(k) = T(2k ). Then R(0) = T(1) = c and R(k) = T(2k ) = 2 · T(2k−1 ) + c · 2k =
2 · R(k − 1) + c · 2k for k ≥ 1. Thus we can instead solve the recurrence
R(0) = c R(k) = 2 · R(k − 1) + c · 2k .
Iterating R a few times, we see
R(0) = c
R(1) = c · 21 + 2 · R(0) = 4c

ISTUDY
304 Analysis of Algorithms

R(2) = c · 22 + 2 · R(1) = 12c


How might we get to the conjecture (∗)? The pattern from
R(3) = c · 23 + 2 · R(2) = 32c. iterating R matches it, or looking at the recursion tree in
Figure 6.25 might help; that tree had 1 + log2 n = 1 + k
levels, with c · n = c · 2k work per level, or (1 + k)2k · c in
On the basis of these values, we might conjecture total. Alternatively, we’d expect a solution that’s the product
of ≈ k and ≈ 2k so that we get T(n) ≈ n log n; if we check
the k = 0 case—R(0) = 1—it looks like we’d better
R(k) = (1 + k)2k · c. multiply by k + 1 rather than k. (∗)
Let’s prove (∗), by induction on k.
In the base case, R(0) = c by the definition of the recurrence, and indeed (1 + 0)20 · c = 1 · 1 · c.
In the inductive case, we assume the inductive hypothesis R(k − 1) = k2k−1 · c. Then,
R(k) = 2R(k − 1) + c · 2k by definition of the recurrence
= 2 · [k · 2 k−1
· c] + c · 2k
by the inductive hypothesis
= (k + 1)2 · c.k
factoring and simplifying

Thus R(k) = (k + 1)2k · c, completing the inductive case—and the proof of (∗). Because we defined
R(k) = T(2k ), we can conclude that T(n) = R(log2 n), by substituting. Therefore
T(n) = (1 + log2 n) · 2log2 n · c = (1 + log2 n) · n · c.

Problem-solving tip: A useful technique for solving recurrences is to do a variable substitution. If you can express the recurrence
in terms of a different variable and solve the new recurrence easily, you can then substitute back into the original recurrence to
solve it. Transforming an unfamiliar recurrence into a familiar one will make life easy!

Thinking only about powers of two in Example 6.24 made our life simpler, but it means that our analysis
was incomplete: what is the running time of Merge Sort when the input array’s length is not precisely a
power of two? The more general analysis is actually not too complicated, given the result we just derived:

Example 6.25: Merge Sort, for general n.


Solve the Merge Sort recurrence (asymptotically), for any integer n ≥ 1:
T(1) = c T(n) = T(⌈ 2n ⌉) + T(⌊ n2 ⌋) + cn.

Solution. We’ll use the fact that T(n) ≥ T(n′ ) if n ≥ n′ —that is, T is monotonic. (See Exercise 6.101.)
So let k be the nonnegative integer such that 2k ≤ n < 2k+1 . Then
T(n) ≥ T(2k ) and T(n) < T(2k+1 ) monotonicity and 2k ≤ n < 2k+1
= (1 + log2 2k ) · 2k · c = (1 + log2 2k+1 ) · 2k+1 · c Example 6.24
> (1 + log2 n2 ) · n
2 · c. ≤ (1 + log2 2n) · 2n · c n
2 < 2 and 2n ≥ 2
k k+1
(by definition)
= Ω(n log n) = O(n log n).
Combining these facts yields that T(n) = Θ(n log n).

Binary Search
The logic that we used to argue that Binary Search takes logarithmic time in Example 6.12 is reasonably
intuitive: In the worst case, when the sought item x isn’t in the array, we repeatedly compare x to the middle

ISTUDY
6.4 Recurrence Relations: Analyzing Recursive Algorithms 305

of the valid range of the array, and halve the size of that valid range. We can halve an n-element range
exactly log2 n times, and thus the running time of Binary Search is logarithmic.
But, while this intuitive argument correctly establishes that Binary Search’s running time is O(log n),
there’s a slightly subtle issue that we’ve glossed over: the so-called “halving” in this description isn’t
actually exactly halving. If there are n elements in the valid range, then after comparing x to the middle
element of the range, we will end up with a worst-case valid range of size either n2 or n−12 , depending on
n
the parity of n. (So it’s sometimes, but not always, exactly 2 . The argument in Example 6.12 only relied on
the fact that we end up with a valid range of size at most n2 . But we have not yet ruled out the possibility
that the running time might be faster than Θ(log n), if we “slightly better than halve” at every stage.)
We can resolve this issue by rigorously analyzing the correct recurrence relation—and we can prove
that the running time is in fact Θ(log n).

Example 6.26: Binary Search.


Solve the Binary Search recurrence:
(
T( n2 ) + 1 if n is even
T(0) = 1 T(n) =
T( n−1
2 ) +1 if n is odd.
(Note that we’ve changed the additive constants to 1 instead of c; changing it back to c would only have
the effect of multiplying the entire solution by c.)
Solution. We conjecture that T(n) = ⌊log2 n⌋ + 2 for all n ≥ 1. We’ll prove the conjecture by strong
induction on n.
For the base case (n = 1), we have T(1) = T(0) + 1 = 1 + 1 = 2 by definition of the recurrence, and
indeed 2 = ⌊0⌋ + 2 = ⌊log2 1⌋ + 2.
For the inductive case (n ≥ 2), assume the inductive hypothesis, namely that T(k) = ⌊log2 k⌋ + 2 for
any k < n. We’ll proceed in two cases:
Case I: n is even. Then T(n) = T( n2 ) + 1 by definition of the recurrence
= ⌊log2 ( n2 )⌋ +2+1 by the inductive hypothesis
= ⌊(log2 n) − 1⌋ + 3 because log( ab ) = log a − log b, and log2 2 = 1
= ⌊log2 n⌋ + 2. because ⌊x + 1⌋ = ⌊x⌋ + 1

Case II: n is odd. Then T(n) = T( n−1


2 )+1 by definition of the recurrence
= ⌊log2 ( n−1
2 )⌋ +2+1 by the inductive hypothesis
= ⌊log2 (n − 1)⌋ + 2 by the same manipulations as in the even case
= ⌊log2 n⌋ + 2. because ⌊log2 (n − 1)⌋ = ⌊log2 n⌋ for any odd integer n > 1

Because T(n) = ⌊log2 n⌋ + 2 in both cases, we’ve proven the claim. Therefore T(n) = Θ(log n).

Problem-solving tip: When solving a new recurrence, we can generate conjectures (to prove correct via induction) by iterating the
recurrence, drawing out the recursion tree, or by straight-up guessing a solution (or recognizing a familiar pattern). To generate
my conjecture for Example 6.26, I wrote a program implementing the recurrence. I ran the program for n ∈ {1, 2, . . . , 2000} and
printed out the smallest integer n for which T(n) = 1, then the smallest for which T(n) = 2, etc. (See Figure 6.10 for a graph of
the function.) The conjecture followed from the observation that the breakpoints all happened at n = 2k − 1 for an integer k.

ISTUDY
306 Analysis of Algorithms

As a general matter, the appearance of floors and ceilings inside a recurrence won’t matter to the
asymptotic running time, nor will small additive adjustments inside the recursive term. For example,
T(n) = T(⌈ n2 ⌉) + 1 and T(n) = T(⌊ n2 ⌋ − 2) + 1 both have T(n) = Θ(log n) solutions. Typically, under-
standing the running time for the “pure” version of the recurrence will give a correct understanding of the
more complicated version. As such, we’ll often be sloppy in our notation, and write T(n) = T( n2 ) + 1 when
we really mean T(⌊ n2 ⌋) or T(⌈ n2 ⌉). (This abuse of notation is fairly common.)
Taking it further: Intuitively, floors and ceilings don’t change this type of recurrence because they don’t affect the total depth of
the recursion tree by more than a Θ(1) number of calls, and a Θ(1) difference in depth is asymptotically irrelevant. There’s a gen-
eral theorem called the “sloppiness” theorem, which states precise conditions under which it is safe to ignore floors and ceilings
in recurrence relations. (As long as we actually prove inductively that our conjectured solution to a recurrence relation is correct,
it’s always fine in generating conjectures.) As a rough guideline, as long as T(n) is monotonic (if n ≤ n′ , then T(n) ≤ T(n′ )) and
doesn’t grow too quickly (T(n) is O(nk ) for some constant k), then this “sloppiness” is fine. The details of the theorem, and its
precise assumptions, are presented in many algorithms textbooks; see [33] for a full proof, for example.

6.4.3 The Fibonacci Numbers


We’ll close with another example of a recurrence relation—the Fibonacci recurrence—that we will analyze
using induction. But this time we will solve the recurrence exactly (that is, nonasymptotically):

Example 6.27: The Fibonacci Numbers.


Recall the Fibonacci numbers, defined by the recurrence
f1 = 1 f2 = 2 fn = fn−1 + fn−2 .
Prove that fn grows exponentially: that is, prove that there exist a ∈ R>0 and r ∈ R>1 such that fn ≥ arn .

Some brainstorming: Let’s start in the middle of a hypothetical proof. Suppose that we’ve
somehow magically figured out values of a and r to make the base cases work (n = 1 and
n = 2—there are two base cases because f2 is not defined recursively, either). And suppose
that we’re in the middle of a inductive proof:
definition of inductive
algebra
the recurrence hypothesis

fn = fn−1 + fn−2 ≥ arn−1 + arn−2 = arn−2 (r + 1).


But what we want to prove is fn ≥ arn . So we’d be done if only r + 1 = r2 —that is, if
r2 − r − 1 = 0. But the value of r isn’t specified by the problem—so we get to choose its
value! Using the quadratic formula, we find that there are two solutions to this equation, which
we’ll name ϕ and ϕ̂:
√ √
1+ 5 1− 5
ϕ= 2 ϕ̂ = 2 .

Let’s use r = ϕ. To get the base cases to work, we would need to have f1 ≥ aϕ and
f2 ≥ aϕ2 = a(1 + ϕ)—in other words, 1 ≥ aϕ and 1 ≥ a(1 + ϕ). Because 1 + ϕ > ϕ, the
latter is the harder one to achieve. To ensure that a(1 + ϕ) ≤ 1, we must have
a≤ 1
1+ϕ = 1 √
= 2√
.
1+ 1+2 5 3+ 5

We’ve now identified a value of r and a constraint on a so the proof might work. Let’s try it!

ISTUDY
6.4 Recurrence Relations: Analyzing Recursive Algorithms 307


Solution. The brainstorming above identifies a value ϕ = 1+2 5 such that ϕ + 1 = ϕ2 and a corresponding
value of a = 3+2√5 . Using these values, we’ll prove the following claim:

Claim: fn ≥ 2√
3+ 5
· ϕn , where ϕ = 1+ 5
2 .

Proof (by strong induction on n). There are two base cases:
√ √
For n = 1, we have 2√
3+ 5
· ϕ1 = 2√
3+ 5
· 1+ 5
2 = 1+√5
3+ 5
= 0.6180 · · · < 1 = f1 .
For n = 2, we have 2√
3+ 5
· ϕ2 = 2√
3+ 5
· (1 + ϕ) we chose ϕ so that ϕ + 1 = ϕ2

= 2√
3+ 5
· 3+ 5
2 = 1 = f2 .

For the inductive case (n ≥ 3), we assume the inductive hypothesis, namely that fk ≥ 2√
3+ 5
· ϕk for any
1 ≤ k ≤ n − 1. Then:
fn = fn−1 + fn−2 definition of the Fibonaccis

≥ 2√
3+ 5
· ϕn−1 + 2√
3+ 5
· ϕn−2 inductive hypothesis, applied twice

= 2√
3+ 5
· ϕn−2 · (ϕ + 1) factoring

= 2√
3+ 5
· ϕn−2 · ϕ2 we chose ϕ so that ϕ + 1 = ϕ2

= 2√
3+ 5
· ϕn . algebra

Therefore the claim follows by induction.

Problem-solving tip: Sometimes starting in the middle of a proof helps! You still need to go back and connect the dots, but
imagining that you’ve gotten somewhere may help you figure out how to get there.


Taking it further: The value ϕ = 1+2 5 ≈ 1.61803 · · · is called the golden ratio. It has a number of interesting characteristics,
including both remarkable mathematical and aesthetic properties. For example, a rectangle whose side lengths are in the ratio
ϕ-to-1 can be divided into a square and a rectangle whose side lengths are in the ratio 1-to-ϕ. That’s because, for these rectangles
to have the same ratios, we need ϕ1 = ϕ− 1
1 —that is, we need ϕ(ϕ − 1) = 1, which means ϕ − ϕ = 1. (See Figure 6.27.)
2

The golden ratio, it has been argued, describes proportions in famous works of art ranging from the Acropolis to Leonardo da
Vinci’s drawings. The Fibonacci numbers also show up all over the place in nature—and also in computation. One computational
application in which they’re relevant is in the design and analysis of a data structure called an AVL tree, a form of binary search
tree that guarantees that the tree supports all its operations efficiently. See p. 309.

1 ϕ−1
(a) A rectangle in the ratio of ϕ to 1, (b) Repeatedly inscribing a square in (c) The same rectangles, rotated and
with an inscribed square. the “leftover” rectangle. shifted to share a corner.

Figure 6.27 Some golden rectangles.

ISTUDY
308 Analysis of Algorithms

A closed-form formula for the Fibonaccis


While Example 6.27 establishes a lower bound on the Fibonacci numbers—in asymptotic notation, it proves
that fn = Ω(ϕn )—we have not yet established an exact formula for the nth Fibonacci number. Here is an
exact formula, along with a proof that it’s correct:

Example 6.28: A closed-form solution for the Fibonaccis.

Some more brainstorming: The trick will be to make use of ϕ̂. The inductive case would go
through perfectly, just as in Example 6.27, if we tried to prove fn = aϕn + bϕ̂n , for constants a
and b. But what about the base cases? For f1 , we would need 1 = aϕ + bϕ̂; for f2 , we would need
1 = aϕ2 + b(ϕ̂2 ) = a(1 + ϕ) + b(1 + ϕ̂). That’s two linear equations with two unknowns, and
−1
some algebra will reveal that a = √15 and b = √ 5
solves these equations.

√ √
−ϕ̂n
ϕn√ 1+ 5 1− 5
Claim: fn = 5
, where ϕ = 2 and ϕ̂ = 2 .

Proof (by strong induction on n). There are two base cases:
√ √
−ϕ̂1
ϕ1√
1+ 5
2 √− 1−2 5
For n = 1, we have 5
= 5
definition of ϕ and ϕ̂

2 5
= √2 =1 algebra
5
= f1 . definition of the Fibonaccis
−ϕ̂
ϕ√ 2 2
1+ϕ−(1+ϕ̂)

For n = 2, we have = ϕ2 = 1 + ϕ and ϕ̂2 = 1 + ϕ̂
5 5
ϕ−
√ ϕ̂
= 5
=1 algebra and the previous case (n = 1)
= f2 . definition of the Fibonaccis
−ϕ̂k
ϕk√
For the inductive case (n ≥ 3), we assume the inductive hypothesis: for any k < n, we have fk = 5
.
Then:
fn = fn−1 + fn−2 definition of the Fibonaccis
−ϕ̂n−1
ϕn−1√ −ϕ̂n−2
ϕn−2√
= 5
+ 5
inductive hypothesis, applied twice
ϕn−2 (ϕ+1)−ϕ̂n−2 (ϕ̂+1)
= √ factoring
5
ϕ n−2
−ϕ̂
2
ϕ√ n−2
ϕ̂ 2
= ϕ + 1 = ϕ2 and ϕ̂ + 1 = ϕ̂2
5
−ϕ̂n
ϕn√
= 5
.

ISTUDY
6.4 Recurrence Relations: Analyzing Recursive Algorithms 309

Computer Science Connections

AVL Trees
A binary search tree is a data structure that allows us to store a dynamic set of elements, supporting Insert, Delete,
and Find operations. Briefly, a binary tree consists of a root node at the top; each node u can have zero, one, or two
children directly attached beneath u. (See p. 637 for more about binary search trees, specifically.)
An AVL tree is a special type of binary search tree
that ensures that the tree is “balanced” and therefore height = 1 height = 1 + max(hL , hR )
supports its operations very efficiently [3]. The point
is to ensure that the tree is “shallow,” because the cost
of almost every operation on binary search trees is pro-
portional to the height of the tree. (The height of a node hL
in a tree is the number of levels of nodes beneath it; the hR
height of the tree is the height of the root.) AVL trees
were developed by and named after two Soviet com-
puter scientists, Georgy Adelson-Velsky (1922–2014) Figure 6.28 Binary trees and AVL trees. An AVL tree can
and Evgenii Landis (1921–1997). (You’d expect it to be either any zero- or one-node tree (of height 0 and 1,
be three people, but hyphenated names are confusing.) respectively); or an AVL tree can be any tree whose root has
Specifically, an AVL tree is a binary search tree in one or two children in which (i) h and h differ by 0 or 1,
L R
which, for any node u, the height of u’s left child and
and (ii) both subtrees are themselves AVL trees.
the height of u’s right child differ by at most one. (See
Figure 6.28.) Alternatively, we can define AVL trees
recursively: an empty (zero-node) tree is an AVL tree of height 0; and a tree of height h ≥ 1 is an AVL tree if both
(i) the subtrees rooted at the two children of the root are both AVL trees; and (ii) the heights of the root’s children
are either both h − 1, or one is h − 1 and the other is h − 2. A few examples of AVL trees are shown in Figure 6.29.

An upper bound
If you studied AVL trees before, you were probably
told “AVL trees have logarithmic height.” Here, we’ll
prove it. Consider an AVL tree T of height h. After a
little contemplation, it should be clear that T will con-
tain the maximum possible number of nodes (out of all
AVL trees of height h) when both of the children of T’s Figure 6.29 Three AVL trees. Take any node u in any of the
root node have height h − 1, and furthermore that both three trees; you can check that the number of layers beneath
subtrees of the root have as many nodes as an AVL tree u’s left child and u’s right child differ by at most one.
of height h − 1 can have. Let’s think about this argu-
ment using a recurrence relation. Let M(h) denote the maximum number of nodes that can appear in an AVL tree of
height h. There can be only one node in a height 1 tree, so M(1) = 1. For h ≥ 2, the above argument says that

M(h) = M(h − 1) + M(h − 1) + 1. (∗)


the left subtree the right subtree the root node
Claim: M(h) = 2h − 1.
Proof. The proof is straightforward by induction. For the base case (h = 1), we have M(h) = 1 by definition, and
21 − 1 = 2 − 1 = 1. For the inductive case, we have M(h) = 2M(h) + 1 = 2 · 2(2h−1 − 1) + 1 by (∗) and the
inductive hypothesis. Simplifying yields M(h) = 2h − 2 + 1 = 2h − 1.
(Another way to see that the largest number of nodes in any binary tree of height h is 2h −1 is by looking at the tree by
row: there’s 1 root, with 2 children, and 4 “grandchildren,” and 8 “great-grandchildren,” and so forth.) Figure 6.30a
shows the fullest-possible AVL trees for a few small heights.

(continued)

ISTUDY
310 Analysis of Algorithms

Computer Science Connections

AVL Trees, continued

A lower bound
To analyze the worst-case height of an AVL tree,
though, we need to look at the other direction: what is
(a) The fullest-possible AVL trees of height h ∈ {1, 2, 3, 4},
the fewest nodes that can appear in an AVL of height h?
respectively containing 1 = 21 − 1, 3 = 22 − 1, 7 = 23 − 1,
(We can transform this analysis into one that finds the and 15 = 24 − 1 nodes.
largest possible height of an AVL tree with n nodes.)
Define N(h) as the minimum number of nodes in
an AVL tree of height h. As before, any height 1 tree
has one node, so N(1) = 1. It’s also immediate that
N(2) = 2. (The emptiest-possible AVL trees of a few
small heights are shown in Figure 6.30b. The smallest
AVL tree of height 1 and height 2 are shown there, with
1 and 2 nodes, respectively.) (b) The emptiest-possible AVL trees of height
For a larger height h, it’s not too hard to persuade h ∈ {1, 2, 3, 4, 5}, which contain 1, 2, 4, 7, and 12 nodes.
yourself that the minimum number of nodes in an AVL
tree of height h is achieved when the root has one child Figure 6.30 The fullest and emptiest possible AVL trees
of height h − 1 and one child of height h − 2—and fur- for a few small heights.
thermore when these two subtrees contain as few nodes
as legally possible. That is,

N(h) = N(h − 1) + N(h − 2) + 1 . (†)


the deeper subtree the shallower subtree the root node

Observe that N(h) = 1 + N(h − 1) + N(h − 2) ≥ 1 + 2 · N(h − 2) because N(h − 1) ≥ N(h − 2). Therefore we
can conclude that N(h) ≥ 2h/2 − 1.
We can do better, though, with a bit more work. Define P(h) = 1 + N(h). Adding one to both sides of (†), in
this new notation, we have that P(h) = P(h − 1) + P(h − 2). (This recurrence should look familiar: it’s the same
recurrence as for the Fibonacci numbers!) Because P(1) = 1 + N(1) = 2 = f3 and P(2) = 1 + N(2) = 3 = f4 , we
can prove inductively that P(h) = fh+2 .
Claim: N(h) ≥ ϕh − 1.
1 2√
Proof. Using the definition of P, the proof in Example 6.27, and the fact that ϕ2
= 3+ 5
, we have

N(h) = P(h) − 1 = fh+2 − 1 ≥ 2√


3+ 5
· ϕh+2 − 1 = ϕh − 1.

Putting it all together



1+ 5
The analysis above will let us prove the following theorem, where ϕ = 2 :

Theorem 6.20: The height of an AVL tree is logarithmic.


The height h of any n-node AVL tree satisfies logϕ (n + 1) ≥ h ≥ log2 (n + 1).

(We know from the first claim that 2h − 1 = M(h) ≥ n. Thus 2h ≥ n + 1, and—taking logs of both sides—we have
h ≥ log2 (n + 1). Similarly, we know from the second claim that ϕh − 1 = N(h) ≤ n. Thus ϕh ≤ n + 1, and—taking
logϕ of both sides—we have h ≤ logϕ (n + 1).)
log (x) log2 (x)
By changing log bases, we have that logϕ (x) = log 2(ϕ) ≈ 0.69424··· ≈ 1.4404 · log2 (x). Thus this theorem says
2
that an n-node AVL tree has height between log2 (n + 1) and 1.44 log2 (n + 1). (In fact, there are AVL trees whose
height is as large as 1.44 log2 (n + 1), so this analysis is tight.)

ISTUDY
Exercises 311

EXERCISES
A quadtree is a data structure typically used to store a collection of n points in R2 . The basic idea is to start with a bounding box
that includes all n points, and then subdivide, into four equal-sized subregions, any region that contains more than a designated
number k of points. (For simplicity, we will subdivide any region with more than k = 1 point.) The height of a quadtree is the
number of levels of the deepest subdivision of the tree. See Figure 6.31.
6.80 Let R(h) denote the largest number of regions that a quadtree of height h can contain. Write a recurrence relation for R(h).
6.81 Let S(h) denote the smallest number of regions that a quadtree of height h can contain. Write a recurrence relation for S(h).
6.82 Let T(n) denote the smallest number of regions that a quadtree with n points can contain. Write a recurrence relation for T(n). You
may use the following fact without proof: the most efficient division of n points in a quadtree occurs when each subregion contains
precisely 4n points.

Consider the recursive algorithms shown in Figure 6.32, which all solve the same problem. The following exercises ask you to write
down a recurrence relation to express each running time, and then use your recurrence to prove by induction that the algorithm
requires O(n) time. (Assume that selecting a subarray takes Θ(1) time.)
6.83 Give a recurrence for B.
6.84 Give a recurrence for C.
6.85 Give a recurrence for D.
6.86 Prove that B is O(n).
6.87 Prove that C is O(n). (For ease, you may assume n is a power of 2.)
6.88 Prove that D is O(n).
6.89 What problem do B, C, and D solve?

Consider the following ternary search algorithm, a variation on binary search. Suppose you have a sorted array A[1 . . . n] and
you’re searching for a particular value x in it. If n ≤ 2, just check whether x is one of the one or two entries in A. Otherwise,
compare x to A[n/3] and A[2n/3], and do the following:
• if x = A[⌊n/3⌋] or x = A[⌊2n/3⌋], return true.
• if x < A[⌊n/3⌋], recursively search A[1 . . . ⌊n/3⌋ − 1].
• if A[⌊n/3⌋] < x < A[⌊2n/3⌋], recursively search A[⌊n/3⌋ + 1 . . . ⌊2n/3⌋ − 1].
• if x > A[⌊2n/3⌋], recursively search A[⌊2n/3⌋ + 1 . . . n].
6.90 Analyze the asymptotic worst-case running time of ternary search. Prove your answer correct using induction. For convenience,
you may assume that n is a power of three.
6.91 Does ternary search perform better or worse than binary search? Here you should count the exact number of comparisons that each
algorithm performs—don’t give an asymptotic answer.

6.92 Consider a simplified (and thus slightly erroneous) version of the recurrence for Binary Search: T(n) = T( n2 ) + c and T(1) = c.
(This recurrence ignores the off-by-one complications.) Prove by induction that T(n) = c(1 + log n) if n is a power of two.

6.93 Consider the “Median of Three” pivoting rule for quickSort. (See Example 5.14 and Exercises 6.74–6.77.) As Exercise 6.77
established, this algorithm can still be slow; the recurrence relation for the worst-case version of the algorithm is T(1) = T(2) = 1
and T(n) = T(n − 2) + cn. Prove that T(n) = Θ(n2 ).
6.94 Generalize your argument from Exercise 6.93 to show that the recurrence
(
1 if n ≤ k
T(n) =
T(n − k) + n otherwise

has solution T(n) = Θ(n2 ) for any integer k ≥ 1.

1 2 3
1

2 3

Figure 6.31 The decomposition of the plane to build a quadtree. (A region’s children are its subregions, clockwise from
the upper left.) The shaded regions at right correspond to the numbered nodes of the quadtree. This quadtree contains 17
regions, and its height is 4.

ISTUDY
312 Analysis of Algorithms

B(A[1 . . . n]): C(A[1 . . . n]): D(A[1 . . . n]):


1 if n = 0 then 1 if n = 0 or (n = 1 and A[1] ≥ 0) then 1 if n = 0 or (n = 1 and A[1] ≥ 0) then
2 return 0 2 return 0 2 return 0
3 else if A[1] < 0 then 3 else if n = 1 and A[1] < 0 then 3 else if n = 1 and A[1] < 0 then
4 return 1 + B(A[2 . . . n]) 4 return 1 4 return 1
5 else 5 else 5 else
6 return B(A[2 . . . n]) 6 count := 0 6 count := 0
 
7 count := count + C(A[1 . . . ⌊ 2n ⌋]) 7 count := count + D(A[1 . . . 4n ])
8 count := count + C(A[⌊ 2n ⌋ + 1 . . . n]) 8 count := count + D(A[⌊ 4 ⌋ + 1 . . . ⌊ 3n
n
4 ⌋])
9 return count 9 count := count + D( A[⌊ 3n 4 ⌋ + 1 . . . n] )
10 return count

Figure 6.32 Three recursive algorithms.

Recall that the Fibonacci numbers are defined by the recurrence f1 = f2 = 1 and fn = fn−1 + fn−2 . The next several exercises refer
to this recurrence and the algorithms for computing the Fibonacci numbers in Figure 6.33.
6.95 First, a warmup unrelated to the algorithms in Figure 6.33: prove by induction that fn ≤ 2n .
6.96 Prove that fibNaive(n − k) appears a total of fk+1 times in the call tree for fibNaive(n).
6.97 Write down and solve a recurrence for the running time of helper (and therefore fibMedium).
6.98 Write down and solve a recurrence for the running time of exp (and therefore fibClever).
6.99 (programming required) The reference to “repeated squaring” in fibMatrix is precisely the same as the idea of exp. Implement
fibMatrix using this idea in a programming language of your choice. (See Exercise 5.62.)

The Merge Sort algorithm sorts an input array by splitting it in half, recursively sorting the two halves, and then merging the
two resulting sorted arrays into a single sorted array. (See Figure 6.23 for mergeSort, and see Figure 6.34 for a reminder of the
algorithm that does this merging.)
6.100 Give a recurrence relation T(n) describing the running time of merge on two input arrays with a total of n elements, and prove that
T(n) = Θ(n).
6.101 Consider the recurrence for the running time of mergeSort (see Figure 6.23):
n  
T(1) = c and T(n) = T( 2 ) + T( n2 ) + cn.

Prove that T(n) ≤ T(n′ ) if n ≤ n′ —that is, T is monotonic.


6.102 Here is a recurrence relation for the number of comparisons done by mergeSort on an input array of size n:

C(1) = 0 and C(n) = 2C( 2n ) + n − 1.

Explain the recurrence relation, and then prove that C(n) = n log n − n + 1 by induction. (For ease, we’ll assume that n is a power
of two.)

fibNaive(n): fibMedium(n): fibClever(n):


exp(ϕ,n)−exp(ϕ̂,n)
1 if n = 0 or n = 1 then 1 ⟨fn , fn−1 ⟩ := helper(n) 1 return √
5
2 return 1 2 return fn
3 else exp(b, n):
4 return fibNaive(n − 1) + fibNaive(n − 2) helper(n): 1 if n = 0 then
fibMatrix(n): 1 if n = 0 then 2 return 1
1 Compute (using repeated squaring) 2 return ⟨1, undefined⟩ 3 else
 
" # " #n " # 3 else if n = 1 then 4 s := exp(b, n2 )
x 1 1 1 return ⟨1, 1⟩
:= · . 4 5 if n is odd then
y 1 0 1 5 else 6 return b · s · s
// See Exercises 5.62 and 6.99. 6 ⟨fn−1 , fn−2 ⟩ := helper(n − 1) 7 else
2 return x 7 return ⟨fn−1 + fn−2 , fn−1 ⟩ 8 return s · s

ϕn√
−ϕ̂n
Figure 6.33 Four algorithms for the Fibonaccis. The values ϕ and ϕ̂ satisfy fn = 5
; see Example 6.28.

ISTUDY
Exercises 313

merge(X[1 . . . n], Y[1 . . . m]):


1 if n = 0 then f (n): g(n):
2 return Y 1 if n ≤ 1 then 1 if n ≤ 1 then
3 else if m = 0 then 2 return n 2 return n
4 return X 3 else 3 else
5 else if X[1] < Y[1] then 4 return f (n − 2) 4 x := 1
6 return X[1] followed by merge(X[2 . . . n], Y) 5 while n ≥ 2x:
7 else 6 x := 2 · x
8 return Y[1] followed by merge(X, Y[2 . . . m]) 7 return g(n − x)

Figure 6.34 The “merging” of two sorted arrays. Figure 6.35 Two recursive algorithms.

The next few exercises refer to the algorithms f and g in Figure 6.35, both which solve the same problem.
6.103 Give and solve (using induction) a recurrence relation for the running time of f.
6.104 Give a recurrence relation for g, and use it to prove that g(n) runs in O(log2 n) time.
6.105 Describe the set of input values n that cause the worst-case behavior for g(n).
6.106 What problem do f and g solve? Prove your answer.

(A true story, inspired by Michael Eisen’s 2011 blog post “Amazon’s $23,698,655.93 book about flies” [44].) Two copies of an
out-of-print book were listed online by Seller A and Seller B. Their prices were over $1,000,000 each—and the next day, both prices
were over $2,000,000, and they kept going up. By watching the prices over several days, it became clear that the two sellers were
using algorithms to set their prices in response to each other. Let an and bn be the prices offered on day n by Seller A and Seller B,
respectively. The prices were set by two (badly conceived) algorithms such that an = α · bn−1 and bn = β · an where α = 0.9983
and β = 1.27059.
6.107 Suppose that b0 = 1. Find closed-form formulas for an and bn . Prove your answer.
6.108 State a necessary and sufficient condition on α, β , and b0 such that an = Θ(1) and bn = Θ(1).

ISTUDY
314 Analysis of Algorithms

n

6.5 An Extension: Recurrence Relations of the Form T(n) = aT b + cnk
It is wise to do that, for life is but short and time passes quickly. If one is competent in one
thing and understands one thing well, one gains at the same time insight into and
knowledge of many other things into the bargain.
Vincent van Gogh (1853–1890)
letter to Theo van Gogh (April 3rd, 1878)

In this section, we’ll develop a formulaic method to solve a common type of recurrence relation that
comes up often: in analyzing recursive algorithms, we will frequently encounter recurrences that look like

T(n) = aT nb + c · nk , for four constants a ≥ 1, b > 1, c > 0, and k ≥ 0. Here, we’ll develop a unified
solution to this type of recurrence relation. Why do these recurrences come up frequently? They arise in
any recursive algorithm that has the following structure: if the input is small, then we compute the solution
directly; otherwise, to solve an instance of size n:

• we make a different recursive calls on inputs of size nb ; and


• to construct the smaller instances and then to reconstruct the solution to the given instance from the
recursive solutions, we spend Θ(nk ) time.

These algorithms are usually called divide-and-conquer algorithms: they “divide” their input into a pieces,
and then recursively “conquer” those subproblems. (To be precise, the recurrence often has ceilings and
floors as part of its recursive calls, but for now assume that n is exact power of b, so that the floors and
ceilings don’t matter.) Here are two examples of recursive algorithms with recurrences of this form:

Example 6.29: Binary Search.


In Binary Search, we spend c = Θ(1) time to compare the sought element to the middle of the range; we
then make one recursive call to search for the element in the appropriate half of the array. If n is an exact
power of two, then the recurrence is
T(n) = T( 2n ) + c.
(So a = 1, b = 2, and k = 0, because c = c · 1 = c · n0 .)

Example 6.30: Merge Sort.


In Merge Sort, we spend Θ(1) time to divide the array in half. We make two recursive calls on the left
and right subarrays, and then spend Θ(n) time to merge the resulting sorted subarrays into a single sorted
array. If n is an exact power of two, then the recurrence is
T(n) = 2T( n2 ) + c · n.
(So a = 2, b = 2, and k = 1.)

n

6.5.1 Solving Recurrences of the Form T(n) = aT b
+ cnk : Some Intuition
We are going to develop a general technique that allows us to solve any recurrence relation of the form
T(n) = aT( bn ) + c · nk . The technique is based on examining the recursion tree for this recurrence (see
Figure 6.36), and a theorem (Theorem 6.21) that describes the total amount of work represented by this

ISTUDY

6.5 An Extension: Recurrence Relations of the Form T(n) = aT n
b + cnk 315

c · nk n

1 2 a

n n n
c · ( bn )k b b ··· b

1 2 a 1 2 a 1 2 a
···
1 + logb n levels

n n n n n n
c · ( bn2 )k b2 b2 ··· b2 b2 b2 ··· b2

1 2 a 1 2 a 1 2 a 1 2 a 1 2 a 1 2 a
··· ··· ··· ··· ··· ···

. .
. .
. .

c · bk b b
1 2 a 1 2 a

1 1 1 ··· 1 1 1 ··· 1

Figure 6.36 The recursion tree for a recurrence relation T(n) = aT( nb ) + c · nk . Assume that n is an exact power of b.

tree. Here’s the intuition. Let’s think about the ith level of the recursion tree (again, see Figure 6.36)—in
other words, the work done by the recursive calls that are i levels beneath the root of the recursion tree.
Here are a few useful facts:

There are ai different calls at level i. There is 1 = a0 call at the zeroth (root) level, then a = a1 calls at first
level, then a2 calls at the second level, and so forth.

Each of the the calls at the ith level operates on an input of size bni . The input size is n
1 = n at the zeroth
level, then nb at the first level, then bn2 at the second, and so forth.

Thus the total amount of work in the ith level of the tree is ai · c · ( bni )k . (That’s just the number of calls at
the ith level multiplied by the work per call.) Or, simplifying, the total work at the ith level is cnk · ( bak )i .

Thus the total amount of work contained within the entire tree is
P a i
 P a i

cnk · bk
= cnk · bk
. (∗)
i i

(We’ll worry about the bounds on the summation later.)


P
Note that (∗) expresses the total work in the recursion tree as a geometric sum i ri , in which the ratio
between terms is given by r = bak . (See Section 5.2.2.) As with any geometric sum, the critical question
is how the ratio compares to 1: if r < 1, then the terms of the sum are getting smaller and smaller as i
increases; if r > 1, then the terms of the sum are getting bigger and bigger as i increases. (And if r = 1,
then each term is simply equal to 1.)
Our unifying theorem will have three cases, each of which corresponds to one of these three natural
cases for the summation in (∗): its terms increase exponentially with i, its terms decrease exponentially
with i, or its terms are constant with respect to i. In these cases, respectively, most of the work is done at
the leaves of the tree; most of the work is done at the root of the tree; or the work is spread evenly across
the levels of the tree.

ISTUDY
316 Analysis of Algorithms

n 1 call, 1 or n or n2 work

n n n
2 2 2 calls, 1 or 2 or ( 2n )2 work each
1 + log2 n levels

n n n n n
4 4 4 4 4 calls, 1 or 4 or ( 4n )2 work each

n n n n n n n n n
8 8 8 8 8 8 8 8 8 calls, 1 or 8 or ( 8n )2 work each
.. ..
. . 2i calls at level i,
1 or 2ni or ( 2ni )2 work each
2 2

1 1 1 1 2log2 n = n calls, 1 or 1 or 1 work each

{ }
Figure 6.37 The recursion trees for three recurrences: T(n) = 2T( n2 ) + f(n), for f(n) ∈ 1, n, n2 . Each row of the tree
is annotated with both the number of calls at that level, plus the additional work done by each call at that level.

A trio of examples
Before we prove the general theorem, we’ll solve a few recurrences that illustrate these three cases, and
then we’ll prove the result in general. The three example recurrences are

T(n) = 2T( n2 ) + 1 and T(n) = 2T( 2n ) + n and T(n) = 2T( n2 ) + n2 ,

all with T(1) = 1. Figure 6.37 simultaneously sketches the recursion trees for these recurrences.
In each of these recurrences, we divide the input by two at every level of the recursion. Thus, the total
depth of the recursion tree is log2 n. (Assume that n is an exact power of two.) In the recursion tree for any
one of these recurrences, consider the ith level of the tree beneath the root. (The root of the recursion tree
has depth 0.) We have divided n by 2 a total of i times, and thus the input size at that level is 2ni . Furthermore,
there are 2i different calls at the ith level of the tree.

Solving the three recurrences


To solve each recurrence, we will sum the total amount of work generated at each level of the tree. The
three recursion trees for these three recurrences are shown in Figure 6.38.

Example 6.31: Solving T(n) = 2T( 2n ) + 1.


Figure 6.38a shows the recursion tree for this recurrence. There are 2i different calls at the ith level, each
of which is on an input of size 2ni —and we do 1 unit of work for each of these 2i calls. Thus the total
amount of work at level i is 2i . The total amount of work in the entire tree is therefore
X
log2 n
21+log2 n − 1
T(n) = 2i = = 2 · 2log2 n = 2n
2−1
i=0

by Theorem 5.5 (on geometric series). And, indeed, T(n) = Θ(n).

Example 6.32: Solving T(n) = 2T( 2n ) + n.


Figure 6.38b shows the recursion tree. There are 2i calls at the ith level of the recursion tree, on inputs of
size 2ni . We do 2ni units of work at each call, so the total work at the ith level is 2i · ( 2ni ) = n. Note that the

ISTUDY

6.5 An Extension: Recurrence Relations of the Form T(n) = aT n
b + cnk 317

n 1 1·1=1

n n
2 1 2 1 2·1=2
1 + log2 n levels

n n n n
4 1 4 1 4 1 4 1 4·1=4

n n n n n n n n
8 1 8 1 8 1 8 1 8 1 8 1 8 1 8 1 8·1=8
.. ..
. . work at level i
= 2i · 1 = 2i
2 2

1 1 1 1 1 1 1 2log2 n · 1 = n

(a) The recursion tree for T(n) = 2T( n2 ) + 1, with the “row-wise” sums of work. The work at each level is twice the
work at the level above it; thus the work is increasing exponentially at each level of the tree.

n
n n 1· 1 =n

n n n n n
2 2 2 2 2· 2 =n
1 + log2 n levels

n n n n n n n n n
4 4 4 4 4 4 4 4 4· 4 =n

n n n n n n n n n n n n n n n n n
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8· 8 =n

.. ..
. . work at level i
= 2i · 2ni = n
2 2

1 1 1 1 1 1 1 2log2 n · 1 = n

(b) The recursion tree for T(n) = 2T( n2 ) + n. The work at each level is exactly n; thus the work is constant across the
levels of the tree.

n2
n n2 1 · ( n1 )2 = 1

n n2 n n2 n2
2 2 2 2 2 · ( n2 )2 = 2
1 + log2 n levels

n n2 n n2 n n2 n n2 n2
4 4 4 4 4 4 4 4 4 · ( n4 )2 = 4

n n2 n n2 n n2 n n2 n n2 n n2 n n2 n n2 n2
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 · ( n8 )2 = 8
.. ..
. . work at level i
2
= 2i · ( 2ni )2 = n2i
2 2

n2
1 1 1 1 1 1 1 2log2 n · 1 = n

(c) The recursion tree for T(n) = 2T( n2 ) + n2 . The work at each level is half of the work at the level above it; thus the
work is decreasing exponentially at each level of the tree.

Figure 6.38 Three different recursion trees.

ISTUDY
318 Analysis of Algorithms

amount of work at level i is independent of the level i. The total amount of work in the tree is therefore
the work at level i is exactly n, for every i

X
log2 n
X
log2 n
T(n) = n=n· 1 = n(1 + log2 n) = Θ(n log n).
i=0 i=0

Example 6.33: Solving T(n) = 2T( 2n ) + n2 .


Figure 6.38c shows the recursion tree. There are 2i calls at the ith level of the tree, and we do ( 2ni )2 work
2
at each call at this level. Thus the work represented by the ith row of the recursion tree is ( 2ni )2 · 2i = n2i .
The total amount of work in the tree is therefore
1+ 1
2 + ··· + 1
log n =2− 1
log n =2− 1
n
2 2 2 2

X
log2 n
X
log2 n
T(n) = ( 12 )i n2 =n ·
2
( 21 )i = n2 · (2 − n1 ),
i=0 i=0

by Theorem 5.5 (on geometric series). Because 2 − 1n is certainly at most 2, and also certainly at least 1,
we know that n2 ≤ T(n) ≤ 2n2 , which allows us to conclude that T(n) = Θ(n2 ).

6.5.2 The Formal Statement and Some Examples


Examples 6.31–6.33 were designed to build the necessary intuition about the three different cases of the
recursion tree: work increases exponentially across levels of the tree; work stays constant across levels; or
work decreases exponentially across levels. Precisely the same intuition will yield the proof of the unifying
theorem for the collection of recurrences we’re considering. Here is the formal statement of the theorem,
which generalizes the idea of these examples to all recurrences of the form T(n) = aT( nb ) + cnk :

Theorem 6.21: A formulaic solution to recurrences of the form T(n) = aT( nb ) + cnk .
Consider the recurrence
T(1) = c and T(n) = a · T( bn ) + c · nk
for constants a ≥ 1, b > 1, c > 0, and k ≥ 0. Then:

Case (i) [“the leaves dominate”]: if bk < a, then T(n) = Θ(nlogb (a) ).
Case (ii) [“all levels are equal”]: if bk = a, then T(n) = Θ(nk · log n).
Case (iii) [“the root dominates”]: if bk > a, then T(n) = Θ(nk ).

(As we discussed previously, we are abusing notation by using c to denote two different constants in this
theorem statement. Again, as you’ll prove in Exercise 6.126, the recurrence T(1) = d with a constant d > 0
possibly different than c has precisely the same asymptotic solution.)
The proof of the theorem is in Section 6.5.3. Here, we’ll just use the theorem with a few examples, using
Theorem 6.21 to reproduce the recursion-tree analysis of Examples 6.31–6.33:

Example 6.34: Solving T(n) = 2T( 2n ) + 1 using Theorem 6.21.


Consider the recurrence T(n) = 2T( n2 ) + 1 with T(1) = 1. We have a = 2, b = 2, c = 1, and k = 0;
because bk = 20 = 1 < 2 = a, case (i) of Theorem 6.21 says that T(n) = Θ(nlog2 2 ) = Θ(n).

ISTUDY

6.5 An Extension: Recurrence Relations of the Form T(n) = aT n
b + cnk 319

Example 6.35: Solving T(n) = 2T( 2n ) + n using Theorem 6.21.


Consider the recurrence T(n) = 2T( 2n ) + n with T(1) = 1. We have a = 2, b = 2, c = 1, and k = 1;
because bk = 21 = 2 = a, case (ii) of Theorem 6.21 says that T(n) = Θ(n1 log n) = Θ(n log n).

Example 6.36: Solving T(n) = 2T( 2n ) + n2 using Theorem 6.21.


Consider the recurrence T(n) = 2T( 2n ) + n2 with T(1) = 1. We have a = 2, b = 2, c = 1, and k = 2;
because bk = 22 = 4 > 2 = a, case (iii) of Theorem 6.21 says that T(n) = Θ(n2 ).

Taking it further: Although we’ve mostly presented “algorithmic design” and “algorithmic analysis” as two separate phases,
in fact there’s interplay between these pieces. See p. 321 for a discussion of a particular computational problem—matrix
multiplication—and algorithms for it, including a straightforward but slow algorithm and another that (with inspiration from
Theorem 6.21) improves upon that slow algorithm.

6.5.3 A Proof of Theorem 6.21


(This section goes into a bit more detail of a more technical proof than most of those in the book. If you
prefer to use the theorem without delving into the proof, you can skip this section. But if you’re interested,
the proof is here.)

Taking it further: While Theorem 6.21 holds even when the input n is not an exact power of b—we just have to fix the recurrence
by adding floors or ceilings so that it still makes sense—we will prove the result for exact powers of b only. A full proof that
includes for the case when n is not an exact power of b can be found in [33], or you can read more about the original formulation
and motivation for this solution technique in the 1980 paper by Jon Bentley, Dorothea Haken, and James Saxe that presented
the method [13]. The unified technique that we’re considering in this section is frequently called the “Master Method” in papers
and textbooks, but I’ve avoided that terminology here—in part because I think it’s a little misleading (there are many recurrence
relations that have the form addressed by Theorem 6.21, but there are also many other recurrence relations that do not, so it’s
not quite as universal as a “master key”), and in part to avoid the word “master.” There’s a recent (and sadly belated) push in
CS to think about and avoid some racially tinged terminology that has been commonly used—things like “blacklist/whitelist”
in networking and “master/slave” in computer architecture—and, while the use of “master” in “Master Method” is something
much more benign, it remains valuable and kind to think carefully about the impact of one’s choice of language. For more, see
the Inclusive Naming Initiative’s homepage inclusivenaming.org.

Remember that we’re considering the recurrence T(n) = a · T( nb ) + c · nk with T(1) = c. We will show that
the total amount work contained in the recursion tree is
Pb n
log
a i

T(n) = cnk · bk
. (∗)
i=0

As before, the formula (∗) should make intuitive the fact that a = bk (that is, bak = 1) is the critical value
for the theorem. The value of bak corresponds to whether the work at each level of the tree is increasing
( bak > 1), steady ( bak = 1), or decreasing ( bak < 1). The summation in (∗) is a geometric sum, and as we saw
in Chapter 5 geometric sums behave fundamentally differently based on whether their ratio is less than,
equal to, or greater than one.
Proof of Theorem 6.21 (for n an exact power of b). For all three cases, we’ll use (∗) as the starting point,
so we need to prove that statement first. A formal proof by induction is left to you as Exercise 6.127, but
here is the intuition from the recursion tree (Figure 6.36). There are ai nodes at the ith level of the tree
(counting the root as level zero). Each node at the ith level corresponds to an input of size bni and therefore
contributes c · ( bni )k work. The tree continues until the inputs are of size 1—that is, until bni = 1, or when

ISTUDY
320 Analysis of Algorithms

i = logb n. Thus the work at level i is c · ( bni )k · ai , and the total work is the level-by-level sum, from level
i = 0 (the root) down to level i = logb n (the leaves). This summation is exactly the right-hand side of (∗).
We’ll now examine this summation in each of the three cases, depending on how the value of bak compares
to 1—and we’ll handle the cases in order of ease, rather than numerically:

Case (ii): a = bk . Then (∗) says that in case (ii), we have a = bk , which means ( bak )i = 1i = 1 for all i

X
logb n
 X
logb n
a i
T(n) = cnk bk
= cnk 1 = cnk (1 + logb n).
i=0 i=0

Thus the total work is Θ(nk log n).


Case (iii): a < bk . Then (∗) is a geometric sum whose ratio is strictly less than 1. Corollary 5.6 states that
Plog n
any geometric sum whose ratio is strictly between 0 and 1 is Θ(1). (Namely, the summation i=0b ( bak )i
1 k
is lower-bounded by 1 and upper-bounded by 1−a/b k , both of which are positive constants when a < b .)
a
Therefore: by Corollary 5.6, because bk < 1 in case (iii)

X
logb n

a i
T(n) = cn k
bk
= cnk · Θ(1).
i=0
k
Therefore the total work is Θ(n ).
Case (i): a > bk . Then (∗) is a geometric sum whose ratio is strictly larger than one. But we can make this
summation look more like Case (iii), by using a little algebraic manipulation. Notice that, for any r ̸= 0,
Pm
we can rewrite i=0 ri as follows: reindexing the summation, by setting j = m − i

X
m X
m X
m X
m
 X
m

1 m−i 1 j
ri = rm · ri−m = rm · ri−m = rm · r = rm · r . (†)
i=0 i=0 i=0 i=0 j=0

Applying this manipulation to (∗), we have


X
logb n

k a i
T(n) = cn bk
by (∗)
i=0
logb n  j
 X
a logb n bk
= cnk · bk
· a by (†)
j=0

a logb n
= cnk · bk
· Θ(1) by Corollary 5.6 (similar to Case (iii)), because bk
a < 1.
logb n
a
= cnk · · Θ(1)
k
(bk )logb n = bk logb n = blogb n = nk
nk
= c · alogb n · Θ(1).
And alogb n = nlogb a , which we can verify by log manipulations:
logb n logb a
alogb n = blogb [a ]
= b[logb n]·[logb a] = b[logb a]·[logb n] = blogb [n ]
= nlogb a .
Therefore the total work in this case is Θ(alogb n ) = Θ(nlogb a ).

ISTUDY

6.5 An Extension: Recurrence Relations of the Form T(n) = aT n
b + cnk 321

Computer Science Connections

Divide-and-Conquer Algorithms and Matrix Multiplication


Matrix multiplication (see Definition 2.45) is a fundamental operation with wide-ranging applications throughout
CS: in computer graphics, in data mining, and in social-network analysis, just to name a few. Often the matrices in
question are quite large—perhaps a matrix of hyperlinks among thousands or millions of web pages, for example.
Thus asymptotic improvements to matrix multiplication algorithms have potential practical importance, too.
For simplicity, consider multiplying two square (n-
by-n) matrices. The most obvious algorithm for matrix matmult(M ∈ Rn×n , N ∈ Rn×n ):
multiplication just follows the definition: separately, Input: Two matrices M ∈ Rn×n and N ∈ Rn×n
P
for each of the n2 entries in the output matrix, perform Output: A matrix P ∈ Rn×n , where Pi,j := nk=1 Mi,k Nk,j
the Θ(n) multiplications and additions to compute the 1 for i = 1, 2, . . . n:
entry. (See Figure 6.39.) But, in the spirit of this sec- 2 for j = 1, 2, . . . , n:
tion, what might we be able to do with a recursive 3 for k = 1, 2, . . . , n:
algorithm? There is indeed a nice way to think about 4 Pi,j := Pi,j + Mi,k Nk,j // Assume Pi,j := 0 to start.
matrix multiplication recursively. To multiply two n- 5 return P
by-n matrices M and N, divide M and N each into four Figure 6.39 The iterative algorithm for matrix
quarters. The matrix product MN can be expressed by multiplication for n-by-n matrices.
appropriately summing the products of the quarters of
M and the quarters of N. This fact suggests a recursive, divide-and-conquer algorithm for multiplying matrices, with
the recurrence T(n) = 8T( n2 ) + n2 . (It takes c · n2 time to combine the result of the recursive calls.) See Figure 6.40.
The recursive approach seems clever. But, using     
Theorem 6.21—we have a = 8, b = 2, and k = 2; thus     RW + SY RX + SZ 
 R S  W X   
we are in case (i)—we can conclude that T(n) =  




 = 





log2 (8) 3     
Θ(n ) = Θ(n ), so the recursive algorithm is   

 





 T V  Y Z   TW + UY TX + UZ 
actually not an improvement over Figure 6.39 at all!
But, in a major algorithmic breakthrough, in 1969
Volker Strassen found a way to use seven recursive calls Figure 6.40 Multiplying two n-by-n matrices by computing
instead of eight. (See Figure 6.41.) This change makes the product of eight pairs of n2 -by- n2 matrices. We compute
the recurrence T(n) = 7T( n2 ) + n2 , reducing the 8 to a RW, SY, . . . , UZ recursively, and then add the results as
7. Now Theorem 6.21—a = 7, b = 2, and k = 2; still indicated to compute the final answer.
case (i)—says that T(n) = Θ(nlog2 7 ) = Θ(n2.8073··· ),
which is a nice improvement. (For example, 1000log2 7 is only about 25% of 10003 —four times faster!)
With this fundamental idea, one can investigate     
other Strassen-like algorithms, making fewer recursive     A+D 
 R S  W X   C+E 
calls and combining them cleverly. In 1978, Victor Pan  




 = 
 −E + G




    
gave a further running-time improvement using this  




 A−B


 T V  Y Z   B+D 
style of algorithm—though more complicatedly!— +C + F

using a total of 143,640 recursive calls on inputs of


size 70n
(!), plus Θ(n2 ) additional work. Using The- where A := (R + V)(W + Z) E := (R + S)Z
orem 6.21, that algorithm yields a running time of B := (T + V)W F := (T − R)(W + X)
Θ(nlog70 143,640 ) = Θ(n2.7951··· ). Algorithms contin- C := R(X − Z) G := (S − V)(Y + Z)
ued to improve for several years, culminating in 1990 D := V(Y − W)
with an Θ(n2.3754··· )-time algorithm due to Don Cop-
persmith and Shmuel Winograd. Their algorithm was Figure 6.41 Strassen’s algorithm. We compute A, B, . . . , G
the best known for two decades, but recently some new recursively, and then add/subtract the results as indicated to
researchers with some new insights have come along, compute the final answer. The additions and subtractions
and the exponent is now down to 2.373. (For more take Θ(n2 ) time.
about matrix multiplication and the recent algorithmic
improvements, see [132], a survey paper by Virginia Vassilevska Williams, one of the researchers responsible for
the reinvigorated progress in improving this exponent.) Many people think that the exponent can be improved all the
way to 2—but no one has yet found an algorithm that’s that fast!

ISTUDY
322 Analysis of Algorithms

EXERCISES
The following recurrence relations follow the form of Theorem 6.21. Solve each. Assume T(1) = 1.
6.109 T(n) = 4T( n3 ) + n2 6.117 T(n) = 2T( n2 ) + n2
6.110 T(n) = 3T( 4n ) + n2 6.118 T(n) = 2T( 2n ) + n
6.111 T(n) = 2T( 3n ) + n4 6.119 T(n) = 2T( 4n ) + n2
n
6.112 T(n) = 3T( 3 ) + n 6.120 T(n) = 2T( 4n ) + n
n
6.113 T(n) = 16T( 4 ) + n 2 6.121 T(n) = 4T( 2n ) + n2
6.114 T(n) = 2T( n4 ) + 1 6.122 T(n) = 4T( 2n ) + n
6.115 T(n) = 4T( n2 ) + 1 6.123 T(n) = 4T( 4n ) + n2
n
6.116 T(n) = 3T( 3 ) + 1 6.124 T(n) = 4T( 4n ) + n
6.125 Solve the quadtree recurrence T(1) = 1 and T(n) = 1 + 4T( 4n ) using Theorem 6.21. (See Exercise 6.82.)

6.126 Prove that the recurrences T(n) = aT( nb ) + c · nk with T(1) = d and S(n) = aS( bn ) + nk with S(1) = 1 have the same asymptotic
solution, for any constants a ≥ 1, b > 1, c > 0, d > 0, and k ≥ 0. (This equivalence justifies the ways that we have paid little
attention to the constant c in recurrences in this section.)
6.127 Consider the recurrence T(n) = aT( nb ) + nk with T(1) = 1. Using induction, prove the equivalence (∗) from the proof of
Plog n
Theorem 6.21: prove that T(n) = nk · i=0b ( bak )i for any n that’s an exact power of b.
6.128 Consider the recurrence T(n) = aT( bn ) with T(1) = 1, for constants a ≥ 1 and b > 1. (This recurrence does not match the form of
Theorem 6.21, because there is no nk term added in the recursive case.) Give a closed-form solution for this recurrence, and prove
your answer correct.
6.129 Theorem 6.21 does not apply for the recurrence T(n) = 2T( n2 ) + n log n, but the same idea—considering the summation of all the
work in the recursion tree—will still work. Prove that T(n) = Θ(n log2 n) by analyzing the summation analogous to (∗).

Each of the following exercises gives a brief description of an algorithm for an interesting problem in computer science. (Sometimes
the recurrence relation is explicitly written; sometimes it’s up to you to write down the recurrence.) For each, state the recurrence
(if it’s missing) and give a Θ-bound on the running time. If Theorem 6.21 applies, you may use it. If not, give a proof by induction.
6.130 The Towers of Hanoi is the following classic puzzle. There are three posts (the “towers”); post A starts with n concentric discs
stacked in order of their radius (smallest radius at the top, largest radius at the bottom). We must move all the discs to post B,
never placing a disc of larger radius on top of a disc of smaller radius. The easiest way to solve this puzzle is with recursion. (See
Figure 6.42.) The total number of moves made satisfies T(n) = 2T(n − 1) + 1 and T(1) = 1. Prove that T(n) = 2n − 1.

6.131 Suppose we are given a sorted array A[1 . . . n], and we wish to determine where in A the element x belongs—that is, the index i such
that A[i − 1] < x ≤ A[i]. (Binary search solves this problem.) Here’s a sketch of an algorithm rootSearch to solve this problem:
• if n is small (say, less than 100), find the index by brute force. Otherwise:
√ √ √ √
• define mileposts := A[ n], A[2 n], A[3 n], . . . , A[n] to be a list of every ( n)th element of A.
• recursively, find post := rootSearch(mileposts, x).
√ √
• return rootSearch(A[(post − 1) n, . . . , post n], x).

(Note that rootSearch makes two recursive calls: one to identify which n-sized subarray is the right one to search in, and one to
search in it.) Find a recurrence relation for the running time of this algorithm, and solve it.

6.132 A van Emde Boas tree is a recursive data structure (with somewhat similar inspiration to Exercise 6.131) that allows us to insert,
delete, and look up keys drawn from a set U = {1, 2, . . . , u} quickly. (It solves the same problem that binary search trees solve, but
our running time will be in terms of the size of the universe U rather than in terms of the number of keys stored.) A van Emde Boas

tree achieves a running time given by T(u) = T( u) + 1 and T(1) = 1. Solve this recurrence. (Hint: define R(k) = T(2k ). Solving
R(k) is easier!)

A B C A B C A B C A B C
recursively move the n − 1 move the biggest recursively move the
smallest discs from A to C disc from A to B n − 1 discs from C to B
1 2 3

Figure 6.42 Solving the Towers of Hanoi by moving the n discs from post A to B.

ISTUDY
6.6 Chapter at a Glance 323

6.6 Chapter at a Glance

Asymptotics
Asymptotic analysis considers the rate of growth of functions, ignoring multiplicative constant factors
and concentrating on the long-run behavior of the function on large inputs. Consider two functions
f : R≥0 → R≥0 and g : R≥0 → R≥0 . Then f(n) = O(g(n)) (“f grows no faster than g”) if there exist c > 0
and n0 ≥ 0 such that f(n) ≤ c · g(n) for all n ≥ n0 . Here are some useful properties of O(·):

• We have that f(n) = O(g(n) + h(n)) if and only if f(n) = O(max(g(n), h(n))).
• If f(n) = O(g(n)) and g(n) = O(h(n)), then f(n) = O(h(n)).
• If f(n) = O(h1 (n)) and g(n) = O(h2 (n)), then f(n) + g(n) is O(h1 (n) + h2 (n)).
• Similarly, if f(n) = O(h1 (n)) and g(n) = O(h2 (n)), then f(n) · g(n) is O(h1 (n) · h2 (n)).
• A polynomial p(n) = ak nk + · · · + a1 n + a0 satisfies p(n) = O(nk ).
• We have that log n = O(nϵ ) for any ϵ > 0.
• For any base b and exponent k, we have logb (nk ) = O(log n).
• For constants b, c ≥ 1, we have bn = O(cn ) if and only if b ≤ c.

There are several other forms of asymptotic notation, to capture other relationships between functions.
A function f grows no slower than g, written f(n) = Ω(g(n)), if there exist constants d > 0 and n0 ≥ 0 such
that ∀n ≥ n0 : f(n) ≥ d · g(n). Two functions f and g satisfy f(n) = O(g(n)) if and only if g(n) = Ω(f(n)).
A function f grows at the same rate as g, written f(n) = Θ(g(n)), if f(n) = O(g(n)) and f(n) = Ω(g(n));
it grows (strictly) slower than g, written f(n) = o(g(n)), if f(n) = O(g(n)) but f(n) ̸= Ω(g(n)); and it
grows (strictly) faster than g, written f(n) = ω(g(n)), if f(n) = Ω(g(n)) but f(n) ̸= O(g(n)). Many of
the properties of O have analogous properties for Ω, Θ, o, and ω. One possibly surprising point is that
there are functions that are incomparable: there are functions f and g such that neither f(n) = O(g(n)) nor
f(n) = Ω(g(n)).

Asymptotic Analysis of Algorithms


Our main interest in asymptotics is in the analysis of algorithms, so that we can make statements about
which of two algorithms that solve the same problem is faster. The running time of an algorithm is a count
of the number of primitive steps that the algorithm takes to complete on a particular input. (Think of one
machine instruction as a primitive step.)
We generally evaluate the efficiency of an algorithm A using worst-case analysis: as a function of n,
how many primitive steps does A take on the input of size n for which A is the slowest. (A primary goal
of algorithmic analysis is to provide a guarantee on the running time of an algorithm, so we will be pes-
simistic.) We can also analyze the space used by an algorithm, in the same way. Sometimes we will instead
consider average-case running time of an algorithm A, which computes the running time of A, averaged
over all inputs of size n. Almost never will we consider an algorithm’s running time on the input of size n
for which A is the fastest (known as best-case analysis); this type of analysis is rarely used.

Recurrence Relations: Analyzing Recursive Algorithms


Typically, for nonrecursive algorithms, we compute the running time by inspecting the algorithm and writ-
ing down a summation corresponding to the operations done in each iteration of each loop, summed over

ISTUDY
324 Analysis of Algorithms

the iterations, and then simplifying. For recursive algorithms, we typically record the work using a recur-
rence relation that expresses the (worst-case) running time on inputs of size n in terms of the (worst-case)
running time on inputs of size less than n. (For small inputs, the running time is a constant—say, T(1) = c.)
For example, ignoring floors and ceilings, T(1) = c and T(n) = 2T( 2n ) + cn is the recurrence relation for
Merge Sort. (Almost always, we can safely ignore floors and ceilings.)
A solution to a recurrence relation is a closed-form (nonrecursive) expression for T(n). Recurrence
relations can be solved by conjecturing a solution and proving that conjecture correct by induction.
A recurrence relation can be represented using a recursion tree, where each node is annotated with the
work that is performed there, aside from the recursive calls, as in Figure 6.43. Recurrence relations can
also be solved by summing up all of the work contained within the recursion tree.

n

An Extension: Recurrence Relations of the Form T(n) = aT b
+ cnk
A particularly common type of recurrence relation is one of the form T(n) = aT( bn ) + c · nk , for constants
a ≥ 1, b > 1, c > 0, and k ≥ 0. This type of recurrence arises in divide-and-conquer algorithms that
solve an instance of size n by making a different recursive calls on inputs of size nb , and reconstructing the
solution to the given instance in Θ(nk ) time. Theorem 6.21 states that the solution to any such recurrence
relation is given by the following.
(i) If bk < a, then T(n) = Θ(nlogb (a) ). “The leaves dominate.”
(ii) If bk = a, then T(n) = Θ(nk · log n). “All levels are equal.”
(iii) If bk > a, then T(n) = Θ(nk ). “The root dominates.”
The proof follows by building the recursion tree, and summing the work at each level of the tree; the cases
correspond to whether the work increases exponentially, decreases exponentially, or stays constant across
levels of the tree.

n n
2 2
1 + log2 n levels

n n n n
4 4 4 4

n n n n n n n n
8 8 8 8 8 8 8 8

.. ..
. .

2 2

1 1 1 1

Figure 6.43 A example of a recursion tree, with each input’s size marked in the rectangular node.

ISTUDY
6.6 Chapter at a Glance 325

Key Terms and Results

Key Terms Key Results


Asymptotics Asymptotics
• asymptotic analysis 1 Some sample useful properties of O(·):
• O (big oh) • We have f(n) = O(g(n) + h(n)) if and only if
• Ω (big omega) f(n) = O(max(g(n), h(n))).
• Θ (big theta) • O(·) is transitive.
• ω (little omega) • Any degree-k polynomial satisfies p(n) = O(nk ).
• o (little oh) • For any ϵ > 0, we have log n = O(nϵ ).
• If f(n) = O(g(n)) then log f(n) = O(log g(n)).
Analysis of Algorithms • For any b and k, we have logb (nk ) = O(log n).
• running time • For any b, c ≥ 1, we have bn = O(cn ) ⇔ b ≤ c.
• worst-case analysis 2 Two functions f and g satisfy f(n) = O(g(n)) if and only
• average-case analysis if g(n) = Ω(f(n)).
• best-case analysis 3 There are pairs of functions f and g such that neither
f(n) = O(g(n)) nor f(n) = Ω(g(n)).
Recurrence Relations
Analysis of Algorithms
• recurrence relation
• recursion tree 1 We generally evaluate the efficiency of an algorithm A
• iterating a recurrence using worst-case analysis: what is (asymptotically) the
number of steps consumed by A as function of the input
Recurrences of the Form size n on the input of size n for which A is the slowest?
T(n) = aT nb + cnk 2 Typically we can analyze the running time of a
nonrecursive algorithm by simple counting and
• “the leaves dominate” manipulation of summations.
• “all levels are equal”
• “the root dominates” Recurrence Relations
1 The running time of a recursive algorithm can be
expressed using a recurrence relation, which can be
solved by figuring out a conjecture of a closed-form
formula for the relation, and then verifying by induction.

n

Recurrences of the Form T(n) = aT b
+ cnk
1 Recurrence relations of the form T(n) = aT( nb ) + cnk
(and T(1) = c) can be solved using Theorem 6.21:
• Case (i): if bk < a, then T(n) = Θ(nlogb (a) ).
• Case (ii): if bk = a, then T(n) = Θ(nk · log n).
• Case (iii): if bk > a, then T(n) = Θ(nk ).

ISTUDY
ISTUDY
7 Number Theory

In which, after becoming separated, our heroes arrange a place to


meet, by sending messages that stay secret even as snooping spies
listen in.

ISTUDY
327
328 Number Theory

7.1 Why You Might Care


I often say that when you can measure what you are speaking about, and express it in
numbers, you know something about it; but when you cannot measure it, when you cannot
express it in numbers, your knowledge is of a meagre and unsatisfactory kind.
Sir William Thomson, Lord Kelvin (1824–1907)
lecture on “Electrical Units of Measurement” (1883)

A chapter about numbers (particularly when it’s so far along in this book!) probably seems a little
bizarre—after all, what is there to say about numbers that you didn’t figure out by elementary school?!?
But, more so than any other chapter of the book, the technical material in this chapter leads directly to a
single absolutely crucial (and ubiquitous) modern application of computer science: cryptography, which
deals with protocols to allow multiple parties to communicate securely, even in the presence of eavesdrop-
ping adversaries. (The word cryptography comes from the Greek kryptos “concealed/secret” and graph
“writing.”) Cryptographic systems are used throughout our daily lives—both in the security layers that
connect us as users to servers (for example, in banking online or in registering for courses at a college),
and in the backend systems that, we hope, protect our data even when we aren’t interacting with it.
Our goal in this chapter will be to build up the technical machinery necessary to define and understand
the RSA cryptosystem, one of the most commonly used cryptographic systems today. (RSA is named after
the initials of its three discoverers, Rivest, Shamir, and Adleman.) By the end of the chapter, in Section 7.5,
we’ll be able to give a full treatment of RSA, along with sketched outlines of a few other important ideas
from cryptography. (Later in the book, in Chapter 9, we’ll also encounter the historical codebreaking work
of Alan Turing and colleagues, which deciphered the German encryption in World War II—a major part
of the allied victory. See p. 497.)
To get there, we’ll need to develop some concepts and tools from number theory. (“Number theory” is
just a slightly fancy name for “arithmetic on integers.”) Our focus will be on modular arithmetic: that is,
the numbers on which we’ll be doing arithmetic will be a set of integers {0, 1, 2, . . . , n − 1}, where—like
on a clock—the numbers “wrap around” from n − 1 back to 0. In other words, we’ll interpret numerical
expressions modulo n, always considering each expression via its remainder when we divide by n. Mod-
ular arithmetic shares a lot of properties with elementary-school arithmetic, and we’ll start by noting and
exploring those commonalities. We begin in Section 7.2 with formal definitions of modular arithmetic,
and the adaptation of some basic ideas from elementary-school arithmetic (addition, multiplication, great-
est common divisors, least common multiples—and efficient algorithms for all of these operations) to this
new setting. We’ll then turn in Section 7.3 to primality (when a number has no divisors other than 1 and
itself) and relative primality (when two numbers have no common divisors other than 1).
Modular arithmetic and elementary-school arithmetic begin to diverge more substantially when we start
to think about division. There’s no integer that’s one eighth of 4 (because 12 isn’t an integer). But, on a
clock where we treat 12:00 as 0, there is an integer that’s a eighth of 4—namely 2, because 2 + 2 + 2 + 2 +
2 + 2 + 2 + 2 is 4 (because 4:00pm is 16 hours after midnight—so 8 · 2 is 4, modulo 12). In Section 7.4,
we’ll explore exactly what division means in modular arithmetic—and some special features of division
that arise when n is a prime number.
As we go, we’ll see a few other applications of number theory: to error-correcting codes, secret sharing,
and the apparently unrelated task of generating all 4-letter sequences (AAAA to ZZZZ). And, finally, we’ll
put the pieces together to explore RSA.

ISTUDY
7.2 Modular Arithmetic 329

7.2 Modular Arithmetic


It’s always a mistake to try for universal approbation, universal approval, because if you
fear making anyone mad, then you ultimately probe for the lowest common denominator
of human achievement.
Jimmy Carter (b. 1924)
Remarks at the National Convention of the Future Farmers of America (1978)

We will start with a few reminders of some basic arithmetic definitions from Chapter 2—multiplication,
division, modular arithmetic—as these concepts are the foundations for all the work that we’ll do in this
chapter. We’ll also introduce a few algorithms for computing these basic arithmetic quantities, including
one of the oldest known algorithms: the Euclidean algorithm, from about 2300 years ago, which computes
the greatest common divisor of two integers n and m (the largest number that evenly divides both).

7.2.1 Remainders: A Reminder


Let’s start with a few basic facts about integers. Every integer is 0 or 1 more than some even number. Every
integer is 0, 1, or 2 more than a multiple of three. Every integer is at most 3 more than a multiple of four. And,
in general, for any integer k ≥ 1, every integer is r more than a multiple of k, for some r ∈ {0, 1, . . . , k − 1}.
We’ll begin with a precise statement and proof of the general version of this property:

Theorem 7.1: Floors and remainders (“the Division Theorem”).


Let k ≥ 1 and n be integers. Then there exist integers d and r such that (i) 0 ≤ r < k, and (ii) kd + r = n.
Furthermore, the values of d and r satisfying (i) and (ii) are unique.

Before we prove the theorem, let’s look at a few examples of what it claims:

Example 7.1: Some examples of the Division Theorem.


For k = 202 and n = 379, the theorem states that there exist integers r ∈ {0, 1, . . . , 201} and d with
202d + r = 379. Specifically, those values are r = 177 and d = 1, because 202 · 1 + 177 = 379.
Here are a few more examples, still with k = 202:
n = 55057 n = 507 n = 177 n = 404 n = −507 n = −404
d = 272 d=2 d=0 d=2 d = −3 d = −2
r = 113 r = 103 r = 177 r=0 r = 99 r=0
You can verify that, in each of these six columns, indeed we have 202d + r = n.

Now let’s give a proof of the general result:


Proof of Theorem 7.1. Consider a fixed integer k ≥ 1. Let P(n) denote the claim
P(n) = there exist integers d and r such that 0 ≤ r < k and kd + r = n.
We must prove that P(n) holds for all integers n. We’ll first prove the result for nonnegative n (by strong
induction on n), and then show it for n < 0 (making use of the result for nonnegative n).
Problem-solving tip: To prove that a property is true for all inputs, it often turns out to be easier to first prove a special case
and then use that special case to show that the property holds in general. (Another example: it’s probably easier to analyze the
performance of Merge Sort on inputs whose size is an exact power of 2, and to then generalize to arbitrary input sizes.)

ISTUDY
330 Number Theory

Case I: n ≥ 0. We’ll prove that P(n) holds for all n ≥ 0 by strong induction on n.
For the base cases (0 ≤ n < k), we simply select d = 0 and r = n. Indeed, these values guarantee that
0 ≤ r < k and kd + r = k · 0 + n = 0 + n = n.
For the inductive case (n ≥ k), we assume the inductive hypotheses—namely, we assume P(n′ ) for any
0 ≤ n′ < n—and we must prove P(n). Let n′ = n − k. Because n ≥ k and k > 0, it is immediate that
n′ satisfies 0 ≤ n′ < n. Thus we can apply the inductive hypothesis P(n′ ) to conclude that there exist
integers d′ and r′ such that 0 ≤ r′ < k and kd′ + r′ = n′ . Select d = d′ + 1 and r = r′ . Thus, indeed,
0 ≤ r < k and

kd + r = k(d′ + 1) + r′ definition of d and r


′ ′
= kd + k + r distributive property

=n +k n′ = kd′ + r′ , by definition
= n. definition of n′ = n − k

Case II: n < 0. To show that P(n) holds for an arbitrary n < 0, we will make use of Case I. Let r′ and d′
be the integers guaranteed by P(−n), so that kd′ + r′ = −n. We consider two cases based on whether
r′ = 0:
Case IIA: r′ ̸= 0. Then let d = −d′ −1 and let r = k−r′ . (Because k > r′ > 0, we have 0 < k−r′ < k.)
Thus

kd + r = k(−d′ − 1) + k − r′ definition of d and r



= −(kd + r) algebraic simplification
= −(−n) = n. definition of d′ and r′

Case IIB: r′ = 0. Then let d = −d′ and r = r′ = 0. Therefore

kd + r = −d′ k + r′ definition of d and r


= −(−n) = n. definition of d′ and r′

We have thus proven that P(n) holds for all integers n: Case I handled n ≥ 0, and Case II handled n < 0.
(We have not yet proven the uniqueness of the integers r and d; see Exercise 7.4.)
This theorem now allows us to give a more careful definition of modular arithmetic. (In Definition 2.11,
we gave the slightly less formal definition of n mod k as the remainder when we divide n by k.)

Definition 7.2: Modulus [reprise].


For integers k > 0 and n, the quantity n mod k is the unique integer r such that 0 ≤ r < k and kd + r = n
for some integer d (whose existence is guaranteed by Theorem 7.1).

Incidentally, the integer d whose existence is guaranteed by Theorem 7.1 is ⌊ nk ⌋ (just as Definition 7.2 says
 
that the integer r is n mod d). For any k ≥ 1 and any integer n, we can write n as n = nk · k + (n mod k).

Taking it further: One of the tasks that we can accomplish conveniently using modular arithmetic is base conversion of integers.
We’re used to writing numbers in decimal (“base 10”), where each digit is “worth” a factor of 10 more than the digit to its right.
(For example, the number we write “31” means 1 · 100 + 3 · 101 = 1 + 30.) Computers store numbers in binary (“base 2”)
representation, and we can convert between bases using modular arithmetic. For more, see p. 340.

ISTUDY
7.2 Modular Arithmetic 331

n
7.2.2 Computing n mod k and k
So far, we’ve taken arithmetic operations for granted—ignoring how we’d figure out the numerical value
of an arithmetic expression like 21024 − 3256 · 5202 , which is simple to write, but whose value is definitely
not so instantaneous to calculate. (Quick! Is 21024 − 3256 · 5202 evenly divisible by 7?) Indeed, many of us
spent a lot of time in elementary-school math classes learning algorithms for basic arithmetic operations
like addition, multiplication, long division, and exponentiation (even if back then nobody told us that they
were called algorithms).
Thinking about algorithms for some basic arithmetic operations will be useful, for multiple reasons:
because they’re surprisingly relevant for proving some useful facts about modular arithmetic, and because
computing them efficiently turns out to be crucial in the cryptographic systems that we’ll explore in Sec-
tion 7.5. We’ll start with the algorithm shown in Figure 7.1, which computes n mod k (and simultaneously
computes ⌊ nk ⌋ too). The core idea for this algorithm was implicit in the proof of Theorem 7.1: we repeatedly
subtract k from n until we reach a number in the range {0, 1, . . . , k − 1}.

Example 7.2: An example of mod-and-div.


Figure 7.1 shows the computation of mod-and-div(64, 5). We start by setting r := 64 and d := 0, and
repeatedly decrease r by 5 and increase d by 1 until r < 5. Thus mod-and-div(64, 5) returns 4 and 12.
Similarly, mod-and-div(20, 17) starts with d = 0 and r = 20, and executes one (and only one)
iteration of the loop, returning d = 1 and r = 3.

(The algorithm in Figure 7.1 is called mod-and-div because some programming languages—Pascal, for
one [admittedly dated] example—use div to denote integer division, so that 15 div 7 is 2. In C++, slightly
disconcertingly, a function called div computes the same pair of integers that our mod-and-div computes.)
The mod-and-div algorithm is fairly intuitive, if fairly slow: it simply keeps removing multiples of k
from n until there are no multiples of k left to remove. (For simplicity, the algorithm as written in Figure 7.1
only handles the case where n ≥ 0; you’ll extend the algorithm to handle negative n in Exercise 7.10. See
Example 7.1 for some examples of the Division Theorem with n < 0.)

Lemma 7.3: Correctness and efficiency of mod-and-div.


For any integers n ≥ 0 and k ≥ 1, calling mod-and-div(n, k) returns n mod k and ⌊ nk ⌋, using a total of
Θ(⌊ nk ⌋) arithmetic operations.

Proof. We claim that, throughout the execution of the algorithm, we have dk + r = n (and also r ≥ 0).
This fact follows by induction on the number of iterations of the while loop in mod-and-div: it’s true before
the loop starts (when r = n and d = 0), and if it’s true before an iteration of the while loop then it’s true after

Each iteration of the execution is shown: first


mod-and-div(n, k): r = 64 and d = 0, then r = 59 and d = 1, etc.
Input: integers n ≥ 0 and k ≥ 1 mod-and-div(64, 5) :
Output: n mod k and ⌊ nk ⌋ r = 64 59 54 49 44 39 34 29 24 19 14 9 4
1 r := n; d := 0 d = 0 1 2 3 4 5 6 7 8 9 10 11 12
2 while r ≥ k: return values: r = 4 and d = 12
3 r := r − k; d := d + 1
(Indeed, we can write 64 = 12 · 5 + 4, where 4 = 64 mod 5 and 12 = ⌊ 64
5 ⌋.)
4 return r, d

Figure 7.1 An algorithm to compute n mod k and ⌊ nk ⌋, and an example of its execution.

ISTUDY
332 Number Theory

that iteration (when dk has increased by k, and r has decreased by k). Furthermore, when the while loop
terminates, we also have that r < k. Thus the returned values satisfy dk + r = n and 0 ≤ r < k—precisely
as required by Definition 7.2.
The total number of iterations of the while loop is exactly the returned value of d = ⌊ nk ⌋, and we do
three arithmetic operations per iteration: a comparison (is r ≥ k?), a subtraction (what’s r − k?), and an
addition (what’s d + 1?). Thus the total number of arithmetic operations is 3 ⌊ nk ⌋ + Θ(1) = Θ(⌊ nk ⌋).
Should we consider mod-and-div fast? Not really! Imagine using this algorithm to determine whether,
say, n = 123,456,789 is divisible by 7. (It isn’t: n = 17,636,684 · 7 + 1.) Our algorithm would take over
10 million iterations (⌊ n7 ⌋ > 70,000,000
7 = 10,000,000) iterations to compute the answer—and that seems
(and is!) very slow. One way to think about mod-and-div(n, k) is that it performs a linear search for the
integer d such that kd ≤ n < (k + 1)d: we keep increasing d by one until this property holds. We could
instead give a much faster algorithm based on binary search to find that value of d. (See Exercises 7.11–
7.16.) The improved algorithm requires only logarithmically many arithmetic operations—an exponential
improvement over mod-and-div.
Taking it further: What should count as an efficient algorithm when the inputs are numbers? In previous chapters, we’ve talked
about the generally accepted definition of efficient as meaning “requiring a number of steps that is polynomial in the size of the
input.” That is, on an input of size n, our algorithm should run in at most O(nc ) steps, for some fixed c. (See p. 294.) So why
did we say that an algorithm like mod-and-div isn’t efficient? After all, on input n and k, Lemma 7.3 says that the algorithm
took only about nk steps. But the key point is that an algorithm that takes a numerical input n does not receive an input of size
n. The number 123,456,789 takes only 9 characters (the nine digits in the number!) to write down—not 123,456,789 characters.
(Unless you wrote down the numbers in unary, using tally marks instead of digits: ;;;;;;;;; · · · .)

Generally, an algorithm that takes a number n as input receives that number n written in binary. The binary representation of n
uses ⌈log2 n⌉ bits. As usual, we consider an algorithm to be efficient if it takes time that’s polynomial in the number of bits of
the input—so we consider an algorithm that takes a number n as input to be efficient if it requires a number of operations that
is at most (log2 n)c for some fixed c. (That is, the algorithm should run in time that is polylogarithmic in n.) Every grade-school
algorithm that you learned for arithmetic—addition, subtraction, multiplication, long division, etc.—was efficient, requiring you
to do a number of operations related to the number of digits in the numbers, and not to the value of the numbers themselves.

7.2.3 Congruences, Divisors, and Common Divisors


Lemma 7.3 says that mod-and-div(n, k), which repeatedly subtracts k from n in a loop, correctly computes
the value of n mod k. We used induction to prove Lemma 7.3, but we could have instead argued for the
correctness of the algorithm, perhaps more intuitively, via the following fact:
For any integers a ≥ 0 and k ≥ 1, we have (a + k) mod k = a mod k.
That is, the remainder when we divide an integer a by k isn’t changed by adding an exact multiple of k to
a. This property follows from the definition of mod, but it’s also a special case of a useful general property
of modular arithmetic, which we’ll state (along with some other similar facts) in Theorem 7.4. Here are a
few examples of this more general property:

Example 7.3: The mod of a sum, and the sum of the mods.
Consider the following expressions of the form (a + b) mod k.

• (17 + 43) mod 7 = 60 mod 7 = 4. (Note 17 mod 7 = 3, 43 mod 7 = 1, and 3 + 1 = 4.)


• (18 + 42) mod 9 = 60 mod 9 = 6. (Note 18 mod 9 = 0, 42 mod 9 = 6, and 0 + 6 = 6.)
• (25 + 25) mod 6 = 50 mod 6 = 2. (Note 25 mod 6 = 1, 25 mod 6 = 1, and 1 + 1 = 2.)

ISTUDY
7.2 Modular Arithmetic 333

Based on these three examples, it might be tempting to conjecture that (a + b) mod k is always equal to
(a mod k) + (b mod k), but be careful—this claim has a bug, as this example shows:

• (18 + 49) mod 5 = 67 mod 5 = 2. (Note 18 mod 5 = 3, 49 mod 5 = 4, but 3 + 4 =


̸ 2.)

Instead, it turns out that (a + b) mod k = [(a mod k) + (b mod k)] mod k—we had to add an “extra”
mod k at the end.
Here are some of the useful general properties of modular arithmetic:

Theorem 7.4: Properties of modular arithmetic.


For integers a and b and k > 0:
k mod k = 0 (7.4.1)
a + b mod k = [(a mod k) + (b mod k)] mod k (7.4.2)
ab mod k = [(a mod k) · (b mod k)] mod k (7.4.3)
b b
a mod k = [(a mod k) ] mod k. (7.4.4)

Again notice the “extra” mod k at the end of the last three of these equations—remember that it is not the
case that ab mod k = (a mod k) · (b mod k) in general. For example, 14 mod 6 = 2 and 5 mod 6 = 5, but
(14 · 5) mod 6 = 4 = ̸ 2 · 5. (We could give formal proofs of these properties based on the definitions of
mod, but we’ll omit them here. Exercise 7.17 asks you to give a formal proof for one of these properties.)
In the cryptographic applications that we will explore later in this chapter, it will turn out to be important
to perform “modular exponentiation” efficiently—that is, we’ll need to compute be mod n very quickly,
even when e is fairly large. Fortunately, Theorem 7.4.4 will help us do this computation efficiently; see
Exercises 7.23–7.25.

Congruences
We’ve now talked a little bit (in Theorem 7.4, for example) about two numbers a and b that have the same
remainder when we divide them by k—that is, with a mod k = b mod k. There’s useful terminology, and
notation, for this kind of equivalence:

Definition 7.5: Congruence.


Two integers a and b are congruent mod k, written a ≡k b, if a mod k = b mod k.

(Typically a ≡k b is read as “a is equivalent to b mod k” or “a is congruent to b mod k.” If you read the
statement a ≡k b out loud, it’s polite to pause slightly, as if there were a comma, before the “mod k” part.)
Taking it further: Some people write a ≡k b using the notation a ≡ b (mod k). This notation is used to mean the same thing
as our notation a ≡k b, but note the somewhat unusual precedence in this alternate notation: it says that [a ≡ b] ( mod k) (and it
does not, as it might appear, say that the quantity a and the quantity [b mod k] are equivalent).

Divisors, factors, and multiples


We now return to the divisibility of one number by another, when the first is an exact multiple of the second.
As with the previous topics in this section, we gave some preliminary definitions in Chapter 2 of divisibility
(and related terminology), but we’ll again repeat the definitions here, and also go into a little bit more detail.

ISTUDY
334 Number Theory

Definition 7.6: Divisibility, factors, and multiples [reprise].


For two integers k > 0 and n, we write k | n to denote the proposition that n mod k = 0. If k | n, we say
that k divides n (or that k evenly divides n), that n is a multiple of k, and that k is a factor of n.

For example, we can say that 42 | 714, that 102 divides 714, that 119 evenly divides 714, that 6 and 17 are
factors of 714, and that 714 is a multiple of 7.
Here are a few useful properties of division:

Theorem 7.7: Properties of divisibility.


For integers a and b and c:
a|0 (7.7.1)
1|a (7.7.2)
a|a (7.7.3)
a | b and b | c ⇒ a|c (7.7.4)
a | b and b | a ⇒ a = b or a = −b (7.7.5)
a | b and a | c ⇒ a | (b + c) (7.7.6)
a|b ⇒ a | bc (7.7.7)
ab | c ⇒ a | c and b | c (7.7.8)

These properties generally follow fairly directly from the definition of divisibility. A few are left to you
in the exercises, and we’ll address a few others in Chapter 8, which introduces relations. (Some of these
facts are some standard properties of some relations that the “divides” relation happens to have: reflexivity
(7.7.3), transitivity (7.7.4), and so-called antisymmetry [a version of (7.7.5)]. See Chapter 8.) To give the
flavor of these arguments, here’s one of the proofs, that ab | c implies that a | c and b | c:

Proof of (7.7.8). Assume ab | c. Then, by definition of mod (and by Theorem 7.1), there exists an integer
k such that c = (ab) · k. Taking both sides mod a, we have

c mod a = abk mod a k is the integer such that c = (ab) · k


= [(a mod a) · (bk mod a)] mod a Theorem 7.4.3
= [0 · (bk mod a)] mod a Theorem 7.4.1
= 0 mod a 0 · x = 0 for any x
= 0. 0 mod a = 0 for any a

Thus c mod a = 0, so a | c. Analogously, because b · (ak) = c, we have that b | c too.

Greatest common divisors and least common multiples


We now turn to our last pair of definitions involving division: for two integers, we’ll be interested in two
related quantities—the largest number that divides both of them, and the smallest number that they both
divide.

ISTUDY
7.2 Modular Arithmetic 335

Definition 7.8: Greatest common divisor [GCD].


The greatest common divisor of two positive integers n and m, denoted GCD(n, m), is the largest d ∈ Z≥1
such that d | n and d | m.

Definition 7.9: Least common multiple [LCM].


The least common multiple of two positive integers n and m, denoted LCM(n, m), is the smallest d ∈ Z≥1
such that n | d and m | d.

Here are some examples of both GCDs and LCMs, for a few pairs of small numbers:

Example 7.4: Examples of GCDs and LCMs.


The greatest common divisor of 6 and 27 is 3, because 3 divides both 6 and 27 (and no integer k ≥ 4
divides both). The least common multiple of 6 and 27 is 54, because 6 and 27 both divide 54 (and no
integer k ≤ 53 is divided by both). Similarly,
GCD(1, 9) = 1 GCD(12, 18) = 6 GCD(202, 505) = 101 GCD(11, 202) = 1
and
LCM(1, 9) = 9 LCM(12, 18) = 36 LCM(202, 505) = 1010 LCM(11, 202) = 2222.

Both of these concepts might be (at least vaguely!) familiar from elementary school, specifically from when
you learned about how to manipulate fractions:

Example 7.5: Fractions in lowest terms.


38
We can rewrite the fraction 133 as 72 , by dividing both numerator and denominator by the common factor
19—and we can’t reduce it further because 19 is the greatest common divisor of 38 and 133. (We have
“reduced the fraction to lowest terms.”)
5 7
We can rewrite the sum 12 + 18 as 15 14 29
36 + 36 (which equals 36 ) by rewriting both fractions with a denom-
inator that’s a common multiple of the denominators of the two addends—and we couldn’t have chosen a
smaller denominator, because 36 is the least common multiple of 12 and 18. (We have “put the fractions
over the lowest common denominator.”)

7.2.4 Computing Greatest Common Divisors


In the remainder of this section, we’ll turn to the task of efficiently computing the greatest common divisor
of two integers. (Using this algorithm, we can also find least common multiples quickly, because GCDs and
LCMs are closely related: for any two positive integers n and m, we have LCM(n, m) · GCD(n, m) = n·m.)
The “obvious” way to compute the greatest common divisor of n and m is to try all candidate divisors
d ∈ {1, 2, . . . , min(n, m)} and to return the largest value of d that indeed evenly divides both n and m. This
algorithm is slow—very slow!—but there is a more efficient approach. Amazingly, a faster way to compute
GCDs has been known for approximately 2300 years: the Euclidean algorithm, named after the Greek
geometer Euclid, who lived in the third century bce. (Euclid is also the namesake of the Euclidean distance
between points in the plane—see Exercise 2.175—among a number of other things in mathematics.) The
algorithm is shown in Figure 7.2. Here are three small examples of the Euclidean algorithm in action:

ISTUDY
336 Number Theory

Euclid(17, 42)
Euclid(n, m):
= Euclid( 42 mod 17 , 17) ̸ 0, so we’re in the else case.
42 mod 17 = 8 =
Input: positive integers n and m ≥ n =8
Output: GCD(n, m)
= Euclid( 17 mod 8 , 8) 17 mod 8 = 1 ̸= 0, so we’re in the else case again
1 if m mod n = 0 then =1
2 return n
=1 8 mod 1 = 0, so we’re done, and we return 1.
3 else
4 return Euclid(m mod n, n)
(Indeed, the only positive integer that divides both 17 and 42 is 1, so GCD(17, 42) = 1.)

Figure 7.2 The Euclidean algorithm for GCDs, and an example of its execution.

Example 7.6: GCDs using the Euclidean algorithm.


Figure 7.2 shows the calculation of GCD(17, 42) = 1 using the Euclidean algorithm. Here are two more
examples. First, for 48 and 1024,
Euclid(48, 1024) = Euclid( 1024 mod 48 , 48) 1024 mod 48 = 16 ̸= 0, so we’re in the else case.
= 16
= 16, 48 mod 16 = 0, so we return 16.

and, second (written more compactly), for 91 and 287,

Euclid(91, 287) = Euclid( 287 mod 91 , 91) = Euclid( 91 mod 14 , 14) = 7.


= 14 =7

Taking it further: Euclid described his algorithm in his book Elements, from c. 300 bce, a multivolume opus covering the
fundamentals of mathematics, particularly geometry, logic, and proofs. Most people view the Euclidean algorithm as the old-
est nontrivial algorithm that’s still in use today; there are some older not-quite-fully-specified procedures for basic arithmetic
operations like multiplication that date back close to 2000 bce, but they’re not quite laid out as algorithms.
Donald Knuth—the 1974 Turing Award winner, the inventor of TEX (the underlying system that is used to typeset virtually all
scholarly materials in computer science—including this book), and a staunch advocate and practitioner of expository writing
about computer science in general and algorithms in particular—describes the history of the Euclidean algorithm (among many
other things!) in The Art of Computer Programming, his own modern-day version of a multivolume opus covering the funda-
mentals of computer science, particularly algorithms, programming, and proofs [72]. Among the fascinating things that Knuth
points out about the Euclidean algorithm is that Euclid’s “proof” of correctness only handles the case of up to three iterations of
the algorithm—because, Knuth argues, Euclid predated the idea of mathematical induction by hundreds of years. (And Euclid’s
version of the algorithm is quite hard to read, in part because Euclid didn’t have a notion of zero, or the idea that 1 is a divisor of
any positive integer n.)

The intuition of the algorithm, and making the intuition formal


Before we try to prove the correctness of the Euclidean algorithm, let’s spend a few moments on the intu-
ition behind it. The basic idea is that any common divisor of two numbers must also evenly divide their
difference. For example, does 7 divide both 63 and 133? If so, then it would have to be the case that 7 | 63
and that 7 also divides the “gap” between 133 and 63. (That’s because 63 = 7 · 9, and if 7k = 133, then
7(k − 9) = 133 − 63.) More generally, suppose that d is a common divisor of n and m ≥ n. Then it must
be the case that d divides m − cn, for any integer c where cn < m. In particular, d divides m − ⌊ mn ⌋ · n; that
is, d divides m mod n. (We’ve only argued that if d is a common divisor of n and m then d must also divide
m mod n, but actually the converse holds too; we’ll formalize this fact in the proof.) See Figure 7.3.
We will now make this intuition formal, and give a full proof of the correctness of the Euclidean
algorithm: that is, we will establish that Euclid(n, m) = GCD(n, m) for any positive integers n and
m ≥ n, with a proof by induction. There’s a crucial lemma that we’ll need to prove first, based on the

ISTUDY
7.2 Modular Arithmetic 337

133
63
63 63 · 2 = 126

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119 126 133

0 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135

Figure 7.3 The intuition of the Euclidean algorithm: d is a common divisor of 63 and 133 if and only if d also divides
133 − 63 and 133 − 63 · 2 = 133 − 126. Indeed d = 7 is a common divisor, but 9 is not (because 9 does not divide
133 − 126 = 7).

intuition we just described: we need to show that for any n and m ≥ n where m mod n ̸= 0, we have
GCD(n, m) = GCD(n, m mod n). We will prove this fact by proving that the common divisors of {n, m}
are identical to the common divisors of {n, m mod n}. (Thus the greatest common divisor of these two
pairs of integers will be identical.)

Lemma 7.10: When n ̸ | m, the same divisors of n divide m and m mod n.


Let n and m be positive integers such that n ≤ m and n ̸ | m. Let d | n be an arbitrary divisor of n. Then
d | m if and only if d | (m mod n).

Here’s a concrete example before we prove the lemma:

Example 7.7: An example of Lemma 7.10.


Consider n = 42 and m = 98. Then n ≤ m and n ̸ | m, as Lemma 7.10 requires. The divisors of 42 are
{1, 2, 3, 6, 7, 14, 21, 42}. Of these divisors, the ones that also divide 98 are {1, 2, 7, 14}.
The lemma claims that the common divisors of 42 and 98 mod 42 = 14 are also precisely {1, 2, 7, 14}.
And they are: because 14 | 42, all divisors of 14—namely, 1, 2, 7, and 14—are common divisors of 14
and 42.

Proof of Lemma 7.10. By the assumption that d | n, we know that there’s an integer a such that n = ad.
Let r = m mod n, so that m = cn + r for an integer c (as guaranteed by Theorem 7.1). We must prove that
d | m if and only if d | r.
For the forward direction, suppose that d | m. (We must prove that d | r.) By definition, there exists an
integer b such that m = bd. But n = ad and m = bd, so

m = cn + r ⇔ bd = c(ad) + r ⇔ r = (b − ac)d

for integers a, b, and c. Thus r is a multiple of d, and therefore d | r.


For the converse, suppose that d | r. (We must prove that d | m.) By definition, we have that r = bd for
some integer b. But then n = ad and r = bd, so

m = cn + r = c(ad) + bd = (ac + b)d

for integers a, b, and c. Thus d | m.

ISTUDY
338 Number Theory

Corollary 7.11. Let n and m be any two positive integers such that n ≤ m and n ̸ | m. Then GCD(n, m) =
GCD(m mod n, n).

Proof. Lemma 7.10 establishes that the set of common divisors of ⟨n, m⟩ is identical to the set of common
divisors of ⟨n, m mod n⟩. Therefore the maxima of these two sets of divisors—that is, GCD(n, m) and
GCD(m mod n, n)—are also equal.

Putting it together: the correctness of the Euclidean algorithm


Using this corollary, we can now prove the correctness of the Euclidean algorithm:

Theorem 7.12: Correctness of the Euclidean algorithm.


For arbitrary positive integers n and m with n ≤ m, we have Euclid(n, m) = GCD(n, m).

Proof. We’ll proceed by strong induction on n, the smaller input. Define the property
P(n) = for any m ≥ n, we have Euclid(n, m) = GCD(n, m).
We’ll prove that P(n) holds for all integers n ≥ 1.

Base case (n = 1). P(1) follows because both GCD(1, m) = 1 and Euclid(1, m) = 1: for any m, the
only positive integer divisor of 1 is 1 itself (and indeed 1 | m), and thus GCD(1, m) = 1. Observe that
Euclid(1, m) = 1, too, because m mod 1 = 0 for any m.
Inductive case (n ≥ 2). We assume the inductive hypotheses—that P(n′ ) holds for any 1 ≤ n′ < n—and
must prove P(n). Let m ≥ n be arbitrary. There are two subcases, based on whether n | m or n ̸ | m:
Case I: n | m. In other words, m = cn for an integer c, which means that m mod n = 0 and thus, by
inspection of the algorithm, Euclid(n, m) = n. Because n | n (and there is no d > n that divides n
evenly), indeed n is the GCD of n and m = cn.
Case II: n ̸ | m. Because n ̸ | m—that is, because m mod n ̸= 0—we have
Euclid(n, m) = Euclid(m mod n, n) by inspection of the algorithm
= GCD(m mod n, n) by the inductive hypothesis P(m mod n)
= GCD(n, m). by Corollary 7.11

Note that (m mod n) ≤ n − 1 by the definition of mod (anything mod n is less than n), so we can
invoke the inductive hypothesis P(m mod n) in the second step of this proof.

Theorem 7.12 establishes the correctness of the Euclidean algorithm, but we introduced this algorithm
because the brute-force algorithm (simply testing every candidate divisor d) was too slow. So we’d better
analyze its running time. Indeed, the Euclidean algorithm is very efficient:

Theorem 7.13: Efficiency of the Euclidean algorithm.


For arbitrary positive integers n and m with n ≤ m, the recursion tree of Euclid(n, m) has depth at most
log n + log m.

(The ability to efficiently compute GCD(n, m) using the Euclidean algorithm—assuming we use the effi-
cient algorithm to compute m mod n from Exercises 7.11–7.16, at least—will be crucial in the RSA

ISTUDY
7.2 Modular Arithmetic 339

cryptographic system in Section 7.5.) You’ll prove Theorem 7.13 by induction in Exercise 7.34—and
you’ll show that the recursion tree can be as deep as Ω(log n + log m), using the Fibonacci numbers,
in Exercise 7.37.
Problem-solving tip: In Theorem 7.13, it’s not obvious what quantity upon which to perform induction—after all, there are
two input variables, n and m. It is often useful to combine multiple inputs into a single “measure of progress” toward the base
case—perhaps performing induction on the quantity n + m or the quantity n · m.

ISTUDY
340 Number Theory

Computer Science Connections

Converting Between Bases, Binary Representation, and Generating Strings


For a combination of historical and anatomical reasons—we have ten fingers and ten toes!—we generally use a base
ten, or decimal, system to represent numbers. (“Decimal” comes from the Latin decim “ten”—and it’s suggestive that
digit is ambiguous in English between “place in a number” and “finger or toe.”) Moving from right to left, there’s a
ones place, a tens place, a hundreds place, and so forth; thus 2048 denotes 8 · 1 + 4 · 10 + 0 · 100 + 2 · 1000.
This representation is an example of a positional system, in which each place/position has a value, and the symbol
in that position tells us how many of that value the number has. Some ancient cultures used non-decimal positional
systems, some of which survive to the present day: for example, the Sumarians and Babylonians used a base 60
system—and, even today, 60 seconds make a minute, and 60 minutes make an hour.
In general, to represent a number n in base b ≥ 2, we write a sequence of elements of {0, 1, . . . , b − 1}—say
[dk dk−1 · · · d2 d1 d0 ]b . (We’ll write the base explicitly as a subscript, for clarity.) Moving from right to left, the ith
∑k
position is “worth” bi , so this number’s value is i=0 bi di . For example,

[1234]5 = 4 · 50 + 3 · 51 + 2 · 52 + 1 · 53 = 4 + 15 + 50 + 125 = 194


[1234]8 = 4 · 80 + 3 · 81 + 2 · 82 + 1 · 83 = 4 + 24 + 128 + 512 = 668.
We can use modular arithmetic to quickly di := n mod 2
convert an integer to an arbitrary base b. baseConvert(n, b): n i di
(For simplicity, it’s easiest to think about the 145 0 1
Input: integers n and b ≥ 2
input n as being written in base 10, but it’s 72 1 0
Output: n, represented in base b
not harder to convert from an arbitrary base 36 2 0
1 i := 0
instead.) To start, notice that 18 3 0
2 while n > 0:
∑k 9 4 1
( i=0 bi di ) mod b = d0 . 3 di := n mod b
4 n := (n − di )/b 4 5 0
(The value bi di is divisible by b for any 5 i := i + 1 2 6 0
i ≥ 1.) Thus, to represent n in base b, we have 6 return [di di−1 · · · d1 d0 ]b 1 7 1
no choice: ∑ we must have d0 := n mod b. 0 8 —
k
Similarly, ( i=0 bi di ) mod b2 = bd1 + d0 ; Figure 7.4 A base-conversion algorithm, and an example—here,
thus we must choose d1 := n−d b
0
mod b. converting 145 (in base 10) to binary (base 2). For each iteration of
(Note that n − d0 must be divisible by b, baseConvert(145, 2), the values of n, i, and di are shown; thus 145
because of our choice of d0 .) An algorithm can be written (reading from the bottom up) as [10010001] .
2
following this strategy is shown in Figure 7.4.
(We could also have written this algorithm without using division; see Exercise 7.5.)
We can use the base conversion algorithm
in Figure 7.4 to convert decimal numbers
0 [ 0 0 0 0 ]26 AAAA
(base 10) into binary (base 2), the inter- [ 0 0 0 1 ]26
1 interpret AAAB
nal representation in computers. Or we can n = 0, 2 [ 0 0 0 2 ]26 0=A AAAC
convert into octal (base 8) or hexadecimal 1=B
···

n = 1,
… 25 convert to [ 0 0 0 25]26 ... AAAZ
(base 16), two other frequently used repre- n = 456975 26 base 26 [ 0 0 1 0 ]26 25 = Z AABA
sentations for numbers in programming. But = 264 − 1 27 [ 0 0 1 1 ]26 AABB
we can also use baseConvert for seemingly
···

1234 [ 0 1 21 12]26 ABVM


unrelated problems. Consider the task of enu-
···

merating all 4-letter strings from the alpha- 456974 [25 25 25 24]26 ZZZY
bet. The “easy” way to write a program to 456975 [25 25 25 25]26 ZZZZ
accomplish this task, with four nested loops,
is painful to write—and it becomes utterly Figure 7.5 Generating all 4-letter strings using baseConvert.
unwieldy if we needed all 10-letter strings
instead. But, instead, let’s count from 0 up to 264 − 1—there are 264 different 4-letter strings—and convert each
number into base 26. We can then translate each number into a sequence of letters, with the ith digit acting as an
index into the alphabet that tells us which letter to put in position i. See Figure 7.5.

ISTUDY
Exercises 341

EXERCISES
7.1 Using paper and pencil only, follow the proof of Theorem 7.1 or use the mod-and-div algorithm (see Figure 7.6) to compute
integers r ∈ {0, 1, . . . , k − 1} and d such that kd + r = n, for k = 17 and n = 202.
7.2 Repeat Exercise 7.1 for k = 99 and n = 2017.
7.3 Repeat Exercise 7.1 for k = 99 and n = −2017.
7.4 Let k ≥ 1 and n be integers. In proving Theorem 7.1, we showed that there exist integers r and d such that 0 ≤ r < k and kd + r = n.
We stated but did not prove that r and d are unique. Prove that they are. In other words, for r, r′ , d, d′ ∈ Z, prove that if 0 ≤ r < k
and 0 ≤ r′ < k and n = dk + r = d′ k + r′ , then d′ = d and r′ = r.
7.5 The algorithm baseConvert on p. 340, which performs base conversion, is written using division. Modify the algorithm so that it
uses only addition, subtraction, mod, multiplication, and comparison.

A repdigitb is a number n that, when represented in base b (see p. 340), consists of the same symbol written over and over, repeated
at least twice. For example, 666 is a repdigit10 : when you write [666]10 , it’s the same digit (“6”) repeated (in this case, three times).
One way of understanding that 666 is a repdigit10 is that 666 = 6 + 60 + 600 = 6 · 100 + 6 · 101 + 6 · 102 . We can write [40]10
as [130]5 because 40 = 0 + 3 · 5 + 1 · 52 , or as [101000]2 because 40 = 1 · 23 + 1 · 25 . So 40 is not a repdigit10 , repdigit5 , or
repdigit2 . But 40 is a repdigit3 , because 40 = [1111]3 .
7.6 Prove that every number n ≥ 3 is a repdigitb for some base b ≥ 2, where n = [11 · · · 1]b .
7.7 Prove that every even number n > 6 is a repdigitb for some base b ≥ 2, where n = [22 · · · 2]b .
7.8 Prove that no odd number n is a repdigitb of the form [22 · · · 2]b , for any base b.
7.9 Write R(n) to denote the number of bases b, for 2 ≤ b ≤ n − 1, such that n is a repdigitb . Conjecture a condition on n such that
R(n) = 1, and prove your conjecture.
7.10 Recall the mod-and-div(n, m) algorithm, reproduced in Figure 7.6, that computes n mod k and ⌊ nk ⌋ by repeatedly subtracting k
from n until the result is less than k. As written, the mod-and-div algorithm fails when given a negative value of n. Follow Case II
of Theorem 7.1’s proof to extend the algorithm for n < 0 too.
7.11 The mod-and-div algorithm is slow—this algorithm computes an integer d such that nd ≤ m < n(d+1) by performing linear search
for d. A faster version of this algorithm, called mod-and-div-faster, finds d using binary search instead; again, see Figure 7.6. The
code for mod-and-div-faster as written uses division, by averaging lo and hi. Modify the algorithm so that it uses only addition,
subtraction, multiplication, and comparison.
7.12 The code for mod-and-div-faster as written uses hi := n + 1 as the initial upper bound. Why is this assignment an acceptable for
the correctness of the algorithm? Explain briefly.
7.13 Describe an algorithm that finds a better upper bound hi, by repeatedly doubling hi until it’s large enough.
7.14 Let k be arbitrary. Describe an input n for which the doubling search from Exercise 7.13 yields a significant improvement on the
running time of the algorithm for inputs k and n.
7.15 (programming required) Implement, in a programming language of your choice, all three of these algorithms (mod-and-div,
mod-and-div-faster, and the doubling-search modified version of mod-and-div-faster from Exercise 7.13) to compute n mod k
and ⌊ nk ⌋.
7.16 Run the three algorithms from Exercise 7.15 to compute the following values: 232 mod 202, 232 mod 2020, and 232 mod 315 . How
do their speeds compare?

mod-and-div(n, k): mod-and-div-faster(n, k): mod-exp(b, e, n):


Input: integers n ≥ 0 and k ≥ 1 Input: integers n ≥ 0 and k ≥ 1 Input: integers n ≥ 1, b, and e ≥ 0
Output: n mod k and ⌊ nk ⌋ Output: n mod k and ⌊ nk ⌋ Output: be mod n
1 r := n; d := 0 1 lo := 0; hi := n + 1 1 if e = 0 then
2 while r ≥ k: 2 while lo < hi − 1: 2 return 1
3 r := r − k; d := d + 1 3 mid := ⌊(lo + hi)/2⌋ 3 else if e is even then
4 return r, d 4 if mid · k ≤ n then 4 result := mod-exp(b, 2e , n)
5 lo := mid 5 return (result · result) mod n
6 else 6 else
7 hi := mid 7 result := mod-exp(b, e − 1, n)
8 return (n − k · lo), lo 8 return (b · result) mod n

Figure 7.6 Two algorithms to compute n mod k and ⌊ nk ⌋ (one reminder and one faster version), and modular
exponentiation via repeated squaring.

ISTUDY
342 Number Theory

7.17 Prove Theorem 7.4.2: for integers k > 0, a, and b, we have a + b mod k = [(a mod k) + (b mod k)] mod k. Begin your proof as
follows: We can write a = ck + r and b = dk + t for r, t ∈ {0, . . . , k − 1} (as guaranteed by Theorem 7.1). Then use mod-and-div
and Lemma 7.3.

7.18 Prove that a mod b = (a mod bc) mod b for all positive integers a, b, and c.
7.19 Prove Theorem 7.7.1: a | 0 for any positive integer a.
7.20 Prove Theorem 7.7.2: 1 | a for any positive integer a.
7.21 Prove Theorem 7.7.6: for all positive integers a, b, and c, if a | b and a | c, then a | (b + c).
7.22 Prove Theorem 7.7.7: for all positive integers a, b, and c, if a | c, then a | bc.

7.23 Consider the “repeated squaring” algorithm mod-exp for modular exponentiation shown in Figure 7.6. Observe that this algorithm
computes be mod n with a recursion tree of depth Θ(log e). Use this algorithm to compute 380 mod 5 without using a calculator.
(You should never have to keep track of a number larger than 5 except for the exponent itself when you’re doing these calculations!)

7.24 Write down a recurrence relation representing the number of multiplications done by mod-exp(b, e, n). Prove, using this recurrence,
that the number of multiplications done is between log e and 2 log e.
7.25 (programming required) Implement mod-exp in a programming language of your choice. Also implement a version of mod-exp
that computes be and then, after that computation is complete, takes the result mod n. Compare the speeds of these two algorithms
in computing 3k mod 5, for k = 80, k = 800, k = 8000, …, k = 8,000,000. Explain.

There’s a category of numerical tricks often called “divisibility rules” that you may have seen—quick ways of testing whether a
given number is evenly divisible by some small k. The test for whether an integer n is divisible by 3 is this: add up the digits of n; n is
divisible by 3 if and only if this sum is divisible by 3. For example, 6,007,023 is divisible by 3 because 6 + 0 + 0 + 7 + 0 + 2 + 3 = 18,
and 3 | 18. (Indeed 3 · 2,002,341 = 6,007,023.) This test relies on the following claim: for any sequence ⟨x0 , x1 , . . . , xn−1 ⟩ ∈
{0, 1, . . . , 9}n , we have
" n−1 # " n−1 #
X X
10i xi mod 3 = xi mod 3.
i=0 i=0

(For example, 6,007,023 is represented as x0 = 3, x1 = 2, x2 = 0, x3 = 7, x4 = 0, x5 = 0, and x6 = 6.)


7.26 Prove that the test for divisibility by 3 is correct. First prove that 10i mod 3 = 1 for any integer i ≥ 0; then prove the stated claim.
Your proof should make heavy use of the properties in Theorem 7.4.
7.27 The divisibility test for 9 is to add up the digits of the given number, and test whether that sum is divisible by 9. State and prove
the condition that ensures that this test is correct.

7.28 Using paper and pencil only, use the Euclidean algorithm to compute the GCDs of n = 111 and m = 202.
7.29 Do the same for n = 333 and m = 2017.
7.30 Do the same for n = 156 and m = 360.

7.31 (programming required) Implement the Euclidean algorithm in a language of your choice.
7.32 (programming required) Early in Section 7.2.4, we discussed a brute-force algorithm to compute GCD(n, m): try every
d ∈ {1, 2, . . . , min(n, m)} and return the largest d such that d | n and d | m. Implement this algorithm, and compare its perfor-
mance to the Euclidean algorithm as follows: for both algorithms, find the largest n for which you can compute GCD(n, n − 1) in
less than 1 second on your computer.

7.33 Let’s analyze the running time of the Euclidean algorithm for GCDs, to prove Theorem 7.13. Let n and m be arbitrary positive
integers with n ≤ m. Prove that m mod n ≤ m2 . (Hint: what happens if n ≤ m2 ? What happens if m2 < n ≤ m?)
7.34 Using Exercise 7.33, prove that the Euclidean algorithm terminates within O(log n + log m) recursive calls. (Actually it’s possible
to prove a bound that’s tighter by a constant factor, but this result is good enough for asymptotic work.)

Now let’s show that, in fact, the Euclidean algorithm generates a recursion tree of depth Ω(log n + log m) in the worst case—
specifically, when Euclid(fn , fn+1 ) is run on consecutive Fibonacci numbers fn , fn+1 .
7.35 Show that, for all n ≥ 3, we have fn mod fn−1 = fn−2 , where fi is the ith Fibonacci number. (Recall from Definition 6.21 that
f1 = 1, f2 = 1 and fn = fn−1 + fn−2 for n ≥ 3.)
7.36 Prove that, for all n ≥ 3, Euclid(fn−1 , fn ) generates a recursion tree of depth n − 2.
7.37 Using Exercise 7.36 and the fact that fn ≤ 2n (Exercise 6.95), argue that the running time of the Euclidean algorithm is
Ω(log n + log m) in the worst case.

ISTUDY
7.3 Primality and Relative Primality 343

7.3 Primality and Relative Primality


Why is it that we entertain the belief that for every purpose odd numbers are the most
effectual?
Pliny the Elder (23–79)
Book XXVIII (“Remedies Derived From Living Creatures”)
The Natural History (c. 79)

Now that we’ve reviewed divisibility (and the related notions of factors, divisors, and multiples) in
Section 7.2, we’ll continue with a brief review of another concept from Chapter 2: the definition of prime
numbers. We’ll then introduce the related notion of relatively prime integers—pairs of numbers that share
no common divisors aside from 1—and a few applications and extensions of both definitions.

7.3.1 Primality (A Reminder) and Relative Primality (An Introduction)


We begin with a reminder of the definitions from Chapter 2:

Definition 7.14: Primes and composites [reprise].


An integer p ≥ 2 is called prime if the only positive integers that evenly divide it are 1 and p itself. An
integer n ≥ 2 that is not prime is called composite. (Note that 1 is neither prime nor composite.)

For example, the integers 2, 3, 5, and 7 are all prime, but 4 (which is divisible by 2) and 6 (which is divisible
by 2 and 3) are composite. It’s also worth recalling two results that we saw in Chapter 4:

There are infinitely many prime numbers. Example 4.31 gave a proof by contradiction to show that there is
no largest prime. (That result is attributed to Euclid—the same Euclid whose algorithm we encountered
in Section 7.2.)
The smallest divisor of a composite number isn’t too big. Theorem 4.32 showed that any composite num-

ber n ≥ 2 is divisible by some factor d ≤ n. (That is, n ≥ 2 is prime if and only if d ̸ | n for every

d ∈ {2, 3, . . . , n}.)

We used the latter result to give an algorithm for the primality testing problem that performs n divisibility
tests. (The primality testing problem: we are given an integer n ≥ 2, and we have to figure out whether
n is prime or composite.) This algorithm simply exhaustively tests whether n is divisible by any of the

candidate divisors between 2 and n.

Taking it further: The faster divisibility algorithm that you developed in Exercises 7.11–7.16 will allow us to test primality in

Θ( n · logk n) steps, for some constant k: faster than the naïve algorithm, but still not efficient. There are faster algorithms for
primality testing that require only polylogarithmically many operations—that is, O(logk n), for some fixed k—to test whether n is
prime. See, for example, p. 368 for a discussion of a randomized algorithm that efficiently tests for primality, which requires only
O(logk n) steps to test whether n is prime, although it does have a small (provably small!) probability of making a mistake. There
are also deterministic algorithms to solve this problem in polylogarithmic time, though they’re substantially more complicated
than this randomized algorithm.

Prime numbers turn out to be useful in all sorts of settings, and it will sometimes turn out to be valuable
to compute a large collection of primes all at once. Of course, we can always generate more than one
prime number by using a primality-testing algorithm (like the one we just suggested) more than once, until
enough numbers have passed the test. But some of the work that we do in figuring out whether n is prime

ISTUDY
344 Number Theory

actually turns out to be helpful in figuring out whether n′ > n is prime. An algorithm called the Sieve of
Eratosthenes, which computes a list of all prime numbers up to a given integer, exploits this redundancy
to save some computation. (The Sieve of Eratosthenes is named after Eratosthenes, a Greek scholar who
lived in the third century bce. Eratosthenes is also credited as the first person to calculate the size of the
earth—people were a lot less specialized back then.)
The Sieve generates its list of prime numbers by successively eliminating (“sieving”) all multiples of
each discovered prime: for example, once we know that 2 is prime and that 4 is a multiple of 2, we will
never have to test whether 4 | n in determining whether n is prime. (If n isn’t prime because 4 | n, then n is
also divisible by 2—that is, 4 is never the smallest integer greater than 1 that evenly divides n, so we never
have to bother testing 4 as a candidate divisor.) See Exercises 7.38–7.42 and Figure 7.17.

Taking it further: The Sieve of Eratosthenes is one of the earliest known algorithms, dating back to about 200 bce. (The date
isn’t clear, in part because none of Eratosthenes’s work survived; the algorithm was reported, and attributed to Eratosthenes, by
Nicomachus about 300 years later.) The Euclidean algorithm for greatest common divisors from Section 7.2, which dates from
c. 300 bce, is one of the few older algorithms that are known. For more, see [72].

The distribution of the primes


For a positive integer n, let primes(n) denote the number of prime numbers less than or equal to n. Thus,
for example, we have

0 = primes(1)
1 = primes(2)
2 = primes(3) = primes(4)
3 = primes(5) = primes(6), and
4 = primes(7) = primes(8) = primes(9) = primes(10).

Or, to state it recursively: we have primes(1) = 0, and, for n ≥ 2, we have


(
primes(n − 1) if n is composite
primes(n) =
1 + primes(n − 1) if n is prime.

Figure 7.7 displays the value of primes(n) for moderately small n. An additional fact that we’ll state with-
out proof is the Prime Number Theorem—also illustrated in Figure 7.7—which describes the behavior of
primes(n) for large n:

Theorem 7.15: Prime Number Theorem.


Let primes(n) denote the number of primes less than or equal to n. As n gets large, the ratio between
primes(n) and lnnn approaches 1.

Formal proofs of the Prime Number Theorem are complicated beasts—far more complicated that we’ll
want to deal with here!—but even an intuitive understanding of the theorem is useful. Informally, this
theorem says that, given an integer n, approximately a ln1n fraction of the numbers “close to” n are prime.
(See Exercise 7.45.)

ISTUDY
7.3 Primality and Relative Primality 345

2
1200

primes(n)
1000
1.5

800 primes(n)
n
ln n

600 1
20

15
400
10 0.5
200 5

0 20 40 60

2000 4000 6000 8000 10000 2000 4000 6000 8000 10000

Figure 7.7 The distribution of primes. The Prime Number Theorem states that the ratio primes(n)/ lnnn , on the right,
slowly converges to 1.

Example 7.8: Using the Prime Number Theorem.


Using the estimate primes(n) ≈ n
ln n , calculate (approximately) how many 10-digit integers are prime.
Solution. By definition, there are exactly primes(999,999,999) primes with 9 or fewer digits, and there
are exactly primes(9,999,999,999) primes with 10 or fewer digits. Thus the number of 10-digit primes is
9,999,999,999 999,999,999
primes(9,999,999,999) − primes(999,999,999) ≈ −
ln 9,999,999,999 ln 999,999,999
≈ 434,294,499 − 48,254,956
= 386,039,543.
Thus, roughly 386 million of the 9 billion 10-digit numbers (about 4.3%) are prime. (Exercise 7.46 asks
you to consider how far off this estimate is.)

Problem-solving tip: Back-of-the-envelope calculations are often great as plausibility checks: although the Prime Number The-
orem doesn’t state a formal bound on how different primes(n) and lnnn are, you can see whether a solution to a problem “smells
right” with an approximation like this one.

The density of the primes is potentially interesting for its own sake, but there’s also a practical reason
that we’ll care about the Prime Number Theorem. In the RSA cryptosystem (see Section 7.5), one of the
first steps of the protocol involves choosing two large prime numbers p and q. The bigger p and q are, the
more secure the encryption, so we would want p and q to be pretty big—say, both approximately 22048 . The
Prime Number Theorem tells us that, roughly, one out of every ln 22048 ≈ 1420 integers around 22048 is
prime. Thus, we can find a prime in this range by repeatedly choosing a random integer n of the right size
and testing n for primality, using some efficient primality testing algorithm. (More about testing algorithms
soon.) Approximately one out of every 1420 integers we try will turn out to be prime, so on average we’ll
only need to try about 2840 values of n before we find primes to use as p and q.

ISTUDY
346 Number Theory

Prime factorization
Recall that any integer can be factored into the product of primes. For example, we can write 2001 =
3 · 23 · 29 and 202 = 2 · 101 and 507 = 3 · 13 · 13 and 55057 = 55057. (All of 2, 3, 13, 23, 29, 101, and
55057 are prime.) The Fundamental Theorem of Arithmetic (Theorem 5.11) states that any integer n can
be factored into a product of primes—and that, up to reordering, there is a unique prime factorization of
n. (In other words, any two prime factorizations of an integer n can differ in the ordering of the factors—
for example, 202 = 101 · 2 and 202 = 2 · 101—but they can differ only in ordering.) We proved the
“there exists” part of the theorem in Example 5.12 using induction; a bit later in this section, we’ll prove
uniqueness. (The proof uses some properties of prime numbers that are most easily seen using an extension
of the Euclidean algorithm that we’ll introduce shortly; we’ll defer the proof until we’ve established those
properties.)

Relative primality
An integer n is prime if it has no divisors except 1 and n itself. Here we will introduce a related concept
for pairs of integers—two numbers that do not share any divisors except 1:

Definition 7.16: Relative primality.


Two positive integers n and m are called relatively prime if GCD(n, m) = 1—that is, if 1 is the only
positive integer that evenly divides both n and m.

Here are a few small examples:

Example 7.9: Some relatively prime integers.


The integers 21 and 25 are relatively prime, as 21 = 3 · 7 and 25 = 5 · 5 have no common divisor (other
than 1). Similarly, 5 and 6 are relatively prime, as are 17 and 35. (But 12 and 21 are not relatively prime,
because they’re both divisible by 3.)

You’ll prove a number of useful facts about relatively prime numbers in the exercises—for example, a
prime number p and any integer n are relatively prime unless p | n; and, more generally, two numbers are
relatively prime if and only if their prime factorizations do not share any factors. (The “usefulness” of these
facts will show up when we look at cryptographic systems.)

Taking it further: Let f(x) be a polynomial. Polynomials have some useful properties, and here’s one of the special characteristics
of prime numbers: some of these useful properties of f(x) continue to hold if we take the result of evaluating the polynomial mod p
for some prime number p. In particular, if f(x) is a polynomial of degree k, then either f(a) ≡p 0 for every a ∈ {0, 1, . . . , p − 1}
or there are at most k values a ∈ {0, 1, . . . , p − 1} such that f(a) ≡p 0. (We saw this property in Section 2.5.3 when we didn’t
take the result modulo the prime p.) As a consequence, if we have two polynomials f(x) and g(x) of degree k, then if f and g are
not equivalent modulo p, then there are at most k values of a ∈ {0, 1, . . . , p − 1} for which f(a) ≡p g(a).
We can use the fact that polynomials of degree k “behave” in the same way modulo p (with respect to the number of roots,
and the number of places that two polynomials agree) to give efficient solutions to two problems: secret sharing, in which n
people wish to “distribute” shares of a secret so that any k of them can reconstruct the secret (but no set of k − 1 can); and
a form of error-correcting codes, as we discussed in Section 4.2. The basic idea will be that by using a polynomial f(x) and
evaluating f(x) mod p for a prime p, we’ll be able to use small numbers (less than p) to accomplish everything that we’d be able
to accomplish by evaluating f(x) without the modulus. See the discussions of secret sharing on p. 356 and of Reed–Solomon
codes on p. 357.

ISTUDY
7.3 Primality and Relative Primality 347

7.3.2 A Structural Fact and the Extended Euclidean Algorithm


Given an integer n ≥ 2, quickly determining whether n is prime seems tricky: we’ve seen some easy algo-
rithms for this problem, but they’re pretty slow. And, though there are efficient but complicated algorithms
for primality testing, we haven’t seen (and, really, nobody knows) a genuinely simple algorithm that’s also
efficient. On the other hand, the analogous question about relative primality—given integers n and m, are
n and m relatively prime?—is easy. In fact, we already know everything we need to solve this problem
efficiently, just from the definition: n and m are relatively prime if and only if their GCD is 1, which occurs
if and only if Euclid(n, m) = 1. So we can efficiently test whether n and m are relatively prime by testing
whether Euclid(n, m) = 1.
We will start this section with a structural property about GCDs. (Right now it shouldn’t be at all clear
what this claim has to with anything in the last paragraph—but stick with it! The connection will come
along soon.) Here’s the claim:

Lemma 7.17: There are multiples of n and m that add up to GCD(n, m).
Let n and m be any positive integers, and let r = GCD(n, m). Then there exist integers x and y such that
xn + ym = r.

Here are a few examples of the multiples guaranteed by this lemma:

Example 7.10: Some examples of Lemma 7.17.


For several pairs of integers n and m, values of x and y such that xn + ym = GCD(n, m) are shown
in Figure 7.8. Note that there are multiple options; for the pair {17, 35} in Figure 7.8, for example,
we could have chosen x = −2 and y = 1 instead of x = 33 and y = −16, as −2 · 17 + 1 · 35 and
33 · 17 + (−16) · 35 are both equal to 1 = GCD(17, 35).

Note that the integers x and y whose existence is guaranteed by Lemma 7.17 are not necessarily positive!
(In fact, in Example 7.10 the only time that we didn’t have a negative coefficient for one of the numbers
was for the pair {16, 48}, where GCD(16, 48) = 16 = 1 · 16 + 0 · 48.) Also, as Example 7.10 illustrates,
there may be more than one pair of values for x and y that satisfy Lemma 7.17—in fact, you’ll show in
Exercise 7.58 that there are always infinitely many values of {x, y} that satisfy the lemma.
Although, if you stare at it long enough, Example 7.10 might give a tiny hint about why Lemma 7.17
is true, a proof still seems distant. But, in fact, we’ll be able to prove the claim based what looks like a
digression: a mild extension to the Euclidean algorithm. For a little bit of a hint as to how, let’s look at one
more example of the Euclidean algorithm, but interpreting it as a guide to find the integers in Lemma 7.17:

n m x · n + y · m = GCD(n, m)
5 6 (−1) · 5 + 1 · 6 = −5 + 6 =1 = GCD(5, 6)
17 35 33 · 17 + (−16) · 35 = 561 − 560 =1 = GCD(17, 35)
12 21 2 · 12 + (−1) · 21 = 24 − 21 =3 = GCD(12, 21)
48 1024 (−21) · 48 + 1 · 1024 = −1008 + 1024 = 16 = GCD(48, 1024)
16 48 1 · 16 + 0 · 48 = 16 + 0 = 16 = GCD(16, 48)

Figure 7.8 Some examples of Lemma 7.17. For each pair of integers n and m, we’ve chosen integers x and y such that
xn + ym = GCD(n, m).

ISTUDY
348 Number Theory

Example 7.11: An example of Lemma 7.17, using the Euclidean algorithm.


Let’s find integers x and y such that 91x + 287y = GCD(91, 287). By running Euclid(91, 287), we make
the recursive calls Euclid(14, 91) and Euclid(7, 14), which returns 7. (See Figure 7.9 for a reminder.) So
GCD(91, 287) = 7. (A)
Remember from the definition of mod (Definition 7.2) that we can write m = ⌊ mn ⌋ · n + (m mod n) for
any integers m and n. Or, by rearranging,
m mod n = m − ⌊ mn ⌋ · n

Specifically, for the first two calls to Euclid (n = 91 and m = 287, and n = 14 and m = 91), we have

14 = 287 mod 91 = 287 − ⌊ 287


91 ⌋ · 91 = 287 − 3 · 91 (B)
7 = 91 mod 14 = 91 − ⌊ 91
14 ⌋ · 14 = 91 − 6 · 14. (C)
Now, assembling all of these facts, we have that
by (A) by (C) by (B) collecting like terms

GCD(91, 287) = 7 = 91 − 6 · 14 = 91 − 6 · [287 − 3 · 91] = − 6 · 287 + 19 · 91.


Thus x = −6 and y = 19 satisfy the requirement that 91x + 287y = GCD(91, 287).

The extended Euclidean algorithm


The extended Euclidean algorithm, shown in Figure 7.9, follows the outline of Example 7.11, applying
these algebraic manipulations recursively. Lemma 7.17 will follow from a proof that this extended version
of the Euclidean algorithm actually computes three integers x, y, r such that GCD(n, m) = r = xn + ym.
Here are two examples:

Example 7.12: Running the extended Euclidean algorithm.


Evaluating extended-Euclid(12, 18) recursively computes extended-Euclid(6, 12), which returns
x = 1, y = 0, and r = 6. The else case of the algorithm tells us that our result is ⟨y − ⌊ mn ⌋ · x, x, r⟩ where
m = 18 and n = 12. Plugging these values into the formula, we see that extended-Euclid(12, 18) returns
⟨−1, 1, 6⟩—and, indeed, GCD(12, 18) = 6 and −1 · 12 + 1 · 18 = 6. (See Figure 7.10a.)
Figure 7.10b shows a slightly more complicated example: extended-Euclid(18, 30), whose result is
x = 2, y = −1, and r = 6. Again, we have GCD(18, 30) = 6 and 2 · 18 + −1 · 30 = 36 − 30 = 6, just
as required.

extended-Euclid(n, m): Euclid(n, m):


Input: positive integers n and m ≥ n Input: positive integers n and m ≥ n
Output: x, y, r ∈ Z where GCD(n, m) = r = xn + ym Output: GCD(n, m)
1 if m mod n = 0 then 1 if m mod n = 0 then
2 return 1, 0, n // 1 · n + 0 · m = n = GCD(n, m) 2 return n
3 else 3 else
4 x, y, r := extended-Euclid(m mod n, n) 4 return Euclid(m mod n, n)
 
5 return y − mn · x, x, r

Figure 7.9 The extended Euclidean algorithm (and a reminder of the Euclidean algorithm).

ISTUDY
7.3 Primality and Relative Primality 349

extended-Euclid(18, 30)
n = 18, m = 30 // m mod n ̸= 0, so we recurse.
extended-Euclid(12, 18)
n = 12, m = 18 // m mod n ̸= 0, so we recurse.
extended-Euclid(6, 12)
extended-Euclid(12, 18) n = 6, m = 12
n = 12, m = 18 // m mod n ̸= 0, so we recurse. // m mod n = 0, so we return 1, 0, n.
extended-Euclid(6, 12) return values: 1, 0, 6
n = 6, m = 12
// m mod n = 0, so we return 1, 0, n. x = 1, y = 0, r = 6 // return y − ⌊ mn ⌋ · x, x, r
return values: 1, 0, 6 return values: −1, 1, 6

x = 1, y = 0, r = 6 // return y − ⌊ mn ⌋ · x, x, r x = −1, y = 1, r = 6 // return y − ⌊ mn ⌋ · x, x, r


return values: −1, 1, 6 return values: 2, −1, 6

(a) Computing extended-Euclid(12, 18). (b) Computing extended-Euclid(18, 30).

Figure 7.10 Two calls to extended-Euclid. (Note that the first recursive call made in (b) computes the value of
extended-Euclid(12, 18), which is exactly what’s shown in (a).)

We’re now ready to state the correctness of the extended Euclidean algorithm:

Theorem 7.18: Correctness of the extended Euclidean algorithm.


For arbitrary positive integers n and m with n ≤ m, extended-Euclid(n, m) returns three integers x, y, r
such that r = GCD(n, m) = xn + ym.

The proof, which is fairly straightforward by induction, is left to you as Exercise 7.60. And once
you’ve proven this theorem, Lemma 7.17—which merely stated that there exist integers x, y, r with
r = GCD(n, m) = xn + ym for any n and m—is immediate.

Problem-solving tip: A nice way, particularly for computer scientists, to prove a theorem of the form “there exists x such that
P(x)” is to actually give algorithm that computes such an x! That’s how Theorem 7.18 establishes Lemma 7.17.

Note also that the extended Euclidean algorithm is an efficient algorithm—you already proved in Exer-
cise 7.34 that the depth of the recursion tree for Euclid(n, m) is upper bounded by O(log n + log m), and
the running time of extended-Euclid(n, m) is asymptotically the same as Euclid(n, m). (The only quantity
that we need to use in extended-Euclid that we didn’t need in Euclid is ⌊ mn ⌋, but we already had to find
m mod n in Euclid—so if we use a faster version of mod-and-div(n, m) to compute m mod n, then we
“for free” also get the value of ⌊ mn ⌋.)

7.3.3 The Uniqueness of Prime Factorization


Lemma 7.17—that there are multiples of n and m that add up to GCD(n, m)—and the extended Euclidean
algorithm (which computes those coefficients) will turn out to be helpful in proving some facts that are
apparently unrelated to greatest common divisors. Here’s a claim about divisibility related to prime numbers
in that vein, which we’ll be able to use to prove that prime factorizations are unique:

ISTUDY
350 Number Theory

Lemma 7.19: When a prime divides a product.


Let p be prime, and let a and b be integers. Then p | ab if and only if p | a or p | b.

Proof. We’ll proceed by mutual implication.


For the backward direction, assume p | a. (The case for p | b is strictly analogous.) Then a = kp for some
integer k, and thus ab = kpb, which is obviously divisible by p.
For the forward direction, assume that p | ab and suppose that p ̸ | a. We must show that p | b. Because p is
prime and p̸ |a, we know that GCD(p, a) = 1 (see Exercise 7.47), and, in particular, extended-Euclid(p, a)
returns the GCD 1 and two integers n and m such that 1 = pm + an. Multiplying both sides by b yields
b = pmb + anb, and thus
b mod p = (pmb + anb) mod p
= (pmb mod p + anb mod p) mod p Theorem 7.4.2
= (0 + anb mod p) mod p Theorem 7.7.7
= (0 + 0) mod p p | ab by assumption, and Theorem 7.7.7 again
= 0.
That is, we’ve shown that if p ̸ | a, then p | b. (And ¬x ⇒ y is equivalent to x ∨ y.)
We can use Lemma 7.19 to prove that an integer’s prime factorization is unique. (We’ll prove only the
uniqueness part of the Prime Factorization Theorem here; see Example 5.12 for the “there exists a prime
factorization” part.)

Theorem 7.20: Prime Factorization Theorem [reprise].


Let n ∈ Z≥1 be any positive integer. There exist k ≥ 0 prime numbers p1 , p2 , . . . , pk such that n =
Qk
i=1 pi . Further, up to reordering, the prime numbers p1 , p2 , . . . , pk are unique.

Taking it further: Back when we defined prime numbers, we were very careful to specify that 1 is neither prime nor composite.
You may well have found this insistence to be silly and arbitrary and pedantic—after all, the only positive integers that evenly
divide 1 are 1 and, well, 1 itself, so it sure seems like 1 ought to be prime. But there was a good reason that we chose to exclude
1 from the list of primes: it makes the uniqueness of prime factorization true! If we’d listed 1 as a prime number, there would
be many different ways to prime factor, say, 202: for example, 202 = 2 · 101 and 202 = 1 · 2 · 101 and 202 = 1 · 1 · 2 · 101,
and so forth. So we’d have to have restated the theorem about uniqueness of prime factorization (“…is unique up to reordering
and the number of times that we multiply by 1”), which is a much more cumbersome statement. This theorem is the reason that
1 is not defined as a prime number, in this book or in any other mathematical treatment. (There’s an embedded problem-solving
tip—and life tip—in this discussion: if you have the power to define something, then you genuinely get to choose how to define
it. So if you can make a choice in the definition that makes life better, do it!)

Proof of Theorem 7.20 (uniqueness). We’ll proceed by strong induction on n.


For the base case (n = 1), we can write 1 as the product of zero prime numbers—recall that
Q
i∈∅ i = 1—and this representation is unique. (The product of one or more primes is greater than 1, as all
primes are at least 2.)
For the inductive case (n ≥ 2), we assume the inductive hypotheses, namely that any n′ < n has a
unique prime factorization. We must prove that the prime factorization of n is also unique. We consider
two subcases:
Case I: n is prime. Then the statement holds immediately: the only prime factorization is p1 = n. (Suppose
Qℓ
that there were a different way of prime factoring n, as n = i=1 qi for prime numbers ⟨q1 , q2 , . . . , qℓ ⟩.

ISTUDY
7.3 Primality and Relative Primality 351

We’d have to have ℓ ≥ 2 for this factorization to differ from p1 = n, but then each qi satisfies qi > 1 and
qi < n and qi | n—contradicting what it means for n to be prime.)
Case II: n is composite. Then suppose that p1 , p2 , . . . , pk and q1 , q2 , . . . , qℓ are two sequences of prime
Qk Qℓ
numbers such that n = i=1 pi = i=1 qi . Without loss of generality, assume that both sequences are
sorted in increasing order, so that p1 ≤ p2 ≤ · · · ≤ pk and q1 ≤ q2 ≤ · · · ≤ qℓ . We must prove that these
two sequences are actually equal.
Qk Qℓ
Case IIA: p1 = q1 . Define n′ = pn1 = qn1 = i=2 pi = i=2 qi as the product of all the other prime num-
bers (excluding the primes p1 and q1 = p1 ). By the inductive hypothesis, n′ has a unique prime factorization,
and thus p2 , p3 , . . . , pk and q2 , q3 , . . . , qℓ are identical.
Case IIB: p1 ̸= q1 . Without loss of generality, suppose that p1 < q1 . But we know that p1 | n, and
Qℓ
therefore p1 | i=1 qi . By Lemma 7.19, there exists an i such that p1 | qi . But 2 ≤ p1 < q1 ≤ qi . This
contradicts the assumption that qi was prime.

Taking it further: How difficult is it to factor a number n? Does there exist an efficient algorithm for factoring—that is, one
that computes the prime factorization of n in a number of steps that’s proportional to O(logk n) for some k? We don’t know. But
it is generally believed that the answer is no, that factoring large numbers cannot be done efficiently. The (believed) difficulty of
factoring is a crucial pillar of widely used cryptographic systems, including the ones that we’ll encounter in Section 7.5. There
are known algorithms that factor large numbers efficiently on so-called quantum computers (see p. 522)—but nobody knows
how to build large quantum computers. And, while there’s no known efficient algorithm for factoring large numbers on classical
computers, there’s also no proof of hardness for this problem. (And most modern cryptographic systems count on the difficulty
of the factoring problem—which is only a conjecture!)

7.3.4 The Chinese Remainder Theorem


We’ll close this section with another ancient result about modular arithmetic, called the Chinese Remainder
Theorem, from around 1750 years ago. (The name of the Chinese Remainder Theorem comes from its early
discovery by the Chinese mathematician Sun Tzu, who lived around the fifth century. This Sun Tzu, to be
clear, is a different Sun Tzu than the one who wrote The Art of War about 800 years prior.)
Here’s the basic idea. If n is some nonnegative integer, then knowing that, say, when n is divided by
7 its remainder is 4 gives you a small clue about n’s value: one seventh of integers have the right value
mod 7. Knowing n mod 2 and n mod 13 gives you more clues. The Chinese Remainder Theorem says that
knowing n mod k for enough values of k will (almost) let you figure out the value of n exactly—at least, if
those values of k are all relatively prime. Here’s a concrete example:

Example 7.13: An example of the Chinese Remainder Theorem.


What nonnegative integers n satisfy the following conditions?
n mod 2 = 0 n mod 3 = 2 n mod 5 = 1.

Solution. Suppose n ∈ {0, 1, . . . , 29}. Then there are only six values for which n mod 5 = 1, namely
0 + 1 = 1 and 5 + 1 = 6 and 10 + 1 = 11 and 15 + 1 = 16 and 20 + 1 = 21 and 25 + 1 = 26. Of
these, the only even values are 6, 16, and 26. And 6 mod 3 = 0, 16 mod 3 = 1, and 26 mod 3 = 2. Thus
n = 26.
Notice that, for any integer k, we have k ≡b k + 30 for all three moduli b ∈ {2, 3, 5}. Therefore any
n ≡30 26 will satisfy the given conditions.

ISTUDY
352 Number Theory

n = 00 10 20 30 40 50 60 70 80 90 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
n mod 2 = 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
n mod 3 = 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
n mod 5 = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

n = 00 60 12 18 24 10 16 22 28 40 20 26 20 80 14 15 21 27 30 90 25 10 70 13 19 50 11 17 23 29
n mod 2 = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
n mod 3 = 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
n mod 5 = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

Figure 7.11 The remainders of all n ∈ {0, 1, . . . , 29}, modulo 2, 3, and 5—sorted by n (above) and by the remainders
(below).

The basic point of Example 7.13 is that every value of n ∈ {0, . . . , 29} has a unique “profile” of remain-
ders mod 2, 3, and 5. (See Figure 7.11.) Crucially, every one of the 30 possible profiles of remainders occurs
in Figure 7.11, and no profile appears more than once. (The fact that there are exactly 30 possible profiles
follows from the Product Rule for counting; see Section 9.2.1.)
The Chinese Remainder Theorem states the general property that’s illustrated in these particular tables:
each “remainder profile” occurs once and only once. Here is a formal statement of the theorem. We refer
to a constraint of the form x mod n = a as a congruence, following Definition 7.5. We also write Zk to
denote the set {0, 1, . . . , k − 1}.

Theorem 7.21: Chinese Remainder Theorem: two congruences.


Let n and m be any two relatively prime integers. For any a ∈ Zn and b ∈ Zm , there exists one and only
one integer x ∈ Znm such that x mod n = a and x mod m = b.

Proof. To show that there exists an integer x satisfying x mod n = a and x mod m = b, we’ll give a proof
by construction—specifically, we’ll compute the value of x given the values of {a, b, n, m}. The two-
line algorithm is shown in Figure 7.12. We must argue that x mod n = a and x mod m = b. Note that
GCD(n, m) = 1 because n and m are relatively prime by assumption. Thus, by the correctness of the
extended Euclidean algorithm, we have

cn + dm = 1. (1)

Multiplying both sides of (1) by a, we know that

acn + adm = a. (2)

Input: relatively prime n, m ∈ Z; a ∈ Zn ; b ∈ Zm . Ensure that m ≥ n by swapping n and m if


Output: x such that x mod m = a and x mod n = b. necessary. Also, the value r returned by
extended-Euclid in Line 1 must be 1, because n
1 c, d, r := extended-Euclid(n, m) and m are assumed to be relatively prime.
2 return x := (adm + bcn) mod nm

Figure 7.12 An algorithm for the Chinese Remainder Theorem.

ISTUDY
7.3 Primality and Relative Primality 353

Recall that we defined x := (adm + bcn) mod nm. Let’s now show that x mod n = a:

x mod n = (adm + bcn) mod nm mod n definition of x


= (adm + bcn) mod n Exercise 7.18
= (adm + 0) mod n bcn mod n = 0 because n | bcn
= (adm + acn) mod n acn mod n = 0 because n | acn too!
= a mod n (2)
= a. a ∈ {0, 1, . . . , n − 1} by assumption, so a mod n = a

We can argue that x = adm + bcn ≡m bdm + bcn ≡m b completely analogously, where the last equivalence
follows by multiplying both sides of (1) by b instead. Thus we’ve established that there exists an x ∈ Znm
with x mod n = a and x mod m = b (because we computed such an x).
To prove that there is a unique such x, suppose that x mod n = x′ mod n and x mod m = x′ mod m
for two integers x, x′ ∈ Znm . We will prove that x = x′ . Because x mod n = x′ mod n, we know that
(x − x′ ) mod n = 0, or, in other words, that n | (x − x′ ). By similar reasoning, we know that m | (x − x′ ). By
Exercise 7.70 and the fact that n and m are relatively prime, then, we know that nm | (x − x′ ). And because
both x, x′ ∈ Znm , we’ve therefore shown that x = x′ .

Some examples
Here are two concrete examples of using the Chinese Remainder Theorem (and, specifically, of using the
algorithm from Figure 7.12):

Example 7.14: The Chinese Remainder Theorem, in action.


Let’s find the integer x ∈ Z30 that satisfies x mod 5 = 4 and x mod 6 = 5. Note that 5 and 6 are relatively
prime, and extended-Euclid(5, 6) returns ⟨−1, 1, 1⟩. (And indeed 5 · −1 + 6 · 1 = 1 = GCD(5, 6).) Thus
we compute x from the values of ⟨n, m, a, b, c, d⟩ = ⟨5, 6, 4, 5, −1, 1⟩ as
4 1 6 5 -1 5

x := (adm + bcn) mod 30 = (24 − 25) mod 30 = −1 mod 30 = 29.


And, indeed, 29 mod 5 = 4 and 29 mod 6 = 5.

Example 7.15: A second example of the Chinese Remainder Theorem.


We are told that x mod 7 = 1 and x mod 9 = 5. What is the value of x?
Solution. Running extended-Euclid(7, 9) yields ⟨4, −3, 1⟩. The algorithm in Figure 7.12 computes x :=
adm + bcn mod nm, where n = 7 and m = 9 are the given moduli; a = 1 and b = 5 are the given
remainders; and c = 4 and d = −3 are the computed multipliers from extended-Euclid. Thus
x := (1 · −3 · 9) + (5 · 4 · 7) mod 7 · 9 = 113 mod 63 = 50.
Indeed, 50 mod 7 = 1 and 50 mod 9 = 5. Thus x ≡63 50.

ISTUDY
354 Number Theory

Generalizing to k congruences
We’ve now shown the Chinese Remainder Theorem for two congruences, but Example 7.13 had three
constraints (x mod 2, x mod 3, and x mod 5). In fact, the generalization of the Chinese Remainder Theorem
to k congruences, for any k ≥ 1, is also true—again, as long as the moduli are pairwise relatively prime
(that is, any two of the moduli share no common divisors).
We can prove this generalization fairly directly, using induction and the two-congruence case. The basic
idea will be to repeatedly use Theorem 7.21 to combine a pair of congruences into a single congruence,
until there are no pairs left to combine. Here’s a concrete example:

Example 7.16: The Chinese Remainder Theorem, with three congruences.


Let’s describe the values of x that satisfy the congruences
x mod 2 = 1 x mod 3 = 2 x mod 5 = 4. (∗)
To do so, we first identify values of y that satisfy the first two congruences, ignoring the third. Note that 2
and 3 are relatively prime, and extended-Euclid(2, 3) = ⟨−1, 1, 1⟩. Thus, y mod 2 = 1 and y mod 3 = 2
if and only if
y mod (2 · 3) = (1 · 1 · 3 + 2 · −1 · 2) mod (2 · 3) = 5.
In other words, y ∈ Z6 satisfies the congruences y mod 2 = 1 and y mod 3 = 2 if and only if y satisfies
the single congruence y mod 6 = 5. Thus the values of x that satisfy (∗) are precisely the values of x that
satisfy
x mod 6 = 5 x mod 5 = 4. (†)
In Example 7.14, we showed that values of x that satisfy (†) are precisely those with x mod 30 = 29.

Now, using the idea from this example, we’ll prove the general version of the theorem:

Theorem 7.22: Chinese Remainder Theorem: general version.


Let n1 , n2 , . . . , nk be a collection of integers that are pairwise relatively prime, for some k ≥ 1. Let
Qk
N = i=1 ni . Then, for any ⟨a1 , . . . , ak ⟩ with each ai ∈ Zni , there exists one and only one integer x ∈ ZN
such that x mod ni = ai for all 1 ≤ i ≤ k.

Proof. We proceed by induction on k. For the base case (k = 1), there’s only one constraint, namely
x mod n1 = a1 . And x = a1 is the only element of ZN = Zn1 that satisfies this congruence.
For the inductive case (k ≥ 2), we assume the inductive hypothesis, namely that there exists a unique
x ∈ ZM satisfying any set of k − 1 congruences whose moduli have product M. To make use of this assump-
tion, we will convert the k given congruences into k − 1 equivalent congruences, as shown in Figure 7.13.
More formally, we must show that there exists one and only one value of x ∈ ZN satisfying Constraint
Set #1. Theorem 7.21 and Exercise 7.77 show that precisely the same values of x satisfy Constraint Set #1
and Constraint Set #2, so it suffices to prove that there exists one and only one value of x ∈ ZN satisfying
Constraint Set #2. The product of the moduli is the same for both the Constraint Sets: N = n1 · n2 · n3 · · · nk
for #1, and (n1 n2 ) · n3 · · · nk for #2. Furthermore, in Exercise 7.69 you’ll prove that n1 n2 is also relatively
prime to every other ni . Thus we can apply the inductive hypothesis to Constraint Set #2, which establishes
that there’s a unique x ∈ ZN that satisfies Constraint Set #2—and therefore a unique x ∈ ZN that satisfies
Constraint Set #1.

ISTUDY
7.3 Primality and Relative Primality 355

By Theorem 7.21, there exists a (unique) value y ∈ Zn n such that


1 2
∗ ∗
Constraint Set #1 y mod n1 = a1 and y mod n2 = a2 . Exercise 7.77 establishes that Constraint Set #2
x ∈ ZN satisfies x mod n1 = a1 and x mod n2 = a2 if and only if x

satisfies x mod n1 n2 = y .
x mod n1 = a1
x mod n1 n2 = y∗
x mod n2 = a2
x mod n3 = a3 x mod n3 = a3
.. ..
. .
x mod nk = ak x mod nk = ak

Figure 7.13 A sketch of the proof of Theorem 7.22. We convert the k congruences on the left to the k − 1 congruences
on the right. Exactly the same values of x satisfy the two sets of congruences.

(This proof of the general version of the Chinese Remainder Theorem is an inductive argument,
based on the two-congruence version. But we could also give a version of the proof that directly echoes
Theorem 7.21’s proof. See Exercise 7.108.)

Taking it further: One interesting implication of the Chinese Remainder Theorem is that we could choose to represent integers
efficiently in a very different way from binary representation, instead using something called modular representation. In modular
representation, we store an integer n as a sequence of values of n mod b, for a set of relatively prime values of b. To be concrete,
consider the set {11, 13, 15, 17, 19}, and let N = 11 · 13 · 15 · 17 · 19 = 692,835 be their product. The Chinese Remainder
Theorem tells us that we can uniquely represent any n ∈ ZN as

⟨n mod 11, n mod 13, n mod 15, n mod 17, n mod 19⟩.

For example, 217 = ⟨7, 6, 2, 2, 10⟩, and 17 = ⟨6, 4, 2, 0, 17⟩. Perhaps surprisingly, the representation of 217 + 17 is ⟨2, 10, 4, 2, 8⟩
and 17 · 217 = ⟨9, 11, 4, 0, 18⟩, which are really nothing more than the result of doing component-wise addition/multiplication
(modulo that component’s corresponding modulus):

mod 11 13 15 17 19 mod 11 13 15 17 19
⟨ 7, 6, 2, 2, 10 ⟩ ⟨ 7, 6, 2, 2, 10 ⟩
+⟨ 6, 4, 2, 0, 17 ⟩ and ·⟨ 6, 4, 2, 0, 17 ⟩
= ⟨ 13, 10, 4, 2, 27 ⟩ = ⟨ 42, 24, 4, 0, 170 ⟩
≡ ⟨ 2, 10, 4, 2, 8 ⟩ ≡ ⟨ 9, 11, 4, 0, 18 ⟩.

This representation has some advantages over the normal binary representation: the numbers in each component stay small, and
multiplying k pairs of 5-bit numbers is significantly faster than multiplying one pair of 5k-bit numbers. (Also, the components can
be calculated in parallel!) But there are some other operations that are slowed down by this representation. (See Exercises 7.147–
7.148.)

ISTUDY
356 Number Theory

Computer Science Connections

Secret Sharing
Although encryption/decryption is probably the most natural cryptographic problem, there are many other important
problems with related but different requirements—some kind of desired communication that adheres to some type of
security specification. Here we’ll look at a different cryptographic problem—using a solution due to Adi Shamir (the
S of the RSA cryptosystem, which we’ll see in Section 7.5) [114]. Imagine a shared resource, collectively owned by
some group, that they wish to keep secure—for example, the launch codes for the United States’ nuclear weapons. In
the post-apocalyptic world in which you’re imagining these codes being used, where many top officials are probably
dead, we’ll need to ensure that any, say, k = 3 of the cabinet members (out of the n = 15 cabinet positions) can
launch the weapons. But you’d also like to guarantee that no single rogue secretary can destroy the world!
In secret sharing, we seek a
(a) (b)
scheme by which we distribute f0 (x) = 0 + 1x + 0x2
2
“shares” of the secret s ∈ S to f 2 ( x ) = 2 + 3x + x
a group of n people such that two x f2 (x) f2 (x) mod 5
properties hold: 30 0 2 2
(1) If any k of these n peo- 1 6 1 f1 (x) = 1 + 2x + 3x2
ple cooperate, then—by combining 2 12 2
their k shares of the secret—they 3 20 0
can compute the secret s (prefer- 4 30 0
ably efficiently). 20 f2 (x) = 2 + 3x + 1x2
(2) If any k′ < k of these n peo-
ple cooperate, then by combining
their k′ shares they learn nothing
about the secret s. (Informally, to f3 (x) = 3 + 4x + 4x2
“learn nothing” about the secret 10
means that no k′ shares of the secret
allow one to infer that s comes from f2 (x) mod 5
f4 (x) = 4 + 0x + 2x2
any particular S′ ⊂ S. Note that just
“splitting up the bits” of the secret
violates condition 2.)
The basic idea will be to define a 1 2 3 4 1 2 3 4
polynomial f(x), and distribute the 2
Figure 7.14 (a) A function f(x) = a + bx + cx as a tool for secret sharing: the
value of f(i) as the the ith “share”
secret is f(0) = a; the four secret shares are f(i) mod 5 for i ∈ {1, 2, 3, 4}.
of the secret; the secret itself will
be f(0). Why will this be useful? (b) Even knowing f(1) ≡5 1 and f(2) ≡5 2, we don’t know f(0) mod 5; there
Imagine that f(x) = a + bx. (The are polynomials consistent with f(0) ≡5 m for every m ∈ {0, 1, 2, 3, 4}. Here
secret is thus f(0) = a + b · 0 = a.) we see five different polynomials, where fi (0) ≡5 i, fi (1) ≡5 1, and fi (2) ≡5 2.
Knowing that f(1) = 17 tells you
that a + b = 17, but it doesn’t tell you anything about a itself: for every possible value of the secret, there’s a
value of b that makes a + b = 17. But knowing f(1) = 17 and f(2) = 42 lets you solve for b = 25, a = −8. If
f(x) = a + bx + cx2 , then knowing f(x1 ) and f(x2 ) gives you two equations and three unknowns—but you can solve
for a if you know the value of f(x) for three different values of x. In general, knowing k values of a polynomial f of
degree k lets you compute f(0), but any k − 1 values of f are consistent with any value of f(0). And this result remains
true if, instead of using the value f(x) as the share of the secret, we instead use f(x) mod p, for some prime p. (See
p. 357, and Figure 7.14.) Thus, concretely, to distribute shares of a secret m ∈ {0, 1, 2, 3, 4}, here’s what we’d do:
• First, choose a1 , . . . , ak−1 uniformly and independently at random from the set {0, 1, 2, 3, 4}.
∑k−1
• Second, define the function f(x) = m + i=1 ai xi .
• Then, to the ith person who’s sharing the secret, send the pair ⟨i, f(i) mod 5⟩.
Then f was built from k unknown coefficients: m, and a1 , a2 , . . . , ak−1 . For example, if k = 3, no two people can
reconstruct the secret—but three people can. Only one quadratic function passes through three given points.

ISTUDY
7.3 Primality and Relative Primality 357

Computer Science Connections

Error Correction with Reed–Solomon Codes


In Section 4.2, we discussed error-correcting codes: we encode a message m as a codeword c(m), so that m is
(efficiently) recoverable from c(m), or even from a mildly corrupted codeword c′ ≈ c(m). (Note the difference in
motivation with cryptography: in error-correcting codes, we want a codeword that makes computing the original
message very easy; in cryptography, we want a ciphertext that makes computing the original message very hard.)
The key property that we seek is that if m1 ̸= m2 , then c(m1 ) and c(m2 ) are “very different,” so that decoding c′
simply corresponds to finding the m that minimizes the difference between c′ and c(m).
We’ve discussed Reed–Solomon
codes, one of the classic schemes pm1 (x) = x + 3x2 + 2x3 pm1 (x) mod 31
for error-correcting codes. Under pm2 (x) = 3x + 3x3 pm2 (x) mod 31
Reed–Solomon codes, to encode a
message m ∈ Zk , we ∑ define the
k i
polynomial pm (x) = i=1 mi x ,
and then encode m as ⟨pm (1),
pm (2), . . . , pm (n)⟩. (We choose
n much bigger than k, to achieve
the desired error-correction proper-
ties.) For example, for the messages
m1 = ⟨1, 3, 2⟩ and m2 = ⟨3, 0, 3⟩, Figure 7.15 Two polynomials derived from Reed–Solomon codes, computed
we have pm1 (x) = x + 3x2 + 2x3 using both normal arithmetic and in Z31 . (The polynomials have the same values
and pm2 (x) = 3x + 3x3 . For n = 6, for x ∈ {0, 1, 2}, and differ for all other values of x.)
we have the codewords (for m1 and
m2 , respectively) ⟨6, 30, 84, 180, 330, 546⟩ and ⟨6, 30, 90, 204, 390, 666⟩. (See Figure 7.15.)
The key point is that two distinct polynomials of degree k agree on at most k inputs, which means that the codewords
for m1 and m2 will be very different. (Here pm1 (x) and pm2 (x) agree on x ∈ {1, 2}, but not on x ∈ {3, 4, 5, 6}.) The
theorem upon which this difference rests is important enough to be called the Fundamental Theorem of Algebra
(Theorem 2.58).
While this fact about Reed– n<k+1 n=k+1 n>k+1
Solomon codes is nice, it’s already
evident that the numbers in the
codewords get really big—546 and
666 are very big relative to the inte-
gers in the original messages! In
real Reed–Solomon codes, there’s
another trick that’s used: every
value is stored modulo a prime. (We
encode the message m by comput-
ing pm (i) mod q for a prime q, for Figure 7.16 The combined message of Reed–Solomon codes and Shamir
many values of i, instead of just secret-sharing (p. 356). Imagine a degree-k polynomial p that is unknown to you
computing pm (i).) We now encode
(here, k = 2). You’re given the value of p(x) for n distinct values of x.
a message m ∈ Zkq with a code-
• If n < k + 1, you know nothing about the constant term of p. (Secrets kept!)
word in Znq . And it turns out that
• If n = k + 1, you can compute every coefficient of p. (Secrets shared!)
everything important about polyno-
mials remains true if we take all • If n > k + 1, you can find p even if some points are wrong. (Errors corrected!)
values modulo a prime q: two dis- (The exact same relationships hold if you’re given p(x) mod q instead of p(x).)
tinct polynomials of degree k agree
mod q on at most k inputs, too. (Again, see Figure 7.15.) In fact, the Shamir secret sharing scheme (p. 356) and error
correction using Reed–Solomon codes can be viewed as part of the same continuum: evaluating a polynomial of
degree k on fewer that k + 1 points keeps a secret, evaluating it on k + 1 points reveals a secret, and evaluating it on
more than k + 1 points allows for errors to be detected and corrected. (See Figure 7.16.)

ISTUDY
358 Number Theory

EXERCISES
7.38 The Sieve of Eratosthenes returns a list of all prime numbers up to a given integer n by creating a list of candidate primes
⟨2, 3, . . . , n⟩, and repeatedly marking the first unmarked number p as prime and striking out all entries in the list that are multiples
of p. (See Figure 7.17.) Write pseudocode to describe the Sieve of Eratosthenes.
7.39 Run the Sieve of Eratosthenes algorithm, by hand, to find all primes less than 100.
7.40 (programming required) Implement the Sieve of Eratosthenes in a programming language of your choice. Use your program to
compute all primes up to 100,000. How many are there?
7.41 (programming required) Earlier, we suggested another algorithm to compute all primes up to n: for each i = 2, 3, . . . , n, test whether

i is divisible by any integer between 2 and i. Implement this algorithm too, and compare their execution times for n = 100,000.
What happens for n = 500,000?
7.42 Assume that each number k is crossed off by the Sieve of Eratosthenes every time a divisor of it is found. (For example, 6 is crossed
off when 2 is the prime in question, and when 3 is the prime in question.) Prove that the total number of crossings-out by sieve(n)
is ≤ Hn · n, where Hn is the nth harmonic number. (See Definition 5.8.)

7.43 Use the Prime Number Theorem to estimate the number of primes between 2127 + 1 and 2128 .
7.44 Use the Prime Number Theorem to estimate the 2128 th-largest prime.
7.45 Use the Prime Number Theorem to argue that, roughly, the probability that a randomly chosen number close to n is prime is about
1/ ln n. (Hint: what does primes(n) − primes(n − 1) represent?)
7.46 Using the same technique as in Example 7.8, estimate the number of 6-digit primes. Then, using the Sieve (see Exercise 7.38) or
some other custom-built program, determine how far off the estimate was.

Let p be an arbitrary prime number and let a be an arbitrary nonnegative integer. Prove the following facts.
7.47 If p ̸ | a, then GCD(p, a) = 1.
7.48 For any positive integer k, we have p | ak if and only if p | a. (Hint: use induction and Lemma 7.19.)
7.49 For any integers n, m ∈ {1, . . . , p − 1}, we have that p ̸ | nm.
7.50 For any integer m and any prime number q distinct from p (that is, p = ̸ q), we have m ≡p a and m ≡q a if and only if m ≡pq a.
(Hint: think first about the case a = 0; then generalize.)
7.51 If 0 ≤ a < p, then a2 ≡p 1 if and only if a ∈ {1, p − 1}. (You may use the theorem from p. 357: if f(x) is a polynomial of degree k,
and q is a prime, then either f(a) mod q = 0 for every a ∈ Zq , or the equation f(x) = 0 has at most k solutions for x ∈ Zq .)

7.52 Using the brute-force algorithm (test all candidate divisors) and paper and pencil only, determine whether 54321 and 12345 are
relatively prime.
7.53 Do the same for 209 and 323.
7.54 Do the same for 101 and 1100.

7.55 Using the extended Euclidean algorithm (executed by hand), compute GCD(n, m) and two integers x and y such that
xn + ym = GCD(n, m) for n = 60 and m = 93.
7.56 Repeat for n = 24 and m = 28.
7.57 Repeat for n = 74 and m = 13.

7.58 Prove the following extension to Lemma 7.17: there are infinitely many pairs of integers x, y such that xn + ym = GCD(n, m), for
any nonnegative integers n and m.
7.59 Prove the extension of Lemma 7.17 to k ≥ 2 integers: if GCD(a1 , . . . , ak ) = d, then there exist integers x1 , . . . , xk such that
Pk
i=1 ai xi = d. (Define GCD(x1 , x2 , . . . , xk ) as GCD(x1 , GCD(x2 , . . . , xk )) for k ≥ 3.)

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
..
.

Figure 7.17 A few iterations of the Sieve of Eratosthenes. Primes are highlighted as they’re discovered; numbers are
written in light gray as they’re crossed off.

ISTUDY
Exercises 359

7.60 Prove Theorem 7.18 (the correctness of the extended Euclidean algorithm) by induction on n: for arbitrary positive integers n and
m with n ≤ m, extended-Euclid(n, m) returns three integers x, y, r such that r = GCD(n, m) = xn + ym.
7.61 (programming required) Write a program that implements the extended Euclidean algorithm. (Recommended: if you did
Exercises 7.11–7.16, compute m mod n and ⌊ mn ⌋ with a single call to mod-and-div-faster(m, n).)

I have a friend named Nikki, who’s from New Zealand. Nikki and I went out to eat together, and I paid for both dinners. She was
going to pay me back, in cash—but she had only New Zealand dollars [NZD]. (I was happy to take NZDs.) Nikki had a giant supply
of 5NZD bills; I had a giant supply of 5 U.S. dollar [USD] bills. At the time, the exchange rate was 5NZD = 3USD (or close
enough to 5 : 3 for two friends to call it good).
7.62 Prove that Nikki can pay me exactly 4USD in value, through only the exchange of 5NZD and 5USD bills.
7.63 In Exercise 7.62, was there something special about the number 4? Identify for which nonnegative integers x Nikki can pay me
back exactly x USD in value, through only the exchange of 5NZD and 5USD bills, and prove your answer.
7.64 In Exercises 7.62–7.63, was there something special about the number 3? Suppose that, due to geopolitical turmoil and a skyrock-
eting of the price of wool, the 5NZD bill is now worth b USDs, for some b ≡5 3. I still have many 5USD bills, and Nikki still has
the equivalent of many b USD bills. What amounts can Nikki now pay me? Prove your answer.
7.65 In an unexpected twist, I run out of U.S. dollars and Nikki runs out of New Zealand dollars. But I discover that I have a giant supply
of identical Israeli Shekel notes, each of which is worth k USD. And Nikki discovers that she has a giant supply of identical Thai
Baht notes, each of which is worth ℓ USD. (Assume k and ℓ are integers.) What amounts can she pay me now? Again, prove your
answer.

7.66 Prove that any two consecutive integers (n and n + 1) are always relatively prime.
7.67 Prove that any two consecutive Fibonacci numbers are always relatively prime.
7.68 Prove that two integers a and b are relatively prime if and only if there is no prime number p such that p | a and p | b. (Notice that
this claim differs from the definition of relative primality, which required that there be no integer n ≥ 2 such that n | a and n | b.)

7.69 Let a and b be relatively prime. Let c be relatively prime to both a and b. Prove that c and ab are also relatively prime.
7.70 Let a and b be relatively prime. Prove that, for any integer n, we have that both a | n and b | n if and only if ab | n.
7.71 Let a and b be relatively prime. Prove that, for every integer m, there exist integers x and y such that ax + by = m.

7.72 Describe the integers x ∈ Z≥0 that satisfy the constraints x mod 13 = 6 and x mod 19 = 2. More precisely, describe these integers

as a set a + bk : k ∈ Z≥0 , where a is smallest x satisfying the constraints, a + b is the next smallest, a + 2b is the next smallest,
etc. (You have to figure out the values of a and b.)
7.73 Repeat for the constraints x mod 21 = 3 and x mod 11 = 2.
7.74 Repeat for the constraints x mod 6 = 3 and x mod 7 = 3.
7.75 Repeat for the constraints x mod 5 = 4 and x mod 6 = 5 and x mod 7 = 2.
7.76 Repeat for the constraints x mod 5 = 4 and x mod 6 = 5 and x mod 7 = 3.

7.77 Let n and m be relatively prime, and let a ∈ Zn and b ∈ Zm . Define y∗ to be the unique value in Znm such that y∗ mod n = a
and y∗ mod m = b, whose existence is guaranteed by Theorem 7.21. Prove that an integer x ∈ Znm satisfies x mod n = a and
x mod m = b if and only if x satisfies x mod nm = y∗ .

Show that relative primality was mandatory for the Chinese Remainder Theorem. Considering two integers n and m that are not
necessarily relatively prime:
7.78 Prove that, for some a ∈ Zn and b ∈ Zm , it may be the case that no x ∈ Znm satisfies x mod n = a and x mod m = b.
7.79 Prove that, for some a ∈ Zn and b ∈ Zm , there may be more than one x ∈ Znm satisfies x mod n = a and x mod m = b.

ISTUDY
360 Number Theory

7.4 Multiplicative Inverses


Tous pour un, un pour tous.
All for one, one for all.
Alexandre Dumas (1802–1870)
Les Trois Mousquetaires [The Three Musketeers] (1844)

For any integer n ≥ 2, let Zn denote the set {0, 1, . . . , n − 1}. In this section, we’ll discuss arithmetic
over Zn —that is, arithmetic where we think of all expressions by considering their value modulo n. For
example, when n = 9, the expressions 4 + 6 and 8 · 7 are equivalent to 1 and 2, respectively, because
10 mod 9 = 1 and 56 mod 9 = 2. When n = 10, the expressions 4 + 6 and 8 · 7 are equivalent to 0 and 6,
respectively.
We have already encountered addition and multiplication in the world of modular arithmetic (for exam-
ple, in Theorem 7.4). But we haven’t yet defined subtraction or division. (Theorem 7.4 also introduced
exponentiation over Zn , and it turns out that, along with division, exponentiation in modular arithmetic
will form the foundation of the RSA cryptographic system; see Section 7.5.) Subtraction turns out to be
fairly straightforward (see Exercise 7.81), but division will be a bit trickier than +, ·, and −. In this sec-
tion, we’ll introduce what division over Zn even means, and then discuss algorithms to perform modular
division.

7.4.1 The Basic Definitions


Before we introduce any of the technical definitions, let’s start with a tiny bit of intuition about why there’s
something potentially interesting going on with division in Zn . For concreteness, here’s a small example
in Z9 :

Example 7.17: Halving some numbers in Z9 .


In Z9 = {0, 1, 2, 3, 4, 5, 6, 7, 8}, where every expression’s value is understood mod 9, what element of
Z9 is half of 6? Half of 8? Half of 5?
Solution. What number is half of 6? Well, easy: it’s obviously 3. (Why? Because 6 is double 3, and
therefore 3 is half of 6—or, in other words, 3 is half of 6 because 3 · 2 is 6.) And what number is half of
8? Easy again: it’s 4 (because 4 · 2 is 8).
Okay, what number is half of 5? The first temptation is to say that it’s 2.5 (or 52 , if you’re more of a fan of
fractions)—but that doesn’t make sense as an answer: after all, which element of {0, 1, 2, 3, 4, 5, 6, 7, 8}
is 2.5?!? So the next temptation is to say that there is no number that’s half of 5. (After all, in normal
nonmodular arithmetic, there is no integer that’s half of 5.) But that’s not right either: there is an answer in
Z9 , even if it doesn’t quite match our intuition. The number that’s half of 5 is in fact 7(!). Why? Because
7 · 2 is 5. (Remember that we’re in Z9 , and 14 mod 9 = 5.) So, in Z9 , the number 7 is half of 5.
In fact, we can find a number in Z9 that’s half of each element of Z9 . For each b ∈ Z9 , here is the value
of a ∈ Z9 such that 2a = b:
b 0 1 2 3 4 5 6 7 8
a 0 5 1 6 2 7 3 8 4.
(You can check that 0 · 2 ≡9 0, and 5 · 2 ≡9 1, and so forth.)

ISTUDY
7.4 Multiplicative Inverses 361

Example 7.17 illustrates the basic idea of division in Zn : we’ll define ab as the number k such that k · b
is equivalent to a in Zn . To make this idea formal, we’ll need a few definitions about modular arithmetic.
But, first, we’ll go back to “normal” arithmetic, for the real numbers, and introduce the two key concepts:
identity and inverse.

Problem-solving tip: When you encounter a new definition, it’s often helpful to try it out in a setting that you already understand
well. For example, it’s easier understand Manhattan distance in R2 (see Example 2.40) before trying to understand it for general
Rn . In this case, you’ve grasped division in R since, what, second grade—so, before trying to make sense of the definitions for
Zn , try to consider the analogy of each definition for R.

Multiplicative inverses in R
The number 1 is called the multiplicative identity, because it has the property that

x · 1 = 1 · x = x, for any x ∈ R.

(We’ve encountered identities a number of times already. Definition 2.43 introduced the identity matrix I,
where MI = IM = M for any matrix M. And Exercises 3.13–3.16 explored identities of logical connectives;
for example, the identity of ∨ is False, because p ∨ False ≡ False ∨ p ≡ p for any proposition p.)
The multiplicative inverse of a number x is the number by which we have to multiply x to get 1 (that is,
to get the multiplicative identity) as the result. In other words, the multiplicative inverse of x ∈ R is the real
number x−1 such that x·x−1 = 1. (We generally denote the multiplicative inverse of x as x−1 , though it may
be easier to think about the multiplicative inverse of x as 1x , because x · 1x = 1. Actually the “−1 ” notation is
in general ambiguous between denoting inverse and denoting exponentiation with a negative exponent—
though these concepts match up perfectly for the real numbers. Exercise 7.100 addresses negative exponents
in modular arithmetic.) For example, the multiplicative inverse of 8 is 18 = 0.125, because 8 · 0.125 = 1.
When we think of dividing y ∈ R by x ∈ R, we can instead think of this operation as multiplying y by
x−1 . For example, we have 7/8 = 7 · 8−1 = 7 · 0.125 = 0.875.
Not every real number has a multiplicative inverse: specifically, there is no number that yields 1 when
it’s multiplied by 0, so 0−1 doesn’t exist. (And we can’t divide y by 0, because 0−1 doesn’t exist.) But for
any x ̸= 0, the multiplicative inverse of x does exist, and it’s given by x−1 = 1x .

Multiplicative inverses in Zn
Now let’s turn to the analogous definitions in the world of modular arithmetic, in Zn . Notice that 1 is still
the multiplicative identity, for any modulus n: for any x ∈ Zn , it is the case that x mod n = 1 · x mod n =
x · 1 mod n. The definition of the multiplicative inverse in Zn is identical to the definition in R:

Definition 7.23: Multiplicative inverse.


Let n ≥ 2 be any integer, and let a ∈ Zn be arbitrary. The multiplicative inverse of a in Zn is the number
a−1 ∈ Zn such that a · a−1 ≡n 1. If there is no element x ∈ Zn such that ax ≡n 1, then a−1 is undefined.

Writing tip: Let a ∈ Zn . The notation a−1 doesn’t explicitly indicate the modulus n anywhere, and the value of n matters! If
there’s any ambiguity about the value of n, then be sure to specify it clearly in your words surrounding the notation.

(Note that Definition 7.23 describes the multiplicative inverse as “the” a−1 that has the desired property.
In Exercise 7.93, you’ll show that there can’t be two distinct values b, c ∈ Zn where ab ≡n ac ≡n 1.) Here
are a few examples of multiplicative inverses, and of a case where there is no multiplicative inverse:

ISTUDY
362 Number Theory

Example 7.18: Some multiplicative inverses.


The multiplicative inverse of 2 in Z9 is 2−1 = 5, because 2 · 5 = 10 ≡9 1, and the multiplicative inverse
of 1 in Z9 is 1−1 = 1, because 1 · 1 ≡9 1.
The multiplicative inverse of 7 in Z11 is 8 because 7 · 8 = 56 ≡11 1, and the multiplicative inverse of
7 in Z13 is 2 because 7 · 2 = 14 ≡13 1.

Example 7.19: A nonexistent multiplicative inverse.


The number 3 has no multiplicative inverse in Z9 , as the following table shows:
3 · 0 = 0 ≡9 0 3 · 3 = 9 ≡9 0 3 · 6 = 18 ≡9 0
3 · 1 = 3 ≡9 3 3 · 4 = 12 ≡9 3 3 · 7 = 21 ≡9 3
3 · 2 = 6 ≡9 6 3 · 5 = 15 ≡9 6 3 · 8 = 24 ≡9 6.
None of these nine entries is equivalent to 1 modulo 9, so there is no 3−1 in Z9 .

Example 7.20: Multiplicative inverses in Z7 .


Find the values of 0−1 , 1−1 , 2−1 , 3−1 , 4−1 , 5−1 , and 6−1 in Z7 .
Solution. The simplest way (though not necessarily the fastest way!) to solve this problem is by building
a multiplication table for Z7 , as shown in Figure 7.18. (The entry in row a and column b of the table is
the value ab mod 7—for example, 4 · 5 = 20 = 2 · 7 + 6, so the entry in row 4, column 5 is the number
6.) For each row a, the value a−1 we seek is the column that has a 1 in it, if there is such a column in
that row. (And there is a 1 in every row except a = 0.) Thus in Z7 we have 1−1 = 1 and 2−1 = 4 and
3−1 = 5 and 4−1 = 2 and 5−1 = 3 and 6−1 = 6—and 0−1 is undefined.

Taking it further: The field of mathematics called abstract algebra focuses on giving and analyzing very general definitions
of structures that satisfy certain properties—allowing apparently disparate objects (like Boolean logic and Rubik’s cubes) to be
studied at the same time. For example, a group is a pair ⟨G, ·⟩, where G is a set of objects and · is a binary operator on G, where
certain properties are satisfied:

• Closure: for any a, b ∈ G, we have a · b ∈ G.


• Associativity: for any a, b, c ∈ G, we have a · (b · c) = (a · b) · c.
• Identity: there is an identity element e ∈ G with the property that a · e = e · a = a for every a ∈ G.
• Inverse: for every a ∈ G, there exists b ∈ G such that a · b = b · a = e (where e is the identity element).

For example, ⟨Z, +⟩ is a group. As we’ll see, so too is ⟨Zp − {0} , ·⟩, where · denotes multiplication and p is any prime integer.
Despite the very abstract nature of these definitions—and other more general or more specific algebraic structures, like semi-
groups, rings, and fields—they are a surprisingly useful way of understanding properties of Zp . See any good textbook on abstract
algebra for more detail.

7.4.2 When Multiplicative Inverses Exist (and How to Find Them)


Examples 7.18–7.20 might inspire you to ask a question that will turn out to be both useful and reason-
ably simple to answer: under what circumstances does a particular number a ∈ Zn have a multiplicative
inverse? As we saw with arithmetic over R, there’s never a multiplicative inverse for 0 in any Zn (because,
for any x, we have x · 0 = 0 ̸≡n 1)—but what happens for nonzero a?

ISTUDY
7.4 Multiplicative Inverses 363

0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 0 1 2 3 4 5 6
2 0 2 4 6 1 3 5
3 0 3 6 2 5 1 4
4 0 4 1 5 2 6 3
5 0 5 3 1 6 4 2
6 0 6 5 4 3 2 1

Figure 7.18 The multiplication table for Z7 .

To take one particular case, we just found that 2−1 = 5 in Z9 but that 3−1 does not exist in Z9 . It’s worth
reflecting a bit on “why” 3−1 failed to exist in Z9 . There are a lot of ways to think about it, but here’s one
convenient way to describe what went wrong: any multiple of 3 is (obviously!) divisible by 3, and numbers
divisible by 3 are never one more than multiples of 9. In other words, the only possible values of 3x mod 9
are {0, 3, 6}—a set that fails to include 1. (Recall from Definition 7.23 that, for 3−1 to exist in Z9 , we’d
have to have been able to find an x such that 3x ≡9 1.) Similarly, 6−1 doesn’t exist in Z9 : again, the only
possible values of 6x mod 9 are {0, 3, 6}, which once again does not include 1.
These observations should be reminiscent of the concepts that we discussed in Section 7.3: for a number
a ∈ Zn , we seem to be unable to find a multiplicative inverse a−1 in Zn whenever a and n share a common
divisor d > 1. In other words, when a and n are not relatively prime, then a−1 fails to exist in Zn . (That’s
because any multiple xa of a will also be divisible by d, and so xa mod n will also be divisible by d, and
therefore xa mod n will not equal 1.) In fact, not being relatively prime to n is the only way to fail to
have a multiplicative inverse in Zn , as we’ll prove. (Note that 0 ∈ Zn is not relatively prime to n, because
GCD(n, 0) ̸= 1.)

Theorem 7.24: Existence of multiplicative inverses.


Let n ≥ 2 and a ∈ Zn . Then a−1 exists in Zn if and only if n and a are relatively prime.

Proof. By definition, a multiplicative inverse of a exists in Zn precisely when there exists an integer x
such that ax ≡n 1. (The definition actually requires x ∈ Zn , not just x ∈ Z, but see Exercise 7.99.) But
ax ≡n 1 means that ax is one more than a multiple of n—that is, there exists some integer y such that
ax + yn = 1. In other words,
a−1 exists in Zn if and only if there exist integers x, y such that ax + yn = 1. (∗)
Observe that (∗) echoes the form of Lemma 7.17 (and thus also echoes the output of the extended Euclidean
algorithm), and we can use this fact to prove the theorem. We’ll prove the two directions of the theorem
separately:
If a−1 exists in Zn , then a and n are relatively prime. We’ll prove the contrapositive. Suppose that a and n
are not relatively prime—that is, suppose that GCD(a, n) = d for some d > 1. We will show that a−1 does
not exist in Zn . Because d | a and d | n, there exist integers c and k such that a = cd and n = kd. But then,
for any integers x and y, we have that
ax + yn = cdx + ykd = d(cx + yk)
and thus d | (ax + yn). Thus there are no integers x, y for which ax + yn = 1 and therefore, by (∗), a−1 does
not exist in Zn .

ISTUDY
364 Number Theory

If a and n are relatively prime, then a−1 exists in Zn . Suppose that a and n are relatively prime. We’ll prove
that a−1 exists. Because a and n are relatively prime, by definition we have that GCD(a, n) = 1. Thus, by
the correctness of the extended Euclidean algorithm (Theorem 7.18), the output of extended-Euclid(a, n)
is ⟨x, y, 1⟩ for integers x, y such that xa + yn = GCD(a, n) = 1. The fact that extended-Euclid(a, n)
outputs integers x and y such xa + yn = 1 means that such an x and y must exist—and so, by (∗), a−1 exists
in Zn .

Note that this theorem is consistent with the examples that we saw previously: we found 1−1 and 2−1
but not 3−1 in Z9 (Examples 7.18 and 7.19; 1 and 2 are relatively prime to 9, but 3 is not), and we found
multiplicative inverses for all nonzero elements of Z7 (Example 7.20; all of {1, 2, . . . , 6} are relatively
prime to 7).

Two implications of Theorem 7.24


There are two useful implications of this result. First, when the modulus is prime, multiplicative inverses
exist for all nonzero elements of Zn , because every nonzero a ∈ Zn and n are relatively prime for any prime
number n.

Corollary 7.25. If p is prime, then every nonzero a ∈ Zp has a multiplicative inverse in Zp .

(We saw an example of this corollary in Example 7.20, where we identified the multiplicative inverses of
all nonzero elements in Z7 .)
The second useful implication of Theorem 7.24 is that, whenever the multiplicative inverse of a exists in
Zn , we can efficiently compute a−1 in Zn using the extended Euclidean algorithm—specifically, by running
the algorithm in Figure 7.19a. (This problem also nicely illustrates a case in which proving a structural fact
vastly improves the efficiency of a calculation—the algorithm in Figure 7.19a is way faster than building
the entire multiplication table, as we did in Example 7.20.)

Corollary 7.26. For any n ≥ 2 and a ∈ Zn , inverse(a, n) returns the value of a−1 in Zn .

Proof. We just proved that a−1 exists if and only if extended-Euclid(a, n) returns ⟨x, y, 1⟩. In this case,
we have xa + yn = 1 and therefore xa ≡n 1. Defining a−1 = x mod n ensures that a · (x mod n) ≡n 1, as
required. (Again, see Exercise 7.99.)

Here are two examples, one replicating the calculation of 5−1 in Z7 from Example 7.20:

Example 7.21: 5−1 in Z7 , again, and 7−1 in Z9 .


To compute 5−1 , we run the extended Euclidean algorithm on 5 and 7, as in Figure 7.19b, which returns
⟨3, −2, 1⟩, implying that 3 · 5 + −2 · 7 = 1 = GCD(5, 7). Therefore inverse(5, 7) returns 3 mod 7 = 3.
And, indeed, 3 · 5 ≡7 1.
In Example 7.15, we saw that extended-Euclid(7, 9) = ⟨4, −3, 1⟩. Thus 7 and 9 are relatively prime,
and 7−1 in Z9 is 4 mod 9 = 4. And indeed 7 · 4 = 28 ≡9 1.

ISTUDY
7.4 Multiplicative Inverses 365

inverse(a, n): To compute 5−1 ∈ Z7 , we call inverse(5, 7):


Input: a ∈ Zn and n ≥ 2 extended-Euclid(5, 7) =2
Output: a−1 in Zn , if it exists extended-Euclid( 7 mod 5 , 5)
1 x, y, d := extended-Euclid(a, n) extended-Euclid( 5 mod 2 , 2)
=1
2 if d = 1 then = 1, 0, 1
3 return x mod n // xa + yn = 1, so xa ≡n 1. = −2, 1, 1
4 else = 3 , −2, 1.
5 return “no inverse for a exists in Zn .” Therefore inverse(5, 7) returns 3 mod 7, which is 3.
(a) The pseudocode. (b) An example: computing that 5−1 ∈ Z7 is 3.

Figure 7.19 An algorithm for computing multiplicative inverses, and an example.

7.4.3 Fermat’s Little Theorem


We’ll now make use of the results that we’ve developed so far—specifically Corollary 7.25—to prove a
surprising and very useful theorem, called Fermat’s Little Theorem, which states that ap−1 is equivalent to
1 mod p, for any prime number p and any a ̸= 0. (And we’ll see why this result is useful for cryptography
in Section 7.5.)

Taking it further: Fermat’s Little Theorem is named after Pierre de Fermat (1601–1665), a French mathematician. Fermat’s
Little Theorem is the second-most famous theorem named after him; his more famous theorem is called Fermat’s Last Theorem,
which states the following:

For any integer k ≥ 3, there are no positive integers x, y, z satisfying xk + yk = zk .

There are integer solutions to the equation xk + yk = zk when k = 2—the so-called Pythagorean triples, like ⟨3, 4, 5⟩ (where
32 + 42 = 9 + 16 = 25 = 52 ) and ⟨7, 24, 25⟩ (where 72 + 242 = 49 + 576 = 625 = 252 ). But Fermat’s Last Theorem states
that there are no integer solutions when the exponent is larger than 2.
The history of Fermat’s Last Theorem is convoluted and about as fascinating as the history of any mathematical statement can be.
In the seventeenth century, Fermat conjectured his theorem, and scrawled—in the margin of one of his books on mathematics—the
words “I have discovered a truly marvelous proof, which this margin is too narrow to contain . . . .” The conjecture, and Fermat’s
assertion, were found after Fermat’s death—but the proof that Fermat claimed to have discovered was never found. And it seems
almost certain that he did not have a correct proof of this claim. Some 350 years later, in 1995, the mathematician Andrew Wiles
published a proof of Fermat’s Last Theorem, building on work by a number of other twentieth-century mathematicians.
The history of the Fermat’s Last Theorem—including the history of Fermat’s conjecture and the centuries-long quest for a
proof—has been the subject of a number of books written for a nonspecialist audience; see [120] for a compelling account. This
story may be as dramatic as one about the history of proofs ever gets.

Before we can prove Fermat’s Little Theorem itself, we’ll need a preliminary result. We will show
that, for any prime p and any nonzero a ∈ Zp , the first p − 1 nonzero multiples of a—that is, the values
{a, 2a, 3a, . . . , (p − 1)a}—are precisely the p − 1 nonzero elements of Zp . Or, to state this claim in a
slightly different way, we will prove that the function f : Zp → Zp defined by f(k) = ak mod p is both
one-to-one and onto (and also satisfies f(0) = 0). Here is a formal statement of the result:

Lemma 7.27: {1, 2, . . . , p − 1} and {1a, 2a, . . . , (p − 1)a} are equivalent mod p.
For prime p and any a ∈ Zp where a ̸= 0, we have
{1 · a mod p, 2 · a mod p, . . . , (p − 1) · a mod p} = {1, 2, . . . , p − 1} .

Before we dive into a proof, let’s check an example:

ISTUDY
366 Number Theory

Example 7.22: {ai mod 11} vs. {i mod 11}.


Consider the prime p = 11 and two values of a, namely a = 2 and a = 5. Then we have
i 1 2 3 4 5 6 7 8 9 10
2i mod 11 2 4 6 8 10 1 3 5 7 9
5i mod 11 5 10 4 9 3 8 2 7 1 6
Note that every number from {1, 2, . . . , p} appears (once and only once) in the {2i mod 11} and
{5i mod 11} rows of this table—exactly as desired. That is,
{1, 2, 3, . . . , 10} ≡11 {2, 4, 6, . . . , 20} ≡11 {5, 10, 15, . . . , 50} .

We can also observe examples of this result in the multiplication table for Z7 . (See Figure 7.18.) In that
table, every (nonzero) row {a, 2a, 3a, 4a, 5a, 6a} contains all six numbers {1, 2, 3, 4, 5, 6}, in some order,
in the six nonzero columns.

Proof of Lemma 7.27. Consider any prime p, and any nonzero a ∈ Zp . We must prove that
{a, 2a, . . . , (p − 1)a} ≡p {1, 2, . . . , p − 1}.
We will first argue that the set {1 · a mod p, 2 · a mod p, . . . , (p − 1) · a mod p} contains no
duplicates—that is, the value of i · a mod p is different for every i. Let i, j ∈ {1, 2, . . . , p − 1} be arbi-
trary. We will show that ia ≡p ja implies that i = j, which establishes this first claim. Suppose that
ia ≡p ja. Then, multiplying both sides by a−1 , we have that iaa−1 ≡p jaa−1 , which immediately yields
i ≡p j because a · a−1 ≡p 1. (Note that, because p is prime, by Corollary 7.25, we know that a−1 exists in
Zp .) Therefore, for any i, j ∈ {1, 2, . . . , 1 − p}, if i ̸= j then ai ̸≡p aj.
We now need only show that ia mod p ̸= 0 for any i > 0. But that fact is straightforward to see:
ia mod p = 0 if and only if p | ia, but p is prime and i < p and a < p, so p cannot divide ia. (See
Exercise 7.49.)

The theorem itself, and a proof


With this preliminary result in hand, we turn to Fermat’s Little Theorem itself:

Theorem 7.28: Fermat’s Little Theorem.


Let p be prime, and let a ∈ Zp where a ̸= 0. Then ap−1 ≡p 1.

As with the previous lemma, we’ll start with a few examples of this claim, and then give a proof of the
general result. (While this property admittedly might seem a bit mysterious, it turns out to follow fairly
closely from Lemma 7.27, as we’ll see.)

Example 7.23: Some examples of Fermat’s Little Theorem.


Here are a few examples, for the prime numbers 7 and 19:
26 mod 7 = 64 mod 7 = (7 · 9 + 1) mod 7 = 1
36 mod 7 = 729 mod 7 = (104 · 7 + 1) mod 7 = 1
418 mod 19 = 68719476736 mod 19 = (3616814565 · 19 + 1) mod 19 = 1.

We’ll now give a proof of the theorem, making heavy use of Lemma 7.27:

ISTUDY
7.4 Multiplicative Inverses 367

Proof of Fermat’s Little Theorem (Theorem 7.28). Lemma 7.27 says that {1, 2, . . . , p − 1} is precisely the
same set as {1 · a mod p, 2 · a mod p, . . . , (p − 1) · a mod p}, so they have the same product:
1 · 2 · 3 · · · (p − 1) ≡p 1a · 2a · 3a · · · (p − 1)a. (1)

Because p is prime, Corollary 7.25 implies that the multiplicative inverses 1−1 , 2−1 , . . . , (p − 1)−1 all exist
in Zp . Multiplying both sides of (1) by the product of all p − 1 of these multiplicative inverses—that is,
multiplying by 1−1 · 2−1 · · · · · (p − 1)−1 —we have

1 · 2 · 3 · · · (p − 1) · 1−1 · 2−1 · · · (p − 1)−1 ≡p 1a · 2a · 3a · · · (p − 1)a · 1−1 · 2−1 · · · (p − 1)−1 . (2)

Rearranging the left-hand side of (2) and replacing b · b−1 by 1 for each b ∈ {1, . . . , p − 1}, we simply
get 1:

1 ≡p 1a · 2a · 3a · · · (p − 1)a · ·1−1 · 2−1 · · · (p − 1)−1 . (3)

Rearranging the right-hand side of (3) and again replacing each b · b−1 by 1, we are left only with p − 1
copies of a:

1 ≡p ap−1 ,
which is exactly what we had to show.
Note that Fermat’s Little Theorem is an implication, not an equivalence. It states that if p is prime, then
for every a ∈ {1, . . . , p − 1}—that is, for every p relatively prime to n—we have ap−1 ≡p 1. The converse
does not always hold: if an−1 ≡n 1 for every a ∈ Zn that’s relatively prime to n, we cannot conclude that
n is prime. For example, a560 ≡561 1 for every a ∈ {1, 2, . . . , 560} with GCD(a, 561) = 1—but 561 is not
prime! (See Exercise 7.112.) A number like 561, which passes the test in Fermat’s Little Theorem but is
not prime, is called a Fermat pseudoprime or a Carmichael number. (The latter name is in honor of Robert
Carmichael (1879–1967), an American mathematician who was one of the first to discover these numbers.)

Taking it further: Let n ≥ 2 be an integer, and suppose that we need to determine whether n is prime. There’s a test for
primality that’s implicitly suggested by Fermat’s Little Theorem—for “many” different values of a ∈ Zn , test to make sure
that an−1 mod n = 1—but this test sometimes incorrectly identifies composite numbers as prime, because of the Carmichael
numbers. (For speed, we generally test a few randomly chosen values of a ∈ Zp instead of trying many of them—but of course
testing fewer values of a certainly can’t prevent us from incorrectly identifying Carmichael numbers as prime.) However, there
are some tests for primality that have a similar spirit but that aren’t fooled by certain inputs in this way. See the discussion on
p. 368 for a description of a randomized algorithm called the Miller–Rabin test that checks primality using this approach.

ISTUDY
368 Number Theory

Computer Science Connections

Miller–Rabin Primality Testing fermat-isPrime?(n, k):


Input: n is a candidate prime number; k is a “certainty parameter” telling
Quick: is 1,073,676,287 prime? What about us how many tests to perform before giving up and reporting n as prime.
1,073,676,289? (Or, a bit harder: figure out 1 repeat
whether some given 100-digit number is 2 choose a ∈ {1, 2, . . . , n − 1} randomly
prime.) Fermat’s Little Theorem tells us that 3 until an−1 ̸≡n 1 or we’ve tried k times
an−1 ≡n 1 for any prime n and any nonzero 4 return “probably prime” if every an−1 ≡n 1; else return “composite”
a ∈ Zn . This fact makes the randomized
Figure 7.20 A primality tester based on Fermat’s Little Theorem.
algorithm in Figure 7.20 a tempting way to
test primality. Indeed, Fermat’s Little Theorem implies that fermat-isPrime?(p) returns “probably prime” for any
prime p—it never errs by accusing a prime number of being composite. But the opposite error (incorrectly assert-
ing that a composite number is prime) can happen. And, unfortunately, that error happens frequently for particular
values of n—specifically, for Carmichael numbers. (See the discussion after Theorem 7.28, and Exercise 7.112.) For
example, n = 118,901,521 is composite, but the only values of a for which an−1 ̸≡n 1 are multiples of 271, 541, or
811—less than 0.7% of {1, 2, . . . , n − 1}. (And there are very large Carmichael numbers that have very few prime
factors—all of which are big numbers—which cause fermat-isPrime? to have an extremely high error rate.)
In the late 1970s and early 1980s, two
computer scientists, Gary Miller and Michael miller-rabin-isPrime?(n, k):
Rabin, developed a new primality test using Input: n is a candidate prime number; k is a “certainty parameter”
modular arithmetic that doesn’t get fooled for 1 write n − 1 as 2r d for an odd number d
any particular input integer. (Miller’s origi- 2 while we’ve done fewer than k tests:
nal algorithm [89] is nonrandom but relies on 3 choose a random a ∈ {1, . . . , n − 1}
r
a still-unproven mathematical assumption; 4 σ := ⟨ad , a2d , a4d , a8d , . . . , a2 d ⟩ mod n.
Rabin [106] later modified the test to remove 5 ̸ ⟨. . . , 1⟩ or if σ = ⟨. . . , x, 1, . . .⟩ for any x ∈
if σ = / {1, n − 1} then
the assumption—but at the cost of making it 6 return “composite”
random instead.) The Miller–Rabin primality 7 return “probably prime”
test in Figure 7.21 is based on the following Figure 7.21 Miller–Rabin primality test.
fact (see Exercise 7.51): if p is prime, then
a2 ≡p 1 if and only if a mod p ∈ {1, p − 1}. Or, taking the contrapositive,

if a2 ≡n 1 with a mod p ∈
/ {1, n − 1}, then n is not prime. (2)
The idea of Miller–Rabin is to look for an n r and d ad a2d a4d a8d a16d
a ∈ Zn satisfying (2). Consider a candidate 561 4 and 35
choose a = 4
166 67 1 1 1
n − 1 = 560 435 ≡561 166
prime n ≥ 3. Thus n is odd, so n − 1 is even, = 24 · 35 435 1662 672 12 12
and we can write n − 1 = 2r d, where d is an choose a = 74
odd number and r ≥ 1. Define the sequence 7435 ≡561 131
131 331 166 67 1
7435 1312 3312 1662 672
⟨ r ⟩
ad , a2d , a4d , . . . , a2 d (3) choose a = 99
330 66 429 33 528
d 2
=(a ) 2d 2
=(a ) 2r−1 d 2 9935 ≡561 330
=(a )
9935 3302 662 4292 332
with each entry taken modulo n. Fermat’s Figure 7.22 A few examples of Miller–Rabin for n = 561. Both
Little Theorem says that n is not prime if a = 4 (because 672 ≡561 1 even though 67 ̸≡561 1 and 67 ̸≡561 560)
an−1 ̸≡n 1, so if (3) ends with anything other and a = 99 (because 99560 ̸≡ 1) demonstrate that 561 is not prime.
561
than 1 we know that n is not prime. And if
there’s a 1 that appears immediately after an entry x where x mod n ∈ / {1, n − 1} in (3), then we also know that n is
not prime: x2 ≡n 1 but x mod n ∈ / {1, n − 1}, so (2) says that n is not prime. (See Figure 7.22 for an example.) The
key fact, which we won’t prove here, is that many different values of a ∈ Zn result in one of these two violations:

Fact: If n is not prime, then for at least n−2 1 different nonzero values of a ∈ Zn , the sequence (3) contains a 1 following
an entry x ∈/ {1, n − 1} or the sequence (3) doesn’t end with 1. (For a proof, see [33].)
This fact then allows us to test for n’s primality by trying k different randomly chosen values of a; the probability
that every one of these tests fails when n is not prime is at most 1/2k .

ISTUDY
Exercises 369

EXERCISES
7.80 Following Example 7.17, identify the numbers that are half of every element in Z9 . (That is, for each a ∈ Z9 , find b ∈ Z9 such that
2b = a.)

We talked extensively in this section about multiplicative inverses, but there can be inverses for other operations, too. The next few
exercises explore the additive inverse in Zn . Notice that the additive identity in Zn is 0: for any a ∈ Zn , we have a + 0 ≡n 0 + a ≡n a.
The additive inverse of a ∈ Zn is typically denoted −a.
7.81 Give an algorithm to find the additive inverse of any a ∈ Zn . (Be careful: the additive inverse of a has to be a value from Zn , so
you can’t just say that 3’s additive inverse is negative 3!)
7.82 Using your solution to Exercise 7.81, prove that, for any a ∈ Zn , we have −(−a) ≡n a.
7.83 Using your solution to Exercise 7.81, prove that, for any a, b ∈ Zn , we have a · (−b) ≡n (−a) · b.
7.84 Using your solution to Exercise 7.81, prove that, for any a, b ∈ Zn , we have a · b ≡n (−a) · (−b).

In regular arithmetic, for a number x ∈ R, a square root of x is a number y such that y2 = x. If x = 0, there’s only one such y,
namely y = 0. If x < 0, there’s no such y. If x > 0, there are two such values y (one positive and one negative). (Hint for the next
two exercises: think about Exercise 7.81.)
7.85 Prove or disprove: for any n ≥ 2, there exists one and only one b ∈ Zn such that b2 ≡n 0.
7.86 Prove or disprove: for any n ≥ 2, and for any a ∈ Zn with a = ̸ 0, there is not exactly one b ∈ Zn such that b2 ≡n a.

7.87 Using paper and pencil (and brute-force calculation), compute 4−1 in Z11 (or state that the inverse doesn’t exist).
7.88 Repeat for 7−1 in Z11 .
7.89 Repeat for 0−1 in Z11 .
7.90 Repeat for 5−1 in Z15 .
7.91 Repeat for 7−1 in Z15 .
7.92 Repeat for 9−1 in Z15 .
7.93 Prove that the multiplicative inverse is unique: that is, for arbitrary n ≥ 2 and a ∈ Zn , suppose that ax ≡n 1 and ay ≡n 1. Prove
that x ≡n y.
7.94 Write down the full multiplication table (as in Figure 7.18) for Z5 .
7.95 Do the same for Z6 .
7.96 Do the same for Z8 .
7.97 Prove or disprove: for arbitrary n ≥ 2, (n − 1)−1 = n − 1 in Zn .
7.98 Prove that (a−1 )−1 = a for any n ≥ 2 and a ∈ Zn : that is, prove that a is the multiplicative inverse of the multiplicative inverse of
a.
7.99 Prove that, for any n ≥ 2 and a ∈ Zn , there exists x ∈ Z with ax ≡n 1 if and only if there exists y ∈ Zn with ay ≡n 1.
7.100 Suppose that the multiplicative inverse a−1 exists in Zn . Let k ∈ Zn be any exponent. Prove that ak has a multiplicative inverse in
Zn , and, in particular, prove that the multiplicative inverse of ak is the kth power of the multiplicative inverse of a. (That is, prove
that (ak )−1 ≡n (a−1 )k .)
7.101 Using paper and pencil and the algorithm based on the extended Euclidean algorithm, compute 17−1 in Z23 (or explain why it
doesn’t exist). (See Figure 7.23 for a reminder.)
7.102 Do the same for 7−1 in Z25 .
7.103 Do the same for 9−1 in Z33 .
7.104 (programming required) Implement inverse(a, n) from Figure 7.19a in a language of your choice.
7.105 Prove or disprove the converse of Corollary 7.25: if n is composite, then there exists a ∈ Zn (with a =
̸ 0) that does not have a
multiplicative inverse in Zn .
7.106 Let p be an arbitrary prime number. What value does the quantity 2p+1 mod p have? Be as specific as you can. Explain.
7.107 It turns out that 247248 mod 249 = 4. From this, you can conclude at least one of following: 247 is not prime; 247 is prime; 249 is
not prime; or 249 is prime. Which one(s)? Explain.
7.108 Reprove the general version of the Chinese Remainder Theorem with single constructive argument, as in the 2-congruence case,
Q
instead of using induction. Namely, assume n1 , n2 , . . . , nk are pairwise relatively prime, and let ai ∈ Zni . Let N = ki=1 ni . Let
Ni = N/ni (more precisely, let Ni be the product of all nj s except ni ) and let di be the multiplicative inverse of Ni in Zni . Prove that
P
x = ki=1 ai Ni di satisfies the congruence x mod ni = ai for all i.

ISTUDY
370 Number Theory

extended-Euclid(n, m): inverse(a, n):


Input: positive integers n and m ≥ n. Input: a ∈ Zn and n ≥ 2
Output: x, y, r ∈ Z where GCD(n, m) = r = xn + ym Output: a−1 in Zn , if it exists
1 if m mod n = 0 then 1 x, y, d := extended-Euclid(a, n)
2 return 1, 0, n // 1 · n + 0 · m = n = GCD(n, m) 2 if d = 1 then
3 else 3 return x mod n // xa + yn = 1, so xa ≡n 1.
4 x, y, r := extended-Euclid(m mod n, n) 4 else
 
5 return y − mn · x, x, r 5 return “no inverse for a exists in Zn .”

Figure 7.23 A reminder of two algorithms.

The totient function φ : Z≥1 → Z≥0 , sometimes called Euler’s totient function after the eighteenth-century Swiss mathematician
Leonhard Euler, is defined as

φ(n) = the number of k such that 1 ≤ k ≤ n such that k and n have no common divisors.

For example, φ(6) = 2 because 1 and 5 have no common divisors with 6 (but all of {2, 3, 4, 6} do share a common divisor with 6).
There’s a generalization of Fermat’s Little Theorem, sometimes called the Fermat–Euler Theorem or Euler’s Theorem, that states
the following: if a and n are relatively prime, then aφ(n) ≡n 1.
7.109 Assuming the Fermat–Euler theorem, argue that Fermat’s Little Theorem holds.
7.110 Assuming the Fermat–Euler theorem, argue that a−1 in Zn is aφ(n)−1 mod n, for any a ∈ Zn that is relatively prime to n. Verify
the claim for the multiplicative inverses of a ∈ {7, 17, 31} in Z60 .
7.111 (programming required) Implicitly, the Fermat–Euler theorem gives a different way to compute the multiplicative inverse of a
in Zn : (1) compute φ(n) [say by brute force, though there are somewhat faster ways—see Exercises 9.37–9.39]; and then (2)
compute aφ(n)−1 mod n [perhaps using repeated squaring; see Figure 7.6]. Implement this algorithm to compute a−1 in Zn in a
programming language of your choice.

Recall that a Carmichael number is a composite number that passes the (bogus) primality test suggested by Fermat’s Little Theorem.
In other words, a Carmichael number n is an integer that is composite but such that, for any a ∈ Zn that’s relatively prime to n,
we have an−1 mod n = 1.
7.112 (programming required) Write a program to verify that 561 is (a) not prime, but (b) satisfies a560 mod 561 = 1 for every
a ∈ {1, . . . , 560} that’s relatively prime to 561. (That is, verify that 561 is a Carmichael number.)
7.113 Suppose n is a composite integer. Argue that there exists at least one integer a ∈ {1, 2, . . . , n − 1} such that an−1 ̸≡n 1. (In other
words, there’s always at least one nonzero a ∈ Zn with an−1 ̸≡n 1 when n is composite. Thus, although the probability of error in
fermat-isPrime? from p. 368 may be very high for particular composite integers n, the probability of success is nonzero, at least!)

The following theorem is due to Alwin Korselt, from 1899: an integer n is a Carmichael number if and only if n is composite,
squarefree, and for all prime numbers p that divide n, we have that p − 1 | n − 1. (An integer n is squarefree if there is no integer
d ≥ 2 such that d2 | n.)
7.114 (programming required) Use Korselt’s theorem (and a program) to find all Carmichael numbers less than 10,000.
7.115 Use Korselt’s theorem to prove that all Carmichael numbers are odd.

7.116 (programming required) Implement the Miller–Rabin primality test (see p. 368) in a language of your choice.

ISTUDY
7.5 Cryptography 371

7.5 Cryptography
Three may keep a Secret, if two of them are dead.
Benjamin Franklin (1706–1790)
Poor Richard’s Almanack (1735)

In the rest of this chapter, we will make use of the number-theoretic machinery that we’ve now devel-
oped to explore cryptography. Imagine that a sender, named Alice, is trying to send a secret message to a
receiver, named Bob. (In a tradition that began with the paper introducing the RSA cryptosystem [107],
cryptographic systems are usually described using an imagined crew of people whose names start with
consecutive letters of the alphabet. We’ll stick with these traditional names: Alice, Bob, Carol, etc.) The
goal of cryptography is to ensure that the message itself is kept secret even if an eavesdropper—named
Eve—overhears the transmission to Bob.
To achieve this goal, Alice does not directly transmit the message m that she wishes to send to Bob;
instead, she encrypts m in some way. The resulting encrypted message c is what’s transmitted to Bob. (The
original message m is called plaintext; the encrypted message c that’s sent to Bob is called the ciphertext.)
Bob then decrypts c to recover the original message m. A diagram of the basic structure of a cryptographic
system is shown in Figure 7.24.
The two obvious crucial properties of a cryptographic system are that (i) Bob can compute m from c,
and (ii) Eve cannot compute m from c. (Of course, to make (i) and (ii) true simultaneously, it will have to be
the case that Bob has some information that Eve doesn’t have—otherwise the task would be impossible!)

One-time pads
The simplest idea for a cryptographic system is for Alice and Bob to agree on a shared secret key that they
will use as the basis for their communication. The easiest implementation of this idea is what’s called a
one-time pad. (The pad in the name comes from spycraft—spies might carry physical pads of paper, where
each sheet has a fresh secret key written on it. The one-time in the name comes from the fact that this
system is secure only if the same key is never reused, as we’ll discuss.)
Here’s how a one-time pad works. Alice and Bob agree in advance on an integer n, denoting the length
of the message that they would like to communicate. They also agree in advance on a secret bitstring
n
k ∈ {0, 1} , where each bit ki ∈ {0, 1} is chosen independently and uniformly—so that every one of the
n
2n different n-bit strings has a 21n chance of being chosen as k. To encrypt a plaintext message m ∈ {0, 1} ,

Eve (eavesdropper)
trying to decrypt without Bob’s private information

Alice (sender) Bob (recipient)

(reconstructed)
plaintext ciphertext
plaintext
encrypt decrypt
m c m
using using Bob’s
information private
about Bob information

Figure 7.24 The outline of a cryptographic system.

ISTUDY
372 Number Theory

Alice computes the bitwise exclusive or of m and k—in other words, the ith bit of the ciphertext is mi ⊕ ki .
n
To decrypt the ciphertext c ∈ {0, 1} , Bob computes the bitwise XOR of c and k. Here’s a small example:

Example 7.24: A one-time pad.


Alice and Bob agree (in advance) on the secret key k = 10111000.
Suppose that Alice decides that she wishes to transmit the message m = 01101110. To do so, Alice
finds the bitwise XOR of m and k, which is c = 11010110. (See Figure 7.25a.)
To decrypt the ciphertext c = 11010110, Bob finds the bitwise XOR of c and k. (Again see Fig-
ure 7.25a.) Observe that c ⊕ k = 01101110 is indeed precisely m = 01101110, as desired.

The reason that Bob can decrypt the ciphertext to recover the original message m is that, for any bits a
and b, it’s the case that (a ⊕ b) ⊕ b = a. (See Figure 7.25b.) The fact that Eve cannot recover m from c
relies on the fact that, for any message m and every ciphertext c, there is precisely one secret key k such
that m ⊕ k = c. (So Eve is just as likely to see a particular ciphertext regardless of what the message is,
and therefore she gains no information about m by seeing c. See Exercise 7.118.) Thus the one-time pad
is perfectly secure as a cryptographic system—if Alice and Bob only use it once! If Alice and Bob reuse
the same key to exchange many different messages, then Eve can use frequency analysis to get a handle
on the key, and therefore can begin to decode the allegedly secret messages. (See Exercises 10.73–10.77
or Exercise 7.119.)

Taking it further: One of the earliest encryption schemes is now known as a Caesar cipher, after Julius Caesar, who used it in
his correspondence. It can be understood as a cryptographic system that uses a one-time pad more than once. The Caesar cipher
works as follows. The sender and receiver agree on a shift x, an integer, as their secret key. The ith letter in the alphabet (from
A = 0 through Z = 25) will be shifted forward by x positions in the alphabet. The shift “wraps around,” so that we encode letter i
as letter (i + x) mod 26. For example, if x = 3 then A→D, L→O, Y→B, etc. To send a text message m consisting of multiple letters
from the alphabet, the same shift is applied to each letter. (For convenience, we’ll leave nonalphabetic characters unchanged.)
For example, the ciphertext XF BSF EJTDPWFSFE; GMFF BU PODF! was generated with the shift x = 1 from the message WE
ARE DISCOVERED; FLEE AT ONCE!. Because we’ve reused the same shift x for each letter of the message, the Caesar cipher
is susceptible to being broken based on frequency analysis. (In the XF BSF EJTDPWFSFE; GMFF BU PODF! example, F is by
far the most common letter in the ciphertext—and E is by far the most common letter in English text. From these two facts, you
might infer that x = 1 is the most probable secret key. See Exercise 7.119.)
Millennia later, the Enigma machine, the encryption system used by the Germans during World War II, was—as with Caesar—a
substitution cipher, but one where the shift changed with each letter. (But not in completely unpredictable ways, as in a one-time
pad!) See p. 497 for more.

m 0 1 1 0 1 1 1 0
k 1 0 1 1 1 0 0 0
a b a⊕b (a ⊕ b) ⊕ b
c=m⊕k 1 1 0 1 0 1 1 0
0 0 0 0
0 1 1 0
c 1 1 0 1 0 1 1 0 1 0 1 1
k 1 0 1 1 1 0 0 0 1 1 0 1
c⊕k 0 1 1 0 1 1 1 0
(b) The truth table for (a ⊕ b) ⊕ b = a. For any
(a) An example of encryption and decryption with a message bit a and any key bit b, the decryption of the
one-time pad. (See Example 7.24.) encryption is equal to the original plaintext.

Figure 7.25 The one-time pad.

ISTUDY
7.5 Cryptography 373

Public-key cryptography
In addition to being single-use-only, there’s another strange thing about the one-time pad: if Alice and
Bob are somehow able to communicate an n-bit string securely—as they must to share the secret key k—it
doesn’t seem particularly impressive that they can then communicate the n-bit string m securely.
Public-key cryptography is an idea to get around this oddity. Here is the idea, in a nutshell. Every par-
ticipant will have a public key and a private (or secret) key, which will somehow be related to the public
key. A user’s public key is completely public—for example, posted on the web. If Alice wishes to send a
message m to Bob, then Alice will (somehow!) encrypt her message to Bob using Bob’s public key, pro-
ducing ciphertext c. Bob, who of course knows Bob’s secret key, can decrypt c to reconstruct m; Eve, not
knowing Bob’s secret key, cannot decrypt c.
This idea sounds a little absurd, but we will be able to make it work. Or, at least, we will make it work
on the assumption that Eve has only limited computational power—and on the assumption that certain
computational problems, like factoring large numbers, require a lot of computational power to solve. (For
example, Bob’s secret key cannot be easily computable from Bob’s public key—otherwise Eve could easily
figure out Bob’s secret key and then run whatever decryption algorithm Bob uses!)

7.5.1 The RSA Cryptosystem


The basic idea of public-key cryptography was discussed in abstract terms in the 1970s—especially by
Whitfield Diffie, Martin Hellman, and Ralph Merkle—and, after some significant contributions by a num-
ber of researchers, a cryptosystem successfully implementing public-key cryptography was discovered by
Ron Rivest, Adi Shamir, and Leonard Adleman [107]. The RSA cryptosystem, named after the first initials
of their three last names, is one of the most famous, and widely used, cryptographic protocols today. The
previous sections of this chapter will serve as the building blocks for the RSA system, which we’ll explore
in the rest of this section.

Taking it further: The RSA cryptosystem is named after its three 1978 discoverers, and the Turing Award—the highest honor in
computer science, roughly equivalent to the Nobel Prize of computer science—was conferred on Rivest, Shamir, and Adleman
in 2002 for this discovery. But there is also a “shadow history” of the advances in cryptography made in the second half of
the twentieth century. The British government’s signal intelligence agency, called Government Communications Headquarters
(GCHQ), had been working to solve precisely the same set of research questions about cryptography as academic researchers
like R., S., and A. (GCHQ was perhaps best known for its success in World War II, in breaking the Enigma code of the German
military; see p. 497 for more discussion.) And it turned out that several British cryptographers at GCHQ—Clifford Cocks, James
Ellis, and Malcolm Williamson—had discovered the RSA protocol several years before 1978. But their discovery was classified
by the British government, and thus we call this protocol “RSA” instead of “CEW.” See the excellent book by Simon Singh for
more on the history of cryptography, including both the published and classified advances in cryptographic systems [119].
Also see the discussion on p. 379 of the Diffie–Hellman key exchange protocol, one of the first (published) modern breakthroughs
in cryptography, which allows Alice and Bob to solve another apparently impossible problem: exchanging secret information
while communicating only over an insecure channel.

In RSA, as for any public-key cryptosystem, we must define three algorithmic components. (These
three algorithms for the RSA cryptosystem are shown in Figure 7.26; an overview of the system is shown
in Figure 7.27.) They are:

• key generation: how do Alice and Bob construct their public/private keypairs?
• encryption: when Alice wishes to send a message to Bob, how does she encode it?
• decryption: when Bob receives ciphertext from Alice, how does he decode it?

ISTUDY
374 Number Theory

Key Generation:
1 Bob chooses two large primes, p and q, and defines n := pq.
̸ 1 such that e and (p − 1)(q − 1) are relatively prime.
2 Bob chooses e =
3 Bob computes d := e−1 modulo (p − 1)(q − 1).
4 Bob publishes ⟨e, n⟩ as his public key; Bob’s secret key is ⟨d, n⟩.

Encryption: If Alice wants to send message m to Bob,


1 Alice finds Bob’s public key, say ⟨eBob , nBob ⟩, as he published it.
2 To send message m ∈ {0, . . . , nBob − 1}, Alice computes c := meBob mod nBob .
3 Alice transmits c to Bob.

Decryption: When Bob receives ciphertext c,


1 Bob computes m := cdBob mod nBob , where ⟨dBob , nBob ⟩ is Bob’s secret key.

Figure 7.26 The RSA cryptosystem: key generation, encryption, and decryption.

Eve (eavesdropper)
trying to decrypt without Bob’s secret key

Alice (sender) Bob (recipient)

(reconstructed)
plaintext ciphertext
plaintext
encrypt decrypt
m c= me mod n cd mod n
using Bob’s using Bob’s
public key secret key
⟨e, n⟩ ⟨d, n⟩

Figure 7.27 A schematic of the RSA cryptosystem, where n = pq and de ≡(p−1)(q−1) 1, for prime numbers p and q.

The core idea of RSA is the following. (The details of the protocols are in Figure 7.26.) To encrypt a
numerical message m for Bob, Alice will compute c := me mod n, where Bob’s public key is ⟨e, n⟩. To
decrypt the ciphertext c that he receives, Bob will compute cd mod n, where Bob’s private key is ⟨d, n⟩.
(Of course, there’s an important relationship among the quantities e, d, and n!)

Taking it further: It may seem strange that n is part of both Bob’s secret key and Bob’s public key—it’s usually done this way
for symmetry, but also to support digital signatures. When Alice sends Bob a message, she can encrypt it using her own secret
key; Bob can then decrypt the message using Alice’s public key to verify that Alice was indeed the person who sent the message.

An example of RSA key generation, encryption, and decryption


Later we will prove that the message that Bob decrypts is always the same as the message that Alice origi-
nally sent. But we’ll start with an example. First, Bob generates a public and private key, using the protocol
in Figure 7.26. (All three phases can be implemented efficiently, using techniques from this chapter; see
Exercises 7.131–7.134.)

Example 7.25: Generating an RSA keypair for Bob.


For good security properties, we’d want to pick seriously large prime numbers p and q, but to make the
computation easier to see we’ll choose very small primes.

ISTUDY
7.5 Cryptography 375

(1) Suppose we choose the “large” primes p = 13 and q = 17. Then n := 13 · 17 = 221.
(2) We now must choose a value of e ̸= 1 that is relatively prime to (p − 1)(q − 1) = 12 · 16 = 192.
Note that GCD(2, 192) = 2 ̸= 1, so e = 2 fails. Similarly GCD(3, 192) = 3 and GCD(4, 192) = 4. But
GCD(5, 192) = 1. We pick e := 5.
(3) We now compute d := inverse(e, (p − 1)(q − 1))—that is, d := e−1 in Z(p−1)(q−1) . Tracing
the execution of inverse(5, 192) just as in Example 7.21 reveals that extended-Euclid(5, 192) returns
⟨77, −2, 1⟩, and thus that inverse(5, 192) returns 77 mod 192 = 77. (Indeed, 5 · 77 = 385 = 192 · 2 + 1,
so 5 · 77 ≡192 1.) So we set d := 77.
(4) Thus Bob’s public key is ⟨e, n⟩ = ⟨5, 221⟩, and Bob’s secret key is ⟨d, n⟩ = ⟨77, 221⟩.

Bob now publishes his public key somewhere, keeping his secret key to himself. If Alice now wishes to
send a message to Bob, she uses his public key, as follows:

Example 7.26: Encrypting a message with RSA.


To send message m = 202 to Bob, whose public key is ⟨e, n⟩ = ⟨5, 221⟩, Alice computes
me mod n = 2025 mod 221 = 336,323,216,032 mod 221 = 206.
Thus she sends Bob the ciphertext c := 206.

When Bob receives an encrypted message, he uses his secret key to decrypt it:

Example 7.27: Decrypting a message with RSA.


When Bob, whose secret key is ⟨d, n⟩ = ⟨77, 221⟩, receives the ciphertext c = 206 from Alice, he
decrypts it as
cd mod n = 20677 mod 221.

Computing 20677 mod 221 by hand is a bit tedious, but we can calculate it with “repeated squaring”
(using the fact that b2k mod n = (b2 mod n)k mod n and b2k+1 mod n = b · (b2k mod n) mod n; see
Exercises 7.23–7.25):

20677 mod 221 = 206 · (2062 mod 221)38 mod 221 2062 mod 221 = 4
= 206 · (42 mod 221)19 mod 221 42 mod 221 = 16
= 206 · 16 · (162 mod 221)9 mod 221 162 mod 221 = 35
= 206 · 16 · 35 · (352 mod 221)4 mod 221 352 mod 221 = 120
= 206 · 16 · 35 · (1202 mod 221)2 mod 221 1202 mod 221 = 35
= 206 · 16 · 35 · (352 mod 221) mod 221 352 mod 221 = 120
= 206 · 16 · 35 · 120 mod 221 206 · 16 · 35 · 120 = 13,843,200
= 202.
Thus Bob decrypts the ciphertext 206 as 202 = 20677 mod 221. Indeed, then, the message that Bob
receives is precisely 202—the same message that Alice sent!

ISTUDY
376 Number Theory

We’ve now illustrated the full RSA protocol: generating a key, and encrypting and decrypting a message.
Here’s one more chance to work through the full pipeline:

Example 7.28: RSA, again, from end to end.


Bob generates a public/private keypair using the primes p = 11 and q = 13, choosing the smallest valid
value of e. You encrypt the message 95 to send to Bob (using his generated public key). What ciphertext
do you send to Bob?
Solution. For ⟨p, q⟩ = ⟨11, 13⟩, we have pq = 143 and (p − 1)(q − 1) = 120. Because 120 is divisible
by 2, 3, 4, 5, and 6 but GCD(120, 7) = 1, we choose e := 7. We find d := inverse(7, 120) = 103. Then
Bob’s public key is ⟨e, n⟩ = ⟨7, 143⟩ and Bob’s private key is ⟨d, n⟩ = ⟨103, 143⟩.
To send Bob the message m = 95, we compute me mod n = 957 mod 143, which is 17. Thus the
ciphertext is c := 17. (Bob would decrypt this ciphertext as cd mod n = 17103 mod 143—which indeed
is 95.)

7.5.2 The Correctness of RSA


Examples 7.25–7.27 gave one instance of the RSA cryptosystem working properly, in the sense that
decrypt(encrypt(m)) turned out to be the original message m itself—but, of course, we want this property
to be true in general. Let’s prove that it is. Before we give the full statement of correctness, we’ll prove an
intermediate lemma:

Lemma 7.29: Correctness of RSA: decrypting the ciphertext, modulo p or q.


Suppose e, d, p, q, n are all as specified in the RSA key generation protocol—that is, n = pq for primes
p and q, and ed ≡(p−1)(q−1) 1. Let m ∈ Zn be any message. Then

m′ := [(me mod n) mod n]


d
(the decryption of the encryption of m)
satisfies both m′ ≡p m and m′ ≡q m.

Proof. We’ll prove m′ ≡p m; because p and q are symmetric in the definition, the fact that m′ ≡q m
follows by exactly the same logic. Recall that we chose d so that ed ≡(p−1)(q−1) 1, and thus we have
ed = k(p − 1)(q − 1) + 1 for some integer k. Hence
[(me mod n)d mod n] mod p = (med mod n) mod p by Theorem 7.4.4
k(p−1)(q−1)+1
= (m mod pq) mod p by definition of e, d, n, and k
k(p−1)(q−1)+1
=m mod p by Exercise 7.18
= [m · m k(p−1)(q−1)
] mod p ak+1 = a · ak
= [(m mod p) · (mk(p−1)(q−1) mod p)] mod p by Theorem 7.4.3
= [(m mod p) · ((m k(q−1)
mod p) p−1
mod p)] mod p. by Theorem 7.4.4

Although it’s not completely obvious, we’re actually almost done: we’ve now shown
   
(me mod n)d mod n mod p = (m mod p) · ((mk(q−1) mod p)p−1 mod p) mod p. (∗)

ISTUDY
7.5 Cryptography 377

If only the highlighted portion of the right-hand side of (∗) were equal to 1, we’d have shown exactly the
desired result, because the right-hand side would then equal [(m mod p) · 1] mod p = m mod p—exactly
what we had to prove! And the good news is that the highlighted portion of (∗) matches the form of Fermat’s
Little Theorem: the highlighted expression is ap−1 mod p, where a = mk(q−1) mod p, and Fermat’s Little
Theorem tells us ap−1 mod p = 1 as long as a ̸≡p 0—that is, as long as p ̸ | a. (We’ll also have to handle
the case when a is divisible by p, but we’ll be able to do that separately.)

Problem-solving tip: If there’s a proof outline that will establish a desired claim except in one or two special cases, then try to
“break off” those special cases and handle them separately. Here we handled the “normal” case a ̸≡p 0 using Fermat’s Little
Theorem, and broke off the special a ≡p 0 case and handled it separately.

Here are the two cases:

Case I: a ≡p 0. Notice that mk(q−1) mod p = 0 and thus that p | mk(q−1) . Therefore:

[(me mod n)d mod n] mod p = [(m mod p) · ap−1 mod p] mod p by (∗)
= [(m mod p) · 0] mod p by the assumption that a ≡p 0
=0
= m mod p,

where the last equality follows because p is prime and p | mk(q−1) ; thus Exercise 7.48 tells us that p | m
as well.
Case II: a ̸≡p 0. Then we can use Fermat’s Little Theorem:

[(me mod n)d mod n] mod p = [(m mod p) · ap−1 mod p] mod p by (∗)
= [(m mod p) · 1] mod p by Fermat’s Little Theorem
= m mod p.

We’ve established that [(me mod n)d mod n] mod p = m mod p in both cases, and the lemma follows.

Using Lemma 7.29 to do most of the work, we can now prove the main theorem:

Theorem 7.30: Correctness of RSA.


Suppose that Bob’s RSA public key is ⟨e, n⟩ and his corresponding private key is ⟨d, n⟩. Let m ∈ Zn be
any message. Then
decryptBob (encryptBob (m)) = m.

d
Proof. Note that decryptBob (encryptBob (m)) = (me mod n) mod n. By Lemma 7.29,

(me mod n)d mod n ≡p m and (me mod n)d mod n ≡q m.

By Exercise 7.50, together these facts imply that (me mod n)d mod n ≡pq m as well. Because n = pq and
m < n, therefore (me mod n)d mod n = m mod n = m.

ISTUDY
378 Number Theory

What about Eve?


When Alice encrypts a message m for Bob and transmits the corresponding RSA-encrypted ciphertext,
Theorem 7.30 says that Bob is able to decrypt to recover the original message m. What’s left to establish
is that Eve cannot recover m from the information that she knows—namely, from the ciphertext me mod n
and from Bob’s public key ⟨e, n⟩. (That’s the desired security property of the system!)
Unfortunately, not only are we unable to prove this property, it’s simply not true! Eve is able to recover
m from the me mod n and e and n, as follows: she factors n—that is, finds the primes p and q such that
pq = n—and then computes d precisely as Bob did when he generated his RSA keys. (And Eve then
computes the “secret” message (me mod n)d mod n precisely as Bob did when he decrypted.)
But the fact that Eve has the information necessary to recover m doesn’t mean that RSA is doomed:
factoring large numbers (particularly those that are the product of two large primes, perhaps) seems to be
a computationally difficult problem. Even if you know that n = 121,932,625,927,450,033 it will take you
quite a while to find p and q—and the best known algorithms for factoring are not fast enough for Eve.

Taking it further: The crucial property that we’re using in RSA is an asymmetry in two “directions” of a problem. Taking two
large prime numbers p and q and computing their product n = pq is easy (that is, it can be done quickly, in polylogarithmic
time). Taking a number n that happens to be the product of two primes and factoring it into p · q appears to be hard (that is,
nobody knows how to do it quickly). Cryptographic systems have been built on a number of different problems with this kind of
asymmetry; see a good textbook on cryptography for much, much more—for example, [69, 109]. See these books for a discussion
of some of the ways in which “textbook RSA”—what we’ve described here!—is susceptible to all sorts of attacks. Industrial-
level RSA implementations take all sorts of precautions that we haven’t even begun to discuss. (It’s also worth noting that Eve is
aware that Alice and Bob are communicating—just not, under the hardness assumptions we’re discussing, what it is that they’re
saying. Steganography [from the Greek stego hidden/covered; compare that to kryptos concealed/secret] seeks to keep secret the
existence of communication rather than merely the contents of the communication. See p. 232.)

Notice, though, that Eve could break RSA another way, too: she only needs to find m, and she knows
both the ciphertext c = me mod n and Bob’s public key ⟨e, n⟩. So Eve could discover m by computing
the “eth root of c”—that is, the number x such that xe = c. Unfortunately for Eve, the fact that she has to
compute the eth root of me mod n, and not just the eth root of me , is crucial; this problem also seems to be
computationally difficult. (See Exercise 7.141—though there’s some evidence that choosing a small value
of e, like e = 3, might weaken the security of the system.)
Note, though, that we have not proven that Eve is unable to efficiently break RSA encryption—for all
we know, a clever student of computational number theory (you!?) will discover an efficient algorithm for
factoring large numbers or computing eth roots in Zn . (Or, for all we know, perhaps someone already has!)

ISTUDY
7.5 Cryptography 379

Computer Science Connections 2 ≡11 2 16 ≡11 5 128 ≡11 7 1024 ≡11 1


4 ≡11 4 32 ≡11 10 256 ≡11 3
Diffie–Hellman Key Exchange 8 ≡11 8 64 ≡11 9 512 ≡11 6

Suppose that Alice and Bob wish to establish 21 22 23 24 25 26 27 28 29 210


some shared piece of secret information— ≡11 2 4 8 5 10 9 7 3 6 1
perhaps to share a key to use in a one-time
pad, or to use for some other cryptographic
51 52 53 54 55 56 57 58 59 510
protocol. But the only communication chan-
≡11 5 3 4 9 1 5 3 4 9 1
nel available to Alice and Bob is insecure;
Eve can listen to all of their communication.
This problem is called the key exchange prob- 71 72 73 74 75 76 77 78 79 710
≡11 7 5 2 3 10 4 6 9 8 1
lem: two parties seek to establish a shared
secret while they communicate only over an
Figure 7.28 Primitive roots of Z11 . Z11 has four different primitive
insecure channel. Like public-key cryptogra-
roots {2, 6, 7, 8}, two of which are shown here: the first 10 powers of
phy (as in RSA), this task seems completely
impossible—and, also like public-key cryp- 2 and 7 (but not 5) in Z11 produce all 10 nonzero elements of Z11 .
tography, despite its apparent impossibility, this problem was solved in the 1970s. The solution that we’ll describe
here is called the Diffie–Hellman key exchange protocol [38], named after Whitfield Diffie (b. 1944) and Martin
Hellman (b. 1945)—winners of the 2015 Turing Award, largely for this contribution.
Let p be prime. The key number-theoretic definition for the Diffie–Hellman protocol is what’s called a primitive
root mod p, which is an element g ∈ Zp such that every nonzero element of Zp is equivalent to a power of g. (In other
words, g is a primitive root mod p if {g1 , g2 , . . . , gp−1 } ≡p {1, 2, . . . , p − 1}.) See Figure 7.28 for some examples.
It’s a theorem of number theory that every Zp for prime p has at least one primitive root.
The Diffie–Hellman protocol is shown in
Alice Bob
Figure 7.29. Alice and Bob agree on a prime
p and a number g that’s a primitive root mod agree on prime p agree on prime p
p. Alice chooses a secret value a ∈ Zp ran- and primitive root g p and g and primitive root g
domly, computes A := ga mod p, and sends (need not be secret) (need not be secret)
A to Bob. Bob chooses a secret value
b ∈ Zp randomly, computes B := gb mod p, choose a ∈ Zp choose b ∈ Zp
and sends B to Alice. (Note that A and B (and keep it secret) (and keep it secret)
are sent over the channel, but the values of a
compute A = g mod p compute
a and b themselves are never transmitted.)
A = g a mod p and B = gb mod p and
Alice now knows a (she picked it) and B =
gb mod p (Bob sent it to her), so she can com- send A to Bob B = gb mod p send B to Alice
pute Ba mod p. Bob knows b (he picked it)
and A = ga mod p (Alice sent it to him), so compute Eve (the eavesdropper) compute
he can compute Ab mod p. s = Ba mod p knows p, g, A, and B s = Ab mod p
Alice and Bob now have a shared piece
of information, namely gab mod p, because Figure 7.29 The Diffie–Hellman key exchange protocol.
Ab ≡p (ga )b = gab = (gb )a ) ≡p Ba . But
why is this shared piece of information a secret? Let’s look at the protocol from Eve’s perspective: she observes the
values of p, g, ga mod p, and gb mod p. But it is generally believed that the problem of computing a from the values
of p, g, and ga mod p cannot be solved efficiently. (This problem is called the discrete logarithm problem: it’s the
modular analogy of computing y from the values of x and xy —that is, computing logx (xy ).) The discrete log problem
is believed to be difficult (as long as the prime p is of appreciable size), and thus it’s believed that Eve cannot feasibly
figure out the value gab mod p, shared by Alice and Bob. (For much more, see any good book on cryptography, such
as [69, 109]. It’s worth pointing out that, as we’ve stated the protocol, Diffie–Hellman is susceptible to a so-called
man-in-the-middle attack: a malicious party (“Mallory”) who has control over the channel can impersonate Bob to
Alice, and impersonate Alice to Bob. (There are improvements to the protocol that address this issue.) Doing so
allows Mallory to intercept, decrypt, and then reencrypt subsequent communications that Alice and Bob thought
were secure—and they’d never know that Mallory was involved.)

ISTUDY
380 Number Theory

EXERCISES
7.117 In our encryption/decryption scheme for one-time pads, we used exclusive or: plaintext m was encrypted as m ⊕ k, and ciphertext
c was decrypted as c ⊕ k. Because (m ⊕ k) ⊕ k is logically equivalent to m, Bob always recovers the original plaintext. But there
are actually three other Boolean operators that we could have used instead of ⊕—that is, there are three other connectives ◦ such
that (m ◦ k) ◦ k ≡ m. (See Figure 4.34.) Identify those three other connectives. Why are these three connectives either uninteresting
or actively bad choices as alternatives to ⊕?

7.118 (Requires knowledge of probability; see Chapter 10.) For a one-time pad with an n-bit key, we have
X 
Pr [ciphertext = c] = Pr [ciphertext = c|plaintext = m] · Pr [plaintext = m] .
m

Prove that the probability that the ciphertext is a particular c ∈ {0, 1}n is precisely 1/2n for any distribution over plaintext messages.

7.119 (programming required) As we suggested, one-time pads are secure if they’re used only once, but using the same key more than
once compromises security. I took a (famous) document written in English, and encoded it as follows: I converted each character
to an ASCII value (in binary, 00000000 to 11111111), and separated this bitstring into 40-bit chunks. (Each chunk contains 5
characters, with 8 bits each.) I generated a 40-bit key, and encoded each chunk using that key. (That is: I used a one-time pad more
than one time!) The encoded document starts like this:

1110111111011100100010011010000110111101
1110100011010101111110111010011110100001
1111010110111110100010011010100010000111

You can find the full encoding at http://www.cambridge.org/highereducation/isbn/9781009150491. Figure out (a)
what 40-bit key I used, and (b) what the encoded document is.

7.120 (programming required) Implement one-time pads in a programming language of your choice.

7.121 Using the “large” primes p = 19 and q = 23, compute the RSA public and private keys. You may have multiple valid choices for
e—if so, choose the smallest e that you can.
7.122 Repeat for p = 31 and q = 37.
7.123 Repeat for p = 41 and q = 43.

Suppose that Bob’s public key is n = 221 and e = 5. (And so Bob’s private key is n = 221 and d = 77.)
7.124 Compute the RSA encryption to send Bob the message m = 42.
7.125 Repeat for the message m = 99.
7.126 If Bob receives the ciphertext c = 99, what message was sent to Bob?
7.127 Repeat for the ciphertext c = 17.

7.128 (programming required) Suppose that Carol’s public key is ⟨e = 3, n = 1,331,191⟩, and the ciphertext c = 441,626. Figure out
the message that was sent to Carol by factoring n.
7.129 Repeat for the public key ⟨e = 11, n = 12,187,823⟩, and the ciphertext c = 7,303,892.
7.130 Repeat for the public key ⟨e = 5, n = 662,983,829⟩, and the ciphertext c = 43,574,279.

In both key generation and encryption/decryption, the RSA cryptosystem completes a number of steps that require some nonobvious
ideas to make them efficient. Luckily, we’ve covered those ideas at various times in previous parts of the chapter. For each of the
following, explain how to compute the desired quantity efficiently (that is, with a number of primitive arithmetic operations that’s
O(logk n) for the value of n in the RSA protocol, for some constant k). For some of these exercises, you’ll simply be able to cite a
previously developed algorithm in a few words; in others, you’ll need to combine more than one algorithm or use an algorithm in
a nontrivial way.
7.131 Find a large prime number: say, find the smallest prime number greater than a given number x.
7.132 Given primes p and q, find a number e = ̸ 1 such that e and (p − 1)(q − 1) are relatively prime.
7.133 Given primes p, q and e relatively prime to (p − 1)(q − 1), compute e−1 modulo (p − 1)(q − 1).
7.134 Given n, e, and m ∈ Zn , compute me mod n. (Similarly, given n, d, and c, compute cd mod n.)

7.135 Prove that, in the RSA key-generation protocol, the number e that we choose is always odd.
7.136 Prove that, in the RSA key-generation protocol, the number d that we choose is also always odd.

ISTUDY
Exercises 381

Imagine the following modifications to the RSA key generation protocol. What goes wrong if we change the algorithm as described?
Be precise. Is there a step of the protocol that can no longer be executed? Does Bob no longer have the information necessary to
decrypt the ciphertext? Does Eve now have the power to decrypt the ciphertext?
7.137 The protocol tells us to choose two large primes p, q. But, instead, we choose one prime p, and set q := p.
7.138 The protocol tells us to choose two large primes p and q. But, instead, we choose two large numbers p and q that aren’t actually
prime.
7.139 The protocol tells us to choose e ̸= 1 that’s relatively prime to (p − 1)(q − 1). But, instead, we choose e = 1.
7.140 The protocol tells us to choose e ̸= 1 that’s relatively prime to (p − 1)(q − 1). But, instead, we choose an e that is not relatively
prime to (p − 1)(q − 1).

7.141 Explain precisely how to use binary search to find the eth root of me efficiently. Then explain precisely why this binary-search
approach doesn’t work to find the eth root of me mod n in general.

7.142 (programming required) Implement RSA key generation. Given two prime numbers p and q as input, produce a public and private
RSA keypair ⟨e, n⟩ and ⟨d, n⟩. (Hint: Exercises 7.31 and 7.104 will be helpful. To pick e, you may wish to simply try all odd
numbers and use Exercise 7.31—you could make this step faster, but generally speaking this slightly slower approach will still be
fast enough.) Use the results from Exercises 7.131–7.134 to make your solutions efficient.
7.143 (programming required) Implement RSA encryption and decryption. For encryption, given a public key ⟨e, n⟩ and a message
m ∈ Zn , compute the corresponding ciphertext c := me mod n. Similarly, for decryption: given a private key ⟨d, n⟩ and a ciphertext
c ∈ Zn , compute m := cd mod n. (Hint: Exercise 7.25 will be helpful.) Use the results from Exercises 7.131–7.134 to make your
solutions efficient.
7.144 (programming required) Generally, a user of a cryptographic system wants to send text rather than a number, so you’ll need to add a
capacity for converting text into an integer. And RSA will only support encrypting elements of Zn , not Z, so you’ll actually need to
convert the text into a sequence of elements of Zn . Write a pair of functions string->intlist(s, n) and intlist->string(L, n)
that convert between strings of characters and a list of elements from Zn . You may do this conversion in many ways, but you must
ensure that these operations are inverses of each other: if string->intlist(s∗ , n) = L∗ , then intlist->string(L∗ , n) = s∗ .
(Hint: the easiest approach is to view text as a sequence of ASCII symbols, each of which is an element of {0, 1, . . . , 255}. Thus
you can view your input text as a number written in base 256. Your output is a number written in base n. Use baseConvert from
p. 340.)
7.145 (programming required) Test your implementations from Exercises 7.142–7.144 by generating keys, encrypting, and decrypt-
ing using the primes p = 5,277,019,477,592,911 and q = 7,502,904,222,052,693, and the message "THE SECRET OF BEING
BORING IS TO SAY EVERYTHING." (Voltaire (1694–1778)).
7.146 (programming required) Complete the last missing piece of your RSA implementation: prime generation. The key generation
implementation from Exercise 7.142 relies on being given two prime numbers. Write a function that, given a (sufficiently large)
range of possible numbers between nmin and nmax , repeatedly does the following: choose a random integer between nmin and nmax ,
and test whether it’s prime using the Miller–Rabin test (see Exercise 7.116).

7.147 (programming required) The Chinese Remainder Theorem tells us that m ∈ Zpq is uniquely described by its value modulo p and
q—that is, m mod p and m mod q fully describe m. Here’s one way to improve the efficiency of RSA using this observation: instead
of computing m := cd mod pq directly, instead compute a := cd mod p and b := cd mod q. Then use the algorithm implicit in
Theorem 7.21 to compute the value m with m mod p = a and m mod q = b. Modify your implementation of RSA to use this idea.
7.148 Actually, instead of computing a := cd mod p and b := cd mod q in Exercise 7.147, we could have instead computed
a := cd mod p−1 mod p and b := cd mod q−1 mod q. Explain why this modification is valid. (This change can improve the effi-
ciency of RSA, because now both the base and the exponent may be substantially smaller than they were in the regular RSA
implementation.)

ISTUDY
382 Number Theory

7.6 Chapter at a Glance

Modular Arithmetic
Given integers k ≥ 1 and n, there exist unique integers d and r such that 0 ≤ r < k and kd + r = n. The
 
value of d is nk , the (whole) number of times k goes into n; the value of r is n mod k, the remainder when
we divide n by k.
Two integers a and b are equivalent or congruent mod n, written a ≡n b, if a and b have the same
remainder when divided by n—that is, when a mod n = b mod n. For expressions taken mod n, we can
always freely “reduce” mod n (subtracting multiples of n) before performing addition or multiplication.
(See Theorem 7.4.)
We write k | n to denote the proposition that n mod k = 0. If k | n, we say that k (evenly) divides n, that k
is a factor of n, and that n is a multiple of k. See Theorem 7.7 for some useful properties of divisibility: for
example, if a | b then, for any integer c, it’s also the case that a divides bc as well. The greatest common
divisor GCD(n, m) of two positive integers n and m is the largest d that evenly divides both n and m; the
least common multiple is the smallest d ∈ Z≥1 that n and m both evenly divide. GCDs can be computed
efficiently using the Euclidean algorithm. (See Figure 7.30.)

Primality and Relative Primality


An integer p ≥ 2 is prime if the only positive integers that evenly divide it are 1 and p itself; an integer
n ≥ 2 that is not prime is called composite. (Note that 1 is neither prime nor composite.) Let primes(n)
denote the number of prime numbers less or equal than n. The Prime Number Theorem states that, as n gets
large, the ratio between primes(n) and logn n converges (slowly!) to 1. Every positive integer can be factored
into a product of zero or more prime numbers, and that factorization is unique up to the ordering of the
factors.
Two positive integers n and m are called relatively prime if they have no common factors aside from 1—
that is, if GCD(n, m) = 1. A tweak to the Euclidean algorithm, called the extended Euclidean algorithm,
takes arbitrary positive integers n and m as input, and (efficiently) computes three integers x, y, r such that
r = GCD(n, m) = xn + ym. (See Figure 7.31.) We can determine whether n and m are relatively prime
using the (extended) Euclidean algorithm.
Qk
Let n1 , n2 , . . . , nk be a collection of integers, any pair of which is relatively prime. Let N = i=1 ni .
Writing Zm to denote {0, 1, . . . , m − 1}, the Chinese Remainder Theorem states that, for any sequence of
values ⟨a1 , . . . , ak ⟩ with each ai ∈ Zni , there exists one and only one integer x ∈ ZN such that x mod ni = ai
for all 1 ≤ i ≤ k.

extended-Euclid(n, m):
Euclid(n, m): Input: positive integers n and m ≥ n.
Input: positive integers n and m ≥ n Output: x, y, r ∈ Z where GCD(n, m) = r = xn + ym
Output: GCD(n, m) 1 if m mod n = 0 then
1 if m mod n = 0 then 2 return 1, 0, n // 1 · n + 0 · m = n = GCD(n, m)
2 return n 3 else
3 else 4 x, y, r := extended-Euclid(m mod n, n)
 
4 return Euclid(m mod n, n) 5 return y − mn · x, x, r

Figure 7.30 The Euclidean algorithm for GCDs. Figure 7.31 The extended Euclidean algorithm.

ISTUDY
7.6 Chapter at a Glance 383

Multiplicative Inverses
For any integer n ≥ 2, let Zn denote the set {0, 1, . . . , n − 1}. Let a ∈ Zn be arbitrary. The multiplicative
inverse of a in Zn is the number a−1 ∈ Zn such that a · a−1 ≡n 1 if any such number exists. (If no such
number exists, then a−1 is undefined.) For example, the multiplicative inverse of 2 in Z9 is 2−1 = 5 because
2 · 5 = 10 ≡9 1; the multiplicative inverse of 1 in Z9 is 1−1 = 1 because 1 · 1 ≡9 1; and the multiplicative
inverse of 3 in Z9 is undefined (because 3a ̸≡9 1 for any a ∈ Z9 ).
Let n ≥ 2 and a ∈ Zn . The multiplicative inverse a−1 exists in Zn if and only if n and a are relatively
prime. Furthermore, when a−1 exists, we can find it using the extended Euclidean algorithm. We compute
⟨x, y, r⟩ := extended-Euclid(a, n); when GCD(a, n) = 1 (as it is when a and n are relatively prime), the
returned values satisfy xa + yn = 1, and thus a−1 := x mod n is the multiplicative inverse of a in Zn . For
a prime number p, every nonzero a ∈ Zp has a multiplicative inverse in Zp .
Fermat’s Little Theorem states that, for any prime p and any integer a with p ̸ | a, the (p − 1)st power
of a must equal 1 modulo p. (That is: for prime p and nonzero a ∈ Zp , we have ap−1 ≡p 1. For example,
because 17 is prime, Fermat’s Little Theorem—or arithmetic!—tells us that 516 mod 17 = 1.)

Cryptography
A sender (“Alice”) wants to send a private message to a receiver (“Bob”), but they can only communicate
using a channel that can be overheard by an eavesdropper (“Eve”). In cryptography, Alice encrypts the
message m (the “plaintext”) and transmits the encrypted version c (the “ciphertext”); Bob then decrypts it
to recover the original message m. The simplest way to achieve this goal is with a one-time pad: Alice and
Bob agree on a shared secret bitstring k; the ciphertext is the bitwise XOR of m and k, and Bob decrypts
by computing the bitwise XOR of c and k.
A more useful infrastructure is public-key cryptography, in which Alice and Bob do not have to com-
municate a secret in advance. Every user has a public key and a (mathematically related) private key; to
communicate with Bob, Alice uses Bob’s public key for encryption (and Bob uses his private key for
decryption). The RSA cryptosystem is a widely used protocol for public-key cryptography; it works as
follows:

• Key generation: Bob finds large primes p and q; he chooses an e ̸= 1 that’s relatively prime to
(p − 1)(q − 1); and he computes d := e−1 modulo (p − 1)(q − 1). Bob’s public key is ⟨e, n⟩ and his
private key is ⟨d, n⟩, where n := pq.
• Encryption: When Alice wants to send m to Bob, she encrypts m as c := me mod n.
• Decryption: Bob decrypts c as cd mod n.

By our choices of n, p, q, d, and e, Fermat’s Little Theorem allows us to prove that Bob’s decryption of
the encryption of message m is always the original message m itself. And, under commonly held beliefs
about the difficulty of factoring large numbers (and computing “eth roots mod n”), Eve cannot compute m
without spending an implausibly large amount of computation time.

ISTUDY
384 Number Theory

Key Terms and Results

Key Terms Key Results


Modular Arithmetic Modular Arithmetic
n
• modulus; n mod k and k 1 For any integers k ≥ 1 and n, there exist unique integers
• congruence/equivalence (≡n ) d and r such that 0 ≤ r < k and kd + r = n. (And
 
• (evenly) divides, factor, multiple r = n mod k and d = nk .)
• greatest common divisor 2 For arbitrary positive integers n and m ≥ n, the
• least common multiple Euclidean algorithm efficiently computes GCD(n, m).
• Euclidean algorithm
Primality and Relative Primality
Primality and Relative Primality
1 The Prime Number Theorem: as n gets large, the ratio
• prime vs. composite numbers between logn n and the number of primes less than or
• Prime Number Theorem equal to n approaches 1.
• prime factorization 2 Every positive integer has a prime factorization (which
• relative primality is unique up to reordering).
• extended Euclidean algorithm 3 Given positive integers n and m, the extended Euclidean
• Chinese Remainder Theorem algorithm efficiently computes three integers x, y, r such
that r = GCD(n, m) = xn + ym.
Multiplicative Inverses 4 The Chinese Remainder Theorem: Suppose
n1 , n2 , . . . , nk are all relatively prime, and let
• Zn Qk
N = i=1 ni . Then, for any ⟨a1 , . . . , ak ⟩ with each
• multiplicative inverse (a−1 in Zn )
ai ∈ Zni , there exists a unique x ∈ ZN such that
• Fermat’s Little Theorem
x mod ni = ai for all 1 ≤ i ≤ k.
• Carmichael number
Multiplicative Inverses
Cryptography
1 In Zn , the multiplicative inverse a−1 of a exists if and
• Alice, Bob, Eve
only if n and a are relatively prime. When it does exist,
• plaintext, ciphertext
we can find a−1 using the extended Euclidean algorithm.
• one-time pad
2 Fermat’s Little Theorem: for any prime number p and
• public-key cryptography
any nonzero a ∈ Zp , we have ap−1 ≡p 1.
• public key; private key
• key generation; encryption/decryption
Cryptography
• RSA
1 In the RSA cryptosystem, Alice can use Bob’s public
key to encrypt a message m so that Bob can decrypt it
efficiently. (And, under reasonable assumptions about
certain numerical problems’ hardness, Eve can’t recover
m without an exorbitant amount of computation.)

ISTUDY
8 Relations

In which our heroes navigate a sea of many related perils, some of


which turn out to be precisely equivalent to each other.

ISTUDY
385
386 Relations

8.1 Why You Might Care


People respond in accordance to how you relate to them.
Nelson Mandela (1918–2013), on choosing reconciliation over vengeance
Interview with Reader’s Digest (April 2005)

Imagine writing a program to implement a student registration system at a college or university. When a
student is registering for classes, you’ll need to be able to answer questions of the form “is Alice eligible to
be added to the roster for Price Theory?” to decide whether to allow her to click to add that particular course.
To do so, you’ll need to know Price Theory’s prerequisites: what classes must you have already passed
before you can take Price Theory? And you’ll need corresponding data about Alice’s academic history:
what classes has Alice already taken? You’d most likely use a database to actually store all of the necessary
records, but at its heart the key information is just two sets: a set prerequisiteOf ⊆ Courses × Courses,
and a set passed ⊆ Students × Courses. (Depending on the school’s rules, you might need some numerical
information, too, to ensure both that there’s an unfilled seat in the class that Alice can occupy, and that
Alice has room in her schedule for another class.) Then Alice is eligible to register for Price Theory only
if ⟨Alice, c⟩ ∈ passed for every course c such that ⟨c, Price Theory⟩ ∈ prerequisiteOf.
In this chapter, we’ll explore a generalization of functions, called relations, that—like prerequisiteOf
and passed—represent arbitrary subsets of A×B. (In Chapter 2, we saw functions, which map each element
of some input set A to an element of an output set B. A function is a special kind of relation where each
input element is related to one and only one element of the output set. For example, a large retailer might
be interested in the relation purchased, a subset of Customers × Products; notice that the same customer
may have purchased many different products—or one, or none at all—so purchased is not a function.)
Relations are the critical foundation of relational databases, an utterly widespread modern area of CS,
underlying many of the tools we all use regularly. (One classical special-purpose programming language
for relational databases is called SQL, for “structured query language”; there are other platforms, too.) A
relational database stores a (generally quite large!) collection of structured data. Logically, a database is
organized as a collection of tables, each of which represents a relation, where each row of a table represents
an element contained in that relation. Fundamental manipulations of these relations can then be used to
answer more sophisticated questions about the underlying data. For example, using standard operations in
relational databases (and the relations prerequisiteOf and passed above), we could compute things like (i)
a list of every class c for which you have satisfied all prerequisites of c but have not yet passed c; or (ii)
a list of people with whom you’ve taken at least one class; or (iii) a list of people p with whom you’ve
taken at least one class and where p has also taken at least one class that meets condition (i). (Those are the
friends you could ask for help when you take that class.) Or that large retailer might want, for a particular
user u, to find the 10 products not purchased by u that were most frequently purchased by other users who
share, say, at least half of their purchases with u. All of these queries—though sometimes rather brutally
complicated to state in English—can be expressed fairly naturally in the language of relations.
We’ll start in Section 8.2 with an introduction to the fundamental definitions relevant to relations. In
Section 8.3, we’ll look at a few properties—reflexivity, symmetry, and transitivity—that some relations
have. Finally, in Section 8.4, we’ll look at the special types of relations that result from particular com-
binations of those properties: equivalence relations (which divide the world into collections of mutually
equivalent items) and order relations (which rank everything in the world according to the relation, pos-
sibly with ties). And, along the way, we’ll encounter relational databases, along with regular expressions,
algorithmic bias, and applications to topics like asymptotics, voting, and computer graphics.

ISTUDY
8.2 Formal Introduction 387

8.2 Formal Introduction


A man is a bundle of relations, a knot of roots, whose flower and fruitage is the world.
Ralph Waldo Emerson (1803–1882)
“History,” Essays: First Series (1841)

Informally, a (binary) relation describes a pairwise relationship that holds for certain pairs of elements
from two sets A and B. As an example, here is one particular relation, expressing the “is a component of”
relationship between primary and secondary colors:

{⟨blue, green⟩, ⟨blue, purple⟩, ⟨red, orange⟩, ⟨red, purple⟩, ⟨yellow, green⟩, ⟨yellow, orange⟩}

In other words, these pairs represent a particular relation on the sets A = {red, yellow, blue} and
B = {green, purple, orange}. This description of a relation—a pairwise relationship between some of
the elements of two sets A and B—is obviously very general. But let’s start by considering a few specific
examples, which together begin to show the range of the kinds of properties that relations can represent:

Example 8.1: Satisfaction.


Let A = {f : truth assignments for p and q} and B = {φ : propositions over p and q}. One interesting
relation between elements of A and B denotes whether a particular truth assignment makes a particular
proposition true. (This relation is usually called satisfies.) For a proposition φ, a truth assignment f either
satisfies φ or it doesn’t satisfy φ. For example:

• the truth assignment [p = T; q = F] satisfies p ∨ q (as do all truth assignments save [p = F; q = F]);
• the truth assignment [p = T; q = F] satisfies p ∧ ¬q (and no other truth assignment does);
• every truth assignment in A satisfies p ∨ ¬p; and
• no truth assignment in A satisfies q ∧ ¬q.

(Thus an element of B might be satisfied by zero, one, or more elements of A. Similarly, an element of A
might satisfy many different elements of B.)

Example 8.2: Numbers that are not too different.


We’ll say that two real numbers x ∈ R and y ∈ R are withinHalf of each other if |x − y| ≤ 0.5.
Thus withinHalf is a relation between pairs of real numbers. For example, withinHalf(2.781828, 3.0) and
withinHalf(3.14159, 3.0) and withinHalf(2.5, 3.0) and withinHalf(2.5, 2.0).
Note that withinHalf(x, x) holds for any real number x.

Example 8.3: Being related to.


In keeping with the word “relation,” we actually use the phrase “is related to” in English to express one
specific binary relation on pairs of people—“being in the same family as” (or “being a (blood) relative
of”). For example, we can make the true claim that Rosemary Clooney is related to George Clooney.
(A related statement is also true: George Clooney is related to Rosemary Clooney. The fact that these
two statements convey the same information follows from the fact that the is related to relation has a
property called symmetry: for any x and y, it’s the case that x is related to y if and only if y is related to x.
Not all relations are symmetric, as we’ll see in Section 8.3.)

ISTUDY
388 Relations

Some qualitatively different types of relations are already peeking out in these few examples (and more
properties of relations will reveal themselves as we go further).
Sometimes the relation contains a finite number of pairs, as in the example of primary and secondary
colors; sometimes the relation contains an infinite number of pairs, as in withinHalf. Sometimes a particular
element x is related to every candidate element, sometimes to none.
Sometimes a relation connects elements from two different sets, as in Example 8.1 (satisfaction, which
connected truth assignments to propositions); sometimes it connects two elements from the same set, as in
Example 8.3 (“is a (blood) relative of,” which connects people to people).
And sometimes the relation has some special properties like reflexivity, in which every x is related to x
itself (as in withinHalf), or symmetry (as in “is a (blood) relative of”).

8.2.1 The Definition of a Relation, Formalized


Technically, a binary relation is simply a subset of the Cartesian product of two sets:

Definition 8.1: (Binary) relation.


A (binary) relation on A × B is a subset of A × B.
Often we’ll be interested in a relation on A × A, where the two sets are the same. If there is no danger
of confusion, we may refer to a subset of A × A as simply a relation on A.

Here are a few formal examples of relations:

Example 8.4: A few relations, formally.

• {⟨12, 1⟩, ⟨1, 2⟩, ⟨2, 3⟩, ⟨3, 4⟩, ⟨4, 5⟩, ⟨5, 6⟩, ⟨6, 7⟩, ⟨7, 8⟩, ⟨8, 9⟩, ⟨9, 10⟩, ⟨10, 11⟩, ⟨11, 12⟩} is a relation
on {1, . . . , 12}. (Informally, this relation expresses “is one hour before.”)
• | (“divides”) is a relation on Z, where | denotes the set {⟨d, n⟩ : n mod d = 0}.
• ≤ is a relation on R, where ≤ denotes the set {⟨x, y⟩ : x is no bigger than y}.
• As a reminder, the power set of a set S, denoted P(S), is the set of all subsets of S. For any set S, then,
we can define ⊆ as a relation on P(S), where ⊆ denotes the set
⊆ = {⟨A, B⟩ ∈ P(S) × P(S) : [∀x ∈ S : x ∈ A ⇒ x ∈ B]} .
For the set S = {1, 2}, for example, the relation ⊆ is
 
⟨∅, ∅⟩, ⟨∅, {1}⟩, ⟨∅, {2}⟩, ⟨∅, {1, 2}⟩,
⊆= .
⟨{1} , {1}⟩, ⟨{1} , {1, 2}⟩, ⟨{2} , {2}⟩, ⟨{2} , {1, 2}⟩, ⟨{1, 2} , {1, 2}⟩
• {⟨Ron Rivest, 2002⟩, ⟨Adi Shamir, 2002⟩, ⟨Len Adleman, 2002⟩, ⟨Alan Kay, 2003⟩, ⟨Vint Cerf, 2004⟩,
⟨Robert Kahn, 2004⟩, ⟨Peter Naur, 2005⟩, ⟨Frances Allen, 2006⟩} is a relation on the set People ×
{2002, 2003, 2004, 2005, 2006}, representing the relationship between people and any year in which
they won a Turing Award.

Taking it further: Rivest, Shamir, and Adleman won Turing Awards for their work in cryptography; see Section 7.5. Kay was an
inventor of the paradigm of object-oriented programming. Cerf and Kahn invented the communication protocols that undergird
the internet. Naur made crucial contributions to the design of programming languages, compilers, and software engineering.
Allen made foundational contributions to optimizing compilers and parallel computing.

ISTUDY
8.2 Formal Introduction 389

Example 8.5: Bitstring prefixes.


Let isPrefix denote the following relation: for two bitstrings x and y, we have ⟨x, y⟩ ∈ isPrefix if and only
if the bitstring y starts with precisely the symbols contained in x. (After the bits of x, the bitstring y may
contain zero or more additional bits.) For example, 001 is a prefix of 001110 and 001, but 001 is not a
prefix of 1001. Write down the relation isPrefix on bitstrings of length ≤ 2 explicitly, using set notation.
(Write ϵ to denote the empty string.)
Solution.
 
⟨ϵ, ϵ⟩, ⟨ϵ, 0⟩, ⟨ϵ, 1⟩, ⟨ϵ, 00⟩, ⟨ϵ, 01⟩, ⟨ϵ, 10⟩, ⟨ϵ, 11⟩, 
isPrefix = ⟨0, 0⟩, ⟨0, 00⟩, ⟨0, 01⟩, ⟨1, 1⟩, ⟨1, 10⟩, ⟨1, 11⟩,
 
⟨00, 00⟩, ⟨01, 01⟩, ⟨10, 10⟩, ⟨11, 11⟩

For some relations—for example, | and ≤ and ⊆ from Example 8.4—it’s traditional to write the symbol
for the relation between the elements that are being related, using so-called infix notation. (So we write
3 ≤ 3.5, rather than ⟨3, 3.5⟩ ∈ ≤.) In general, for a relation R, we may write either ⟨x, y⟩ ∈ R or x R y,
depending on context.
Taking it further: Most programming languages use infix notation in their expressions: that is, they place their operators
between their operands, as in (5 + 3) / 2 in Java or Python or C to denote the value 5+ 3
2 . But some programming languages,
like Postscript (the language commonly used by printers) or the language of Hewlett–Packard calculators, use postfix notation,
where the operator follows the operands. Other languages, like Scheme, use prefix notation, in which the operator comes before
the operands. (In Postscript, we would write 5 3 add 2 div; in Scheme, we’d write (/ (+ 5 3) 2).) While we’re all much
more accustomed to infix notation, one of the advantages of pre- or postfix notation is that the order of operations is unambiguous:
compare the ambiguous 5 + 3 / 2 to its two unambiguous postfix alternatives, namely 5 3 2 div add and 5 3 add 2 div.

Taking it further: Recall from Chapter 3 that we defined a predicate as a Boolean-valued function—that is, P is a func-
tion P : U → {True, False} for a set U, called the universe. (See Definition 3.20.) For example, we considered the predicate
Palphabetical (x, y) = “string x is alphabetically before string y.” Binary predicates—when the universe is a set of pairs U = A × B—
are very closely related to binary relations. The main difference is that in Chapter 3 we thought of a binary predicate P as a function
P : A × B → {True, False}, whereas here we’re thinking of a relation R on A × B as a set R ⊆ A × B of ordered pairs. For example,
the relation Ralphabetical is the set {⟨AA, AAH⟩, ⟨AA, AARDVARK⟩, . . . , ⟨ZYZZYVA, ZYZZYVAS⟩}. And Palphabetical (AA, AAH) = True,
Palphabetical (AA, ZYZZYVA) = True, and Palphabetical (BEFORE, AFTER) = False.
But there’s a direct translation between these two worldviews. For a relation R ⊆ A × B, define the predicate PR as
(
True if ⟨a, b⟩ ∈ R
PR (a, b) =
False if ⟨a, b⟩ ∈
/ R.

The function PR is known as the characteristic function of the set R: that is, it’s the function such that PR (x) = True if and only
if x ∈ R. (Palphabetical is the characteristic function of Ralphabetical .)
We can also go the other direction, and translate a Boolean-valued binary function into a relation. Given a predicate P : A × B →
{True, False}, define the relation RP = {⟨a, b⟩ : P(a, b)}—that is, define RP as the set of pairs for which the function P is true.
In either case, we have a direct correspondence between (i) the elements of the relation, and (ii) the inputs to the function that
make the output true.

Visualizing binary relations


For a relation R on A × B where both A and B are finite sets, instead of viewing R as a list of pairs, it can be
easier to think of R as a two-column table, where each row corresponds to an element ⟨a, b⟩ ∈ R. Alterna-
tively, we can visualize relations in a way similar to the way that we visualized functions in Chapter 2: we
place the elements of A in one column, the elements of B in a second column, and draw a line connecting

ISTUDY
390 Relations

Month Days Days Month

Jan 31 31 Jan
Feb 28 Jan 28 Feb
Feb 29 Feb 29 Feb
Mar 31 Mar 31 Mar
Apr 30 Apr 28 30 Apr
May 31 May 31 May
Jun 30 Jun 29 30 Jun
Jul 31 Jul 31 Jul
Aug 31 Aug 30 31 Aug
Sep 30 Sep 30 Sep
Oct 31 Oct 31 31 Oct
Nov 30 Nov 30 Nov
Dec 31 Dec 31 Dec

(a) The relation as a table. (b) The relation, shown visually. (c) The inverse of the relation.

Figure 8.1 In (a) and (b), the relation indicating the number of days per month. (Note that Feb is related to both 28 and
29.) The inverse of the relation (see Definition 8.2) is shown in (c).

a ∈ A to b ∈ B whenever ⟨a, b⟩ ∈ R. Note that when we drew functions using these two-column pictures,
every element in the left-hand column had exactly one line leaving it. That’s not necessarily true for a
relation; elements in the left-hand column could have none, one, or two or more lines leaving them.
Figures 8.1a and 8.1b show a relation represented in these two ways. (For a relation that’s a subset of
A×A, the graphical version of this two-column representation is less appropriate because there’s really only
one kind of element; see Section 8.3 for a different way of visualizing these relations, and see Figure 8.10a
for isPrefix as a specific example.)

8.2.2 Inverse and Composition of Binary Relations


Because a relation on A × B is simply a subset of A × B, we can combine relations on A × B using all the
normal set-theoretic operations: if R and S are both relations on A × B, then R ∪ S, R ∩ S, and R − S are
also relations on A × B, as is the set ∼R = {⟨a, b⟩ ∈ A × B : ⟨a, b⟩ ∈
/ R}.
But we can also generate new relations in ways that are specific to relations, rather than being generic
set operations. Two of the most common are the inverse of a relation (which turns a relation on A × B into
a relation on B × A by “flipping around” every pair in the relation) and the composition of two relations
(which turns two relations on A × B and B × C into a single relation on A × C, where a and c are related
if there’s a “two-hop” connection from a to c via some element b ∈ B).

Inverting a relation
Here is the formal definition of the inverse of a relation:

Definition 8.2: Inverse of a relation.


Let R be a relation on A × B. The inverse R−1 of R is a relation on B × A defined by R−1 =
{⟨b, a⟩ ∈ B × A : ⟨a, b⟩ ∈ R}.

Here are a few examples of the inverses of relations:

ISTUDY
8.2 Formal Introduction 391

Example 8.6: Some inverses.

• The inverse of the relation ≤ is the relation ≥.


• The inverse of the relation = is the relation = itself. (That is, = is its own inverse.)
• The inverse of the month–day relation from Figure 8.1a is shown in Figure 8.1c.
• Define the relation
R = {⟨1, 2⟩, ⟨1, 3⟩, ⟨1, 4⟩, ⟨1, 5⟩, ⟨1, 6⟩, ⟨2, 2⟩, ⟨2, 4⟩, ⟨2, 6⟩, ⟨3, 3⟩, ⟨3, 6⟩, ⟨4, 4⟩, ⟨5, 5⟩, ⟨6, 6⟩} .

The inverse of R is the relation

R−1 = {⟨2, 1⟩, ⟨3, 1⟩, ⟨4, 1⟩, ⟨5, 1⟩, ⟨6, 1⟩, ⟨2, 2⟩, ⟨4, 2⟩, ⟨6, 2⟩, ⟨3, 3⟩, ⟨6, 3⟩, ⟨4, 4⟩, ⟨5, 5⟩, ⟨6, 6⟩} .
(Note that R is {⟨d, n⟩ : d divides n}, and R−1 is {⟨n, d⟩ : n is a multiple of d}.)

Note that, as in the month–day example, the inverse of any relation shown in table form is simply the
relation resulting from swapping the two columns of the table.

Composing two relations


The second way of creating a new relation from existing relations is composition, which, informally, repre-
sents the successive “application” of two relations. Two elements x and y are related under the relation S◦R,
denoting the composition of two relations R and S, if there’s some intermediate element b that connects x
and y under R and S, respectively. (We already saw how to compose functions, in Section 2.5, by applying
one function immediately after the other. Functions are a special type of relation—see Section 8.2.3—and
the composition of functions will similarly be a special case of the composition of relations.) Let’s start
with an informal example to build some intuition:

Example 8.7: Relation composition, informally.


Consider two relations: allergicTo on People × Ingredients and containedIn on Ingredients × Entrees.
Then the composition of allergicTo and containedIn is a relation on People × Entrees identifying pairs
⟨p, e⟩ for which entree e contains at least one ingredient to which person p is allergic.

Here’s the formal definition:

Definition 8.3: Composition of two relations.


Let R be a relation on A × B and let S be a relation on B × C. The composition of R and S is a relation
on A × C, denoted S ◦ R, where ⟨a, c⟩ ∈ S ◦ R if and only if there exists b ∈ B such that ⟨a, b⟩ ∈ R and
⟨b, c⟩ ∈ S.

Warning! The composition of R and S is, as with functions, denoted S ◦ R: the function g ◦ f first applies f and then applies g, so
(g ◦ f)(x) gives the result g(f(x)). The order in which the relations are written may initially be confusing.

Perhaps the easiest way to understand the composition of relations is through the picture-based view that
we introduced in Figure 8.1b: the relation S ◦ R contains pairs of elements that are joined by “two-hop”
connections, where the first hop is defined by R and the second hop is defined by S. (See Figure 8.2a.)

Some examples of composing relations


Here are a few examples of the composition of some relations:

ISTUDY
392 Relations

(a) (b)
S◦R
R S R S
R S S◦R
a
b π π π
• 0 0 0
b √ √ √
a 3 3 3
• • c
c 1 √ 1 √ 1 √
d 2 2 2

Figure 8.2 The composition of two relations. In (a), the definition: a pair ⟨a, c⟩ is in S ◦ R when, for some b, both
⟨a, b⟩ ∈ R and also ⟨b, c⟩ ∈ S. In (b), a particular example. (See Example 8.8.)

Example 8.8: The composition of two small relations.


Let R = {⟨0, a⟩, ⟨0, √
b⟩, ⟨0, c⟩,√⟨1, c⟩, ⟨1,
√ d⟩} be a relation on {0, 1} × {a, b, c, d}.
Let S = {⟨b, π⟩, ⟨b, 3⟩, ⟨c, 2⟩, ⟨d, 2⟩} be a relation on {a, b, c, d} × R.
Then S ◦ R ⊆ {0, 1} × R is the relation that consists of all pairs ⟨x, z⟩ such that there exists an element
y ∈ {a, b, c, d} where ⟨x, y⟩ ∈ R and ⟨y, z⟩ ∈ S. That is,
 
√ √ √
S◦R= ⟨0, π⟩ , ⟨0, 3⟩ , ⟨0, 2⟩ , ⟨1, 2⟩ .
because of b because of b because of c because of c and d

See Figure
√ 8.2b
√ for the visual representation
√ of this composition: because there are “two-hop” paths from
0 to {π, 3, 2} and from 1 to { 2}, the relation S ◦ R is as described. (Again: the relation S ◦ R consists
of pairs related by a two-step chain, with the first step under R and the second under S.)

Here’s a second example, this time where the relations being composed are more meaningful:

Example 8.9: Relations in the U.S. Senate.


The U.S. Senate has two senators from each state, each of whom is affiliated with zero or one political
parties. See Figure 8.3 for two relations: the relation S, between all U.S. states whose names start with
the letter “I” and the senators who represented them in the year 1993 (after the 1992 elections, which
more than doubled the number of women in the Senate); and the relation T, between senators and their
political party. Figure 8.3c shows the composition of S and T, which is a relation between IStates and
Parties, where ⟨state, party⟩ ∈ T ◦ S if one or both of the senators from state are affiliated with party.

Taking it further: The relation T ⊆ Senators × Parties, between senators and their political party, is way of representing
what’s sometimes called an affiliation network: we have a collection of individuals (the senators) and groups/organizations (the
parties), and information about which individuals are part of which groups (T). Affiliations in other context make sense, too; for
example, you might define a relation R on students and university organizations, so that R tells you which people are members
of the ballroom dance team or the robotics club and so forth. Some companies have tried to use a relation like R as part of a
system to automate their résumé-screening and hiring processes—to some disastrous effects around algorithmic bias (in which,
unintentionally, the automated system turned out to be making sexist and racist hiring decisions). See p. 399 for more.

So far we’ve considered composing relations on A × B and B × C for three distinct sets A, B, and C. But
we can also consider a relation R ⊆ A × A, and in this case we can also compose R with itself.
Problem-solving tip: Just as you do with a program, always make sure that your mathematical expressions “type check.” (For
example, just as the Python expression 0.33 * "atomic" doesn’t make sense, the composition R ◦ R for the relation R =
{⟨1, A⟩, ⟨2, B⟩} doesn’t denote anything useful.)

ISTUDY
8.2 Formal Introduction 393

(a) Senators (b) Senators (c)

Dirk Kempthorne Dirk Kempthorne Parties Parties


States States
Larry Craig Larry Craig Democratic Democratic
Idaho Idaho
Paul Simon Paul Simon
Illinois Illinois
Carol Moseley Braun Carol Moseley Braun
Indiana Indiana
Richard Lugar Richard Lugar
Iowa Republican Iowa Republican
Dan Coats Dan Coats
Tom Harkin Tom Harkin
Chuck Grassley Chuck Grassley

Figure 8.3 Two relations and their composition: (a) the relation S ⊆ IStates × Senators of each state’s senators; (b) the
relation T ⊆ Senators × Parties of each senator’s party affiliation; and (c) their composition T ◦ S ⊆ IStates × Parties.

Example 8.10: Composing a relation with itself.


For each of the following relations R on Z≥1 , describe the relation R ◦ R:
1 successor, namely the set {⟨n, n + 1⟩ : n ∈ Z≥1 }.
2 =, namely the set {⟨n, n⟩ : n ∈ Z≥1 }.
3 relativelyPrime = {⟨n, m⟩ : GCD(n, m) = 1}, the set of pairs of relatively prime (positive) integers.

Solution. For successor: by definition, ⟨x, z⟩ ∈ successor ◦ successor if and only if there exists an integer
y such that both ⟨x, y⟩ ∈ successor and ⟨y, z⟩ ∈ successor. Thus the only possible y is y = x + 1, and the
only possible z is z = y + 1 = x + 2. Thus successor ◦ successor = {⟨n, n + 2⟩ : n ∈ Z≥1 } .
For equality (we’ll write equals instead of =; otherwise the notation becomes indecipherable): by def-
inition, the pair ⟨x, z⟩ is in the relation equals ◦ equals if and only if there exists an integer y such that
x = y and y = z. But that’s true if and only if x = z. That is, ⟨x, z⟩ ∈ equals ◦ equals if and only if
⟨x, z⟩ ∈ equals. Thus composing equals with itself doesn’t do anything: equals ◦ equals = equals.
For relative primality: we must identify all pairs ⟨x, z⟩ ∈ Z≥1 × Z≥1 such that there exists an integer y
where ⟨x, y⟩ ∈ relativelyPrime and ⟨y, z⟩ ∈ relativelyPrime. But notice that y = 1 is relatively prime
to every positive integer. Thus, for any ⟨x, z⟩ ∈ Z≥1 × Z≥1 , we have that ⟨x, 1⟩ ∈ relativelyPrime and
⟨1, z⟩ ∈ relativelyPrime. Thus relativelyPrime ◦ relativelyPrime = Z≥1 × Z≥1 .

An example of composing a relation with its own inverse


We’ll close with one last example of composing relations, this time by taking the composition of a relation
R and its inverse R−1 :

Example 8.11: Composing a relation and its inverse.


Let R ⊆ M × D be the relation between the months and the numbers of days in that month, and let
R−1 ⊆ D × M be its inverse. (See Figure 8.1.) What is R−1 ◦ R?
Solution. Because R ⊆ M × D and R−1 ⊆ D × M, we know that R−1 ◦ R ⊆ M × M. We have to identify
⟨x, y⟩ ∈ M × M such that ∃z ∈ D : ⟨x, z⟩ ∈ R and ⟨z, y⟩ ∈ R−1
⇔ ∃z ∈ D : ⟨x, z⟩ ∈ R and ⟨y, z⟩ ∈ R. definition of inverse

ISTUDY
394 Relations

(a) R R−1 (b)


Jan Jan  

 ⟨Jan, Jan⟩, ⟨Jan, Mar⟩, ⟨Jan, May⟩, ⟨Jan, Jul⟩, ⟨Jan, Aug⟩, ⟨Jan, Oct⟩, ⟨Jan, Dec⟩, 

Feb Feb 
 

 ⟨Mar, Jan⟩, ⟨Mar, Mar⟩, ⟨Mar, May⟩, ⟨Mar, Jul⟩, ⟨Mar, Aug⟩, ⟨Mar, Oct⟩, ⟨Mar, Dec⟩,  

Mar Mar 
 


 ⟨ May , Jan ⟩, ⟨ May , Mar ⟩, ⟨ May , May ⟩, ⟨ May , Jul ⟩, ⟨ May , Aug ⟩, ⟨ May , Oct ⟩, ⟨ May , Dec ⟩, 

Apr Apr 
 

28 
 

⟨Jul, Jan⟩, ⟨Jul, Mar⟩, ⟨Jul, May⟩, ⟨Jul, Jul⟩, ⟨Jul, Aug⟩, ⟨Jul, Oct⟩, ⟨Jul, Dec⟩,
 

May May 
 


 ⟨ Oct , Jan ⟩, ⟨ Oct , Mar ⟩, ⟨ Oct , May ⟩, ⟨ Oct , Jul ⟩, ⟨ Oct , Aug ⟩, ⟨ Oct , Oct ⟩, ⟨ Oct , Dec ⟩, 


⟨Dec, Jan⟩, ⟨Dec, Mar⟩, ⟨Dec, May⟩, ⟨Dec, Jul⟩, ⟨Dec, Aug⟩, ⟨Dec, Oct⟩, ⟨Dec, Dec⟩, 

Jun 29 Jun
Jul Jul 
 


 ⟨Apr, Apr⟩, ⟨Apr, Jun⟩, ⟨Apr, Sep⟩, ⟨Apr, Nov⟩, 

Aug 30 Aug 
 


 ⟨Jun, Apr⟩, ⟨Jun, Jun⟩, ⟨Jun, Sep⟩, ⟨Jun, Nov⟩, 


 

Sep Sep 
 

⟨Sep, Apr⟩, ⟨Sep, Jun⟩, ⟨Sep, Sep⟩, ⟨Sep, Nov⟩,





Oct 31 Oct 
 ⟨Nov, Apr⟩, ⟨Nov, Jun⟩, ⟨Nov, Sep⟩, ⟨Nov, Nov⟩, 


 


 

Nov Nov  
⟨Feb, Feb⟩
Dec Dec

Figure 8.4 R−1 ◦ R, for the relations R and R−1 from Figure 8.1, shown in two ways. In (a), the composition consists of
every pair of elements joined by a two-hop path; in (b), the set R−1 ◦ R is listed explicitly.

In other words, we seek pairs of months that are related by R to at least one of the same values. The
exhaustive list of pairs in R−1 ◦ R is shown in Figure 8.4.
Note that R−1 ◦ R is different from R ◦ R−1 : the latter is the set of numbers that are related by R−1 to at
least one of the same months, while the former is the set of months that are related by R to at least one of
the same numbers. Thus R ◦ R−1 = {⟨31, 31⟩, ⟨30, 30⟩, ⟨29, 29⟩, ⟨28, 28⟩, ⟨28, 29⟩, ⟨29, 28⟩}. (The only
distinct numbers related by R ◦ R−1 are 28 and 29, because of February.)

The relation R−1 ◦ R from Example 8.11 has a special form: this relation “partitions” the 12 months
into three clusters—the 31-day months, the 30-day months, and February—so that any two months in the
same cluster are related by R−1 ◦ R, and no two months in different clusters are related by R−1 ◦ R. (See
Figure 8.10b.) A relation with this structure, where elements are partitioned into clusters (and two elements
are related if and only if they’re in the same cluster) is called an equivalence relation; see Section 8.4.1.

8.2.3 Functions as Relations


Back in Chapter 2, we defined a function as something that maps each element of the set of legal inputs
(the domain) to an element of the set of legal outputs (the range):
Definition 2.46 (functions): Let A and B be sets. A function f from A to B, written f : A → B, assigns to each input value
a ∈ A a unique output value b ∈ B; the unique value b assigned to a is denoted by f(a). We sometimes say that f maps a
to f(a).

While we’ve begun this chapter defining relations as a completely different kind of thing from
functions, actually functions are just a special type of relation. For example, the “one hour
later than” relation {⟨12, 1⟩, ⟨1, 2⟩, . . . , ⟨10, 11⟩, ⟨11, 12⟩} from Example 8.4 really is a function
f : {1, . . . , 12} → {1, . . . , 12}, where we could write f more compactly as f(x) = (x mod 12) + 1.
In general, to think of a function f : A → B as a relation, we will view f as defining the set of ordered
pairs ⟨x, f(x)⟩ for each x ∈ A, rather than as a mapping:

Definition 8.4: Functions, viewed as relations.


Let A and B be sets. A function f from A to B, written f : A → B, is a relation on A × B with the additional
property that, for every a ∈ A, there exists one and only one element b ∈ B such that ⟨a, b⟩ ∈ f.

ISTUDY
8.2 Formal Introduction 395

That is, we view the function f : A → B as the set F = {⟨x, f(x)⟩ : x ∈ A}, which is a subset of A × B. The
restriction of the definition requires that F has a unique output defined for every input: there cannot be two
distinct pairs ⟨x, y⟩ and ⟨x, y′ ⟩ in F, and furthermore there cannot be any x for which there’s no ⟨x, •⟩ in F.

Example 8.12: A function as a relation.


(Write Z11 to denote {0, 1, 2, . . . , 10}, as in Chapter 7.) The function f : Z11 → Z11 defined as f(x) =
x2 mod 11 can be written as
{⟨0, 0⟩, ⟨1, 1⟩, ⟨2, 4⟩, ⟨3, 9⟩, ⟨4, 5⟩, ⟨5, 3⟩, ⟨6, 3⟩, ⟨7, 5⟩, ⟨8, 9⟩, ⟨9, 4⟩, ⟨10, 1⟩} .
Observe that f−1 , the inverse of f, is not a function—for example, the pairs ⟨5, 4⟩ and ⟨5, 7⟩ are both in
f−1 , and there is no element ⟨2, •⟩ ∈ f−1 . But f−1 is still a relation.

Example 8.13: Composing functions.


Suppose that f ⊆ A × B and g ⊆ B × C are functions (in the sense of Definition 8.4). Prove that the
relation g ◦ f is a function from A to C.
Solution. By definition, the composition of the relations f and g is
g ◦ f = {⟨x, z⟩ : there exists y such that ⟨x, y⟩ ∈ f and ⟨y, z⟩ ∈ g} .
Consider any x ∈ A. Because f is a function, there exists one and only one y∗ such that ⟨x, y∗ ⟩ ∈ f.
Furthermore, because g is a function, for this particular y∗ there is a unique z with ⟨y∗ , z⟩ ∈ g. Thus there
exists one and only one z such that ⟨x, z⟩ ∈ g ◦ f. By definition, then, the relation g ◦ f is a function.

Under this functions-as-relations view, the definitions of the inverse and composition of functions—
Definitions 2.50 and 2.54—precisely line up with the definitions of the inverse and composition of relations
from this section. Furthermore, if a function is just a special type of relation, then the special types of
functions that we defined in Chapter 2—one-to-one and onto functions—are just further restrictions on
relations. Under the relation-based view of functions, the function f ⊆ A × B is called one-to-one if, for
every b ∈ B, there exists at most one element a ∈ A such that ⟨a, b⟩ ∈ f. The function f ⊆ A × B is called
onto if, for every b ∈ B, there exists at least one element a ∈ A such that ⟨a, b⟩ ∈ f.
If f ⊆ A × B is a function, then the inverse f−1 of f—that is, the set f−1 = {⟨b, a⟩ : ⟨a, b⟩ ∈ f}—is guar-
anteed to be a relation on B×A. But f−1 is a function from B to A if and only if f is both one-to-one and onto.
In Exercises 8.39–8.44, you’ll explore some other properties of the composition of functions/relations.

8.2.4 n-ary Relations


The relations that we’ve explored so far have all expressed relationships between two elements. But some
interesting properties might involve more than two entities; for example, you might assemble all of your
friends’ birthdays as a collection of triples of the form ⟨name, birthdate, birthyear⟩. Or we might consider
a relation on integers of the form ⟨a, b, k⟩ where a ≡k b. A relation involving tuples with n components,
called an n-ary relation, is a natural generalization of a (binary) relation:

Definition 8.5: n-ary relation.


An n-ary relation on the set A1 × A2 × · · · × An is a subset of A1 × A2 × · · · × An . If there is no danger
of confusion, we may refer to a subset of An as an n-ary relation on A.

ISTUDY
396 Relations

(continued) (continued)
Color R G B Color R G B Color R G B

Black 0 0 0 Lime 0 255 0 Purple 128 0 128


Blue 0 0 255 Magenta 255 0 255 Red 255 0 0
Cyan 0 255 255 Maroon 128 0 0 Teal 0 128 128
Gray 128 128 128 Navy 0 0 128 White 255 255 255
Green 0 128 0 Olive 128 128 0 Yellow 255 255 0

Figure 8.5 The 4-ary relation containing the full set of RGB colors with component values that are all drawn from
either {0, 128} or {0, 255}.

(We generally refer to 2-ary relations as binary relations and 3-ary relations as ternary relations.) Here are
a few examples:

Example 8.14: Summing to 8.


Define sumsTo8 as a ternary relation on the set {0, 1, 2, 3, 4}, where
sumsTo8 = {⟨a, b, c⟩ ∈ {0, 1, 2, 3, 4} × {0, 1, 2, 3, 4} × {0, 1, 2, 3, 4} : a + b + c = 8} .
Then sumsTo8 is
 
⟨0, 4, 4⟩, ⟨1, 3, 4⟩, ⟨1, 4, 3⟩, ⟨2, 2, 4⟩, ⟨2, 3, 3⟩, ⟨2, 4, 2⟩,
.
⟨3, 1, 4⟩, ⟨3, 2, 3⟩, ⟨3, 3, 2⟩, ⟨3, 4, 1⟩, ⟨4, 0, 4⟩, ⟨4, 1, 3⟩, ⟨4, 2, 2⟩, ⟨4, 3, 1⟩, ⟨4, 4, 0⟩

Example 8.15: Betweenness.



Define the set B = ⟨x, y, z⟩ ∈ R3 : x ≤ y ≤ z or x ≥ y ≥ z . Then B is a ternary relation on R that
expresses “betweenness”—that is, the triple ⟨x, y, z⟩ ∈ B if x, y, and z are in a consistent order (either
ascending or descending).
For example, we have ⟨−1, 0, 1⟩ ∈ B and ⟨6, 5, 4⟩ ∈ B, because −1 ≤ 0 ≤ 1 and 6 ≥ 5 ≥ 4. But
⟨−7, 8, −9⟩ ∈/ B, because these three numbers are neither in ascending order (because 8 ̸≤ −9) nor
descending order (because −7 ̸≥ 8).

Example 8.16: RGB colors.


A 4-ary relation on Names×{0, 1, . . . , 255}×{0, 1, . . . , 255}×{0, 1, . . . , 255} is shown in Figure 8.5: a
collection of colors, each with its official name in HTML/CSS and its red, green, and blue components—
all of which are elements of {0, 1, . . . , 255}. (HTML (hypertext markup language) and CSS (cascading
style sheet) are languages used to express the format, style, and layout of web pages.)

Taking it further: Databases—systems for storing and accessing collections of structured data—are a widespread modern
application of computer science. Databases store student records for registrars, account information for financial institutions, and
records of who liked whose posts on Facebook; in short, virtually every industrial system that has complex data with nontrivial
relationships among data elements is stored in a database. More specifically, a relational database stores information about a
collection of entities and relationships among those entities: fundamentally, a relational database is a collection of n-ary relations,
which can then be manipulated and queried in various ways. Designing databases well affects both how easy it is for a user to
pose the questions that they wish to ask about the data, and how efficiently answers to those questions can be computed. See
p. 398 for more on relational databases and how they connect with the types of relations that we’ve discussed so far.

ISTUDY
8.2 Formal Introduction 397

Expressing n-ary relations as a collection of binary relations


Non-binary relations, like those in the last few examples, represent complex interactions among more than
two entities. For example, the “betweenness” relation
B = {⟨x, y, z⟩ ∈ R3 : x ≤ y ≤ z or x ≥ y ≥ z}
from Example 8.15 fundamentally expresses a relationship regarding triples of numbers: for any three real
numbers x, y, and z, there are triples ⟨x, y, •⟩ ∈ B and ⟨•, y, z⟩ ∈ B and ⟨x, •, z⟩ ∈ B—but whether ⟨x, y, z⟩
itself is in the relation B genuinely depends on how all three numbers relate to each other. Similarly, the
sumsTo8 relation from Example 8.14 is a genuinely three-way relationship among elements—not some-
thing that can be directly reduced to a pair of pairwise relationships. But we can represent an n-ary relation
R by a collection of binary relations, if we’re a little creative in defining the sets that are being related.
(Decomposing n-ary relations into multiple binary relations may be helpful if we store this type of data in
a database; there may be advantages of clarity and efficiency in this view of an n-ary relation.)
This idea is perhaps easiest to see for the colors from Example 8.16: because each color name appears
once and only once in the table, we can treat the name as a unique “key” that allows us to treat the 4-ary
relation as three separate binary relations, corresponding to the red, green, and blue components of the
colors. (See Figure 8.6a.) But how would we represent an n-ary relation like the ternary sumsTo8 using
multiple binary relations? (Recall the relation
 
⟨0, 4, 4⟩, ⟨1, 3, 4⟩, ⟨1, 4, 3⟩, ⟨2, 2, 4⟩, ⟨2, 3, 3⟩, ⟨2, 4, 2⟩,
sumsTo8 =
⟨3, 1, 4⟩, ⟨3, 2, 3⟩, ⟨3, 3, 2⟩, ⟨3, 4, 1⟩, ⟨4, 0, 4⟩, ⟨4, 1, 3⟩, ⟨4, 2, 2⟩, ⟨4, 3, 1⟩, ⟨4, 4, 0⟩
from Example 8.14.) One idea is to introduce a new set of fake “entities” that correspond to each of the
tuples in sumsTo8, and then build binary relations between each component and this set of entities. For
example, define the set
E = {044, 134, 143, 224, 233, 242, 314, 323, 332, 341, 404, 413, 422, 431, 440} ,
and then define the three binary relations first, second, and third shown in Figure 8.6b. Now ⟨a, b, c⟩ ∈
sumsTo8 if and only if there exists an e ∈ E such that ⟨e, a⟩ ∈ first, ⟨e, b⟩ ∈ second, and ⟨e, c⟩ ∈ third.
(See Exercise 8.45 for a way to think of betweenness using binary relations.)

R G B first second third

Black 0 Black 0 Black 0 044 0 044 4 044 4


Blue 0 Blue 0 Blue 255 134 1 134 3 134 4
Cyan 0 Cyan 255 Cyan 255 143 1 143 4 143 3
Gray 128 Gray 128 Gray 128 224 2 224 2 224 4
Green 0 Green 128 Green 0 233 2 233 3 233 3
.. .. .. .. .. ..
. . . . . .

(a) The colors from the 4-ary relation in Example 8.16, (b) The relation sumsTo8, as three binary relations.
represented as three binary relations.

Figure 8.6 Representing n-ary relations as a collection of binary relations.

ISTUDY
398 Relations

Computer Science Connections

Relational Databases
A database is a (generally large!) collection of structured data. A user can both “query” the database (asking questions
about existing entries and edit it (adding or updating existing entries). The bulk of modern attention to databases
focuses on relational databases, based explicitly on the types of relations explored in this chapter. (Until a massively
influential early paper by Ted Codd (1923–2003) [30]—he later would win a Turing Award for this work—database
systems were generally based on rigid top-down organization of the data.)
In a relational database, the fun- name red green blue
damental unit of storage is the
table, which represents an n-ary Green 0 128 0 1 SELECT name, red
relation R ⊆ A1 × A2 × · · · × An . Lime 0 255 0 2 FROM colors

Magenta 255 0 255 3 WHERE green > blue;


A table consists of a collection of
columns, each of which represents Maroon 128 0 0
name red
a component of R; the columns are Navy 0 0 128
labeled with the name of the cor- Olive 128 128 0 Green 0
responding component so you can Purple 128 0 128 Lime 0
refer to columns by name rather Red 255 0 0 Olive 128
than just by index. The rows of the Teal 0 128 128 Yellow 255
White 255 255 255
table correspond to elements of the
Yellow 255 255 0
relation: that is, each row is a value
⟨a1 , a2 , . . . , an ⟩ that’s in R. Fig- Figure 8.7 Some RGB colors; and the SQL code and output resulting from
ure 8.7 shows an example table of selecting colors with green > blue and projecting to name, red.
this form, echoing Example 8.16.
Thus a relational database is at its essence a collection of n-ary relations. A common way to interact with this sort
of database is with a special-purpose programming language, often the language SQL. (“SQL” is short for Structured
Query Language; it’s pronounced either like “sequel” or by spelling out the letters [to rhyme with “Bless you, Mel!”].)
Operations on relational databases are based on three fundamental operations on n-ary relations. The first two basic
operations either choose some of the rows or some of the columns from an n-ary relation R ⊆ A1 × · · · × An :
• select: for a predicate φ on A1 × · · · × An , we can select those elements of R that satisfy φ.
• project: we can project R ⊆ A1 × · · · × An into a smaller set of columns by deleting some Ai s.
For example, we can select colors with blue component equal to zero, or project the colors relation down to just red
and blue values. See Figure 8.7 for an example. (In SQL, select and project operations are done with unified syntax.)
The third key operation in rela- S T
tional databases, called join, corre- state senator senator party 1 SELECT * FROM T
sponds closely to the composition 2 INNER JOIN S

of relations. In a join, we combine IA Grassley Coats R 3 ON T.senator = S.senator;


two relations by insisting that an IA Harkin Craig R
senator party state
identified shared column of the two ID Craig Grassley R
relations matches. Unlike with the ID Kempthorne Harkin D Coats R IN
composition of relations, we con- IL Moseley Braun Kempthorne R Craig R ID
tinue to include that matching col- IL Simon Lugar R Grassley R IA
umn in the resulting table. For two IN Coats Moseley Braun D ..
binary relations X ⊆ S × T and IN Lugar Simon D .
Y ⊆ T × U, the join of X and Y Figure 8.8 Joining S and T from Figure 8.3.
is a ternary relation on S × T × U,
defined as X 1 Y = {⟨a, c, b⟩ ∈ S × T × U : ⟨a, c⟩ ∈ X and ⟨c, b⟩ ∈ Y} . In SQL syntax, this operation is denoted
by INNER JOIN; see Figure 8.8.
(The description here barely scratches the surface of relational databases—there’s a full course’s worth of material
on databases (and then some!) that we’ve left out, including how these operations are implemented and how to design
databases to support efficient operations. For more, see a good book on databases, like [118].)

ISTUDY
8.2 Formal Introduction 399

Computer Science Connections

Automating Decisions, Facial Recognition, and Algorithmic Bias


Imagine that you been put in charge of hiring new university graduates as soft-
ware engineers for the company where you work. The process of screening
résumés can be tedious, and it’s potentially prone to whatever implicit biases
(about, say, age, or gender, or race, or degree-granting university) you may
happen to harbor. So you might think about automating the process, to ease
your workload and try to avoid human fallibilities. Here’s one natural way
to do it: first, you apply some kind of threshold test on applicants’ academic
qualifications, and, second, you look at characteristics of applicants, and try
to infer the likelihood of success at the job for someone with those partic-
ular characteristics. There are a lot of characteristics that you might choose
to use in this kind of analysis, but one reasonable choice is to look at uni-
versity organizations that your applicants have participated in: athletic teams,
musical groups, hackathons, whatever. So, you might look at the participa-
tion relation R ⊆ Applicants × Clubs, and then, to find the most promising
applicants in the pool, you might calculate the fraction fc the fraction of your
currently employed software ∑ engineers who were in each club c. Then you
could score an applicant a by c∈Clubs:⟨a,c⟩ fc (or something similar but more
sophisticated), and interview the applicants in descending order of their score.
The automated résumé-screening system described above is very similar to
a system that Amazon tried to build and deploy in 2018—which they quickly
scrapped after discovering exactly how it was performing. More specifically,
the automated system ended up being brutally sexist: for example, the word
“women” on a résumé, as in “women’s fencing team captain,” was causing the
score of that résumé to be massively decreased. (See [35] for the Reuters story
that broke the news.) The basic reason that the system was so badly biased
because of a particular overlooked piece of logic in the system. This system
is a fancy way of reproducing the existing biases in the workforce because
it tries to predict who will be a good future software engineer by comparing
characteristics to the good current software engineers (who were hired by the
previous generations of human screeners with their own biases).
Algorithmic bias refers to automated systems making discriminatory deci-
sions, typically by reproducing existing biases in their training data. Another
prominent example, among too many, is an algorithmic system that’s been
deployed to decide whether to grant parole [7]. In it, for instance, one of the
inputs to the probabilistic model of recidivism was the number of previous Images from official U.S. House of Representatives portraits

arrests for the individual—but differential policing can lead to more arrests Figure 8.9 New York State’s 2021
in some neighborhoods (correlated with race and socioeconomics). U.S. House of Representatives
There are other forms of algorithmic bias, too. It’s possible to build a sys- delegation, from District 1 (top left)
tem that looks like it does an excellent job at some task—and yet is still to District 27 (bottom right).
profoundly biased, in that its performance is much better when, say, the input
corresponds to a white man than when the input corresponds to a person with any other demographic characteristics.
For example, Joy Buolamwini and Timnit Gebru [24] showed a severe asymmetry of accuracy rates across skin tones
in modern commercial image-processing systems presented with faces, like those in Figure 8.9, as input. (These
systems can try to detect faces, or recognize faces, or predict the gender of faces, etc.—and the performance of all of
them was systematically worse when presented with images of individuals with darker skin tones.) For more on the
kinds of flaws arising in (computational and noncomputational) research resulting from building models based only
on men in training data, see Caroline Criado Perez’s Invisible Women [100]. And for more of the burgeoning research
by computer scientists on understanding and mitigating these kind of algorithmic bias issues, see the proceedings of
the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT), facctconference.org.

ISTUDY
400 Relations

EXERCISES
Here are a few English-language descriptions of relations on a particular set. For each, write out (by exhaustive enumeration) the
full set of pairs in the relation, as we did in Examples 8.4 and 8.5. (Hint: It’s easy to miss an element of these relations if you solve
these problems by hand. Consider writing a small program to enumerate all pairs in these relations.)
8.1 divides, written |, on {1, 2, . . . , 8}, where ⟨d, n⟩ ∈ | if and only if n mod d = 0.
8.2 subset, written ⊂, on P({1, 2, 3}), where ⟨S, T⟩ ∈ ⊂ if and only if S ̸= T and ∀x : x ∈ S ⇒ x ∈ T.
8.3 isProperPrefix on bitstrings of length ≤ 3. A string x is a proper prefix of a string y if x starts with precisely the symbols of y,
followed by one or more other symbols. (See Example 8.5, but note that we are looking for proper prefixes here. The string x is
prefix, but not a proper prefix, of itself.)
8.4 isProperSubstring on bitstrings of length ≤ 3. For two strings x and y, we say that x is a substring of y if all of x appears consecutively
somewhere in y. And x is a proper substring of y if x is a substring of y but x = ̸ y.
8.5 isProperSubsequence on bitstrings of length ≤ 3. A string x is a subsequence of y if the symbols of x appear in order, but not
necessarily consecutively, in y. (For example, 001 is a substring of 1001 but not of 0101. But 001 is a subsequence of 1001 and
also of 0101.) Again, x is proper subsequence of y if x is a subsequence of y but x = ̸ y.
8.6 isAnagram on bitstrings of length ≤ 3. A string x is an anagram of a string y if x contains exactly the same symbols as y (with the
same number of copies of each symbol), but not necessarily in the same order.

Let ⊆ and ⊂ denote the subset and proper subset relations on P(Z). (That is, we have ⟨A, B⟩ ∈ ⊂ if A ⊆ B but A ̸= B.) What
relation is represented by each of the following?
8.7 ⊆ ∪ ⊂ 8.10 ⊂ ∩ ⊆
8.8 ⊆ − ⊂ 8.11 ∼⊂
8.9 ⊂ − ⊆

Define R = {⟨2, 2⟩, ⟨5, 1⟩, ⟨2, 3⟩, ⟨5, 2⟩, ⟨2, 1⟩} and S = {⟨3, 4⟩, ⟨5, 3⟩, ⟨6, 6⟩, ⟨1, 4⟩, ⟨4, 3⟩} as two relations on the set
{1, 2, 3, 4, 5, 6}. What pairs are in the following relations?
8.12 R− 1 8.16 S ◦ R
8.13 S− 1 8.17 R ◦ S−1
8.14 R◦R 8.18 S ◦ R−1
8.15 R◦S 8.19 S−1 ◦ S

Five so-called mother sauces of French cooking were codified by the chef Auguste Escoffier in the early twentieth century. (Many
other sauces—“daughter” or “secondary” sauces—used in French cooking are derived from these basic recipes.) They are:
• Sauce Béchamel is made of milk, butter, and flour.
• Sauce Espagnole is made of stock, butter, and flour.
• Sauce Hollandaise is made of egg, butter, and lemon juice.
• Sauce Velouté is made of stock, butter, and flour.
• Sauce Tomate is made of tomatoes, butter, and flour.
8.20 Write down the “is an ingredient of” relation on Ingredients × Sauces using the tabular representation of relations introduced in
Figure 8.1.
8.21 Writing R to denote the relation that you enumerated in Exercise 8.20, what is R ◦ R−1 ? Give both a list of elements and an
English-language description of what R ◦ R−1 represents.
8.22 For R from Exercise 8.20, what is R−1 ◦ R? Again, give both a list of elements and a description of the meaning.

Suppose that a Registrar’s office has computed the following relations:

taughtIn ⊆ Classes × Rooms taking ⊆ Students × Classes at ⊆ Classes × Times.

For the following exercises, express the given additional relation using taughtIn, taking, and at, plus relation composition and/or
inversion (and no other tools).
8.23 R ⊆ Students × Times, where ⟨s, t⟩ ∈ R indicates that student s is taking a class at time t.
8.24 R ⊆ Rooms × Times, where ⟨r, t⟩ ∈ R indicates that there is a class in room r at time t.
8.25 R ⊆ Students × Students, where ⟨s, s′ ⟩ ∈ R indicates that students s and s′ are taking at least one class in common.
8.26 R ⊆ Students × Students, where ⟨s, s′ ⟩ ∈ R indicates that there’s at least one time when s and s′ are both taking a class (but not
necessarily the same class).

ISTUDY
Exercises 401

Let parent ⊆ People × People denote the relation {⟨p, c⟩ : p is a parent of c}. (For the sake of simplicity over realism, assume
that there are no divorces, remarriages, widows, widowers, adoptions, single parents, etc. That is, you should assume that each
child has exactly two parents, and any two children who share one parent share both parents.) What familial relationships are
represented by the following relations?
8.27 parent ◦ parent 8.30 (parent−1 ) ◦ parent
8.28 −
(parent ) ◦ (parent )
1 − 1 8.31 parent ◦ parent ◦ (parent−1 ) ◦ (parent−1 )
8.29 parent ◦ (parent−1 ) 8.32 parent ◦ (parent−1 ) ◦ parent ◦ (parent−1 )
8.33 Suppose that the relations R ⊆ Z × Z and S ⊆ Z × Z contain, respectively, n pairs and m pairs of elements. In terms of n and m,
what’s the largest possible size of R ◦ S? The smallest?
8.34 For arbitrary relations R, S, and T, prove that R ◦ (S ◦ T) = (R ◦ S) ◦ T.
8.35 For arbitrary relations R and S, prove that (R ◦ S)−1 = (S−1 ◦ R−1 ).
8.36 Let R be any relation on A × B. Prove or disprove: ⟨x, x⟩ ∈ R ◦ R−1 for every x ∈ A.
8.37 What set is represented by the relation ≤ ◦ ≥, where ≤ and ≥ are relations on R?
8.38 What set is represented by the relation successor ◦ predecessor, for the relations successor = {⟨n, n + 1⟩ : n ∈ Z} and
predecessor = {⟨n, n − 1⟩ : n ∈ Z}?

Suppose that R ⊆ A × B and T ⊆ B × C are relations. The first few exercises below (Exercises 8.39–8.41) ask you to prove that
some properties of R and S translate into properties of T ◦ R. The next few exercises (Exercises 8.42–8.44) ask you to address the
converse of these results. Supposing that T ◦ R has the listed property, can you infer that both relations R and T have the same
property? Only R? Only T? Neither? Prove your answers.
8.39 Prove that, if R and T are both functions, then T ◦ R is a function too.
8.40 Prove that, if R and T are both one-to-one functions, then T ◦ R is one-to-one too.
8.41 Prove that, if R and T are both onto functions, then T ◦ R is onto too.
8.42 Assume that T ◦ R is a function. Must T be a function? R? Both?
8.43 Assume that T ◦ R is a one-to-one function and that R and T are both functions. Must T be one-to-one? R? Both?
8.44 Assume that T ◦ R is an onto function and that R and T are both functions. Must T be onto? R? Both?

On p. 398, we introduced three operations on relations that are used frequently in relational databases:
• select(R, P): choose a subset of an n-ary relation R, according to some condition P. (A “condition” P assigns true or false to
each element of the universe.)
• project(R, K): turn an n-ary relation R into an k-ary relation for some k ≤ n, by eliminating those columns of R that aren’t in
K. (The set K ⊆ {1, 2, . . . , n} identifies which columns of R to keep.)
• join(R, S): combine two binary relations R ⊆ A × B and S ⊆ B × C into a single ternary relation containing all triples ⟨a, b, c⟩
where ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ S.
For example, let R = {⟨1, 2, 3⟩, ⟨4, 5, 6⟩}, let S = {⟨6, 7⟩, ⟨6, 8⟩}, and let T = {⟨7, 9⟩, ⟨7, 10⟩}. Then
• select(R, xzEven) = {⟨4, 5, 6⟩} for xzEven(x, y, z) = (2 | x) ∧ (2 | z).
• project(R, {1, 2}) = {⟨1, 2⟩, ⟨4, 5⟩} and project(R, {1, 3}) = {⟨1, 3⟩, ⟨4, 6⟩}.
• join(S, T) = {⟨6, 7, 9⟩, ⟨6, 7, 10⟩}.
Solve the following using the relation operators −1 (inverse), ◦ (composition), select, project, and join:

8.45 Recall from Example 8.15 the ternary “betweenness” relation B = ⟨x, y, z⟩ ∈ R3 : x ≤ y ≤ z or x ≥ y ≥ z . Show how to con-
struct B using only ≤, the relation operators (−1 , ◦, join, select, project), and standard set-theoretic operations (∪, ∩, ∼, −).

Figure 8.5 contains a reminder of the 4-ary relation C that lists several colors and their red, green, and blue components. Using
this relation C and select/project/join, write a set that corresponds to the following:
8.46 The names of all colors that have red component 0.
8.47 The names of all pairs of colors whose amount of blue is the same.
8.48 The names of all colors that are more blue than red.

Let X denote the set of color names from Figure 8.5. Define three relations Red, Green, and Blue on X × {0, 1, . . . , 255} such that
⟨x, r, g, b⟩ ∈ C if and only if ⟨x, r⟩ ∈ Red and ⟨x, g⟩ ∈ Green and ⟨x, b⟩ ∈ Blue. (In other words, Red = project(C, {1, 2}) and
Green = project(C, {1, 3}) and Blue = project(C, {1, 4}).)
8.49 Repeat Exercise 8.47 using only −1 , ◦, and the relations Red, Green, Blue, ≤, and =.
8.50 Repeat Exercise 8.48 using only −1 , ◦, and the relations Red, Green, Blue, ≤, and =. Or, at least, compute the set of ⟨x, x⟩ such
that x is the name of a color that’s more blue than red. (You may construct a relation R on colors, and then take R ∩ =.)

ISTUDY
402 Relations

8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity


There are two ways of spreading light; to be
The candle or the mirror that reflects it.
Edith Wharton (1862–1937)
“Vesalius in Zante (1564)” (1902)

Let R ⊆ A × A be a relation on a single set A (as in the successor or ≤ relations on Z, or the is a (blood)
relative of relation on people). We’ve seen a two-column approach to visualizing a relation R ⊆ A × B,
but this layout is misleading when the sets A and B are identical. (Weirdly, we’d have to draw each element
twice, in both the A column and the B column.) Instead, it will be more convenient to visualize a relation
R ⊆ A × A without differentiated columns, using a directed graph: we simply write down each element of
A, and draw an arrow from a1 to a2 for every pair ⟨a1 , a2 ⟩ ∈ R. (See Chapter 11 for much more on directed
graphs.) A few small examples are shown in Figure 8.10.
This directed-graph visualization of relations will provide a useful way of thinking intuitively about
relations in general—and about some specific types of relations in particular. There are several important
structural properties that some relations on A have (and that some relations do not), and we’ll explore these
properties throughout this section. We’ll consider three basic categories of properties:

reflexivity: whether elements are related to themselves. Is an element x necessarily related to x itself?

symmetry: whether order matters in the relation. If x and y are related, are y and x necessarily related too?

transitivity: whether chains of related pairs are themselves related. If x and y are related and y and z are
related, are x and z necessarily related too?

These properties turn out to characterize several important types of relations—for example, some relations
divide A into clusters of “equivalent” elements (as in Figure 8.10b), while other relations “order” A in some
consistent way (as in Figure 8.10a)—and we’ll see these special types of relations in Section 8.4. But first
we’ll examine these three categories of properties in turn, and then we’ll define closures of relations, which
expand any relation R as little as possible while ensuring that the expansion of R has any particular desired
subset of these properties.

(a) (b) (c)


Aug Oct
00

Jul Dec 0
0 10 1
01
May Jan 9 2
ϵ Mar

10 8 3
Nov Apr
1
Feb 7 4
11 Sep Jun 6 5

Figure 8.10 Visualizations of three relations: (a) isPrefix from Example 8.5 (prefixes of bitstrings); (b) months of the
same length, from Example 8.11; and (c) ⟨x, x2 mod 11⟩ for x ∈ Z11 , from Example 8.12.

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 403

8.3.1 Reflexivity
The reflexivity of a relation R ⊆ A × A is based on whether elements of A are related to themselves. (Latin:
re “back” + flect “bend.”) That is, are there pairs ⟨a, a⟩ in R? The relation R is reflexive if all of those ⟨a, a⟩
pairs are in R, and it’s irreflexive if none of them are:

Definition 8.6: Reflexive and irreflexive relations.


A relation R on A is reflexive if, for every x ∈ A, we have that ⟨x, x⟩ ∈ R. A relation R on A is irreflexive
if, for every x ∈ A, we have that ⟨x, x⟩ ∈/ R.

Using the visualization style from Figure 8.10, a relation is reflexive if every element a ∈ A has a “loop”
from a back to itself—and it’s irreflexive if no a ∈ A has a loop to itself. (See Figure 8.11.)

Example 8.17: Reflexivity of =, ≡17 , and ⟨x, x2 ⟩ mod 11.


Consider relations = and ≡17 on Z: that is, {⟨x, y⟩ : x = y} and {⟨x, y⟩ : x mod 17 = y mod 17}. Both of
these relations are reflexive, because x = x and x mod 17 = x mod 17 for any x ∈ Z.

But the relation R = ⟨x, x2 mod 11⟩ : x ∈ Z11 from Figure 8.10c is not reflexive, because (among
other examples) we have ⟨7, 7⟩ ∈ / R.

Note that relations can be neither reflexive nor irreflexive. For example, the relation S = {⟨0, 1⟩, ⟨1, 1⟩}
on {0, 1} isn’t reflexive (because ⟨0, 0⟩ ∈
/ S), but it’s also not irreflexive (because ⟨1, 1⟩ ∈ S).

Example 8.18: A few arithmetic relations.


Which of the following relations on Z≥1 are reflexive? Which are irreflexive?
R1 = {⟨n, m⟩ : m mod n = 0} divides
R2 = {⟨n, m⟩ : n > m} greater than
R3 = {⟨n, m⟩ : n ≤ m} less than or equal to

R4 = ⟨n, m⟩ : n2 = m square
R5 = {⟨n, m⟩ : n mod 5 = m mod 5} equivalent mod 5

Solution. | is reflexive. For any positive integer n, we have that n mod n = 0. Thus ⟨n, n⟩ ∈ R1 for any n.
> is irreflexive. For any n ∈ Z≥1 , we have that n ̸> n. Thus ⟨n, n⟩ ∈
/ R2 for any n.
≤ is reflexive. For any positive integer n, we have n ≤ n, so every ⟨n, n⟩ ∈ R3 .
square is neither reflexive nor irreflexive. The square relation is not reflexive because ⟨9, 9⟩ ∈
/ R4 and it
is also not irreflexive because ⟨1, 1⟩ ∈ R4 , for example. (That’s because 9 ̸= 9 , but 1 = 1 .)
2 2

≡5 is reflexive. For any n ∈ Z≥1 , we have n mod 5 = n mod 5, so ⟨n, n⟩ ∈ R5 .

(a) (b)

1 3 4 1 3 4
2 2

Figure 8.11 Reflexive and irreflexive relations: (a) a relation on A is reflexive if every a ∈ A has a self-loop (the thick
arrows); (b) a relation is irreflexive if no a ∈ A has a self-loop.

ISTUDY
404 Relations

Note again that, as with square, it is possible to be neither reflexive nor irreflexive. (But it’s not possible
to be both reflexive and irreflexive, as long as A ̸= ∅: for any specific a ∈ A, if ⟨a, a⟩ ∈ R, then R is not
irreflexive; if ⟨a, a⟩ ∈
/ R, then R is not reflexive.)

8.3.2 Symmetry
The symmetry of a relation R ⊆ A × A is based on whether the order of the elements in a pair matters.
(Greek: syn “same” + metron “measure.” Is R the same no matter which way you measure it?) That is, if
the pair ⟨a, b⟩ is in R, is the pair ⟨b, a⟩ always also in R? (Or is it never in R? Or sometimes but not always?)
The relation R is symmetric if, for every a and b, the pairs ⟨a, b⟩ and ⟨b, a⟩ are both in R or both not in
R. There are two accompanying notions: a relation R is antisymmetric if the only time ⟨a, b⟩ and ⟨b, a⟩ are
both in R is when a = b, and R is asymmetric if ⟨a, b⟩ and ⟨b, a⟩ are never both in R (whether a = b or
a≠ b). Here are the formal definitions:

Definition 8.7: Symmetric, antisymmetric, and asymmetric relations.


• A relation R on A is symmetric if, for every a ∈ A and b ∈ A, if ⟨a, b⟩ ∈ R then ⟨b, a⟩ ∈ R.
• A relation R on A is antisymmetric if, for every a ∈ A and b ∈ A such that ⟨a, b⟩ ∈ R and ⟨b, a⟩ ∈ R,
we have a = b.
• A relation R on A is asymmetric if, for every a ∈ A and b ∈ A, if ⟨a, b⟩ ∈ R then ⟨b, a⟩ ∈
/ R.

An important etymological note: anti- means “against” rather than “not.” Asymmetric (there is no ⟨a, b⟩ ∈ R when ⟨b, a⟩ ∈ R)
is different from antisymmetric (whenever ⟨a, b⟩ ∈ R and ⟨b, a⟩ ∈ R then a = b) is different from not symmetric (there is some
⟨a, b⟩ ∈ R but ⟨b, a⟩ ∈
/ R).

Again thinking about our visualization of relations: a relation is symmetric if every arrow a → b is matched
by an arrow b → a in the opposite direction. It’s antisymmetric if there are no matched bidirectional pairs
of arrows between two distinct elements a and b; and it’s asymmetric if there also aren’t even any self-
loops. (An a-to-a self-loop is, in a weird way, a “pair” of arrows a → b and b → a, just with a = b.) See
Figure 8.12.

Example 8.19: Some symmetric relations.


The relations
{⟨w, w′ ⟩ : w and w′ have the same length} (on the set of English words)
′ ′
{⟨s, s ⟩ : s and s sat next to each other in class today} (on the set of students)

are both symmetric. If w contains the same number of letters as w′ , then w′ also contains the same number
of letters as w. And if I sat next to you, then you sat next to me!
(The first relation is also reflexive—ZEUGMA contains the same number of letters as ZEUGMA—but the
latter is irreflexive, as no student sits beside herself in class. [zeugma, n.: grammatical device in which
words are used in parallel construction syntactically, but not semantically, as in Yesterday, Alice caught
a rainbow trout and hell from Bob for fishing all day.])

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 405

1 3 4 1 3 4 1 3 4
2 2 2

(a) R is symmetric if every a → b is (b) R is antisymmetric if no a ↔ b (c) R is asymmetric if it is


matched by b → a. exists for a ̸= b. Self-loops (like the antisymmetric and it also has no
thick arrow) are allowed. self-loops.

Figure 8.12 Symmetric, antisymmetric, and asymmetric relations.

Example 8.20: A few arithmetic relations, again.


Which of these relations (from Example 8.18) are symmetric? Antisymmetric? Asymmetric?
R1 = {⟨n, m⟩ : m mod n = 0} divides
R2 = {⟨n, m⟩ : n > m} greater than
R3 = {⟨n, m⟩ : n ≤ m} less than or equal to

R4 = ⟨n, m⟩ : n2 = m square
R5 = {⟨n, m⟩ : n mod 5 = m mod 5} equivalent mod 5

Solution. | is antisymmetric. Because n mod m = 0 and m mod n = 0 if and only if n = m, we know


that if ⟨n, m⟩ ∈ R1 and ⟨m, n⟩ ∈ R1 then n = m. But the relation is neither symmetric (for example, 3 | 6
but 6 ̸ | 3) nor asymmetric (for example, 3 | 3).
> is asymmetric (and therefore antisymmetric). If x < y then y ̸< x, even if x = y. So R2 is asymmetric,
which means that it is also antisymmetric.
≤ is antisymmetric. Similar to (1), R3 is antisymmetric: if x ≤ y and y ≤ x, then x = y. (But 3 ≤ 6 and
6 ̸≤ 3, and 3 ≤ 3, so R3 is neither symmetric nor asymmetric.)
square is antisymmetric. R4 is not symmetric because ⟨3, 9⟩ ∈ R4 but ⟨9, 3⟩ ∈
/ R4 , and it’s not asymmetric
because ⟨1, 1⟩ ∈ R4 . (That’s because 32 = 9 but 92 ̸= 3, and 12 = 1.) But it is antisymmetric: the only
way x2 = y and y2 = x is if x = y (specifically x = y = 0 or x = y = 1).
≡5 is symmetric. The “equivalent mod 5” relation is symmetric because equality is: for any n and m, we
have n mod 5 = m mod 5 if and only if m mod 5 = n mod 5. But it’s not antisymmetric: ⟨17, 202⟩ ∈ R5
and ⟨202, 17⟩ ∈ R5 .

Note that it is possible for a relation to be both symmetric and antisymmetric; see Exercise 8.70. And
it’s also possible for a relation R not to be symmetric, but also for R to fail to be either antisymmetric or
asymmetric:

Example 8.21: A non-symmetric, non-asymmetric, non-antisymmetric relation.


The relation R = {⟨0, 1⟩, ⟨0, 2⟩, ⟨1, 0⟩} on {0, 1, 2} isn’t symmetric (0 → 2 but 2 ̸→ 0), and it isn’t
asymmetric or antisymmetric (0 → 1 and 1 → 0 but 0 ̸= 1).

One other useful way to think about the symmetry (or antisymmetry/asymmetry) of a relation R is
by considering the inverse R−1 of R. Recall that R−1 reverses the direction of all of the arrows of R, so
⟨a, b⟩ ∈ R if and only if ⟨b, a⟩ ∈ R−1 . A symmetric relation is one in which every a → b arrow is matched
by a b → a arrow, so reversing the arrows doesn’t change the relation. For an antisymmetric relation R, the

ISTUDY
406 Relations

(a) (b) (c)

1 3 4 1 3 4 1 3 4
2 2 2

Figure 8.13 (a) A relation R, (b) its inverse R−1 , and (c) their intersection R ∩ R−1 . From (c), we see that R isn’t
symmetric (1 → 2 and 4 → 3 are missing), asymmetric (1 has a self-loop) or antisymmetric (2 ↔ 3 is present).

inverse R−1 has only self-loops in common with R. And an asymmetric relation has no arrows in common
with its inverse. (See Figure 8.13.) Specifically (see Exercises 8.67–8.69):

Theorem 8.8: Symmetry in terms of inverses.


Let R ⊆ A × A be a relation and let R−1 be its inverse. Then:

• R is symmetric if and only if R ∩ R−1 = R = R−1 .


• R is antisymmetric if and only if R ∩ R−1 ⊆ {⟨a, a⟩ : a ∈ A}.
• R is asymmetric if and only if R ∩ R−1 = ∅.

8.3.3 Transitivity
The transitivity of a relation R ⊆ A × A is based on whether the relation always contains a “short circuit”
from a to c whenever two pairs ⟨a, b⟩ and ⟨b, c⟩ are in R. (Latin: trans “across/through.”) An alternative
view is that a transitive relation R is one in which “applying R twice” doesn’t yield any new connections.
For example, consider the relation “lives in the same town as”: if a person x lives in the same town as a
person y you live in same town as, then in fact x directly (without reference to the intermediary y) lives in
the same town as you. Here is the formal definition:

Definition 8.9: Transitive relation.


A relation R on A is transitive if, for every a, b, c ∈ A, if ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ R, then ⟨a, c⟩ ∈ R too.

Or, using the visualization from Figure 8.10, a relation is transitive if there are no “open triangles”: if a → b
and b → c, then a → c. (In any “chain” of connected elements in a transitive relation, every element is
also connected to all elements that are “downstream” of it.) See Figure 8.14.

1 3 4 1 3 4
2 2

(a) A relation that is not transitive (the thick arrows (b) A transitive relation, with a highlighted
form an open triangle: 1 → 3 is missing). closed triangle.

Figure 8.14 Transitivity of relations. A relation on A is transitive if every triangle is closed.

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 407

Example 8.22: Some transitive relations.


The relations
{⟨w, w′ ⟩ : w and w′ have the same length} (on the set of English words)
′ ′
{⟨s, s ⟩ : s arrived in class before s today} (on the set of students)

are both transitive. If w contains the same number of letters as w′ , and w′ contains the same number of
letters as w′′ , then w certainly contains the same number of letters as w′′ too. And if Alice got to class
before Bob, and Bob got to class before Charlie, then Alice got to class before Charlie.

Example 8.23: A few arithmetic relations, one more time.


Which of the relations from Examples 8.18 and 8.20 are transitive?
R1 = {⟨n, m⟩ : m mod n = 0} divides
R2 = {⟨n, m⟩ : n > m} greater than
R3 = {⟨n, m⟩ : n ≤ m} less than or equal to

R4 = ⟨n, m⟩ : n2 = m square
R5 = {⟨n, m⟩ : n mod 5 = m mod 5} equivalent mod 5

Solution. | is transitive. Suppose that a | b and b | c. We need to show that a | c. But that’s not too hard:
by definition a | b and b | c mean that b = ak and c = bℓ for integers k and ℓ. Therefore c = a · (kℓ)—and
thus a | c. (This fact was Theorem 7.7.4.)
> is transitive. If x > y and y > z, then we know x > z.
≤ is transitive. Just as in (2), R3 is transitive: if x ≤ y and y ≤ z, then x ≤ z.
square is not transitive. The square relation isn’t transitive, because, for example, we have ⟨2, 4⟩ ∈ R4
and ⟨4, 16⟩ ∈ R4 —but ⟨2, 16⟩ ∈/ R4 . (That’s because 22 = 4 and 42 = 16 but 22 ̸= 16.)
≡5 is transitive. The “equivalent mod 5” relation is transitive because equality is: if n mod 5 = m mod 5
and m mod 5 = p mod 5, then n mod 5 = p mod 5.

While we can understand the transitivity of a relation R directly from Definition 8.9, we can also think about
the transitivity of R by considering the relationship between R and R ◦ R—that is, R and the composition
of R with itself. (Earlier we saw how to view the symmetry of R by connecting R and its inverse R−1 .)

Theorem 8.10: Transitivity in terms of self-composition.


Let R ⊆ A × A be a relation. Then R is transitive if and only if R ◦ R ⊆ R.

A proof of this theorem is deferred to the exercises (see Exercise 8.86).


Taking it further: Your preferences among a set of possibilities form a transitive relation: if you prefer chocolate ice cream to
mint, and mint to strawberry, then you surely prefer chocolate to strawberry, too. But if you and a bunch of friends (all of whom
may have different preferences about cuisines) want to go out to a restaurant, things become trickier; it’s possible that your
group’s collective preferences may fail to be transitive (even if each individual’s preferences are). The idea of a voting system
is to determine the group’s collective preferences based on its members’ individual preferences, and there are some troubling
paradoxes that arise in this voting context. (There’s also the related issue of how to implement actual voting systems, perhaps
electronically—while maintaining both secrecy and trust in the system.) See p. 414.

ISTUDY
408 Relations

8.3.4 Properties of Asymptotic Relationships


Now that we’ve introduced the three categories of properties of relations (reflexivity, symmetry, and tran-
sitivity), let’s consider one more set of relations in light of these properties: the asymptotics of functions.
Recall from Chapter 6 that, for two functions f : R≥0 → R≥0 and g : R≥0 → R≥0 , we say that
f(n) is O(g(n)) if and only if ∃n0 ≥ 0, c > 0 : [∀n ≥ n0 : f(n) ≤ c · g(n)] .
f(n) is Θ(g(n)) if and only if f(n) is O(g(n)) and g(n) is O(f(n)).
f(n) is o(g(n)) if and only if f(n) is O(g(n)) and g(n) is not O(f(n)).
(Actually we previously phrased the definitions of Θ(·) and o(·) in terms of Ω(·), but the definition we’ve
given here is completely equivalent, as proven in Exercise 6.30.) We can view these asymptotic properties

as relations on the set F = f : R≥0 → R≥0 of functions.

The standard asymptotic notation doesn’t match the standard notation for relations—we write f = Θ(g) rather than f Θ g or
⟨f, g⟩ ∈ Θ—but Θ genuinely is a relation on F, in the sense that some pairs of functions are related by Θ and some pairs are not.
And O and o are relations on F in the same way.

Example 8.24: O and Θ and o: reflexivity.


O is reflexive. For any function f, we can establish that f = O(f) by choosing the constants n0 = 1 and
c = 1, because it is immediate that ∀n ≥ 1 : f(n) ≤ 1 · f(n). Therefore O is reflexive, because every
function f satisfies f = O(f).
Θ is reflexive. This fact follows immediately from the fact that O is reflexive:
Θ is reflexive ⇔ ∀f ∈ F : f = Θ(f) definition of reflexivity
⇔ ∀f ∈ F : f = O(f) and f = O(f) definition of Θ
⇔ ∀f ∈ F : f = O(f). p∧p≡p

But ∀f ∈ F : f = O(f) is just the definition of O being reflexive, which we just established.
o is irreflexive. This fact follows by similar logic: for any function f ∈ F,
f = o(f) ⇔ f = O(f) and f =
̸ O(f),
by the definition of o(·). But p ∧ ¬p ≡ False (including when p is “f = O(f)”), so o is irreflexive.

Example 8.25: O and Θ and o: symmetry.


O is not symmetric, antisymmetric, or asymmetric. Define the functions t1 (n) = n and t2 (n) = n2 and
t3 (n) = 2n2 . O is not symmetric because, for example, t1 = O(t2 ) but t2 ̸= O(t1 ). O is not asymmetric
because, for example, t1 = O(t1 ). And O is not antisymmetric because, for example, t2 = O(t3 ) and
t3 = O(t2 ) but t2 ̸= t3 .
Θ is symmetric. This fact follows immediately by definition: for arbitrary f and g,
f = Θ(g) ⇔ f = O(g) and g = O(f) definition of Θ
⇔ g = O(f) and f = O(g) p∧q≡q∧p
⇔ g = Θ(f). definition of Θ

(Θ is not anti/asymmetric, because t2 = Θ(t3 ) for t2 (n) and t3 (n) defined above.)

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 409

o is asymmetric. This fact follows immediately, by similar logic: for arbitrary f and g, we have f = o(g)
and g = o(f) if and only if f = O(g) and g ≠ O(f) and g = O(f) and f ̸= O(g)—a contradiction! So if
f = o(g) then g ̸= o(f). Therefore o is asymmetric.

Exercises 6.18, 6.46, and 6.47 established that O, Θ, and o are all transitive.In sum: O is reflexive and
transitive (but not symmetric, asymmetric, or antisymmetric); o is irreflexive, asymmetric, and transitive;
and Θ is reflexive, symmetric, and transitive.

Taking it further: Among the computer scientists, philosophers, and mathematicians who study formal logic, there’s a special
kind of logic called modal logic that’s of significant interest. Modal logic extends the type of logic we introduced in Chapter 3
to also include logical statements about whether a true proposition is necessarily true or accidentally true. For example, the
proposition Canada won the 2014 Olympic gold medal in curling is true—but the gold-medal game could have turned out
differently and, if it had, that proposition would have been false. But Either it rained yesterday or it didn’t rain yesterday is true,
and there’s no possible scenario in which this proposition would have turned out to be false. We say that the former statement is
“accidentally” true (it was an “accident” of fate that the game turned out the way it did), but the latter is “necessarily” true.
In modal logic, we evaluate the truth value of a particular logical statement multiple times, once in each of a set W of so-called
possible worlds. Each possible world assigns truth values to every atomic proposition. Thus every logical proposition φ of the
form we saw in Chapter 3 has a truth value in each possible world w ∈ W. But there’s another layer to modal logic. In addition
to the set W, we are also given a relation R ⊆ W × W, where ⟨w, w′ ⟩ ∈ R indicates that w′ is possible relative to w. In addition
to the basic logical connectives from normal logic, we can also write two more types of propositions:

3φ “possibly φ” 3φ is true in w if ∃w′ ∈ W such that ⟨w, w′ ⟩ ∈ R and φ is true in w′ .


2φ “necessarily φ” 2φ is true in w if ∀w′ ∈ W such that ⟨w, w′ ⟩ ∈ R, φ is true in w′ .

Of course, these operators can be nested, so we might have a proposition like 2(3p ⇒ 2p).
Different assumptions about the relation R will allow us to use modal logic to model different types of interesting phenomena.
For example, we might want to insist that 2φ ⇒ φ (“if φ is necessarily true, then φ is true”: that is, if φ is true in every world
w′ ∈ W possible relative to w, then φ is true in w). This axiom corresponds to the relation R being reflexive: w is always possible
relative to w. Symmetry and transitivity correspond to the axioms φ ⇒ 23φ and 2φ ⇒ 22φ.
The general framework of modal logic (with different assumptions about R) has been used to represent logics of knowledge
(where 2φ corresponds to “I know φ”); logics of provability (where 2φ corresponds to “we can prove φ”); and logics of
possibility and necessity (where 2φ corresponds to “necessarily φ” and 3φ to “possibly φ”). Others have also studied temporal
logics (where 2φ corresponds to “always φ” and 3φ to “eventually φ”); these logical formalisms have proven to be very useful
in formally analyzing the correctness of programs. For a good introduction to modal logic, see [60].

8.3.5 Closures of Relations


Until now, in this section we’ve discussed some important properties that certain relations R ⊆ A × A may
or may not happen to have. We’ll close this section by looking at how to “force” the relation R to have
one or more of these properties. Specifically, we will introduce the closure of a relation with respect to a
property like symmetry: we’ll take a relation R and expand it into a relation R′ that has the desired property,
while adding as few pairs to R as possible. That is, the symmetric closure of R is the smallest set R′ ⊇ R
such that the relation R′ is symmetric. Here are the formal definitions:

Definition 8.11: Reflexive, symmetric, and transitive closures.


Let R ⊆ A × A be a relation. Then:
The reflexive closure of R is the smallest relation R′ ⊇ R such that R′ is reflexive.
The symmetric closure of R is the smallest relation R′′ ⊇ R such that R′′ is symmetric.
The transitive closure of R is the smallest relation R+ ⊇ R such that R+ is transitive.

ISTUDY
410 Relations

Taking it further: In general, a set S is said to be closed under the operation f if, whenever we apply f to an arbitrary element
of S (or to an arbitrary k-tuple of elements from S, if f takes k arguments), then the result is also an element of S. For example,
the integers are closed under + and ·, because the sum of two integers is always an integer, as is their product. But the integers
are not closed under /: for example, 2/3 is not an integer even though 2 ∈ Z and 3 ∈ Z. The closure of S under f is the smallest
superset of S that is closed under f.

We’ll illustrate these definitions with a small example of symmetric, reflexive, and transitive closures, and
then return to our running examples of arithmetic relations.

Example 8.26: Closures of a small relation.


Consider the relation R = {⟨1, 5⟩, ⟨2, 2⟩, ⟨2, 4⟩, ⟨4, 1⟩, ⟨4, 2⟩} on {1, 2, 3, 4, 5}. Then we have the
following closures of R. (See Figure 8.15 for visualizations.)
n o
reflexive closure = R ∪ ⟨1, 1⟩, ⟨3, 3⟩, ⟨4, 4⟩, ⟨5, 5⟩
n o
symmetric closure = R ∪ ⟨5, 1⟩, ⟨1, 4⟩
because of ⟨1, 5⟩ because of ⟨4, 1⟩
n o
transitive closure = R ∪ ⟨2, 1⟩, ⟨4, 4⟩, ⟨4, 5⟩, ⟨2, 5⟩
because of because of because of because of
⟨2, 4⟩ and ⟨4, 1⟩ ⟨4, 2⟩ and ⟨2, 4⟩ ⟨4, 1⟩ and ⟨1, 5⟩ ⟨2, 4⟩ and ⟨4, 5⟩

It’s worth noting that ⟨2, 5⟩ had to be in the transitive closure R+ of R, even though there was no x such
that ⟨2, x⟩ ∈ R and ⟨x, 5⟩ ∈ R. There’s one more intermediate step in the chain of reasoning: the pair
⟨4, 5⟩ had to be in R+ because ⟨4, 1⟩, ⟨1, 5⟩ ∈ R, and therefore both ⟨2, 4⟩ and ⟨4, 5⟩ had to be in R+ —so
⟨2, 5⟩ had to be in R+ as well.

Example 8.27: Closures of divides.


Recall the “divides” relation R = {⟨n, m⟩ : m mod n = 0}. Because R is both reflexive and transitive, the
reflexive closure and transitive closure of R are both just R itself. The symmetric closure of R is the set
of pairs ⟨n, m⟩ where one of n and m is a divisor of the other: {⟨n, m⟩ : n mod m = 0 or m mod n = 0}.

Example 8.28: Closures of greater than.


Recall the “greater than” relation {⟨n, m⟩ : n > m}. The reflexive closure of > is ≥—that is, the set
{⟨n, m⟩ : n ≥ m}. The symmetric closure of > is ̸=—that is, the set {⟨n, m⟩ : n > m or m > n} is exactly
{⟨n, m⟩ : n =
̸ m}. The relation > is already transitive, so the transitive closure of > is > itself.

(a) (b) (c) (d)

2 2 2 2
1 3 1 3 1 3 1 3

5 4 5 4 5 4 5 4

Figure 8.15 (a) A relation R, and several closures of R: (b) the reflexive closure; (c) the symmetric closure; and (d) the
transitive closure. In each, the thick arrows had to be added to R to achieve the desired property.

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 411

Computing the closures of a relation


How did we compute the closures in the last few examples? The approach itself isn’t too hard: starting
with R′ = R, we repeatedly look for a violation of the desired property in R′ (an element of R′ required
by the property but missing from R′ ), and repair that violation by adding the necessary element to R′ . For
the reflexive and symmetric closures, this idea is more straightforward: the violations of reflexivity are
precisely those elements of {⟨a, a⟩ : a ∈ A} not already in R, and the violations of symmetry are precisely
those elements of R−1 that are not already in R.
For the transitive closure, things are slightly trickier: as we resolve existing violations by adding missing
pairs to the relation, new violations of transitivity can crop up. (See Figure 8.16.) But to compute the
transitive closure, we can simply iterate: starting with R′ := R, repeatedly add to R′ any missing ⟨a, c⟩
with ⟨a, b⟩, ⟨b, c⟩ ∈ R′ , until there are no more violations of transitivity. (While we won’t prove it here, it’s
an important fact that the order in which we add elements to the transitive closure doesn’t affect the final
result.) See Figure 8.17 for algorithms to compute these closures for R ⊆ A × A for a finite set A. (Note
that these algorithms are not guaranteed to terminate if A is infinite! Also, there are faster ways to find the
transitive closure based on graph algorithms—see Chapter 11—but the basic idea is captured here.)
Alternatively, here’s another way to view the transitive closure of R ⊆ A × A. The relation R ◦ R
denotes precisely those pairs ⟨a, c⟩ where ⟨a, b⟩, ⟨b, c⟩ ∈ R for some b ∈ A. Thus the “direct” violations of
transitivity are pairs that are in R◦R but not R. But, as Figure 8.16 shows, the relation R ∪ (R ◦ R) might have
violations of transitivity, too: that is, a pair ⟨a, d⟩ ∈
/ R ∪ (R ◦ R) but where ⟨a, b⟩ ∈ R and ⟨b, d⟩ ∈ R ◦ R
for some b ∈ A. So we have to add R ◦ R ◦ R as well. And so on! In other words, the transitive closure R+
of R is given by R+ = R ∪ R2 ∪ R3 ∪ · · ·, where Rk = R ◦ R ◦ · · · ◦ R is the result of composing R with
itself k times. Thus (see Exercise 8.105):

• The reflexive closure of R is R ∪ {⟨a, a⟩ : a ∈ A}.


• The symmetric closure of R is R ∪ R−1 .
• The transitive closure of R is R ∪ R2 ∪ R3 ∪ · · · .

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(a) The relation R. (b) ⟨0, 1⟩ and ⟨1, 2⟩ mean (c) ⟨1, 2⟩ and ⟨2, 3⟩ mean (d) ⟨0, 2⟩, which we
that we must add ⟨0, 2⟩. that we must add ⟨0, 2⟩. added in (b), and ⟨2, 3⟩
mean that we must now
We could have instead argued that we had to add ⟨0, 3⟩ because of ⟨0, 1⟩ and ⟨1, 3⟩ [from (c)]
rather than because of ⟨0, 2⟩ [from (b)] and ⟨2, 3⟩.
add ⟨0, 3⟩ too.

Figure 8.16 Computing the transitive closure of the relation {⟨0, 1⟩, ⟨1, 2⟩, ⟨2, 3⟩}.

reflexive-closure(R): transitive-closure(R):
Input: a relation R ⊆ A × A Input: a relation R ⊆ A × A
Output: the smallest reflexive R′ ⊇ R Output: the smallest transitive R′ ⊇ R
1 return R ∪ {⟨a, a⟩ : a ∈ A} 1 R′ := R
2 while there exist a, b, c ∈ A such that
symmetric-closure(R): / R′ :
⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ R and ⟨a, c⟩ ∈
Input: a relation R ⊆ A × A 3 R′ := R′ ∪ {⟨a, c⟩}
Output: the smallest symmetric R′ ⊇ R 4 return R′
1 return R ∪ R−1

Figure 8.17 Algorithms to compute the reflexive, symmetric, and transitive closures of R ⊆ A × A, when A is finite.

ISTUDY
412 Relations

Closures with respect to multiple properties at once


In addition to defining the closure of a relation R with respect to one of the three properties (reflexivity,
symmetry, or transitivity), we can also define the closure with respect to two or more of these properties
simultaneously. Any subset of these properties makes sense in this context, but the two most common
combinations require reflexivity and transitivity, with or without requiring symmetry:

Definition 8.12: Reflexive [symmetric] transitive closure.


Let R ⊆ A × A be a relation. Then:

• The reflexive transitive closure of R is the smallest relation R∗ ⊇ R such that R∗ is both reflexive and
transitive.
• The reflexive symmetric transitive closure of R is the smallest relation R≡ ⊇ R such that R≡ is
reflexive, symmetric, and transitive.

Example 8.29: Parent.


Consider the relation parent = {⟨p, c⟩ : p is a parent of c} over a set S. (This example makes sense if we
think of S as a set of people where “parent” has biological meaning, or if we think of S as a set of nodes
in a tree.) Then:
transitive closure of parent = parent ∪ grandparent ∪ greatgrandparent ∪ · · ·
reflexive transitive closure of parent = yourself ∪ parent ∪ grandparent ∪ greatgrandparent ∪ · · ·
where yourself = {⟨x, x⟩ : x ∈ S}. These closures of parent might be called ancestor: ⟨x, y⟩ is in the
(reflexive) transitive closure of parent if and only if x is a direct ancestor of y. (The reflexive transitive
closure counts you as an ancestor of yourself; the transitive closure does not.)

Example 8.30: Adjacent seating at a concert.


Consider a set S of people attending a concert held in a theater with rows of seats. Let R denote the relation
“sat immediately to the right of,” so that ⟨x, y⟩ ∈ R if and only if x sat one seat to y’s right in the same
row. (See Figure 8.18.)
The transitive closure of R is “sat (not necessarily immediately) to the right of.” The symmetric closure
of R is “sat immediately next to.” The symmetric transitive closure of R—just like the reflexive symmetric
transitive closure—is “sat in the same row as.” (You sit in the same row as yourself.)

As we discussed previously, we can think of the transitive closure R+ of the relation R as the result of
repeating R one or more times: in other words, R+ = R ∪ R2 ∪ R3 ∪ · · ·. The reflexive transitive closure
of R also adds {⟨a, a⟩ : a ∈ A} to the closure, which we can view as the result of repeating R zero or
more times. In other words, we can write the reflexive transitive closure R∗ as R∗ = R0 ∪ R+ , where
R0 = {⟨a, a⟩ : a ∈ A} represents the “zero-hop” application of R.
Taking it further: The basic idea underlying the (reflexive) transitive closure of a relation R—allowing (zero or) one or more
repetitions of a relation R—also comes up in a widely useful tool for pattern matching in text, called regular expressions. Using
regular expressions, you can search a text file for lines that match certain kinds of patterns (like: find all violations in the dictionary
of the “I before E except after C” rule), or apply some operation to all files with a certain name (like: remove all .txt files). For
more about regular expressions in general, and a little more on the connection between (reflexive) transitive closure and regular
expressions, see p. 415.

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 413

(a) (b) (c) (d)

Figure 8.18 (a) The sat-immediately-to-the-right-of relation R for a three-row concert venue, and a few closures: (b)
the transitive closure of R; (c) the symmetric closure of R; and (d) the symmetric transitive closure of R.

We’ll end with one last example of closures of an arithmetic relation:

Example 8.31: Closures of the successor relation.


The successor relation on the integers is {⟨n, n + 1⟩ : n ∈ Z}. What are the reflexive, symmetric,
transitive, reflexive transitive, and reflexive symmetric transitive closures of this relation?
Solution. Perhaps the most interesting of these closures to think about is the transitive closure, which is
best illustrated by the infinite version of Figure 8.16. For any n, we have ⟨n, n + 1⟩ and ⟨n + 1, n + 2⟩ in
successor, so the transitive closure includes ⟨n, n + 2⟩. But ⟨n + 2, n + 3⟩ is in successor, so the transitive
closure also includes ⟨n, n + 3⟩. But ⟨n + 3, n + 4⟩ is in successor, so the transitive closure also includes
⟨n, n + 4⟩. And so forth! (See Exercise 8.107 for a formal proof.) Here are the sets:
reflexive closure = {⟨n, m⟩ : m = n or m = n + 1}
symmetric closure = {⟨n, m⟩ : m = n − 1 or m = n + 1}
transitive closure = {⟨n, m⟩ : n < m} (that is, the relation <)
reflexive transitive closure = {⟨n, m⟩ : n ≤ m} (that is, the relation ≤)
reflexive symmetric transitive closure = Z × Z (that is, every pair of integers is in this relation).

Taking it further: We can view ≤ (the reflexive transitive closure of successor) as either the reflexive closure of < (the transitive
closure of successor), or we can view ≤ as the transitive closure of {⟨n, m⟩ : m = n or m = n + 1} (the reflexive closure of
successor). (For any relation, the reflexive closure of the transitive closure equals the transitive closure of the reflexive closure.)

ISTUDY
414 Relations

Computer Science Connections

What’s Hard about Designing Voting Systems


Imagine a collection of n people who have individual preferences over k candidates. That is, we have n relations
R1 , R2 , . . . , Rn , each of which is a relation on the set {1, 2, . . . , k}. We wish to aggregate these individual preferences
into a single preference relation for the collection of people. Although this description is much more technical than our
everyday usage, the problem that we’ve described here is well known: it’s otherwise known as voting. (Economists
also call this topic the theory of social choice.) But voting systems are hard to design, for at least two distinct reasons:
Xavier Yasmeen Zelda
Aggregating preferences. Some troubling paradoxes
1. Alice 1. Charlie 1. Bob
arise in voting problems, related to transitivity—or,
2. Bob 2. Alice 2. Charlie
more precisely, to the absence of transitivity. Fig-
3. Charlie 3. Bob 3. Alice
ure 8.19 gives an example, for a three-candidate
(Alice, Bob, Charlie) and three-voter (Xavier, Yas- Alice beats Bob Bob beats Charlie Charlie beats Alice
meen, Zelda) election. (This paradox also arises when Xavier Zelda Xavier Yasmeen Yasmeen Xavier
Yasmeen Zelda Zelda
there are many more voters.) If the three voters have
preferences as shown in the figure, then, in head- Figure 8.19 The Condorcet paradox: in head-to-head
to-head runoffs between pairs of candidates, there’s comparisons, Alice beats Bob, Bob beats Charlie, and
a “cycle” of winners. That’s pretty weird: we have Charlie beats Alice (each time a 2-to-1 victory).
taken strict preferences (each of which is certainly
transitive!) from each of the voters, and aggregated them into a nontransitive set of societal preferences. This
phenomenon—no candidate would win a head-to-head vote against every other candidate—is called the Condorcet
paradox. (The Condorcet criterion declares the winner of a vote to be the candidate who would win a runoff election
against any other individual candidate. Here, no candidate satisfies the Condorcet criterion. Both of these notions are
named after the French philosopher/mathematician Marquis de Condorcet [rhymes with gone for hay] (1743–1794).)
The Condorcet paradox is troubling, but an even more troubling result says that, more or less, there’s no good way
of designing a voting system! Arrow’s Theorem, proven around 1950 by Kenneth Arrow (1921–2017)—an American
economist who won the 1972 Nobel Prize in Economics, largely for this theorem [10]—states that there’s no way
to aggregate individual preferences to society-level preferences in a way that’s consistent with three “obviously
desirable” properties of a voting system: (1) if every voter prefers candidate A to candidate B, then A beats B; (2)
there’s no “dictator” (a single voter whose preferences of the candidates directly determines the outcome of the vote);
and (3) “independence of irrelevant alternatives” (if candidate A beats B when candidate C is in the race, then A still
beats B if C were to drop out of the race). The simplest voting system is “plurality” (or “relative majority”) voting,
in which every voter chooses their preferred candidate, and the winner is whichever candidate gets the most votes.
But, increasingly, other voting schemes are used in some real-world elections, too: “ranked-choice” (or “instant-
runoff”) voting, where a voter ranks multiple candidates, or “Borda count” voting—named after Jean-Charles de
Borda (1733–1799)—in which candidates receive different numbers of “points” from each voter. But what Arrow’s
Theorem says is that all of these voting systems sometimes exhibit some undesirable property.

Collecting and verifying the counting of votes. There has been increasing attention paid in recent years to the
hardware, software, and procedures for the casting, processing, and tabulating of votes in elections. (In the United
States, this attention began to intensify after the “butterfly ballot” controversy in Florida in the 2000 presidential
election, and further intensified during and after the 2020 presidential election. The design of the ballots themselves
[did voters actually record their intended candidate on their ballot?] was the focus in 2000—the kinds of questions of
human factors and usability that so often haunt software products, too; in 2020, the focus was more on the processes
for acquiring, casting, and counting ballots.) It has been a long-standing temptation to try to develop an electronic
voting system, perhaps using cryptographic protocols to enable individual voters to verify that their votes were cast
and counted as they were intended. But the crucial fact that ballots are meant to be secret makes the cryptographic
setting very different from traditional ones: if I can prove to you that my vote was recorded and counted for a particular
candidate, you would be able to coerce or bribe me into voting in a particular way. Between this risk, and the inherent
difficulty of designing computer systems that are truly secure, there is a clear, strong consensus among cryptography
and security researchers advocating for (i) using paper ballots (and not electronic or, even worse, internet-based
voting); and (ii) using statistical means (and not cryptographic means) to audit election results [94].

ISTUDY
8.3 Properties of Relations: Reflexivity, Symmetry, and Transitivity 415

Computer Science Connections

Regular Expressions
Regular expressions (sometimes called regexps or regexes for short) are a mechanism to express pattern-matching
searches in strings. (Their name is also a bit funny; more on that below.) Regular expressions are used by a number
of useful utilities on Unix-based systems, like grep (which prints all lines of a file that match a given pattern) and
sed (which can perform search-and-replace operations for particular patterns). And many programming languages
have a capability for regular-expression processing; they’re a tremendously handy tool for text processing.
Let Σ be an alphabet of sym-
regexp matched strings The + operator is roughly analogous to
bols. (For convenience, think of transitive closure: α+ matches any string
A matches only the single character A
Σ = {A, B, . . . , Z}, but gener- that consists of one or more repetitions of
B matches only the single character B α—while ? is roughly analogous to the
ally it’s the set of all ASCII char- . reflexive closure and * to the reflexive
acters.) Let Σ∗ denote the set of ..
transitive closure. The only difference is
all finite-length strings of symbols Z matches only the single character Z that here we’re combining repetitions by
from Σ. (The ∗ notation echoes the . any single character string
concatenation rather than by composition.

notation for the reflexive transitive αβ any string xy where x matches α and y matches β
closure: Σ∗ is the set of elements α|β any string x where x matches α or x matches β
resulting from “repeating” Σ zero α? any string that matches α or the empty string
or more times.) Some of the basic α+ any string x1 x2 . . . xk , with k ≥ 1, where each xi matches α
syntax for regular expressions is α* matches any string x1 x2 . . . xk , with k ≥ 0, where each xi matches α
shown in Figure 8.20, which recur- Figure 8.20 Regular expressions: single characters, and regexp operators.
sively defines a relation Matches ⊆
Regexps × Σ∗ , where certain strings match a given pattern. For example, {s : ⟨A*B+, s⟩ ∈ Matches} is precisely the
set of strings that can be written xy where ⟨A*, x⟩ and ⟨B+, y⟩ are in Matches. (And that means that strings like AB or
BBB or AAAAABBB, with any number of As and one or more Bs, matches the regular expression A*B+.) A few other
regexp operators correspond to the types of closures introduced in this section. There’s some other shorthand for
common constructions, too: for example, a list of characters in square brackets matches any of those characters (for
example, [AEIOU] is shorthand for (A|E|I|O|U)). (Other syntax allows a range of characters or everything but a
list of characters: for example, [A-Z] for all letters, and [^AEIOU] for consonants.)
Figure 8.21 shows a few exam- (a) W E I R D (b) L E N G T H S
ples of regular expressions match-
ing words in a dictionary with some
vaguely interesting properties. .* (CIE|[^C]EI) .* .* [^AEIOU] [^AEIOU] [^AEIOU] [^AEIOU] [^AEIOU] .*
The odd-sounding name “reg-
ular expression” derives from a
G L A C I E R W I T C H C R A F T
related notion, called a regular lan-

guage. A language L ⊆ Σ is (c) F A C E T I O U S
a subset of all strings; in the sub-
field of theoretical computer sci-
[^AEIOU]* A [^AEIOU]* E [^AEIOU]* I [^AEIOU]* O [^AEIOU]* U [^AEIOU]*
ence called formal language the-
ory, we’re interested in the dif-
ficulty of determining whether a A B S T E M I O U S
given string x ∈ Σ∗ is in L or
not, for a particular language L. Figure 8.21 Three regular expressions, and two words that match each: (a)
(Some example languages: the set violations of “I before E except after C”; (b) words with five consecutive
of words containing only type of consonants; and (c) words with all five vowels, once each, in alphabetical order.
vowel, or the set of binary strings
with the same number of ones and zeros.) A regular language is one for which it’s possible to determine whether
x ∈ L by reading the string from left to right and, at each step, remembering only a constant amount of information
about what you’ve seen so far. (The set of univocalic words is regular; the set of “balanced” bitstrings is not.)
(For a bit more on regular languages, regular expressions, and formal language theory see p. 430; for a lot more,
see a good textbook on computational complexity and formal languages, like [73] or [121].)

ISTUDY
416 Relations

EXERCISES

8.51 Draw a directed graph representing the relation ⟨x, x2 mod 13⟩ : x ∈ Z13 .
8.52 Repeat for {⟨x, 3x mod 13⟩ : x ∈ Z15 }.
8.53 Repeat for {⟨x, 3x mod 15⟩ : x ∈ Z15 }.

Which of the following relations on {0, 1, 2, 3, 4} are reflexive? Irreflexive? Neither?



8.54 ⟨x, x⟩ : x5 ≡5 x
8.55 {⟨x, y⟩ : x + y ≡5 0}
8.56 {⟨x, y⟩ : there exists z such that x · z ≡5 y}

8.57 ⟨x, y⟩ : there exists z such that x2 · z2 ≡5 y

Let R ⊆ A × A and T ⊆ A × A be relations. Prove or disprove:


8.58 R is reflexive if and only if R−1 is reflexive.
8.59 If R and T are both reflexive, then R ◦ T is reflexive.
8.60 If R ◦ T is reflexive, then R and T are both reflexive.
8.61 R is irreflexive if and only if R−1 is irreflexive.
8.62 If R and T are both irreflexive, then R ◦ T is irreflexive.

Which relations from Exercises 8.54–8.57 on {0, 1, 2, 3, 4} are symmetric? Antisymmetric? Asymmetric? Explain.

8.63 ⟨x, x⟩ : x5 ≡5 x
8.64 {⟨x, y⟩ : x + y ≡5 0}
8.65 {⟨x, y⟩ : there exists z such that x · z ≡5 y}

8.66 ⟨x, y⟩ : there exists z such that x2 · z2 ≡5 y

Prove Theorem 8.8, connecting the symmetry/asymmetry/antisymmetry of a relation R to the inverse R−1 of R.
8.67 Prove that R is symmetric if and only if R ∩ R−1 = R = R−1 .
8.68 Prove that R is antisymmetric if and only if R ∩ R−1 ⊆ {⟨a, a⟩ : a ∈ A}.
8.69 Prove that R is asymmetric if and only if R ∩ R−1 = ∅.

8.70 Be careful: it’s possible for a relation R ⊆ A × A to be both symmetric and antisymmetric! Describe, as precisely as possible, the
set of relations on A that are both.
8.71 Use Theorem 8.8 to argue that every asymmetric relation is also antisymmetric.

Consider three possibilities for a relation’s symmetry—symmetric, antisymmetric, and asymmetric—and also three possibilities
for its reflexiveness: it can be reflexive, irreflexive, or neither. For each combination, identify a relation on {0, 1} that satisfies the
given criteria, or, if the criteria are inconsistent, explain why there is no such relation.
8.72 a reflexive, symmetric relation on {0, 1}
8.73 a reflexive, antisymmetric relation on {0, 1}
8.74 a reflexive, asymmetric relation on {0, 1}
8.75 an irreflexive, symmetric relation on {0, 1}
8.76 an irreflexive, antisymmetric relation on {0, 1}
8.77 an irreflexive, asymmetric relation on {0, 1}
8.78 a symmetric relation on {0, 1} that’s neither reflexive nor irreflexive
8.79 an antisymmetric relation on {0, 1} that’s neither reflexive nor irreflexive
8.80 an asymmetric relation on {0, 1} that’s neither reflexive nor irreflexive

Which relations from Exercises 8.54–8.57 on {0, 1, 2, 3, 4} are transitive? Explain.



8.81 ⟨x, x⟩ : x5 ≡5 x .
8.82 {⟨x, y⟩ : x + y ≡5 0}.
8.83 {⟨x, y⟩ : there exists z such that x · z ≡5 y}.

8.84 ⟨x, y⟩ : there exists z such that x2 · z2 ≡5 y .

8.85 Prove that, if R is irreflexive and transitive, then R is asymmetric.


8.86 Prove Theorem 8.10: show that R is transitive if and only if R ◦ R ⊆ R.
8.87 Theorem 8.10 cannot be stated with an = instead of ⊆ (although I actually made this mistake in a previous draft of this chapter!).
Give an example of a transitive relation R where R ◦ R ⊂ R (that is, where R ◦ R ̸= R).

ISTUDY
Exercises 417

The following exercises describe a relation with certain properties. For each, say whether it is possible for a relation R ⊆ A × A
to simultaneously have all of the stated properties. If so, describe as precisely as possible what structure the relation R must have.
If not, prove that it is impossible.
8.88 Is it possible for R to be simultaneously symmetric, transitive, and irreflexive?
8.89 Is it possible for R to be simultaneously transitive and a function?

8.90 Identify all relations R on {0, 1} that are transitive.


8.91 Of the transitive relations on {0, 1} from Exercise 8.90, which are also reflexive and symmetric?

Consider the relation R = {⟨2, 4⟩, ⟨4, 3⟩, ⟨4, 4⟩} on the set {1, 2, 3, 4}.
Also consider the relation T = {⟨1, 2⟩, ⟨1, 3⟩, ⟨2, 1⟩, ⟨2, 3⟩, ⟨3, 1⟩, ⟨3, 2⟩, ⟨3, 4⟩, ⟨4, 5⟩} on {1, 2, 3, 4, 5}.
8.92 What’s the reflexive closure of R?
8.93 What’s the symmetric closure of R?
8.94 What’s the transitive closure of R?
8.95 What’s the reflexive transitive closure of R?
8.96 What’s the reflexive symmetric transitive closure of R?
8.97 What’s the reflexive closure of T?
8.98 What’s the symmetric closure of T?
8.99 What’s the transitive closure of T?

8.100 What’s the symmetric closure of ≥?

The next few exercises ask you to implement relations (and the standard relation operations) in a programming language of your
choice. Don’t worry too much about efficiency in your implementation; it’s okay to run in time Θ(n3 ), Θ(n4 ) or even Θ(n5 ) when
relation R is on a set of size n.
8.101 (programming required) Develop a basic implementation of relations on a set A. Also implement inverse (R−1 ) and composition
(R ◦ T).
8.102 (programming required) Write functions reflexive?, irreflexive?, symmetric?, antisymmetric?, asymmetric?, and transitive?
to test whether a given relation R has the specified property.
8.103 (programming required) Implement the closure algorithms (from Figure 8.17) for relations.
8.104 (programming required) Using your implementation from Exercises 8.101–8.103, verify your answers to Exercises 8.72–8.80.

8.105 Prove that the transitive closure of R is indeed R+ = R ∪ R2 ∪ R3 ∪ · · ·, as follows: show that if S ⊇ R is any transitive relation,
then Rk ⊆ S. (We’d also need to prove that R+ is transitive, but you can omit this part of the proof. You may find a recursive
definition of Rk most helpful: R1 = R and Rk = R ◦ Rk−1 .)
8.106 Give an example of a relation R ⊆ A × A, for a finite set A, such that the transitive closure of R contains at least c · |R|2 pairs, for
some constant c > 0. Make c as big as you can.

8.107 Recall the relation successor = ⟨x, x + 1⟩ : x ∈ Z≥0 . Prove by induction on k that, for any integer x and any positive integer
k, we have that ⟨x, x + k⟩ is in the transitive closure of successor. (In other words, you’re showing that the transitive closure of
successor is ≥. You cannot rely on the algorithm in Figure 8.17 because Z≥0 is not finite!)
8.108 We talked about the X closure of a relation R, for X being any nonempty subset of the properties of reflexivity, symmetry, and
transitivity. But we didn’t define the “antisymmetric closure” of a relation R—with good reason! Why doesn’t the antisymmetric
closure make sense?

ISTUDY
418 Relations

8.4 Special Relations: Equivalence Relations and Partial/Total Orders


At the return of consciousness, that closed
Before the pity of those two relations,
Which utterly with sadness had confused me,

New torments I behold, and new tormented


Around me, whichsoever way I move,
And whichsoever way I turn, and gaze.
Dante Alighieri (1265–1321)
The Divine Comedy, “Inferno: Canto VI” (1320)

In Section 8.3, we introduced three key categories of properties that a particular relation R ⊆ A × A
might have: (ir)reflexivity, (a/anti)symmetry, and transitivity. Here we’ll consider relations R that have
one of two particular combinations of those three categories of properties. Two very different “flavors” of
relations emerge from these two particular constellations of properties.
The first special type of relation is an equivalence relation, which is reflexive, symmetric, and transitive.
Equivalence relations divide the elements of A into one or more groups of equivalent elements, so that all
elements in the same group are “the same” under R.
The second special type of relation is an order relation, which is antisymmetric, transitive, and either
reflexive or irreflexive. These relations “rank” the elements of A, so that some elements of A are “more R”
than others.
In this section, we’ll give formal definitions of these two types of relations, and look at a few applications.

8.4.1 Equivalence Relations


An equivalence relation R ⊆ A × A separates the elements of A into one or more groups, where any two
elements in the same group are equivalent according to R:

Definition 8.13: Equivalence relation.


An equivalence relation is a relation that is reflexive, symmetric, and transitive.

The most important equivalence relation that you’ve seen is equality (=): certainly, for any objects a, b,
and c, we have that (i) a = a; (ii) a = b if and only if b = a; and (iii) if a = b and b = c, then a = c.
The relation sat in the same row as from Example 8.30 (see Figure 8.22a) is also an equivalence relation:
it’s reflexive (you sat in the same row as you yourself), symmetric (anyone you sat in the same row as also
sat in the same row as you), and transitive (you sat in the same row as anyone who sat in the same row as
someone who sat in the same row as you).
We’ve also previously seen another example in Example 8.11 (see Figure 8.22b for a reminder):

{⟨m1 , m2 ⟩ : months m1 and m2 have the same number of days (in some years)}

is also an equivalence relation. It’s tedious but not hard to verify by checking all pairs that the relation is
reflexive, symmetric, and transitive. (See also Exercises 8.116–8.118.)
Here are a few more examples of equivalence relations:

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 419

 
 ⟨0, 0⟩, ⟨0, 12⟩, 

 

Aug Oct 
 ⟨ 1 , 1 ⟩, ⟨ 1 , 13 ⟩, 


 


 . 

Nov Apr 
 . 


 . 

Jul Dec  
⟨11, 11⟩, ⟨11, 23⟩

 ⟨ 12 , 0 ⟩, ⟨ 12 , 12 ⟩ 

Sep Jun 
 ⟨13, 1⟩, ⟨13, 13⟩  

 

May 
 

Jan 
 . 


 . 

Mar Feb 
 . 

⟨23, 11⟩, ⟨23, 23⟩

(a) “In the same row as,” (b) The months-of-the-same-length relation, from (c) Hours that are equivalent
from Example 8.30. Example 8.11 and Figure 8.10b. on a clock.

Figure 8.22 Reminders of two equivalence relations, and one new equivalence relation.

Example 8.32: Some equivalence relations.


The set of pairs from {0, 1, . . . , 23} with the same representation on a 12-hour clock is an equivalence
relation. (See Figure 8.22c.)
The asymptotic relation Θ (that is, for two functions f and g, we have ⟨f, g⟩ ∈ Θ if and only if f is
Θ(g)) is an equivalence relation. Example 8.24, Example 8.25, and Exercise 6.46 established that Θ is
reflexive, symmetric, and transitive.
The relation ≡ on logical propositions, where P ≡ Q if and only if P and Q are true under precisely
the same set of truth assignments, is an equivalence relation. (We even used the word “equivalent” in
defining ≡, which we called logical equivalence back in Chapter 3.)

Example 8.33: All equivalence relations on a small set.


List all equivalence relations on the set {a, b, c}.
Solution. There are five different equivalence relations on this set:
{⟨a, a⟩, ⟨b, b⟩, ⟨c, c⟩} “no element is equivalent to any other”
{⟨a, a⟩, ⟨a, b⟩, ⟨b, a⟩, ⟨b, b⟩, ⟨c, c⟩} “a and b are equivalent, but they’re different from c”
{⟨a, a⟩, ⟨a, c⟩, ⟨b, b⟩, ⟨c, a⟩, ⟨c, c⟩} “a and c are equivalent, but they’re different from b”
{⟨a, a⟩, ⟨b, b⟩, ⟨b, c⟩, ⟨c, b⟩, ⟨c, c⟩} “b and c are equivalent, but they’re different from a”
{⟨a, a⟩, ⟨a, b⟩, ⟨a, c⟩, ⟨b, a⟩, ⟨b, b⟩, ⟨b, c⟩, ⟨c, a⟩, ⟨c, b⟩, ⟨c, c⟩} . “all elements are equivalent”

Taking it further: A deterministic finite automaton (DFA) is a simple model of a so-called “machine” that has a finite amount
of memory, and processes an input string by moving from state to state according to a fixed set of rules. DFAs can be used for a
variety of applications (for example, in computer architecture, compilers, or in modeling simple behavior in computer games).
And they can also be understood in terms of equivalence relations. See p. 430.

Equivalence classes
The descriptions of the quintet of equivalence relations on the set {a, b, c} from Example 8.33 makes more
explicit the other way that we’ve talked about an equivalence relation R on A: as a relation that carves up
A into one or more equivalence classes, where any two elements of the same equivalence class are related
by R (and no two elements of different classes are). Here’s the formal definition:

ISTUDY
420 Relations

Definition 8.14: Equivalence class.


Let R ⊆ A × A be an equivalence relation. The equivalence class of a ∈ A is defined as the set
{b ∈ A : ⟨a, b⟩ ∈ R} of elements related to A under R. The equivalence class of a ∈ A under R is denoted
by [a]R —or, when R is clear from context, just as [a].

The equivalence classes of an equivalence relation on A form a partition of the set A—that is, every element
of A is in one and only one equivalence class. (See Definition 2.32 for the definition of a partition.)

Example 8.34: Equivalent mod 5.


Define the relation ≡5 on Z, so that ⟨x, y⟩ ∈ ≡5 if and only if x mod 5 = y mod 5. All three require-
ments of equivalence relations (reflexivity, symmetry, and transitivity) are met; see Examples 8.18, 8.20,
and 8.23. There are five equivalence classes under ≡5 :
{0, 5, 10, . . .} , {1, 6, 11, . . .} , {2, 7, 12, . . .} , {3, 8, 13, . . .} , and {4, 9, 14, . . .} ,
corresponding to the five possible values mod 5.

Example 8.35: Some equivalence classes.


The five different equivalence relations on {a, b, c} in Example 8.33 correspond to five different sets of
equivalence classes:

{a} , {b} , {c} “no element is equivalent to any other”

{a, b} , {c} “a and b are equivalent, but they’re different from c”

{a, c} , {b} “a and c are equivalent, but they’re different from b”

{a} , {b, c} “b and c are equivalent, but they’re different from a”

{a, b, c} . “all elements are equivalent”

An example: equivalence of rational numbers


̸ 0
Back in Chapter 2, we defined the rational numbers (that is, fractions) as the set Q = Z × Z= : a rational
number is a two-element sequence of integers (the numerator and the denominator, respectively), where
the denominator is nonzero. (See Example 2.39.) But we haven’t yet really talked about the fact that two
rational numbers like ⟨17, 34⟩ and ⟨101, 202⟩ are equivalent, in the sense that 34
17
= 101 1
202 = 2 . Let’s do that:

Example 8.36: Equivalence of rationals by reducing to lowest terms.


Formally define a relation ≡ on Q that captures the notion of equality for fractions, and prove that ≡ is
an equivalence relation.
Solution. We define two rationals ⟨a, b⟩ and ⟨c, d⟩ as equivalent if and only if ad = bc—that is, we define
the relation ≡ as the set
n o
⟨a, b⟩, ⟨c, d⟩ : ad = bc .

To show that ≡ is an equivalence relation, we must prove that ≡ is reflexive, symmetric, and transitive.
These three properties follow fairly straightforwardly from the fact that the relation = on integers is an

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 421

equivalence relation. We’ll prove symmetry: for arbitrary ⟨a, b⟩ ∈ Q and ⟨c, d⟩ ∈ Q, we have
⟨a, b⟩ ≡ ⟨c, d⟩ ⇒ ad = bc definition of ≡
⇒ bc = ad symmetry of =
⇒ ⟨c, d⟩ ≡ ⟨a, b⟩. definition of ≡

(Reflexivity and transitivity can be proven analogously.)

Taking it further: Recall that the equivalence class of a rational ⟨a, b⟩ ∈ Q under ≡, denoted [⟨a, b⟩]≡ , represents the set of all
rationals equivalent to ⟨a, b⟩. For example,

[⟨17, 34⟩]≡ = {⟨1, 2⟩, ⟨−1, −2⟩, ⟨2, 4⟩, ⟨−2, −4⟩, . . . , ⟨17, 34⟩, . . .} .

For equivalence relations like ≡ for Q, we may agree to associate an equivalence class with a canonical element of that class—
here, the representative that’s “in lowest terms.” So we might agree to write ⟨1, 2⟩ to denote the equivalence class [⟨1, 2⟩], for
example. This idea doesn’t matter too much for the rationals, but it plays an important (albeit rather technical) role in figuring
out how to define the real numbers in a mathematically coherent way. One standard way of defining the real numbers is as the
equivalence classes of converging infinite sequences of rational numbers, called Cauchy sequences after the French mathemati-
cian Augustin Louis Cauchy (1789–1857). (Two converging infinite sequences of rational numbers are defined to be equivalent if
they converge to the same limit—that is, if the two sequences eventually differ by less than ϵ, for all ϵ > 0.) Thus when we write
π, we’re actually secretly denoting an infinitely large set of equivalent converging infinite sequences of rational numbers—but
we’re representing that equivalence class using a particular canonical form. Actually producing a coherent definition of the real
numbers is a surprisingly recent development in mathematics, dating back less than 150 years. For more, see a good textbook on
the subfield of math called analysis. (One classic book: [110].)

Coarsening and refining equivalence relations


An equivalence relation ≡ on A slices up the elements of A into equivalence classes—that is, disjoint subsets
of A such that any two elements of the same class are related by ≡. For example, you might consider two
restaurants equivalent if they serve food from the same cuisine (Thai, Indian, Ethiopian, Chinese, British,
Minnesotan, . . .). But, given ≡, we can imagine further subdividing the equivalence classes under ≡ by
making finer-grained distinctions (that is, refining ≡)—perhaps dividing Indian into North Indian and South
Indian, and Chinese into Americanized Chinese and Authentic Chinese. Or we could make ≡ less specific
(that is, coarsening ≡) by combining some equivalence classes—perhaps having only two equivalence
classes, Delicious (Thai, Indian, Ethiopian, Chinese) and Okay (British, Minnesotan). See Figure 8.23.

Definition 8.15: Coarsening/refining equivalence relations.


Consider two equivalence relations ≡c and ≡r on the same set A. We say that ≡r is a refinement of ≡c ,
or that ≡c is a coarsening of ≡r , if (a ≡r b) ⇒ (a ≡c b) for any ⟨a, b⟩ ∈ A × A. We can also refer to ≡c
as coarser than ≡r , and ≡r as finer than ≡c .

(a) An equivalence relation ≡. Dots (b) A coarsening of ≡: a new (c) A refinement of ≡: a new
represent elements; each region equivalence relation formed by equivalence relation formed by
denotes an equivalence class under ≡. merging equivalence classes from ≡. subdividing ≡’s equivalence classes.

Figure 8.23 Refining/coarsening an equivalence relation.

ISTUDY
422 Relations

For example, equivalence mod 10 is a refinement of equivalence mod 5: whenever n mod 10 = m mod
10 we know for certain that it will be the case that n mod 5 = m mod 5 too. (In other words, we have
(n ≡10 m) ⇒ (n ≡5 m).) An equivalence class of the coarser relation is formed from the union of one
or more equivalence classes of the finer relation. Here ≡10 is a refinement of ≡5 , and, for example, the
equivalence class [3]≡5 is the union of two equivalence classes from ≡10 , namely [3]≡10 ∪ [8]≡10 .

Example 8.37: Refining/coarsening equivalence relations on {a, b, c}.


In Example 8.35, we considered five different equivalence relations on {a, b, c}:

{{a} , {b} , {c}} {{a, b} , {c}} {{a, c} , {b}} {{a} , {b, c}} {{a, b, c}}
finest coarsest
Of these, all three equivalence relations in the middle refine the one-class equivalence relation {{a, b, c}}
and coarsen the three-class equivalence relation {{a} , {b} , {c}}. (And the three-class equivalence
relation {{a} , {b} , {c}} also refines the one-class equivalence relation {{a, b, c}}.)

Taking it further: This is a very meta comment—sorry!—but we can think of “is a refinement of” as a relation on equivalence
relations on a set A. In fact, the relation “is a refinement of” is reflexive, antisymmetric, and transitive: ≡ refines ≡; if ≡1 refines
≡2 and ≡2 refines ≡1 then ≡1 and ≡2 are precisely the same relation on A; and if ≡1 refines ≡2 and ≡2 refines ≡3 then ≡1
refines ≡3 . Thus “is a refinement of” is, as per the definition to follow in the next section, a partial order on equivalence relations
on the set A. That means, for example, that there is a minimal element according to the “is a refinement of” relation on the set of
equivalence relations on any finite set A—that is, an equivalence relation ≡min such that ≡min is refined by no relation aside from
≡min itself. (Similarly, there’s a maximal relation ≡max that refines no relation except itself.) See Exercises 8.119 and 8.120.

8.4.2 Partial and Total Orders


An equivalence relation ≡ on a set A has properties that “feel like” a form of equality—differing from =
only in that there might be multiple elements that are unequal but nonetheless cannot be distinguished by
≡. Here we’ll introduce a different special type of relation, more akin to ≤ than =, that instead describes
a consistent order among the elements of A:

Definition 8.16: Partial order.


Let A be a set. A relation ⪯ on A that is reflexive, antisymmetric, and transitive is called a partial order.
(A relation ≺ on A that is irreflexive, antisymmetric, and transitive is called a strict partial order.)

(Actually, the requirement of antisymmetry in a strict partial order is redundant; see Exercise 8.85.) Here
are a few examples, from arithmetic and sets:

Example 8.38: Some (strict) partial orders on Z: |, >, and ≤.


In Examples 8.18, 8.20, and 8.23, we showed that the following relations are all antisymmetric, transitive,
and either reflexive or irreflexive:

• Divides (reflexive): R1 = {⟨n, m⟩ : m mod n = 0} is a partial order.


• Greater than (irreflexive): R2 = {⟨n, m⟩ : n > m} is a strict partial order.
• Less than or equal to (reflexive): R3 = {⟨n, m⟩ : n ≤ m} is a partial order.

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 423

(a) {} ⊆ {0} {} ⊆ {1} {} ⊆ {0, 1} (b) {0, 1}


{0} ⊆ {0} {0} ⊆ {0, 1}
For any A ∈ P(0, 1) and any
{1} ⊆ {1} {1} ⊆ {0, 1} B ∈ P(0, 1), we have that
{0, 1} ⊆ {0, 1} {0} {1} A ⊆ B if and only if we can get
from A to B by following zero or
more arrows in this diagram.
{}

Figure 8.24 The ⊆ relation on P({0, 1}): (a) all 8 pairs of sets in the ⊆ relation, and (b) a diagrammatic
representation.

Example 8.39: The subset relation.


Consider the relation ⊆ on the set P({0, 1}), which consists of the pairs of sets shown in Figure 8.24a.
It’s tedious but not hard to verify that ⊆ is reflexive, antisymmetric, and transitive. (Perhaps the easiest
way to see this fact is via Figure 8.24b, which abbreviates the visualizations in Figure 8.10 by leaving out
an a-to-c arrow if their relationship is implied by transitivity because of a-to-b and b-to-c arrows. We’ll
see more of this type of abbreviated diagram in a moment.)

Comparability and total orders


In a partial order ⪯, there can be two elements a, b ∈ A such that neither a ⪯ b nor b ⪯ a: for the subset
relation from Example 8.39 we have {0} ̸⊆ {1} and {1} ̸⊆ {0}, and for the divides relation we have 17̸ |21
and 21̸ |17. In this case, the relation ⪯ does not say which of these elements is “smaller.” This phenomenon
is the reason that ⪯ is called a partial order, because it only specifies how some pairs compare.

Definition 8.17: Comparability.


Let ⪯ be a partial order on A. We say that two elements a ∈ A and b ∈ A are comparable under ⪯ if
either a ⪯ b or b ⪯ a. Otherwise we say that a and b are incomparable.

When there are no incomparable pairs under ⪯, then we call ⪯ a total order:

Definition 8.18: Total order.


A relation ⪯ on A is a total order if it’s a partial order and every pair of elements in A is comparable. (A
relation ≺ is a strict total order if ≺ is a strict partial order and every pair of distinct elements in A is
comparable.)

Warning! It’s easy to get muddled about incomparability because of common-language use of the word that’s related to the
technical definition but misleadingly different. In everyday usage, people may say “incomparable” (or “beyond compare”) to
mean “unequaled”—as in Cheese from France is incomparable to cheese from Wisconsin. Be careful! In the context of our work,
“incomparable” means “cannot be compared” and not “cannot be matched.”

A few examples of partial and total orders


Here are a few examples of orders, related to strings and to asymptotics:

Example 8.40: Ordering strings.


Which of the following relations (on the set of all [finite-length] strings of letters) are partial orders?
Which are total orders? Of those that are partial or total orders, which are strict?

ISTUDY
424 Relations

⟨x, y⟩ ∈ R if |x| ≥ |y|. (The length of a string x—the number of letters in x—is denoted |x|.)
⟨x, y⟩ ∈ S if x comes alphabetically no later than y. (See Example 3.48.)
⟨x, y⟩ ∈ T if the number of As in x is smaller than the number of As in y.

Solution. String length. The relation {⟨x, y⟩ : |x| ≥ |y|} is reflexive and transitive, but it is not antisym-
metric: for example, both ⟨PASCAL, RASCAL⟩ and ⟨RASCAL, PASCAL⟩ are in the relation, but RASCAL = ̸
PASCAL. So this relation isn’t a partial order.
Alphabetical order. The relation “comes alphabetically no later than” is reflexive (every word w comes
alphabetically no later than w), antisymmetric (the only word that comes alphabetically no later than w
and no earlier than w is w itself), and transitive (if w1 is alphabetically no later than w2 and w2 is no later
than w3 , then indeed w1 is no later than w3 ). Thus S is a partial order.
In fact, any two words are comparable under S: either w is a prefix of w′ (and ⟨w, w′ ⟩ ∈ S) or there’s
a smallest index i in which wi = ̸ w′i (and either ⟨w, w′ ⟩ ∈ S or ⟨w′ , w⟩ ∈ S, depending on whether wi is
earlier or later in the alphabet than w′i ). Thus S is actually a total order.
Number of As. The relation “contains fewer As than” is irreflexive (any word w contains exactly the same
number of As as it contains, not fewer than that!) and transitive (if we have aw < aw′ and aw′ < aw′′ ,
then we also have aw < aw′′ ). Therefore the relation is antisymmetric (by Exercise 8.85), and thus T is a
strict partial order. But neither ⟨PASCAL, RASCAL⟩ nor ⟨RASCAL, PASCAL⟩ are in T—both words contain
two As, so neither has fewer than the other—and thus RASCAL and PASCAL are incomparable, and T is not
a (strict) total order.

Example 8.41: O and o as orders.


We’ve argued that o is irreflexive (Example 8.24), transitive (Exercise 6.47), and asymmetric (Exam-
ple 8.25). Thus o is a strict partial order. But o is not a (strict) total order: we saw a function f(n) in
Example 6.6 such that f(n) ̸= o(n2 ) and n2 ̸= o(f(n)), so these two functions are incomparable.
And, though we showed that O is reflexive and transitive (Exercise 6.18), we showed that O is not
antisymmetric (Example 8.25), because, for example, the functions f(n) = n2 and g(n) = 2n2 are O of
each other. Thus O is not a partial order.

Taking it further: A relation like O that is both reflexive and transitive (but not necessarily antisymmetric) is sometimes called
a preorder. Although O is not a partial order, it very much has an “ordering-like” feel to it: it does rank functions by their growth
rate, but there are clusters of functions that are all equivalent under O. We can think of O as defining a partial order on the
equivalence classes under Θ. We saw another preorder in Example 8.40, with the relation R (“x and y have the same length”):
although there are many pairs of distinct strings x and y where ⟨x, y⟩ ∈ R and ⟨y, x⟩ ∈ R, it is only because of ties in lengths that
R fails to be a partial order—or, indeed, a total order.

Hasse diagrams
Let R be any relation on a set A. For k ≥ 1, we will call a sequence ⟨a1 , a2 , . . . , ak ⟩ ∈ Ak a cycle if
⟨a1 , a2 ⟩, ⟨a2 , a3 ⟩, . . . , ⟨ak−1 , ak ⟩ ∈ R and ⟨ak , a1 ⟩ ∈ R. A cycle is a sequence of elements, each of which is
related by R to the next element in the sequence (where the last element is related to the first). For a partial
order ⪯, there are cycles with k = 1 (because a partial order is reflexive, a1 ⪯ a1 for any a1 ), but there are
no longer cycles. (See Exercise 8.133.)
Recall the “directed graph” visualization of a relation R ⊆ A × A that we introduced earlier (see Fig-
ure 8.10): we write down every element of A, and then, for every pair ⟨a1 , a2 ⟩ ∈ R, we draw an arrow from
a1 to a2 . For a relation R that’s a partial order, we’ll introduce a simplified visualization, called a Hasse

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 425

(a)   (b)

 ⟨0, 0⟩, ⟨0, 1⟩, ⟨0, 2⟩, ⟨0, 3⟩, ⟨0, 4⟩,

4 Note that we’ve omitted all arrow directions (they

⟨1, 1⟩, 


 
 all point up), all five self-loops (they can be
3 inferred from reflexivity), and the pairs ⟨0, 3⟩,
⟨2, 2⟩, ⟨2, 3⟩, ⟨2, 4⟩, ⟨0, 4⟩, and ⟨2, 4⟩ (they can be inferred from

 

⟨3, 3⟩, ⟨3, 4⟩,
 
 transitivity).

 
 1 2
⟨4, 4⟩
0

Figure 8.25 (a) A partial order, and (b) a Hasse diagram representing it.

diagram, that allows us to figure out the full relation R but makes the diagram dramatically cleaner. (Hasse
diagrams are named after Helmut Hasse (1898–1979), a German mathematician.)
Let ⪯ be a partial order. Consider three elements a, b, and c such that a ⪯ b and b ⪯ c and a ⪯ c.
Then the very fact that ⪯ is a partial order means that a ⪯ c can be inferred from the fact that a ⪯ b
and b ⪯ c. (That’s just transitivity.) Thus we will omit from the diagram any arrows that can be inferred
via transitivity. Similarly, we will leave out self-loops, which can be inferred from reflexivity. Finally, as
we discussed above, there are no nontrivial cycles (that is, there are no cycles other than self-loops) in a
partial order. Thus we will arrange the elements so that when a ⪯ b we will draw a physically below b in
the diagram; all arrows will implicitly point upward in the diagram.
A small partial order, and a Hasse diagram for it, are shown in Figure 8.25. Here is a somewhat larger
example:

Example 8.42: Hasse diagram for divides.


A Hasse diagram for the relation | (divides) on the set {1, 2, . . . , 32} is shown in Figure 8.26. Again, the
diagram omits arrow directions, self-loops, and “indirect” connections that can be inferred by transitivity.
For example, the fact that 2 | 20 is implicitly represented by the arrows 2 → 4 → 20 (or 2 → 10 → 20).

Which arrows must be shown in a Hasse diagram? We must include all arrows that cannot be inferred
by the definition of a partial order—in other words, we must draw a direct connections for all those rela-
tionships that are not “short circuits” of pairs of other relationships. In other words, we must draw lines for
all those pairs ⟨a, c⟩ where a ⪯ c and there is no b ∈ / {a, c} such that a ⪯ b and b ⪯ c. Such a c is called
an immediate successor of a.
Warning! When a ⪯ b holds for a partial order ⪯, we think of a as “smaller” than b under ⪯—a view that can be a little misleading
if, for example, the partial order in question is ≥ instead of ≤. One example of this oddity: for ≥, the immediate successor of 42
is 41.

Minimal/maximal elements in a partial order


Consider the partial order ⪯ = {⟨1, 1⟩, ⟨1, 2⟩, ⟨1, 3⟩, ⟨1, 4⟩, ⟨2, 2⟩, ⟨2, 4⟩, ⟨3, 3⟩, ⟨4, 4⟩}—that is, the
divides relation on the set {1, 2, 3, 4}. There’s a strong sense in which 1 is the “smallest” element under
⪯: every element a satisfies 1 ⪯ a. And there’s a slightly weaker sense in which 3 and 4 are both “largest”
elements under ⪯: no element a satisfies 3 ⪯ a or 4 ⪯ a. These ideas inspire two related pairs of definitions:

Definition 8.19: Minimum/maximum element.


For a partial order ⪯ on A:

• A minimum element is x ∈ A such that, for every y ∈ A, we have x ⪯ y.


• A maximum element is x ∈ A such that, for every y ∈ A, we have y ⪯ x.

ISTUDY
426 Relations

32

16 24

8 12 18 20 27 28 30

4 6 9 10 14 15 21 22 25 26

2 3 5 7 11 13 17 19 23 29 31

Figure 8.26 A Hasse diagram for “divides” on {1, 2, . . . , 32}. The darker lines represent the Hasse diagram; the
lighter arrows give the full picture of the relation, including all of the relationships that can be inferred from the fact that
the relation is a partial order.

Definition 8.20: Minimal/maximal element.


For a partial order ⪯ on A:

• A minimal element is x ∈ A such that, for every y ∈ A with y =


̸ x, we have y ̸⪯ x.
• A maximal element is x ∈ A such that, for every y ∈ A with y ≠ x, we have x ̸⪯ y.

A maximal whatzit is any whatzit that loses its whatzitness if we add anything to it. A maximum whatzit is the largest possible
whatzit. If you’ve studied calculus, you’ve seen a similar distinction under a different name: maximal corresponds to a local
maximum; maximum corresponds to a global maximum.

Note that x being a minimal element does not demand that every other element be larger than x—only that
no element is smaller! (Again, we’re talking about a partial order—so x ̸⪯ y doesn’t imply that y ⪯ x.)
In other words, a minimal element is one for which every other element y either satisfies x ⪯ y or is
incomparable to x.

Example 8.43: Minimal/maximal/maximum/minimum elements in the divides relation.


For the divides relation on {1, 2, . . . , 32} (Example 8.42 and Figure 8.26):

• 1 is a minimum element. (Every n ∈ {1, 2, . . . , 32} satisfies 1 | n.)


• 1 is also a minimal element. (No n ∈ {1, 2, . . . , 32} satisfies n | 1, except n = 1 itself.)
• There is no maximum element. (No n ∈ {1, 2, . . . , 32} aside from 32 satisfies n | 32, so 32 is the only
candidate—but 31 ̸ | 32.)
• There are a slew of maximal elements: each of {17, 18, . . . , 32} is a maximal element. (None of these
elements divides any n ∈ {1, 2, . . . , 32} other than itself.)

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 427

(You’ll prove that any minimum element is also minimal, and that there can be at most one minimum
element in a partial order, in Exercises 8.148 and 8.149.)
We’ve already seen some small partial orders that don’t have minimum or maximum elements, but every
partial order over a finite set must have at least one minimal element and at least one maximal element:

Theorem 8.21: Every (finite) partial order has a minimal/maximal element.


Let ⪯ ⊆ A × A be a partial order on a finite set A. Then ⪯ has at least one minimal element and at least
one maximal element.

Proof. We’ll prove that there’s a minimal element; the proof for the maximal element is analogous. Our
proof is constructive; we’ll give an algorithm to find a minimal element:

1 i := 1
2 x1 := an arbitrarily chosen element in A
3 while there exists any y ̸= xi with y ⪯ xi :
4 xi+1 := any such y (that is, one with y ̸= xi and y ⪯ xi )
5 i := i + 1
6 return xi

It’s not too hard to see that if this algorithm terminates, then it returns a minimal element. After all, the
while loop only terminates when we’ve found an xi ∈ A such that there’s no y = ̸ xi with y ⪯ xi —which is
precisely the definition of xi being a minimal element. Thus the real work is in proving that this algorithm
actually terminates.
We claim that after |A| iterations of the while loop—that is, after we’ve defined x1 , x2 , . . . , x|A|+1 —we
must have found a minimal element. Suppose not. Then we have found elements x1 ⪰ x2 ⪰ · · · ⪰ x|A|+1 ,
where xi+1 ̸= xi for each i. Because there are only |A| different elements in A, in a sequence of |A| + 1
elements we must have encountered the same element more than once. (This argument implicitly makes
use of the pigeonhole principle, which we’ll see in much greater detail in Chapter 9.) But that’s a cycle
containing two or more elements! And Exercise 8.133 asks you to show that there are no such cycles in a
partial order.
Note that Theorem 8.21 only claimed that a minimal element must exist in a partial order on a finite set
A. The claim would be false without that assumption! If A is an infinite set, then there may be no minimal
element in A under a partial order. (See Exercise 8.146.)
We can identify minimal and maximal elements of a partial order directly from the Hasse diagram:
they’re simply the elements that aren’t connected to anything above them (the maximal elements), and
those that aren’t connected to anything below them (the minimal elements). And, indeed, there are
always topmost element(s) and bottommost element(s) in a Hasse diagram, and thus there are always
maximal/minimal elements in any partial order—if the set of elements is finite, at least!

Problem-solving tip: A good visualization of data can make an apparently complicated statement much simpler. Another way to
state Theorem 8.21 and its proof: start anywhere, and follow lines downward in the Hasse diagram; eventually, you must run out
of elements below you, and you can’t go any lower. Thus there’s at least one bottommost element in any (finite) Hasse diagram.

8.4.3 Topological Ordering


Partial orders can be used to specify constraints on the order in which certain tasks must be completed:
the printer must be loaded with paper before the document can be printed; the document must be written

ISTUDY
428 Relations

software design programming algorithms computability &


languages complexity

data structures math of CS intro to


computer systems

intro to CS

Figure 8.27 The CS major at a certain college in the midwest.

before the document can be printed; the paper must be purchased before the printer can be loaded with
paper. Or, as another example: a computer science major at a certain college must take courses following
the prerequisite structure specified in Figure 8.27.
But, while these types of constraints impose a partial order on elements, the tasks must actually be
completed in some sequence. (Likewise, the courses must be taken in some sequence—for a major who
avoids “doubling up” on CS courses in the same term, at least.) The challenge we face is to extend a partial
order into a total order—that is, to create a total order that obeys all of the constraints of the partial order,
while making comparable all previously incomparable pairs.

Definition 8.22: Consistency of a total order with a partial order.


A total order ⪯total is consistent with the partial order ⪯ if a ⪯ b implies that a ⪯total b.

In general, there are many total orders that are consistent with a given partial order. Here’s an example:

Example 8.44: Ordering CS classes.


Here are two (of many!) course orderings that are consistent with the prerequisites in Figure 8.27:
Order A: (1) intro to CS, (2) data structures, (3) math of CS, (4) intro to computer systems, (5) software design,
(6) programming languages, (7) algorithms, and (8) computability & complexity.

Order B: (1) intro to CS, (2) data structures, (3) software design, (4) programming languages, (5) math of CS,
(6) algorithms, (7) computability & complexity, (8) intro to computer systems.

Order A corresponds to reading the elements of the Hasse diagram from the bottom-to-top (and left-
to-right within a “row”); Order B corresponds to completing the top row left-to-right (first recursively
completing the requirements to make the next element of the top row valid).

As in these examples, we can construct a total order that’s consistent with any given partial order on the
set A. Such an ordering of A is called a topological ordering of A. (Some people will refer to a topological
ordering as a topological sort of A.) We’ll prove this result inductively, by repeatedly identifying a minimal
element a from the set of unprocessed elements, and then adding constraints to make a be a minimum
element (and not just a minimal element).

Theorem 8.23: Extending any partial order to a total order.


Let A be a finite set with a partial order ⪯. Then there is a total order ⪯total on A that’s consistent with ⪯.

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 429

1 Identify some 2 Turn a into a



3 Inductively, find a
minimal element a in ⪯.

minimum element by total ordering of the
adding constraints (the remaining partial order
thick, dark lines). (the shaded box).
a∗
a∗ a∗

Figure 8.28 A sketch of the proof of Theorem 8.23.

Proof. We’ll proceed by induction on |A|.


For the base case (|A| = 1), the task is trivial: there’s simply nothing to do! The relation ⪯ must be
{⟨a, a⟩}, where A = {a}, because partial orders are reflexive. And the relation {⟨a, a⟩} is a total order on
{a} that’s consistent with ⪯.
For the inductive case (|A| ≥ 2), we assume the inductive hypothesis (for any set A′ of size |A′ | = |A|−1
and any partial order on A′ , there’s a total order on A′ consistent with that partial order). We must show
how to extend ⪯ to be a total order on all of A. Here’s the idea: we’ll remove some element of A that can go
first in the total order, inductively find a total order of all the remaining elements, and then add the removed
element to the beginning of the order.
More specifically, let a∗ ∈ A be an arbitrary minimal element under ⪯ on A—in other words, let a∗
be any element such that no b ∈ A − {a∗ } satisfies b ⪯ a∗ . Such an element is guaranteed to exist
by Theorem 8.21. Add any missing pair ⟨a∗ , b⟩ to ⪯. After the additions, the relation ⪯ is still a partial
order on A: by the definition of a minimal element, we haven’t introduced any violations of transitivity or
antisymmetry. Now, inductively, we extend the partial order ⪯ on A − {a∗ } to a total order; the result is a
total order on A that’s consistent with ⪯. (See Figure 8.28.)
(Slightly more formally: define ⪯′ as the relation ⪯ ∩ (A − {a∗ }) × (A − {a∗ }), which is ⪯ restricted
to A − {a∗ }. Then ⪯′ is a partial order on A − {a∗ }; by the inductive hypothesis, there exists a total order
⪯′total on A − {a∗ } consistent with ⪯′ . Define
⪯total = ⪯′total ∪ {⟨a∗ , y⟩ : y ∈ A} .
It’s not too hard to verify that ⪯total is a total order on A that’s consistent with ⪯.)

Taking it further: Deciding the order in which to compute the cells of a spreadsheet (where a cell might depend on a list of
other cells’ contents) is solved using a topological ordering. In this setting, let C denote the set of cells in the spreadsheet, and
define a relation R ⊆ C × C where ⟨c, c′ ⟩ ∈ R if we need to know the value in cell c before we can compute the value for
c′ . (For example, if cell C4’s value is determined by the formula A1 + B1 + C1, then the three pairs ⟨A1, C4⟩, ⟨B1, C4⟩, and
⟨C1, C4⟩ are all in R. Note that it’s not possible to compute all the values in a spreadsheet if there’s a cell x whose value depends
on cell y, which depends on . . . , which depends on cell x—in other words, the “depends on” relationship cannot have a cycle!
Furthermore, we’re in trouble if there’s a cell x whose value depends on x itself. In other words, we can compute the values in a
spreadsheet if and only if R is irreflexive and transitive—that is, if R is a strict partial order.
Another problem that can be solved using the idea of topological ordering is that of hidden-surface removal in computer graphics:
we have a three-dimensional “scene” of objects that we’d like to display on a two-dimensional screen, as if it were being viewed
from a camera. We need to figure out which of the objects are invisible from the camera (and therefore need not be drawn)
because they’re “behind” other objects. One classic algorithm, called the painter’s algorithm, solves this problem using ideas
from relations and topological ordering. See p. 431.

ISTUDY
430 Relations

Computer Science Connections

Deterministic Finite Automata (DFAs) The start state is marked with an unattached
incoming arrow; from state q on input symbol a,
the arrow leaving q with label a points to δ(q, a).
As we hinted at previously (see the discus- Σ = {0, 1} Final states are circled. 0, 1
sion of regular expressions on p. 415), there Q = {a, b, c, win, lose} b
are some interesting computational applica- 0
δ is defined by this table: win
0
tions of finite-state machines, a formal model 0 1
a
for a computational device that uses a fixed a b c 1 1 1
amount of memory to respond to input. Varia- b win lose 0
lose
tions on these machines can be used in build- c lose win If you’re currently c
ing very simple characters in a video game, in state a and the
win win win input symbol is a 0, 1
in computer architecture, in software systems lose lose lose 1, then move to
to do automatic speech recognition, and other state c.
tasks. They can also identify which strings the start state is a
match a given regular expression—in fact, win is the (only) final state
for a set of strings L, there exists a finite- Figure 8.29 A DFA accepting all bitstrings whose first two symbols
state machine M that recognizes precisely the are the same—both by defining all five components, and by a picture.
strings in L if and only if there’s a regular
expression α that matches precisely the strings in L. Formally, a deterministic finite automaton (DFA)—the simplest
version of a finite-state machine—is defined by five things:
• A finite set Σ (the alphabet) that defines the set of input symbols the machine can handle.
• A finite set Q (the states); the machine is always in one of these states. (The fact that Q is finite corresponds to M
having only finite memory.)
• A function δ : Q × Σ → Q (the transition function): when the machine is in state q ∈ Q and sees an input symbol
a ∈ Σ, the machine moves into state δ(q, a).
• A start state s ∈ Q, the state in which the machine begins before having seen any input.
• A set F ⊆ Q of final states. If, after processing a string x, the machine ends up in a state q ∈ F, then the machine
accepts x; if it ends in a state q ∈
/ F, then the machine rejects x.
An example of a DFA that accepts all bitstrings whose first two symbols are the same is shown in Figure 8.29.
We can also understand DFAs—and the
0 1 The input is
sorts of sets of strings that they can
divisible by three
recognize—by thinking about equivalence 1 0 if and only if we
relations. To see this connection, suppose end up in the
we’re trying to identify binary strings repre- leftmost state.

senting integers that are multiples of 3. (So 1 0


11 and 1001 and 1111 are all “yes” because Figure 8.30 A DFA for bitstrings representing multiples of 3.
3 | 3 and 3 | 9 and 3 | 15, but 10001 is “no”
because 3 ̸ | 17.) Here’s one way to solve this problem. Let’s define an equivalence relation on binary strings, where
x ≡ y if and only if, for any bitstring z, we have that (xz is divisible by 3) ⇔ (yz is divisible by 3). In other words,
two bitstrings x and y are equivalent if, no matter what additional bitstring suffix we add to both of them, the two
resulting bitstrings are either both divisible by three or both not divisible by three. For example, it turns out that
11 ≡ 1001 (11 and 1001 are both “yes”; 110 and 10010 are both “yes”; 111 and 10011 are both “no”; 1110 and
100010 are both “no”; etc.). Similarly, we have 1000 ≡ 10. It’s not hard to prove that ≡ is an equivalence relation.
It’s also true, though a bit harder to prove, that there are only three equivalence classes for ≡. (Those equivalence
classes are: bitstrings that are 0 mod 3, those that are 1 mod 3, and those that are 2 mod 3.) Thus we can actually figure
out whether a bitstring is evenly divisible by 3 with the simple DFA in Figure 8.30. The three states of this machine,
going from left to right, correspond to the three equivalence classes for ≡—namely [0], [1], and [10]. (For a set of
strings that cannot be recognized by a DFA—for example, bitstrings with an equal number of zeros and ones—there
are an infinite number of equivalence classes for ≡.)
(These particular DFAs merely hint at the kind of problem that can be solved with this kind of machine—for much
more, see any good textbook in formal languages, such as [73] or [121].)

ISTUDY
8.4 Special Relations: Equivalence Relations and Partial/Total Orders 431

Computer Science Connections

The Painter’s Algorithm and Hidden-Surface Removal


At a high level, the goal in computer graphics is to take
a three-dimensional scene—a set of objects in R3 (with
differing shapes, colors, surface reflectivities, textures,
etc.)—as seen from a particular vantage point (a point
and a direction, also in R3 ). The task is then to project
the scene into a two-dimensional image. There are a lot
of components to this task, and we’ve already talked a
bit about some of them: typically we’ll approximate the
shapes of the objects using a large collection of trian-
gles (see p. 244), and then compute where each triangle
shows up in the camera’s view, in R2 , via rotation (see
p. 54). Even after triangulation and rotation, we are still
left with another important step: when two triangles
overlap in the two-dimensional image, we have to fig-
ure out which to draw—that is, which one is obscured
Figure 8.31 A house in a forest.
by the other. This task is also known hidden-surface
removal: we want to omit whatever pieces of the image aren’t visible. For example, to render the humble forest scene
in Figure 8.31, we have to draw trees in front of and behind the house, and one particular tree in front of another. One
approach to hidden-surface removal is called the painter’s algorithm, named after a hypothetical artist at an easel:
we can “paint” the shapes in the image “from back to front,” simply painting over faraway shapes with the closer
ones as we go, as in Figure 8.32.
How might we implement this approach? Let S be
the set of shapes that we have to draw. We can com-
pute a relation obscures ⊆ S × S, where a pair ⟨s1 , s2 ⟩
in obscures tells us that we must draw s2 before s1 .
We seek a total order on S that is consistent with the
obscures relation; we’ll draw the shapes in this order.
Unfortunately obscures may not be a total order—
or even a partial order! The biggest problem with
obscures is that we can have “cycles of obscurity”—
s1 obscures s2 which obscures s3 which, eventually,
obscures a shape sk that obscures s1 . (See Figure 8.33; Figure 8.32 Drawing the house, one piece at a time.
although it may look like an M. C. Escher drawing, there’s nothing strange going on—just three triangles that over-
lap a bit like a pretzel.) This issue can be resolved using some geometric algorithms specific to the particular task:
we’ll split up shapes in each cycle of obscurity—here, dividing one triangle into two—so that we no longer have any
cycles. (Again see Figure 8.33.)
We now have an expanded set S′ of shapes, and
a cycle-free relation obscures on S′ . We can use this
relation to compute the order in which to draw the
shapes. First, compute the reflexive, transitive closure
of obscures on S′ . (The resulting relation is a partial
order on S′ .) Then, extend this partial order to a total Figure 8.33 A cycle of obscurity, and splitting one of the
order on S′ , using Theorem 8.23. We now have a total
cycle’s pieces to break the cycle.
ordering on the shapes that respect the obscures rela-
tion, so we can draw the shapes in precisely this order. (The painter’s algorithm was first published in 1972 [95].
While it does correctly accomplish hidden-surface removal, it’s pretty slow, particularly as we’ve described it here.
For example, when there are many layers to a scene, we actually have to “paint” each pixel in the resulting image
many many times. Every computation of a pixel’s color before the last is a waste of time. You can learn about cleverer
approaches to hidden-surface removal, like the “z-buffer,” in a good textbook on computer graphics, such as [61].)

ISTUDY
432 Relations

EXERCISES
8.109 List all equivalence relations on {0, 1}.
8.110 List all equivalence relations on {0, 1, 2, 3}.

Are the following relations on P({0, 1, 2, 3}) equivalence relations? If so, list the equivalence classes under the relation; if not,
explain why not.
8.111 ⟨A, B⟩ ∈ R1 if and only if (i) A and B are nonempty and the largest element in A equals the largest element in B, or (ii) if A = B = ∅.
8.112 ⟨A, B⟩ ∈ R2 if and only if the sum of the elements in A equals the sum of the elements in B.
8.113 ⟨A, B⟩ ∈ R3 if and only if the sum of the elements in A equals the sum of the elements in B and the largest element in A equals the
largest element in B. (That is, R3 = R1 ∩ R2 .)
8.114 ⟨A, B⟩ ∈ R4 if and only A ∩ B ̸= ∅.
8.115 ⟨A, B⟩ ∈ R5 if and only |A| = |B|.

In Example 8.11, we considered the relation M = {⟨m, d⟩ : in some years, month m has d days}, and computed the pairs in the
relation M−1 ◦ M. By checking all the requirements (or by visual inspection of Figure 8.10b), we see that M−1 ◦ M is an equivalence
relation. But it turns out that the fact that M−1 ◦ M is an equivalence relation says something particular about M, and is not true
in general. Let R ⊆ A × B be an arbitrary relation. Which of the three required properties of an equivalence relation must R−1 ◦ R
have? (At least one of these is false!).
8.116 Prove or disprove: R−1 ◦ R must be reflexive.
8.117 Prove or disprove: R−1 ◦ R must be symmetric.
8.118 Prove or disprove: R−1 ◦ R must be transitive.

Let A be any set. There exist two equivalence relations ≡coarsest and ≡finest with the following property: if ≡ is an equivalence
relation on A, then (i) ≡ refines ≡coarsest , and (ii) ≡finest refines ≡.
8.119 Identify ≡coarsest , prove that it’s an equivalence relation, and prove property (i): if ≡ is an equivalence relation on A, then ≡ refines
≡coarsest .
8.120 Identify ≡finest , prove that it’s an equivalence relation, and prove property (ii): if ≡ is an equivalence relation on A, then ≡finest
refines ≡.

8.121 Write ≡k to denote equivalence mod k—that is, a ≡k b if and only if a mod k = b mod k. Consider the equivalence relation ≡60 .
For what values of k is ≡k a coarsening of ≡60 ? For what values of k is ≡k a refinement of ≡60 ?
8.122 Suppose that R is an equivalence relation that coarsens ≡60 . Prove or disprove: R is ≡k , for some integer k.
8.123 In many programming languages, there are two distinct but related notions of “equality”: has the same value as and is the same
object as. In Python, these are denoted as == and is, respectively; in Java, they are .equals() and ==, respectively. (Confusingly!)
(For example, in Python, [1,7,8] + [9] is [1,7,8,9] is false, but [1,7,8] + [9] == [1,7,8,9] is true.) Does one of
these equality relations refine the other? Explain.

8.124 List all partial orders on {0, 1}.


8.125 List all partial orders on {0, 1, 2}.

Are the following relations on P({2, 3, 4, 5}) partial orders, strict partial orders, or neither? Explain.
P P
8.126 ⟨A, B⟩ ∈ R1 ⇔ a∈A a ≤ b∈B b
Q Q
8.127 ⟨A, B⟩ ∈ R2 ⇔ a∈A a ≤ b∈B b
8.128 ⟨A, B⟩ ∈ R3 ⇔ A ⊆ B
8.129 ⟨A, B⟩ ∈ R4 ⇔ A ⊇ B
8.130 ⟨A, B⟩ ∈ R5 ⇔ |A| < |B|

8.131 Prove that ⪯ is a partial order if and only if ⪯−1 is a partial order.
8.132 Prove that if ⪯ is a partial order, then {⟨a, b⟩ : a ⪯ b and a =
̸ b} is a strict partial order.

8.133 A cycle in a relation R is a sequence of k distinct elements a0 , a1 , . . . , ak−1 ∈ A where ⟨ai , ai+1 mod k ⟩ ∈ R for every
i ∈ {0, 1, . . . , k − 1}. A cycle is nontrivial if k ≥ 2. Prove that there are no nontrivial cycles in any transitive, antisymmetric
relation R. (Hint: use induction on the length k of the purported cycle.)

ISTUDY
Exercises 433

(a) (b) (c) (d)


5 5 5 5

3 4 3 4 2 3 4 4

1 2 2 1 1 2 3

Figure 8.34 Four Hasse diagrams.

Let S ∈ Z≥1 × Z≥1 be a collection of points. Define the relation R ⊆ S × S as follows: ⟨⟨a, b⟩, ⟨x, y⟩⟩ ∈ R if and only if a ≤ x
and b ≤ y. (You can think of ⟨x, y⟩ ∈ S as an x-by-y picture frame, and ⟨f, f′ ⟩ ∈ R if and only if f fits inside f′ . Or you can think
of ⟨x, y⟩ ∈ S as a job that you’d get x “happiness points” from doing and that pays you y dollars, and ⟨j, j′ ⟩ ∈ R if and only if j
generates no more happiness and pays no more than j′ .)
8.134 Show that R might not be a total order by identifying two incomparable elements of Z≥1 × Z≥1 .
8.135 Prove that R must be a partial order.
8.136 Write out all pairs in the relation represented by the Hasse diagram in Figure 8.34a.
8.137 Repeat for Figure 8.34b.
8.138 Repeat for Figure 8.34c.
8.139 Repeat for Figure 8.34d.
8.140 Draw the Hasse diagram for the partial order ⊆ on the set P(1, 2, 3).
8.141 Draw the Hasse diagram for the partial order ⪯ on the set S = {0, 1} ∪ {0, 1}2 ∪ {0, 1}3 , where, for two bitstrings x, y ∈ S, we
have x ⪯ y if and only if x is a prefix of y.

Let ⪯ be a partial order on A. Recall that an immediate successor of a ∈ A is an element c such that (i) a ⪯ c, and (ii) there is no
b∈/ {a, c} such that a ⪯ b and b ⪯ c. In this case a is said to be an immediate predecessor of c.
8.142 For the partial order ≥ on Z≥1 , identify all the immediate predecessor(s) and immediate successor(s) of 202.
8.143 For the partial order | (divides) on Z≥1 , identify all the immediate predecessor(s) and immediate successor(s) of 202.
8.144 Give an example of a strict partial order on Z≥1 such that every integer has exactly two different immediate successors.
8.145 Prove that for a partial order ⪯ on A when A is finite there must be an a ∈ A that has fewer than two immediate successors.
8.146 Consider the partial order ≥ on the set Z≥0 . Argue that there is no maximal element in Z.
8.147 Note that there is a minimal element under the partial order ≥ on Z≥0 —namely 0, which is also the minimum element. Give an
example of a partial order on an infinite set that has neither a minimal nor a maximal element.
8.148 Let ⪯ be a partial order on a set A. Prove that there is at most one minimum element in A under ⪯. (That is, prove that if a ∈ A and
b ∈ A are both minimum elements, then a = b.)
8.149 Let ⪯ be a partial order on a set A, and let a ∈ A be a minimum element under ⪯. Prove that a is also a minimal element.

Here’s a (surprisingly addictive) word game that can be played with a set of Scrabble tiles. Each player has a set of words that she
“owns”; there is also a set of individual tiles in the middle of the table. At any moment, a player can form a new word by taking
both (1) one or more tiles from the middle, and (2) zero or more words owned by any of the players; and reordering those letters
to form a new word, which the player now owns. For example, from the word GRAMPS and the letters R and O, a player could make
the word PROGRAMS.
(If you’re bored and decide to waste time playing this game: it’s more fun if you forbid stealing words with “trivial” changes,
like changing COMPUTER into COMPUTERS. Each player should also get a fair share of the tiles, originally face down; anyone can
flip a new tile into the middle of the table at any time.)
Define a relation ⪯ on the set W of English words (of three or more letters), as follows: w ⪯ w′ if w′ can be formed from word
w plus one or more individual letters. For example, we showed above that GRAMPS ⪯ PROGRAMS.
8.150 Give a description (in English) of what it means for a word w to be a minimal element under ⪯, and what it means for a word w′
to be a maximal element under ⪯.
8.151 (programming required) Write a program that, given a word w, finds all immediate successors of w. (You can find a dictionary of
English words on the web, or /usr/share/dict/words on Unix-based operating systems.) Report all immediate successors of
GRAMPS using your dictionary.

ISTUDY
434 Relations

8.152 (programming required) Write a program to find the English word that is the longest minimal element under ⪯ (that is, out of all
minimal elements, find the one that contains the most letters).

8.153 Consider a spreadsheet containing a set of cells C. A cell c can contain a formula that depends on zero or more other cells. Write
⪯ to denote the relation {⟨p, s⟩ : cell s depends on cell p}. For example, the value of cell C2 might be the result of the formula
A2 ∗ B1; here A2 ⪯ C2 and B1 ⪯ C2. A spreadsheet is only meaningful if ⪯ is a strict partial order. Give a description (in English)
of what it means for a cell c to be a minimal element under ⪯, and what it means for a cell c′ to be a maximal element under ⪯.

8.154 List all total orders consistent with the partial order in Figure 8.34a.
8.155 Repeat for the partial order in Figure 8.34b.
8.156 Repeat for the partial order in Figure 8.34c.
8.157 Repeat for the partial order in Figure 8.34d.

A chain in a partial order ⪯ on A is a set C ⊆ A such that ⪯ imposes a total order on C—that is, writing the elements of C as
C = {c1 , c2 , . . . , ck } [in an appropriate order], we have c1 ⪯ c2 ⪯ · · · ⪯ ck .
8.158 Identify all chains of k ≥ 2 elements in the partial order in Figure 8.34a.
8.159 Repeat for the partial order reproduced in Figure 8.34b.
An antichain in a partial order ⪯ on A is a set S ⊆ A such that no two distinct elements in S are comparable under ⪯—that is, for
any distinct a, b ∈ S we have a ̸⪯ b.
8.160 Identify all antichains S with |S| ≥ 2 in the partial order in Figure 8.34a.
8.161 Repeat for the partial order reproduced in Figure 8.34b.

8.162 Consider the set A = {1, 2, . . . , n}. Consider the following claim: there exists a relation ⪯ on the set A that is both an equivalence
relation and a partial order. Either prove that the claim is true (and describe, as precisely as possible, the structure of any such
relation ⪯) or disprove the claim.

ISTUDY
8.5 Chapter at a Glance 435

8.5 Chapter at a Glance

Formal Introduction
A (binary) relation on A × B is a subset of A × B. For a relation R on A × B, we can write ⟨a, b⟩ ∈ R or
a R b. When A and B are both finite, we can describe R using a two-column table, where a row containing
a and b corresponds to ⟨a, b⟩ ∈ R. Or we can view R graphically: draw all elements of A in one column,
all elements of B in a second column, and draw a line connecting a ∈ A to b ∈ B whenever ⟨a, b⟩ ∈ R.
We’ll frequently be interested in a relation that’s a subset of A × A, where the two sets are the same.
In this case, we may refer to a subset of A × A as simply a relation on A. For a relation R ⊆ A × A, it’s
more convenient to visualize R using a directed graph, without separated columns: we simply draw each
element of A, with an arrow from a1 to a2 whenever ⟨a1 , a2 ⟩ ∈ R.
The inverse of a relation R ⊆ A × B is a new relation, denoted R−1 , that “flips around” every pair in
R, so that the relation R−1 = {⟨b, a⟩ : ⟨a, b⟩ ∈ R} is a subset of B × A. The composition of two relations
R ⊆ A × B and S ⊆ B × C is a new relation, denoted S ◦ R, that, informally, represents the successive
“application” of R and S. A pair ⟨a, c⟩ is related under S ◦ R ⊆ A × C if and only if there exists an element
b ∈ B such that ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ S.
For sets A and B, a function f from A to B, written f : A → B, is a special kind of relation on A × B where,
for every a ∈ A, there exists one and only one element b ∈ B such that ⟨a, b⟩ ∈ f.
An n-ary relation is a generalization of a binary relation (n = 2) to describe a relationship among n-
tuples, rather than just pairs. An n-ary relation on the set A1 ×A2 ×· · ·×An is just a subset of A1 ×A2 ×· · ·×An ;
an n-ary relation on a set A is a subset of An .

Properties of Relations: Reflexivity, Symmetry, and Transitivity


A relation R on A is reflexive if, for every a ∈ A, we have that ⟨a, a⟩ ∈ R. It’s irreflexive if ⟨a, a⟩ ∈ / R
for every a ∈ A. (In the visualization described above, where we draw an arrow a1 → a2 whenever
⟨a1 , a2 ⟩ ∈ R, reflexivity corresponds to every element having a “self-loop” and irreflexivity corresponds to
no self-loops.) Note that a relation might be neither reflexive nor irreflexive.
A relation R on A is symmetric if, for every a, b ∈ A, we have ⟨a, b⟩ ∈ R if and only if ⟨b, a⟩ ∈ R.
The relation is antisymmetric if the only time both ⟨a, b⟩ ∈ R and ⟨b, a⟩ ∈ R is when a = b, and it’s
asymmetric if it’s never the case that ⟨a, b⟩ ∈ R and ⟨b, a⟩ ∈ R whether a ̸= b or a = b. Note that,
while asymmetry implies antisymmetry, they are different properties—and they’re both different from “not
symmetric”; a relation might not be symmetric, antisymmetric, or asymmetric. (In the visualization, a
relation is symmetric if every arrow a → b is matched by an arrow b → a; it’s antisymmetric if there are
no matched bidirectional pairs of arrows between a and b ̸= a; and it’s asymmetric if it’s antisymmetric
and furthermore there aren’t even any self-loops.) An alternative view is that a relation R is symmetric if
and only if R ∩ R−1 = R = R−1 ; it’s antisymmetric if and only if R ∩ R−1 ⊆ {⟨a, a⟩ : a ∈ A}; and it’s
asymmetric if and only if R ∩ R−1 = ∅.
A relation R on A is transitive if, for every a, b, c ∈ A, if ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ R, then ⟨a, c⟩ ∈ R
too. In the visualization, R is transitive if there are no “open triangles”: in a chain of connected elements,
every element is also connected to all “downstream” connections. The relation R is transitive if and only if
R ◦ R ⊆ R.
For a relation R ⊆ A × A, the closure of R with respect to some property is the smallest relation R′ ⊇ R
that has the named property. For example, the symmetric closure of R is the smallest relation R′′ ⊇ R

ISTUDY
436 Relations

such that R′′ is symmetric. We also define the reflexive closure R′ ; the transitive closure R+ ; the reflexive
transitive closure R∗ ; and the reflexive symmetric transitive closure R≡ . When A is finite, we can compute
any of these closures by repeatedly adding any missing elements to the set. The reflexive closure of R is
given by R ∪ {⟨a, a⟩ : a ∈ A}; the symmetric closure of R is R ∪ R−1 ; and the transitive closure of R is
R ∪ R2 ∪ R3 ∪ · · · .

Special Relations: Equivalence Relations and Partial/Total Orders


There are two special kinds of relations that emerge from particular combinations of these properties:
equivalence relations and partial/total orders.

Equivalence relations. An equivalence relation is a relation ≡ that’s reflexive, symmetric, and transitive.
Such a relation partitions the elements of A into one or more categories, called equivalence classes; any
two elements in the same equivalence class are related by ≡, and no two elements in different equivalence
classes are related.
A refinement of ≡ is another equivalence relation ≡r on the same set A where a ≡ b whenever a ≡r b.
Each equivalence class of ≡ is partitioned into one or more equivalence classes by ≡r , but no equivalence
class of ≡r intersects with more than one equivalence class of ≡. We also call ≡ a coarsening of ≡r .

Partial and total orders. A partial order is a reflexive, antisymmetric, and transitive relation ⪯. (A
strict partial order ≺ is irreflexive, antisymmetric, and transitive.) Elements a and b are comparable under
⪯ if either a ⪯ b or b ⪯ a; otherwise they’re incomparable. A Hasse diagram is a simplified visual
representation of a partial order where we draw a physically below c whenever a ⪯ c, and we omit the
a → c arrow if there’s some other element b such that a ⪯ b ⪯ c. (We also omit self-loops.)
For a partial order ⪯ on A, a minimum element is an element a ∈ A such that, for every b ∈ A, we have
a ⪯ b; a minimal element is an a ∈ A such that, for every b ∈ A with b ̸= a, we have b ̸⪯ a. (Maximum
and maximal elements are defined analogously.) Every minimum element is also minimal, but a minimal
element a isn’t minimum unless a is comparable with every other element. There’s at least one minimal
element in any partial order on a finite set.
A total order is a partial order under which all pairs of elements are comparable. A total order ⪯total is
consistent with the partial order ⪯ if a ⪯ b implies that a ⪯total b. For any partial order ⪯ on a finite set
A, there is a total order ⪯total on A that’s consistent with ⪯. Such an ordering of A is called a topological
ordering of A.

ISTUDY
8.5 Chapter at a Glance 437

Key Terms and Results

Key Terms Key Results


Formal Introduction Formal Introduction
• (binary) relation 1 For relations R ⊆ A × B and S ⊆ B × C, the relations
• inverse (of a relation) R−1 ⊆ B × A and S ◦ R ⊆ A × C—the inverse of R and
• composition (of two relations) the composition of R and S—are defined as
• functions (as relations)
R−1 = {⟨b, a⟩ : ⟨a, b⟩ ∈ R}
• n-ary relation
S ◦ R = {⟨a, c⟩ :
∃b ∈ B such that ⟨a, b⟩ ∈ R and ⟨b, c⟩ ∈ S}.
Properties of Relations
2 A function f : A → B is a special case of a relation on
• reflexivity
A × B, where, for every a ∈ A, there exists one and only
• irreflexivity
one element b ∈ B such that ⟨a, b⟩ ∈ f.
• symmetry
• asymmetry Properties of Relations
• antisymmetry
• transitivity 1 A relation R is symmetric if and only if
• closures (of a relation) R ∩ R−1 = R = R−1 ; it’s antisymmetric if and only if
R ∩ R−1 ⊆ {⟨a, a⟩ : a ∈ A}; and it’s asymmetric if and
Special Relations only if R ∩ R−1 = ∅.
2 A relation R is transitive if and only if R ◦ R ⊆ R.
• equivalence relation 3 The reflexive closure of R is R ∪ {⟨a, a⟩ : a ∈ A}; the
• equivalence class symmetric closure of R is R ∪ R−1 ; and the transitive
• coarsening, refinement closure of R is R ∪ R2 ∪ R3 ∪ · · · .
• partial order
• comparability Special Relations
• total order
• Hasse diagram 1 For a partial order ⪯ ⊆ A × A on a finite set A, there is at
• minimal/maximal element least one minimal element and at least one maximal
• minimum/maximum element element under ⪯.
• consistency 2 Let A be any finite set with a partial order ⪯. Then there
(of a total order with a partial order) is a total order ⪯total (a topological ordering of A) on A
• topological ordering that’s consistent with ⪯.

ISTUDY
ISTUDY
9 Counting

In which our heroes encounter many choices, some of which may


lead them to live more happily than others, and a precise count of
their number of options is calculated.

ISTUDY
439
440 Counting

9.1 Why You Might Care


How do I love thee? Let me count the ways.
Elizabeth Barrett Browning (1806–1861)
Sonnet 43, Sonnets from the Portuguese (1850)

The earth is a slightly squashed sphere, with a radius of a little less than 6400 kilometers (about 4000
miles), which means that its surface area is roughly 4πr2 ≈ 5.1 × 1014 square meters. If you then divide
the earth’s surface into squares that are three meters on a side, you’ll end up just under 35,0003 total
squares. This calculation is at the heart of the service provided by the company What Three Words, which
has (pseudorandomly) assigned a sequence of three common English words to each three-meter square on
earth, as an easy-to-use addressing system. (So computer.science.happiness is a spot in northwestern
Brazil, and enjoy.counting.chapter is located in eastern Quebec, in Canada.) Our calculation says
that we’d only need a vocabulary of 35,000 words (out of hundreds of thousands in current use) to specify
addresses. (The company’s underlying idea is that much of the world does not have a reliable street address
system—sometimes because a particular house doesn’t have a fixed address, sometimes because it’s in the
middle of an ocean, sometimes because the resolution of a street address isn’t sufficiently fine-grained.)
This chapter is devoted to the apparently trivial task of counting. By “counting,” we mean the follow-
ing problem: given a potentially convoluted description of a set S, compute the cardinality of S—that is,
compute the number of elements in S. It may seem bizarre that counting could somehow be harder than at
the preschool level (just count! one, two, three), but it will turn out that we can solve surprisingly subtle
problems with some useful and general (and subtle) techniques.
Why does counting matter in computer science? There are, again, surprisingly many applications. Here
are a few examples. One common (basic) style of algorithm is a brute-force algorithm, which finds the best
whatzit by trying every possible whatzit and seeing which one is best. Whether a brute-force algorithm is
fast enough depends on how many possible whatzits there are. A more advanced algorithmic design tech-
nique, called dynamic programming, can be used to design efficient recursive solutions to problems—as
long as there aren’t too many distinct subproblems. Counting techniques are even powerful enough to estab-
lish a mind-bending result about computability: we will be able to prove that there are more problems than
computer programs—which means that there are some problems that cannot be solved by any program!
Probability (see Chapter 10) has a plethora of applications in computer science, ranging from random-
ized algorithms in sorting (algorithms that process their input by making random decisions about how to
act) to models of random noise in speech recognition or random errors in typing (if I’m trying to type the
letter p, what is the chance that I accidentally type o instead?). We can think of the probability of some
event X happening as two counting problems: the ratio of, first, the number of ways X can happen, and,
second, the number of ways X can either happen or not happen.
We’ll start in Section 9.2 with some basic counting techniques—how to compute the cardinality of a
union A ∪ B of two sets, or sequences from the Cartesian product A × B of two sets. We then turn in Section
9.3 to one of the best counting strategies: being lazy! If we can show that |A| = |B| and we already know
the value of |B|, then figuring out |A| is easy; we’ll often use functions to relate two sets so that we can
then lazily compute the size of the apparently harder-to-count set. Finally, in Section 9.4, we’ll explore
combinations (“how many ways are there to choose an unordered collection of k items out of a set of n
possibilities?”) and permutations (“how many ways are there to put a set of n items into some order?”).
As we go, we’ll see several other applications, scattered across computer science: breaking cryptographic
systems, compressing media files, and the time the internet started to run out of addresses, to name a few.

ISTUDY
9.2 Counting Unions and Sequences 441

9.2 Counting Unions and Sequences


But if he does really think that there is no distinction between virtue and vice, why, Sir,
when he leaves our houses let us count our spoons.
Samuel Johnson (1709–1784)

Suppose that we have two sets A and B from which we must choose an element. There are two dif-
ferent natural scenarios that meet this one-sentence description: we must choose a total of one element
from either A or B, or we must choose one element from each of A and B. For example, consider a restau-
rant that offers soups A = {chicken noodle, beer cheese, minestrone, mulligatawny, . . .} and offers salads
B = {caesar, house, cobb, . . .}. A lunch special that includes soup or salad involves choosing an x ∈ A∪B.
A dinner special including soup and salad involves choosing an x ∈ A and also choosing a y ∈ B—that is,
choosing an element ⟨x, y⟩ ∈ A×B. In Section 9.2.1, we’ll start with rules for computing these cardinalities:

Sum Rule: If A and B are disjoint, then |A ∪ B| = |A| + |B|.

Product Rule: The number of pairs ⟨x, y⟩ with x ∈ A and y ∈ B is |A × B| = |A| · |B|.

These rules handle the soup-and-salad scenarios above, but there are a pair of extensions that we’ll introduce
to handle slightly more complex situations. The first (Section 9.2.2) extends the Sum Rule to calculate the
cardinality of a union of two sets even if those sets may contain elements in common:

Inclusion–Exclusion: |A ∪ B| = |A| + |B| − |A ∩ B|.

The second extension (Section 9.2.3) generalizes the Product Rule to allow us to calculate the cardinality
of a set of pairs ⟨x, y⟩ even if the choice of x changes the list (but not the number) of possible choices for y:

Generalized Product Rule: Consider pairs ⟨x, y⟩ of the following form: we can choose any x ∈ A, and,
for each such x, there are precisely n different choices for y. Then the total number of pairs meeting this
description is |A| · n.

The remainder of this section will give the details of these four rules, and how to use these rules individually
and in combination.

9.2.1 The Basics: The Sum and Product Rules


Sum Rule: counting disjoint unions
Our first rule addresses the union of two sets: if two sets A and B are disjoint, then the cardinality of their
union is simply the sum of their sizes:

Theorem 9.1: Sum Rule.


Let A and B be sets. If A ∩ B = ∅, then |A ∪ B| = |A| + |B|.

More generally, consider a collection of k ≥ 1 sets A1 , A2 , . . . , Ak . If these sets are all disjoint—that is, if
Ai ∩ Aj = ∅ whenever i = ̸ j—then the cardinality of their union is the sum of their individual cardinalities:
|A1 ∪ A2 ∪ · · · ∪ Ak | = |A1 | + |A2 | + · · · + |Ak |.

ISTUDY
442 Counting

The Sum Rule captures an intuitive fact: if a box contains some red things and some blue things, then
the total number of things in the box is the number of red things plus the number of blue things.

Example 9.1: Counting disjoint unions.

• Let A = {1, 2} and B = {3, 4, 5, 6}. Thus |A| = 2 and |B| = 4. Observe that the sets A and B are
disjoint. By the sum rule, |A ∪ B| = |A| + |B| = 2 + 4 = 6. Indeed, we have A ∪ B = {1, 2, 3, 4, 5, 6},
which contains 6 elements.
• There are 11 starters on your school’s women’s soccer team. Suppose there are 8 nonstarters on the
team. The total number of people on the team is 19 = 11 + 8.
• At a certain school in the midwest, there are currently 30 computer science majors who are studying
abroad. There are 89 computer science majors who are studying on campus. Then the total number of
computer science majors is 119 = 89 + 30.
• Consider a computer lab that contains 32 Macs and 14 PCs and 1 PDP-8 (a 1960s-era machine, one
of the first computers that was sold commercially). Then the total number of computers in the lab is
47 = 32 + 14 + 1.

Example 9.2: Students in classes.


During this term, there are 19 students taking Data Structures, and 39 students taking Mathematics of
Computer Science. Let S denote the set of students taking Data Structures or Mathematics of Computer
Science this term. What is |S|?
Solution. There isn’t enough information to answer the question!
It’s possible that there are no students who are taking both classes. In this case—that is, in the case
that DS ∩ MOCS = ∅—we have |S| = |DS| + |MOCS| = 19 + 39 = 58.
But, for all we know from the problem statement, every student in Data Structures is also taking
Mathematics of Computer Science. In this case, we have DS ⊂ MOCS and thus S = DS ∪ MOCS =
MOCS; therefore |S| = |MOCS| = 39.
Indeed, |S| can be anywhere between 39 and 58. (The Inclusion–Exclusion Rule, in Section 9.2.2,
formalizes the calculation of |A∪B| in terms of |A|, |B|, and |A∩B|, in the manner that we just considered.)

Taking it further: The logic that we used in Example 9.2 to conclude that there were at most 58 students in the two classes
combined is an application of the general fact that |A ∪ B| ≤ |A|+|B|. While this fact is pretty simple, it turns out to be remarkably
useful in proving facts about probability. The Union Bound states that the probability that any of A1 , A2 , . . . , Ak occurs is at most
p1 + p2 + · · · + pk , where pi denotes the probability that Ai occurs. The Union Bound turns out to be useful when each Ai is
a “bad event” that we’re worried might happen, and these bad events may have complicated probabilistic dependencies—but if
we can show that the probability that every particular one of these bad events is some very small ϵ, then we can use the Union
Bound to conclude that the probability of experiencing any bad event is at most k · ϵ. (See Exercise 10.143, for example.)

Using the Sum Rule in less obvious settings


As a general strategy for solving counting problems, we can try to find a way to apply the Sum Rule—even
if it does not superficially seem to apply. If we can find a way to partition an apparently complicated set S
Sk
into simple disjoint sets S1 , S2 , . . . , Sk such that i=1 Si = S, then we can use the Sum Rule to find |S|.
In this spirit, here’s a somewhat more complex example of using the Sum Rule, where we have to figure
out the subsets ourselves: let’s determine how many 8-bit strings contain precisely two ones. (The full list
of the bitstrings meeting this condition appears in Figure 9.1.)

ISTUDY
9.2 Counting Unions and Sequences 443

11000000 01100000 00110000 00011000 00001100 00000110 00000011


10100000 01010000 00101000 00010100 00001010 00000101
10010000 01001000 00100100 00010010 00001001
10001000 01000100 00100010 00010001
10000100 01000010 00100001
10000010 01000001
10000001

Figure 9.1 All bitstrings in {0, 1} that contain exactly two ones.
8

Example 9.3: 8-bit strings with exactly two ones.


8
How many elements of {0, 1} have precisely two ones?
Solution. Obviously, we can just count the number of bitstrings in Figure 9.1, which yields the answer:
there are 28 such bitstrings. But let’s use the Sum Rule instead.
8
What does a bitstring x ∈ {0, 1} with two ones look like? There must be two indices i and j—say
with i > j—such that xi = xj = 1, and all other components of x must be 0:
one in position j one in position i

x = 00 · · · 0 1 00 · · · 0 1 00 · · · 0
j−1 i−j−1 8−i
zeros zeros zeros

(For example, 01001000 has ones in positions j = 2 and i = 5, interspersed with an initial block of
j − 1 = 1 zero, a block of i − j − 1 = 2 between-the-ones zeros, and a block of 8 − i = 3 final zeros.)
We are going to divide the set of 8-bit strings with two ones based on the index i. That is, suppose that
8
x ∈ {0, 1} contains two ones, and the second one in x appears in bit position #i. Then there are i − 1
positions in which the first one could appear—any of the slots j ∈ {1, 2, . . . , i − 1} coming before i. (See
Figure 9.1, where the (i − 1)st column contains all i − 1 bitstrings whose second one appears in position
#i. For example, column #3 contains the three bitstrings with x4,5,6,7,8 = 10000: that is, 10010000,
01010000, and 00110000.) Because every x with exactly two ones has an index i of its second one, we
can use the Sum Rule to say that the answer to the given question is
X
8 X
8
[number of bitstrings with the second one in position i] = (i − 1) = 0 + 1 + · · · + 7 = 28.
i=1 i=1

(We’ll also see another way to solve this example later, in Example 9.41.)

Problem-solving tip: When you’re trying to find the cardinality of a complicated set S, try to find a way to split S into a collection
of simpler disjoint sets, and then apply the Sum Rule.

Let’s also generalize this example to bitstrings of arbitrary length:

Example 9.4: k-bit strings with exactly two ones.


k
Consider the set S = {x ∈ {0, 1} : x has precisely two ones}. As in Example 9.3, every bitstring x ∈ S
has an index i of its second one; we’ll use the value of i to partition S into sets that can be easily counted,
and then use the Sum Rule to find |S|. Specifically, for each index i with 1 ≤ i ≤ k, define the set
Si = {x ∈ S : xi = 1 and xi+1 = xi+2 = · · · = xk = 0}
k
= {x ∈ {0, 1} : [∃j ≤ i − 1 : xi = xj = 1 and x has no other ones]} .

ISTUDY
444 Counting

Sk
Observe that |Si | = i − 1: there are i − 1 different possible values of j. Also, observe that S = i=1 Si
and that, for any i ̸= i′ , the sets Si and Si′ are disjoint. Thus
[
k X
k X
k
k(k − 1)
|S| = Si = |Si | = (i − 1) = (∗)
2
i=1 i=1 i=1

by the Sum Rule and the formula for the sum of the first n integers (Theorem 5.3). As a check of our
formula, let’s verify our solution for some small values of k:

• For k = 2, (∗) says there are 2(2−1)


2 = 1 strings with two ones. Indeed, there’s just one: the string 11.
3(3−1)
• For k = 3, there are 2 = 3 strings with two ones: 011, 101, and 110.
• For k = 4, indeed there are 4·3
2 = 6 such strings: 1100, 1010, 0110, 1001, 0101, and 0011.
• For k = 8, (∗) matches Example 9.3: for k = 8, we have 28 = 8·7 2 strings with two ones.

Problem-solving tip: It’s always a good idea to check to make sure your formulas are reasonable by testing them for small inputs
(as we did in Example 9.4).

Product Rule: counting sequences


Our second basic counting rule addresses the Cartesian product of sets. Recall that, for sets A and B, the
Cartesian product A × B consists of all pairs ⟨a, b⟩ with a ∈ A and b ∈ B. (For example, {1, 2, 3} × {x, y}
is {⟨1, x⟩, ⟨1, y⟩, ⟨2, x⟩, ⟨2, y⟩, ⟨3, x⟩, ⟨3, y⟩}.) The cardinality of A × B is the product of the cardinalities
of A and B:

Theorem 9.2: Product Rule.


Let A and B be sets. Then |A × B| = |A| · |B|.

More generally, consider a collection of k arbitrary sets A1 , A2 , . . . , Ak , and consider the set of k-element
sequences where, for each i, the ith component is an element of Ai . The number of such sequences is given
by the product of the sets’ cardinalities:
|A1 × A2 × · · · × Ak | = |A1 | · |A2 | · · · · · |Ak |.
Here are a few examples of counting using the Product Rule:

Example 9.5: Counting sequences.

• Let A = {1, 2} and B = {3, 4, 5, 6}. By the product rule, |A × B| = |A| · |B| = 2 · 4 = 8. Indeed,
A × B = {⟨1, 3⟩, ⟨1, 4⟩, ⟨1, 5⟩, ⟨1, 6⟩, ⟨2, 3⟩, ⟨2, 4⟩, ⟨2, 5⟩, ⟨2, 6⟩}, which contains 8 elements.
• At a certain school in the midwest, there are currently 56 senior computer science majors and 63 junior
computer science majors. Then the number of ways to choose a pair of class representatives, one senior
and one junior, is 56 · 63 = 3528.
• Consider a tablet computer that is sold with three different options: a choice of protective cover, a
choice of stylus, and a color. If there are 7 different styles of protective cover, 5 different styles of
stylus, and 3 different colors, then there are 7 · 5 · 3 = 105 different configurations of the computer.

Like the Sum Rule, the Product Rule should be reasonably intuitive: if we are choosing a pair ⟨a, b⟩ from
A × B, then we have |A| different choices of the first component a—and, for each of those |A| choices, we

ISTUDY
9.2 Counting Unions and Sequences 445

have |B| choices for the second component b. (Thinking of A as A = {a1 , a2 , . . . , a|A| }, we can even view
{⟨a, b⟩ : a ∈ A, b ∈ B} as

{⟨a1 , b⟩ : b ∈ B} ∪ {⟨a2 , b⟩ : b ∈ B} ∪ · · · ∪ {⟨a|A| , b⟩ : b ∈ B} .

By the Sum Rule, this set has cardinality |B| + |B| + · · · + |B|, with one term for each element of A—in
other words, it has cardinality |A| · |B|.) Here are a few more examples:

Example 9.6: 32-bit strings.


How many different 32-bit strings are there?
32
Solution. The set of 32-bit strings is {0, 1} —that is, elements of
{0, 1} × {0, 1} × {0, 1} × · · · × {0, 1} .
32 times

Because | {0, 1} | = 2, the Product Rule lets us conclude that


32
|{0, 1} | = | {0, 1} | · | {0, 1} | · · · · · | {0, 1} | = 2 · 2 · · · · · 2 = 232 .
32 times 32 times

Example 9.7: Number of possible shortened URLs.


A URL-shortening service like bit.ly or tinyurl.com allows a user to compress a long URL into a
much shorter sequence of characters. (The shorter URL can then be used in emails or tweets or other
contexts in which a long URL is unwieldy.) For example, by entering the URL of Katherine Johnson’s
Wikipedia page into bit.ly, I got the URL https://bit.ly/36fvkTP as a shortened form of https:
//en.wikipedia.org/wiki/Katherine_Johnson.
If a shortened URL consists of 7 characters, each of which is a digit, lowercase letter, or uppercase
letter, the number of possible shortened URLs is, using the Product Rule,
|C × C × C × C × C × C × C| = |C| · |C| · |C| · |C| · |C| · |C| · |C| = |C|7 ,
where C = {0, . . . , 9} ∪ {a, . . . , z} ∪ {A, . . . , Z} is the set of possible characters. Because |C| =
10 + 26 + 26 = 62 via the Sum Rule, there are 627 = 3,521,614,606,208 possible 7-character URLs.

Taking it further: The point of a URL-shortening service is to translate long URLs into short ones, but it’s theoretically
impossible for every URL to be shortened by this service: there are more possible URLs of length k than there are URLs of
length strictly less than k. A similar issue arises with file compression algorithms, like ZIP, that try to reduce the space required
to store a file. See p. 475.

Product Rule: counting sequences from a fixed set


The use of the Product Rule in Examples 9.6 and 9.7—to count the number of sequences of length k with
elements all drawn from a fixed set S, rather than having a different set of options for each component—is
common enough that we’ll note it as a separate rule:

Theorem 9.3: Product Rule: sequences of elements from a single set S.


For any set S and any k ∈ Z≥1 , the number of k-tuples from the set Sk = S × S × · · · × S is |Sk | = |S|k .
k times

ISTUDY
446 Counting

(A notational reminder regarding Theorem 9.3: Sk is the set S × S × · · · × S, that is, the set of k-tuples
where each component is an element of S, while |S|k is the number |S| raised to the kth power.)
Here’s another example using this special case of the Product Rule:

Example 9.8: MAC addresses.


A media access control address, or MAC address, is a unique identifier for a network adapter, like an
ethernet card or wireless card. A MAC address consists of a sequence of six groups of pairs of hexadecimal
digits. (A hexadecimal digit is one of 0123456789ABCDEF.) For example, F7:DE:F1:B6:A4:38 is a
MAC address. (The pairs of digits are traditionally separated by colons when written down.) How many
different MAC addresses are there?
Solution. There are 16 different hexadecimal digits. Thus, using the Product Rule, there are 16 · 16 =
256 different pairs of hexadecimal digits, ranging from 00 to FF. Using the Product Rule again, as in
Example 9.7, we see that there are 2566 different sequences of six pairs of hexadecimal digits. Thus there
are 2566 = [162 ]6 = [(24 )2 ]6 = 248 = 281,474,976,710,656 total different MAC addresses.

Taking it further: In addition to the numerical addresses assigned to particular hardware devices—the MAC addresses from
Example 9.8—each device that’s connected to the internet is also assigned an address, akin to a mailing address, that’s used
to identify the destination of a packet of information. But we’ve had to make a major change to the way that information is
transmitted across the internet because of a counting problem: we’ve run out of addresses! See p. 456.

9.2.2 Inclusion–Exclusion: Unions of Nondisjoint Sets


The counting techniques that we’ve introduced so far have some important restrictions. We can only use
the Sum Rule to calculate |A ∪ B| when A and B are disjoint. And we are only able to use the Product Rule
to calculate the number of sequences when the set of options for the second component does not depend
on the choice that we made in the first component. In the remainder of this section, we will extend our
techniques to remove these restrictions so that we can handle more general problems. Let’s start with a
specific example of the cardinality of the union of nondisjoint sets:

Example 9.9: Primes and odds.


Consider the set O = {1, 3, 5, 7, 9} of odd numbers less than 10 and the set P = {2, 3, 5, 7} of prime
numbers less than 10. What is |O ∪ P|?
It might be tempting to use the Sum Rule to conclude that |O ∪ P| = |O| + |P| = 5 + 4 = 9. But this
conclusion is incorrect, because P ∩ O = {3, 5, 7} ̸= ∅, so the Sum Rule doesn’t apply. In particular,
O ∪ P = {1, 2, 3, 5, 7, 9}, so |O ∪ P| = 6.

The issue with the naïve application of the Sum Rule in Example 9.9 is called double counting: in the
expression |O| + |P|, we counted the elements in the intersection O ∩ P twice, which gave us the incorrect
total count. The idea underlying the Inclusion–Exclusion Rule is to correct for this error: to compute the size
of the union of two sets A and B, we extend the Sum Rule to correct for the double counting by subtracting
|A ∩ B| from the final result. (See Figure 9.2.) This counting rule is called Inclusion–Exclusion because we
include (add) the cardinalities of the two individual sets, and then exclude (subtract) the cardinality of the
intersection of the pairs:

ISTUDY
9.2 Counting Unions and Sequences 447

A B A B A B A B
+ −

A B A B A B
= =

(a) Two sets A and B; (b) Calculating |A| + |B| counts elements in (c) We correct for the double-counted
we seek |A ∪ B|. the dark-shaded region A ∩ B twice. intersection by subtracting its cardinality.

Figure 9.2 The Inclusion–Exclusion Rule: |A ∪ B| = |A| + |B| − |A ∩ B|.

Theorem 9.4: Inclusion–Exclusion.


Let A and B be sets. Then |A ∪ B| = |A| + |B| − |A ∩ B|.

Problem-solving tip: Sometimes the easiest way to solve a problem—in CS or in life!—is to find an imperfect approximation to
the solution, and then correct for whatever inaccuracies result. Inclusion–Exclusion is a good example of this estimate-and-fix
strategy.

Here are a few small examples, and then one that’s slightly more complicated:

Example 9.10: Counting not necessarily disjoint unions.

• Let A = {1, 2, 3} and B = {3, 4, 5, 6}. Thus A ∩ B = {3}, and so |A| = 3 and |B| = 4 and |A ∩ B| = 1.
By the Inclusion–Exclusion Rule, |A ∪ B| = |A| + |B| − |A ∩ B| = 3 + 4 − 1 = 6. Indeed, we have
A ∪ B = {1, 2, 3, 4, 5, 6}, which contains 6 elements.
• At a certain school in the midwest, there are 119 computer science majors and 65 math majors. There
are 7 students double majoring in CS and math. Thus a total of 119 + 65 − 7 = 177 different students
are majoring in either of the two fields.
• There are 21 consonants (BCDFGHJKLMNPQRSTVWXYZ) in English. There are 6 vowels in English
(AEIOUY). There is one letter that’s both a vowel and a consonant (Y). Thus there are 21 + 6 − 1 = 26
total letters.
• Let E be the set of even integers between 1 and 100. Let O be the set of odd integers between 1 and
100. Note that |E| = 50, |O| = 50, and |E ∩ O| = 0. Thus |E ∪ O| = 50 + 50 − 0 = 100.

Example 9.11: ATM machine PIN numbers.


A certain bank’s customers can select a 4-digit number (called a PIN) to access their accounts, but the
bank insists that the PIN may not start with the same digit repeated three times (for example, 7770) or
end with the same digit repeated three times (for example, 0111). How many invalid PINs are there?
Solution. Let S denote the set of PINs that start with three repeated digits. Let E denote the set of PINs
that end with three repeated digits. Then the set of invalid PINs is S ∪ E. (See Figure 9.3.)
2
Note that |S| = 100: we can view a PIN in S as a sequence of two digits ⟨x, y⟩ ∈ {0, 1, . . . , 9} , with x
repeated three times in the PIN. (So ⟨3, 1⟩ corresponds to the PIN 3331.) By the Product Rule, there are
102 = 100 such codes.

ISTUDY
448 Counting

the first three positions match the last three positions match

0001, 0002, 0003, . . . , 9997, 9998 0000, 1111, . . . , 9999 0111, 0222, 0333, . . . , 9777, 9888

Figure 9.3 Invalid PINs, starting or ending with the same digit repeated three times.

Similarly, we have that |E| = 100: we can think of an element of E as a sequence of two digits
2
⟨x, y⟩ ∈ {0, 1, . . . , 9} , where y is repeated three times in the PIN.
If S ∩ E were empty, then we could apply the Sum Rule to compute |S ∪ E| as 100 + 100 = 200. But
there are PINs that are in both S and E. A 4-digit number ⟨x, y, z, w⟩ is in S ∩ E if and only if x = y = z
(because ⟨x, y, z, w⟩ ∈ S) and y = z = w (because ⟨x, y, z, w⟩ ∈ E). That is, any 4-digit number that
consists of the same digit repeated four times is in S ∩ E. Thus
S ∩ E = {0000, 1111, 2222, 3333, 4444, 5555, 6666, 7777, 8888, 9999} ,
and |S ∩ E| = 10. Applying the Inclusion–Exclusion Rule, we see that the set S ∪ E of invalid PINs has
cardinality |S| + |E| − |S ∩ E| = 100 + 100 − 10 = 190. (So 10,000 − 190 = 9810 PINs are valid.)

The basic Sum Rule is actually a special case of the Inclusion–Exclusion Rule: if A and B are disjoint, then
|A ∩ B| = ∅, so |A ∪ B| = |A| + |B| − |A ∩ B| = |A| + |B| − 0 = |A| + |B|.

Inclusion–Exclusion for three sets


Theorem 9.4 describes how to calculate the cardinality of the union of two sets, but this idea can be gen-
eralized. The basic idea is simple: we will try counting in the easiest way possible, and then we’ll correct
for any overcounting or undercounting. For example, we can compute the cardinality of the union of three
sets A ∪ B ∪ C using a more complicated version of Inclusion–Exclusion:

• We add (include) the three singleton sets (|A| + |B| + |C|), but this sum counts any element contained in
more than one of the three sets more than once.
• So we subtract (exclude) the three pairwise intersections (|A ∩ B| + |A ∩ C| + |B ∩ C|) from the sum.
But we’re not done: imagine an element contained in all three of A, B, and C; such an element was
included three times and then excluded three times, so it hasn’t been counted at all.
• So we add (include) the three-way intersection |A ∩ B ∩ C|.

This calculation yields the following three-set rule for Inclusion–Exclusion. (Or see Figure 9.4 for a visual
illustration of why this calculation is correct.)

Theorem 9.5: Inclusion–Exclusion for three sets.


Let A, B, and C be sets. Then |A ∪ B ∪ C| is given by
|A| + |B| + |C| − |A ∩ B| − |A ∩ C| − |B ∩ C| + |A ∩ B ∩ C|.

Here are a couple of small examples of the three-set version of Inclusion–Exclusion:

ISTUDY
9.2 Counting Unions and Sequences 449

A B A B A B A B A B A B A B A B A B

+ + + + − +
C C C C C C C C C

A B A B A B

= = =

C C C

(a) If we start to compute |A ∪ B ∪ C| (b) Subtracting the sum of the sizes of (c) The result of (a) minus (b) hasn’t
as |A| + |B| + |C|, we correctly count the pairwise intersections counted the elements of A ∩ B ∩ C at
the light-shaded regions, but we count |A ∩ B| + |B ∩ C| + |A ∩ C| almost all, so we can achieve the final count
elements in the medium-shaded corrects for the double counting from by adding |A ∩ B ∩ C|.
regions twice, and elements in the (a), but it also triple counts the
dark-shaded region three times. elements of A ∩ B ∩ C.

Figure 9.4 The Inclusion–Exclusion Rule for three sets A, B, and C. See Theorem 9.5.

Example 9.12: Counting three-set unions.

• Let A = {0, 1, 2, 3, 4} and B = {0, 2, 4, 6} and C = {0, 3, 6}. Then


{ 0, 1, 2, 3, 4} {0, 3, 6}
{ 0, 2, 4} { 0, 3} { 0, 6} { 0}
{0, 2, 4, 6}

|A ∪ B ∪ C| = |A| + |B| + |C| − |A ∩ B| − |A ∩ C| − |B ∩ C| + |A ∩ B ∩ C|


= 5 + 4 + 3 − 3 − 2 − 2 + 1 = 6,
by Inclusion–Exclusion. Indeed, A ∪ B ∪ C = {0, 1, 2, 3, 4, 6}.
• Consider the words ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, and EIGHT. Let E be the set of these
words containing at least one E, let T be the words containing a T, and let R be the words containing
an R. Then
ONE
THREE TWO
THREE THREE
FIVE THREE THREE THREE THREE
FOUR EIGHT
SEVEN EIGHT
EIGHT

|E ∪ T ∪ R| = |E| + |T| + |R| − |E ∩ T| − |E ∩ R| − |T ∩ R| + |E ∩ T ∩ R|


= 5 + 3 + 2 − 2 − 1 − 1 + 1 = 7,
and, indeed, seven of the eight words are in E ∪ T ∪ R (the only one missing is SIX).

We’ll close with a slightly bigger example, about integers divisible by 2, 3, or 5:

Example 9.13: Divisibility.


How many integers between 1 and 1000, inclusive, are evenly divisible by any of 2, 3, or 5?
Solution. Define the following sets (writing Z1...1000 for {1, . . . , 1000}):
  
A = n ∈ Z1...1000 : 2 | n , B = n ∈ Z1...1000 : 3 | n , and C = n ∈ Z1...1000 : 5 | n .
We must compute |A ∪ B ∪ C|. It’s fairly easy to see that |A| = 500, |B| = 333, and |C| = 200, because
A = {2n : 1 ≤ n ≤ 500}, B = {3n : 1 ≤ n ≤ 333}, and C = {5n : 1 ≤ n ≤ 200}.

ISTUDY
450 Counting

Observe that A ∩ B is the set of integers between 1 and 1000 that are divisible by both 2 and 3—that
is, the set of integers divisible by 6. By the same logic that we used to compute |A|, |B|, and |C|, we see
|A ∩ B| |A ∩ C| |B ∩ C|
= | {6n : 1 ≤ n ≤ 166} | = | {10n : 1 ≤ n ≤ 100} | = | {15n : 1 ≤ n ≤ 66} |
= 166 = 100 = 66.
And, using the same approach, we can conclude that A ∩ B ∩ C = {n : 30 | n} = {30n : 1 ≤ n ≤ 33},
so |A ∩ B ∩ C| = 33. Therefore, using the Inclusion–Exclusion Rule, |A ∪ B ∪ C| is
|A ∪ B ∪ C| = 500 + 333 + 200 − 166 − 100 − 66 + 33 = 734.
| A| | B| |C| | A ∩ B| |A ∩ C| | B ∩ C| |A ∩ B ∩ C|

Problem-solving tip: To verify a calculation like this one, it’s a good idea (and not hard) to write a short program.

We can further generalize the Inclusion–Exclusion Rule to calculate the cardinality of the union of an
arbitrary number of sets. (See Exercises 9.33 and 9.184.)

9.2.3 The Generalized Product Rule


The Product Rule (Theorem 9.2) tells us how to compute the number of 2-element sequences where the
first element is drawn from the set A and the second from the set B—specifically, it says that |A × B| is
|A| · |B|. But there are many types of sequences that do not precisely fit this setting: the Product Rule only
describes the set of sequences where each component is selected from a fixed set of options. If the set of
options for choice #2 depends on choice #1, then we cannot directly apply the Product Rule. However, the
basic principle of the Product Rule still applies if the number of different choices for the second component
is the same regardless of the choice of the first component, even if the particular set of choices can differ:

Theorem 9.6: Generalized Product Rule.


Let S denote a set of sequences, each of length k, where for each index i ∈ {1, . . . , k} the following
condition holds: for each choice of the first i − 1 components of the sequence, there are exactly ni choices
Qk
for the ith component. Then |S| = i=1 ni .

Here are a few examples using the Generalized Product Rule:

Example 9.14: Gold, silver, and bronze.


A set S of eight sprinters qualify for the finals of the 100-meter sprint in the Olympics. One will win the
gold medal, another the silver, and a third the bronze. How many different trios of medalists are possible?
Solution. It “feels” like we can solve this problem using the Product Rule, by choosing a sequence of
three elements from S, where we forbid duplication in our choices. But our choice of gold, silver, and
bronze medalists would be from
S × (S − {the gold medalist}) × (S − {the gold and silver medalists})
and the Product Rule doesn’t permit the set of choices for the second component to depend on the first
choice, or the options for the third choice to depend on the first two choices.
Instead, observe that there are 8 choices for the gold medalist. For each of those choices, there are 7
choices for the silver medalist. For each of these pairs of gold and silver medalists, there are 6 choices

ISTUDY
9.2 Counting Unions and Sequences 451

8
rmblkans 8
rmblkans
7
opopopop 7
opopopop
6
0Z0Z0Z0Z 6
0Z0Z0Z0Z
5
Z0Z0Z0Z0 5
Z0Z0Z0Z0
4
0Z0Z0Z0Z 4
0Z0Z0Z0Z
3
Z0Z0Z0Z0 3
Z0Z0Z0Z0
2
POPOPOPO 2
POPOPOPO
1
SNAQJBMR 1
SNAQJBMR
a b c d e f g h a b c d e f g h

Figure 9.5 The valid first moves in a chess game.

for the bronze medalist. Thus, by the Generalized Product Rule, the total number of trios of medalists is
8 · 7 · 6 = 336.

Example 9.15: Opening moves in a chess game.


In White’s very first move in a chess game, there are n1 = 10 pieces that can move: any of White’s 8
pawns or 2 knights. Each of these pieces has n2 = 2 legal moves: the pawns can move forward either 1
or 2 squares, and the knights can move either ↰ or ↱. (See Figure 9.5.) Thus there are n1 · n2 = 10 · 2 = 20
legal first moves.

Example 9.16: Students in classes.


At a certain school in the midwest, each of 2023 students enrolls in exactly 3 classes per term. The set
Enrollments = {⟨s, c⟩ : s is a student enrolled in class c during the current term}
has cardinality 2023 · 3 = 6069, by the Generalized Product Rule: for each of the n1 = 2023 choices
of student, there are n2 = 3 choices of classes. (Note that the original Product Rule does not apply,
because the set Enrollments is not a Cartesian product: in general, two students are not enrolled in the
same classes—just the same number of classes.)

Although we didn’t say we were doing so, we actually used the underlying idea of the Generalized Product
Rule in Example 9.11. Let’s make its use explicit here:

Example 9.17: 4-digit PINs starting with a triplicated digit.


4
Let S ⊆ {0, 1, . . . , 9} denote the set of 4-digit PINs that start with three repeated digits. We claim that
|S| = 100, as follows:

• There are n1 = 10 choices for the first digit.


• There is only n2 = 1 choice for the second digit: it must match the first digit.
• There’s also only n3 = 1 choice for the third digit: it must match the first two.
• There are n4 = 10 choices for the fourth digit.

Thus there are n1 · n2 · n3 · n4 = 10 · 1 · 1 · 10 = 100 elements of S.

Permutations
The Generalized Product Rule sheds some light on a concept that arises in a wide range of contexts: a
permutation of a set S, which is any ordering of the elements of S.

ISTUDY
452 Counting

Definition 9.7: Permutation.


A permutation of a set S is a sequence of elements from S that is of length |S| and contains no repetitions.
In other words, a permutation of S is an ordering of the elements of S.

As a first example, let’s list all the permutations of the set {1, 2, . . . , n} for a few small values of n:

• For n = 1, there’s just one ordering: ⟨1⟩.


• For n = 2, there are two orderings: ⟨1, 2⟩ and ⟨2, 1⟩.
• For n = 3, there are six orderings: ⟨1, 2, 3⟩, ⟨1, 3, 2⟩, ⟨2, 1, 3⟩, ⟨2, 3, 1⟩, ⟨3, 1, 2⟩, and ⟨3, 2, 1⟩.
• For n = 4, there are 24 orderings: six with 1 as the first element (which can then be followed by any
of the six permutations of ⟨2, 3, 4⟩), six with 2 as the first element, six with 3 first, and six with 4 first,
yielding a total of 4 · 6 = 24 orderings.

How many permutations of an n-element set are there? There are several ways to see the general pattern,
including recursively, but it may be easiest to use the Generalized Product Rule to count:

Theorem 9.8: Number of permutations.


Let S be any set, and write n = |S|. The number of different permutations of S is n!.

Proof. There are n choices for the first element of a permutation of S. For the second element, there are
n−1 choices (all but the element chosen first). There are n−2 choices for the third slot (all but the elements
chosen first and second). In general, for the ith element, there are n − i + 1 choices. Thus
Q
n Q
n
the number of permutations of S = (n − i + 1) = j = n!
i=1 j=1

by the Generalized Product Rule.


Here’s a small example for a concrete set S:

Example 9.18: 10-digit numbers.


What fraction of integers between 0 and 9,999,999,999 (all written as 10-digit numbers, including any
leading zeros) have no repeated digits?
Solution. We seek a 10-digit sequence with no repetitions—that is, a permutation of {0, 1, . . . , 9}. There
are 10! = 3,628,800 such permutations, by Theorem 9.8. There are a total of 1010 integers between 0
and 9,999,999,999, by the Product Rule. Thus the fraction of these integers with no repeated digits is
1010 ≈ 0.00036 · · · , about one out of every 2750 integers in this range.
10!

Taking it further: A permutation of a set S is an ordering of that set S—so thinking about permutations is closely related to
thinking about sorting algorithms that put an out-of-order array into a specified order. By using the counting techniques of this
section, we can prove that algorithms must take a certain amount of time to sort; see p. 457. We will also return to permutations
frequently later in the chapter. For example, in Section 9.4, we will address counting questions like the following: how many
different 13-card hands can be drawn from a standard 52-card deck of playing cards? (Here’s one way to think about it: we can
lay out the 52 cards in any order—any permutation of the cards—and then pick the first 13 of them as a hand. We’ll have to
correct for the fact that any ordering of the first 13 cards—and, for that matter, any ordering of the last 39—will count as the
same hand. But permutations will also help us to think about this correction!)

ISTUDY
9.2 Counting Unions and Sequences 453

9.2.4 Combining Products and Sums


Suppose that we select a pair ⟨a, b⟩ from a set of possible choices. The Product Rule tells us how many
ways to make these choices if the particular choice of a does not affect the set of options from which b
is chosen. The Generalized Product Rule tells us how many ways to make these choices if the particular
choice of a does not affect the size of the set of options from which b is chosen. But if the number of options
for the choice of b differs based on the choice of a, even the Generalized Product Rule does not apply. In
this case, we can use a combination of the Sum Rule and the Generalized Product Rule to calculate the
number of results. We’ll close this section with a few of these somewhat more complex questions.

Example 9.19: Ordering coffee.


A certain coffee shop sells the following espresso-based drinks: americano∗ , cappuccino, espresso∗ , latte,
macchiato, and mocha. (The drinks marked with an asterisk do not contain milk; the others do.) All drinks
can be made with either decaf or regular espresso. All milk-containing drinks can be made with any of
{soy, skim, 2%, whole} milk. How many different drinks are sold by this coffee shop?
We can think of a chosen drink as a sequence of the form
⟨drink type, milk type (or “none”), espresso type⟩.
There are 4 · 4 · 2 = 32 choices of milk-based drinks (4 drink types, 4 milk types, and 2 espresso types).
There are 2 · 1 · 2 = 4 choices of non-milk-based drinks (2 drink types, 1 “milk” type [“none”], and 2
espresso types). Thus the total number of different drinks sold by this coffee shop is 32 + 4 = 36.

Example 9.20: Text numbers.


In the United States, a text message can be sent either to a regular 10-digit phone number, or to a so-called
short code which is a 5- or 6-digit number. Neither a phone number nor a short code can start with a 0 or
a 1. How many different textable numbers are there in the United States?
Solution. Let D = {2, 3, . . . , 9}. Note |D| = 8. The set of valid textable numbers is:

D × (D ∪ {0, 1})9 ∪ D × (D ∪ {0, 1})4 ∪ D × (D ∪ {0, 1})5 .


phone numbers 5-digit short codes 6-digit short codes
The Product Rule tells us that |D × (D ∪ {0, 1})i | = |D| · |D ∪ {0, 1} |i = 8 · 10i for any i. (To be totally
pedantic: we’re using the Sum Rule to conclude that |D ∪ {0, 1} | = |D| + | {0, 1} | = 10, because D
and {0, 1} are disjoint.) Therefore:
D × (D ∪ {0, 1})9 ∪ D × (D ∪ {0, 1})4 ∪ D × (D ∪ {0, 1})5
= D × (D ∪ {0, 1})9 + D × (D ∪ {0, 1})4 + D × (D ∪ {0, 1})5
Sum Rule: the three types of numbers are disjoint because they have different lengths

= 8 · 10 + 8 · 10 + 8 · 105
9 4
Product Rule (as described in the previous paragraph)
= 8,000,880,000.

Problem-solving tip: When you’re facing a counting problem that appears complicated, try to find a nice way of splitting the
problem into several disjoint options. Often a difficult counting problem is actually the sum of two simple counting problems.

Combining sums and products: prefix-free codes


We’ll end the section with two somewhat more complicated counting problems, where we’re asked to
calculate the number of objects meeting some particular condition: sets of bitstrings such that no string

ISTUDY
454 Counting

is a prefix of another, and results of a best-of-five series of games. In both cases, we can give a solution
based entirely on a brute-force approach by simply enumerating all possible sequences, eliminating any
that don’t meet the stated condition, and counting the uneliminated sequences one by one. But there are
also ways to break down the set of objects of interest into subsets that we can count using the Sum and
(Generalized) Product Rules.

Example 9.21: Prefix-free codes.


A prefix-free code is a set C of bitstrings with the property that no x ∈ C is a prefix of any other y ∈ C.
(For example, if 010 ∈ C, then we must have 0101 ∈ / C, because 010 is a prefix of 0101.) Let’s compute
the number of prefix-free codes where all of the codewords are only 1 or 2 bits long.
1 2
One uncomplicated way to find the number of prefix-free codes C ⊆ {0, 1} ∪ {0, 1} is to write down
1 2
all subsets of S = {0, 1} ∪ {0, 1} , and then check each subset to eliminate any set that violates the
prefix rule. (See Figure 9.6, which was generated by a Python program; there are 25 codes in the table
that pass the prefix test.) There are 2|S| = 26 = 64 subsets of S: we can describe each subset of S as an
|S|
element of {yes, no} where the ith component tells us whether the ith element of S is in the set. The
|S|
Product Rule tells us that |{yes, no} | = 26 = 64. (See Lemma 9.10.)
Here’s a different approach, involving more thinking and less brute-force calculation. Let’s partition
the set of valid codes into four classes based on whether 0 ∈ C or not, and whether 1 ∈ C or not:

• If 0 ∈
/ C and 1 ∈
/ C, then any subset of {00, 01, 10, 11} can be in C.
• If 0 ∈
/ C and 1 ∈ C, then any subset of {00, 01} can also be in C.
• If 0 ∈ C and 1 ∈
/ C, then any subset of {10, 11} can also be in C.
• If 0 ∈ C and 1 ∈ C, then no 2-bit strings can be included.

By the Product Rule, there are, respectively, 24 and 22 and 22 and 20 choices corresponding to these
classes. (The four classes correspond to the four columns of Figure 9.6.) By the Sum Rule, the total
number of prefix-free codes using 1- and 2-bit strings is 16 + 4 + 4 + 1 = 25.

prefix prefix prefix prefix


0 1 00 01 10 11 free? 0 1 00 01 10 11 free? 0 1 00 01 10 11 free? 0 1 00 01 10 11 free?
3 3 3 3 3 3 3 3
3 3 3 3 11 3 3 3 3 3 3 11
3 3 3 3 10 3 3 3 3 3 3 10
3 3 3 3 3 3 10 3 3 3 3 3 3 3 3 10
3 3 3 3 3 3 3 01 3 3 3 01
3 3 3 3 3 3 11 3 3 3 01 3 3 3 3 01
3 3 3 3 3 3 10 3 3 3 01 3 3 3 3 01
3 3 3 3 3 3 3 3 10 3 3 3 3 01 3 3 3 3 3 01
3 3 3 3 3 3 3 00 3 3 3 00
3 3 3 3 3 3 11 3 3 3 00 3 3 3 3 00
3 3 3 3 3 3 10 3 3 3 00 3 3 3 3 00
3 3 3 3 3 3 3 3 10 3 3 3 3 00 3 3 3 3 3 00
3 3 3 3 3 3 3 3 3 3 00 3 3 3 3 00
3 3 3 3 3 3 3 3 11 3 3 3 3 00 3 3 3 3 3 00
3 3 3 3 3 3 3 3 10 3 3 3 3 00 3 3 3 3 3 00
3 3 3 3 3 3 3 3 3 3 10 3 3 3 3 3 00 3 3 3 3 3 3 00

Figure 9.6 All 64 subsets of {0, 1, 00, 01, 10, 11}, with indication of whether the subset is prefix-free or not. In each
row (a subset), if the set is not prefix-free, then one violation found in the set is listed.

ISTUDY
9.2 Counting Unions and Sequences 455

Taking it further: Prefix-free codes are useful in that they can be transmitted unambiguously, without a special marker that
separates codewords. For example, consider the prefix-free code {0, 10, 11}. Then a sequence 0101111100 can only be inter-
preted as 0∥10∥11∥11∥10∥0. If a code is not prefix-free—like the English language!—then a sequence of codewords cannot be
unambiguously decoded: for example, THEME might be one word (theme) or it might be two (the me).
Huffman coding—named after David Huffman (1925–1999)—is an algorithm for computing a prefix-free code that can be used
for data compression for English (for example), by allowing us to translate each letter into a corresponding code word. Huffman
coding carefully assigns shorter codewords to more commonly used letters, and thus has a special property: among all prefix-free
codes, its codewords have the smallest length, on average. A Huffman code can be constructed using a greedy approach [59].

Combining sums and products: a best-of-five series


Here’s one more example of using our counting rules in combination:

Example 9.22: A best-of-five series.


Suppose that two teams A and B play a best-of-five series of games: the teams play until one team has won
three games, at which point the match is over, and that team is the winner. How many different sequences
of outcomes are there?
Solution. The simplest approach is to use brute force: simply write out all possible sequences of out-
comes, and count them up. This approach is shown in Figure 9.7. However, there’s another way to count.
Suppose that team A wins the series:

• There’s 1 outcome in which A never loses: A wins games 1, 2, and 3.


• There are 3 outcomes in which A loses once: A loses immediately before its first win (BAAA), before
its second win (ABAA), or before its third win (AABA).
• If A loses twice, then A must have won the fifth game, and exactly two of the first four. Thinking of the
outcomes of the first four games as 4-bit strings with ones denoting A’s wins, Example 9.4 says there
are precisely 6 such outcomes.

In sum, there are 1 + 3 + 6 = 10 ways for A to win the series. There are 10 analogous ways for B to win,
so there are 20 outcomes in total.

A B

A B A B

A B A B A B A B

AAA BBB
A B A B A B A B A B A B

AABA ABAA ABBB BAAA BABB BBAB


A B A B A B A B A B A B

AABBA AABBB ABABA ABABB ABBAA ABBAB BAABA BAABB BABAA BABAB BBAAA BBAAB

Figure 9.7 A tree representing each best-of-five series of games between two teams, A and B. The branch points
correspond to the games, and are labeled by the winner of the game. The 20 different sequences of outcomes are shown
at the bottom of the tree.

ISTUDY
456 Counting

Computer Science Connections

Running out of IP addresses, and IPv6


A crucial component of the internet is the assignment of an address to every machine connected to the network.
This address is called an IP address, where “IP” stands for Internet Protocol—the algorithm by which packets of
information are handled while they’re being transmitted across the internet. Each packet of information to be trans-
mitted stores a variety of pieces of information, including (1) some basic header information; (2) a source address
(the sender of the information); (3) a destination address (the intended recipient of the information); and (4) the data
to be transmitted (the “payload”). The subfield of computer science called computer networking is devoted to every-
thing about how the internet (or some smaller network) works: design of the network, physical systems, protocols
for routing, and more. (See a good textbook on computer networks, like [75], for much more.) Here we are going to
concentrate on the IP address itself, and a particular issue related to how many—or how few!—addresses there are.
Each networked device that sends or receives infor-
0–15.x.x.x
mation needs an address by which to do so. For almost 16–31.x.x.x Ford Motor Company, 19.x.x.x
the entire history of the internet, an IP address was just 32–47.x.x.x
a 32-bit string. These IP addresses are typically repre- 48–63.x.x.x
4
sented as an element of {0, . . . , 255} instead of as an 64–79.x.x.x U.S. Postal Service, 56.x.x.x
32
element of {0, 1} , by converting 8 bits at a time into 80–95.x.x.x
base-10 numbers, and then writing each 8-bit chunk 96–111.x.x.x
separated by periods. For example, as of this writing, 112–127.x.x.x
the site cam.ac.uk is associated with the IP address 128–143.x.x.x
144–159.x.x.x
10000000 . 11101000 . 10000100 . 00001000 . 160–175.x.x.x
128 232 132 8 176–191.x.x.x
You can find the IP address of your favorite site using a 192–207.x.x.x
tool called nslookup on most machines, which checks 208–223.x.x.x Latin America and Caribbean
a so-called name server to translate a site’s name (like 224–239.x.x.x Network Information Centre [LACNIC]
cam.ac.uk) into an IP address (like 128.232.132.8). 240–255.x.x.x 177.x.x.x, 179.x.x.x, …, 201.x.x.x
The international consortia operating portions of inter- Figure 9.8 Some of the allocations of IPv4 addresses. Each
net infrastructure allocate various blocks of addresses square in this grid corresponds to 224 = 16,777,216
to individual entities; see Figure 9.8. (Sometimes these addresses; some of these blocks are assigned to companies
allocations are more efficient, and sometimes they are or government-related organizations, and some are allocated
more wasteful: Ford was allocated more IP addresses to region-based internet organizations. (Examples of each
than many countries.) are labeled above.)
As a straightforward counting problem, we can see
32
that there only 2 = 4,294,967,296 different possible 32-bit IP addresses—which amounts to around 4.3 billion
addresses. Every machine connected to the internet needs to be addressable in order to be able receive data, so that
means that we can only directly support about 4.3 billion connected devices. (There are some strategies from computer
networking for conserving addresses by “translation,” so that several computers c1 , c2 , . . . can be connected via an
access point p—where p is the only machine that has a public, visible IP address. All of those computers’ traffic
is handled by p, but p must be able to reroute the traffic it receives to the correct one of the ci computers. Again,
see a book like [75] for more.) In the 1990s and 2000s, more and more people began to have machines connected
to the internet, and each person also began to have more and more devices that they wanted to connect. It became
clear that we were facing a dire shortage of IP addresses. As such, a new version of the Internet Protocol (version
six, hence called IPv6) was introduced. In IPv6, instead of using 32-bit addresses, we now use 128-bit addresses.
There were some tricky elements to the transition from 32-bit to 128-bit addresses—the system needed to ensure that
addressing under both systems would keep working throughout the (long!) transition—but there are now 2128 different
IP addresses available. That’s 340,282,366,920,938,463,463,374,607,431,768,211,456 ≈ 3.4 × 1038 , which should
hold us for a few millennia. These addresses are represented by 8 blocks of 4 hexadecimal numbers—that is, as an
4
element of the set [{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f} ]8 . For example, cam.ac.uk is associated with a 32-
bit address 128.232.132.8, and a 128-bit address 2a05:b400:5:270::80e8:8408, (The double colon notation
is shorthand for a sequence of blocks of 0000s, and it’s customary to omit leading zeros; in full, that address is
2a05:b400:0005:0270:0000:0000:80e8:8408.)

ISTUDY
9.2 Counting Unions and Sequences 457

Computer Science Connections

A Lower Bound for Comparison-Based Sorting


Most people who encounter the sorting problem—given an array A[1 . . . n], rearrange A so that it’s in ascending
order—initially devise a quadratic-time algorithm. (For simplicity, suppose that we’re sorting distinct elements.) The
most common examples of Θ(n2 )-time algorithms are Selection Sort, Insertion Sort, and Bubble Sort. Then, after a
lot of thought (and, usually, some help), those people often are able to devise a O(n log n)-time sorting algorithm, like
Merge Sort, Quick Sort, or Heap Sort. (See Section 6.3.) But suppose that you were extra impatient with the speed
of your sorting algorithm, and you were extra, extra clever. Could you do asymptotically better than O(n log n) in
the worst case? The answer, we’ll show, is no—with a footnote: any “comparison-based” sorting algorithm requires
Ω(n log n) time. (The footnote is that it depends on what we mean by “sort,” as we’ll see.)

A warm-up: Selection Sort. First, the ith iteration (which is i the inner square is
n n
rows from the bottom) 2 -by- 2 ; the outer
recall Selection Sort, shown in Fig- contains n − i steps square is n-by-n.
ure 9.9. We saw one way to analyze selectionSort(A[1 . . . n]):
its running time in Example 6.7: 1 for i := 1 to n:
there are n iterations, and in the

n iterations
2 minIndex := i
(n − i)th iteration we require i 3 for j := i + 1 to n:
steps. In other words, the running
∑n 4 if A[j] < A[minIndex] then
time of Selection Sort is i=1 i, 5 minIndex := j
and we could∑n repeat the inductive 6 swap A[i] and A[minIndex]
proof that i=1 i = n(n + 1)/2 to
complete the analysis. But, instead, Figure 9.9 Selection Sort: a reminder of the pseudocode, ∑n
and its analysis. The
Figure 9.9 gives a more visual way triangle represents the algorithm’s running time: i=1 i, where the ith row
of seeing this result. The triangle shows i steps. This triangle is contained within an n-by-n square and also
n n
in the figure represents the running contains an 2
-by- 2
square.
time of Selection Sort. Because the
triangle fits within an n-by-n square and contains an 2n -by- n2 square, the 2
area of the triangle is upper bounded by
n · n = n2 (the area of the outer square) and lower bounded by n2 · n2 = n4 (the area of the inner square), and therefore
is Θ(n2 ). This picture is a visual representation of a more algebraic proof:

n ∑
n ∑
n ∑
n ∑
n
n2
i≤ n = n2 and i≥ i≥ n
2 = 4 .
i=1 i=1 i=1 i= n2 +1 i= n2 +1

(The first inequality


{ comes from } the fact that any i ∈ {1, . . . , n} satisfies i ≤ n; the last inequality comes from the
fact that any i ∈ n2 + 1, . . . , n satisfies i ≥ n2 .)
There are no O(n) comparison-based sorting algorithms. All of the sorting algorithms that we’ve encountered in
the book are comparison-based sorting algorithms: they proceed by repeatedly comparing the values of two elements
xi and xj from the input array without considering the values themselves. Depending on the result of the comparison,
the algorithm may then swap some elements of the array. (Comparison-based sorting algorithms probably include
every sorting algorithm that you’ve ever seen, except counting, radix, and bucket sorts.)
One way to view a comparison-based sorting algorithm is through a decision tree, like the one shown in Figure 9.10
for Selection Sort on a 3-element array. Internal nodes encode the comparisons made by the algorithm; leaves corre-
spond to sorted orders—the algorithm’s output. The running time of the sorting algorithm whose input corresponds
to a particular leaf is Ω(number of comparisons on that root-to-leaf path) because, although the algorithm might do
more than compare—in fact, it must (for example, it has to perform swaps)—it must do at least these comparisons.
We will use the decision tree to establish a lower bound on the running time of comparison-based sorting algorithms:

Theorem. Any comparison-based sorting algorithm requires Ω(n log n) time.

(continued)

ISTUDY
458 Counting

Computer Science Connections

Sorting Lower Bounds, continued

This branch (and


a ≶ b? the analogous one,
to the left) is
a<b a>b logically
a is the inconsistent, but
minimum Selection Sort
value a ≶ c? b ≶ c?
doesn’t care; it
reexecutes the
a<c a>c b<c b>c a-versus-b
comparison in
b ≶ c? a ≶ b? a ≶ c? a ≶ b? these cases.

b<c b>c a<b a>b a<c a>c a<b a>b

⟨a, b, c⟩ ⟨a, c, b⟩ ⟨c, a, b⟩ inconsistent! ⟨b, a, c⟩ ⟨b, c, a⟩ inconsistent! ⟨c, b, a⟩

Figure 9.10 The decision tree for Selection Sort on input ⟨a, b, c⟩. Selection Sort first does two comparisons to find the
minimum value of {a, b, c}, and subsequently compares the remaining two elements to decide on the final order.

Proof. Consider the decision tree T of the sorting algorithm. First, observe that T must have at least n! leaves.
There are n! different permutations of the input, and a correct algorithm must be capable of producing any of these
permutations as output. Second, observe that T has at most 2d nodes at depth d. (It’s a binary tree!) Thus the height
h of T satisfies 2h ≥ n!. Taking logarithms of both sides, we have
h ≥ log2 (n!) = log2 [n · (n − 1) · (n − 2) · · · · · ( n2 + 1) · ( 2n ) · · · · · 1]
≥ log2 [n · (n − 1) · (n − 2) · · · · · ( n2 + 1)]
[ ]
≥ log2 ( 2n )(n/2) The crucial fact here is precisely analogous to the one in Figure 9.9:


n ∏
n ∏
n
= ( n2 ) · log2 (n/2) i≥ i≥ n
= ( 2n )
n2
/

.
2
i n 1 i n 1
= Ω(n log n), i1
= =

2
+ =

2
+

The only difference: here we’re using products instead of summations.


which completes the proof.

A linear-time sorting algorithm. While we’ve now shown that every comparison-based sorting algorithm takes
Ω(n log n) time, there are faster algorithms for special cases.
Figure 9.11 shows one, called counting sort, which
allows us to sort without doing element-to-element countingSort(A[1 . . . n]):
comparisons. The basic idea is that, instead of sorting
Input: array (A[1 . . . n]) where each A[i] ∈ {1, 2, . . . , c}.
the input array A by comparing A’s elements to each
1 for v := 1 to c:
other, we count (and track) how many times each inte-
2 count[v] := 0
ger in {1, 2, . . . , c} appears in A—and then use those
3 for i := 1 to n:
counts to fill in the array’s entries from left to right. 4 count[A[i]] := count[A[i]] + 1
As long as the elements of the array are integers 5 i := 1
from a small range, then this algorithm is fast: the run- 6 for v := 1 to c:
∑ time is Θ(c + n) (the last nested loop requires
ning 7 for t := 1 to count[v]:
v count[v] = n time); as long as c is small, this 8 A[i] := v
algorithm runs in linear time. 9 i := i + 1
Figure 9.11 Counting Sort.

ISTUDY
Exercises 459

EXERCISES
9.1 For the first decade or so of Twitter’s existence, a tweet was a sequence of at most 140 characters. (This length restriction was
loosened in 2017.) Assuming there are 256 valid characters that can appear in each position, how many distinct tweets are possible?

9.2 Cars in the United States have alphanumeric license codes (often misleadingly called license plate numbers despite the presence of
uppercase letters), whose format varies among states. Minnesota recently issued plates of the form digit-digit-digit-letter-letter-letter
(as in 400GPA). How many different license plate codes of this form are possible? (All letters in all codes are uppercase.)
9.3 What about for Pennsylvania’s recent plates of the form letter-letter-letter-digit-digit-digit-digit (as in EEE2718)?
9.4 What about for Connecticut’s recent plates of the form digit-letter-letter-letter-letter-digit (as in 4FIVE6)?
9.5 You have been named Secretary of Transportation for the State of [Your Favorite State]. Congratulations! You’re considering
replacing the current license plate format ABCD-1234 (4 letters followed by 4 digits) with a sequence of any k symbols, each of
which can be either a letter or a digit. How large must k be so that your new format has at least as many options as the old format?
9.6 Until recently, France used license plates that contain codes of any of the following forms:
• digit-digit-digit-letter-digit-digit.
• digit-digit-digit-letter-letter-digit-digit, where the first letter is alphabetically ≤ P.
• digit-digit-digit-digit-letter-letter-digit-digit, where the first letter is alphabetically ≥ Q.
• digit-digit-digit-letter-letter-letter-digit-digit.
How many license plates, in total, met the French requirements?

9.7 A voicemail system allows numerical passwords of length 3, 4, or 5 digits. How many passwords are possible in this system?
9.8 What about numerical passwords of length 4, 5, or 6?

9.9 A contact lens is built with the following parameters: a (spherical) power (for correcting near- or farsightedness); and, possibly, a
cylindrical power and an axis (for correcting astigmatism). For a particular brand of contacts:
• The parameters for a lens that corrects near- or farsightedness only is a power between −6.00 and +6.00 inclusive in 0.25 steps
(excluding 0.00); between 6.50 and 8.00 inclusive in 0.50 steps; and between −6.50 and −10.00 inclusive in 0.50 steps.
• The parameters for a lens that corrects astigmatism are: (i) one of the powers listed previously; (ii) a cylindrical power in
{−0.75, −1.25, −1.75, −2.25}; and (iii) an axis between 10◦ and 180◦ in steps of 10◦ .
How many different contact lenses are there?
9.10 A patient needing vision correction in both eyes may get different contact lenses for each eye. A prescription assigns a lens for the
left eye and for the right eye. How many contact prescriptions are there?

9.11 During the West African Ebola crisis that started in 2014, geneticists were working to trace the spread of the disease. To do so, they
acquired DNA samples of the viruses from a number of patients, and affixed a unique “tag” to each patient’s sample [105]. A tag
is a sequence of 8 nucleotides—each an element of {A, C, G, T}—attached to the end of a virus sample from each patient, so that
subsequently it will be easy to identify the patient associated with a particular sample. How many different such tags are there?
9.12 In a computer science class, there are 14 students who have previously written a program in Java, and 12 students who have
previously written a program in Python. How many students have previously written a program in at least one of the two languages?
(If you can’t give a single number as a definitive answer, give as narrow a range of possible values as you can.)
9.13 True story: a relative was given a piece of paper with the password to a wireless access point that was written as follows:
a154bc0401011. But she couldn’t tell from this handwriting whether each “1” was 1 (one), l (ell), or I (eye); or whether “0”
was 0 (zero) or O (oh). How many possible passwords would she have to try before having exhausted all of the possibilities?

A Rubik’s cube—named after the twentieth-century Hungarian architect Ernő Rubik—is a 3-by-3-by-3 cube, with nine square cells
arranged in a grid on each face. Any of the six nine-cell faces (top, bottom, left, right, front, back) can be rotated 90◦ clockwise or
counterclockwise in a single move. (See Figure 9.12.)
9.14 How many Rubik’s cube moves are there?
9.15 From any configuration, 26 moves suffice to solve the cube. (Note every 90◦ rotation counts as a move; if you rotate the same face
180◦ by using two consecutive 90◦ moves, it counts as two moves.) How many sequences of exactly 26 moves are possible?
9.16 It’s useless to rotate a face clockwise in one move, and rotate the same face counterclockwise in the next move. (You’ve just
undone the previous move.) A counterclockwise move followed by a clockwise move is analogous. How many sequences are there
of exactly 26 moves that never undo the previous move?
9.17 There is a bit of a schism in the world of hard-core Rubik’s cube solvers regarding what counts as a “move.” The last several exercises
have considered what’s called the quarter-turn metric, where rotating a single face by 90◦ (clockwise or counterclockwise) counts
as a single move. Many people prefer the half-turn metric, where rotating a single face by any angle (90◦ , 180◦ , or 270◦ —the last
of which is just 90◦ in the other direction) counts as a single move.

ISTUDY
460 Counting

Each face of each cell is colored with one of six colors (blue, red,
green, yellow, white, and orange); initially, all nine cell-faces on
each cube-face have the same color, but the cube can then be
scrambled. The challenge is to use rotations to configure a
scrambled cube such that each face of the cube contains nine cells of
the same color. For more about the hardest positions, see [108]. (Or
see cube20.org.)

Figure 9.12 A Rubik’s cube, and the result of a single move—rotating the top face clockwise.

9.18 There are more single moves under the half-turn metric, but the hardest positions require fewer half-turn moves than they do
quarter-turn moves; from any configuration, 20 half-turn moves suffice to solve the cube. Redo Exercise 9.15 for half turns: how
many sequences of exactly 20 half-turn moves are possible? Is this number bigger or smaller than the solution to Exercise 9.15?
9.19 Under the half-turn metric, it’s always wasteful to make two consecutive moves by rotating the same face. (Rotating the same face
twice in a row either completely undoes whatever the first move did, or the pair of moves could have been replaced by a single
move with a different angle.) Redo Exercise 9.16 under the half-turn metric: how many sequences are there of exactly 20 moves
that never rotate the same face as the previous move? Is this number bigger or smaller than the solution to Exercise 9.16?

9.20 Emacs is a widely used software program for—among other things—editing text documents (including this book). Here’s a sim-
plified description of commands in Emacs: a command character is produced by pressing a letter key while holding down either
the Control key, the Meta key, or both. (For example, Control+Y or Meta+B or Control+Meta+U are command characters.) How
many command characters are there in Emacs?
9.21 Emacs is complicated enough that it needs more commands than Exercise 9.20 allows. To allow for more commands, the syntax
of Emacs has been extended, as follows. Meta+X and Control+X—as in eXtended—are command prefixes: neither Meta+X nor
Control+X is a valid command itself, but, they can be the first part of a two-part command. For example, “Control+X Control+U” is
a command (and it’s different from Control+U). A valid command can be formed by Control+X or Meta+X followed by any letter or
any command character (including Control+X or Meta+X). All other command characters from Exercise 9.20—aside from Meta+X
and Control+X—are still valid. How many command characters are there now?

9.22 Argue that, for any sets A and B, |A ∪ B| = |A − B| + |B − A| + |A ∩ B|. (Use the Sum Rule.)
9.23 How many 100-bit strings have at most 2 ones? (Use Example 9.4.)
9.24 Determine how many k-bit strings have exactly three ones using the approach in Example 9.4—that is, by dividing the set of
bitstrings based on the position of the third one.
9.25 (programming required) Write a program, in a language of your choice, to enumerate all bitstrings in {0, 1}16 and count the number
that have 0, 1, 2, and 3 ones. Use this program to verify your answer to Exercise 9.24 and your approach to Exercise 9.23.
9.26 The following is a simpler “solution” to Example 9.4, where we computed the number of elements of {0, 1}k that have precisely
two ones. What, exactly, is wrong with this argument?
Let S be the set of k-bit strings that contain exactly 2 ones. Define Si = {x ∈ S : xi = 1}, for each index i ∈ {1, 2, . . . , k}. Observe
S P P
that S = ki=1 Si and that |Si | = k − 1. Therefore, by the Sum Rule, we have |S| = ki=1 |Si | = ki=1 (k − 1) = k(k − 1).

Unicode is a character set frequently used on the web; it supports hundreds of thousands of characters from many languages—
English, Greek, Chinese, Arabic, and all other scripts in common current use. A very common encoding scheme for Unicode, called
UTF-8, uses a variable number of bits to represent different characters (with more commonly used characters using fewer bits).
Valid UTF-8 characters can use 1, 2, 3, or 4 bytes, in one of the following forms:

0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110yyy 10yyxxxx 10xxxxxx 10xxxxxx

Here, x represents any bit, and the five bits labeled y have a further restriction: the five yyyyy bits must be either of the form
0xxxx or 10000. The ith character in the Unicode character set is encoded by the ith legal UTF-8 representation, resulting from
converting i into binary and filling in the x (and y) bits from the templates.
9.27 How many characters can be encoded using UTF-8?
9.28 Unicode forbids excess zero padding: if a character can be encoded using one byte, then the two-byte encoding is illegal. For
example, 01010101 encodes the same character as 11000001 10010101; thus the latter is not allowed. How many characters
from Exercise 9.27 can be encoded without violating this rule?

ISTUDY
Exercises 461

8
0Z0Z0Z0Z 8
0Z0Z0Z0Z
7
Z0Z0Z0Z0 7
Z0Z0Z0Z0
6
0Z0Z0Z0Z 6
0Z0Z0Z0Z
5
Z0Z0S0Z0 5
Z0Z0L0Z0
4
0Z0Z0Z0Z 4
0Z0Z0Z0Z
3
Z0Z0Z0Z0 3
Z0Z0Z0Z0
2
0Z0Z0Z0Z 2
0Z0Z0Z0Z
1
Z0Z0Z0Z0 1
Z0Z0Z0Z0
a b c d e f g h a b c d e f g h

Figure 9.13 Two chess boards, showing the legal moves for a rook (left) and queen (right).

9.29 In chess, a rook can move any number of spaces horizontally or vertically. (See Figure 9.13.) How many ways can you put a black
rook and a white rook on an 8-by-8 chessboard so they can’t capture each other (that is, neither can move to the other’s square)?
9.30 A queen in chess can move any number of spaces horizontally, vertically, or diagonally. (Again, see Figure 9.13.) How many ways
are there to put one black queen and one white queen on an 8-by-8 chessboard so they can’t capture each other? (Hint: think about
how far the black queen is from the edge of the board.)
9.31 (programming required) Write a program to verify your solution to the last two exercises.
9.32 You have a wireless-enabled laptop, phone, and tablet. Each device needs to be assigned a unique “send” frequency and a unique
“receive” frequency to communicate with a base station. Let S = {1, . . . , 8} denote send frequencies and R = {a, . . . , h} receive
frequencies. A frequency assignment is an element of S × R. A set of frequency assignments is noninterfering if no elements of S
or R appears twice. How many noninterfering frequency assignments are there for your three devices?
9.33 Write down an Inclusion–Exclusion formula for |A ∪ B ∪ C ∪ D|.
9.34 How many integers between 1 and 1000, inclusive, are divisible by one or more of 3, 5, and 7?
9.35 How many integers between 1 and 1000, inclusive, are divisible by one or more of 6, 7, and 8?
9.36 How many integers between 1 and 10,000, inclusive, are divisible by at least one of 2, 3, 5, or 7?

In Chapter 7, we encountered the totient function φ : Z≥1 → Z≥0 , defined as

φ(n) = the number of k with 1 ≤ k ≤ n such that k and n have no common divisors.

We can always compute the totient of n by brute force (just test all k ∈ {1, . . . , n} for common divisors using the Euclidean
algorithm). But the next few exercises will give a hint at another way to do this computation more efficiently. For a fixed integer n:
9.37 Suppose m ∈ Z≥1 evenly divides n. Define M = {k ∈ {1, . . . , n} : m | k}. Argue that |M| = mn .
9.38 (A number-theoretic interlude.) Let the prime factorization of n be n = pe11 · pe22 · · · peℓ , for integers e1 , . . . , eℓ ≥ 1 and distinct

prime numbers {p1 , . . . , pℓ }. Let k ≤ n be arbitrary. Argue that k and n have no common divisors greater than 1 if and only if, for
all i, we have pi ̸ | k.
9.39 Let n be an integer with exactly two prime factors—that is, we can write n = pi qj for primes p and q = ̸ p and for powers
i, j ∈ Z≥1 . (For example, we have 544 = 171 · 25 ; here p = 17, q = 2, i = 1, and j = 5.) Let P = {k ∈ {1, . . . , n} : p | k} and
Q = {k ∈ {1, . . . , n} : q | k}. Argue that φ(n) = n(1 − p1 )(1 − 1q ) by using Inclusion–Exclusion to compute |P ∪ Q|. (You should
find the previous two exercises helpful.)

In the sport of cricket, a team consists of 11 players who come up to bat in pairs. Initially, players #1 and #2 bat. When one of
those two players gets out, then player #3 replaces the one who got out. When one of the two batting players—player #3 and
whichever player of {#1, #2} didn’t get out—gets out, then player #4 joins the one who isn’t out. This process continues until
the 10th player gets out, leaving the last player not out (but stranded without a partner).
Thus, in total, there are 11 players who bat together in 10 partnerships. As an example, consider the lineup Anil, Brendan,
Curtly, Don, Eoin, Freddie, Glenn, Hansie, Inzamam, Jacques, Kumar. We could have the following batting partnerships: Anil &
Brendan; Anil & Curtly; Anil & Don; Don & Eoin; Don & Freddie; …; Don & Kumar.
9.40 How many different partnerships are possible? (That is, how many different pairs of players might end up batting together?)
9.41 How many sequences of partnerships (like the above list) are possible? (It doesn’t matter which of the last two players gets out.)
9.42 In cricket, a team’s batting lineup may be truncated (either because the team wins the game, or because the team strategically
chooses not to bat any longer). The batting team may stop batting at any point after the first pair starts batting. Now how many
different sequences of partnerships are possible? (A tedious clarification: it matters whether a sequence of partnerships is truncated
or not. For example, a sequence ending with Don & Kumar [where one of them gets out] is different from the same sequence but
where neither of Don & Kumar gets out. It does not matter, though, which of the two was the one who got out.)

ISTUDY
462 Counting

8
0Z0Z0Z0Z
7
Z0Z0Z0Z0
6
0Z0Z0Z0Z
5
Z0Z0Z0Z0
4
0Z0Z0Z0Z
3
Z0Z0Z0Z0
2
0Z0Z0Z0Z
1
Z0Z0Z0Z0
a b c d e f g h

Figure 9.14 A checker board.

9.43 Suppose that, as in Example 9.11, a bank uses 4-digit PINs, but doesn’t permit a PIN that starts with the same digit repeated twice
(for example, 7730) or ends with the same digit repeated twice (for example, 0122). Now how many invalid PINs are there?
9.44 Let Sk denote the set of k-digit PINs that neither start with three repeated digits nor end with three repeated digits. (Example 9.11
computed |S4 |.) In terms of k, what is |Sk |? Be careful about very small k.

Checkers is a game, like chess, played on an 8-by-8 grid. (See Figure 9.14.) Two players, Red and Black, move tokens diagonally
on the grid; tokens can only occupy shaded squares. There are two types of tokens: pieces and kings. Any piece that has reached the
opposite side of the board from its starting side (row 8 or row 1) becomes a king. (So Black cannot have a piece in row 8, because
that piece would have become a king. Similarly Red cannot have a piece in row 1.) The next few exercises ask you to compute how
many different board positions are possible when the number k of tokens on the board is very small—namely, for k ∈ {1, 2}. (In
2007, Checkers was solved by the authors of a program called Chinook, which never loses a game [113]. Part of Chinook computes
all possible board positions with a small number of tokens, as these exercises start to do.)
9.45 How many board positions have exactly one token? (Both the color and the type of the token matter; a Black piece on C3 is a
different position from either a Black king on C3 or a Red piece on C3.)
9.46 How many board positions have two kings, one of each color?
9.47 How many board positions have two Red kings? (Notice that two Red kings cannot be distinguished, so it doesn’t matter “which”
one comes first.)
9.48 How many board positions have two Black pieces?
9.49 How many board positions have two pieces, one of each color?
9.50 How many board positions have one Red king and one Red piece?
9.51 How many board positions have one Black king and one Red piece?
9.52 Use the last six exercises to determine how many total board positions have two tokens.
9.53 (programming required) Write a program to verify your answers to Exercises 9.45–9.52 (particularly Exercise 9.52).

9.54 How many subsets of {0, 1}1 ∪ {0, 1}2 ∪ {0, 1}3 are prefix free? (See Example 9.21.) (Hint: I recommend writing a program.)

A text-to-speech system takes written language (text) and reads it aloud as audio (speech). One of the simplest ways to build a text-
to-speech system is to prerecord each syllable, and then paste together those sounds. (Pasting separate recordings is difficult, and
this system as described will produce very robotic-sounding speech. But it’s a start.) A syllable consists of a consonant or cluster
of consonants called the onset, then a vowel called the nucleus, and finally the consonant(s) called the coda. [Many languages only
allow some combinations of choices—there are fascinating linguistic constraints based on ordering or place of articulation (for
example, English allows stay but not tsay, and allows clay and play but not tlay) that we’re almost entirely omitting here.]
9.55 A consonant can be described by a place of articulation (one of 11 choices: the lips, the palate, etc.); a manner of articulation (one
of 8 choices: stopping the airflow, stopping the oral airflow with the nasal passage open, etc.); and a voicing (the vocal cords are
either vibrating, or not). According to this description, how many consonants are there?
9.56 A vowel can be described as either lax or tense; as either high or mid or low; and as either front or central or back. According to
this description, how many vowels are there?
9.57 As a rough approximation, Japanese syllables consist of one of 25 consonants followed by one of 5 vowels, with one consonant
that can appear as a coda (or the coda can be left off). How many Japanese syllables are there?
9.58 An even rougher approximation: English syllables consist of an onset that is either one of 25 consonants or a cluster of any two
consonants, followed by one of 16 vowels, followed optionally by one of 25 consonants. How many English syllables are there?
9.59 To cut down on the large number of syllables that you found in the last exercise, some text-to-speech systems are instead based on
demisyllables—the first half or the second half of a syllable. (We glue the sounds together in the middle of the vowel.) That is, a
demisyllable is either a legal onset followed by a vowel, or a vowel followed by a legal coda. How many demisyllables are there in
English (making the same very rough assumptions as the previous exercise)?

ISTUDY
9.3 Using Functions to Count 463

9.3 Using Functions to Count


The sun’s shining bright
Everything seems all right
When we’re poisoning pigeons in the park.
Tom Lehrer (b. 1928)
“Poisoning Pigeons In The Park” (1959)

Our focus in Section 9.2 was on counting sequences of choices (the Generalized Product Rule) and
choices of choices (the Sum Rule). But what about counting other kinds of sets? Our basic plan is simple:
be lazy! In this section, we’ll introduce ways of counting the cardinality of a given set A in terms of |B| for
some other set B, by using functions that translate between the elements of A and the elements of B:

Mapping Rule: There exists a bijection f : A → B if and only if |A| = |B|. Similarly, there exists an onto
function f : A → B if and only if |A| ≥ |B|, and a one-to-one function f : A → B if and only if |A| ≤ |B|.

Division Rule: Suppose there exists a function f : A → B such that, for every b ∈ B, we have
| {a ∈ A : f(a) = b} | = k. Then |A| = k · |B|.

In particular, we’ll hope to “translate” a choice from an arbitrary set into a sequence of choices from simple
sets—whose size we can count using the tools from Section 9.2. Here’s an example to illustrate the idea:

Example 9.23: Number of valid Hamming codewords.


In Section 4.2, we introduced the Hamming code, an error-correcting code that encodes any 4-bit message
4 7 4 7
m ∈ {0, 1} as a 7-bit codeword x ∈ {0, 1} . The encoding function encode : {0, 1} → {0, 1} maps
⟨a, b, c, d⟩ to ⟨a, b, c, d, b ⊕ c ⊕ d, a ⊕ c ⊕ d, a ⊕ b ⊕ d⟩, where ⊕ is exclusive or. That is, a valid Hamming
7
codeword x is an element of {0, 1} satisfying three conditions:
x2 + x3 + x4 ≡2 x5 x1 + x3 + x4 ≡2 x6 x1 + x2 + x4 ≡2 x7 .
How many different valid codewords does the Hamming code have?
7
Solution. We can count the number of valid codewords by looking at all 27 = 128 elements of {0, 1}
and testing these three conditions. (See Figure 9.15.) By checking every entry in the table, we see that
there are 16 valid codewords.
This table-based approach is fine, but here’s a less tedious way to count. By the definition of the
4 7
encoding function, every possible message in {0, 1} is encoded as a different codeword in {0, 1} .
4
Furthermore, every valid codeword is the encoding of a message in {0, 1} . Thus the number of valid
4
codewords equals the number of messages, and there are |{0, 1} | = 16 valid codewords.

Problem-solving tip: Use programming to help you! If you’re going to use the simple-but-tedious way to count legal Hamming
code codewords, via enumeration, write a program rather than doing it by hand. (For example, the table in Example 9.23 was
generated with a Python program!)

9.3.1 The Mapping Rule


The approach that we used in Example 9.23 is based on functions that translate from one set to another. In
the remainder of this section, we will formalize this style of reasoning as a general technique for counting
problems. To build intuition about using functions to count, let’s start with some informal examples:

ISTUDY
464 Counting

codeword codeword codeword codeword codeword codeword codeword codeword


0000000 333 0010000 7 73 0100000 737 0110000 37 7 1000000 37 7 1010000 737 1100000 7 73 1110000 333
0000001 337 0010001 7 7 7 0100001 733 0110001 373 1000001 373 1010001 733 1100001 7 7 7 1110001 337
0000010 373 0010010 733 0100010 7 7 7 0110010 337 1000010 337 1010010 7 7 7 1100010 733 1110010 373
0000011 37 7 0010011 737 0100011 7 73 0110011 333 1000011 333 1010011 7 73 1100011 737 1110011 37 7
0000100 733 0010100 373 0100100 337 0110100 7 7 7 1000100 7 7 7 1010100 337 1100100 373 1110100 733
0000101 737 0010101 37 7 0100101 333 0110101 7 73 1000101 7 73 1010101 333 1100101 37 7 1110101 737
0000110 7 73 0010110 333 0100110 37 7 0110110 737 1000110 737 1010110 37 7 1100110 333 1110110 7 73
0000111 7 7 7 0010111 337 0100111 373 0110111 733 1000111 733 1010111 373 1100111 337 1110111 777
0001000 7 7 7 0011000 337 0101000 373 0111000 733 1001000 733 1011000 373 1101000 337 1111000 777
0001001 7 73 0011001 333 0101001 37 7 0111001 737 1001001 737 1011001 37 7 1101001 333 1111001 7 73
0001010 737 0011010 37 7 0101010 333 0111010 7 73 1001010 7 73 1011010 333 1101010 37 7 1111010 737
0001011 733 0011011 373 0101011 337 0111011 7 7 7 1001011 7 7 7 1011011 337 1101011 373 1111011 733
0001100 37 7 0011100 737 0101100 7 73 0111100 333 1001100 333 1011100 7 73 1101100 737 1111100 37 7
0001101 373 0011101 733 0101101 7 7 7 0111101 337 1001101 337 1011101 7 7 7 1101101 733 1111101 373
0001110 337 0011110 7 7 7 0101110 733 0111110 373 1001110 373 1011110 733 1101110 7 7 7 1111110 337
0001111 333 0011111 7 73 0101111 737 0111111 37 7 1001111 37 7 1011111 737 1101111 7 73 1111111 333

Figure 9.15 All 27 = 128 elements of {0, 1} , marked with which of the three Hamming code conditions each one
7

passes (3) and which ones it fails (7): condition one (x2 + x3 + x4 ≡2 x5 ), condition two (x1 + x3 + x4 ≡2 x6 ), and
condition three (x1 + x2 + x4 ≡2 x7 ). The highlighted entries pass all three tests.

Example 9.24: Some mappings, informally.

• Let S be a collection of documents, where each document is labeled with one of 5 genres: poem,
essay, memoir, drama, or novel.
– Suppose every genre appears as the label for at least one document. Then |S| ≥ 5. (We see 5
different kinds of labels on documents, and every document has only one label. Thus there must
be at least 5 different documents.)
– Suppose there’s no genre that appears as the label for two distinct documents. Then |S| ≤ 5. (No
label is reused—that is, no label appears on more than one document—so we can only possibly
observe 5 total labels. Every document is labeled, so we can’t have more than 5 documents.)
• You’re taking a class in which no two students’ last names start with the same letter. Then there are
at most 26 students in the class.
• You’re in a club on campus that has at least one member from every U.S. state. Then the club has at
least 50 members.
• You’re out to dinner with friends, and you and each of your friends order one of 8 desserts on the
menu. Suppose that each dessert is ordered at least once, and no two of you order the same dessert.
Then your group has exactly 8 people.

Taking it further: The document/genre scenario in Example 9.24 is an example of a classification problem, where we must
label some given input data (“instances”) as belonging to exactly one of k different classes. Classification problems are one of
the major types of tasks encountered in the subfield of CS called machine learning. In machine learning, we try to build software
systems that can “learn” how to better perform a task on the basis of some training data. Other problems in machine learning
include anomaly detection, where we try to identify which instances from a set “aren’t like” the others; or clustering problems
(see p. 38), where we try to separate a collection of instances into coherent subgroups—for example, separating a collection of
documents into “topics.” Classification problems are very common in machine learning: for example, we might want to classify
a written symbol as one of the 26 letters of the alphabet (optical character recognition); or classify a portion of an audio speech
stream as one of 40,000 common English words (speech recognition); or classify an email message as either “spam” or “not
spam” (spam detection).

ISTUDY
9.3 Using Functions to Count 465

Formalizing the rule


How can we generalize the intuition of Example 9.24 into a rule for counting? Think about the first scenario,
the documents and the genres: we can view the labels on the documents in S as being given by a function
label : S → {poem, essay, memoir, drama, novel} . If there exists any function that behaves in the way
that label did in Example 9.24—that is, either “covering” all of the possible outputs at least once each, or
covering all of the possible outputs at most once each—then we can infer whether the set of possible inputs
or the set of possible outputs is bigger.
The formal statements of the counting rules based on this intuition rely on the definition of three special
types of functions that we defined in Chapter 2: onto functions (Definition 2.51), one-to-one functions
(Definition 2.52), and bijections (Definition 2.53). (See Figure 9.16 for a reminder of the definitions.)
Formally, the existence of a function f : A → B with one of these properties will let us relate |A| and |B|:

Theorem 9.9: Mapping Rule.


Let A and B be arbitrary sets. Then:

• An onto function f : A → B exists if and only if |A| ≥ |B|.


• A one-to-one function f : A → B exists if and only if |A| ≤ |B|.
• A bijection f : A → B exists if and only if |A| = |B|.

See Figure 9.16 for a visual representation of the Mapping Rule, and for the intuition as why it’s correct:
the number of arrows leaving A is precisely |A|; if |A| arrows are enough to “cover” all elements of B, then
|B| ≤ |A|; and if |A| arrows can be directed into |B| elements without any duplication, then |B| ≥ |A|. (And,
actually, the third part of the Mapping Rule is implied by the first two parts: if there’s a bijection f : A → B
then f is both onto and one-to-one, so the first two parts of the Mapping Rule imply that |A| ≥ |B| and
|A| ≤ |B|, and thus that |A| = |B|.)

A few examples
We’ll start with another example—like those in Example 9.24—of the logic underlying the Mapping Rule,
but this time using function terminology.

(a) A B (b) A B (c) A B


f f f
b1 b1 b1

b2 b2 b2

b3 b3 b3
.. .. ..
. . .
bn bn bn

f : A → B is onto if, for all b ∈ B, there exists an f : A → B is one-to-one if, for all a ∈ A and f : A → B is a bijection if it is both one-to-one
a ∈ A such that f(a) = b. (“Every possible a ∈ A, if f(a) = f(a ) then a = a . (“No output is
′ ′ ′
and onto. (“Every output is hit exactly once.”)
output is hit.”) Every element of B has an hit more than once.”) No element of B has more Every element of B has exactly one incoming
incoming arrow, so |A| ≥ |B|. than one incoming arrow, so |A| ≤ |B|. arrow, so |A| = |B|.

Figure 9.16 The Mapping Rule, and a reminder of the definitions of (a) onto, (b) one-to-one, and (c) bijective
functions. The number of arrows equals |A|.

ISTUDY
466 Counting

Example 9.25: Students and assignments.


Let S be a set of 128 students in a computer science class, let A be a set of programming assignments, and
suppose that mine : S → A is a function so that mine(s) is the assignment that has the name of student
s written on it. (Because mine is a function, each student’s name is by definition on one and only one
submitted assignment.)
The function mine is onto. Then every assignment in A has at least one student’s name on it—and therefore
there are at least as many students as assignments: each name is written only once, and every assignment
has a name on it. So |A| ≤ 128. (There could be fewer than 128 if, for example, assignments were allowed
to be submitted by pairs of students.)
The function mine is one-to-one. Then no assignment has more than one name on it—and therefore there
are at least as many students as assignments: each assignment has at most one name, so there can’t be
more names than assignments. So |A| ≥ 128. (There could be more than 128 if, for example, there are
assignments in the pile that were submitted by students in a different section of the course.)
The function mine is both onto and one-to-one. Then each assignment has exactly one name written on
it, and thus |A| = |S| = 128.

Let’s also rewrite two of the informal scenarios from Example 9.24 to explicitly the Mapping Rule:

Example 9.26: Classes and names, formalized.


Let S be the set of students taking a particular class. Define the function f : S → {A, B, . . . , Z}, where f(s)
is the first letter of the last name of student s. If no two students’ last names start with the same letter,
then f(s) = f(s′ ) only when s = s′ —in other words, the function f is one-to-one. Then, by the Mapping
Rule, |S| ≤ | {A, B, . . . , Z} |: there are at most 26 students in the class.

Example 9.27: States, formalized.


Let T be the set of people in a particular club. Let T′ ⊆ T be those people in T who are from one of the
50 states. Because T′ ⊆ T, we have |T| ≥ |T′ |.
Define the function g : T′ → {Alabama, Alaska, . . . , Wyoming}, where g(x) is the home state of per-
son x. If there is at least one student from every state, then for all s ∈ {Alabama, Alaska, . . . , Wyoming}
there’s an x ∈ T′ such that g(x) = s—in other words, the function g is onto. Then, by the Mapping Rule,
|T′ | ≥ | {Alabama, Alaska, . . . , Wyoming} |: there are at least 50 people in the club.

We’ll close this section with an example of using the Mapping Rule to count the cardinality of a set
that we have not yet been able to calculate. We’ll do so by giving a bijection between this new set (with
previously unknown cardinality) and a set whose cardinality we do know. The set that we’ll analyze here is
the power set of a set X—the set of all subsets of X, defined as P(X) = {Y : Y ⊆ X}. (See Definition 2.33.)

For example, P({0, 1}) is {} , {0} , {1} , {0, 1} . Let’s look at the power set of {1, 2, . . . , 8}:

Example 9.28: Power set of {1, 2, . . . , 8}.


What is |P({1, 2, . . . , 8})|?
8
Solution. We’ll give a bijection between {0, 1} and P({1, 2, . . . , 8})—that is, we’ll define a function
8
b : {0, 1} → P({1, 2, . . . , 8}) that’s a bijection. Here is the correspondence: for every 8-bit string

ISTUDY
9.3 Using Functions to Count 467

8
y ∈ {0, 1} , define b(y) to be the subset Y ⊆ {1, 2, . . . , 8} such that i ∈ Y if and only if the ith bit of y
is 1. For example:
y = 11101010 → Y = {1, 2, 3, 5, 7} that is, b(11101010) = {1, 2, 3, 5, 7},
y = 00001000 → Y = {5} and b(00001000) = {5},
y = 00000000 → Y = {} and b(00000000) = {}.

Because every subset corresponds to some bitstring, and no subset corresponds to more than one bitstring,
8 8
the function b : {0, 1} → P({1, 2, . . . , 8}) is a bijection between {0, 1} and P({1, 2, . . . , 8}).
8
Because a bijection from {0, 1} to P({1, 2, . . . , 8}) exists (we just defined it!), the Mapping Rule
8
says that |P({1, 2, . . . , 8})| = |{0, 1} | = 28 = 256.

The idea of the mapping from Example 9.28 applies for an arbitrary finite set X. Here is the general result.
(Lemma 9.10 is the reason for the power set’s name: the cardinality of P(X) is 2 to the power of |X|.)

Lemma 9.10: Cardinality of the Power Set.


Let X be any finite set. Then |P(X)| = 2|X| .

Proof. Let n = |X|. Let X = {x1 , x2 , . . . , xn } be an arbitrary ordering of the elements of X. Define a
n
function f : {0, 1} → P(X) as follows:

f(y) = {xi : the ith bit of y is 1} .


n
It is easy to see that f is onto: for any subset Y of X, there exists a y ∈ {0, 1} such that f(y) = Y. It is also
easy to see that f is one-to-one: if y ̸= y′ then there exists an i such that yi ̸= y′i , so [xi ∈ f(y)] ̸= [xi ∈ f(y′ )].
|X|
Therefore f is a bijection, and by the Mapping Rule we can conclude |P(X)| = | {0, 1} | = 2|X| .

Taking it further: Although our focus is on finding the cardinality of finite sets, we can also apply the Mapping Rule to think
about infinite cardinalities. Infinite sets are generally more the focus of mathematicians than of computer scientists, but there are
some fascinating (and completely mind-bending) results that are relevant for computer scientists, too. For example, we can prove
that the number of even integers is the same as the number of integers (even though the former is a proper subset of the latter!).
But we can also prove that |R| > |Z|. More relevantly for computer science, we can prove that there are strictly more problems
than there are computer programs, and therefore that there are problems that cannot be solved by a computer. See p. 474.

9.3.2 The Division Rule


When we introduced the Inclusion–Exclusion Rule, we used an approach to counting that we might call
count first, apologize later: to compute the cardinality of a set A∪B, we found |A|+|B| and then “fixed” our
count by subtracting the number of elements that we’d counted twice—namely, subtracting |A ∩ B|. Here
we’ll consider an analogous count-and-correct rule, called the Division Rule, that applies when we count
every element of a set multiple times (and where each element is recounted the same number of times); we
then correct our total by dividing by this “redundancy factor.” Let’s start with some informal examples:

Example 9.29: Some redundant counting, informally.

• Suppose that the Juggling Club on campus sells 99 juggling torches to its members, in sets of three.
Then there are 33 people who purchased torches.

ISTUDY
468 Counting

• There are 42 people at a party. Suppose that every person shakes hands with every other person. How
many handshakes have occurred, in total? There are many ways to solve this problem, but here’s an
approach that uses division: each person shakes hands with all 41 other people, for a grand total of
(42 people)·(41 shakes/person) = 1722 shakings. But each handshake involves two people, so we’ve
counted every shake exactly twice; thus there are actually a total of 861 = 1722 42·41
2 = 2 handshakes.

• In Game 5 of the 1997 NBA Finals, the Chicago Bulls had 10 different players who were on the
court for some portion of the game. The number of minutes played by each of these ten people were
⟨45, 44, 26, 24, 24, 24, 23, 23, 4, 3⟩. Thus the total number of minutes played by these ten players was
45 + 44 + 26 + 24 + 24 + 24 + 23 + 23 + 4 + 3 = 240. In basketball, five players are on the court
at a time. Thus the game lasted 240 5 = 48 minutes.

We’ll phrase the Division Rule using the same general structure as the Mapping Rule, in terms of a function
that maps from one set to another. Specifically, if we have a function f : A → B that always maps exactly
the same number of elements of A to each element of B—for instance, exactly three torches are mapped to
any particular juggler in Example 9.29—then |A| and |B| differ exactly by that factor:

Theorem 9.11: Division Rule.


Let A and B be arbitrary sets. Suppose that there exists a function f : A → B such that, for every b ∈ B,
there are exactly k elements a1 , . . . , ak ∈ A such that f(ai ) = b. (That is, |{a ∈ A : f(a) = b}| = k for all
b ∈ B.) Then |A| = k · |B|.

(The Division Rule with k = 1 simply is the bijection case of the Mapping Rule: what it means for f : A → B
to be a bijection is precisely that |{a ∈ A : f(a) = b}| = 1 for every b ∈ B. If such a function f exists, then
both the Mapping Rule and the Division Rule say that |A| = 1 · |B|.)

Example 9.30: Redundantly counting jugglers, formally.


Let M be the set of members of the Juggling Club, and let T be the set of torches bought by the members of
the club. Consider the function boughtBy : T → M. Assuming that each member bought precisely three
torches—that is, assuming that | {t ∈ T : boughtBy(t) = m} | = 3 for every m ∈ M—then |T| = 3 · |M|.

Example 9.31: Redundantly counting integers, formally.


Consider sets A = {0, 1, . . . , 31} and B = {0, 1, . . . , 15}. Define the function f : A → B as f(n) = ⌊ n2 ⌋.
For each b ∈ B, there are exactly two input values whose output under f is b, namely 2b and 2b + 1. Thus
by the Division Rule |A| = 2 · |B|.

This basic idea—if we count each thing k times, then dividing our total count by k gives us the number of
things—may be pretty obvious, but it’s also surprisingly useful. Here’s a sequence of examples, starting
with a warm-up exercise and continuing with two (less obvious) applications of the Division Rule:

Example 9.32: Rearranging PERL, PEER, and SMALLTALK.


How many different ways can you arrange the letters of (1) the name of the programming language PERL?
(2) the word PEER? (3) the name of the programming language SMALLTALK?

ISTUDY
9.3 Using Functions to Count 469

ELPR

LEPR

ELRP

LERP

EPLR

LPER

EPRL

LPRE

ERLP

LREP

ERPL

LRPE

PELR

PLER

PERL

PLRE

PREL

PRLE

RELP

RLEP

REPL

RLPE

RPEL

RPLE
EEPR EERP EPER EPRE EREP ERPE PEER PERE PREE REEP REPE RPEE

Figure 9.17 The 24 different orderings of PERL and the 12 different orderings of PEER. The function that replaces L by
E is displayed by the arrows.

Solution. For PERL: There are 4 different letters, and any permutation of them is a different ordering.
Thus there are 4! = 4 · 3 · 2 · 1 = 24 orderings. (See Theorem 9.8.)
PEER: We’ll answer this question using the PERL solution. Define the function L->E as follows: given a
4-character input string, it produces a 4-character output string in which every L has been replaced by
an E. For example, L->E(PERL) = PERE. Let S denote the orderings of the word PERL, and let T denote
the orderings of PEER. Note that the function L->E : S → T has the property that, for every t ∈ T,
there are exactly two strings x ∈ S such that L->E(x) = t. (For example, L->E(PERL) = PERE and
L->E(PLRE) = PERE.) Thus, by the Division Rule, there are 4!2 = 24 2 = 12 ways to order the letters of
PEER. (See Figure 9.17.)
SMALLTALK: There are 9! different orderings of the nine “letters” in the not-quite-real word
S M A1 L1 L2 T A2 L3 K. (We are writing L1 and L2 and L3 to denote three different “letters,”
and similarly for A1 and A2 .) We will use the Division Rule repeatedly to “erase” subscripts:

• The function that erases subscripts on the As maps two inputs to each output: one with A1 before A2 ,
and one with A2 before A1 . By the Division Rule, then, there are 9!2 different orderings of the “letters”
in the word S M A L1 L2 T A L3 K.
• The function that takes an ordering of S M A L1 L2 T A L3 K and erases the subscripts on
the Ls maps precisely six inputs to each output: one for each of the 3! possible orderings of the Ls.

9! 362,880
Thus there are 2·3! = 12 = 30,240 orderings of the letters in S M A L L T A L K.

Counting orderings when some elements are indistinguishable


Although we phrased Example 9.32 in terms of the number of ways to rearrange the letters of some partic-
ular words, there’s a very general idea that underlies the PEER and SMALLTALK examples. We’ll state the
underlying idea as a theorem:

Theorem 9.12: Rearranging with duplicates.


The number of ways to rearrange a sequence containing k different distinct elements {x1 , . . . , xk }, where
element xi appears ni times, is
(n1 + n2 + · · · + nk )!
.
(n1 !) · (n2 !) · · · · · (nk !)

For example, PERL has k = 4 distinct elements, which appear nP = nE = nR = nL = 1 time each; the
theorem says that there are (1+1+1+1)!
1!·1!·1!·1! = 4! ways to arrange the letters. On the other hand, SMALLTALK
has k = 6 distinct elements, which appear nA = 2, nL = 3, and nS = nM = nT = nK = 1 times each; the
theorem says that there are (2+3+1+1+1+1)!
2!·3!·1!·1!·1!·1!
9!
= 2!·3! ways to arrange the letters.

ISTUDY
470 Counting

Proof of Theorem 9.12. Let’s handle a simpler case first: suppose that we have n different elements that
we can put into any order, and precisely k of these n elements are indistinguishable. Then there are exactly
n!
k! different orderings of those n elements. To see this fact, imagine “decorating” each of those k items with
some kind of artificial distinguishing mark, like the numerical subscripts of the letters of SMALLTALK from
Example 9.32. Then there are n! different orderings of the n elements. The erase function that eliminates
our artificial distinguishing marks has k! inputs that yield the same output—namely, one for each ordering
of the k artificially marked elements. Therefore, by the Division Rule, there are n! k! different orderings of
the elements, without the distinguishing markers.
The full theorem is just a mild generalization of this argument, to allow us to consider more than one set
of indistinguishable elements. (In particular, we could give a formal proof by induction on the number of
elements with ni ≥ 2.) In total, there are (n1 +n2 +· · ·+nk )! different orderings of the elements themselves,
but there are n1 ! equivalent orderings of the first element, n2 ! of the second, and so forth. The function that
“erases subscripts” as in Example 9.32 has (n1 !) · (n2 !) · · · · · (nk !) different equivalent orderings, and
thus the total number of orderings is, by the Division Rule,
(n1 + n2 + · · · + nk )!
.
(n1 !) · (n2 !) · · · · · (nk !)
Here’s a different kind of example that we can solve using this theorem:

Example 9.33: Writing 232,848 as a sequence of prime factors.


How many ways can we write 232,848 as a product p1 p2 · · · pk , where each pi is prime? (The set of prime
factors, and the number of occurrences of each factor, are the same in every product, because the prime
factorization of any positive integer is unique. But the order may change: for example, we can write
6 = 3 · 2 or 6 = 2 · 3.)
Solution. The prime factorization of 232,848 is 232,848 = 24 · 33 · 72 · 11. Thus a product of primes
that equals 232,848 consists of 4 copies of two, 3 copies of three, 2 copies of seven, and one copy of
eleven—in some order. (For example, 2 · 2 · 7 · 3 · 3 · 7 · 2 · 11 · 3 · 2.) By Theorem 9.12, the number of
orderings of these elements is
(4 + 3 + 2 + 1)! 10! 3,628,800
= = = 12,600.
4! · 3! · 2! · 1! 4! · 3! · 2! 24 · 6 · 2

A slightly more complicated example


Here is one final example of the Division Rule, in which we’ll use this approach on a slightly more
complicated problem:

Example 9.34: Assigning partners.


The professor divides the n students in a CS class into n2 partnerships, with two students per partnership.
(Assume that n is even.) The order of partners within a pair doesn’t matter, nor does the order of the
partnerships. (That is,
Pairing A: Pair 1 = Paul and George; Pair 2 = John and Ringo and
Pairing B: Pair 1 = Ringo and John; Pair 2 = George and Paul
represent exactly the same partnerships.) How many ways are there to divide the class into partnerships?

ISTUDY
9.3 Using Functions to Count 471

ordering reordered within pairs ordering reordered within pairs ordering reordered within pairs
  
AB CD AB CD 
 AC BD AC BD 
 AD BC AD BC 


 
 

AB DC AB CD 
 AC DB AC BD 
 AD CB AD BC 


 
 

BA CD AB CD 
 BD AC BD AC   BC AD BC AD 


 
AC 

BA DC AB CD AB BD CA BD AC  BC DA BC AD AD
+ + +
CD AB CD AB 
 CA BD AC BD 
 CB AD BC AD 


 CD 
 BD 
 BC
CD BA CD AB 
 CA DB AC BD 
 CB DA BC AD 


 
 


 BD AC   

DC AB CD AB 
 DB AC 
 DA BC AD BC 

  
DC BA CD AB DB CA BD AC DA CB AD BC

Figure 9.18 Partnerships for n = 4 students: the 4! orderings, then the orderings sorted within each pair, and finally
with the pairs themselves sorted.

Solution. Let’s line up the students in some order, and then pair the first two students, then pair the third
and fourth, and so on. There are n! different orderings of the students, but there are fewer than n! possible
partnerships, because we’ve double counted each set of pairs in two different ways:

• there are two equivalent orderings of the first pair of students, and two equivalent orderings of the
second pair, and so on.
• the ordering of the pairs doesn’t matter, so the partnerships themselves can be listed in any order at
all (without changing who’s paired with whom).

Each of the n2 pairs can be listed in 2 orders, so—by the Product Rule—there are 2n/2 different possible
within-pair orderings. And there are (n/2)! different orderings of the pairs. Applying the Division Rule,
then, we see that there are
n!
(∗)
(n/2)! · 2n/2
total possible ways to assign partners.
Let’s confirm that (∗) checks out for some small values of n. For n = 2, there’s just one pairing, and
2! 2 4! 4·3·2
indeed (∗) is 1!·2 1 = 2 = 1. For n = 4, the formula (∗) yields 23 = 8 = 3 pairings; indeed, for
the quartet Paul, John, George, and Ringo, there are three possible partners for Paul (and once Paul is
assigned a partner there are no further choices to be made). See Figure 9.18: we try all 4! = 24 orderings
of the four people, then we reorder the names within each pair, and finally we reorder the pairs.

Problem-solving tip: There are often many different ways to solve a given problem—and you can use whatever approach makes
the most sense to you! For example, Exercise 9.109 explores a completely different way to solve Example 9.34, based on the
Generalized Product Rule instead of the Division Rule.

9.3.3 The Pigeonhole Principle


We’ll close this section with a not-too-hard-to-prove—but also surprisingly useful—theorem based on the
Mapping Rule, called the pigeonhole principle.
A pigeonhole refers to one of the “cells” in a grid of compartments that are open in the front, and which can house either snail
mail or, back in the day, roosting pigeons. (There’s also a related verb: to pigeonhole someone/something is to categorize that
person/thing into one of a small number of—misleadingly simple—groups.)

We’ll start with a few informal examples to introduce the underlying idea:

ISTUDY
472 Counting

Example 9.35: What happens when there are more things than kinds of things.

• If there are more socks in your drawer than there are colors of socks in your drawer, then you must
have two socks of the same color.
• If there are only 5 possible letter grades and there are 6 or more students in a class, then there must
be two students who receive the same letter grade.
• If you take 9 or more CS courses during the 8 semesters that you’re in college, then there must be at
least one semester in which you doubled up on CS courses.
• In the antiquated language in which this result is generally stated: if there are n pigeonholes, and
n + 1 pigeons that are placed into those pigeonholes, then there must be at least one pigeonhole that
contains more than one pigeon.

Here is the general statement of the theorem, along with its proof:

Theorem 9.13: Pigeonhole Principle.


Let A and B be sets with |A| > |B|, and let f : A → B be any function. Then there exist distinct elements
a ∈ A and a′ ∈ A such that f(a) = f(a′ ).

Proof. We can prove the Pigeonhole Principle using the Mapping Rule. Given the sets A and B, and the
function f : A → B, the Mapping Rule tells us that
if f : A → B is one-to-one, then |A| ≤ |B|. (1)

Taking the contrapositive of (1), we have

if |A| > |B|, then f : A → B is not one-to-one. (2)


By assumption, we have that |A| > |B|, so f : A → B is not one-to-one. The theorem follows by the
definition of a one-to-one function: the fact that f : A → B is not one-to-one means precisely that there is
some b ∈ B that’s “hit” twice by f. In other words, there exist distinct a ∈ A and a′ ∈ A such that a ̸= a′
and f(a) = f(a′ ).
A slight generalization of this idea is also sometimes useful: if there are n total objects, each of which
has one of k types, then there must be a type that has at least ⌈ nk ⌉ objects. (We’ll omit the proof, but the
idea is very similar to Theorem 9.13.)

Theorem 9.14: Pigeonhole Principle: extended version.


Let A and B be sets, and let f : A → B be any function. Then there exists some b ∈ B such that the set
{a ∈ A : f(a) = b} contains at least ⌈|A|/|B|⌉ elements.

(Another less formal way of stating this fact is “the maximum must exceed the average”: the number of
elements in A that “hit” a particular b ∈ B is |A|
|B| on average, and there must be some element of B that’s hit
at least this many times.)
We’ll start with two simpler examples of the pigeonhole principle, and close with a slightly more com-
plicated application. (In the last example, the slightly tricky part of applying the pigeonhole principle is
figuring out what corresponds to the “holes.”)

ISTUDY
9.3 Using Functions to Count 473

Example 9.36: Congressional voting.


Suppose that there were 5 different bills upon which the House of Representatives voted yesterday. (There
are 435 representatives in the U.S. House.) The pigeonhole principle implies that there are two represen-
tatives who voted identically on yesterday’s bills. A representative’s vote can be expressed as an element
5
of {aye, nay, abstain} , which has cardinality 35 = 243. Because 243 < 435, the pigeonhole principle
says that there are two representatives with the same voting record.

Example 9.37: Logical equivalence.


Let S be a set of 17 different logical propositions over the Boolean variables p and q.
4
A truth table for a proposition in S is an element of {True, False} (the rows of the truth table corre-
4
spond to each of the four truth assignments for p and q), and there are only |{True, False} | = 24 = 16
different such values. Therefore, our 17 different propositions have only 16 different possible truth
tables—so, by the pigeonhole principle, there must be two different propositions that have the same
truth table.

Example 9.38: Points in a square.


Suppose that there are n√2 + 1 points in a 1-by-1 square, as in Figure 9.19a. Show that there must be two
points within distance n2 of each other.
Solution. We will use the pigeonhole principle. Divide the unit square into n2 equal-sized disjoint
subsquares—each with dimension 1n -by- 1n . (To prevent overlap, we’ll say that every shared boundary line
is included in the square to the left or below the shared line.) There are n2 subsquares, and n2 + 1 points.
By the pigeonhole principle, at least one subsquare contains two or more points. (See Figure 9.19b.)
Notice that the farthest apart that two points in a subsquare can be is when they are at opposite corners
of the subsquare. In this case, they are n1 apart in x-coordinate and 1n apart in y-coordinate—in other words,
they are separated by a distance of
q q √
( n1 )2 + ( 1n )2 = n22 = n2 .
Therefore, because there
√ is at least one subsquare containing 2+ points, and points in the same subsquare
are within distance n2 of each other, the claim follows.

Taking it further: The pigeonhole principle can be used to show that compression of data files (for example, ZIP files or
compressed image formats like GIF) must either lose information about the original data (so-called lossy compression) or must,
for some input files, actually cause the “compressed” version to be larger than the original file. See p. 475.

(a) 17 points in a 1-by-1 square. (b) The square divided into 16 subsquares, and one of the
several doubly occupied subsquares.

Figure 9.19 Putting n2 + 1 points in the unit square.

ISTUDY
474 Counting

Computer Science Connections

Infinite Cardinalities (and Problems that Can’t Be Solved by Any Program)


Recall the Mapping Rule: for any two sets A and B, a bijection f : A → B exists if and only if |A| = |B|. Although
we were thinking about finite sets when we stated this rule, the statement holds even for infinite sets A and B; we
can even think of this rule as defining what it means for two sets to have the same cardinality. Those sets S such that
|S| = |Z|, called countable sets, will turn out to be particularly important.
Surprisingly, some sets that “seem” much
bigger or much smaller than the integers have Z≥0 = 0 1 2 3 4 5 6 7 8 9 10 11 · · ·
the same cardinality as Z. For example, the  
f(n) = n2 · (−1)n
set of nonnegative integers has the same car-
dinality as the set of all integers! (See Fig- Z = 0 −1 1 −2 2 −3 3 −4 4 −5 5 −6 · · ·
ure 9.20.) This fact is very strange—after all,
≥0 ≥0
we’re looking at sets A and B where A is a Figure 9.20 A bijection f : Z → Z. Thus |Z | = |Z|.
proper subset of B and we’ve now established
that |A| = |B|! But, indeed, because we have a bijection between A and B, they really are the same size.
Or consider a Python program p. Think
p r i n t ( " h e l l o w o r l d
of the source code of p as a sequence of
112 114 105 110 116 40 34 104 101 108 108 111 32 119 111 114 108 100
characters, each of which is represented as 1110000 1110010 1101001 1101110 1110100 0101000 0100010 1101000 1100101 1101100 1101100 1101111 0100000 1110111 1101111 1110010 1101100 1100
a sequence of bits, which can therefore be
interpreted as an integer written in binary. Figure 9.21 Converting a Python program into an integer. The
(See Figure 9.21.) Therefore there is a bijec- program print("hello world") corresponds to the integer
tion f between the integers and the set of 1,229,340,410,842,605,087,191,708,943,331,595,860,381,993, with
Python programs, where f(i) is the ith-largest binary representation 1110000 1110010 1101001 1101110 · · · .
Python program (sorted numerically by its
binary representation). With all of these sets that have the same cardinality, it might be tempting to think that all
infinite sets have the same cardinality as Z. But they don’t!

Theorem: The set of all subsets of Z≥0 —that is, P(Z≥0 )—is strictly bigger than Z≥0 .

Proof. Suppose for a contradiction that 0 1 2 3 4


f : Z≥0 → P(Z≥0 ) is an onto function. Suppose that f : Z 0 → P(Z 0 ). In a
≥ ≥

f(0) 1 0 1 0 1 ··· table, write row n corresponding to


We’ll show that there’s a set S ∈ P(Z≥0 ) f(n)—so that f(n) has a “1” in column j
such that for every n ∈ Z≥0 we have f(1) 0 0 0 1 1 ··· when j ∈ f(n). Define
S = {i : i ∈/ f(i)}—that is, the opposite of
f(n) ̸= S. Define the set S as follows: f(2) 0 1 1 0 1 ··· the diagonal element.
S = {i ∈ Z≥0 : i ∈
/ f(i)} f(3) 1 1 0 1 1 ···
(So i ∈ S ⇔ the set f(i) does not contain i.)
f(4) 1 0 1 0 0 ··· negation of diagonal:
Now, the set S differs from f(i) for every i, i= 0 1 2 3 4 ...
. . . . . diagonal 1 0 1 1 0 ...
. . . . .
because i ∈ S ⇔ i ∈ / f(i) for every i. Thus . . . . . negated 0 1 0 0 1 ...
S is never “hit” by f—contradicting the S={ 1, 4, . . .}
assumption that f was onto. Therefore there
is no onto function f : Z≥0 → P(Z≥0 ), and, Figure 9.22 Diagonalization. Suppose that there exists an onto
by the Mapping Rule, |Z≥0 | < |P(Z≥0 )|. function f : Z≥0 → P(Z≥0 ), and define S = {i : i ∈ / f(i)}. Then
(This argument is called a proof by diagonal- S ⊆ Z≥0 , but S ̸= f(i) for every i—a contradiction!
ization; see Figure 9.22.)
We can think of any subset of Z as defining a problem that we might want to write a Python program to solve.
For example, the set {0, 2, 4, 6, . . .} is the problem of identifying even numbers. The set {1, 2, 4, 8, 16, . . .} is the
problem of identifying exact powers of 2. The set {2, 3, 5, 7, 11, . . .} is prime numbers. What does all of this say?
There are more problems than there are Python programs! And thus there are problems that cannot be solved by any
program! (Problems that can’t be solved by any computer program are called uncomputable. See Section 4.4.4 for
some particular uncomputable problems, or see any good book on computability, like [73] or [121].)

ISTUDY
9.3 Using Functions to Count 475

Computer Science Connections

Lossy and Lossless Compression lower quality


than original
The task in compression is to Original

take a large (potentially massively

NASA photographer Bob Nye, 1966


large!) piece of data and to repre- uncompr
essed
lossily
sent it, somehow, using a smaller compres
sed

amount of space. Compression tech-


niques are tremendously common,
for a wide variety of data: text,
loss
images, audio, and video, for exam- com lessly identical to
pres original
sed
ple. There are two fundamentally
different approaches to compres- unc
omp
r esse
d
sion of an original data file d into These images are of Katherine Johnson (1918–2020), a
a compressed form d′ : lossy and mathematician and NASA scientist (and “computer,” when that word
was more a job title than a piece of hardware). Most famously, she
lossless compression. manually calculated flight trajectories for the first human earth orbit,
in John Glenn’s 1962 mission, in a story made famous by the 2016
Lossy Compression. In lossy com- movie Hidden Figures, based on a book of the same name [116].
pression, d′ does not represent
exactly all of the information in d— Figure 9.23 A visual example of compression. A large image file (say, in TIFF
that is, we’ve “lost” some informa- format) can typically be stored more efficiently in other image formats—either
tion through compression. (That’s losslessly (as in PNG) or lossily (as in JPG).
why the compression is called
“lossy.”) In fact, many of the standard file formats for images, audio, and video are just standard methods for lossy
compression. For example, JPEG is a lossy image compression format, and MP3 is a lossy audio compression format.
The general goal with a lossy compression technique is to maintain, to the extent possible, “perceptual indistinguisha-
bility.” For example, a digital audio stream can be represented precisely as a sequence of intensities at each time t
(“how loud is the sound at time t?”). A lossy compression technique for sound might round the intensities: instead of
representing an intensity as one of 216 values (“a 16-bit sound,” which is CD quality), we could round to the nearest
of 28 values. (This idea is called quantization; see Example 2.51.) As long as the lost precision is smaller than the
level of human perception, the new audio file would “sound the same” as the original.
Lossless Compression. In lossless compression, the precise contents of the original data file d can be reconstructed
when the compressed data file d′ is uncompressed. This approach is the one commonly used, for example, when
compressing text using a program like ZIP. The typical idea of lossless compression is to exploit redundancy in
the stored data and to avoid wasting space storing the “same” information twice. For example, take the complete
works of Shakespeare. By replacing every occurrence of the with QQ (two letters that don’t occur consecutively in
Shakespeare) the resulting file takes “only” about 99.2% of the original size. We can then set up a “translation table”
telling us that QQ → the when we’re decompressing. (There are many thousands of occurrences of the—the word
the appears over 20,000 times in the complete works of Shakespeare, plus over 1000 each of thee, them, their,
they, there, and these—and we’re saving a character’s worth of storage for each of them.)
An algorithm due to Abraham Lempel and Jacob Ziv, published in 1977—hence “LZ77”—is the basis of many
commonly used lossless compression techniques [135]. Huffman coding—see Example 9.21 and the discussion after
it—is another classic approach to lossless compression [59]. One interesting fact about lossless compression, though,
is that it is impossible to actually compress every input file into a smaller size:

Theorem: Let C be any lossless compression function. Then there exists an input file d such that C(d) takes up at
least as much space as d.

n ∪n−1 i
Proof. Suppose that C compresses all n-bit inputs into n − 1 or fewer bits. That is, C : {0, 1} → i=0 {0, 1} .
∑ n−1
Observe that the domain has size 2n and the range has size i=0 2i = 2n − 1. By the pigeonhole principle, there
must be two distinct input files d1 and d2 such that C(d1 ) = C(d2 ). But this C cannot be a lossless compression
technique: if the compressed versions of the files are identical, the decompressed versions must be identical too!

ISTUDY
476 Counting

EXERCISES
9.60 Determine how many bitstrings x ∈ {0, 1}7 fail all three Hamming code tests. (These bitstrings are the ones marked “7 7 7” in
Figure 9.15.) That is, using the idea of Example 9.23, figure out how many bitstrings satisfy the following three conditions:

x2 + x3 + x4 ̸≡2 x5 x1 + x3 + x4 ̸≡2 x6 x1 + x2 + x4 ̸≡2 x7 .

9.61 Prove that the set P of legal board positions in a chess game satisfies |P| ≤ 1364 . (Hint: Define a one-to-one function from the set
{1, 2, . . . , 13}64 to P.)

A string over Σ is a sequence of elements of a set Σ—that is, a string x over Σ satisfies x ∈ Σn for some length n ≥ 0.
9.62 How many strings of length n over the alphabet Σ = {A, B, . . . , Z, } are there? How many contain exactly 2 “words” (that is,
contain exactly one space that is not in the first or last position)?
9.63 Let n ≥ 3. How many n-symbol strings over this alphabet contain exactly 3 “words”? (Hint: use Example 9.4 to account for
n-symbol strings with exactly two s; then use Inclusion–Exclusion to prevent an initial space, a final space, or two consecutive
spaces, as in ABC, XYZ␣, and JKL␣␣MNO.)

A string over the alphabet {[, ]} is called a string of balanced parentheses if two conditions hold: (i) every [ is later closed by a ];
and (ii) every ] closes a previous [. (You must close everything, and you never close something you didn’t open.) Let Bn ⊆ {[, ]}n
denote the set of strings of balanced parentheses that contain n symbols.
9.64 Show that |Bn | ≤ 2n : define a one-to-one function f : Bn → {0, 1}n and use the Mapping Rule.
9.65 Show that |Bn | ≥ 2n/4 by defining a one-to-one function g : {0, 1}n/4 → Bn and using the Mapping Rule. (Hint: consider [][]
and [[]].)

A certain college requires its users’ passwords to be 15 characters long. Inspired by an XKCD comic (see http://xkcd.com/
936/), a certain faculty member started creating his passwords by choosing three 5-letter English words from the dictionary, without
spaces. (An example password is ADOBESCORNADORN, from the words ADOBE and SCORN and ADORN.) There are 8636 five-letter
words in the dictionary that he found.
9.66 How many passwords can be made from any 15 (uppercase-only) letters? How many passwords can be made by pasting together
three 5-letter words from this dictionary?
9.67 How many passwords can be made by pasting together three distinct 5-letter words from this dictionary? (For example, the password
ADOBESCUBAADOBE is forbidden because ADOBE is repeated.)
9.68 The faculty member in question has a hard time remembering the order of the words in his password, so he’s decided to ensure that
the three words he chooses from this dictionary are different and appear in alphabetical order in his password. (For example, the
password ADOBESCUBAFOXES is forbidden because SCUBA is alphabetically after FOXES.) How many passwords fit this criterion?
Solve this problem as follows. Let P denote the set of three-distinct-word passwords (the set from Exercise 9.67). Let A denote the
set of three-distinct-alphabetical-word passwords. Define a function f : P → A that sorts. Then use the Division Rule.

9.69 After play-in games, the NCAA basketball tournament involves 64 teams, arranged in a bracket that specifies who plays whom in
each round. (The winner of each game goes on to the next round; the loser is eliminated. See Figure 9.24.) How many different
outcomes (that is, lists of winners of all games) of the tournament are there?

A palindrome over Σ is a string x ∈ Σn that reads the same backward and forward—like 0110, TESTSET, or (ignoring spaces and
punctuation) SIT ON A POTATO PAN, OTIS!.
9.70 How many 6-letter palindromes (elements of {A, B, . . . , Z}6 ) are there?
9.71 How many 7-letter palindromes (elements of {A, B, . . . , Z}7 ) are there?
9.72 Let n ≥ 1 be an integer, and let Pn denote the set of palindromes over Σ of length n. Define a bijection f : Pn → Σk (for some
k ≥ 0 that you choose). Prove that f is a bijection, and use this bijection to write a formula for |Pn | for arbitrary n ∈ Z≥1 .

9.73 Recall an integer k ≥ 1 is a factor of n if k | n. How many positive integer factors does 100 have? How many are squarefree?
9.74 How many positive integer factors does 12! have? (Hint: calculate the prime factorization of 12!.)
9.75 An integer n is called squarefree if there’s no integer m ≥ 2 such that m2 | n. How many squarefree factors does 12! have? Explain.
9.76 (programming required) Write a program that, given n ∈ Z≥1 , finds all squarefree factors of n.

9.77 Consider two sets A and B. Consider the following claim: if there is a function f : A → B that is not onto, then |A| < |B|. Why does
this claim not follow directly from the Mapping Rule?

ISTUDY
Exercises 477

A
B
C In the first round, A plays B, C
D plays D, etc. The A/B winner
E plays the C/D winner in the
F second round, and so forth.
G
H

Figure 9.24 An 8-team tournament bracket.

9.78 The genre-counting problem (Example 9.24) considered a function f : {1, 2, . . . , n} → {1, 2, 3, 4, 5}. Let’s consider n = 5: how
many different functions f : {1, 2, . . . , 5} → {1, 2, . . . , 5} are there?
9.79 How many one-to-one functions f : {1, 2, . . . , 5} → {1, 2, . . . , 5} are there?
9.80 How many bijections f : {1, 2, . . . , 5} → {1, 2, . . . , 5} are there?
9.81 Let n ≥ 1 and m ≥ n be integers. Consider the set G of functions g : {1, 2, . . . n} → {1, 2, . . . , m}. How many functions are in G?
How many one-to-one functions are there in G? How many bijections?
9.82 Show that the number of bijections f : A → B is equal to the number of bijections g : B → A. (Hint: define a bijection between
{bijections f : A → B} and {bijections g : B → A}, and use the bijection case of the Mapping Rule!)

9.83 A Universal Product Code (UPC) is a numerical representation of the bar codes used in stores, with an error-detecting feature
P
to handle misscanned codes. A UPC is a 12-digit number ⟨x1 , x2 , . . . , x12 ⟩ where [ 6i=1 3x2i−1 + x2i ] mod 10 = 0. (That is, the
even-indexed digits plus three times the odd-indexed digits should be divisible by 10.) Prove that there exists a bijection between
the set of 11-digit numbers and the set of valid 12-digit UPC codes. Use this fact to determine the number of valid UPC codes.
9.84 A strictly increasing sequence of integers is ⟨i1 , i2 , . . . , ik ⟩ where i1 < i2 < · · · < ik . How many strictly increasing sequences start
with 1 and end with 1024? (That is, we have i1 = 1 and ik = 1024. The value of k can be anything you want; you should count
both ⟨1, 1024⟩ and ⟨1, 2, 3, 4, . . . , 1023, 1024⟩.)

A subsequence of a sequence x = ⟨x1 , x2 , . . . , xn ⟩ is a sequence ⟨xi1 , xi2 , . . . , xik ⟩ of k ≥ 0 elements of x, where ⟨i1 , i2 , . . . , ik ⟩
is a strictly increasing sequence. For example, PYTHON is a subsequence of PYTHAGOREAN and BASIC is a subsequence of
BRAINSICKNESS.
9.85 Suppose the components of x = ⟨x1 , x2 , . . . , xn ⟩ are all different (as in PYTHON but not PYTHAGOREAN). Use the Mapping Rule to
figure out how many subsequences of x there are.
9.86 Suppose the components of x = ⟨x1 , x2 , . . . , xn ⟩ are all different, except for a single pair of identical elements that are separated
by k other elements. For example, PYTHAGOREAN has n = 11 and k = 4, because there are four entries (GORE) between the As (at
index 5 and 10), which are the only repeated entries. In terms of n and k, how many subsequences of x are there?

As Example 9.23 describes, the Hamming code adds 3 different parity bits to a 4-bit message m, where each added bit corresponds
to the parity of a carefully chosen subset of the message bits, creating a 7-bit codeword c. Let k and n, respectively, denote the
number of bits in the message and the codeword. (For the Hamming code, we have k = 4 and n = 7.)
A decoding algorithm takes a received (and possibly corrupted) codeword c′ and determines which message has a corresponding
codeword c that is most similar to c′ (as measured by Hamming distance). Specifically for the Hamming code: having received a
(possibly corrupted) codeword c′ , we compute what the parity bits should have been for the received message bits, and check for
mismatches between the computed and received parity bits; each set of mismatched parity bits corresponds to a different single-bit
error in the received codeword. (See Section 4.2 for more, and also see Exercises 4.25–4.28.) We can view the decoding algorithm
as a function decode : P(1, 2, . . . , n − k) → {0, 1, 2, . . . , n}—where decode(S) tells us which bit (if any) to flip in the received
codeword when S is the set of mismatched parity bits. (If decode(S) = 0, then no bits should be flipped.)
9.87 Argue using the Mapping Rule (that is, without reference to the precise decoding function for the Hamming code) that for the
Hamming code’s parameters (n = 7 and k = 4) that there exists a bijection decode : P({1, 2, . . . , n − k}) → {0, 1, 2, . . . , n}.
9.88 Suppose that we choose n = 9 and k = 4. Does there exist a bijection from P({1, 2, . . . , n − k}) to {0, 1, 2, . . . , n}? Why or why
not?
9.89 Suppose that we choose n = 31. For what value(s) of k does there exist a bijection from P({1, 2, . . . , n − k}) to {0, 1, 2, . . . , n}?
Prove your answer.
9.90 Prove that, for any n that is not one less than a power of 2, there does not exist a bijection from P({1, 2, . . . , n − k}) to
{0, 1, 2, . . . , n}.

ISTUDY
478 Counting

In the corporate and political worlds, there’s a dubious technique called URL squatting, where someone creates a website whose
name is very similar to a popular site and uses it to skim the traffic generated by poor-typing internet users. For example, Google
owns the addresses gogle.com and googl.com, which redirect to google.com. (But, as of this writing, someone else owns
oogle.com, goole.com, and googe.com.) Consider an n-letter company name. How many single-typo manglings of the name
are there for the following kinds of errors? Consider only uppercase letters throughout. (If your answers depend on the particular
n-letter company name, then say how they depend on that name. Note that no transposition errors are possible for the company
name MMM, for example.)
9.91 one-letter substitutions
9.92 one-letter insertions
9.93 one-pair transpositions (two adjacent letters written in the wrong order)
9.94 one-letter deletions

9.95 How many different ways can you arrange the letters of PASCAL?
9.96 How many different ways can you arrange the letters of GRACEHOPPER?
9.97 How many different ways can you arrange the letters of ALANTURING?
9.98 How many different ways can you arrange the letters of CHARLESBABBAGE?
9.99 How many different ways can you arrange the letters of ADALOVELACE?
9.100 How many different ways can you arrange the letters of PEERTOPEERSYSTEM?
9.101 (programming required) Write a function that, given an input string, computes the number of ways to rearrange the string’s letters.
Use your program to verify your answers to the last few exercises.

9.102 (programming required) In Example 9.33, we analyzed the number of ways to write a particular integer n as the product of primes.
(Because the prime factorization of n is unique, the only difference between these products is the order in which the primes appear.)
Write a program, in a language of your choice, to compute the number xn of ways we can write a given number n as p1 · p2 · · · pk ,
where each pi is prime. For what number n ≤ 10,000 is xn the greatest?

In Chapter 3, we discussed the application of Boolean logic to AI-based approaches to playing games like Tic-Tac-Toe. (See p. 121,
or Figure 9.25 for a 2-by-2 version of the game [Tic-Tac; the 3-by-3 version is Tic-Tac-Toe].)
Specifically, recall the Tic-Tac-Toe game tree: the root of the tree is the empty board, and the children of any node in the tree
are the boards that result from any move made in any of the empty squares. We talked briefly about why chess is hard to solve using
an approach like this. (In brief: it’s huge.) The next few problems will explore why a little bit of cleverness helps a lot in solving
even something as simple as Tic-Tac-Toe.
9.103 Tic-Tac-Toe ends when either player completes a row, column, or diagonal. But for this question, assume that even after somebody
wins the game, the board is completely filled in before the game ends. (That is, every leaf of the game tree has a completely filled
board.) How many leaves are in the game tree?
9.104 Continue to assume that the board is completely filled in before the game ends. How many distinct leaves are there in the tree?
(That is, the order in which a player fills their squares doesn’t matter; if the same squares are filled, the boards count as the same.)
9.105 Continue to assume that the board is completely filled in before the game ends. Extend your answer to Exercise 9.103: how many
total boards appear in the game tree (as leaves or as internal nodes)? (Hint: it may be easiest to compute the number of boards after
k moves, and add up your numbers for k = 0, 1, . . . , 9.)
9.106 Continue to assume that the board is completely filled in before the game ends. How many distinct total boards—internal nodes or
leaves—are there in the tree?

There are still two optimizations left that we haven’t tried. The first is using the symmetry of the board to help us: for example, there
are really only three first moves that can be made in Tic-Tac-Toe: a corner, the middle of the board, and the middle of a side. The
second optimization is to truncate the tree when there’s a winner. These are both a bit tedious to track by hand, but it’s manageable
with a small program.
9.107 (programming required) We can cut the size of the game tree down to less than a third of the original size—actually substantially
more!—by exploiting symmetry in plays. (We’re down to a third of the original size just within the first move.) Write a program to
compute the entire Tic-Tac-Toe game tree, and use it to determine the number of unique boards (counting as equivalent two boards
that match with respect to rotational or reflectional symmetry) in the game tree. How many boards are now in the tree?
9.108 (programming required) We can reduce the size of the game tree just a bit further by not expanding the portions of the game tree
where one of the players has already won. Extend your implementation from the last exercise so that no moves are made in any
board in which O or X has already won. How many boards are in the tree now?

ISTUDY
Exercises 479

O O
O O

OX O O XO O O X X X X
X X X X O O OX O O XO

OX OX OO O OO O XO XO OO O OO O XO X OX X O O XO X OX X O O
O O X XO X OX O O X XO X OX O OO O OO OX OX O OO O OO XO XO

OX OX XO XO XO XO OX OX
XO XO OX OX OX OX XO XO

Figure 9.25 The game tree for Tic-Tac.

Recall Example 9.34: we must put n students (where n is even) into 2n partnerships. (We don’t care about the order of the
partnerships, nor about the order of partners within a pair.) Here is an alternative way of solving this problem:
9.109 Consider sorting the n people alphabetically by name. Repeat the following n2 times: for the unmatched person p whose name is
alphabetically first, choose a partner for p from the set of all other unmatched people. How many choices are there in iteration i?
How many choices are there, in total?
9.110 Algebraically prove the following identity. (Hint: what does (n/2)! · 2n/2 represent?)

Y
n/2
n!
(n − 2i + 1) =
i=1
(n/2)! · 2n/2

Think of an n-gene chromosome as a permutation of the numbers {1, 2, . . . , n}, representing the order in which these n genes
appear. The following questions ask you to determine how many chromosome-level rearrangement events of a particular form
there are. (See, for example, Figure 3.33b.)
9.111 A prefix reversal inverts the order of the first j genes, for some index j satisfying j > 1 and j ≤ n. For example, for the chromosome
⟨5, 9, 6, 2, 1, 4, 7, 3, 8⟩ we could get the result ⟨6, 9, 5, 2, 1, 4, 7, 3, 8⟩ or ⟨1, 2, 6, 9, 5, 4, 7, 3, 8⟩ from a prefix reversal. How many
different prefix reversals are there for a 1000-gene chromosome?
9.112 A reversal inverts the order of the genes between index i and index j, for some i and j > i. For example, for the chromosome
⟨5, 9, 6, 2, 1, 4, 7, 3, 8⟩ we could get the result ⟨6, 9, 5, 2, 1, 4, 7, 3, 8⟩ or ⟨5, 9, 6, 4, 1, 2, 7, 3, 8⟩ from a reversal. How many different
reversals are there for a 1000-gene chromosome?
9.113 A transposition takes the genes between indices i and j and places them between indices k and k + 1, for some i and j > i and
k∈ / {i, i + 1, . . . , j}. For example, for the chromosome ⟨5, 9, 6, 2, 1, 4, 7, 3, 8⟩ we could get the result ⟨5, 1, 4, 7, 3, 9, 6, 2 8⟩ or
⟨ 1, 4, 5, 9, 6, 2, 7, 3, 8⟩ from a transposition. How many different transpositions are there for a 1000-gene chromosome?

Imagine a round-robin chess tournament for 150 players, each of whom plays 7 games. (In other words, each player is guaranteed
to participate in precisely 7 games with 7 different opponents. Remember that each game has two players.)
9.114 There are 20 possible first moves for White in a chess game, and 20 possible first moves for Black in response. (See Example 9.15.)
Prove that there must be two different games in the tournament that began with the same first two moves (one by White and one
by Black).
9.115 Suppose that would-be draws in this tournament are resolved by a coin flip, so that every game has a winner and a loser. Prove
that there must be two participants in such a tournament who have precisely the same sequence of wins and losses (for example,
WWWLLLW).
9.116 A win–loss record reports a number of wins and a number of losses (for example, 6 wins and 1 loss, or 3 wins and 4 losses), without
reference to the order of these results. Continuing to suppose that there are no draws in this tournament, identify as large a value
of k as you can for which the following claim is true, and prove that it’s true for your value of k: there is some win–loss record that
is shared by at least k competitors.
9.117 Now suppose that draws are allowed, so that competitors have a win–loss–draw record (for example, 2 wins, 1 loss, and 4 draws).
Identify the largest k for which there is some win–loss–draw record that is shared by at least k competitors, and prove that this
claim holds for the k you’ve identified.

ISTUDY
480 Counting

this cell will die from overcrowding time t time t + 1


(it has 4 active neighbors)
this cell will become active
(it has 3 active neighbors)
this cell will die from loneliness ···
(it has 1 active neighbor)

(a) The Game of Life update rule. (b) A sequence of configurations following this rule.

Figure 9.26 Each cell in the Game of Life has the same update rule: active cells with ≤ 1 or ≥ 4 active neighbors die,
and dead cells with exactly three living neighbors become alive.

A cellular automaton is a formalism that’s sometimes used to model complex systems—like the spatial distribution of populations,
for example. Here is the model, in its simplest form. We start from an n-by-n toroidal lattice of cells: a two-dimensional grid, that
“wraps around” so that that there’s no edge. (Think of a donut.) Each cell is connected to its eight immediate neighbors.
Cellular automata are a model of evolution over time: our model will proceed in a sequence of time steps. At every time step,
each cell u is in one of two states: active or inactive. A cell’s state may change from time t to time t + 1. More precisely, each cell
u has an update rule that describes u’s state at time t + 1 given the state of u and each of u’s neighbors at time t.
For example, see Figure 9.26 for the update rule for The Game of Life, which is the most famous model using cellular automata.
(The Game of Life was invented in 1970 by John Conway (1937–2020), a British mathematician.)
9.118 An update rule is a function that takes the state of a cell and the state of its eight neighbors as input, and produces the new state of
the cell as output. How many different update rules are there?
9.119 Let’s call an update rule a strictly cardinal update rule if—as in the Game of Life—the state of a cell u at time t + 1 depends only
the following: (i) the state of cell u at time t, and (ii) the number of active neighbors of cell u at time t. How many different strictly
cardinal update rules are there?

Suppose that we have an 10-by-10 lattice of 100 cells, and we have an update rule fu for every cell u. (These update rules might be
the same or differ from cell to cell.) Suppose the system begins in an initial configuration M0 . Suppose we start the system at time
t = 0 in configuration M0 , and derive the configuration Mt at time t ≥ 1 by computing

Mt (u) = fu (the states of u’s neighbors in Mt−1 ).

Let’s consider the possible outcomes of the sequence M0 , M1 , M2 , . . .. Say that this sequence exhibits eventual convergence if the
following holds: there exists a time t ≥ 0 such that, for all times t′ ≥ t, we have Mt = Mt . (So the Life example in Figure 9.26

exhibits eventual convergence.) Otherwise, we’ll say that this sequence oscillates.
9.120 Given M0 and the fu functions, we’d like to know what the long-run behavior of this system is: does it eventually converge or does
it oscillate? Prove that, for a sufficiently large value of K, we have eventual convergence if and only if the following algorithm
returns True. Also compute the smallest value of K for which this algorithm is guaranteed to be correct.
• Start with M := M0 and t := 0.
• Repeat the following K times: update M to the next time step (that is, for each u compute the updated M′ (u) by evaluating fu on
u’s neighbor cells in M).
• If M would be unchanged by one additional round of updates, return True. Else return False.

9.121 Suppose that we place 1234 items into 17 buckets. (For example, consider hashing 1234 items into a 17-cell hash table.) Call the
number of items in a bucket its occupancy, and the maximum occupancy the number of items in the most-occupied bucket. What’s
the smallest possible maximum occupancy?
9.122 Consider a function f : A → B. Fill in the blank with a statement relating |A| and |B|, and then prove the resulting claim: if ,
then, for some b ∈ B, we have | {a ∈ A : f(a) = b} | ≥ 202.
9.123 Suppose that we quantize a set of values from S = {1, 2, . . . , n} into {k1 , k2 , . . . , k5 } ⊂ S. (See Example 2.51.) Namely, we choose
these 5 values and then define a function q : S → {k1 , k2 , . . . , k5 }. The maximum error of this quantization is maxx∈S |x − q(x)|.
Use the Pigeonhole Principle (or the “the maximum must exceed the average” generalization) to determine the smallest possible
maximum error.

ISTUDY
9.4 Combinations and Permutations 481

9.4 Combinations and Permutations


So much to win, so much to lose,
No marvel if I fear to choose.
Letitia Elizabeth Landon (1802–1838)
The Golden Violet with its tales of Romance and Chivalry, and other poems (1827)

So far in this chapter, we’ve been working to develop a toolbox of general techniques for counting
problems: the Sum Rule and Inclusion–Exclusion, the (Generalized) Product Rule, the Mapping Rule,
and the Division Rule. This section will be different; instead of a new technique, here we will devote our
attention to a particularly common kind of counting problem: the number of ways to choose a subset from
a given set of candidate elements. Let’s start with an illustrative example:

Example 9.39: Printing t-shirts.


Suppose you run a t-shirt shop. There is a collection of jobs that you’re asked to run, but there’s limited
time so you must choose which ones to actually print. There are 17 requested jobs {a, b, . . . , q}, but there
is only time to print 4 different jobs. How many ways are there to select 4 of these 17 candidate jobs?
Solution. There are two answers, depending on how we interpret the problem: does the order of the
printed jobs matter, or does it only matter whether a job was printed? (Are we choosing an ordered
4-tuple? Or an unordered subset of size 4?)
Order matters. Then the Generalized Product Rule immediately gives us the answer: there are 17 choices
for the first job, 16 for the second job, 15 for the third, and 14 for the fourth; thus there are 17 · 16 · 15 · 14
total choices. Another way to write 17 · 16 · 15 · 14 is 17! 13! : every multiplicand between 1 and 13 appears
in both the numerator and denominator, leaving only {17, 16, 15, 14} uncancelled. We can justify the 17! 13!
version of the answer using the Division Rule: we choose one of the 17! orderings of all 17 jobs, and
then print the first 4 jobs in this order—but we’ve counted each 4-job ordering 13! times (once for each
ordering of the 13 unprinted jobs), so we must divide by 13!.
Order doesn’t matter. As before, there are 17!
13! ways to choose an ordered sequence of 4 jobs. Because
order doesn’t matter, we have counted each set of four chosen jobs 4! times, once for each ordering of
17!
them. By the Division Rule, then, there are 13!·4! ways of selecting 4 unordered jobs from a set of 17.

Two different fundamental notions of choice are illustrated by Example 9.39: permutations, in which the
order of the chosen elements matters, and combinations, in which the order doesn’t matter. These two
notions will be our focus in this section. Here’s another example to further illustrate combinations:

Example 9.40: Arranging letters of a bitstring.


How many different ways can you arrange the symbols in the “word” 000111? What about the “word”
00...011...1 containing k zeros and n − k ones?
Solution. This problem is just another application of the techniques we used for PERL and PEER and
SMALLTALK in Example 9.32. (We can think of the word 000111 just like a word like DEEDED: two
different letters, appearing three times each.) There are 6 total characters in the word, each appearing 3
6!
times, so the total number of arrangements is 3!·3! . (See Theorem 9.12.)
For the general version of the problem—the word 00...011...1, with k zeros and n − k ones—we
have a total of n characters, so there are n! ways of writing them down. But k! orderings of the zeros, and

ISTUDY
482 Counting

(n − k)! orderings of the ones, are identical. Hence, by the Division Rule, the total number of orderings
n!
is k!·(n−k)! .

Combinations
The quantity that we computed in Example 9.40 is called the number of combinations of k elements chosen
from a set of n candidates:

Definition 9.15: Combinations.



Consider nonnegative integers n and k with k ≤ n. The quantity n
k is defined as
 
n n!
= .
k k! · (n − k)!
 
The expression nk is read as “n choose k.” (Sometimes nk is also called a binomial coefficient, for reasons
that we’ll see in Section 9.4.3. It’s also sometimes denoted C(n, k)—with a “C” as in “Combination”.)

As we just argued in Example 9.40, the quantity nk denotes the number of ways to choose a k-element

subset of a set of n elements. For convenience, define nk = 0 whenever n < 0 or k < 0 or k > n: there
are zero ways to choose a k-element subset of a set of n elements under these circumstances.

Taking it further: When there are annoying complications (or divide-by-zero errors or the like) in the boundary cases of a

definition, it’s often easiest to tweak the definition to make those cases less special. (Here, for example, instead of having 78 be
7
undefined, we treat 8 as 0.) A similar idea in programming can make life much simpler when you encounter data structures
with complicated edge conditions—for example, a node in a linked list that might not have a successor. A sentinel is a “fake”
element that you might add to the boundary of a data structure that makes the edge elements of the data structure less special.
For example, in image processing, we might augment an n-by-m image with an extra 0th and (m + 1)st column, and an extra
0th and (n + 1)st row, of blank pixels. Once these “border pixels” are added, every pixel in the image has a neighbor in each
cardinal direction. Thus there’s no special code required for edge pixels in code to, for example, apply a blur filter to the image.

Here are a few small examples of counting problems that use combinations:

Example 9.41: 8-bit strings with two ones.


How many different 8-bit strings have exactly two ones?
We solved this precise problem in Example 9.3 using the Sum Rule, but combinations give us an
easier way to answer this question. We must choose 2 out of 8 indices to make equal to one. There are
8
 8! 8! 8·7 8

2 = 2!·(8−2)! = 2!·6! = 2 = 28 such choices of indices, and thus 2 different 8-bit bitstrings with
exactly two ones. These 28 strings are shown in Figure 9.27.

Example 9.42: 32-bit strings with fewer than three ones.


How many different 32-bit strings have fewer than three ones?
We will use the Sum Rule, plus the formula for combinations. (We can partition the set of 32-bit
  
strings that have fewer than 3 ones into those with 0, 1, or 2 ones.) Thus there are 32 32 32
0 + 1 + 2 =
1 + 32 + 32·31
2 = 1 + 32 + 496 = 529 total such strings.

(Recall that 0! = 1, so 32
0 = 32! 32! 32! 32!
0!·(32−0)! = 0!·32! = 1·32! = 32! = 1.)

Finally, here’s an example of counting using combinations that relates counting to probability. (There’s
much more about probability in Chapter 10.) If we flip an unbiased coin (in other words, a coin that comes

ISTUDY
9.4 Combinations and Permutations 483

1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
Figure 9.27 All 8-bit bitstrings with exactly two ones.

up heads with probability 12 and tails with probability 21 each time we flip it), then every sequence of coin
flips is equally likely. The probability that an “event” E happens when we flip an unbiased coin is the
fraction of possible flip sequences for which E actually occurs.

Example 9.43: Exactly 50% heads.


Suppose we flip an unbiased coin 10 times. What is the probability that precisely 5 flips come up heads?

There are 210 = 1024 total sequences, of which 10 5
10!
= 5!·5! = 252 have precisely 5 heads. Thus
there’s a 1024 ≈ 0.2461 chance of exactly half of the flips being heads.
252

9.4.1 Four Different Ways to Select k out of n Options


In Example 9.39, we saw two different ways in which we can imagine choosing a subset of k distinct
elements from a set S of n candidates, depending on whether the order in which we choose those k elements
matters. There is another dichotomy that can arise in counting problems: we can imagine circumstances in
which we choose k elements from a set S, but where repetition is allowed (that is, we can choose the same
element more than once). In other scenarios, repetition might not make sense. Here are some examples of
all four situations (see also Figure 9.28):

Example 9.44: Ways of choosing.


You order a two-scoop ice cream cone from a list of flavors. Order matters: a chocolate scoop on top of
a mint scoop = ̸ mint on top of chocolate. Repetition is allowed: you can choose vanilla for both scoops.
Your soccer game is tied, and you must choose 5 of your 11 players to take penalty kicks to break the
tie. Order matters: the kicks are taken in sequence, so Rapinoe then Wambach = ̸ Wambach then Rapinoe.
Repetition is forbidden: each player is allowed to take only one kick.
You order a three-salad salad sampler from a list of salads. Order doesn’t matter: salads are served on
a round plate, so it doesn’t matter which one is “first.” Repetition is allowed: you can choose the Caesar
as two or all three of your salads.
You select a starting lineup of 5 basketball players from your 13-person team. Order doesn’t matter:
all 5 chosen players are equivalent in starting the game. Repetition is forbidden: you must choose five
different players.

Here we will consider all four types of counting problems—ordered or unordered choice, with or without
repetition—and do a few examples. See Figure 9.28 for a summary of the number of ways to make these
different types of choices.

When order matters and repetition is forbidden


Suppose that we choose a sequence of k distinct elements from a set S: that is, the order of the selected
elements matters and repetition is not allowed. (For example, in a player draft for a sports league, no player

ISTUDY
484 Counting

order matters order matters order irrelevant order irrelevant


repetition allowed repetition not allowed repetition allowed repetition not allowed
n! n + k − 1 n
nk
(n − k)! k k
(9 ways for k = 2, n = 3) (6 ways for k = 2, n = 3) (6 ways for k = 2, n = 3) (3 ways for k = 2, n = 3)

A, then A A and A
A, then B A, then B A and B A and B
B, then A B, then A
A, then C A, then C A and C A and C
C, then A C, then A
B, then B B and B
B, then C B, then C B and C B and C
C, then B C, then B
C, then C C and C

Figure 9.28 Four ways of choosing k of n items—depending on whether we can pick the same element more than once,
and whether the order of choices matters—and the ways to choose 2 elements from the candidates A, B, and C in each.

can be chosen more than once—“repetition is forbidden”—and the outcome of the draft depends not just
on whether Babe Ruth was chosen, but also whether it was the Eagles or the Wildcats that selected him.)
In other words, we make k successive selections from S, but no candidate can be chosen more than once.
Such a sequence is sometimes called a k-permutation of S—an ordered sequence of k distinct elements of
S. (Recall from Definition 9.7 that a permutation of a set S is an ordering of S’s elements. Some people
denote the number of ways of choosing an ordered sequence of k distinct selections from a set of n options
by P(n, k), because “permutation” starts with “P.”)
n!
There are (n−k)! different k-permutations of an n-element set S, by the Generalized Product Rule.
(Specifically, there are
choices of choices of choices of
first element second element kth element

(n) · (n − 1) · · · · · (n − k + 1)

total choices, and n!


(n−k)! = n · (n − 1) · (n − 2) · · · · · (n − k + 1).)

Example 9.45: 4 of 10.


Suppose that you are asked to place four of the cards {A♡, 2♡, · · · , 10♡} on the table, arranged from
left to right in an order of your choosing. There are 10 · 9 · 8 · 7 = (10−4)!
10!
such arrangements: order
matters (A234♡ ̸= 432A♡) and repetition is not allowed (4444♡ isn’t a valid arrangement, because you
only have one 4♡ card).

When order matters and repetition is allowed


Suppose that we simply choose a sequence of k (not necessarily distinct) elements: that is, order matters
and repetition is allowed. That is, we make k successive selections from S, and we’re allowed to make the
same choice multiple times. (For example, you and k − 1 friends go to a Chinese restaurant with n items
on the menu, and each of you orders something. You’re allowed to order the same dish as your friends—
“repetition is allowed”—but you getting the Tofu with Black Bean Sauce and your vegan friend getting
Twice-Cooked Pork is definitely different from the other way around.)

ISTUDY
9.4 Combinations and Permutations 485

Then there are nk different ways to make this choice, by the Product Rule: at every stage, there are n
possible choices, and there are k stages.

Example 9.46: 4 of 10, a second way.


Suppose that you are asked to create a 4-digit integer. There are 104 such integers: order matters (1234 ̸=
4321) and repetition is allowed (4444 is a valid 4-digit number).

When order doesn’t matter and repetition is forbidden


Suppose that we choose an unordered set of k distinct elements: that is, order does not matter and repetition
is not allowed. (For example, you and n − 1 friends enter a raffle in which k identical new cell phones will
be given away. Each of you puts your name on one of n cards that are placed in a hat, and k cards are drawn
to choose the winners. Cards for winners are not put back into the hat after they’re drawn, so nobody can
win twice—“repetition is forbidden”—but Alice and Bob winning is the same as Bob and Alice winning.)

When we choose an unordered set of k distinct elements from a set of n options, there are nk dif-
ferent ways to make this choice, by the definition of combination. Such a subset is sometimes called a
k-combination of S—an unordered set of k distinct elements of S. (Recall from Definition 9.15 that a
combination of elements from a set S is precisely an unordered subset of elements from S.)

Example 9.47: 4 of 10, another way.


Suppose that you’re asked to create a 10-bit number with exactly four ones. You do so by starting with

0000000000 and choosing 4 indices to change from 0 to 1. There are 10 4 such bitstrings: the order in
which you choose a bit to make a 1 doesn’t matter (changing bit #2 and then bit #7 to 1 yields the same
bitstring as changing bit #7 and then bit #2 to 1) and repetition is not allowed (you have to change 4
different bits to 1).

When order doesn’t matter and repetition is allowed


While these three types of selecting k out of n elements are the most frequent, the fourth possibility can
sometimes arise, too: order doesn’t matter but repetition is allowed. Let’s build some intuition for this case
with a longer example:

Problem-solving tip: When you encounter a problem that seems completely novel, run through the techniques you know about
and try them on for size, even if they’re not an obvious fit. The type of counting in Example 9.48 doesn’t seem like it has a lot to
do with combinations, but by changing the way you view this problem it can be transformed into a problem you’ve seen before.

Example 9.48: Taking notes on six sheets of paper in three classes.


You discover that your school notebook has only k = 6 sheets of paper left in it. You are attending n = 3
different classes today: Archaeology (A), Buddhism (B), and Computer Science (C). How many ways
are there to allocate your six sheets of paper across your three classes? (No paper splitting or hoarding:
each sheet must be allocated to one and only one class!)
(Here’s another way to phrase the question: you must choose how many pages to assign to A, how
many to B, and how many to C. That is, you must choose three nonnegative integers a, b, and c with
a + b + c = 6. How many ways can you do it?)

ISTUDY
486 Counting

(a) (b) (c) A |B |C (d)


AAAAAA 222222| | 00000011
AAAAA B 22222 |2 | 00000101
AAAAA C 22222 | |2 00000110
AAAA BB 2222 |22 | 00001001
AAAA B C 2222 |2 |2 00001010
AAAA CC 2222 | |22 00001100
AAA BBB 222 |222 | 00010001
AAA BB C 222 |22 |2 00010010
AAA B CC 222 |2 |22 00010100
AAA CCC 222 | |222 00011000
AA BBBB 22 |2222 | 00100001
AA BBB C 22 |222 |2 00100010
AA BB CC 22 |22 |22 00100100
AA B CCC 22 |2 |222 00101000
AA CCCC 22 | |2222 00110000
A BBBBB 2 |22222 | 01000001
A BBBB C 2 |2222 |2 01000010
A BBB CC 2 |222 |22 01000100
A BB CCC 2 |22 |222 01001000
A B CCCC 2 |2 |2222 01010000
A CCCCC 2 | |22222 01100000
BBBBBB |222222| 10000001
BBBBB C |22222 |2 10000010
BBBB CC |2222 |22 10000100
BBB CCC |222 |222 10001000
BB CCCC |22 |2222 10010000
B CCCCC |2 |22222 10100000
CCCCCC | |222222 11000000

Figure 9.29 The ways to allocate 6 sheets of paper across three classes {A, B, C}. (a) Any ordering of 6 pieces of paper
and 2 dividers tabs defines three sections (before, between, and after the dividers); (b) representing pages allocated to
each class by the class name; (c) representing pages allocated to each class by 2, with | marking divisions between
classes; and (d) representing pages allocated to each class by 0, with 1 marking divisions between classes.

Solution. The 28 ways of allocating your paper are shown in Figure 9.29, sorted by the number of
pages allocated to Archaeology (and breaking ties by the number of pages allocated to Buddhism). The
allocations are shown in three ways:

• Pages are represented by the class name.


• Pages are represented by 2, with | marking divisions between classes: we allocate the number of
pages before the first divider to A, the number between the dividers to B, and the number after the
second divider to C.
• Pages are represented by 0, with 1 marking divisions between classes: as in the 2-and-| representa-
tion, we allocate pages before the first 1 to A, those after the first 1 and before the second 1 to B, and
those after the second 1 to C.

All three versions of this table accurately represent the full set of 28 allocations, but let’s concentrate on
the representation in the second and third columns—particularly the third. The 0-and-1 representation in

ISTUDY
9.4 Combinations and Permutations 487

8

the third column contains exactly the same strings as Figure 9.27, which listed all 28 = 2 of the 8-bit
strings that contain exactly two ones.

In a moment, we’ll state a theorem that generalizes this example into a formula for the number of ways to
select k out of n elements when order doesn’t matter but repetition is allowed. But, first, here’s a slightly
different way of thinking about the result in Example 9.48 that may be more intuitive.
Suppose that we’re trying to allocate a total of k pages among n classes. Imagine placing the k pages
into a three-ring binder along with n − 1 “divider tabs” (the kind that separate sections of a binder), as in
Figure 9.29a. There are now n + k − 1 things in your binder. (In Example 9.48, there were 6 pages and 2
dividers, so 8 total things are in the binder.) The ways of allocating the pages precisely correspond to the
ways of ordering the things in the binder—that is, choosing which of the n + k − 1 things in the binder

should be blank sheets of paper, and which should be dividers. So there are n+k−1 ways of doing so. In
 k
Example 9.48, we had n = 3 and k = 6, so there were 86 = 28 ways of doing this allocation.
While the description in Example 9.48 wasn’t stated in precisely these terms, our paper-allocation task
was really a task about choosing with repetition: six times (once for each piece of paper), we select one
of the elements of the set {A, B, C} of classes. We may select the same class as many times as we wish
(“repetition is allowed”), and the pieces of paper are indistinguishable (“order doesn’t matter”). Here is the
general statement of the number of ways to select k out of n elements for this scenario:

Theorem 9.16: Choosing with repetition when order doesn’t matter.


The number of ways to select k out of n elements when order doesn’t matter but repetition is allowed is
n+k−1

k .

Proof. We’ll give a proof based on the Mapping Rule. We can represent a particular choice of k elements
Pn
from the set of n candidates as a sequence x ∈ (Z≥0 )n such that i=1 xi = k. (Specifically, xi tells us how
many times we chose element i.) Define
 Pn
X = x ∈ (Z≥0 )n : i=1 xi = k
n+k−1
and S = {x ∈ {0, 1} : x contains exactly n − 1 ones and k zeros} .
We claim that there is a bijection between X and S. Specifically, define f : X → S as
f(x1 , x2 , . . . , xn ) = 0 0 · · · 0 1 0 0 · · · 0 1 · · · 1 0 0 · · · 0
x1 times x2 times xn times

(This representation is precisely the one in Example 9.48.) Then f is a bijection: every element of S

corresponds to one and only one element of X. By Example 9.40, then, the cardinality of S is n+k−1
k .
Here’s another example of this type of choice:

Example 9.49: 4 of 10, one last way.


Suppose that you have decided to buy 4 total drinks for a group of 10 of your friends. (You may buy
multiple drinks for the same friend.) You can think of lining your friends up and performing a total of 13
successive actions, each of which is either (a) buying a drink for the friend immediately in front of you,
or (b) shouting “next!”. Of your 13 actions, 4 must be drink purchases. (The other 9 must be shouts of

“next!”) There are 134 ways to choose these actions.

ISTUDY
488 Counting

Choosing k of n elements, summarized


We’ve now discussed four notions of choosing k elements from a set of n candidates, depending on whether
we could choose the same option more than once and whether the order of our choices mattered:

• Order matters and repetition is allowed: nk ways.


• n!
Order matters and repetition is forbidden: (n−k)! ways.

• Order doesn’t matter and repetition is allowed: n+k−1
k ways.
• Order doesn’t matter and repetition is forbidden: nk ways.

(Or see Figure 9.28 for a summary.) We’ve also considered the same example—choosing 4 of 10 options—
in each setting, and the number of ways to do so was different in each of the four different scenarios:

• Order matters and repetition is allowed: 10,000 = 104 ways.


• Order matters and repetition is forbidden: 5040 = 10 · 9 · 8 · 7 ways.

• Order doesn’t matter and repetition is allowed: 715 = 134 ways.

• Order doesn’t matter and repetition is forbidden: 210 = 10 4 ways.

Taking it further: In CS, we frequently encounter tasks where we must identify the best solution from a set of possibilities. For
example, we might want to find the longest increasing subsequence (LIS) of a sequence of n integers. A brute-force algorithm is
one that solves the problem by literally trying every possible solution and selecting the best. (For LIS, there are 2n subsequences,
so this algorithm is very slow.) But if there’s a certain kind of structure and enough repetition in the subproblems that arise
in a naïve recursive solution, a more advanced algorithmic design technique called dynamic programming can yield a much
faster algorithm. And counting the number of subproblems—and the number of distinct subproblems!—is what establishes when
algorithms using brute force or dynamic programming are good enough. See p. 496.

n

9.4.2 Some Properties of k
, and Combinatorial Proofs
Of the ways of choosing k elements from n candidates explored in Section 9.4.1, perhaps the most common
is the setting when order doesn’t matter and repetition is forbidden. In this section, we’ll explore some of

the remarkable mathematical properties of the numbers—the values of nk —that arise in this scenario.
The properties that we’ll prove here (and those that you’ll establish in the exercises) will be equalities
of the form x = y for two expressions x and y. We’ll generally be able to give two very different styles

of proof that x = y. One type of proof uses algebra, typically using the definition of nk and algebraic
manipulations to show that x and y are equal. The other type of proof will be a more story-based approach,
called a combinatorial proof, where we argue that x = y by explaining how x and y are really just two ways
of looking at the same set:

Definition 9.17: Combinatorial proof.


A combinatorial proof establishes that two quantities x and y are equal by defining a set S and proving
that |S| = x and |S| = y by counting |S| in two different ways.

The algebraic approach is perhaps apparently more straightforward, but combinatorial proofs can be more
fun. Here’s a first example:

Theorem 9.18: A symmetry in choosing.


 
For any positive integer n and any integer k ∈ {0, 1, . . . , n}, we have n
k = n
n−k .

ISTUDY
9.4 Combinations and Permutations 489

n
 n

Proof #1 of k = n−k , via algebra. We simply follow our noses through the definition:

n
 n!
k = k!·(n−k)! definition of combinations
n!
= (n−k)!·k! commutativity of multiplication
n!
= (n−k)!·(n−(n−k))! antisimplification: k = n − (n − k)
n

= n−k . definition of combinations

Here is a second proof of Theorem 9.18—this time a combinatorial proof. The basic idea is that we will
 
construct a set S such that we can prove that |S| = nk and we can prove that |S| = n−k
n
. (Thus we can
 
conclude nk = n−k n
.)
 
Proof #2 of nk = n−kn
, via a combinatorial proof: Suppose that n students submit implementations of
Bubble Sort in a computer science class. The instructor has k gold stars, which will be affixed to each of k
different implementations. Let S be the set of ways to affix gold stars. Here are two ways to compute |S|:

• First, we claim that |S| = nk . Specifically, the instructor will choose k of the n submissions and affix

gold stars to the k chosen elements. There are nk ways of doing so.

• Second, we claim that |S| = n−k n
. Specifically, the instructor will choose n − k of the n submissions
that will not be adorned with gold stars. The remaining unchosen submissions will be adorned. There
n

are n−k ways of choosing the unadorned submissions.
 
But |S| is the same regardless of how we count it! So n
k = |S| = n
n−k and the theorem follows.
(Another way to think about the combinatorial proof: an n-bit string with k ones is an n-bit string with
n − k zeros; the number of choices for where the ones go is identical to the number of choices for where
the zeros go.)
A combinatorial proof requires creativity—what set S should we consider?—but the argument that the
proof is correct is generally comparatively straightforward. Thus the challenge in proving an identity with a
combinatorial proof is a challenge of narrative: we must find a story in which the two sides of the equation
both capture the set described by that story.

Problem-solving tip: The hard part in a combinatorial proof is coming up with a story that explains both sides of the equation.
Understanding what the more complicated side of the equation means is often a good place to start.

Pascal’s identity
Here’s another example claim with both algebraic and combinatorial proofs:

Theorem 9.19: Pascal’s identity.


  
  
n−1 n−1 n
For any integer n ≥ 1 and any k ∈ {0, 1, . . . , n}, we have + = .
k k−1 k

(Pascal’s identity is named after Blaise Pascal (1623–1662), a French mathematician. The programming
language Pascal was also named in his honor.)
Proof #1 of Pascal’s identity (algebra). Observe that if k = 0 or k = n, the identity follows immediately:
     
by definition, we have n0 = 1 = 1 + 0 = n−1 0 + n−1 n
−1 and similarly n = 1 = 0 + 1 =
n−1
n + n−1
n−1 .

ISTUDY
490 Counting

For the non-boundary cases, we’ll manipulate the left-hand side until it’s equal to the right-hand side:
n−1
 
k + n−1k−1
(n−1)! (n−1)!
= k!·(n−1−k)! + (k−1)!·(n−k)! definition of combinations
(n−1)! (n−1)!
= k!·(n−1−k)! · n−k
n−k + (k−1)!·(n−k)! · k
k multiplying by 1 = x
x
(n−1)!·(n−k) (n−1)!·k
= k!·(n−k)! + k!·(n−k)! (k − 1)! · k = k! and (n − 1 − k)! · (n − k) = (n − k)!
(n−1)!·[(n−k)+k]
= k!·(n−k)! factoring
n!
= k!·(n−k)! n − k + k = n, and (n − 1)! · n = n!
n

= k . definition of combinations

Proof #2 of Pascal’s identity (combinatorial proof). For the case of k = 0 or k = n, the argument is the
same as in Proof #1. Otherwise, consider a set of n ≥ 1 employees, one of whom is named Babbage. How
many ways can we select a subset of k different employees? Here are two different ways of counting the
number of these subsets:

• We choose k of the n employees. There are nk ways to do so.
• We decide whether to include Babbage, and then fill in the rest of the team:
– If we pick Babbage, we need to pick k − 1 further employees from the n − 1 other (non-Babbage)

employees; thus there are n−1
k−1 ways to select a team that includes Babbage. 
– If we don’t pick Babbage, we pick all k employees from the n − 1 others; thus there are n−1
k ways
to select a team that does not include Babbage.
 
By the Sum Rule, there are therefore n−1
k−1 +
n−1
k ways to choose a team.

Because we’ve counted the cardinality of one set in two different ways, the two sizes must be equal.
  
Therefore nk = n−1
k−1 +
n−1
k and the theorem follows.

Taking it further: World War II was perhaps the first major historical moment in which computer science—and, by the end of
the war, the computer—was central to the story. The German military used a complex cryptographic device called the Enigma
machine for encryption of military communication during the war. The Enigma machine, which was partially mechanical and
partially electrical, had a large (though not unfathomably large) set of possible physical configurations, each corresponding to
a different cryptographic “key.” Among the first applications of an electronic computer—and the reason that one of the first
computers was designed and built in the first place—was in breaking these codes, in part by exhaustively exploring the set of
possible keys. As such, understanding the number of different keys in the system (a counting problem!) was crucial to the Allies’
success in breaking the Enigma code. For more, see p. 497.

9.4.3 The Binomial Theorem



The quantity nk is sometimes called a binomial coefficient, for reasons that we’ll see in this section. (A
binomial—Latin bi “two” + nom “name”—is a special kind of polynomial—poly “many” + nom “name”—
that has precisely two terms.)
First, a reminder: the product of two binomials (x + y) and (a + b) is xa + xb + ya + yb. (You may have
once learned the “FOIL” mnemonic for the terms of the product: first = xa; outer = xb; inner = ya; and
last = yb.) Thus when we square x + y—that is, multiply it by itself—we get

(x + y) · (x + y) = xx + xy + yx + yy = 1 · x2 + 2 · xy + 1 · y2 .

ISTUDY
9.4 Combinations and Permutations 491

  
Observe that the three coefficients of these terms, in order, are ⟨1, 2, 1⟩ = ⟨ 20 , 21 , 22 ⟩. The Binomial
Theorem is a general statement of this pattern: when we multiply out the expression (x+y)n , the coefficient

of the xk yn−k term is nk :

Theorem 9.20: The Binomial Theorem.


X
n

For any a ∈ R, any b ∈ R, and any n ∈ Z≥0 , we have (a + b)n = n
i ai bn−i .
i=0

Before we prove the theorem, let’s start with some intuition about why these coefficients arise. For example,
let’s compute (x + y)4 = (x + y) · (x + y) · (x + y) · (x + y), without doing any simplification:

(x + y) · (x + y) · (x + y) · (x + y)
= (xx + xy + yx + yy) · (x + y) · (x + y)
= (xxx + xyx + yxx + yyx + xxy + xyy + yxy + yyy) · (x + y)
= xxxx + xyxx + yxxx + yyxx + xxyx + xyyx + yxyx + yyyx
+ xxxy + xyxy + yxxy + yyxy + xxyy + xyyy + yxyy + yyyy.

Every term of the resulting expression consists of 4 multiplicands, one from each of the 4 copies of (x + y).
How many of these 16 terms contain, say, 2 copies of x and 2 copies of y? There are 6—yyxx, xyyx, yxyx,
4
xyxy, yxxy, and xxyy—which is just the number of elements of {x, y} that contain precisely two copies of
x. While the symbols are different, it’s easy to see that this quantity is precisely the number of elements of
4 
{0, 1} that contain precisely two ones—which is just 42 .
We will prove the Binomial Theorem in generality in a moment, but to build a little bit of intuition for
the proof, let’s look at a special case first:

Problem-solving tip: When you’re asked to solve a problem for a general value of n, one good way to get started is to try to solve
it for a specific small value of n—and then try to generalize your solution to an arbitrary n. It’s often easier to generalize from a
particular n to a general n than to give a fully generally answer “from scratch.”

Example 9.50: The coefficients of (x + y)3 .


We’re going to show that (x+y)3 = x3 +3x2 y+3xy2 +y3 in the same style that we’ll use in the full proof of
the Binomial Theorem. We’ll start with the observation, made previously, that (x + y)2 = x2 + 2xy + y2 =

2 2 2
 
2 2 2
0 x + 1 xy + 2 y . A key step will make use of Theorem 9.19 to move from the coefficients of (x + y)
3
to the coefficients of (x + y) .
(x + y)3 = (x + y) · (x + y)2
h    i
= (x + y) · 20 x2 + 2
1 xy + 22 y2
   2   
= 20 x3 + 21 x2 y + 2
2 xy + 2 2
0 x y+
2
1
2 3
xy2 +
2 y
(( ) (2) ( 2) 2 ) (( ) ( ) ( ) )
2 2 2 2 2 2
=x· 0 x2 + 1 xy + 2 y =y· 0 x + 1 xy + 2 y

which, collecting like terms, simplifies to


 h  i h  i 
(x + y)3 = 20 x3 + 21 + 2
0 x2 y + 2
2
+ 2
1 xy2 + 2
2 y3 .

ISTUDY
492 Counting

     
By Theorem 9.19, we have that 21 + 20 = 31 and 22 + 21 = 32 , so
   
(x + y)3 = 20 x3 + 31 x2 y + 32 xy2 + 22 y3 .
     
Because nn = 1 and n0 = 1 for any n, we have that 20 = 30 and 22 = 3
3 , and thus
   
(x + y)3 = 30 x3 + 31 x2 y + 32 xy2 + 33 y3
= x3 + 3x2 y + 3xy2 + y3 .
The combination notation can sometimes obscure the structure of the proof; for further intuition, here is
what this proof looks like, without the notational overhead:
(x + y)3 = (x + y) · (x + y)2
= (x + y) · (x2 + 2xy + y2 )
= (x3 + 2x2 y + xy2 ) + (x2 y + 2xy2 + y3 )
= x3 + (2 + 1)x2 y + (1 + 2)xy2 + y3
= x3 + 3x2 y + 3xy2 + y3 .

Proof of the Binomial Theorem


We’re now ready to give a proof of the general form of the Binomial Theorem. Our proof will use mathe-
matical induction on the exponent, and the structure of the inductive case of the proof will precisely mimic
that of Example 9.50.
Proof of the Binomial Theorem. Let a and b be arbitrary real numbers. We wish to prove that, for any
integer n ≥ 0,
X
n

n
(a + b)n = i ai bn−i .
i=0

We proceed by induction on n.
The base case (n = 0) is straightforward: anything to the 0th power is 1, so in particular (a + b)0 = 1.
P0  
And i=0 0i ai b0−i = 00 · 1 · 1 = 1.
Pn−1  i n−1−i
For the inductive case (n ≥ 1), we assume the inductive hypothesis (a + b)n−1 = i=0 n−1 ab .
n
Pn n i n−i i
We must prove that (a + b) = i=0 i a b . Our proof echoes the structure of Example 9.50:
(a + b)n = (a + b) · (a + b)n−1 definition of exponentiation

X
n−1

= (a + b) · n−1
i ai bn−1−i inductive hypothesis
i=0
" n−1 # " n−1 #
X  X 
=a· n−1
i
i n−1−i
ab +b· n−1
i ab i n−1−i
distributing the multiplication
i=0 i=0
" n−1 # " n−1 #
X  X 
n−1 i+1 n−1−i n−1 i n−i
= i a b + i ab distributing the multiplication, again
i=0 i=0
  " n−1 #
X
n
 X 
= n−1
j−1 ab j n−j 
+ n−1
i
i n−i
ab . reindexing the first summation (j := i + 1)
j=1 i=0

ISTUDY
9.4 Combinations and Permutations 493

By separating out the i = 0 and j = n terms from the two summations, and then combining like terms, we
have
  " #
X
n−1
 X
n−1
 i n−i  n n−n  0 n−0
(a + b) = 
n n−1 j
j−1 a b
n−j + n−1
i ab + n−1
n−1 a b + n−1
0 a b
j=1 i=1
 
n−1 
X    
= n−1
j−1 + n−1
j aj bn−j  + n−1
n−1 an bn−n + n−1
0 a0 bn−0 .
j=1

    
Applying Theorem 9.19 to substitute nj for n−1
j−1 +
n−1
and using the fact that n−1
=1= n
and
n−1
  j n−1 n
0 = 1 = n0 , we have
 
X
n−1
  n n−n  (n) (n−1) (n−1)
(a + b)n =  n j n−j 
j ab + n−1
n−1 a b + n−1
0 a0 bn−0 j = j−1 + j
j=1
 
X
n−1
   (n−1) (n) (n−1) (n)
= n
j aj bn−j  + n
n an bn−n + n
0 a0 bn−0 n−1 =1= n and 0 =1= 0
j=1
 
X
n

= n
j aj bn−j  , incorporating the j = 0 and j = n terms back into the summation
j=0

which proves the theorem.

9.4.4 Pascal’s Triangle


Much of this section has been devoted to understanding the binomial coefficients, through the Binomial
Theorem and through combinatorial proofs of a number of their other properties. We’ll close our discussion
of binomial coefficients with a visual representation of these quantities, called Pascal’s triangle. (Like
Pascal’s identity, Pascal’s triangle is named after the French mathematician Blaise Pascal.)
Pascal’s triangle arranges the binomial coefficients in a classical and very useful way: the nth row of
  
Pascal’s triangle consists of all of the n + 1 binomial coefficients n0 , n1 , . . . , nn , in order. Figure 9.30
shows the first several rows of Pascal’s triangle. Many of the properties of the binomial coefficients that
we’ve established previously can be seen by looking at patterns visible in Pascal’s triangle—as can some
others that we’ll prove here, or that you’ll prove in the exercises.
For example, Figures 9.31a and 9.31b give visualizations of two properties that we’ve already proven.
 
Theorem 9.18 states that nk = n−k n
; this theorem is reflected by the fact that the numerical values of
Pascal’s triangle are symmetric around a vertical line drawn down through the middle of the triangle.
  
And Theorem 9.19 (“Pascal’s identity”), which states that n−1 k + n−1 n
k−1 = k , is illustrated by the fact
that each entry in Pascal’s triangle is the sum of the two elements immediately above it (up-and-left and
up-and-right).
There are many other notable properties of the binomial coefficients, many of which we can see more
easily by looking at Pascal’s triangle. Here’s one example; a number of other properties are left to you in
  
the exercises. Let’s look at the row sums of Pascal’s triangle—that is, computing n0 + n1 + · · · + nn for
different values of n. (See Figure 9.31c.) From calculating the row sum for a few small values of n, we see
that the nth row appears to have value equal to 2n . (Incidentally, the sum of the squares of the numbers

ISTUDY
494 Counting

(0)
0 1
(1) (1)
0 1 1 1
(2) (2) (2)
0 1 2 1 2 1
(3) (3) (3) (3)
0 1 2 3 1 3 3 1
(4) (4) (4) (4) (4)
0 1 2 3 4 1 4 6 4 1
(5) (5) (5) (5) (5) (5)
0 1 2 3 4 5 1 5 10 10 5 1
(6) (6) (6) (6) (6) (6) (6)
0 1 2 3 4 5 6 1 6 15 20 15 6 1
(7) (7) (7) (7) (7) (7) (7) (7)
0 1 2 3 4 5 6 7 1 7 21 35 35 21 7 1
(8) (8) (8) (8) (8) (8) (8) (8) (8)
0 1 2 3 4 5 6 7 8 1 8 28 56 70 56 28 8 1
.. ..
. .
Figure 9.30 The top of Pascal’s triangle, in both “choose” notation and in numerical form.

(a) Theorem 9.18: (b) (c) Theorem 9.21:


Theorem 9.19:
all values match 1 the bottom value 1 the rows add 1 1
their counterpart in each triplet up to powers
mirrored is the sum of two:
1 1 1 1 1 1 2
across the of the top 1, 2, 4,
middle line. two. 8, 16, 32,
1 2 1 1 2 1 64, . . . . 1 2 1 4

1 3 3 1 1 3 3 1 1 3 3 1 8

1 4 6 4 1 1 4 6 4 1 1 4 6 4 1 16

1 5 10 10 5 1 1 5 10 10 5 1 1 5 10 10 5 1 32

1 6 15 20 15 6 1 1 6 15 20 15 6 1 1 6 15 20 15 6 1 64

Figure 9.31 Theorems 9.18, 9.19, and 9.21 reflected in Pascal’s triangle.

in any particular row in Pascal’s triangle also has a special form, as you’ll see in Exercise 9.173.) Indeed,
the power-of-two pattern for the row sums of Pascal’s triangle that we observe in Figure 9.31c holds for
arbitrary n—and we’ll prove this theorem here, in several different ways.

Theorem 9.21: Sum of a row of Pascal’s triangle.


P n n n
i=0 i = 2 .

Proof #1 (algebraic/inductive) [sketch]. We can gain a bit of intuition for this claim from Theorem 9.19

(Pascal’s identity): each entry nk in the nth row is added into exactly two entries in the (n + 1)st row,
 
namely n+1 k and n+1
k+1 . Therefore the values in row #n of Pascal’s triangle each contribute twice to the
values in row #(n + 1), and therefore the (n + 1)st row’s sum is twice the sum of the nth row. This intuition
can be turned into an inductive proof, which you’ll give in Exercise 9.172.
Proof #2 (combinatorial). Let S = {1, 2, . . . , n} be a set with n elements. Let’s count the number of subsets
of S in two different ways. On the one hand, there are 2n such subsets: there is a bijection between subsets
of S and |S|-bit strings. (See Lemma 9.10.)

ISTUDY
9.4 Combinations and Permutations 495

On the other hand, let’s account for the subsets of S by first choosing a size k of the subset, and then
counting the number of subsets of that size. By the Sum Rule, the total number of subsets of S is exactly
X
n
(the number of subsets of S of size k).
k=0

n Pn n

By definition, there are exactly subsets of size k. Therefore the total number of subsets is .
Pn  k k=0 k
Thus 2n = k=0 nk .
Proof #3 (making clever use of the Binomial Theorem). We’ll start from the right-hand side of the theorem
statement, and begin with a completely unexpected, but obviously true, antisimplification:
2n = (1 + 1)n obviously 2 = 1 + 1; therefore 2n = (1 + 1)n
Xn

n i n−i
= i 11 Binomial Theorem
i=0
Xn

n
= i . 1k = 1 for any value of k
i=0

You’ll explore some of the many other interesting and useful properties of Pascal’s triangle, and of the
binomial coefficients in general, in the exercises.

ISTUDY
496 Counting

Computer Science Connections

Brute-Force Algorithms and Dynamic Programming


In an optimization problem, we’re given a Cheapest Vertical Seam (CVS): 9 8 7 1 9
7 3 2 9 1
set S of valid solutions and some measure of Input: An n-by-n grid of integers. 2 8 5 6 9
quality f : S → R, and asked to compute Output: A path from the top row to the bottom row, 4 7 5 3 4
the x ∈ S that’s the best according to f. One moving in direction {↙, ↓, ↘} at each step, such that 3 8 2 8 1
example is the traveling salesperson problem the sum of the integers along the path is minimized.
(TSP)—the problem solved daily by delivery
Figure 9.32 The cheapest vertical seam problem, and an example.
drivers, who have to visit a list of addresses
and return to the depot, while minimizing their overall driving time. Another example is shown in Figure 9.32: the
cheapest vertical seam (CVS) problem, which arises in a remarkable computer graphics application; see Figure 9.33.
For both TSP and CVS, there are straight-
forward brute-force algorithms that solve the To shrink an image without
cropping or rescaling,
Image adapted from U.S. Army Photo,
archives of the ARL Technical Library
problem by computing the list of all possi- construct a matrix indicating
ble solutions (all orderings of the cities; all how prominent each pixel is;
then find the CVS, and delete
top-to-bottom paths) and identifying the best the pixels in the found seam.
of these possible solutions. It’s by now rea- Here, an image of a very early
sonably straightforward to count: there are n! computer (the ENIAC), being
programmed by Jean Bartik
orderings and between 2 · n and 3 · n paths.
n n
and Frances Spence, with a
n n
(The gap between 2 and 3 arises because seam that could be removed.
it’s a little tricky to avoid counting paths that Figure 9.33 Shai Avidan and Ariel Shamir’s method of seam carving
fall off the sides of the grid.) These run-
to resize images by removing the least important vertical seam [12].
ning times are unimpressive—even n ≈ 100
would require decades of computing time—and this is, more or less, the best known algorithm for TSP! (See p. 104.)
But we can do better for CVS, with another view of the problem. Given a grid G, define best(i, j) as the cost of the
cheapest path from grid cell ⟨i, j⟩ to the bottom of the grid. Then we can solve the CVS problem using a recursive
algorithm that computes best(i, j) for every cell ⟨i, j⟩:
(1) if i = n (we’re reached the bottom row), the cost is just Gi,j ;
(2) if j ≤ 0 or j ≥ n (we’ve fallen off the sides of the grid), the cost is ∞; and
(3) otherwise, the cost is Gi,j plus the smallest of best(i + 1, j − 1) and best(i + 1, j) and best(i + 1, j + 1).
Unfortunately, this algorithm is as slow as brute force (in fact, it is the brute-force algorithm): to compute best(i, j), we
make three recursive calls, at least two of which remain inside the grid. The running time T(i) to find best(n−i, j) with
i rows beneath cell ⟨i, j⟩ is given by the recurrence T(1) = 1 and T(i) ≥ 2T(i − 1) + 1—which satisfies T(n) ≥ 2n .
But a key algorithmic observation is that
the number of different cells in the grid is CVS(G1...n,1...n ):
much smaller than 2n —there are only n2 1 for j := 1, . . . , n:
different cells! So, while the recursive algo- 2 T[n, j] := Gi,j
rithm is very slow, it “should” require only 3 for i := n − 1, . . . , 1: below left below below right
2 for j := 1 , . . . , n:
Θ(n ) time—as long as we avoid recom- 4
puting best(i, j) multiple times for the same 5 T[i, j] := Gi,j + min(T[i + 1, j − 1], T[i + 1, j], T[i + 1, j + 1])
value of ⟨i, j⟩! Once we’ve figured out, say, 6 return minj T[1, j] (Treat T[·, j] = ∞ if j is out of range.)
best(3, 7) (because we needed that value to Figure 9.34 A dynamic programming algorithm for CVS.
figure out best(4, 6)), we don’t bother recom-
puting best(3, 7) when we need it again; instead, we just remember the value and reuse it without doing any further
computation. The most straightforward way to implement this basic idea is called memoization: we build a data struc-
ture in which we check to see whether we’ve already stored the value of best(i, j) before computing the value via the
three recursive calls, and we always add all values we compute to the data structure before returning them. A slightly
more efficient way of implementing this idea is called dynamic programming, where we transform this recursive
solution into one using loops—and build up the values of best(i, j) from the bottom up. (See Figure 9.34.) In general,
dynamic programming is an algorithmic design technique that can save us a massive amount of computation—as
long as the number of different problems encountered in the recursive solution is small.

ISTUDY
9.4 Combinations and Permutations 497

Computer Science Connections

The Enigma Machine and a First Computer


The Enigma machine was a physical cryptographic
device used by the Germans during World War II

https://www.cia.gov/legacy/museum/
A
to communicate between German high command and

Image adapted from the CIA Museum


their military units in the field. The basic structure of
the machine involved rotors and cables. A rotor was a
26-slot physical wheel that encoded a permutation π;
when the wire corresponding to input i is active, the Q
output wire corresponding to πi is active. A plugboard
allowed an arbitrary matching of keys on the keyboard
to the inputs to the rotors—a cable was what actually
connected a key to the first rotor. (The machine did not Figure 9.35 An Enigma machine, and a schematic of its
require any cables in the plugboard; if there was no operation. The operator types a letter (say, A), which goes
cable, then the key pressed was what went into the rotor through the plugboard, and is then permuted by rotor #1,
in the first place.) The basic encryption in the Enigma
rotor #2, rotor #3, the fixed permutation of the machine,
machine proceeded as follows (see Figure 9.35):
rotor #3, rotor #2, and rotor #1. It then (after passing
(1) The user pressed a key, say A, on the keyboard. If
there was a cable from the A key, then the key would be through the plugboard) lights up the output, say Q. The
remapped to the other end of the cable; otherwise the rotors advance by one notch, and encoding continues.
procedure proceeded using the A. (See Figure 9.36.)
(2) The pressed key was permuted by rotor #1; the Q W E R T Each of the 26 keys is
either mapped to itself
output of rotor #1 was permuted by rotor #2; the output (like W here), or is
of rotor #2 was permuted by rotor #3. (Again, see Fig- matched with another
ure 9.36.) The output of rotor #3 was “reflected” by a A S D F G key (like Q ↔ D here).
Pressing an unmatched
fixed permutation, and then the reflector’s output pass key x yields x itself;
through the three rotors, in reverse order and backward: P Y X C V
pressing a matched key x
yields whatever letter is
the output of the reflector was permuted by rotor #3, matched to x.
then by #2, and then by #1.
O N M L
(3) A light corresponding to the output of rotor #1, Q P K
R J Each rotor encodes a
passed through the plugboard cable if present, lights S I permutation of the
T H
up; the illuminated letter is the encoding. U G letters; when the input
V F letter i comes into the
The tricky part is that the rotors rotate by one notch, W E rotor, the output πi comes
X P O N MD L
a bit like an odometer, when the key is pressed, so that YQZ A B C K out. (Here, for example,
R J an input V turns into an
the encoding changes with every keypress. The “secret S I
T H output of M.) After each
key” that the two communicating entities needed to U G keypress, the top portion
agree upon was which rotors to use in which order V F of the rotor rotates by a
W E
notch, so V would now
(5 · 4 · 3 = 60; there were 5 standard rotors in an X
Y Z A B C
D
turn into N.
Enigma), what the initial position of the rotors should
3
be (26 = 17,576), and what plugboard matching to Figure 9.36 The plugboard (above) and a rotor (below).
13 ≈ 8 × 10
26! 12
use ( 13!·2 choices if all 26 letters were
matched; see Example 9.34). Interestingly, almost all of the complexity came from the plugboards.
Perhaps surprisingly, the fact that there were so many possible settings for the Enigma led to the invention of one
of the first programmable computers, by a team including Alan Turing at Bletchley Park, in England, during the war.
They built a machine to test many of these configurations, by brute force. (If there were fewer possibilities, it could
have been cracked by hand; if there were many more, it couldn’t have been cracked by brute force.) Turing and his
team developed a device called the Bombe to exhaustively try to compute the shared German secret key—each day!
Many other cryptographic tricks related to the way the Enigma was being used were also part of the analysis.
For example, the construction of the device meant that no letter could encrypt to itself; this fact was exploited in the
analysis. Another crucial part of the code breaking was a known plaintext attack on the Enigma: the British also used
knowledge of what the Germans tended to communicate (like weather reports) to narrow their search.

ISTUDY
498 Counting

EXERCISES
For two strings x and y, let’s call a shuffle of x and y any interleaving of the letters of the two strings (that maintains the order
of the letters within each string, but may repeatedly alternate between blocks of x letters and blocks of y letters). For example, the
words ALE and LID can be shuffled into

A L L I E D or A L L I D E or A L L I D E or L I D A L E.

How many different strings can be produced as shuffles of the following pairs of words?
9.124 BACK and FORTH
9.125 DAY and NIGHT
9.126 SUPPLY and DEMAND
9.127 LIFE and DEATH
9.128 ON and ON
9.129 OUT and OUT
9.130 (programming required) Write a program, in a language of your choice, that computes all shuffles of two given words x and y. A
recursive approach works well: a shuffle consists either of the first character of x followed by a shuffle of x2...|x| and y, or the first
character of y followed by a shuffle of x and y2...|y| . (Be sure to eliminate any duplicates from your resulting list.)

The next few questions ask you to think about shuffles of generic strings, instead of particular words. (Assume that the alphabet is
an arbitrarily large set—you are not restricted to the 26 letters in English.) Consider two strings x and y, and let n = |x| + |y| be
the total number of characters between them. Note that the number of distinct shuffles of x and y may depend both on the lengths
of x and y and on the particular strings themselves; for example, if some letters are shared between or within the two strings, there
may be fewer possible shuffles.
9.131 In terms of n, what is the maximum possible number of different shuffles of x and y?
9.132 In terms of n, what’s the minimum possible number of distinct shuffles of x and y?
9.133 What is the largest possible number of different shuffles of three strings of length a, b, and c?

9.134 How many 42-bit strings have exactly 16 ones?


9.135 How many 23-bit strings have at most 3 ones? (The coincidental arithmetic structure of the answer actually turns out to be helpful
for error-correcting codes; see Exercise 4.30.)
9.136 How many 32-bit strings have a number of ones within ±2 of the number of zeros?
9.137 The set of 64-bit strings with k ones is largest for k = 32. What’s the smallest m for which

| {the number of 64-bit strings with ≤ m ones} | ≥ | {the number of 64-bit strings with 32 ones} |?

9.138 What is the smallest even integer n for which the following statement is true? If we flip an unbiased coin n times, as in Example 9.43,
the probability that we get exactly n2 heads is less than 10%.

A bridge hand consists of 13 cards from a standard 52-card deck, with 13 ranks (2 through ace) and 4 suits (♣, ♢, ♡, and ♠).
(That is, the cards in the deck are {2, 3, . . . , 10, J, Q, K, A} × {♣, ♢, ♡, ♠}.) How many different bridge hands are there that meet
the following conditions?
9.139 A void in spades: a 13-card hand that contains only cards from the suits ♣, ♢, and ♡.
9.140 A singleton in hearts: exactly one of the 13 cards comes from the suit ♡.
9.141 All four kings.
9.142 No queens at all.
9.143 Exactly two jacks.
9.144 Exactly two jacks and exactly two queens.
9.145 A bridge hand has high honors if it contains the five highest-ranked cards {10, J, Q, K, A} in the same suit. How many bridge hands
have high honors? (Warning: be careful about double counting!)

Many bridge players evaluate their hands by the following system of points. First, give yourself one high-card point for a jack, two
for a queen, three for a king, and four for an ace. Furthermore, give yourself three distribution points for each void (a suit in which
you have zero cards), two points for a singleton (a suit with one card), and one point for a doubleton (a suit with two cards).
9.146 How many bridge hands have a high-card point count of zero?
9.147 How many bridge hands have a high-card point count of zero and a distribution point count of zero? What fraction of all bridge
hands is this?

ISTUDY
Exercises 499

9.148 How many ways are there to choose 32 out of 202 options if repetition is allowed and order matters?
9.149 How many ways are there to choose 32 out of 202 options if repetition is forbidden and order matters?
9.150 How many ways are there to choose 32 out of 202 options if repetition is allowed and order doesn’t matter?
9.151 How many ways are there to choose 32 out of 202 options if repetition is forbidden and order doesn’t matter?
9.152 The first 10 prime numbers are {2, 3, 5, 7, 11, 13, 17, 19, 23, 29}. How many different integers have exactly 5 prime factors (all
from this set), where all of these factors are different?
9.153 How many different integers have exactly 5 prime factors (all from the {2, 3, 5, 7, 11, 13, 17, 19, 23, 29})? (Note that 32 = 2·2·2·2·2
is an example.)
9.154 How many different integers have exactly 10 prime factors that all come from the set of the first 20 prime numbers?
9.155 How many different integers have exactly 10 prime factors that all come from the set of the first 20 prime numbers, and where all
10 of these factors are different from each other?

Suppose that we have two sequences ⟨x1 , x2 , . . . , xn ⟩ and ⟨y1 , y2 , . . . , y2n ⟩ of data points—perhaps representing a sequence of
intensities from two streams of speech. We wish to align x to y by matching up elements of x to elements of y. (For example, y might
represent a reference stream, where we’re trying to match x up to it.) We insist that each element of x is assigned to one and only
one element of y. (See Figure 9.37. Thanks to Roni Khardon, from whom I learned a version of the exercises.)
9.156 How many ways are there to assign each of the n elements of x to one of the 2n elements of y?
9.157 How many ways are there to assign each of the n elements of x to one of the 2n elements of y so that no element of y is matched to
more than one element of x?

In many applications, we can only consider alignments of the elements of x and y that “maintain order”: that is, we can’t have x5
assigned to an element of y that comes after the element assigned to x6 . (If f : {1, . . . , n} → {1, . . . , 2n} represents the alignment,
then we require that i ≤ j implies that f(i) ≤ f(j).)
9.158 How many ways are there to assign each of the n elements of x to one of the 2n elements of y in a way that maintains order?
9.159 How many ways are there to assign each of the n elements of x to one of the 2n elements of y in a way that maintains order so that
no element of y is matched to more than one element of x?
9.160 Consider the equation a + b + c = 202. How many solutions are there where a, b, and c are all nonnegative integers?
9.161 How many solutions are there to the equation a + b + c + d + e = 8, where all of {a, b, c, d, e} must be nonnegative integers?
9.162 What about for a + b + c + d + e = 88, again where all variables must be nonnegative integers?
9.163 What about for a + 2b + c = 128, again where a, b, and c must be nonnegative integers? (Hint: sum over the possible values of b
and use Theorem 9.16.)

The Association for Computing Machinery—a major professional society for computer scientists—puts on student programming
competitions regularly. Teams of students spend a few hours working on some programming problems (of various levels of difficulty).
9.164 Suppose that, at a certain college in the midwest, there are 141 computer science majors. A programming contest team consists of
3 students. How many ways are there to choose a team?
9.165 Suppose that, at a certain programming contest, teams are given 10 problems to try to solve. When the contest begins, each of the
three members of the team has to choose a problem to think about first. (More than one team member can think about the same
problem.) How many ways are there for the three team members to choose a problem to think about first?
9.166 In most programming contests, teams are scored by the number of problems they correctly solve. (There are tiebreakers based on
time and certain penalties.) A team can submit multiple solutions to the same problem. Suppose that a particular team has calculated
that they have time to code up and submit 20 different attempted answers to the 10 questions in the contest. How many different
ways can they allocate their 20 submissions across the 10 problems? (The order of their submissions doesn’t matter.)

x4 x4
x1 x5 x1 x5
x3 x3
x2 x2

y8 y10 y8 y10
y1 y7 y1 y7
y3 y9 y3 y9
y2 y4 y2 y4
y5 y5
y6 y6

(a) An alignment that doesn’t respect order. (b) An alignment that does respect order.

Figure 9.37 An alignment between two sequences, for Exercises 9.156–9.159.

ISTUDY
500 Counting

9.167 Solve the following problem, posed by Adi Shamir in his original paper on secret sharing (see p. 356, or [114]). Eleven scientists
are working on a secret project. They wish to lock up the documents in a cabinet so that the cabinet can be opened if and only if
six or more of the scientists are present. What is the smallest number of locks needed? What is the smallest number of keys to the
locks each scientist must carry?

9.168 In machine learning, we try to use a collection of training data—for example, a large collection of ⟨image, letter⟩ pairs of images
of handwritten letters and the English letter that they represent—to compute a predictor that will do well on predicting answers on a
set of novel test data. One danger in such a system is overfitting: we might build a predictor that’s overly affected by idiosyncrasies
of the training data. One way to address the risk of overfitting is a technique called cross-validation: we divide the training data
into several subsets, and then, for each subset S, train our predictor based on ∼S and test it on S. We might then average the
parameters of our predictor across the subsets S. In ten-fold cross-validation on a n-element training set, we would split our n
n
training examples into disjoint sets S1 , S2 , . . . , S10 where |Si | = 10 . How many ways are there to split an n-element set into disjoint
n
subsets S1 , S2 , . . . , S10 of size 10 each? (Note the order of the subsets themselves doesn’t matter, nor does the order of the elements
within a subset.)

9.169 Consider the set of bitstrings x ∈ {0, 1}n+k with n zeros and k ones with the additional condition that no ones are adjacent. (For
n = 3 and k = 2, for example, the legal bitstrings are 00101, 01001, 01010, 10001, 10010, and 10100.) Prove by induction on n
1
that the number of such bitstrings is n+ k .
n+ k
9.170 Consider the set of bitstrings x ∈ {0, 1} with n zeros and k ones with the additional condition that every block of ones has even
length. (For n = 3 and k = 2, for example, the legal bitstrings are 00011, 00110, 01100, 11000.) Prove that, for any even k, the

number of such bitstrings is n+(nk/2) .
n −1
9.171 Prove that k · k = n · nk− 1 twice, using both an algebraic and a combinatorial proof.
P n
n
9.172 Using induction on n, prove Theorem 9.21—that is, prove that i = 2n .
i= 0
9.173 Using a combinatorial proof, prove the following identity about the squares of the binomial coefficients:

X
n
n 2 
k = 2n
n .
k=0

4 2 4 2    
(For example, for n = 4, this identity states that 0 + + 42 2 + 43 2 + 44 2 = 12 + 42 + 62 + 42 + 12 = 70 is equal to 84 .
 1
And, indeed, 84 = 4!·8!4! = 70.)
9.174 Prove the following identity by algebraic manipulation:

n  m  k
m k = nk mn−
−k .

9.175 Now prove the identity from Exercise 9.174 with a combinatorial proof. (Hint: think about choosing a team of m people from a
pool of n candidates, and picking k managers from the team that you’ve chosen.)
9.176 Prove the following identity, using an algebraic, inductive, or combinatorial proof:

X
n
k 1
m = mn+
+1 .
k=0

a
Recall that b = 0 for any b < 0 or b > a, so many of the terms of the summation are zero. For example, for m = 3 and n = 5,
         
the claim states that 64 = 03 + 13 + 23 + 33 + 43 + 53 = 0 + 0 + 0 + 33 + 43 + 53 .
9.177 Prove the following identity about the binomial coefficients and the Fibonacci numbers (where fi is the ith Fibonacci number), by
induction on n:

⌊n/2⌋
X
n−k
k = fn+1 .
k=0

9.178 Prove van der Monde’s identity:

X
k
n+ m m  n
k = k−r · r .
r =0

(Hint: suppose you have a deck of n red cards and m black cards, from which you choose a hand of k total cards.)

ISTUDY
Exercises 501

(a) (b)
1 for each subsequence a of x: 1 for k := 0, 1, . . . , n:
2 for each subsequence b of y: 2 for each subsequence a of x of length k:
3 check if a = b 3 for each subsequence b of y of length k:
4 check if a = b

Figure 9.38 Two algorithms for common subsequences: (a) brute force, and (b) a length-aware version of brute force.

A common subsequence of two strings x and y is a string z that’s a subsequence of both. A subsequence of an n-character string
corresponds to a subset of {1, 2, . . . , n}, indicating which indices are included (and which aren’t). (See Exercise 9.85.) For example,
BASIC is a common subsequence of BRAINSICKNESS and BIOACOUSTICS.
9.179 Suppose that you have been asked to find the number of common subsequences of two n-character strings x, y ∈ Σn , by brute force.
An algorithm to do so is shown in Figure 9.38a. How many times do we execute Line 3 (testing whether a = b)?
9.180 Using the fact that a subsequence of x and a subsequence of y cannot be equal unless they have the same length, we can modify the
algorithm as shown in Figure 9.38b. Now how many times do we execute Line 4 (testing whether a = b)?

9.181 Using Stirling’s approximation of factorial, which states that n! ≈ 2πn · (n/e)n (where π = 3.1415 · · · and e = 2.7182 · · ·), argue
that Figure 9.38b improves on Figure 9.38a in terms of the number of subsequences compared.
P
n 
9.182 Use the Binomial Theorem to prove the following identity: (−1)k · nk = 0.
k=0
9.183 Use the Binomial Theorem to prove the following identity:
X
n n
k
n
= 23 .
2k
k=0

9.184 In Section 9.2.2, we introduced the Inclusion–Exclusion Rule for counting the union of 2 or 3 sets:

|A ∪ B| = |A| + |B| − |A ∩ B|
|A ∪ B ∪ C| = |A| + |B| + |C| − |A ∩ B| − |A ∩ C| − |B ∩ C| + |A ∩ B ∩ C|

Exercise 9.33 asked you to give a formula for a 4-set intersection, but here’s a completely general solution:
 
[k X k X
Ai = (−1) i + 1
· |Aj ∩ Aj ∩ · · · ∩ Aj | .
1 2 i
i=1 i=1 j1 <j2 <···<ji

S
(Recall that ki=1 Ai = A1 ∪ A2 ∪ · · · ∪ Ak .) Argue that this formula correctly expresses the Inclusion–Exclusion Rule for any
number of sets. (Hint: figure out how many ℓ-set intersections each element x appears in. Then use the Binomial Theorem—
specifically, Exercise 9.182.)

9.185 In Example 8.4, we looked at the subset relation for a set S: that is, we defined the set of pairs

subset = {⟨A, B⟩ ∈ P(S) × P(S) : [∀x ∈ S : x ∈ A ⇒ x ∈ B]} .

For any particular set B ∈ P(S), the number of sets A such that ⟨A, B⟩ ∈ subset is precisely 2|B| . The total number of pairs in the
subset relation on S is thus 2k times the number of subsets of S of size k, summed over all k. We’ve already seen that the number of

subsets of S of size k is |Sk| . Thus the total number of pairs in the subset relation on S is
|S| |S|
X X |S|
(number of subsets of S of size k) · 2k = k
· 2k .
k=0 k=0

Use the Binomial Theorem to compute a simple formula for this summation.

ISTUDY
502 Counting

9.5 Chapter at a Glance


Counting refers to the task of, given a potentially convoluted description of a set S, computing the cardi-
nality of S. Our general strategy for counting is to develop techniques for counting simple sets like unions
and sequences, and then to handle more complex problems by “translating” them into these simpler ones.

Counting Unions and Sequences


The Sum Rule describes how to compute the cardinality of the union of sets: if A and B are disjoint sets, then
Sk Pk
|A ∪ B| = |A| + |B|. More generally, if the sets A1 , A2 , . . . , Ak are all disjoint, then i=1 Ai = i=1 |Ai |. If
the sets A and B are not disjoint, then the Sum Rule doesn’t apply. Instead, we can use Inclusion–Exclusion
to count |A ∪ B|. This rule states that |A ∪ B| = |A| + |B| − |A ∩ B| for any sets A and B. For three sets,
|A ∪ B ∪ C| = |A| + |B| + |C| − |A ∩ B| − |A ∩ C| − |B ∩ C| + |A ∩ B ∩ C|.
To compute the cardinality of the Cartesian product of sets, we can use the Product Rule: for sets
A and B, we have |A × B| = |A| · |B|. More generally, for arbitrary sets A1 , A2 , . . . , Ak , we have that
Qk
|A1 × A2 × · · · × Ak | is i=1 |Ai |. Applying the Product Rule to a set S × S × · · · × S, we see that, for
any set S and any k ∈ Z≥1 , we have |Sk | = |S|k . If the set of options for one choice depends on previous
choices, then we cannot directly apply the Product Rule. However, the basic idea still applies: the Gener-
Qk
alized Product Rule says that |S| = i=1 ni if S denotes a set of sequences of length k, where, for each
choice of the first i − 1 components of the sequence, there are exactly ni choices for the ith component.
A permutation of a set S is sequence of elements from S that contains no repetitions and has length |S|.
In other words, a permutation of S is an ordering of the elements of S. By the Generalized Product Rule,
there are precisely n! = n · (n − 1) · (n − 2) · · · · · 1 permutations of an n-element set.

Using Functions to Count


Let A and B be arbitrary sets. We can use a function f : A → B to relate |A| and |B|. The Mapping Rule
says that:

• There exists a function f : A → B that’s onto if and only if |A| ≥ |B|.


• There exists a function f : A → B that’s one-to-one if and only if |A| ≤ |B|.
• There exists a function f : A → B that’s a bijection if and only if |A| = |B|.

The Mapping Rule implies, among other things, that the power set P(S) of a set S (the set of all subsets
of S) has cardinality |P(S)| = 2|S| .
The Division Rule says the following: suppose that there exists a function f : A → B such that, for every
b ∈ B, there are exactly k elements a1 , . . . , ak ∈ A such that f(ai ) = b. Then |A| = k · |B|. The Division
Rule implies, among other things, that the number of ways to rearrange a sequence containing k different
distinct elements {x1 , . . . , xk }, where element xi appears ni times, is
(n1 + n2 + · · · + nk )!
.
(n1 !) · (n2 !) · · · · · (nk !)
The pigeonhole principle says that if A and B are sets with |A| > |B|, and f : A → B, then there exist
distinct a and a′ ∈ A such that f(a) = f(a′ ). That is, if there are more pigeons than holes, and we place the
pigeons into the holes, then there must be (at least) one hole containing more than one pigeon.

ISTUDY
9.5 Chapter at a Glance 503

Combinations and Permutations



Consider nonnegative integers n and k with k ≤ n. The quantity nk is defined as
n
 n!
k = k!·(n−k)! ,

and is read as “n choose k.” The quantity nk denotes the number of ways to choose a k-element subset of a
set of n elements, called a combination, when each element can only be selected at most once and the order

of the selected elements doesn’t matter. The quantity nk is also sometimes called a binomial coefficient.
Depending on whether we allow the same candidate to be chosen more than once and whether we care
about the order in which the candidates are chosen, there are many versions of selecting k out of a set of n
candidates:

• If the order of the selected elements doesn’t matter and repetition of the chosen elements is not allowed,

then there are nk ways to choose.
• If order matters and repetition is not allowed, there are (n−k)!
n!
ways.
• If order matters and repetition is allowed, there are n ways.
k

• If order doesn’t matter and repetition is allowed, there are n+k−1k ways.

A combinatorial proof establishes that two quantities x and y are equal by defining a set S and proving
that |S| = x and |S| = y by counting |S| in two different ways. We can give combinatorial proofs of the
following facts about the binomial coefficients, among others:
n
 n
 n
 n−1
  Pn n
k = n−k k = k + n−1
k−1 i=0 i = 2 .
n

The Binomial Theorem states that, for any a, b ∈ R and any n ∈ Z≥0 ,
X
n

n n
(a + b) = i ai bn−i .
i=0

We can prove the Binomial Theorem by induction on the exponent n.


Many of the interesting properties of the binomial coefficients can be seen by looking at patterns visible
in Pascal’s triangle, which arranges the binomial coefficients so that the nth row contains the n+1 binomial
  
coefficients n0 , n1 , . . . , nn . See Figure 9.39 for the first few rows of Pascal’s triangle.

(0)
0 1
(1) (1)
0 1 1 1
(2) (2) (2)
0 1 2 1 2 1
(3) (3) (3) (3)
0 1 2 3 1 3 3 1
(4) (4) (4) (4) (4)
0 1 2 3 4 1 4 6 4 1
(5) (5) (5) (5) (5) (5)
0 1 2 3 4 5 1 5 10 10 5 1
(6) (6) (6) (6) (6) (6) (6)
0 1 2 3 4 5 6 1 6 15 20 15 6 1
(7) (7) (7) (7) (7) (7) (7) (7)
0 1 2 3 4 5 6 7 1 7 21 35 35 21 7 1

Figure 9.39 The first several rows of Pascal’s triangle, in both “choose” notation and in numerical form.

ISTUDY
504 Counting

Key Terms and Results

Key Terms Key Results


Counting Unions and Sequences Counting Unions and Sequences
• Sum Rule 1 The Sum Rule: if the sets A1 , A2 , . . . , Ak are all disjoint,
Sk Pk
• Product Rule then i=1 Ai = i=1 |Ai |. The Inclusion–Exclusion
• double counting Rule allows us to handle nondisjoint sets; for example,
• Inclusion–Exclusion for any sets A, B we have |A ∪ B| = |A| + |B| − |A ∩ B|.
Qk
• Generalized Product Rule 2 The Product Rule: |A1 × A2 × · · · × Ak | = i=1 |Ai |.
• permutation For any set S and any k ∈ Z≥1 , we have |Sk | = |S|k .
3 The Generalized Product Rule: if S is a set of sequences
Using Functions to Count of length k, where, for each choice of the first i − 1
components of the sequence, there are exactly ni choices
• Mapping Rule Qk
for the ith component, then |S| = i=1 ni .
• Division Rule
• pigeonhole principle
Using Functions to Count
Combinations and Permutations 1 The Mapping Rule: an onto function f : A → B means
|A| ≥ |B|; a one-to-one function f : A → B means
• combinations
|A| ≤ |B|; and a bijection f : A → B means |A| = |B|.
• permutations
 2 For any set S, |P(S)| = 2|S| .
• n
k / binomial coefficient 3 The Division Rule: if f : A → B satisfies
• Binomial Theorem
|{a ∈ A : f(a) = b}| = k for all b ∈ B, then |A| = k · |B|.
• combinatorial proof
4 The number of ways to arrange a sequence containing
• Pascal’s triangle
elements {x1 , . . . , xk }, where xi appears ni times, is
(n1 +n2 + ··· +nk )!
(n1 !)·(n2 !)· ··· ·(nk !) .
5 Pigeonhole principle: if f : A → B and |A| > |B|, then
there exist a, a′ ̸= a ∈ A such that f(a) = f(a′ ).

Combinations and Permutations


1 There are four versions of selecting k out of n
candidates, depending on whether the order of the
chosen elements matters and whether we can choose the
same element twice. (See Figure 9.28.) The binomial

coefficient nk denotes the number of ways to choose
when repetition is forbidden and order doesn’t matter
(called combinations).
 
2 Some useful properties: nk = n−k n
and
n−1
 n−1
 n
 Pn n 
+ = and = 2n .
k k−1 k i=0 i
n
Pn n i n−i
3 The Binomial Theorem: (a + b) = i=0 i a b .

ISTUDY
10 Probability

In which our heroes evade threats and conquer their fears by flipping
coins, rolling dice, and spinning the wheels of chance.

ISTUDY
505
506 Probability

10.1 Why You Might Care


“Les Anglais!” he murmured. “No method—absolutely none whatever. They leave all to
chance!”
Agatha Christie (1890–1976)
The Murder on the Links (1923)

This chapter introduces probability, the study of randomness. Our focus, as will be no surprise by this
point of the book, is on building a formal mathematical framework for analyzing random processes. We’ll
begin with a definition of the basics of probability: defining a random process that chooses one particular
outcome from a set of possibilities (any one of which occurs some fraction of the time). We’ll then analyze
the likelihood that a particular event occurs—in other words, asking whether the chosen outcome has some
particular property that we care about. We then consider independence and dependence of events, and
conditional probability: how, if at all, does knowing that the randomly chosen outcome has one particular
property change our calculation of the probability that it has a different property? (For example, perhaps
90% of all email is spam. Does knowing that a particular email contains the word ENLARGE make that email
more than 90% likely to be spam?) Finally, we’ll turn to random variables and expectation, which give
quantitative measurements of random processes: for example, if we flip a coin 1000 times, how many heads
would we see (on average)? How many runs of 10 or more consecutive heads? Probabilistic questions are
surprisingly difficult to have good intuition about; the focus of the chapter will be on the tools required to
rigorously settle these questions.
Probability is relevant almost everywhere in computer science. One broad application is in randomized
algorithms to solve computational problems. In the same way that the best strategy to use in a game of
rock–paper–scissors involves randomness (throw rock 31 of the time, throw paper 13 of the time, throw
scissors 13 of the time), there are some problems—for example, finding the median element of an unsorted
array, or testing whether a given large integer is a prime number—for which the best known algorithm (the
fastest, the simplest, the easiest to understand, …) proceeds by making random choices. The same idea
occurs in data structures: a hash table is an excellent data structure for many applications, and it’s best
when it assigns elements to (approximately) random cells of a table. (See Section 10.1.1.) Randomization
can also be used for symmetry breaking: we can ensure that 1000 identical drones do not clog the airwaves
by all trying to communicate simultaneously: each drone will choose to try to communicate at a random
time. And we can generate more realistic computer graphics of flame or hair or, say, a field of grass by, for
each blade, randomly perturbing the shape and configuration of an idealized piece of grass.
As a rough approximation, we can divide probabilistic applications in CS into two broad categories:
cases in which the randomness is internally generated by our algorithms or data structures, and cases in
which the randomness comes “from the outside.” The first type we discussed above. In the latter category,
consider building some sort of computational model to address some real-world phenomenon. For example,
we might wish to model social behavior (a social network of friendships), or traffic on a road network or on
the internet, or to build a speech-recognition system. Because these applications interact with extremely
complex real-world behaviors, we typically think of them as generated according to some deterministic
(nonrandom) underlying rule, but with hard-to-model variation that is valuably thought of as generated
by a random process. In systems for speech recognition, it works well to treat a particular “frame” of the
speech stream (perhaps tens of milliseconds in duration) as a noisy version of the sound that the speaker
intended to produce, where the noise is essentially a random perturbation of the intended sound.

ISTUDY
10.1 Why You Might Care 507

Finally, you might care about probability because any well-educated person must understand something
about probability. You need probability to understand political polls, weather forecasting, news reports
about medical studies, wagers that you might place (either with real money or by choosing which of two
alternatives is a better option), and many other subjects. Probability is everywhere!

10.1.1 Hashing: A Running Example


Throughout this chapter, we will consider a running sequence of examples that are about hash tables, a
highly useful data structure that also conveniently illustrates a wide variety of probabilistic concepts. So
we’ll start here with a short primer on hash tables. (See also p. 72, or a good textbook on data structures.)
A hash table is a data structure that stores a set of elements in a table T[1 . . . m]—that is, an array of
size m. (Remember that, throughout this book, arrays are indexed starting at 1, not 0.) The set of possible
elements is called the universe or the keyspace. We will be asked to store in this table a particular small
subset of the keyspace. (For example, the keyspace might be the set of all 8-letter strings; we might be
asked to store the user IDs of all students on campus.) We use a hash function h to determine in which cell
of the table T[1 . . . m] each element will be stored. The hash function h takes elements of the keyspace as
input, and produces as output an index identifying a cell in T. To store an element x in T using hash function
h, we compute h(x) and place x into the cell T[h(x)]. (We say that the element x hashes to the cell T[h(x)].)
We must somehow handle collisions, when we’re asked to store two different elements that hash to the
same cell of T. We will usually consider the simplest solution, where we use a strategy called chaining
to resolve collisions. To implement chaining, we store all elements that hash to a cell in that cell, in an
unsorted list. Thus, to find whether an element y is stored in the hash table T, we look one-by-one through
the list of elements stored in T[h(y)].

Example 10.1: A small hash table.


Let the keyspace be {1, 2, 3, 4}, and consider a 2-cell hash table with the hash function h given by h(x) =
(x mod 2) + 1. (Thus h(1) = h(3) = 2 and h(2) = h(4) = 1.) Figure 10.1 shows two scenarios for this
hash function: storing the elements {1, 4}, and storing the elements {2, 4}.

More formally, we are given a finite set K called the keyspace, and we are also given a positive integer
m representing the table size. We will base the data structure on a hash function h : K → {1, . . . , m}. For
the purposes of this chapter, we choose h randomly, specifically choosing the hash function so that each
function from K to {1, . . . , m} is equally likely to be chosen as h.
Let’s continue our above example with a randomly chosen hash function. For the moment, we’ll treat
the process of randomly choosing a hash function informally. (The precise definitions of what it means to
choose randomly, and what it means for certain “events” to occur, will be defined in the following sections
of this chapter.)

T[1] T[2] T[1] T[2]


Scenario #1: storing the Scenario #2: storing the
elements {1, 4}. [4] [1] elements {2, 4}. [2, 4] []

Figure 10.1 Hashing two elements (in two scenarios) using the hash function h(x) = (x mod 2) + 1.

ISTUDY
508 Probability

3 3 3 3 3 3 3 3 functions with h(4) = h(1)


3 3 3 3 3 3 “perfectly balanced”
3 all elements hash into cell #2
h(1) 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
h(2) 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2
h(3) 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
h(4) 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

Figure 10.2 All functions from {1, 2, 3, 4} to {1, 2}. Each column is a different function h; the ith row records the
value of h(i). The letters mark some functions as described in Example 10.2.

Example 10.2: A small hash table.


As before, let K = {1, 2, 3, 4} and m = 2. There are m|K| = 24 = 16 different functions h : K → {1, 2},
and each of these functions is equally likely to be chosen. (The 16 functions are listed in Figure 10.2.)
1
Each of these functions is chosen a 16 fraction of the time. Thus:

• A 168
= 12 fraction of the time, we have h(4) = h(1).
• A 16 = 38 fraction of the time, the hash function is “perfectly balanced”—that is, hashes an equal share
6

of the keys to each cell.


• A 161
fraction of the time, the hash function hashes every element of K into cell #2.

(The functions meeting each of these three criteria are marked in Figure 10.2.)

Taking it further: In practice, the function h will not be chosen completely at random, for a variety of practical reasons (for exam-
ple, we’d have to write down the whole function to remember it!), but throughout this chapter we will model hash tables as if h is
chosen completely randomly. The assumption that the hash function is chosen randomly, with every function K → {1, 2, . . . , m}
equally likely to be chosen, is called the simple uniform hashing assumption. It is very common to make this assumption when
analyzing hash tables.
It may be easier to think of choosing a random hash function using an iterative process instead: for every key x ∈ K, we choose
a number ix uniformly at random and independently from {1, 2, . . . , m}. (The definitions of “uniformly” and “independently”
are coming in the next few sections. Informally, this description means that each number in {1, 2, . . . , m} is equally likely to
be chosen as ix , regardless of what choices were made for previous numbers.) Now define the function h as follows: on input x,
output ix . One can prove that this process is completely identical to the process illustrated in Example 10.2: write down every
function from K to {1, 2, . . . , m} (there are m|K| of them), and pick one of these functions at random.

After we’ve chosen the hash function h, a set of actual keys {x1 , . . . , xn } ⊆ K will be given to us, and we
will store the element xi in the table slot T[h(xi )]. Notice that the only randomly determined quantity is the
hash function h. Everything else—the keyspace K, the table size m, and the set of to-be-stored elements—is
fixed.

ISTUDY
10.2 Probability, Outcomes, and Events 509

10.2 Probability, Outcomes, and Events


Luck is not chance –
It’s Toil –
Fortune’s expensive smile
Is earned.
Emily Dickinson (1830–1886)

This section will give formal definitions of the fundamental concepts in probability, giving us a frame-
work to use in thinking about the many computational applications that involve chance. These definitions
are somewhat technical, but they’ll allow us reason about some fairly sophisticated probabilistic settings
fairly quickly.

Warning! It is very rare to have good intuition or instincts about probability questions. Try to hold yourself back from jumping to
conclusions too quickly, and instead use the systematic approaches to probabilistic questions that are introduced in this chapter.

10.2.1 Outcomes and Probability


Here’s the very rough outline of the relevant definitions; we’ll give more details in a moment. Imagine a
scenario in which some quantity is determined in some random way. We will consider a set S of possible
outcomes. Each outcome has an associated probability, which is a number between 0 and 1. The set S is
called the sample space. In any particular result of this scenario, one outcome from S is selected randomly
(by “nature”); the frequency with which a particular outcome is chosen is given by that outcome’s associ-
ated probability. (Sometimes we might talk about the process by which a sequence of random quantities is
selected, and the realization as the actual choice made according to this process.) For example, for flipping
an unweighted coin we would have S = {Heads, Tails}, where Heads has probability 0.5 and Tails has
probability 0.5. Our particular outcome might be Heads. Here are the formal definitions:

Definition 10.1: Outcomes and sample space.


An outcome of a probabilistic process is the sequence of results for all randomly determined quantities.
(An outcome can also be called a realization of the probabilistic process.) The sample space S is the set
of all outcomes.

Definition 10.2: Probability function.


Let S be a sample space. A probability function Pr : S → R describes, for each outcome s ∈ S, the
fraction of the time that s occurs. (We denote probabilities using square brackets, so the probability of
s ∈ S is written Pr [s].) We insist that the following two conditions hold for the probability function Pr:
X
Pr [s] = 1 (10.2.1)
s∈S

Pr [s] ≥ 0 for all s ∈ S. (10.2.2)

Intuitively, condition (10.2.1) says that something has to happen: when we flip a coin, then either it comes
up heads or it comes up tails. (And so Pr [Heads]+Pr [Tails] = 1.) The other condition, (10.2.2), formalizes
the idea that Pr [s] denotes the fraction of the time that the outcome s occurs: the least frequently that an
outcome can occur is never.

ISTUDY
510 Probability

The probability function Pr is also sometimes called a probability distribution over S. (This function
“distributes” one unit of probability across the set S of all possible outcomes, as in (10.2.1).)

Taking it further: Bizarrely, in quantum computation—an as-yet-theoretical type of computation based on quantum
mechanics—we can have outcomes whose probabilities are not restricted to be real numbers between 0 and 1. This model is
(very!) difficult to wrap one’s mind around, but a computer based on this idea turns out to let us solve interesting problems, and
faster than on “normal” computers. For example, we can factor large numbers efficiently on a quantum computer. (Though we
don’t know how to build quantum computers of any nontrivial size.) See p. 522.

A few examples: cards, coins, and words


Here are a few examples of sample spaces with probabilities naturally associated with each outcome:

Example 10.3: One card from the deck.


We draw one card from a perfectly shuffled deck of 52 cards. Then we can denote the sample space as
S = {2, 3, . . . , 10, J, Q, K, A} × {♣, ♢, ♡, ♠}, and |S| = 52. Each card c ∈ S has Pr [c] = 52
1
. Note that
condition (10.2.1) is satisfied because
P P
c∈S 52 = 52 · 52 = 1,
1 1
c∈S Pr [c] =

and (10.2.2) is obviously satisfied because Pr [c] = 1


52 for each c, and 1
52 ≥ 0.

Example 10.4: Coin flips.


You flip a $1 coin and Margaret Thatcher flips a £1 coin. Assume that both coins are fair (equally likely
to come up heads and tails) and that flips of the $1 coin and the £1 coin do not affect each other in
any way. Then the four outcomes are—writing the dollar’s result first—⟨Heads, Heads⟩, ⟨Heads, Tails⟩,
⟨Tails, Heads⟩, and ⟨Tails, Tails⟩. Each of these four outcomes has probability 0.25.

Example 10.5: A word on the page.


Excluding spaces and punctuation, the following sentence—which comes from William Shakespeare’s
King Richard III—contains a total of 29 different symbols (namely N, o, w, i, s, t, …, t):
Now is the winter of our discontent.

We are going to select a word from this sentence, according to the following process: choose one of the
29 non-space symbols from the sentence with equal likelihood; the selected word is the one in which the
selected symbol appears. (Thus longer words will be chosen more frequently than shorter words, because
longer words contain more symbols—and are therefore more likely to be selected.)
The sample space is S = {Now, is, the, winter, of, our, discontent}.
3 2
There are 3 + 2 + 3 + 6 + 2 + 3 + 10 = 29 total symbols, and thus Pr [Now] = 29 , Pr [is] = 29 , and
10
so on, through Pr [discontent] = 29 . Again, the conditions for being a probability are satisfied: each
P
outcome’s probability is nonnegative, and w∈S Pr [w] = 1.

Examples 10.3 and 10.4 are scenarios of uniform probability, in which each outcome in the sample space
is chosen with equal likelihood. (Specifically, each s ∈ S has probability Pr [s] = |S| 1
.) Example 10.5
illustrates nonuniform probability, in which some outcomes occur more frequently than others.
Note that for a single sample space S, we can have many different distinct processes by which we choose
an outcome from S. For example:

ISTUDY
10.2 Probability, Outcomes, and Events 511

Example 10.6: Two ways of choosing from S = {0, 1, 2, . . . , 7}.


One process for selecting an element of S = {0, 1, 2, . . . , 7} is to flip three fair coins and treat their results
as a binary number (HHH = 111 → 7, HHT = 110 → 6, …, TTT = 000 → 0). This process gives a
uniform distribution over S: each sequence of coin flips occurs with the same probability. For example,
Pr [4] = 18 = 0.125 and Pr [7] = 18 = 0.125.
A second process for selecting an element of S is to flip 7 fair coins and to let the outcome be the
number of heads that we see in those 7 flips (HHHHHHH → 7, HHHHHHT → 6, HHHHHTH → 6, …,
TTTTTTT → 0). This process gives a nonuniform distribution over S, because the number of sequences
that have k heads is different for different values of k. For example:
7
 7

Pr [4] = 4
= 35
128 ≈ 0.2734, but Pr [7] = 7
= 1
128 ≈ 0.0078.
27 27

As a word of warning, notice that probabilistic statements about a particular realization don’t make sense;
the only kind of probabilistic statement that makes sense is a statement about a probabilistic process. If
you happen to be one of the ≈ 10% of the population that’s red–green colorblind, and a friend says “what
are the odds that you’re colorblind!?,” the correct answer is: the probability is 1 (because it happened!).

10.2.2 Events
Many of the probabilistic questions that we’ll ask are about whether the realization has some particular
property, rather than whether a single particular outcome occurs. For example, we might ask for the prob-
ability of getting more heads than tails in 1000 flips of a fair coin. Or we might ask for the probability
that a hand of seven cards (dealt from a perfectly shuffled deck) contains at least two pairs. There may be
many different outcomes in the sample space that have the property in question. Thus, often we will be
interested in the probability of a set of outcomes, rather than the probability of a single outcome. Such a
set of outcomes is called an event:

Definition 10.3: Event.


Let S be a sample space with probability function Pr. An event is a subset of S. The probability of an
P
event E is the sum of the probabilities of the outcomes in E, and it is written Pr [E] = s∈E Pr [s].

The notation in Definition 10.3 generalizes the function Pr by allowing us to write either elements of S or
subsets of S as inputs to Pr. The probability of an event E ⊆ S follows by a probabilistic version of the
Sum Rule, from counting: because one (and only one) outcome is chosen in a particular realization, the
probability of either outcome x or y occurring is Pr [x] + Pr [y].

Taking it further: What does it mean to extend the Pr notation to either events or outcomes? Previously we considered a function
Pr : S → [0, 1]; we have now “extended” our notation so that it’s a function Pr : P(S) → [0, 1]. (To be more precise, we’re
actually extending the notation to be a function Pr : (S ∪ P(S)) → [0, 1], because we’re still letting ourselves write outcomes
as arguments too.) Our mixture of Pr [outcome] and Pr [event] is an abuse of notation; we’re mixing the type of input willy nilly.
But, because Pr [x] for an outcome x and Pr [{x}] for the singleton event {x} are identical, we can write probabilities this way
without risk of confusion.

A few examples: coins, cards, and roulette


Here are a few examples of events and their probabilities:

ISTUDY
512 Probability

Example 10.7: At least one head.


You and Margaret Thatcher each flip fair coins, as in Example 10.4. Define the event “at least one coin
comes up heads” as H = {⟨Heads, Heads⟩, ⟨Heads, Tails⟩, ⟨Tails, Heads⟩}. Then Pr [H] = 0.25 + 0.25 +
0.25 = 0.75.

Example 10.8: Aces up.


Suppose that you draw one card from a perfectly shuffled deck, as in Example 10.3. What is the probability
that you draw an ace?
Solution. The event in question is E = {A♣, A♢, A♡, A♠}. Each of these four outcomes has a
1 1 1 1 1 4 1
probability of 52 , so Pr [E] = 52 + 52 + 52 + 52 = 52 = 13 .

Example 10.9: Full house.


You’re dealt 5 cards from a shuffled deck, so that each set of 5 cards is equally likely to be your hand. A
hand is a full house if 3 cards share one rank, and the other 2 cards share a second rank. (For example,
the hand 3♡, 3♠, 9♡, 9♣, 3♣ is a full house.) What’s the probability of being dealt a full house?
 
Solution. There are 52 52
5 possible hands, each of which is dealt with probability 1/ 5 . Thus the key
question is a counting question: how many full houses are there? We can compute this number using the
Generalized Product Rule; specifically, we can view a full house as the result of the following sequence
of selections:

• we choose the rank of which to have three of a kind;


• we choose which 3 of the 4 cards of that rank are in the hand;
• we choose the rank of the pair (any of the 12 remaining ranks); and
• we choose which 2 of the 4 cards of that rank are in the hand.
   
Thus there are 13
1 · 4
3 · 12
1 · 4
2 full houses, and the probability of a full house is
   4
13
· 43 · 12 1 · 2 3744
1
52
 = ≈ 0.00144.
5
2598960

Here’s a slightly more complex example, with multiple events of interest:

Example 10.10: Roulette.


In the casino game of roulette (French: “little wheel”), a wheel is spun, and a metal ball comes to rest in
one of the wheel’s 38 segments. The segments are numbered 1–36 (each colored red or black), and there
are two more segments labeled 0 and 00 (both colored green). See Figure 10.3a. Assume that the ball is
equally likely to land in each segment, and that the sample space consists of {00, 0, 1, 2, . . . , 36}. There
are 38 outcomes in the sample space.
A roulette player can bet on a particular outcome, or on an event that corresponds to a pair, triple,
quadruple, or sextuple of outcomes whose numbered squares are adjacent in the board in Figure 10.3a.
(For example, you can bet on 32 or {22, 23} or {4, 5, 6} or {17, 18, 20, 21} or {7, 8, 9, 10, 11, 12}.)
Roulette players can also bet on a number of different events defined by the 12 panels along the sides of
the grid. These events, and their probabilities, are shown in Figure 10.3b.

ISTUDY
10.2 Probability, Outcomes, and Events 513

0 00

1 2 3

1–18
4 5 6

first 12
6 21 33
31 18 7 8 9

even
19 16
8 4
10 11 12
23
1 2
29

35 13 14 15

black
1 00 27 10 25

14

second 12
16 18
2 0 28 9 26
17

19 20 21

red
22 23 24
13
36

30

25 26 27

odd
24

11

3 7

third 12
15 32 17
20 28 29 30
5 22 34
31 32 33
19–36

34 35 36

2 to 1 2 to 1 2 to 1

(a) The roulette wheel and the roulette board.

event outcomes probability


Note: for the purposes of this 1–18 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18} 18/38
game, the numbers 0 and 00
count as neither even nor even {2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36} 18/38
odd—for reasons related only to first 12 {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} 12/38
casinos’ business models, and
not to the value of 0 mod 2.
black {2, 4, 6, 8, 10, 11, 13, 15, 17, 20, 22, 24, 26, 28, 29, 31, 33, 35} 18/38
Similarly, 0 and 00 are neither red {1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36} 18/38
black nor red, and so forth. second 12 {13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24} 12/38
(This figure shows the American odd {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35} 18/38
roulette wheel; the European 19–36 {19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36} 18/38
wheel has only one green third 12 {25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36} 12/38
segment (including 0, but not 00)
and there are only 37 outcomes. “2 to 1” A {1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34} 12/38
The player in the European “2 to 1” B {2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35} 12/38
version does better on average.)
“2 to 1” C {3, 6, 9, 12, 15, 18, 21, 24, 27, 39, 33, 36} 12/38

(b) Some events and their probabilities.

Even: 11 13 15 17 29 31 33 35 2 4 6 8 10 20 22 24 26 28 00
1 3 5 7 9 19 21 23 25 27 12 14 16 18 30 32 34 36 0

Black: 11 13 15 17 29 31 33 35 2 4 6 8 10 20 22 24 26 28 00 Black & Even: 11 13 15 17 29 31 33 35 2 4 6 8 10 20 22 24 26 28 00


1 3 5 7 9 19 21 23 25 27 12 14 16 18 30 32 34 36 0 1 3 5 7 9 19 21 23 25 27 12 14 16 18 30 32 34 36 0

Red: 11 13 15 17 29 31 33 35 2 4 6 8 10 20 22 24 26 28 00 Red & Even: 11 13 15 17 29 31 33 35 2 4 6 8 10 20 22 24 26 28 00


1 3 5 7 9 19 21 23 25 27 12 14 16 18 30 32 34 36 0 1 3 5 7 9 19 21 23 25 27 12 14 16 18 30 32 34 36 0

(c) A few events in roulette, visualized in a Venn diagram–like way.

Figure 10.3 The game of roulette.

ISTUDY
514 Probability

The details of the particular roulette events in Example 10.10 aren’t particularly important, but the distinc-
tion between outcomes and events—which this example should make starkly (it’s the difference between
“the ball stops on number 17” and “the ball stops on an odd number”)—is crucial in probability.
It is often useful to visualize a sample space using a Venn diagram–like representation, particularly if
we draw the subsets/events so that their area corresponds to their probability. See Figure 10.3c, which also
shows a few intersections of pairs of events: because an event is just a subset of the sample space, the
intersection of two events is still a subset of the sample space, and therefore is also an event.

A few useful properties


Here are a few useful general properties of the probability of events:

Theorem 10.4: Some properties of event probabilities.


Let S be a sample space, and let A ⊆ S and B ⊆ S be events. Then, writing A = S − A to denote the
complement of the event A, we have:
Pr [S] = 1 (10.4.1)
Pr [∅] = 0 (10.4.2)
 
Pr A = 1 − Pr [A] (10.4.3)
Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B] . (10.4.4)

These properties all follow directly from the definition of the probability of an event.

10.2.3 Tree Diagrams in Probability


Many probabilistic processes involve a sequence of randomly determined quantities, rather than just a
single random choice. Much like in counting, we can use a tree diagram to represent the sequence of
random choices—and then we can look for the probability of a particular outcome as reflected in the
sequence of choices in the tree.
In a tree diagram for a probabilistic sequence of choices, every internal node in the tree corresponds
to a random decision; every edge leaving that internal node is labeled with the probability of a particular
decision. The probability labels of all edges leaving any particular internal node u must add up to 1. (The
interpretation is: if the probabilistic process reaches node u, then each branch leaving u is chosen with
frequency in proportion to its label.) Every leaf in the tree corresponds to an outcome. The probability of
reaching a particular leaf is precisely equal to the product of the labels on the edges leading from the root
to that leaf. As usual, the probability of an event is the sum of the probabilities of the outcomes contained
in that event.

Problem-solving tip: Tree diagrams are generally a very good way to solve probability questions; they force you to systematically
think about all of the steps of a probabilistic process (and also about all of the steps of solving probability problems!).

Here is a first small example:

ISTUDY
10.2 Probability, Outcomes, and Events 515

1 1 1 1 1 1
6 6 6 6 6 6

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
⟨1, 1⟩
⟨1, 2⟩
⟨1, 3⟩
⟨1, 4⟩
⟨1, 5⟩
⟨1, 6⟩

⟨2, 1⟩
⟨2, 2⟩
⟨2, 3⟩
⟨2, 4⟩
⟨2, 5⟩
⟨2, 6⟩

⟨3, 1⟩
⟨3, 2⟩
⟨3, 3⟩
⟨3, 4⟩
⟨3, 5⟩
⟨3, 6⟩

⟨4, 1⟩
⟨4, 2⟩
⟨4, 3⟩
⟨4, 4⟩
⟨4, 5⟩
⟨4, 6⟩

⟨5, 1⟩
⟨5, 2⟩
⟨5, 3⟩
⟨5, 4⟩
⟨5, 5⟩
⟨5, 6⟩

⟨6, 1⟩
⟨6, 2⟩
⟨6, 3⟩
⟨6, 4⟩
⟨6, 5⟩
⟨6, 6⟩
Figure 10.4 The probability tree for rolling two fair dice, one after the other. Three events are highlighted: “doubles are
rolled” (marked ) “a 7 is rolled” (marked ); and “an 11 is rolled” (marked ).

Example 10.11: Rolling two dice.


The probability tree for rolling two fair dice, one after the other, is shown in Figure 10.4.

• All edges have probability 61 ; thus each outcome’s probability is 1


6 · 1
6 = 1
36 .
• The event “doubles are rolled” has probability 6 · 361
= 61 .
• The event “a 7 is rolled” has probability 6 · 36
1
= 61 .
• The event “an 11 is rolled”has probability 2 · 36 = 18
1 1
.

Incidentally, one of the calculations from Example 10.11 can also be rephrased to address a question about
hashing. Suppose that we hash two elements into a hash table with 6 slots using a uniform random hash
function. (See Section 10.1.1.) What is the probability that we have a collision? This question is precisely
the same as asking for the probability of rolling doubles with two fair dice—that is, 16 , by Example 10.11.
When we introduced hash tables in Section 10.1.1, we described resolving collisions by chaining: an
element x is stored in cell T[h(x)]; if that cell is already occupied, then we simply add x to a list of elements
in cell T[h(x)]. But there are several other strategies for resolving collisions in a hash table, the simplest of
which is called linear probing. In linear probing, when we insert an element x into the table, we put x in
the first unoccupied cell, moving from left to right, starting at cell h(x). See Figure 10.5 for an example.

Example 10.12: Hashing with linear probing.


Suppose that we hash 2 elements into a hash table with 6 slots using a uniform random hash function h,
where we resolve collisions by linear probing. What is the probability that we end up with 2 consecutive
slots of the hash table filled?
Solution. The sample space is S = {1, 2, . . . , 6} × {1, 2, . . . , 6}: we first randomly choose a value for
h(A), and then randomly choose a value for h(B). We’ll build a tree diagram to represent these choices,
as shown (in part) in Figure 10.6. The highlighted outcomes have A and B hashed to adjacent cells. (The
remainder of the tree is analogous; it’s good practice to try drawing the other branches.)
1
Each branch of the tree is equally likely, so each outcome occurs with probability 36 . How many
different outcomes result in A and B being stored in adjacent cells? For each of the 6 possible hash values
for A, there are 3 hash values for B that cause A and B to be adjacent, when h(B) is one of h(A) − 1, h(A),
and h(A) + 1. So the final probability of a cluster forming is (6 · 3) · 361
= 3618
= 12 .

Taking it further: One of the downsides of resolving collisions in a hash table using linear probing is a phenomenon called
“clustering”: contiguous blocks of filled cells develop, and these filled blocks tend to get longer and longer as more and more
elements are added to the table. (This problem is beginning to occur in Figure 10.5.) Other collision-resolution schemes can
mitigate this problem; see Exercises 10.45–10.50.

ISTUDY
516 Probability

(a) (b) (c) (d)


A B C A B C A D B C A D E B
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Suppose A and B, with Suppose h(C) = 3. T[3] is Suppose h(D) = 4. T[4] is Suppose h(E) = 3. T[3] and T[4]
h(A) = 4 and h(B) = 8, are empty, so we store C in T[3]. full, but T[5] is empty, so we and T[5] are all full, but T[6] is
stored initially. store D in T[5]. empty, so we store E in T[6].

Figure 10.5 Linear probing: we try to put x in cell T[h(x)], but if that cell is already full, then we try cell T[h(x) + 1],
and then cell T[h(x) + 2], and so on. We wrap around to T[1] after we reach the right edge of T.

1 1 1 1 1 1
6 6 6 6 6 6

h(A)=1 h(A)=2 h(A)=3 h(A)=4 h(A)=5 h(A)=6


A A A A A A

1 1 1 1 1
1 1 1 1 1 1 1 6 6
6 6 6 6 6 6 6 6 6 6

h(B)=1 h(B)=2 h(B)=3 h(B)=4 h(B)=5 h(B)=6 h(B)=1 h(B)=2 h(B)=3 h(B)=4 h(B)=5 h(B)=6
AB AB A B A B A B A B BA AB AB A B A B A B

Figure 10.6 The tree diagram for hashing 2 elements into 6 slots using a uniform random hash function h, resolving
collisions by linear probing.

Here’s another (by now famous) example, called the Monty Hall Problem, in which using a probability
tree helps resolve a potentially confusing probability question. This problem is named after Monty Hall,
the host of the television game show Let’s Make A Deal in the 1960s–1980s. (The problem became famous
after a kerfuffle involving Marilyn vos Savant who, in Parade Magazine, answered a reader’s question about
the Monty Hall Problem. Her answer was 100% right, and many of the Ph.D.-holding mathematicians who
wrote angry letters to the editor were 100% wrong [128, 129].)
Problem-solving tip: It is usually worth the time to make the probabilistic process concrete, and to make explicit any hidden
assumptions about the process, before solving the problem. (That’s how we’ll solve Example 10.13.)

Example 10.13: Monty Hall Problem.


Here is the problem (based on the Let’s Make a Deal setup):
You are given the choice of three doors, behind which are a car, a goat, and another goat. You choose a door. Monty
Hall opens one of the doors that you didn’t choose to reveal a goat. He then offers you the chance to switch to the other
(unopened) door that you didn’t initially choose. Should you switch?

(To make this question concrete, assume that the car is initially placed randomly; you choose an initial
door randomly; the host always opens one of the two doors you didn’t choose to reveal a goat, choosing
a goat at random if there are two unchosen goats; and the host will always give you an opportunity to
switch.)
Solution. There are three randomly chosen quantities: where the car is placed, which door you choose,
and which goat is revealed (if there are two possibilities). We can express the process using the probability
tree in Figure 10.7a. The shaded outcomes are those in which switching from your initially chosen door
causes your new door to hide a car; the thin-lined, unshaded outcomes are those in which not switching
causes you to win. All of the outcomes and their associated probabilities are also shown in Figure 10.7b.

ISTUDY
10.2 Probability, Outcomes, and Events 517

B revealed 1
· 1
· 1
= 1
1/2 3 3 2 18
you pick A

revealed goat
1/3 · ·

your choice
1 1 1 1
C revealed =

probability
1/2 3 3 2 18

car placed at A 1/3 you pick B 1 C revealed 1


3 · 1
3 · 1
1 = 1
9

car
1/3
you pick C 1 B revealed 1
3 · 1
3 · 1
1 = 1
9
1/3 A A B 1/18
you pick A 1 C revealed 1
3 · 1
3 · 1
1 = 1
9 A A C 1/18
1/3
1
· 1
· 1 1 A B C 1/9
1/2 A revealed 3 3 1 = 9
1/3 car placed at B 1/3 you pick B A C B 1/9
C revealed 1
· 1
· 1
= 1
B A C 1/9
1/2 3 3 1 9

1/3 B B A 1/18
you pick C 1 A revealed 1
3 · 1
3 · 1
1 = 1
9
B B C 1/18
1/3 you pick A 1 B revealed 1
3 · 1
3 · 1
1 = 1
9 B C A 1/9
1/3
C A B 1/9
car placed at C 1/3 you pick B 1 A revealed 1
3 · 1
3 · 1
1 = 1
9
C B A 1/9
C A B 1/18
A revealed 1
· 1
· 1
= 1
1/2 1/2 3 3 2 18
C B A 1/18
you pick C
B revealed 1
· 1
· 1
= 1
1/2 3 3 2 18
(b) A table of the
(a) The probability tree. outcomes.

Figure 10.7 The Monty Hall Problem. The shaded outcomes are those where you win by switching.

There are six outcomes in which switching causes you to win the car. Each of these outcomes has
probability 91 , so the probability of winning a car by switching is 6 · 19 = 32 . The other six outcomes are
those in which not switching causes you to get a car (and switching gets you a goat); these outcomes each
have probability 18 1
, and so the probability of winning by not switching is 6 · 181
= 13 . You should switch.

Taking it further: Section 10.2.3 has been devoted to tree diagrams—a systematic way of analyzing probabilistic settings in
which a sequence of random choices is made. Typically we think of—or at least model—these random choices as being made
“by nature”: if you flip a coin, you act as though the universe “chooses” (via microdrafts of wind, the precise topology of the
ground where the coin bounces, etc.) whether the coin will come up heads or tails. But, in many scenarios in computer science, we
want to generate the randomness ourselves, perhaps in a program: choose a random element of the set A; go left with probability
1 1
2 and go right with probability 2 ; generate a random 8-symbol password. The process of actually generating a sequence of
“random” numbers on a computer is difficult, and (perhaps surprisingly) very closely tied to notions of cryptographic security. A
pseudorandom generator is an algorithm that produces a sequence of bits that seem to be random, at least to someone examining
the sequence of generated bits with limited computational power. It turns out that building a difficult-to-break encryption system is
in a sense equivalent to building a difficult-to-distinguish-from-random pseudorandom generator. For more, see a good textbook
on cryptography, such as [52, 69, 109].

10.2.4 Some Common Probability Distributions


We’ll end this section with some of the common probabilistic processes (and therefore some common
probability distributions) that arise in computer science applications.

Uniform distribution
Under the uniform distribution, every outcome is equally likely. We can define a uniform distribution for
any finite sample space S:

ISTUDY
518 Probability

Definition 10.5: Uniform distribution.


Let S be a finite sample space. Under the uniform distribution, the probability of any particular outcome
s ∈ S is given by Pr [s] = |S|
1
.

Some familiar examples of the uniform distribution include:

• Flipping a fair coin (Pr [Heads] = Pr [Tails] = 12 ).


• Rolling a fair 6-sided die (Pr [1] = Pr [2] = Pr [3] = Pr [4] = Pr [5] = Pr [6] = 16 ).
• Choosing one card from a shuffled deck (Pr [c] = 52 1
for any card c).

If outcomes are chosen uniformly at random, then the probability of an event is simply its fraction of the
sample space. That is, for any event E ⊆ S, we have Pr [E] = |E|
|S| .

Taking it further: We often make use of a uniform distribution in randomized algorithms. For example, in randomized quicksort
or randomized select applied to an array A[1 . . . n], a key step is to choose a “pivot” value uniformly at random from A, and then
use the chosen value to guide subsequent operation of the algorithm. (See Exercises 10.24–10.27.)
Another algorithmic use of randomization is in mechanisms to allow scientific study of sensitive data while preserving the privacy
of the individuals whose data is being studied. The basic idea is to “fuzz” the data by adding random noise to the true data points,
and then release only the fuzzed dataset. To maximize the usefulness of the data for scientific study, we should minimize the
amount of noise added to the data (“how blurry” to make it). The insight of theoretical computer scientists has been in precisely
characterizing how much noise must be added so as to provably preserve the privacy of individual data values. For more, including
a highly successful approach called differential privacy, see p. 521.

Bernoulli distribution
The next several distributions are related to “flipping coins” in various ways. “Coin flipping” is a common
informal way to refer to any probabilistic process in which we have one or more trials, where each trial has
the same “success probability,” also known as “getting heads.” We will refer to flipping an actual coin as
a coin flip, but we will also refer to other probabilistic processes that succeed with some fixed probability
as a coin flip. We consider a (possibly) biased coin—that is, a coin that comes up heads with probability
p, and tails with probability 1 − p. The coin is called fair if p = 21 ; that is, if the probability distribution
is uniform. We can call the coin p-biased when Pr [heads] = p. It’s important that the result of one trial
has no effect on the success probability of any subsequent trial. (That is, these flips are independent; see
Section 10.3.) The first coin-related distribution is simply the one associated with a single trial:

Definition 10.6: Bernoulli distribution.


The Bernoulli distribution with parameter p is the probability distribution that results from flipping one
p-biased coin. Thus the sample space is {H, T}, where Pr [H] = p and Pr [T] = 1 − p.

(The Bernoulli distribution is named after Jacob Bernoulli (1655–1705), a Swiss mathematician.)

Taking it further: Imagine a sequence of Bernoulli trials performed with p = 0.01, and another sequence of Bernoulli trials
performed with p = 0.48. The former sequence will consist almost entirely of zeros; the latter will be about half zeros and about
half ones. There’s a precise technical sense in which the second sequence contains more information than the first, measured in
terms of the entropy of the sequence. See p. 523.

ISTUDY
10.2 Probability, Outcomes, and Events 519

n = 10, p = 0.5 n = 10, p = 0.25


0.3 Pr [k] 0.3 Pr [k]
0.2 0.2

0.1 0.1
k k
0 2 4 6 8 10 0 2 4 6 8 10

n = 15, p = 0.5 n = 10, p = 0.75


0. 3 Pr [k] 0.3 Pr [k]
0. 2 0.2

0. 1 0.1
k
0 5 10 15 0 2 4 6 8 10

n = 20, p = 0.5 n = 10, p = 0.85


0. 3 Pr [k] 0.3 Pr [k]
0. 2 0.2

0. 1 0.1
k
0 5 10 15 20 0 2 4 6 8 10

Figure 10.8 Several binomial distributions, for different values of n and p.

Binomial distribution
A somewhat more interesting distribution results from considering a sequence of flips of a biased coin.
Consider the following probabilistic process: perform n flips of a p-biased coin, and then count the number
of heads in those flips. The binomial distribution with parameters n and p is a distribution over the sample
space {0, 1, . . . , n}, where Pr [k] denotes the probability of getting precisely k heads in those flips. Fig-
ure 10.8 shows several examples of binomial distributions, for different settings of the parameters n and
p. Each panel of Figure 10.8 shows the probability P[k] of getting precisely k heads in n flips of a p-biased
coin, for each k in the sample space.
If we flip a p-biased coin n times, what is the probability of the event of getting exactly k heads? For
example, consider the outcome

H H ··· H T T ··· T .
k times n − k times

The probability of this outcome is pk · (1 − p)n−k : the first k flips must come up heads, and the next n − k
flips must come up tails. In fact, any ordering of k heads and n − k tails has probability pk · (1 − p)n−k . One
way to see this fact is by imagining the probability tree, which is a binary tree with left branches (heads)
having probability p and right branches (tails) having probability 1 − p. The outcomes in question have k

left branches and n − k right branches, and thus have probability pk · (1 − p)n−k . There are nk different
outcomes with k heads—a sequence of n flips, out of which we choose which k come up heads. Therefore:

ISTUDY
520 Probability

Definition 10.7: Binomial distribution.


The binomial distribution with parameters n and p is a distribution over the sample space {0, 1, . . . , n},
where for each k ∈ {0, 1, . . . , n} we have

Pr [k] = nk · pk · (1 − p)n−k .

For an unbiased coin—that is, when p = 21 —the expression for Pr [k] from Definition 10.7 simplifies to

Pr [k] = nk /2n , because ( 21 )k · (1 − 12 )n−k = ( 21 )k · ( 12 )n−k = ( 12 )n .

Geometric distribution
Another interesting coin-derived distribution comes from the “waiting time” before we see heads for the
first time. Consider a p-biased coin, and continue to flip it until we get a heads. The output of this prob-
abilistic process is the number of flips that were required, and the geometric distribution with parameter
p is defined by this process. (The name “geometric” comes from the fact that the probability of needing k
flips looks a lot like a geometric series, from Chapter 5.) See Figure 10.9 for a few such distributions.
What is the probability of needing precisely k flips to get heads for the first time? We would have to have
k − 1 initial flips come up tails, and then one flip come up heads. As with the binomial distribution, one
nice way to think about the probability of this outcome uses the probability tree. This tree has left branches
(heads) having probability p and right branches (tails) having probability 1−p; the outcome k follows k−1
right branches and one left branch, and thus has probability (1 − p)k−1 · p. Therefore:

Definition 10.8: Geometric distribution.


Let p be a real number satisfying 0 < p ≤ 1. The geometric distribution with parameter p is a distribution
over the sample space Z≥1 = {1, 2, 3, . . .}, where for each k we have
Pr [k] = (1 − p)k−1 · p.

Notice that the geometric distribution is our first example of an infinite sample space: every positive integer
is a possible result.

p = 0.3 p = 0.5 p = 0.7


Pr [k]
0.6

0.4

0.2

k k k

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 10.9 Several geometric distributions, for different values of p. Although these plots are truncated at k = 10, the
distribution continues infinitely: Pr [k] > 0 for all positive integers k.

ISTUDY
10.2 Probability, Outcomes, and Events 521

Computer Science Connections

Using Randomness to Protect Privacy

Cinema & Media Studies


It’s not hard to think of research topics that could lead

Computer Science
Cognitive Science
American Studies
Africana Studies
to clear positive impacts at a societal level, but where

Asian Studies
Art History

Economics
Chemistry
the individual-level data of people being studied might

Classics
Biology

English
be extremely sensitive. Is having a particular underly-

Art
ing health condition (like having asthma, say, or being
HIV positive) a counterindicator for a particular steroid
(a) The true data on majors.
treatment for COVID-19? What is the magnitude of
income and wealth disparities across race and gender
for people doing the same jobs in the same locations?
(It’s harder to formulate and advocate for the most
effective policies without quantitative understanding
of the problems!) There are major potential societal (b) Random noise (from the so-called Laplace distribution).
benefits of a thorough understanding of these kinds of

Cinema & Media Studies


questions—but, at the same time, I certainly don’t want

Computer Science
Cognitive Science
my HIV status or my full financial records being made

American Studies
public. So what do we do? Africana Studies

Asian Studies
Art History

Economics
Chemistry
A first answer is to rely on the ethical standards

Classics
Biology

English
of researchers (and research oversight boards), and

Art
promise all participants in research that their individ-
ual data will never be released. (The United States (c) The adjusted counts for release.
first mandated Institutional Review Boards (“IRBs”)
Figure 10.10 Differentially private release of data. Starting
in 1974, after some prominent and egregious ethical
lapses in medical and behavioral research; see [122].) from sensitive data (say, how many first-generation students
Alongside the risk of unintentional data breaches, declared each major), we add random noise before releasing
though, a major downside of this approach is that sci- the adjusted counts. Theorems have established how much
entific progress rests on sharing of data, so that others noise must be added to ensure a given level of privacy
can reproduce and extend existing work. So might we protection. The released counts aren’t too far from the
permit sharing of anonymized data, with personal iden- originals, but they protect each individual’s privacy.
tifiers stripped out? Tempting as it is, this idea isn’t
enough: anonymization is hard. For example, a pioneering study by Latanya Sweeney showed that more than 85%
of U.S. residents were uniquely identifiable by their 5-digit ZIP code, gender, and day/month/year of birth [124].
So releasing an anonymized sensitive dataset that includes these three fields is tantamount to releasing the names; a
malicious actor with an external data source could “deanonymize” it. (Other famous cases of published anonymous
datasets later being deanonymized include AOL search logs and Netflix movie ratings.)
Instead, a long line of computer scientists have worked to develop algorithms to “fuzz” data, releasing a dataset
that has been slightly, and randomly, altered from the original values. The idea is to alter the data as much as necessary
protect individual privacy. But we also alter it as little as possible, so that group-level statistics will remain close to
their original values. In differential privacy, the most prominent and successful approach, the biggest breakthrough
was not a new algorithm but rather a new definition—one that, like safe mechanisms for releasing data, is probabilistic.
Consider an algorithm A(S) for releasing a privatized version of data from a set S of people, and consider a particular
person x. The key definition of differential privacy (informally stated) is that the probability of a particular dataset
being released is nearly identical whether A is run on S or A is run on S ∪ {x}. (This definition does not mean
that someone looking at the released dataset can’t learn something about x based on correlations—if you can’t learn
something about people from research then there’s not much point in research about people—but rather that the
analyst doesn’t learn more about a participant x than if x had not participated in the study.) Mechanisms to achieve
differential privacy add random noise to the input data, with the amount of noise carefully tuned to be as small as
possible while still obscuring individual data points. See Figure 10.10. (Only some kinds of queries are known to be
achievable under this definition.) And these results are not purely theoretical: as of the 2020 U.S. Census, individuals
are protected by differentially private release of census data. (For more on differential privacy, see [41] or [42].)

ISTUDY
522 Probability

Computer Science Connections

Quantum Computing
As the twentieth-century revolution in physics brought about by the discovery of quantum mechanics unfolded,
researchers at the boundary of physics and computer science started thinking about computation based on these
quantum ideas. This model of quantum computation relies on some very deep physics, so what follows is only a brief
summary—omitting the details of the physics (which I must admit I do not understand). (As the line attributed to
the Danish physicist and quantum theorist Niels Bohr (1885–1962) goes: “Anyone who is not shocked by quantum
theory has not understood it.”)
The basic element of data in a quantum
computer is a quantum bit, or qubit. Like a
bit (the basic element of data on a classical

|0000⟩
|0001⟩
|0010⟩
|0011⟩
|0100⟩
|0101⟩
|0110⟩
|0111⟩
|1000⟩
|1001⟩
|1010⟩
|1011⟩
|1100⟩
|1101⟩
|1110⟩
|1111⟩
computer), a qubit can be in one of two basic
states, which are written as |0⟩ and |1⟩. (A The input is a “trivial”
superposition in which all
classical bit is in state 0 or 1.) The quantum of the weight is on the input
magic is that a qubit can be in both states sequence of quantum gates value—here, 1001. The
simultaneously, in what’s called a superpo- (a.k.a. a quantum algorithm) superposition resulting
from the quantum
sition of these basic states. A qubit’s state is algorithm may spread the
α|0⟩ + β|1⟩, where α and β are “weights” weight across all 2n states.
|0000⟩
|0001⟩
|0010⟩
|0011⟩
|0100⟩
|0101⟩
|0110⟩
|0111⟩
|1000⟩
|1001⟩
|1010⟩
|1011⟩
|1100⟩
|1101⟩
|1110⟩
|1111⟩
with |α| +|β| = 1. (Actually, the weights α
2 2

and β are complex numbers, but the basic idea The final output of the
will come across if we think of them as real algorithm is a single n-bit
sequence resulting from a
numbers instead.) Thus, while there are only measurement (with the
two states of a bit, there are infinitely many measurement probability of a particular
states that a qubit can be in. So a qubit’s state output corresponding to the
coefficient of that state in
contains a huge amount of information. But, the superposition)—here,
0101
by the laws of quantum physics, we are lim- 0101.
ited in how we can extract that information
Figure 10.11 A rough schematic of a quantum algorithm.
from a qubit. Specifically, we can measure a
qubit, but we only see 0 or 1 as the output. When we measure a qubit α|0⟩ + β|1⟩, the probability that we see 0 is
|α|2 ; the probability that we see 1 is |β|2 . For example, we might have a qubit in the state
√   √ 2
2 |0⟩ 2 |1⟩.
1 3 1 2 3
+ (Note 2 + 2 = 14 + 34 = 1.)

If we measure it, we see a 0 with probability 25% (and otherwise we see a 1). See Figure 10.11.
There are two more crucial points. First, when there are multiple qubits—say n of them—the qubits’ state is a
superposition of 2n basic states. (For example, in the same way that two-bit classical system can be in state 00, 01, 10,
or 11, a two-qubit system is in state α00 |00⟩ + α01 |01⟩ + α10 |10⟩ + α11 |11⟩.) Second, even though we only see one
value when we measure qubits, there can be “cancellation” (or interference) among coefficients. There are notable
restrictions on how we can operate on qubits, based on constraints of physics, but at a very rough level, we can run
an operation on an n-qubit quantum computer in parallel in each of the 2n basic states and, if the process is designed
properly, still read something useful from our single measurement.
Why does anyone care? Much of the interest in quantum computation stems from a major breakthrough, Shor’s
algorithm, a quantum algorithm that efficiently solves the factoring problem—given a large integer n, determine n’s
prime factorization. Efficient factoring is deeply problematic for most currently deployed cryptographic systems (see
Chapter 7), so a functional quantum computer would be a big deal. And there are other important quantum algorithms,
in cryptography and in other areas, too. Building a practical quantum computer of appreciable size remains a major
engineering effort; at the moment, it’s a largely theoretical device—but there’s active research both on the physics
side (can we actually build one?) and on the algorithmic side (what else could we do if we did build one?).
(For much more about how quantum computers work, or what they would be able to do, see [86]. The efficient
quantum factoring algorithm is due to Peter Shor (b. 1959) [117]. Quantum computers with 53 qubits were first built
in 2019, by Google and IBM; that’s the same year that Google claimed to have shown a quantum computer that can
perform tasks beyond the reach of classical computers (though there is scholarly debate about this claim) [11].)

ISTUDY
10.2 Probability, Outcomes, and Events 523

Computer Science Connections

Information, Charles Dickens, and the Entropy of English


Consider the following two (identical-length) sequences of letters and spaces—one from Charles Dickens’s A Tale
of Two Cities and one generated by uniformly randomly choosing a sequence of elements of {A, . . . , Z, }:

IT WAS THE BEST OF TIMES, IT WAS THE WORST OF TIMES, IT WAS THE AGE OF WISDOM, IT WAS THE AGE OF
FOOLISHNESS, IT WAS THE EPOCH OF BELIEF, IT WAS THE EPOCH OF INCREDULITY.

TUYSSUWWYVOZULF XZQBSFS AFNBMAOOGWZPAHGREAYC SUSCMBOWDCNCYEJBHPVCRO MLVTGVHTVCZXHSCQFULCMBO


CDIWTXOCUPKTFZVNBHRGDWAKZSZPFTZKEWKWIH O QFIUWTCDKUBTQSPLXSYXGQZA DLXBHKFILFPZ.

Which sequence contains more information? It is very tempting to choose the first (information about contrast, and
irony—and liberty, equality, and fraternity!), but, in a precise technical sense, Random contains far more information
than Dickens. The basic reason is that, in Dickens, certain letters occur far more frequently than others—E occurs
17 times and there are six letters that don’t appear at all. (In Random, all 26 letters appear.) With such a lopsided
distribution, you already know a lot about what letter is (probably) going to come next, and so there’s less new
information conveyed by a typical letter.
Formally, the entropy of a sequence of letters (or
bits, or whatever) is a measure of “how surprising”
each element of the sequence is, averaged over the
sequence. We’ll convert the “unit of surprise” into a
real number between zero and one, where zero corre-
sponds to the next letter is 100% predictable and one
corresponds to we have absolutely no idea what the p = 0.25 p = 0.5 p = 0.9
next letter will be. Mathematically, the entropy H of
a probability distribution over S is defined as Figure 10.12 A sequence of bits, produced independently
∑ at random with probability p ∈ {0.25, 0.5, 0.9} of a .
− Pr [x] · log(Pr [x]). These sequences’ entropies are, respectively, 0.8113,
x∈S 1.0000, and 0.4690.
For example, if we produce a sequence of coin flips
where each flip comes up heads with probability p (see Figure 10.12), then the entropy of the sequence will be
− (p log p + (1 − p) log(1 − p)), as shown in Figure 10.13.
This definition of entropy comes from the 1940s, in
1
a paper by Claude Shannon [115], and has found all
sorts of useful applications since. Here is one example:
the entropy of a sequence of bits expresses a theoretical 0.8
limit on the compressibility of that sequence. (And that
theoretical limit is, in fact, achievable.) That is, if the 0.6
entropy

entropy of a string of n bits is very low—say around


0.25—then with some clever algorithms we can repre- 0.4
sent that string (without any error) using only about n4
bits. But we can’t represent it in fewer bits with perfect 0.2
fidelity (“lossless” compression; see p. 475).
There is significant redundancy in English text, as 0
we’ve already mentioned, based on the nonuniformity 0 0.25 0.5 0.75 1
in the probability distribution of individual letters. But probability p
there’s even more redundancy based on the fact that
the probability that the ith character of an English doc- Figure 10.13 The entropy of a biased coin whose heads
ument is an H is affected by whether the (i − 1)st probability is p.
character was a T. (In the language of Section 10.3,
these events are not independent.) If you’ve seen the letters ␣TH in succession, you can make a very good bet that E is
coming next. Compression schemes for English make use of this phenomenon. (For more on entropy, compressibility,
and information theory generally, a great classic reference is [34].)

ISTUDY
524 Probability

EXERCISES
10.1 Philippe flips a fair coin 100 times. Let the outcome be the number of heads that he sees. What is the sample space?
10.2 For Philippe’s 100 flips, what is Pr [0]?
10.3 For Philippe’s 100 flips, what is Pr [50]?
10.4 For Philippe’s 100 flips, what is Pr [64]?
10.5 Philippe now flips his fair coin n times. What’s the probability of the event “there are (strictly) more heads than tails” if n = 2?
10.6 Repeat Exercise 10.5 for n = 3.
10.7 Repeat Exercise 10.5 for n = 1001. (Hint: Pr [k] = Pr [1001 − k].)
10.8 Repeat Exercise 10.5 for an arbitrary positive integer n.
10.9 Bridget plays bridge. Bridge is a card game played with a standard 52-card deck. Each player is initially dealt a hand of 13 cards;

assume a fair deal in which each of the 52
13 hands is equally likely. What is the probability of being dealt both A♣ and A♢?
10.10 Suppose Bridget receives a uniformly drawn hand of 13 cards, in a uniformly random order. Because your ex-friend Peter was
trying to cheat at poker with this deck, the A♣ card is marked. You observe that the card the fourth-from-the-right position in
Bridget’s hand is A♣. What is the probability that Bridget also has the A♢ in her hand?
10.11 Most casual bridge players sort their hands by suit (♠, ♡, ♣, ♢ from left to right), and decreasing from left to right by rank within
each suit. (So your hand might be ♠AK4 ♡983 ♣AKQ ♢AJ98, reading from left to right.) Professional players are taught not to
sort their hands, because doing so causes which card they play to leak information about the rest of their hand to the other players.
Suppose Bridget receives a uniformly drawn hand of 13 cards, and sorts the cards in her hand. Peter’s card marking is still present,
and you observe the A♣ in the fourth-from-the-right (that is, fourth-from-the-lowest) card in Bridget’s hand. What is the probability
that Bridget also has the A♢ in her hand? (That is: out of all hands for which A♣ is fourth-from-the-lowest card, what fraction
also have the A♢?)
10.12 Suppose A♣ is the rightmost (that is, lowest) card in Bridget’s hand. What is the probability that Bridget also has the A♢?
10.13 Suppose A♣ is the leftmost (that is, highest) card in Bridget’s hand. What is the probability that Bridget also has the A♢?

Chrissie plays Cribbage. Cribbage is a card game played with a standard 52-card deck. For the purposes of these questions, assume

that a player is dealt one of the 52
4 different 4-card hands, chosen uniformly at random. Cribbage hands are awarded points for
having a variety of special configurations:
• A flush is a hand with all four cards from the same suit.
• A run is a set of at least 3 cards with consecutive rank. (For example, the hand 3♡, 9♣, 10♢, J♣ contains a run.)
• A pair is a set of two cards with identical rank.
Aces are low in Cribbage, so A, 2, 3 is a valid run, but Q, K, A is not.
10.14 What’s the probability that Chrissie is dealt a flush?
10.15 What’s the probability that Chrissie is dealt a run of length 4?
10.16 What’s the probability that Chrissie is dealt two runs of length 3 that is not a run of 4? (For example, the hand 9♡, 9♣, 10♢, J♣
contains two runs of length 3: the first is 9♡, 10♢, J♣ and the second is 9♣, 10♢, J♣.)
10.17 What’s the probability that Chrissie is dealt one (and only one) run of length 3 (and not a run of length 4)?
10.18 What’s the probability that Chrissie is dealt at least one pair? (Hint: Pr [getting a pair] = 1 − Pr [getting no pair].)
10.19 What’s the probability that Chrissie is dealt two or more pairs? (In cribbage, any two cards with the same rank count as a pair; for
example, the hand 2♡2♢2♠8♣ has three pairs: 2♡2♢ and 2♡2♠ and 2♢2♠.)
10.20 (programming required) Write a program to approximately verify your calculations from these Cribbage exercises, as follows:
generate 1,000,000 random hands from a standard deck, and count the number of those samples in which there’s a flush, run (of
the three flavors), pair, or multiple pairs.
10.21 (programming required) Modify your program to exactly verify your calculations: exhaustively generate all 4-card hands, and
count the number of hands with the various features (flushes, runs, pairs).
10.22 A fifteen is a subset of cards whose ranks sum to 15, where an A counts as 1 and each of {10, J, Q, K} counts as 10. (For example,
the hand 3♡, 2♣, 5♢, J♣ contains two fifteens: 3♡ + 2♣ + J♣ = 15 and 5♢ + J♣ = 15.) What’s the probability a 4-card hand
contains at least one fifteen? (Hint: use a program.)
10.23 A bitstring x ∈ {0, 1}5 is stored in vulnerable memory, subject to corruption—for example, on a spacecraft. An α-ray strikes the
memory and resets one bit to a random value (both the new value and which bit is affected are chosen uniformly at random). A
second α-ray strikes the memory and resets one bit (again chosen uniformly at random). What’s the probability that the resulting
bitstring is identical to x?
10.24 Recall the quick sort algorithm for sorting an array A, shown in Figure 10.14. (Assume that the elements of A are all distinct.) Quick
Sort is efficient if the two sublists are close to equal in size, and the Random strategy is a common (and good!) way to (usually)

ISTUDY
Exercises 525

quickSort(A[1 . . . n]): Two strategies for choosing a pivot in Line 4:


1 if n ≤ 1 then Random: choose pivot uniformly at random from A.
2 return A
Median of Three: choose three elements p1 , p2 , p3 uniformly at random from
3 else the set {1, . . . , n}. Then choose pivot as the pi whose corresponding element
4 choose pivot ∈ {1, . . . , n}, somehow. of A is the median of the three. (The same index can be chosen more than
once—for example, p1 = p3 is allowed.) For example, for the array
5 L := list of all A[i] where A[i] < A[pivot]. A = ⟨94, 32, 29, 85, 64, 8, 12, 99⟩, we might randomly choose p1 = 1,
6 R := list of all A[i] where A[i] > A[pivot]. p2 = 7, and p3 = 2. Then the pivot will be p3 because A[p3 ] = 32 is between
7 return quickSort(L) + ⟨A[pivot]⟩ + quickSort(R) A[p2 ] = 12 and A[p1 ] = 94.

Figure 10.14 Quick Sort, briefly. (See Figure 5.19a for more detail.) Assume that the elements of A are all distinct.

produce two sublists of roughly equal size. Suppose we select pivot in Line 4 uniformly at random from the set {1, . . . , n}. As a
function of n, what is the probability that |L| ≤ 3n/4 and |R| ≤ 3n/4? (You may assume that n is divisible by 4.)
10.25 Under the Random strategy, as a function of n and α ∈ [0, 1], what is the probability |L| ≤ αn and |R| ≤ αn? (You may neglect
issues of integrality: assume αn is an integer.)
10.26 Under the Median of Three strategy, what is the probability that |L| ≤ 3n/4 and |R| ≤ 3n/4? Assume n is large; for ease, you may
neglect issues of integrality in your answer.
10.27 Under the Median of Three strategy, as a function of α ∈ [0, 1], what is the probability |L| ≤ αn and |R| ≤ αn? Again, you may
assume that n is large, and you may neglect issues of integrality in your answer.
10.28 Suppose that Team Emacs and Team VI play a best-of-five series of softball games. Emacs, being better than VI, wins each game
with probability 60%. (“Emacs” rhymes with “ski wax”; “VI” rhymes with “knee-high.” The teams are named after two text editors
popular among some computer scientists.) Use a tree diagram to compute the probability that Team Emacs wins the series.
10.29 For the scenario in Exercise 10.28, what is the probability that the series goes five games? (That is, what is the probability that
neither team wins 3 of the first 4 games?)
10.30 Update your last two answers if Team Emacs wins each game with probability 70%.
10.31 (calculus required) Now assume that Team Emacs wins each game with probability p, for an arbitrary value p ∈ [0, 1]. Write down
a formula expressing the probability that there is a fifth game in the series. Also find the value of p that maximizes the probability,
and the probability of the specified event for this maximizing p.
10.32 (calculus required) Repeat Exercise 10.31 for the event that there is a fourth game in the series.
10.33 (calculus required) Repeat for the event that there is a fourth game of the series and Team Emacs wins that fourth game.

Let S be a sample space, and let Pr : S → [0, 1] be an arbitrary function satisfying the requirements of being a probability function
(Definition 10.2). Argue briefly that the following properties hold.
10.34 For any outcome s ∈ S, we have Pr [s] ≤ 1.
 
10.35 For any event A ⊆ S, we have Pr A = 1 − Pr [A]. (Recall that A = S − A.)
10.36 For any events A, B ⊆ S, we have Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B].
S  P
10.37 The Union Bound: for any events A1 , A2 , . . . , An , we have Pr i Ai ≤ i Pr [Ai ].

Imagine n identical computers that share a single radio frequency for use as a network connection. Each of the n computers would
like to send a packet of information out across the network, but if two or more different computers simultaneously try to send a
message, no message gets through. Here you’ll explore another use of randomization: using randomness for symmetry breaking.
10.38 Each computer flips a coin that comes up heads with probability p. What is the probability that exactly one of the n machines’ coins
comes up heads (and thus that machine can send its message)? Your answer should be a formula that’s in terms of n and p.
10.39 (calculus required) Given the formula you found in Exercise 10.38, what p should you choose to maximize the probability of a
message being successfully sent?
10.40 (calculus required) What is the probability of success if you choose p as in Exercise 10.39? What is the limit of this quantity as n
grows large? (You may use the following fact: (1 − m1 )m → e−1 as m → ∞.)
10.41 Suppose that we hash items into a 10-slot hash table using a hash function h that uniformly assigns elements to {1, . . . , 10}. If we
hash 3 elements into an 10-slot table, what’s the probability that no collisions occur?
10.42 If we hash 3 elements into an 10-slot table, what’s the probability that all 3 elements have the same hash value?
10.43 If we hash 3 elements into an 10-slot table using linear probing, what’s the probability that at least 2 adjacent slots are filled? Count
slot #10 as adjacent to #1. (In linear probing, an element x that hashes to an occupied cell h(x) is placed in the first unoccupied cell
after h(x), wrapping around to the beginning of the table if necessary. See Figure 10.15a for a reminder.)
10.44 If we hash 3 elements into an 10-slot table using linear probing, what’s the probability that 3 adjacent slots are filled?

ISTUDY
526 Probability

h(x) = 4
g(x) = 1
(a) (b) (c)
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
h(y) = 4
A reminder of linear probing: we try to put Quadratic probing: we try to store x in slot g(y) = 3
x into h(x), then h(x) + 1, then h(x) + 2, h(x), then h(x) + 12 , then h(x) + 22 , etc.
etc., wrapping around to the beginning of
the table after the 10th slot. 1 2 3 4 5 6 7 8 9 10
Double hashing: we try to store x in slot
h(x), then h(x) + g(x), then h(x) + 2g(x),
etc. (wrapping around the table as
necessary).

Figure 10.15 Handling collisions in hash tables: (a) linear probing; (b) quadratic probing; and (c) double hashing.

One issue with resolving collisions by linear probing is called clustering: if there’s a large block of occupied slots in the hash
table, then there’s a relatively high chance that the next element placed into the table extends that block. Because linear probing
suffers from this clustering issue, other mechanisms for resolving collisions are sometimes used. Another choice is called quadratic
probing: we change the cell number we try by an increasing step size at every stage, instead of by one every time. Specifically, to
hash x into an n-slot table, first try to store x in h(x); if that cell is full, try putting x into h(x) + i2 , wrapping back around to the
beginning of the table as usual, for i = 1, 2, . . . . (Linear probing tried slot h(x) + i instead.) See Figure 10.15b.
10.45 Suppose that we currently have a single block of k adjacent slots full in an n-slot hash table, and all other slots are empty. If we
use linear probing, what’s the probability that the next element inserted into the hash table extends that block (that is, leaves k + 1
adjacent slots full)?
10.46 (programming required) Write a program to hash 5000 elements into a 10,007-slot hash table using linear probing. Record which
cell x5000 ends up occupying—that is, how many hops from h(x5000 ) is x5000 ? Run your program 2048 times, and report how far,
on average, x5000 moved from h(x5000 ). Also report the maximum distance that x5000 moved.
10.47 (programming required) Modify your program from Exercise 10.46 to use quadratic probing instead, and report the same statistics:
the mean and maximum number of cells probed for x5000 .
10.48 In about one paragraph, explain the differences that you observed between linear and quadratic probing. A concern called secondary
clustering arises in quadratic probing: if h(x) = h(y) for two elements x and y, then the sequence of cells probed for x and y is
identical. These sequences were also identical for linear probing. In your answer, explain why secondary clustering from quadratic
probing is less of a concern than the clustering from linear probing.
10.49 (programming required) A fourth way of handling collisions in hash tables (after chaining, linear probing, and quadratic probing)
is what’s called double hashing: we move forward by the same number of slots at every stage, but that number is randomly chosen,
as the output of a different hash function. Specifically, to hash x into an n-slot table, first try to store x in h(x); if that cell is full, try
putting x into h(x) + i · g(x), wrapping back around to the beginning of the table as usual, for i = 1, 2, . . . . (Here g is a different
hash function, crucially one whose output is never zero.) See Figure 10.15c. Modify your program from Exercises 10.46 and 10.47
to use double hashing. Again report the mean and maximum number of cells probed for x5000 .
10.50 In about one paragraph, explain the differences you observe among chaining, linear probing, quadratic probing, and double hashing.
Is there any reason you wouldn’t always use double hashing?

Consider a randomized algorithm that solves a problem on a particular input correctly with probability p, and it’s wrong with
probability 1 − p. Assume that each run of the algorithm is independent of every other run, so that we can think of each run
as being an (independent) coin flip of a p-biased coin (where heads means “correct answer”). Suppose that the probability p is
unknown to you.
10.51 (calculus required) You observe that exactly k out of n trials gave the correct answer. Then the number k of correct answers follows
a binomial distribution with parameters n and p: that is, the probability that exactly k runs give the correct answer is
n n−k .
k · p · (1 − p) (∗)
k

Prove that the maximum likelihood estimate of p is p = nk —that is, prove that (∗) is maximized by p = nk .
10.52 (calculus required) You observe that it takes n trials before the first time you get a correct answer. Then n follows a geometric
distribution with parameter p: that is, the probability that n runs were required is given by

(1 − p)n−1 p. (†)

Prove that the maximum likelihood estimate of p is p = n1 —that is, prove that (†) is maximized by p = 1n .

ISTUDY
10.3 Independence and Conditional Probability 527

10.3 Independence and Conditional Probability


Things don’t happen because they’re bad or good, else all eggs would be addled or none at
all, and at the most it is but six to the dozen. There’s good chances and bad chances, and
nobody’s luck is pulled only by one string.
George Eliot (1819–1880)
Felix Holt, the Radical (1866)

Imagine that you’re interviewing to be a consultant for Premier Passenger Pigeon Purveyors, a com-
pany that pitches its products to prospective pigeon purchasers using online advertising—specifically, by
displaying ads to users of a particular search engine on the web. PPPP makes $50 profit from each sale,
and, from historical data, they have determined that 0.02% of searchers who see an ad buy a pigeon. The
interviewer asks you how much PPPP should be willing to pay to advertise to a searcher. A good answer
is $0.01: on average, PPPP earns $50 · 0.0002 = $0.01 per ad, so paying anything up to a penny per ad
yields a profit, on average. But you realize that there’s a better answer (and, by giving it, you get the job):
it depends on what the user is searching for! A user who searches for BIRD or PIGEON or BUYING A PET
TO COMBAT LONELINESS is far more likely to respond to a PPPP ad than an average user, while a user
who searches for ORNITHOPHOBIA is much less likely to respond to an ad.
It is a general phenomenon in probability that knowing that event A has occurred may tell you that an
event B is much more likely (or much less likely) to occur than you’d previously known. In this section, we’ll
discuss when knowing that an event A has occurred does or does not affect the probability that B occurs
(that is, whether A and B are dependent or independent, respectively). We’ll then introduce conditional
probability, which allows us to state and manipulate quantities like “the probability that B happens given
that A happens.”

10.3.1 Independence and Dependence of Events


We’ll start with independence and dependence of events. Intuitively, two events A and B are dependent if
A’s occurrence/nonoccurrence gives us some information about whether B occurs; in contrast, A and B are
independent when A occurs with the same probability when B occurs as it does when B does not occur:

Definition 10.9: Independent and dependent events.


Two events A and B are independent if and only if Pr [A ∩ B] = Pr [A] · Pr [B]. The events A and B are
called dependent if they are not independent.

If A and B are dependent events, then we can also say that A and B are correlated; independent events are
called uncorrelated. A visual representation of independent and dependent events is shown in Figure 10.16.
A little manipulation of the equation from Definition 10.9 may help to make the connection to the above
intuition clearer. Assume for now that Pr [B] ̸= 0. (Exercise 10.71 addresses the case of Pr [B] = 0.)
Considering the equality Pr [A] · Pr [B] = Pr [A ∩ B] from the definition, and dividing both sides by Pr [B],
we see that the events A and B are independent if and only if
Pr [A ∩ B]
Pr [A] = .
Pr [B]
The left-hand side of this equation, Pr [A], denotes the fraction of the time that A occurs. The right-hand
side, Pr [A ∩ B] /Pr [B], denotes the fraction of the time when B occurs that A occurs too. If these two

ISTUDY
528 Probability

B B B B

A A

A A

(a) Independent events. First, an event A in a sample space S (the shaded rectangle has area Pr [A]). Second, an event B
with Pr [B] = 0.5 (the shaded region has area 0.5). A and B are independent, because A ∩ B has area equal to
0.5 · Pr [A].
C C

A C A C

A A

(b) Dependent events. First, the event A, repeated. Second, an event C, again with Pr [C] = 0.5 (the shaded region has
area 0.5). A and C are not independent, because A ∩ C has area different from (much bigger than) 0.5 · Pr [A].

Figure 10.16 Independent and dependent events. (A region’s area represents its probability; the sample space—the
enclosing rectangle—has area 1.)

fractions are equal, then A occurs with the same probability when B occurs as it does when B does not
occur. (And if these two fractions are equal, then both when B occurs and when B does not occur, A occurs
with probability Pr [A]—that is, the probability of A without reference to B.)

Examples of independent and dependent events


To establish that two events A and B are independent, we can simply compute Pr [A], Pr [B], and Pr [A ∩ B],
and show that the product of the first two quantities is equal to the third. Here are a few examples:

Example 10.14: Some independent events.


The following pairs of events are independent:
Coin flips. I flip a fair penny and a fair nickel. Define the following events:
Event A: The penny is heads. Event B: The nickel is heads.
Then Pr [A] = 0.5 and Pr [B] = 0.5 and Pr [A ∩ B] = 0.25 = 0.5 · 0.5.
Cards. I draw a card from a randomly shuffled deck. Define the following events:
Event A: I draw an ace. Event B: I draw a heart.
For these events, we have
1
Pr [A] = Pr [{A♣, A♢, A♡, A♠}] = 13
1
Pr [B] = Pr [{A♡, 2♡, . . . , K♡}] = 4
Pr [A ∩ B] = Pr [{A♡}] = 1
52 = 1
4 · 1
13 .

ISTUDY
10.3 Independence and Conditional Probability 529

Dice. I roll a fair red die and a fair blue die. Define the following events:
Event A: The red die is odd. Event B: The sum of the rolled numbers is odd.
Then, writing outcomes as ⟨the red roll, the blue roll⟩, we have

Pr [A] = Pr [{1, 3, 5} × {1, 2, 3, 4, 5, 6}] = 18


36 = 0.5
Pr [B] = Pr [ {1, 3, 5} × {2, 4, 6} ∪ {2, 4, 6} × {1, 3, 5} ] = 18
36 = 0.5
red odd, blue even red even, blue odd

Observe that A ∩ B = {1, 3, 5} × {2, 4, 6}, and so Pr [A ∩ B] = 9


36 = (0.5) · (0.5).

Any time the processes by which A and B come to happen are completely unrelated, it’s certainly true that
A and B are independent. But events can also be independent in other circumstances, as we saw in the dice
scenario in Example 10.14: both events in some way incorporated the result of the value rolled on the red
die, but the stated events themselves are independent anyway.
Example 10.14 showed that pairs of events are independent by showing that Pr [A ∩ B] = Pr [A] · Pr [B].
Similarly, we can establish that two events are not independent—that is, they are dependent—directly from
the definition by showing that Pr [A ∩ B] is not equal to Pr [A] · Pr [B]. Here are a few examples:

Example 10.15: Some dependent events.


The following pairs of events are not independent:
Dice. I roll a fair die. Define the following events:
Event A: I roll an odd number. Event B: I roll a prime number.
Then we have
1
Pr [A] = Pr [{1, 3, 5}] = 2
1
Pr [B] = Pr [{2, 3, 5}] = 2
Pr [A ∩ B] = Pr [{3, 5}] = 2
6.

Because Pr [A ∩ B] is 26 = 13 , but Pr [A] · Pr [B] = 12 · 12 = 14 ̸= 13 , the events A and B are dependent.


Similarly, define Event C as “I roll an even number.” Because Pr [B] = Pr [C] = 12 and Pr [C ∩ B] =
Pr [{2}] = 16 ̸= 12 · 12 , the events B and C are dependent too.
Cards. I draw a card from a randomly shuffled deck. Define the following events:
Event A: I draw a heart. Event B: I draw a spade.
For these events, we have
1
Pr [A] = Pr [{A♡, 2♡, . . . , K♡}] = 4
1
Pr [B] = Pr [{A♠, 2♠, . . . , K♠}] = 4
Pr [A ∩ B] = Pr [∅] = 0,
where A ∩ B = ∅ because no cards are simultaneously a heart and a spade. Because 0 ̸= 1
16 = 1
4 · 14 , we
have Pr [A ∩ B] ̸= Pr [A] · Pr [B]. These events are dependent.
Coin flips. I flip a fair penny and a fair nickel. Define the following events:
Event A: The penny is heads. Event B: Both coins are heads.

ISTUDY
530 Probability

So Pr [A] = 0.5 and Pr [B] = 0.25 and Pr [A ∩ B] = 0.25 = Pr [B] ̸= Pr [A] · Pr [B].

Correlation of events
The pairs of dependent events from Example 10.15 are of two different qualitative types. Knowing that
the first event occurred can make the second event more likely to occur (“rolling an odd number” and
“rolling a prime number” for the dice) or less likely to occur (“rolling an even number” and “rolling a
prime number”):

Definition 10.10: Positive and negative correlation.


When two events A and B satisfy Pr [A ∩ B] > Pr [A] · Pr [B], we say that A and B are positively cor-
related. When Pr [A ∩ B] < Pr [A] · Pr [B], we say that A and B are negatively correlated. (And if
Pr [A ∩ B] = Pr [A] · Pr [B], then we say that A and B are uncorrelated.)

At the extremes, knowing that the first event occurred can ensure that the second event definitely does not
occur (“drawing a heart” and “drawing a spade” from Example 10.15) or can ensure that the second event
definitely does occur (“both coins are heads” and “the first coin is heads” from Example 10.15).
Here are some examples in which you’re asked to figure out whether certain events are correlated:

Example 10.16: Encryption by random substitution.


One simple form of encryption for text is a substitution cipher, in which (in the simplest version) we
choose a permutation of the alphabet, and then replace each letter with its permuted variant. (For example,
we might permute the letters as ABCDE· · · → XENBG· · · ; thus DECADE would be written as BGNXBG.)
Suppose we choose a random permutation for this mapping, so that each of the 26! orderings of the
alphabet is equally likely. Are the following events Q and Z independent or dependent?
Event Q = “the letter Q is mapped to itself (that is, Q is ‘rewritten’ as Q).”
Event Z = “the letter Z is mapped to itself.”

Solution. We must compute Pr [Q], Pr [Z], and Pr [Q ∩ Z]. Because each permutation is equally likely to
be chosen, we have
number of permutations π1,2,...,26 where π17 = 17 25! 1
Pr [Q] = = =
number of permutations π1,2,...,26 26! 26

because we can choose any of 25! orderings of all non-Q letters. Similarly,
number of permutations π1,2,...,26 where π26 = 26 25! 1
Pr [Z] = = = .
number of permutations π1,2,...,26 26! 26

To compute Pr [Q ∩ Z], we need to count the number of permutations π1...26 with both π17 = 17 and
π26 = 26. Any of the 24 other letters can go into any of the remaining 24 slots of the permutation, so
there are 24! such permutations. Thus
number of permutations π1,2,...,26 where π17 = 17 and π26 = 26 24! 1
Pr [Q ∩ Z] = = = .
number of permutations π1,2,...,26 26! 25 · 26

ISTUDY
10.3 Independence and Conditional Probability 531

Thus we have
Pr [Q ∩ Z] = 1
25·26 and Pr [Q] · Pr [Z] = 1
26 · 1
26 = 1
26·26 .
1
There’s only a small difference between 26·26 ≈ 0.00148 and 1
25·26 ≈ 0.00154, but they’re indubitably
different, and thus Q and Z are not independent.

(Incidentally, substitution ciphers are susceptible to frequency analysis: the most common letters in
English-language texts are ETAOIN—almost universally in texts of reasonable length—and the frequencies
of various letters is surprisingly consistent. See Exercises 10.73–10.77.)

Example 10.17: Matched flips of two fair coins.


I flip two fair coins (independently). Consider the following events:
Event A = the first flip comes up heads.
Event B = the second flip comes up heads.
Event C = the two flips match (are both heads or are both tails).
Which pairs of these events are independent, if any?
Solution. The sample space is {HH, HT, TH, TT}, and the events from the problem statement are
A = {HH, HT}, B = {HH, TH}, and C = {HH, TT}. Thus A ∩ B = A ∩ C = B ∩ C = {HH}—that
is, HH is the only outcome that results in more than one of these events being true. (See Figure 10.17.)
Because the coins are fair, every outcome in this sample space has probability 14 . Focusing on the
events A and B, we have
1
Pr [A] = Pr [{HH, HT}] = 2
1
Pr [B] = Pr [{HH, TH}] = 2
Pr [A ∩ B] = Pr [{HH}] = 1
4.

Thus Pr [A] · Pr [B] = 12 · 12 = 14 , and Pr [A ∩ B] = 41 . Because Pr [A] · Pr [B] = Pr [A ∩ B], the two
events are independent. The calculation is identical for the other two pairs of events, and so A and B are
independent; A and C are independent; and B and C are independent.

A B
HT TH
HH
Event A = the first flip comes up heads.
TT Event B = the second flip comes up heads.
Event C = the two flips match (are both heads or are both tails).
C

Figure 10.17 Two coin flips and three events.

ISTUDY
532 Probability

Example 10.18: Matched flips of two biased coins.


How would your answers to Example 10.17 change if the coins are p-biased instead of fair?
Solution. The sample space and events remain as in Example 10.17 (and Figure 10.17), but the outcomes
now have different probabilities:
Pr [HH] = p · p Pr [HT] = p · (1 − p) Pr [TH] = (1 − p) · p Pr [TT] = (1 − p) · (1 − p)
Using these outcome probabilities, we compute the event probabilities as follows:
Pr [A] = Pr [{HH, HT}] = p · p + p · (1 − p) =p (1)
Pr [B] = Pr [{HH, TH}] = p · p + (1 − p) · p =p (2)
Pr [C] = Pr [{HH, TT}] = p · p + (1 − p) · (1 − p) = p + (1 − p) . 2 2
(3)
Because A ∩ B = B ∩ C = A ∩ C = {HH}, we also have
Pr [A ∩ B] = Pr [B ∩ C] = Pr [A ∩ C] = Pr [HH] = p2 . (4)
Thus A and B are still independent, because Pr [A] · Pr [B] = p · p = p2 = Pr [A ∩ B] by (1), (2), and (4).
But what about events A and C? By (1), (3), and (4), we have
 
Pr [A] · Pr [C] = p · p2 + (1 − p)2 and Pr [A ∩ C] = p2 .
By a bit of algebra, we see that Pr [A ∩ C] = Pr [A] · Pr [C] if and only if
p2 = p(p2 + (1 − p)2 ) ⇔ 0 = p(p2 + (1 − p)2 ) − p2
⇔ 0 = 2p3 − 3p2 + p
⇔ 0 = p(2p − 1)(p − 1).
So the events A and C are independent—that is, Pr [A ∩ C] = Pr [A] · Pr [C]—if and only if p ∈ {0, 12 , 1}.
Thus events A and B are independent for any value of p, while events A and C (and similarly B and C)
are independent if and only if p ∈ {0, 12 , 1}.

Taking it further: While any two of the events from Example 10.17 (or Example 10.18 with p = 21 ) are independent, the
third event is not independent of the other two. Another way to describe this situation is that the events A and B ∩ C are not
independent: in particular, Pr [A ∩ (B ∩ C)] /Pr [B ∩ C] = 1 = ̸ Pr [A]. A set of events A1 , A2 , . . . , An is said to be pairwise
independent if, for any two indices i and j ̸= i, the events Ai and Aj are independent. More generally, these events are said to be
k-wise independent if, for any subset S of up to k of these events, the events in S are all independent. (And we say that the set of
events is fully independent if every subset of any size satisfies this property.) Sometimes it will turn out that we “really” care only
about pairwise independence. For example, if we think about a hash table that uses a “random” hash function, we’re usually only
concerned with the question “do elements x and y collide?”—which is a question about just one pair of events. Generally, we
can create a pairwise-independent random hash function more cheaply than creating a fully independent random hash function.
If we view random bits as a scarce resource (like time and space, in the style of Chapter 6), then this savings is valuable.

10.3.2 Conditional Probability


In Section 10.3.1, we discussed the black-and-white distinction between pairs of independent events and
dependent events: if A and B are independent, then knowing whether or not B happened gives you no
information about whether A happened; if A and B are dependent, then the probability that A happens if
B happened is different from the probability that A happens if B did not happen. But how does knowing
that B occurred change your estimate of the probability of A? Think about events like “the sky is clear”

ISTUDY
10.3 Independence and Conditional Probability 533

and “it is very windy” and “it will rain today”: sometimes B means that A is less likely or even impossible;
sometimes B means that A is more likely or even certain. Here we will discuss quantitatively how one
event’s probability is affected by the knowledge of another event. The conditional probability of A given B
represents the probability of A occurring if we know that B occurred:

Definition 10.11: Conditional probability.


Pr [A ∩ B]
The conditional probability of A given B, written Pr [A|B], is given by Pr [A|B] = .
Pr [B]

(The quantity Pr [A|B] is also sometimes called the probability of A conditioned on B.) We will treat Pr [A|B]
as undefined when Pr [B] = 0. Here are a few examples:

Example 10.19: Odds and primes.


I choose a number uniformly at random from {1, 2, . . . , 10}. Define these two events:
Event A: The chosen number is odd. Event B: The chosen number is prime.
For these events, we have
Pr[A∩B] Pr[{3,5,7}] 3 Pr[A∩B] Pr[{3,5,7}]
Pr [A|B] = Pr[B] = Pr[{2,3,5,7}] = 4 and Pr [B|A] = Pr[A] = Pr[{1,3,5,7,9}] = 53 .

Example 10.20: Dominoes.


Shuffle the dominoes in Figure 10.18, and draw one uniformly at random.

• What is the probability that you draw a domino with a 2 ( ) on it?


• You make a draw and see the domino . (Imagine the shaded side of the domino is covered by
your hand.) What’s the probability your domino has a 2?
• You draw and see that the domino is . What is the probability that you drew a domino with a 2?

Solution. For the first question, we are asked for the probability of drawing a domino with a 2:

contain a 2 do not contain a 2


3
Thus 3 of the 7 dominoes have a 2, so Pr [ ] = 7.
For the second question, we observe on our drawn domino. We’re asked for the probability of a 2:

Figure 10.18 Some dominoes.

ISTUDY
534 Probability

contains a 2 does not contain a 2 impossible (do not contain a 1)


We know that the domino you drew must have been either or . These two dominoes were equally
likely to be drawn, and one of these two has a , so there’s a 12 probability that you drew a 2. Using
conditional probability notation, we can write this quantity as Pr [ | ] = 21 .
Finally, we must compute Pr [ | ], the probability of a given that we observed a . By the definition
of conditional probability, we have
Pr [ ∩ ] 0
Pr [ | ] = = 1 = 0.
Pr [ ] 7

(A less notationally heavy way to write this argument: we see a , so we know that the domino you drew
must have been . This domino doesn’t have a , and so there’s zero chance that we observe a .)

Conditional probability as “zooming in” (and another example)


Intuitively, we can think of Pr [A|B] as “zooming” the universe down to the set B. The basic idea that we
used in Example 10.20 was to narrow the set of possible outcomes to those consistent with the observed
partial data about the drawn domino, and then compute the fraction of the narrowed sample space for which
A occurs. This view of conditional probability is illustrated in Figure 10.19.
Here’s one more example, where we condition on slightly more complex events.

Example 10.21: Coin flips.


Flip a fair coin 10 times (with all flips independent: the ith flip has no effect on the jth flip for j ̸= i).
Write H to denote the event of getting at least 9 heads.
1 What is Pr [H]?
2 Let A be the event “the first flip comes up heads.” What is Pr [H|A]?
3 Let B be the event “the first flip comes up tails.” What is Pr [H|B]?
4 Let C be the event “the first three flips come up heads.” What is Pr [H|C]?
5 Let D be the event “we get at least 8 heads.” What is Pr [H|D]?

10
Solution. Every outcome—every element of {H, T} —is equally likely, each with probability 1/210 .
 10
9 + 10 = 10+1 = 11, so Pr [H] = 11/2 ≈ 0.0107.
1 The number of outcomes with 9 or 10 heads is 10 10

For the conditional probabilities, we will compute Pr [H ∩ X] and Pr [X] for each of the stated events
X. The final answer is their ratio. Because each outcome is equally likely, we only have to compute the
cardinality of the given events (and the cardinality of their intersection with H) to answer the questions.
2 For A (the first flip comes up H), we have |A ∩ H| = 10: there are 9 outcomes with one tails that start
with a heads (HTHHHHHHHH, HHTHHHHHHH, . . ., HHHHHHHHHT) and 1 outcome with zero
tails (HHHHHHHHHH). Thus Pr [A ∩ H] = 10/210 . Clearly Pr [A] = 21 . Thus
Pr [A ∩ H] 10/210 10
Pr [H|A] = = = 9 ≈ 0.01953.
Pr [A] 1/2 2

ISTUDY
10.3 Independence and Conditional Probability 535

S
A
A A
B
B B

(a) A sample space S and two events A (b) Conditioning on the event B. (c) Conditioning on the event A.
and B. Any outcome in S can be chosen, Any outcome in B can be chosen, Any outcome in A can be chosen,
and so in this example Pr [A] ≈ 0.4 and and so Pr [A|B] is the fraction of and so Pr [B|A] is the fraction of
Pr [B] ≈ 0.1. those outcomes for which A those outcomes for which B
occurs, so here Pr [A|B] ≈ 0.8. occurs, so here Pr [B|A] ≈ 0.2.

Figure 10.19 A view of conditional probability. (Think of an event’s probability as being represented by its area.)

3 For B (the first flip comes up T), we’ve already “used up” the single permitted non-heads in the first
flip, so there’s only one outcome in B ∩ H, namely THHHHHHHHH. Again, Pr [B] = 12 . Therefore
Pr [B ∩ H] 1/210 1
Pr [H|B] = = = 9 ≈ 0.00195.
Pr [B] 1/2 2
4 For C (the first three flips come up H), we have Pr [C] = 18 . The outcomes in C ∩ H are start with HHH
 
followed by 6+ heads in the last 7 flips. There are 77 + 76 = 8 such outcomes. Thus
Pr [C ∩ H] 8/210 64
Pr [H|C] = = = 10 ≈ 0.0625.
Pr [C] 1/8 2
5 For D (there are at least 8 heads), we have Pr [H ∩ D] = Pr [H] = 11/210 . (There are no outcomes in
which we get 9+ heads but fail to get 8+ heads!) The probability of getting 8+ heads in 10 fair flips is
10
 10
 10

8 + 9 + 10 45 + 10 + 1 56
Pr [D] = = = 10 .
210 210 2
And therefore
Pr [D ∩ H] 11/210 11
Pr [H|D] = = = ≈ 0.1964.
Pr [D] 56/210 56

To repeat the word of warning from early in this chapter: it can be very difficult to have good intuition
about probability questions. For example, the last problem in Example 10.21 asked for the probability of
getting 9+ heads in 10 flips conditioned on getting 8+ heads. It may be easy to talk yourself into believing
that, of the times that we get 8+ heads, there’s a ≈ 50% chance of getting 9 or more heads. (“Put aside
the first 8 heads, and look at one of the other flips—it’s heads with probability 12 , so we get a 9th heads
with probability 12 .”) But this intuition is blatantly wrong. Another way of thinking about the calculation
in the last part of Example 10.21 is to observe that there are 56 outcomes with 8, 9, or 10 heads. Only 11
of these outcomes have 9 or 10 heads. Each outcome is equally likely. So if we’re promised that one of the
56 outcomes occurred, then there’s an 11 56 chance that one of the 11 occurred.

Taking it further: So far, we have considered only random processes in which each outcome that can occur does so with proba-
bility ϵ > 0—that is, there have been no infinitesimal probabilities. The restriction to non-infinitesimal probabilities is generally a
reasonable one to make for CS applications, but it is a genuine restriction. (It’s worth noting that we have encountered an infinite
sample space before—just one that didn’t have any infinitesimal probabilities. In a geometric distribution with parameter 12 , for

ISTUDY
536 Probability

example, any positive integer k is a possible outcome, with Pr [k] = 1/2k , which is a finite, albeit very small, probability for any
positive integer k.) But we can imagine scenarios in which infinitesimal probabilities make sense.
For example, imagine a probabilistic process that chooses a real number x between 0 and 1, where each element of the sample
space S = {x : 0 ≤ x ≤ 1} is equally likely to be chosen. We can make probabilistic statements like Pr [x ≤ 0.5] = 12 —half the
time, we end up with x ≤ 0.5, half the time we end up with x ≥ 0.5—but for any particular value c, the probability that x = c
is zero! (Perhaps bizarrely, Pr [x ≤ 0.5] = Pr [x < 0.5]. Indeed, Pr [x = 0.5] cannot be ϵ > 0, for any ϵ. Every possible outcome
has to have that same probability ϵ of occurring, and for any ϵ > 0 there are more than ϵ1 real numbers between 0 and 1. So we’d
violate (10.2.1) if we had Pr [x = 0.5] > 0.)
To handle infinitesimal probabilities, we need calculus. We can describe the above circumstance with a probability density func-
R
tion p : S → [0, 1], so that, in place of (10.2.1), we require x∈S p(x)dx = 1. (For a uniformly chosen x ∈ [0, 1], we have p(x) = 1;
1
for a uniformly chosen x ∈ [0, 100], we have p(x) = 100 .) Some of the statements that we’ve made in this chapter don’t apply in
the infinitesimal case. For example, the “zooming in” view of conditional probability from Figure 10.19 doesn’t quite work in
the infinitesimal case. In fact, we can consider questions about Pr [A|B] even when Pr [B] = 0, like what is the probability that
a uniformly chosen x ∈ [0, 100] is an integer, conditioned on x being a rational number?. (And Exercise 10.71—if Pr [B] = 0,
then A and B are independent—isn’t true with infinitesimal probabilities.) But details of this infinitesimal version of probability
theory are generally outside of our concern here, and are best left to a calculus/analysis-based textbook on probability.

Relating independence of events and conditional probability


Consider two events A and B for which Pr [B] ̸= 0. Observe that A and B are independent if and only if
Pr [A|B] = Pr [A]:

A and B are independent ⇔ Pr [A] · Pr [B] = Pr [A ∩ B] definition of independence

⇔ Pr [A] = Pr[A∩B]
Pr[B] dividing both sides by Pr [B]
⇔ Pr [A] = Pr [A|B] . definition of Pr [A|B]

(This calculation doesn’t work when Pr [B] = 0—we can’t divide by 0, and Pr [A|B] is undefined—but see
Exercise 10.71.) Notice again that this relationship is an if-and-only-if relationship: when A and B are not
independent, then Pr [A] and Pr [A|B] must be different. Here is a small example:

Example 10.22: Self-mapped letters in substitution ciphers.


Example 10.16 showed that, for a random permutation π of the alphabet, the events Q (Q is mapped to
1 1
itself by π) and Z (Z is mapped to itself by π) were not independent: specifically, Pr [Q] = 26 , Pr [Z] = 26 ,
1/(25·26)
and Pr [Q ∩ Z] = 25·26 . Thus Pr [Q|Z] = Pr[Z] = 1/26 = 25 . Compare Pr [Q|Z] = 25 = 0.04 to
1 Pr[Q∩Z] 1 1

Pr [Q] = 26 1
≈ 0.03846: knowing that Z is mapped to itself makes it slightly more likely that Q is also
mapped to itself. Event Z makes Q slightly more probable because, when Z occurs, Z cannot be mapped
to Q, so there are only 25 letters “competing” to be mapped to Q instead of 26.

Problem-solving tip: Often it is easier to get intuition about a probabilistic statement by imagining an absurdly small variant
of the problem. Here, for example, imagine a 2-letter alphabet Q,Z. Then if Z is mapped to itself then Q must also be mapped to
itself. So Pr [Q] = 21 , but Pr [Q|Z] = 1.

Another example (and conditional independence)


In Example 9.22, we considered the scenario of two teams playing a best-of-five series (two teams keep
playing games until one team has won three). Suppose that the teams are perfectly evenly matched—that
is, each team wins each game with probability 0.5—and that each game’s result is independent of all other
games. See Figure 10.20 for the probability tree. Let’s look at a few examples of events in this scenario:

ISTUDY
10.3 Independence and Conditional Probability 537

1
8 AAA BBB

1
16 AABA ABAA ABBB BAAA BABB BBAB

1
32 AABBA AABBB ABABA ABABB ABBAA ABBAB BAABA BAABB BABAA BABAB BBAAA BBAAB

Figure 10.20 A best-of-five series of games between teams A and B. The two 3-game outcomes each occur with
probability 18 ; the six 4-game outcomes with probability 16
1
; and the twelve 5-game outcomes with probability 321 .

Example 10.23: Playing ball.


Define the following events: Event C = team A wins the series
Event D = team A wins the first game of the series
Event E = team A wins the second game of the series, and
Event F = the series lasts five games.
There’s a shortcut to calculating the probability of three of these individual events: the two teams are
completely symmetric in this scenario, and thus Pr [C] = Pr [D] = Pr [E] = 0.5. To calculate Pr [F], we

can note that there are 42 = 6 ways for the teams to split the first four games (by winning two games
1 6
each); each of these “split” cases happens with probability 1/24 = 16 , so Pr [F] = 16 = 38 . Let’s calculate
a few conditional probabilities using pairs of these events:
Pr[C∩D] Pr[{AAA,AABA,ABAA,AABBA,ABABA,ABBAA}] 11/32 11
Pr [C|D] = Pr[D] = Pr[{AAA,AABA,ABAA,ABBB,AABBA,AABBB,ABABA,ABABB,ABBAA,ABBAB}] = 1/2 = 16
Pr[C∩E] Pr[{AAA,AABA,BAAA,AABBA,BAABA,BABAA}] 11/32 11
Pr [C|E] = Pr[E] = Pr[{AAA,AABA,BAAA,BABB,AABBA,AABBB,BAABA,BAABB,BABAA,BABAB}] = 1/2 = 16
Pr[C∩F] Pr[{AABBA,ABABA,ABBAA,BAABA,BABAA,BBAAA}] 6/32
Pr [C|F] = Pr[F] = 3/8 = 3/8 = 12 .

Similarly, we have
Pr[A wins both of the first two games of the series] 1/4
Pr [D|E] = Pr[A wins the second game] = 1/2 = 12 , and
Pr[A wins the first game of the series and the series lasts five games] 6/32
Pr [D|F] = Pr[the series lasts five games] = 6/16 = 12 .
Because Pr [C] = Pr [C|F] and Pr [D] = Pr [D|E] and Pr [D] = Pr [D|F], these three pairs of events—
that is, the pairs {C, F} and {D, E} and {D, F}—are all independent. But Pr [C|D] = 11
16 ̸= 2 = Pr [C],
1

and similarly Pr [C|E] ̸= Pr [C], so neither C and D, nor C and E, are independent.

There’s another potentially interesting aspect of conditional probability lurking in Example 10.23. We’ve
shown that Team A winning the first game of the series makes it more likely that Team A wins the series—
in other words, we’ve shown that 11 1
16 = Pr [C|D] > Pr [C] = 2 . But now suppose that you know that A and
B each win two of the first four games. Once you know that the series lasts five games, then team A has no
remaining advantage from having won the first game; in other words, given that Event F occurred, then A
winning the first game and A winning the series are now independent events. This phenomenon is called
conditional independence:

Definition 10.12: Conditional independence.


Let A, B, and C be three events. The events A and B are said to be conditionally independent given C if
Pr [A|B ∩ C] = Pr [A|C].

ISTUDY
538 Probability

Note that there can be events that are not independent—but that are conditionally independent given the
appropriate third event. (Example 10.23 includes this situation: C and D are dependent events, but they
are conditionally independent given F.) The opposite is also possible; there can be two events that are
independent, but that are not conditionally independent given some third event:

Example 10.24: Independent events aren’t always conditionally independent.


Recall Event D (Team A wins the first game), Event E (Team A wins the second game), and Event F (the
series lasts five games) from Example 10.23. We already showed that D and E are independent: knowing
who won the first game doesn’t change the fact that there’s a 50/50 chance for each team to win the second
game. But if you know that the series lasts five games, then D and E are no longer independent:
Pr[the series lasts 5 games and team A wins the first two games]
Pr [D|E ∩ F] = Pr[D∩E∩F]
Pr[E∩F] = Pr[the series lasts 5 games and team A wins the second game]
Pr[{AABBA,AABBB}] 2/32
= Pr[{AABBA,AABBB,BAABA,BAABB,BABAA,BABAB}] = 6/32 = 13 .
Because Pr [D|E ∩ F] = 13 and Pr [D|E] = 12 are different, events D and E are not conditionally indepen-
dent given F. (Intuitively, knowing that series lasted five games makes it much likelier that a different
team won game #1 as won game #2. This calculation shows how much more likely it is.)

10.3.3 Bayes’ Rule and Calculating with Conditional Probability


Here, we’ll introduce a few ways of thinking about plain (unconditional) probability using conditional
probability, and Bayes’ Rule, a tremendously useful formula that relates Pr [A|B] and Pr [B|A].

Intersections and conditional probability


The definition of conditional probability (Definition 10.11) states that Pr [A|B] = Pr[A∩B]
Pr[B] . Multiplying both
sides of this equality by Pr [B] yields a useful way of thinking about the probability of intersections:

Theorem 10.13: The Chain Rule.


Let A and B be arbitrary events. Then Pr [A ∩ B] = Pr [B] · Pr [A|B] . More generally, for a collection of
events A1 , A2 , . . . , Ak , we have
Pr [A1 ∩ A2 ∩ · · · ∩ Ak ] = Pr [A1 ] · Pr [A2 |A1 ] · Pr [A3 |A1 ∩ A2 ] · · · Pr [Ak |A1 ∩ · · · ∩ Ak−1 ] .

If we’re interested in the probability that A and B occur, then we need it to be the case that A occurs—and
then, knowing that A occurred, B must occur too. Here’s a small example regarding hands of cards:

Example 10.25: Drawing a heart flush in poker.


A flush in poker is a 5-card hand, all of which are the same suit. What is the probability of drawing a
heart flush from a randomly shuffled deck?
Solution. We can draw any heart first. We have to keep drawing hearts to get a flush, so for 2 ≤ k ≤ 5,
the kth card we draw must be one of the remaining 14 − k hearts from the 53 − k cards left in the deck.
That is, writing Hi to denote the event that the ith card drawn is a heart:
Pr [H1 ∩ H2 ∩ H3 ∩ H4 ∩ H5 ]

ISTUDY
10.3 Independence and Conditional Probability 539

= Pr [H1 ] · Pr [H2 |H1 ] · Pr [H3 |H1,2 ] · Pr [H4 |H1,2,3 ] · Pr [H5 |H1,2,3,4 ]


13 12 11 10 9
52 51 50 49 48

= 13
52 · 12
51 · 11
50 · 10
49 · 9
48 ≈ 0.0004952.

(We could also have directly computed this quantity via counting: there are 13 5 hands containing 5 hearts,

and 52
5 total hands. Thus the fraction of all hands that are heart flushes is
13
 13!
13! · 47! 13 · 12 · 11 · 10 · 9
5
 = 5!·8! = = ,
52
5
52!
5!·47!
8! · 52! 52 · 51 · 50 · 49 · 48
which is the same quantity that we found above.)

We can use the Chain Rule to compute the probability of an event A by making the (admittedly fairly
obvious!) observation that another event B either occurs or doesn’t occur:

Theorem 10.14: The Law of Total Probability.


Let A and B be arbitrary events. Then Pr [A] = Pr [A|B] · Pr [B] + Pr [A| B ] · Pr [ B ] .

Proof. We’ll proceed by splitting A into two disjoint subsets, A ∩ B and A − B (the latter of which is
otherwise known as A ∩ B):
 
Pr [A] = Pr (A ∩ B) ∪ (A ∩ B) A = (A ∩ B) ∪ (A ∩ B)
= Pr [A ∩ B] + Pr [ A ∩ B ] A ∩ B and A ∩ B are disjoint
= Pr [A|B] · Pr [B] + Pr [A| B ] · Pr [ B ] . Chain Rule

Thus the theorem follows.


Here’s a small example of using the Law of Total Probability:

Example 10.26: Binary symmetric channel.


We wish to transmit a 1-bit message from a sender to a receiver. The sender’s message is 0 with probability
0.3, and it’s 1 with probability 0.7. The sender sends this data using a communication channel that corrupts
(that is, flips) every transmitted bit with probability 0.25. (See Figure 10.21, where p = 0.75 and thus
where 1 − p = 0.25.) Then the probability that the receiver receives a “1” message is
Pr [receive 1] = Pr [receive 1|send 1] · Pr [send 1] + Pr [receive 1|send 0] · Pr [send 0] = 0.6.
0.75 · 0.7 = 0.525 0.25 · 0.3 = 0.075

Taking it further: The binary symmetric channel has its name because it transmits a bit (it’s binary) and it corrupts a 0 with
the same probability as it corrupts a 1 (it’s symmetric). (See Figure 10.21; view each arrow as transforming a particular input
bit to a particular output bit, with the indicated probability.) The binary symmetric channel is one of the most basic forms of

input output
0 p 0
1−p
1−p
1 p 1

Figure 10.21 The binary symmetric channel.

ISTUDY
540 Probability

a noisy communication channel—that is, a channel that does not perfectly transmit its input without any chance of corruption.
The subfield of information theory is devoted to analyzing topics like the (theoretical) efficiency of communication channels,
including the binary symmetric channel. For much more on the binary symmetric channel and information theory more generally,
see [34, 83], or the 1948 paper by Claude Shannon (1916–2001) that’s generally thought to have launched the field [115].

Bayes’ Rule
Bayes’ Rule is a simple—but tremendously useful—rule for “flipping around” a conditional probability
statement. (Bayes’ Rule is named after Thomas Bayes (1702–1761), an English mathematician who also
happened to be a Presbyterian minister.) It allows us to express the conditional probability of A given B in
terms of the conditional probability of B given A:

Theorem 10.15: Bayes’ Rule.


Pr [B|A] · Pr [A]
For any two events A and B, we have Pr [A|B] = .
Pr [B]

Proof. Let’s start by thinking about the intersection of the events A and B. Applying the Chain Rule to
break apart Pr [A ∩ B] “in both orders,” we have

Pr [A ∩ B] = Pr [A|B] · Pr [B] and Pr [B ∩ A] = Pr [B|A] · Pr [A] .

The left-hand sides of these equations are equal because A ∩ B = B ∩ A (and so Pr [A ∩ B] = Pr [B ∩ A]),
and thus their right-hand sides are identical, too:

Pr [A|B] · Pr [B] = Pr [B|A] · Pr [A] .

Dividing both sides of this equality by Pr [B] yields the desired equation:
Pr [B|A] · Pr [A]
Pr [A|B] = .
Pr [B]
Here are a couple of small examples of using Bayes’ Rule:

Example 10.27: Binary symmetric channel, again.


As in Example 10.26, assume a sender transmits a 0 with probability 0.3 and a 1 with probability
0.7 across a channel that corrupts every bit with probability 0.25. We showed in Example 10.26 that
Pr [receive 1] = 0.6 and thus Pr [receive 0] = 0.4. Then the probability that the receiver receiving a “1”
message was indeed sent a 1 is by Bayes’ Rule
Pr [receive 1|send 1] · Pr [send 1] 0.75 · 0.7
Pr [message sent was 1|receive 1] = = = 0.875.
Pr [receive 1] 0.6
And the probability that the receiver receiving a “0” message was indeed sent a 0 is

Pr [receive 0|send 0] · Pr [send 0] 0.75 · 0.3


Pr [message sent was 0|receive 0] = = = 0.5625.
Pr [receive 0] 0.4
(Qualitatively, these numbers tell us that most of received ones were actually sent as ones, but barely
more than half of the received zeros were actually sent as zeros.)

ISTUDY
10.3 Independence and Conditional Probability 541

Example 10.28: 9+ heads, again.


We flip a fair coin 10 times. As in Example 10.21, let A denote the event that the first flip comes up
heads and let H denote the event that there are 9 or more heads in the 10 flips. (There we showed that
Pr [H] = 11/210 , Pr [A] = 12 , and Pr [H|A] = 10/29 .) Then
Pr [H|A] · Pr [A] (10/29 ) · 1
2 10
Pr [A|H] = = = .
Pr [H] 11/210 11

Taking it further: A speech recognition system is supposed to “listen” to speech in a language like English, and recognize the
words that are being spoken. Bayes’ Rule allows us to think about two different types of evidence that such a system uses in
deciding what words it “thinks” are being said; see p. 543.

A particularly important application of Bayes’ Rule is in “updating” one’s beliefs about the world by
observing new information. (Here “beliefs” take the form of a probability distribution.) One starts with a
prior distribution which one then updates based on evidence to produce a posterior distribution. (The prior
(pre = before) is your best guess of the probability of the event prior to seeing the produced evidence; the
posterior (post = after) is your best guess after seeing the evidence.) Here are two examples:

Example 10.29: Alice the CS major.


We are interested in whether a student (let’s call her Alice) is a computer science major. Our prior for
Alice might be Pr [CS major] = 0.05 because 5% of students are CS majors. We learn that Alice took
Ancient Philosophy. If we know that 10% of students as a whole take Ancient Philosophy, and 50% of
CS majors do, then
Pr [took philosophy|CS major] · Pr [CS major] 0.5 · 0.05
Pr [CS major|took philosophy] = = = 0.25.
Pr [took philosophy] 0.10
Our posterior distribution (the updated guess) is that there is now a 25% chance that Alice is a CS major.

Example 10.30: Flipping a coin to decide which coin to flip.


I have two coins in an opaque bag. The coins are visually indistinguishable, but one coin is fair
(Pr [H] = 0.5); the other coin is 0.75-biased (Pr [H] = 0.75). I pull one of the two coins out at random.

• My prior distribution is that there is a 50% chance I’m holding the fair coin, and a 50% chance I’m
holding the biased coin. (That is, Pr [biased] = Pr [fair] = 0.5.)

I flip the coin that I’m holding. It comes up heads.

• The evidence is the heads flip.

Because the biased coin is more likely to produce heads than the fair coin is (and we saw heads), this
evidence should make us view it as more likely that the coin that I’m holding is the biased coin. Let’s
compute my posterior probability:

ISTUDY
542 Probability

• The posterior probability of an event is the probability of that event conditioned on the observed
evidence. So we wish to compute Pr [biased|H]:
Pr [H|biased] · Pr [biased]
Pr [biased|H] = Bayes’ Rule
Pr [H]

Pr [H|biased] · Pr [biased]
= Law of Total Probability
Pr [H|biased] · Pr [biased] + Pr [H|fair] · Pr [fair]

0.75 · Pr [biased] the given biases of the coins:


=
(0.75 · Pr [biased]) + (0.5 · Pr [fair]) 0.75 for biased, 0.5 for fair

0.75 · 0.5
= Pr [biased] = Pr [fair] = 0.5, as defined by the prior
(0.75 · 0.5) + (0.5 · 0.5)

0.375
= = 0.6.
0.375 + 0.25
So the posterior probability is Pr [biased|H] = 0.6 and Pr [fair|H] = 0.4.

Taking it further: The idea of Bayesian reasoning is used frequently in many applications of computer science—any time a
computational system weighs various pieces of evidence in deciding what kind of action to take in a particular situation. One of
the most noticeable examples of this type of reasoning occurs in Bayesian spam filters; see p. 544.

ISTUDY
10.3 Independence and Conditional Probability 543

Computer Science Connections

Speech Recognition, Bayes’ Rule, and Language Models


A software system for speech recognition must solve the following problem: given an audio stream S of spoken
English as input, produce as output a transcript W of the words in S. (See Figure 10.22.) There will be many candidate
transcripts of S, and generally the task of the system is to produce the most likely sequence of words given the audio
stream—that is, to find the W ∗ maximizing Pr [W ∗ |S]. Using Bayes’ Rule, we can rephrase Pr [W ∗ |S] into an
expression that’s easier to understand:
[ ] Pr [S|W ∗ ] · Pr [W ∗ ]
the W ∗ maximizing Pr W ∗ |S = the W ∗ maximizing Bayes’ Rule
Pr [S]
[ ] [ ]
= the W maximizing Pr S|W ∗ · Pr W ∗ .

Pr [S] is the same for each W ∗

Thus there are two valuable sources of data for eval-


uating a candidate W. First, there’s Pr [S|W], the

Spectrogram generated by Praat [16]


likelihood of the observation: the probability that this
sound stream would have been produced if W were the
sequence of words. Second, there’s Pr [W], the proba-
bility of this output: the probability of this sequence of
words being uttered at all.
For example, even if the audio stream is a bet-
ter acoustic match for the phrase whirled Siri string,
you’d want your system to prefer the phrase World
Series ring—because an English speaker is far more Figure 10.22 A spectrogram representation of an audio
likely to say the latter phrase than the former. stream: the x-axis represents time, the y-axis represents
(That is, Pr [World Series ring] is much higher than frequency, and the darkness of the shading denotes the
Pr [whirled Siri string].) Of course, we still must take intensity of sound at that particular frequency at that
into account the audio stream S—otherwise, regard- particular time. (See p. 38.) The task is to turn this
less of the audio, we’d end up with a system that representation into its most probable sequence of words—in
produced precisely the same output sentence (the most this case, the sentence “I prefer agglomerative clustering.”
common sentence in English: I’m sorry!, or whatever
it is) for any input sound stream.
Generally speaking, the quantity Pr [S|W] would be estimated by an acoustic model of the vocal tract: if I’m
trying to say Camp Utah seance, what is the probability that I produce a particular stream S of sounds?
The quantity Pr [W] is estimated by what’s called
a language model. We would acquire a large collec-
tion of English text, and then try to use that data to
estimate how likely a particular sequence is. The sim- SQ
OUARE ARQ
OUSE
plest language model is the unigram model: from a
giant dataset with N total words, for each word w we Figure 10.23 Two noisily presented words. You are very
count up the number of times n(w) that w appears; likely to be able to read these words without significant
then, if W = w1 , w2 , . . . , wk , we estimate Pr [W] as difficulty—but you’re using your (implicit) language model
n(w1 ) n(w2 ) n(wk )
N · N · ··· · N . to do so, not actually “simply” recognizing the characters:
A more complex language model might use
the (noisy) characters in the two words are precisely
bigrams—two-word sequences—instead; we count the
identical, just in a different order, even though you read the
number of occurrences of wi , wi+1 consecutively in
the giant dataset, and estimate Pr [W] based on these second letter of the left-hand word as a Q and the third letter
counts. Other more complex language models are used of the right-hand word as a O. (You’ll be relieved to realize
in real systems. (See Figure 10.23.) There’s also a great that the language model hanging out inside your head is
deal of complication with avoiding overfitting of the more sophisticated than the unigram model.)
language model to the training data. (In addition to
speech recognition, a variety of other natural language processing problems are generally solved with the same
general approach. For much more, see a good NLP textbook, like [45] or [66].)

ISTUDY
544 Probability

Computer Science Connections


spam (95%) not spam (5%)
Bayesian Modeling and Spam Filtering
There are, it’s estimated, a few hundred billion email incorrectly classified (10%)
messages sent on earth per day. Of those, a significant
fraction of those messages are unsolicited, unwanted
bulk messages—that is, what’s commonly known as correctly classified (90%)
spam. Somewhere between 50% and 95% of emails are
currently spam. (It’s hard to be precise; statistics and
definitions of spam vary, and there’s change over time
as certain spammers are shut down, or not.) The basic
idea of a spam filter is to estimate the probability that a
particular message m is spam. The email client, or pos- inbox contents
sibly the individual user, can choose a threshold p; a
message m for which Pr [m is spam] ≥ p is placed into true positives: false negatives: true negatives: false positives:
a spam folder. The choice of p depends on the user’s correctly incorrectly marked correctly marked incorrectly
relative concern about false positives (nonspam mes- marked as spam as nonspam as nonspam marked as spam

sages that end up being incorrectly treated as spam) By Bayes’ Rule, the fraction of your inbox that is nonspam is:
versus false negatives (spam messages that end up
Pr [nonspam|inbox]
being incorrectly left in the inbox). See Figure 10.24.
Pr [inbox|nonspam] Pr [nonspam]
So, how might a spam filter actually make its =
Pr [inbox|nonspam] Pr [nonspam] + Pr [inbox|spam] Pr [spam]
decisions? Here’s one approach, based fundamen- 0.9 · 0.05 0.045
tally on Bayes’ Rule. Consider a message con- = = = 0.3214 · · · .
0.9 · 0.05 + 0.1 · 0.95 0.045 + .095
sisting of words w1 , w2 , . . . , wn ; we must compute
In other words, a full two thirds of your inbox would be spam!
Pr [spam|w1 , w2 , . . . , wn ]. Using Bayes’ Rule, we turn
around this probability: Figure 10.24 A test of your probabilistic intuition: if we
Pr [spam|w1 , w2 , . . . , wn ] have a spam filter that correctly classifies 90% of email
messages as spam/nonspam, and 95% of email messages
Pr [w1 , w2 , . . . , wn |spam] · Pr [spam]
= are spam, what fraction of email in your inbox is nonspam?
Pr [w1 , w2 , . . . , wn ]
By the Law of Total Probability (every message is either spam or not spam), we can rewrite this probability as
Pr [w1 , w2 , . . . , wn |spam] · Pr [spam]
Pr [spam|w1 , w2 , . . . , wn ] = .
Pr [w1 , w2 , . . . , wn |spam] Pr [spam] + Pr [w1 , w2 , . . . , wn |not spam] Pr [not spam]
That is, we want to know: what is the probability that the sequence of words w1 , . . . , wn would have been generated
in a spam message, relative to the probability that w1 , . . . , wn would have been generated in any message at all
(in a spam or nonspam message)? (These “relative probabilities” are weighted by the background probability of
spam-vs.-nonspam messages.)
A naïve Bayes classifier uses an additional assumption: that the appearance of every word in an email is an
independent event. That is, we’re going to estimate Pr [w1 , w2 , . . . , wn ] as if the probability of each wi appearing
does not depend on any other word appearing. (Obviously that assumption isn’t right: the probability of the word
MORTGAGE appearing is not independent of the probability of the word RATE appearing, in either spam or nonspam.)
Pr [w1 , w2 , . . . , wn |spam] ≈ Pr [w1 |spam] · Pr [w2 |spam] · · · · · Pr [wn |spam] .
Thus a naïve Bayes classifier estimates the probability of a message being generated as spam by multiplying a measure
of “how spammy” each word is. A spam filter would still need to have two numbers associated with each word wi —
namely Pr [wi |spam] and Pr [wi |nonspam]. We can estimate these numbers from a training set of spam/nonspam
emails, with some sort of “smoothing” mechanism to improve our estimate of the spamminess of a word that doesn’t
appear in any of the training emails. (For more about the training of these estimates, and about text classification—
the broader version of the problem that we’re trying to solve in spam filtering—again see any good book on natural
language processing, like [45] or [66]. For more about spam in particular, you may be interested in the statistics on
email and spam that are regularly calculated by the Radicati Group, for example: www.radicati.com.)

ISTUDY
Exercises 545

EXERCISES
Choose one of the 12 months of the year uniformly at random. (That is, choose a number uniformly from the set {1, 2, . . . , 12}.)
Indicate whether the following pairs of events are independent or dependent. Justify your answers.
10.53 “The month number is even” and “the month number is divisible by 3.”
10.54 “The month number is even” and “the month number is divisible by 5.”
10.55 “The month number is even” and “the month number is divisible by 6.”
10.56 “The month number is even” and “the month number is divisible by 7.”
10.57 “The month’s name contains an ’R’ ” and “the month contains an odd number of days in a leap year.”

We flip a fair coin 6 times. Which of these pairs of events are independent or dependent? Justify your answers.
10.58 “The number of heads is even” and “the number of heads is divisible by 3.”
10.59 “The number of heads is even” and “the number of heads is divisible by 4.”
10.60 “The number of heads is even” and “the number of heads is divisible by 5.”

10.61 We flip three fair coins, called a, b, and c. Are the events “The number of heads in {a, b} is odd” and “The number of heads in
{b, c} is odd” independent or dependent?
10.62 How (if at all) would your answer to the previous exercise change if the three coins are p-biased? (That is, assume Pr [a = H],
Pr [b = H], and Pr [c = H] are all equal to p, where p is not necessarily 12 .)

Consider the list of words and the events in Figure 10.25. Choose a word at random from this list. Which of these pairs of events
are independent? For the pairs that are dependent, indicate whether the events are positively or negatively correlated. Justify.
10.63 A and B 10.67 A and E
10.64 A and C 10.68 A ∩ B and E
10.65 B and C 10.69 A ∩ C and E
10.66 A and D 10.70 A ∩ D and E

10.71 Let A and B be arbitrary events in a finite sample space. Prove that if Pr [B] = 0, then A and B are independent.
10.72 Let A and B be arbitrary events in a finite sample space. Prove that A and B are independent if and only if A and B are independent.

A substitution cipher (see Example 10.16) is a cryptographic scheme in which we choose a permutation π of the alphabet, and
replace each letter i with πi . (Decryption is the same process, but backward: replace πi by i.) However, substitution ciphers are
susceptible to frequency analysis, in which an eavesdropper who observes the encrypted message (the ciphertext) infers that the
most common letter in the ciphertext probably corresponds to the most common letter in English text (the letter E), the second-most
common to the second-most common (T), and so on.
10.73 (programming required) Write a program that generates a random permutation π of the alphabet, and encrypts a given input text
using π. (Leave all non-alphabetic characters unchanged.)
10.74 (programming required) Write a program that takes a text as input, converts it to uppercase, and produces a vector ⟨fA , fB , . . . , fZ ⟩,
where f• is the fraction of letters in the input text that are the letter •. (So f will be a probability distribution over the alphabet.)
10.75 (programming required) Write a program that, given a reference text and a text encrypted with an unknown substitution cipher,
attempts to decrypt by mapping the most common encrypted letters, in order, to the most common reference letters. You can find
useful reference files (the complete works of Shakespeare, say) from Project Gutenberg, http://www.gutenberg.org/.
10.76 (programming required) A Caesar cipher is a special kind of substitution cipher in which the permutation π is generated by
choosing a numerical shift s and moving all letters s steps forward in the alphabet, wrapping back to the beginning of the alphabet
as necessary. (For example, with a shift of 5, A → F and W → B.) Write a Caesar cipher encryption program that encrypts a given
input text file with a randomly chosen shift in {0, 1, . . . , 25}.

ABIDES EXUDE A : “the first letter of the word is a consonant.”


BASES FEDORA B : “the second letter of the word is a consonant.”
CAJOLED GASOLINES C : “the second letter of the word is a vowel.”
DATIVE HABANERO D : “the last letter of the word is a consonant.”
E : “the word has even length.”

Figure 10.25 A word list from which we choose a random word, and some events.

ISTUDY
546 Probability

10.77 (programming required) If you run your decryption program from Exercise 10.75 on Caesar-ciphered text, you’ll find that your
program generally doesn’t work perfectly. Write a Caesar-cipher-decrypting program that takes advantage of the fact that every
letter is shifted by the same amount. Find the most probable s—the s that minimizes the difference in the probabilities of each letter
P
from the reference text and the deciphered text. That is, minimize i |f′i − fi+s |, where f comes from the ciphertext and f′ comes
from the reference text.

Flip n fair coins. For any two distinct indices i and j with 1 ≤ i < j ≤ n, define the event Ai,j as

Ai,j = (the ith coin flip came up heads) XOR (the jth coin flip came up heads).

For example, for n = 4 and the outcome ⟨T, T, H, H⟩, the events A1,3 , A1,4 , A2,3 , and A2,4 all occur; A1,2 and A3,4 do not. Thus,
 
from n independent coin flips, we’ve defined n2 different events. In the next few exercises, you’ll show that these n2 events are
pairwise independent, but not fully independent.
10.78 Let i and j > i be arbitrary. Show that Pr [Ai,j ] = 12 .
10.79 Let i and j > i be arbitrary, and let i′ and j′ > i′ be arbitrary. Show that any two distinct events Ai,j and Ai ,j are independent. That
′ ′

is, show that Pr [Ai,j |Ai ,j ] = Pr [Ai,j |Ai ,j ] = 21 if {i, j} ̸= {i′ , j′ }.


′ ′ ′ ′

10.80 Show that there is a set of three distinct A events that are not mutually independent. That is, identify three events Ai,j , Ai ,j , and
′ ′

Ai ,j where the sets {i, j}, {i′ , j′ }, and {i′′ , j′′ } are all different (though not necessarily disjoint). Then show that if you know the
′′ ′′

value of Ai,j and Ai ,j , the probability of Ai ,j =


′ ′ ̸ 12 .
′′ ′′

Consider the dominoes in Figure 10.26. Suppose you shuffle them and draw one domino uniformly at random. (More specifically,
1
you choose any particular domino with probability 12 . After you’ve chosen the domino, you choose an orientation, with a 50–50
chance of either side pointing to the left.) What are the following conditional probabilities? (“Even total” means that the sum of
the two halves of the domino is even. “Doubles” means that the two halves are the same.)
10.81 Pr [even total|doubles]
10.82 Pr [doubles|even total]
10.83 Pr [doubles|at least one ]
10.84 Pr [at least one |doubles]
10.85 Pr [total ≥ 7|doubles]
10.86 Pr [doubles|total ≥ 7]
10.87 Pr [even total|total ≥ 7]
10.88 Pr [doubles|the top half of the drawn domino is ]

10.89 Suppose A and B are mutually exclusive events—that is, A ∩ B = ∅. Prove or disprove the following claim: A and B cannot be
independent.
10.90 Let A and B be two events such that Pr [A|B] = Pr [B|A]. Which of the following is true? (a) A and B must be independent; (b) A
and B must not be independent; or (c) A and B may or may not be independent (there’s not enough information to tell). Justify your
answer briefly.

Suppose, as we have done throughout the chapter, that h : K → {1, . . . , n} is a random hash function.
10.91 Suppose that there are currently k cells in the array that are occupied. Consider a key x ∈ K not currently stored in the hash table.
What is the probability that the cell h(x) into which x hashes is empty?
10.92 Suppose that you insert n distinct values x1 , x2 , . . . , xn into an initially empty n-slot hash table. What is the probability that there
are no collisions? (Hint: if the first i elements have had no collisions, what is the probability that the (i + 1)st hashed element does
not cause a collision? Use Theorem 10.13 and Exercise 10.91.)

Figure 10.26 Some dominoes.

ISTUDY
Exercises 547

10.93 Imagine a disease that afflicts a small fraction of the population; one in a thousand people in the population have the disease. Doctor
Genius has invented a disease-detection test. Her test, though, isn’t perfect:
• It has false negatives: if you do have the disease, then her test says that you’re not sick with probability 0.01.
• It has false positives: if you don’t have the disease, then her test says that you’re sick with probability 0.03.
What is the probability p that Dr. Genius gives a random person x an erroneous diagnosis?
10.94 “Doctor” Quack has invented a disease-detection test, too. He was a little confused by the statement “one in a thousand people
1
in the population have the disease,” so his test is this: no matter who the patient is, with probability 1000 report “sick” and with
999
probability 1000 report “not sick.” What is p now?

Alice wishes to send a 3-bit message 011 to Bob, over a noisy channel that corrupts (flips) each transmitted bit independently with
some probability. To combat the possibility of her transmitted message differing from the received message, she adds a parity bit
to the end of her message (so that the transmitted message is 0110). [Bob checks that he receives a message with an even number
of ones, and if so interprets the first three received bits as the message.]
10.95 Assume that each bit is flipped with probability 1%. Conditioned on receiving a message with an even number of ones, what is the
probability that the message Bob received is the message that Alice sent?
10.96 What if the probability of error is 10% per bit?

Suppose, as in Example 10.30, I have two coins—one fair and one p-biased. I pull one uniformly at random from an opaque bag,
and flip it. What is Pr [I pulled the biased coin|the following observed flips]? Justify your answers.
10.97 p = 2/3, and I observe a single heads flip.
10.98 p = 3/4, and I observe the flip sequence HHHT.
10.99 p = 3/4, and I observe the flip sequence HTTTHT.

A Bloom filter is a probabilistic data structure designed to store a set of elements from a universe U, allowing very quick query
operations to determine whether a particular element has been stored [15]. Specifically, it supports the operations Insert(x), which
adds x to the stored set, and Lookup(x), which reports whether x was previously stored. But, unlike most data structures for this
problem, we will allow ourselves to (occasionally) make mistakes in lookups, in exchange for making these operations fast.
Here’s how a Bloom filter works. We will choose k different hash functions h1 , . . . , hk : U → {1, . . . , m}, and we will maintain
an array of m bits, all initially set to zero. The operations are implemented as follows:
• To insert x into the data structure, we set the k slots h1 (x), h2 (x), . . . , hk (x) of the array to 1. (If any of these slots was already
set to 1, we leave it as a 1.)
• To look up x in the data structure, we check that the k slots h1 (x), h2 (x), . . . , hk (x) of the array are all set to 1. If they’re all equal
to 1, we report “yes”; if any one of them is a 0, we report “no.”
For an example, see Figure 10.27. Note that there can be a false positive in a lookup: if all k slots corresponding to a query element
x happen to have been set to 1 because of other insertions, then x will incorrectly be reported to be present.
As usual, we treat each of the k hash functions as independently assigning each element of U to a uniformly chosen slot of
the array. Suppose that we have an m-slot Bloom filter, with k independent hash functions, and we insert n elements into the data
structure.
10.100 Suppose we have k = 1 hash functions, and we’ve inserted n = 1 element into the Bloom filter. Consider any particular slot of the
m-slot table. What is the probability that this particular slot is still set to 0? (That is, what is the probability that this slot is not the
slot set to 1 when the single element was inserted?)

1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0
0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0

(a) The table initially; after inserting 3; and after (b) Testing for 3 (yes!), 15 (no!), and 10 (yes!?!). Note
inserting 7. Note h1 (3) = 4, h2 (3) = 10, h1 (7) = 8, h1 (15) = 3, h2 (15) = 5, h1 (10) = 11, and
and h2 (7) = 11. h2 (10) = 10—so 10 is a false positive.

Figure 10.27 A Bloom filter with k = 2 hash functions: h1 (x) = x mod 13 + 1 and h2 (x) = x2 mod 13 + 1.

ISTUDY
548 Probability

10.101 Let the number k of hash functions be an arbitrary number k ≥ 1, but continue to suppose that we’ve inserted only n = 1 element
in the Bloom filter. What is the probability a particular slot is still set to 0 after this insertion?
10.102 Let the number k of hash functions be an arbitrary number k ≥ 1, and suppose that we’ve inserted an arbitrary number n ≥ 1 of
elements into the Bloom filter. What is the probability a particular slot is still set to 0 after these insertions?

Define the false-positive rate of a Bloom filter (with m slots, k hash functions, and n inserted elements) to be the probability that
we incorrectly report that y is in the table when we query for an uninserted element y.
For many years (starting with Bloom’s original paper about Bloom filters), people in computer science believed that the false
positive rate was precisely pk , where p = (1 − [your answer to Exercise 10.102]). The justification was the following. Let Bi denote
the event “slot hi (y) is occupied.” We have a false positive if and only if B1 , B2 , . . . , Bk are all true. Thus

the false positive rate = Pr [B1 and B2 and · · · and Bk ] .

You showed in the previous exercise that Pr [Bi ] = p. Everything up until here is correct; the next step in the argument,
however, was not! Therefore, because the Bi events are independent,

the false positive rate = Pr [B1 and B2 and · · · and Bk ] = Pr [B1 ] · Pr [B2 ] · · · Pr [Bk ] = pk .

But it turns out that Bi and Bj are not independent [20, 29]! (This error is a prime example of how hard it is to have perfect intuition
about probability!)
10.103 Let m = 2, k = 2, and n = 1. Compute by hand the false-positive rate. (Hint: there are “only” 16 different outcomes, each of
which is equally likely: the random hash functions assign values in {1, 2} to h1 (x), h2 (x), h1 (y), and h2 (y). In each of these 16
cases, determine whether a false positive occurred.)
10.104 Compute p2 —the answer you would have gotten by using

false-positive rate = (1 − [your answer to Exercise 10.102])2 .

Which is bigger—p2 or [your answer to Exercise 10.103]? In approximately one paragraph, explain the difference, including an
explanation of why the events B1 and B2 are not independent.
10.105 (programming required) While the actual false-positive rate is not exactly pk , it turns out that pk is a very good approximation
to the false-positive rate as long as m is sufficiently big and k is sufficiently small. Write a program that creates a Bloom filter
with m = 1,000,000 slots and k = 20 hash functions. Insert n = 100,000 elements, and estimate the false positive probability by
querying for n additional uninserted elements y ∈ / X. What is the false-positive rate that you observe in your experiment? How does
it compare to pk ?

ISTUDY
10.4 Random Variables and Expectation 549

10.4 Random Variables and Expectation


I’m from New Jersey, I don’t expect too much
If the world ended today, I would adjust.
John Gorka (b. 1958)
“I’m From New Jersey,” from the recording Jack’s Crows (1991)

Thus far, we have been considering whether or not something occurs—that is, using the language of
probability, we have been interested in events. But often we will also be interested in how many? questions
and not just did it or did it not? questions. How many heads came up in 1000 coin flips? How many times
do we have to flip a coin before it comes up heads for the 1000th time? For a randomly ordered array
A[1 . . . n] of the integers {1, . . . , n}, for how many indices i is A[i] < A[i + 1]? To address these types of
questions, we will introduce the concept of a random variable, which measures some numerical quantity
that varies from outcome to outcome. We will also consider the expectation of a random variable, which is
the value of that variable averaged over all of the outcomes in the sample space.

10.4.1 Random Variables


We begin with the definition of a random variable itself:

Definition 10.16: Random variable.


A random variable X assigns a numerical value to every outcome in the sample space S. (In other words,
a random variable is a function X : S → R.)

Warning! A “random variable” is a pretty weirdly named concept. A random variable is not a variable—rather, it’s a function
that maps each outcome to a numerical value. But everyone calls it a random variable, so that’s what we’ll call it, too.

Here are a few small examples:

Example 10.31: Counting heads in 3 flips.


3
Suppose that we flip a fair coin independently, three times. (Then the sample space is S = {H, T} , and
Pr [x] = 81 for any x ∈ S.) Define the random variables
X = the number of heads and Y = the number of initial consecutive tails.
These random variables take on the following values shown in Figure 10.28.

Example 10.32: Word length, and number of vowels.


Select a word from the sample space {Now, is, the, winter, of, our, discontent} by choosing word w
with probability proportional to the number of letters in w, as in Example 10.5. Define a random variable
L to denote the number of letters in the word chosen. Thus L(discontent) = 10 and L(winter) = 6,
for example. We can also define a random variable V to denote the number of vowels in the word chosen.
Thus V(discontent) = 3 and V(winter) = 2, for example. The values for these two random variables
for each outcome in the sample space are shown in Figure 10.29.

Although it’s an abuse of notation, often we just write X to denote the value of a random variable X for a
realization chosen according to the probability distribution Pr. (So we might write “X = 3 with probability

ISTUDY
550 Probability

X(HHH) = 3 X(HHT) = 2 X(HTH) = 2 X(HTT) = 1 X(THH) = 2 X(THT) = 1 X(TTH) = 1 X(TTT) = 0


Y(HHH) = 0 Y(HHT) = 0 Y(HTH) = 0 Y(HTT) = 0 Y(THH) = 1 Y(THT) = 1 Y(TTH) = 2 Y(TTT) = 3

Figure 10.28 The random variables X, representing the number of heads, and Y, representing the number of initial
consecutive tails, in three fair coin flips.

w Pr [w] L(w) V(w)


Choose a letter from these 29 options uniformly at random. The number of letters in Now 3/29 3 1
the box of your chosen letter is L, and the number of vowels in it is V. The values of L is 2/29 2 1
and V for each outcome in the sample space are shown at right.
the 3/29 3 1
N o w i s t h e o f winter 6/29 6 2
o u r w i n t e r of 2/29 2 1
our 3/29 3 2
d i s c o n t e n t
discontent 10/29 10 3

Figure 10.29 Choosing a word from Now is the winter of our discontent with probability proportional to the
length of each word. The number of letters in your chosen word is L, and the number of vowels in it is V.

1
8 ” or “there are L letters in the chosen word.”) Using this notation, we can state probability questions about
events based on random variables, as the following example illustrates:

Example 10.33: More word length and vowel counts.


Choose a word as in Example 10.32. Define L as the number of letters in the word, and define V as the
number of vowels in the word. Then Pr [L = 3] denotes the probability that we choose an outcome w for
which L(w) = 3. (In other words, L = 3 denotes the event {w : L(w) = 3}.) Thus (see Figure 10.29)
9 10
Pr [L = 3] = Pr [{Now, the, our}] = 29 and Pr [V = 3] = Pr [{discontent}] = 29 .

We will also abuse notation by performing arithmetic on random variables (remember, these are func-
tions!): for two random variables X and Y, we write X + Y as a new random variable that, for any outcome
x, denotes the sum of X(x) and Y(x). We will interpret similarly any other arithmetic expression that
involves random variables. (The notational analogue here is writing “sin + cos” to denote the function
f(x) = sin(x) + cos(x).) Here’s a small example:

Example 10.34: Number of consonants.


We can express the number of consonants in the randomly chosen word from our running example (see
Example 10.32) as L − V. For example, L − V = 1 when the chosen word is our, and L − V = 4 when
the chosen word is winter.

Indicator random variables


One special type of random variable that will come up frequently is an indicator random variable, which
only takes on the values 0 and 1. (Such a random variable “indicates” whether a particular event has
occurred.) Here’s a small example:

Example 10.35: Indicator random variables in coin flips.


Suppose that we flip three fair coins independently. Let X1 be an indicator random variable that reports
whether the first flip came up heads. Similarly, let X2 and X3 be indicator random variables for the second
and third flips. See Figure 10.30. The total number of heads is given by the random variable X1 + X2 + X3 .

ISTUDY
10.4 Random Variables and Expectation 551

outcome X1 X2 X3 X1 + X2 + X3
HHH 1 1 1 3
HHT 1 1 0 2
HTH 1 0 1 2
HTT 1 0 0 1
THH 0 1 1 2
THT 0 1 0 1
TTH 0 0 1 1
TTT 0 0 0 0

Figure 10.30 Three coin flips; Xi is an indicator random variable representing whether the ith flip was heads.

Independence of random variables


Just as with independence for events, we will often be concerned with whether knowing the value of one
random variable tells us something about the value of another. Two random variables X and Y are indepen-
dent if every two events of the form “X = x” and “Y = y” are independent: for every value x and y, it must
be the case that Pr [X = x and Y = y] = Pr [X = x] · Pr [Y = y]. For example:

Example 10.36: Some independent/dependent random variables.


The random variables X2 and X3 from Example 10.35—we flip 3 fair coins independently; X2 and X3 indi-
cate whether the second and third flips are heads—are independent. You can check all four possibilities;
for example,
Pr [X2 = 1 and X3 = 1] = 1
4 = 1
2 · 1
2 = Pr [X2 = 1] · Pr [X3 = 1] and
Pr [X2 = 1 and X3 = 0] = 1
4 = 1
2 · 1
2 = Pr [X2 = 1] · Pr [X3 = 0] .
On the other hand, the random variables X and Y from Example 10.31—we flip 3 fair coins indepen-
dently; X is the number of heads and Y is the number of consecutive initial tails—are not independent;
for example,
Pr [X = 3] · Pr [Y = 3] = 1
8 · 1
8 but Pr [X = 3 and Y = 3] = 0.

10.4.2 Expectation
A random variable X measures a numerical quantity that varies from realization to realization. We will often
be interested in the “average” value of X, which is otherwise known as the random variable’s expectation:

Definition 10.17: Expectation.


The expectation of a random variable X, denoted E [X], is the average value of X, defined as
X
E [X] = X(x) · Pr [x] .
x∈S

The expectation of X is also sometimes called the mean of X.


P
We can equivalently write E [X] = y (y · Pr [X = y]) by summing over each possible value y that X
can take on, rather than by summing over outcomes.

In other words, E [X] is the average value of X over all outcomes (where the average is weighted, with
weights defined by the probability function). For example:

ISTUDY
552 Probability

Example 10.37: Expectation of a Bernoulli random variable.


Let X be an indicator random variable for a Bernoulli trial with parameter p—that is, X = 1 with
probability p and X = 0 with probability 1 − p. Then E [X] is precisely
E [X] = 1 · Pr [X = 1] + 0 · Pr [X = 0] definition of expectation (alternative version)
= 1 · p + 0 · (1 − p) definition of a Bernoulli trial with parameter p
= p.

Example 10.38: Counting heads in 3 flips, again.


Recall Example 10.31, where the random variable X denotes the number of heads in three independent
3
flips of a fair coin. (The sample space was S = {H, T} , and Pr [x] = 18 for any x ∈ S.) What is E [X]?
Solution. By definition, the expectation of X is
X
Pr [x] · X(x)
x∈{H,T}3

= 18 X(HHH) + 18 X(HHT) + 18 X(HTH) + 18 X(HTT) + 18 X(THH) + 18 X(THT) + 18 X(TTH) + 18 X(TTT)


 
= 18 · 3 + 2 + 2 + 1 + 2 + 1 + 1 + 0 = 12 8 .
12
In other words, in three flips of a fair coin, we expect 8 = 1.5 flips to come up heads.

Warning! Just because E [X] = 1.5 doesn’t mean that Pr [X = 1.5] is big! (If you ever flip three fair coins and see exactly 1.5
heads, it might be a sign that the world is ending.) Remember that “average” and “typical” aren’t the same thing!

Example 10.39: Counting letters and vowels, again.


Recall the probabilistic process of choosing a word from the sentence Now is the winter of our
discontent in proportion to word length. Recall also the random variables from Example 10.32: L
denotes the chosen word’s length, and V the number of vowels in the chosen word. Then we have
E [L] = 3 · 3
29 +2· 2
29 +3· 3
29 +6· 6
29 +2· 2
29 +3· 3
29 + 10 · 10 171
29 = 29 ≈ 5.8966.
E [V] = 1 · 3
29 +1· 2
29 +1· 3
29 +2· 6
29 +1· 2
29 +2· 3
29 +3· 10
29 = 29
57
≈ 1.9656.

Taking it further: If we think about it without a great deal of care, there’s something apparently curious about the result
from Example 10.39. We’ve plopped down our thumb on a random letter in the sentence Now is the winter of our
discontent, and we’ve computed that the word that our thumb lands on has an average length of about 5.9 letters. That seems a
little puzzling, because there are 7 words in the sentence, with an average word length of 29
7 = 4.1428 letters. But there’s a good
reason for this discrepancy: longer words are more likely to be chosen because they have more letters, and therefore the average
word that’s chosen has more letters than average. An analogous phenomenon occurs in many other settings, too. When you’re
driving, you spend most of your time on longer-than-average trips. Most people in Canada live in a larger-than-average-sized
Canadian city. Most 3rd-grade students in California are in a larger-than-average-size 3rd-grade class. (In fact, this broader phe-
nomenon is sometimes called the class-size paradox.) Perhaps even more jarringly, a random person x knows fewer people than
the average number of people known by someone x knows—that is, on average, your friends are more popular than you are [46].
(Why? A very popular person—call her Oprah—is, by definition, the friend of many people, and therefore Oprah’s astronomical
popularity is averaged into the popularity of many people x. In computing the popularity of a randomly chosen person x, Oprah
only contributes her popularity once for x = Oprah—but she contributes it many times to the popularity of x’s friends.)
This phenomenon may illustrate an example of a sampling bias, in which we try to draw a uniform sample from a population but
we end up with some kind of bias that overweights some members of the population at the expense of others. Sampling biases
are a widespread concern in any statistical approach to understanding a population. For example, consider a telephone-based
political poll that collects voters’ preferences for candidates one evening by randomly dialing phone numbers until somebody

ISTUDY
10.4 Random Variables and Expectation 553

answers, and records the answerer’s preference. This poll will overweight those people who are sitting around at home during
the evening—which correlates with the voter’s age, which correlates with the voter’s political affiliation.

Example 10.40: Number of aces in a bridge hand.


What is the expected number of aces in a 13-card hand dealt from a standard 52-card deck?

Solution. Let’s compute the probability of getting 0, 1, …, 4 aces. In total, there are 52 different hands.
  13
Of them, there are k · 13−k hands with exactly k aces. (We have to pick k ace cards from the 4 aces in
4 48

the deck, and 13 − k non-ace cards from the 48 non-aces.) Each hand is equally likely to be chosen, so
 
k · 13−k
4 48
Pr [drawing exactly k aces] = 52
 .
13

And thus, letting A be a random variable denoting the number of aces, we have
X
4
E [A] = i · Pr [A = i] (using the value-based definition of expectation)
i=0
0 · Pr [A = 0] 1 · Pr [A = 1] 2 · Pr [A = 2] 3 · Pr [A = 3] 4 · Pr [A = 4]
         
0 · 40 · 48 13 + 1 · 41 · 48 12 + 2 · 42 · 48 11 + 3 · 43 · 48 10 + 4 · 44 · 48 9
= 52

13
0 + 278,674,137,872 + 271,142,404,416 + 78,488,590,752 + 6,708,426,560
=
635,013,559,600
635,013,559,600
= .
635,013,559,600
That is, the expected number of aces in a 13-card hand is precisely 1. (The fact that this calculation came
out to be exactly one may look like magic, but it’s not! We’ll see a different—and much easier!—way of
thinking about this problem in Example 10.43 that will demystify any apparent magic.)

A useful property of expectation


We’ve now seen several examples of computing the expectation of random variables by directly follow-
ing the definition of expectation. Here we’ll introduce a transformation that can often make expectation
calculations easier, at least for positive integer–valued random variables:

Theorem 10.18: Another formula for expectation, for nonnegative integers.


P∞
Let X : S → Z≥0 be a random variable. Then E [X] = i=1 Pr [X ≥ i].
P∞
(Note that by definition E [X] = i=0 i · Pr [X = i], so we’re trading the multiplication of i for the
replacement of = by ≥.)
Proof of Theorem 10.18. The theorem follows by changing the order of summation in the expectation
formula. Here is an algebraic proof:

X
E [X] = i · Pr [X = i] definition of expectation
i=0
∞ X
X i
∑i
= Pr [X = i] i= j=1 1
i=0 j=1

ISTUDY
554 Probability

∞ X
X ∞
= Pr [X = i] changing the order of summation (see Figure 10.31)
j=1 i=j
X∞
∑∞
= Pr [X ≥ j] . Pr [X ≥ j] = i=j Pr [X = i]
j=1

It may be easier to follow the idea by looking at a visualization instead of algebra; see Figure 10.31.

We can use this theorem to find the expected value of a geometric random variable, for example:

Example 10.41: Expectation of a geometric random variable.


Let X be a geometric random variable with parameter p. (That is, X measures the number of flips of a
p-biased coin before we get heads for the first time.) Then E [X] is precisely p1 :
P
∞ ∑∞
E [X] = Pr [X ≥ i] Theorem 10.18 (E [X] = i=1 Pr [X ≥ i])
i=1
P∞
= Pr [fail to get heads in i − 1 flips] definition of geometric random variable
i=1
P∞
= (1 − p)i−1 need i − 1 consecutive tails flips
i=1
P∞
= (1 − p)i changing index of summation
i=0
1
= 1−(1−p) = 1p . formula for geometric summations

For example, we expect to flip a fair coin (with p = 12 ) twice before we get heads.

10.4.3 Linearity of Expectation


Here’s a very useful general property of expectation, called linearity of expectation: the expectation of a
sum is the sum of the expectations. (A linear function is a function f that satisfies f(a + b) = f(a) + f(b)—
for example, f(x) = 3x or f(x) = 0.) The usefulness of linearity of expectation will come from the way in
which it lets us “break down” a complicated random variable into the sum of a collection of simple random
P  P
variables. (We can then compute E [Complicated] = E i Simplei = i E [Simplei ].)
We’ll see several useful examples soon, but let’s start with the proof:

Theorem 10.19: Linearity of expectation.


Consider a sample space S, and let X : S → R and Y : S → R be any two random variables. Then
E [X + Y] = E [X] + E [Y].

Proof. We’ll be able to prove this theorem by just invoking the definition of expectation and following
our algebraic noses:
X
E [X + Y] = (X + Y)(s) · Pr [s] definition of expectation
s∈S
X 
= X(s) + Y(s) · Pr [s] definition of the random variable X + Y
s∈S

ISTUDY
10.4 Random Variables and Expectation 555

Pr [X = 1] = 1 · Pr [X = 1]

Pr [X = 2] Pr [X = 2] = 2 · Pr [X = 2]

Pr [X = 3] Pr [X = 3] Pr [X = 3] = 3 · Pr [X = 3]

Pr [X = 4] Pr [X = 4] Pr [X = 4] Pr [X = 4] = 4 · Pr [X = 4]

Pr [X = 5] Pr [X = 5] Pr [X = 5] Pr [X = 5] Pr [X = 5] = 5 · Pr [X = 5]

Pr [X = 6] Pr [X = 6] Pr [X = 6] Pr [X = 6] Pr [X = 6] Pr [X = 6] = 6 · Pr [X = 6]

Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] = 7 · Pr [X = 7]

Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] = 8 · Pr [X = 8]

Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] = 9 · Pr [X = 9]

.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
= Pr [X ≥ 1] ∑Pr∞[X ≥ 2]
= = Pr [X ≥ 3] = Pr [X ≥ 4] = Pr [X ≥ 5] = Pr [X ≥ 6] = Pr [X ≥ 7] = Pr [X ≥ 8] = Pr [X ≥ 9] ···
(a) The sum i= 0 i · Pr [X = i] can be visualized as adding up i copies of Pr [X = i] one row at a time.

Pr [X = 1] = 1 · Pr [X = 1]

Pr [X = 2] Pr [X = 2] = 2 · Pr [X = 2]

Pr [X = 3] Pr [X = 3] Pr [X = 3] = 3 · Pr [X = 3]

Pr [X = 4] Pr [X = 4] Pr [X = 4] Pr [X = 4] = 4 · Pr [X = 4]

Pr [X = 5] Pr [X = 5] Pr [X = 5] Pr [X = 5] Pr [X = 5] = 5 · Pr [X = 5]

Pr [X = 6] Pr [X = 6] Pr [X = 6] Pr [X = 6] Pr [X = 6] Pr [X = 6] = 6 · Pr [X = 6]

Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] Pr [X = 7] = 7 · Pr [X = 7]

Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] Pr [X = 8] = 8 · Pr [X = 8]

Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] Pr [X = 9] = 9 · Pr [X = 9]

.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
= Pr [X ≥ 1] = Pr [X ≥ 2] = Pr [X ≥ 3] = Pr [X ≥ 4] = Pr [X ≥ 5] = Pr [X ≥ 6] = Pr [X ≥ 7] = Pr [X ≥ 8] = Pr [X ≥ 9] ···

(b) Alternatively, the same sum can be visualized as adding up one column at a time. The jth column contains a copy of
Pr [X = i] for every i greater than or equal to j, so the value of the sum of the jth column is Pr [X ≥ j].


Figure 10.31 A change of summation. View E[X] = ∞ i=0 i · Pr [X = i] as the sum of the entries of an infinite table,
where the ith row of the table contains i copies of Pr [X = i]. We can compute the infinite sum of the table either by
summing up all of the row sums, or by summing up all of the column sums.

ISTUDY
556 Probability

hX i hX i
= X(s) · Pr [s] + Y(s) · Pr [s] distributing the multiplication; rearranging
s∈S s∈S
= E[X] + E[Y]. definition of expectation

Therefore E [X + Y] = E [X] + E [Y], as desired.


Notice that Theorem 10.19 does not impose any requirement of independence on the random variables X
and Y: even if X and Y are highly correlated (positively or negatively), we still can use linearity of expecta-
tion to conclude that E [X + Y] = E [X] + E [Y]. There are many apparently complicated problems in which
using linearity of expectation makes a solution surprisingly straightforward. Here are a few examples:

Problem-solving tip: Often, the easiest way to compute an expectation is by finding a way to express the quantity of interest in
terms of a sum of indicator random variables.

Example 10.42: Expectation of a binomial random variable.


We have a p-biased coin (that is, Pr [heads] = p) that we flip 1000 times. What is the expected number
of heads that come up in these 1000 flips?
Solution. The intuition is fairly straightforward: a p-fraction of flips are heads, so we should expect 1000p
heads in 1000 flips. But doing the math requires a bit of work.
An abandoned first attempt. Let’s compute the probability that there are exactly k heads in a sequence of

1000 flips, and then apply the definition of expectation directly. There are 1000
k sequences of 1000 flips
that have exactly k heads, and the probability of any one of these sequences is pk (1 − p)1000−k , so
X
1000
E [number of heads] = k · Pr [number of heads = k] definition of expectation
k=0
X
1000

= k · 1000
k · pk · (1 − p)1000−k . above analysis of Pr [number of heads = k]
k=0

We could try to simplify this expression (but it turns out to be pretty hard!). Instead, let’s start over with
a different approach.
A second try. Here’s a different strategy that ends up being much easier. Define 1000 random variables
X1 , X2 , . . . , X1000 , where Xi is the indicator random variable
(
1 if the ith flip of the coin comes up heads
Xi =
0 if the ith flip of the coin comes up tails.
The total number of heads in the 1000 coin flips is given by the random variable
X = X1 + X2 + · · · + X1000 .
We can now use linearity of expectation to compute the expected number of heads much more easily:
Example 10.37
definition of X (expectation of a Bernoulli variable)
linearity of expectation
"1000 #
X X
1000 X
1000
E [number of heads] = E [X] = E Xi = E [Xi ] = p = 1000p.
i=1 i=1 i=1

ISTUDY
10.4 Random Variables and Expectation 557

Example 10.43: Number of aces in a bridge hand, better.


Recall Example 10.40, where we showed that the number A of aces in a randomly chosen 13-card hand
from a standard 52-card deck has E [A] = 1. Here is a much easier way of solving that problem:
Number your cards from 1 to 13. Let Ai be an indicator random variable that reports whether the ith
card in your hand is an ace. Then A = A1 + A2 + · · · + A13 . Note that Pr [Ai = 1] = 13
1 4
(there are 52 1
= 13
aces in the deck), so
E [A] = E [A1 + A2 + · · · + A13 ]
= E [A1 ] + E [A2 ] + · · · + E [A13 ] linearity of expectation
= 13 · 1
13 Pr [Ai = 1] = 1
13 as above, and so E [Ai ] = 1
13 (Example 10.37)
= 1.
(The random variables Ai and Aj are correlated—but, again, linearity of expectation doesn’t care! We can
still use it to conclude that E [Ai + Aj ] = E [Ai ] + E [Aj ].)

Some examples about hashing


Here are two more problems about expectation, both involving hashing:

Example 10.44: Empty slots in a hash table.


Suppose that we hash 1000 elements into a 1000-slot hash table, using a completely random hash function,
resolving collisions by chaining. (See Section 10.1.1.) How many empty slots do we expect?
Solution. Let’s compute the probability that some particular slot is empty:
Pr [slot i is empty]
= Pr [none of the 1000 elements hash to slot i]
= Pr [every element j ∈ {1, 2, . . . , 1000} hashes to a slot other than i]
Q
1000
= Pr [element j hashes to a slot other than i] elements are hashed independently
j=1

Q
1000
999
= 1000 elements are hashed uniformly, and there are 999 other slots
j=1

999 1000
= 1000 = 0.3677 · · · .
We’ll finish with the by-now-familiar calculation that also concluded the last two examples: we define a
collection of indicator random variables and use linearity of expectation. Let Xi be an indicator random
variable that’s 1 if slot i is empty and 0 if slot i is full. Then the expected number of empty slots is
1000 
P P
1000 
999 1000
E Xi = E [Xi ] = 1000 · 1000 ≈ 367.7.
i=1 i=1

Taking it further: If we stated the question from Example 10.44 in full generality, we would ask: if we hash n elements into n
slots, how many empty slots are there in expectation? Using the same approach as in Example 10.44, we’d find that the fraction
of empty slots is, in expectation, (1 − 1/n)n . Using calculus, it’s possible to show that (1 − 1/n)n approaches 1/e ≈ 0.367879 as
n → ∞. So, for large n, we’d expect to have ne empty slots when we hash n elements into n slots. We can also turn this hashing
problem on its head: we’ve been asking “if we hash n elements into n slots, how many slots do we expect to find empty?” Instead
we can ask “how many elements do we expect have to hash into n slots before all n slots are full?” This problem is called the
coupon-collector problem; see Exercises 10.138–10.139 for more.

ISTUDY
558 Probability

Let’s also consider a second example about hashing—this time counting the (expected) number of
collisions, rather than the (expected) number of empty slots:

Example 10.45: Expected collisions in a hash table.


Hash n elements A = {x1 , . . . , xn } into an m-slot hash table. Recall that a collision between two elements
xi and xj (for i ̸= j) occurs when h(xi ) = h(xj ).
̸ xj . What’s Pr [there’s a collision between xi and xj ]?
1 Consider two elements xi =
2 What is the expected number of collisions among the elements of A?

Solution. For the first question, note that a collision between xi and xj occurs precisely when, for some
index k, we have h(xi ) = k and h(xj ) = k. Thus:
Pr [collision between xi and xj ]
h     i
= Pr h(xi ) = h(xj ) = 1 or h(xi ) = h(xj ) = 2 or · · · or h(xi ) = h(xj ) = m
X
m
= Pr [h(xi ) = k and h(xj ) = k] by the sum rule; these events are disjoint
k=1
Xm
= Pr [h(xi ) = k] · Pr [h(xj ) = k] hashing assumption: hash values are independent
k=1
Xm
= 1
m · 1
m hashing assumption: hash values are uniform
k=1
m 1
= 2
= .
m m
So the probability that a particular pair of elements collides is precisely m1 .
Given this calculation, we can solve the second question by again computing the expected number of
collisions using indicator random variables and linearity of expectation. The number of collisions between
elements of A is precisely the number of unordered pairs {xi , xj } that collide. For indices i and j > i, then,
define Xi,j as the indicator random variable
(
1 if xi and xj collide
Xi,j =
0 if they do not.
Thus the expected number of collisions among the elements of A is given by
h X i
E Xi,j summing over all unordered pairs of elements
1≤i<j≤n
X
= E [Xi,j ] linearity of expectation
1≤i<j≤n
X
1 1
= m part 1 of this example: we showed E [Xi,j ] = Pr [Xi,j = 1] = m
1≤i<j≤n
n

2 n(n − 1) (n) n(n−1)
= = . there are 2 = 2 unordered pairs of elements
m 2m

ISTUDY
10.4 Random Variables and Expectation 559

Taking it further: Example 10.45 also explains the so-called birthday paradox. We showed that the expected number of colli-
sions when we hash n elements into an m-slot hash table is n(nm−1) . This formula suggests that we’d expect the first collision in
√ √
an m-slot hash table to occur when the number n of hashed elements reaches approximately 2m: for n = 2m + 1, the expected
number of collisions would be
√ √
n(n − 1) ( 2m + 1) · 2m 2m
= ≈ = 1.
2m 2m 2m
Assume that a person’s birthday is uniformly and independently chosen from the m = 365 days in the year. (Close, but not quite
1
true; certain parts of the year are nine months before days whose probabilities are notably more than 365 .) Under this assumption,
you can think of “birthday” as a random hash function mapping people to {1, 2, . . . , 365}. By Example 10.45, if you’re in a room

with more than 2 · 365 = 27.018 people, you’d expect to find a pair that shares a birthday. (It’s called a “paradox” because
most people’s intuition is that you’d need way more than 28 people in a room before you’d find a shared birthday.)

Two more examples of expectation: breaking PINs and Insertion Sort


Here’s another example of expectation, in a simple security context:

Example 10.46: Brute-force breaking of PINs.


I steal a debit card from a (former) friend. The card has a 4-digit PIN, between 0000 and 9999, that I need
to know to get all my friend’s money. Here are two strategies:
1 every day, I try a random PIN.
2 every day, I try a random PIN that I haven’t tried before.
How many days would I expect to wait before I get into my friend’s account?
Solution. If I’m trying a random PIN (without avoiding repeated guesses), then my probability of getting
1
the correct PIN on a particular day is 10000 . Thus we have a geometric random variable with parameter
1
10000 , so by Example 10.41 we expect to need 10000 days to break the PIN.
If I do avoid guessing the same PIN twice, I’ll be quicker. As usual, there are multiple ways to solve this
problem—and, for illustrative purposes, we’ll describe two of them, using fairly different approaches.

Solution A: what’s Pr [winning on the ith day]? The key will be to find the probability of breaking the
code on day i. Because we make i − 1 guesses on the i − 1 days before day i, we know
i−1
Pr [getting the PIN before day i] = 10000 (1)
Pr [not getting the PIN before day i] = 1 − i−1
10000 = 10001−i
10000 . (2)

Furthermore, on day i there are 10000 − (i − 1) untried guesses, and so


1 1
Pr [getting the PIN on day i| not getting it before day i] = 10000−(i−1) = 10001−i . (3)
Thus the expected number of days that we have to guess is:
P
10000
i · Pr [we break the code on the ith day] definition of expectation
i=1
P
10000
= i · Pr [wrong on days 1, . . . , i − 1] · Pr [right on day i|wrong on days 1, . . . , i − 1]
i=1
Chain Rule
P
10000
= i· 10001−i
10000 · 1
10001−i (2) and (3), as argued above
i=1

ISTUDY
560 Probability

P
10000
= 1
10000 · i algebra
i=1

= 1
10000 · 10000·10001
2 = 5000.5. arithmetic summation (Theorem 5.3)

(Another way to view this solution: our PIN-guessing strategy corresponds to choosing a permutation
of {0000, . . . , 9999} uniformly at random, and guessing in the chosen order. The correct PIN is equally
likely to be at any position in the permutation so, for any i, we require exactly i days with probability
1
precisely 10000 .)

Solution B: what’s Pr [have to guess on the ith day]? Define an indicator random variable Xi with Xi = 1
if we have to make a guess on the ith day, and Xi = 0 if we do not. Thus the number of days that we
P10000
have to guess is precisely X = i=1 Xi . Observe that
E [Xi ] = Pr [Xi ] = Pr [we guess incorrectly on all days 1, . . . , i − 1] = 10001−i
10000

by the same reasoning as in Solution A. Thus


P
10000
E [X] = E [Xi ] linearity of expectation
i=1
P
10000
10001−i
= 10000 the above argument
i=1
P
10000
j
= 10000 change of variables j = 10001 − i
j=1
= 5000.5. just as in Solution A

So avoiding duplication of guesses saves, in expectation, just less than half of the days: we expect to
use 10000 days if we allow duplication, and 5000.5 days if we avoid it. (Incidentally, the argument in
Solution B is just another way of viewing the transformation from Theorem 10.18: instead of calculating
P P
the value of i i · Pr [exactly i days], we calculated i Pr [at least i days].)

Let’s conclude with one last example of another type: analyzing the expected performance of an algo-
rithm on a randomly chosen input. In Example 6.13, we gave a brief intuition for the average-case
(expected) performance of Insertion Sort. (See Figure 10.32 for a reminder of the algorithm.) Here is a
somewhat different version of the analysis, which comes out with the same result:

Example 10.47: Expected performance of Insertion Sort.


Let the array A be a permutation of {1, . . . , n} chosen uniformly at random. What is the expected number
of swaps performed by insertionSort(A[1 . . . n])?
Solution. Define an indicator random variable Xj,i for indices j < i:
(
1 if the (original) elements A[j] and A[i] are swapped by insertionSort
Xj,i =
0 if not.
Note that E [Xj,i ] = Pr [Xj,i = 1] = 21 : precisely half of permutations have their ith element larger than
their jth element. (There’s a bijection between the set of permutations with their ith element larger than

ISTUDY
10.4 Random Variables and Expectation 561

insertionSort(A[1 . . . n]):
1 for i := 2 to n:
2 j := i
3 while j > 1 and A[j] < A[j − 1]:
4 swap A[j] and A[j − 1]
5 j := j − 1

Figure 10.32 A reminder of Insertion Sort.

their jth element and the set of permutations with their ith element smaller than their jth element. Because
these sets have the same size, the probability of choosing one of the former is 21 .)
Because insertionSort correctly sorts its input and only swaps out-of-order pairs once per pair, the
total number of swaps done is precisely
X
n X
i−1
X= Xi,j .
i=2 j=1

Note that the number of indicator random variables in this sum is


X
n X
i−1 X
n X
n−1
(n−1)·n 
1 = (i − 1) = i = 2 = n
2 .
i=2 j=1 i=2 i=1
  n(n−1)
Thus by linearity of expectation we have E [X] = n
2 · E [Xi,j ] = n
2 · 1
2 = 4 .

10.4.4 Conditional Expectation


Just as we did with conditional probability in Section 10.3, we can define a notion of conditional
expectation: that is, the average value of a random variable X when a particular event occurs.

Definition 10.20: Conditional expectation.


The conditional expectation of a random variable X given an event E, denoted E [X|E], is the average
value of X over all outcomes where E occurs:
X
E [X|E] = X(x) · Pr [x|E] .
x∈E

In the original definition of expectation, we summed over all x in the whole sample space; here we sum only
over the outcomes in the event E. Furthermore, here we weight the value of X by Pr [x|E] rather than by
Pr [x]. We’ll omit the details, but conditional expectation has analogous properties to those of the original
(nonconditional) version of expectation, including linearity of expectation.

Example 10.48: Hearts in poker.


In Texas Hold ’Em, a particular variant of poker, after a standard deck of cards is randomly shuffled, you
are dealt two “personal” cards, and then five “community” cards are dealt. Let P denote the number of
your personal cards that are hearts, and let C denote the number of community cards that are hearts.
1 First, some (nonconditional) expectations: what are E [P] and E [C]?
2 Second, some conditional expectations: what are the values of E [C|P = 0] and E [C|P = 2]?

ISTUDY
562 Probability

Solution. Each card that’s dealt has a 13 1


52 = 4 chance of being a heart. By linearity of expectation, then,
E [P] = 24 = 0.5 and E [C] = 54 = 1.25. (Implicitly, we’re defining indicator random variables for “the
ith card is a heart,” so P = P1 + P2 and C = C1 + · · · + C5 .)
For the conditional expectation E [C|P = 0], we know that 2 of the 39 non-heart cards were dealt as
your personal cards. (You got two personal cards, and because P = 0 neither of them was a heart.)
Thus there are still 13 undealt hearts among the remaining 50 undealt cards, so there is a 13 50 = 0.26
chance that any particular undealt card is a heart. Thus, again by linearity of expectation, we have that
E [C|P = 0] = 5 · 13 50 = 1.30.
Computing E [C|P = 2] is similar: there are 11 undealt hearts among the remaining 50 undealt cards,
so there is an 11 11
50 = 0.22 chance that any particular undealt card is a heart, and E [C|P = 2] = 5· 50 = 1.10.

We’ll omit the proof, but it’s worth noting a useful property that connects expectation to conditional
expectation, an analogy to the Law of Total Probability:

Theorem 10.21: Law of Total Expectation.


For any random variable X and any event E:
E [X] = E [X|E] · Pr [E] + E [X| E ] · (1 − Pr [E]).

That is, the expectation of X is the (weighted) average of the expectation of X when E occurs and when E
does not occur.

Taking it further: One tremendously valuable use of probability is in randomized algorithms, which flip some coins as part of
solving some problem. There is a massive variety in the ways that randomization is used in these algorithms, but one example—the
computation of the median element of an unsorted array of numbers—is discussed on p. 567. (We’ll make use of Theorem 10.21.)
Median finding is a nice example of problem for which there is a very simple, efficient algorithm that makes random choices in
its solution. (There are deterministic algorithms that solve this problem just as efficiently, but they are much more complicated
than this randomized algorithm.)

10.4.5 Deviation from Expectation


Let X be a random variable. By definition, the value of E [X] is the average value that X takes on, where
we’re averaging over many different realizations. But how far away from E [X] is X, on average? That is,
what is the average difference between (a) X, and (b) the average value of X? We might care about this
quantity in applications like political polling or scientific experimentation, for example. Suppose X is a
random variable defined as follows:


−1 the voter will vote for the Democratic candidate

X= 0 the voter will vote for neither the Democratic nor Republican candidates


+1 the voter will vote for the Republican candidate

for a voter chosen uniformly at random from the population. If E [X] < 0, then the Democrat will beat the
Republican in the election; if E [X] > 0, then the Republican will beat the Democrat. We might estimate
E [X] by calling, say, 500 uniformly chosen voters from the population and averaging their responses. We’d
like to know whether our estimate is accurate (that is, if our estimate is close to E [X]). This kind of question
is the core of statistical reasoning. We’ll only begin to touch on these questions, but here are a few of the
most important concepts.

ISTUDY
10.4 Random Variables and Expectation 563

Definition 10.22: Variance.


Let X be a random variable. The variance of X is
 
var (X) = E (X − E [X])2 .
p
The standard deviation is std (X) = var (X).

(Exercise: why didn’t we just define std (X) = E [X − E [X]]?)

Example 10.49: Variance/standard deviation of a Bernoulli random variable.


Let X be the outcome of a flipping a p-biased coin. (That is, X is a Bernoulli random variable.) We
previously showed that E [X] = p, so the variance of X is
var (X) = E [ (X − E [X])2 ] definition of expectation
= E [ (X − p) ] 2
expectation of a Bernoulli random variable (Example 10.37)
= Pr [X = 0] · (0 − p) + Pr [X = 1] · (1 − p)2
2
definition of expectation
= (1 − p) · (0 − p) + p · (1 − p)
2 2
definition of Bernoulli random variable
= (1 − p)p + p(1 − p)
2 2

= (1 − p)p · (p + 1 − p)
= (1 − p)p.
p p
Thus the standard deviation is std (X) = var (X) = (1 − p)p.

p √
(For example, for a fair coin, the standard deviation is (1 − 0.5)0.5 = 0.25 = 0.5: an average coin
flip is 0.5 units away from the mean 0.5. In fact, every coin flip is that far away from the mean!)
Here’s another small example, illustrating the fact that two random variables can have the same mean
but wildly different variances:

Example 10.50: Roulette bets.


Recall the game of roulette (there’s a reminder in Figure 10.33): a number in the set {0, 00, 1, 2, . . . , 36}
is chosen uniformly at random; there are 18 red numbers, 18 black numbers, and 2 numbers (0 and 00)
that are neither red nor black. Here are two bets available to a player in roulette:

Bet $1 on “red”:
You pay $1. If the spin lands on one of the 18 red numbers, you get $2 back; otherwise you get nothing.
Bet $1 on “17”:
You pay $1. If the spin lands on the number 17, you get $36 back; otherwise you get nothing.

Let X denote the payoff from playing the first bet, so X = 0 with probability 20
38 and X = 2 with probability
18 37
38 . Let Y denote the payoff from playing the second bet, so Y = 0 with probability 38 and X = 36 with
1
probability 38 . The expectations match:
E [X] = 20
38 ·0+ 18
38 ·2 = 36
38
and E [Y] = 37
38 ·0+ 1
38 · 36 = 36
38 .

ISTUDY
564 Probability

12

15

18

21

24

27

30

33

36
3

9
00

11

14

17

20

23

26

29

32

35
2

8
0

10

13

16

19

22

25

28

31

34
1

7
Figure 10.33 A reminder of the roulette outcomes. A number in the set {0, 00, 1, 2, . . . , 36} is chosen uniformly at
random by a spinning wheel; there are 18 red numbers {1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36}
and 18 black numbers {2, 4, 6, 8, 10, 11, 13, 15, 17, 20, 22, 24, 26, 28, 29, 31, 33, 35}; 0 and 00 are neither red nor
black.

But the variances are very different:


36 2 36 2
var (X) = 20
38 · (0 − 38 ) + 18
38 · (2 − 38 ) = 0.9972 · · ·
36 2 2
var (Y) = 37
38 · (0 − 38 ) + 1
38 · (36 − 36
38 ) = 33.2077 · · · .

Generally speaking, the expectation of a random variable measures “how good it is” (on average), while
the variance measures “how risky it is.”

Variance, the squared expectation, and the expectation of the square


Here’s a useful property of variance, which sometimes helps us avoid tedium in calculations. We can write
2
var (X) as var (X) = E [X2 ] − (E [X]) , that is, the difference between the expectation of the square of X and
the square of the expectation of X:

Theorem 10.23: Variance = expectation of the square minus the expectation2 .


For any random variable X, we have
  2
var (X) = E X2 − (E [X]) .

Proof. Writing μ = E [X], we have


 
var (X) = E (X − μ)2 definition of expectation
 
= E X2 − 2Xμ + μ2 multiplying out
   
= E X2 + E [−2Xμ] + E μ2 linearity of expectation
 
= E X2 − 2μ · E [X] + μ2 Exercise 10.153
 
= E X2 − 2μ · μ + μ2 definition of μ = E [X]
 
= E X2 − μ 2
  2
= E X2 − (E [X]) .

Here is an example in which Theorem 10.23 eases the computation:

ISTUDY
10.4 Random Variables and Expectation 565

Example 10.51: Variance/standard deviation of a uniform random variable.


Let X be the result of a roll of a fair die. What is var (X)?
Solution. Because Pr [X = k] = 1
6 for all k ∈ {1, . . . , 6}, we have that
E [X] = 1
6 · (1 + 2 + 3 + 4 + 5 + 6)
= 1
6 · 21
= 3.5.
 
Similarly, we can compute E X2 as follows:
 
E X2 = 16 · (12 + 22 + 32 + 42 + 52 + 62 )
= 1
6 · 91
≈ 15.1666 · · · .
Therefore, by Theorem 10.23,
2
var (X) = E [X2 ] − (E [X]) = 91
6 − 49
4 = 35
12 ≈ 2.9116 · · · ,
p
and std (X) = 35/12 ≈ 1.7078 · · · .
(In Exercise 10.152, you’ll show that the standard deviation of the average result of two independent
dice rolls is much smaller.)

Taking it further: Suppose that we need to estimate the fraction of [very complicated objects] that have [easy-to-verify property]:
would I win a higher fraction of chess games with Opening Move A or B? Roughly how many different truth assignments satisfy
Boolean formula φ? Roughly how many integers in {2, 3, . . . , n − 1} evenly divide n? Is the array A “mostly” sorted?
One nice way to approximate the answer to these questions is the Monte Carlo method, one of the simplest ways to use ran-
domization in computation. The basic idea is to compute many random candidate elements—chess games, truth assignments,
possible divisors, etc.—and test each one; we can then estimate the answer to the question of interest by calculating the fraction
of those random candidates that have the property in question. See p. 566.

ISTUDY
566 Probability

Computer Science Connections

The Monte Carlo Method


If we need to compute some (potentially very complicated) quantity, one way to do so is the Monte Carlo method.
Let’s take a computation of area of a potentially complicated shape as an example. If we identify a bounding box (a
rectangle surrounding the shape) and then generate a sequence of random points in the bounding box, we can count
how many of those points fall into the shape in question.
For example, to find the area
of the shape in Figure 10.34, we
can throw a random point into the
bounding box. The probability that
the randomly chosen point is inside
the polygon is precisely the ratio of
the area of the polygon to the area
of the bounding box—and thus the
expected fraction of points that land
inside the shape precisely yields the
area of the shape. Of course, the
more points we throw at the bound-
ing box, the more accurate our esti-
mate of the area will be: the fraction
of heads in n flips of a p-biased coin
has a much lower variance (but the
same expectation) as n gets bigger
and bigger. (See Exercise 10.157.) Figure 10.34 A shape, and an estimate of its area with random points: we
There are a few issues compli- simply estimate the area using the fraction of the chosen points that fall within
cating this approach. First, we must
the shape. The more points, the more accurate the estimate.
find a bounding box for which the
shape in question covers a “large” fraction of the bounding box. (If the probability p of a random point falling into
the shape is tiny, then a little bad luck in sampling—2 points land inside instead of 3?—causes huge relative [mul-
tiplicative] error in our area estimate.) Second, we’ve described this process as choosing a uniform point from the
bounding box—which requires infinitesimal probabilities associated with each of the infinitely many points inside
the bounding box. The handle this, typically we would define a “mesh” of points: we specify a “resolution” ϵ and
choose a coordinate of a random point as k/ϵ for a random k ∈ {0, 1, . . . , 1/ϵ}.
The example in Figure 10.34 is a
nice way of being lazy—we could
have calculated the area of the poly-
gon with some tedious algebra—
but there are some other examples
in which this technique is even
more useful. For example, some of
the simplest methods for estimat-
ing the value of π in the last cen-
tury were based on Monte Carlo
methods. One option is to throw
a point ⟨x, y⟩ into the unit square
[0, 1] × [0, 1] and test what fraction
have x2 + y2 ≤ 1. Another is an
algorithm called Buffon’s needle— Figure 10.35 Estimating π with a point in the square, or with Buffon’s needle.
named after an eighteenth-century
French mathematician—in which we throw unit-length “needles” onto a surface with parallel lines one unit apart;
one can show that Pr [a needle crosses a line] = π2 . See Figure 10.35.

ISTUDY
10.4 Random Variables and Expectation 567

Computer Science Connections

A Randomized Algorithm for Finding Medians


The median element of an array A[1 . . . n] is the item that would appear in the ⌈n/2⌉th slot of the sorted order if we
sorted A. For example, the median of [9, 7, 3, 1, 5] is 5, and the median of [4, 3, 2, 1] is 2. (We arbitrarily chose to find
the ⌈n/2⌉th element instead of the ⌊n/2⌋th.) This description already suggests a solution to the median problem: sort
A, and then return A[⌈n/2⌉]. But we can do better than the sorting-based approach: we’ll give a faster algorithm for
finding the median element of an unsorted array. Our algorithm will be randomized, and the expected running time
of the algorithm will be linear.
Perhaps a little counterintuitively, it will turn out to be easier to solve a generalization of the median problem,
called Select. (A more general problem seems like it would be harder to solve, not easier, but solving a more general
problem recursively means that we can get more value out of our recursive calls.) Here is the Select problem:
Input: an array A[1 . . . n] and an index k ∈ {1, . . . , n}.
Output: the element x in A such that, if you were to sort A, x would appear in the kth slot of the sorted array.
(We can solve the median problem by setting k = ⌈n/2⌉.)
A recursive solution to Select is given in Figure
10.36. A proof of correctness of the algorithm—that randSelect(A[1 . . . n], i):
is, a proof that randSelect actually solves the Select
1 if n = 1 then Find the ith-largest
problem—is reasonably straightforward by induction.
2 return A[1]. (If i ̸= 1, then error.) element of A. If
(In fact, correctness is guaranteed regardless of how we i∈/ {1, 2, . . . , n},
3 choose x ∈ {1, . . . , n} randomly then error.
choose x in Line 3 of the algorithm.) But we still have
4 Losers[1 . . . ℓ] := {y ∈ A : y < A[x]}
to analyze the running time. 5 Winners[1 . . . w] := {y ∈ A : y > A[x]}.
6 if i < ℓ + 1 then
Running time: the big picture. Let’s think about how
7 return randSelect(Losers, i) Build Losers and
fast the algorithm might be. To do so, think about an Winners by going
8 else if i = ℓ + 1 then
invocation of randSelect(A, k), and think about the through A
9 return A[x] element-by-element.
array A that it is given as input: 10 else if i > ℓ + 1 then
11 return randSelect(Winners, i − ℓ − 1)

In each call to
randSelect, the
Now imagine the same input array, but in sorted order pivot #1 algorithm chooses a
and divided into quartiles: random pivot value,
and divides the other
elements by
comparing them to
pivot #2 the pivot
(maintaining each
bottom 25% middle 50% top 25% group’s order). It
then makes a
recursive call on the
Here are two crucial observations: pivot #3 appropriate portion.
Observation #1: Suppose that the element A[x] cho- With a bit of care in
bookkeeping, we can
sen in Line 3 of randSelect—call A[x] the pivot—falls always figure out the
within the middle section of the sorted order above. (and so forth, until only one element rank of the current
pivot in the original
Then we know that |Losers| ≤ 3n 4 and |Winners| ≤ 4 .
3n
remains under consideration) array.
Observation #2: The middle section of the sorted Figure 10.36 Randomized Select: the pseudocode, and a
order contains half of the elements of A. (More briefly: visualization.
half of the elements of A are in the middle half of A.)
So what? Whenever we choose an element from the middle half of the sorted order, the next recursive call is on an
array of size at most 34 the size of the original input. Also observe that the running time of any particular call (aside
from the recursive call) is linear in the input size.

(continued)

ISTUDY
568 Probability

Computer Science Connections

A Randomized Algorithm for Finding Medians, Continued


Thus, if we got lucky every time and unfailingly picked an element from the middle half of the array, we’d have a
recurrence like the following:
T(1) = 1 T(n) ≤ n + T(3n/4)
That’s a classic divide-and-conquer recurrence with a solution of T(n) = Θ(n). (Actually Theorem 6.21—the theo-
rem that addressed divide-and-conquer recurrences—only says that T(n) = O(n), because we have an inequality in
the recurrence, rather than an equality. But the running time is definitely Ω(n) as well, because just building Losers
and Winners at the root takes Ω(n) time.)

Running time: making it formal. We engaged in a bit of wishful thinking in the above analysis: it’s obviously not true
that we get a pivot in the middle half of the array every time. In fact, it’s only half the time! But this isn’t so bad: even
if we imagine that picking a pivot outside the middle half yields zero progress at all toward the base case, we’d only
double the estimate of the running time! Let’s make this formal. Define
Cn = the number of comparisons performed by randSelect on an input of size n.
Cn is a random variable: the number of comparisons performed depends on which pivots are chosen. But we can
analyze E [Cn ]. (Before we start, one quick observation: the expected running time of this algorithm is monotonic in
its input size; that is, E [Cn ] ≤ E [Cn′ ] if n ≤ n′ . This fact is tedious to prove rigorously, but is still not too surprising.)

Theorem: The number of comparisons Cn performed by randSelect(A[1 . . . n], k) satisfies E [Cn ] ≤ 8n.

Proof (by strong induction on n). For the base case (n = 1), the algorithm goes into the base case of its recursion
when n = 1, and thus performs zero comparisons—and indeed 0 ≤ 8.
[ For] the inductive case (n ≥ 2), we assume the inductive hypothesis, namely that for any n′ < n, we have that
′ ′
E Cn ≤ 8n . We must prove that E [Cn ] ≤ 8n. Let’s consider the comparisons that are made on an input array of size
n. First, there are n comparisons performed in Lines 4–5, to compute Losers and Winners. Then there are whatever
comparisons are made in the recursive call. Because we’re trying to compute a worst-case bound, we’ll make do
with the following observation: Cn ≤ n + Cmax(|Losers|,|Winners|) . (That is, we can’t do worse than whichever “half” is
larger.) Let M denote the event that our pivot is in the middle half of A’s sorted order. Thus:
[ ]
E [Cn ] ≤ E n + Cmax(|Losers|,|Winners|) the above accounting of the comparisons
[ ]
= n + E Cmax(|Losers|,|Winners|) linearity of expectation
[ ] [ ] [ ]
= n + E Cmax(|Losers|,|Winners|) |M · Pr [M] + E Cmax(|Losers|,|Winners|) | M · Pr M
Law of Total Expectation (Theorem 10.21)
[ [ ] [ ]]
=n+ 1
2 · E Cmax(|Losers|,|Winners|) |M + E Cmax(|Losers|,|Winners|) | M
[ ]
Crucial observation #2: Pr [M] = Pr M = 1
[ [ ] ]
2

≤n+ 1
2 · E C3n/4 + E [Cn ] .
Crucial observation #1: if M occurs, we recurse on ≤ 3n
4 elements; else it’s certainly on ≤ n elements.

Thus we have argued that


[ ]
E [Cn ] ≤ n + 12 · E C3n/4 + 12 · E [Cn ] and therefore
[ ]
E [Cn ] ≤ 2n + E C3n/4 . starting with the previous inequality and subtracting 1
2 · E [Cn ] from both sides, and then multiplying both sides by 2

[ ]
The inductive hypothesis says that E C3n/4 ≤ 8 · 3n
4 = 6n, so we therefore have

E [Cn ] ≤ 2n + 6n = 8n.

ISTUDY
Exercises 569

EXERCISES
Choose a word in S = {Computers, are, useless, They, can, only, give, you, answers} (a quote attributed to Pablo Picasso)
by choosing a word w with probability proportional to the number of letters in w. Let L be a random variable denoting the number
of letters in the chosen word, and let V be a random variable denoting the number of vowels.
10.106 Give a table of outcomes and their probabilities, together with the values of L and V.
10.107 What is Pr [L = 4]? What is E [V|L = 4]?
10.108 Are L and V independent?
10.109 What are E [L] and E [V]?
10.110 What is var (L)?
10.111 What is var (V)?

Flip a fair coin 16 times. Define the following two random variables:
• Let H be an indicator random variable that’s 1 if at least one of the 16 flips comes up heads, and 0 otherwise.
• Let R be a random variable equal to the length of the longest “run” in the flips. (A run of length k is a sequence of k consecutive
flips that all come up heads, or k consecutive flips that all come up tails.)
10.112 What’s E [H]?
10.113 What’s E [R]? (Hint: write a program—not by simulating many sequences of 16 coin flips, but rather by listing exhaustively all
outcomes.)
10.114 Are H and R independent?

In 1975, a physicist named Michael Winkelmann invented a dice-based game with the following three (fair) dice.
Blue die: sides 1, 2, 5, 6, 7, 9 Red die: sides 1, 3, 4, 5, 8, 9 Black die: sides 2, 3, 4, 6, 7, 8
There are some weird properties of these dice, as you’ll see.
10.115 Choose one of the three dice at random, roll it, and call the result X. Show that Pr [X = k] = 19 for any k ∈ {1, . . . , 9}.
10.116 Choose one of the three dice at random, roll it, and call the result X. Put that die back in the pile and again (independently) choose
1
one of the three dice at random, roll it, and call the result Y. Show that Pr [9X − Y = k] = 81 for any k ∈ {0, . . . , 80}.
10.117 Roll each die. Call the results B (blue), R (red), and K (black). Compute E [B], E [R], and E [K].
10.118 Define B, R, and K as in the last exercise. Compute Pr [B > R|B ̸= R], Pr [R > K|R ̸= K], and Pr [K > B|K ̸= B]—in particular,
show that all three of these probabilities (strictly) exceed 12 .

Exercise 10.118 demonstrates that the red, blue, and black dice are nontransitive, using the language of relations (Chapter 8):
you’d bet on Blue beating Red and you’d bet on Red beating Black, but (surprisingly) you’d want to bet on Black beating Blue.
Here’s another, even weirder, example of nontransitive dice. (And if you’re clever and mildly unscrupulous, you can win some
serious money in bets with your friends using these dice.)
Kelly die: sides 3, 3, 3, 3, 3, 6 Lime die: sides 2, 2, 2, 5, 5, 5 Mint die: sides 1, 4, 4, 4, 4, 4
These dice are fair; each side comes up with probability 61 . Roll each die, and call the resulting values K, L, and M.
10.119 Show that the expectation of each of these three random variables is identical.
10.120 Show that Pr [K > L], Pr [L > M], and Pr [M > K] are all strictly greater than 12 .
10.121 You can think of Exercise 10.120 as showing that, if you had to bet on which of K or L would roll a higher number, you should bet
on K. (And likewise for L over M, and for M over K.) Now let’s think about rolling each die twice and adding the two rolled values
together. Roll each die twice, and call the resulting values K1 , K2 , L1 , L2 , M1 , and M2 , respectively. Show that the expectation of
the three values K1 + K2 , L1 + L2 , and M1 + M2 are identical.
10.122 (programming required) With K1 , K2 , L1 , L2 , M1 , and M2 defined as in the previous exercise, show that the following probabilities
are all strictly less than 21 , by writing a program to check how many of the 64 outcomes cause K1 + K2 > L1 + L2 , etc.:

Pr [K1 + K2 > L1 + L2 ] , Pr [L1 + L2 > M1 + M2 ] , and Pr [M1 + M2 > K1 + K2 ] .

(Notice that which die won switched directions—and all we did was go from rolling the dice once to rolling them twice!)

You are dealt a 5-card hand from a standard deck. For the next two exercises, define a pair as any two cards with the same rank—so
♣A♡A♢A23 contains three pairs (♡A♢A and ♣A♢A and ♣A♡A). Let P denote the number of pairs in your hand.
10.123 Compute E [P] “the hard way,” by computing Pr [P = 0], Pr [P = 1], Pr [P = 2], and so forth. (There can be as many as 6 pairs in
your hand, if you have four-of-a-kind.)
10.124 Compute E [P] “the easy way,” by defining an indicator random variable Ri,j that’s 1 if and only if cards #i and #j are a pair,
 
computing E Ri,j , and using linearity of expectation.

ISTUDY
570 Probability

In bridge, you are dealt a 13-card hand from a standard deck. A hand’s high-card points are awarded for face cards: 4 for an ace,
3 for a king, 2 for a queen, and 1 for a jack. A hand’s distribution points are awarded for having a small number of cards in a
particular suit: 1 point for a “doubleton” (only two cards in a suit), 2 points for a “singleton” (only one card in a suit), and 3
points for a “void” (no cards in a suit).
10.125 What is the expected number of high-card points in a bridge hand? (Hint: define some simple random variables, and use linearity
of expectation.)
10.126 What is the expected number of distribution points for hearts in a bridge hand? (Hint: calculate the probability of having exactly
2 hearts, exactly 1 heart, or no hearts in a hand.)
10.127 Using the results of the last two exercises and linearity of expectation, find the expected number of points (including both high-card
and distribution points) in a bridge hand.

10.128 Consider a random variable X : S → R. Definition 10.17 defined the expectation of X in two different ways:
X X
E [ X] = X(x) · Pr [x] and E [ X] = (y · Pr [X = y]) .
x∈S y∈R

Argue that these two formulations are equivalent.

10.129 We’ve shown linearity of expectation—the expectation of a sum equals the sum of the expectations—even when the random
variables in question aren’t independent. It turns out that the expectation of a product equals the product of the expectations when
the random variables are independent, but not in general when they’re dependent. Let X and Y be independent random variables.
Prove that E [X · Y] = E [X] · E [Y].
10.130 On the other hand, suppose that X and Y are dependent random variables. Prove that E [X · Y] is not necessarily equal to E [X] · E [Y].
10.131 Suppose that X and Y are dependent random variables. Prove that E [X · Y] is also not necessarily unequal to E [X] · E [Y].

10.132 Example 10.41 showed that the expected number of flips of a p-biased coin before we get heads is precisely 1p . How many flips
would you expect to have to make before you see 1000 heads in total (not necessarily consecutive)? (Hint: define a random variable
Xi denoting the number of coin flips after the (i − 1)st heads before you get another heads. Then use linearity of expectation.)
10.133 How many flips would you expect to make before you see two consecutive heads?


In Insertion Sort, we showed in Example 10.47 that the expected number of swaps is n2 /2 for a randomly sorted input. With respect
to comparisons, it’s fairly easy to see that each element participates in one more comparison than it does swap—with one exception:
those elements that are swapped all the way back to the beginning of the array. Here you’ll precisely analyze the expected number
of comparisons.
10.134 What is the probability that the ith element of the array is swapped all the way back to the beginning of the array?
10.135 What’s the expected number of comparisons done by Insertion Sort on a randomly sorted n-element input?

Suppose we hash n elements into an 100,000-slot hash table, resolving collisions by chaining.
10.136 Use Example 10.45 to identify the smallest n for which the expected number of collisions first reaches 1. What the smallest n for
which the expected number of collisions exceeds 100,000?
10.137 (programming required) Write a program to empirically test your answers from the last exercise, by doing k = 1000 trials of
loading [your first answer from Exercise 10.136] elements into a 100,000-slot hash table. Also do k = 100 trials of loading [your
second answer from Exercise 10.136] elements. On average, how many collisions did you see?

Consider an m-slot hash table that resolves collisions by chaining. In the next few problems, we’ll figure out the expected number
of elements that must be hashed into this table before every slot is “hit”—that is, until every cell of the hash table is full.
(The problem you’ll address in the next two exercises is called the coupon collector problem among computer scientists: imagine,
say, a cereal company that puts one of n coupons into each box of cereal that it sells, choosing which coupon type goes into each
box randomly. How many boxes of cereal must a serial cereal eater buy before he collects a complete set of the n coupons?)
10.138 Suppose that the hash table currently has i − 1 filled slots, for some number i ∈ {1, . . . , m}. What is the probability that the next
element that’s hashed falls into an unoccupied slot? Let the random variable Xi denote the number of elements that are hashed until
one more cell is filled. What is E [Xi ]?
P
10.139 Argue that the total number X of elements hashed before the entire hash table is full is given by X = m i=1 Xi . Using Exercise 10.138
P
and linearity of expectation, prove that E [X] = m · Hm . (Recall that Hm denotes the mth harmonic number, where Hm = m 1
i= 1 i .
See Definition 5.8.)

ISTUDY
Exercises 571

True story: some nostalgic friends and I were trying to remember all of the possible responses on a Magic 8 Ball, a pseudopsychic
toy that reveals one of 20 answers uniformly at random when it’s shaken—things like {ask again later, signs point to yes, don’t
count on it, . . .}. We found a toy shop with a Magic 8 Ball in stock and started asking it questions. We hoped to have learned all
20 different answers before we got kicked out of the store.
10.140 What is the probability that we’d get 20 different answers in our first 20 trials?
10.141 In expectation, how many trials would we need before we found all 20 answers? (Use the result on coupon collecting from
Exercise 10.139.)
10.142 In Exercise 10.141, you determined the number of trials that, on average, are necessary to get all 20 answers. But how likely are
we to succeed with a certain number of trials? Suppose we perform 200 trials. What is the probability that a particular answer (for
example, “ask again later”) was never revealed in any of those 200 trials?
10.143 Use the Union Bound (Exercise 10.37) and the previous exercise to argue that the probability that we need more than 200 trials to
see all 20 answers is less than 0.1%.

10.144 Suppose that one random bit in a 32-bit number is corrupted (that is, flipped from 0 to 1 or from 1 to 0). What is the expected size
of the error (thinking of the change of the value in binary)? What about for a random bit in an n-bit number?

10.145 Suppose that the numbers {1, . . . , n} are randomly ordered—that is, we choose a random permutation π of {1, . . . , n}. For a
particular index i, what is the probability that πi = i—that is, the ith biggest element is in the ith position?
10.146 Let X be a random variable denoting the number of indices i for which πi = i. What is E [X]? (Hint: define indicator random
variables and use linearity of expectation.)

10.147 Markov’s inequality states that, for a random variable X that is always nonnegative (that is, for any x in the sample space, we have
X(x) ≥ 0), the following statement is true, for any α ≥ 1:

E [ X]
Pr [X ≥ α] ≤ .
α

Prove Markov’s inequality. (Hint: use conditional expectation.) (Markov’s inequality is named after Andrey Markov (1856–1922),
a Russian mathematician. A number of other important ideas in probability are also named after him, like Markov processes, Hidden
Markov models, and more.)
10.148 The median of a random variable X is a value x such that

Pr [X ≤ x] ≥ 12 and Pr [X ≥ x] ≥ 12 .

Using Markov’s inequality, prove that the median of a nonnegative random variable X is at most 2 · E [X].

Take a fair coin, and repeatedly flip it until it comes up heads. Let K be a random variable indicating the number of flips performed.
(We’ve already shown that E [K] = 2, in Example 10.41.) You are offered a chance to play a gambling game, for the low low price of
y dollars to enter. A fair coin will be flipped until it comes up heads, and you will be paid (3/2)K dollars if K flips were required. (So
there’s a 12 chance that you’ll be paid $1.50 because the first flip comes up heads; a 14 chance that you’ll be paid $2.25 = (1.50)2
because the first flip comes up tails and the second comes up heads, and so forth.)
10.149 Assuming that you care only about expected value—that is, you’re willing to play if and only if E [(3/2)K ] ≥ y—then what value
of y is the break-even point? (In other words, what is E [(3/2)K ]?)
10.150 Let’s sweeten the deal slightly: you’ll be paid 2K dollars if K flips are required. Assuming that you still care only about expected
value, then what value of y is the break-even point? (Be careful!)

10.151 Let X be the number of heads flipped in 4 independent flips of a fair coin. What is var (X)?
10.152 Let Y be the average of two independent rolls of a fair die. What is var (Y)?

10.153 Let a ∈ R, and let X be a random variable. Prove that E [a · X] = a · E [X].


10.154 Let a ∈ R, and let X be a random variable. Prove that var (a · X) = a2 · var (X).
10.155 Prove that var (X + Y) = var (X) + var (Y) for two independent random variables X and Y. (Hint: use Exercise 10.129.)
10.156 Let X be a random variable following a binomial distribution with parameters n and p. (That is, X is the number of heads found in n
flips of a p-biased coin.) Using Exercise 10.155 and the logic as in Example 10.42, show that E [X] = np and var (X) = np(1 − p).
10.157 Flip a p-biased coin n times, and let Y be a random variable denoting the fraction of those n flips that came up heads. What are E [Y]
and var (Y)?

ISTUDY
572 Probability

In the next few exercises, you’ll find the variance of a geometric random variable. This derivation will require a little more work
than the result from Exercise 10.156 (about the variance of a binomial random variable); in particular, we’ll need a preliminary
result about summations first:
10.158 (calculus required) Prove the following two formulas, for any real number r with 0 ≤ r < 1:

X ∞
X
r r(1 + r)
iri = i2 ri = .
(1 − r)2 (1 − r)3
i=0 i=0

Pn rn 1 − 1
+

(Hint: use the geometric series formula i=0 ri = from Theorem 5.5, differentiate, and take the limit as n grows. Repeat
r−1
for the second derivative.)
10.159 Let X be a geometric random variable with parameter p. (That is, X denotes the number of flips of a p-biased coin we need before
 
we see heads for the first time.) What is var (X)? (Hint: compute both E [X]2 and E X2 . The previous exercise will help with at
least one of those computations.)

Recall from Chapter 3 that a proposition is in 3-conjunctive normal form (3CNF) if it is the conjunction of clauses, where each
clause is the disjunction of three different variables/negated variables. For example,

(¬p ∨ q ∨ r) ∧ (¬q ∨ ¬r ∨ x)

is in 3CNF. Recall further that a proposition φ is satisfiable if it’s possible to give a truth assignment for the variables of φ to
true/false so that φ itself turns out to be true. We’ve previously discussed that it is believed to be computationally very difficult to
determine whether a proposition φ is satisfiable (see p. 104)—and it’s believed to be very hard to determine whether φ is satisfiable
even if φ is in 3CNF. But you’ll show here an easy way to satisfy “most” clauses of a proposition φ in 3CNF, using randomization.
10.160 Let φ be a proposition in 3CNF. Consider a random truth assignment for φ—that is, each variable is set independently to True
with probability 21 . Prove that a particular clause of φ is true under this truth assignment with probability ≥ 78 .
10.161 Suppose that φ has m clauses and n variables. Prove that the expected number of satisfied clauses under a random truth assignment
is at least 7m
8 .
10.162 Prove the following general statement about any random variable: Pr [X ≥ E [X]] > 0. (Hint: use conditional expectation.) Then,
using this general fact and Exercise 10.161, argue that, for any 3CNF proposition φ, there exists a truth assignment that satisfies
at least 78 of φ’s clauses.
(Although we won’t prove it here, one can also show that there’s a very good chance—at least 8/m—that a random truth assignment
satisfies at least 7m/8 clauses, and therefore we expect to find such a truth assignment within m/8 random trials. This algorithm
is called Johnson’s algorithm, named after David Johnson (1945–2016). For the details on this algorithm and other approaches to
satisfiability, see a good book on randomized algorithms, like [90] or [93].)

ISTUDY
10.5 Chapter at a Glance 573

10.5 Chapter at a Glance

Probability, Outcomes, and Events


Imagine a process by which some quantities of interest are determined in some random way. An outcome,
or realization, of this probabilistic process is the sequence of results for all randomly determined quantities.
The sample space S is the set of all possible outcomes. A probability function Pr : S → R describes, for
each outcome s ∈ S, the fraction of the time that s occurs. The probability function Pr must satisfy two
P
conditions: (i) s∈S Pr [s] = 1, and (ii) Pr [s] ≥ 0 for every s ∈ S.
An event is a subset of S, and the probability of an event E, written Pr [E], is the sum of the probabilities
of all of the individual outcomes contained in E. We have that Pr [S] = 1 and Pr [∅] = 0. For events
A and B, writing A (“not A”) to denote the event A = S − A, we have that Pr [ A ] = 1 − Pr [A], and
Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B].
We can use a tree diagram to represent a sequence of random choices, where internal nodes of the tree
correspond to random decisions made by the probabilistic process and where leaves correspond to the
outcomes in the sample space. Every edge leaving an internal node is labeled with the probability of the
corresponding random decision; the probability of a particular outcome is precisely equal to the product
of the labels on the edges leading from the root to its corresponding leaf.
The uniform distribution is the probability distribution in which all outcomes in the sample space S are
equally likely—that is, when Pr [s] = |S| 1
for each s ∈ S. (Nonuniform probability is when this equality
does not hold.)
The Bernoulli distribution with parameter p is the probability distribution that results from flipping one
coin, where the sample space is {H, T} and Pr [H] = p (and thus Pr [T] = 1 − p). Such a coin is called
p-biased. Each coin flip is called a trial; the flip is called fair if p = 12 .
The binomial distribution with parameters n and p is a distribution over the sample space {0, 1, . . . , n}
determined by flipping a p-biased coin n times and counting the number of times the coin comes up heads.

Here Pr [k] = nk · pk · (1 − p)n−k denotes the probability that there are precisely k heads in the n flips.
The geometric distribution with parameter p is a distribution over the positive integers, where the out-
put is determined by the number of flips of a p-biased coin required before we first see a heads; thus
Pr [k] = (1 − p)k−1 · p for any integer k ≥ 1.

Independence and Conditional Probability


When there are multiple events of interest, then one useful way understanding the relationship between two
events is to understand whether one event’s occurrence changes the likelihood of the other event also occur-
ring. When there’s no change, the events are called independent; when there is a change in the probability,
the events are called dependent. More formally, two events A and B are independent (or uncorrelated) if
and only if Pr [A ∩ B] = Pr [A] · Pr [B]. Otherwise the events A and B are called dependent (or correlated).
Intuitively, A and B are dependent if A’s occurrence/nonoccurrence tells us something about whether B
occurs. When knowing that A occurred makes B more likely to occur, we say that A and B are positively
correlated; when A makes B less likely to occur, we say that A and B are negatively correlated.
The conditional probability of A given B is

Pr [A ∩ B]
Pr [A|B] = .
Pr [B]

ISTUDY
574 Probability

(Treat Pr [A|B] as undefined when Pr [B] = 0.) Intuitively, we can think of Pr [A|B] as “zooming” the
universe down to the set B. Two events A and B for which Pr [B] ̸= 0 are independent if and only if
Pr [A|B] = Pr [A].
Events A and B are called conditionally independent given a third event C if Pr [A|B ∩ C] = Pr [A|C].
There are a few useful equivalences based on conditional probability. For any events A and B, the Chain
Rule says that Pr [A ∩ B] = Pr [B] · Pr [A|B]; more generally,
Pr [A1 ∩ A2 ∩ · · · ∩ Ak ] = Pr [A1 ] · Pr [A2 |A1 ] · Pr [A3 |A1 ∩ A2 ] · · · · · Pr [Ak |A1 ∩ · · · ∩ Ak−1 ] .
The Law of Total Probability says that Pr [A] = Pr [A|B] · Pr [B] + Pr [A| B ] · Pr [ B ].
Bayes’ Rule is a particularly useful rule that allows us to “flip around” a conditional probability
statement: for any two events A and B, we have
Pr [B|A] · Pr [A]
Pr [A|B] = .
Pr [B]

Random Variables and Expectation


The probabilistic statements that we’ve considered so far are about events (“whether or not” questions); we
can also consider probabilistic questions about “how much” or “how often.” A random variable X assigns a
numerical value to every outcome in the sample space S—that is, a random variable is a function X : S → R.
(Often we write X to denote the value of a random variable X for a realization chosen according to Pr, or
perform arithmetic on random variables.) An indicator random variable is a {0, 1}-valued random variable.
Two random variables X and Y are independent if every two events of the form “X = x” and “Y = y” are
independent.
The expectation of a random variable X, denoted E [X], is the average value of X, defined as
P
E [X] = x∈S X(x) · Pr [x]. A Bernoulli random variable with parameter p has expectation p. A binomial
random variable with parameters p and n has expectation pn. A geometric random variable with parameter
p has expectation 1p .
Linearity of expectation is the very useful fact that the expectation of a sum is the sum of the expectations.
That is, for random variables X : S → R and Y : S → R, we have E [X + Y] = E [X] + E [Y]. (Note that there
is no requirement of independence on X and Y!) Another useful fact is that, for a positive integer–valued
P∞
random variable X : S → Z≥0 , we have E [X] = i=1 Pr [X ≥ i].
The conditional expectation of a random variable X given an event E is the average value of X over
P
outcomes where E occurs, defined as E [X|E] = x∈E X(x) · Pr [x|E].
The variance of a random variable X is
    2
var (X) = E (X − E [X])2 = E X2 − (E [X]) .
p
The standard deviation is std (X) = var (X).

ISTUDY
10.5 Chapter at a Glance 575

Key Terms and Results

Key Terms Key Results


Probability, Outcomes, and Events Probability, Outcomes, and Events
• outcome/realization 1 For a sample space S and events A and B, writing A
• sample space (“not A”) to denote the event S − A, we have that
• probability function/distribution Pr [S] = 1, Pr [∅] = 0, Pr [ A ] = 1 − Pr [A], and
• event Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B].
• tree diagram 1
2 Under the uniform distribution, Pr [s] = |S| for every
• uniform vs. nonuniform probability s ∈ S. Consider parameters p and n. Under a Bernoulli
• fair vs. biased coin flips distribution, Pr [H] = p and Pr [T] = 1 − p. Under a

• uniform distribution binomial distribution, Pr [k] = nk pk (1 − p)n−k . Under a
• Bernoulli distribution geometric distribution, Pr [k] = (1 − p)k−1 p.
• binomial distribution
• geometric distribution Independence and Conditional Probability
1 Events A and B are independent if and only if
Independence and Conditional
Pr [A ∩ B] = Pr [A] · Pr [B], or, equivalently, if
Probability
Pr [A|B] = Pr [A].
• independent/uncorrelated events 2 Chain Rule: Pr [A ∩ B] = Pr [B] · Pr [A|B].
• dependent/correlated events 3 Law of Total Probability:
• positive/negative correlation Pr [A] = Pr [A|B] · Pr [B] + Pr [A| B ] · Pr [ B ].
• conditional probability 4 Bayes’ Rule: Pr [A|B] = Pr[B|A]·Pr[A]
Pr[B] .
• conditional independence
• Chain Rule Random Variables and Expectation
• Law of Total Probability 1 The expectation of a random variable X is the average
• Bayes’ Rule P
value of X, defined as E [X] = x∈S X(x) · Pr [x].
2 A Bernoulli random variable with parameter p has
Random Variables and Expectation expectation p. A binomial random variable with
• random variable parameters p and n has expectation pn. A geometric
• indicator random variable random variable with parameter p has expectation 1p .
• independent random variables 3 Linearity of expectation: for any two random variables X
• expectation and Y, we have E [X + Y] = E [X] + E [Y]. (Note that
• linearity of expectation there is no requirement of independence on X and Y!)
• conditional expectation 4 For a random variable X : S → Z≥0 , we have that
P∞
• variance E [X] = i=1 Pr [X ≥ i].
• standard deviation 5 For a random variable X, we have
    2
var (X) = E (X − E [X])2 = E X2 − (E [X]) .

ISTUDY
ISTUDY
11 Graphs and Trees

In which our heroes explore the many twisting paths through the
gnarled forest, emerging in the happy and peaceful land in which
their computational adventures will continue.

ISTUDY
577
578 Graphs and Trees

11.1 Why You Might Care


Oh what a tangled web we weave,
When first we practise to deceive!
Sir Walter Scott (1771–1832)
Marmion (1808)

In computer science, a graph means a network: a collection of things (people, web pages, subway sta-
tions, animal species, . . .) where some pairs of those things are joined by some kind of pairwise relationship
(spent more than 15 minutes inside an enclosed space with, has a [hyper]link to, is the stop before/after on
some subway line, is a predator of, . . .). It’s possible to make graphs sound hopelessly abstract and utterly
uninteresting—a graph is a pair ⟨V, E⟩, where V is a nonempty collection of entities called nodes and E
is a collection of edges that join pairs of nodes—but graphs are fascinating whenever the entities and the
relationship represented by the edges are themselves interesting! Here are just a few examples.
In a social network, the nodes are people, and an edge between two people represents a friendship (or
whatever friendship-like relationship is represented in an online social networking site). Figure 11.1 shows
an example. Or we could define such a graph based on communication—an edge between two people
might represent that they have texted each other within the last year—or as a contact network, where an
edge between two people represents close physical interaction (for example, the kind of proximity and
duration that would count as a “close contact” for the purpose of possible disease transmission).
In a road network, edges represent roads, and nodes represent intersections of roads. A graph like this
one is what your GPS operates on when it finds driving directions: the shortest sequence of edges that gets
you from where you are to where you’re going. (Sometimes “shortest” can be a little subtle; for example,
some delivery companies save fuel and money by using route-finding algorithms that avoid turns that cross
traffic [82].) Similarly, the route map for an airline forms another kind of transportation network.
We can also think about technological networks, like the internet—nodes are computers (laptops,
phones, servers, routers, etc.), and edges represent wires or radio signals connecting two machines
together—or the web, in which nodes are web pages and an edge represents a link from one page to another.
And there are many other kinds of networks, too: dating networks (and sexual contact networks), food webs
(what species eats what other species?), purchase networks (nodes correspond to people and products, and
an edge joins a person to a product they bought), among many more.
In this chapter, we’ll begin with a bunch of underlying definitions that are related to graphs (Sec-
tion 11.2), and then move on to paths, sequences of hops to get from one node in a graph to another
(Section 11.3). We’ll then look at trees (Section 11.4), which are a special type of graph, and weighted
graphs, in which different edges have different costs or lengths (Section 11.5). And we’ll spend a little
time with a few of the ubiquitous applications of graphs as we go.

Figure 11.1 A sample social network (in this case, among the 34 members of a karate club in the 1970s [133]).

ISTUDY
11.2 Formal Introduction 579

11.2 Formal Introduction


The Bible tells us to love our neighbors, and also to love our enemies; probably because
they are generally the same people.
G. K. Chesterton (1874–1936)
Illustrated London News (1910)

We begin by defining the terminology for the two different basic types of graphs. In both, we have a set
of entities called nodes, some pairs of which are joined by a relationship called an edge. (A node can also
be called a vertex, and we’ll use the terms node/nodes and vertex/vertices interchangeably; they’re both
used commonly in CS. A graph is also sometimes called a network; edges are also sometimes called links.)
The two types of graph differ in whether the relationship represented by an edge is “between two nodes”
or “from one node to another.” In an undirected graph, the relationship denoted by the edges is symmetric
(for example, “u and v are genetically related”):

Definition 11.1: Undirected graph.


An undirected graph is a pair G = ⟨V, E⟩ where V is a nonempty set of vertices or nodes, and

E ⊆ {u, v} : u, v ∈ V is a set of edges joining pairs of vertices.

The second basic kind of graph is a directed graph, in which the relationship denoted by the edges need
not be reciprocated (for example, “u has texted v”):

Definition 11.2: Directed graph.


A directed graph is a pair G = ⟨V, E⟩ where V is a nonempty set of nodes, and E ⊆ V × V is a set of
edges joining (ordered) pairs of vertices.

In other words, in a directed graph an edge is an ordered pair of vertices (“an edge from u to v”) and in
an undirected graph an edge is an unordered pair of vertices (“an edge between u and v”). Think about the
difference between Twitter followers (directed) and Facebook friendships (undirected): Alice can follow
Bob without Bob following Alice, but they’re either friends or they’re not friends.
Graphs are generally drawn with nodes represented as circles, and edges represented by lines. Each
edge in directed graphs is drawn with an arrow indicating its orientation (“which way it goes”). Here is an
example of each:

Example 11.1: A sample undirected graph.


A small undirected graph is shown in Figure 11.2a. This graph contains:

• 12 nodes: A, B, C, D, E, F, G, H, I, J, K, and L.
• 10 edges: {A, B}, {B, C}, {C, D}, {E, F}, {E, H}, {F, G}, {G, H}, {I, J}, {J, K}, and {K, L}.

Example 11.2: Streets of Manhattan: a sample directed graph.


The directed graph shown in Figure 11.2b contains 9 nodes, each corresponding to an intersection of a
“street” running east–west and an “avenue” running north–south in Manhattan. There are 14 edges in this
graph. There’s something potentially tricky in counting to 14: edges in a directed graph are ordered pairs,

ISTUDY
580 Graphs and Trees

(a) (b)
B F I 43rd & 9th 43rd & 8th 43rd & 7th
A E J

42nd & 9th 42nd & 8th 42nd & 7th


C D H
G K L
41st & 9th 41st & 8th 41st & 7th

Figure 11.2 Two small graphs: (a) one undirected graph and (b) one directed graph (a small portion of Manhattan).

(a) (b)

Figure 11.3 Parallel edges and self-loops, in (a) undirected and (b) directed graphs.

which means that there are two edges between [42nd & 9th] and [42nd & 8th], one in each direction—
⟨42nd & 9th, 42nd & 8th⟩ and ⟨42nd & 8th, 42nd & 9th⟩. The pair of nodes [42nd & 8th] and [42nd & 7th]
is similar.

For many of the concepts that we’ll explore in this chapter, it will turn out that there are no substantive
differences between the ideas for directed and undirected graphs. To avoid being tedious and unhelpfully
repetitive, whenever it’s possible we’ll state definitions and results about both undirected and directed
graphs simultaneously. But doing so will require a little abuse of notation: we’ll allow ourselves to write
an edge as an ordered pair ⟨u, v⟩ even for an undirected graph. In an undirected graph, we will agree to
understand both ⟨u, v⟩ and ⟨v, u⟩ as meaning {u, v}.

Simple graphs
For many of the real-world phenomena that we will be interested in modeling, it will make sense to make a
simplifying assumption about the edges in our graphs. Specifically, we will typically restrict our attention to
so-called simple graphs, which forbid two different kinds of edges: edges that connect nodes to themselves,
and edges that are precise duplicates of other existing edges.

Definition 11.3: Self-loops and parallel edges.


A self-loop is an edge from a node u to itself. Two edges are called parallel if they both go from same
node u and both go to the same node v.

(See Figure 11.3.) Note that the edges ⟨u, v⟩ and ⟨v, u⟩ are not parallel in a directed graph: directed edges
are parallel only if they both go from the same node and to the same node, in the same orientation. (Edges
⟨u, v⟩ and ⟨v, u⟩ are sometimes called antiparallel.)

Definition 11.4: Simple graph.


A graph is simple if it contains no parallel edges and no self-loops.

Throughout, we’ll assume that graphs are undirected and simple unless otherwise noted.
In general, the particular real-world phenomenon that we seek to model will dictate whether self-loops,
parallel edges, or both will make sense. Here are a few examples:

ISTUDY
11.2 Formal Introduction 581

Example 11.3: Self-loops and parallel edges.


Suppose that we construct a graph to model each of the following phenomena. In which settings do
self-loops or parallel edges make sense?
1 A social network: nodes correspond to people; (undirected) edges represent friendships.
2 The web: nodes correspond to web pages; (directed) edges represent links.
3 The flight network for a commercial airline: nodes correspond to airports; (directed) edges denote
flights scheduled by the airline in the next month.
4 The email network at a college: nodes correspond to students; there is a (directed) edge ⟨u, v⟩ if u has
sent at least one email to v within the last year.

Solution. A social network: Neither self-loops nor parallel edges make sense. A self-loop would corre-
spond to a person being a friend of himself, and parallel edges between two people would correspond to
them being friends “twice.” (But two people are either friends or not friends.)
The web: Both self-loops and parallel edges are reasonable. It is easy to imagine a web page p that
contains a hyperlink to p itself. It is also easy to imagine a web page p that contains two separate links
to another web page q. (For example, as of this writing, the “CNN” logo on www.cnn.com links to www.
cnn.com. And, as of the end of this sentence, this page has three distinct references to www.cnn.com.)
A commercial flight network: In a flight network, many parallel edges will exist: there are generally
many scheduled commercial flights from one airport to another—for example, there are dozens of flights
every week from BOS (Boston, MA) to SFO (San Francisco, CA) on most major airlines. However, there
are no self-loops: a commercial flight from an airport back to the same airport doesn’t go anywhere!
A who-emailed-whom network: Self-loops are reasonable but parallel edges are not. A student u has
either sent email to v in the last year or she has not, so parallel edges don’t make sense in this net-
work. However, self-loops exist if any student has sent an email to herself (as many people do to remind
themselves to do something later).

Taking it further: Technically speaking, the way that we phrased our definitions of graphs in Definitions 11.1 and 11.2 doesn’t
even allow us to consider parallel edges. (Our definitions do allow self-loops, though.) That’s because we defined the edges as a
subset E of V × V or {{u, v} : u, v ∈ V}, and sets don’t allow duplication—which means that we can’t have ⟨u, v⟩ in E “twice.”
There are alternate ways to formalize graphs that do permit parallel edges, but they’re needlessly complicated for the applications
that we’ll focus on in this chapter.

11.2.1 Neighborhoods and Degree


Imagine a social network in which two people, Ursula and Victor, are friends—or, more generally, imagine
an undirected graph in which nodes u and v are joined by an edge. Here’s the vocabulary for referring to
these nodes and the edge between them:

Definition 11.5: Adjacency, neighbors, endpoints, incidence.


For an edge e = {u, v} in an undirected graph (see Figure 11.4a), we say that:

• the nodes u and v are adjacent;


• the node v is a neighbor of the node u (and vice versa);
• the nodes u and v are the endpoints of the edge e; and
• the nodes u and v are both incident to the edge e.

ISTUDY
582 Graphs and Trees

Los Angeles
I10(west)
I10(east)
Lake City, FL Jacksonville

I75 I95
u
Tampa Daytona Beach
e I4(west) I4(east)

v Orlando

(a) Two nodes in an undirected graph (b) A small portion of the U.S. Interstate system between Orlando, FL and
joined by an edge. Los Angeles, CA. Each of the roads is labeled by its name.

Figure 11.4 A schematic view of edges joining nodes in an undirected graph, and an example.

It’s important to distinguish between two distinct concepts: first, the direct connection between two nodes u
and v that are adjacent—that is, a single edge that joins u and v directly; and, second, an indirect connection
between two nodes that follows a sequence of edges. At the moment, we’re talking only about the first kind,
a direct connection via a single edge. (A multihop connection is called a path; we’ll talk about paths in
Section 11.3.)
Here’s an example of the vocabulary from Definition 11.5:

Example 11.4: Disney World to Disney Land.


A small portion of the U.S. Interstate system is shown in Figure 11.4b. In this graph:

• Orlando is adjacent to Tampa and Daytona Beach.


• None of the other nodes (Lake City, Jacksonville, Los Angeles) is a neighbor of Orlando. Orlando is
also not a neighbor of itself.
• The endpoints of edge I75 are Tampa and Lake City.
• Jacksonville is incident to I95, as is Daytona Beach.

The neighborhood of a node is the set of all nodes adjacent to it:

Definition 11.6: Neighborhood.


Let G = ⟨V, E⟩ be an undirected graph, and let u ∈ V be a node. The neighborhood of u is the set

v ∈ V : {u, v} ∈ E —that is, the set of all neighbors of u.

For example, in the graph in Figure 11.4b, the neighborhood of Lake City is {Los Angeles, Tampa,
Jacksonville}. Or, for a graph G that represents a social network, the neighborhood of a node u is the
set of people who are u’s friends.

Degree
It’s also common to refer the number of neighbors that a node has (without reference to which particular
nodes happen to be that node’s neighbors):

ISTUDY
11.2 Formal Introduction 583

Definition 11.7: Degree.


The degree of a node u in an undirected graph G is the size of the neighborhood of u in G—that is, the
number of nodes adjacent to u.

For example, in the graph in Figure 11.4b, Lake City has degree 3 and Los Angeles has degree 1. (See
Figure 11.5 for a version of the road network with each node labeled by its degree.) Or, in a social network,
the degree of a node u is the popularity of u—the number of friends that u has. Here are a few practice
questions:

Example 11.5: Neighborhood and degree.


Here are a few questions about the graph in Figure 11.6:
1 What are the neighbors of node C?
2 What nodes, if any, have degree equal to one?
3 What node has the highest degree in this graph?
4 What nodes, if any, are in the neighborhoods of both nodes B and E?

Solution. By inspecting Figure 11.6, we see:

• Node C has two neighbors, namely the nodes B and E.


• The nodes with degree one are those with precisely one neighbor. These nodes are: A, D, F, and H.
(Their solitary neighbors are, respectively: B, G, E, and G.)
• We simply count neighbors for each node, and we find that nodes B and E both have degree three, and
are tied as the nodes with the highest degree.
• The neighborhood of node B is {A, C, E}, and the neighborhood of node E is {B, C, F}. Taking the
intersection of those sets yields the one node in the neighborhood of both B and E, namely node C.

Taking it further: Consider a population of people—say, the current residents of Canada—represented as a social network, in
an undirected graph whose edges represent friendship. For a node in the social network (also known as a person), we can calculate
many numbers that may be interesting: height, age, income, number of cigarettes smoked per day, self-reported happiness, etc.
Then, for any one of these numerical properties, we can consider the distribution over the population: for example, the distribution
of heights, or the distribution of ages. (The height distribution will follow a roughly bell-shaped curve; the age distribution is
more complicated, both because of death and because of variation in the birth rate over time.) Another interesting numerical
property of a person u is the degree of u: that is, the number of friends that u has. The degree distribution of a graph describes
how popularity varies across the nodes of the network. The degree distribution has some interesting properties—very different
from the distribution of heights or ages. See p. 599.

1
3 2
B D F
2 2
A H
2
C E G

Figure 11.5 The road network from Figure 11.4b, with Figure 11.6 An undirected graph, for some practice
nodes labeled by their degree. questions.

ISTUDY
584 Graphs and Trees

The handshaking lemma


Before we move on from degree, we’ll prove a basic but valuable fact, colloquially called the “handshaking
lemma.” (We can represent a group of people, some pairs of whom shake hands, using an undirected graph:
an edge joins u and v if and only if u and v shook hands; the theorem describes the number of shakes.) The
handshaking lemma relates the sum of nodes’ degrees to the number of edges in the graph:

Theorem 11.8: Handshaking lemma.


Let G = ⟨V, E⟩ be an undirected graph. Then
X
degree(u) = 2|E|.
u∈V

For example, Figure 11.5 shows our road network from Figure 11.4b, with all nodes labeled by their degree.
This graph has |E| = 6 edges, and the sum of the nodes’ degrees is 1 + 3 + 2 + 2 + 2 + 2 = 12, and indeed
12 = 2 · 6. Here is a proof:

Proof of Theorem 11.8. Every edge has two endpoints! (Or, as The Player says to Rosencrantz in Tom
Stoppard’s brilliant play Rosencrantz and Guildenstern are Dead: “look on every exit being an entrance
somewhere else” [123].)
Or, more formally, imagine looping over each edge to compute all nodes’ degrees:

1 initialize du to 0 for each node u


2 for each edge {u, v} ∈ E:
3 du := du + 1
4 dv := dv + 1

In each iteration of the for loop, we increment two different d• values; thus, after i iterations, we have that
P
u du = 2i. (We could give a fully rigorous proof of this fact by induction.) We complete |E| iterations
P
of the for loop, one for each edge, and thus at the end of the algorithm we have that u∈V du = 2|E|.
Furthermore, after the loop, it’s clear that du = degree(u) for every node u. Thus
X X
du = degree(u) = 2|E|.
u∈V u∈V

Here’s a useful corollary of Theorem 11.8 (the proof is left to you as Exercise 11.17):

Corollary 11.9. Let nodd denote the number of nodes whose degree is odd. Then nodd is even.

(For example, for the road-network graph in Figure 11.5, we have nodd = 2: the two nodes with odd degree
are those with degree 1 and 3. And 2 is an even number.)

Neighborhoods and degree: directed graphs


The definitions of adjacency, neighbors, and degree from Definitions 11.5–11.7 were all for undirected
graphs. Here we’ll introduce the analogous notions for directed graphs, all of which are slightly more
complicated because they must account for the orientation of each edge. We start with the directed version
of “neighbors”:

ISTUDY
11.2 Formal Introduction 585

Definition 11.10: Neighbors in directed graphs.


For an edge ⟨u, v⟩ from node u to node v in a directed graph, we say that:

• the node v is an out-neighbor of the node u; and


• the node u is an in-neighbor of the node v.

For example, if G represents a flight network (with nodes as airports and directed edges corresponding
to flights), then the out-neighbors of node u are those airports that have direct flights from u, and the in-
neighbors of u are those airports that have direct flights to u. (See Figure 11.7.) Now, using these definitions,
we can define the analogues of neighborhoods and degree in directed graphs:

Definition 11.11: Neighborhoods and degrees in directed graphs.


For a node u in a directed graph, we say that:

• the in-neighborhood of u is {v : ⟨v, u⟩ ∈ E}, the set of in-neighbors of v;


• the in-degree of u is its number of in-neighbors (its in-neighborhood’s cardinality);
• the out-neighborhood of u is {v : ⟨u, v⟩ ∈ E}, the set of out-neighbors of u; and
• the out-degree of u is its number of out-neighbors (its out-neighborhood’s cardinality).

Here are a few practice questions about in- and out-neighborhoods:

Example 11.6: Neighborhood and degree in a directed graph.


For the directed graph in Figure 11.8:
1 What are the in-neighbors of node C? The out-neighbors of C?
2 What nodes, if any, are in both the in-neighborhood and out-neighborhood of node E?
3 What nodes, if any, have in-degree zero? Out-degree zero?

Solution. Node C has one in-neighbor, namely B, and two out-neighbors, namely D and E.
Node E has three in-neighbors (B, C, and F) and two out-neighbors (B and F). So nodes B and F are in
both E’s in-neighborhood and E’s out-neighborhood.
Node A has no in-neighbors, so A’s in-degree is zero. Node G has no out-neighbors, so G’s out-degree
is zero.

(a) (b)
B D F

A H
u u
C E G

Figure 11.7 The (a) in-neighbors and (b) Figure 11.8 A directed graph for some practice
out-neighbors of a node u. questions.

ISTUDY
586 Graphs and Trees

11.2.2 Representing Graphs: Data Structures


The graphs that we’ve considered so far have been presented visually: as a picture, with nodes drawn as
circles and edges drawn as lines or arrows. But, of course, when we represent a graph on a computer,
we’ll need to use some data structure to store a network, not just some image file. Here we will give a
brief summary of the two major data structures used to represent graphs. If you’ve had a course on data
structures, then this material may be a review; if not, it will be a preview.

Taking it further: A visual representation is great for some smaller networks, and a well-designed layout can sometimes make
even large networks easy to understand at a glance. Graph drawing is the problem of algorithmically laying out the nodes of a
graph well—in an aesthetic and informative manner. There’s a physics analogy that’s often used in laying out graphs, in which
we imagine nodes “attracting” and “repelling” each other depending on the presence or absence of edges. See p. 600 for more,
including an application of this graph-drawing idea to the 9/11 Memorial in New York City. Some other gorgeous visualizations
of network (and other!) data can be found online at sites like Flowing Data (http://flowingdata.com/), Information Is
Beautiful (http://informationisbeautiful.net), or some of the beautiful books on data visualization (such as [18]).

The most straightforward data structure for a graph is just a list of nodes and a list of edges. But this
straightforward representation suffers for some standard, natural questions that are typically asked about
graphs. Many of the natural questions that we will find ourselves asking are things like: What are all of the
neighbors of A? or Are B and C joined by an edge? There are two standard data structures for graphs, each
of which is tailored to make it possible to answer one of these two questions quickly.

Adjacency lists
The first standard data structure for graphs is an adjacency list, which—as the name implies—stores, for
each node u, a list of the nodes adjacent to u:

Definition 11.12: Adjacency list.


In an adjacency list of a graph G = ⟨V, E⟩, for each node u ∈ V, we store an unsorted list of all of u’s
neighbors in the graph.

The schematic for an adjacency list is illustrated in Figure 11.9: each node in the graph corresponds to a
row of the table, which points to an unsorted list of that node’s neighbors. (These lists are unsorted so that
it’s faster to add a new edge to the data structure.)
There’s no significant difference between adjacency lists for undirected graphs and for directed graphs:
for an undirected graph, we list the neighbors for each node u; for a directed graph, we list the out-neighbors
of each node. (Every edge ⟨u, v⟩ in a directed graph appears only once in the data structure, in u’s list. Every

An array
containing all
nodes in the graph.
Some nodes (like
the second node ..
here) may have an .
empty list of u v1 v2 v3 v4
neighbors. The
node u has four ..
neighbors in this .
graph: v1 , v2 , v3 ,
and v4 .

Figure 11.9 A schematic of an adjacency list.

ISTUDY
11.2 Formal Introduction 587

Ben Allie: Evie, Ben A: B


Ben: Allie, Evie B D B: C, D
Allie Camille Camille: --- A C: E, A
Derek: --- C E D: ---
Evie Derek Evie: Allie, Ben E: C

Figure 11.10 Two small graphs, one undirected and one directed, represented by adjacency lists.

edge {u, v} in an undirected graph is represented twice: v appears in u’s list, and u appears in v’s list. This
observation is another way of thinking of the proof of Theorem 11.8.)

Example 11.7: Two sample adjacency lists.


Two small graphs, one undirected and one directed, and their corresponding adjacency lists are shown in
Figure 11.10. Note that the order of the (out-)neighbors of any particular node isn’t specified: for example,
we could just as well said that Evie’s neighbors were [Ben, Allie] as [Allie, Ben].

Adjacency matrices
The second standard data structure for representing graphs is an adjacency matrix:

Definition 11.13: Adjacency matrix.


In an adjacency matrix of a graph G = ⟨V, E⟩, we store the graph using an |V|-by-|V| table. The ith row
of the table corresponds to the neighbors of node i. A True (or 1) in column j indicates that the edge ⟨i, j⟩
is in E; a False (or 0) indicates that ⟨i, j⟩ ∈
/ E.

In a directed graph, the ith row corresponds to the out-neighbors of node i, so that the ⟨i, j⟩th entry of
the matrix corresponds to the presence/absence of an edge from i to j. The ith column corresponds to the
in-neighbors of i. See Figure 11.11.

Example 11.8: Two sample adjacency matrices.


The adjacency matrices for the graphs from Example 11.7 are shown in Figure 11.12.

The adjacency matrix has two properties that are worth a note. (Again, see Figure 11.11.) First, the main
diagonal contains all zeros: a 1 in the ⟨i, i⟩th position of the matrix would correspond to an edge between
node i and node i—that is, a self-loop, which is forbidden in a simple graph.

1 2 ··· i j ··· n
1
2
.. Is there an edge 0
. from i to j in E? 0
i 0 The main diagonal.
(These are identical in an 0 (In a simple graph, there
undirected graph.) 0 are no self loops, so
j 0
0 ⟨u, u⟩ ∈/ E for every u.)
.. Is there an edge 0
. from j to i in E? 0
n 0

Figure 11.11 A schematic of an adjacency matrix.

ISTUDY
588 Graphs and Trees

Camille
Allie

Derek
Evie
Ben
A B C D E
Ben Allie 0 1 0 0 1 A 0 1 0 0 0
Ben 1 0 0 0 1 B D B 0 0 1 1 0
Allie Camille Camille 0 0 0 0 0 A C 1 0 0 0 1
Derek 0 0 0 0 0 C E D 0 0 0 0 0
Evie Derek Evie 1 1 0 0 0 E 0 0 1 0 0

Figure 11.12 The two graphs from Figure 11.10, represented by adjacency matrices.

Second, for an undirected graph, the matrix is symmetric: the ⟨i, j⟩th position of the matrix records the
presence or absence of an edge from i to j, which is identical to the presence or absence of an edge from
j to i in an undirected graph. Adjacency matrices are not necessarily symmetric in directed graphs: there
may be an edge from u to v without an edge from v to u.

Choosing between adjacency lists and matrices


Which of the two data structures that we’ve seen for graphs should we choose? Are adjacency lists better
than adjacency matrices, or the other way around?

Meta-problem-solving tip: The answer to “which is better?” in a class or textbook is almost always It depends! After all, why
would we waste time/pages on a solution that’s always worse!? (The only plausible answer is that it warms us up conceptually
for a better but more complex solution.) The real question here what does it depend on?

Recall that there are two basic questions about graphs that we wish to answer quickly: (A) is v a neighbor
of u?, and (B) what are all of u’s neighbors? Figuring the details of how efficiently we can answer these
questions with an adjacency list or an adjacency matrix is better suited to a data-structures textbook than
this one, but here’s a brief summary of the reasoning.

Adjacency lists: An adjacency list is perfectly tailored to answering Question (B): we’ve stored precisely
the list of u’s neighbors for each node u, so we simply iterate through that list to output u’s neighborhood.
To answer Question (A), we need to search through that same unsorted list to see if v is present. In both
cases, we have to spend constant time finding u’s list in the table, and then we examine a list of length
degree(u) to answer the question.

Adjacency matrices: An adjacency matrix is perfect for answering Question (A): we just look at the appro-
priate spot in the table. If the ⟨u, v⟩th entry is True, then the edge ⟨u, v⟩ exists. This lookup takes constant
time. Answering Question (B) requires looking at one entire row of the table, entry by entry. There are
|V| entries in the row, so this loop requires |V| operations.

Thus adjacency matrices solve Question (A) faster, while adjacency lists are faster at solving Question (B).
In addition to worrying about the time required to answer these questions, we’d also want the space—
the amount of memory—consumed by the data structure to be as small as possible. (You can think of “the
amount of memory” as the total number of boxes that appear in the diagrams in Figures 11.9 and 11.11.)

ISTUDY
11.2 Formal Introduction 589

adjacency list adjacency matrix


running time for Question (A): is v a neighbor of u? 1 + Θ(degree(u)) Θ(1)
running time for Question (B): what are all of u’s neighbors? 1 + Θ(degree(u)) Θ(|V|)
total space consumed by the data structure Θ(|V| + |E|) Θ(|V|2 )

Figure 11.13 A summary of the efficiency differences between adjacency lists and adjacency matrices. The better data
structure in each row is highlighted.

Example 11.9: Space consumption for adjacency lists and matrices.


Consider a graph G = ⟨V, E⟩ stored using an adjacency list or an adjacency matrix. In terms of the number
of nodes and the number of edges in G—that is, in terms of |V| and |E|—how much memory is used by
these data structures?
Solution. An adjacency matrix is a |V|-by-|V| table, and thus contains exactly |V|2 cells. (Of them, the
|V| cells on the diagonal are always 0, but they’re still there!)
An adjacency list is a |V|-element table pointing to |V| lists; the length of the list for node u is exactly
degree(u). Thus the total number of cells in the data structure is
X
|V| + degree(u).
u∈V
P
In an undirected graph we have u degree(u) = 2|E|, by Theorem 11.8; in a directed graph we have
P
u out-degree(u) = |E| by Exercise 11.18. Thus the total amount of memory used is |V| + 2|E| for an
undirected graph, and |V| + |E| for a directed one.

Figure 11.13 summarizes the efficiency of these data structures, using asymptotic notation from Chap-
ter 6. (Note that, in a simple graph, we have that degree(u) ≤ |V| and |E| ≤ |V|2 .) So, is an adjacency list
or an adjacency matrix better? It depends!
First, it depends on what kind of questions we want to answer: if we will ask few “is v a neighbor of
u?” questions, then adjacency lists will be faster. If we will ask many of those questions, then we probably
prefer adjacency matrices. Similarly, it might depend on how much, if at all, the graph changes over time:
adjacency lists are harder to update than adjacency matrices.
Second, it depends on how many edges are present in the graph. If the total number of edges in the
graph is relatively small—and thus most nodes have only a few neighbors—then degree(u) will generally
be small, and the adjacency list will win. If the total number of edges in the graph is relatively large,
then degree(u) will generally be larger, and the adjacency matrix will perform better. (Many of the most
interesting real-world graphs are sparse: for example, the typical degree of a person in a social network
like Facebook is perhaps a few hundred or at most a few thousand—very small in relation to the billions
of Facebook users.)

11.2.3 Relationships between Graphs: Isomorphism and Subgraphs


Now that we have the general definitions, we’ll turn to a few more specific properties that certain graphs
have. We’ll start in this section with two different relationships between pairs of graphs—when two graphs
are “the same” and when one is “part” of another; in Section 11.2.4, we’ll look at single graphs with a
particular structure.

ISTUDY
590 Graphs and Trees

Graph isomorphism
When two graphs G and H are identical except for how we happen to have arranged the nodes when we
drew them on the page (and except for the names that we happen to have assigned to the nodes), then we call
the graphs isomorphic (Greek: iso “same”; morph “form”). Informally, G and H are isomorphic if there’s
a way to relabel (and rearrange) the nodes of G so that G and H are exactly identical. More formally:

Definition 11.14: Graph isomorphism.


Consider two graphs G = ⟨V, E⟩ and H = ⟨U, F⟩. We say that G and H are isomorphic if there exists a
bijection f : V → U such that
for all a ∈ V and b ∈ V, ⟨a, b⟩ ∈ E ⇔ ⟨f(a), f(b)⟩ ∈ F.

(By abusing notation as we described earlier, this definition works for either undirected or directed graphs
G and H.) Here are some small examples:

Example 11.10: Two isomorphic graphs.


Let’s show that the directed graphs in Figures 11.14a and 11.14b are isomorphic. To do so, define the
following bijection f : {1, 2, . . . , 6} → {A, B, . . . , F}:
f(1) = A f(2) = D f(3) = C f(4) = F f(5) = B f(6) = E.
Then the edges in the two graphs are
⟨1, 2⟩ ⟨1, 3⟩ ⟨1, 4⟩ ⟨1, 5⟩ ⟨1, 6⟩ ⟨2, 4⟩ ⟨2, 6⟩ ⟨3, 6⟩

⟨f(1), f(2)⟩ ⟨f(1), f(3)⟩ ⟨f(1), f(4)⟩ ⟨f(1), f(5)⟩ ⟨f(1), f(6)⟩ ⟨f(2), f(4)⟩ ⟨f(2), f(6)⟩ ⟨f(3), f(6)⟩
⟨A, D⟩ ⟨A, C⟩ ⟨A, F⟩ ⟨A, B⟩ ⟨A, E⟩ ⟨D, F⟩ ⟨D, E⟩ ⟨C, E⟩
Because these lists of edges match exactly, the two graphs are isomorphic.

Example 11.11: Isomorphic graphs.


Which pairs, if any, of the graphs in Figures 11.14c, 11.14d, and 11.14e are isomorphic?
Solution. The graphs in Figures 11.14c and 11.14d are isomorphic. The easiest way to see this fact is to
show the mapping between the nodes of the two graphs:
A B C D E F G H I J
1 2 3 4 5 0 7 9 6 8
It’s easy to verify that all 15 edges now match up between the first two graphs. But the third graph is not
isomorphic to either of the others. The easiest justification is that node S in the third graph has degree 5,
and no node in either of the first two graphs has degree 5. No matter how we reshuffle the nodes of graph
#3, there will still be a node of degree 5—so the third graph can never match the others.

Problem-solving tip: When you’re trying to prove or disprove a claim about graphs, you may find it useful to test out the claim
against the following four “trivial” graphs: a graph with a single node ( ); a graph with two nodes and no edges ( ); a graph
with two nodes joined by an edge ( ); and graph with two pairs of nodes with each pair joined by an edge ( ). A lot of
bogus claims about graphs turn out to be false on one of these four examples—or, unexpectedly, the so-called Petersen graph, the
graph in Figure 11.14c. (The Petersen graph is named after Julius Petersen (1839–1910), a Danish mathematician.) It’s a good
idea to try out any conjecture on all five of these graphs before you let yourself start to believe it!

ISTUDY
11.2 Formal Introduction 591

(a) (b) B

A C E
1 2 3 4 5 6

D F

(c) B (d) (e) R


U T
1 2 3 4 5
G S
A F H C
Z Y
Q
J I 6 7 8 9 0
V X
E D W

Figure 11.14 Several graphs. (Incidentally, the edges in (a) can be written as {⟨a, b⟩ : a < b and a evenly divides b}.)

Taking it further: In general, it’s easy to test whether two graphs are isomorphic by brute force (try all permutations!), but no
substantially better algorithms are known. The computational complexity of the graph isomorphism problem has been studied
extensively over the last few decades, and there has been substantial progress—but no complete resolution. It’s easy to convince
someone that two graphs G and H are isomorphic: we can simply describe the relabeling of the nodes of G so that the resulting
graphs are identical. (The “convincee” then just needs to verify that the edges really do match up.) When G and H are not
isomorphic, it might be easy to demonstrate their nonisomorphism: for example, if they have a different number of nodes or
edges, or if the degrees in G aren’t identical to the degrees in H. But the graphs may have identical degree distributions and yet
not be isomorphic; see Exercise 11.49.

Subgraphs
When a graph H is isomorphic to a graph G, we can think of having created H by moving around some
of the nodes and edges of G. When H is a subgraph of G, we can think of having created H by deleting
some of the nodes and edges of G. (Of course, it doesn’t make sense to delete either endpoint of an edge e
without also deleting the edge e.) Here’s the definition, for either undirected or directed graphs:

Definition 11.15: Subgraph.


Let G = ⟨V, E⟩ be a graph. A subgraph of G is a graph G′ = ⟨V′ , E′ ⟩ where V′ ⊆ V and E′ ⊆ E such
that every edge ⟨u, v⟩ ∈ E′ satisfies u ∈ V′ and v ∈ V′ .

(Definition 11.15 uses the abuse of notation that we mentioned earlier: we “ought” to have written
{u, v} ∈ E′ for the case that G is undirected.)

Example 11.12: All 3-node subgraphs of G.


Consider the graph G with nodes {A, B, C, D} and edges {{A, B} , {A, C} , {B, C} , {C, D}}. Then the graph
G′ with nodes {B, C, D} and edges {{B, C} , {C, D}} is a subgraph of G. In fact, G has many different
subgraphs. Figure 11.15 shows all of the 3-node subgraphs of G. (There are many other subgraphs—about
50 total—when we consider subgraphs with 1, 2, 3, or 4 nodes.)

We sometimes refer to a special kind of subgraph: the subgraph of G = ⟨V, E⟩ induced by a set V′ ⊆ V
of nodes is the subgraph of G where every edge between nodes in V′ is retained. The first subgraph in each
row of Example 11.12 is the induced subgraph for its nodes.
Here’s a brief description of one application of (induced) subgraphs:

ISTUDY
592 Graphs and Trees

B B B B B B B B
A, B, C: A D A D A D A D A D A D A D A D
C C C C C C C C

B B
A, B, D: A D A D
C C

B B B B
A, C, D: A D A D A D A D
C C C C

B B B B
B, C, D: A D A D A D A D
C C C C

Figure 11.15 All 3-node subgraphs of the graph with four nodes (A, B, C, and D) and four edges (three edges joining C
to every other node, plus an edge between A and B).

Example 11.13: Motifs in biological networks.


At any particular moment in any particular cell, some of the genes in the organism’s DNA are being
expressed—that is, some genes are “turned on” and the proteins that they code for are being produced
by the cell. Furthermore, one gene g can regulate another gene g′ : when g is being expressed, gene g
can cause the expression of gene g′ to increase or decrease over the baseline level. A great deal of recent
biological research has allowed us to construct gene-regulation networks for different such settings: that
is, a directed graph G whose nodes are genes, and whose edges represent the regulation of one gene by
another.
Consider the induced subgraph of a particular set of genes in such a graph G—that is, the interac-
tions among the particular genes in that set. Certain patterns of these subgraphs, called motifs, occur
significantly more frequently in gene-regulation networks than would be expected by chance. Biologists
generally believe that these repeated patterns indicate something important in the way that our genes
work, so computational biologists have been working hard to build efficient algorithms to identify induced
subgraphs that are overrepresented in a network.

Taking it further: One of the earliest applications of a formal, mathematical perspective to networks—a collaboration between
a psychologist and mathematician, in the 1950s—was based on subgraphs. Consider a signed social network, an undirected graph
where each edge is labeled with “+” to indicate friends, or “−” to indicate enemies. (See Figure 11.16a.) The adages “the enemy
of my enemy of my friend” and “the friend of my friend is my friend” correspond to the claim that the subgraphs in Figure 11.16b
would not appear. Dorwin Cartwright (1915–2008, the psychologist) and Frank Harary (1921–2005, the mathematician) proved
some very interesting structural properties of any signed social network G that does not have either triangle in Figure 11.16b as
a subgraph—a property that they called “structural balance”—and in the process helped launch much of the mathematical and
computational work on graphs that’s followed. For more, see [27].

(a) Italy
− (b)
+ UK

− + + − −
Germany + +


+
− US − −
Japan

Figure 11.16 Signed social networks: (a) a signed network from 1941, and (b) two forbidden triangles.

ISTUDY
11.2 Formal Introduction 593

11.2.4 Special Types of Graphs: Complete, Bipartite, Regular, and Planar Graphs
In Section 11.2.3, we looked at two ways in which a pair of graphs might be related. Here, we’ll consider
special characteristics that a single graph might have—that is, subcategories of graphs with some particular
structural properties. These special types of graphs arise frequently in various applications.

Complete graphs
Our first special type of graph is a complete graph (also called a clique), which is an undirected graph in
which every possible edge exists:

Definition 11.16: Complete graph/clique.


A complete graph or clique is an undirected graph G = ⟨V, E⟩ such that {u, v} ∈ E for any two distinct
nodes u ∈ V and v ∈ V.

See Figure 11.17 for examples of complete graphs of varying sizes. (In everyday usage, a clique is a small,
tight-knit, and exclusionary group of friends that doesn’t mingle with outsiders—like the cool kids who
never would even say hello to me when I was in middle school. If you think about a graph as a social
network, the common-language meaning is similar to Definition 11.16. In computer science, though, the
word clique usually rhymes with bleak or sleek. In common-language usage, the word usually rhymes with
slick or flick.)

Observe that an undirected graph with n nodes has n2 unordered pairs of nodes, and therefore an n-node

complete graph has n2 = n(n − 1)/2 edges.
A complete graph with n nodes is sometimes denoted by Kn . (Why? There are two different prevailing
explanations for this notation, so there’s some debate. One is that the K is short for complete—or, rather,
short for komplett; the notation was invented by a German speaker. The second is that the K is in honor of
Kazimierz Kuratowski (1896–1980), a Polish mathematician who made major contributions to the study
of graphs, among other mathematical topics.)
The word clique can also refer to a subgraph that’s complete—that is, a subgraph in which
every possible edge actually exists. So we’d say that the graph with nodes {A, B, C, D} and edges
{{A, B} , {A, C} , {B, C} , {C, D}} contains a 3-node clique {A, B, C}.
Here’s one small example of an interesting application in which cliques arise:

Example 11.14: Collaboration networks and cliques.


Imagine a setting in which different groups of people can work together in different teams, with each
person allowed to participate in multiple teams. For example:

• Actors in movies. (A “team” is the cast of a single movie.)

Figure 11.17 Complete graphs with 3, 5, 6, 7, 8, and 16 nodes.

ISTUDY
594 Graphs and Trees

Team Members Tigers


Tigers Deborah, George, Hicham, Josh, Lauren C K
H
G

Unicorns Anita, Bev, Eva, Fernan A E D


J
Vultures Cathy, Eva, Kelly B F L

C K G Vultures
H Unicorns C K
H
G
C K G
H
A E D A E
J
D
A E D
B F L J
J B F L
B F L
(b) The clique resulting from each team in the
(a) Three teams, and the resulting collaboration graph. collaboration graph. (Here, K5 , K4 , and K3 .)

Figure 11.18 A collaboration network.

• Scientific researchers. (A “team” is the set of coauthors of a published paper.)


• Employees of a company. (A “team” is a group that worked on a specific project.)

A collaboration network is a graph G that represents a setting like these: the nodes of G are the people
involved; there is an edge between any two people who have worked together on at least one team. (You
may have heard of a challenge in the collaboration network: in the Kevin Bacon Game, you’re given the
name of some actor A; your job is to find a sequence of edges that connects A to the “Kevin Bacon” node
in the movie collaboration network. There’s a similar game that computer scientists play in the scientific
collaboration network, trying to connect themselves to the Hungarian polymath Paul Erdős. See p. 180.)
For a concrete example, see Figure 11.18. Notice that each team results in a clique in the collaboration
graph—every pair of members of that team is joined by an edge—as shown in Figure 11.18b.

Bipartite graphs
Our second special kind of graph is a bipartite graph (Latin: bi “two”; part “part”). In a bipartite graph, the
nodes can be divided into two groups such that no edges join two nodes that are in the same group: that is,
there are two “kinds” of nodes, and all edges join a node of Type A to a node of Type B. Formally:

Definition 11.17: Bipartite graph.


A bipartite graph is an undirected graph G = ⟨V, E⟩ such that V can be partitioned into two disjoint sets
L and R where, for every edge e ∈ E, one endpoint of e is in L and the other endpoint of e is in R.

For example, consider a graph with nodes A, B, C, D, E, and F, and with edges {A, B}, {A, C}, {C, E}, and
{D, E}. This graph is bipartite: for example, we can split the nodes into two groups—the vowels {A, E}
and the consonants {B, C, D, F}—such that every edge joins a vowel and a consonant. (There’s another
split that would also have worked: {A, E, F} and {B, C, D}.) See Figure 11.19 for a visualization of the
vowel–consonant split.
Bipartite graphs are traditionally drawn with the nodes arranged in two columns, one for each part: left
(“L”) and right (“R”). But notice that the definition only requires that it be possible to divide the nodes into
two groups, with no within-group edges; it doesn’t require that there be two columns in whatever drawing
of the graph you’re looking at.

ISTUDY
11.2 Formal Introduction 595

L R
A B

D (i) (ii) (iii) (iv) (v)

E F

Figure 11.19 A
bipartite graph. Figure 11.20 Five possibly bipartite graphs.

Example 11.15: Bipartite and nonbipartite graphs.


Which of the graphs in Figure 11.20 are bipartite?
Solution. All of them except (iii)! Although (iv) and (v) are the only graphs drawn in the “two-column”
format, both (i) and (ii) can be rearranged into two columns. In fact, aside from node positioning, graphs
(i) and (iv) are identical. And, similarly, graphs (ii) and (v) are isomorphic! Only (iii) is not bipartite: if
we attempt to put the topmost node in one group, then both of the next higher two nodes must both be in
the other group—but they’re joined by an edge themselves, and so we’re stuck.

Many interesting real-world phenomena can be modeled using bipartite graphs:

Example 11.16: Bipartite graphs as models.


Here are just a few of the scenarios that are naturally modeled using bipartite graphs:

• Dating relationships in a strictly binary, heterosexual community: the nodes are the men M and the
women W; every edge connects some man and some woman.
• Nodes are courses and students; an edge joins a student to each class they’ve taken.
• Affiliation networks: people and organizations are the nodes; an edge connects person p and organiza-
tion o if p is a member of o.

There’s one further refinement of bipartite graphs that we’ll mention: a complete bipartite graph is a
bipartite graph in which every possible edge exists. In other words, a complete bipartite graph has the form
G = ⟨L ∪ R, E⟩ where, for every node ℓ ∈ L and r ∈ R, we have {ℓ, r} ∈ E. A complete bipartite graph
with ℓ nodes in the left group and r nodes in the right group is sometimes denoted by Kℓ,r . See Figure 11.21
for a few examples. (Note again that, as with the K2,4 in Figure 11.21, we don’t have to draw a bipartite
graph in two-column format—if it’s bipartite, then it’s still bipartite no matter how we draw it!)

Figure 11.21 Complete bipartite graphs of varying sizes: K1,4 , K4,4 , K8,4 , K8,8 , and K2,4 .

ISTUDY
596 Graphs and Trees

Regular graphs
Our next type of graph is defined in terms of the degree of its nodes: a regular graph is one in which all of
the nodes have an identical number of neighbors.

Definition 11.18: Regular graph.


Let d ≥ 0 be an integer. A d-regular graph is a graph G such that every node has degree precisely equal
to d. If G is d-regular for any d, then we say that G is a regular graph.

(Most of the time one talks about regular graphs that are undirected, but we can speak of regular directed
graphs, too; we’d generally require that all in-degrees match each other and all out-degrees match each
other.) Several examples of regular graphs are shown in Figure 11.22.
There are many real-world examples in which regular graphs are useful: for example, imagine construct-
ing a physical network of computers in which each machine only has the capacity for a fixed number of
connections. Here are two other useful applications of regular graphs:

Example 11.17: Scheduling sports with a regular graph.


You are the League Commissioner for an intramural ultimate frisbee league. There are 10 teams in the
league, each of whom should play four games. No two teams should play each other twice. Suppose
that you construct an undirected graph G = ⟨V, E⟩, where V = {1, 2, . . . , 10} is the set of teams, and E
is the set of games to be played. If G is an 4-regular graph, then all of the listed requirements are met.
Figure 11.22c is a randomly generated example of such a graph; you could use that graph to set the league
schedule.

A 1-regular graph is called a perfect matching, because each node is “matched” with one—and only
one—neighbor. (If every node has degree at most 1, then the graph is just called a matching.) Matchings
have a variety of applications—for example, see p. 497 for their role in the Enigma machine—but here’s
another specific use of matchings, in assigning partnerships:

Example 11.18: Matchings for CS partnerships.


Each of n students in an Intro CS class submits a list of people whom they’d like to have as a partner for
the final project. Define the following undirected graph G:

• The set V of nodes is {1, 2, . . . , n}, one per student.


• The set E of edges includes {u, v} if both of the following are true: student u wants to work with student
v, and student v wants to work with student u.

The instructor can assign partnerships by finding a 1-regular graph G′ = ⟨V, E′ ⟩ with E′ ⊆ E—that
is, a subgraph of G that includes all of the nodes of G. See Figure 11.23 for an example. (Incidentally,
Example 9.34 asked: how many perfect matchings are there in Kn ?)

Planar graphs
Our last special type of graph is a planar graph, which is one that can be drawn on a sheet of paper without
any lines crossing:

ISTUDY
11.2 Formal Introduction 597

10
9 1

V = {A, B, C, D, E, F} 8 2

E= {A, B} , {A, E} , {B, C} ,
7 3
{C, F} , {D, E} , {D, F}
6 4
5
(a) A 2-regular graph. (You can (b) For any n, the complete graph
check: each node has degree two.) Kn is (n − 1)-regular. (c) A 4-regular 10-node graph.

Figure 11.22 Several examples of regular graphs.

12 12 12
11 1 11 1 11 1

10 2 10 2 10 2

9 3 9 3 9 3

8 4 8 4 8 4

7 5 7 5 7 5
6 6 6

(a) A graph G of compatabilities. (b) For G, these graphs (among others) are valid partner assignments.

Figure 11.23 Choosing partnerships in a class. (See Example 11.18.)

Definition 11.19: Planar graph.


A planar graph is a graph G such that it is possible to draw G on a plane (that is, on a piece of paper)
such that no edges cross.

It’s important to note that a graph is planar if it is possible to draw it with no crossing edges; just because a
graph is drawn with edges crossing does not mean that it isn’t planar. Here is an example of a planar graph:

Example 11.19: New England, in a plane.


Figure 11.24 shows two copies of the same graph—one drawn with edge crossings, and another with the
nodes rearranged to avoid any edges crossing—representing the U.S. states in the New England region.

Figure 11.24 corresponds to one of the most famous types of planar graph, one derived from a map: we
can think of the countries on a map as nodes, and we draw an edge between two countries/nodes if those
two countries share a border. (See p. 181 for a discussion of the Four-Color Theorem for maps, which we
could have phrased as a result about planar graphs instead.)
There are other applications of planar graphs in computer science, too. For example, we can view a
circuit (see Section 3.3.3) as a graph, where the logic gates correspond to nodes and the wires correspond
to edges. Most modern circuits are now printed on a board (where the “ink” is the conducting material
that serves as the wire), and the question of whether a particular circuit can be printed on a single layer is
precisely the question of whether its corresponding graph is planar. (If it’s not planar, we’d like to minimize
the number of edges that cross, or more specifically the number of layers we’d need in the circuit.)
Here’s one more set of planarity challenges for you to try:

ISTUDY
598 Graphs and Trees

ME

NH ME NY VT NH

MA
VT RI NY

MA CT CT RI

Figure 11.24 New England, drawn in two different ways.

A G I H

B F

C D E J K L M

Figure 11.25 Two (possibly planar) graphs.

Example 11.20: Two planar challenges.


Consider the two graphs in Figure 11.25. Are these graphs planar?
Solution. Yes, both: we can rearrange the nodes so that there are no edges that cross, as shown in
Figure 11.26.

Taking it further: Determining how to lay out a planar graph without edge crossings can be an interesting amusement—see
www.planarity.net for a surprisingly fun game based on planar graphs. So far we haven’t seen any examples of graphs that
can’t be rearranged so that no edges cross. But, if you play around long enough, you should be able to convince yourself that
neither K5 and K3,3 are planar; see Figure 11.27. And, while this shouldn’t be at all obvious, it turns out that K5 and K3,3
are in a sense the only “reasons” that a graph can be nonplanar. A theorem known as Kuratowski’s Theorem—after the Polish
mathematician who may have lent his initial to the notation for complete graphs—says that every graph is planar unless it
“contains” K5 or K3,3 for a subgraph-like notion of “containment.” (It’s not exactly the subgraph relation, because there are
graphs that do not contain K5 or K3,3 as subgraphs but nonetheless are nonplanar in some sense “because” of one of them. For
example, the Petersen graph from Example 11.11—see Figure 11.27—is nonplanar, but it doesn’t have K5 as a subgraph. But if
we “collapse” together the nodes A/F, B/G, C/H, D/I, and E/J into “supernodes” then the resulting graph is K5 .)

G
F H A F H C

E A D G C K J L M J I

B I E D

Figure 11.26 Rearrangements of the graphs in Figure 11.27 Three non-planar graphs: K5 , K3,3 , and
Figure 11.25 to show that they are planar. the Petersen graph.

ISTUDY
11.2 Formal Introduction 599

Computer Science Connections

Degree Distributions and the Heavy Tail


400
When we think about massive graphs like the web (with 198 people with
nodes representing web pages and edges representing degree 4
links from one page to another) or an online social 300

number of people
network (with nodes representing people and edges
representing whatever it is that being “friends” in an 200
online social network means), it is interesting to look three people with
at how properties of individual nodes are distributed degree 512
across the population. We can look at the distribution 100
of any node-by-node property—the physical height of
Twitter users, or the number of words of text per web
page, for example. But in addition to demographic 1000 2000 3000
degree k
properties like height and length, we can also look at
the distribution of network-type properties. (a) The degree distribution.
The degree distribution of a graph G shows, for each
possible degree d, the number of nodes in G whose 1000
degree is d. While one might initially expect the degree
distribution of a real social network to look similar to three people with
number of people

the distribution of heights, it turns out that the degree 100 degree 512
distribution has very different properties.
Figure 11.28 shows the degree distribution (in lin-
ear, log–log, and cumulative form) for members of the 10 198 people with
University of North Carolina: that is, for each value of degree 4
k, the number of people in the network who have pre-
cisely k Facebook friends in the network. About 350
people have only 1 friend, which is the most common 10 100 1000
degree k
number of friends to have. There are about 750,000
friendships represented in this dataset; the average (b) Another plot of the degree distribution: the data as in (a),
degree is ≈ 84. but with the axes scaled logarithmically instead of linearly.
But, looking at the far-right end of Figure 11.28a
and 11.28b, we see a handful of people with very
high degrees: 2000, 2500, 3000, and even ≈ 3800. 10000
people with degree ≥ k

One of the interesting facts about degree distributions


in real social networks (or the web) is that there are 1000 17,340 people
people whose popularity is massively larger than aver- with degree ≥ 4
age: the highest-degree person in this dataset is about 100
3800/84 ≈ 45 times more popular than average.
(Imagine the tallest person at the University of North 10 57 people with
Carolina being 45 times taller than average!) degree ≥ 512
Significant research by computer scientists (among
researchers from many other fields) interested in the 10 100 1000
degree k
structure of social networks and the web has focused on
this so-called heavy-tailed degree distribution. Some of (c) The cumulative degree distribution: the number of
the literature debates the particular form of this distri- people with degree ≥ k (whereas (a) and (b) showed the
bution; for example, whether the distribution has the number of people with degree = k).
particular form of a power law, where the number of
Figure 11.28 The degree distribution of ≈ 18,000
people with degree k is roughly kα for some small con-
Facebook users at the University of North Carolina. (The
stant α, usually around 2. (If you’re interested, you
data are from the Facebook5 dataset, from Mason Porter via
can read a lot more about power laws and heavy-tailed
degree distributions in [43].) the International Network for Social Network
Analysis [126].)

ISTUDY
600 Graphs and Trees

Computer Science Connections

Graph Drawing, Graph Layouts, and the 9/11 Memorial Marble Hill–225 St

215 St
Visual representations of most large 207 St
Inwood–207 St
Dyckman St

graphs are too cluttered for a human 190 St


Dyckman St

viewer to process: there are just too 191 St


181 St
many nodes and edges crammed into 175 St
181 St

a small space to see much of any- u


168 St 168 St–Washington Hts
thing. Visually presenting a graph like 163 St–Amsterdam Av
157 St
Facebook (billions of nodes, tens of u 155 St
155 St

billions of edges) without it looking 145 St 145 St


Holding the other Harlem–148 St
like a grade-school scribble is daunt- three nodes’ 137 St–City College
145 St 135 St

ing. But there is an entire subfield of positions constant, 125 St


135 St
node u would 125 St

computer science called graph draw- move to the new 116 St–Columbia University 125 St 116 St

ing, which is devoted to taking networks position that’s Cathedral Pkwy 125 St 116 St
Cathedral Pkwy (110 St)
Central Park North (110 St)

and producing good—clear, aesthetic, shown. 103 St 116 St 103 St

110 St
96 St
informative—images of networks. 103 St
96 St

86 St
In some graphs, each node has a “nat- 79 St 96 St
86 St 96 St

81 St - Museum of Natural History


ural location” and it is clear where on the 72 St
86 St

86 St 72 St
page it should be placed. For example, 66 St–Lincoln Center
77 St

graphs may represent nodes that have 59 St–Columbus Circle


59 St - Columbus Circle
72 St
68 St - Hunter College

Av5
57 St - 757 St
Av/59 St

a precise location situated in the phys- 50 St


Lexington Av/63 St
7 Av
50 St
Lexington Av/59 St
49 St 5 Av/53 St
59 St

47-50 Sts - Rockefeller Ctr


ical world. With that kind of informa- 42 St–Port Authority Bus Terminal
34 St–11 Av Times
Roosevelt Island
Lexington Av/53 St
Times Sq - 42 St
42 StSq- -Bryant
42 St Pk
5 Av
51 St

Grand Central - 42 St
tion for each node, presenting the graph 34 St–Penn Station 34 St -34
Penn
St - Station
Grand Central–42 St
Grand Central - 42 St
Herald Sq
28 St
well is easier. (See Figure 11.29.) But 23 St 33 St23 St
23 St
28 St

28 St
18 St 23 St
many large graphs do not have obvious 14 St
23 St
14 St614Av
St
8 Av 14 St–Union Sq14 St - Union Sq
coordinates associated with each node: Christopher St–Sheridan Sq W 4 St
Union Sq - 14 St
3 Av
8 St - NYU

while you and your classmates may have Houston St


Astor Pl
1 Av
Spring St Bleecker St
Broadway-Lafayette St
Prince St
geographic locations (residence halls), Canal St 2 Av
Canal St
Spring St
Canal St
Bowery
Canal
CanalSt Essex St
it’s not clear that your room really best Franklin St
Chambers St
CanalGrand
Chambers
St St St
St
Delancey St
CityPlBridge
Brooklyn Hall
World Park
Trade Center - City Hall
East Broadway
describes “where” you fit in the social WTC Cortlandt
Chambers St
Cortlandt
Fulton St
St St
Fulton
Fulton St
WallStSt
Rector
scene of your institution. Rector St
Bowling Green
Broad St
Wall St
Whitehall St
For graphs whose nodes don’t have South Ferry

obvious coordinates, we have to do Figure 11.29 Part of the New York City subway system, where each
something else. One common approach node’s position corresponds to the station’s spatial location. Inset: a node
is to arrange the nodes based on a “seeks” a position that perfectly balances the attractive forces to each of its
physics analogy, as follows. Imagine neighbors (and repelling forces from all other nodes).
each node as a charged particle: any two
nodes that are joined by an edge are pulled together by an attractive force, and any two nodes that are not joined
by an edge are pushed apart by a repulsive force. Then figuring out how to place nodes on the page can be done by
starting them in a random configuration and letting the attractive/repulsive forces move the nodes around until they’re
“happy” in their current positions. (See the inset of Figure 11.29.)
An idea like this one was actually used in designing the 9/11 Memorial at the site of the World Trade Center.
The memorial was designed with bronze panels inscribed with the 2982 names of victims. A team of computer
scientists, architects, and visual artists collaborated to organize the names in a meaningful way. Families were invited
to submit “meaningful adjacencies” between victims—which would cause two names to be as close together in the
bronze panels as possible. (One of the other algorithmic issues regarding the layout of this memorial was that the
designers wanted the names to be placed at evenly spaced intervals on the bronze panels; this constraint added to
the computational complexity of the process.) The team used an algorithm to organize the names in an arrangement
that respected these requests, which was then used in the final design of the memorial. (In addition to the broader
news reports on the wrenching emotional and historical aspects of the 9/11 Memorial, the algorithmic aspects of the
memorial were also covered in the popular press; for more, see [99].)

ISTUDY
Exercises 601

EXERCISES
Draw a graph with the following nodes and edges. Does it make sense to use a directed or undirected graph? Is the graph you’ve
drawn simple?
11.1 The nodes are V = {1, 2, . . . , 10}; an edge connects x and y if gcd(x, y) = 1.
11.2 The nodes are V = {1, 2, . . . , 10}; an edge connects x and y if x divides y.
11.3 The nodes are V = {1, 2, . . . , 10}; an edge connects x and y if x < y.

11.4 List the edges of the graph in Figure 11.30a, and identify the node(s) with the highest degree.
11.5 Repeat for the graph in Figure 11.30b.
11.6 Repeat for the graph in Figure 11.30c, identifying both the highest in-degree node(s) and the highest out-degree node(s).
11.7 Repeat for the graph in Figure 11.30d, again for both in- and out-degree.

11.8 If G = ⟨V, E⟩ is an undirected, simple graph with n nodes, what’s the largest that |E| can be (in terms of n)? The smallest? Explain.
11.9 If G is a directed, simple graph with n nodes, what’s the largest that |E| can be? The smallest?
11.10 How do your answers to Exercise 11.9 change if self-loops are allowed?
11.11 How do your answers to Exercise 11.9 change if self-loops and parallel edges are allowed?

The anthropologist Robin Dunbar has argued that humans have a mental capacity for only approximately 150 friends [40]. Dun-
bar’s argument is based in part on the physical size of the human brain, and cross-species comparisons; 150 is now occasionally
known as Dunbar’s number. (Thanks to Michael Kearns, from whom I learned a somewhat related version of these exercises.)
11.12 Suppose that Alice has exactly 150 friends, and each of her friends has exactly 150 friends—that is, a friend of Alice knows Alice
and 149 other people. (Note that Alice’s friends’ sets of friends can overlap.) Let S denote the set of people that Alice knows directly
or with whom Alice has a mutual friend. What’s the largest possible value of |S|?
11.13 For the set S defined as in Exercise 11.12, what’s the smallest possible value of |S|?
11.14 Continue to assume that everyone has precisely 150 friends. Let Sk denote the set of all people that Bob knows via a chain of k or
fewer intermediate friends:

S0 = Bob’s friends;
S1 = the people in S0 and the friends of people in S0 ;
S2 = the people in S1 and the friends of people in S1 ; and so forth.

In terms of k, what’s the largest possible value of |Sk |?


11.15 Let k ≥ 0 be arbitrary. For the set Sk as defined in Exercise 11.14, what’s the smallest possible value of |Sk |?

11.16 Let u be a node in an undirected graph G. Prove that u’s degree is at most the sum of the degrees of u’s neighbors.

11.17 Prove Corollary 11.9: in an undirected graph G = ⟨V, E⟩, let nodd denote the number of nodes whose degree is odd. Prove that nodd
is an even number. That is: prove that

| {u ∈ V : degree(u) mod 2 = 1} | mod 2 = 0.

11.18 Prove the analogy of Theorem 11.8 for directed graphs: for a directed graph G = ⟨V, E⟩,
X X
in-degree(v) = out-degree(v) = |E|.
u∈V u∈V

(a) (b) (c) (d)


B D F D D B
B F B F
A H A D
C E G A H A H C

C G C G
E E

Figure 11.30 A few graphs.

ISTUDY
602 Graphs and Trees

head head
1 2 3 4 5 1 2 3 4

(a) A linked list. Each node has two fields: data on the (b) A doubly linked list. The three fields in each node,
left and next on the right. from left to right, are: previous, data, and next. (This
doubly linked list is circular.)

Figure 11.31 Singly and doubly linked lists. Each rectangle is a node, and contains a data field (whatever the node
stores) and a next field that is either null or points to a node in the linked list. A particular node is designated as the
head node. Note that a circular linked list in which a node’s next field points back to a previously encountered node
meets this definition. In a doubly linked list, there is also a previous field that is either null or points to a node in the
linked list.

11.19 A linked list is a data structure consisting of a collection of nodes, as illustrated in Figure 11.31a. Define a directed graph G = ⟨V, E⟩,
where V is the set of all nodes reachable by following any number of next pointers starting at the head node, and ⟨u, v⟩ ∈ E if u’s
next field points to u. (G may not be simple.) Observe that each node u in G has out-degree d ∈ {0, 1}. Describe a 5-node linked
list in which every node of G has in-degree d = 1.
11.20 Describe a 5-node linked list for which the graph G (as defined in Exercise 11.19) contains a node with in-degree d = 2.
11.21 Describe a 5-node linked list for which the graph G (as defined in Exercise 11.19) is not simple.
11.22 (This exercise is a tougher algorithmic challenge.) You are given access to the head node h of an n-node linked list. The value of n
is unknown to you. The only operations permitted are (a) to save a node; (b) test whether two saved nodes are the same or different;
and (c) given a node u, fetch the node pointed to by u’s next field. Give an algorithm to determine whether the given list is circular
using only a constant amount of memory—that is, remembering only a constant number of nodes at a time.

11.23 A doubly linked list has n nodes with data and two pointers, previous and next. (See Figure 11.31b.) Consider a doubly linked
list with nodes {1, 2, . . . , n}, where each node u’s next field points to node u + 1 (and node n points back to node 1). That is, u’s
next node is v = (u mod n) + 1 (and v’s previous node is u). (Figure 11.31b shows an example with n = 4.) Define the directed
graph Gn = ⟨V, E⟩, where V = {1, 2, . . . , n} is the set of nodes, and every node has two edges leaving it: one edge ⟨u, u.next⟩,
and one edge ⟨u, u.previous⟩. Draw the graph G5 .
11.24 Give an example of a value of n for which Gn (as defined in Exercise 11.23) contains a self-loop.
11.25 Give an example of a value of n for which Gn (as defined in Exercise 11.23) contains parallel edges.

11.26 Write down an adjacency list representing the graph in Figure 11.30a.
11.27 Write down an adjacency list representing the graph in Figure 11.30b.
11.28 Write down an adjacency list representing the graph in Figure 11.30c.
11.29 Write down an adjacency list representing the graph in Figure 11.30d.
11.30 Write down an adjacency matrix for the graph in Figure 11.30a.
11.31 Write down an adjacency matrix for the graph in Figure 11.30b.
11.32 Write down an adjacency matrix for the graph in Figure 11.30c.
11.33 Write down an adjacency matrix for the graph in Figure 11.30d.

11.34 Suppose that a (possibly directed or undirected) simple graph G is represented by an adjacency list. Suppose further that, for every
node u in G, the list of (out-)neighbors of u has a unique length. (That is, no two nodes have the same number of (out-)neighbors.)
True or False: G must be a directed graph. Justify your answer.
11.35 Describe a directed graph G meeting the specifications of Exercise 11.34.

11.36 The density of a graph G = ⟨V, E⟩ is the fraction of all possible edges that actually exist: that is,
|E|
density = .
[your answer to the first part of Exercise 11.8/Exercise 11.9]
As a function of n, what is the density of an n-node path? (Like this one, for n = 12: .)
11.37 As a function of n, what is the density of an n-node cycle? (Like this one, for n = 12: .)
11.38 As a function of n, what is the density of a graph that consists of 3n disconnected triangles? Assume that n mod 3 = 3. (Like this
one, for n = 12: .)
11.39 As a function of n, what is the density of a graph that consists of 3 separate cliques, each of which contains exactly 3n nodes?
Assume that n mod 3 = 3. (Like this one, for n = 12: .)

ISTUDY
Exercises 603

11.40 A hypercube Hn is a graph in which the 2n different nodes are all elements of {0, 1}n . There is an edge between x and y if they differ
in only one bit position. (Using the language of Section 4.2, there’s an edge between any two nodes whose Hamming distance is
1.) Draw H3 .
11.41 Write down an adjacency list for the hypercube H4 (see Exercise 11.40).
11.42 Write down an adjacency matrix for the hypercube H4 (see Exercise 11.40).
11.43 For the hypercube Hn (see Exercise 11.40): in terms of n, how many edges does Hn have? What is its density?

11.44 Are the pair of graphs in Figure 11.32a isomorphic? Prove your answer.
11.45 Are the pair of graphs in Figure 11.32b isomorphic? Prove your answer.
11.46 Are the following graphs isomorphic? Prove your answer.

G1 = ⟨V1 , E1 ⟩, where V1 = {10, 11, 12, 13, 14, 15} and ⟨x, y⟩ ∈ E1 if and only if x and y are not relatively prime
G2 = ⟨V2 , E2 ⟩, where V2 = {20, 21, 22, 23, 24, 25} and ⟨x, y⟩ ∈ E2 if and only if x and y are not relatively prime.

11.47 Prove or disprove: all 5-node graphs with degrees 1, 1, 1, 1, and 0 are isomorphic.
11.48 Prove or disprove: all 5-node graphs with degrees 4, 4, 4, 3, and 3 are isomorphic.
11.49 Prove or disprove: all 5-node graphs with degrees 3, 3, 2, 2, and 2 are isomorphic.
11.50 Prove or disprove: all n-node, 3-regular graphs are isomorphic.

11.51 The computational problem of finding the largest clique (complete graph) that’s a subgraph of a given graph G is believed to be
very difficult. But for small graphs it’s possible to do, even by brute force. Identify the size of the largest clique that’s a subgraph
of the graph in Figure 11.33a.
11.52 What’s the size of the largest clique that’s a subgraph of the graph in Figure 11.33b?
11.53 What’s the size of the largest clique that’s a subgraph of the graph in Figure 11.33c?

11.54 Consider the collaboration network in Figure 11.33d. (See Example 11.14.) Assuming that the nodes correspond to actors in movies,
what is the smallest number of movies that could possibly have generated this collaboration network?
11.55 Are you certain that there weren’t more movies than [your answer to Exercise 11.54] that generated this graph? Explain.

11.56 Consider the graph V = {1, 2, . . . , n} and E = {⟨i, i − 1⟩ : i ≥ 2}. For which n is this graph bipartite? Give proof.
11.57 For which n is the graph V = {0, 1, . . . , n − 1} and E = {⟨i, i + 1 mod n⟩ : i ≥ 1} bipartite? Give proof.
11.58 For which n is Kn (that is, the complete graph on n nodes) bipartite? Prove your answer.
11.59 Identify all values of n for which the following graph G = ⟨V, E⟩ is bipartite, and prove your answer. The set of nodes is
V = {0, 1, . . . , 2n − 1}; the set of edges is E = {⟨i, (i + n) mod 2n⟩ : i ∈ V}.

P O
(a) (b)
C A E B G J K M H

J K M I
G F C A F

H D B L N D E L N I

Figure 11.32 Two pairs of graphs.

(a) A I (b) A J I (c) A J I

B H B H B H

C G C G C G (d) A collaboration network.

D E F D E F D E F

Figure 11.33 Three graphs in which to find cliques, and a collaboration network.

ISTUDY
604 Graphs and Trees

(a) (b) (c) B D H (d)


D B D F A B C D
B F
A H F G
A H C E G I E
E A C
C G H G F
E

Figure 11.34 A few graphs.

11.60 Is the graph in Figure 11.34a bipartite? Explain.


11.61 Is the graph in Figure 11.34b bipartite? Explain.

Consider a bipartite graph with a set L of nodes in the left column and a set of nodes R on the right column, where |L| = |R|. Prove
or disprove the following claims:
11.62 The sum of the degrees of the nodes in L must equal the sum of the degrees of the nodes in R.
11.63 The sum of the degrees of the nodes in L must be even.
11.64 The sum of the degrees of all nodes (that is, all nodes in L ∪ R) must be an even number.

11.65 Suppose that G is a complete bipartite graph with n nodes—that is, G = K|L|,|R| for |L| + |R| = n. What’s the largest number of
edges that can appear in G?
11.66 What’s the smallest number of edges that can appear in a complete bipartite graph with n total nodes? (Careful!)

11.67 Suppose that G is graph that does not contain a triangle (that is, there is no set of three nodes a, b, and c with the edges {a, b} and
{b, c} and {c, a} all appearing in the graph). Prove or disprove: G is bipartite.

11.68 Definition 11.18 describes a regular undirected graph. In a directed regular graph, we require that there be two integers din and dout
such that every node’s in-degree is din and every node’s out-degree is dout . Prove that din = dout .

11.69 Show that the graph in Figure 11.34c is planar.


11.70 Show that the graph in Figure 11.34d is planar.
11.71 Prove that any 2-regular graph is planar.

ISTUDY
11.3 Paths, Connectivity, and Distances 605

11.3 Paths, Connectivity, and Distances


I love North Dakota cause you have never been there
and the days go on forever and the towns all look the same
and I can ride the back roads and I can walk the main streets
and show someone your picture but they would not know your name.
Kris Delmhorst (b. 1970)
excerpted lyrics from “North Dakota” (1998)

One of the most basic questions that one can ask about a graph is whether it is possible to get from some
given node s to some given node t by following a sequence of edges. Is there some chain of handshakes
that connects Bea Arthur to Arthur Conan Doyle? Can you get from Missoula to Madison by car? (And, if
there is a way to get from s to t, what is the shortest way to get there?) These basic questions concern the
existence of paths in the graph:

Definition 11.20: Path.


Consider a (directed or undirected) graph G = ⟨V, E⟩. A path in G is a sequence ⟨u1 , u2 , . . . , uk ⟩ of k ≥ 1
nodes such that:

• ui ∈ V for every i ∈ {1, . . . , k}, and


• ⟨ui , ui+1 ⟩ ∈ E for every i ∈ {1, . . . , k − 1}.

We say that such a sequence of nodes is a path from u1 to uk , and that this path has length k − 1. We also
say that this path traverses the edges ⟨ui , ui+1 ⟩.

See Figure 11.35a. (Note that this definition includes both directed and undirected graphs: if the edges are
directed, we have to follow them “in the right direction.”) For example, in the graphs shown in Figure 11.35b
and 11.35c, there is no path from A to X. But, in both, the sequence ⟨A, C, E, Z⟩ is a path of length 3 from
A to Z. In both cases, the edges traversed by the path are {⟨A, C⟩, ⟨C, E⟩, ⟨E, Z⟩}. Notice that the length of a
path is the number of edges that it traverses, which is one fewer than the number of nodes in the path.

Taking it further: A common mistake made by novice (and not-so-novice) programmers is an off-by-one error in specifying
the bounds on a loop, by iterating either one time too many or one time too few. These errors are also sometimes called fencepost
errors: if you build a 10-yard fence with posts placed every yard, then there are eleven fenceposts (at yard 0, yard 1, . . ., yard
10). Be careful! A path ⟨A, C, E, Z⟩ contains four nodes, but it traverses only three edges (A → C, C → E, and E → Z) and has
length 3.

Here’s an example of finding paths in a small graph:

u1 u2 u3 uk−1 uk
··· A D E X A D E X
u1 u2 u3 uk−1 uk
···
B C Z Y B C Z Y
(a) Paths in undirected and (b) An undirected graph (c) A directed graph with a
directed graphs. with a path from A to Z. path from A to Z.

Figure 11.35 Paths in graphs.

ISTUDY
606 Graphs and Trees

(a) (b) 1 1
D D D
B F B F B F

A H A 2
4 H A 2 5 H

C G C 3 G C G
E E 3 E 4

Figure 11.36 (a) An undirected graph, and (b) two paths from D to F: ⟨D, B, E, G, F⟩ and ⟨D, B, C, E, G, F⟩.

Example 11.21: Finding paths.


Consider the undirected graph in Figure 11.36.
1 Is there a path from node H to node E?
2 Name three different paths from node D to node F. What is the length of each?

Solution. For (1), yes, there is a path from node H to node E—for example, ⟨H, A, F, G, E⟩.
For (2), two paths from D to F are shown in Figure 11.36b: the path ⟨D, B, E, G, F⟩, which has length
4; and the path ⟨D, B, C, E, G, F⟩, which has length 5. Finding a third path might seem harder, but Defini-
tion 11.20 did not require that the nodes in a path be distinct from each other. (In other words, nothing
forbade the repetition of nodes in a path.) So ⟨D, B, C, E, B, C, E, G, F⟩ is a path from D to F, with length 8.

We will often restrict our attention to paths that never go back to a vertex that they’ve already visited,
which are called simple paths:

Definition 11.21: Simple path.


A path ⟨u1 , u2 , . . . , uk ⟩ is simple if all of the nodes u1 , . . . , uk are distinct.

Of the three paths identified in Example 11.21, the first two are simple paths, but the third path is not simple
because it repeated nodes {B, C, E}.

11.3.1 Connectivity in Undirected Graphs


The most basic question about two nodes in a graph is whether it’s possible to get from one to another—that
is, are these two nodes connected? We start with a formal definition of connectivity for undirected graphs,
because the relevant notions are a bit easier in the undirected setting.

Definition 11.22: Connected nodes and connected graphs.


Let G = ⟨V, E⟩ be an undirected graph.

• Two nodes u ∈ V and v ∈ V are connected if there exists a path from u to v.


• The graph G is connected if u and v are connected for any two nodes u ∈ V and v ∈ V.
• The graph G is called disconnected if it is not connected.

For example, Figure 11.37a shows a disconnected graph—there’s no path from A to H, for example—and
Figure 11.37b shows a connected one. You can check that the second graph is connected by testing all pairs

ISTUDY
11.3 Paths, Connectivity, and Distances 607

2
1 3

8 4
(a) (b)
A D E G I L M O 7 5
6
B C F H J K N P
Figure 11.38 A possibly
Figure 11.37 (a) A disconnected graph, and (b) a connected graph. connected graph.

of nodes. (Exercise 11.87 asks you to show that connectivity is symmetric in an undirected graph: if there
exists a path from u to v, then there exists a path from v to u.)

Example 11.22: Connectivity of an undirected graph.


Is the graph in Figure 11.38 connected?
Solution. No: odd-numbered nodes have edges only to other odd-numbered nodes, and even-numbered
nodes have edges only to other even-numbered nodes. So there is no path from, for example, node 1 to
node 2; this graph is disconnected.

Problem-solving tip: Sometimes it’s very helpful to redraw a graph that you’re given, with nodes placed more meaningfully.
For example, the graph in Figure 11.38 can be redrawn as just by sliding the even-numbered nodes to the right. This
visualization makes it clear that the graph is disconnected.

Connected components
More generally, we will talk about the connected components of an undirected graph G = ⟨V, E⟩—
“subsections” of the graph in which all pairs of nodes are connected.

Definition 11.23: Connected component.


In an undirected graph G = ⟨V, E⟩, a connected component is a set C ⊆ V such that
(i) any two nodes s ∈ C and t ∈ C are connected; and
(ii) for any node x ∈ V − C, adding x to C would make (i) false.

A subset C ⊆ V of nodes is a connected component of an undirected graph G = ⟨V, E⟩ if, intuitively, it


forms its own “section” of the graph: any two nodes in C are connected, and no node in C is connected to
any node not in C. For example, Figure 11.39 shows a graph with three connected components—one with
4 nodes, one with 3 nodes, and one with just a single node.
Note that we could have defined a “connected graph” in terms of the definition of connected components
(instead of Definition 11.22): an undirected graph G = ⟨V, E⟩ is connected if it contains only one connected
component, namely the entire node set V.

ISTUDY
608 Graphs and Trees

A B A B A B A B

D C D C D C D C

E F E F E F E F

G H G H G H G H

(a) The original graph. (b) Component #1. (c) Component #2. (d) Component #3.

Figure 11.39 A graph, and its three connected components.

E A B C G H
A B
A ⟨A⟩ ⟨A, C, G, B⟩ ⟨A, C⟩ ⟨A, C, G⟩ ⟨A, C, G, H⟩
B ⟨B⟩ ⟨B, G, C⟩ ⟨B, G⟩ ⟨B, H⟩
H F
C ⟨C⟩ ⟨C, G⟩ ⟨C, G, H⟩
D C G ⟨G⟩ ⟨G, H⟩
G H ⟨H⟩

(a) The graph. (b) Paths connecting every pair of nodes in the largest connected component.

Figure 11.40 An undirected graph, and some paths connecting several pairs of nodes.

Example 11.23: Connected components of an undirected graph.


What are the connected components of the graph in Figure 11.40a?
Solution. The set S = {A, B, C, G, H} is a connected component; there are paths from every node u ∈ S to
every node v ∈ S, and furthermore no node in S is connected to any node not in S. To be thorough, paths
connecting each pair of nodes from S are shown in Figure 11.40b.
Note that we haven’t bothered to write down a path from u to v when we’d already recorded a path
from v to u, because the graph is undirected and paths are symmetric. We also had many choices of paths
for many of these entries: for example, other paths from B to H included ⟨B, G, H⟩ or ⟨B, G, H, B, G, H⟩.
There’s a second connected component in the graph: the nodes {D, E, F}. It’s not too hard to check that
both clauses of Definition 11.23 are also satisfied for this set.

Observe that, in any undirected graph G = ⟨V, E⟩, there is a path from each node u ∈ V to itself. Namely,
the path is ⟨u⟩, and it has length 0. Check Definition 11.20!

Taking it further: There are many computational settings in which undirected paths are relevant; here’s one example, in brief.
In computer vision, we try to build algorithms to process—“understand,” even—images. For example, before it can decide how
to react to them, a self-driving car must partition the image of the world from a front-facing camera into separate objects: painted
lines on the road, trees, other cars, pedestrians, etc. Here’s a crude way to get started (real systems use far more sophisticated
techniques): define a graph whose nodes are the image’s pixels; there is an edge between pixels p and p′ if (i) the two pixels are
adjacent in the image, and (ii) the colors of p and p′ are within a threshold of acceptable difference. The connected components of
this graph are a (very rough!) approximation to the “objects” in the image. This description misses all sorts of crucial features of
good algorithms for the image-segmentation problem, but even as stated it may be familiar from a different context: the “region
fill” tool in image-manipulation software uses something very much like what we’ve just described.

ISTUDY
11.3 Paths, Connectivity, and Distances 609

11.3.2 Connectivity in Directed Graphs


Recall that we have to follow edges “in the right direction” in a directed graph G: as in Definition 11.20,
a path from u1 to uk in G is a sequence ⟨u1 , u2 , . . . , uk ⟩ where every pair ⟨ui , ui+1 ⟩ is an edge in G. Thus
notions of connectivity in directed graphs are more complicated: the existence of a path from u to v does
not imply the existence of a path from v to u. We will speak of a node t as being reachable from a node s if
it’s possible to go from s to t, and of pairs of nodes as being strongly connected when it’s possible to “go
in both directions” between them:

Definition 11.24: Reachability and strongly connected nodes/graphs.


Let G = ⟨V, E⟩ be a directed graph.

• A node u ∈ V is reachable from a node v ∈ V if there is a directed path from u to v.


• Two nodes u ∈ V and v ∈ V are strongly connected if u is reachable from v, and v is reachable from u.
• The graph G is strongly connected if every pair of nodes in V is strongly connected.

For example, you can check that the graph in Figure 11.41a is strongly connected by testing for directed
paths between all pairs of nodes, in both directions. But the graph in Figure 11.41b is not strongly con-
nected: there’s no path from any node in the right-hand side (nodes {M, N, O, P}) to any node in the left-hand
side (nodes {I, J, K, L}).

Strongly connected components


As with undirected graphs, for a directed graph we will divide the graph into “sections”—subsets of the
nodes—each of which is strongly connected. These sections are called strongly connected components of
the graph:

Definition 11.25: Strongly connected component.


In a directed graph G = ⟨V, E⟩, a strongly connected component (SCC) is a set C ⊆ V such that:
(i) any two nodes s ∈ C and t ∈ C are strongly connected; and
(ii) for any node x ∈ V − C, adding x to C would make (i) false.

Figure 11.42 shows an example of a directed graph G and the three strongly connected components in G.
The easiest strongly connected component to identify is {A, B, C, D}: we can go counterclockwise around
the loop A → B → C → D → A, so we can go from any one of these four nodes to any other, and we can’t
get from any of these four nodes to any of the other nodes. The other two strongly connected components
are {E, F, H} and, separately, {G} on its own. The reason is that G is not strongly connected to any other
node: we can’t get from G to any other node. (We can go around the E → F → H → E loop, so these three
nodes are together in the other strongly connected component.)

(a) A D E G (b) I L M O

B C F H J K N P

Figure 11.41 Two directed graphs: (a) one that’s strongly connected, and (b) one that’s not.

ISTUDY
610 Graphs and Trees

A B A B A B A B

D C D C D C D C

E F E F E F E F

G H G H G H G H

(a) The original graph. (b) SCC #1. (c) SCC #2. (d) SCC #3.

Figure 11.42 A graph and its strongly connected components.

(a) A F (b) A F G

B E B E

C C

D D

Figure 11.43 Two directed graphs.

Example 11.24: Finding strongly connected components.


What are the strongly connected components of the graph in Figure 11.43a?
Solution. The three nodes {C, D, E} form a strongly connected component: there is a path from any one
of them to any other of them (C → D → E → C → D → E · · · ), and furthermore there is no path from
any {C, D, E} to any other node in the graph.
In fact, every other node in the graph is alone in a strongly connected component by itself. For example,
while there is a path from A to every node in the graph, there is no path from any other node to A. (There
is a path from A to A, so the set {A} is a strongly connected component.) Thus the four strongly connected
components of the graph are {A}, {B}, {F}, and {C, D, E}.

Here’s an example that shows why the second clause of Definition 11.25 is crucial:

Example 11.25: A non-SCC.


In the graph in Figure 11.43b, the set S = {A, B, C, E, F} is not a strongly connected component. Why
not?
Solution. It is indeed the case that there is a path in both directions between any two nodes in S: we can
just keep “going around” clockwise in S and we eventually reach every other node in S. So S satisfies
clause (i) of Definition 11.25. But it fails to satisfy clause (ii) of Definition 11.25: if we considered the
set S+ = S ∪ {D}, it is still the case that there is a path in both directions between any nodes in S+ . Thus
S is not a strongly connected component!
On the other hand, S+ = {A, B, C, D, E, F} is a strongly connected component: we can’t add any other
node (specifically G; it’s the only other node) to S+ without falsifying this property—because there’s no
path from G to A, for example. Thus the two strongly connected components are {A, B, C, D, E, F} and {G}.

ISTUDY
11.3 Paths, Connectivity, and Distances 611

Taking it further: There are many computational settings in which directed paths, reachability, and strongly connected compo-
nents are relevant. For example, for a spreadsheet, consider a directed graph whose nodes are the spreadsheet’s cells, and an edge
⟨u, v⟩ indicates that u’s contents affect the contents of cell v; when a user changes the content of cell c, we must update all cells
that are reachable from node c. For a chess-playing program, consider a directed graph whose nodes are board configurations,
and there’s an edge ⟨u, v⟩ if a legal move in u can result in v; any configuration u that’s unreachable from the starting board
configuration can never occur in chess, and thus your program doesn’t have to bother evaluating what move to make in position
u. See p. 618 for a discussion of another application of reachability and strongly connected components: the structure of the web,
understood with respect to the directed paths in the graph defined by the pages and the hyperlinks of the web.

11.3.3 Shortest Paths and Distance


So far we have concentrated on the basic question of connectivity: for a given pair of nodes, does any path
exist from one node to the other? Here we address a more refined question: what is the shortest path that
goes from one node to the next?

Definition 11.26: Shortest paths.


Let G = ⟨V, E⟩ be a graph (undirected or directed), and let s ∈ V and t ∈ V be two nodes. A path from s
to t is a shortest path if its length is the smallest out of all s-to-t paths.

(Recall that the length of a path ⟨u1 , u2 , . . . , uk ⟩ is k − 1, the number of edges that it traverses.) There may
be more than one shortest path from a node s to a node t, if there are multiple paths that are tied in length.

Definition 11.27: Distance.


The distance from s to t is the length of a shortest path from s to t. If there is no path from s to t, then we
say that the distance from s to t is infinite (written as “∞”).

Example 11.26: Some shortest paths.


Consider the undirected graph in Figure 11.44a. The distance from A to A is 0 because ⟨A⟩ is a path from
A to A. This graph also has an example of a pair of nodes connected by two different shortest paths, going
from A to C (via either B or E).
For the directed graph in Figure 11.44b, again, there’s a path from G to G of length zero, so the distance
from G to G is 0. Note that there’s no G-to-J path of length two (because the edge from J to K goes in
the wrong direction), so the distance from G to J is 3 (via K and I, or via H and I). Similarly, there is no
directed path from G to L, so the distance is infinite.

(a) F 1 (b) L ∞
A G
0 0
B E 1 H K 1
1 1
C I
2 2
D 2 J 3

Figure 11.44 Some distances, in undirected and directed graphs. The nodes in (a) are marked with their distances from
A; the nodes in (b) are marked with their distances from G.

ISTUDY
612 Graphs and Trees

A: B, D, E, F, G distance 0 distance 1 distance 2 distance 3 distance 4 distance ∞


B: C, D, I
C: B, D, I B C
D: E
E: A, F D I H J
F:
G: F A E K
H: E, F
I: B, H, K F L
J: C, K
K: L
G
L: F

(a) An adjacency list. (b) A drawing of the graph, with one layer for each distance from A.

Figure 11.45 A directed graph.

Example 11.27: Shortest paths in directed graphs.


Find the shortest path from A to L in the graph with the adjacency list shown in Figure 11.45a.
Solution. The nodes at distance 1 from A are B, D, E, F, and G. There’s no edge from any of those nodes
to L—or indeed to K, which is L’s only in-neighbor. Thus the distance from A to L cannot be any smaller
than 4. But there is an edge from I to K, and one from B to I. We can assemble these edges into the
path ⟨A, B, I, K, L⟩. This path has length 4. So the distance from A to L is 4. (Drawing the graph, as in
Figure 11.45b, with nodes arranged by their distance from A, can make these facts easier to see.)

Problem-solving tip: In solving any graph problem with a small graph, a good first move is to draw the graph.

11.3.4 Finding Paths: Breadth-First Search (BFS)


There are many aspects of graphs that are valuable for interesting computational applications, but perhaps
the single most important graph algorithm is breadth-first search (BFS). BFS is a path-finding algorithm:
it explores outward from a given source node s in a given graph G until it finds every node reachable from
s in G. BFS can be used to solve all sorts of graph-related problems, as we’ll see.
Here’s the intuition of the algorithm. (See Figure 11.46a.) We maintain a set of nodes that are reachable
from the given node s (the shaded nodes in Figure 11.46a). To start, the reachable set is just {s}. Now we
find all as-yet-undiscovered neighbors of nodes in our set, and add those nodes (the darker-shaded nodes
in Figure 11.46a) to the set of reachable nodes: if ⟨u, v⟩ ∈ E and you can reach the node u from s, then you
can also reach v from s, via u. But now we’ve found some more nodes that can be reached from s, which
means that we can also reach any nodes that are directly connected to them from s. So we’ll repeat that
process with the updated set of reachable nodes. And we’ll do it again, and again, and again, until we stop
finding new nodes.
Observe that BFS discovers nodes in order of their distance from the source node. Every expansion of
the set of reachable nodes takes the full breadth of the frontier and expands it out by one more “layer” in
the graph. (That’s why the algorithm is called breadth-first search.) You can think of BFS as throwing a
pebble onto the graph at the node s, and then watching the ripples expanding out from s.
Breadth-first search is presented more formally in Figure 11.46b. (While we’ve described BFS in terms
of undirected graphs for simplicity, it works equally well for directed graphs. The only change is that Line 6
should say “for every out-neighbor” for a directed graph.)

ISTUDY
11.3 Paths, Connectivity, and Distances 613

Breadth-First Search (BFS):


Input: a graph G = ⟨V, E⟩ and a source node s ∈ V
(continued) Output: the set of nodes reachable from s in G
0 3 1 Frontier := ⟨s⟩
F J F J
C H C H // Frontier will be a list of nodes to process, in order.
A E I A E I
B G B G 2 Known := ∅
D D
// Known will be the set of already-processed nodes.
1 4 3 while Frontier is nonempty:
F J F J
C H C H 4 u := the first node in Frontier
A E I A E I
B G B G 5 remove u from Frontier
D D
6 for every neighbor v of u:
2 5 7 if v is in neither Frontier nor Known then
F J F J
C H C H 8 add v to the end of Frontier
A E I A E I
B G B G 9 add u to Known
D D 10 return Known

(a) The steps of BFS on a small graph, starting at A. (b) The pseudocode for breadth-first search.

Figure 11.46 The intuition of a small example, and the pseudocode of breadth-first search.

Here’s another example of breadth-first search in action, running the algorithm in full detail (precisely
as specified in Figure 11.46b):

Example 11.28: Sample run of BFS, in detail.


Let’s trace BFS starting at node A in the graph shown in Figure 11.47.
Every step of the algorithm’s execution is shown in Figure 11.48. Because Frontier is empty after the
last step shown in Figure 11.48, the while loop in BFS terminates. The algorithm returns the final set
Known, which is {A, B, C, G, E, F}.

We’ll prove two important properties of BFS. The first is correctness: the set that BFS returns is precisely
those nodes that are reachable from the starting node. The second is efficiency: BFS finds this set quickly.
The first claim might seem obvious—and thus proving it may feel annoyingly pedantic—but there’s a bit
of subtlety to the argument, and it’s good practice at using induction in proofs besides.

Correctness of BFS
Theorem 11.28: Correctness of BFS.
Let G = ⟨V, E⟩ be any graph, and let s ∈ V be an arbitrary node. Then the set of nodes discovered by
BFS(G, s) is exactly {t ∈ V : t is reachable from s in G}.

We’ll prove the result by showing two set inclusions: the discovered nodes form a subset of the reachable
nodes, and the reachable nodes form a subset of the discovered nodes. Both proofs will use induction,
though on different quantities.

Problem-solving tip: The hard part here is figuring on what quantity to do induction. One way to approach this question is to
figure out a recursive way of stating the correctness claim. Question #1: why is there a path to every node added to Frontier?
(Answer #1: there was a path to every previous node in Frontier, and there’s an edge from some previously added node to this
one!) Question #2: why is every node u reachable from s eventually added to Frontier? (Answer #2: because a neighbor of u
that’s closer to s is eventually added to Frontier, and every neighbor of a node in Frontier is eventually added to Frontier!)

ISTUDY
614 Graphs and Trees

A: B, C
B: A, C, G
C: A, B, E, F, G
D: H
B G F E: C
F: C, G
A G: B, C, F
C E D H H: D

Figure 11.47 A sample graph for BFS, shown both as a picture and as an adjacency list.

nodes in Frontier nodes in Known


(a) A (b) A (c) A (d) A (e) A (f) A (g) A

C B C B C B C B C B C B C B

E G E G E G E G E G E G E G

D F D F D F D F D F D F D F

H H H H H H H
Known = {} {A} {A, B} {A, B, C} {A, B, C, G} {A, B, C, G, E} {A, B, C, G, E, F}
Frontier = ⟨A⟩ ⟨B, C⟩ ⟨C, G⟩ ⟨G, E, F⟩ ⟨E, F⟩ ⟨F⟩ ⟨⟩

Figure 11.48 Every step of BFS on the graph in Figure 11.47: (a) the initialization (Lines 1–2 of the BFS pseudocode
from Figure 11.46b); and (b–g) the processing of the next node (Lines 4–9) [in order, the processed nodes are: A, B, C, G,
E, and F].

Proof of Theorem 11.28, Part #1: every node returned by BFS(G, s) is reachable from s in G. By examin-
ing the algorithm, we see that (i) BFS returns the set of nodes that end up in the Known set, and (ii) the
only way that a node ends up in Known is having previously been in Frontier. Thus it will suffice to prove
the following property for all k ≥ 0, by strong induction on k:

Q(k) = if a node t ∈ V is added to the list Frontier during the kth iteration of the
while loop of BFS, then there is a path from s to t.

For the base case (k = 0), we must show that there is a path from s to any node t added to Frontier during
the 0th iteration of the while loop—that is, before the while loop begins. Then t was added in Line 1 of
BFS, and therefore t is actually the node s itself. There is a path from s to s itself in any graph, and thus
Q(0) holds.
For the inductive case (k ≥ 0), we assume the inductive hypotheses Q(0), . . . , Q(k − 1), and we must
prove Q(k). Consider a node t that was added to Frontier during the kth iteration of the while loop—in
other words, t was added in the for loop (Lines 6–8) because t is a neighbor of some node u that was already
in Frontier. That is, we know that ⟨u, t⟩ ∈ E and that u was added to Frontier in the (k′ )th iteration, for
some k′ < k. By the inductive hypothesis Q(k′ ), there is a path P from s to u. Therefore there is a path
from s to t, too:
edges of P edge ⟨u, t⟩
s u t .

Proof of Theorem 11.28, Part #2: every node reachable from s in G is returned by BFS(G, s). If a node t
is reachable from s in G, then by definition the distance from s to t is some integer d ≥ 0. Furthermore, by

ISTUDY
11.3 Paths, Connectivity, and Distances 615

inspection of the algorithm, we see that any node that’s added to Frontier is eventually moved to Known.
Thus it will suffice to prove the following property for all d ≥ 0, by (weak) induction on d:
R(d) = if a node t ∈ V at distance d from s, then t is eventually added to Frontier.
For the base case (d = 0), we must prove R(0): any node t at distance 0 is eventually added to Frontier. But
the only node at distance 0 from s is s itself, and BFS adds s itself to Frontier in Line 1 of the algorithm.
For the inductive case (d ≥ 1), we assume the inductive hypothesis R(d − 1), and we must prove R(d).
Let t be a node at distance d from s. Then by definition of distance there is a shortest path P of length d
from s to t. Let u be the node immediately before t in P. Then the distance from s to u must be d − 1, and
therefore by the inductive hypothesis R(d − 1) the node u is added to Frontier in some iteration of the
while loop. There are at most |V| iterations of the loop, and thus eventually u is the first node in Frontier.
In that iteration, the node t is added to Frontier (if it had not already been added). Thus R(d) follows.
(In the exercises, you’ll show how to modify BFS so that it actually computes distances from s, using an
idea very similar to the proof of Part #2 of Theorem 11.28.)

Running time of BFS

Theorem 11.29: Efficiency of BFS.


For a graph G = ⟨V, E⟩ represented using an adjacency list, BFS takes Θ(|V| + |E|) time.

Proof. See Figure 11.49a for a reminder of the algorithm. Lines 1, 2, and 10 take Θ(1) time, so the only
question is how long the while loop takes. In the worst case, every node in the graph is reachable from the
node from which BFS is run. In this case, there is one iteration of the while loop for every node u ∈ V.
The key question, then, is how long the body of the while loop (Lines 4–9) takes for a particular node u.
Lines 4, 5, and 9 take Θ(1) time. But what about the for loop in Lines 6–8? The for loop has one iteration
for each neighbor of u. (In an adjacency list, the loop simply steps through the list of neighbors, one by
one.) Each for-loop iteration takes Θ(1) time, and there are degree(u) iterations for node u.
Therefore, ignoring multiplicative constants, the worst-case running time of BFS is
Xh i P P
1+ 1 + degree(u) = 1 + u∈V 1 + u∈V degree(u) rearranging the summation
u∈V = |V| = 2|E| for an undirected graph (by Theorem 11.8),
or |E| in a directed graph (by Exercise 11.18)
= Θ(|V| + |E|).

Taking it further: BFS arises in applications throughout computer science, from network routing to artificial intelligence.
Another application of BFS occurs (hidden from your view) as you use programming languages like Python and Java, through a
language feature called garbage collection. In garbage-collected languages, when you as a programmer are done using whatever
data you’ve stored in some chunk of memory, you just “drop it on the floor”; the “garbage collector” comes along to reclaim
that memory for other use in the future of your program. The garbage collector runs BFS-like algorithms to determine whether
a particular piece of memory is actually trash. See p. 619.

11.3.5 Finding Paths: Depth-First Search (DFS)


Another important algorithm for exploring graphs is called depth-first search (DFS), whose pseudocode is
shown in Figure 11.49b. Informally, instead of exploring outward from the source node s in “layers” as in
BFS, in DFS we will try to explore a new node at every stage of the search. We start at s, and at every stage

ISTUDY
616 Graphs and Trees

Breadth-First Search (BFS): Depth-First Search (DFS):


Input: a graph G = ⟨V, E⟩ and a source node s ∈ V Input: a graph G = ⟨V, E⟩ and a source node s ∈ V
Output: the set of nodes reachable from s in G Output: the set of nodes reachable from s in G
1 Frontier := ⟨s⟩ 1 Frontier := ⟨s⟩
2 Known := ∅ 2 Known := ∅
3 while Frontier is nonempty: 3 while Frontier is nonempty:
4 u := the first node in Frontier 4 u := the first node in Frontier
5 remove u from Frontier 5 remove u from Frontier
6 if u is not in Known then
6 for every neighbor v of u: 7 for every neighbor v of u:
7 if v is in neither Frontier nor Known then 8 if v is not in Known then
8 add v to the end of Frontier 9 add v to the start of Frontier
9 add u to Known 10 add u to Known
10 return Known 11 return Known

(a) A reminder of BFS. (b) The pseudocode for depth-first search. The only
changes from BFS are underlined.

Figure 11.49 A reminder of breadth-first search, and the pseudocode of depth-first search.

we move to an unvisited neighbor of our current node. If at any stage we’re stuck at a node u that has no
unvisited neighbors, we go back from u to the node from which we first reached u and continue exploring
from there. An example of DFS in a small graph is shown, informally, in Figure 11.50.

(a) A (b) A (c) A (d) A (e) A (f) A (g) A

B C B C B C B C B C B C B C

D D D D D D D

E E E E E E E
F F F F F F F

G H G H G H G H G H G H G H

We start exploring node A; in each frame, In each of (a–d), we move from the In (e), the current node E has no unvisited neighbors, so we
the dark-shaded node is the current node. current node to a neighbor that is backtrack from E to D to find D’s unvisited neighbor F. Similarly,
Previously discovered nodes are lightly unexplored. (We pick the alphabetically in (f) we backtrack from F to D to B to discover the new node C,
shaded. Arrows indicate the steps of the first node if there’s a choice.) and in (g) we backtrack from C to B to A. At this point there are
exploration. no further unexplored nodes from any of these nodes, and thus
the algorithm terminates.

Figure 11.50 A sample run of depth-first search, starting at node A.

Intuitively, depth-first search is a close match for the way that you would explore a maze: you start at
the entrance, follow a passageway to a location you’ve never visited before; using breadcrumbs or a pencil,
you remember where you’ve been and backtrack if you get stuck. You may have heard of another algorithm
for mazes: Place your right hand on the wall as you go in the entrance. Continue to walk forward, always
keeping your right hand on the wall. Eventually, you will get out of the maze. In fact, this right-hand-on-
the-wall algorithm is identical in spirit to DFS: whenever you encounter a choice, you always choose the
first (right-most) unexplored passageway, and if you ever get stuck at a dead end you turn around and go
back from whence you came.
We can implement DFS with only a small change to BFS, as shown in Figure 11.49b: instead of putting
a newly discovered node u at the end of the list Frontier of nodes from which to explore (as in BFS), we

ISTUDY
11.3 Paths, Connectivity, and Distances 617

put a newly discovered node u at the beginning of Frontier. (In other words, BFS treats the list Frontier as
a queue—first in, first out—while DFS treats the list Frontier as a stack—last in, first out.) Another small
change is necessary, to allow a node already in Frontier to be “moved” earlier in the list of nodes to explore.
Because this alteration of BFS changes only the order in which the nodes in Frontier are explored, DFS
does precisely the same work as BFS, and is correct for the same reasons: DFS returns precisely the set of
nodes reachable from the given source node s. (With a little more cleverness in moving nodes to the front
of Frontier, DFS can also be implemented in Θ(|V| + |E|) time.) Here’s a fully detailed example of DFS:

Example 11.29: Sample run of DFS, in detail.


We’ll trace DFS starting at node A in the graph shown in Figure 11.51. Every substantive step of the
algorithm’s execution is shown in Figure 11.52; after the last step shown in the figure, there are two more
iterations that remove the last two entries in Frontier (making no changes to Known and adding nothing
further to Frontier), because both F and C are already in Known. The while loop then terminates, and
DFS returns {A, B, G, E, C, F}.

A: B, C
B: A, G
C: A, E
D: H
B G F E: C, F, G
F: E, G
A G: B, E, F
C E D H H: D

Figure 11.51 A sample graph for DFS, shown both as a picture and as an adjacency list.

nodes in Frontier nodes in Known


(a) A (b) A (c) A (d) A (e) A (f) A (g) A

C B C B C B C B C B C B C B

E G E G E G E G E G E G E G

D F D F D F D F D F D F D F

H H H H H H H
Known = {} {A} {A, B} {A, B, G} {A, B, G, E} {A, B, G, E, C} {A, B, G, E, C, F}
Frontier = ⟨A⟩ ⟨B, C⟩ ⟨ G , C⟩ ⟨ E , F , C⟩ ⟨ C , F , F, C⟩ ⟨ F, F, C⟩ ⟨ F, C⟩

Figure 11.52 Every step of DFS on the graph in Figure 11.51: (a) initialization; (b–g) processing the node at the
beginning of Frontier [in order, the processed nodes are: A, B, G, E, C, and F]. The names of the nodes that were just
added to Frontier are highlighted. When nodes are in Known, they are not re-added to Frontier [as with A in (c), B in
(d), G in (e), and both A and E in (f)].

ISTUDY
618 Graphs and Trees

Computer Science Connections

The Bowtie Structure of the Web


As the web has grown more and more central in the daily lives of us all, it has garnered massive attention from
researchers in computer science. A great deal of the early work performed attempted to characterize the web in terms
of its degree distribution (see p. 599) or in terms of the “small-world phenomenon” (see p. 180). But one foundational
and influential paper sought to characterize the web’s structure in terms of its strongly connected components [22]. In
the early days of the web, eight researchers from AltaVista, IBM, and Compaq downloaded around 200 million web
pages, comprising about 1.5 billion links. They then analyzed the structure of the resulting graph, by categorizing
the pages in terms of which pages could reach which other pages.
First, think about the web pages contained in the
largest strongly connected component of the web
graph. Like many other networks (for example, social
networks and collaboration networks), the web graph
has a so-called giant component that contains many
IN CORE OUT
more nodes than the second-largest strongly connected
component. Denote by core those nodes in the largest
SCC in the web graph.
Second, imagine a web page from which you can
reach the nodes in the largest SCC by following links,
but that’s not itself inside the giant component. Let in Figure 11.53 The “bowtie structure” of the web graph, in
denote those web pages p such that (i) p ∈ / core, and its basic form. Broder et al. [22] found that roughly 25% of
(ii) there is a path from p to some node in core. (And web pages fell into each of these categories: 56M pages (of
so there is a path from p in in to every page in core, 200M) in core, 43M pages in in, and 43M pages in out.
but there’s no path from any node in core to p.) And
let out denote the downstream analogue of in: pages reachable by following links from pages in core, but that are
not themselves in core. That is, out contains those pages p such that (i) p ∈ / core, and (ii) there is a path from some
node in core to p. (And therefore there is a path from every page in core to page p.)
When displayed graphically, as in Figure 11.53,
these categories of web pages look like a bowtie, and TENDRILS TENDRILS
so the paper by Broder et al. came to be known as “the
bowtie paper.” To complete the picture of the bowtie
structure of the web, we must note that not all web
pages are included in Figure 11.53. (It’s worth think-
ing through the reasons that we’re not yet done—that
is, the multiple ways in which a page on the web can IN CORE OUT
fall outside of core, in, and out.)
There are three further categories of nodes: (a)
shortcuts from in to out that bypass core (call
these tubes); (b) ways out of in that lead nowhere,
and ways into out that come from nowhere (call TUBES
these tendrils); and (c) nodes that have no paths to
any of the nodes we’ve discussed so far (call these
DISCONNECTED
disconnected). See Figure 11.54.
One of the unexpected facts found by Broder et
Figure 11.54 The remainder of the “bowtie structure” of
al. was the extent to which the web is actually not par-
ticularly well connected. In particular, if we were to the web graph. There were about 44M pages in tendrils
choose web pages p and q uniformly at random from and tubes, and about 17M pages in disconnected.
the web graph, there was only a roughly 24% chance
that a directed path from p to q exists—far lower than the “small world” phenomenon would suggest. (For much
more, see the original bowtie paper [22] or a book on the structure of social and technological networks, like [43].)

ISTUDY
11.3 Paths, Connectivity, and Distances 619

Computer Science Connections

Garbage Collection
In many modern programming languages, including Python and Java, the burden of managing memory—allocating
memory for new objects, deallocating memory for objects that are no longer needed—is lifted from the shoulders of
the programmer. When a new object is needed, the programmer just creates it.
After a program has been running for a while, there
may be objects that were stored in memory but are now 1 L := Node(7, NULL) Suppose that Node(data,next)
2 L := Node(5, L) creates a new node for a singly linked
inaccessible because the programmer has no way to
3 L := Node(3, L) list, with data data and with a pointer
refer to them ever again. This stored but inaccessible 4 L := Node(2, L) next to the next node in the list.
data is called garbage. Figure 11.55 shows an example.
In Python- and Java-like languages, the system pro- L
vides a garbage collector that periodically runs to clean
2 3 5 7
up the garbage, which allows that memory to be reused
for future allocations. (In contrast, in languages like C (a) Some code, and the state of memory after executing it.
or C++, when you as a programmer are done using a
chunk of memory, it’s your responsibility to declare 5 L.next := L.next.next
to the system that you’re done using that memory by
explicitly “deallocating” or “freeing” it.) L
The garbage-collection algorithms employed in real
2 3 5 7
systems are sophisticated, but fundamentally the algo-
rithmic idea is based on finding reachable nodes in a (b) If we continue by executing L.next := L.next.next,
graph. There is a root set of memory locations that are then the node whose data field is 3 is now garbage: there is
reachable—essentially every variable that’s defined in no way to access that memory again, because there is no
any currently active function call on the stack. Fur- way for the programmer to refer to it.
thermore, if a memory location ℓ is pointed to by a Figure 11.55 Garbage being created.
reachable memory location, then ℓ too is reachable.
Two of the simpler algorithms that are sometimes used in garbage collection are based on some corresponding simple
graph-theoretic approaches. Here’s a brief description of these two garbage-collection algorithms:
Reference counting: For each block b of memory, we maintain a reference count of the number of other blocks of
memory (or root set variables) that refer to b. When the garbage collector runs, any block b that has a reference count
equal to 0 is marked as garbage and reclaimed for future use.
Mark-and-sweep: When the garbage collector runs, we iteratively mark each block b that is accessible. Specifically,
for every variable v in the root set, we mark the block to which v refers. Then, for any block b that is marked, we also
mark any block to which b refers. Once the marking process is completed, we sweep through memory, and reclaim
all unmarked blocks.
x
y
In graph-theoretic terms, we view memory as a directed
graph, with an edge from each block b to the block(s) to 1 2 3 4 5 6
which b refers. Reference counting declares as garbage
any node with in-degree 0; mark-and-sweep declares
as garbage any node that is not reached by BFS starting
from the root set. Reference counting is a simpler algo- Figure 11.56 Six blocks of memory, and two root set
rithm, but it has a problem with cyclical structures. If variables x and y. Reference counting would show block #6
two inaccessible blocks of memory refer to each other, with a reference count of zero, and therefore it would be
they both have nonzero reference count, and therefore reclaimed. Mark-and-sweep would mark blocks #1, #4, and
won’t be marked as garbage. An example is shown in #5; thus it would reclaim blocks #2, #3, and #6.
Figure 11.56. There are issues of efficiency with mark-
and-sweep (the entire system has to pause while the garbage collector runs), and so other, more sophisticated
algorithms are generally used in real systems. (You can learn more about garbage collection in any good textbook on
programming languages—or even more in a book devoted exclusively to garbage collection, like [65].)

ISTUDY
620 Graphs and Trees

EXERCISES
11.72 Find a path from D to B in Figure 11.57a.
11.73 Find two different paths from C to H in Figure 11.57a.
11.74 Find a path from C to B in Figure 11.57b.
11.75 Find two different paths from A to H in Figure 11.57b.
11.76 Find a path from D to H in Figure 11.57b that is not simple.
11.77 Find a path from B to C in the graph defined by the adjacency list in Figure 11.57c.
11.78 Find a shortest path from B to F in Figure 11.57d.
11.79 Find a non-shortest path from B to C in the graph defined by the adjacency matrix in Figure 11.57e.
11.80 Find all nodes reachable from A in Figure 11.57d.
11.81 Find all nodes reachable from A in Figure 11.57e.

11.82 Identify all of the connected components in the graph in Figure 11.57a. Is the graph connected?
11.83 Identify all of the strongly connected components in the graph in Figure 11.57b. Is the graph strongly connected?
11.84 Identify all of the connected components in the graph in Figure 11.57c.
11.85 Identify all of the strongly connected components in the graph in Figure 11.57d.
11.86 Decide whether the graph in Figure 11.57e is directed or undirected, and then identify all of its (strongly) connected components.

11.87 Let s and t be any two nodes in an undirected graph, and let k ≥ 0 be any integer. Prove the following: if there’s a path of length k
from s to t, then there’s a path of length k from t to s.
11.88 Let s and t be any two nodes in an undirected graph. Prove that every shortest path between s and t is a simple path.

11.89 The diameter of a graph G is the largest node-to-node distance in the graph. (Although the context is different, this definition of
“diameter” matches the idea from geometry: the diameter of a circle is the distance between the two points in the circle that are
farthest apart. That’s still true for a graph.) In terms of n, what is the smallest diameter that an n-node undirected graph can have?
Prove your answer.
11.90 In terms of n, what is the largest diameter that a connected n-node undirected graph can have? Give an example of a graph where
the diameter is this large. In other words, assuming that G is connected, what’s the largest possible distance between two nodes in
G? (Without the restriction that the graph be connected, the answer would be ∞.)

11.91 Consider an n-node 3-regular undirected graph G. (That is, G = ⟨V, E⟩ has |V| = n, and each node u ∈ V has degree exactly equal
to 3.) In terms of n, what is the largest possible number of connected components in a 3-regular graph?
11.92 In terms of n, what is the smallest possible number of connected components in a 3-regular graph with n nodes?
11.93 Describe a connected 3-regular graph with n nodes with a diameter that’s at least n8 . (See Exercise 11.89.)
11.94 Describe a connected 3-regular graph with n nodes with a diameter that’s at most 8 log n. (See Exercise 11.89.)

11.95 Prove or disprove: let G = ⟨L ∪ R, E⟩ be a bipartite graph with |L| = |R|. Suppose that every node in the graph (that is, all nodes
in L and R) has at least one neighbor. Then the graph is connected.

11.96 Consider an undirected graph G. Recall that a simple path from s to t in G does not go through any node more than once. A
Hamiltonian path from s to t in G is a path from s to t that goes through each node of G precisely once. (Hamiltonian paths are
named after William Rowan Hamilton (1805–1865), an Irish mathematician and physicist.) In general, finding Hamiltonian paths
in a graph is believed to be computationally very difficult. But there are some specific graphs in which it’s not too hard to find one.
Here’s one: find a Hamiltonian path in the Petersen graph (see Figure 11.58a).

(e)
A B C D E F G H
A 0 0 0 1 0 0 0 1
B 0 0 1 0 1 0 1 0
(a) A (b) A (c) (d)
A: B, E, F A: B, C C 0 1 0 1 0 0 0 0
B C B C B: A B: A, C D 1 0 1 0 0 1 0 0
D E D E C: D C: A, B, F E 0 1 0 0 0 0 0 1
D: C, F D: E F 0 0 0 1 0 0 0 1
F G F G E: A E: B, D G 0 1 0 0 0 0 0 0
H H F: A, D F: C H 1 0 0 0 1 1 0 0

Figure 11.57 Several graphs, described by picture, adjacency list, or adjacency matrix.

ISTUDY
Exercises 621

B
14 15 1 15 16 1 14 15 1 15 16 1
G 13 2 14 2 13 2 14 2
A F H C 12 3 13 3 12 3 13 3
12 4 12 4
11 4 11 4
J I 10 5
11 5
10 5
11 5
10 6 10 6
9 6 9 7 9 6 9 7
8 7 8 8 7 8
E D

(a) The Petersen graph. (b) A 15-node cycle. (c) A 16-node cycle. (d) A 15-node path. (e) A 16-node path.

Figure 11.58 Five graphs.

11.97 Let Kn be a complete graph, and let s and t be two distinct nodes in the graph. How many different Hamiltonian paths are there
from s to t? (See Exercise 11.96.)
11.98 Let Kn,m be a complete bipartite graph with n + m nodes, and let s and t be two distinct nodes in the graph. How many different
Hamiltonian paths are there from s to t? (See Exercise 11.96.) (Careful; your answer may depend on s and t.)

The diameter of an undirected graph G = ⟨V, E⟩—the maximum distance between any two nodes s ∈ V and t ∈ V (see Exer-
cise 11.89)—is one measure of how far a graph “sprawls.” But another way of measuring this idea is by looking at the average
distance instead. That is, for a pair of distinct nodes ⟨s, t⟩ chosen uniformly from the set V, what’s the distance from s to t? In other
words, the average distance of a graph G = ⟨V, E⟩ is defined as
P P
s∈V t∈V:t̸=s distance(s, t)
the average distance of G = .
n(n − 1)
(There are n(n − 1) ordered pairs of distinct nodes.) Often the average distance is a bit harder to calculate than the maximum
distance, but in the next few exercises you’ll look at the average distance for a few well-structured graphs.
11.99 Consider an n-node cycle, where n is odd. (We’ll see a formal definition of a cycle in Section 11.4, but for now just look at the
example with n = 15 in Figure 11.58b.) Compute the average distance in this n-node graph. (Hint: every node is positioned
symmetrically, so you can just figure out the average distance from some particular node u.)
11.100 What is the average distance for an n-node cycle where n is even? (See the example with n = 16 in Figure 11.58c.)
11.101 What is the average distance for an n-node path? (See the examples in Figures 11.58d and 11.58e.) (Hint: for any particular integer
k, how many pairs of nodes have distance k? Then simplify the summation.)
11.102 (programming required) Write a program, in a language of your choice, to verify your answers to the last three exercises: build a
graph of the appropriate size and structure, sum all of the node-to-node distances, and compute their average.
11.103 Suppose that G is a disconnected undirected graph with n nodes. In terms of n, what is the largest possible number of edges that G
can contain?
11.104 Suppose that G is a connected undirected graph with n nodes. What is the smallest possible number of edges in G?
11.105 Suppose that G is a strongly connected directed graph with n nodes. What is the smallest number of edges that G can contain?
11.106 Suppose that G is a directed graph with n nodes. If every node of G is in its own strongly connected component (that is, there are
n different SCCs, one per node), what is the largest number of edges that G can contain?
11.107 A metric on a set V is a function d : V × V → R≥0 that obeys the following conditions (see Exercise 4.6):
̸ 0 whenever u ̸= v.
• Reflexivity: for any u ∈ V and v ∈ V, we have d(u, u) = 0 and d(u, v) =
• Symmetry: for any u ∈ V and v ∈ V, we have d(u, v) = d(v, u).
• Triangle inequality: for any u ∈ V and v ∈ V and z ∈ V, we have d(u, v) ≤ d(u, z) + d(z, v).
Let dG (u, v) denote the distance (shortest path length) between nodes u ∈ V and v ∈ V for a graph G = ⟨V, E⟩. Prove that dG is a
metric if G is any connected undirected graph.
11.108 Prove that dG is not necessarily a metric for a directed graph G, even if G is strongly connected. (See Exercise 11.107.)
11.109 Definition 11.25 defined a strong connected component in a graph G = ⟨V, E⟩ as a set C ⊆ V such that: (i) any two nodes s ∈ C
and t ∈ C are strongly connected; and (ii) for any node x ∈ V − C, adding x to C would make (i) false. Suppose that we’d instead
defined clause (i) as for any two nodes s ∈ C and t ∈ C, the node t is reachable from node s. (But we don’t require that s be
reachable from t.) This alternate definition is equivalent to the original. Why?
11.110 Prove that the strongly connected components (SCCs) of a directed graph partition the nodes of the graph: that is, prove that the
relation R(u, v) denoting mutual reachability (u is reachable from v, and v is reachable from u) is an equivalence relation (reflexive,
symmetric, and transitive). (See Definition 8.13.)

ISTUDY
622 Graphs and Trees

(a) G (b) (continued)


0: 3, 7 7: 0, 4, 6, 8
A F
1: 9, 2, 5 8: 11, 12
B E 2: 1, 10, 9 9: 1
C 3: 0, 7, 1 10: 2, 4
D 4: 10, 7 11: 6, 8
5: 1 12: 8
6: 7, 11

Figure 11.59 Two graphs.

11.111 Identify the strongly connected components in the graph in Figure 11.59a.
11.112 Identify the strongly connected components in the graph represented by the adjacency list in Figure 11.59b.

11.113 Suppose that we run breadth-first search from node A in the graph in Figure 11.59a. What is the last node that BFS discovers? (If
there’s a tie, then list all the tied nodes.)
11.114 What if we run BFS from node B in the graph in Figure 11.59a?
11.115 What if we run BFS from node 0 in the graph in Figure 11.59b?
11.116 What if we run BFS from node 12 in the graph in Figure 11.59b?

11.117 Breadth-first search as described in Figure 11.46b finds all nodes reachable from a given source node in a given graph, and, in
fact, it discovers nodes in increasing order of their distance from s. But we didn’t actually record distances during the computation.
Modify the pseudocode for BFS to compute distances instead of just whether a path exists, by annotating every node added to
Frontier with its distance from the source node s.
11.118 Argue that in your modified version of BFS, there are never more than two different distances stored in the Frontier.
11.119 Argue that the claim from Exercise 11.118 may be false for depth-first search.

11.120 Consider a graph G represented by an adjacency matrix M. What does the ⟨i, j⟩th entry of MM (the matrix that results from squaring
the matrix M) represent?

11.121 (programming required) A word ladder is a sequence ⟨w1 , w2 , . . . , wk ⟩ of words, where each wi is a word in English, and wi+1 is
one letter different from wi . For example, a word ladder from FROWN to SMILE for my dictionary is

FROWN → FLOWN → FLOWS → SLOWS → SLOTS → SLITS → SKITS → SKITE → SMITE → SMILE.

(SKITE is an obscure word of Scottish origin, meaning “an oblique blow.”) Write a program that uses a BFS-like algorithm to find
a shortest word ladder between two given words w1 and w2 of the same length. (You can find a dictionary of English words on
the web, or /usr/share/dict/words on Unix-based operating systems. You’ll want to cull your dictionary to only words of the
right length before you start.) There are faster solutions that involve searching “in both directions” out from w1 and into w2 until
you find a match, but BFS from w1 will work.

ISTUDY
11.4 Trees 623

11.4 Trees
I think that I shall never see
A poem lovely as a tree.
Joyce Kilmer (1886–1918)
“Trees,” Trees and Other Poems (1914)

Informally, a tree is a graph that grows from a root, branching outward and eventually leading to the
leaves. (We computer scientists are always upside down compared to botanists: unlike an oak or maple or
tamarack, the root of a tree in CS is at the top, and it grows downward toward the leaves.) See Figure 11.60
for a small example.
Trees arise very frequently in computer science: to name just a few examples, they’re the class hierarchies
of object-oriented programming, the binary search trees of data structures (see p. 637), the game trees
describing the progression of Tic-Tac-Toe or chess (p. 121), the parse trees that describe formal or natural
languages (p. 259), the recursion trees that describe the execution of recursive algorithms (Section 6.4).
Trees are also frequently used in computational models of important phenomena from outside of CS: for
example, in reconstructing evolutionary phylogenies (in computational biology), or in reconstructing the
paths by which rumors and misinformation spread from the originator (in social network analysis). In this
section, we’ll introduce trees formally—including definitions, properties, algorithms, and applications—as
a special type of graph.

11.4.1 Cycles
Before we can define trees properly, we must first define another notion about graphs in general—a cycle,
which is way to get from a node back to itself:

Definition 11.30: Cycle.


A cycle ⟨u1 , u2 , . . . , uk , u1 ⟩ is a path of length ≥ 2 from a node u1 back to node u1 that does not traverse
the same edge twice. Just as for any other path, the length of the cycle ⟨u1 , u2 , . . . , uk , u1 ⟩ is the number
of edges it traverses—that is, k.

Figures 11.61a and 11.61b show examples of an undirected and directed graph with a cycle ⟨A, B, C, A⟩.
Note that the edges ⟨s, t⟩ and ⟨t, s⟩ in a directed graph are different; in an undirected graph, the edges {s, t}
and {t, s} are the same. Thus a cycle in a directed graph can use both ⟨s, t⟩ and ⟨t, s⟩, but a cycle in an
undirected graph cannot use both ⟨s, t⟩ and ⟨t, s⟩. In the directed graph in Figure 11.61b, the path ⟨C, E, C⟩
is a cycle, but ⟨C, E, C⟩ is not a cycle in the undirected graph in Figure 11.61a because it reuses an edge.

root

leaves

Figure 11.60 A small tree.

ISTUDY
624 Graphs and Trees

(a) (b) (c) (d) (e)


B D B D B B B
A C A C
E E A C A C A C
An undirected graph (a) and a directed graph (b) that Undirected (c) and directed (d) graphs containing only one
contain the cycle ⟨A, B, C, A⟩. cycle, and a directed graph (e) containing two.

Figure 11.61 A few graphs with cycles.

Technically, the definition of a cycle in Definition 11.30 says that the undirected graph in Figure 11.61a
has six different cycles: ⟨A, B, C, A⟩, ⟨C, A, B, C⟩, and ⟨B, C, A, B⟩ (going clockwise); and ⟨A, C, B, A⟩,
⟨C, B, A, C⟩, and ⟨B, A, C, B⟩ (going counterclockwise). However, we will adopt the convention that there
is one and only one cycle in this graph. Because we can “start anywhere” in a cycle, we consider a cycle
to be defined only by the relative ordering of the nodes involved, regardless of where we start. In an undi-
rected graph, we can “go either direction” (clockwise or counterclockwise), so we also ignore the direction
of travel in distinguishing cycles. In a directed graph, the direction of travel does matter; there are graphs
in which we are able to go in one direction around a cycle without being able to go in the other. In other
words, we say that Figures 11.61c and 11.61d have one cycle each, while Figure 11.61e has two.
A cycle is by definition forbidden from traversing the same edge twice. A simple cycle also does not
visit any node more than once:

Definition 11.31: Simple cycle.


A cycle ⟨u1 , u2 , . . . , uk , u1 ⟩ is simple if each ui is distinct—that is, no nodes in the cycle are duplicated
aside from the last node (which equals the first node).

(We’ve now used the word “simple” in three different contexts: simple graphs have no parallel edges or self-
loops, and simple paths and cycles have no repeated vertices. Intuitively, all three definitions correspond
to an entity that’s not unnecessarily complicated.) Here are a few examples:

Example 11.30: Simple and non-simple ways of getting around.


Figure 11.62a shows an undirected graph. In it, ⟨D, B, A, C, E, A, D⟩ is a non-simple cycle. This graph also
has two simple cycles: ⟨D, B, A, D⟩ and ⟨C, E, A, C⟩.

(a) (b) (c)


B D K
B F
A C
A H I J M
E D
C G
E L

Figure 11.62 A few graphs that contain cycles.

ISTUDY
11.4 Trees 625

Example 11.31: Finding cycles.


Identify all simple cycles in the graphs in Figures 11.62b and 11.62c.
Solution. A nice way to identify cycles systematically is to look for cycles of all possible lengths: 2-node
cycles, 3-node cycles, etc. (Actually 2-node cycles are possible only in directed graphs. Exercise: why?)
Ordered by length, the simple cycles in the graph in Figure 11.62b are:
⟨B, E, C, B⟩, ⟨B, D, F, C, B⟩, ⟨C, F, G, E, C⟩, ⟨B, D, F, G, E, B⟩, and ⟨B, D, F, G, E, C, B⟩.
The simple cycles in the graph in Figure 11.62c are:
⟨I, J, I⟩, ⟨J, L, J⟩, ⟨J, M, L, J⟩, and ⟨J, K, M, L, J⟩.
Note that (to name one of several examples) the sequence ⟨I, J, L, J, I⟩ is also a cycle in Figure 11.62c—
it traverses four distinct directed edges and goes from node I to I—but this cycle is not simple, because
node J is repeated.

We can use a modification of breadth-first search to identify cycles algorithmically. Specifically, suppose
that we wish to find out whether a node u is involved in a cycle in a directed graph. We run BFS starting at
node u, and if we ever encounter a node v that has u as a neighbor, then we have found a cycle involving
node u. (An extra modification is necessary for undirected graphs; see Exercise 11.129.)
Taking it further: Kidneys are the most frequently transplanted organ today, in part because—unlike for other organs—humans
generally have a “spare”: we’re born with two kidneys, but only need one functioning kidney to live a healthy life. Thus patients
suffering from kidney failure may be able to get a transplant from friends or family members who are willing to donate one of
their kidneys. But this potential transplant relies on the donor and the patient being compatible in dimensions like blood type
and the physical size of the organs. Recently a computational solution to the problem of incompatibility has emerged, using
algorithms based on finding (short) cycles in a particular graph: there is now national exchange for matching up two (or a few)
patients with willing-but-incompatible donors, and doing a multiway transplant. See p. 636.

Acyclic graphs
While cycles are important on their own, their relevance for trees is actually when they don’t exist:

Definition 11.32: Acyclic graphs.


A graph is acyclic if it contains no cycles.

Let’s prove a useful structural fact about acyclic graphs. (Recall that we are considering finite graphs, where
the set of nodes in the graph is finite. The following claim would be false if graphs could have an infinite
number of nodes!)

Lemma 11.33: Every acyclic graph has a node with degree 0 or 1.


If G = ⟨V, E⟩ is an acyclic undirected graph, then there exists a node in V whose degree is zero or one.

Proof. We’ll give a constructive proof of the claim—specifically, we’ll give an algorithm that finds a node
with the stated property:

1 let u0 be an arbitrary node in the graph, and let i := 0


2 while the current node ui has no unvisited neighbors:
3 let ui+1 be a neighbor of ui that has not previously been visited
4 increment i

ISTUDY
626 Graphs and Trees

Observe that this process must terminate in at most |V| iterations, because we must visit a new node in
each step. Suppose that this algorithm goes through k iterations of the while loop, and let t be the last node
visited by the algorithm. (So t = uk .)
If k = 0, then t = u0 has degree zero, so the claim follows immediately. Otherwise k ≥ 1; in this case,
we’ll argue that t has degree one. Because the algorithm terminated, there cannot be an edge between t
and any unvisited node. Furthermore, if there were an edge from t to any previously visited node uj for
j < k − 1, then there would be a cycle in the graph, namely ⟨uj , uj+1 , . . . , uk−1 , uk , uj ⟩. Therefore t’s only
neighbor is uk−1 , and the degree of t is one.

For directed graphs, the claim analogous to Lemma 11.33 is every directed acyclic graph contains a
node with out-degree zero. (You’ll prove it in Exercise 11.130.)

Taking it further: A directed acyclic graph (often just called a DAG) is, perhaps obviously, a directed graph that contains no
cycles. A DAG G corresponds to a (strict) partial order (see Chapter 8); a cycle in G corresponds to a violation of transitivity. In
fact, we can think of any directed graph G = ⟨V, E⟩ as a relation—specifically, the edge set E is a subset of V × V. Like transitivity
and acyclicity, many of the concepts that we explored in Chapter 8 have analogues in the world of graphs.

11.4.2 Trees
With the definition of cycles in hand, we can now define trees themselves:

Definition 11.34: Tree.


A tree is an undirected graph that is connected and acyclic.

We will also sometimes talk about graphs that satisfy only the latter requirement: a forest is an undirected
graph that is acyclic (but not necessarily connected). Every connected component of a forest is a tree, and
note that a tree is itself a forest. (An irrelevant note about Chinese: the character for tree is ; the character
for forest is —a disconnected collection of trees!)

Example 11.32: A small collection of trees.


Several examples of trees are shown in Figure 11.63: all six graphs have a single connected component
and contain no cycles. Therefore all six are trees.

We’ll prove several structural facts about trees in this section, beginning with one concerning the number
of edges in a tree. To start, let’s look at the trees in Figure 11.63. The figure shows the number of nodes
and edges in each of these trees. In each, the number of nodes is one more than the number of edges, and
that’s no coincidence; here’s the statement and proof of the general fact:

(a) (b) (c) (d) (e) (f)

4 nodes 11 nodes 4 nodes 5 nodes 1 node 7 nodes


3 edges 10 edges 3 edges 4 edges 0 edges 6 edges

Figure 11.63 Six sample trees, each annotated with the number of nodes and edges that it contains.

ISTUDY
11.4 Trees 627

Theorem 11.35: Number of edges in a tree.


Let T = ⟨V, E⟩ be a tree. Then |E| = |V| − 1.

Proof. Let P(n) denote the property that any n-node tree has precisely n − 1 edges. We will prove that
P(n) holds for all n ≥ 1 by induction on n.
Base case (n = 1). We must prove P(1): any 1-node tree has 1 − 1 = 0 edges. But the only (simple) 1-node
graph is the one shown in Figure 11.63e, which has zero edges, and so we’re done immediately.
Inductive case (n ≥ 2). We assume the inductive hypothesis P(n − 1)—that is, every (n − 1)-node tree has
n − 2 edges. We must prove P(n).
Consider an arbitrary tree T = ⟨V, E⟩ with |V| = n. By definition, T is acyclic and connected. By
Lemma 11.33, then, there exists a node u ∈ V with degree 0 or 1 in T. Furthermore, because T is connected,
the degree of u cannot be 0. Thus u is a node with degree(u) = 1. Let v ∈ V be the unique neighbor of u in
T. Let T′ be T with node u and the edge {u, v} between u and v deleted. (See Figure 11.64a.) We claim that
the graph T′ = ⟨V − {u} , E − {{u, v}}⟩ is a tree, too. The acyclicity and connectivity of T′ both follow
from the fact that T was acyclic and connected, and the fact that the eliminated node u was of degree 1.
The tree T′ contains n − 1 nodes, and thus, by the inductive hypothesis P(n − 1), contains n − 2 edges.
Therefore T, whose edges are precisely the edges of T′ plus the eliminated edge {u, v}, contains precisely
(n − 2) + 1 = n − 1 edges.

An immediate consequence of Theorem 11.35 is that every tree is teetering on the edge of being
disconnected and of having a cycle (see Figure 11.64):

Corollary 11.36. Let T = ⟨V, E⟩ be any tree. Then (1) adding any edge e ∈
/ E to T creates a cycle; and
(2) removing any edge e ∈ E from T disconnects the graph.

Proof. For (1), define the graph G = ⟨V, E ∪ {e}⟩ as the result of adding the new edge e to the tree T.
Because adding an edge to a graph can never disrupt connectivity and T was already connected, we know
that G must be connected too. Thus if G were acyclic, then G would be a tree. But G has one more edge
than T—specifically, G has (|V| − 1) + 1 = |V| edges—and therefore isn’t a tree by Theorem 11.35.
The proof for (2) is similar: let G′ be T with e removed. Removing an edge cannot create a cycle, so G′
is acyclic. But G′ has too few edges to be a tree by Theorem 11.35, so G′ must be disconnected.

T′

u
v
7

(a) A tree T, with a node u of (b) Imagine adding (c) Adding an edge (d) Removing an
degree = 1 and its neighbor v. the dashed edge, or creates a cycle. edge disconnects
The tree T′ is T without the node removing the edge the graph.
u and the edge {u, v}. marked with 7.

Figure 11.64 Illustrations of (a) the proof of Theorem 11.35, and of (b–d) the statement of Corollary 11.36.

ISTUDY
628 Graphs and Trees

(Here’s an alternative proof of the first statement in Corollary 11.36. Let ⟨u, v⟩ be an edge not in the tree
T. Because T is connected, there is already a (simple) path P from u to v in T. If we add ⟨u, v⟩ to T, then
there is a cycle: follow P from u to v and then follow the new edge from v back to u. Therefore G contains
a cycle.)

Rooted trees
We often designate a particular node of a tree T as the root, which is traditionally drawn as the topmost node.
(Note that we could designate any node as the root and—just like that mobile of zoo animals from your
crib from infancy—“hang” the tree by that node.) We will adopt the standard convention that, whenever
we draw trees, the vertically highest node is the root.
There’s a lot of terminology about rooted trees in computer science that’s borrowed from the world of
family trees:

• For a node u in a tree with root r ̸= u, the parent of u is the unique neighbor of u that is closer to r
than u is. (The root is the only node that has no parent.)
• A node v is one of the children of a node u if v’s parent is u.
• A node v is a sibling of a node u ̸= v if v and u have the same parent.

A node with zero children is called a leaf. A node with one or more children is called an internal node.
(Note that the root is an internal node unless the tree is just a one-node graph.)

Example 11.33: Some sample trees.


See Figure 11.65 for an illustration of all of these definitions. Note that Figure 11.65 is correct only when
the root is the topmost node in the image; with a different root, all of the panels could change.
Here’s a concrete example. Consider the two trees in Figure 11.66a and 11.66b, which differ only based
on which node is designated as the root. A few properties of each tree are shown, too: the root, leaves,
and internal nodes, along with the parent and children of nodes A and B. While the leaves and internal
nodes are identical in these two trees, if we’d rerooted the tree at any of the erstwhile leaves instead, the
new root would become an internal node instead of a leaf. For example, if we reroot this tree at H, then
the leaves would be {D, F, J, K, L, M} and the internal nodes would be {A, B, C, E, G, H, I}.

Subtrees, descendants, and ancestors


Let T be a rooted tree, and let u be any node in T. The subtree rooted at u consists of u and all those nodes
and edges “below” u in T. (In other words, a node v is in the subtree rooted at u if and only if v is no closer
to the root of T than u is; the subtree is the induced subgraph of these nodes.) Such a node v in the subtree
rooted at u is called a descendant of u if v ̸= u. The node u is called an ancestor of v.

Example 11.34: Descendants, ancestors, and subtrees.


See Figures 11.67a–11.67c for illustrations of these definitions. For another example, the trees from
Figures 11.66a and 11.66b are reproduced in Figure 11.68, with a few properties marked (and the subtrees
rooted at a particular node highlighted).

ISTUDY
11.4 Trees 629

A A A A A A

B C B C B C B C B C B C

D E D E D E D E D E D E

F G H F G H F G H F G H F G H F G H

I I I I I I

(a) Root (b) Leaves (c) Internal nodes (d) E ’s parent (e) E ’s children (f) E ’s sibling(s)

Figure 11.65 The root, leaves, and internal nodes of the tree; the parent, children, and siblings of one node.

(a) (b)
A E

B C H I B

D E F G K L M A D F

H I J C

K L M G

Root: A Root: E
Leaves: {D, F, H, J, K, L, M} Leaves: {D, F, H, J, K, L, M}
Internal nodes: {A, B, C, E, G, I} Internal nodes: {A, B, C, E, G, I}
Parent of B: A Parent of B: E
Children of B: {D, E, F} Children of B: {A, D, F}
Parent of A: none Parent of A: B
Children of A: {B, C} Children of A: {C}

Figure 11.66 Two rooted trees. (The second tree is just the first, rerooted to make E the new root.)

A A A 0
B C B C B 1 C 1
D E D E E D 2 E 2
F G H F G H F G H F 3 G 3 H 3
I I I I 4
(a) Ancestors of E . (b) Descendants of E . (c) Subtree rooted at E . (d) Every node’s depth.

Figure 11.67 Ancestors, descendants, subtrees, and the depth of nodes in a rooted tree.

ISTUDY
630 Graphs and Trees

(a) (b)
A E

B C H I B

D E F G K L M A D F

H I J C

K L M G

Descendants of B: {D, E, F, H, I, K, L, M} Descendants of B: {A, C, D, F, G, J}


Ancestors of B: {A} Ancestors of B: {E}
Descendants of H: none Descendants of H: none
Ancestors of H: {A, B, E} Ancestors of H: {E}

Figure 11.68 Two rooted trees (from Figure 11.66), with the subtrees rooted at B highlighted in both.

We have one final pair of definitions to (at last!) conclude our parade of terminology about rooted trees,
related to how “tall” a tree is. Consider a rooted tree T with root node r. The depth of a node u is the distance
from u to r. The height of a tree is the maximum, over all nodes u in the tree, of the depth of node u.

Example 11.35: Depth of nodes in a tree.


Every node in the tree in Figure 11.67d is labeled by its depth: the root has depth 0, its children have
depth 1, their children (the “grandchildren” of the root) have depth 2, and so forth. The height of the tree
is the largest depth of any of its nodes—in this case, the height is 4.

Taking it further: If you happen to be a lover of recursion, you may find it interesting to think about alternative recursive ways
to give the definitions about rooted trees from this section. (See Section 5.4 for some similar ideas.) Ancestors and descendants
in a rooted tree are a good example. We could say that a node v is an ancestor of u if (i) v is the parent of u; or (ii) v is the parent
of any ancestor of u. Or a node v is a descendant of u if (i) v is a child of u; or (ii) v is a child of any descendant of u. We can also
think of the depth of a node, or the height of a tree, recursively. The depth of the root is zero; the depth of a node with a parent p
is 1 + (the depth of p). Or, for height: (i) the height of a one-node tree T is zero; and (ii) the height of a tree T with root r with
children {c1 , c2 , . . . , ck } is 1 + maxi (the height of the subtree rooted at ci ). These alternative versions of the definitions can be
translated—almost word for word—into recursive code to compute these tree-related quantities.

Binary trees
We’ll often encounter a special type of tree in which nodes have a limited number of children:

Definition 11.37: Binary trees and k-ary trees.


A binary tree is a rooted tree in which each node has 0, 1, or 2 children. More generally, if every node
in a rooted tree T has k or fewer children, then T is called a k-ary tree. (In other words, a binary tree is
2-ary.)

Example 11.36: Some binary trees.


Of the trees in Figure 11.69, only the tree in Figure 11.69d is not a binary tree, because its root has four
children. (This tree is a 4-ary tree.) But the other five trees are all binary: in each, every internal node has
either 1 child or 2 children.

ISTUDY
11.4 Trees 631

(a) (b) (c) (d) (e) (f)

Figure 11.69 The trees from Figure 11.63, repeated.

In a binary tree, the possible children of a node are called its left child and right child. (Even for a node u
in a binary tree that has only one child, we’ll insist that the lone child be designated as either the left child
of u or the right child of u.) For a node u, we say that u’s left subtree is the subtree rooted at u’s left child;
the right subtree is analogous. (See Section 5.4.1.)

11.4.3 Tree Traversal


We will sometimes want to do something with all of the nodes contained in a tree T: print them all out,
make a list of them, count the number whose names are anagrams of swear words, whatever. There are three
standard algorithms that are used for this purpose, called pre-order, in-order, and post-order traversal.
While these algorithms can be generalized to non-binary trees, they’re easier to understand for binary trees
(and they’re most frequently deployed for binary trees), so we’ll consider them that way.
All three algorithms are recursive, and all three algorithms execute precisely the same steps—just in a
different order. On an empty tree T, we do nothing; on a non-empty tree T, all three algorithms perform the
following steps:

• We “visit” the root of T. (You can think of “visiting” the root as printing out the contents of the root
node, or as adding it to the end of an accumulating list of the nodes that we’ve encountered in the tree.)
• We recursively traverse the left subtree of T, finding all nodes there.
• We recursively traverse the right subtree of T, finding all nodes there.

But the three traversal algorithms execute the three steps in different orders, either visiting the root
before both recursive calls (“pre-order”); between the recursive calls (“in-order”); or after both recur-
sive calls (“post-order”). We always recurse on the left subtree before we recurse on the right subtree. See
Figure 11.70. Let’s take a look at traversing a small tree using these algorithms:

pre-order-traverse(T): in-order-traverse(T): post-order-traverse(T):


1 if T is empty then 1 if T is empty then 1 if T is empty then
2 do nothing. 2 do nothing. 2 do nothing.
3 else 3 else 3 else
4 visit the root of T 4 in-order-traverse(Tleft ) 4 post-order-traverse(Tleft )
5 pre-order-traverse(Tleft ) 5 visit the root of T 5 post-order-traverse(Tright )
6 pre-order-traverse(Tright ) 6 in-order-traverse(Tright ) 6 visit the root of T

Figure 11.70 Three different algorithms to traverse a binary tree. Write Tleft to denote T’s left subtree, and Tright to
denote its right subtree.

ISTUDY
632 Graphs and Trees

(a) A (b)

B C pre-order traversal A, B, D, C, E, F
in-order traversal D, B, A, E, C, F
D E F post-order traversal D, B, E, F, C, A

Figure 11.71 A tree to be traversed, and its three traversals.

Example 11.37: Traversing a small tree: pre-order traversal.


Let’s determine the order of nodes’ visits by a pre-order traversal of the tree in Figure 11.71. In a pre-
order traversal, we first visit the root, then pre-order-traverse the left subtree, then pre-order-traverse the
B
right subtree. In other words, for this tree, we first visit the root A, then pre-order-traverse D , then
E C F
pre-order-traverse :
Step #1: visit the root. We visit the root A.
B
Step #2: pre-order-traverse the left subtree. To pre-order-traverse D , we first visit the root B, then
pre-order-traverse the left subtree D , then pre-order-traverse the (empty) right-subtree. In order, these
steps visit B and D.
C F , we first visit C, then
Step #3: pre-order-traverse the right subtree. To pre-order-traverse E
pre-order-traverse the left subtree E , and then pre-order-traverse the right subtree F . Pre-order-traversing
E just results in visiting E, and pre-order-traversing F just visits F. In order, these steps visit C, E, and F.

Putting this all together, the pre-order traversal of the tree visits the nodes in this order:
A , B, D , C, E, F .
step #1 step #2 step #3

Here are examples of the other two traversal algorithms, on the same tree:

Example 11.38: Traversing a small tree: in-order traversal.


In what order are the nodes visited by an in-order traversal of the tree in Figure 11.71?
B C
Solution. We first traverse D , then visit A, then traverse E F.
B
Traversing D visits D and B: first the left subtree, then the root.
E C F visits E, then C, then F.
Traversing
Thus an in-order traversal visits the nodes in the order D, B, A, E, C, F.

Example 11.39: Traversing a small tree: post-order traversal.


In what order are the nodes visited by an post-order traversal of the tree in Figure 11.71?
Solution. For a post-order traversal, the root of each subtree is the last node traversed in that subtree: we
B C F , then visit A.
first traverse D , then traverse E
B
Traversing D visits D and B: first the left subtree, then the nonexistent right subtree, then the
root.
C F visits E, then F, then C.
Traversing E
Thus a post-order traversal visits the tree’s nodes in the order D, B, E, F, C, A.

Here’s another example, of using traversals to reconstruct a binary tree:

ISTUDY
11.4 Trees 633

(a) (b)
9 9
2 7
2
4
{3, 4, 5, 7}
5 3
Figure 11.72 Reconstructing a tree from its pre-order, in-order, and post-order traversals.

Example 11.40: Trees from traversals.


Here is the output of the pre-, in-, and post-order traversals of a binary tree T:
9, 2, 7, 4, 5, 3 , 2, 9, 5, 4, 3, 7 , and 2, 5, 3, 4, 7, 9 .
pre-order traversal in-order traversal post-order traversal

What’s T?
Solution. We’ll reassemble T from the root down. The root is first in the pre-order traversal (and last in
the post-order), so 9 is the root. The root separates the left subtree from the right subtree in the in-order
traversal; thus the left subtree contains just 2 and the right contains {3, 4, 5, 7}. So the tree has the form
shown in Figure 11.72a.
The post-order 5, 3, 4, 7 and in-order 5, 4, 3, 7 show that 7 is the root of the unknown portion of the
tree and that 7’s right subtree is empty. The last three nodes are pre-ordered 4, 5, 3; in-ordered 5, 4, 3;
and post-ordered 5, 3, 4. In sum, that says that 4 is the root, 5 is the left subtree, and 3 is the right subtree.
Assembling these pieces yields the final tree, shown in Figure 11.72b.

Taking it further: One particularly important type of binary tree is the binary search tree (BST), a widely used data structure—
one that’s probably very familiar if you’ve taken a course on data structures. A BST is a binary tree in which each node has some
associated “key” (a piece of data), and the nodes of the tree are stored in a particular sorted order: all nodes in the left subtree
have a key smaller than the root, and all nodes in the right subtree have a key larger than the root. Thus an in-order traversal of
a binary search tree yields the tree’s keys in sorted order. (See p. 637.) An even more specific form of binary search tree, called
a balanced binary search tree, adds an additional structural property related to the depth of nodes in the tree. See p. 309 for a
discussion of one scheme for balanced binary search trees, called AVL trees.

11.4.4 Spanning Trees


Let G = ⟨V, E⟩ be an undirected graph. For example, imagine that each node in V represents a dorm
room on your campus, and each edge in E denotes a possible fiber optic cable that can be laid to build
an ethernet connection throughout the residence halls. A reasonable goal is to actually place only some
of those possible cables, a subset E′ ⊆ E, while ensuring that network traffic can be sent between any
two dorm rooms—that is, ensuring that the resulting network is connected. In other words, one seeks a
spanning tree of the graph G:

Definition 11.38: Spanning tree.


Let G = ⟨V, E⟩ be a connected undirected graph. A spanning tree of G is a tree T = ⟨V, E′ ⟩ with the
same nodes as G and with edges E′ ⊆ E that are a subset of G’s edges.

A spanning tree of G is called “spanning” because it connects (that is, spans) all nodes in G.

Example 11.41: A complete collection of spanning trees.


Figure 11.73 shows a small example: a small graph, and all of the different spanning trees for it.

ISTUDY
634 Graphs and Trees

(a) (b) B B B B
B
A F A F A F A F
A F
C C C C
E E E E
C B B B B
E A F A F A F A F

C C C C
E E E E

Figure 11.73 (a) An undirected graph, and (b) all eight spanning trees for it.

A graph G has a spanning tree if and only if G is connected: we can be sure to only remove “redundant”
edges that aren’t required for connectivity, and removing edges from G can never cause a disconnected
graph to become connected. (For disconnected graphs, people sometimes talk about a spanning forest: a
forest F = ⟨V, E′ ⟩ with E′ ⊆ E, where the connected components of the original graph G and the connected
components of the forest F are identical.)
Although we didn’t talk about it this way when we introduced breadth- and depth-first search (see Fig-
ures 11.46b and 11.49b), these algorithms can find spanning trees, with a small change: as we explore the
graph, we include in E′ every edge ⟨u, v⟩ that leads from a previously known node u to a newly discovered
node v.
We’ll also see some other ways to find spanning trees in Section 11.5.2, but here’s another, conceptually
simpler technique. To find a spanning tree in a connected graph G, we repeatedly find an edge that can be
deleted without disconnecting G—that is, an edge that’s in a cycle—and delete it. See Figure 11.74a for
the algorithm. Here’s an example:

Example 11.42: Finding a spanning tree via cycle elimination.


Figure 11.74b shows the Cycle Elimination algorithm computing a spanning tree of a particular graph.
We repeatedly do this: choose some cycle (it doesn’t matter which cycle), and then remove a edge from
that cycle (it doesn’t matter which edge). For this particular graph, after three iterations, the resulting
graph has no cycles, and remains connected; the resulting graph is a spanning tree of the original graph.

We’ll finish the section by proving that the Cycle Elimination algorithm correctly finds the spanning
tree of an arbitrary connected graph:

step #1 step #2 step #3


A A A A A

C B C B C B C B C B

Cycle Elimination algorithm: D D D D D


Input: a connected graph G = ⟨V, E⟩ E E E E E
Output: a spanning tree of G
G F G F G F G F G F
1 while there exists a cycle C in G:
2 let e be an arbitrary edge traversed by C H H H H H
3 remove e from E
4 return the resulting graph ⟨V, E⟩. (b) An example. In each iteration, we’ve selected an arbitrary cycle
(lightly shaded) and then selected an arbitrary edge from that cycle
(a) An algorithm to find a spanning tree. (heavily shaded) and removed it.

Figure 11.74 The Cycle Elimination algorithm, and an example of its execution.

ISTUDY
11.4 Trees 635

Theorem 11.39: Correctness of the Cycle Elimination algorithm.


Given any connected graph G = ⟨V, E⟩, the Cycle Elimination algorithm returns a spanning tree T of G.

Proof. The algorithm only deletes edges from G, so certainly T = ⟨V, E′ ⟩ satisfies E′ ⊆ E. We need to
prove that T is a tree: that is, T is acyclic and T is connected.
T is acyclic. As long as there’s a cycle remaining, the algorithm stays in the while loop. Thus we only exit
the loop when the remaining graph is acyclic. (And the loop terminates in at most |E| iterations, because
an edge is deleted in every iteration.)
T is connected. We claim that the graph is connected throughout the algorithm. It’s true at the beginning of
the algorithm, by assumption. Now consider an iteration in which we delete the edge {u, v} from a cycle
C. Let s and t be arbitrary nodes; we will argue that there is still a path from s to t. Before we deleted
{u, v}, there was a path P from s to t. If P didn’t traverse the edge {u, v}, then P is still a path from s to t.
Otherwise, we can still get from s to t by going “the long way around” the cycle C instead of following the
single edge {u, v}. (See Figure 11.75.) Thus there is still a path from any node s to any node t, and so the
graph stays connected.

(a) s t (b) s t
u v u v

cycle C cycle C

Figure 11.75 Maintaining connectivity in the Cycle Elimination algorithm. There are two ways from s to t: (a) the
short way, via {u, v}; and (b) the long way, all the way around C.

ISTUDY
636 Graphs and Trees

Computer Science Connections

Directed Graphs, Cycles, and Kidney Transplants


Kidneys are essential to human life; they play an essential filtering role in the body without which we would all die.
Although we are born with two kidneys, humans need only one functioning kidney to live healthy lives. Because
we’re all naturally equipped with a “spare,” kidney transplants are the most common form of transplant surgery
performed today; thousands of lives are saved annually through kidney transplants.
Typically a patient in need of a kidney finds a friend or relative who is willing to donate one of theirs. If the patient
and donor are compatible—for example, blood type and physical size of the donor’s kidney must be appropriate—
then medical teams perform two simultaneous operations: one to remove the “spare” kidney from the donor, and one
to implant it in the patient. (Some patients instead receive kidneys from strangers who chose to donate their organs in
case of an untimely death, or from so-called “angel donors” who decide to donate one of their two working kidneys
while they’re alive.) Unfortunately, many patients who need kidneys have a friend or relative willing to donate to
them—but they are incompatible with their prospective donor’s kidney. These patients may spend years on a waiting
list for a transplant, undergoing painful, expensive, and only partially effective dialysis while they wait and hope.
In recent years, medical personnel have begun a pro-
A directed edge goes from There’s a directed edge from a
gram of kidney exchanges. Suppose that a patient p1 every patient to her donor to each patient who can
is incompatible with her prospective donor d1 , another corresponding donor. receive a kidney from that donor.
patient p2 is incompatible with his prospective donor
patient #1 donor #1
d2 , but pairs ⟨p1 , d2 ⟩ and ⟨p2 , d1 ⟩ are both compati-
ble with each other. Four teams of doctors can then
patient #2 donor #2
do a “paired exchange” with four surgeries, in which
d1 donates to p2 and d2 donates to p1 . (To ensure
patient #3 donor #3
that everybody follows through, the two pairs of surg-
eries must be done simultaneously: if d1 donates to p2
patient #4 donor #4
before d2 undergoes surgery, then d2 has no incentive
to go through the surgery, as d2 ’s friend p2 has already
received his kidney.) We can even consider larger patient #5 donor #5
exchanges (three or more simultaneous donations)— (a) The graph of compatibilities.
though as the number of surgeries increases, the logis-
tical difficulty increases as well. patient #1 donor #1
Deciding which transplants to complete is done
using an algorithm based on the graph of donor–patient patient #2 donor #2
compatabilities. Each patient pi comes to the system
with a donor di who is willing to donate to pi . Define patient #3 donor #3
a directed graph G as follows. There is a node for each
patient pi and a node for each donor di . Add a directed patient #4 donor #4
edge ⟨pi , di ⟩ for every i. Also add a directed edge
⟨di , pj ⟩ if donor dj is compatible with patient pj . A cycle patient #5 donor #5
in G then corresponds to a set of surgeries that can be
(b) The selected transplants. We “cover” this graph with two
completed: every donor in the cycle donates a kidney, cycles; if we perform the transplants that are highlighted
and every patient in the cycle receives a compatible (the darker donor-to-patient edges), then every patient
kidney. See Figure 11.76 for an example. receives a compatible kidney.
The algorithm that’s actually used in the real kidney
exchange network in the United States computes a set Figure 11.76 An example of a kidney exchange network,
of node-disjoint cycles to perform. To limit the number and the cycle-based algorithm to select transplants.
of simultaneous surgeries that are required, the algo-
rithm seeks a set of cycles of length 4 or length 6—that is, two or three transplants—in G that maximizes the total
number of nodes included. (The constraint on cycle length makes the computational problem much more difficult, so
the algorithm requires significant computational power to compute the surgeries to complete.) There’s more infor-
mation about “paired donation” (the usual form, in which a cycle contains exactly two donors and two recipients)
from the National Kidney Foundation, kidney.org. You can read more about the computational research in [2].

ISTUDY
11.4 Trees 637

Computer Science Connections

Binary Search Trees


Trees are the basis of many important data structures, of which binary search trees are perhaps most frequently used.
Binary search trees are data structures that implement the abstract data type called a dictionary: we have a set of
keys, each of which has a corresponding value. (For example, the keys might be words and the values definitions,
or they might be student names and GPAs, or usernames and encrypted passwords.) The data structure must support
operations like insert(k, v) (add a new key/value pair) and lookup(k) (report the value associated with key k, if any).
A binary search tree (BST) is a binary tree for which
every node u satisfies the BST condition illustrated in x
Figure 11.77: every node v in u’s left subtree has a key
that is less than u’s key, and every node v in u’s right
subtree has a key that is greater than u’s key. For sim-
plicity, assume that all keys are distinct. (Incidentally,
the BST condition implies the following claim: an in- all keys < x all keys > x
order traversal of a binary search tree visits the keys
Figure 11.77 The binary search tree condition. For every
in sorted order. This claim can be proven formally by
node with key x: all keys in the left subtree of the node have
induction, but the intuition is straightforward: an in-
order traversal of a node with key x first visits nodes a key < x, and all keys in the right subtree have a key > x.
< x [while traversing the left subtree], then x itself, and
then nodes > x [while traversing the right subtree]. Because, recursively, the nodes of the left and right subtrees are
themselves visited in sorted order, the entire tree’s keys are visited in sorted order.) An example of a binary search
tree is shown in Figure 11.78.
Binary search trees are good data structures for Hanan
dictionaries because insert and lookup can be imple-
mented simply and efficiently. If we perform a lookup Evan Joseph
for a key k in an empty BST T, we return “not found.”
Isaac Noah
(For simplicity, we allow a BST to be empty—that is,
to contain zero nodes.) Otherwise, compare k to the key Liam Yasin
r stored in the root node of T:
Morgan Qwill

if k = r: return the value stored at the root. Milan


if k < r: perform a lookup for k in the left subtree.
if k > r: perform a lookup for k in the right subtree. Figure 11.78 A binary search tree storing a set of 10 keys.

The BST condition guarantees that we find the node with key k if it’s in the tree. (You can prove this fact by induction.)
The insert operation can be implemented similarly, by adding a new node exactly where a lookup for the key k would
have found k.
The worst-case running time of lookup and insert
is proportional to the height of the binary search tree. Liam
More “balanced” BSTs—in which every internal node
has a left subtree with roughly the same height as Hanan Noah
its right subtree—have better performance. (There are
Evan Joseph Morgan Yasin
many different BSTs with the same set of keys; for
example, another BST that has the same keys as the Isaac Milan Qwill
BST in Figure 11.78 is shown in Figure 11.79.)
Most software therefore uses balanced binary Figure 11.79 Another binary search tree with the same set
search trees instead—for example, AVL trees or red– of keys as in Figure 11.78, but a shallower depth.
black trees. (See p. 309 for further discussion of AVL
trees, and a proof of their efficiency.) The basic tool behind balanced BSTs is the rotation, which swaps a node in
a BST with one of its children (while carefully rearranging the subtrees of the swapped nodes to maintain the BST
property). You can find the details in any good textbook on data structures.

ISTUDY
638 Graphs and Trees

EXERCISES
11.122 Identify all of the simple cycles in the graph in Figure 11.80a.
11.123 Identify all of the simple cycles in the graph in Figure 11.80b.
11.124 Identify all of the simple cycles in the graph in Figure 11.80c.
11.125 Consider an undirected graph G with n nodes. In terms of n, what is the longest simple cycle that G can contain? Explain.
11.126 How does your answer to Exercise 11.125 change if the cycle is not required to be simple? Explain.
11.127 In the n-node complete graph Kn , how many simple cycles is a particular node u involved in? Simplify your answer as n gets large.
P
(You can use, without proof, the following useful fact: ni=0 i1! approaches e = 2.71828 · · · as n grows.)
11.128 Let u be a node in a n-node complete directed graph: all edges except for self-loops are present. How many simple cycles is node
u involved in? Again, simplify your answer for large n (you can again make use of the fact from Exercise 11.127).
11.129 A small modification to BFS can detect cycles involving a node s a directed graph, as shown in Figure 11.81. However, this
modification doesn’t quite work for undirected graphs. Give an example of an acyclic graph in which the algorithm Figure 11.81
falsely claims that there is a cycle. Then describe briefly how to modify this algorithm to correctly detect cycles involving node s
in undirected graphs.
11.130 Recall Lemma 11.33: in any acyclic undirected graph, there exists a node whose degree is zero or one. Prove the following variation
on this lemma: every directed acyclic graph contains a node with out-degree zero.
11.131 Prove the following extension to Lemma 11.33: there are two nodes of degree 1 in any acyclic undirected graph that contains at
least one edge.

Recall Definition 11.30: a cycle ⟨u0 , u1 , . . . , uk , u0 ⟩ is a path of length ≥ 2 from a node u0 back to node u0 that does not traverse the
same edge twice. At various times in class, I’ve tried to define cycles in all of the following ways—and they’re all bogus definitions,
in the sense that they describe something different from Definition 11.30.
11.132 What’s wrong with this definition? A cycle is a simple path from s to s.
11.133 What’s wrong with this definition? A cycle is a path of length ≥ 2 from s to s.
11.134 What’s wrong with this definition? A cycle is a path from s to s that doesn’t traverse any edge more than once.
11.135 What’s wrong with this definition? A cycle is a path from s to s that includes at least 3 distinct nodes.
11.136 What’s wrong with this definition? A cycle is a path of length ≥ 2 from s to s that doesn’t traverse any edge twice consecutively.
11.137 Definition 11.32 defines an acyclic graph as one containing no cycles, but it would have been equivalent to define acyclic graphs
as those containing no simple cycles. Prove that G has a cycle if and only if G has a simple cycle.

(a) (b) (c)


B D F B D F B D F
A H A H A H
C E G C E G C E G

Figure 11.80 Three graphs.

(Buggily) Modified BFS:


Input: a graph G = ⟨V, E⟩ and a source node s ∈ V 1 Frontier := ⟨s⟩
Output: is s involved in a cycle in G? 2 Known := ∅
3 while Frontier is nonempty:
4 u := the first node in Frontier
5 remove u from Frontier
6 if s is a neighbor of u then
7 return “s is involved in a cycle.”
8 for every neighbor v of u:
9 if v is in neither Frontier nor Known then
10 add v to the end of Frontier
11 add u to Known
12 return “s is not involved in a cycle.”

Figure 11.81 BFS modified (slightly buggily) to detect cycles involving the node s.

ISTUDY
Exercises 639

11.138 Identify two different regular graphs that are trees. (Recall Definition 11.18: G = ⟨V, E⟩ is a regular graph if every u ∈ V has
degree(u) = d, for some fixed constant d.)
11.139 It turns out that there are two and only two different trees T that are regular graphs. Prove that, aside from the two you found in
Exercise 11.138, there are no regular graphs that are trees.

11.140 A square is a simple cycle containing exactly four nodes. What is the largest number of squares possible in an undirected graph of
n nodes?
11.141 A triangle is a simple cycle containing exactly three nodes. What is the largest number of triangles possible in an undirected graph
of n nodes?
11.142 Let’s figure out the largest number of edges that are possible in an n-node undirected graph G = ⟨E, V⟩ that contains no triangles.
For two nodes u and v, argue that if {u, v} ∈ E, then we have degree(u) + degree(v) ≤ |V|.
11.143 Using induction on the number of nodes, prove the following: if G = ⟨V, E⟩ is a triangle-free graph, then |E| ≤ |V|2 /4. (Hint: use
Exercise 11.142.)
2
11.144 Give an example of an n-node triangle-free graph that contains n4 edges.

11.145 Is the graph represented by the adjacency list in Figure 11.82a a tree? Justify your answer.
11.146 What about the graph represented by the adjacency list in Figure 11.82b?
11.147 What about the graph represented by the adjacency list in Figure 11.82c?
11.148 What about the graph represented by the adjacency list in Figure 11.82d?

11.149 Prove or disprove: in any tree with 3 or more nodes, there is a node of degree equal to 2.
11.150 Prove or disprove: in any rooted binary tree, there are an even number of leaves. (In a binary tree, all nodes have 0, 1, or 2 children.)
11.151 Prove or disprove: if an undirected graph G = ⟨V, E⟩ has |V| − 1 edges, then G must be a forest.
11.152 The following pair of definitions is subtly broken: the root of a tree is a node that is not a child, and a leaf is a node that is a child
but not a parent. What’s wrong?

11.153 What are the leaves in the tree in Figure 11.83a?


11.154 For the tree in Figure 11.83a, which nodes are internal nodes?
11.155 For the tree in Figure 11.83a, what the are parent, children, and siblings of node D?
11.156 For the tree in Figure 11.83a, what nodes are descendants of node D?
11.157 For the tree in Figure 11.83a, what nodes are ancestors of node F?
11.158 What is the height of the tree in Figure 11.83a?

11.159 Let T be an arbitrary n-node rooted tree, with root r and with ℓ different leaves. Prove or disprove: if we reroot T at a new node
r′ ̸= r, then the number of leaves remains exactly ℓ.

(a) (b) (c) (d)


A: B, E A: C A: D A: C, D, F
B: A B: C, E B: E, F B: F
C: D C: A, B, F C: D, F C: A, E, F
D: C, F D: E D: A, C D: A
E: A E: B, D E: B E: C
F: D F: C F: B, C F: A, B, C

Figure 11.82 Adjacency lists for a few graphs, which may or may not be trees.

(a) A (b) A

B D B D

C E F C E F

G G

H I H I

Figure 11.83 Two rooted trees, with A as the root.

ISTUDY
640 Graphs and Trees

Figure 11.84 A complete and nearly complete binary tree of height 3.

A complete binary tree of height h has “no holes”: reading from top-to-bottom and left-to-right, every node exists. A nearly complete
binary tree has every node until the last row, which is allowed to stop early. (Every complete binary tree is nearly complete.) See
Figure 11.84, and see also p. 245 for a discussion of heaps, which are a data structure represented as a nearly complete binary
tree.)
11.160 Prove by induction that a complete binary tree of height h contains precisely 2h+1 − 1 nodes.
11.161 How many leaves does a nearly complete binary tree of height h have? Give the smallest and largest possible values, and explain.
11.162 What is the diameter of a nearly complete binary tree of height h? Again, give the smallest and largest possible values, and explain.
(Recall that the diameter of a graph is the maximum distance [length of shortest path] between nodes. See Exercise 11.89.)
11.163 Suppose that we “reroot” a complete binary tree of height h by instead designating one of the erstwhile leaves as the root. What is
the height of the rerooted tree?
11.164 What is the diameter of a complete binary tree rerooted at one of its leaves (as in Exercise 11.163)?
11.165 How many leaves are in a complete binary tree rerooted at one of its leaves (as in Exercise 11.163)?

11.166 Describe a 1000-node binary tree with height as large as possible. Justify your answer.
11.167 Describe a 1000-node binary tree with height as small as possible. Justify.
11.168 Describe a 1000-node binary tree with as many leaves as possible. Justify.
11.169 Describe a 1000-node binary tree with as few leaves as possible. Justify.

11.170 In terms of n, what is the largest possible height for an n-node binary tree in which every node has precisely zero or two children?
Justify your answer.

11.171 In what order are nodes of the tree in Figure 11.83b traversed by a pre-order traversal?
11.172 In what order are nodes of the tree in Figure 11.83b traversed by an in-order traversal?
11.173 In what order are nodes of the tree in Figure 11.83b traversed by a post-order traversal?

11.174 Draw the binary tree with in-order traversal 4, 1, 2, 3, 5; pre-order traversal 1, 4, 3, 2, 5; and post-order traversal 4, 2, 5, 3, 1.
11.175 Do the same for the tree with in-order traversal 1, 3, 5, 4, 2; pre-order traversal 1, 3, 5, 2, 4; and post-order traversal 4, 2, 5, 3, 1.

11.176 Describe (that is, fully explain the structure of) an n-node binary tree T for which the pre-order and in-order traversals of T result
in precisely the same ordering of T’s nodes. (That is, pre-order-traverse(T) = in-order-traverse(T).)
11.177 Describe a binary tree T for which the pre-order and post-order traversals result in precisely the same ordering of T’s nodes. (That
is, pre-order-traverse(T) = post-order-traverse(T).)
11.178 Prove that there are two distinct binary trees T and T′ such that pre-order and post-order traversals are both identical on the trees
T and T′ . (That is, pre-order-traverse(T) = pre-order-traverse(T′ ) and post-order-traverse(T) = post-order-traverse(T′ ) but
T ̸= T′ .)
11.179 Give a recursive algorithm to reconstruct a tree from the in-order and post-order traversals.

11.180 Argue that we didn’t leave out any spanning trees of the graph in Figure 11.73.
11.181 How many spanning trees does the graph in Figure 11.85a have? Explain.
11.182 How many spanning trees does the graph in Figure 11.85b have? Explain.

(b)
(a) B D F
B D F
A H
A H
C E G
C E G

Figure 11.85 Two graphs.

ISTUDY
11.5 Weighted Graphs 641

11.5 Weighted Graphs


Some minds improve by travel, others, rather,
Resemble copper wire, or brass,
Which gets the narrower by going farther!
Thomas Hood (1799–1845)
“Ode to Rae Wilson Esq.” (1837)

Many real-world situations are naturally modeled by different edges having different “weights”: the
price of an airplane flight, the closeness of a friendship, the physical length of a road, the time required to
transmit data across an internet connection. These graphs are called weighted graphs:

Definition 11.40: Weighted graph.


A weighted graph is a graph G = ⟨V, E⟩ and a weight function w : E → R≥0 , so that each edge e ∈ E has
a weight w(e) ≥ 0. For simplicity of notation, we’ll often write we instead of w(e); we’ll also sometimes
refer to we as the length of the edge e.
In a weighted graph, the length of a path is the sum of the lengths of the edges traversed by the path.
(A shortest path is, as before, one with the smallest length.)

Definition 11.40 considers only nonnegative weights—every we ≥ 0—which is a genuine restriction. (For example, the “signed”
social networks from Figure 11.16a have positive and negative weights signifying friendship and enmity.) Some, but not all, of
the results that we’ll discuss in this section carry over to the setting of negative weights.

Either undirected or directed graphs can be weighted. Aside from the length of a path, all of the other
notions and terminology from unweighted graphs carry over: neighbors and degree, paths and connectivity,
and so forth. Weighted graphs can be represented just as unweighted graphs were, using adjacency lists
or adjacency matrices: we typically store the weight of edge ⟨u, v⟩ directly in the ⟨u, v⟩th entry of the
adjacency matrix, or attach the edge weight as an additional slot in the adjacency list entries.

Example 11.43: A weighted graph.


Figure 11.86 shows the highway system from Example 11.4, with each road labeled by its length. There
are two simple paths between Orlando and Lake City:
Orlando ↔ Tampa ↔ Lake City 85 + 180 = 265 miles
Orlando ↔ Daytona Beach ↔ Jacksonville ↔ Lake City 55 + 90 + 60 = 205 miles.
The second path is shorter, even though it traverses more edges, as 265 > 205.

2350 miles 60 miles


Los Angeles Lake City, FL Jacksonville

180 miles 90 miles

Tampa Daytona Beach

85 miles 55 miles

Orlando

Figure 11.86 The highway system from Figure 11.4b, with each road labeled by its length.

ISTUDY
642 Graphs and Trees

Taking it further: The primary job of a web search engine is to respond to a user’s search query (“give me web pages about
Thomas Hood”) with a list of relevant pages. There’s a complex question of data structures, parallel computing, and networking
infrastructure in solving even the simplest part of this task: identifying the set R of web pages (out of many billions) that contain
the search term. A subtler challenge—and at least as important—is figuring out how to rank the set R. What pages in R are the
“most important,” the ones that we should display on the first page of results? Google uses a weighted graph (and probability) to
do some of this ranking; see p. 650.

11.5.1 Shortest Paths in Weighted Graphs: Dijkstra’s Algorithm


A shortest path from s to t in a weighted graph is the path connecting s and t that has shortest total length.
In many natural applications where shortest paths are useful, we have weights on edges: you probably want
the shortest walking route from the bar back to your apartment, for example, not necessarily the one with
the fewest turns. In Example 11.43, we already saw a case in which the shortest path used more edges than
necessary. Thus we cannot directly use breadth-first search to compute distances in weighted graphs. But
we can compute distances using an algorithm that’s very similar in spirit to BFS.
The basic idea of breadth-first search is to “expand outward” from the source node s in layers, accumu-
lating a set of nodes u for which we know the distance from s to u. We add nodes in increasing order of
their distance from s, and eventually we’ve computed distances from s to all nodes in the graph. (See Fig-
ure 11.87.) The trouble for weighted graphs is that the order in which BFS builds up its knowledge about
shortest paths doesn’t always work (as in Example 11.43). But we can use a cleverer way of building up
knowledge about the network to find shortest paths in weighted graphs, too, following the same intuition.
The algorithm that we’ll describe is due to Edsger Dijkstra, and hence it is known as Dijkstra’s algorithm.

Taking it further: Edsger Dijkstra (1930–2002) was a Dutch computer scientist—one of the founders of theoretical computer
science, and the 1972 Turing Award winner. An irrelevant-to-this-section quotation attributed to Dijkstra, for your amuse-
ment: “computer science is no more about computers than astronomy is about telescopes.” (This field is named strangely.)
An irrelevant-to-this-section challenge, also for your amusement: name a common English word that, like DIJKSTRA, has at
least five consecutive consonants (or 6 or even 7, which is technically possible—though, honestly, you’d probably be annoyed if
someone claimed the 7-consonant example as a word, and it’s only in some dictionaries). (Not SYZYGY or RHYTHMS; Y is a vowel
if it’s used as a vowel!)

The key idea of Dijkstra’s algorithm has parallels with BFS: Suppose that we know the distance from
a source node s to every node in some set S of nodes. (Assume that s ∈ S.) We will find some node not in S
for which we can determine the shortest path from s. For now, let’s not worry about where this set S came
from; the key point is just that we are assuming that we know distances to certain nodes (those in S), and
we seek to leverage that existing knowledge to learn the distance to some other node (not previously in S).
We’ll then add that new node to S and iterate.
Before we state the formal result, let’s look at an example:

The shaded region represents the set S of nodes within distance d of s.


Suppose u ∈
/ S is a neighbor of some node v ∈ S.
s

v Then the distance from s to u must be d + 1.


d u

Figure 11.87 The intuition of breadth-first search, for an unweighted graph.

ISTUDY
11.5 Weighted Graphs 643

(a) 6 (b)
d(A, B) = 1 B D 2 F node distance
1 10 F (via B) d(A, B) + wB,F =1+6=7
d(A, A) = 0 A 4 5 7 H E (via A) d(A, A) + wA,E =0+9=9
3 E (via C) d(A, C) + wC,E = 3 + 8 = 11
d(A, C) = 3 C 8 E G D (via C) d(A, C) + wC,D =3+5=8
9
Figure 11.88 (a) A weighted graph. If we know the distance from A to each node in the shaded set (as marked), then (b)
the candidate nodes are those unmarked neighbors of nodes with known distances, with the candidate distances shown.
See Example 11.44.

Example 11.44: An example of distances.


Consider the weighted, undirected graph in Figure 11.88. Suppose we know the distances from A to
every node in the shaded set S = {A, B, C}: in other words, we know d(A, A) = 0 and d(A, B) = 1 and
d(A, C) = 3. We wish to expand our set of known nodes by adding a neighbor of an node whose distance
is already known. The candidate nodes are neighbors of nodes with known distances that themselves do
not have known distances—which are {D, E, F}. Their candidate distances are shown in Figure 11.88b.
Let’s argue that we can now conclude that d(A, F) = 7. The key reason is that, to get from A to F, we
have to “escape” the set of shaded nodes in Figure 11.88a—and every “escape route” (path to F) must
reach its last shaded node v (that’s d(A, v)) and then follow an edge to its first non-shaded node u (that’s
wv,u ). Because the table in Figure 11.88b tells us that every path out of the shaded region has length at
least 7, and we’ve found a path from A to F with exactly that length, we can conclude that d(A, F) = 7.

Computing the distance to a new node


The same basic reasoning that we used in Example 11.44 will allow us to prove a general observation that’s
the foundation of Dijkstra’s algorithm:

Lemma 11.41: Foundation of Dijkstra’s algorithm.


Let G = ⟨V, E⟩ be a graph with edge weights w, let S ⊂ V be a set of nodes, and let s ∈ S be a source
node. Let d(s, v) denote the distance from s to v for every node v in S. For a node u ∈
/ S, define
du := min d(s, v) + wv,u .
v∈S:u is a neighbor of v

Let u∗ be the node u ∈


/ S for which du is minimized. Then the distance from s to u∗ is du∗ .

Before we prove the lemma, let’s restate the claim in slightly less notation-heavy English. See Figure 11.89.
We have a set S of shaded nodes for which we know the distance from s. We examine all unshaded nodes

The shaded region represents the set S of nodes with known distance from s.

s Suppose u ∈
/ S is a neighbor of a node v ∈ S,
∗ ∗

where v and u minimize du = d(s, v ) + wv u .


∗ ∗


∗, ∗

d(s, v )
∗ v

wv , u
∗ ∗
Then the distance from s to u must be du .

u ∗

Figure 11.89 The intuition for Lemma 11.41. (Compare to the intuition for BFS, in Figure 11.87.)

ISTUDY
644 Graphs and Trees

u that are neighbors of shaded nodes v. For each shaded/unshaded pair, we’ve computed the sum of the
distance d(s, v) and the edge weight wv,u . And we’ve chosen the pair ⟨v∗ , u∗ ⟩ that minimizes this quantity.
The lemma says that the shortest path from s to u∗ must have length equal to du∗ := d(s, v∗ ) + wv∗ ,u∗ .
The intuition is the same as in Example 11.44: to get from s to u∗ , we have to somehow “escape” the set
of shaded nodes—and, by the way that we chose u∗ , every “escape route” must have length at least du∗ .

Problem-solving tip: When we want to prove that x = y, it’s sometimes easier to prove x ≥ y and x ≤ y separately.

Proof of Lemma 11.41. We must show that the distance from s to u∗ is du∗ , and we’ll do it in two steps:
by showing that the distance is no more than du∗ , and by showing that the distance is no less than du∗ .
Claim #1: The distance from s to u∗ is ≤ du∗ . We must argue that there is a path that has length
d(s, v∗ ) + wv∗ ,u∗ from s to u∗ . By assumption and the fact that v∗ ∈ S, we know that d(s, v∗ ) is the distance
from s to v∗ , so there must exist a path P of length d(s, v∗ ) from s to v∗ . (It’s the curved line in Figure 11.89.)
By tacking u∗ onto the end of P, we’ve constructed a path from s to u∗ via v∗ with length d(s, v∗ ) + wv∗ ,u∗ .
Claim #2: The distance from s to u∗ is ≥ du∗ . Consider an arbitrary path P from s to u∗ . We must show
that P has length at least d(s, v∗ ) + wv∗ ,u∗ . What can P look like? The node s is in the set S, so P starts
out at s ∈ S, then wanders around for a while inside S, then crosses outside of S for the first time, wanders
around outside S for a while, and eventually ends up at u∗ ∈ / S. Nothing prevents P from re-entering (and
later re-exiting) S after its first departure—indeed, it can go in and out of S several times—but it definitely
has to leave S at least once. (See Figure 11.90.) Therefore:

the length of P
= (the length of P up to the first exit) + (the length of P after the first exit)
≥ (the length of the shortest path exiting S) + (the length of P after the first exit)
P up to the first exit is a path exiting S, so its length is at least the length of the shortest such path

≥ d(s, v ) + wv∗ ,u∗ + (the length of P after the first exit)
we chose u∗ and v∗ so that d(s, v∗ ) + wv∗ ,u∗ is exactly the length of the shortest path exiting S
≥ d(s, v∗ ) + wv∗ ,u∗ + 0 all edge weights are nonnegative, so all path lengths are ≥ 0 too
= du∗ . definition of du∗

Thus the length of P is at least du∗ , which concludes the proof of Claim #2.
We’ve therefore argued that the distance from s to u∗ is both ≤ du∗ (Claim #1) and ≥ du∗ (Claim #2). Thus
the distance is precisely du∗ , and the lemma follows.

(a) (b) (c)

s s s

u∗
u∗
u∗

Figure 11.90 An illustration of the proof of Claim #2 in Lemma 11.41. (a) Any path P from s to u∗ can be broken
down into two parts: (b) the portion of P up to (and including) its first exit from S, and (c) the portion of P after the first
exit from S.

ISTUDY
11.5 Weighted Graphs 645

Dijkstra’s algorithm:
Input: a weighted graph G = ⟨V, E⟩, nonnegative edge weights we ≥ 0, and a source node s ∈ V.
Output: the distance from s to every node in G

1 let S := {s} and let d(s, s) := 0 // S is the set of nodes with known distances.
2 while there exists a node in S with a neighbor not in S:
3 for every node u ∈
/ S, define du as minv d(s, v) + wv,u , where v is any node in S that’s a neighbor of u
4 u∗ := the node with the smallest du
5 add u∗ to S and set d(s, u∗ ) := du ∗

6 for every node u ∈ V − S:


7 d(s, u) := ∞
8 return the recorded values d(s, u)

Figure 11.91 The pseudocode for Dijkstra’s algorithm.

Dijkstra’s algorithm
With Lemma 11.41 proven, we can now put together the pieces of the entire algorithm. The lemma describes
a way to take a set S of nodes with known distance from the source node s, and correctly calculate the
distance from s to a new node u ∈ / S. In Dijkstra’s algorithm, the idea is to apply the calculation from
Lemma 11.41 repeatedly to find all distances from the given source node s. We’ll need a base case to get
started, but that’s straightforward: we start with the set of nodes with known distance from s as S = {s},
where the distance from s to s is zero. The full algorithm is shown in Figure 11.91.
Before we prove the algorithm’s correctness, let’s run through an example:

Example 11.45: Dijkstra’s algorithm in action.


Figure 11.92 shows Dijkstra’s algorithm running on the road network from Example 11.43.

(a) (b) (c) (d) (e) (f)


OR OR OR OR OR OR

85 55 85 55 85 55 85 55 85 55 85 55

TA DB TA DB TA DB TA DB TA DB TA DB

180 90 180 90 180 90 180 90 180 90 180 90

LC JA LC JA LC JA LC JA LC JA LC JA
60 60 60 60 60 60

2350 2350 2350 2350 2350 2350

LA LA LA LA LA LA

d(OR, OR) = 0 d(OR, OR) = 0 d(OR, OR) = 0 d(OR, OR) = 0 d(OR, OR) = 0 d(OR, OR) = 0
d(OR, DB) = 55 d(OR, DB) = 55 d(OR, DB) = 55 d(OR, DB) = 55 d(OR, DB) = 55
d(OR, TA) = 85 d(OR, TA) = 85 d(OR, TA) = 85 d(OR, TA) = 85
d(OR, JA) = 145 d(OR, JA) = 145 d(OR, JA) = 145
d(OR, LC) = 205 d(OR, LC) = 205
d(OR, LA) = 2555

Figure 11.92 Dijkstra’s algorithm on the graph in Figure 11.86 (with abbreviated names), starting from Orlando (OR).
Nodes inside the shaded region have known distances (which are recorded in the table) from OR. In each iteration, the
algorithm considers every “candidate” node—that is, a node that itself has an unknown distance but that has a neighbor
with a known distance. (The light, unfilled nodes in the figure are not yet candidates.) In going from (a) to (b), for
example, DB is the candidate with the smallest candidate distance (as in Lemma 11.41), so its distance can be recorded.

ISTUDY
646 Graphs and Trees

The correctness of Dijkstra’s algorithm


We’ll now prove the correctness of the algorithm, using Lemma 11.41 and induction:

Theorem 11.42: Correctness of Dijkstra’s algorithm.


Let G = ⟨V, E⟩ be a graph with nonnegative edge weights we for each edge. Let s ∈ V be a source node,
and let d(s, •) be the values computed by calling Dijkstra(G, w, s). Then, for every node u, we have that
d(s, u) is the length of the shortest path from s to u in G under w.

Proof. Looking at the algorithm, we see that Dijkstra’s algorithm records finite distances from s in Line 1
(for s itself) and Line 5 (for other nodes reachable from s). Suppose that Dijkstra’s algorithm executes n
iterations of the loop in Line 2, thus recording n + 1 total distances in Lines 1 and 5—say in the order
u0 , u1 , . . . , un . Let P(i) denote the claim that d(s, ui ) is the length of the shortest s-to-ui path. We claim by
strong induction on i that P(i) holds for all i ∈ {0, 1, . . . , n}.
For the base case (i = 0), we must prove that d(s, u0 ) is recorded correctly. The 0th node u0 is recorded
in Line 1, so u0 is the source node s itself. And the shortest path from s to s in any graph with nonnegative
edge weights is the 0-hop path ⟨s⟩, of length 0.
For the inductive case (i ≥ 1), we assume the inductive hypothesis P(0), P(1), . . . , P(i − 1): that
is, all recorded distances d(s, u0 ), d(s, u1 ), . . . , d(s, ui−1 ) are correct. We must prove P(i): that is, that
the recorded distance d(s, ui ) is correct. But this follows immediately from Lemma 11.41: the algorithm
chooses ui as the u ∈ / S minimizing
du := min d(s, v) + wv,u ,
v∈S:u is a neighbor of v

where S = {u0 , u1 , . . . , ui−1 }. Lemma 11.41 states precisely that this value du is the length of the shortest
path from s to u.
Finally, observe that any node u that’s only discovered in Line 6 is not reachable from s, and so
indeed d(s, u) = ∞. (A fully detailed argument that the ∞ values are correct can follow the structure
in Theorem 11.28, which proved the correctness of BFS.)
Taking it further: Dijkstra’s algorithm as written in Figure 11.91 can be straightforwardly implemented to run in O(|V| · |E|)
time: each iteration of the while loop (Line 2) can look at each edge to compute the smallest du . But with cleverer data structures,
Dijkstra’s algorithm can be made to run in O(|E| log |V|) time. This improved running-time analysis, as well as other shortest-
path algorithms—for example, handling the case in which edge weights can be negative (it’s worth thinking about where the
proof of Lemma 11.41 fails if an edge e can have we < 0), or computing distances between all pairs of nodes instead of just
every distance from a single source—is a standard topic in a course on algorithms. Any good algorithms text should cover these
algorithms and their analysis.

Before we leave Dijkstra’s algorithm, it’s worth reflecting on its similarities with BFS. In both cases, we
start from a seed set S of nodes for which we know the distance from s—namely S = {s}. Then we build
up the set of nodes for which we know the distance from s by finding the unknown nodes that are closest
to s, and adding them to S. Of course, BFS is conceptually simpler, but Dijkstra’s algorithm solves a more
complicated problem. It’s a worthwhile exercise to think about what happens if Dijkstra’s algorithm is run
on an unweighted graph. (How does it relate to BFS?)

11.5.2 Spanning Trees in Weighted Graphs: Minimum Spanning Trees


Recall from Definition 11.38 that a spanning tree of a connected graph G = ⟨V, E⟩ is a tree T = ⟨V, E′ ⟩
where E′ ⊆ E. As with shortest paths, in many of the applications in which spanning trees are interesting,

ISTUDY
11.5 Weighted Graphs 647

(b) 6
(a)
LA 2350 LC 60 JA B D 2 F
1 10
180 90 A 4 5 7 H
3
TA DB C 8 E
85 55 9
OR

(d) 6
(c)
LA 2350 LC 60 JA B D 2 F
1 10
180 90 A 4 5 7 H
3
TA DB C 8 E
85 55 9
OR

Figure 11.93 (a, b) Two weighted graphs, and (c, d) their minimum spanning trees.

we actually want to find a spanning tree whose edges have minimum possible total cost. For example, when
a college wants to put down networking cable in a new dorm building, they wish to ensure that the resulting
network is connected, while minimizing the cost of construction.
P
In a weighted graph, the cost of a spanning tree T is the sum of the weights of its edges: e∈E′ we . A
minimum spanning tree (MST) is a spanning tree whose cost is as small as possible.

Example 11.46: Some minimum spanning trees.


Figures 11.93a and 11.93b show two graphs (the road network from Example 11.43 and the larger con-
nected component from Example 11.44). Their minimum spanning trees are shown in Figures 11.93c
and 11.93d. (For the first, every spanning tree omits exactly one edge from the lone cycle; the cheapest
tree omits the most expensive edge.)

As with shortest paths in weighted graphs, the question of how to find a minimum spanning tree efficiently
is more appropriate to an algorithms text than this one. But, between the Cycle Elimination algorithm
(Figure 11.74a) and Example 11.46, we’ve already done most of the work to develop a first algorithm.
Assume throughout that all edge weights are distinct. (This assumption lets us refer to “the most expen-
sive edge” in a set of edges. Removing this assumption would complicate the language that we’d have to
use, but it doesn’t fundamentally change anything about the MST problem or its solution.)

Lemma 11.43: The “cycle rule” for minimum spanning trees.


Let G = ⟨V, E⟩ be a connected undirected graph, and let C be a cycle in G. Let e be the heaviest edge in
C. Then e is not in any minimum spanning tree of G.

Proof. Consider a spanning tree T of G, and suppose that e = {u, v} is in T. See Figure 11.94a. We’ll
show that T is not a minimum spanning tree. (Thus the only minimum spanning trees of G do not include e.)
By definition, the spanning tree T is connected. If we delete {u, v} from T, the resulting graph will have two
connected components, one containing u and the other containing v. (This fact follows by Corollary 11.36.)
Call those connected components U and V, respectively. See Figure 11.94b.
Imagine following the cycle C from u to v the “long way” around C. This part of C starts at u, wanders
around U for a while, and eventually crosses over into V, before finally arriving at v. Let a ∈ U be the last

ISTUDY
648 Graphs and Trees

u v u v u v u v

a b a b

(a) A spanning tree T that (b) Removing {u, v} (c) Some other edge (d) A new spanning tree
includes edge {u, v} that splits T into two {a, b} also in C must T′ that results from
is in some cycle C. connected components, U have a ∈ U and b ∈ V. replacing {u, v} by
and V. {a, b}.
Figure 11.94 The cycle rule for MSTs.

node in U and b the first node in V as we go around C. (Note that C might go back and forth between U
and V multiple times, but define a and b based on the first time C leaves U.) See Figure 11.94c.
Now define the graph T′ as T with the edge {u, v} removed and with the edge {a, b} inserted instead. (See
Figure 11.94d.) Crucially, T′ is a spanning tree of G; because we’ve only swapped which edge connected
the connected sets U and V. Thus T′ remains connected and acyclic.
Now observe that the cost of T′ is less than the cost of T, because the edge {u, v} is heavier than the
edge {a, b}. (Both {u, v} and {a, b} are in the cycle C, and by assumption {u, v} is the heaviest edge in
C.) But therefore T′ is a cheaper spanning tree than T, and thus T isn’t a minimum spanning tree.

Finding MSTs by removing cycles


Lemma 11.43 immediately suggests that we can find minimum spanning trees using a modification of
the Cycle Elimination algorithm, which is shown in Figure 11.95. While the Weighted Cycle Elimination
algorithm is correct and reasonably efficient, there are more efficient algorithms based on Lemma 11.43.
One such algorithm is called Kruskal’s algorithm, named after its discoverer Joseph Kruskal (1928–2010).
(Kruskal published his MST algorithm in 1956, in part inspired by the 1920s-era work of Otakar Borůvka,
a Czech mathematician interested in building efficient electrical grids in Eastern Europe [19, 74]—both
remarkably early in the history of CS.)
The key idea of Kruskal’s algorithm is that by sorting the edges in increasing order, we can be more
efficient: we add edges in increasing order of their weight, as long as doing so doesn’t create a cycle. The
insight of this algorithm is that, by considering edges in increasing order of weight, if including an edge e
creates a cycle, then we know that e must be the heaviest edge in that cycle. See Figure 11.95. Kruskal’s
algorithm is pretty efficient: the sorting step takes O(|E| log |E|) time, and each of the |E| iterations of the

Weighted Cycle Elimination algorithm Kruskal’s algorithm


1 while there exists any cycle C in G: 1 Sort the edges e in increasing order of weight.
2 let e be the heaviest edge traversed by C 2 S := ∅
3 remove e from E 3 for each edge e (taken in increasing order of weight):
4 return the resulting graph ⟨V, E⟩. 4 if the graph ⟨V, S ∪ {e}⟩ doesn’t contain a cycle then
5 add e to S
6 return the resulting graph ⟨V, S⟩

Figure 11.95 Two algorithms to find a minimum spanning tree of a weighted connected graph G = ⟨V, E⟩.

ISTUDY
11.5 Weighted Graphs 649

for loop can be implemented using one call to BFS to test for a cycle. (And, in fact, there are some cleverer
ways to implement Line 4 so that the entire algorithm runs in O(|E| log |E|) time.) Here’s an example:

Example 11.47: Sample run of Kruskal’s algorithm.


Figure 11.96 shows every iteration of Kruskal’s algorithm when it’s run on a small graph. The algorithm
successively considers each edge of the original graph, in order from cheapest to most expensive, and
discards the edge that it’s considering if its inclusion would create a cycle.

Here is the general statement of correctness for both algorithms:

Theorem 11.44: Correctness of minimum spanning tree algorithms.


The Weighted Cycle Elimination algorithm and Kruskal’s algorithm both return a minimum spanning
tree for any weighted connected undirected graph.

Proof. The correctness of the Weighted Cycle Elimination algorithm follows immediately from
Lemma 11.43 (the cycle rule) and from Theorem 11.39 (the correctness of the Cycle Elimination algo-
rithm): the heaviest edge in any cycle does not appear in any MST, and we terminate with a spanning tree
when we repeatedly eliminate any edge from an arbitrarily chosen cycle.
For Kruskal’s algorithm, consider an edge e that is not retained—that is, when e is considered, it is
not included in the set S. The only reason that e wasn’t included is that adding it would create a cycle C
involving e and previously included edges—but because the edges are considered in increasing order of
weight, that means that e is the heaviest edge in C. Thus by Lemma 11.43, Kruskal’s algorithm removes
only edges not contained in any minimum spanning tree. Because it only excludes edges that create cycles,
the resulting graph is also connected—and thus a minimum spanning tree.

Taking it further: There are several other commonly used algorithms for minimum spanning trees, using different structural
properties than the cycle rule. For much more on these other algorithms, and for the clever data structures that allow Kruskal’s
algorithm to be implemented in O(|E| log |E|) time, see any good textbook on algorithms.

The next-cheapest edge doesn’t create a cycle, so we retain it. The next-cheapest edge does create a cycle, so we discard it.

(a) A (b) A (c) A (d) A (e) A (f) A (g) A (h) A (i) A


4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1

B 2 C B 2 C B 2 C B 2 C B 2 C B 2 C B 2 C B 2 C B 2 C

6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3 6 3

D 7 D 7 D 7 D 7 D 7 D 7 D 7 D 7 D 7

5 5 5 5 5 5 5 5 5

E E E E E E E E E

Figure 11.96 An example execution of Kruskal’s algorithm on the graph in (a), which produces the final minimum
spanning tree in (i). In each iteration, we examine the next-cheapest edge (shown with a double line); if it doesn’t create
a cycle, we retain it in the spanning tree; if it does, we discard the edge.

ISTUDY
650 Graphs and Trees

Computer Science Connections

Random Walks and Ranking Web Pages


When Google launched as a web search engine, one of its major innovations over its competition was in how it
ranked the pages returned in response to a user’s query. Here are two key ideas in Google’s ranking system, called
PageRank (named after Larry Page, one of Google’s founders): (1) we can view a link from page u to page v as
implicit “endorsement” of v by u; and (2) not all endorsements are equal: if a page u is endorsed by many other
pages, then u’s endorsements matter more. These points can be restated more glibly as: a page is important if it is
pointed to by many important pages. The idea of PageRank is to break this apparent circularity using the Random
Surfer Model. Imagine a hypothetical web user who starts at a random web page, and, at every time step, clicks on a
randomly chosen link from the page they’re currently visiting. The more frequently that this hypothetical user visits
page u, the more important we’ll say u is. B
1.0
The Random Surfer explores the web using a so- 0.5 0.33 0.33
called random walk on the web graph. In its simplest A D E
1.0
form, a random walk on a directed graph G visits a
0.5 0.33
sequence of nodes: (1) we choose a starting node u0
C 1.0
uniformly at random; and (2) in step t = 1, 2, . . ., the
next node ut is chosen by picking a node uniformly (a) A sample 5-node graph. Edges are annotated with their
at random from the out-neighborhood of the previous probabilities in a random walk; we can view the resulting
node ut−1 . (See Figure 11.97.) As you’ll see in Exer- weighted graph as defining the process.
cises 11.204–11.208, under mild assumptions about node steps
B
G, there’s a special probability distribution p over the A 166,653
nodes of the graph, called the stationary distribution of B 166,652
A D E
G, that has the following property: if we choose an ini- C 166,155
tial node u with probability p(u), and we then take one C D 250,270
step of the random walk from u, the resulting probabil- E 250,271
ity distribution over the nodes is still p. And, it turns (b) A computer-generated random walk starting at A (this
out, we can approximate p by the probability distribu- particular walk began ABABABABABABACEDCEDEDBABAC),
tion observed simply by running the random walk for with 1,000,000 steps: the first 25 steps visualized, and the
many steps, as in Figure 11.97b. number of these 1,000,000 steps spent at each node.
This stationary probability distribution p already has
1.0 B p(B) = 16
several good properties: p(u) is higher when u has 1
p(E) = 14
0.5 0.33 p(D) = 4 0.33
more in-neighbors, but it’s increased even more when
A p(A) = 16 D 1.0 E
the in-neighbors of u have a high probability them-
0.5 0.33
selves. (In Figure 11.97c, for example, p(D) > p(B)
p(C) = 61 C 1.0
and p(D) > p(C), despite B and C having higher in-
degree than D.) But there are a few refinements to (c) The stationary distribution for G.
get to the full PageRank model. One issue is that the A B C D E
Random Surfer gets stuck at any page u that has no A 0.03 0.45 0.45 0.03 0.03
out-neighbors. (The next step isn’t defined.) In this B 0.88 0.03 0.03 0.03 0.03
case, we’ll have the Random Surfer jump to a com- C 0.03 0.03 0.03 0.03 0.88
pletely random page (each of the |V| nodes is chosen D 0.03 0.31 0.31 0.03 0.31
1
with probability |V| ). Second, the Random Surfer can E 0.03 0.03 0.03 0.88 0.03
also get stuck in a “dead end” if there’s a group of
nodes that has no edges leaving it. Thus—and this (d) The updated link probabilities, with random restarts.
change probably makes the Random Surfer more real- Figure 11.97 A random walk.
istic anyway—we’ll add a restart probability of 15%
to every stage of the random walk: with probability 85%, we behave as previously described; with probability 15%,
we jump to a randomly chosen node. See Figure 11.97d for the updated probabilities. (You can find more about
the Random Surfer model and PageRank—including about how to calculate it on a graph with nodes numbering
in the billions—in the original paper [21] or a good textbook on data mining, like [76]. There are also many other
ingredients in Google’s ranking recipe beyond PageRank, though PageRank was an early and important one.)

ISTUDY
Exercises 651

EXERCISES
11.183 Find all shortest paths from A to E in the graph in Figure 11.98a. Give both the path length and the paths themselves.
11.184 Do the same for all shortest paths from A to E in the graph in Figure 11.98b.
11.185 Do the same for all shortest paths from A to E in the graph in Figure 11.98c.
11.186 Do the same for all shortest paths from A to H in the graph in Figure 11.98d.
11.187 Do the same for all shortest paths from A to H in the graph in Figure 11.98e.

11.188 Let n be arbitrary. Give an example of an n-node weighted graph G = ⟨V, E⟩ with designated nodes s ∈ V and t ∈ V in which both
of the following conditions hold:
(i) all edge weights are distinct (for any e ∈ E and e′ ∈ E, we have w(e) ̸= w(e′ ) unless e = e′ ), and
(ii) for some α > 1 and c > 0, there are at least c · αn different shortest paths between s and t.

11.189 Suppose that we are running Dijkstra’s algorithm on the graph in Figure 11.99a to compute distances from the node A. So far
Dijkstra’s algorithm has computed four distances:

d(A, A) = 0 d(A, B) = 1 d(A, C) = 3 d(A, F) = 7

If we continue Dijkstra’s algorithm for further iterations, it records the distance for a new node in each iteration. What is the next
node recorded, and what is its distance?
11.190 What is the next node (after the one from Exercise 11.189) for which Dijkstra’s algorithm records a distance, and what is its
distance? And what’s discovered after that node? List all subsequently discovered nodes, and their distances.
11.191 Trace Dijkstra’s algorithm on the graph shown in Figure 11.99a to compute distances from the node H. List all discovered nodes
and their distances, in the order in which they’re discovered.

11.192 Identify exactly where the proof of correctness for Dijkstra’s algorithm (specifically, Lemma 11.41) the argument fails if edge
weights can be negative. Then give an example of a graph with negative edge weights in which Dijkstra’s algorithm fails.
11.193 Suppose that G is a weighted, directed graph in which nodes represent physical states of a system, and an edge ⟨u, v⟩ indicates that
it’s possible to move from state u to state v. The weight wu,v denotes the multiplicative cost of the exchange: I can swap wu,v units
of u for 1 unit of v. (For example, if there’s an edge ⟨X, Y⟩ with weight 1.04, then I can trade 2.08 units of energy in state X for 2
units of energy in state Y.) Now suppose that we wish to find a shortest multiplicative path (SMP) from a given node s to a given
node t in G, where the cost of the path is the product of the edge weights along it. (Until now, we’ve cared about the sum of the
costs.) See Figure 11.99b. Describe how to modify Dijkstra’s algorithm to find the SMP in a given weighted graph G. Alternatively,
describe how to modify a given weighted graph G into a graph G′ so that Dijkstra’s algorithm (unmodified) run on G′ finds an
SMP in G.

(a) (b) (c) (d) (e)


A A A A A
2 4 4 1 1 1
1 7 5 3
C 5 B C 6 B C 1 B
1 7 7 5 7 7 B 10 C 3 B 4 C 9
D D D
8 6 2 3 3 3 5 6 1 8
3 8 3
2 D E 6 D E
E E E
8 9 2 7 G
F 4 H F 10 H

Figure 11.98 A few graphs.

6 (b) 2.5
(a) A 1 B D 2 F
10 A B C D
3 4 5 7 H 1.1 1.5 1.4

C 8 E G The shortest multiplicative path from A to D is


9 A → B → C → D at cost 1.1 · 1.5 · 1.4 = 2.31,
which is better than A → B → D at cost
1.1 · 2.5 = 2.75.

Figure 11.99 Two graphs.

ISTUDY
652 Graphs and Trees

11.194 As you argued in Exercise 11.192, Dijkstra’s algorithm may fail if edge weights are negative. State the condition that guarantees
that your algorithm from Exercise 11.193 properly computes SMPs.

11.195 List all minimum spanning trees for the graph in Figure 11.98a.
11.196 Do the same for the graph in Figure 11.98b.
11.197 Do the same for the graph in Figure 11.98c. (Note that this graph has edges with nondistinct weights.)
11.198 Do the same for the graph in Figure 11.98d.
11.199 Do the same for the graph in Figure 11.98e.

11.200 Consider the undirected 9-node complete graph K9 . There are 92 = 92·8 = 36 unordered pairs of nodes in this graph, so there are
36 different edges. You’re asked to assign each of these 36 edges a distinct weight from the set {1, 2, . . . , 36}. (You get to choose
which edges have which weights.) What’s the cheapest possible minimum spanning tree of K9 ?
11.201 In the scenario from Exercise 11.200, what’s the most expensive edge that can ever appear in a minimum spanning tree of K9 ?
(Remember that you get to choose which edges have which weights.)
11.202 In the scenario from Exercise 11.200, what’s the costliest possible minimum spanning tree of K9 ?
11.203 Generalize Exercise 11.200 and 11.202: what are the cheapest and most expensive possible MSTs for the graph Kn if all edges
 
have distinct weights chosen from 1, 2, . . . , n2 ? (Hint: see Exercise 9.176.)

A random walk in a graph G = ⟨V, E⟩ proceeds as follows: we start at a node u0 ∈ V, and, at every time step, we select as the next
node ui+1 a uniformly chosen (out-)neighbor of ui . (See p. 650.) Suppose we choose an initial node u0 according to a probability
distribution p, and we then take one step of the random walk from u0 to get a new node u1 . The probability distribution p is called
a stationary distribution if it satisfies the following condition: for every node s ∈ V, we have that Pr [u0 = s] = Pr [u1 = s] = p(s).
Such a distribution is called “stationary” because, if p is the probability distribution before a step of the walk, then p is still the
probability distribution after a step of the walk (and thus the distribution “hasn’t moved”—that is, is stationary).
11.204 Argue that p(A) = p(B) = p(C) = 13 is a stationary distribution for the graph in Figure 11.100a.
11.205 Argue that the graph in Figure 11.100b has at least two distinct stationary distributions.

Suppose that we start a random walk at node A in the graph in Figure 11.100a. Figure 11.101 shows the probability of being at any
particular node after each of the first six steps of the random walk. Let pk (u) denote the probability of the kth step of this random
walk being at node u. Although we’ll skip the proof, the following theorem turns out to be true of random walks on undirected
graphs G: If G is connected and nonbipartite, then a unique stationary distribution p exists for this random walk on G (regardless
of which node we choose as the initial node for the walk). Furthermore, the stationary distribution is the limit of the probability
distributions pk of where the random walk is in the kth step.
11.206 (programming required) Write a random-walk simulator: take an undirected graph G as input, and simulate 2000 steps of a random
walk starting at an arbitrary node. Repeat 2000 times, and report the fraction of walks that are at each node. What are your results
on the graph from Figure 11.100a?
11.207 Argue that the above process doesn’t converge to a unique stationary distribution in a bipartite graph. (For example, what’s p1000
if a random walk starts at node J in the graph in Figure 11.100c? Node K?)
(u)
11.208 Let G = ⟨V, E⟩ be an arbitrary connected undirected graph. For any u ∈ V, define p(u) = degree 2|E|
. Prove that the probability
distribution p is a stationary distribution for the random walk on G.

(a) (b)
B E H (c) K L

A C D F G I J M

Figure 11.100 Some undirected graphs upon which a random walk can be performed.

1 1 3 5 11 21
0 2 4 8 16 32 64

1 2 1 2 3 6 5 10 11 22 21
1 0 0 2 4 4 8 8 16 16 32 32 64 64

Figure 11.101 The distribution after the first few steps of a random walk on a triangle.

ISTUDY
11.6 Chapter at a Glance 653

11.6 Chapter at a Glance


B D F
Formal Introduction
A H
A graph is a pair G = ⟨V, E⟩ where V is a set of ver-
tices or nodes, and E is a set of edges. In a directed graph, C E G

the edges E ⊆ V × V are ordered pairs of vertices; in an


B D F
undirected graph, the edges E ⊆ {{u, v} : u, v ∈ V} are
unordered pairs. A directed edge ⟨u, v⟩ goes from u to v; an A H

undirected edge ⟨u, v⟩ goes between u and v. We sometimes C E G


write ⟨u, v⟩ even for an undirected graphs. A simple graph
Figure 11.102 Two sample graphs.
has no parallel edges joining the same two nodes and also
has no self loops joining a node to itself. Two simple graphs, one undirected and one directed, are shown
in Figure 11.102.
For an edge e = ⟨u, v⟩, we say that u and v are adjacent; v is a neighbor of u; u and v are the endpoints of
e; and u and v are both incident to e. The neighborhood of a node u is {v : ⟨u, v⟩ ∈ E}, its set of neighbors.
The degree of u is the cardinality of u’s neighborhood. In a directed graph, the in-neighbors of u are the
nodes that have an edge pointing to u; the out-neighbors are the nodes to which u has an edge pointing;
and the in-degree and out-degree of u are the number of in- and out-neighbors, respectively.
An adjacency list stores a graph using an array with |V| entries; the slot for node u is a linked list of u’s
neighbors. An adjacency matrix stores the graph using a two-dimensional Boolean array of size |V| × |V|;
the value in ⟨row u, column v⟩ indicates whether the edge ⟨u, v⟩ exists.
Two graphs are isomorphic if they are identical except for the naming of the nodes. A subgraph of G
contains a subset V′ of G’s nodes and a subset E′ of G’s edges joining elements of V′ . An induced subgraph
is a subgraph in which every edge that joins elements of V′ is included in E′ . A complete graph or clique is
a graph Kn in which every possible edge exists. A bipartite graph is one in which nodes can be partitioned
into sets L and R such that every edge joins a node in L to a node in R. A regular graph is one in which every
node has identical degree. A planar graph is one that can be drawn on paper without any edges crossing.

Paths, Connectivity, and Distances


A path is a sequence of nodes ⟨v1 , v2 , . . . , vk ⟩, with k ≥ 1, where ⟨vi−1 , vi ⟩ ∈ E for every index
i ∈ {1, 2, . . . , k − 1}. The path is simple if each vi is distinct. This path has length k − 1—the number
of edges that it traverses—and is a path from v1 to vk .
In an undirected graph, nodes u and v are connected if there exists a path from u to v. A connected
component of G = ⟨V, E⟩ is a set S ⊆ V such that (i) every u ∈ S and v ∈ S are connected; and (ii) for
every w ∈ / S, the set S ∪ {w} does not satisfy condition (i). The entire graph is connected if it has only one
connected component, namely V.
In a directed graph, node u is reachable from node v if there exists a path from v to u; u and v are strongly
connected if each is reachable from the other. A strongly connected component is a set S of nodes such that
any two nodes in S are strongly connected and no node x ∈ / S is strongly connected to any node s ∈ S.
Connectivity can be tested efficiently using breadth-first search (BFS; see Figure 11.103) or depth-first
search (DFS) (which is similar, but storing newly discovered nodes at the front of Frontier rather than at
the back—in other words, treating it as a stack rather than as a queue). The distance from node s to node t
is the length of a shortest path from s to t. BFS can also be used to compute distances.

ISTUDY
654 Graphs and Trees

Breadth-First Search (BFS):


Input: a graph G = ⟨V, E⟩ and a source node s ∈ V
Output: the set of nodes reachable from s in G

1 Frontier := ⟨s⟩ // Frontier will be a list of nodes to process, in order.


2 Known := ∅ // Known will be the set of already-processed nodes.
3 while Frontier is nonempty:
4 u := the first node in Frontier
5 remove u from Frontier
6 for every neighbor v of u:
7 if v is in neither Frontier nor Known then
8 add v to the end of Frontier
9 add u to Known
10 return Known

Figure 11.103 Breadth-first search.

Trees
A cycle ⟨v1 , v2 , . . . , vk , v1 ⟩ is a path of length ≥ 2 from a node v1 back to itself that does not traverse the
same edge twice. The length of the cycle is k. The cycle is simple if each vi is distinct. Cycles can be
identified using BFS.
A graph is acyclic if it contains no cycles. Every acyclic graph has a node of degree 0 or 1. A tree is a
connected, acyclic graph. (A forest is any acyclic graph.) A tree has one more node than it has vertex. A
tree becomes disconnected if any edge is deleted; it becomes cyclic if any edge is added.
One node in a tree can be designated as the root. Every node other than the root has a parent (its neighbor
that’s closer to the root). If p is v’s parent, then v is one of p’s children. Two nodes with the same parent are
siblings. A leaf is a node with no children; an internal node is a node with children. The depth of a node
is its distance from the root; the height of the entire tree is the depth of deepest node. The descendants of
u are those nodes that go through u to get the root; the ancestors are those nodes through which u’s path to
the root goes. The subtree rooted at u is the induced subgraph consisting of u and all descendants of u.
All nodes in binary trees have at most two children, called left and right. A traversal of a binary tree
visits every node of the tree. An in-order traversal recursively traverses the root’s left subtree, visits the
root, and recursively traverses the root’s right subtree. A pre-order traversal visits the root and recursively
traverses the root’s left and right subtrees; a post-order traversal recursively traverses the root’s left and
right subtrees and then visits the root.
A spanning tree of a connected graph G = ⟨V, E⟩ is a graph T = ⟨V, E′ ⊆ E⟩ that’s a tree. A spanning
tree can by found by repeatedly identifying a cycle in G and deleting any edge in that cycle.

Weighted Graphs
In a weighted graph, each edge e has a weight we ∈ R≥0 . (Although graphs with negative edge weights
are possible, we haven’t addressed them in any detail.) The length of a path in a weighted graph is the
sum of the weights of the edges that it traverses. Shortest paths in weighted graphs can be found with
Dijkstra’s algorithm (Figure 11.91), which expands a set of nodes of known distance one by one. Minimum
spanning trees—spanning trees of the smallest possible total weight—in weighted graphs can be found with
Kruskal’s algorithm (Figure 11.95) or by repeatedly identifying a cycle in G and deleting the heaviest edge
in that cycle.

ISTUDY
11.6 Chapter at a Glance 655

Key Terms and Results

Key Terms Key Results


Formal Introduction Formal Introduction
• undirected and directed graphs 1 The “handshaking lemma”: for any undirected graph
P
• nodes/vertices, edges G = ⟨V, E⟩, we have u∈V degree(u) = 2|E|.
• parallel edges, self loops 2 Representing G with an adjacency matrix requires
• simple graphs Θ(|V|2 ) space; we can answer “what are all of u’s
• adjacent node, incident edge neighbors?” in Θ(|V|) time and “is there an edge
• (in/out-)neighbors, neighborhood between u and v?” in Θ(1) time. Representing
• (in/out-)degree G = ⟨V, E⟩ with an adjacency list requires Θ(|V| + |E|)
• adjacency list, adjacency matrix space; both questions take 1 + Θ(degree(u)) time.
• isomorphic graphs
• subgraphs Paths, Connectivity, and Distances
• complete, bipartite, regular,
1 Connectivity can be tested using breadth-first search
and planar graphs
(BFS) (Figure 11.46b) or depth-first search (DFS)
(Figure 11.49b). BFS can also be used to compute the
Paths, Connectivity, and Distances
distance between nodes in a graph, and it runs in
• path Θ(|V| + |E|) time.
• connected (nodes), connected (graph)
• connected component Trees
• reachability
1 Any tree with n nodes has exactly n − 1 edges. Adding
• strongly connected component
any edge to a tree creates a cycle; deleting any edge
• shortest path/distance disconnects the graph.
• breadth-first search (BFS) 2 A spanning tree of a graph G can by found by repeatedly
• depth-first search (DFS) identifying a cycle in G and deleting an arbitrary edge in
that cycle.
Trees
• cycle Weighted Graphs
• tree, forest 1 Shortest paths in weighted graphs can be found with
• root, leaf, internal node Dijkstra’s algorithm (Figure 11.91) if all edges have
• child, parent, ancestor, descendant nonnegative weights.
• depth, height, subtree 2 Minimum spanning trees in weighted graphs can be
• spanning tree found with Kruskal’s algorithm (Figure 11.95) or by
repeatedly identifying a cycle in G and deleting the
Weighted Graphs heaviest edge in that cycle.
• Dijkstra’s algorithm
• minimum spanning trees
• Kruskal’s algorithm

ISTUDY
ISTUDY
12 Looking Forward

In which our heroes reflect on where they have been, and look
forward to where they will travel in the future.

ISTUDY
657
658 Looking Forward

We have always held to the hope, the belief, the conviction, that there is a better life, a
better world, beyond the horizon.
Franklin D. Roosevelt (1882–1945)
radio address on hemisphere defense from Dayton, Ohio (1940)

This book has introduced the mathematical foundations of computer science—the conceptual building
blocks of, among other things, the large, complex computational systems that have become central aspects
of our daily lives, some of which have already genuinely and meaningfully improved the world in their own
unique ways, profound and small. Understanding and reasoning about these fundamental building blocks
is necessary for you to understand, develop, and evaluate the key ideas of these many new applications of
computer science, and introducing these foundations has been the underlying goal of this book.
We’ve seen introductions to a wide variety of those applications as we’ve gone along, but perhaps the sin-
gle best example is the idea of paired kidney transplants (p. 636). The technical background that enables
the paired-transplant system uses foundational concepts from at least five chapters of this book and, far
more importantly, saves literally hundreds to thousands of lives per year. There are plenty of other exam-
ples: improved forecasting and communication systems give earlier warnings about an impending natural
disaster; tools facilitate rerouting unused restaurant food to food banks instead of landfills; screen readers
allow visually impaired people to access information that would otherwise have been unavailable except
at great expense.
There are two requirements for these interventions to have worked as well as they have: first, having the
underlying idea for the system, and, second, implementing it correctly. It’s possible, of course, to conceive
of a computational approach to a problem that, if solved properly, could have this same kind of profoundly
positive impact on the world—but to execute that solution buggily. The Therac-25 system (p. 207) was
a paradigmatic example of this phenomenon: a medical device built to treat tumors that, because of a
particular fault that occurred when the machine’s operator was too fast a typist, ended up killing patients
instead of helping them to heal.
But if it’s bad to implement good things buggily, it’s worse to do bad things well. There is simply
no circumstance in which ransomware or a tool for phishing that attempts to steal someone’s financial
credentials, for example, should be built. Aside from these very rare cases in which someone conceives of
a computational idea that’s unmitigatedly awful, though, nearly every idea has both benefits and costs, and
the ethically complex questions are all about the tradeoffs between them.
Bitcoin—a digital currency built around cryptography using what’s called a “blockchain”—allows indi-
viduals to form a fully decentralized financial system, without needing to trust (or fund) big banks. And
yet the bitcoin “mining” and transaction-verification processes by which the system works are extremely
energy intensive, and environmental consequences are potentially large.
More generally, cryptographic systems like Bitcoin, Tor (p. 232), and RSA encryption (Section 7.5) are
intended to allow communication and transactions that are fully private. And yet that privacy guarantee
also implies untraceability even if the content is about the trafficking of money or weapons or people.
Meanwhile, systems based on image processing can provide security on a smartphone, or take readings
of blood oxygenation levels for rapid diagnosis, or personalize information of particular interest to you
as you move through the world. And yet those same systems can make stalking or government surveil-
lance easier, and the systems that have actually been deployed have tended to have gender- or race-biased
outcomes. (For example, automated systems that seek to caption images have been shown to exacer-
bate gender stereotyping even beyond the disparities present in training data [134]; Black men have been
arrested for crimes that they did not commit, based on erroneous facial recognition [54]; and the accuracy

ISTUDY
Looking Forward 659

ELIZA: I am the psychotherapist. Please, describe your problems.


ME: I am worried about the future of the world.
ELIZA: Are you worried often?
ME: Yes.
ELIZA: I see... Well, why do you say this is so?
ME: I think that technology can be used for a great deal of good and it often is, but sometimes it's
used for malice and sometimes it's developed to try to do something good but is implemented in a way
that makes the world worse.
ELIZA: Is it because sometimes it's used for malice and sometimes it's developed to try to do
something good but is implemented in a way that makes the world worse that you came to me?
[...]

Figure 12.1 A conversation with a version of ELIZA (the version embedded in a recent version of the program Emacs).

of commercially released image-processing systems is highest for lighter-skinned male faces and lowest
for darker-skinned female faces [24].)
It would take a whole other book to do justice to the question of how to appropriately weigh the benefits
and costs of potential computational innovations. But there are a few general categories of questions that are
helpful to consider as you’re thinking about these tradeoffs. Who would be served well by the system that
you’re thinking about building? Who would be served less well? Who would it ignore? Who might it harm?
What are the risks: what could go wrong, and what are the potential unintended consequences of deploying
the system? (For another way of thinking about these concerns: the Association for Computing Machinery,
one of the main professional societies for computer scientists, has published a Code of Ethics and Profes-
sional Conduct, https://www.acm.org/code-of-ethics, that articulates the underlying principles as
statements of responsibility, rather than as questions to consider.)

In the mid-1960s, the German-American computer scientist Joseph Weizenbaum (1923–2008) devel-
oped a computer system that could carry out a psychiatrist-like conversation with a human user [130]—a
1960s version of a chatbot. The system, called ELIZA, was rather technically simple: it “knew” a collection
of scripted patterns of communication and some grammatical rules for converting a user’s statement into a
question—as in responding to I wish I were more social with Do you really think you wish you were more
social? (Technically, it used the kinds of grammar-based tools discussed on p. 259.) See Figure 12.1 for a
sample dialogue with ELIZA.
After Weizenbaum built ELIZA, he observed people using it—and, surprisingly often, confiding deeply
personal information in conversations with it—and he found himself troubled by the ways in which humans
acted as though it were truly intelligent. This experience started Weizenbaum onto a more philosophical
track in thinking about artificial intelligence specifically, and computing in general, and eventually turned
him into a skeptic of the expanding role of computers in life. In 1976, he published a book entitled Computer
Power and Human Reason [131], expressing some of this skepticism about what role computers play in
society, in ways that feel prescient today:

[Another] kind of computer application that ought to be avoided, or at least not undertaken without very careful fore-
thought, is that which can easily be seen to have irreversible and not entirely foreseeable side effects. If, in addition, such
an application cannot be shown to meet a pressing human need that cannot readily be met in any other way, then it ought
not to be pursued. (p. 270)

ISTUDY
660 Looking Forward

(In the latter category, Weizenbaum—who, with his German Jewish family, was forced to leave Nazi Ger-
many in 1936—talks about speech recognition as an example, pointing out the ways that government
surveillance would become so much easier in the presence of automatic speech recognition. He’d have felt
even more strongly about facial recognition, presumably.)
Earlier in the book, Weizenbaum describes what he believes to be appropriate boundaries for the role of
computation in our lives:
What I conclude here is that the relevant issues are neither technological nor even mathematical; they are ethical. They
cannot be settled by asking questions beginning with “can.” The limits of the applicability of computers are ultimately
statable only in terms of oughts. What emerges as the most elementary insight is that, since we do not now have any ways
of making computers wise, we ought not now to give computers tasks that demand wisdom. (p. 227)

If this book has done its job, you are by now equipped with the key mathematical foundations necessary
to create new ideas, new insights, new applications of computer science. There are so very many ways in
which the world can be made better through those new contributions. (And some that make it worse: as
Weizenbaum wrote, “a person falling into a manhole is rarely helped by making it possible for him to fall
faster or more efficiently.”) May you, and may we all, choose to build the things that make the world better.

ISTUDY
References

[1] Harold Abelson and Gerald Jay Sussman divide-and-conquer recurrences. ACM SIGACT
with Julie Sussman. Structure and News, 12(3):36–44, 1980.
Interpretation of Computer Programs. MIT [14] Ambrose Bierce. The Devil’s Dictionary.
Press/McGraw-Hill, 2nd edition, 1996. Neale, New York, 1911.
[2] David Abraham, Avrim Blum, and Tuomas [15] Burton Bloom. Space/time trade-offs in hash
Sandholm. Clearing algorithms for barter coding with allowable errors. Communications
exchange markets: Enabling nationwide kidney of the ACM, 13(7):422–426, July 1970.
exchanges. In Proceedings of the 8th ACM [16] Paul Boersma and David Weenink. Praat: doing
Conference on Electronic Commerce, pages phonetics by computer.
295–304, 2007. http://www.praat.org, 2012. Version
[3] A. Adelson-Velskii and E. M. Landis. An 5.3.22.
algorithm for the organization of information. [17] Tolga Bolukbasi, Kai-Wei Chang, James Zou,
Proceedings of the USSR Academy of Sciences, Venkatesh Saligrama, and Adam Kalai. Man is
146:263–266, 1962. to computer programmer as woman is to
[4] Manindra Agrawal, Neeraj Kayal, and Nitin homemaker? Debiasing word embeddings. In
Saxena. Primes is in P. Annals of Mathematics, Proceedings of the 29th International
160:781–793, 2004. Conference on Neural Information Processing
[5] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Systems, volume 29, pages 4349–4357, 2016.
Jeffrey D. Ullman. Compilers: Principles, [18] Katy Börner. Atlas of Science: Visualizing
Techniques, and Tools. Prentice Hall, 2nd What We Know. MIT Press, 2010.
edition, 2006. [19] Otakar Borůvka. O jistém problému
[6] Martin Aigner and Günter Ziegler. Proofs from minimálním. Práca Moravské Prírodovedecké
The Book. Springer, 4th edition, 2009. Spolecnosti, 3(3):37–58, 1926.
[7] Julia Angwin, Jeff Larson, Surya Mattu, and [20] Prosenjit Bose, Hua Guo, Evangelos Kranakis,
Lauren Kirchner. Machine bias. ProPublica, 23 Anil Maheshwari, Pat Morin, Jason Morrison,
May 2016. Michiel Smid, and Yihui Tang. On the
[8] Kenneth Appel and Wolfgang Haken. Solution false-positive rate of Bloom filters. Information
of the four color map problem. Scientific Processing Letters, 108(4):210–213, 2008.
American, 237(4):108–121, October 1977. [21] Sergei Brin and Larry Page. The anatomy of a
[9] David Appell. The sun will eventually engulf large-scale hypertextual web search engine. In
Earth—maybe. Scientific American, September 7th International World Wide Web Conference,
2008. pages 107–117, 1998.
[10] Kenneth Arrow. Social Choice and Individual [22] Andrei Broder, Ravi Kumar, Farzin Maghoul,
Values. Wiley, 1951. Prabhakar Raghavan, Sridhar Rajagopalan,
[11] Frank Arute et al. Quantum supremacy using a Raymie Stata, Andrew Tomkins, and Janet
programmable superconducting processor. Wiener. Graph structure in the web. Computer
Nature, 574(7779):505–510, 2019. Networks, 33(1–6):309–320, 2000.
[12] Shai Avidan and Ariel Shamir. Seam carving [23] Stephen Budiansky. Journey to the Edge of
for content-aware image resizing. In ACM Reason: The Life of Kurt Gödel. Oxford
SIGGRAPH, 2007. University Press, 2021.
[13] Jon Louis Bentley, Dorothea Haken, and [24] Joy Buolamwini and Timnit Gebru. Gender
James B. Saxe. A general method for solving shades: Intersectional accuracy disparities in

ISTUDY
662 References

commercial gender classification. In large clusters. Communications of the ACM,


Conference on Fairness, Accountability and 51(1):107–113, 2008.
Transparency, pages 77–91, 2018. [38] Whitfield Diffie and Martin Hellman. New
[25] Aylin Caliskan, Joanna J. Bryson, and Arvind directions in cryptography. IEEE Transactions
Narayanan. Semantics derived automatically on Information Theory, pages 644–654,
from language corpora contain human-like November 1976.
biases. Science, 356(6334):183–186, 2017. [39] Roger Dingledine, Nick Mathewson, and Paul
[26] Murray Campbell, A. Joseph Hoane Jr., and Syverson. Tor: The second-generation onion
Feng-hsiung Hsu. Deep Blue. Artificial router. Technical report, Naval Research Lab
Intelligence, 134:57–83, 2002. Washington DC, 2004.
[27] Dorwin Cartwright and Frank Harary. [40] Robin Dunbar. How Many Friends Does One
Structural balance: a generalization of Heider’s Person Need?: Dunbar’s Number and Other
theory. Psychological Review, 63(5):277–293, Evolutionary Quirks. Harvard University Press,
1956. 2010.
[28] Alhaji Cherif, Nadja Grobe, Xiaoling Wang, [41] Cynthia Dwork, Frank McSherry, Kobbi
and Peter Kotanko. Simulation of pool testing Nissim, and Adam Smith. Calibrating noise to
to identify patients with coronavirus disease sensitivity in private data analysis. In Theory of
2019 under conditions of limited test Cryptography Conference, pages 265–284,
availability. JAMA Network Open, 2006.
3(6):e2013075–e2013075, June 2020. [42] Cynthia Dwork and Aaron Roth. The
[29] Ken Christensen, Allen Roginsky, and Miguel algorithmic foundations of differential privacy.
Jimeno. A new analysis of the false positive Foundations and Trends in Theoretical
rate of a Bloom filter. Information Processing Computer Science, 9(3–4):211–407, 2014.
Letters, 110:944–949, 2010. [43] David A. Easley and Jon M. Kleinberg.
[30] Edgar F. Codd. A relational model of data for Networks, Crowds, and Markets: Reasoning
large shared data banks. Communications of the About a Highly Connected World. Cambridge
ACM, 13(6):377–387, 1970. University Press, 2010.
[31] Stephen Cook. The complexity of theorem [44] Michael Eisen. Amazon’s $23,698,655.93 book
proving procedures. In Proceedings of the about flies. it is NOT junk blog, April 2011.
Third Annual ACM Symposium on Theory of https:
Computing, pages 151–158, 1971. //www.michaeleisen.org/blog/?p=358,
[32] James W. Cooley and John W. Tukey. An retrieved 16 August 2021.
algorithm for the machine calculation of [45] Jacob Eisenstein. Introduction to Natural
complex Fourier series. Mathematics of Language Processing. MIT Press, 2019.
Computation, 19(90):297–301, 1965. [46] Scott L. Feld. Why your friends have more
[33] Thomas H. Cormen, Charles E. Leisersen, friends than you do. American Journal of
Ronald L. Rivest, and Clifford Stein. Sociology, 96(6):1464–1477, May 1991.
Introduction to Algorithms. MIT Press, 3rd [47] Judith Flanders. A Place for Everything: The
edition, 2009. Curious History of Alphabetical Order. Basic
[34] Thomas M. Cover and Joy A. Thomas. Books, 2020.
Elements of Information Theory. Wiley, 1991. [48] Robert W. Floyd. Assigning meanings to
[35] Jeffrey Dastin. Amazon scraps secret AI programs. In Proceedings of Symposia in
recruiting tool that showed bias against women. Applied Mathematics XIX, American
Reuters, 10 October 2018. Mathematical Society, pages 19–32, 1967.
[36] Mark de Berg, Marc van Kreveld, Mark [49] Simpson Garfinkel. History’s worst software
Overmars, and Otfried Schwarzkopf. bugs. Wired Magazine, 2005.
Computational Geometry. Springer-Verlag, 2nd [50] W. H. Gates and C. H. Papadimitriou. Bounds
edition, 2000. for sorting by prefix reversals. Discrete
[37] Jeffrey Dean and Sanjay Ghemawat. Mathematics, 27:47–57, 1979.
MapReduce: simplified data processing on

ISTUDY
References 663

[51] Alexander George. Letter to the editor. The Natural Language Processing, Computational
New Yorker, page 12, 24 December 2007. Linguistics, and Speech Recognition. Pearson
[52] Oded Goldreich. Foundations of Cryptography. Prentice Hall, 2nd edition, 2008.
Cambridge University Press, 2006. [67] Frank Kafka. “Fürsprecher” [“Advocates”],
[53] R. W. Hamming. Error detecting and error c. 1922. Translation by Tania and James Stern.
correcting codes. The Bell System Technical Available in Franz Kafka: The Complete
Journal, XXIX(2):147–160, April 1950. Stories. Edited by Nahum Glatzer. New York:
[54] Kashmir Hill. Wrongfully accused by an Schocken, 1971, pp. 449–451.
algorithm. The New York Times, 24 June 2020. [68] Richard M. Karp. Reducibility among
[55] C. A. R. Hoare. An axiomatic basis for combinatorial problems. In Complexity of
computer programming. Communications of Computer Computations, pages 85–103.
the ACM, 12(10):576–585, October 1969. Springer, 1972.
[56] Douglas Hofstadter. Gödel, Escher, Bach: An [69] Jonathan Katz and Yehuda Lindell.
Eternal Golden Braid. Vintage, 1980. Introduction to Modern Cryptography.
[57] Douglas Hofstadter. Le Ton Beau de Marot: In Chapman & Hall/CRC Press, 2007.
Praise of the Music of Language. Basic Books, [70] Stefanos Kaxiras and Margaret Martonosi.
1998. Computer Architecture Techniques for
[58] Michael Huber and V. Frederick Rickey. What Power-Efficiency. Morgan Claypool, 2008.
is 00 ? Convergence, July 2012. [71] Alfred B. Kempe. On the geographical problem
https://www.maa.org/press/ of the four colours. American Journal of
periodicals/convergence/what-is-00. Mathematics, 2(3):193–200, 1879.
[59] David A. Huffman. A method for the [72] Donald E. Knuth. The Art of Computer
construction of minimum-redundancy codes. Programming: Seminumerical Algorithms
Proceedings of the IRE, 40(9):1098–1101, (Volume 2). Addison-Wesley Longman, 3rd
1952. edition, 1997.
[60] G. E. Hughes and M. J. Cresswell. A New [73] Dexter Kozen. Automata and Computability.
Introduction to Modal Logic. Routledge, 1996. Springer, 1997.
[61] John F. Hughes, Andries van Dam, Morgan [74] Joseph Kruskal. On the shortest spanning
McGuire, David F. Sklar, James D. Foley, subtree of a graph and the traveling salesman
Steven K. Feiner, and Kurt Akeley. Computer problem. Proceedings of the American
Graphics: Principles and Practice. Mathematical Society, 7:48–50, 1956.
Addison-Wesley, 3rd edition, 2013. [75] James F. Kurose and Keith W. Ross. Computer
[62] Tobias Isenberg, Knut Hartmann, and Henry Networking: A Top-Down Approach.
König. Interest value driven adaptive Addison–Wesley, 6th edition, 2013.
subdivision. In Simulation and Visualisation [76] Jure Leskovec, Anand Rajaraman, and Jeff
(SimVis), pages 139–149. SCS European Ullman. Mining of Massive Datasets.
Publishing House, 2003. Cambridge University Press, 2nd edition, 2014.
[63] P. Jaccard. Distribution de la flore alpine dans [77] Nancy Leveson. Safeware. Pearson, 1995.
le bassin des dranses et dans quelques régions [78] Nancy Leveson. Engineering a Safer World.
voisines. Bulletin de la Société Vaudoise des MIT Press, 2016.
Sciences Naturelles, 37:241–272, 1901. [79] Leonid Levin. Universal search problems.
[64] Karen Spärck Jones. A statistical interpretation Problems of Information Transmission,
of term specificity and its application in 9(3):265–266, 1973.
retrieval. Journal of Documentation, 28:11–21, [80] J. L. Lions. Ariane 5 flight 501 failure report:
1972. Report by the enquiry board, 1996.
[65] Richard Jones. Garbage Collection: Algorithms [81] Elisha Scott Loomis. The Pythagorean
for Automatic Dynamic Memory Management. Proposition. National Council of Teachers of
Wiley, 1996. Mathematics, June 1968.
[66] Daniel Jurafsky and James H. Martin. Speech [82] Joel Lovell. Left-hand-turn elimination. The
and Language Processing: An Introduction to New York Times, 9 December 2007.

ISTUDY
664 References

[83] David J. C. MacKay. Information Theory, [97] Christos H. Papadimitriou. Computational


Inference and Learning Algorithms. Cambridge Complexity. Addison Wesley, 1994.
University Press, 2003. [98] David A. Patterson and John L. Hennessy.
[84] Christopher D. Manning, Prabhakar Raghavan, Computer Organization and Design: the
and Hinrich Schütze. Introduction to Hardware/Software Interface. Morgan
Information Retrieval. Cambridge University Kaufmann, 4th edition, 2008.
Press, 2008. [99] Nick Paumgarden. The names. The New Yorker,
[85] Steve Martin. Born Standing Up: A Comic’s 16 May 2011.
Life. Simon & Schuster, 2008. [100] Caroline Criado Perez. Invisible Women:
[86] N. David Mermin. Quantum Computer Exposing Data Bias in a World Designed for
Science: An Introduction. Cambridge Men. Random House, 2019.
University Press, 2007. [101] Ivars Peterson. MathTrek: Pentium bug
[87] Tomas Mikolov, Ilya Sutskever, Kai Chen, revisited. MAA Online, May 1997.
Greg Corrado, and Jeffrey Dean. Distributed [102] Madsen Pirie. How to Win Every Argument:
representations of words and phrases and their The Use and Abuse of Logic. Continuum, 2007.
compositionality. In Proceedings of the 26th [103] George Pólya. How to Solve It. Doubleday,
International Conference on Neural 1957.
Information Processing Systems, pages [104] William Press, Saul Teukolsky, William
3111–3119, 2013. Vetterling, and Brian Flannery. Numerical
[88] Stanley Milgram. The small world problem. Recipes. Cambridge University Press, 3rd
Psychology Today, 1:61–67, May 1967. edition, 2007.
[89] Gary L. Miller. Riemann’s hypothesis and tests [105] Richard Preston. The Ebola wars. The New
for primality. Journal of Computer and System Yorker, 27 October 2014.
Sciences, 13(3):300–317, 1976. [106] Michael O. Rabin. Probabilistic algorithm for
[90] Michael Mitzenmacher and Eli Upfal. testing primality. Journal of Number Theory,
Probability and Computing: Randomized 12(1):128–138, 1980.
Algorithms and Probabilistic Analysis. [107] R. L. Rivest, A. Shamir, and L. Adleman. A
Cambridge University Press, 2005. method for obtaining digital signatures and
[91] Gordon E. Moore. Cramming more public-key cryptosystems. Communications of
components onto integrated circuits. the ACM, 21:120–126, February 1978.
Electronics, 38(8):114–117, April 1965. [108] Tomas Rokicki, Herbert Kociemba, Morley
[92] Gordon E. Moore. No exponential is forever: Davidson, and John Dethridge. The diameter of
but “forever” can be delayed! In International the Rubik’s cube group is twenty. SIAM
Solid-State Circuits Conference, pages 20–23, Review, 56(4):645–670, 2014.
2003. [109] Mike Rosulek. The Joy of Cryptography.
[93] Rajeev Motwani and Prabhakar Raghavan. Oregon State, 2020.
Randomized Algorithms. Cambridge University [110] Walter Rudin. Principles of Mathematical
Press, 1995. Analysis. McGraw–Hill, 3rd edition, 1976.
[94] National Academies of Sciences, Engineering, [111] Stuart Russell and Peter Norvig. Artificial
and Medicine. Securing the Vote: Protecting Intelligence: A Modern Approach. Prentice
American Democracy. National Academies Hall, 3rd edition, 2009.
Press, 2018. [112] R. M. Sainsbury. Paradoxes. Cambridge
[95] Martin Newell, Richard Newell, and Tom University Press, 3rd edition, 2009.
Sancha. A solution to the hidden surface [113] Jonathan Schaeffer, Neil Burch, Yngvi
problem. In Proceedings of the ACM Annual Bjornsson, Akihiro Kishimoto, Martin Muller,
Conference, pages 443–450, August 1972. Rob Lake, Paul Lu, and Steve Sutphen.
[96] Sydney Padua. The Thrilling Adventures of Checkers is solved. Science,
Lovelace and Babbage: The (Mostly) True 317(5844):1518–1522, 14 September 2007.
Story of the First Computer. Pantheon Books,
2015.

ISTUDY
References 665

[114] Adi Shamir. How to share a secret. [126] Amanda L. Traud, Peter J. Mucha, and
Communications of the ACM, 22(11):612–613, Mason A. Porter. Social structure of Facebook
November 1979. networks. CoRR, abs/1102.2166, 2011.
[115] Claude E. Shannon. A mathematical theory of [127] A. M. Turing. On computable numbers, with an
communication. Bell System Technical Journal, application to the entscheidungsproblem.
27:379–423, 1948. Proceedings of the London Mathematical
[116] Margot Lee Shetterly. Hidden Figures: The Society, s2-42(1):230–265, 1937.
American Dream and the Untold Story of the [128] Marilyn vos Savant. Ask Marilyn. Parade
Black Women Who Helped Win the Space Race. Magazine, 9 September 1990.
William Morrow and Company, 2016. [129] Marilyn vos Savant. Ask Marilyn. Parade
[117] Peter Shor. Polynomial-time algorithms for Magazine, 2 December 1990.
prime factorization and discrete logarithms on a [130] Joseph Weizenbaum. ELIZA: a computer
quantum computer. SIAM Review, program for the study of natural language
41(2):303–332, 1999. communication between man and machine.
[118] Avi Silberschatz, Henry F. Korth, and Communications of the ACM, 9(1):36–45,
S. Sudarshan. Database System Concepts. January 1966.
McGraw-Hill, 6th edition, 2010. [131] Joseph Weizenbaum. Computer Power and
[119] Simon Singh. The Code Book: The Secret Human Reason: From Judgment to
History of Codes and Code-breaking. Fourth Calculation. W. H. Freeman & Co, 1976.
Estate Ltd., 1999. [132] Virginia Vassilevska Williams. An overview of
[120] Simon Singh. Fermat’s Last Theorem: The the recent progress on matrix multiplication.
Story of a Riddle That Confounded the World’s ACM SIGACT News, 43(4), December 2012.
Greatest Minds for 358 Years. Fourth Estate [133] Wayne Zachary. An information flow model for
Ltd., 2002. conflict and fission in small groups. Journal of
[121] Michael Sipser. Introduction to the Theory of Anthropological Research, 33(4):452–473,
Computation. Course Technology, 3rd edition, 1977.
2012. [134] Jieyu Zhao, Tianlu Wang, Mark Yatskar,
[122] Laura Stark. Behind Closed Doors: IRBs and Vicente Ordonez, and Kai-Wei Chang. Men
the Making of Ethical Research. University of also like shopping: Reducing gender bias
Chicago, 2011. amplification using corpus-level constraints. In
[123] Tom Stoppard. Rosencrantz and Guildenstern Proceedings of the 2017 Conference on
are Dead. Grove/Atlantic, Inc., 1967. Empirical Methods in Natural Language
[124] Latanya Sweeney. Simple demographics often Processing, pages 2979–2989, 2017.
identify people uniquely. Data Privacy Working [135] Jacob Ziv and Abraham Lempel. A universal
Paper, Carnegie Mellon University, 2000. algorithm for sequential data compression.
[125] T. Taylor, G. VanDyk, L. Funk, R. Hutcheon, IEEE Transactions on Information Theory,
and S. Schriber. Therac 25: A new medical 23(3):337–343, 1977.
accelerator concept. IEEE Transactions on
Nuclear Science, 30(2):1768–1771, 1983.

ISTUDY
ISTUDY
Index

2–3 and 2–3–4 trees, 261 associativity, 99, 261, 362 bipartite graphs, 594 ff.
9/11 Memorial, 600 assuming the antecedent, see proofs complete bipartite graphs, 595
123456791, 378 asymmetry, 404 ff. birthday paradox, 242, 559
987654263, 378 asymptotics Bitcoin, 658
analysis of algorithms, 283 ff. bitmaps, 49
asymptotic analysis, 269 ff. bits/bitstrings, 7, 45, 94, 146 ff., 371, 442,
∀ (universal quantifier), 111 ff.
asymptotic relationships viewed as 466, 482 ff., 500
absolute value, 9, 170, 172
relations, 408 ff. Bletchley Park, 497
abstract algebra, 362
best- and average-case running time, blockchain, 658
abstraction, 144
289 ff. Bloom filters, 547
ACM Code of Ethics and Professional
divide and conquer, 314 ff. Bob smells, 232
Conduct, 659
O (Big O), 269 ff. Booleans, 7, 83, see also logic
ACM Conference on Fairness,
o, Ω, ω , and Θ, 274 ff. bound (vs. free) variables, 114, 122
Accountability, and Transparency, 399
polynomials, logs, and exponentials, breadth-first search, 612 ff.
adjacency, see graphs
273 ff. finding cycles, 625
affirming the consequent, see fallacy
recurrence relations, 300 ff. brute force, see algorithms
algorithmic bias, 53, 210, 399, 658
worst-case analysis, 284 ff. Bubble Sort, see sorting
algorithmic sentencing, 399
automata, 430, 479 Buffon’s needle, 566
algorithms, 70 ff., see also randomized
automated theorem proving, 167 bugs, 21, 207, 231, 605
algorithms
average distance in a graph, 621 butterfly ballots, 414
asymptotic analysis, 283 ff.
average-case analysis, see running time
brute force, 104, 229, 440, 496
AVL trees, 309 ff.
divide and conquer, 314 ff., 321 C (programming language), 105, 122, 251
axiom of extensionality, 32
dynamic programming, 229, 440, 496 Caesar cipher, see cryptography
greedy algorithms, 165, 455 cardinality, 26–27, 441 ff.
recurrence relations, 300 ff. Bacon, Kevin, 180, 593 infinite, 474
time, space, and power, 292 balanced binary search trees, 309 Carmichael numbers, 367, 368, 370
Alice and Bob, 371 ff. Bayes’ Rule, 540 ff. Cartesian plane, 44
ambiguity begging the question, see fallacy Cartesian product (×), 42
in natural language, 86, 87, 92 Bernoulli distribution, 518 ff., 552, 563 catchphrase, 415, 642
order of operations, 259, 389 Bernoulli’s inequality, 235 Cauchy sequences, 421
order of quantification, 129 ff., 139 betweenness, 396 ceiling, 9
prefix-free/Huffman codes, 455 BFS, see breadth-first search cellular automata, 479
analysis (mathematics), 421 bias, see algorithmic bias Chain Rule (probability), 538 ff.
ancestors (in a tree), 628 biased coins, 518 ff. change of index, 17
and (∧), 83 big O, big Ω, and big Θ, 269 ff., 408 ff. checkers, 121, 181, 462
anonymization, 521 bigrams, 543 checksum, 145, 154
antisymmetry, 404 ff. bijections, 67, 465, 474 chess, 42, 121, 234, 451, 460, 611
approximate equality, 9 binary numbers, see integers children (in a tree), 628
Ariane 5 rocket, 207 binary relation, see relations Chinese Remainder Theorem, 351 ff.
arithmetic mean, 182, 199 Binary Search, see searching circle packing, 159
arithmetic series, 17, 226, 233 binary search trees, see trees circuits, 80, 94
Arrow’s Theorem, 414 binary symmetric channel, 539, 540 printing and planar graphs, 597
artificial intelligence, 659 binary trees, see trees representing logical propositions, 100,
computer vision, 608 binomial coefficients, see combinations 107
game trees, 121, 478 binomial distribution, 519 ff., 556 using nand gates, 188
assertions, 137, 231 Binomial Theorem, 490 ff. class-size paradox, 552

ISTUDY
667
668 Index

cliques, 593 ff. rotation matrices, 54 Diffie–Hellman key exchange, 379


closure, 362, 409 ff. triangulation, 244 digital signatures, 374
clustering, 38 computer security, 378, 379 Enigma machine and WWII, 497
coarsening equivalence relations, 421 ff. computer vision, 608 frequency analysis, 531, 545
codomain (of a function), 61 computing networking, 456 key exchange, 379
collaboration networks, 593 conditional expectation, 561 ff. man-in-the-middle attack, 379
collaborative filtering, 40 conditional independence, 537 one-time pads, 371
combinations, 482 ff. conditional probability, 532 ff. public-key cryptography, 373 ff.
k-combinations, 485 ff. Bayes’ Rule, 540 RSA cryptosystem, 197, 373 ff.
Binomial Theorem, 490 ff. Chain Rule, 538 secret sharing, 356
Pascal’s identity, 489, 493 Law of Total Probability, 539 substitution cipher, 530, 536, 545
Pascal’s triangle, 493 Condorcet paradox, 414 Currying, 135
combinatorial proof, 488 ff. congruences (modular), 332 ff., 352 ff., cycles, 424, 623 ff.
commutativity, 51, 99, 130, 261, 362 420 acyclic graphs, 625 ff.
comparability, see partial orders conjunctive normal form, 101 ff., 184 ff., Cycle Elimination algorithm, 634
comparison-based sorting, see sorting 256 ff. cycle rule for minimum spanning trees,
compilers, 60, 105, 259 connectivity (in graphs), 606 ff. 647
complement (of a set), 30 connected component, 607 ff. kidney transplants, 636
complete graphs, 593 ff. reachability, 609 ff. simple cycles, 624
complexity, see computational complexity constructive proofs, 175 Weighted Cycle Elimination algorithm,
composite numbers, 15, see also prime constructivism, 176 647
numbers context-free grammar, 259
composition contradiction, 96 DAG (directed acyclic graph), 626
of functions, 64, 395 contrapositive, 98, 171, see also proofs data mining, see machine learning
of relations, 391, 407 converse, 98 data privacy, 521
compression Cook–Levin Theorem, 104 data visualization, 586, 600
entropy and compressibility, 523 correlation, 527 databases, 124, 398, 401
Huffman coding, 455, 475 correlation vs. causation, 206 De Morgan’s Laws, 100
impossibility of lossless compression, positive and negative, 530 decision problems, 191
475 countable sets, 474 Deep Blue, 121
lossy vs. lossless, 475 counterexamples, 175 ff. degree (in a graph), 583 ff., 585
quantization of images, 60, 73 counting degree distribution, 599
URL shortening, 445 Binomial Theorem, 490 ff. regular graphs, 596
computability, 192 combinations, 482 ff. degree (of a polynomial), 69
computational biology combinatorial proofs, 488 ff. density (of a graph), 281, 602
genome rearrangements, 137, 479 combining products and sums, 453 ff. denying the hypothesis, see fallacy
motifs in gene networks, 592 Division Rule, 467 ff. dependent events, 527 ff.
computational complexity double counting, 446 ff. depth-first search, 615 ff.
and cryptography, 378 Generalized Product Rule, 450 ff. Descartes, René, 42
complexity classes, 294 Inclusion–Exclusion, 446 ff. descendants (in a tree), 628
graph isomorphism, 591 for 3+ sets, 449 deterministic finite automata, 430
input size, 332 Mapping Rule, 463 ff. DFS, see depth-first search
P vs. NP, 104 order, 483 ff. diagonalization, 474
regular languages, 415, 430 Pascal’s triangle, 493 ff. diameter (of a graph), 620
computational geometry, 56 permutations, 484 ff. differential privacy, 521
computational linguistics, see natural Pigeonhole Principle, 471 ff. Diffie–Hellman key exchange, 379
language processing Product Rule (sequences), 444 Dijkstra’s algorithm, 642 ff.
computer architecture, 101 ff., 188 repetition, 483 ff. directed graphs, 402
and running times, 284 Sum Rule (unions), 441 disconnected, see connectivity in graphs
Moore’s Law, 279 Counting Sort, see sorting disjoint sets, 34, 159
power consumption, 292 coupon collector problem, 570 disjunctive normal form, 101 ff., 184 ff.,
representation of numbers, 21 crossword puzzles, 136 256 ff.
computer graphics cryptography, 232, 371 ff., 414, 658 distance, see also metrics
hidden-surface removal, 431 and pseudorandomness, 517 Euclidean, see Euclidean distance
morphing, 57 Caesar cipher, 372, 545 Hamming, see Hamming distance

ISTUDY
Index 669

in a graph, 611 ff. correlated, 527 formal language theory, 415, see
Manhattan, see Manhattan distance independent events, 527 ff. computational complexity
minimum distance of a code, 149 ff. exclusive or (⊕), 15, 86 ff., 154 ff. formal methods, 167, 409
divide and conquer, see algorithms existential quantifier (∃), 111 ff. Four-Color Theorem, 181, 597
divisibility, 14, 62, 230, 426 expectation, 551 ff. fractals, 216, 222–224, 234, 248
and modular arithmetic, 333 ff. average-case analysis of algorithms, free (vs. bound) variables, 114, 122
common divisors, 334 ff. 289 ff. frequency analysis, 531
divisibility rules, 94, 168, 182, 342 conditional expectation, 561 ff. functions, 59 ff.
Division Theorem, 329 coupon collector problem, 570 algorithms, 70 ff.
division, see mod deviation from expectation, 562 ff. characteristic function of a set, 389
in Zn , 361 Markov’s inequality, 571 composition, 63
Division Rule, 467 ff. Law of Total Expectation, 562 domain/codomain, 61
domain (of a function), 61 linearity of expectation, 554 ff. growth rates, 269 ff.
dot product, 47 ff. exponentials, 10 ff., 261 inverses, 67
Dunbar’s number, 601 asymptotics, 273 ff. one-to-one/onto functions, 65 ff.
dynamic programming, see algorithms modular, 342 range/image, 62 ff.
dynamic scope, 122 EXPSPACE (complexity class), 294 viewed as relations, 394 ff.
EXPTIME (complexity class), 294 visual representation, 63
∃ (existential quantifier), 111 ff. extended Euclidean algorithm, 348 vs. macros, 122
e (base of natural logarithm), 13 Fundamental Theorem of Arithmetic, 346
edges, see graphs fuzzy logic, 92
efficiency, see running time, see also facial recognition, 210, 399, 658
computational complexity factorial, 166–167, 229, 300, 302, 452, Game of Life, 479
ELIZA, 659 458 game trees, 121, 478
Emacs, 460, 659 Stirling’s approximation, 501 garbage collection, 293, 619
empty set, 30 factors, see divisibility, see also prime Gates, Bill, 137, 180
Enigma machine, 497 factorization GCD, see greatest common divisor
entropy, 523 fallacy, 201 ff. GCHQ, 373
equivalence relations, 418 ff. affirming the consequent, 203 Generalized Product Rule, 450 ff.
equivalence classes, 419 begging the question, 205 geometric distribution, 520 ff., 554
refinements and coarsenings, 421 denying the hypothesis, 204 geometric mean, 182, 199
Eratosthenes, 344, 358 false dichotomy, 170, 204 geometric series, 17, 224 ff.
Erdős numbers, 180 proving true, 203 for recurrence relations, 314 ff.
Erdős, Paul, 180, 593 false dichotomy, see fallacy infinite, 225
error-correcting codes, 146 ff. Fast Fourier transform, 11 giant component, 618
Golay code, 165 fencepost error, 605 Gödel’s Incompleteness Theorem, 123
Hamming code, 154 ff., 463 Fermat pseudoprime, 367 Goldbach’s conjecture, 81, 128, 138
messages and codewords, 146 ff. Fermat’s Last Theorem, 365 golden ratio, 307, 310
minimum distance and rate, 149 ff. Fermat’s Little Theorem, 365 ff. grammars, 252, 259
Reed–Solomon codes, 161, 357 Fermat–Euler Theorem, 370 graph drawing, 596, 600
repetition code, 153 ff. Fibonacci numbers, 58, 247, 301, graphs, 579 ff.
upper bounds on rates, 158 306–308, 310, 500 acyclic graphs, 625 ff.
error-detecting codes, 146 ff. algorithms, 312 adjacency lists, 586 ff.
credit-card numbers, 145, 162 and the Euclidean algorithm, 342 adjacency matrices, 587 ff.
UPC, 477 filter, 37 bipartite graphs, 594 ff.
ethics, 210, 232, 319, 399, 521, 658–660 finite-state machines, 430 breadth-first search, 612 ff.
Euclid, 190, 335 float (floating-point number), 21, 284 complete graphs, 593 ff.
Euclidean algorithm, 335, 348 floor, 9 connected components, 607 ff.
efficiency, 338, 342 Division Theorem, 329 connectivity, 606 ff.
extended Euclidean algorithm, 348 for loops cycles, 623 ff.
Q
Euclidean distance, 55, 199 analogy for , 20 data structures, 586 ff.
P
Euler’s Theorem, 370 analogy for , 16, 18 degree, 583, 585 ff.
even numbers, 15, 173 analogy for quantifiers, 113, 130, 139 handshaking lemma, 584
evenly divides, see divisibility forests, 626 regular graphs, 596
events (probability), 511 ff. spanning forests, 634 density, 602

ISTUDY
670 Index

depth-first search, 615 ff. identity matrix, 49 invalid inference, 201


forests, 626 multiplicative identity, 361 inverse
isomorphism, 590 ff. of a binary operator, 93, 261 additive, 369
matchings, 470, 478, 497, 596, 636 if and only if (⇔), 87 ff. multiplicative, 361 ff.
neighborhoods, 581 ff., 585 ff. image (of a function), 62 of a function, 67
paths, 605 ff. image processing of a matrix, 58
shortest paths, 611 ff. blur filter, 22 of a relation, 390 ff., 406
planar graphs, 596 ff. dithering, 107 of an implication, 98
shortest paths grayscale conversion, 6 IP addresses, 456
Dijkstra’s algorithm, 642 ff. quantization, 60 irrationals, see rationals

simple graphs, 580 seam carving, 496 irrationality of 2, 174
subgraphs, 591 ff. segmentation, 608 irreflexivity, 403 ff.
trees, see trees imaginary numbers, 11 isomorphism (of graphs), 590 ff.
undirected vs. directed, 579 ff. implication (⇒), 84 ff.
weighted graphs, 641 ff. in-degree, see degree Jaccard coefficient, 40
Dijkstra’s algorithm, 642 ff. in-neighbor, see neighbors (in graphs) Java (programming language), 61, 88,
grayscale, 6, 60, 107 Inclusion–Exclusion, 446 ff., 501 105, 619
greatest common divisor, 334 ff., see also inclusive naming, 319 Johnson’s algorithm, 572
Euclidean algorithm incomparability, 277, 423 join (database operation), 398
incompleteness (logic), 123
Hn , see harmonic number inconsistency (logic), 123 Kn , see complete graphs
halting problem, 123, 194 ff., 198 independent events, 527 ff. Kn,n , see bipartite graphs
Hamiltonian path, 620 pairwise independence, 532 Kasparov, Garry, 121
Hamming code, 58, 154 ff., 477 indicator random variables, 550 Keller, Mary Kenneth, 201
number of valid codewords, 463 induction, see proofs keyspace, see hashing
Hamming distance, 146 checklist for inductive proofs, 219 kidney transplants, 636, 658
handshaking lemma, 584 generating conjectures, 222 Knuth, Donald, 336
harmonic number, 226–227, 234 proofs about algorithms, 228 ff., 303 ff. Kruskal’s algorithm, 648
hashing, 72, 480, 507–508, 557, 570 strengthening the inductive hypothesis, Kuratowski’s Theorem, 598
Bloom filters, 547 256
chaining, 72 infinite sequences, 17, 225 L (complexity class), 294
collisions, 72, 507 ff., 515, 525, 546, infix notation, 389 latchstring, 415, 642
558 information retrieval, 53 Law of the Excluded Middle, 95
and pairwise independence, 532 information theory, 523, 539 Law of Total Expectation, 562
chaining, 507 injective functions, see one-to-one Law of Total Probability, 539
clustering, 515, 525 functions least common multiple, 334 ff.
double hashing, 526 Insertion Sort, see sorting length (of a vector), 46
linear probing, 515, 525 integers, 7 ff. lexical scope, 122
quadratic probing, 526 algorithms for arithmetic, 331, 341 lexicographic ordering, 127, 389
simple uniform hashing, 508 efficiency, 332 Liar’s paradox, 29, 195
Hasse diagrams, 425 division, see modular arithmetic linearity of expectation, 554
heaps, 74, 245, 261 primes and composites, see prime linked lists, 260
heavy-tailed distribution, 599 numbers adjacency lists for graphs, 586 ff.
height (of a tree), 630 recursive definition, 258 as graphs, 602
Heron’s method, 22, 182 representation recursive definition, 250
hidden-surface removal, 431 binary numbers, 94, 220, 236, 246, list, see sequence
higher-order functions, 37, 135 332, 340 little o and little ω , 274 ff., 408 ff.
Hopper, Grace Murray, 60, 207 different bases, 246, 340 logarithms, 12–13
Huffman coding, 455, 475 ints, 21 asymptotics, 273 ff.
hypercube, 603 modular representation, 355 discrete logarithm, 379
unary, 332 polylogarithmic functions, 281, 332
I (identity matrix), 49 successor relation, 413 logic
idempotence, 99 internet addresses, 456 Boolean logic, 7, 362
identity intersection (of sets), 31 consistency, 123
identity function, 68 intervals, see real numbers fuzzy logic, 92

ISTUDY
Index 671

incompleteness, 123 memoization, 496 nonconstructive proofs, 175


logical equivalence, 97, 115 memory management, 619 nor (not or), 188, 261
logical fallacy, see fallacy Merge Sort, see sorting not (¬), 59, 83
modal logic, 409 metrics, 146, 162–163, 621 NP (complexity class), 104, 204, 294
predicate logic, 109 ff. Milgram, Stanley, 180 number line, 9
games against the demon, 132 Miller–Rabin test, 197, 368 number theory, see modular arithmetic
nested quantifiers, 127 ff. minimal element, 425 ff. numerical methods, see scientific
order of quantification, 128 ff. minimum element, 32, 425 ff. computing
predicates, 109 ff. minimum spanning trees, 646 ff.
quantifiers, 111 ff. cycle rule, 647 O (Big O), 269 ff., 408 ff.
theorems in predicate logic, 115 ff. Kruskal’s algorithm, 648 o (little o), 274, 408 ff.
propositional logic, 81 ff. Weighted Cycle Elimination algorithm, odd numbers, 15
atomic vs. compound propositions, 647 off-by-one error, 605
82 ML (programming language), 135, 256 Omega (Ω) (asymptotics), 274, 408 ff.
logical connectives, 83 ff. modal logic, 409 omega (ω ) (asymptotics), 274, 408 ff.
propositions, 81 ff. modular arithmetic, 13–14, 329 ff. one-time pads, 371
recursive definition of a well-formed Division Theorem, 329 one-to-one functions, 66, 395, 465
formula, 252 mod-and-div algorithm, 331 ff., 341 onion routing, 232
satisfiability, 96 modular congruences, 332 onto functions, 65, 127, 395, 465
tautology, 95 ff. modular exponentiation, 342 operating systems, 136
truth assignment, 89 modular products, 333 multitasking, 293
truth tables, 89 ff. modular sums, 333 virtual memory, 198
truth values, 81, 252 multiplicative inverse, 361 ff. optimizing compilers, 105, 107
universal set of operators, 199 primitive roots, 379 or (∨), 83
temporal logic, 409 Modus Ponens, 96 order of operations, see precedence of
longest common subsequence, 229, 501 Modus Tollens, 96 operators
loop invariants, 231 Monte Carlo method, 566 orders, see partial orders
Lovelace, Ada, 22, 42 Monty Hall Problem, 516 organ donation, 636
Moore’s Law, 279 out-degree, see degree
machine learning multiples, see divisibility out-neighbor, see neighbors (in graphs)
bias, 53 multiplicative identity, 361 outcome (probability), 509
classification problems, 464, 544 multiplicative inverse, 361 ff. overfitting, 543
clustering, 38 multitasking, 293 overflow, 21, 207
cross-validation, 500
macros, 122 naïve Bayes classifier, 544 P , see power set
Manhattan distance, 46, 55, 199 nand (not and), 188, 261 P (complexity class), 104, 204, 294
map, 37 n-ary relations, 395 ff. PageRank, 650
Mapping Rule, 463 ff. expressing n-ary relations as binary painter’s algorithm, 431
MapReduce, 37 relations, 397 pairwise independence, 532
maps, 181, 597 natural language processing, 659 palindromes, 262, 476
mark-and-sweep, 619 ambiguity, 92 paradoxes
Markov’s inequality, 571 language model, 543 birthday paradox, 559
matchings, see graphs speech processing, 38, 462 class-size paradox, 552
matrices, 48 ff. speech recognition, 543 Liar’s paradox, 29, 195
adjacency matrices for graphs, 587 ff. text classification, 544 nontransitive dice, 569
identity matrix, 49 text-to-speech systems, 462 paradoxes of translation, 82
inverse of a matrix, 58 natural logarithm, see logarithms Russell’s paradox, 29
matrix multiplication, 50 ff. neighbors (in graphs), 581, 585 Simpson’s paradox, 210
Strassen’s algorithm, 321 nested quantifiers, 127 ff. voting paradoxes, 414
rotation matrices, 54, 200 games against the demon, 132 parallel edges, 580
term–document matrix, 53 negations, 130 parent (in a tree), 628
maximal element, 425 ff. order of quantification, 128 ff. parity, 15, 154 ff., 238–239, 246
maximum element, 32, 71, 425 ff. nested sums, 18, 555 parsing, 259
mazes, 616 Newton’s method, 22 partial orders, 422 ff.
median (of an array), 74, 567 ff. nodes, see graphs chains and antichains, 434

ISTUDY
672 Index

comparability, 423 preorder, 424 parsing, 259


extending to a total order, 427 ff. prime numbers, 15, 192, 343 ff. scoping/functions/macros, 122
Hasse diagrams, 425 Carmichael numbers, 367, 370 short-circuit evaluation, 105
immediate successors, 425 distribution of the primes, 344 syntactic sugar, 100
minimal/maximal elements, 425 infinitude of primes, 190 project (database operation), 398
minimum/maximum element, 425 primality testing, 190, 197, 283, 343 proofs, 166 ff.
strict partial order, 422 efficient algorithms, 368 by assuming the antecedent, 118, 169
topological ordering, 427 ff. prime factorization, 346, 378 by cases, 157, 169 ff.
total orders, 423 cryptography, 197, 378 by construction, 154, 175 ff.
consistency with a partial order, existence of, 239–240 by contradiction, 159, 173 ff.
427 ff. Shor’s algorithm, 522 by contrapositive, 171 ff.
partition (of a set), 35 uniqueness of, 349–351 by induction, 70, 217 ff.
bipartite graphs, 594 Prime Number Theorem, 344 by mutual implication, 172
equivalence relations, 420 Sieve of Eratosthenes, 344, 358 by strong induction, 237 ff.
Pascal’s identity, 489, 493 priority queues, 245 by structural induction, 252 ff.
Pascal’s triangle, 493 ff. privacy, 232, 521, 658 combinatorial proofs, 488 ff.
paths (in graphs), 605 ff. probability direct, 168 ff.
breadth-first search, 612 ff. Bayes’ Rule, 540 ff. nonconstructive, 175
connected graphs, 606 ff. conditional expectation, 561 ff. strategy for proofs, 176 ff.
depth-first search, 615 ff. conditional probability, 532 ff. unprovable true statements, 123
Dijkstra’s algorithm, 642 ff. coupon collector problem, 570 without loss of generality, 170
internet routing, 456 events, 511 ff. writing proofs, 178 ff.
shortest paths, 611 ff. expectation, 551 ff. proper subset and superset, 33
simple paths, 606 infinitesimal probabilities, 535 propositional logic, see logic
Peirce’s arrow (↓), 188, 261 Law of Total Expectation, 562 proving true, see fallacy
Pentium chip, 207, 279 Law of Total Probability, 539 pseudocode, 71
perfect matchings, see graphs linearity of expectation, 554 ff. pseudorandom generator, 517
perfect numbers, 24 Markov’s inequality, 571 PSPACE (complexity class), 294
perfect square, 11 Monty Hall Problem, 516 public-key cryptography, see
Perl (programming language), 190 outcomes, 509 ff. cryptography
permutations, 249, 452, 458 probability functions, 509 ff. Pythagorean Theorem, 178, 188–190, 199
k-permutations, 484 ff. random variables, 549 ff. incorrect published proof, 211
Petersen graph, 590, 598 random walks, 650 Python (programming language), 21, 37,
Pigeonhole Principle, 471 ff., 475 standard deviation, 562 ff. 61, 93, 94, 122, 135, 192 ff., 474, 619
planar graphs, 181, 596 ff. tree diagrams, 514 ff.
Kuratowski’s Theorem, 598 variance, 562 ff. Q, see rationals
polygons, 54, 236, 240, 244, 248, 566 probability distributions QR codes, 145, 161
polylogarithmic, 281, 332 Bernoulli, 518 ff. quadtrees, 311
polynomials, 68 ff., 161, see also P binomial, 519 ff. quantifiers, 111 ff.
(complexity class) entropy, 523 negating quantifiers, 116 ff.
asymptotics, 273 ff. geometric, 520 ff. nested quantifiers, 127 ff.
evaluating modulo a prime, 346, 356, posterior distribution, 541 vacuous quantification, 119
357 prior distribution, 541 quantum computation, 522
postfix notation, 389 uniform, 517 ff. Quick Sort, see sorting
Postscript (programming language), 389 product, 20 ff.
power set, 36 of a set, 32 R, see real numbers
as a relation, 388 product of sums, see conjunctive normal Radix Sort, see sorting
cardinality, 466 form raising to a power, see exponentials
power-law distribution, 599 Product Rule, 444 Random Surfer Model, 650
powers, see exponentials cardinality of Sk , 445 random variables, 549 ff.
precedence of operators, 31, 88, 113, 259 programming languages expectation, 551 ff.
predicate logic, see logic compile-time optimization, 105 independent random variables, 551
predicates, 109 ff., 389, see also logic Currying, 135 indicator random variables, 550
prefix notation, 389 garbage collection, 293, 619 random walks, 650, 652
prefix-free codes, 454 higher-order functions, 37, 135 randomized algorithms, 292

ISTUDY
Index 673

Buffon’s needle, 566 relative primality, 346 ff., 363 ff. subsets/supersets, 32 ff., see also power
finding medians, 567 Chinese Remainder Theorem, 351 ff. set
Johnson’s algorithm, 572 extended Euclidean algorithm, 348 union, 31
Monte Carlo method, 566 remainder, see mod Inclusion–Exclusion, 446 ff.
primality testing (Miller–Rabin), 368 repeated squaring, 312, 342, 375 Venn diagrams, 30
Quick Sort, 524 repetition code, 153 ff. well-ordered, 254
range (of a function), 62 roots (of a polynomial), 69, 161, 357 Sheffer stroke (|), 188, 261
rate (of a code), 149 ff. rotation matrices, see matrices Shor’s algorithm, 522
rationals, 7 ff., 44, 169, 172, 335 roulette, 513, 563 short-circuit evaluation, 105
in lowest terms, 335, 420 RSA cryptosystem, 197, 232, 373 ff. Sierpiński triangle/carpet, 234 ff.
real numbers, 7 ff. breaking the encryption, 378 Sieve of Eratosthenes, 344, 358
absolute value/floor/ceiling, 9 ff. Rubik’s cube, 362, 459 signed social networks, 592
approximate equality (≈), 9, 387 running time, 283 ff. Simpson’s paradox, 210
defining via infinite sequences, 421 average case, 289 ff., 560 six degrees of separation, 180
exponentiation, 10 ff. best case, 289 ff. skite, 622
floats (representation), 21 worst case, 284 ff. small-world phenomenon, 180, 618
intervals, 9 Russell’s paradox, 29 social networks, 180, 578, 592, 599
logarithms, 12 ff. Dunbar’s number, 601
trichotomy, 277 sorting
realization, see outcome (probability) sample space (probability), 509 Bubble Sort, 287, 291, 295
recommender system, 40 sampling bias, 552 comparison-based, 295, 457
recurrence relations, 300 ff. satisfiability, 96, 104, 193, 387, 572 Counting Sort, 296, 458
divide and conquer, 314 ff. scalars, 45 Insertion Sort, 222, 286, 290, 295
iterating, 302 SCC, see strongly connected components average-case analysis, 560, 570
sloppiness, 306 Scheme (programming language), 37, 43, correctness using loop invariants, 231
solving by induction, 302 100, 135, 389 lower bounds, 457–458
variable substitution, 304 scientific computing, 284 Merge Sort, 248, 298–301, 303–304,
recursion tree, 298, 314 ff. Newton’s method, 22 312, 314
recursively defined structures, 250 ff. seam carving, 496 Quick Sort, 296, 311
Reed–Solomon codes, 161, 357 searching correctness (for any pivot rule),
reference counting, 619 Binary Search, 231, 248, 288 ff., 301, 241 ff.
refining equivalence relations, 421 ff. 304–306, 314, 341 randomized pivot selection, 524
reflexivity, 146, 403 ff. Linear Search, 288 ff. Radix Sort, 296
reflexive closure, 409 ff. Ternary Search, 311 Selection Sort, 284, 291, 295, 457
regular expressions, 415, 430 secret sharing, 356, 500 spam filter, 544
regular graphs, 596 select, see median spanning trees, 633 ff.
reindexing, 17 select (database operation), 398 Cycle Elimination algorithm, 634
relational databases, see databases Selection Sort, see sorting minimum spanning trees, 646 ff.
relations self-loops, 580 speech processing, see natural language
n-ary relations, 395 ff. self-reference, 29, 82, 123, 191, 650, 673 processing
binary relations, 388 ff. sentinels, 482 sphere packing, 159
closures, 409 ff. sequences, 42 ff. spreadsheets, 429, 434, 611
composition, 391 ff. Sn (sequence of elements from the same SQL (programming language), 398
equivalence relations, 418 ff. set), 44 square roots, 11, 22, see exponentials
functions as relations, 394 ff. cardinality, 444, 450 Heron’s method, 22, 182
inverses, 390 ff. sets, 26 ff. standard deviation, 562 ff.
partial orders, 422 ff. cardinality, 26 ff., 441 ff. steganography, 232
reflexivity, 403 characteristic function, 389 Strassen’s algorithm, 321
relational databases, 398 complement, 30 strings, 44, 55
symmetry, 404 disjointness, see disjoint sets, see also generating all strings of a given length,
total orders, 423 ff. partitions 340
transitivity, 406 ff. empty set, 30 regular expressions, 415
visual representation, 389 ff., 402 ff. intersection, 31 strong induction, see proofs
Hasse diagrams, 425 ff. set difference, 31 strongly connected components, 609 ff.
vs. predicates, 389 singleton set, 30 structural induction, see proofs

ISTUDY
674 Index

subgraphs, see graphs signed social networks, 592 uniform distribution, 510, 517 ff.
subsequences, 229, 400, 477, 488, 499, transitive closure, 409 ff. unigrams, 543
501 Traveling Salesperson Problem, 496 union (of sets), 31
subset, 33 trees, 623 ff. Union Bound, 442
Sudoku, 104 2–3 and 2–3–4 trees, 261 unit vector, 46
sum of products, see disjunctive normal AVL trees, 309 ff. universal quantifier (∀), 111 ff.
form binary search trees, 260, 309, 637 unsatisfiability, 96
Sum Rule, 441 binary trees, 251, 309 ff., 630 ff. URL squatting, 477
summations, 15 ff. complete binary trees, 640 ff.
arithmetic, 17, 226, 233 heaps, 74, 245 vacuous quantification, 119
geometric, 17, 224 ff., 314 ff. decision trees, 458 valid inference, 201
infinite, 225 forests, 626 van Emde Boas trees, 322
harmonic, 226 ff. game trees, 121, 478 variance, 562 ff.
of a set, 32 in counting problems, 455 Vector Space Model, 53
reindexing summations, 17 parse trees, 259 vectors, 44 ff.
reversing nested summations, 18, 553 quadtrees, 311 dot product, 47 ff.
superset, 33 recursion trees, 298 ff. Venn diagrams, 30
surjective functions, see onto functions recursive definitions of trees, 251, 630 virtual memory, 198
symmetry, 146, 387, 404 ff. rooted trees, 628 ff. Von Koch snowflake, 216, 223, 234 ff.
symmetric closure, 409 ff. spanning trees, 633 ff. Voronoi diagram, 56
syntactic sugar, 100 minimum spanning trees, 646 ff. voting systems, 414
syzygy, 598, 642 subtrees, 628 ff.
tree traversal, 631 ff.
wall clocks, 293
T(n) = aT(n/b) + nk , 314 ff. van Emde Boas trees, 322
Weizenbaum, Joseph, 659
tautology, 95 ff. triangle inequality, 146, 170
well-ordered set, 254
temporal logic, 409 triangulation, 240–241, 244
What Three Words, 440
term frequency–inverse document truth tables, 89 ff.
without loss of generality, 170
frequency (TFIDF), 53 truth values, 81 ff.
word2vec, 53
The Book (of proofs), 180 tsktsks, 415, 642
World War II, 497, 592
Therac-25, 207, 658 tuple, see sequence
World Wide Web, 599, 618
Theta (Θ) (asymptotics), 274, 408 ff. Turing Award, 28, 98, 146, 270, 336, 373,
PageRank, 650
tic-tac-toe, 121, 478 379, 388, 398, 642
worst-case analysis, see running time
topological ordering, 427 ff. Turing machines, 123, 192, 284
Tor, 232, 658 Turing, Alan, 28, 191, 284, 497
total orders, 423 ff., see also partial orders xor, see exclusive or
totient function, 370, 461 unary numbers, see integers
Towers of Hanoi, 322 uncomputability, 123, 191–195, 198, 474 Z, see integers
transitivity, 406 ff. undecidability, see uncomputability Zn , 360 ff.
nontransitive dice, 569 underflow, 21 zero (of a binary operator), 93, 261
nontransitivity in voting, 414 Unicode, 460 zyzzyvas, 389

ISTUDY
ISTUDY
ISTUDY

You might also like