Discrete Notes
Discrete Notes
James Aspnes
2016-08-03 16:41
Contents
Table of contents
List of figures
xiv
List of tables
xv
List of algorithms
xvi
Preface
xvii
Internet resources
xviii
1 Introduction
1.1 So why do I need to learn all this nasty mathematics?
1.2 But isnt math hard? . . . . . . . . . . . . . . . . . . .
1.3 Thinking about math with your heart . . . . . . . . .
1.4 What you should know about math . . . . . . . . . . .
1.4.1 Foundations and logic . . . . . . . . . . . . . .
1.4.2 Basic mathematics on the real numbers . . . .
1.4.3 Fundamental mathematical objects . . . . . . .
1.4.4 Modular arithmetic and polynomials . . . . . .
1.4.5 Linear algebra . . . . . . . . . . . . . . . . . .
1.4.6 Graphs . . . . . . . . . . . . . . . . . . . . . .
1.4.7 Counting . . . . . . . . . . . . . . . . . . . . .
1.4.8 Probability . . . . . . . . . . . . . . . . . . . .
1.4.9 Tools . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
3
4
4
5
6
6
6
7
7
8
2 Mathematical logic
2.1 The basic picture . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Axioms, models, and inference rules . . . . . . . . . .
9
9
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
2.2
2.3
2.4
2.5
2.6
3 Set
3.1
3.2
3.3
2.1.2 Consistency . . . . . . . . . . . . . . . . . . . . .
2.1.3 What can go wrong . . . . . . . . . . . . . . . .
2.1.4 The language of logic . . . . . . . . . . . . . . .
2.1.5 Standard axiom systems and models . . . . . . .
Propositional logic . . . . . . . . . . . . . . . . . . . . .
2.2.1 Operations on propositions . . . . . . . . . . . .
2.2.1.1 Precedence . . . . . . . . . . . . . . . .
2.2.2 Truth tables . . . . . . . . . . . . . . . . . . . . .
2.2.3 Tautologies and logical equivalence . . . . . . . .
2.2.3.1 Inverses, converses, and contrapositives
2.2.3.2 Equivalences involving true and false .
Example . . . . . . . . . . . . . . . . . . .
2.2.4 Normal forms . . . . . . . . . . . . . . . . . . . .
Predicate logic . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Variables and predicates . . . . . . . . . . . . . .
2.3.2 Quantifiers . . . . . . . . . . . . . . . . . . . . .
2.3.2.1 Universal quantifier . . . . . . . . . . .
2.3.2.2 Existential quantifier . . . . . . . . . .
2.3.2.3 Negation and quantifiers . . . . . . . .
2.3.2.4 Restricting the scope of a quantifier . .
2.3.2.5 Nested quantifiers . . . . . . . . . . . .
2.3.2.6 Examples . . . . . . . . . . . . . . . . .
2.3.3 Functions . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Equality . . . . . . . . . . . . . . . . . . . . . . .
2.3.4.1 Uniqueness . . . . . . . . . . . . . . . .
2.3.5 Models . . . . . . . . . . . . . . . . . . . . . . .
2.3.5.1 Examples . . . . . . . . . . . . . . . . .
Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Inference Rules . . . . . . . . . . . . . . . . . . .
2.4.2 Proofs, implication, and natural deduction . . . .
2.4.2.1 The Deduction Theorem . . . . . . . .
Natural deduction . . . . . . . . . . . . . . . . . . . . .
2.5.1 Inference rules for equality . . . . . . . . . . . .
2.5.2 Inference rules for quantified statements . . . . .
Proof techniques . . . . . . . . . . . . . . . . . . . . . .
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
11
11
12
13
15
16
17
19
21
22
23
24
25
25
26
26
27
27
28
30
31
31
32
32
33
34
35
37
37
38
39
39
42
theory
47
Naive set theory . . . . . . . . . . . . . . . . . . . . . . . . . 47
Operations on sets . . . . . . . . . . . . . . . . . . . . . . . . 49
Proving things about sets . . . . . . . . . . . . . . . . . . . . 50
CONTENTS
3.4
3.5
3.6
3.7
3.8
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
53
55
55
56
56
56
57
57
57
57
58
60
60
62
62
63
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
65
65
66
68
69
70
71
73
73
74
75
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
79
80
80
81
82
83
84
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
5.6.2
5.6.3
iv
Recursive definitions and induction . . . . . . . . . . .
Structural induction . . . . . . . . . . . . . . . . . . .
6 Summation notation
6.1 Summations . . . . . . . . . . . . . . . . . . . .
6.1.1 Formal definition . . . . . . . . . . . . .
6.1.2 Scope . . . . . . . . . . . . . . . . . . .
6.1.3 Summation identities . . . . . . . . . . .
6.1.4 Choosing and replacing index variables .
6.1.5 Sums over given index sets . . . . . . .
6.1.6 Sums without explicit bounds . . . . . .
6.1.7 Infinite sums . . . . . . . . . . . . . . .
6.1.8 Double sums . . . . . . . . . . . . . . .
6.2 Products . . . . . . . . . . . . . . . . . . . . . .
6.3 Other big operators . . . . . . . . . . . . . . .
6.4 Closed forms . . . . . . . . . . . . . . . . . . .
6.4.1 Some standard sums . . . . . . . . . . .
6.4.2 Guess but verify . . . . . . . . . . . . .
6.4.3 Ansatzes . . . . . . . . . . . . . . . . . .
6.4.4 Strategies for asymptotic estimates . . .
6.4.4.1 Pull out constant factors . . .
6.4.4.2 Bound using a known sum . .
Geometric series . . . . . . . . .
Constant series . . . . . . . . . .
Arithmetic series . . . . . . . . .
Harmonic series . . . . . . . . . .
6.4.4.3 Bound part of the sum . . . .
6.4.4.4 Integrate . . . . . . . . . . . .
6.4.4.5 Grouping terms . . . . . . . .
6.4.4.6 Oddities . . . . . . . . . . . . .
6.4.4.7 Final notes . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
85
86
86
87
87
87
89
89
91
91
91
92
92
93
94
95
96
97
97
97
98
98
98
98
98
98
99
99
99
7 Asymptotic notation
100
7.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Motivating the definitions . . . . . . . . . . . . . . . . . . . . 100
7.3 Proving asymptotic bounds . . . . . . . . . . . . . . . . . . . 101
7.4 Asymptotic notation hints . . . . . . . . . . . . . . . . . . . . 102
7.4.1 Remember the difference between big-O, big-, and
big- . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4.2 Simplify your asymptotic terms as much as possible . 103
CONTENTS
.
.
.
.
.
.
.
.
8 Number theory
8.1 Divisibility and division . . . . . . . . . . . . . . . . . . .
8.2 Greatest common divisors . . . . . . . . . . . . . . . . . .
8.2.1 The Euclidean algorithm for computing gcd(m, n)
8.2.2 The extended Euclidean algorithm . . . . . . . . .
8.2.2.1 Example . . . . . . . . . . . . . . . . . .
8.2.2.2 Applications . . . . . . . . . . . . . . . .
8.3 The Fundamental Theorem of Arithmetic . . . . . . . . .
8.3.1 Applications . . . . . . . . . . . . . . . . . . . . .
8.4 Modular arithmetic and residue classes . . . . . . . . . . .
8.4.1 Arithmetic on residue classes . . . . . . . . . . . .
8.4.2 Division in Zm . . . . . . . . . . . . . . . . . . . .
8.4.3 The Chinese Remainder Theorem . . . . . . . . . .
8.4.4 The size of Zm and Eulers Theorem . . . . . . . .
8.5 RSA encryption . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
106
. 106
. 108
. 108
. 109
. 109
. 109
. 112
. 112
. 113
. 114
. 115
. 116
. 119
. 120
9 Relations
9.1 Representing relations . . . . . . . . . . .
9.1.1 Directed graphs . . . . . . . . . . .
9.1.2 Matrices . . . . . . . . . . . . . . .
9.2 Operations on relations . . . . . . . . . .
9.2.1 Composition . . . . . . . . . . . .
9.2.2 Inverses . . . . . . . . . . . . . . .
9.3 Classifying relations . . . . . . . . . . . .
9.4 Equivalence relations . . . . . . . . . . . .
9.4.1 Why we like equivalence relations .
9.5 Partial orders . . . . . . . . . . . . . . . .
9.5.1 Drawing partial orders . . . . . . .
9.5.2 Comparability . . . . . . . . . . .
9.5.3 Lattices . . . . . . . . . . . . . . .
9.5.4 Minimal and maximal elements . .
9.5.5 Total orders . . . . . . . . . . . . .
9.5.5.1 Topological sort . . . . .
9.5.6 Well orders . . . . . . . . . . . . .
9.6 Closures . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.5
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
104
104
104
122
122
122
123
124
124
125
125
125
128
128
130
130
131
131
132
132
135
136
CONTENTS
vi
9.6.1
Examples . . . . . . . . . . . . . . . . . . . . . . . . . 139
10 Graphs
10.1 Types of graphs . . . . . . . . . .
10.1.1 Directed graphs . . . . . .
10.1.2 Undirected graphs . . . .
10.1.3 Hypergraphs . . . . . . .
10.2 Examples of graphs . . . . . . . .
10.3 Local structure of graphs . . . .
10.4 Some standard graphs . . . . . .
10.5 Subgraphs and minors . . . . . .
10.6 Graph products . . . . . . . . . .
10.6.1 Functions . . . . . . . . .
10.7 Paths and connectivity . . . . . .
10.8 Cycles . . . . . . . . . . . . . . .
10.9 Proving things about graphs . . .
10.9.1 Paths and simple paths .
10.9.2 The Handshaking Lemma
10.9.3 Characterizations of trees
10.9.4 Spanning trees . . . . . .
10.9.5 Eulerian cycles . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
140
. 141
. 141
. 141
. 142
. 143
. 144
. 144
. 147
. 149
. 150
. 151
. 153
. 155
. 155
. 156
. 156
. 160
. 160
11 Counting
162
11.1 Basic counting techniques . . . . . . . . . . . . . . . . . . . . 163
11.1.1 Equality: reducing to a previously-solved case . . . . . 163
11.1.2 Inequalities: showing |A| |B| and |B| |A| . . . . . 163
11.1.3 Addition: the sum rule . . . . . . . . . . . . . . . . . . 164
11.1.3.1 For infinite sets . . . . . . . . . . . . . . . . 165
11.1.3.2 The Pigeonhole Principle . . . . . . . . . . . 165
11.1.4 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . 165
11.1.4.1 Inclusion-exclusion for infinite sets . . . . . . 166
11.1.4.2 Combinatorial proof . . . . . . . . . . . . . . 166
11.1.5 Multiplication: the product rule . . . . . . . . . . . . 167
11.1.5.1 Examples . . . . . . . . . . . . . . . . . . . . 168
11.1.5.2 For infinite sets . . . . . . . . . . . . . . . . 168
11.1.6 Exponentiation: the exponent rule . . . . . . . . . . . 168
11.1.6.1 Counting injections . . . . . . . . . . . . . . 169
11.1.7 Division: counting the same thing in two different ways170
11.1.8 Applying the rules . . . . . . . . . . . . . . . . . . . . 172
11.1.9 An elaborate counting problem . . . . . . . . . . . . . 173
CONTENTS
vii
CONTENTS
Solving for the PFE using the extended coverup method . . . . . . . . . . . . . .
11.3.7 Asymptotic estimates . . . . . . . . . . . . . . . . . .
11.3.8 Recovering the sum of all coefficients . . . . . . . . . .
11.3.8.1 Example . . . . . . . . . . . . . . . . . . . .
11.3.9 A recursive generating function . . . . . . . . . . . . .
11.3.10 Summary of operations on generating functions . . . .
11.3.11 Variants . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.12 Further reading . . . . . . . . . . . . . . . . . . . . . .
viii
202
203
204
204
205
208
209
209
12 Probability theory
210
12.1 Events and probabilities . . . . . . . . . . . . . . . . . . . . . 211
12.1.1 Probability axioms . . . . . . . . . . . . . . . . . . . . 211
12.1.1.1 The Kolmogorov axioms . . . . . . . . . . . . 212
12.1.1.2 Examples of probability spaces . . . . . . . . 213
12.1.2 Probability as counting . . . . . . . . . . . . . . . . . 213
12.1.2.1 Examples . . . . . . . . . . . . . . . . . . . . 214
12.1.3 Independence and the intersection of two events . . . 214
12.1.3.1 Examples . . . . . . . . . . . . . . . . . . . . 215
12.1.4 Union of events . . . . . . . . . . . . . . . . . . . . . . 216
12.1.4.1 Examples . . . . . . . . . . . . . . . . . . . . 216
12.1.5 Conditional probability . . . . . . . . . . . . . . . . . 217
12.1.5.1 Conditional probabilities and intersections of
non-independent events . . . . . . . . . . . . 217
12.1.5.2 The law of total probability . . . . . . . . . . 218
12.1.5.3 Bayess formula . . . . . . . . . . . . . . . . 219
12.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.2.1 Examples of random variables . . . . . . . . . . . . . . 219
12.2.2 The distribution of a random variable . . . . . . . . . 220
12.2.2.1 Some standard distributions . . . . . . . . . 221
12.2.2.2 Joint distributions . . . . . . . . . . . . . . . 222
Examples . . . . . . . . . . . . . . . . . . . . . 222
12.2.3 Independence of random variables . . . . . . . . . . . 223
12.2.3.1 Examples . . . . . . . . . . . . . . . . . . . . 223
12.2.3.2 Independence of many random variables . . . 224
12.2.4 The expectation of a random variable . . . . . . . . . 224
12.2.4.1 Variables without expectations . . . . . . . . 225
12.2.4.2 Expectation of a sum . . . . . . . . . . . . . 226
Example . . . . . . . . . . . . . . . . . . . . . . 226
12.2.4.3 Expectation of a product . . . . . . . . . . . 226
CONTENTS
12.2.5
12.2.6
12.2.7
12.2.8
12.2.9
ix
12.2.4.4 Conditional expectation . . . . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . .
12.2.4.5 Conditioning on a random variable . . . . . .
Markovs inequality . . . . . . . . . . . . . . . . . . .
12.2.5.1 Example . . . . . . . . . . . . . . . . . . . .
12.2.5.2 Conditional Markovs inequality . . . . . . .
The variance of a random variable . . . . . . . . . . .
12.2.6.1 Multiplication by constants . . . . . . . . . .
12.2.6.2 The variance of a sum . . . . . . . . . . . . .
12.2.6.3 Chebyshevs inequality . . . . . . . . . . . .
Application: showing that a random variable
is close to its expectation . . . . . .
Application: lower bounds on random variables
Probability generating functions . . . . . . . . . . . .
12.2.7.1 Sums . . . . . . . . . . . . . . . . . . . . . .
12.2.7.2 Expectation and variance . . . . . . . . . . .
Summary: effects of operations on expectation and
variance of random variables . . . . . . . . . . . . . .
The general case . . . . . . . . . . . . . . . . . . . . .
12.2.9.1 Densities . . . . . . . . . . . . . . . . . . . .
12.2.9.2 Independence . . . . . . . . . . . . . . . . . .
12.2.9.3 Expectation . . . . . . . . . . . . . . . . . .
13 Linear algebra
13.1 Vectors and vector spaces . . . . . . . . . . .
13.1.1 Relative positions and vector addition
13.1.2 Scaling . . . . . . . . . . . . . . . . .
13.2 Abstract vector spaces . . . . . . . . . . . . .
13.3 Matrices . . . . . . . . . . . . . . . . . . . . .
13.3.1 Interpretation . . . . . . . . . . . . . .
13.3.2 Operations on matrices . . . . . . . .
13.3.2.1 Transpose of a matrix . . . .
13.3.2.2 Sum of two matrices . . . . .
13.3.2.3 Product of two matrices . . .
13.3.2.4 The inverse of a matrix . . .
Example . . . . . . . . . . . . .
13.3.2.5 Scalar multiplication . . . . .
13.3.3 Matrix identities . . . . . . . . . . . .
13.4 Vectors as matrices . . . . . . . . . . . . . . .
13.4.1 Length . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
227
228
230
231
232
232
232
233
234
235
235
236
237
237
237
239
239
241
241
242
243
. 243
. 244
. 245
. 246
. 247
. 248
. 249
. 249
. 249
. 250
. 251
. 252
. 253
. 253
. 255
. 255
CONTENTS
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
registers
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Sample assignments
A.1 Assignment 1: due Thursday, 2013-09-12, at
A.1.1 Tautologies . . . . . . . . . . . . . .
A.1.2 Positively equivalent . . . . . . . . .
A.1.3 A theory of leadership . . . . . . . .
A.2 Assignment 2: due Thursday, 2013-09-19, at
A.2.1 Subsets . . . . . . . . . . . . . . . .
A.2.2 A distributive law . . . . . . . . . .
A.2.3 Exponents . . . . . . . . . . . . . . .
A.3 Assignment 3: due Thursday, 2013-09-26, at
A.3.1 Surjections . . . . . . . . . . . . . .
A.3.2 Proving an axiom the hard way . . .
A.3.3 Squares and bigger squares . . . . .
A.4 Assignment 4: due Thursday, 2013-10-03, at
A.4.1 A fast-growing function . . . . . . .
A.4.2 A slow-growing set . . . . . . . . . .
A.4.3 Double factorials . . . . . . . . . . .
A.5 Assignment 5: due Thursday, 2013-10-10, at
A.5.1 A bouncy function . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
256
257
258
259
260
260
261
263
264
267
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
268
268
269
271
272
274
274
275
276
5:00 pm
. . . . .
. . . . .
. . . . .
5:00 pm
. . . . .
. . . . .
. . . . .
5:00 pm
. . . . .
. . . . .
. . . . .
5:00 pm
. . . . .
. . . . .
. . . . .
5:00 pm
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
277
277
277
279
279
280
280
281
282
283
283
283
284
285
285
285
287
288
288
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
A.6
A.7
A.8
A.9
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
289
290
290
290
291
291
292
292
293
293
296
296
298
299
300
300
301
303
B Sample exams
B.1 CS202 Exam 1, October 17th, 2013 . . . . . . . .
B.1.1 A tautology (20 points) . . . . . . . . . .
B.1.2 A system of equations (20 points) . . . .
B.1.3 A sum of products (20 points) . . . . . .
B.1.4 A subset problem (20 points) . . . . . . .
B.2 CS202 Exam 2, December 4th, 2013 . . . . . . .
B.2.1 Minimum elements (20 points) . . . . . .
B.2.2 Quantifiers (20 points) . . . . . . . . . . .
B.2.3 Quadratic matrices (20 points) . . . . . .
B.2.4 Low-degree connected graphs (20 points)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
305
305
305
306
306
307
307
308
308
308
310
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
311
311
311
312
313
313
314
314
314
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
xii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
315
316
316
316
316
317
318
318
319
319
319
320
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
(20 points)
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
321
321
321
323
323
324
324
324
325
325
326
326
326
327
327
328
329
329
330
330
331
332
333
334
334
334
335
CONTENTS
D.4.4
D.4.5
D.5 CS202
D.5.1
D.5.2
D.5.3
D.5.4
D.5.5
xiii
A transitive graph (20 points) . . . . .
A possible matrix identity (20 points)
Final Exam, December 14th, 2010 . .
Backwards and forwards (20 points) .
Linear transformations (20 points) . .
Flipping coins (20 points) . . . . . . .
Subtracting dice (20 points) . . . . . .
Scanning an array (20 points) . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
335
335
336
336
337
338
339
340
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
341
341
341
342
342
natural numbers
The Peano axioms . . . . . . . . . . . . . . . .
A simple proof . . . . . . . . . . . . . . . . . .
Defining addition . . . . . . . . . . . . . . . . .
G.3.1 Other useful properties of addition . . .
G.4 A scary induction proof involving even numbers
G.5 Defining more operations . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
350
350
352
353
355
356
357
Bibliography
359
Index
362
List of Figures
8.1
8.2
9.1
9.2
9.3
9.4
9.5
9.6
9.7
A directed graph . . . . . . . . . . . . . . . .
Relation as a directed graph . . . . . . . . . .
Factors of 12 partially ordered by divisibility
Maximal and minimal elements . . . . . . . .
Topological sort . . . . . . . . . . . . . . . . .
Reflexive, symmetric, and transitive closures .
Strongly-connected components . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
123
130
132
133
138
138
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
142
143
145
146
147
147
148
148
149
154
xiv
List of Tables
2.1
2.2
2.3
2.4
2.5
Compound propositions . . . . . . . . . . . . . . . . .
Common logical equivalences . . . . . . . . . . . . . .
Absorption laws . . . . . . . . . . . . . . . . . . . . . .
Natural deduction: introduction and elimination rules
Proof techniques . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
20
22
40
46
4.1
76
8.1
8.2
xv
List of Algorithms
xvi
Preface
These were originally the notes for the Fall 2013 semester of the Yale course
CPSC 202a, Mathematical Tools for Computer Science. They have been subsequently updated to incorporate numerous corrections suggested by Dana
Angluin and her students.
xvii
Internet resources
You may find these resources useful.
PlanetMath http://planetmath.org
Wolfram MathWorld http://mathworld.wolfram.com
WikiPedia http://en.wikipedia.org
Google http://www.google.com
xviii
Chapter 1
Introduction
This is a course on discrete mathematics as used in Computer Science. Its
only a one-semester course, so there are a lot of topics that it doesnt cover
or doesnt cover in much depth. But the hope is that this will give you a
foundation of skills that you can build on as you need to, and particularly
to give you a bit of mathematical maturitythe basic understanding of
what mathematics is and how mathematical definitions and proofs work.
1.1
Why you should know about mathematics, if you are interested in Computer
Science: or, more specifically, why you should take CS202 or a comparable
course:
Computation is something that you cant see and cant touch, and yet
(thanks to the efforts of generations of hardware engineers) it obeys
strict, well-defined rules with astonishing accuracy over long periods
of time.
Computations are too big for you to comprehend all at once. Imagine
printing out an execution trace that showed every operation a typical
$500 desktop computer executed in one (1) second. If you could read
one operation per second, for eight hours every day, you would die
of old age before you got halfway through. Now imagine letting the
computer run overnight.
So in order to understand computations, we need a language that allows
us to reason about things we cant see and cant touch, that are too big for us
1
CHAPTER 1. INTRODUCTION
1.2
Yes and no. The human brain is not really designed to do formal mathematical reasoning, which is why most mathematics was invented in the last
few centuries and why even apparently simple things like learning how to
count or add require years of training, usually done at an early age so the
pain will be forgotten later. But mathematical reasoning is very close to
legal reasoning, which we do seem to be very good at.1
There is very little structural difference between the two sentences:
1. If x is in S, then x + 1 is in S.
2. If x is of royal blood, then xs child is of royal blood.
But because the first is about boring numbers and the second is about
fascinating social relationships and rules, most people have a much easier
time deducing that to show somebody is royal we need to start with some
known royal and follow a chain of descendants than they have deducing that
to show that some number is in the set S we need to start with some known
element of S and show that repeatedly adding 1 gets us to the number we
want. And yet to a logician these are the same processes of reasoning.
So why is statement (1) trickier to think about than statement (1)? Part
of the difference is familiaritywe are all taught from an early age what it
means to be somebodys child, to take on a particular social role, etc. For
mathematical concepts, this familiarity comes with exposure and practice,
just as with learning any other language. But part of the difference is that
1
For a description of some classic experiments that demonstrate this, see http://en.
wikipedia.org/wiki/Wason_selection_task.
CHAPTER 1. INTRODUCTION
we humans are wired to understand and appreciate social and legal rules:
we are very good at figuring out the implications of a (hypothetical) rule
that says that any contract to sell a good to a consumer for $100 or more
can be canceled by the consumer within 72 hours of signing it provided the
good has not yet been delivered, but we are not so good at figuring out the
implications of a rule that says that a number is composite if and only if it
is the product of two integer factors neither of which is 1. Its a lot easier to
imagine having to cancel a contract to buy swampland in Florida that you
signed last night while drunk than having to prove that 82 is composite. But
again: there is nothing more natural about contracts than about numbers,
and if anything the conditions for our contract to be breakable are more
complicated than the conditions for a number to be composite.
1.3
There are two things you need to be able to do to get good at mathematics
(the creative kind that involves writing proofs, not the mechanical kind
that involves grinding out answers according to formulas). One of them is
to learn the language: to attain what mathematicians call mathematical
maturity. Youll do that in CS202, if you pay attention. But the other is to
learn how to activate the parts of your brain that are good at mathematicalstyle reasoning when you do maththe parts evolved to detect when the
other primates in your band of hunter-gatherers are cheating.
To do this it helps to get a little angry, and imagine that finishing a proof
or unraveling a definition is the only thing that will stop your worst enemy
from taking some valuable prize that you deserve. (If you dont have a worst
enemy, there is always the universal quantifier.) But whatever motivation
you choose, you need to be fully engaged in what you are doing. Your brain
is smart enough to know when you dont care about something, and if you
dont believe that thinking about math is important, it will think about
something else.
1.4
We wont be able to cover all of this, but the list below might be a minimal
set of topics it would be helpful to understand for computer science. Topics
that we didnt do this semester are marked with (*).
CHAPTER 1. INTRODUCTION
1.4.1
1.4.2
Why: You need to be able to understand, write, and prove equations and
inequalities involving real numbers.
Standard functions and their properties: addition, multiplication, exponentiation, logarithms.
More specialized functions that come up in algorithm analysis: floor,
ceiling, max, min.
Techniques for proving inequalities, including:
General inequality axioms (transitivity, anti-symmetry, etc.)
Inequality axioms for R (i.e., how < interacts with addition, multiplication, etc.)
Techniques involving derivatives (assumes calculus) (*):
Finding local extrema of f by solving for f 0 (x) = 0. (*)
Using f 00 to distinguish local minima from local maxima. (*)
Using f 0 (x) g 0 (x) in [a, b] and f (a) g(a) or f (b) g(b)
to show f (x) g(x) in [a, b]. (*)
Special subsets of the real number: rationals, integers, natural numbers.
CHAPTER 1. INTRODUCTION
1.4.3
Why: These are the mathematical equivalent of data structures, the way
that more complex objects are represented.
Set theory.
Naive set theory.
Predicates vs sets.
Set operations.
Set comprehension.
Russells paradox and axiomatic set theory.
Functions.
Functions as sets.
Injections, surjections, and bijections.
Cardinality.
Finite vs infinite sets.
Sequences.
Relations.
Equivalence relations, equivalence classes, and quotients.
Orders.
The basic number tower.
Countable universes: N, Z, Q. (Can be represented in a computer.)
Uncountable universes: R, C. (Can only be approximated in a
computer.)
Other algebras.
The string monoid. (*)
Zm and Zp .
Polynomials over various rings and fields.
CHAPTER 1. INTRODUCTION
1.4.4
1.4.5
Linear algebra
1.4.6
Graphs
Why: Good for modeling interactions. Basic tool for algorithm design.
Definitions: graphs, digraphs, multigraphs, etc.
Paths, connected components, and strongly-connected components.
Special kinds of graphs: paths, cycles, trees, cliques, bipartite graphs.
Subgraphs, induced subgraphs, minors.
CHAPTER 1. INTRODUCTION
1.4.7
Counting
Why: Basic tool for knowing how much resources your program is going to
consume.
Basic combinatorial counting: sums, products, exponents, differences,
and quotients.
Combinatorial functions.
Factorials.
Binomial coefficients.
The 12-fold way. (*)
Advanced counting techniques.
Inclusion-exclusion.
Recurrences. (*)
Generating functions. (Limited coverage.)
1.4.8
Probability
CHAPTER 1. INTRODUCTION
1.4.9
Tools
Why: Basic computational stuff that comes up, but doesnt fit in any of the
broad categories above. These topics will probably end up being mixed in
with the topics above.
Things you may have forgotten about exponents and logarithms. (*)
Inequalities and approximations.
and
notation.
Chapter 2
Mathematical logic
Mathematical logic is the discipline that mathematicians invented in the late
nineteenth and early twentieth centuries so they could stop talking nonsense.
Its the most powerful tool we have for reasoning about things that we cant
really comprehend, which makes it a perfect tool for Computer Science.
2.1
Model
Theory
N = {0, 1, 2, . . .}
x : y : y = x + 1
2.1.1
One approach is to come up with a list of axioms that are true statements
about the model and a list of inference rules that let us derive new true
statements from the axioms. The axioms and inference rules together generate a theory that consists of all statements that can be constructed from
the axioms by applying the inference rules. The rules of the game are that
we cant claim that some statement is true unless its a theorem: something
we can derive as part of the theory.
9
10
2.1.2
Consistency
A theory is consistent if it cant prove both P and not-P for any P . Consistency is incredibly important, since all the logics people actually use can
prove anything starting from P and not-P .
2.1.3
If we throw in too many axioms, you can get an inconsistency: All fish are
green; all sharks are not green; all sharks are fish; George Washington is a
shark gets us into trouble pretty fast.
If we dont throw in enough axioms, we underconstrain the model. For
example, the Peano axioms for the natural numbers (see example below) say
(among other things) that there is a number 0 and that any number x has
a successor S(x) (think of S(x) as x + 1). If we stop there, we might have
a model that contains only 0, with S(0) = 0. If we add in 0 6= S(x) for any
x, then we can get stuck at S(0) = 1 = S(1). If we add yet another axiom
that says S(x) = S(y) if and only if x = y, then we get all the ordinary
natural numbers 0, S(0) = 1, S(1) = 2, etc., but we could also get some
extras: say 00 , S(00 ) = 10 , S(10 ) = 00 . Characterizing the correct natural
numbers historically took a lot of work to get right, even though we all know
what we mean when we talk about them. The situation is of course worse
when we are dealing with objects that we dont really understand; here the
most we can hope for is to try out some axioms and see if anything strange
happens.
Better yet is to use some canned axioms somebody else has already
debugged for us. In this respect the core of mathematics acts like a system
11
2.1.4
The basis of mathematical logic is propositional logic, which was essentially invented by Aristotle. Here the model is a collection of statements
that are either true or false. There is no ability to refer to actual things;
though we might include the statement George Washington is a fish, from
the point of view of propositional logic that is an indivisible atomic chunk
of truth or falsehood that says nothing in particular about George Washington or fish. If we treat it as an axiom we can prove the truth of more
complicated statements like George Washington is a fish or 2+2=5 (true
since the first part is true), but we cant really deduce much else. Still, this
is a starting point.
If we want to talk about things and their properties, we must upgrade
to predicate logic. Predicate logic adds both constants (stand-ins for
objects in the model like George Washington) and predicates (stand-ins
for properties like is a fish). It also lets use quantify over variables and
make universal statements like For all x, if x is a fish then x is green. As
a bonus, we usually get functions (f (x) = the number of books George
Washington owns about x) and equality (George Washington = 12 implies George Washington + 5 = 17). This is enough machinery to define
and do pretty much all of modern mathematics.
We will discuss both of these logics in more detail below.
2.1.5
Rather than define our own axiom systems and models from scratch, it helps
to use ones that already have a track record of consistency and usefulness.
Almost all mathematics fits in one of the following models:
The natural numbers N. These are defined using the Peano axioms,
and if all you want to do is count, add, and multiply, you dont need
much else. (If you want to subtract, things get messy.)
The integers Z. Like the naturals, only now we can subtract. Division
is still a problem.
p
The real numbers R. Now we have 2. But what about (1)?
12
The complex numbers C. Now we are pretty much done. But what if
we want to talk about more than one complex number at a time?
The universe of sets. These are defined using the axioms of set theory,
and produce a rich collection of sets that include, among other things,
structures equivalent to the natural numbers, the real numbers, collections of same, sets so big that we cant even begin to imagine what
they look like, and even bigger sets so big that we cant use the usual
accepted system of axioms to prove whether they exist or not. Fortunately, in computer science we can mostly stop with finite sets, which
makes life less confusing.
Various alternatives to set theory, like lambda calculus, category theory, or second-order arithmetic. We wont talk about these, since they
generally dont let you do anything you cant do already with sets.
However, lambda calculus and category theory are both important to
know about if you are interested in programming language theory.
In practice, the usual way to do things is to start with sets and then define
everything else in terms of sets: e.g., 0 is the empty set, 1 is a particular
set with 1 element, 2 a set with 2 elements, etc., and from here we work our
way up to the fancier numbers. The idea is that if we trust our axioms for
sets to be consistent, then the things we construct on top of them should
also be consistent, although if we are not careful in our definitions they may
not be exactly the things we think they are.
2.2
Propositional logic
Propositional logic is the simplest form of logic. Here the only statements
that are considered are propositions, which contain no variables. Because
propositions contain no variables, they are either always true or always false.
Examples of propositions:
2 + 2 = 4. (Always true).
2 + 2 = 5. (Always false).
Examples of non-propositions:
x + 2 = 4. (May be true, may not be true; it depends on the value of
x.)
13
2.2.1
Operations on propositions
Propositions by themselves are pretty boring. So boring, in fact, that logicians quickly stop talking about specific propositions and instead haul out
placeholder names like p, q, or r. But we can build slightly more interesting propositions by combining propositions together using various logical
connectives, such as:
Negation The negation of p is written as p, or sometimes p, p or p.
It has the property that it is false when p is true, and true when p is
false.
Or The or of two propositions p and q is written as p q, and is true as
long as at least one, or possibly both, of p and q is true.1 This is not
always the same as what or means in English; in English, or often
is used for exclusive or which is not true if both p and q are true. For
example, if someone says You will give me all your money or I will
stab you with this table knife, you would be justifiably upset if you
turn over all your money and still get stabbed. But a logician would
not be at all surprised, because the standard or in propositional logic
is an inclusive or that allows for both outcomes.
Exclusive or If you want to exclude the possibility that both p and q are
true, you can use exclusive or instead. This is written as p q, and
is true precisely when exactly one of p or q is true. Exclusive or is
not used in classical logic much, but is important for many computing
applications, since it corresponds to addition modulo 2 (see 8.4) and
1
The symbol is a stylized V, intended to represent the Latin word vel, meaning
or. (Thanks to Noel McDermott for remembering this.) Much of this notation is actually pretty recent (early 20th century): see http://jeff560.tripod.com/set.html for a
summary of earliest uses of each symbol.
14
The symbol is a stylized A, short for the latin word atque, meaning and also.
p
pq
pq
pq
pq
pq
15
p, p
p q, p q
pq
Precedence
The short version: for the purposes of this course, we will use the ordering in
Table 2.1, which corresponds roughly to precedence in C-like programming
languages. But see caveats below. Remember always that there is no shame
in putting in a few extra parentheses if it makes a formula more clear.
Examples: (p q r s t) is interpreted as ((((p) (q r))
s) t). Both OR and AND are associative, so (p q r) is the same as
((p q) r) and as (p (q r)), and similarly (p q r) is the same as
((p q) r) and as (p (q r)).
Note that this convention is not universal: many mathematicians give
AND and OR equal precedence, so that the meaning of p q r is ambiguous without parentheses. There are good arguments for either convention.
Making AND have higher precedence than OR is analogous to giving multiplication higher precedence than addition, and makes sense visually when
AND is written multiplicatively (as in pq qr for (p q) (q r). Making them have the same precedence emphasizes the symmetry between the
two operations, which well see more about later when we talk about De
Morgans laws in 2.2.3. But as with anything else in mathematics, either
convention can be adopted, as long as you are clear about what you are
doing and it doesnt cause annoyance to the particular community you are
writing for.
There does not seem to be a standard convention for the precedence of
XOR, since logicians dont use it much. There are plausible arguments for
16
putting XOR in between AND and OR, but its probably safest just to use
parentheses.
Implication is not associative, although the convention is that it binds
to the right, so that a b c is read as a (b c); except for
type theorists and Haskell programmers, few people ever remember this,
so it is usually safest to put in the parentheses. I personally have no idea
what p q r means, so any expression like this should be written with
parentheses as either (p q) r or p (q r).
2.2.2
Truth tables
To define logical operations formally, we give a truth table. This gives, for
any combination of truth values (true or false, which as computer scientists
we often write as 1 or 0) of the inputs, the truth value of the output. In this
usage, truth tables are to logic what addition and multiplication tables are
to arithmetic.
Here is a truth table for negation:
p p
0 1
1 0
And here is a truth table for the rest of the logical operators:
p
0
0
1
1
q pq pq pq pq pq
0
0
0
0
1
1
1
1
1
0
1
0
0
1
1
0
0
0
1
1
0
1
1
1
17
Q, etc. We can check that each truth table we construct works by checking
that the truth values each column (corresponding to some subexpression of
the thing we are trying to prove) follow from the truth values in previous
columns according to the rules established by the truth table defining the
appropriate logical operation.
For predicate logic, model checking becomes more complicated, because
a typical system of axioms is likely to have infinitely many models, many of
which are likely to be infinitely large. There we will need to rely much more
on proofs constructed by applying inference rules.
2.2.3
18
q p q p q
0
1
1
1
1
1
0
0
0
1
1
1
q p q (p q) p q p q
0
0
1
1
1
1
1
1
0
1
0
0
0
1
0
0
1
0
1
1
0
0
0
0
q
0
0
1
1
0
0
1
1
r q r p (q r) p q p r (p q) (p r)
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
1
1
1
1
1
1
0
0
1
1
1
1
1
0
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
19
associativity of ):
(p r) (q r) (p r) (q r)
p q r r
[Using p q p q twice]
[Associativity and commutativity of ]
p q r
[p p p]
(p q) r
(p q) r.
[p q p q]
p p
20
Double negation
(p q) p q
De Morgans law
(p q) p q
De Morgans law
pq qp
Commutativity of AND
pq qp
Commutativity of OR
p (q r) p (q r)
Associativity of AND
p (q r) p (q r)
Associativity of OR
p (q r) (p q) (p r)
p (q r) (p q) (p r)
p q p q
p q q p
p q (p q) (q p)
p q p q
pqqp
Table 2.2: Common logical equivalences (see also [Fer08, Theorem 1.1])
21
P as an object of type P ;
P Q as a function that takes a P as an argument and returns a Q;
P Q as an object that contains both a P and a Q (like a struct in C);
P Q as an object that contains either a P or a Q (like a union in C); and
P as P , a function that given a P produces a special error value that
cant otherwise be generated.
With this interpretation, many theorems of classical logic continue to hold. For example,
modus ponens says
(P (P Q)) Q.
Seen through the Curry-Howard isomorphism, this means that there is a function that,
given a P and a function that generates a Q from a P , generates a Q. For example, the
following Scheme function:
(define (modus-ponens p p-implies q) (p-implies-q p))
22
P 00
P 0P
P 1P
P 11
P 0 P
P 0P
P 1P
P 1 P
P 0 P
0P 1
P 11
1P P
Table 2.3: Absorption laws. The first four are the most important. Note
that , , , and are all commutative, so reversed variants also work.
the law of the excluded middle or the law of non-contradiction. These can
then be absorbed into nearby terms using various absorption laws, shown in
Table 2.3.
Example Lets show that (P (P Q)) Q is a tautology. (This
justifies the inference rule modus ponens, defined below.) Working from the
Similarly, in a sufficiently sophisticated programming language we can show P P ,
since this expands to P ((P ) ), and we can write a function that takes a P
as its argument and returns a function that takes a P function and feeds the P to it:
(define (double-negation p) (lambda (p-implies-fail)
(p-implies-fail p)))
But we cant generally show P P , since there is no way to take a function of type
(P ) and extract an actual example of a P from it. Nor can we expect to show
P P , since this would require exhibiting either a P or a function that takes a P and
produces an error, and for any particular type P we may not be able to do either.
For normal mathematical proofs, we wont bother with this, and will just assume P P
always holds.
23
expand
non-contradiction
(P Q) Q
absorption
(P Q) Q
expand
(P Q) Q
De Morgans law
P (Q Q)
associativity
P 1
excluded middle
absorption
In this derivation, weve labeled each step with the equivalence we used.
Most of the time we would not be this verbose.
2.2.4
Normal forms
24
2.3
Predicate logic
25
choice of axioms, we may not know this. What we would like is a general
way to say that humanity implies mortality for everybody, but with just
propositional logic, we cant write this fact down.
2.3.1
The solution is to extend our language to allow formulas that involve variables. So we might let x, y, z, etc. stand for any element of our universe of
discourse or domainessentially whatever things we happen to be talking
about at the moment. We can now write statements like:
x is human.
x is the parent of y.
x + 2 = x2 .
These are not propositions because they have variables in them. Instead,
they are predicates; statements whose truth-value depends on what concrete object takes the place of the variable. Predicates are often abbreviated
by single capital letters followed by a list of arguments, the variables that
appear in the predicate, e.g.:
H(x) = x is human.
P (x, y) = x is the parent of y.
Q(x) = x + 2 = x2 .
We can also fill in specific values for the variables, e.g. H(Spocrates) =
Spocrates is human. If we fill in specific values for all the variables, we
have a proposition again, and can talk about that proposition being true
(e.g. H(2) and H(1) are true) or false (H(0) is false).
In first-order logic, which is what we will be using in this course, variables always refer to things and never to predicates: any predicate symbol
is effectively a constant. There are higher-order logics that allow variables
to refer to predicates, but most mathematics accomplishes the same thing
by representing predicates with sets (see Chapter 3).
2.3.2
Quantifiers
26
Universal quantifier
Existential quantifier
The existential quantifier (pronounced there exists) says that a statement must be true for at least one value of the variable. So some human is
mortal becomes x : Human(x) Mortal(x). Note that we use AND rather
than implication here; the statement x : Human(x) Mortal(x) makes
the much weaker claim that there is some thing x, such that if x is human,
then x is mortal, which is true in any universe that contains an immortal
purple penguinsince it isnt human, Human(penguin) Mortal(penguin)
is true.
As with , can be limited to an explicit universe with set membership
notation, e.g., x Z : x = x2 . This is equivalent to writing x : x
Z x = x2 .
4
27
28
talking about real numbers (two of which happen to be square roots of 79),
we can exclude the numbers we dont want by writing
x Z : x2 = 79
which is interpreted as
x : (x Z x2 = 79)
or, equivalently
x : x Z x2 6= 79.
Here Z = {. . . , 2, 1, 0, 1, 2, . . .} is the standard set of integers.
For more uses of , see Chapter 3.
2.3.2.5
Nested quantifiers
29
and
yx : likes(x, y)
mean very different things. The first says that for every person, there is
somebody that that person likes: we live in a world with no complete misanthropes. The second says that there is some single person who is so
immensely popular that everybody in the world likes them. The nesting of
the quantifiers is what makes the difference: in xy : likes(x, y), we are
saying that no matter who we pick for x, y : likes(x, y) is a true statement;
while in yx : likes(x, y), we are saying that there is some y that makes
x : likes(x, y) true.
Naturally, such games can go on for more than two steps, or allow the
same player more than one move in a row. For example
xyz : x2 + y 2 = z 2
is a kind of two-person challenge version of the Pythagorean theorem where
the universal player gets to pick x and y and the existential player has to
respond with a winning z. (Whether the statement itself is true or false
depends on the range of the quantifiers; its false, for example, if x, y, and z
are all natural numbers or rationals but true if they are all real or complex.
Note that the universal player only needs to find one bad (x, y) pair to make
it false.)
One thing to note about nested quantifiers is that we can switch the
order of two universal quantifiers or two existential quantifiers, but we cant
swap a universal quantifier for an existential quantifier or vice versa. So
for example xy : (x = y x + 1 = y + 1) is logically equivalent to
yx : (x = y y + 1 = x + 1), but xy : y < x is not logically equivalent
to yx : y < x. This is obvious if you think about it in terms of playing
games: if I get to choose two things in a row, it doesnt really matter which
order I choose them in, but if I choose something and then you respond it
might make a big difference if we make you go first instead.
One measure of the complexity of a mathematical statement is how many
layers of quantifiers it has, where a layer is a sequence of all-universal or
all-existential quantifiers. Heres a standard mathematical definition that
involves three layers of quantifiers, which is about the limit for most humans:
h
Now that we know how to read nested quantifiers, its easy to see what
the right-hand side means:
30
Examples
x : Crow(x) Black(x)
31
x : Cow(x) Brown(x)
x : Cow(x) Blue(x)
x : Glitters(x) Gold(x)
Or x : Glitters(x)Gold(x). Note that the English syntax is a bit ambiguous: a literal translation might look like x : Glitters(x) Gold(x),
which is not logically equivalent. This is an example of how predicate logic
is often more precise than natural language.
No shirt, no service.
x : Shirt(x) Served(x)
xy : Causes(y, x)
2.3.3
Functions
2.3.4
Equality
Often we include a special equality predicate =, written x = y. The interpretation of x = y is that x and y are the same element of the domain. It
32
Uniqueness
An occasionally useful abbreviation is !xP (x), which stands for there exists a unique x such that P (x). This is short for
(xP (x)) (xy : P (x) P (y) x = y).
An example is !x : x + 1 = 12. To prove this wed have to show not
only that there is some x for which x + 1 = 12 (11 comes to mind), but that
if we have any two values x and y such that x + 1 = 12 and y + 1 = 12, then
x = y (this is not hard to do). So the exclamation point encodes quite a bit
of extra work, which is why we usually hope that x : x + 1 = 12 is good
enough.
2.3.5
Models
In propositional logic, we can build truth tables that describe all possible
settings of the truth-values of the literals. In predicate logic, the analogous
concept to an assignment of truth-values is a structure. A structure consists of a set of objects or elements (built using set theory, as described in
Chapter 3), together with a description of which elements fill in for the constant symbols, which predicates hold for which elements, and what the value
of each function symbol is when applied to each possible list of arguments
(note that this depends on knowing what constant, predicate, and function
33
symbols are availablethis information is called the signature of the structure). A structure is a model of a particular theory (set of statements), if
each statement in the theory is true in the model.
In general we cant hope to find all possible models of a given theory.
But models are useful for two purposes: if we can find some model of a
particular theory, then the existence of this model demonstrates that the
theory is consistent; and if we can find a model of the theory in which some
additional statement S doesnt hold, then we can demonstrate that there is
no way to prove S from the theory (i.e. it is not the case that T ` S, where
T is the list of axioms that define the theory).
2.3.5.1
Examples
Consider the axiom x. This axiom has exactly one model (its
empty).
Now consider the axiom !x. This axiom also has exactly one model
(with one element).
We can enforce exactly k elements with one rather long axiom, e.g. for
k = 3 do x1 x2 x3 y : y = x1 y = x2 y = x3 . In the absence of
any special symbols, a structure of 3 undifferentiated elements is the
unique model of this axiom.
Suppose we add a predicate P and consider the axiom xP x. Now
we have many models: take any nonempty model you like, and let P
be true of at least one of its elements. If we take a model with two
elements a and b, with P a and P b, we get that xP x is not enough
to prove xP x, since the later statement isnt true in this model.
Now lets bring in a function symbol S and constant symbol 0. Consider a stripped-down version of the Peano axioms that consists of just
the axiom xy : Sx = Sy x = y. Both the natural numbers N and
the integers Z are a model for this axiom, as is the set Zm of integers
mod m for any m (see 8.4). In each case each element has a unique
predecessor, which is what the axiom demands. If we throw in the
first Peano axiom x : Sx 6= 0, we eliminate Z and Zm because in
each of these models 0 is a successor of some element. But we dont
eliminate a model that consists of two copies of N sitting next to each
other (only one of which contains the real 0), or even a model that
consists of one copy of N (to make 0 happy) plus any number of copies
of Z and Zm .
34
2.4
Proofs
35
case) or about the logical relation between two statements (the second).
Things get a little more complicated with statements involving predicates;
in this case there are incompleteness theorems that say that sufficiently
powerful sets of axioms have consequences that cant be proven unless the
theory is inconsistent.
2.4.1
Inference Rules
Inference rules let us construct valid arguments, which have the useful property that if their premises are true, their conclusions are also true.
The main source of inference rules is tautologies of the form P1 P2 . . .
Q; given such a tautology, there is a corresponding inference rule that allows us to assert Q once we have P1 , P2 , . . . (either because each Pi is an
axiom/theorem/premise or because we proved it already while doing the
proof). The most important inference rule is modus ponens, based on the
tautology (p (p q)) q; this lets us, for example, write the following
famous argument:5
1. If it doesnt fit, you must acquit. [Axiom]
2. It doesnt fit. [Premise]
3. You must acquit. [Modus ponens applied to 1+2]
There are many named inference rules in classical propositional logic.
Well list some of them below. You dont need to remember the names
of anything except modus ponens, and most of the rules are pretty much
straightforward applications of modus ponens plus some convenient tautology that can be proved by truth tables or stock logical equivalences. (For
example, the addition rule below is just the result of applying modus
ponens to p and the tautology p (p q).)
Inference rules are often written by putting the premises above a horizontal line and the conclusion below. In text, the horizontal line is often
replaced by the symbol `, which means exactly the same thing. Premises
are listed on the left-hand side separated by commas, and the conclusion is
5
36
Addition
p q ` p.
Simplification
p, q ` p q.
Conjunction
p, p q ` q.
Modus ponens
q, p q ` p.
Modus tollens
p q, q r ` p r.
Hypothetical syllogism
p q, p ` q.
p q, p r ` q r.
Disjunctive syllogism
Resolution
2.4.2
37
38
to
` (P1 P2 . . . Pn ) Q.
The statement that we can do this, for a given collection of inference
rules, is the Deduction Theorem:
Theorem 2.4.1 (Deduction Theorem). If there is a proof of Q from premises
, P1 , P2 , . . . , Pn , then there is a proof of P1 P2 . . . Pn Q from
alone.
The actual proof of the theorem depends on the particular set of inference
rules we start with, but the basic idea is that there exists a mechanical
procedure for extracting a proof of the implication from the proof of Q
assuming P1 etc.
Caveat: In predicate logic, the deduction theorem only applies if none
of the premises contain any free variables (which are variables that arent
bound by a containing quantifier). Usually you wont run into this, but
there are some bad cases that arise without this restriction.
2.5
Natural deduction
( I)
39
2.5.1
The equality predicate is special, in that it allows for the substitution rule
x = y, P (x) ` P (y).
If we dont want to include the substitution rule as an inference rule, we
could instead represent it as an axiom schema:
x : y : ((x = y P (x)) P (y)).
But this is messier.
We can also assert x = x directly:
`x=x
2.5.2
`P
` P
` P
`P
`P `Q
`P Q
`P Q
`P
`P Q
`Q
40
(I)
(E)
(I)
(E1 )
(E2 )
`P
`P Q
(I1 )
`Q
`P Q
(I2 )
` P Q ` Q
`P
` P Q ` P
`Q
, P ` Q
`P Q
(E1 )
(E2 )
( I)
`P Q `P
`Q
( E1 )
` P Q ` Q
` P
( E2 )
41
42
2.6
Proof techniques
These strategies are largely drawn from [Sol05], particularly the summary table in
the appendix, which is the source of the order and organization of the table and the
names of most of the techniques. The table omits some techniques that are mentioned in
Solow [Sol05]: Direct Uniqueness, Indirect Uniqueness, various max/min arguments, and
induction proofs. (Induction proofs are covered in Chapter 5.)
For other sources, Ferland [Fer08] has an entire chapter on proof techniques of various
sorts. Rosen [Ros12] describes proof strategies in 1.51.7 and Biggs [Big02] describes
various proof techniques in Chapters 1, 3, and 4; both descriptions are a bit less systematic
than the ones in Solow or Ferland, but also include a variety of specific techniques that
are worth looking at.
43
Strategy
When
Assume
Direct proof
Try it first
Contraposition
B = Q
44
Contradiction
When B = Q, or
when you are stuck
trying the other techniques.
A B
False
Construction
B = xP (x)
Counterexample
B = xP (x)
P (c) for
some
specific
object
c.
P (c)
for some
specific
object
c.
Pick a likely-looking c
and show that P (c)
holds. This is identical
to a proof by construction, except that we
are proving xP (x),
which is equivalent to
xP (x).
45
Choose
A, P (c),
where
c
is
chosen
arbitrarily.
Q(c)
Instantiation
A = xP (x)
Elimination
B =C D
A C
Case analysis
A=C D
C, D
B = x NP (x)
46
P (0)
and
x
N
:
(P (x)
P (x +
1)).
Chapter 3
Set theory
Set theory is the dominant foundation for mathematics. The idea is that
everything else in mathematicsnumbers, functions, etc.can be written
in terms of sets, so that if you have a consistent description of how sets
behave, then you have a consistent description of how everything built on
top of them behaves. If predicate logic is the machine code of mathematics,
set theory would be assembly language.
3.1
Naive set theory is the informal version of set theory that corresponds
to our intuitions about sets as unordered collections of objects (called elements) with no duplicates. A set can be written explicitly by listing its
elements using curly braces:
{} = the empty set , which has no elements.
{Moe, Curly, Larry} = the Three Stooges.
{0, 1, 2, . . .} = N, the natural numbers. Note that we are relying on
the reader guessing correctly how to continue the sequence here.
{{} , {0} , {1} , {0, 1} , {0, 1, 2} , 7} = a set of sets of natural numbers,
plus a stray natural number that is directly an element of the outer
set.
Membership in a set is written using the symbol (pronounced is an
element of, is a member of, or just is in). So we can write Moe the
47
48
3.2
49
Operations on sets
universe U instead of A.
3.3
50
We have three predicates so far in set theory, so there are essentially three
positive things we could try to prove about sets:
1. Given x and S, show x S. This requires looking at the definition of
S to see if x satisfies its requirements, and the exact structure of the
proof will depend on what the definition of S is.
2. Given S and T , show S T . Expanding the definition of subset, this
means we have to show that every x in S is also in T . So a typical
proof will pick an arbitrary x in S and show that it must also be an
element of T . This will involve unpacking the definition of S and using
its properties to show that x satisfies the definition of T .
3. Given S and T , show S = T . Typically we do this by showing S T
and T S separately. The first shows that x : x S x T ; the
second shows that x : x T x S. Together, x S x T
and x T x S gives x S x T , which is what we need for
equality.
There are also the corresponding negative statements:
1. For x 6 S, use the definition of S as before.
2. For S 6 T , we only need a counterexample: pick any one element of
S and show that its not an element of T .
3. For S 6= T , prove one of S 6 T or T 6 S.
Note that because S 6 T and S 6= T are existential statements rather
than universal ones, they tend to have simpler proofs.
Here are some examples, which well package up as a lemma:
Lemma 3.3.1. The following statements hold for all sets S and T , and all
predicates P :
ST S
(3.3.1)
ST S
(3.3.2)
{x S | P (x)} S
(3.3.3)
S = (S T ) (S \ T )
(3.3.4)
51
Proof.
(3.3.1) Let x be in S T . Then x S and x T , from the
definition of S T . It follows that x S. Since x was arbitrary, we
have that for all x in S T , x is also in T ; in other words, S T T .
(3.3.2). Let x be in S.1 Then x S x T is true, giving x S T .
(3.3.3) Let x be in {x S | P (x)}. Then, by the definition of set
comprehension, x S and P (x). We dont care about P (x), so we
drop it to just get x S.
(3.3.4). This is a little messy, but we can solve it by breaking it down
into smaller problems.
First, we show that S (S \ T ) (S T ). Let x be an element of S.
There are two cases:
1. If x T , then x (S T ).
2. If x 6 T , then x (S \ T ).
In either case, we have shown that x is in (S T ) (S \ T ). This gives
S (S T ) (S \ T ).
Conversely, we show that (S \ T ) (S T ) S. Suppose that x
(S \ T ) (S T ). Again we have two cases:
1. If x (S \ T ), then x S and x 6 T .
2. If x (S T ), then x S and x T .
In either case, x S.
Since weve shown that both the left-hand and right-hand sides of
(3.3.4) are subsets of each other, they must be equal.
Using similar arguments, we can show that properties of and that
dont involve negation carry over to and in the obvious way. For example, both operations are commutative and associative, and each distributes
over the other.
1
Note that we are starting with S here because we are really showing the equivalent
statement S S T .
3.4
52
The problem with naive set theory is that unrestricted set comprehension
is too strong, leading to contradictions. Axiomatic set theory fixes this
problem by being more restrictive about what sets one can form. The axioms
most commonly used are known as Zermelo-Fraenkel set theory with
choice or ZFC. Well describe the axioms of ZFC below, but in practice
you mostly just need to know what constructions you can get away with.
The short version is that you can construct sets by (a) listing their
members, (b) taking the union of other sets, (c) taking the set of all subsets
of a set, or (d) using some predicate to pick out elements or subsets of some
set.2 The starting points for this process are the empty set and the set N
of all natural numbers (suitably encoded as sets). If you cant construct a
set in this way (like the Russells Paradox set), odds are that it isnt a set.
These properties follow from the more useful axioms of ZFC:
Extensionality Any two sets with the same elements are equal.3
Existence The empty set is a set.4
Pairing Given sets x and y, {x, y} is a set.5
Union For any set of sets S = {x, y, z, . . .}, the set
exists.6
S = x y z ...
Power set For any set S, the power set P(S) = {A | A S} exists.7
Specification For any set S and any predicate P , the set {x S | P (x)}
exists.8 This is called restricted comprehension, and is an axiom schema instead of an axiom, since it generates an infinite list
of axioms, one for each possible P . Limiting ourselves to constructing subsets of existing sets avoids Russells Paradox, because we cant
construct S = {x | x 6 x}. Instead, we can try to construct S =
{x T | x 6 x}, but well find that S isnt an element of T , so it
doesnt contain itself but also doesnt create a contradiction.
2
Technically this only gives us Z, a weaker set theory than ZFC that omits Replacement
(Fraenkels contribution) and Choice.
3
x : y : (x = y) (z : z x z y).
4
x : y : y 6 x.
5
x : y : z : q : q z q = x q = y.
6
x : y : z : z y (q : z q q x).
7
x : y : z : z y z x.
8
x : y : z : z y z x P (z).
53
Infinity There is a set that has as a member and also has x{x} whenever
it has x. This gives an encoding of N.9 Here represents 0 and x {x}
represents x + 1. This effectively defines each number as the set of all
smaller numbers, e.g. 3 = {0, 1, 2} = {, {} , {, {}}}. Without this
axiom, we only get finite sets. (Technical note: the set whose existence
is given by the Axiom of Infinity may also contain some extra elements,
but we can strip them outwith some effortusing Specification.)
There are three other axioms that dont come up much in computer
science:
Foundation Every nonempty set A contains a set B with AB = .10 This
rather technical axiom prevents various weird sets, such as sets that
contain themselves or infinite descending chains A0 3 A1 3 A2 3 . . . .
Without it, we cant do induction arguments once we get beyond N.
Replacement If S is a set, and R(x, y) is a predicate with the property
that x : !y : R(x, y), then {y | x SR(x, y)} is a set.11 Like comprehension, replacement is an axiom schema. Mostly used to construct
astonishingly huge infinite sets.
Choice For any set of nonempty sets S there is a function f that assigns
to each x in S some f (x) x. This axiom is unpopular in some
circles because it is non-constructive: it tells you that f exists, but
it doesnt give an actual definition of f . But its too useful to throw
out.
Like everything else in mathematics, the particular system of axioms
we ended up with is a function of the history, and there are other axioms
that could have been included but werent. Some of the practical reasons
for including some axioms but not others are described in a pair of classic
papers by Maddy [Mad88a, Mad88b].
3.5
Sets are unordered: the set {a, b} is the same as the set {b, a}. Sometimes it
is useful to consider ordered pairs (a, b), where we can tell which element
comes first and which comes second. These can be encoded as sets using the
9
x : x y x : y {y} x.
x 6= : y x : x y = .
11
(x : !y : R(x, y)) z : q : r : r q (s z : R(s, r)).
10
54
rule (a, b) = {{a} , {a, b}}, which was first proposed by Kuratowski [Kur21,
Definition V].12
Given sets A and B, their Cartesian product AB is the set {(x, y) | x A y B},
or in other words the set of all ordered pairs that can be constructed
by taking the first element from A and the second from B. If A has n
elements and B has m, then A B has nm elements.13 For example,
{1, 2} {3, 4} = {(1, 3), (1, 4), (2, 3), (2, 4)}.
Because of the ordering, Cartesian product is not commutative in general. We usually have A B 6= B A (Exercise: when are they equal?).
The existence of the Cartesian product of any two sets can be proved
using the axioms we already have: if (x, y) is defined as {{x} , {x, y}}, then
P(A B) contains all the necessary sets {x} and {x, y} , and P(P(A B))
contains all the pairs {{x} , {x, y}}. It also contains a lot of other sets we
dont want, but we can get rid of them using Specification.
A special class of relations are functions. A function from a domain
A to a codomain14 B is a relation on A and B (i.e., a subset of A B
such that every element of A appears on the left-hand side of exactly one
ordered pair. We write f : A B as a short way of saying that f is a
function from A to B, and for each x A write f (x) for the unique y B
with (x, y) f .15
The set of all functions from A to B is written as B A : note that the order
of A and B is backwards here from A B. Since this is just the subset
of P(A B) consisting of functions as opposed to more general relations, it
exists by the Power Set and Specification axioms.
When the domain of a function is is finite, we can always write down
a list of all its values. For infinite domains (e.g. N), almost all functions
are impossible to write down, either as an explicit table (which would need
to be infinitely long) or as a formula (there arent enough formulas). Most
12
This was not the only possible choice. Kuratowski cites a previous encoding suggested
by Hausdorff [Hau14] of (a, b) as {{a, 1} , {b, 2}}, where 1 and 2 are tags not equal to a or
b. He argues that this definition seems less convenient to me than {{a} , {a, b}}, because
it requires tinkering with the definition if a or b do turn out to be equal to 1 or 2. This
is a nice example of how even though mathematical definitions arise through convention,
some definitions are easier to use than others.
13
In fact, this is the most direct way to define multiplication on N, and pretty much the
only sensible way to define multiplication for infinite cardinalities; see 11.1.5.
14
The codomain is sometimes called the range, but most mathematicians will use range
for {f (x) | x A}, which may or may not be equal to the codomain B, depending on
whether f is or is not surjective.
15
Technically, knowing f alone does not tell you what the codomain is, since some
elements of B may not show up at all. This can be fixed by representing a function as a
pair (f, B), but its generally not something most people worry about.
55
3.5.1
Examples of functions
f (x) = x2 . Note: this single rule gives several different functions, e.g.
f : R R, f : Z Z, f : N N, f : Z N. Changing the domain
or codomain changes the function.
f (x) = x + 1.
Floor and ceiling functions: when x is a real number, the floor of x
(usually written bxc) is the largest integer less than or equal to x and
the ceiling of x (usually written dxe) is the smallest integer greater
than or equal to x. E.g., b2c = d2e = 2, b2.337c = 2, d2.337e = 3.
The function from {0, 1, 2, 3, 4} to {a, b, c} given by the following table:
0 a
1 c
2 b
3 a
4 b
3.5.2
Sequences
Functions let us define sequences of arbitrary length: for example, the infinite sequence x0 , x1 , x2 , . . . of elements of some set A is represented by a
function x : N A, while a shorter sequence (a0 , a1 , a2 ) would be represented by a function a : {0, 1, 2} A. In both cases the subscript takes
the place of a function argument: we treat xn as syntactic sugar for x(n).
Finite sequences are often called tuples, and we think of the result of taking
the Cartesian product of a finite number of sets A B C as a set of tuples (a, b, c), even though the actual structure may be ((a, b), c) or (a, (b, c))
depending on which product operation we do first.
We can think of the Cartesian product of k sets (where k need not be 2)
as a set of sequences indexed by the set {1 . . . k} (or sometimes {0 . . . k 1}).
56
An
i=1
or even
Y
Ax .
xR
3.5.3
3.5.4
Composition of functions
3.5.5
57
Surjections
Injections
Bijections
A function that is both surjective and injective is called a one-to-one correspondence, bijective, or a bijection.16 Any bijection f has an inverse
f 1 ; this is the function {(y, x) | (x, y) f }.
Of the functions we have been using as examples, only f (x) = x + 1 from
Z to Z is bijective.
3.5.5.4
Bijections let us define the size of arbitrary sets without having some special
means to count elements. We say two sets A and B have the same size or
16
The terms onto, one-to-one, and bijection are probably the most common, although
injective and surjective are often used as well, as injective in particular avoids any possible
confusion between one-to-one and one-to-one correspondence. The terms injective, surjective, and bijective are generally attributed to the pseudonymous collective intelligence
Bourbaki [Bou70].
58
3.6
With power set, Cartesian product, the notion of a sequence, etc., we can
construct all of the standard objects of mathematics. For example:
Integers The integers are the set Z = {. . . , 2, 1, 0, 1, 2, . . .}. We represent each integer z as an ordered pair (x, y), where x = 0 y = 0;
formally, Z = {(x, y) N N | x = 0 y = 0}. The interpretation of
(x, y) is x y; so positive integers z are represented as (z, 0) while
17
The formal definition is that S is an ordinal if (a) every element of S is also a subset
of S; and (b) every subset T of S contains an element x with the property that x = y or
x y for all y T . The second property says that S is well-ordered if we treat as
meaning < (see 9.5.6. The fact that every subset of S has a minimal element means that
we can do induction on S, since if there is some property that does not hold for all x in
S, there must be some minimal x for which it doesnt hold.
59
negative integers are represented as (0, z). Its not hard to define
addition, subtraction, multiplication, etc. using this representation.
Rationals The rational numbers Q are all fractions of the form p/q where
p is an integer, q is a natural number not equal to 0, and p and q have
no common factors. Each such fraction can be represented as a set
using an ordered pair (p, q). Operations on rationals are defined as
you may remember from grade school.
Reals The real numbers R can be defined in a number of ways, all of
which turn out to be equivalent. The simplest to describe is that
a real number x is represented by pair of sets {y Q | y < x} and
{y Q | y x}; this is known as a Dedekind cut.[Ded01] Formally,
a Dedekind cut is any pair of subsets (S, T ) of Q with the properties that (a) S and T partition Q, meaning that S T = and
S T = Q; (b) every element of S is less than every element of
T (s S t T : s < t); and (c) S contains no largest element
(x S y S : x < y). Note that real numbers in this representation may be hard to write down.
A simpler but equivalent representation is to drop T , since it is just
Q \ S: this gives use a real number for any proper subset S of Q that
is downward closed, meaning that x < y S implies x S. Real
numbers this representation may still be hard to write down.
More conventionally, a real number can be written as an infinite decimal expansion like
3.14159265358979323846264338327950288419716939937510582 . . . ,
which is a special case of a Cauchy sequence that gives increasingly
good approximations to the actual real number the further along you
go.
We can also represent standard objects of computer science:
Deterministic finite state machines A deterministic finite state machine is a tuple (, Q, q0 , , Qaccept ) where is an alphabet (some
finite set), Q is a state space (another finite set), q0 Q is an initial
state, : Q Q is a transition function specifying which state
to move to when processing some symbol in , and Qaccept Q is
the set of accepting states. If we represent symbols and states as
natural numbers, the set of all deterministic finite state machines is
60
3.7
We can compute the size of a set by explicitly counting its elements; for example, || = 0, |{Larry, Moe, Curly}| = 3, and |{x N | x < 100 x is prime}| =
|{2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97}| =
25. But sometimes it is easier to compute sizes by doing arithmetic. We
can do this because many operations on sets correspond in a natural way
to arithmetic operations on their sizes. (For much more on this, see Chapter 11.)
Two sets A and B that have no elements in common are said to be disjoint; in set-theoretic notation, this means A B = . In this case we have
|A B| = |A| + |B|. The operation of disjoint union acts like addition
for sets. For example, the disjoint union of 2-element set {0, 1} and the 3element set {Wakko, Jakko, Dot} is the 5-element set {0, 1, Wakko, Jakko, Dot}.
The size of a Cartesian product is obtained by multiplication: |A B| =
|A| |B|. An example would be the product of the 2-element set {a, b} with
the 3-element set {0, 1, 2}: this gives the 6-element set {(a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}.
Even though Cartesian product is not generally commutative, since ordinary
natural number multiplication is, we always have |A B| = |B A|.
For power set, it is not hard to show that |P(S)| = 2|S| . This is a special
case of the size of AB , the set of all functions from B to A, which is |A||B| ;
for the power set we can encode P(S) using 2S , where 2 is the special set
{0, 1}, and a subset T of S is encoded by the function that maps each x S
to 0 if x 6 T and 1 if x T .
3.7.1
Infinite sets
61
The generalized continuum hypothesis says (essentially) that there arent any more
cardinalities out there in between the ones that are absolutely necessary; a consequence of
this is that there are no cardinalities between |N| and |R|. An alternative notation exists if
you dont want to take a position on GCH: this writes i0 (beth-0) for |N|, i1 (beth-1)
for |R| = |P(N)|, with the general rule ii+1 = 2ii . This avoids the issue of whether
there exist sets with size between N and R, for example. In my limited experience, only
hard-core set theorists ever use i instead of : in the rare cases where the distinction
matters, most normal mathematicians will just assume GCH, which makes ii = i for all
i.
62
the sequence and rest is all the other elements. For example,
f (0, 1, 2) = 1 + h0, f (1, 2)i
= 1 + h0, 1 + h1, f (2)ii
= 1 + h0, 1 + h1, 1 + h2, 0iii
= 1 + h0, 1 + h1, 1 + 3ii = 1 + h0, 1 + h1, 4ii
= 1 + h0, 1 + 19i
= 1 + h0, 20i
= 1 + 230
= 231.
This assigns a unique element of N to each finite sequence, which is
enough to show |N | |N|. With some additional effort one can show
that f is in fact a bijection, giving |N | = |N|.
3.7.2
Countable sets
All of these sets have the property of being countable, which means that
they can be put into a bijection with N or one of its subsets. The general
principle is that any sum or product of infinite cardinal numbers turns into
taking the maximum of its arguments. The last case implies that anything
you can write down using finitely many symbols (even if they are drawn
from an infinite but countable alphabet) is countable. This has a lot of
applications in computer science: one of them is that the set of all computer
programs in any particular programming language is countable.
3.7.3
Uncountable sets
Exponentiation is different. We can easily show that 20 6= 0 , or equivalently that there is no bijection between P(N) and N. This is done using
Cantors diagonalization argument, which appears in the proof of the
following theorem.
Theorem 3.7.1. Let S be any set. Then there is no surjection f : S
P(S).
Proof. Let f : S P(S) be some function from S to subsets of S. Well
construct a subset of S that f misses, thereby showing that f is not a
surjection. Let A = {x S | x 6 f (x)}. Suppose A = f (y). Then y A
y 6 A, a contradiction.19
19
Exercise: Why does A exist even though the Russells Paradox set doesnt?
63
Since any bijection is also a surjection, this means that theres no bijection between S and P(S) either, implying, for example, that |N| is strictly
less than |P(N)|.
(On the other hand, it is the case that NN = 2N , so things are still
weird up here.)
Sets that are larger than N are called uncountable. A quick way to
show that there is no surjection from A to B is to show that A is countable
but B is uncountable. For example:
Corollary 3.7.2. There are functions f : N {0, 1} that are not computed
by any computer program.
Proof. Let P be the set of all computer programs that take a natural number
as input and always produce 0 or 1 as output (assume some fixed language),
and for each program p P , let fp be the function that p computes. Weve
already argued that P is countable (each program is a finite sequence drawn
from a countable alphabet), and since the set of all functions f : N
{0, 1} = 2N has the same size as P(N), its uncountable. So some f gets
missed: there is at least one function from Nto {0, 1} that is not equal to fp
for any program p.
The fact that there are more functions from N to N than there are
elements of N is one of the reasons why set theory (slogan: everything is
a set) beat out lambda calculus (slogan: everything is a function from
functions to functions) in the battle over the foundations of mathematics.
And this is why we do set theory in CS202 and lambda calculus (disguised
as Scheme) in CS201.
3.8
Further reading
Chapter 4
64
65
4.1
Field axioms
The real numbers are a field, which means that they support the operations
of addition +, multiplication , and their inverse operations subtraction
and division /. The behavior of these operations is characterized by the
field axioms.
4.1.1
(4.1.1)
Any operation that satisfies Axiom 4.1.1 is called commutative. Commutativity lets us ignore the order of arguments to an operation. Later, we
will see that multiplication is also commutative.
Axiom 4.1.2 (Associativity of addition). For all numbers,
a + (b + c) = (a + b) + c.
(4.1.2)
An operation that satisfies Axiom 4.1.2 is called associative. Associativity means we dont have to care about how a sequence of the same
associative operation is parenthesized, letting us write just a + b + c for
a + (b + c) = (a + b) + c.2
2
A curious but important practical fact is that addition is often not associative in
computer arithmetic. This is because computers (and calculators) approximate real
numbers by floating-point numbers, which only represent the some limited number of digits of an actual real number in order to make it fit in limited memory. This
means that low-order digits on very large numbers can be lost to round-off error. So
a computer might report (1000000000000 + 1000000000000) + 0.00001 = 0.00001 but
1000000000000 + (1000000000000 + 0.00001) = 0.0. Since we dont have to write any
programs in this class, we will just work with actual real numbers, and not worry about
such petty numerical issues.
66
Axiom 4.1.3 (Additive identity). There exists a number 0 such that, for
all numbers a,
a + 0 = 0 + a = a.
(4.1.3)
An object that satisfies the condition a+0 = 0+a = a for some operation
is called an identity for that operation. Later we will see that 1 is an identity
for multiplication.
Its not hard to show that identities are unique:
Lemma 4.1.4. Let 00 + a = a + 00 = a for all a. Then 00 = 0.
Proof. Compute 00 = 00 + 0 = 0. (The first equality holds by the fact that
a = a + 0 for all a and the second from the assumption that 00 + a = a for
all a.)
Axiom 4.1.5 (Additive inverses). For each a, there exists a number a,
such that
a + (a) = (a) + a = 0.
(4.1.4)
For convenience, we will often write a + (b) as a b (a minus b).
This gives us the operation of subtraction. The operation that returns a
given a is called negation and a can be read as negative a, minus
a,3 or the negation of a.
Like identities, inverses are also unique:
Lemma 4.1.6. If a0 + a = a + a0 = 0, then a0 = a.
Proof. Starting with 0 = a0 + a, add a on the right to both sides to get
a = a0 + a + a = a0 .
4.1.2
Warning: Some people will get annoyed with you over minus a and insist on reserving minus for the operation in a b. In extreme cases, you may see a typeset
differently: -a. Pay no attention to these people. Though not making the distinction
makes life more difficult for calculator designers and compiler writers, as a working mathematician you are entitled to abuse notation by using the same symbol for multiple
purposes when it will not lead to confusion.
4
Also called laziness.
67
(4.1.5)
(4.1.6)
(4.1.8)
Lemma 4.1.6 applies here to show that a1 is also unique for each a.
For convenience, we will often write a b1 as a/b or the vertical version
a
a
b . This gives us the operation of division. The expression a/b or b is
pronounced a over b or (especially in grade school, whose occupants are
generally not as lazy as full-grown mathematicians) a divided by b. Some
other notations (mostly used in elementary school) for this operation are ab
and a : b.5
Note that because 0 is not guaranteed to have an inverse,6 the meaning
of a/0 is not defined.
The number a1 , when it does exist, is often just called the inverse
of a or sometimes inverse a. (The ambiguity that might otherwise arise
with the additive inverse a is avoided by using negation for a.) The
multiplicative inverse a1 can also be written using the division operation
as 1/a.
5
Using a colon for division is particularly popular in German-speaking countries,
where the My Dear Aunt Sally rule for remembering that multiplication and division
bind tighter than addition and subtraction becomes the more direct Punktrechnung vor
Strichrechnungpoint reckoning before stroke reckoning.
6
In fact, once we get a few more axioms, terrible things will happen if we try to make
0 have an inverse.
4.1.3
68
(4.1.9)
(a + b) c = ac + bc
(4.1.10)
(4.1.11)
(4.1.12)
1
ab=a
b = 0.
(4.1.13)
(4.1.14)
7
This is an example of the proof strategy where we show P Q by assuming P and
proving Q.
69
(4.1.15)
(a) b = (ab),
(4.1.16)
and
(a) (b) = ab.
(4.1.17)
Like annihilation, these are not axiomsor at least, we dont have to include
them as axioms if we dont want to. Instead, we can prove them directly
from axioms and theorems weve already got. For example, here is a proof
of (4.1.15):
a0=0
a (b + (b)) = 0
ab + a (b) = 0
(ab) + (ab + a (b)) = (ab)
((ab) + ab) + a (b) = (ab)
0 + a (b) = (ab)
a (b) = (ab).
Similar proofs can be given for (4.1.16) and (4.1.17).
A special case of this is that multiplying by 1 is equivalent to negation:
Corollary 4.1.14. For all a,
(1) a = a.
(4.1.18)
4.1.4
The field axioms so far do not determine the real numbers: they also hold for
any number of other fields, including the rationals Q, the complex numbers
C, and various finite fields such as the integers modulo a prime p (written
as Zp ; well see more about these in Chapter 14).
70
They do not hold for the integers Z (which dont have multiplicative
inverses) or the natural numbers N (which dont have additive inverses either). This means that Z and N are not fields, although they are examples
of weaker algebraic structures (a ring in the case of Z and a semiring in
the case of N).
4.2
Order axioms
Unlike C and Zp (but like Q), the real numbers are an ordered field,
meaning that in addition to satisfying the field axioms, there is a relation
that satisfies the axioms:
Axiom 4.2.1 (Comparability). a b or b a.
Axiom 4.2.2 (Antisymmetry). If a b and b a, then a = b.
Axiom 4.2.3 (Transitivity). If a b and b c, then a c.
Axiom 4.2.4 (Translation invariance). If a b, then a + c b + c.
Axiom 4.2.5 (Scaling invariance). If a b and 0 c, then a c b c.
The first three of these mean that is a total order (see 9.5.5). The
other axioms describe how interacts with addition and multiplication.
For convenience, we define a < b as shorthand for a b and a 6= b, and
define reverse operations a b (meaning b a) and a > b (meaning b < a).
If a > 0, we say that a is positive. If a < 0, it is negative. If a 0, it is
non-negative. Non-positive can be used to say a 0, but this doesnt
seem to come up as much as non-negative.
Other properties of can be derived from these axioms.
Lemma 4.2.6 (Reflexivity). For all x, x x.
Proof. Apply comparability with y = x.
Lemma 4.2.7 (Trichotomy). Exactly one of x < y, x = y, or x > y holds.
Proof. First, lets show that at least one holds. If x = y, we are done.
Otherwise, suppose x 6= y. From comparability, we have x y or y x.
Since x 6= y, this gives either x < y or x > y.
Next, observe that x = y implies x 6< y and x 6> y, since x < y and x > y
are both defined to hold only when x 6= y. This leaves the possibility that
x < y and x > y. But then x y and y x, so by anti-symmetry, x = y,
contradicting our assumption. So at most one holds.
71
4.3
72
4.4
73
One way to think about the development of number systems is that each
system N, Z, Q, R, and C adds the ability to solve equations that have no
solutions in the previous system. Some specific examples are
x+1=0
2x = 1
xx=2
xx+1=0
This process stops with the complex numbers C, which consist of pairs
of the form a + bi where i2 = 1. The reason is that the complex numbers
are algebraically closed: if you write an equation using only complex
numbers, +, and , and it has some solution x in any field bigger than
C, then x is in C as well. The down side in comparison to the reals is
that we lose order: there is no ordering of complex numbers that satisfies
the translation and scaling invariance axioms. As in many other areas of
mathematics and computer science, we are forced to make trade-offs based
on what is important to us at the time.
4.5
Arithmetic
74
represent exactly even trivial values like 13 . Similarly, mixed fractions like
1
1 10
, while useful for carpenters, are not popular in mathematics.
4.6
The reals are an example of an algebra, which is a set with various operations attached to it: the set is R itself with the operations being 0, 1,
+, and . A sub-algebra is a subset that is closed under the operations,
meaning that the results of any operation applied to elements of the subsets
(no elements in the case of 0 or 1) yields an element of the subset.
All sub-algebras of R inherit any properties that dont depend on the
existence of particular elements other than 0 and 1; so addition and multiplication are still commutative and associative, multiplication still distributes
over addition, and 0 and 1 are still identities. But other axioms may fail.
Some interesting sub-algebras of R are:
The natural numbers N. This is the smallest sub-algebra of R,
because once you have 0, 1, and addition, you can construct the rest
of the naturals as 1 + 1, 1 + 1 + 1, etc. They do not have additive or
multiplicative inverses, but they do satisfy the order axioms, as well
as the extra axiom that 0 x for all x N.
The integers Z. These are what you get if you throw in additive
inverses: now in addition to 0, 1, 1 + 1, etc., you also get 1, (1 + 1),
etc. The order axioms are still satisfied. No multiplicative inverses,
though.
The dyadics D. These are numbers of the form m2n where m Z
and n N. These are of some importance in computing because
almost all numbers represented inside a computer are really dyadics,
although in mathematics they are not used much. Like the integers,
they still dont have multiplicative inverses: there is no way to write
1/3 (for example) as m2n .
The rationals Q. Now we ask for multiplicative inverses, and get them.
Any rational can be written as p/q where p and q are integers. Unless
extra restrictions are put on p and q, these representations are not
unique: 22/7 = 44/14 = 66/21 = (110)/(35). You probably first
saw these in grade school as fractions, and one way to describe Q is
as the field of fractions of Z.
75
The rationals satisfy all the field axioms, and are the smallest subfield of R. They also satisfy all the ordered field axioms and the
Archimedean property. But they are not complete. Adding completeness gives the real numbers.
An issue that arises here is that, strictly speaking, the natural numbers N we defined back in 3.4 are not elements of R as defined in terms
of, say, Dedekind cuts. The former are finite ordinals while the latter are
downward-closed sets of rationals, themselves represented as elements of
N N. Similarly, the integer elements of Q will be pairs of the form (n, 1)
where n N rather than elements of N itself. We also have a definition
(G.1) that builds natural numbers out of 0 and a successor operation S.
So what does it mean to say N Q R?
One way to think about it is that the sets
{, {} , {, {}} , {, {} , {, {}}} . . .} ,
{(0, 1), (1, 1), (2, 1), (3, 1), . . .} ,
{{(p, q) | p < 0} , {(p, q) | p < q} , {(p, q) | p < 2q} , {(p, q) | p < 3q} , . . .} ,
and
{0, S0, SS0, SSS0, . . .}
are all isomorphic: there are bijections between them that preserve the
behavior of 0, 1, +, and . So we think of N as representing some Platonic
ideal of natural-numberness that is only defined up to isomorphism.10 So in
the context of R, when we write N, we mean the version of N that is a subset
of R, and in other contexts, we might mean a different set that happens to
behave in exactly the same way.
In the other direction, the complex numbers are a super-algebra of the
reals: we can think of any real number x as the complex number x + 0i,
and this complex number will behave exactly the same as the original real
number x when interacting with other real numbers carried over into C in
the same way.
The various features of these algebras are summarized in Table 4.1.
4.7
The floor function bxc and ceiling function dxe can be used to convert an
arbitrary real to an integer: the floor of x is the largest integer less than
10
In programming terms, N is an interface that may have multiple equivalent implementations.
N
Naturals
12
Yes
Yes
No
Yes
Yes
Yes
No
76
Z
Integers
12
Yes
Yes
+ only
Yes
Yes
Yes
No
Q
Rationals
12
7
Yes
Yes
Yes
Yes
Yes
No
No
R
Reals
12
Yes
Yes
Yes
Yes
Yes
Yes
No
C
Complex
numbers
12 + 22
7 i
Yes
Yes
Yes
Yes
No
No
Yes
|x| =
x
x
if x < 0,
if x 0.
The absolute value function erases the sign of x: |12| = |12| = 12.
The signum function sgn(x) returns the sign of its argument, encoded
as 1 for negative, 0 for zero, and +1 for positive:
sgn(x) =
+1
if x < 0,
if x = 0,
if x > 0.
So sgn(12) = 1, sgn(0) = 0, and sgn(12) = 1. This allows for an alternative definition of |x| as sgn(x) x.
Chapter 5
5.1
Simple induction
The simplest form of induction goes by the name of simple induction, and
its what we use to show that something is true for all natural numbers.
We have several equivalent definitions of the natural numbers N, but
what they have in common is the following basic pattern, which goes back
to Peano [Pea89]:
0 is a natural number.
If x is a natural number, so is x + 1.
This is an example of a recursive definition: it gives us a base object
to start with (0) and defines new natural numbers (x + 1) by applying some
operation (+1) to natural numbers we already have x.
Because these are the only ways to generate natural numbers, we can
prove that a particular natural number has some property P by showing
that you cant construct a natural number without having P be true. This
means showing that P (0) is true, and that P (x) implies P (x + 1). If both
of these statements hold, then P is baked into each natural number as part
of its construction.
77
78
(5.1.1)
Any proof that uses the induction schema will consist of two parts, the
base case showing that P (0) holds, and the induction step showing that
P (x) P (x + 1). The assumption P (x) used in the induction step is called
the induction hypothesis.
For example, lets suppose we want to show that for all n N, either
n = 0 or there exists n0 such that n = n0 + 1. Proof: We are trying to show
that P (n) holds for all n, where P (n) says x = 0 (x0 : x = x0 + 1). The
base case is when n = 0, and here the induction hypothesis holds by the
addition rule. For the induction step, we are given that P (x) holds, and
want to show that P (x + 1) holds. In this case, we can do this easily by
observing that P (x + 1) expands to (x + 1) = 0 (x0 : x + 1 = x0 + 1). So
let x0 = x and we are done.1
Heres a less trivial example. So far we have not defined exponentiation.
Lets solve this by declaring
x0 = 1
x
n+1
=xx
(5.1.2)
n
(5.1.3)
79
Induction step: Suppose the induction hypothesis holds for n, i.e., that
n > 0 an > 1. We want to show that it also holds for n + 1. Annoyingly,
there are two cases we have to consider:
1. n = 0. Then we can compute a1 = a a0 = a 1 = a > 1.
2. n > 0. The induction hypothesis now gives an > 1 (since in this case
the premise n > 0 holds), so an+1 = a an > a 1 > 1.
5.2
One of the things that is apparent from the proof of Theorem 5.1.1 is that
being forced to start at 0 may require painful circumlocutions if 0 is not the
first natural for which we the predicate we care about holds. So in practice
it is common to use a different base case. This gives a generalized version
of the induction schema that works for any integer base:
(P (z0 ) z Z, z z0 : (P (z) P (z + 1))) z Z, z z0 : P (z)
(5.2.1)
Intuitively, this works for the same reason (5.1.1) works: if P is true for
z0 , then any larger integer can be reached by applying +1 enough times,
and each +1 operation preserves P . If we want to prove it formally, observe
that (5.2.1) turns into (5.1.1) if we do a change of variables and define
Q(n) = P (z z0 ).
Heres an example of starting at a non-zero base case:
Theorem 5.2.1. Let n N. If n 4, then 2n n2 .
Proof. Base case: Let n = 4, then 2n = 16 = n2 .
For the induction step, assume 2n n2 . We need to show that 2n+1
(n + 1)2 = n2 + 2n + 1. Using the assumption and the fact that n 4, we
can compute
2n+1 = 2 2n
2n2
= n2 + n2
n2 + 4n
= n2 + 2n + 2n
n2 + 2n + 1
= (n + 1)2 .
5.3
80
5.4
81
The converse is a little trickier, since we need to figure out how to use
induction to prove things about subsets of N, but induction only talks about
elements of N. The trick is consider only the part of S that is smaller than
some variable n, and show that any S that contains an element smaller than
n has a smallest element.
Lemma 5.4.1. For all n N, if S is a subset of N that contains an element
less than or equal to n, then S has a smallest element.
Proof. By induction on n.
The base case is n = 0. Here 0 S and 0 x for any x N, so in
particular 0 x for any x S, making 0 the smallest element in S.
For the induction step, suppose that the claim in the lemma holds for n.
To show that it holds for n + 1, suppose that n + 1 S. Then either (a) S
contains an element less than or equal to n, so S has a smallest element by
the induction hypothesis, or (b) S does not contain an element less than or
equal to n. But in this second case, S must contain n + 1, and since there
are no elements less than n + 1 in S, n + 1 is the smallest element.
To show the full result, let n be some element of S. Then S contains an
element less than or equal to n, and so S contains a smallest element.
5.5
Strong induction
82
5.5.1
Examples
1.
A number is prime if it cant be written as a b where a and b are both greater than
83
players turn to move. In either case each f (y) is well-defined (by the
induction hypothesis) and so f (x) is also well-defined.
The division algorithm: For each n, m N with m 6= 0, there is a
unique pair q, r N such that n = qm + r and 0 r < m. Proof:
Fix m then proceed by induction on n. If n < m, then if q > 0 we
have n = qm + r 1 m m, a contradiction. So in this case q = 0
is the only solution, and since n = qm + r = r we have a unique
choice of r = n. If n m, by the induction hypothesis there is a
unique q 0 and r0 such that n m = q 0 m + r0 where 0 r0 < m. But
then q = q 0 + 1 and r = r0 satisfies qm + r = (q 0 1 + 1)m + r =
(q 0 m + r0 ) + m = (n m) + m = n. To show that this solution is
unique, if there is some other q 00 and r00 such that q 00 m + r00 = n, then
(q 00 1)m + r00 = n m = q 0 m + r0 , and by the uniqueness of q 0 and r0
(ind. hyp. again), we have q 00 1 = q 0 = q 1 and r00 = r0 = r, giving
that q 00 = q and r00 = r. So q and r are unique.
5.6
Recursively-defined structures
A definition with the structure of an inductive proof (give a base case and a
rule for building bigger structures from smaller ones) Structures defined in
this way are recursively-defined.
Examples of recursively-defined structures:
Finite Von Neumann ordinals A finite von Neumann ordinal is either
(a) the empty set , or (b) x {x}, where x is a finite von Neumann
ordinal.
Complete binary trees A complete binary tree consists of either (a) a
leaf node, or (b) an internal node (the root) with two complete binary
trees as children (or subtrees).
Boolean formulas A boolean formula consists of either (a) a variable, (b)
the negation operator applied to a Boolean formula, (c) the AND of
two Boolean formulas, or (d) the OR of two Boolean formulas. A
monotone Boolean formula is defined similarly, except that negations
are forbidden.
Finite sequences, recursive version Before we defined a finite sequence
as a function from some natural number (in its set form: n = {0, 1, 2, ..., n 1})
84
to some set S. We could also define a finite sequence over S recursively, by the rule: hi (the empty sequence) is a finite sequence, and if
a is a finite sequence and x S, then (x, a) is a finite sequence. (Fans
of LISP will recognize this method immediately.)
The key point is that in each case the definition of an object is recursivethe object itself may appear as part of a larger object. Usually we
assume that this recursion eventually bottoms out: there are some base
cases (e.g. leaves of complete binary trees or variables in Boolean formulas)
that do not lead to further recursion. If a definition doesnt bottom out in
this way, the class of structures it describes might not be well-defined (i.e.,
we cant tell if some structure is an element of the class or not).
5.6.1
5.6.2
Recursive definitions have the same form as an induction proof. There are
one or more base cases, and one or more recursion steps that correspond to
the induction step in an induction proof. The connection is not surprising
if you think of a definition of some class of objects as a predicate that
identifies members of the class: a recursive definition is just a formula for
writing induction proofs that say that certain objects are members.
85
5.6.3
Structural induction
Chapter 6
Summation notation
6.1
Summations
i3
= 0.
n
X
i=1
i=
n(n + 1)
,
2
86
6.1.1
87
Formal definition
For finite sums, we can formally define the value by either of two recurrences:
b
X
i=a
b
X
0
if b < a
Pb
f (a) + i=a+1 f (i) otherwise.
f (i) =
(6.1.1)
0
if b < a
Pb1
f (b) + i=a f (i) otherwise.
f (i) =
i=a
(6.1.2)
6.1.2
Scope
i +1=
i=1
n
X
!
2
+1=1+
i=1
n
X
i2 6=
i=1
n
X
(i2 + 1).
i=1
i2 +
n
X
i=
i=1
n
X
i2
2
n
X
+ i .
i=1
i=1
Here the looming bulk of the second sigma warns the reader that the
first sum is ending; it is much harder to miss than the relatively tiny plus
symbol in the first example.
6.1.3
Summation identities
The summation operator is linear. This means that constant factors can
be pulled out of sums:
m
X
i=n
axi = a
X
i=n
xi
(6.1.3)
88
(xi + yi ) =
i=n
m
X
xi +
i=n
yi .
(6.1.4)
iS
m0 X
m
X
xij =
xij .
j=n0 i=n
i=n j=n0
Products of sums can be turned into double sums of products and vice
versa:
! m0
m
m m0
X
xi
yj =
j=n0
i=n
XX
i=n
xi yj .
j=n0
These identities can often be used to transform a sum you cant solve
into something simpler.
To prove these identities, use induction and (6.1.2). For example, the
following lemma demonstrates a generalization of (6.1.3) and (6.1.4):
Lemma 6.1.1.
m
X
(axi + byi ) = a
m
X
xi + b
yi .
i=n
i=n
i=n
m
X
Proof. If m < n, then both sides of the equation are zero. This proves
that (6.1.1) holds for small m and gives us a base case for our induction at
m = n 1 that we can use to show it holds for larger m.
For the induction step, we want to show that (6.1.1) holds for m + 1 if
it holds for m. This is a straightforward computation using (6.1.2) first to
unpack the combined sum then to repack the split sums:
m+1
X
m
X
i=n
i=n
m
X
(axi + byi ) =
=a
=a
=a
i=n
m
X
m
X
yi + axm + bym
i=n
xi + xm + b
i=n
m+1
X
m+1
X
i=n
i=n
+b
m
X
i=n
yi .
yi + ym
6.1.4
89
When writing a summation, you can generally pick any index variable you
like, although i, j, k, etc., are popular choices. Usually its a good idea to
pick an index that isnt used outside the sum. Though
n
X
n=
n=0
n
X
i=0
has a well-defined meaning, the version on the right-hand side is a lot less
confusing.
In addition to renaming indices, you can also shift them, provided you
shift the bounds to match. For example, rewriting
n
X
(i 1)
i=1
j,
j=0
6.1.5
Sometimes wed like to sum an expression over values that arent consecutive
integers, or may not even be integers at all. This can be done using a sum
over all indices that are members of a given index set, or in the most general
form satisfy some given predicate (with the usual set-theoretic caveat that
the objects that satisfy the predicate must form a set). Such a sum is written
by replacing the lower and upper limits with a single subscript that gives
the predicate that the indices must obey.
For example, we could sum i2 for i in the set {3, 5, 7}:
X
i2 = 32 + 52 + 72 = 83.
i{3,5,7}
|A|.
90
Or we could sum the inverses of all prime numbers less than 1000:
X
1/p.
|A|
xAS
where the first sum sums over all pairs of values (i, j) such that 1 i, i j,
and j n, with each pair appearing exactly once; and the second sums over
all sets A that are subsets of S and contain x (assuming x and S are defined
outside the summation). Hopefully, you will not run into too many sums
that look like this, but its worth being able to decode them if you do.
Sums over a given set are guaranteed to be well-defined only if the set is
finite. In this case we can use the fact that there is a bijection between any
finite set S and the ordinal |S| to rewrite the sum as a sum over indices in |S|.
For example, if |S| = n, then there exists a bijection f : {0 . . . n 1} S,
so we can define
X
xi =
n1
X
xf (i) .
(6.1.5)
i=0
iS
0
x i = P
x
+ xz
i
iS\z
if S = ,
if z S.
(6.1.6)
The idea is that for any particular z S, we can always choose a bijection
that makes z = f (|S| 1).
If S is infinite, computing the sum is trickier. For countable S, where
there is a bijection f : N S, we can sometimes rewrite
X
iS
xi =
X
i=0
xf (i) .
91
and use the definition of an infinite sum (given below). Note that if the
xi have different signs the result we get may depend on which bijection we
choose. For this reason such infinite sums are probably best avoided unless
you can explicitly use N or a subset of N as the index set.
6.1.6
When the index set is understood from context, it is often dropped, leaving
P
only the index, as in i i2 . This will generally happen only if the index spans
all possible values in some obvious range, and can be a mark of sloppiness
in formal mathematical writing. Theoretical physicists adopt a still more
P
lazy approach, and leave out the i part entirely in certain special types
of sums: this is known as the Einstein summation convention after the
notoriously lazy physicist who proposed it.
6.1.7
Infinite sums
Sometimes you may see an expression where the upper limit is infinite, as
in
X
1
.
2
i
i=0
The meaning of this expression is the limit of the series s obtained by
taking the sum of the first term, the sum of the first two terms, the sum
of the first three terms, etc. The limit converges to a particular value x if
for any > 0, there exists an N such that for all n > N , the value of sn is
within of x (formally, |sn x| < ). We will see some examples of infinite
sums when we look at generating functions in 11.3.
6.1.8
Double sums
Nothing says that the expression inside a summation cant be another summation. This gives double sums, such as in this rather painful definition of
multiplication for non-negative integers:
def
ab =
a X
b
X
1.
i=1 j=1
If you think of a sum as a for loop, a double sum is two nested for loops.
The effect is to sum the innermost expression over all pairs of values of the
two indices.
92
Heres a more complicated double sum where the limits on the inner sum
depend on the index of the outer sum:
n X
i
X
(i + 1)(j + 1).
i=0 j=0
6.2
Products
What if you want to multiply a series of values instead of add them? The
notation is the same as for a sum, except that you replace the sigma with a
pi, as in this definition of the factorial function for non-negative n:
def
n! =
n
Y
i = 1 2 n.
i=1
The other difference is that while an empty sum is defined to have the
value 0, an empty product is defined to have the value 1. The reason for
this rule (in both cases) is that an empty sum or product should return the
identity element for the corresponding operationthe value that when
added to or multiplied by some other value x doesnt change x. This allows
writing general rules like:
X
f (i) +
!
iA
f (i)
f (i)
iAB
iB
iA
f (i) =
!
f (i)
iB
f (i)
iAB
6.3
Some more obscure operators also allow you to compute some aggregate
over a series, with the same rules for indices, lower and upper limits, etc.,
P
Q
as
and . These include:
93
Big AND:
^
xS
Big OR:
_
xS
Big Intersection:
n
\
Ai = A1 A2 . . . An .
i=1
Big Union:
n
[
Ai = A1 A2 . . . An .
i=1
These all behave pretty much the way one would expect. One issue that
is not obvious from the definition is what happens with an empty index set.
Here the rule as with sums and products is to return the identity element
for the operation. This will be True for AND, False for OR, and the empty
set for union; for intersection, there is no identity element in general, so the
intersection over an empty collection of sets is undefined.
6.4
Closed forms
6.4.1
94
Here are the three formulas you should either memorize or remember how
to derive:
n
X
i=1
n
X
1=n
i=
n(n + 1)
2
ri =
1 rn+1
1r
i=1
n
X
i=0
+
n
+
1
+ (n + 1) = n(n + 1),
ri =
i=0
1
,
1r
ri
i=0
then
rS =
ri+1 =
i=0
and so
S rS = r0 = 1.
X
i=1
ri
95
ri =
ri rn+1
i=0
rn+1
1 rn+1
1
=
.
1r 1r
1r
ri =
i=0
(3 2n + 5) = 3
i=0
n
X
2n + 5
n
X
i=0
i=0
= 3 2n+1 1 + 5(n + 1)
= 3 2n+1 + 5n + 2.
Other useful summations can be found in various places. Rosen [Ros12]
and Graham et al. [GKP94] both provide tables of sums in their chapters
on generating functions. But it is usually better to be able to reconstruct
the solution of a sum rather than trying to memorize such tables.
6.4.2
If nothing else works, you can try using the guess but verify method, which
also works more generally for identifying sequences defined recursively. Here
we write out the values of the summation for the first few values of the upper
limit (for example), and hope that we recognize the sequence. If we do, we
can then try to prove that a formula for the sequence of sums is correct by
induction.
Example: Suppose we want to compute
S(n) =
n
X
(2k 1)
k=1
Pn
k=1 k
and
Pn
k=1 1
96
n S(n)
0 0
1 1
2 1+3=4
3 1+3+5=9
4 1 + 3 + 5 + 7 = 16
5 1 + 3 + 5 + 7 + 9 = 25
At this point we might guess that S(n) = n2 . To verify this, observe
that it holds for n = 0, and for larger n we have S(n) = S(n1)+(2n1) =
(n 1)2 + 2n 1 = n2 2n + 1 2n 1 = n2 . So we can conclude that our
guess was correct.
6.4.3
Ansatzes
i2 = c3 n3 + c2 n2 + c1 n + c0 ,
(6.4.1)
i=0
when n 0.
Under the assumption that (6.4.1) holds, we can plug in n = 0 to get
P0
2
i=0 i = 0 = c0 . This means that we only need to figure out c3 , c2 , and c1 .
Plugging in some small values for n gives
0 + 1 = 1 = c3 + c2 + c1
0 + 1 + 4 = 5 = 8c3 + 4c2 + 2c1
0 + 1 + 4 + 9 = 14 = 27c3 + 8c2 + 3c1
With some effort, this system of equations can be solved to obtain c3 =
1/3, c2 = 1/2, c1 = 1/6, giving the formula
n
X
1
1
1
i2 = n3 + n2 + n.
3
2
6
i=0
(6.4.2)
97
i2 =
(2n + 1)n(n + 1)
,
6
(6.4.3)
1 3
1
1 2
1
n + n2 + n +
+
n +n+
3
3
2
2
1 3 1 2 1
2
= n + n + n + n + 2n + 1
3
2
6
=
=
n
X
i=0
n+1
X
i2 + (n + 1)2
i2 .
i=0
6.4.4
Mostly in algorithm analysis, we do not need to compute sums exactly, because we are just going to wrap the result up in some asymptotic expression
anyway (see Chapter 7). This makes our life much easier, because we only
need an approximate solution.
Heres my general strategy for computing sums:
6.4.4.1
Pull as many constant factors out as you can (where constant in this case
means anything that does not involve the summation index). Example:
Pn n
Pn 1
i=1 i = n
i=1 i = nHn = (n log n). (See harmonic series below.)
6.4.4.2
See if its bounded above or below by some other sum whose solution you
already know. Good sums to try (you should memorize all of these):
1
1
n+
6
6
98
n
1x
1
i
i
Geometric series
and
i=0 x = 1x
i=0 x = 1x .
The way to recognize a geometric series is that the ratio between adjacent
terms is constant. If you memorize the second formula, you can rederive the
first one. If youre Gauss, you can skip memorizing the second formula.
A useful trick to remember for geometric series is that if x is a constant
that is not exactly 1, the sum is always big-Theta of its largest term. So for
P
P
example ni=1 2i = (2n ) (the exact value is 2n+1 2), and ni=1 2i = (1)
(the exact value is 1 2n ).
If the ratio between terms equals 1, the formula doesnt work; instead,
we have a constant series (see below).
Constant series
Pn
i=1 1
= n.
n
Harmonic series
i=1 1/i = Hn = (n log n).
Can be rederived using the integral technique given below or by summing
the last half of the series, so this is mostly useful to remember in case you
run across Hn (the n-th harmonic number).
6.4.4.3
See if theres some part of the sum that you can bound. For example, ni=1 i3
has a (painful) exact solution, or can be approximated by the integral trick
described below, but it can very quickly be solved to within a constant factor
P
P
P
P
by observing that ni=1 i3 ni=1 n3 = O(n4 ) and ni=1 i3 ni=n/2 i3
Pn
3
4
i=n/2 (n/2) = (n ).
P
6.4.4.4
Integrate
Integrate.
If f (n) is non-decreasing
and you know how to integrate it, then
Rb
R b+1
Pb
f
(x)
dx
f
(i)
f
(x) dx, which is enough to get a bigi=a
a1
a
Theta bound for almost all functions you are likely to encounter in algorithm
analysis. If you dont know how to integrate it, see F.3.
99
Grouping terms
Try grouping terms together. For example, the standard trick for showing
that the harmonic series is unbounded in the limit is to argue that 1 + 1/2 +
1/3 + 1/4 + 1/5 + 1/6 + 1/7 + 1/8 + . . . 1 + 1/2 + (1/4 + 1/4) + (1/8 +
1/8 + 1/8 + 1/8) + . . . 1 + 1/2 + 1/2 + 1/2 + . . . . I usually try everything
else first, but sometimes this works if you get stuck.
6.4.4.6
Oddities
One oddball sum that shows up occasionally but is hard to solve using
P
any of the above techniques is ni=1 ai i. If a < 1, this is (1) (the exact
P i
formula for i=1 a i when a < 1 is a/(1 a)2 , which gives a constant upper
bound for the sum stopping at n); if a = 1, its just an arithmetic series; if
a > 1, the largest term dominates and the sum is (an n) (there is an exact
formula, but its uglyif you just want to show its O(an n), the simplest
Pn1 ni
a (n i) by the geometric series
approach is to bound the series i=0
Pn1 ni
n n/(1 a1 ) = O(an n). I wouldnt bother memorizing this
a
n
a
i=0
one provided you remember how to find it in these notes.
6.4.4.7
Final notes
In practice, almost every sum you are likely to encounter in algorithm analP
ysis will be of the form ni=1 f (n) where f (n) is exponential (so that its
bounded by a geometric series and the largest term dominates) or polynomial (so that f (n/2) = (f (n))) and the sum is (nf (n)) using the
Pn
i=n/2 f (n) = (nf (n)) lower bound).
Graham et al. [GKP94] spend a lot of time on computing sums exactly.
The most generally useful technique for doing this is to use generating functions (see 11.3).
Chapter 7
Asymptotic notation
Asymptotic notation is a tool for describing the behavior of functions on
large values, which is used extensively in the analysis of algorithms.
7.1
Definitions
O(f (n)) A function g(n) is in O(f (n)) (big O of f (n)) if there exist
constants c > 0 and N such that |g(n)| c|f (n)| for all n > N .
(f (n)) A function g(n) is in (f (n)) (big Omega of f (n)) if there exist
constants c > 0 and N such that |g(n)| c|f (n)| for all n > N .
(f (n)) A function g(n) is in (f (n)) (big Theta of f (n)) if there exist
constants c1 > 0, c2 > 0, and N such that c1 |f (n)| |g(n)| c2 |f (n)|
for all n > N . This is equivalent to saying that g(n) is in both O(f (n))
and (f (n)).
o(f (n)) A function g(n) is in o(f (n)) (little o of f (n)) if for every c > 0
there exists an N such that |g(n)| c|f (n)| for all n > N . This is
equivalent to saying that limn g(n)/f (n) = 0.
(f (n)) A function g(n) is in (f (n) (little omega of f (n)) if for every
c > 0 there exists an N such that |g(n)| c|f (n)| for all n > N . This
is equivalent to saying that limn |g(n)|/|f (n)| diverges to infinity.
7.2
101
Constant factors vary from one machine to another. The c factor hides
this. If we can show that an algorithm runs in O(n2 ) time, we can be
confident that it will continue to run in O(n2 ) time no matter how fast
(or how slow) our computers get in the future.
For the N threshold, there are several excuses:
Any problem can theoretically be made to run in O(1) time for
any finite subset of the possible inputs (e.g. all inputs expressible
in 50 MB or less), by prefacing the main part of the algorithm
with a very large table lookup. So its meaningless to talk about
the relative performance of different algorithms for bounded inputs.
If f (n) > 0 for all n, then we can get rid of N (or set it to zero) by
making c large enough. But some functions f (n) take on zeroor
undefinedvalues for interesting n (e.g., f (n) = n2 is zero when
n is zero, and f (n) = log n is undefined for n = 0 and zero for
n = 1). Allowing the minimum N lets us write O(n2 ) or O(log n)
for classes of functions that we would otherwise have to write
more awkwardly as something like O(n2 + 1) or O(log(n + 2)).
Putting the n > N rule in has a natural connection with the
definition of a limit, where the limit as n goes to infinity of g(n)
is defined to be x if for each > 0 there is an N such that
|g(n) x| < for all n > N . Among other things, this permits
the limit test that says g(n) = O(f (n)) if the limn fg(n)
(n) exists
and is finite.
7.3
Most of the time when we use asymptotic notation, we compute bounds using stock theorems like O(f (n))+O(g(n)) = O(max(f (n), g(n)) or O(cf (n)) =
O(f (n)). But sometimes we need to unravel the definitions to see whether
a given function fits in a given class, or to prove these utility theorems to
begin with. So lets do some examples of how this works.
Theorem 7.3.1. The function n is in O(n3 ).
Proof. We must find c, N such that for all n > N , |n| cn3 . Since n3
is much bigger than n for most values of n, well pick c to be something
convenient to work with, like 1. So now we need to choose N so that for all
102
n > N , |n| n3 . It is not the case that |n| n3 for all n (try plotting
n vs n3 for n < 1) but if we let N = 1, then we have n > 1, and we just
need to massage this into n3 n. There are a couple of ways to do this,
but the quickest is probably to observe that squaring and multiplying by n
(a positive quantity) are both increasing functions, which means that from
n > 1 we can derive n2 > 12 = 1 and then n2 n = n3 > 1 n = n.
7.4
7.4.1
Use big-O when you have an upper bound on a function, e.g. the zoo
never got more than O(1) new gorillas per year, so there were at most
O(t) gorillas at the zoo in year t.
Use big- when you have a lower bound on a function, e.g. every year
the zoo got at least one new gorilla, so there were at least (t) gorillas
at the zoo in year t.
103
Use big- when you know the function exactly to within a constantfactor error, e.g. every year the zoo got exactly five new gorillas, so
there were (t) gorillas at the zoo in year t.
For the others, use little-o and when one function becomes vanishingly
small relative to the other, e.g. new gorillas arrived rarely and with declining
frequency, so there were o(t) gorillas at the zoo in year t. These are not used
as much as big-O, big-, and big- in the algorithms literature.
7.4.2
O(f (n)) + O(g(n)) = O(f (n)) when g(n) = O(f (n)). If you have an
expression of the form O(f (n) + g(n)), you can almost always rewrite
it as O(f (n)) or O(g(n)) depending on which is bigger. The same goes
for and .
O(cf (n)) = O(f (n)) if c is a constant. You should never have a
constant inside a big O. This includes bases for logarithms: since
loga x = logb x/ logb a, you can always rewrite O(lg n), O(ln n), or
O(log1.4467712 n) as just O(log n).
But watch out for exponents and products: O(3n n3.1178 log1/3 n) is
already as simple as it can be.
7.4.3
If you are confused whether e.g. log n is O(n), try computing the limit as
n goes to infinity of logn n , and see if it converges to a constant (zero is OK).
(n)
The general rule is that f (n) is O(g(n) if limn fg(n)
exists.1
You may need to use LHpitals Rule to evaluate such limits if they
arent obvious. This says that
f (n)
f 0 (n)
= lim 0
n g(n)
n g (n)
lim
when f (n) and g(n) both diverge to infinity or both converge to zero. Here
f 0 and g 0 are the derivatives of f and g with respect to n; see F.2.
1
Note that this is a sufficient but not necessary condition. For example, the function
f (n) that is 1 when n is even and 2 when n is odd is O(1), but limn f (n)
doesnt exist.
1
7.5
104
Variations in notation
As with many tools in mathematics, you may see some differences in how
asymptotic notation is defined and used.
7.5.1
Absolute values
Some authors leave out the absolute values. For example, Biggs [Big02]
defines f (n) as being in O(g(n)) if f (n) cg(n) for sufficiently large n.
If f (n) and g(n) are non-negative, this is not an unreasonable definition.
But it produces odd results if either can be negative: for example, by this
definition, n1000 is in O(n2 ). Some authors define O, , and only for
non-negative functions, avoiding this problem.
The most common definition (which we will use) says that f (n) is in
O(g(n)) if |f (n)| c|g(n)| for sufficiently large n; by this definition n1000
is not in O(n2 ), though it is in O(n1000 ). This definition was designed for
error terms in asymptotic expansions of functions, where the error term
might represent a positive or negative error.
7.5.2
105
means that for any g in o(f (n)), there exists an h in (f (n)) such that
f (n) + g(n) = h(n), and
O(f (n)) + O(g(n)) + 1 = O(max(f (n), g(n))) + 1
means that for any r in O(f (n)) and s in O(g(n)), there exists t in O(max(f (n), g(n))
such that r(n) + s(n) + 1 = t(n) + 1.
The nice thing about this definition is that as long as you are careful
about the direction the equals sign goes in, you can treat these complicated pseudo-equations like ordinary equations. For example, since O(n2 ) +
O(n3 ) = O(n3 ), we can write
n2 n(n + 1)(n + 2)
+
= O(n2 ) + O(n3 )
2
6
= O(n3 ),
which is much simpler than what it would look like if we had to talk about
particular functions being elements of particular sets of functions.
This is an example of abuse of notation, the practice of redefining
some standard bit of notation (in this case, equations) to make calculation
easier. Its generally a safe practice as long as everybody understands what
is happening. But beware of applying facts about unabused equations to the
abused ones. Just because O(n2 ) = O(n3 ) doesnt mean O(n3 ) = O(n2 )
the big-O equations are not reversible the way ordinary equations are.
More discussion of this can be found in [Fer08, 10.4] and [GKP94, Chapter 9].
Chapter 8
Number theory
Number theory is the study of the natural numbers, particularly their
divisibility properties. Throughout this chapter, when we say number, we
mean an element of N.
8.1
107
108
8.2
Let m and n be numbers, where at least one of m and n is nonzero, and let k
be the largest number for which k|m and k|n. Then k is called the greatest
common divisor or gcd of m and n, written gcd(m, n) or sometimes just
(m, n). A similar concept is the least common multiple (lcm), written
lcm(m, n), which is the smallest k such that m|k and n|k.
Formally, g = gcd(m, n) if g|m, g|n, and for any g 0 that divides both m
and n, g 0 |g. Similarly, ` = lcm(m, n) if m|`, n|`, and for any `0 with m|`0
and n|`0 , `|`0 .
Two numbers m and n whose gcd is 1 are said to be relatively prime
or coprime, or simply to have no common factors.
If divisibility is considered as a partial order, the naturals form a lattice
(see 9.5.3), which is a partial order in which every pair of elements x and
y has both a unique greatest element that is less than or equal to both (the
meet x y, equal to gcd(x, y) in this case) and a unique smallest element
that is greater than or equal to both (the join x y, equal to lcm(x, y) in
this case).
8.2.1
Euclid described in Book VII of his Elements what is now known as the
Euclidean algorithm for computing the gcd of two numbers (his origi-
109
nal version was for finding the largest square you could use to tile a given
rectangle, but the idea is the same). Euclids algorithm is based on the
recurrence
(
n
if m = 0,
gcd(m, n) =
gcd(n mod m, m) if m > 0.
The first case holds because n|0 for all n. The second holds because
if k divides both n and m, then k divides n mod m = n bn/mc m; and
conversely if k divides m and n mod m, then k divides n = (n mod m) +
m bn/mc. So (m, n) and (n mod m, m) have the same set of common factors,
and the greatest of these is the same.
So the algorithm simply takes the remainder of the larger number by the
smaller recursively until it gets a zero, and returns whatever number is left.
8.2.2
The extended Euclidean algorithm not only computes gcd(m, n), but
also computes integer coefficients m0 and n0 such that
m0 m + n0 n = gcd(m, n).
It has the same structure as the Euclidean algorithm, but keeps track of
more information in the recurrence. Specifically:
For m = 0, gcd(m, n) = n with n0 = 1 and m0 = 0.
For m > 0, let n = qm + r where 0 r < m, and use the algorithm
recursively to compute a and b such that ar + bm = gcd(r, m) =
gcd(m, n). Substituting r = nqm gives gcd(m, n) = a(nqm)+bm =
(baq)m+an. This gives both the gcd and the coefficients m0 = baq
and n0 = a.
8.2.2.1
Example
Figure 8.1 gives a computation of the gcd of 176 and 402, together with
the extra coefficients. The code used to generate this figure is given in
Figure 8.2.
8.2.2.2
Applications
Finding gcd(176,402)
q = 2 r = 50
Finding gcd(50,176)
q = 3 r = 26
Finding gcd(26,50)
q = 1 r = 24
Finding gcd(24,26)
q = 1 r = 2
Finding gcd(2,24)
q = 12 r = 0
Finding gcd(0,2)
base case
Returning 0*0 + 1*2 = 2
a = b1 - a1*q = 1 - 0*12 = 1
Returning 1*2 + 0*24 = 2
a = b1 - a1*q = 0 - 1*1 = -1
Returning -1*24 + 1*26 = 2
a = b1 - a1*q = 1 - -1*1 = 2
Returning 2*26 + -1*50 = 2
a = b1 - a1*q = -1 - 2*3 = -7
Returning -7*50 + 2*176 = 2
a = b1 - a1*q = 2 - -7*2 = 16
Returning 16*176 + -7*402 = 2
Figure 8.1: Trace of extended Euclidean algorithm
110
111
#!/usr/bin/python3
def euclid(m, n, trace = False, depth = 0):
"""Implementation of extended Euclidean algorithm.
Returns triple (a, b, g) where am + bn = g and g = gcd(m, n).
Optional argument trace, if true, shows progress."""
def output(s):
if trace:
print("{}{}".format( * depth, s))
output("Finding gcd({},{})".format(m, n))
if m == 0:
output("base case")
a, b, g = 0, 1, n
else:
q = n//m
r = n % m
output("q = {} r = {}".format(q, r))
a1, b1, g = euclid(r, m, trace, depth + 1)
a = b1 - a1*q
b = a1
output("a = b1 - a1*q = {} - {}*{} = {}".format(b1, a1, q, a))
output("Returning {}*{} + {}*{} = {}".format(a, m, b, n, g))
return a, b, g
if __name__ == __main__:
import sys
euclid(int(sys.argv[1]), int(sys.argv[2]), True)
Figure 8.2: Python code for extended Euclidean algorithm
112
If p is prime and p|ab, then either p|a or p|b. Proof: suppose p 6 |a;
since p is prime we have gcd(p, a) = 1. So there exist r and s such
that rp + sa = 1. Multiply both sides by b to get rpb + sab = b. Then
p|rpb and p|sab (the latter because p|ab), so p divides their sum and
thus p|b.
8.3
8.3.1
Applications
113
8.4
From the division algorithm, we have that for each pair of integers n and
m 6= 0, there is a unique remainder r with 0 r < |m| and n = qm + r for
some q; this unique r is written as (n mod m). Define n m n0 (read n is
congruent to n0 mod m) if (n mod m) = (n0 mod m), or equivalently if
there is some q Z such that n = n0 + qm.
The set of integers congruent to n mod m is called the residue class
of n (residue is an old word for remainder), and is written as [n]m . The
sets [0]m , [1]m , . . . [m 1]m between them partition the integers, and the set
{[0]m , [1]m , . . . [m 1]m } defines the integers mod m, written Zm . We will
see that Zm acts very much like Z, with well-defined operations for addition,
subtraction, and multiplication. In the case where the modulus is prime, we
even get divison: Zp is a finite field for any prime p.
The most well-known instance of Zm is Z2 , the integers mod 2. The class
[0]2 is the even numbers and the class [1]2 is the odd numbers.
8.4.1
114
0
0
1
2
1
1
2
0
2
2
0
1
115
0
1
2
0
0
0
0
1
0
1
2
2
0
2
1
8.4.2
Division in Zm
One thing we dont get general in Zm is the ability to divide. This is not
terribly surprising, since we dont get to divide (without remainders) in Z
either. But for some values of x and m we can in fact do division: for these
x and m there exists a multiplicative inverse x1 (mod m) such that
xx1 = 1 (mod m). We can see the winning xs for Z9 by looking for ones
in the multiplication table for Z9 , given in Table 8.2.
Here we see that 11 = 1, as wed expect, but that we also have 21 = 5,
1
4 = 7, 51 = 2, 71 = 4, and 81 = 8. There are no inverses for 0, 3, or
6.
What 1, 2, 4, 5, 7, and 8 have in common is that they are all relatively
prime to 9. This is not an accident: when gcd(x, m) = 1, we can use the
extended Euclidean algorithm (8.2.2) to find x1 (mod m). Observe that
what we want is some x0 such that xx0 m 1, or equivalently such that
x0 x + qm = 1 for some q. But the extended Euclidean algorithm finds such
an x0 (and q) whenever gcd(x, m) = 1.
0
1
2
3
4
5
6
7
8
0
0
0
0
0
0
0
0
0
0
1
0
1
2
3
4
5
6
7
8
2
0
2
4
6
8
1
3
5
7
3
0
3
6
0
3
6
0
3
6
116
4
0
4
8
3
7
2
6
1
5
5
0
5
1
6
2
7
3
8
4
6
0
6
3
0
6
3
0
6
3
7
0
7
5
3
1
8
6
4
2
8
0
8
7
6
5
4
3
2
1
8.4.3
117
Theorem 8.4.3 (Chinese Remainder Theorem). Let m1 and m2 be relatively prime.4 Then for each pair of equations
n mod m1 = n1 ,
n mod m2 = n2 ,
there is a unique solution n with 0 n < m1 m2 .
Example: let m1 = 3 and m2 = 4. Then the integers n from 0 to 11 can
be represented as pairs (n1 , n2 ) with no repetitions as follows:
n n 1 n2
0 0 0
1 1 1
2 2 2
3 0 3
4 1 0
5 2 1
6 0 2
7 1 3
8 2 0
9 0 1
10 1 2
11 2 3
Proof. Well show an explicit algorithm for constructing the solution. The
first trick is to observe that if a|b, then (x mod b) mod a = x mod a. The
proof is that x mod b = xqb for some q, so (x mod b) mod a = (x mod a)
(qb mod a) = x mod a since any multiple of b is also a multiple of a, giving
qb mod a = 0.
Since m1 and m2 are relatively prime, the extended Euclidean algorithm
gives m01 and m02 such that m01 m1 = 1 (mod m2 ) and m02 m2 = 1 (mod m1 ).
result is due to the fifth-century Indian mathematician Aryabhata. The name Chinese
Remainder Theorem appears to be much more recent. See http://mathoverflow.net/
questions/11951/what-is-the-history-of-the-name-chinese-remainder-theorem
for a discussion of the history of the name.
4
This means that gcd(m1 , m2 ) = 1.
118
i mi .
n=
X
i
ni
Y
j6=i
(m1
j
mi .
119
n mod mk =
ni
(m1
j
mi mod mk
j6=i
X Y
= ni (m1
j
i
j6=i
= nk 1 +
(ni 0) mod mk
i6=k
= nk .
8.4.4
The size of Zm , or equivalently the number of numbers less than m whose gcd
with m is 1, is written (m) and is called Eulers totient function or just
the totient
of m. When p is prime, gcd(n, p) = 1 for all n with 0 < n < p, so
(p) = Zp = p 1. For a prime power pk , we similarly have gcd(n, pk ) = 1
unless p|n. There are exactly pk1 numbers less than pk that are divisible
by p (they are 0, p, 2p, . . . (pk 1)p), so (pk ) = pk pk1 = pk1 (p 1).5
For composite numbers m that are not prime powers, finding the value of
(m) is more complicated; but we can show using the Chinese Remainder
Theorem (Theorem 8.4.3) that in general
k
Y
pei
i
i=1
k
Y
piei 1 (pi 1).
=
i=1
(mod m).
Proof. We will prove this using an argument adapted from the proof of
[Big02, Theorem 13.3.2]. Let z1 , z2 , . . . , z(m) be the elements of Zm . For any
n
120
8.5
RSA encryption
Eulers Theorem is useful in cryptography; for example, the RSA encryption system is based on the fact that (xe )d = x (mod m) when de = 1
(mod (m)). So x can be encrypted by raising it to the e-th power mod m,
and decrypted by raising the result to the d-th power. It is widely believed
that publishing e and m reveals no useful information about d provided e
and m are chosen carefully.
Specifically, the person who wants to receive secret messages chooses
large primes p and q, and finds d and e such that de = 1 (mod (p1)(q1)).
They then publish m = pq (the product, not the individual factors) and e.
Encrypting a message x involves computing xe mod m. If x and e are
both large, computing xe and then taking the remainder is an expensive
operation; but it is possible to get the same value by computing xe in stages
by repeatedly squaring x and taking the product of the appropriate powers.
To decrypt xe , compute (xe )d mod m.
For example, let p = 7, q = 13, so m = 91. The totient (m) of m is
(p 1)(q 1) = 6 12 = 72. Next pick some e relatively prime to (m):
e = 5. Since 5 29 = 72 2 + 1 we can make d = 29. Note that to compute d
in this way, we needed to know how to factor m so that we could compute
(p 1)(q 1); its not known how to find d otherwise.
Now lets encrypt a message. Say we want to encrypt 11. Using e = 5
and m = 91, we can compute:
111 = 11
112 = 121 = 30
114 = 302 = 900 = 81
115 = 114 111 = 81 11 = 891 = 72.
121
When the recipient (who knows d) receives the encrypted message 72,
they can recover the original by computing 7229 mod 91:
721 = 72
722 = 5184 = 88
724 = 882 = (3)2 = 9
728 = 92 = 81
7216 = 812 = (10)2 = 100 = 9
7229 = 7216 728 724 721 = 9 81 9 72 = 812 72 = 9 72 = 648 = 11.
Note that we are working in Z91 throughout. This is what saves us from
computing the actual value of 7229 in Z,6 and only at the end taking the
remainder.
For actual security, we need m to be large enough that its hard to
recover p and q using presently conceivable factoring algorithms. Typical
applications choose m in the range of 2048 to 4096 bits (so each of p and
q will be a random prime between roughly 10308 and 10617 . This is too
big to show a hand-worked example, or even to fit into the much smaller
integer data types shipped by default in many programming languages, but
its not too large to be able to do the computations efficiently with good
large integer arithmetic library.
Chapter 9
Relations
A binary relation from a set A to a set B is a subset of A B. In general,
an n-ary relation on sets A1 , A2 , . . . , An is a subset of A1 A2 . . . An .
We will mostly be interested in binary relations, although n-ary relations
are important in databases. Unless otherwise specified, a relation will be
a binary relation. A relation from A to A is called a relation on A; many
of the interesting classes of relations we will consider are of this form. Some
simple examples are the relations =, <, , and | (divides) on the integers.
You may recall that functions are a special case of relations, but most
of the relations we will consider now will not be functions.
Binary relations are often written in infix notation: instead of writing
(x, y) R, we write xRy. This should be pretty familiar for standard
relations like < but might look a little odd at first for relations named with
capital letters.
9.1
Representing relations
In addition to representing a relation by giving an explicit table ({(0, 1), (0, 2), (1, 2)})
or rule (xRy if x < y, where x, y {0, 1, 2}), we can also visualize relations
in terms of other structures built on pairs of objects.
9.1.1
Directed graphs
122
CHAPTER 9. RELATIONS
123
3
Figure 9.2: Relation {(1, 2), (1, 3), (2, 3), (3, 1)} represented as a directed
graph
term(e1 ) = term(e2 ).
If we dont care about the labels of the edges, a simple directed graph
can be described by giving E as a subset of V V ; this gives a one-to-one
correspondence between relations on a set V and (simple) directed graphs.
For relations from A to B, we get a bipartite directed graph, where all edges
go from vertices in A to vertices in B.
Directed graphs are drawn using a dot or circle for each vertex and an
arrow for each edge, as in Figure 9.1.
This also gives a way to draw relations. For example, the relation on
{1, 2, 3} given by {(1, 2), (1, 3), (2, 3), (3, 1)} can be depicted as show in Figure 9.2.
A directed graph that contains no sequence of edges leading back to
their starting point is called a directed acyclic graph or DAG. DAGs are
important for representing partially-ordered sets (see 9.5).
9.1.2
Matrices
CHAPTER 9. RELATIONS
like this:
124
0 1 1 0
A = 2 1 0 0
1 0 0 1
The first index of an entry gives the row it appears in and the second one
the column, so in this example A2,1 = 2 and A3,4 = 1. The dimensions
of a matrix are the numbers of rows and columns; in the example, A is a
3 4 (pronounced 3 by 4) matrix.
Note that rows come before columns in both indexing (Aij : i is row, j
is column) and giving dimensions (n m: n is rows, m is columns). Like
the convention of driving on the right (in many countries), this choice is
arbitrary, but failing to observe it may cause trouble.
Matrices are used heavily in linear algebra (Chapter 13), but for the
moment we will use them to represent relations from {1 . . . n} to {1 . . . m},
by setting Aij = 0 if (i, j) is not in the relation and Aij = 1 if (i, j) is. So
for example, the relation on {1 . . . 3} given by {(i, j) | i < j} would appear
in matrix form as
0 1 1
0 0 1 .
0 0 0
When used to represent the edges in a directed graph, a matrix of this
form is called an adjacency matrix.
9.2
9.2.1
Operations on relations
Composition
CHAPTER 9. RELATIONS
125
mind that the equality relation is also the constant function.) In directed
graph terms, xRn y if and only if there is a path of exactly n edges from x
to y (possibly using the same edge more than once).
9.2.2
Inverses
Relations also have inverses: xR1 y yRx. Unlike functions, every relation has an inverse.
9.3
Classifying relations
9.4
Equivalence relations
CHAPTER 9. RELATIONS
126
examples are:
Equality mod m: The relation x = y (mod m) that holds when x
and y have the same remainder when divided by m is an equivalence
relation. This is often written as x m y.
Equality after applying a function: Let f : A B be any function,
and define x f y if f (x) = f (y). Then f is an equivalence relation.
Note that m is a special case of this.
Membership in the same block of a partition: Let A be the union of
a collection of sets Ai where the Ai are all disjoint. The set {Ai } is
called a partition of A, and each individual set Ai is called a block
of the partition. Let x y if x and y appear in the same block Ai for
some i. Then is an equivalence relation.
Directed graph isomorphism: Suppose that G = (V, E) and G0 =
(V 0 , E 0 ) are directed graphs, and there exists a bijection f : V V 0
such that (u, v) is in E if and only if (f (u), f (v)) is in E 0 . Then G
and G0 are said to be isomorphic (from Greek same shape). The
relation G
= G0 that holds when G and G0 are isomorphic is easily seen
to be reflexive (let f be the identity function), symmetric (replace f
by f 1 ), transitive (compose f : G G0 and g : G0 G00 ); thus it is
an equivalence relation.
Partitioning a plane: draw a curve in a plane (i.e., pick a continuous
function f : [0, 1] R2 ). Let x y if there is a curve from x to y
(i.e., a curve g with g(0) = x and g(1) = y) that doesnt intersect the
first curve. Then x y is an equivalence relation on points in the
plane excluding the curve itself. Proof: To show x x, let g be the
constant function g(t) = x. To show x y y x, consider some
function g demonstrating x y with g(0) = x and g(1) = y and let
g 0 (t) = g(1 t). To show x y and y z implies x z, let g be a
curve from x to y and g 0 a curve from y to z, and define a new curve
(g + g 0 ) by (g + g 0 )(t) = g(2t) when t 1/2 and (g + g 0 )(t) = g 0 (2t 1)
when t 1/2.
Any equivalence relation on a set A gives rise to a set of equivalence
classes, where the equivalence class of an element a is the set of all b such
that a b. Because of transitivity, the equivalence classes form a partition of
the set A, usually written A/ (pronounced the quotient set of A by ,
CHAPTER 9. RELATIONS
127
A slash , or sometimes A modulo ). A member of a particular equivalence class is said to be a representative of that class. For example, the
equivalence classes of equality mod m are the sets [i]m = {i + km | k N},
with one collection of representatives being {0, 1, 2, 3, . . . , m 1}. A more
complicated case are the equivalence classes of the plane partitioning example; here the equivalence classes are essentially the pieces we get after
cutting out the curve f , and any point on a piece can act as a representative
for that piece.
This gives us several equally good ways of showing that a particular
relation is an equivalence relation:
Theorem 9.4.1. Let be a relation on A. Then each of the following
conditions implies the others:
1. is reflexive, symmetric, and transitive.
2. There is a partition of A into disjoint equivalence classes Ai such
that x y if and only if x Ai and y Ai for some i.
3. There is a set B and a function f : A B such that x y if and
only if f (x) = f (y).
Proof. We do this in three steps:
(1 2). For each x A, let Ax = [x] = {y A | y x}, and let
the partition be {Ax | x A}. (Note that this may produce duplicate
S
indexes for some sets.) By reflexivity, x Ax for each x, so A = x Ax .
To show that distinct equivalence classes are disjoint, suppose that
Ax Ay 6= . Then there is some z that is in both Ax and Ay , which
means that z x and z y; symmetry reverses these to get x z
and y z. If q Ax , then q x z y, giving q Ay ; conversely,
P
if q Ay , then q y z x, giving q Ax . It follows that Ax = Ay .
(2 3). Let B = A/ = {Ax }, where each Ax is defined as above. Let
f (x) = Ax . Then x y implies x Ay implies Ax Ay 6= . Weve
shown above that if this is the case, Ax = Ay , giving f (x) = f (y).
Conversely, if f (x) 6= f (y), then Ax 6= Ay , giving Ax Ay = . In
particular, x Ax means x 6 Ay , so x 6 y.
(3 1). Suppose x y if and only if f (x) = f (y) for some f . Then
f (x) = f (x), so x x: () is reflexive. If x y, then f (x) = f (y),
giving f (y) = f (x) and thus y x: () is symmetric. If x y z,
then f (x) = f (y) = f (z), and f (x) = f (z), giving x z: () is
transitive.
CHAPTER 9. RELATIONS
9.4.1
128
Equivalence relations are the way that mathematicians say I dont care.
If you dont care about which integer youve got except for its remainder
when divided by m, then you define two integers that dont differ in any
way that you care about to be equivalent and work in Z/ m . This turns
out to be incredibly useful for defining new kinds of things: for example, we
can define multisets (sets where elements can appear more than once) by
starting with sequences, declaring x y if there is a permutation of x that
reorders it into y, and then defining a multiset as an equivalence class with
respect to this relation.
This can also be used informally: Ive always thought that broccoli,
spinach, and kale are in the same equivalence class.1
9.5
Partial orders
A partial order is a relation that is reflexive, transitive, and antisymmetric. The last means that if x y and y x, then x = y. A set S together
with a partial order is called a partially ordered set or poset. A strict
partial order is a relation < that is irreflexive and transitive (which implies
antisymmetry as well). Any partial order can be converted into a strict
partial order and vice versa by deleting/including the pairs (x, x) for all x.
A total order is a partial order in which any two elements are comparable. This means that, given x and y, either x y or y x. A poset
(S, ) where is a total order is called totally ordered. Not all partial
orders are total orders; for an extreme example, the poset (S, =) for any set
S with two or more elements is partially ordered but not totally ordered.
Examples:
(N, ) is a poset. It is also totally ordered.
(N, ) is also a both partially ordered and totally ordered. In general,
if R is a partial order, then R1 is also a partial order; similarly for
total orders.
The divisibility relation a|b on natural numbers, where a|b if and
only if there is some k in N such that b = ak, is reflexive (let k = 1),
1
Curious fact: two of these unpopular vegetables are in fact cultivars of the same
species Brassica oleracea of cabbage.
CHAPTER 9. RELATIONS
129
CHAPTER 9. RELATIONS
130
12
12
9.5.1
Since partial orders are relations, we can draw them as directed graphs.
But for many partial orders, this produces a graph with a lot of edges whose
existence is implied by transitivity, and it can be easier to see what is going
on if we leave the extra edges out. If we go further and line the elements
up so that x is lower than y when x < y, we get a Hasse diagram: a
representation of a partially ordered set as a graph where there is an edge
from x to y if x < y and there is no z such that x < z < y.2
Figure 9.3 gives an example of the divisors of 12 partially ordered by
divisibility, represented both as a digraph and as a Hasse diagram. Even in
this small example, the Hasse diagram is much easier to read.
9.5.2
Comparability
CHAPTER 9. RELATIONS
131
diagram, comparable elements are connected by a path that only goes up.
For example, in Figure 9.3, 3 and 4 are not comparable because the only
paths between them requiring going both up and down. But 1 and 12 are
both comparable to everything.
9.5.3
Lattices
A lattice is a partial order in which (a) each pair of elements x and y has a
unique greatest lower bound or meet, written x y, with the property
that (x y) x, (x y) y, and z (x y) for any z with z x and z y;
and (b) each pair of elements x and y has a unique least upper bound or
join, written x y, with the property that (x y) x, (x y) y, and
z (x y) for any z with z x and z y.
Examples of lattices are any total order (x y is min(x, y), x y is
max(x, y)), the subsets of a fixed set ordered by inclusion (x y is x y,
x y is x y), and the divisibility relation on the positive integers (x y
is the greatest common divisor, x y is the least common multiplesee
Chapter 8). Products of lattices with the product order are also lattices:
(x1 , x2 )(y1 , y2 ) = (x1 1 y1 , x2 y2 ) and (x1 , x2 )(y1 , y2 ) = (x1 1 y1 , x2 y2 ).
3
9.5.4
The product of two lattices with lexicographic order is not always a lattice. For
example, consider the lex-ordered product of ({0, 1} , ) with (N, ). For the elements
x = ({0} , 0) and y = ({1} , 0), we have that z is a lower bound on x and y if and only if
z is of the form (, k) for some k N. But there is no greatest lower bound for x and y
because given any particular k we can always choose a bigger k0 .
CHAPTER 9. RELATIONS
132
Figure 9.4: Maximal and minimal elements. In the first poset, a is minimal
and a minimum, while b and c are both maximal but not maximums. In the
second poset, d is maximal and a maximum, while e and f are both minimal
but not minimums. In the third poset, g and h are both minimal, i and j
are both maximal, but there are no minimums or maximums.
Here is an example of the difference between a maximal and a maximum
element: consider the family of all subsets of N with at most three elements,
ordered by . Then {0, 1, 2} is a maximal element of this family (its not a
subset of any larger set), but its not a maximum because its not a superset
of {3}. (The same thing works for any other three-element set.)
See Figure 9.4 for some more examples.
9.5.5
Total orders
If any two elements of a partial order are comparable (that is, if at least one
of x y or y x holds for all x and y), then the partial order is a total
order. Total orders include many of the familiar orders on the naturals, the
reals, etc.
Any partial order (S, ) can be extended to a total order (generally
more than one, if the partial order is not total itself). This means that we
construct a new relation 0 on S that is a superset of and also totally
ordered. There is a straightforward way to do this when S is finite, called
a topological sort, and a less straightforward way to do this when S is
infinite.
9.5.5.1
Topological sort
CHAPTER 9. RELATIONS
133
12
12
1
Figure 9.5: Topological sort. On the right is a total order extending the
partial order on the left.
sort exist. We wont bother with efficiency, and will just use the basic idea
to show that a total extension of any finite partial order exists.
The simplest version of this algorithm is to find a minimal element, put
it first, and then sort the rest of the elements; this is similar to selection
sort, an algorithm for doing ordinary sorting, where we find the smallest
element of a set and put it first, find the next smallest element and put it
second, and so on. In order for the selection-based version of topological
sort to work, we have to know that there is, in fact, a minimal element.4
Lemma 9.5.1. Every nonempty finite partially-ordered set has a minimal
element.
Proof. Let (S, ) be a nonempty finite partially-ordered set. We will prove
that S contains a minimal element by induction on |S|.
If |S| = 1, then S = {x} for some x; x is the minimal element.
Now consider some S with |S| > 2. Pick some element x S, and let
T = S \ {x}. Then by the induction hypothesis, T has a minimal element
y, and since |T | 2, T has at least one other element z 6 y.
4
CHAPTER 9. RELATIONS
134
CHAPTER 9. RELATIONS
135
transitivity, and repeat. Unfortunately this process may take infinitely long,
so we have to argue that it converges in the limit to a genuine total order
using a tool called Zorns lemma, which itself is a theorem about partial
orders.5
9.5.6
Well orders
If R is not a total order, then there is some pair of elements x and y that are incomparable. Let
T =R
Then T is reflexive (because it contains all the pairs (x, x) that are in R) and transitive (by
a tedious case analysis), and antisymmetric (by another tedious case analysis), meaning
that it is a partial order that extends Rand thus an element of Swhile also being a
But this contradicts the assumption that R
is maximal. So R
is in
proper superset of R.
fact the total order we are looking for.
6
Proof: We can prove that any nonempty S N has a minimum in a slightly roundabout way by induction. The induction hypothesis (for x) is that if S contains some
element y less than or equal to x, then S has a minimum element. The base case is when
x = 0; here x is the minimum. Suppose now that the claim holds for x. Suppose also that
S contains some element y x + 1; if not, the induction hypothesis holds vacuously. If
there is some y x, then S has a minimum by the induction hypothesis. The alternative
is that there is no y in S such that y x, but there is a y in S with y x + 1. This y
must be equal to x + 1, and so y is the minimum.
CHAPTER 9. RELATIONS
136
9.6
Closures
CHAPTER 9. RELATIONS
137
CHAPTER 9. RELATIONS
138
Figure 9.6: Reflexive, symmetric, and transitive closures of a relation represented as a directed graph. The original relation {(0, 1), (1, 2)} is on top;
the reflexive, symmetric, and transitive closures are depicted below it.
6
3
123
56
CHAPTER 9. RELATIONS
9.6.1
139
Examples
Chapter 10
Graphs
A graph is a structure in which pairs of vertices are connected by edges.
Each edge may act like an ordered pair (in a directed graph) or an unordered pair (in an undirected graph). Weve already seen directed graphs
as a representation for relations. Most work in graph theory concentrates
instead on undirected graphs.
Because graph theory has been studied for many centuries in many languages, it has accumulated a bewildering variety of terminology, with multiple terms for the same concept (e.g. node for vertex or arc for edge) and
ambiguous definitions of certain terms (e.g., a graph without qualification
might be either a directed or undirected graph, depending on who is using
the term: graph theorists tend to mean undirected graphs, but you cant
always tell without looking at the context). We will try to stick with consistent terminology to the extent that we can. In particular, unless otherwise
specified, a graph will refer to a finite simple undirected graph: an undirected graph with a finite number of vertices, where each edge connects two
distinct vertices (thus no self-loops) and there is at most one edge between
each pair of vertices (no parallel edges).
A reasonably complete glossary of graph theory can be found at at http:
//en.wikipedia.org/wiki/Glossary_of_graph_theory. See also Ferland [Fer08],
Chapters 8 and 9; Rosen [Ros12] Chapter 10; or Biggs [Big02] Chapter 15
(for undirected graphs) and 18 (for directed graphs).
If you want to get a fuller sense of the scope of graph theory, Reinhard
Diestels (graduate) textbook Graph Theory[Die10] can be downloaded from
http://diestel-graph-theory.com.
140
141
10.1
Types of graphs
10.1.1
Directed graphs
10.1.2
Undirected graphs
142
10.1.3
Hypergraphs
143
1
1
2
2
3
4
4
10.2
Examples of graphs
Such graphs are often labeled with edge lengths, prices, etc. In computer networking, the design of network graphs that permit efficient routing
of data without congestion, roundabout paths, or excessively large routing
144
10.3
There are some useful standard terms for describing the immediate connections of vertices and edges:
Incidence: a vertex is incident to any edge of which it is an endpoint
(and vice versa).
Adjacency, neighborhood: two vertices are adjacent if they are the
endpoints of some edge. The neighborhood of a vertex v is the set
of all vertices that are adjacent to v.
Degree, in-degree, out-degree: the degree of v counts the number
edges incident to v. In a directed graph, in-degree counts only incoming edges and out-degree counts only outgoing edges (so that
the degree is always the in-degree plus the out-degree). The degree of
a vertex v is often abbreviated as d(v); in-degree and out-degree are
similarly abbreviated as d (v) and d+ (v), respectively.
10.4
Most graphs have no particular structure, but there are some families of
graphs for which it is convenient to have standard names. Some examples
are:
Complete graph Kn . This has n vertices, and every pair of vertices
has an edge between them. See Figure 10.4.
K1
K2
K5
K8
145
K3
K4
K6
K7
K9
K10
146
C3
C4
C5
C6
C7
C8
C9
C10
C11
P1
P2
147
P3
P4
K3,4
Figure 10.7: Complete bipartite graph K3,4
Star graphs. These have a single central vertex that is connected to
n outer vertices, and are the same as K1,n . See Figure 10.8.
The cube Qn . This is defined by letting the vertex set consist of all
n-bit strings, and putting an edge between u and u0 if u and u0 differ
in exactly one place. It can also be defined by taking the n-fold square
product of an edge with itself (see 10.6).
Graphs may not always be drawn in a way that makes their structure
obvious. For example, Figure 10.9 shows two different presentations of Q3 ,
neither of which looks much like the other.
10.5
K1,3
148
K1,4
K1,6
K1,5
K1,7
K1,8
1
0
5
4
5
2
7
6
2
149
2
6
2
1
2
1
45
3
Figure 10.10: Examples of subgraphs and minors. Top left is the original
graph. Top right is a subgraph that is not an induced subgraph. Bottom
left is an induced subgraph. Bottom right is a minor.
A minor of a graph H is a graph obtained from H by deleting edges
and/or vertices (as in a subgraph) and contracting edges, where two adjacent vertices u and v are merged together into a single vertex that is adjacent
to all of the previous neighbors of both vertices. Minors are useful for recognizing certain classes of graphs. For example, a graph can be drawn in
the plane without any crossing edges if and only if it doesnt contain K5 or
K3,3 as a minor (this is known as Wagners theorem).
Figure 10.10 shows some subgraphs and minors of the graph from Figure 10.2.
10.6
Graph products
There are at least five different definitions of the product of two graphs used
by serious graph theorists. In each case the vertex set of the product is the
Cartesian product of the vertex sets, but the different definitions throw in
different sets of edges. Two of them are used most often:
The square product or graph Cartesian product G H. An
edge (u, u0 )(v, v 0 ) is in G H if and only if (a) u = v and u0 v 0 is
an edge in H, or (b) uv is an edge in G and v = v 0 . Its called the
150
10.6.1
Functions
151
10.7
152
makes sense to talk about this in terms of reachability, or whether you can
get from one vertex to another along some path.
A path of length n in a graph is the image of a homomorphism from
Pn .
In ordinary speech, its a sequence of n + 1 vertices v0 , v1 , . . . , vn
such that vi vi+1 is an edge in the graph for each i.
A path is simple if the same vertex never appears twice (i.e. if
the homomorphism is injective). If there is a path from u to v,
there is a simple path from u to v obtained by removing cycles
(Lemma 10.9.1).
153
10.8
Cycles
The standard cycle graph Cn has vertices {0, 1, . . . , n 1} with an edge from
i to i + 1 for each i and from n 1 to 0. To avoid degeneracies, n must be
at least 3. A simple cycle of length n in a graph G is an embedding of Cn
in G: this means a sequence of distinct vertices v0 v1 v2 . . . vn1 , where each
pair vi vi+1 is an edge in G, as well as vn1 v0 . If we omit the requirement
that the vertices are distinct, but insist on distinct edges instead, we have a
cycle. If we omit both requirements, we get a closed walk; this includes
very non-cyclic-looking walks like the short excursion uvu. We will mostly
worry about cycles.1 See Figure 10.11
1
Some authors, including Ferland [Fer08], reserve cycle for what we are calling a simple
cycle, and use circuit for cycle.
154
6
3
1
3
6
3
Figure 10.11: Examples of cycles and closed walks. Top left is a graph. Top
right shows the simple cycle 1253 found in this graph. Bottom left shows
the cycle 124523, which is not simple. Bottom right shows the closed walk
12546523.
Unlike paths, which have endpoints, no vertex in a cycle has a special
role.
A graph with no cycles is acyclic. Directed acyclic graphs or DAGs
have the property that their reachability relation is a partial order; this
155
10.9
Suppose we want to show that all graphs or perhaps all graphs satisfying
certain criteria have some property. How do we do this? In the ideal case,
we can decompose the graph into pieces somehow and use induction on the
number of vertices or the number of edges. If this doesnt work, we may
have to look for some properties of the graph we can exploit to construct an
explicit proof of what we want.
10.9.1
156
10.9.2
This lemma relates the total degree of a graph to the number of edges. The
intuition is that each edge adds one to the degree of both of its endpoints,
so the total degree of all vertices is twice the number of edges.
Lemma 10.9.3. For any graph G = (V, E),
X
d(v) = 2|E|.
vV
2(m 1) =
dG0 (v)
vV
vV \{s,t}
vV \{s,t}
dG (v) 2.
vV
So
vV
dG (v) 2 = 2m 2, giving
vV
dG (v) = 2m.
10.9.3
Characterizations of trees
A tree is defined to be an acyclic connected graph. There are several equivalent characterizations.
Theorem 10.9.4. A graph is a tree if and only if there is exactly one simple
path between any two distinct vertices.
157
158
In fact, no graph with |V | 2 contains a cycle, but we dont need to use this.
159
(2) and (3) imply (1). As in the previous case, G contains a vertex
v with d(v) 1. If d(v) = 1, then G v is a nonempty graph with
n 2 edges and n 1 vertices that is acyclic by Lemma 10.9.5. It is
thus connected by the induction hypothesis, so G is also connected by
Lemma 10.9.5. If d(v) = 0, then G v has n 1 edges and n 1
vertices. From Corollary 10.9.7, G v contains a cycle, contradicting
(2).
10.9.4
160
Spanning trees
10.9.5
Eulerian cycles
Lets prove the vertex degree characterization of graphs with Eulerian cycles.
As in the previous proofs, well take the approach of looking for something
to pull out of the graph to get a smaller case.
Theorem 10.9.10. Let G be a connected graph. Then G has an Eulerian
cycle if and only if all nodes have even degree.
Proof.
(Only if part). Fix some cycle, and orient the edges by the
direction that the cycle traverses them. Then in the resulting directed
graph we must have d (u) = d+ (u) for all u, since every time we enter
a vertex we have to leave it again. But then d(u) = 2d+ (u) is even.
(If part). Suppose now that d(u) is even for all u. We will construct
an Eulerian cycle on all nodes by induction on |E|. The base case
is when |E| = 2|V | and G = C|V | . For a larger graph, choose some
starting node u1 , and construct a path u1 u2 . . . by choosing an arbitrary unused edge leaving each ui ; this is always possible for ui 6= u1
since whenever we reach ui we have always consumed an even number of edges on previous visits plus one to get to it this time, leaving
at least one remaining edge to leave on. Since there are only finitely
many edges and we can only use each one once, eventually we must
get stuck, and this must occur with uk = u1 for some k. Now delete
161
all the edges in u1 . . . uk from G, and consider the connected components of G (u1 . . . uk ). Removing the cycle reduces d(v) by an even
number, so within each such connected component the degree of all
vertices is even. It follows from the induction hypothesis that each
connected component has an Eulerian cycle. Well now string these
per-component cycles together using our original cycle: while traversing u1 . . . , uk when we encounter some component for the first time,
we take a detour around the components cycle. The resulting merged
cycle gives an Eulerian cycle for the entire graph.
Why doesnt this work for Hamiltonian cycles? The problem is that in a
Hamiltonian cycle we have too many choices: out of the d(u) edges incident
to u, we will only use two of them. If we pick the wrong two early on, this
may prevent us from ever fitting u into a Hamiltonian cycle. So we would
need some stronger property of our graph to get Hamiltonicity.
Chapter 11
Counting
Counting is the process of creating a bijection between a set we want to
count and some set whose size we already know. Typically this second set
will be a finite ordinal [n] = {0, 1, . . . , n 1}.1
Counting a set A using a bijection f : A [n] gives its size |A| = n;
this size is called the cardinality of n. As a side effect, it also gives a
well-ordering of A, since [n] is well-ordered as we can define x y for x, y
in A by x y if and only if f (x) f (y). Often the quickest way to find f
is to line up all the elements of A in a well-ordering and then count them
off: the smallest element of A gets mapped to 0, the next smallest to 1, and
so on. Stripped of the mathematical jargon, this is exactly what you were
taught to do as a small child.
Usually we will not provide an explicit bijection to compute the size
of a set, but instead will rely on standard counting principles based on
how we constructed the set. The branch of mathematics that studies sets
constructed by combining other sets is called combinatorics, and the subbranch that counts these sets is called enumerative combinatorics. In
this chapter, were going to give an introduction to enumerative combinatorics, but this basically just means counting.
For infinite sets, cardinality is a little more complicated. The basic idea
is that we define |A| = |B| if there is a bijection between them. This gives an
equivalence relation on sets2 , and we define |A| to be the equivalence class
of this equivalence relation that contains A. For the finite case we represent
1
Starting from 0 is traditional in computer science, because it makes indexing easier.
Normal people count to n using {1, 2, . . . , n}.
2
Reflexivity: the identity function is a bijection from A to A. Symmetry: if f : A B
is a bijection, so is f 1 : B A. Transitivity: if f : A B and g : B C are bijections,
so is (g f ) : A C.
162
163
11.1
Our goal here is to compute the size of some set of objects, e.g., the number
of subsets of a set of size n, the number of ways to put k cats into n boxes
so that no box gets more than one cat, etc.
In rare cases we can use the definition of the size of a set directly, by constructing a bijection between the
set we
and some
canonical set
care about
2
2
[n]. For example, the set Sn = x N x < n y : x = y has exactly n
members, because we can generate it by applying the one-to-one correspondence f (y) = y 2 to the set {0, 1, 2, 3, . . . , n 1} = [n]. But most of the time
constructing an explicit one-to-one correspondence is too time-consuming
or too hard, so instead we will show how to map set-theoretic operations to
arithmetic operations, so that from a set-theoretic construction of a set we
can often directly read off an arithmetic computation that gives the size of
the set.
11.1.1
If we can produce a bijection between a set A whose size we dont know and
a set B whose size we do, then we get |A| = |B|. Pretty much all of our
proofs of cardinality will end up looking like this.
11.1.2
The claim for general sets is known as the Cantor-Bernstein-Schroeder theorem. One
way to prove this is to assume that A and B are disjoint and construct a (not necessarily
finite) graph whose vertex set is A B and that has edges for all pairs (a, f (a)) and
(b, g(b)). It can then be shown that the connected components of this graph consist of (1)
finite cycles, (2) doubly-infinite paths (i.e., paths with no endpoint in either direction), (3)
infinite paths with an initial vertex in A, and (4) infinite paths with an initial vertex in B.
For vertexes in all but the last class of components, define h(x) to be f (x) if x is in A and
f 1 (x) if x is in B. (Note that we are abusing notation slightly here by defining f 1 (x)
164
11.1.3
The sum rule computes the size of A B when A and B are disjoint.
Theorem 11.1.1. If A and B are finite sets with A B = , then
|A B| = |A| + |B|.
Proof. Let f : A [|A|] and g : B [|B|] be bijections. Define h : A B
[|A| + |B|] by the rule h(x) = f (x) for x A, h(x) = |A| + g(x) for x B.
To show that this is a bijection, define h1 (y) for y in [|A| + |b|] to be
f 1 (y) if y < |A| and g 1 (y |A|) otherwise. Then for any y in [|A| + |B|],
either
1. 0 y < |A|, y is in the codomain of f (so h1 (y) = f 1 (y) A is
well-defined), and h(h1 (y)) = f (f 1 (y)) = y.
2. |A| y < |A| + |B|. In this case 0 y |A| < |B|, putting y |A| in
the codomain of g and giving h(h1 (y)) = g(g 1 (y |A|)) + |A| = y.
So h1 is in fact an inverse of h, meaning that h is a bijection.
to be the unique y that maps to x when it exists.) For the last class of components, the
initial B vertex is not the image of any x under f ; so for these we define h(x) to be g(x)
if x is in B and g 1 (x) if x is in A. This gives the desired bijection h between A and B.
In the case where A and B are not disjoint, we can make them disjoint by replacing
them with A0 = {0} A and B 0 = {1} B. (This is a pretty common trick for enforcing
disjoint unions.)
165
One way to think about this proof is that we are constructing a total
order on A B by putting all the A elements before all the B elements. This
gives a straightforward bijection with [|A| + |B|] by the usual preschool trick
of counting things off in order.
Generalizations: If A1 , A2 , A3 . . . Ak are pairwise disjoint (i.e., Ai
Aj = for all i 6= j), then
k
k
X
[
|Ai |.
Ai =
i=1
i=1
The sum rule works for infinite sets, too; technically, the sum rule is used
to define |A| + |B| as |A B| when A and B are disjoint. This makes
cardinal arithmetic a bit wonky: if at least one of A and B is infinite, then
|A| + |B| = max(|A|, |B|), since we can space out the elements of the larger
of A and B and shove the elements of the other into the gaps.
11.1.3.2
A consequence of the sum rule is that if A and B are both finite and |A| >
|B|, you cant have an injection from A to B. The proof is by contraposition.
Suppose f : A B is an injection. Write A as the union of f 1 (x) for each
x B, where f 1 (x) is the set of y in A that map to x. Because each f 1 (x)
is disjoint, the sum rule applies; but because f is an injection there is at
P
most one element in each f 1 (x). It follows that |A| = xB f 1 (x)
P
xB 1 = |B|. (Question: Why doesnt this work for infinite sets?)
The Pigeonhole Principle generalizes in an obvious way to functions
1 with
larger domains; if f : A B, then there is some x in B such that f (x)
|A|/|B|.
11.1.4
Subtraction
(11.1.1)
166
Combinatorial proof
167
follows that |L| = |A| + |B| and |R| = |A B| + |A B|. Now define the
function f : L R by the rule
f ((0, x)) = (0, x).
f ((1, x)) = (1, x)if x B A.
f ((1, x)) = (0, x)if x B \ A.
11.1.5
The product rule says that Cartesian product maps to arithmetic product.
Intuitively, we line the elements (a, b) of A B in lexicographic order and
count them off. This looks very much like packing a two-dimensional array
in a one-dimensional array by mapping each pair of indices (i, j) to i|B|+j.
Theorem 11.1.3. For any finite sets A and B,
|A B| = |A| |B|.
Proof. The trick is to order A B lexicographically and then count off
the elements. Given bijections f : A [|A|] and g : B [|B|], define
h : (A B) [|A| |B|] by the rule h((a, b)) = a |B| + b. The division
algorithm recovers a and b from h(a, b) by recovering the unique natural
numbers q and r such that h(a, b) = q |B| + r and 0 b < |B| and letting
a = f 1 (q) and b = f 1 (r).
The general form is
k
k
Y
Y
Ai =
|Ai |,
i=1
i=1
where the product on the left is a Cartesian product and the product on the
right is an ordinary integer product.
168
Examples
As I was going to Saint Ives, I met a man with seven sacks, and every
sack had seven cats. How many cats total? Answer: Label the sacks
0, 1, 2, . . . , 6, and label the cats in each sack 0, 1, 2, . . . , 6. Then each cat
can be specified uniquely by giving a pair (sack number, cat number),
giving a bijection between the set of cats and the set 7 7. Since
|7 7| = 7 7 = 49, we have 49 cats.
Dr. Frankensteins trusty assistant Igor has brought him 6 torsos,
4 brains, 8 pairs of matching arms, and 4 pairs of legs. How many
different monsters can Dr Frankenstein build? Answer: there is a oneto-one correspondence between possible monsters and 4-tuples of the
form (torso, brain, pair of arms, pair of legs); the set of such 4-tuples
has 6 4 8 4 = 728 members.
How many different ways can you order n items? Call this quantity
n! (pronounced n factorial). With 0 or 1 items, there is only one
way; so we have 0! = 1! = 1. For n > 1, there are n choices for the
first item, leaving n 1 items to be ordered. From the product rule we
Q
thus have n! = n (n 1)!, which we could also expand out as ni=1 i.
11.1.5.2
The product rule also works for infinite sets, because we again use it as a
definition: for any A and B, |A||B| is defined to be |A B|. One oddity for
infinite sets is that this definition gives |A| |B| = |A| + |B| = max(|A|, |B|),
because if at least one of A and B is infinite, it is possible to construct a
bijection between A B and the larger of A and B. Infinite sets are strange.
11.1.6
B
Given
sets A and B, let A be the set of functions f : B A. Then
B
|B|
A = |A| .
If |B| is finite, this is just a |B|-fold application of the product rule: we
can write any function f : B A as a sequence of length |B| that gives
the value in A for each input in B. Since each element of the sequence
contributes |A| possible choices, we get |A||B| choices total.
For infinite sets, the exponent rule is a definition of |A||B| . Some simple
facts are that n = 2 whenever n is finite and is infinite (this comes down
to the fact that we can represent any element of [n] as a finite sequence of
169
Counting injections
n
Y
i=nk+1
i=
n!
(n k)!
170
11.1.7
n!
.
k! (n k)!
def
n!
.
k! (n k)!
where the left-hand side is known as a binomial coefficient and is pronounced n choose k. We discuss binomial coefficients at length in 11.2.
171
The secret of why its called a binomial coefficient will be revealed when we
talk about generating functions in 11.3.
Example: Heres a generalization of binomial coefficients: let the multinomial coefficient
!
n
n1 n2 . . . nk
be the number of different ways to distribute n items among k bins where the
i-th bin gets exactly ni of the items and we dont care what order the items
appear in each bin. (Obviously this only makes sense if n1 +n2 + +nk = n.)
Can we find a simple formula for the multinomial coefficient?
Here are two ways to count the number of permutations of the n-element
set:
1. Pick the first element, then the second, etc. to get n! permuations.
2. Generate a permutation in three steps:
(a) Pick a partition of the n elements into groups of size n1 , n2 , . . . nk .
(b) Order the elements of each group.
(c) Paste the groups together into a single ordered list.
There are
n
n1 n2 . . . nk
n
n1 n2 . . . nk
n1 ! n2 ! nk !,
n!
.
n1 ! n2 ! nk !
This also gives another way to derive the formula for a binomial coefficient, since
!
!
n
n
n!
=
=
.
k
k (n k)
k! (n k)!
11.1.8
172
If youre given some strange set to count, look at the structure of its description:
If its given by a rule of the form x is in S if either P (x) or Q(x) is
true, use the sum rule (if P and Q are mutually exclusive) or inclusionexclusion. This includes sets given by recursive definitions, e.g. x is a
tree of depth at most k if it is either (a) a single leaf node (provided
k > 0) or (b) a root node with two subtrees of depth at most k 1.
The two classes are disjoint so we have T (k) = 1 + T (k 1)2 with
T (0) = 0.4
For objects made out of many small components or resulting from
many small decisions, try to reduce the description of the object to
something previously known, e.g. (a) a word of length k of letters
from an alphabet of size n allowing repetition (there are nk of them,
by the product rule); (b) a word of length k not allowing repetition
(there are (n)k of themor n! if n = k); (c) a subset of k distinct things
from a set of size n, where we dont care about the order (there are nk
of them); any subset of a set of n things (there are 2n of themthis
is a special case of (a), where the alphabet encodes non-membership
as 0 and membership as 1, and the position in the word specifies the
element). Some examples:
The number of games of Tic-Tac-Toe assuming both players keep
playing until the board is filled is obtained by observing that each
such game can be specified by listing which of the 9 squares are
filled in order, giving 9! = 362880 distinct games. Note that we
dont have to worry about which of the 9 moves are made by X
and which by O, since the rules of the game enforce it. (If we
only consider games that end when one player wins, this doesnt
work: probably the easiest way to count such games is to send a
computer off to generate all of them. This gives 255168 possible
games and 958 distinct final positions.)
The number of completely-filled-in Tic-Tac-Toe boards can be
obtained by observing
that any such board has 5 Xs and 4 Os.
So there are 95 = 126 such positions. (Question: Why would
this be smaller than the actual number of final positions?)
4
Of course, just setting up a recurrence doesnt mean its going to be easy to actually
solve it.
173
11.1.9
Suppose you have the numbers {1, 2, . . . , 2n}, and you want to count how
many sequences of k of these numbers you can have that are (a) increasing
(a[i] < a[i + 1] for all i), (b) decreasing (a[i] a[i + 1] for all i), or (c) made
up only of even numbers.
This is the union of three sets A, B, and C, corresponding to the three
cases. The first step is to count each set individually; then we can start
thinking about applying inclusion-exclusion to get the size of the union.
174
2n
2n
n
n
+
+ nk
k
k
k
k
=2
2n
n
k
k
!!
+ nk
Its even easier to assume that A B = always, but for k = 1 any sequence is both
increasing and nonincreasing, since there are no pairs of adjacent elements in a 1-element
sequence to violate the property.
175
Without looking at the list, can you say which 3 of the 62 = 36 possible length-2
sequences are missing?
11.1.10
176
Further reading
Rosen [Ros12] does basic counting in Chapter 6 and more advanced counting
(including solving recurrences and using generating functions) in chapter 8.
Biggs [Big02] gives a basic introduction to counting in Chapters 6 and 10,
with more esoteric topics in Chapters 11 and 12. Graham et al. [GKP94]
have quite a bit on counting various things.
Combinatorics largely focuses on counting rather than efficient algorithms for constructing particular combinatorial objects. The book Constructive Combinatorics, by Denisses Stanton and White, [SW86] remedies
this omission, and includes algorithms not only for enumerating all instances
of various classes of combinatorial objects but also for finding the i-th such
instance in an appropriate ordering without having to generate all previous
instances (unranking) and the inverse operation of finding the position of
a particular object in an appropriate ordering (ranking).
11.2
Binomial coefficients
(n)k
n!
=
,
k!
k! (n k)!
X
k=0
n k nk
x y
,
k
(11.2.1)
(x + y) =
n
X
k=0
n k nk
x y
.
k
(11.2.2)
177
11.2.1
Recursive definition
If we dont like computing factorials, we can also compute binomial coefficients recursively. This may actually be less efficient for large n (we need to
do (n2 ) additions instead of (n) multiplications and divisions), but the
recurrence gives some insight into the structure of binomial coefficients.
Base cases:
If k = 0, then there is exactly one zero-element
set of our n-element
setits the empty setand we have n0 = 1.
If k > n, then there are no k-element subsets, and we have k > n :
n
k = 0.
Recursive step: Well use Pascals identity, which says that
n
k
n1
n1
+
.
k
k1
178
1
2 1
3 3 1
4 6 4 1
5 10 10 5 1
Using the binomial theorem plus a little bit of algebra, we can prove Pascals
identity without using a combinatorial argument (this is not necessarily an
improvement). The additional fact we need is that if we have two equal
series
X
k=0
ak xk =
X
k=0
bk xk
179
n k
x = (1 + x)n
k
= (1 + x)(1 + x)n1
= (1 + x)n1 + x(1 + x)n1
=
=
=
=
=
n1
X
k=0
n1
X
k=0
n1
X
k=0
n
X
k=0
n
X
n1
X n1
n1 k
x +x
xk
k
k
k=0
X n1
n 1 k n1
x +
xk+1
k
k
k=0
!
n
n1 k X
n1 k
x +
x
k
k1
k=1
n
n1 k X
n1 k
x +
x
k
k1
k=0
n1
n1
+
k
k1
k=0
!!
xk .
n1
n1
+
k
k1
as advertised.
11.2.2
Vandermondes identity
r
X
k=0
m
rk
n
.
k
Combinatorial proof
180
the whole set if we limit ourselves to choosing exactly k from the last n.
The identity follow by summing over all possible values of k.
11.2.2.2
Algebraic proof
Here we use the fact that, for any sequences of coefficients {ai } and {bi },
n
X
m
X
ai xi
i=0
m+n
X
bi xi
i
X
i=0
i=0
aj bij xi .
j=0
So now consider
m+n
X
r=0
m+n r
x = (1 + x)m+n
r
= (1 + x)n (1 + x)m
n
X
n
i=0
! ! m
!
X m
xi
xj
m+n
X
r
X
r=0
k=0
j=0
n
k
m
rk
!!
xr .
11.2.3
What is the sum of all binomial coefficients for a given n? We can show
n
X
k=0
n
k
= 2n
181
(1)
k=0
n
k
= 0.(Assuming n 6= 0.)
k=0
n
k
= 3n .
11.2.4
\
|S|+1
(1)
Aj .
jS
S{1...n},S6=
X
(11.2.3)
This rather horrible expression means that to count the elements in the
union of n sets A1 through An , we start by adding up all the individual sets
|A1 | + |A2 | + . . . |An |, then subtract off the overcount from elements that
appear in two sets |A1 A2 | |A1 A3 | . . . , then add back the resulting
undercount from elements that appear in three sets, and so on.
Why does this work? Consider a single element x that appears
in k of
k
k
the sets. Well count it as +1 in 1 individual sets, as 1 in 2 pairs, +1
in k3 triples, and so on, adding up to
k
X
i=1
k+1
(1)
k
i
k
X
i=1
(1)
k
i
!!
k
X
(1)
i=0
k
1
i
= (0 1) = 1.
11.2.5
182
= (n)k /k! =
n
Y
i /k!.
i=nk+1
1
k
= (1)k /k! =
1
Y
i /k! =
i=1k+1
1
Y
k
Y
i /
= (1)k .
i=1
i=k
X
X
X
1 1n
1
zn.
(1)n (z)n =
1
(z)n =
= (1 z)1 =
n
1z
n=0
n=0
n=0
X
1 n
1X
1
1
1 (z)1n =
= (1 z)1 =
n
1z
z n=0 z n
n=0
This turns out to actually be correct, since applying the geometric series
formula turns the last line into
1
1
1
1
=
=
,
z 1 1/z
z1
1z
but its a lot less useful.
What happens for a larger upper index? One way to think about (n)k
is that we are really computing (n + k 1)k and then negating all the factors
(which corresponds to multiplying the whole expression by (1)k . So this
gives us the identity
n
k
n+k1
.
k
183
X 2
X
X
1
n+1
= (1z)2 =
12n (z)n =
(1)n
(z)n =
(n+1)z n .
2
(1 z)
n
n
n
n
n
11.2.6
Yes, we can do fractional binomial coefficients, too. Exercise: Find the value
of
!
1/2
(1/2)n
.
=
n!
n
Like negative binomial coefficients, these dont have an obvious combinatorial interpretation, but can
be handy for computing power series of
fractional binomial powers like 1 + z = (1 + z)1/2 .
11.2.7
Further reading
11.3
Generating functions
11.3.1
Basics
A simple example
We are given some initial prefixes for words: qu, s, and t; some vowels to
put in the middle: a, i, and oi; and some suffixes: d, ff, and ck, and we
want to calculate the number of words we can build of each length.
184
We are using word in the combinatorial sense of a finite sequence of letters (possibly
even the empty sequence) and not the usual sense of a finite, nonempty sequence of letters
that actually make sense.
185
Formal definition
F (z) =
ai z i .
i=0
186
In some cases, the sum has a more compact representation. For example,
we have
X
1
=
zi,
1z
i=0
so 1/(1 z) is the generating function for the sequence 1, 1, 1, . . . . This
may let us manipulate this sequence conveniently by manipulating the generating function.
Heres a simple case. If F (z) generates some sequence ai , what does
sequence bi does F (2z) generate? The i-th term in the expansion of F (2z)
will be ai (2z)i = ai 2i z i , so we have bi = 2i ai . This means that the sequence 1, 2, 4, 8, 16, . . . has generating function 1/(1 2z). In general, if
F (z) represents ai , then F (cz) represents ci ai .
What else can we do to F ? One useful operation is to take its derivative
with respect to z. We then have
X
d
d i X
F (z) =
ai z =
ai iz i1 .
dz
dz
i=0
i=0
This almost gets us the representation for the series iai , but the exponents on the zs are off by one. But thats easily fixed:
z
X
X
d
F (z) = z
ai iz i1 =
ai iz i .
dz
i=0
i=0
d 1
z
=
,
dz 1 z
(1 z)2
z
z
2z 2
d
=
+
.
2
2
dz (1 z)
(1 z)
(1 z)3
As you can see, some generating functions are prettier than others.
(We can also use integration to divide each term by i, but the details are
messier.)
Another way to get the sequence 0, 1, 2, 3, 4, . . . is to observe that it
satisfies the recurrence:
a0 = 0.
187
X
n=0
!
n
an z a0 /z
!
an z n /z
n=1
=
=
X
n=1
an z n1
an+1 z n .
n=0
188
pad each weight-k object out to weight n in exactly one way using n k
junk objects, i.e. multiply F (z) by 1/(1 z).
11.3.2
X
1
=
zi
1z
i=0
X
z
=
iz i
(1 z)2
i=0
n
(1 + z) =
X
n
i=0
1
=
(1 z)n
z =
n
X
n
i=0
X
n+i1
i=0
zi
zi
Of these, the first is the most useful to remember (its also handy for
remembering how to sum geometric series). All of these equations can be
proven using the binomial theorem.
11.3.3
F (z)G(z) =
i
X
i=0
aj bji z i .
j=0
189
(1 + x)n =
X
n
i=0
xi .
11.3.4
The product formula above suggests that generating functions can be used to
count combinatorial objects that are built up out of other objects, where our
goal is to count the number of objects of each possible non-negative integer
weight (we put weight in scare quotes because we can make the weight
be any property of the object we like, as long as its a non-negative integera
typical choice might be the size of a set, as in the binomial theorem example
above). There are five basic operations involved in this process; weve seen
two of them already, but will restate them here with the others.
Throughout this section, we assume that F (z) is the generating function
counting objects in some set A and G(z) the generating function counting
objects in some set B.
11.3.4.1
Disjoint union
190
Cartesian product
Now let C = A B, and let the weight of a pair (a, b) C be the sum
of the weights of a and b. Then the generating function for objects in C is
F (z)G(z).
Example: Let A be all-x strings and B be all-y or all-z strings, as in the
previous example. Let C be the set of all strings that consist of zero or more
xs followed by zero or more ys and/or zs. Then the generating function
1
for C is F (z)G(z) = (1z)(12z)
.
11.3.4.3
Repetition
Now let C consists of all finite sequences of objects in A, with the weight
of each sequence equal to the sum of the weights of its elements (0 for an
empty sequence). Let H(z) be the generating function for C. From the
preceding rules we have
H = 1 + F + F2 + F3 + =
1
.
1F
This works best when H(0) = 0; otherwise we get infinitely many weight0 sequences. Its also worth noting that this is just a special case of substitution (see below), where our outer generating function is 1/(1 z).
Example: (0|11) Let A = {0, 11}, and let C be the set of all sequences
of zeros and ones where ones occur only in even-length runs. Then the
generating function for A is z + z 2 and the generating function for C is
1/(1zz 2 ). We can extract exact coefficients from this generating function
using the techniques below.
Example: sequences of positive integers Suppose we want to know
how many different ways there are to generate a particular integer as a sum
of positive integers. For example, we can express 4 as 4, 3+1, 2+2, 2+1+1,
1 + 1 + 1 + 1, 1 + 1 + 2, 1 + 2 + 1, or 1 + 3, giving 8 different ways.
We can solve this problem using the repetition rule. Let F = z/(1 z)
191
H=
1 2z
1 2z 1 2z
=
=
X
n=0
2n z n
2n z n
n=0
=1+
=1+
X
n=0
2n z n+1
2n1 z n
n=1
(2n 2n1 )z n
n=1
2n1 z n .
n=1
This means that there is 1 way to express 0 (the empty sum), and 2n1
ways to express any larger value n (e.g. 241 = 8 ways to express 4).
Once we know what the right answer is, its not terribly hard to come
up with a combinatorial explanation. The quantity 2n1 counts the number
of subsets of an (n 1)-element set. So imagine that we have n 1 places
and we mark some subset of them, plus add an extra mark at the end; this
might give us a pattern like XX-X. Now for each sequence of places ending
with a mark we replace it with the number of places (e.g. XX-X = 1, 1, 2,
X--X-X---X = 1, 3, 2, 4). Then the sum of the numbers we get is equal to n,
because its just counting the total length of the sequence by dividing it up
at the marks and the adding the pieces back together. The value 0 doesnt
fit this pattern (we cant put in the extra mark without getting a sequence
of length 1), so we have 0 as a special case again.
If we are very clever, we might come up with this combinatorial explanation from the beginning. But the generating function approach saves us
from having to be clever.
192
Pointing
dn
F (z),
dz n
Substitution
Suppose that the way to make a C-thing is to take a weight-k A-thing and
attach to each its k items a B-thing, where the weight of the new C-thing
is the sum of the weights of the B-things. Then the generating function for
C is the composition F (G(z)).
Why this works: Suppose we just want to compute the number of Cthings of each weight that are made from some single specific weight-k Athing. Then the generating function for this quantity is just (G(z))k . If we
expand our horizons to include all ak weight-k A-things, we have to multiply
193
ak (G(z))k .
k=0
But this is just what we get if we start with F (z) and substitute G(z)
for each occurrence of z, i.e. if we compute F (G(z)).
Example: bit-strings with primes Suppose we let A be all sequences
of zeros and ones, with generating function F (z) = 1/(1 2z). Now suppose
we can attach a single or double prime to each 0 or 1, giving 00 or 000 or
10 or 100 , and we want a generating function for the number of distinct
primed bit-strings with n attached primes. The set {0 , 00 } has generating
function G(z) = z + z 2 , so the composite set has generating function F (z) =
1/(1 2(z + z 2 )) = 1/(1 2z 2z 2 ).
Example: (0|11)* again The previous example is a bit contrived. Heres
one thats a little more practical, although it involves a brief digression into
multivariate generating functions. A multivariate generating function
P
F (x, y) generates a series ij aij xi y j , where aij counts the number of things
that have i xs and j ys. (There is also the obvious generalization to more
than two variables). Consider the multivariate generating function for the
set {0, 1}, where x counts zeros and y counts ones: this is just x + y. The
multivariate generating function for sequences of zeros and ones is 1/(1
x y) by the repetition rule. Now suppose that each 0 is left intact but
each 1 is replaced by 11, and we want to count the total number of strings
by length, using z as our series variable. So we substitute z for x and z 2
for y (since each y turns into a string of length 2), giving 1/(1 z z 2 ).
This gives another way to get the generating function for strings built by
repeating 0 and 11.
11.3.5
194
11.3.6
There are basically three ways to recover coefficients from generating functions:
1. Recognize the generating function from a table of known generating
functions, or as a simple combination of such known generating functions. This doesnt work very often but it is possible to get lucky.
195
2. To find the k-th coefficient of F (z), compute the k-th derivative dk /dz k F (z)
and divide by k! to shift ak to the z 0 term. Then substitute 0 for z.
For example, if F (z) = 1/(1 z) then a0 = 1 (no differentiating),
a1 = 1/(1 0)2 = 1, a2 = 1/(1 0)3 = 1, etc. This usually only
works if the derivatives have a particularly nice form or if you only
care about the first couple of coefficients (its particularly effective if
you only want a0 ).
3. If the generating function is of the form 1/Q(z), where Q is a polynomial with Q(0) 6= 0, then it is generally possible to expand the
generating function out as a sum of terms of the form Pc /(1 z/c)
where c is a root of Q (i.e. a value such that Q(c) = 0). Each denominator Pc will be a constant if c is not a repeated root; if c is a
repeated root, then Pc can be a polynomial of degree up to one less
than the multiplicity of c. We like these expanded solutions because
P
we recognize 1/(1 z/c) = i ci z i , and so we can read off the coefficients ai generated by 1/Q(z) as an appropriately weighted some of
i
ci
1 , c2 , etc., where the cj range over the roots of Q.
Example: Take the generating function G = 1/(1 z z 2 ). We can
simplify it by factoring the denominator: 1 z z 2 = (1 az)(1 bz) where
2
1/a and 1/b
are the solutions to the equation 1 z z =0; in this case
a = (1 + 5)/2, which is approximately 1.618 and b = (1 5)/2, which is
approximately 0.618. It happens to be the case that we can always expand
1/P (z) as A/(1 az) + B/(1 bz) for some constants A and B whenever
P is a degree 2 polynomial with constant coefficient 1 and distinct roots a
and b, so
A
B
G=
+
,
1 az 1 bz
and here we can recognize the right-hand side as the sum of the generating
functions for the sequences A ai and B bi . The A ai term dominates, so
we have that T (n) = (an ), where a is approximately 1.618. We can also
solve for A and B exactly to find an exact solution if desired.
A rule of thumb that applies to recurrences of the form T (n) = a1 T (n
1) + a2 T (n 2) + . . . ak T (n k) + f (n) is that unless f is particularly large,
the solution is usually exponential in 1/x, where x is the smallest root of
the polynomial 1 a1 z a2 z 2 ak z k . This can be used to get very quick
estimates of the solutions to such recurrences (which can then be proved
without fooling around with generating functions).
Exercise: What is the exact solution if T (n) = T (n 1) + T (n 2) + 1?
Or if T (n) = T (n 1) + T (n 2) + n?
196
There is a nice trick for finding the numerators in a partial fraction expansion. Suppose we have
1
A
B
=
+
.
(1 az)(1 bz)
1 az 1 bz
Multiply both sides by 1 az to get
1
B(1 az)
=A+
.
1 bz
1 bz
Now plug in z = 1/a to get
1
= A + 0.
1 b/a
We can immediately read off A. Similarly, multiplying by 1 bz and
then setting 1 bz to zero gets B. The method is known as the cover-up
method because multiplication by 1 az can be simulated by covering up
1 az in the denominator of the left-hand side and all the terms that dont
have 1 az in the denominator in the right hand side.
The cover-up method will work in general whenever there are no repeated
roots, even if there are many of them; the idea is that setting 1 qz to zero
knocks out all the terms on the right-hand side but one. With repeated roots
we have to worry about getting numerators that arent just a constant, so
things get more complicated. Well come back to this case below.
Example: A simple recurrence Suppose f (0) = 0, f (1) = 1, and for
n 2, f (n) = f (n 1) + 2f (n 2). Multiplying these equations by z n and
summing over all n gives a generating function
F (z) =
X
n=0
f (n)z n = 0 z 0 + 1 z 1 +
X
n=2
f (n 1)z n +
2f (n 2)z n .
n=2
With a bit of tweaking, we can get rid of the sums on the RHS by
197
X
n=2
f (n 1)z n + 2
f (n 2)z n
n=2
f (n)z n+1 + 2
n=1
=z+z
f (n)z n + 2z
f (n)z n+2
n=0
X
2
n=1
f (n)z n
n=0
0
A
z
B
z
Now solve for F (z) to get F (x) = 1z2z
2 = (1+z)(12z) = z
1+z + 12z ,
where we need to solve for A and B.
We can do this directly, or we can use the cover-up method. The coverup method is easier. Setting z = 1 and covering up 1 + z gives A =
1/(1 2(1)) = 1/3. Setting z = 1/2 and covering up 1 2z gives B =
1/(1 + z) = 1/(1 + 1/2) = 2/3. So we have
(1/3)z (2/3)z
+
1+z
1 2z
X
(1)n n+1 X
2 2n n+1
=
z
+
z
3
3
n=0
n=0
F (z) =
=
=
X
2n n
(1)n1 n X
z +
z
n=1
n
X
2
n=1
n=1
(1)n
n
z .
n(1)
198
Since each part can be chosen independently of the other two, the generating
function for all three parts together is just the product:
1
.
(1 z)(1 2z)(1 3z)
Lets use the cover-up method to convert this to a sum of partial fractions. We have
1
=
(1 z)(1 2z)(1 3z)
=
1
(12)(13)
1z
(1 12 )(1 32 )
1 2z
1z
1
2
(1 13 )(1 32 )
1 3z
9
4
2
+
.
1 2z 1 3z
Formula
1/2 4 + 9/2 = 1
1/2 8 + 27/2 = 6
1/2 16 + 81/2 = 25
1/2 32 + 243/2 = 90
Strings
()
M, O, U, G, H, K
M M, M O, M U, M G, M H, M K, OO, OU, OG, OH, OK, U O, U U, U G, U H
(exercise)^
T (n 1)z + 12
n=2
n=0
= z + 4z
T (n)z n + 12z 2
n=1
n=0
z2
1z
F =
T (n)z n + z
z+
z2
1z
1 4z 12z 2
T (n 2)z +
n=2
X
2
n=0
z2
1z
X
n=2
zn
1 zn
199
z+
F =
z2
1z
(1 + 2z)(1 6z)
z
z2
=
+
.
(1 + 2z)(1 6z) (1 z)(1 + 2z)(1 6z)
= z
+ z2
=
(1 + 2z) 1 6 12
+
1
1+2
1
6
(1 6z)
1
1
1
+
+
1
1
1
(1 z) (1 + 2) (1 6)
1 2 (1 + 2z) 1 6 2
1 6 1 + 2 61 (1 6z)
1
4z
1 + 2z
3
4z
1 6z
1 2
1 2
9 2
15
z
z
z
+ 6
+ 10
.
1z
1 + 2z 1 6z
From this we can immediately read off the value of T (n) for n 2:
1
3
1
1
9
T (n) = (2)n1 + 6n1
+ (2)n2 + 6n2
4
4
15 6
10
1
1
1
1
1
= (2)n + 6n
+ (2)n + 6n
8
8
15 24
40
3 n
1
1
n
= 6 (2) .
20
12
15
Lets check this against the solutions we get from the recurrence itself:
n
0
1
2
3
4
T (n)
0
1
1 + 4 1 + 12 0 = 5
1 + 4 5 + 12 1 = 33
1 + 4 33 + 12 5 = 193
Well try n = 3, and get T (3) = (3/20) 216 + 8/12 1/15 = (3 3 216 +
40 4)/60 = (1944 + 40 4)/60 = 1980/60 = 33.
To be extra safe, lets try T (2) = (3/20) 36 4/12 1/15 = (3 3 36
20 4)/60 = (324 20 4)/60 = 300/60 = 5. This looks good too.
The moral of this exercise? Generating functions can solve ugly-looking
recurrences exactly, but you have to be very very careful in doing the math.
200
= a0
= 2a0 + 1
= 4a0 +2+2= 4a0 + 4
= 8a0 +8+3= 8a0 + 11
= 16a0 + 22+4= 16a0 + 26
an z n = 2
an1 z n +
nz n + a0
z
F = 2zF +
+ a0
(1 z)2
z
+ a0
(1 2z)F =
(1 z)2
z
a0
F =
+
.
2
(1 z) (1 2z) 1 2z
Observe that the right-hand term gives us exactly the 2n a0 terms we
expected, since 1/(1 2z) generates the sequence 2n . But what about the
left-hand term? Here we need to apply a partial-fraction expansion, which
is simplified because we already know how to factor the denominator but is
complicated because there is a repeated root.
We can now proceed in one of two ways: we can solve directly for the
partial fraction expansion, or we can use an extended version of Heavisides
cover-up method that handles repeated roots using differentiation. Well
start with the direct method.
Solving for the PFE directly Write
1
(1
z)2 (1
2z)
A
B
+
2
(1 z)
1 2z
.
We expect B to be a constant and A to be of the form A1 z + A0 .
201
A=
F =
202
The reason for the large n caveat is that z 2 /(1 z)2 doesnt generate
precisely the sequence xn = n1, since it takes on the values 0, 0, 1, 2, 3, 4, . . .
instead of 1, 0, 1, 2, 3, 4, . . . . Similarly, the power series for z/(1 2z) does
not have the coefficient 2n1 = 1/2 when n = 0. Miraculously, in this
particular example the formula works for n = 0, even though it shouldnt:
2(n 1) is 2 instead of 0, but 4 2n1 is 2 instead of 0, and the two errors
cancel each other out.
Solving for the PFE using the extended cover-up method It is also
possible to extend the cover-up method to handle repeated roots. Here we
choose a slightly different form of the partial fraction expansion:
1
A
B
C
=
+
+
.
(1 z)2 (1 2z)
(1 z)2 1 z 1 2z
Here A, B, and C are all constants. We can get A and C by the cover-up
method, where for A we multiply both sides by (1 z)2 before setting z = 1;
this gives A = 1/(1 2) = 1 and C = 1/(1 12 )2 = 4. For B, if we multiply
both sides by (1 z) we are left with A/(1 z) on the right-hand side and
a (1 z) in the denominator on the left-hand side. Clearly setting z = 1 in
this case will not help us.
The solution is to first multiply by (1 z)2 as before but then take a
derivative:
1
A
B
C
=
+
+
(1 z)2 (1 2z)
(1 z)2 1 z 1 2z
1
C(1 z)2
= A + B(1 z) +
1 2z
1 2z
!
d
1
d
C(1 z)2
=
A + B(1 z) +
dz 1 2z
dz
1 2z
2
2C(1 z) 2C(1 z)2
=
B
+
+
(1 2z)2
1 2z
(1 2z)2
Now if we set z = 1, every term on the right-hand side except B
becomes 0, and we get B = 2/(1 2)2 or B = 2.
Plugging A, B, and C into our original formula gives
1
(1
z)2 (1
2z)
1
2
4
+
+
,
2
(1 z)
1 z 1 2z
and thus
z
a0
1
2
4
+
=z
+
+
2
2
(1 z) (1 2z) 1 2z
(1 z)
1 z 1 2z
F =
a0
.
1 2z
203
11.3.7
Asymptotic estimates
204
More examples:
F (z)
1/(1 z)
1/(1 z)2
1/(1 z z 2 )
1/((1 z)(1 2z)(1 3z))
(z + z 2 (1 z))/(1 4z 12z 2 )
1/((1 z)2 (1 2z))
Smallest pole
1
1,multiplicity 2
( 5 1)/2 = 2/(1 + 5)
1/3
1/6
1/2
Asymptotic value
(1)
(n)
(((1 + 5)/2)n )
(3n )
(6n )
(2n )
11.3.8
11.3.8.1
Example
Lets derive the formula for 1 + 2 + + n. Well start with the generating
P
function for the series ni=0 z i , which is (1 z n + 1)/(1 z). Applying the
d
z dz
method gives us
n
X
iz i = z
i=0
=z
=
8
d 1 z n+1
dz 1 z
1
z n+1
(n + 1)z n
(1 z)2
1z
(1 z)2
The justification for doing this is that we know that a finite sequence really has a finite
n+1
sum, so the singularity appearing at z = 1 in e.g. 1z
is an artifact of the generating1z
function representation rather than the original seriesits a removable singularity that
can be replaced by the limit of f (x)/g(x) as x c.
205
which is our usual formula. Gausss childhood proof is a lot quicker, but the
generating-function proof is something that we could in principle automate
most of the work using a computer algebra system, and it doesnt require
much creativity or intelligence. So it might be the weapon of choice for
nastier problems where no clever proof comes to mind.
More examples of this technique can be found in 11.2, where the binomial theorem
applied to (1 + x)n (which is really just a generating function
P n i
for
i z ) is used to add up various sums of binomial coefficients.
11.3.9
Lets suppose we want to count binary trees with n internal nodes. We can
obtain such a tree either by (a) choosing an empty tree (g.f.: z 0 = 1); or
(b) choosing a root with weight 1 (g.f. 1 z 1 = z), since we can choose it in
exactly one way), and two subtrees (g.f. = F 2 where F is the g.f. for trees).
This gives us a recursive definition
F = 1 + zF 2 .
Solving for F using the quadratic formula gives
1 1 4z
F =
.
2z
That 2z in the denominator may cause us trouble later, but lets worry
about that when the time comes. First we need to figure out how to extract
coefficients from the square root term.
206
1/2
1 4z = (1 4z)
X
n=0
1/2
n
1/2
n
1/2
(4z)n .
n
terms as
(1/2)(n)
n!
Y
1 n1
(1/2 k)
n! k=0
Y 1 2k
1 n1
n! k=0 2
Y
(1)n n1
(2k 1)
n
2 n! k=0
2n2
(1)n
k=1 k
Qn1
= n
2 n!
k=1 2k
(1)n
(2n 2)!
n1
n
2 n! 2
(n 1)!
(1)n (2n 2)!
= 2n1
2
n!(n 1)!
(1)n
(2n 1)!
= 2n1
2
(2n 1) n!(n 1)!
=
2n 1
(1)n
.
= 2n1
2
(2n 1)
n
For n = 0, the switch from the big product of odd terms to (2n 2)!
divided by the even terms doesnt work,
because (2n 2)! is undefined. So
here we just use the special case 1/2
=
1.
0
207
1 1 4z
F =
2z
!
1
1 X
1/2
=
(4z)n
2z 2z n=0 n
1
=
2z
(1)n1
1
1 X
2n 1
+
(4z)n
2n1
2z 2z n=1 2
(2n 1)
n
=
2z
1
1 X
(1)2n1 22n 2n 1 n
+
z
2z 2z n=1 22n1 (2n 1)
n
=
2z
2
1
1 X
2n 1 n
+
z
2z 2z n=1 (2n 1)
n
1
=
2z
X
1
1
2n 1 n1
z
2z n=1 (2n 1)
n
1
=
2z
X
1
1
2n + 1 n
z
2z n=0 (2n + 1) n + 1
2n + 1 n
1
z
=
(2n + 1) n + 1
n=0
1
2n n
=
z .
n+1 n
n=0
Here we choose minus for the plus-or-minus to get the right answer and
then do a little bit of tidying up of the binomial coefficient.
We can check the first few values of f (n):
n
0
1
2
3
f (n)
0
0 = 1
(1/2) 21 = 1
(1/3) 42 = 6/3 = 2
(1/4) 63 = 20/4 = 5
and these are consistent with what we get if we draw all the small binary
trees by hand.
1 2n
The numbers n+1
n show up in a lot of places in combinatorics, and
are known as the Catalan numbers.
11.3.10
208
The following table describes all the nasty things we can do to a generating
P
P
function. Throughout, we assume F = fk z k , G = gk z k , etc.
Operation
Generating functions Coefficients
Combinatorial interpretation
Find f0
f0 = F (0)
Returns f0
Count weight 0 objects.
1 dk
Find fk
fk = k! dz k F (z)|z=0
Returns fk
Count weight k objects.
P
Flatten
F (1)
Computes fk Count all objects,
ignoring weights.
Shift right
G = zF
gk = fk1
Add 1 to all
weights.
Shift left
G = z 1 (F F (0))
gk = fk+1
Subtract 1 from
all weights, after
removing
any
weight-0 objects.
d
Pointing
G = z dz
F
gk = kfk
A G-thing is an F thing with a label
pointing to one of
its units.
Sum
H =F +G
hk = fk + gk
Disjoint union.
P
Product
H = FG
hk = i fi gki
Cartesian product.
P
Composition H = F G
H = fk Gk
To make an Hthing, first choose
an F -thing of
weight m, then
bolt onto it m
G-things.
The
weight of the Hthing is the sum of
the weights of the
G-things.
P
Repetition
G = 1/(1 F )
G = Fk
A G-thing is a
sequence of zero
or more F -things.
Note: this is just
a special case of
composition.
11.3.11
209
Variants
11.3.12
Further reading
Rosen [Ros12] discusses some basic facts about generating functions in 8.4.
Graham et al. [GKP94] give a more thorough introduction. Herbert Wilfs
book generatingfunctionology, which can be downloaded from the web, will
tell you more about the subject than you probably want to know.
See http://www.swarthmore.edu/NatSci/echeeve1/Ref/LPSA/PartialFraction/
PartialFraction.html for very detailed notes on partial fraction expansion.
Chapter 12
Probability theory
Here are two examples of questions we might ask about the likelihood of
some event:
Gambling: I throw two six-sided dice, what are my chances of seeing
a 7?
Insurance: I insure a typical resident of Smurfington-Upon-Tyne against
premature baldness. How likely is it that I have to pay a claim?
Answers to these questions are summarized by a probability, a number
in the range 0 to 1 that represents the likelihood that some event occurs.
There are two dominant interpretations of this likelihood:
The frequentist interpretation says that if an event occurs with
probability p, then in the limit as I accumulate many examples of
similar events, I will see the number of occurrences divided by the
number of samples converging to p. For example, if I flip a fair coin
over and over again many times, I expect that heads will come up
roughly half of the times I flip it, because the probability of coming
up heads is 1/2.
The Bayesian interpretation says that when I say that an event
occurs with probability p, that means my subjective beliefs about the
event would lead me to take a bet that would be profitable on average
if this were the real probability. So a Bayesian would take a doubleor-nothing bet on a coin coming up heads if they believed that the
probability it came up heads was at least 1/2.
210
211
12.1
Well start by describing the basic ideas of probability in terms of probabilities of events, which either occur or dont. Later we will generalize these
ideas and talk about random variables, which may take on many different
values in different outcomes.
12.1.1
Probability axioms
Coming up with axioms for probabilities that work in all the cases we want
to consider took much longer than anybody expected, and the current set in
common use only go back to the 1930s. Before presenting these, lets talk
a bit about the basic ideas of probability.
An event A is something that might happen, or might not; it acts like
a predicate over possible outcomes. The probability Pr [A] of an event A
is a real number in the range 0 to 1, that must satisfy certain consistency
rules like Pr [A] = 1 Pr [A].
In discrete probability, there is a finite set of atoms, each with an
assigned probability, and every event is a union of atoms. The probability
1
This caricature of the debate over interpreting probability is thoroughly incomplete.
For a thoroughly complete discussion, including many other interpretations, see http:
//plato.stanford.edu/entries/probability-interpret/.
212
213
= {H, T}, F = P() = {{,} {H} , {T} , {H, T}}, Pr [A] = |A|/2.
This represents a fair coin with two outcomes H and T that each occur
with probability 1/2.
= {H, T}, F = P(), Pr [{H}] = p, Pr [{T}] = 1p. This represents
a biased coin, where H comes up with probability p.
= {(i, j) | i, j {1, 2, 3, 4, 5, 6}}, F = P(), Pr [A] = |A|/36. Roll
of two fair dice. A typical event might be the total roll is 4, which
is the set {(1, 3), (2, 2), (3, 1)} with probability 3/36 = 1/12.
= N, F = P(), Pr [A] = nA 2n1 . This is an infinite probability space; a real-world process that might generate it is to flip a fair
coin repeatedly and count how many times it comes up tails before the
first time it comes up heads. Note that even though it is infinite, we can
still define all probabilities by summing over atoms: Pr [{0}] = 1/2,
Pr [{1}] = 1/4, Pr [{0, 2, 4, . . .}] = 1/2 + 1/8 + 1/32 + = 2/3, etc.
P
Its unusual for anybody doing probability to actually write out the
details of the probability space like this. Much more often, a writer will
just assert the probabilities of a few basic events (e.g. Pr [{H}] = 1/2),
and claim that any other probability that can be deduced from these initial
probabilities from the axioms also holds (e.g. Pr [{T}] = 1 Pr [{H}] =
1/2). The main reason Kolmogorov gets his name attached to the axioms
is that he was responsible for Kolmogorovs extension theorem, which
says (speaking very informally) that as long as your initial assertions are
consistent, there exists a probability space that makes them and all their
consequences true.
12.1.2
Probability as counting
The easiest probability space to work with is a uniform discrete probability space, which has N outcomes each of which occurs with probability
1/N . If someone announces that some quantity is random without specifying probabilities (especially if that someone is a computer scientist), the
odds are that what they mean is that each possible value of the quantity is
equally likely. If that someone is being more careful, they would say that
the quantity is drawn uniformly at random from a particular set.
Such spaces are among the oldest studied in probability, and go back
to the very early days of probability theory where randomness was almost
214
Examples
A random bit has two outcomes, 0 and 1. Each occurs with probability 1/2.
A die roll has six outcomes, 1 through 6. Each occurs with probability
1/6.
A roll of two dice has 36 outcomes (order of the dice matters). Each
occurs with probability 1/36.
A random n-bit string has 2n outcomes. Each occurs with probability
2n . The probability that exactly one bit is a 1 is obtained by counting
all strings with a single 1 and dividing by 2n . This gives n2n .
A poker hand consists of a subset of 5 cards drawn uniformly at
random from a deck of 52 cards. Depending on whether the order of
the 5 cards is considered important (usually it isnt), there are either
52
five
5 or (52)5 possible hands. The probability of getting a flush (all
13 52
cards in the hand drawn from the same suit of 13 cards) is 4 5 / 5 ;
there are 4 choices of suits, and 13
5 ways to draw 5 cards from each
suit.
A random permutation on n items has n! outcomes, one for each
possible permutation. A typical event might be that the first element
of a random permutation of 1 . . . n is 1; this occurs with probability
(n1)!/n! = 1/n. Another example of a random permutation might be
a uniform shuffling of a 52-card deck (difficult to achieve in practice!).
Here, the probability that we get a particular set of 5 cards as the first
5 in the deck is obtained by counting all the permutations that have
those 5 cards in the first 5 positions (there
are 5! 47! of them) divided
52
by 52!. The result is the same 1/ 5 that we get from the uniform
poker hands.
12.1.3
215
Examples
See http://arXiv.org/abs/math/0509698.
216
12.1.4
Union of events
Examples
What is the probability of getting at least one head out of two independent coin-flips? Compute Pr [H1 H2 ] = 1/2+1/2(1/2)(1/2) = 3/4.
What is the probability of getting at least one head out of two coinflips, when the coin-flips are not independent? Here again we can get
any probability from 0 to 1, because the probability of getting at least
one head is just 1 Pr [T1 T2 ].
For more events, we can use a probabilistic version of the inclusionexclusion formula (Theorem 11.2.2). The new version looks like this:
Theorem 12.1.1. Let A1 . . . An be events on some probability space. Then
Pr
" n
[
i=1
Ai =
X
S{1...n},S6=
(1)|S|+1 Pr
jS
Aj .
(12.1.1)
217
For discrete probability, the proof is essentially the same as for Theorem 11.2.2; the difference is that instead of showing that we add 1 for each
T
possible element of Ai , we show that we add the probability of each outT
come in Ai . The result continues to hold for more general spaces, but
requires a little more work.3
12.1.5
Conditional probability
Suppose I want to answer the question What is the probability that my dice
add up to 6 if I know that the first one is an odd number? This question
involves conditional probability, where we calculate a probability subject
to some conditions. The probability of an event A conditioned on an event
B, written Pr [A | B], is defined by the formula
Pr [A | B] =
Pr [A B]
.
Pr [B]
One way to think about this is that when we assert that B occurs we
are in effect replacing the entire probability space with just the part that
sits in B. So we have to divide all of our probabilities by Pr [B] in order to
make Pr [B | B] = 1, and we have to replace A with A B to exclude the
part of A that cant happen any more.
Note also that conditioning on B only makes sense if Pr [B] > 0. If
Pr [B] = 0, Pr [A | B] is undefined.
12.1.5.1
218
i
Y
k+j1
k+j
j=1
We can use the fact that A is the disjoint union of A B and A B to get
Pr [A] by case analysis:
h
Pr [A] = Pr [A B] + Pr A B
h
h i
= Pr [A | B] Pr [B] + Pr A B Pr B .
For example, if there is a 0.2 chance I can make it to the top of Mt
Everest safely without learning how to climb first, my chances of getting
there go up to (0.9)(0.1) + (0.2)(0.9) = 0.27.
This method is sometimes given the rather grandiose name of the law
of total probability. The most general version is that if B1 . . . Bn are all
disjoint events and the sum of their probabilities is 1, then
Pr [A] =
n
X
i=1
Pr [A | Bi ] Pr [Bi ] .
219
Bayess formula
Pr [B | A] =
12.2
Random variables
A random variable X is a variable that takes on particular values randomly. This means that for each possible value x, there is an event [X = x]
with some probability of occurring that corresponds to X (the random variable, usually written as an upper-case letter) taking on the value x (some
fixed value). Formally, a random variable X is really a function X() of the
outcome that occurs, but we save a lot of ink by leaving out .4
12.2.1
Indicator variables: The indicator variable for an event A is a variable X that is 1 if A occurs and 0 if it doesnt (i.e., X() = 1 if A
4
For some spaces, not all functions X() work as random variables, because the events
[X = x] might not be measurable with respect to F. We will generally not run into these
issues.
220
and 0 otherwise). There are many conventions out there for writing
indicator variables. I am partial to 1A , but you may also see them
written using the Greek letter chi (e.g. A ) or by abusing the bracket
notation for events (e.g., [A], [Y 2 > 3], [all six coins come up heads]).
Functions of random variables: Any function you are likely to run
across of a random variable or random variables is a random variable.
So if X and Y are random variables, X + Y , XY , and log X are all
random variables.
Counts of events: Flip a fair coin n times and let X be the number of
times it comes up heads. Then X is an integer-valued random variable.
Random sets and structures: Suppose that we have a set T of n elements, and we pick out a subset U by flipping an independent fair
coin for each element to decide whether to include it. Then U is a setvalued random variable. Or we could consider the infinite sequence
X0 , X1 , X2 , . . . , where X0 = 0 and Xn+1 is either Xn + 1 or Xn 1,
depending on the result of independent fair coin flip. Then we can
think of the entire sequence X as a sequence-valued random variable.
12.2.2
221
222
Z x
2 /2
ex
dx.
Joint distributions
223
12.2.3
The difference between the two preceding examples is that in the first case,
X and Y are independent, and in the second case, they arent.
Two random variables X and Y are independent if any pair of events
of the form X A, Y B are independent. For discrete random variables,
it is enough to show that Pr [X = x Y = y] = Pr [X = x] Pr [Y = y], or
in other words that the events [X = x] and [Y = y] are independent for
all values x and y. For continuous random variables, the corresponding
equation is Pr [X x Y y] = Pr [X x] Pr [Y y]. In practice, we
will typically either be told that two random variables are independent or
deduce it from the fact that they arise from separated physical processes.
12.2.3.1
Examples
Roll two six-sided dice, and let X and Y be the values of the dice. By
convention we assume that these values are independent. This means
for example that Pr [X {1, 2, 3} Y {1, 2, 3}] = Pr [X {1, 2, 3}]
Pr [Y {1, 2, 3}] = (1/2)(1/2) = 1/4, which is a slightly easier computation than counting up the 9 cases (and then arguing that each
occurs with probability (1/6)2 , which requires knowing that X and Y
are independent).
Take the same X and Y , and let Z = X + Y . Now Z and X are not
independent, because Pr [X = 1 Z = 12] = 0, which is not equal to
Pr [X = 1] Pr [Z = 12] = (1/6)(1/36) = 1/216.
Place two radioactive sources on opposite sides of the Earth, and let
X and Y be the number of radioactive decay events in each source
during some 10 millisecond interval. Since the sources are 42 milliseconds away from each other at the speed of light, we can assert that
either X and Y are independent, or the world doesnt behave the way
the physicists think it does. This is an example of variables being
independent because they are physically independent.
Roll one six-sided die X, and let Y = dX/2e and Z = X mod 2. Then
Y and Z are independent, even though they are generated using the
same physical process.
224
12.2.4
x Pr [X = x] .
x dF (x).
E [X] =
Technically, this will work for any values we can add and multiply by probabilities.
So if X is actually a vector in R3 (for example), we can talk about the expectation of X,
which in some sense will be the average position of the location given by X.
225
Example (unbounded discrete variable) Let X be a geometric random variable with parameter p; this means that Pr [X = k] = q k p,
P
P
k
k
where as usual q = 1 p. Then E[X] =
k=0 kq p = p
k=0 kq =
q
pq
q
1p
1
p (1q)2 = p2 = p = p = p 1.
Expectation is a way to summarize the distribution of a random variable
without giving all the details. If you take the average of many independent
copies of a random variable, you will be likely to get a value close to the
expectation. Expectations are also used in decision theory to compare different choices. For example, given a choice between a 50% chance of winning
$100 (expected value: $50) and a 20% chance of winning $1000 (expected
value: $200), a rational decision maker would take the second option.
Whether ordinary human beings correspond to an economists notion of a
rational decision maker often depends on other details of the situation.
Terminology note: If you hear somebody say that some random variable
X takes on the value z on average, this usually means that E [X] = z.
12.2.4.1
E[2 ] =
=
=
X
k=0
2k Pr[X = k]
2k 2k1
k=0
1
,
2
k=0
226
h
Expectation of a sum
(ax + y) Pr [X = x Y = x]
x,y
=a
x Pr [X = x Y = x] +
x,y
=a
=a
y Pr [X = x Y = x]
x,y
X X
x
Pr [X = x Y = x] +
x Pr [X = x] +
X X
Pr [X = x Y = x]
y Pr [Y = y]
= a E [X] + E [Y ] .
Linearity of expectation makes computing many expectations easy. Example: Flip a fair coin n times, and let X be the number of heads. What
is E [X]? We can solve this problem by letting Xi be the indicator variable
P
for the event coin i came up heads. Then X = ni=1 Xi and E [X] =
P
P
P
E [ ni=1 Xi ] = ni=1 E [Xi ] = ni=1 12 = n2 . In principle it is possible to
calculate the same value from the distribution of X (this involves a lot of
binomial coefficients), but linearity of expectation is much easier.
Example Choose a random permutation , i.e., a random bijection from
{1 . . . n} to itself. What is the expected number of values i for which (i) =
i?
Let Xi be the indicator variable for the event that (i) = i. Then we are
looking for E [X1 + X2 + . . . Xn ] = E [X1 ] + E [X2 ] + . . . E [Xn ]. But E [Xi ] is
just 1/n for each i, so the sum is n(1/n) = 1. Calculating this by computing
P
Pr [ ni=1 Xi = x] first would be very painful.
12.2.4.3
Expectation of a product
227
For example: Roll two dice and take their product. What value do we
get on average? The product formula gives E [XY ] = E [X] E [Y ] = (7/2)2 =
(49/4) = 12 14 . We could also calculate this directly by summing over all 36
cases, but it would take a while.
Alternatively, roll one die and multiply it by itself. Now what value do
we get on average? Here we are no longer dealing
with independent random
variables, so we have to do it the hard way: E X 2 = (12 + 22 + 32 + 42 +
52 + 62 )/6 = 91/6 = 15 16 . This is substantially higher than when the dice
are uncorrelated. (Exercise: How can you rig the second die so it still comes
up with each value 16 of the time but minimizes E [XY ]?)
We can prove the product rule without too much trouble for discrete
random variables. The easiest way is to start from the right-hand side.
!
E [X] E [Y ] =
x Pr [X = x]
y Pr [Y = y]
!
X
xy Pr [X = x] Pr [Y = y]
x,y
x,y,xy=z
Pr [X = x] Pr [Y = y]
!
=
=
Pr [X = x Y = y]
x,y,xy=z
z Pr [XY = z]
= E [XY ] .
Here we use independence in going from Pr [X = x] Pr [Y = y] to Pr [X = x Y = y]
and use the union rule to convert the x, y sum into Pr [XY = z].
12.2.4.4
Conditional expectation
Like conditional probability, there is also a notion of conditional expectation. The simplest version of conditional expectation conditions on a single
event A, is written E [X | A], and is defined for discrete random variables
by
X
E [X | A] =
x Pr [X = x | A] .
x
This is exactly the same as ordinary expectation except that the probabilities are now all conditioned on A.
228
E [X | Ai ] Pr [Ai ]
229
OK, my intuitions.
230
There is a more general notion of conditional expectation for random variables, where the conditioning is done on some other random variable Y .
Unlike E [X | A], which is a constant, the expected value of X conditioned
on Y , written E [X | Y ], is itself a random variable: when Y = y, it takes
on the value E [X | Y = y].
Heres a simple example. Lets compute E [X + Y | X] where X and Y
are the values of independent six-sided dice. When X = x, E [E [X + Y | X] | X = x] =
E [X + Y | X = x] = x + E [Y ] = x + 7/2. For the full random variable we
can write E [X + Y | X] = X + 7/2.
Another way to get the result in the preceding example is to use some
general facts about conditional expectation:
E [aX + bY | Z] = a E [X | Z] + b E [Y | Z]. This is the conditionalexpectation version of linearity of expectation.
E [X | X] = X. This is immediate from the definition, since E [X | X = x] =
x.
If X and Y are independent, then E [Y | X] = E [Y ]. The intuition
is that knowing the value of X gives no information about Y , so
E [Y ] X = x = E [Y ] for any x in the range of X. (To do this for=yX=x]
mally requires using the fact that Pr [Y = y | X = x] = Pr[YPr[X=x]
=
Pr[Y =y] Pr[X=x]
Pr[X=x]
231
12.2.5
Markovs inequality
232
term on the right-hand side can only make it smaller. This gives:
E [X] E [X | X > a E [X]] Pr [X > a E [X]]
> a E [X] Pr [X > a E [X]] ,
and dividing both side by a E [X] gives the desired result.
Another version of Markovs inequality replaces > with :
Pr [X a E [X]] 1/a.
The proof is essentially the same.
12.2.5.1
Example
Suppose that that all you know about the high tide height X is that E [X] =
1 meter and X 0. What can we say about the probability that X >
2 meters? Using Markovs inequality, we get Pr [X > 2 meters] = Pr [X > 2 E [X]] <
1/2.
12.2.5.2
12.2.6
Expectation tells you the average value of a random variable, but it doesnt
tell you how far from the average the random variable typically gets: the
random variables X = 0 and Y = 1, 000, 000, 000, 000 with equal probability both have expectation 0, though their distributions are very different.
Though it is impossible to summarize everything about the spread of a distribution in a single number, a useful approximation for many purposes
is the variance Var [X] of a random variable X, which ish defined as the
i
expected square of the deviation from the expectation, or E (X E [X])2 .
Example Let X be 0 or 1 with equal probability. Then E [X] = 1/2, and
(X E [X])2 is always 1/4. So Var [X] = 1/4.
233
Example
i value of a fair six-sided die. Then E [X] = 7/2, and
h Let X be the
2
E (X E [X]) = 61 (1 7/2)2 + (2 7/2)2 + (3 7/2)2 + + (6 7/2)2 =
35/12.
Computing variance directly from
the definition can be tedious. Often
2
it is easier to compute it from E X and E [X]:
h
= E X 2 2X E [X] + (E [X])2
h
Example Lets try the six-sided die again, except this time well use an
n-sided die. We have
h
n+1
2
2
n
6
4
(n + 1)(2n + 1) (n + 1)2
=
.
6
4
=
49
713
6 4
35
12 .
Multiplication by constants
= c2 E X 2 (c E [X])2
= c2 Var [X] .
234
Var [X + Y ] = E (X + Y )2 (E [X + Y ])2
i
" n
X
Xi =
i=1
n
X
Var [Xi ] +
i=1
Cov [Xi , Xj ] .
i6=j
235
Chebyshevs inequality
Pr [|X E [X]| r]
Proof. Well do the first version. The event |X E [X]| r is the same as
the event (X E [X])2 r2 . By Markovs inequality, the probability that
E[(XE[X])2 ]
this occurs is at most
= Var[X]
.
r2
r2
Application: showing that a random variable is close to its expectation This is the usual statistical application.
Example Flip a fair coin n times, and let X be the number of heads. What
is the probability that |X n/2| > r? Recall that Var [X] = n/4, so
Pr [|X n/2| > r] < (n/4)/r2 = n/(4r2 ). So, for example, the chances
of deviating from the average by more than 1000 after 1000000 coinflips is less than 1/4.
Example Out of n voters in Saskaloosa County, m plan to vote for Smith
for County Dogcatcher. A polling firm samples k voters (with replacement) and asks them who they plan to vote for. Suppose that
m < n/2; compute a bound on the probability that the polling firm
incorrectly polls a majority for Smith.
Solution: Let Xi be the indicator variable for a Smith vote when the
P
i-th voter is polled and let X = Xi be the total number of pollees
who say they will vote for Smith. Let p = E [Xi ] = m/n. Then
Var [Xi ] = p p2 , E [X] = kp, and Var [X] = k(p p2 ). To get a
majority in the poll, we need X > k/2 or X E [X] > k/2 kp. Using
236
12.2.7
237
Pr [X = n] z n .
n=0
Sums
A very useful property of pgfs is that the pgf of a sum of independent random variables is just the product of the pgfs of the individual random variables. The reason for this is essentially the same as for ordinary generating
functions: when we multiply together two terms (Pr [X = n] z n )(Pr [Y = m] z m ),
we get Pr [X = n Y = m] z n+m , and the sum over all the different ways of
decomposing n + m gives all the different ways to get this sum.
So, for example, the pgf of a binomial random variable equal to the sum
of n independent Bernoulli random variables is (q + pz)n (hence the name
binomial).
12.2.7.2
One nice thing about pgfs is that the can be used to quickly compute
expectation and variance. For expectation, we have
F 0 (z) =
n Pr [X = n] z n1 .
n=0
So
F 0 (1) =
n Pr [X = n]
n=0
= E [X] .
238
n(n 1) Pr [X = n] z n1
n=0
or
F 00 (1) =
n(n 1) Pr [X = n]
n=0
= E [X(X 1)]
h
= E X 2 E [X] .
So we can recover E X 2 as F 00 (1) + F 0 (1) and get Var [X] as F 00 (1) +
F 0 (1) (F 0 (1))2 .
12.2.8
239
E [X + Y ] = E [X] + E [Y ]
E [aX] = a E [X]
E [XY ] = E [X] E [Y ] + Cov [X, Y ]
12.2.9
So far we have only considered discrete random variables, which avoids a lot
of nasty technical issues. In general, a random variable on a probability
space (, F, P ) is a function whose domain is that satisfies some extra
conditions on its values that make interesting events involving the random
variable elements of F. Typically the codomain will be the reals or the integers, although any set is possible. Random variables are generally written as
capital letters with their arguments suppressed: rather than writing X(),
where , we write just X.
A technical condition on random variables is that the inverse image of
any measurable subset of the codomain must be in Fin simple terms, if
you cant nail down exactly, being able to tell which element of F you land
in should be enough to determine the value of X(). For a discrete random
variables, this just means that X 1 (x) F for each possible value x. For
240
The detail we are sweeping under the rug here is what makes a subset of the codomain
measurable. The essential idea is that we also have a -algebra F 0 on the codomain, and
elements of this codomain -algebra are the measurable subsets. The rules for simple
random variables and real-valued random variables come from default choices of -algebra.
241
Densities
Independence
Independence is the same as for discrete random variables: Two random variables X and Y are independent if any pair of events of the form X A,
Y B are independent. For real-valued random variables it is enough to
show that their joint distribution F (x, y) is equal to the product of their individual distributions FX (x)FY (y). For real-valued random variables with
densities, showing the densities multiply also works. Both methods generalize in the obvious way to sets of three or more random variables.
242
Expectation
If a continuous random variable has a density f (x), the formula for its
expectation is
Z
xf (x) dx.
E [X] =
For example, let X be a uniform random variable in the range [a, b].
1
Then f (x) = ba
when a x b and 0 otherwise, giving
Z b
E [X] =
a
1
dx
ba
b
x2
=
2(b a) x=a
b2 a2
2(b a)
a+b
.
=
2
=
Chapter 13
Linear algebra
Linear algebra is the branch of mathematics that studies vector spaces
and linear transformations between them.
13.1
Lets start with vectors. In the simplest form, a vector consists of a sequence
of n values from some field (see 4.1); for most purposes, this field will be R.
The number of values (called coordinates) in a vector is the dimension
of the vector. The set of all vectors over a given field of a given dimension
(e.g., Rn ) forms a vector space, which has a more general definition that
we will give later.
So the idea is that a vector represents a point in an n-dimensional
space represented by its coordinates in some coordinate system. For example, if we imagine the Earth is flat, we can represent positions on the
surface of the Earth as a latitude and longitude, with the point h0, 0i representing the origin of the system at the intersection between the equator (all points of the form h0, xi and the prime meridian (all points of
the form hx, 0i. In this system, the location of Arthur K. Watson Hall
(AKW) would be h41.31337, 72.92508i, and the location of LC 317 would
be h41.30854, 72.92967i. These are both offsets (measured in degrees) from
the origin point h0, 0i.
243
244
y = h1, 2i
x + y = h4, 1i
h0, 0i
x = h3, 1i
Figure 13.1: Geometric interpretation of vector addition
13.1.1
What makes this a little confusing is that we will often use vectors to represent relative positions as well.1 So if we ask the question where do I
have to go to get to LC 317 from AKW?, one answer is to travel 0.00483
degrees in latitude and 0.00459 degrees in longitude, or, in vector terms,
to follow the relative vector h0.00483, 0.00459i. This works because we
define vector addition coordinatewise: given two vectors x and y, their sum
x + y is defined by (x + y)i = xi + yi for each index i. In geometric terms,
this has the effect of constructing a compound vector by laying vectors x
and y end-to-end and drawing a new vector from the start of x to the end
of y (see Figure 13.1.)
The correspondence between vectors as absolute positions and vectors
as relative positions comes from fixing an origin 0. If we want to specify an absolute position (like the location of AKW), we give its position
relative to the origin (the intersection of the equator and the prime meridian). Similarly, the location of LC 317 can be specified by giving its position relative to the origin, which we can compute by first going to AKW
(h41.31337, 72.92508i), and then adding the offset of LC 317 from AWK
(h0.00483, 0.00459i) to this vector to get the offset directly from the origin (h41.30854, 72.92967i).
More generally, we can add together as many vectors as we want, by
adding them coordinate-by-coordinate.
This can be used to reduce the complexity of pirate-treasure instructions:
1
A further complication that we will sidestep completely is that physicists will often use
vector to mean both an absolute position and an offset from itsort of like an edge in a
graphrequiring n coordinates to represent the starting point of the vector and another
n coordinates to represent the ending point. These vectors really do look like arrows at a
particular position in space. Our vectors will be simpler, and always start at the origin.
245
1. Yargh! Start at the olde hollow tree on Dead Mans Isle, if ye dare.
2. Walk 10 paces north.
3. Walk 5 paces east.
4. Walk 20 paces south.
13.1.2
Scaling
13.2
246
This means that there is an addition operation for vectors that is commutative (x+y =
y+x), associative (x+(y+z) = (x+y)+z), and has an identity element 0 (0+x = x+0 = x)
and inverses x (x + (x) = 0).
247
13.3
Matrices
A11 A12
A = A21 A22
A31 A32
where Aij = a(i, j), and the domain of the function is just the cross-product
of the two index sets. Such a structure is called a matrix. The values Aij
are called the elements or entries of the matrix. A sequence of elements
with the same first index is called a row of the matrix; similarly, a sequence
of elements with the same second index is called a column. The dimension
of the matrix specifies the number of rows and the number of columns: the
matrix above has dimension (3, 2), or, less formally, it is a 3 2 matrix.3 A
matrix is square if it has the same number of rows and columns.
Note: The convention in matrix indices is to count from 1 rather than
0. In programming language terms, matrices are written in FORTRAN.
3
The convention for both indices and dimension is that rows come before columns.
248
13.3.1
Interpretation
We can use a matrix any time we want to depict a function of two arguments
(over small finite sets if we want it to fit on one page). A typical example
(that predates the formal notion of a matrix by centuries) is a table of
distances between cities or towns, such as this example from 1807:4
Because distance matrices are symmetric (see below), usually only half
of the matrix is actually printed.
Another example would be a matrix of counts. Suppose we have a set
of destinations D and a set of origins O. For each pair (i, j) D O, let
Cij be the number of different ways to travel from j to i. For example, let
origin 1 be Bass Library, origin 2 be AKW, and let destinations 1, 2, and
3 be Bass, AKW, and SML. Then there is 1 way to travel between Bass
and AKW (walk), 1 way to travel from AKW to SML (walk), and 2 ways
to travel from Bass to SML (walk above-ground or below-ground). If we
assume that we are not allowed to stay put, there are 0 ways to go from
Bass to Bass or AKW to AKW, giving the matrix
0 1
C = 1 0
2 1
4
249
1/2 1/2 0
0
1/2
0
1/2
0
P =
.
0
1/2 0 1/2
0
0 1/2 1/2
Finally, the most common use of matrices in linear algebra is to represent
the coefficients of a linear transformation, which we will describe later.
13.3.2
13.3.2.1
Operations on matrices
Transpose of a matrix
>
"
#
A11 A12
A11 A21 A31
>
A = A21 A22 =
A12 A22 A32
A31 A32
If a matrix is equal to its own transpose (i.e., if Aij = Aji for all i and
j), it is said to be symmetric. The transpose of an n m matrix is an
m n matrix, so only square matrices can be symmetric.
13.3.2.2
If we have two matrices A and B with the same dimension, we can compute
their sum A + B by the rule (A + B)ij = Aij + Bij . Another way to say this
is that matrix sums are done term-by-term: there is no interaction between
entries with different indices.
For example, suppose we have the matrix of counts C above of ways
of getting between two destinations on the Yale campus. Suppose that
upperclassmen are allowed to also take the secret Science Hill Monorail from
the sub-basement of Bass Library to the sub-basement of AKW. We can get
the total number of ways an upperclassman can get from each origin to each
250
0 1
0 0
0 1
C + M = 1 0 + 1 0 = 2 0.
2 1
0 0
2 1
13.3.2.3
Suppose we are not content to travel once, but have a plan once we reach our
destination in D to travel again to a final destination in some set F . Just as
we constructed the matrix C (or C + M , for monorail-using upperclassmen)
counting the number of ways to go from each point in O to each point in
D, we can construct a matrix Q counting the number of ways to go from
each point in D to each point in F . Can we combine these two matrices to
compute the number of ways to travel O D F ?
The resulting matrix is known as the product QC. We can compute
each entry in QC by taking a sum of products of entries in Q and C. Observe
that the number of ways to get from k to i via some single intermediate point
j is just Qij Cjk . To get all possible routes, we have to sum over all possible
P
intermediate points, giving (QC)ik = j Qij Cjk .
This gives the rule for multiplying matrices in general: to get (AB)ik ,
sum Aij Bjk over all intermediate values j. This works only when the number
of columns in A is the same as the number of rows in B (since j has to vary
over the same range in both matrices), i.e., when A is an n m matrix and
B is an m s matrix for some n, m, and s. If the dimensions of the matrices
dont match up like this, the matrix product is undefined. If the dimensions
do match, they are said to be compatible.
For example, let B = (C + M ) from the sum example and let A be the
number of ways of getting from each of destinations 1 = Bass, 2 = AKW,
and 3 = SML to final destinations 1 = Heaven and 2 = Hell. After consulting
with appropriate representatives of the Divinity School, we determine that
one can get to either Heaven or Hell from any intermediate destination in
one way by dying (in a state of grace or sin, respectively), but that Bass
Library provides the additional option of getting to Hell by digging. This
gives a matrix
"
#
1 1 1
A=
.
2 1 1
251
A(C+M ) =
"
# "
#
# 0 1
10+12+12 11+10+11
4 2
1
=
.
2 0 =
1 1
2 1 1
2 1
20+12+12 21+10+11
One special matrix I (for each dimension n n) has the property that
IA = A and BI = B for all matrices A and B with compatible dimension.
This matrix is known as the identity matrix, and is defined by the rule
Iii = 1 and Iij = 0 for i 6= j. It is not hard to see that in this case
P
(IA)ij = k Iik Akj = Iii Aij = Aij , giving IA = A; a similar computation
shows that BI = B. With a little more effort (omitted here) we can show
that I is the unique matrix with this identity property.
13.3.2.4
The tedious details: To multiply row r by a, use a matrix B with Bii = 1 when i 6= r,
Brr = a, and Bij = 0 for i 6= j; to add a times row r to row s, use a matrix B with
Bii = 1 when i 6= r, Brs = a, and Bij = 0 for all other pairs ij; to swap rows r and s, use
a matrix B with Bii = 1 for i 6 {r, s}, Brs = Bsr = 1, and Bij = 0 for all other pairs ij.
4 3
252
all the entries above the diagonal. The only way this can fail is if we hit
some Aii = 0, which we can swap with a nonzero Aji if one exists (using
a type (c) operation). If all the rows from i on down have a zero in the i
column, then the original matrix A is not invertible. This entire process is
known as Gauss-Jordan elimination.
This procedure can be used to solve matrix equations: if AX = B, and
we know A and B, we can compute X by first computing A1 and then
multiplying X = A1 AX = A1 B. If we are not interested in A1 for
its own sake, we can simplify things by substituting B for I during the
Gauss-Jordan elimination procedure; at the end, it will be transformed to
X.
Example Original A is on the left, I on the right.
Initial matrices:
2 0 1
1 0 0
1 0 1 0 1 0
3 1 2
0 0 1
Divide top row by 2:
1/2 0 0
1 0 1/2
1 0 1 0
1 0
0 0 1
3 1 2
Subtract top row from middle row and 3top row from bottom row:
1/2 0 0
1 0 1/2
0 0 1/2 1/2 1 0
3/2 0 1
0 1 1/2
Swap middle and bottom rows:
1 0 1/2
1/2 0 0
0 1 1/2 3/2 0 1
0 0 1/2
1/2 1 0
Multiply bottom row by 2:
1 0 1/2
1/2 0 0
0 1 1/2 3/2 0 1
0 0 1
1 2 0
253
1 1 0
1 0 0
0 1 0 1 1 1
1 2 0
0 0 1
and were done. (Its probably worth multiplying the original A by the
alleged A1 to make sure that we didnt make a mistake.)
13.3.2.5
Scalar multiplication
13.3.3
Matrix identities
For the most part, matrix operations behave like scalar operations, with a
few important exceptions:
1. Matrix multiplication is only defined for matrices with compatible dimensions.
2. Matrix multiplication is not commutative: in general, we do not expect
that AB = BA. This is obvious when one or both of A and B is not
square (one of the products is undefined because the dimensions arent
compatible), but may also be true even if A and B are both square.
For a simple example of a non-commutative pair of matrices, consider
"
1 1
0 1
#"
"
"
1 1
2 0
1 1
=
6=
1 1
1 1
1 1
#"
"
1 1
1 0
=
.
0 1
1 2
254
P
P P
P
Aik (BC)kj = k m Aik Bkm Cmj . Then compute ((AB)C)ij = m (AB)im Cmj =
k
P P
P P
An ,
nN
255
13.4
Vectors as matrices
13.4.1
Length
256
aligned with each other, but the triangle inequality kx + yk kxk + kyk
always holds.
A special class of vectors are the unit vectors, those vectors x for
which kxk = 1. In geometric terms, these correspond to all the points on
the surface of a radius-1 sphere centered at the origin. Any vector x can be
turned into a unit vector x/kxk by dividing
by its ilength. In two dimensions,
h
>
the unit vectors are all of the form cos sin , where by convention
is the angle from due east measured counterclockwise;
this is why travelh
i>
ing 9 units northwest corresponds to the vector 9 cos 135 sin 135
=
h
h i
i>
9/ 2 9/ 2 . In one dimension, the unit vectors are 1 . (There are
no unit vectors in zero dimensions: the unique zero-dimensional vector has
length 0.)
13.4.2
Suppose we have some column vector x, and we want to know how far x
sends us in a particular direction, where the direction is represented by a
unit column vector e. We can compute this distance (a scalar) by taking
the dot product
X
e x = e> x =
ei xi .
h
For example, if x = 3 4
i>
e x = 1 0
h
and e = 1 0
" #
i 3
4
i>
i>
= 1 3 + 0 4 = 3.
In this case we see that the 1 0 vector conveniently extracts the first
coordinate, which is about
find out
how
h whatwed
i> expect. But we can
h also
i
far x takes us in the 1/ 2 1/ 2 direction: this is 1/ 2 1/ 2 x =
7/ 2.
By convention, we are allowed to take the dot product of two row vectors
or of a row vector times a column vector or vice versa, provided of course that
the non-boring dimensions match. In each case we transpose as appropriate
to end up with a scalar when we take the matrix product.
Nothing in the definition of the dot product restricts either vector to
be a unit vector. If we compute x y where x = ce and kek = 1, then
we are effectively multiplying e y by c. It follows that the dot product is
proportional to the length of both of its arguments. This often is expressed in
257
13.5
Technical note: If the set of vectors {xi } is infinite, then we will only permit linear
combinations with a finite number of nonzero coefficients. We will generally not consider
vector spaces big enough for this to be an issue.
258
13.5.1
Bases
If a set of vectors is both (a) linearly independent, and (b) spans the entire vector space, then we call that set of vectors a basis of the vector
space. An example of a basis is the standard basis consisting of the vectors
[10 . . . 00]> , [01 . . . 00]> , . . . , [00 . . . 10]> , [00 . . . 01]> . This has the additional
nice property of being made of of vectors that are all orthogonal to each
other (making it an orthogonal basis) and of unit length (making it a
normal basis).
A basis that is both orthogonal and normal is called orthonormal.
We like orthonormal bases because we can recover the coefficients of some
P
arbitrary vector v by taking dot-products. If v =
ai xi , then v xj =
P
ai (xi xj ) = ai , since orthogonality means that xi xj = 0 when i 6= j,
and normality means xi xi = kxi k2 = 1.
However, even for non-orthonormal bases it is still the case that any
vector can be written as a unique linear combination of basis elements. This
fact is so useful we will state it as a theorem:
Theorem 13.5.1. If {xi } is a basis for some vector space V , then every
vector y has a unique representation y = a1 x1 + a2 x2 + + an xn .
Proof. Suppose there is some y with more than one representation, i.e., there
are sequences of coefficients ai and bi such that y = a1 x1 +a2 x2 + +an xn =
b1 x1 + b2 x2 + + bn xn . Then 0 = y y = a1 x1 + a2 x2 + + an xn
b1 x1 + b2 x2 + + bn xn = (a1 b1 )x1 + (a2 b2 )x2 + + (an bn )xn . But
since the xi are independent, the only way a linear combination of the xi
can equal 0 is if all coefficients are 0, i.e., if ai = bi for all i.
Even better, we can do all of our usual vector space arithmetic in terms
P
P
of the coefficients ai . For example, if a = ai xi and b = bi xi , then it can
P
P
easily be verified that a + b = (ai + bi )xi and ca = (cai )xi .
However, it may be the case that the same vector will have different
representations in different bases. For example, in R2 , we could have a basis
B1 = {(1, 0), (0, 1)} and a basis B2 = {(1, 0), (1, 2)}. Because B1 is the
standard basis, the vector (2, 3) is represented as just (2, 3) using basis B1 ,
but it is represented as (5/2, 3/2) in basis B2 .
259
Both bases above have the same size. This is not an accident; if a vector
space has a finite basis, then all bases have the same size. Well state this
as a theorem, too:
Theorem 13.5.2. Let x1 . . . xn and y1 . . . ym be two finite bases of the same
vector space V . Then n = m.
Proof. Assume without loss of generality that n m. We will show how
to replace elements of the xi basis with elements of the yi basis to produce
a new basis consisting only of y1 . . . yn . Start by considering the sequence
y1 , x1 . . . xn . This sequence is not independent since y1 can be expressed as
a linear combination of the xi (theyre a basis). So from Theorem 1 there
is some xi that can be expressed as a linear combination of y1 , x1 . . . xi1 .
Swap this xi out to get a new sequence y1 , x1 . . . xi1 , xi+1 , . . . xn . This new
sequence is also a basis, because (a) any z can be expressed as a linear
combination of these vectors by substituting the expansion of xi into the
expansion of z in the original basis, and (b) its independent, because if
there is some nonzero linear combination that produces 0 we can substitute the expansion of xi to get a nonzero linear combination of the original
basis that produces 0 as well. Now continue by constructing the sequence
y2 , y1 , x1 . . . xi1 , xi+1 , . . . xn , and arguing that some xi0 in this sequence
must be expressible as a combination of earlier terms by Theorem 13.5.1
(it cant be y1 because then y2 , y1 is not independent), and drop this xi0 .
By repeating this process we can eventually eliminate all the xi , leaving the
basis yn , . . . , y1 . But then any yk for k > n would be a linear combination
of this basis, so we must have m = n.
The size of any basis of a vector space is called the dimension of the
space.
13.6
Linear transformations
260
Proof. Well use the following trick for extracting entries of a matrix by
multiplication. Let M be an n m matrix, and let ei be a column vector
>
with eij = 1 if i = j and 0 otherwise.7 Now observe that (ei ) M ej =
P i
P
j
j
j
k ek (M e )k = (M e )i =
k Mik ek = Mij . So given a particular linear f ,
>
we will now define M by the rule Mij = (ei ) f (ej ). It is not hard to see
that this gives f (ej ) = M ej for each basis vector j, since multiplying by
>
(ei ) grabs the i-th coordinate in each case. To show that M x = f (x) for
P
P
all x, decompose each x as k ck ek . Now compute f (x) = f ( k ck ek ) =
P
P
P
k
k
k
k ck f (e ) =
k ck M (e ) = M ( k ck e ) = M x.
13.6.1
Composition
13.6.2
When we multiply a matrix and a column vector, we can think of the matrix
as a sequence of row or column vectors and look at how the column vector
operates on these sequences.
Let Mi be the i-th row of the matrix (the is a stand-in for the missing
column index). Then we have
(M x)i =
Mik xk = Mi x.
261
Mik xk =
(Mk )i xk .
" #
" #
" # " # " # " # " #
# 1
1
2
3
1
2
7
9
1 2 3
+1
+2
=
+
+
=
.
1 = 1
4 5 6
12
21
The set {M x} for all x is thus equal to the span of the columns of M ;
it is called the column space of M .
For yM , where y is a row vector, similar properties hold: we can think
of yM either as a row vector of dot-products of y with columns of M or as
a weighted sum of rows of M ; the proof follows immediately from the above
facts about a product of a matrix and a column vector and the fact that
yM = (M > y > )> . The span of the rows of M is called the row space of M ,
and equals the set {yM } of all results of multiplying a row vector by M .
13.6.3
Geometric interpretation
Geometrically, linear transformations can be thought of as changing the basis vectors for a space: they keep the origin in the same place, move the
basis vectors, and rearrange all the other vectors so that they have the same
coordinates in terms of the new basis vectors. These new basis vectors are
easily read off of the matrix representing the linear transformation, since
they are just the columns of the matrix. So in this sense all linear transformations are transformations from some vector space to the column space of
some matrix.8
This property makes linear transformations popular in graphics, where
they can be used to represent a wide variety of transformations of images.
Below is a picture of an untransformed image (top left) together with two
standard basis vectors labeled x and y. In each of the other images, we
have shifted the basis vectors using a linear transformation, and carried the
image along with it.9
8
The situation is slightly more complicated for infinite-dimensional vector spaces, but
we will try to avoid them.
9
The thing in the picture is a Pokmon known as a Wooper, which evolves into a
Quagsire at level 20. This evolution is not a linear transformation.
262
y
y
Note that in all of these transformations, the origin stays in the same
place. If you want to move an image, you need to add a vector to everything. This gives an affine transformation, which is any transformation that can be written as f (x) = Ax + b for some matrix A and column
vector b. One nifty thing about affine transformations is thatlike linear
transformationsthey compose to produce new transformations of the same
kind: A(Cx + d) + b = (AC)x + (Ad + b).
Many two-dimensional linear transformations have standard names. The
simplest transformation is scaling, where each axis is scaled by a constant,
but the overall orientation of the image is preserved. In the picture above,
the top right image is scaled by the same constant in both directions and
the second-from-the-bottom image is scaled differently in each direction.
Recall that the product M x corresponds to taking a weighted sum of
the columns of M , with the weights supplied by the coordinates of x. So in
263
1 c
.
0 1
Here the x vector is preserved: (1, 0) maps to the first column (1, 0), but
the y vector is given a new component in the x direction of c, corresponding
to the shear. If we also flipped or scaled the image at the same time that
we sheared it, we could represent this by putting values other than 1 on the
diagonal.
For a rotation, we will need some trigonometric functions to compute the
new coordinates of the axes as a function of the angle we rotate the image by.
The convention is that we rotate counterclockwise: so in the figure above,
the rotated image is rotated counterclockwise approximately 315 or 45 .
If is the angle of rotation, the rotation matrix is given by
"
cos sin
.
sin cos
13.6.4
The dimension of the column space of a matrixor, equivalently, the dimension of the range of the corresponding linear transformationis called the
264
13.6.5
Projections
265
A line consists of all points that are scalar multiples of some fixed vector
b. Given any other vector x, we want to extract all of the parts of x that lie
in the direction of b and throw everything else away. In particular, we want
to find a vector y = cb for some scalar c, such that (x y) b = 0. This is is
enough information to solve for c.
We have (x cb) b = 0, so x b = c(b b) or c = (x b)/(b b). So the
projection of x onto the subspace {cb | c R} is given by y = b(x b)/(b b)
or y = b(x b)/kbk2 . If b is normal (i.e. if kbk = 1), then we can leave out
the denominator; this is one reason we like orthonormal bases so much.
Why is this the right choice to minimize distance? Suppose we pick some
other vector db instead. Then the points x, cb, and db form a right triangle
with the right angle at cb, and the distance from x to db is kx dbk =
q
266
13.7
267
Further reading
Chapter 14
Finite fields
Our goal here is to find computationally-useful structures that act enough
like the rational numbers Q or the real numbers R that we can do arithmetic
in them that are small enough that we can describe any element of the
structure uniquely with a finite number of bits. Such structures are called
finite fields.
An example of a finite field is Zp , the integers mod p (see 8.4). These
finite fields are inconvenient for computers, which like to count in bits and
prefer numbers that look like 2n to horrible nasty primes. So wed really like
finite fields of size 2n for various n, particularly if the operations of addition,
multiplication, etc. have a cheap implementation in terms of sequences of
bits. To get these, we will show how to construct a finite field of size pn for
any prime p and positive integer n, and then let p = 2.
14.1
A magic trick
269
14.2
A field is a set F together with two operations + and that behave like
addition and multiplication in the rationals or real numbers. Formally, this
1
2
270
means that:
1. Addition is associative: (x + y) + z = x + (y + z) for all x, y, z in F .
2. There is an additive identity 0 such that 0 + x = x + 0 = x for all
x in F .
3. Every x in F has an additive inverse x such that x + (x) =
(x) + x = 0.
4. Addition is commutative: x + y = y + x for all x, y in F .
5. Multiplication distributes over addition: x (y + z) = (x y + x z)
and (y + z) x = (y x + z x) for all x, y, z in F .
6. Multiplication is associative: (x y) z = x (y z) for all x, y, z in F .
7. There is a multiplicative identity 1 such that 1 x = x 1 = x for
all x in F .
8. Multiplication is commutative: x y = y x for all x, y in F .
9. Every x in F \ {0} has a multiplicative inverse x1 such that x
x1 = x1 x = 1.
Some structures fail to satisfy all of these axioms but are still interesting
enough to be given names. A structure that satisfies 13 is called a group;
14 is an abelian group or commutative group; 17 is a ring; 18 is a
commutative ring. In the case of groups and abelian groups there is only
one operation +. There are also more exotic names for structures satisfying
other subsets of the axioms.3
Some examples of fields: R, Q, C, Zp where p is prime. We will be particularly interested in Zp , since we are looking for finite fields that can fit
inside a computer.
The integers Z are an example of a commutative ring, as is Zm for
m > 1. Square matrices of fixed dimension greater than 1 are an example
of a non-commutative ring.
3
A set with one operation that does not necessarily satisfy any axioms is a magma.
If the operation is associative, its a semigroup, and if there is also an identity (but not
necessarily inverses), its a monoid. For example, the set of nonempty strings with +
interpreted as concatenation form a semigroup, and throwing in the empty string as well
gives a monoid.
Weaker versions of rings knock out the multiplicative identity (a pseudo-ring or rng)
or negation (a semiring or rig). An example of a semiring that is actually useful is the
(max, +) semiring, which uses max for addition and + (which distributes over max) for
multiplication; this turns out to be handy for representing scheduling problems.
14.3
271
Any field F generates a polynomial ring F [x] consisting of all polynomials in the variable x with coefficients in F . For example, if F = Q,
some elements of Q[x] are 3/5, (22/7)x2 + 12, 9003x417 (32/3)x4 + x2 , etc.
Addition and multiplication are done exactly as youd expect, by applying the distributive law and combining like terms: (x + 1) (x2 + 3/5) =
x x2 + x (3/5) + x2 + (3/5) = x3 + x2 + (3/5)x + (3/5).
The degree deg(p) of a polynomial p in F [x] is the exponent on the
leading term, the term with a nonzero coefficient that has the largest
exponent. Examples: deg(x2 + 1) = 2, deg(17) = 0. For 0, which doesnt
have any terms with nonzero coefficients, the degree is taken to be .
Degrees add when multiplying polynomials: deg((x2 + 1)(x + 5)) = deg(x2 +
1) + deg(x + 5) = 2 + 1 = 3; this is just a consequence of the leading terms in
the polynomials we are multiplying producing the leading term of the new
polynomial. For addition, we have deg(p + q) max(deg(p), deg(q)), but
we cant guarantee equality (maybe the leading terms cancel).
Because F [x] is a ring, we cant do division the way we do it in a field
like R, but we can do division the way we do it in a ring like Z, leaving a
remainder. The equivalent of the integer division algorithm for Z is:
Theorem 14.3.1 (Division algorithm for polynomials). Given a polynomial
f and a nonzero polynomial g in F [x], there are unique polynomials q and
r such that f = q g + r and deg(r) < deg(g).
Proof. The proof is by induction on deg(f ). If deg(f ) < deg(g), let q = 0
and r = f . If deg(f ) is larger, let m = deg(f ), n = deg(g), and qmn =
fm gn1 . Then qmn xmn g is a degree-m polynomial with leading term fm .
Subtracting this from f gives a polynomial f 0 of degree at most m 1,
and by the induction hypothesis there exist q 0 , r such that f 0 = q 0 g + r
and deg r < deg g. Let q = qmn xmn + q 0 ; then f = f 0 + qmn xmn g =
(qmn xmn g + q 0 g + r = q g + r.
The essential idea of the proof is that we are finding q and r using the
same process of long division as we use for integers. For example, in Q[x]:
x+1
x+2
x2 + 3x + 5
x2 2x
x+5
x2
3
272
14.4
273
0
1
x
x+1
0
1
x
x+1
0
0
0
0
0
1
x
x+1
0
x
x+1
1
0 x+1
1
x
We can see that every nonzero element has an inverse by looking for ones
in the table; e.g. 11 = 1 means 1 is its own inverse and x(x+1) = x2 +x = 1
means that x and x + 1 are inverses of each other.
Heres the same thing for Z2 [x]/(x3 + x + 1):
0
1
x
x+1
x2
x2 + 1
x2 + x
x2 + x + 1
0
0
0
0
0
0
0
0
0
1
0
1
x
x+1
x2
x2 + 1
x2 + x
x2 + x + 1
x
0
x
x2
x2 + x
x+1
1
x2 + x + 1
x2 + 1
2
2
2
2
x+1
0
x+1
x +x
x +1
x +x+1
x
1
x
2
2
2
2
2
0
x
x+1
x +x+1
x +x
x
x +1
1
x
x2 + 1
0
x2 + 1
1
x2
x
x2 + x + 1
x+1
x2 + x
x2 + x
0
x2 + x
x2 + x + 1
1
x2 + 1
x+1
x
x2
x2 + x + 1 0 x2 + x + 1
x2 + 1
x
1
x2 + x
x2
x+1
Note that we now have 23 = 8 elements. In general, if we take Zp [x]
modulo a degree-n polynomial, we will get a field with pn elements. These
turn out to be all the possible finite fields, with exactly one finite field
for each number of the form pn (up to isomorphism, which means that
we consider two fields equivalent if there is a bijection between them that
preserves + and ). We can refer to a finite field of size pn abstractly as
GF (pn ), which is an abbreviation for the Galois field pn .
4
This is not an accident: any extension field acts like a vector space over its base field.
14.5
274
Applications
14.5.1
275
might do:
1101 (initial value)
11010 (after shift)
0011 (after XOR with 11001)
or
0110 (initial value)
01100 (after shift)
1100 (no XOR needed)
If we write our initial value as r3 r2 r1 r0 , the shift produces a new value
r3 r2 r1 r0 0. Then XORing with 11001 has three effects: (a) it removes a
leading 1 if present; (b) it sets the rightmost bit to r3 ; and (c) it flips the
new leftmost bit if r3 = 1. Steps (a) and (b) turn the shift into a rotation.
Step (c) is the mysterious flip from our sequence generator. So in fact what
our magic sequence generator was doing was just computing all the powers
of x in a particular finite field.
As in Zp , these powers of an element bounce around unpredictably, which
makes them a useful (though cryptographically very weak) pseudorandom
number generator. Because high-speed linear-feedback shift registers are
very cheap to implement in hardware, they are used in applications where
a pre-programmed, statistically smooth sequence of bits is needed, as in the
Global Positioning System and to scramble electrical signals in computers
to reduce radio-frequency interference.
14.5.2
Checksums
(start with 0)
(shift in 1)
(no XOR)
(shift in 0)
(no XOR)
(shift in 0)
(no XOR)
(shift in 1)
(no XOR)
(shift in
(XOR with
(shift in
(XOR with
276
0)
11001)
1)
11001)
14.5.3
Cryptography
Appendix A
Sample assignments
These are sample assignments from the Fall 2013 version of CPSC 202.
A.1
Bureaucratic part
Send me email! My address is [email protected].
In your message, include:
1. Your name.
2. Your status: whether you are an undergraduate, grad student, auditor,
etc.
3. Anything else youd like to say.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
A.1.1
Tautologies
278
3. (P Q) (Q (P (Q R))).
Solution
For each solution, we give the required truth-table solution first, and then
attempt to give some intuition for why it works. The intuition is merely an
explanation of what is going on and is not required for your solutions.
1. Here is the truth table:
P
0
1
P
1
0
P P
0
1
(P P ) P
1
1
This is a little less intuitive than the first case. A reasonable story
might be that the proposition is true if P is true, so for it to be false,
P must be false. But then (P Q) reduces to Q, and Q Q is
true.
3. (P Q) (Q (P (Q R))).
P
0
0
0
0
1
1
1
1
Q
0
0
1
1
0
0
1
1
R
0
1
0
1
0
1
0
1
P Q
0
0
1
1
1
1
1
1
QR
1
1
0
1
1
1
0
1
P (Q R)
0
0
1
0
1
1
0
1
Q (P (Q R))
0
0
1
1
1
1
1
1
(P Q) (Q (P (Q R)))
1
1
1
1
1
1
1
1
I have no intuition whatsoever for why this is true. In fact, all three
of these tautologies were plucked from long lists of machine-generated
tautologies, and three variables is enough to start getting tautologies
that dont have good stories.
279
Its possible that one could prove this more succinctly by arguing by
cases that if Q is true, both sides of the biconditional are true, and if
Q is not true, then Q R is always true so P (Q R) becomes
just P , making both sides equal. But sometimes it is more direct (and
possibly less error-prone) just to shut up and calculate.
A.1.2
Positively equivalent
Show how each of the following propositions can be simplified using equivalences from Table 2.2 to a single operation applied directly to P and Q.
1. (P Q).
2. ((P Q) (P Q)).
Solution
1.
(P Q) (P Q)
P Q
P Q.
2.
((P Q) (P Q))
(P Q) (P Q)
(P Q) (P Q)
(P Q) (P Q)
(P Q) (Q P )
(P Q) (Q P )
P Q.
A.1.3
A theory of leadership
280
was successful as a leader), as well as all the usual tools of predicate logic
, , =, and so forth, and can refer to specific leaders by name.
Express each of the following statements in mathematical form. Note
that these statements are not connected, and no guarantees are made about
whether any of them are actually true.
1. Lincoln was the tallest leader.
2. Napoleon was at least as tall as any unsuccessful leader.
3. No two leaders had the same height.
Solution
1. The easiest way to write this is probably x : taller(Lincoln, x). There
is a possible issue here, since this version says that nobody is taller than
Lincoln, but it may be that somebody is the same height.1 A stronger
claim is x : (x 6= Lincoln) taller(Lincoln, x). Both solutions (and
their various logical equivalents) are acceptable.
2. x : successful(x) taller(x, Napoleon).
3. x y : (x = y) taller(x, y) taller(y, x). Equivalently, x y : x 6=
y (taller(x, y) taller(y, x)). If we assume that taller(x, y) and
taller(y, x) are mutually exclusive, then x y : (x = y)(taller(x, y)
taller(y, x)) also works.
A.2
A.2.1
At least one respected English-language novelist [Say33] has had a character claim
that it is well understood that stating that a particular brand of toothpaste is the most
effective is not a falsehood even if it is equally effective with other brandswhich are
also the most effectivebut this understanding is not universal. The use of the also
suggests that Lincoln is unique among tallest leaders.
281
A.2.2
A distributive law
Show that the following identities hold for all sets A, B, and C:
1. A (B C) = (A B) (A C).
2. A (B C) = (A B) (A C).
Solution
1. Let (a, x) A (B C). Then a A and x B C. If x B, then
(a, x) A B; alternatively, if x C, then (a, x) A C. In either
case, (a, x) (A B) (A C).
282
A.2.3
Exponents
Let A be a set with |A| = n > 0. What is the size of each of the following
sets of functions? Justify your answers.
1. A .
2. A .
3. .
Solution
A.3
A.3.1
283
A.3.2
284
g(x) =
f (x)
x
if x A, and
if x C.
A.3.3
A.4
A.4.1
285
Let f : N N be defined by
f (0) = 2,
f (n + 1) = f (n) f (n) 1.
Show that f (n) > 2n for all n N.
Solution
The proof is by induction on n, but we have to be a little careful for small
values. Well treat n = 0 and n = 1 as special cases, and start the induction
at 2.
For n = 0, we have f (0) = 2 > 1 = 20 .
For n = 1, we have f (1) = f (0) f (0) 1 = 2 2 1 = 3 > 2 = 21 .
For n = 2, we have f (2) = f (1) f (1) 1 = 3 3 1 = 8 > 4 = 22 .
For the induction step, we want to show that, for all n 2, if f (n) > 2n ,
then f (n + 1) = f (n) f (n) 1 > 2n+1 . Compute
f (n + 1) = f (n) f (n) 1
> 2n 2n 1
= 2n 4 1
= 2n+1 + 2n+1 1
> 2n+1 .
The principle of induction gives us that f (n) > 2n for all n 2, and
weve already covered n = 0 and n = 1 as special cases, so f (n) > 2n for all
n N.
A.4.2
A slow-growing set
Let
A0 = {3, 4, 5} ,
An+1 = An
x.
xAn
xAn
286
287
that Sn = 12 2n ,
Sn+1 =
xAn+1
x + Sn
xAn
= Sn + Sn
= 12 2n + 12 2n
= 12 (2n + 2n )
= 12 2n+1 .
This completes the induction argument and the proof.
A.4.3
Double factorials
n
Y
i = 1 2 3 . . . n.
(A.4.1)
i=1
n!! =
(n 2i).
(A.4.2)
i=0
(A.4.3)
Solution
First lets figure out what n0 has to be.
We have
(2 0)!! = 1
(0!)2 = 1 1 = 1
(2 1)!! = 2
(1!)2 = 1 1 = 1
(2 2)!! = 4 2 = 8
(2!)2 = 2 2 = 4
(2 3)!! = 6 4 2 = 48
(2 4)!! = 8 6 4 2 = 384
(3!)2 = 6 6 = 36
(4!)2 = 24 24 = 576
288
(2n)!! =
=
(2n 2i)
i=0
n1
Y
(2n 2i)
i=0
n
Y
2i.
i=1
n+1
Y
2i
i=1
n
Y
2i (2(n + 1))
i=1
= ((2n)!!) 2 (n + 1)
(n!)2 (n + 1) (n + 1)
= (n! (n + 1))2
=
n
Y
!2
i (n + 1)
i=1
= ((n + 1)!)2 .
A.5
A.5.1
Let f : N N be defined by
(
f (n) =
1
n
if n is odd, and
if n is even.
289
A.5.2
A.5.3
290
A.6
A.6.1
Solution
Let n be composite. Then there exist natural numbers a, b 2 such that
n = ab. Assume without loss of generality that a b.
For convenience, let k = bn/2c. Since b = n/a and a 2, b n/2; but b
is an integer, so b n/2 implies b bn/2c = k. It follows that both a and
b are at most k.
We now consider two cases:
1. If a 6= b, then both a and b appear as factors in k!. So k! = ab
giving ab|k!, which means n|k! and k! = 0 (mod n).
1ik,i6{a,b} i,
A.6.2
291
A.6.3
Equivalence relations
Solution
Proof: The direct approach is to show that T is reflexive, symmetric, and
transitive:
1. Reflexive: For any x, xRx and xSx, so xT x.
2. Symmetric: Suppose xT y. Then xRy and xSy. Since R and S are
symmetric, yRx and ySx. But then yT x.
292
3. Transitive: Let xT y and yT z. Then xRy and yRz implies xRz, and
similarly xSy and ySz implies xSz. So xRz and xSz, giving xT z.
Alternative proof: Its also possible to show this using one of the alternative characterizations of an equivalence relation from Theorem 9.4.1.
Since R and S are equivalence relations, there exist sets B and C and
functions f : A B and g : A C such that xRy if and only if f (x) = f (y)
and xSy if and only if g(x) = g(y). Now consider the function h : A
B C defined by h(x) = (f (x), g(x)). Then h(x) = h(y) if and only if
(f (x), g(x)) = (f (y), g(y)), which holds if and only if f (x) = f (y) and
g(x) = g(y). But this last condition holds if and only if xRy and xSy, the
definition of xT y. So we have h(x) = h(y) if and only if xT y, and T is an
equivalence relation.
A.7
A.7.1
for all x, y S,
then
x y f (y) f (x)
for all x, y S.
Solution
Let S, T , f be such that f (x y) = f (x) f (y) for all x, y S.
Now suppose that we are given some x, y S with x y.
Recall that x y is the minimum z greater than or equal to both x
and y; so when x y, y x and y y, and for any z with z x
and z y, z y, and y = x y. From the assumption on f we have
f (y) = f (x y) = f (x) f (y).
Now use the fact that f (x) f (y) is less than or equal to both f (x) and
f (y) to get f (y) = f (x) f (y) f (x).
A.7.2
293
A.7.3
For each pair of natural numbers m and k with m 2 and 0 < k < m, let
Sm,k be the graph whose vertices are the m elements of Zm and whose edges
consist of all pairs (i, i + k), where the addition is performed mod m. Some
examples are given in Figure A.1
Give a simple rule for determining, based on m and k, whether or not
Sm,k is connected, and prove that your rule works.
Solution
The rule is that Sm,k is connected if and only if gcd(m, k) = 1.
To show that this is the case, consider the connected component that
contains 0; in other words, the set of all nodes v for which there is a path
from 0 to v.
294
0
0
7
1
6
2
S5,1
2
5
S5,2
3
4
S8,2
3
4
3
4
3
4
3
4
3
2
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
3
4
295
A.8
A.8.1
296
n!
1 1
+
2 4
n1
1
2
= n!
n+1
8
297
n!
1 1
+
2 4
n
1
2 +
2
8
= n!
n+1
8
when n is even.
So we get the same expression in each case. We can simplify this further
to get
(n + 1)!
(A.8.1)
8
two-path graphs on n 3 vertices.
The simplicity of (A.8.1) suggests that there ought to be a combinatorial proof of this result, where we take a two-path graph and three bits
of additional information and bijectively construct a permutation of n + 1
values.
Here is one such construction, which maps the set of all two-path graphs
with vertices in [n] plus three bits to the set of all permutations on [n + 1].
The basic idea is to paste the two paths together in some order with n
between them, with some special handling of one-element paths to cover
permutations that put n at one end of the other. Miraculously, this special
handling exactly compensates for the fact that one-element paths have no
sense of direction.
1. For any two-path graph, we can order the two components on which
contains 0 and which doesnt. Similarly, we can order each path by
starting with its smaller endpoint.
2. To construct a permutation on [n+1], use one bit to choose the order of
the two components. If both components have two or more elements,
use two bits to choose whether to include them in their original order or
the reverse, and put n between the two components. If one component
has only one element x, use its bit instead to determine whether we
include x, n or n, x in our permutation.
In either case we can reconstruct the original two-path graph uniquely by
splitting the permutation at n, or by splitting off the immediate neighbor
of n if n is an endpoint; this shows that the construction is surjective.
Furthermore changing any of the three bits changes the permutation we get;
together with the observation that we can recover the two-path graph, this
298
shows that the construction is also injective. So we have that the number of
permutations on n + 1 values is 23 = 8 times the number of two-path graphs
on n vertices, giving (n + 1)!/8 two-path graphs as claimed.
(For example, if our components are 0, 1, and 2, 3, 4, and the bits are
101, the resulting permuation is 4, 3, 2, 5, 0, 1. If the components are instead
3 and 2, 0, 4, 1, and the bits are 011, then we get 5, 3, 1, 4, 0, 2. In either
case we can recover the original two-path graph by deleting 5 and splitting
according to the rule.)
Both of these proofs are pretty tricky. The brute-force counting approach
may be less prone to error, and the combinatorial proof probably wouldnt
occur to anybody who hadnt already seen the answer.
A.8.2
Even teams
Solution
Well take the hint, and let E(n) be the number of team assignments that
make k even and U (n) being the number that make k uneven, or odd. Then
299
we can compute
!
E(n) U (n) =
=
=
X
0kn
k even
n
X
k odd
(1)k
k=0
n
X
k=0
X
n k
n k
2
2
k
k
0kn
n k
2
k
n
(2)k
k
= (1 + (2))n
= (1)n .
We also have E(n) + U (n) =
E(n) gives
n n
k=0 k 2
Pn
3n + (1)n
.
(A.8.2)
2
To make sure that we didnt make any mistakes, it may be helpful to
check a few small cases. For n = 0, we have one even split (nobody on
either team), and (30 + (1)0 )/2 = 2/2 = 1. For n = 1, we have the
same even split, and (31 + (1)1 )/2 = (3 1)/2 = 1. For n = 2, we get
a five even splits ((, ), ({x} , {y}), ({y} , {x}), ({x, y} , ), (, {x, y})), and
(32 + (1)2 )/2 = (9 + 1)/2 = 5. This is not a proof that (A.8.2) will keep
working forever, but it does suggest that we didnt screw up in some obvious
way.
E(n) =
A.8.3
Inflected sequences
300
n
X
i=0
i=0
(i + 1)2 =
(i)2
1
1
1
= n3 + n2 + n.
3
2
6
The last step uses (6.4.2).
The number we want is |S T | = |S| + |T | |S T |. For a triple to be
in |S T |, we must have a0 = a1 = a2 ; there are n such triples. So we have
1 3 1 2 1
n + n + n n
|S T | = 2
3
2
6
2 3
2
= n + n2 n.
3
3
A.9
For problems that ask you to compute a value, closed-form expressions are
preferred, and you should justify your answers.
A.9.1
301
A.9.2
Two flushes
A standard poker deck has 52 cards, which are divided into 4 suits of 13
cards each. Suppose that we shuffle a poker deck so that all 52! permutations
are equally likely.3 We then deal the top 5 cards to you and the next 5 cards
to me.
Define a flush to be five cards from the same suit. Let A be the event
that you get a flush, and let B be the event that I get a flush.
1. What is Pr [B | A]?
2. Is this more or less than Pr [B]?
Solution
Recall that
Pr [B | A] =
Pr [B A]
.
Pr [A]
Lets start by calculating Pr [A]. For any single suit s, there are (13)5
ways to give you 5 cards from s, out of (52)5 ways to give you 5 cards,
assuming in both cases that we keep track of the order of the cards.4 So the
3
This turns out to be pretty hard to do in practice [BD92], but well suppose that we
can actually do it.
4
If we dont keep track of the order, we get 13
choices out of 52
possibilities; these
5
5
divide out to the same value.
302
(13)10
(52)10
(13)5 (13)5
(52)10
if s = t, and
if s 6= t.
(13)10
(13)5
Pr [B | A] =
=
=
+ 3 (13)5
(52)10
(52)5
(8)5 + 3 (13)5
.
(47)5
(A.9.1)
Another way to get (A.9.1) is to argue that once you have five cards of a
particular suit, there are (47)5 equally probable choices for my five cards, of
which (8)5 give me five cards from your suit and 3 (13)5 give me five cards
from one of the three other suits.
303
Pr [B | A] =
This turns out to be slightly larger than the probability that I get a flush
without conditioning, which is
4 (13)5
(52)5
4 154400
=
311875200
33
=
16660
0.00198079.
Pr [B] =
A.9.3
D0
X
i=1
Di .
D0
X
E [S] = E Di
i=1
n
X
D0
X
E
Di
j=1
i=1
D0 = j Pr [D0 = j]
j
n
X
1X
=
E Di
n j=1
i=1
j
n X
X
1
E [Di ]
n j=1 i=1
n X
1X
n+1
n j=1 i=1
2
n
n+1
1X
j
n j=1
2
n
n+1 X
j
2n j=1
n + 1 n(n + 1)
2n
2
(n + 1)2
=
.
4
=
304
Appendix B
Sample exams
These are exams from the Fall 2013 version of CPSC 202. Some older exams
can be found in Appendices C and D.
B.1
Write your answers on the exam. Justify your answers. Work alone. Do not
use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately 75 minutes to complete this exam.
B.1.1
Q P Q P (P Q) P (P
0
1
1
1
0
0
0
0
1
1
1
1
305
Q) Q
1
1
1
1
B.1.2
306
(mod m)
xy =0
(mod m)
(mod m)
B.1.3
2.
307
Solution
Using the definition of exponentiation and the geometric series formula, we
can compute
n Y
i
X
2=
i=1 j=1
n
X
2i
i=1
n1
X
2i+1
i=0
n1
X
=2
2i
i=0
2n
1
21
= 2 (2n 1)
=2
= 2n+1 2.
B.1.4
B.2
Write your answers on the exam. Justify your answers. Work alone. Do not
use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately 75 minutes to complete this exam.
B.2.1
308
B.2.2
(B.2.1)
x Z : y Z : x < y
(B.2.2)
Solution
First, well show that (B.2.1) is true. Given any x Z, choose y = x + 1.
Then x < y.
Next, well show that (B.2.2) is not true, by showing that its negation
is true. Negating (B.2.2) gives x Z : y Z : x 6< y. Given any x Z,
choose y = x. Then x 6< y.
B.2.3
(B.2.3)
Solution
We dont really expect this to be true, because the usual expansion (A +
B)2 = A2 + AB + BA + B 2 doesnt simplify further since AB does not equal
BA in general.
309
1 1
,
A=
1 1
"
1 1
B=
.
1 1
Then
"
#2
"
"
#2
"
"
2 0
(A + B) =
2 2
2
0 4
=
,
8 4
but
1 1
A + 2AB + B =
1 1
2
"
1 1
+2
1 1
"
#"
"
2 2
2 0
0 2
=
+2
+
2 2
2 0
2 0
6 0
=
.
8 2
"
1 1
1 1
+
1 1
1 1
#
#2
B.2.4
310
How many connected graphs contain no vertices with degree greater than
one?
Solution
There are three: The empty graph, the graph with one vertex, and the graph
with two vertices connected by an edge. These enumerate all connected
graphs with two vertices or fewer (the other two-vertex graph, with no edge,
is not connected).
To show that these are the only possibilities, suppose that we have a
connected graph G with more than two vertices. Let u be one of these
vertices. Let v be a neighbor of u (if u has no neighbors, then there is no
path from u to any other vertex, and G is not connected). Let w be some
other vertex. Since G is connected, there is a path from u to w. Let w0 be
the first vertex in this path that is not u or v. Then w0 is adjacent to u or
v; in either case, one of u or v has degree at least two.
Appendix C
C.1
Write your answers on the exam. Justify your answers. Work alone. Do not
use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately 50 minutes to complete this exam.
C.1.1
n=0 T (n)z
n,
then
1
.
1 2z
T (n) =
k=0
nk k
2 =3
n
X
(2/3) = 3
k=0
1 (2/3)n+1
1 (2/3)
= 3n+1 2n+1 .
A guess is not a proof; to prove that this guess works we verify T (0) =
31 21 = 3 2 = 1 and T (n) = 3T (n 1) + 2n = 3(3n 2n ) + 2n =
3n+1 2 2n = 3n+1 2n+1 .
C.1.2
C.1.3
Prove that k
n
k
=n
n1
k1
when 1 k n.
Solution
There are several ways to do this. The algebraic version is probably cleanest.
Combinatorial version
The LHS counts the way to choose k of n elements and then specially mark
one of the k. Alternatively, we could choose the marked element first (n
choices) and then choose
the remaining k 1 elements from the remaining
n 1 elements ( n1
choices);
this gives the RHS.
k1
Algebraic version
Compute k
n
k
n!
= k k!(nk)!
=
n!
(k1)!(nk)!
(n1)!
= n (k1)!((n1)(k1))!
=n
n1
k1 .
C.1.4
Suppose you flip a fair coin n times, where n 1. What is the probability
of the event that both of the following hold: (a) the coin comes up heads at
least once and (b) once it comes up heads, it never comes up tails on any
later flip?
Solution
For each i {1 . . . n}, let Ai be the event that the coin comes up heads for
the first time on flip i and continues to come up heads thereafter. Then
the desired event is the disjoint union of the Ai . Since each Ai is a single
sequence of coin-flips, each occurs with probability 2n . Summing over all
i gives a total probability of n2n .
C.2
Write your answers on the exam. Justify your answers. Work alone. Do not
use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately 50 minutes to complete this exam.
C.2.1
C.2.2
C.2.3
C.2.4
Let A B.
1. Prove or disprove: There exists an injection f : A B.
2. Prove or disprove: There exists a surjection g : B A.
Solution
1. Proof: Let f (x) = x. Then f (x) = f (y) implies x = y and f is
injective.
2. Disproof: Let B be nonempty and let A = . Then there is no function
at all from B to A, surjective or not.
C.3
Write your answers on the exam. Justify your answers. Work alone. Do not
use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately 50 minutes to complete this exam.
C.3.1
C.3.2
m
n X
m c=0
"
c
m X
c
c x=0 x
!#)
=
=
=
n
X
m=0
n
X
m=0
n
X
m=0
m
n X
m c
2
m c=0 c
! )
n
(1 + 2)m
m
!
n m
3
m
= (1 + 3)n
= 4n .
C.3.3
C.3.4
A test is graded on a scale of 0 to 80 points. Because the grading is completely random, your grade can be represented by a random variable X with
0 X 80 and E[X] = 60.
1. What is the maximum possible probability that X = 80?
2. Suppose that we change the bounds to 20 X 80, but E[X] is still
60. Now what is the maximum possible probability that X = 80?
Solution
1. Here we apply Markovs inequality: since X 0, we have Pr[X
60
80] E[X]
80 = 80 = 3/4. This maximum is achieved exactly by letting
X = 0 with probability 1/4 and 80 with probability 3/4, giving E[X] =
(1/4) 0 + (3/4) 80 = 60.
2. Raising the minimum grade to 20 knocks out the possibility of getting
0, so our previous distribution doesnt work. In this new case we can
apply Markovs inequality to Y = X 20 0, to get Pr[X 80] =
]
40
Pr[Y 60] E[Y
60 = 60 = 2/3. So the extreme case would seem to be
that we get 20 with probability 1/3 and 80 with probability 2/3. Its
easy to check that we then get E[X] = (1/3) 20 + (2/3) 80 = 180/3 =
60. So in fact the best we can do now is a probability of 2/3 of getting
80, less than we had before.
C.4
Write your answers on the exam. Justify your answers. Work alone. Do not
use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately 75 minutes to complete this exam.
C.4.1
C.4.2
Let p be a prime, and let 0 a < p. Show that a2p1 = a (mod p).
Solution
Write a2p1 = ap1 ap1 a. If a 6= 0, Eulers Theorem (or Fermats Little
Theorem) says ap1 = 1 (mod p), so in this case ap1 ap1 a = a (mod p).
If a = 0, then (since 2p 1 6= 0), a2p1 = 0 = a (mod p).
C.4.3
Let L(x, y) represent the statement x likes y and let T (x) represent the
statement x is tall, where x and y range over a universe consisting of all
children on a playground. Let m be Mary, one of the children.
1. Translate the following statement into predicate logic: If x is tall,
then Mary likes x if and only if x does not like x.
2. Show that if the previous statement holds, Mary is not tall.
Solution
1. x (T (x) (L(m, x) L(x, x))).
C.4.4
Pb
k=a k,
assuming 0 a b.
Solution
Here are three ways to do this:
1. Write
n(n+1)
2
Pb
k=a k
as
Pb
k=1 k
Pa1
k=1 k
Pn
k=1 k
to get
b
X
k=
k=a
b
X
k=1
a1
X
k=1
b(b + 1) (a 1)a
2
2
b(b + 1) a(a 1)
=
.
2
=
b
X
k=
k=a
=
=
b
X
k+
k=a
b
X
b
X
(b + a k)
k=a
(k + b + a k)
k=a
b
X
(b + a)
k=a
= (b a + 1)(b + a).
Dividing both sides by 2 gives
(ba+1)(b+a)
.
2
ba
3. Write bk=a k as ba
k=0 (a + k) = (b a + 1)a +
k=0 k. Then use the
sum formula as before to turn this into (b a + 1)a + (ba)(ba+1)
.
2
Appendix D
D.1
Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are seven problems on this exam, each worth 20 points, for a total
of 140 points. You have approximately three hours to complete this exam.
D.1.1
321
322
1. What is the probability that the players score at the end of the game
is zero?
2. What is the expectation of the players score at the end of the game?
Solution
1. The only way to get a score of zero is to lose on the first roll. There
are 36 equally probable outcomes for the first roll, and of these the
six outcomes (4,6), (5,5), (5,6), (6,4), (6,5), and (6,6) yield a product
greater than 20. So the probability of getting zero is 6/36 = 1/6.
2. To compute the total expected score, let us first compute the expected
score for a single turn. This is
6 X
6
1 X
ij[ij 20].
36 i=1 j=1
where [ij 20] is the indicator random variable for the event that
ij 20.
I dont know of a really clean way to evaluate the sum, but we can
expand it as
3
X
! 6
5
4
3
X
X
X
X
j+5
j+6
j
j +4
i
i=1
j=1
j=1
j=1
j=1
= 6 21 + 4 15 + 5 10 + 6 6
= 126 + 60 + 50 + 36
= 272.
So the expected score per turn is 272/36 = 68/9.
Now we need to calculate the expected total score; call this value S.
Assuming we continue after the first turn, the expected total score for
the second and subsequent turns is also S, since the structure of the
tail of the game is identical to the game as a whole. So we have
S = 68/9 + (5/6)S,
which we can solve to get S = (6 68)/9 = 136/3.
D.1.2
323
D.1.3
92
9
24036583
2
1
is an integer.
Solution
Lets save ourselves a lot of writing by letting x = 24036583, so that p =
2x 1 and the fraction becomes
x1
92
9
.
p
To show that this is an integer, we need to show that p divides the
denominator, i.e., that
92
x1
9=0
(mod p).
Wed like to attack this with Fermats Little Theorem, so we need to get
the exponent to look something like p 1 = 2x 2. Observe that 9 = 32 , so
x1
92
x1
= (32 )2
= 32 = 32
x1
x 2
32 = 3p1 32 .
x1
D.1.4
324
D.1.5
D.1.6
Recall that the powerset P(S) of a set S is the set of sets {A : A S}.
Prove that if S T , then P(S) P(T ).
Solution
Let A P(S); then by the definition of P(S) we have A S. But then
A S T implies A T , and so A P(T ). Since A was arbitrary,
A P(T ) holds for all A in P(S), and we have P(S) P(T ).
D.1.7
325
Archaeologists working deep in the Upper Nile Valley have discovered a curious machine, consisting of a large box with three levers painted red, yellow,
and blue. Atop the box is a display that shows one of set of n hieroglyphs.
Each lever can be pushed up or down, and pushing a lever changes the
displayed hieroglyph to some other hieroglyph. The archaeologists have determined by extensive experimentation that for each hieroglyph x, pushing
the red lever up when x is displayed always changes the display to the same
hieroglyph f (x), and pushing the red lever down always changes hieroglyph
f (x) to x. A similar property holds for the yellow and blue levers: pushing
yellow up sends x to g(x) and down sends g(x) to x; and pushing blue up
sends x to h(x) and down sends h(x) to x.
Prove that there is a finite number k such that no matter which hieroglyph is displayed initially, pushing any one of the levers up k times leaves
the display with the same hieroglyph at the end.
Clarification added during exam: k > 0.
Solution
Let H be the set of hieroglyphs, and observe that the map f : H H corresponding to pushing the red lever up is invertible and thus a permutation.
Similarly, the maps g and h corresponding to yellow or blue up-pushes are
also permutations, as are the inverses f 1, g 1, and h 1 corresponding to
red, yellow, or blue down-pushes. Repeated pushes of one or more levers
correspond to compositions of permutations, so the set of all permutations
obtained by sequences of zero or more pushes is the subgroup G of the
permutation group S|H| generated by f , g, and h.
Now consider the cyclic subgroup hf i of G generated by f alone. Since
G is finite, there is some index m such that f m = e. Similarly there are
indices n and p such that g n = e and hp = e. So pushing the red lever up
any multiple of k times restores the initial state, as does pushing the yellow
lever up any multiple of n times or the blue lever up any multiple of p times.
Let k = mnp. Then k is a multiple of m, n, and p, and pushing any single
lever up k times leaves the display in the same state.
D.2
Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
326
There are six problems on this exam, each worth 20 points, for a total
of 120 points. You have approximately three hours to complete this exam.
D.2.1
Recall that the order of an element x of a group is the least positive integer
k such that xk = e, where e is the identity, or if no such k exists.
Prove or disprove: In the symmetric group Sn of permutations on n
elements, the order of any permutation is at most n2 .
Clarifications added during exam
Assume n > 2.
Solution
Disproof: Consider the permutation (1 2)(3 4 5)(6 7 8 9 10)(11 12 13 14 15
1716
16 17) in S17 . This has order 2 3 5 7 = 210 but 17
2 = 2 = 136.
D.2.2
Recall that the free group over a singleton set {a} consists of all words of the
form ak , where k is an integer, with multiplication defined by ak am = ak+m .
Prove or disprove: The free group over {a} has exactly one finite subgroup.
Solution
Proof: Let F be the free group defined above and let S be a subgroup of F .
Suppose S contains ak for some k 6= 0. Then S contains a2k , a3k , . . . because
it is closed under multiplication. Since these elements are all distinct, S is
infinite.
The alternative is that S does not contain ak for any k 6= 0; this leaves
only a0 as possible element of S, and there is only one such subgroup: the
trivial subgroup {a0 }.
D.2.3
327
S and T , there are at least two edges that have one endpoint in S and one
in T .
Solution
Proof: Because G is connected and every vertex has even degree, there is
an Euler tour of the graph (a cycle that uses every edge exactly once). Fix
some particular tour and consider a partition of V into two sets S and T .
There must be at least one edge between S and T , or G is not connected;
but if there is only one, then the tour cant return to S or T once it leaves.
It follows that there are at least 2 edges between S and T as claimed.
D.2.4
D.2.5
328
and the weight of its meal (and the eaten piranha is gone); if unsuccessful,
the piranha remains at the same weight.
Prove that after k days, no surviving piranha weighs more than 2k units.
Clarifications added during exam
It is not possible for a piranha to eat and be eaten on the same day.
Solution
By induction on k. The base case is k = 0, when all piranha weigh exactly
20 = 1 unit. Suppose some piranha has weight x 2k after k days. Then
either its weight stays the same, or it successfully eats another piranha of
weight y 2k increases its weight to x + y 2k + 2k = 2k+1 . In either case
the claim follows for k + 1.
D.2.6
Recall that a subspace of a vector space is a set that is closed under vector
addition and scalar multiplication. Recall further that the subspace generated by a set of vector space elements is the smallest such subspace, and its
dimension is the size of any basis of the subspace.
Let A be the 2-by-2 matrix
1 1
0 1
over the reals, and consider the subspace S of the vector space of 2-by-2 real
matrices generated by the set {A, A2 , A3 , . . .}. What is the dimension of S?
Solution
First lets see what Ak looks like. We have
A =
1 1
0 1
1 1
0 1
A =
1 1
0 1
1 2
0 1
1 2
0 1
1 3
0 1
A =
1 1
0 1
1 k1
0
1
1 k
0 1
329
A =
1 k
0 1
= (k1)
1 2
0 1
(k2)
1 1
0 1
= (k1)A2 (k2)A.
It follows that {A, A2 } generates all the Ak and thus generates any linear
combination of the Ak as well. It is easy to see that A and A2 are linearly
independent: if c1 A + c2 A2 = 0, we must have (a) c1 + c2 = 0 (to cancel
out the diagonal entries) and (b) c1 + 2c2 = 0 (to cancel out the nonzero
off-diagonal entry). The only solution to both equations is c1 = c2 = 0.
Because {A, A2 } is a linearly independent set that generates S, it is a
basis, and S has dimension 2.
D.3
Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are six problems on this exam, each worth 20 points, for a total
of 120 points. You have approximately three hours to complete this exam.
D.3.1
330
p2H
p2H
p2H
p2H
=
=
=
.
1 pH pS pS
pH + pT pH pS
pT + pH (pH + pT )
pT + pH pT + p2H
D.3.2
Let G be a group and a partial order on the elements of G such that for
all x, y in G, x xy. How many elements does G have?
Solution
The group G has exactly one element.
First observe that G has at least one element, because it contains an
identity element e.
Now let x and y be any two elements of G. We can show x y, because
y = x(x1 y). Similarly, y x = y(y 1 x). But then x = y by antisymmetry.
It follows that all elements of G are equal, i.e., that G has at most one
element.
D.3.3
331
Solution
Lets look at the effect of multiplying a vector of known weight by just one
near-diagonal matrix. We will show: (a) for any near-diagonal A and any x,
w(Ax) w(x)+1, and (b) for any n1 column vector x with 0 < w(x) < n,
there exists a near-diagonal matrix A with w(Ax) w(x) + 1.
P
To prove (a), observe that (Ax)i = nj=1 Aij xj . For (Ax)i to be nonzero,
there must be some index j such that Aij xj is nonzero. This can occur in
two ways: j = i, and Aii and xi are both nonzero, or j 6= i, and Aij and xj
are both nonzero. The first case can occur for at most w(x) different values
of i (because there are only w(x) nonzero entries xi ). The second can occur
for at most one value of i (because there is at most one nonzero entry Aij
with i 6= j). It follows that Ax has at most w(x) + 1 nonzero entries, i.e.,
that w(Ax) w(x) + 1.
To prove (b), choose k and m such that xk = 0 and xm 6= 0, and let A
be the matrix with Aii = 1 for all i, Akm = 1, and all other entries equal to
P
zero. Now consider (Ax)i . If i 6= k, then (Ax)i = nj=1 Aij xj = Aii xi = xi .
Pn
If i = k, then (Ai)k = j=1 Aij xj = Akk xk + Akm xm = xm 6= 0, since we
chose k so that ak = 0 and chose m so that am 6= 0. So (Ax)i is nonzero if
either xi is nonzero or i = k, giving w(Ax) w(x) + 1.
Now proceed by induction:
For any k, if A1 . . . Ak are near-diagonal matrices, then w(A1 Ak x)
w(x)+k. Proof: The base case of k = 0 is trivial. For larger k, w(A1 Ak x) =
w(A1 (A2 Ak x)) w(A2 Ak x) + 1 w(x) + (k 1) + 1 = w(x) + k.
Fix x with w(x) = 1. Then for any k < n, there exists a sequence of
near-diagonal matrices A1 . . . Ak such that w(A1 Ak x) = k + 1. Proof:
Again the base case of k = 0 is trivial. For larger k < n, we have from the
induction hypothesis that there exists a sequence of k 1 near-diagonal
matrices A2 . . . Ak such that w(A2 . . . Ak x) = k < n. From claim (b)
above we then get that there exists a near-diagonal matrix A1 such that
w(A1 (A2 . . . Ak x)) = w(A2 . . . Ak x) + 1 = k + 1.
Applying both these facts, setting k = n 1 is necessary and sufficient
for w(A1 . . . Ak x) = n, and so k = n 1 is the smallest value of k for which
this works.
D.3.4
332
D.3.5
333
Solution
1. We have two equations in two unknowns:
axi + b = xi+1
(mod p)
axi+1 + b = xi+2
(mod p).
(mod p).
(mod p).
Now we have a. To find b, plug our value for a into either equation
and solve for b.
2. We will show that for any observed values of xi and xi+1 , there are at
least two different values for a that are consistent with our observation;
in fact, well show the even stronger fact that for any value of a, xi
and xi+1 are consistent with that choice of a. Proof: Fix a, and let
b = xi+1 axi (mod p). Then xi+1 = axi + b (mod p).
D.3.6
.
B
1 3z
1 3z 1 3z
So h0 = 30 = 1, and for n > 0, we have hn = 3n 23n1 = (32)3n1 =
n1
3
.
H=
D.4
334
Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are five problems on this exam, each worth 20 points, for a total
of 100 points. You have approximately three hours to complete this exam.
D.4.1
D.4.2
D.4.3
335
Take a biased coin that comes up heads with probability p and flip it 2n
times.
What is the probability that at some time during this experiment two
consecutive coin-flips come up both heads or both tails?
Solution
Its easier to calculate the probability of the event that we never get two
consecutive heads or tails, since in this case there are only two possible
patterns of coin-flips: HT HT . . . or T HT H . . . . Since each of these patterns
contains exactly n heads and n tails, they occur with probability pn (1
p)n , giving a total probability of 2pn (1 p)n . The probability that neither
sequence occurs is then 1 2pn (1 p)n .
D.4.4
D.4.5
336
Solution
Observe first that (A B)(A + B) = A2 + AB BA + B 2 . The question
then is whether AB = BA. Because A and B are symmetric, we have that
BA = B > A> = (AB)0 . So if we can show that AB is also symmetric, then we
have AB = (AB)0 = BA. Alternatively, if we can find symmetric matrices
A and B such that AB is not symmetric, then A2 B 2 6= (A B)(A + B).
Lets try multiplying two generic symmetric 2-by-2 matrices:
!
a b
b c
d e
e f
ad + be ae + bf
bd + ce be + cf
The product doesnt look very symmetric, and in fact we can assign
variables to make it not so. We need ae + bf 6= bd + ce. Lets set b = 0
to make the bf and bd terms drop out, and e = 1 to leave just a and c.
Setting a = 0 and c = 1 gives an asymmetric product. Note that we didnt
determine d or f , so lets just set them to zero as well to make things as
simple as possible. The result is:
AB =
0 0
0 1
0 1
1 0
0 0
1 0
D.5
Write your answers in the blue book(s). Justify your answers. Give closedform solutions when possible. Work alone. Do not use any notes or books.
There are five problems on this exam, each worth 20 points, for a total
of 100 points. You have approximately three hours to complete this exam.
D.5.1
337
Both and are equivalence relations. Let {0, 1}n / and {0, 1}n /
be the corresponding sets of equivalence classes.
1. What is |{0, 1}n /| as a function of n?
2. What is |{0, 1}n /| as a function of n?
Solution
1. Given a string x, the equivalent class [x] = {x, r(x)} has either one
element (if x = r(x)) or two elements (if x 6= r(x)). Let m1 be
the number of one-element classes and m2 the number of two-element
classes. Then |{0, 1}n | = 2n = m1 + 2m2 and the number we are
n
2
1
looking for is m1 + m2 = 2m1 +2m
= 2 +m
= 2n1 + m21 . To find
2
2
m1 , we must count the number of strings x1 , . . . xn with x1 = xn ,
x2 = xn1 , etc. If n is even, there are exactly 2n/2 such strings, since
we can specify one by giving the first n/2 bits (which determine the
rest uniquely). If n is odd, there are exactly 2(n+1)/2 such strings,
since the middle bit can be set freely. We can write both alternatives
as 2dn/2e , giving |{0, 1}n /| = 2n1 + 2dn/2e .
2. In this case, observe that x y if and only if x and y contain the same
number of 1 bits. There are n + 1 different possible values 0, 1, . . . , n
for this number. So |{0, 1}n /| = n + 1.
The solution to the first part assumes n > 0; otherwise it produces the
nonsensical result 3/2. The problem does not specify whether n = 0 should
be considered; if it is, we get exactly one equivalence class for both parts
(the empty set).
D.5.2
Show whether each of the following functions from R2 to R is a linear transformation or not.
f1 (x) = x1 x2 .
f2 (x) = x1 x2 .
f3 (x) = x1 + x2 + 1.
f4 (x) =
x21 x22 + x1 x2
.
x1 + x2 + 1
338
Solution
1. Linear: f1 (ax) = ax1 ax2 = a(x1 x2 ) = af1 (x) and f1 (x + y) =
(x1 + y1 ) (x2 + y2 ) = (x1 x2 ) + (y1 y2 ) = f1 (x) + f1 (y).
2. Not linear: f2 (2x) = (2x1 )(2x2 ) = 4x1 x2 = 4f2 (x) 6= 2f2 (x) when
f2 (x) 6= 0.
3. Not linear: f3 (2x) = 2x1 + 2x1 + 1 but 2f3 (x) = 2x1 + 2x2 + 2. These
are never equal.
4. Linear:
x21 x22 + x1 x2
x1 + x2 + 1
(x1 + x2 )(x1 x2 ) + (x1 x2 )
=
x1 + x2 + 1
(x1 + x2 + 1)(x1 x2 )
=
x1 + x2 + 1
= x1 x2
f4 (x) =
= f1 (x).
Since weve already shown f1 is linear, f4 = f1 is also linear.
A better answer is that f4 is not a linear transformation from R2 to
R because its not defined when x1 + x2 1 = 0. The clarification added
during the exam tries to work around this, but doesnt really work. A
better clarification would have defined f4 as above for most x, but have
f4 (x) = x1 x2 when x1 + x2 = 1. Since I was being foolish about this
myself, I gave full credit for any solution that either did the division or
noticed the dividing-by-zero issue.
D.5.3
Flip n independent fair coins, and let X be a random variable that counts
how many of the coins come up heads. Let a be a constant. What is E[aX ]?
339
Solution
To compute E[aX ], we need to sum over all possible values of aX weighted
by their probabilities.
The variable X itself takes on each value k {0 . . . n}
with probability nk 2n , so aX takes on each corresponding value ak with
the same probability. We thus have:
X
E[a ] =
n
X
k=0
n
=2
n
X
k=0
n n
2
k
!
n k nk
a 1
k
=2
(a + 1)n
a+1 n
.
2
D.5.4
340
D.5.5
Appendix E
E.1
By hand
Example
E.2
LATEX
This is what these notes are written in. Its also standard for writing papers
in most technical fields.
Advantages Very nice formatting. De facto standard for mathematics publishing. Free.
Disadvantages You have to install it and learn it. Cant tell what something looks like until you run it through a program. Cryptic and
341
342
n
X
i=1
i=
n(n + 1)
.
2
E.3
E.4
This is the method originally used to format these notes, back when they
lived at http://pine.cs.yale.edu/pinewiki/CS202.
Advantages Everybody can read ASCII and most people can read Unicode. No special formatting required. Results are mostly machinereadable.
343
Appendix F
F.1
Limits
xa
if for any constant > 0 there exists a constant > 0 such that
|f (y) c|
whenever
|y x| .
The intuition is that as y gets closer to x, f (y) gets closer to c.
The formal definition has three layers of quantifiers, so as with all quantified expressions it helps to think of it as a game played between you and
some adversary that controls all the universal quantifiers. So to show that
limxa = c, we have three steps:
344
345
Some malevolent jackass picks , and says oh yeah, smart guy, I bet
you cant force f (y) to be within of c.
After looking at , you respond with , limiting the possible values of
y to the range [x , x + ].
Your opponent wins if he can find a nonzero y in this range with f (y)
outside [c , c + ]. Otherwise you win.
For example, in the next section we will want to show that
(x + z)2 x2
= 2x.
z0
z
lim
We need to take a limit here because the left-hand side isnt defined when
z = 0.
Before playing the game, it helps to use algebra to rewrite the left-hand
side a bit:
(x + z)2 x2
x2 + 2x(z) + (z)2 x2
= lim
z0
z0
z
z
2x(z) + (z)2
= lim
z0
z
= lim 2x + z.
lim
z0
So now the adversary says make |(2x + z) 2x| < , and we say thats
easy, let = , then no matter what z you pick, as long as |z 0| < , we
get |(2x + z) 2x| = |z| < = , QED. And the adversary slinks off with
its tail between its legs to plot some terrible future revenge.
Of course, a definition only really makes sense if it doesnt work if we
pick a different limit. If we try to show
(x + z)2 x2
= 12,
z0
z
lim
(assuming x 6= 6), then the adversary picks < |12 2x|. Now we are out
of luck: no matter what we pick, the adversary can respond with some
value very close to 0 (say, min(/2, |12 2x|/2)), and we land inside but
outside 12 .
We can also take the limit as a variable goes to infinity. This has a
slightly different definition:
lim f (x) = c
346
holds if for any > 0, there exists an N > 0, such that for all x > N ,
|f (x) c| < . Structurally, this is the same 3-step game as before, except
now after we see instead of constraining x to be very close to a, we constraint x to be very big. Limits as x goes to infinity are sometimes handy
for evaluating asymptotic notation.
Limits dont always exist. For example, if we try to take
lim x2 ,
F.2
Derivatives
The derivative or differential of a function measures how much the function changes if we make a very small change to its input. One way to think
about this is that for most functions, if you blow up a plot of them enough,
you dont see any curvature any more, and the function looks like a line that
we can approximate as ax + b for some coefficients a and b. This is useful for
determining whether a function is increasing or decreasing in some interval,
and for finding things like local minima or maxima.
The derivative f 0 (x) just gives this coefficient a for each particular x.
The notation f 0 is due to Leibnitz and is convenient for functions that have
names but not so convenient for something like x2 + 3. For more general
functions, a different notation due to Newton is used. The derivative of f
df
d
with respect to x is written as dx
or dx
f , and its value for a particular value
x = c is written using the somewhat horrendous notation
d
f
.
dx x=c
There is a formal definitions of f 0 (x), which nobody ever uses, given by
f 0 (x) = lim
x0
f (x + x) f (x)
,
x
where x is a single two-letter variable (not the product of and x!) that
represents the change in x. In the preceding section, we calculated an exd 2
ample of this kind of limit and showed that dx
x = 2x.
f (x)
f 0 (x)
c
0
n
x
nxn1
x
e
ex
x
x
a
a ln a
ln x
1/x
cg(x)
cg 0 (x)
g(x) + h(x)
g 0 (x) + h0 (x)
g(x)h(x)
g(x)h0 (x) + g 0 (x)h(x)
g(h(x))
g 0 (h(x))h0 (x)
347
follows from ax = ex ln a
multiplication by a constant
sum rule
product rule
chain rule
x [product rule]
dx ln x
dx ln x
ln x dx
d
1
= x2 1 (ln x) 2
ln x[chain rule] +
2x
dx
ln x
x2
2x
1
= 2 +
ln x x ln x
x
2x
= 2 +
.
ln x ln x
The idea is that whatever the outermost operation in an expression is,
you can apply one of the rules above to move the differential inside it, until
there is nothing left. Even computers can be programmed to do this. You
can do it too.
F.3
Integrals
First you have to know how to differentiate (see previous section). Having
learned how to differentiate, your goal in integrating some function f (x) is
to find another function F (x) such that F 0 (x) = f (x). You can then write
348
f (x) dx = lim
a
x0
X b a
i=0
f (a + ix)x.
(F.3.1)
Alternatively, one can also think Rof the definite integral ab f (x) dx as a
special case of the indefinite integral f (x) dx = F (x) + C where we choose
C
= F (a) so that F (a) + C = 0. In this case, F (b) + C = F (b) F (a) =
Rb
a f (x) dx. Where this interpretation differs from area under the curve is
that it works even if b < a.
Returning to anti-differentiation, how do you find the magic F (x) with
F 0 (x) = f (x)? Some possibilities:
R
Memorize some standard integral formulas. Some useful ones are given
in Table F.2.
Guess but verify. Guess F (x) and compute F 0 (x) to see if its f (x).
May be time-consuming unless you are good at guessing, and can put
enough parameters in F (x) to let you adjust F 0 (x) to equal f (x).
Example: if f (x) = 2/x, you may remember the 1/x formula and
try F (x) = a ln bx. Then F 0 (x) = ab/(bx) = a/x and you can set
a = 2, quietly forget you ever put in b, and astound your friends (who
also forgot the af (x) rule) by announcing that the integral is 2 ln x.
Sometimes if the answer comes out wrong you can see how to fudge
F (x) to make it work: if for f (x) = ln x you guess F (x) = x ln x, then
F 0 (x) = ln x + 1 and you can notice that you need
to add a x term
R
(the integral of 1) to get rid of the 1. This gives ln x dx = x ln xx.
1
Having a bounded derivative over [a, b] will make (F.3.1) work, in the sense of giving
sensible results that are consistent with more rigorous definitions of integrals. An example
of a non-well-behaved function for this purpose is the non-differentiable function f with
f (x) = 1 if x is rational and f (x) = 0 if x is irrational. This is almost never 1, but (F.3.1)
will assign it a value 1 if a and b are rational. More sophisticated definitions of integrals,
like the Lebesgue integral, give more reasonable answers here.
349
a is constant
a is constant
n constant, n 6= 1
a constant
u dv = uv
R
v du.
R
An example
is ln x dx = x ln x x d(ln x) = x ln x x(1/x) dx =
R
x ln x 1 dx = x ln xx. You probably shouldnt bother memorizing
this unless you need to pass AP Calculus again, although you can
rederive it from the product rule for derivatives.
Use a computer algebra system like Mathematica, Maple, or Maxima. Mathematicas integration routine is available on-line at http:
//integrals.wolfram.com.
Look your function up in a big book of integrals. This is actually less
effective that using Mathematica, but may continue to work during
power failures.
Appendix G
G.1
(P1)
Some people define the natural numbers as starting at 1. Those people are generally
(a) wrong, (b) number theorists, (c) extremely conservative, or (d) citizens of the United
Kingdom of Great Britain and Northern Ireland. As computer scientists, we will count
from 0 as the gods intended.
2
This is not actually the first axiom that Peano defined. The original Peano axioms [Pea89, 1] included some axioms on existence of Sx and the properties of equality
that have since been absorbed as standard rules of first-order logic. The axioms we are
presenting here correspond to Peanos axioms 8, 7, and 9.
350
351
This still allows for any number of nasty little models in which 0 is
nobodys successor, but we still stop before getting all of the naturals. For
example, let SS0 = S0; then we only have two elements in our model (0
and S0, because once we get to S0, any further applications of S keep us
where we are.
To avoid this, we need to prevent S from looping back round to some
number weve already produced. It cant get to 0 because of the first axiom,
and to prevent it from looping back to a later number, we take advantage
of the fact that they already have one successor:
x : y : Sx = Sy x = y.
(P2)
(P3)
This is known as the induction schema, and says that, for any predicate P , if we can prove that P holds for 0, and we can prove that P (x)
implies P (x + 1), then P holds for all x in N. The intuition is that even
though we havent bothered to write out a proof of, say P (1337), we know
that we can generate one by starting with P (0) and modus-pwning our way
out to P (1337) using P (0) P (1), then P (1) P (2), then P (2) P (3),
etc. Since this works for any number (eventually), there cant be some
number that we missed.
In particular, this lets us throw out the bogus numbers in the bad example above. Let B(x) be true if x is bogus (i.e., its equal to B or one
352
of the other values in its chain of successors). Let P (x) B(x). Then
P (0) holds (0 is not bogus), and if P (x) holds (x is not bogus) then so does
P (Sx). It follows from the induction axiom that xP (x): there are no bogus
numbers.3
G.2
A simple proof
Lets use the Peano axioms to prove something that we know to be true
about the natural numbers we learned about in grade school but that might
not be obvious from the axioms themselves. (This will give us some confidence that the axioms are not bogus.) We want to show that 0 is the only
number that is not a successor:
Claim G.2.1. x : (x 6= 0) (y : x = Sy).
To find a proof of this, we start by looking at the structure of what we are
trying to prove. Its a universal statement about elements of N (implicitly,
the x is really x N, since our axioms exclude anything that isnt in N), so
our table of proof techniques suggests using an induction argument, which
in this case means finding some predicate we can plug into the induction
schema.
If we strip off the x, we are left with
(x 6= 0) (y : x = Sy).
Here a direct proof is suggested: assuming x 6= 0, and try to prove
y : x = Sy. But our axioms dont tell us much about numbers that arent
0, so its not clear what to do with the assumption. This turns out to be a
dead end.
Recalling that A B is the same thing as A B, we can rewrite our
goal as
x = 0 y : x = Sy.
3
There is a complication here. Peanos original axioms were formulated in terms of
second-order logic, which allows quantification over all possible predicates (you can
write things like P : P (x) P (Sx)). So the bogus predicate we defined is implicitly
included in that for-all. But if there is no first-order predicate that distinguishes bogus
numbers from legitimate ones, the induction axiom wont kick them out. This means
that the Peano axioms (in first-order logic) actually do allow bogus numbers to sneak in
somewhere around infinity. But they have to be very polite bogus numbers that never
do anything different from ordinary numbers. This is probably not a problem except for
philosophers. Similar problems show up for any model with infinitely many elements, due
to something called the Lwenheim-Skolem theorem.
353
This seems like a good candidate for P (our induction hypothesis), because we do know a few things about 0. Lets see what happens if we try
plugging this into the induction schema:
P (0) 0 = 0 y : 0 = Sy. The right-hand term looks false because
of our first axiom, but the left-hand term is just the reflexive axiom
for equality. P (0) is true.
xP (x) P (Sx). We can drop the x if we fix an arbitrary x.
Expand the right-hand side P (Sx) Sx = 0 ySx = Sy. We can
be pretty confident that Sx 6= 0 (its an axiom), so if this is true, we
had better show ySx = Sy. The first thing to try for statements is
instantiation: pick a good value for y. Picking y = x works.
Since we showed P (0) and xP (x) P (Sx), the induction schema tells
us xP (x). This finishes the proof.
Having figured the proof out, we might go back and clean up any false
starts to produce a compact version. A typical mathematician might write
the preceding argument as:
Proof. By induction on x. For x = 0, the premise fails. For Sx, let y = x.
A really lazy mathematician would write:
Proof. Induction on x.
Though laziness is generally a virtue, you probably shouldnt be quite
this lazy when writing up homework assignments.
G.3
Defining addition
Because of our restricted language, we do not yet have the ability to state
valuable facts like 1+1 = 2 (which we would have to write as S0+S0 = SS0).
Lets fix this, by adding a two-argument function symbol + which we will
define using the axioms
x + 0 = x.
x + Sy = S(x + y).
354
(We are omitting some quantifiers, since unbounded variables are implicitly universally quantified.)
This definition is essentially a recursive program for computing x + y using only successor, and there are some programming languages (e.g. Haskell)
that will allow you to define addition using almost exactly this notation. If
the definition works for all inputs to +, we say that + is well-defined. Not
working would include giving different answers depending on which parts of
the definitions we applied first, or giving no answer for some particular inputs. These bad outcomes correspond to writing a buggy program. Though
we can in principle prove that this particular definition is well-defined (using
induction on y), we wont bother. Instead, we will try to prove things about
our new concept of addition that will, among other things, tell us that the
definition gives the correct answers.
We start with a lemma, which is Greek for a result that is not especially
useful by itself but is handy for proving other results.4
Lemma G.3.1. 0 + x = x.
Proof. By induction on x. When x = 0, we have 0 + 0 = 0, which is true
from the first case of the definition. Now suppose 0 + x = x and consider
what happens with Sx. We want to show 0 + Sx = Sx. Rewrite 0 + Sx as
S(0 + x) [second case of the definition], and use the induction hypothesis to
show S(0 + x) = S(x).
(We could do a lot of QED-ish jumping around in the end zone there,
but it is more refinedand lazierto leave off the end of the proof once its
clear weve satisifed all of our obligations.)
Heres another lemma, which looks equally useless:
Lemma G.3.2. x + Sy = Sx + y.
Proof. By induction on y. If y = 0, then x + S0 = S(x + 0) = Sx = Sx + 0.
Now suppose the result holds for y and show x + SSy = Sx + Sy. We have
x + SSy = S(x + Sy) = S(Sx + y)[ind. hyp.] = Sx + Sy.
Now we can prove a theorem: this is a result that we think of as useful
in its own right. (In programming terms, its something we export from a
module instead of hiding inside it as an internal procedure.)
Theorem G.3.3. x + y = y + x. (Commutativity of addition.)
4
355
G.3.1
This actually came up on a subtraction test I got in the first grade from the terrifying
Mrs Garrison at Mountain Park Elementary School in Berkeley Heights, New Jersey. She
insisted that 2 was not the correct answer, and that we should have recognized it as a
trick question. She also made us black out the arrow the left of the zero on the number-line
stickers we had all been given to put on the top of our desks. Mrs Garrison was, on the
whole, a fine teacher, but she did not believe in New Math.
356
a b c d a + c b + d.
x y y x x = y.
(The actual proofs will be left as an exercise for the reader.)
G.4
Lets define the predicate Even(x) yx = y +y. (The use of here signals
that Even(x) is syntactic sugar, and we should think of any occurrence of
Even(x) as expanding to yx = y + y.)
Its pretty easy to see that 0 = 0+0 is even. Can we show that S0 is not
even?
Lemma G.4.1. Even(S0).
Proof. Expand the claim as yS0 = y + y yS0 6= y + y. Since we are
working over N, its tempting to try to prove the y bit using induction. But
its not clear why S0 6= y + y would tell us anything about S0 6= Sy + Sy.
So instead we do a case analysis, using our earlier observation that every
number is either 0 or Sz for some z.
Case 1 y = 0. Then S0 6= 0 + 0 since 0 + 0 = 0 (by the definition of +)
and 0 6= S0 (by the first axiom).
Case 2 y = Sz. Then y+y = Sz+Sz = S(Sz+z) = S(z+Sz) = SS(z+z).6
Suppose S0 = SS(z + z) [Note: Suppose usually means we are
starting a proof by contradiction]. Then 0 = S(z + z) [second axiom],
violating x0 6= Sx [first axiom]. So S0 6= SS(z + z) = y + y.
Since we have S0 6= y + y in either case, it follows that S0 is not even.
Maybe we can generalize this lemma! If we recall the pattern of non-even
numbers we may have learned long ago, each of them (1, 3, 5, 7, . . . ) happens
to be the successor of some even number (0, 2, 4, 6, . . . ). So maybe it holds
that:
Theorem G.4.2. Even(x) Even(Sx).
6
357
G.5
358
x 6= 0 x y = x z y = z.
x (y + z) = x y + x z.
x y z x z y.
z 6= 0 z x z y x y.
(Note we are using 1 as an abbreviation for S0.)
The first few of these are all proved pretty much the same way as for
addition. Note that we cant divide in N any more than we can subtract,
which is why we have to be content with multiplicative cancellation.
Exercise: Show that the Even(x) predicate, defined previously as yy =
x + x, is equivalent to Even0 (x) yx = 2 y, where 2 = SS0. Does this
definition make it easier or harder to prove Even0 (S0)?
Bibliography
[BD92]
Dave Bayer and Persi Diaconis. Trailing the dovetail shuffle to its
lair. Annals of Applied Probability, 2(2):294313, 1992.
[Ber34]
George Berkeley. THE ANALYST; OR, A DISCOURSE Addressed to an Infidel MATHEMATICIAN. WHEREIN It is examined whether the Object, Principles, and Inferences of the modern
Analysis are more distinctly conceived, or more evidently deduced,
than Religious Mysteries and Points of Faith. Printed for J. Tonson, London, 1734.
[Big02]
[Bou70]
[Ded01]
[Die10]
[Fer08]
BIBLIOGRAPHY
360
[Hau14]
[Kol33]
[Kur21]
[Pel99]
[Ros12]
[Say33]
[Sch01]
[Sol05]
BIBLIOGRAPHY
361
[Sta97]
Richard P. Stanley. Enumerative Combinatorics, Volume 1. Number 49 in Cambridge Studies in Advanced Mathematics. Cambridge University Press, 1997.
[Str05]
[SW86]
[TK74]
Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):11241131,
September 1974.
[Wil95]
Index
O(f (n)), 100
(f (n)), 100
(f (n)), 100
, 14
N, 64, 74
Q, 64
R, 64
Z, 74
, 13
(f (n)), 100
, 14
, 13
, 14
D, 74
k-permutation, 169
n-ary relation, 122
o(f (n)), 100
abelian group, 65, 270
absolute value, 76
absorption law, 22
abuse of notation, 66, 105
accepting state, 59
acyclic, 154
addition, 65
addition (inference rule), 36
additive identity, 270
additive inverse, 270
adjacency matrix, 124
adjacent, 144
adversary, 28
affine transformation, 262
aleph-nought, 60
aleph-null, 60
aleph-one, 60
aleph-zero, 60
algebra, 74
linear, 243
algebraic field extension, 272
algebraically closed, 73
alphabet, 59
and, 14
annihilator, 68
ansatz, 96
antisymmetric, 125
antisymmetry, 70
arc, 140
arguments, 25
associative, 65, 270
associativity, 65
of addition, 65
of multiplication, 67
associativity of addition, 355
asymptotic notation, 100
atom, 211
automorphism, 151
average, 225
axiom, 9, 34
Axiom of Extensionality, 48
axiom schema, 52, 351
axiomatic set theory, 52
axioms
field, 65
for the real numbers, 64
362
INDEX
363
INDEX
of relations, 124
compound proposition, 14
comprehension
restricted, 52
conclusion, 34
conditional expectation, 227
conditional probability, 217
congruence mod m, 113
conjuctive normal form, 23
conjunction (inference rule), 36
connected, 152
connected components, 152
connected to, 152
consistent, 10
constants, 11
constructive, 41
contained in, 49
continuous random variable, 241
contraction, 149
contradiction, 17
contrapositive, 19
converges, 91
converse, 19
coordinate, 243
coprime, 108
countable, 62
counting two ways, 170
covariance, 234
CRC, 276
cross product, 150
cube, 147
Curry-Howard isomorphism, 21
cycle, 146, 153
simple, 153
cyclic redundancy check, 276
DAG, 123, 154
decision theory, 225
Dedekind cut, 59
deducible, 34
364
Deduction Theorem, 37
definite integral, 348
definition
recursive, 77, 78
degree, 144, 271
dense, 72
density, 241
joint, 241
derivative, 346
deterministic finite state machine, 59
diagonalization, 62
die roll, 214
differential, 346
dimension, 243, 247, 259
of a matrix, 124
directed acyclic graph, 123, 137, 154
directed graph, 122, 140, 141
directed multigraph, 141
discrete probability, 211
discrete random variable, 220
disjoint, 60
disjoint union, 60
disjunctive normal form, 24
disjunctive syllogism, 36
distribution, 220
Bernoulli, 221
binomial, 221
geometric, 221
joint, 222
marginal, 222
normal, 222
Poisson, 221
uniform, 221
distribution function, 240
distributive, 270
divergence to infinity, 346
divided by, 67
divides, 106
divisibility, 128
division, 67
INDEX
division algorithm, 83, 107
for polynomials, 271
divisor, 106, 107
DNF, 24
domain, 25, 54
dot product, 256
double factorial, 287
downward closed, 59
dyadics, 74
edge, 122
parallel, 122, 140
edges, 140
egf, 209
Einstein summation convention, 91
element
minimum, 308
elements, 47, 247
empty set, 47
endpoint, 141
entries, 247
enumerative combinatorics, 162
equality, 31
equivalence class, 126, 127
equivalence relation, 125
Euclidean algorithm, 108
extended, 109
Eulers Theorem, 119
Eulers totient function, 119
Eulerian cycle, 154
Eulerian tour, 154
even, 113
even numbers, 61
event, 211
exclusive or, 13
existential quantification, 26
existential quantifier, 26
expectation, 224
exponential generating function, 209
extended Euclidean algorithm, 109
365
extended real line, 72
extension
of a partial order to a total order,
132
factor, 106
factorial, 168, 287
double, 287
false positive, 219
Fermats Little Theorem, 120
field, 65, 269
finite, 113, 268
Galois, 273
ordered, 70
field axioms, 65
field extension, 272
field of fractions, 74
finite field, 113, 268
finite simple undirected graph, 140
first-order logic, 25
floating-point number, 65
floor, 55, 75, 107
flush, 301
formal power series, 185
fraction, 64, 74
fractional part, 76
frequentist interpretation, 210
full rank, 264
function symbol, 31
functions, 54
Fundamental Theorem of Arithmetic,
112
Galois field, 273
gcd, 108
Generalized Continuum Hypothesis,
61
generating function, 185
probability, 237
generating functions, 183
INDEX
geometric distribution, 221
Goldbachs conjecture, 31
graph, 140
bipartite, 123, 142
directed, 122
directed acyclic, 123, 137
simple, 122, 140
two-path, 296
undirected, 140
graph Cartesian product, 149
Graph Isomorphism Problem, 150
greatest common divisor, 108
greatest lower bound, 72, 131
group, 65, 270
abelian, 270
commutative, 65, 270
Hamiltonian cycle, 154
Hamiltonian tour, 155
Hasse diagram, 130
head, 141
homomorphism, 115, 151
graph, 293
hyperedges, 142
hypergraph, 142
hypothesis, 34
hypothetic syllogism, 36
identity, 66
additive, 270
for addition, 66
for multiplication, 67
multiplicative, 270
identity element, 92
identity matrix, 251
if and only if, 14
immediate predecessor, 130
immediate successor, 130
implication, 14
in-degree, 144
366
incident, 144
inclusion-exclusion, 181
inclusion-exclusion formula, 166
inclusive or, 13
incomparable, 130
incompleteness theorem, 35
indefinite integral, 348
independent, 214, 223, 241
pairwise, 234
index of summation, 86
indicator variable, 219
indirect proof, 19
induced subgraph, 147
induction, 77
induction hypothesis, 78
induction schema, 78, 351
induction step, 78
inequality
Chebyshevs, 235
inference rule, 35
inference rules, 9, 34
infimum, 72
infinite descending chain, 135
infinitesimal, 72
infix notation, 122
initial state, 59
initial vertex, 122, 141
injection, 57
injective, 57
integer, 64
integers, 74
integers mod m, 113
integral
definite, 348
indefinite, 348
Lebesgue, 348
Lebesgue-Stieltjes, 224
integration by parts, 349
intersection, 49
intuitionistic logic, 21
INDEX
invariance
scaling, 70
translation, 70
inverse, 19, 57, 67
additive, 270
multiplicative, 270
of a relation, 125
invertible, 251
irrational, 64
irreducible, 272
isomorphic, 56, 75, 126, 150
isomorphism, 150
367
linearly independent, 257
little o, 100
little omega, 100
logical equivalence, 17
logically equivalent, 17
loops, 141
lower bound, 72, 86
lower limit, 86
lower-factorial, 169
magma, 270
marginal distribution, 222
Markovs inequality, 231
join, 108, 131
mathematical maturity, 1, 3
joint density, 241
matrix, 123, 247
joint distribution, 221, 222, 240
adjacency, 124
maximal, 131
Kolmogorovs extension theorem, 213 maximum, 131
measurable, 240
labeled, 143
measurable set, 212
lambda calculus, 63
meet, 108, 131
lattice, 108, 131
mesh, 150
law of non-contradiction, 21
method of infinite descent, 81
law of the excluded middle, 21
minimal, 131
law of total probability, 218
minimum, 131
lcm, 108
minimum element, 308
least common multiple, 108
minor, 149
least upper bound, 71, 131
minus, 66
Lebesgue integral, 348
model, 10, 16, 33
Lebesgue-Stieltjes integral, 224
model checking, 16
lemma, 34, 37, 354
modulus, 107
length, 146, 152, 255
modus ponens, 24, 35, 36
lex order, 129
modus tollens, 36
lexicographic order, 129
monoid, 270
LFSR, 268
multigraph, 142
limit, 91, 344
multinomial coefficient, 171
linear, 87, 226, 259
multiplicative identity, 270
linear algebra, 243
multiplicative inverse, 109, 115, 270
linear combination, 257
multiset, 128
linear transformation, 249, 259
multivariate generating functions, 193
linear-feedback shift register, 268
INDEX
naive set theory, 47
natural, 64
natural deduction, 38
natural number, 64
natural numbers, 47, 74, 350
negation, 13, 66
negative, 66, 70
neighborhood, 144
node, 140
non-constructive, 21, 41, 53
non-negative, 70
non-positive, 70
normal, 258
normal distribution, 222, 241
notation
abuse of, 105
asymptotic, 100
number
complex, 64
floating-point, 65
natural, 64
rational, 64
real, 64
number theory, 106
O
big, 100
o
little, 100
octonion, 64
odd, 113
odd numbers, 61
Omega
big, 100
omega
little, 100
on average, 225
one-to-one, 57
one-to-one correspondence, 57
onto, 57
368
or, 13
exclusive, 13
inclusive, 13
order
lexicographic, 129
partial, 128
pre-, 129
quasi-, 129
total, 70, 128, 132
ordered field, 70
ordered pair, 53
orthogonal, 257
orthogonal basis, 258
orthonormal, 258
out-degree, 144
outcome, 212
over, 67
pairwise disjoint, 165
pairwise independent, 234
parallel edge, 122
parallel edges, 140
partial order, 128
strict, 128
partially ordered set, 128
partition, 59, 126
Pascals identity, 177
Pascals triangle, 178
path, 146, 152
Peano axioms, 350
peer-to-peer, 144
permutation, 169
k-, 169
pgf, 209, 237
plus, 65
Poisson distribution, 221
poker deck, 301
poker hand, 214
pole, 203
polynomial ring, 271
INDEX
poset, 128
product, 129
positive, 70
power, 153
predecessor, 130
predicate logic, 11
predicates, 11, 25
prefix, 129
premise, 34
preorder, 129
prime, 82, 106
prime factorization, 112
probability, 210, 211
conditional, 217
discrete, 211
probability distribution function, 220
probability generating function, 209,
237
probability mass function, 220, 240
probability measure, 212
probability space, 212
uniform, 213
product, 250
product poset, 129
product rule, 167
projection, 264
projection matrix, 266
proof
by contraposition, 19
proof by construction, 41
proof by example, 41
proposition, 12
compound, 14
propositional logic, 11, 12
provable, 34
pseudo-ring, 270
Pythagorean theorem, 257
quantifiers, 26
quantify, 11
369
quasiorder, 129
quaternion, 64
quotient, 107
quotient set, 126
radius of convergence, 203
random bit, 214
random permutation, 214
random variable, 219, 239
continuous, 241
discrete, 220
range, 54, 57
rank, 264
ranking, 176
rational, 64
rational decision maker, 225
rational functions, 203
rational number, 64
reachable, 152
real number, 64
recursion, 77
recursive, 84
recursive definition, 77, 78
recursively-defined, 83
reflexive, 125
reflexive closure, 136
reflexive symmetric transitive closure,
137
reflexive transitive closure, 137
reflexivity, 70
reflexivity axiom, 32
regression to the mean, 229
regular expression, 197
relation, 122
n-ary, 122
binary, 122
equivalence, 125
on a set, 122
relatively prime, 108
remainder, 107
INDEX
representative, 114, 127
residue class, 113
resolution, 23, 36
resolution proof, 24
resolving, 23
restricted comprehension, 52
restriction, 169
rig, 270
ring, 68, 70, 270
commutative, 270
polynomial, 271
rng, 270
round-off error, 65
row, 247
row space, 261
row vector, 255
RSA encryption, 120
Russells paradox, 48
scalar, 245, 246, 253
scalar multiplication, 246
scalar product, 253
scaling, 245, 262
scaling invariance, 70
second-order logic, 352
selection sort, 133
self-loops, 140
semigroup, 270
semiring, 70, 270
sequence, 247
set comprehension, 48
set difference, 49
set theory
axiomatic, 52
naive, 47
set-builder notation, 48
shear, 263
sigma-algebra, 212
signature, 33
signum, 76
370
simple, 141, 152
simple cycle, 153
simple induction, 77
simple undirected graph, 141
simplification (inference rule), 36
sink, 122, 141
size, 57
sort
selection, 133
topological, 132
soundness, 34
source, 122, 141
span, 257
spanning tree, 160
square, 247
square product, 149
standard basis, 258
star graph, 146
state space, 59
statement, 11
Stirling number, 170
strict partial order, 128
strong induction, 81
strongly connected, 152, 153
strongly-connected component, 137
strongly-connected components, 153
structure, 32
sub-algebra, 74
subgraph, 147
induced, 147
sublattice, 113
subscript, 55
subset, 49
subspace, 257
substitution (inference rule), 39
substitution axiom schema, 32
substitution rule, 32, 39
subtraction, 66
successor, 130
sum rule, 164
INDEX
supremum, 72
surjection, 57
surjective, 56, 57
symmetric, 125, 249
symmetric closure, 137
symmetric difference, 49
symmetry, 32
syntactic sugar, 55
tail, 141
tautology, 17
terminal vertex, 122, 141
theorem, 9, 34, 37, 354
Wagners, 149
theory, 9, 33
Theta
big, 100
Three Stooges, 47
topological sort, 132
topologically sorted, 154
total order, 70, 128, 132
totally ordered, 128
totient, 119
transition function, 59
transition matrix, 249
transitive, 125
transitive closure, 137, 153
transitivity, 32, 70
translation invariance, 70
transpose, 249
tree, 154, 156
triangle inequality, 256
trichotomy, 70
truth table, 16
proof using, 17
tuples, 55
turnstile, 34
two-path graph, 296
uncorrelated, 234
371
uncountable, 63
undirected graph, 140, 141
uniform discrete probability space, 213
uniform distribution, 221
union, 49
unit, 106
unit vector, 256
universal quantification, 26
universal quantifier, 26
universe, 49
universe of discourse, 25
unranking, 176
upper bound, 71, 86
upper limit, 86
valid, 35, 36
Vandermondes identity, 179
variable
indicator, 219
variance, 232
vector, 243, 246, 255
unit, 256
vector space, 243, 246
vertex, 122
initial, 122
terminal, 122
vertices, 140
Von Neumann ordinals, 58
Wagners theorem, 149
weakly connected, 153
web graph, 144
weight, 184
well order, 135
well-defined, 354
well-ordered, 58, 80
Zermelo-Fraenkel set theory with choice,
52
ZFC, 52
Zorns lemma, 135