lecture2-notes
lecture2-notes
1 Introduction
In general, when analyzing an algorithm, we want to know two things.
1. Does it work?
2. Does it have good performance?
Today, we’ll begin to see how we might formally answer these questions, through the lens
of sorting. We’ll start, as a warm-up, with InsertionSort, and then move on to the (more
complicated) MergeSort. MergeSort will also allow us to continue our exploration of Divide
and Conquer.
2 InsertionSort
In your pre-lecture exercise, you should have taken a look at a couple of ways of implementing
InsertionSort. Here’s one:
def InsertionSort(A):
for i in range(1,len(A)):
current = A[i]
j = i-1
while j >= 0 and A[j] > current:
A[j+1] = A[j]
j -= 1
A[j+1] = current
Let us ask our two questions: does this algorithm work, and does it have good performance?
1
study in the future, it won’t always be obvious that it works, and so we’ll have to prove it. To
warm us up for those proofs, let’s carefully go through a proof of correctness of InsertionSort.
We’ll do the proof by maintaining a loop invariant, in this case that after iteration i , then
A[:i+1] is sorted. This is true when i = 0 (because the one-element list A[: 1] is sorted) and
then we’ll show that for any i > 0, if it’s true for i − 1, then it’s true for i . At the end of the
day, we’ll conclude that A[:n] (aka, the whole thing) is sorted and we’ll be done.
Formally, we will proceed by induction.
• Inductive hypothesis. After iteration i of the outer loop, A[:i+1] is sorted.
• Base case. When i = 0, A[:1] contains only one element, and this is sorted.
• Inductive step. Suppose that the inductive hypothesis holds for i − 1, so A[:i] is
sorted after the i − 1’st iteration. We want to show that A[:i+1] is sorted after the
i ’th iteration.
Suppose that j ∗ is the largest integer in {0, . . . , i − 1} so that A[j*] ≤ A[i]. Then
the effect of the inner loop is to turn
[A[0], A[1], . . . , A[j ∗ ], . . . , A[i − 1], A[i ]]
into
[A[0], A[1], . . . , A[j ∗ ], A[i ], A[j ∗ + 1], . . . , A[i − 1]].
We claim that this latter list is sorted. This is because A[i ] ≥ A[j ∗ ], and by the inductive
hypothesis, we have A[j ∗ ] ≥ A[j] for all j ≤ j ∗ , and so A[i ] is larger than or equal to
than everything that is positioned before it. Similarly, by the choice of j ∗ we have
A[i ] < A[j ∗ + 1] ≤ A[j] for all j ≥ j ∗ + 1, so A[i ] is smaller than everything that comes
after it. Thus, A[i ] is in the right place. All of the other elements were already in the
right place, so this proves the claim.
Thus, after the i ’th iteration completes, A[:i+1] is sorted, and this establishes the
inductive hypothesis for i .
• Conclusion. By induction, we conclude that the inductive hypothesis holds for all
i ≤ n − 1. In particular, this implies that after the end of the n − 1’st iteration (after the
algorithm ends) A[:n] is sorted. Since A[:n] is the whole list, this means the whole
list is sorted when the algorithm terminates, which is what we were trying to show.
The above proof was maybe a bit pedantic: we used a lot of words to prove something that
may have been pretty obvious. However, it’s important to understand the structure of this
argument, because we’ll use it a lot, sometimes for more complicated algorithms.
2
We’re not going to stress the precise operation count because we’ll argue that the end of the
lecture that we don’t care too much about it. The main question that we have, is, can we
do asymptotically better than n2 ? That is, can we come up with an algorithm that sorts an
arbitrary list of n integers in time that scales less than n2 ? For example, like n1.5 , or n log(n),
or even n? Next, we’ll see that MergeSort will scale like n log(n), which is much faster.
3 MergeSort
Recall the Divide-and-conquer paradigm from the first lecture. In this paradigm, we use the
following strategy:
• Break the problem into sub-problems.
• Solve the sub-problems (often recursively)
• Combine the results of the sub-problems to solve the big problem.
At some point, the sub-problems become small enough that they are easy to solve, and then
we can stop recursing.
With this approach in mind, MergeSort is a very natural algorithm to solve the sorting problem.
The pseudocode is below:
MergeSort(A):
n = len(A)
if n <= 1:
return A
L = MergeSort( A[:n/2] )
R = MergeSort( A[n/2:] )
return Merge(L, R)
Above, we are using Python notation, so A[: n/2] = [A[0], A[1], . . . , A[n/2 − 1]] and A[n/2 :
] = [A[n/2], . . . , A[n − 1]]. Additionally, we’re using integer division, so n/2 means ⌊n/2⌋.
How do we do the Merge procedure? We need to take two sorted arrays, L and R, and merge
them into a sorted array that contains both of their elements. See the slides for a walkthrough
of this procedure.
Merge(L, R):
m = len(L) + len(R)
S = [ ]
for k in range(m):
if L[i] < R[j]:
S.append( L[i] )
i += 1
else:
S.append( R[j] )
3
j += 1
return S
Note: This pseudocode is incomplete! What happens if we get to the end of L or R? Try to
adapt the pseudocode above to fix this.
As before, we need to ask: Does it work? And does it have good performance?
4
At the top (zeroth) level is the whole problem, which has size n. This gets broken into two
subproblems, each of size n/2, and so on. At the t’th level, there are 2t problems, each of
size n/2t . This continues until we have n problems of size 1, which happens at the log(n)th
level.
Some notes:
• In this class, logarithms will always be base 2, unless otherwise noted.
• We are being a bit sloppy in the picture above: what if n is not a power of 2? Then
n/2j might not be an integer. In the pseudocode above, we break a problem of size
n into problems of size ⌊n/2⌋ and ⌈n/2⌉. Keeping track of this in our analysis will be
messy, and it won’t add much, so we will ignore it, and for now, we will assume that n
is a power of 2. 1
To figure out the total amount of work, we will figure out how much work is done at each node
in the tree, and add it all up. To that end, we tally up the work that is done in a particular
node in the tree—that is, in a particular call to MergeSort. There are three things:
1. Checking the base case
1
To formally justify the assumption that n is a power of 2, notice that we can always sort a longer list of
length n′ = 2⌈log2 (n)⌉ . That is, we’ll add extra entries, whose values are ∞, to the list. Then we sort the new list
of length n′ and return the first n values. Since n′ ≤ 2n (why?) this won’t affect the asymptotic running time.
Also, see CLRS Section 4.6.2 for a rigorous analysis of the original algorithm with floors and ceilings.
5
2. Making recursive calls (but we don’t count the work done in those recursive calls; that
will count in other nodes)
3. Running Merge.
Let’s analyze each of these. Suppose that our input has size m (so that m = n/2j for some
j).
1. Checking the base case doesn’t take much time. For concreteness, let us say that it
takes one operation to retrieve the length of A, and another operation to compare this
length to 1, for a total of two operations.2
2. Making the recursive calls should also be fast. If we implemented the pseudocode well,
it should also take a constant number of operations.
Aside: This is a good point to talk about how we interpret pseudocode in
this class. Above, we’ve written MergeSort(A[:n/2]) as an example of a
recursive call. This makes it clear that we are supposed to recurse on the first
half of the list, but it’s not clear how we implement that. Our “pseudocode"
above is working Python code, and in Python, this implementation, while clear,
is a bit inefficient. That is, written this way, Python will copy the first n/2
elements of the list before sending them to the recursive call. A much better
way would be to instead just pass in pointers to the 0’th and n/2 − 1’st index
in the list. This would result in a faster algorithm, but kludgier pseudocode.
In this class, we generally will opt for cleaner pseudocode, as long as it does
not hurt the asymptotic running time of the algorithm. In this case, our
simple-but-slower pseudocode turns out not to affect the asymptotic running
time, so we’ll stick with this.
In light of the above Aside, let’s suppose that this step takes m + 2 operations, m/2
to copy each half of the list over, and 2 operations to store the results. Of course, a
better implementation of this step would only take a constant number of (say, four)
operations.
3. The third thing is the tricky part. We claim that the Merge step also takes about m
operations.
Consider a single call to Merge, where we’ll assume the total size of A is m numbers.
How long will it take for Merge to execute? To start, there are two initializations for i
and j. Then, we enter a for loop which will execute m times. Each loop will require one
comparison, followed by an assignment to S and an increment of i or j. Finally, we’ll
need to increment the counter in the for loop k. If we assume that each operation costs
us a certain amount of time, say Costa for assignment, Costc for comparison, Costi
for incrementing a counter, then we can express the total time of the Merge subroutine
2
Of course, there’s no reason that the “operation" of getting the length of A should take the same amount
of time as the “operation" of comparing two integers. This disconnect is one of the reasons we’ll introduce
big-Oh notation at the end of this lecture.
6
as follows:
2Costa + m(Costa + Costc + 2Costi )
This is a precise, but somewhat unruly, expression of the running time. In particular,
it seems difficult to keep track of lots of different constants, and it isn’t clear which
costs will be more or less expensive (especially if we switch programming languages or
machine architectures). To simplify our analysis, we choose to assume that there is
some global constant cop which represents the cost of an operation. You may think of
cop as max{Costa , Costc , Costi , . . .}. We can then bound the amount of running time
for Merge as
2cop + 4cop m = 2 + 4m operations
2 + (m + 2) + 4m + 2 ≤ 11m
using the assumption that m ≥ 1. This is a very loose bound; for larger m this will be much
closer to 5m than it is to 11m. But, as we’ll discuss more below, the difference between 5
and 11 won’t matter too much to us, so much as the linear dependence on m.
Now that we understand how much work is going on in one call where the input has size m,
let’s add it all up to obtain a bound on the number of operations required for MergeSort. in a
Merge of m numbers, we want to translate this into a bound on the number of operations
required for MergeSort. At first glance, the pessimist in you may be concerned that at
each level of recursive calls, we’re spawning an exponentially increasing number of copies of
MergeSort (because the number of calls at each depth doubles). Dual to this, the optimist in
you will notice that at each level, the inputs to the problems are decreasing at an exponential
rate (because the input size halves with each recursive call). Today, the optimists win out.
Claim 1. MergeSort requires at most 11n log n + 11n operations to sort n numbers.
Before we go about proving this bound, let’s first consider whether this running time bound is
good. We mentioned earlier that more obvious methods of sorting, like InsertionSort, required
roughly n2 operations. How does n2 = n · n compare to n · log n? An intuitive definition of
log n is the following: “Enter n into your calculator. Divide by 2 until the total is ≤ 1. The
number of times you divided is the logarithm of n." This number in general will be significantly
smaller than n. In particular, if n = 32, then log n = 5; if n = 1024, then log n = 10. Already,
to sort arrays of ≈ 103 numbers, the savings of n log n as compared to n2 will be orders of
magnitude. At larger problem instances of 106 , 109 , etc. the difference will become even
more pronounced! n log n is much closer to growing linearly (with n) than it is to growing
quadratically (with n2 ).
One way to argue about the running time of recursive algorithms is to use recurrence relations.
A recurrence relation for a running time expresses the time it takes to solve an input of size n
in terms of the time required to solve the recursive calls the algorithm makes. In particular,
we can write the running time T (n) for MergeSort on an array of n numbers as the following
7
expression.
There are several sophisticated and powerful techniques for solving recurrences. We will cover
many of these techniques in the coming lectures. Today, we can analyze the running time
directly.
Proof of Claim 1. Consider the recursion tree of a call to MergeSort on an array of n numbers.
Assume for simplicity that n is a power of 2. Let’s refer to the initial call as Level 0, the
proceeding recursive calls as Level 1, and so on, numbering the level of recursion by its depth
in the tree. How deep is the tree? At each level, the size of the inputs is divided in half, and
there are no recursive calls when the input size is ≤ 1 element. By our earlier “definition",
this means the bottom level will be Level log n. Thus, there will be a total of log n + 1 levels.
We can now ask two questions: (1) How many subproblems are there at Level i ? (2) How
large are the individual subproblems at Level i ? We can observe that at the i th level, there
will be 2i subproblems, each with inputs of size n/2i .
We’ve already worked out that each sub-problem with an input of size n/2i takes at most
11n/2i operations. Now we can add this up:
Importantly, we can see that the work done at Level i is independent of i – it only depends on
n and is the same for every level. This means we can bound the total running time as follows:
8
if we received the sequence of numbers [1, 2, 3, 5, 4, 6, 7, 8]? There is a “sorting algorithm"
for this sequence that only takes a few operations, but MergeSort runs through all log n + 1
levels of recursion anyway. Would it be better to try to design our algorithms with this in
mind? Additionally, in our analysis, we’ve given a very loose upper bound on the time required
of Merge and dropped some constant factors and lower-order terms. Is this a problem? In
what follows, we’ll argue that these are features, not bugs, in the design and analysis of the
algorithm.
9
running time of your algorithm as your input size gets very large (i.e. n → +∞). This
framework is motivated by the fact that if we need to solve a small problem, it doesn’t cost
that much to solve it by brute force. If we want to solve a large problem, we may need to be
much more creative for the problem to run efficiently. From this perspective, it should be very
clear that 11n(log n + 1) is much better than n2 /2. (If you are unconvinced, try plugging in
some values for n.)
Intuitively, we’ll say that an algorithm is “fast" when the running time grows “slowly" with the
input size. In this class, we want to think of growing “slowly" as growing as close to linear as
possible. Based on this intuitive notion, we can come up with a formal system for analyzing
how quickly the running time of an algorithm grows with its input size.
“Big-Oh" Notation:
Intuitively, Big-Oh notation gives an upper bound on a function. We say T (n) is O(f (n))
when as n gets big, f (n) grows at least as quickly as T (n). Formally, we say
“Big-Omega" Notation:
Intuitively, Big-Omega notation gives a lower bound on a function. We say T (n) is Ω(f (n))
when as n gets big, f (n) grows at least as slowly as T (n). Formally, we say
“Big-Theta" Notation:
T (n) is Θ(f (n)) if and only if T (n) = O(f (n)) and T (n) = Ω(f (n)). Equivalently, we can say
that
T (n) = Θ(f (n)) ⇐⇒ ∃c1 > 0, c2 > 0, n0 s.t. ∀n ≥ n0 , 0 ≤ c1 f (n) ≤ T (n) ≤ c2 f (n)
We can see that these notations do capture exactly the behavior that we want – namely, to
focus on the rate of growth of a function as the inputs get large, ignoring constant factors
and lower-order terms. As a sanity check, consider the following example and non-example.
Claim 2. All degree-k polynomials4 are O(nk ).
4
To be more precise, all degree-k polynomials T so that T (n) ≥ 0 for all n ≥ 1. How would you adapt this
proof to be true for all degree-k polynomials T with a positive leading coefficient?
10
Figure 3.1 from CLRS – Examples of Asymptotic Bounds
(Note: In these examples f (n) corresponds to our T (n) and g(n) corresponds to our f (n).)
T (n) = ak nk + . . . + a1 n + a0
≤ a∗ n k + . . . + a∗ n + a∗
≤ a∗ n k + . . . + a∗ n k + a∗ n k
= (k + 1)a∗ · nk
Proof. By contradiction. Assume nk = O(nk−1 ). Then there is some choice of c and n0 such
that nk ≤ c · nk−1 for all n ≥ n0 . But this in turn means that n ≤ c for all n ≥ n0 , which
contradicts the fact that c is a constant, independent of n. Thus, our original assumption
was false and nk is not O(nk−1 ).
11