K - Select Using Data Structures
K - Select Using Data Structures
7 2 6 9 1 5 4 11
7 2 6 9 1 5 4 11
SELECT(A,k):
A = MERGESORT(A) It’s k-1 (rather than k)
since my pseudocode
return A[k-1] is 0-indexed and k is a
1-indexed number
4
AN O(n log n) ALGORITHM
SELECT(A,k):
THE QUESTION IS...
A =CAN WE DO
MERGESORT(A)
BETTER
It’s k-1 (rather than k)
since my pseudocode
return A[k-1] is 0-indexed and k is a
? 1-indexed number
5
GOAL: AN O(n) ALGORITHM
If k = 1, then we want the minimum of A. There’s an easy O(n) algorithm for that:
Pretty much the same if k = n (we’re just finding MAX(A) instead)
SELECT-1(A):
result = infinity
for i in [0,...,n-1]: This loop runs O(n) times
SELECT-2(A):
result = infinity
minSoFar = infinity
This loop runs O(n) times
for i in [0,...,n-1]:
if A[i] < result & A[i] < minSoFar:
result = minSoFar
The body of each iteration
is still O(1) work.
minSoFar = A[i]
else if A[i] < result & A[i] >= minSoFar
result = A[i]
return result
SELECT-n/2(A):
result = infinity
minSoFar = infinity
secondMinSoFar = infinity
thirdMinSoFar = infinity
fourthMinSoFar = infinity
fifthMinSoFar = infinity
...
Select a pivot
kind of like a “binary
search” for the kth
Partition around it smallest element
(except that the array
isn’t sorted!)
Recurse!
9
LINEAR SELECTION: THE IDEA
3 2 9 8 1 6 4 11
10
LINEAR SELECTION: THE IDEA
3 2 9 8 1 6 4 11
Select a pivot How do we pick a pivot?? We’ll see this later.
For now, imagine we pick it randomly.
11
LINEAR SELECTION: THE IDEA
3 2 9 8 1 6 4 11
Select a pivot How do we pick a pivot?? We’ll see this later.
For now, imagine we pick it randomly.
6
Partition around it
L 3 2 1 4 9 8 11 R
Partition around pivot: L has elements less than pivot, and R has elements greater than pivot.
(Note that L and R remain unsorted).
12
LINEAR SELECTION: THE IDEA
3 2 9 8 1 6 4 11
Select a pivot How do we pick a pivot?? We’ll see this later.
For now, imagine we pick it randomly.
6
Partition around it
L 3 2 1 4 9 8 11 R
Partition around pivot: L has elements less than pivot, and R has elements greater than pivot.
(Note that L and R remain unsorted).
3. if k > 5: return SELECT(R, k-5) the kth smallest element is the (k-
5)th
smallest element in R 13
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
1 12 4 20 31 6 18 9
14
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
PICK A PIVOT
How do we pick a pivot??? 1 12 4 20 31 6 18 9
We’ll see later...
15
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
1 12 4 20 31 6 18 9
18
PARTITION
L 1 12 4 6 9 20 31 R
16
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
1 12 4 20 31 6 18 9
18
L 1 12 4 6 9 20 31 R
Recurse here (since 18 occupies
index 6 and k = 7 > 6)
RECURSE
SELECT(R, 1):
20 31
1=7-6
(aka k minus pivot position)
17
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
1 12 4 20 31 6 18 9
18
L 1 12 4 6 9 20 31 R
Recurse here (since 18 occupies
index 6 and k = 7 > 6)
SELECT(R, 1):
PICK A PIVOT 20 31
How do we pick a pivot???
We’ll see later...
18
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
1 12 4 20 31 6 18 9
18
L 1 12 4 6 9 20 31 R
Recurse here (since 18 occupies
index 6 and k = 7 > 6)
SELECT(R, 1):
20 31
PARTITION 20
31 R
19
LINEAR SELECTION: EXAMPLE
SELECT(A, 7):
1 12 4 20 31 6 18 9
18
L 1 12 4 6 9 20 31 R
Recurse here (since 18 occupies
index 6 and k = 7 > 6)
SELECT(R, 1):
20 31 20 IS OUR ANSWER!
(20 is the 1th smallest in R,
and 7th smallest overall)
20 is in the 1th position, and k = 1! 20
No need to recurse further! 31
20
LINEAR SELECTION: PSEUDOCODE
Base Case:
SELECT(A,k):
if len(A) = 1, then just if len(A) == 1:
go ahead and return return A[0]
the element itself p = GET_PIVOT(A)
Case 1:
L, R = PARTITION(A,p) We got lucky and found
exactly the kth smallest!
if len(L) == k-1:
return p Case 2:
else if len(L) > k-1: The kth smallest is in the
first part of the array (L)
return SELECT(L, k)
Case 3:
else if len(L) < k-1: The kth smallest is in the
return SELECT(R, k-len(L)-1) second part of the array (R)
21
LINEAR SELECTION: PSEUDOCODE
SELECT(A,k): PARTITION(A, pivot):
if len(A) == 1: L, R = [], []
return A[0] for i in
p = GET_PIVOT(A) [1,...,len(A)]:
L, R = PARTITION(A,p) if A[i] == pivot:
if len(L) == k-1: continue
return p else if A[i] <
else if len(L) > k-1: pivot:
return SELECT(L, k) add A[i] to L
else if len(L) < k-1: else:
return SELECT(R, k-len(L)-1) add A[i] to R
22
RUNTIME
SELECT(A,k): Recurrence Relation for SELECT
if len(A) == 1:
return A[0] For now, assume we’ll pick the pivot in time O(n)
p = GET_PIVOT(A)
L, R = PARTITION(A,p) O(n) len(L) == k-1
if len(L) == k-1: T(n)
return p T(len(L)) + O(n) len(L) > k-1
=
else if len(L) > k-1: T(len(R)) + O(n) len(L) < k-1
return SELECT(L, k)
else if len(L) < k-1:
return SELECT(R, k-len(L)- But what are len(L) and len(R)?
1) That depends on how we pick the pivot...
23
RUNTIME
SELECT(A,k): What’s a “good” pivot? Relation for SELECT
Recurrence
if len(A) == 1: What’s a “bad”For
pivot?
now, assume we’ll pick the pivot in time O(n)
return A[0]
p = GET_PIVOT(A)
L, R = PARTITION(A,p) O(n) len(L) == k-1
if len(L) == k-1: T(n)
return p T(len(L)) + O(n) len(L) > k-1
=
else if len(L) > k-1: T(len(R)) + O(n) len(L) < k-1
return SELECT(L, k)
else if len(L) < k-1:
return SELECT(R, k-len(L)- But what are len(L) and len(R)?
1) That depends on how we pick the pivot...
24
THE WORST PIVOT
The WORST pivot: picking the max or the min each time!
Then, in the worst case, the recurrence relation looks like T(n) = T(n-1) + O(n).
O(n) 26
THE IDEAL PIVOT
The IDEAL pivot: splits the input array exactly in half!
len(L) = len(R) = (n-1)/2
O(n) len(L) == k
T(n)
T(len(L))Sadly,
+ O(n)the pivot to divide
len(L) > k the input in half
T(n)is ≤
theT(n/2) + O(n)
=
T(len(R)) + O(n) len(L)MEDIAN
< k
27
THE GOOD-ENOUGH PIVOT
The GOOD-ENOUGH pivot: splits the input array kind of in half!
3n/10 < len(L) < 7n/10
3n/10 < len(R) < 7n/10
If we could fetch this good-enough pivot in time O(n), let’s say, the recurrence looks like:
28
THE GOOD-ENOUGH PIVOT
The GOOD-ENOUGH pivot: splits the input array kind of in half!
3n/10 < len(L) < 7n/10
3n/10 < len(R) < 7n/10
If we could fetch this good-enough pivot in time O(n), let’s say, the recurrence looks like:
O(n) 29
OUR GOAL
Efficiently pick the pivot in time O(n) so that
pivot!
array with things smaller than pivot array with things larger than pivot
18
L 1 12 4 6 9 20 31 R
3n/10 < len(L) < 7n/10 3n/10 < len(R) < 7n/10
31
MEDIAN-OF-MEDIANS
The ideal world wasn’t feasible because we can’t just compute SELECT(A, n/2) ⇒ that would
throw us into infinite recursion since problem sizes aren’t shrinking between recursive calls…
But we can instead generate a smaller list and call SELECT on that smaller list!
32
MEDIAN-OF-MEDIANS
GOAL: get a proxy for the true median by finding the exact median of all the sub-medians!
1 14 4 18 25 6 17 9 3 5 10 16 12 23 19 13 20 8 15 24 7 21 22 2 11
33
MEDIAN-OF-MEDIANS
GOAL: get a proxy for the true median by finding the exact median of all the sub-medians!
Divide the original list into ⌈n/5⌉ groups (each group has 5 elements)
1 14 4 18 25 6 17 9 3 5 10 16 12 23 19 13 20 8 15 24 7 21 22 2 11
34
MEDIAN-OF-MEDIANS
GOAL: get a proxy for the true median by finding the exact median of all the sub-medians!
Divide the original list into ⌈n/5⌉ groups (each group has 5 elements)
Find the sub-median of each small group (3rd smallest out of the 5)
14 6 16 15 11
1 14 4 18 25 6 17 9 3 5 10 16 12 23 19 13 20 8 15 24 7 21 22 2 11
35
MEDIAN-OF-MEDIANS
GOAL: get a proxy for the true median by finding the exact median of all the sub-medians!
Divide the original list into ⌈n/5⌉ groups (each group has 5 elements)
Find the sub-median of each small group (3rd smallest out of the 5)
Find the median of all the sub-medians (call SELECT)
14 6 16 15 11
1 14 4 18 25 6 17 9 3 5 10 16 12 23 19 13 20 8 15 24 7 21 22 2 11
36
MEDIAN-OF-MEDIANS
GOAL: get a proxy for the true median by finding the exact median of all the sub-medians!
Divide the original list into ⌈n/5⌉ groups (each group has 5 elements)
Find the sub-median of each small group (3rd smallest out of the 5)
Find the median of all the sub-medians (call SELECT) constant work for
each group.
14 ⌈n/5⌉ groups total ⇒
O(n) work.
14 6 16 15 11
1 14 4 18 25 6 17 9 3 5 10 16 12 23 19 13 20 8 15 24 7 21 22 2 11
37
MEDIAN-OF-MEDIANS
GOAL: get a proxy for the true median by finding the exact median of all the sub-medians!
Divide the original list into ⌈n/5⌉ groups (each group has 5 elements)
Find the sub-median of each small group (3rd smallest out of the 5)
Find the median of all the sub-medians (call SELECT) constant work for
each group.
14 ⌈n/5⌉ groups total ⇒
O(n) work.
p = MEDIAN_OF_MEDIANS(A)
L, R = PARTITION(A,p) T(n/5) work hidden in
if len(L) == k-1: this recursive call
(remember, MEDIAN_OF_MEDIANS
return p calls SELECT on ⌈n/5⌉-size array)
else if len(L) > k-1:
return SELECT(L, k) T(???) work hidden in
else if len(L) < k-1: this recursive call
return SELECT(R, k-len(L)-1) What is the maximum size of
either L or R?
40
ANALYZING RUNTIME
O(n) work outside of
SELECT(A,k): recursive calls
if len(A) == 1: (base case, set-up within
return A[0] MEDIAN_OF_MEDIANS, partitioning)
1 14 4 18 25
m = ⌈n/5⌉ groups
6 17 9 3 5
10 16 12 23 19
13 20 8 15 24
7 21 22 2 11
at most 5 elements 42
ANALYZING RUNTIME
MEDIAN_OF_MEDIANS will choose a pivot greater than at least 3n/10 - 6
elements
(The same reasoning we’re about to do also shows that the pivot will be less than at least 3n/10 - 6 elements)
6 17 9 3 5
10 16 12 23 19
13 20 8 15 24
7 21 22 2 11
at most 5 elements 43
ANALYZING RUNTIME
MEDIAN_OF_MEDIANS will choose a pivot greater than at least 3n/10 - 6
elements
(The same reasoning we’re about to do also shows that the pivot will be less than at least 3n/10 - 6 elements)
2 7 11 21 22
1 4 14 18 25
8 13 15 20 24
10 12 16 19 23
at most 5 elements 44
ANALYZING RUNTIME
MEDIAN_OF_MEDIANS will choose a pivot greater than at least 3n/10 - 6
elements
(The same reasoning we’re about to do also shows that the pivot will be less than at least 3n/10 - 6 elements)
2 7 11 21 22
3 elements from each group that 2 elements from the group
1 4 14 18 25 has a median smaller than the containing the median of
median of medians medians
8 13 15 20 24
3 · (⌈m/2⌉ - 1) + 2
10 12 16 19 23
To exclude the group with
at most 5 elements the median of medians 45
ANALYZING RUNTIME
MEDIAN_OF_MEDIANS will choose a pivot greater than at least 3n/10 - 6
elements
(The same reasoning we’re about to do also shows that the pivot will be less than at least 3n/10 - 6 elements)
2 7 11 21 22
3 elements from each (non-leftover) 2 elements from the group
1 4 14 18 25 group that has a median smaller than containing the median of
the median of medians medians
8 13 15 20 24
3 · (⌈m/2⌉ - 1 - 1) + 2
10 12 16 19 23
To exclude any of those
To exclude the group with
groups that might be a
at most 5 elements the median of medians
“leftover” group! 46
ANALYZING RUNTIME
MEDIAN_OF_MEDIANS will choose a pivot greater than at least 3n/10 - 6
elements
(The same reasoning we’re about to do also shows that the pivot will be less than at least 3n/10 - 6 elements)
2 7 11 21 22
3 elements from each (non-leftover) 2 elements from the group
4 14 18 group that has a median smaller than containing the median of
the median of medians medians
8 13 15 20 24 The group with the
3 · (⌈m/2⌉ - 1 - 1) + 2 median of medians
might be a “leftover”
10 12 16 19 23 group! Might as well
just get rid of the +2
To exclude any of those to be safe
To exclude the group with
groups that might be a
at most 5 elements the median of medians
“leftover” group! 47
ANALYZING RUNTIME
MEDIAN_OF_MEDIANS will choose a pivot greater than at least 3n/10 - 6
elements
(The same reasoning we’re about to do also shows that the pivot will be less than at least 3n/10 - 6 elements)
8 13 15 20 24 3 · (⌈m/2⌉ - 2)
= 3 · (⌈⌈n/5⌉/2⌉ - 2)
10 12 16 19 23
≥ 3 · (n/10 - 2)
= 3n/10 - 6
at most 5 elements 48
ANALYZING RUNTIME
We just showed:
3n/10 - 6 ≤ len(L)
len(R) ≤ 7n/10 + 5
49
ANALYZING RUNTIME
We can similarly show the inverse:
52
ANALYZING RUNTIME
O(n)
Worst-case Runtime! 53
PSEUDOCODE & RUNTIME
O(n) work outside of
SELECT(A,k): recursive calls
if len(A) == 1: (base case, set-up within
return A[0] MEDIAN_OF_MEDIANS, partitioning)
p = MEDIAN_OF_MEDIANS(A)
L, R = PARTITION(A,p) T(n/5) work hidden in
if len(L) == k-1: this recursive call
(remember, MEDIAN_OF_MEDIANS
return p calls SELECT on ⌈n/5⌉-size array)
else if len(L) > k-1:
return SELECT(L, k) T(7n/10) work hidden in
else if len(L) < k-1: this recursive call
7n/10 is the maximum size of
return SELECT(R, k-len(L)-1) either L or R (this is what the
median-of-medians technique
guarantees us)! 54
LINEAR-TIME SELECTION
SELECT(A,k):
if len(A) == 1:
return A[0]
p = MEDIAN_OF_MEDIANS(A)
L, R = PARTITION(A,p)
if len(L) == k-1:
return p
else if len(L) > k-1:
return SELECT(L, k)
else if len(L) < k-1:
return SELECT(R, k-len(L)-1)
O(n)
Worst-case Runtime! 55
HIGHLIGHTS OF SELECT
We covered a lot of details - here are the big picture takeaways.
56
LINEAR SELECTION: THE BIG IDEA
Recurse!
57
LINEAR SELECTION: RUNTIME
Select a pivot: Median of (sub)Medians
Divide the original list into ⌈n/5⌉ groups (each group has ≤ 5 elements)
Find the sub-median of each small group (3rd smallest out of the 5)
Find the median of all the sub-medians (via recursive call to
SELECT!!)
Recurse! on either L or R
(size ≤ 7n/10)
61
WAIT: WHERE DID WE GET 7n/10?
We proved this claim:
3n/10 - 6 ≤ len(L) ≤ 7n/10 + this is because
len(L) + len(R) = n-1,
53n/10 - 6 ≤ len(R) ≤ 7n/10 + so if
3n/10 - 6 ≤ len(L)
5 then
len(R) ≤ 7n/10 + 5
3 5 6 9 17 We asked ourselves:
m = ⌈n/5⌉ groups
Recurse! on either L or R
(size ≤ ~7n/10)
64
LINEAR SELECTION: RUNTIME
O(n) Non- Select a pivot: Median of (sub)Medians
recursive
“shallow” Divide the original list into ⌈n/5⌉ groups (each group has ≤ 5 elements)
T(n/5)
work! Recursive work:
Find the sub-median of each small group (3rd smallest out of the 5) we call SELECT
on an array of size
T(n) ≤ T(n/5) + T(7n/10) +
Find the median of all the sub-medians (via recursive call to
SELECT!!)
n/5
O(n)
Partition around pivot T(7n/10)
Recursive work:
we call SELECT
Recurse! on either L or R
(size ≤ 7n/10)
65
LINEAR SELECTION: RUNTIME
O(n)
Worst-case Runtime! 66
LINEAR SELECTION: THE BIG IDEA
Select a pivot: Median of Medians
Recurse!
Median of Medians is really cool! The math was a little detailed, but worth the time to
digest so that you’re 110% convinced that the technique does give a ~7n/10 bound on the
max size of either L or R. Solving the recurrence can be done via Substitution Method.
SELECT as a whole is an amazing display of Divide-and-Conquer! 67