Algorithms and Data Structures--Topic Summary
Algorithms and Data Structures--Topic Summary
This course will examine various data structures for storing and accessing information together with
relationships between the items being stored, and algorithms for efficiently finding solutions to various
problems, both relative to the data structures and queries and operations based on the relationships
between the items stored. We will conclude by looking at some theoretical limitations of algorithms and
what we can compute.
1.1 Introduction
Often, we are faced with an issue of dealing with items in order of their priority relative to each other.
Items waiting for a service will arrive at different times, but they must be serviced in order of the priority.
1.3 C++
The C++ programming language is similar to C, Java and C#. Where it differs from C# and Java are in its
memory allocation model (explicitly having to deallocate memory as opposed to relying on a garbage
collector), pointers explicitly recording the location in address in memory where an object is stored, and
the concept of a pre-processor. Where it differs from all three languages is the concept of templates:
allowing the user of the class to specify types. C++ also uses namespaces to prevent collisions on large
projects. We will use the std namespace of the standard template library (STL).
3.1 Lists
There are numerous occasions where the programmer may want to specify the linear order. Operations
we may want to perform on a list are insert an object at particular location, move to the previous or next
object, or remove the object at that location. Both arrays and singly linked lists are reasonable for some
but not all of these operations. We introduce doubly linked lists and two-ended arrays to reduce some of
the run times but at a cost of more memory. We observe that in general, it is often possible to speed up
If the objects being linearly ordered are selected from a finite and well defined alphabet, the list is
referred to as a string. This includes text but also DNA where the alphabet is comprised of four amino
acids adenine, thymine, guanine and cytosine (A, T, G and C).
3.2 Stacks
One type of container we see often is a last-in—first-out container: items may be inserted into the
container (pushed onto the stack) in any order, but the item removed (popped) is always the one that has
most recently been pushed onto the stack. The last item pushed onto the stack is at the top of the stack.
This defines an abstract stack or Stack ADT. This is simple to implement efficiently (all relevant
operations are (1)) with a singly linked list and with a one-ended (standard) array. Stacks, despite being
trivial to implement, are used in parsing code (matching parentheses and XML tabs), tracking function
calls, allowing undo and redo operations in applications, in reverse-Polish operations, and is the format
for assembly language instructions. With respect to the array-based implementation, we focus on the
amortized effect on the run time if the capacity is doubled when the array is full, and when we increase
the capacity by a constant amount. In the first case, operations have an amortized run time of (1) but
there is O(n) unused memory, while in the second the amortized run-time is O(n) while the unused
memory is (1).
3.3 Queues
Another type of container we see often is a first-in—first-out container, a behavior desirable in many
client-server models where clients waiting for service enter into a queue (pushed onto the back of the
queue) and when a server becomes reading, it begins servicing the client that has been waiting the longest
in the queue (the client is popped off the front of the queue). This defines an abstract queue or Queue
ADT. This can be implemented efficiently (all relevant operations are (1)) with either a singly linked
list or a two-ended cyclic array. With respect to the array-based implementation, we focus on the
characteristics of a cyclic array, including the requirement for doubling the capacity of the array when
full.
3.4 Deques
A less common container stores items as a contiguous list but only allows insertions and erases at either
end (pushes and pops at the front and back). This defines an abstract doubly ended queue or abstract
deque or Deque ADT. This can be implemented efficiently using a two ended array but requires a doubly
linked list for an efficient implementation using a linked list. For this data structure, we look at the
concept of an iterator: an object that allows the user to step through the items in a container without
gaining access to the underlying data structure.
4. Trees and hierarchical orders
We will now look at data structures for storing items linearly in an order specified by the programmer (an
explicitly defined linear order). However, to start, we will first look at trees and their obvious purpose: to
store hierarchical orders.
6.4 B+ trees
A B+ tree is a tree that is used as an associative container. Each leaf node contains up to L objects
including keys and the associated information. Internal nodes are multi-way trees where the intermediate
values are the smallest entries in the leaf nodes of the second through last sub-trees. If a B+ tree has no
more than L entries, those entries are stored in a root node that is a leaf node. Otherwise, we require that
leaf nodes are at least half full and all at the same depth, internal nodes are multiway nodes that, too, are
at least half full, and the root node is a multiway node that is at least half full. When an insertion occurs
into leaf node that is filled, it is split in two, and a new node is added to the parent. If parent is already
full, it too is split. This recurses possibly all the way back to the root, in which case, the root node will
have to be split and a new root node will be created.
7. Priority queues
In this topic, we will examine the question of storing priority queues. We will look at the abstract data
type and we will then continue to look at binary min-heaps. While there are numerous other data
structures that could be used to store a heap, almost all are node-based. Given the emphasis on node-
based data structures in the previous topics, we will now focus on an array-based binary min-heap.
Students are welcome to look at other implementations (leftist heaps, skew heaps, binomial heaps and
Fibonacci heaps).
time, and taking those same n items out again takes the same amount of time. However, the items will
come out of the heap in order of their values; consequently, the items will come out in linear order. The
only issue is that this requires (n) additional memory. Instead, converting the array of unsorted items
into a binary max-heap, popping the top n times, and placing the popped items into the vacancy left as a
result of the pop allows us to sort the list in place.
8.6 Quicksort
The most significant issue with merge sort is that it requires (n) additional memory. Instead, another
approach would be to find the median element and then rearrange the remaining entries based on whether
they are less than or greater than the median. We can do this in-place in (n) time, in which case, the
run-time would be equal to that of merge sort. Unfortunately, selecting the median is difficult at best.
We could randomly choose a pivot, but this would have a tendency of dividing the interval in a ratio of
1:3, on average. There would also be a higher likelihood of
9. Hash tables and relation-free storage
9.1 Introduction to hash tables
When we store linearly ordered data that we intend to both access and modify at arbitrary locations, this
requires us to use a tree structure that ultimately requires most operations to run in O(ln(n)) time—the
linear ordering prevents us from having run-times that are o(ln(n)). If, however, we don’t care about the
relative order (What comes next? What is the largest?), we don’t need the tree structure. Instead, we just
need a simple formula—a hash value—that tells us where to look in an array to find the object. The issue
is, it is very difficult to find hash functions that generate unique hash values on a small range from 0 to
M – 1, so we must deal with collisions. Our strategy will be to find (1) functions that first map the
object onto a 32-bit number (our hash value), this hash value is mapped to the range 0, …, M – 1, and
then deal with collisions.
vertex is said to be coincident with the edge. We define sub-graphs and vertex-induced sub-graphs, paths,
simple paths, cycles, and simple cycles. We define weighted graphs, were each edge has a weight
associated with it. We describe how a graph can be a tree or a forest and the concept of an acyclic
graph—one with no cycles. We then define directed graphs and modify the definitions seen previously
for undirected graphs. We then quickly cover the means of storing graphs and some general questions
that may be asked of abstract graph.
12.3 Divide-and-conquer
Often, a problem can be solved by breaking a problem into smaller sub-problems, finding solutions those
smaller sub-problems, and then recombining the results to produce a solution for the overall problem.
Two questions are: Can we determine when it is beneficial to use a divide-and-conquer algorithm? What
approaches should we be using to increase the efficiency of a divide-and-conquer strategy? In finding the
maximum entry of a sorted square matrix, a divide-and-conquer algorithm is sub-optimal when contrasted
with a linear search. When multiplying two n-digit numbers, Karatsuba’s algorithm reduces the problem
to multiplying three sets of n/2-digit numbers, yielding a significant reduction in run time. When
multiplying two n × n matrices, Strassen’s algorithm reduces the problem to multiplying seven pairs of
n/2 × n/2 matrices. A naïve divide-and-conquer algorithm for matrix-vector multiplication reduces the
product to four products of n/2 × n/2 matrices with n/2-dimensional vectors. This does not reduce the run
time—it is still (n2); however, in the special case of the Fourier transform, the matrix is such that the
matrix-vector product can be reduced to two products of n/2 × n/2 matrices with n/2-dimensional vectors
resulting in a run time of (n ln(n)). The master theorem allows us to write
1 n 1 1 n 1
bk
T b
m
T n n a
m m
a T b O n n 1 a T b O b
m 1 k m
k
n 1 0 a
and then based on whether a > bk, a = bk or a < bk, the run times may be determined to be O n logb a
,
O nlogb a ln n O n k ln n , or O n k , respectively.