Abstract Data Structures
Abstract Data Structures
Li Yin1
1
www.liyinscience.com
ii
Abstract Data Structures
0.1 Introduction
Leaving alone statements that “data structures are building blocks of algo-
rithms”, they are just mimicking how things and events are organized in
real-world in the digital sphere. Imagine that a data structure is an old-
schooled file manager that has some basic operations: searching, modifying,
inserting, deleting, and potentially sorting. In this chapter, we are simply
learning how a file manager use to ‘lay out’ his or her files (structures) and
each ‘lay out’s corresponding operations to support his or her work.
We say the data structures introduced in this chapter are abstract or
idiomatic, because they are conventionally defined structures. Understand-
ing these abstract data structures are like the terminologies in computer
science. We further provide each abstract data structure’s corresponding
Python data structure in Part. ??.
There are generally three broad ways to organize data: Linear, tree-like,
and graph-like, which we introduce in the following three sections.
Items We use the notion of items throughout this book as a generic name
for unspecified data type.
Records
iii
iv ABSTRACT DATA STRUCTURES
shown in Fig. 1. Since using contiguous memory locations, once we know the
physical position of the first element, an offset related to data types can be
used to access any item in the array with O(1), which can be characterized
as random access. Because of these items are physically stored contiguous
one after the other, it makes array the most efficient data structure to store
and access the items. Specifically, array is designed and used for fast random
access of data.
Dynamic Array In the static array, once we declared the size of the array,
we are not allowed to do any operation that would change its size; saying
we are banned from either inserting or deleting any item at any position
of the array. In order to be able to change its size, we can go for dynamic
array. that is to sayStatic array and dynamic array differs in the matter of
fixing size or not. A simple dynamic array can be constructed by allocating
a static array, typically larger than the number of elements immediately
required. The elements of the dynamic array are stored contiguously at the
start of the underlying array, and the remaining positions towards the end
of the underlying array are reserved, or unused. Elements can be added at
the end of a dynamic array in constant time by using the reserved space,
until this space is completely consumed. When all space is consumed, and
an additional element is to be added, then the underlying fixed-sized array
needs to be increased in size. Typically resizing is expensive because it
involves allocating a new underlying array and copying each element from
the original array. Elements can be removed from the end of a dynamic
array in constant time, as no resizing is required. The number of elements
used by the dynamic array contents is its logical size or size, while the size of
the underlying array is called the dynamic array’s capacity or physical size,
which is the maximum possible size without relocating data. Moreover, if
the memory size of the array is beyond the memory size of your computer,
it could be impossible to fit the entire array in, and then we would retrieve
to other data structures that would not require the physical contiguity, such
as linked list, trees, heap, and graph that we would introduce next.
• Random access: it takes O(1) time to access one item in the array
0.2. LINEAR DATA STRUCTURES v
• Search and Iteration: O(n) time for array to iterate all the elements
in the array. Similarly to search an item by value through iteration
takes O(n) time too.
No matter it’s static or dynamic array, they are static data structures; the
underlying implementation of dynamic array is static array. When frequent
need of insertion and deletion, we need dynamic data structures, The concept
of static array and dynamic array exist in programming languages such as
C–for example, we declare int a[10] and int* a = new int[10], but not
in Python, which is fully dynamically typed(need more clarification).
Singly and Doubly Linked List When the node has only one pointer,
it is called singly linked list which means we can only scan nodes in one
direction; when there is two pointers, one pointer to its predecessor and
another to its successor, it is called doubly linked list which supports traversal
in both forward and backward directions.
vi ABSTRACT DATA STRUCTURES
• Search and Iteration: O(n) time for linked list to iterate all items.
Similarly to search an item by value through iteration takes O(n) time
too.
• Extra memory space for a pointer is required with each element of the
list.
the delete operation: for stack, we delete item from the end; for a queue, we
delete item from the front instead. Of course, we can also implement with
any other linear data structure, such as linked list. Conventionally, the add
and deletion operation is called “push” and “pop” in a stack, and “enque”
and “deque” in a queue.
Operations Stacks and Queues support limited access and limited inser-
tion and deletion and the search and iteration relies on its underlying data
structure.
Stacks and queues are widely used in computer science. First, they
are used to implement the three fundamental searching strategies–Depth-
first, Breath-first, and Priority-first Search. Also, stack is a recursive data
structure as it can be defined as:
A hash table is a data structure that (a) stores items formed as {key:
value} pairs, (b) and uses a hash function index = h(key) to compute an
index into an array of buckets or slots, from which the mapping value will
be stored and accessed; for users, ideally, the result is given a key we are
expected to find its value in constant time–only by computing the hash
function. An example is shown in Fig. 5. Hashing will not allow two pairs
that has the same key.
First, the key needs to be of real number; when it is not, a conversion
from any type it is to a real number is necessary. Now, we assume the keys
passing to our hash function are all real numbers. We define a universe set
of keys U = {0, 1, 2, ..., |U − 1|}. To frame hashing as a math problem: given
a set of keys drawn from U that has n {key: value} pairs, a hash function
needs to be designed to map each pair to a key in a set in range {0, .., m − 1}
so that it fits into a table with size m (denoted by T [0...m − 1]), usually
n > m. We denote this mapping relation as h : U → − {0, ..., m − 1}. The
simplest hashing function is h = key, called direct hashing, which is only
possible when the keys are drawn from {0, ..., m − 1} and it is usually not
the case in reality.
Continue from the hashing problem, when two keys are mapped into the
same slot, which will surely happen given n > m, this is called collision.
In reality, a well-designed hashing mechanism should include: (1) a hash
function which minimizes the number of collisions and (2) a efficient collision
resolution if it occurs.
Hashing Functions
Because the keys share a common factor c = 2 with the bucket size
m = 4, this will decrease the range of the reminder into m/c of its
original range. As shown in the example, the remainder is just {0, 2}
which is only half of the space of m. The real loading factor increase
to cα. Using a prime number is a easy way to avoid this since a prime
number has no factors other than 1 and itself.
If the size of table cannot easily to be adjusted to a prime number,
we can use h(k, m) = (k%p)%m, where p is our prime number which
should be chosen from the range m < p < |U |.
Resolving Collision
Collision is unavoidable given that m < n and the sometimes it just purely
bad luck that the data you have and the chosen hashing function produce
x ABSTRACT DATA STRUCTURES
Chaining An easy way to think of is by chaining the keys that have the
same hashing value using a linked list (either singly or doubly). For example,
when h(k, m) = k%4, and keys = [10,20,30,40,50]. For key as 10, 30, 50,
they are mapped to the same slot 2. Therefore, we chain them up at index
2 using a single linked list shown in Fig. 6. This method shows the following
characters:
• the worst-case behavior is when all keys are mapped to the same slot.
The advantage of chaining is the flexible size because we can always add
more items by chaining behind, this is useful when the size of the input set is
unknown or too large. However, the advantage comes with a price of taking
extra space due to the use of pointers.
Open Addressing In Open addressing, all items are stored in the hash
table itself; thus requiring the size of the hash table to be (m ≥ n), making
each slot either contains an item or empty and the load factor α ≤ 1. This
avoids the usages of pointers, saving spaces. So, here is the question, what
would you do if there is collision?
0.3. GRAPHS xi
i marks the number of tries. Now, try to delete item from linear
probing table, we know that T [p1 ] = k1 , T [p1 + 1] = k2 , T [p1 + 2] = k3 .
say we delete k1 , we repeat the hash function, find and delete it from
p1 , working well, we have T [p1 == N U LL. Then we need to delete or
search for k2 , we first find p1 and find that it is empty already, then we
thought k2 is deleted, great! You see the problem here? k2 is actually
at p1 + 1 but from the process we did not know. A simple resolution
instead of really deleting the value, we add a flag, say deleted, at
any position that a key is supposedly be deleted. Now, to delete k1 ,
we have p1 is marked as deleted. This time, when we are trying to
delete k2 , we first go to p1 and see that the value is not the same as its
value, we would know we should move to p1 + 1, and check its value:
it equals, nice, we put a marker here again.
*Perfect Hashing
0.3 Graphs
0.3.1 Introduction
Graph is a natural way to represent connections and reasoning between
things or events. A graph is made up of vertices (nodes or points) which are
xii ABSTRACT DATA STRUCTURES
edge might be the length, speed limit or traffic. In unweighted graphs, there
is no cost distinction between various edges and vertices.
0.4 Trees
Trees in Interviews The most widely used are binary tree and binary
search tree which are also the most popular tree problems you encounter in
the real interviews. A large chance you will be asked to solve a binary tree
or binary search tree related problem in a real coding interview especially
for new graduates which has no real industrial experience and pretty much
had no chance before to put the major related knowledge into practice yet.
0.4.1 Introduction
A tree is essentially a simple graph which is (1) connected, (2) acyclic, and
(3) undirected. To connect n nodes without a cycle, it requires n − 1 edges.
Adding one edge will create one cycle and removing one edge will divides
a tree into two components. Trees can be represented as a graph whose
0.4. TREES xv
representations we have learned in the last section, such a tree is called free
tree. A Forest is a set of n >= 0 disjoint trees.
Figure 9: Example of Trees. Left: Free Tree, Right: Rooted Tree with
height and depth denoted
However, free trees are not commonly seen and applied in computer
science (not in coding interviews either) and there are better ways–rooted
trees. In a rooted tree, a special node is singled out which is called the root
and all the edges are oriented to point away from the root. The rooted node
and one-way structure enable the rooted tree to indicate a hierarchy relation
between nodes whereas not so in the free tree. A comparison between free
tree and the rooted tree is shown in Fig. 9.
Rooted Trees
Three Types of Nodes Just like a real tree, we have the root, branches,
and finally the leaves. The first node of the tree is called the root node,
which will likely to be connected to its several underlying children node(s),
making the root node the parent node of its children. Besides the root
node, there are another two kinds of nodes: inner nodes and leaf nodes. A
leaf node can be found at the last level of the tree which has no further
children. An inner node is any node in the tree that has both parent node
and children, which is also any node that can not be characterized as either
leaf or root node. A node can be both root and leaf node at the same time,
if it is the only node that composed of the tree.
• Depth: The depth (or level) of a node is the number of edges from
the node to the tree’s root node. The depth of the root node is 0.
• Height: The height(or depth) of a tree would be the height of its root
node, or equivalently, the depth of its deepest node.
• Trees can be always used to organize data and can come with efficient
information retrieval. Because of the recursive tree structure, divide
and conquer can be easily applied on trees (a problem can be most
likely divided into subproblems related to its subtrees). For example,
Segment Tree, Binary Search Tree, Binary heap, and for the pattern
matching, we have the tries and suffix trees.
1. Unlike arrays and linked list, tree is hierarchical: (1) we can store
information that naturally forms hierarchically, e.g., the file systems on
a computer, the employee relation in at a company. (2) If we organize
keys of the tree with ordering, e.g. Binary Search Tree, Segment Tree,
Trie used to implement prefix lookup for strings.
2. Trees are relevant to the study of analysis of algorithms not only be-
cause they implicitly model the behavior of recursive programs but also
because they are involved explicitly in many basic algorithms that are
widely used.
Types of Binary Tree There are four common types of Binary Tree:
1. Full Binary Tree: A binary tree is full if every node has either 0
or 2 children. We can also say that a full binary tree is a binary
tree in which all nodes except leaves have two children. In full binary
tree, the number of leaves (|L|) and the number of all other non-leaf
nodes (|N L|) has relation: |L| = |N L| + 1. The total number of nodes
compared with the height h will be:
n = 20 + 21 + 22 + ... + 2h (2)
=2 h+1
−1 (3)
0.4. TREES xix