Data Structures AND Algorithm Analysis: University of Veszprem Hungary
Data Structures AND Algorithm Analysis: University of Veszprem Hungary
Hungary
DATA STRUCTURES
AND
ALGORITHM ANALYSIS
lecture notes
Sandor Dominich
Adrienn Skrop
Maria Horvath
2001.
Mathematics Review
1. Exponents
a ba
a
a
= a
bc
= a
(a )
b+c
= a
bc
+ a
= 2a
+ 2
= 2
2n
n +1
2. Logarithms
typically log 2 = log
log a b = x a x = b
properties : log ab = log a + log b
a
log = log a log b
b
log c b
log a b =
log c a
log 1 = 0
log 2 = 1
Note : ln , lg
log 2 is just a compromise
3. Series
Geometric series
a n +1 1
a =
a 1
i =0
n
a = 2,
2i =
i =0
0 < a < 1,
2 n +1 1
1
a
i =0
1
1 a
' =' n
1 + a + a 2 + a 3 + ... = S | a
a + a 2 + a 3 + a 4 + ... = aS
S=
1
1 a
HW :
i =1
=2
Arithmetic series
n
i =
i =1
n
n(n + 1)
2
i =1
n(n + 1)(2n + 1)
6
Harmonic number
n
1
Hn = = log e n + n
i =1 i
n : const
Fibonacci numbers
F0 = 1, F1 = 1 (by definition)
F2 = F1 + F0 = 2
Fi = Fi-1 + Fi-2, i 2
i
5
Fi < , i 1
3
contradict s
a = 2c
b2 = 2c2
b = 2d (a, b) 1
Note: P
P: The sun is shining
P: The sun is NOT shining
It is not the sun that is shining
contradiction
1 i
;
i i
i =0 4
i =0 4
n
Prove (2i 1) = n 2
i =1
Asymptotic notation
- notation
(g (n )) = {f (n ) c1 , c 2 , n0 : 0 c1 g (n ) f (n ) c 2 g (n ), n > n0 }
def
c1g(n)
f(n)
f (n ) (g (n ))
c2g(n)
n0
f ( n ) = ( n 2 )
HW: 6n3 (n2)
Property:
d
p(n) = ai n i
i =0
p ( n) = ( n d )
c = const = (n0) = (1)
O notation
f (n ) = (g (n ))
n2 = O(n2)
n = O(n2)
cg(n)
f (n) = O( g (n))
f(n)
f(n)
cg(n)
n0
lower bound for f(n)
n3
n2
nlogn
f (n) = ( g (n))
Ex.:
i, S: integers
S: = 0
running time
measure effective time
FOR i = 1 to n
algorithm analysis
S=S+i*i
1
1 1
1. Calculate:
2
i =1
S=
1 2 3
1
+ + + ...
2 4 8
2
S 1 2 3
= + + + ...
2 4 8 16
+
1 1 1
+ + + ...
4 8 16
1 2 3
= + + + ...
2 4 8
1 1
1
S S + i = S = 2
2
i =2 2
2
1
2
2. Prove:
a,
5
Fi < , i 1
3
5
i = 1, Fi = 1 <
3
25
i = 2, Fi = 2 <
9
5
Fi +1 <
3
i +1
5 5
Fi +1 = Fi + Fi 1 < +
3 3
i 1
5
=
3
i +1
3 3 2 5 i +1 24 5 i +1
+ = <
5 5 3 25 3
b,
n(n + 1)
2
i =
i =1
n =11=1
n
i =
i =1
n +1
i =
n(n + 1)
2
i =1
c,
n
i =1
n(n + 1)(2n + 1)
6
n = 1: 1 =
n +1:
n +1
1 2 3
6
n +1
i
i =1
n
i = i
2
i =1
=
2
(n + 1)(n + 2)(2n + 3)
6
+ (n + 1) =
2
i =1
n(n + 1)(2n + 1)
2
+ (n + 1) =
6
n(2n + 1) + 6n + 6
n(2n + 1)
= (n + 1)
+ n + 1 = (n + 1)
=
6
6
2n 2 + 7 n + 6
6
(n + 2)(2n + 3) = 2n 2 + 7 n + 6 igaz
= (n + 1)
3.
n
(2i 1) = n
i =1
n
i =1
i =1
i =1
(2i 1) = 2i 1 = 2
n(n + 1)
n = n2 + n n = n2
2
4.
n2
F = F
i =1
n = 3:
F =1= F
i
2 =1
i =1
n2
= Fn 2
= Fn +1 2
i =1
n 1
i =1
n 1
i =1
Fi = F1 +
n 1
n 1
n 1
i =2
i =2
i=2
(Fi 1 + Fi2 ) = F1 + Fi 1 + Fi 2
Fn +1 2 = Fn + Fn 1 2 = (Fn 2 ) + (Fn 1 2 ) + 2 =
n2
n 3
F +F +2
i
i =1
i =1
n = O(n2)
2n2 + 3n + 1 = 2n2 + (n) There exists some functions such that the equation holds.
2n2 + (n) = (n2)
O, ,
o,
f ( n)
=0
n g ( n)
f o( g )
=
f ( n)
lim
=
n g ( n)
f (g )
=
lim
Properties
Transitivity
( f(n) = . (g(n)) g(n) = . (h(n)) ) f(n) = . (h(n))
where . {O, , , o, }
Reflexivity
f(n) = . (f(n))
Symmetry
f(n) = (g(n)) g(n) = (f(n))
ab
f(n) = (g(n))
ab
f(n) = (g(n))
a=b
f(n) = o(g(n))
a<b
f(n) = (g(n))
a>b
Trichotomy
Dichotomy: Boolean logic
Trichotomic: three different values
Trichotomy: a, b R: exactly one of the followings holds:
a<b
a=b
a>b
Note: for every real number holds, and we can prove it.
Asymptotic notation: trichotomy doesnt hold
Not any two functions can be compared in an asymptotic sense.
Ex.: f(n) = n
g(n) = n
1+sin n
f(n) = n
g(n) = n
sin n
Floor, ceiling
Floor of x R:
Ceiling of x R:
x 1 < x x x < x + 1
n Z: n / 2 + n / 2 = n
Rate of growth of polynomials and exponentials
n
lim x x = 0
x a
n \ {0}, a (1, )
Series
ex =
xi
i!
i =0
e x 1+ x
1+ x e x 1+ x + x 2 , x 1
( )
x 0, e x = 1 + x + x 2
e 1+ x
x
ex =
lim
n
x
1+
ln(1 + x ) = x
x2 x3
+
..., x < 1
2
3
Stirling approximation:
n
n!= 2 n
e
n
1
1 +
n
n+
n
n 12 n
2 n n! 2 n
e
e
Fibonacci numbers and the Golden ratio
Fi = Fi-1 + Fi-2,
i 2,
F0 = 0, F1 = 1
AB AP
=
AP PB
AB
AP =
5 1
2
5 1
AB = 1, AP =
0.61803
2
1+ 5 1 5
, =
2
2
i
i
The relation between the Fibonacci numbers and the Golden ratio.
Fi =
2
Fibonacci numbers grow exponentially.
Summations
Sequence: a1,, an
(finite)
(infinite)
i =1
Linearity
n
(ca
k =1
k =1
k =1
+ bk ) = c ak + bk
( f (k )) = f (k )
k =1
k =1
Differentiation
k =0
kx
k =0
1
, x <1
1 x
x
(1 x )2
Telescoping
Telescoping series (sums)
n
(a
k =1
ak 1 ) = a/ 1 a0 + a/ 2 a/ 1 + a/ 3 a/ 2 + ... + an a/ n 1 = an a0
n 1
n 1
1
1
1
1
=
= 1
k +1
n
k =1 k (k + 1)
k =1 k
Products
n
n
lg ak = lg ak
k =1 k =1
Bounding summations
By induction
Prove:
( )
= O 3k
k =0
c : 3k c 3
k =0
Initial value : n = 0
1 c 1, c 1
Assume:
n +1
k =0
k =0
3k = 3k + 3n+1
c 3n + 3n +1
1 1
= + c 3n +1
3 c
c 3n +1
By bounding terms
n
, amax = max ak
k
k =1
n
Ex.
k n
k =1
namax
n
k =1
By splitting
n even
n
n
2
k =1
k =1
k = k +
n
2
0+
k =1
n
k = +1
2
n
n
n 2
k = +1
0+
n
4
( )
= n2
By integrals
n
n +1
k =m
f (x )dx f (k ) f (x )dx
m 1
k =1 k
n
Ex.
n +1
dx
x
|
m-1 m m+1 n+1
LIFO
4
3
2
Note:
p: the address of the top which isnt empty
(another view)
p
7
B
locations
address
FIFO
its called a QUEUE
Head
Tail
A
Locations
It is infinite at both ends. Tha data is written only from one direction at one end.
Tail: where we write the data in
Head: where we read the data out
The space we can allocate for a QUEUE is finite.
overflow check: ENQUEUE
underflow check: DEQUEUE
write: ENQUEUE
read: DEQUEUE
Application:
modelling the dataflow in online processing (pressure, temperature)
ARRAY
1 dimensional
p 1 D2
E1
p2
E2
...
Dn
NIL
En
Example: Application:
to store data on disk (files)
element = track : the elements are stored on available tracks, ??????????
Operations:
INSERT: write a new element into the list
a. front of the first element of the list
E1
NIL
E2
En
E
pointer of E points towards E1
b. after the last element
E1
NIL
E2
En
E
Replacing NIL with a pointer to E
c. between the existing elements
NIL
E1
Ei
Ej
E
Rearraging the pointers.
En
SEARCHING: (n)
D1
p1
Di
pi
Dn
Ei
E1
D : to be searched
sequentially:
NIL
En
we either find
it or not
DELETION: delete an existing element
D
NIL
E1
Ei
Ei+1
En
D: find the element contains D and take it out from the list
a) searching
b) deletion Ei
Ei-1
Ei
Ei+1
Logical deletion: just the pointers are rearranged, not physical deletion
Example:
E1
E2
E3
2
B
(1,2)
(1,3)
3
C
NIL
Data
Pointers
D2
E1
Dn
E2
D1
E1
p1
D2
p2
E2
The sentinel always tells us which is the first element of the list.
Dn
En
TREE
Mathematically: acyclic connected graph
ROOT level 0: parent of its children, has no parent
level 1: children of the root, parents of level 2
level 2: children of the nodes of level 1, parents of
level 3
last level:
leaves: have no children
..
Binary tree: at most two children (except the leaves)
NIL
NIL
NIL
NIL
7
5
8
Binary search tree property
WALK:
R
L
INORDER:
PREORDER:
POSTORDER:
L Root R
Root L R
L R Root
Example:
INORDER:
PREORDER:
POSTORDER:
235578
532578
25 3875
Application:
(A+B)*C-D mathematical form (usual)
Evaluation of an arithmetic expression :
parentheses
operators
form: Pohsh:
no parentheses
precedence rules
no need the check the precedence of the operators
D
C
Application:
SORTING: arrange data
numbers 1, 7, 6
1, 6, 7
Comparison
numbers?
a2:a3
Lower bound for comparison bored
sorting
a1:a2:a3
n! : number of leaves
. . .
n! 2h
h log (n!)
>
a1:a3
>
HASHING
Hash table
K1 K2
:
.
m-1
U: Universe of all possible key values (from where the keys can take values).
K1, K2, K: keys, they are used to identify records.
Hash function: h, meets the key values with the slots.
h: U {0, 1, , m-1}
h: K h(K)
Collision: h(K1) = h(K2) when K1 K2
Resolve: chaining, data that collide are chained.
Sentinel
Complexity
Analysing hashing with chaining.
Search: (1+)
n: number of keys
K1
..
K2
..
m: number of slots
n
: average number of elements in a list
m
n
= : load factor (the analysis is made in term of )
m
Search:
K
h(K)
(1)
()
(1) + () = (1 + )
Assume: n = O(m) (as many keys as slots)
n O ( m)
=
= O(1)
m
m
The search takes constant time.
Uniform hashing: any given key is equally likely to hash into any of the slots.
Probability:1 / m
Application
Spelling checker
Compilers
Game playing
Graphs
Hash function
What makes a good hash function? (Uniform hashing is ensured 1 / m)
Usually we do not know the distribution of the key values difficult to design good hash function.
In practice: spread the key values as much as possible.
Division method
h(K) = K mod m
It does not work well in every case m = 2p
Reminder
p bits
p = 2 m = 22
0, 1, 2, 3
K=5
2 bits
1
3 bits
ASCII
C1 C2 Cn
number 0 Cj 127
Universal hashing
There are h hashing functions s.t. there exist key values that more than one will be hashed into the
same slot.
Any fixed hashing is valid.
H = {h1, h2, , hr} set of hash functions
For any random hi uniform hashing
Statement
If h is chosen from a universal collection H of hash functions and is used to hash n keys into a hash
table of size m, where n m, then the expected number of collisions involving the particular key x is
less than 1.
Proof
1, h( y ) = h( z ), y, z
Let cyz a random variable : c yz =
0, otherwise
[ ]
1
m
, Pn = P ( = xn )
E
p
x
[
]
n n n
E c y,z =
cxy1
E c xy1 =
[ ]
1
m
[ ]
1
m
cxy2
y2
E c xy 2 =
cxyn-1
yn-1
[ ]
E [c x ] = E cxy =
y
E c xyn 1 =
1 n 1
=
<1
m
m
1
m
c x = c xy1 + ... + c xyn 1
E i = E [ i ]
i i
ha (x ) = ai xi mod m
i =0
a = <a0 a1 ar>, every ai is chosen randomly from the set {0, 1, , m-1}
Theorem
The set H is universal.
Proof
x y
r
ha (x ) = ai xi mod m
i =0
r
ha ( y ) = ai yi mod m
i =0
y = y0 y1... yr
x y x0 y 0
ha (x ) = ha ( y )
r
ai xi mod m = ai yi mod m
i =0
i =0
a x a y
i =0
i i
i =0
(mod m )
a0 (x0 y0 ) ai (xi yi )
i =1
[mod m]
mr
H = m r +1
mr
1
=
r +1
m
m
H(n m) n, m should be comparable.
Choose at random any time when hashing should be done, apply h.
Application
Data bases
Hash table + B- tree
Not easy to estimate n
Not easy the choice of h.
ESTIMATION OF COMPLEXITY
Running time: the time required for a computer to execute a program.
Running time dependes on the following factors:
CPU: the higher the speed, the quicker the computer.
Memory: main and secondary memory available to execute the program.
Input data: size, type, operations.
Software: compiler, operating system, etc.
Algorithm (based on which the particular program is written).
Example: a1 + a2 + a3 + a4
S=0
S = S + a1
S=0
FOR i = 1 TO 4
S = S + a2
S = S + ai
S = S + a3
S = S + a4
The asymptotic notation is a way to express how bad (slow) or good (fast) an algorithm is.
Expression of complexity we get a measure
1 umit
Ex.: S = 0
1 unit each
2. Consecutive statement
sum of each
3. Loop: time required to evaluate the body multiplied by the number of iterations
4. Nested loop: inside multiplied by the product of iterations
5. IF: test + max (THEN, ELSE)
This technique can overestimate (ex. 5), but it will never underestimate the complexity.
SORTING
To arrange given data according to given criteria.
Ex.:
1. ci, i = 1, , n
j, j = 1, , m (criteria)
arrange ci taking into account every j
2. List of names alphabetic order
3. Temperature values ascending or descending order
4. Sorting a1, a2,,an input data
criterion: ascending (descending)
BUBBLE SORTING
Main idea: find the smallest and put it on the top
find the second smallest and put it next to the top one
Ex. 2, 1, 7, 6
Fix the first number and compare it with all the other. If
wrong order swap and change.
1, 2, 7, 6
Repeat from the second position.
2, 7, 6
7, 6
1, 2, 6, 7
Best case: the input is already in the right order (no swap) O(n2)
Worst case: all the numbers are in the wrong order O(n2)
Average case: the numbers are given at random (typical case)
Bubble sorting (comparison-based): O(n2)
Comparison-based sorting: (n log n)
O(n log n)
QUICK SORTING
The fastest known method.
Divide and conquer philosophy
A(p, q), x q
A(pr)
A(q + 1, r), y q
q: computed value
Ex. 9, 2, 11, 20, 7
Q: pivot = (9 + 7) / 2 = 8 not necessarily belongs to the sequence
7, 2
11, 20, 9
q=4
q = 10
2 7
9 11, 20
q = 15
2 7 9 11 20
QUICKSORT (A, p, r)
IF p < r THEN
q PARTITION (A, p, r)
QUICKSORT (A (p, q))
QUICKSORT (A (q + 1, r))
Initial call: QUICKSORT (A, 1, length(A))
PARTITION (A, p, r)
x A(p), i p 1, j r + 1
WHILE TRUE DO
REPEAT j j 1 UNTIL A(j) x
REPEAT i i + 1 UNTIL A(i) x
IF i < j THEN SWAP (A(i), A(j))
ELSE RETURN j
Recurrence: resolved by telescoping
n
T (n) = 2T + (n )
2
n
T (n) = 2T + n
|: n
2
n
T
T ( n)
2
= + 1
n
n
n
T
T ( n)
4
= + 1
n
n
+
2
4
T (2) T (1)
=
+1
2
1
T(n) = nc + nlogn
T(n) = O(nlogn) average (and best) case
Worst case: O(n2)
Worst case: one element / region
T(n) T( 1 )
=
+ log n
n
1
T (n) = T (n 1) + (n)
T (n 1) = T (n 2) + (n 1)
.
+
.
T (n) = T (1) + (k ) =
k =2
T (n) = (n ) = (k ) = ( k )
2
SEARCHING
There may be different situations where searching is performed
SEQUENTIAL SEARCH
Given a1, , ai, , an
Find: x
(nave or brute force search or straight search)
It is in fact one loop:
FOR i = 1 TO n
IF x = ai THEN STOP
O(n)
Note: this kind of search is very simple (primitive),
but what is if ai is a matrix?
records of a file?
The comparison is more complicated here. But in principle it is very simple.
RANDOM SEARCH
Introduce some sort of probability.
Given a1, , ai, , an to be searched.
Find: x
Coin: probability element
If head: search from 1 to n.
Flip a coin
If tail: search from n to 1.
Two sequential searches are combined, applied together.
Assume: x = ai (ith position)
The coin is fair: the head and the tail occur with equal probabilities.
n
20
n
21
.
.
.
n
2
log n
Note:
1. Will the search work when all elements are given at once (at the same time)?
2. Will the search work when all elements are given on line (one by one)?
1.
2.
Sequential search
YES
YES
Randomized search
YES
NO
Binary search
YES
NO
In every case we assume that we have just one processor to do the job.
PARALLEL BINARY SEARCH
It speeds up the binary search.
P processors with shared memory (PRAM parallel RAM).
CREW: Concurrent Read Exclusive Write
Given the elements a1, , ai, , an sorted.
Find: x
Divide the sequence into p + 1 parts:
a1, , ai1 | ai1+1, , ai2 | , aij | , an
x is compared with the boundary elements (or the leftmost or the rightmost)
in parallel:
0, if x > aij
1, otherwise
BS
n
20
PBS
n
( p + 1)0
n
( p + 1)2
.
.
.
n
=1
( p + 1)n
n
2
log n
Ex. 3 2 4 5 6
4 5
3
text: n = 5
pattern: m = 2
2 4 5 6
|
4 5
|
4 5
O(n m)
KNUTH
MORRIS
PRATT ALGORITHM
Given text: n, i
pattern: m, j
Both pointers are incremented by 1 as long as there is a match.
Mismatch at position k: j is reset and comparison is restarted.
I. 3 2 3 4 5
3 4
j=1
II. 3 2 3 4 5
3 4
IV.
i=1
i=2
j=2
III. 3 2 3 4 5
i=2
3 4
j=1
3 2 3 4 5
3 4
O(n + m)
i=3
j=1
BOYER
MOORE ALGORITHM
text: s1, m
pattern: s2, m
Idea: the pattern is searched from right to left, while it is moved from left to right an appropriate
number of positions.
I. 3 2 7 4 5 6
4 5 mismatch: no matching characters between pattern and text we move the
pattern m position to the right
II. 3 2 7 4 5 6
4 5 mismatch: align the four
III. 3 2 7 4 5 6
4 5 stop
O(n/m) really very fast.
Application: word processing (spelling checker)
SIGNATURE FILES
Given text
Every word is hashed (transformed) into a string of bits. (Ex. radix 128, soundex coding)
synonym: in hashing sense, not grammatically
hash table:
Find word x:
x is transformed into a string of bits,
the hash table is presorted,
x is searched using binary search in the hash table or better hash function,
we get a block address on the disk, we find the word on that block somewhere,
U RL 1
w2
U RL 2
wn
U RL n