0% found this document useful (0 votes)
59 views23 pages

Notes 5

The document discusses string algorithms and pattern matching. It defines key concepts like strings, substrings, prefixes, suffixes. It describes the naive pattern matching algorithm and introduces the Boyer-Moore algorithm, which is more efficient. The Boyer-Moore algorithm preprocesses the pattern to build a last-occurrence function, which allows it to skip characters in the text more intelligently when a mismatch occurs between the pattern and text.

Uploaded by

februarydtz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views23 pages

Notes 5

The document discusses string algorithms and pattern matching. It defines key concepts like strings, substrings, prefixes, suffixes. It describes the naive pattern matching algorithm and introduces the Boyer-Moore algorithm, which is more efficient. The Boyer-Moore algorithm preprocesses the pattern to build a last-occurrence function, which allows it to skip characters in the text more intelligently when a mismatch occurs between the pattern and text.

Uploaded by

februarydtz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

how many substrings?

Week 5: String Algorithms, Approximation,


Randomisation, Ethics 4 prefixes: ! "" !! "a" !! "a/" !! "a/a"
4 suffixes: ! "a/a" !! "/a" !! "a" !! ""
6 substrings: ! "" !! "a" !! "/" !! "a/" !! "/a" !! "a/a"
Strings Note:
"" means the same as λ !!(= empty string)

6/182
Strings ... Strings 11/182

A string is a sequence of characters. ASCII (American Standard Code for Information Interchange)
An alphabet Σ is the set of possible characters in strings. Specifies mapping of 128 characters to integers 0..127
The characters encoded include:
Examples of strings:
upper and lower case English letters: A-Z and a-z
C program digits: 0-9
HTML document common punctuation symbols
DNA sequence special non-printing characters: e.g. newline and space
Digitised image

Examples of alphabets:

ASCII
Unicode
{0,1}
{A,C,G,T}

... Strings 7/182

Notation:

length(P) … #characters in P
λ … empty string " (length(λ) = 0)
Σm … set of all strings of length m over alphabet Σ
Σ* … set of all strings over alphabet Σ

νω denotes the concatenation of strings ν and ω


Note: length(νω) = length(ν)+length(ω) "" λω = ω = ωλ

... Strings 8/182 12/182


... Strings
Notation: UTF-8
* Most common Unicode standard
substring of P … any string Q such that P = νQω, for some ν,ω∈Σ
prefix of P … any string Q such that P = Qω, for some ω∈Σ* Specifies mapping of around 150,000 characters
ASCII-compatible
suffix of P … any string Q such that P = ωQ, for some ω∈Σ* 0b0xxxxxxx !! ASCII letters 0-127
Two, three and four byte long characters
0b110xxxxxxxxxxxxx !! most Latin-script alphabets
Exercise #1: Strings 9/182 0xEXXXXX !! Chinese, Japanese, Korean characters
0xFXXXXXXX !! mathematical symbols, emojis
The string "a/a" of length 3 over the ASCII alphabet has
unicode.org/emoji/charts/full-emoji-list.html
how many prefixes?
how many suffixes?
Pattern Matching T = aaa…ah
P = aaah
may occur in DNA sequences
unlikely in English text
14/182
Pattern Matching
Example (pattern checked backwards): Exercise #2: Naive Matching 18/182

Suppose all characters in P are different.

Can you accelerate NaiveMatching to run in O(n) on an n-character text T?

Text !… !! abacaab
Pattern !… !! abacab When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]

⇒ each character in T checked at most twice


... Pattern Matching 15/182
Example:

Given two strings T (text) and P (pattern), abcdabcdeabcc !! abcdabcdeabcc


the pattern matching problem consists of finding a substring of T equal to P abcdexxxxxxxx !! xxxxabcde
Applications:
20/182
Text editors Boyer-Moore Algorithm
Search engines
Biological research
The Boyer-Moore pattern matching algorithm is based on two heuristics:

Looking-glass heuristic: Compare P with subsequence of T moving backwards


... Pattern Matching 16/182
Character-jump heuristic: When a mismatch occurs at T[i]=c
if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]
Naive pattern matching algorithm

checks for each possible shift of P relative to T


until a match is found, or
all placements of the pattern have been tried

NaiveMatching(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
otherwise ⇒ shift P so as to align P[0] with T[i+1] !! (a.k.a. "big jump")
| -1 if no such substring exists
|
| for all i=0..n-m do
| | j=0 // check from left to right
| | while j<m and T[i+j]=P[j] do // test ith shift of pattern
| | | j=j+1
| | | if j=m then
| | | return i // entire pattern checked
| | | end if
| | end while 21/182
| end for ... Boyer-Moore Algorithm
| return -1 // no match found
Example:

17/182
Analysis of Naive Pattern Matching
Naive pattern matching runs in O(n·m)

Examples of worst case (forward checking):


... Boyer-Moore Algorithm 22/182

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

last-occurrence function L
L maps Σ to integers such that L(c) is defined as
the largest index i such that P[i]=c, or
-1 if no such index exists Case 2: j < 1+l !! ⇒ i = i+m-j

Example: Σ = {a,b,c,d}, P = acab

c a b c d

L(c) 2 3 1 -1

L can be represented by an array indexed by the numeric codes of the characters


L can be computed in O(m+s) time ""(m … length of pattern, s … size of Σ)
Exercise #3: Boyer-Moore algorithm 25/182

... Boyer-Moore Algorithm 23/182


For the alphabet Σ = {a,b,c,d}

BoyerMooreMatch(T,P,Σ): 1. compute last-occurrence function L for pattern P = abacab


| Input text T of length n, pattern P of length m, alphabet Σ 2. trace Boyer-More on P and text T = abacaabadcabacabaabb
| Output starting index of a substring of T equal to P how many comparisons are needed?
| -1 if no such substring exists
|
| L=lastOccurenceFunction(P,Σ)
| i=m-1, j=m-1 // start at end of pattern c a b c d
| repeat
| | if T[i]=P[j] then
L(c) 4 5 3 -1
| | | if j=0 then
| | | return i // match found at i
| | | else
| | | i=i-1, j=j-1 // keep comparing
| | | end if
| | else // character-jump
| | | i=i+m-min(j,1+L[T[i]])
| | | j=m-1
| | end if
| until i≥n
| return -1 // no match

Biggest jump (m characters ahead) occurs when L[T[i]] = -1


13 comparisons in total
... Boyer-Moore Algorithm 24/182

... Boyer-Moore Algorithm 27/182


l … last occurrence L[T[i]] of character T[i]

Why i = i + m - min(j,1+l)? Analysis of Boyer-Moore algorithm:

Case 1: 1+l ≤ j "" ⇒ i = i+m-(1+l) Runs in O(nm+s) time


m … length of pattern !! n … length of text !! s … size of alphabet
Example of worst case:
T = aaa … a
P = baaa
Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than naive matching on English text

28/182
Knuth-Morris-Pratt Algorithm
The Knuth-Morris-Pratt algorithm …

compares the pattern to the text left-to-right


but shifts the pattern more intelligently than the naive algorithm
... Knuth-Morris-Pratt Algorithm 31/182
Reminder:

Q is a prefix of P !!…!! P = Qω, for some ω∈Σ* KMPMatch(T,P):


| Input text T of length n, pattern P of length m
Q is a suffix of P !!…!! P = ωQ, for some ω∈Σ* | Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
... Knuth-Morris-Pratt Algorithm 29/182
| F=failureFunction(P)
| i=0, j=0 // start from left
When a mismatch occurs … | while i<n do
| | if T[i]=P[j] then
what is the most we can shift the pattern to avoid redundant comparisons? | | | if j=m-1 then
Answer: the largest prefix of P[0..j-1] that is a suffix of P[1..j-1] | | | return i-j // match found at i-j
| | | else
| | | i=i+1, j=j+1 // keep comparing
| | | end if
| | else if j>0 then // mismatch and j>0?
| | | j=F[j-1] // → shift pattern to i-F[j-1]
| | else // mismtach and j still 0?
| | | i=i+1 // → begin at next text character
| | end if
| end while
| return -1 // no match

Exercise #4: KMP-Algorithm 32/182

... Knuth-Morris-Pratt Algorithm 30/182 1. compute failure function F for pattern P = abacab
2. trace Knuth-Morris-Pratt on P and text T = abacaabaccabacabaabb
KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself how many comparisons are needed?

Failure function F(j) defined as


the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
for each position j=0..m-1
if mismatch occurs at Pj !!⇒ advance j to F(j-1) j 0 1 2 3 4 5

Example: P = abaaba Pj a b a c a b

j 0 1 2 3 4 5 F(j) 0 0 1 0 1 2

Pj a b a a b a

F(j) 0 0 1 1 2 3
⇒ F[0]=0
j=1, len=0, P[1]≠P[0] ⇒ F[1]=0
j=2, len=0, P[2]=P[0] ⇒ len=1, F[2]=1
j=3, len=1, P[3]≠P[1] ⇒ len=F[0]=0
j=3, len=0, P[3]=P[0] ⇒ len=1, F[3]=1
j=4, len=1, P[4]=P[1] ⇒ len=2, F[4]=2
j=5, len=2, P[5]=P[2] ⇒ len=3, F[5]=3

... Knuth-Morris-Pratt Algorithm 38/182

19 comparisons in total Analysis of failure function computation:

At each iteration of the while-loop, either


34/182 i increases by one, or
... Knuth-Morris-Pratt Algorithm the "shift amount" i-j increases by at least one ! (remember that always F(j-1)<j)
Hence, there are no more than 2·m iterations of the while-loop
Analysis of Knuth-Morris-Pratt algorithm:
⇒ !failure function can be computed in O(m) time
Failure function can be computed in O(m) time ! (→ next slide)
At each iteration of the while-loop, either
i increases by one, or 39/182
the pattern is shifted ≥1 to the right ! ("shift amount" i-j increases since always F(j-1)<j) Boyer-Moore vs KMP
⇒ i can be incremented at most n times, pattern can be shifted at most n times
⇒ there are no more than 2·n iterations of the while-loop Boyer-Moore algorithm

⇒ !KMP's algorithm runs in optimal time O(m+n) decides how far to jump ahead based on the mismatched character in the text
works best on large alphabets and natural language texts (e.g. English)

35/182 Knuth-Morris-Pratt algorithm


... Knuth-Morris-Pratt Algorithm
uses information embodied in the pattern to determine where the next match could begin
Construction of the failure function matches pattern against itself: works best on small alphabets (e.g. A,C,G,T)
failureFunction(P): For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size
| Input pattern P of length m of alphabet
| Output failure function for P
|
| F[0]=0 // F[0] is always 0
| j=1, len=0
Word Matching With Tries
| while j<m do
| | if P[j]=P[len] then
41/182
| | len=len+1 // we have matched len+1 characters Preprocessing Strings
| | F[j]=len // P[0..len-1] = P[len-1..j]
| | j=j+1 Preprocessing the pattern speeds up pattern matching queries
| | else if len>0 then // mismatch and len>0?
| | len=F[len-1] // → use already computed F[len] for new len After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length
| | else // mismatch and len still 0?
| | F[j]=0 // → no prefix of P[0..j] is also suffix of P[1..j] If the text is large, immutable and searched for often (e.g., works by Shakespeare)
| | j=j+1 // → continue with next pattern character
| | end if we can preprocess the text instead of the pattern
| end while
| return F
Tries
Exercise #5: 36/182

... Tries 43/182


Trace the failureFunction algorithm for pattern P = abaaba
A trie …
is a tree-like data structure Item data; // no Item if !finish
for compact representation of a set of strings Trie child[ALPHABET_SIZE];
e.g. all the words in a text, a dictionary etc. } Node;

Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree" typedef char *Key;

44/182
Tries ... Tries 49/182

Tries are trees organised using parts of keys (rather than whole keys) Note: Can also use BST-like nodes for more space-efficient implementation of tries

50/182
Exercise #6: 45/182 Trie Operations
How many words are encoded in the trie on the previous slide? Basic operations on tries:

1. search for a key


2. insert a key
11
... Trie Operations 51/182

... Tries 47/182

Each node in a trie …

contains one part of a key (typically one character)


may have up to 26 children
may be tagged as a "finishing" node
but even "finishing" nodes may have children

Depth d of trie = length of longest key value

Cost of searching O(d) ! (independent of n)

... Tries 48/182

... Trie Operations 52/182


Possible trie representation:

#define ALPHABET_SIZE 26 Traversing a path, using char-by-char from Key:

typedef struct Node *Trie; find(trie,key):


| Input trie, key
typedef struct Node { | Output pointer to element in trie if key found
bool finish; // last char in key? | NULL otherwise
|
| node=trie
| for each char in key do
| | if node.child[char] exists then
| | node=node.child[char] // move down one level
| | else
| | return NULL
| | end if
| end for
| if node.finish then // "finishing" node reached?
| return node
| else
| return NULL
| end if

... Trie Operations 53/182

... Trie Operations 56/182


Insertion into Trie:

insert(trie,item,key): Analysis of standard tries:


| Input trie, item with key of length m
| Output trie with item inserted O(n) space
| insertion and search in time O(m)
| if trie is empty then n … total size of text !(e.g. sum of lengths of all strings in a given dictionary)
| t=new trie node m … size of the string parameter of the operation !(the "key")
| end if
| if m=0 then
| t.finish=true, t.data=item Word Matching With Tries
| else
| t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])
| end if 58/182
| return t Word Matching with Tries
Preprocessing the text:
Exercise #7: Trie Insertion 54/182
1. Insert all searchable words of a text into a trie
Insert cat, cats and carer into this trie: 2. Each finishing node stores the occurrence(s) of the associated word in the text

... Word Matching with Tries 59/182

Example text and corresponding trie of searchable words:


How many nodes (including the root) are needed for the compressed trie?

7
60/182
Compressed Tries 63/182
Pattern Matching With Suffix Tries
Compressed tries …
The suffix trie of a text T is the compressed trie of all the suffixes of T
have internal nodes of degree at least 2 ! (i.e. non-finishing nodes must have ≥ 2 children)
are obtained from standard tries by compressing "redundant" chains of nodes Example:
Example:

... Pattern Matching With Suffix Tries 64/182

Compact representation:

Exercise #8: Compressed Tries 61/182

Consider this uncompressed trie: ... Pattern Matching With Suffix Tries 65/182

Input:

compact suffix trie for text T


pattern P

Goal:

find starting index of a substring of T equal to P

Exercise #9: Suffix Tries 66/182


Construct the compressed suffix trie for T = banana 71/182
Text Compression
Problem: Efficiently encode a given string X by a smaller string Y

Applications:

Save memory and/or bandwidth

Huffman's algorithm

computes frequency f(c) for each character c


encodes high-frequency characters with short code
no code word is a prefix of another code word
uses optimal encoding tree to determine the code words

... Text Compression 72/182

... Pattern Matching With Suffix Tries 68/182


Code … mapping of each character to a binary code word

Prefix code … binary code such that no code word is prefix of another code word
suffixTrieMatch(trie,P):
| Input compact suffix trie for text T, pattern P of length m Encoding tree …
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
| represents a prefix code
| j=0, v=root of trie each leaf stores a character
| repeat code word given by the path from the root to the leaf ! (0 for left child, 1 for right child)
| | // we have matched j characters
| | if ∃w∈children(v) such that P[j]=T[start(w)] then
| | | i=start(w) // start(w) is the start index of w
... Text Compression 73/182
| | | x=end(w)-i+1 // end(w) is the end index of w
| | | if m≤x then // length of suffix ≤ length of the node label?
| | | if P[j..j+m-1]=T[i..i+m-1] then Example:
| | | return i-j // match at i-j
| | | else
| | | return -1 // no match
| | | else if P[j..j+x-1]=T[i..i+x-1] then
| | | j=j+x, m=m-x // update suffix start index and length
| | | v=w // move down one level
| | | else return -1 // no match
| | | end if
| | else
| | return -1
| | end if 74/182
| until v is leaf node ... Text Compression
| return -1 // no match
Text compression problem

... Pattern Matching With Suffix Tries 69/182 Given a text T, find a prefix code that yields the shortest encoding of T

short codewords for frequent characters


Analysis of pattern matching using suffix tries:
long code words for rare characters
Suffix trie for a text of size n …

Exercise #10: 75/182


can be constructed in O(n) time
uses O(n) space
supports pattern matching queries in O(m) time Two different prefix codes:
m … length of the pattern

Text Compression
| while |Q|≥2 do
| | f1=Q.minKey(), T1=leave(Q)
| | f2=Q.minKey(), T2=leave(Q)
| | T=new tree node with subtrees T1 and T2
| | join(Q,T) with f1+f2 as key
Which code is more efficient for T = abracadabra? | end while
| return leave(Q)

T1 requires 29 bits to encode text T, Exercise #11: Huffman Code 79/182

T2 requires 24 bits.
Construct a Huffman tree for: a fast runner need never be afraid of the dark

01011011010000101001011011010 vs 001011000100001100101100

77/182
Huffman Code
Huffman's algorithm

computes frequency f(c) for each character


successively combines pairs of lowest-frequency characters to build encoding tree "bottom-up"

Example: abracadabra

... Huffman Code 81/182

Analysis of Huffman's algorithm:

O(n+d·log d) time
n … length of the input text T
d … number of distinct characters in T

78/182
Approximation
... Huffman Code

Huffman's algorithm using priority queue: 83/182


Approximation for Numerical Problems
HuffmanCode(T):
| Input string T of size n Approximation is often used to solve numerical problems by
| Output optimal encoding tree for T
| solving a simpler, but much more easily solved, problem
| compute frequency array where this new problem gives an approximate solution
| Q=new priority queue and refine the method until it is "accurate enough"
| for all characters c do
| | T=new single-node tree storing c Examples:
| | join(Q,T) with frequency(c) as key
| end for roots of a function f
length of a curve determined by a function f
… and many more

... Approximation for Numerical Problems 84/182

Example: Finding Roots

Find where a function crosses the x-axis:

length=0, δ=(end-start)/StepSize
for each x∈[start+δ,start+2δ,..,end] do
length = length + sqrt(δ2 + (f(x)-f(x-δ))2)
end for

87/182
Approximation for Problems in NP
Generate and test: move x1 and x2 together until "close enough" Approximation is often used for problems in NP…

computing a near-optimal solution


in polynomial time
... Approximation for Numerical Problems 85/182
Examples:
A simple approximation algorithm for finding a root in a given interval:
vertex cover of a graph
bisection(f,x1,x2): subset-sum problem
| Input function f, interval [x1,x2]
| Output x∈[x1,x2] with f(x)≅0
88/182
| Vertex Cover
| repeat
| | mid=(x1+x2)/2 Reminder: Graph G = (V,E)
| | if f(x1)*f(mid)<0 then
set of vertices V
| | x2=mid // root to the left of mid set of edges E
| | else
| | x1=mid // root to the right of mid Vertex cover C of G …
| | end if
| until f(mid)=0 or x2-x1<ε // ε: accuracy C⊆V
for all edges (u,v) ∈ E either v ∈ C or u ∈ C (or both)
| return mid

bisection guaranteed to converge to a root if f continuous on [x1,x2] and f(x1) and f(x2) have opposite signs
⇒ All edges of the graph are "covered" by vertices in C

... Vertex Cover 89/182


... Approximation for Numerical Problems 86/182

Example (6 nodes, 7 edges, 3-vertex cover):


Example: Length of a Curve

Estimate length: approximate curve as sequence of straight lines.


Applications:

Computer Network Security


compute minimal set of routers to cover all connections
Biochemistry
What would be an optimal vertex cover?
... Vertex Cover 90/182

size of vertex cover C … !! |C| !(number of elements in C)

optimal vertex cover … !! a vertex cover of minimum size

Theorem.
Determining whether a graph has a vertex cover of a given size k is an NP-complete problem.
... Vertex Cover 95/182

... Vertex Cover 91/182 Theorem.


The approximation algorithm returns a vertex cover at most twice the size of an optimal cover.
An approximation algorithm for vertex cover: Proof. Any (optimal) cover must include at least one endpoint of each chosen edge.

approxVertexCover(G): Cost analysis …


| Input undirected graph G=(V,E)
| Output vertex cover of G repeatedly select an edge from E
| add endpoints to C
| C=∅ delete all edges in E covered by endpoints
| E'=copy of E
| while E'≠∅ do Time complexity: O(V+E) ! (adjacency list representation)
| | choose any (v,w) ∈ E'
| | C = C∪{v,w}
| | E' = E' \ {all edges incident on v or w} Randomised Algorithms
| end while
| return C
102/182
Randomised Algorithms
Exercise #12: Vertex Cover 92/182
Algorithms employ randomness to
Show how the approximation algorithm produces a vertex cover on:
improve worst-case runtime
compute correct solutions to hard problems more efficiently but with low probability of failure
compute approximate solutions to hard problems

103/182
Randomness
Randomness is also useful
Possible result:
in computer games: 18, 12, 8, 26, 7, 15, 10, 17, 1, 11, 28, 29, 9, 6, 4, 13, 19, 23, 5, 24, 16,
may want aliens to move in a random pattern 21, 14, 30, 20, 3, 2, 22, 25, 27, 18, 12, 8, 26, 7, 15, 10, 17, 1, 11, 28,
29, 9, 6, 4, 13, 19, 23, 5, 24, 16, 21, 14, 30, 20, 3, 2, 22, 25, 27, 18,
the layout of a dungeon may be randomly generated 12, 8, 26, 7, 15, 10, 17, 1, ...
may want to introduce unpredictability
in physics/applied maths: all the integers from 1 to 30 are here
carry out simulations to determine behaviour period length = 30
e.g. models of molecules are often assume to move randomly
in testing:
stress test components by bombarding them with random data ... Sidetrack: Random Numbers 107/182
random data is often seen as unbiased data
gives average performance (e.g. in sorting algorithms)
Another trivial example:
in cryptography
again let c=0
try a=12=X0 and m=30
104/182
Sidetrack: Random Numbers that is, Xn+1 = 12·Xn mod 30
which generates the sequence:
How can a computer pick a number at random?

it cannot 12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6,
12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6, ...
Software can only produce pseudo random numbers.
notice the period length (= 4) … clearly a terrible sequence
a pseudo random number is one that is predictable
(although it may appear unpredictable)
... Sidetrack: Random Numbers 108/182
⇒ Implementation may deviate from expected theoretical behaviour
It is a complex task to pick good numbers. A bit of history:

... Sidetrack: Random Numbers 105/182 Lewis, Goodman and Miller (1969) suggested

The most widely-used technique is called the Linear Congruential Generator (LCG) Xn+1 = 75·Xn mod (231-1)
note:
it uses a recurrence relation:
Xn+1 = (a·Xn + c) mod m, where: 75 is 16807
m is the "modulus" 231-1 is 2147483674
a, 0 < a < m is the "multiplier" X0 = 0 is not a good seed value
c, 0 ≤ c ≤ m is the "increment"
Most compilers use LCG-based algorithms that are slightly more involved; see www.mscs.dal.ca/~selinger/random/ for details (including a
X0 is the "seed"
short C program that produces the exact same pseudo-random numbers as gcc for any given seed value)
if c=0 it is called a multiplicative congruential generator

LCG is not good for applications that need extremely high-quality random numbers 109/182
... Sidetrack: Random Numbers
the period length is too short
period length … length of sequence at which point it repeats itself Two functions are required:
a short period means the numbers are correlated
srand(unsigned int seed) // sets its argument as the seed

... Sidetrack: Random Numbers 106/182 rand() // uses a LCG technique to generate random
// numbers in the range 0 .. RAND_MAX
Trivial example:
where the constant RAND_MAX is defined in stdlib.h
for simplicity assume c=0 (depends on the computer: on the CSE network, RAND_MAX = 2147483647)
so the formula is Xn+1 = a·Xn mod m
The period length of this random number generator is very large
try a=11=X0, m=31, which generates the sequence:
approximately 16 · ((231) - 1)

11, 28, 29, 9, 6, 4, 13, 19, 23, 5, 24, 16, 21, 14, 30, 20, 3, 2, 22, 25,
27, 18, 12, 8, 26, 7, 15, 10, 17, 1, 11, 28, 29, 9, 6, 4, 13, 19, 23, 5, ... Sidetrack: Random Numbers 110/182
24, 16, 21, 14, 30, 20, 3, 2, 22, 25, 27, 18, 12, 8, 26, 7, 15, 10, 17, 1,
11, 28, 29, 9, 6, 4, 13, 19, 23, 5, 24, 16, 21, 14, 30, 20, 3, 2, 22, 25, 27,
To convert the return value of rand() to a number between 0 .. RANGE // since the Epoch, 1970-01-01 00:00:00 +0000

compute the remainder after division by RANGE+1 // time(NULL) on July 31st, 2020, 12:59pm was 1596164340
// time(NULL) about a minute later was 1596164401
Using the remainder to compute a random number is not the best way:

can generate a 'better' random number by using a more complex division


but good enough for most purposes Randomised Algorithms
Some applications require more sophisticated, cryptographically secure pseudo random numbers

115/182
Exercise #13: Random Numbers 111/182
Analysis of Randomised Algorithms
Math needed for the analysis of randomised algorithms:
Write a program to simulate 10,000 rounds of Two-up.
Sample space… !! Ω = {ω1,…,ωn}
Assume a $10 bet at each round
Compute the overall outcome and average per round Probability… !! 0 ≤ P(ωi) ≤ 1
Event… !! E ⊆ Ω

#include <stdlib.h> Basic probability theory


#include <stdio.h> P(E) = Σω∈E P(ω)
P(Ω) = 1
#define RUNS 10000 P(not E) = P(Ω\E) = 1 – P(E)
#define BET 10 P(E1 and E2) = P(E1∩E2) = P(E1) · P(E2) ! if E1,E2 independent
Expectation
int main(void) { event E has probability p ⇒ average number of trials needed to see E is 1/p
srand(1234567); // choose arbitrary seed
int coin1, coin2, n, sum = 0;
Combinatorics
for (n = 0; n < RUNS; n++) {
number of ways to choose k objects from n objects…
n · (n − 1) · … · (n − k + 1)
do {
n
(k)
coin1 = rand() % 2; =
coin2 = rand() % 2; 1·2·…·k
} while (coin1 != coin2);
if (coin1==1 && coin2==1)
sum += BET; Exercise #14: Basic Probability 116/182
else
sum -= BET; Consider Ω = {HH, HT, TH, TT} !! (each outcome with probability ¼)
}
printf("Final result: %d\n", sum); 1. E1 = first coin lands on heads. What is P(E1)?
printf("Average outcome: %f\n", (float) sum / RUNS); 2. E2 = second coin lands on tails. What is P(E2)?
return 0; 3. Are E1,E2 independent?
}
4. Probability of not (E1 and E2)?
5. On average, how often do you have to toss the pair of coins to obtain HH or TT?
... Sidetrack: Random Numbers 113/182

1. ½
Seeding
2. ½
There is one significant problem: 3. Yes
4. 1 – ¼·¼ = ¾
every time you run a program with the same seed, you get exactly the same sequence of 'random' numbers 5. 2 times
1 1 1 1 1 2 1 1 3 1
!(why?) Note that 2 = is the infinite sum ⋅ 1 + (1 − ) ⋅ ⋅ 2 + (1 − ) ⋅ ⋅ 3 + (1 − ) ⋅ ⋅ 4 + ...
½ 2 2 2 2 2 2 2

To vary the output, can give the random seeder a starting point that varies with time
... Analysis of Randomised Algorithms 118/182
an example of such a starting point is the current time, time(NULL)
(NB: this is different from the UNIX command time, used to measure program running time)
Randomised algorithm to find some element with key k in an unordered array:
#include <time.h>
time(NULL) // returns the time as the number of seconds findKey(L,k):
| Input array L, key k Divide
| Output some element in L with key k pick a pivot element
| move all elements smaller than the pivot to its left
| repeat move all elements greater than the pivot to its right
| randomly select e∈L Conquer
| until key(e)=k sort the elements on the left
| return e sort the elements on the right

119/182 123/182
... Analysis of Randomised Algorithms Non-randomised Quicksort
Analysis: Divide ...
1
p … ratio of elements in L with key k !! (e.g. p = )
3 partition(array,low,high):
Probability of success: 1 !! (if p > 0) | Input array, index range low..high
Expected runtime: ! | Output selects array[low] as pivot element
1 | moves all smaller elements between low+1..high to its left
p | moves all larger elements between low+1..high to its right
| returns new position of pivot element
Example: a third of the elements have key k ⇒ expected number of iterations = 3 |
| pivot_item=array[low], left=low+1, right=high
| repeat
| | right = find index of rightmost element <= pivot_item
... Analysis of Randomised Algorithms 120/182 | | left = find index of leftmost element > pivot_item // left=right if none
| | if left<right then
| | swap array[left] with array[right]
If we cannot guarantee that the array contains any elements with key k … | | end if
| until left≥right
findKey(L,k,d): | if low<right then
| Input array L, key k, maximum #attempts d | swap array[low] with array[right] // right is final position for pivot
| Output some element in L with key k | end if
| return right
|
| repeat
| | if d=0 then 124/182
| | return failure ... Non-randomised Quicksort
| | end if
| | randomly select e∈L ... and Conquer!
| | d=d-1
Quicksort(array,low,high):
| until key(e)=k
| Input array, index range low..high
| return e
| Output array[low..high] sorted
|
121/182
| if high > low then // termination condition low >= high
... Analysis of Randomised Algorithms | | pivot = partition(array,low,high)
| | Quicksort(array,low,pivot-1)
Analysis: | | Quicksort(array,pivot+1,high)
| end if
p … ratio of elements in L with key k
d … maximum number of attempts
Probability of success: 1 - (1-p)d ... Non-randomised Quicksort 125/182
Expected runtime:

( i=1..d−1 )
i ⋅ (1 − p) i−1 ⋅ p + d ⋅ (1 − p) d−1
∑ 3 6 5 2 4 1 // swap a[left=1] and a[right=5]
O(1) if d is a constant
3 1 5 2 4 6 // swap a[left=2] and a[right=3]

122/182
Randomised Quicksort 3 1 2 5 4 6 // swap pivot and a[right=2]

Quicksort applies divide and conquer to sorting:


2 1 | 3 | 5 4 6
Bad call:
1 2 | 3 | 5 4 6 one of low..pivot-1 or pivot+1..high greater than ¾·s
Probability that a call is good: 0.5
1 2 | 3 | 4 | 5 | 6 (because half the possible pivot elements cause a good call)

Example of a bad call:


126/182
Worst-case Running Time
6 3 7 5 8 2 4 1
Worst case for Quicksort occurs when the pivot is the unique minimum or maximum element:
4 3 6 5 1 2 | 7 | 8
One of the intervals low..pivot-1 and pivot+1..high is of size n-1 and the other is of size 0
⇒ running time is proportional to n + n-1 + … + 2 + 1 Example of a good call:
Hence the worst case for non-randomised Quicksort is O(n2)
4 3 6 5 1 2 | 7 | 8
1 2 3 4 5 6
1 2 | 3 | 5 6 4 | 7 | 8
1 | 2 3 4 5 6

... Randomised Quicksort 129/182


1 | 2 | 3 4 5 6

n … size of array
...
From probability theory we know that the expected number of coin tosses required in order to get k heads is 2·k
1 | 2 | 3 | 4 | 5 | 6 For a recursive call at depth d we expect
d/2 ancestors are good calls
127/182
⇒ size of input sequence for current call is "≤ (¾)d/2 · n
Randomised Quicksort Therefore,
the input of a recursive call at depth 2·log4/3n has expected size 1
⇒ the expected recursion depth thus is O(log n)
partition(array,low,high):
| Input array, index range low..high The total amount of work done at all the nodes of the same depth is O(n)
| Output randomly select a pivot element from array[low..high]
| moves all smaller elements between low..high to its left Hence the expected runtime is O(n·log n)
| moves all larger elements between low..high to its right
| returns new position of pivot element
|
| randomly select pivot_index∈[low..high]
| pivot_item=array[pivot_index], swap array[low] with array[pivot_index]
| left=low+1, right=high
| repeat
| | right = find index of rightmost element <= pivot_item
| | left = find index of leftmost element > pivot_item // left=right if none
| | if left<right then
| | swap array[left] with array[right]
| | end if
| until left≥right
| if low<right then
| swap array[low] with array[right] // right is final position for pivot
| end if
| return right 130/182
Minimum Cut Problem
... Randomised Quicksort 128/182 Given:

undirected graph G=(V,E)


Analysis:
Cut of a graph …
Consider a recursive call to partition() on an index range of size s
Good call: a partition of V into S ∪ T
both low..pivot-1 and pivot+1..high shorter than ¾·s S,T disjoint and both non-empty
its weight is the number of edges between S and T: Apply the contraction algorithm twice to the following graph, with different random choices:
ω(S,T) = | { {s,t}∈E : s∈S, t∈T } |

Minimum cut problem … find a cut of G with minimal weight

... Minimum Cut Problem 131/182

Example:

... Contraction 135/182

Analysis:

V … number of vertices

Fact. Probability of contract to result in a minimum cut:


≥ 1/ ( )
132/182 V
Contraction 2
This is much higher than the probability of picking a minimum cut at random, which is
Contracting edge e = {v,w} …
≤ ( ) / (2V−1 − 1)
V
2
remove edge e
replace vertices v and w by new node n because every graph has 2V-1-1 cuts and
V
replace all edges {x,v}, {x,w} by {x,n} Fact. At most ( ) cuts can have minimum weight
2
… results in a multigraph (multiple edges between vertices allowed) Example: Single edge contraction can be implemented in O(V) time on an adjacency-list representation ⇒ total running
time: O(V2)
(Best known implementation of graph contraction uses O(E) time)

136/182
Karger's Algorithm
Idea: Repeat random graph contraction several times and take the best cut found

MinCut(G):
| Input graph G with V≥2 vertices
| Output smallest cut found
|
133/182 | min_weight=∞, d=0
... Contraction | repeat
| | cut=contract(G)
Randomised algorithm for graph contraction = repeated edge contraction until 2 vertices remain | | if weight(cut)<min_weight then
| | min_cut=cut, min_weight=weight(cut)
contract(G):
| | end if
| Input graph G = (V,E) with |V|≥2 vertices
| | d=d+1
| Output cut of G
| until d > binomial(V,2)·ln V
|
| return min_cut
| while |V|>2 do
| randomly select e∈E
| contract edge e in G 137/182
| end while ... Karger's Algorithm
| return the only cut in G
Analysis:

134/182 V … number of vertices


Exercise #15: Graph Contraction E … number of edges

1
1
Probability of success: ≥ 1 −
V
probability of not finding a minimum cut when the contraction algorithm is repeated d = ( ) ⋅ ln V
V
2
!times:
V d 1 1
[ ( 2 )]
≤ 1 − 1/ ≤ ln V = ω(S,T) = 4
e V
Total running time: O(E·d) = O(E·V2·log V)
... Sidetrack: Maxflow and Mincut 143/182
assuming graph contraction implemented in O(E)

Max-flow Min-cut Theorem.


138/182 In a flow network G the following conditions are equivalent:
Sidetrack: Maxflow and Mincut
1. f is a maximum flow in G
Given: flow network G=(V,E) with 2. the residual network G relative to f contains no augmenting path
3. value of flow f = weight of some minimum cut (S,T) of G
edge weights w(u,v)
source s∈V, sink t∈V
144/182
Cut of flow network G … Randomised Algorithms for NP-Problems
a partition of V into S ∪ T Many NP-problems can be tackled by randomised algorithms that
s∈ S, t∈T, S and T disjoint
its weight is the sum of the weights of the edges between S and T: compute nearly optimal solutions
ω(S, T) = w(u, v) with high probability
∑∑
u∈S v∈T
Examples:
Minimum cut problem … find cut of a network with minimal weight
travelling salesman
constraint satisfaction problems, satisfiability
139/182 … and many more
Exercise #16: Cut of Flow Networks

Simulation

146/182
Simulation
What is the weight of the cut {Fairfield,Parramatta,Auburn}, {Ryde,Homebush,Rozelle}? In some problem scenarios

it is difficult to devise an analytical solution


12+14 = 26 so build a software model and run experiments

Examples: weather forecasting, traffic flow, queueing, games


Exercise #17: Cut of Flow Networks 141/182
Such systems typically require random number generation
Find a minimal cut in:
distributions: uniform, numerical, normal, exponential

Accuracy of results depends on accuracy of model.

147/182
Example: Area inside a Curve
Scenario:

have a closed curve defined by a complex function


have a function to compute "X is inside/outside curve?"
TJ Maxx credit and debit card theft (2005-07)

Hackers gained access to accounts of over 100 million customers


⇒ Customers exposed to credit/debit card fraud

Yahoo! data breach (2013-16)

Hackers gained access to all 3 billion user accounts


Details taken included names, DOBs, passwords, answers to security questions
⇒ Customers exposed to identity theft
⇒ Over 20 class-action lawsuits filed against Yahoo!
... Example: Area inside a Curve 148/182
Facebook-Cambridge Analytica data scandal (2018)
Simulation approach to determining the area:
Millions of people's Facebook profiles used for political purpose without their consent
determine a region completely enclosing curve ⇒ Cambridge Analytica went bust as a consequence
generate very many random points in this region
for each point x, compute inside(x)
... Data Breaches 152/182
count number of insides and outsides
areaWithinCurve = totalArea * insides/(insides+outsides)
The Guardian, 30/03/15 …
i.e. we approximate the area within the curve by using the ratio of points inside the curve against those outside

This general method of approximating is known as Monte Carlo estimation.

149/182
Summary
Alphabets and words
Pattern matching
Boyer-Moore, Knuth-Morris-Pratt
Tries
Text compression
Huffman code
Approximation
numerical problems
vertex cover
Analysis of randomised algorithms
probability of success
expected runtime
Randomised Quicksort
Karger's algorithm
Simulation

Suggested reading:
tries … Sedgewick, Ch. 15.2 153/182
... Data Breaches
approximation … Moffat, Ch. 9.4
randomisation … Moffat, Ch. 9.3, 9.5
More severe, recent incidents in Australia …

Optus cyberattack (Sept 2022)


Algorithm and Data Ethics
Hacker gained access to personal information of 2.1 million costumers

151/182
Details taken included names, DOBs, street addresses, driving licence numbers, passport numbers
Data Breaches ⇒ Customers vulnerable to financial crimes
⇒ Up to 100,000 new passports had to be issued
Major incidents … ⇒ Optus put aside $140 million for costs related to the breach
individuals must be notified promptly
... Data Breaches 154/182 Australian Information Commissioner must also be notified
take action to prevent future breaches
Medibank cyberattack (Oct 2022)
157/182
ABC News, 26/10/20 … Data (Mis-)use
In 2012 several newspapers reported that …

Target used data analysis to predict whether female customers are likely pregnant
Target then sent coupons by mail
A Minneapolis man thus found out about the pregnancy of his teenage daughter

Not based on a factual story, but not implausible either

Who "owns" your data?

big companies (Google, Meta, Microsoft, …)?


governments?
you?

Details of 9 million customers taken, including names, DOBs, street addresses, medical diagnoses and
procedures ... Data (Mis-)use 158/182

Also passport numbers and visa details for international students stolen

Ransom demanded but refused


⇒ Medical records posted on darknet, including data on abortions, mental health information

... Data Breaches 155/182

Australia's Privacy Act 1988 …

outlines how personal information must be used and managed


applies to government agencies, businesses and organisations with annual turnover of >$3 million, private health services, …

Individuals have the right to:

have access to their personal information Respect privacy


know why and how information is collected and who it will be disclosed to Store only the minimum amount of personal information necessary
ask to stop unwanted direct marketing Prevent re-identification of anonymised data
Carefully analyse the consequences of data aggregation
Businesses and organisations must comply with the Australian Privacy Principles: Access data only when authorised or compelled by the public good
Whistleblower Manning's disclosing of classified military data to Wikileaks (2010-11)
how to collect personal information
Paradise papers that disclosed offshore investments (2017)
how (not) to use personal information
how to secure personal information Source: ACM Code of Ethics and Professional Conduct

... Data Breaches 156/182 159/182


Costly Software Errors
Australia's Privacy Act 1988 !!! Notifiable Data Breaches scheme
NASA's Mars Climate Orbiter …
In the event of a suspected or known data breach …
launched 11/12/1998
contain breach where possible reached Mars on 23/9/1999
assess if personal information is likely to result in serious harm to affected individuals came too close to surface and disintegrated
Cause of failure: #include <time.h>
time(NULL) // returns the time as the number of seconds
spec said impulse must be calculated in newton seconds // since the Epoch, 1970-01-01 00:00:00 +0000
one module calculated impulse in pound-force seconds
1 newton ≅ 0.2248 pound-force
Year 2038 problem …

... Costly Software Errors 160/182 time(NULL) on 19 January 2038 at 03:14:07 (UTC) will be 2147483647 = 0x7FFFFFFF
a second later it will be 0x80000000 = -2,147,483,648
Toyota vehicle recall (2009-11) ⇒ -231 seconds since 01/01/1970 ("Epoch") is 13 December 1901 …

Vehicles experienced sudden unintended acceleration


89 deaths have been linked to the failure 163/182
9 million cars recalled worldwide
Programming Ethics
Causes of failure included … From the ACM/IEEE Software Engineering Code …

a deficiency in the electronic throttle control system: Software engineers shall ensure that their products meet the highest professional standards possible
stack overflow Strive to fully understand the specifications for software
⇒ stack grew out of boundary, overwrote other data Ensure that specifications have been well documented and satisfy the users' requirements
Ensure adequate testing, debugging, and review of software and related documents

161/182 Approve software only if it


... Costly Software Errors is safe
meets specifications
Sydney Morning Herald, 05/01/10: passes appropriate tests
does not diminish quality of life, diminish privacy or harm the environment

... Programming Ethics 164/182

Algorithms can save lives.

Uberlingen airplane collision 1/7/02 at 11:35pm …

passenger jet V9 2937 and cargo jet QY 611 on collision course at 36,000 feet
ground air traffic controller instructed V9 pilot to descend
seconds later, the automatic Traffic Collision Avoidance System (TCAS)
instructed V9 2937 to climb
instructed QY 611 to descend
flight 611's pilot followed TCAS, flight 2937's pilot ignored TCAS
all 71 people on board the two planes killed

⇒ Collision would not have occurred had both pilots followed TCAS
EFTPOS terminals inoperable for several days in early 2010

customers' cards rejected as expired Exercise #18: Collision Avoidance Algorithm 165/182

Cause of failure:
The TCAS …
one module interpreted the current year as hexadecimal
builds 3D map of aircraft in the airspace
0x09 = 09
determines if collision threat occurs
0x10 = 16 (≠ 10)
automatically negotiates mutual avoidance manoeuvre
gives synthesised voice instructions to pilots ("climb, climb")
162/182
Sidetrack: Year 2038 Problem What algorithm would you use for reaching an agreement (climb vs. descent)?

Recall:
166/182 assn = mark for large assignment (out of 12)
Moral Dilemmas exam = mark for final exam (out of 60)
How to program an autonomous car … if (exam >= 25)
total = lab + midterm + assn + exam
for a potential crash scenario
else
when you have to choose between two actions that are both harmful
total = exam * (100/60)
This is a modern version of the Trolley Problem …
To pass the course, you must achieve:
A runaway trolley is on course to kill five people
at least 50/100 for total
You stand next to a lever that controls a switch
If the trolley is diverted, it will kill one person on the side track which implies that you must achieve at least 25/60 for exam

... Assessment Summary 171/182

Check your results using

prompt$!9024 classrun -sturec

ClassKey: 24T0COMP9024 ClassKey: 24T0COMP9024


Is it ethical to pull the lever and kill the one in order to save the five? ... ...
Exam: 43/60 Exam: 23/60
assn: 8/12 assn: 8/12
lab: 11.5/16 lab: 11.5/16
Exercise #19: Moral Dilemmas 167/182 midterm: 9/12 midterm: 9/12
total: 72/100 total: 38/100
What would you do?
172/182
Variations: Final Exam
Fat man on bridge Goal: to check whether you have become a competent Computer Scientist
Transplant
Requires you to demonstrate:
⇒ try it yourself on the Moral Machine
understanding of fundamental data structures and algorithms
ability to analyse time complexity of algorithms
Course Review ability to develop algorithms from specifications

Lectures, problem sets and assignments have built you up to this point.
169/182
Course Review
... Final Exam 173/182
Goal:
2-hour exam on Monday, 5 February
For you to become competent Computer Scientists able to:
CSE Computer Labs, your Time/Lab/Seat emailed to you
choose/develop effective data structures
choose/develop algorithms on these data structures 2 hours, ! reading time starts 10 minutes before beginning of exam, ! be there early
analyse performance characteristics of algorithms (time/space complexity)
package a set of data structures+algorithms as an abstract data type 7 multiple-choice questions, 4 open questions
represent data structures and implement algorithms in C Covers all of the contents of this course
Each multiple-choice question is worth 4 marks (7 × 4 = 28)
Each open question is worth 8 marks (4 × 8 = 32)
170/182 Closed book, but you can bring one A4-sized sheet of your own handwritten notes
Assessment Summary Bring student ID card, your zPass, ballpoint pens, your A4-sheet
lab = mark for programs/quizzes (out of 8+8)
midterm = mark for mid-term test (out of 12) 174/182
... Final Exam
Sample prac exam available on Moodle please, please, … I'll be excluded if I fail COMP9024
please, please, … this is my final course to graduate
4 multiple-choice questions, 2 open questions etc. etc. etc.
maximum time: 60 minutes
Failure is a fact of life. For example, my scientific papers or project proposals get rejected sometimes too.
sample solutions provided upon completion

175/182
Summing Up …
... Final Exam

Of course, assessment isn't a "one-way street" … 180/182


So What Was the Real Point?
I get to assess you in the final exam
you get to assess me in UNSW's MyExperience Evaluation The aim was for you to become a better computer scientist
go to https://myexperience.unsw.edu.au/
login using [email protected] and your zPass more confident in your own ability to design data structures and algorithms
with an expanded set of fundamental structures and algorithms to draw on
able to analyse and justify your choices
Response rate (as at Friday week 4): 21.7% !
ultimately, enjoying the software design and development process
Please fill it out …
181/182
give me some feedback on how you might like the course to run in the future Finally …
even if that is "Exactly the same. I liked how it was run."
! Book 9 !
Epilogue
176/182
Revision Strategy Thus spake the Master Programmer:

Re-read lecture slides and example programs "Time for you to leave."
Read the corresponding chapters in the recommended textbooks
Review/solve problem sets
Attempt prac exam questions on Moodle
Invent your own variations of the weekly exercises (problem solving is a skill that improves with practice)

177/182
Supplementary Exam ... Finally … 182/182

If you attend an exam

you are making a statement that you are "fit and healthy enough"
it is your only chance to pass (i.e. no second chances)

Supplementary exam only available to students who


T h a t ' s !! A l l !! F o l k s
&&
do not attend the final exam and
apply formally for special consideration Good Luck with the Exam
with a documented and accepted reason for not attending
and with your future studies

178/182
Assessment
! !
Assessment is about determining how well you understand the syllabus of this course.

If you can't demonstrate your understanding, you don't pass. Produced: 27 Jan 2024

In particular, we don't pass people just because …

please, please, … my parents will be ashamed of me


please, please, … I tried really hard in this course

You might also like