Notes 5
Notes 5
6/182
Strings ... Strings 11/182
A string is a sequence of characters. ASCII (American Standard Code for Information Interchange)
An alphabet Σ is the set of possible characters in strings. Specifies mapping of 128 characters to integers 0..127
The characters encoded include:
Examples of strings:
upper and lower case English letters: A-Z and a-z
C program digits: 0-9
HTML document common punctuation symbols
DNA sequence special non-printing characters: e.g. newline and space
Digitised image
Examples of alphabets:
ASCII
Unicode
{0,1}
{A,C,G,T}
Notation:
length(P) … #characters in P
λ … empty string " (length(λ) = 0)
Σm … set of all strings of length m over alphabet Σ
Σ* … set of all strings over alphabet Σ
Text !… !! abacaab
Pattern !… !! abacab When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]
NaiveMatching(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
otherwise ⇒ shift P so as to align P[0] with T[i+1] !! (a.k.a. "big jump")
| -1 if no such substring exists
|
| for all i=0..n-m do
| | j=0 // check from left to right
| | while j<m and T[i+j]=P[j] do // test ith shift of pattern
| | | j=j+1
| | | if j=m then
| | | return i // entire pattern checked
| | | end if
| | end while 21/182
| end for ... Boyer-Moore Algorithm
| return -1 // no match found
Example:
17/182
Analysis of Naive Pattern Matching
Naive pattern matching runs in O(n·m)
last-occurrence function L
L maps Σ to integers such that L(c) is defined as
the largest index i such that P[i]=c, or
-1 if no such index exists Case 2: j < 1+l !! ⇒ i = i+m-j
c a b c d
L(c) 2 3 1 -1
28/182
Knuth-Morris-Pratt Algorithm
The Knuth-Morris-Pratt algorithm …
... Knuth-Morris-Pratt Algorithm 30/182 1. compute failure function F for pattern P = abacab
2. trace Knuth-Morris-Pratt on P and text T = abacaabaccabacabaabb
KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself how many comparisons are needed?
Example: P = abaaba Pj a b a c a b
j 0 1 2 3 4 5 F(j) 0 0 1 0 1 2
Pj a b a a b a
F(j) 0 0 1 1 2 3
⇒ F[0]=0
j=1, len=0, P[1]≠P[0] ⇒ F[1]=0
j=2, len=0, P[2]=P[0] ⇒ len=1, F[2]=1
j=3, len=1, P[3]≠P[1] ⇒ len=F[0]=0
j=3, len=0, P[3]=P[0] ⇒ len=1, F[3]=1
j=4, len=1, P[4]=P[1] ⇒ len=2, F[4]=2
j=5, len=2, P[5]=P[2] ⇒ len=3, F[5]=3
⇒ !KMP's algorithm runs in optimal time O(m+n) decides how far to jump ahead based on the mismatched character in the text
works best on large alphabets and natural language texts (e.g. English)
Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree" typedef char *Key;
44/182
Tries ... Tries 49/182
Tries are trees organised using parts of keys (rather than whole keys) Note: Can also use BST-like nodes for more space-efficient implementation of tries
50/182
Exercise #6: 45/182 Trie Operations
How many words are encoded in the trie on the previous slide? Basic operations on tries:
7
60/182
Compressed Tries 63/182
Pattern Matching With Suffix Tries
Compressed tries …
The suffix trie of a text T is the compressed trie of all the suffixes of T
have internal nodes of degree at least 2 ! (i.e. non-finishing nodes must have ≥ 2 children)
are obtained from standard tries by compressing "redundant" chains of nodes Example:
Example:
Compact representation:
Consider this uncompressed trie: ... Pattern Matching With Suffix Tries 65/182
Input:
Goal:
Applications:
Huffman's algorithm
Prefix code … binary code such that no code word is prefix of another code word
suffixTrieMatch(trie,P):
| Input compact suffix trie for text T, pattern P of length m Encoding tree …
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
| represents a prefix code
| j=0, v=root of trie each leaf stores a character
| repeat code word given by the path from the root to the leaf ! (0 for left child, 1 for right child)
| | // we have matched j characters
| | if ∃w∈children(v) such that P[j]=T[start(w)] then
| | | i=start(w) // start(w) is the start index of w
... Text Compression 73/182
| | | x=end(w)-i+1 // end(w) is the end index of w
| | | if m≤x then // length of suffix ≤ length of the node label?
| | | if P[j..j+m-1]=T[i..i+m-1] then Example:
| | | return i-j // match at i-j
| | | else
| | | return -1 // no match
| | | else if P[j..j+x-1]=T[i..i+x-1] then
| | | j=j+x, m=m-x // update suffix start index and length
| | | v=w // move down one level
| | | else return -1 // no match
| | | end if
| | else
| | return -1
| | end if 74/182
| until v is leaf node ... Text Compression
| return -1 // no match
Text compression problem
... Pattern Matching With Suffix Tries 69/182 Given a text T, find a prefix code that yields the shortest encoding of T
Text Compression
| while |Q|≥2 do
| | f1=Q.minKey(), T1=leave(Q)
| | f2=Q.minKey(), T2=leave(Q)
| | T=new tree node with subtrees T1 and T2
| | join(Q,T) with f1+f2 as key
Which code is more efficient for T = abracadabra? | end while
| return leave(Q)
T2 requires 24 bits.
Construct a Huffman tree for: a fast runner need never be afraid of the dark
01011011010000101001011011010 vs 001011000100001100101100
77/182
Huffman Code
Huffman's algorithm
Example: abracadabra
O(n+d·log d) time
n … length of the input text T
d … number of distinct characters in T
78/182
Approximation
... Huffman Code
length=0, δ=(end-start)/StepSize
for each x∈[start+δ,start+2δ,..,end] do
length = length + sqrt(δ2 + (f(x)-f(x-δ))2)
end for
87/182
Approximation for Problems in NP
Generate and test: move x1 and x2 together until "close enough" Approximation is often used for problems in NP…
bisection guaranteed to converge to a root if f continuous on [x1,x2] and f(x1) and f(x2) have opposite signs
⇒ All edges of the graph are "covered" by vertices in C
Theorem.
Determining whether a graph has a vertex cover of a given size k is an NP-complete problem.
... Vertex Cover 95/182
103/182
Randomness
Randomness is also useful
Possible result:
in computer games: 18, 12, 8, 26, 7, 15, 10, 17, 1, 11, 28, 29, 9, 6, 4, 13, 19, 23, 5, 24, 16,
may want aliens to move in a random pattern 21, 14, 30, 20, 3, 2, 22, 25, 27, 18, 12, 8, 26, 7, 15, 10, 17, 1, 11, 28,
29, 9, 6, 4, 13, 19, 23, 5, 24, 16, 21, 14, 30, 20, 3, 2, 22, 25, 27, 18,
the layout of a dungeon may be randomly generated 12, 8, 26, 7, 15, 10, 17, 1, ...
may want to introduce unpredictability
in physics/applied maths: all the integers from 1 to 30 are here
carry out simulations to determine behaviour period length = 30
e.g. models of molecules are often assume to move randomly
in testing:
stress test components by bombarding them with random data ... Sidetrack: Random Numbers 107/182
random data is often seen as unbiased data
gives average performance (e.g. in sorting algorithms)
Another trivial example:
in cryptography
again let c=0
try a=12=X0 and m=30
104/182
Sidetrack: Random Numbers that is, Xn+1 = 12·Xn mod 30
which generates the sequence:
How can a computer pick a number at random?
it cannot 12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6,
12, 24, 18, 6, 12, 24, 18, 6, 12, 24, 18, 6, ...
Software can only produce pseudo random numbers.
notice the period length (= 4) … clearly a terrible sequence
a pseudo random number is one that is predictable
(although it may appear unpredictable)
... Sidetrack: Random Numbers 108/182
⇒ Implementation may deviate from expected theoretical behaviour
It is a complex task to pick good numbers. A bit of history:
... Sidetrack: Random Numbers 105/182 Lewis, Goodman and Miller (1969) suggested
The most widely-used technique is called the Linear Congruential Generator (LCG) Xn+1 = 75·Xn mod (231-1)
note:
it uses a recurrence relation:
Xn+1 = (a·Xn + c) mod m, where: 75 is 16807
m is the "modulus" 231-1 is 2147483674
a, 0 < a < m is the "multiplier" X0 = 0 is not a good seed value
c, 0 ≤ c ≤ m is the "increment"
Most compilers use LCG-based algorithms that are slightly more involved; see www.mscs.dal.ca/~selinger/random/ for details (including a
X0 is the "seed"
short C program that produces the exact same pseudo-random numbers as gcc for any given seed value)
if c=0 it is called a multiplicative congruential generator
LCG is not good for applications that need extremely high-quality random numbers 109/182
... Sidetrack: Random Numbers
the period length is too short
period length … length of sequence at which point it repeats itself Two functions are required:
a short period means the numbers are correlated
srand(unsigned int seed) // sets its argument as the seed
... Sidetrack: Random Numbers 106/182 rand() // uses a LCG technique to generate random
// numbers in the range 0 .. RAND_MAX
Trivial example:
where the constant RAND_MAX is defined in stdlib.h
for simplicity assume c=0 (depends on the computer: on the CSE network, RAND_MAX = 2147483647)
so the formula is Xn+1 = a·Xn mod m
The period length of this random number generator is very large
try a=11=X0, m=31, which generates the sequence:
approximately 16 · ((231) - 1)
11, 28, 29, 9, 6, 4, 13, 19, 23, 5, 24, 16, 21, 14, 30, 20, 3, 2, 22, 25,
27, 18, 12, 8, 26, 7, 15, 10, 17, 1, 11, 28, 29, 9, 6, 4, 13, 19, 23, 5, ... Sidetrack: Random Numbers 110/182
24, 16, 21, 14, 30, 20, 3, 2, 22, 25, 27, 18, 12, 8, 26, 7, 15, 10, 17, 1,
11, 28, 29, 9, 6, 4, 13, 19, 23, 5, 24, 16, 21, 14, 30, 20, 3, 2, 22, 25, 27,
To convert the return value of rand() to a number between 0 .. RANGE // since the Epoch, 1970-01-01 00:00:00 +0000
compute the remainder after division by RANGE+1 // time(NULL) on July 31st, 2020, 12:59pm was 1596164340
// time(NULL) about a minute later was 1596164401
Using the remainder to compute a random number is not the best way:
115/182
Exercise #13: Random Numbers 111/182
Analysis of Randomised Algorithms
Math needed for the analysis of randomised algorithms:
Write a program to simulate 10,000 rounds of Two-up.
Sample space… !! Ω = {ω1,…,ωn}
Assume a $10 bet at each round
Compute the overall outcome and average per round Probability… !! 0 ≤ P(ωi) ≤ 1
Event… !! E ⊆ Ω
1. ½
Seeding
2. ½
There is one significant problem: 3. Yes
4. 1 – ¼·¼ = ¾
every time you run a program with the same seed, you get exactly the same sequence of 'random' numbers 5. 2 times
1 1 1 1 1 2 1 1 3 1
!(why?) Note that 2 = is the infinite sum ⋅ 1 + (1 − ) ⋅ ⋅ 2 + (1 − ) ⋅ ⋅ 3 + (1 − ) ⋅ ⋅ 4 + ...
½ 2 2 2 2 2 2 2
To vary the output, can give the random seeder a starting point that varies with time
... Analysis of Randomised Algorithms 118/182
an example of such a starting point is the current time, time(NULL)
(NB: this is different from the UNIX command time, used to measure program running time)
Randomised algorithm to find some element with key k in an unordered array:
#include <time.h>
time(NULL) // returns the time as the number of seconds findKey(L,k):
| Input array L, key k Divide
| Output some element in L with key k pick a pivot element
| move all elements smaller than the pivot to its left
| repeat move all elements greater than the pivot to its right
| randomly select e∈L Conquer
| until key(e)=k sort the elements on the left
| return e sort the elements on the right
119/182 123/182
... Analysis of Randomised Algorithms Non-randomised Quicksort
Analysis: Divide ...
1
p … ratio of elements in L with key k !! (e.g. p = )
3 partition(array,low,high):
Probability of success: 1 !! (if p > 0) | Input array, index range low..high
Expected runtime: ! | Output selects array[low] as pivot element
1 | moves all smaller elements between low+1..high to its left
p | moves all larger elements between low+1..high to its right
| returns new position of pivot element
Example: a third of the elements have key k ⇒ expected number of iterations = 3 |
| pivot_item=array[low], left=low+1, right=high
| repeat
| | right = find index of rightmost element <= pivot_item
... Analysis of Randomised Algorithms 120/182 | | left = find index of leftmost element > pivot_item // left=right if none
| | if left<right then
| | swap array[left] with array[right]
If we cannot guarantee that the array contains any elements with key k … | | end if
| until left≥right
findKey(L,k,d): | if low<right then
| Input array L, key k, maximum #attempts d | swap array[low] with array[right] // right is final position for pivot
| Output some element in L with key k | end if
| return right
|
| repeat
| | if d=0 then 124/182
| | return failure ... Non-randomised Quicksort
| | end if
| | randomly select e∈L ... and Conquer!
| | d=d-1
Quicksort(array,low,high):
| until key(e)=k
| Input array, index range low..high
| return e
| Output array[low..high] sorted
|
121/182
| if high > low then // termination condition low >= high
... Analysis of Randomised Algorithms | | pivot = partition(array,low,high)
| | Quicksort(array,low,pivot-1)
Analysis: | | Quicksort(array,pivot+1,high)
| end if
p … ratio of elements in L with key k
d … maximum number of attempts
Probability of success: 1 - (1-p)d ... Non-randomised Quicksort 125/182
Expected runtime:
( i=1..d−1 )
i ⋅ (1 − p) i−1 ⋅ p + d ⋅ (1 − p) d−1
∑ 3 6 5 2 4 1 // swap a[left=1] and a[right=5]
O(1) if d is a constant
3 1 5 2 4 6 // swap a[left=2] and a[right=3]
122/182
Randomised Quicksort 3 1 2 5 4 6 // swap pivot and a[right=2]
n … size of array
...
From probability theory we know that the expected number of coin tosses required in order to get k heads is 2·k
1 | 2 | 3 | 4 | 5 | 6 For a recursive call at depth d we expect
d/2 ancestors are good calls
127/182
⇒ size of input sequence for current call is "≤ (¾)d/2 · n
Randomised Quicksort Therefore,
the input of a recursive call at depth 2·log4/3n has expected size 1
⇒ the expected recursion depth thus is O(log n)
partition(array,low,high):
| Input array, index range low..high The total amount of work done at all the nodes of the same depth is O(n)
| Output randomly select a pivot element from array[low..high]
| moves all smaller elements between low..high to its left Hence the expected runtime is O(n·log n)
| moves all larger elements between low..high to its right
| returns new position of pivot element
|
| randomly select pivot_index∈[low..high]
| pivot_item=array[pivot_index], swap array[low] with array[pivot_index]
| left=low+1, right=high
| repeat
| | right = find index of rightmost element <= pivot_item
| | left = find index of leftmost element > pivot_item // left=right if none
| | if left<right then
| | swap array[left] with array[right]
| | end if
| until left≥right
| if low<right then
| swap array[low] with array[right] // right is final position for pivot
| end if
| return right 130/182
Minimum Cut Problem
... Randomised Quicksort 128/182 Given:
Example:
Analysis:
V … number of vertices
136/182
Karger's Algorithm
Idea: Repeat random graph contraction several times and take the best cut found
MinCut(G):
| Input graph G with V≥2 vertices
| Output smallest cut found
|
133/182 | min_weight=∞, d=0
... Contraction | repeat
| | cut=contract(G)
Randomised algorithm for graph contraction = repeated edge contraction until 2 vertices remain | | if weight(cut)<min_weight then
| | min_cut=cut, min_weight=weight(cut)
contract(G):
| | end if
| Input graph G = (V,E) with |V|≥2 vertices
| | d=d+1
| Output cut of G
| until d > binomial(V,2)·ln V
|
| return min_cut
| while |V|>2 do
| randomly select e∈E
| contract edge e in G 137/182
| end while ... Karger's Algorithm
| return the only cut in G
Analysis:
1
1
Probability of success: ≥ 1 −
V
probability of not finding a minimum cut when the contraction algorithm is repeated d = ( ) ⋅ ln V
V
2
!times:
V d 1 1
[ ( 2 )]
≤ 1 − 1/ ≤ ln V = ω(S,T) = 4
e V
Total running time: O(E·d) = O(E·V2·log V)
... Sidetrack: Maxflow and Mincut 143/182
assuming graph contraction implemented in O(E)
Simulation
146/182
Simulation
What is the weight of the cut {Fairfield,Parramatta,Auburn}, {Ryde,Homebush,Rozelle}? In some problem scenarios
147/182
Example: Area inside a Curve
Scenario:
149/182
Summary
Alphabets and words
Pattern matching
Boyer-Moore, Knuth-Morris-Pratt
Tries
Text compression
Huffman code
Approximation
numerical problems
vertex cover
Analysis of randomised algorithms
probability of success
expected runtime
Randomised Quicksort
Karger's algorithm
Simulation
Suggested reading:
tries … Sedgewick, Ch. 15.2 153/182
... Data Breaches
approximation … Moffat, Ch. 9.4
randomisation … Moffat, Ch. 9.3, 9.5
More severe, recent incidents in Australia …
151/182
Details taken included names, DOBs, street addresses, driving licence numbers, passport numbers
Data Breaches ⇒ Customers vulnerable to financial crimes
⇒ Up to 100,000 new passports had to be issued
Major incidents … ⇒ Optus put aside $140 million for costs related to the breach
individuals must be notified promptly
... Data Breaches 154/182 Australian Information Commissioner must also be notified
take action to prevent future breaches
Medibank cyberattack (Oct 2022)
157/182
ABC News, 26/10/20 … Data (Mis-)use
In 2012 several newspapers reported that …
Target used data analysis to predict whether female customers are likely pregnant
Target then sent coupons by mail
A Minneapolis man thus found out about the pregnancy of his teenage daughter
Details of 9 million customers taken, including names, DOBs, street addresses, medical diagnoses and
procedures ... Data (Mis-)use 158/182
Also passport numbers and visa details for international students stolen
... Costly Software Errors 160/182 time(NULL) on 19 January 2038 at 03:14:07 (UTC) will be 2147483647 = 0x7FFFFFFF
a second later it will be 0x80000000 = -2,147,483,648
Toyota vehicle recall (2009-11) ⇒ -231 seconds since 01/01/1970 ("Epoch") is 13 December 1901 …
a deficiency in the electronic throttle control system: Software engineers shall ensure that their products meet the highest professional standards possible
stack overflow Strive to fully understand the specifications for software
⇒ stack grew out of boundary, overwrote other data Ensure that specifications have been well documented and satisfy the users' requirements
Ensure adequate testing, debugging, and review of software and related documents
passenger jet V9 2937 and cargo jet QY 611 on collision course at 36,000 feet
ground air traffic controller instructed V9 pilot to descend
seconds later, the automatic Traffic Collision Avoidance System (TCAS)
instructed V9 2937 to climb
instructed QY 611 to descend
flight 611's pilot followed TCAS, flight 2937's pilot ignored TCAS
all 71 people on board the two planes killed
⇒ Collision would not have occurred had both pilots followed TCAS
EFTPOS terminals inoperable for several days in early 2010
customers' cards rejected as expired Exercise #18: Collision Avoidance Algorithm 165/182
Cause of failure:
The TCAS …
one module interpreted the current year as hexadecimal
builds 3D map of aircraft in the airspace
0x09 = 09
determines if collision threat occurs
0x10 = 16 (≠ 10)
automatically negotiates mutual avoidance manoeuvre
gives synthesised voice instructions to pilots ("climb, climb")
162/182
Sidetrack: Year 2038 Problem What algorithm would you use for reaching an agreement (climb vs. descent)?
Recall:
166/182 assn = mark for large assignment (out of 12)
Moral Dilemmas exam = mark for final exam (out of 60)
How to program an autonomous car … if (exam >= 25)
total = lab + midterm + assn + exam
for a potential crash scenario
else
when you have to choose between two actions that are both harmful
total = exam * (100/60)
This is a modern version of the Trolley Problem …
To pass the course, you must achieve:
A runaway trolley is on course to kill five people
at least 50/100 for total
You stand next to a lever that controls a switch
If the trolley is diverted, it will kill one person on the side track which implies that you must achieve at least 25/60 for exam
Lectures, problem sets and assignments have built you up to this point.
169/182
Course Review
... Final Exam 173/182
Goal:
2-hour exam on Monday, 5 February
For you to become competent Computer Scientists able to:
CSE Computer Labs, your Time/Lab/Seat emailed to you
choose/develop effective data structures
choose/develop algorithms on these data structures 2 hours, ! reading time starts 10 minutes before beginning of exam, ! be there early
analyse performance characteristics of algorithms (time/space complexity)
package a set of data structures+algorithms as an abstract data type 7 multiple-choice questions, 4 open questions
represent data structures and implement algorithms in C Covers all of the contents of this course
Each multiple-choice question is worth 4 marks (7 × 4 = 28)
Each open question is worth 8 marks (4 × 8 = 32)
170/182 Closed book, but you can bring one A4-sized sheet of your own handwritten notes
Assessment Summary Bring student ID card, your zPass, ballpoint pens, your A4-sheet
lab = mark for programs/quizzes (out of 8+8)
midterm = mark for mid-term test (out of 12) 174/182
... Final Exam
Sample prac exam available on Moodle please, please, … I'll be excluded if I fail COMP9024
please, please, … this is my final course to graduate
4 multiple-choice questions, 2 open questions etc. etc. etc.
maximum time: 60 minutes
Failure is a fact of life. For example, my scientific papers or project proposals get rejected sometimes too.
sample solutions provided upon completion
175/182
Summing Up …
... Final Exam
Re-read lecture slides and example programs "Time for you to leave."
Read the corresponding chapters in the recommended textbooks
Review/solve problem sets
Attempt prac exam questions on Moodle
Invent your own variations of the weekly exercises (problem solving is a skill that improves with practice)
177/182
Supplementary Exam ... Finally … 182/182
you are making a statement that you are "fit and healthy enough"
it is your only chance to pass (i.e. no second chances)
178/182
Assessment
! !
Assessment is about determining how well you understand the syllabus of this course.
If you can't demonstrate your understanding, you don't pass. Produced: 27 Jan 2024