KMP Algorithm
KMP Algorithm
A string-matching algorithm wants to find the starting index m in string S[] that
matches the search word W[].
Usually, the trial check will quickly reject the trial match. If the strings are
uniformly distributed random letters, then the chance that characters match is 1 in
26. In most cases, the trial check will reject the match at the initial letter. The
chance that the first two letters will match is 1 in 26 (1 in 26^2 chances of a
match over 26 possible letters). So if the characters are random, then the expected
complexity of searching string S[] of length n is on the order of n comparisons or
Θ(n). The expected performance is very good. If S[] is 1 million characters and W[]
is 1000 characters, then the string search should complete after about 1.04 million
character comparisons.
That expected performance is not guaranteed. If the strings are not random, then
checking a trial m may take many character comparisons. The worst case is if the
two strings match in all but the last letter. Imagine that the string S[] consists
of 1 million characters that are all A, and that the word W[] is 999 A characters
terminating in a final B character. The simple string-matching algorithm will now
examine 1000 characters at each trial position before rejecting the match and
advancing the trial position. The simple string search example would now take about
1000 character comparisons times 1 million positions for 1 billion character
comparisons. If the length of W[] is k, then the worst-case performance is O(k⋅n).
The KMP algorithm has a better worst-case performance than the straightforward
algorithm. KMP spends a little time precomputing a table (on the order of the size
of W[], O(k)), and then it uses that table to do an efficient search of the string
in O(n).
The difference is that KMP makes use of previous match information that the
straightforward algorithm does not. In the example above, when KMP sees a trial
match fail on the 1000th character (i = 999) because S[m+999] ≠ W[999], it will
increment m by 1, but it will know that the first 998 characters at the new
position already match. KMP matched 999 A characters before discovering a mismatch
at the 1000th character (position 999). Advancing the trial match position m by one
throws away the first A, so KMP knows there are 998 A characters that match W[] and
does not retest them; that is, KMP sets i to 998. KMP maintains its knowledge in
the precomputed table and two state variables. When KMP discovers a mismatch, the
table determines how much KMP will increase (variable m) and where it will resume
testing (variable i).