0% found this document useful (0 votes)

10 views29 pages

AlgorithmsandDataStructures Part5StringMatching

String matching

Uploaded by

Kiran Ganivada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views29 pages

AlgorithmsandDataStructures Part5StringMatching

String matching

Uploaded by

Kiran Ganivada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/259398205

Lecture Notes - Algorithms and Data Structures - Part 5: String Matching

Book · December 2013

DOI: 10.13140/2.1.1062.0486

CITATIONS READS

0 2,749

3 authors, including:

Reiner Creutzburg
Brandenburg University of Applied Sciences
530 PUBLICATIONS 578 CITATIONS

SEE PROFILE

All content following this page was uploaded by Reiner Creutzburg on 13 May 2015.

The user has requested enhancement of the downloaded file.

Algorithms and
Data Structures
Part 5: String Matching
(Wikipedia Book 2014)

By Wikipedians

Editors: Reiner Creutzburg, Jenny Knackmuß

Contents

1 String Matching 1
1.1 String (computer science) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Formal theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 String datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Text file strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Non-text strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5 String processing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.6 Character string-oriented languages and utilities . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.7 Character string functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.8 String buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.9 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 String searching algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Basic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Other classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Academic conferences on text searching . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Knuth–Morris–Pratt algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 KMP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 “Partial match” table (also known as “failure function”) . . . . . . . . . . . . . . . . . . . 15
1.3.4 Efficiency of the KMP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.5 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.6 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.8 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Boyer–Moore string search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Shift Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

i
ii CONTENTS

1.4.4 The Galil Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.6 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.7 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.8 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Text and image sources, contributors, and licenses 24

2.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 1

String Matching

1.1 String (computer science)

This article is about the data type. For other uses, see String (disambiguation).
In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some

Strings are applied e.g. in Bioinformatics to describe DNA strands composed of nitrogenous bases.

kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be ﬁxed (after
creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words)
that stores a sequence of elements, typically characters, using some character encoding. A string may also denote
more general arrays or other sequence (or list) data types and structures.
Depending on programming language and precise data type used, a variable declared to be a string may either cause

1
2 CHAPTER 1. STRING MATCHING

storage in memory to be statically allocated for a predetermined maximum length or employ dynamic allocation to
allow it to hold variable number of elements.
When a string appears literally in source code, it is known as a string literal or an anonymous string.[1]
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a ﬁnite
sequence of symbols that are chosen from a set called an alphabet.

1.1.1 Formal theory

Concatenation and substrings

Concatenation is an important binary operation on Σ* . For any two strings s and t in Σ* , their concatenation is defined
as the sequence of symbols in s followed by the sequence of characters in t, and is denoted st. For example, if Σ =
{a, b, ..., z}, s = bear, and t = hug, then st = bearhug and ts = hugbear.
String concatenation is an associative, but non-commutative operation. The empty string ε serves as the identity
element; for any string s, εs = sε = s. Therefore, the set Σ* and the concatenation operation form a monoid, the free
monoid generated by Σ. In addition, the length function defines a monoid homomorphism from Σ* to the non-negative
integers (that is, a function L : Σ∗ 7→ N ∪ {0} , such that L(st) = L(s) + L(t) ∀s, t ∈ Σ∗ ).
A string s is said to be a substring or factor of t if there exist (possibly empty) strings u and v such that t = usv. The
relation “is a substring of” defines a partial order on Σ* , the least element of which is the empty string.

Preﬁxes and suﬃxes

A string s is said to be a prefix of t if there exists a string u such that t = su. If u is nonempty, s is said to be a proper
prefix of t. Symmetrically, a string s is said to be a suffix of t if there exists a string u such that t = us. If u is nonempty,
s is said to be a proper suffix of t. Suffixes and prefixes are substrings of t. Both the relations “is a prefix of” and “is a
suffix of” are prefix orders.

Rotations

A string s = uv is said to be a rotation of t if t = vu. For example, if Σ = {0, 1} the string 0011001 is a rotation of
0100110, where u = 00110 and v = 01.
1.1. STRING (COMPUTER SCIENCE) 3

Reversal

The reverse of a string is a string with the same symbols but in reverse order. For example, if s = abc (where a, b,
and c are symbols of the alphabet), then the reverse of s is cba. A string that is the reverse of itself (e.g., s = madam)
is called a palindrome, which also includes the empty string and all strings of length 1.

Lexicographical ordering

It is often useful to deﬁne an ordering on a set of strings. If the alphabet Σ has a total order (cf. alphabetical order)
one can deﬁne a total order on Σ* called lexicographical order. For example, if Σ = {0, 1} and 0 < 1, then the
lexicographical order on Σ* includes the relationships ε < 0 < 00 < 000 < ... < 0001 < 001 < 01 < 010 < 011 < 0110
< 01111 < 1 < 10 < 100 < 101 < 111 < 1111 < 11111 ... The lexicographical order is total if the alphabetical order
is, but isn't well-founded for any nontrivial alphabet, even if the alphabetical order is.
See Shortlex for an alternative string ordering that preserves well-foundedness.

String operations

A number of additional operations on strings commonly occur in the formal theory. These are given in the article on
string operations.

Topology

110 111

010 011

101
100

000 001
(Hyper)cube of binary strings of length 3

Strings admit the following interpretation as nodes on a graph:

• Fixed-length strings can be viewed as nodes on a hypercube

4 CHAPTER 1. STRING MATCHING

• Variable-length strings (of finite length) can be viewed as nodes on the k-ary tree, where k is the number of
symbols in Σ
• Infinite strings (otherwise not considered here) can be viewed as infinite paths on the k-ary tree.

The natural topology on the set of fixed-length strings or variable length strings is the discrete topology, but the natural
topology on the set of infinite strings is the limit topology, viewing the set of infinite strings as the inverse limit of
the sets of finite strings. This is the construction used for the p-adic numbers and some constructions of the Cantor
set, and yields the same topology.
Isomorphisms between string representations of topologies can be found by normalizing according to the lexicographically
minimal string rotation.

1.1.2 String datatypes

See also: Comparison of programming languages (string functions)

A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful
datatype that they are implemented in nearly every programming language. In some languages they are available as
primitive types and in others as composite types. The syntax of most high-level programming languages allows for a
string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal
or string literal.

String length

Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often
constrained to an artificial maximum. In general, there are two types of string datatypes: fixed-length strings, which
have a fixed maximum length to be determined at compile time and which use the same amount of memory whether
this maximum is needed or not, and variable-length strings, whose length is not arbitrarily fixed and which can use
varying amounts of memory depending on the actual requirements at run time. Most strings in modern programming
languages are variable-length strings. Of course, even variable-length strings are limited in length – theoretically by
the number of bits available to a pointer, practically by the current size of memory. The string length can be stored
as a separate integer (which may put an artificial limit on the length) or implicitly through a termination character,
usually a character value with all bits zero. See also “Null-terminated” below.

Character encoding

String datatypes have historically allocated one byte per character, and, although the exact character set varied by
region, character encodings were similar enough that programmers could often get away with ignoring this, since
characters a program treated specially (such as period and space and comma) were in the same place in all the
encodings a program would encounter. These character sets were typically based on ASCII or EBCDIC.
Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256
characters (the limit of a one 8-bit byte per-character encoding) for reasonable representation. The normal solutions
involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. Use
of these with existing code led to problems with matching and cutting of strings, the severity of which depended on
how the character encoding was designed. Some encodings such as the EUC family guarantee that a byte value in the
ASCII range will represent only that ASCII character, making the encoding safe for systems that use those characters
as field separators. Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching
on byte codes unsafe. These encodings also were not “self-synchronizing”, so that locating character boundaries
required backing up to the start of a string, and pasting two strings together could result in corruption of the second
string (these problems were much less with EUC as any ASCII character did synchronize the encoding).
Unicode has simplified the picture somewhat. Most programming languages now have a datatype for Unicode strings.
Unicode’s preferred byte stream format UTF-8 is designed not to have the problems described above for older multi-
byte encodings. UTF-8, UTF-16 and UTF-32 require the programmer to know that the fixed-size code units are
different than the “characters”, the main difficulty currently is incorrectly designed API’s that attempt to hide this
difference (UTF-32 does make code points fixed-sized, but these are not “characters” due to composing codes).
1.1. STRING (COMPUTER SCIENCE) 5

Implementations

Some languages like C++ implement strings as templates that can be used with any datatype, but this is the exception,
not the rule.
Some languages, such as C++ and Ruby, normally allow the contents of a string to be changed after it has been
created; these are termed mutable strings. In other languages, such as Java and Python, the value is ﬁxed and a new
string must be created if any alteration is to be made; these are termed immutable strings.
Strings are typically implemented as arrays of bytes, characters, or code units, in order to allow fast access to individual
units or substrings—including characters when they have a ﬁxed length. A few languages such as Haskell implement
them as linked lists instead.
Some languages, such as Prolog and Erlang, avoid implementing a dedicated string datatype at all, instead adopting
the convention of representing strings as lists of character codes.

Representations

Representations of strings depend heavily on the choice of character repertoire and the method of character encoding.
Older string implementations were designed to work with repertoire and encoding defined by ASCII, or more recent
extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined by Unicode
along with a variety of complex encodings such as UTF-8 and UTF-16.
The term bytestring usually indicates a general-purpose string of bytes, rather than strings of only (readable) characters,
strings of bits, or such. Byte strings often imply that bytes can take any value and any data can be stored as-is, meaning
that there should be no value interpreted as a termination value.
Most string implementations are very similar to variable-length arrays with the entries storing the character codes of
corresponding characters. The principal difference is that, with certain encodings, a single logical character may take
up more than one entry in the array. This happens for example with UTF-8, where single codes (UCS code points)
can take anywhere from one to four bytes, and single characters can take an arbitrary number of codes. In these cases,
the logical length of the string (number of characters) differs from the logical length of the array (number of bytes in
use). UTF-32 avoids the first part of the problem.

Null-terminated Main article: Null-terminated string

The length of a string can be stored implicitly by using a special terminating character; often this is the null character
(NUL), which has all bits zero, a convention used and perpetuated by the popular C programming language.[4] Hence,
this representation is commonly referred to as a C string. This representation of an n-character string takes n + 1
space (1 for the terminator), and is thus an implicit data structure.
In terminated strings, the terminating code is not an allowable character in any string. Strings with length ﬁeld do
not have this limitation and can also store arbitrary binary data. In C two things are needed to handle binary data, a
character pointer and the length of the data.
An example of a null-terminated string stored in a 10-byte buﬀer, along with its ASCII (or more modern UTF-8)
representation as 8-bit hexadecimal numbers is:
The length of the string in the above example, “FRANK”, is 5 characters, but it occupies 6 bytes. Characters after the
terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings
of this form are sometimes called ASCIZ strings, after the original assembly language directive used to declare them.)

Length-prefixed The length of a string can also be stored explicitly, for example by prefixing the string with
the length as a byte value (a convention used in many Pascal dialects): as a consequence, some people call it a
Pascal string or P-string. Storing the string length as byte limits the maximum string length to 255. To avoid such
limitations, improved implementations of P-strings use 16-, 32-, or 64-bit words to store the string length. When the
length field covers the address space, strings are limited only by the available memory. Encoding the length n takes
log(n) space (see fixed-length code), so length-prefixed strings are a succinct data structure, encoding a string of length
n in log(n) + n space. However, if the length is bounded, then the length can be encoded in constant space, typically
a machine word, and thus is an implicit data structure, taking n + k space, where k is the number of characters in a
word (8 for 8-bit ASCII on a 64-bit machine, 1 for 32-bit UTF-32/UCS-4 on a 32-bit machine, etc.).
6 CHAPTER 1. STRING MATCHING

Here is the equivalent Pascal string stored in a 10-byte buﬀer, along with its ASCII / UTF-8 representation:

Strings as records Many languages, including object-oriented ones, implement strings as records in a structure
like:
class string { unsigned int length; char *text; };

Although this implementation is hidden, and accessed through member functions. The “text” will be a dynamically
allocated memory area, that might be expanded if needed. See also string (C++).

Linked-list Both character termination and length codes limit strings: For example, C character arrays that contain
null (NUL) characters cannot be handled directly by C string library functions: Strings using a length code are limited
to the maximum value of the length code.
Both of these limitations can be overcome by clever programming, of course, but such workarounds are by deﬁnition
not standard.
Rough equivalents of the C termination method have historically appeared in both hardware and software. For
example, “data processing” machines like the IBM 1401 used a special word mark bit to delimit strings at the left,
where the operation would start at the right. This meant that, while the IBM 1401 had a seven-bit word in “reality”,
almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example)
handle ASCII codes.
It is possible to create data structures and functions that manipulate them that do not have the problems associated
with character termination and can in principle overcome length code bounds. It is also possible to optimize the string
represented using techniques from run length encoding (replacing repeated characters by the character value and a
length) and Hamming encoding.
While these representations are common, others are possible. Using ropes makes certain string operations, such as
insertions, deletions, and concatenations more eﬃcient.

Security concerns

The differing memory layout and storage requirements of strings can affect the security of the program accessing
the string data. String representations requiring a terminating character are commonly susceptible to buffer overflow
problems if the terminating character is not present, caused by a coding error or an attacker deliberately altering the
data. String representations adopting a separate length field are also susceptible if the length can be manipulated. In
such cases, program code accessing the string data requires bounds checking to ensure that it does not inadvertently
access or change data outside of the string memory limits.
String data is frequently obtained from user-input to a program. As such, it is the responsibility of the program to
validate the string to ensure that it represents the expected format. Performing limited or no validation of user-input
can cause a program to be vulnerable to code injection attacks.

1.1.3 Text ﬁle strings

In computer readable text files, for example programming language source files or configuration files, strings can be
represented. The NUL byte is normally not used as terminator since that does not correspond to the ASCII text
standard, and the length is usually not stored, since the file should be human editable without bugs.
Two common representations are:

• Surrounded by quotation marks (ASCII 2216 ), used by most programming languages. To be able to include
quotation marks, newline characters etc., escape sequences are often available, usually using the backslash
character (ASCII 5C16 ).
• Terminated by a newline sequence, for example in Windows INI ﬁles.

1.1.4 Non-text strings

While character strings are very common uses of strings, a string in computer science may refer generically to any
sequence of homogeneously typed data. A string of bits or bytes, for example, may be used to represent non-textual
binary data retrieved from a communications medium. This data may or may not be represented by a string-speciﬁc
datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the pro-
gramming language being used. If the programming language’s string implementation is not 8-bit clean, data cor-
ruption may ensue.

1.1.5 String processing algorithms

There are many algorithms for processing strings, each with various trade-oﬀs. Some categories of algorithms include:

• String searching algorithms for ﬁnding a given substring or pattern

• String manipulation algorithms

• Sorting algorithms

• Regular expression algorithms

• Parsing a string

• Sequence mining

Advanced string algorithms often employ complex mechanisms and data structures, among them suﬃx trees and ﬁnite
state machines.
The name stringology was coined in 1984 by computer scientist Zvi Galil for the issue of algorithms and data
structures used for string processing.[5]

1.1.6 Character string-oriented languages and utilities

Character strings are such a useful datatype that several languages have been designed in order to make string pro-
cessing applications easy to write. Examples include the following languages:

• awk

• Icon

• MUMPS

• Perl

• Rexx

• Ruby

• sed

• SNOBOL

• Tcl

• TTM

Many Unix utilities perform simple string manipulations and can be used to easily program some powerful string
processing algorithms. Files and ﬁnite streams may be viewed as strings.
Some APIs like Multimedia Control Interface, embedded SQL or printf use strings to hold commands that will be
interpreted.
8 CHAPTER 1. STRING MATCHING

Recent scripting programming languages, including Perl, Python, Ruby, and Tcl employ regular expressions to fa-
cilitate text operations. Perl is particularly noted for its regular expression use,[6] and many other languages and
applications implement Perl compatible regular expressions.
Some languages such as Perl and Ruby support string interpolation, which permits arbitrary expressions to be evalu-
ated and included in string literals.

1.1.7 Character string functions

1.1.8 String buﬀers

In some programming languages, a string buﬀer is an alternative to a string. It has the ability to be altered through
adding or appending, whereas a String is normally ﬁxed or immutable.

In Java

Theory Java's standard way to handle text is to use its String class. Any given String in Java is an immutable
object, which means its state cannot be changed. A String has an array of characters. Whenever a String must be
manipulated, any changes require the creation of a new String (which, in turn, involves the creation of a new array of
characters, and copying of the original array). This happens even if the original String’s value or intermediate Strings
used for the manipulation are not kept.
Java provides an alternate class for string manipulation, called a StringBuffer. A StringBuffer, like a String, has an
array to hold characters. It, however, is mutable (its state can be altered). Its array of characters is not necessarily
completely filled (as oppose to a String, whose array is always the exact required length for its contents). Thus, it has
the capability to add, remove, or change its state without creating a new object (and without the creation of a new
array, and array copying). The exception to this is when its array is no longer of suitable length to hold its content.
In this case, it is required to create a new array, and copy contents.
For these reasons, Java would handle an expression like
String newString = aString + anInt + aChar + aDouble;

like this:
String newString = (new StringBuilder(aString)).append(anInt).append(aChar).append(aDouble).toString();

Implications Generally, a StringBuffer is more efficient than a String in string handling. However, this is not
necessarily the case, since a StringBuffer will be required to recreate its character array when it runs out of space.
Theoretically, this is possible to happen the same number of times as a new String would be required, although this
is unlikely (and the programmer can provide length hints to prevent this). Either way, the effect is not noticeable in
modern desktop computers.
As well, the shortcomings of arrays are inherent in a StringBuffer. In order to insert or remove characters at arbitrary
positions, whole sections of arrays must be moved.
The method by which a StringBuffer is attractive in an environment with low processing power takes this ability by
using too much memory, which is likely also at a premium in this environment. This point, however, is trivial, con-
sidering the space required for creating many instances of Strings in order to process them. As well, the StringBuffer
1.1. STRING (COMPUTER SCIENCE) 9

can be optimized to “waste” as little memory as possible.

The StringBuilder class, introduced in J2SE 5.0, differs from StringBuffer in that it is unsynchronized. When only a
single thread at a time will access the object, using a StringBuilder processes more efficiently than using a StringBuffer.
StringBuffer and StringBuilder are included in the java.lang package.

In .NET

Microsoft’s .NET Framework has a StringBuilder class in its Base Class Library.

In other languages

• In C++ and Ruby, the standard string class is already mutable, with the ability to change the contents and
append strings, etc., so a separate mutable string class is unnecessary.

• In Objective-C (Cocoa/OpenStep frameworks), the NSMutableString class is the mutable version of the NSString
class.

1.1.9 See also

• Formal language — a (possibly inﬁnite) set of strings in theoretical computer science

• Connection string — passed to a driver to initiate a connection e.g. to a database

• Rope — a data structure for eﬃciently manipulating long strings

• Bitstring — a string of binary digits

• Binary-safe — a property of string manipulating functions treating their input as raw data stream

• Improper input validation — a type of software security vulnerability particularly relevant for user-given strings

• Incompressible string — a string that cannot be compressed by any algorithm

• Empty string — its properties and representation in programming languages

• String metric — notions of similarity between strings

• string (C++) — overview of C++ string handling

• string.h — overview of C string handling

• Analysis of algorithms — determining time and storage needed by a particular (e.g. string manipulation)
algorithm

1.1.10 References
[1] “Introduction To Java - MFC 158 G”. String literals (or constants) are called ‘anonymous strings’

[2] Barbara H. Partee; Alice ter Meulen; Robert E. Wall (1990). Mathematical Methods in Linguistics. Kluwer.

[3] John E. Hopcroft, Jeﬀrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. Addison-
Wesley. ISBN 0-201-02988-X. Here: sect.1.1, p.1

[4] Bryant, Randal E.; David, O'Hallaron (2003), Computer Systems: A Programmer’s Perspective (2003 ed.), Upper Saddle
River, NJ: Pearson Education, p. 40, ISBN 0-13-034074-X

[5] http://www.stringology.org/

[6] “Essential Perl”. Perl’s most famous strength is in string manipulation with regular expressions.
10 CHAPTER 1. STRING MATCHING

1.2 String searching algorithm

In computer science, string searching algorithms, sometimes called string matching algorithms, are an important
class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a
larger string or text.
Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are vectors of elements of Σ. The Σ
may be a usual human alphabet (for example, the letters A through Z in the Latin alphabet). Other applications may
use binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics.
In practice, how the string is encoded can affect the feasible string search algorithms. In particular if a variable width
encoding is in use then it is slow (time proportional to N) to find the Nth character. This will significantly slow down
many of the more advanced search algorithms. A possible solution is to search for the sequence of code units instead,
but doing so may produce false matches unless the encoding is specifically designed to avoid it.

1.2.1 Basic classiﬁcation

The various algorithms can be classiﬁed by the number of patterns each uses.

Single pattern algorithms

Let m be the length of the pattern and let n be the length of the searchable text.
1
Asymptotic times are expressed using O, Ω, and Θ notation
The Boyer–Moore string search algorithm has been the standard benchmark for the practical string search literature.[1]

Algorithms using a ﬁnite set of patterns

• Aho–Corasick string matching algorithm

• Commentz-Walter algorithm

• Rabin–Karp string search algorithm

Algorithms using an inﬁnite number of patterns

Naturally, the patterns can not be enumerated in this case. They are represented usually by a regular grammar or
regular expression.

1.2.2 Other classiﬁcation

Other classiﬁcation approaches are possible. One of the most common uses preprocessing as main criteria.

Naïve string search

A simple but inefficient way to see where one string occurs inside another is to check each place it could be, one by
one, to see if it’s there. So first we see if there’s a copy of the needle in the first character of the haystack; if not, we
look to see if there’s a copy of the needle starting at the second character of the haystack; if not, we look starting at
the third character, and so forth. In the normal case, we only have to look at one or two characters for each wrong
position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the length of
the haystack and m is the length of the needle; but in the worst case, searching for a string like “aaaab” in a string like
“aaaaaaaaab”, it takes O(nm)
1.2. STRING SEARCHING ALGORITHM 11

M |MOMMY
|MOMMY
M|OMMY

M O

M O

Finite state automaton based search

In this approach, we avoid backtracking by constructing a deterministic ﬁnite automaton (DFA) that recognizes stored
search string. These are expensive to construct—they are usually created using the powerset construction—but are
12 CHAPTER 1. STRING MATCHING

very quick to use. For example, the DFA shown to the right recognizes the word “MOMMY”. This approach is
frequently generalized in practice to search for arbitrary regular expressions.

Stubs

Knuth–Morris–Pratt computes a DFA that recognizes inputs with the string to search for as a suﬃx, Boyer–Moore
starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza–
Yates keeps track of whether the previous j characters were a preﬁx of the search string, and is therefore adaptable
to fuzzy string searching. The bitap algorithm is an application of Baeza–Yates’ approach.

Index methods

Faster search algorithms are based on preprocessing of the text. After building a substring index, for example a suffix
tree or suffix array, the occurrences of a pattern can be found quickly. As an example, a suffix tree can be built in
Θ(n) time, and all z occurrences of a pattern can be found in O(m) time under the assumption that the alphabet
has a constant size and all inner nodes in the suffix tree know what leaves are underneath them. The latter can be
accomplished by running a DFS algorithm from the root of the suffix tree.

Other variants

Some search methods, for instance trigram search, are intended to ﬁnd a “closeness” score between the search string
and the text rather than a “match/non-match”. These are sometimes called “fuzzy” searches.

1.2.3 See also

• Sequence alignment

• Pattern matching

• Compressed pattern matching

• Approximate string matching

1.2.4 Academic conferences on text searching

• Combinatorial pattern matching (CPM), a conference on combinatorial algorithms for strings, sequences, and
trees.

• String Processing and Information Retrieval (SPIRE), an annual symposium on string processing and informa-
tion retrieval.

• Prague Stringology Conference (PSC), an annual conference on algorithms on strings and sequences.

• Competition on Applied Text Searching (CATS), an annual series of evaluations of text searching algorithms.

1.2.5 References
[1] Hume; Sunday (1991). “Fast String Searching”. Software: Practice and Experience 21 (11): 1221–1248. doi:10.1002/spe.4380211105.

[2] Melichar, Borivoj, Jan Holub, and J. Polcar. Text Searching Algorithms. Volume I: Forward String Matching. Vol. 1. 2
vols., 2005. http://stringology.org/athens/TextSearchingAlgorithms/.

• R. S. Boyer and J. S. Moore, A fast string searching algorithm, Carom. ACM 20, (10), 262–272(1977).

• Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliﬀord Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching,
pp.906–932.
1.3. KNUTH–MORRIS–PRATT ALGORITHM 13

1.2.6 External links

• Huge (maintained) list of pattern matching links Last updated:12/27/2008 20:18:38
• StringSearch – high-performance pattern matching algorithms in Java – Implementations of many String-
Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
• Exact String Matching Algorithms — Animation in Java, Detailed description and C implementation of many
algorithms.
• Boyer-Moore-Raita-Thomas
• (PDF) Improved Single and Multiple Approximate String Matching
• Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features
• C implementation of Suﬃx Tree based Pattern Searching

1.3 Knuth–Morris–Pratt algorithm

In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for
occurrences of a “word” W within a main “text string” S by employing the observation that when a mismatch occurs,
the word itself embodies suﬃcient information to determine where the next match could begin, thus bypassing re-
examination of previously matched characters.
The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris.
The three published it jointly in 1977.

1.3.1 Background
A string matching algorithm wants to find the starting index m in string S[] that matches the search word W[].
The most straightforward algorithm is to look for a character match at successive values of the index m, the position
in the string being searched, i.e. S[m]. If the index m reaches the end of the string then there is no match, in which
case the search is said to “fail”. At each position m the algorithm first checks for equality of the first character in the
searched for word, i.e. S[m] =? W[0]. If a match is found, the algorithm tests the other characters in the searched
for word by checking successive values of the word position index, i. The algorithm retrieves the character W[i] in
the searched for word and checks for equality of the expression S[m+i] =? W[i]. If all successive characters match
in W at position m, then a match is found at that position in the search string.
Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then
the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The
chance that the first two letters will match is 1 in 262 (1 in 676). So if the characters are random, then the expected
complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance
is very good. If S[] is 1 billion characters and W[] is 1000 characters, then the string search should complete after
about one billion character comparisons.
That expected performance is not guaranteed. If the strings are not random, then checking a trial m may take many
character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string
S[] consists of 1 billion characters that are all A, and that the word W[] is 999 A characters terminating in a final
B character. The simple string matching algorithm will now examine 1000 characters at each trial position before
rejecting the match and advancing the trial position. The simple string search example would now take about 1000
character comparisons times 1 billion positions for 1 trillion character comparisons. If the length of W[] is n, then
the worst-case performance is O(k⋅n).
The KMP algorithm does not have the horrendous worst-case performance of the straightforward algorithm. KMP
spends a little time precomputing a table (on the order of the size of W[], O(n)), and then it uses that table to do an
efficient search of the string in O(k).
The difference is that KMP makes use of previous match information that the straightforward algorithm does not.
In the example above, when KMP sees a trial match fail on the 1000th character (i = 999) because S[m+999] ≠
W[999], it will increment m by 1, but it will know that the first 998 characters at the new position already match.
14 CHAPTER 1. STRING MATCHING

KMP matched 999 A characters before discovering a mismatch at the 1000th character (position 999). Advancing
the trial match position m by one throws away the ﬁrst A, so KMP knows there are 998 A characters that match W[]
and does not retest them; that is, KMP sets i to 998. KMP maintains its knowledge in the precomputed table and two
state variables. When KMP discovers a mismatch, the table determines how much KMP will increase (variable m)
and where it will resume testing (variable i).

1.3.2 KMP algorithm

Worked example of the search algorithm

To illustrate the algorithm’s details, consider a (relatively artiﬁcial) run of the algorithm, where W = “ABCDABD”
and S = “ABC ABCDAB ABCDABCDABDE”. At any given time, the algorithm is in a state determined by two
integers:

• m, denoting the position within S where the prospective match for W begins,

• i, denoting the index of the currently considered character in W.

In each step the algorithm compares S[m+i] with W[i] and advances i if they are equal. This is depicted, at the start
of the run, like
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
The algorithm compares successive characters of W to “parallel” characters of S, moving from one to the next by
incrementing i if they match. However, in the fourth step S[3] = ' ' does not match W[3] = 'D'. Rather than beginning
to search again at S[1], we note that no 'A' occurs between positions 1 and 2 in W; hence, having checked all those
characters previously (and knowing they matched the corresponding characters in S), there is no chance of finding
the beginning of a match. Therefore, the algorithm sets m = 3 and i = 0.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This match fails at the initial character, so the algorithm sets m = 4 and i = 0
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
Here i increments through a nearly complete match “ABCDAB” until i = 6 giving a mismatch at W[6] and S[10].
However, just prior to the end of the current partial match, there was that substring “AB” that could be the beginning
of a new match, so the algorithm must take this into consideration. As these characters match the two characters prior
to the current position, those characters need not be checked again; the algorithm sets m = 8 (the start of the initial
prefix) and i = 2 (signaling the first two characters match) and continues matching. Thus the algorithm not only omits
previously matched characters of S (the “BCD”), but also previously matched characters of W (the prefix “AB”).
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This search fails immediately, however, as W does not contain another “A”, so as in the first trial, the algorithm
returns to the beginning of W and begins searching at the mismatched character position of S: m = 10, reset i = 0.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
The match at m=10 fails immediately, so the algorithm next tries m = 11 and i = 0.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
Once again, the algorithm matches “ABCDAB”, but the next character, 'C', does not match the final character 'D' of
the word W. Reasoning as before, the algorithm sets m = 15, to start at the two-character string “AB” leading up to
the current position, set i = 2, and continue matching from the current position.
1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456
This time the match is complete, and the first character of the match is S[15].

Description of pseudocode for the search algorithm

The above example contains all the elements of the algorithm. For the moment, we assume the existence of a “partial
match” table T, described below, which indicates where we need to look for the start of a new match in the event
1.3. KNUTH–MORRIS–PRATT ALGORITHM 15

that the current one ends in a mismatch. The entries of T are constructed so that if we have a match starting at S[m]
that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that
is, T[i] is the amount of “backtracking” we need to do after a mismatch). This has two implications: ﬁrst, T[0] =
−1, which indicates that if W[0] is a mismatch, we cannot backtrack and must simply check the next character; and
second, although the next possible match will begin at index m + i - T[i], as in the example above, we need not actually
check any of the T[i] characters after that, so that we continue searching from W[T[i]]. The following is a sample
pseudocode implementation of the KMP search algorithm.
algorithm kmp_search: input: an array of characters, S (the text to be searched) an array of characters, W (the word
sought) output: an integer (the zero-based position in S at which W is found) deﬁne variables: an integer, m ← 0
(the beginning of the current match in S) an integer, i ← 0 (the position of the current character in W) an array of
integers, T (the table, computed elsewhere) while m + i < length(S) do if W[i] = S[m + i] then if i = length(W) - 1
then return m let i ← i + 1 else if T[i] > −1 then let m ← m + i - T[i], i ← T[i] else let i ← 0, m ← m + 1 (if we
reach here, we have searched all of S unsuccessfully) return the length of S

Eﬃciency of the search algorithm

Assuming the prior existence of the table T, the search portion of the Knuth–Morris–Pratt algorithm has complexity
O(n), where n is the length of S and the O is big-O notation. Except for the ﬁxed overhead incurred in entering and
exiting the function, all the computations are performed in the while loop. To bound the number of iterations of this
loop; observe that T is constructed so that if a match which had begun at S[m] fails while comparing S[m + i] to
W[i], then the next possible match must begin at S[m + (i - T[i])]. In particular, the next possible match must occur
at a higher index than m, so that T[i] < i.
This fact implies that the loop can execute at most 2n times, since at each iteration it executes one of the two branches
in the loop. The ﬁrst branch invariably increases i and does not change m, so that the index m + i of the currently
scrutinized character of S is increased. The second branch adds i - T[i] to m, and as we have seen, this is always a
positive number. Thus the location m of the beginning of the current potential match is increased. At the same time,
the second branch leaves m + i unchanged, for m gets i - T[i] added to it, and immediately after T[i] gets assigned
as the new value of i, hence new_m + new_i = old_m + old_i - T[old_i] + T[old_i] = old_m + old_i. Now, the loop
ends if m + i = n; therefore, each branch of the loop can be reached at most n times, since they respectively increase
either m + i or m, and m ≤ m + i: if m = n, then certainly m + i ≥ n, so that since it increases by unit increments at
most, we must have had m + i = n at some point in the past, and therefore either way we would be done.
Thus the loop executes at most 2n times, showing that the time complexity of the search algorithm is O(n).
Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p. If W
exists as a substring of S at p, then W[0..m] = S[p..p+m]. Upon success, that is, the word and the text matched at
the positions (W[i] = S[p+i]), we increase i by 1. Upon failure, that is, the word and the text does not match at the
positions (W[i] ≠ S[p+i]), the text pointer is kept still, while the word pointer is rolled back a certain amount (i =
T[i], where T is the jump table), and we attempt to match W[T[i]] with S[p+i]. The maximum number of roll-back
of i is bounded by i, that is to say, for any failure, we can only roll back as much as we have progressed up to the
failure. Then it is clear the runtime is 2n.

1.3.3 “Partial match” table (also known as “failure function”)

The goal of the table is to allow the algorithm not to match any character of S more than once. The key observation
about the nature of a linear search that allows this to happen is that in having checked some segment of the main
string against an initial segment of the pattern, we know exactly at which places a new potential match which could
continue to the current position could begin prior to the current position. In other words, we “pre-search” the pattern
itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not
sacriﬁcing any potential matches in doing so.
We want to be able to look up, for each position in W, the length of the longest possible initial segment of W leading
up to (but not including) that position, other than the full segment starting at W[0] that just failed to match; this is
how far we have to backtrack in ﬁnding the next match. Hence T[i] is exactly the length of the longest possible proper
initial segment of W which is also a segment of the substring ending at W[i - 1]. We use the convention that the
empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of
backtracking), we set T[0] = −1, as discussed below.
16 CHAPTER 1. STRING MATCHING

Worked example of the table-building algorithm

We consider the example of W = “ABCDABD” first. We will see that it follows much the same pattern as the main
search, and is efficient for similar reasons. We set T[0] = −1. To find T[1], we must discover a proper suffix of “A”
which is also a prefix of W. But there are no proper suffixes of “A”, so we set T[1] = 0. Likewise, T[2] = 0.
Continuing to T[3], we note that there is a shortcut to checking all suffixes: let us say that we discovered a proper
suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible); then its first character is a
proper prefix of W, hence a proper prefix itself, and it ends at W[1], which we already determined cannot occur in
case T[2]. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1
only if a valid suffix of size m was found at the previous stage (e.g. T[x] = m).
Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole
one with length 1 fails, so T[3] = 0.
We pass to the subsequent W[4], 'A'. The same logic shows that the longest substring we need consider has length 1,
and although in this case 'A' does work, recall that we are looking for segments ending before the current character;
hence T[4] = 0 as well.
Considering now the next character, W[5], which is 'B', we exercise the following logic: if we were to find a subpattern
beginning before the previous character W[4], yet continuing to the current one W[5], then in particular it would itself
have a proper initial segment ending at W[4] yet beginning before it, which contradicts the fact that we already found
that 'A' itself is the earliest occurrence of a proper segment ending at W[4]. Therefore we need not look before W[4]
to find a terminal string for W[5]. Therefore T[5] = 1.
Finally, we see that the next character in the ongoing segment starting at W[4] = 'A' would be 'B', and indeed this is
also W[5]. Furthermore, the same argument as above shows that we need not look before W[4] to find a segment for
W[6], so that this is it, and we take T[6] = 2.
Therefore we compile the following table:
Another example, more interesting and complex:

Description of pseudocode for the table-building algorithm

The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle
is that of the overall search: most of the work was already done in getting to the current position, so very little needs
to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously
gives non-proper substrings at the beginning. This necessitates some initialization code.
algorithm kmp_table: input: an array of characters, W (the word to be analyzed) an array of integers, T (the table
to be filled) output: nothing (but during operation, it populates the table) define variables: an integer, pos ← 2 (the
current position we are computing in T) an integer, cnd ← 0 (the zero-based index in W of the next
character of the current candidate substring) (the first few values are fixed but different from what the algorithm
might suggest) let T[0] ← −1, T[1] ← 0 while pos < length(W) do (first case: the substring continues) if W[pos-1]
= W[cnd] then let cnd ← cnd + 1, T[pos] ← cnd, pos ← pos + 1 (second case: it doesn't, but we can fall back) else
if cnd > 0 then let cnd ← T[cnd] (third case: we have run out of candidates. Note cnd = 0) else let T[pos] ← 0, pos
← pos + 1

Eﬃciency of the table-building algorithm

The complexity of the table algorithm is O(n), where n is the length of W. As except for some initialization all the
work is done in the while loop, it is suﬃcient to show that this loop executes in O(n) time, which will be done by
simultaneously examining the quantities pos and pos - cnd. In the ﬁrst branch, pos - cnd is preserved, as both pos and
cnd are incremented simultaneously, but naturally, pos is increased. In the second branch, cnd is replaced by T[cnd],
which we saw above is always strictly less than cnd, thus increasing pos - cnd. In the third branch, pos is incremented
and cnd is not, so both pos and pos - cnd increase. Since pos ≥ pos - cnd, this means that at each stage either pos or a
lower bound for pos increases; therefore since the algorithm terminates once pos = n, it must terminate after at most
2n iterations of the loop, since pos - cnd begins at 1. Therefore the complexity of the table algorithm is O(n).
1.3. KNUTH–MORRIS–PRATT ALGORITHM 17

1.3.4 Eﬃciency of the KMP algorithm

Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the
overall algorithm is O(n + k).
These complexities are the same, no matter how many repetitive patterns are in W or S.

1.3.5 Variants

A real-time version of KMP can be implemented using a separate failure function table for each character in the
alphabet. If a mismatch occurs on character x in the text, the failure function table for character x is consulted for the
index i in the pattern at which the mismatch took place. This will return the length of the longest substring ending at i
matching a prefix of the pattern, with the added condition that the character after the prefix is x . With this restriction,
character x in the text need not be checked again in the next phase, and so only a constant number of operations are
executed between the processing of each index of the text. This satisfies the real-time computing restriction.
The Booth algorithm uses a modified version of the KMP preprocessing function to find the lexicographically minimal
string rotation. The failure function is progressively calculated as the string is rotated.

1.3.6 See also

• Boyer–Moore string search algorithm

• Rabin–Karp string search algorithm

• Aho–Corasick string matching algorithm

1.3.7 References
• Knuth, Donald; Morris, James H., jr; Pratt, Vaughan (1977). “Fast pattern matching in strings”. SIAM Journal
on Computing 6 (2): 323–350. doi:10.1137/0206024. Zbl 0372.68005.

• Cormen, Thomas; Lesiserson, Charles E.; Rivest, Ronald L.; Stein, Cliﬀord (2001). “Section 32.4: The Knuth-
Morris-Pratt algorithm”. Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931.
ISBN 0-262-03293-7. Zbl 1047.68161.

• Crochemore, Maxime; Rytter, Wojciech (2003). Jewels of stringology. Text algorithms. River Edge, NJ: World
Scientiﬁc. pp. 20–25. ISBN 981-02-4897-0. Zbl 1078.68151.

• Szpankowski, Wojciech (2001). Average case analysis of algorithms on sequences. Wiley-Interscience Series
in Discrete Mathematics and Optimization. With a foreword by Philippe Flajolet. Chichester: Wiley. pp.
15–17,136–141. ISBN 0-471-24063-X. Zbl 0968.68205.

1.3.8 External links

• String Searching Applet animation

• An explanation of the algorithm and sample C++ code by David Eppstein

• Knuth-Morris-Pratt algorithm description and C code by Christian Charras and Thierry Lecroq

• Explanation of the algorithm from scratch by FH Flensburg.

• Breaking down steps of running KMP by Chu-Cheng Hsieh.

• NPTELHRD YouTube lecture video

• Proof of correctness
18 CHAPTER 1. STRING MATCHING

1.4 Boyer–Moore string search algorithm

For the Boyer-Moore theorem prover, see Nqthm.

In computer science, the Boyer–Moore string search algorithm is an eﬃcient string searching algorithm that is the
standard benchmark for practical string search literature.[1] It was developed by Robert S. Boyer and J Strother Moore
in 1977.[2] The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in
(the text). It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists
across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to skip
sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the algorithm
runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern
rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single
character in the text.

1.4.1 Deﬁnitions

Alignments of pattern PAN to text ANPANMAN, from k=3 to k=8. A match occurs at k=5.

• S[i] refers to the character at index i of string S, counting from 1.

• S[i..j] refers to the substring of string S starting at index i and ending at j, inclusive.

• A preﬁx of S is a substring S[1..i] for some i in range [1, n], where n is the length of S.

• A suﬃx of S is a substring S[i..n] for some i in range [1, n], where n is the length of S.

• The string to be searched for is called the pattern and is referred to with symbol P.

• The string being searched in is called the text and is referred to with symbol T.

• The length of P is n.

• The length of T is m.

• An alignment of P to T is an index k in T such that the last character of P is aligned with index k of T.

• A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k].

1.4.2 Description

The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at
different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore
uses information gained by preprocessing P to skip as many alignments as possible.
Previous to the introduction of this algorithm, the usual way to search within text was to examine each character of
the text for the first character of the pattern. Once that was found the subsequent characters of the text would be
compared to the characters of the pattern. If no match occurred then the text would again be checked character by
character in an effort to find a match. Thus almost every character in the text needs to be examined.
The key insight in this algorithm is that if the end of the pattern is compared to the text then jumps along the text can
be made rather than checking every character of the text. The reason that this works is that in lining up the pattern
against the text, the last character of the pattern is compared to the character in the text. If the characters do not
match there is no need to continue searching backwards along the pattern. If the character in the text does not match
any of the characters in the pattern, then the next character to check in the text is located n characters farther along
the text, where n is the length of the pattern. If the character is in the pattern then a partial shift of the pattern along
the text is done to line up along the matching character and the process is repeated. The movement along the text in
jumps to make comparisons rather than checking every character in the text decreases the number of comparisons
that have to be made, which is the key to the increase of the efficiency of the algorithm.
1.4. BOYER–MOORE STRING SEARCH ALGORITHM 19

More formally, the algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in
P and T are then compared starting at index n in P and k in T, moving backward: the strings are matched from the
end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there
is a match) or a mismatch occurs upon which the alignment is shifted to the right according to the maximum value
permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats
until the alignment is shifted past the end of T, which means no further matches will be found.
The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.

1.4.3 Shift Rules

A shift is calculated by applying two rules: the bad character rule and the good suﬃx rule. The actual shifting oﬀset
is the maximum of the shifts calculated by these rules.

The Bad Character Rule

Description Demonstration of bad character rule with pattern NNAAMAN.

The bad-character rule considers the character in T at which the comparison process failed (assuming such a failure
occurred). The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in
line with the mismatched occurrence in T is proposed. If the mismatched character does not occur to the left in P, a
shift is proposed that moves the entirety of P past the point of mismatch.

Preprocessing Methods vary on the exact form the table for the bad character rule should take, but a simple
constant-time lookup solution is as follows: create a 2D table which is indexed ﬁrst by the index of the character c
in the alphabet and second by the index i in the pattern. This lookup will return the occurrence of c in P with the
next-highest index j < i or −1 if there is no such occurrence. The proposed shift will then be i - j, with O(1) lookup
time and O(kn) space, assuming a ﬁnite alphabet of length k.

The Good Suﬃx Rule

Description Demonstration of good suﬃx rule with pattern ANAMPNAM.

The good suﬃx rule is markedly more complex in both concept and implementation than the bad character rule. It
is the reason comparisons begin at the end of the pattern rather than the start, and is formally stated thus:[3]

Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch
occurs at the next comparison to the left. Then find, if it exists, the right-most copy t' of t in P such that
t' is not a suffix of P and the character to the left of t' in P differs from the character to the left of t in P.
Shift P to the right so that substring t' in P aligns with substring t in T. If t' does not exist, then shift the
left end of P past the left end of t in T by the least amount so that a prefix of the shifted pattern matches
a suffix of t in T. If no such shift is possible, then shift P by n places to the right. If an occurrence of P
is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the
occurrence of P in T. If no such shift is possible, then shift P by n places, that is, shift P past t.

Preprocessing The good suﬃx rule requires two tables: one for use in the general case, and another for use when
either the general case returns no meaningful result or a match occurs. These tables will be designated L and H
respectively. Their deﬁnitions are as follows:[3]

For each i, L[i] is the largest position less than n such that string P[i..n] matches a suffix of P[1..L[i]]
and such that the character preceding that suffix is not equal to P[i-1]. L[i] is defined to be zero if there
is no position satisfying the condition.

Let H[i] denote the length of the largest suﬃx of P[i..n] that is also a preﬁx of P, if one exists. If
none exists, let H[i] be zero.
20 CHAPTER 1. STRING MATCHING

Both of these tables are constructible in O(n) time and use O(n) space. The alignment shift for index i in P is given
by n - L[i] or n - H[i]. H should only be used if L[i] is zero or a match has been found.

1.4.4 The Galil Rule

A simple but important optimization of Boyer-Moore was put forth by Galil in 1979.[4] As opposed to shifting, the
Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known
to match. Suppose that at an alignment k1 , P is compared with T down to character c of T. Then if P is shifted to
k2 such that its left end is between c and k1 , in the next comparison phase a preﬁx of P must match the substring
T[(k2 - n)..k1 ]. Thus if the comparisons get down to position k1 of T, an occurrence of P can be recorded without
explicitly comparing past k1 . In addition to increasing the eﬃciency of Boyer-Moore, the Galil rule is required for
proving linear-time execution in the worst case.

1.4.5 Performance
The Boyer-Moore algorithm as presented in the original paper has worst-case running time of O(n+m) only if the
pattern does not appear in the text. This was ﬁrst proved by Knuth, Morris, and Pratt in 1977,[5] followed by Guibas
and Odlyzko in 1980[6] with an upper bound of 5m comparisons in the worst case. Richard Cole gave a proof with
an upper bound of 3m comparisons in the worst case in 1991.[7]
When the pattern does occur in the text, running time of the original algorithm is O(nm) in the worst case. This is
easy to see when both pattern and text consist solely of the same repeated character. However, inclusion of the Galil
rule results in linear runtime across all cases.[4][7]

1.4.6 Implementations
Various implementations exist in different programming languages. In C++, Boost provides the generic Boyer–Moore
search implementation under the Algorithm library.
Below are a few simple implementations.
[Python implementation]
def alphabet_index(c): """ Returns the index of the given character in the English alphabet, counting from 0. """
return ord(c.lower()) - 97 # 'a' is ASCII character 97 def match_length(S, idx1, idx2): """ Returns the length of
the match of the substrings of S beginning at idx1 and idx2. """ if idx1 == idx2: return len(S) - idx1 match_count
= 0 while idx1 < len(S) and idx2 < len(S) and S[idx1] == S[idx2]: match_count += 1 idx1 += 1 idx2 += 1 return
match_count def fundamental_preprocess(S): """ Returns Z, the Fundamental Preprocessing of S. Z[i] is the length of
the substring beginning at i which is also a prefix of S. This pre-processing is done in O(n) time, where n is the length
of S. """ if len(S) == 0: # Handles case of empty string return [] if len(S) == 1: # Handles case of single-character
string return [1] z = [0 for x in S] z[0] = len(S) z[1] = match_length(S, 0, 1) for i in range(2, 1+z[1]): # Optimization
from exercise 1-5 z[i] = z[1]-i+1 # Defines lower and upper limits of z-box l = 0 r = 0 for i in range(2+z[1], len(S)): if
i <= r: # i falls within existing z-box k = i-l b = z[k] a = r-i+1 if b < a: # b ends within existing z-box z[i] = b else: # b
ends at or after the end of the z-box, we need to do an explicit match to the right of the z-box z[i] = b+match_length(S,
a, r+1) l = i r = i+z[i]−1 else: # i does not reside within existing z-box z[i] = match_length(S, 0, i) if z[i] > 0: l = i r
= i+z[i]−1 return z def bad_character_table(S): """ Generates R for S, which is an array indexed by the position of
some character c in the English alphabet. At that index in R is an array of length |S|+1, specifying for each index i
in S (plus the index after S) the next location of character c encountered when traversing S from right to left starting
at i. This is used for a constant-time lookup for the bad character rule in the Boyer-Moore string search algorithm,
although it has a much larger size than non-constant-time solutions. """ if len(S) == 0: return [[] for a in range(26)]
R = [[−1] for a in range(26)] alpha = [−1 for a in range(26)] for i, c in enumerate(S): alpha[alphabet_index(c)] = i
for j, a in enumerate(alpha): R[j].append(a) return R def good_suffix_table(S): """ Generates L for S, an array used
in the implementation of the strong good suffix rule. L[i] = k, the largest position in S such that S[i:] (the suffix of
S starting at i) matches a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount
to shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]] matches the substring
of T matched by a suffix of P in the previous match attempt. Specifically, if the mismatch took place at position
i-1 in P, the shift magnitude is given by the equation len(P) - L[i]. In the case that L[i] = −1, the full shift table is
used. Since only proper suffixes matter, L[0] = −1. """ L = [−1 for c in S] N = fundamental_preprocess(S[::−1])
# S[::−1] reverses S N.reverse() for j in range(0, len(S)−1): i = len(S) - N[j] if i != len(S): L[i] = j return L def
1.4. BOYER–MOORE STRING SEARCH ALGORITHM 21

full_shift_table(S): """ Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore
string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a prefix of S. In the cases it is used,
the shift magnitude of the pattern P relative to the text T is len(P) - F[i] for a mismatch occurring at i-1. """ F = [0
for c in S] Z = fundamental_preprocess(S) longest = 0 for i, zv in enumerate(reversed(Z)): longest = max(zv, longest)
if zv == i+1 else longest F[-i-1] = longest return F def string_search(P, T): """ Implementation of the Boyer-Moore
string search algorithm. This finds all occurrences of P in T, and incorporates numerous ways of pre-processing the
pattern to determine the optimal amount to shift the string and skip comparisons. In practice it runs in O(m) (and
even sublinear) time, where m is the length of T. This implementation performs a case-insensitive search on ASCII
alphabetic characters, spaces not included. """ if len(P) == 0 or len(T) == 0 or len(T) < len(P): return [] matches
= [] # Preprocessing R = bad_character_table(P) L = good_suffix_table(P) F = full_shift_table(P) k = len(P) - 1 #
Represents alignment of end of P relative to T previous_k = −1 # Represents alignment in previous phase (Galil’s
rule) while k < len(T): i = len(P) - 1 # Character to compare in P h = k # Character to compare in T while i >= 0
and h > previous_k and P[i] == T[h]: # Matches starting from end of P i -= 1 h -= 1 if i == −1 or h == previous_k:
# Match has been found (Galil’s rule) matches.append(k - len(P) + 1) k += len(P)-F[1] if len(P) > 1 else 1 else: #
No match, shift by max of bad character and good suffix rules char_shift = i - R[alphabet_index(T[h])][i] if i+1 ==
len(P): # Mismatch happened on first attempt suffix_shift = 1 elif L[i+1] == −1: # Matched suffix does not appear
anywhere in P suffix_shift = len(P) - F[i+1] else: # Matched suffix appears in P suffix_shift = len(P) - L[i+1] shift =
max(char_shift, suffix_shift) previous_k = k if shift >= i+1 else previous_k # Galil’s rule k += shift return matches
[C implementation]
#include <stdint.h> #include <stdlib.h> #define ALPHABET_LEN 256 #define NOT_FOUND patlen #define max(a,
b) ((a < b) ? b : a) // delta1 table: delta1[c] contains the distance between the last // character of pat and the rightmost
occurrence of c in pat. // If c does not occur in pat, then delta1[c] = patlen. // If c is at string[i] and c != pat[patlen-1],
we can // safely shift i over by delta1[c], which is the minimum distance // needed to shift pat forward to get string[i]
lined up // with some character in pat. // this algorithm runs in alphabet_len+patlen time. void make_delta1(int
*delta1, uint8_t *pat, int32_t patlen) { int i; for (i=0; i < ALPHABET_LEN; i++) { delta1[i] = NOT_FOUND; }
for (i=0; i < patlen-1; i++) { delta1[pat[i]] = patlen-1 - i; } } // true if the suffix of word starting from word[pos]
is a prefix // of word int is_prefix(uint8_t *word, int wordlen, int pos) { int i; int suffixlen = wordlen - pos; // could
also use the strncmp() library function here for (i = 0; i < suffixlen; i++) { if (word[i] != word[pos+i]) { return 0; }
} return 1; } // length of the longest suffix of word ending on word[pos]. // suffix_length(“dddbcabc”, 8, 4) = 2 int
suffix_length(uint8_t *word, int wordlen, int pos) { int i; // increment suffix length i to the first mismatch or beginning
// of the word for (i = 0; (word[pos-i] == word[wordlen-1-i]) && (i < pos); i++); return i; } // delta2 table: given a
mismatch at pat[pos], we want to align // with the next possible full match could be based on what we // know about
pat[pos+1] to pat[patlen-1]. // // In case 1: // pat[pos+1] to pat[patlen-1] does not occur elsewhere in pat, // the next
plausible match starts at or after the mismatch. // If, within the substring pat[pos+1 .. patlen-1], lies a prefix // of
pat, the next plausible match is here (if there are multiple // prefixes in the substring, pick the longest). Otherwise,
the // next plausible match starts past the character aligned with // pat[patlen-1]. // // In case 2: // pat[pos+1] to
pat[patlen-1] does occur elsewhere in pat. The // mismatch tells us that we are not looking at the end of a match. //
We may, however, be looking at the middle of a match. // // The first loop, which takes care of case 1, is analogous to
// the KMP table, adapted for a 'backwards’ scan order with the // additional restriction that the substrings it considers
as // potential prefixes are all suffixes. In the worst case scenario // pat consists of the same letter repeated, so every
suffix is // a prefix. This loop alone is not sufficient, however: // Suppose that pat is “ABYXCDBYX”, and text is
".....ABYXCDEYX”. // We will match X, Y, and find B != E. There is no prefix of pat // in the suffix “YX”, so the
first loop tells us to skip forward // by 9 characters. // Although superficially similar to the KMP table, the KMP
table // relies on information about the beginning of the partial match // that the BM algorithm does not have. // //
The second loop addresses case 2. Since suffix_length may not be // unique, we want to take the minimum value,
which will tell us // how far away the closest potential match is. void make_delta2(int *delta2, uint8_t *pat, int32_t
patlen) { int p; int last_prefix_index = patlen-1; // first loop for (p=patlen-1; p>=0; p--) { if (is_prefix(pat, patlen,
p+1)) { last_prefix_index = p+1; } delta2[p] = last_prefix_index + (patlen-1 - p); } // second loop for (p=0; p <
patlen-1; p++) { int slen = suffix_length(pat, patlen, p); if (pat[p - slen] != pat[patlen-1 - slen]) { delta2[patlen-1 -
slen] = patlen-1 - p + slen; } } } uint8_t* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t
patlen) { int i; int delta1[ALPHABET_LEN]; int *delta2 = (int *)malloc(patlen * sizeof(int)); make_delta1(delta1,
pat, patlen); make_delta2(delta2, pat, patlen); // The empty pattern must be considered specially if (patlen == 0)
return string; i = patlen-1; while (i < stringlen) { int j = patlen-1; while (j >= 0 && (string[i] == pat[j])) { --i; --j; }
if (j < 0) { free(delta2); return (string + i+1); } i += max(delta1[string[i]], delta2[j]); } free(delta2); return NULL; }
[Java implementation]
/** * Returns the index within this string of the first occurrence of the * specified substring. If it is not a substring,
return −1. * * @param haystack The string to be scanned * @param needle The target string to search * @return
The start index of the substring */ public static int indexOf(char[] haystack, char[] needle) { if (needle.length ==
22 CHAPTER 1. STRING MATCHING

0) { return 0; } int charTable[] = makeCharTable(needle); int oﬀsetTable[] = makeOﬀsetTable(needle); for (int i =

needle.length - 1, j; i < haystack.length;) { for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) { if (j ==
0) { return i; } } // i += needle.length - j; // For naive method i += Math.max(offsetTable[needle.length - 1 - j],
charTable[haystack[i]]); } return −1; } /** * Makes the jump table based on the mismatched character informa-
tion. */ private static int[] makeCharTable(char[] needle) { final int ALPHABET_SIZE = 256; int[] table = new
int[ALPHABET_SIZE]; for (int i = 0; i < table.length; ++i) { table[i] = needle.length; } for (int i = 0; i < nee-
dle.length - 1; ++i) { table[needle[i]] = needle.length - 1 - i; } return table; } /** * Makes the jump table based
on the scan offset which mismatch occurs. */ private static int[] makeOffsetTable(char[] needle) { int[] table = new
int[needle.length]; int lastPrefixPosition = needle.length; for (int i = needle.length - 1; i >= 0; --i) { if (isPrefix(needle,
i + 1)) { lastPrefixPosition = i + 1; } table[needle.length - 1 - i] = lastPrefixPosition - i + needle.length - 1; } for (int i
= 0; i < needle.length - 1; ++i) { int slen = suffixLength(needle, i); table[slen] = needle.length - 1 - i + slen; } return
table; } /** * Is needle[p:end] a prefix of needle? */ private static boolean isPrefix(char[] needle, int p) { for (int i
= p, j = 0; i < needle.length; ++i, ++j) { if (needle[i] != needle[j]) { return false; } } return true; } /** * Returns
the maximum length of the substring ends at p and is a suffix. */ private static int suffixLength(char[] needle, int
p) { int len = 0; for (int i = p, j = needle.length - 1; i >= 0 && needle[i] == needle[j]; --i, --j) { len += 1; } return len; }

1.4.7 Variants
The Boyer–Moore–Horspool algorithm is a simplification of the Boyer–Moore algorithm using only the bad character
rule.
The Apostolico–Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given
alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of
the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths
requires an additional table equal in size to the text being searched.

1.4.8 See also

• Knuth–Morris–Pratt string search algorithm

• Boyer–Moore–Horspool string search algorithm

• Apostolico–Giancarlo string search algorithm

• Aho–Corasick multi-pattern string search algorithm

• Rabin–Karp multi-pattern string search algorithm

• Suﬃx trees

1.4.9 References
[1] Hume; Sunday (November 1991). “Fast String Searching”. Software—Practice and Experience 21 (11): 1221–1248.

[2] Boyer, Robert S.; Moore, J Strother (October 1977). “A Fast String Searching Algorithm.”. Comm. ACM (New York,
NY, USA: Association for Computing Machinery) 20 (10): 762–772. doi:10.1145/359842.359859. ISSN 0001-0782.

[3] Gusﬁeld, Dan (1999) [1997], “Chapter 2 - Exact Matching: Classical Comparison-Based Methods”, Algorithms on Strings,
Trees, and Sequences (1 ed.), Cambridge University Press, pp. 19–21, ISBN 0521585198

[4] Galil, Z. (September 1979). “On improving the worst case running time of the Boyer-Moore string matching algorithm”.
Comm. ACM (New York, NY, USA: Association for Computing Machinery) 22 (9): 505–508. doi:10.1145/359146.359148.
ISSN 0001-0782.

[5] Knuth, Donald; Morris, James H.; Pratt, Vaughan (1977). “Fast pattern matching in strings”. SIAM Journal on Computing
6 (2): 323–350. doi:10.1137/0206024.

[6] Guibas, Odlyzko; Odlyzko, Andrew (1977). “A new proof of the linearity of the Boyer-Moore string searching algorithm”.
Proceedings of the 18th Annual Symposium on Foundations of Computer Science (Washington, DC, USA: IEEE Computer
Society): 189–195. doi:10.1109/SFCS.1977.3.
1.4. BOYER–MOORE STRING SEARCH ALGORITHM 23

[7] Cole, Richard (September 1991). “Tight bounds on the complexity of the Boyer-Moore string matching algorithm”. Pro-
ceedings of the 2nd annual ACM-SIAM symposium on Discrete algorithms (Philadelphia, PA, USA: Society for Industrial
and Applied Mathematics): 224–233. ISBN 0-89791-376-0.

1.4.10 External links

• Original paper on the Boyer-Moore algorithm

• An example of the Boyer-Moore algorithm from the homepage of J Strother Moore, co-inventor of the algo-
rithm
• Richard Cole’s 1991 paper proving runtime linearity
Chapter 2

Text and image sources, contributors, and

licenses

2.1 Text
• String (computer science) Source: http://en.wikipedia.org/wiki/String%20(computer%20science)?oldid=660859856 Contributors: Damian
Yerrick, AxelBoldt, Eloquence, Hornlo, Bryan Derksen, Zundark, The Anome, Stephen Gilbert, Koyaanis Qatsi, Drj, Boleslav Bobcik,
Perique des Palottes, Mjb, B4hand, Patrick, RTC, Michael Hardy, Pnm, TakuyaMurata, Mkweise, Ahoerstemeier, Александър, Arthur
Frayn, Error, Bogdangiusca, Andres, Ghewgill, Charles Matthews, Dcoetzee, Dysprosia, Furrykef, Bevo, Sewing, Robbot, Murray Lang-
ton, Scarfboy, Pengo, Tobias Bergemann, Giftlite, DavidCary, Castaa, Fropuff, Jorge Stolfi, Christopherlin, Vadmium, Fudo, Beland,
Kusunose, Maximaximax, Sebbe, Pinguin.tk~enwiki, Andreas Kaufmann, Shahab, Slady, Murtasa, Plugwash, Spitzak, MisterSheik, Can-
isRufus, Anphanax, Cedders, Richard W.M. Jones, Spearhead, Sietse Snel, R. S. Shaw, Minghong, Obradovic Goran, Nevyn, Wayfarer,
Hippophaë~enwiki, Ubermonkey, Seec77, Alai, Forderud, Oleg Alexandrov, Linas, Bkkbrad, MattGiuca, Ruud Koot, Urod, Anthony
Borla, Jonnabuz, Gwil, Qwertyus, Kbdank71, TheIncredibleEdibleOompaLoompa, StuartBrady, FlaBot, Ian Pitchford, Stoph, Margos-
bot~enwiki, Gparker, Gurch, Pexatus, Chobot, YurikBot, Borgx, Hairy Dude, Fabartus, Howcheng, Mikeblas, Black Falcon, JLaTondre,
SmackBot, BurntSky, AnOddName, Chris the speller, Sahirshah, Gaiacarra, Thumperward, Nbarth, Jeremysr, BIL, Cybercobra, Dread-
star, Tompsci, Drphilharmonic, Mlpkr, Lambiam, Doug Bell, Derek farn, Witharebelyell, Shirifan, Loadmaster, Dr.K., Rory O'Kane,
Dreftymac, Pimlottc, Courcelles, Georg Peter, Neelix, Gregbard, Peterdjones, Gogo Dodo, Christian75, Mojo Hand, John254, Icep,
AntiVandalBot, JonathanCross, JAnDbot, Dereckson, David Eppstein, Philg88, Tigrisek, Gwern, DorganBot, WinterSpw, Tortoise3,
VolkovBot, AlnoktaBOT, Andy Dingley, C45207, S.Örvarr.S, Bentogoa, Taemyr, Doctorfluffy, OKBot, Anchor Link Bot, Treekids,
Elassint, ClueBot, The Thing That Should Not Be, Garyzx, Alexbot, OpinionPerson, Mad Tinman, Marc van Leeuwen, Gumum, Silvo-
nenBot, Addbot, Jncraton, IOLJeff, Numbo3-bot, Teles, Jarble, Legobot, Yobot, Gyro Copter, Denispir, AnomieBOT, Materialscien-
tist, LilHelpa, Xqbot, 4twenty42o, Nasnema, GenQuest, SassoBot, Kyng, Charvest, GNRY09, Jordandanford, Jc3s5h, Ptarjan, FoxBot,
TBloemink, Ripchip Bot, EmausBot, RogerofRomsey, GoingBatty, Wikipelli, Slawekb, Nomen4Omen, Dennis714, SporkBot, Under-
rated1 17, Jay-Sebastos, Uuf6429, Cgt, ClueBot NG, Jiri 1984, Fatboar, CanadianMaritimer, Doorknob747, Luke Igoe, Dainomite,
Joydeep, Andrew Helwer, Local.empire, Alfabalon, Frosty, Jochen Burghardt, Pantser, A4b3c2d1e0f, Tentinator, Captain Conundrum,
DavidLeighEllis, Komarov om, Mythas11, Sofia Koutsouveli, IAMBLAQTHOVEN, Stawny, PJ editing and Anonymous: 133
• String searching algorithm Source: http://en.wikipedia.org/wiki/String%20searching%20algorithm?oldid=653943206 Contributors:
Taw, Boleslav Bobcik, B4hand, Nixdorf, Kku, Angela, Poor Yorick, Dcoetzee, Nyxos, Mordomo, Jaredwf, Fredrik, Macrakis, Alves-
trand, Pne, Phe, Watcher, PFHLai, Sam Hocevar, Andreas Kaufmann, Squash, Tristan Schmelcher, Ascánder, Bender235, Plugwash,
BrokenSegue, NJM, Ruud Koot, Mandarax, Shadowhillway, Quuxplusone, Borgx, Rsrikanth05, Neilbeach, Bisqwit, Nils.grimsmo, Nils
Grimsmo, Mikeblas, Tony1, Ms2ger, Netrapt, Thosylve, SmackBot, Rdt~enwiki, TripleF, Mlpkr, MegaHasher, Jafet, CRGreathouse,
Sniffnoy, Szabolcs Nagy, MaxEnt, Ltickett, Shehzad.kazmi, A3RO, Drake Wilson, PhilKnight, Squidonius, Trusilver, Catmoongirl, Ijus-
tam, TXiKiBoT, Webmeischda~enwiki, IdreamofJeanie, OKBot, Kumioko (renamed), Hariva, ClueBot, SummerWithMorons, Excirial,
Jwpat7, Algebran, Addbot, Jarble, Luckas-bot, THEN WHO WAS PHONE?, Ayonggu114ster, AnomieBOT, Sz-iwbot, Materialscientist,
GrouchoBot, RedBot, Dcirovic, KerinthIT, ZéroBot, Jan.papousek, HerrMister, OldCodger2, Jodosma, Dummy6277, Shekharsuman93,
Stefan.Bunk, Moylin4, Anurag.x.singh and Anonymous: 75
• Knuth–Morris–Pratt algorithm Source: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt%20algorithm?oldid=
659379745 Contributors: Bryan Derksen, Michael Hardy, Tregoweth, Timwi, Dcoetzee, Ww, Almi, Fibonacci, Jaredwf, Mark T, Elias,
Wikibob, Madoka, Phe, PACO~enwiki, Chadernook, Diagonalfish, Rich Farmbrough, Crescent Moon, Antaeus Feldspar, Olau, Acntx,
Blinken, Krischik, Swift, Bikri, Chester br, Erroneous01, LOL, Ruud Koot, GregorB, Byronknoll, Ryan Reich, VsevolodSipakov, Man-
darax, Quuxplusone, Chobot, Bgwhite, YurikBot, Borgx, Tom Alsberg, KSmrq, Shell Kinney, Zhaladshar, Bruguiea, BOT-Superzerocool,
Cedar101, SmackBot, Mhss, Jon Awbrey, A5b, Curly Turkey, NeilFraser, Ycl6, Pranith, Vanisaac, Amitchaudhary, RainCT, Şamil~enwiki,
Billc.cn, Znora, .anacondabot, Magioladitis, Master.mind, David Eppstein, Raknarf44, Gwern, Glrx, STBotD, LokiClock, MadLex,
Sikuyihsoy, Jeremiah Mountain, Ee19921, Hariva, ClueBot, Jagat sastry, Magicheader, Niceguyedc, Johnuniq, Arlolra, Chucheng, Lit-
tle Mountain 5, Addbot, Javy413, Lightbot, Peni, Yobot, Ptbotgourou, Citation bot, Xqbot, J04n, Smallman12q, FrescoBot, Hobson-
lane, Pratik.mallya, Mikespedia, Wahas1234, Dinamik-bot, Jocapc, Seninp, TjBot, Ripchip Bot, WikitanvirBot, Spencer4Hire, One-
PlusTwelve, Haojin, Mikhail Ryazanov, Adityasinghhhh, Adityasinghhhhh, Winston Chuen-Shih Yang, Andrew Helwer, Xterminatrix,
Tushicomeng, Deltahedron, Hddqsb, Axings, Dmshafi, Wisiti, Mafagafogigante, Ying.l.xiong, Angelababy00, Jason721z and Anonymous:
137
• Boyer–Moore string search algorithm Source: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%20string%20search%20algorithm?

24
2.2. IMAGES 25

oldid=660810615 Contributors: Damian Yerrick, Tim Starling, Booyabazooka, Kayvee, Dcoetzee, Ww, Greenrd, Maximus Rex, Fur-
rykef, Murray Langton, Fredrik, Moink, Kd4ttc, Tobias Bergemann, Ancheta Wis, DocWatson42, Wikibob, Karnan, Mboverload, Pne,
Beland, Phe, Watcher, Fib, Billlava, Rich Farmbrough, Mathiasl26, YUL89YYZ, Antaeus Feldspar, Jemfinch, Mr flea, RJFJR, Fbriere,
Ruud Koot, Triddle, Ryan Reich, BD2412, Nneonneo, Quuxplusone, Kri, Borgx, Bisqwit, Dpakoha, Jashmenn, Mikeblas, Klutzy,
Eyal0, Ott2, DaveWF, Blueyoshi321, Lt-wiki-bot, SmackBot, JoeMarfice, Kostmo, TripleF, Evgeny Lykhin, Frap, Radagast83, Xil-
lion, Wthrower, Zearin, Freaky Dug, Szabolcs Nagy, Ahuds, Infinito, Tim.head, Thijs!bot, Billyoneal, Alphachimpbot, Martinkunev,
PhilKnight, Abednigo, Gwern, Lisamh, Glrx, Nemo bis, Plindenbaum, STBotD, Icktoofay, Duplicity, Nickjhay, Barry Fruitman, Eep-
peliteloop, Elassint, SummerWithMorons, Thegeneralguy, Rhododendrites, M.O.X, Dekart, Addbot, DOI bot, Alex.mccarthy, Adfellin,
Mi1ror, Tide rolls, Cneubauer, Balabiot, Luckas-bot, Yobot, Ptbotgourou, Lauren Lester, AnomieBOT, TapatioGeek, J12f, Edsarian,
Czlaner, Smallman12q, SeekerOfThePath, Lumpynifkin, Citation bot 1, Biker Biker, Art1x com, Cwalgampaya, JustAHappyCamper,
John of Reading, Brunobowden, Donner60, ChuispastonBot, Neelpulse, Snowgene, ClueBot NG, Jinghaoxu, Kejia, Vacation9, PedR,
Widr, Chokfung, Jy2wong, Aunndroid, Andrew Helwer, ChrisGualtieri, Kucyla, IgushevEdward, Deqing.huang, Jun Furuse, Patrickzzy
and Anonymous: 141

2.2 Images
• File:0321_DNA_Macrostructure.jpg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/0321_DNA_Macrostructure.jpg
License: CC BY 3.0 Contributors: Anatomy & Physiology, Connexions Web site. http://cnx.org/content/col11496/1.6/, Jun 19, 2013.
Original artist: OpenStax College
• File:DFA_search_mommy.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/d9/DFA_search_mommy.svg License: Pub-
lic domain Contributors: ? Original artist: ?
• File:Hamming_distance_3_bit_binary.svg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/Hamming_distance_3_bit_
binary.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett
• File:Question_book-new.svg Source: http://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0
Contributors:
Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist:
Tkgd2007
• File:Wikibooks-logo-en-noslogan.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.svg
License: CC BY-SA 3.0 Contributors: Own work Original artist: User:Bastique, User:Ramac et al.

2.3 Content license

• Creative Commons Attribution-Share Alike 3.0

View publication stats

(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
100% (3)
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
21 pages
Extra Worksheets 1st Year
No ratings yet
Extra Worksheets 1st Year
41 pages
Honours Student SONIA Quick Start Guide
No ratings yet
Honours Student SONIA Quick Start Guide
6 pages
English Specification 0908
No ratings yet
English Specification 0908
31 pages
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
Chapter 12 Computational Methods For Stitching Alignmen 2019 Methods in C
No ratings yet
Chapter 12 Computational Methods For Stitching Alignmen 2019 Methods in C
16 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
11 pages
Gartner Global Supply Chain Top 25 For 2024
No ratings yet
Gartner Global Supply Chain Top 25 For 2024
13 pages
11 - Conditional Control Structure
No ratings yet
11 - Conditional Control Structure
8 pages
BL Outline 14 01 24
No ratings yet
BL Outline 14 01 24
8 pages
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
No ratings yet
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
139 pages
Aemc Ca811 Ca813
No ratings yet
Aemc Ca811 Ca813
1 page
Lorraine - de Souza - GCSE - String Manipulation With Helpsheets
No ratings yet
Lorraine - de Souza - GCSE - String Manipulation With Helpsheets
37 pages
Changelog
No ratings yet
Changelog
2 pages
M Bharath
No ratings yet
M Bharath
3 pages
Data Structure
No ratings yet
Data Structure
251 pages
Functional Programming Victoria University of Wellington
No ratings yet
Functional Programming Victoria University of Wellington
217 pages
t10 - Requirements Management
No ratings yet
t10 - Requirements Management
47 pages
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Communications Protocol
No ratings yet
Communications Protocol
462 pages
Parsing Dependency
No ratings yet
Parsing Dependency
26 pages
Unit-T Ut510
No ratings yet
Unit-T Ut510
1 page
Dsa - Barnette and Tonga - 2
No ratings yet
Dsa - Barnette and Tonga - 2
3 pages
Permasense Brochure V4.3
No ratings yet
Permasense Brochure V4.3
8 pages
Guide
No ratings yet
Guide
160 pages
Service Center Repairs We Buy Used Equipment: Instra
No ratings yet
Service Center Repairs We Buy Used Equipment: Instra
75 pages
To Print - Dsa Dotnet Slackers
No ratings yet
To Print - Dsa Dotnet Slackers
56 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
T Test Formula
100% (1)
T Test Formula
2 pages
C Programming Strings
No ratings yet
C Programming Strings
9 pages
Johan Sannemo - Principles of Algorithmic Problem Solving
No ratings yet
Johan Sannemo - Principles of Algorithmic Problem Solving
351 pages
DSA Data Structures and Algorithms
No ratings yet
DSA Data Structures and Algorithms
126 pages
CC2530 Registers
No ratings yet
CC2530 Registers
299 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
FIT2004 (Contents)
No ratings yet
FIT2004 (Contents)
3 pages
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Capstone Case Study
No ratings yet
Capstone Case Study
4 pages
SDL Plugins
No ratings yet
SDL Plugins
5 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
MSC Pool Conceptdfadslfkdslfkdsal
No ratings yet
MSC Pool Conceptdfadslfkdslfkdsal
4 pages
Computational Thinking
100% (4)
Computational Thinking
176 pages
SAP User Classification
100% (3)
SAP User Classification
4 pages
Crash Course Coding Companion
No ratings yet
Crash Course Coding Companion
136 pages
QuantNet Online C Course
No ratings yet
QuantNet Online C Course
9 pages
Grow with Python Programming: From Basics to Advanced
From Everand
Grow with Python Programming: From Basics to Advanced
Mark Fliks
No ratings yet
(Amir Hussain Shah) (Amir Hussain Shah) (Amir Hussain Shah) : Course Code Tutor Address Tutor Address Tutor Address
No ratings yet
(Amir Hussain Shah) (Amir Hussain Shah) (Amir Hussain Shah) : Course Code Tutor Address Tutor Address Tutor Address
25 pages
Algorithm Structure C++ - Teo OK PDF
No ratings yet
Algorithm Structure C++ - Teo OK PDF
126 pages
Data Structures and Algorithms
100% (1)
Data Structures and Algorithms
111 pages
Hilfinger Data Structures
No ratings yet
Hilfinger Data Structures
253 pages
Data Structures and Algorithms: Lecture Notes For
No ratings yet
Data Structures and Algorithms: Lecture Notes For
126 pages
Fundamental Data Structures
No ratings yet
Fundamental Data Structures
160 pages
WikiBooks - FSharp
No ratings yet
WikiBooks - FSharp
126 pages
Principles of Algorithmic Problem Solving PDF
100% (1)
Principles of Algorithmic Problem Solving PDF
351 pages
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
DWGX
No ratings yet
DWGX
262 pages
Elementary Algorithms
No ratings yet
Elementary Algorithms
622 pages
F# Programming
No ratings yet
F# Programming
104 pages
Perl Programming
No ratings yet
Perl Programming
141 pages
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet
Elementary Algorithms
100% (1)
Elementary Algorithms
618 pages
STL Free
No ratings yet
STL Free
181 pages
Data Structure PDF
No ratings yet
Data Structure PDF
233 pages
Ac Datatypes Reference
No ratings yet
Ac Datatypes Reference
56 pages
Pure Script Book
100% (1)
Pure Script Book
232 pages
Pure Script
No ratings yet
Pure Script
235 pages
Nono
No ratings yet
Nono
215 pages
Data Structures Using C++
92% (12)
Data Structures Using C++
167 pages
An Intuitive Introduction To Data Structures Heinold
No ratings yet
An Intuitive Introduction To Data Structures Heinold
167 pages
Data Structures
No ratings yet
Data Structures
104 pages
Fundamental Data Structures PDF
No ratings yet
Fundamental Data Structures PDF
44 pages
Elementary Algorithms
100% (1)
Elementary Algorithms
622 pages
Data Structures
No ratings yet
Data Structures
104 pages
Dsa Book1 PDF
No ratings yet
Dsa Book1 PDF
126 pages
Data Structures Big Wiiki Books Bigger Book
No ratings yet
Data Structures Big Wiiki Books Bigger Book
440 pages
Acm Cheat Sheet
No ratings yet
Acm Cheat Sheet
43 pages
Elementary Algorithms
100% (4)
Elementary Algorithms
630 pages
AlgoXY Elementary Algorithms
No ratings yet
AlgoXY Elementary Algorithms
749 pages
Mehlhorn K., Sanders P. Concise Algorithmics, The Basic Toolbox 124ñ PDF
No ratings yet
Mehlhorn K., Sanders P. Concise Algorithmics, The Basic Toolbox 124ñ PDF
124 pages
JNTU BTECH 2-1 Data Structures NOTES
No ratings yet
JNTU BTECH 2-1 Data Structures NOTES
104 pages
University of Aberdeen Department of Mathematical Sciences
No ratings yet
University of Aberdeen Department of Mathematical Sciences
116 pages
Avl-2 0 1
No ratings yet
Avl-2 0 1
432 pages
Data Structures and Algorithms With Object-Oriented Design Patterns in Python
No ratings yet
Data Structures and Algorithms With Object-Oriented Design Patterns in Python
14 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Java Data Structures Hilfinger
No ratings yet
Java Data Structures Hilfinger
231 pages
Data-Structures in Java
No ratings yet
Data-Structures in Java
233 pages
Ics 2013
No ratings yet
Ics 2013
883 pages
STL Documentation
No ratings yet
STL Documentation
108 pages
Data Sructures and Algorithms
No ratings yet
Data Sructures and Algorithms
112 pages

AlgorithmsandDataStructures Part5StringMatching

Uploaded by

AlgorithmsandDataStructures Part5StringMatching

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Lecture Notes - Algorithms and Data Structures - Part 5: String Matching

Book · December 2013

The user has requested enhancement of the downloaded file.

Editors: Reiner Creutzburg, Jenny Knackmuß

1.4.4 The Galil Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Text and image sources, contributors, and licenses 24

1.1 String (computer science)

1.1.1 Formal theory

Concatenation and substrings

Preﬁxes and suﬃxes

Strings admit the following interpretation as nodes on a graph:

• Fixed-length strings can be viewed as nodes on a hypercube

1.1.2 String datatypes

Null-terminated Main article: Null-terminated string

1.1.3 Text ﬁle strings

See also: String literal

1.1.4 Non-text strings

1.1.5 String processing algorithms

• String searching algorithms for ﬁnding a given substring or pattern

• String manipulation algorithms

• Regular expression algorithms

1.1.6 Character string-oriented languages and utilities

1.1.7 Character string functions

See also: Comparison of programming languages (string functions)

1.1.8 String buﬀers

can be optimized to “waste” as little memory as possible.

1.1.9 See also

• Connection string — passed to a driver to initiate a connection e.g. to a database

• Rope — a data structure for eﬃciently manipulating long strings

• Bitstring — a string of binary digits

• Incompressible string — a string that cannot be compressed by any algorithm

• Empty string — its properties and representation in programming languages

• String metric — notions of similarity between strings

• string (C++) — overview of C++ string handling

• string.h — overview of C string handling

1.2 String searching algorithm

1.2.1 Basic classiﬁcation

Single pattern algorithms

Algorithms using a ﬁnite set of patterns

• Aho–Corasick string matching algorithm

• Rabin–Karp string search algorithm

Algorithms using an inﬁnite number of patterns

1.2.2 Other classiﬁcation

Naïve string search

Finite state automaton based search

1.2.3 See also

• Compressed pattern matching

• Approximate string matching

1.2.4 Academic conferences on text searching

1.2.6 External links

1.3 Knuth–Morris–Pratt algorithm

1.3.2 KMP algorithm

• i, denoting the index of the currently considered character in W.

Description of pseudocode for the search algorithm

Eﬃciency of the search algorithm

1.3.3 “Partial match” table (also known as “failure function”)

Worked example of the table-building algorithm

Description of pseudocode for the table-building algorithm

Eﬃciency of the table-building algorithm

1.3.4 Eﬃciency of the KMP algorithm

1.3.6 See also

• Rabin–Karp string search algorithm

• Aho–Corasick string matching algorithm

1.3.8 External links

• An explanation of the algorithm and sample C++ code by David Eppstein

• Explanation of the algorithm from scratch by FH Flensburg.

• Breaking down steps of running KMP by Chu-Cheng Hsieh.

• NPTELHRD YouTube lecture video

1.4 Boyer–Moore string search algorithm

• S[i] refers to the character at index i of string S, counting from 1.

• A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k].

1.4.3 Shift Rules

The Bad Character Rule

Description Demonstration of bad character rule with pattern NNAAMAN.

The Good Suﬃx Rule