String Processing
Lê Sỹ Vinh
Computational Science and Engineering
Email: [email protected]
Outlines
• String matching
• Regular expression
String
• String is an array of characters.
For example: S = “Matching is a string algorithms”
• Substring is a continuous part of a string
Example: s = “a string” is a substring of S.
• A prefix string is a substring of S that includes the first character of S.
Example: S = “Algorithm”
Prefix of S: A, Al, Alg,....Algorithm
• A suffix string is substring of S that includes the last character of S.
Example: S = “Algorithm”
Suffix of S: m, hm, thm, ithm...Algorithm
String matching problem
Problem: Given a short string (pattern) P and a long string S (text), determine whether
if the pattern P appears in the text S.
Example:
• S = “Hello to string algorithms”
• P = “algorithm”
Naïve string matching
Moving from the begin to the end of the text S, for each position determine if the
pattern P appears at the position.
Naïve string matching
Algorithm Naïve (P, S):
Let m be the length of S
Let n be the length P
For x from 0 to m – n do
if P = S[x…(x + n – 1)]:
return “P in S”
return “P not in S”
Complexity: O(mn)
Knuth Morris Pratt Algorithm
Idea: Whenever a
mismatch occurs, we
shift the pattern as far as
possible to avoid
redundant comparisons
Complexity: O(m+n)
Exercises on string
• Given a string, write an algorithm to determine all
duplicate words in the string.
• Given a string, write an algorithm to check if it
contains only digits
Regular expression
Problem: How to find patterns such as email addresses, URLs in a string or
text?
• A regular expression (regex) defines a pattern of characters with conditions:
Examples:
• “regular expression” matches exactly the text “regular expression”
• “oo+h!” matches “ooh!”, “oooh!’, “ooooh!”, etc.
• “colo?r” matches color or colour
• “beg.n” matches begin, began, begun, etc.
• The search pattern can be anything from a simple character, a fixed string or a
complex expression containing special characters.
• The pattern defined by the regex may match one or several times or not at all for a
given string.
Common matching symbols
Regular Description Example
expression
. Matches any characters /beg.n/ => “begin”, “began”,
“begun”
^regex Find the regex that must /^sit/ => “site”, “sitcom”
match at the beginning of but not “visit”, “deposit”
the string
regex$ Find the regex that must /ext$/ => “next”, “context”
match at the end of the but not “extra”, “extent”
string
[abc] Match either a or b or c /[fg]un/ => “fun”, “gun”
[^abc] Match any character /[^fg]un/ => “run”, “sun”
except a, b, c
[1-9] Match any digit from 1 to /any[1-9]/ => any1, any2
9
Meta characters
Regular Description Example
expression
\d Any digit, short for [0- /\d\d/ => “01”, “02” … “99”
9]
\D A non-digit, short for /c\Dt/ => “cat”, “cut”
[^0-9] but not “c4t”
\s A white space /get\sup/ => “get up”
character
\w A word character, /h\wt/ => “hAt”, “hot”, “h0t”, “h1t”
short for [a-z,A-Z0-9_]
Quantifier
Regular Description Example
expression
regex* Regex occurs zero or /buz*/ => “bu”, “buz”, “buzz”,
more times “buzzzzzz”
regex+ Regex occurs one or /lo+ng/ => “long”, “loooooong”
more times but not “lng”
regex? Regex occurs zero or /colou?r/ => “color”, “colour”
one time
regex{X} regex occurs X times /\d{3}/ => “016”, “752”
regex{X,Y} Regex occurs between /\w{3,4}/ => “int”, “long”
X and Y times but not “double”
Examples
Regular expression
for a password
Regular expression for a password
Regular expression
for an email
16
Regular expression for an email
Regular expression a URL
18
Regular expression a URL
Regular expression
for an IP address
20
Regular expression for an IP address
Regular expression
for a variable