Lecture19 12PM
Lecture19 12PM
15-123
Systems Skills in C and Unix
Case for regular expressions
• Many web applications require pattern
matching
– look for <a href> tag for links
– Token search
• A regular expression
– A pattern that defines a class of strings
– Special syntax used to represent the class
• Eg; *.c - any pattern that ends with .c
Formal Languages
• Formal language consists of
– An alphabet
– Formal grammar
– Web searches
Regex Engine
• A software that can process a string to find regex
matches.
• Regex software are part of a larger piece of
software
– grep, awk, sed, php, python, perl, java etc..
• We can write our own regex engine that
recognizes all “caa” in a strings
– See democode folder
• Different regex engines may not be compatible
with each other
– Perl 5 is a popular one to learn
Regex machines
• Perl can do a “decent” job with simple regex’s
• But it can fail in cases where expressions can
be of the form ____________
• One of the best regex machines was written in
C by Ken Thompson in the 70’s
– 400 lines of C code
– Superior to perl, python and other
implementations when working with real world
applications
Unix grep utility
The grep command
Simple grep examples
• grep “<a href” guna.html > output.txt
• ls | grep “guna”
• grep ‘regex’ filename
• man grep
– For more info
regex grammer
Regular Expression Grammar
• Regex grammar defines a set of rules for
finding patterns. Grammar categories
– Alternation
– Grouping
– quantification
Regular Expression Grammar
• Alternation
• The vertical bar is used to describe alternating
choices among two or more choices.
– the notation a | b | c indicates that we can choose
a or b or c as part of the string.
– Another example is that “(c|s)at” describes the
expressions “cat” or “sat”. n
Regular Expression Grammar
Grouping
Parenthesis can be used to describe the scope
and precedence of operators.
In the example above (c|s) indicates that we
can either begin with c or s but must
immediately follow by “at”
Regular Expression Grammar
• Quantification
– Quantification is the notation used to define the
number of symbols that could appear in the string.
• The most common quantifiers are
– ?, * and +
– The ? mark indicates that there is zero or one of the
previous expression.
– The “*” indicates that zero or more of the previous
expression can be accepted.
– The “+” indicates that one or more of the previous
expression can be accepted.
Examples of *, ? , +
Other facts
• . matches a single character
• .* matches any string
• [a-zA-Z]* matches any string of alphabetic
characters
• [ag].* matches any string that starts with a or g
• [a-d].* matches any string that starts with a,b,c or
d
• ^(ab) matches any string that begins with ab. In
general, to match all lines that begins with any
string use ^string
• (ab)$ matches any string that ends with ab
Finding non-matches
• To exclude a pattern
– [^class]
– Eg: [^0-9]
Group Matches
– grep ‘<h\([1-4]\)>.*h\([1-3]\)>’ filename
• What patterns match?
– grep ‘h\([1-4]\).*h\1’ filename
• Back-reference
Character Classes
• \d digit [0-9]
• \D non-digit [^0-9]
• \w word character [0-9a-z_A-Z]
• \W non-word character [^0-9a-z_A-Z]
• \s a whitespace character [ \t\n\r\f]
• \S a non-whitespace character [^ \t\n\r\f]
More regex notation
• {n,m} at least n but not more than m times