Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
Processing
Lecture 2
Instructor: Dr. Muhammad Asfand-e-yar
If the caret ^ is the first symbol after the open square brace [, the
resulting pattern is negated.
For example, the pattern /[^a]/ matches any single character (including
special characters) except a.
This is only true when the caret is the first symbol after the open
square brace.
If it occurs anywhere else, it usually stands for a caret symbol
Therefore we use the question mark /?/, which means “the preceding
character or nothing”.
The question mark means that “zero or one instances of the previous
character”.
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Regular Expression
Regular Expressions with Optional Conditions”;
Since we can’t use the square brackets to search for “cat or dog”
Why can’t we say /[catdog]/?
We need a new operator, the disjunction operator, also called the pipe
symbol |.
The pattern /cat|dog/ matches
• either the string cat
• or the string dog
No, because that would match only the strings guppy and ies.
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Regular Expression
Precedence:
This is because sequences like guppy take precedence over the
disjunction operator |.
The following table gives the order of RE operator precedence, from highest
precedence to lowest precedence.
1 Parenthesis ()
2 Counters * + ? {}
3 Sequences and anchors the ^my end$
4 Disjunction |
/Column [0-9]+ */
Will not match any number of columns; instead, it will match a single column
followed by any number of spaces.
The star here applies only to the space that precedes it, not to the whole
sequence.
b a ε
q0 q1 q2 q3
b a a ε
q0 q1 q2 q3 q4
b a a ε
q0 q1 q2 q3 q4
b b b
a a, b
q5
Kleene *, Kleene +
/[tT]he/
won’t treat underscores and numbers as word (the or the25)
/\b[tT]he\b/
Incorrectly returns other or theology
/[^a-zA-Z][tT]he[^a-zA-Z]/
But not 99 in
“There are 299 bottles of beer on the wall”
(since 99 follows a number).
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
Matching strings that we should not have matched (there, then, other)
False positives (Type I)
For example; build an application to help a user buy a computer on the Web.
• The user might want
“any machine with more than 6 GHz and 500 GB of disk space for less
than $1000”.
• To do this kind of retrieval, initially analyze expressions like 6 GHz or
500 GB or Mac or $999.99.
In the rest of the section some simple regular expressions will be analyzed
for this task.
Therefore,
/a\.{24}z/
will match a followed by 24 dots followed by z
/a\.{24, 30}z/
will match a followed by 24 dots OR upto 30 dots followed by z
/a\.{24, }z/
will match a followed by at least 24 dots followed by z