2 NLP PDF
2 NLP PDF
Chapter 2
RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrences of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 1.9: Regular expression operators for counting.
Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Figure 1.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like .,
*, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
GHz or 500 GB or Mac or $999.99. In the rest of this section, we’ll work out some
simple regular expressions for this task.
First, let’s complete our regular expression for prices. Here’s a regular expression
for a dollar sign followed by a string of digits:
/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)
Now we just need to deal with fractions of dollars. We’ll add a decimal point and
two digits afterwards:
/$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary:
/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/
One last catch! This pattern allows prices like $199999.99 which would be far too
expensive! We need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/
How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the of / */ to mean “zero or
more spaces” since there might always be extra spaces lying around:
/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
SE-507 Natural Language Processing
Edit Distance
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing