0% found this document useful (0 votes)
142 views10 pages

2 NLP PDF

This document discusses regular expressions (REs) for natural language processing tasks. It provides examples of REs to find instances of the word "the" accounting for capitalization and word boundaries. More complex REs are given to extract prices with dollars and cents, and disk space amounts with optional fractions. The document also outlines common RE operators for matching characters, counting occurrences, and escaping special characters.

Uploaded by

Sherry Adan Off
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views10 pages

2 NLP PDF

This document discusses regular expressions (REs) for natural language processing tasks. It provides examples of REs to find instances of the word "the" accounting for capitalization and word boundaries. More complex REs are given to extract prices with dollars and cents, and disk space amounts with optional fractions. The document also outlines common RE operators for matching characters, counting occurrences, and escaping special characters.

Uploaded by

Sherry Adan Off
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SE-507 Natural Language Processing

Chapter 2

6.3 A Simple Example


Suppose we wanted to write a RE to find cases of the English article the. A simple
(but incorrect) pattern might be:
/the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology). So, we need to specify that we want instances with a word
boundary on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of /\b/. We might want this since /\b/
won’t treat underscores and numbers as word boundaries; but we might want to
find the in some context where it might also have underlines or numbers nearby
(the or the25). We need to specify that we want instances in which there are no
alphabetic letters on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [ˆa-zA-Z], which we
used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specifying
that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

6.4 More operators


Figure 1.8 shows some aliases for common ranges, which can be used mainly to
save typing. Besides the Kleene * and Kleene + we can also use explicit numbers
as counters, by enclosing them in curly brackets. The regular expression /{3}/
means “exactly 3 occurrences of the previous character or expression”. So
/a\.{24}z/ will match a followed by 24 dots followed by z (but not a followed by
23 or 25 dots followed by a z).

RE Expansion Match First Matches


\d [0-9] any digit Party of 5
\D [ˆ 0-9] any non-digit Blue moon
\w [a-zA-Z0-9_ ] any alphanumeric/underscore Daiyu
SE-507 Natural Language Processing

\W [ˆ \w] a non-alphanumeric !!!!


\s [ \r\t\n\f] whitespace (space, tab)
\S [ˆ \s] non-whitespace in Concord
Figure 1.8: Aliases for common sets of characters.

A range of numbers can also be specified. So /{n,m}/ specifies from n to m


occurrences of the previous char or expression, and /{n,}/ means at least n
occurrences of the previous expression. REs for counting are summarized in Figure
1.9.

RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrences of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 1.9: Regular expression operators for counting.

Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Figure 1.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like .,
*, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).

RE Match First Matches


\* an asterisk “*” “K*A*P*L*A*N”
\. a period “Dr. Livingston, I presume”
\? a question mark “Why don’t they come and lend a hand?”
\n a newline
\t a tab

Figure 1.10: Some characters that need to be backslashed.

6.5 A More Complex Example


Let’s try out a more significant example of the power of REs. Suppose we want to
build an application to help a user buy a computer on the Web. The user might want
“any machine with at least 6 GHz and 500 GB of disk space for less than $1000”.
To do this kind of retrieval, we first need to be able to look for expressions like 6
SE-507 Natural Language Processing

GHz or 500 GB or Mac or $999.99. In the rest of this section, we’ll work out some
simple regular expressions for this task.

First, let’s complete our regular expression for prices. Here’s a regular expression
for a dollar sign followed by a string of digits:

/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)

Now we just need to deal with fractions of dollars. We’ll add a decimal point and
two digits afterwards:

/$[0-9]+\.[0-9][0-9]/

This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary:

/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/

One last catch! This pattern allows prices like $199999.99 which would be far too
expensive! We need to limit the dollars:

/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/

How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the of / */ to mean “zero or
more spaces” since there might always be extra spaces lying around:

/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
SE-507 Natural Language Processing

Edit Distance
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing

You might also like