0% found this document useful (0 votes)

142 views10 pages

2 NLP PDF

This document discusses regular expressions (REs) for natural language processing tasks. It provides examples of REs to find instances of the word "the" accounting for capitalization and word boundaries. More complex REs are given to extract prices with dollars and cents, and disk space amounts with optional fractions. The document also outlines common RE operators for matching characters, counting occurrences, and escaping special characters.

Uploaded by

Sherry Adan Off

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views10 pages

2 NLP PDF

Uploaded by

Sherry Adan Off

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

SE-507 Natural Language Processing

Chapter 2

6.3 A Simple Example

Suppose we wanted to write a RE to find cases of the English article the. A simple
(but incorrect) pattern might be:
/the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology). So, we need to specify that we want instances with a word
boundary on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of /\b/. We might want this since /\b/
won’t treat underscores and numbers as word boundaries; but we might want to
find the in some context where it might also have underlines or numbers nearby
(the or the25). We need to specify that we want instances in which there are no
alphabetic letters on either side of the the:
/[â-zA-Z][tT]he[â-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [â-zA-Z], which we
used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specifying
that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:
/(ˆ|[â-zA-Z])[tT]he([â-zA-Z]|$)/

6.4 More operators

Figure 1.8 shows some aliases for common ranges, which can be used mainly to
save typing. Besides the Kleene * and Kleene + we can also use explicit numbers
as counters, by enclosing them in curly brackets. The regular expression /{3}/
means “exactly 3 occurrences of the previous character or expression”. So
/a\.{24}z/ will match a followed by 24 dots followed by z (but not a followed by
23 or 25 dots followed by a z).

RE Expansion Match First Matches

\d [0-9] any digit Party of 5
\D [ˆ 0-9] any non-digit Blue moon
\w [a-zA-Z0-9_ ] any alphanumeric/underscore Daiyu
SE-507 Natural Language Processing

\W [ˆ \w] a non-alphanumeric !!!!

\s [ \r\t\n\f] whitespace (space, tab)
\S [ˆ \s] non-whitespace in Concord
Figure 1.8: Aliases for common sets of characters.

A range of numbers can also be specified. So /{n,m}/ specifies from n to m

occurrences of the previous char or expression, and /{n,}/ means at least n
occurrences of the previous expression. REs for counting are summarized in Figure
1.9.

RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrences of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 1.9: Regular expression operators for counting.

Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Figure 1.10). The most common of these are the newline character
\n and the tab character \t. To refer to characters that are special themselves (like .,
*, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).

RE Match First Matches

\* an asterisk “*” “K*A*P*L*A*N”
\. a period “Dr. Livingston, I presume”
\? a question mark “Why don’t they come and lend a hand?”
\n a newline
\t a tab

Figure 1.10: Some characters that need to be backslashed.

6.5 A More Complex Example

Let’s try out a more significant example of the power of REs. Suppose we want to
build an application to help a user buy a computer on the Web. The user might want
“any machine with at least 6 GHz and 500 GB of disk space for less than $1000”.
To do this kind of retrieval, we first need to be able to look for expressions like 6
SE-507 Natural Language Processing

GHz or 500 GB or Mac or $999.99. In the rest of this section, we’ll work out some
simple regular expressions for this task.

First, let’s complete our regular expression for prices. Here’s a regular expression
for a dollar sign followed by a string of digits:

/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)

Now we just need to deal with fractions of dollars. We’ll add a decimal point and
two digits afterwards:

/$[0-9]+\.[0-9][0-9]/

This pattern only allows $199.99 but not $199. We need to make the cents optional
and to make sure we’re at a word boundary:

/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/

One last catch! This pattern allows prices like $199999.99 which would be far too
expensive! We need to limit the dollars:

/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/

How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the of / */ to mean “zero or
more spaces” since there might always be extra spaces lying around:

/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
SE-507 Natural Language Processing

Edit Distance
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing
SE-507 Natural Language Processing

Cheat Sheet
No ratings yet
Cheat Sheet
1 page
Ambiguous Grammar: Context Free Grammars (CFGS) Are Classified Based On
No ratings yet
Ambiguous Grammar: Context Free Grammars (CFGS) Are Classified Based On
3 pages
Regular Expression To DFA Conversion Module
No ratings yet
Regular Expression To DFA Conversion Module
38 pages
Boolean Algebra and Venn Diagrams
No ratings yet
Boolean Algebra and Venn Diagrams
52 pages
Compiler Construction: Lab Report # 08
No ratings yet
Compiler Construction: Lab Report # 08
5 pages
Software Testing and Quality Assurance: ETCS - 453
No ratings yet
Software Testing and Quality Assurance: ETCS - 453
53 pages
Compiler Design Lecture Notes (10CS63) : D C S & E
No ratings yet
Compiler Design Lecture Notes (10CS63) : D C S & E
96 pages
Cs304 Oop Notes by Sonu
No ratings yet
Cs304 Oop Notes by Sonu
39 pages
IS 7118 Unit-2 Regular Expressions
No ratings yet
IS 7118 Unit-2 Regular Expressions
69 pages
TOC - Question Paper - MID Sem Exam Nov-2021
No ratings yet
TOC - Question Paper - MID Sem Exam Nov-2021
2 pages
Chapter 6
100% (1)
Chapter 6
28 pages
Chapter 12 Context Free Grammars
100% (1)
Chapter 12 Context Free Grammars
68 pages
AI Unit 2
No ratings yet
AI Unit 2
198 pages
Slide-2.2 Discrete Time Linear Time Invariant (LTI) System-2
No ratings yet
Slide-2.2 Discrete Time Linear Time Invariant (LTI) System-2
94 pages
History of Mp3 Player
No ratings yet
History of Mp3 Player
26 pages
Module-4 Lex and Yacc
No ratings yet
Module-4 Lex and Yacc
67 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Resource-Allocation Graph
No ratings yet
Resource-Allocation Graph
15 pages
The CAP Theorem and The Design of Large Scale Distributed Systems: Part I
No ratings yet
The CAP Theorem and The Design of Large Scale Distributed Systems: Part I
44 pages
Syntax Directed
No ratings yet
Syntax Directed
51 pages
Compiler Design Unit 2
No ratings yet
Compiler Design Unit 2
44 pages
Chapter Two
No ratings yet
Chapter Two
72 pages
Unit 4 PDF
No ratings yet
Unit 4 PDF
52 pages
CC File
No ratings yet
CC File
47 pages
GNS221 E-Exam Question1000
No ratings yet
GNS221 E-Exam Question1000
49 pages
A Ad - A - Ab - Abc - B: Generate The SLR Parsing Table For The Following Grammar
0% (1)
A Ad - A - Ab - Abc - B: Generate The SLR Parsing Table For The Following Grammar
7 pages
Example 1: Simplify The Following Boolean Expression. Using Boolean Algebra Postulates and
No ratings yet
Example 1: Simplify The Following Boolean Expression. Using Boolean Algebra Postulates and
10 pages
CSB353: Compiler Design Lab: Project Report
No ratings yet
CSB353: Compiler Design Lab: Project Report
15 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Compiler Construction: Lab Report # 06
No ratings yet
Compiler Construction: Lab Report # 06
5 pages
Python RegEx
No ratings yet
Python RegEx
8 pages
Ch-7 Document, Hypertext and MHEG
No ratings yet
Ch-7 Document, Hypertext and MHEG
8 pages
SRM Institute of Science and Technology
No ratings yet
SRM Institute of Science and Technology
6 pages
Web Development Using PHP
No ratings yet
Web Development Using PHP
65 pages
Database Handling in Prolog: Type1: Created at Each Execution. It Grows, Shrinks and
No ratings yet
Database Handling in Prolog: Type1: Created at Each Execution. It Grows, Shrinks and
5 pages
Tamil Morphological Analysis
No ratings yet
Tamil Morphological Analysis
18 pages
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
No ratings yet
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
18 pages
Formal Languages and Automata Theory
No ratings yet
Formal Languages and Automata Theory
8 pages
Question Bank For Theory of Computation Regulation 2013
No ratings yet
Question Bank For Theory of Computation Regulation 2013
9 pages
All Theory Questions
No ratings yet
All Theory Questions
2 pages
NLP Chapter 5
No ratings yet
NLP Chapter 5
70 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Problem
No ratings yet
Problem
4 pages
Applications of Regular Expressions
No ratings yet
Applications of Regular Expressions
2 pages
Final Lab Exam
No ratings yet
Final Lab Exam
13 pages
Infosys Previous Year Placement Papers Skillvertex 1
No ratings yet
Infosys Previous Year Placement Papers Skillvertex 1
18 pages
OS Total
100% (1)
OS Total
50 pages
World Wide Web
No ratings yet
World Wide Web
8 pages
Compiler Design
No ratings yet
Compiler Design
48 pages
Regex Cheat Sheet
No ratings yet
Regex Cheat Sheet
10 pages
Theory of Automata 20 Most Important Questions
No ratings yet
Theory of Automata 20 Most Important Questions
3 pages
Certificate Declaration: Topic Name
No ratings yet
Certificate Declaration: Topic Name
16 pages
ACT CH 3 Context Free Languages
No ratings yet
ACT CH 3 Context Free Languages
66 pages
Bcs503 Module 2
No ratings yet
Bcs503 Module 2
46 pages
Pega Collections
No ratings yet
Pega Collections
9 pages
CD-30 Questions With Solution
No ratings yet
CD-30 Questions With Solution
43 pages
NLP Questions and Answers MCQ
No ratings yet
NLP Questions and Answers MCQ
7 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Final Exam 50% Compiler Design
No ratings yet
Final Exam 50% Compiler Design
4 pages
Compiler Design Previous Papers
No ratings yet
Compiler Design Previous Papers
14 pages

2 NLP PDF

Uploaded by

2 NLP PDF

Uploaded by

SE-507 Natural Language Processing

6.3 A Simple Example

6.4 More operators

RE Expansion Match First Matches

\W [ˆ \w] a non-alphanumeric !!!!

A range of numbers can also be specified. So /{n,m}/ specifies from n to m

RE Match First Matches

Figure 1.10: Some characters that need to be backslashed.

6.5 A More Complex Example

You might also like