0% found this document useful (0 votes)

66 views39 pages

Natural Language Processing - Session 3 - Regular Expressions

Uploaded by

tamkeen_fatima_501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views39 pages

Natural Language Processing - Session 3 - Regular Expressions

Uploaded by

tamkeen_fatima_501

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Natural Language Processing

Session 3: Regular Expressions

Instructor: Behrooz Mansouri
Spring 2023, University of Southern Maine
Previous Session
In previous session we learned about:

Coding in python

Using Google Colab

Revisited programming concepts in Python

2
Regular Expressions
Regular Expressions and their Usage
Regular expression (RE): a language for specifying text search strings
● They are particularly useful for searching in texts, when we have a
pattern to search for and a corpus of texts to search through
○ In an information retrieval (IR) system such as a Web search
engine, the texts might be entire documents or Web pages
○ In a word-processor, the texts might be individual words, or lines
of a document
● grep command in Linux
○ grep ‘nlp’ /path/file

4
Basic Regular Expressions
The simplest kind of regular expression is a sequence of simple characters
● For example, to search for language, we type /language/
● The search string can consist of a single character (like /!/) or a sequence of characters (like
/urgl/)

Regular expressions are case-sensitive; lower case /s/ is distinct from uppercase /S/

5
Basic Regular Expressions
The simplest kind of regular expression is a sequence of simple characters
● For example, to search for language, we type /language/
● The search string can consist of a single character (like /!/) or a sequence of characters (like
/urgl/)

Regular expressions are case-sensitive; lower case /s/ is distinct from uppercase /S/

Can use of the square braces

● The string of characters inside the braces specifies a disjunction of characters to match
● /[lL]anguage/
● The regular expression /[1234567890]/ specified any single digit

Ranges in []: If there is a well-defined sequence associated with a set of characters, dash (-) in
brackets can specify any one character in a range
● /[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”
6
Basic Regular Expressions
Negations in []:

● The square braces can also be used to specify what a single character cannot be, using the caret ^
● If the caret ^ is the first symbol after the open square brace [, the resulting pattern is negated
● [^A-Z] Not an upper case letter
● [^a-z] Not a lower case letter
● [^Ss] Neither ‘S’ nor ‘s’

If E1 and E2 are regular expressions, then E1 | E2 is a regular expression

● woodchuck|groundhog → woodchuck or groundhog

● a|b|c → a, b or c

7
Closure Operators: Kleene * and Kleene +
● Kleene * (closure) operator: The Kleene star means “zero or more occurrences of the
immediately previous regular expression
● Kleene + (positive closure) operator: The Kleene plus means “one or more occurrences of
the immediately preceding regular expression

Regular Expression Matches

ba* b, ba, baa, baaa, ...

ba+ ba, baa, baaa, ...

(ba)* ∅, ba, baba, bababa, ...

(ba)+ ba, baba, bababa, ...

(b|a)+ b, a, bb, ba, aa, ab, ...

Stephen Kleene, 1909 - 1994

8
Wildcard, Question Mark, and Curly Bracelet
● A wildcard expression (dot) . matches any single character (except a carriage return)
○ beg.n → begin, begun, begxn, …
○ a.*b → any string starts with a and ends with b

● The question mark ? marks optionality of the previous expression

○ woodchucks? → woodchuck or woodchucks
○ colou?r → color or colour
○ (a|b)?c → ac, bc, c

● {m,n} causes the resulting RE to match from m to n repetitions of the preceding RE

● {m} specifies that exactly m copies of the previous RE should be matched
○ (ba){2,3} → baba, bababa

9
Precedence of Operators
The order precedence of RE operator precedence, from highest precedence to
lowest precedence is as follows

● Parenthesis ()
● Counters * + ? {}
● Sequences and anchors ^ $
● Disjunction |

● The regular expression the* matches theeeee but not thethe

● The regular expression (the)* matches thethe but not theeeee

10
Advanced Operators
Aliases for common sets of characters

Special characters need to be backslashed

11
Finite State Automaton
Any regular expression can be realized as a finite state automaton (FSA)
There are two kinds of FSAs
● Deterministic Finite state Automatons (DFAs)
● Non-deterministic Finite state Automatons (NFAs)

A FSA (a regular expression) represents a regular language

12
Regular Expressions: A DFA and A NFA

Any NFA can be converted into a corresponding DFA

13
Python Regular Expressions
The re library in Python is a built-in library that provides support for regular expressions
The re library provides several functions for searching and manipulating strings, including:

● search(): searches for a match anywhere in the string

● findall(): returns a list of all non-overlapping matches in the string
● finditer(): returns an iterator yielding match objects
● sub(): replaces all occurrences of the pattern in the string
● split(): splits the string by the occurrences of the pattern
● compile(): compiles a regular expression pattern into a regular expression object, which
can be used for efficient reuse

To use the re library, you first need to import it at the beginning of your Python script by using
the following line: import re

And then use the functions and constants provided by the re library to perform regular
expression operations on strings
https://www.w3schools.com/python/python_regex.asp 14
Python Regular Expressions
Extract all the digits from a string:
import re
string = "There are 12 monkeys in the zoo" \d is a metacharacter that matches
print(re.findall(r'\d+', string)) any digit (0-9)
# Output: ['12']

Extract all the words starting with a specific letter:

import re
string = "The quick brown fox jumps over the lazy dog"
print(re.findall(r'\b[Tt]\w+', string))
# Output: ['The', 'the'] \b is a metacharacter that matches only at the
beginning or end of a word

15
Python Regular Expressions (Class Activity)
1. Extract all email addresses from a string
string = "The email addresses are be-mansouri@maine and [email protected]"

2. Extract all phone numbers from a string

"The phone numbers are 207-587-3290, 5583446854, (207) 107-9293, 2075142279x0000"

\w Matches Unicode word characters

\s Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other
characters, for example the non-breaking spaces mandated by typography rules in many
languages)
[-.\s] Matches any of the characters inside the square bracket, in this case -, . and space

16
Python Regular Expressions (Class Activity)
1. Extract all email addresses from a string
string = "The email addresses are be-mansouri@maine and [email protected]"

2. Extract all phone numbers from a string

"The phone numbers are 207-587-3290, 5583446854, (207) 107-9293, 2075142279x0000"

17
Words and Corpora
Corpus
Corpus (plural corpora), a computer-readable collection of text or speech

● For example the Brown corpus is a million-word collection of samples from 500
written texts from different genres (newspaper, fiction, non-fiction, academic, etc.),
assembled at Brown University in 1963–64

Punctuation is critical for finding boundaries of things (commas, periods, colons) and
for identifying some aspects of meaning (question marks, exclamation marks,
quotation marks)

● For some tasks, like part-of-speech tagging or parsing or speech synthesis, we

sometimes treat punctuation marks as if they were separate words

19
Utterance
An utterance is the spoken correlate of a sentence

● I do uh main- mainly business data processing

This utterance has two kinds of disfluencies: fragments and fillers

(1) The broken-off word main- is called a fragment

(e.g., taking pause)

(2) Words like uh and um are called fillers or filled pauses

Another Example; A speech disfluency, also spelled speech dysfluency;

● is any of various breaks, irregularities in particular language.

20
Word Type
“How many words are there in English?”

21
Word Type
“How many words are there in English?”

To answer this question, we need to distinguish two ways of talking about words:
(1) WORD TYPE; (2) WORD TOKEN
Word Type is the number of distinct words in a corpus

● If the set of words in the vocabulary is V, the number of types is the vocabulary size |V|

Word Token are the total number of N running words

e.g. They picnicked by the pool, then lay back on the grass and looked at the stars

● Word Types in this sentence are 14

● Word Tokens in this sentence are 16

22
Text Normalization
Text Preprocessing
Before process any natural language processing of a text, the text has to be
normalized

At least three tasks are commonly applied as part of any normalization process

(a) Segmenting/tokenizing words from running text

(b) Normalizing word formats

(c) Segmenting sentences in running text

24
Segmenting / Tokenizing
Separating out words from sentences

Example: “It is sunny today.” → “It” “is” “sunny” “today”

Tokenization algorithms may also tokenize multiword expressions like New York or rock 'n' roll as
a single token

Tokenization can be more challenging for other languages:

● German noun compounds are not segmented:

○ Lebensversicherungsgesellschaftsangestellter "Life Insurance Company Employee"
● Chinese and Japanese no spaces between words
○ 莎拉波娃现在居住在美国东南部的佛罗里达。

25
Text Normalization
Tokens can also be normalized, in which a single normalized form is chosen for words with multiple forms like
USA and US

● This standardization may be valuable, despite the spelling information that is lost in the normalization
process
○ "$200" would be pronounced as "two hundred dollars" in English
● For information retrieval, we want a query for US to match a document that has USA

Case folding is another kind of normalization: Reduce all letters to lower case. (US versus us are important)

● For most applications (information retrieval), case folding is helpful

● For some NLP applications (MT, information extraction) cases can be helpful

26
Lemmatization
Lemmatization is the task of determining that two words have the same root, despite their surface
differences

● am, are, is → be
● car, cars, car's, cars' → car

Lemmatization: have to find correct dictionary headword form of the Word

Lemmatization algorithms can be complex. For this reason, we sometimes make use of a simpler
but cruder method, which mainly consists of chopping off word-final affixes

● This naïve version of morphological analysis is called stemming

27
Sentence Segmentation
Separate out sentences from a paragraph/text
Example: Shelby is a student. She is sick today. She will not go to school.
After segmenting:
● “Shelby is a student”
● “She is sick today”
● “She will not go to school”

Question marks and exclamation points are relatively unambiguous markers of sentence boundaries

Periods, on the other hand, are more ambiguous

● Abbreviations like Inc. or Dr.

● Numbers like .02% or 4.3

28
Minimum Edit Distance
String Edit Distance
Given two strings (sequences) return the “distance” between the two strings as
measured by the minimum number of “character edit operations” needed to turn one
sequence into the other

Natural → Nature

1. Substitution a → e
2. Deletion l

Distance = 2

30
Damerau-Levenshtein distance
Counts the minimum number of insertions, deletions, substitutions, or
transpositions of single characters required
● e.g., Damerau-Levenshtein distance 1 (80% of errors caused by a single char)

● distance 2

31
Alignment

distance(“William Cohen”, “Willliam Cohon”)

Given two sequences, an alignment is a correspondence between substrings of the two sequences 32
Dynamic Program Table for String Edit
Measure distance between strings PARK and SPAKE

P A R K

A Cij

Cij = the number of edit operations needed to align PA with SPA

33
Dynamic Program Table for String Edit
Measure distance between strings PARK and SPAKE

P A R K
D(i,j) = score of best alignment from s1..si to t1..tj
0 1 2 3 4

S 1

P 2

A 3 Cij

K 4 d(c,d) is an arbitrary
distance function on
E 5 characters

Cij = the number of edit operations needed to align PA with SPA

34
Dynamic Program Table for String Edit
Measure distance between strings PARK and SPAKE

P A R K

0 1 2 3 4

S 1 1 2 3 4

P 2 1 2 3 4

A 3 2 1 2 3

K 4 3 2 2 2

E 5 4 3 3 3

Cij = the number of edit operations needed to align PA with SPA

35
Summary
Today we learned about:

36
Summary
Today we learned about:

Regular Expressions

Word and Corpora

Text Normalization

String Edit Distance

37
Next Session
Tokenization and Stemming
In the next session, we will explore tokenization and stemming

To do before next session:

● Chapter 2 of Jurafsky’s book (link)

● Assignment 1 is ready
● We will have the first Quiz

Implementing The Clean Architecture: Sebastian Buczyński
No ratings yet
Implementing The Clean Architecture: Sebastian Buczyński
239 pages
Hose Reel Calculation
100% (4)
Hose Reel Calculation
2 pages
Frameworks, Methodologies, and Tools For Developing Rich Internet Applications (PDFDrive)
No ratings yet
Frameworks, Methodologies, and Tools For Developing Rich Internet Applications (PDFDrive)
368 pages
Django Documentation
No ratings yet
Django Documentation
2 pages
PostgreSQL For Ruby Developers - Jesus Castello
No ratings yet
PostgreSQL For Ruby Developers - Jesus Castello
119 pages
F 1 Activex
100% (2)
F 1 Activex
326 pages
ISO
No ratings yet
ISO
12 pages
Damerau Levenshtein Algorithm by R - G
No ratings yet
Damerau Levenshtein Algorithm by R - G
5 pages
Index of Locally Available Projects:: Ebook
No ratings yet
Index of Locally Available Projects:: Ebook
531 pages
MVVM Vs MVP Vs MVC
No ratings yet
MVVM Vs MVP Vs MVC
7 pages
Introducing Net Maui Build and Deploy Cross Platform Applications Using C and Net Multi Platform App Ui 1nbsped 9781484292334 9781484292341
100% (1)
Introducing Net Maui Build and Deploy Cross Platform Applications Using C and Net Multi Platform App Ui 1nbsped 9781484292334 9781484292341
484 pages
Angular Merged
No ratings yet
Angular Merged
98 pages
98 375 html5
No ratings yet
98 375 html5
276 pages
Regex in C#
No ratings yet
Regex in C#
3 pages
Ruby Concurrency Explained
No ratings yet
Ruby Concurrency Explained
7 pages
Mastering Jquery Beginners Guide
No ratings yet
Mastering Jquery Beginners Guide
277 pages
Programming in C
No ratings yet
Programming in C
3 pages
Regular Expressions Cheat Sheet v2 PDF
0% (1)
Regular Expressions Cheat Sheet v2 PDF
1 page
C# Docs
No ratings yet
C# Docs
2,288 pages
Integrating Code SmellsIntegrating Code Smells Detection With Refactoring Tool Support-NongPong - PhD-Thesis-Aug-2012.pdf Detection With Refactoring Tool Support-NongPong - PhD-Thesis-Aug-2012
No ratings yet
Integrating Code SmellsIntegrating Code Smells Detection With Refactoring Tool Support-NongPong - PhD-Thesis-Aug-2012.pdf Detection With Refactoring Tool Support-NongPong - PhD-Thesis-Aug-2012
153 pages
Groovy in Action
No ratings yet
Groovy in Action
18 pages
Class Diagrams
100% (1)
Class Diagrams
20 pages
Javascript JSON and Ajax v2 - 2
100% (1)
Javascript JSON and Ajax v2 - 2
214 pages
Elgg
No ratings yet
Elgg
265 pages
Zope 3 Book
No ratings yet
Zope 3 Book
485 pages
MVC 5
No ratings yet
MVC 5
5 pages
Pandas
No ratings yet
Pandas
3,603 pages
Java Complete Reference (2023)
No ratings yet
Java Complete Reference (2023)
292 pages
Microsoft Excel 2021 Advanced Training - Transcription
No ratings yet
Microsoft Excel 2021 Advanced Training - Transcription
225 pages
Typescript Handbook v4.1
No ratings yet
Typescript Handbook v4.1
1,697 pages
YUI 2.8.0 Cheat Sheet Packet
No ratings yet
YUI 2.8.0 Cheat Sheet Packet
44 pages
Perl TK Tutorial
No ratings yet
Perl TK Tutorial
48 pages
Jmeter Notes
No ratings yet
Jmeter Notes
49 pages
Underscore - Js
No ratings yet
Underscore - Js
1 page
Advanced React
No ratings yet
Advanced React
486 pages
Three QR Code
No ratings yet
Three QR Code
20 pages
Gnome GTK Programming Bible PDF
0% (3)
Gnome GTK Programming Bible PDF
4 pages
ASP Net Core 3 1 Succinctly
No ratings yet
ASP Net Core 3 1 Succinctly
130 pages
Visual Studio 2015 Cheat Sheet PDF
No ratings yet
Visual Studio 2015 Cheat Sheet PDF
1 page
Blazor Webassembly Succinctly
100% (1)
Blazor Webassembly Succinctly
111 pages
OOAD Complete Notes 1 5
No ratings yet
OOAD Complete Notes 1 5
101 pages
Excel Book PDF
No ratings yet
Excel Book PDF
104 pages
The Glib/Gtk+ Development Platform: A Getting Started Guide
No ratings yet
The Glib/Gtk+ Development Platform: A Getting Started Guide
71 pages
Visual Studio .NET - Guide For Developers
No ratings yet
Visual Studio .NET - Guide For Developers
305 pages
CD PPTS 2
No ratings yet
CD PPTS 2
27 pages
Mongodb 3.0 Manual
No ratings yet
Mongodb 3.0 Manual
1,021 pages
AWP Manual (2019) (E-Next - In)
No ratings yet
AWP Manual (2019) (E-Next - In)
27 pages
Fortran 90 Tutorial
No ratings yet
Fortran 90 Tutorial
28 pages
Mastering Ext JS - Second Edition - Sample Chapter
No ratings yet
Mastering Ext JS - Second Edition - Sample Chapter
20 pages
Slides Download
No ratings yet
Slides Download
99 pages
Dictionary of VB NET
No ratings yet
Dictionary of VB NET
136 pages
02 Text Processing - Regular Expressions-Text Normalization
No ratings yet
02 Text Processing - Regular Expressions-Text Normalization
58 pages
Regular Expressions, Tok-Enization, Edit Distance
No ratings yet
Regular Expressions, Tok-Enization, Edit Distance
29 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Lec02 1 BasicTextProcessing
No ratings yet
Lec02 1 BasicTextProcessing
47 pages
2 Regular Expressions
No ratings yet
2 Regular Expressions
34 pages
3-Regular Expressions
No ratings yet
3-Regular Expressions
34 pages
CS 491 Natural Language Processing Module 2: Basic Text Processing
No ratings yet
CS 491 Natural Language Processing Module 2: Basic Text Processing
24 pages
3 Regular Expression
No ratings yet
3 Regular Expression
15 pages
NLP Chapter 5
No ratings yet
NLP Chapter 5
70 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
IFOS Presentation-PAK Mobilink0704
No ratings yet
IFOS Presentation-PAK Mobilink0704
13 pages
Group 3: Molecular Orbital Theory
No ratings yet
Group 3: Molecular Orbital Theory
37 pages
Trojan Port List
No ratings yet
Trojan Port List
13 pages
Photonics Element For Sensing and Optical Conversions
No ratings yet
Photonics Element For Sensing and Optical Conversions
310 pages
Desymm
No ratings yet
Desymm
13 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
APEC 2015 Intro Small Signal Modeling Seminar
No ratings yet
APEC 2015 Intro Small Signal Modeling Seminar
171 pages
CS5371 Theory of Computation
No ratings yet
CS5371 Theory of Computation
2 pages
Describing Gases Focus Points
No ratings yet
Describing Gases Focus Points
2 pages
Assignment 1 Excel Spreadsheet 2 3
No ratings yet
Assignment 1 Excel Spreadsheet 2 3
20 pages
Ec34 Question Bank
No ratings yet
Ec34 Question Bank
6 pages
CH-10 Boiler Performance
No ratings yet
CH-10 Boiler Performance
19 pages
BCS 040 Previous Year Question Papers by Ignouassignmentguru 2
No ratings yet
BCS 040 Previous Year Question Papers by Ignouassignmentguru 2
66 pages
45 90 Degree Pipe Elbow Dimensions Sizes
No ratings yet
45 90 Degree Pipe Elbow Dimensions Sizes
10 pages
Dimensioning and Tolerances
No ratings yet
Dimensioning and Tolerances
51 pages
1756 ControlLogix Controllers
No ratings yet
1756 ControlLogix Controllers
40 pages
Ideal Gas
No ratings yet
Ideal Gas
20 pages
Structural Analysis
No ratings yet
Structural Analysis
3 pages
PMDG 737 Flows + FS2CREW PDF
100% (1)
PMDG 737 Flows + FS2CREW PDF
15 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
JavaScript Cheat Sheet & Quick Reference
No ratings yet
JavaScript Cheat Sheet & Quick Reference
23 pages
Relationship Between Marketing and Customer Satisfaction: Case Study From Beco Powering Somalia in Mogadishu-Somalia
No ratings yet
Relationship Between Marketing and Customer Satisfaction: Case Study From Beco Powering Somalia in Mogadishu-Somalia
10 pages
1000-4 European Union EN12975
No ratings yet
1000-4 European Union EN12975
26 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
Rhi Jma Des 2015
No ratings yet
Rhi Jma Des 2015
16 pages
Design Method For Bearing Stresses in Wood
No ratings yet
Design Method For Bearing Stresses in Wood
12 pages
Leading Dan Lagging Indicators Highlights
No ratings yet
Leading Dan Lagging Indicators Highlights
78 pages
AQA GCSE Chem C2 Summary Question Answers
No ratings yet
AQA GCSE Chem C2 Summary Question Answers
4 pages
HITEC PowerPRO2700 - 2016 PDF
100% (4)
HITEC PowerPRO2700 - 2016 PDF
55 pages

Natural Language Processing - Session 3 - Regular Expressions

Uploaded by

Natural Language Processing - Session 3 - Regular Expressions

Uploaded by

Natural Language Processing

Session 3: Regular Expressions

Using Google Colab

Revisited programming concepts in Python

Can use of the square braces

If E1 and E2 are regular expressions, then E1 | E2 is a regular expression

● woodchuck|groundhog → woodchuck or groundhog

Regular Expression Matches

ba* b, ba, baa, baaa, ...

ba+ ba, baa, baaa, ...

(ba)* ∅, ba, baba, bababa, ...

(ba)+ ba, baba, bababa, ...

(b|a)+ b, a, bb, ba, aa, ab, ...

● The question mark ? marks optionality of the previous expression

● {m,n} causes the resulting RE to match from m to n repetitions of the preceding RE

● The regular expression the* matches theeeee but not thethe

Special characters need to be backslashed

A FSA (a regular expression) represents a regular language

Any NFA can be converted into a corresponding DFA

● search(): searches for a match anywhere in the string

Extract all the words starting with a specific letter:

2. Extract all phone numbers from a string

\w Matches Unicode word characters

2. Extract all phone numbers from a string

● For some tasks, like part-of-speech tagging or parsing or speech synthesis, we

● I do uh main- mainly business data processing

This utterance has two kinds of disfluencies: fragments and fillers

(1) The broken-off word main- is called a fragment

(e.g., taking pause)

(2) Words like uh and um are called fillers or filled pauses

Another Example; A speech disfluency, also spelled speech dysfluency;

● is any of various breaks, irregularities in particular language.

Word Token are the total number of N running words

● Word Types in this sentence are 14

(a) Segmenting/tokenizing words from running text

(b) Normalizing word formats

(c) Segmenting sentences in running text

Example: “It is sunny today.” → “It” “is” “sunny” “today”

Tokenization can be more challenging for other languages:

● German noun compounds are not segmented:

● For most applications (information retrieval), case folding is helpful

Lemmatization: have to find correct dictionary headword form of the Word

● This naïve version of morphological analysis is called stemming

Periods, on the other hand, are more ambiguous

● Abbreviations like Inc. or Dr.

distance(“William Cohen”, “Willliam Cohon”)

Cij = the number of edit operations needed to align PA with SPA

Cij = the number of edit operations needed to align PA with SPA

Cij = the number of edit operations needed to align PA with SPA

Word and Corpora

String Edit Distance

To do before next session:

● Chapter 2 of Jurafsky’s book (link)

You might also like