Natural Language Processing - Session 3 - Regular Expressions
Natural Language Processing - Session 3 - Regular Expressions
Coding in python
2
Regular Expressions
Regular Expressions and their Usage
Regular expression (RE): a language for specifying text search strings
● They are particularly useful for searching in texts, when we have a
pattern to search for and a corpus of texts to search through
○ In an information retrieval (IR) system such as a Web search
engine, the texts might be entire documents or Web pages
○ In a word-processor, the texts might be individual words, or lines
of a document
● grep command in Linux
○ grep ‘nlp’ /path/file
4
Basic Regular Expressions
The simplest kind of regular expression is a sequence of simple characters
● For example, to search for language, we type /language/
● The search string can consist of a single character (like /!/) or a sequence of characters (like
/urgl/)
Regular expressions are case-sensitive; lower case /s/ is distinct from uppercase /S/
5
Basic Regular Expressions
The simplest kind of regular expression is a sequence of simple characters
● For example, to search for language, we type /language/
● The search string can consist of a single character (like /!/) or a sequence of characters (like
/urgl/)
Regular expressions are case-sensitive; lower case /s/ is distinct from uppercase /S/
Ranges in []: If there is a well-defined sequence associated with a set of characters, dash (-) in
brackets can specify any one character in a range
● /[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”
6
Basic Regular Expressions
Negations in []:
● The square braces can also be used to specify what a single character cannot be, using the caret ^
● If the caret ^ is the first symbol after the open square brace [, the resulting pattern is negated
● [^A-Z] Not an upper case letter
● [^a-z] Not a lower case letter
● [^Ss] Neither ‘S’ nor ‘s’
7
Closure Operators: Kleene * and Kleene +
● Kleene * (closure) operator: The Kleene star means “zero or more occurrences of the
immediately previous regular expression
● Kleene + (positive closure) operator: The Kleene plus means “one or more occurrences of
the immediately preceding regular expression
8
Wildcard, Question Mark, and Curly Bracelet
● A wildcard expression (dot) . matches any single character (except a carriage return)
○ beg.n → begin, begun, begxn, …
○ a.*b → any string starts with a and ends with b
9
Precedence of Operators
The order precedence of RE operator precedence, from highest precedence to
lowest precedence is as follows
● Parenthesis ()
● Counters * + ? {}
● Sequences and anchors ^ $
● Disjunction |
10
Advanced Operators
Aliases for common sets of characters
11
Finite State Automaton
Any regular expression can be realized as a finite state automaton (FSA)
There are two kinds of FSAs
● Deterministic Finite state Automatons (DFAs)
● Non-deterministic Finite state Automatons (NFAs)
12
Regular Expressions: A DFA and A NFA
13
Python Regular Expressions
The re library in Python is a built-in library that provides support for regular expressions
The re library provides several functions for searching and manipulating strings, including:
To use the re library, you first need to import it at the beginning of your Python script by using
the following line: import re
And then use the functions and constants provided by the re library to perform regular
expression operations on strings
https://www.w3schools.com/python/python_regex.asp 14
Python Regular Expressions
Extract all the digits from a string:
import re
string = "There are 12 monkeys in the zoo" \d is a metacharacter that matches
print(re.findall(r'\d+', string)) any digit (0-9)
# Output: ['12']
import re
string = "The quick brown fox jumps over the lazy dog"
print(re.findall(r'\b[Tt]\w+', string))
# Output: ['The', 'the'] \b is a metacharacter that matches only at the
beginning or end of a word
15
Python Regular Expressions (Class Activity)
1. Extract all email addresses from a string
string = "The email addresses are be-mansouri@maine and [email protected]"
16
Python Regular Expressions (Class Activity)
1. Extract all email addresses from a string
string = "The email addresses are be-mansouri@maine and [email protected]"
17
Words and Corpora
Corpus
Corpus (plural corpora), a computer-readable collection of text or speech
● For example the Brown corpus is a million-word collection of samples from 500
written texts from different genres (newspaper, fiction, non-fiction, academic, etc.),
assembled at Brown University in 1963–64
Punctuation is critical for finding boundaries of things (commas, periods, colons) and
for identifying some aspects of meaning (question marks, exclamation marks,
quotation marks)
19
Utterance
An utterance is the spoken correlate of a sentence
21
Word Type
“How many words are there in English?”
To answer this question, we need to distinguish two ways of talking about words:
(1) WORD TYPE; (2) WORD TOKEN
Word Type is the number of distinct words in a corpus
● If the set of words in the vocabulary is V, the number of types is the vocabulary size |V|
e.g. They picnicked by the pool, then lay back on the grass and looked at the stars
22
Text Normalization
Text Preprocessing
Before process any natural language processing of a text, the text has to be
normalized
At least three tasks are commonly applied as part of any normalization process
24
Segmenting / Tokenizing
Separating out words from sentences
Tokenization algorithms may also tokenize multiword expressions like New York or rock 'n' roll as
a single token
25
Text Normalization
Tokens can also be normalized, in which a single normalized form is chosen for words with multiple forms like
USA and US
● This standardization may be valuable, despite the spelling information that is lost in the normalization
process
○ "$200" would be pronounced as "two hundred dollars" in English
● For information retrieval, we want a query for US to match a document that has USA
Case folding is another kind of normalization: Reduce all letters to lower case. (US versus us are important)
26
Lemmatization
Lemmatization is the task of determining that two words have the same root, despite their surface
differences
● am, are, is → be
● car, cars, car's, cars' → car
Lemmatization algorithms can be complex. For this reason, we sometimes make use of a simpler
but cruder method, which mainly consists of chopping off word-final affixes
27
Sentence Segmentation
Separate out sentences from a paragraph/text
Example: Shelby is a student. She is sick today. She will not go to school.
After segmenting:
● “Shelby is a student”
● “She is sick today”
● “She will not go to school”
Question marks and exclamation points are relatively unambiguous markers of sentence boundaries
28
Minimum Edit Distance
String Edit Distance
Given two strings (sequences) return the “distance” between the two strings as
measured by the minimum number of “character edit operations” needed to turn one
sequence into the other
Natural → Nature
1. Substitution a → e
2. Deletion l
Distance = 2
30
Damerau-Levenshtein distance
Counts the minimum number of insertions, deletions, substitutions, or
transpositions of single characters required
● e.g., Damerau-Levenshtein distance 1 (80% of errors caused by a single char)
● distance 2
31
Alignment
Given two sequences, an alignment is a correspondence between substrings of the two sequences 32
Dynamic Program Table for String Edit
Measure distance between strings PARK and SPAKE
P A R K
A Cij
P A R K
D(i,j) = score of best alignment from s1..si to t1..tj
0 1 2 3 4
S 1
P 2
A 3 Cij
K 4 d(c,d) is an arbitrary
distance function on
E 5 characters
P A R K
0 1 2 3 4
S 1 1 2 3 4
P 2 1 2 3 4
A 3 2 1 2 3
K 4 3 2 2 2
E 5 4 3 3 3
36
Summary
Today we learned about:
Regular Expressions
Text Normalization
37
Next Session
Tokenization and Stemming
In the next session, we will explore tokenization and stemming
39