2-Regular Expressions, Text Normalization, Edit Distance
2-Regular Expressions, Text Normalization, Edit Distance
2
Regular Expressions
• Regular expressions are case sensitive. This means that the
pattern /woodchucks/ will not match the string “Woodchucks”.
– We can solve this by using square braces []
– The string of characters inside the braces [] specifies a disjunction of
characters to match.
3
Regular Expressions: Disjunctions
4
Regular Expressions: Negation in Disjunction
• Negations can be applied using the caret ^ symbol
– Caret means negation only when first in []
Pattern Matches Example Patterns
Matched
[^A-Z] Not an upper case letter Oyfn pripetchik
Pattern Matches
colou?r Optional previous char Color
Colour
oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!
7
Regular Expressions: Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret (^) matches the start of a line.
– The pattern /^The/ matches the word “The” only at the start of a line.
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
8
Regular Expressions: Boundary Anchors \b \B
• There are also two other anchors: \b matches a word boundary, and \B
matches a non-boundary.
• For the purposes of a regular expression, a “word” is defined as any
sequence of digits, underscores, or letters.
• Examples:
– /\bthe\b/ matches the word “the” but not the word “other”.
– /\b99\b/ will match the string 99 in “There are 99 bottles of juice on the wall”
(because 99 follows a space and precedes a space) but not 99 in “There are
299 bottles of juice on the wall” (since 99 follows a number). But it will match
99 in “$99” (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).
• What will be the results of using the other anchor: \B in the previous
examples knowing that it matches a non-word boundary?
9
Example:
• Suppose we wanted to write a RE to find cases of the
English article “the”. A simple (but incorrect) pattern might
be:
/the/
• One problem is that this pattern will miss the word when it
begins a sentence and hence is capitalized (i.e., The). This
might lead us to the following pattern:
/[tT]he/
• But we will still incorrectly return texts with the embedded
in other words (e.g., other or theology).
• So we need to specify that we want instances with a word
boundary on both sides:
/\b[tT]he\b/ 10
Errors
• The process we just went through was based on
fixing two kinds of errors
– Matching strings that we should not have matched (there,
then, other)
• False positives (Type I)
– Not matching things that we should have matched (The)
• False negatives (Type II)
11
Errors cont.
12
Summary
13
Basic Text Processing
Text normalization
Text normalization
• Normalizing text means converting it to a more convenient, standard
form.
1. Tokenization - Splitting a phrase, sentence, paragraph, or an entire
text document into smaller units, such as individual words or terms.
2. Lemmatization - The task of determining that two words have the
same root, despite their surface differences.
– The words “sang”, “sung”, and “sings” are forms of the verb “sing”. The
word sing is the common lemma of these words, and a lemmatizer maps
from all of these to “sing”.
3. Stemming - We mainly just strip suffixes from the end of the word.
– The words “caring”, “careful” are stemmed to “car”, and the words
“history” and “historical” are stemmed to “histori”
4. Sentence Segmentation - We break up a text into individual
sentences, using cues like periods or exclamation points.
15
Normalization
• Need to “normalize” terms
– Information Retrieval: indexed text to query terms must
have same form.
• We want to match U.S.A. and USA
• We implicitly define equivalence classes of terms
– e.g., deleting periods in a term
• Alternative: asymmetric expansion:
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows
16
Case folding
Word tokenization
Text Normalization
• Every NLP task needs to do text normalization:
1. Segmenting/tokenizing words in running text
19
How many words?
• A lemma is a set of lexical forms having
• cat and cats = same lemma
20
How many words?
They lay back on the San Francisco grass and looked at the stars and their
• How many?
• 15 tokens (or 14)
21
How many words?
N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary
Church and Gale (1990): |V| > O(N½)
22
Simple Tokenization in UNIX
• We can use command tr to tokenize the words by changing every sequence of
non alphabetic characters to a newline (’A-Za-z’ means alphabetic, the -c
option complements to non-alphabet, and the -s option squeezes all sequences
into a Single character):
tr -sc 'A-Za-z’ ‘/n' < shakes.txt
The output of this command will be:
THE shakes.txt
SONNETS
by
William
Shakespeare THE SONNETS by William
From Shakespeare From fairest creatures
fairest We ….
creatures
We
... 23
Simple Tokenization in UNIX
• Now that there is one word per line, we can sort the lines, and
pass them to unique -c which will collapse and count them:
...
Issues in Tokenization
• Finland’s capital Finland Finlands Finland’s ?
• what’re, I’m, isn’t What are, I am, is not
• Hewlett-Packard Hewlett Packard ?
• state-of-the-art state of the art ?
• Lowercase lower-case lowercase lower case ?
• San Francisco one token or two?
• m.p.h., PhD. ??
25
Basic Text Processing
27
Morphology
• It is the study of the internal structure of words.
• Morphology focuses on how the components within a word (stems, root
words, prefixes, suffixes, etc.) are arranged or modified to create different
meanings.
• Example: happy; un-happy; happy-ness; un-happy-ness
• Morphemes:
28
Stemming
• Reduce terms to their stems in information retrieval.
• Stemming is crude chopping of affixes
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.
29
Basic Text Processing
31
Determining if a word is End-of-Sentence: Decision Tree
32
More sophisticated decision tree features
• Numeric features
– Length of word with “.”
– Probability(word with “.” occurs at end-of-s)
– Probability(word after “.” occurs at beginning-of-s)
33
Implementing Decision Trees
• A decision tree is just an if-then-else statement.
• The interesting research is choosing the features.
• Setting up the structure is often too hard to do by hand.
– Hand-building only possible for very simple features,
domains
• For numeric features, it’s too hard to pick each
threshold
• Instead, structure usually learned by machine learning from
a training corpus
34
Basic Text Processing
37
Minimum Edit Distance
d-> delete
s-> substitution
i-> insert
38
How to find the Min Edit Distance?
• Searching for a path (sequence of edits) from the start string
to the final string:
– Initial state: the word we’re transforming
– Operators: insert, delete, substitute
– Goal state: the word we’re trying to get to
– Path cost: what we want to minimize: the number of edits
39
Defining Min Edit Distance
• For two strings
– X of length n
– Y of length m
• We define D(i,j)
– the edit distance between X[1..i] and Y[1..j]
• i.e., the first i characters of X and the first j characters of Y
– The edit distance between X and Y is thus D(n,m)
40
Minimum Edit Distance - Example
41
Minimum Edit Distance - Example
42