Regular Expression - Sentence Segment
Regular Expression - Sentence Segment
Regular Expressions
1
Regular expressions
• A formal language for specifying text strings
• How can we search for any of these?
– woodchuck
– woodchucks
– Woodchuck
– Woodchucks
Regular Expressions: Disjunctions
• Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in
Disjunction
• Negations [^Ss]
– Caret means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a caret b Look up a^b now
Regular Expressions: More
Disjunction
• Woodchucks is another name for
groundhog!
•
Pattern
The pipe |
groundhog|woodchuck
for Matches
disjunction
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + .
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa Stephen C Kleene
beg.n begin begun begun beg3n Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
• Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or
theology
[^a-zA-Z][tT]he[^a-zA-Z]
Errors
• The process we just went through was based on
fixing two kinds of errors
– Matching strings that we should not have matched (there,
then, other)
• False positives (Type I)
– Not matching things that we should have matched (The)
• False negatives (Type II)
Errors cont.
• In NLP we are always dealing with these kinds of
errors.
• Reducing the error rate for an application often
involves two antagonistic efforts:
– Increasing accuracy or precision (minimizing false
positives)
– Increasing coverage or recall (minimizing false negatives).
Summary
• Regular expressions play a surprisingly large role
– Sophisticated sequences of regular expressions are
often the first model for any text processing text
• For many hard tasks, we use machine learning
classifiers
– But regular expressions are used as features in the
classifiers
– Can be very useful in capturing generalizations
11
Text Normalization
• Every NLP task needs to do text
normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
How many words?
• I do uh main- mainly business data processing
– Fragments, filled pauses
• Seuss’s cat in the hat is different from other cats!
– Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms
How many words?
they lay back on the San Francisco grass and looked at the stars and their
28
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
(*v*)ing ø walking walk
sing sing
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr
1312 King 548 being
548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr
29
Dealing with complex morphology is
sometimes necessary
• Some languages requires complex morpheme segmentation
– Turkish
– Uygarlastiramadiklarimizdanmissinizcasina
– `(behaving) as if you are among those whom we could not civilize’
– Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46