Week 2
Week 2
Introduction
Basic regular expression patterns
Examples
Regular expression with Python
Practical examples
CS3TM20 © XH 1
Introduction
Regular expression (RE), a language for specifying text
search strings.
is a sequence of characters that specifies a search pattern.
used in every computer language, word processor, and text
processing tools.
particularly useful for searching in texts, when we have a
pattern to search for and a corpus of texts to search
through.
The corpus can be a single document or a collection.
Regular expressions come in many variants.
CS3TM20 © XH 2
Basic regular expression patterns
The simplest kind of regular expression is a sequence of
simple characters.
To search for woodchuck, we type /woodchuck/
Woodchuck
woodchuck
Woodchucks
woodchucks
CS3TM20 © XH 3
Disjunction [ ]
The string of characters inside the square braces [ ]
specifies a disjunction of characters to match
CS3TM20 © XH 4
Range -
the brackets can be used with the dash (-) to specify any
one character in a range
CS3TM20 © XH 5
Negation ^
The square braces can also be used to specify what a
single character cannot be, by use of the caret ^.
This is if the caret ^ is the first symbol after the open
square brace [ .
RE Match Example Patterns
/[ˆA-Z]/ Not an upper case letter “Oyfn pripetchik”
/[ˆSs]/ neither ‘S’ nor ‘s’ “I have no exquisite reason for ‘t”
/[ˆ.]/ not a period “our resident Djinn”
/[eˆ]/ either ‘e’ or ‘ˆ’ “look up ˆnow”
CS3TM20 © XH 7
Consider the language of certain sheep:
baa!
baaa!
baaaa! . . .
Kleene *
zero or more occurrences of the immediately previous
character or regular expression
/baaa*!/
Kleene +
“one or more occurrences of the immediately preceding
character or regular expression”
/baa+!/
CS3TM20 © XH 8
Anchors
special characters that anchor RE to specific places in a
string.
The caret ˆ matches the start of a line.
The dollar sign $ matches the end of a line.
\b matches a word boundary. \B matches a non-boundary.
/\bthe\b/ matches the word the but not the word other
Pipe |
The pipe (disjunction operator) symbol |
The pattern /cat|dog/ matches either the string cat or the
string dog
CS3TM20 © XH 9
Precedence hierarchy from high to low
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
CS3TM20 © XH 10
A simple example
Suppose we wanted to write a RE to find cases of the English
article the
1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/
CS3TM20 © XH 11
A simple example
Suppose we wanted to write a RE to find cases of the English
article the
1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/
CS3TM20 © XH 12
A simple example
Suppose we wanted to write a RE to find cases of the English
article the
1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/
CS3TM20 © XH 13
A simple example
Suppose we wanted to write a RE to find cases of the English
article the
1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/
CS3TM20 © XH 14
A more complicated example
Suppose we wanted to write a RE to find cases of the English
article the
1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/
CS3TM20 © XH 15
The process we just went through was based on fixing two kinds
of errors:
Matching strings that we should not have matched (there,
then, other)
False positives (Type I errors)
Not matching things that we should have matched (The)
False negatives (Type II errors)
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two
antagonistic efforts:
Increasing accuracy or precision (minimizing false positives)
Increasing coverage or recall (minimizing false negatives).
CS3TM20 © XH 16
Regular expression in Python : re
Practical examples
CS3TM20 © XH 17
Regular expression in Python : re
Practical examples
CS3TM20 © XH 19
Capture Groups
use of parentheses to store a pattern in memory
Every time a capture group is used ( ), the resulting match is
stored in a numbered register.
/the (.*)er they (.*), the \1er we \2/
will match the faster they ran, the faster we ran (but not
the faster they ran, the faster we ate).
CS3TM20 © XH 20
Your turn: Practical answering questions with NLTK:
Reference: https://www.nltk.org/book/ch03.html
3.4 Regular Expressions for Detecting Word Patterns
CS3TM20 © XH 21
Language model : N-grams
Introduction
N-grams
Probabilities Recap
Bigram examples
Evaluation and Perplexity
CS3TM20 © XH 22
Introduction
CS3TM20 © XH 26
In class thinking exercise: Based on the following
sentence, find the probability P(cookies|cook )?
(see if you can say it fast first)
“How many cookies could a good cook cook if a good
cook could cook cookies?”
P(cook)=4/15
P(cook cookies) = 1/15
P(cookies| cook) =1/4
Markov assumption
This idea that a future event (in this case, the next word)
can be predicted using a relatively short history (for the
example, one or two words)
Hence N-gram: we will focus on bigram
CS3TM20 © XH 28
Bigram
The bigram model approximates the probability of a word
given all the previous words P(w|h)
by using only the conditional probability of the preceding
word P(wn|wn−1).
P(in | Please turn your homework) ≈P(in | homework)
CS3TM20 © XH 30
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
nltk.download('punkt')
text = [' I am Sam ', ' Sam I am ', ' I do not like green eggs and ham ']
bigram=[]
for line in text:
token = nltk.word_tokenize(line)
bigram=bigram+ list(ngrams(token, 2,pad_right=True,pad_left=True,
left_pad_symbol='<S>', right_pad_symbol='</S>'))
print(bigram,'\n')
word_fd = nltk.FreqDist(bigram)
word_fd
bigram
CS3TM20 © XH 31
[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),
('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like',
'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'),
('ham', '</S>’)]
P(I|<s>) =2/3
P(Sam|<s>) =1/3
CS3TM20 © XH 32
[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),
('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'),
('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</S>’)]
P(am|I) =2/3
P(</s>|Sam) =1/2=0.5
CS3TM20 © XH 33
[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),
('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'),
('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</S>')]
P(Sam|am) =1/2
P(do|I) =1/3
CS3TM20 © XH 34
Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Why?
Just add one to all the counts!
MLE estimate:
Add-1 estimate:
CS3TM20 © XH 37
[('<S>', 'I'), ('I', 'am'), ('am', 'Tom'), ('Tom', '</S>'),
('<S>', 'Tom'), ('Tom', 'I'), ('I', 'am'), ('am', '</S>'),
('<S>', 'I'), ('I', 'do'), ('do', 'like'), ('like', 'watch'),
('watch', 'videos'), ('videos', 'on'), ('on', 'Youtube'),
('Youtube', 'and'), ('and', 'watch'), ('watch', 'films'),
('films', 'in'), ('in', 'cinema'), ('cinema', '</S>')]
Add-1:
P(I|<s>) = (2+1)/(3+13)= 3/16
P(Tom|<s>) = (1+1)/(3+13)=1/8
P(am|I) = (2+1)/(3+13)= 3/16
P(</s>|Tom) = (1+1)/(2+13)=2/15
P(films|watch) = (1+1)/(2+13)=2/15
P(videos|watch) = (1+1)/(2+13)=2/15
CS3TM20 © XH 38
Evaluation: How good is our model?
Does our language model prefer good sentences to bad
ones?
Assign higher probability to “real” or “frequently
observed” sentences than “ungrammatical” or “rarely
observed” sentences?
We train parameters of our model on a training set.
We test the model’s performance on data we haven’t seen.
A test set is an unseen dataset that is different from
our training set, totally unused.
An evaluation metric tells us how well our model
does on the test set.
Evaluation: How good is our model?
Intuition of Perplexity: The Shannon Game.
How well can we predict the next word?
A better model of a text is one which assigns a higher
probability to the word that occurs.
Word and prob
1. mushrooms 0.1
I always order pizza with cheese and ___ 2. pepperoni 0.1
3. anchovies 0.01
4. fried rice 0.0001
5. .... 1e-100
The Shannon Visualization Method
Choose a random bigram (<s>, w) according to its probability
Now choose a random bigram (w, x) according to its probability
And so on until we choose </s>. Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food
Perplexity
The best language model is one that best predicts an unseen test set
with highest P(sentence).
Perplexity is the inverse probability of the test set, normalized by the
number of words.
=
Chain rule:
For bigrams: =
Minimizing perplexity is the same as maximizing probability .
Exam style question:
You are given a Corpus that has the following n-grams and their
number of occurrences: