0% found this document useful (0 votes)
32 views44 pages

Week 2

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views44 pages

Week 2

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Text processing: Regular expression

 Introduction
 Basic regular expression patterns
 Examples
 Regular expression with Python
Practical examples
CS3TM20 © XH 1
Introduction
 Regular expression (RE), a language for specifying text
search strings.
 is a sequence of characters that specifies a search pattern.
 used in every computer language, word processor, and text
processing tools.
 particularly useful for searching in texts, when we have a
pattern to search for and a corpus of texts to search
through.
 The corpus can be a single document or a collection.
 Regular expressions come in many variants.
CS3TM20 © XH 2
Basic regular expression patterns
The simplest kind of regular expression is a sequence of
simple characters.
To search for woodchuck, we type /woodchuck/

Woodchuck
woodchuck
Woodchucks
woodchucks

CS3TM20 © XH 3
Disjunction [ ]
 The string of characters inside the square braces [ ]
specifies a disjunction of characters to match

RE Match Example Patterns

/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”

/[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati”

/[1234567890]/ any digit “plenty of 7 to 5”

CS3TM20 © XH 4
Range -
 the brackets can be used with the dash (-) to specify any
one character in a range

RE Match Example Patterns


/[A-Z]/ an upper case letter “we should call it ‘Drenched Blossoms’ ”

/[a-z]/ a lower case letter “my beans were impatient to be hoed!”

/[0-9]/ any single digit “Chapter 1: Down the Rabbit Hole”

CS3TM20 © XH 5
Negation ^
 The square braces can also be used to specify what a
single character cannot be, by use of the caret ^.
 This is if the caret ^ is the first symbol after the open
square brace [ .
RE Match Example Patterns
/[ˆA-Z]/ Not an upper case letter “Oyfn pripetchik”
/[ˆSs]/ neither ‘S’ nor ‘s’ “I have no exquisite reason for ‘t”
/[ˆ.]/ not a period “our resident Djinn”
/[eˆ]/ either ‘e’ or ‘ˆ’ “look up ˆnow”

/aˆb/ the pattern ‘aˆb’ “look up aˆ b now”


CS3TM20 © XH 6
Optional ?
 The question mark ? marks optionality of the previous
expression.
RE Match Example Patterns
/woodchucks?/ woodchuck or woodchucks “woodchuck”
/colou?r/ color or colour “color”
Period .
 The use of the period . to specify any character.
RE Match Example Patterns
/beg.n/ any character between beg and n begin, beg’n, begun

CS3TM20 © XH 7
 Consider the language of certain sheep:
baa!
baaa!
baaaa! . . .
Kleene *
 zero or more occurrences of the immediately previous
character or regular expression
/baaa*!/
Kleene +
 “one or more occurrences of the immediately preceding
character or regular expression”
/baa+!/
CS3TM20 © XH 8
Anchors
 special characters that anchor RE to specific places in a
string.
 The caret ˆ matches the start of a line.
 The dollar sign $ matches the end of a line.
 \b matches a word boundary. \B matches a non-boundary.
 /\bthe\b/ matches the word the but not the word other

Pipe |
 The pipe (disjunction operator) symbol |
 The pattern /cat|dog/ matches either the string cat or the
string dog
CS3TM20 © XH 9
Precedence hierarchy from high to low

Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |

CS3TM20 © XH 10
A simple example
Suppose we wanted to write a RE to find cases of the English
article the

1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/

CS3TM20 © XH 11
A simple example
Suppose we wanted to write a RE to find cases of the English
article the

1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/

CS3TM20 © XH 12
A simple example
Suppose we wanted to write a RE to find cases of the English
article the

1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/

CS3TM20 © XH 13
A simple example
Suppose we wanted to write a RE to find cases of the English
article the

1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/

CS3TM20 © XH 14
A more complicated example
Suppose we wanted to write a RE to find cases of the English
article the

1. /the/
2. /[tT]he/
3. /[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
4. /\b [tT] \b/

CS3TM20 © XH 15
 The process we just went through was based on fixing two kinds
of errors:
 Matching strings that we should not have matched (there,
then, other)
False positives (Type I errors)
 Not matching things that we should have matched (The)
False negatives (Type II errors)
 In NLP we are always dealing with these kinds of errors.
 Reducing the error rate for an application often involves two
antagonistic efforts:
 Increasing accuracy or precision (minimizing false positives)
 Increasing coverage or recall (minimizing false negatives).

CS3TM20 © XH 16
Regular expression in Python : re
Practical examples

Example 1: RE for detecting Word Patterns


print('\n RE for detecting Word Patterns \n ')
import nltk, re, pprint
nltk.download(“popular”)
from nltk import word_tokenize
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
print([w for w in wordlist if re.search('ed$', w)],'\n')
print([w for w in wordlist if re.search('^..j..t..$', w)],'\n')
print([w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)],'\n')

CS3TM20 © XH 17
Regular expression in Python : re
Practical examples

Example 1: RE for detecting Word Patterns


print([w for w in wordlist if re.search('ed$', w)],'\n’)

What are found ?

print([w for w in wordlist if re.search('^..j..t..$', w)],'\n’)

What are found ?

print([w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)],'\n’)

What are found?


CS3TM20 © XH 18
Regular expression in Python : re
In class exercises: Modify the code as
print([w for w in wordlist if re.search(‘ing$', w)],'\n’)

What are found ?

print([w for w in wordlist if re.search(‘^..s..y..$', w)],'\n’)

What are found ?

print([w for w in wordlist if re.search(‘^[st][aeiou][dt]$', w)],'\n’)

What are found?

CS3TM20 © XH 19
Capture Groups
 use of parentheses to store a pattern in memory
 Every time a capture group is used ( ), the resulting match is
stored in a numbered register.
/the (.*)er they (.*), the \1er we \2/
will match the faster they ran, the faster we ran (but not
the faster they ran, the faster we ate).

CS3TM20 © XH 20
Your turn: Practical answering questions with NLTK:

RE for detecting Word Patterns (More examples):

What patterns have been detected?

Reference: https://www.nltk.org/book/ch03.html
3.4 Regular Expressions for Detecting Word Patterns

CS3TM20 © XH 21
Language model : N-grams
 Introduction
 N-grams
 Probabilities Recap
 Bigram examples
 Evaluation and Perplexity

CS3TM20 © XH 22
Introduction

 Language models offer a way to assign a probability to a


sentence or other sequence of words, and
 to predict a word from preceding words
 Why?
 Probabilities are essential in any task in which we identify
word in NLP
• speech recognition.
• spelling correction
• grammatical error correction
• machine translation
CS3TM20 © XH 23
N-gram model

 An N-gram is a sequence of N words.


 N-gram models are attempts to guess the next word in a
sentence based upon the (N-1) previous words in the
sentence, e.g.
Please turn your homework

 Bigram N=2 (two- word sequences)


Please turn your homework
Please turn your homework
Please turn your homework
CS3TM20 © XH 24
Probability Recap
Conditional probability
 Conditional probability is defined as the likelihood of an
event or outcome occurring,
 based on the occurrence of a previous event or outcome.
P(w|h): the probability of a word w given some history h.

Example: h : Please turn your homework


w: in
P(in | Please turn your homework)
> P(the | Please turn your homework)
CS3TM20 © XH 25
Probability Recap
Product rule A∩B
P(A∩B) = P(A) P(B|A) A B

P(A∩B) : joint probability of A and B


P(B|A): Probability of B conditional on A
Example :
P(red)=3/5
P(number ∩ red) = 2/5
P(number| red) =2/3

CS3TM20 © XH 26
In class thinking exercise: Based on the following
sentence, find the probability P(cookies|cook )?
(see if you can say it fast first)
“How many cookies could a good cook cook if a good
cook could cook cookies?”

P(cook)=4/15
P(cook cookies) = 1/15
P(cookies| cook) =1/4

The other three pairs are


“cook | cook”, “if| cook”, “could|cook”
CS3TM20 © XH 27
Probability Recap

Probability chain rule


P(A ∩ B ∩ C) = P(A ∩ B) P(C|A ∩B)
=P ( B ) P ( A|B ) P(C|A ∩ B)

Markov assumption

 This idea that a future event (in this case, the next word)
can be predicted using a relatively short history (for the
example, one or two words)
 Hence N-gram: we will focus on bigram
CS3TM20 © XH 28
Bigram
 The bigram model approximates the probability of a word
given all the previous words P(w|h)
 by using only the conditional probability of the preceding
word P(wn|wn−1).
P(in | Please turn your homework) ≈P(in | homework)

 Maximum likelihood estimator (MLE)


P(wn|wn-1) =C(wn-1wn)/C(wn-1)
C(wn-1wn): count of the bigram
C(wn-1) : count of the unigram
CS3TM20 © XH 29
Example:

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),


('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'),
('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</S>')]

CS3TM20 © XH 30
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
nltk.download('punkt')

text = [' I am Sam ', ' Sam I am ', ' I do not like green eggs and ham ']
bigram=[]
for line in text:
token = nltk.word_tokenize(line)
bigram=bigram+ list(ngrams(token, 2,pad_right=True,pad_left=True,
left_pad_symbol='<S>', right_pad_symbol='</S>'))
print(bigram,'\n')
word_fd = nltk.FreqDist(bigram)
word_fd
bigram
CS3TM20 © XH 31
[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),
('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like',
'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'),
('ham', '</S>’)]

P(I|<s>) =2/3
P(Sam|<s>) =1/3

CS3TM20 © XH 32
[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),
('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'),
('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</S>’)]

P(am|I) =2/3
P(</s>|Sam) =1/2=0.5

CS3TM20 © XH 33
[('<S>', 'I'), ('I', 'am'), ('am', 'Sam'), ('Sam', '</S>’),
('<S>', 'Sam'), ('Sam', 'I'), ('I', 'am'), ('am', '</S>’),
('<S>', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'),
('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', '</S>')]

P(Sam|am) =1/2
P(do|I) =1/3

CS3TM20 © XH 34
Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Why?
Just add one to all the counts!

MLE estimate:

Add-1 estimate:

where V is the size of corpus.


In class exercise:
a) List all bigrams for the below sentences (Python
recommended)
<s> I am Tom </s>
<s> Tom I am </s>
<s> I do like watch videos on Youtube and watch films in
cinema </s>

b) Then identify conditional probabilities of MLE and Add 1


estimation, respectively
P(I|<s>) = P(Tom|<s>) =
P(am|I) = P(</s>|Tom) =
P(films|watch) = P(video|watch) =
CS3TM20 © XH 36
[('<S>', 'I'), ('I', 'am'), ('am', 'Tom'), ('Tom', '</S>'),
('<S>', 'Tom'), ('Tom', 'I'), ('I', 'am'), ('am', '</S>'),
('<S>', 'I'), ('I', 'do'), ('do', 'like'), ('like', 'watch'),
('watch', 'videos'), ('videos', 'on'), ('on', 'Youtube'),
('Youtube', 'and'), ('and', 'watch'), ('watch', 'films'),
('films', 'in'), ('in', 'cinema'), ('cinema', '</S>')]

P(I|<s>) = 2/3 P(Tom|<s>) = 1/3


P(am|I) = 2/3 P(</s>|Tom) = 1/2
P(films|watch) = 1/2 P(videos|watch) = 1/2

CS3TM20 © XH 37
[('<S>', 'I'), ('I', 'am'), ('am', 'Tom'), ('Tom', '</S>'),
('<S>', 'Tom'), ('Tom', 'I'), ('I', 'am'), ('am', '</S>'),
('<S>', 'I'), ('I', 'do'), ('do', 'like'), ('like', 'watch'),
('watch', 'videos'), ('videos', 'on'), ('on', 'Youtube'),
('Youtube', 'and'), ('and', 'watch'), ('watch', 'films'),
('films', 'in'), ('in', 'cinema'), ('cinema', '</S>')]
Add-1:
P(I|<s>) = (2+1)/(3+13)= 3/16
P(Tom|<s>) = (1+1)/(3+13)=1/8
P(am|I) = (2+1)/(3+13)= 3/16
P(</s>|Tom) = (1+1)/(2+13)=2/15
P(films|watch) = (1+1)/(2+13)=2/15
P(videos|watch) = (1+1)/(2+13)=2/15
CS3TM20 © XH 38
Evaluation: How good is our model?
Does our language model prefer good sentences to bad
ones?
Assign higher probability to “real” or “frequently
observed” sentences than “ungrammatical” or “rarely
observed” sentences?
We train parameters of our model on a training set.
We test the model’s performance on data we haven’t seen.
A test set is an unseen dataset that is different from
our training set, totally unused.
An evaluation metric tells us how well our model
does on the test set.
Evaluation: How good is our model?
Intuition of Perplexity: The Shannon Game.
 How well can we predict the next word?
A better model of a text is one which assigns a higher
probability to the word that occurs.
Word and prob
1. mushrooms 0.1
I always order pizza with cheese and ___ 2. pepperoni 0.1
3. anchovies 0.01
4. fried rice 0.0001
5. .... 1e-100
The Shannon Visualization Method
Choose a random bigram (<s>, w) according to its probability
Now choose a random bigram (w, x) according to its probability
And so on until we choose </s>. Then string the words together
<s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>
I want to eat Chinese food
Perplexity
The best language model is one that best predicts an unseen test set
with highest P(sentence).
Perplexity is the inverse probability of the test set, normalized by the
number of words.
=
Chain rule:
For bigrams: =
Minimizing perplexity is the same as maximizing probability .
Exam style question:

You are given a Corpus that has the following n-grams and their
number of occurrences:

I (500), always (400), eat (100), pizza (350)


I always (300), always eat (50), eat pizza (50)

(i) Estimate the probabilities P(always) and P(always | I) using the


Maximum Likelihood Estimation method.

(ii)What is the perplexity of sentence “ I always eat pizza”


Answer:
I (500), always (400), eat (100), pizza (350)
I always (300), always eat (50), eat pizza (50)

(i) Estimate the probabilities P(always) and P(always | I) using the


Maximum Likelihood Estimation method.
P (always) =
P( always |I) = 300/500=0.6
(ii)What is the perplexity of sentence “ I always eat pizza”
P(I)P( always |I) P( eat|always) P( pizza |eat)
=
PP = 2.9130

You might also like