0% found this document useful (0 votes)
250 views57 pages

Lecture 3: Text Processing & Minimum Edit Distance Algorithm

The document discusses various natural language processing (NLP) tasks and techniques including text normalization, regular expressions, parsing, and shift-reduce parsing. It provides examples of using regular expressions to search for patterns in text and defines parsing as resolving a sentence into its component parts using grammar rules. Bottom-up shift-reduce parsing is explained through an example that shifts and reduces tokens based on grammar rules to build up the parse tree.

Uploaded by

Manaal Azfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views57 pages

Lecture 3: Text Processing & Minimum Edit Distance Algorithm

The document discusses various natural language processing (NLP) tasks and techniques including text normalization, regular expressions, parsing, and shift-reduce parsing. It provides examples of using regular expressions to search for patterns in text and defines parsing as resolving a sentence into its component parts using grammar rules. Bottom-up shift-reduce parsing is explained through an example that shifts and reduces tokens based on grammar rules to build up the parse tree.

Uploaded by

Manaal Azfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

Lecture 3: Text Processing &

Minimum edit Distance


Algorithm
Lecture Objectives:

•Student will be able to understand NLP tasks


•Student will be able to understand Language Modeling
Techniques
•Students Will be able to understand Edit distance
Algorithm applications
CSC-441: Natural Language Processing
What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Text Normalization

• Every NLP task requires text normalization:


1. Tokenzing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
REs & Parsing

Each Regular Expression (RE) represents a set of strings having certain


pattern.
•In NLP, we can use REs to find strings having certain patterns in a given text.
Simple Definition for Regular Expressions over alphabet 
• is a regular expression
•If a  , a is a regular expression
•or : If E1 and E2 are REs, then E1 | E2 is a regular expression
•concatenation : If E1 and E2 are REs, then E1E2 is a regular expression
•Kleene Closure: If E is a RE, then E* is a regular expression
•Positive Closure: If E is a RE, then E+ is a regular expression
Regular Expressions : Specific words/string
searching
Example
• Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]

A good example to understand Sequencing ,patterns and efficient


use in IE or IR.
Example
stri="hahaha! yes i will dodo this in
Rs. 234/-“
# Remove all the special characters
document = re.sub(r'\W', ' ', stri)
# remove all single characters
document = re.sub(r'\s+[a-zA-Z]\s+', ' ',
document)
# Remove single characters from the start
document = re.sub(r'\^[a-zA-Z]\s+', ' ',
document)
hahaha yes will do this in r 234
What is Parsing?
S  NP VP NP  Ali
NP  Det N S N  rice
NP  NP PP N  spoon
VP  V NP V  spoon
VP  VP PP NP VP V  ate
PP  P NP P  with
Det  the
Ali VP PP Det  a
Parsing means
resolving a
sentence into its V NP P NP
component
parts. These
components can
ate Det N with Det N
comprise a word or
group of words. the rice a spoon
Parsing Techniques
• Classical
o Top-Down parsing
o Bottom up Parsing
o Dynamic Programming

• Bottom up – explores options that won’t lead to a full parse


o Example: shift-reduce (srparser in nltk)
o Example: CKY (Cocke-Kasami-Younger)

• Top down – explores options that don’t match the full sentence
o Example: recursive descent (rdparser in nltk)
o Example: Early parser 

• Dynamic programming – caches of intermediate results


(memorization)
Shift-Reduce parsing
Bottom-Up Parser Example
Shift a

INPUT: a b b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc
Ab
Program
Bd

10
Shift-Reduce parsing
Bottom-Up Parser Example
Shift b
Reduce from b to A
INPUT: a b b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc A
Ab
Program
Bd b

11
Shift-Reduce parsing
Bottom-Up Parser Example
Shift A

INPUT: a A b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc A
Ab
Program
Bd b

12
Shift-Reduce parsing
Bottom-Up Parser Example
Shift b

INPUT: a A b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc A
Ab
Program
Bd b

13
Shift-Reduce parsing
Bottom-Up Parser Example
Shift c
Reduce from Abc to A
INPUT: a A b c d e $ OUTPUT:

Production
A
S  aABe
Bottom-Up Parsing
A  Abc A b c
Ab
Program
Bd b

14
Introduction(8)
Bottom-Up Parser Example
Shift A

INPUT: a A d e $ OUTPUT:

Production
A
S  aABe
Bottom-Up Parsing
A  Abc A b c
Ab
Program
Bd b

15
Shift-Reduce parsing
Bottom-Up Parser Example
Shift d
Reduce from d to B
INPUT: a A d e $ OUTPUT:

Production
A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

16
Shift-Reduce parsing
Bottom-Up Parser Example
Shift B

INPUT: a A B e $ OUTPUT:

Production
A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

17
Shift-Reduce parsing
Bottom-Up Parser Example
Shift e
Reduce from aABe to S
INPUT: a A B e $ OUTPUT:
S
Production e
a A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

18
Shift-Reduce parsing
Bottom-Up Parser Example
Shift S
Hit the target $
INPUT: S $ OUTPUT:
S
Production e
a A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

This parser is known as an LR Parser because


it scans the input from Left to right, and it constructs
a Rightmost derivation in reverse order. 19
Ambiguity
S  NP VP S NP  Ali
NP  Det N N  rice
NP  NP PP N  spoon
VP  V NP NP VP V  spoon
VP  VP PP
V  ate
PP  P NP
P  with
Ali VP PP Det  the
Det  a
V NP P NP

ate Det N with Det N

the rice a spoon


Ambiguity
S  NP VP S NP  Ali
NP  Det N N  rice
NP  NP PP N  spoon
VP  V NP NP VP V  spoon
VP  VP PP
V  ate
PP  P NP
P  with
Ali V NP Det  the
Det  a
ate NP PP

Det N P NP

the ricer with Det N

a spoon
The parsing problem
correct test trees
P
A s
c
R o
S r accuracy
e
E r
test R
sentences Recent parsers
quite accurate
Grammar … good enough
to help a range of
NLP tasks!
Chomsky Normal Form
The right-hand side of a standard CFG can have an arbitrary
number of symbols (terminals and nonterminals):
VP
VP → ADV eat ADV eat NP
NP
A CFG in Chomsky Normal Form (CNF) allows only two
kinds of right-hand sides:
–Two nonterminals: VP → ADV VP
–One terminal: VP → eat

Any CFG can be transformed into an equivalent CNF:


VP → ADVP VP1 VP
VP1 → VP2 NP VP ADV VP1
VP2 → eat ADV VP2
eat NP NP
CS447 Natural Language Processing eat 23
A note about ε-productions
Formally, context-free grammars are allowed to have
empty productions (ε = the empty string): VP → V
NP NP → DT Noun NP → ε

These can always be eliminated without changing the


language generated by the grammar:
VP → V NP NP → DT Noun NP → ε
becomes
VP → V NP VP → V ε NP →
which in turn becomes
DT Noun
VP → V NP VP → V NP → DT Noun

We will assume that our grammars don’t have ε-productions

CS447 Natural Language Processing 24


The CKY parsing algorithm To recover the
parse tree, each
entry needs
w
e
NP we
eat
S
we eat
sushi
pairs of
backpointers.

S → NP VP ea eat
t V sushi
VP → V NP
V → VP
eat NP →
NP
sush
i

NP
we→ mango
We eat mango
CS447 Natural Language Processing 14
CKY algorithm, recognizer version

 Input: string of n words


 Output: yes/no (since it’s only a recognizer)
 Data structure: n  n table
 rows labeled 0 to n-1
 columns labeled 1 to n
 cell [i,j] lists constituents found between i and j

 Basic idea: fill in width-1 cells, then width-2, …


CKY algorithm, recognizer version

for J := 1 to n
Add to [J-1,J] all categories for the Jth word
for col := 2 to n
for i := 0 to n-col
k := i+ col
for j := i+1 to k-1
for every nonterminal Y in [i,j]
for every nonterminal Z in [j,k]
for all nonterminals X
if X  Y Z is in the grammar
then add X to [i,k]
CKY: filling the chart
w ... ... ... w w ... ... ... w w ... ... ... w w ... ... ... w
wi wi wi wi
w w w w
... ... ... ...

.. .. .. ..

wi wi wi wi

... ... ... ...

w w w w

w ... ... ... w w ... ... ... w w ... ... ... w


wi wi wi

w w w
... ... ...

.. .. ..

wi wi wi

... ... ...


w w w

CS447 Natural Language Processing 28


CKY: filling one cell
w ... ... ... w
wi

w
chart[2][6]:
...
w1 w2 w3 w4 w5 w6 w7
..

wi
...
w

chart[2][6]: chart[2][6]: chart[2][6]: chart[2][6]:


w1 w2w3w4w5w6 w7 w1 w2w3w4w5w6 w7 w1 w2w3w4w5w6 w7 w1 w2w3w4w5w6 w7

w ... ... wi ... w w ... w ... w ...


w w w w
... wi ... ... wi ... w ... wi ... w
... ... ... ...
w

.. .. .. ..

wi wi wi wi
... ... ... ...
w w w w

CS447 Natural Language Processing 29


The CKY parsing algorithm
V VP buy drinks VP
with buy drinks
bu buy drinks with milk
y
S  NP VP V, NP VP, NP
drinks with
VP  V NP drinks drinks with milk

VP  VP PP
V  drinks P PP
with with mil k
NP  NP PP
NP  we NP
milk
NP  drinks
NP  milk
PP  P NP We buy drinks with milk
P  with
30
The CKY parsing algorithm
wewe
eateat wewe
eateat
w
we we eat we eat sushi sushi
e eat sushi mango mango
mango with with apple
tuna
with with

S  NP VP V
ea eat VP eat sushi
mangoi VP with
eat sushi
tea eat
sushimango with tuna with
eat mango
with
VP  V NP t apple

VP  VP PP sush sushiwith NP
sushi with
mango mango
i with tuna
Mango with apple
V  eat
NP  NP PP
wit withPP
NP  we h with with apple
tuna
NP  mango
tun
apple
NP  apple a
PP  P
P
NP with We eat mango with apple
31
What are the terminals in NLP?
Are the “terminals”: words or POS tags?

For toy examples (e.g. on slides), it’s typically the words

With POS-tagged input, we may either treat the POS tags as


the terminals, or we assume that the unary rules in our
grammar are of the form
POS-tag → word
(so POS tags are the only nonterminals that can be rewritten
as words; some people call POS tags “preterminals”)

CS447: Natural Language Processing (J. Hockenmaier) 32


Shift-Reduce Parsing
• A bottom-up parser – Tries to match the RHS of a
production until it can build an S
• Shift operation – Each word in the input sentence is
pushed onto a stack
• Reduce-n operation – If the top n words on the top of the
stack match the RHS of a production, then they are popped
and replaced by the LHS of the production
• Breadth-first search
• Stopping condition – The process stops when the input
sentence has been processed and S has been popped from
the stack
Probabilistic Language Modeling
Assign a probability to a sentence :P(S)=P(∑Wi)
• Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a
language model.
• Better: the grammar But language model or LM is
standard
How to compute P(W)
• How to compute this joint probability:

– P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of


Probability
Reminder: The Chain Rule
• Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence

P(w1w2  wn )   P(wi | w1w2  wi 1 )


i
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is
so)
Sentence (Parse tree)
probability
• Rule Probability The probability of a
– S -> P VP 1 parse tree given by a
PCFG is:
– P -> He 0.5
– P She 0.5
– VP -> VH NV 0.5 P(He can)?
– VP -> VH 0.5 P(She can help)
– VH -> can 1
– NV help 1
A probabilistic context-free grammar is a generative model
P(Tree) means any desired derivation.
How to estimate these
probabilities
• Could we just count and divide?
Corpora
• From Where Words come out?
• A Text can be written for any specific purpose.
• By any writer.
• Employing some rules for specific language
In linguistics, a corpus (plural corpora) or
text corpus is a language resource
consisting of a large and structured set of
texts.
nltk:Natural Language Tool Kit
• Install nltk :pip install nltk
• Import brown corpus:
>>> from nltk.corpus import brown
• Find all categories
>>> brown.categories()
• ['adventure', 'belles_lettres', 'editorial', 'fiction',
'government', 'hobbies', 'humor', 'learned', 'lore',
'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
nltk:Natural Language Tool Kit
• Access the corpus as a list of words or a list
of sentences
>>> brown.words(categories='humor')
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
>>> brown.words(categories='fiction')
['Thirty-three', 'Scotty', 'did', 'not', 'go', 'back', ...]
>>> brown.sents(categories='humor')
[['It', 'was', 'among', 'these', 'that', 'Hinkle', 'identified', 'a',
'photograph', 'of', 'Barco', '!', '!'], ['For', 'it', 'seems', 'that',
'Barco', ',', 'fancying', 'himself', 'a', "ladies'", 'man', '(', 'and',
'why', 'not', ',', 'after', 'seven', 'marriages', '?', '?'], ...]
nltk:Natural Language Tool Kit
• Use your own text
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed",
len(nltk.word_tokenize(text1)), "words.")
>>Enter some text: Natural Languge Processing
['Natural', 'Languge', 'Processing']
3
You typed 3 words.
How many words?
• My father , walking along a river looking at
sky said these words.
• Type: an element of the vocabulary.
• Token: recognized word.
• How many?
– No of Tokens =14
– No of types =?
Words Normalization
• Lemmatization
 Represent all words as their shared root
 The goal is to remove inflections and map a word to its
root form. 
am, are, is  be
car, cars, car's, cars'  car

Lemmatization is the process of grouping together the different


inflected forms of a word so they can be analyzed as a single item
Lemmatization is done by
Morphological Parsing
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Parts that adhere to stems, often with
grammatical functions
• Morphological Parsers:
– Parse cats into two morphemes cat and
s
– Parse connected into connect and ed
Words Normalization
Stemming: chop off the words to get root
 Stemming uses a crude heuristic process that chops off the
ends of words such as ThisThi, AccurateAccur
Lemmatization and stemming
import nltk
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed", len(nltk.word_tokenize(text1)),
"words.")
lemma = nltk.wordnet.WordNetLemmatizer()
print ("Lemmatized: ",lemma.lemmatize('article'))
print ("Lemmatized: ",lemma.lemmatize('leaves'))
sno = nltk.stem.SnowballStemmer('english')
print("Stemmed: ",sno.stem('article'))
print("Stemmed: ",sno.stem('leaves'))
output
• “article” Lemmatized: article
• “leaves” Lemmatized: leaf
• “article” Stemmed: articl
• “leaves” Stemmed: leav
Sentence Segmentation
• !, ? are relatively unambiguous but “.” is quite
ambiguous
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Common Algorithm: decide whether a (.) is
part of the word or is a sentence-boundary
marker.
– An abbreviation dictionary can help
• Sentence segmentation can then often be
done by rules based on this tokenization.
How similar are two strings?

• Spell correction • Computational Biology


• Align two sequences of nucleotides
– The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“Karachi” TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi

• Also used for Machine Translation, Information Extraction, Speech


Recognition
Edit Distance
• The minimum edit distance between two
strings
• Is the minimum number of editing
operations
– Insertion
– Deletion
– Substitution
• Needed to transform one into the other
Computing Levenshtein
distance - 1
D(i,j) = score of best alignment from s1..si to s2…sj

D(i-1,j-1), if s1[i]=s2[j] //copy


D(i-1,j-1)+1, if s1[i]!=s2[j] //substitute
= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
Minimum Edit Distance
• Two strings and their alignment:
Basic processing
• Basic Text Processing includes
 Conversion to same (lower case)

 Lemmatization

 Stemming

 Spelling correction

 Sentence segmentation
Summary

• Language Models = “word sequence prediction” as a


probabilistic model
• The minimum edit distance between two strings can be
used to correct the spelling
• Can be used for information Extraction
• Can be used for Words annotation

56
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu

57

You might also like