0% found this document useful (0 votes)

250 views57 pages

Lecture 3: Text Processing & Minimum Edit Distance Algorithm

The document discusses various natural language processing (NLP) tasks and techniques including text normalization, regular expressions, parsing, and shift-reduce parsing. It provides examples of using regular expressions to search for patterns in text and defines parsing as resolving a sentence into its component parts using grammar rules. Bottom-up shift-reduce parsing is explained through an example that shifts and reduces tokens based on grammar rules to build up the parse tree.

Uploaded by

Manaal Azfar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

250 views57 pages

Lecture 3: Text Processing & Minimum Edit Distance Algorithm

Uploaded by

Manaal Azfar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 57

Lecture 3: Text Processing &

Minimum edit Distance

Algorithm
Lecture Objectives:

•Student will be able to understand NLP tasks

•Student will be able to understand Language Modeling
Techniques
•Students Will be able to understand Edit distance
Algorithm applications
CSC-441: Natural Language Processing
What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Text Normalization

• Every NLP task requires text normalization:

1. Tokenzing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
REs & Parsing

Each Regular Expression (RE) represents a set of strings having certain

pattern.
•In NLP, we can use REs to find strings having certain patterns in a given text.
Simple Definition for Regular Expressions over alphabet 
• is a regular expression
•If a  , a is a regular expression
•or : If E1 and E2 are REs, then E1 | E2 is a regular expression
•concatenation : If E1 and E2 are REs, then E1E2 is a regular expression
•Kleene Closure: If E is a RE, then E* is a regular expression
•Positive Closure: If E is a RE, then E+ is a regular expression
Regular Expressions : Specific words/string
searching
Example
• Find me all instances of the word “the” in a text.
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]

A good example to understand Sequencing ,patterns and efficient

use in IE or IR.
Example
stri="hahaha! yes i will dodo this in
Rs. 234/-“
# Remove all the special characters
document = re.sub(r'\W', ' ', stri)
# remove all single characters
document = re.sub(r'\s+[a-zA-Z]\s+', ' ',
document)
# Remove single characters from the start
document = re.sub(r'\^[a-zA-Z]\s+', ' ',
document)
hahaha yes will do this in r 234
What is Parsing?
S  NP VP NP  Ali
NP  Det N S N  rice
NP  NP PP N  spoon
VP  V NP V  spoon
VP  VP PP NP VP V  ate
PP  P NP P  with
Det  the
Ali VP PP Det  a
Parsing means
resolving a
sentence into its V NP P NP
component
parts. These
components can
ate Det N with Det N
comprise a word or
group of words. the rice a spoon
Parsing Techniques
• Classical
o Top-Down parsing
o Bottom up Parsing
o Dynamic Programming

• Bottom up – explores options that won’t lead to a full parse

o Example: shift-reduce (srparser in nltk)
o Example: CKY (Cocke-Kasami-Younger)

• Top down – explores options that don’t match the full sentence
o Example: recursive descent (rdparser in nltk)
o Example: Early parser

• Dynamic programming – caches of intermediate results

(memorization)
Shift-Reduce parsing
Bottom-Up Parser Example
Shift a

INPUT: a b b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc
Ab
Program
Bd

10
Shift-Reduce parsing
Bottom-Up Parser Example
Shift b
Reduce from b to A
INPUT: a b b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc A
Ab
Program
Bd b

11
Shift-Reduce parsing
Bottom-Up Parser Example
Shift A

INPUT: a A b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc A
Ab
Program
Bd b

12
Shift-Reduce parsing
Bottom-Up Parser Example
Shift b

INPUT: a A b c d e $ OUTPUT:

Production
S  aABe
Bottom-Up Parsing
A  Abc A
Ab
Program
Bd b

13
Shift-Reduce parsing
Bottom-Up Parser Example
Shift c
Reduce from Abc to A
INPUT: a A b c d e $ OUTPUT:

Production
A
S  aABe
Bottom-Up Parsing
A  Abc A b c
Ab
Program
Bd b

14
Introduction(8)
Bottom-Up Parser Example
Shift A

INPUT: a A d e $ OUTPUT:

Production
A
S  aABe
Bottom-Up Parsing
A  Abc A b c
Ab
Program
Bd b

15
Shift-Reduce parsing
Bottom-Up Parser Example
Shift d
Reduce from d to B
INPUT: a A d e $ OUTPUT:

Production
A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

16
Shift-Reduce parsing
Bottom-Up Parser Example
Shift B

INPUT: a A B e $ OUTPUT:

Production
A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

17
Shift-Reduce parsing
Bottom-Up Parser Example
Shift e
Reduce from aABe to S
INPUT: a A B e $ OUTPUT:
S
Production e
a A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

18
Shift-Reduce parsing
Bottom-Up Parser Example
Shift S
Hit the target $
INPUT: S $ OUTPUT:
S
Production e
a A B
S  aABe
Bottom-Up Parsing
A  Abc A b c d
Ab
Program
Bd b

This parser is known as an LR Parser because

it scans the input from Left to right, and it constructs
a Rightmost derivation in reverse order. 19
Ambiguity
S  NP VP S NP  Ali
NP  Det N N  rice
NP  NP PP N  spoon
VP  V NP NP VP V  spoon
VP  VP PP
V  ate
PP  P NP
P  with
Ali VP PP Det  the
Det  a
V NP P NP

ate Det N with Det N

the rice a spoon

Ambiguity
S  NP VP S NP  Ali
NP  Det N N  rice
NP  NP PP N  spoon
VP  V NP NP VP V  spoon
VP  VP PP
V  ate
PP  P NP
P  with
Ali V NP Det  the
Det  a
ate NP PP

Det N P NP

the ricer with Det N

a spoon
The parsing problem
correct test trees
P
A s
c
R o
S r accuracy
e
E r
test R
sentences Recent parsers
quite accurate
Grammar … good enough
to help a range of
NLP tasks!
Chomsky Normal Form
The right-hand side of a standard CFG can have an arbitrary
number of symbols (terminals and nonterminals):
VP
VP → ADV eat ADV eat NP
NP
A CFG in Chomsky Normal Form (CNF) allows only two
kinds of right-hand sides:
–Two nonterminals: VP → ADV VP
–One terminal: VP → eat

Any CFG can be transformed into an equivalent CNF:

VP → ADVP VP1 VP
VP1 → VP2 NP VP ADV VP1
VP2 → eat ADV VP2
eat NP NP
CS447 Natural Language Processing eat 23
A note about ε-productions
Formally, context-free grammars are allowed to have
empty productions (ε = the empty string): VP → V
NP NP → DT Noun NP → ε

These can always be eliminated without changing the

language generated by the grammar:
VP → V NP NP → DT Noun NP → ε
becomes
VP → V NP VP → V ε NP →
which in turn becomes
DT Noun
VP → V NP VP → V NP → DT Noun

We will assume that our grammars don’t have ε-productions

CS447 Natural Language Processing 24

The CKY parsing algorithm To recover the
parse tree, each
entry needs
w
e
NP we
eat
S
we eat
sushi
pairs of
backpointers.

S → NP VP ea eat
t V sushi
VP → V NP
V → VP
eat NP →
NP
sush
i

NP
we→ mango
We eat mango
CS447 Natural Language Processing 14
CKY algorithm, recognizer version

 Input: string of n words

 Output: yes/no (since it’s only a recognizer)
 Data structure: n  n table
 rows labeled 0 to n-1
 columns labeled 1 to n
 cell [i,j] lists constituents found between i and j

 Basic idea: fill in width-1 cells, then width-2, …

CKY algorithm, recognizer version

for J := 1 to n
Add to [J-1,J] all categories for the Jth word
for col := 2 to n
for i := 0 to n-col
k := i+ col
for j := i+1 to k-1
for every nonterminal Y in [i,j]
for every nonterminal Z in [j,k]
for all nonterminals X
if X  Y Z is in the grammar
then add X to [i,k]
CKY: filling the chart
w ... ... ... w w ... ... ... w w ... ... ... w w ... ... ... w
wi wi wi wi
w w w w
... ... ... ...

.. .. .. ..

wi wi wi wi

... ... ... ...

w w w w

w ... ... ... w w ... ... ... w w ... ... ... w

wi wi wi

w w w
... ... ...

.. .. ..

wi wi wi

... ... ...

w w w

CS447 Natural Language Processing 28

CKY: filling one cell
w ... ... ... w
wi

w
chart[2][6]:
...
w1 w2 w3 w4 w5 w6 w7
..

wi
...
w

chart[2][6]: chart[2][6]: chart[2][6]: chart[2][6]:

w1 w2w3w4w5w6 w7 w1 w2w3w4w5w6 w7 w1 w2w3w4w5w6 w7 w1 w2w3w4w5w6 w7

w ... ... wi ... w w ... w ... w ...

w w w w
... wi ... ... wi ... w ... wi ... w
... ... ... ...
w

.. .. .. ..

wi wi wi wi
... ... ... ...
w w w w

CS447 Natural Language Processing 29

The CKY parsing algorithm
V VP buy drinks VP
with buy drinks
bu buy drinks with milk
y
S  NP VP V, NP VP, NP
drinks with
VP  V NP drinks drinks with milk

VP  VP PP
V  drinks P PP
with with mil k
NP  NP PP
NP  we NP
milk
NP  drinks
NP  milk
PP  P NP We buy drinks with milk
P  with
30
The CKY parsing algorithm
wewe
eateat wewe
eateat
w
we we eat we eat sushi sushi
e eat sushi mango mango
mango with with apple
tuna
with with

S  NP VP V
ea eat VP eat sushi
mangoi VP with
eat sushi
tea eat
sushimango with tuna with
eat mango
with
VP  V NP t apple

VP  VP PP sush sushiwith NP
sushi with
mango mango
i with tuna
Mango with apple
V  eat
NP  NP PP
wit withPP
NP  we h with with apple
tuna
NP  mango
tun
apple
NP  apple a
PP  P
P
NP with We eat mango with apple
31
What are the terminals in NLP?
Are the “terminals”: words or POS tags?

For toy examples (e.g. on slides), it’s typically the words

With POS-tagged input, we may either treat the POS tags as

the terminals, or we assume that the unary rules in our
grammar are of the form
POS-tag → word
(so POS tags are the only nonterminals that can be rewritten
as words; some people call POS tags “preterminals”)

CS447: Natural Language Processing (J. Hockenmaier) 32

Shift-Reduce Parsing
• A bottom-up parser – Tries to match the RHS of a
production until it can build an S
• Shift operation – Each word in the input sentence is
pushed onto a stack
• Reduce-n operation – If the top n words on the top of the
stack match the RHS of a production, then they are popped
and replaced by the LHS of the production
• Breadth-first search
• Stopping condition – The process stops when the input
sentence has been processed and S has been popped from
the stack
Probabilistic Language Modeling
Assign a probability to a sentence :P(S)=P(∑Wi)
• Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a
language model.
• Better: the grammar But language model or LM is
standard
How to compute P(W)
• How to compute this joint probability:

– P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of

Probability
Reminder: The Chain Rule
• Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

P(w1w2  wn )   P(wi | w1w2  wi 1 )

i
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is
so)
Sentence (Parse tree)
probability
• Rule Probability The probability of a
– S -> P VP 1 parse tree given by a
PCFG is:
– P -> He 0.5
– P She 0.5
– VP -> VH NV 0.5 P(He can)?
– VP -> VH 0.5 P(She can help)
– VH -> can 1
– NV help 1
A probabilistic context-free grammar is a generative model
P(Tree) means any desired derivation.
How to estimate these
probabilities
• Could we just count and divide?
Corpora
• From Where Words come out?
• A Text can be written for any specific purpose.
• By any writer.
• Employing some rules for specific language
In linguistics, a corpus (plural corpora) or
text corpus is a language resource
consisting of a large and structured set of
texts.
nltk:Natural Language Tool Kit
• Install nltk :pip install nltk
• Import brown corpus:
>>> from nltk.corpus import brown
• Find all categories
>>> brown.categories()
• ['adventure', 'belles_lettres', 'editorial', 'fiction',
'government', 'hobbies', 'humor', 'learned', 'lore',
'mystery', 'news', 'religion', 'reviews', 'romance',
'science_fiction']
nltk:Natural Language Tool Kit
• Access the corpus as a list of words or a list
of sentences
>>> brown.words(categories='humor')
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
>>> brown.words(categories='fiction')
['Thirty-three', 'Scotty', 'did', 'not', 'go', 'back', ...]
>>> brown.sents(categories='humor')
[['It', 'was', 'among', 'these', 'that', 'Hinkle', 'identified', 'a',
'photograph', 'of', 'Barco', '!', '!'], ['For', 'it', 'seems', 'that',
'Barco', ',', 'fancying', 'himself', 'a', "ladies'", 'man', '(', 'and',
'why', 'not', ',', 'after', 'seven', 'marriages', '?', '?'], ...]
nltk:Natural Language Tool Kit
• Use your own text
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed",
len(nltk.word_tokenize(text1)), "words.")
>>Enter some text: Natural Languge Processing
['Natural', 'Languge', 'Processing']
3
You typed 3 words.
How many words?
• My father , walking along a river looking at
sky said these words.
• Type: an element of the vocabulary.
• Token: recognized word.
• How many?
– No of Tokens =14
– No of types =?
Words Normalization
• Lemmatization
 Represent all words as their shared root
 The goal is to remove inflections and map a word to its
root form.
am, are, is  be
car, cars, car's, cars'  car

Lemmatization is the process of grouping together the different

inflected forms of a word so they can be analyzed as a single item
Lemmatization is done by
Morphological Parsing
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Parts that adhere to stems, often with
grammatical functions
• Morphological Parsers:
– Parse cats into two morphemes cat and
s
– Parse connected into connect and ed
Words Normalization
Stemming: chop off the words to get root
 Stemming uses a crude heuristic process that chops off the
ends of words such as ThisThi, AccurateAccur
Lemmatization and stemming
import nltk
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed", len(nltk.word_tokenize(text1)),
"words.")
lemma = nltk.wordnet.WordNetLemmatizer()
print ("Lemmatized: ",lemma.lemmatize('article'))
print ("Lemmatized: ",lemma.lemmatize('leaves'))
sno = nltk.stem.SnowballStemmer('english')
print("Stemmed: ",sno.stem('article'))
print("Stemmed: ",sno.stem('leaves'))
output
• “article” Lemmatized: article
• “leaves” Lemmatized: leaf
• “article” Stemmed: articl
• “leaves” Stemmed: leav
Sentence Segmentation
• !, ? are relatively unambiguous but “.” is quite
ambiguous
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Common Algorithm: decide whether a (.) is
part of the word or is a sentence-boundary
marker.
– An abbreviation dictionary can help
• Sentence segmentation can then often be
done by rules based on this tokenization.
How similar are two strings?

• Spell correction • Computational Biology

• Align two sequences of nucleotides
– The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“Karachi” TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi

• Also used for Machine Translation, Information Extraction, Speech

Recognition
Edit Distance
• The minimum edit distance between two
strings
• Is the minimum number of editing
operations
– Insertion
– Deletion
– Substitution
• Needed to transform one into the other
Computing Levenshtein
distance - 1
D(i,j) = score of best alignment from s1..si to s2…sj

D(i-1,j-1), if s1[i]=s2[j] //copy

D(i-1,j-1)+1, if s1[i]!=s2[j] //substitute
= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
Minimum Edit Distance
• Two strings and their alignment:
Basic processing
• Basic Text Processing includes
 Conversion to same (lower case)

 Lemmatization

 Stemming

 Spelling correction

 Sentence segmentation
Summary

• Language Models = “word sequence prediction” as a

probabilistic model
• The minimum edit distance between two strings can be
used to correct the spelling
• Can be used for information Extraction
• Can be used for Words annotation

56
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu

Reader's Digest India - July 2025
No ratings yet
Reader's Digest India - July 2025
118 pages
Probabilistic Reasoning in Artificial Intelligence: Register Now
No ratings yet
Probabilistic Reasoning in Artificial Intelligence: Register Now
8 pages
Backoff and Interpolation
No ratings yet
Backoff and Interpolation
3 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
Test 25
No ratings yet
Test 25
4 pages
DL Notes ALL
No ratings yet
DL Notes ALL
63 pages
Designing A Learning System: DR - Chandrika.J Professor CSE Course Faculty
No ratings yet
Designing A Learning System: DR - Chandrika.J Professor CSE Course Faculty
22 pages
r22 1 9 ML Lab Manual r22 Regulations
No ratings yet
r22 1 9 ML Lab Manual r22 Regulations
24 pages
Atc Module-5 - TM
100% (1)
Atc Module-5 - TM
29 pages
5.hyperparameters and Validation Sets (C)
No ratings yet
5.hyperparameters and Validation Sets (C)
3 pages
Artificial Intelligence Aakash
No ratings yet
Artificial Intelligence Aakash
129 pages
Flat-Unit-2 Notes
No ratings yet
Flat-Unit-2 Notes
23 pages
Ai Unit 1,2 Notes
No ratings yet
Ai Unit 1,2 Notes
45 pages
AI Unit4 LogicAgents
No ratings yet
AI Unit4 LogicAgents
17 pages
Klingon and Its Users - A Sociolinguistic Profile
No ratings yet
Klingon and Its Users - A Sociolinguistic Profile
58 pages
Chapter 5 - Uncertain Knowledge and Reasoning
No ratings yet
Chapter 5 - Uncertain Knowledge and Reasoning
29 pages
PPS Course Material
100% (1)
PPS Course Material
177 pages
Ai-Unit-Iii Notes
No ratings yet
Ai-Unit-Iii Notes
46 pages
ML Unit-5
No ratings yet
ML Unit-5
83 pages
Esiot Lab
No ratings yet
Esiot Lab
29 pages
Unit 5 RNN
No ratings yet
Unit 5 RNN
14 pages
CH 9: Connectionist Models
No ratings yet
CH 9: Connectionist Models
35 pages
Ielts PDF
0% (3)
Ielts PDF
3 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
ML - CSA 301 - ML Perspective and Issues
No ratings yet
ML - CSA 301 - ML Perspective and Issues
34 pages
Name: Gapkwi S. Reuel REG NO: U21DLCS10193 Course: Cosc 408: A. What Is Analytic Grammar?
No ratings yet
Name: Gapkwi S. Reuel REG NO: U21DLCS10193 Course: Cosc 408: A. What Is Analytic Grammar?
8 pages
Topic For The Class:: Knowledge and Reasoning
No ratings yet
Topic For The Class:: Knowledge and Reasoning
41 pages
AIML Unit 2 Notes
No ratings yet
AIML Unit 2 Notes
49 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Deep Learning r18 Jntuh Lab Manual
No ratings yet
Deep Learning r18 Jntuh Lab Manual
20 pages
Artificial Intelligence - AL3391 - Important Questions With Answer - Unit 2 - Problem Solving
No ratings yet
Artificial Intelligence - AL3391 - Important Questions With Answer - Unit 2 - Problem Solving
9 pages
C Programming and Data Structures
No ratings yet
C Programming and Data Structures
5 pages
Unit 5
No ratings yet
Unit 5
61 pages
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
No ratings yet
R22B Tech CSE (AIML) IandIIYearSyllabus PDF
65 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
Manual Sssro Oruro
No ratings yet
Manual Sssro Oruro
28 pages
Conceptual Dependency and Natural Language Processing
No ratings yet
Conceptual Dependency and Natural Language Processing
59 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
G1 Sign Language Identifier PPT
No ratings yet
G1 Sign Language Identifier PPT
18 pages
Neural Network Unit - 4 - 221210 - 134739
No ratings yet
Neural Network Unit - 4 - 221210 - 134739
15 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
Unit 5 1
No ratings yet
Unit 5 1
18 pages
Module-02 AIML NOTES
No ratings yet
Module-02 AIML NOTES
29 pages
1º Eso Repaso Inglés
No ratings yet
1º Eso Repaso Inglés
5 pages
Colloquial Azerbaijani
100% (4)
Colloquial Azerbaijani
30 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
Passive Impersonal
No ratings yet
Passive Impersonal
4 pages
Artificial Intelligence Module 5
No ratings yet
Artificial Intelligence Module 5
23 pages
Communication Operations
No ratings yet
Communication Operations
70 pages
Robotics and Machine Vision Internal 3 Important Questions
No ratings yet
Robotics and Machine Vision Internal 3 Important Questions
1 page
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
1000 English Verbs Forms
75% (4)
1000 English Verbs Forms
27 pages
Prolog Notes-Complete
No ratings yet
Prolog Notes-Complete
31 pages
ML Unit 1
No ratings yet
ML Unit 1
44 pages
19cs413 Artificial Intelligence
No ratings yet
19cs413 Artificial Intelligence
3 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Production Systems
No ratings yet
Production Systems
27 pages
LP I ML Viva Questions
100% (1)
LP I ML Viva Questions
9 pages
Chapter 1 - Data Representation 1.1 - Data Types
No ratings yet
Chapter 1 - Data Representation 1.1 - Data Types
12 pages
Android Tips
No ratings yet
Android Tips
126 pages
Closure Properties of Context Free Languages (Proof)
No ratings yet
Closure Properties of Context Free Languages (Proof)
2 pages
Communication Gap
No ratings yet
Communication Gap
1 page
Thuật toán NLP
No ratings yet
Thuật toán NLP
57 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
46 pages
God's Truths Rediscovered
No ratings yet
God's Truths Rediscovered
286 pages
Parsing
No ratings yet
Parsing
33 pages
The Outsiders Unit Plan
No ratings yet
The Outsiders Unit Plan
22 pages
Unit 3 AI Srs 13-14
No ratings yet
Unit 3 AI Srs 13-14
45 pages
Iannix Commands
100% (1)
Iannix Commands
12 pages
Thyagaraja and Dikshitar
No ratings yet
Thyagaraja and Dikshitar
4 pages
Lesson 5
No ratings yet
Lesson 5
6 pages
Bartolini, E - HUMAN AS ζοον λόγον ἔχον AND IN-DER-WELT-SEIN.
No ratings yet
Bartolini, E - HUMAN AS ζοον λόγον ἔχον AND IN-DER-WELT-SEIN.
7 pages
Journals Style Guide - Version 2
No ratings yet
Journals Style Guide - Version 2
47 pages
Template
No ratings yet
Template
2 pages
Paper 7 (Historical Linguistics) Typological Classification of Language
No ratings yet
Paper 7 (Historical Linguistics) Typological Classification of Language
4 pages
L1 Handout Creole Interpreting
No ratings yet
L1 Handout Creole Interpreting
2 pages
Ielts Speaking Mock Test (June 2024)
No ratings yet
Ielts Speaking Mock Test (June 2024)
6 pages
Tugas Pengganti Meeting 1 (Femy Chandra Winata)
No ratings yet
Tugas Pengganti Meeting 1 (Femy Chandra Winata)
4 pages
Focus3 2E Unit Test Writing Unit8 ANSWERS
No ratings yet
Focus3 2E Unit Test Writing Unit8 ANSWERS
1 page
TC Eng Online Module
No ratings yet
TC Eng Online Module
14 pages
Extended Irregular Verbs List
No ratings yet
Extended Irregular Verbs List
25 pages
Using English-Translated Indonesian Kid-Songs in English Class To Teach English Vocabulary and Noun Phrases, and To Encourage Indonesian Young Learners' Concern About Nature
No ratings yet
Using English-Translated Indonesian Kid-Songs in English Class To Teach English Vocabulary and Noun Phrases, and To Encourage Indonesian Young Learners' Concern About Nature
10 pages
Unified International English Olympiad: Syllabus
No ratings yet
Unified International English Olympiad: Syllabus
4 pages
Final SOP
No ratings yet
Final SOP
2 pages
Vocabulary: Describing Products
No ratings yet
Vocabulary: Describing Products
3 pages

Lecture 3: Text Processing & Minimum Edit Distance Algorithm

Uploaded by

Lecture 3: Text Processing & Minimum Edit Distance Algorithm

Uploaded by

Lecture 3: Text Processing &

Minimum edit Distance

•Student will be able to understand NLP tasks

• Every NLP task requires text normalization:

Each Regular Expression (RE) represents a set of strings having certain

A good example to understand Sequencing ,patterns and efficient

• Bottom up – explores options that won’t lead to a full parse

• Dynamic programming – caches of intermediate results

This parser is known as an LR Parser because

ate Det N with Det N

the rice a spoon

the ricer with Det N

Any CFG can be transformed into an equivalent CNF:

These can always be eliminated without changing the

We will assume that our grammars don’t have ε-productions

CS447 Natural Language Processing 24

 Input: string of n words

 Basic idea: fill in width-1 cells, then width-2, …

... ... ... ...

w ... ... ... w w ... ... ... w w ... ... ... w

... ... ...

CS447 Natural Language Processing 28

chart[2][6]: chart[2][6]: chart[2][6]: chart[2][6]:

w ... ... wi ... w w ... w ... w ...

CS447 Natural Language Processing 29

For toy examples (e.g. on slides), it’s typically the words

With POS-tagged input, we may either treat the POS tags as

CS447: Natural Language Processing (J. Hockenmaier) 32

– P(its, water, is, so, transparent, that)

• Intuition: let’s rely on the Chain Rule of

P(w1w2  wn )   P(wi | w1w2  wi 1 )

Lemmatization is the process of grouping together the different

• Spell correction • Computational Biology

• Also used for Machine Translation, Information Extraction, Speech

D(i-1,j-1), if s1[i]=s2[j] //copy

• Language Models = “word sequence prediction” as a

You might also like