0% found this document useful (0 votes)

44 views42 pages

2-Regular Expressions, Text Normalization, Edit Distance

The document states that the training data is current only up to October 2023. It implies that any information or events occurring after this date are not included in the training. This limitation should be considered when interpreting the data.

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views42 pages

2-Regular Expressions, Text Normalization, Edit Distance

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

CS463 – Natural Language Processing

Basic Text Processing:

 Regular Expressions
 Text Normalization
 Word Tokenization
 Lemmatization and Stemming
 Sentence Segmentation and Decision Trees
 Minimum Edit Distance
Regular Expressions
• A formal language for specifying text strings.
• Formally, a regular expression is an algebraic notation for
characterizing a set of strings.
• A regular expression search function will search through a
corpus, returning all texts that match a pattern.
– The simplest kind of regular expression is a sequence of simple
characters.
– For example:

2
Regular Expressions
• Regular expressions are case sensitive. This means that the
pattern /woodchucks/ will not match the string “Woodchucks”.
– We can solve this by using square braces []
– The string of characters inside the braces [] specifies a disjunction of
characters to match.

3
Regular Expressions: Disjunctions

• Use dash - inside brackets to specify any one character in a

range.

Pattern Matches Example Patterns Matched

[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case letter my beans were impatient

[0-9] A single digit Chapter 1: Down the Rabbit

Hole

4
Regular Expressions: Negation in Disjunction
• Negations can be applied using the caret ^ symbol
– Caret means negation only when first in []
Pattern Matches Example Patterns
Matched
[^A-Z] Not an upper case letter Oyfn pripetchik

[^Ss] Neither ‘S’ nor ‘s’ I have no

exquisite
reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a caret b Look up a^b now
5
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!
• The pipe | symbol for disjunction
Pattern Matches
groundhog|woodchuck groundhog
woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck groundhog
woodchuck
Groundhog
Woodchuck 6
Regular Expressions: ? * + .
• The question mark (?) Symbol means zero or one instance of the preceding character.
• The Kleene asterisk (*) symbol means zero or more occurrences of the preceding character.
• The Kleene (+) symbol means one or more occurrences of the preceding character.
• The period (.) symbol is a wildcard expression that matches any single character it
represents within the pattern (except a carriage return).

Pattern Matches
colou?r Optional previous char Color
Colour
oo*h! 0 or more of previous char oh! ooh! oooh! ooooh!

o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

baa+ 1 or more of previous char baa baaa baaaa baaaaa

beg.n Only 1 character begin begun begun beg3n

7
Regular Expressions: Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret (^) matches the start of a line.
– The pattern /^The/ matches the word “The” only at the start of a line.

• The dollar sign $ matches the end of a line.

– /^The dog\.$/ matches a line that contains only the phrase “The dog”.

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
8
Regular Expressions: Boundary Anchors \b \B
• There are also two other anchors: \b matches a word boundary, and \B
matches a non-boundary.
• For the purposes of a regular expression, a “word” is defined as any
sequence of digits, underscores, or letters.
• Examples:
– /\bthe\b/ matches the word “the” but not the word “other”.
– /\b99\b/ will match the string 99 in “There are 99 bottles of juice on the wall”
(because 99 follows a space and precedes a space) but not 99 in “There are
299 bottles of juice on the wall” (since 99 follows a number). But it will match
99 in “$99” (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).

• What will be the results of using the other anchor: \B in the previous
examples knowing that it matches a non-word boundary?
9
Example:
• Suppose we wanted to write a RE to find cases of the
English article “the”. A simple (but incorrect) pattern might
be:
/the/
• One problem is that this pattern will miss the word when it
begins a sentence and hence is capitalized (i.e., The). This
might lead us to the following pattern:
/[tT]he/
• But we will still incorrectly return texts with the embedded
in other words (e.g., other or theology).
• So we need to specify that we want instances with a word
boundary on both sides:
/\b[tT]he\b/ 10
Errors
• The process we just went through was based on
fixing two kinds of errors
– Matching strings that we should not have matched (there,
then, other)
• False positives (Type I)
– Not matching things that we should have matched (The)
• False negatives (Type II)

11
Errors cont.

• In NLP we are always dealing with these kinds of

errors.
• Reducing the error rate for an application often
involves two antagonistic efforts:
– Increasing accuracy or precision (minimizing false positives)
– Increasing coverage or recall (minimizing false negatives).

12
Summary

• Regular expressions play a surprisingly large role

– Sophisticated sequences of regular expressions are often the first
model for any text processing
• For many hard tasks, we use machine learning classifiers
– But regular expressions are used as features in the classifiers
– Can be very useful in capturing generalizations

13
Basic Text Processing

Text normalization
Text normalization
• Normalizing text means converting it to a more convenient, standard
form.
1. Tokenization - Splitting a phrase, sentence, paragraph, or an entire
text document into smaller units, such as individual words or terms.
2. Lemmatization - The task of determining that two words have the
same root, despite their surface differences.
– The words “sang”, “sung”, and “sings” are forms of the verb “sing”. The
word sing is the common lemma of these words, and a lemmatizer maps
from all of these to “sing”.
3. Stemming - We mainly just strip suffixes from the end of the word.
– The words “caring”, “careful” are stemmed to “car”, and the words
“history” and “historical” are stemmed to “histori”
4. Sentence Segmentation - We break up a text into individual
sentences, using cues like periods or exclamation points.
15
Normalization
• Need to “normalize” terms
– Information Retrieval: indexed text to query terms must
have same form.
• We want to match U.S.A. and USA
• We implicitly define equivalence classes of terms
– e.g., deleting periods in a term
• Alternative: asymmetric expansion:
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

16
Case folding

• Applications like IR (Information Retrieval):

reduce all letters to lower case
– Since users tend to use lower case
– Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail

• For sentiment analysis, MT (Machine

Translate), Information extraction
– Case is helpful (US versus us is important)
17
Basic Text Processing

Word tokenization
Text Normalization
• Every NLP task needs to do text normalization:
1. Segmenting/tokenizing words in running text

2. Normalizing word formats

3. Segmenting sentences in running text

19
How many words?
• A lemma is a set of lexical forms having
• cat and cats = same lemma

– The wordform is the full inflected or derived form

of the word.
• cat and cats = different wordforms

20
How many words?

They lay back on the San Francisco grass and looked at the stars and their

• Type: an element of the vocabulary.

• Token: an instance of that type in running text.

• How many?
• 15 tokens (or 14)

• 13 types (or 12)

21
How many words?

N = number of tokens
V = vocabulary = set of types
|V| is the size of the vocabulary
Church and Gale (1990): |V| > O(N½)

22
Simple Tokenization in UNIX
• We can use command tr to tokenize the words by changing every sequence of
non alphabetic characters to a newline (’A-Za-z’ means alphabetic, the -c
option complements to non-alphabet, and the -s option squeezes all sequences
into a Single character):
tr -sc 'A-Za-z’ ‘/n' < shakes.txt
The output of this command will be:
THE shakes.txt
SONNETS
by
William
Shakespeare THE SONNETS by William
From Shakespeare From fairest creatures
fairest We ….
creatures
We
... 23
Simple Tokenization in UNIX
• Now that there is one word per line, we can sort the lines, and
pass them to unique -c which will collapse and count them:

tr -sc 'A-Za-z' ‘/n' < shakes2.txt | sort | uniq -c

with the following output:
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
24

...
Issues in Tokenization
• Finland’s capital  Finland Finlands Finland’s ?
• what’re, I’m, isn’t  What are, I am, is not
• Hewlett-Packard  Hewlett Packard ?
• state-of-the-art  state of the art ?
• Lowercase  lower-case lowercase lower case ?
• San Francisco  one token or two?
• m.p.h., PhD.  ??

25
Basic Text Processing

Lemmatization and Stemming

Lemmatization
• Reduce inflections or variant forms to base form
– am, are, is  be
– car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be different
color
• Lemmatization: have to find correct dictionary headword form
• Machine translation
– In Spanish: quiero (‘I want’), quieres (‘you want’) same lemma as
querer ‘want’

27
Morphology
• It is the study of the internal structure of words.
• Morphology focuses on how the components within a word (stems, root
words, prefixes, suffixes, etc.) are arranged or modified to create different
meanings.
• Example: happy; un-happy; happy-ness; un-happy-ness
• Morphemes:

– The small meaningful units that make up words

– Stems: The core meaning-bearing units

– Affixes: Bits and pieces that adhere to stems

• Often with grammatical functions

28
Stemming
• Reduce terms to their stems in information retrieval.
• Stemming is crude chopping of affixes
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.

29
Basic Text Processing

Sentence Segmentation and Decision Trees

Sentence Segmentation
• Sentence segmentation is another important step in text processing. The
most useful cues for segmenting a text into sentences are punctuation, like
periods (.), question marks (?), exclamation points (!).
• (?) and (!) are relatively unambiguous markers of sentence boundaries.
• (.) on the other hand, are more ambiguous.
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Sentence tokenization methods work by building a binary classifier.
– Look at a period “.”
– Decide EndOfSentence/NotEndOfSentence
– Classifiers: hand-written rules, regular expressions, or machine-learning

31
Determining if a word is End-of-Sentence: Decision Tree

32
More sophisticated decision tree features

• Case of word with “.”: Upper, Lower, Cap, Number

• Case of word after “.”: Upper, Lower, Cap, Number

• Numeric features
– Length of word with “.”
– Probability(word with “.” occurs at end-of-s)
– Probability(word after “.” occurs at beginning-of-s)

33
Implementing Decision Trees
• A decision tree is just an if-then-else statement.
• The interesting research is choosing the features.
• Setting up the structure is often too hard to do by hand.
– Hand-building only possible for very simple features,
domains
• For numeric features, it’s too hard to pick each
threshold
• Instead, structure usually learned by machine learning from
a training corpus

34
Basic Text Processing

Minimum Edit Distance

How similar are two strings?

• Spell correction • Computational Biology

• Align two sequences of nucleotides
– The user typed “graffe”
AGGCTATCACCTGACCTCCAGGCCGATGCCC
– Which is closest? TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• graf • Resulting alignment:
• graft
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
• grail TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• giraffe

• Also for Machine Translation, Information Extraction, Speech

Recognition
36
Minimum Edit Distance
• The minimum edit distance between two strings.

• It is the minimum number of editing operations.

– Insertion
– Deletion
– Substitution

• Needed to transform one into the other.

37
Minimum Edit Distance

d-> delete
s-> substitution
i-> insert

• If each operation has cost of 1, then Distance between these is 5

• If substitution operation cost 2, then Distance between them is 8
– The gap between intention and execution, for example, is 5 (delete
an i, substitute e for n, substitute x for t, insert c, substitute u for n).
3 substitution + 1 insert + 1 delete =5

38
How to find the Min Edit Distance?
• Searching for a path (sequence of edits) from the start string
to the final string:
– Initial state: the word we’re transforming
– Operators: insert, delete, substitute
– Goal state: the word we’re trying to get to
– Path cost: what we want to minimize: the number of edits

39
Defining Min Edit Distance
• For two strings
– X of length n
– Y of length m
• We define D(i,j)
– the edit distance between X[1..i] and Y[1..j]
• i.e., the first i characters of X and the first j characters of Y
– The edit distance between X and Y is thus D(n,m)

40
Minimum Edit Distance - Example

41
Minimum Edit Distance - Example

NLP UNIT 5 Part B
100% (2)
NLP UNIT 5 Part B
31 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP Course File Notes
No ratings yet
NLP Course File Notes
71 pages
Chapter02.ppt 1
No ratings yet
Chapter02.ppt 1
33 pages
System Paradigms in NLP
No ratings yet
System Paradigms in NLP
8 pages
DBMS Unit 1 Notes
100% (1)
DBMS Unit 1 Notes
22 pages
Factors and Tables
No ratings yet
Factors and Tables
10 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
Flat Unit 5 Notes
No ratings yet
Flat Unit 5 Notes
10 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
PST Material Unit-I
No ratings yet
PST Material Unit-I
32 pages
r22 1 9 ML Lab Manual r22 Regulations
No ratings yet
r22 1 9 ML Lab Manual r22 Regulations
24 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Aecs Lab Manual Final - 2019-20
No ratings yet
Aecs Lab Manual Final - 2019-20
101 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
No ratings yet
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
46 pages
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
0% (1)
Aim: Write A Program To Parse XML Text, Generate Web Graph and Compute Topic Specific Page Rank. Source Code
5 pages
Case Study of Lexical Analyzer PDF
100% (1)
Case Study of Lexical Analyzer PDF
3 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
59 pages
Chapter 6
100% (1)
Chapter 6
28 pages
R22 SkillDevelopmentCourse
No ratings yet
R22 SkillDevelopmentCourse
21 pages
Python Project Result Management System
No ratings yet
Python Project Result Management System
21 pages
Lab Program
100% (1)
Lab Program
15 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
CS 3 - Problem Solving Agent
No ratings yet
CS 3 - Problem Solving Agent
80 pages
1.1 Introduction To Service Management
No ratings yet
1.1 Introduction To Service Management
22 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
ML Unit-1
No ratings yet
ML Unit-1
26 pages
Recursively Enumerable Languages
No ratings yet
Recursively Enumerable Languages
8 pages
建筑师求职信
100% (1)
建筑师求职信
7 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
ARTIFICIAl iNTELLIGENCE Unit III &iv
No ratings yet
ARTIFICIAl iNTELLIGENCE Unit III &iv
39 pages
R Language
No ratings yet
R Language
59 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
CST395 Neural Networks and Deep Learning, December 2021
No ratings yet
CST395 Neural Networks and Deep Learning, December 2021
3 pages
CS1 Formula Sheet
No ratings yet
CS1 Formula Sheet
15 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
16th January 2018 Part 1 Standardised Competence-Oriented Written School-Leaving Examination
No ratings yet
16th January 2018 Part 1 Standardised Competence-Oriented Written School-Leaving Examination
28 pages
Nobel Prize - Story by Vikas Taya
No ratings yet
Nobel Prize - Story by Vikas Taya
1 page
Graven and Venkat
No ratings yet
Graven and Venkat
21 pages
Chalmers, Constructing The World
0% (1)
Chalmers, Constructing The World
11 pages
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
NLP Lab Expdoc New
No ratings yet
NLP Lab Expdoc New
103 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
Practice Problems: Paul Dawkins
No ratings yet
Practice Problems: Paul Dawkins
75 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
11 pages
Artificial Intelligence Lab Manual: Python
No ratings yet
Artificial Intelligence Lab Manual: Python
15 pages
Prolog Notes-Complete
No ratings yet
Prolog Notes-Complete
31 pages
Myanmar Cyclone Shelter Assessment
No ratings yet
Myanmar Cyclone Shelter Assessment
116 pages
NLP: Background and Overview: Introduction To Natural Language Processing (CSE5321)
No ratings yet
NLP: Background and Overview: Introduction To Natural Language Processing (CSE5321)
30 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
36 pages
Ch05 Student (Prob. Tuts)
No ratings yet
Ch05 Student (Prob. Tuts)
154 pages
IRS Unit-3
No ratings yet
IRS Unit-3
30 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
NASA Rocketry Basics
No ratings yet
NASA Rocketry Basics
38 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Year 11 Algebra HSCs 2022 To 2005
No ratings yet
Year 11 Algebra HSCs 2022 To 2005
17 pages
Syntactic and Dependency Parsing
No ratings yet
Syntactic and Dependency Parsing
159 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
No ratings yet
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
126 pages
View PDF
No ratings yet
View PDF
16 pages
10 Estimators Pre Lecture
No ratings yet
10 Estimators Pre Lecture
109 pages
Theory QBank
No ratings yet
Theory QBank
7 pages
3 - Slides Corpus3
No ratings yet
3 - Slides Corpus3
88 pages
2DI90 ch9
No ratings yet
2DI90 ch9
83 pages
04-Textcat Text Class
No ratings yet
04-Textcat Text Class
77 pages
E-Rickshaws + E-Carts
No ratings yet
E-Rickshaws + E-Carts
12 pages
4 - Slides Regualer Expression
No ratings yet
4 - Slides Regualer Expression
75 pages
ch07 Consistency Replication
No ratings yet
ch07 Consistency Replication
30 pages
Tut4 - WordEmb NLP
No ratings yet
Tut4 - WordEmb NLP
30 pages
Flat Online Bits (Mid-I)
No ratings yet
Flat Online Bits (Mid-I)
6 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Lect33 Textcat
No ratings yet
Lect33 Textcat
70 pages
POS Tagging
No ratings yet
POS Tagging
63 pages
07 Covariance Answers Hidden Lecture
No ratings yet
07 Covariance Answers Hidden Lecture
62 pages
2DI90 chID190-CH5
No ratings yet
2DI90 chID190-CH5
62 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
01-Introduction PLC
No ratings yet
01-Introduction PLC
53 pages
2DI90 ch11
No ratings yet
2DI90 ch11
54 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
13-Neuralcrf Pos Tagging
No ratings yet
13-Neuralcrf Pos Tagging
40 pages
Primes
No ratings yet
Primes
39 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
TechSmart 97, October 2011, The Security Issue
No ratings yet
TechSmart 97, October 2011, The Security Issue
36 pages
Encyclopedia of Giftedness Creativity and Talent 1st Edition Barbara Kerr Download
No ratings yet
Encyclopedia of Giftedness Creativity and Talent 1st Edition Barbara Kerr Download
86 pages
01-Bayes-All-Handout Prob
No ratings yet
01-Bayes-All-Handout Prob
28 pages
Mẫu Câu Writing Task 2 Hay
No ratings yet
Mẫu Câu Writing Task 2 Hay
15 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
Aviation Ni-Cd BMT - Battery Maintenance Training
No ratings yet
Aviation Ni-Cd BMT - Battery Maintenance Training
2 pages
Slides08 LR Parsing
No ratings yet
Slides08 LR Parsing
25 pages
0510 s16 Ms 23 PDF
No ratings yet
0510 s16 Ms 23 PDF
11 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
Imc Shift-Cipher
No ratings yet
Imc Shift-Cipher
17 pages
Jarrar LectureNotes Ch1 Introduction
No ratings yet
Jarrar LectureNotes Ch1 Introduction
18 pages
Early Detection of Lung Cancer Using AI and ML
No ratings yet
Early Detection of Lung Cancer Using AI and ML
6 pages
13-Oo-Opolymorphism PLC
No ratings yet
13-Oo-Opolymorphism PLC
15 pages
Balloon Manual
No ratings yet
Balloon Manual
7 pages
Final Trial Exam - 2021: Text One
No ratings yet
Final Trial Exam - 2021: Text One
7 pages
Top-Down and Bottom-Up Parsing
No ratings yet
Top-Down and Bottom-Up Parsing
23 pages
Reduction Proofs
No ratings yet
Reduction Proofs
9 pages
Emulgel Preparation
No ratings yet
Emulgel Preparation
6 pages
Data Structure - AVL Tree
No ratings yet
Data Structure - AVL Tree
6 pages
JNTUGV B.tech R23 Course Structure
No ratings yet
JNTUGV B.tech R23 Course Structure
6 pages
Lesson Plan in English VI - Compund Sentence
No ratings yet
Lesson Plan in English VI - Compund Sentence
5 pages
Comply Efficiently With Electronic Documents and Statutory Reporting Worldwide
No ratings yet
Comply Efficiently With Electronic Documents and Statutory Reporting Worldwide
4 pages
New Trends For Authentication
No ratings yet
New Trends For Authentication
5 pages
McIntyre - Quantum Mechanics - 83
No ratings yet
McIntyre - Quantum Mechanics - 83
3 pages
Webview
No ratings yet
Webview
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Schrack RE030024
No ratings yet
Schrack RE030024
2 pages
1 Essay3
No ratings yet
1 Essay3
2 pages
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet

2-Regular Expressions, Text Normalization, Edit Distance

Uploaded by

2-Regular Expressions, Text Normalization, Edit Distance

Uploaded by

CS463 – Natural Language Processing

Basic Text Processing:

• Use dash - inside brackets to specify any one character in a

Pattern Matches Example Patterns Matched

[0-9] A single digit Chapter 1: Down the Rabbit

[^Ss] Neither ‘S’ nor ‘s’ I have no

o+h! 1 or more of previous char oh! ooh! oooh! ooooh!

baa+ 1 or more of previous char baa baaa baaaa baaaaa

beg.n Only 1 character begin begun begun beg3n

• The dollar sign $ matches the end of a line.

• In NLP we are always dealing with these kinds of

• Regular expressions play a surprisingly large role

• Applications like IR (Information Retrieval):

• For sentiment analysis, MT (Machine

2. Normalizing word formats

3. Segmenting sentences in running text

– The wordform is the full inflected or derived form

• Type: an element of the vocabulary.

• Token: an instance of that type in running text.

• 13 types (or 12)

tr -sc 'A-Za-z' ‘/n' < shakes2.txt | sort | uniq -c

Lemmatization and Stemming

– The small meaningful units that make up words

– Stems: The core meaning-bearing units

– Affixes: Bits and pieces that adhere to stems

• Often with grammatical functions

for example compressed for exampl compress and

Sentence Segmentation and Decision Trees

• Case of word with “.”: Upper, Lower, Cap, Number

Minimum Edit Distance

• Spell correction • Computational Biology

• Also for Machine Translation, Information Extraction, Speech

• It is the minimum number of editing operations.

• Needed to transform one into the other.

• If each operation has cost of 1, then Distance between these is 5

You might also like