Basic Text Processing:
Regular Expressions and Text
Normalization
1
DR. K.M. AZHARUL HASAN
Dr. Azhar, KUET 11/16/2024
ELIZA: A natural language processing system
2
User: You are like my father in some ways.
ELIZA: WHAT RESEMBLANCE DO YOU SEE
User: You are not very aggressive but I think you don't want me to
notice that.
ELIZA: WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
User: You don't argue with me.
ELIZA: WHY DO YOU THINK I DON'T ARGUE WITH YOU
User: You are afraid of me.
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
Weizenbaum (1966)
Dr. Azhar, KUET 11/16/2024
An Intelligent Bangla Conversational Agent: TUNI
Rahman Joy, M.T., Shahriar Akash, M.N., Hasan, K.M.A. (2021). An Intelligent Bangla
Conversational Agent: TUNI. In: Uddin, M.S., Bansal, J.C. (eds) Proceedings of International
Joint Conference on Advances in Computational Intelligence. Algorithms for Intelligent
Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-0586-4_34
Dr. Azhar, KUET 11/16/2024
Corpus
4
Corpus
linguistics enclose the compilation and analysis of collections of
spoken and written texts
as the source of evidence for describing the nature, structure,
and use of languages.
a collection of written texts, especially the entire works of a
particular author or a body of writing on a particular subject
Corpora (plural of corpus) vary large in size and
design
most are nowadays in electronic form with purpose to build
computer software to support analysis.
Corpora are often annotated to show grammatical classes,
structures and functions.
Software to analyse grammatical structures or to identify.
Dr. Azhar, KUET 11/16/2024
Regular expressions
5
A formal language for specifying text strings
They are particularly useful for searching in texts,
when we have a pattern to search.
A regular expression search function will search
through the corpus, returning all texts that match
the pattern.
The corpus can be a single document or a collection.
How can we search for any of these?
woodchuck
woodchucks
Woodchuck
Woodchucks
Dr. Azhar, KUET 11/16/2024
Regular Expressions: Disjunctions
6
The string of characters inside the braces
specifies a disjunction of characters to match.
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case my beans were impatient
letter
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Dr. Azhar, KUET 11/16/2024
Regular Expressions: Negation in Disjunction
7
The pattern /[2-5]/ specifies any one of the characters 2, 3, 4,
or 5. The pattern /[b-g]/ specifies one of the characters b, c, d,
e, f, or g.
Negations [^Ss]
Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[e^] either e or ^ Look here
a^b The pattern a carat Look up a^b now
b
Dr. Azhar, KUET 11/16/2024
Regular Expressions: ? * + .
8
Pattern Matches
colou?r Optional previous char color colour
oo*h! 0 or more of previous oh! ooh! oooh!
char ooooh!
o+h! 1 or more of previous oh! ooh! oooh!
char ooooh!
baa+ baa baaa baaaa Stephen C Kleene
baaaaa
Kleene *, Kleene
beg.n begin begun begun +
beg3n
Dr. Azhar, KUET 11/16/2024
Regular Expressions: Anchors ^ $
9
Anchors are special characters that anchor
regular expressions to particular places in a
string. The most common anchors are the
caret ^ and the dollar sign $.
The caret matches the start of a line. The pattern
/^The/ matches the word “The” only at the start of a
line
The caret ^ has three uses:
to match the start of a line,
to indicate a negation inside of square brackets,
and just to mean a caret.
Dr. Azhar, KUET 11/16/2024
Regular Expressions: Anchors ^ $
10
The dollar sign $ matches the end of a line. So the
pattern $ is a useful pattern for matching a space
at the end of a line
/^The dog\.$/ matches a line that contains only the phrase
“The dog”.
There are also two other anchors: \b matches a
word boundary, and \B matches a non-boundary.
\bthe\b/ matches the word the but not the word other.
Dr. Azhar, KUET 11/16/2024
Disjunction, Grouping, and Precedence
11
The disjunction operator, also called the pipe
symbol |. The pattern /cat|dog/ matches either the
string cat or the string dog.
To apply disjunction operator only to a specific
pattern, use the parenthesis operators ( and ).
the pattern /gupp(y|ies)/ would specify that we meant the
disjunction only to apply to the suffixes y and ies.
we could write the expression /(Column [0-9]+ *)*/ to
match the word Column, followed by a number and
optional spaces, the whole pattern repeated any number of
times
Dr. Azhar, KUET 11/16/2024
operator precedence hierarchy
12
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ^my end$
Disjunction |
Counters have a higher precedence than
sequences,/the*/ matches theeeee but not thethe.
Sequences have a higher precedence than disjunction,
/the|any/ matches the or any but not theny.
Dr. Azhar, KUET 11/16/2024
Greedy pattern
13
Patterns can be ambiguous in another way.
The expression /[a-z]*/ when matching against the text once
upon a time.
Since /[a-z]*/ matches zero or more letters, this expression
could match nothing, or just the first letter o, on, onc, or
once.
In this case RE match longest sequence
RE always match the largest string they can
Therefore, we say that “patterns are greedy”, expanding to
cover as much of a string as they can.
Ways to make non greedy
*? : Kleenee star ->as little text as possible
+? : Kleenee plus ->as little text as possible
Dr. Azhar, KUET 11/16/2024
Example 1
14
Find all instances of the word “the” in a text.
/the/ Misses capitalized examples
[tT]he Incorrectly returns other or theology
/\b[tT]he\b/ some context where it might also have
underlines or numbers nearby
(the or the25)
/[^a-zA-Z][tT]he[^a-zA- won’t find the word the when it
Z]/ begins a line
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
Dr. Azhar, KUET 11/16/2024
Example 2
15
Find “ any machine with at least 6 GHz and 500 GB of disk space for less than
$1000”
Look like the expression
“6 GHz or 500 GB or mac or $999.99 "
/$[0-9]+/ regular expression for a dollar
sign followed by a string of digits
/$[0-9]+\.[0-9][0-9]/ fractions of dollars; $199.99 but
not $199
/(^|\W)$[0-9]+(\.[0-9][0-9])?\b/ a word boundary
/(^|\W)$[0-9]{0,3}(\.[0-9][0-9])?\ allows prices like $199999.99,
b/ Limit it
/\b[6-9]+ *(GHz|[Gg]igahertz)\b/ specifications for >6GHz
processor speed
/\b[0-9]+(\.[0-9]+)? *(GB| For disk space
[Gg]igabytes?)\b/
Dr. Azhar, KUET 11/16/2024
Errors
16
The process we just went through was
based on fixing two kinds of errors
Matching strings that we should not have
matched (there, then, other)
False positives (Type I)
Not matching things that we should have
matched (The)
False negatives (Type II)
Dr. Azhar, KUET 11/16/2024
Errors cont.
17
In NLP we are always dealing with these
kinds of errors.
Reducing the error rate for an application
often involves two antagonistic efforts:
Increasing accuracy or precision (minimizing
false positives)
Increasing coverage or recall (minimizing false
negatives).
Dr. Azhar, KUET 11/16/2024
Summary
18
Regular expressions play a surprisingly large
role
Sophisticated sequences of regular expressions are
often the first model for any text processing text
For many hard tasks, we use machine
learning classifiers
But regular expressions are used as features in the
classifiers
Can be very useful in capturing generalizations
Dr. Azhar, KUET 11/16/2024
19
Text
Normalization
Dr. Azhar, KUET 11/16/2024
Text Normalization
20
Normalizing text means converting it to a more
convenient, standard form.
Ex. most of what we are going to do with language relies on first
separating out or tokenizing words
Word boundary detection.
English words are often separated from each other by
whitespace, but whitespace is not always sufficient.
New York and rock ’n’ roll are sometimes treated as large words
despite the fact that they contain spaces,
sometimes we’ll need to separate I’m into the two words I and
am.
For processing tweets or texts we need to tokenize emoticons
like :) or hashtags like #nlproc
Dr. Azhar, KUET 11/16/2024
Tokenization
21
Tokenization is the process of breaking down
a stream of text into words, phrases, symbols,
or any other meaningful elements called
tokens.
Ex. “After sleeping for four hours, he decided to sleep
for another four”
{'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided',
'to', 'sleep', 'for', 'another', 'four'}
Dr. Azhar, KUET 11/16/2024
Lemmatization and Stemming
22
A form of text normalization is lemmatization
The task of determining that two words have the same
root, despite their surface differences.
For example, the words sang, sung, and sings are forms
of the verb sing
A lemmatizer maps from all of these to sing
sang, sung, sings->sing
Stemming refers to a simpler version of
lemmatization in which we mainly just strip
suffixes from the end of the word.
sung, sings->sing
Dr. Azhar, KUET 11/16/2024
Lemma
23
A lemma is a set of lexical forms having the
same stem, the same major part-of-speech,
and the same word sense.
Seuss’s cat in the hat is different from other cats!
cat and cats = same lemma
Word form: the full inflected surface form
cat and cats = different word forms
Dr. Azhar, KUET 11/16/2024
Lemmatization and Stemming
24
Lemmatization is the task of determining that two
words have the same root, despite their surface
differences.
The words am, are, and is have the shared lemma be;
The words dinner and dinners both have the lemma dinner.
Representing a word by its lemma is important for
web search. This is especially important in
morphologically complex languages.
Ex. He is reading detective stories would thus be He be
read detective story.
Dr. Azhar, KUET 11/16/2024
How is lemmatization done?
25
Morphology: Study of how words are formed by combining
smaller units of meaning called morphemes.
Morphemes: Smallest meaningful unit in a language. E.g. in
the word "cats", the parts "cat" and "s" are both
morphemes.
Two Types of Morphemes:
1. Stems:
1. The central part of the word that gives the main meaning.
2. E.g., in "running", "run" is the stem, as it provides the core meaning.
2. Affixes:
1. Additional parts of the word that add extra meaning, such as prefixes
or suffixes.
2. E.g., in "running", "ing" is the affix, which tells us it's an action
happening right now.
Dr. Azhar, KUET 11/16/2024
Lemmatization
26
Reduce inflections or variant forms to base
form
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy
car be different color
Lemmatization: have to find correct
dictionary headword form
Dr. Azhar, KUET 11/16/2024
Text Normalization
27
Every NLP task needs to do text
normalization
1. Segmenting/tokenizing words
2. Normalizing word formats
3. Segmenting sentences in running text
Dr. Azhar, KUET 11/16/2024
Words
28
What counts as a word?
Look one particular corpus; collection of text or speech
Brown corpus is a million-word collection of samples
from 500 written English texts from different genres
(newspaper, fiction, non-fiction etc.)
How many words are in the following Brown
sentence?
“He stepped out into the hall, was delighted to
encounter a water brother”.
13 words if we don’t count punctuation marks as
words, 15 if we count punctuation.
Whether we treat period (“.”), comma (“,”), and so on as words depends on
the task. Punctuation is critical for finding boundaries.
Dr. Azhar, KUET 11/16/2024
Stop words
29
Stop words: Words in a stop list (or stoplist or negative
dictionary) which are filtered out (i.e. stopped) before or
after processing of natural language data (text) because
they are insignificant in that context.
There is no single universal list of stop words used by
all NLP tools, nor any agreed upon rules for identifying
stop words, and indeed not all tools even use such a
single list.
Therefore, group of words can vary due to purpose.
Ex. "This is a sample sentence, showing off the stop
words filtration.“
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']
Dr. Azhar, KUET 11/16/2024
How many words?
30
Types are the number of distinct words in a corpus;
Tokens are the total number of running words.
They picnicked by the pool, then lay back on the grass
and looked at the stars.
How many?
16 tokens and 14 types
Dr. Azhar, KUET 11/16/2024
Herdan’s Law/ Heaps’ Law
31
The larger the corpora we look at, the more
types we find
The relationship between the number of
types Vand number of tokens N is called
Herdan’s Law
where k and are positive constants, and 0 <<1.
The value of depends on the corpus size and the
genre, for large ranges from 0.67 to 0.75
V for a text goes up significantly faster than the
square root of its tokens.
Dr. Azhar, KUET 11/16/2024
How many words?
32
N = number of tokens Church and Gale (1990): |V| > O(N½)
V = vocabulary = set of
types
|V| is the size of the vocabulary
Tokens = N Types = |V|
Switchboard phone 2.4 million 20 thousand
conversations
Shakespeare 884,000 31 thousand
Google N-grams 1 trillion 13 million
Dr. Azhar, KUET 11/16/2024
33
Tokenization: Byte Pair Encoding(BPE)
Dr. Azhar, KUET 11/16/2024
Another option for text tokenization
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (because tokens can be parts of words
as well as whole word)
Subword tokenization
Three common algorithms:
Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
Unigram language modeling tokenization (Kudo,
2018)
WordPiece (Schuster and Nakajima, 2012)
All have 2 parts:
A token learner takes a raw training corpus
and induces a vocabulary (a set of tokens).
A token segmenter that takes a raw test
sentence and tokenizes it according to that
vocabulary
Byte Pair Encoding (BPE) token
learner
Let vocabulary be the set of all individual
characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
Add a new merged symbol 'AB' to the vocabulary
Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner
Let a corpus: low low low low low lowest lowest newer newer newer
newer newer newer wider wider wider new new
The base vocabulary:
representation
BPE token learner
Merge e r to er
BPE
Merge er _ to er_
BPE
Merge n e to ne
BPE
The next merges are:
42
Merger : er, er_, ne, new, lo, low, newer_,
low_
On the test data, run each merge learned
from the training data:
Result:
Test set "n e w e r _" would be tokenized as a full word
Test set "l o w e r _" would be two tokens: "low er_"
Dr. Azhar, KUET 11/16/2024
Properties of BPE tokens
Usually include frequent words
And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing
unit of a language
• unlikeliest has 3 morphemes un-, likely, and
-est
BPE token learner algorithm
HT
45
Training corpus
This is the hugging face course. This
chapter is about tokenization. This
section shows several tokenizer
algorithms.
Testing corpus: hugs
Dr. Azhar, KUET 11/16/2024
The Porter Stemmer
46
One of the most widely used stemming
algorithms is the simple and efficient Porter
(1980) algorithm.
ATIONAL -> ATE (e.g., relational->relate)
ING -> if stem contains vowel (e.g., motoring!->motor)
SSES -> SS (e.g., grasses -> grass)
Dr. Azhar, KUET 11/16/2024
Porter’s algorithm
47
Step 1a
sses ss caresses caress
ies i ponies poni
ss ss caress caress
s ø cats cat
Step 1b
Step 2 (for long stems)
(*v*)ing ø walking walk
ational ate relational relate
sing sing
(*v*)ed ø plastered plaster
izer ize digitizer
digitize
…
ator ate operator
operate
…
Step 3 (for longer stems)
al ø revival reviv
able ø adjustable adjust
ate ø activate activ
…
Dr. Azhar, KUET 11/16/2024
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
48
(*v*)ing ø walking walk
sing sing
1312 King 548 being
548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
Dr. Azhar, KUET 11/16/2024
Issues in Tokenization
49
Some exceptions to handle
Finland’s capital Finland Finlands Finland’s ?
what’re, I’m, isn’t What are, I am, is not
Hewlett-Packard Hewlett Packard ?
state-of-the-art state of the art ?
Lowercase lower-case lowercase lower case ?
San Francisco one token or two?
m.p.h., Ph.D. ??
$44.55 prices ?
01/02/06 dates ?
abc@xyz, www.kuet.ac.bd/cse Email and urls ?
Dr. Azhar, KUET 11/16/2024
Case folding
50
Case folding: Another form of normalization
reduce all letters to lower case
Applications like IR from text
Since users tend to use lower case
Possible exception: upper case in mid-sentence?
e.g., General Motors
US vs. us
SAIL vs. sail
For sentiment analysis, MT, Information extraction
Case is helpful
Dr. Azhar, KUET 11/16/2024
Sentence Segmentation
51
!, ? are relatively unambiguous
Period “.” is quite ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
Numbers like .02% or 4.3
Build a binary classifier
Looks at a “.”
Decides EndOfSentence/NotEndOfSentence
Classifiers: hand-written rules, regular expressions, or
machine-learning
Dr. Azhar, KUET 11/16/2024
Determining if a word is end-of-
sentence: a Decision Tree
52
Dr. Azhar, KUET 11/16/2024
53
THANK YOU
Dr. Azhar, KUET 11/16/2024