Text Preprocessing
Text Preprocessing
1
✓ Sentence Segmentation
✓ Word Tokenization
2
Text Preprocessing
Sentence Segmentation
The task of segmenting running text into sentences.
Word Tokenization
The task of segmenting running text into words
Word Segmentation - #WelcometoIndia
Task of segmenting the combined string into words
Need dictionary to match with
Normalization
The task of putting words/tokens in a standard format.
3
Sentence Segmentation
Boundary detection - identify dots, semi colons, exclamations and question marks and break
the text stream on them.
!, ? mostly unambiguous but period “.” is very ambiguous
Exceptional
• San Francisco
• The New York New Heaven railroad
• Wake up, work out
7
Tokenization & Punctuations
Dots: Dots help indicate sentence boundaries.
It’s useful to keep dots (Wash. → wash)
Hyphenation:
• Keep as single word: co-operate
• Separate into words: 26-year-old, aluminum-export-ban
8
Tokenization & Punctuations
Punctuation as a separate token
A clitic is a part of a word that can’t stand on its own; only occur when it is
attached to another word.
what’re , we're
A tokenizer - expand clitic contractions that are marked by apostrophes
Converting to what are, we are.
10
Tokenization
11
Tokenizer output - Penn Treebank tokenization standard
12
Tokenization – Multiword Expression
13
How many words in a sentence?
They lay back on the San Francisco grass and looked at the stars
and their
low low low low low lowest lowest newer newer newer newer newer newer wider
wider wider new new
representation
BPE Token Learner
Merge e r to er
BPE Token Learner
Merge er _ to er_
BPE Token Learner
Merge n e to ne
BPE Token Learner
BPE Token Segmenter
Test data
Run each merge learned from the training data (in learnt order)
30
Text Normalization
31
Non-standard words (numbers, Acronym, and dates)
32
Text Normalization-Case folding
33
Stemming
• Removes affixes
Text Lemmatization
Normalization • Removes affixes
• Resulting form is a known word in
a dictionary
34
Stemming &
Lemmatization
35
Stemming
How can we know “organize” , “organizes”, and “organizing” should map to the
same word?
• The goal of stemming and lemmatization
• To reduce inflectional forms and sometimes derivationally related forms of
a word to a common base form.
am, are, is --be
car, cars, car’s, cars’ --- car
• “the boy’s cars are different colors” --- “the boy car be different color”
Stemming Algorithms
40
Word = Stem + Affix(es)
The Porter • generalizations = general + ization + s
41
The Porter Stemmer
Troubles
VOWEL: V
CONSONANT: C
Tr ou bl e s
All words are of the form: C V C V C
(C)(VC)m(V) # VC repeated m times
42
The Porter Stemmer: rule
format
Conditions:
43
45
The Porter Stemmer: Step 1
SSES -> SS
caresses -> caress
IES -> I
ponies -> poni
ties -> ti
SS -> SS
caress -> caress
S -> є
cats -> cat
46
(m≥1) EED -> EE
*v* The stem contains a vowel • Condition verified: motoring -> motor
• Condition not verified: sing -> sing
47
(These rules are run if second or third rule in 2a apply)
AT-> ATE
BL -> BLE
Stemmer: Step Troubl(ing) -> trouble
2b (*d & ! (*L or *S or *Z)) -> single letter
(cleanup) Condition verified: hopp(ing) -> hop, tann(ed) -> tan
Condition not verified: fall(ing) -> fall
*d Double consonant.
*S The stem ends with S (m=1 & *o) -> E
*d The stem ends with a • Condition verified: fil(ing) -> file
double consonant *o stem ends in CVC
*o The stem ends in CVC • Condition not verified: fail -> fail
48
The Porter Stemmer: Steps 3 and 4
Step 3: Y Elimination (*V*) Y -> I
Condition verified: happy -> happi
Condition not verified: sky -> sky
49
Step 5: Derivational Step 6: Derivational
Morphology, II Morphology, III
(m>0) ICATE -> IC (m>0) ANCE -> є
The Porter triplicate -> triplic allowance-> allow
Stemmer: (m>0) FUL -> є
hopeful -> hope
(m>0) ENT -> є
dependent-> depend
Steps 5 & 6 (m>0) NESS -> є (m>0) IVE -> є
goodness -> good effective -> effect
50
Step 7a
The Porter (m>1) E -> є
Stemmer: probate -> probat
(m=1 & !*o) NESS -> є
Step 7 goodness -> good
(cleanup) Step 7b
52
Problems
53
Problems with Porter Stemmer
54
Lancaster stemmer (Iterative)
55
Snowball stemmer
56
Problems
57
Stemmer
Cons:
• Operates on a single word without knowledge of the context
• Cannot discriminate words which have different meanings (based on POS)
Pros:
• Easier to implement and run faster
• Its reduced accuracy may not matter for some applications.
Lemmatization
Computational Morphology
Methods
• Brute force: Try every possible segmentation of word and see which ones match known
stems and affixes
• Rule-based (simplistic method): Have list of known affixes, see which ones apply.
• Rule-based (more sophisticated): List of known affixes, and knowledge about allowable
combinations, eg -ing can only attach to a verb stem (Morphotactic rules)
63/20
• The Porter Stemmer home page (with the original
paper and code):
http://www.tartarus.org/~martin/PorterStemmer/
References • Jurafsky and Martin, chapter 3.4
• The original paper: Porter, M.F., 1980, An algorithm for
suffix stripping, Program, 14(3) :130-137.
64