Text preprocessing
Text preprocessing
Word Tokenization
Tokenization is the process of segmenting a string of characters into
tokens (words).
An example
I have a can opener; but I can’t open these cans.
Word Tokens: 11
Word Types: 10
Hyphenation
End-of-Line Hyphen: Used for splitting whole words into part for text
justification. e.g. “... apparently, mid-dle English followed this practice...”
Lexical Hyphen: Certain prefixes are offen written hyphenated, e.g. co-,
pre-, meta-, multi-, etc.
Sententially Determined Hyphenation: Mainly to prevent incorrect
parsing of the phrase. e.g. State-of-the-art, three-to-five-year, etc.
French
l’ensemble: want to match with un ensemble
German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟
German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟
Sanskrit
Very long compound words
Japanese
Further complications with multiple alphabets intermingled.
Why it is difficult?
Are „!‟ and „?‟ ambiguous?
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous?
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)
Why it is difficult?
Are „!‟ and „?‟ ambiguous? No
Is period “.” ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)
Basic Idea
Usually works top-down, by choosing a variable at each step that best
splits the set of items.
Popular algorithms: ID3, C4.5, CART
Why to “normalize”?
Indexed text and query terms must have the same form.
U.S.A. and USA should be matched
http://text-processing.com/demo/tokenize/
Simple Tokenization in UNIX
Given a text file, output the word tokens and their frequencies
1/24/2022
Lemmatization in Python
1/24/2022
Morphology
Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes
1/24/2022
Morphology
Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes
1/24/2022
Stemming
1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)
Step 1b
(*v*)ing → φ(walking → walk, king →
1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)
Step 1b
(*v*)ing → φ (walking → walk, king → king)
(*v*)ed → φ(played → play)
...
If first two rules of Step 1b are successful, the following is
done: AT → ATE (conflat(ed) → conflate)
BL → BLE (troubl(ed) → trouble)
1/24/2022
Porter‟s algorithm
Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...
1/24/2022
Porter‟s algorithm
Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...
Step 3
al → φ(revival → reviv)
able → φ(adjustable → adjust)
ate → φ(activate → activ)
...
1/24/2022