5 Basic Text Processing
5 Basic Text Processing
Word tokenization
1
How many words?
Text Normalization
Its complicated question like ‘uh’ is a word or how about ‘main-mainly’
• Every NLP task needs to do text normalization: • I do uh main- mainly business data processing
1. Segmenting/tokenizing words in running text – Fragments, filled pauses
2. Normalizing word formats • Seuss’s cat in the hat is different from other cats!
3. Segmenting sentences in running text – Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms
2
Tokenization
• Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens. One can think of token as
parts like a word is a token in a sentence, and a sentence is a
token in a paragraph.
Source: https://www.kaggle.com/satishgunjal/tokenization-in-nlp
Source: https://www.kaggle.com/satishgunjal/tokenization-in-nlp
3
Tokenization Techniques Simple Tokenization in UNIX
THE A
SONNETS A
by A
William A
Shakespeare A
From A
fairest A
creatures A
We A
... ...
4
More counting Issues in Tokenization
• Merging upper and lower case • Finland’s capital Finland Finlands Finland’s ?
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
• what’re, I’m, isn’t What are, I am, is not
• Sorting the counts • Hewlett-Packard Hewlett Packard ?
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r • state-of-the-art state of the art ?
23243 the • Lowercase lower-case lowercase lower case ?
22225 i
18618 and • San Francisco one token or two?
16339 to
15687 of • m.p.h., PhD. ??
12780 a
12163 you
10839 my
10005 in
8954 d
5
Tokenization: language issues
• Arabic (or Hebrew) is basically written right to left, but with certain Basic Text Processing
items like numbers written left to right
• Words are separated, but letter forms within a word form complex
ligatures
• Potentially more powerful, but less efficient – Case is helpful (US versus us is important)