0% found this document useful (0 votes)
21 views6 pages

5 Basic Text Processing

Uploaded by

Atharva Nagore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

5 Basic Text Processing

Uploaded by

Atharva Nagore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Basic Text Processing

Word tokenization

NLP Library Text Processing


There are many library / framework for NLP problem solution • These are some of the methods of processing the data in
1. Natural Language Toolkit (NLTK) NLP:
2. TextBlob • Tokenization
3. CoreNLP
4. Gensim • Stop words removal
5. spaCy • Stemming
6. polyglot
7. scikit–learn
• Normalization
8. Pattern • Lemmatization
• Parts of speech tagging

1
How many words?
Text Normalization
Its complicated question like ‘uh’ is a word or how about ‘main-mainly’

• Every NLP task needs to do text normalization: • I do uh main- mainly business data processing
1. Segmenting/tokenizing words in running text – Fragments, filled pauses
2. Normalizing word formats • Seuss’s cat in the hat is different from other cats!
3. Segmenting sentences in running text – Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms

How many words? How many words?


• (Word) Type: an element of the vocabulary (how many unique words there
are). N = number of tokens Church and Gale (1990): |V| > O(N½)
• (Word ) Token: an instance of that type in running text.
V = vocabulary = set of types
2 4 6 8 10 12 14
they lay back on the San Francisco grass and looked at the stars and their
9
|V| is the size of the vocabulary
1 3 5 7 11 13 15

Dataset (Corpora) Tokens = N Types = |V|


• How many? 15  If we count San & Francisco as two tokens
Switchboard phone conversations 2.4 million 20 thousand
– 15 tokens (or 14) 14  If we count San Francisco as one tokens Shakespeare 884,000 31 thousand
– 13 types (or 12) (or 11?) 13  San Francisco as one Google N-grams 1 trillion 13 million
12  San Francisco , the, they & their (same lemma)

2
Tokenization
• Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens. One can think of token as
parts like a word is a token in a sentence, and a sentence is a
token in a paragraph.

Source: https://www.kaggle.com/satishgunjal/tokenization-in-nlp

Tokenization Techniques Tokenization Techniques

• Tokenization Using Python's Inbuilt • Tokenization Using Regular


Method Expressions(RegEx)

Word Tokenization Sentence Tokenization Word Tokenization Sentence Tokenization


https://colab.research.google.com/drive/1D_HclnqrPU_-XaK8xjkp0ZkZSaueGjo0#scrollTo=oD04lqUWT3W4 https://colab.research.google.com/drive/1D_HclnqrPU_-XaK8xjkp0ZkZSaueGjo0#scrollTo=oD04lqUWT3W4

Source: https://www.kaggle.com/satishgunjal/tokenization-in-nlp

3
Tokenization Techniques Simple Tokenization in UNIX

• Tokenization Using NLTK • (Inspired by Ken Church’s UNIX for Poets.)


– Install Python • Given a text file, output the word tokens and their
– Install NLTK (Natural Language Toolkit) frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type
1945 A 25 Aaron
72 AARON 6 Abate
19 ABBESS 1 Abates
5 ABBOT 5 Abbess
... ... 6 Abbey
3 Abbot
.... …

The first step: tokenizing The second step: sorting


tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

THE A
SONNETS A
by A
William A
Shakespeare A
From A
fairest A
creatures A
We A
... ...

4
More counting Issues in Tokenization
• Merging upper and lower case • Finland’s capital  Finland Finlands Finland’s ?
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
• what’re, I’m, isn’t  What are, I am, is not
• Sorting the counts • Hewlett-Packard  Hewlett Packard ?
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r • state-of-the-art  state of the art ?
23243 the • Lowercase  lower-case lowercase lower case ?
22225 i
18618 and • San Francisco  one token or two?
16339 to
15687 of • m.p.h., PhD.  ??
12780 a
12163 you
10839 my
10005 in
8954 d

Tokenization: language issues Tokenization: language issues


• French • Chinese and Japanese have no spaces between words:
– L'ensemble  one token or two? – 莎拉波娃现在居住在美国东南部的佛罗里达。
• L ? L’ ? Le ? – Not always guaranteed a unique tokenization
• Want l’ensemble to match with un ensemble
• Further complicated in Japanese, with multiple alphabets
– Until at least 2003, it didn’t on Google
» Internationalization!
intermingled
– Dates/amounts in multiple formats
• German noun compounds are not segmented フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’ Katakana Hiragana Kanji Romaji
– German retrieval systems benefit greatly from a compound splitter module
– Can give a 15% performance boost for German End-user can express query entirely in hiragana! 20

5
Tokenization: language issues
• Arabic (or Hebrew) is basically written right to left, but with certain Basic Text Processing
items like numbers written left to right
• Words are separated, but letter forms within a word form complex
ligatures

← → ←→ ← Word Normalization and Stemming


• ‘Algeria achieved its independence in 1962 after 132 years of French
occupation.’
• With Unicode, the surface presentation is complex, but the stored
form is straightforward

Normalization Case folding


• Need to “normalize” terms • Applications like IR: reduce all letters to lower case
– Information Retrieval: indexed text & query terms must have same – Since users tend to use lower case
form.
• We want to match U.S.A. and USA – Possible exception: upper case in mid-sentence?
• We implicitly define equivalence classes of terms • e.g., General Motors
– e.g., deleting periods in a term • Fed vs. fed
• Alternative: asymmetric expansion: • SAIL vs. sail

• For sentiment analysis, MT, Information extraction
Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

• Potentially more powerful, but less efficient – Case is helpful (US versus us is important)

You might also like