0% found this document useful (0 votes)
5 views

Text preprocessing

The document discusses text processing techniques, focusing on word tokenization, sentence segmentation, and normalization. It highlights various tokenization libraries, issues in tokenization across different languages, and methods for stemming and lemmatization. Additionally, it covers the implementation of decision trees and classifiers for text processing tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Text preprocessing

The document discusses text processing techniques, focusing on word tokenization, sentence segmentation, and normalization. It highlights various tokenization libraries, issues in tokenization across different languages, and methods for stemming and lemmatization. Additionally, it covers the implementation of decision trees and classifiers for text processing tasks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Text Processing

Word Tokenization
Tokenization is the process of segmenting a string of characters into
tokens (words).

An example
I have a can opener; but I can’t open these cans.
Word Tokens: 11
Word Types: 10

1/24/2022 Basic Text Processing


Several tokenization libraries

NLTK Toolkit (Python)


Spacy (Python)
Polyglot (Python)
Stanford CoreNLP (Java)
Unix Commands

Basic Text Processing


Issues in Tokenization
Common examples

Finland‟s → Finland Finlands Finland‟s ?


What‟re, I‟m, shouldn‟t → What are, I am, should not ?
San Francisco → one token or two?
m.p.h. → ??

Hyphenation
End-of-Line Hyphen: Used for splitting whole words into part for text
justification. e.g. “... apparently, mid-dle English followed this practice...”
Lexical Hyphen: Certain prefixes are offen written hyphenated, e.g. co-,
pre-, meta-, multi-, etc.
Sententially Determined Hyphenation: Mainly to prevent incorrect
parsing of the phrase. e.g. State-of-the-art, three-to-five-year, etc.

1/24/2022 Basic Text Processing


Language Specific Issues

French
l’ensemble: want to match with un ensemble

Basic Text Processing


1/24/2022
Language Specific Issues
French
l’ensemble: want to match with un ensemble

German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟

1/24/2022 Basic Text Processing


Language Specific Issues
French
l’ensemble: want to match with un ensemble

German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟

Sanskrit
Very long compound words

1/24/2022 Basic Text Processing


Language Specific Issues
Chinese
No space between words

Japanese
Further complications with multiple alphabets intermingled.

1/24/2022 Basic Text Processing


Word Tokenization in Chinese

Maximum Matching (Greedy Algorithm)


Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string

Will the above scheme work for English?

1/24/2022 Basic Text Processing


Word Tokenization in Chinese

Maximum Matching (Greedy Algorithm)


Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string

Will the above scheme work for English?


No: Thetabledownthere
Yes: #ThankYouSachin, #musicmonday etc.

1/24/2022 Basic Text Processing


Sentence Segmentation

Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous?

Basic Text Processing


Sentence Segmentation

Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous?

Basic Text Processing


Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)

Basic Text Processing


1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)

Basic Text Processing


1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No
Is period “.” ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)

Can we build a binary classifier for ‟period‟


classification? For each “.”
Decides EndOfSentence/NotEndOfSentence
Classifiers can be: hand-written rules, regular expressions, or
machine learning

Basic Text Processing


1/24/2022
Sentence Segmentation: Decision Tree
Example
Decision Tree: Is this word the end-of-sentence (E-O-S)?

Basic Text Processing


1/24/2022
Other Important Features

Case of word with “.”: Upper, Lower, Number


Case of word after “.”: Upper, Lower, Number
Numeric Features
Length of word with “.”
Probability (word with “.” occurs at end-of-sentence)
Probability (word after “.” occurs at beginning-of-sentence)

Basic Text Processing


1/24/2022
Implementing Decision Trees
Just an if-then-else statement
Choosing the features is more important
For numeric features, thresholds are to be picked
With increasing features including numerical ones, difficult to set up
the structure by hand
Decision Tree structure can be learned using machine learning over
a training corpus

Basic Text Processing


1/24/2022
Implementing Decision Trees
Just an if-then-else statement
Choosing the features is more important
For numeric features, thresholds are to be picked
With increasing features including numerical ones, difficult to set up
the structure by hand
Decision Tree structure can be learned using machine learning over
a training corpus

Basic Idea
Usually works top-down, by choosing a variable at each step that best
splits the set of items.
Popular algorithms: ID3, C4.5, CART

Basic Text Processing


1/24/2022
Other Classifiers

Support Vector Machines


Logistic regression
Neural Networks

Basic Text Processing


1/24/2022
Normalization

Why to “normalize”?
Indexed text and query terms must have the same form.
U.S.A. and USA should be matched

We implicitly define equivalence classes of terms

Basic Text Processing


1/24/2022
Case Folding

Reduce all letters to lower case


Some caveats (Task dependent):
Upper case in mid sentence, may point to named entities (e.g. General
Motors)
For MT and information extraction, some cases might be helpful (US vs.
us)

Basic Text Processing


1/24/2022
Python tokenization example

http://text-processing.com/demo/tokenize/
Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies

tr -sc ’A-Za-z’ ’\n’ < file_name


| sort
| uniq -c
| sort -rn

Change all non-alphabetic characters to newline


Sort in alphabetical order
Merge and count each type
Sort based on the count

For more info: execute ‘man tr’

1/24/2022 Basic Text Processing


Token normalization
We may want the same token for different forms of the word
• wolf, wolves  wolf
• talk, talks  talk
Stemming
• A process of removing and replacing suffixes to get to the root
form of the word, which is called the stem
• Usually refers to heuristics that chop off suffixes
Lemmatization
• Usually refers to doing things properly with the use of a
vocabulary and morphological analysis
• Returns the base or dictionary form of a word,
which is known as the lemma
Lemmatization example
WordNet lemmatizer
• Uses the WordNet Database to lookup lemmas
• nltk.stem.WordNetLemmatizer
• Examples:
− feet  foot cats  cat
− wolves  wolf talked  talked
• Problems: not all forms are reduced

• Takeaway: we need to try stemming or lemmatization and


choose best for our task
Lemmatization

Reduce inflections or variant forms to base form:


am, are, is → be
car, cars, car‟s, cars‟ → car
Have to find the correct dictionary headword form

1/24/2022
Lemmatization in Python

>>> from nltk.stem import WordNetLemmatizer


>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(’dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(’churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(’abaci’)
u’abacus’

1/24/2022
Morphology

Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes

1/24/2022
Morphology

Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes

Morphemes are divided into two categories


Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings
and grammatical functions
Prefix: un-, anti-, etc (a-, ati-, pra- etc.)
Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)

1/24/2022
Stemming

Reducing terms to their stems


Used in information retrieval Crude chopping of affixes
Language dependent

1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)

Step 1b
(*v*)ing → φ(walking → walk, king →

1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)

Step 1b
(*v*)ing → φ (walking → walk, king → king)
(*v*)ed → φ(played → play)
...
If first two rules of Step 1b are successful, the following is
done: AT → ATE (conflat(ed) → conflate)
BL → BLE (troubl(ed) → trouble)

1/24/2022
Porter‟s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...

1/24/2022
Porter‟s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...

Step 3
al → φ(revival → reviv)
able → φ(adjustable → adjust)
ate → φ(activate → activ)
...

Complete Algorithm is available at:


http://snowball.tartarus.org/algorithms/porter/stemmer.html
1/24/2022
Python stemming example
Stemming in Python

>>> from nltk.stem.porter import PorterStemmer


>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’maximum’)
’maximum’
>>> porter_stemmer.stem(’presumably’)
’presum’
>>> porter_stemmer.stem(’multiply’)
’multipli’
>>> porter_stemmer.stem(’provision’)
’provis’

1/24/2022

You might also like