0% found this document useful (0 votes)
30 views

Text Preprocessing

The document discusses various text preprocessing techniques including sentence segmentation, word tokenization, word segmentation, text normalization, stemming, and lemmatization. It describes common algorithms and issues for each technique.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Text Preprocessing

The document discusses various text preprocessing techniques including sentence segmentation, word tokenization, word segmentation, text normalization, stemming, and lemmatization. It describes common algorithms and issues for each technique.

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Text Pre-processing

1
✓ Sentence Segmentation
✓ Word Tokenization

Text ✓ Word Segmentation


✓ Text Normalization
Preprocessing

2
Text Preprocessing
Sentence Segmentation
The task of segmenting running text into sentences.
Word Tokenization
The task of segmenting running text into words
Word Segmentation - #WelcometoIndia
Task of segmenting the combined string into words
Need dictionary to match with
Normalization
The task of putting words/tokens in a standard format.

3
Sentence Segmentation
Boundary detection - identify dots, semi colons, exclamations and question marks and break
the text stream on them.
!, ? mostly unambiguous but period “.” is very ambiguous

Issues : Not using . as end of sentence (Numbers, Abbreviations)


U.S. Corporate
Mr. Ram
Christoper S. Manning
For e.g.
[email protected]
Sentence Segmentation
Common algorithm:
Tokenize
Use rules or ML to classify a period as either
(a) part of the word or (b) a sentence-boundary

Abbreviation dictionary can help


Sentence segmentation can based on this tokenization.
Sentence Segmentation
Sentence boundary detection algorithms
• Rule based (If-else) for less features
• Regular Expression based
• Statistical classification trees
• Probability of a case and length of words occur before or after a boundary
• Machine learning (NB, MaxEnt)
• Part of speech, other feature distribution of preceding and following
words
Tokenization
Divide text into units called tokens (words, numbers, punctuations)

Whitespace often do not indicate a word break

Exceptional
• San Francisco
• The New York New Heaven railroad
• Wake up, work out

7
Tokenization & Punctuations
Dots: Dots help indicate sentence boundaries.
It’s useful to keep dots (Wash. → wash)

Single apostrophes: contractions (isn’t, didn’t, dog’s)


Useful for meaning extraction (is + n’t or not)

Hyphenation:
• Keep as single word: co-operate
• Separate into words: 26-year-old, aluminum-export-ban

8
Tokenization & Punctuations
Punctuation as a separate token

Commas - useful piece of information for parsers – clause/phrase

Want to keep the punctuation that occurs word internally


• Ph.D.
• AT&T
• cap’n.
• ($45.55) and dates (01/02/06); Do not separate
• URLs (http://www.nitt.edu)
• Twitter hashtags (#nlpACL)
• email addresses ([email protected])
• 5,50,500.50. 9
Tokenization - Clitics

A clitic is a part of a word that can’t stand on its own; only occur when it is
attached to another word.
what’re , we're
A tokenizer - expand clitic contractions that are marked by apostrophes
Converting to what are, we are.

10
Tokenization

• Tokenization standard - Penn Treebank tokenization standard


• Used for the parsed corpora (treebanks) released by the Linguistic Data
Consortium (LDC).
• This standard does
• Separates out clitics (doesn’t becomes does + n’t),
• Keeps hyphenated words together
• Separates out all punctuation

11
Tokenizer output - Penn Treebank tokenization standard

12
Tokenization – Multiword Expression

To tokenize multiword expressions as a single token.

New York (or) rock 'n' roll

Requires a multiword expression dictionary.

Tokenization is tied up with named entity detection

13
How many words in a sentence?
They lay back on the San Francisco grass and looked at the stars
and their

Type: Types are the number of distinct words in a corpus


Token: Tokens are the total number of running words.
• How many Type/Token?
• 15 tokens ?
• 13 types ?
How many words in a corpus?
N = number of tokens
V = vocabulary = set of types, |V| is size of vocabulary
Heaps Law = Herdan's Law = ; .67 < β < .75 ; K btn 10 and 100
vocabulary size grows up > square root of the number of word tokens

Tokens = N Types = |V|


Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
COCA 440 million 2 million
Google N-grams 1 trillion 13+ million
Tokenization in NLTK
Bird, Loper and Klein (2009), Natural Language Processing with Python. O’Reilly
TOKENIZATION
Byte Pair Encoding
Recent text tokenization

Use the data to tell us how to tokenize.


Subword tokenization (because tokens can be parts of words as
well as whole words)
Subword tokenization
Three common algorithms
• Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
• Unigram language modeling tokenization (Kudo, 2018)
• Word Piece (Schuster and Nakajima, 2012)
Token learner
Takes raw training corpus and induces vocabulary (a set of tokens)
Token segmenter
Takes a raw test sentence and tokenizes it according to that vocabulary
Byte Pair Encoding (BPE) token learner
Let vocabulary be the set of all individual characters
= {A, B, C, D,…, a, b, c, d….}
Repeat:
• Choose most frequently adjacent two symbols in training corpus (say 'A', 'B')
• Add a new merged symbol 'AB' to the vocabulary
• Replace every adjacent 'A' 'B' in the corpus with 'AB'.
Until k merges have been done.
BPE token learner algorithm
Byte Pair Encoding (BPE) : Preprocess Corpus
Sub word algorithms run inside space-separated tokens

• Special end-of-word symbol '__' before space in training corpus

• Separate into letters.


BPE token learner
Sample corpus

low low low low low lowest lowest newer newer newer newer newer newer wider
wider wider new new

Add end-of-word tokens, resulting in this vocabulary:

representation
BPE Token Learner

Merge e r to er
BPE Token Learner

Merge er _ to er_
BPE Token Learner

Merge n e to ne
BPE Token Learner
BPE Token Segmenter
Test data
Run each merge learned from the training data (in learnt order)

Merge every e r to er, then merge er _ to er_, etc.


Result
• Test set "n e w e r _" : tokenized as a full word
• Test set "l o w e r _" : two tokens: "low er_"
Properties of BPE tokens
Include frequent words and frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing unit of a language
• unlikeliest : 3 morphemes un-, likely, and -est
Used in
• GPT-2, RoBERTa, XLM, FlauBERT,
Text Normalization
Set all characters to lowercase

Remove numbers (or convert numbers to textual representations)

Remove punctuation (generally part of tokenization)

Strip white space (also generally part of tokenization)

Remove default stop words (general English stop words)

30
Text Normalization

Tokens are to be normalized


Single normalized form is chosen for words with multiple forms
• USA and US.
• It is valuable, despite the loss of spelling information.

31
Non-standard words (numbers, Acronym, and dates)

• Mapping them to a special vocabulary.


• Decimal number to a single token 0.0
• Acronym could be mapped to AAA.
• Vocabulary becomes small
• Improves the accuracy of many language modeling tasks.

Abbreviations : Cust. - Customer

Acronyms: UN - United Nation

32
Text Normalization-Case folding

Case folding – Converting to a same case.


Adopted in speech recognition and information retrieval
Mapped to lower case.
Sentiment analysis, text classification tasks, information extraction, and
machine translation
case is helpful and case folding is generally not done.
US [country] and us [pronoun]

33
Stemming
• Removes affixes
Text Lemmatization
Normalization • Removes affixes
• Resulting form is a known word in
a dictionary

34
Stemming &
Lemmatization

35
Stemming

• Stemming - reducing derived or inflected words to their stem, base /root


form.
• Related words map to the same stem.
• Search engines treat words with the same stem as synonyms (conflation).
Stemming and Lemmatization

How can we know “organize” , “organizes”, and “organizing” should map to the
same word?
• The goal of stemming and lemmatization
• To reduce inflectional forms and sometimes derivationally related forms of
a word to a common base form.
am, are, is --be
car, cars, car’s, cars’ --- car
• “the boy’s cars are different colors” --- “the boy car be different color”
Stemming Algorithms

40
Word = Stem + Affix(es)
The Porter • generalizations = general + ization + s

Stemmer Porter’s stemmer is a rule-based algorithm


• E.g., ational → ate (apply: relational → relate)
Algorithm Porter’s stemmer is heuristic, in that it is a practical
method not guaranteed to be optimal

41
The Porter Stemmer

Troubles

VOWEL: V
CONSONANT: C
Tr ou bl e s
All words are of the form: C V C V C
(C)(VC)m(V) # VC repeated m times

C=string of one or more consonants (con+)


V=string of one or more vowels

42
The Porter Stemmer: rule
format
Conditions:

m The measure of the stem


*S The stem ends with S
*v* The stem contains a vowel
*d The stem ends with a double
The rules are of the form:
consonant
(condition) S1 -> S2
Where S1 and S2 are suffixes *o The stem ends in CVC

43
45
The Porter Stemmer: Step 1
SSES -> SS
caresses -> caress

IES -> I
ponies -> poni
ties -> ti

SS -> SS
caress -> caress

S -> є
cats -> cat

46
(m≥1) EED -> EE

• Condition verified: agreed -> agree


• Condition not verified: feed -> feed
The Porter
(*V*) ED -> є
Stemmer: Step 2a
• Condition verified: plastered -> plaster
(past tense, • Condition not verified: bled -> bled
progressive)
(*V*) ING -> є

*v* The stem contains a vowel • Condition verified: motoring -> motor
• Condition not verified: sing -> sing

47
(These rules are run if second or third rule in 2a apply)
AT-> ATE

The Porter conflat(ed) -> conflate

BL -> BLE
Stemmer: Step Troubl(ing) -> trouble
2b (*d & ! (*L or *S or *Z)) -> single letter
(cleanup) Condition verified: hopp(ing) -> hop, tann(ed) -> tan
Condition not verified: fall(ing) -> fall
*d Double consonant.
*S The stem ends with S (m=1 & *o) -> E
*d The stem ends with a • Condition verified: fil(ing) -> file
double consonant *o stem ends in CVC
*o The stem ends in CVC • Condition not verified: fail -> fail

48
The Porter Stemmer: Steps 3 and 4
Step 3: Y Elimination (*V*) Y -> I
Condition verified: happy -> happi
Condition not verified: sky -> sky

Step 4: Derivational Morphology I


(m>0) ATIONAL -> ATE
Relational -> relate
(m>0) IZATION -> IZE
generalization-> generalize
(m>0) BILITI -> BLE
sensibiliti -> sensible

49
Step 5: Derivational Step 6: Derivational
Morphology, II Morphology, III
(m>0) ICATE -> IC (m>0) ANCE -> є
The Porter triplicate -> triplic allowance-> allow
Stemmer: (m>0) FUL -> є
hopeful -> hope
(m>0) ENT -> є
dependent-> depend
Steps 5 & 6 (m>0) NESS -> є (m>0) IVE -> є
goodness -> good effective -> effect

50
Step 7a
The Porter (m>1) E -> є
Stemmer: probate -> probat
(m=1 & !*o) NESS -> є
Step 7 goodness -> good
(cleanup) Step 7b

(m>1 & *d & *L) -> single letter


*L The stem ends with L
*d The stem ends with a
Condition verified: controll -> control
double consonant Condition not verified: roll -> roll
*o The stem ends in CVC
51
Correct Examples
computers generalizations
• Step 1: -> computer • Step 1: -> generalization
• Step 6: -> compute • Step 4: -> generalize
singing • Step 6: -> general
• Step 2a: -> sing doing - > do
controlling • Step 2a: -> do
• Step 2a: -> controll
• Step 7b : -> control

52
Problems

elephants -> eleph


• Step 1: -> elephant
• Step 6: -> eleph

53
Problems with Porter Stemmer

Words with different meanings are conflated to the same stem


• Conflates general, generous, generation, and generic to the same root
gener
Words with similar meaning are not conflated at all.
• recognize and recognition are not conflated

54
Lancaster stemmer (Iterative)

A table of 120 rules is indexed by the last letter of a suffix.


• On each iteration, find an rule based on last character of the word.
• Each rule specifies deletion /replacement of an ending
• Terminate when
• No applicable rule
• Word starts with a vowel and there are only two letters
• Word starts with a consonant and there are only three characters left.
• Otherwise, the rule is applied and the process repeats.

55
Snowball stemmer

Supports 13 non-English languages.


• Danish, Dutch, English, Finnish, French, German, Hungarian, Italian,
Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

56
Problems

57
Stemmer

Cons:
• Operates on a single word without knowledge of the context
• Cannot discriminate words which have different meanings (based on POS)
Pros:
• Easier to implement and run faster
• Its reduced accuracy may not matter for some applications.
Lemmatization

• Algorithmic process of determining the lemma for a given word.


• To extract the proper lemma - look at the morphological analysis of each word.
• Base form with the part of speech - lexeme
61
Lemmatization – How to?

Computational Morphology
Methods
• Brute force: Try every possible segmentation of word and see which ones match known
stems and affixes
• Rule-based (simplistic method): Have list of known affixes, see which ones apply.
• Rule-based (more sophisticated): List of known affixes, and knowledge about allowable
combinations, eg -ing can only attach to a verb stem (Morphotactic rules)

63/20
• The Porter Stemmer home page (with the original
paper and code):
http://www.tartarus.org/~martin/PorterStemmer/
References • Jurafsky and Martin, chapter 3.4
• The original paper: Porter, M.F., 1980, An algorithm for
suffix stripping, Program, 14(3) :130-137.

64

You might also like