0% found this document useful (0 votes)
7 views28 pages

CL - Lec 6

The document provides an overview of tokenization, stemming, and lemmatization in natural language processing. It explains the processes of breaking text into tokens, the challenges of tokenization across different languages, and the importance of stemming and lemmatization for reducing words to their base forms. Additionally, it discusses the implications of stop words and normalization in text processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

CL - Lec 6

The document provides an overview of tokenization, stemming, and lemmatization in natural language processing. It explains the processes of breaking text into tokens, the challenges of tokenization across different languages, and the importance of stemming and lemmatization for reducing words to their base forms. Additionally, it discusses the implications of stop words and normalization in text processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Tokenization, Stemming

and Lemmatization
What is Tokenization?

Tokenization is the task of chopping text into pieces, called tokens

Token is an instance of a sequence of characters in some particular document (grouped as a useful


semantic unit)

Type is the class of all tokens containing the same character sequence

Term is a (perhaps normalized) type that is included in the corpus dictionary

Example to sleep more to learn


:
Token to, sleep, more, to, learn
:
Type to, sleep, more, learn
:
Term sleep, more, learn (stop words removed)
:
Tokenization
◼ Input: “Girls, Bengalees and Countrymen”
◼ Output: Tokens
◼ Girls
◼ Bengalees
◼ Countrymen
◼ Each such token is now a candidate for an
index entry, after further processing
◼ Described below
◼ But what are valid tokens to emit?
Tokenization
◼ Issues in tokenization:
◼ Japan’s capital →

Japan? Japans? Japan’s?


◼ অগ্নি-বীণা → অগ্নি and বীণা as two tokens?
◼ state-of-the-art: break up hyphenated sequence.
◼ co-education
◼ lowercase, lower-case, lower case ?
◼ It’s effective to get the user to put in possible hyphens

◼ San Francisco: one token or two? How


do you decide it is one token?
◼ For example, if the document to be indexed is
to sleep perchance to dream, then there are
five tokens, but only four types (because there are
two instances of to). However, if to is omitted from
the index (as a stop word; then there are only three
terms: sleep, perchance, and dream.).
◼ The major question of the tokenization phase is

what are the correct tokens to use? In this


example, it looks fairly trivial: you chop on
whitespace and throw away punctuation
characters.
◼ Example.
◼ Mr. O’Neill thinks that the boys’ stories about
Numbers
◼ 3/12/91 Mar. 12, 1991
◼ 55 B.C.
◼ B-52
◼ My PGP key is 324a3df234cb23e
◼ (800) 234-2333
◼ Often have embedded spaces

◼ Often, don’t index as text

◼ But often very useful: think about things like


looking up error codes/stacktraces on the web
◼ (One answer is using n-grams: Lecture 3)
◼ Will often index “meta-data” separately
◼ Creation date, format, etc.
Tokenization: language issues
◼ French
◼ L'ensemble → one token or two?
◼ L ? L’ ? Le ?
◼ Want l’ensemble to match with un ensemble

◼ German noun compounds are not


segmented
◼ Lebensversicherungsgesellschaftsangestellter
◼ ‘life insurance company employee’
◼ German retrieval systems benefit greatly from a
compound splitter module
Tokenization: language issues
◼ Chinese and Japanese have no spaces
between words:
◼ 莎拉波娃现在居住在美国东南部的佛罗里达。
◼ Not always guaranteed a unique tokenization
◼ Further complicated in Japanese, with
multiple alphabets intermingled
◼ Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!


Tokenization: language issues
◼ Arabic (or Hebrew) is basically written right
to left, but with certain items like numbers
written left to right
◼ Words are separated, but letter forms within
a word form complex ligatures

◼ ← → ←→ ← start
◼ ‘Algeria achieved its independence in 1962
after 132 years of French occupation.’
◼ With Unicode, the surface presentation is
complex, but the stored form is straightforward
Word-based Tokenization

Approach
● Splitting the text by spaces
● Other delimiters such as punctuation can be used
Advantages
● Easy to implement
Disadvantages
● High risk of missing words; e.g., Let and Let’s will have two
different types
● Languages like Chinese do not have space
● Huge vocabulary size (token type)
○ Limit the number of words that can be added to the
vocabulary
● Misspelled words will be considered as a token
Character-based Tokenization
Approach
●Splitting the text into individual characters
Advantages
●There will be no or very few unknown words
(Out Of Vocabulary)
●Useful for languages that characters carry
information
●Fewer number of tokens
●Easy to implement
Disadvantages
●A character usually does not have a meaning
○ Cannot learn semantic for words
●Larger sequence to be processed by models
○ More input to process
Subword Tokenization

Approach
●Frequently used words should not be split into smaller subwords
●Rare words should be decomposed into meaningful subwords
●Uses a special symbol to indicate which word is the start of the token and which word is
the completion of the start of the token
○ Tokenization → “Token”, “##ization”
●State-of-the-art approaches for NLP and IR rely on this type

Advantages
●Out-of-vocabulary word problem solved
●Manageable vocabulary sizes

Disadvantages
●New scheme and needs more exploration

Byte Per Encoding (BPE) and WordPiece are two examples of this scheme
Stop words
◼ With a stop list, you exclude from dictionary
entirely the commonest words. Intuition:
◼ They have little semantic content: the, a, and, to, be
◼ There are a lot of them: ~30% of postings for top 30 wds
◼ But the trend is away from doing this:
◼ Good compression techniques means the space for
including stop words in a system is very small
◼ Good query optimization techniques mean you pay little
at query time for including stop words.
◼ You need them for:
◼ Phrase queries: “King of Denmark”
◼ Various song titles, etc.: “Let it be”, “To be or not to be”
◼ “Relational” queries: “flights to London”
Stopwords Removal

Stopping: Removing common words from the stream of tokens that become
index terms
●Words that are function words helping form sentence structure: the, of, and, to,
….
●For an application, an additional domain specific stop words list may be
constructed
●Why do we need to remove stop words?
○ Reduce indexing (or data) file size
○ Usually has no impact on the NLP task’s effectiveness, and may even
improve it
●Can sometimes cause issues for NLP tasks:
○ e.g., phrases: “to be or not to be”, “let it be”, “flights to Portland Maine”
○ Some tasks consider very small stopwords list
■ Sometimes perhaps only “the”
●List of stopwords: https://www.ranks.nl/stopwords
Normalization
◼ Need to “normalize” terms in indexed text as
well as query terms into the same form
◼ We want to match U.S.A. and USA
◼ We most commonly implicitly define
equivalence classes of terms
◼ e.g., by deleting periods in a term
◼ Alternative is to do asymmetric expansion:
◼ Enter: window Search: window, windows
◼ Enter: windows Search: Windows, windows, window
◼ Enter: Windows Search: Windows
◼ Potentially more powerful, but less efficient
Normalization: other languages
◼ Accents: résumé vs. resume.
◼ Most important criterion:
◼ How are your users like to write their queries
for these words?

◼ Even in languages that standardly have


accents, users often may not type them

◼ German: Tuebingen vs. Tübingen


◼ Should be equivalent
Case folding
◼ Reduce all letters to lower case
◼ exception: upper case in mid-sentence?
◼ e.g., General Motors
◼ Fed vs. fed
◼ SAIL vs. sail
◼ Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…

◼ Aug 2005 Google example:


◼ C.A.T. → Cat Fanciers website not Caterpiller
Inc.
Stemming and lemmatization
◼ The goal of both stemming and lemmatization is to
reduce inflectional
◼ forms and sometimes derivationally related forms of a
word to a common
◼ base form. For instance
◼ e.g., car = automobile
◼ color = colour
◼ Rewrite to form equivalence classes
◼ Index such equivalences
◼ When the document contains automobile, index it under car
as well (usually, also vice-versa)
Stemming is a process that extract stems by removing last few characters from a word,
often leading to incorrect meanings and spelling

Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma
Stemming
Stemming: To group words that are derived from a common stem
●e.g, “fish”, “fishes”, “fishing” could be mapped to “fish”
●Generally produces small improvements in tasks effectiveness
●Similar to stopping, stemming can be done aggressively, conservatively, or not
at all
○ Aggressively: consider “fish” and “fishing” the same
○ Conservatively: just identifying plural forms using the letter “s”
■ issues: ‘Centuries’ → ‘Centurie”
○ Not at all: Consider all the word variants
●In different languages, stemming can have different importance for
effectiveness:
○ In Arabic, morphology is more complicated than English
○ In Chinese, stemming is not effective
Stemming
◼ Reduce terms to their “roots” before
indexing
◼ “Stemming” suggest crude affix chopping
◼ language dependent
◼ e.g., automate(s), automatic, automation all
reduced to automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Evaluation of Stemmers
There are three criteria for evaluating stemmers:
1.Correctness
2.Efficiency of the task
3.Compression performance

There are two ways in which stemming can be incorrect:


●Over-stemming (too much of the term is removed)
○ Two or more words being reduced to the same wrong root
○ e.g., ‘centennial’, ‘century’, ‘center’: ‘cent’
●Under-stemming (too little of the term is removed)
○ Two or more words could be wrongly reduced to more than one
root word
○ e.g., ‘acquire’, ‘acquiring’, ‘acquired’: acquir ‘acquisition’:
‘acquis’
Example
Lemmatization
◼ Reduce inflectional/variant forms to base
form
◼ E.g.,
◼ am, are, is → be
◼ car, cars, car's, cars' → car
◼ the boy's cars are different colors → the boy
car be different color
◼ Lemmatization implies doing “proper”
reduction to dictionary headword form
Lemmatization

Reduce inflectional/variant forms to base form


●am, are, is → be
●car, cars, car's, cars' → car
●the boy's cars are different colors → the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form
●e.g., WordNet is a lexical database of semantic relations between words in
more than 200 languages

37
Phrases
In a task such as information retrieval, input queries can be 2-3
word phrases
● Phrases can yield more precise queries
○ “University of Southern Maine”, “black sea”
● Less ambiguous
○ “Red apple” vs. “apple”

Phrase is any sequence of n words: n-gram

● unigram: one bigram: sequence of 2 trigram: sequence of 3


word words words
● Generated by:

○ Choosing a particular value for ‘n’


○ Moving that “window” forward one unit word at a time
● The more frequently a word n-gram occurs, the more likely it is
to correspond to a meaningful phrase in the language
Language-specificity
◼ Many of the above features embody
transformations that are
◼ Language-specific and
◼ Often, application-specific
◼ These are “plug-in” addenda to the indexing
process
◼ Both open source and commercial plug-ins
are available for handling these
Dictionary entries – first cut
ensemble.french

時間.japanese

MIT.english These may be


grouped by
mit.german language (or
not…).
guaranteed.english
More on this in
entries.english ranking/query
processing.
sometimes.english

tokenization.english
THANK YOU

You might also like