0% found this document useful (0 votes)

7 views28 pages

CL - Lec 6

The document provides an overview of tokenization, stemming, and lemmatization in natural language processing. It explains the processes of breaking text into tokens, the challenges of tokenization across different languages, and the importance of stemming and lemmatization for reducing words to their base forms. Additionally, it discusses the implications of stop words and normalization in text processing.

Uploaded by

sharminaktereilma96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views28 pages

CL - Lec 6

Uploaded by

sharminaktereilma96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Tokenization, Stemming

and Lemmatization
What is Tokenization?

Tokenization is the task of chopping text into pieces, called tokens

Token is an instance of a sequence of characters in some particular document (grouped as a useful

semantic unit)

Type is the class of all tokens containing the same character sequence

Term is a (perhaps normalized) type that is included in the corpus dictionary

Example to sleep more to learn

:
Token to, sleep, more, to, learn
:
Type to, sleep, more, learn
:
Term sleep, more, learn (stop words removed)
:
Tokenization
◼ Input: “Girls, Bengalees and Countrymen”
◼ Output: Tokens
◼ Girls
◼ Bengalees
◼ Countrymen
◼ Each such token is now a candidate for an
index entry, after further processing
◼ Described below
◼ But what are valid tokens to emit?
Tokenization
◼ Issues in tokenization:
◼ Japan’s capital →

Japan? Japans? Japan’s?

◼ অগ্নি-বীণা → অগ্নি and বীণা as two tokens?
◼ state-of-the-art: break up hyphenated sequence.
◼ co-education
◼ lowercase, lower-case, lower case ?
◼ It’s effective to get the user to put in possible hyphens

◼ San Francisco: one token or two? How

do you decide it is one token?
◼ For example, if the document to be indexed is
to sleep perchance to dream, then there are
five tokens, but only four types (because there are
two instances of to). However, if to is omitted from
the index (as a stop word; then there are only three
terms: sleep, perchance, and dream.).
◼ The major question of the tokenization phase is

what are the correct tokens to use? In this

example, it looks fairly trivial: you chop on
whitespace and throw away punctuation
characters.
◼ Example.
◼ Mr. O’Neill thinks that the boys’ stories about
Numbers
◼ 3/12/91 Mar. 12, 1991
◼ 55 B.C.
◼ B-52
◼ My PGP key is 324a3df234cb23e
◼ (800) 234-2333
◼ Often have embedded spaces

◼ Often, don’t index as text

◼ But often very useful: think about things like

looking up error codes/stacktraces on the web
◼ (One answer is using n-grams: Lecture 3)
◼ Will often index “meta-data” separately
◼ Creation date, format, etc.
Tokenization: language issues
◼ French
◼ L'ensemble → one token or two?
◼ L ? L’ ? Le ?
◼ Want l’ensemble to match with un ensemble

◼ German noun compounds are not

segmented
◼ Lebensversicherungsgesellschaftsangestellter
◼ ‘life insurance company employee’
◼ German retrieval systems benefit greatly from a
compound splitter module
Tokenization: language issues
◼ Chinese and Japanese have no spaces
between words:
◼ 莎拉波娃现在居住在美国东南部的佛罗里达。
◼ Not always guaranteed a unique tokenization
◼ Further complicated in Japanese, with
multiple alphabets intermingled
◼ Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Tokenization: language issues
◼ Arabic (or Hebrew) is basically written right
to left, but with certain items like numbers
written left to right
◼ Words are separated, but letter forms within
a word form complex ligatures

◼ ← → ←→ ← start
◼ ‘Algeria achieved its independence in 1962
after 132 years of French occupation.’
◼ With Unicode, the surface presentation is
complex, but the stored form is straightforward
Word-based Tokenization

Approach
● Splitting the text by spaces
● Other delimiters such as punctuation can be used
Advantages
● Easy to implement
Disadvantages
● High risk of missing words; e.g., Let and Let’s will have two
different types
● Languages like Chinese do not have space
● Huge vocabulary size (token type)
○ Limit the number of words that can be added to the
vocabulary
● Misspelled words will be considered as a token
Character-based Tokenization
Approach
●Splitting the text into individual characters
Advantages
●There will be no or very few unknown words
(Out Of Vocabulary)
●Useful for languages that characters carry
information
●Fewer number of tokens
●Easy to implement
Disadvantages
●A character usually does not have a meaning
○ Cannot learn semantic for words
●Larger sequence to be processed by models
○ More input to process
Subword Tokenization

Approach
●Frequently used words should not be split into smaller subwords
●Rare words should be decomposed into meaningful subwords
●Uses a special symbol to indicate which word is the start of the token and which word is
the completion of the start of the token
○ Tokenization → “Token”, “##ization”
●State-of-the-art approaches for NLP and IR rely on this type

Advantages
●Out-of-vocabulary word problem solved
●Manageable vocabulary sizes

Disadvantages
●New scheme and needs more exploration

Byte Per Encoding (BPE) and WordPiece are two examples of this scheme
Stop words
◼ With a stop list, you exclude from dictionary
entirely the commonest words. Intuition:
◼ They have little semantic content: the, a, and, to, be
◼ There are a lot of them: ~30% of postings for top 30 wds
◼ But the trend is away from doing this:
◼ Good compression techniques means the space for
including stop words in a system is very small
◼ Good query optimization techniques mean you pay little
at query time for including stop words.
◼ You need them for:
◼ Phrase queries: “King of Denmark”
◼ Various song titles, etc.: “Let it be”, “To be or not to be”
◼ “Relational” queries: “flights to London”
Stopwords Removal

Stopping: Removing common words from the stream of tokens that become
index terms
●Words that are function words helping form sentence structure: the, of, and, to,
….
●For an application, an additional domain specific stop words list may be
constructed
●Why do we need to remove stop words?
○ Reduce indexing (or data) file size
○ Usually has no impact on the NLP task’s effectiveness, and may even
improve it
●Can sometimes cause issues for NLP tasks:
○ e.g., phrases: “to be or not to be”, “let it be”, “flights to Portland Maine”
○ Some tasks consider very small stopwords list
■ Sometimes perhaps only “the”
●List of stopwords: https://www.ranks.nl/stopwords
Normalization
◼ Need to “normalize” terms in indexed text as
well as query terms into the same form
◼ We want to match U.S.A. and USA
◼ We most commonly implicitly define
equivalence classes of terms
◼ e.g., by deleting periods in a term
◼ Alternative is to do asymmetric expansion:
◼ Enter: window Search: window, windows
◼ Enter: windows Search: Windows, windows, window
◼ Enter: Windows Search: Windows
◼ Potentially more powerful, but less efficient
Normalization: other languages
◼ Accents: résumé vs. resume.
◼ Most important criterion:
◼ How are your users like to write their queries
for these words?

◼ Even in languages that standardly have

accents, users often may not type them

◼ German: Tuebingen vs. Tübingen

◼ Should be equivalent
Case folding
◼ Reduce all letters to lower case
◼ exception: upper case in mid-sentence?
◼ e.g., General Motors
◼ Fed vs. fed
◼ SAIL vs. sail
◼ Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…

◼ Aug 2005 Google example:

◼ C.A.T. → Cat Fanciers website not Caterpiller
Inc.
Stemming and lemmatization
◼ The goal of both stemming and lemmatization is to
reduce inflectional
◼ forms and sometimes derivationally related forms of a
word to a common
◼ base form. For instance
◼ e.g., car = automobile
◼ color = colour
◼ Rewrite to form equivalence classes
◼ Index such equivalences
◼ When the document contains automobile, index it under car
as well (usually, also vice-versa)
Stemming is a process that extract stems by removing last few characters from a word,
often leading to incorrect meanings and spelling

Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma
Stemming
Stemming: To group words that are derived from a common stem
●e.g, “fish”, “fishes”, “fishing” could be mapped to “fish”
●Generally produces small improvements in tasks effectiveness
●Similar to stopping, stemming can be done aggressively, conservatively, or not
at all
○ Aggressively: consider “fish” and “fishing” the same
○ Conservatively: just identifying plural forms using the letter “s”
■ issues: ‘Centuries’ → ‘Centurie”
○ Not at all: Consider all the word variants
●In different languages, stemming can have different importance for
effectiveness:
○ In Arabic, morphology is more complicated than English
○ In Chinese, stemming is not effective
Stemming
◼ Reduce terms to their “roots” before
indexing
◼ “Stemming” suggest crude affix chopping
◼ language dependent
◼ e.g., automate(s), automatic, automation all
reduced to automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Evaluation of Stemmers
There are three criteria for evaluating stemmers:
1.Correctness
2.Efficiency of the task
3.Compression performance

There are two ways in which stemming can be incorrect:

●Over-stemming (too much of the term is removed)
○ Two or more words being reduced to the same wrong root
○ e.g., ‘centennial’, ‘century’, ‘center’: ‘cent’
●Under-stemming (too little of the term is removed)
○ Two or more words could be wrongly reduced to more than one
root word
○ e.g., ‘acquire’, ‘acquiring’, ‘acquired’: acquir ‘acquisition’:
‘acquis’
Example
Lemmatization
◼ Reduce inflectional/variant forms to base
form
◼ E.g.,
◼ am, are, is → be
◼ car, cars, car's, cars' → car
◼ the boy's cars are different colors → the boy
car be different color
◼ Lemmatization implies doing “proper”
reduction to dictionary headword form
Lemmatization

Reduce inflectional/variant forms to base form

●am, are, is → be
●car, cars, car's, cars' → car
●the boy's cars are different colors → the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form
●e.g., WordNet is a lexical database of semantic relations between words in
more than 200 languages

37
Phrases
In a task such as information retrieval, input queries can be 2-3
word phrases
● Phrases can yield more precise queries
○ “University of Southern Maine”, “black sea”
● Less ambiguous
○ “Red apple” vs. “apple”

Phrase is any sequence of n words: n-gram

● unigram: one bigram: sequence of 2 trigram: sequence of 3

word words words
● Generated by:

○ Choosing a particular value for ‘n’

○ Moving that “window” forward one unit word at a time
● The more frequently a word n-gram occurs, the more likely it is
to correspond to a meaningful phrase in the language
Language-specificity
◼ Many of the above features embody
transformations that are
◼ Language-specific and
◼ Often, application-specific
◼ These are “plug-in” addenda to the indexing
process
◼ Both open source and commercial plug-ins
are available for handling these
Dictionary entries – first cut
ensemble.french

時間.japanese

MIT.english These may be

grouped by
mit.german language (or
not…).
guaranteed.english
More on this in
entries.english ranking/query
processing.
sometimes.english

tokenization.english
THANK YOU

Xper IM Advanced System Administrator Training Course - Before 4-24-12 Class
No ratings yet
Xper IM Advanced System Administrator Training Course - Before 4-24-12 Class
676 pages
Minecraft Worlds of Curiosity - Teachers Guide
100% (1)
Minecraft Worlds of Curiosity - Teachers Guide
106 pages
Advanced Java MCQ
No ratings yet
Advanced Java MCQ
7 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
How To Build A Usb Device With Pic 18f4550 or 18f2550 (And The Microchip CDC Firmware) PDF
100% (2)
How To Build A Usb Device With Pic 18f4550 or 18f2550 (And The Microchip CDC Firmware) PDF
12 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Ciena 6500 Hardware Components Research
No ratings yet
Ciena 6500 Hardware Components Research
37 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Thesis Typeface Download
100% (3)
Thesis Typeface Download
6 pages
Sapphire Structure Truss
No ratings yet
Sapphire Structure Truss
136 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Data Entry Task Emails
No ratings yet
Data Entry Task Emails
11 pages
Cs609 System Programming Solved Mcqs + Solved Subjective Questions For Final Term Exam
100% (1)
Cs609 System Programming Solved Mcqs + Solved Subjective Questions For Final Term Exam
52 pages
Acousticube 3
100% (1)
Acousticube 3
17 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Forerunner: ™ Atm Switch User'S Manual
No ratings yet
Forerunner: ™ Atm Switch User'S Manual
220 pages
Text Processing
No ratings yet
Text Processing
114 pages
NLP m2
No ratings yet
NLP m2
71 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
(PDF Download) Introduction To Computing Systems: From Bits & Gates To C & Beyond 3rd Edition Yale Patt Fulll Chapter
100% (8)
(PDF Download) Introduction To Computing Systems: From Bits & Gates To C & Beyond 3rd Edition Yale Patt Fulll Chapter
64 pages
Chapter - 2 Text Operation (Lecture 2.1)
No ratings yet
Chapter - 2 Text Operation (Lecture 2.1)
63 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Lec 19
No ratings yet
Lec 19
60 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Corpora
No ratings yet
Corpora
48 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
D5.1. OpenETCS - Functional Specification of Demonstrator
No ratings yet
D5.1. OpenETCS - Functional Specification of Demonstrator
41 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Text Mining
No ratings yet
Text Mining
62 pages
Session 1
No ratings yet
Session 1
33 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Zapi Controllers 1511612-2200SRM1006 - (11-2007) - Us-En
No ratings yet
Zapi Controllers 1511612-2200SRM1006 - (11-2007) - Us-En
62 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Unit-Ii Text and Web Page Pre-Processing: Stop Words
No ratings yet
Unit-Ii Text and Web Page Pre-Processing: Stop Words
23 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Seminar Project - Face - Recognition
No ratings yet
Seminar Project - Face - Recognition
58 pages
Log
No ratings yet
Log
42 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
Unit 4
No ratings yet
Unit 4
11 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Unit 1b
No ratings yet
Unit 1b
24 pages
Lect 7 Normalization
No ratings yet
Lect 7 Normalization
9 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Force: User Guide
No ratings yet
Force: User Guide
19 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Cpe Diary G7
No ratings yet
Cpe Diary G7
20 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Test#1: Sub Inspector Bs14: Email
No ratings yet
Test#1: Sub Inspector Bs14: Email
34 pages
Unit 2
No ratings yet
Unit 2
20 pages
NPS International School-2
No ratings yet
NPS International School-2
13 pages
Sample Resume For Fico
No ratings yet
Sample Resume For Fico
6 pages
FR 250127 en Shaftalign Data Sheet
No ratings yet
FR 250127 en Shaftalign Data Sheet
4 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
AJP Practicals: Practical 1
No ratings yet
AJP Practicals: Practical 1
37 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
14 New Students Are Ready To Join Your Classroom 6th Grade-Movers
No ratings yet
14 New Students Are Ready To Join Your Classroom 6th Grade-Movers
15 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
Website and Technology Integration - Globestar Edutech. Pvt. Ltd. - Intern JD
No ratings yet
Website and Technology Integration - Globestar Edutech. Pvt. Ltd. - Intern JD
3 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Task Manager
No ratings yet
Task Manager
5 pages
Synchronization PCB: This Module Is One of A Series of Modules That Describe The Components of The M System
No ratings yet
Synchronization PCB: This Module Is One of A Series of Modules That Describe The Components of The M System
12 pages
Analogue Addressable Control Panel: Product Data
No ratings yet
Analogue Addressable Control Panel: Product Data
4 pages
Media and Information, Then and Now
No ratings yet
Media and Information, Then and Now
16 pages

CL - Lec 6

Uploaded by

CL - Lec 6

Uploaded by

Tokenization, Stemming

Tokenization is the task of chopping text into pieces, called tokens

Token is an instance of a sequence of characters in some particular document (grouped as a useful

Term is a (perhaps normalized) type that is included in the corpus dictionary

Example to sleep more to learn

Japan? Japans? Japan’s?

◼ San Francisco: one token or two? How

what are the correct tokens to use? In this

◼ Often, don’t index as text

◼ But often very useful: think about things like

◼ German noun compounds are not

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

◼ Even in languages that standardly have

◼ German: Tuebingen vs. Tübingen

◼ Aug 2005 Google example:

for example compressed for exampl compress and

There are two ways in which stemming can be incorrect:

Reduce inflectional/variant forms to base form

Phrase is any sequence of n words: n-gram

● unigram: one bigram: sequence of 2 trigram: sequence of 3

○ Choosing a particular value for ‘n’

MIT.english These may be

You might also like