CL - Lec 6
CL - Lec 6
and Lemmatization
What is Tokenization?
Type is the class of all tokens containing the same character sequence
◼ ← → ←→ ← start
◼ ‘Algeria achieved its independence in 1962
after 132 years of French occupation.’
◼ With Unicode, the surface presentation is
complex, but the stored form is straightforward
Word-based Tokenization
Approach
● Splitting the text by spaces
● Other delimiters such as punctuation can be used
Advantages
● Easy to implement
Disadvantages
● High risk of missing words; e.g., Let and Let’s will have two
different types
● Languages like Chinese do not have space
● Huge vocabulary size (token type)
○ Limit the number of words that can be added to the
vocabulary
● Misspelled words will be considered as a token
Character-based Tokenization
Approach
●Splitting the text into individual characters
Advantages
●There will be no or very few unknown words
(Out Of Vocabulary)
●Useful for languages that characters carry
information
●Fewer number of tokens
●Easy to implement
Disadvantages
●A character usually does not have a meaning
○ Cannot learn semantic for words
●Larger sequence to be processed by models
○ More input to process
Subword Tokenization
Approach
●Frequently used words should not be split into smaller subwords
●Rare words should be decomposed into meaningful subwords
●Uses a special symbol to indicate which word is the start of the token and which word is
the completion of the start of the token
○ Tokenization → “Token”, “##ization”
●State-of-the-art approaches for NLP and IR rely on this type
Advantages
●Out-of-vocabulary word problem solved
●Manageable vocabulary sizes
Disadvantages
●New scheme and needs more exploration
Byte Per Encoding (BPE) and WordPiece are two examples of this scheme
Stop words
◼ With a stop list, you exclude from dictionary
entirely the commonest words. Intuition:
◼ They have little semantic content: the, a, and, to, be
◼ There are a lot of them: ~30% of postings for top 30 wds
◼ But the trend is away from doing this:
◼ Good compression techniques means the space for
including stop words in a system is very small
◼ Good query optimization techniques mean you pay little
at query time for including stop words.
◼ You need them for:
◼ Phrase queries: “King of Denmark”
◼ Various song titles, etc.: “Let it be”, “To be or not to be”
◼ “Relational” queries: “flights to London”
Stopwords Removal
Stopping: Removing common words from the stream of tokens that become
index terms
●Words that are function words helping form sentence structure: the, of, and, to,
….
●For an application, an additional domain specific stop words list may be
constructed
●Why do we need to remove stop words?
○ Reduce indexing (or data) file size
○ Usually has no impact on the NLP task’s effectiveness, and may even
improve it
●Can sometimes cause issues for NLP tasks:
○ e.g., phrases: “to be or not to be”, “let it be”, “flights to Portland Maine”
○ Some tasks consider very small stopwords list
■ Sometimes perhaps only “the”
●List of stopwords: https://www.ranks.nl/stopwords
Normalization
◼ Need to “normalize” terms in indexed text as
well as query terms into the same form
◼ We want to match U.S.A. and USA
◼ We most commonly implicitly define
equivalence classes of terms
◼ e.g., by deleting periods in a term
◼ Alternative is to do asymmetric expansion:
◼ Enter: window Search: window, windows
◼ Enter: windows Search: Windows, windows, window
◼ Enter: Windows Search: Windows
◼ Potentially more powerful, but less efficient
Normalization: other languages
◼ Accents: résumé vs. resume.
◼ Most important criterion:
◼ How are your users like to write their queries
for these words?
Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma
Stemming
Stemming: To group words that are derived from a common stem
●e.g, “fish”, “fishes”, “fishing” could be mapped to “fish”
●Generally produces small improvements in tasks effectiveness
●Similar to stopping, stemming can be done aggressively, conservatively, or not
at all
○ Aggressively: consider “fish” and “fishing” the same
○ Conservatively: just identifying plural forms using the letter “s”
■ issues: ‘Centuries’ → ‘Centurie”
○ Not at all: Consider all the word variants
●In different languages, stemming can have different importance for
effectiveness:
○ In Arabic, morphology is more complicated than English
○ In Chinese, stemming is not effective
Stemming
◼ Reduce terms to their “roots” before
indexing
◼ “Stemming” suggest crude affix chopping
◼ language dependent
◼ e.g., automate(s), automatic, automation all
reduced to automat.
37
Phrases
In a task such as information retrieval, input queries can be 2-3
word phrases
● Phrases can yield more precise queries
○ “University of Southern Maine”, “black sea”
● Less ambiguous
○ “Red apple” vs. “apple”
時間.japanese
tokenization.english
THANK YOU