NLP m2
NLP m2
5
Tokenization is the process of converting text or
data into smaller units, called tokens. These
What is tokens can be words, subwords, characters, or
Tokenization even sentences, depending on the level of
granularity chosen for tokenization. The idea is to
break down complex data (such as natural
language text) into manageable, discrete pieces
that can be more easily processed, analyzed, or
understood by computers, especially in fields like
natural language processing (NLP) and machine
learning.
Tokenization in nlp
Hassan Khoury © 7
Ms. Vandana Soni 8
In the context of Natural Language ● Word Tokenization: The process of splitting a sentence into
Processing (NLP):
words. For example, the sentence "I love pizza" would be
tokenized into the words ["I", "love", "pizza"].
● Subword Tokenization: Sometimes words are further
broken down into smaller units, such as prefixes, suffixes, or
even parts of words (e.g., breaking "unhappiness" into ["un",
"happi", "ness"]). This is common in models like BERT or
GPT.
● Character Tokenization: In some cases, individual
characters are treated as tokens. For instance, "hello" could
be tokenized as ["h", "e", "l", "l", "o"].
● Sentence Tokenization: Breaking a document or passage of
text into sentences. For example, "Hello! How are you?"
would be tokenized into the sentences ["Hello!", "How are
you?"].
9
Sentence segmentation is the process of breaking a sentence down into individual words. Here are some examples
of sentence segmentation:
● Counting words: Students can count the words in a sentence and use their fingers to map the words.
● Using counters: Students can use counters to represent each word in a sentence.
● Using clothespins: Students can count the words in a sentence and clip a clothespin on the correct number.
● Building sentences: Students can use word cards to build sentences.
● Sentence segmentation games: Students can use a game board and marker to count the words in a
sentence and move their marker that many spaces.
Example:
arduino
Copy code
"hyphenation" (without the line break)
○
2. Hyphenated Compound Words:
○ Compound words that use hyphens to combine two or more words into one, such as "high-quality" or "well-known".
○ Example: "He is a well-known scientist."
Should be recognized as "well-known" rather than two separate tokens ("well" and "known").
3. Hyphenation in Foreign Words:
○ Some foreign words or names may use hyphens as part of their structure, and these should not be split or misunderstood by
NLP models.
○ Example: "Franco-British" or "co-op".
4. Hyphens as Separators in Lists or Numbers:
○ Hyphens might appear in numeric expressions, phone numbers, or lists.
○ Example: "A 10-15% increase in profits."
Stemming is a text normalization technique used in Natural
Language Processing (NLP) that reduces words to their root or
base form. The purpose of stemming is to treat different forms
of a word (e.g., "running", "runner", "ran") as a single entity,
making it easier to process and analyze the underlying
concepts.
For example:
● "running" → "run"
● "better" → "good"
● "happiness" → "happy"
Lemmatization
Lemmatization is a text
pre-processing technique that
breaks down words into their
root form, or lemma, to make
them easier to analyze:
Lemmatization Techniques
2. Dictionary-Based
Lemmatization
3. Machine Learning-Based
Lemmatization
1. Rule Based Lemmatization