100% found this document useful (1 vote)
219 views10 pages

NLP-1 (Tokenization)

Uploaded by

Mahesh Jagtap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
219 views10 pages

NLP-1 (Tokenization)

Uploaded by

Mahesh Jagtap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Tokenization

Tokenization is a key task in Natural Language Processing (NLP) that involves breaking up a sequence
of text into individual words or tokens. In NLP, a token is a sequence of characters that represents a
single unit of meaning. Tokenization is a fundamental step in many NLP tasks such as text classification,
sentiment analysis, and machine translation.

There are several techniques used for tokenization, including rule-based methods and statistical
methods. Rule-based methods involve defining a set of rules to split the text into tokens based on
punctuation marks, spaces, and other delimiters. Statistical methods, on the other hand, use machine
learning algorithms to learn patterns in the text and determine the most appropriate way to tokenize
it.

Tokenization can also involve additional tasks such as stemming and lemmatization. Stemming
involves reducing words to their base or root form, while lemmatization involves reducing words to
their base form based on their part of speech. These tasks can help reduce the number of tokens in a
text and improve the accuracy of NLP models.

Overall, tokenization is a critical task in NLP as it forms the foundation for many other tasks in the
field. It allows NLP models to understand the structure and meaning of natural language text, enabling
them to perform various tasks such as text classification, sentiment analysis, and machine translation.

Reasons behind tokenization in NLP

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can
be individual words, punctuations, or other meaningful elements of a language. Tokenization is a
fundamental step in natural language processing (NLP) and is essential for many NLP tasks, such as
text classification, sentiment analysis, and language translation.

Here are some of the main reasons why tokenization is necessary in NLP:

1. Text Preprocessing: Tokenization is an important step in text preprocessing, which involves


cleaning and preparing text data for analysis. By breaking text into tokens, it becomes easier
to perform various preprocessing tasks, such as removing stop words, stemming, and
lemmatization.

2. Text Representation: Tokens provide a way to represent text in a numerical format that can
be processed by machine learning algorithms. By assigning a numerical value to each token,
we can create a numerical representation of a text document, which can be used for text
classification, clustering, and other NLP tasks.

3. Vocabulary Management: Tokenization helps manage the vocabulary of a language by


creating a finite set of unique tokens that represent all the possible words and symbols in a
language. This is important for tasks like language modeling, where a model needs to predict
the probability of the next word in a sentence.

4. Language Translation: Tokenization is essential for language translation, where text is


translated from one language to another. By breaking text into tokens, we can align words
between two languages, making it easier to create accurate translations.

Overall, tokenization is a crucial step in NLP, enabling us to process, analyze, and understand text data
efficiently.

What is token? And how it is created?


In the context of natural language processing, a token refers to a sequence of characters that
represents a unit of meaning. Typically, a token is a word or a punctuation mark, but it can also be a
phrase or a combination of words that convey a particular meaning.

Tokenization is the process of breaking down a text into individual tokens. The specific rules for
tokenization can vary depending on the task and the language being processed, but some common
techniques include:

1. White-space Tokenization: This involves splitting a text into tokens based on white spaces (i.e.,
spaces, tabs, and line breaks). This method is commonly used for English text.

2. Punctuation-based Tokenization: This method involves splitting a text into tokens based on
punctuation marks. For example, a period or a comma may be used to separate tokens.

3. Rule-based Tokenization: This method involves applying a set of rules to break down a text
into tokens. For example, a rule-based tokenizer may split a text into tokens based on
capitalization, hyphenation, or other linguistic features.

4. Machine Learning-based Tokenization: This involves training a machine learning model to


automatically split a text into tokens. This method is becoming increasingly popular, especially
for languages with complex grammatical structures.

Once the tokens have been created, they can be used to perform various NLP tasks, such as text
classification, sentiment analysis, and language translation.

Real world example of how tokenization with token vault works

Tokenization with a token vault is a method for securing sensitive information by replacing it with a
token, or a unique identifier, that can be used in place of the original data. This can be useful in
situations where sensitive data, such as credit card numbers or personal identification information,
needs to be stored or transmitted securely.

Here's a real-world example of how tokenization with a token vault works in the context of a payment
transaction:

1. A customer makes a purchase using a credit card at a retail store.

2. The store's point-of-sale system captures the credit card number, expiration date, and CVV
code.

3. Instead of storing the credit card information in its raw form, the point-of-sale system sends
the data to a token vault for tokenization.

4. The token vault replaces the credit card number with a unique token, which is then sent back
to the point-of-sale system.

5. The point-of-sale system stores the token in its database instead of the credit card number.

6. When the customer makes another purchase at the store, the point-of-sale system uses the
token to retrieve the credit card information from the token vault for processing.

7. The token vault returns the original credit card information to the point-of-sale system, which
processes the transaction as usual.
By using tokenization with a token vault, the sensitive credit card information is protected because
the token is essentially useless to anyone who doesn't have access to the token vault. Even if the token
is intercepted during transmission, it cannot be used to retrieve the original credit card information
without access to the token vault.

Word Tokenization

Word tokenization is the process of breaking down a text into individual words, also known as word-
level tokens. This is a fundamental step in many natural language processing (NLP) tasks, such as
language modeling, sentiment analysis, and machine translation.

Word tokenization typically involves identifying word boundaries in a text. In English, words are often
separated by spaces, but this is not always the case. For example, in some languages, such as Chinese
or Japanese, there are no spaces between words. In these cases, other techniques, such as character-
level tokenization, may be used.

There are several approaches to word tokenization, including rule-based methods and statistical
methods. Rule-based methods use pre-defined rules to identify word boundaries, such as splitting text
on whitespace or punctuation. Statistical methods, such as machine learning algorithms, learn to
identify word boundaries based on patterns in a large corpus of text.

Here's an example of word tokenization in Python using the NLTK library:

Python code

import nltk

text = "Tokenization is the process of breaking down a text into individual words."

tokens = nltk.word_tokenize(text) print(tokens)

Output:

Css

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'a', 'text', 'into', 'individual', 'words', '.']

In this example, the word_tokenize() function from the NLTK library is used to tokenize the text into
individual words. The resulting list of tokens includes each word and punctuation mark in the original
text.

Drawback of Word Tokenization

While word tokenization is a crucial step in many NLP tasks, it also has some drawbacks that can affect
the accuracy of downstream analysis. Here are some common drawbacks of word tokenization:

1. Ambiguity: Word tokenization can be ambiguous, particularly when dealing with languages
that have complex grammar and syntax. For example, in English, the word "run" can be a noun
or a verb, depending on the context in which it is used. This can lead to ambiguity in the
meaning of the text, which can affect the accuracy of NLP models.

2. Out-of-vocabulary words: Word tokenization assumes that all words in a text are known and
can be represented as tokens. However, there are often words that are not included in the
tokenization vocabulary, such as misspelled words or domain-specific jargon. These words
may be treated as out-of-vocabulary words, which can affect the accuracy of downstream
analysis.

3. Token size: Word tokenization typically splits text into individual words, but this can be
problematic for languages that have complex word structures, such as compounds or inflected
forms. For example, in German, the word "Kindergarten" (meaning "nursery school") is a
compound word made up of two smaller words, "Kind" (meaning "child") and "Garten"
(meaning "garden"). If the word is split into individual tokens, it may not be recognized as a
compound word, which can affect the accuracy of analysis.

4. Contextual information: Word tokenization does not take into account contextual
information, such as the relationship between words in a sentence or the grammatical role of
a word. This can affect the accuracy of downstream analysis, particularly in tasks such as
named entity recognition or part-of-speech tagging.

Overall, while word tokenization is a useful technique for breaking down text into individual units, it
has limitations that can affect the accuracy of downstream analysis. It is important to consider these
limitations and to use additional techniques, such as contextual modeling or character-level
tokenization, to improve the accuracy of NLP models.

Character Tokenization

Character tokenization is an alternative approach to word tokenization, where the text is broken down
into individual characters, rather than words. In character tokenization, each character in the text is
treated as a separate token.

Character tokenization has some advantages over word tokenization:

1. Handling out-of-vocabulary words: Since character tokenization breaks down text into
individual characters, it can handle out-of-vocabulary words more effectively than word
tokenization. This is because even if a word is not in the tokenization vocabulary, its individual
characters can still be represented as tokens.

2. Improved language modeling: Character tokenization can be useful for language modeling
tasks, where the goal is to predict the next character or sequence of characters in a text. By
treating each character as a separate token, character-level models can capture more fine-
grained patterns in the text, such as spelling mistakes or word structure.

3. Multi-lingual support: Character tokenization is language-independent, and can be used to


tokenize text in any language, including languages that do not have spaces between words,
such as Chinese or Japanese.

Here's an example of character tokenization in Python:

Python code

text = "Tokenization is the process of breaking down a text into individual characters."

tokens = list(text) print(tokens)

Output:

css
['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', ' ', 'o', 'f', ' ',
'b', 'r', 'e', 'a', 'k', 'i', 'n', 'g', ' ', 'd', 'o', 'w', 'n', ' ', 'a', ' ', 't', 'e', 'x', 't', ' ', 'i', 'n', 't', 'o', ' ', 'i', 'n', 'd', 'i', 'v',
'i', 'd', 'u', 'a', 'l', ' ', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', '.']

In this example, the text variable is tokenized into individual characters using the Python list()
function. Each character in the text is represented as a separate token in the resulting list.

Drawbacks of character tokenization

While character tokenization has some advantages over word tokenization, it also has some
drawbacks that should be considered:

1. Increased sequence length: Since character tokenization breaks down text into individual
characters, the resulting sequence of tokens can be much longer than a sequence produced
by word tokenization. This can make training and inference more computationally expensive.

2. Reduced interpretability: Character tokenization can make it more difficult to interpret the
meaning of a text, since individual characters may not carry as much semantic information as
words. This can make it harder to extract meaningful insights from the text.

3. Limited ability to capture word-level patterns: Character tokenization does not capture word-
level patterns or relationships between words, which can be important for many NLP tasks.
For example, named entity recognition or sentiment analysis may rely on the presence or
absence of specific words in a text, which may not be captured by character tokenization.

4. Vocabulary size: Character tokenization can result in a very large vocabulary size, particularly
for languages with complex writing systems, such as Chinese or Japanese. This can make it
more difficult to train models with limited computational resources.

Overall, while character tokenization can be useful in certain contexts, it is not always the best choice
for NLP tasks. The choice between character and word tokenization should depend on the specific
requirements of the task, as well as the resources available for training and inference.

Need of Tokenization

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down
a piece of text into smaller units called tokens. Tokens are typically words, but they can also be
phrases, characters, or other meaningful units of text. Tokenization is important for several reasons:

1. Text preprocessing: Tokenization is often the first step in text preprocessing, which is the
process of transforming raw text into a format that can be easily analyzed by NLP algorithms.
By breaking down text into tokens, tokenization makes it easier to perform tasks such as part-
of-speech tagging, named entity recognition, sentiment analysis, and text classification.

2. Vocabulary creation: Tokenization is also important for creating the vocabulary that is used to
represent text in NLP models. Each token is typically assigned a unique integer or vector
representation, which is used to represent the token in the model's input or output layer. The
vocabulary size can be very large, particularly for languages with a large number of words or
a complex writing system, such as Chinese or Japanese.

3. Feature extraction: Tokens can be used as features in machine learning models, which can
help to improve the performance of the models. For example, a bag-of-words model can
represent a text document as a vector of token frequencies, which can be used to train a
classifier or regression model.
4. Text normalization: Tokenization can also be used for text normalization, which involves
transforming text into a canonical or standardized form. For example, tokenization can be
used to convert all words to lowercase or remove punctuation from a text.

Overall, tokenization is an essential step in NLP that enables text to be processed, analyzed, and
represented in a way that can be used by machine learning algorithms.

Example of sentence tokenization

Input: "Natural Language Processing is an interesting field. It involves the application of computational
techniques to process and analyze natural language data."

Output:

1. "Natural Language Processing is an interesting field."

2. "It involves the application of computational techniques to process and analyze natural
language data."

In this example, the input text contains two sentences, which are separated by a period followed by a
space. Sentence tokenization involves identifying these sentence boundaries and separating the input
text into individual sentences. The output shows the two sentences in the input text, with each
sentence on a separate line.

import nltk

# Define input text

input_text = "Natural Language Processing is an interesting field. It involves the application of


computational techniques to process and analyze natural language data."

# Perform sentence tokenization

sentences = nltk.sent_tokenize(input_text)

# Print output

for sentence in sentences:

print(sentence)

The sent_tokenize function from the nltk library is used to perform sentence tokenization on the input
text. It returns a list of sentences, which are then printed using a for loop.

The output of this code will be:

Natural Language Processing is an interesting field.

It involves the application of computational techniques to process and analyze natural language data.

What are the benefits of Tokenization?

1. Text Preprocessing: Tokenization is a fundamental step in text preprocessing for NLP tasks. It
breaks down the raw text into smaller, more manageable units that can be easily processed
by downstream algorithms, such as part-of-speech tagging, named entity recognition, and
sentiment analysis.
2. Feature Extraction: Tokens can be used as features in machine learning models. They can help
improve the performance of the models by capturing important information about the text.
For example, in a bag-of-words model, the frequency of each token can be used as a feature.

3. Vocabulary Creation: Tokenization is used to create the vocabulary that is used to represent
text in NLP models. Each token is typically assigned a unique integer or vector representation,
which is used to represent the token in the model's input or output layer.

4. Text Normalization: Tokenization can be used for text normalization, which involves
transforming text into a standardized form. For example, all tokens can be converted to
lowercase, or punctuation can be removed.

5. Efficient Storage and Retrieval: By converting text into a sequence of tokens, the amount of
data that needs to be stored and processed can be reduced. This can lead to more efficient
storage and retrieval of text data, which is important for many NLP applications.

Overall, tokenization is a crucial step in many NLP tasks. It helps to simplify and standardize the text
data, making it easier to process, analyze, and represent in a way that can be used by machine learning
algorithms.

Tokenization Challenges in NLP

While tokenization is an important step in natural language processing (NLP), there are several
challenges that need to be addressed:

1. Ambiguity: Some words can have multiple meanings, which can lead to ambiguity in
tokenization. For example, the word "bank" can refer to a financial institution or the edge of
a river. In such cases, context and part-of-speech information may be used to disambiguate
the token.

2. Domain-specific terms: Tokenizing domain-specific terms, such as medical terminology or


legal jargon, can be challenging. These terms may not appear in standard language
dictionaries and may require additional preprocessing or custom dictionaries to tokenize
correctly.

3. Compound words: Some languages, such as German, have long compound words that can be
difficult to tokenize. For example, the German word
"Donaudampfschifffahrtsgesellschaftskapitän" (Danube steamship company captain) is a
single word that can be challenging to split into individual tokens.

4. Informal language: Tokenizing informal language, such as social media posts or text messages,
can be difficult because it often contains non-standard grammar, spelling, and punctuation.

5. Multilingual text: Tokenizing multilingual text can be challenging because different languages
may have different tokenization rules and writing systems.

6. Tokenization errors: Tokenization algorithms are not always perfect and can sometimes make
errors, such as splitting a word into two tokens or combining two words into a single token.

Overall, tokenization is a crucial step in NLP, but it can be challenging to perform accurately,
particularly in cases where the text contains domain-specific terms, compound words, or informal
language. Addressing these challenges requires careful consideration of the text data and the specific
requirements of the NLP task.
Types of Token

In natural language processing (NLP), there are several types of tokens that can be created depending
on the tokenization technique used. Here are some of the most common types:

1. Word tokens: These are tokens created by word tokenization. Each word in the text is treated
as a separate token.

2. Character tokens: These are tokens created by character tokenization. Each character in the
text is treated as a separate token.

3. Subword tokens: These are tokens created by subword tokenization, which involves breaking
words down into smaller units called subwords. Subword tokenization is often used for
languages with complex morphology or for handling out-of-vocabulary words.

4. Byte pair encoded (BPE) tokens: These are a type of subword token created by a specific
subword tokenization algorithm called byte pair encoding. BPE works by iteratively merging
the most frequent pair of characters until a predetermined vocabulary size is reached.

5. Part-of-speech (POS) tags: These are tokens that represent the part of speech of a word in the
text, such as noun, verb, adjective, or adverb. POS tags are often created as a post-processing
step after word tokenization.

6. Named entity tokens: These are tokens that represent named entities in the text, such as
person names, organization names, or location names. Named entity tokens are often created
as a post-processing step after word tokenization.

Overall, the type of token created depends on the tokenization technique used and the specific
requirements of the NLP task. Word tokens are the most common type, but subword and character
tokens are becoming more widely used, especially for languages with complex morphology or out-of-
vocabulary words.

What is tokenization?
Tokenization is the process of replacing sensitive data with unique identification symbols that retain
all the essential information about the data without compromising its security.

Examples of tokenization

Tokenization technology can, in theory, be used with sensitive data of all kinds, including bank
transactions, medical records, criminal records, vehicle driver information, loan applications, stock
trading and voter registration. For the most part, any system in which surrogate, nonsensitive
information can act as a stand-in for sensitive information can benefit from tokenization.

Tokenization is often used to protect credit card data, bank account information and other sensitive
data handled by payment processors. Payment processing use cases that tokenize sensitive credit
card information include the following:

• mobile wallets, such as Google Pay and Apple Pay;


• e-commerce sites; and

• businesses that keep customers' cards on file.

How tokenization works

Tokenization substitutes sensitive information with equivalent nonsensitive information. The


nonsensitive, replacement information is called a token.

Tokens can be created in the following ways:

• using a mathematically reversible cryptographic function with a key;

• using a nonreversible function, such as a hash function; or

• using an index function or randomly generated number.

As a result, the token becomes the exposed information, and the sensitive information that the
token stands in for is stored safely in a centralized server known as a token vault. The token vault is
the only place where the original information can be mapped back to its corresponding token.

Here is one real-world example of how tokenization with a token vault works.

• A customer provides their payment details at a point-of-sale (POS) system or online checkout
form.

• The details, or data, are substituted with a randomly generated token, which is generated in
most cases by the merchant's payment gateway.

• The tokenized information is then encrypted and sent to a payment processor. The original
sensitive payment information is stored in a token vault in the merchant's payment gateway.
This is the only place where the token can be mapped to the information it represents.

• The tokenized information is encrypted again by the payment processor before being sent for
final verification.

On the other hand, some tokenization is vaultless. Instead of storing the sensitive information in a
secure database, vaultless tokens are stored using an algorithm. If the token is reversible, then the
original sensitive information is generally not stored in a vault.

You might also like