100% found this document useful (1 vote)

222 views10 pages

NLP-1 (Tokenization)

Uploaded by

Mahesh Jagtap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

222 views10 pages

NLP-1 (Tokenization)

Uploaded by

Mahesh Jagtap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Tokenization

Tokenization is a key task in Natural Language Processing (NLP) that involves breaking up a sequence
of text into individual words or tokens. In NLP, a token is a sequence of characters that represents a
single unit of meaning. Tokenization is a fundamental step in many NLP tasks such as text classification,
sentiment analysis, and machine translation.

There are several techniques used for tokenization, including rule-based methods and statistical
methods. Rule-based methods involve defining a set of rules to split the text into tokens based on
punctuation marks, spaces, and other delimiters. Statistical methods, on the other hand, use machine
learning algorithms to learn patterns in the text and determine the most appropriate way to tokenize
it.

Tokenization can also involve additional tasks such as stemming and lemmatization. Stemming
involves reducing words to their base or root form, while lemmatization involves reducing words to
their base form based on their part of speech. These tasks can help reduce the number of tokens in a
text and improve the accuracy of NLP models.

Overall, tokenization is a critical task in NLP as it forms the foundation for many other tasks in the
field. It allows NLP models to understand the structure and meaning of natural language text, enabling
them to perform various tasks such as text classification, sentiment analysis, and machine translation.

Reasons behind tokenization in NLP

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can
be individual words, punctuations, or other meaningful elements of a language. Tokenization is a
fundamental step in natural language processing (NLP) and is essential for many NLP tasks, such as
text classification, sentiment analysis, and language translation.

Here are some of the main reasons why tokenization is necessary in NLP:

1. Text Preprocessing: Tokenization is an important step in text preprocessing, which involves

cleaning and preparing text data for analysis. By breaking text into tokens, it becomes easier
to perform various preprocessing tasks, such as removing stop words, stemming, and
lemmatization.

2. Text Representation: Tokens provide a way to represent text in a numerical format that can
be processed by machine learning algorithms. By assigning a numerical value to each token,
we can create a numerical representation of a text document, which can be used for text
classification, clustering, and other NLP tasks.

3. Vocabulary Management: Tokenization helps manage the vocabulary of a language by

creating a finite set of unique tokens that represent all the possible words and symbols in a
language. This is important for tasks like language modeling, where a model needs to predict
the probability of the next word in a sentence.

4. Language Translation: Tokenization is essential for language translation, where text is

translated from one language to another. By breaking text into tokens, we can align words
between two languages, making it easier to create accurate translations.

Overall, tokenization is a crucial step in NLP, enabling us to process, analyze, and understand text data
efficiently.

What is token? And how it is created?

In the context of natural language processing, a token refers to a sequence of characters that
represents a unit of meaning. Typically, a token is a word or a punctuation mark, but it can also be a
phrase or a combination of words that convey a particular meaning.

Tokenization is the process of breaking down a text into individual tokens. The specific rules for
tokenization can vary depending on the task and the language being processed, but some common
techniques include:

1. White-space Tokenization: This involves splitting a text into tokens based on white spaces (i.e.,
spaces, tabs, and line breaks). This method is commonly used for English text.

2. Punctuation-based Tokenization: This method involves splitting a text into tokens based on
punctuation marks. For example, a period or a comma may be used to separate tokens.

3. Rule-based Tokenization: This method involves applying a set of rules to break down a text
into tokens. For example, a rule-based tokenizer may split a text into tokens based on
capitalization, hyphenation, or other linguistic features.

4. Machine Learning-based Tokenization: This involves training a machine learning model to

automatically split a text into tokens. This method is becoming increasingly popular, especially
for languages with complex grammatical structures.

Once the tokens have been created, they can be used to perform various NLP tasks, such as text
classification, sentiment analysis, and language translation.

Real world example of how tokenization with token vault works

Tokenization with a token vault is a method for securing sensitive information by replacing it with a
token, or a unique identifier, that can be used in place of the original data. This can be useful in
situations where sensitive data, such as credit card numbers or personal identification information,
needs to be stored or transmitted securely.

Here's a real-world example of how tokenization with a token vault works in the context of a payment
transaction:

1. A customer makes a purchase using a credit card at a retail store.

2. The store's point-of-sale system captures the credit card number, expiration date, and CVV
code.

3. Instead of storing the credit card information in its raw form, the point-of-sale system sends
the data to a token vault for tokenization.

4. The token vault replaces the credit card number with a unique token, which is then sent back
to the point-of-sale system.

5. The point-of-sale system stores the token in its database instead of the credit card number.

6. When the customer makes another purchase at the store, the point-of-sale system uses the
token to retrieve the credit card information from the token vault for processing.

7. The token vault returns the original credit card information to the point-of-sale system, which
processes the transaction as usual.
By using tokenization with a token vault, the sensitive credit card information is protected because
the token is essentially useless to anyone who doesn't have access to the token vault. Even if the token
is intercepted during transmission, it cannot be used to retrieve the original credit card information
without access to the token vault.

Word Tokenization

Word tokenization is the process of breaking down a text into individual words, also known as word-
level tokens. This is a fundamental step in many natural language processing (NLP) tasks, such as
language modeling, sentiment analysis, and machine translation.

Word tokenization typically involves identifying word boundaries in a text. In English, words are often
separated by spaces, but this is not always the case. For example, in some languages, such as Chinese
or Japanese, there are no spaces between words. In these cases, other techniques, such as character-
level tokenization, may be used.

There are several approaches to word tokenization, including rule-based methods and statistical
methods. Rule-based methods use pre-defined rules to identify word boundaries, such as splitting text
on whitespace or punctuation. Statistical methods, such as machine learning algorithms, learn to
identify word boundaries based on patterns in a large corpus of text.

Here's an example of word tokenization in Python using the NLTK library:

Python code

import nltk

text = "Tokenization is the process of breaking down a text into individual words."

tokens = nltk.word_tokenize(text) print(tokens)

Output:

Css

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'a', 'text', 'into', 'individual', 'words', '.']

In this example, the word_tokenize() function from the NLTK library is used to tokenize the text into
individual words. The resulting list of tokens includes each word and punctuation mark in the original
text.

Drawback of Word Tokenization

While word tokenization is a crucial step in many NLP tasks, it also has some drawbacks that can affect
the accuracy of downstream analysis. Here are some common drawbacks of word tokenization:

1. Ambiguity: Word tokenization can be ambiguous, particularly when dealing with languages
that have complex grammar and syntax. For example, in English, the word "run" can be a noun
or a verb, depending on the context in which it is used. This can lead to ambiguity in the
meaning of the text, which can affect the accuracy of NLP models.

2. Out-of-vocabulary words: Word tokenization assumes that all words in a text are known and
can be represented as tokens. However, there are often words that are not included in the
tokenization vocabulary, such as misspelled words or domain-specific jargon. These words
may be treated as out-of-vocabulary words, which can affect the accuracy of downstream
analysis.

3. Token size: Word tokenization typically splits text into individual words, but this can be
problematic for languages that have complex word structures, such as compounds or inflected
forms. For example, in German, the word "Kindergarten" (meaning "nursery school") is a
compound word made up of two smaller words, "Kind" (meaning "child") and "Garten"
(meaning "garden"). If the word is split into individual tokens, it may not be recognized as a
compound word, which can affect the accuracy of analysis.

4. Contextual information: Word tokenization does not take into account contextual
information, such as the relationship between words in a sentence or the grammatical role of
a word. This can affect the accuracy of downstream analysis, particularly in tasks such as
named entity recognition or part-of-speech tagging.

Overall, while word tokenization is a useful technique for breaking down text into individual units, it
has limitations that can affect the accuracy of downstream analysis. It is important to consider these
limitations and to use additional techniques, such as contextual modeling or character-level
tokenization, to improve the accuracy of NLP models.

Character Tokenization

Character tokenization is an alternative approach to word tokenization, where the text is broken down
into individual characters, rather than words. In character tokenization, each character in the text is
treated as a separate token.

Character tokenization has some advantages over word tokenization:

1. Handling out-of-vocabulary words: Since character tokenization breaks down text into
individual characters, it can handle out-of-vocabulary words more effectively than word
tokenization. This is because even if a word is not in the tokenization vocabulary, its individual
characters can still be represented as tokens.

2. Improved language modeling: Character tokenization can be useful for language modeling
tasks, where the goal is to predict the next character or sequence of characters in a text. By
treating each character as a separate token, character-level models can capture more fine-
grained patterns in the text, such as spelling mistakes or word structure.

3. Multi-lingual support: Character tokenization is language-independent, and can be used to

tokenize text in any language, including languages that do not have spaces between words,
such as Chinese or Japanese.

Here's an example of character tokenization in Python:

Python code

text = "Tokenization is the process of breaking down a text into individual characters."

tokens = list(text) print(tokens)

Output:

css
['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'p', 'r', 'o', 'c', 'e', 's', 's', ' ', 'o', 'f', ' ',
'b', 'r', 'e', 'a', 'k', 'i', 'n', 'g', ' ', 'd', 'o', 'w', 'n', ' ', 'a', ' ', 't', 'e', 'x', 't', ' ', 'i', 'n', 't', 'o', ' ', 'i', 'n', 'd', 'i', 'v',
'i', 'd', 'u', 'a', 'l', ' ', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's', '.']

In this example, the text variable is tokenized into individual characters using the Python list()
function. Each character in the text is represented as a separate token in the resulting list.

Drawbacks of character tokenization

While character tokenization has some advantages over word tokenization, it also has some
drawbacks that should be considered:

1. Increased sequence length: Since character tokenization breaks down text into individual
characters, the resulting sequence of tokens can be much longer than a sequence produced
by word tokenization. This can make training and inference more computationally expensive.

2. Reduced interpretability: Character tokenization can make it more difficult to interpret the
meaning of a text, since individual characters may not carry as much semantic information as
words. This can make it harder to extract meaningful insights from the text.

3. Limited ability to capture word-level patterns: Character tokenization does not capture word-
level patterns or relationships between words, which can be important for many NLP tasks.
For example, named entity recognition or sentiment analysis may rely on the presence or
absence of specific words in a text, which may not be captured by character tokenization.

4. Vocabulary size: Character tokenization can result in a very large vocabulary size, particularly
for languages with complex writing systems, such as Chinese or Japanese. This can make it
more difficult to train models with limited computational resources.

Overall, while character tokenization can be useful in certain contexts, it is not always the best choice
for NLP tasks. The choice between character and word tokenization should depend on the specific
requirements of the task, as well as the resources available for training and inference.

Need of Tokenization

Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down
a piece of text into smaller units called tokens. Tokens are typically words, but they can also be
phrases, characters, or other meaningful units of text. Tokenization is important for several reasons:

1. Text preprocessing: Tokenization is often the first step in text preprocessing, which is the
process of transforming raw text into a format that can be easily analyzed by NLP algorithms.
By breaking down text into tokens, tokenization makes it easier to perform tasks such as part-
of-speech tagging, named entity recognition, sentiment analysis, and text classification.

2. Vocabulary creation: Tokenization is also important for creating the vocabulary that is used to
represent text in NLP models. Each token is typically assigned a unique integer or vector
representation, which is used to represent the token in the model's input or output layer. The
vocabulary size can be very large, particularly for languages with a large number of words or
a complex writing system, such as Chinese or Japanese.

3. Feature extraction: Tokens can be used as features in machine learning models, which can
help to improve the performance of the models. For example, a bag-of-words model can
represent a text document as a vector of token frequencies, which can be used to train a
classifier or regression model.
4. Text normalization: Tokenization can also be used for text normalization, which involves
transforming text into a canonical or standardized form. For example, tokenization can be
used to convert all words to lowercase or remove punctuation from a text.

Overall, tokenization is an essential step in NLP that enables text to be processed, analyzed, and
represented in a way that can be used by machine learning algorithms.

Example of sentence tokenization

Input: "Natural Language Processing is an interesting field. It involves the application of computational
techniques to process and analyze natural language data."

Output:

1. "Natural Language Processing is an interesting field."

2. "It involves the application of computational techniques to process and analyze natural
language data."

In this example, the input text contains two sentences, which are separated by a period followed by a
space. Sentence tokenization involves identifying these sentence boundaries and separating the input
text into individual sentences. The output shows the two sentences in the input text, with each
sentence on a separate line.

import nltk

# Define input text

input_text = "Natural Language Processing is an interesting field. It involves the application of

computational techniques to process and analyze natural language data."

# Perform sentence tokenization

sentences = nltk.sent_tokenize(input_text)

# Print output

for sentence in sentences:

print(sentence)

The sent_tokenize function from the nltk library is used to perform sentence tokenization on the input
text. It returns a list of sentences, which are then printed using a for loop.

The output of this code will be:

Natural Language Processing is an interesting field.

It involves the application of computational techniques to process and analyze natural language data.

What are the benefits of Tokenization?

1. Text Preprocessing: Tokenization is a fundamental step in text preprocessing for NLP tasks. It
breaks down the raw text into smaller, more manageable units that can be easily processed
by downstream algorithms, such as part-of-speech tagging, named entity recognition, and
sentiment analysis.
2. Feature Extraction: Tokens can be used as features in machine learning models. They can help
improve the performance of the models by capturing important information about the text.
For example, in a bag-of-words model, the frequency of each token can be used as a feature.

3. Vocabulary Creation: Tokenization is used to create the vocabulary that is used to represent
text in NLP models. Each token is typically assigned a unique integer or vector representation,
which is used to represent the token in the model's input or output layer.

4. Text Normalization: Tokenization can be used for text normalization, which involves
transforming text into a standardized form. For example, all tokens can be converted to
lowercase, or punctuation can be removed.

5. Efficient Storage and Retrieval: By converting text into a sequence of tokens, the amount of
data that needs to be stored and processed can be reduced. This can lead to more efficient
storage and retrieval of text data, which is important for many NLP applications.

Overall, tokenization is a crucial step in many NLP tasks. It helps to simplify and standardize the text
data, making it easier to process, analyze, and represent in a way that can be used by machine learning
algorithms.

Tokenization Challenges in NLP

While tokenization is an important step in natural language processing (NLP), there are several
challenges that need to be addressed:

1. Ambiguity: Some words can have multiple meanings, which can lead to ambiguity in
tokenization. For example, the word "bank" can refer to a financial institution or the edge of
a river. In such cases, context and part-of-speech information may be used to disambiguate
the token.

2. Domain-specific terms: Tokenizing domain-specific terms, such as medical terminology or

legal jargon, can be challenging. These terms may not appear in standard language
dictionaries and may require additional preprocessing or custom dictionaries to tokenize
correctly.

3. Compound words: Some languages, such as German, have long compound words that can be
difficult to tokenize. For example, the German word
"Donaudampfschifffahrtsgesellschaftskapitän" (Danube steamship company captain) is a
single word that can be challenging to split into individual tokens.

4. Informal language: Tokenizing informal language, such as social media posts or text messages,
can be difficult because it often contains non-standard grammar, spelling, and punctuation.

5. Multilingual text: Tokenizing multilingual text can be challenging because different languages
may have different tokenization rules and writing systems.

6. Tokenization errors: Tokenization algorithms are not always perfect and can sometimes make
errors, such as splitting a word into two tokens or combining two words into a single token.

Overall, tokenization is a crucial step in NLP, but it can be challenging to perform accurately,
particularly in cases where the text contains domain-specific terms, compound words, or informal
language. Addressing these challenges requires careful consideration of the text data and the specific
requirements of the NLP task.
Types of Token

In natural language processing (NLP), there are several types of tokens that can be created depending
on the tokenization technique used. Here are some of the most common types:

1. Word tokens: These are tokens created by word tokenization. Each word in the text is treated
as a separate token.

2. Character tokens: These are tokens created by character tokenization. Each character in the
text is treated as a separate token.

3. Subword tokens: These are tokens created by subword tokenization, which involves breaking
words down into smaller units called subwords. Subword tokenization is often used for
languages with complex morphology or for handling out-of-vocabulary words.

4. Byte pair encoded (BPE) tokens: These are a type of subword token created by a specific
subword tokenization algorithm called byte pair encoding. BPE works by iteratively merging
the most frequent pair of characters until a predetermined vocabulary size is reached.

5. Part-of-speech (POS) tags: These are tokens that represent the part of speech of a word in the
text, such as noun, verb, adjective, or adverb. POS tags are often created as a post-processing
step after word tokenization.

6. Named entity tokens: These are tokens that represent named entities in the text, such as
person names, organization names, or location names. Named entity tokens are often created
as a post-processing step after word tokenization.

Overall, the type of token created depends on the tokenization technique used and the specific
requirements of the NLP task. Word tokens are the most common type, but subword and character
tokens are becoming more widely used, especially for languages with complex morphology or out-of-
vocabulary words.

What is tokenization?
Tokenization is the process of replacing sensitive data with unique identification symbols that retain
all the essential information about the data without compromising its security.

Examples of tokenization

Tokenization technology can, in theory, be used with sensitive data of all kinds, including bank
transactions, medical records, criminal records, vehicle driver information, loan applications, stock
trading and voter registration. For the most part, any system in which surrogate, nonsensitive
information can act as a stand-in for sensitive information can benefit from tokenization.

Tokenization is often used to protect credit card data, bank account information and other sensitive
data handled by payment processors. Payment processing use cases that tokenize sensitive credit
card information include the following:

• mobile wallets, such as Google Pay and Apple Pay;

• e-commerce sites; and

• businesses that keep customers' cards on file.

How tokenization works

Tokenization substitutes sensitive information with equivalent nonsensitive information. The

nonsensitive, replacement information is called a token.

Tokens can be created in the following ways:

• using a mathematically reversible cryptographic function with a key;

• using a nonreversible function, such as a hash function; or

• using an index function or randomly generated number.

As a result, the token becomes the exposed information, and the sensitive information that the
token stands in for is stored safely in a centralized server known as a token vault. The token vault is
the only place where the original information can be mapped back to its corresponding token.

Here is one real-world example of how tokenization with a token vault works.

• A customer provides their payment details at a point-of-sale (POS) system or online checkout
form.

• The details, or data, are substituted with a randomly generated token, which is generated in
most cases by the merchant's payment gateway.

• The tokenized information is then encrypted and sent to a payment processor. The original
sensitive payment information is stored in a token vault in the merchant's payment gateway.
This is the only place where the token can be mapped to the information it represents.

• The tokenized information is encrypted again by the payment processor before being sent for
final verification.

On the other hand, some tokenization is vaultless. Instead of storing the sensitive information in a
secure database, vaultless tokens are stored using an algorithm. If the token is reversible, then the
original sensitive information is generally not stored in a vault.

Introduction To Problem Solving - CLass Notes
No ratings yet
Introduction To Problem Solving - CLass Notes
6 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
CH 2. Algorithms With Block Coding
0% (1)
CH 2. Algorithms With Block Coding
11 pages
CLASS XII COMPUTER SCIENCE CH-2 Functions - PPT
No ratings yet
CLASS XII COMPUTER SCIENCE CH-2 Functions - PPT
46 pages
Emerging Trends With Answers
No ratings yet
Emerging Trends With Answers
2 pages
Blue Print Class IX Ai PT 2 417
100% (1)
Blue Print Class IX Ai PT 2 417
2 pages
Iot Unit Notes
No ratings yet
Iot Unit Notes
13 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
Operators in Python
No ratings yet
Operators in Python
11 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Python
No ratings yet
Python
16 pages
Class 10 Unit 1 Notes
100% (1)
Class 10 Unit 1 Notes
7 pages
AI Lab Manual
33% (3)
AI Lab Manual
68 pages
UNIT 4 Data Science Notes
No ratings yet
UNIT 4 Data Science Notes
4 pages
Data Centric Artificial Intelligence: A Beginner's Guide
No ratings yet
Data Centric Artificial Intelligence: A Beginner's Guide
137 pages
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
No ratings yet
NLP Worksheet: Text Processing, Bag of Words, Tf-Idf Activity
6 pages
Slicing and Indexing
No ratings yet
Slicing and Indexing
16 pages
Unit 4
No ratings yet
Unit 4
26 pages
Unit 1 - Communication Skills Class 10
No ratings yet
Unit 1 - Communication Skills Class 10
10 pages
Final Year Project Presentation
No ratings yet
Final Year Project Presentation
13 pages
NLP Unit 5
No ratings yet
NLP Unit 5
10 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
Chapter 6
100% (1)
Chapter 6
28 pages
Data (Prod & Admin) - July 2023 - August
No ratings yet
Data (Prod & Admin) - July 2023 - August
332 pages
Class X - AI (417) Practical File 2024-25
No ratings yet
Class X - AI (417) Practical File 2024-25
11 pages
3rd Sem Python 2023 MQ Paper With Solution
No ratings yet
3rd Sem Python 2023 MQ Paper With Solution
27 pages
Class 10 AI 417 Computer Vision
No ratings yet
Class 10 AI 417 Computer Vision
22 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
DPS Ai Annual Sample Paper 2023-24
50% (2)
DPS Ai Annual Sample Paper 2023-24
5 pages
Worksheet - List Xi 1
No ratings yet
Worksheet - List Xi 1
12 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
Ibm Rational Requisitepro V2003.06: Evaluators' Guide
100% (1)
Ibm Rational Requisitepro V2003.06: Evaluators' Guide
17 pages
Unit 3
No ratings yet
Unit 3
33 pages
TOC Unit 4 PDF
100% (1)
TOC Unit 4 PDF
23 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
Inderbir Singh Human Embryology 11th Edition by Subhadra Devi ISBN 9789352701155 9352701151 Instant Download
100% (4)
Inderbir Singh Human Embryology 11th Edition by Subhadra Devi ISBN 9789352701155 9352701151 Instant Download
46 pages
E Passport Tracking System Feb
No ratings yet
E Passport Tracking System Feb
45 pages
Preeri Arora Chapter2 First Year
100% (1)
Preeri Arora Chapter2 First Year
34 pages
Assignment On Boolean Algebra
No ratings yet
Assignment On Boolean Algebra
3 pages
Ip Sample Paper 10
No ratings yet
Ip Sample Paper 10
9 pages
NLP Unit-5
No ratings yet
NLP Unit-5
14 pages
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
No ratings yet
Defence Standard 00-970 Part 1 Section 1: Issue 13 Date: 13 Jul 2015
23 pages
UNIT - II Part 1 LC& LP
No ratings yet
UNIT - II Part 1 LC& LP
39 pages
IMA2023109 - Imagine Invoice 132432 - Thecaratshop
No ratings yet
IMA2023109 - Imagine Invoice 132432 - Thecaratshop
1 page
CSE Dept. PPT 176 173
No ratings yet
CSE Dept. PPT 176 173
17 pages
(Computer Science and Engineering) : University of Engineering & Management (UEM), Jaipur
No ratings yet
(Computer Science and Engineering) : University of Engineering & Management (UEM), Jaipur
35 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
417 Ai SQP T1
No ratings yet
417 Ai SQP T1
9 pages
Runehammer OSE Hacked 1.2
100% (1)
Runehammer OSE Hacked 1.2
17 pages
Basic Microbiology and Biochemistry
No ratings yet
Basic Microbiology and Biochemistry
67 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
Data Analytics For Ioe: Syllabus
No ratings yet
Data Analytics For Ioe: Syllabus
23 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
AI
No ratings yet
AI
272 pages
CLASS XI AI PRACTICAL LIST 2024 (Till Mid Term)
No ratings yet
CLASS XI AI PRACTICAL LIST 2024 (Till Mid Term)
3 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
AI Ch-14 Inroduction To Prolog
No ratings yet
AI Ch-14 Inroduction To Prolog
15 pages
Ai Project: Water Jug Problem
No ratings yet
Ai Project: Water Jug Problem
5 pages
General Instructions:: Section A: Objective Type Questions
No ratings yet
General Instructions:: Section A: Objective Type Questions
8 pages
Final Paper
No ratings yet
Final Paper
6 pages
SCM 100 Review
No ratings yet
SCM 100 Review
23 pages
Saint Louis College: Legislative Committee
No ratings yet
Saint Louis College: Legislative Committee
3 pages
Vikas Gurjar20241226045412
No ratings yet
Vikas Gurjar20241226045412
1 page
O Poder Do Mel
No ratings yet
O Poder Do Mel
26 pages
Cyber Crime Laboratory Manual 2022
No ratings yet
Cyber Crime Laboratory Manual 2022
7 pages
Xie 2021
No ratings yet
Xie 2021
8 pages
X-AI Practical File-2 (2024)
No ratings yet
X-AI Practical File-2 (2024)
17 pages
1.introduction To PYTHON
No ratings yet
1.introduction To PYTHON
11 pages
Nottingham Contemporary Information
No ratings yet
Nottingham Contemporary Information
39 pages
1
No ratings yet
1
5 pages
SCADA
No ratings yet
SCADA
12 pages
REFERENCES
No ratings yet
REFERENCES
7 pages
Art and Technology in Poland Ed. Agnieszka Jelewska
No ratings yet
Art and Technology in Poland Ed. Agnieszka Jelewska
258 pages
Priciples of Marketing by Philip Kotler and Gary Armstrong
No ratings yet
Priciples of Marketing by Philip Kotler and Gary Armstrong
33 pages
Tutorial Letter 201/1/2018: Organisational Communication
No ratings yet
Tutorial Letter 201/1/2018: Organisational Communication
37 pages
Road Paving, Trenches
100% (2)
Road Paving, Trenches
42 pages
Intermediary Liability in A Global World: Prof. Dr. Matthias Leistner, LL.M. (Cambridge)
No ratings yet
Intermediary Liability in A Global World: Prof. Dr. Matthias Leistner, LL.M. (Cambridge)
40 pages
PJS Damansara Qtr4 2022 - Invoices
No ratings yet
PJS Damansara Qtr4 2022 - Invoices
3 pages
A Milling Machine Is A Machine Tool Used To Machine Solid Materials
No ratings yet
A Milling Machine Is A Machine Tool Used To Machine Solid Materials
7 pages
Home Sweet Compromise
No ratings yet
Home Sweet Compromise
7 pages
Neighbours Dec 5
No ratings yet
Neighbours Dec 5
10 pages
Practice Question Bank UNIT 1&2
No ratings yet
Practice Question Bank UNIT 1&2
3 pages
A Guidelines For Interviewing For The High School Newspaper
No ratings yet
A Guidelines For Interviewing For The High School Newspaper
4 pages
Stephen Hawking - 'Transcendence Looks at The Implications of Artificial Intelligence - But Are We Taking AI Seriously Enough - X 4'
No ratings yet
Stephen Hawking - 'Transcendence Looks at The Implications of Artificial Intelligence - But Are We Taking AI Seriously Enough - X 4'
2 pages
Query Optimization in Object Oriented Databases Through Detecting Independent Subqueries
No ratings yet
Query Optimization in Object Oriented Databases Through Detecting Independent Subqueries
5 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Artificial Intelligence Class 6: Skill Education for Class 6th, Code (417)
From Everand
Artificial Intelligence Class 6: Skill Education for Class 6th, Code (417)
Geeta Zunjani
No ratings yet
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
TouchCode Class 6
From Everand
TouchCode Class 6
Team Orange
No ratings yet

NLP-1 (Tokenization)

Uploaded by

NLP-1 (Tokenization)

Uploaded by

Tokenization

Reasons behind tokenization in NLP

1. Text Preprocessing: Tokenization is an important step in text preprocessing, which involves

3. Vocabulary Management: Tokenization helps manage the vocabulary of a language by

4. Language Translation: Tokenization is essential for language translation, where text is

What is token? And how it is created?

4. Machine Learning-based Tokenization: This involves training a machine learning model to

Real world example of how tokenization with token vault works

1. A customer makes a purchase using a credit card at a retail store.

Here's an example of word tokenization in Python using the NLTK library:

tokens = nltk.word_tokenize(text) print(tokens)

Drawback of Word Tokenization

Character tokenization has some advantages over word tokenization:

3. Multi-lingual support: Character tokenization is language-independent, and can be used to

Here's an example of character tokenization in Python:

tokens = list(text) print(tokens)

Drawbacks of character tokenization

Example of sentence tokenization

1. "Natural Language Processing is an interesting field."

# Define input text

input_text = "Natural Language Processing is an interesting field. It involves the application of

# Perform sentence tokenization

for sentence in sentences:

The output of this code will be:

Natural Language Processing is an interesting field.

What are the benefits of Tokenization?

Tokenization Challenges in NLP

2. Domain-specific terms: Tokenizing domain-specific terms, such as medical terminology or

• mobile wallets, such as Google Pay and Apple Pay;

• businesses that keep customers' cards on file.

How tokenization works

Tokenization substitutes sensitive information with equivalent nonsensitive information. The

Tokens can be created in the following ways:

• using a mathematically reversible cryptographic function with a key;

• using a nonreversible function, such as a hash function; or

• using an index function or randomly generated number.

You might also like