0% found this document useful (0 votes)
12 views4 pages

Tokenization

Uploaded by

vishnucheppanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

Tokenization

Uploaded by

vishnucheppanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Tokenization

Reference: chatGPT

What is Tokenization?
Tokenization is the process of breaking down the text into smaller units.
The units are called the tokens
In some languages like Chinese the tokenization is a difficult process because there are no proper gap
between words
Tokenization is essential for text analysis, sentiment analysis, and machine translation.

Libraries used for Tokenization


spaCy and NLTK are the most popular two libraries used for NLP tasks

Spacy NLTK

Object Oriented String Processing

Most efficient Gives more


algorithm for a given customization
task options

More user friendly Less user friendly

Can not choose a Can choose a


specific algorithm specific algorithm

New and more active Older and less active


community

Better for app Better for


developers researchers

In Spacy we create a In NLTK, we import


NLP object and use the desired tokenizer
that object to from
tokenize sentences nltk.tokenize
and words and use that for
tokenization

Tokenization Process
Tokenization of different languages
📔Reference : https://www.youtube.com/watch?
v=hKK59rfpXL0&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=10

Introduction to Language Processing Pipelines in spaCy


An NLP pipeline comes directly after tokenization and it includes a number of processing steps. They are
used for categorizing or finding the base value of the tokens.

As shown the above image if you define nlp object as blank . Then you wont have access to any of the
pipeline processes.
For accessing the pipeline features they need to be defined specifically.

Code Implementation in spaCy


A sample pipeline code is given below, this will load some pipeline tools used for English language.
nlp = spacy.load("en_core_web_sm")

A pipe line includes many features like shown in the above representation. Once you define a pipe line you
can use that to categorize and manipulate the tokens.

Components of Language Processing Pipeline


tok2vec : Converts words into vectors tagger : Assigns part of speech tokens parser : Analyze the
relation between words (eg: grammatical structure) attribute_ruler : Adds token attributes based on
patterns lemmatizer : Provides the base or root of words ner : Identifies the named entities (eg:
persons, organizations)

Use of Language Processing Pipeline for Entity Recognition


The ner tool will help to find the entities in the language.

doc = nlp("In a bustling city, John bought 3 tickets for the concert, each priced
at $50. As the clock struck 7, he realized he was running late and decided to
take a cab, spending an additional $30. Despite the unexpected expenses, he
couldn't contain his excitement as he headed to the venue, looking forward to an
unforgettable night.")
for ent in doc.ents:
print(ent.text , " | ", ent.label_, " | ", spacy.explain(ent.label_))

Output:

John | PERSON | People, including fictional


3 | CARDINAL | Numerals that do not fall under another type
50 | MONEY | Monetary values, including unit
7 | DATE | Absolute or relative dates or periods
an additional $30 | MONEY | Monetary values, including unit

Customization of Language Processing Pipeline


We can also customize existing blank object using other existing pipelines

nlp_blank = spacy.blank("en")

nlp_blank.add_pipe('ner', source=nlp)
nlp_blank.pipe_names

Output:

['ner']

You might also like