Tokenization
Tokenization
Reference: chatGPT
What is Tokenization?
Tokenization is the process of breaking down the text into smaller units.
The units are called the tokens
In some languages like Chinese the tokenization is a difficult process because there are no proper gap
between words
Tokenization is essential for text analysis, sentiment analysis, and machine translation.
Spacy NLTK
Tokenization Process
Tokenization of different languages
📔Reference : https://www.youtube.com/watch?
v=hKK59rfpXL0&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=10
As shown the above image if you define nlp object as blank . Then you wont have access to any of the
pipeline processes.
For accessing the pipeline features they need to be defined specifically.
A pipe line includes many features like shown in the above representation. Once you define a pipe line you
can use that to categorize and manipulate the tokens.
doc = nlp("In a bustling city, John bought 3 tickets for the concert, each priced
at $50. As the clock struck 7, he realized he was running late and decided to
take a cab, spending an additional $30. Despite the unexpected expenses, he
couldn't contain his excitement as he headed to the venue, looking forward to an
unforgettable night.")
for ent in doc.ents:
print(ent.text , " | ", ent.label_, " | ", spacy.explain(ent.label_))
Output:
nlp_blank = spacy.blank("en")
nlp_blank.add_pipe('ner', source=nlp)
nlp_blank.pipe_names
Output:
['ner']