Open In App

Python | Tokenize text using TextBlob

Last Updated : 13 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Tokenization is a fundamental task in Natural Language Processing that breaks down a text into smaller units such as words or sentences which is used in tasks like text classification, sentiment analysis and named entity recognition. TextBlob is a python library for processing textual data and simplifies many NLP tasks including tokenization. In this article we’ll explore how to tokenize text using the TextBlob library in Python.

Implementing Tokenization using TextBlob

TextBlob is a simple NLP library built on top of NLTK (Natural Language Toolkit) and Pattern. It provides easy-to-use APIs for common NLP tasks like tokenization, part-of-speech tagging, noun phrase extraction, translation and many more. It offers two main types of tokenization:

  1. Word tokenization: Breaking text into individual words.
  2. Sentence tokenization: Breaking text into individual sentences.

1. Downloading Necessary Library

Before starting we need to install TextBlob. You can easily install it using following command in command-line interface (CLI):

pip install textblob

Once installed you also need to download the necessary NLTK corpora which are used for various TextBlob operations such as tokenization. Run this Python code to download the corpora:

Python
from textblob import download_corpora
download_corpora.download_all()

Output:

Screenshot-2025-03-28-112307

NLTK Corpora

2. Tokenizing Text into Words

Let’s start by tokenizing text into words. We will use the TextBlob class to create a TextBlob object which allows us to easily manipulate the text.

  • We created a TextBlob object with a sample text.
  • The words property of TextBlob object returns a list of words in the text breaking the sentence into individual tokens i.e words.
  • It handles punctuation automatically so punctuation marks are excluded from the list of words.
Python
from textblob import TextBlob

text = "Hello! I am learning NLP with TextBlob."
blob = TextBlob(text)

words = blob.words
print(words)

Output:

[‘Hello’, ‘I’, ‘am’, ‘learning’, ‘NLP’, ‘with’, ‘TextBlob’]

3. Tokenizing Text into Sentences

Now we will tokenize text into sentences. To do this you can use the sentences property of the TextBlob object.

  • We used the sentences() property to break the text into two individual sentences.
  • TextBlob recognizes sentence boundaries and tokenizes the text accordingly.
Python
text = "Hello! I am learning NLP with TextBlob. It's a fun journey."
blob = TextBlob(text)

sentences = blob.sentences
for sentence in sentences:
    print(sentence)

Output:

Hello!
I am learning NLP with TextBlob.
It’s a fun journey.

4. Working with Tokenized Data

Once you’ve tokenized the text into words or sentences you can perform further processing on the tokens. Here are a few common operations you can do with tokenized data:

  1. Word Frequency Analysis: Count how often each word appears in the text.
  2. Filtering Stop Words: Remove common words like “and”, “the”, etc that do not carry much meaning.
  3. Stemming or Lemmatization: Stemming or Lemmatization reduces words to their base or root form.

Here we downloaded a list of stop words using NLTK’s stopwords corpus and filtered out the stop words from the tokenized words list.

Python
import nltk
from textblob import TextBlob
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

text = "Hello! I am learning NLP with TextBlob."
blob = TextBlob(text)
words = blob.words

filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)

[‘Hello’, ‘learning’, ‘NLP’, ‘TextBlob’]

Tokenization is a important step in NLP and TextBlob simplifies this process in Python. With TextBlob you can easily tokenize text into words and sentences and perform further operations such as filtering stop words and analyzing word frequencies.



Next Article

Similar Reads