Python NLTK | nltk.tokenizer.word_tokenize()

Last Updated : 01 Aug, 2025

Tokenization is the process of breaking text into smaller units called tokens. These may be sentences, words, sub-words or characters depending on the level of granularity we need for our NLP task. Tokens are the basic building blocks for most NLP operations, such as analysis, information extraction, sentiment assessment and more.

NLTK (Natural Language Toolkit) is a Python library that provides a range of tokenization tools including methods for splitting text into words, punctuation and even syllables. In this article we will learn about word_tokenize which splits a sentence or phrase into words/punctuation.

Lets a Example:

Python

from nltk.tokenize import word_tokenize

text = "The company spent $30,000,000 last year."
tokens = word_tokenize(text)
print(tokens)

Output: ['The', 'company', 'spent', '$', '30,000,000', 'last', 'year', '.']

nltk.tokenize.word_tokenize() tokenizes sentences into words, numbers and punctuation marks. It does not split words into syllables, but simply splits text at word boundaries.

Syntax:

Python

from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

Here we give text in word_tokenize and it return word tokens

NLTK offers useful and flexible tokenization tools that form the backbone of many NLP workflows. By understanding the differences between word-level tokenization with word_tokenize users can choose when to use it for general text analysis to specialized linguistic applications.

jitender_1998

Improve

Article Tags :

Practice Tags :

python

Python NLTK | nltk.tokenizer.word_tokenize()

Syntax:

Similar Reads

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Thank You!

What kind of Experience do you want to share?