Open In App

Python NLTK | nltk.tokenizer.word_tokenize()

Last Updated : 01 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Tokenization is the process of breaking text into smaller units called tokens. These may be sentences, words, sub-words or characters depending on the level of granularity we need for our NLP task. Tokens are the basic building blocks for most NLP operations, such as analysis, information extraction, sentiment assessment and more.

tokenization
Tokenization

NLTK (Natural Language Toolkit) is a Python library that provides a range of tokenization tools including methods for splitting text into words, punctuation and even syllables. In this article we will learn about word_tokenize which splits a sentence or phrase into words/punctuation.

Lets a Example:

Python
from nltk.tokenize import word_tokenize

text = "The company spent $30,000,000 last year."
tokens = word_tokenize(text)
print(tokens)

Output: ['The', 'company', 'spent', '$', '30,000,000', 'last', 'year', '.']

nltk.tokenize.word_tokenize() tokenizes sentences into words, numbers and punctuation marks. It does not split words into syllables, but simply splits text at word boundaries.

Syntax:

Python
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

Here we give text in word_tokenize and it return word tokens

NLTK offers useful and flexible tokenization tools that form the backbone of many NLP workflows. By understanding the differences between word-level tokenization with word_tokenize users can choose when to use it for general text analysis to specialized linguistic applications.


Article Tags :
Practice Tags :

Similar Reads