
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Tokenize Text Using NLTK in Python
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. In the context of nltk and python, it is simply the process of putting each token in a list so that instead of iterating over each letter at a time, we can iterate over a token.
For example, given the input string −
Hi man, how have you been?
We should get the output −
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']
We can tokenize this text using the word_tokenize method from NLTK. For example,
Example
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize my_sent = "Hi man, how have you been?" tokens = word_tokenize(my_sent) print(tokens)
Output
This will give the output −
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']
Advertisements