Tokenization
Section 2
Langer - Introduction to Text Analytics with Python 1
Tokenization
Tokenization is the process of decomposing a collection of text into smaller, but meaningful, units. These
smaller meaningful units are called tokens.
Tokens typically represent individual words or numbers. However, they can also represent punctuation,
symbols, emoticons (e.g., :-D), and emojis (e.g., 😁).
Tokens can also represent entire sentences of a document.
Tokenization is one of the most fundamental activities undertaken in text analytics.
Given this importance, it is not surprising that the NLTK offers rich support for tokenization.
Langer - Introduction to Text Analytics with Python 2
Tokenization Basics
The most basic form of tokenization is to split text based on spaces:
Tokenize
Tokens
Notice how there’s already a problem (hint – the last token)? Here’s another example:
What about this?
Should the first token be broken into “I” and “’m”? Should it be expanded into “I” and “am”?
Langer - Introduction to Text Analytics with Python 3
Word Tokenization
Turns out that tokenization is a hard problem!
Luckily, the NLTK offers several tokenizers to assist in tokenization. First up, the word tokenizer:
Import tokenizer
Tokenize
Tokens
Python list
Langer - Introduction to Text Analytics with Python 4
Regular Expression Tokenization
Regular expressions is a small, dedicated programming language for defining string matching patterns.
Regular expressions are very powerful and commonly used to parse text data.
The NLTK supports the use of regular expressions for tokenization via the RegexpTokenizer class:
Langer - Introduction to Text Analytics with Python 5
Sentence Tokenization
In many applications, you want to be able to first decompose text into sentences. The NTLK offers the
sentence tokenizer for that purpose:
List of strings
Combing the sent_tokenize() function with the word_tokenize() function:
List comprehension
List of lists Langer - Introduction to Text Analytics with Python 6
Tweet Tokenization
Social media is a prime example of tokenization as a hard problem to solve.
For example, the NLTK offers the TweetTokenizer class to handle the specific challenges of tokenizing tweets
(e.g., for sentiment analysis).
Langer - Introduction to Text Analytics with Python 7
N-Grams
So far, the tokens we’ve seen correspond very closely to individual words or unigrams:
Unigrams
To provide more insight into the structure of text, tokenization can also produce tokens of multiple
consecutive words or n-grams:
• Tokens consisting of two consecutive words are known as bigrams or 2-grams.
• Tokens consisting of three consecutive words are known as trigrams or 3-grams.
• While possible to make larger n-grams, there are diminishing returns in practice.
Langer - Introduction to Text Analytics with Python 8
Bigrams
To create n-grams from list of tokens, the NLTK provides the ngrams() function:
NOTE - n-grams do not extend past the end of the unigram list.
Langer - Introduction to Text Analytics with Python 9
Trigrams
Langer - Introduction to Text Analytics with Python 10
Top-Rated Training Courses
Attendee course ratings from TDWI Las Vegas (Feb 2023)
My new Python courses will delight attendees in the same way!
Use promo code
INS150 to save an
additional $150!