0% found this document useful (0 votes)
9 views12 pages

Learn Text Analytics With Python

Uploaded by

shanurudra177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Learn Text Analytics With Python

Uploaded by

shanurudra177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Tokenization

Section 2

Langer - Introduction to Text Analytics with Python 1


Tokenization
Tokenization is the process of decomposing a collection of text into smaller, but meaningful, units. These
smaller meaningful units are called tokens.

Tokens typically represent individual words or numbers. However, they can also represent punctuation,
symbols, emoticons (e.g., :-D), and emojis (e.g., 😁).

Tokens can also represent entire sentences of a document.

Tokenization is one of the most fundamental activities undertaken in text analytics.

Given this importance, it is not surprising that the NLTK offers rich support for tokenization.

Langer - Introduction to Text Analytics with Python 2


Tokenization Basics
The most basic form of tokenization is to split text based on spaces:

Tokenize
Tokens

Notice how there’s already a problem (hint – the last token)? Here’s another example:

What about this?

Should the first token be broken into “I” and “’m”? Should it be expanded into “I” and “am”?

Langer - Introduction to Text Analytics with Python 3


Word Tokenization
Turns out that tokenization is a hard problem!

Luckily, the NLTK offers several tokenizers to assist in tokenization. First up, the word tokenizer:

Import tokenizer

Tokenize
Tokens

Python list
Langer - Introduction to Text Analytics with Python 4
Regular Expression Tokenization
Regular expressions is a small, dedicated programming language for defining string matching patterns.

Regular expressions are very powerful and commonly used to parse text data.

The NLTK supports the use of regular expressions for tokenization via the RegexpTokenizer class:

Langer - Introduction to Text Analytics with Python 5


Sentence Tokenization
In many applications, you want to be able to first decompose text into sentences. The NTLK offers the
sentence tokenizer for that purpose:

List of strings

Combing the sent_tokenize() function with the word_tokenize() function:

List comprehension

List of lists Langer - Introduction to Text Analytics with Python 6


Tweet Tokenization
Social media is a prime example of tokenization as a hard problem to solve.

For example, the NLTK offers the TweetTokenizer class to handle the specific challenges of tokenizing tweets
(e.g., for sentiment analysis).

Langer - Introduction to Text Analytics with Python 7


N-Grams
So far, the tokens we’ve seen correspond very closely to individual words or unigrams:

Unigrams

To provide more insight into the structure of text, tokenization can also produce tokens of multiple
consecutive words or n-grams:

• Tokens consisting of two consecutive words are known as bigrams or 2-grams.


• Tokens consisting of three consecutive words are known as trigrams or 3-grams.
• While possible to make larger n-grams, there are diminishing returns in practice.

Langer - Introduction to Text Analytics with Python 8


Bigrams
To create n-grams from list of tokens, the NLTK provides the ngrams() function:

NOTE - n-grams do not extend past the end of the unigram list.

Langer - Introduction to Text Analytics with Python 9


Trigrams

Langer - Introduction to Text Analytics with Python 10


Top-Rated Training Courses

Attendee course ratings from TDWI Las Vegas (Feb 2023)

My new Python courses will delight attendees in the same way!


Use promo code
INS150 to save an
additional $150!

You might also like