0% found this document useful (0 votes)

9 views12 pages

Learn Text Analytics With Python

Uploaded by

shanurudra177

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views12 pages

Learn Text Analytics With Python

Uploaded by

shanurudra177

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Tokenization

Section 2

Langer - Introduction to Text Analytics with Python 1

Tokenization
Tokenization is the process of decomposing a collection of text into smaller, but meaningful, units. These
smaller meaningful units are called tokens.

Tokens typically represent individual words or numbers. However, they can also represent punctuation,
symbols, emoticons (e.g., :-D), and emojis (e.g., 😁).

Tokens can also represent entire sentences of a document.

Tokenization is one of the most fundamental activities undertaken in text analytics.

Given this importance, it is not surprising that the NLTK offers rich support for tokenization.

Langer - Introduction to Text Analytics with Python 2

Tokenization Basics
The most basic form of tokenization is to split text based on spaces:

Tokenize
Tokens

Notice how there’s already a problem (hint – the last token)? Here’s another example:

What about this?

Should the first token be broken into “I” and “’m”? Should it be expanded into “I” and “am”?

Langer - Introduction to Text Analytics with Python 3

Word Tokenization
Turns out that tokenization is a hard problem!

Luckily, the NLTK offers several tokenizers to assist in tokenization. First up, the word tokenizer:

Import tokenizer

Tokenize
Tokens

Python list
Langer - Introduction to Text Analytics with Python 4
Regular Expression Tokenization
Regular expressions is a small, dedicated programming language for defining string matching patterns.

Regular expressions are very powerful and commonly used to parse text data.

The NLTK supports the use of regular expressions for tokenization via the RegexpTokenizer class:

Langer - Introduction to Text Analytics with Python 5

Sentence Tokenization
In many applications, you want to be able to first decompose text into sentences. The NTLK offers the
sentence tokenizer for that purpose:

List of strings

Combing the sent_tokenize() function with the word_tokenize() function:

List comprehension

List of lists Langer - Introduction to Text Analytics with Python 6

Tweet Tokenization
Social media is a prime example of tokenization as a hard problem to solve.

For example, the NLTK offers the TweetTokenizer class to handle the specific challenges of tokenizing tweets
(e.g., for sentiment analysis).

Langer - Introduction to Text Analytics with Python 7

N-Grams
So far, the tokens we’ve seen correspond very closely to individual words or unigrams:

Unigrams

To provide more insight into the structure of text, tokenization can also produce tokens of multiple
consecutive words or n-grams:

• Tokens consisting of two consecutive words are known as bigrams or 2-grams.

• Tokens consisting of three consecutive words are known as trigrams or 3-grams.
• While possible to make larger n-grams, there are diminishing returns in practice.

Langer - Introduction to Text Analytics with Python 8

Bigrams
To create n-grams from list of tokens, the NLTK provides the ngrams() function:

NOTE - n-grams do not extend past the end of the unigram list.

Langer - Introduction to Text Analytics with Python 9

Trigrams

Langer - Introduction to Text Analytics with Python 10

Top-Rated Training Courses

Attendee course ratings from TDWI Las Vegas (Feb 2023)

My new Python courses will delight attendees in the same way!

Use promo code
INS150 to save an
additional $150!

Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
Tokenizing Tweets for Data Analysis
No ratings yet
Tokenizing Tweets for Data Analysis
8 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
65 SC Tae1 A3
No ratings yet
65 SC Tae1 A3
3 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Slide 2 Introduction To Text Tokeni
No ratings yet
Slide 2 Introduction To Text Tokeni
5 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
Week 1
No ratings yet
Week 1
14 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
Chapter 3
No ratings yet
Chapter 3
4 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
21 pages
NLTK for NLP Education
No ratings yet
NLTK for NLP Education
4 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLTK
No ratings yet
NLTK
3 pages
PR 7
No ratings yet
PR 7
2 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Introduction To Regular Expressions: Katharine Jarmul
No ratings yet
Introduction To Regular Expressions: Katharine Jarmul
31 pages
Unit 1 - Tokenisation Text
No ratings yet
Unit 1 - Tokenisation Text
5 pages
Chapter1 NLP
No ratings yet
Chapter1 NLP
31 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
Text Processing For NLP String Tokenization
No ratings yet
Text Processing For NLP String Tokenization
10 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP Tutorial with Python NLTK
No ratings yet
NLP Tutorial with Python NLTK
19 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLTK
No ratings yet
NLTK
4 pages
Tokenization Essentials
No ratings yet
Tokenization Essentials
20 pages
NLP Tokenization Basics
No ratings yet
NLP Tokenization Basics
3 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
How To Perform Sentiment Analysis in Python 3 Using The Natural Language Toolkit (NLTK) - DigitalOcean
No ratings yet
How To Perform Sentiment Analysis in Python 3 Using The Natural Language Toolkit (NLTK) - DigitalOcean
29 pages
Main Topics: Start With A Checkmark Followed by The Topic Name
No ratings yet
Main Topics: Start With A Checkmark Followed by The Topic Name
48 pages
Statistics Interview Questions
No ratings yet
Statistics Interview Questions
10 pages
Pascals Triangle
No ratings yet
Pascals Triangle
9 pages
SLR Assignment Ashish Gore
No ratings yet
SLR Assignment Ashish Gore
13 pages
Tableau
No ratings yet
Tableau
16 pages
Python ML Course Notes
No ratings yet
Python ML Course Notes
36 pages
Basic Concepts of Machine Learning For Beginners
No ratings yet
Basic Concepts of Machine Learning For Beginners
102 pages
Gradient Descent & Regression Analysis
No ratings yet
Gradient Descent & Regression Analysis
19 pages
SIVACON 8MF Calculation Table Doors IP20 2022-05
No ratings yet
SIVACON 8MF Calculation Table Doors IP20 2022-05
1 page
Identifying Main Idea
No ratings yet
Identifying Main Idea
6 pages
Lembar Soal&jwb 3 B.inggris
No ratings yet
Lembar Soal&jwb 3 B.inggris
10 pages
Cellular Gateway Release Notes Xe 17 11 X
No ratings yet
Cellular Gateway Release Notes Xe 17 11 X
6 pages
Steps To Use Smart Pigeon Hole PDF
No ratings yet
Steps To Use Smart Pigeon Hole PDF
2 pages
Group Functions
No ratings yet
Group Functions
6 pages
Hologic Dimensions Rel.n.
No ratings yet
Hologic Dimensions Rel.n.
12 pages
AZ204 Resources
No ratings yet
AZ204 Resources
3 pages
Day - 8 - Solutions: Non-Verbal - Coding and Decoding (Logical)
No ratings yet
Day - 8 - Solutions: Non-Verbal - Coding and Decoding (Logical)
8 pages
Digital Literacy
No ratings yet
Digital Literacy
19 pages
Welcome To Jiwaji
No ratings yet
Welcome To Jiwaji
1 page
Subject Title: MICROCONTROLLER: 18EC46 Model Question Paper-2 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
Subject Title: MICROCONTROLLER: 18EC46 Model Question Paper-2 With Effect From 2019-20 (CBCS Scheme)
2 pages
General Terminal Commands::cd:pwd
No ratings yet
General Terminal Commands::cd:pwd
19 pages
Printed 黃建華Oracle - EBS Workflow
No ratings yet
Printed 黃建華Oracle - EBS Workflow
90 pages
HPU Main Library Membership Form For Smart Card HPU Staff
No ratings yet
HPU Main Library Membership Form For Smart Card HPU Staff
1 page
An Introduction To Network Analyzers New
No ratings yet
An Introduction To Network Analyzers New
18 pages
Checklist TCR Niaga 070918
No ratings yet
Checklist TCR Niaga 070918
19 pages
Energy Optimization and Saving For Green Data Centers: Niharika Raskar
No ratings yet
Energy Optimization and Saving For Green Data Centers: Niharika Raskar
14 pages
CV - Andi Kurniawan - 2023
No ratings yet
CV - Andi Kurniawan - 2023
6 pages
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
No ratings yet
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
18 pages
Search For Music Using Your Voice by Singing or Humming, View Music Videos, Join Fan Clubs, Share With Friends, Be Discovered and Much More For Free!
No ratings yet
Search For Music Using Your Voice by Singing or Humming, View Music Videos, Join Fan Clubs, Share With Friends, Be Discovered and Much More For Free!
3 pages
Ukrainian Power Grid Cyberattack Analysis
No ratings yet
Ukrainian Power Grid Cyberattack Analysis
12 pages
Exponent Rules Simplified
No ratings yet
Exponent Rules Simplified
8 pages
Cómo Escribir Un Ensayo Paso A Paso
100% (1)
Cómo Escribir Un Ensayo Paso A Paso
7 pages
Configure The Network For VxRail
No ratings yet
Configure The Network For VxRail
16 pages
Module1 DSDV
No ratings yet
Module1 DSDV
95 pages
SOFTWARE MANUAL DesignStudioReference RevT
No ratings yet
SOFTWARE MANUAL DesignStudioReference RevT
936 pages
User Manual: 1. Functional Statement
No ratings yet
User Manual: 1. Functional Statement
11 pages
Data Democratization: Toward A Deeper Understanding: September 2021
No ratings yet
Data Democratization: Toward A Deeper Understanding: September 2021
18 pages
0 Intro
No ratings yet
0 Intro
26 pages

Learn Text Analytics With Python

Uploaded by

Learn Text Analytics With Python

Uploaded by

Tokenization

Langer - Introduction to Text Analytics with Python 1

Tokens can also represent entire sentences of a document.

Tokenization is one of the most fundamental activities undertaken in text analytics.

Langer - Introduction to Text Analytics with Python 2

What about this?

Langer - Introduction to Text Analytics with Python 3

Langer - Introduction to Text Analytics with Python 5

Combing the sent_tokenize() function with the word_tokenize() function:

List of lists Langer - Introduction to Text Analytics with Python 6

Langer - Introduction to Text Analytics with Python 7

• Tokens consisting of two consecutive words are known as bigrams or 2-grams.

Langer - Introduction to Text Analytics with Python 8

Langer - Introduction to Text Analytics with Python 9

Langer - Introduction to Text Analytics with Python 10

Attendee course ratings from TDWI Las Vegas (Feb 2023)

My new Python courses will delight attendees in the same way!

You might also like