0% found this document useful (0 votes)

21 views6 pages

5 Basic Text Processing

Uploaded by

Atharva Nagore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views6 pages

5 Basic Text Processing

Uploaded by

Atharva Nagore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Basic Text Processing

Word tokenization

NLP Library Text Processing

There are many library / framework for NLP problem solution • These are some of the methods of processing the data in
1. Natural Language Toolkit (NLTK) NLP:
2. TextBlob • Tokenization
3. CoreNLP
4. Gensim • Stop words removal
5. spaCy • Stemming
6. polyglot
7. scikit–learn
• Normalization
8. Pattern • Lemmatization
• Parts of speech tagging

1
How many words?
Text Normalization
Its complicated question like ‘uh’ is a word or how about ‘main-mainly’

• Every NLP task needs to do text normalization: • I do uh main- mainly business data processing
1. Segmenting/tokenizing words in running text – Fragments, filled pauses
2. Normalizing word formats • Seuss’s cat in the hat is different from other cats!
3. Segmenting sentences in running text – Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms

How many words? How many words?

• (Word) Type: an element of the vocabulary (how many unique words there
are). N = number of tokens Church and Gale (1990): |V| > O(N½)
• (Word ) Token: an instance of that type in running text.
V = vocabulary = set of types
2 4 6 8 10 12 14
they lay back on the San Francisco grass and looked at the stars and their
9
|V| is the size of the vocabulary
1 3 5 7 11 13 15

Dataset (Corpora) Tokens = N Types = |V|

• How many? 15  If we count San & Francisco as two tokens
Switchboard phone conversations 2.4 million 20 thousand
– 15 tokens (or 14) 14  If we count San Francisco as one tokens Shakespeare 884,000 31 thousand
– 13 types (or 12) (or 11?) 13  San Francisco as one Google N-grams 1 trillion 13 million
12  San Francisco , the, they & their (same lemma)

2
Tokenization
• Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens. One can think of token as
parts like a word is a token in a sentence, and a sentence is a
token in a paragraph.

Source: https://www.kaggle.com/satishgunjal/tokenization-in-nlp

Tokenization Techniques Tokenization Techniques

• Tokenization Using Python's Inbuilt • Tokenization Using Regular

Method Expressions(RegEx)

Word Tokenization Sentence Tokenization Word Tokenization Sentence Tokenization

https://colab.research.google.com/drive/1D_HclnqrPU_-XaK8xjkp0ZkZSaueGjo0#scrollTo=oD04lqUWT3W4 https://colab.research.google.com/drive/1D_HclnqrPU_-XaK8xjkp0ZkZSaueGjo0#scrollTo=oD04lqUWT3W4

Source: https://www.kaggle.com/satishgunjal/tokenization-in-nlp

3
Tokenization Techniques Simple Tokenization in UNIX

• Tokenization Using NLTK • (Inspired by Ken Church’s UNIX for Poets.)

– Install Python • Given a text file, output the word tokens and their
– Install NLTK (Natural Language Toolkit) frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type
1945 A 25 Aaron
72 AARON 6 Abate
19 ABBESS 1 Abates
5 ABBOT 5 Abbess
... ... 6 Abbey
3 Abbot
.... …

The first step: tokenizing The second step: sorting

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

THE A
SONNETS A
by A
William A
Shakespeare A
From A
fairest A
creatures A
We A
... ...

4
More counting Issues in Tokenization
• Merging upper and lower case • Finland’s capital  Finland Finlands Finland’s ?
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
• what’re, I’m, isn’t  What are, I am, is not
• Sorting the counts • Hewlett-Packard  Hewlett Packard ?
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r • state-of-the-art  state of the art ?
23243 the • Lowercase  lower-case lowercase lower case ?
22225 i
18618 and • San Francisco  one token or two?
16339 to
15687 of • m.p.h., PhD.  ??
12780 a
12163 you
10839 my
10005 in
8954 d

Tokenization: language issues Tokenization: language issues

• French • Chinese and Japanese have no spaces between words:
– L'ensemble  one token or two? – 莎拉波娃现在居住在美国东南部的佛罗里达。
• L ? L’ ? Le ? – Not always guaranteed a unique tokenization
• Want l’ensemble to match with un ensemble
• Further complicated in Japanese, with multiple alphabets
– Until at least 2003, it didn’t on Google
» Internationalization!
intermingled
– Dates/amounts in multiple formats
• German noun compounds are not segmented フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’ Katakana Hiragana Kanji Romaji
– German retrieval systems benefit greatly from a compound splitter module
– Can give a 15% performance boost for German End-user can express query entirely in hiragana! 20

5
Tokenization: language issues
• Arabic (or Hebrew) is basically written right to left, but with certain Basic Text Processing
items like numbers written left to right
• Words are separated, but letter forms within a word form complex
ligatures

← → ←→ ← Word Normalization and Stemming

• ‘Algeria achieved its independence in 1962 after 132 years of French
occupation.’
• With Unicode, the surface presentation is complex, but the stored
form is straightforward

Normalization Case folding

• Need to “normalize” terms • Applications like IR: reduce all letters to lower case
– Information Retrieval: indexed text & query terms must have same – Since users tend to use lower case
form.
• We want to match U.S.A. and USA – Possible exception: upper case in mid-sentence?
• We implicitly define equivalence classes of terms • e.g., General Motors
– e.g., deleting periods in a term • Fed vs. fed
• Alternative: asymmetric expansion: • SAIL vs. sail
–
• For sentiment analysis, MT, Information extraction
Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

• Potentially more powerful, but less efficient – Case is helpful (US versus us is important)

Rapid Prototyping PPT Seminar
0% (1)
Rapid Prototyping PPT Seminar
32 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
SC4020 Factory Settings Procedure
100% (1)
SC4020 Factory Settings Procedure
2 pages
HTB Academy Report Template
No ratings yet
HTB Academy Report Template
24 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
???????????????????? accessibilityPunctuationGroup
No ratings yet
???????????????????? accessibilityPunctuationGroup
101 pages
Agile Testing
No ratings yet
Agile Testing
22 pages
Mẫu Câu Hỏi Trong Đề Thi Hkico I - Java
No ratings yet
Mẫu Câu Hỏi Trong Đề Thi Hkico I - Java
7 pages
Text Processing
No ratings yet
Text Processing
114 pages
2 Introduction
No ratings yet
2 Introduction
15 pages
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
55 pages
Chapter - 2 Text Operation (Lecture 2.1)
No ratings yet
Chapter - 2 Text Operation (Lecture 2.1)
63 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
NLP Week 02
No ratings yet
NLP Week 02
55 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
Week 2
No ratings yet
Week 2
90 pages
Words and Corpora J+M
No ratings yet
Words and Corpora J+M
49 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
MCQ 4 Final
No ratings yet
MCQ 4 Final
106 pages
Session 1
No ratings yet
Session 1
60 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Text Proc
No ratings yet
Text Proc
55 pages
Parsing and Parsing Techniques in Compiler Construction
No ratings yet
Parsing and Parsing Techniques in Compiler Construction
12 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
EC-BOS-8 Install Startup - UG
No ratings yet
EC-BOS-8 Install Startup - UG
58 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Bachelor Thesis ZhangYancan
No ratings yet
Bachelor Thesis ZhangYancan
37 pages
3.chapter4 - Lexical Representations
No ratings yet
3.chapter4 - Lexical Representations
36 pages
Lab - Create User Accounts
No ratings yet
Lab - Create User Accounts
3 pages
Cyberark Engineer IAM
No ratings yet
Cyberark Engineer IAM
4 pages
Dept of CSE Even Sem Routine 2024-2025-3
No ratings yet
Dept of CSE Even Sem Routine 2024-2025-3
2 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Unit I Inroduction
No ratings yet
Unit I Inroduction
52 pages
Jaa Project
100% (1)
Jaa Project
2 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Corpora
No ratings yet
Corpora
48 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
NLP m2
No ratings yet
NLP m2
71 pages
Session1 2024 - 2025 - Natural Language Processing
No ratings yet
Session1 2024 - 2025 - Natural Language Processing
40 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Rdbms and SQL Notes
No ratings yet
Rdbms and SQL Notes
58 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Micro Processors: Case Study Summary
No ratings yet
Micro Processors: Case Study Summary
6 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Natural Language Processing 1
No ratings yet
Natural Language Processing 1
19 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
All Practicals
No ratings yet
All Practicals
33 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
2 Textprocessingboth
No ratings yet
2 Textprocessingboth
46 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
Formalizing BPE Tokenization
No ratings yet
Formalizing BPE Tokenization
12 pages
Lec 2
No ratings yet
Lec 2
21 pages
Neural Network Numericals
No ratings yet
Neural Network Numericals
14 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
Awd MCQ
No ratings yet
Awd MCQ
19 pages
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
No ratings yet
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
80 pages
3 Regular Expression
No ratings yet
3 Regular Expression
15 pages
ML Assignment
No ratings yet
ML Assignment
13 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
4 Regular Expression Numericals
No ratings yet
4 Regular Expression Numericals
8 pages
CS6551 Computer Networks
No ratings yet
CS6551 Computer Networks
39 pages
1 Introduction
No ratings yet
1 Introduction
13 pages
Ethernet IP 3HAC028509-001 RevA en
No ratings yet
Ethernet IP 3HAC028509-001 RevA en
46 pages
Tokeniz Prob!
No ratings yet
Tokeniz Prob!
4 pages
Demystifying The Number of Vcpus For Optimal Workload Performance
No ratings yet
Demystifying The Number of Vcpus For Optimal Workload Performance
12 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
03 Word Tokenization 14-26
No ratings yet
03 Word Tokenization 14-26
6 pages
User's Manual of RDC6332G Control System
No ratings yet
User's Manual of RDC6332G Control System
42 pages
Crescendo'25 Software Hackathon Rules
No ratings yet
Crescendo'25 Software Hackathon Rules
3 pages
ST - Thomas Technical High School: School Based Assessment (Sba) Project 2023 - 2023
No ratings yet
ST - Thomas Technical High School: School Based Assessment (Sba) Project 2023 - 2023
5 pages
Core System
No ratings yet
Core System
2 pages
Excel Fundamentals Manual 46
No ratings yet
Excel Fundamentals Manual 46
1 page
Resume Hassan Suhaib
No ratings yet
Resume Hassan Suhaib
1 page
Rohde & Schwarz Presents Economy Vector Signal Generator For The Automotive, Iot and Education Sectors
No ratings yet
Rohde & Schwarz Presents Economy Vector Signal Generator For The Automotive, Iot and Education Sectors
2 pages
Printing Behavior
No ratings yet
Printing Behavior
27 pages
Lecture02 Tokenization
No ratings yet
Lecture02 Tokenization
16 pages
What Is Antivirus Software
No ratings yet
What Is Antivirus Software
15 pages
Bluetooth Broadcasting: How Far Can We Go? An Experimental Study
No ratings yet
Bluetooth Broadcasting: How Far Can We Go? An Experimental Study
12 pages

5 Basic Text Processing

Uploaded by

5 Basic Text Processing

Uploaded by

Basic Text Processing

NLP Library Text Processing

How many words? How many words?

Dataset (Corpora) Tokens = N Types = |V|

Tokenization Techniques Tokenization Techniques

• Tokenization Using Python's Inbuilt • Tokenization Using Regular

Word Tokenization Sentence Tokenization Word Tokenization Sentence Tokenization

• Tokenization Using NLTK • (Inspired by Ken Church’s UNIX for Poets.)

The first step: tokenizing The second step: sorting

Tokenization: language issues Tokenization: language issues

← → ←→ ← Word Normalization and Stemming

Normalization Case folding

You might also like