0% found this document useful (0 votes)

5 views

Text preprocessing

The document discusses text processing techniques, focusing on word tokenization, sentence segmentation, and normalization. It highlights various tokenization libraries, issues in tokenization across different languages, and methods for stemming and lemmatization. Additionally, it covers the implementation of decision trees and classifiers for text processing tasks.

Uploaded by

RISHU CHAUHAN (RA2011003011371)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Text preprocessing

Uploaded by

RISHU CHAUHAN (RA2011003011371)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Text Processing

Word Tokenization
Tokenization is the process of segmenting a string of characters into
tokens (words).

An example
I have a can opener; but I can’t open these cans.
Word Tokens: 11
Word Types: 10

1/24/2022 Basic Text Processing

Several tokenization libraries

NLTK Toolkit (Python)

Spacy (Python)
Polyglot (Python)
Stanford CoreNLP (Java)
Unix Commands

Basic Text Processing

Issues in Tokenization
Common examples

Finland‟s → Finland Finlands Finland‟s ?

What‟re, I‟m, shouldn‟t → What are, I am, should not ?
San Francisco → one token or two?
m.p.h. → ??

Hyphenation
End-of-Line Hyphen: Used for splitting whole words into part for text
justification. e.g. “... apparently, mid-dle English followed this practice...”
Lexical Hyphen: Certain prefixes are offen written hyphenated, e.g. co-,
pre-, meta-, multi-, etc.
Sententially Determined Hyphenation: Mainly to prevent incorrect
parsing of the phrase. e.g. State-of-the-art, three-to-five-year, etc.

1/24/2022 Basic Text Processing

Language Specific Issues

French
l’ensemble: want to match with un ensemble

Basic Text Processing

1/24/2022
Language Specific Issues
French
l’ensemble: want to match with un ensemble

German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟

1/24/2022 Basic Text Processing

Language Specific Issues
French
l’ensemble: want to match with un ensemble

German
Noun coumpounds are not segmented
Lebensversicherungsgesellschaftsangestellter
„life insurance company employee‟

Sanskrit
Very long compound words

1/24/2022 Basic Text Processing

Language Specific Issues
Chinese
No space between words

Japanese
Further complications with multiple alphabets intermingled.

1/24/2022 Basic Text Processing

Word Tokenization in Chinese

Maximum Matching (Greedy Algorithm)

Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string

Will the above scheme work for English?

1/24/2022 Basic Text Processing

Word Tokenization in Chinese

Maximum Matching (Greedy Algorithm)

Start a pointer at the beginning of the string
Find the largest word in dictionary that matches the string starting
at pointer
Move the pointer over the word in string

Will the above scheme work for English?

No: Thetabledownthere
Yes: #ThankYouSachin, #musicmonday etc.

1/24/2022 Basic Text Processing

Sentence Segmentation

Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous?

Basic Text Processing

Sentence Segmentation

Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous?

Basic Text Processing

Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)

Basic Text Processing

1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No Is period “.”
ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)

Basic Text Processing

1/24/2022
Sentence Segmentation
Can we decide where the sentences begin and end?

Why it is difficult?
Are „!‟ and „?‟ ambiguous? No
Is period “.” ambiguous? Yes
Abbreviations (Dr., Mr., m.p.h.)
Numbers (2.4%, 4.3)

Can we build a binary classifier for ‟period‟

classification? For each “.”
Decides EndOfSentence/NotEndOfSentence
Classifiers can be: hand-written rules, regular expressions, or
machine learning

Basic Text Processing

1/24/2022
Sentence Segmentation: Decision Tree
Example
Decision Tree: Is this word the end-of-sentence (E-O-S)?

Basic Text Processing

1/24/2022
Other Important Features

Case of word with “.”: Upper, Lower, Number

Case of word after “.”: Upper, Lower, Number
Numeric Features
Length of word with “.”
Probability (word with “.” occurs at end-of-sentence)
Probability (word after “.” occurs at beginning-of-sentence)

Basic Text Processing

1/24/2022
Implementing Decision Trees
Just an if-then-else statement
Choosing the features is more important
For numeric features, thresholds are to be picked
With increasing features including numerical ones, difficult to set up
the structure by hand
Decision Tree structure can be learned using machine learning over
a training corpus

Basic Text Processing

Basic Idea
Usually works top-down, by choosing a variable at each step that best
splits the set of items.
Popular algorithms: ID3, C4.5, CART

Basic Text Processing

1/24/2022
Other Classifiers

Support Vector Machines

Logistic regression
Neural Networks

Basic Text Processing

1/24/2022
Normalization

Why to “normalize”?
Indexed text and query terms must have the same form.
U.S.A. and USA should be matched

We implicitly define equivalence classes of terms

Basic Text Processing

1/24/2022
Case Folding

Reduce all letters to lower case

Some caveats (Task dependent):
Upper case in mid sentence, may point to named entities (e.g. General
Motors)
For MT and information extraction, some cases might be helpful (US vs.
us)

Basic Text Processing

1/24/2022
Python tokenization example

http://text-processing.com/demo/tokenize/
Simple Tokenization in UNIX

Given a text file, output the word tokens and their frequencies

tr -sc ’A-Za-z’ ’\n’ < file_name

| sort
| uniq -c
| sort -rn

Change all non-alphabetic characters to newline

Sort in alphabetical order
Merge and count each type
Sort based on the count

For more info: execute ‘man tr’

1/24/2022 Basic Text Processing

Token normalization
We may want the same token for different forms of the word
• wolf, wolves  wolf
• talk, talks  talk
Stemming
• A process of removing and replacing suffixes to get to the root
form of the word, which is called the stem
• Usually refers to heuristics that chop off suffixes
Lemmatization
• Usually refers to doing things properly with the use of a
vocabulary and morphological analysis
• Returns the base or dictionary form of a word,
which is known as the lemma
Lemmatization example
WordNet lemmatizer
• Uses the WordNet Database to lookup lemmas
• nltk.stem.WordNetLemmatizer
• Examples:
− feet  foot cats  cat
− wolves  wolf talked  talked
• Problems: not all forms are reduced

• Takeaway: we need to try stemming or lemmatization and

choose best for our task
Lemmatization

Reduce inflections or variant forms to base form:

am, are, is → be
car, cars, car‟s, cars‟ → car
Have to find the correct dictionary headword form

1/24/2022
Lemmatization in Python

>>> from nltk.stem import WordNetLemmatizer

>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(’dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(’churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(’abaci’)
u’abacus’

1/24/2022
Morphology

Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes

1/24/2022
Morphology

Morphology studies the internal structure of words, how words are built
up from smaller meaningful units called morphemes

Morphemes are divided into two categories

Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings
and grammatical functions
Prefix: un-, anti-, etc (a-, ati-, pra- etc.)
Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)

1/24/2022
Stemming

Reducing terms to their stems

Used in information retrieval Crude chopping of affixes
Language dependent

1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)

Step 1b
(*v*)ing → φ(walking → walk, king →

1/24/2022
Porter‟s algorithm
Step 1a
sses → ss (caresses → caress)
ies → i (ponies → poni)
ss → ss (caress → caress)
s → φ(cats → cat)

Step 1b
(*v*)ing → φ (walking → walk, king → king)
(*v*)ed → φ(played → play)
...
If first two rules of Step 1b are successful, the following is
done: AT → ATE (conflat(ed) → conflate)
BL → BLE (troubl(ed) → trouble)

1/24/2022
Porter‟s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...

1/24/2022
Porter‟s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
...

Step 3
al → φ(revival → reviv)
able → φ(adjustable → adjust)
ate → φ(activate → activ)
...

Complete Algorithm is available at:

http://snowball.tartarus.org/algorithms/porter/stemmer.html
1/24/2022
Python stemming example
Stemming in Python

>>> from nltk.stem.porter import PorterStemmer

>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’maximum’)
’maximum’
>>> porter_stemmer.stem(’presumably’)
’presum’
>>> porter_stemmer.stem(’multiply’)
’multipli’
>>> porter_stemmer.stem(’provision’)
’provis’

1/24/2022

Expt 2 - Cognitive Style Inventory - Methodology
100% (1)
Expt 2 - Cognitive Style Inventory - Methodology
8 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Flatterland - Like Flatland. Only More So - Stewart, Ian
No ratings yet
Flatterland - Like Flatland. Only More So - Stewart, Ian
332 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP_Lecture_6_Week_3
No ratings yet
NLP_Lecture_6_Week_3
9 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP m2
No ratings yet
NLP m2
71 pages
Week3
No ratings yet
Week3
15 pages
Week 2
No ratings yet
Week 2
90 pages
3.Word level analysis-tokenization stemming
No ratings yet
3.Word level analysis-tokenization stemming
8 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
ir manual
No ratings yet
ir manual
53 pages
nlp
No ratings yet
nlp
16 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Text Mining
No ratings yet
Text Mining
62 pages
Lect_05_Preprocessing_text
No ratings yet
Lect_05_Preprocessing_text
25 pages
lec2
No ratings yet
lec2
21 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Lecture 1 Text Preprocessing PDF
No ratings yet
Lecture 1 Text Preprocessing PDF
29 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
TextMining
No ratings yet
TextMining
43 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Lab 2
No ratings yet
Lab 2
49 pages
Text Processing: Basics: Pawan Goyal
No ratings yet
Text Processing: Basics: Pawan Goyal
42 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
Chapter V - Working With Text Data
No ratings yet
Chapter V - Working With Text Data
30 pages
NATURAL LANGUAGE PROCESSING UNIT 1
No ratings yet
NATURAL LANGUAGE PROCESSING UNIT 1
16 pages
NLP - SEM
No ratings yet
NLP - SEM
31 pages
Session 1
No ratings yet
Session 1
33 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
CAT King study material 5
No ratings yet
CAT King study material 5
21 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
NLP_Week_02
No ratings yet
NLP_Week_02
55 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
VO_MCA_SEM 4 _ Text Mining _U2
No ratings yet
VO_MCA_SEM 4 _ Text Mining _U2
15 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
PART B NOTES
No ratings yet
PART B NOTES
62 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
NLP_Week_02
No ratings yet
NLP_Week_02
54 pages
Lec 5
No ratings yet
Lec 5
25 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
The Ultimate Linux Shell Scripting Guide: Automate, Optimize, and Empower tasks with Linux Shell Scripting
From Everand
The Ultimate Linux Shell Scripting Guide: Automate, Optimize, and Empower tasks with Linux Shell Scripting
Donald A. Tevault
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
10 Cu Sinif WORD DEFINITION-9
No ratings yet
10 Cu Sinif WORD DEFINITION-9
2 pages
Software Engineer Test Online Mcqs - Solve Mcqs Online
No ratings yet
Software Engineer Test Online Mcqs - Solve Mcqs Online
7 pages
1 - Golden Gate Fields Retrospecto
No ratings yet
1 - Golden Gate Fields Retrospecto
9 pages
LAS3&4
No ratings yet
LAS3&4
3 pages
Role of Marketing Information System in Enhancing Sales.
No ratings yet
Role of Marketing Information System in Enhancing Sales.
11 pages
Lab Practical (Answer Sheet) - Finite-Element Analysis: Part A - The Basics
No ratings yet
Lab Practical (Answer Sheet) - Finite-Element Analysis: Part A - The Basics
4 pages
Anotations On Trevor-Wishart-Tongues-of-Fire PDF
No ratings yet
Anotations On Trevor-Wishart-Tongues-of-Fire PDF
10 pages
Photoelasticity
No ratings yet
Photoelasticity
41 pages
Alexander 4
No ratings yet
Alexander 4
3 pages
PHL 1B Ethic Module 10
No ratings yet
PHL 1B Ethic Module 10
7 pages
System: Power State Estimation
No ratings yet
System: Power State Estimation
7 pages
Corporate Governance PPT Final
No ratings yet
Corporate Governance PPT Final
30 pages
Perioperative Medication Management
100% (1)
Perioperative Medication Management
6 pages
Ppe Checklist
No ratings yet
Ppe Checklist
9 pages
Paid in Full
No ratings yet
Paid in Full
1 page
A Brief Presentation On Accounting
No ratings yet
A Brief Presentation On Accounting
15 pages
Conflicts: The Pearl
No ratings yet
Conflicts: The Pearl
10 pages
American Cake Decorating Magazine 2012'01
No ratings yet
American Cake Decorating Magazine 2012'01
68 pages
Internship Report - Hodal Bizimungu
No ratings yet
Internship Report - Hodal Bizimungu
76 pages
Complaint Handling and Greivence Managment
No ratings yet
Complaint Handling and Greivence Managment
45 pages
He Reincarnated With Overpowered Stats But Hides His Powers New Anime 2024
No ratings yet
He Reincarnated With Overpowered Stats But Hides His Powers New Anime 2024
15 pages
competency based questions (EVS) - converted
No ratings yet
competency based questions (EVS) - converted
4 pages
ملخص النقد بالكامل
No ratings yet
ملخص النقد بالكامل
11 pages
Concentration Circle of Indonesian Foreign Policy
No ratings yet
Concentration Circle of Indonesian Foreign Policy
1 page
Set.1.Claros - Sux.digests Complete
No ratings yet
Set.1.Claros - Sux.digests Complete
4 pages
Yasin V Shari - A District Court 241 SCRA 606 (GR 94986)
No ratings yet
Yasin V Shari - A District Court 241 SCRA 606 (GR 94986)
9 pages
Social Literacy' Implies A Level of Skill in Being Able To Form Respectful Relationships. It
No ratings yet
Social Literacy' Implies A Level of Skill in Being Able To Form Respectful Relationships. It
6 pages
Chapter 7 - Uncertainty and Consumer Behavior
No ratings yet
Chapter 7 - Uncertainty and Consumer Behavior
11 pages

Text preprocessing

Uploaded by

Text preprocessing

Uploaded by

Text Processing

1/24/2022 Basic Text Processing

NLTK Toolkit (Python)

Basic Text Processing

Finland‟s → Finland Finlands Finland‟s ?

1/24/2022 Basic Text Processing

Basic Text Processing

1/24/2022 Basic Text Processing

1/24/2022 Basic Text Processing

1/24/2022 Basic Text Processing

Maximum Matching (Greedy Algorithm)

Will the above scheme work for English?

1/24/2022 Basic Text Processing

Maximum Matching (Greedy Algorithm)

Will the above scheme work for English?

1/24/2022 Basic Text Processing

Can we decide where the sentences begin and end?

Basic Text Processing

Can we decide where the sentences begin and end?

Basic Text Processing

Basic Text Processing

Basic Text Processing

Can we build a binary classifier for ‟period‟

Basic Text Processing

Basic Text Processing

Case of word with “.”: Upper, Lower, Number

Basic Text Processing

Basic Text Processing

Basic Text Processing

Support Vector Machines

Basic Text Processing

We implicitly define equivalence classes of terms

Basic Text Processing

Reduce all letters to lower case

Basic Text Processing

tr -sc ’A-Za-z’ ’\n’ < file_name

Change all non-alphabetic characters to newline

For more info: execute ‘man tr’

1/24/2022 Basic Text Processing

• Takeaway: we need to try stemming or lemmatization and

Reduce inflections or variant forms to base form:

>>> from nltk.stem import WordNetLemmatizer

Morphemes are divided into two categories

Reducing terms to their stems

Complete Algorithm is available at:

>>> from nltk.stem.porter import PorterStemmer

You might also like