0% found this document useful (0 votes)

3 views21 pages

Lec 2

The document discusses corpus-based work and text normalization in natural language processing, emphasizing the importance of tokenization, sentence segmentation, and handling formatting issues. It highlights challenges such as hyphenation, contractions, and homographs, as well as methods for sentence boundary detection. Additionally, it mentions the use of software tools for processing corpora and the complexities involved in defining and normalizing tokens.

Uploaded by

Tooba Liaquat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views21 pages

Lec 2

Uploaded by

Tooba Liaquat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 21

Corpus and text normalization

Corpus-Based Work
 Text corpora are usually big, often represent
samples of some population of interest. For
example, the Brown Corpus collected by Kucera
and Francis was designed as a representative
sample of written American English. Balance of
subtypes (e.g., genre) is often desired.
 Corpus work involves collecting a large number
of counts from corpora that need to be
accessed quickly.
 There exists some software for processing
corpora

Mar 16, 2025 Natural Language Processing 2

Text Normalization
 Tokenizing (segmenting) words
 Normalizing word formats
 Segmenting sentences

Mar 16, 2025 Natural Language Processing 3

Looking at text
 Markup
 Removing junk formatting/content
Examples include document headers and separators,
typesetter codes, tables and diagrams, garbled data in
the file. Problems arise if data was obtained using OCR
(unrecognized words). May need to remove junk content
before any processing begins.
 Upper case/Lower case
 Should we ignore case?
 Distinction between Richard Brown and brown paint
 One heuristic is to lowercase the letters in the beginning of
a sentence

Mar 16, 2025 Natural Language Processing 4

Tokenization: What is a Word?
 Early in processing, we must divide the input
text into meaningful units called (e.g., words,
numbers, punctuation).
 Tokenization is the process of breaking input
from a text character stream into tokens to be
normalized and saved.
 One practical definition: a string of contiguous
alphanumeric characters with space on either
side; may include hyphens and apostrophes,
but no other punctuation marks.
 There are problems with this definition though
Problems: Micro$oft or :-)

Mar 16, 2025 Natural Language Processing 5

Tokenization issues
 Dealing with full stops
 Words are not always surrounded by white space
 Punctuation marks such as , ; . Denote end of words
 One cannot simply remove such marks however.
 Case in point: how to deal with: etc. Calif.
 And other standard and non-standard
abbreviations
 If the etc. appears at the end of sentences, the dot
at the end of it also serves as a full stop. This
phenomenon in linguistic parlance (idiom) is
called: haplology

Mar 16, 2025 Natural Language Processing 6

Some of the Problems: Hyphens
 How should we deal with hyphens? Are hyphenated words
comprised of one or multiple tokens? Usage:
1. Typographical to improve the right margins of a document:
typically the hyphens should be removed since breaks
occur at syllable boundaries; however, the hyphen may be
part of the word too.
2. Lexical hyphens: inserted before or after small word
formatives (e.g., co-operate, so-called, pro-university).
3. Word grouping: Take-it-or-leave-it, once-in-a-lifetime, text-
based, etc.
 How many lexemes will you allow?
 Data base, data-base, database
 Cooperate, Co-operate
 Mark-up, mark up

Mar 16, 2025 Natural Language Processing 7

Some of the Problems: Hyphens
 White space not indicating a word break
 Classic example is phone numbers
 Proper nouns
 New York or San Francisco
 An example where hyphenation can mess up
 The New York-New Haven railroad
 Creating a word “York-New” is meaningless

Mar 16, 2025 Natural Language Processing 8

Contractions
 I’m right.
 He isn’t funny.
 Child’s health
 Baby’ toys

Mar 16, 2025 Natural Language Processing 9

Phone number representation

Mar 16, 2025 Natural Language Processing 10

Tokenized Text

Mar 16, 2025 Natural Language Processing 11

Some of the Problems:
Homographs
 In some cases, lexemes have
overlapping forms (homographs) as
in:
 I saw the dog.
 The saw is sharp.
 These forms will need to be
distinguished for part-of-speech
tagging.

Mar 16, 2025 Natural Language Processing 12

Some of the Problems: No
space between Words
 There are no separators between words
in languages like Chinese, so English
tokenization methods are irrelevant.
 Waterloo is located in the south of
Canada.

 Compounds in German: (life insurance

company employees)

Lebensversicherungsgesellschaftsange
steller
Mar 16, 2025 Natural Language Processing 13
Morphology: What Should I Put
in My Dictionary?
 Speech Corpora
 Morphology
 Stemming
 The idea is to extract the root of the word
and use it for other purposes.
 Not that helpful in English (from an IR point
of view)
 Perhaps more useful for other languages or
in other contexts

Mar 16, 2025 Natural Language Processing 14

Morphology: What Should I Put
in My Dictionary?
 Ex. If you are looking for a
“business” and you extract the
stem of that word: “busy” and use
that to retrieve relevant documents
in a collection – the result would be
underwhelming

Mar 16, 2025 Natural Language Processing 15

What is a Sentence?
 Something ending with a ‘.’, ‘?’ or ‘!’. True in
90% of the cases.
 Sentences may be split up by other
punctuation marks (e.g., : ; --).
 Sentences may be broken up, as in: “You
should be here,” she said, “before I know it!”
 Quote marks may be at the very end of the
sentence.
 Identifying sentence boundaries can involve
heuristic methods that are hand-coded.
Some effort to automate the sentence-
boundary process has also been tried.

Mar 16, 2025 Natural Language Processing 16

Heuristic Algorithm for
Sentence Boundary Detection
 Place putative sentence boundaries after all
occurrences of . ? !.
 Move boundary after following quotation
marks, if any.
 Disqualify a period boundary in the following
circumstances:
 If it is preceded by a known abbreviation of
a sort that does not normally occur word
finally, but is commonly followed by a
capitalized proper name, such as Prof. or vs.

Mar 16, 2025 Natural Language Processing 17

 If it is preceded by a known abbreviation
and not followed by an uppercase word. This
will deal correctly with most usage of
abbreviations like etc. or Jr. which can occur
sentence medially or finally.
 Disqualify a boundary with a ? or ! If:
 It is followed by a lowercase letter (or a
known name)
 Regard other putative sentence boundaries as
sentence boundaries.

Mar 16, 2025 Natural Language Processing 18

Example
 Ali lives in Calif. He is student of Prof.
Kifor and he is interested in India vs.
Pakistan cricket match. He eats apple,
orange, mango etc. Does he eat kiwi
too? is a question he is often asked
about. Aaah! what a fruit is kiwi. I
simply love it!

Mar 16, 2025 Natural Language Processing 19

 Ali lives in Calif. He is student of Prof.
Kifor and he is interested in India vs.
Pakistan cricket match. He eats apple,
orange, mango etc. and corn meal
too. Does he eat kiwi too? Asad often
asks this question. Aaah! what a fruit
is kiwi. I simply love it!

Mar 16, 2025 Natural Language Processing 20

 import nltk
 from nltk.tokenize import word_tokenize
 from nltk.tokenize import regexp_tokenize
 import re
 #corpus="Ali is a good guy. He lives in Lahore. He spends 3-4 hours in study every day. His CGPA is
3.7. He does not SPEND more than 500 Rs. per day. His contact number is 042-1113456. But it has
been days since I could make him a call. "
 corpus="this -5 BAG 2-3 Doesn't 2.67 worth 87-987 $4.7."
 #print(word_tokenize(corpus))
 #print (re.findall("[\w'|$|.]+", corpus))
 #print(re.findall("[0-4]+[\. 0-9]+", corpus))
 #print(re.findall("\.", corpus))
 #print(re.findall("\$[0-9]+\.[0-9]+", corpus))
 #print(re.findall("[A-Z][A-Z]+", corpus))
 #print(re.findall("[\0-9 ^.]+", corpus))
 #print (re.findall('([A-Z][A-Z]+)', corpus))
 #print (re.findall('([0-9]+-[0-9]+)', corpus))
 #print (re.findall('([\.])', corpus))
 #print (re.findall('(\w+\'\w+)', corpus))
 print (re.findall('(\w+\'\w+|[A-Z][a-z]+|[a-z][a-z]+|[A-Z][A-Z]+)', corpus))

Mar 16, 2025 Natural Language Processing 21

Text Mining (22CS809)
No ratings yet
Text Mining (22CS809)
109 pages
NLP m1
No ratings yet
NLP m1
148 pages
Session 1
No ratings yet
Session 1
60 pages
Text Processing
No ratings yet
Text Processing
114 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
9-Word and Sentence Segmentation-17!01!2024
No ratings yet
9-Word and Sentence Segmentation-17!01!2024
32 pages
Natural Language Processing Notes Class 10 AI
100% (1)
Natural Language Processing Notes Class 10 AI
20 pages
NPL CH1
No ratings yet
NPL CH1
35 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Ai Part B ch12
No ratings yet
Ai Part B ch12
16 pages
Text Proc
No ratings yet
Text Proc
55 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
Week 2
No ratings yet
Week 2
90 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Week 3
No ratings yet
Week 3
15 pages
Word Segmentation Sentence Segmentation: Recommended Reading
No ratings yet
Word Segmentation Sentence Segmentation: Recommended Reading
31 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Schmid Tokenising - CorpusLinguistics IntHbk
No ratings yet
Schmid Tokenising - CorpusLinguistics IntHbk
17 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Introduction
No ratings yet
Introduction
23 pages
Text Mining (22CS809)
No ratings yet
Text Mining (22CS809)
177 pages
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
100% (1)
Kuhlmann - Introduction To Computational Linguistics (Slides) (2015)
66 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
Natural Language Processing: By-Himani (ROLL NO. 43)
No ratings yet
Natural Language Processing: By-Himani (ROLL NO. 43)
19 pages
NLP Digital Notes
No ratings yet
NLP Digital Notes
128 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
UNIT 1 - Part1
No ratings yet
UNIT 1 - Part1
121 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
Unit I
No ratings yet
Unit I
12 pages
NLP Basics
No ratings yet
NLP Basics
4 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
NLP m2
No ratings yet
NLP m2
71 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Introduction To
No ratings yet
Introduction To
16 pages
Experiance Letter Sample
No ratings yet
Experiance Letter Sample
3 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Ix Developer: User's Guide
100% (1)
Ix Developer: User's Guide
48 pages
NLP Final
No ratings yet
NLP Final
4 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
No ratings yet
Natural Language Processing: Learning Is Not A Course, Its A Path From Passion To Profession
19 pages
Grade 8 Pretechnical
No ratings yet
Grade 8 Pretechnical
8 pages
Lab Manual Digital Marketing: Mr. Prince Vohra
No ratings yet
Lab Manual Digital Marketing: Mr. Prince Vohra
32 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
SkillSet Pro: Advanced Employability Skills Mastery Program
No ratings yet
SkillSet Pro: Advanced Employability Skills Mastery Program
11 pages
M-Duino 21+Arduino-PLC
No ratings yet
M-Duino 21+Arduino-PLC
3 pages
Foundation Load (Reactions) Data FOR 45 M Diameter Thickener
No ratings yet
Foundation Load (Reactions) Data FOR 45 M Diameter Thickener
88 pages
Consumer Durable Industry: Presented By-Kasturi Mandal A Vijay Kumar Sasi Kumar Umesh G S Arun Kumar Barun Bardhan
0% (1)
Consumer Durable Industry: Presented By-Kasturi Mandal A Vijay Kumar Sasi Kumar Umesh G S Arun Kumar Barun Bardhan
60 pages
Hydro (1) (1) 1
No ratings yet
Hydro (1) (1) 1
47 pages
Riscv Server Soc
No ratings yet
Riscv Server Soc
34 pages
Harpoon Lagoon Manual Ice
No ratings yet
Harpoon Lagoon Manual Ice
22 pages
Mathematical Modeling in Realistic Mathematics Education
No ratings yet
Mathematical Modeling in Realistic Mathematics Education
4 pages
Alto DJM-2 Mixer Schematics
No ratings yet
Alto DJM-2 Mixer Schematics
34 pages
AI Report
No ratings yet
AI Report
18 pages
B - Com - II Money and Financial System Additional Sub Point
No ratings yet
B - Com - II Money and Financial System Additional Sub Point
32 pages
Project 2
No ratings yet
Project 2
4 pages
Application Form For Ajman University
No ratings yet
Application Form For Ajman University
1 page
X1E Spec (EN)
No ratings yet
X1E Spec (EN)
3 pages
Wireshark
No ratings yet
Wireshark
15 pages
Solution Manual For Data Structures and Problem Solving Using C++ 2/E Mark A. Weiss Immediately PDF
100% (7)
Solution Manual For Data Structures and Problem Solving Using C++ 2/E Mark A. Weiss Immediately PDF
13 pages
Water Body Extraction From Sentinel-3 Image With Multiscale Spatiotemporal Super-Resolution Mapping
No ratings yet
Water Body Extraction From Sentinel-3 Image With Multiscale Spatiotemporal Super-Resolution Mapping
20 pages
Taoufik Hachi Mi
No ratings yet
Taoufik Hachi Mi
11 pages
Engine Immobilizer System
No ratings yet
Engine Immobilizer System
6 pages
Title Page Certi Abstract Final PSSG 2
No ratings yet
Title Page Certi Abstract Final PSSG 2
5 pages
Shoprite - Navigating A Competitive Market
No ratings yet
Shoprite - Navigating A Competitive Market
4 pages
Add Label For XY Scatter Chart
No ratings yet
Add Label For XY Scatter Chart
34 pages
How To Importing Text File
No ratings yet
How To Importing Text File
18 pages
Assignment-0 DIP-CS406-Fal 2024
No ratings yet
Assignment-0 DIP-CS406-Fal 2024
6 pages
Call For Application-Ajman University UAE
No ratings yet
Call For Application-Ajman University UAE
1 page
Arwa Alrezehi - Shahad Sultan
No ratings yet
Arwa Alrezehi - Shahad Sultan
1 page
SDO Animo Year End 2020-2021 - GBB - Lopez
No ratings yet
SDO Animo Year End 2020-2021 - GBB - Lopez
2 pages
Cylinder Form
No ratings yet
Cylinder Form
1 page
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page
Question 2.21: What Are The Reasons of Using Load Equalisation in The Electric Drive? Answer
No ratings yet
Question 2.21: What Are The Reasons of Using Load Equalisation in The Electric Drive? Answer
1 page
Grammar Rescue Kit: Fix the 50 Most Common English Grammar Mistakes – A Practical Guide for ESL Learners
From Everand
Grammar Rescue Kit: Fix the 50 Most Common English Grammar Mistakes – A Practical Guide for ESL Learners
Sebastian Thorne
No ratings yet

Lec 2

Uploaded by

Lec 2

Uploaded by

Corpus and text normalization

Mar 16, 2025 Natural Language Processing 2

Mar 16, 2025 Natural Language Processing 3

Mar 16, 2025 Natural Language Processing 4

Mar 16, 2025 Natural Language Processing 5

Mar 16, 2025 Natural Language Processing 6

Mar 16, 2025 Natural Language Processing 7

Mar 16, 2025 Natural Language Processing 8

Mar 16, 2025 Natural Language Processing 9

Mar 16, 2025 Natural Language Processing 10

Mar 16, 2025 Natural Language Processing 11

Mar 16, 2025 Natural Language Processing 12

 Compounds in German: (life insurance

Mar 16, 2025 Natural Language Processing 14

Mar 16, 2025 Natural Language Processing 15

Mar 16, 2025 Natural Language Processing 16

Mar 16, 2025 Natural Language Processing 17

Mar 16, 2025 Natural Language Processing 18

Mar 16, 2025 Natural Language Processing 19

Mar 16, 2025 Natural Language Processing 20

Mar 16, 2025 Natural Language Processing 21

You might also like