0% found this document useful (0 votes)
3 views21 pages

Lec 2

The document discusses corpus-based work and text normalization in natural language processing, emphasizing the importance of tokenization, sentence segmentation, and handling formatting issues. It highlights challenges such as hyphenation, contractions, and homographs, as well as methods for sentence boundary detection. Additionally, it mentions the use of software tools for processing corpora and the complexities involved in defining and normalizing tokens.

Uploaded by

Tooba Liaquat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

Lec 2

The document discusses corpus-based work and text normalization in natural language processing, emphasizing the importance of tokenization, sentence segmentation, and handling formatting issues. It highlights challenges such as hyphenation, contractions, and homographs, as well as methods for sentence boundary detection. Additionally, it mentions the use of software tools for processing corpora and the complexities involved in defining and normalizing tokens.

Uploaded by

Tooba Liaquat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Corpus and text normalization

Corpus-Based Work
 Text corpora are usually big, often represent
samples of some population of interest. For
example, the Brown Corpus collected by Kucera
and Francis was designed as a representative
sample of written American English. Balance of
subtypes (e.g., genre) is often desired.
 Corpus work involves collecting a large number
of counts from corpora that need to be
accessed quickly.
 There exists some software for processing
corpora

Mar 16, 2025 Natural Language Processing 2


Text Normalization
 Tokenizing (segmenting) words
 Normalizing word formats
 Segmenting sentences

Mar 16, 2025 Natural Language Processing 3


Looking at text
 Markup
 Removing junk formatting/content
Examples include document headers and separators,
typesetter codes, tables and diagrams, garbled data in
the file. Problems arise if data was obtained using OCR
(unrecognized words). May need to remove junk content
before any processing begins.
 Upper case/Lower case
 Should we ignore case?
 Distinction between Richard Brown and brown paint
 One heuristic is to lowercase the letters in the beginning of
a sentence

Mar 16, 2025 Natural Language Processing 4


Tokenization: What is a Word?
 Early in processing, we must divide the input
text into meaningful units called (e.g., words,
numbers, punctuation).
 Tokenization is the process of breaking input
from a text character stream into tokens to be
normalized and saved.
 One practical definition: a string of contiguous
alphanumeric characters with space on either
side; may include hyphens and apostrophes,
but no other punctuation marks.
 There are problems with this definition though
Problems: Micro$oft or :-)

Mar 16, 2025 Natural Language Processing 5


Tokenization issues
 Dealing with full stops
 Words are not always surrounded by white space
 Punctuation marks such as , ; . Denote end of words
 One cannot simply remove such marks however.
 Case in point: how to deal with: etc. Calif.
 And other standard and non-standard
abbreviations
 If the etc. appears at the end of sentences, the dot
at the end of it also serves as a full stop. This
phenomenon in linguistic parlance (idiom) is
called: haplology

Mar 16, 2025 Natural Language Processing 6


Some of the Problems: Hyphens
 How should we deal with hyphens? Are hyphenated words
comprised of one or multiple tokens? Usage:
1. Typographical to improve the right margins of a document:
typically the hyphens should be removed since breaks
occur at syllable boundaries; however, the hyphen may be
part of the word too.
2. Lexical hyphens: inserted before or after small word
formatives (e.g., co-operate, so-called, pro-university).
3. Word grouping: Take-it-or-leave-it, once-in-a-lifetime, text-
based, etc.
 How many lexemes will you allow?
 Data base, data-base, database
 Cooperate, Co-operate
 Mark-up, mark up

Mar 16, 2025 Natural Language Processing 7


Some of the Problems: Hyphens
 White space not indicating a word break
 Classic example is phone numbers
 Proper nouns
 New York or San Francisco
 An example where hyphenation can mess up
 The New York-New Haven railroad
 Creating a word “York-New” is meaningless

Mar 16, 2025 Natural Language Processing 8


Contractions
 I’m right.
 He isn’t funny.
 Child’s health
 Baby’ toys

Mar 16, 2025 Natural Language Processing 9


Phone number representation

Mar 16, 2025 Natural Language Processing 10


Tokenized Text

Mar 16, 2025 Natural Language Processing 11


Some of the Problems:
Homographs
 In some cases, lexemes have
overlapping forms (homographs) as
in:
 I saw the dog.
 The saw is sharp.
 These forms will need to be
distinguished for part-of-speech
tagging.

Mar 16, 2025 Natural Language Processing 12


Some of the Problems: No
space between Words
 There are no separators between words
in languages like Chinese, so English
tokenization methods are irrelevant.
 Waterloo is located in the south of
Canada.

 Compounds in German: (life insurance


company employees)

Lebensversicherungsgesellschaftsange
steller
Mar 16, 2025 Natural Language Processing 13
Morphology: What Should I Put
in My Dictionary?
 Speech Corpora
 Morphology
 Stemming
 The idea is to extract the root of the word
and use it for other purposes.
 Not that helpful in English (from an IR point
of view)
 Perhaps more useful for other languages or
in other contexts

Mar 16, 2025 Natural Language Processing 14


Morphology: What Should I Put
in My Dictionary?
 Ex. If you are looking for a
“business” and you extract the
stem of that word: “busy” and use
that to retrieve relevant documents
in a collection – the result would be
underwhelming

Mar 16, 2025 Natural Language Processing 15


What is a Sentence?
 Something ending with a ‘.’, ‘?’ or ‘!’. True in
90% of the cases.
 Sentences may be split up by other
punctuation marks (e.g., : ; --).
 Sentences may be broken up, as in: “You
should be here,” she said, “before I know it!”
 Quote marks may be at the very end of the
sentence.
 Identifying sentence boundaries can involve
heuristic methods that are hand-coded.
Some effort to automate the sentence-
boundary process has also been tried.

Mar 16, 2025 Natural Language Processing 16


Heuristic Algorithm for
Sentence Boundary Detection
 Place putative sentence boundaries after all
occurrences of . ? !.
 Move boundary after following quotation
marks, if any.
 Disqualify a period boundary in the following
circumstances:
 If it is preceded by a known abbreviation of
a sort that does not normally occur word
finally, but is commonly followed by a
capitalized proper name, such as Prof. or vs.

Mar 16, 2025 Natural Language Processing 17


 If it is preceded by a known abbreviation
and not followed by an uppercase word. This
will deal correctly with most usage of
abbreviations like etc. or Jr. which can occur
sentence medially or finally.
 Disqualify a boundary with a ? or ! If:
 It is followed by a lowercase letter (or a
known name)
 Regard other putative sentence boundaries as
sentence boundaries.

Mar 16, 2025 Natural Language Processing 18


Example
 Ali lives in Calif. He is student of Prof.
Kifor and he is interested in India vs.
Pakistan cricket match. He eats apple,
orange, mango etc. Does he eat kiwi
too? is a question he is often asked
about. Aaah! what a fruit is kiwi. I
simply love it!

Mar 16, 2025 Natural Language Processing 19


 Ali lives in Calif. He is student of Prof.
Kifor and he is interested in India vs.
Pakistan cricket match. He eats apple,
orange, mango etc. and corn meal
too. Does he eat kiwi too? Asad often
asks this question. Aaah! what a fruit
is kiwi. I simply love it!

Mar 16, 2025 Natural Language Processing 20


 import nltk
 from nltk.tokenize import word_tokenize
 from nltk.tokenize import regexp_tokenize
 import re
 #corpus="Ali is a good guy. He lives in Lahore. He spends 3-4 hours in study every day. His CGPA is
3.7. He does not SPEND more than 500 Rs. per day. His contact number is 042-1113456. But it has
been days since I could make him a call. "
 corpus="this -5 BAG 2-3 Doesn't 2.67 worth 87-987 $4.7."
 #print(word_tokenize(corpus))
 #print (re.findall("[\w'|$|.]+", corpus))
 #print(re.findall("[0-4]+[\. 0-9]+", corpus))
 #print(re.findall("\.", corpus))
 #print(re.findall("\$[0-9]+\.[0-9]+", corpus))
 #print(re.findall("[A-Z][A-Z]+", corpus))
 #print(re.findall("[\0-9 ^.]+", corpus))
 #print (re.findall('([A-Z][A-Z]+)', corpus))
 #print (re.findall('([0-9]+-[0-9]+)', corpus))
 #print (re.findall('([\.])', corpus))
 #print (re.findall('(\w+\'\w+)', corpus))
 print (re.findall('(\w+\'\w+|[A-Z][a-z]+|[a-z][a-z]+|[A-Z][A-Z]+)', corpus))

Mar 16, 2025 Natural Language Processing 21

You might also like