Natural Language
Processing
Text Normalization
& Corpus
Text Normalization
• Conversion of text that includes ‘nonstandard’ words like numbers,
abbreviations, misspellings into normal words.
Example :
u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.
$200" would be pronounced as "two hundred dollars" in English.
• Text normalization requires being aware of what type of text is to be
normalized and how it is to be processed afterwards; there is no all-purpose
normalization procedure.
Text Normalization
• Text normalization is frequently used when converting text to speech.
• Numbers, dates, acronyms, and abbreviations are non-standard "words"
that need to be pronounced differently depending on context.
Text Normalization
• Given a string of characters in a text, what is the (reasonable) set of possible
actual words (or word sequences) that might correspond to it.
• Which of those is right for the particular context?
Text Normalization Types
Word Form Normalization
• Forms can have many inclinations, but more often they are not important
and we need to know only the base form of the word
Can be done by
• Stemming: keeping only the root of the word (usually just deleting suffixes)
• economy, economic, economical, economically, economics, economize => econom
• Lemmatization: keeping only the lemma
• produce, produces, product, production => produce
What is Corpus
• Corpus is a large collection of texts. It is a body of written or spoken
material upon which a linguistic analysis is based.
• The plural form of corpus is corpora.
• Some popular corpora are British National Corpus (BNC),
COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus.
• European Corpus Initiative (ECI) corpus is multilingual having 98 million
words in Turkish, Japenese, Russian, Chinese, and other languages.
• The corpus may be composed of written language, spoken language or both.
Spoken corpus is usually in the form of audio recordings.
Types of Corpus
• A corpus may be open or closed. An open corpus is one which does not
claim to contain all data from a specific area while a closed corpus does
claim to contain all or nearly all data from a particular field. Medical
corpora, for example, are closed as there can be no further input to an area.
• Monolingual corpora represent only one language while bilingual corpora
represent two languages.
• Parallel corpus
• Balanced Corpus
Balanced Corpus
What should be covered in a balanced corpus?
Balanced: covers a range of text categories
• Definition depends upon the intended uses
• No true objective measure of balance
• Usually based on proportional sampling
• Balance can be based on a text typology, a classification of text types
Uses of Corpus
• A corpus provides grammarians, lexicographers, and other interested parties
with better descriptions of a language.
• Computer-procesable corpora allow linguists to adopt the principle of total
accountability, retrieving all the occurrences of a particular word or
structure for inspection or randomly selected samples.
• Corpus analysis provide lexical information, morphosyntactic information,
semantic information and pragmatic information.
Applications of Corpus
• Corpora are used in the development of NLP tools.
• Applications include spell-checking, grammar-checking, speech recognition,
text-to-speech and speech-to-text synthesis, automatic abstraction and
indexing, information retrieval and machine translation.
• Corpora also used for creation of new dictionaries and grammars for
learners.