0% found this document useful (0 votes)
38 views10 pages

4 Natural Language Processing-Text Normalization

Uploaded by

calm_magician
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views10 pages

4 Natural Language Processing-Text Normalization

Uploaded by

calm_magician
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Natural Language

Processing
Text Normalization
& Corpus
Text Normalization

• Conversion of text that includes ‘nonstandard’ words like numbers,


abbreviations, misspellings into normal words.

Example :
u r dng btr thn ny autmtc txt nrmlztion prgrm cn do.

$200" would be pronounced as "two hundred dollars" in English.

• Text normalization requires being aware of what type of text is to be


normalized and how it is to be processed afterwards; there is no all-purpose
normalization procedure.
Text Normalization

• Text normalization is frequently used when converting text to speech.

• Numbers, dates, acronyms, and abbreviations are non-standard "words"


that need to be pronounced differently depending on context.
Text Normalization

• Given a string of characters in a text, what is the (reasonable) set of possible


actual words (or word sequences) that might correspond to it.

• Which of those is right for the particular context?


Text Normalization Types

Word Form Normalization

• Forms can have many inclinations, but more often they are not important
and we need to know only the base form of the word
Can be done by
• Stemming: keeping only the root of the word (usually just deleting suffixes)
• economy, economic, economical, economically, economics, economize => econom
• Lemmatization: keeping only the lemma
• produce, produces, product, production => produce
What is Corpus

• Corpus is a large collection of texts. It is a body of written or spoken


material upon which a linguistic analysis is based.

• The plural form of corpus is corpora.

• Some popular corpora are British National Corpus (BNC),


COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus.

• European Corpus Initiative (ECI) corpus is multilingual having 98 million


words in Turkish, Japenese, Russian, Chinese, and other languages.
• The corpus may be composed of written language, spoken language or both.
Spoken corpus is usually in the form of audio recordings.
Types of Corpus

• A corpus may be open or closed. An open corpus is one which does not
claim to contain all data from a specific area while a closed corpus does
claim to contain all or nearly all data from a particular field. Medical
corpora, for example, are closed as there can be no further input to an area.

• Monolingual corpora represent only one language while bilingual corpora


represent two languages.

• Parallel corpus

• Balanced Corpus
Balanced Corpus

What should be covered in a balanced corpus?

Balanced: covers a range of text categories

• Definition depends upon the intended uses

• No true objective measure of balance

• Usually based on proportional sampling

• Balance can be based on a text typology, a classification of text types


Uses of Corpus

• A corpus provides grammarians, lexicographers, and other interested parties


with better descriptions of a language.

• Computer-procesable corpora allow linguists to adopt the principle of total


accountability, retrieving all the occurrences of a particular word or
structure for inspection or randomly selected samples.

• Corpus analysis provide lexical information, morphosyntactic information,


semantic information and pragmatic information.
Applications of Corpus

• Corpora are used in the development of NLP tools.

• Applications include spell-checking, grammar-checking, speech recognition,


text-to-speech and speech-to-text synthesis, automatic abstraction and
indexing, information retrieval and machine translation.

• Corpora also used for creation of new dictionaries and grammars for
learners.

You might also like