Natural Language Processing Module1 (CH-1)
Natural Language Processing Module1 (CH-1)
ORIGINS OF NLP:
Natural language processing sometimes mistakenly termed natural language understanding originated from
machine translation research.
Natural language processing includes both understanding (interpretation) and generation (production).
Computational linguistics is similar to theoretical and psycho-linguistics.
Theoretical linguistics:
It mainly provides structural description of natural language and its semantics. They are not concerned with the
actual processing of sentences or generation of sentences from structural description.
They are in quest for principles that remain common across languages and identify rules that capture linguistics
generalization.
Example-most language have constructs like noun and verb phrases.
Theoretical linguists identify rules that describe and restrict the structures of languages.
psycho-linguistics.
They explain how humans produce and comprehend natural language. They are interested in the representation
of linguistic structures as well as in the process by which there structures are produced. They rely primarily on
empirical investigations to back up their theories.
Computational linguistics:
It is concerned with the study of language using computational models of linguistic phenomena. It deals with
the application of linguistic theories and computational techniques for NLP.
In computational linguistics, representing a language is a major problem; most knowledge representations ta ckle
only a small part of knowledge.
Computational models may be broadly classified under knowledge driven and data driven categories.
Knowledge driven systems rely on explicitly coded linguistic knowledge, often expressed as a act of
handcrafted grammar rules. Acquiring and encoding such knowledge is difficult.
Data driven approaches presume the existence of a large amount of data and usually employ some machine
learning technique to learn syntactic patterns. The amount of human effort is less and the performance of these
systems is dependent on the quantity of the data. These systems are usually adaptive to noisy data.
This phase scans the source code as a stream of characters and converts it into meaningful lexemes. It
divides the whole text into paragraphs, sentences and words.
2. Syntactic analysis:
It consists of sequence of words as a unit, usually a sentence, and finds its structure.
It decomposes a sentence into its constitutents(or words) and identifies how they relate to each other. It
captures grammaticality or non-grammaticality of sentences by looking at constraints like word order,
number and case agreement. This level of processing requires syntactic requires knowledge. i.e.
knowledge about how words are combined to form larger units such as phrases and sentences.
For example, “I went to the market” is a valid sentences whereas “went the I market to” is not.
3. Semantic Analysis:
Semantics is associated with the meaning of the language. It is concerned with creating meaningful
representation of linguistic inputs .
The general idea of semantic interpretation is to take natural language sentences or utterances and map
them onto some representation of meaning.
Defining meaning components is difficult as grammatically valid sentences can be meaningless.
The starting point in semantic analysis, has been lexical semantics. A word can have a number of
possible meaning associated with it ,but in a given context, only one of these meaning participates.
Finding out the correct meaning of a particular use of word is necessary to fnd meaning of larger units.
Consider the following sentences:
Kabir and ayan are married.
Kabir and suha are married.
Both sentences have identical structures, and the meanings of individual words are clear. But most of us
end up with two different interpretations.
We may interpret the second sentence to mean that kabir and suha are married to each other, but this
interpretation does not occur for the first sentence syntactic structure and compositional semantics fail to
explain these interpretations. We make use of pragmatic information.
4. Discourse Analysis:
Discourse level processing attempts to interpret the structure and meaning of even larger units,e.g. at the
paragraph and document level, in terms of words, phrases,clusters and sentences.
It requires the resolution of anaphoric refrences and identification of discourse structure. It also requires
discourse knowledge, that is, knowledge of how the meaning of a sentenceis determinedby preceding
sentences.
In fact, pragmatic knowledge may be needed for resolving anaphoric refrences.
For example, in the following sentences, resolving the anaphoric reference ‘they’ requires pragmatic
knowledge.
“The district administrator refused to give the trade union permission for the meeting because they
feared violence.”
“The district administrator refused to give the trade union permission for the meeting because they
oppose government.”
5. Pragmatic analysis:
This is the highest level of processing. It deals with the purposeful use of sentences in situations.
It requires knowledge of the world, i.e, knowledge that extends beyond the contents of the text,
Example: I saw the girl with the binocular. In this example ,did I have binocular? Or did the girl
have the binocular?
c. Referential Ambiguity:
Referential Ambiguity exists when you are referring to something to prnoun.
Example: iran went to sunitha .she said “I am hungry”. In this sentence, you do not know that
who is hungry either kiran or sunitha.
S—sentence NP—nounphrase
VP--- verb phrase
Det--- Detrminer.
Transformational grammar is a set of transformation rules, which transform one phrase
maker(underlying) into another phrase marker(derived). These rules are applied on the terminal
string generated by phrase structure rules, transformational rules are heterogeneous and may
have more than one symbol on their left hand side.these rules are used to transform one surface
representation into another. Eg:an active sentence into passive one.
The rule relating active and passive sentences is:
e.g., gar aha hai--- singing and khel rahi hai--- playing.
The auxiliary verbs in this sequence provide information about tense, aspect, modality.
NLP APPLICATIONS:
The applications utilizing the NLP include the following:
1. Machine Translation: This refers to automatic translation of text from one human language to another.
In order to carry out these translation, it is necessary to have an understanding of words and phrases,
grammars of the two languages involved, semantics of the language and world knowledge.
2. Speech Recognition: this is the process of mapping acoustic speech signals to a set of words. The
difficulties arise due to wide variations in the pronunciation of words, homonym and acoustic
ambiguities.
3. Speech synthesis: it refers to automatic production of speech. Such systems can read out your mails on
telephone, or even read out a story book.
4. Natural language interfaces to databases: it allow querying a structured database using natural
language sentences.
5. Information retrieval: This is concerned with identifying documents relevant to a users query.
Indexing, words sense disambiguation, query modification, and knowledge bases have also been used in
IR system to enhance performance.
6. Information Extraction: It captures and outputs factual information contained within a document.
7. Question answering: It attempts to find the precise portion of text in which the answer appears.
8. Text summarization: this deals with the creation of summaries of documents and involves syntactic,
semantic and discourse level processing of text..
INFORMATION RETRIEVAL
The availability of a large amount of text in electronic form has made it extremely difficult to get relevant
information. Information retrieval system aims at providing a solution to this.
The term information is being used here to reflect “subject matter” or the ‘content’ of some text. The focus is on
the communication taking place between human beings as expressed through natural languages. Information is
alwaya associated with some data: we are concered with text only.
The word ‘retrieval’ to refer to the operation of accesing information from some computer based representation.
Retrieval of information thus requires information to be processed and stored. not all the information
represented in computable form is retrieved. Instead, only the information relevant to the needs expressed in the
form of query is located.
Information retrieval deals with unstructured data. The retrieval is performed based on the content of the
document rather than on its structure. The IR systems usually return a ranked list of documents. The IR
components have been traditionally incorporated into different types of information systems including database
management systems, bibliographic text retrieval systems, question answering systems and more recently in
search engines.
Current approaches for accessing large text collections can be broadly classified into two categories.
1. Consists of approaches that construct topic hierarchy. Ex: yahoo. This helps the user locate documents
of interest manually by traversing the hierarchy.
2. Consists of approaches that rank the retrieved documents according to relevance.
ISSUES IN INFORMATION RETRIEVAL:
1. Choose a representation of the document. Most human knowledge representation language for computer
systems. Most of the current retrieval models are based on keyword representation. This representation
creates problems during retrieval due to polysemy, homonomy and synonymy.
Polysemy: it involves the phenomenon of a lexeme with multiple meaning.
Homonymy: is an ambiguity in which words that appear the same have unrelated meanings.
Synonymy: creates problem when a document is indexed with one term and the query contains a
different term and the two term share a common meaning.
Another problem with keyword based retrieval is that it ignores semantic and contextual information in
the retrieval process. This information is last in the extraction of keywords from the ntest and cannot be
recovered by the retrieval algorithms.
2. Inappropriate characterization of qurries by the user. The user may fail to include relevant terms in
query or may include irrelevant terms.
Inappropriate or inaccurate queries lead to poor retrieval performances.
3. Matching query representation with that of the document is another issue.
4. Selection of the appropriate similarity measure is a crucial issue in the design of IR systems.
5. Evaluating the performance of IR systems is also a major issue. Recall and precision are the most widely
used measures of effectiveness.
6. The major goal of IR is to search a document in a manner relevant to the query, understanding what
constitutes relevance is also an important issue.
7. The size of document collections and the varying needs of users also complicate text retrieval.