0% found this document useful (0 votes)
251 views11 pages

Natural Language Processing Module1 (CH-1)

Natural Language Processing (NLP) involves developing computational models to understand and process human language for automated tools and insights into communication. It encompasses various approaches, including rationalist and empiricist, and involves multiple levels of language analysis such as lexical, syntactic, semantic, discourse, and pragmatic analysis. Challenges in NLP include ambiguity, representation issues, and the evolving nature of language, with applications ranging from machine translation to information retrieval.

Uploaded by

zikiho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views11 pages

Natural Language Processing Module1 (CH-1)

Natural Language Processing (NLP) involves developing computational models to understand and process human language for automated tools and insights into communication. It encompasses various approaches, including rationalist and empiricist, and involves multiple levels of language analysis such as lexical, syntactic, semantic, discourse, and pragmatic analysis. Challenges in NLP include ambiguity, representation issues, and the evolving nature of language, with applications ranging from machine translation to information retrieval.

Uploaded by

zikiho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

INTRODUCTION NLP(BAI601)

Natural language processing


Introduction
What is natural language processing (NLP)?
Language is the primary means of communication used by humans. It is the tool we use to express the greater
part of our ideas and emotions. it shapes thought, has a structure, and carries meaning.
Natural language processing (NLP) is concered with the development of computational models of aspects of
human language processing.
There are two main reasons for such development:
1. To develop automated tools for language processing.
2. To gain a better understanding of human communication.
Building computational models with human language processing abilities requires a knowledge of how humans
acquire, store and process language. It also requires knowledge of the world and of language.
Two major approaches to NLP::
1. Rationalist approach
2. Empiricist approach
Rationalist approach:
It assumes the existence of some language faculty in the human brain.supporters of this approach argue that it is
not possible for children to learn a complex thing like natural language from limited sensory inputs.
Empiricist approach:
They do not believe in the existence of of a language faculty. Instead, they believe in the existence of some
general organization principles such as pattern recognization, generalization, and association.
Learning of detailed structures can, therefore, take place through the application of these principles on sensory
inputs available to the child.

ORIGINS OF NLP:
Natural language processing sometimes mistakenly termed natural language understanding originated from
machine translation research.
Natural language processing includes both understanding (interpretation) and generation (production).
Computational linguistics is similar to theoretical and psycho-linguistics.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

Theoretical linguistics:
It mainly provides structural description of natural language and its semantics. They are not concerned with the
actual processing of sentences or generation of sentences from structural description.
They are in quest for principles that remain common across languages and identify rules that capture linguistics
generalization.
Example-most language have constructs like noun and verb phrases.
Theoretical linguists identify rules that describe and restrict the structures of languages.
psycho-linguistics.
They explain how humans produce and comprehend natural language. They are interested in the representation
of linguistic structures as well as in the process by which there structures are produced. They rely primarily on
empirical investigations to back up their theories.
Computational linguistics:
It is concerned with the study of language using computational models of linguistic phenomena. It deals with
the application of linguistic theories and computational techniques for NLP.
In computational linguistics, representing a language is a major problem; most knowledge representations ta ckle
only a small part of knowledge.
Computational models may be broadly classified under knowledge driven and data driven categories.
Knowledge driven systems rely on explicitly coded linguistic knowledge, often expressed as a act of
handcrafted grammar rules. Acquiring and encoding such knowledge is difficult.
Data driven approaches presume the existence of a large amount of data and usually employ some machine
learning technique to learn syntactic patterns. The amount of human effort is less and the performance of these
systems is dependent on the quantity of the data. These systems are usually adaptive to noisy data.

LANGUAGE AND KNOWLEDGE:


Language is the medium of expression in which knowledge is deciphered.language, being a medium of
expression is the order form of the content it expresses. The same content can be expressed in different
languages. The meaning of one –language is written in the same language(but with a different set of words).
Language (text) processing has different levels, each involving different types of knowledge it involves.
1. Lexical Analysis:
It is the simplest level of analysis , which involves analysis of words. Words are the most fundamental
unit of any natural language text. Word level processing requires morphological knowledge, i.e;
knowlwdge about the structure and formation of words from basic units(morphemes). The rules for
forming words from morphemes are language specific.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

This phase scans the source code as a stream of characters and converts it into meaningful lexemes. It
divides the whole text into paragraphs, sentences and words.

2. Syntactic analysis:
It consists of sequence of words as a unit, usually a sentence, and finds its structure.
It decomposes a sentence into its constitutents(or words) and identifies how they relate to each other. It
captures grammaticality or non-grammaticality of sentences by looking at constraints like word order,
number and case agreement. This level of processing requires syntactic requires knowledge. i.e.
knowledge about how words are combined to form larger units such as phrases and sentences.
For example, “I went to the market” is a valid sentences whereas “went the I market to” is not.

3. Semantic Analysis:
Semantics is associated with the meaning of the language. It is concerned with creating meaningful
representation of linguistic inputs .
The general idea of semantic interpretation is to take natural language sentences or utterances and map
them onto some representation of meaning.
Defining meaning components is difficult as grammatically valid sentences can be meaningless.
The starting point in semantic analysis, has been lexical semantics. A word can have a number of
possible meaning associated with it ,but in a given context, only one of these meaning participates.
Finding out the correct meaning of a particular use of word is necessary to fnd meaning of larger units.
Consider the following sentences:
Kabir and ayan are married.
Kabir and suha are married.
Both sentences have identical structures, and the meanings of individual words are clear. But most of us
end up with two different interpretations.
We may interpret the second sentence to mean that kabir and suha are married to each other, but this
interpretation does not occur for the first sentence syntactic structure and compositional semantics fail to
explain these interpretations. We make use of pragmatic information.
4. Discourse Analysis:
Discourse level processing attempts to interpret the structure and meaning of even larger units,e.g. at the
paragraph and document level, in terms of words, phrases,clusters and sentences.
It requires the resolution of anaphoric refrences and identification of discourse structure. It also requires
discourse knowledge, that is, knowledge of how the meaning of a sentenceis determinedby preceding
sentences.
In fact, pragmatic knowledge may be needed for resolving anaphoric refrences.
For example, in the following sentences, resolving the anaphoric reference ‘they’ requires pragmatic
knowledge.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

“The district administrator refused to give the trade union permission for the meeting because they
feared violence.”
“The district administrator refused to give the trade union permission for the meeting because they
oppose government.”

5. Pragmatic analysis:
This is the highest level of processing. It deals with the purposeful use of sentences in situations.
It requires knowledge of the world, i.e, knowledge that extends beyond the contents of the text,

THE CHALLENGES OF NLP:

There are number of factors that make NLP difficult.


1. These relate to the problems of representation and interpretation of content. Since natural languages
are highly ambiguous and vague, achieving such representation can be difficult.
2. The inability to capture all the required knowledge is another source of difficulty. It is almost
impossible to embody all sources of knowledge that humans use to process language.
3. The greatest source of difficulty in natural language is identifying its semantics. Words alone do not
make a sentence. Instead, it is the words as well as their syntactic and semantic relation that give
meaning to a sentence.
A language keeps on evolving. New words are added continually and existing words are introduced
in new context. For example most newspapers and TV channels use 9/11 to refer to the terrorist act
on the world trade centre in the USA in 2004.
4. Idioms, metaphor, and ellipses add more complexity to identify the meaning of the written text, as an
example consider the sentence:
“ the old man finally kicked the bucket”
The meaning of this sentence has nothing to do with the words ‘kick’ and ‘bucket’ appearing in it.
5. Quantifier –scoping is another problem. The scope of quantifiers (the,each,etc).is often not clear and
poses problem in automatic processing.
6. The ambiguity of natural languages is another difficulty.
a. Lexical ambiguity: the first level of ambiguity arises at the word level, without much effect. We
can identify words that have multiple meanings associated with them.
Example:
Manya is looking for a match.
In the above example the word match refer to that either manya is looking for a partner or manya
is looking for a match(cricket or other).
Solution: part-of-speech tagging and word sense disambiguation.
b. Syntactic Ambiguity:
Syntactic Ambiguity exists in the presence of two or more possible meaning with in the
sentence.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

Example: I saw the girl with the binocular. In this example ,did I have binocular? Or did the girl
have the binocular?
c. Referential Ambiguity:
Referential Ambiguity exists when you are referring to something to prnoun.
Example: iran went to sunitha .she said “I am hungry”. In this sentence, you do not know that
who is hungry either kiran or sunitha.

Language and Grammar:


Automatic processing of language requires the rules and expectations of a language to be explained to the
computer. Grammar consists of a set of rules that allow us to parse and generate sentences in alanguage. Rules
relate information to coding devices at the language level not at the world knowledge level.
The world knowledge affects both the coding and the coding convention(structure), this blurs the boundary
between syntax and semantics.
Types of grammars:
Transformational grammar(Chomsky 1957).
Lexical functional grammar(Kaplan and bresnan 1982).
Government and binding (chomsk 1981).
Generalized phrase structure grammar.
Dependency grammar.
Panian grammar.
Tree ad-joining grammar.
Some of these grammar focous on relationships.
The gratest contribution comes from Noam Chomsky, who proposed a hierarchy of formal ngrammar based on
level of complexity.
The grammar uses phrase structure rules. Generate grammar basically refers to any grammar that uses a set of
rules to specify or generate all and only grammatical sentences in a language.
Transformational grammar , each sentence in a language has two levels of representation, namely, a deep
structure and a surface structure.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

Transformational grammar has three components:


1. Phrase structure grammar
2. Transformational rules
3. Morphophonemic rules-these rules match each sentence representation to a string of
phonemes.
Each of these components consists of rules that generate natural language sentences and
assign a structural descripton to them, as an example, consider the following set of rules:

S—sentence NP—nounphrase
VP--- verb phrase
Det--- Detrminer.
Transformational grammar is a set of transformation rules, which transform one phrase
maker(underlying) into another phrase marker(derived). These rules are applied on the terminal

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

string generated by phrase structure rules, transformational rules are heterogeneous and may
have more than one symbol on their left hand side.these rules are used to transform one surface
representation into another. Eg:an active sentence into passive one.
The rule relating active and passive sentences is:

Transformational rules can be obligatory or optional.


An obligatory transformation is one that ensures agreement in number of subject and verb.
An optional transformation is one that modifies the structure of a sentence while processing its
meaning.
Morphophonemic rules match each sentence representation to a string of phonemes.
Consider the active sentence:
The police will catch the snatcher.
The application of phrase structure rules will assign the structure shown below:

The passive transformation rules will convert the sentence into:


The + culprit + will + be + en + catch + by +police.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

Processing Indian Language:


There are number of differences between Indian Languages and English. Some of these differences are listed
here:
 Unlike English, Indic scripts have a non-linear structure.
 Unlike English, Indian languages have SOV (subject object verb) as the default sentence structure.
 Indian languages have a free word order, ie., words can be moved freely within a sentence without
changing the meaning of the sentence.
 Spelling standardization is more subtle in HINDI than in English.
 Indian languages have a relatively rich set of morphological variants.
 Indian languages make extensive and productive use of complex predicates(CPs).
 Indian languages use post-position (karakas) case markers instead of prepositions.
 Indian languages use verb complexes consisting of sequences of verbs,

e.g., gar aha hai--- singing and khel rahi hai--- playing.
The auxiliary verbs in this sequence provide information about tense, aspect, modality.
NLP APPLICATIONS:
The applications utilizing the NLP include the following:
1. Machine Translation: This refers to automatic translation of text from one human language to another.
In order to carry out these translation, it is necessary to have an understanding of words and phrases,
grammars of the two languages involved, semantics of the language and world knowledge.
2. Speech Recognition: this is the process of mapping acoustic speech signals to a set of words. The
difficulties arise due to wide variations in the pronunciation of words, homonym and acoustic
ambiguities.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

3. Speech synthesis: it refers to automatic production of speech. Such systems can read out your mails on
telephone, or even read out a story book.
4. Natural language interfaces to databases: it allow querying a structured database using natural
language sentences.
5. Information retrieval: This is concerned with identifying documents relevant to a users query.
Indexing, words sense disambiguation, query modification, and knowledge bases have also been used in
IR system to enhance performance.
6. Information Extraction: It captures and outputs factual information contained within a document.
7. Question answering: It attempts to find the precise portion of text in which the answer appears.
8. Text summarization: this deals with the creation of summaries of documents and involves syntactic,
semantic and discourse level processing of text..

SOME SUCESSFUL EARLY NLP SYSTEMS


ELIZA is one of the earliest natural language understanding programs. It uses syntactic patterns to
mimic human conversation with the user. Here ia a sample conversation.

ELIZA: Hello. I am ELIZA. How may I help you?


User: I am feeling a little bit sleepy.
ELIZA: how long have you been feeling a little bit sleepy?
User:for almost half an hour. I am feeling a little bit sleepy.
ELIZA: please go on.
SYSTRAN(SYSTEM TRANSLATION)
THE FIRST Systran machine translation system was developed in 1969 for Russian-English translation.
Systran also provided the first on-line machine translation service called Babel Fish, which is used by Alta Vista
search engines for handling translation requests from users.
TATUM METEO
This is a natural language generation system used in Canada to generate weather reports. It accepts daily
weather data and generates weather reports in English and French.
SHRDLU(Winogard 1972)
This ia a natural language understanding system that simulates actions of a robot in a block world
domain. It uses syntactic parsing and semantic reasoning to understand instructions. The user can ask the robot
to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.
LUNAR(Woods 1977)
This war an early question answering system that answeres questions about moon rock.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

INFORMATION RETRIEVAL
The availability of a large amount of text in electronic form has made it extremely difficult to get relevant
information. Information retrieval system aims at providing a solution to this.
The term information is being used here to reflect “subject matter” or the ‘content’ of some text. The focus is on
the communication taking place between human beings as expressed through natural languages. Information is
alwaya associated with some data: we are concered with text only.
The word ‘retrieval’ to refer to the operation of accesing information from some computer based representation.
Retrieval of information thus requires information to be processed and stored. not all the information
represented in computable form is retrieved. Instead, only the information relevant to the needs expressed in the
form of query is located.
Information retrieval deals with unstructured data. The retrieval is performed based on the content of the
document rather than on its structure. The IR systems usually return a ranked list of documents. The IR
components have been traditionally incorporated into different types of information systems including database
management systems, bibliographic text retrieval systems, question answering systems and more recently in
search engines.
Current approaches for accessing large text collections can be broadly classified into two categories.
1. Consists of approaches that construct topic hierarchy. Ex: yahoo. This helps the user locate documents
of interest manually by traversing the hierarchy.
2. Consists of approaches that rank the retrieved documents according to relevance.
ISSUES IN INFORMATION RETRIEVAL:
1. Choose a representation of the document. Most human knowledge representation language for computer
systems. Most of the current retrieval models are based on keyword representation. This representation
creates problems during retrieval due to polysemy, homonomy and synonymy.
Polysemy: it involves the phenomenon of a lexeme with multiple meaning.
Homonymy: is an ambiguity in which words that appear the same have unrelated meanings.
Synonymy: creates problem when a document is indexed with one term and the query contains a
different term and the two term share a common meaning.
Another problem with keyword based retrieval is that it ignores semantic and contextual information in
the retrieval process. This information is last in the extraction of keywords from the ntest and cannot be
recovered by the retrieval algorithms.
2. Inappropriate characterization of qurries by the user. The user may fail to include relevant terms in
query or may include irrelevant terms.
Inappropriate or inaccurate queries lead to poor retrieval performances.
3. Matching query representation with that of the document is another issue.
4. Selection of the appropriate similarity measure is a crucial issue in the design of IR systems.
5. Evaluating the performance of IR systems is also a major issue. Recall and precision are the most widely
used measures of effectiveness.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM


INTRODUCTION NLP(BAI601)

6. The major goal of IR is to search a document in a manner relevant to the query, understanding what
constitutes relevance is also an important issue.
7. The size of document collections and the varying needs of users also complicate text retrieval.

BHAVYA A , ASST, PROFESSOR, DEPT OF AIML,AIT,CKM

You might also like