Module 1 Notes
Module 1 Notes
MODULE-1
CHAPTER-1
INTRODUCTION
1.1 WHAT IS NATURAL LANGUAGE PROCESSING (NLP)
Natural language processing (NLP) is concerned with the development of computational
models of aspects of human language processing. There are two main reasons for such
development:
1. To develop automated tools for language processing
2. To gain a better understanding of human communication.
Building computational models with human language-processing abilities requires a
knowledge of how humans acquire, store, and process language. It also requires a knowledge
of the world and of language.
Historically, there have been two major approaches to NLP-he rationalist approach and the
empiricist approach. Early NILP research took a rationalist approach, which assumes the
existence of some language faculty in the human brain. Supporters of this approach argue
that it is not possible for children to learn a complex thing like natural language from limited
sensory inputs. Empiricists do not believe in existence of a language faculty. Instead, they
believe in the existence of some general organization principles such as pattern recognition,
generalization, and association. Learning of detailed structures can, therefore, take place
through the application of these principles on sensory inputs available to the child.
their theories. Computational linguistics is concerned with the study of language using
computational models of linguistic phenomena. It deals with the application of linguistic
theories and computational techniques for NLP. In computational linguistics, representing a
language is a major problem; most knowledge representations tackle only a small part of
knowledge. Computational models May be broadly classified as
Knowledge driven and data-driven.
Knowledge driven they rely on explicitly coded linguistic knowledge which are expressed as
a set of grammar rules. Whereas data-driven, Rely on existence of a large amount of data,
Uses machine learning techniques to learn.
It attempts to interpret the structure and meaning of larger units, e.g., at the
paragraph and document level, in terms of words, phrases, clusters, and sentences.
It deals with how the meaning of a sentence is determined by preceding sentence.
Pragmatic analysis is the highest level of analysis. It deals with the purposeful use
of sentences in situations.
It requires knowledge of the world, i.e., knowledge that extends beyond the
contents of the text.
It focuses on inferred meaning perceived by the speaker and the listener.
Given that natural languages are highly ambiguous and vague, achieving such representation
can be difficult. The inability to capture all the required knowledge is another source of
difficulty. It is almost impossible to embody all sources of knowledge that humans use to
process language. Even if this were done, it is not possible to write procedures that imitate
language processing as done by humans. The inability to capture all the required knowledge.
Identifying the semantics of sentences.
• Sentence meaning can be inferred by the syntactic and semantic relation of words.
• The frequency of a word being used in a particular sense also affects its meaning.
• Idioms, metaphor, and ellipses add more complexity to identify the meaning of the
written text.
• The scope of quantifiers (the, each, etc.) is often not clear and poses problem in
automatic processing.
Ambiguity of natural languages
• Word level ambiguity: A word may be ambiguous in its part. Of-speech or it may be
ambiguous in its meaning. The word can' is ambiguous in its part-of-speech
whereas the word 'bat is ambiguous in its meaning.
• Eg: bank, can
• Structural ambiguity:
• A sentence may be ambiguous even if the words are not. oor example, the
sentence: Stolen rifle found by tree.' None of the words in this sentence is
ambiguous but the sentence is. This is an example of structural ambiguity.
• Eg: Stolen ice candy found by tree
• An idiom is a phrase that, when taken as a whole, has a meaning you wouldn't be able to
deduce from the meanings of the individual words.
A piece of cake
• A metaphor is a figure of speech that describes something by saying it's something else.
Life is a highway.
He is a shining star.
• It consists of a set of rules that allows us to parse and generate sentences in a language.
• The main hurdle in language specification is constantly changing nature of natural
languages and the presence of a large number of hard to specify exceptions.
• It leads to the development of a number of grammars.
• Main among them are transformational grammar, lexical functional grammar, Government
and binding , Dependency grammar, paninan grammar, tree-adjoining grammar
• Hierarchy of formal grammar based on level of complexity.
• These grammars use phrase structure rules.
• Generative grammar uses a set of rules to specify or generate all and only grammatical
(well-formed) sentences in a language.
• Transformational grammar is a system of language analysis that recognizes the
relationship among the various elements of a sentence and among the possible
sentences of a language and uses rules (some of which are called transformations) to
express these relationships
• Transformational grammar assigns a “deep structure” and a “surface structure” to show
the relationship of such sentences.
• Deep structure is what you wish to express and surface structure how you express it in with
the help of words and sentence.
• Surface structures are the versions of sentences that are seen or heard, while deep
structures contain the basic units of meaning of a sentence
• The mapping from deep structure to surface structure is carried out by transformations
• Deep structure can be transformed in a number of ways to yield many different surface-
level representations.
Example sentences:
•Eg:
• In these rules, S stands for sentence, NP for noun phrase, VP for verb phrase, and Det for
determiner. Sentences that can be generated using these rules are termed grammatical. The
structure assigned by the grammar is a constituent structure analysis of the sentence. The
second component of transformational grammar is a set of transformation rules, which
transform one phrase-maker (underlying) into another phrase-marker (derived). These rules are
applied on the terminal. String generated by phrase structure rules. Unlike phrase structure
rules. Transformational rules are heterogeneous and may have more than one symbol on their
left hand side. These rules are used to transform one Surface representation into another, C-g.,
an active sentence into passive one. The rule relating active and passive sentences (as given by
Chomsky) is
This rule says that an underlying input having the structure NP-Aux can be transformed to NP -
Aux + be + en - V - by + NP. This transformation involves addition of strings 'be' and 'en' and
certain re arrangements of the constituents of a sentence. Transformational rules can be
obligatory or optional. An obligatory transformation is one that ensures agreement in number
of subject and verb, etc., whereas an optional transformation is one that modifies the structure
of a sentence while preserving its meaning. Morphophonemic rules match each sentence
representation to a string of phoneme.
Consider the active sentence:
The police will catch the snatcher. (1.5)
The (1.5) application of phrase structure rules will assign the structure shown in oigure 1.2 to
this sentence.
The passive transformation rules will convert the sentence into: The + culprit + will + be + en +
catch + by + police (oigure 1.3).
Another transformational rule will then reorder 'en + catch' to 'catch + en' and subsequently
one of the morphophonemic rules will convert catch + en' to 'caught'. In general, the noun
phrase is not always as simple as in sentence (1.5).
Except for the direction in which its script is written, Urdu is closely related to Hindi. Both share
similar phonology, morphology, and syntax. Both are free-word-order languages and use post-
positions. They also share a large amount of their vocabulary. Differences in the vocabulary arise
mainly because a significant portion of Urdu vocabulary comes from Persian and Arabic, while
Hindi borrows much of its vocabulary from Sanskrit. Paninian grammar provides a framework for
Indian language models. These can be used for computation of Indian languages. The grammar
focuses on extraction of Karaka relations from a sentence.
Machine Translation
• Automatic translation of text from one human language to another.
• In order to carry out this translation, it is necessary to have an understanding of
words and phrases, grammars of the two language, involved, semantics of the
languages, and world knowledge.
Speech Recognition
• It is the process of mapping acoustic speech signals to a set of words.
• Issues might occur due to wide variations in the pronunciation of words homonym
(e.g. dear and deer) and acoustic ambiguities (e.g., in the rest and interest).
Speech Synthesis
• Speech synthesis refers to automatic production of speech utterance of natural
language sentences
• The systems can read out the mails on telephone, or read out a storybook.
• In order to generate utterances, text has to be processed.
• Natural Language Interfaces to Databases
• Natural language interfaces allow querying a structured database using natural
language sentences.
Question Answering
• Attempts to find the precise answer, or at least the precise portion of text in which
the answer appears.
• Requires precise analysis of questions and portions of texts, semantics and
background knowledge to answer- certain type of questions.
Text summarization
Creates the summaries of documents and involves syntactic, semantic, and discourse level
processing
Information Retrieval
• Identifies the documents relevant to a user’s query. NLP techniques such as Indexing
(stop word elimination, stemming, phrase extraction, etc.), word sense
disambiguation, query modification,
• and knowledge bases have also been used in IR system to enhance performance.
• •Eg :Google Search
• WordNet, LDOCE (Longman Dictionary of Contemporary English) and Roget’s
• Thesaurus are some of the useful lexical resources for IR research.
Information Extraction
• Captures and outputs factual information contained within a document.
• The information need is specified as pre-defined database schemas or templates.
***CHAPTER ENDS**
CHAPTER-2
LANGUAGE MODELLING
• A language model is a description of language.
• Two approaches:
Grammar consists of hand-coded rules defining the structure and ordering of various
constituents appearing in a linguistic unit (phrase, sentence, etc.).
The grammar-based approach attempts to utilize this structure and also the
relationships between these structures.
n-gram Model
So, in order to calculate sentence probability, we need to calculate the probability of a word, given
the sequence of words preceding it.
The n-gram model calculates P(Wi/hi) by modelling langauge as Markov model of order n-1, ie. by
looking at previous n-1 words only.
A model that limits the history to previous one word only only , is termed as bi-gram(n=1) model.
likewise, a model that conditions the probability of a word to the previous two words , is called tri-
gram(n=2) model.
EXAMPLE:
• Training set:
• Test sentence(s):
• Solution
Test sentence:
• 0.67 x 0.5x 1.0x 0.5 x 0.5 x 0.2 x 1.0 x 1.0 x 1.0 x 0.2 = 0.00335
An n-gram that does not occur in the training data is assigned zero probability, so
that even a large corpus has several zero entries in its bi-gram matrix.
There are several long distance dependencies in natural language sentences, which
this model fails to capture.
Smoothing techniques have been developed to handle the data sparseness problem
• It adds a value of one to each n-gram frequency before normalizing them into probabilities.
Thus, the conditional probability becomes:
• where V is the
vocabulary size,
i.e., size of the set of all the words being considered.
• Issues:
It assigns the same probability to all missing n-grams, even though some words are
more important.
It shifts too much of the probability mass towards the unseen n-grams (n-grams with
0 probabilities) as there number is usually quite large
• Good—Turing smoothing adjusts the frequency f of an gram using the count of n-grams
having a frequency of occurrence f+1.
• where nf is the number of n-grams that occur exactly f times in the training corpus.
• Eg:, consider that the number of n-grams that occur 4 times is 25,108 and the
number of n-grams that occur 5 times is 20,542.
• The frequency of n-gram is not uniform across the text segments or corpus.
• Certain words occur more frequently in certain segments (or documents) and rarely in
others.
• The basic n-gram model ignores this sort of variation of n- gram frequency.
• The cache model combines the most recent n-gram frequency with the standard n-gram
model to improve its performance locally.
• The underlying assumption here is that the recently discovered words are more likely to be
repeated.
PANINIAN FRAMEWORK
• Paninian grammar (PG) was written by Panini in 500 BC.
The inflections provide important syntactic and semantic cues for language
analysis and understanding.
Some languages like Sanskrit have the flexibility to allow word groups representing
subject, object, and verb to occur in any order.
Layered Representation in PG
• Paninian grammar framework is said to be syntactico-semantic, that is, one can go from
surface layer to deep semantics by passing through intermediate layers.
• It has 4 layers
PG specifies a mapping between the karaka level and the vibhakti level, and the
vibhakti level and the surface form.
• The vibhakti level is the level at which there are local word groups based on case endings,
Vibhakti refers to word (noun, verb, or other) groups based either on case endings,
or post- positions, or compound verbs, or main and auxiliary verbs, etc.
These markers are language-specific, but all Indian languages can be represented at
the. Vibhakti level
Vibhakti for verbs includes the verb form and the auxiliary verbs
The information about TAM (tense, aspect and modality) is given by the vibhakti for
a verb.
• Karaka Level lies between this topmost level and vibhakti level
At the karaka level, we have karaka relations and verb-verb relations etc.
• Through these relations, the Karakas try to capture the information from the
semantics of the texts.
Thus, Karaka level processes the semantics of the language but represents it at the
syntactic level.
• These relations are based on the way the word groups participate in the
activity denoted by the verb group
This is the level of semantics that is important syntactically and is reflected in the
surface form of the sentence.
KARAKA THEORY
• Karaka relations are assigned based on the roles palyed by various participants in the main
activity.
These roles are reflected in case markers and post position markers
• the richness of the case endings found in Indian languages has been used to
its advantage.
Karana (instrument)
Sampradana (beneficiary/(recipient))
Apadan(source/separation)
Mother-Karta plate from Apaadan food taking-up child-to gave. The mother gave food to the
child taking it up from the plate.
• Computational implementation of PG
• PG is a multilayered implementation.
• Rules may be represented in the form of charts (such as Karaka chart and Lakshan chart).
Another difficulty arises when mapping between the Vibhakti (case markers and post-
positions) and the semantic relation (with respect to verb) is not one to one.
• Two different Vibhakti can represent the same relation, or the same Vibhakti can represent
different relations in different contexts. The strategy to disambiguate the various senses of
words, or word groupings, are still the challenging issues.
As the system of rules is different in different languages, the framework requires adaptations
to tackle various applications in various languages.
******CHAPTER ENDS*****
*****END Oo MODULE-1*****