BAI601 Module 1 PDF
BAI601 Module 1 PDF
MODULE-1
Introduction & Language Modelling
• Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian
Languages, NLP Applications.
• Language Modelling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.
Textbook 1: Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information
Retrieval”, Oxford University Press. Ch. 1, Ch. 2.
1. INTRODUCTION
1. Rationalist approach
2. Empiricist approach
Rationalist Approach: Early approach, assumes the existence of some language faculty in
the human brain. Supporters of this approach argue that it is not possible to learn a complex thing
like natural language from limited sensory inputs.
Empiricist approach: Do not believe in existence of a language faculty. Believe in the existence
of some general organization principles such as pattern recognition, generalization, and association.
3
Natural Language Processing
Learning of detailed structures takes place through the application of these principles on sensory
inputs available to the child.
Computational linguistics: is similar to theoretical- and psycho-linguistics, but uses different tools.
While theoretical linguistics is more about the structural rules of language, psycho-linguistics focuses on
how language is used and processed in the mind.
Theoretical linguistics explores the abstract rules and structures that govern language. It investigates
universal grammar, syntax, semantics, phonology, and morphology. Linguists create models to explain
how languages are structured and how meaning is encoded. Eg. Most languages have constructs like noun
and verb phrases. Theoretical linguists identify rules that describe and restrict the structure of languages
(grammar).
Psycho-linguistics focuses on the psychological and cognitive processes involved in language use. It
examines how individuals acquire, process, and produce language. Researchers study language
development in children and how the brain processes language in real-time. Eg. Studying how children
acquire language, such as learning to form questions ("What’s that?").
4
Natural Language Processing
of the leading techniques for processing and retrieving information. NLP has become one of the leading
techniques for processing and retrieving information.
Information retrieval includes a number of information processing applications such as information
extraction, text summarization, question answering, and so forth. It includes multiple modes of
information, including speech, images, and text.
5
Natural Language Processing
• A word can have a number of possible meanings associated with it. But in a given context, only
one of these meanings participates.
Syntactic Semantic
• Finding out the correct meaning of a particular use of word is necessary to find meaning of larger
units.
• Eg. Kabir and Ayan are married.
Kabir and Suha are married.
• Syntactic structure and compositional semantics fail to explain these interpretations.
• This means that semantic analysis requires pragmatic knowledge besides semantic and syntactic
knowledge.
• Pragmatics helps us understand how meaning is influenced by context, social factors, and
speaker intentions.
Anamorphic Reference
• Pragmatic knowledge may be needed for resolving anaphoric references.
Example: The district administration refused to give the trade union
permission for the meeting because they feared violence. (a)
The district administration refused to give the trade union permission
for the meeting because they oppose government. (b)
• For example, in the above sentences, resolving the anaphoric reference 'they' requires pragmatic
knowledge.
6
Natural Language Processing
1.3.5 Pragmatic analysis
• The highest level of processing, deals with the purposeful use of sentences in situations.
• It requires knowledge of the world, i.e., knowledge that extends beyond the contents of the text.
7
Natural Language Processing
• Challenges in Language Specification: Natural languages constantly evolve, and the numerous
exceptions make language specification challenging for computers.
• Different Grammar Frameworks: Various grammar frameworks have been developed,
including transformational grammar, lexical functional grammar, and dependency grammar, each
focusing on different aspects of language such as derivation or relationships.
• Chomsky’s Contribution: Noam Chomsky’s generative grammar framework, which uses rules
to specify grammatically correct sentences, has been fundamental in the development of formal
grammar hierarchies.
Chomsky argued that phrase structure grammars are insufficient for natural language and proposed
transformational grammar in Syntactic Structures (1957). He suggested that each sentence has two levels:
a deep structure and a surface structure (as shown in Fig 1), with transformations mapping one to the
other.
• Chomsky argued that an utterance is the surface representation of a 'deeper structure' representing
its meaning.
• The deep structure can be transformed in a number of ways to yield many different surface-level
representations.
• Sentences with different surface-level representations having the same meaning, share a common
deep-level representation.
Pooja plays veena.
Veena is played by Pooja.
8
Natural Language Processing
Both sentences have the same meaning, despite having different surface structures (roles of subject and
object are inverted).
Transformational grammar has three components:
1. Phrase structure grammar: Defines the basic syntactic structure of sentences.
2. Transformational rules: Describe how deep structures can be transformed into different surface
structures.
3. Morphophonemic rules: Govern the relationship structure of a sentence (its syntax) influences the
form of the words in terms of sound and pronunciation (phonology).
Phrase structure grammar consists of rules that generate natural language sentences and assign a
structural description to them. As an example, consider the following set of rules:
Transformation rules, transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. It transforms one surface
representation into another, e.g., an active sentence into passive one.
Consider the active sentence: “The police will catch the snatcher.”
The application of phrase structure rules will assign the structure shown in Fig 2 (a)
9
Natural Language Processing
Fig. 2: (a) Phrase structure (b) Passive Transformation
Note: Long distance dependency refers to syntactic phenomena where a verb and its subject or object can
be arbitrarily apart. Wh-movement are a specific case of these types of dependencies.
E.g.
"I wonder who John gave the book to" involves a long-distance dependency between the verb "wonder"
and the object "who". Even though "who" is not directly adjacent to the verb, the syntactic relationship
between them is still clear.
The problem in the specification of appropriate phrase structure rules occurs because these phenomena
cannot be localized at the surface structure level.
Paninian grammar provides a framework for Indian language models. These can be used for
computation of Indian languages. The grammar focuses on extraction of relations from a
sentence.
10
Natural Language Processing
11
Natural Language Processing
1.8.1 ELIZA (Weizenbaum 1966)
ELIZA is one of the earliest natural language understanding programs. It uses syntactic patterns to
mimic human conversation with the user. Here is a sample conversation.
The first SysTran machine translation system was developed in 1969 for Russian-English translation.
SysTran also provided the first on-line machine translation service called Babel Fish, which is used by
AltaVista search engines for handling translation requests from users.
This is a natural language generation system used in Canada to generate weather reports. It accepts
daily weather data and generates weather reports in English and French.
This is a natural language understanding system that simulates actions of a robot in a block world
domain. It uses syntactic parsing and semantic reasoning to understand instructions. The user can ask the
robot to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.
This was an early question answering system that answered questions about moon rock.
The availability of vast amounts of electronic text has made it challenging to find relevant
information. Information retrieval (IR) systems aim to address this issue by providing efficient access to
relevant content. Unlike 'entropy' in communication theory, which measures uncertainty, information here
refers to the content or subject matter of text, not digital communication or data transmission. Words serve
as carriers of information, and text is seen as the message encoded in natural language.
12
Natural Language Processing
unstructured data. Retrieval is based on content, not structure, and systems typically return a ranked list
of relevant documents.
IR has been integrated into various systems, including database management systems, bibliographic
retrieval systems, question answering systems, and search engines. Approaches for accessing large text
collections fall into two categories: one builds topic hierarchies (e.g., Yahoo), requiring manual
classification of new documents, which can be cost-ineffective; the other ranks documents by relevance,
offering more scalability and efficiency for large collections
Major issues in designing and evaluating Information Retrieval (IR) systems include selecting
appropriate document representations. Current models often use keyword-based representation, which
suffers from problems like polysemy, homonymy, and synonymy, as well as ignoring semantic and
contextual information. Additionally, vague or inaccurate user queries lead to poor retrieval performance,
which can be addressed through query modification or relevance feedback.
2.1 Introduction
Our purpose is to understand and generate natural languages from a computational viewpoint.
1st approach: Try to understand every word and sentence of it, and then come to a conclusion (has not
succeeded).
2nd approach: To study the grammar of various languages, compare them, and if possible, arrive at
reasonable models that facilitate our understanding of the problem and designing of natural-language
tools.
Language Model: A model is a description of some complex entity or process. Natural language is a
complex entity and in order to process it through a computer-based program, we need to build a
representation (model) of it.
Two categories of language modelling approaches:
13
Natural Language Processing
Grammar-based language model:
Eg. A sentence usually consists of noun phrase and a verb phrase. The grammar-based approach attempts
to utilize this structure and also the relationships between these structures.
14
Natural Language Processing
Linguists often argue that language structure, especially in resolving structural ambiguity, can be
understood through meaning. However, the transformation between meaning and syntax is not well
understood. Transformational grammars distinguish between surface-level and deep-root-level sentence
structures.
Government and Binding (GB) theories rename these as s-level and d-level, adding phonetic and
logical forms as parallel levels of representation for analysis, as shown in Figure.
• 'meaning' in a 'sound' form is represented as logical form (LF) and phonetic form (PF) in above
figure.
• The GB is concerned with LF, rather than PF.
• The GB imagines that if we define rules for structural units at the deep level, it will be possible
to generate any language with fewer rules.
15
Natural Language Processing
Components of GB
• Government and binding (GB) comprise a set of theories that map the structures from d-structure
to s-structure and to logical form (LF).
• A general transformational rule called 'Move 𝛼' is applied at d-structure level as well as at s-
structure level.
• Simplest from GB can be represented as below.
GB consists of 'a series of modules that contain constraints and principles' applied at various
levels of its representations and the transformation rule, Move α.
The GB considers all three levels of representations (d-, s-, and LF) as syntactic, and LF is also
related to meaning or semantic-interpretive mechanisms.
16
Natural Language Processing
GB applies the same Move a transformation to map d-levels to s-levels or s-levels to LF level.
LF level helps in quantifier scoping and also in handling various sentence constructions such as passive
or interrogative constructions.
Example:
Consider the sentence: “ Two countries are visited by most travellers.”
Its two possible logical forms are:
LF1: [s Two countries are visited by [NP most travellers]]
• In LF1, the interpretation is that most travellers visit the same two countries (say, India and
China).
• In LF2, when we move [most travellers] outside the scope of the sentence, the interpretation can
be that most travellers visit two countries, which may be different for different travellers.
• One of the important concepts in GB is that of constraints. It is the part of the grammar which
prohibits certain combinations and movements; otherwise Move α can move anything to any
possible position.
• Thus, GB, is basically the formulation of theories or principles which create constraints to
disallow the construction of ill-formed sentences.
The organization of GB is as given below:
𝑿̅ Theory:
• The 𝑿̅ Theory (pronounced X-bar theory) is one of the central concepts in GB. Instead of defining
several phrase structures and the sentence structure with separate sets of rules, 𝑿̅ Theory defines
them both as maximal projections of some head.
• Noun phrase (NP), verb phrase (VP), adjective phrase (AP), and prepositional phrase (PP) are
maximal projections of noun (N), verb (V), adjective (A), and preposition (P) respectively, and
can be represented as head X of their corresponding phrases (where X = {N, V, A, P})
• Even the sentence structure can be regarded as the maximal projection of inflection (INFL).
• The GB envisages projections at two levels:
• The projection of head at semi-phrasal level, denoted by 𝑿̅ ,
• The Maximal projection at the phrasal level, denoted by 𝑿̅ .
17
Natural Language Processing
Sub-categorization: It refers to the process of classifying words or phrases (typically verbs) according
to the types of arguments or complements they can take. It's a form of syntactic categorization that is
important for understanding the structure and meaning of sentences.
For example, different verbs in English can have different sub-categorization frames (also called argument
structures). A verb like "give" might take three arguments (subject, object, and indirect object), while a
verb like "arrive" might only take a subject and no objects.
"He gave her a book." ("gave" requires a subject, an indirect object, and a direct object)
In principle, any maximal projection can be the argument of a head, but sub-categorization is used as a
filter to permit various heads to select a certain subset of the range of maximal projections.
Projection Principle:
Three syntactic representations:
18
Natural Language Processing
1. Constituency Parsing (Tree Structure):
• Sentences are broken into hierarchical phrases or constituents (e.g., noun phrases, verb
phrases), represented as a tree structure.
2. Dependency Parsing (Directed Graph):
• Focuses on the direct relationships between words, where words are connected by directed
edges indicating syntactic dependencies.
3. Semantic Role Labelling (SRL):
• Identifies the semantic roles (e.g., agent, patient) of words in a sentence, focusing on the meaning
behind the syntactic structure.
The projection principle, a basic notion in GB, places a constraint on the three syntactic representations
and their mapping from one to the other.
The principle states that representations at all syntactic levels (i.e., d-level, s-level, and LF level) are
projections from the lexicon (collection or database of words and their associated linguistic information).
Thus, lexical properties of categorical structure (sub-categorization) must be observed at each level.
Suppose 'the object' is not present at d-level, then another NP cannot take this position at s-level.
Example:
• At D-structure, each argument of a verb is assigned a thematic role (e.g., Agent, Theme, Goal,
etc.).
• In a sentence like "John gave Mary the book", the verb "gave" requires three arguments: Agent
(John), Recipient (Mary), and Theme (the book).
• If the object (Theme) is not present at the deep structure, it cannot be filled at the surface structure
(S-structure) by another NP (e.g., a different noun phrase).
• 'Sub-categorization' only places a restriction on syntactic categories which a head can accept.
• GB puts another restriction on the lexical heads through which it assigns certain roles to its
arguments.
• These roles are pre-assigned and cannot be violated at any syntactical level as per the projection
principle.
• These role assignments are called theta-roles and are related to 'semantic-selection'.
Agent is a special type of role which can be assigned by a head to outside arguments (external
arguments) whereas other roles are assigned within its domain (internal arguments).
19
Natural Language Processing
Hence in 'Mukesh ate food',
the verb 'eat' assigns the 'Agent' role to 'Mukesh' (outside VP) and
Theta-Criterion states that 'each argument bears one and only one Ɵ-role, and each Ɵ-role is
assigned to one and only one argument'.
If there are two structures α and ß related in such a way that 'every maximal projection dominating a
dominates ß', we say that a C-commands ß, and this is the necessary and sufficient condition (iff) for C-
command.
20
Natural Language Processing
GOVERNMENT
-is a special case of C-COMMAND
government refers to the syntactic relationship between a head (typically a verb, noun, or adjective) and its dependent
elements (such as objects or complements) within a sentence. It determines how certain words control the form or
case of other words in a sentence.
On the other hand, c-command is a syntactic relationship between two constituents in a sentence. A constituent A c-
commands another constituent B if the first constituent (A) is higher in the syntactic structure (usually in the tree)
and can potentially govern or affect the second constituent (B), provided no intervening nodes.
To put it together in context:
Government: This is a formal rule determining how certain words govern the case or form of other words in a
sentence (e.g., verbs can govern the object noun in accusative case in languages like Latin or German).
C-command: This is a structural relationship in which one constituent can influence another, typically affecting
operations like binding, scope, and sometimes government.
In short, government often operates within the structures of c-command, but c-command itself is a broader syntactic
relationship that is also relevant for other linguistic phenomena, such as binding theory, where one element can bind
another if it c-commands it.
Sure! Here are a few examples of government in syntax, showing how one word governs the form or case of another
word in a sentence:
1. Verb Government
In many languages, verbs can govern the case of their objects. Here’s an example in Latin:
Latin: "Vidēre puellam" (to see the girl)
The verb "vidēre" (to see) governs the accusative case of "puellam" (the girl).
In this case, the verb "vidēre" governs the object "puellam" by requiring it to be in the accusative case.
2. Preposition Government
Prepositions can also govern the case of their objects. Here’s an example from German:
German: "Ich gehe in den Park" (I am going to the park)
The preposition "in" governs the accusative case of "den Park" (the park).
The preposition "in" governs the accusative case for the noun "Park" in this sentence.
3. Adjective Government
Adjectives can govern the case, gender, or number of the noun they modify. Here's an example from Russian:
Russian: "Я вижу красивую девочку" (I see a beautiful girl)
The adjective "красивую" (beautiful) governs the accusative case of "девочку" (girl).
In this case, the adjective "красивую" (beautiful) governs the accusative case of "девочку".
4. Noun Government
In some languages, nouns can govern the case of their arguments. In Russian, for example, some nouns govern a
particular case:
Russian: "Я горжусь успехом" (I am proud of the success)
The noun "успехом" (success) governs the instrumental case in this sentence.
Here, the noun "успехом" governs the instrumental case of its argument "успех".
Summary:
21
Natural Language Processing
Government involves syntactic relationships where a head (verb, preposition, adjective, etc.) dictates or determines
the form (such as case) of its dependent elements.
In these examples, verbs, prepositions, and adjectives have a "governing" influence on the cases of nouns or objects
in the sentence, which is a core part of the syntax in many languages.
Two being empty NP positions called wh-trace and NP trace, and the remaining two being pronouns
called small 'pro' and big 'PRO'.
This division is based on two properties-anaphoric (+a or -a ) and pronominal (+p or -p).
Wh-trace -a, -p
NP-trace +a, -p
small 'pro' -a, +p
big 'PRO' . +a, +p
The traces help ensure that the proper binding relationships are maintained between moved elements (such
as how pronouns or reflexives bind to their antecedents, even after movement).
Additional Information:
• +a (Anaphor): A form that must refer back to something mentioned earlier (i.e., it has an
antecedent). For example, "himself" in "John washed himself." The form "himself" is an anaphor
because it refers back to "John."
• -a (Non-Anaphor): A form that does not require an antecedent to complete its meaning. A regular
pronoun like "he" in "He went to the store" is not an anaphor because it doesn't explicitly need to
refer back to something within the same sentence or clause.
• +p (Pronominal): A form that can function as a pronoun, standing in for a noun or noun phrase.
For example, "she" in "She is my friend" is a pronominal because it refers to a specific person
(though not necessarily previously mentioned).
• -p (Non-Pronominal): A word or form that isn't used as a pronoun. It could be a noun or other
word that doesn't serve as a replacement for a noun phrase in a given context. For example, in
"John went to the store," "John" is not pronominal—it is a noun phrase.
Co-indexing
It is the indexing of the subject NP and AGR (agreement) at d-structure which are preserved by Move α
operations at s-structure.
22
Natural Language Processing
When an NP-movement takes place, a trace of the movement is created by having an indexed empty
category (e) from the position at which the movement began to the corresponding indexed NP.
For defining constraints to movement, the theory identifies two positions in a sentence. Positions assigned
θ -roles are called θ-positions, while others are called 𝜃̅ positions.
In a similar way, core grammatical positions (where subject, object, indirect object, etc., are positioned)
are called A-positions (arguments positions), and the rest are called 𝐴̅ -positions.
Binding theory:
Binding Theory is a syntactic theory that explains how pronouns and noun phrases are interpreted and
distributed in a sentence. It's concerned with the relationships between pronouns and their antecedents
(myself, herself, himself).
Empty clause (ei) and Mukesh (NPi) are bound. This theory gives a relationship between NPs (including
pronouns and reflexive pronouns). Now, binding theory can be given as follows:
(a) An anaphor (+a) is bound in its governing category.
(b) A pronominal (+p) is free in its governing category.
(c) An R-expression (-a, -p) is free.
Example
A: Mukeshi knows himselfi
23
Natural Language Processing
α governs ß and a is lexical (i.e. N, V, A, or P) or α
locally A-binds ß
The ECP says 'A trace must be properly governed'.
This principle justifies the creation of empty categories during NP- trace and wh-trace and also explains
the subject/object asymmetries to some extent. As in the following sentences:
(a) Whati do you think that Mukesh ate ei?
Note: There are many other types of constraints on Move α and not possible to explain all of them.
In English, the long-distance movement for complement clause can be explained by bounding theory if
NP and S are taken to be bounding nodes. The theory says that the application of Move a may not cross
more than one bounding node. The theory of control involves syntax, semantics, and pragmatics.
In GB, case theory deals with the distribution of NPs and mentions that each NP must be assigned a case.
In English, we have the nominative, objective, genitive, etc., cases, which are assigned to NPs at particular
positions. Indian languages are rich in case-markers, which are carried even during movements.
Example:
He is running ("He" is the subject of the sentence, performing the action. - nominative)
She sees him. ("Him" is the object of the verb "sees." - Objective)
The man's book. (The genitive case expresses possession or a relationship between nouns,)
Case filter: An NP is ungrammatical if it has phonetic content or if it is an argument and is not case-
marked. Phonetic content here, refers to some physical realization, as opposed to empty categories.
Thus, case filters restrict the movement of NP at a position which has no case assignment. It works in a
manner similar to that of the θ-criterion.
Summary of GB:
In short, GB presents a model of the language which has three levels of syntactic representation.
• It assumes phrase structures to be the maximal projection of some lexical head and in a similar
fashion, explains the structure of a sentence or a clause.
24
Natural Language Processing
• It assigns various types of roles to these structures and allows them a broad kind of movement
called Move α.
• It then defines various types of constraints which restrict certain movements and justifies others.
2.2.4 Lexical Functional Grammar (LFG) Model
%Watch this video: https://www.youtube.com/watch?v=EoCLhS_0cmE %
• LFG represents sentences at two syntactic levels - constituent structure (c-structure) and
functional structure (f-structure).
• Kaplan proposed a concrete form for the register names and values which became the functional
structures in LFG.
• Bresnan was more concerned with the problem of explaining some linguistic issues, such as
active/passive and dative alternations, in transformational approach. She proposed that such
issues can be dealt with by using lexical redundancy rules.
• The unification of these two diverse approaches (with a common concern) led to the development
of the LFG theory.
• The 'functional' part is derived from 'grammatical functions', such as subject and object, or roles
played by various arguments in a sentence.
• The 'lexical' part is derived from the fact that the lexical rules can be formulated to help define
the given structure of a sentence and some of the long-distance dependencies, which is difficult
in transformational grammars.
The grammatical-functional role cannot be derived directly from phrase and sentence structure, functional
specifications are annotated on the nodes of c-structure, which when applied on sentences, results in f-
structure
[
SUBJ: [ PERS: 3, NUM: SG ], // "She" is the subject, 3rd person, singular
PRED: "see", // The verb "saw" represents the predicate "see"
OBJ: [ NUM: PL, PRED: "star" ], // "stars" is the object, plural, and the predicate is "star"
LOC: [ PRED: "sky", DEF: + ] // "sky" is the location, with a definite determiner ("the")
]
25
Natural Language Processing
f-structure
c- structure
Example:
Stars Sky
PP P NP
NP Det N {PP}
S' Comp S
Where: S: Sentence V: Verb P: Preposition N: Noun
• Here, (up arrow) refers to the f-structure of the mother node that is on the left-hand side of the
rule.
• The (down arrow) symbol refers to the f-structure of the node under which it is denoted.
26
Natural Language Processing
• Hence, in Rule 1, indicates that the f-structure of the first NP goes to the f-structure of
the subject of the sentence, while indicates that the f-structure of the VP node goes directly
to the f-structure of the sentence VP.
Consistency In a given f-structure, a particular attribute may have at the most one value. Hence, while
unifying two f-structures, if the attribute Num has value SG in one and PL in the other, it will be rejected.
Completeness When an f-structure and all its subsidiary f-structures (as the value of any attribute of f-
structure can again contain other f-structures) contain all the functions that their predicates govern, then
and only then is the f-structure complete.
For example, since the predicate 'see < ( Subj) ( Obj) >' contains an object as its governable function,
a sentence like 'She saw' will be incomplete.
Coherence Coherence maps the completeness property in the reverse direction. It requires that all
governable functions of an f-structure, and all its subsidiary f-structures, must be governed by their
respective predicates. Hence, in the f-structure of a sentence, an object cannot be taken if its verb does
not allow that object. Thus, it will reject the sentence, 'I laughed a book.'
Example:
Let us see first the lexical entries of various words in the sentence:
Lexical entries
c – structure
27
Natural Language Processing
Lexical Rules in LFG
Different theories have different kinds of lexical rules and constraints for handling various sentence-
constructs (active, passive, dative, causative, etc.).
In LFG, the verb is converted to the participial form, but the sub-categorization is changed directly.
Example
Active: तारा हँ सी
Taaraa hansii
Tara laughed
Causative: मोनिका िेेे तारा को हँ साया
Monika ne Tara ko hansaayaa Here, a new predicate is formed which
Monika Subj Tara Obj laugh-cause-past causes the action and requires a new
subject, while the old subject becomes the
Monika made Tara to laugh. object of the new predicate and the old verb
becomes the X-complement (complement
Active: Pred='Laugh < Subj>’
to infinital VPs).
Causative: Pred='cause <( Subj) ( Obj) (Comp)>’
Long Distance Dependencies and Coordination
In GB, when a category moved, it creates an empty category.
28
Natural Language Processing
In LFG, unbounded movement and coordination is handled by the functional identity and by correlation
with the corresponding f-structure.
Unlike English (Subject-Verb-Object ordered), Asian languages are SOV (Subject-Object-Verb) ordered
and inflectionally rich. The inflections provide important syntactic and semantic cues for language
analysis and understanding. The Paninian framework takes advantage of these features.
Note: Inflectional – refers to the changes a word undergoes to express different grammatical categories
such as tense, number, gender, case, mood, and aspect without altering the core meaning of the word.
Indian languages have traditionally used oral communication for knowledge propagation. In Hindi, we
can change the position of subject and object. For example:
29
Natural Language Processing
वह चला वह चल नेदया
He move given
He moved He moved (started the action)
The nouns are followed by post-positions instead of prepositions. They generally remain as separate
words in Hindi,
र खा क निता उसक निता
Rekha ke pita Uske pita
Rekha of father
Father of Rekha Her (His) father
All nouns are categorized as feminine or masculine, and the verb form must have a gender agreement with
the subject
ताला खो गया चाभी खो गयी
Taalaa kho gayaa Chaabhii kho gayeee
Lock lose (past) key lose (past)
The lock was lost The key was lost.
Layered Representation in PG
The GB theory represents three syntactic levels: deep structure, surface structure, and logical form (LF),
where the LF is nearer to semantics. This theory tries to resolve all language issues at syntactic levels
only.
• The surface and the semantic levels are obvious. The other
two levels should not be confused with the levels of GB.
• Vibhakti literally means inflection, but here, it refers to word
(noun, verb, or other) groups based either on case endings, or
post-positions, or compound verbs, or main and auxiliary
verbs, etc
• Karaka (pronounced Kaaraka) literally means Case, and in GB, we have already discussed case
theory, θ-theory, and sub-categorization, etc. Paninian Grammar has its own way of defining
Karaka relations.
Karaka Theory
30
Natural Language Processing
• Karaka relations are assigned based on the roles played by various participants in the main
activity.
• Various Karakas, such as Karta (subject), Karma (object), Karana (instrument), Sampradana
(beneficiary), Apadan (separation), and Adhikaran (locus).
Example:
• 'maan' (mother) is the Karta, Karta has generally 'ne' or 'o' case marker.
• rotii (bread) is the Karma. ('Karma' is similar to object and is the locus of the result of the activity)
• haath (hand) is the Karan. (noun group through which the goal is achieved), It has the marker
“dwara” (by) or “se”
• 'Sampradan' is the beneficiary of the activity, e.g., bachchi (child).
• 'Apaadaan' denotes separation and the marker is attached to the part that serves as a reference
point (being stationary). It takes the marker “ko” (to) or “ke liye” (for).
• aangan (courtyard) is the Adhikaran (is the locus (support in space or time) of Karta or Karma).
It is a statistical method that predicts the probability of a word appearing next in a sequence based on the
previous "n" words.
Why n-gram?
31
Natural Language Processing
The goal of a statistical language model is to estimate the probability (likelihood) of a sentence. This is
achieved by decomposing sentence probability into a product of conditional probabilities using the chain
rule as follows:
So, in order to calculate sentence probability, we need to calculate the probability of a word, given the
sequence of words preceding it. This is not a simple task.
An n-gram model simplifies the task by approximating the probability of a word given all the previous
words by the conditional probability given previous n-1 words only.
P(Wi/hi) = P(Wi/Wi-n+1.Wi-1)
Thus, an n-gram model calculates P(w/h) by modelling language as Markov model of order n-1, i.e., by
looking at previous n-1 words only.
A model that limits the history to the previous one word only, is termed a bi-gram (n= 1) model.
A model that conditions the probability of a word to the previous two words, is called a tri-gram (n=2)
model.
Using bi-gram and tri-gram estimate, the probability of a sentence can be calculated as:
Example: The Arabian knights are fairy tales of the east bi-gram
approximation - P(east/the), tri-gram approximation - P(east/of the) One
pseudo-word <s> is introduced to mark the beginning of the sentence in bi-
gram estimation.
32
Natural Language Processing
o Count a particular n-gram in the training corpus and divide it by the sum of all n-grams that
share the same prefix
3. The sum of all n-grams that share first n-1 words is equal to the count of the common prefix
Wi-n+1, ... , Wi-1.
Example tri-gram:
Bi-gram model:
33
Natural Language Processing
Test sentence(s): The Arabian knights are the fairy tales of the east.
P(The/<s>)×P(Arabian/the)×P(Knights/Arabian)x
P(are/knights) ×
P(the/are)×P(fairy/the)xP(tales/fairy)×P(of/tales)× P(the/of) x
P(east/the)
=0.67×0.5×1.0×1.0×0.5×0.2×1.0×1.0×1.0×0.2
=0.0067
Limitations:
• Multiplying the probabilities might cause a numerical underflow, particularly in long sentences.
To avoid this, calculations are made in log space, where a calculation corresponds to adding log
of individual probabilities and taking antilog of the sum.
• The n-gram model faces data sparsity, assigning zero probability to unseen n-grams in the training
data, leading to many zero entries in the bigram matrix. This results from the assumption that a
word's probability depends solely on the preceding word(s), which isn't always true.
• Fails to capture long-distance dependencies in natural language sentences.
Solution:
• A number of smoothing techniques have been developed to handle the data sparseness problem.
• Smoothing in general refers to the task of re-evaluating zero-probability or low-probability n-
grams and assigning them non-zero values.
2.3.2 Add-one Smoothing
• It adds a value of one to each n-gram frequency before normalizing them into probabilities. Thus,
the conditional probability becomes:
• Yet, not effective, since it assigns the same probability to all missing n-grams, even though some
of them could be more intuitively appealing than others.
Example:
34
Natural Language Processing
Consider the following toy corpus:
We want to calculate the probability of the bigram "I love" using Add-one smoothing. Step
• Unigrams:
o "I" appears 2 times o "love"
• Bigrams:
o "I love" appears 2 times o
time
• Vocabulary size sV: There are 4 unique words: "I", "love", "programming", "coding".
Step 2: Apply Add-one smoothing
For the bigram "I love":
Let’s say we want to calculate the probability for the bigram "I coding" (which doesn’t appear in the
training data):
35
Natural Language Processing
2.3.3 Good-Turing Smoothing
• Good-Turing smoothing improves probability estimates by adjusting for unseen n-grams based
on the frequency distribution of observed n-grams.
• It adjusts the frequency f of an n-gram using the count of n-grams having a frequency of
occurrence f+1. It converts the frequency of an n-gram from f to f* using the following expression:
where n is the number of n-grams that occur exactly f times in the training corpus.
As an example, consider that the number of n-grams that occur 4 times is
25,108 and the number of n-grams that occur 5 times is 20,542. Then, the smoothed count for 5 will be:
36