NLP BAD613B FullNotes
NLP BAD613B FullNotes
VI – Semester
BAD613B
Dr. Mahantesh K
Associate Professor
Dept. of CSE (Data Science)
RNS Institute of Technology
Natural Language Processing [BAD613B]
Course objectives:
• Learn the importance of natural language modelling.
• Understand the applications of natural language processing.
• Study spelling, error detection and correction methods and parsing techniques in NLP.
• Illustrate the information retrieval models in natural language processing.
Module-1
Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian Languages,
NLP Applications.
Language Modeling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.
Textbook 1: Ch. 1, Ch. 2.
Module-2
Module-1
Textbook 1: Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information
Retrieval”, Oxford University Press. Ch. 1, Ch. 2.
1. INTRODUCTION
1. Rationalist approach
2. Empiricist approach
Rationalist Approach: Early approach, assumes the existence of some language faculty in
the human brain. Supporters of this approach argue that it is not possible to learn a complex thing
like natural language from limited sensory inputs.
Empiricist approach: Do not believe in existence of a language faculty. Believe in the existence
of some general organization principles such as pattern recognition, generalization, and association.
Learning of detailed structures takes place through the application of these principles on sensory
inputs available to the child.
Computational linguistics: is similar to theoretical- and psycho-linguistics, but uses different tools.
While theoretical linguistics is more about the structural rules of language, psycho-linguistics focuses on
how language is used and processed in the mind.
Theoretical linguistics explores the abstract rules and structures that govern language. It investigates
universal grammar, syntax, semantics, phonology, and morphology. Linguists create models to explain
how languages are structured and how meaning is encoded. Eg. Most languages have constructs like noun
and verb phrases. Theoretical linguists identify rules that describe and restrict the structure of languages
(grammar).
Psycho-linguistics focuses on the psychological and cognitive processes involved in language use. It
examines how individuals acquire, process, and produce language. Researchers study language
development in children and how the brain processes language in real-time. Eg. Studying how children
acquire language, such as learning to form questions ("What’s that?").
With the unprecedented amount of information now available on the web, NLP has become one
of the leading techniques for processing and retrieving information. NLP has become one of the leading
techniques for processing and retrieving information.
Information retrieval includes a number of information processing applications such as information
extraction, text summarization, question answering, and so forth. It includes multiple modes of
information, including speech, images, and text.
Syntactic Semantic
• Finding out the correct meaning of a particular use of word is necessary to find meaning of larger
units.
• Eg. Kabir and Ayan are married.
Kabir and Suha are married.
• Syntactic structure and compositional semantics fail to explain these interpretations.
• This means that semantic analysis requires pragmatic knowledge besides semantic and syntactic
knowledge.
• Pragmatics helps us understand how meaning is influenced by context, social factors, and
speaker intentions.
Anamorphic Reference
• Pragmatic knowledge may be needed for resolving anaphoric references.
Example: The district administration refused to give the trade union
permission for the meeting because they feared violence. (a)
Phrase structure grammar consists of rules that generate natural language sentences and assign a
structural description to them. As an example, consider the following set of rules:
Transformation rules, transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. It transforms one surface
representation into another, e.g., an active sentence into passive one.
Consider the active sentence: “The police will catch the snatcher.”
The application of phrase structure rules will assign the structure shown in Fig 2 (a)
Morphophonemic Rule: Another transformational rule will then reorder 'en + catch' to 'catch + en' and
subsequently one of the morphophonemic rules will convert 'catch + en' to 'caught'.
Note: Long distance dependency refers to syntactic phenomena where a verb and its subject or object can
be arbitrarily apart. Wh-movement are a specific case of these types of dependencies.
E.g.
"I wonder who John gave the book to" involves a long-distance dependency between the verb "wonder"
and the object "who". Even though "who" is not directly adjacent to the verb, the syntactic relationship
between them is still clear.
The problem in the specification of appropriate phrase structure rules occurs because these phenomena
cannot be localized at the surface structure level.
Paninian grammar provides a framework for Indian language models. These can be used for
computation of Indian languages. The grammar focuses on extraction of relations from a
sentence.
The first SysTran machine translation system was developed in 1969 for Russian-English translation.
SysTran also provided the first on-line machine translation service called Babel Fish, which is used by
AltaVista search engines for handling translation requests from users.
This is a natural language generation system used in Canada to generate weather reports. It accepts
daily weather data and generates weather reports in English and French.
This is a natural language understanding system that simulates actions of a robot in a block world
domain. It uses syntactic parsing and semantic reasoning to understand instructions. The user can ask the
robot to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.
This was an early question answering system that answered questions about moon rock.
The availability of vast amounts of electronic text has made it challenging to find relevant
information. Information retrieval (IR) systems aim to address this issue by providing efficient access to
relevant content. Unlike 'entropy' in communication theory, which measures uncertainty, information
here refers to the content or subject matter of text, not digital communication or data transmission. Words
serve as carriers of information, and text is seen as the message encoded in natural language.
organizing, storing, retrieving, and evaluating information that matches a query, working with
unstructured data. Retrieval is based on content, not structure, and systems typically return a ranked list
of relevant documents.
IR has been integrated into various systems, including database management systems, bibliographic
retrieval systems, question answering systems, and search engines. Approaches for accessing large text
collections fall into two categories: one builds topic hierarchies (e.g., Yahoo), requiring manual
classification of new documents, which can be cost-ineffective; the other ranks documents by relevance,
offering more scalability and efficiency for large collections
Major issues in designing and evaluating Information Retrieval (IR) systems include selecting
appropriate document representations. Current models often use keyword-based representation, which
suffers from problems like polysemy, homonymy, and synonymy, as well as ignoring semantic and
contextual information. Additionally, vague or inaccurate user queries lead to poor retrieval performance,
which can be addressed through query modification or relevance feedback.
2. LANGUAGE MODELLING
To create a general model of any language is a difficult task. There are two approaches for language
modelling.
2.1 Introduction
Our purpose is to understand and generate natural languages from a computational viewpoint.
1st approach: Try to understand every word and sentence of it, and then come to a conclusion (has not
succeeded).
2nd approach: To study the grammar of various languages, compare them, and if possible, arrive at
reasonable models that facilitate our understanding of the problem and designing of natural-language
tools.
Language Model: A model is a description of some complex entity or process. Natural language is a
complex entity and in order to process it through a computer-based program, we need to build a
representation (model) of it.
Two categories of language modelling approaches:
Grammar-based language model:
Eg. A sentence usually consists of noun phrase and a verb phrase. The grammar-based approach attempts
to utilize this structure and also the relationships between these structures.
Government and Binding (GB) theories rename these as s-level and d-level, adding phonetic and
logical forms as parallel levels of representation for analysis, as shown in Figure.
• 'meaning' in a 'sound' form is represented as logical form (LF) and phonetic form (PF) in above
figure.
• The GB is concerned with LF, rather than PF.
• The GB imagines that if we define rules for structural units at the deep level, it will be possible
to generate any language with fewer rules.
Components of GB
• Government and binding (GB) comprise a set of theories that map the structures from d-structure
to s-structure and to logical form (LF).
• A general transformational rule called 'Move 𝛼' is applied at d-structure level as well as at s-
structure level.
• Simplest form GB can be represented as below.
GB consists of 'a series of modules that contain constraints and principles' applied at various
levels of its representations and the transformation rule, Move α.
The GB considers all three levels of representations (d-, s-, and LF) as syntactic, and LF is also
related to meaning or semantic-interpretive mechanisms.
GB applies the same Move 𝛼 transformation to map d-levels to s-levels or s-levels to LF level.
LF level helps in quantifier scoping and also in handling various sentence constructions such as passive
or interrogative constructions.
Example:
Consider the sentence: “Two countries are visited by most travellers.”
Its two possible logical forms are:
LF1: [s Two countries are visited by [NP most travellers]]
LF2: Applying Move 𝛼
[NP Most travellersi ] [s two countries are visited by ei]
• In LF1, the interpretation is that most travellers visit the same two countries (say, India and
China).
• In LF2, when we move [most travellers] outside the scope of the sentence, the interpretation can
be that most travellers visit two countries, which may be different for different travellers.
• One of the important concepts in GB is that of constraints. It is the part of the grammar which
prohibits certain combinations and movements; otherwise Move α can move anything to any
possible position.
• Thus, GB, is basically the formulation of theories or principles which create constraints to
disallow the construction of ill-formed sentences.
The organization of GB is as given below:
̅ Theory:
𝑿
• ̅ Theory (pronounced X-bar theory) is one of the central concepts in GB. Instead of defining
The 𝑿
̅ Theory defines
several phrase structures and the sentence structure with separate sets of rules, 𝑿
them both as maximal projections of some head.
• Noun phrase (NP), verb phrase (VP), adjective phrase (AP), and prepositional phrase (PP) are
maximal projections of noun (N), verb (V), adjective (A), and preposition (P) respectively, and
can be represented as head X of their corresponding phrases (where X = {N, V, A, P})
• Even the sentence structure can be regarded as the maximal projection of inflection (INFL).
• The GB envisages projections at two levels:
• The projection of head at semi-phrasal level, denoted by 𝑿 ̅,
• ̿.
The Maximal projection at the phrasal level, denoted by 𝑿
Sub-categorization: It refers to the process of classifying words or phrases (typically verbs) according
to the types of arguments or complements they can take. It's a form of syntactic categorization that is
important for understanding the structure and meaning of sentences.
For example, different verbs in English can have different sub-categorization frames (also called
argument structures). A verb like "give" might take three arguments (subject, object, and indirect object),
while a verb like "arrive" might only take a subject and no objects.
"He gave her a book." ("gave" requires a subject, an indirect object, and a direct object)
In principle, any maximal projection can be the argument of a head, but sub-categorization is used as a
filter to permit various heads to select a certain subset of the range of maximal projections.
Projection Principle:
Three syntactic representations:
1. Constituency Parsing (Tree Structure):
• Sentences are broken into hierarchical phrases or constituents (e.g., noun phrases, verb
phrases), represented as a tree structure.
2. Dependency Parsing (Directed Graph):
• Focuses on the direct relationships between words, where words are connected by directed
edges indicating syntactic dependencies.
3. Semantic Role Labelling (SRL):
• Identifies the semantic roles (e.g., agent, patient) of words in a sentence, focusing on the
meaning behind the syntactic structure.
The projection principle, a basic notion in GB, places a constraint on the three syntactic representations
and their mapping from one to the other.
The principle states that representations at all syntactic levels (i.e., d-level, s-level, and LF level) are
projections from the lexicon (collection or database of words and their associated linguistic information).
Thus, lexical properties of categorical structure (sub-categorization) must be observed at each level.
Suppose 'the object' is not present at d-level, then another NP cannot take this position at s-level.
Example:
• At D-structure, each argument of a verb is assigned a thematic role (e.g., Agent, Theme, Goal,
etc.).
• In a sentence like "John gave Mary the book", the verb "gave" requires three arguments: Agent
(John), Recipient (Mary), and Theme (the book).
• If the object (Theme) is not present at the deep structure, it cannot be filled at the surface structure
(S-structure) by another NP (e.g., a different noun phrase).
• 'Sub-categorization' only places a restriction on syntactic categories which a head can accept.
• GB puts another restriction on the lexical heads through which it assigns certain roles to its
arguments.
• These roles are pre-assigned and cannot be violated at any syntactical level as per the projection
principle.
• These role assignments are called theta-roles and are related to 'semantic-selection'.
Agent is a special type of role which can be assigned by a head to outside arguments (external
arguments) whereas other roles are assigned within its domain (internal arguments).
the verb 'eat' assigns the 'Agent' role to 'Mukesh' (outside VP)
Theta-Criterion states that 'each argument bears one and only one Ɵ-role, and each Ɵ-role is
assigned to one and only one argument'.
If any word or phrase (say α or ß) falls within the scope of and is determined by a maximal projection,
we say that it is dominated by the maximal projection.
If there are two structures α and ß related in such a way that 'every maximal projection dominating a
dominates ß', we say that a C-commands ß, and this is the necessary and sufficient condition (iff) for C-
command.
Government
α governs ß iff: α C-commands ß
α is an X (head, e.g., noun, verb, preposition, adjective, and inflection), and every maximal projection
dominating ß dominates α.
Additional information
C-COMMAND
A c-command is a syntactic relationship in linguistics, particularly in the theory of syntax, where one node (word
or phrase) in a tree structure can "command" or "govern" another node in certain ways. In simpler terms, it's a rule
that helps determine which parts of a sentence can or cannot affect each other syntactically.
Simple Definition:
C-command occurs when one word or phrase in a sentence has a syntactic connection to another word or phrase,
typically by being higher in the syntactic tree (closer to the top).
Example 1:
In the sentence "John saw Mary,"
"John" c-commands "Mary" because "John" is higher up in the tree structure and can potentially affect "Mary"
syntactically.
Example 2:
In the sentence "She thinks that I am smart,"
The pronoun "She" c-commands "I" because "She" is higher in the syntactic tree, governing the phrase where "I"
occurs.
In essence, c-command helps explain which words in a sentence are connected in ways that allow for things like
pronoun interpretation or binding relations (e.g., which noun a pronoun refers to).
GOVERNMENT
-is a special case of C-COMMAND
government refers to the syntactic relationship between a head (typically a verb, noun, or adjective) and its
dependent elements (such as objects or complements) within a sentence. It determines how certain words control
the form or case of other words in a sentence.
On the other hand, c-command is a syntactic relationship between two constituents in a sentence. A constituent A
c-commands another constituent B if the first constituent (A) is higher in the syntactic structure (usually in the tree)
and can potentially govern or affect the second constituent (B), provided no intervening nodes.
To put it together in context:
Government: This is a formal rule determining how certain words govern the case or form of other words in a
sentence (e.g., verbs can govern the object noun in accusative case in languages like Latin or German).
C-command: This is a structural relationship in which one constituent can influence another, typically affecting
operations like binding, scope, and sometimes government.
In short, government often operates within the structures of c-command, but c-command itself is a broader syntactic
relationship that is also relevant for other linguistic phenomena, such as binding theory, where one element can bind
another if it c-commands it.
Sure! Here are a few examples of government in syntax, showing how one word governs the form or case of another
word in a sentence:
1. Verb Government
In many languages, verbs can govern the case of their objects. Here’s an example in Latin:
Latin: "Vidēre puellam" (to see the girl)
The verb "vidēre" (to see) governs the accusative case of "puellam" (the girl).
In this case, the verb "vidēre" governs the object "puellam" by requiring it to be in the accusative case.
2. Preposition Government
Prepositions can also govern the case of their objects. Here’s an example from German:
German: "Ich gehe in den Park" (I am going to the park)
The preposition "in" governs the accusative case of "den Park" (the park).
The preposition "in" governs the accusative case for the noun "Park" in this sentence.
3. Adjective Government
Adjectives can govern the case, gender, or number of the noun they modify. Here's an example from Russian:
Russian: "Я вижу красивую девочку" (I see a beautiful girl)
The adjective "красивую" (beautiful) governs the accusative case of "девочку" (girl).
In this case, the adjective "красивую" (beautiful) governs the accusative case of "девочку".
4. Noun Government
In some languages, nouns can govern the case of their arguments. In Russian, for example, some nouns govern a
particular case:
Russian: "Я горжусь успехом" (I am proud of the success)
The noun "успехом" (success) governs the instrumental case in this sentence.
Here, the noun "успехом" governs the instrumental case of its argument "успех".
Summary:
Government involves syntactic relationships where a head (verb, preposition, adjective, etc.) dictates or determines
the form (such as case) of its dependent elements.
In these examples, verbs, prepositions, and adjectives have a "governing" influence on the cases of nouns or objects
in the sentence, which is a core part of the syntax in many languages.
Two being empty NP positions called wh-trace and NP trace, and the remaining two being pronouns
called small 'pro' and big 'PRO'.
This division is based on two properties-anaphoric (+a or -a ) and pronominal (+p or -p).
Wh-trace -a, -p
NP-trace +a, -p
small 'pro' -a, +p
big 'PRO' . +a, +p
The traces help ensure that the proper binding relationships are maintained between moved elements
(such as how pronouns or reflexives bind to their antecedents, even after movement).
Additional Information:
• +a (Anaphor): A form that must refer back to something mentioned earlier (i.e., it has an
antecedent). For example, "himself" in "John washed himself." The form "himself" is an anaphor
because it refers back to "John."
• -a (Non-Anaphor): A form that does not require an antecedent to complete its meaning. A regular
pronoun like "he" in "He went to the store" is not an anaphor because it doesn't explicitly need to
refer back to something within the same sentence or clause.
• +p (Pronominal): A form that can function as a pronoun, standing in for a noun or noun phrase.
For example, "she" in "She is my friend" is a pronominal because it refers to a specific person
(though not necessarily previously mentioned).
• -p (Non-Pronominal): A word or form that isn't used as a pronoun. It could be a noun or other
word that doesn't serve as a replacement for a noun phrase in a given context. For example, in
"John went to the store," "John" is not pronominal—it is a noun phrase.
Co-indexing
It is the indexing of the subject NP and AGR (agreement) at d-structure which are preserved by Move α
operations at s-structure.
When an NP-movement takes place, a trace of the movement is created by having an indexed empty
category (e) from the position at which the movement began to the corresponding indexed NP.
For defining constraints to movement, the theory identifies two positions in a sentence. Positions assigned
̅ positions.
θ -roles are called θ-positions, while others are called 𝜃
In a similar way, core grammatical positions (where subject, object, indirect object, etc., are positioned)
̅-positions.
are called A-positions (arguments positions), and the rest are called 𝐴
Binding theory:
Binding Theory is a syntactic theory that explains how pronouns and noun phrases are interpreted and
distributed in a sentence. It's concerned with the relationships between pronouns and their antecedents
(myself, herself, himself).
Empty clause (ei) and Mukesh (NPi) are bound. This theory gives a relationship between NPs (including
pronouns and reflexive pronouns). Now, binding theory can be given as follows:
(a) An anaphor (+a) is bound in its governing category.
(b) A pronominal (+p) is free in its governing category.
(c) An R-expression (-a, -p) is free.
Example
A: Mukeshi knows himselfi
B: Mukeshi believes that Amrita knows himi
C: Mukesh believes that Amritaj knows Nupurk (Referring expression)
Note: There are many other types of constraints on Move α and not possible to explain all of them.
In English, the long-distance movement for complement clause can be explained by bounding theory if
NP and S are taken to be bounding nodes. The theory says that the application of Move a may not cross
more than one bounding node. The theory of control involves syntax, semantics, and pragmatics.
In GB, case theory deals with the distribution of NPs and mentions that each NP must be assigned a case.
In English, we have the nominative, objective, genitive, etc., cases, which are assigned to NPs at particular
positions. Indian languages are rich in case-markers, which are carried even during movements.
Example:
He is running ("He" is the subject of the sentence, performing the action. - nominative)
She sees him. ("Him" is the object of the verb "sees." - Objective)
The man's book. (The genitive case expresses possession or a relationship between nouns,)
Case filter: An NP is ungrammatical if it has phonetic content or if it is an argument and is not case-
marked. Phonetic content here, refers to some physical realization, as opposed to empty categories.
Thus, case filters restrict the movement of NP at a position which has no case assignment. It works in a
manner similar to that of the θ-criterion.
Summary of GB:
In short, GB presents a model of the language which has three levels of syntactic representation.
• It assumes phrase structures to be the maximal projection of some lexical head and in a similar
fashion, explains the structure of a sentence or a clause.
• It assigns various types of roles to these structures and allows them a broad kind of movement
called Move α.
• It then defines various types of constraints which restrict certain movements and justifies others.
• LFG represents sentences at two syntactic levels - constituent structure (c-structure) and
functional structure (f-structure).
• Kaplan proposed a concrete form for the register names and values which became the functional
structures in LFG.
• Bresnan was more concerned with the problem of explaining some linguistic issues, such as
active/passive and dative alternations, in transformational approach. She proposed that such
issues can be dealt with by using lexical redundancy rules.
• The unification of these two diverse approaches (with a common concern) led to the development
of the LFG theory.
• The 'functional' part is derived from 'grammatical functions', such as subject and object, or roles
played by various arguments in a sentence.
• The 'lexical' part is derived from the fact that the lexical rules can be formulated to help define
the given structure of a sentence and some of the long-distance dependencies, which is difficult
in transformational grammars.
The grammatical-functional role cannot be derived directly from phrase and sentence structure, functional
specifications are annotated on the nodes of c-structure, which when applied on sentences, results in f-
structure
[
SUBJ: [ PERS: 3, NUM: SG ], // "She" is the subject, 3rd person, singular
PRED: "see", // The verb "saw" represents the predicate "see"
OBJ: [ NUM: PL, PRED: "star" ], // "stars" is the object, plural, and the predicate is "star"
LOC: [ PRED: "sky", DEF: + ] // "sky" is the location, with a definite determiner ("the")
]
f-structure
c- structure
Example:
She saw stars in the sky
CFG rules to handle this sentence are:
S → NP VP
VP → V {NP} PP* {NP} {S'}
Stars Sky
PP → P NP
NP → Det N {PP}
S' → Comp S
Where: S: Sentence V: Verb P: Preposition N: Noun
• Here, (up arrow) refers to the f-structure of the mother node that is on the left-hand side of
the rule.
• The (down arrow) symbol refers to the f-structure of the node under which it is denoted.
• Hence, in Rule 1, indicates that the f-structure of the first NP goes to the f-structure of
the subject of the sentence, while indicates that the f-structure of the VP node goes directly
to the f-structure of the sentence VP.
Consistency In a given f-structure, a particular attribute may have at the most one value. Hence, while
unifying two f-structures, if the attribute Num has value SG in one and PL in the other, it will be rejected.
Completeness When an f-structure and all its subsidiary f-structures (as the value of any attribute of f-
structure can again contain other f-structures) contain all the functions that their predicates govern, then
and only then is the f-structure complete.
For example, since the predicate 'see < ( Subj) ( Obj) >' contains an object as its governable function,
a sentence like 'She saw' will be incomplete.
Coherence Coherence maps the completeness property in the reverse direction. It requires that all
governable functions of an f-structure, and all its subsidiary f-structures, must be governed by their
respective predicates. Hence, in the f-structure of a sentence, an object cannot be taken if its verb does
not allow that object. Thus, it will reject the sentence, 'I laughed a book.'
Example:
Let us see first the lexical entries of various words in the sentence:
Lexical entries
c – structure
In LFG, the verb is converted to the participial form, but the sub-categorization is changed directly.
Example
Active: तारा हँ सी
Taaraa hansii
Tara laughed
In LFG, unbounded movement and coordination is handled by the functional identity and by correlation
with the corresponding f-structure.
Unlike English (Subject-Verb-Object ordered), Asian languages are SOV (Subject-Object-Verb) ordered
and inflectionally rich. The inflections provide important syntactic and semantic cues for language
analysis and understanding. The Paninian framework takes advantage of these features.
Note: Inflectional – refers to the changes a word undergoes to express different grammatical categories
such as tense, number, gender, case, mood, and aspect without altering the core meaning of the word.
Indian languages have traditionally used oral communication for knowledge propagation. In Hindi, we
can change the position of subject and object. For example:
In Hindi, some verbs (main), e.g., give (दे िा), take (लेिा), also combine with other verbs (main) to
change the aspect and modality of the verbs.
वह चला वह चल नदया
He move given
He moved He moved (started the action)
The nouns are followed by post-positions instead of prepositions. They generally remain as separate
words in Hindi,
• The surface and the semantic levels are obvious. The other
two levels should not be confused with the levels of GB.
• Vibhakti literally means inflection, but here, it refers to word
(noun, verb, or other) groups based either on case endings, or
post-positions, or compound verbs, or main and auxiliary
verbs, etc
• Karaka (pronounced Kaaraka) literally means Case, and in GB, we have already discussed case
theory, θ-theory, and sub-categorization, etc. Paninian Grammar has its own way of defining
Karaka relations.
Karaka Theory
Example:
• 'maan' (mother) is the Karta, Karta has generally 'ne' or 'o' case marker.
• rotii (bread) is the Karma. ('Karma' is similar to object and is the locus of the result of the activity)
• haath (hand) is the Karan. (noun group through which the goal is achieved), It has the marker
“dwara” (by) or “se”
• 'Sampradan' is the beneficiary of the activity, e.g., bachchi (child).
• 'Apaadaan' denotes separation and the marker is attached to the part that serves as a reference
point (being stationary). It takes the marker “ko” (to) or “ke liye” (for).
• aangan (courtyard) is the Adhikaran (is the locus (support in space or time) of Karta or Karma).
It is a statistical method that predicts the probability of a word appearing next in a sequence based on the
previous "n" words.
Why n-gram?
The goal of a statistical language model is to estimate the probability (likelihood) of a sentence. This is
achieved by decomposing sentence probability into a product of conditional probabilities using the chain
rule as follows:
So, in order to calculate sentence probability, we need to calculate the probability of a word, given the
sequence of words preceding it. This is not a simple task.
An n-gram model simplifies the task by approximating the probability of a word given all the previous
words by the conditional probability given previous n-1 words only.
P(Wi/hi) = P(Wi/Wi-n+1.Wi-1)
Thus, an n-gram model calculates P(w/h) by modelling language as Markov model of order n-1, i.e., by
looking at previous n-1 words only.
A model that limits the history to the previous one word only, is termed a bi-gram (n= 1) model.
A model that conditions the probability of a word to the previous two words, is called a tri-gram (n=2)
model.
Using bi-gram and tri-gram estimate, the probability of a sentence can be calculated as:
One pseudo-word <s> is introduced to mark the beginning of the sentence in bi-gram estimation.
Two pseudo-words <s1> and <s2> for tri-gram estimation.
How to estimate these probabilities?
1. Train n-gram model on training corpus.
2. Estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e.,
using relative frequencies.
o Count a particular n-gram in the training corpus and divide it by the sum of all n-grams
that share the same prefix
3. The sum of all n-grams that share first n-1 words is equal to the count of the common prefix
Wi-n+1, ... , Wi-1.
Example tri-gram:
Example
Training set:
Bi-gram model:
P(the/<s>) =0.67 P(Arabian/the) = 0.4 P(knights /Arabian) =1.0
P(are/these) = 1.0 P(the/are) = 0.5 P(fairy/the) =0.2
P(tales/fairy) =1.0 P(of/tales) =1.0 P(the/of) =1.0
P(east/the) = 0.2 P(stories/the) =0.2 P(of/stories) =1.0
P(are/knights) =1.0 P(translated/are) =0.5 P(in /translated) =1.0
P(many/in) =1.0
P(languages/many) =1.0
Test sentence(s): The Arabian knights are the fairy tales of the east.
P(The/<s>)×P(Arabian/the)×P(Knights/Arabian)x P(are/knights)
× P(the/are)×P(fairy/the)xP(tales/fairy)×P(of/tales)× P(the/of)
x P(east/the)
=0.67×0.5×1.0×1.0×0.5×0.2×1.0×1.0×1.0×0.2
=0.0067
Limitations:
• Multiplying the probabilities might cause a numerical underflow, particularly in long sentences.
To avoid this, calculations are made in log space, where a calculation corresponds to adding log
of individual probabilities and taking antilog of the sum.
• The n-gram model faces data sparsity, assigning zero probability to unseen n-grams in the training
data, leading to many zero entries in the bigram matrix. This results from the assumption that a
word's probability depends solely on the preceding word(s), which isn't always true.
• Fails to capture long-distance dependencies in natural language sentences.
Solution:
• A number of smoothing techniques have been developed to handle the data sparseness problem.
• Smoothing in general refers to the task of re-evaluating zero-probability or low-probability n-
grams and assigning them non-zero values.
• It adds a value of one to each n-gram frequency before normalizing them into probabilities. Thus,
the conditional probability becomes:
• Yet, not effective, since it assigns the same probability to all missing n-grams, even though some
of them could be more intuitively appealing than others.s
Example:
We want to calculate the probability of the bigram "I love" using Add-one smoothing.
• Unigrams:
• Bigrams:
• Vocabulary size V: There are 4 unique words: "I", "love", "programming", "coding".
Let’s say we want to calculate the probability for the bigram "I coding" (which doesn’t appear in the
training data):
• Good-Turing smoothing improves probability estimates by adjusting for unseen n-grams based
on the frequency distribution of observed n-grams.
• It adjusts the frequency f of an n-gram using the count of n-grams having a frequency of
occurrence f+1. It converts the frequency of an n-gram from f to f* using the following
expression:
where n is the number of n-grams that occur exactly f times in the training corpus. As an example, consider
that the number of n-grams that occur 4 times is 25,108 and the number of n-grams that occur 5 times is
20,542. Then, the smoothed count for 5 will be:
Module – 2
Processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying
correct part-of-speech of a word.
Some simple regular expressions: First instance of each match is underlined in table
Characters are grouped by square brackets, matching one character from the class. For
example, /[abcd]/ matches a, b, c, or d, and /[0123456789]/ matches any digit. A dash specifies
a range, like /[5-9]/ or /[m-p]/. The caret at the start of a class negates the match, as in /[^x]/,
which matches any character except x. The caret is interpreted literally elsewhere.
• Regular expressions are case-sensitive (e.g., /s/ matches 's', not 'S').
• Use square brackets to handle case differences, like /[sS]/.
o /[sS]ana/ matches 'sana' or 'Sana'.
• The question mark (?) makes the previous character optional (e.g., /supernovas?/).
• The * allows zero or more occurrences (e.g., /b*/).
• /[ab]*/ matches zero or more occurrences of 'a' or 'b'.
• The + specifies one or more occurrences (e.g., /a+/).
• /[0-9]+/ matches a sequence of one or more digits.
• The caret (^) anchors the match at the start, and $ at the end of a line.
o /^The nature\.$/ will search exactly for this line.
• The dot (.) is a wildcard matching any single character (e.g., /./).
o Expression /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.
Special characters
RE Description
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, l. matches a dot, \* matches a *
and \ matches a backslash.
• The wildcard symbol can count characters, e.g., /.....berry/ matches ten-letter strings
ending in "berry".
• This matches "strawberry", "sugarberry", but not "blueberry" or "hackberry".
• To search for "Tanveer" or "Siddiqui", use the disjunction operator (|), e.g.,
"Tanveer|Siddiqui".
• The pipe symbol matches either of the two patterns.
• Sequences take precedence over disjunction, so parentheses are needed to group patterns.
• Enclosing patterns in parentheses allows disjunction to apply correctly.
3. Finite-State Automata
• Game Description: The game involves a board with pieces, dice or a wheel to generate random
numbers, and players rearranging pieces based on the number. There’s no skill or choice; the
game is entirely based on random numbers.
• States: The game progresses through various states, starting from the initial state (beginning
positions of pieces) to the final state (winning positions).
• Machine Analogy: A machine with input, memory, processor, and output follows a similar
process: it starts in an initial state, changes to the next state based on the input, and eventually
reaches a final state or gets stuck if the next state is undefined.
• Finite Automaton: This model, with finite states and input symbols, describes a machine that
automatically changes states based on the input, and it’s deterministic, meaning the next state is
fully determined by the current state and input.
Let ∑ = {a, b, c}, the set of states = {q0, q1, q2, q3, q4} with q0 being the start state and q4 the final state,
we have the following rules of transition:
1. From state q0 and with input a, go to state q1.
2. From state q1 and with input b, go to state q2.
3. From state q1 and with input c go to state q3.
4. From state q2 and with input b, go to state q4.
5. From state q3 and with input b, go to state q4.
• The nodes in this diagram correspond to the states, and the arcs to transitions.
Non-Deterministic Automata:
• For each state, there can be more than one transition on a given symbol, each leading to a different
state.
• This is shown in Figure, where there are two possible transitions from state q 0 on input symbol
a.
• The transition function of a non-deterministic finite-state automaton (NFA) maps Q× (Σ Ս {ε})
to a subset of the power set of Q.
Example:
1. Consider the deterministic automaton described in above example and the input, “ac”.
• We start with state q0 and input symbol a and will go to state
q1.
• The next input symbol is c, we go to state q3.
• No more input is left and we have not reached the final state.
• Hence, the string ac is not recognized by the automaton.
State-transition table
• The rows in this table represent states and the columns correspond to input.
• The entries in the table represent the transition corresponding to a given state-input pair.
• A ɸ entry indicates missing transition.
• This table contains all the information needed by FSA.
Input
State a b c
Start: q0 q1 ɸ ɸ
q1 ɸ q2 q3
q2 ɸ q4 ɸ
q3 ɸ q4 ɸ
Final: q4 ɸ ɸ ɸ
Deterministic finite -state automaton (DFA) The state-transition table of DFA
Example
• Consider a language consisting of all strings containing only a’s and b’s and ending with baa.
• We can specify this language by the regular expression→ /(a|b)*baa$/.
• The NFA implementing this regular expression is shown & state-transition table for the NFA is
as shown below.
Input
State a b
Start: q0 {q0} {q0, q1}
q1 {q2} ɸ
q2 {q3} ɸ
Final: q3 ɸ ɸ
4. Morphological Parsing
• It is a sub-discipline of linguistics
• It studies word structure and the formation of words from smaller units (morphemes).
• The goal of morphological parsing is to discover the morphemes that build a given word.
• A morphological parser should be able to tell us that the word 'eggs' is the plural form of the noun
stem 'egg'.
Example:
The word 'bread' consists of a single morpheme.
'eggs' consist of two morphemes: the egg and -s
4.1 Two Broad classes of Morphemes:
1. Stems – Main morpheme, contains the central meaning.
2. Affixes – modify the meaning given by the stem.
o Affixes are divided into prefix, suffix, infix, and circumfix.
1. Prefix - morphemes which appear before a stem. (un-happy, be-waqt)
2. Suffix - morphemes applied to the end. (ghodha-on, gurramu-lu, bidr-s, शीतलता)
3. Infixes - morphemes that appear inside a stem.
• English slang word "abso-bloody-lutely." The morpheme "-bloody-" is
inserted into the stem "absolutely" to emphasize the meaning.
4. Circumfixes - morphemes that may be applied to beginning & end of the stem.
• German word - gespielt (played) → ge+spiel+t
Spiel – play (stem)
4.2 Three main ways of word formation: Inflection, Derivation, and Compounding
Inflection: a root word combined with a grammatical morpheme to yield a word of the same class as the
original stem.
Ex. play (verb)+ ed (suffix) = Played (inflected form – past-tense)
Derivation: a root word combined with a grammatical morpheme to yield a word belonging to a different
class.
Compounding: The process of merging two or more words to form a new word.
Morphological analysis and generation deal with inflection, derivation and compounding process in
word formation and essential to many NLP applications:
1. Spelling corrections to machine translations.
2. In Information retrieval – to identify the presence of a query word in a document in spite of
different morphological variants.
Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the
word-forms of all the roots.
4.4 Stemmers:
• The simplest morphological systems
• Collapse morphological variations of a given word (word-forms) to one lemma or stem.
• Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
o ier → y (e.g., earlier → early)
o ing → ε (e.g., playing → play)
• Stemming algorithms work in two steps:
(i) Suffix removal: This step removes predefined endings from words.
(ii) Recoding: This step adds predefined endings to the output of the first step.
• Two widely used stemming algorithms have been developed by Lovins (1968) and Porter (1980).
Surface Level → p l a y i n g
Lexical Level → p l a y +V +PP
• Identifies the base form ("walk") and applies the appropriate suffix to generate different surface
forms, like "walked" or "walking".
A finite-state transducer is a 6-tuple (Σ1, Σ2, Q, δ, S, F), where Q is set of states, S is the initial state, and
F ⃀ Q is a set of final states, Σ1 is input alphabet, Σ2 is output alphabet, and δ is a function mapping Q x
(Σ1 Ս {€}) x (Σ2 Ս {€}) to a subset of the power set of Q.
δ: Q × (Σ1∪{ε}) × (Σ2∪{ε}) → 2Q
Thus, an FST is similar to an NFA except in that transitions are made on strings rather than on symbols
and, in addition, they have outputs. FSTs encode regular relations between regular languages, with the
upper language on the top and the lower language on the bottom. For a transducer T and string s, T(s)
represents the set of strings in the relation. FSTs are closed under union, concatenation, composition, and
Kleene closure, but not under intersection or complementation.
Two-level morphology using FSTs involves analyzing surface forms in two steps.
Step1: Words are split into morphemes, considering spelling rules and possible splits (e.g., "boxe + s" or
"box + s").
Step2: The output is a concatenation of stems and affixes, with multiple representations possible for each
word.
We need to build two transducers: one that maps the surface form to the intermediate form and another
that maps the intermediate form to the lexical form.
A transducer maps the surface form "lesser" to its comparative form, where ɛ represents the empty string.
This bi-directional FST can be used for both analysis (surface to base) and generation (base to surface).
E.g. Lesser
• The plural form of regular nouns usually ends with -s or -es. (not necessarily be the plural form
– class, miss, bus).
• One of the required translations is the deletion of the 'e' when introducing a morpheme boundary.
o E.g. Boxes, This deletion is usually required for words ending in xes, ses, zes.
• This is done by below transducer – Mapping English nouns to the intermediate form:
Bird+s
Box+e+s
Quiz+e+s
• The next step is to develop a transducer that does the mapping from the intermediate level to the
lexical level. The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
• Regular noun stem + s, e.g., bird + s
• Singular irregular noun stem, e.g., goose
• Plural irregular noun stem, e.g., geese
• In the first case, the transducer has to map all symbols of the stem to themselves and then output
N and sg.
• In the second case, it has to map all symbols of the stem to themselves, but then output N and
replaces PL with s.
• In the third case, it has to do the same as in the first case.
• Finally, in the fourth case, the transducer has to map the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then it should add Nand PL.
The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer encoding a lexicon.
The transducer implementing the lexicon maps the individual regular and irregular noun stems to their
correct noun stem, replacing labels like regular noun form, etc.
This lexicon maps the surface form geese, which is an irregular noun, to its correct stem goose in the
following way:
g:g e:o e:o s:s e:e
Mapping for the regular surface form of bird is b:b i:i r:r d:d. Representing pairs like a:a with a single
letter, these two representations are reduced to g e:o e:o s e and b i r d respectively.
Composing this transducer with the previous one, we get a single two-level transducer with one input
tape and one output tape. This maps plural nouns into the stem plus the morphological marker + pl and
singular nouns into the stem plus the morpheme + sg. Thus a surface word form birds will be mapped to
bird + N + pl as follows.
b:b i:i r:r d:d + ε:N + s:pl
Each letter maps to itself, while & maps to morphological feature +N, and s maps to morphological feature
pl. Figure shows the resulting composed transducer.
Typing mistakes: single character omission, insertion, substitution, and reversal are the most common
typing mistakes.
OCR errors: Usually grouped into five classes: substitution, multi-substitution (or framing), space
deletion or insertion, and failures.
Substitution errors: Caused due to visual similarity (single character) such as c→e, 1→l, r→n.
The same is true for multi-substitution (two or more chars), e.g., m→rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
Solution: These errors can be corrected using 'context' or by using linguistic structures.
Phonetic errors:
• Spelling errors are often phonetic, with incorrect words sounding like correct ones.
• Phonetic errors are harder to correct due to more complex distortions.
• Phonetic variations are common in translation
• Non-word error:
o Word that does not appear in a given lexicon or is not a valid orthographic word form.
o The two main techniques to find non-word errors: n-gram analysis and dictionary lookup.
• Real-word error:
o It occurs due to typographical mistakes or spelling errors.
o E.g. piece for peace or meat for meet.
o May cause local syntactic errors, global syntactic errors, semantic errors, or errors at
discourse or pragmatic levels.
o Impossible to decide that a word is wrong without some contextual information
Spelling correction: consists of detecting and correcting errors. Error detection is the process of finding
misspelled words and error correction is the process of suggesting correct words to a misspelled one.
These sub-problems are addressed in two ways:
Isolated-error detection and correction: Each word is checked separately, independent of its context.
Context dependent error detection and correction methods: Utilize the context of a word. This requires
grammatical analysis and is thus more complex and language dependent. the list of candidate words must
first be obtained using an isolated-word method before making a selection depending on the context.
Minimum edit distance The minimum edit distance between two strings is the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another.
Similarity key techniques The basic idea in a similarity key technique is to change a given string into a
key such that similar strings will change into the same key.
n-gram based techniques n-gram techniques usually require a large corpus or dictionary as training data,
so that an n-gram table of possible combinations of letters can be compiled. In case of real-word error
detection, we calculate the likelihood of one character following another and use this information to find
possible correct word candidates.
Neural nets These have the ability to do associative recall based on incomplete and noisy data. They can
be trained to adapt to specific spelling error patterns. Note: They are computationally expensive.
Rule-based techniques In a rule-based technique, a set of rules (heuristics) derived from knowledge of
a common spelling error pattern is used to transform misspelled words into valid words.
The minimum edit distance is the number of insertions, deletions, and substitutions required to change
one string into another.
For example, the minimum edit distance between 'tutor' and 'tumour' is 2: We substitute 'm' for 't' and
insert 'u' before 'r'.
Edit distance can be viewed as a string alignment problem. By aligning two strings, we can measure the
degree to which they match. There may be more than one possible alignment between two strings.
Alignment 1:
t u t o - r
t u m o u r
The best possible alignment corresponds to the minimum edit distance between the strings. The alignment
shown here, between tutor and tumour, has a distance of 2.
A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do
not match (shown in bold).
The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation,
therefore distance is 2.
Alignment 2:
Another possible alignment for these sequences is
t u t - o - r
t u - m o u r
which has a cost of 3.
Dynamic programming algorithms can be quite useful for finding minimum edit distance between two
sequences. (table-driven approach to solve problems by combining solutions to sub-problems).
The dynamic programming algorithm for minimum edit distance is implemented by creating an edit
distance matrix.
• This matrix has one row for each symbol in the source string and one column for each matrix in
the target string.
• The (i, j)th cell in this matrix represents the distance between the first i character of the source
and the first j character of the target string.
• Each cell can be computed as a simple function of its surrounding cells. Thus, by starting at the
beginning of the matrix, it is possible to fill each entry iteratively.
• The value in each cell is computed in terms of three possible paths.
• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• The minimum edit distance algorithm is shown below.
• How the algorithm computes the minimum edit distance between tutor and tumour is shown in
table.
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2
Minimum edit distance algorithms are also useful for determining accuracy in speech recognition
systems.
Table shows some of the word classes in English. Lexical categories and their properties vary from
language to language.
Word classes are further categorized as open and closed word classes.
• Open word classes constantly acquire new members while closed word classes do not (or only
infrequently do so).
• Nouns, verbs (except auxiliary verbs), adjectives, adverbs, and interjections are open word
classes.
e.g. computer, happiness, dog, run, think, discover, beautiful, large, happy, quickly, very, easily
• Prepositions, auxiliary verbs, delimiters, conjunction, and Interjections are closed word classes.
e.g. in, on, under, between, he, she, it, they, the, a, some, this, and, but, or, because, oh, wow,
ouch
7. Part-of-Speech Tagging
• The process of assigning a part-of-speech (such as a noun, verb, pronoun, preposition, adverb,
and adjective), to each word in a sentence.
• Input to a tagging algorithm: Sequence of words of a natural language sentence and specified tag
sets.
• Output: single best part-of-speech tag for each word.
• Many words may belong to more than one lexical category:
o I am reading a good book → book: Noun
Tag set:
Consider,
Tags from Penn Treebank tag set Possible tags for the word to eat
VB Verb, base form Subsumes imperatives, eat VB
infinitives, and subjunctives
VBD Verb, past tense Includes the ate VBD
conditional form of the verb to be
VBG Verb, gerund, or present participle eaten VBN
VBN Verb, past participle eats VBP
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
Rule-based taggers use hand-coded rules to assign tags to words. These rules use a lexicon to obtain a
list of candidate tags and then use rules to discard incorrect tags.
Hybrid taggers combine features of both these approaches. Like rule- based systems, they use rules to
specify tags. Like stochastic systems, they use machine-learning to induce rules from a tagged training
corpus automatically. E.g. Brill tagger.
• A two-stage architecture.
• The first stage: A dictionary look-up procedure, which returns a set of potential tags (parts-of-
speech) and appropriate syntactic features for each word.
• The second stage: A set of hand-coded rules to discard contextually illegitimate tags to get a
single part-of-speech for each word.
IF word ends in -ing and preceding word is a verb THEN label it a verb (VB).
Rule-based taggers use capitalization to identify unknown nouns and typically require supervised training.
Rules can be induced by running untagged text through a tagger, manually correcting it, and feeding it
back for learning.
TAGGIT (1971) tagged 77% of the Brown corpus using 3,300 rules. ENGTWOL (1995) is another rule-
based tagger known for speed and determinism.
While rule-based systems are fast and deterministic, they require significant effort to write rules and need
a complete rewrite for other languages. Stochastic taggers are more flexible, adapting to new languages
with minimal changes and retraining. Thus, rule-based systems are precise but labor-intensive, while
stochastic systems are more adaptable but probabilistic.
The unigram model requires a tagged training corpus to gather statistics for tagging data. It assigns tags
based solely on the word itself. For example, the tag JJ (Adjective) is frequently assigned to "fast"
because it is more commonly used as an adjective than as a noun, verb, or adverb. However, this can lead
to incorrect tagging, as seen in the following examples:
3. Those who were injured in the accident need to be helped fast — Here, "fast" is an adverb.
In these cases, a more accurate prediction could be made by considering additional context. A bi-gram
tagger improves accuracy by incorporating both the current word and the tag of the previous word. For
instance, in sentence (1), the sequence "DT NN" (determiner, noun) is more likely than "DT JJ"
(determiner, adjective), so the bi-gram tagger would correctly tag "fast" as a noun. Similarly, in sentence
(3), a verb is more likely to be followed by an adverb, so the bi-gram tagger assigns "fast" the tag RB
(adverb).
In general, n-gram models consider both the current word and the tags of the previous n-1 words. A tri-
gram model, for example, uses the previous two tags, providing even richer context for more accurate
tagging. The context considered by a tri-gram model is shown in Figure, where the shaded area represents
the contextual window.
How the HMM tagger assigns the most likely tag sequence to a given sentence:
We refer to this model as a Hidden Markov Model (HMM) because it has two layers of states:
While tagging input data, we can observe the words, but the tags (states) are hidden. The states are visible
during training but not during the tagging process.
As mentioned earlier, the HMM uses lexical and bi-gram probabilities estimated from a tagged training
corpus to compute the most likely tag sequence for a given sentence. One way to store these probabilities
is by constructing a probability matrix. This matrix includes:
• The n-gram analysis (for example, in a bi-gram model, the probability that a word of class X
follows a word of class Y).
During tagging, this matrix is used to guide the HMM tagger in predicting the tags for an unknown
sentence. The goal is to determine the most probable tag sequence for a given sequence of words.
can be computed as
= P(DT) × P(NNP|DT) * P(MD|NNP) × P(VB|MD) × P(the/DT) x P(bird|NNP) x P(can|MD) ×
P(fly|VB)
Figure illustrates the TBL process, which is a supervised learning technique. The algorithm starts by
assigning the most likely tag to each word using a lexicon. Transformation rules are then applied
iteratively, with the rule that improves tagging accuracy most being selected each time. The process
continues until no significant improvements are made.
The output is a ranked list of transformations, which are applied to new text by first assigning the most
frequent tag and then applying the transformations.
As the most likely tag for fish is NNP, the tagger assigns this tag to the word in both sentences. In the
second case, it is a mistake. After initial tagging when the transformation rules are applied, the tagger
learns a rule that applies exactly to this mis-tagging of fish:
As the contextual condition is satisfied, this rule will change fish/NN to fish/VB:
Most part-of-speech tagging research focuses on English and European languages, but the lack of
annotated corpora limits progress for other languages, including Indian languages. Some systems, like
Bengali (Sandipan et al., 2004) and Hindi (Smriti et al., 2006), combine morphological analysis with
tagged corpora.
Tagging Urdu is more complex due to its right-to-left script and grammar influenced by Arabic and
Persian. Before Hardie (2003), little work was done on Urdu tag sets, with his research part of the
EMILLE project for South Asian languages.
Unknown words, which do not appear in a dictionary or training corpus, pose challenges during tagging.
Solutions include:
• Assigning the most frequent tag from the training corpus or initializing unknown words with an
open class tag and disambiguating them using tag probabilities.
• Another approach involves using morphological information, such as affixes, to predict the tag
based on common suffixes or prefixes in the training data, similar to Brill's tagger.
Syntactic Analysis
1. Introduction:
• Syntactic parsing deals with the syntactic structure of a sentence.
• 'Syntax' refers to the grammatical arrangement of words in a sentence and their relationship with
each other.
• The objective of syntactic analysis is to find the syntactic structure of the sentence.
• This structure is usually depicted as a tree, as shown in Figure.
o Nodes in the tree represent the phrases and leaves correspond to the words.
o The root of the tree is the whole sentence.
• Identifying the syntactic structure is useful in determining the
meaning of the sentence.
• Syntactic parsing can be considered as the process of assigning
'phrase markers' to a sentence.
• Two important ideas in natural language are those of constituency
and word order.
o Constituency is about how words are grouped together.
o Word order is about how, within a constituent, words are
ordered and also how constituents are ordered with respect
to one another.
• A widely used mathematical system for modelling constituent structure in natural language is
context-free grammar (CFG) also known as phrase structure grammar.
2. Context-free Grammar:
• Context-free grammar (CFG) was first defined for natural language by Chomsky (1957).
• Consists of four components:
1. A set of non-terminal symbols, N
2. A set of terminal symbols, T
3. A designated start symbol, S, that is one of the symbols from N.
4. A set of productions, P, of the form: A→α
o Where A € N and α is a string consisting of terminal and non-terminal symbols.
o The rule A → α says that constituent A can be rewritten as α. This is also called the
phrase structure rule. It specifies which elements (or constituents) can occur in a phrase
and in what order.
o For example, the rule S → NP VP states that S consists of NP followed by VP, i.e., a
sentence consists of a noun phrase followed by a verb phrase.
CFG as a generator:
• Above can be derived from S. The representation of this derivation is shown in Figure.
• Sometimes, a more compact bracketed notation is used to represent a parse tree.
[s [NP [N Hena]] [vp [v reads] [NP [Det a] [Nbook]]]]
• The parse tree in Figure can be represented using this notation as follows:
3. Constituency:
• Words in a sentence are not tied together as a sequence of part-of-speech.
• Language puts constraints on word order.
• Words group together to form constituents (often termed phrases), each of which acts as a single
unit. They combine with other constituents to form larger constituents, and eventually, a sentence.
• Constituents combine with others to form a sentence constituent.
• For example: the noun phrase, The bird, can combine with the verb phrase, flies, to form
the sentence, The bird flies.
• Different types of phrases have different internal structures.
Noun Phrase, Verb phrase, Prepositional Phrase, Adjective Phrase, Adverb Phrase
Noun Phrase:
• A noun phrase is a phrase whose head is a noun or a pronoun, optionally accompanied by a set
of modifiers. It can function as subject, object, or complement.
• The modifiers of a noun phrase can be determiners or adjective phrases.
• Phrase structure rules are of the form: A→B C
NP → Pronoun
NP → Det Noun
NP → Noun
NP → Adj Noun
NP → Det Adj Noun
• We can combine all these rules in a single phrase structure rule as follows:
NP → (Det) (Adj) Noun|Pronoun
• A noun phrase may include post-modifiers and more than one adjective.
NP → (Det) (AP) Noun (PP)
Few examples of noun phrases:
They NP
The foggy morning Adj Noun
Chilled water Adj Noun
A beautiful lake in Kashmir Det Adj Noun PP
Cold banana shake Adjective followed by a sequence of nouns.
• Adjective followed by a sequence of nouns → A noun sequence is termed as nominal. We modify
our rules to cover this situation.
NP → (Det) (AP) Nom (PP)
Nom → Noun | Noun Nom
• A noun phrase can act as a subject, an object, or a predicate.
Example:
The foggy damped weather disturbed the match. → noun phrase acts as a subject
I would like a nice cold banana shake. → noun phrase acts as an object
Kula botanical garden is a beautiful location. → noun phrase acts as predicate
Verb Phrase:
• Headed by a verb
• The verb phrase organizes various elements of the sentence that depend syntactically on the verb.
Things are further complicated by the fact that objects may also be entire clauses as in the sentence, I
know that Taj is one of the seven wonders. Hence, we must also allow for an alternative phrase statement
rule, in which NP is replaced by S.
VP → Verb S
Prepositional Phrase:
Prepositional phrases are headed by a preposition. They consist of a preposition, possibly followed by
some other constituent, usually a noun phrase.
The phrase structure rule that captures the above eventualities is as follows.
PP → Prep (NP)
Adjective Phrase:
The head of an adjective phrase (AP) is an adjective. APs consist of an adjective, which may be preceded
by an adverb and followed by a PP.
The four commonly known structures are declarative structure, imperative structure, yes-no question
structure, and wh-question structure.
Grammar rule: S → NP VP
Example: Please pass the salt, Look at the door, Show me the latest design.
Structure: usually begin with a verb phrase and lack subject.
Grammar rule: S → VP
Structure: usually begin with an auxiliary verb, followed by a subject NP, followed by a VP.
S→ NP VP
S→ VP
S→ Aux NP VP
S→ Wh-NP VP
S→ Wh-NP Aux NP VP
NP → (Det) (AP) Nom (PP)
VP → Verb (NP) (NP) (PP)*
VP → Verb S
AP → (Adv) Adj (PP)
PP → Prep (NP)
Nom →
Note:
Coordination:
Refers to conjoining phrases with conjunctions like 'and', 'or', and 'but'.
For example,
A coordinate noun phrase can consist of two other noun phrases separated by a conjunction.
I ate [NP [NP an apple] and [NP a banana]].
Similarly, verb phrases and prepositional phrases can be conjoined as follows:
It is [VP [VP dazzling] and [VP raining]].
Not only that, even a sentence can be conjoined.
[S [S I am reading the book] and [S I am also watching the movie]]
Examples: Demonstrate how the subject NP affects the form of the verb.
Does [NP Priya] sing?
Do [Np they] eat?
The -es form of 'do', i.e. 'does' is used. The second sentence has a plural NP subject. Hence, the
form 'do' is being used. Sentences in which subject and verb do not agree are ungrammatical.
For example, the number property of a noun phrase can be represented by NUMBER feature. The value
that a NUMBER feature can take is SG (for singular) and PL (for plural).
Feature structures are represented by a matrix-like diagram called attribute value matrix (AVM).
The feature structure can be used to encode the grammatical category of a constituent and the features
associated with it. For example, the following structure represents the third person singular noun phrase.
The CAT and PERSON feature values remain the same in both structures, illustrating how feature
structures support generalization while maintaining necessary distinctions. Feature values can also be
other feature structures, not just atomic symbols. For instance, combining NUMBER and PERSON into
a single AGREEMENT feature makes sense, as subjects must agree with predicates in both properties.
This allows a more streamlined representation.
4. Parsing
• A phrase structure tree constructed from a sentence is called a parse.
• The syntactic parser is thus responsible for recognizing a sentence and assigning a syntactic
structure to it.
• The task that uses the rewrite rules of a grammar to either generate a particular sequence of words
or reconstruct its derivation (or phrase structure tree) is termed parsing.
• It is possible for many different phrase structure trees to derive the same sequence of words.
• Sentence can have multiple parses → This phenomenon is called syntactic ambiguity.
• Processes input data (usually in the form of text) and converts it into a format that can be
easily understood and manipulated by a computer.
o Input: The first constraint comes from the words in the input sentence. A valid parse is
one that covers all the words in a sentence. Hence, these words must constitute the leaves
of the final parse tree.
o Grammar: The second kind of constraint comes from the grammar. The root of the final
parse tree must be the start symbol of the grammar.
Example: Consider the grammar shown in Table and the sentence “Paint the door”.
S → NP VP VP → Verb NP
S→ VP VP → Verb
NP → Det Nominal PP → Preposition NP
NP → Noun Det → this | that | a | the
NP → Det Noun PP Verb → sleeps | sings | open | saw | paint
Nominal → Noun Preposition → from | with | on | to
Nominal → Noun Nominal Pronoun → she | he | they
S → NP VP & S → VP
A bottom-up parser starts with the words in the input sentence and attempts to construct a parse tree
in an upward direction towards the root.
• Start with the input words – Begin with the words in the sentence as the leaves of the parse
tree.
• Look for matching grammar rules – Search for rules where the right-hand side matches parts
of the input.
• Apply reduction using the left-hand side – Replace matched portions with non-terminal
symbols from the left-hand side of the rule.
• Construct the parse tree upwards – Build the parse tree by moving upward toward the root.
• Repeat until the start symbol is reached – Continue reducing until the entire sentence is
reduced to the start symbol.
• Successful parse – The parsing is successful if the input is fully reduced to the start
symbol, completing the parse tree.
• Top-Down Parsing: Starts from the start symbol and generates trees, avoiding paths that lead to
a different root, but it may waste time exploring inconsistent trees before seeing the input.
• Bottom-Up Parsing: Starts with the input and ensures only matching trees are explored, but may
waste time generating trees that won't lead to a valid parse tree (e.g., incorrect assumptions about
word types).
• Top-Down Drawback: It can explore incorrect trees that eventually do not match the input,
resulting in wasted computation.
Basic Search Strategy: Combines top-down tree generation with bottom-up constraints to filter out
bad parses, aiming to optimize the parsing process.
• Start with Depth-First Search (DFS): Use a depth-first approach to explore the search tree
incrementally.
• Left-to-Right Search: Expand nodes from left to right in the tree.
• Incremental Expansion: Expand the search space one state at a time.
• Select Left-most Node for Expansion: Always select the left-most unexpanded node for
expansion.
• Expand Using Grammar Rules: Expand nodes based on the relevant grammar rules.
• Handle Inconsistent State: If a state is inconsistent with the input, it is flagged.
• Return to Recent Tree: The search then returns to the most recently unexplored tree to continue.
1. Initialize agenda
2. Pick a state, let it be curr_state, from agenda
3. If (curr_state) represents a successful parse then return parse tree
else if curr_stat is a POS then
if category of curr_state is a subset of POS associated with curr_word
then apply lexical rules to current state
else reject
else generate new states by applying grammar rules and push them into agenda
4. If (agenda is empty) then return failure
else select a node from agenda for expansion and go to step 3.
Figure shows the trace of the algorithm on the sentence, Open the door.
• The algorithm begins with the node S and input word "Open."
• It first expands S using the rule S → NP VP, then expands NP with NP → Det Nominal.
• Since "Open" cannot be derived from Det, the parser discards this rule and tries NP → noun,
which also fails.
• The next agenda item corresponds to S → VP.
• Expanding VP using VP → Verb NP matches the first input word successfully.
• The algorithm then continues in a depth-first, left-to-right manner to match the remaining words.
1. Inefficiency: It may explore many unnecessary branches of the parse tree, especially if the input
does not match the grammar well, leading to high computational overhead.
2. Backtracking: If a rule fails, the parser often needs to backtrack to a previous state and try
alternative expansions, which can significantly slow down parsing.
3. Left Recursion Issues: Top-down parsers struggle with left-recursive grammars because they
can lead to infinite recursion.
4. Lack of Lookahead: Basic top-down parsers generally lack lookahead capabilities, meaning they
might make incorrect decisions early on without enough information, leading to errors.
5. Ambiguity Handling: They may have difficulty handling ambiguities in the grammar, often
exploring all possible alternatives without any way of pruning inefficient branches.
6. Limited Error Recovery: Basic top-down parsers typically have poor error recovery and can
fail immediately when encountering an unexpected input.
Dynamic programming algorithms can solve these problems. These algorithms construct a table
containing solutions to sub-problems, which, if solved, will solve the whole problem.
There are three widely known dynamic parsers-the Cocke-Younger-Kasami (CYK) algorithm, the
Graham-Harrison-Ruzzo (GHR) algorithm, and the Earley algorithm.
Earley Parsing
Input: Sentence and the Grammar
Output: Chart
chart[0] + S' → S, [0,0]
n length (sentence) // number of words in the sentence
for i = 0 to n do
for each state in chart[i] do
if (incomplete (state) and next category is not a part of speech) then
predictor (state)
else if (incomplete (state) and next category is a part of speech)
scanner (state)
else
completer (state)
end-if
end-if
end for
end for
return:
.
Procedure predictor (A → X1 ... B ... Xm,. [i, j])
for each rule (B → α) in G do
.
insert the state B → α, [j, j] to chart [j]
End
.
Procedure scanner (A → X1 ... B ... Xm [i, j])
If B is one of the part of speech associated with word[j] then
.
Insert the state B → word [j] , [j, j + 1] to chart [j + 1]
End
Procedure Completer (A → X1 ... . , [j, k])
.
for each B → X1 .... A ... ,[i, j] in chart[j] do
.
insert the state B → X1 ... A ... [i, k] to chart[k]
End
Steps:
1. Prediction
➢ If the dot (•) is before a non-terminal in a rule, add all rules expanding that non-terminal
to the state set.
➢ The predictor generates new states representing potential expansion of the non-terminal
in the left-most derivation.
➢ A predictor is applied to every state that has a non-terminal to the right of the dot.
➢ Results in the creation of as many new states as there are grammar rules for the non-
terminal
Their start and end positions are at the point where the generating state ends. If
.
A → X1 ... B ... Xm, [i, j]
Then for every rule of the form B → α , the operation adds to chart [j], the state
.
For example, when the generating state is S → NP VP, [0,0], the predictor adds the following states
to chart [0]:
NP →· Det Nominal, [0,0]
NP →· Noun, [0,0]
NP →· Pronoun, [0,0]
NP →· Det Noun PP, [0,0]
2. Scanning
➢ A scanner is used when a state has a part-of-speech category to the right of the dot.
➢ The scanner examines the input to see if the part-of-speech appearing to the right of the
dot matches one of the part-of-speech associated with the current input.
➢ If yes, then it creates a new state using the rule that allows generation of the input word with
this part-of-speech.
➢ If the dot (•) is before a terminal that matches the current input symbol, move the dot to the
right.
Example:
When the state NP → . Det Nominal, [0,0] is processed, the parser finds a part-of-speech category next
to the dot.
It checks if the category of the current word (curr_word) matches with the expectation in the current state.
.
If yes, then it adds the new state Det → curr_word , [0,1] to the next chart entry.
3. Completion
• If the dot reaches the end of a rule, find and update previous rules that were waiting for this rule
to complete.
• The completer identifies all previously generated states that expect this grammatical category at
this position in the input and creates new states by advancing the dots over the expected category.
Example:
Since John is a valid NP, we scan it. The next word is "sees", which matches V.
The sequence of states for “Paint the door” created by the parser is shown in Figure
• Fill Base Case (Single Words): Find matching grammar rules for each word
• Fill Table for Larger Substrings: Now, we combine smaller segments.
• Check for Start Symbol (S): Since S appears in T[1,5], the sentence is valid under this grammar!
Algorithm:
Let w =w1 w2 w3 wi ... wj ... wn
and wij= wi ... wi+j-1
// Initialization step
for i := 1 to n do
for all rules A→ wi do
chart [i,1] = {A}
// Recursive step
for j= 2 to n do
for i = 1 to n-j+1 do
begin
chart [i, j]=ø
for k= 1 to j -1 do
chart [i, j] := chart[i, j] U{A | A →BC is a production and
B € chart[i, k] and C € chart [i+k, j-k]}
end
if S € chart[1, n] then accept else reject
5. Probabilistic Parsing
• Statistical parser, requires a corpus of hand-parsed text.
• The Penn tree-bank is a large corpus – consists Penn tree-bank tags, parsed based on simple set
of phrase structure rules, Chomsky's government and binding syntax.
• The parsed sentences are represented in the form of properly bracketed trees.
Given a grammar G, sentence s, and a set of possible parse trees of s which we denote by ꞇ(s), a
probabilistic parser finds the most likely parse ‘φ’ of s as follows:
φ = argmaxφ € ꞇ(s) P(φ | s) % where φ belongs to a feasible set T(s) - conditional probability.
= argmaxφ € ꞇ(s) P(φ, s) % φ within the feasible set T(s) that maximizes the joint probability P(ϕ,s).
= argmaxφ € ꞇ(s) P(φ) % φ within the feasible set T(s) that maximizes the prior probability P(ϕ).
We have to first find all possible parses of a sentence, then assign probabilities to them, and finally return
the most probable parse → probabilistic context-free grammars (PCFGs).
• A probabilistic parser helps resolve parsing ambiguity (multiple parse trees) by assigning
probabilities to different parse trees, allowing selection of the most likely structure.
• It improves efficiency by narrowing the search space, reducing the time required to determine the
final parse tree.
∑ 𝒇(𝑨 → 𝜶) = 𝟏
𝜶
Example: PCFG is shown in Table, for each non-terminal, the sum of probabilities is 1.
S→NP VP 0.8 Noun→door 0.25
S→VP 0.2 Noun→bird 0.25
NP→Det Noun 0.4 Noun→hole 0.25
NP→Noun 0.2 Verb→sleeps 0.2
NP→Pronoun 0.2 Verb→sings 0.2
NP→Det Noun PP 0.2 Verb→open 0.2
VP→Verb NP 0.5 Verb→saw 0.2
VP→Verb 0.3 Verb→paint 0.2
VP→VP PP 0.2 Preposition→from 0.3
PP→Preposition NP 1.0 Preposition→with 0.25
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25
If our training corpus consists of two parse trees (as shown in Figure), we will get the estimates as shown
in Table for the rules.
Figure: Two Parse trees Table: MLE for grammar rules considering two parse trees
Where n is a node in the parse tree Ҩ and r is the rule used to expand n.
The probability of the two parse trees of the sentence Paint the door with the hole (shown in Figure)
using PCFG table can be computed as follows:
S→NP VP 0.8 Noun→door 0.25
S→VP 0.2 Noun→bird 0.25
NP→Det Noun 0.4 Noun→hole 0.25
NP→Noun 0.2 Verb→sleeps 0.2
NP→Pronoun 0.2 Verb→sings 0.2
NP→Det Noun PP 0.2 Verb→open 0.2
VP→Verb NP 0.5 Verb→saw 0.2
VP→Verb 0.3 Verb→paint 0.2
VP→VP PP 0.2 Preposition→from 0.3
PP→Preposition NP 1.0 Preposition→with 0.25
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25
P(t1) = 0.2 * 0.5 * 0.2 * 0.2 * 0.35 * 0.25 * 1.0 * 0.25 * 0.4 * 0.35 * 0.25 = 0.0000030625
P(t2) = 0.2* 0.2 * 0.5 * 0.2 * 0.4 * 0.35 * 0.25 * 1 * 0.25 * 0.4 * 0.35 * 0.25 = 0.000001225
The first tree has a higher probability leading to correct interpretation.
We can calculate probability to a sentence s by summing up probabilities of all possible parses associated
with it.
Given a PCFG, a probabilistic parsing algorithm assigns the most likely parse Ҩ to a sentence s.
φ` = argmaxT € ꞇ(s) P(T | s)
• The rest of the steps follow those of basic CYK parsing algorithm.
Solution: This however, requires a model which captures lexical dependency statistics for different
words. → Lexicalization
Lexicalization
• This lexicalized version keeps track of headwords (e.g., "jumped" in VP) and improves parsing
accuracy.
• A lexicalized PCFG assigns specific words to rules, making parsing more accurate by capturing
relationships between words.
o The verb (jumped) affects parsing probability.
o Dependencies between words like "jumped" and "boy" are captured.
o A sentence like "The boy jumped over the fence" is parsed more accurately.
6. Indian Languages
• Some of the characteristics of Indian languages that make CFG unsuitable.
• Paninian grammar can be used to model Indian languages.
1. Indian languages are free word order.
o सबा खाना खाती है । Saba khana khati hai.
o खाना सबा खाती है । Khana Saba khati hai.
The CFG we used for parsing English is basically positional, but it fails to model free word order
languages.
2. Complex predicates (CPs) is another property that most Indian languages have in common.
• A complex predicate combines a light verb with a verb, noun, or adjective, to produce a
new verb.
• For example:
(b) सबा आ गयी। → (Saba a gayi.) → Saba come went. → Saba arrived.
(c) सबा आ पडी। → Saba a pari. → Saba come fell. → Saba came (suddenly).
The use of post-position case markers and the auxiliary verbs in this sequence provide information about
tense, aspect, and modality.
Paninian grammar provides a framework to model Indian languages. It focuses on the extraction of Karak
relations from a sentence.
Bharti and Sangal (1990) described an approach for parsing of Indian languages based on Paninian
grammar formalism. Their parser works in two stages.
Example:
• Word ladkiyan forms one unit, the words maidaan and mein are grouped together to form a noun
group, and the word sequence khel rahi hein forms a verb group.
2nd stage:
• The parser takes the word groups formed during first stage and identifies (i) Karaka relations
among them, and (ii) senses of words.
• Karaka chart is created to store additional information like Karaka-Vibhakti mapping.
• Constraint graph for sentence: The Karaka relation between a verb group and a noun group can
be depicted using a constraint graph.
Each sub-graph of the constraint graph that satisfies the following constraints yields a parse of the
sentence.
1. It contains all the nodes of the graph.
2. It contains exactly one outgoing edge from a verb group for each of its mandatory Karakas. These
edges are labelled by the corresponding Karaka.
3. For each of the optional Karaka in Karaka chart, the sub-graph can have at most one outgoing
edge labelled by the Karaka from the verb group.
4. For each noun group, the sub-graph should have exactly one incoming edge.
Question Bank
1 2 3 4 5
In this table, 1 is a protocol, 2 is name of a server, 3 is the directory, and 4 is the name
of a document. Suppose you have to write a program that takes a URL and returns the
protocol used, the DNS name of the server, the directory and the document name.
Develop a regular expression that will help you in writing this program.
7. Give two possible parse trees for the sentence, Stolen painting found by tree.
8. Identify the noun and verb phrases in the sentence, My soul answers in music.\
10. Discuss the disadvantages of the basic top-down parser with the help of an
appropriate example.
11. Tabulate the sequence of states created by CYK algorithm while parsing, The sun
rises in the east. Augment the grammar in section 4.4.5 with appropriate rules of
lexicon.
13. What does lexicalized grammar mean? How can lexicalization be achieved? Explain
with the help of suitable examples.
14. List the characteristics of a garden path sentence. Give an example of a garden path
sentence and show its correct parse.
S → NP VP S → VP NP → Det Noun
NP Noun NP → NP PP VP → VP PP
VP → Verb VP → VP NP PP → Preposition NP
Give two possible parse of the sentence: 'Pluck the flower with the stick. Introduce lexicon
rules for words appearing in the sentence. Using these parse trees obtain maximum
likelihood estimates for the grammar rules used in the tree. Calculate probability of any one
parse tree using these estimates.
Lab Exercises
1. Write a program to find minimum edit distance between two input strings.
2. Use any tagger available in your lab to tag a text file. Now write a program to find
the most likely tag in the tagged text.
3. Write a program to find the probability of a tag given previous two tags, i.e., P(t3/t2
t1).
4. Write a program to extract all the noun phrases from a text file. Use the phrase structure
rule given in this chapter.
5. Write a program to check whether a given grammar is context free grammar or not.
Module – 3
Naive Bayes, Text Classification and Sentiment
Naive Bayes, Text Classification and Sentiment: Naive Bayes Classifiers, Training the Naive
Bayes Classifier, Worked Example, Optimizing for Sentiment Analysis, Naive Bayes for Other
Text Classification Tasks, Naive Bayes as a Language Model.
Textbook 2: Ch. 4.
Introduction
Example: + ... any characters and richly applied satire, and some great plot twists
- It was pathetic. The worst part about it was the boxing scenes ...
+ ... awesome caramel sauce and sweet toasty almonds. I love this place!
- ... awful pizza and ridiculously overpriced ...
Words like great, richly, awesome, and pathetic, and awful and ridiculously are very informative
cues: unification is based on the functional specifications of the verb, which predicts the overall
sentence structure.
2. Spam detection:
o Binary classification task of assigning an email to one of the two classes spam or
not-spam.
o Many lexical and other features can be used to perform classification.
3. Assigning a library subject category or topic label to a text: Various sets of subject
categories exist. Deciding whether a research paper concerns epidemiology, embryology,
etc..is an important component of information retrieval.
Supervised Learning:
• The most common way of doing text classification in language processing is supervised
learning.
• In supervised learning, we have a data set of input observations, each associated with
some correct output (a ‘supervision signal’).
• The goal of the algorithm is to learn how to map from a new observation to a correct
output.
• We have a training set of N documents that have each been hand labeled with a class:
{(d1 c1)…(dN cN)}. Our goal is to learn a classifier that is capable of mapping from a new
document d to its correct class c € C, where C is some set of useful document classes.
The intuition of the classifier is shown in Fig. 1. We represent a text document as if it were a bag
of words, that is, an unordered set of words with their position ignored, keeping only their
frequency in the document.
Instead of representing the word order in all the phrases like “I love this movie” and “I would
recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word
it 6 times, the words love, recommend, and movie once, and so on.
(1)
Use Bayes’ rule to break down any conditional probability P(x|y) into three other probabilities:
(2)
(3)
Since P(d) doesn't change for each class, we can conveniently simplify Eq. 3 by dropping the
denominator.
(4)
We call Naive Bayes a generative model, Eq. 4 can be read as class is sampled from P(c), and
then the words are generated by sampling from P(d|c) and a document is generated.
(5)
Eq. 6 is still too hard to compute directly: without some simplifying assumptions, estimating the
probability of every possible combination of features (for example, every possible set of words
and positions) would require huge numbers of parameters and impossibly large training sets.
The first is the bag-of-words assumption, that the features f1, f2, ... ,fn only encode word identity
and not position.
The second is commonly called the naive Bayes assumption, the conditional independence
assumption that the probabilities P(fi|c) are independent given the class c.
The final equation for the class chosen by a naive Bayes classifier is:
(8)
To apply the naive Bayes classifier to text, we will use each word in the documents as a feature,
as suggested above, and we consider each of the words in the document by walking an index
through every word position in the document:
(9)
Naive Bayes calculations, like calculations for language modelling, are done in log space, to
avoid underflow and increase speed. Thus Eq. 9 is generally instead expressed as,
(10)
Eq. 10 computes the predicted class as a linear function of input features. Classifiers that use a
linear combination of the inputs to make a classification decision -like naive Bayes and also
logistic regression are called linear classifiers.
To learn class priori P(c): What percentage of the documents in our training set are in each class
c.
Let Nc be the number of documents in our training data with class c. Ndoc be the total number of
documents. Then,
(11)
We'll assume a feature is just the existence of a word in the document's bag of words, and so
we'll want P(wi|c), we compute as the fraction of times the word wi appears among all words in
all documents of topic c.
Concatenate all documents with category c into one big "category c" text. Then we use the
frequency of wi in this concatenated document to give a maximum likelihood estimate of the
probability:
i.e (12)
Here the vocabulary V total number of unique words in your vocabulary in all classes, not just
the words in one class c.
Imagine we are trying to estimate the likelihood of the word "fantastic" given class positive, but
suppose there are no training documents that both contain the word "fantastic" and are classified
as positive. Perhaps the word "fantastic" happens to occur (sarcastically?) in the class negative.
In such a case the probability for this feature will be zero:
(13)
Since naive Bayes naively multiplies all the feature likelihoods together, zero probabilities in the
likelihood term for any class will cause the probability of the class to be zero, no matter the other
evidence!
To solve this, we use something called Laplace smoothing (or add-one smoothing). Instead of:
(14)
Now "fantastic" will still get a very small probability in the "positive" class — but not zero.
2. Words that occur in our test data but are not in our vocabulary:
• Remove them from the test document and not include any probability for them at all.
Some systems choose to completely ignore another class of words: stop words, very
frequent words like the and a.
• Defining the top 10-100 vocabulary entries as stop words, or alternatively by using one
of the many predefined stop word lists available online. Then each instance of these stop
words is simply removed from both training and test documents.
• However, using a stop word list doesn't improve performance, and so it is more common
to make use of the entire vocabulary.
Fig. The naïve Bayes algorithm, using add-1smoothing. To use add-smoothing instead, change
the +1 to +α for log likelihood counts in training.
Let’s use a sentiment analysis domain with the two classes positive (+) and negative (-), and take
the following miniature training and test documents simplified from actual movie reviews.
Step1: Prior P(c) for the two classes is computed as per equation 11:
Step 2: The word “with” doesn't occur in the training set, so we drop it completely.
Step 3: The likelihoods from the training set for the remaining three words "predictable", "no",
and "fun", are as follows:
Step 4: For the test sentence S = "predictable with no fun", after removing the word 'with', the
chosen class, via Eq. 9: is therefore computed as
follows:
While standard naive Bayes text classification can work well for sentiment analysis, some small
changes are generally employed that improve performance.
Consider the difference between I really like this movie (positive) and I didn’t like this movie
(negative). Similarly, negation can modify a negative word to produce a positive review (don’t
dismiss this film, doesn’t let us get bored).
Solution: Prepend the prefix NOT to every word after a token of logical negation (n’t, not, no,
never) until the next punctuation mark.
‘words’ like NOT_like, NOT_recommend will thus occur more often in negative document and
act as cues for negative sentiment, while words like NOT_bored, NOT_dismiss will acquire
positive associations.
Derive the positive and negative word features from sentiment lexicons (corpus), lists of words
that are pre-annotated with positive or negative sentiment.
For example, the MPQA lexicon corpus subjectivity lexicon has 6885 words each marked for
whether it is strongly or weakly biased positive or negative. Some examples:
- : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate
Spam detection—deciding whether an email is unsolicited bulk mail—was one of the earliest
applications of naïve Bayes in text classification (Sahami et al., 1998). Rather than treating all
words as individual features, effective systems often use predefined sets of words or patterns,
along with non-linguistic features.
For instance, the open-source tool SpamAssassin uses a range of handcrafted features:
In contrast, tasks like language identification rely less on words and more on subword units like
character n-grams or even byte n-grams. These can capture statistical patterns at the start or end
of words, especially when spaces are included as characters.
A well-known system, langid.py (Lui & Baldwin, 2012), starts with all possible n-grams of
lengths 1–4 and uses feature selection to narrow down to the 7,000 most informative.
Training data for language ID systems often comes from multilingual sources such as Wikipedia
(in 68+ languages), newswire, and social media. To capture regional and dialectal diversity,
additional corpora include:
These diverse sources help models capture the full range of language use across different
communities and contexts (Jurgens et al., 2017).
Example: Consider a naive Bayes model with the classes positive (+) and negative (-) and the
following model parameters:
Each of the two columns above instantiates a language model that can assign a probability to
the sentence “I love this fun film”:
P(“I love this fun film”+) = 0.1 * 0.1 * 0.01 * 0.05 * 01=5 * 10-7
P(“I love this fun film” - ) = 0.2 * 0.001 * 0.01* 0.005 * 0.1=1.0 * 10-9
The positive model assigns a higher probability to the sentence: P(s|pos) > P(s|neg).
Note: This is just the likelihood part of the naive Bayes model; once we multiply in the prior a
full naive Bayes model might well make a different classification decision.
• Need to compare:
o System’s prediction
o Gold label (human-defined correct label)
Example 2: Social Media Monitoring for a Brand
• Scenario: CEO of Delicious Pie Company wants to track mentions on social media.
• To evaluate how well a system (e.g., spam detector or pie-tweet detector) performs.
• Confusion Matrix:
• Accuracy:
o Real-world data is often skewed (e.g., most tweets are not about pie).
o Example:
o Conclusion: Accuracy is not a reliable metric when the positive class is rare.
That’s why, instead of relying on accuracy, we often use two more informative metrics:
precision and recall (as shown in Fig).
• Precision measures the percentage of items labeled as positive by the system that are
actually positive (according to human-annotated “gold” labels).
• Recall measures the percentage of actual positive items that were correctly identified by
the system.
These metrics address the issue with the “nothing is pie” classifier. Despite its seemingly
excellent 99.99% accuracy, it has a recall of 0 —because it misses all 100 true positive cases,
identifying none. Its precision is also meaningless, since it detects nothing (since there are no
true positives, and 100 false negatives, the recall is 0/100).
Unlike accuracy, precision and recall focus on true positives, helping us measure how well the
system finds the things it’s actually supposed to detect.
To combine both precision and recall into a single metric, we use the F-measure (van
Rijsbergen, 1975), with the most common version being the F1 score:
The ß parameter differentially weights the importance of recall and precision, based perhaps on
the needs of an application. Values of ß > 1 favor recall, while values of ß < 1 favor precision.
When ß = 1, precision and recall are equally balanced; this is the most frequently used metric,
and is called Fβ=1 or just F1:
(16)
For sentiment analysis we generally have 3 classes (positive, negative, neutral) and even
more classes are common for tasks like part-of-speech tagging, word sense disambiguation,
semantic role labeling, emotion detection, and so on. Luckily the naive Bayes algorithm is
already a multi-class classification algorithm.
Consider the sample confusion matrix for a hypothetical 3-way one-of email
categorization decision (urgent, normal, spam) shown in Fig. The matrix shows, for example,
that the system mistakenly labeled one spam document as urgent, and we have shown how to
compute a distinct precision and recall value for each class.
Confusion matrix for a three-class categorization task, showing for each pair of
classes (c1,c2), how many documents from c1 were (in)correctly assigned to c2.
In order to derive a single metric that tells us how well the system is doing, we can combine
these values in two ways.
1. In macroaveraging, we compute the performance for each class, and then average over
classes.
2. In microaveraging, we collect the decisions for all classes into a single confusion matrix,
and then compute precision and recall from that table.
Fig. shows the confusion matrix for each class separately, and shows the computation of
microaveraged and macroaveraged precision.
As the figure shows, a microaverage is dominated by the more frequent class (in this case spam),
since the counts are pooled. The macroaverage better reflects the statistics of the smaller classes,
and so is more appropriate when performance on all the classes is equally important
1. Standard Procedure:
o Train the model on the training set.
o Use the development set (devset) to tune parameters and choose the best model.
o Evaluate the final model on a separate test set.
2. Issue with Fixed Splits:
o Fixed training/dev/test sets may lead to small dev/test sets.
o Smaller test sets might not be representative of overall performance.
3. Solution – Cross-Validation (as shown in Fig):
o Cross-validation allows use of all data for training and testing.
o Process:
▪ Split data into k folds.
▪ For each fold:
▪ Train on k-1 folds, test on the remaining fold.
▪ Repeat k times, average the test errors.
o Example: 10-fold cross-validation (train on 90%, test on 10%, repeated 10
times).
4. Limitation of Cross-Validation:
o All data is used for testing →
can't analyze the data in
advance (avoiding
"peeking").
o Looking at data is important
for feature design in NLP
systems.
5. Common Compromise:
o Split off a fixed test set.
o Do 10-fold cross-validation
on the training set.
o Use test set only for final evaluation.
• If p < threshold, the result is considered statistically significant (we reject H₀ and
conclude A is likely better than B).
How Do We Compute the p-value in NLP?
• NLP avoids parametric tests (like t-tests or ANOVAs) because they assume certain
distributions that often don't apply.
• Instead, we use non-parametric tests that rely on sampling methods.
Key Idea:
• Simulate many variations of the experiment (e.g., using different test sets x′).
• Compute δ(x′) for each → this gives a distribution of δ values.
• If the observed δ(x) is in the top 1% (i.e., p-value < 0.01), it's unlikely under H₀ → reject
H₀.
Common Non-Parametric Tests in NLP:
1. Approximate Randomization (Noreen, 1989)
2. Bootstrap Test (paired version is most common)
o Compares aligned outputs from two systems (e.g., A vs. B on the same inputs
xi).
o Measures how consistently one system outperforms the other across samples.
1. Generate a large number (e.g., 100,000) of new test sets by sampling 10 documents with
replacement from the original set.
2. For each virtual test set, recalculate the accuracy difference between A and B.
3. Use the distribution of these differences to estimate a p-value, telling us how likely the
observed δ(x) is under the null hypothesis (that A is not better than B).
This helps determine whether the observed performance difference is statistically significant or
just due to random chance.
Figure: The paired bootstrap test: Examples of b pseudo test sets x (i) being created from an initial true test
set x. Each pseudo test set is created by sampling n = 10 times with replacement; thus an individual sample
is a single cell, a document with its gold label and the correct or incorrect performance of classifiers A and
B.
With the b bootstrap test sets, we now have a sampling distribution to analyze whether
A’s advantage is due to chance. Following Berg-Kirkpatrick et al. (2012), we assume the null
hypothesis (H₀)—that A is not better than B—so the average δ(x) should be zero or negative. If
our observed δ(x) is much higher, it would be surprising under H₀. To measure this, we calculate
the p-value by checking how often the sampled δ(xᵢ) values exceed the observed δ(x).
We use the notation 1(x) to mean “1 if x is true, and 0 otherwise.” Although the expected value
of δ(X) over many test sets is 0, this isn't true for bootstrapped test sets due to the bias in the
original test set, so we compute the p-value by counting how often δ (x(i)) exceeds the expected
δ(x) by δ(x) or more.
(22)
If we have 10,000 test sets and a threshold of 0.01, and in 47 test sets we find δ(x(i)) ≥ 2δ(x), the
p-value of 0.0047 is smaller than 0.01. This suggests the result is surprising, allowing us to reject
the null hypothesis and conclude A is better than B.
The full algorithm for the bootstrap is shown in Fig. It is given a test set x, a number of samples
b, and counts the percentage of the b bootstrap test sets in which δ (x *(i)) > 2δ (x). This percentage
then acts as a one-sided empirical p-value.
Module – 4
Information Retrieval & Lexical Resources
Information Retrieval: Design Features of Information Retrieval Systems, Information
Retrieval Models - Classical, Non-classical, Alternative Models of Information Retrieval -Custer
model, Fuzzy model, LSTM model, Major Issues in Information Retrieval.
Lexical Resources: WordNet, FrameNet, Stemmers, Parts-of-Speech Tagger, Research
Corpora.
Textbook 1: Ch. 9, Ch. 12.
Overview:
The huge amount of information stored in electronic form, has placed heavy demands on
information retrieval systems. This has made information retrieval an important research area.
4.1 Introduction
• Information retrieval (IR) deals with the organization, storage, retrieval, and evaluation
of information relevant to a user's query.
• A user in need of information formulates a request in the form of a query written in a
natural language.
• The retrieval system responds by retrieving the document that seems relevant to the
query.
“An information retrieval system does not inform (i.e., change the knowledge of) the user on the
subject of their inquiry. It merely informs on the existence (or non-existence) and whereabouts of
documents relating to their request”.
• This chapter focuses on text document retrieval, excluding question answering and data
retrieval systems, which handle precise queries for specific data or answers.
• In contrast, IR systems deal with vague, imprecise queries and aim to retrieve relevant
documents rather than exact answers.
• In information retrieval, documents are not represented by their full text but by a set of
index terms or keywords, which can be single words or phrases, extracted automatically
or manually.
• Indexing, provides a logical view of the document and helps reduce computational costs.
• A commonly used data structure is the inverted index, which maps keywords to the
documents they appear in.
• To further reduce the number of keywords, text operations such as stop word
elimination (removing common functional words) and stemming (reducing words to
their root form) are used.
• Zipf’s law can be applied to reduce the index size by filtering out extremely frequent or
rare terms.
• Since not all terms are equally relevant, term weighting assigns numerical values to
keywords to reflect their importance.
• Choosing appropriate index terms and weights is a complex task, and several term-
weighting schemes have been developed to address this challenge.
4.2.1 Indexing
IR system can access a document to decide its relevance to a query. Large collection of documents,
this technique poses practical problems. A collection of raw documents is usually transformed into an
easily accessible representation. This process is known as indexing.
• Multi-word terms can be extracted using methods like n-grams, POS tagging, NLP, or
manual crafting.
• POS tagging aids in resolving word sense ambiguity using contextual grammar.
• Statistical methods (e.g., frequent word pairs) are efficient but struggle with word order
and structural variations, which syntactic methods handle better.
• TREC approach: Treats any adjacent non-stop word pair as a phrase, retaining only
those that occur in a minimum number (e.g., 25) of documents.
• NLP is also used for identifying proper nouns and normalizing noun phrases to unify
variations (e.g., "President Kalam" and "President of India").
• Phrase normalization reduces structural differences in similar expressions (e.g., "text
categorization," "categorization of text," and "categorize text" → "text categorize").
4.2.3 Stemming
• Stemming reduces words to their root form by removing affixes (e.g., "compute,"
"computing," "computes," and "computer" → "compute").
• This helps normalize morphological variants for consistent text representation.
• Stems, are used as index terms.
• The Porter Stemmer (1980) is one of the most widely used stemming algorithms.
The stemmed representation of the text, Design features of information retrieval systems, is
{design, feature, inform, retrieval, system}
• Stemming can sometimes reduce effectiveness by removing useful distinctions
between words.
• It may increase recall by conflating similar terms, but can also reduce precision by
retrieving irrelevant results (e.g., "computation" vs. "personal computer").
• Recall and precision are key metrics for evaluating information retrieval performance
o High-frequency words lack discriminative power and are not useful for
indexing.
• Words can be filtered by setting frequency thresholds to drop too common or too rare
terms.
• Stop word elimination is a practical application of Zipf’s law, targeting high-frequency
terms.
✓ Example:
2. Non-classical IR models:
✓ Use principles beyond similarity, probability, or Boolean logic.
✓ Based on advanced theories like special logic, situation theory, or interaction models.
3. Alternative IR models:
✓ Examples include the Cluster model, Fuzzy model, and Latent Semantic Indexing
(LSI).
Advantages:
They are simple, efficient, and easy to implement and perform well in terms of recall and
precision if the query is well formulated.
Drawbacks:
• The Boolean model retrieves only fully matching documents; it cannot handle documents
that are partially relevant to a query (No partial relevance).
• It does not rank the retrieved documents by relevance—documents either match or don’t
(No ranking of results).
• Users must formulate queries using strict Boolean expressions, which is unnatural and
difficult for most users (Strict query format).
• Representation:
• Documents and queries are represented as vectors of features (terms).
• Each vector exists in a multi-dimensional space, with each dimension
corresponding to a unique term in the corpus.
• Numerical vectors: Terms are assigned weights, often based on their frequency in
the document (e.g., TF-IDF).
• Similarity computation:
• Ranking algorithms (e.g., cosine similarity) are used to compute the similarity
between a document vector and the query vector.
• The similarity score determines how relevant a document is to a given query.
• Retrieval output:
• Documents are ranked based on their similarity scores to the query.
• A ranked list of documents is presented as the retrieval result.
Where wij is the weight of the term ti in document dj, the document collection as a whole is
represented by an m x n term-document matrix as:
Example:
Consider the documents and terms in previous section Let the weights be assigned based on the
frequency of the term within the document. Then, the associated vectors will be
(2, 2, 1)
(1, 0, 1)
(0, 1, 1)
The vectors can be represented as a point in Euclidean space,
To reduce the importance of the length of document vectors, we normalize document vectors.
Normalization changes all vectors to a standard length.
We convert document vectors to unit length by dividing each dimension by the overall length of
the vector.
Elements of each column are divided by the length of the column vector given by
The tf-idf weighting scheme combines two components to determine the importance of a term:
• Term frequency (tf): A local statistic indicating how often a term appears in a
document.
• Inverse document frequency (idf): A global statistic that reflects how rare or
specific a term is across the entire document collection.
• tf-idf is Widely used in information retrieval and natural language processing to assess
the relevance of a term in a document relative to a corpus.
Example:
Consider a document represented by the three terms {tornado, swirl, wind} with the raw tf {4, 1,
and 1} respectively. In a collection of 100 documents, 15 documents contain the term tornado,
20 contain swirl, and 40 contain wind.
The idf of other terms are computed in the same way. Table shows the weights assigned to the three terms
using this approach.
Note:
Tornado: highest TF-IDF weight (3.296), indicating both high frequency in the document and relatively
low occurrence across all documents.
Swirl: rare but relevant
Wind: least significant
Most weighting schemes can thus be characterized by the following three factors:
Table: Calculating weight with different options for the three weighting factors
Term weighting in IR has evolved significantly from basic tf-idf. Different combinations of tf,
idf, and normalization strategies form various weighting schemes, each affecting retrieval
performance. Advanced models like BM25 further refine this by incorporating document length
and probabilistic reasoning.
4.4.3 A simple automatic method for obtaining indexed representation of the documents is
as follows.
Step 1: Tokenization This extracts individual terms form a document, converts all the letters to
lower case, and removes punctuation marks.
Step 2: Stop word elimination This removes words that appear more frequently in the document
collection.
Step 3: Stemming This reduces the remaining terms to their linguistic root, to obtain the index
terms.
Step 4: Term weighting This assigns weights to terms according to their importance in the
document, in the collection, or some combination of both.
Example:
Sample documents
Measures similarity by doubling the inner product and normalizing by the sum of squared
weights.
Jaccard’s Coefficient:
Computes similarity as the ratio of the inner product to the union (sum of squares minus
intersection).
Computes the cosine of the angle between the document vector dj and the query vector qk. It
gives a similarity score between 0 and 1:
3.Interaction Model
• Documents are interconnected; retrieval emerges from the interaction between query
and documents.
• Implemented using artificial neural networks, where documents and the query are
neurons in a dynamic network.
• Query integration reshapes connections, and the degree of interaction guides retrieval.
Reduces the number of document comparisons during retrieval by grouping similar documents.
• Suggests that documents with high similarity are likely to be relevant to the same queries.
Cluster Representation
o ꞅₖ = {a₁ₖ, a₂ₖ, ..., aₘₖ}, where each element represents the average of
corresponding term weights in the documents of that cluster.
• This comparison is carried out by computing the similarity between the query vector q
and the representative vector ꞅk as
• A cluster Ck whose similarity Sk exceeds a threshold is returned and the search proceeds
in that cluster.
Example:
Consider 3 documents (d1, d2, d3) and 5 terms (t1 to t5). The term-by-document matrix is:
t/d d1 d2 d3
t1 1 1 0
t2 1 0 0
t3 1 1 1
t4 0 0 1
t5 1 1 0
So, document vectors are: d1 = (1, 1, 1, 0, 1), d2 = (1, 0, 1, 0, 1), d3 = (0, 0, 1, 1, 0)
• sim(d1, d2)
dot(d1, d2) = 1×1 + 1×0 + 1×1 + 0×0 + 1×1 = 3
|d1| = √(1²+1²+1²+0²+1²) = √4 = 2
|d2| = √(1²+0²+1²+0²+1²) = √3 ≈ 1.73
sim = 3 / (2 × 1.73) ≈ 0.87
• sim(d1, d3)
dot = 1×0 + 1×0 + 1×1 + 0×1 + 1×0 = 1
|d3| = √(0²+0²+1²+1²+0²) = √2 ≈ 1.41
sim = 1 / (2 × 1.41) ≈ 0.35
• sim(d2, d3)
dot = 1×0 + 0×0 + 1×1 + 0×1 + 1×0 = 1
sim = 1 / (1.73 × 1.41) ≈ 0.41
Similarity matrix:
d1 d2 d3
d1 1.0
d2 0.87 1.0
d3 0.35 0.41 1.0
• r1 = avg(d1, d2)
= ((1+1)/2, (1+0)/2, (1+1)/2, (0+0)/2, (1+1)/2)
= (1, 0.5, 1, 0, 1)
• r2 = d3 = (0, 0, 1, 1, 0)
• sim(q, r1)
dot = 1×1 + 0×0.5 + 1×1 + 0×0 + 1×1 = 3
|q| = √(1² + 0² + 1² + 0² + 1²) = √3 ≈ 1.73
|r1| = √(1² + 0.5² + 1² + 0² + 1²) = √3.25 ≈ 1.80
sim = 3 / (1.73 × 1.80) ≈ 0.96
• sim(q, r2)
dot = 1×0 + 0×0 + 1×1 + 0×1 + 1×0 = 1
|r2| = √(0²+0²+1²+1²+0²) = √2 ≈ 1.41
sim = 1 / (1.73 × 1.41) ≈ 0.41
Query is closer to r1, so we retrieve documents from Cluster C1 = {d1, d2}
In the fuzzy model of information retrieval, each document is represented as a fuzzy set of
terms, where each term is associated with a membership degree indicating its importance to
the document's content. These weights are typically derived from term frequency within the
document and across the entire collection.
For queries:
• A single-term query returns documents where the term’s weight exceeds a threshold.
This model allows ranking documents by their degree of relevance to the query.
Example:
Documents:
• d1 = {information, retrieval, query}
• d2 = {retrieval, query, model}
• d3 = {information, retrieval}
Term Set:
Query:
• q = t2 ˄ t4 (i.e., model AND retrieval)
In fuzzy logic, the AND operation (˄) is typically interpreted using the minimum of the
memberships.
Latent Semantic Indexing (LSI) applies Singular Value Decomposition (SVD) to information
retrieval, aiming to uncover hidden semantic structures in word usage across documents.
Unlike traditional keyword-based methods, LSI captures conceptual similarities between terms
and documents, even when there’s no exact term match.
W=TSDT
• Query Transformation: Queries are projected into the same reduced k-dimensional
latent
• Similarity Computation: Documents are ranked using similarity measures (e.g., cosine
similarity) between the query vector and document vectors in the latent space.
Advantages:
• Captures semantic relationships between terms and documents.
• Can retrieve relevant documents even if they don’t share any terms with the query.
• Reduces the impact of synonymy and polysemy.
Example:
The SVD of X is computed to get the three matrices T, S, and D. X5x6=T5x5 S5x5 (D6×5)T
Term Vector
Singular values
Document Vector
Consider the first two largest singular values of S, and rescale DT2x6 with singular values to get
matrix R2x6 = S2x2D2x6, as shown in below Figure. R is a reduced dimensionality representation
of the original term-by-document matrix X.
To find out the changes introduced by the reduction, we compute document similarities in the
new space and compare them with the similarities between documents in the original space.
The document-document correlation matrix for the original n-dimensional space is given by
the matrix Y= XTX. Here, Y is a square, symmetric n x n matrix. An element Yij, in this matrix
gives the similarity between documents i and j. The correlation matrix for the original document
vectors is shown in Figure (Z) This matrix is computed using X, after normalizing the lengths of
its columns.
The document-document correlation matrix for the new space is computed analogously using
the reduced representation R. Let N be the matrix R with length-normalized columns. Then, M=
NTN gives the matrix of document correlations in the reduced space. The correlation matrix M
is given in Figure.
The similarity between document d1, d4(-0.0304), and d6(-0.2322) is quite low in the new space
because document d1 is not topically similar to documents d4 and d6.
In the original space, the similarity between documents d2 and d3 and between documents d2 and
d5 is 0. In the new space, they have high similarity values (0.5557 and 0.8518 respectively)
although documents d3 and d5 share no term with the document d2. This topical similarity is
recognized due to the co-occurrence of patterns in the documents.
Part B
LEXICAL RESOURCES
1. Introduction
The chapter provides an overview of freely available tools and lexical resources for natural
language processing (NLP), aimed at assisting researchers—especially newcomers to the field.
It emphasizes the importance of knowing where to find resources, which can significantly reduce
time and effort. The chapter compiles and briefly discusses key tools such as stemmers, taggers,
parsers, and lexical databases like WordNet and FrameNet, along with accessible test corpora,
all of which are available online or through scholarly articles.
2. WORDNET
A comprehensive lexical database for the English language developed at Princeton University
under George A. Miller based on psycholinguistic principles, WordNet is divided into three
databases: nouns, verbs, and a combined one for adjectives and adverbs.
Key features include:
• Synsets: Groups of synonymous words representing a single concept.
• Lexical and semantic relations: These include synonymy, antonymy,
hypernymy/hyponymy (generalization/specialization), meronymy/holonymy
(part/whole), and troponymy (manner-based verb distinctions).
• Multiple senses: Words can belong to multiple synsets and parts of speech, with each
sense given a gloss—a dictionary-style definition with usage examples.
• Hierarchical structure: Nouns and verbs are arranged in taxonomic hierarchies (e.g.,
'river' has a hypernym chain), while adjectives are grouped by antonym sets.
The figure 1 shows the entries for the word 'read'. 'Read' has one sense as a noun and 11 senses as a verb.
Glosses help differentiate meanings. Figures 2, 3, and 4 show some of the relationships that hold between
nouns, verbs, and adjectives and adverbs.
Nouns and verbs are organized into hierarchies based on the hypernymy/hyponymy relation,
whereas adjectives are organized into clusters based on antonym pairs (or triplets). Figure 5
shows a hypernym chain for 'river' extracted from WordNet. Figure 6 shows the troponym
relations for the verb 'laugh'.
• Figure 7 shows the Hindi WordNet entry for the word (aakanksha).
• Hindi WordNet can be obtained from the URL
http://www.cfilt.iitb.ac.in/wordnet/webhwn/. CFLIT has also developed a Marathi
WordNet.
• Figure 8 shows the Marathi WordNet
(http://www.cfilt.iitb.ac.in/wordnet/webmwn/wn.php) entry for the word 'qa' (pau).
3. FRAMENET
FrameNet, a rich lexical database focused on semantically annotated English sentences,
grounded in frame semantics.
1. Frame Semantics:
Each word (especially verbs, nouns, adjectives) evokes a specific situation or event
known as a frame.
2. Target Word / Predicate:
The word that evokes the frame (e.g., nab in the ARREST frame).
3. Frame Elements (FEs):
These are semantic roles or participants in the frame-specific event (e.g.,
AUTHORITIES, SUSPECT, TIME in the ARREST frame).
o These roles define the predicate-argument structure of the sentence.
4. Annotated Sentences:
Sentences, often drawn from the British National Corpus, are tagged with frame
elements to illustrate how words function in context.
5. Ontology Representation:
FrameNet provides a semantic-level ontology of language, representing not just
grammatical but also contextual and conceptual relationships.
Example:
In the sentence, “The police nabbed the suspect,” the word nab triggers the ARREST frame:
• The police → AUTHORITIES
• The suspect → SUSPECT
[Authorities The police] nabbed [Suspect the snatcher]
FrameNet thus provides a structured and nuanced way to model meaning and roles in language,
making it valuable for tasks such as semantic role labeling, information extraction, and natural
language understanding.
The COMMUNICATION frame includes roles like ADDRESSEE, COMMUNICATOR, TOPIC, and
MEDIUM. The JUDGEMENT frame includes JUDGE, EVALUEE, and REASON. Frames can
inherit roles from others; for instance, the STATEMENT frame inherits from COMMUNICATION and
includes roles such as SPEAKER, ADDRESSEE, and MESSAGE.
The following sentences show some of these roles:
[Judge She] [Evaluee blames the police] [Reason for failing to provide enough protection].
[Speaker She] told [Addressee me] [Message 'I’ll return by 7:00 pm today'].
Figure 9 shows the core and non-core frame elements of the COMMUNICATION frame, along with
other details.
4. STEMMERS:
Stemming (or conflation) is the process of reducing inflected or derived words to their base
or root form. The resulting stem doesn't need to be a valid word, as long as related terms map
to the same stem.
Purpose:
• Helps in query expansion, indexing (e.g., in search engines), and various NLP tasks.
Common Stemming Algorithms:
• Porter's Stemmer – Most widely used (Porter, 1980).
• Lovins Stemmer – An earlier approach (Lovins, 1968).
• Paice/Husk Stemmer – A more recent and flexible method (Paice, 1990).
These tools, called stemmers, differ in how aggressively they reduce words but all aim to
improve text processing by grouping word variants.
Figure 10 shows a sample text and output produced using these stemmers.
5. PART-OF-SPEECH TAGGER
Part-of-speech tagging is a crucial early-stage NLP technique used in applications like speech
synthesis, machine translation, information retrieval (IR), and information extraction. In
IR, it helps with indexing, phrase extraction, and word sense disambiguation.
• Performance:
o Outperforms unidirectional methods.
o Comparable to top algorithms like kernel SVMs.
• Reference: Tsuruoka and Tsujii (2005)
Table 12.1 shows tagged text of document #93 of the CACM collection.
6. RESEARCH CORPORA
Research corpora have been developed for a number of NLP-related tasks. In the following
section, we point out few of the available standard document collections for a variety of NLP-
related tasks, along with their Internet links.
Glasgow University, UK, maintains a list of freely available IR test collections. Table lists the
sources of those and few more IR test
collections. LETOR (learning to rank) is a
package of benchmark data sets released by
Microsoft Research Asia. It consists of two
datasets OHSUMED and TREC (TD2003 and
TD2004).
LETOR is packaged with extracted features for each query-document pair in the collection,
baseline results of several state-of-the-art learning-to-rank algorithms on the data and evaluation
tools. The data set is aimed at supporting future research in the area of learning ranking function
for information retrieval.
Evaluating a text summarizing system requires existence of 'gold summaries'. DUC provides
document collections with known extracts and abstracts, which are used for evaluating
performance of summarization systems submitted at TREC conferences. Figure 11 shows a
sample document and its extract from DUC 2002 summarization data.
Open Mind Word Expert13 attempts to create a very large sense-tagged corpus. It collects word
sense tagging from the general public over the Web.
Module – 5
Machine Translation
Machine Translation: Language Divergences and Typology, Machine Translation using
Encoder-Decoder, Details of the Encoder-Decoder Model, Translating in Low-Resource
Situations, MT Evaluation, Bias and Ethical Issues.
Textbook 2: Ch. 13. (Exclude 13.4)
Overview:
• Machine Translation: The use of computers to translate from one language to another.
• MT for information access is probably one of the most common uses of NLP
o We might want to translate some instructions on the web, perhaps the recipe for
a favorite dish, or the steps for putting together some furniture.
o We might want to read an article in a newspaper, or get information from an
online resource
like Wikipedia or
a government
webpage in some
other language.
o Google Translate alone translates hundreds of billions of words a day between
over 100 languages.
• Another common use of machine translation is to aid human translators.
o This task is often called computer-aided translation or CAT.
o CAT is commonly used as part of localization: the task of adapting content or a
product to a particular language community.
• Finally, a more recent application of MT is to in-the-moment human communication
needs. This includes incremental translation, translating speech on-the-fly before the
entire sentence is complete, as is commonly used in simultaneous interpretation.
• Image-centric translation can be used for example to use OCR of the text on a phone
camera image as input to an MT system to translate menus or street signs.
Fig 5.1: The Tower of Babel, Pieter Bruegel 1563. Wikimedia Commons, from the
Kunsthistorisches Museum, Vienna.
Story of The Tower of Babel (Bruegel, 1563):
• Bruegel’s painting depicts the biblical story from Genesis 11, where humanity, speaking
a single language, tries to build a tower to reach the heavens.
• As a divine response, God confuses their language, causing miscommunication and
halting the project. It’s a cautionary tale about human ambition and the limits of
communication.
To build better machine translation (MT) systems, we need to understand why translations can
be different (Dorr, 1994).
• Differences about words themselves. For example, each language has a different word
for "dog." These are called idiosyncratic or lexical differences, and we handle them one
by one.
• Differences about patterns. For example, some languages put the verb before the object.
Others put the verb after the object. These are systematic differences that we can model
more generally. The study of these patterns across languages is called linguistic
typology.
Two languages that share their basic word order type often have other similarities.
For example, VO languages generally have prepositions, whereas OV languages
generally have postpositions.
VO → verb wrote is followed by its object a letter and
the prepositional phrase to a friend, in which the
preposition to is followed by its argument a friend.
• Arabic, with a VSO order, also has the verb before the object and prepositions.
• Other kinds of ordering preferences vary idiosyncratically - In some SVO languages (like
English and Mandarin) adjectives tend to appear before nouns, while in others languages
like Spanish and Modern Hebrew, adjectives appear after the noun.
Fig. shows examples of other word order differences. All of these word order differences
between languages can cause problems for translation, requiring the system to do huge structural
reorderings as it generates the output.
5.2.1 Tokenization
• Machine translation systems use a fixed vocabulary decided in advance.
• The vocabulary is built by running a tokenization algorithm on both source and target
language texts together.
• This vocabulary is made using subword tokenization, not by splitting at spaces.
• An example of a subword tokenization method is BPE (Byte Pair Encoding).
• One shared vocabulary is used for both source and target languages.
• This sharing makes it easy to copy names and special words from one language to
another.
• Subword tokenization works well for languages with spaces (like English, Hindi) and no
spaces (like Chinese, Thai).
• Modern systems use better algorithms than simple BPE.
o For example, BERT uses WordPiece, a smarter version of BPE.
• WordPiece chooses – merges that improve the language model probability, not just
based on frequency.
• Wordpieces use a special symbol at the beginning of each token;
The wordpiece algorithm is given a training corpus and a desired vocabulary size V, and
proceeds as follows:
1. Initialize the wordpiece lexicon with characters (for example a subset of Unicode
characters, collapsing all the remaining characters to a special unknown character token).
2. Repeat until there are V wordpieces:
(a) Train an n-gram language model on the training corpus, using the current set of
wordpieces.
(b) Consider the set of possible new wordpieces made by concatenating two wordpieces
from the current lexicon. Choose the one new wordpiece that most increases the
language model probability of the training corpus.
Unigram Model:
• Unlike BPE, which requires specifying the number of merges, WordPiece and the
unigram algorithm let users define a target vocabulary size, typically between 8K–32K
tokens.
• The unigram algorithm, often referred to as SentencePiece (its implementation library),
starts with a large initial vocabulary of characters and frequent character sequences.
• Unigram iteratively removes low-probability tokens using statistical modeling (like the
EM algorithm) until reaching the desired size.
• It generally outperforms BPE by avoiding overly fragmented or non-meaningful tokens
and better handling common subword patterns.
• OpenSubtitles corpus & ParaCrawl corpus: Movie and TV subtitles & general web
text, 223 million sentence pairs between 23 EU languages and English extracted from the
CommonCrawl.
Sentence alignment
Standard training corpora for MT come as aligned pairs of sentences.
When creating new corpora, for example for underresourced languages or new domains, these
sentence alignments must be created.
Fig: A sample alignment between sentences in English and French, with sentences extracted from Antoine de
Saint-Exupery’s Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences e 1,..., en,
and f1,……, fm and finds minimal sets of sentences that are translations of each other, including single
sentence mappings like (e1, f1), (e4, f3), (e5, f4), (e6, f6) as well as 2-1 alignments (e2/e3,f2), (e7/e8,f7), and
null alignments (f5).
where nSents() gives the number of sentences (this biases the metric toward many alignments of
single sentences instead of aligning very large spans). The denominator helps to normalize the
similarities, and so are randomly selected sentences sampled from the respective
documents.
Where,
• K = Multiply the encoder output Henc by the cross-attention layer’s key weights WK
• V = Multiply the encoder output Henc by the cross-attention layer’s key weights Wv
• To train an encoder-decoder model, we use the same self-supervision model we used for
training encoder-decoders RNNs.
• The network is given the source text and then starting with the separator token is trained
autoregressively to predict the next token using cross-entropy loss.
• Cross-entropy is determined by the probability the model assigns to the correct next word.
• We use teacher forcing in the decoder, at each time step in decoding we force the system
to use the gold target token from training as the next input.
Example:
One advantage of a multilingual model is that they can improve the translation of lower-
resourced languages by drawing on information from a similar language in the training data that
happens to have more resources.
5.5 MT Evalaution
Translations are evaluated along two dimensions:
1. Adequacy: how well the translation captures the exact meaning of the source sentence.
Sometimes called faithfulness or fidelity.
2. Fluency: how fluent the translation is in the target language (is it grammatical, clear,
readable, natural).
5.5.1 Using Human Raters to Evaluate MT
• Human evaluation is the most accurate method for assessing machine translation (MT)
quality, focusing on two main dimensions: fluency (how natural and readable the
translation is) and adequacy (how much meaning from the source is preserved).
• Raters, often crowdworkers, assign scores on a scale (e.g., 1–5 or 1–100) for each.
• Bilingual raters compare source and translation directly for adequacy, while monolingual
raters compare MT output with a human reference. Alternatively, raters may choose the
better of two translations.
• Proper training is crucial, as raters often struggle to distinguish fluency from adequacy.
• To ensure consistency, outliers are removed and ratings are normalized.
Example:
• These metrics are not good for comparing very different systems (e.g., human-aided
vs. machine translation).
• chrF works best for comparing small changes within the same system.