0% found this document useful (0 votes)
85 views158 pages

NLP BAD613B FullNotes

The document outlines a Natural Language Processing (NLP) course (BAD613B) offered in the VI semester by Dr. Mahantesh K at RNS Institute of Technology, covering key topics such as language modeling, syntactic and semantic analysis, and machine translation. It details course objectives, modules, and outcomes, emphasizing the importance of understanding human language processing and its computational models. Suggested learning resources include textbooks and online materials to support students in mastering NLP concepts and applications.

Uploaded by

deekshagowdam22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views158 pages

NLP BAD613B FullNotes

The document outlines a Natural Language Processing (NLP) course (BAD613B) offered in the VI semester by Dr. Mahantesh K at RNS Institute of Technology, covering key topics such as language modeling, syntactic and semantic analysis, and machine translation. It details course objectives, modules, and outcomes, emphasizing the importance of understanding human language processing and its computational models. Suggested learning resources include textbooks and online materials to support students in mastering NLP concepts and applications.

Uploaded by

deekshagowdam22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

Department of CSE (Data Science)

VI – Semester

Natural Language Processing

BAD613B

Dr. Mahantesh K
Associate Professor
Dept. of CSE (Data Science)
RNS Institute of Technology
Natural Language Processing [BAD613B]

NATURAL LANGUAGE PROCESSING


Course Code: BAD613B Semester: VI
Teaching Hours/Week (L: T:P: S): 3:0:0:0 CIE Marks: 50
Total Hours of Pedagogy: 40 SEE Marks: 50
Credits: 03 Total Marks: 100
Examination type (SEE): Theory Exam Hours: 03

Course objectives:
• Learn the importance of natural language modelling.
• Understand the applications of natural language processing.
• Study spelling, error detection and correction methods and parsing techniques in NLP.
• Illustrate the information retrieval models in natural language processing.

Module-1
Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian Languages,
NLP Applications.
Language Modeling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.
Textbook 1: Ch. 1, Ch. 2.
Module-2

Word Level Analysis: Regular Expressions, Finite-State Automata, Morphological Parsing,


Spelling Error Detection and Correction, Words and Word Classes, Part-of Speech Tagging.

Syntactic Analysis: Context-Free Grammar, Constituency, Top-down and Bottom-up Parsing,


CYK Parsing.
Textbook 1: Ch. 3, Ch. 4.
Module-3
Naive Bayes, Text Classification and Sentiment: Naive Bayes Classifiers, Training the
Naive Bayes Classifier, Worked Example, Optimizing for Sentiment Analysis, Naive Bayes
for Other Text Classification Tasks, Naive Bayes as a Language Model.
Textbook 2: Ch. 4.
Module-4
Information Retrieval: Design Features of Information Retrieval Systems, Information
Retrieval Models - Classical, Non-classical, Alternative Models of Information Retrieval -
Custer model, Fuzzy model, LSTM model, Major Issues in Information Retrieval.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1


Natural Language Processing [BAD613B]

Lexical Resources: WordNet, FrameNet, Stemmers, Parts-of-Speech Tagger, Research


Corpora.
Textbook 1: Ch. 9, Ch. 12.
Module-5
Machine Translation: Language Divergences and Typology, Machine Translation using
Encoder-Decoder, Details of the Encoder-Decoder Model, Translating in Low-Resource
Situations, MT Evaluation, Bias and Ethical Issues.
Textbook 2: Ch. 13.

Course outcome (Course Skill Set)


At the end of the course, the student will be able to:
1. Apply the fundamental concept of NLP, grammar-based language model and statistical-
based language model.
2. Explain morphological analysis and different parsing approaches.
3. Develop the Naïve Bayes classifier and sentiment analysis for Natural language problems
and text classifications.
4. Apply the concepts of information retrieval, lexical semantics, lexical dictionaries.
5. Identify the Machine Translation applications of NLP using Encode and Decoder.
Suggested Learning Resources:
Text Books:
1. Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information
Retrieval”, Oxford University Press.
2. Daniel Jurafsky, James H. Martin, “Speech and Language Processing, An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition”,
Pearson Education, 2023.
Reference Books:
1. Akshay Kulkarni, Adarsha Shivananda, “Natural Language Processing Recipes -
Unlocking Text Data with Machine Learning and Deep Learning using Python”, Apress,
2019.
2. T V Geetha, “Understanding Natural Language Processing – Machine Learning and Deep
Learning Perspectives”, Pearson, 2024.
3. Gerald J. Kowalski and Mark. T. Maybury, “Information Storage and Retrieval systems”,
Kluwer Academic Publishers.
Web links and Video Lectures (e-Resources):
• https://www.youtube.com/watch?v=M7SWr5xObkA
• https://youtu.be/02QWRAhGc7g
• https://www.youtube.com/watch?v=CMrHM8a3hqw
• https://onlinecourses.nptel.ac.in/noc23_cs45/preview
• https://archive.nptel.ac.in/courses/106/106/106106211/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2


Natural Language Processing [BAD613B]

Module-1

Introduction & Language Modelling


• Introduction: What is Natural Language Processing? Origins of NLP, Language and
Knowledge, The Challenges of NLP, Language and Grammar, Processing Indian
Languages, NLP Applications.
• Language Modelling: Statistical Language Model - N-gram model (unigram, bigram),
Paninion Framework, Karaka theory.

Textbook 1: Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information
Retrieval”, Oxford University Press. Ch. 1, Ch. 2.

1. INTRODUCTION

1.1 What is Natural Language Processing (NLP)


Language is the primary means of communication used by humans and tool to express
the greater part of our ideas and emotions. It shapes thought and has a structure, and carries
meaning. To express a thought, content helps represent the language in real-time.

NLP is concerned with development of computational models of aspects of human


language processing, there are two main reasons:

1. To develop automated tools for language processing.


2. To gain a better understanding of human communication.

Building computational models with human language-processing abilities requires a


knowledge of how humans acquire, store, and process language.

Historically, there have been two major approaches to NLP:

1. Rationalist approach
2. Empiricist approach

Rationalist Approach: Early approach, assumes the existence of some language faculty in
the human brain. Supporters of this approach argue that it is not possible to learn a complex thing
like natural language from limited sensory inputs.

Empiricist approach: Do not believe in existence of a language faculty. Believe in the existence
of some general organization principles such as pattern recognition, generalization, and association.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3


Natural Language Processing [BAD613B]

Learning of detailed structures takes place through the application of these principles on sensory
inputs available to the child.

1.2 Origins of NLP


The NLP includes speech processing and sometimes mistakenly termed natural language
understanding-originated from machine translation research. Natural language processing includes both
understanding (interpretation) and generation (production). We are concerned with text processing only
- The area of computational linguistics and its application.

Computational linguistics: is similar to theoretical- and psycho-linguistics, but uses different tools.
While theoretical linguistics is more about the structural rules of language, psycho-linguistics focuses on
how language is used and processed in the mind.
Theoretical linguistics explores the abstract rules and structures that govern language. It investigates
universal grammar, syntax, semantics, phonology, and morphology. Linguists create models to explain
how languages are structured and how meaning is encoded. Eg. Most languages have constructs like noun
and verb phrases. Theoretical linguists identify rules that describe and restrict the structure of languages
(grammar).
Psycho-linguistics focuses on the psychological and cognitive processes involved in language use. It
examines how individuals acquire, process, and produce language. Researchers study language
development in children and how the brain processes language in real-time. Eg. Studying how children
acquire language, such as learning to form questions ("What’s that?").

Computational Linguistics Models:


Computational linguistics is concerned with the study of language using computational models of
linguistic phenomena. It deals with the application of linguistic theories and computational techniques
for NLP. In computational linguistics, representing a language is a major problem; Most knowledge
representations tackle only a small part of knowledge. Representing the whole body of knowledge is
almost impossible.
Computational models may be broadly classified under knowledge-driven and data-driven categories.
Knowledge-driven systems rely on explicitly coded linguistic knowledge, often expressed as a set of
handcrafted grammar rules. Acquiring and encoding such knowledge is difficult and is the main
bottleneck in the development of such systems.
Data-driven approaches presume the existence of a large amount of data and usually employ some
machine learning technique to learn syntactic patterns. Performance of these systems is dependent on the
quantity of the data and usually adaptive to noisy data.
Main objective of the models is to achieve a balance between semantic (knowledge-driven) and
data-driven approaches on one hand, and between theory and practice on the other.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4


Natural Language Processing [BAD613B]

With the unprecedented amount of information now available on the web, NLP has become one
of the leading techniques for processing and retrieving information. NLP has become one of the leading
techniques for processing and retrieving information.
Information retrieval includes a number of information processing applications such as information
extraction, text summarization, question answering, and so forth. It includes multiple modes of
information, including speech, images, and text.

1.3 Language & Knowledge


Language is the medium of expression in which knowledge is deciphered. We are here considering
the text form of the language and the content of it as knowledge.
Language, being a medium of expression, is the outer form of the content it expresses. The same
content can be expressed in different languages.
Hence, to process a language means to process the content of it. As computers are not able to
understand natural language, methods are developed to map its content in a formal language.
The language and speech community considers a language as a set of sounds that, through
combinations, conveys meaning to a listener. However, we are concerned with representing and
processing text only. Language (text) processing has different levels, each involving different types of
knowledge.
1.3.1 lexical analysis
• Analysis of words.
• Word-level processing requires morphological knowledge, i.e., knowledge about the
structure and formation of words from basic units (morphemes).
• The rules for forming words from morphemes are language specific.

1.3.2 Syntactic analysis


• Considers a sequence of words as a unit, usually a sentence, and finds its structure.
• Decomposes a sentence into its constituents (or words) and identifies how they relate to each
other.
• It captures grammaticality or non-grammaticality of sentences by looking at constraints like
word order, number, and case agreement.
• This level of processing requires syntactic knowledge (How words are combined to form
larger units such as phrases and sentences)
• For example:
o 'I went to the market' is a valid sentence whereas 'went the I market to' is not.
o 'She is going to the market' is valid, but 'She are going to the market' is not.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5


Natural Language Processing [BAD613B]

1.3.3 Semantic analysis


• It is associated with the meaning of the language.
• Semantic analysis is concerned with creating meaningful representation of linguistic inputs.
• Eg. 'Colorless green ideas sleep furiously' - syntactically correct, but semantically anomalous.
• A word can have a number of possible meanings associated with it. But in a given context,
only one of these meanings participates.

Syntactic Semantic

• Finding out the correct meaning of a particular use of word is necessary to find meaning of larger
units.
• Eg. Kabir and Ayan are married.
Kabir and Suha are married.
• Syntactic structure and compositional semantics fail to explain these interpretations.
• This means that semantic analysis requires pragmatic knowledge besides semantic and syntactic
knowledge.
• Pragmatics helps us understand how meaning is influenced by context, social factors, and
speaker intentions.

1.3.4 Discourse Analysis


• Attempts to interpret the structure and meaning of even larger units, e.g., at the paragraph and
document level, in terms of words, phrases, clusters, and sentences.
• It requires the resolution of anaphoric references and identification of discourse structure.

Anamorphic Reference
• Pragmatic knowledge may be needed for resolving anaphoric references.
Example: The district administration refused to give the trade union
permission for the meeting because they feared violence. (a)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6


Natural Language Processing [BAD613B]

The district administration refused to give the trade union


permission for the meeting because they oppose government. (b)
• For example, in the above sentences, resolving the anaphoric reference 'they' requires pragmatic
knowledge.

1.3.5 Pragmatic analysis


• The highest level of processing, deals with the purposeful use of sentences in situations.
• It requires knowledge of the world, i.e., knowledge that extends beyond the contents of the text.

1.4 The Challenges of NLP


• Natural languages are highly ambiguous and vague, achieving precise representation of content
can be difficult.
• The inability to capture all the required knowledge.
• Identifying its semantics.
• A language keeps on evolving. New words are added continually and existing words are
introduced in new context. (eg. 9/11 - terrorist act on WTC)
Solution: The only way machines can learn is by considering its context, context of a
word is defined by co-occurring words.
• The frequency of a word being used in a particular sense also affects its meaning.
• Idioms, metaphor, and ellipses add more complexity to identify the meaning of the written text.
o Example: “The old man finally kicked the bucket” → "kicked the bucket" is a well-known
Idiom, meaning is to "to die."
o "Time is a thief." → Metaphor suggests “time robs you of valuable moments or
experiences in life”.
o "I’m going to the store, and you’re going to the party, right?"
"Yes, I am…"
Ellipses refer to the omission of words or phrases in a sentence. (represented by "…")
• The ambiguity of natural languages is another difficulty (explicit as well as implicit sources of
knowledge).
o Word Ambiguity: Example: 'Taj' - a monument, a brand of tea, or a hotel.
▪ “Can” – ambiguous in its part-of-speech. ('Part-of-speech tagging' algorithm)
▪ “Bank” is ambiguous in its meaning. ('word sense disambiguation' algorithm)
o Structural ambiguity - A sentence may be ambiguous
▪ 'Stolen rifle found by tree.'
▪ Verb sub-categorization may help to resolve
▪ Probabilistic parsing - statistical models to predict the most likely syntactic
structure.
• A number of grammars have been proposed to describe the structure of sentences.
o It is almost impossible for grammar to capture the structure of all and only meaningful
text.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7


Natural Language Processing [BAD613B]

1.5 Language and Grammar


• Language Grammar: Grammar defines language and consists of rules that allow parsing and
generation of sentences, serving as a foundation for natural language processing.
• Syntax vs. Semantics: Although syntax and semantics are closely related, a separation is made
in processing due to the complexity of world knowledge influencing both language structure and
meaning.
• Challenges in Language Specification: Natural languages constantly evolve, and the numerous
exceptions make language specification challenging for computers.
• Different Grammar Frameworks: Various grammar frameworks have been developed,
including transformational grammar, lexical functional grammar, and dependency grammar, each
focusing on different aspects of language such as derivation or relationships.
• Chomsky’s Contribution: Noam Chomsky’s generative grammar framework, which uses rules
to specify grammatically correct sentences, has been fundamental in the development of formal
grammar hierarchies.
Chomsky argued that phrase structure grammars are insufficient for natural language and proposed
transformational grammar in Syntactic Structures (1957). He suggested that each sentence has two levels:
a deep structure and a surface structure (as shown in Fig 1), with transformations mapping one to the
other.

Fig 1. Surface and Deep Structures of sentence


• Chomsky argued that an utterance is the surface representation of a 'deeper structure' representing
its meaning.
• The deep structure can be transformed in a number of ways to yield many different surface-level
representations.
• Sentences with different surface-level representations having the same meaning, share a common
deep-level representation.
Pooja plays veena.
Veena is played by Pooja.
Both sentences have the same meaning, despite having different surface structures (roles of subject and
object are inverted).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8


Natural Language Processing [BAD613B]

Transformational grammar has three components:


1. Phrase structure grammar: Defines the basic syntactic structure of sentences.
2. Transformational rules: Describe how deep structures can be transformed into different surface
structures.
3. Morphophonemic rules: Govern the relationship structure of a sentence (its syntax) influences
the form of the words in terms of sound and pronunciation (phonology).

Phrase structure grammar consists of rules that generate natural language sentences and assign a
structural description to them. As an example, consider the following set of rules:

Eg: “The police will catch the snatcher.”

S → NP + VP Det → the, a, an, ...


VP → V + NP Verb → catch, write, eat, ...
NP → Det + Noun Noun → police, snatcher, ...
V → Aux + Verb Aux → will, is, can, ...

Transformation rules, transform one phrase-maker (underlying) into another phrase-marker (derived).
These rules are applied on the terminal string generated by phrase structure rules. It transforms one surface
representation into another, e.g., an active sentence into passive one.
Consider the active sentence: “The police will catch the snatcher.”

Eg. [NP1 - Aux - V - NP2] → [NP2 - Aux + be + en - V - by + NP1]

The application of phrase structure rules will assign the structure shown in Fig 2 (a)

(a) Phrase structure (b) Passive Transformation

The passive transformation rules will convert the sentence into


The + culprit + will + be + en + catch + by + police

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9


Natural Language Processing [BAD613B]

Morphophonemic Rule: Another transformational rule will then reorder 'en + catch' to 'catch + en' and
subsequently one of the morphophonemic rules will convert 'catch + en' to 'caught'.

Note: Long distance dependency refers to syntactic phenomena where a verb and its subject or object can
be arbitrarily apart. Wh-movement are a specific case of these types of dependencies.

E.g.

"I wonder who John gave the book to" involves a long-distance dependency between the verb "wonder"
and the object "who". Even though "who" is not directly adjacent to the verb, the syntactic relationship
between them is still clear.
The problem in the specification of appropriate phrase structure rules occurs because these phenomena
cannot be localized at the surface structure level.

1.6 Processing Indian Languages


There are a number of differences between Indian languages and English:
• Unlike English, Indic scripts have a non-linear structure.
• Unlike English, Indian languages have SOV (Subject-Object-Verb) as the default sentence
structure.
• Indian languages have a free word order, i.e., words can be moved freely within a sentence
without changing the meaning of the sentence.
• Spelling standardization is more subtle in Hindi than in English.
• Indian languages have a relatively rich set of morphological variants.
• Indian languages make extensive and productive use of complex predicates (CPs).
• Indian languages use post-position (Karakas) case markers instead of prepositions.
• Indian languages use verb complexes consisting of sequences of verbs,
o e.g., गा रहा है (ga raha hai-singing) and खेल रही है (khel rahi hai-playing).
o The auxiliary verbs in this sequence provide information about tense, aspect, modality,
etc

Paninian grammar provides a framework for Indian language models. These can be used for
computation of Indian languages. The grammar focuses on extraction of relations from a
sentence.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10


Natural Language Processing [BAD613B]

1.7 NLP Applications


1.7.1 Machine Translation
This refers to automatic translation of text from one human language to another. In order to carry out
this translation, it is necessary to have an understanding of words and phrases, grammars of the two
languages involved, semantics of the languages, and word knowledge.

1.7.2 Speech Recognition


This is the process of mapping acoustic speech signals to a set of words. The difficulties arise due to
wide variations in the pronunciation of words, homonym (e.g. dear and deer) and acoustic ambiguities
(e.g., in the rest and interest).

1.7.3 Speech Synthesis


Speech synthesis refers to automatic production of speech (utterance of natural language sentences). Such
systems can read out your mails on telephone, or even read out a storybook for you.

1.7.4 Information Retrieval


This focuses on identifying relevant documents for a user's query using NLP techniques in
information retrieval. Methods like indexing, word sense disambiguation, query modification, and
knowledge bases improve IR performance, including query expansion. Lexical resources like WordNet,
LDOCE, and Roget's Thesaurus enhance these systems. These tools help refine search results and
improve accuracy.

1.7.5 Information Extraction


An information extraction system captures and outputs factual information contained within a
document. Query is specified as pre-defined templates. System identifies a subset of information within
a document that fits the pre-defined template.

1.7.6 Question Answering


Given a question and a set of documents, a question answering system attempts to find the precise
answer, or at least the precise portion of text in which the answer appears. A question answering system
requires more NLP than an information retrieval system or an information extraction system. It requires
not only precise analysis of questions and portions of texts but also semantic as well as background
knowledge to answer certain type of questions.

1.7.7 Text Summarization


This deals with the creation of summaries of documents and involves syntactic, semantic, and
discourse level processing of text.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11


Natural Language Processing [BAD613B]

1.8 Some Successful Early NLP Systems

1.8.1 ELIZA (Weizenbaum 1966)


ELIZA is one of the earliest natural language understanding programs. It uses syntactic patterns to
mimic human conversation with the user. Here is a sample conversation.

Eliza: Hello. I am ELIZA. How may I help you?


User: I am feeling a little bit sleepy.
Eliza: How long have you been feeling a little bit sleepy?
User: For almost half an hour.
Eliza: Please go on.
1.8.2 SysTran (System Translation)

The first SysTran machine translation system was developed in 1969 for Russian-English translation.
SysTran also provided the first on-line machine translation service called Babel Fish, which is used by
AltaVista search engines for handling translation requests from users.

1.8.3 TAUM METEO

This is a natural language generation system used in Canada to generate weather reports. It accepts
daily weather data and generates weather reports in English and French.

1.8.4 SHRDLU (Winogard 1972)

This is a natural language understanding system that simulates actions of a robot in a block world
domain. It uses syntactic parsing and semantic reasoning to understand instructions. The user can ask the
robot to manipulate the blocks, to tell the blocks configurations, and to explain its reasoning.

1.8.5 LUNAR (Woods 1977)

This was an early question answering system that answered questions about moon rock.

1.9 Information Retrieval

The availability of vast amounts of electronic text has made it challenging to find relevant
information. Information retrieval (IR) systems aim to address this issue by providing efficient access to
relevant content. Unlike 'entropy' in communication theory, which measures uncertainty, information
here refers to the content or subject matter of text, not digital communication or data transmission. Words
serve as carriers of information, and text is seen as the message encoded in natural language.

In IR, "retrieval" refers to accessing information from computer-based representations, requiring


processing and storage. Only relevant information, based on a user's query, is retrieved. IR involves

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12


Natural Language Processing [BAD613B]

organizing, storing, retrieving, and evaluating information that matches a query, working with
unstructured data. Retrieval is based on content, not structure, and systems typically return a ranked list
of relevant documents.

IR has been integrated into various systems, including database management systems, bibliographic
retrieval systems, question answering systems, and search engines. Approaches for accessing large text
collections fall into two categories: one builds topic hierarchies (e.g., Yahoo), requiring manual
classification of new documents, which can be cost-ineffective; the other ranks documents by relevance,
offering more scalability and efficiency for large collections

Major issues in designing and evaluating Information Retrieval (IR) systems include selecting
appropriate document representations. Current models often use keyword-based representation, which
suffers from problems like polysemy, homonymy, and synonymy, as well as ignoring semantic and
contextual information. Additionally, vague or inaccurate user queries lead to poor retrieval performance,
which can be addressed through query modification or relevance feedback.

Matching query representation to document representation is another challenge, requiring effective


similarity measures to rank results. Evaluating IR system performance typically relies on recall and
precision, though relevance itself is subjective and difficult to measure accurately. Relevance
frameworks, such as the situational framework, attempt to address this by considering context and time.
Moreover, varying user needs and document collection sizes further complicate retrieval, requiring
specialized methods for different scopes.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13


Natural Language Processing [BAD613B]

2. LANGUAGE MODELLING
To create a general model of any language is a difficult task. There are two approaches for language
modelling.

1. To define a grammar that can handle the language.


2. To capture the patterns in a grammar language statistically.

2.1 Introduction
Our purpose is to understand and generate natural languages from a computational viewpoint.

1st approach: Try to understand every word and sentence of it, and then come to a conclusion (has not
succeeded).
2nd approach: To study the grammar of various languages, compare them, and if possible, arrive at
reasonable models that facilitate our understanding of the problem and designing of natural-language
tools.
Language Model: A model is a description of some complex entity or process. Natural language is a
complex entity and in order to process it through a computer-based program, we need to build a
representation (model) of it.
Two categories of language modelling approaches:
Grammar-based language model:

• Uses the grammar of a language to create its model.


• It attempts to represent the syntactic structure of language.
• Hand-coded rules defining the structure and ordering of various constituents appearing in a
linguistic unit.

Eg. A sentence usually consists of noun phrase and a verb phrase. The grammar-based approach attempts
to utilize this structure and also the relationships between these structures.

Statistical language modelling:

• Creates a language model by training it from a corpus.


• To capture regularities of a language, the training corpus needs to be sufficiently large.
• Fundamental tasks in many NLP applications, including speech recognition, spelling correction,
handwriting recognition, and machine translation.
• Information retrieval, text summarization, and question answering.
• Most popular - n-gram models.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14


Natural Language Processing [BAD613B]

2.2 Various Grammar-Based Language Models


• Generative Grammars
• Hierarchical Grammar
• Government and Binding (GB)
• Lexical Functional Grammar (LFG) Model
• Paninian Framework

2.2.1 Generative Grammars


• We can generate sentences in a language if we know a collection of words and rules in that
language (Noam Chomsky).
• Sentences that can be generated as per the rules are grammatical and had dominated
computational linguistics.
• Addressed syntactical structure of language.
• But Language is a relation between the sound (or the written text) and its meaning.

2.2.2 Hierarchical Grammar


• Chomsky (1956) described classes of grammars in a hierarchical manner, where the top layer
contained the grammars represented by its sub classes.
• Hence, Type 0 (or unrestricted) grammar contains Type 1 (or context-sensitive grammar), which
in turn contains Type 2 (context-free grammar) and that again contains Type 3 grammar (regular
grammar).

2.2.3 Government and Binding (GB)


(Eliminated rules of Grammar – since rules were language particular)
Linguists often argue that language structure, especially in resolving structural ambiguity, can be
understood through meaning. However, the transformation between meaning and syntax is not well
understood. Transformational grammars distinguish between surface-level and deep-root-level sentence
structures.

Government and Binding (GB) theories rename these as s-level and d-level, adding phonetic and
logical forms as parallel levels of representation for analysis, as shown in Figure.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15


Natural Language Processing [BAD613B]

• 'meaning' in a 'sound' form is represented as logical form (LF) and phonetic form (PF) in above
figure.
• The GB is concerned with LF, rather than PF.
• The GB imagines that if we define rules for structural units at the deep level, it will be possible
to generate any language with fewer rules.

Let us take an example to explain d- and s- Structures in GB:


Mukesh was killed
i) In Transformational grammar, this can be expressed as:
S – NP AUX VP → as given below

ii) In GB, s-structure & d-structure are as follows:

Surface structure Deep structure


Note:
• The surface structure is the actual form of the sentence as it appears in speech or writing.
• The deep structure represents the underlying syntactic and semantic structure that is abstract and not
directly visible (Represents the core meaning of the sentence). "Someone killed Mukesh" or "A person
killed Mukesh."

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16


Natural Language Processing [BAD613B]

Components of GB

• Government and binding (GB) comprise a set of theories that map the structures from d-structure
to s-structure and to logical form (LF).
• A general transformational rule called 'Move 𝛼' is applied at d-structure level as well as at s-
structure level.
• Simplest form GB can be represented as below.

GB consists of 'a series of modules that contain constraints and principles' applied at various
levels of its representations and the transformation rule, Move α.
The GB considers all three levels of representations (d-, s-, and LF) as syntactic, and LF is also
related to meaning or semantic-interpretive mechanisms.
GB applies the same Move 𝛼 transformation to map d-levels to s-levels or s-levels to LF level.
LF level helps in quantifier scoping and also in handling various sentence constructions such as passive
or interrogative constructions.
Example:
Consider the sentence: “Two countries are visited by most travellers.”
Its two possible logical forms are:
LF1: [s Two countries are visited by [NP most travellers]]
LF2: Applying Move 𝛼
[NP Most travellersi ] [s two countries are visited by ei]

• In LF1, the interpretation is that most travellers visit the same two countries (say, India and
China).
• In LF2, when we move [most travellers] outside the scope of the sentence, the interpretation can
be that most travellers visit two countries, which may be different for different travellers.
• One of the important concepts in GB is that of constraints. It is the part of the grammar which
prohibits certain combinations and movements; otherwise Move α can move anything to any
possible position.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 17


Natural Language Processing [BAD613B]

• Thus, GB, is basically the formulation of theories or principles which create constraints to
disallow the construction of ill-formed sentences.
The organization of GB is as given below:

̅ Theory:
𝑿

• ̅ Theory (pronounced X-bar theory) is one of the central concepts in GB. Instead of defining
The 𝑿
̅ Theory defines
several phrase structures and the sentence structure with separate sets of rules, 𝑿
them both as maximal projections of some head.
• Noun phrase (NP), verb phrase (VP), adjective phrase (AP), and prepositional phrase (PP) are
maximal projections of noun (N), verb (V), adjective (A), and preposition (P) respectively, and
can be represented as head X of their corresponding phrases (where X = {N, V, A, P})
• Even the sentence structure can be regarded as the maximal projection of inflection (INFL).
• The GB envisages projections at two levels:
• The projection of head at semi-phrasal level, denoted by 𝑿 ̅,
• ̿.
The Maximal projection at the phrasal level, denoted by 𝑿

Figure depicts the general and particular structures with examples

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 18


Natural Language Processing [BAD613B]

Maximal projection of sentence structure

Sub-categorization: It refers to the process of classifying words or phrases (typically verbs) according
to the types of arguments or complements they can take. It's a form of syntactic categorization that is
important for understanding the structure and meaning of sentences.

For example, different verbs in English can have different sub-categorization frames (also called
argument structures). A verb like "give" might take three arguments (subject, object, and indirect object),
while a verb like "arrive" might only take a subject and no objects.

"He gave her a book." ("gave" requires a subject, an indirect object, and a direct object)

"He arrived." ("arrived" only requires a subject)

In principle, any maximal projection can be the argument of a head, but sub-categorization is used as a
filter to permit various heads to select a certain subset of the range of maximal projections.

Projection Principle:
Three syntactic representations:
1. Constituency Parsing (Tree Structure):
• Sentences are broken into hierarchical phrases or constituents (e.g., noun phrases, verb
phrases), represented as a tree structure.
2. Dependency Parsing (Directed Graph):
• Focuses on the direct relationships between words, where words are connected by directed
edges indicating syntactic dependencies.
3. Semantic Role Labelling (SRL):
• Identifies the semantic roles (e.g., agent, patient) of words in a sentence, focusing on the
meaning behind the syntactic structure.
The projection principle, a basic notion in GB, places a constraint on the three syntactic representations
and their mapping from one to the other.

The principle states that representations at all syntactic levels (i.e., d-level, s-level, and LF level) are
projections from the lexicon (collection or database of words and their associated linguistic information).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 19


Natural Language Processing [BAD613B]

Thus, lexical properties of categorical structure (sub-categorization) must be observed at each level.
Suppose 'the object' is not present at d-level, then another NP cannot take this position at s-level.

Example:

• At D-structure, each argument of a verb is assigned a thematic role (e.g., Agent, Theme, Goal,
etc.).
• In a sentence like "John gave Mary the book", the verb "gave" requires three arguments: Agent
(John), Recipient (Mary), and Theme (the book).
• If the object (Theme) is not present at the deep structure, it cannot be filled at the surface structure
(S-structure) by another NP (e.g., a different noun phrase).

Theta Theory (Ɵ-Theory) or The Theory of Thematic Relations

• 'Sub-categorization' only places a restriction on syntactic categories which a head can accept.
• GB puts another restriction on the lexical heads through which it assigns certain roles to its
arguments.
• These roles are pre-assigned and cannot be violated at any syntactical level as per the projection
principle.
• These role assignments are called theta-roles and are related to 'semantic-selection'.

Theta Role and Theta Criterion


There are certain thematic roles from which a head can select. These are called Ɵ-roles and they are
mentioned in the lexicon, say for example the verb 'eat' can take arguments with Ɵ-roles '(Agent, Theme)'.

Agent is a special type of role which can be assigned by a head to outside arguments (external
arguments) whereas other roles are assigned within its domain (internal arguments).

Hence in 'Mukesh ate food',

the verb 'eat' assigns the 'Agent' role to 'Mukesh' (outside VP)

and 'Theme' (or 'patient') role to 'food'.

Theta-Criterion states that 'each argument bears one and only one Ɵ-role, and each Ɵ-role is
assigned to one and only one argument'.

C-command and Government


C-Command: It is a syntactic relation that defines a type of hierarchical relationship between two
constituents (words or phrases) in a sentence. It plays a critical role in the distribution of certain syntactic
phenomena, such as binding, agreement, and pronoun reference.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 20


Natural Language Processing [BAD613B]

If any word or phrase (say α or ß) falls within the scope of and is determined by a maximal projection,
we say that it is dominated by the maximal projection.

If there are two structures α and ß related in such a way that 'every maximal projection dominating a
dominates ß', we say that a C-commands ß, and this is the necessary and sufficient condition (iff) for C-
command.

Government
α governs ß iff: α C-commands ß
α is an X (head, e.g., noun, verb, preposition, adjective, and inflection), and every maximal projection
dominating ß dominates α.
Additional information
C-COMMAND
A c-command is a syntactic relationship in linguistics, particularly in the theory of syntax, where one node (word
or phrase) in a tree structure can "command" or "govern" another node in certain ways. In simpler terms, it's a rule
that helps determine which parts of a sentence can or cannot affect each other syntactically.
Simple Definition:
C-command occurs when one word or phrase in a sentence has a syntactic connection to another word or phrase,
typically by being higher in the syntactic tree (closer to the top).
Example 1:
In the sentence "John saw Mary,"
"John" c-commands "Mary" because "John" is higher up in the tree structure and can potentially affect "Mary"
syntactically.
Example 2:
In the sentence "She thinks that I am smart,"
The pronoun "She" c-commands "I" because "She" is higher in the syntactic tree, governing the phrase where "I"
occurs.
In essence, c-command helps explain which words in a sentence are connected in ways that allow for things like
pronoun interpretation or binding relations (e.g., which noun a pronoun refers to).
GOVERNMENT
-is a special case of C-COMMAND
government refers to the syntactic relationship between a head (typically a verb, noun, or adjective) and its
dependent elements (such as objects or complements) within a sentence. It determines how certain words control
the form or case of other words in a sentence.
On the other hand, c-command is a syntactic relationship between two constituents in a sentence. A constituent A
c-commands another constituent B if the first constituent (A) is higher in the syntactic structure (usually in the tree)
and can potentially govern or affect the second constituent (B), provided no intervening nodes.
To put it together in context:
Government: This is a formal rule determining how certain words govern the case or form of other words in a
sentence (e.g., verbs can govern the object noun in accusative case in languages like Latin or German).
C-command: This is a structural relationship in which one constituent can influence another, typically affecting
operations like binding, scope, and sometimes government.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 21


Natural Language Processing [BAD613B]

In short, government often operates within the structures of c-command, but c-command itself is a broader syntactic
relationship that is also relevant for other linguistic phenomena, such as binding theory, where one element can bind
another if it c-commands it.
Sure! Here are a few examples of government in syntax, showing how one word governs the form or case of another
word in a sentence:
1. Verb Government
In many languages, verbs can govern the case of their objects. Here’s an example in Latin:
Latin: "Vidēre puellam" (to see the girl)
The verb "vidēre" (to see) governs the accusative case of "puellam" (the girl).
In this case, the verb "vidēre" governs the object "puellam" by requiring it to be in the accusative case.
2. Preposition Government
Prepositions can also govern the case of their objects. Here’s an example from German:
German: "Ich gehe in den Park" (I am going to the park)
The preposition "in" governs the accusative case of "den Park" (the park).
The preposition "in" governs the accusative case for the noun "Park" in this sentence.
3. Adjective Government
Adjectives can govern the case, gender, or number of the noun they modify. Here's an example from Russian:
Russian: "Я вижу красивую девочку" (I see a beautiful girl)
The adjective "красивую" (beautiful) governs the accusative case of "девочку" (girl).
In this case, the adjective "красивую" (beautiful) governs the accusative case of "девочку".
4. Noun Government
In some languages, nouns can govern the case of their arguments. In Russian, for example, some nouns govern a
particular case:
Russian: "Я горжусь успехом" (I am proud of the success)
The noun "успехом" (success) governs the instrumental case in this sentence.
Here, the noun "успехом" governs the instrumental case of its argument "успех".
Summary:
Government involves syntactic relationships where a head (verb, preposition, adjective, etc.) dictates or determines
the form (such as case) of its dependent elements.
In these examples, verbs, prepositions, and adjectives have a "governing" influence on the cases of nouns or objects
in the sentence, which is a core part of the syntax in many languages.

Movement, Empty Category, and Co-indexing


Movement & Empty Category:
In GB, Move α is described as 'move anything anywhere', though it provides restrictions for valid
movements.
In GB, the active to passive transformation is the result of NP movement as shown in sentence. Another
well-known movement is the wh-movement, where wh-phrase is moved as follows.
What did Mukesh eat?

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 22


Natural Language Processing [BAD613B]

[Mukesh INFL eat what]


As discussed in the projection principle, lexical categories must exist at all the three levels. This principle,
when applied to some cases of movement leads to the existence of an abstract entity called empty category.

In GB, there are four types of empty categories:

Two being empty NP positions called wh-trace and NP trace, and the remaining two being pronouns
called small 'pro' and big 'PRO'.

This division is based on two properties-anaphoric (+a or -a ) and pronominal (+p or -p).
Wh-trace -a, -p
NP-trace +a, -p
small 'pro' -a, +p
big 'PRO' . +a, +p

The traces help ensure that the proper binding relationships are maintained between moved elements
(such as how pronouns or reflexives bind to their antecedents, even after movement).
Additional Information:
• +a (Anaphor): A form that must refer back to something mentioned earlier (i.e., it has an
antecedent). For example, "himself" in "John washed himself." The form "himself" is an anaphor
because it refers back to "John."
• -a (Non-Anaphor): A form that does not require an antecedent to complete its meaning. A regular
pronoun like "he" in "He went to the store" is not an anaphor because it doesn't explicitly need to
refer back to something within the same sentence or clause.
• +p (Pronominal): A form that can function as a pronoun, standing in for a noun or noun phrase.
For example, "she" in "She is my friend" is a pronominal because it refers to a specific person
(though not necessarily previously mentioned).
• -p (Non-Pronominal): A word or form that isn't used as a pronoun. It could be a noun or other
word that doesn't serve as a replacement for a noun phrase in a given context. For example, in
"John went to the store," "John" is not pronominal—it is a noun phrase.

Co-indexing
It is the indexing of the subject NP and AGR (agreement) at d-structure which are preserved by Move α
operations at s-structure.

When an NP-movement takes place, a trace of the movement is created by having an indexed empty
category (e) from the position at which the movement began to the corresponding indexed NP.

For defining constraints to movement, the theory identifies two positions in a sentence. Positions assigned
̅ positions.
θ -roles are called θ-positions, while others are called 𝜃
In a similar way, core grammatical positions (where subject, object, indirect object, etc., are positioned)
̅-positions.
are called A-positions (arguments positions), and the rest are called 𝐴

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 23


Natural Language Processing [BAD613B]

Binding theory:

Binding Theory is a syntactic theory that explains how pronouns and noun phrases are interpreted and
distributed in a sentence. It's concerned with the relationships between pronouns and their antecedents
(myself, herself, himself).

Binding is defined by Sells (1985) as follows:


α binds ß iff
α C-commands ß, and
α and ß are co-indexed
As we noticed in sentence,
[ei INFL kill Mukesh]
[Mukesh; was killed (by ei)]
Mukesh was killed.
Empty clause (ei) and Mukesh (NPi) are bound. This theory gives a relationship between NPs.

Empty clause (ei) and Mukesh (NPi) are bound. This theory gives a relationship between NPs (including
pronouns and reflexive pronouns). Now, binding theory can be given as follows:
(a) An anaphor (+a) is bound in its governing category.
(b) A pronominal (+p) is free in its governing category.
(c) An R-expression (-a, -p) is free.
Example
A: Mukeshi knows himselfi
B: Mukeshi believes that Amrita knows himi
C: Mukesh believes that Amritaj knows Nupurk (Referring expression)

Similar rules apply on empty categories also:


NP-trace: +a, -p: Mukesh, was killed ei
wh-trace: -a, -p: Who does he; like ei
Empty Category Principle (ECP):

The 'proper government' is defined as:


α properly governs ß iff:
α governs ß and a is lexical (i.e. N, V, A, or P) or
α locally A-binds ß
The ECP says 'A trace must be properly governed'.
This principle justifies the creation of empty categories during NP- trace and wh-trace and also explains
the subject/object asymmetries to some extent. As in the following sentences:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 24


Natural Language Processing [BAD613B]

(a) Whati do you think that Mukesh ate ei?


(b) Whati do you think Mukesh ate ei?
Mukesh is subject, ate is a verb and what is object that moves to the front. Mukesh remains in its original
position.

Bounding and Control Theory:

Note: There are many other types of constraints on Move α and not possible to explain all of them.

In English, the long-distance movement for complement clause can be explained by bounding theory if
NP and S are taken to be bounding nodes. The theory says that the application of Move a may not cross
more than one bounding node. The theory of control involves syntax, semantics, and pragmatics.

Case Theory and Case Filter:

In GB, case theory deals with the distribution of NPs and mentions that each NP must be assigned a case.
In English, we have the nominative, objective, genitive, etc., cases, which are assigned to NPs at particular
positions. Indian languages are rich in case-markers, which are carried even during movements.

Example:
He is running ("He" is the subject of the sentence, performing the action. - nominative)
She sees him. ("Him" is the object of the verb "sees." - Objective)
The man's book. (The genitive case expresses possession or a relationship between nouns,)

Case filter: An NP is ungrammatical if it has phonetic content or if it is an argument and is not case-
marked. Phonetic content here, refers to some physical realization, as opposed to empty categories.

Thus, case filters restrict the movement of NP at a position which has no case assignment. It works in a
manner similar to that of the θ-criterion.

Summary of GB:

In short, GB presents a model of the language which has three levels of syntactic representation.

• It assumes phrase structures to be the maximal projection of some lexical head and in a similar
fashion, explains the structure of a sentence or a clause.
• It assigns various types of roles to these structures and allows them a broad kind of movement
called Move α.
• It then defines various types of constraints which restrict certain movements and justifies others.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 25


Natural Language Processing [BAD613B]

2.2.4 Lexical Functional Grammar (LFG) Model


%Watch this video: https://www.youtube.com/watch?v=EoCLhS_0cmE %

• LFG represents sentences at two syntactic levels - constituent structure (c-structure) and
functional structure (f-structure).
• Kaplan proposed a concrete form for the register names and values which became the functional
structures in LFG.
• Bresnan was more concerned with the problem of explaining some linguistic issues, such as
active/passive and dative alternations, in transformational approach. She proposed that such
issues can be dealt with by using lexical redundancy rules.
• The unification of these two diverse approaches (with a common concern) led to the development
of the LFG theory.

The term 'lexical functional' is composed of two terms:

• The 'functional' part is derived from 'grammatical functions', such as subject and object, or roles
played by various arguments in a sentence.
• The 'lexical' part is derived from the fact that the lexical rules can be formulated to help define
the given structure of a sentence and some of the long-distance dependencies, which is difficult
in transformational grammars.

C-structure and f-structure in LFG


The c-structure is derived from the usual phrase and sentence structure syntax, as in CFG

The grammatical-functional role cannot be derived directly from phrase and sentence structure, functional
specifications are annotated on the nodes of c-structure, which when applied on sentences, results in f-
structure

Example: She saw stars in the sky

[
SUBJ: [ PERS: 3, NUM: SG ], // "She" is the subject, 3rd person, singular
PRED: "see", // The verb "saw" represents the predicate "see"
OBJ: [ NUM: PL, PRED: "star" ], // "stars" is the object, plural, and the predicate is "star"
LOC: [ PRED: "sky", DEF: + ] // "sky" is the location, with a definite determiner ("the")
]

f-structure

c- structure

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 26


Natural Language Processing [BAD613B]

Example:
She saw stars in the sky
CFG rules to handle this sentence are:
S → NP VP
VP → V {NP} PP* {NP} {S'}
Stars Sky
PP → P NP
NP → Det N {PP}
S' → Comp S
Where: S: Sentence V: Verb P: Preposition N: Noun

S': clause Comp: complement { }: optional

* : Phrase can appear any number of times including blank

When annotated with functional specifications, the rules become

• Here, (up arrow) refers to the f-structure of the mother node that is on the left-hand side of
the rule.
• The (down arrow) symbol refers to the f-structure of the node under which it is denoted.
• Hence, in Rule 1, indicates that the f-structure of the first NP goes to the f-structure of

the subject of the sentence, while indicates that the f-structure of the VP node goes directly
to the f-structure of the sentence VP.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 27


Natural Language Processing [BAD613B]

Consistency In a given f-structure, a particular attribute may have at the most one value. Hence, while
unifying two f-structures, if the attribute Num has value SG in one and PL in the other, it will be rejected.
Completeness When an f-structure and all its subsidiary f-structures (as the value of any attribute of f-
structure can again contain other f-structures) contain all the functions that their predicates govern, then
and only then is the f-structure complete.

For example, since the predicate 'see < ( Subj) ( Obj) >' contains an object as its governable function,
a sentence like 'She saw' will be incomplete.
Coherence Coherence maps the completeness property in the reverse direction. It requires that all
governable functions of an f-structure, and all its subsidiary f-structures, must be governed by their
respective predicates. Hence, in the f-structure of a sentence, an object cannot be taken if its verb does
not allow that object. Thus, it will reject the sentence, 'I laughed a book.'
Example:

Let us see first the lexical entries of various words in the sentence:

She saw stars

Lexical entries

c – structure

Finally, the f-structure is the set of attribute-value pairs, represented as

It is interesting to note that the final f-structure is obtained


through the unification of various f-structures for subject, object,
verb, complement, etc. This unification is based on the functional
specifications of the verb, which predicts the overall sentence
structure.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 28


Natural Language Processing [BAD613B]

Lexical Rules in LFG


Different theories have different kinds of lexical rules and constraints for handling various sentence-
constructs (active, passive, dative, causative, etc.).

In LFG, the verb is converted to the participial form, but the sub-categorization is changed directly.

Consider the following example:


oblique agent (Oblag) phrase:
Active: Tara ate the food.
Passive: The food was eaten by Tara

Active: Pred='eat<( Subj) ( Obj)>’

Passive: Pred='eat<( Oblag) ( Subj)>’


Here, Oblag represents oblique agent phrase.
Similar rules can be applied in active and dative constructs for the verbs that accept two objects.
oblique goal (Oblgo) phrase:
Active: Tara gave a pen to Monika.
Passive: Tara gave Monika a pen.

Active: Pred='give<( Subj) ( Obj2) ( Obj)>’

Passive: Pred ='give <( Subj) ( Obj) ( Oblgo)>'


Here, Oblgo stands for oblique goal phrase.
Similar rules are also applicable to the process of causativization. This can be seen in Hindi, where the
verb form is changed as follows:

Example

Active: तारा हँ सी

Taaraa hansii
Tara laughed

Causative: मोनिका िे तारा को हँ साया

Monika ne Tara ko hansaayaa Here, a new predicate is formed which


Monika Subj Tara Obj laugh-cause-past causes the action and requires a new
subject, while the old subject becomes the
Monika made Tara to laugh. object of the new predicate and the old verb
becomes the X-complement (complement
Active: Pred='Laugh < Subj>’ to infinital VPs).

Causative: Pred='cause <( Subj) ( Obj) (Comp)>’

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 29


Natural Language Processing [BAD613B]

Long Distance Dependencies and Coordination


In GB, when a category moved, it creates an empty category.

In LFG, unbounded movement and coordination is handled by the functional identity and by correlation
with the corresponding f-structure.

Example: Consider the wh-movement in the following sentence.


Which picture does Tara like-most?
The f-structure can be represented as follows:

2.2.5 Paninian Framework


Paninian grammar (PG) was written by Panini in 500 BC in Sanskrit (the original text being titled
Asthadhyayi), the framework can be used for other Indian languages and possibly some Asian languages
as well.

Unlike English (Subject-Verb-Object ordered), Asian languages are SOV (Subject-Object-Verb) ordered
and inflectionally rich. The inflections provide important syntactic and semantic cues for language
analysis and understanding. The Paninian framework takes advantage of these features.

Note: Inflectional – refers to the changes a word undergoes to express different grammatical categories
such as tense, number, gender, case, mood, and aspect without altering the core meaning of the word.

Indian languages have traditionally used oral communication for knowledge propagation. In Hindi, we
can change the position of subject and object. For example:

(a) माँ बच्चे को खािा दे ती है। (b) बच्चे को माँ खािा दे ती है ।


Maan Bachche ko khanaa detii hai Bachche ko Maan khanaa detii hai
Mother child to food give-(s) Child to mother food give-(s)
Mother gives food to the child. Mother gives food to the child.
The auxilary verbs follow the main verb. In Hindi, they remain as separate words:

खा रहा है करता रहा है


khaa raha hai kartaa rahaa hai
eat-ing doing been has
eating has been doing

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 30


Natural Language Processing [BAD613B]

In Hindi, some verbs (main), e.g., give (दे िा), take (लेिा), also combine with other verbs (main) to
change the aspect and modality of the verbs.

उसिे खािा खाया। उसिे खािा खा नलया।


Usne khanaa khaayaa Usne khaanaa kha liyaa
He (Subj) food ate He (Subj) food eat taken
He ate food He ate food (completed the action)

वह चला वह चल नदया
He move given
He moved He moved (started the action)

The nouns are followed by post-positions instead of prepositions. They generally remain as separate
words in Hindi,

रे खा के निता उसके निता

Rekha ke pita Uske pita


Rekha of father
Father of Rekha Her (His) father
All nouns are categorized as feminine or masculine, and the verb form must have a gender agreement
with the subject

ताला खो गया चाभी खो गयी

Taalaa kho gayaa Chaabhii kho gayeee


Lock lose (past) key lose (past)
The lock was lost The key was lost.
Layered Representation in PG
The GB theory represents three syntactic levels: deep structure, surface structure, and logical form (LF),
where the LF is nearer to semantics. This theory tries to resolve all language issues at syntactic levels
only.

Paninian grammar framework is said to be syntactico-semantic, that


is, one can go from surface layer to deep semantics by passing
through intermediate layers.

• The surface and the semantic levels are obvious. The other
two levels should not be confused with the levels of GB.
• Vibhakti literally means inflection, but here, it refers to word
(noun, verb, or other) groups based either on case endings, or
post-positions, or compound verbs, or main and auxiliary
verbs, etc

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 31


Natural Language Processing [BAD613B]

• Karaka (pronounced Kaaraka) literally means Case, and in GB, we have already discussed case
theory, θ-theory, and sub-categorization, etc. Paninian Grammar has its own way of defining
Karaka relations.

Karaka Theory

• Karaka theory is the central theme of PG framework.


• Karaka relations are assigned based on the roles played by various participants in the main
activity.
• Various Karakas, such as Karta (subject), Karma (object), Karana (instrument), Sampradana
(beneficiary), Apadan (separation), and Adhikaran (locus).

Example:

माँ बच्ची को आँ गि में हाथ से रोटी खखलाती है ।

Maan bachchi ko aangan mein haath se rotii khilaatii hei


Mother child-to courtyard-in hand-by bread feed (s).
The mother feeds bread to the child by hand in the courtyard.

• 'maan' (mother) is the Karta, Karta has generally 'ne' or 'o' case marker.
• rotii (bread) is the Karma. ('Karma' is similar to object and is the locus of the result of the activity)
• haath (hand) is the Karan. (noun group through which the goal is achieved), It has the marker
“dwara” (by) or “se”
• 'Sampradan' is the beneficiary of the activity, e.g., bachchi (child).
• 'Apaadaan' denotes separation and the marker is attached to the part that serves as a reference
point (being stationary). It takes the marker “ko” (to) or “ke liye” (for).
• aangan (courtyard) is the Adhikaran (is the locus (support in space or time) of Karta or Karma).

Issues in Paninian Grammar


The two problems challenging linguists are:
(i) Computational implementation of PG, and
(ii) Adaptation of PG to Indian, and other similar languages.
However, many issues remain unresolved, specially in cases of shared Karak relations. Another
difficulty arises when mapping between the Vibhakti (case markers and post-positions) and the semantic
relation (with respect to verb) is not one to one. Two different Vibhakti can represent the same relation,
or the same Vibhakti can represent different relations in different contexts.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 32


Natural Language Processing [BAD613B]

2.3 Statistical Language Model


A statistical language model is a probability distribution P(s) over all possible word sequences (or any
other linguistic unit like words, sentences, paragraphs, documents, or spoken utterances).

2.3.1 n-gram Model (https://www.youtube.com/watch?v=Vc2C1NZkH0E )


Applications: Suggestions in messages, spelling correction, Machine translation, Handwritten recognition…

It is a statistical method that predicts the probability of a word appearing next in a sequence based on the
previous "n" words.

Why n-gram?

The goal of a statistical language model is to estimate the probability (likelihood) of a sentence. This is
achieved by decomposing sentence probability into a product of conditional probabilities using the chain
rule as follows:

where hi is history of word wi, defined as w1 w2 . . . wi-1

So, in order to calculate sentence probability, we need to calculate the probability of a word, given the
sequence of words preceding it. This is not a simple task.

An n-gram model simplifies the task by approximating the probability of a word given all the previous
words by the conditional probability given previous n-1 words only.

P(Wi/hi) = P(Wi/Wi-n+1.Wi-1)

Thus, an n-gram model calculates P(w/h) by modelling language as Markov model of order n-1, i.e., by
looking at previous n-1 words only.

A model that limits the history to the previous one word only, is termed a bi-gram (n= 1) model.

A model that conditions the probability of a word to the previous two words, is called a tri-gram (n=2)
model.

Using bi-gram and tri-gram estimate, the probability of a sentence can be calculated as:

Example: The Arabian knights are fairy tales of the east


bi-gram approximation - P(east/the), tri-gram approximation - P(east/of the)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 33


Natural Language Processing [BAD613B]

One pseudo-word <s> is introduced to mark the beginning of the sentence in bi-gram estimation.
Two pseudo-words <s1> and <s2> for tri-gram estimation.
How to estimate these probabilities?
1. Train n-gram model on training corpus.
2. Estimate n-gram parameters using the maximum likelihood estimation (MLE) technique, i.e.,
using relative frequencies.
o Count a particular n-gram in the training corpus and divide it by the sum of all n-grams
that share the same prefix
3. The sum of all n-grams that share first n-1 words is equal to the count of the common prefix
Wi-n+1, ... , Wi-1.

Example tri-gram:

Predicted word for “The girl bought”

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 34


Natural Language Processing [BAD613B]

Example
Training set:

The Arabian Knights


These are the fairy tales of the east
The stories of the Arabian knights are translated in many languages

Bi-gram model:
P(the/<s>) =0.67 P(Arabian/the) = 0.4 P(knights /Arabian) =1.0
P(are/these) = 1.0 P(the/are) = 0.5 P(fairy/the) =0.2
P(tales/fairy) =1.0 P(of/tales) =1.0 P(the/of) =1.0
P(east/the) = 0.2 P(stories/the) =0.2 P(of/stories) =1.0
P(are/knights) =1.0 P(translated/are) =0.5 P(in /translated) =1.0
P(many/in) =1.0
P(languages/many) =1.0

Test sentence(s): The Arabian knights are the fairy tales of the east.
P(The/<s>)×P(Arabian/the)×P(Knights/Arabian)x P(are/knights)
× P(the/are)×P(fairy/the)xP(tales/fairy)×P(of/tales)× P(the/of)
x P(east/the)
=0.67×0.5×1.0×1.0×0.5×0.2×1.0×1.0×1.0×0.2
=0.0067
Limitations:

• Multiplying the probabilities might cause a numerical underflow, particularly in long sentences.
To avoid this, calculations are made in log space, where a calculation corresponds to adding log
of individual probabilities and taking antilog of the sum.
• The n-gram model faces data sparsity, assigning zero probability to unseen n-grams in the training
data, leading to many zero entries in the bigram matrix. This results from the assumption that a
word's probability depends solely on the preceding word(s), which isn't always true.
• Fails to capture long-distance dependencies in natural language sentences.

Solution:

• A number of smoothing techniques have been developed to handle the data sparseness problem.
• Smoothing in general refers to the task of re-evaluating zero-probability or low-probability n-
grams and assigning them non-zero values.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 35


Natural Language Processing [BAD613B]

2.3.2 Add-one Smoothing

• It adds a value of one to each n-gram frequency before normalizing them into probabilities. Thus,
the conditional probability becomes:

Where, V is the vocabulary size.

• Yet, not effective, since it assigns the same probability to all missing n-grams, even though some
of them could be more intuitively appealing than others.s

Example:

Consider the following toy corpus:

• "I love programming"

• "I love coding"

We want to calculate the probability of the bigram "I love" using Add-one smoothing.

Step 1: Count the occurrences

• Unigrams:

o "I" appears 2 times

o "love" appears 2 times

o "programming" appears 1 time

o "coding" appears 1 time

• Bigrams:

o "I love" appears 2 times

o "love programming" appears 1 time

o "love coding" appears 1 time

• Vocabulary size V: There are 4 unique words: "I", "love", "programming", "coding".

Step 2: Apply Add-one smoothing

For the bigram "I love":

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 36


Natural Language Processing [BAD613B]

Step 3: For an unseen bigram

Let’s say we want to calculate the probability for the bigram "I coding" (which doesn’t appear in the
training data):

2.3.3 Good-Turing Smoothing

• Good-Turing smoothing improves probability estimates by adjusting for unseen n-grams based
on the frequency distribution of observed n-grams.
• It adjusts the frequency f of an n-gram using the count of n-grams having a frequency of
occurrence f+1. It converts the frequency of an n-gram from f to f* using the following
expression:

where n is the number of n-grams that occur exactly f times in the training corpus. As an example, consider
that the number of n-grams that occur 4 times is 25,108 and the number of n-grams that occur 5 times is
20,542. Then, the smoothed count for 5 will be:

2.3.4 Caching Technique


The caching model is an enhancement to the basic n-gram model that addresses the issue of frequency
variation across different segments of text or documents. In traditional n-gram models, the probability of
an n-gram is calculated based solely on its occurrence in the entire corpus, which does not take into
account the local context or recent patterns. The caching model improves this by incorporating the
recently discovered n-grams into the probability calculations.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 37


Natural Language Processing [BAD613B]

Module – 2

Word Level Analysis & Syntactic Analysis


Word Level Analysis: Regular Expressions, Finite-State Automata, Morphological Parsing,
Spelling Error Detection and Correction, Words and Word Classes, Part-of Speech Tagging.

Syntactic Analysis: Context-Free Grammar, Constituency, Top-down and Bottom-up Parsing,


CYK Parsing.
Textbook 1: Ch. 3, Ch. 4.

Word Level Analysis


1. Introduction

Processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying
correct part-of-speech of a word.

1.1 The part-of-speech tagging methods:


1. Rule-based (linguistic).
2. Stochastic (data-driven).
3. Hybrid.
1.2 Regular expressions: standard notations for describing text patterns.
1.3 Implementation Regular expressions using finite-state automaton (FSA): applications
in speech recognition and synthesis, spell checking, and information extraction.
1.4 Detecting and correcting errors.

2. Regular Expressions (regexes)


• Pattern-matching standard for string parsing and replacement.
• Powerful way to find and replace strings that take a defined format.
• They are useful tools for the design of language compilers.
• Used in NLP for tokenization, describing lexicons, morphological analysis, etc..
• Perl was the first language that provided integrated support for regular expressions.
o It uses a slash “/” around each regular expression;
• A regular expression is an algebraic formula whose value is a pattern consisting of a set of strings,
called the language of the expression. Example: /a/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1


Natural Language Processing [BAD613B]

Some simple regular expressions: First instance of each match is underlined in table

Regular expression Example patterns


/book/ The world is a book, and those who do not travel read only one page.
/book/ Reporters, who do not read the stylebook, should not criticize their
editors.
/face/ Not everything that is faced can be changed. But nothing can be
changed until it is faced.
/a/ Reason, Observation, and Experience-the Holy Trinity of Science.

2.1 Character Classes

Characters are grouped by square brackets, matching one character from the class. For
example, /[abcd]/ matches a, b, c, or d, and /[0123456789]/ matches any digit. A dash specifies
a range, like /[5-9]/ or /[m-p]/. The caret at the start of a class negates the match, as in /[^x]/,
which matches any character except x. The caret is interpreted literally elsewhere.

Use of square brackets


RE Match Example patterns Matched
[abc] Match any of a, b, and c 'Refresher course will start
tomorrow'
[A-Z] Match any character between A and Z (ASCII order) the course will end on Jan. 10,
2006'
[^A-Z] Match any character other than an uppercase letter 'TREC Conference'

[^abc] Match anything other than a, b, and c 'TREC Conference'


[+ *?. ] Match any of +, *, ?, or the dot. '3 +2 = 5'
[a^] Match a or ^ ‘^ has different uses.’

• Regular expressions are case-sensitive (e.g., /s/ matches 's', not 'S').
• Use square brackets to handle case differences, like /[sS]/.
o /[sS]ana/ matches 'sana' or 'Sana'.
• The question mark (?) makes the previous character optional (e.g., /supernovas?/).
• The * allows zero or more occurrences (e.g., /b*/).
• /[ab]*/ matches zero or more occurrences of 'a' or 'b'.
• The + specifies one or more occurrences (e.g., /a+/).
• /[0-9]+/ matches a sequence of one or more digits.
• The caret (^) anchors the match at the start, and $ at the end of a line.
o /^The nature\.$/ will search exactly for this line.
• The dot (.) is a wildcard matching any single character (e.g., /./).
o Expression /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2


Natural Language Processing [BAD613B]

Special characters

RE Description
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, l. matches a dot, \* matches a *
and \ matches a backslash.

• The wildcard symbol can count characters, e.g., /.....berry/ matches ten-letter strings
ending in "berry".
• This matches "strawberry", "sugarberry", but not "blueberry" or "hackberry".
• To search for "Tanveer" or "Siddiqui", use the disjunction operator (|), e.g.,
"Tanveer|Siddiqui".
• The pipe symbol matches either of the two patterns.
• Sequences take precedence over disjunction, so parentheses are needed to group patterns.
• Enclosing patterns in parentheses allows disjunction to apply correctly.

Example: Suppose we need to check if a string is an email address or not.

^[A-Za-z0-9_\ .- ]++@[A-Za-z0-9_\ .- ]++[A-Za-z0-9_][A-Za-z0-9_]$


Pattern Description
^[A-Za-z0-9_\.- ]+ Match a positive number of acceptable characters at the start of
the string.
@ Match the @ sign.
[A-Za-z0-9_\ .- ]+ Match any domain name, including a dot.
[A-Za-z0-9_] [A-Za-z0-9_] $ Match two acceptable characters but not a dot. This ensures that
the email address ends with .xx, .xxx, .xxxx, etc.

• The language of regular expressions is similar to formulas of Boolean logic.


• Regular languages may be encoded as finite state networks.
• A regular expression can contain symbol pairs, e.g., /a:b/, which represents a relation between
two strings.
• Regular languages can be encoded using finite-state automata (FSA), making it easier to
manipulate and process regular languages, which in turn aids in handling more complex
languages like context-free languages.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3


Natural Language Processing [BAD613B]

3. Finite-State Automata
• Game Description: The game involves a board with pieces, dice or a wheel to generate random
numbers, and players rearranging pieces based on the number. There’s no skill or choice; the
game is entirely based on random numbers.
• States: The game progresses through various states, starting from the initial state (beginning
positions of pieces) to the final state (winning positions).
• Machine Analogy: A machine with input, memory, processor, and output follows a similar
process: it starts in an initial state, changes to the next state based on the input, and eventually
reaches a final state or gets stuck if the next state is undefined.
• Finite Automaton: This model, with finite states and input symbols, describes a machine that
automatically changes states based on the input, and it’s deterministic, meaning the next state is
fully determined by the current state and input.

A finite automaton has the following properties:


1. A finite set of states, one of which is designated the initial or start state, and one or more of which are
designated as the final states.
2. A finite alphabet set, ∑, consisting of input symbols.
3. A finite set of transitions that specify for each state and each symbol of the input alphabet, the state to
which it next goes.
This finite-state automaton is shown as a directed graph, called transition diagram.

A deterministic finite -state automaton (DFA)

Let ∑ = {a, b, c}, the set of states = {q0, q1, q2, q3, q4} with q0 being the start state and q4 the final state,
we have the following rules of transition:
1. From state q0 and with input a, go to state q1.
2. From state q1 and with input b, go to state q2.
3. From state q1 and with input c go to state q3.
4. From state q2 and with input b, go to state q4.
5. From state q3 and with input b, go to state q4.

A finite automaton can be deterministic or non-deterministic.


Deterministic Automata:

• The nodes in this diagram correspond to the states, and the arcs to transitions.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4


Natural Language Processing [BAD613B]

• The arcs are labelled with inputs.


• The final state is represented by a double circle.
• There is exactly one transition leading out of each state. Hence, this automaton is deterministic.
• Any regular expression can be represented by a finite automaton and the language of any finite
automaton can be described by a regular expression.
• A deterministic finite-state automaton (DFA) is defined as a 5-tuple (Q, Σ, δ, S, F), where Q
is a set of states, Σ is an alphabet, S is the start state, F ⃀ Q is a set of final states, and δ is a
transition function.
• The transition function δ defines mapping from QxΣ to Q. That is, for each state q and symbol a,
there is at most one transition possible

Non-Deterministic Automata:

• For each state, there can be more than one transition on a given symbol, each leading to a different
state.
• This is shown in Figure, where there are two possible transitions from state q 0 on input symbol
a.
• The transition function of a non-deterministic finite-state automaton (NFA) maps Q× (Σ Ս {ε})
to a subset of the power set of Q.

Non-deterministic finite-state automaton (NFA)


How it Works for Regex – NLP?

• A path is a sequence of transitions beginning with the start state.


• A path leading to one of the final states is a successful path.
• The FSAs encode regular languages.
• The language that an FSA encodes is the set of strings that can be formed by concatenating the
symbols along each successful path.

Example:
1. Consider the deterministic automaton described in above example and the input, “ac”.
• We start with state q0 and input symbol a and will go to state
q1.
• The next input symbol is c, we go to state q3.
• No more input is left and we have not reached the final state.
• Hence, the string ac is not recognized by the automaton.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5


Natural Language Processing [BAD613B]

2. Now, consider the input “acb”


• we start with state q0 and go to state q1
• The next input symbol is c, so we go to state q3.
• The next input symbol is b, which leads to state q4.
• No more input is left and we have reached to final state.
• The string acb is a word of the language defined by the automaton.

State-transition table

• The rows in this table represent states and the columns correspond to input.
• The entries in the table represent the transition corresponding to a given state-input pair.
• A ɸ entry indicates missing transition.
• This table contains all the information needed by FSA.

Input
State a b c
Start: q0 q1 ɸ ɸ
q1 ɸ q2 q3
q2 ɸ q4 ɸ
q3 ɸ q4 ɸ
Final: q4 ɸ ɸ ɸ
Deterministic finite -state automaton (DFA) The state-transition table of DFA
Example

• Consider a language consisting of all strings containing only a’s and b’s and ending with baa.
• We can specify this language by the regular expression→ /(a|b)*baa$/.
• The NFA implementing this regular expression is shown & state-transition table for the NFA is
as shown below.
Input
State a b
Start: q0 {q0} {q0, q1}
q1 {q2} ɸ
q2 {q3} ɸ
Final: q3 ɸ ɸ

NFA for /(a|b)*baa$/ State transition table

An NFA can be converted to an equivalent DFA and vice versa.

DFA for /(a|b)*baa$/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6


Natural Language Processing [BAD613B]

4. Morphological Parsing
• It is a sub-discipline of linguistics
• It studies word structure and the formation of words from smaller units (morphemes).
• The goal of morphological parsing is to discover the morphemes that build a given word.
• A morphological parser should be able to tell us that the word 'eggs' is the plural form of the noun
stem 'egg'.
Example:
The word 'bread' consists of a single morpheme.
'eggs' consist of two morphemes: the egg and -s
4.1 Two Broad classes of Morphemes:
1. Stems – Main morpheme, contains the central meaning.
2. Affixes – modify the meaning given by the stem.
o Affixes are divided into prefix, suffix, infix, and circumfix.
1. Prefix - morphemes which appear before a stem. (un-happy, be-waqt)
2. Suffix - morphemes applied to the end. (ghodha-on, gurramu-lu, bidr-s, शीतलता)
3. Infixes - morphemes that appear inside a stem.
• English slang word "abso-bloody-lutely." The morpheme "-bloody-" is
inserted into the stem "absolutely" to emphasize the meaning.
4. Circumfixes - morphemes that may be applied to beginning & end of the stem.
• German word - gespielt (played) → ge+spiel+t
Spiel – play (stem)
4.2 Three main ways of word formation: Inflection, Derivation, and Compounding
Inflection: a root word combined with a grammatical morpheme to yield a word of the same class as the
original stem.
Ex. play (verb)+ ed (suffix) = Played (inflected form – past-tense)
Derivation: a root word combined with a grammatical morpheme to yield a word belonging to a different
class.

Ex. Compute (verb)+ion=Computation (noun).

Care (noun)+ ful (suffix)= careful (adjective).

Compounding: The process of merging two or more words to form a new word.

Ex. Personal computer, desktop, overlook.

Morphological analysis and generation deal with inflection, derivation and compounding process in
word formation and essential to many NLP applications:
1. Spelling corrections to machine translations.
2. In Information retrieval – to identify the presence of a query word in a document in spite of
different morphological variants.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7


Natural Language Processing [BAD613B]

4.3 Morphological parsing:


It converts inflected words into their canonical form (lemma) with syntactical and morphological tags
(e.g., tense, gender, number).
Morphological generation reverses this process, and both parsing and generation rely on a dictionary
of valid lemmas and inflection paradigms for correct word forms.
A morphological parser uses following information sources:
1. Lexicon: A lexicon lists stems and affixes together with basic information about them.
2. Morphotactics: Ordering among the morphemes that constitute a word, It describes the way
morphemes are arranged or touch each other. Ex. Rest-less-ness is a valid word, but not Rest-
ness-less.
3. Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine. Ex. 'easy' to 'easier' and not to 'easyer'. (y → ier spelling rule)

Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the
word-forms of all the roots.

Ex. A sample lexicon entry:


Word form Category Root Gender Number Person
Ghodhaa Noun GhoDaa Masculine Singular 3rd

Ghodhii -do- -do- feminine -do- -do-

Ghodhon -do- -do- Masculine plural -do-

Ghodhe -do- -do- -do- -do- -do-

Limitations in Lexical entry:


1. It puts a heavy demand on memory.
2. Fails to show the relationship between different roots having similar word-forms.
3. Number of possible word-forms may be theoretically infinite (complex languages like Turkish).

4.4 Stemmers:
• The simplest morphological systems
• Collapse morphological variations of a given word (word-forms) to one lemma or stem.
• Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
o ier → y (e.g., earlier → early)
o ing → ε (e.g., playing → play)
• Stemming algorithms work in two steps:
(i) Suffix removal: This step removes predefined endings from words.
(ii) Recoding: This step adds predefined endings to the output of the first step.
• Two widely used stemming algorithms have been developed by Lovins (1968) and Porter (1980).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8


Natural Language Processing [BAD613B]

o Lovins's stemmer performs Suffix removal & Recoding sequentially


e.g. earlier→ first removes ier and recodes as Early
o Porter's stemmer performs Suffix removal & Recoding simultaneously
e.g. ational → ate
To transform word such as 'rotational' into 'rotate'.
Limitations:
• It is difficult to use stemming with morphologically rich languages.
• E.g. Transformation of the word 'organization' into 'organ'
• It reduces only suffixes and prefixes; Compound words are not reduced. E.g. “toothbrush” or
“snowball” can’t be broken.
A more efficient two-level morphological model – Koskenniemi (1983)
• Morphological parsing is viewed as a mapping from the surface level into morpheme and feature
sequences on the lexical level.
• The surface level represents the actual spelling of the word.
• The lexical level represents the concatenation of its constituent morphemes.
e.g. 'playing' is represented in the
Surface level → play + V + PP
Lexical level → 'play' followed by the morphological information +V +PP

Surface Level → p l a y i n g
Lexical Level → p l a y +V +PP

Similarly, 'books' is represented in the lexical form as 'book + N + PL'


This model is usually implemented with a kind of finite-state automata, called finite-state transducer
(FST).
Finite-state transducer (FST)
• FST maps an input word to its morphological components (root, affixes, etc.) and can also
generate the possible forms of a word based on its root and morphological rules.
• An FST can be thought of as a two-state automaton, which recognizes or generates a pair of
strings.
E.g. Walking
Analysis (Decomposition):
The analyzer uses a transducer that:
• Identifies the base form ("walk") from the surface form ("walking").
• Recognizes the suffix ("-ing") and removes it.
Generation (Synthesis):
The generator uses another transducer that:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9


Natural Language Processing [BAD613B]

• Identifies the base form ("walk") and applies the appropriate suffix to generate different surface
forms, like "walked" or "walking".
A finite-state transducer is a 6-tuple (Σ1, Σ2, Q, δ, S, F), where Q is set of states, S is the initial state, and
F ⃀ Q is a set of final states, Σ1 is input alphabet, Σ2 is output alphabet, and δ is a function mapping Q x
(Σ1 Ս {€}) x (Σ2 Ս {€}) to a subset of the power set of Q.

δ: Q × (Σ1∪{ε}) × (Σ2∪{ε}) → 2Q
Thus, an FST is similar to an NFA except in that transitions are made on strings rather than on symbols
and, in addition, they have outputs. FSTs encode regular relations between regular languages, with the
upper language on the top and the lower language on the bottom. For a transducer T and string s, T(s)
represents the set of strings in the relation. FSTs are closed under union, concatenation, composition, and
Kleene closure, but not under intersection or complementation.

Two-step morphological parser

Two-level morphology using FSTs involves analyzing surface forms in two steps.

Fig. Two-step morphological parser

Step1: Words are split into morphemes, considering spelling rules and possible splits (e.g., "boxe + s" or
"box + s").

Step2: The output is a concatenation of stems and affixes, with multiple representations possible for each
word.

We need to build two transducers: one that maps the surface form to the intermediate form and another
that maps the intermediate form to the lexical form.
A transducer maps the surface form "lesser" to its comparative form, where ɛ represents the empty string.
This bi-directional FST can be used for both analysis (surface to base) and generation (base to surface).

E.g. Lesser

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10


Natural Language Processing [BAD613B]

FST-based morphological parser for singular and plural nouns in English

• The plural form of regular nouns usually ends with -s or -es. (not necessarily be the plural form
– class, miss, bus).
• One of the required translations is the deletion of the 'e' when introducing a morpheme boundary.
o E.g. Boxes, This deletion is usually required for words ending in xes, ses, zes.
• This is done by below transducer – Mapping English nouns to the intermediate form:

Bird+s

Box+e+s

Quiz+e+s

Mapping English nouns to the intermediate form

• The next step is to develop a transducer that does the mapping from the intermediate level to the
lexical level. The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
• Regular noun stem + s, e.g., bird + s
• Singular irregular noun stem, e.g., goose
• Plural irregular noun stem, e.g., geese
• In the first case, the transducer has to map all symbols of the stem to themselves and then output
N and sg.
• In the second case, it has to map all symbols of the stem to themselves, but then output N and
replaces PL with s.
• In the third case, it has to do the same as in the first case.
• Finally, in the fourth case, the transducer has to map the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then it should add Nand PL.

Transducer for Step 2

The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer encoding a lexicon.
The transducer implementing the lexicon maps the individual regular and irregular noun stems to their
correct noun stem, replacing labels like regular noun form, etc.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11


Natural Language Processing [BAD613B]

This lexicon maps the surface form geese, which is an irregular noun, to its correct stem goose in the
following way:
g:g e:o e:o s:s e:e
Mapping for the regular surface form of bird is b:b i:i r:r d:d. Representing pairs like a:a with a single
letter, these two representations are reduced to g e:o e:o s e and b i r d respectively.
Composing this transducer with the previous one, we get a single two-level transducer with one input
tape and one output tape. This maps plural nouns into the stem plus the morphological marker + pl and
singular nouns into the stem plus the morpheme + sg. Thus a surface word form birds will be mapped to
bird + N + pl as follows.
b:b i:i r:r d:d + ε:N + s:pl
Each letter maps to itself, while & maps to morphological feature +N, and s maps to morphological feature
pl. Figure shows the resulting composed transducer.

A transducer mapping nouns to their stem and morphological features

5. Spelling Error Detection and Correction

Typing mistakes: single character omission, insertion, substitution, and reversal are the most common
typing mistakes.

• Omission: When a single character is missed - 'concpt'


• Insertion error: Presence of an extra character in a word - 'error' is misspell as 'errorn'
• Substitution error: When a wrong letter is typed in place of the right one - 'errpr'
• Reversal: A situation in which the sequence of characters is reversed - 'aer' instead of 'are'.

OCR errors: Usually grouped into five classes: substitution, multi-substitution (or framing), space
deletion or insertion, and failures.
Substitution errors: Caused due to visual similarity (single character) such as c→e, 1→l, r→n.
The same is true for multi-substitution (two or more chars), e.g., m→rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
Solution: These errors can be corrected using 'context' or by using linguistic structures.

Phonetic errors:

• Speech recognition matches spoken utterances to a dictionary of phonemes.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12


Natural Language Processing [BAD613B]

• Spelling errors are often phonetic, with incorrect words sounding like correct ones.
• Phonetic errors are harder to correct due to more complex distortions.
• Phonetic variations are common in translation

Spelling errors: are classified as non-word or real-word errors.

• Non-word error:
o Word that does not appear in a given lexicon or is not a valid orthographic word form.
o The two main techniques to find non-word errors: n-gram analysis and dictionary lookup.
• Real-word error:
o It occurs due to typographical mistakes or spelling errors.
o E.g. piece for peace or meat for meet.
o May cause local syntactic errors, global syntactic errors, semantic errors, or errors at
discourse or pragmatic levels.
o Impossible to decide that a word is wrong without some contextual information

Spelling correction: consists of detecting and correcting errors. Error detection is the process of finding
misspelled words and error correction is the process of suggesting correct words to a misspelled one.
These sub-problems are addressed in two ways:

1. Isolated-error detection and correction


2. Context-dependent error detection and correction

Isolated-error detection and correction: Each word is checked separately, independent of its context.

• It requires the existence of a lexicon containing all correct words.


• Would take a long time to compile and occupy a lot of space.
• It is impossible to list all the correct words of highly productive languages.

Context dependent error detection and correction methods: Utilize the context of a word. This requires
grammatical analysis and is thus more complex and language dependent. the list of candidate words must
first be obtained using an isolated-word method before making a selection depending on the context.

The spelling correction algorithm:

Broadly categorized by Kukich (1992) as follows:

Minimum edit distance The minimum edit distance between two strings is the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another.

Similarity key techniques The basic idea in a similarity key technique is to change a given string into a
key such that similar strings will change into the same key.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13


Natural Language Processing [BAD613B]

n-gram based techniques n-gram techniques usually require a large corpus or dictionary as training data,
so that an n-gram table of possible combinations of letters can be compiled. In case of real-word error
detection, we calculate the likelihood of one character following another and use this information to find
possible correct word candidates.

Neural nets These have the ability to do associative recall based on incomplete and noisy data. They can
be trained to adapt to specific spelling error patterns. Note: They are computationally expensive.

Rule-based techniques In a rule-based technique, a set of rules (heuristics) derived from knowledge of
a common spelling error pattern is used to transform misspelled words into valid words.

5.1 Minimum Edit Distance:

The minimum edit distance is the number of insertions, deletions, and substitutions required to change
one string into another.

For example, the minimum edit distance between 'tutor' and 'tumour' is 2: We substitute 'm' for 't' and
insert 'u' before 'r'.

Edit distance can be viewed as a string alignment problem. By aligning two strings, we can measure the
degree to which they match. There may be more than one possible alignment between two strings.

Alignment 1:

t u t o - r
t u m o u r
The best possible alignment corresponds to the minimum edit distance between the strings. The alignment
shown here, between tutor and tumour, has a distance of 2.

A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do
not match (shown in bold).

The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation,
therefore distance is 2.

Alignment 2:
Another possible alignment for these sequences is
t u t - o - r
t u - m o u r
which has a cost of 3.
Dynamic programming algorithms can be quite useful for finding minimum edit distance between two
sequences. (table-driven approach to solve problems by combining solutions to sub-problems).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14


Natural Language Processing [BAD613B]

The dynamic programming algorithm for minimum edit distance is implemented by creating an edit
distance matrix.

• This matrix has one row for each symbol in the source string and one column for each matrix in
the target string.
• The (i, j)th cell in this matrix represents the distance between the first i character of the source
and the first j character of the target string.
• Each cell can be computed as a simple function of its surrounding cells. Thus, by starting at the
beginning of the matrix, it is possible to fill each entry iteratively.
• The value in each cell is computed in terms of three possible paths.

• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• The minimum edit distance algorithm is shown below.

Input: Two strings, X and Y


Output: The minimum edit distance between X and Y
m « length(X)
n«length(Y)
for i = 0 to m do
dist[i,0] « i
for j = 0 to n do
dist[0,j] « j
for i = 0 to m do
for j = 0 to n do
dist[i,j]=min{ dist[i-1,j]+insert_cost, dist[i-1,j-1]+subst_cost(Xi,Yj), dist[ij-1]+ delet_cost}

• How the algorithm computes the minimum edit distance between tutor and tumour is shown in
table.
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2

Minimum edit distance algorithms are also useful for determining accuracy in speech recognition
systems.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15


Natural Language Processing [BAD613B]

6.Words & Word Classes

• Words are classified into categories called part-of-speech.


• These are sometimes called word classes or lexical categories.
• These lexical categories are usually defined by their syntactic and morphological behaviours.
• The most common lexical categories are nouns and verbs. Other lexical categories include
adjectives, adverbs, prepositions, and conjunctions.

NN noun Student, chair, proof, mechanism


VB verb Study, increase, produce
ADJ adj Large, high, tall, few
JJ adverb Carefully, slowly, uniformly
IN preposition in, on, to, of
PRP pronoun I, me, they
DET determiner the, a, an, this, those

Table shows some of the word classes in English. Lexical categories and their properties vary from
language to language.

Word classes are further categorized as open and closed word classes.

• Open word classes constantly acquire new members while closed word classes do not (or only
infrequently do so).
• Nouns, verbs (except auxiliary verbs), adjectives, adverbs, and interjections are open word
classes.

e.g. computer, happiness, dog, run, think, discover, beautiful, large, happy, quickly, very, easily

• Prepositions, auxiliary verbs, delimiters, conjunction, and Interjections are closed word classes.
e.g. in, on, under, between, he, she, it, they, the, a, some, this, and, but, or, because, oh, wow,
ouch

7. Part-of-Speech Tagging

• The process of assigning a part-of-speech (such as a noun, verb, pronoun, preposition, adverb,
and adjective), to each word in a sentence.
• Input to a tagging algorithm: Sequence of words of a natural language sentence and specified tag
sets.
• Output: single best part-of-speech tag for each word.
• Many words may belong to more than one lexical category:
o I am reading a good book → book: Noun

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16


Natural Language Processing [BAD613B]

o The police booked the snatcher → book: verb


o 'sona' may mean 'gold' (noun) or 'sleep' (verb)

Determine the correct lexical category of a word in its context

Tag set:

• The collection of tags used by a particular tagger is called a tag set.


• Most part-of-speech tag sets make use of the same basic categories, i.e., noun, verb, adjective,
and prepositions.
• Most tag sets capture morpho-syntactic information such as singular/plural, number, gender,
tense, etc.
• Tag sets differ in how they define categories and how finely they divide words into categories.

Consider,

Zuha eats an apple daily.


Aman ate an apple yesterday.
They have eaten all the apples in the basket.
I like to eat guavas.
The word eat has a distinct grammatical form in each of these four sentences.
Eat is the base form, ate its past tense, and the form eats requires a third person singular subject.
Similarly, eaten is the past participle form and cannot occur in another grammatical context.
Number of tags:
• The number of tags used by different taggers varies substantially (20 tags and over 400 tags).
• Penn Treebank tag set contains 45 tags & C7 uses 164
• TOSCA-ICE for the International Corpus of English with 270 tags (Garside 1997).
• TESS with 200 tags.
• English, which is not morphologically rich, the C7 tagset is too big → yield too many mistagged
words.

Tags from Penn Treebank tag set Possible tags for the word to eat
VB Verb, base form Subsumes imperatives, eat VB
infinitives, and subjunctives
VBD Verb, past tense Includes the ate VBD
conditional form of the verb to be
VBG Verb, gerund, or present participle eaten VBN
VBN Verb, past participle eats VBP
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 17


Natural Language Processing [BAD613B]

Example of a tagged sentence:


Speech/NN sounds/NNS were/VBD sampled/VBN by/IN a/DT microphone/NN.
Another tagging possible
Speech/NN sounds/VBZ were/VBD sampled/VBN by/IN a/DT microphone/NN
It leads to semantic incoherence. We resolve the ambiguity using the context of the word. The context is
also utilized by automatic taggers.

Part-of-speech tagging methods:


1. Rule-based (linguistic)
2. Stochastic (data-driven)
3. Hybrid

Rule-based taggers use hand-coded rules to assign tags to words. These rules use a lexicon to obtain a
list of candidate tags and then use rules to discard incorrect tags.

Stochastic taggers have data-driven approaches in which frequency-based information is automatically


derived from corpus and used to tag words. Probability that a word occurs with a particular tag. E.g.
Hidden Markov model (HMM).

Hybrid taggers combine features of both these approaches. Like rule- based systems, they use rules to
specify tags. Like stochastic systems, they use machine-learning to induce rules from a tagged training
corpus automatically. E.g. Brill tagger.

7.1 Rule-based Tagger

• A two-stage architecture.
• The first stage: A dictionary look-up procedure, which returns a set of potential tags (parts-of-
speech) and appropriate syntactic features for each word.
• The second stage: A set of hand-coded rules to discard contextually illegitimate tags to get a
single part-of-speech for each word.

E.g Consider the noun-verb ambiguity in the following sentence:


“The show must go on”
Show → ambiguity {VB, NN}
Following are the rules to resolve this ambiguity:
IF preceding word is determiner THEN eliminate VB tag.
In addition to contextual information, many taggers use morphological information to help in the
disambiguation process:

IF word ends in -ing and preceding word is a verb THEN label it a verb (VB).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 18


Natural Language Processing [BAD613B]

Rule-based taggers use capitalization to identify unknown nouns and typically require supervised training.
Rules can be induced by running untagged text through a tagger, manually correcting it, and feeding it
back for learning.

TAGGIT (1971) tagged 77% of the Brown corpus using 3,300 rules. ENGTWOL (1995) is another rule-
based tagger known for speed and determinism.

Advantages & disadvantages:

While rule-based systems are fast and deterministic, they require significant effort to write rules and need
a complete rewrite for other languages. Stochastic taggers are more flexible, adapting to new languages
with minimal changes and retraining. Thus, rule-based systems are precise but labor-intensive, while
stochastic systems are more adaptable but probabilistic.

7.2 Stochastic Tagger

• The standard stochastic tagger algorithm is the HMM tagger.


• Applies the simplifying assumption that the probability of a chain of symbols can be
approximated in terms of its parts or n-grams.

The unigram model requires a tagged training corpus to gather statistics for tagging data. It assigns tags
based solely on the word itself. For example, the tag JJ (Adjective) is frequently assigned to "fast"
because it is more commonly used as an adjective than as a noun, verb, or adverb. However, this can lead
to incorrect tagging, as seen in the following examples:

1. She had a fast — Here, "fast" is a noun.

2. Muslims fast during Ramadan — Here, "fast" is a verb.

3. Those who were injured in the accident need to be helped fast — Here, "fast" is an adverb.

In these cases, a more accurate prediction could be made by considering additional context. A bi-gram
tagger improves accuracy by incorporating both the current word and the tag of the previous word. For
instance, in sentence (1), the sequence "DT NN" (determiner, noun) is more likely than "DT JJ"
(determiner, adjective), so the bi-gram tagger would correctly tag "fast" as a noun. Similarly, in sentence
(3), a verb is more likely to be followed by an adverb, so the bi-gram tagger assigns "fast" the tag RB
(adverb).

In general, n-gram models consider both the current word and the tags of the previous n-1 words. A tri-
gram model, for example, uses the previous two tags, providing even richer context for more accurate
tagging. The context considered by a tri-gram model is shown in Figure, where the shaded area represents
the contextual window.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 19


Natural Language Processing [BAD613B]

How the HMM tagger assigns the most likely tag sequence to a given sentence:

We refer to this model as a Hidden Markov Model (HMM) because it has two layers of states:

• A visible layer corresponding to the input words.


• A hidden layer corresponding to the tags.

While tagging input data, we can observe the words, but the tags (states) are hidden. The states are visible
during training but not during the tagging process.

As mentioned earlier, the HMM uses lexical and bi-gram probabilities estimated from a tagged training
corpus to compute the most likely tag sequence for a given sentence. One way to store these probabilities
is by constructing a probability matrix. This matrix includes:

• The probability that a specific word belongs to a particular word class.

• The n-gram analysis (for example, in a bi-gram model, the probability that a word of class X
follows a word of class Y).

During tagging, this matrix is used to guide the HMM tagger in predicting the tags for an unknown
sentence. The goal is to determine the most probable tag sequence for a given sequence of words.

Let W be the sequence of words.

W=W1, W2, ... ,Wn


The task is to find the tag sequence

T= t1, t2, ... , tn


which maximizes P(T|W), i.e.,

T'= argmaxT P(T|W)


Applying Bayes Rule, P(T|W) can be estimated using the expression:
P(T|W) = P(W|T) * P(T)/P(W)
P(W), remains the same for each tag sequence, we can drop it. The expression for the most likely tag
sequence becomes:

T'= argmaxT P(W|T) * P(T)


A tag sequence can be estimated as the product of the probability of its constituent n-grams, i.e.,

P(T)=P(t1) * P(t2|t1) * P(t3|t1t2) ...* P(tn|t1 ... tn-1)


P(W/T) is the probability of seeing a word sequence, given a tag sequence.
For example, it is asking the probability of seeing 'The egg is rotten' given 'DT NNP VB JJ'. We make
the following two assumptions:

• The words are independent of each other.


• The probability of a word is dependent only on its tag.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 20


Natural Language Processing [BAD613B]

Example Consider the sentence “The bird can fly”.


and the tag sequence DT NNP MD VB
Using bi-gram approximation, the probability

can be computed as
= P(DT) × P(NNP|DT) * P(MD|NNP) × P(VB|MD) × P(the/DT) x P(bird|NNP) x P(can|MD) ×
P(fly|VB)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 21


Natural Language Processing [BAD613B]

7.3 Hybrid Taggers


Hybrid approaches to tagging combine the strengths of both rule-based and stochastic methods.

• These approaches use rules to assign tags to words.


• While leveraging machine learning techniques to automatically generate rules from data.

An example is Transformation-Based Learning (TBL), or Brill tagging, introduced by E. Brill in 1995.


TBL has been applied to tasks like part-of-speech tagging and syntactic parsing.

Figure illustrates the TBL process, which is a supervised learning technique. The algorithm starts by
assigning the most likely tag to each word using a lexicon. Transformation rules are then applied
iteratively, with the rule that improves tagging accuracy most being selected each time. The process
continues until no significant improvements are made.

The output is a ranked list of transformations, which are applied to new text by first assigning the most
frequent tag and then applying the transformations.

TBL tagging algorithm


INPUT: Tagged corpus and lexicon (with most frequent information)
Step 1: Label every word with most likely tag (from dictionary)
Step 2: Check every possible transformation and select one which most improves tagging
Step 3: Re-tag corpus applying the rules
Repeat 2-3: Until some stopping criterion is reached
RESULT Ranked sequence of transformation rules

Example: Assume that in a corpus, fish is most likely to be a noun.


P(NN/fish) = 0.91
P(VB/fish) = 0.09
Now consider the following two sentences and their initial tags.
I/PRP like/VB to/TO eat/VB fish/NNP.
I/PRP like/VB to/TO fish/NNP.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 22


Natural Language Processing [BAD613B]

As the most likely tag for fish is NNP, the tagger assigns this tag to the word in both sentences. In the
second case, it is a mistake. After initial tagging when the transformation rules are applied, the tagger
learns a rule that applies exactly to this mis-tagging of fish:

Change NNP to VB if the previous tag is TO.

As the contextual condition is satisfied, this rule will change fish/NN to fish/VB:

like/VB to/TO fish/NN → like/VB to/TO fish/VB

Scope; Advantages and disadvantages:


The algorithm can be made more efficient by indexing words in a training corpus using potential
transformations. Recent work has applied finite state transducers to compile pattern-action rules,
combining them into a single transducer for faster rule application, as demonstrated by Roche and Schabes
(1997) on Brill’s tagger.

Most part-of-speech tagging research focuses on English and European languages, but the lack of
annotated corpora limits progress for other languages, including Indian languages. Some systems, like
Bengali (Sandipan et al., 2004) and Hindi (Smriti et al., 2006), combine morphological analysis with
tagged corpora.

Tagging Urdu is more complex due to its right-to-left script and grammar influenced by Arabic and
Persian. Before Hardie (2003), little work was done on Urdu tag sets, with his research part of the
EMILLE project for South Asian languages.

7.4 Unknown words:

Unknown words, which do not appear in a dictionary or training corpus, pose challenges during tagging.
Solutions include:

• Assigning the most frequent tag from the training corpus or initializing unknown words with an
open class tag and disambiguating them using tag probabilities.
• Another approach involves using morphological information, such as affixes, to predict the tag
based on common suffixes or prefixes in the training data, similar to Brill's tagger.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 23


Natural Language Processing [BAD613B]

Syntactic Analysis

1. Introduction:
• Syntactic parsing deals with the syntactic structure of a sentence.
• 'Syntax' refers to the grammatical arrangement of words in a sentence and their relationship with
each other.
• The objective of syntactic analysis is to find the syntactic structure of the sentence.
• This structure is usually depicted as a tree, as shown in Figure.
o Nodes in the tree represent the phrases and leaves correspond to the words.
o The root of the tree is the whole sentence.
• Identifying the syntactic structure is useful in determining the
meaning of the sentence.
• Syntactic parsing can be considered as the process of assigning
'phrase markers' to a sentence.
• Two important ideas in natural language are those of constituency
and word order.
o Constituency is about how words are grouped together.
o Word order is about how, within a constituent, words are
ordered and also how constituents are ordered with respect
to one another.
• A widely used mathematical system for modelling constituent structure in natural language is
context-free grammar (CFG) also known as phrase structure grammar.

2. Context-free Grammar:
• Context-free grammar (CFG) was first defined for natural language by Chomsky (1957).
• Consists of four components:
1. A set of non-terminal symbols, N
2. A set of terminal symbols, T
3. A designated start symbol, S, that is one of the symbols from N.
4. A set of productions, P, of the form: A→α
o Where A € N and α is a string consisting of terminal and non-terminal symbols.
o The rule A → α says that constituent A can be rewritten as α. This is also called the
phrase structure rule. It specifies which elements (or constituents) can occur in a phrase
and in what order.
o For example, the rule S → NP VP states that S consists of NP followed by VP, i.e., a
sentence consists of a noun phrase followed by a verb phrase.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 24


Natural Language Processing [BAD613B]

CFG as a generator:

• A CFG can be used to generate a sentence or to assign a structure to a given sentence.


• When used as a generator, the arrows in the production rule may be read as 'rewrite the symbol
on the left with symbols on the right'.
• Consider the toy grammar: shown in Figure.
• The symbol S can be rewritten as NP VP using Rule 1, then using rules R2 and R4, NP and VP
are rewritten as N and V NP respectively. NP is then rewritten as Det N (R3). Finally, using rules
R6 and R7, we get the sentence:

Hena reads a book.

• Above can be derived from S. The representation of this derivation is shown in Figure.
• Sometimes, a more compact bracketed notation is used to represent a parse tree.
[s [NP [N Hena]] [vp [v reads] [NP [Det a] [Nbook]]]]

• The parse tree in Figure can be represented using this notation as follows:

3. Constituency:
• Words in a sentence are not tied together as a sequence of part-of-speech.
• Language puts constraints on word order.
• Words group together to form constituents (often termed phrases), each of which acts as a single
unit. They combine with other constituents to form larger constituents, and eventually, a sentence.
• Constituents combine with others to form a sentence constituent.
• For example: the noun phrase, The bird, can combine with the verb phrase, flies, to form
the sentence, The bird flies.
• Different types of phrases have different internal structures.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 25


Natural Language Processing [BAD613B]

3.1 Phrase Level Constructions

Noun Phrase, Verb phrase, Prepositional Phrase, Adjective Phrase, Adverb Phrase
Noun Phrase:

• A noun phrase is a phrase whose head is a noun or a pronoun, optionally accompanied by a set
of modifiers. It can function as subject, object, or complement.
• The modifiers of a noun phrase can be determiners or adjective phrases.
• Phrase structure rules are of the form: A→B C
NP → Pronoun
NP → Det Noun
NP → Noun
NP → Adj Noun
NP → Det Adj Noun
• We can combine all these rules in a single phrase structure rule as follows:
NP → (Det) (Adj) Noun|Pronoun

• A noun phrase may include post-modifiers and more than one adjective.
NP → (Det) (AP) Noun (PP)
Few examples of noun phrases:
They NP
The foggy morning Adj Noun
Chilled water Adj Noun
A beautiful lake in Kashmir Det Adj Noun PP
Cold banana shake Adjective followed by a sequence of nouns.
• Adjective followed by a sequence of nouns → A noun sequence is termed as nominal. We modify
our rules to cover this situation.
NP → (Det) (AP) Nom (PP)
Nom → Noun | Noun Nom
• A noun phrase can act as a subject, an object, or a predicate.

Example:
The foggy damped weather disturbed the match. → noun phrase acts as a subject
I would like a nice cold banana shake. → noun phrase acts as an object
Kula botanical garden is a beautiful location. → noun phrase acts as predicate
Verb Phrase:

• Headed by a verb
• The verb phrase organizes various elements of the sentence that depend syntactically on the verb.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 26


Natural Language Processing [BAD613B]

Examples of verb phrases:


Khushbu slept. VP → Verb
The boy kicked the ball VP →Verb NP
Khushbu slept in the garden. VP → Verb PP
The boy gave the girl a book. VP → Verb NP NP
The boy gave the girl a book with blue cover. Verb NP NP PP
In general, the number of NPs in a VP is limited to two, whereas it is possible to add more than two
PPs. VP → Verb (NP) (NP) (PP)*

Things are further complicated by the fact that objects may also be entire clauses as in the sentence, I
know that Taj is one of the seven wonders. Hence, we must also allow for an alternative phrase statement
rule, in which NP is replaced by S.

VP → Verb S
Prepositional Phrase:
Prepositional phrases are headed by a preposition. They consist of a preposition, possibly followed by
some other constituent, usually a noun phrase.

We played volleyball on the beach.

We can have a preposition phrase that consists of just a preposition.

John went outside.

The phrase structure rule that captures the above eventualities is as follows.

PP → Prep (NP)

Adjective Phrase:
The head of an adjective phrase (AP) is an adjective. APs consist of an adjective, which may be preceded
by an adverb and followed by a PP.

Here are few examples.


Ashish is clever.
The train is very late.
My sister is fond of animals.
The phrase structure rule for adjective phrase is
AP → (Adv) Adj (PP)
Adverb Phrase:
An adverb phrase consists of an adverb, possibly preceded by a degree adverb. Here is an example.
Time passes very quickly. AdvP → (Intens) Adv

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 27


Natural Language Processing [BAD613B]

3.2 Sentence Level Constructions

A sentence can have varying structure.

The four commonly known structures are declarative structure, imperative structure, yes-no question
structure, and wh-question structure.

1. Declarative structure: Makes a statement or expresses an idea.

Example: I like horse riding

Structure: will have a subject followed by a predicate.

The subject is noun phrase and the predicate is a verb phrase.

Grammar rule: S → NP VP

2. Imperative structure: Gives a command, request, or suggestion.

Example: Please pass the salt, Look at the door, Show me the latest design.
Structure: usually begin with a verb phrase and lack subject.

Grammar rule: S → VP

3. Yes-no question structure: Asks a question that expects a yes or no answer.

Example: Do you have a red pen?

Did you finish your homework?

Is the game over?

Structure: usually begin with an auxiliary verb, followed by a subject NP, followed by a VP.

Grammar rule: S → Aux NP VP


4. Wh-question structure: Asks for specific information using words like who, what, or where.

Example: Where are you going?

Which team won the match?


Structure: May have a wh-phrase as a subject or may include another subject.
Grammar rule: S → Wh-NP VP
Another type of wh-question structure is one that involves more than one NP, , the auxiliary verb comes
before the subject NP, just as in yes-no question structures.

Example: Which cameras can you show me in your shop?


Grammar rule: S → Wh-NP Aux NP VP

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 28


Natural Language Processing [BAD613B]

Table. Summary of grammar rules

S→ NP VP
S→ VP
S→ Aux NP VP
S→ Wh-NP VP
S→ Wh-NP Aux NP VP
NP → (Det) (AP) Nom (PP)
VP → Verb (NP) (NP) (PP)*
VP → Verb S
AP → (Adv) Adj (PP)
PP → Prep (NP)
Nom →
Note:

• Grammar rules are not exhaustive.


• There are other sentence-level structures that cannot be modelled by the rules.
• Coordination, Agreement and Feature structures

Coordination:
Refers to conjoining phrases with conjunctions like 'and', 'or', and 'but'.
For example,
A coordinate noun phrase can consist of two other noun phrases separated by a conjunction.
I ate [NP [NP an apple] and [NP a banana]].
Similarly, verb phrases and prepositional phrases can be conjoined as follows:
It is [VP [VP dazzling] and [VP raining]].
Not only that, even a sentence can be conjoined.
[S [S I am reading the book] and [S I am also watching the movie]]

Conjunction rules for NP, VP, and S can be built as follows:


NP → NP and NP
VP → VP and VP
S → S and S
Agreement:
Most verbs use two different forms in present tense-one for third person, singular subjects, and the other
for all other kinds of subjects. Subject and verb must agree.

Examples: Demonstrate how the subject NP affects the form of the verb.
Does [NP Priya] sing?
Do [Np they] eat?

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 29


Natural Language Processing [BAD613B]

The -es form of 'do', i.e. 'does' is used. The second sentence has a plural NP subject. Hence, the
form 'do' is being used. Sentences in which subject and verb do not agree are ungrammatical.

The following sentences are ungrammatical:


[Does] they eat?
[Do] she sings?
Rules that handle the yes-no questions: S → Aux NP VP
To take care of the subject-verb agreement, we replace this rule with a pair of rules as follows:
S → 3sgAux 3sgNP VP
S → Non3sgAux Non3sgNP VP
We could add rules for the lexicon like these:
3sg Aux → does| has| can
Non3sg Aux → do | have | can
Similarly, rules for 3sgNP and Non3sgNP need to be added. So we replace each of the phrase structure
rules for noun phrase by a pair of rules as follows:
3sgNP → → (Det) (AP) SgNom (PP)
Non3sgNP → (Det) (AP) PINom (PP)
SgNom→ SgNoun | SgNoun SgNom
PINom→ PlNoun | PlNoun PlNom
SgNoun→ Priya | lake | banana | sister | ...
PlNoun → Children | ...
Note: These results in an explosion in the number of grammar rules and loss of generality.
Solution: Feature structures
Feature Structures
Feature structures are able to capture grammatical properties without increasing the size of the grammar.

Feature structures are sets of feature-value pairs.

Features are simply symbols representing properties that we wish to capture.

For example, the number property of a noun phrase can be represented by NUMBER feature. The value
that a NUMBER feature can take is SG (for singular) and PL (for plural).

Feature structures are represented by a matrix-like diagram called attribute value matrix (AVM).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 30


Natural Language Processing [BAD613B]

The feature structure can be used to encode the grammatical category of a constituent and the features
associated with it. For example, the following structure represents the third person singular noun phrase.

Similarly, a third person plural noun phrase can be represented as follows:

The CAT and PERSON feature values remain the same in both structures, illustrating how feature
structures support generalization while maintaining necessary distinctions. Feature values can also be
other feature structures, not just atomic symbols. For instance, combining NUMBER and PERSON into
a single AGREEMENT feature makes sense, as subjects must agree with predicates in both properties.
This allows a more streamlined representation.

4. Parsing
• A phrase structure tree constructed from a sentence is called a parse.
• The syntactic parser is thus responsible for recognizing a sentence and assigning a syntactic
structure to it.
• The task that uses the rewrite rules of a grammar to either generate a particular sequence of words
or reconstruct its derivation (or phrase structure tree) is termed parsing.
• It is possible for many different phrase structure trees to derive the same sequence of words.
• Sentence can have multiple parses → This phenomenon is called syntactic ambiguity.
• Processes input data (usually in the form of text) and converts it into a format that can be
easily understood and manipulated by a computer.
o Input: The first constraint comes from the words in the input sentence. A valid parse is
one that covers all the words in a sentence. Hence, these words must constitute the leaves
of the final parse tree.
o Grammar: The second kind of constraint comes from the grammar. The root of the final
parse tree must be the start symbol of the grammar.

Two most widely used search strategies by parsers,

1. Top-down or goal-directed search.


2. Bottom-up or data-directed search.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 31


Natural Language Processing [BAD613B]

4.1 Top-down Parsing


• Starts its search from the root node S and works downwards towards the leaves.
• Find all sub-trees which can start with S: Expand the root node using all the grammar rules with
S on their left-hand side.
• Likewise, each non-terminal symbol in the resulting sub-trees is expanded next using the
grammar rules having a matching non-terminal symbol on their left-hand side.
• The right-hand side of the grammar rules provide the nodes to be generated, which are then
expanded recursively.
• The tree grows downward and eventually reaches a state where the bottom of the tree consists
only of part-of-speech categories.
• A successful parse corresponds to a tree which matches exactly with the words in the input
sentence.

Example: Consider the grammar shown in Table and the sentence “Paint the door”.

S → NP VP VP → Verb NP
S→ VP VP → Verb
NP → Det Nominal PP → Preposition NP
NP → Noun Det → this | that | a | the
NP → Det Noun PP Verb → sleeps | sings | open | saw | paint
Nominal → Noun Preposition → from | with | on | to
Nominal → Noun Nominal Pronoun → she | he | they

1. The first level (ply) search tree consists


of a single node labelled S.
2. The grammar in Table has two rules
with S on their left hand side.

S → NP VP & S → VP

3. These rules are used to expand the tree,


gives us two partial trees at the second level
search.
4. The third level is generated by
expanding the non-terminal at the bottom
of the search tree in the previous.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 32


Natural Language Processing [BAD613B]

4.2 Bottom-Up Parsing

A bottom-up parser starts with the words in the input sentence and attempts to construct a parse tree
in an upward direction towards the root.

• Start with the input words – Begin with the words in the sentence as the leaves of the parse
tree.
• Look for matching grammar rules – Search for rules where the right-hand side matches parts
of the input.
• Apply reduction using the left-hand side – Replace matched portions with non-terminal
symbols from the left-hand side of the rule.
• Construct the parse tree upwards – Build the parse tree by moving upward toward the root.
• Repeat until the start symbol is reached – Continue reducing until the entire sentence is
reduced to the start symbol.
• Successful parse – The parsing is successful if the input is fully reduced to the start
symbol, completing the parse tree.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 33


Natural Language Processing [BAD613B]

Advantages & disadvantages:

• Top-Down Parsing: Starts from the start symbol and generates trees, avoiding paths that lead to
a different root, but it may waste time exploring inconsistent trees before seeing the input.
• Bottom-Up Parsing: Starts with the input and ensures only matching trees are explored, but may
waste time generating trees that won't lead to a valid parse tree (e.g., incorrect assumptions about
word types).
• Top-Down Drawback: It can explore incorrect trees that eventually do not match the input,
resulting in wasted computation.

Basic Search Strategy: Combines top-down tree generation with bottom-up constraints to filter out
bad parses, aiming to optimize the parsing process.

4.3 A Basic Top-Down Parsing

A depth first, left to right search.

• Start with Depth-First Search (DFS): Use a depth-first approach to explore the search tree
incrementally.
• Left-to-Right Search: Expand nodes from left to right in the tree.
• Incremental Expansion: Expand the search space one state at a time.
• Select Left-most Node for Expansion: Always select the left-most unexpanded node for
expansion.
• Expand Using Grammar Rules: Expand nodes based on the relevant grammar rules.
• Handle Inconsistent State: If a state is inconsistent with the input, it is flagged.
• Return to Recent Tree: The search then returns to the most recently unexplored tree to continue.

Top-down, depth-first parsing algorithm

1. Initialize agenda
2. Pick a state, let it be curr_state, from agenda
3. If (curr_state) represents a successful parse then return parse tree
else if curr_stat is a POS then
if category of curr_state is a subset of POS associated with curr_word
then apply lexical rules to current state
else reject
else generate new states by applying grammar rules and push them into agenda
4. If (agenda is empty) then return failure
else select a node from agenda for expansion and go to step 3.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 34


Natural Language Processing [BAD613B]

Figure shows the trace of the algorithm on the sentence, Open the door.

• The algorithm begins with the node S and input word "Open."
• It first expands S using the rule S → NP VP, then expands NP with NP → Det Nominal.
• Since "Open" cannot be derived from Det, the parser discards this rule and tries NP → noun,
which also fails.
• The next agenda item corresponds to S → VP.
• Expanding VP using VP → Verb NP matches the first input word successfully.
• The algorithm then continues in a depth-first, left-to-right manner to match the remaining words.

Left corner for each grammar category


Category Left Corners
S Det, Pronoun, Noun, Verb
NP Noun, Pronoun, Det
VP Verb
PP Preposition
Nominal Noun
Disadvantages:

1. Inefficiency: It may explore many unnecessary branches of the parse tree, especially if the input
does not match the grammar well, leading to high computational overhead.

2. Backtracking: If a rule fails, the parser often needs to backtrack to a previous state and try
alternative expansions, which can significantly slow down parsing.

3. Left Recursion Issues: Top-down parsers struggle with left-recursive grammars because they
can lead to infinite recursion.

4. Lack of Lookahead: Basic top-down parsers generally lack lookahead capabilities, meaning they
might make incorrect decisions early on without enough information, leading to errors.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 35


Natural Language Processing [BAD613B]

5. Ambiguity Handling: They may have difficulty handling ambiguities in the grammar, often
exploring all possible alternatives without any way of pruning inefficient branches.

6. Limited Error Recovery: Basic top-down parsers typically have poor error recovery and can
fail immediately when encountering an unexpected input.

Dynamic programming algorithms can solve these problems. These algorithms construct a table
containing solutions to sub-problems, which, if solved, will solve the whole problem.

There are three widely known dynamic parsers-the Cocke-Younger-Kasami (CYK) algorithm, the
Graham-Harrison-Ruzzo (GHR) algorithm, and the Earley algorithm.

Probabilistic grammar can also be used to disambiguate parse trees.

4.4 Earley Parser


• Efficient parallel top-down search using dynamic programming.
• It builds a table of sub-trees for each of the constituents in the input (eliminates the repetitive
parse and reduces the exponential-time problem).
• Most important component of this algorithm is Earley chart.
o The chart contains a set of states for each word position in the sentence.
o The algorithm makes a left to right scan of input to fill the elements in this chart.
o It builds a set of states, one for each position in the input string.
o The states in each entry provide the following information.
▪ A sub-tree corresponding to a grammar rule.
▪ Information about the progress made in completing the sub-tree.
▪ Position of the sub-tree with respect to input.

Earley Parsing
Input: Sentence and the Grammar
Output: Chart
chart[0] + S' → S, [0,0]
n  length (sentence) // number of words in the sentence
for i = 0 to n do
for each state in chart[i] do
if (incomplete (state) and next category is not a part of speech) then
predictor (state)
else if (incomplete (state) and next category is a part of speech)
scanner (state)
else
completer (state)
end-if
end-if
end for
end for

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 36


Natural Language Processing [BAD613B]

return:
.
Procedure predictor (A → X1 ... B ... Xm,. [i, j])
for each rule (B → α) in G do
.
insert the state B → α, [j, j] to chart [j]
End
.
Procedure scanner (A → X1 ... B ... Xm [i, j])
If B is one of the part of speech associated with word[j] then
.
Insert the state B → word [j] , [j, j + 1] to chart [j + 1]
End
Procedure Completer (A → X1 ... . , [j, k])
.
for each B → X1 .... A ... ,[i, j] in chart[j] do

.
insert the state B → X1 ... A ... [i, k] to chart[k]
End

Steps:

Earley’s algorithm works in three main steps:

1. Prediction

➢ If the dot (•) is before a non-terminal in a rule, add all rules expanding that non-terminal
to the state set.

➢ The predictor generates new states representing potential expansion of the non-terminal
in the left-most derivation.

➢ A predictor is applied to every state that has a non-terminal to the right of the dot.

➢ Results in the creation of as many new states as there are grammar rules for the non-
terminal

Their start and end positions are at the point where the generating state ends. If

.
A → X1 ... B ... Xm, [i, j]

Then for every rule of the form B → α , the operation adds to chart [j], the state

B→ ·α, [j, j] % Rule

.
For example, when the generating state is S → NP VP, [0,0], the predictor adds the following states
to chart [0]:
NP →· Det Nominal, [0,0]
NP →· Noun, [0,0]

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 37


Natural Language Processing [BAD613B]

NP →· Pronoun, [0,0]
NP →· Det Noun PP, [0,0]

2. Scanning

➢ A scanner is used when a state has a part-of-speech category to the right of the dot.

➢ The scanner examines the input to see if the part-of-speech appearing to the right of the
dot matches one of the part-of-speech associated with the current input.

➢ If yes, then it creates a new state using the rule that allows generation of the input word with
this part-of-speech.

➢ If the dot (•) is before a terminal that matches the current input symbol, move the dot to the
right.

Example:

When the state NP → . Det Nominal, [0,0] is processed, the parser finds a part-of-speech category next
to the dot.
It checks if the category of the current word (curr_word) matches with the expectation in the current state.

.
If yes, then it adds the new state Det → curr_word , [0,1] to the next chart entry.

3. Completion

• If the dot reaches the end of a rule, find and update previous rules that were waiting for this rule
to complete.
• The completer identifies all previously generated states that expect this grammatical category at
this position in the input and creates new states by advancing the dots over the expected category.

Example:

Let's consider a simple CFG: We want to parse the sentence:

“John sees the dog”

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 38


Natural Language Processing [BAD613B]

Chart [0] (Start State) Chart [1] ("John")

We start with S → • NP VP:

Since John is a valid NP, we scan it. The next word is "sees", which matches V.

Chart [2] ("sees") Chart [3] ("the")

We scan "the". We scan "dog".

Chart [4] ("dog")

The sequence of states for “Paint the door” created by the parser is shown in Figure

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 39


Natural Language Processing [BAD613B]

4.5 CYK Parser


• CYK (Cocke-Younger-Kasami) is a dynamic programming parsing algorithm.
• Follows a bottom-up approach in parsing.
• It builds a parse tree incrementally. Each entry in the table is based on previous entries. The
process is iterated until the entire sentence has been parsed.
• Checks particular words is a part or member of particular grammar.
• The CYK parsing algorithm assumes the grammar to be in Chomsky normal form (CNF). A CFG
is in CNF if all the rules are of only two forms:
o A→ B C
o A → w, where w is a word.

Consider the following simplified grammar in CNF:


S→ NP VP Verb → wrote
VP → Verb NP Noun → girl
NP → Det Noun Noun → essay
Det → an | the

The sentence to be parsed is: The girl wrote an essay.


Table contains entries after a complete scan of the algorithm. The entry in the [1, n]th cell contains a
start symbol which indicates that S* => W1n i.e., the parse is successful.

Create a triangular table where:

• Rows represent start positions in the sentence.

• Columns represent substrings of increasing length.

• Fill Base Case (Single Words): Find matching grammar rules for each word
• Fill Table for Larger Substrings: Now, we combine smaller segments.
• Check for Start Symbol (S): Since S appears in T[1,5], the sentence is valid under this grammar!

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 40


Natural Language Processing [BAD613B]

Algorithm:
Let w =w1 w2 w3 wi ... wj ... wn
and wij= wi ... wi+j-1
// Initialization step
for i := 1 to n do
for all rules A→ wi do
chart [i,1] = {A}
// Recursive step
for j= 2 to n do
for i = 1 to n-j+1 do
begin
chart [i, j]=ø
for k= 1 to j -1 do
chart [i, j] := chart[i, j] U{A | A →BC is a production and
B € chart[i, k] and C € chart [i+k, j-k]}
end
if S € chart[1, n] then accept else reject

5. Probabilistic Parsing
• Statistical parser, requires a corpus of hand-parsed text.
• The Penn tree-bank is a large corpus – consists Penn tree-bank tags, parsed based on simple set
of phrase structure rules, Chomsky's government and binding syntax.
• The parsed sentences are represented in the form of properly bracketed trees.
Given a grammar G, sentence s, and a set of possible parse trees of s which we denote by ꞇ(s), a
probabilistic parser finds the most likely parse ‘φ’ of s as follows:
φ = argmaxφ € ꞇ(s) P(φ | s) % where φ belongs to a feasible set T(s) - conditional probability.

= argmaxφ € ꞇ(s) P(φ, s) % φ within the feasible set T(s) that maximizes the joint probability P(ϕ,s).
= argmaxφ € ꞇ(s) P(φ) % φ within the feasible set T(s) that maximizes the prior probability P(ϕ).

To construct a statistical parser:

We have to first find all possible parses of a sentence, then assign probabilities to them, and finally return
the most probable parse → probabilistic context-free grammars (PCFGs).

Benefits of statistical parser:

• A probabilistic parser helps resolve parsing ambiguity (multiple parse trees) by assigning
probabilities to different parse trees, allowing selection of the most likely structure.
• It improves efficiency by narrowing the search space, reducing the time required to determine the
final parse tree.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 41


Natural Language Processing [BAD613B]

Probabilistic context-free grammar (PCFG):

• Every rule is assigned a probability. A → α [p]


o Where p gives the probability of expanding a constituent using the rule: A → α.
• A PCFG is defined by the pair (G, f), where G is a CFG and f is a positive function defined over the
set of rules such that, the sum of the probabilities associated with the rules expanding a particular
non-terminal is 1.

∑ 𝒇(𝑨 → 𝜶) = 𝟏
𝜶

Example: PCFG is shown in Table, for each non-terminal, the sum of probabilities is 1.
S→NP VP 0.8 Noun→door 0.25
S→VP 0.2 Noun→bird 0.25
NP→Det Noun 0.4 Noun→hole 0.25
NP→Noun 0.2 Verb→sleeps 0.2
NP→Pronoun 0.2 Verb→sings 0.2
NP→Det Noun PP 0.2 Verb→open 0.2
VP→Verb NP 0.5 Verb→saw 0.2
VP→Verb 0.3 Verb→paint 0.2
VP→VP PP 0.2 Preposition→from 0.3
PP→Preposition NP 1.0 Preposition→with 0.25
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25

f(S→ NP VP) + f(S→ VP)=1


f(NP→ Det Noun) + f(NP→ Noun)+ f(NP → Pronoun) + f(NP→ Det Noun PP) = 1
f(VP → Verb NP) + f(NP → Verb) + f(VP → VP PP) = 1.0
f(Det→this) +f(Det→that)+f(Det→a)+f(Det→ the)=1.0
f(Noun→paint)+f(Noun→door)+f(Noun→bird) + f(Noun→ hole) = 1.0

5.1 Estimating Rule Probabilities


• How are probabilities assigned to rules? (As shown in PCFG table)
• Manually construct a corpus of a parse tree for a set of sentences, and then estimate the
probabilities of each rule being used by counting them over the corpus.
• The MLE estimate for a rule A →α is given by the expression.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 42


Natural Language Processing [BAD613B]

If our training corpus consists of two parse trees (as shown in Figure), we will get the estimates as shown
in Table for the rules.

Figure: Two Parse trees Table: MLE for grammar rules considering two parse trees

Table: probabilistic context-free grammar (PCFG)

S→NP VP 0.8 Noun→door 0.25


S→VP 0.2 Noun→bird 0.25
NP→Det Noun 0.4 Noun→hole 0.25
NP→Noun 0.2 Verb→sleeps 0.2
NP→Pronoun 0.2 Verb→sings 0.2
NP→Det Noun PP 0.2 Verb→open 0.2
VP→Verb NP 0.5 Verb→saw 0.2
VP→Verb 0.3 Verb→paint 0.2
VP→VP PP 0.2 Preposition→from 0.3
PP→Preposition NP 1.0 Preposition→with 0.25
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25

What do we do with these probabilities?

• We assign a probability to each parse tree o of a sentence s.


• The probability of a complete parse is calculated by multiplying the probabilities for each of the
rules used in generating the parse tree:

Where n is a node in the parse tree Ҩ and r is the rule used to expand n.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 43


Natural Language Processing [BAD613B]

The probability of the two parse trees of the sentence Paint the door with the hole (shown in Figure)
using PCFG table can be computed as follows:
S→NP VP 0.8 Noun→door 0.25
S→VP 0.2 Noun→bird 0.25
NP→Det Noun 0.4 Noun→hole 0.25
NP→Noun 0.2 Verb→sleeps 0.2
NP→Pronoun 0.2 Verb→sings 0.2
NP→Det Noun PP 0.2 Verb→open 0.2
VP→Verb NP 0.5 Verb→saw 0.2
VP→Verb 0.3 Verb→paint 0.2
VP→VP PP 0.2 Preposition→from 0.3
PP→Preposition NP 1.0 Preposition→with 0.25
Det→this 0.2 Preposition→on 0.2
Det→that 0.2 Preposition→to 0.25
Det→a 0.25 Pronoun→she 0.35
Det→the 0.35 Pronoun→he 0.35
Noun→paint 0.25 Pronoun→they 0.25

P(t1) = 0.2 * 0.5 * 0.2 * 0.2 * 0.35 * 0.25 * 1.0 * 0.25 * 0.4 * 0.35 * 0.25 = 0.0000030625
P(t2) = 0.2* 0.2 * 0.5 * 0.2 * 0.4 * 0.35 * 0.25 * 1 * 0.25 * 0.4 * 0.35 * 0.25 = 0.000001225
The first tree has a higher probability leading to correct interpretation.

We can calculate probability to a sentence s by summing up probabilities of all possible parses associated
with it.

The sentence will have the probability=


P(t1) + P(t2) = 0.0000030625 + 0.000001225
= 0.0000042875
5.2 Parsing PCFGs

Given a PCFG, a probabilistic parsing algorithm assigns the most likely parse Ҩ to a sentence s.
φ` = argmaxT € ꞇ(s) P(T | s)

where ꞇ(S) is the set of all possible parse trees of s.


Probabilistic CYK
w = W1 W2 W3 Wj ... Wj ... Wn represents a sentence consisting of n words.
Let Ҩ [i ,j ,A] represent the maximum probability parse for a constituent with non-terminal A spanning
words i, i+1, up to i+j-1. This means it is a sub-tree rooted at “A” that derives sequence of “j” words
beginning at position “i” and has a probability greater than all other possible sub-trees deriving the same
word sequence.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 44


Natural Language Processing [BAD613B]

• An array named BP is used to store


back pointers. These pointers allow us to
recover the best parse.

• Initialize the maximum probable parse


trees deriving a string of length 1, with the
probabilities of the terminal derivation
rules used to derive them.

• Recursive step involves breaking a


string into all possible ways and identifying
the maximum probable parse.

• The rest of the steps follow those of basic CYK parsing algorithm.

5.3 Problems with PCFG


• The probability of a parse tree assumes that the rules are independent of each other.
o Example: Pronouns occur more frequently as subjects rather than objects.
o These dependencies are not captured by a PCFG.
o Expanding an NP as a pronoun versus a lexical NP
o NP appears as a subject or an object.
• Lack of sensitivity to lexical information.
o Two structurally different parses that use the same rules will have the same probability
under a PCFG.

Solution: This however, requires a model which captures lexical dependency statistics for different
words. → Lexicalization

Lexicalization

• Words do affect the choice of the rule.


• Involvement of actual words in the sentences, to
decide the structure of the parse tree.
• Lexicalization is also helpful in choosing phrasal
attachment positions.
• One way to achieve lexicalization is to mark each
phrasal node in a parse tree by its head word.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 45


Natural Language Processing [BAD613B]

• This lexicalized version keeps track of headwords (e.g., "jumped" in VP) and improves parsing
accuracy.
• A lexicalized PCFG assigns specific words to rules, making parsing more accurate by capturing
relationships between words.
o The verb (jumped) affects parsing probability.
o Dependencies between words like "jumped" and "boy" are captured.
o A sentence like "The boy jumped over the fence" is parsed more accurately.

6. Indian Languages
• Some of the characteristics of Indian languages that make CFG unsuitable.
• Paninian grammar can be used to model Indian languages.
1. Indian languages are free word order.
o सबा खाना खाती है । Saba khana khati hai.
o खाना सबा खाती है । Khana Saba khati hai.

The CFG we used for parsing English is basically positional, but it fails to model free word order
languages.
2. Complex predicates (CPs) is another property that most Indian languages have in common.
• A complex predicate combines a light verb with a verb, noun, or adjective, to produce a
new verb.
• For example:

(a) सबा आयी। → (Saba Ayi.) → Saba came.

(b) सबा आ गयी। → (Saba a gayi.) → Saba come went. → Saba arrived.

(c) सबा आ पडी। → Saba a pari. → Saba come fell. → Saba came (suddenly).
The use of post-position case markers and the auxiliary verbs in this sequence provide information about
tense, aspect, and modality.

Paninian grammar provides a framework to model Indian languages. It focuses on the extraction of Karak
relations from a sentence.

Bharti and Sangal (1990) described an approach for parsing of Indian languages based on Paninian
grammar formalism. Their parser works in two stages.

1st stage: Identifying word groups.

2nd stage: Assigning a parse structure to the input sentence.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 46


Natural Language Processing [BAD613B]

Example:

लड़कियााँ मैदान में हािी खेल रही हैं ।

Ladkiyan maidaan mein hockey khel rahi hein.


1st stage:

• Word ladkiyan forms one unit, the words maidaan and mein are grouped together to form a noun
group, and the word sequence khel rahi hein forms a verb group.

2nd stage:

• The parser takes the word groups formed during first stage and identifies (i) Karaka relations
among them, and (ii) senses of words.
• Karaka chart is created to store additional information like Karaka-Vibhakti mapping.

• Constraint graph for sentence: The Karaka relation between a verb group and a noun group can
be depicted using a constraint graph.

• A parse of the sentence:

Each sub-graph of the constraint graph that satisfies the following constraints yields a parse of the
sentence.
1. It contains all the nodes of the graph.
2. It contains exactly one outgoing edge from a verb group for each of its mandatory Karakas. These
edges are labelled by the corresponding Karaka.
3. For each of the optional Karaka in Karaka chart, the sub-graph can have at most one outgoing
edge labelled by the Karaka from the verb group.
4. For each noun group, the sub-graph should have exactly one incoming edge.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 47


Natural Language Processing [BAD613B]

Question Bank

1. Define a finite automaton that accepts the following language: (aa)(bb).

2. A typical URL is of the form:

http :// www.abc.com /nlppaper/public /xxx.html

1 2 3 4 5

In this table, 1 is a protocol, 2 is name of a server, 3 is the directory, and 4 is the name
of a document. Suppose you have to write a program that takes a URL and returns the
protocol used, the DNS name of the server, the directory and the document name.
Develop a regular expression that will help you in writing this program.

3. Distinguish between non-word and real-word error.

4. Compute the minimum edit distance between paecflu and peaceful.

5. Comment on the validity of the following statements:

(a) Rule-based taggers are non-deterministic.

(b) Stochastic taggers are language independent.

(c) Brill's tagger is a rule-based tagger.

6. How can unknown words be handled in the tagging process?

7. Give two possible parse trees for the sentence, Stolen painting found by tree.

8. Identify the noun and verb phrases in the sentence, My soul answers in music.\

9. Give the correct parse of sentence.

10. Discuss the disadvantages of the basic top-down parser with the help of an
appropriate example.

11. Tabulate the sequence of states created by CYK algorithm while parsing, The sun
rises in the east. Augment the grammar in section 4.4.5 with appropriate rules of
lexicon.

12. Discuss the disadvantages of probabilistic context free grammar.

13. What does lexicalized grammar mean? How can lexicalization be achieved? Explain
with the help of suitable examples.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 48


Natural Language Processing [BAD613B]

14. List the characteristics of a garden path sentence. Give an example of a garden path
sentence and show its correct parse.

15. What is the need of lexicalization?

16. Use the following grammar:

S → NP VP S → VP NP → Det Noun

NP Noun NP → NP PP VP → VP PP

VP → Verb VP → VP NP PP → Preposition NP

Give two possible parse of the sentence: 'Pluck the flower with the stick. Introduce lexicon
rules for words appearing in the sentence. Using these parse trees obtain maximum
likelihood estimates for the grammar rules used in the tree. Calculate probability of any one
parse tree using these estimates.

Lab Exercises

1. Write a program to find minimum edit distance between two input strings.

2. Use any tagger available in your lab to tag a text file. Now write a program to find
the most likely tag in the tagged text.

3. Write a program to find the probability of a tag given previous two tags, i.e., P(t3/t2
t1).

4. Write a program to extract all the noun phrases from a text file. Use the phrase structure
rule given in this chapter.

5. Write a program to check whether a given grammar is context free grammar or not.

6. Write a program to convert a given CFG grammar in CNF.

7. Write a program to implement a basic top-down parser.

8. Implement Earley parsing algorithm.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 49


Natural Language Processing [BAD613B]

Module – 3
Naive Bayes, Text Classification and Sentiment
Naive Bayes, Text Classification and Sentiment: Naive Bayes Classifiers, Training the Naive
Bayes Classifier, Worked Example, Optimizing for Sentiment Analysis, Naive Bayes for Other
Text Classification Tasks, Naive Bayes as a Language Model.

Textbook 2: Ch. 4.

Introduction

• Classification, heart of both human and machine intelligence → assigning a category to


an input
• Deciding what letter, word, or image has been presented to our senses, recognizing faces
or voices, sorting mail, assigning grades to homeworks;
• Naïve Bayes algorithm for text categorization: the task of assigning a label or category
to an entire text or document.
• Common text categorization tasks:
1. Sentiment analysis, the extraction of sentiment, the positive or negative orientation that
a writer expresses toward some object.
o A review of a movie, book, or product on the web.

Example: + ... any characters and richly applied satire, and some great plot twists

- It was pathetic. The worst part about it was the boxing scenes ...
+ ... awesome caramel sauce and sweet toasty almonds. I love this place!
- ... awful pizza and ridiculously overpriced ...

Words like great, richly, awesome, and pathetic, and awful and ridiculously are very informative
cues: unification is based on the functional specifications of the verb, which predicts the overall
sentence structure.

2. Spam detection:
o Binary classification task of assigning an email to one of the two classes spam or
not-spam.
o Many lexical and other features can be used to perform classification.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1


Natural Language Processing [BAD613B]

Example: Suspicious of an email containing phrases like “online pharmaceutical” or


“WITHOUT ANY COST” or “Dear Winner”.

3. Assigning a library subject category or topic label to a text: Various sets of subject
categories exist. Deciding whether a research paper concerns epidemiology, embryology,
etc..is an important component of information retrieval.

Supervised Learning:

• The most common way of doing text classification in language processing is supervised
learning.
• In supervised learning, we have a data set of input observations, each associated with
some correct output (a ‘supervision signal’).
• The goal of the algorithm is to learn how to map from a new observation to a correct
output.
• We have a training set of N documents that have each been hand labeled with a class:
{(d1 c1)…(dN cN)}. Our goal is to learn a classifier that is capable of mapping from a new
document d to its correct class c € C, where C is some set of useful document classes.

3.1 Naive Bayes Classifiers

The intuition of the classifier is shown in Fig. 1. We represent a text document as if it were a bag
of words, that is, an unordered set of words with their position ignored, keeping only their
frequency in the document.

Instead of representing the word order in all the phrases like “I love this movie” and “I would
recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word
it 6 times, the words love, recommend, and movie once, and so on.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2


Natural Language Processing [BAD613B]

• Naive Bayes is a probabilistic classifier.


• For a document d, out of all classes c € C the classifier returns the class 𝐶̂ which has the
maximum posterior probability given the document.

(1)

Use Bayes’ rule to break down any conditional probability P(x|y) into three other probabilities:

(2)

We can then substitute Eq.2 into Eq.1 to get Eq.3

(3)

Since P(d) doesn't change for each class, we can conveniently simplify Eq. 3 by dropping the
denominator.
(4)
We call Naive Bayes a generative model, Eq. 4 can be read as class is sampled from P(c), and
then the words are generated by sampling from P(d|c) and a document is generated.

̂ given some document d by choosing the class


Eq. 4 states, we compute the most probable class 𝑪
which has the highest product of two probabilities: the prior probability of the class P(c) and
the likelihood of the document P(d|c):

(5)

we can represent a document d as a set of features f1, f2, ....,fn:


(6)

Eq. 6 is still too hard to compute directly: without some simplifying assumptions, estimating the
probability of every possible combination of features (for example, every possible set of words
and positions) would require huge numbers of parameters and impossibly large training sets.

Naive Bayes classifiers therefore make two simplifying assumptions.

The first is the bag-of-words assumption, that the features f1, f2, ... ,fn only encode word identity
and not position.

The second is commonly called the naive Bayes assumption, the conditional independence
assumption that the probabilities P(fi|c) are independent given the class c.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3


Natural Language Processing [BAD613B]

Therefore, P(f1, f2, .... ,fn|c) = P(f1|c).P(f2|c) ..... P(fn|c) (7)

The final equation for the class chosen by a naive Bayes classifier is:

(8)

To apply the naive Bayes classifier to text, we will use each word in the documents as a feature,
as suggested above, and we consider each of the words in the document by walking an index
through every word position in the document:

(9)

Naive Bayes calculations, like calculations for language modelling, are done in log space, to
avoid underflow and increase speed. Thus Eq. 9 is generally instead expressed as,

(10)

Eq. 10 computes the predicted class as a linear function of input features. Classifiers that use a
linear combination of the inputs to make a classification decision -like naive Bayes and also
logistic regression are called linear classifiers.

3.2 Training the Naive Bayes Classifier

How can we learn the probabilities P(c) and P(fi|c)?

To learn class priori P(c): What percentage of the documents in our training set are in each class
c.

Let Nc be the number of documents in our training data with class c. Ndoc be the total number of
documents. Then,
(11)

To learn the probability P(fi|c):

We'll assume a feature is just the existence of a word in the document's bag of words, and so
we'll want P(wi|c), we compute as the fraction of times the word wi appears among all words in
all documents of topic c.

Concatenate all documents with category c into one big "category c" text. Then we use the
frequency of wi in this concatenated document to give a maximum likelihood estimate of the
probability:

i.e (12)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4


Natural Language Processing [BAD613B]

Here the vocabulary V total number of unique words in your vocabulary in all classes, not just
the words in one class c.

Issues with training:

1. Zero Probability problem with maximum likelihood training:

Imagine we are trying to estimate the likelihood of the word "fantastic" given class positive, but
suppose there are no training documents that both contain the word "fantastic" and are classified
as positive. Perhaps the word "fantastic" happens to occur (sarcastically?) in the class negative.
In such a case the probability for this feature will be zero:

(13)

Since naive Bayes naively multiplies all the feature likelihoods together, zero probabilities in the
likelihood term for any class will cause the probability of the class to be zero, no matter the other
evidence!

To solve this, we use something called Laplace smoothing (or add-one smoothing). Instead of:

(14)

Now "fantastic" will still get a very small probability in the "positive" class — but not zero.

2. Words that occur in our test data but are not in our vocabulary:
• Remove them from the test document and not include any probability for them at all.
Some systems choose to completely ignore another class of words: stop words, very
frequent words like the and a.
• Defining the top 10-100 vocabulary entries as stop words, or alternatively by using one
of the many predefined stop word lists available online. Then each instance of these stop
words is simply removed from both training and test documents.
• However, using a stop word list doesn't improve performance, and so it is more common
to make use of the entire vocabulary.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5


Natural Language Processing [BAD613B]

Fig. The naïve Bayes algorithm, using add-1smoothing. To use add-smoothing instead, change
the +1 to +α for log likelihood counts in training.

3.3 Worked example:

Let’s use a sentiment analysis domain with the two classes positive (+) and negative (-), and take
the following miniature training and test documents simplified from actual movie reviews.

Step1: Prior P(c) for the two classes is computed as per equation 11:

P(-) = 3/5 P(+) = 2/5

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6


Natural Language Processing [BAD613B]

Step 2: The word “with” doesn't occur in the training set, so we drop it completely.

Step 3: The likelihoods from the training set for the remaining three words "predictable", "no",
and "fun", are as follows:

Step 4: For the test sentence S = "predictable with no fun", after removing the word 'with', the
chosen class, via Eq. 9: is therefore computed as
follows:

3.4 Optimizing for Sentiment Analysis

While standard naive Bayes text classification can work well for sentiment analysis, some small
changes are generally employed that improve performance.

3.4.1 Clip the word counts (duplicate words) in each document at 1:


• Remove all duplicate words before concatenating them into the single big document
during training and we also remove duplicate words from test documents.
• This variant is called binary multinomial naive Bayes or binary naive Bayes.
• Example:

Fig. An example of binarization for the binary naive Bayes algorithm

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7


Natural Language Processing [BAD613B]

3.4.2 Deal with negation.

Consider the difference between I really like this movie (positive) and I didn’t like this movie
(negative). Similarly, negation can modify a negative word to produce a positive review (don’t
dismiss this film, doesn’t let us get bored).

Solution: Prepend the prefix NOT to every word after a token of logical negation (n’t, not, no,
never) until the next punctuation mark.

Thus the phrase: didn’t like this movie , but I

becomes: didnt NOT_like NOT_this NOT_movie , but I

‘words’ like NOT_like, NOT_recommend will thus occur more often in negative document and
act as cues for negative sentiment, while words like NOT_bored, NOT_dismiss will acquire
positive associations.

3.4.3 Insufficient labelled training data:

Derive the positive and negative word features from sentiment lexicons (corpus), lists of words
that are pre-annotated with positive or negative sentiment.

For example, the MPQA lexicon corpus subjectivity lexicon has 6885 words each marked for
whether it is strongly or weakly biased positive or negative. Some examples:

+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great

- : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate

3.5 Naive Bayes for other text classification tasks

3.5.1 Spam Detection and Naïve Bayes

Spam detection—deciding whether an email is unsolicited bulk mail—was one of the earliest
applications of naïve Bayes in text classification (Sahami et al., 1998). Rather than treating all
words as individual features, effective systems often use predefined sets of words or patterns,
along with non-linguistic features.

For instance, the open-source tool SpamAssassin uses a range of handcrafted features:

• Specific phrases like "one hundred percent guaranteed"

• Regex patterns like mentions of millions of dollars

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8


Natural Language Processing [BAD613B]

• Structural properties like HTML with a low text-to-image ratio

• Non-linguistic metadata, such as the email’s delivery path

Other examples of SpamAssassin features include:

• Subject lines written entirely in capital letters

• Urgent phrases like "urgent reply"

• Keywords such as "online pharmaceutical"

• HTML anomalies like unbalanced head tags

• Claims such as "you can be removed from the list"

3.5.2 Language Identification

In contrast, tasks like language identification rely less on words and more on subword units like
character n-grams or even byte n-grams. These can capture statistical patterns at the start or end
of words, especially when spaces are included as characters.

A well-known system, langid.py (Lui & Baldwin, 2012), starts with all possible n-grams of
lengths 1–4 and uses feature selection to narrow down to the 7,000 most informative.

Training data for language ID systems often comes from multilingual sources such as Wikipedia
(in 68+ languages), newswire, and social media. To capture regional and dialectal diversity,
additional corpora include:

• Geo-tagged tweets from Anglophone regions like Nigeria or India

• Translations of the Bible and Quran

• Slang from Urban Dictionary

• Corpora of African American Vernacular English (Blodgett et al., 2016)

These diverse sources help models capture the full range of language use across different
communities and contexts (Jurgens et al., 2017).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9


Natural Language Processing [BAD613B]

3.6 Naive Bayes as a Language Model


• Naive Bayes classifiers can use any sort of feature: dictionaries, URLs, email addresses,
network features, phrases, and so on.
• A naive Bayes model can be viewed as a set of class-specific unigram language models,
in which the model for each class instantiates a unigram language model.
• Assign a probability to each word P(word|c), the model also assigns a probability to each
sentence: (15)

Example: Consider a naive Bayes model with the classes positive (+) and negative (-) and the
following model parameters:

Each of the two columns above instantiates a language model that can assign a probability to
the sentence “I love this fun film”:

P(“I love this fun film”+) = 0.1 * 0.1 * 0.01 * 0.05 * 01=5 * 10-7

P(“I love this fun film” - ) = 0.2 * 0.001 * 0.01* 0.005 * 0.1=1.0 * 10-9

The positive model assigns a higher probability to the sentence: P(s|pos) > P(s|neg).

Note: This is just the likelihood part of the naive Bayes model; once we multiply in the prior a
full naive Bayes model might well make a different classification decision.

3.7 Evaluation: Precision, Recall, F-measure

Text classification evaluation often starts with binary detection tasks.


Example 1: Spam Detection

• Goal: Label each text as spam (positive) or not spam (negative).

• Need to compare:
o System’s prediction
o Gold label (human-defined correct label)
Example 2: Social Media Monitoring for a Brand

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10


Natural Language Processing [BAD613B]

• Scenario: CEO of Delicious Pie Company wants to track mentions on social media.

• Build a system to detect tweets about Delicious Pie.

• Positive class: Tweets about the company.

• Negative class: All other tweets.


Why we need metrics:

• To evaluate how well a system (e.g., spam detector or pie-tweet detector) performs.

• Confusion Matrix:

o A table that compares system


predictions vs. gold (human) labels.

o Each cell represents a type of


outcome:

▪ True Positive (TP): Correctly


predicted positives (e.g., actual spam
labeled as spam).

▪ False Negative (FN): Actual positives incorrectly labeled as negative


(e.g., spam labeled as non-spam).

• Accuracy:

o Formula: (Correct predictions) / (Total predictions).

o Appears useful but misleading for unbalanced classes.

• Why accuracy can fail:

o Real-world data is often skewed (e.g., most tweets are not about pie).

o Example:

▪ 1,000,000 tweets → only 100 about pie.

▪ A naive classifier labels all tweets as "not about pie".

▪ Result: 99.99% accuracy, but 0 useful results.

o Conclusion: Accuracy is not a reliable metric when the positive class is rare.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11


Natural Language Processing [BAD613B]

That’s why, instead of relying on accuracy, we often use two more informative metrics:
precision and recall (as shown in Fig).

• Precision measures the percentage of items labeled as positive by the system that are
actually positive (according to human-annotated “gold” labels).

Precision = true positives/ true positives + false positives

• Recall measures the percentage of actual positive items that were correctly identified by
the system.

Recall = true positives/ true positives + false negatives

These metrics address the issue with the “nothing is pie” classifier. Despite its seemingly
excellent 99.99% accuracy, it has a recall of 0 —because it misses all 100 true positive cases,
identifying none. Its precision is also meaningless, since it detects nothing (since there are no
true positives, and 100 false negatives, the recall is 0/100).

Unlike accuracy, precision and recall focus on true positives, helping us measure how well the
system finds the things it’s actually supposed to detect.

To combine both precision and recall into a single metric, we use the F-measure (van
Rijsbergen, 1975), with the most common version being the F1 score:

The ß parameter differentially weights the importance of recall and precision, based perhaps on
the needs of an application. Values of ß > 1 favor recall, while values of ß < 1 favor precision.
When ß = 1, precision and recall are equally balanced; this is the most frequently used metric,
and is called Fβ=1 or just F1:

(16)

3.7.1 Evaluating with more than two classes

For sentiment analysis we generally have 3 classes (positive, negative, neutral) and even
more classes are common for tasks like part-of-speech tagging, word sense disambiguation,
semantic role labeling, emotion detection, and so on. Luckily the naive Bayes algorithm is
already a multi-class classification algorithm.

Consider the sample confusion matrix for a hypothetical 3-way one-of email
categorization decision (urgent, normal, spam) shown in Fig. The matrix shows, for example,

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12


Natural Language Processing [BAD613B]

that the system mistakenly labeled one spam document as urgent, and we have shown how to
compute a distinct precision and recall value for each class.

Confusion matrix for a three-class categorization task, showing for each pair of
classes (c1,c2), how many documents from c1 were (in)correctly assigned to c2.

In order to derive a single metric that tells us how well the system is doing, we can combine
these values in two ways.

1. In macroaveraging, we compute the performance for each class, and then average over
classes.
2. In microaveraging, we collect the decisions for all classes into a single confusion matrix,
and then compute precision and recall from that table.

Fig. shows the confusion matrix for each class separately, and shows the computation of
microaveraged and macroaveraged precision.

As the figure shows, a microaverage is dominated by the more frequent class (in this case spam),
since the counts are pooled. The macroaverage better reflects the statistics of the smaller classes,
and so is more appropriate when performance on all the classes is equally important

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13


Natural Language Processing [BAD613B]

3.8 Test sets and Cross-validation

Training & Testing for Text Classification:

1. Standard Procedure:
o Train the model on the training set.
o Use the development set (devset) to tune parameters and choose the best model.
o Evaluate the final model on a separate test set.
2. Issue with Fixed Splits:
o Fixed training/dev/test sets may lead to small dev/test sets.
o Smaller test sets might not be representative of overall performance.
3. Solution – Cross-Validation (as shown in Fig):
o Cross-validation allows use of all data for training and testing.
o Process:
▪ Split data into k folds.
▪ For each fold:
▪ Train on k-1 folds, test on the remaining fold.
▪ Repeat k times, average the test errors.
o Example: 10-fold cross-validation (train on 90%, test on 10%, repeated 10
times).
4. Limitation of Cross-Validation:
o All data is used for testing →
can't analyze the data in
advance (avoiding
"peeking").
o Looking at data is important
for feature design in NLP
systems.
5. Common Compromise:
o Split off a fixed test set.
o Do 10-fold cross-validation
on the training set.
o Use test set only for final evaluation.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14


Natural Language Processing [BAD613B]

3.9 Statistical Significance Testing


• When building NLP systems, we often need to compare performance between two
systems (e.g., a new model vs. an existing one).
• Simply observing different scores (e.g., accuracy, F1) isn't enough — we need to know
if the difference is statistically significant.
• This is where statistical hypothesis testing comes in.
• Inspired by Dror et al. (2020) and Berg-Kirkpatrick et al. (2012), these tests help
determine if the observed improvement is real or due to chance.
• Example:
o Classifier A (e.g., logistic regression) vs. Classifier B (e.g., naive Bayes).
o Metric M (e.g., F1-score), tested on dataset x.
o Let M(A, x) be the score for A, and δ(x) be the difference in performance between
A and B.
(19)
Understanding Effect Size and Significance
• We want to know if δ(x) > 0, meaning A (logistic regression) performs better than B
(naive Bayes).
• δ(x) is the effect size — larger δ means a bigger performance gap.
• But a positive δ alone isn’t enough.
o Example: A has 0.04 higher F1 than B — is that meaningful?
• Problem: The difference might be due to chance on this specific test set.
• What we really want to know:
o Would A still outperform B on another test set or under different conditions?
• That’s why we need statistical testing, not just raw differences.
Statistical Hypothesis Testing Paradigm
• We compare models by setting up two formal hypotheses:
(20)

o Null hypothesis (H₀): There's no real difference between A and B — any


observed difference is due to chance.
o Alternative hypothesis (H₁): There is a real performance difference between A
and B.
• Statistical tests help us decide whether to reject H₀ in favor of H₁ based on the data.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15


Natural Language Processing [BAD613B]

Null Hypothesis and p-value

• Null hypothesis (H₀): Assumes δ(x) ≤ 0 — A is not better than B.


• We want to see if we can reject H₀ and support H₁ (that A is better).
• We imagine δ(x) over many possible test sets.
• The p-value measures how likely we are to observe our δ(x), or a larger one, if H₀ were
true. (21)
• A low p-value suggests our result is unlikely due to chance, supporting H₁.
Interpreting p-values and Statistical Testing in NLP
• The p-value is the probability of observing a performance difference δ(x) (or larger),
assuming A is not better than B (null hypothesis H₀).
• If δ(x) is large (e.g., A’s F1 = 0.9 vs. B’s = 0.2), it's unlikely under H₀ → low p-value
→ we reject H₀.
• If δ(x) is small, it's more plausible under H₀ → higher p-value → we may fail to reject
H₀.
What Counts as “Small”?
o Common p-value thresholds: 0.05 or 0.01

• If p < threshold, the result is considered statistically significant (we reject H₀ and
conclude A is likely better than B).
How Do We Compute the p-value in NLP?
• NLP avoids parametric tests (like t-tests or ANOVAs) because they assume certain
distributions that often don't apply.
• Instead, we use non-parametric tests that rely on sampling methods.
Key Idea:
• Simulate many variations of the experiment (e.g., using different test sets x′).
• Compute δ(x′) for each → this gives a distribution of δ values.
• If the observed δ(x) is in the top 1% (i.e., p-value < 0.01), it's unlikely under H₀ → reject
H₀.
Common Non-Parametric Tests in NLP:
1. Approximate Randomization (Noreen, 1989)
2. Bootstrap Test (paired version is most common)
o Compares aligned outputs from two systems (e.g., A vs. B on the same inputs
xi).
o Measures how consistently one system outperforms the other across samples.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16


Natural Language Processing [BAD613B]

3.9.1 The Paired Bootstrap Test


The bootstrap test is a flexible, non-parametric method that can be applied to any evaluation
metric—like precision, recall, F1, or BLEU.
What is bootstrapping?
It involves repeatedly sampling with replacement from an original dataset to create many
"bootstrap samples" or virtual test sets. The key assumption is that the original sample is
representative of the larger population.
Example
Imagine a small classification task with 10 test documents. Two classifiers, A and B, are
evaluated:

• Each document outcome falls into one of four categories:


o Both A and B correct
o Both incorrect
o A correct, B wrong
o A wrong, B correct
• If A has 70% accuracy and B has 50%, then the performance difference δ(x) = 0.20.
How bootstrap works:

1. Generate a large number (e.g., 100,000) of new test sets by sampling 10 documents with
replacement from the original set.

2. For each virtual test set, recalculate the accuracy difference between A and B.

3. Use the distribution of these differences to estimate a p-value, telling us how likely the
observed δ(x) is under the null hypothesis (that A is not better than B).

This helps determine whether the observed performance difference is statistically significant or
just due to random chance.

Figure: The paired bootstrap test: Examples of b pseudo test sets x (i) being created from an initial true test
set x. Each pseudo test set is created by sampling n = 10 times with replacement; thus an individual sample
is a single cell, a document with its gold label and the correct or incorrect performance of classifiers A and
B.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 17


Natural Language Processing [BAD613B]

With the b bootstrap test sets, we now have a sampling distribution to analyze whether
A’s advantage is due to chance. Following Berg-Kirkpatrick et al. (2012), we assume the null
hypothesis (H₀)—that A is not better than B—so the average δ(x) should be zero or negative. If
our observed δ(x) is much higher, it would be surprising under H₀. To measure this, we calculate
the p-value by checking how often the sampled δ(xᵢ) values exceed the observed δ(x).

We use the notation 1(x) to mean “1 if x is true, and 0 otherwise.” Although the expected value
of δ(X) over many test sets is 0, this isn't true for bootstrapped test sets due to the bias in the
original test set, so we compute the p-value by counting how often δ (x(i)) exceeds the expected
δ(x) by δ(x) or more.

(22)

If we have 10,000 test sets and a threshold of 0.01, and in 47 test sets we find δ(x(i)) ≥ 2δ(x), the
p-value of 0.0047 is smaller than 0.01. This suggests the result is surprising, allowing us to reject
the null hypothesis and conclude A is better than B.

Fig. A version of the paired bootstrap algorithm

The full algorithm for the bootstrap is shown in Fig. It is given a test set x, a number of samples
b, and counts the percentage of the b bootstrap test sets in which δ (x *(i)) > 2δ (x). This percentage
then acts as a one-sided empirical p-value.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 18


Natural Language Processing [BAD613B]

3.10 Avoiding Harms in Classification (Summary)

• Classifiers can cause harm, including representational harms (e.g., reinforcing


stereotypes).
o Example: Sentiment analysis systems rated sentences with African American
names more negatively than identical ones with European American names.
• Toxicity classifiers may falsely label non-toxic content as toxic, especially when it
references marginalized groups or dialects (e.g., AAVE), leading to silencing.
• Harms can arise from:
o Biased training data
o Biased labels or resources (e.g., lexicons, embeddings)
o Model design choices
• No universal fix exists, so transparency is key.
• A proposed solution: release model cards (Mitchell et al., 2019), which include:
o Training algorithms and parameters
o Training data sources, motivation, and preprocessing
o Evaluation data sources, motivation, and preprocessing
o Intended use and users
o Model performance across different demographic or other groups and
environmental situations

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 19


Natural Language Processing [BAD613B]

Module – 4
Information Retrieval & Lexical Resources
Information Retrieval: Design Features of Information Retrieval Systems, Information
Retrieval Models - Classical, Non-classical, Alternative Models of Information Retrieval -Custer
model, Fuzzy model, LSTM model, Major Issues in Information Retrieval.
Lexical Resources: WordNet, FrameNet, Stemmers, Parts-of-Speech Tagger, Research
Corpora.
Textbook 1: Ch. 9, Ch. 12.

Overview:

The huge amount of information stored in electronic form, has placed heavy demands on
information retrieval systems. This has made information retrieval an important research area.

4.1 Introduction
• Information retrieval (IR) deals with the organization, storage, retrieval, and evaluation
of information relevant to a user's query.
• A user in need of information formulates a request in the form of a query written in a
natural language.
• The retrieval system responds by retrieving the document that seems relevant to the
query.

“An information retrieval system does not inform (i.e., change the knowledge of) the user on the
subject of their inquiry. It merely informs on the existence (or non-existence) and whereabouts of
documents relating to their request”.

• This chapter focuses on text document retrieval, excluding question answering and data
retrieval systems, which handle precise queries for specific data or answers.
• In contrast, IR systems deal with vague, imprecise queries and aim to retrieve relevant
documents rather than exact answers.

4.2 Design Features of Information Retrieval Systems


• It begins with the user's information need.
• Based on this need, he/she formulates a query.
• The IR system returns documents that seem relevant to the query.
• The retrieval is performed by matching the query representation
with document representation.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1


Natural Language Processing [BAD613B]

• In information retrieval, documents are not represented by their full text but by a set of
index terms or keywords, which can be single words or phrases, extracted automatically
or manually.
• Indexing, provides a logical view of the document and helps reduce computational costs.
• A commonly used data structure is the inverted index, which maps keywords to the
documents they appear in.
• To further reduce the number of keywords, text operations such as stop word
elimination (removing common functional words) and stemming (reducing words to
their root form) are used.
• Zipf’s law can be applied to reduce the index size by filtering out extremely frequent or
rare terms.
• Since not all terms are equally relevant, term weighting assigns numerical values to
keywords to reflect their importance.
• Choosing appropriate index terms and weights is a complex task, and several term-
weighting schemes have been developed to address this challenge.

4.2.1 Indexing
IR system can access a document to decide its relevance to a query. Large collection of documents,
this technique poses practical problems. A collection of raw documents is usually transformed into an
easily accessible representation. This process is known as indexing.

• Indexing involves identifying descriptive terms (keywords) that capture a document's


content and distinguish it from others.
• Effective descriptors aid in both content representation and document discrimination.
• Luhn (1957, 1958) introduced automatic indexing based on word frequency, suggesting
that terms with middle-range frequency are the most effective discriminators.
• Indexing represents text—both documents and queries—using selected terms that
reflect the original content meaningfully.
The word term can be a single word or multi-word phrases.
For example, the sentence, Design features of information retrieval systems, can be represented
as follows:
Design, features, information, retrieval, systems
It can also be represented by the set of terms:
Design, features, information retrieval, information retrieval systems

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2


Natural Language Processing [BAD613B]

• Multi-word terms can be extracted using methods like n-grams, POS tagging, NLP, or
manual crafting.
• POS tagging aids in resolving word sense ambiguity using contextual grammar.
• Statistical methods (e.g., frequent word pairs) are efficient but struggle with word order
and structural variations, which syntactic methods handle better.
• TREC approach: Treats any adjacent non-stop word pair as a phrase, retaining only
those that occur in a minimum number (e.g., 25) of documents.
• NLP is also used for identifying proper nouns and normalizing noun phrases to unify
variations (e.g., "President Kalam" and "President of India").
• Phrase normalization reduces structural differences in similar expressions (e.g., "text
categorization," "categorization of text," and "categorize text" → "text categorize").

4.2.2 Eliminating Stop Words


• Stop words are high-frequency, low-semantic-value words (e.g., articles, prepositions)
that are commonly removed during lexical processing.
• They play grammatical roles but offer little help in distinguishing document content for
retrieval.
• Eliminating stop words reduces the number of index terms and enhances efficiency.
• Drawbacks include potential loss of meaningful terms (e.g., "Vitamin A") and inability
to search for meaningful phrases composed entirely of stop words (e.g., "to be or not to
be").

Sample stop words in English

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3


Natural Language Processing [BAD613B]

4.2.3 Stemming
• Stemming reduces words to their root form by removing affixes (e.g., "compute,"
"computing," "computes," and "computer" → "compute").
• This helps normalize morphological variants for consistent text representation.
• Stems, are used as index terms.
• The Porter Stemmer (1980) is one of the most widely used stemming algorithms.

The stemmed representation of the text, Design features of information retrieval systems, is
{design, feature, inform, retrieval, system}
• Stemming can sometimes reduce effectiveness by removing useful distinctions
between words.
• It may increase recall by conflating similar terms, but can also reduce precision by
retrieving irrelevant results (e.g., "computation" vs. "personal computer").
• Recall and precision are key metrics for evaluating information retrieval performance

4.2.4 Zipf's Law


• Zipf's Law describes the distribution of words in natural language.
• It states that word frequency × rank ≈ constant, meaning frequency is inversely
proportional to rank.
• When words are sorted by decreasing frequency, higher-ranked words occur more often,
and lower-ranked words occur less frequently.
• This pattern is consistent across large text corpora.
• This relationship is shown in Figure →
• Zipf’s Law in practice shows that human language has:
o A few high-frequency words,
o Many low-frequency words, and
o A moderate number of medium-frequency words.

• In information retrieval (IR):

o High-frequency words lack discriminative power and are not useful for
indexing.

o Low-frequency words are rarely queried and can also be excluded.

• Medium-frequency words are typically content-bearing and ideal for indexing.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4


Natural Language Processing [BAD613B]

• Words can be filtered by setting frequency thresholds to drop too common or too rare
terms.
• Stop word elimination is a practical application of Zipf’s law, targeting high-frequency
terms.

4.3 Information Retrieval Models


• An Information Retrieval (IR) model defines how documents and queries are
represented, matched, and ranked.
• Core components of an IR system include:
o A document model
o A query model
o A matching function to compare the two
• The primary goal is to retrieve all relevant documents for a user query.
• Different IR models exist, varying in:
o Representation: e.g., as sets of terms or vectors of weighted terms
o Retrieval method: based on term presence or similarity scoring
• Some models use binary matching, while others use vector space models with
numerical scoring for ranking results.
These models can be classified as follows:
➢ Classical models of IR
➢ Non-classical models of IR
➢ Alternative models of IR

1. Classical IR models (Boolean, Vector, Probabilistic):

✓ Based on well-known mathematical foundations.

✓ Simple, efficient, and widely used in commercial systems.

✓ Example:

i. Boolean: Query → ("machine" AND "learning") OR "AI"


ii. Vector: Query and documents represented as vectors → cosine similarity used to
rank results.
iii. Probabilistic: Estimates the probability that a document is relevant to a given
query

2. Non-classical IR models:
✓ Use principles beyond similarity, probability, or Boolean logic.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5


Natural Language Processing [BAD613B]

✓ Based on advanced theories like special logic, situation theory, or interaction models.

✓ Example: Modal or fuzzy logic, Contextual information, Dialogue or iterative process

3. Alternative IR models:

✓ Enhance classical models with techniques from other fields.

✓ Examples include the Cluster model, Fuzzy model, and Latent Semantic Indexing
(LSI).

✓ Example: Hierarchical or k-means clustering of documents, partial matching between


query and documents using fuzzy logic, Singular Value Decomposition (SVD) to
identify hidden semantic structures.

4.4 Classical Information Retrieval Models


4.4.1 Boolean model
• Introduced in the 1950s – Oldest of the three classical information retrieval models.
• Based on Boolean logic and set theory – Uses binary logic (true/false) operations.
• Document representation – Documents are represented as sets of keywords.
• Uses inverted files – A data structure listing keywords and the documents they appear in.
• Query formulation – Users must write queries using Boolean operators (AND, OR, NOT).
• Retrieval method – Documents are retrieved based on the presence or absence of query
terms.

Example: Let the set of original documents be D= {D1, D2, D3}


Where,
D1 = Information retrieval is concerned with the organization, storage, retrieval, and evaluation of
information relevant to user's query.
D2 = A user having an information needs to formulate a request in the form of query written in natural
language.
D3 = The retrieval system responds by retrieving the document that seems relevant to the query.
Let the set of terms used to represent these documents be:
T= {information, retrieval, query}
Then, the set D of document will be represented as follows:
D= {d1, d2, d3}
Where
d1 = {information, retrieval, query}, d2 = {information, query}, d3 = {retrieval, query]

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6


Natural Language Processing [BAD613B]

Let the query be Q: Q= information retrieval


First, the sets R1 and R2 of documents are retrieved in response to Q,
Where,
R1 = {dj | information € dj} = {d1, d2}
R2 = {dj | retrieval € dj} = {d1, d3}
Then, the following documents are retrieved in response to query Q
{dj |dj € R1 ∩ R2} = {d1}

Advantages:
They are simple, efficient, and easy to implement and perform well in terms of recall and
precision if the query is well formulated.

Drawbacks:

• The Boolean model retrieves only fully matching documents; it cannot handle documents
that are partially relevant to a query (No partial relevance).
• It does not rank the retrieved documents by relevance—documents either match or don’t
(No ranking of results).
• Users must formulate queries using strict Boolean expressions, which is unnatural and
difficult for most users (Strict query format).

4.2 Probabilistic Model

• Applies probability theory to information retrieval (Robertson and Jones, 1976).


• Documents are ranked by the probability of being relevant to a given query.
• A document is considered relevant if: P(R/d) ≥ P(I/d)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7


Natural Language Processing [BAD613B]

(i.e., relevance probability is greater than or equal to irrelevance)

• A document is retrieved only if its probability of relevance is greater than or equal to a


threshold value α.
• The retrieved set S consists of documents meeting both criteria:

Assumptions & limitations:

• Assumes terms occur independently when calculating relevance probabilities.


• This simplifies computation and aids in parameter estimation.
• However, real-world terms co-occur, making this assumption often inaccurate.
• The probabilistic model allows partial matching of documents to queries.
• A threshold (α) must be set to filter relevant documents.
• Difficult to estimate accurately, especially when the number of relevant documents is
small.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8


Natural Language Processing [BAD613B]

4.3 Vector Space Model

• Representation:
• Documents and queries are represented as vectors of features (terms).
• Each vector exists in a multi-dimensional space, with each dimension
corresponding to a unique term in the corpus.
• Numerical vectors: Terms are assigned weights, often based on their frequency in
the document (e.g., TF-IDF).
• Similarity computation:
• Ranking algorithms (e.g., cosine similarity) are used to compute the similarity
between a document vector and the query vector.
• The similarity score determines how relevant a document is to a given query.
• Retrieval output:
• Documents are ranked based on their similarity scores to the query.
• A ranked list of documents is presented as the retrieval result.

Given a finite set of n documents: D = {d1, d2, ..., dj ..., dn)


and a finite set of m terms: T = {t1, t2, ..., tj, ..., tm}
Each document is represented by a column vector of weights as follows:

Where wij is the weight of the term ti in document dj, the document collection as a whole is
represented by an m x n term-document matrix as:

Example:
Consider the documents and terms in previous section Let the weights be assigned based on the
frequency of the term within the document. Then, the associated vectors will be

(2, 2, 1)
(1, 0, 1)
(0, 1, 1)
The vectors can be represented as a point in Euclidean space,

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9


Natural Language Processing [BAD613B]

To reduce the importance of the length of document vectors, we normalize document vectors.
Normalization changes all vectors to a standard length.

We convert document vectors to unit length by dividing each dimension by the overall length of
the vector.

Elements of each column are divided by the length of the column vector given by

Let Query be Q = (1,1,0)

Compute Cosine Similarity:


Cosine similarity between vectors Q and Dj is the dot product since all vectors are unit length.

Rank documents based on similarity:


1. D1 — 0.951 → Retrieved
2. D2 — 0.504
3. D3 — 0.504

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10


Natural Language Processing [BAD613B]

4.4 Term Weighting


• Each selected indexing term distinguishes a document from others in the collection.
• Mid-frequency terms are the most discriminative and content-bearing.
• Two key observations refine this idea:
1. A document is more about a term if the term appears frequently in it.
2. A term is more discriminative if it appears in fewer documents across the
collection.
Term Frequency (TF):
• A term that appears more frequently in a document likely represents its content well.
• TF can be used as a weight to reflect this.
Inverse Document Frequency (IDF):
• Measures how unique or discriminating a term is across the corpus.
• Terms common across many documents are less useful for distinguishing content.
• Calculated as:
IDF = log(n / ni)
• n = total number of documents
• ni = number of documents containing term i
• Note: A term in all documents gets lowest IDF (1), while a term in one document gets
highest IDF (n before taking log).
4.4.1 TF & IDF:
To assign higher weight to terms that occur frequently in a particular document but are rare
across the corpus
• tf-idf (term frequency-inverse document frequency) weighting scheme combines both
term frequency and inverse document frequency.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11


Natural Language Processing [BAD613B]

The tf-idf weighting scheme combines two components to determine the importance of a term:
• Term frequency (tf): A local statistic indicating how often a term appears in a
document.
• Inverse document frequency (idf): A global statistic that reflects how rare or
specific a term is across the entire document collection.
• tf-idf is Widely used in information retrieval and natural language processing to assess
the relevance of a term in a document relative to a corpus.
Example:
Consider a document represented by the three terms {tornado, swirl, wind} with the raw tf {4, 1,
and 1} respectively. In a collection of 100 documents, 15 documents contain the term tornado,
20 contain swirl, and 40 contain wind.

The idf of the term tornado can be computed as

The idf of other terms are computed in the same way. Table shows the weights assigned to the three terms
using this approach.

Note:
Tornado: highest TF-IDF weight (3.296), indicating both high frequency in the document and relatively
low occurrence across all documents.
Swirl: rare but relevant
Wind: least significant

4.4.2 Weight normalization:


Normalization prevents longer documents from being unfairly weighted due to higher raw term
counts.
Term frequency (tf) can be normalized by dividing by the frequency of the most frequent term
in the document, known as maximum normalization, producing values between 0 and 1.
Inverse document frequency (idf) can also be normalized by dividing it by the logarithm of the
total number of documents (log(n)).

Most weighting schemes can thus be characterized by the following three factors:

• Within-document frequency or term frequency (tf)


• Collection frequency or inverse document frequency (idf)
• Document length

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12


Natural Language Processing [BAD613B]

Table: Calculating weight with different options for the three weighting factors

Term weighting in IR has evolved significantly from basic tf-idf. Different combinations of tf,
idf, and normalization strategies form various weighting schemes, each affecting retrieval
performance. Advanced models like BM25 further refine this by incorporating document length
and probabilistic reasoning.

4.4.3 A simple automatic method for obtaining indexed representation of the documents is
as follows.

Step 1: Tokenization This extracts individual terms form a document, converts all the letters to
lower case, and removes punctuation marks.
Step 2: Stop word elimination This removes words that appear more frequently in the document
collection.
Step 3: Stemming This reduces the remaining terms to their linguistic root, to obtain the index
terms.
Step 4: Term weighting This assigns weights to terms according to their importance in the
document, in the collection, or some combination of both.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13


Natural Language Processing [BAD613B]

Example:

Vector representation of sample documents after stemming

Sample documents

4.5 Similarity Measures


• Vector Space Model (VSM) represents documents and queries as vectors in a multi-
dimensional space.
• Retrieval is based on measuring the closeness between query and document vectors.
• Documents are ranked according to their numeric similarity to the query.
• Selected documents are those geometrically closest to the query vector.
• The model assumes that similar vectors represent semantically related documents.
• Example in a 2D space using terms ti and tj:
o Document d1: 2 occurrences of ti
o Document d2: 1 occurrence of ti
o Document d3: 1 occurrence each of ti and tj
• Term weights (raw term frequencies) are used as
vector coordinates.
• Angles θ1, θ2, θ3 represent direction differences
between document vectors and the query.
• Basic similarity measure: counting common
terms.
• Commonly used similarity metric: inner product of query and document vectors.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14


Natural Language Processing [BAD613B]

The Dice’s coefficient:

Measures similarity by doubling the inner product and normalizing by the sum of squared
weights.

Jaccard’s Coefficient:

Computes similarity as the ratio of the inner product to the union (sum of squares minus
intersection).

The cosine measure:

Computes the cosine of the angle between the document vector dj and the query vector qk. It
gives a similarity score between 0 and 1:

• 0: No similarity (vectors are orthogonal, angle is 90°).

• 1: Maximum similarity (vectors point in the same direction).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15


Natural Language Processing [BAD613B]

4.5 Non-Classical Models of IR

1. Information Logic Model

• Based on logical imaging and inference from document to query.

• Introduces uncertain inference, where a measure of uncertainty (from van Rijsbergen's


principle) quantifies how much additional information is needed to establish the truth of
an implication.

• Aims to address classical models' limitations in effectiveness.

2. Situation Theory Model

• Also grounded in van Rijsbergen's principle.

• Uses infons to represent information and its truth in specific situations.

• Retrieval is seen as an information flow from document to query.

• Incorporates semantic transformations (e.g., synonyms, hypernyms) to establish


relevance even if a document does not directly support a query.

3.Interaction Model

• Inspired by quantum mechanics' concept of interaction (Copenhagen interpretation).

• Documents are interconnected; retrieval emerges from the interaction between query
and documents.

• Implemented using artificial neural networks, where documents and the query are
neurons in a dynamic network.

• Query integration reshapes connections, and the degree of interaction guides retrieval.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16


Natural Language Processing [BAD613B]

4.6 Alternative Models of IR


4.6.1 Cluster Model

Reduces the number of document comparisons during retrieval by grouping similar documents.

Cluster Hypothesis (Salton)

• “Closely associated documents tend to be relevant to the same clusters.”

• Suggests that documents with high similarity are likely to be relevant to the same queries.

Cluster Improves Efficiencey

• Instead of comparing a query with every document:


o The query is first compared with cluster representatives (centroids).
o Only documents in relevant clusters are checked individually.
• This significantly reduces search time and computational cost.

Clustering can be applied to:

o Documents (group similar documents).


o Terms (group co-occurring terms; useful for dimensionality reduction or building
thesauri).

Cluster Representation

• Each cluster Cₖ has a representative vector (centroid):

o ꞅₖ = {a₁ₖ, a₂ₖ, ..., aₘₖ}, where each element represents the average of
corresponding term weights in the documents of that cluster.

o An element a¡ in this vector is computed as


o where aij is weight of the term ti, of the document dj, in cluster Ck. During retrieval,
the query is compared with the cluster vectors

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 17


Natural Language Processing [BAD613B]

• This comparison is carried out by computing the similarity between the query vector q
and the representative vector ꞅk as

• A cluster Ck whose similarity Sk exceeds a threshold is returned and the search proceeds
in that cluster.

Example:

Consider 3 documents (d1, d2, d3) and 5 terms (t1 to t5). The term-by-document matrix is:

t/d d1 d2 d3
t1 1 1 0
t2 1 0 0
t3 1 1 1
t4 0 0 1
t5 1 1 0
So, document vectors are: d1 = (1, 1, 1, 0, 1), d2 = (1, 0, 1, 0, 1), d3 = (0, 0, 1, 1, 0)

Calculate cosine similarity between the documents:

• sim(d1, d2)
dot(d1, d2) = 1×1 + 1×0 + 1×1 + 0×0 + 1×1 = 3
|d1| = √(1²+1²+1²+0²+1²) = √4 = 2
|d2| = √(1²+0²+1²+0²+1²) = √3 ≈ 1.73
sim = 3 / (2 × 1.73) ≈ 0.87
• sim(d1, d3)
dot = 1×0 + 1×0 + 1×1 + 0×1 + 1×0 = 1
|d3| = √(0²+0²+1²+1²+0²) = √2 ≈ 1.41
sim = 1 / (2 × 1.41) ≈ 0.35
• sim(d2, d3)
dot = 1×0 + 0×0 + 1×1 + 0×1 + 1×0 = 1
sim = 1 / (1.73 × 1.41) ≈ 0.41
Similarity matrix:

d1 d2 d3
d1 1.0
d2 0.87 1.0
d3 0.35 0.41 1.0

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 18


Natural Language Processing [BAD613B]

Clustering with threshold 0.7


• d1 and d2 → sim = 0.87 → Cluster C1
• d3 has low similarity with both → Cluster C2
Clusters:
• C1 = {d1, d2}
• C2 = {d3}
Cluster representatives

Average the vectors in each cluster:

• r1 = avg(d1, d2)
= ((1+1)/2, (1+0)/2, (1+1)/2, (0+0)/2, (1+1)/2)
= (1, 0.5, 1, 0, 1)

• r2 = d3 = (0, 0, 1, 1, 0)

Retrieval is performed by matching the query vector with r1 and r2.


Retrieval using a query

Assume the query vector q = (1, 0, 1, 0, 1)


This means the query contains terms t1, t3, and t5.

Similarity with cluster vectors:

• sim(q, r1)
dot = 1×1 + 0×0.5 + 1×1 + 0×0 + 1×1 = 3
|q| = √(1² + 0² + 1² + 0² + 1²) = √3 ≈ 1.73
|r1| = √(1² + 0.5² + 1² + 0² + 1²) = √3.25 ≈ 1.80
sim = 3 / (1.73 × 1.80) ≈ 0.96
• sim(q, r2)
dot = 1×0 + 0×0 + 1×1 + 0×1 + 1×0 = 1
|r2| = √(0²+0²+1²+1²+0²) = √2 ≈ 1.41
sim = 1 / (1.73 × 1.41) ≈ 0.41
Query is closer to r1, so we retrieve documents from Cluster C1 = {d1, d2}

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 19


Natural Language Processing [BAD613B]

4.6.2 Fuzzy Model

In the fuzzy model of information retrieval, each document is represented as a fuzzy set of
terms, where each term is associated with a membership degree indicating its importance to
the document's content. These weights are typically derived from term frequency within the
document and across the entire collection.

Each document dj is modelled as a vector of term weights:

where wij is the degree to which term ti belongs to document dj.

Each term ti defines a fuzzy set fi over the documents:

For queries:

• A single-term query returns documents where the term’s weight exceeds a threshold.

• An AND query uses the minimum of term weights (fuzzy intersection).

• An OR query uses the maximum of term weights (fuzzy union).

This model allows ranking documents by their degree of relevance to the query.

Example:

Documents:
• d1 = {information, retrieval, query}
• d2 = {retrieval, query, model}
• d3 = {information, retrieval}
Term Set:

• T = {t1: information, t2: model, t3: query, t4: retrieval}

Fuzzy sets (term-document weights):

• f1 (t1): {(d1, 1/3), (d2, 0), (d3, 1/2)}


• f2 (t2): {(d1, 0), (d2, 1/3), (d3, 0)}
• f3 (t3): {(d1, 1/3), (d2, 1/3), (d3, 0)}
• f4 (t4): {(d1, 1/3), (d2, 1/3), (d3, 1/2)}

Query:
• q = t2 ˄ t4 (i.e., model AND retrieval)

In fuzzy logic, the AND operation (˄) is typically interpreted using the minimum of the
memberships.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 20


Natural Language Processing [BAD613B]

Step 1: Retrieve memberships for t2 and t4

From f2 (t2 - model):


• d1: 0
• d2: 1/3
• d3: 0
From f4 (t4 - retrieval):
• d1: 1/3
• d2: 1/3
• d3: 1/2

Step 2: Compute query membership using min (t2, t4)


Apply min operator for each document:
• d1: min(0, 1/3) = 0
• d2: min(1/3, 1/3) = 1/3
• d3: min(0, 1/2) = 0

Step 3: Determine which documents are returned


Assuming a non-zero membership indicates relevance (typical in fuzzy IR), only documents with
non-zero membership values for the query will be returned.

So, only: d2 has a non-zero value (1/3)

4.6.3 Latent Semantic Indexing Model

Latent Semantic Indexing (LSI) applies Singular Value Decomposition (SVD) to information
retrieval, aiming to uncover hidden semantic structures in word usage across documents.
Unlike traditional keyword-based methods, LSI captures conceptual similarities between terms
and documents, even when there’s no exact term match.

• Term-document matrix (W): Represents the frequency or weighted usage of terms


(rows) in documents (columns).
• SVD Decomposition: The matrix W is decomposed into three matrices:

W=TSDT

Where, T: Term vectors, S: Diagonal matrix of singular values, D: Document vectors


• Truncated SVD: Retain only the top k singular values and corresponding vectors to
form a lower-dimensional approximation Wk=TkSkDkT, capturing the main semantic
structure and removing noise.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 21


Natural Language Processing [BAD613B]

• Query Transformation: Queries are projected into the same reduced k-dimensional
latent
• Similarity Computation: Documents are ranked using similarity measures (e.g., cosine
similarity) between the query vector and document vectors in the latent space.
Advantages:
• Captures semantic relationships between terms and documents.
• Can retrieve relevant documents even if they don’t share any terms with the query.
• Reduces the impact of synonymy and polysemy.

Example:

• An example is given with a 5-term, 6-document matrix reduced to 2 dimensions using


truncated SVD. This shows how documents originally in a 5D space (based on terms like
tornado, storm, etc.) are projected into a 2D concept space, revealing deeper connections
among them.
• In essence, LSI enhances retrieval effectiveness by operating on meaning (latent
semantics) rather than surface-level word matching.

The SVD of X is computed to get the three matrices T, S, and D. X5x6=T5x5 S5x5 (D6×5)T

Term Vector

Singular values

Document Vector

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 22


Natural Language Processing [BAD613B]

Consider the first two largest singular values of S, and rescale DT2x6 with singular values to get
matrix R2x6 = S2x2D2x6, as shown in below Figure. R is a reduced dimensionality representation
of the original term-by-document matrix X.

To find out the changes introduced by the reduction, we compute document similarities in the
new space and compare them with the similarities between documents in the original space.

The document-document correlation matrix for the original n-dimensional space is given by
the matrix Y= XTX. Here, Y is a square, symmetric n x n matrix. An element Yij, in this matrix
gives the similarity between documents i and j. The correlation matrix for the original document
vectors is shown in Figure (Z) This matrix is computed using X, after normalizing the lengths of
its columns.

The document-document correlation matrix for the new space is computed analogously using
the reduced representation R. Let N be the matrix R with length-normalized columns. Then, M=
NTN gives the matrix of document correlations in the reduced space. The correlation matrix M
is given in Figure.

The similarity between document d1, d4(-0.0304), and d6(-0.2322) is quite low in the new space
because document d1 is not topically similar to documents d4 and d6.

In the original space, the similarity between documents d2 and d3 and between documents d2 and
d5 is 0. In the new space, they have high similarity values (0.5557 and 0.8518 respectively)
although documents d3 and d5 share no term with the document d2. This topical similarity is
recognized due to the co-occurrence of patterns in the documents.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 23


Natural Language Processing [BAD613B]

4.7 Major Issues in Information Retrieval


1. Vocabulary Mismatch: Users often express queries using terms that differ from those
in relevant documents, leading to retrieval failures.
2. Ambiguity and Polysemy: Words with multiple meanings can cause confusion in
interpreting user intent, affecting retrieval accuracy.
3. Scalability and Performance: As data volumes grow, IR systems must efficiently index
and retrieve information without compromising speed or accuracy.
4. Evaluation Metrics: Determining the relevance and effectiveness of IR systems is
challenging due to the subjective nature of "relevance" and the lack of standardized
evaluation methods.
5. User Behavior Modeling: Understanding and predicting user behavior is essential for
refining search results and improving user satisfaction.
6. Integration with Natural Language Processing (NLP): Incorporating NLP techniques
can enhance IR systems by enabling better understanding of context and semantics, but
it also introduces complexity.
These issues highlight the multifaceted nature of IR and the need for interdisciplinary approaches
to address them effectively.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 24


Natural Language Processing [BAD613B]

Part B
LEXICAL RESOURCES

1. Introduction
The chapter provides an overview of freely available tools and lexical resources for natural
language processing (NLP), aimed at assisting researchers—especially newcomers to the field.
It emphasizes the importance of knowing where to find resources, which can significantly reduce
time and effort. The chapter compiles and briefly discusses key tools such as stemmers, taggers,
parsers, and lexical databases like WordNet and FrameNet, along with accessible test corpora,
all of which are available online or through scholarly articles.

2. WORDNET

A comprehensive lexical database for the English language developed at Princeton University
under George A. Miller based on psycholinguistic principles, WordNet is divided into three
databases: nouns, verbs, and a combined one for adjectives and adverbs.
Key features include:
• Synsets: Groups of synonymous words representing a single concept.
• Lexical and semantic relations: These include synonymy, antonymy,
hypernymy/hyponymy (generalization/specialization), meronymy/holonymy
(part/whole), and troponymy (manner-based verb distinctions).
• Multiple senses: Words can belong to multiple synsets and parts of speech, with each
sense given a gloss—a dictionary-style definition with usage examples.
• Hierarchical structure: Nouns and verbs are arranged in taxonomic hierarchies (e.g.,
'river' has a hypernym chain), while adjectives are grouped by antonym sets.

The figure 1 shows the entries for the word 'read'. 'Read' has one sense as a noun and 11 senses as a verb.
Glosses help differentiate meanings. Figures 2, 3, and 4 show some of the relationships that hold between
nouns, verbs, and adjectives and adverbs.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 25


Natural Language Processing [BAD613B]

Nouns and verbs are organized into hierarchies based on the hypernymy/hyponymy relation,
whereas adjectives are organized into clusters based on antonym pairs (or triplets). Figure 5
shows a hypernym chain for 'river' extracted from WordNet. Figure 6 shows the troponym
relations for the verb 'laugh'.

The availability and multilingual extensions of WordNet:


• English WordNet is freely available for download at
http://wordnet.princeton.edu/obtain.
• EuroWordNet extends WordNet to multiple European languages, including English,
Dutch, Spanish, Italian, German, French, Czech, and Estonian. It includes both
language-internal relations and cross-lingual links to English meanings.
• Hindi WordNet, developed by CFILT at IIT Bombay, follows the same design
principles as the English version but includes language-specific features, such as
causative relations. It currently includes:
o Over 26,208 synsets and 56,928 Hindi words
o 16 types of semantic relations
o Each entry contains a synset, gloss (definition), and its position in the ontology.

• Figure 7 shows the Hindi WordNet entry for the word (aakanksha).
• Hindi WordNet can be obtained from the URL
http://www.cfilt.iitb.ac.in/wordnet/webhwn/. CFLIT has also developed a Marathi
WordNet.
• Figure 8 shows the Marathi WordNet
(http://www.cfilt.iitb.ac.in/wordnet/webmwn/wn.php) entry for the word 'qa' (pau).

Figure 8 WordNet entry for the Marathi word


(pau)

Figure 7. WordNet entry for the Hindi word (aakanksha).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 26


Natural Language Processing [BAD613B]

2.1 Applications of WordNet


The key applications of WordNet in Information Retrieval (IR) and Natural Language
Processing (NLP):
1. Concept Identification:
WordNet helps identify the underlying concepts associated with a term, enabling more
accurate understanding and interpretation of user queries or texts by capturing their full semantic
richness.
2. Word Sense Disambiguation (WSD):
WordNet is widely used for disambiguating word meanings in context. Its value lies in:
o Providing sense definitions and examples
o Organizing words into synsets
o Defining semantic relations (e.g., synonymy, hypernymy)
These features make WordNet the most prominent and frequently used resource for WSD.
o Early research: One of the first uses of WordNet in WSD for IR was by Voorhees
(1993), who applied its noun hierarchy (hypernym/hyponym structure).
o Further work: Researchers like Resnik (1995, 1997) and Sussna (1993) also utilized
WordNet in developing WSD techniques.
Additional applications of WordNet
Automatic Query Expansion:
WordNet’s semantic relations (e.g., synonyms, hypernyms, hyponyms) can enhance query
terms, allowing a broader and more meaningful search.
• Voorhees (1994) used these relations to expand queries, improving retrieval performance
by going beyond simple keyword matching.
Document Structuring and Categorization:
WordNet’s conceptual framework and semantic relationships have been employed for text
categorization, helping systems classify documents more effectively.
• Scott and Matwin (1998) leveraged this approach for document classification tasks.
Document Summarization:
WordNet aids in generating lexical chains—sequences of semantically related words—that help
identify key topics and coherence in texts.
• Barzilay and Elhadad (1997) used this technique to improve text summarization.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 27


Natural Language Processing [BAD613B]

3. FRAMENET
FrameNet, a rich lexical database focused on semantically annotated English sentences,
grounded in frame semantics.
1. Frame Semantics:
Each word (especially verbs, nouns, adjectives) evokes a specific situation or event
known as a frame.
2. Target Word / Predicate:
The word that evokes the frame (e.g., nab in the ARREST frame).
3. Frame Elements (FEs):
These are semantic roles or participants in the frame-specific event (e.g.,
AUTHORITIES, SUSPECT, TIME in the ARREST frame).
o These roles define the predicate-argument structure of the sentence.
4. Annotated Sentences:
Sentences, often drawn from the British National Corpus, are tagged with frame
elements to illustrate how words function in context.
5. Ontology Representation:
FrameNet provides a semantic-level ontology of language, representing not just
grammatical but also contextual and conceptual relationships.
Example:
In the sentence, “The police nabbed the suspect,” the word nab triggers the ARREST frame:
• The police → AUTHORITIES
• The suspect → SUSPECT
[Authorities The police] nabbed [Suspect the snatcher]
FrameNet thus provides a structured and nuanced way to model meaning and roles in language,
making it valuable for tasks such as semantic role labeling, information extraction, and natural
language understanding.

The COMMUNICATION frame includes roles like ADDRESSEE, COMMUNICATOR, TOPIC, and
MEDIUM. The JUDGEMENT frame includes JUDGE, EVALUEE, and REASON. Frames can
inherit roles from others; for instance, the STATEMENT frame inherits from COMMUNICATION and
includes roles such as SPEAKER, ADDRESSEE, and MESSAGE.
The following sentences show some of these roles:
[Judge She] [Evaluee blames the police] [Reason for failing to provide enough protection].
[Speaker She] told [Addressee me] [Message 'I’ll return by 7:00 pm today'].
Figure 9 shows the core and non-core frame elements of the COMMUNICATION frame, along with
other details.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 28


Natural Language Processing [BAD613B]

Figure 9 Frame elements of communication frame

3.1 FrameNet Applications


FrameNet supports semantic parsing and information extraction by providing shallow
semantic roles that reveal meaning beyond syntax. For example, the noun "match" plays the
same theme role in both sentences below, despite differing syntactic positions:
The umpire stopped the match.
The match stopped due to bad weather.
FrameNet also enhances question-answering systems by enabling role-based reasoning. For
instance, in the TRANSFER frame, verbs like "send" and "receive" share roles such as
SENDER, RECIPIENT, and GOODS, allowing a system to infer that:
Q: Who sent a packet to Khushbu?
A: Khushbu received a packet from the examination cell.
Additional applications of FrameNet include:
• Information retrieval (IR)
• Machine translation (interlingua design)
• Text summarization
• Word sense disambiguation
These uses highlight FrameNet’s importance in understanding and processing natural language
at a deeper semantic level.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 29


Natural Language Processing [BAD613B]

4. STEMMERS:
Stemming (or conflation) is the process of reducing inflected or derived words to their base
or root form. The resulting stem doesn't need to be a valid word, as long as related terms map
to the same stem.
Purpose:
• Helps in query expansion, indexing (e.g., in search engines), and various NLP tasks.
Common Stemming Algorithms:
• Porter's Stemmer – Most widely used (Porter, 1980).
• Lovins Stemmer – An earlier approach (Lovins, 1968).
• Paice/Husk Stemmer – A more recent and flexible method (Paice, 1990).
These tools, called stemmers, differ in how aggressively they reduce words but all aim to
improve text processing by grouping word variants.
Figure 10 shows a sample text and output produced using these stemmers.

4.1 Stemmers for European Languages:


• Snowball provides stemmers for many European languages:
o Examples: English, French, Spanish, Russian, Portuguese, German, Dutch,
Hungarian, Italian, Swedish, Norwegian, Danish, Finnish
o Available at: http://snowball.tartarus.org/texts/stemmersoverview.html
4.2 Stemmers for Indian Languages:
• Standard stemmers for Indian languages like Hindi are limited.
• Notable research:
o Ramanathan and Rao (2003): Used handcrafted suffix lists for Hindi.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 30


Natural Language Processing [BAD613B]

o Majumder et al. (2007): Used a cluster-based approach, evaluated using


Bengali data, and found that stemming improves recall.
• CFILT, IIT Bombay has developed stemmers for Indian languages:
o http://www.cfilt.iitb.ac.in
4.3 Stemming Applications:
• Widely used in search engines and IR systems:
o Reduces word variants to a common form, improving recall and reducing index
size.
o Example: "astronaut" and "astronauts" are treated as the same term.
o However, for English, stemming may not always improve precision.
• Also applied in:
o Text summarization
o Text categorization
o Helps in term frequency analysis by consolidating word forms into stems.

5. PART-OF-SPEECH TAGGER
Part-of-speech tagging is a crucial early-stage NLP technique used in applications like speech
synthesis, machine translation, information retrieval (IR), and information extraction. In
IR, it helps with indexing, phrase extraction, and word sense disambiguation.

5.1 Stanford Log-linear POS Tagger


• Model Type: Maximum Entropy Markov Model
• Key Features:
o Uses preceding and following tag contexts via a dependency network.
o Employs a wide range of lexical features.
o Incorporates priors in conditional log-linear models.
• Accuracy: 97.24% on Penn Treebank WSJ
• Improvement: 4.4% error reduction over previous best (Toutanova et al., 2003)
• More Info: http://nlp.stanford.edu/software/tagger.shtml

5.2 A Part-of-Speech Tagger for English


• Model Type: Maximum Entropy Markov Model (MEMM)
• Inference: Bi-directional inference algorithm
o Enumerates all possible decompositions to find the best sequence.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 31


Natural Language Processing [BAD613B]

• Performance:
o Outperforms unidirectional methods.
o Comparable to top algorithms like kernel SVMs.
• Reference: Tsuruoka and Tsujii (2005)

5.3 TnT Tagger (Trigrams'n'Tags)


• Model Type: Hidden Markov Model (HMM)
• Features:
o Uses trigrams, smoothing, and handling of unknown words.
• Efficiency: Performs as well as other modern methods, including maximum entropy
models.
• Reference: Brants (2000)

Table 12.1 shows tagged text of document #93 of the CACM collection.

5.4 Brill Tagger


• Type: Rule-based, transformation-based learning.
• Key Features:
o Learns tagging rules automatically.
o Handles unknown words.
o Supports k-best tagging (multiple tags in uncertain cases).
• Performance: Comparable to statistical methods.
• Brill tagger is available for download at the link http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z.

5.5 CLAWS Tagger


• Type: Hybrid (probabilistic + rule-based).
• Developed By: University of Lancaster.
• Accuracy: 96–97%, depending on text type.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 32


Natural Language Processing [BAD613B]

• Adaptability: Works with diverse input formats.


• More Info:
5.6 Tree-Tagger
• Type: Probabilistic (uses decision trees).
• Strengths:
o Effective with sparse data.
o Automatically selects optimal context size.
• Accuracy: Above 96% on Penn Treebank.
• The tagger is available at the link
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

5.7 ACOPOST Collection


• Language: C
• Includes:
1. Maximum Entropy Tagger (MET) – Iterative feature-based method.
2. Trigram Tagger (T3) – HMM-based using tag pairs.
3. Transformation-based Tagger (TBT) – Based on Brill's rule learning.
4. Example-based Tagger (ET) – Uses memory-based reasoning from past data.

5.8 POS Taggers for Indian Languages


• Challenge: Lack of annotated corpora and basic NLP tools.
• Development Centers:
o IIT Bombay: Developing POS taggers for Hindi and Marathi using a
bootstrapping + statistical approach.
o Other Institutes: CDAC, IIIT Hyderabad, CIIL Mysore, University of
Hyderabad.
o Urdu Tagging: Reported by Hardie (2003) and Baker et al. (2004).
o More information can be found at http://ltrc.iiit.net and www.cse.iitb.ac.in

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 33


Natural Language Processing [BAD613B]

6. RESEARCH CORPORA
Research corpora have been developed for a number of NLP-related tasks. In the following
section, we point out few of the available standard document collections for a variety of NLP-
related tasks, along with their Internet links.

6.1 IR Test Collection

Glasgow University, UK, maintains a list of freely available IR test collections. Table lists the
sources of those and few more IR test
collections. LETOR (learning to rank) is a
package of benchmark data sets released by
Microsoft Research Asia. It consists of two
datasets OHSUMED and TREC (TD2003 and
TD2004).

LETOR is packaged with extracted features for each query-document pair in the collection,
baseline results of several state-of-the-art learning-to-rank algorithms on the data and evaluation
tools. The data set is aimed at supporting future research in the area of learning ranking function
for information retrieval.

6.2 Summarization Data

Evaluating a text summarizing system requires existence of 'gold summaries'. DUC provides
document collections with known extracts and abstracts, which are used for evaluating
performance of summarization systems submitted at TREC conferences. Figure 11 shows a
sample document and its extract from DUC 2002 summarization data.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 34


Natural Language Processing [BAD613B]

6.3 Word Sense Disambiguation

SEMCOR is a sense-tagged corpus used in disambiguation. It is a subset of the Brown corpus,


sense-tagged with WordNet synsets.

Open Mind Word Expert13 attempts to create a very large sense-tagged corpus. It collects word
sense tagging from the general public over the Web.

6.4 Asian Language Corpora


The EMILLE (Enabling Minority Language Engineering) corpus is a multilingual resource
developed at Lancaster University, UK, aimed at supporting natural language processing (NLP)
for South Asian languages. The project, in collaboration with the Central Institute for Indian
Languages (CIIL) in India, provides extensive data and tools for various Indian languages. The
corpus includes monolingual written and spoken corpora, parallel corpora, and annotated data.
The monolingual written corpus covers 14 South Asian languages, while the spoken data,
sourced from BBC Asia radio broadcasts, includes five languages: Hindi, Bengali, Gujarati,
Punjabi, and Urdu. The parallel corpus consists of English texts and their translations into five
languages, featuring materials like UK government advice leaflets, aligned at the sentence level.
The annotated section includes part-of-speech tagging for Urdu and annotations of demonstrative
usage in Hindi.
The EMILLE/CIIL corpus is available free of charge for research purposes at elda.org, with
further details provided in the manual at emille.lancs.ac.uk. This resource is particularly valuable
for research in statistical machine translation and other NLP applications involving Indian
languages, despite challenges posed by the limited availability of electronic text repositories in
these languages.

7. JOURNALS AND CONFERENCES IN THE AREA


Major NLP Research Bodies and Conferences:
• ACM (Association for Computing Machinery)
• ACL (Association for Computational Linguistics) and EACL (European Chapter)
• RIAO (Recherche d'Information Assistie par Ordinateur)
• COLING (International Conferences on Computational Linguistics)
Key Conferences:
• ACM SIGIR: A leading international conference on Information Retrieval (IR); the 30th
conference held in Amsterdam (July 23–27, 2007).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 35


Natural Language Processing [BAD613B]

• TREC (Text Retrieval Conferences): Organized by the US government (NIST),


providing standardized IR evaluation results; formerly called the Document/Message
Understanding Conferences.
• NTCIR: Focuses on IR for Japanese and other Asian languages.
• ECIR: European counterpart of SIGIR.
• KES (Knowledge-Based and Intelligent Engineering & Information Systems): Focuses
on intelligent systems, including NLP, neural networks, fuzzy logic, and web mining.
• HLT-NAACL: Sponsored by the North American chapter of ACL; covers human
language technologies.
Notable Journals:
• Journal of Computational Linguistics: Focuses on theoretical and linguistic aspects.
• Natural Language Engineering Journal: Focuses on practical NLP applications.
• Information Retrieval (Kluwer), Information Processing and Management (Elsevier),
ACM TOIS (Transactions on Information Systems), Journal of the American Society for
Information Science.
Other Relevant Journals:
• International Journal of Information Technology and Decision Making (World Scientific)
• Journal of Digital Information Management
• Journal of Information Systems
AI Journals Reporting NLP Work:
• Artificial Intelligence
• Computational Intelligence
• IEEE Transactions on Intelligent Systems
• Journal of AI Research

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 36


Natural Language Processing [BAD613B]

Module – 5
Machine Translation
Machine Translation: Language Divergences and Typology, Machine Translation using
Encoder-Decoder, Details of the Encoder-Decoder Model, Translating in Low-Resource
Situations, MT Evaluation, Bias and Ethical Issues.
Textbook 2: Ch. 13. (Exclude 13.4)

Overview:

• Machine Translation: The use of computers to translate from one language to another.
• MT for information access is probably one of the most common uses of NLP
o We might want to translate some instructions on the web, perhaps the recipe for
a favorite dish, or the steps for putting together some furniture.
o We might want to read an article in a newspaper, or get information from an
online resource
like Wikipedia or
a government
webpage in some
other language.
o Google Translate alone translates hundreds of billions of words a day between
over 100 languages.
• Another common use of machine translation is to aid human translators.
o This task is often called computer-aided translation or CAT.
o CAT is commonly used as part of localization: the task of adapting content or a
product to a particular language community.
• Finally, a more recent application of MT is to in-the-moment human communication
needs. This includes incremental translation, translating speech on-the-fly before the
entire sentence is complete, as is commonly used in simultaneous interpretation.
• Image-centric translation can be used for example to use OCR of the text on a phone
camera image as input to an MT system to translate menus or street signs.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1


Natural Language Processing [BAD613B]

5.1 Language Divergences and Typology


There are about 7,000 languages in the world, some aspects of human language seem to be
universal. For example, language seems to have words for referring to people, for talking about
eating.
Structural linguistic universals; for example, every language seems to have nouns and verbs.
Languages are different in many ways. People have noticed this since ancient times (see Fig.
5.1).

Fig 5.1: The Tower of Babel, Pieter Bruegel 1563. Wikimedia Commons, from the
Kunsthistorisches Museum, Vienna.
Story of The Tower of Babel (Bruegel, 1563):
• Bruegel’s painting depicts the biblical story from Genesis 11, where humanity, speaking
a single language, tries to build a tower to reach the heavens.
• As a divine response, God confuses their language, causing miscommunication and
halting the project. It’s a cautionary tale about human ambition and the limits of
communication.
To build better machine translation (MT) systems, we need to understand why translations can
be different (Dorr, 1994).
• Differences about words themselves. For example, each language has a different word
for "dog." These are called idiosyncratic or lexical differences, and we handle them one
by one.
• Differences about patterns. For example, some languages put the verb before the object.
Others put the verb after the object. These are systematic differences that we can model
more generally. The study of these patterns across languages is called linguistic
typology.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2


Natural Language Processing [BAD613B]

Different types of Language Divergences and Typology:


• Word Order Typology
• Lexical Divergences
• Morphological Typology
• Referential density

5.1.1 Word Order Typology


• English and Japanese, languages differ in the basic word order of verbs, subjects, and
objects in simple declarative clauses.
• German, French, English, and Mandarin, are all SVO (Subject-Verb-Object) languages.
• Hindi and Japanese, by contrast, are SOV languages, the verb tends to come at the end
of basic clauses.
• Irish and Arabic are VSO languages.

Two languages that share their basic word order type often have other similarities.
For example, VO languages generally have prepositions, whereas OV languages
generally have postpositions.
VO → verb wrote is followed by its object a letter and
the prepositional phrase to a friend, in which the
preposition to is followed by its argument a friend.

• Arabic, with a VSO order, also has the verb before the object and prepositions.
• Other kinds of ordering preferences vary idiosyncratically - In some SVO languages (like
English and Mandarin) adjectives tend to appear before nouns, while in others languages
like Spanish and Modern Hebrew, adjectives appear after the noun.

Fig. shows examples of other word order differences. All of these word order differences
between languages can cause problems for translation, requiring the system to do huge structural
reorderings as it generates the output.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3


Natural Language Processing [BAD613B]

5.1.2 Lexical Divergences


Translate the individual words from one language to another.
The appropriate word can vary depending on the context.
• Bass - English source-language; fish lubina or the musical instrument bajo - in Spanish.
• A wall – English; Wand (walls inside a building), and Mauer (walls outside a building) – in
German.
• Brother - English uses the word brother for any male sibling, Chinese and many other languages
have distinct words for older brother and younger brother (Mandarin gege and didi, respectively)
• One language places more grammatical constraints on word choice than another.
o English marks nouns for whether they are singular or plural. Mandarin doesn’t. Or French
and Spanish.
• Lexically dividing up conceptual space, leading to many-to-many mappings.
o English leg, foot, and paw → French.
o For example, when leg is used about an animal →patte; leg of a journey → etape; leg is
of a chair → pied.
• Lexical gap: , English does not have a word that
corresponds neatly for phrases like: filial piety or loving child, or good son/daughter.

Fig.: The complex overlap between English leg, foot, etc.,

• Verb-framed languages (Spanish, French, Japanese)


o Verb shows the direction or path of movement.
o The manner (how something moves) is optional or added separately.
Examples:
o Entrar = to enter
o Salir = to go out
o Subir = to go up
• Satellite-framed languages (English, German, Russian)
o The verb shows the manner of movement.
o The direction or path is shown in a satellite (like a preposition or particle).
Examples:

o Run out = run (manner) + out (path)


o Jump over = jump (manner) + over (path)
o Slide down = slide (manner) + down (path)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4


Natural Language Processing [BAD613B]

5.1.3 Morphological Typology


• Languages differ morphologically along two main dimensions:
1. Number of Morphemes per Word:
• Isolating languages (e.g., Vietnamese, Cantonese):
o One morpheme per word.
o Words are simple and not combined.
• Polysynthetic languages (e.g., Siberian Yupik):
o Many morphemes in one word.
o One word can express a full sentence.
2. Segmentability of Morphemes:
• Agglutinative languages (e.g., Turkish):
o Morphemes are clearly separable.
o Each morpheme represents one meaning or function.
• Fusional languages (e.g., Russian):
o Morphemes blend together.
o One morpheme may carry multiple meanings (e.g., case, number, gender).
In Machine Translation:
• Translating morphologically rich languages requires understanding parts of words.
• Modern systems often use subword models (like BPE or WordPiece) to handle this.

5.1.4 Referential density


• Measures how often a language uses explicit pronouns.
• High referential density (hot languages):
o Use pronouns more frequently.
o Easier for the listener (e.g., English).
• Low referential density (cold languages):
o Omit pronouns often.
o Listener must infer more (e.g., Chinese, Japanese).
Hot vs. Cold Languages:
• Hot languages = more explicit, easier for the listener.
• Cold languages = less explicit, require more inference.
Translation Challenge:
• Translating from pro-drop (cold) to non-pro-drop (hot) languages is tricky.
• The system must infer missing pronouns and insert the correct ones.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5


Natural Language Processing [BAD613B]

5.2 Machine Translation using Encoder-Decoder (sequence to- sequence model)


• Translate each sentence independently –
o source language → target language.
o The green witch arrived (English) → Lleg´o la bruja verde (Spanish).
• MT uses supervised machine learning:
o System is given a large set of parallel sentences.
o Learns to map source sentences into target sentences.
o Split the sentences into a sequence of subword tokens.
o The systems are then trained to maximize the probability of the sequence of
tokens in the target language y1,…..,ym given the sequence of tokens in the
source language x1,……, xn:
• Rather than use the input tokens directly, the encoder-decoder architecture is used.
o Encoder takes the input words x =[ x1,…, xn] and produces an intermediate
context h.
o Decoder, the system takes h and, word by word, generates the output y:

5.2.1 Tokenization
• Machine translation systems use a fixed vocabulary decided in advance.
• The vocabulary is built by running a tokenization algorithm on both source and target
language texts together.
• This vocabulary is made using subword tokenization, not by splitting at spaces.
• An example of a subword tokenization method is BPE (Byte Pair Encoding).
• One shared vocabulary is used for both source and target languages.
• This sharing makes it easy to copy names and special words from one language to
another.
• Subword tokenization works well for languages with spaces (like English, Hindi) and no
spaces (like Chinese, Thai).
• Modern systems use better algorithms than simple BPE.
o For example, BERT uses WordPiece, a smarter version of BPE.
• WordPiece chooses – merges that improve the language model probability, not just
based on frequency.
• Wordpieces use a special symbol at the beginning of each token;

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6


Natural Language Processing [BAD613B]

The wordpiece algorithm is given a training corpus and a desired vocabulary size V, and
proceeds as follows:
1. Initialize the wordpiece lexicon with characters (for example a subset of Unicode
characters, collapsing all the remaining characters to a special unknown character token).
2. Repeat until there are V wordpieces:
(a) Train an n-gram language model on the training corpus, using the current set of
wordpieces.
(b) Consider the set of possible new wordpieces made by concatenating two wordpieces
from the current lexicon. Choose the one new wordpiece that most increases the
language model probability of the training corpus.

Unigram Model:
• Unlike BPE, which requires specifying the number of merges, WordPiece and the
unigram algorithm let users define a target vocabulary size, typically between 8K–32K
tokens.
• The unigram algorithm, often referred to as SentencePiece (its implementation library),
starts with a large initial vocabulary of characters and frequent character sequences.
• Unigram iteratively removes low-probability tokens using statistical modeling (like the
EM algorithm) until reaching the desired size.
• It generally outperforms BPE by avoiding overly fragmented or non-meaningful tokens
and better handling common subword patterns.

5.2.2 Creating the Training data


Machine translation models are trained on a parallel corpus, sometimes called a bitext, a text
that appears in two (or more) languages.
• Europarl corpus: Proceedings of the European Parliament, contains between 400,000
and 2 million sentences each from 21 European languages.
• The United Nations Parallel Corpus: contains on the order of 10 million sentences in
the six official languages of the United Nations (Arabic, Chinese, English, French,
Russian, Spanish).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7


Natural Language Processing [BAD613B]

• OpenSubtitles corpus & ParaCrawl corpus: Movie and TV subtitles & general web
text, 223 million sentence pairs between 23 EU languages and English extracted from the
CommonCrawl.
Sentence alignment
Standard training corpora for MT come as aligned pairs of sentences.
When creating new corpora, for example for underresourced languages or new domains, these
sentence alignments must be created.

Fig: A sample alignment between sentences in English and French, with sentences extracted from Antoine de
Saint-Exupery’s Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences e 1,..., en,
and f1,……, fm and finds minimal sets of sentences that are translations of each other, including single
sentence mappings like (e1, f1), (e4, f3), (e5, f4), (e6, f6) as well as 2-1 alignments (e2/e3,f2), (e7/e8,f7), and
null alignments (f5).

we generally need two steps to produce sentence alignments:


1. A cost function that takes a span of source sentences and a span of target sentences and
returns a score measuring how likely these spans are to be translations.
2. An alignment algorithm that takes these scores to find a good alignment between the
documents.
Cost function between two sentences or spans x, y from the source and target documents
respectively:

where nSents() gives the number of sentences (this biases the metric toward many alignments of
single sentences instead of aligning very large spans). The denominator helps to normalize the
similarities, and so are randomly selected sentences sampled from the respective
documents.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8


Natural Language Processing [BAD613B]

5.3 Details of the Encoder-Decoder Model

Fig. The encoder-decoder transformer architecture for machine translation.

• Fig. shows the intuition of the architecture at a high level.


• The encoder-decoder architecture is made up of two transformers: an encoder, which is
the same as the basic transformers, and a decoder, which is augmented with a special new
layer called the cross-attention layer.
• The encoder takes the source language input word tokens X = x1,.……, xn and maps them
to an output representation Henc = h1, …., hn; via a stack of encoder blocks.
• The decoder attends to the encoder representation and generates the target words one by
one.
o At each timestep conditioning on the source sentence and the previously
generated target language words to generate a token.
• In order to attend to the source language, the transformer blocks in the decoder have an
extra cross-attention layer.
• A self-attention layer that attends to the input from the previous layer, followed by layer
norm, a feed forward layer, and another layer norm.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9


Natural Language Processing [BAD613B]

Each encoder block consists of:


1. Multi-Head Self-Attention
o Allows the model to focus on different positions in the input sequence
simultaneously.
o Uses scaled dot-product attention in multiple “heads.”
2. Add & Layer Normalization
o Adds the input and the output of the attention layer (residual connection).
o Applies layer normalization.
3. Feed-Forward Neural Network (FFN)
o Two linear layers with a ReLU (or GELU) in between.
o Applies to each position independently.
4. Add & Layer Normalization (again)
o Adds the input and output of the FFN and normalizes.

Each decoder block adds an additional attention layer:


1. Masked Multi-Head Self-Attention
o Prevents attending to future tokens (important during training).
2. Add & Norm
3. Encoder-Decoder Attention (Cross attention)
o The decoder attends to encoder outputs.
o Helps generate context-aware outputs.
4. Add & Norm
5. Feed-Forward Neural Network
6. Add & Norm

Cross attention is given by:

Where,

• Multi-head attention the input to each attention layer is X,


• The final output of the encoder Henc = h, …, hn. Henc is of shape [n, d]
• The query (Q) comes from the output from the prior decoder layer which is
multiplied by the cross-attention layer’s query weights WQ

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10


Natural Language Processing [BAD613B]

• K = Multiply the encoder output Henc by the cross-attention layer’s key weights WK
• V = Multiply the encoder output Henc by the cross-attention layer’s key weights Wv

• To train an encoder-decoder model, we use the same self-supervision model we used for
training encoder-decoders RNNs.
• The network is given the source text and then starting with the separator token is trained
autoregressively to predict the next token using cross-entropy loss.
• Cross-entropy is determined by the probability the model assigns to the correct next word.
• We use teacher forcing in the decoder, at each time step in decoding we force the system
to use the gold target token from training as the next input.

5.4 Translating in low-resource situations


• For some languages, and especially for English, online resources are widely available.
• There are many large parallel corpora that contain translations between English and many
languages.
• But, the vast majority of the world’s languages do not have large parallel training texts
available.
How to get good translation with lesser resourced languages? – Data Sparsity
Two commonly used approaches: Data augmentation [Backtranslation], and Multilingual
models.

5.4.1 Data Augmentation


• Statistical technique for dealing with insufficient training data.
• Adding new synthetic data that is generated from the current natural data.
• Backtranslation is a common data augmentation technique in machine translation (MT)
using monolingual target-language data (text written only in the target language) to
create synthetic parallel data.
• It addresses the scarcity of parallel corpora by generating synthetic bitexts from abundant
monolingual data.
• The process involves training a target-to-source MT model using available parallel data,
then using it to translate monolingual target-language text into the source language.
• The resulting synthetic source-target pairs are added to the training data to improve the
original source-to-target MT model.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11


Natural Language Processing [BAD613B]

• Backtranslation includes configurable parameters like decoding methods (greedy, beam


search, sampling) and data ratio settings (e.g., upsampling real bitext).
• It is highly effective—studies suggest it provides about two-thirds the benefit of training
on real bitext.

Example:

5.4.2 Multilingual models


• The models we’ve described so far are for bilingual translation: one source language, one
target language.
• It’s also possible to build a multilingual translator. In a multilingual translator, we train
the system by giving it parallel sentences in many different pairs of languages.
• We tell the system which language is which by adding a special token ls to the encoder
specifying the source language we’re translating from, and a special token lt to the
decoder telling it the target language we’d like to translate into.

One advantage of a multilingual model is that they can improve the translation of lower-
resourced languages by drawing on information from a similar language in the training data that
happens to have more resources.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12


Natural Language Processing [BAD613B]

5.4.3 Sociotechnical issues


• Limited native speaker involvement: Many low-resource language projects lack
participation from native speakers in content curation, technology development, and
evaluation.
• Low data quality: some multilingual datasets were of acceptable quality, often
containing errors, repeated content, or boilerplate text—likely due to insufficient native
speaker oversight.
• English-centric bias: Many MT systems prioritize language pairs involving English,
limiting the diversity of language coverage.
• Efforts to broaden coverage: Recent large-scale projects aim to support MT across
hundreds of languages, expanding beyond English-centric training.
• Participatory design: Researchers advocate for involving native speakers in all stages
of MT development, including through online communities, mentoring, and collaborative
infrastructure.
• Improved evaluation method: Instead of direct evaluation, post-editing MT outputs
(then measuring the difference) helps reduce bias from linguistic variation and simplifies
training evaluators.

5.5 MT Evalaution
Translations are evaluated along two dimensions:
1. Adequacy: how well the translation captures the exact meaning of the source sentence.
Sometimes called faithfulness or fidelity.
2. Fluency: how fluent the translation is in the target language (is it grammatical, clear,
readable, natural).
5.5.1 Using Human Raters to Evaluate MT
• Human evaluation is the most accurate method for assessing machine translation (MT)
quality, focusing on two main dimensions: fluency (how natural and readable the
translation is) and adequacy (how much meaning from the source is preserved).
• Raters, often crowdworkers, assign scores on a scale (e.g., 1–5 or 1–100) for each.
• Bilingual raters compare source and translation directly for adequacy, while monolingual
raters compare MT output with a human reference. Alternatively, raters may choose the
better of two translations.
• Proper training is crucial, as raters often struggle to distinguish fluency from adequacy.
• To ensure consistency, outliers are removed and ratings are normalized.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13


Natural Language Processing [BAD613B]

• Another evaluation method involves post-editing: raters minimally correct MT output,


and the extent of editing reflects translation quality.

5.5.2 Automatic Evaluation


Human evaluation can be time consuming and expensive. In this regard, automatic metrics are
often used as temporary proxies.
Automatic metrics are less accurate than human evaluation, but can help test potential system
improvements, and even be used as an automatic loss function for training.
Automatic Evaluation by Character Overlap: chrF
• The simplest and most robust metric for MT evaluation is called chrF, which stands for
character F-score.
• A good machine translation will tend to contain characters and words that occur in a
human translation of the same sentence.
• Consider a test set from a parallel corpus, in which each source sentence has both a gold
human target translation and a candidate MT translation we’d like to evaluate.
• chrP percentage of character 1-grams, 2-grams, ..., k-grams in the hypothesis that occur
in the reference, averaged.
• chrR percentage of character 1-grams, 2-grams,..., k-grams in the reference that occur
in the hypothesis, averaged.
• The metric then computes an F-score by combining chrP and chrR using a weighting
parameter β. It is common to set β = 2, thus weighing recall twice as much as precision:

Example:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14


Natural Language Processing [BAD613B]

Alternative overlap metric: BLEU


• BLEU is a traditional metric used to evaluate machine translation quality.
• It is precision-based, not based on recall.
• It uses n-gram precision, meaning it checks how many n-grams (1 to 4 words in a row)
in the translation match the reference.
• It calculates a geometric mean of unigram to 4-gram precision scores.
• BLEU includes a brevity penalty to avoid favoring translations that are too short.
• It uses clipped counts, which means it limits how often a match is counted to avoid
overestimating quality.
• BLEU is word-based and sensitive to tokenization, so consistent tokenization is
important.
• BLEU doesn’t perform well with languages that have complex word forms.
• Tools like SACREBLEU are used to ensure consistent evaluation across different
systems.
Statistical Significance Testing for MT evals
• chrF and BLEU are overlap-based metrics used to compare machine translation
(MT) systems.
• They help answer: Did the new translation system improve over the old one?
• To check if a difference in scores is statistically significant, we use tests like the
paired bootstrap test or randomization test.
• For a confidence interval on one system’s chrF score:
o Create many pseudo-testsets by sampling the original test set with
replacement.
o Calculate the chrF score for each one.
o Drop the top and bottom 2.5% of scores → the rest give a 95% confidence
interval.
• To compare two systems (A and B):
o Use the same pseudo-testsets.
o Compare chrF scores on each.
o Count the percentage of sets where A scores higher than B.
chrF: Limitations
• Metrics like chrF and BLEU are helpful but have important limitations.
• chrF is local — it doesn't handle large phrase movements or reordering well.
• chrF can’t assess document-level features like discourse coherence.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15


Natural Language Processing [BAD613B]

• These metrics are not good for comparing very different systems (e.g., human-aided
vs. machine translation).
• chrF works best for comparing small changes within the same system.

5.6 Bias and Ethical Issues


• MT systems can show gender bias, especially when translating from gender-neutral
languages (like Hungarian or Spanish) to gendered ones (like English).
• They may default to male pronouns due to lack of gender info or cultural stereotypes.
• Example: Hungarian gender-neutral pronoun "ő" becomes:
o “she” if the job is nurse
o “he” if the job is CEO
• These biases reflect and reinforce gender stereotypes, which is a serious ethical
concern.
• The WinoMT dataset tests MT systems on sentences involving non-stereotypical
gender roles.
• MT systems often perform worse on such sentences.
• Example: A system may mistranslate “The doctor asked the nurse to help her” if it
expects the doctor to be male.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16

You might also like