NLP MODULE 1 Chapter1 &2
NLP MODULE 1 Chapter1 &2
PROCESSING
TEXTBOOK 1: TANVEER SIDDIQUI, U.S. TIWARY, “NATURAL LANGUAGE
2008.
LIMITED 2007.
MODULE 1: CHAPTER1: INTRODUCTION
• Language is method of Communication for humans. By studying language, we come to understand more
about the world. With the help of communication, we can speak, read and write. Eg: we think, we make
decisions, plans and more in natural language; precisely in words.
• Natural Language Processing (NLP): is the sub-field of Computer Science especially AI that is concerned
about enabling computers to understand and process human language.
machines process and understand the human language so that they can automatically perform
repetitive tasks.
• It’s goal is to process the text data (unstructured data) to perform task like translation, grammar
• Lexical (word level): It happens when word has more than one meaning. Ie.,which words to
choose. Eg: treating the word silver as a noun, an adjective.
• Syntactic(sentence level/parsing): “The man saw the girl with the telescope". Sentence has
multiple meanings for compiler. It is about confusion in how words are grouped together in a
sentence. It is ambiguous whether the man saw the girl carrying a telescope or he saw her through
his telescope.
• Referential ambiguity: It occurs when a phrase can have multiple interpretations due to the use
of multiple objects and the referencing not being clear.
• Ex: Meena went to Geetha. She says that she is hungry.” Here “she” can refer to either Meena or
Geetha in a compiler.
Contd..
• NATURAL LANGUAGE GENERATION:
• TEXT PLANNING:
It includes retrieving the relevant content from knowledge base.
• SENTENCE PLANNING:
It includes choosing required words, forming meaningful phrases, setting
tone of the sentence.
It arrange words in proper meaningful way.
• TEXT REALIZATION:
It is used to make a structure of sentences and display as output.
Contd..
ORIGINS/HISTORY OF NLP
• (1940-1960) - Focused on Machine Translation (MT):The Natural Languages
Processing started in the year 1940s.
• 1948 - In the Year 1948, the first recognizable NLP application was introduced in
Birkbeck College, London.
• (1960-1980)- Flavored with Artificial Intelligence (AI): In the year 1960 to 1980, the
key developments were:
For example: "Neha broke the mirror with the hammer". In this example case grammar
identify Neha as an agent, mirror as a theme, and hammer as an instrument.
• SHRDLU: In the year 1960 to 1980, key systems were:
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate
with the computer and moving objects.
Contd..
• LUNAR:
LUNAR is the classic example of a Natural Language database interface system that is used ATNs
and Woods' Procedural Semantics. It was capable of translating elaborate natural language
expressions into database queries and handle 78% of requests without errors.
• 1980 –Current:Till the year 1980, natural language processing systems were based on
complex sets of hand-written rules. After 1980, NLP introduced machine learning algorithms for
language processing.
• In the beginning of the year 1990s, NLP started growing faster and achieved good process
accuracy, especially in English Grammar.
• Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading.
LANGUAGE AND KNOWLEDGE
• Language is a vital part of human connection. Although all species have their
ways of communicating, humans are the only ones that have mastered cognitive
language communication.
• Language allows us to share our ideas, thoughts, and feelings with others.
and/or multicultural audience, you’ll need to provide support for multiple languages.
2. Training data:
At its core, NLP is all about analyzing language to better understand it. A human being must be
immersed in a language constantly for a period of years to become fluent in it; even the best AI
must also spend a significant amount of time reading, listening to, and utilizing a language.
Contd..
3. Misspellings:
Misspellings are a simple problem for human beings. But for a machine, misspellings can be harder to
identify. You’ll need to use an NLP tool with capabilities to recognize common misspellings of words,
and move beyond them.
4. Words with multiple meanings: No language is perfect, and most languages have words that
have multiple meanings. For example, a user who asks, “how are you” has a totally different goal than a
user who asks something like “how do I add a new credit card?” Good NLP tools should be able to
differentiate between these phrases with the help of context.
your NLP AI needs to be able to keep the conversation moving, providing additional questions to collect
NLP Applications
Speech Synthesis: It is used for converting text to speech. The voice synthesizers
Alexa, Google Home are a few well-known examples.
• NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
• Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.
Disadvantages of NLP
• NLP is unpredictable: Data privacy issues arise mainly due to its reliance on
personal information collected from users via apps and websites.
• NLP is unable to adapt to the new domain, and it has a limited function that's why
NLP is built for a single and specific task only.
Language and Grammer in NLP
• Grammer defines language. It consist of a set of rules that allows us to parse and
generate sentences in a language.
• This includes identifying parts of speech such as nouns, verbs, and adjectives,
determining the subject and predicate of a sentence, and identifying the
relationships between words and phrases.
What is Grammer?
• Grammar is defined as the rules for forming well-structured sentences.
• Grammar also plays an essential role in describing the syntactic structure of well-
formed programs, like denoting the syntactical rules used for conversation in
natural languages.
• In the C programming language, the precise grammar rules state how functions
are made with the help of lists and statements.
Contd..
• Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P) where:
o It has the form 𝛼 →𝛽 where α and β are strings on VN∪∑, and at least one symbol of α
belongs to VN
Syntax
• Each natural language has an underlying structure usually referred to under
Syntax.
• The fundamental idea of syntax is that words group together to form the
constituents like groups of words or phrases which behave as a single unit.
• Context Free Grammar: It consists of a set of rules expressing how symbols of the
language can be grouped and ordered together and a lexicon of words and symbols.
o One example rule is to express an NP (or noun phrase) that can be composed of
either a ProperNoun or a determiner (Det) followed by a Nominal, a Nominal in turn
can consist of one or more Nouns: NP → DetNominal, NP → ProperNoun; Nominal
→ Noun | NominalNoun
Contd..
• A Context free grammar consists of a set of rules or productions, each expressing
the ways the symbols of the language can be grouped, and a lexicon of words.
• Context-free grammar (CFG) can also be seen as the list of rules that define the
set of all well-formed sentences in a language. Each rule has a left-hand side that
identifies a syntactic category and a right-hand side that defines its alternative
parts reading from left to right.
• Example: The rule s --> np vp means that "a sentence is defined as a noun
phrase followed by a verb phrase."
Contd..
• Formalism in rules for context-free grammar: A sentence in the
language defined by a CFG is a series of words that can be derived by
systematically applying the rules, beginning with a rule that has s on its left-hand
side.
Have the same meaning, Both the sentences are being generated from ‘deep structure’ In
which deep subject is pooja and deep object is veena.
Contd..
• Each sentence in a language has two levels of representations as shown in the below fig:
Deep structure and Surface structure.
• The mapping from deep structure to surface structure is carried out by transformations. In
the following paragraphs, we introduce transformational grammer.
Contd..
• Transformational grammar has three components: Each of these components
consist of a set of rules.
o Phrase structure grammar: It consists of rules that generate natural language sentences
and assign structure description to them.
o Transformational rules:
• Transformational rules are heterogenous, and may have more than one symbol on
their left hand side. These rules are used to transform one surface representation
into another. Eg: an active sentence into passive sentence.
Contd..
Q:Write the transformational grammar for the sentence
S: “The boy hit the girl.”
Transformational Grammar for the sentence - A boy hit the girl
S -> NP + VP
NP -> Det + Noun
VP -> V + NP
V -> Aux + V
Det -> A, the
Noun -> boy, girl
Verb -> hit
Contd..
• Morphophonemic rules: These rules match each sentence representation to a string of
phonemes.
• The Transformational rule will reorder ‘en + catch’ to ‘catch + en’ and subsequently one
of the morphophonemic rules will convert ‘catch + en’ to ‘caught’.
Contd..
• As an example, consider a following set of rules: Sentences that can be generated
using these rules are termed as grammatical.
Contd..
Processing Indian Languages
• There are a number of differences between Indian Languages and English.
• This introduces differences in their processing. Some of the differences are listed here:
o Indian Languages have a free word order i.e., words can be moved freely within a
sentence without changing the meaning of the sentence.
o Indian languages use verb complexes consisting of sequences of verbs, e.g., गा रहा
है (ga raha hai — singing ) and खेल रही है (khel rahi hai — playing).
Information Retrieval
• Information Retrieval is the software program that deals with the organization, storage,
retrieval, and evaluation of information from document repositories particularly textual
information.
• Information Retrieval is the activity of obtaining material that can usually be documented
on an unstructured nature i.e. usually text which satisfies an information need from within
large collections which is stored on computers. For example: Information Retrieval can be
when a user enters a query into the system.
Contd..
• Information retrieval also retrieves information about a subject.
• A set of keywords are required to search. Keywords are what people are searching
for in search engines. These keywords summarize the description of the
information.
• In the retrieval model how can a document be represented with the selected keywords and
how are documents and query representations compared to calculate a score.
• Information Retrieval (IR) deals with issues like uncertainty and vagueness in information
systems.
o Uncertainty: The available representation does not typically reflect true semantics of
objects such as images, videos etc.
o Vagueness: The information that the user requires lacks clarity, is only vaguely expressed
in a query, feedback or user action.
Contd..
• System Evaluation:
• Natural Language is a complex entity and in order to process it through a computer-based program,
we need to build a representation(model) of it. This is known as Language modelling.
• Language modeling is the way of determining the probability of any sequence of words.
• Language modeling is used in a wide variety of applications such as Speech Recognition, Spam
filtering, etc.
• In fact, language modeling is the key aim behind the implementation of many state-of-the-art
Natural Language Processing models.
Contd..
• Methods of Language Modelings:
Two types of Language Modelings:
• For example: a sentence usually consists of noun phrase and verb phrase.
• The grammar-based approach attempts to utilize this structure and also relationship between
this structures.
o Government and Binding (GB):theories have renamed them as s-level and d-level and
identified two more levels of representations( parallel to each other) called phonetic form
and logical form.
o According to GB theories, language can be considered for analysis at the level shown in
the figure.
Contd..
• Let us take an example to explain TG representation of a sentence:
• Ex: Mukesh was killed.
(i) In transformational grammar, this can be represented as S-NP AUX VP as given below
fig2.2
Contd
• Example to explain d-structure and s-structure:
Consider the sentence: S: Mukesh was killed
s-structure d-structure
D-structure and S-structure
Contd..
• Ex: Drano, he drank
Contd..
• Components of GB: It comprises a set of theories that map the structures from d-
structure to s-structure and to logical form (LF)
• Instead of defining several phrase structures and the sentence structure with separate sets
of rules, X theory defines them both as maximal projections of some head. In this
manner, the entities defined become language independent.
• Thus, noun phrase (Np), verb phrase (VP), adjective phrase (AP), and prepositional
phrase (PP) are maximal projections of noun (N), verb (V), adjective (A), and preposition
(P) respectively, and can be represented as head X of their corresponding phrases (where
X = {N, V, A, P}) structure (S’- projection of sentence) can be regarded as maximal
projection of inflection.
Contd..
• The GB envisages projection at two levels:
o first the projection of head at semi-phrasal level, denoted by x̄
o second maximal projection at the phrasal level is denoted by x( double bar).
o For sentences, the first level projection is denoted as S.
o Second level maximal projection is denoted by S’
o We now illustrate phrase and sentence representations with the help of examples.
• Example 2.3 Figure 2.7 depicts the general and particular structures with examples. We
see the general structure in Figure 2.7(a).
Contd..
• Next, we consider the representation of the NP, the food in a dhaba This is followed by
the representation of VP, AP, and PP structure in Figure 2.7(c—e); and finally Figure
2.7(f) shows the representation of a sentence.
Contd..
Contd..
• As shown in Figure 2.7(f), the sentence is considered to be the head of INFL and the projection of
sentence is denoted by S, which has the specifier as complementizer (COMP).
Contd..
• Different components of GB:
1. X-bar theory
2. Sub-Categorization
3. Projection
4. Theta Theory(θ-Theory)
5. Theta-role and Theta-criterion
6. Binding Theory
7. Empty Category Principle
8. Case Theory and Case Filter
Lexical Functional Grammer(LFG Model)
• Lexical Functional Grammar (LFG) plays a vital role in the area of Natural Language Processing
(NLP).
• C-structure indicates the hierarchical composition of words into larger units or phrasal
constituents,
• The classical Paninian Grammar facilitates the task of obtaining the semantics through
syntactical framework. In PG, an extensive and perfect interpretation of Phonology,
Morphology, Syntax, and Semantics is available.
Contd..
• Layered representation in panini grammar:
• Paninian Grammar (PG) framework is said to be syntactico–semantic that is one can go from the
surface layer to deep semantics by passing through immediate layers.
• PG works on various levels of language analysis to achieve the meaning of the sentence from the
hearer’s perspective.
• To achieve the desired meaning, the grammar analysis is divided itself internally into various
levels as shown in the figure below.
Contd..
• Semantic Level: Represents the speaker’s actual intention, that is, his real thought for the sentence.
• Surface Level: It is the actual string or the sentence. It captures the written or the spoken sentences as
it is.
• Vibhakti Level: Vibhakti is the word suffix, which helps to find out the participants, gender as well as
form of the word.
Vibhakti level is purely syntactic. At the Vibhakti level, a noun is formed containing a noun, which
contains the instances of noun or pronoun, etc.
Vibhakti for verbs includes the verb form and the auxiliary verbs.
Karaka Level: At the Karaka level, the relation of the participant noun, in the action, to the verb is
determined.
• Various Karakas: The Karta, Karma, and Karana are considered the foremost Karakas while
Sampradana, Apadana, and Adhikarana Karakas are known as the influenced Karakas.
1. Karta - subject
2. Karma - object
3. Karana - instrument
4. Sampradhana - beneficiary
5. Apadana - separation
6. Adhara or Adhikarana - locus
7. Sambandh - relation
8. Tadarthya - purpose
Contd..
1. Karta Karaka: The Karta Karaka is the premier one according to action and it is used to perform
an action independently of its own.
• An action indicated in a sentence is entirely dependent upon the Karta-Karaka. Activity either resides
in or rises from the Karta only.
Tiger - karta.
• The most desirable nominative by the Karta Karaka is the Karma Karaka. When the Karta carries
out any activity, the result of that activity rests in the Karma. As the Karma(object) is the basis of
outcome of the primary action, it is one of the important prominent Karaka.
• Karta and Karma also are directly dependent on the Karana for performing the action.
• Karana is the most important tool by means of which the action is achieved.
• When the Karana Karaka executes its auxiliary actions then only the main action is executed by the
Karta Karaka. This is why the Karana is considered as the efficient mean in action accomplishment.
Example:
• Sampradana Karaka receives or gets benefited from the action. It can also be said that, the
person/object for which the Karma is intentional, is known as Sampradana.
Shambhavi is sampradhana.
me is sampradhana.
Shambhavi is sampradhana
Contd..
5. Apadana Karaka: About Apadana Karaka Panini stated that, as when separation is affected by
a verbal action, the point of separation is called Apadana.
• During the execution of the action whenever the task of separation from a certain entity is
executed then whatever remains unmoved or constant is known as Apadana.
• The entity from which something gets separated or is separated out is known as Apadana.
Example: Shambhavi tore the page from the book with a scissor.
• Adhikarana is assigned to the locus of the action i.e. Kriya. Adhikarana may indicate the place at which the Kriya
(the action) is taking place or the time at which the Kriya is carried out. Any action i.e. the Kriya is either bounded
by space (place) or by time.
Example: ‘Yesterday Shambhavi hit the dog with the stick in front of the shop.’
Dog : Karma
Stick : Karana
Karaka role:
Maan - Karta
roti - Karma
haath - Karana
putree - Sampradhana
angan - Adhikaran
Statistical language model
• A statistical language model is a probability distribution P(s) over all possible word sequences (or
any other linguistic unit like words, sentences, paragraphs, documents etc).
• A number of statistical language models have been proposed in the literature. The dominant
approach in modeling is the n-gram model:
• n-gram Model: The goal of a statistical language model is to estimate the probability of a
sentence. This is achieved by decomposing sentence probability into a product of conditional
probabilities using the chain rule as follows:
Contd..
P(s) = P (w1, w2, w3, …,wn)
= P (w1) P (w2/w1) P (w3/w1 w2) P (w4/w1 w2 w3) … P (wn/ w1 w2 … wn-1)
previous words by the conditional probability given previous n-1 words only.
• A model that limits the history to the previous one word only is termed a bi-gram (n = 2)
model. A model that conditions the probability of a word to the previous two words, is called a tri-gram (n
= 3) model.
Contd..
• Using bi-gram and tri-gram estimates, the probability of a sentence can be calculated as
• A special word (pseudoword) <s> is introduced to mark the beginning of the sentence in bi-gram
estimation. The probability of the first word in the sentence is conditioned on <s>. Similarly, in tri-
gram estimation, two pseudo words <s1> <s2> are introduced.
• Estimation of probabilities is done by training the n-gram model on the training corpus. We count
a
particular n-gram in the training corpus and divide it by the sum of all n-grams that share the same
prefix.
Contd..
Q: Find the probability of the test sentence – P(“They play in a big garden”) in the
following training set using the bi-gram model.
Training Set:
<S> I am a human
I am not a stone
Training Set:
= 3/3=1
= 2/4= ½= 0.5
= ¼ = 0.25
Contd..
Test Sentence:
“I I am not”
Bi-gram model:
= 3/3 * ¼ * 2/4 * ½