Unit - 2 NLP - R20
Unit - 2 NLP - R20
UNIT - II Grammars and Parsing Lecture 8Hrs Grammars and Parsing- Top- Down and
Bottom-Up Parsers, Transition Network Grammars, Feature Systems and Augmented
Grammars, Morphological Analysis and the Lexicon, Parsing with Features, Augmented
Transition Networks, Bayes Rule, Shannon game, Entropy and Cross Entropy.
Grammars and Parsing:
Natural language has an underlying structure usually referred to under the heading of Syntax.
The fundamental idea of syntax is that words group together to form so-called constituents
i.e. groups of words or phrases which behave as a single unit. These constituents can combine
together to form bigger constituents and eventually sentences.
The instance, John, the man, the man with a hat and almost every man are constituents (called
Noun Phrases or NP for short) because they all can appear in the same syntactic context (they
can all function as the subject or the object of a verb for instance). Moreover, the NP
constituent the man with a hat can combine with the VP (Verb Phrase) constituent run to
form a S (sentence) constituent.
A program consists of various strings of characters. But, every string is not a proper or
meaningful string. So, to identify valid strings in a language, some rules should be specified
to check whether the string is valid or not. These rules are nothing but make Grammar.
SEAGI-NB 1
R20-Regulations NLP
Derivations:
A derivation is a sequence of rule applications that derive a terminal string w = w1 … wn
from S
For Example:
S->NP VP
Pro VP
I VP
I Verb NP
I prefer NP
I prefer Det Nom
I prefer a Nom
I prefer a Nom Noun
I prefer a Noun Noun
I prefer a morning Noun
I prefer a morning flight
SEAGI-NB 2
R20-Regulations NLP
Parsing: In the syntax analysis phase, a compiler verifies whether or not the tokens generated
by the lexical analyzer are grouped according to the syntactic rules of the language. This is
done by a parser.
The parser obtains a string of tokens from the lexical analyzer and verifies that the
string can be the grammar for the source language. It detects and reports any syntax errors
and produces a parse tree from which intermediate code can be generated.
Ambiguity A grammar that produces more than one parse tree for some sentence is said to be
ambiguous.
Eg- consider a grammar
S -> aS | Sa | a
Now for string aaa, we will have 4 parse trees, hence ambiguous
SEAGI-NB 3
R20-Regulations NLP
Top-Down Parser
A parsing algorithm can be described as a procedure that searches through various ways of
combining grammatical rules to find a combination that generates a tree that could be the
structure of the input sentence. In other words, the algorithm will say whether a certain
sentence is accepted by the grammar or not. The top-down parsing method is related in many
artificial intelligence (AI) search applications.
SEAGI-NB 4
R20-Regulations NLP
A top-down parser starts with the S symbol and attempts to rewrite it into a sequence of
terminal symbols that matches the classes of the words in the input sentence. The state of the
parse at any given time can be represented as a list of symbols that are the result of operations
applied so far, called the symbol list.
For example, the parser starts in the state (5) and after applying the rule S -> NP VP the
symbol list will be (NP VP). If it then applies the rule NP ->ART N, the symbol list will be
(ART N VP), and so on.
1. S -> NP VP
2. NP -> ART N
3. NP -> ART ADJ N
4. VP -> V
5. VP -> V NP
The parser could continue in this fashion until the state consisted entirely of terminal
symbols, and then it could check the input sentence to see if it matched. The lexical analyser
will produce list of words from the given sentence. A very small lexicon for use in the
examples is
cried: V
dogs: N, V
the: ART
Positions fall between the words, with 1 being the position before the first word. For
example, here is a sentence with its positions indicated:
((N VP) 2)
indicating that the parser needs to find an N followed by a VP, starting at position two. New
states are generated from old states depending on whether the first symbol is a lexical symbol
or not. If it is a lexical symbol, like N in the preceding example, and if the next word can
belong to that lexical category, then you can update the state by removing the first symbol
and updating the position counter. In this case, since the word dogs is listed as an N in the
lexicon, the next parser state would be ((VP) 3)
SEAGI-NB 5
R20-Regulations NLP
which means it needs to find a VP starting at position 3. If the first symbol is a nonterminal,
like VP, then it is rewritten using a rule from the grammar. For example, using rule 4 in the
above Grammar, the new state would be
((V) 3)
which means it needs to find a V starting at position 3. On the other hand, using rule 5, the
new state would be
((V NP) 3)
A parsing algorithm that is guaranteed to find a parse if there is one must systematically
explore every possible new state. One simple technique for this is called backtracking. Using
this approach, rather than generating a single new state from the state ((VP) 3), you generate
all possible new states. One of these is picked to be the next state and the rest are saved as
backup states. If you ever reach a situation where the current state cannot lead to a solution,
you simply pick a new current state from the list of backup states. Here is the algorithm in a
little more detail.
A Simple Top-Down Parsing Algorithm
The algorithm manipulates a list of possible states, called the possibilities list. The first
element of this list is the current state, which consists of a symbol list - and a word position In
the sentence, and the remaining elements of the search state are the backup states, each
indicating an alternate symbol-list—word-position pair. For example, the possibilities list
(((N) 2) ((NAME) 1) ((ADJ N) 1))
indicates that the current state consists of the symbol list (N) at position 2, and that there are
two possible backup states: one consisting of the symbol list (NAME) at position 1 and the
other consisting of the symbol list (ADJ N) at position 1.
Top-
SEAGI-NB 6
R20-Regulations NLP
The algorithm starts with the initial state ((S) 1) and no backup states
1. Select the current state: Take the first state off the possibilities list and call it C. If the
possibilities list is empty, then the algorithm fails (that is, no successful parse is
possible).
2. If C consists of an empty symbol list and the word position is at the end of the
sentence, then the algorithm succeeds
If the first symbol on the symbol list of C is a lexical symbol, and the next word in the
sentence can be in that class, then create a new state by removing the first symbol
from the symbol list and updating the word position, and add it to the possibilities list.
Consider an example. Using above Grammar, above shows a trace of the algorithm on the
sentence The dogs cried. First, the initial S symbol is rewritten using rule 1 to produce a new
current state of ((NP VP) 1) in step 2. The NP is then rewritten in turn, but since there are two
possible rules for NP in the grammar, two possible states are generated: The new current state
involves (ART N VP) at position 1, whereas the backup state involves (ART ADJ N VP) at
position 1. In step 4 a word in category ART is found at position 1 of the sentence, and the
new current state becomes (N VP). The backup state generated in step 3 remains untouched.
The parse continues in this fashion to step 5, where two different rules can rewrite VP. The
first rule generates the new current state, while the other rule is pushed onto the stack of
backup states. The parse completes successfully in step 7, since the current state is empty and
all the words in the input sentence have been accounted for.
In this case assume that the word old is ambiguous between an ADJ and an N and that the
word man is ambiguous between an N and a V (as in the sentence The sailors man the
boats). Specifically, the lexicon is
SEAGI-NB 7
R20-Regulations NLP
the: ART
old: ADJ, N
man: N, V
cried: V
The parse proceeds as follows. The initial S symbol is rewritten by rule 1 to produce the new
current state of ((NP VP) 1). The NP is rewritten in turn, giving the new state of ((ART N
VP) 1) with a backup state of ((ART ADJ N VP) 1). The parse continues, finding the as an
ART to produce the state ((N VP) 2) and then old as an N to obtain the state ((VP) 3). There
are now two ways to rewrite the VP, giving us a current state of ((V) 3) and the backup states
of ((V NP) 3) and ((ART ADJ N) 1) from before. The word man can be parsed as a V. giving
the state (04). Unfortunately, while the symbol list is empty, the word position is not at the
end of the sentence, so no new state can be generated and a backup state must be used. In the
next cycle, step 8, ((V NP) 3) is attempted. Again man is taken as a V and the new state ((NP)
4) generated. None of the rewrites of NP yield a successful parse. Finally, in step 12, the last
backup state, ((ART ADJ N VP) 1), is tried and leads to a successful parse.
You can think of parsing as a special case of a search problem as defined in Al. In particular,
the top-down parser in this section was described in terms of the following generalized search
procedure. The possibilities list is initially set to the start state of the parse. Then you repeat
the following steps until you have success or failure:
1. Select the first state from the possibilities list (and remove it from the list).
2. Generate the new states by trying every possible option from the selected state (there may
be none if we are on a bad path).
SEAGI-NB 8
R20-Regulations NLP
Bottom-Up Parser
A bottom-up parser builds derivation by working from input sentence back toward the start
symbol S. The bottom-up parser is also known as shift-reduce parser.
The basic operation in bottom-up parsing is to take a sequence of symbols and match it to the
right-hand side of the rules. You could build a bottom-up parser simply by formulating this
matching process as a search process. The state would simply consist of a symbol list,
starting with the words in the sentence. Successor states could be generated by exploring all
possible ways to:
replace a sequence of symbols that matches the right-hand side of a grammar rule by
its left-hand side
SEAGI-NB 9
R20-Regulations NLP
SEAGI-NB 10
R20-Regulations NLP
Starting at the initial state, you can traverse an arc if the current word in the sentence is in the
category on the arc. If the arc is followed, the current word is updated to the next word. A
phrase is a legal NP if there is a path from the node NP to a pop arc (an arc labeled pop) that
accounts for every word in the phrase. This network recognizes the same set of sentences as
the following context-free grammar:
SEAGI-NB 11
R20-Regulations NLP
NP1 -> N
Consider parsing the NP a purple cow with this network. Starting at the node NP, you can
follow the arc labelled art, since the current word is an article— namely, a. From node NP1
you can follow the arc labeled adj using the adjective purple, and finally, again from NP1,
you can follow the arc labeled noun using the noun cow. Since you have reached a pop arc, a
purple cow is a legal NP.
Consider finding a path through the S network for the sentence The purple cow ate the grass.
Starting at node 5, to follow the arc labeled NP, you need to traverse the NP network. Starting
at node NP, traverse the network as before for the input the purple cow. Following the pop
arc in the NP network, return to the S network and traverse the arc to node S 1. From node S
1 you follow the arc labeled verb using the word ate. Finally, the arc labeled NP can be
followed if you can traverse the NP network again. This time the remaining input consists of
the words the grass. You follow the arc labeled art and then the arc labeled noun in the NP
network; then take the pop arc from node NP2 and then another pop from node S3. Since you
have traversed the network and used all the words in the sentence, The purple cow ate the
grass is accepted as a legal sentence.
SEAGI-NB 12
R20-Regulations NLP
In natural languages there are often agreement restrictions between words and phrases.
For example, the NP "a men" is not correct English because the article a indicates a single
object while the noun "men" indicates a plural object; the noun phrase does not satisfy the
number agreement restriction of English. There are many other forms of agreement, including
subject- verb agreement, gender agreement for pronouns, restrictions between the head of a
phrase and the form of its complement, and so on. To handle such phenomena conveniently,
the grammatical formalism is extended to allow constituents to have features. For example,
we might define a feature NUMBER that may take a value of either s (for singular) or p (for
plural), and we then might write an augmented CFG rule such as
This rule says that a legal noun phrase consists of an article followed by a noun, but only
when the number feature of the first word agrees with the number feature of the second. This
one rule is equivalent to two CFG rules that would use different terminal symbols for
encoding singular and plural forms of all noun phrases, such as
This rule says that a legal noun phrase consists of an article followed by a noun, but only
when the number feature of the first word agrees with the number feature of the second.
While the two approaches seem similar in ease-of-use in this one example, consider that all
rules in the grammar that use an NP on the right-hand side would now need to be duplicated
to include a rule for NP-SING and a rule for NP-PLURAL, effectively doubling the size of
the grammar.
Augmented grammars provides more precise and detailed linguistic analysis and can handle
complex linguistic phenomena more effectively.
SEAGI-NB 13
R20-Regulations NLP
Morphological Analysis: Morphological Analysis is the study of lexemes and how they are
created. The discipline is particularly interested in neologisms (newly created words from
existing words(root word)), derivation, and compounding. In morphological analysis each
token will be analysed as follows:
katternas ->katt+N+plur+def+gen
Derivation refers to a way of creating new words by adding affixes to the root of a word - this
is also known as affixation. There are two of affixes: prefixes and suffixes.
Compounding: Compounding refers to the creation of new words by combining two or more
existing words together. Now here are some examples of compounding:
SEAGI-NB 14
R20-Regulations NLP
The lexicon must contain information about all the different words that can be used,
including all the relevant feature value restrictions. When a word is ambiguous, it may be
described by multiple entries in the lexicon, one for each different use.
Most English verbs, for example, use the same set of suffixes to indicate different forms: -s is
added for third person singular present tense, -ed for past tense, -ing for the present
participle, and so on.
The idea is to store the base form of the verb in the lexicon and use context-free rules to
combine verbs with suffixes to derive the other entries. Consider the following rule for
present tense verbs:
(V ROOT ?r SUBCAT ?s VFORM pres AGR 3s) -> (V ROOT ?r SUBCAT ?s VFORM
base) (+S)
where +S is a new lexical category that contains only the suffix morpheme -s. This rule,
coupled with the lexicon entry
want:
(V ROOT want
VFORM base)
would produce the following constituent given the input string want –s
want:
(V ROOT want
Another rule would generate the constituents for the present tense form not in third person
singular, which for most verbs is identical to the root form:
But this rule needs to be modified in order to avoid generating erroneous interpretations.
Currently, it can transform any base form verb into a present tense form, which is clearly
wrong for some irregular verbs. For instance, the base form be cannot be used as a present
form (for example, *We be at the store). To cover these cases, a feature is introduced to
identify irregular forms. Specifically, verbs with the binary feature +IRREGPRES have
irregular present tense forms. Now the rule above can be stated correctly:
The goal of incorporating features into parsing is to enhance the accuracy and quality
of the parsed output.
Features can be defined at different levels. For example, at the word level, features
may include part-of-speech tags, lemma forms, or morphological(ed,s,ing) properties.
At the syntactic level, features may involve phrase types, head words, or
dependencies between words.
SEAGI-NB 16
R20-Regulations NLP
2. Syntactic parsing: Identifying the correct structure of the sentence then finding the
relationships between words and phrases.
3. Semantic parsing: Identifying the meaning of the words used in the sentence and
understanding the relationship between them.
4. Named entities: Identifying and classifying named entities in the sentence, such as
person names, locations, or organizations
The ATN (augmented transition network) is produced by adding new features to a recursive
transition network. Features in an ATN are traditionally called registers. Constituent
structures are created by allowing each network to have a set of registers. Each time a new
network is pushed, a new set of registers is created. As the network is traversed, these
registers are set to values by actions associated with each arc. When the network is popped,
the registers are assembled to form a constituent structure, with the CAT slot being the
network name.
Consider Grammar below fig is a simple NP network. The actions are listed in the table
below the network. ATNs use a special mechanism to extract the result of following an arc.
When a lexical arc, such as arc 1, is followed, the constituent built from the word in the input
is put into a special variable named "*".
The action DET := * then assigns this constituent to the DET register
The second action on this arc, AGR := AGR* assigns the AGR register of the network to the
value of the AGR register of the new word (the constituent in "*"). Agreement checks are
specified in the tests. A test is an expression that succeeds if it returns a nonempty value and
fails if it returns the empty set or nil.
SEAGI-NB 17
R20-Regulations NLP
A simple NP network
If a test fails, its arc is not traversed. The test on arc 2 indicates that the arc can be followed
only if the AGR feature of the network has a non-null intersection with the AGR register of
the new word (the noun constituent in "*").
Features on push arcs are treated similarly. The constituent built by traversing the NP
network is returned as the value "*". Thus in Grammar below, the action on the arc from S to
S1, SUBJ := * would assign the constituent returned by the NP network to the register SUBJ.
The test on arc 2 will succeed only if the AGR register of the constituent in the SUBJ register
has a non-null intersection with the AGR register of the new constituent (the verb). This test
enforces subject- verb agreement.
A simple S Network
SEAGI-NB 18
R20-Regulations NLP
Baye’s rule
Bayes' rule is a formula that describes how to update your beliefs about something
based on new evidence
In NLP, Bayes' Rule is used in various applications, such as text classification, spam
filtering, sentiment analysis.
Where:
Let's consider a simple example of spam email classification using Bayes' Rule.
Assume we have a dataset of emails labeled as spam or not spam, and we want to
classify a new email as spam or not spam based on its content.
SEAGI-NB 19
R20-Regulations NLP
Shannon's Game
The Shannon game is a thought experiment in linguistics and natural language processing
(NLP) that asks participants to guess the next letter in a sequence based on its preceding
context.
If the next letter in a sequence is highly predictable, then the game will be easy to win.
However, if the next letter is not predictable, then the game will be more difficult to win.
The Shannon game can be used to measure the entropy of a language. Entropy is a measure
of the uncertainty or randomness in a sequence. The higher the entropy, the more uncertain
the sequence is. The lower the entropy, the more predictable the sequence is.
The language model is then used to predict the next letter in a sequence.
The Shannon game is used to measure the predictability of the language model's
predictions.
The predictability of the language model's predictions is used to improve the language
model's accuracy.
o q
o qu
o que
o ques----
o quest---
o questi—
o questio-
o question
The Shannon game is a powerful tool for understanding and improving the predictability of
language
SEAGI-NB 20
R20-Regulations NLP
SEAGI-NB 21