Unit 3
Unit 3
Unit 3
Grammar in NLP is a set of rules for constructing sentences in a language used to understand and
analyze the structure of sentences in text data.
This includes identifying parts of speech such as nouns, verbs, and adjectives, determining the
subject and predicate of a sentence, and identifying the relationships between words and phrases.
Grammar is defined as the rules for forming well-structured sentences. Grammar also plays an
essential role in describing the syntactic structure of well-formed programs, like denoting the
syntactical rules used for conversation in natural languages.
Syntax
Each natural language has an underlying structure usually referred to under Syntax. The
fundamental idea of syntax is that words group together to form the constituents like groups of
words or phrases which behave as a single unit. These constituents can combine to form bigger
constituents and, eventually, sentences.
Syntax describes the regularity and productivity of a language making explicit the
structure of sentences, and the goal of syntactic analysis or parsing is to detect if a
sentence is correct and provide a syntactic structure of a sentence.
Syntax also refers to the way words are arranged together. Let us see some basic ideas related to
syntax:
Constituency: Groups of words may behave as a single unit or phrase - A constituent, for
example, like a Noun phrase.
Grammatical relations: These are the formalization of ideas from traditional grammar.
Examples include - subjects and objects.
Sub categorization and dependency relations: These are the relations between words
and phrases, for example, a Verb followed by an infinitive verb.
Regular languages and part of speech: Refers to the way words are arranged together
but cannot support easily. Examples are Constituency, Grammatical relations, and Sub
categorization and dependency relations.
Syntactic categories and their common denotations in NLP: np - noun phrase, vp -
verb phrase, s - sentence, det - determiner (article), n - noun, tv - transitive verb (takes an
object), iv - intransitive verb, prep - preposition, pp - prepositional phrase, adj - adjective
There are some differences present to differentiate these two parsing techniques, which are
given below:
It is a parsing strategy that first looks at the It is a parsing strategy that first looks at the
highest level of the parse tree and works down lowest level of the parse tree and works up
the parse tree by using the rules of grammar. the parse tree by using the rules of grammar.
In this parsing technique we start parsing from In this parsing technique we start parsing
the top (start symbol of parse tree) to down from the bottom (leaf node of the parse tree)
(the leaf node of parse tree) in a top-down to up (the start symbol of the parse tree) in a
manner. bottom-up manner.
This parsing technique uses Left Most This parsing technique uses Right Most
Derivation. Derivation.
The main leftmost decision is to select what The main decision is to select when to use a
production rule to use in order to construct the production rule to reduce the string to get the
string. starting symbol.
Top-Down Parsing Bottom-Up Parsing
Transition Networks:
A transition network is a finite state automaton that is used to represent a part of a grammar. A
transition network parser uses a number of these transition networks to represent its entire
grammar. Each network represents one non-terminal symbol in the grammar.
A transition network is a method of parsing which represents the grammar as a set of a finite
state machine (FSM).
1. Augmented Transition Networks (ATNs): ATN was developed by Wiliam Woods in 1970.
The ATN method of parsing sentences integrates many concepts from Chomsky’s (1957) formal
grammar theory with a matching process resembling a dynamic semantic network.
2. Recursive Transition Networks (RTNs): RTN is a recursive transition network that permits
arc labels to refer to other networks and they, in turn, may refer back to the referring network
rather than just permitting word categories used previously.
To illustrate a top down chart parse, we will assume the CFG shown in Figure A.3.
Figure A.3. A small CFG for parsing “the dog likes meat”.
S → NP VBD NN → dog | meat
S → NP VP VBD → barked
NP → DT NN VBZ → likes
VP → VBZ NN DT → the
Figure A.4 shows a trace of a top-down chart parse of the sentence “The dog likes meat.”,
showing the edges created in the top-down chart parser implemented in NLTK. The parse begins
with top-down predictions, based on the grammar, and then begins to process the actual input,
which is shown in the third row. As each word is read, there is a top-down prediction followed
by an application of the fundamental rule. After an edge is completed (such as the first NP) then
new predictions are added (e.g., for an upcoming VP).
Figure A.4. Trace of a top-down chart parse of the sentence “The dog likes meat.”
[0:0] S → * NP
VBD
[0:0] S → * NP
VP
[0:0] NP → * DT
NN
[0:0] DT → * the Predict dog; apply
[0:1] DT → the * fundamental rule
[0:1] NP → DT * and then make a
NN top down
[1:1] NN → * dog prediction for a VP
[1:2] NN → dog (using the second S
* rule.)
[0:2] NP → DT
NN *
[0:2] S → NP *
VBD
[0:2] S → NP *
VP
[2:2] VP → *
VBZ NN
Predict likes;
apply the
[0:0] S → * NP fundamental rule to
VBD create VBZ-> likes
[0:0] S → * NP *, and again, to add
VP VP -> VBZ * NN
[0:0] NP → * DT
NN
[0:0] DT → * the
[0:1] DT → the *
[0:1] NP → DT *
NN
[1:1] NN → * dog
[1:2] NN → dog *
[0:2] NP → DT
NN *
[0:2] S→ NP *
VBD
[0:2] S → NP *
VP
[2:2] VP → *
VBZ NN
[2:2] VBZ → *
likes
[2:3] VBZ →
likes *
[2:3] VP → VBZ
* NN
In natural languages there are often agreement restrictions between words and phrases. For
example, the NP "a men" is not correct English because the article a indicates a single object
while the noun "men" indicates a plural object; the noun phrase does not satisfy the number
agreement restriction of English.
There are many other forms of agreement, including subject-verb agreement, gender agreement
for pronouns, restrictions between the head of a phrase and the form of its complement, and so
on. To handle such phenomena conveniently, the grammatical formalism is extended to allow
constituents to have features. For example, we might define a feature NUMBER that may take a
value of either s (for singular) or p (for plural), and we then might write an augmented CFG rule
such as
This rule says that a legal noun phrase consists of an article followed by a noun, but only when
the number feature of the first word agrees with the number feature of the second. This one rule
is equivalent to two CFG rules that would use different terminal symbols for encoding singular
and plural forms of all noun phrases, such as
While the two approaches seem similar in ease-of-use in this one example, consider that all rules
in the grammar that use an NP on the right-hand side would
Lexical or Morphological Analysis is the initial step in NLP. It entails recognizing and analyzing
word structures. The collection of words and phrases in a language is referred to as the lexicon.
Lexical analysis is the process of breaking down a text file into paragraphs, phrases, and words.
The source code is scanned as a stream of characters and converted into intelligible lexemes in
this phase. The entire book is divided into paragraphs, phrases, and words.
It refers to the study of text at the level of individual words. It searches for morphemes, which
are the smallest units of a word. The lexical analysis identifies the relationship between these
morphemes and transforms the word into its root form. The word’s probable parts of speech
(POS) are also assigned by a lexical analyzer.