Unit 3

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Unit- 3

Grammars and sentence Structure

Grammar in NLP is a set of rules for constructing sentences in a language used to understand and
analyze the structure of sentences in text data.

This includes identifying parts of speech such as nouns, verbs, and adjectives, determining the
subject and predicate of a sentence, and identifying the relationships between words and phrases.

Grammar is defined as the rules for forming well-structured sentences. Grammar also plays an
essential role in describing the syntactic structure of well-formed programs, like denoting the
syntactical rules used for conversation in natural languages.

 In the theory of formal languages, grammar is also applicable in Computer Science,


mainly in programming languages and data structures. Example - In the C programming
language, the precise grammar rules state how functions are made with the help of lists
and statements.

 Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P) where:


o N or VN = set of non-terminal symbols or variables.
o T or ∑ = set of terminal symbols.
o S = Start symbol where S ∈ N
o P = Production rules for Terminals as well as Non-terminals.
o It has the form �→�α→β, where α and β are strings on ��∪∑VN∪∑, and at
least one symbol of α belongs to VN

Syntax

Each natural language has an underlying structure usually referred to under Syntax. The
fundamental idea of syntax is that words group together to form the constituents like groups of
words or phrases which behave as a single unit. These constituents can combine to form bigger
constituents and, eventually, sentences.

 Syntax describes the regularity and productivity of a language making explicit the
structure of sentences, and the goal of syntactic analysis or parsing is to detect if a
sentence is correct and provide a syntactic structure of a sentence.

Syntax also refers to the way words are arranged together. Let us see some basic ideas related to
syntax:

 Constituency: Groups of words may behave as a single unit or phrase - A constituent, for
example, like a Noun phrase.
 Grammatical relations: These are the formalization of ideas from traditional grammar.
Examples include - subjects and objects.
 Sub categorization and dependency relations: These are the relations between words
and phrases, for example, a Verb followed by an infinitive verb.
 Regular languages and part of speech: Refers to the way words are arranged together
but cannot support easily. Examples are Constituency, Grammatical relations, and Sub
categorization and dependency relations.
 Syntactic categories and their common denotations in NLP: np - noun phrase, vp -
verb phrase, s - sentence, det - determiner (article), n - noun, tv - transitive verb (takes an
object), iv - intransitive verb, prep - preposition, pp - prepositional phrase, adj - adjective

Top-Down and Bottom-Up Parsers


There are 2 types of Parsing techniques present parsing, the first one is Top-down parsing and
the second one is Bottom-up parsing. Top-down parsing is a parsing technique that first looks
at the highest level of the parse tree and works down the parse tree by using the rules of
grammar while Bottom-up Parsing is a parsing technique that first looks at the lowest level of
the parse tree and works up the parse tree by using the rules of grammar.

There are some differences present to differentiate these two parsing techniques, which are
given below:

Top-Down Parsing Bottom-Up Parsing

It is a parsing strategy that first looks at the It is a parsing strategy that first looks at the
highest level of the parse tree and works down lowest level of the parse tree and works up
the parse tree by using the rules of grammar. the parse tree by using the rules of grammar.

Bottom-up parsing can be defined as an


Top-down parsing attempts to find the left
attempt to reduce the input string to the start
most derivations for an input string.
symbol of a grammar.

In this parsing technique we start parsing from In this parsing technique we start parsing
the top (start symbol of parse tree) to down from the bottom (leaf node of the parse tree)
(the leaf node of parse tree) in a top-down to up (the start symbol of the parse tree) in a
manner. bottom-up manner.

This parsing technique uses Left Most This parsing technique uses Right Most
Derivation. Derivation.

The main leftmost decision is to select what The main decision is to select when to use a
production rule to use in order to construct the production rule to reduce the string to get the
string. starting symbol.
Top-Down Parsing Bottom-Up Parsing

Example: Recursive Descent parser. Example: ItsShift Reduce parser.

Transition Network Grammars

Transition Networks:
A transition network is a finite state automaton that is used to represent a part of a grammar. A
transition network parser uses a number of these transition networks to represent its entire
grammar. Each network represents one non-terminal symbol in the grammar.
A transition network is a method of parsing which represents the grammar as a set of a finite
state machine (FSM).

Finite State Machine:


A FSM is a model of computational behavior where each node represents an internal state of the
system and the arcs are the means of moving between the states. They are used in automata
theory to represent grammar. In the case of parsing of natural language, the arcs in the networks
represent either a terminal or a non-terminal symbol.

Types of Transition Networks:

1. Augmented Transition Networks (ATNs): ATN was developed by Wiliam Woods in 1970.
The ATN method of parsing sentences integrates many concepts from Chomsky’s (1957) formal
grammar theory with a matching process resembling a dynamic semantic network.
2. Recursive Transition Networks (RTNs): RTN is a recursive transition network that permits
arc labels to refer to other networks and they, in turn, may refer back to the referring network
rather than just permitting word categories used previously.

Top- Down Chart Parsing


Top-down chart parsing methods, such as Earley’s algorithm, begin with the top-most
nonterminal and then expand downward by predicting rules in the grammar by considering the
rightmost unseen category for each rule. Compared to other search-based parsing algorithms,
top-down parsing can be more efficient, by eliminating many potential local ambiguities as it
expands the tree downwards.

To illustrate a top down chart parse, we will assume the CFG shown in Figure A.3.

Figure A.3. A small CFG for parsing “the dog likes meat”.
S → NP VBD NN → dog | meat
S → NP VP VBD → barked
NP → DT NN VBZ → likes
VP → VBZ NN DT → the

Figure A.4 shows a trace of a top-down chart parse of the sentence “The dog likes meat.”,
showing the edges created in the top-down chart parser implemented in NLTK. The parse begins
with top-down predictions, based on the grammar, and then begins to process the actual input,
which is shown in the third row. As each word is read, there is a top-down prediction followed
by an application of the fundamental rule. After an edge is completed (such as the first NP) then
new predictions are added (e.g., for an upcoming VP).

Figure A.4. Trace of a top-down chart parse of the sentence “The dog likes meat.”

For each of the


sentence rules,
[0:0] S → * NP make a top-down
VBD prediction to create
[0:0] S → * NP an empty, active
VP edge.

Make more top-


[0:0] S → * NP down predictions to
VBD create active edges
[0:0] S → * NP for each of the
VP nonterminal
[0:0] NP → * DT categories just to
NN the right of the dot.

Predict the and use


the fundamental
[0:0] S → * NP rule to create new
VBD edges where the dot
[0:0] S → * NP in the DT rule and
VP in NP rule move to
[0:0] NP → * DT the right.
NN
[0:0] DT → * the
[0:1] DT → the *
[0:1] NP → DT *
NN

[0:0] S → * NP
VBD
[0:0] S → * NP
VP
[0:0] NP → * DT
NN
[0:0] DT → * the Predict dog; apply
[0:1] DT → the * fundamental rule
[0:1] NP → DT * and then make a
NN top down
[1:1] NN → * dog prediction for a VP
[1:2] NN → dog (using the second S
* rule.)
[0:2] NP → DT
NN *
[0:2] S → NP *
VBD
[0:2] S → NP *
VP
[2:2] VP → *
VBZ NN

Predict likes;
apply the
[0:0] S → * NP fundamental rule to
VBD create VBZ-> likes
[0:0] S → * NP *, and again, to add
VP VP -> VBZ * NN
[0:0] NP → * DT
NN
[0:0] DT → * the
[0:1] DT → the *
[0:1] NP → DT *
NN
[1:1] NN → * dog
[1:2] NN → dog *
[0:2] NP → DT
NN *
[0:2] S→ NP *
VBD
[0:2] S → NP *
VP
[2:2] VP → *
VBZ NN
[2:2] VBZ → *
likes
[2:3] VBZ →
likes *
[2:3] VP → VBZ
* NN

Predict meat apply


the fundamental
[0:0] S → * NP rule for meat as a
VBD noun (NN) and
[0:0] S → * NP again to extend the
VP active VP edge, to
[0:0] NP → * DT get VP -> VBZ NN
NN *. Finally, use the
[0:0] DT → * the fundamental rule to
[0:1] DT → the * extend the active S
[0:1] NP → DT * edge, which is now
NN complete.
[1:1] NN → * dog
[1:2] NN → dog *
[0:2] NP → DT
NN *
[0:2] S → NP *
VBD
[0:2] S → NP *
VP
[2:2] VP → *
VBZ NN
[2:2] VBZ → *
likes
[2:3]
VBZ → likes *
[2:3] VP → VBZ
* NN
[3:3] NN → *
meat
[3:4] NN →
meat*
[2:4] VP → VBZ
NN *
[0:4] S → NP VP
*

Feature Systems and augmented grammer

In natural languages there are often agreement restrictions between words and phrases. For
example, the NP "a men" is not correct English because the article a indicates a single object
while the noun "men" indicates a plural object; the noun phrase does not satisfy the number
agreement restriction of English.

There are many other forms of agreement, including subject-verb agreement, gender agreement
for pronouns, restrictions between the head of a phrase and the form of its complement, and so
on. To handle such phenomena conveniently, the grammatical formalism is extended to allow
constituents to have features. For example, we might define a feature NUMBER that may take a
value of either s (for singular) or p (for plural), and we then might write an augmented CFG rule
such as

NP -> ART N only when NUMBER1 agrees with NUMBER2

This rule says that a legal noun phrase consists of an article followed by a noun, but only when
the number feature of the first word agrees with the number feature of the second. This one rule
is equivalent to two CFG rules that would use different terminal symbols for encoding singular
and plural forms of all noun phrases, such as

NP-SING -> ART-SING N-SING

NP-PLURAL -> ART-PLURAL N-PLURAL

While the two approaches seem similar in ease-of-use in this one example, consider that all rules
in the grammar that use an NP on the right-hand side would

Morphological Analysis and the Lexicon

Lexical or Morphological Analysis is the initial step in NLP. It entails recognizing and analyzing
word structures. The collection of words and phrases in a language is referred to as the lexicon.
Lexical analysis is the process of breaking down a text file into paragraphs, phrases, and words.
The source code is scanned as a stream of characters and converted into intelligible lexemes in
this phase. The entire book is divided into paragraphs, phrases, and words.
It refers to the study of text at the level of individual words. It searches for morphemes, which
are the smallest units of a word. The lexical analysis identifies the relationship between these
morphemes and transforms the word into its root form. The word’s probable parts of speech
(POS) are also assigned by a lexical analyzer.

Augmented transition network


An augmented transition network or ATN is a type of graph theoretic structure used in
the operational definition of formal languages, used especially in parsing relatively
complex natural languages, and having wide application in artificial intelligence. An ATN can,
theoretically, analyze the structure of any sentence, however complicated. ATN are modified
transition networks and an extension of RTNs[citation needed].
ATNs build on the idea of using finite state machines (Markov model) to parse sentences. W. A.
Woods in "Transition Network Grammars for Natural Language Analysis" claims that by adding
a recursive mechanism to a finite state model, parsing can be achieved much more efficiently.
Instead of building an automaton for a particular sentence, a collection of transition graphs are
built. A grammatically correct sentence is parsed by reaching a final state in any state graph.
Transitions between these graphs are simply subroutine calls from one state to any initial state on
any graph in the network. A sentence is determined to be grammatically correct if a final state is
reached by the last word in the sentence.
This model meets many of the goals set forth by the nature of language in that it captures the
regularities of the language. That is, if there is a process that operates in a number of
environments, the grammar should encapsulate the process in a single structure. Such
encapsulation not only simplifies the grammar, but has the added bonus of efficiency of
operation. Another advantage of such a model is the ability to postpone decisions. Many
grammars use guessing when an ambiguity comes up. This means that not enough is yet known
about the sentence. By the use of recursion, ATNs solve this inefficiency by postponing
decisions until more is known about a sentence.

You might also like