Grammars

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Naga College Foundation, Inc.

COLLEGE OF COMPUTER STUDIES


PACUCOA Level II for BS Computer Science

GRAMMARS AND LANGUAGES

Computers can perform many tasks. Given a task, two questions arise. The first is:
Can it be carried out using a computer? Once we know that this first question has an
affirmative answer, we can ask the second question: How can the task be carried out?

Models of computation are used to help answer these questions. We will study three
types of structures used in models of computation, namely, grammars, finite-state machines,
and Turing machines. Grammars are used to generate the words of a language and to
determine whether a word is in a language. Formal languages, which are generated by
grammars, provide models both for natural languages, such as English, and for programming
languages, such as Pascal, Fortran, Prolog, C, and Java.

In particular, grammars are extremely important in the construction and theory of


compilers. The grammars that we will discuss were first used by the American linguist Noam
Chomsky in the 1950s. Various types of finite-state machines are used in modeling. All finite-
state machines have a set of states, including a starting state, an input alphabet, and a
transition function that assigns a next state to every pair of a state and an input.

The states of a finite-state machine give it limited memory capabilities. Some finite-
state machines produce an output symbol for each transition; these machines can be used to
model many kinds of machines, including vending machines, delay machines, binary adders,
and language recognizers.

Words in the English language can be combined in various ways. The grammar of
English tells us whether a combination of words is a valid sentence. For instance, the frog
writes neatly is a valid sentence, because it is formed from a noun phrase, the frog, made up
of the article the and the noun frog, followed by a verb phrase, writes neatly, made up of the
verb writes and the adverb neatly. We do not care that this is a nonsensical statement,
because we are concerned only with the syntax, or form, of the sentence, and not its
semantics, or meaning. We also note that the combination of words swims quickly
mathematics is not a valid sentence because it does not follow the rules of English grammar.

The syntax of a natural language, that is, a spoken language, such as English, French,
German, or Spanish, is extremely complicated. In fact, it does not seem possible to specify
all the rules of syntax for a natural language.

Modeling Computation to another has led to the concept of a formal language, which,
unlike a natural language, is specified by a well-defined set of rules of syntax. Rules of syntax
are important not only in linguistics, the study of natural languages, but also in the study of
programming languages.

We will describe the sentences of a formal language using a grammar. The use of
grammars helps when we consider the two classes of problems that arise most frequently in
applications to programming languages:

(1) How can we determine whether a combination of words is a valid sentence in a


formal language?

(2) How can we generate the valid sentences of a formal language?


Naga College Foundation, Inc.
COLLEGE OF COMPUTER STUDIES
PACUCOA Level II for BS Computer Science

Before giving a technical definition of a grammar, we will describe an example of a


grammar that generates a subset of English. This subset of English is defined using a list of
rules that describe how a valid sentence can be produced.

We specify that

1. a sentence is made up of a noun phrase followed by a verb phrase;


2. a noun phrase is made up of an article followed by an adjective followed by a noun, or
3. a noun phrase is made up of an article followed by a noun;
4. a verb phrase is made up of a verb followed by an adverb, or
5. a verb phrase is made up of a verb;
6. an article is a, or
7. an article is the;
8. an adjective is large, or
9. an adjective is hungry;
10. a noun is rabbit, or
11. a noun is mathematician;
12. a verb is eats, or
13. a verb is hops;
14. an adverb is quickly, or
15. an adverb is wildly

From these rules we can form valid sentences using a series of replacements until no more
rules can be used. For instance, we can follow the sequence of replacements:

sentence
noun phrase verb phrase
article adjective noun verb phrase
article adjective noun verb adverb
the adjective noun verb adverb
the large noun verb adverb
the large rabbit verb adverb
the large rabbit hops adverb
the large rabbit hops quickly

to obtain a valid sentence. It is also easy to see that some other valid sentences are: a hungry
mathematician eats wildly, a large mathematician hops, the rabbit eats quickly, and so on.
Also, we can see that the quickly eats mathematician is not a valid sentence.

A vocabulary (or alphabet) V is a finite, nonempty set of elements called symbols. A word (or
sentence) over V is a string of finite length of elements of V . The empty string or null string,
denoted by λ, is the string containing no symbols. The set of all words over V is denoted by
V ∗. A language over V is a subset of V ∗.

Note that λ, the empty string, is the string containing no symbols. It is different from ∅, the
empty set. It follows that {λ} is the set containing exactly one string, namely, the empty
string. Languages can be specified in various ways. One way is to list all the words in the
language. Another is to give some criteria that a word must satisfy to be in the language. In
this section, we describe another important way to specify a language, namely, through the
use of a grammar, such as the set of rules we gave in the introduction to this section.
Naga College Foundation, Inc.
COLLEGE OF COMPUTER STUDIES
PACUCOA Level II for BS Computer Science

A grammar provides a set of symbols of various types and a set of rules for producing
words. More precisely, a grammar has a vocabulary V , which is a set of symbols used to
derive members of the language. Some of the elements of the vocabulary cannot be replaced
by other symbols. These are called terminals, and the other members of the vocabulary,
which can be replaced by other symbols, are called nonterminals. The sets of terminals and
nonterminals are usually denoted by T and N, respectively.

In the example given in the introduction of the section, the set of terminals is {a, the,
rabbit, mathematician, hops, eats, quickly, wildly}, and the set of nonterminals is {sentence,
noun phrase, verb phrase, adjective, article, noun, verb, adverb}. There is a special member
of the vocabulary called the start symbol, denoted by S, which is the element of the
vocabulary that we always begin with.

In the example in the introduction, the start symbol The notion of a phrase-structure
grammar extends the concept of a rewrite system devised by Axel Thue in the early 20th
century. is sentence. The rules that specify when we can replace a string from V ∗, the set of
all strings of elements in the vocabulary, with another string are called the productions of the
grammar.

We denote by z0 → z1 the production that specifies that z0 can be replaced by z1


within a string. The productions in the grammar given in the introduction of this section were
listed. The first production, written using this notation, is sentence → noun phrase verb
phrase.

TYPES OF GRAMMARS

Phrase-structure grammars can be classified according to the types of productions that


are al lowed. We will describe the classification scheme introduced by Noam Chomsky.

A type 0 grammar has no restrictions on its productions.

A type 1 grammar can have productions of the form w1 → w2, where w1 = lAr and w2
= lwr, where A is a nonterminal symbol, l and r are strings of zero or more terminal or
nonterminal symbols, and w is a nonempty string of terminal or nonterminal symbols. It can
also have the production S → λ as long as S does not appear on the right-hand side of any
other production.

A type 2 grammar can have productions only of the form w1 → w2, where w1 is a single
symbol that is not a terminal symbol.

A type 3 grammar can have productions only of the form w 1 → w2 with w1 = A and
either w2 = aB or w2 = a, where A and B are nonterminal symbols and a is a terminal symbol,
or with w1 = S and w2 = λ. Type 2 grammars are called context-free grammars because a
nonterminal symbol that is the left side of a production can be replaced in a string whenever
it occurs, no matter what else is in the string. A language generated by a type 2 grammar is
called a context-free language. When there is a production of the form lw1r → lw2r (but not
of the form w1 → w2), the grammar is called type 1 or context-sensitive because w1 can be
replaced by w2 only when it is surrounded by the strings l and r. A language generated by a
type 1 grammar is called a context-sensitive language. Type 3 grammars are also called
Naga College Foundation, Inc.
COLLEGE OF COMPUTER STUDIES
PACUCOA Level II for BS Computer Science

regular grammars. A language generated by a regular grammar is called regular. Of the four
types of grammars we have defined, context-sensitive grammars have the most complicated
definition. Sometimes, these grammars are defined in a different way. A production of the
form w1 → w2 is called noncontracting if the length of w1 is less than or equal to the length
of w2.

According to our characterization of context-sensitive languages, every production in


a type 1 grammar, other than the production S → λ, if it is present, is noncontracting. It
follows that the lengths of the strings in a derivation in a context-sensitive language are
nondecreasing unless the production S → λ is used. This means that the only way for the
empty string to belong to the language generated by a context-sensitive grammar is for the
production S → λ to be part of the grammar.

The other way that context-sensitive grammars are defined is by specifying that all
productions are noncontracting. A grammar with this property is called noncontracting or
monotonic. The class of noncontracting grammars is not the same as the class of
context sensitive grammars. However, these two classes are closely related; it can be shown
that they define the same set of languages except that noncontracting grammars cannot
generate any language containing the empty string λ.

DERIVATION TREES

A derivation in the language generated by a context-free grammar can be represented


graphically using an ordered rooted tree, called a derivation, or parse tree. The root of this
tree represents the starting symbol. The internal vertices of the tree represent the
nonterminal symbols that arise in the derivation. The leaves of the tree represent the terminal
symbols that arise.

If the production A → w arises in the derivation, where w is a word, the vertex that
represents A has as children vertices that represent each symbol in w, in order from left to
right.

Example: Construct a derivation tree for the derivation of the hungry rabbit eats quickly,
given in the introduction of this section. Solution: The derivation tree is shown in Figure 1

Determine whether the word cbab belongs to the language generated by the grammar

G = (V , T , S, P ),
Naga College Foundation, Inc.
COLLEGE OF COMPUTER STUDIES
PACUCOA Level II for BS Computer Science

where V = {a, b, c, A, B, C, S},

T = {a, b, c},

S is the starting symbol, and the productions are

S → AB A → Ca B → Ba B → Cb B → b C → cb C → b.

Solution: One way to approach this problem is to begin with S and attempt to derive cbab
using a series of productions. Because there is only one production with S on its left-hand
side, we must start with S ⇒ AB. Next we use the only production that has A on its left-hand
side, namely, A → Ca, to obtain S ⇒ AB ⇒ CaB. Because cbab begins with the symbols cb, we
use the production C → cb. This gives us S ⇒ AB ⇒ CaB ⇒ cbaB. We finish by using the
production B → b, to obtain S ⇒ AB ⇒ CaB ⇒ cbaB ⇒ cbab. The approach that we have used
is called top-down parsing, because it begins with the starting symbol and proceeds by
successively applying productions.

There is another approach to this problem, called bottom-up parsing. In this approach, we
work backward. Because cbab is the string to be derived, we can use the production C → cb,
so that Cab ⇒ cbab. Then, we can use the production A → Ca, so that Ab ⇒ Cab ⇒ cbab. Using
the production B → b gives AB ⇒ Ab ⇒ Cab ⇒ cbab. Finally, using S → AB shows that a
complete derivation for cbab is S ⇒ AB ⇒ Ab ⇒ Cab ⇒ cbab.

Source:

151 discrete mathematics K. Rosen, discrete mathematics and its ... (n.d.). Retrieved March 26, 2023,
from https://faculty.ksu.edu.sa/sites/default/files/Math%20151-New%20Syllabus%20441_0.pdf

You might also like