SYNTAX AND SEMANTICS
Programming Language
What is Syntax?
• is the form in which programs are written
• Programs in the definition must be correct
programs
4 General Syntactic Criteria
1. Readability
– This is a property where programs are self-
documenting, i.e., understandable without any
separate documentation.
– For an ordinary user, who only knows the
language English, Cobol is the most readable
– For Logician, Prolog is the most readable
2. Writability
– Features are most often in conflict with the
readability
– Requires more concise and regular structures
while readability requires more verbose
constructs
– The more readable a program is, the more
difficult is to write the same program
3. Ease of Translation
– If readability and writability are directed
towards the users of the language, the ease
of translation is directed towards the machine
were the language will be executed.
– Usually measures by the size of the compiler
for the language, hence the lesser syntactic
constructs, the better.
– This also affects the popularity of the
language
– Algol-68 never become popular because of its
complexity which makes translation very
difficult to do.
4. Lack of Ambiguity
– It means that every program in the language
must have only one interpretation.
– Example, in Fortran:
– The Code A(I,J) may be interpreted as an access to
an element if a two-dimensional array or a procedure
call with two parameters.
– Example, in Algol-60:
– if C1 then if C2 then S1 else S2
– where the else may be matched with the first or
second if
Syntactic Elements
1. Character Set
– The set of symbols used on the PL is called
its character set of alphabet.
– This refers to all the characters that can be
used in writing a program as inputs to the
program, and as output of the program.
– Machine Language, for example, have as
alphabet only 0 and 1.
2. Identifiers
– These are strings used to name data objects,
procedures, and keywords.
– The issues involved in the choice of what to
support are: rules on their use, maximum
length, and whether it is case sensitive or not.
– For instance:
• Consider the BASIC where identifiers were
restricted to single capital letters, followed by a
digit, or capital followed by $.
• Compare this to C, where an identifier can be any
length but should start with a letter and case
sensitive.
3. Operator Symbols
– These are the symbols used to represent the
primitive operations in the language.
– For Example:
• In Pascal, we have the symbols +,-,*,/
• Cobol, use “ADD” for addition
• Fortran, use “.EQ.” for equal operator
4. Keywords and Reserved Words
– Keyword is an identifier used as a fixed part
of the syntax
• For Example:
– the keywords begin and end of Pascal.
– the switch of C
– Reserved Word is a keyword that may not be
used as programmer-chosen identifier. It is
important because they make the job of
translation difficult if we do not define them
properly.
5. Comments and Noise Words
– Comments are words ignored during
translation.
• In Pascal, it uses markers “{“ and “}”
• C uses markers “/*” and “*/”
– Noise Word are optional words included in
the statement to enhance readability.
• For example in Cobol,
– GO TO <label>
– Where TO is a noise word since it can be deleted without
affecting the program
6. Delimeters and Brackets
– Delimeter is a syntactic element used to mark
the beginning or end of some syntactic
constructs.
• For example:
– “;” for Pascal and “:” for Basic
– Brackets are paired delimeter to enclose a
group of statement.
– In Pascal, “begin” and “end”
– In C, “{“ and “}”
7. Free-Field Format and Fixed-Field Format
– Free-Field Format, the program statements may
be written anywhere on an input line without
regard for positioning on the line or for breaks
between lines.
• For example, in Pascal:
for j := 1 to n do
or
for
j := 1
to
n
do.
• The programmer has the option to write the program in any format for as
long as the words in the program are intact.
• Fixed-Field Format, the positioning of a
program code in the line are used to
convey information.
– For example, in Fortran:
• where columns 1 to 5 are for labels, column 6 is
where the character “C” is placed to indicate that
the line is a comment line, and the rest of the
columns is where the statements can be found.
8. Expressions
– These are used to indicate conditions and in
some cases used to evaluate values that are
assigned to variables.
– It may take different forms for different
languages. But all of them may fall in either
prefix, postfix, or infix form
9. Statements
– There are several ways by which statements
are formulated.
– One is to adopt the single basic statement
format.
10. Overall program-subprogram
structure
– Several approaches have been used by
existing languages.
• One is the use of a separate subprogram definition
as in FORTRAN:
SUBROUTINE A
….
….
SUBROUTINE B
…. .
….
• Then, there is the nested subprogram of Pascal:
procedure A;
procedure B;
begin
…
end;
begin
…
end;
• Another case is the separation of the data definitions or
description from the executable statements found in
Cobol:
DATA DEFINITION.
….
….
PROCEDURE DIVISION.
….
….
Formal Syntax
• The syntactic definition of a PL aims to define all
the strings of symbols that form the correct
programs in the language.
• We need another language, called a
metalanguage, to describe all these correct
programs.
– Metalanguage is a notion that is used to describe the
syntax or semantics of a language.
– The best metalanguage should have been the English
language, but its disadvantage is its inherent
ambiguity and lack of precision
2 Types of Syntax
1. Abstract Syntax
– A simple listing of all possible forms for each
of the syntactic classes in the language.
– Gives the components of each language
construct, leaving out the representation
details.
– Usually composed of 2 sections: the first
section and second section.
• First Section or Section I
– Lists the syntactic classes along with the
symbols that stand for arbitrary elements of
the classes.
• Second Section or Section II
– Lists the alternatives for each of the non-
elementary classes to the right of a”::=“
symbol, separated by occurrences of the “|”.
– Basically an abstract grammar with a finite set
of procedures, each associated with a
construct.
• Abstract syntax is a syntax that can tell
what syntactic structures are available in a
language, but does not specify which
strings of characters are well-formed
program texts, nor their phrase structure.
• For Example, in Pascal:
if E then if E then C else C
• are possible phrase structures for a command C.
However, this does not specify whether:
if a then if b then p else q
• is a well-formed command text or, if it is, whether it
is to be analyzed so that the else part matches the
first or second of the then part. Such a problem is
settled in the concrete syntax.
2. Concrete Syntax
– Can detect whether a string is a well-formed
string in the language or not.
– To express a concrete syntax, we use three
different metalanguage:
» Backus Naur Formalism (BNF)
» Syntax Diagrams
» Context-free Grammars
a. Backus Naur Formalism (BNF)
– The BNF is a grammar developed for the
syntactic definition of Algol-60.
– It was developed almost at the same time as
the theory of Context-Free Grammars
(CFG).
– The BNF grammar is a set of rules or
productions of the form:
left-side ::= right-side
• where left-side is a non-terminal symbol
• right-side is a string of non-terminal and terminals.
• A terminal represents the atomic symbols
in the language
• Non-terminal represents other symbols
as defined to the right of the symbol “::=“ is
read as “produces” or “is defined as.”
• Two other metasymbols, aside from “::=“
are also used:
• The “|” is interpreted as alternative; and
• The “{}” denote possible repetition of the
enclosed symbols zero or more times.
• For example: A ::= B | {C}
– Means “A produces B” or “A produces a string of
zero or more C’s.”
• The BNF should not be very strange to us
since it has been used often to explain how
to formulate English sentences.
– For example, we express English language
as:
sentence ::= subject predicate
and subject and predicate produce
subject ::= noun |article noun
predicate ::= verb | verb object
• To cut the story short, we assume noun, article,
verb and object produce atomic symbols. To
illustrate these, non-terminals may define:
noun ::= man | woman
article ::= the | a
verb ::= runs | walks
object ::= home
• Combining all these rules, we produce the
following BNF rules for a “small” English
language:
sentence ::= subject predicate
subject ::= noun | article noun
predicate ::= verb | verb object
noun ::= man | woman
article ::= the | a
verb ::= runs | walks
object ::= home
• Once given the BNF grammar, how do we
construct a sentence?
• The grammar is basically a recipe that explains
how sentences can be constructed.
• To construct a sentence, we start with the start
symbol, which in our example is sentence.
sentence => subject predicate
=> article noun predicate
=> the noun predicate
=> the man predicate
=> the man verb object
=> the man walks object
=> the man walks home
• Let us now switch back to PL.
• Let us now use another convention in this
notation by enclosing nonterminals by the
symbols “<>”.
<expression> ::= <term> | <expression> <addoperator><term>
<term> ::= <factor> | <term><multoperator><factor>
<factor> ::= <identifier> | <literal> | (<expression>)
<identifier> ::= a | b | c |…| z
<literal> ::= 0|1|2|…|9
<addoperator> ::= + | - | or
<multoperator> ::= *| / | div | mod | and
• Let us use the above grammar to recognize valid
strings in the language instead of generating
strings in the language.
• For example, consider the string a+b*c. This
string may be derived as follows:
<expression> => <expression><addoperator><term>
=> <term><addoperator><term>
=> <factor><addoperator><term>
=> <identifier><addoperator><factor>
=> a + <term><multoperator><factor>
=> a + <factor><multoperator><factor>
=> a + <identifier><multoperator><factor>
=> a + b <multoperator><factor>
=> a + b * <identifier>
=> a + b * c
• An alternative method of doing the derivation is
the use of Parse Tree.
• Parse Tree is a graphical method of showing a
derivation
a+b*c
<expression>
<expression> <addoperator> <term>
<term> + <term> <multoperator> <factor>
<factor> <factor> * <identifier>
<identifier> <identifier> c
a b
b. Syntax Diagram
• Similar to BNF rules, except that instead of
grammar rules, directed graphs are used.
• For each grammar rule an equivalent syntax
diagram can be drawn.
<expression> ::= <term> | <expression> <addoperator><term>
<term> ::= <factor> | <term><multoperator><factor>
<factor> ::= <identifier> | <literal> | (<expression>)
<identifier> ::= a | b | c |…| z
<literal> ::= 0|1|2|…|9
<addoperator> ::= + | - | or
<multoperator> ::= *| / | div | mod | and
• The rectangles in the syntax diagrams represent
the nonterminals.
• The oval shapes represent the terminals
<expression>
<term>
<expression> <addoperator> <term>
<term>
<factor>
<term> <multoperator> <factor>
<factor>
<identifier>
<literal>
( <expression> )
<identifier> <addoperator> <multoperator>
a + *
b - /
c or div
mod
z
and
c. Context-free Grammar
– CFG is another method of expressing the
syntax of a language.
– This is more used in the study of formal
languages than used to express the syntax of
PL.
• Definition: A CFG is denoted by G = (V,T,P,S)
where V is the finite set of symbols called non-
terminals, T is a finite set of symbols called
terminals, S is an element V called the start
symbol and P is the finite set of productions.
• Each production is of the form A →£,
where:
A is a variable and £ is a string of
symbols from a set of strings formed from
the elements of the non-terminals and
terminals, i.e., (V U T)*
Conventions on CFGs
1. The capital letters denote variables (or non-
terminals; S being the star symbol unless
otherwise stated.
2. The small letters and digits are used to
represent terminals.
3. The lower-case Greek letters are used to
denote strings of variables and terminals
– With this convention, we can immediately
define V, T, and S by simply examining
the set of productions.
• Another convention is the use of the symbol |
(read as “or”) to represent alternatives in the
productions, i.e.,
A → £1, A → £2,…, A → £k
may be written as:
A → £1 | £2 |…£k
Example: The grammar for the language
composed of strings starting with a and followed
by any number of b’s and any number of a’s
ended by a b is given by
G = ({S,M,A,B},{a,b},P,S)
where P = {S→aMb, M→A|B, A→aA|ε, B →bB|ε}
Derivations
• Using the sample grammar, we can
derive the string aaab from S as follows:
S => aMb using S → aMb
=> aAb using M → A
=> aaAb using A → aA
=> aaaAb using A → aA
=> aaab using M → ε
Hence, we can say S =>*aaab.
• A sentential form in Grammar G is a
string of symbols £ composed of
terminals and non-terminals such that
S =>* £
• The language generated by a grammar
G, denoted by L(G), is {w | w is in T* and
S =>*w}.
• Another way of saying this is a string is in
L(G) if the string consists solely of
terminals and the string can be derived
from S
Leftmost and Rightmost Derivations
• A leftmost derivation is a derivation in
which at each step, the leftmost non-
terminal is replaced.
To illustrate this, consider the grammar:
G = ({S,A,},{a,b},P,S)
where:
P = {S→aAS | a, A→SbA | SS | ba}
The leftmost derivation of the string aabbaa
is:
S => aAS => aSbAS => aabAS => aabbaS => aabbaa
• The rightmost derivation is a derivation in
which the rightmost nonterminal is
replaced at each step.
For example, consider the string aabbaa is:
S => aAS => aAa => aSbAa => aSbbaa => aabbaa
Ambiguity
• A CFG is ambiguous if it generates some
sentences by two or more distinct
leftmost (rightmost) derivations.
Example:
CFG G = ({S,T,},{a,b},P,S)
where:
P = {S→T, T→TT | ab}
• We can find a string with two distinct
leftmost derivations.
• One such string is ababab, where it can
be derived (leftmost) by the following
derivations:
S => T => TT => abT => abTT => ababT => ababab
and
S => T => TT => TTT => abTT => ababT => ababab
Derivation (Parse) Tree
• Let G = (V,T,PS) be CFG. A tree is a derivation
or parse tree in G if:
1. Every vertex has a label which is a symbol of V U T
U {ε};
2. The label of the root is S;
3. If a vertex is an interior vertex and has a label A,
then A must be in V;
4. Is a vertex v has a label A and vertices v1, v2,…, vk
are the sons of v, in order from left to right, with
labels, X1,X2,…, Xk respectively, the A → X1,X2,…,
Xk must be a production in P;
5. If vertex v has a label ε, then v is a leaf and is the
only son of its father.
Example:
Consider the grammar G = ({S,R,T}, {(,)},P,S)
where:
P = {S→R, R→RT | T, T → (R) | ()}
The derivation tree for the string ()(()) is:
S
R T
T ( R )
( ) T
( )
Operator Precedence
Again, let us consider the grammar for
expression given earlier:
<expression> ::= <term> | <expression> <addoperator><term>
<term> ::= <factor> | <term><multoperator><factor>
<factor> ::= <identifier> | <literal> | (<expression>)
<identifier> ::= a | b | c |…| z
<literal> ::= 0|1|2|…|9
<addoperator> ::= + | - | or
<multoperator> ::= *| / | div | mod | and
• Consider the string a + b * c. The string may be
recognized as an expression phrase structure is:
<expression><addoperator><term>
a+b*c
<expression>
<expression> <addoperator> <term>
<term> + <term> <multoperator> <factor>
<factor> <factor> * <identifier>
<identifier> <identifier> c
a b
• Consider the string a + b * c. The string may be
recognized as an expression phrase structure is:
<expression><addoperator><term>
rivHence, a+b*c = a+ (b*c)
a+b*c
<expression>
<expression> <addoperator> <term>
<term> + <term> <multoperator> <factor>
<factor> <factor> * <identifier>
<identifier> <identifier> c
a b
Associativity
• Another aspect of grammars that we want to
illustrate, aside from operator precedence, is
associativity
• Consider the expression a-b+c. This is
recognized by the phrase structure:
<expression><addoperator><term>
where:
<expression> ::= <expression><addoperator><term>
…
::= a-b
Therefore, a – b +c (a – b) + c
This implied that add operators associates to the left, i.e., operators
are evaluated from left to right.
Ambiguity
• A syntactic description is termed
ambiguous if, for any text, it specifies more
than one phrase structure
• To show that a grammar is ambiguous, all
that is needed is to find a string in the
language that specifies more than one
phrase structure
• Alternatively, simply show that there is
more than one parse tree for the string
Consider the expression a-b+c:
– There are two phrase structures for this,
which are:
a-b+c (a-b)+c,
» When the first <expression> derives “a-b”,
and
a-b+c a-(b+c),
» When the second <expression> derives
“b+c”
Formal Semantics
• The goal of formal semantics is to reveal
the essence of a language beneath its
syntactic surface.
• The formal semantics of a language is
given by a mathematical model to
represent the possible computations
described by the language
Three Methods used in defining the meaning
of languages:
1. Operational Semantics
– Describes how a valid program is interpreted
as sequences of a computational steps.
– These sequences then make up the
meaning of the program.
– Tells how a computation is performed by
defining how to simulate the execution of the
program.
2. Denotational Semantics
– Defined by a valuation function that maps
programs into mathematical objects
considered as their denotation (i.e.
meaning).
– A function that maps a valid expression onto
some mathematical object.
– For example: if I have the expression 2+2,
then the denotational semantics of this
expression might be the natural number 4.
3. Axiomatic Semantics
– The assertions about relationships that remain the
same each time the program is executed.
– Defined for each control structure and command of
the programming language.
– The semantic formulas are triples of the form: {P} S
{Q}
• where S is a command or control structure in the PL, P and
Q are assertions or statements concerning the properties
of program objects (often program variables) which may be
true or false. P is called pre-condition and Q is called a
post-condition. The pre- and post-conditions are formulas
in some arbitrary logic and summarize the progress of the
computation.
• The semantic formulas are triples of the form:
{P} S {Q}
– where S is a command or control structure in the
PL, P and Q are assertions or statements
concerning the properties of program objects (often
program variables) which may be true or false.
– P is called pre-condition and Q is called a post-
condition. The pre- and post-conditions are formulas
in some arbitrary logic and summarize the progress
of the computation.
The meaning of {P} S {Q}
– is that if S is executed in a state in which assertion
P is satisfied and S terminates, then S terminates in
a state in which assertion Q is satisfied.