Defining Program Syntax: Chapter Two Modern Programming Languages, 2nd Ed. 1
Defining Program Syntax: Chapter Two Modern Programming Languages, 2nd Ed. 1
Defining Program Syntax: Chapter Two Modern Programming Languages, 2nd Ed. 1
Chapter Two
Programming language syntax: how programs look, their form and structure
Programming language semantics: what programs do, their behavior and meaning
Chapter Two
Outline
Grammar and parse tree examples BNF and parse tree definitions Constructing grammars Phrase structure and lexical structure Other grammar forms
Chapter Two
An English Grammar
A sentence is a noun phrase, a verb, and a noun phrase. A noun phrase is an article and a noun.
A verb is An article is A noun is...
Chapter Two
says you can add nodes <NP>, <V>, and <NP>, in that order, as children of <S>
Chapter Two Modern Programming Languages, 2nd ed. 5
A Parse Tree
<S> <NP> <V> <NP> <A> <N> the dog loves <A> <N>
the cat
Chapter Two
An expression can be the sum of two expressions, or the product of two expressions, or a parenthesized subexpression Or it can be one of the variables a, b or c
A Parse Tree
<exp> ( <exp> ) ((a+b)*c) <exp> * <exp> ( <exp> ) <exp> + <exp> a b c
Chapter Two
Outline
Grammar and parse tree examples BNF and parse tree definitions Constructing grammars Phrase structure and lexical structure Other grammar forms
Chapter Two
start symbol
<S> ::= <NP> <V> <NP>
a production <NP> ::= <A> <N> <V> ::= loves | hates|eats <A> ::= a | the
non-terminal symbols
Chapter Two
10
The set of tokens The set of non-terminal symbols The start symbol The set of productions
Chapter Two
11
Definition, Continued
Strings of one or more characters of program text They are atomic: not treated as being composed from smaller parts
They are strings enclosed in angle brackets, as in <NP> They are not strings that occur literally in program text The grammar says how they can be expanded into strings of tokens
The start symbol is the particular non-terminal that forms the root of any parse tree for the grammar
Modern Programming Languages, 2nd ed. 12
Chapter Two
Definition, Continued
The productions are the tree-building rules Each one has a left-hand side, the separator ::=, and a right-hand side
The left-hand side is a single non-terminal The right-hand side is a sequence of one or more things, each of which can be either a token or a non-terminal
A production gives one possible way of building a parse tree: it permits the non-terminal symbol on the left-hand side to have the things on the righthand side, in order, as its children in a parse tree
Modern Programming Languages, 2nd ed. 13
Chapter Two
Alternatives
When there is more than one production with the same left-hand side, an abbreviated form can be used The BNF grammar can give the left-hand side, the separator ::=, and then a list of possible right-hand sides separated by the special symbol |
Chapter Two
14
Example
<exp> ::= <exp> + <exp> | <exp> * <exp> | ( <exp> ) | a | b | c
Note that there are six productions in this grammar. It is equivalent to this one: <exp> ::= <exp> + <exp> <exp> ::= <exp> * <exp> <exp> ::= ( <exp> ) <exp> ::= a <exp> ::= b <exp> ::= c
Chapter Two Modern Programming Languages, 2nd ed. 15
Empty
The special nonterminal <empty> is for places where you want the grammar to generate nothing For example, this grammar defines a typical if-then construct with an optional else part:
<if-stmt> ::= if <expr> then <stmt> <else-part> <else-part> ::= else <stmt> | <empty>
Chapter Two
16
Parse Trees
To build a parse tree, put the start symbol at the root Add children to every non-terminal, following any one of the productions for that non-terminal in the grammar Done when all the leaves are tokens Read off leaves from left to rightthat is the string derived by the tree
Practice
<exp> ::= <exp> + <exp> | <exp> * <exp> | ( <exp> ) |a|b|c
Show a parse tree for each of these strings: a+b a*b+c (a+b) (a+(b))
Chapter Two
18
Compiler Note
What we just did is parsing: trying to find a parse tree for a given string Thats what compilers do for every program you try to compile: try to build a parse tree for your program, using the grammar for whatever language you used Take a course in compiler construction to learn about algorithms for doing this efficiently
Language Definition
We use grammars to define the syntax of programming languages The language defined by a grammar is the set of all strings that can be derived by some parse tree for the grammar As in the previous example, that set is often infinite (though grammars are finite) Constructing grammars is a little like programming...
Outline
Grammar and parse tree examples BNF and parse tree definitions Constructing grammars Phrase structure and lexical structure Other grammar forms
Chapter Two
21
Constructing Grammars
Most important trick: divide and conquer Example: the language of Java declarations: a type name, a list of variables separated by commas, and a semicolon Each variable can be followed by an initializer:
Example, Continued
Easy if we postpone defining the commaseparated list of variables with initializers: Primitive type names are easy enough too:
<type-name> ::= boolean | byte | short | int | long | char | float | double
(Note: skipping constructed types: class names, interface names, and array types)
Modern Programming Languages, 2nd ed. 23
Chapter Two
Example, Continued
That leaves the comma-separated list of variables with initializers Again, postpone defining variables with initializers, and just do the commaseparated list part:
<declarator-list> ::= <declarator> | <declarator> , <declarator-list>
Chapter Two
24
Example, Continued
For full Java, we would need to allow pairs of square brackets after the variable name There is also a syntax for array initializers And definitions for <variable-name> and <expr>
Outline
Grammar and parse tree examples BNF and parse tree definitions Constructing grammars Phrase structure and lexical structure Other grammar forms
Chapter Two
26
Chapter Two
28
Separate Grammars
One says how to construct a sequence of tokens from a file of characters One says how to construct a parse tree from a sequence of tokens
<program-file> ::= <end-of-file> | <element> <program-file> <element> ::= <token> | <one-white-space> | <comment> <one-white-space> ::= <space> | <tab> | <end-of-line> <token> ::= <identifier> | <operator> | <constant> |
Chapter Two
30
Historical Note #1
Early languages sometimes did not separate lexical structure from phrase structure
Early Fortran and Algol dialects allowed spaces anywhere, even in the middle of a keyword Other languages like PL/I allow keywords to be used as identifiers
This makes them harder to scan and parse It also reduces readability
Historical Note #2
One statement per line (i.e. per card) First few columns for statement label Etc.
Early dialects of Fortran, Cobol, and Basic Most modern languages are free-format: column positions are ignored
Outline
Grammar and parse tree examples BNF and parse tree definitions Constructing grammars Phrase structure and lexical structure Other grammar forms
Chapter Two
34
Chapter Two
35
BNF Variations
Some use or = instead of ::= Some leave out the angle brackets and use a distinct typeface for tokens Some allow single quotes around tokens, for example to distinguish | as a token from | as a meta-symbol
Chapter Two
36
EBNF Variations
{x} to mean zero or more repetitions of x [x] to mean x is optional (i.e. x | <empty>) () for grouping | anywhere to mean a choice among alternatives Quotes around tokens, if necessary, to distinguish from all these meta-symbols
Chapter Two
37
EBNF Examples
<if-stmt> ::= if <expr> then <stmt> [else <stmt>] <stmt-list> ::= {<stmt> ;} <thing-list> ::= { (<stmt> | <declaration>) ;}
Anything that extends BNF this way is called an Extended BNF: EBNF There are many variations
Syntax Diagrams
Syntax diagrams (railroad diagrams) Start with an EBNF grammar A simple production is just a chain of boxes (for nonterminals) and ovals (for terminals):
<if-stmt> ::= if <expr> then <stmt> else <stmt> if-stmt if expr then stmt else stmt
Chapter Two
39
Bypasses
Square-bracket pieces from the EBNF get paths that bypass them
Chapter Two
40
Branching
exp exp )
Loops
exp addend +
Chapter Two
42
Chapter Two
43
In the study of formal languages and automata, grammars are expressed in yet another notation:
S aSb | X X cX |
These are called context-free grammars Other kinds of grammars are also studied: regular grammars (weaker), contextsensitive grammars (stronger), etc.
BNF and EBNF ideas are widely used Exact notation differs, in spite of occasional efforts to get uniformity
But as long as you understand the ideas, differences in notation are easy to pick up
Chapter Two
45
Example
WhileStatement: while ( Expression ) Statement DoStatement: do Statement while ( Expression ) ; BasicForStatement: for ( ForInitopt ; Expressionopt ; ForUpdateopt) Statement [from The Java Language Specification, Third Edition, James Gosling et. al.]
Chapter Two Modern Programming Languages, 2nd ed. 46
Conclusion
We use grammars to define programming language syntax, both lexical structure and phrase structure Connection between theory and practice
Two grammars, two compiler passes Parser-generators can write code for those two passes automatically from grammars
Chapter Two
47
Conclusion, Continued
Novices want to find out what legal programs look like Expertsadvanced users and language system implementerswant an exact, detailed definition Toolsparser and scanner generatorswant an exact, detailed definition in a particular, machine-readable form
Chapter Two
48