Parsing PDF
Parsing PDF
We now move the second module of the front-end: the parser. Recall the front-end components:
The parser checks the stream of words (tokens) and their parts of speech for grammatical
correctness. It determines if the input is syntactically well formed. It guides context-sensitive
(“semantic”) analysis (type checking). Finally, it builds IR for source program.
Syntactic Analysis
Consider the sentence “He wrote the program”. The structure of the sentence can be described
using grammar syntax of English language.
The analogy can be carried over to syntax of sentences in a programming language. For
example, an if-statement has the syntax
The parser ensures that sentences of a programming language that make up a program abide by
the syntax of the language. If there are errors, the parser will detect them and reports them
accordingly. Consider the following code segment that contains a number of syntax errors:
This CFG defines the set of noises sheep make. We can use the SheepNoise grammar to create
sentences of the language. We use the productions as rewriting rules
While it is cute, this example quickly runs out intellectual steam. To explore uses of CFGs, we need
a more complex grammar. Consider the grammar for arithmetic expressions:
Grammar rules in a similar form were first used in the description of the Algol- 60 programming
language. The syntax of C, C++ and Java is derived heavily from Algol-60. The notation was
developed by John Backus and adapted by Peter Naur for the Algol-60 language report; thus the
term Backus-Naur Form (BNF). Let us use the expression grammar to derive the sentence
x–2*y
Such a process of rewrites is called a derivation and the process or discovering a derivation is
called parsing. At each step, we choose a non-terminal to replace. Different choices can lead to
different derivations.
Two derivations are of interest
1. Leftmost: replace leftmost non-terminal (NT) at each step
2. Rightmost: replace rightmost NT at each step
The example on the preceding slides was leftmost derivation. There is also a rightmost derivation.
In both cases we have
expr ? * id – num. id
The two derivations produce different parse trees. The parse trees imply different evaluation
orders!
Parse Trees
The derivations can be represented in a tree-like fashion. The interior nodes contain the non-
terminals used during the derivation
Precedence
These two derivations point out a problem with the grammar. It has no notion of precedence, or
implied order of evaluation. The normal arithmetic rules say that multiplication has higher
precedence than subtraction. To add precedence, create a non-terminal for each level of
precedence. Isolate corresponding part of grammar to force parser to recognize high precedence
sub-expressions first. Here is the revised grammar:
This grammar is larger and requires more rewriting to reach some of the terminal symbols. But
it encodes expected precedence. Let’s see how it parses
This produces same parse tree under leftmost and rightmost derivations.
Both leftmost and rightmost derivations give the same expression because the grammar directly
encodes the desired precedence.