Compilers - Week 3
Compilers - Week 3
I- introduction
Regular languages are the weakest formal languages but they’re widely
used.
Regular languages and finite automatons are incapable of the
following:
i- counting to an arbitrary i.
ii- recognizing strings of valid parentheses
II-parsing
input: sequence of tokens from the lexer
output: parse tree of the program
Notes:
i-sometimes the parse tree is only implicit, so a compiler may
never actually build the full parse tree.
ii- some compilers combine the lexical analysis and the parsing
phases into one where everything is done by the parser, since
the parsing technology is generally powerful enough to express
lexical analysis in addition to parsing, But most compilers still
divide up the work this way because regular expressions are
such a good match for lexical analysis and then the parsing is
handled separately.
III- context free grammar
-Since not all strings of tokens are valid, the parser has to tell the
difference, and give error messages for the invalid strings of tokens.
Therefore, we need a way to describe valid strings (CFG?), and a
method to distinguish a valid string from an invalid one (parse tree?).
-CFG consists of the following:
i- start symbol S ∈ N
ii- a set of terminals T
iii- a set of nonterminals N
Iv- a set of productions X -> y0| y1|...where:
X ∈ N and yi ∈ N ∪ T ∪ ε
Example: S -> (S)
Terminals : {(, )}
Nonterminals : {S}
Start symbol : S
Notes:
i- productions can be viewed as replacement rules, since the
left hand side is replaced by any of the items on the right
hand side.
ii- we have to start with the start symbol, and then we keep
replacing non terminals with terminals until there are no
more nonterminals to be replaced.
iii- terminals -which are tokens of the language- can not be
replaced (once generated, it becomes a permanent
feature of the string).
iv- non-terminals are written in all caps.
v- many grammars generate the same language.
In conclusion:
if G is the context free grammar with start symbol S, then the
language L(G) of G is defined as:
{𝑎𝑖.... 𝑎𝑛 | ∀ 𝑎𝑖 ∈ 𝑇 & 𝑆 −*> 𝑎𝑖.... 𝑎𝑛} the sequence of characters
𝑎𝑖 through 𝑎𝑛 , such that every character 𝑎𝑖 is a terminal, and the
start symbol S goes to that sequence of characters in zero or
more steps.
IV- derivations
-A derivation is a sequence of productions.
-it can be drawn as a tree, where the start symbol is the root of the tree,
the interior nodes are the nonterminals, and the leaves are the
terminals. (this tree is known as the parse tree)
-types of derivation
i- left-most derivation
At each step, replace the left-most nonterminal.
ii- right-most derivation
At each step, replace the right-most nonterminal.
iii- other types
Notes:
-every parse tree has a right-most and a left-most
derivation (one parse tree may have many derivations).
-in section 5-03 quiz, S can not be directly replaced by aa, it
has to be replaced by aXa first, then X is replaced by epsilon
and we get aa.
V- ambiguity
Consider the following two parse trees for the string id * id + id:
Derivation for the left tree:
E - > E + E -> E * E + E - > id * E + E -> id * id + E -> id * id + id
Derivation for the right tree:
E - > E * E -> E * E + E - > E * E + id -> E * id + id -> id * id + id
-a grammar is ambiguous if it has more than one parse tree of the same
kind for some string (i.e. a grammar is ambiguous if there is more than
one right-most derivation or left-most derivation for some string).
-if you have multiple parse trees for some program then that essentially
means that you're leaving it up to the compiler to pick which of those
two possible interpretations of the program is going to generated code
for.
Notes:
i- Each parse tree has a leftmost derivation, a right most
derivation, and many other derivations, but if a single string has
multiple parse trees of the same kind, then the grammar is
ambiguous.
ii- E -> E + E | id is ambiguous grammar, because the string E + E +
E can be constructed using two different parse trees.
How to handle ambiguity
-The most direct method is just to rewrite the grammar so it's
unambiguous. That is to write a new grammar that generates the
same language as the old grammar but it only has a single parse
tree for each string.
Example:
Consider the string id * id + id, we can rewrite the grammar
to become as follows:
E -> E’ + E | E’
E’ -> id * E’ | E’ | (E) * E’| (E)| id
We've divided the productions into two classes; one that
handles + and one that handles *.
-The E production controls the generation of +,
resulting in a sequence of any number of E’ summed
together, and added to a trailing E at the end.
Disambiguation mechanism
-Instead of rewriting the grammar, use the more natural
(ambiguous) grammar, along with disambiguating
declarations.
-The most popular form of disambiguating declarations are
precedence and associativity declarations.
Example:
consider the grammar E -> E + E | int, which is
ambiguous since we have two parse trees for
the string int + int + int, because this grammar
doesn't tell us whether plus is left associative or
right associative.The solution here would be to
declare left associativity for +, and therefore
the parse tree on the right is excluded.
Example:
consider the ambiguous grammar E -> E + E | E *
E| int. The solution here would be to
declare + to be left associative , and declare * to
be left associative as follows:
%left +
%left *
and then the precedence between + and
* is given by the order, so the fact here that *
appears after plus means that * has a higher
precedence than +, and therefore
the parse tree on the left is excluded.
Note: It’s important that you check the behavior of
your grammar after you add these
declarations.
production: E -> T + E
function: bool E2() = {T() && TERM(PLUS) && E();}
Where E2 is the function that checks for the success of E2
-some production of T has to match a portion of the input and
after that, we have to find a + in the input following whatever t
matched and if + matches, then some production for E has to
match some portion of the input
What happened?
While backtracking is available within the nonterminal, there is no
way to backtrack in order to find a different production for a
nonterminal X in case one production of X already succeeded.
So we were not able to try the second production rule of T
instead of the first.
X- left recursion
Consider the grammar S -> Sa
bool S1 () { return S() && term(a); }
bool S() { return S1 (); }
-since there's only one production for s we don't need to worry about
backtracking
When an input string is getting parsed, the first step is to call S() which
is going to call S1() which will then call S() again, and as a result S() will
go into an infinite loop, and no input will be parsed successfully.
Solution of left recursive grammar (so that it works with the recursive
descent algorithm)
It can be rewritten using right-recursion, as follows:
S -> bS’
S’ -> aS’ | ε
Notes:
-left recursive grammar is not always that obvious.
Consider the following grammar:
S -> Aa| c
A -> Sb
S-> Aa-> Sba
we have, in two steps, produce another string with S at the
left end.
-Recursive descent algorithm is used in production
compilers such as GCC.