0% found this document useful (0 votes)
20 views

Compilers - Week 3

Regular languages are useful but limited, unable to count arbitrarily or recognize balanced parentheses. A compiler parses input tokens into a parse tree. Context-free grammars (CFGs) describe valid token strings through terminals, nonterminals, productions, and a start symbol. Derivations show how the start symbol produces a string via productions. Grammars can be ambiguous, allowing multiple parse trees for a string. Precedence and associativity declarations can disambiguate grammars. Compilers detect lexical, syntax, and semantic errors and should report errors clearly while quickly recovering.

Uploaded by

rahma aboeldahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Compilers - Week 3

Regular languages are useful but limited, unable to count arbitrarily or recognize balanced parentheses. A compiler parses input tokens into a parse tree. Context-free grammars (CFGs) describe valid token strings through terminals, nonterminals, productions, and a start symbol. Derivations show how the start symbol produces a string via productions. Grammars can be ambiguous, allowing multiple parse trees for a string. Precedence and associativity declarations can disambiguate grammars. Compilers detect lexical, syntax, and semantic errors and should report errors clearly while quickly recovering.

Uploaded by

rahma aboeldahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Compilers - Week 3

I- introduction
Regular languages are the weakest formal languages but they’re widely
used.
Regular languages and finite automatons are incapable of the
following:
i- counting to an arbitrary i.
ii- recognizing strings of valid parentheses
II-parsing
input: sequence of tokens from the lexer
output: parse tree of the program
Notes:
i-sometimes the parse tree is only implicit, so a compiler may
never actually build the full parse tree.
ii- some compilers combine the lexical analysis and the parsing
phases into one where everything is done by the parser, since
the parsing technology is generally powerful enough to express
lexical analysis in addition to parsing, But most compilers still
divide up the work this way because regular expressions are
such a good match for lexical analysis and then the parsing is
handled separately.
III- context free grammar
-Since not all strings of tokens are valid, the parser has to tell the
difference, and give error messages for the invalid strings of tokens.
Therefore, we need a way to describe valid strings (CFG?), and a
method to distinguish a valid string from an invalid one (parse tree?).
-CFG consists of the following:
i- start symbol S ∈ N
ii- a set of terminals T
iii- a set of nonterminals N
Iv- a set of productions X -> y0| y1|...where:
X ∈ N and yi ∈ N ∪ T ∪ ε
Example: S -> (S)
Terminals : {(, )}
Nonterminals : {S}
Start symbol : S
Notes:
i- productions can be viewed as replacement rules, since the
left hand side is replaced by any of the items on the right
hand side.
ii- we have to start with the start symbol, and then we keep
replacing non terminals with terminals until there are no
more nonterminals to be replaced.
iii- terminals -which are tokens of the language- can not be
replaced (once generated, it becomes a permanent
feature of the string).
iv- non-terminals are written in all caps.
v- many grammars generate the same language.
In conclusion:
if G is the context free grammar with start symbol S, then the
language L(G) of G is defined as:
{𝑎𝑖.... 𝑎𝑛 | ∀ 𝑎𝑖 ∈ 𝑇 & 𝑆 −*> 𝑎𝑖.... 𝑎𝑛} the sequence of characters
𝑎𝑖 through 𝑎𝑛 , such that every character 𝑎𝑖 is a terminal, and the
start symbol S goes to that sequence of characters in zero or
more steps.
IV- derivations
-A derivation is a sequence of productions.
-it can be drawn as a tree, where the start symbol is the root of the tree,
the interior nodes are the nonterminals, and the leaves are the
terminals. (this tree is known as the parse tree)
-types of derivation
i- left-most derivation
At each step, replace the left-most nonterminal.
ii- right-most derivation
At each step, replace the right-most nonterminal.
iii- other types
Notes:
-every parse tree has a right-most and a left-most
derivation (one parse tree may have many derivations).
-in section 5-03 quiz, S can not be directly replaced by aa, it
has to be replaced by aXa first, then X is replaced by epsilon
and we get aa.
V- ambiguity
Consider the following two parse trees for the string id * id + id:
Derivation for the left tree:
E - > E + E -> E * E + E - > id * E + E -> id * id + E -> id * id + id
Derivation for the right tree:
E - > E * E -> E * E + E - > E * E + id -> E * id + id -> id * id + id
-a grammar is ambiguous if it has more than one parse tree of the same
kind for some string (i.e. a grammar is ambiguous if there is more than
one right-most derivation or left-most derivation for some string).
-if you have multiple parse trees for some program then that essentially
means that you're leaving it up to the compiler to pick which of those
two possible interpretations of the program is going to generated code
for.
Notes:
i- Each parse tree has a leftmost derivation, a right most
derivation, and many other derivations, but if a single string has
multiple parse trees of the same kind, then the grammar is
ambiguous.
ii- E -> E + E | id is ambiguous grammar, because the string E + E +
E can be constructed using two different parse trees.
How to handle ambiguity
-The most direct method is just to rewrite the grammar so it's
unambiguous. That is to write a new grammar that generates the
same language as the old grammar but it only has a single parse
tree for each string.

Example:
Consider the string id * id + id, we can rewrite the grammar
to become as follows:
E -> E’ + E | E’
E’ -> id * E’ | E’ | (E) * E’| (E)| id
We've divided the productions into two classes; one that
handles + and one that handles *.
-The E production controls the generation of +,
resulting in a sequence of any number of E’ summed
together, and added to a trailing E at the end.

-The E’ production controls the generation of *,


resulting in either a bunch of identifiers multiplied
together and multiplied by a trailing E’ at the end -
which can be replaced by an identifier- or a bunch of
parenthesized E multiplied together. The +’s are
generated at the outermost level and then the E
primes will generate the *’s inside the +’s, and so the
grammar enforces that * is gonna bind more tightly
than plus.
Note: we can have +’s inside of *’s by using
parentheses (E’ -> (E))
Yet another example: if-then-else

Disambiguation mechanism
-Instead of rewriting the grammar, use the more natural
(ambiguous) grammar, along with disambiguating
declarations.
-The most popular form of disambiguating declarations are
precedence and associativity declarations.
Example:
consider the grammar E -> E + E | int, which is
ambiguous since we have two parse trees for
the string int + int + int, because this grammar
doesn't tell us whether plus is left associative or
right associative.The solution here would be to
declare left associativity for +, and therefore
the parse tree on the right is excluded.

Example:
consider the ambiguous grammar E -> E + E | E *
E| int. The solution here would be to
declare + to be left associative , and declare * to
be left associative as follows:
%left +
%left *
and then the precedence between + and
* is given by the order, so the fact here that *
appears after plus means that * has a higher
precedence than +, and therefore
the parse tree on the left is excluded.
Note: It’s important that you check the behavior of
your grammar after you add these
declarations.

VI- Error handling


Compilers have two relatively distinct jobs. The first is to translate valid
programs, and the other one is to give good feedback for erroneous
programs.
Types of errors:
i- lexical errors: using a symbol that is not even on the alphabet
of the language. This error is detected by the lexer.
ii- syntax errors: when all the individual lexical units are correct
but they're assembled in some way that doesn't make sense. This
error is detected by the parser.
iii- semantic errors: for example when types mismatch. This error
is detected by the type checker.
iv- correctness: when the program you wrote is actually a valid
program but it doesn't do what you intended. This error is
detected by a tester or a user.
Note: once we get past what the compiler can do, then it's up to
testers and users to find the rest of the problems in the program.
Error handler should:
i- report errors accurately and clearly
ii- recover from an error quickly
iii- not slow down compilation of valid code.
Types of error handling
i- panic mode
-simplest and most popular mode
-the basic idea is that when an error is detected, the parser
is going to begin discarding tokens until one that has a clear
role (a synchronizing token) in the language is found and
then it's going to try to restart itself and continue from
that point on.
Note: synchronizing tokens are tokens that have a
well-known role in the language and so that we can reliably
identify where we are.
Example: (1 ++ 2) + 3
the parser is going to be proceeding from left to
right, until it reaches the second plus sign, and then it
won’t know what to do, because there is no
expression in the language with two plus signs in a
row, so it'll just throw away the plus in this case and
then it would restart at the two.

-a special terminal symbol called error, which is used in


Bison -widely used parser generator- to describe how much
input to skip, for example:
E -> int| E + E| (E)| error int| (error)
If none of the productions work, the parser can apply
one of the last two productions. For example, if the
error int production is applied, the parser considers
all tokens until the first integer (including that first
integer) as just an E, which means that ‘error’ will
match all the input up to the next integer.
ii- error productions
Used to specify known common mistakes that
programmers make, such as writing 5 x instead of 5 * x, for
example, we can have a production that looks like this:
E -> EE
-A disadvantage to this strategy is that it complicates
grammar.
iii- automatic local or global correction (error correction)
here we want to find programs that are nearby, programs
that aren't too different from the program that the
programmer supplied but we couldn't compile correctly.
How do we decide if a program is nearby to the
program at hand?
One way is to try token insertions and deletions,
that is to minimize the edit distance, which would be
the metric used to determine whether a program was
close to the original program or not. Another way is
to do exhaustive search within some bounce to try all
possible programs that are close to the supplied
program
Disadvantages to this strategy are the following:
-Hard implementation
–Slows down parsing of correct programs
–“Nearby” is not necessarily “the intended” program

Note: the first two modes are used in current compilers.

VII- Abstract syntax tree


a parser traces the derivation of a sequence of tokens (a string), but the
rest of the compiler needs some representation of the program. It
needs an actual data structure that tells it what the operations are in
the program and how they're put together, and that data structure is
known as the abstract syntax tree not the parse tree.
Notes:
i-an abstract syntax tree or an AST is just like a parse tree but with
fewer details, since it suppresses details of the concrete syntax.
ii-the parse tree is actually perfectly adequate for compilation,
since it traces the operation of the parser and captures nesting
structure, but the problem is that it would be inconvenient to use
parse trees for compilation because they provide too much
information like parentheses, and single-successor nodes.
Parse trees vs ASTs
Consider the string 5 + (2 + 3), the parse tree shown below
provides too much information as shown:

While the corresponding AST


abstracts from the concrete
syntax
-which makes it easier to use-
while also capturing the nesting
structure, as shown below:
VII- recursive descent parsing
It’s a top-down parsing algorithm, in which the parse tree is
constructed from the top and from left to right.
When parsing a string, we're gonna work from left to right, we try the
productions in order and when the productions fail, we may have to do
some backtracking in order to try alternative productions.
Note: we have to start with the root node
Example:
Consider the token stream (int5) and the following grammar:
E -> T| T + E
T -> int| int * T| (E)

VIII- recursive descent algorithm


We're going to define a number of boolean functions, as follows:
1- a function that matches the given token in the input. It takes an
argument of type token, then it just checks whether that matches
what's currently pointed to in the input, and if so it returns true,
otherwise it returns false.
bool term(TOKEN tok) { return *next++ == tok; }
Note: the pointer is incremented regardless of whether the match
succeeded or failed.
2-a function that only checks for the success of one production of s
bool 𝑆𝑛 () { … }
3- a function that tries all the productions of s
bool S() { … }
Note: this function will return true if any production of s can match the
input.
Example:
i- functions for the nonterminal E
production: E -> T
function: bool E1() = {return T();}
Where E1 is the function that checks for the success of the first
production, i.e. it returns true only if the first production
succeeds in matching some input.
How would the first production match any input? Well it can only
match some input if some production of T matches the input and
that is checked through the function T().

production: E -> T + E
function: bool E2() = {T() && TERM(PLUS) && E();}
Where E2 is the function that checks for the success of E2
-some production of T has to match a portion of the input and
after that, we have to find a + in the input following whatever t
matched and if + matches, then some production for E has to
match some portion of the input

Combined production: E -> T| T + E


-We need to write the function that will match any alternative for
E and since it's only these two productions, it just has to match
one of them.
-Now the only bit of state that we have to worry about in the
backtracking Is the next pointer which needs to be restored in
case we ever have to undo our decisions.
How to restore the value of the next pointer?
A local variable called save is declared inside this function, and it
records the position of the next pointer before we try to match
any input.
function: bool E(){ TOKEN *save = next
return(next = save, E1()) || (next = save, E2());
}
Note: since this is the very first production, we actually don't
need this assignment statement: next = save.

How is the alternative matching done?


We first try E1(), and if it doesn’t succeed, we try E2(). If the E
function fails, which means we were out of alternatives for E
(because E1() and E2 both failed), then the failure is gonna be
returned to the next higher level production, in our derivation
and it will have to backtrack and try another alternative.

ii- functions for the nonterminal T


production: T -> int
function: bool T1() = {return TERM(int);}
production: T -> int * T
function: bool T2() = {return term(int) && term(TIMES) && T();}

production: T -> (E)


function: bool T3() = {return term(OPEN) && E() && term(CLOSE);}

Combined production: T -> int | int * T | (E)


bool T(){ TOKEN *save = next
return(next = save, T1()) || (next = save, T2())
|| (next = save, T3());
}
How to start the parser?
1- the next pointer is initialized to the first token
2- E() is invoked (the function that matches any alternative of the
start symbol)
In conclusion:
The implementation of the grammar is a set of mutually recursive
functions that together, implement this simple recursive descent
strategy.

IX- recursive descent limitations


Consider the input int * int, first E will be called, and then it’s gonna
call E1, then E1 will call T, and T will call T1 which will return true
because the first portion of the input is ‘int’, but then the remaining
portion won’t be evaluated and therefore the input will be rejected,
while in fact it should be accepted.

What happened?
While backtracking is available within the nonterminal, there is no
way to backtrack in order to find a different production for a
nonterminal X in case one production of X already succeeded.
So we were not able to try the second production rule of T
instead of the first.

-that means this particular recursive descent algorithm is not


general, but there are other recursive descent algorithms that
support full backtracking, making them suitable to implement any
grammar.

-despite this limitation, the recursive descent algorithm is easy to


implement by hand, and it is sufficient for grammars where for
any nonterminal, at most one production rule can succeed.

-the given grammar could be rewritten to work with this


particular algorithm, and one way to do that is by left factoring
the grammar.

X- left recursion
Consider the grammar S -> Sa
bool S1 () { return S() && term(a); }
bool S() { return S1 (); }
-since there's only one production for s we don't need to worry about
backtracking
When an input string is getting parsed, the first step is to call S() which
is going to call S1() which will then call S() again, and as a result S() will
go into an infinite loop, and no input will be parsed successfully.

-the problem with this grammar is that it is left recursive


Left recursive grammar:
A grammar that has a nonterminal, where if you start with that
non-terminal and you do some non-empty sequence of
replacements you get back to a situation where you have the
same symbol still in the left-most (first) position.
Notes:
1- we can never manage any input because the only way we
manage input is if the first thing we generate is a terminal
symbol, but if the first thing is always a non-terminal, we
will never make any progress.
2- The recursive descent algorithm doesn’t work with
recursive grammars.
Example:
Consider the grammar: S -> Sa| b
This will produce all strings that start with a ‘b’ followed by any
number of as.
Note: it produces the right end of the string first, and the very
last thing it produces is the first thing that appears in the input
(left end of the string), and that’s why recursive grammar doesn’t
work with the recursive descent algorithm, because it starts with
the left end of the string, and works its way to the right end.

Solution of left recursive grammar (so that it works with the recursive
descent algorithm)
It can be rewritten using right-recursion, as follows:
S -> bS’
S’ -> aS’ | ε
Notes:
-left recursive grammar is not always that obvious.
Consider the following grammar:
S -> Aa| c
A -> Sb
S-> Aa-> Sba
we have, in two steps, produce another string with S at the
left end.
-Recursive descent algorithm is used in production
compilers such as GCC.

You might also like