Compiler Notes
Compiler Notes
Pattern is the set of rules that determines whether a given lexeme is a valid token or not.
Token is a sequence of characters that can be treated as a single logical entity.
Lexeme is a sequence of characters in the source program that is matched for a token.
Parse tree: it is a graphical representation of how the start symbol of grammar generates
the string.
Topdown parser cant accept left recursive grammar bcz it will fall in infinite loop, so we
have to remove left recursion. Also it can’t take ambiguous grammar and non deterministic
grammar. It uses leftmost derivation.
Bottom up parser will work on left recursive and non deterministic but not on ambiguous
grammar except operator precedence which will work on any grammar. It uses rightmost
derivation.
To convert ambiguous grammar to unambiguous we have to:
1.ensure that higher precedence operator remains at lower lever.
2.if operator is left associative make grammar left recursive, otherwise right recursive.
If RHS of more than one production starts with the same symbol, then such a grammar is
called as Grammar With Common Prefixes or non deterministic grammar.
This kind of grammar creates a problematic situation for Top down parsers.
Top down parsers can’t decide which production must be chosen to parse the string in
hand.
To remove this confusion, we use left factoring
Left factoring is the process to remove common prefix or converting non det. Grammar
to det. Grammar.
Predictive parser is a recursive descent parser, which has the capability to predict
which production is to be used to replace the input string. The predictive parser
does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which
points to the next input symbols. To make the parser back-tracking free, the
predictive parser puts some constraints on the grammar and accepts only a class of
grammar known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate
a parse tree. Both the stack and the input contains an end symbol $ to denote that
the stack is empty and the input is consumed. The parser refers to the parsing table
to take any decision on the input and stack element combination.
LR PARSER
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide
class of context-free grammar which makes it the most efficient syntax analysis
technique. LR parsers are also known as LR(k) parsers, where L stands for left-to-
right scanning of the input stream; R stands for the construction of right-most
derivation in reverse, and k denotes the number of lookahead symbols to make
decisions.
There are three widely used algorithms available for constructing an LR parser:
Quadruples
Each instruction in quadruples presentation is divided into four fields: operator,
arg1, arg2, and result. The above example is represented below in quadruples
format:
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
Triples
Each instruction in triples presentation has three fields : op, arg1, and arg2.The
results of respective sub-expressions are denoted by the position of expression.
Triples represent similarity with DAG and syntax tree. They are equivalent to DAG
while representing expressions.
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
Triples face the problem of code immovability while optimization, as the results are
positional and changing the order or position of an expression may cause problems.
Indirect Triples
This representation is an enhancement over triples representation. It uses pointers
instead of position to store results. This enables the optimizers to freely re-position
the sub-expression to produce an optimized code.
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be
implemented as an unordered list, which is easy to code, but it is only suitable for
small tables only. A symbol table can be implemented in one of the following ways:
The character ("blank space") beyond the token ("int") have to be examined before
the token ("int") will be determined.
After processing token ("int") both pointers will set to the next token ('a'), & this
process will be repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards halfway in
First Half, the second half is filled with new characters to be read. If the look Ahead pointer
moves towards the right end of the buffer of the second half, the first half will be filled with
new characters, and it goes on.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is
converted, a check is completed to provide that one half of the buffer has not converted off.
If it is completed, then the other half should be reloaded.
Buffer Pairs − A specialized buffering technique can decrease the amount of overhead,
which is needed to process an input character in transferring characters. It includes two
buffers, each includes N-character size which is reloaded alternatively.
In static allocation, the compiler can decide the amount of storage needed by each
data object. Thus, it becomes easy for a compiler to identify the address of these
data in the activation record. It is not possible to use variables whose size has to
be determined at run time.
FORTRAN uses this
BackPatching: While generating three address codes for the given expression, it can
specify the address of the Label in goto statements. It is very difficult to assign
locations of these label statements in one pass so, two passes are used. In the first
pass, it can leave these addresses unspecified & in the next pass, and it can fill
these addresses. Therefore filling of incomplete transformation is called
Backpatching.