Compiler 3
Compiler 3
Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
It reads the input stream and produces the source code as output through implementing
the lexical analyzer in the C program.
Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler
runs the lex.1 program and produces a C program lex.yy.c.
Finally C compiler runs the lex.yy.c program and produces an object program a.out.
a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as
follows:
{ definitions }
1. %%
2. { rules }
3. %%
4. { user subroutines }
Where pi describes the regular expression and action1 describes the actions what action the
lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded
with the lexical analyzer and compiled separately.
YACC
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the
basic concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the
limitations of the regular expressions. Regular expressions cannot check balancing tokens, such
as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is recognized by
push-down automata.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of
strings. The non-terminals define sets of strings that help define the language generated
by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from
which strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production
begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the
start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by
means of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,
etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams.
The parser analyzes the source code (token stream) against the production rules to detect any
errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers
use error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential
form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-
most derivation. The sentential form derived from the right-most derivation is called the right-
sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the parse
tree. Let us see this by an example from the last topic.
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed
first, therefore the operator in that sub-tree gets precedence over the operator which is in the
parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation)
for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the
operand will be taken by the left operator or if the operation is right-associative, the right
operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:
id op id op id
(id op id) op id
id op (id op id)
Precedence
If two different operators share a common operand, the precedence of operators decides which
will take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to
(2+3)*4 and another corresponding to 2+(3*4). By setting precedence among operators, this
problem can be easily removed. As in the previous example, mathematically * (multiplication)
has precedence over + (addition), so the expression 2+3*4 will always be interpreted as:
2 + (3 * 4)