CDUnit 1
CDUnit 1
Language Processors
The Structure of a Compiler
Lexical Analysis: The Role of the Lexical
Analyzer
Specification of Tokens
Recognition of Tokens
The Lexical-Analyzer Generator Lex.
Compiler Design 1
Compiler :
Compiler Design 2
If the target program is an executable machine-language
program, it can then be called by the user to process inputs
and produce outputs
Interpreter :
An interpreter is another common kind of language
processor. Instead of producing a target program as a
translation, an interpreter appears to directly execute the
operations specified in the source program on inputs
supplied by the user
Compiler Design 3
For example Java language processors combine compilation and
interpretation A Java source program may first be compiled into an
intermediate form called bytecodes. The bytecodes are then
interpreted by a virtual machine.
Compiler Design 4
Language Processors:
In addition to a compiler, several other programs may be required to
create an executable target program as shown in Fig
Compiler Design 5
Preprocessor :
The preprocessor may also expand shorthands, called macros, into
source language statements. The modified source program is then fed
to a compiler.
Compiler :
The compiler may produce an assembly-language program as its
output, because assembly language is easier to produce as output and
is easier to debug.
Assembler :
The assembly language is then processed by a program called
an assembler that
produces relocatable machine code as its output.
Compiler Design 7
Phases of a compiler:
Compiler Design 8
Compiler Design 9
Lexical Analysis :
•For each lexeme, the lexical analyzer produces a token of the form
that it passes on to the subsequent phase, syntax analysis
<token-name, attribute-value>
Compiler Design 10
Syntax analysis :
Compiler Design 11
Semantic analysis :
Compiler Design 12
Intermediate Code Generation :
•The intermediate code should be generated in such a way that you can
easily translate it into the target machine code.
Compiler Design 13
Code Optimization :
Code Generation :
Compiler Design 14
Compiler Design 15
Lexical Analysis :
•The first phase of a compiler
•The main task of the lexical analyzer is to read the input characters of
the source program, group them into lexemes , and produce as output
a sequence of tokens for each lexeme in the source program.
Compiler Design 16
The role of lexical analyzer :
token
Source Lexical Parse
Parser
program Analyzer tree
getNextToken
Symbol
table
Compiler Design 17
Lexical Analysis Versus Parsing
Simplicity of design is the most important consideration.
Compiler efficiency is improved
Compiler portability is enhanced.
Compiler Design 18
Tokens, Patterns and Lexemes
A token is a pair a token name and an optional
token value
A pattern is a description of the form that the
lexemes of a token may take
A lexeme is a sequence of characters in the
source program that matches the pattern for a
token
Compiler Design 19
Example:
In many programming languages, the following classes cover most or all of
the tokens:
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
Compiler Design 20
Attributes for tokens
E = M * C ** 2
<id, pointer to symbol table entry for E>
<assign-op>
<id, pointer to symbol table entry for M>
<mult-op>
<id, pointer to symbol table entry for C>
<exp-op>
<number, integer value 2>
Compiler Design 21
Lexical errors
Some errors are out of power of lexical analyzer
to recognize:
fi (a == b) …
However it may be able to recognize errors like:
d = 2r
Such errors are recognized when no pattern for
tokens matches a character sequence
Compiler Design 22
Error recovery
Panic mode: successive characters are ignored
until we reach to a well formed token
Delete one character from the remaining input
Insert a missing character into the remaining
input
Replace a character by another character
Transpose two adjacent characters
Compiler Design 23
Specification of tokens
In theory of compilation regular expressions are
used to formalize the specification of tokens
Regular expressions are means for specifying
regular languages
Example:
Letter(letter | digit)*
Compiler Design 24
Regular expressions
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
If a is a symbol in ∑then a is a regular expression, L(a) =
{a}
(r) | (s) is a regular expression denoting the language L(r) ∪
L(s)
(r)(s) is a regular expression denoting the language
L(r)L(s)
(r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
Compiler Design 25
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
Example:
letter -> A | B | … | Z | a | b | … | Z |
digit -> 0 | 1 | … | 9
id -> letter (letter| digit)*
Compiler Design 26
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [abc]
Example:
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter(letter|digit)*
Compiler Design 27
Recognition of tokens
Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Compiler Design 28
Recognition of tokens (cont.)
The next step is to formalize the patterns:
digit -> [0-9]
digits -> digit+
number -> digit(.digits)? (E[+-]? digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Compiler Design 29
Transition diagrams
Transition diagram for relop
Relop -> < | > | <= | >= | = | <>
Compiler Design 30
Transition diagrams (cont.)
Transition diagram for reserved words and
identifiers
id -> letter (letter|digit)*
Compiler Design 31
Transition diagrams (cont.)
Transition diagram for unsigned numbers
number -> digit(.digits)? (E[+-]? digit)?
Compiler Design 32
Transition diagrams (cont.)
Transition diagram for whitespace
ws -> (blank | tab | newline)+
Compiler Design 33
Lexical Analyzer Generator - Lex
Compiler Design 34
Lexical analyzer with LEX
lex.yy.c C a.out
compiler
Compiler Design 35
Structure of Lex programs :
Compiler Design 36
• The declarations section includes declarations of variables, manifest
constants (identifiers declared to stand for a constant, e.g., the name of a
token), and regular definitions.
• The translation rules each have the form
• Pattern { Action }
• pattern is a regular expression
• Action-Fragment of code written in C.
• Third Section-holds whatever additional functions are used in the actions.
• Alternatively, these functions can be compiled separately and loaded with
the lexical analyser
Compiler Design 37
Conflict Resolution in Lex
There are two rules that Lex uses to decide on the proper lexeme to
select, when several prefixes of the input match one or more
patterns:
Compiler Design 38
The Lookahead Operator
Lex automatically reads one character ahead of the last character that forms
the selected lexeme, and then retracts the input so only the lexeme itself is
consumed from the input.
What follows / is additional pattern that must be matched before we can decide
that the token in question was seen, but what matches this second pattern is not
part of the lexeme.
Compiler Design 39