Module 3
Module 3
Compiler Design
Module 3
The objective of this module is to get a clear understanding of the lexical phase and to learn
about defining patterns using regular expression.
This module discusses the functionalities of the lexical phase of the compiler. This will brief why
we need lexical phase, what it does and how to do it efficiently.
• Removes whitespaces
Lexical analyzer does not have to be an individual phase. But having a separate phase simplifies
the design and improves efficiency and portability. With these things in mind and in order to
efficiently do these functions, the lexical analyser is not assigned a single pass of its own, rather
the lexical and the syntax analyser together functions as one pass and is specified in Figure 3.1.
From figure 3.1, it could be understood that, the lexer and the parser are combined together into
one pass, where the parser issues a “Nexttoken()” command and the lexer issues or gives a token
to the parser. The grouping of the phases helps in achieving the following:
• Enhances modularity and supports portability. It helps in using the same analyser-parser
combination for various high-level languages across platforms as the analysis phase is
target independent.
• Reusability
• Efficiency
3.2 Definitions
Let us look at some basic definitions pertaining to the lexical phase of the compiler. Token is a
group of characters having a collective meaning. Lexeme is a particular instant of a token. For
example, consider the variable “pi”. So, token could be thought of as a name given to a lexeme.
For this variable the token is “identifier” and the corresponding lexeme is “pi”. Similarly for the
string “if”, the lexeme is “if” and the token is “keyword”. The lexemes are typically described by
defining a pattern for it. The set of rules describing how a token can be formed is defined as its
pattern. So, defining patterns for every possible lexeme is one of the main jobs of this phase of
the compiler. For example, for the token identifier the pattern is [a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*.
Here, this corresponds to a regular expression. The operator „[„ and „]‟ indicates that any one
character in this range could be used. The operator „|‟ is the union operator. It indicates either
this or that. The * is a Kleene closure operator which indicates zero or more combination of the
symbols that has this operator. The implicit operator here is the concatenation operator “.” which
indicates the sequence of symbols that occur. We will discuss more on defining regular
expressions in the subsequent sections.
3.2.1 Issues
Tokenizing is the primary job of this phase of the compiler. Hence, the primary issue is to find
out how to identify tokens? Typically tokens are identified by defining patterns either as a
regular expression or as an automata. The next issue is how the lexical phase will recognize the
tokens using a token specification. This indirectly poses ways of implementing the nexttoken()
routine. In order to solve these issues, the first two phases of the compiler are integrated.
Formally stating, given a set of token descriptions in terms of token name and regular expression
defining the pattern for a lexeme, the lexical phase accepts an input string and partitions the
strings into tokens (class, value). In choosing the lexeme, there could be two matching prefix that
would be a part of two different tokens. For example, consider the operator “**” for
exponentiation. Will the compiler, consider this is as a multiplication operator on seeing the first
“*” and will it log the second “*” as an error? It considers this as an exponentiation operator
because it tries to match the longest matching prefix. Thus ambiguity encountered in this phase
is typically resolved by choosing the longest matching token and between two equal length
tokens, the first one is selected. Table 3.1 lists some examples of tokens and their corresponding
example lexemes.
if Character i, f If
E=M*C**2 (3.1)
For the statement in (3.1), the lexemes are “E”, “=”, “M”, “*”, “C”, “**”, “2”.
3.2.4 Patterns: The lexical phase, decides the token based on defining patterns. Let us look at
how to define patterns using regular expression. Table 3.2 gives the basic idea about the basic
patterns of a regular expression.
R+ Positive closure operator. One This is also like Kleene Closure except
or more occurrences of R that it doesn‟t have the string ε
Regular expressions are used to define the patterns using various regular expression operators.
As far as the operators are considered, * has the highest precedence, followed by “.” and “|” has
the least precedence. Consider the example of defining Pascal Language identifiers. The
identifier can start with alphabets followed by 0 or more combinations of alphabets or numbers.
In statement (3.2), letter indicates any alphabet and digit indicates any integer between 0 and 9.
The expression (letter | digit)* can be extended as { ε, letter, digit, letter.digit, digit.letter,
letter.letter, digit.digit, ….}
If L(r) and L(s) are regular expressions then L(r) U L (s) is also a RE and so is L(r) . L(s). L(r*)
is also a RE if L(r) corresponds to a regular expression. For example, if the input alphabet, ∑ =
{a, b}, then
Regular definitions are the names given to certain regular expressions and these names are later
used for constructing the patterns. Regular definition is a sequence of the form
where each “di” is a symbol that is not in the input alphabet and each “ri” corresponds to a
regular expression. To generate a regular expression, the left hand side “di” is replaced by the
corresponding right hand side “ri”. Consider the example of defining identifiers in pascal.
letter A | B | …| z
digit 0 | 1 | 2 | 3 … | 9
From statement (3.3), the identifier “id” is typically replaced with its RHS definition, where the
constituent of the RHS is again stated as regular definition. Another example for defining
number constant is given in statement (3.4) while the definition of statements is given in (3.5).
digit 0 | 1 | … | 9
optionalFraction .digits | ε
term id | number
id letter . (letter|digits)*
3.2.7 Automata
Automata typically refer to a Deterministic one. It is a five tuple representation (Q, ∑, δ, q0,
F), where q0 belongs to Q, Q is a finite set of states, F is a subset of Q indicating final states,
δ is a mapping from Q x ∑ to Q. The transition function states, for every state on every input
symbol there is exactly one transition. We define the language of the automata as
The interpretation of this definition is that, the string “w” should take you from the start state q0
to any one of the final state. Hence, every string has exactly one path and so a DFA is faster for
string matching. On the other hand, a NFA is similar to a DFA, but it gives some flexibility. It is
a five tuple representation (Q, ∑ , δ, q0, F), q0 belongs to Q, F is a subset of Q and δ is a
mapping from Q x ∑ to 2Q . The transition function here indicates that for every state on every
input symbol there could be 0 or more transitions.
Since, multiple paths exists because of the existence of multiple transitions, a NFA takes more
time for string matching. ε -NFA is same as NFA but with more flexibility in allowing to change
state without consuming any input symbol.
The definition of ε -NFA is different from NFA only in the transition function which is defined
as δ a mapping from Q x ∑ U { ε } to 2Q. Hence, it is slower than NFA for string matching.
Summary: This module focused on constructing regular expressions and regular definitions as a
way of defining pattern for regular expressions which will be used by the lexical phase of the
compiler.