Chapter 2 - Lexical Analysis
Chapter 2 - Lexical Analysis
Describe the specification of Tokens: Strings and Languages, Operations on Languages, Regular Expressions,
Regular Definitions and Extensions of Regular Expressions
Describe the generation of Tokens: Transition Diagrams, Recognition of Reserved Words and
Identifiers, Completion of the Running Example.
.
Describe the basics of Automata: Nondeterministic Finite Automata (NFA) Vs Deterministic Finite
Automata
(DFA), and conversion of an NFA to a DFA.
1
6/14/2024
Compiler Design
Introduction
The lexical analyzer is the first phase of compilation.
It takes text file (input program), which is a stream of characters, and converts it into a stream of
tokens, which are logical units, each representing one or more characters that belong together.
The main role of the lexical analyzer is to read a sequence of characters from the source program and produce tokens
to be used by the parser.
2
6/14/2024
Compiler Design
If we are designing a new language, separating lexical and syntactic concerns can lead to a
cleaner overall language design.
2. Compiler efficiency is improved: - A separate lexical analyzer allows us to apply
specialized
techniques that serve only the lexical task, not the job of parsing.
In addition, specialized buffering techniques for reading input characters can speed up the
compiler significantly.
3. Compiler portability is enhanced: - Input-device-specific peculiarities can be restricted to
the
lexical analyzer.
4
6/14/2024
Compiler Design
Token could be represented in the form of <token- name, attribute-value>, that it passes on to the
subsequent phase, syntax analysis. Where, token- name is an abstract symbol that is used during
syntax analysis, and attribute-value (which is optional for some tokens) points to an entry in the
symbol table for this token.
5
6/14/2024
Compiler Design
A pattern is a rule describing the set of lexemes that represent a particular token. Patterns are
usually specified using regular expressions. Ex. [a-zA-Z]* and lexemes matched could be a, ab,
count, …
In the case of a keyword as a token, the pattern is just the sequence of characters that form the 6
keyword.
6/14/2024
Compiler Design
Example 2: The following table shows some tokens and their lexemes in Pascal (a high level, case
insensitive programming language)
7
6/14/2024
Compiler Design
Attributes of tokens
When more than one pattern matches a lexeme, the scanner must provide additional information
about the particular lexeme to the subsequent phases of the compiler. For example, both 0 and 1
match the pattern for the token num. But the code generator needs to know which number is
recognized.
The lexical analyzer collects information about tokens into their associated attributes.
Tokens influence parsing decisions; the attributes influence the translation of tokens. Practically, a
token has one attribute: a pointer to the symbol table entry in which the information about the
token is kept. The symbol table entry contains various information about the token such as the
lexeme, the line number in which it was first seen, …
8
6/14/2024
Compiler Design
Example 3. x = y + 2
The tokens and their attributes are written as:
<id, pointer to symbol-table entry for x>
<assign_op, >
<id, pointer to symbol-table entry for y>
<plus_op, >
<num, integer value 2>
Errors
Very few errors are detected by the lexical analyzer. For example, if the programmer makes mistakes for a
while, the lexical analyzer cannot detect the error since it will consider while as an identifier.
Nonetheless, if a certain sequence of characters follows none of the specified patterns, the lexical analyzer
can detect the error.
9
6/14/2024
Compiler Design
i.e. if there is like a ~, , , or ? symbol in the source program and no pattern contains those symbols.
Besides, when Lexemes whose length exceed the bound specified by the language is a lexical error and
also unterminated strings or comments is also a type of error detected during lexical analysis.
10
6/14/2024
Compiler Design
Input Buffering
There are some ways that the task of reading the source program can be speeded.
This task is made difficult by the fact that we often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
In C language: we need to look after -, = or < to decide what token to return because they could be the beginning of
11
6/14/2024
Compiler Design
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes,
Using one system read command we can read N characters into a buffer, rather than using one system
call per
character.
If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of
the source file and is different from any possible character of the source program
Two pointers to the input are maintained:
Pointer lexeme Begin, marks the beginning of the current lexeme, whose extent we are attempting
to determine
Pointer forward scans ahead until a pattern match is found
Once the lexeme is determined, forward is set to the character at its right end (involves retracting)
Then, after the lexeme is recorded as an attribute value of the token returned to the parser, lexeme
Begin is set to the character immediately after the lexeme just found. 12
6/14/2024
Compiler Design
Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.
Sentinels/ eof
If we use the previous scheme, we must check each time we advance forward, that we have not moved off one of the
buffers; if we do, then we must also reload the other buffer.
Thus, for each character read, we must make two tests: one for the end of the buffer, and one to determine which character
is read.
We can combine the buffer-end test with the test for the current character if we extend each buffer to hold sentinel
character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice is the
character
eof
Note that eof retains its use as a marker for the end of the entire input
Any eof that appears other than at the end of buffer means the input is at an end
The above Figure shows the same arrangement as the previous Figure, but with the sentinels added.
Regular expressions are means for specifying regular languages (patterns of tokens)
14
6/14/2024
Compiler Design
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string S, usually written |S|, is the number of occurrences of symbols in S.
16
6/14/2024
Compiler Design
4. The proper prefixes, suffixes, and substrings of a string S are those, prefixes, suffixes, and substrings, respectively, of S that
are not Ɛ or not equal to S itself.
5. A subsequence of S is any string formed by deleting zero or more not necessarily consecutive positions of S.
🖙 e.g. baan is a subsequence of banana.
Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the set of digits {0, 1, . . . 9}. We may think of L and D in two,
essentially equivalent, ways.
17
6/14/2024
Compiler Design
One way is that L and D are, respectively, the alphabets of uppercase and lowercase letters and of
digits.
The second way is that L and D are languages, all of whose strings happen to be of length one.
Here are some other languages that can be constructed from languages L and D, using the above operators:
1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one, each of
which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including Ɛ, the empty string.
5. L (L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
7
6/14/2024
Compiler Design
Regular Expressions
Each regular expression is a pattern specifying the form of strings. The regular expressions
are built recursively out of smaller regular expressions, using the following two rules:
R1: - Ɛ is a regular expression, and L(Ɛ) = {Ɛ}, that is, the language whose sole member
language with one string, of length one, with a in its one position.
Note: - By convention, we use italics for symbols, and boldface for their corresponding
regular expression.
Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's sub expressions.
7
6/14/2024
Compiler Design
Regular Expressions
There are four parts to the induction where by larger regular expressions are built from
smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s),
respectively.
(r) |(s) is a regular expression denoting the language L(r) U L(s).
(r) (s) is a regular expression denoting the language L(r)L (s).
(r)* is a regular expression denoting (L (r))*.
(r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around expressions
without changing the language they denote.
7
6/14/2024
Compiler Design
Regular Expressions
Regular expressions often contain unnecessary pairs of parentheses.
We may drop certain pairs of parentheses if we adopt the conventions that:
7
6/14/2024
Compiler Design
There are a number of algebraic laws for regular expressions; each law asserts that expressions of two
different
forms are equivalent. LAW DESCRIPTION
Table: The algebraic laws that hold for arbitrary regular expressions r, s, and t.
23
6/14/2024
Compiler Design
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence
of the form:
of definitions
d1 r1
2
d2 r
…
dn r n
Each di is a new symbol, not in ∑ and not the same as any other of the d's,
and
Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di -1).
Example 4.
(A) Regular Definition for Java
Identifiers letter A|B|…|Z|a|b|…|
z|_
digit 0|1|2|…|9
id letter_(letter_|digit)*
6/14/2024
Compiler Design
(B) Regular Definition for unsigned numbers such as 5280, 0.01234, 6.336E4, or 1.89E-4
digit 0|1|2|…|9
digits digit digit*
Optional Fraction . digits |Ɛ
Optional Exponent (E(+|-|Ɛ) digits )|Ɛ
number digits optional Fraction optional Exponent
The regular definition is a precise specification for this set of strings.
That is, an optional Fraction is either a decimal point (dot) followed by one or more digits, or it
is missing (the empty string)
An optional Exponent, if not missing, is the letter E followed by an optional + or - sign, followed by
one or more digits. Note that at least one digit must follow the dot, so number does not match 1., but
does match 1.0.
6/14/2024
Compiler Design
Example:
[abc] is shorthand for a| b | c, and
[a- z] is shorthand for a |b |c |··· |z.
6/14/2024
Compiler Design
Using the above short hands, we can rewrite the regular expression for example 4, a and b as
follows. 5
Examples
A. Regular Definition for Java Identifiers
id letter_ (letter | digit)*
letter [A-Za-z_]
digit [0-9]
B. Regular Definition for unsigned
numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4
digit [0-9]
digits digit+
number digits (. digits)? (E [+
-]? digits)?
Recognition of Tokens
Deals with on how to take the
patterns for all the needed tokens and
build a piece of code that examines
the
input string and finds a prefix that is
a lexeme matching one of the
patterns.
6/14/2024
Compiler Design
Recognition of Tokens
Deals with on how to take the patterns for all the needed tokens and build a piece of code that examines
the input string and finds a prefix that is a lexeme matching one of the patterns.
1.Starting point is the language grammar to understand the tokens:
stmt if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr term relop term
| term
term id
|
numb
er
The terminals
of the grammar,
which are if, then,
else, relop, id, and
number, are the
names of tokens as
far as the
lexical
6/14/2024
analyzer is
Compiler Design
Transition Diagram
Transition diagrams have a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that matches
one of several patterns.
Edges are directed from one state of the transition diagram to another. Each edge is labeled by a
symbol or set of symbols. If we are in some states, s, and the next input symbol is a, we look for
an edge out of states s, labeled by a (and perhaps by other symbols, as well).
6/14/2024
Compiler Design
If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to
which that edge leads. We shall assume that all our transition diagrams are deterministic, meaning that
there is never more than one edge out of a given state with a given symbol among its labels
first input symbol, then among the lexemes that match the pattern for
relop
we can only be looking at <, <>, or <=. First go to state 1, and look at the next character. If it is =, then
we recognize lexeme <=, enter state 2, and return the token relop with attribute LE, the symbolic
constant representing this particular comparison operator.
If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return
an indication that the not-equals operator has been found.
On any other character, the lexeme is <, and we enter state 4 to return that information. Note,
however, that state 4 has a * to indicate that we must retract the input one position.
if in state 0 the first character we see is =, then this one character must be the lexeme. We
immediately return that fact from state 5.
6/14/2024
Compiler Design
Fig d. Transition diagram for white spaces where delim represents one or more whitespace
characters
6/14/2024
Compiler Design
In this section, we introduce a tool called Lex, or in a more recent implementation Flex, that allows one
to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.
The lex compiler transforms the input patterns into a transition diagram and generates code, in a file
called lex . yy . c, that simulates this transition diagram.
6/14/2024
Compiler Design
Finite Automata
The lexical analyzer tools use finite automata, at the heart of the transition, to convert the input
program into a lexical analyzer. These are essentially graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.
2. Finite automata come in two flavors:
Nondeterministic finite automata (NFA) have no restrictions on the labels of
their edges. A symbol can label several edges out of the same state, and Ɛ, the empty string, is a
possible label.
Deterministic finite automata (DFA) have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.
6/14/2024
Compiler Design
Finite Automata
Both deterministic and nondeterministic finite automata are capable of recognizing the
same languages.
In fact these languages are exactly the same languages,
called the regular languages, that regular expressions can describe.
Finite Automata State Graphs
6/14/2024
Compiler Design
1
0
Q: Check that “1110” is accepted but “110…” is not?
0
0
1
1
6/14/2024
Compiler Design
4. A state s0 from S that is distinguished as the start state (or initial state).
5. A set of states F, a subset of S, that is distinguished as the accepting states (or final states).
We can represent either an NFA or DFA by a transition graph, where the nodes are states and the labeled edges represent the
transition function.
o There is an edge labeled a from states s to state t if and only if t is one of the next states for state s and input a.
This graph is very much like a transition diagram, except:
o The same symbol can label edges from one state to several different states, and
o An edge may be labeled by Ɛ instead of, or in addition to, symbols from the input alphabet.
40
6/14/2024
Compiler Design
Input: a b a
Rule: NFA accepts if it can get in a final state
6/14/2024
Compiler Design
Transition Tables
• We can also represent an NFA by a transition table, whose rows correspond to states, and whose
columns correspond to the input symbols and Ɛ.
The entry for a given state and input is the value of the transition function applied to those arguments.
If the transition function has no information about that state-input pair, we put ɸ in the table for the pair.
Example: - The transition table for the NFA on the pervious state graph is represented as:
The transition table has the advantage that we can easily find the transitions on a given state and
input. Its disadvantage is that it takes a lot of space, when the input alphabet is large, yet most states do
not have any moves on most of the input symbols.
6/14/2024
Compiler Design
o we may therefore represent this state without the curly braces that we use to form sets.
While the NFA is an abstract representation of an algorithm to recognize the strings of a
certain
language, the DFA is a simple, concrete algorithm for recognizing strings.
6/14/2024
Compiler Design
6/14/2024
Compiler Design
A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ, q0, F), where
o Q is a finite set of states
o ∑ is a finite set called the alphabet
o δ is a total function from (Q x ∑) to Q known as transition function (a function that takes a state and a
symbol as inputs and returns a state)
o q0 an elements of Q is the start state, and
o F is subset of Q called final state
Example
A DFA that can accept the strings which begin with a or b, or begin with c and contain at most one
a.
2)
n
a .
It is possible that the number of DFA states is exponential in the number of NFA states,
which could lead to difficulties when we try to implement this DFA. However, part of the power of the
automaton-based approach to lexical analysis is that for real languages, the NFA and DFA have
approximately the same number of states, and the exponential behavior is not seen.
6/14/2024
Compiler Design
Review Exercises
Note: attempt all questions individually.
Submit your answer on [email protected]
Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding those that may only appear in character strings or comments),
B. The lexical form of numerical constants, and
C. The lexical form of identifiers, for C++ and java programming languages.
1. Describe the languages denoted by the following regular expressions:
A. a(a|b)* a
B. ((Ɛ|a)b * )*
C. (a|b) *a (a|b) (a|b)
D. a* ba*ba*ba*.
E. (aa|bb) * ((ab |ba) (aa| bb) * (ab|b a) (aa|bb )*)*
3. Write regular definitions for the following languages:
A. All strings of lowercase letters that contain the five vowels in order.
B. All strings of lowercase letters in which the letters are in ascending lexicographic order.
C. All strings of binary digits with no repeated digit s.
D. All strings of binary digits with at most one repeated digit.
E. All strings of a ' s and b’s where every a is preceded by b.
F. All strings of a 's and b's that contain substring abab.
4. Design finite automata (deterministic or nondeterministic) for each of the languages of question 3.
5. Give the transition tables for the following NFA
6/14/2024