0% found this document useful (0 votes)
12 views48 pages

Chapter 2 - Lexical Analysis

Chapter Two of Compiler Design focuses on Lexical Analysis, defining key terms such as tokens, patterns, and lexemes, and explaining the role of the lexical analyzer in converting source code into tokens. It discusses the specification and generation of tokens, the basics of automata, and the importance of regular expressions in defining token patterns. Additionally, it covers error detection, input buffering techniques, and the use of sentinels to enhance efficiency in lexical analysis.

Uploaded by

alula girma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views48 pages

Chapter 2 - Lexical Analysis

Chapter Two of Compiler Design focuses on Lexical Analysis, defining key terms such as tokens, patterns, and lexemes, and explaining the role of the lexical analyzer in converting source code into tokens. It discusses the specification and generation of tokens, the basics of automata, and the importance of regular expressions in defining token patterns. Additionally, it covers error detection, input buffering techniques, and the use of sentinels to enhance efficiency in lexical analysis.

Uploaded by

alula girma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Compiler Design

Chapter Two: Lexical Analysis


The Objective of this chapter are the following
 Define the basic terms with relate to lexical analyzer (LA): Lexical Analysis Versus Parsing, Tokens, Patterns,
and Lexemes, Attributes for Tokens and Lexical Errores.

 Describe the specification of Tokens: Strings and Languages, Operations on Languages, Regular Expressions,
Regular Definitions and Extensions of Regular Expressions

 Describe the generation of Tokens: Transition Diagrams, Recognition of Reserved Words and
Identifiers, Completion of the Running Example.
.
 Describe the basics of Automata: Nondeterministic Finite Automata (NFA) Vs Deterministic Finite
Automata
(DFA), and conversion of an NFA to a DFA.

1
6/14/2024
Compiler Design
Introduction
 The lexical analyzer is the first phase of compilation.

 It takes text file (input program), which is a stream of characters, and converts it into a stream of
tokens, which are logical units, each representing one or more characters that belong together.
 The main role of the lexical analyzer is to read a sequence of characters from the source program and produce tokens
to be used by the parser.

2
6/14/2024
Compiler Design

The scanner can also perform the following secondary


tasks:
 Lexical analyzer removes comments from source program.
 Lexical analyzer removes whitespace such as blank, newline, tab, and perhaps other characters that are used
to separate tokens in the input.
 Correlating error messages generated by the compiler with the source program.
 Keep track of line numbers (that could be used as a reference during error reporting)
 Lexical analyzer generates symbol table that store information about identifiers

Lexical Analysis Vs Parsing


There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis
and parsing (syntax analysis) phases:
1. Simplicity of Design: - is the most important consideration that often allows us to simplify
compilation tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would
be considerably more complex than one that can assume comments and whitespace have already been removed by
the lexical analy.
Compiler Design

 If we are designing a new language, separating lexical and syntactic concerns can lead to a
cleaner overall language design.
2. Compiler efficiency is improved: - A separate lexical analyzer allows us to apply
specialized
techniques that serve only the lexical task, not the job of parsing.
 In addition, specialized buffering techniques for reading input characters can speed up the
compiler significantly.
3. Compiler portability is enhanced: - Input-device-specific peculiarities can be restricted to
the
lexical analyzer.

4
6/14/2024
Compiler Design

Token, pattern, lexeme


A token is a sequence of characters from the source program having a collective meaning. In general,
a single token can be produced by different sequences of entry. These sequences follow a certain
pattern. Tokens are terminals in the grammar of the source language.

Token could be represented in the form of <token- name, attribute-value>, that it passes on to the
subsequent phase, syntax analysis. Where, token- name is an abstract symbol that is used during
syntax analysis, and attribute-value (which is optional for some tokens) points to an entry in the
symbol table for this token.

5
6/14/2024
Compiler Design

Example 1: {i, sum, a1}


 Identifiers {10, 100, -5}
 Number

{+, -}
Operators
 Keywords
{for, int}
 Separators {;, ,}
A lexeme is a sequence of characters in the source program that is matched by a pattern for
a certain token.
Example: x, distance, count ID
token: Identifier
lexemes: x, distance, count

A pattern is a rule describing the set of lexemes that represent a particular token. Patterns are
usually specified using regular expressions. Ex. [a-zA-Z]* and lexemes matched could be a, ab,
count, …
In the case of a keyword as a token, the pattern is just the sequence of characters that form the 6
keyword.
6/14/2024
Compiler Design

Example 2: The following table shows some tokens and their lexemes in Pascal (a high level, case
insensitive programming language)

7
6/14/2024
Compiler Design

Attributes of tokens
When more than one pattern matches a lexeme, the scanner must provide additional information
about the particular lexeme to the subsequent phases of the compiler. For example, both 0 and 1
match the pattern for the token num. But the code generator needs to know which number is
recognized.

The lexical analyzer collects information about tokens into their associated attributes.
Tokens influence parsing decisions; the attributes influence the translation of tokens. Practically, a
token has one attribute: a pointer to the symbol table entry in which the information about the
token is kept. The symbol table entry contains various information about the token such as the
lexeme, the line number in which it was first seen, …

8
6/14/2024
Compiler Design

Example 3. x = y + 2
The tokens and their attributes are written as:
<id, pointer to symbol-table entry for x>
<assign_op, >
<id, pointer to symbol-table entry for y>
<plus_op, >
<num, integer value 2>

Errors
Very few errors are detected by the lexical analyzer. For example, if the programmer makes mistakes for a
while, the lexical analyzer cannot detect the error since it will consider while as an identifier.
Nonetheless, if a certain sequence of characters follows none of the specified patterns, the lexical analyzer
can detect the error.

9
6/14/2024
Compiler Design

i.e. if there is like a ~, , ,  or ? symbol in the source program and no pattern contains those symbols.
Besides, when Lexemes whose length exceed the bound specified by the language is a lexical error and
also unterminated strings or comments is also a type of error detected during lexical analysis.

When an error occurs, the lexical analyzer recovers by:


🖛 skipping (deleting) successive characters from the remaining input until the lexical
analyzer can find a well-formed token (panic mode recovery)
🖛 deleting extraneous characters
🖛 inserting missing characters
🖛 replacing an incorrect character by a correct character
🖛 transposing two adjacent characters
.

10
6/14/2024
Compiler Design
Input Buffering
 There are some ways that the task of reading the source program can be speeded.
 This task is made difficult by the fact that we often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
 In C language: we need to look after -, = or < to decide what token to return because they could be the beginning of

characters such as ->, => or <=.


 Therefore, we shall introduce a two-buffer scheme that handles large lookaheads safely.
 We then consider an improvement involving sentinels that saves time checking for the ends of buffers.
 Because of the amount of time taken to process characters and the large number of characters that must be processed
during compilation of a large source program, specialized buffering techniques have been developed to reduce the
amount of overhead to process a single input character.
 An important scheme involves two buffers that are alternatively reloaded, as shown in the below Figure.

11
6/14/2024
Compiler Design

 Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes,
 Using one system read command we can read N characters into a buffer, rather than using one system
call per
character.
 If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of
the source file and is different from any possible character of the source program
 Two pointers to the input are maintained:
 Pointer lexeme Begin, marks the beginning of the current lexeme, whose extent we are attempting
to determine
 Pointer forward scans ahead until a pattern match is found
 Once the lexeme is determined, forward is set to the character at its right end (involves retracting)
 Then, after the lexeme is recorded as an attribute value of the token returned to the parser, lexeme
Begin is set to the character immediately after the lexeme just found. 12
6/14/2024
Compiler Design
 Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.

Sentinels/ eof
 If we use the previous scheme, we must check each time we advance forward, that we have not moved off one of the
buffers; if we do, then we must also reload the other buffer.
 Thus, for each character read, we must make two tests: one for the end of the buffer, and one to determine which character
is read.
 We can combine the buffer-end test with the test for the current character if we extend each buffer to hold sentinel
character at the end.

Fig. Sentinels at the end of each


buffer 13
6/14/2024
Compiler Design
Sentinels

 The sentinel is a special character that cannot be part of the source program, and a natural choice is the
character
eof
 Note that eof retains its use as a marker for the end of the entire input
 Any eof that appears other than at the end of buffer means the input is at an end
 The above Figure shows the same arrangement as the previous Figure, but with the sentinels added.

Specifying and recognizing tokens


Regular expressions are an important notation for specifying lexeme patterns. While they cannot express
all possible patterns, they are very effective in specifying those types of patterns that we actually need for tokens.
 In theory of compilation regular expressions are used to formalize the specification of tokens

 Regular expressions are means for specifying regular languages (patterns of tokens)

14
6/14/2024
Compiler Design

Strings and Languages


An alphabet is any finite set of symbols.
• Typical examples of symbols are letters, digits, and punctuation that is denoted by Ɛ.
 The set {0, 1} is the binary alphabet.
 ASCII is an important example of an alphabet; it is used in many software systems.
 Unicode, which includes approximately 100,000 characters from alphabets around the world, is another important
example of an alphabet.

A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
 In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
 The length of a string S, usually written |S|, is the number of occurrences of symbols in S.

o For example, banana is a string of length six.


 The empty string, denoted Ɛ, is the string of length zero.
15
6/14/2024
Compiler Design

A language is any countable set of strings over some fixed


alphabet.
 Abstract languages like  , the empty set, or {Ɛ}, the set containing only the empty
string, are languages under this definition.
 So too are the set of all syntactically well-formed Java programs and the set of all
grammatically correct English sentences, although the latter two languages are difficult
to specify exactly.
Note that the definition of "language" does not require that any: meaning be ascribed to
the strings in the language.

16
6/14/2024
Compiler Design

Terms for Parts of Strings


The following string-related terms are commonly used:
1. A prefix of string S is any string obtained by removing zero or more symbols from the end of S.
 e.g. ban, banana, and Ɛ are prefixes of banana.
2. A suffix of string S is any string obtained by removing zero or more symbols from the beginning of S.
 e.g. nana, banana, and Ɛ are suffixes of banana.
3. A substring of S is obtained by deleting any prefix and any suffix from S.
 e.g. banana, nan, and Ɛ are substrings of banana.

4. The proper prefixes, suffixes, and substrings of a string S are those, prefixes, suffixes, and substrings, respectively, of S that
are not Ɛ or not equal to S itself.
5. A subsequence of S is any string formed by deleting zero or more not necessarily consecutive positions of S.
🖙 e.g. baan is a subsequence of banana.

Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the set of digits {0, 1, . . . 9}. We may think of L and D in two,
essentially equivalent, ways.
17
6/14/2024
Compiler Design

 One way is that L and D are, respectively, the alphabets of uppercase and lowercase letters and of
digits.
 The second way is that L and D are languages, all of whose strings happen to be of length one.

Here are some other languages that can be constructed from languages L and D, using the above operators:
1. L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one, each of
which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including Ɛ, the empty string.
5. L (L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

7
6/14/2024
Compiler Design
Regular Expressions

Regular Expressions : are used to specify the patters of


tokens.
strings. Each pattern matches a set of
Letter € A|B|C|…|Z|a| the vertical bar means union
e.g. b|c|…|z
digit € 0|1|…|9
identifier € letter_(letter | digit)* parentheses are used to group sub
expressions
Lex provides shortcuts for describing regular expressions in a compact
manner.

e.g. [a-z] stands for a|b|c|…|z


[0-9] stands for 0|1|…|9
[abc] stands for a|b|c
7
6/14/2024
Compiler Design
Regular Expressions

Each regular expression is a pattern specifying the form of strings. The regular expressions
are built recursively out of smaller regular expressions, using the following two rules:
R1: - Ɛ is a regular expression, and L(Ɛ) = {Ɛ}, that is, the language whose sole member

is the empty string.


R2: - If a is a symbol in ∑ then a is a regular expression, L(a) = {a}, that is, the

language with one string, of length one, with a in its one position.
Note: - By convention, we use italics for symbols, and boldface for their corresponding
regular expression.
Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's sub expressions.
7
6/14/2024
Compiler Design

Regular Expressions
There are four parts to the induction where by larger regular expressions are built from
smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s),
respectively.
 (r) |(s) is a regular expression denoting the language L(r) U L(s).
 (r) (s) is a regular expression denoting the language L(r)L (s).
 (r)* is a regular expression denoting (L (r))*.
 (r) is a regular expression denoting L(r).
 This last rule says that we can add additional pairs of parentheses around expressions
without changing the language they denote.
7
6/14/2024
Compiler Design

Regular Expressions
 Regular expressions often contain unnecessary pairs of parentheses.
 We may drop certain pairs of parentheses if we adopt the conventions that:

1. The unary operator * has highest precedence and is left associative.


2. Concatenation has second highest precedence and is left associative.
3. | has lowest precedence and is left associative.
Example: - (a) |((b) *(c)) can be represented by a|b*c. Both expressions denote the set of strings that
are either a single a or are zero or more b' s followed by one c.
A language that can be defined by a regular expression is called a regular set.
If two regular expressions R and S denote the same regular set, we say they are equivalent and write r =
s. For instance, (a|b) = (b|a).

7
6/14/2024
Compiler Design

There are a number of algebraic laws for regular expressions; each law asserts that expressions of two
different
forms are equivalent. LAW DESCRIPTION

r|s = s|r | is commutative


r|(s|t) = (r|s)|t |is associative
r(st) = (rs)t Concatenation is associative
r(s|t) = rs|rt; Concatenation distributes over |
(s|t)r= sr|tr
Ɛr=rƐ=r Ɛ is the identity for concatenation
r* = (r|Ɛ)* Ɛ is guaranteed in a closure
r** = r* * is idempotent

Table: The algebraic laws that hold for arbitrary regular expressions r, s, and t.

23
6/14/2024
Compiler Design
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence
of the form:
of definitions
d1  r1

2
d2  r

dn  r n
Each di is a new symbol, not in ∑ and not the same as any other of the d's,
and
Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di -1).
Example 4.
(A) Regular Definition for Java
Identifiers letter A|B|…|Z|a|b|…|
z|_
digit 0|1|2|…|9
id  letter_(letter_|digit)*
6/14/2024
Compiler Design

(B) Regular Definition for unsigned numbers such as 5280, 0.01234, 6.336E4, or 1.89E-4
digit  0|1|2|…|9
digits  digit digit*
Optional Fraction  . digits |Ɛ
Optional Exponent (E(+|-|Ɛ) digits )|Ɛ
number  digits optional Fraction optional Exponent
The regular definition is a precise specification for this set of strings.
That is, an optional Fraction is either a decimal point (dot) followed by one or more digits, or it
is missing (the empty string)
An optional Exponent, if not missing, is the letter E followed by an optional + or - sign, followed by
one or more digits. Note that at least one digit must follow the dot, so number does not match 1., but
does match 1.0.

6/14/2024
Compiler Design

Extensions of Regular Expressions


Many extensions have been added to regular expressions to enhance their ability to specify string
patterns. Here are few notational extensions:
 One or more instances (+): - is a unary postfix operator that represents the positive closure of a
regular expression and its language. That is, if r is a regular expression, then (r)+ denotes the language
(L (r)) +.
o The operator + has the same precedence and associativity as the operator *.
o Two useful algebraic laws, r*=r+|Ɛ and r+=rr*=r*r.
 Zero or one instance(?): - r? is equivalent to r|Ɛ, or put another way, L(r?) = L(r) U {Ɛ}.
o The? operator has the same precedence and associativity as * and +.
 1 2
Character classes. A regular expression a |a |·· ·| a n, where the ai's are each symbols of the alphabet, can be replaced by the shorthand [a la2 n
··· a ].

Example:
[abc] is shorthand for a| b | c, and
[a- z] is shorthand for a |b |c |··· |z.
6/14/2024
Compiler Design
Using the above short hands, we can rewrite the regular expression for example 4, a and b as
follows. 5
Examples
A. Regular Definition for Java Identifiers
id  letter_ (letter | digit)*
letter  [A-Za-z_]
digit  [0-9]
B. Regular Definition for unsigned
numbers such as 5280 , 0.0 1 234,
6. 336E4, or 1. 89E-4
digit  [0-9]
digits  digit+
number  digits (. digits)? (E [+
-]? digits)?
Recognition of Tokens
 Deals with on how to take the
patterns for all the needed tokens and
build a piece of code that examines
the
input string and finds a prefix that is
a lexeme matching one of the
patterns.
6/14/2024
Compiler Design

Recognition of Tokens
 Deals with on how to take the patterns for all the needed tokens and build a piece of code that examines
the input string and finds a prefix that is a lexeme matching one of the patterns.
1.Starting point is the language grammar to understand the tokens:
stmt  if expr then stmt
| if expr then stmt else stmt

expr  term relop term
| term
term  id
|
numb
er

The terminals
of the grammar,
which are if, then,
else, relop, id, and
number, are the
names of tokens as
far as the
lexical
6/14/2024
analyzer is
Compiler Design

2. The next step is to formalize the patterns:


digit  [0-9]
digits  digit+
number  digit(.digits)? (E[+-]? Digit)?
letter  [A-Za z_]
id  letter (letter|digit)*
If  if
Then  then
Else  else
Relop  < | >
| <= | >= | = |
<>
For this language, the
lexical analyzer will
recognize the
keywords i f , then,
6/14/2024
Compiler Design

3. We also need to handle whitespaces:


ws  (blank | tab | newline)+

Transition Diagram
 Transition diagrams have a collection of nodes or circles, called states. Each state represents a

condition that could occur during the process of scanning the input looking for a lexeme that matches
one of several patterns.
 Edges are directed from one state of the transition diagram to another. Each edge is labeled by a
symbol or set of symbols. If we are in some states, s, and the next input symbol is a, we look for
an edge out of states s, labeled by a (and perhaps by other symbols, as well).

6/14/2024
Compiler Design

If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to
which that edge leads. We shall assume that all our transition diagrams are deterministic, meaning that
there is never more than one edge out of a given state with a given symbol among its labels

Lexemes Token Names Attribute Value


Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE

Table: Tokens, their patterns, and attribute values


6/14/2024
Compiler Design
Some important conventions about transition diagrams are:
 Certain states are said to be accepting, or final. These states indicate that a lexeme has been
found,
although the actual lexeme may not consist of all positions between the lexeme Begin and
Forward pointers.
 if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the
symbol
that got us to the accepting state), then we shall additionally place a * near that accepting state.
 One state is designated the start state, or initial state; it is indicated by an edge, labeled "start," entering
from nowhere. The transition diagram always begins in the start state before any input symbols have
been read.
Example: a transition diagram that recognizes the lexemes matching the token relop.

Table: Tokens, their patterns, and attribute values


6/14/2024
Compiler Design

Fig a. Transition diagram for


relop
6/14/2024
Compiler Design

first input symbol, then among the lexemes that match the pattern for
relop
 we can only be looking at <, <>, or <=. First go to state 1, and look at the next character. If it is =, then
we recognize lexeme <=, enter state 2, and return the token relop with attribute LE, the symbolic
constant representing this particular comparison operator.
 If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return
an indication that the not-equals operator has been found.
 On any other character, the lexeme is <, and we enter state 4 to return that information. Note,
however, that state 4 has a * to indicate that we must retract the input one position.
 if in state 0 the first character we see is =, then this one character must be the lexeme. We
immediately return that fact from state 5.

6/14/2024
Compiler Design

Fig b. Transition diagram for reserved words and


identifiers

Fig c. Transition diagram for unsigned


numbers

Fig d. Transition diagram for white spaces where delim represents one or more whitespace
characters

6/14/2024
Compiler Design

A typical Lexical Analyzer Generator

In this section, we introduce a tool called Lex, or in a more recent implementation Flex, that allows one
to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler.
The lex compiler transforms the input patterns into a transition diagram and generates code, in a file
called lex . yy . c, that simulates this transition diagram.

6/14/2024
Compiler Design

Finite Automata
 The lexical analyzer tools use finite automata, at the heart of the transition, to convert the input
program into a lexical analyzer. These are essentially graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.
2. Finite automata come in two flavors:
 Nondeterministic finite automata (NFA) have no restrictions on the labels of
their edges. A symbol can label several edges out of the same state, and Ɛ, the empty string, is a
possible label.
 Deterministic finite automata (DFA) have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.

6/14/2024
Compiler Design

Finite Automata
 Both deterministic and nondeterministic finite automata are capable of recognizing the
same languages.
In fact these languages are exactly the same languages,
called the regular languages, that regular expressions can describe.
Finite Automata State Graphs

6/14/2024
Compiler Design

Example 6: A finite automaton that accepts only “1”

Example 7: A finite automaton accepting any number of 1’s followed by a single


0

1
0
Q: Check that “1110” is accepted but “110…” is not?

Question: What language does this recognize?


1

0
0

1
1

6/14/2024
Compiler Design

Nondeterministic Finite Automata (NFA)


 A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols ∑, the input alphabet. We assume that Ɛ, which stands for the empty string, is never a member of ∑.
3. A transition function that gives, for each state, and for each symbol in ∑ U {Ɛ} a set of next states.

4. A state s0 from S that is distinguished as the start state (or initial state).

5. A set of states F, a subset of S, that is distinguished as the accepting states (or final states).
 We can represent either an NFA or DFA by a transition graph, where the nodes are states and the labeled edges represent the
transition function.
o There is an edge labeled a from states s to state t if and only if t is one of the next states for state s and input a.
 This graph is very much like a transition diagram, except:
o The same symbol can label edges from one state to several different states, and
o An edge may be labeled by Ɛ instead of, or in addition to, symbols from the input alphabet.

40
6/14/2024
Compiler Design

An NFA can get into multiple states

Fig. A Nondeterministic Finite


Automata

Input: a b a
Rule: NFA accepts if it can get in a final state

6/14/2024
Compiler Design

Transition Tables
• We can also represent an NFA by a transition table, whose rows correspond to states, and whose
columns correspond to the input symbols and Ɛ.
 The entry for a given state and input is the value of the transition function applied to those arguments.
 If the transition function has no information about that state-input pair, we put ɸ in the table for the pair.
Example: - The transition table for the NFA on the pervious state graph is represented as:

The transition table has the advantage that we can easily find the transitions on a given state and
input. Its disadvantage is that it takes a lot of space, when the input alphabet is large, yet most states do
not have any moves on most of the input symbols.
6/14/2024
Compiler Design

Acceptance of Input Strings by


Automata
 An NFA accepts input string x if and only if there is some path in the transition graph from the
start state to one of the accepting (Final) states, such that the symbols along the path spell out x.
o Note that Ɛ labels along the path are effectively ignored, since the empty string does not contribute
to the string constructed along the path.
Example:- Strings aaba and bbbba are accepted by the above NFA in page 41.
 The language defined (or accepted) by an NFA is the set of strings labeling some path from the start
to
an accepting state.
o As was mentioned above, the NFA of page 41 defines the same language as does the regular
expression (a|b)* ba, that is, all strings from the alphabet {a, b} that end in ba.
o We may use L(A) to stand for the language accepted by automaton A.
6/14/2024
Compiler Design

Deterministic Finite Automata (DFA)


 A deterministic finite automaton (DFA) is a special case of an NFA where:
o There are no moves on input Ɛ, and
o For each state S and input symbol a, there is exactly one edge out of s labeled a.
 If we are using a transition table to represent a DFA, then each entry is a single state.

o we may therefore represent this state without the curly braces that we use to form sets.
 While the NFA is an abstract representation of an algorithm to recognize the strings of a
certain
language, the DFA is a simple, concrete algorithm for recognizing strings.

6/14/2024
Compiler Design

Deterministic Finite Automata (DFA)


 It is fortunate indeed that every regular expression and every NFA can be converted to a DFA accepting the same
language, because it is the DFA that we really implement or simulate when building lexical analyzers.
 DFA recognize strings: - When a string is fed into a DFA, if the DFA recognizes the string, it accepts the string
otherwise it rejects the string.
 A DFA is a collection of states and transitions: - Given the input string, the transitions tell us how to move among
the states.
o One of the states is denoted as the initial state, and a subset of the states are final states.
o We start from the initial state, move from state to state via the transitions, and check to see if we are in the
final state when we have checked each character in the string.
o If we are, then the string is accepted, otherwise, the string is rejected

6/14/2024
Compiler Design

 A DFA is a quintuple, a machine with five parameters, M = (Q, ∑, δ, q0, F), where
o Q is a finite set of states
o ∑ is a finite set called the alphabet
o δ is a total function from (Q x ∑) to Q known as transition function (a function that takes a state and a
symbol as inputs and returns a state)
o q0 an elements of Q is the start state, and
o F is subset of Q called final state

Example

A DFA that can accept the strings which begin with a or b, or begin with c and contain at most one
a.

2)

A DFA accepting (a|b) * abb


6/14/2024
Compiler Design

Conversion of an NFA to a DFA


 The general idea behind the subset construction is that each state of the constructed DFA
corresponds to a
set of NFA states.
 l 2 n
After reading input a , a , . . ., a , the DFA is in that state which corresponds to the set of states that the NFA can reach, from its start state, following paths labeled a l a2 ...

n
a .

 It is possible that the number of DFA states is exponential in the number of NFA states,

which could lead to difficulties when we try to implement this DFA. However, part of the power of the
automaton-based approach to lexical analysis is that for real languages, the NFA and DFA have
approximately the same number of states, and the exponential behavior is not seen.

6/14/2024
Compiler Design
Review Exercises
Note: attempt all questions individually.
Submit your answer on [email protected]
Consult the language reference manuals to determine
A. The sets of characters that form the input alphabet (excluding those that may only appear in character strings or comments),
B. The lexical form of numerical constants, and
C. The lexical form of identifiers, for C++ and java programming languages.
1. Describe the languages denoted by the following regular expressions:
A. a(a|b)* a
B. ((Ɛ|a)b * )*
C. (a|b) *a (a|b) (a|b)
D. a* ba*ba*ba*.
E. (aa|bb) * ((ab |ba) (aa| bb) * (ab|b a) (aa|bb )*)*
3. Write regular definitions for the following languages:
A. All strings of lowercase letters that contain the five vowels in order.
B. All strings of lowercase letters in which the letters are in ascending lexicographic order.
C. All strings of binary digits with no repeated digit s.
D. All strings of binary digits with at most one repeated digit.
E. All strings of a ' s and b’s where every a is preceded by b.
F. All strings of a 's and b's that contain substring abab.

4. Design finite automata (deterministic or nondeterministic) for each of the languages of question 3.
5. Give the transition tables for the following NFA

6/14/2024

You might also like