Lecture02 Scanning 1
Lecture02 Scanning 1
Kenneth C. Louden
2. Scanning (Lexical Analysis)
PART ONE
Contents
PART ONE
2.1 The Scanning Process [Open]
2.2 Regular Expression [Open]
2.3 Finite Automata [Open]
PART TWO
2.4 From Regular Expressions to DFAs
2.5 Implementation of a TINY Scanner
2.6 Use of Lex to Generate a Scanner Automatically
2.1 The Scanning Process
The Function of a Scanner
• Reading characters from the source code
and form them into logical units called
tokens
• Tokens are logical entities defined as an
enumerated type
– Typedef enum
{IF, THEN, ELSE, PLUS, MINUS, NUM, ID,…}
TokenType;
The Categories of Tokens
• RESERVED WORDS
– Such as IF and THEN, which represent the strings of
characters “if” and “then”
• SPECIAL SYMBOLS
– Such as PLUS and MINUS, which represent the
characters “+” and “-“
• OTHER TOKENS
– Such as NUM and ID, which represent numbers and
identifiers
Relationship between Tokens and its
String
• The string is called STRING VALUE or
LEXEME of token
• Some tokens have only one lexeme, such as
reserved words
• A token may have infinitely many lexemes,
such as the token ID
Relationship between Tokens and its
String
• Any value associated to a token is called an attributes of a
token
– String value is an example of an attribute.
– A NUM token may have a string value such as “32767” and actual
value 32767
– A PLUS token has the string value “+” as well as arithmetic
operation +
• The token can be viewed as the collection of all of its
attributes
– Only need to compute as many attributes as necessary to allow
further processing
– The numeric value of a NUM token need not compute immediately
Some Practical Issues of the Scanner
a [ i n d e x ] = 4 + 2
a [ i n d e x ] = 4 + 2 RET
2.2 Regular Expression
Some Relative Basic Concepts
• Regular expressions
– represent patterns of strings of characters.
• A regular expression r
– completely defined by the set of strings it matches.
– The set is called the language of r written as L(r)
• The set elements
– referred to as symbols
• This set of legal symbols
– called the alphabet and written as the Greek symbol ∑
Some Relative Basic Concepts
• A regular expression r
– contains characters from the alphabet, indicating
patterns, such a is the character a used as a pattern
• A regular expression r
– may contain special characters called meta-characters
or meta-symbols
• An escape character can be used to turn off the
special meaning of a meta-character.
– Such as backslash and quotes
More About Regular Expression
Example 2:
– ∑={ a,b,c}
– the set of all strings that contain at most one b.
– (a|c)*|(a|c)*b(a|c)* (a|c)*(b|ε)(a|c)*
– the same language may be generated by many different regular
expressions.
Examples of Regular Expressions
Example 3:
– ∑={ a,b}
– the set of strings consists of a single b surrounded by the same
number of a’s.
– S = {b, aba, aabaa,aaabaaa,……} = { anban | n≠0}
– This set can not be described by a regular expression.
• “regular expression can’t count ”
BACK
2.2.2 Extensions to Regular
Expression
List of New Operations
1) one or more repetitions
r+
2) any character
period “.”
3) a range of characters
[0-9], [a-zA-Z]
List of New Operations
4) any character not in a given set
∼(a|b|c) a character not either a or b or c
[^abc] in Lex
5) optional sub-expressions
– r? the strings matched by r are optional
BACK
2.2.3 Regular Expressions for
Programming Language Tokens
Number, Reserved word and
Identifiers
Numbers
– nat = [0-9]+
– signedNat = (+|-)?nat
– number = signedNat(“.”nat)? (E signedNat)?
Reserved Words and Identifiers
– reserved = if | while | do |………
– letter = [a-z A-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*
Comments
Several forms:
{ this is a pascal comment } {( ∼ })*}
/* this is a C comment */
can not written as ba(~(ab))*ab, ~ restricted to single character
one solution for ~(ab) : b*(a*∼(a|b)b*)*a*
RET
2.3 FINITE AUTOMATA
Introduction to Finite Automata
• Finite automata (finite-state machines) are a
mathematical way of describing particular kinds of
algorithms.
• A strong relationship between finite automata and
regular expression
• Identifier = letter (letter | digit)*
letter
letter
1 2
digit
Introduction to Finite Automata
letter
letter
1 2
digit
• Transition:
– Record a change from one state to another upon a match of the
character or characters by which they are labeled.
• Start state:
– The recognition process begin
– Drawing an unlabeled arrowed line to it coming “from nowhere”
• Accepting states:
– Represent the end of the recognition process.
– Drawing a double-line border around the state in the diagram
More About Finite Automata
Definition of a DFA:
A DFA (Deterministic Finite Automation) M consist of
(1) an alphabet ∑,
(2) A set of states S,
(3) a transition function T : S ×∑ → S,
(4) a start state s0∈S,
(5)And a set of accepting states A ⊂ S
The Concept of DFA
The language accepted by a DFA M, written L(M),
is defined to be
the set of strings of characters c1c2c3….cn with each ci ∈ ∈∑ such that
there exist states s1 = t(s0,c1),s2 = t(s1,c2), sn = T(sn-1,cn) with sn an
element of A (i.e. an accepting state).
letter
star In-id
t
digit
Not b
not b
not b
b
Examples of DFA
Example 2.8:digit = [0-9]
nat = digit +
signedNat = (+|-)? nat
Number = singedNat(“.”nat)?(E signedNat)?
A DFA of signedNat: +
digit
digit
−
digit
Examples of DFA
Example 2.8:digit = [0-9]
nat = digit +
signedNat = (+|-)? nat
Number = singedNat(“.”nat)?(E signedNat)?
A DFA of Number:
digit
- -
E
digit digit
Examples of DFA
Example 2.9 : A DFA of C Comments
(easy than write down a regular expression)
other
*
/ * * /
1 2 3 4 5
other
BACK
2.3.2 Lookahead, Backtracking,
and Nondeterministic Automata
A Typical Action of DFA Algorithm
• Making a transition: move the character from the input
string to a string that accumulates the characters
belonging to a single token (the token string value or
lexeme of the token)
• Reaching an accepting state: return the token just
recognized, along with any associated attributes.
• Reaching an error state: either back up in the input
(backtracking) or to generate an error token.
letter
letter [other]
start in_id finish return ID
digit
Finite automation for an identifier
with delimiter and return value
• The error state represents the fact that either
an identifier is not to be recognized (if came
from the start state) or a delimiter has been seen
and we should now accept and generate an
identifier-token.
• [other]: indicate that the delimiting character
should be considered look-ahead, it should be
returned to the input string and not consumed.
letter
letter [other]
start in_id finish return ID
digit
Finite automation for an identifier
with delimiter and return value
• This diagram also expresses the principle of
longest sub-string described in Section 2.2.4:
the DFA continues to match letters and digits (in
state in_id) until a delimiter is found.
• By contrast the old diagram allowed the DFA
to accept at any point while reading an
identifier string.
letter
letter
letter
letter [other]
star In-id
start in_id finish return ID
t
digit
digit
How to arrive at the start state in
the first place
(combine all the tokens into one DFA)
Each of these tokens begins with a
different character
: =
•Consider the tokens given by return ASSIGN
return LT
it is not a DFA
=
return LE
rearranged into a
DFA [other] return LT
Expand the Definition of a Finite
Automaton
• One solution for the problem is to expand
the definition of a finite automaton
• More than one transition from a state
may exist for a particular character
(NFA: non-deterministic finite automaton,)
• Developing an algorithm for systematically
turning these NFA into DFAs
ε-transition
• A transition that may occur without consulting the
input string (and without consuming any characters)
ε
• Second: to explicitly ε
describe a match of the
empty string.
Definition of NFA
• An NFA (non-deterministic finite automaton) M consists
of
– an alphabet Σ, a set of states S,
– a transition function T: S x (Σ U{ε})→℘(S),
– a start state s0 from S, and a set of accepting states A from S
a b ε b a ε
1 3 4
→1→2→4→2→4
ε
a ε ε b ε b
→1→3→4→2→4→2→4
• This NFA accepts the languages as
a
follows:
regular expression: (a|ε)b*
ab+|ab*|b* b b
a
ε 3 4 ε
ε ε ε b
1 2 7 8 9 10
c
ε 5 6
ε
ε
BACK
2.3.3 Implementation of Finite
Automata in Code
Ways to Translate a DFA or NFA
into Code
The code for the DFA accepting identifiers:
• { starting in state 1 }
• if the next character is a letter then
• advance the input; letter
• { now in state 2 }
• while the next character is a letter or a digit do letter [other]
advance the input; { stay in state 2 } 1 2 3
• end while;
• { go to state 3 without advancing the input} digit
• accept;
• else
• { error or other cases }
• end if;
Two drawbacks:
• It is ad hoc—that is, each DFA has to be treated slightly differently, and it is difficult
to state an algorithm that will translate every DFA to code in this way.
• The complexity of the code increases dramatically as the number of states rises or,
more specifically, as the number of different states along arbi-trary paths rises.
Ways to Translate a DFA or NFA
into Code
The Code of the DFA that accepts the C comments:
• { state 1 }
• if the next character is "/" then advance the input; ( state 2 }
• if the next character is " * " then
• advance the input; { state 3 } other
• done := false; *
• while not done do
• while the next input character is not "*" do 1
/
2
*
3
*
4
/
5
• advance the input; end while;
• advance the input; ( state 4 }
• while the next input character is "*" do other
• advance the input;
• end while;
• if the next input character is "/" then
• done := true; end if;
• advance the input; end while;
• accept; { state 5 }
• else { other processing }
• end if;
• else { other processing } end if;
Ways to Translate a DFA or NFA
into Code
A better method:
• Using a variable to maintain the current state and
• writing the transitions as a doubly nested case statement inside a loop,
• where the first case statement tests the current state and the nested sec-ond level tests the input
character.
• case state of 1 2 3
Assumes :
• The transi-tions are kept in a transition array T indexed by states and input characters;
• The transi-tions that advance the input (i.e., those not marked with brackets in the table) are given by
the Boolean array Advance, indexed also by states and input characters;
• Accepting states are given by the Boolean array Accept, indexed by states.
Features of Table-Driven Method
Table driven: use tables to direct the progress of the algorithm.
The advantage:
• The size of the code is reduced, the same code will work for many different problems,
and the code is easier to change (maintain).
The disadvantage:
• The tables can become very large, causing a significant increase in the space used by the
program. Indeed, much of the space in the arrays we have just described is wasted.
• Table-driven methods often rely on table-compression methods such as sparse-array
representations, although there is usually a time penalty to be paid for such compression,
since table lookup becomes slower. Since scanners must be efficient, these methods are
rarely used for them.
NFAs can be implemented in similar ways to DFAs, except NFAs are nondeterministic,
• there are potentially many different sequences of transitions that must be tried.
• A program that simulates an NFA must store up transitions that have not yet been tried
and backtrack to them on failure.
RET
End of Part One
THANKS