0% found this document useful (0 votes)
27 views30 pages

Chapter 2 - 1 Lexical Analysis

The document describes the process of developing a lexical analyzer or scanner. It discusses expressing the lexical grammar, implementing a scanner based on that grammar, and refining the scanner to track token spelling and kind. It provides details on systematically developing scanning methods for each non-terminal in the grammar. The scanner returns Token objects containing the token kind and spelling. Regular expressions are used to describe tokens and finite state machines are used to recognize them. Non-deterministic finite automata can be converted to equivalent deterministic finite automata.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views30 pages

Chapter 2 - 1 Lexical Analysis

The document describes the process of developing a lexical analyzer or scanner. It discusses expressing the lexical grammar, implementing a scanner based on that grammar, and refining the scanner to track token spelling and kind. It provides details on systematically developing scanning methods for each non-terminal in the grammar. The scanner returns Token objects containing the token kind and spelling. Regular expressions are used to describe tokens and finite state machines are used to recognize them. Non-deterministic finite automata can be converted to equivalent deterministic finite automata.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Chapter 2_1: Lexical Analysis/scanning

Syntax Analysis: Scanner


Dataflow chart
Source Program Stream of Characters

Scanner Error Reports

Stream of “Tokens”

Parser Error Reports

Abstract Syntax Tree


Steps for Developing a Scanner
1) Express the “lexical” grammar
2) Implement Scanner based on this grammar
3) Refine scanner to keep track of spelling and kind of
currently scanned token
Systematic Development of Scanner
(1) Express (lexical) grammar.
(2) (2) Create a scanning method scan N for each non terminal
N
(3) Create a scanner class with
– private variable currentChar
– private methods : take and takeIt
– the private scanning methods implemented in step (2)
– add private scan N method for each non terminal N,
enhanced to record each token’s kind and spelling
– public scan method that scans Separator* Token,
discarding any separators but returning the token that
follows them
Developing a Scanner
Implementation of the scanner

public class Scanner {

private char currentChar;


private StringBuffer currentSpelling;
private byte currentKind;

private char take(char expectedChar) { ... }


private char takeIt() { ... }

// other private auxiliary methods and scanning


// methods here.

public Token scan() { ... }


}
Developing a Scanner
The scanner will return instances of Token:
public class Token {
byte kind; String spelling;
final static byte
IDENTIFIER = 0; INTLITERAL = 1; OPERATOR = 2;
BEGIN = 3; CONST = 4; ...
...

public Token(byte kind, String spelling) {


this.kind = kind; this.spelling = spelling;
if spelling matches a keyword change my kind
automatically
}

...
}
Developing a Scanner
public class Scanner {

private char currentChar = get first source char;


private StringBuffer currentSpelling;
private byte currentKind;

private char take(char expectedChar) {


if (currentChar == expectedChar) {
currentSpelling.append(currentChar);
currentChar = get next source char;
}
else report lexical error
}
private char takeIt() {
currentSpelling.append(currentChar);
currentChar = get next source char;
}
...
Developing a Scanner
...
public Token scan() {
// Get rid of potential separators before
// scanning a token
while ( (currentChar == ‘!’)
|| (currentChar == ‘ ’)
|| (currentChar == ‘\n’ ) )
scanSeparator();
currentSpelling = new StringBuffer();
currentKind = scanToken();
return new Token(currentkind,
currentSpelling.toString());
}
Developed much in the
private void scanSeparator() { ... } same way as parsing
private byte scanToken() { ... } methods
...
Developing a Scanner

private byte scanToken() {


switch (currentChar) {
case ‘a’: case ‘b’: ... case ‘z’:
case ‘A’: case ‘B’: ... case ‘Z’:
scan Letter (Letter | Digit)*
return Token.IDENTIFIER;
case ‘0’: ... case ‘9’:
scan Digit Digit*
return Token.INTLITERAL ;
case ‘+’: case ‘-’: ... : case ‘=’:
takeIt();
return Token.OPERATOR;
...etc...
}
Developing a Scanner
Let’s look at the identifier case in more detail

...
return ...
case ‘a’: case ‘b’: ... case ‘z’:
case ‘A’: case ‘B’: ... case ‘Z’:
scan Letter (Letter | Digit)*
acceptIt();
while
return
scan (Letter
(isLetter(currentChar)
Token.IDENTIFIER;
| Digit)*
case ‘0’:
return ... case ‘9’:
|| isDigit(currentChar)
Token.IDENTIFIER; )
case ‘0’: ... case|‘9’:
...acceptIt();
scan (Letter Digit)
...
return Token.IDENTIFIER;
case ‘0’: ... case ‘9’:
...
Developing a Scanner
The scanner will return instances of Token:
public class Token {
byte kind; String spelling;
final static byte
IDENTIFIER = 0; INTLITERAL = 1; OPERATOR = 2;
BEGIN = 3; CONST = 4; ...
...

public Token(byte kind, String spelling) {


this.kind = kind; this.spelling = spelling;
if spelling matches a keyword change my kind
automatically
}

...
}
Developing a Scanner
The scanner will return instances of Token. The implementation below
is the one in the Triangle source code.

public class Token {


...
public Token(byte kind, String spelling) {
if (kind == Token.IDENTIFIER) {
int currentKind = firstReservedWord;
boolean searching = true;
while (searching) {
int comparison = tokenTable[currentKind].compareTo(spelling);
if (comparison == 0) {
this.kind = currentKind;
searching = false;
} else if (comparison > 0 || currentKind == lastReservedWord) {
this.kind = Token.IDENTIFIER;
searching = false;
} else { currentKind ++; }
}
} else
this.kind = kind;
...
}
Developing a Scanner
The scanner will return instances of Token:
public class Token {
...

private static String[] tokenTable = new String[] {


"<int>", "<char>", "<identifier>", "<operator>",
"array", "begin", "const", "do", "else", "end",
"func", "if", "in", "let", "of", "proc", "record",
"then", "type", "var", "while",
".", ":", ";", ",", ":=", "~", "(", ")", "[", "]", "{", "}", "",
"<error>" };

private final static int firstReservedWord = Token.ARRAY,


lastReservedWord = Token.WHILE;
...
}
Generating Scanners
Generation of scanners is based on
• Regular Expressions: to describe the tokens to be recognized
• Finite State Machines: an execution model to which REs are
“compiled”
Recap: Regular Expressions
e The empty string
t Generates only the string t
XY Generates any string xy such that x is generated by x
and y is generated by Y
X|Y Generates any string which generated either
by X or by Y
X* The concatenation of zero or more strings generated
by X
(X) For grouping
Generating Scanners
• Regular Expressions can be recognized by a finite state
machine. (often used synonyms: finite automaton (acronym FA))

Definition: A finite state machine is an N-tuple (States,S,start,d ,End)


States A finite set of “states”
S An “alphabet”: a finite set of symbols from which the
strings we want to recognize are formed (for example:
the ASCII char set)
start A “start state” Start  States
d Transition relation d  States x States x S. These are
“arrows” between states labeled by a letter from the
alphabet.
End A set of final states. End  States
Generating Scanners
• Finite state machine: the easiest way to describe a
Finite State Machine is by means of a picture:
Example: an FA that recognizes M r | M s

= initial state
r = final state
M
M = non-final state
s
Deterministic, and non-deterministic FA
• A FA is called deterministic (acronym: DFA) if for
every state and every possible input symbol, there
is only one possible transition to chose from.
Otherwise it is called non-deterministic (NDFA or
NFA).
Q: Is this FSM deterministic or non-deterministic:

r
M

M
s
Deterministic, and non-deterministic FA
• Theorem: every NDFA can be converted into an
equivalent DFA.
r
M
M
s

r
DFA ? M
s
Deterministic, and non-deterministic FA
• Theorem: every NDFA can be converted into an
equivalent DFA.
Algorithm:
The basic idea: DFA is defined as a machine that does a “parallel
simulation” of the NDFA.
• The states of the DFA are subsets of the states of the NDFA
(i.e. every state of the DFA is a set of states of the NDFA)
=> This state can be interpreted as meaning “the simulated
NDFA is now in any of these states”
Deterministic, and non-deterministic FA
Conversion algorithm example:
r
M
2 3
M
1 r
{3,4} is a final state because 3
4 r is a final state
r,s

r {3,4} {2,4} --r-->{3,4}


M because:
r s
s 2 --r--> 3
{1} {2,4}
4 --r--> 3
{4} 4 --r--> 4
s
FA with e moves
(N)DFA-e automata are like (N)DFA. In an (N)DFA-e we are
allowed to have transitions which are “e-moves”.
Example: M r (M r)*
M r

e
Theorem: every (N)DFA-e can be converted into an equivalent
NDFA (without e-moves).

M r
r M
FA with e moves
Theorem: every (N)DFA-e can be converted into an equivalent
NDFA (without e-moves).
convert into a final state
Algorithm:
e
1) converting states into final states:
if a final state can be reached from
a state S using an e-transition
convert it into a final state.
Repeat this rule until no more states can be converted.
For example:
convert into a final state
e e

2 1
FA with e moves
Algorithm:
1) converting states into final states.
2) adding transitions (repeat until no more can be added)
a) for every transition followed by e-transition
t e

t add transition

b) for every transition preceded by e-transition


e t

t add transition
3) delete all e-transitions
Converting a RE into an NDFA-e
RE: e
FA:

RE: t
FA: t

RE: XY
FA: e
X Y
Converting a RE into an NDFA-e
RE: X|Y
FA:
e X e

e Y e

RE: X* e
FA:
X

e
FA and the implementation of Scanners
• Regular expressions, (N)DFA-e and NDFA and
DFA’s are all equivalent formalism in terms of what
languages can be defined with them.
• Regular expressions are a convenient notation for
describing the “tokens” of programming
languages.
• Regular expressions can be converted into FA’s
(the algorithm for conversion into NDFA-e is
straightforward)
• DFA’s can be easily implemented as computer
programs.
FA and the implementation of Scanners

What a typical scanner generator does:

Token definitions Scanner Generator Scanner DFA


Regular expressions Java or C or ...

A possible algorithm: note: In practice this exact


- Convert RE into NDFA-e algorithm is not used. For reasons of
- Convert NDFA-e into NDFA performance, sophisticated
- Convert NDFA into DFA optimizations are used.
- generate Java/C/... code • direct conversion from RE to DFA
• minimizing the DFA
Implementing a DFA
Definition: A finite state machine is an N-tuple (States,S,start,d,
End)
States N different states => integers {0,..,N-1} => int data type
S byte or char data type.
start An integer number
d Transition relation d  States x S x States.
For a DFA this is a function
States x S -> States
Represented by a two dimensional array (one dimension
for the current state, another for the current character).
The contents of the array is the next state.
End A set of final states. Represented (for example) by an
array of Booleans (mark final state by true and other
states by false)
JLex Regular Expressions
• Regular expressions are expressed using ASCII
characters (0 – 127).
• The following characters are metacharacters.
? * + | ( ) ^ $ . [ ] { } “ \
• Metacharacters have special meaning; they do not
represent themselves.
• All other characters represent themselves.
THANK YOU

You might also like