Chapter 2 - Lexical Analyser
Chapter 2 - Lexical Analyser
Instructor: Mohammed O.
Email: [email protected]
Samara University
Chapter Two
This Chapter Covers:
Role of lexical analyser
Token Specification and Recognition
NFA to DFA
Lexical Analyzer
Lexical Analyzer reads the source program character by
character to produce tokens.
Normally a lexical analyzer doesn’t return a list of tokens
at one shot, it returns a token when the parser asks a
token from it.
3
2
1
Token
Token represents a set of strings described by a pattern.
Identifier represents a set of strings which start with a
letter continues with letters and digits
Lexeme: is a sequence of characters in the source
program that matched by the pattern for a token.
Tokens: identifier, number, addop, delimeter, …
Since a token can represent more than one lexeme,
additional information should be held for that specific
lexeme. This additional information is called as the
attribute of the token.
For simplicity, a token may have a single attribute which
holds the required information for that token.
For identifiers, this attribute a pointer to the symbol table,
and the symbol table holds the actual attributes for that
token.
Token (Cont.)
Some attributes:
<id,attr> where attr is pointer to the symbol table
<assgop,_> no attribute is needed (if there is only one
assignment operator)
<num,val> where val is the actual value of the number.
The parser will repeatedly call the scanner to read all the
tokens from the input stream or until an error is detected
(such as a syntax error).
Some tokens require some extra information.
For example, an identifier is a token (so it is represented by
some number) but it is also associated with a string that
holds the identifier name.
Scanner (Cont.)
For example, the token id(x) is associated with the string, "x".
Similarly, the token num(1) is associated with the number, 1.
Tokens are specified by patterns, called regular expressions.
For example, the regular expression [a-z][a-zA-Z0-9]*
recognises all identifiers with at least one alphanumeric letter
whose first letter is lower-case alphabetic.
A typical scanner:
recognises the keywords of the language (these are the
reserved words that have a special meaning in the language,
such as the word class in Java); (such as the #include "file"
directive in C).
Scanner (Cont.)
recognises special characters, such as parentheses ( and ),
or groups of special characters, such as := (equal by
definition) and ==;
recognises identifiers, integers, reals, decimals, strings, etc;
ignores whitespaces and comments;
Hand Implementation
There are two ways to use hand implementation:
Input Buffer approach
Transitional diagrams approach
Input Buffering
The lexical analyser scans the characters of the source
programme one at a time to discover tokens.
Cont.
Often, many characters beyond (in addition to) the next
token may have to be examined before the next token itself
can be determined.
i 1
Example
L1 = {a,b,c,d} L2 = {1,2}
L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
L1 L2 = {a,b,c,d,1,2}
-closure({0}) = {0,1,2,4,7}
mark S0
-closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S0,a] S1 transfunc[S0,b] S2
mark S1
-closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a] S1 transfunc[S1,b] S2
mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a] S1 transfunc[S2,b] S2
Converting a NFA into a DFA (Cont.)
Syntax tree of (a|b) * a #
#
4
* a
3 • each symbol is numbered (positions)
• each symbol is at a leave
|
G1 = {2}
G2 = {1,3}
a b
1->2 1->3
2->2 2->3
3->4 3->3
RE 0* = ?
RE (0|1)* = ?
RE (0|1)*11 = ?