Lec2 LexicalAnalyser
Lec2 LexicalAnalyser
Lexical Analyzer
Lexical Analyzer
●
First phase of a compiler
●
Read the input characters of the source program, group them into lexemes.
●
Produce as output a sequence of tokens for each lexeme in the source
program.
Interaction
between
getNextToken command, parser and lex
causes the lexical analyzer to analyzer
read characters from its input
until it can identify the next
lexeme and produce for it the
next token, which it returns to
the parser.
Lexical Analyzer
●
Tasks of Lexical Analyzer
– Scanning: Stripping comments and white spaces
– Lexical Analysis: Identifying lexemes and produce tokens from the
output of the scanner.
– Correlate error messages by compiler with the source program.
Lexical Analyzer
●
Three terminologies
– Lexemes: A lexeme is a sequence of characters in the source program that matches the pattern
for a token.
– Pattern: is a description of the form that the lexemes of a token may take.
– Token: A pair of token-name (An abstract symbol name (eg., id for identifier) and optional
attribute value.
●
Attribute value: differentiates tokens from each other, an attribute value describes the
lexeme represented by the token, for example, number is a token with value 3.14 and
number is another token with value 6.02.
●
Token name influence parsing decision, while attribute value influences translation of
tokens after the parse.
●
Token-name = id (identifier)
– Attribute value will be the pointer to the symbol table entry for the token-name ‘id’.
– Associated information in the symbol table – lexeme, position first found, type, etc
Lexical Analyzer
●
Patterns are to cover all the tokens.
– One token for each keyword. The pattern for a keyword is the same as the keyword itself.
– Tokens for the operators, either individually or in classes such as the token comparison
mentioned for all comparative operators.
– One token representing all identiers.
– One or more tokens representing constants, such as numbers and literal strings.
– Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Lexical Analyzer
●
E = M * C ** 2
//Tokens generated by
lexical analyzer
Lexical Error
ofr i =1 to 10
{
//for loop block
}
Concatenating L zero or
more times.
Constructing Patterns
L= {A; B; : : : ; Z; a; b; : : : ; z}
D = {0,1,2,.......,9}
Two languages with string length 1.
- we use italics for symbols of language, and boldface for their corresponding regular
expression.
r = L_( L_ | D )*
L_ - any letter or _,
() - group subexpressions,
| - concatenation
* - zero or more occurrences of
Fundamental rules:
- ϵ is a regular expression that L(ϵ) is ϵ, the language whose sole member is the empty
string.
- If a is a symbol in , then a is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with a in its one position.
Constructing Patterns using Regular
Expressions
Making large regular expression from smaller regular expressions, r and s.
➢
(r)|(s) is a regular expression denoting the language L(r) |[ L(s).
➢
(r)(s) is a regular expression denoting the language L(r)L(s).
➢
(r)* is a regular expression denoting L(r)*.
➢
(r)+ is a regular expression denoting L(r)+.
- r = (a|b)(a|b), L = {aa, ab, ba, bb}, Another regular expression for the same language is r =
aa|ab|ba|bb.
a|a*b denotes the language {a,b,ab,aab, aaab,.............}, that is, the string a
and all strings consisting of zero or more a's and ending in b.
Algebraic laws for regular expression
Regular Definition
- For convinence, we give name to regular expressions to be used in sebsequent expressions,
called assigning definition.
d1 -> r1
d2 -> r2
dn -> rn
- Each di is a new symbol not in alphabet Σ, and not in any other ds.
- Each ri is a regular expression over the alphabet Σ U {d1,d2;,.......di-1}
- ^the[a-z]*, here ^ matches begining of the line. //^ outside the class []
Other regular expression operators
Recognition of Tokens
Grammar for branching statements language:
Possible Terminals (can be interpreted as tokens): if, else, then, relop, id, number
-Initial State: The tranition diagram start with an initial/start state having an input edge with
‘start’ label.
- Final state: final/ accetping state, indicating a lexeme has been found and having an
assocated aciton if required. Represented by double circle.
- Double circle with * - At times it is necessary to retract the forward pointer one position
(forward pointer position) beyond the lexeme to decide the lexeme, but not part of the current
lexeme. Place * with final state indicating any one character after the lexeme. It can be more
than one * for more than one characters.
Transition Diagrams
Relop -> <|<=|<>|=|>|>=
- A field of the symbol table has an entry for reserved words of a programming language and
provides respective token names (not part of lexical analysis process) .
Method-1
//We retract one position to
get the actual lexeme.
- installID() - will check whether it already exists in the symbol table, if not, make an entry in the
symbol table for the lexeme and return a pointer to the symbol table entry.
- getToken() - will return the right token-name from the symbol table, either id or one of the keyword
tokens that was initially installed in the table.
- Lex translates all Res into Automata in the background to check if a string belongs
to the RE or not.
- Automata: Stats and Edges - Nodes are states and labeled edges are transition
functions.
- Very similar to Transition diagram, except
a) The same symbol can label edges from one state to several dierent states,
and
b) An edge may be labeled by , the empty string, instead of, or in addition to,
symbols from the input alphabet.
NFA
Transition Graph: Automata
- The lexical analyzer software are supposed to implement automata in the background.
- Making NFA is more straightforward than DFA on paper.
- While simulating NFA is less straightforward than DFA.
- Hence, all NFA is to be converted into DFA.