Lecture 2.76
Lecture 2.76
Compiler Construction
Lecture 2
Mahzaib Younas
Lecturer, Department of Computer Science
FAST NUCES CFD
Outlines
• The role of lexical analyzer
• Input Buffering
• Specification of tokens
• Recognition of tokens
• Lexical Analyzer Generator Lex
• Finite Automata
• Design lexical Analyzer generator
• Optimization of DFA based pattern mactches
Lexical Analysis
• The main task of the lexical analyzer is to read the input characters of
the source program, group them into lexemes, and produce as output a
sequence of tokens for each lexeme in the source program.
It takes the modified source code from language preprocessors that are
written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespaces or
comments in the source code.
Lexeme
• A lexeme is a sequence of source code that matches one of the
predefined patterns and thereby forms a valid token.
• Example:
• int c = 5;
Lexeme Tokens
Int Keyword
C Identifier
= Assignment operator
5 constant
; symbol
Pattern
• A pattern is a description of the form that the lexemes of a token
may take. In the case of a keyword as a token, the pattern is just the
sequence of characters that form the keyword.
• For identifiers and some other tokens, the pattern is a more complex
structure that is matched by many strings
Tokens
A token is a pair consisting of a token name and an optional attribute value.
Partition input string into substring, and classify according to the rule
Token nextToken() {
if( idChar(next) )
return readId();
if( number(next) )
return readNumber();
if( next == ‘”’ )
return readString();
...
...
How to perform the Tokenization via
Program.
• Here we declare the method of class
Token nextToken() {
if( idChar(next) )
return readId(); // if the letter is identifier mean follow the rule of
identifier
if( number(next) )
return readNumber();
if( next == ‘”’ )
return readString();
...
...
How to Make the token of Identifier
Token readId() {
string id = “”;
while(true){
char c = input.read();
if(idChar(c) == false)
return
new Token(TID,id);
id = id + string(c);
}
}
Ad-Hoc Lexer using C++
"auto","break","case","char","const","continue","default","do","double","else","enum","extern","float
","for","goto", "if","int","long","register","return","short","signed",
"sizeof","static","struct","switch","typedef","union","unsigned","void","volatile","while" };
int i, flag = 0;
for (i = 0; i < 32; ++i) {
if (strcmp(keywords[i], buffer) == 0) {
flag = 1;
break;
}}
return flag;
}
To check the operator
//decalration of operator
operators[] = "+-*/%=";
for (i = 0; i < 6; ++i) {
if (ch == operators[i])
cout << ch << " is operator\n";
}
Check identifiers
if (isalnum(ch)) {
buffer[j++] = ch;
}
else if ((ch == ' ' || ch == '\n') && (j != 0)) {
buffer[j] = '\0';
j = 0;
if (isKeyword(buffer) == 1)
cout << buffer << " is keyword\n";
else
cout << buffer << " is indentifier\n";
}
}
How to describe the tokens?
Regular languages are the most popular for specifying the tokens.
• Implementation
Finite Automata
Finite Automata
Finite Automaton consists of
• An input alphabet (S)
• A set of states
• A start (initial) state
• A set of transitions
• A set of accepting (final) states
Finite Automata
• A finite automaton accepts a string if we can follow transitions
labelled with characters in the string from start state to some
accepting state.
• FA Example: A FA that accepts any number of 1’s followed by
signle 0.