Chapter 2 - Lexical Analysis
Chapter 2 - Lexical Analysis
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input character until it can
identify the next token. The LA return to the parser representation for the token it has found. The representation will
be an integer code, if the token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping out from the source
program the commands and white spaces in the form of blank, tab and new line characters. Another is correlating
error message from the compiler with the source program.
1
2.3 LEXICAL ANALYSIS VS PARSING:
Lexical analysis Parsing
A Scanner simply turns an input String (say a file) into a list A parser converts this list of tokens into a Tree-like object
of tokens. These tokens represent things like identifiers, to represent how the tokens fit together to form a cohesive
parentheses, operators etc. whole (sometimes referred to as a sentence).
The lexical analyzer (the "lexer") parses individual symbols A parser does not give the nodes any meaning beyond
from the source code file into tokens. From there, the structural cohesion. The next thing to do is extract meaning
"parser" proper turns those whole tokens into sentences of from this structure (sometimes called contextual analysis).
your grammar
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string tutorials point is 14, and is denoted by |tutorials point| = 14. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).
Assignment =
Preprocessor #
2
Shift Operator >>, >>>, <<, <<<
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.
Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline etc
Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a sequence
of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
Example:
Description of token Token lexeme pattern
const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
3
nun 3.14 any character b/w “and “except"
literal "core" pattern
A patter is a rule describing the set of lexemes that can represent a particular token in source program.
The lexical analyzer identifies the error with the help of automation machine and the grammar of the given
language on which it is based like C , C++ and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer –
a=b+c; It will generate token sequence like this:
id=id+id; Where each id reference to it’s variable in the symbol table referencing all details
For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
4
As another example, consider below printf statement.
Exercise 3:
Count number of tokens and values:
int studno=43609023;
Exercise 4:
Count number of tokens:
int main() 4
{ 1
int a = 10, b = 20; 9
printf("sum is :%d" , a + b ); 9
printf(“HELLO “); 5
printf( “\n”) ; 5
return 0; 3
} 1
Answer: Total number of tokens 37
5
2.6 Lexical Grammar and FSMs
To recognize a token described by a regular definition, the regular expression in the definition is often transformed
into a FSM. The resulting FSM has a finite number of states comprising an initial state and a set of accepting
states.
For example, the regular expression a | b can be converted into the following FSM.
A|B
AB
A*
R = (a|b)c
(a|b)
6
R = (a|b)*c
R = a(bc)*
Concatenating the above two FSMs will give us the FSM for a(bc)*.
7
Language Definition o Appearance of programming language: Vocabulary: Regular expression
Syntax : Backus-Naur Form(BNF) or Context Free Form(CFG)
To specify the syntax of a language: CFG and BNF o Example: if-else statement in C has the form of statement → if
(expression ) statement else statement
Alphabet: An alphabet of a language is a set of symbols. o Examples : {0,1} for a binary number
system(language)={0,1,100,101,...} {a,b,c} for language={a,b,c, ac,abcc..} {if,(,),else ...} for a if
statements={if(a==1)goto10, if--}
String: A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples :
0,1,10,00,11,111,0202 ... strings for a alphabet {0,1} o Null string is a string which does not have any symbol of
alphabet.
Language: It is a subset of all the strings over a given alphabet. o Alphabets Ai Languages Li for Ai A0={0,1}
L0={0,1,100,101,...} A1={a,b,c} L1={a,b,c, ac, abcc..} A2={all of C tokens} L2= {all sentences of C program }
Grammar G=(N,T,P,S) o N : a set of nonterminal symbols o T : a set of terminal symbols, tokens o P : a set of
production rules o S : a start symbol, S∈N
To specify the syntax of a language: CFG and BNF
1. Example : if-else statement in C has the form of statement → if ( expression )
Statement else statement • An alphabet of a language is a set of symbols.
2. Examples : {0,1} for a binary number system (language)={0,1,100,101,...}
3. {a,b,c} for language={a,b,c, ac,abcc..}
4. {if,(,),else ...} for a if statements={if(a==1)goto10, if--}
• A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples :
0,1,10,00,11,111,0202 ... strings for a alphabet {0,1}
o Null string is a string which does not have any symbol of alphabet.
• Language o Is a subset of all the strings over a given alphabet.
o Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c, ac, abcc, acb….}
A2={all of C tokens} L2= {all sentences of C program }
• Example 2.1. Grammar for expressions consisting of digits and plus and minus signs. o Language of expressions
L={9-5+2, 3-1, ...} o The productions of grammar for this language L are: list → list + digit list → list - digit list → digit
digit → 0|1|2|3|4|5|6|7|8|9 o list, digit : Grammar variables, Grammar symbols o 0,1,2,3,4,5,6,7,8,9,-,+ : Tokens,
Terminal symbols
• Convention specifying grammar o Terminal symbols : bold face string if, num, id o Nonterminal symbol, grammar
symbol : italicized names, list, digit ,A,B
Grammar G=(N,T,P,S)
N : a set of nonterminal symbols
T : a set of terminal symbols, tokens
P : a set of production rules
8
S : a start symbol, S∈N
Example G: list -> list + digit | list - digit | digit digit -> 0|1|2|3|4|5|6|7|8|9
left most derivation for 9-5+2, list ⇒ list+digit ⇒ list-digit+digit ⇒ digit-digit+digit ⇒ 9-digit+digit ⇒ 9-5+digit ⇒ 9-
5+2
right most derivation for 9-5+2, list ⇒ list+digit ⇒ list+2 ⇒ list-digit+2 ⇒ list-5+2 ⇒ digit-5+2 ⇒ 9-5+2 parse tree
for 9-5+2
Ambiguity
• A grammar is said to be ambiguous if the grammar has more than one parse tree for a given string of tokens.
• Example 2.5. Suppose a grammar G that can not distinguish between lists and digits as in Example 2.1. • G : string
→ string + string | string - string |0|1|2|3|4|5|6|7|8|9
9
Associativity of operator
A operator is said to be left associative if an operand with operators on both sides of it is taken by the operator to
its left. eg) 9+5+2≡(9+5)+2, a=b=c≡a=(b=c)
• Left Associative Grammar : list → list + digit | list - digit digit →0|1|…|9
• Right Associative Grammar : right → letter = right | letter letter → a|b|…|z
Precedence of operators We say that a operator(*) has higher precedence than other operator(+) if the
operator(*) takes operands before other operator(+) does.
• ex. 9+5*2≡9+(5*2), 9*5+2≡(9*5)+2 • left associative operators : + , - , * , /
• right associative operators : = , **
10