0% found this document useful (0 votes)
35 views10 pages

Chapter 2 - Lexical Analysis

Uploaded by

om55500r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views10 pages

Chapter 2 - Lexical Analysis

Uploaded by

om55500r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 2- LEXICAL ANALYSIS

2.1 OVER VIEW OF LEXICAL ANALYSIS


o To identify the tokens we need some method of describing the possible tokens that can appear in the input stream.
For this purpose we introduce regular expression, a notation that can be used to describe essentially all the tokens of
programming language.
o Secondly , having decided what the tokens are, we need some mechanism to recognize these in the input stream.
This is done by the token recognizers, which are designed using transition diagrams and finite automata.

2.2 ROLE OF LEXICAL ANALYSIS


Lexical Analysis is the first phase of compiler also known as scanner( text scanner). This phase scans the
source code as a stream of characters and converts it into meaningful Lexemes. Lexical analyzer represents these
lexemes in the form of tokens as: It converts the High level input program into a sequence of Tokens.
 Lexical Analysis can be implemented with the Deterministic finite Automata.
 The output is a sequence of tokens that is sent to the parser for syntax analysis
<token-name, attribute-value>

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input character until it can
identify the next token. The LA return to the parser representation for the token it has found. The representation will
be an integer code, if the token is a simple construct such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping out from the source
program the commands and white spaces in the form of blank, tab and new line characters. Another is correlating
error message from the compiler with the source program.

1
2.3 LEXICAL ANALYSIS VS PARSING:
Lexical analysis Parsing
A Scanner simply turns an input String (say a file) into a list A parser converts this list of tokens into a Tree-like object
of tokens. These tokens represent things like identifiers, to represent how the tokens fit together to form a cohesive
parentheses, operators etc. whole (sometimes referred to as a sentence).

The lexical analyzer (the "lexer") parses individual symbols A parser does not give the nodes any meaning beyond
from the source code file into tokens. From there, the structural cohesion. The next thing to do is extract meaning
"parser" proper turns those whole tokens into sentences of from this structure (sometimes called contextual analysis).
your grammar

For example, in C language, the variable declaration line

Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of occurrence of alphabets,
e.g., the length of the string tutorials point is 14, and is denoted by |tutorials point| = 14. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

2
Shift Operator >>, >>>, <<, <<<

2.4 TOKEN, LEXEME, PATTERN:


Token:
A lexical token is a sequence of characters that can be treated as a unit(single logical entity) in the
grammar of the programming languages.
Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants

Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.

int value = 100; char name=”hello”;


Contains the tokens. Each identifier, constant, variable, operator is a token in a lexical Analyzer.

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Example of tokens:
 Type token (id, number, real, . . . )
 Punctuation tokens (IF, void, return, . . . )
 Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
 Comments, preprocessor directive, macros, blanks, tabs, newline etc
Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a sequence
of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer functions


1. Tokenization .i.e Dividing the program into valid tokens.
2. Remove white space characters.
3. Remove comments.
4. It also provides help in generating error message by providing row number and column number.

Example:
Description of token Token lexeme pattern
const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
3
nun 3.14 any character b/w “and “except"
literal "core" pattern
A patter is a rule describing the set of lexemes that can represent a particular token in source program.

2.5 LEXICAL ERRORS:


Lexical errors are the errors thrown by your lexer when unable to continue. Which means that there's no
way to recognise a lexeme as a valid token for you lexer. Syntax errors, on the other side, will be thrown
by your scanner when a given set of already recognised valid tokens don't match any of the right sides of
your grammar rules. simple panic-mode error handling system requires that we return to a high-level
parsing function when a parsing or lexical error is detected.

Error-recovery actions are:


i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

 The lexical analyzer identifies the error with the help of automation machine and the grammar of the given
language on which it is based like C , C++ and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer –
a=b+c; It will generate token sequence like this:
id=id+id; Where each id reference to it’s variable in the symbol table referencing all details
For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}

All the valid tokens are:


'int' 'main' '(' ')' '{' '}' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'

Above are the valid tokens.


You can observe that we have omitted comments.

4
As another example, consider below printf statement.

There are 5 valid token in this printf statement.


Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2:
Count number of tokens :
int max(int i);
 Lexical analyzer first read int and finds it to be valid and accepts as token
 max is read by it and found to be valid function name after reading (
 int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;

Exercise 3:
Count number of tokens and values:
int studno=43609023;

Answer: Total number of tokens 5:


int(keyword), studno(identifier), =(operator), 43609023(constant), ;(separator)

Exercise 4:
Count number of tokens:
int main() 4
{ 1
int a = 10, b = 20; 9
printf("sum is :%d" , a + b ); 9
printf(“HELLO “); 5
printf( “\n”) ; 5
return 0; 3
} 1
Answer: Total number of tokens 37

5
2.6 Lexical Grammar and FSMs

To recognize a token described by a regular definition, the regular expression in the definition is often transformed
into a FSM. The resulting FSM has a finite number of states comprising an initial state and a set of accepting
states.

For example, the regular expression a | b can be converted into the following FSM.

A|B

AB

A*

R = (a|b)c

(a|b)

6
R = (a|b)*c

R = a(bc)*

The FSM for (bc)* would be represented with a loop on bc.

Concatenating the above two FSMs will give us the FSM for a(bc)*.

7
Language Definition o Appearance of programming language: Vocabulary: Regular expression
Syntax : Backus-Naur Form(BNF) or Context Free Form(CFG)
To specify the syntax of a language: CFG and BNF o Example: if-else statement in C has the form of statement → if
(expression ) statement else statement
Alphabet: An alphabet of a language is a set of symbols. o Examples : {0,1} for a binary number
system(language)={0,1,100,101,...} {a,b,c} for language={a,b,c, ac,abcc..} {if,(,),else ...} for a if
statements={if(a==1)goto10, if--}
String: A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples :
0,1,10,00,11,111,0202 ... strings for a alphabet {0,1} o Null string is a string which does not have any symbol of
alphabet.
Language: It is a subset of all the strings over a given alphabet. o Alphabets Ai Languages Li for Ai A0={0,1}
L0={0,1,100,101,...} A1={a,b,c} L1={a,b,c, ac, abcc..} A2={all of C tokens} L2= {all sentences of C program }

Grammar G=(N,T,P,S) o N : a set of nonterminal symbols o T : a set of terminal symbols, tokens o P : a set of
production rules o S : a start symbol, S∈N
To specify the syntax of a language: CFG and BNF
1. Example : if-else statement in C has the form of statement → if ( expression )
Statement else statement • An alphabet of a language is a set of symbols.
2. Examples : {0,1} for a binary number system (language)={0,1,100,101,...}
3. {a,b,c} for language={a,b,c, ac,abcc..}
4. {if,(,),else ...} for a if statements={if(a==1)goto10, if--}
• A string over an alphabet o is a sequence of zero or more symbols from the alphabet. o Examples :
0,1,10,00,11,111,0202 ... strings for a alphabet {0,1}
o Null string is a string which does not have any symbol of alphabet.
• Language o Is a subset of all the strings over a given alphabet.
o Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c, ac, abcc, acb….}
A2={all of C tokens} L2= {all sentences of C program }
• Example 2.1. Grammar for expressions consisting of digits and plus and minus signs. o Language of expressions
L={9-5+2, 3-1, ...} o The productions of grammar for this language L are: list → list + digit list → list - digit list → digit
digit → 0|1|2|3|4|5|6|7|8|9 o list, digit : Grammar variables, Grammar symbols o 0,1,2,3,4,5,6,7,8,9,-,+ : Tokens,
Terminal symbols
• Convention specifying grammar o Terminal symbols : bold face string if, num, id o Nonterminal symbol, grammar
symbol : italicized names, list, digit ,A,B
Grammar G=(N,T,P,S)
N : a set of nonterminal symbols
T : a set of terminal symbols, tokens
P : a set of production rules
8
S : a start symbol, S∈N

Example :Grammar G for a language L={9-5+2, 3-1, ...}


o G=(N,T,P,S) N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P : list -> list + digit list -> list - digit list -> digit digit -> 0|1|2|3|4|5|6|7|8|9
S=list
Parse Tree
A derivation can be conveniently represented by a derivation tree( parse tree).
o The root is labeled by the start symbol.
o Each leaf is labeled by a token or ε.
o Each interior none is labeled by a nonterminal symbol.
o When a production A→x1… xn is derived, nodes labeled by x1… xn are made as children nodes of node labeled by
A.
• root : the start symbol
• internal nodes : nonterminal
• leaf nodes : terminal

Example G: list -> list + digit | list - digit | digit digit -> 0|1|2|3|4|5|6|7|8|9
left most derivation for 9-5+2, list ⇒ list+digit ⇒ list-digit+digit ⇒ digit-digit+digit ⇒ 9-digit+digit ⇒ 9-5+digit ⇒ 9-
5+2
right most derivation for 9-5+2, list ⇒ list+digit ⇒ list+2 ⇒ list-digit+2 ⇒ list-5+2 ⇒ digit-5+2 ⇒ 9-5+2 parse tree
for 9-5+2

Ambiguity
• A grammar is said to be ambiguous if the grammar has more than one parse tree for a given string of tokens.
• Example 2.5. Suppose a grammar G that can not distinguish between lists and digits as in Example 2.1. • G : string
→ string + string | string - string |0|1|2|3|4|5|6|7|8|9

9
Associativity of operator
A operator is said to be left associative if an operand with operators on both sides of it is taken by the operator to
its left. eg) 9+5+2≡(9+5)+2, a=b=c≡a=(b=c)
• Left Associative Grammar : list → list + digit | list - digit digit →0|1|…|9
• Right Associative Grammar : right → letter = right | letter letter → a|b|…|z
Precedence of operators We say that a operator(*) has higher precedence than other operator(+) if the
operator(*) takes operands before other operator(+) does.
• ex. 9+5*2≡9+(5*2), 9*5+2≡(9*5)+2 • left associative operators : + , - , * , /
• right associative operators : = , **

Syntax of full expressions


Operator associative precedence
+,- left 1 low
*,/ left 2 high
• expr → expr + term | expr - term | term term → term * factor | term / factor | factor factor → digit | ( expr ) digit
→0|1|…|9
• Syntax of statements o stmt → id = expr ; | if ( expr ) stmt ; | if ( expr ) stmt else stmt ; | while ( expr ) stmt ; expr
→ expr + term | expr - term | term term → term * factor | term / factor | factor factor → digit | ( expr ) digit → 0 | 1 |
…|9

10

You might also like