unit1
unit1
(CST327)
LEXICAL ANALYSIS
Unit-1
OVERVIEW
Lexical Analysis and Tokens
Process:
stream of characters are read from left to right and grouped into
tokens, by removing any whitespace or comments
source code is scanned one character at a time
when the scanner encounters a whitespace, operator symbol or special
symbol, it decides that a word has been completed
Longest Match Rule is followed and rule priority is applied (Higher
priority to a reserved word)
Output: token stream.
Error generated:
when an invalid token is found.
When the name of an identifier matches any existing reserved word
Lexical Analysis and Tokens
The functions of the lexical analyzer:
Stripping out comments , whitespace and newline characters
from the source program
tokenize the source language
<id , 100>
<=, > OR <operator, = address>
<(, >
<id , 101>
<+ , >
<id , 102>
<), >
<*,>
< Constant, 103>
<;, >
The numbers after comma(,) represent the pointer to symbol table entry.
The symbol table entry at memory location 100 stores the identifier ‘a’,
101 stores ‘b’ and 102 stores ‘c’.
A constant value 2 is stored at location 103.
Regular Expression
Regular expression is a notation for specifying patterns.
Programming language tokens can be described by regular
languages.
Notations used to represent regular expressions:
Notation Description Example
{ }+ One or more {a}+
repetition String contains a repeated one or more time.
[] Optional [abc]
(Called character String can contain a or b or c
class)
- Range [a-z]
String can contain any character between a to z
? Zero or one instance a?
() Grouping (a | b)
^ Except [^a]
a is not included
RE Examples- Questions
Regular expression for strings of 0’s and 1’s not
accepting empty string is_____
Regular expression for string abc___
Regular expression for string containing substring
ab and any letters before and after is _______
Regular expression for string containing small letters
and capital letters separated by @ is _____
Transition Diagram
keep track of information about characters that are
identified as the forward pointer scans the input.
Positions in a transition diagram are drawn as
circles and are called states.
The states are connected by arrows called edges.
A double circle indicates an accepting state, a state
in which a token is found.
a* indicates that input retraction must take place.
FA are used to represent the transition diagram.
Finite Automata
Finite automata (FA) is a recognizer for regular expressions.
The mathematical model of finite automata consists of:
Finite set of input symbols (Σ)
Finite set of states (Q)
One Start state (q0)
One or more final states (qf)
Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q
FA that accepts a string starting with a and ending with b
and having any number of b
Transition Diagram- Example 1
Transition diagram for relational operators in Java
language
< , <=, ==, !=, > and >=
Transition Diagram- Example 1
Transition diagram for relational operators in Java
language
< , <=, ==, !=, > and >=
Transition Diagram- Example 2
Transition diagram for identifiers of C language
Fconst →
Exponential numbers:
Name of the employee- Can have only letters. Name can start with a capital letter.
Employee id- It has the format EMP_<number>
Address- The address must start with digit (one or more). Then a separator comma is
allowed. Followed by letters.
E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then .
and domain name (like com, org)
Salary- floating point number is allowed
cletter→[A-Z]
digit→[0-9]
underscore→ _
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the
given employees
Consider a company wants to accept the data from employees. The data is
described as follows:
Name of the employee- Can have only letters. Name can start with a capital letter.
Employee id- It has the format EMP_<number>
Address- The address must start with digit (one or more). Then a separator comma is
allowed. Followed by letters.
E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then .
and domain name (like com, org)
Salary- floating point number is allowed
cletter→[A-Z]
digit→[0-9]
underscore→ _
Name of the employee- Can have only letters. Name can start with a capital letter.
Employee id- It has the format EMP_<number>
Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
sletter → [a-z]
cletter→[A-Z]
digit→[0-9]
underscore→ _
provider→ gmail | yahoo | rediffmail | hotmail
domain→com |org | edu
name→
empid→
address→ // cletter and combination with sletter is also valid
email→
salary→
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the given employees
Consider a company wants to accept the data from employees. The data is described as follows:
Name of the employee- Can have only letters. Name can start with a capital letter.
Employee id- It has the format EMP_<number>
Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
sletter → [a-z]
cletter→[A-Z]
digit→[0-9]
underscore→ _
provider→ gmail | yahoo | rediffmail | hotmail
domain→com |org | edu
name→cletter sletter+
empid→ EMP underscore (digit)+
address→digit+, sletter+ // cletter and combination with sletter is also valid
email→ [email protected]
salary→(digit)+[.]?(digit)+
Example -1(TRY)
Transition diagram for accepting the data needed by a company for the given employees
Consider a company wants to accept the data from employees. The data is described as follows:
Name of the employee- Can have only letters. Name can start with a capital letter.
Employee id- It has the format EMP_<number>
Address- The address must start with digit (one or more). Then a separator comma is allowed. Followed by
letters.
E-mail- letters followed by @ followed by mail provider (like gmail, yahoo) then . and domain name (like com,
org)
Salary- floating point number is allowed
The regular definition for a new language to be designed is as follows:
sletter → [a-z]
cletter→[A-Z]
digit→[0-9]
underscore→ _
provider→ gmail | yahoo | rediffmail | hotmail
domain→com |org | edu
name→cletter sletter+
empid→ EMP underscore (digit)+
address→digit+, sletter+ // cletter and combination with sletter is also valid
email→ [email protected]
salary→(digit)+[.]?(digit)+
Lexical Errors
Some of the errors can be detected only by the lexical analysis phase.
For example, consider a code fragment in C
if(a>b)
printf(“a is greater”);
esle
printf(“b is greater”);
The lexical analyser encounters esle and cannot judge if it is a misspelling of
the keyword else or an identifier.
esle is a valid identifier, so the lexical analyser must return the token for an
identifier and let the latter phases handle any error.
If the lexical analyser is unable to proceed, then error detection and handler is
invoked.
Error recovery strategies are:
Panic mode error recovery
Deleting an extraneous character
Inserting a missing character
Replacing an incorrect character with the correct character
Transposing two adjacent characters
Thank You
Happy Learning !