2 - Lexical Analysis
2 - Lexical Analysis
01/06/2025 2
Role of Lexical Analyzer
token
source to semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol Table
01/06/2025 3
Role of Lexical Analyzer
• Lexical analyzer might perform some other tasks.
• Stripping out comments and white space (blank, tab, newline).
• Correlating error messages from the source program.
• Associating line number with an error message.
• Expanding macro preprocessor functions.
01/06/2025 4
Role of Lexical Analyzer
• Lexical analyzer may be divided into a cascade of two
processes
• Scanning
• Simple processes that do not require tokenization.
• Deletion of comment.
• Compaction of consecutive whitespaces into one.
• Lexical analysis
• Complex portion.
• Produces tokens.
01/06/2025 5
Tokens
• A pair consisting of a token name and an optional attribute value.
• Token name is an abstract symbol representing a kind of lexical
unit.
• Keyword
• Identifier
01/06/2025 6
Patterns
• A description of the form that the lexemes of a token may take.
Lexeme
• A sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an
instance of that token.
01/06/2025 7
Tokens, Patterns and Lexemes
printf(“Total = %d\n”,score)
• printf and score are lexemes matching the pattern for token id.
• “Total = %d\n” is a lexeme matching literal.
01/06/2025 8
Tokens, Patterns and Lexemes
01/06/2025 9
Attribute for Tokens
• More than one lexeme can match a pattern.
• 0 and 1 both are number.
• Additional information must be provided to subsequent phases.
• In many cases, lexical analyzer returns token name with an attribute
value.
• Attribute value describes the lexeme represented by the token.
• Token name influences parsing decisions.
• Attribute value influences translation of token after the parse.
01/06/2025 10
Attribute for Tokens
• Attribute can be
• A single value.
• Structure combining several information.
• A ‘id’ might contain information of its
• Lexeme
• Type
• Location at which it is found first
• These values are stored in symbol table.
• Hence appropriate value for an identifier is a pointer to the symbol
table entry for that identifier.
01/06/2025 11
Attribute for Tokens
E = M * C ** 2
01/06/2025 12
Lexical Errors
• Hard to detect error without the help of other component.
fi(a%2==0)
01/06/2025 13
Lexical Errors
• Lexical analyzer is unable proceed because none of the patterns for
tokens matches any prefix of the remaining input.
• Simplest recovery strategy – Panic mode recovery.
• Delete successive characters from the remaining input, until lexical analyzer
can find a well-formed token.
• Other possible error-recovery actions
• Delete one character from the remaining input.
• Insert a missing character from the remaining input.
• Replace a character by another character.
• Transpose two adjacent characters.
01/06/2025 14
Input Buffering
• We often need to look one or more character beyond to correctly determine the
lexeme.
• Need to find space to determine the end of identifier.
• Single operator (<,=) can be the beginning of a two-character operators (<=, ==).
01/06/2025 15
Buffer Pair
• Buffering technique is used to reduce the amount of overhead required to
process a single input character.
• One scheme is using two buffers and alternately reloading them.
• Each buffer is of the same size N.
• N is normally the size of disk block (4096 bytes).
• We can read N character per system call.
01/06/2025 16
Buffer Pair
• Two pointers are maintained in the buffer
• lexemeBegin Marks the beginning of the current lexeme.
• forward scans ahead until a match is found.
E = M * C * * 2 eof
lexemeBegin forward
01/06/2025 17
Buffer Pair
• Once lexeme is found
• lexemeBegin is set to immediate next character after the previous lexeme.
• Forward is retracted one position left.
• Advancing forward requires to test if we have reached the end of the buffer.
• If so, other buffer is reloaded and forward is moved to the beginning of newly loaded buffer.
01/06/2025 18
Buffer Pair
• Two checks are necessary to advance forward
• Have we reached end of the buffer?
• Which character have we read?
• We can combine buffer end test with current character.
• The Sentinel is a special character that cannot be part of the source program
• a natural choice is the character eof.
01/06/2025 19
Buffer Pair
01/06/2025 20
Specification of Token
• Regular expressions are used to specify lexeme patterns.
• Although not all patterns can be expressed using RE
• Very effective for specifying tokens.
01/06/2025 21
Strings and Language
• Alphabet
• Any finite set of symbols.
• {0,1} is the binary alphabet.
• ASCII, Unicode.
• String
• Finite sequence of symbol drawn from the alphabet.
• 0,1,00,01,1111,… etc. are string of binary alphabet.
• Length of string s, |s|
• Number of occurences of symbols in s.
• ε is the empty string with length 0.
01/06/2025 22
Strings and Language
• Language
• Any countable set of strings over some fixed alphabet.
• Very broad definition.
• All syntactically well-formed C program.
• All grammatically correct sentences.
01/06/2025 23
Operations on Language
01/06/2025 24
Example of Operations
• Let L be the set of letters {A,B,…,Z,a,b,…z}
• Let D be the set of digits {0,1,…9}
• LUD
• Set of letters and digits.
• 62 strings with length 1.
• LD
• Set of 520 strings of length two.
• One letter followed by one digit.
• L4
• Set of all 4 letter strings.
01/06/2025 25
Example of Operations
• Let L be the set of letters {A,B,…,Z,a,b,…z}
• Let D be the set of digits {0,1,…9}
• L*
• Set of all strings of letter including empty string.
• L(L U D)*
• Set of all strings of letters and digits beginning with letter.
• D+
• Set of all strings of one or more digits.
01/06/2025 26
Regular Expressions
• Sequence of characters specifying patterns.
• If letter_ means any letter or underscore
• And digit means any digit
• We can describe the language of C identifiers by
• letter_ (letter_ | digit)*
01/06/2025 27
Formation of Regular Expressions
• Regular expression are built recursively out of smaller regular expression.
• Each regular expression r denotes a language L(r).
01/06/2025 28
Formation of Regular Expressions
• Rules to define RE over language Σ
• Basis
• ε is a regular expression and L(ε) = {ε}
• If ‘a’ is a symbol in Σ, then ‘a’ is a RE and L(a) = {a}.
01/06/2025 29
Formation of Regular Expressions
• Induction
• Suppose r and s are RE.
• (r)|(s) is a RE denoting L(r) U L(s)
• (r)(s) is a RE denoting the language L(r)L(s).
• (r)* is a RE denoting (L(r))*.
• (r) is a RE denoting L(r).
• We can add additional parentheses without changing the meaning.
01/06/2025 30
Precedence and Associativity
• The unary operator(*) has the highest precedence.
• Concatenation has second highest precedence.
• | has the lowest precedence.
• All operators are left associative.
01/06/2025 31
Regular Expression Example
• Let Σ = {a,b}
• a |b denotes the language {a,b}
• (a|b)(a|b)
• {aa, ab, ba, bb}
• a*
• Consisting of all strings of zero or more a.
• (a|b)*
• Zero or more instances of a or b.
• A,b,aa,ab,ba,aab,….
01/06/2025 32
Regular Definition
• Used for notational convenience
• Give name to certain R.E and use them as symbols.
• If ∑ is an alphabet
• Then a regular definition is a sequence of definition of the form
d1 → r1
d2 → r2
…
…
…
dn → rn
01/06/2025 33
Regular Definition
• Each di is a new symbol, not in ∑ and not same as any other d’s
• Each ri is a regular expression over the alphabet
• ∑ U {d1, d2, …, di-1}
d1 → r1
d2 → r2
…
…
…
dn → rn
01/06/2025 34
Regular Definition Example
• C identifiers are strings of letters, digits and underscores.
• The regular definition of identifiers
letters_ → A | B | … | Z| a | …. | z | _
digit → 0 | 1 | … | 9
Id → letters_ ( letters_ | digit)*
01/06/2025 35
Extension of Regular Expression
• One or more instances
• Unary postfix operator +
• Represents positive closure.
• (r)+ denotes the language (L( r ))+
• Zero or one instance
• Unary postfix operator ?
• r? is equivalent to r | ε.
• Same precedence as * and +.
01/06/2025 36
Extension of Regular Expression
• Character classes
• a1|a2| … | an where ai are each symbol of the alphabet can be replaced with
• [a1a2…an]
• If a1a2…an forms a logical sequence
• Uppercase letters, lowercase letters, digits
• We can replace a1a2…an with a1-an
• First and last symbol separated by hyphen.
01/06/2025 37
Regular Expression Example
• Rewriting the regular definition of identifiers
letters_ → [A-Za-z_]
digit → [0-9]
Id → letters_ ( letters_ | digit)*
01/06/2025 38
Recognition of Token
• So far we have seen how to express patterns using regular expression.
• Now we want to use these patterns to detect lexemes.
01/06/2025 39
Recognitions of Token
• Consider the example
stmt → if expr then stmt
| if expr then stmt else stmt
|ϵ
expr -> term relop term
| term
term -> id | number
01/06/2025 40
Recognition of Token
• Terminals of the grammars are:
• if, then, else, relop, id, number.
• For relop we will use:
• =, <>, <, >, <=, >=
01/06/2025 41
Recognition of Token
digit → [0-9]
digits → digit+
number → digits(.digits)?(E[+-]? digits)?
letter → [A-Za-z]
id → letter(letter|digit)*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>
01/06/2025 42
Recognition of Token
• We also need to removed white spaces.
• ws → (blank | tab | newline)+
01/06/2025 43
Tokens, Patterns and Attribute
Values
Lexemes Token Name Attribute Value
Any ws - -
If If -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
01/06/2025 44
Transition Diagrams
• As an intermediate step, patterns are converted into stylized
flowcharts, called transition diagrams.
01/06/2025 45
Transition Diagram
• Have a collection of nodes or circles, called states.
• Each state represents a condition that could occur during the process
of scanning.
• Edges are directed from one state of the transition diagram to
another.
• Each edge is labeled by a symbol or set of symbols.
• Assume our diagram is deterministic.
01/06/2025 46
Transition Diagram
• Certain states are said to be accepting or final.
• Indicates a lexeme is found.
• Indicated by a double circle.
• Action is attached with the circle.
• Action is typically returning lexeme with attribute.
• One state is designated the start state or initial state.
• Indicated by an edge labeled by the start.
01/06/2025 47
Transition Diagram Example
01/06/2025 48
Recognition of Reserved Words and
Identifiers
• Keywords like if or then are reserved.
• Even though they look like identifiers.
01/06/2025 49
Methods to handle reserved word
• Install the reserved words in the symbol table initially.
• A field will indicate that it is not a identifier.
• installID() places a identifier if it is not in the symbol table already.
• Create separate transition diagrams for each keywords.
01/06/2025 50
Transition Diagram for Numbers
01/06/2025 51
The End
01/06/2025 52