0% found this document useful (0 votes)
7 views52 pages

2 - Lexical Analysis

Uploaded by

Anonymous Racoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views52 pages

2 - Lexical Analysis

Uploaded by

Anonymous Racoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Lexical Analysis

Role of Lexical Analyzer


• First phase of a compiler.
• The main task is
• read the input character of the source program.
• Group them into lexeme.
• Produce a sequence of token for each lexeme.

01/06/2025 2
Role of Lexical Analyzer

token
source to semantic
Lexical Analyzer Parser
program analysis
getNextToken

Symbol Table

Fig 1: Interaction between the lexical analyzer and the parser

01/06/2025 3
Role of Lexical Analyzer
• Lexical analyzer might perform some other tasks.
• Stripping out comments and white space (blank, tab, newline).
• Correlating error messages from the source program.
• Associating line number with an error message.
• Expanding macro preprocessor functions.

01/06/2025 4
Role of Lexical Analyzer
• Lexical analyzer may be divided into a cascade of two
processes
• Scanning
• Simple processes that do not require tokenization.
• Deletion of comment.
• Compaction of consecutive whitespaces into one.
• Lexical analysis
• Complex portion.
• Produces tokens.

01/06/2025 5
Tokens
• A pair consisting of a token name and an optional attribute value.
• Token name is an abstract symbol representing a kind of lexical
unit.
• Keyword
• Identifier

01/06/2025 6
Patterns
• A description of the form that the lexemes of a token may take.

Lexeme
• A sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an
instance of that token.

01/06/2025 7
Tokens, Patterns and Lexemes
printf(“Total = %d\n”,score)

• printf and score are lexemes matching the pattern for token id.
• “Total = %d\n” is a lexeme matching literal.

01/06/2025 8
Tokens, Patterns and Lexemes

Token Informal Description Sample Lexemes


if Characters i, f If
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letters and digits Pi, score
literal Anything but “ surrounded by “’s “core dumped”
number Any numeric constant 3.14, 0, 6.02e23

Figure 2: Examples of Tokens

01/06/2025 9
Attribute for Tokens
• More than one lexeme can match a pattern.
• 0 and 1 both are number.
• Additional information must be provided to subsequent phases.
• In many cases, lexical analyzer returns token name with an attribute
value.
• Attribute value describes the lexeme represented by the token.
• Token name influences parsing decisions.
• Attribute value influences translation of token after the parse.

01/06/2025 10
Attribute for Tokens
• Attribute can be
• A single value.
• Structure combining several information.
• A ‘id’ might contain information of its
• Lexeme
• Type
• Location at which it is found first
• These values are stored in symbol table.
• Hence appropriate value for an identifier is a pointer to the symbol
table entry for that identifier.
01/06/2025 11
Attribute for Tokens
E = M * C ** 2

• Tokens for the statement


• < id, pointer to symbol table entry for E>
• <assign_op>
• <id, pointer to symbol table entry for M>
• <mult_op>
• <id, pointer to symbol table entry for C>
• <exp_op>
• <number, integer value 2>

01/06/2025 12
Lexical Errors
• Hard to detect error without the help of other component.
fi(a%2==0)

• Lexical analyzer can’t tell if “fi” is a misspelling of “if” or an


undeclared function.

01/06/2025 13
Lexical Errors
• Lexical analyzer is unable proceed because none of the patterns for
tokens matches any prefix of the remaining input.
• Simplest recovery strategy – Panic mode recovery.
• Delete successive characters from the remaining input, until lexical analyzer
can find a well-formed token.
• Other possible error-recovery actions
• Delete one character from the remaining input.
• Insert a missing character from the remaining input.
• Replace a character by another character.
• Transpose two adjacent characters.

01/06/2025 14
Input Buffering
• We often need to look one or more character beyond to correctly determine the
lexeme.
• Need to find space to determine the end of identifier.
• Single operator (<,=) can be the beginning of a two-character operators (<=, ==).

01/06/2025 15
Buffer Pair
• Buffering technique is used to reduce the amount of overhead required to
process a single input character.
• One scheme is using two buffers and alternately reloading them.
• Each buffer is of the same size N.
• N is normally the size of disk block (4096 bytes).
• We can read N character per system call.

01/06/2025 16
Buffer Pair
• Two pointers are maintained in the buffer
• lexemeBegin Marks the beginning of the current lexeme.
• forward scans ahead until a match is found.

E = M * C * * 2 eof

lexemeBegin forward

01/06/2025 17
Buffer Pair
• Once lexeme is found
• lexemeBegin is set to immediate next character after the previous lexeme.
• Forward is retracted one position left.

• Advancing forward requires to test if we have reached the end of the buffer.
• If so, other buffer is reloaded and forward is moved to the beginning of newly loaded buffer.

01/06/2025 18
Buffer Pair
• Two checks are necessary to advance forward
• Have we reached end of the buffer?
• Which character have we read?
• We can combine buffer end test with current character.
• The Sentinel is a special character that cannot be part of the source program
• a natural choice is the character eof.

E = M * eof C * * 2 eof eof

01/06/2025 19
Buffer Pair

01/06/2025 20
Specification of Token
• Regular expressions are used to specify lexeme patterns.
• Although not all patterns can be expressed using RE
• Very effective for specifying tokens.

• Let’s recap Regular Expression.

01/06/2025 21
Strings and Language
• Alphabet
• Any finite set of symbols.
• {0,1} is the binary alphabet.
• ASCII, Unicode.
• String
• Finite sequence of symbol drawn from the alphabet.
• 0,1,00,01,1111,… etc. are string of binary alphabet.
• Length of string s, |s|
• Number of occurences of symbols in s.
• ε is the empty string with length 0.

01/06/2025 22
Strings and Language
• Language
• Any countable set of strings over some fixed alphabet.
• Very broad definition.
• All syntactically well-formed C program.
• All grammatically correct sentences.

01/06/2025 23
Operations on Language

Operation Definition and Notation


Union of L and M L ∪ M = {s | s is in L or s is in M}
Concatenation of L and M LM = {st| s is in L and t is in M}
Kleene closure of L L*=
Positive closure of L L+=

01/06/2025 24
Example of Operations
• Let L be the set of letters {A,B,…,Z,a,b,…z}
• Let D be the set of digits {0,1,…9}
• LUD
• Set of letters and digits.
• 62 strings with length 1.
• LD
• Set of 520 strings of length two.
• One letter followed by one digit.
• L4
• Set of all 4 letter strings.

01/06/2025 25
Example of Operations
• Let L be the set of letters {A,B,…,Z,a,b,…z}
• Let D be the set of digits {0,1,…9}
• L*
• Set of all strings of letter including empty string.
• L(L U D)*
• Set of all strings of letters and digits beginning with letter.
• D+
• Set of all strings of one or more digits.

01/06/2025 26
Regular Expressions
• Sequence of characters specifying patterns.
• If letter_ means any letter or underscore
• And digit means any digit
• We can describe the language of C identifiers by
• letter_ (letter_ | digit)*

01/06/2025 27
Formation of Regular Expressions
• Regular expression are built recursively out of smaller regular expression.
• Each regular expression r denotes a language L(r).

01/06/2025 28
Formation of Regular Expressions
• Rules to define RE over language Σ
• Basis
• ε is a regular expression and L(ε) = {ε}
• If ‘a’ is a symbol in Σ, then ‘a’ is a RE and L(a) = {a}.

01/06/2025 29
Formation of Regular Expressions
• Induction
• Suppose r and s are RE.
• (r)|(s) is a RE denoting L(r) U L(s)
• (r)(s) is a RE denoting the language L(r)L(s).
• (r)* is a RE denoting (L(r))*.
• (r) is a RE denoting L(r).
• We can add additional parentheses without changing the meaning.

01/06/2025 30
Precedence and Associativity
• The unary operator(*) has the highest precedence.
• Concatenation has second highest precedence.
• | has the lowest precedence.
• All operators are left associative.

01/06/2025 31
Regular Expression Example
• Let Σ = {a,b}
• a |b denotes the language {a,b}
• (a|b)(a|b)
• {aa, ab, ba, bb}
• a*
• Consisting of all strings of zero or more a.
• (a|b)*
• Zero or more instances of a or b.
• A,b,aa,ab,ba,aab,….

01/06/2025 32
Regular Definition
• Used for notational convenience
• Give name to certain R.E and use them as symbols.
• If ∑ is an alphabet
• Then a regular definition is a sequence of definition of the form

d1 → r1
d2 → r2



dn → rn

01/06/2025 33
Regular Definition
• Each di is a new symbol, not in ∑ and not same as any other d’s
• Each ri is a regular expression over the alphabet
• ∑ U {d1, d2, …, di-1}

d1 → r1
d2 → r2



dn → rn

01/06/2025 34
Regular Definition Example
• C identifiers are strings of letters, digits and underscores.
• The regular definition of identifiers

letters_ → A | B | … | Z| a | …. | z | _
digit → 0 | 1 | … | 9
Id → letters_ ( letters_ | digit)*

01/06/2025 35
Extension of Regular Expression
• One or more instances
• Unary postfix operator +
• Represents positive closure.
• (r)+ denotes the language (L( r ))+
• Zero or one instance
• Unary postfix operator ?
• r? is equivalent to r | ε.
• Same precedence as * and +.

01/06/2025 36
Extension of Regular Expression
• Character classes
• a1|a2| … | an where ai are each symbol of the alphabet can be replaced with
• [a1a2…an]
• If a1a2…an forms a logical sequence
• Uppercase letters, lowercase letters, digits
• We can replace a1a2…an with a1-an
• First and last symbol separated by hyphen.

01/06/2025 37
Regular Expression Example
• Rewriting the regular definition of identifiers

letters_ → [A-Za-z_]
digit → [0-9]
Id → letters_ ( letters_ | digit)*

01/06/2025 38
Recognition of Token
• So far we have seen how to express patterns using regular expression.
• Now we want to use these patterns to detect lexemes.

01/06/2025 39
Recognitions of Token
• Consider the example
stmt → if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id | number

• A grammar for branching statement and conditional


expressions.

01/06/2025 40
Recognition of Token
• Terminals of the grammars are:
• if, then, else, relop, id, number.
• For relop we will use:
• =, <>, <, >, <=, >=

01/06/2025 41
Recognition of Token
digit → [0-9]
digits → digit+
number → digits(.digits)?(E[+-]? digits)?
letter → [A-Za-z]
id → letter(letter|digit)*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>

01/06/2025 42
Recognition of Token
• We also need to removed white spaces.
• ws → (blank | tab | newline)+

01/06/2025 43
Tokens, Patterns and Attribute
Values
Lexemes Token Name Attribute Value
Any ws - -
If If -
then then -
else else -
Any id Id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE

01/06/2025 44
Transition Diagrams
• As an intermediate step, patterns are converted into stylized
flowcharts, called transition diagrams.

01/06/2025 45
Transition Diagram
• Have a collection of nodes or circles, called states.
• Each state represents a condition that could occur during the process
of scanning.
• Edges are directed from one state of the transition diagram to
another.
• Each edge is labeled by a symbol or set of symbols.
• Assume our diagram is deterministic.

01/06/2025 46
Transition Diagram
• Certain states are said to be accepting or final.
• Indicates a lexeme is found.
• Indicated by a double circle.
• Action is attached with the circle.
• Action is typically returning lexeme with attribute.
• One state is designated the start state or initial state.
• Indicated by an edge labeled by the start.

01/06/2025 47
Transition Diagram Example

01/06/2025 48
Recognition of Reserved Words and
Identifiers
• Keywords like if or then are reserved.
• Even though they look like identifiers.

• This diagram will detect if as a identifier.

01/06/2025 49
Methods to handle reserved word
• Install the reserved words in the symbol table initially.
• A field will indicate that it is not a identifier.
• installID() places a identifier if it is not in the symbol table already.
• Create separate transition diagrams for each keywords.

01/06/2025 50
Transition Diagram for Numbers

01/06/2025 51
The End

01/06/2025 52

You might also like