0% found this document useful (0 votes)
55 views19 pages

2 Compiler - Slide

This document provides an overview of compilation and lexical analysis. It defines key terms like compilers, lexical analysis, tokens, and finite state automata. It then gives examples of finite state automata for recognizing specific words and integers. The document discusses how lexical analysis works by stripping unnecessary characters and tracking line numbers from source code. It also introduces regular expressions, deterministic and nondeterministic finite automata as formalisms used for scanners in compilers.

Uploaded by

gilbertelena7898
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views19 pages

2 Compiler - Slide

This document provides an overview of compilation and lexical analysis. It defines key terms like compilers, lexical analysis, tokens, and finite state automata. It then gives examples of finite state automata for recognizing specific words and integers. The document discusses how lexical analysis works by stripping unnecessary characters and tracking line numbers from source code. It also introduces regular expressions, deterministic and nondeterministic finite automata as formalisms used for scanners in compilers.

Uploaded by

gilbertelena7898
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

An Overview of Compilation

source program target program

lexical analyzer
CS 335: Lexical Analysis symbol table
code generator

Swarnendu Biswas
syntax analyzer error handler code optimizer
Semester 2022-2023-II
CSE, IIT Kanpur
intermediate code
semantic analyzer generator
Content influenced by many excellent references, see References slide for acknowledgements.

CS 335 Swarnendu Biswas

Overview of Lexical Analysis Description of Lexical Analysis


• First stage of a three-part frontend to help understand the source • Input:
program • A high level language (e.g., C++ and Java) program in the form of a sequence
of ASCII characters
• Processes every character in the input program
• If a word is valid, then it is assigned to a syntactic category • Output:
• This is similar to identifying the part of speech of an English word • A sequence of tokens along with attributes corresponding to different
syntactic categories that is forwarded to the parser for syntax analysis
• Functionality:
Compilers are engineered objects. • Strips off blanks, tabs, newlines, and comments from the source program
• Keeps track of line numbers and associates error messages from various parts
of a compiler with line numbers
noun verb adjective noun punctuation • Performs some preprocessor functions in languages like C

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Recognizing Word “new”
c = getNextChar();
if (c == ‘n’)
S0
c = getNextChar();
if (c == ‘e’) n
c = getNextChar();
if (c == ‘w’) s1
report success;
else
// Other logic s2
e
Formalism for Scanners
else Regular expressions, DFAs, and NFAs
w
// Other logic
else s3
// Other logic

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Definitions Finite State Automaton


• An alphabet is a finite set of symbols • A finite state automaton (FSA) is a five-tuple or quintuple (𝑆, Σ, 𝛿, 𝑠0 , 𝑆𝐹 )
• Typical symbols are letters, digits, and punctuations • 𝑆 is a finite set of states
• ASCII and UNICODE are examples of alphabets • Σ is the alphabet or character set, is the union of all edge labels in the FSA, and is
finite
• A string over an alphabet is a finite sequence of symbols drawn from • 𝛿(𝑠, 𝑐) represents the transition from state 𝑠 on input 𝑐
that alphabet • 𝑠0 ∈ S is the designated start state
• A language is any countable set of strings over a fixed alphabet • 𝑆𝐹 ⊆ 𝑆 is the set of final states
• A FSA accepts a string 𝑥 if and only if
i. FSA starts in 𝑠0
ii. Executes transitions for the sequence of characters in 𝑥
iii. Final state is an accepting state ∈ 𝑆𝐹 after 𝑥 has been consumed

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


FSA for recognizing “new” FSA for Unsigned Integers
• FSA = (𝑆, Σ, 𝛿, 𝑠0 , 𝑆𝐹 ) char = getNextChar( ) 𝑠𝑒 is the • FSA = (𝑆, Σ, 𝛿, 𝑠0 , 𝑆𝐹 )
state = 𝑠0 error state
S0 • 𝑆 = (𝑠0 , 𝑠1 , 𝑠2 , 𝑠3 ) • 𝑆 = (𝑠0 , 𝑠1 , 𝑠2 , 𝑠𝑒 )
• Σ = {𝑛, 𝑒, 𝑤} while (char ≠ EOF and state ≠ 𝑠𝑒 ) • Σ = {0,1,2,3,4,5,6,7,8,9}
n 𝑛 𝑒 𝑤 state = 𝛿(state,char) 0 1−9
• 𝛿 = {𝑠0 ՜ 𝑠1 , 𝑠1 ՜ 𝑠2 , 𝑠2 ՜ 𝑠3 } • 𝛿 = {𝑠0 ՜ 𝑠1 , 𝑠0 𝑠2 ,
s1 char = getNextChar() 0−9 0−9
• 𝑠0 = 𝑠0 𝑠2 𝑠2 , 𝑠1 𝑠𝑒 }
e • 𝑆𝐹 = {𝑠3 } • 𝑠0 = 𝑠0
if (state ∈ 𝑆𝐹 )
s2 • 𝑆𝐹 = {𝑠1 , 𝑠2 }
report success
w
else
s3 String is recognized in time proportional to the input report failure

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Dealing with Erroneous Situations Nondeterministic Finite Automaton


• FSA is in state 𝑠, the next input character is 𝑐, and 𝛿(𝑠, 𝑐) is not • NFA is a FSA that allows transitions on the empty string 𝜖 and can
defined have states that have multiple transitions on the same input character
• FSA processes the complete input and is still not in the final state
• Input string is a proper prefix for some word accepted by the FSA • Simulating an NFA
• Always make the correct nondeterministic choice to follow transitions that
lead to accepting state(s) for the input string, if such transitions exist
• Try all nondeterministic choices in parallel to search the space of all possible
configurations

• Simulating a DFA is more efficient than an NFA


CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Regular Expressions Regular Expressions
• The set of words accepted by an FSA 𝐹 is called its language 𝐿(𝐹) • 𝜖 is a RE, 𝐿 𝜖 = {𝜖}
• Let Σ be an alphabet. For each 𝑎 ∈ Σ, 𝑎 is a RE, and 𝐿 𝑎 = {𝑎}.
• For any FSA 𝐹, we can also describe 𝐿(𝐹) using a notation called a
Regular Expressions (RE) • Let 𝑟 and 𝑠 be REs denoting the languages 𝑅 and 𝑆, respectively
• The language described by a RE 𝑟 is called a regular language • Alternation (or union): (𝑟|𝑠) is a RE, 𝐿 𝑟 𝑠 = 𝑅|𝑆 = 𝑥 𝑥 ∈ 𝑅 or 𝑥 ∈ 𝑆 =
(denoted by 𝐿(𝑟)) 𝐿(𝑟) ∪ 𝐿(𝑠)
• Concatenation: (𝑟𝑠) is a RE, 𝐿 𝑟𝑠 = 𝑅. 𝑆 = 𝑥𝑦 𝑥 ∈ 𝑅 ∧ 𝑦 ∈ 𝑆}
• Closure: (𝑟 ∗ ) is an RE, 𝐿 𝑟 ∗ = 𝑅∗ = ‫∞ڂ‬𝑖=0 𝑅
𝑖

• 𝐿∗ is called the Kleene closure or closure of 𝐿

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Examples of Regular Expressions Examples of Regular Expressions


𝐿 = set of all strings of 0′ s and 1′ s Unsigned real numbers with exponents
𝑟 = (0 + 1)∗ 𝑟 = 0 1 … 9 0 … 9 ∗ . 0 … 9 ∗ 𝜖 𝐸(+| − |𝜖)(0|[1 … 9][0 … 9]∗ )

𝐿 = 𝑤 ∈ 0,1 ∗ 𝑤 has two or three occurences of 1,


the first and second are not consecutive} 𝐿 = 𝑤 ∈ 0,1 ∗ 𝑤 has no pair of consecutive zeros}
𝑟 = 0∗ 10∗ 010∗ (10∗ + 𝜖) 𝑟 = 1 + 01 ∗ (0 + 𝜖)

𝐿 = 𝑤 𝑤 ∈ 𝑎, 𝑏 ∗ ∧ 𝑤 ends with 𝑎}
𝑟 = (𝑎 + 𝑏)∗ 𝑎
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Regular Expressions Algebraic Rules for REs
• We can reduce the use of parentheses by introducing precedence and Rule Description
associativity rules 𝑟|𝑠 = 𝑠|𝑟 | is commutative
• Binary operators, closure, concatenation, and alternation are left associative
𝑟| 𝑠 𝑡 = 𝑟 𝑠 |𝑡 | is associative
𝑟 𝑠𝑡 = 𝑟𝑠 𝑡 Concatenation is commutative
• Precedence rule is
𝑟 𝑠 𝑡 = 𝑟𝑠|𝑟𝑡; 𝑠 𝑡 𝑟 = 𝑠𝑟|𝑡𝑟 Concatenation distributes over |
parentheses > closure > concatenation > alternation 𝜖𝑟 = 𝑟𝜖 = 𝑟 𝜖 is the identity of concatenation
𝑟∗ = (𝑟|𝜖)∗ 𝜖 is guaranteed in a closure
𝑟 ∗∗ = 𝑟∗ ∗ is idempotent

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Regular Definitions Example of Regular Definitions


• Let 𝑟𝑖 be a regular expression and 𝑑𝑖 be a distinct name • Unsigned numbers (e.g., 5280, 0.01234, 6.336E4, or 1.89E-4)
• Regular Definition is a sequence of definitions of the form
𝑑1 ՜ 𝑟1
𝑑2 ՜ 𝑟2 𝑑𝑖𝑔𝑖𝑡 = 0 1 2 3 4 5 6 7 8|9
… 𝑑𝑖𝑔𝑖𝑡𝑠 = 𝑑𝑖𝑔𝑖𝑡 𝑑𝑖𝑔𝑖𝑡 ∗
𝑑𝑛 ՜ 𝑟𝑛 𝑜𝑝𝑡𝑓𝑟𝑎𝑐 = . 𝑑𝑖𝑔𝑖𝑡𝑠|𝜖
• Each 𝑟𝑖 is a regular expression over the symbols Σ ∪ {𝑑1 , 𝑑2 , … , 𝑑𝑖−1 } 𝑜𝑝𝑡𝑒𝑥𝑝 = (𝐸 + − 𝜖 𝑑𝑖𝑔𝑖𝑡𝑠 |𝜖
𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚 = 𝑑𝑖𝑔𝑖𝑡𝑠 𝑜𝑝𝑡𝑓𝑟𝑎𝑐 𝑜𝑝𝑡𝑒𝑥𝑝
• Each 𝑑𝑖 is a new symbol not in Σ

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Extensions of Regular Expressions Example of Regular Definitions
“.” is any character other than “\n” • Unsigned numbers
• Example: 5280, 0.01234, 6.336E4, or 1.89E-4
[𝑥𝑦𝑧] is 𝑥|𝑦|𝑧 𝑑𝑖𝑔𝑖𝑡 = 0 1 2 3 4 5 6 7 8|9
𝑑𝑖𝑔𝑖𝑡𝑠 = 𝑑𝑖𝑔𝑖𝑡 𝑑𝑖𝑔𝑖𝑡 ∗
[𝑎𝑏𝑔−𝑝𝑇−𝑌] is any character 𝑎, 𝑏, 𝑔, … , 𝑝, 𝑇, … , 𝑌 𝑜𝑝𝑡𝑓𝑟𝑎𝑐 = . 𝑑𝑖𝑔𝑖𝑡𝑠|𝜖
𝑜𝑝𝑡𝑒𝑥𝑝 = (𝐸 + − 𝜖 𝑑𝑖𝑔𝑖𝑡𝑠 |𝜖
[^𝐺−𝑄] is not any one of 𝐺, 𝐻, … , 𝑄
𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚 = 𝑑𝑖𝑔𝑖𝑡𝑠 𝑜𝑝𝑡𝑓𝑟𝑎𝑐 𝑜𝑝𝑡𝑒𝑥𝑝
𝑟+ is one or more 𝑟’s 𝑑𝑖𝑔𝑖𝑡𝑠 = [0−9] Simpler to
𝑑𝑖𝑔𝑖𝑡𝑠 = 𝑑𝑖𝑔𝑖𝑡 + write
𝑟? is zero or one 𝑟 𝑢𝑛𝑠𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚 = 𝑑𝑖𝑔𝑖𝑡𝑠 . 𝑑𝑖𝑔𝑖𝑡𝑠 ? 𝐸 +− ? 𝑑𝑖𝑔𝑖𝑡𝑠 ?

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

NFA = (𝑁, Σ, 𝛿𝑁 , 𝑛0 , 𝑁𝐴 )

Equivalence of RE and FSA NFA to DFA: Subset Construction DFA = (𝐷, Σ, 𝛿𝐷 , 𝑑0 , 𝐷𝐴 )

• There exists an NFA with 𝜖-transitions that accepts 𝐿(𝑟), where 𝑟 is a


RE Subset Construction 𝜖-closure

• If 𝐿 is accepted by a DFA, then 𝐿 is generated by a RE 𝑞0 = 𝜖-closure({𝑠0 }) for each state 𝑛 ∈ 𝑁 do


𝑄 = 𝑞0 𝐸 𝑛 = {𝑛}
•… Kleene’s WorkList = {𝑞0 } WorkList = 𝑁
Construction
while (WorkList ≠ 𝜙) do while (WorkList ≠ 𝜙) do
Improve run time
and memory
code for a remove 𝑞 from WorkList remove 𝑛 from WorkList
scanner
overhead for each character 𝑐 ∈ Σ do 𝑡 = {𝑛} ∪ ‫𝜖 ڂ‬ 𝐸(𝑝)
RE DFA 𝑛՜𝑝∈𝛿𝑁
DFA 𝑡 = 𝜖-closure(𝛿(𝑞, 𝑐)) if 𝑡 ≠ 𝐸(𝑛)
Minimization 𝑇 𝑞, 𝑐 = 𝑡 𝐸 𝑛 =𝑡
if 𝑡 ∉ 𝑄 then 𝜖
Thompson’s WorkList = WorkList ∪ {𝑚|𝑚 ՜ 𝑛 ∈ 𝛿𝑁 }
NFA Subset add 𝑡 to 𝑄 and to WorkList
Construction DFAs are
Construction easier to
simulate
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
DFA to Minimal DFA: Hopcroft’s Algorithm Splitting a Partition
𝑑𝑥 𝑑𝑖 𝑎 𝑑𝑥

• A DFA from Subset construction can have a large number of states 𝑎 𝑎


𝑝4 𝑝6 𝑝4
𝑑𝑖 𝑑𝑥 𝑑𝑖
• Does not increase the time needed to scan a string
• Increases the space requirement of the scanner in memory 𝑎 𝑎
𝑑𝑗 𝑑𝑦 𝑑𝑗 𝑎
• Speed of accesses to main memory may turn out to be the bottleneck 𝑑𝑦 𝑑𝑗 𝑑𝑦
• Smaller scanner has better chances of fitting in the processor cache
𝑎 𝑎
𝑑𝑘 𝑑𝑧 𝑑𝑘 𝑎
𝑑𝑧 𝑑𝑘 𝑑𝑧

𝑝1 𝑝2 𝑝3 𝑝5 𝑝7 𝑝5

𝑎 does not split 𝑝1 𝑎 splits 𝑝3 Partitions after splitting on 𝑎

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

DFA to Minimal DFA: Hopcroft’s Algorithm


Minimization Split(𝑺)
𝑇 = 𝐷𝐴 , 𝐷 − 𝐷𝐴 for each 𝑐 ∈ Σ do
𝑃=𝜙 if 𝑐 splits 𝑆 into 𝑠1 and 𝑠2
while(𝑃 ≠ 𝑇) do return {𝑠1 , 𝑠2 }
return 𝑆
𝑃=𝑇
𝑇=𝜙 Realizing Scanners
for each set 𝑝 ∈ 𝑃 do
𝑇 = 𝑇 ∪ Split(𝑝)

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Tokens Patterns and Lexemes
• Pattern
f l o a t a b s _ z e r o = - 2 7 3 ; / * K e l v i n * /
• The rule describing the set of strings for which the same token is produced
• Token • The pattern is said to match each string in the set
• A string of characters which logically belong together in a syntactic category • float, letter(letter|digit|_)*, =, -, digit+, ;
• Sentences consist of a string of tokens (e.g., float, identifier, assign, minus,
intnum, semicolon) • Lexeme
• Tokens are treated as terminal symbols of the grammar specifying the source • The sequence of characters matched by a pattern to form the corresponding
language token
• May have optional attributes • “float”, “abs_zero”, “=”, “-”, “273”, “;”
• Example of tokens in programming languages: Keywords, operators,
identifiers (names), constants, literal strings, punctuation symbols
(parentheses, brackets, commas, semicolons, and colons)

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Attributes of Tokens Role of a Lexical Analyzer


• An attribute of a token is a value that the scanner extracts from the • Identify tokens and corresponding lexemes
corresponding lexeme and supplies to the syntax analyzer • Construct constants: for example, convert a number to token intnum
• Examples attributes for tokens and pass the value as its attribute
• identifier: the lexeme of the token, or a pointer into the symbol table where • 31 becomes <intnum, 31>
the lexeme is stored by the LA
• Recognize keyword and identifiers
• intnum: the value of the integer (similarly for floatnum, etc.)
• counter = counter + increment becomes id = id + id
• Type of the identifier, location where first found
• Check that id here is not a keyword
• The exact set of attributes are dependent on the compiler designer
• Discard whatever does not contribute to parsing
• White spaces (blanks, tabs, newlines) and comments

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Specifying and Recognizing Patterns and
Transition Diagrams
Tokens
• Patterns are denoted with REs, and recognized with FSAs • Transition diagrams (TDs) are generalized DFAs with the following
differences
• Regular definitions, a mechanism based on regular expressions, are • Edges may be labelled by a symbol, a set of symbols, or a regular definition
popular for specification of tokens • Few accepting states may be indicated as retracting states
• Indicates that the lexeme does not include the symbol that transitions to the accepting
• Transition diagrams, a variant of FSAs, are used to implement regular state
definitions and to recognize tokens • Each accepting state has an action attached to it
• Usually used to model LA before translating them to executable programs • Action is executed when the state is reached (e.g., return a token and its attribute value)

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Examples of Transition Diagrams A Sample Specification


Identifiers and reserved words letter/digit
𝑠𝑡𝑚𝑡 ⟶ if 𝑒𝑥𝑝𝑟 then 𝑠𝑡𝑚𝑡 𝑑𝑖𝑔𝑖𝑡 ⟶ [0−9]
𝑙𝑒𝑡𝑡𝑒𝑟 = 𝑎−𝑧𝐴−𝑍 𝑑𝑖𝑔𝑖𝑡𝑠 ⟶ 𝑑𝑖𝑔𝑖𝑡 +
𝑑𝑖𝑔𝑖𝑡 = [0−9] * | if 𝑒𝑥𝑝𝑟 then 𝑠𝑡𝑚𝑡 else 𝑠𝑡𝑚𝑡
start letter other 𝑛𝑢𝑚𝑏𝑒𝑟 ⟶ 𝑑𝑖𝑔𝑖𝑡𝑠 . 𝑑𝑖𝑔𝑖𝑡𝑠 ? 𝐸 +− ? 𝑑𝑖𝑔𝑖𝑡𝑠 ?
𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑟 = 𝑙𝑒𝑡𝑡𝑒𝑟(𝑙𝑒𝑡𝑡𝑒𝑟|𝑑𝑖𝑔𝑖𝑡)∗ 0 1 2 |𝜖
𝑙𝑒𝑡𝑡𝑒𝑟 ⟶ [𝐴−𝑍𝑎 − 𝑧]
𝑒𝑥𝑝𝑟 ⟶ 𝑡𝑒𝑟𝑚 relop 𝑡𝑒𝑟𝑚
return(get_token_code(), name) 𝑖𝑑 ⟶ 𝑙𝑒𝑡𝑡𝑒𝑟 𝑙𝑒𝑡𝑡𝑒𝑟 𝑑𝑖𝑔𝑖𝑡)∗
| 𝑡𝑒𝑟𝑚
• * indicates a retraction state 𝑡𝑒𝑟𝑚 ⟶ id
𝑖𝑓 ⟶ if
𝑡ℎ𝑒𝑛 ⟶ then
• get_token_code() searches a table to check if the name is a | number 𝑒𝑙𝑠𝑒 ⟶ else
𝑟𝑒𝑙𝑜𝑝 ⟶ < | > <= >= | = | <>
reserved word and returns its integer code if so 𝑤𝑠 ⟶ blank tab | newline)+
• Otherwise, it returns the integer code of the IDENTIFIER token, with
name containing the string of characters forming the token
• Name is not relevant for reserved words

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Tokens, Lexemes, and Attributes Transition Diagram for relop
Lexemes Token Name Attribute Value start < =
0 1 2
Any 𝑤𝑠 -- -- return(relop, LE)
>
𝑖𝑓 if --
3
𝑡ℎ𝑒𝑛 then -- return(relop, NE)
𝑒𝑙𝑠𝑒 else --
4
*
Any 𝑖𝑑 id Pointer to symbol table entry
5 return(relop, LT)
Any 𝑛𝑢𝑚𝑏𝑒𝑟 number Pointer to symbol table entry return(relop, ASSGN)
< relop LT
<= relop LE =
6 7
= relop ASSGN return(relop, GE)

<> relop NE
8
*
> relop GT return(relop, GT)
>= relop GE
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Transition Diagrams for IDs and Keywords Transition Diagram for Unsigned Numbers
IDs and Keywords letter/digit
digit digit digit

start letter other *


9 10 11 start digit . digit E +|- digit other *
12 13 14 15 16 17 18 19
return(get_token_code(), name)

E digit
Whitespace
delim
* *
20 21

start delim other *


22 23 24

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Combining Transition Diagrams to form a Combining Transition Diagrams to form a
Lexical Analyzer Lexical Analyzer
• Different transition diagrams (TDs) must be combined appropriately • Different transition diagrams (TDs) must be combined appropriately
to yield a scanner to yield a scanner
• Try different transition diagrams one after another
• For example, TDs for reserved words, constants, identifiers, and operators could be tried
How do we do this? in that order
• However, this does not use the “longest match” characteristic
• thenext should be an identifier, and not reserved word then followed by identifier ext

• To find the longest match, all TDs must be tried and the longest match
must be used

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Challenges in Lexical Analysis Challenges in Lexical Analysis


• Certain languages like PL/I do not have any reserved words • Certain languages like PL/I do not have any reserved words
• while, do, if, and else are reserved in C but not in PL/I • while, do, if, and else are reserved in C but not in PL/I
• Makes it difficult for the scanner to distinguish between keywords and user- • Makes it difficult for the scanner to distinguish between keywords and user-
defined identifiers defined identifiers

if then then then = else else else = then • PL/I declarations


• DECLARE(arg1,arg2,arg3,…,argn)
if if then then = then + 1 • Cannot tell whether DECLARE is a keyword with variable definitions or is a
procedure with arguments until after “)”
• Requires arbitrary lookahead and very large buffers
• Worse, the buffers may have to be reloaded in case of wrong inferences

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Challenges in Lexical Analysis Challenges in Lexical Analysis
fi (a == g(x)) … • Consider a fixed-format language like Fortran
• Is fi a typo or a function call? • 80 columns per line
• Remember, fi is a valid lexeme for IDENTIFIER • Column 1-5 for the statement number/label column
• Column 6 for continuation mark
• Think of C++ • Column 7-72 for the program statements
• Template syntax: Foo<Bar> • Column 73-80 Ignored (used for other purposes)
• Stream syntax: cin >> var;
• Nested templates: Foo<Bar<Bazz>> • Letter C in Column 1 meant the current line is a comment

• Can these problems be resolved by lexical analysers alone? No, in some


cases parser needs to help.

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Programming Languages vs Natural


Challenges in Lexical Analysis
Languages
• In fixed-format Fortran, some keywords are context-dependent • Meaning of words in natural languages is often context-sensitive
• In the statement, DO 10 I = 10.86, DO10I is an identifier, and DO is not a • An English word can be a noun or a verb (for e.g., “stress”)
keyword • “are” is a verb, “art” is a noun, and “arz” is undefined
• But in the statement, DO 10 I = 10, 86, DO is a keyword
• Blanks are not significant in Fortran and can appear in the midst of identifiers
• Variable “counter” is same as “count er” • Grammars are rigorously specified to provide meaning
• In Fortran, blanks are important only in literal strings • Words in a programming language are always lexically specified
• Reading from left to right, one cannot distinguish between the two until the • Any string in (1…9)(0…9)* is a positive integer
“,” or “.” is reached
• Requires look ahead for resolution

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Why separate tokens and lexemes? Lexical Analysis as a Separate Phase
• Rules to govern the lexical structure of a programming language is 1. Simplifies the compiler design: I/O issues are limited to only the
called its microsyntax lexical analyzer, leading to better portability
2. Allows designing a more compact and faster parser
• Separating syntax and microsyntax allows for a simpler parser • Comments and whitespace need not be handled by the parser
• Parser only needs to deal with syntactic categories like IDENTIFIER • No rules for numbers, names, and comments are needed in the parser
• A parser is more complicated than a lexical analyzer and shrinking the
grammar makes the parser more efficient
3. Scanners based on finite automata are more efficient to implement
than stack-based pushdown automata used for parsing

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Interfacing with Parser Error Handling in Lexical Analysis


• A unique integer representing the token is passed by LA to the parser • LA cannot catch any other errors except for simple errors such as
illegal symbols
token • In such cases, LA skips characters in the input until a well-formed
source Lexical Syntax to semantic token is found
program Analyzer Analyzer analysis
get next token • This is called “panic mode” recovery
• We can think of other possible recovery strategies
• Delete one character from the remaining input, or insert a missing character
• Replace a character, or transpose two adjacent characters
symbol table
• Idea is to see if a single (or few) transformation(s) can repair the error

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Other Uses of Lexical Analysis Concepts
• UNIX command line tools like grep, awk, and sed
• Search tools in editors
• Word-processing tools

Implementing Scanners

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Implementing Scanners Implementation Considerations


1. Specify REs for each syntactic category in the PL • Speed is paramount for scanning
2. Construct an NFA for each RE • Processes every character from a possibly large input source program

3. Join the NFAs with 𝜖-transitions


4. Create the equivalent DFA • Repeatedly read input characters and simulate the corresponding DFA
• Types of scanner implementations: table-driven, direct-coded, and hand-
5. Minimize the DFA coded
6. Generate code to implement the DFA • Asymptotic complexity is the same, differs in run-time costs

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


High-Level Idea in Implementing Scanners Table-Driven Scanner
Lexical
Patterns Scanner
Read input characters one by one Tables
Generator

Look up the transition based on the current state and the input character
FSA
Switch to the new state Interpreter
[0…9]
Check for termination conditions, i.e., accept and error
• Register specification r [0…9]
Repeat • For example, r1 and r27 s0 s1 s2

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Table-Driven Scanner Table-Driven Scanner


𝒓 𝟎, 𝟏, 𝟐, … , 𝟗 EOF Other
𝒓 𝟎, 𝟏, 𝟐, … , 𝟗 EOF Other
state = 𝑠0 ; lexeme = “”; Register Digit EOF Other // Rollback
Register Digit EOF Other
clear stack; push(bad); while (state ∉ 𝑠𝐴 and state ≠ bad)
state = pop()
// Model the DFA
𝜹 R 0,1,…,9 other
truncate lexeme 𝜹 R 0,1,…,9 other
while (state ≠ 𝑠𝑒 ) 𝒔𝟎 𝑠1 𝑠𝑒 𝑠𝑒 rollback()
char = getNextChar() 𝒔𝟎 𝑠1 𝑠𝑒 𝑠𝑒
lexeme = lexeme + char 𝒔𝟏 𝑠𝑒 𝑠2 𝑠𝑒
if state ∈ 𝑠𝐴 𝒔𝟏 𝑠𝑒 𝑠2 𝑠𝑒
if state ∈ 𝑠𝐴 𝒔𝟐 𝑠𝑒 𝑠2 𝑠𝑒 return token 𝒔𝟐 𝑠𝑒 𝑠2 𝑠𝑒
clear stack
push(state) 𝒔𝒆 𝑠𝑒 𝑠𝑒 𝑠𝑒 else
return invalid
𝒔𝒆 𝑠𝑒 𝑠𝑒 𝑠𝑒
token = lookup(PATTERN)
state = 𝛿(state, token) involves two
table lookups

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Problem of Rollbacks Address Excessive Rollbacks
state = 𝑠0 ; lexeme = “”; // Rollback
clear stack; push(<bad, bad>); while (state ∉ 𝑠𝐴 and state ≠ bad)
• A scanner’s aim is to recognize the inputPos = 0 Failed[state, inputPos] = true
longest match but it can increase for each state 𝑠 ∈ DFA while (state ≠ 𝑠𝑒 ) <state, inputPos> = pop()
rollbacks for i = 1:|input stream| char = getNextChar() truncate lexeme
• Consider the RE 𝑎𝑏 | (𝑎𝑏)∗ 𝑐, and Failed[state, i] = false lexeme = lexeme + char rollback()
input 𝑎𝑏𝑎𝑏𝑎𝑏𝑎𝑏 inputPos = inputPos + 1
if Failed[state, inputPos] if state ∈ 𝑠𝐴
• A scanner can avoid such
break return token
pathological quadratic expense by if state ∈ 𝑠𝐴 else
remembering failed attempts clear stack return invalid
• Such scanners are called maximal push(<state, inputPos>)
munch scanners token = lookup(PATTERN)
state = 𝛿(state, token)

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Overhead with Table Lookups Direct-Coded Scanner


w lexeme = “”; clear stack; 𝑠 1: char = getNextChar()
base base
push(bad); goto 𝑠0; lexeme = lexeme + char
w
if state ∈ 𝑠𝐴
𝑠0: char = getNextChar() clear stack
Address2 = base + lexeme = char push(𝑠1 )
(i,j) (i*c + j) * w
if state ∈ 𝑠𝐴 if (‘0’ ≤ char ≤ ‘9’)
offset
Address1 = base + clear stack goto 𝑠2
i
offset * w push(𝑠0 ) else
c columns
if (char == ‘r’) goto 𝑠𝑒
goto 𝑠1
else
The table-driven scanner performs two address computations and
two load operations for each character that it processes goto 𝑠𝑒

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Direct-Coded Scanner Hand-Coded Scanner
𝑠 2: char = getNextChar() 𝑠𝑒 : while (state ∉ 𝑠𝐴 and
lexeme = lexeme + char state ≠ bad) • Many real-world compilers use hand-coded scanners for further
if state ∈ 𝑠𝐴 state = pop() efficiency
clear stack truncate lexeme
• For e.g., gcc 4.0 uses hand-coded scanners in several of its front ends
push(𝑠2 ) rollback()
if (‘0’ ≤ char ≤ ‘9’) if state ∈ 𝑠𝐴
goto 𝑠2 return token i. Fetching a character one-by-one from I/O is expensive; fetch a
else else number of characters in one go and store in a buffer
goto 𝑠𝑒 return invalid ii. Use double buffering to simplify lookahead and rollback

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Reading Characters from Input Optimizing Reads from the Buffer


• A scanner reads the input character by character • A buffer at its end may contain an initial portion of a lexeme
• Reading the input will be very inefficient if it requires a system call for every E = M*C**2
character read
E = M *
• Input buffer
• OS reads a block of data, supplies scanner the required amount, and stores
the remaining portion in a buffer called buffer cache • It creates problem in refilling the buffer, so a two-buffer scheme is
• In subsequent calls, actual I/O does not take place as long as the data is used where the two buffers are filled alternatively
available in the buffer cache
E = M * C * * 2 eof
• Scanner uses its own buffer since requesting OS for single character is also
costly due to context-switching overhead
forward
lexBegin

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Optimizing Reads from the Buffer Advance Forward Pointer
• Read from buffer if (forward is at end of first buffer) {
• (1) Check for end of buffer, and (2) test the type of the input character reload second buffer
• If end of buffer, then reload the other buffer forward = beginning of second buffer
} else if (forward is at end of second buffer) {
reload first buffer
E = M * C * * 2 eof
forward = beginning of first buffer
} else {
forward
lexBegin forward++
}

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Optimizing Reads from the Buffer Optimizing Reads from the Buffer
switch (*forward++) {
• A sentinel character (say eof) is placed at the end of buffer to avoid case eof:
two comparisons if (forward is at end of first buffer) {
reload second buffer
forward = beginning of second buffer
} else if (forward is at end of second buffer) {
reload first buffer
E = M eof * C * * 2 eof eof
forward = beginning of first buffer
} else { // end of input
forward
lexBegin break
}

// case for other characters
}

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas


Symbol Table Implementation of Symbol Table
• Data structure that stores information for subsequent phases Fixed space for lexemes Other attributes Pointer to Other attributes
lexemes
• Symbol table interface
• insert(s, t): save lexeme s, token t, and return pointer
• lookup(s): return index of entry for lexeme s or 0 if s is not found

32 bytes
4 bytes

Fixed amount of space to store lexemes


might waste space lexeme1 eos lexeme2 eos …

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

Handling Keywords References


• Two choices: use separate REs or compare lexemes for ID token • A. Aho et al. Compilers: Principles, Techniques, and Tools, 2nd edition, Chapter 3.
• K. Cooper and L. Torczon. Engineering a Compiler, 2nd edition, Chapter 2.

• Consider token DIV and MOD with lexemes div and mod
• Initialize symbol table with insert(“div”, DIV) and
insert(“mod”, MOD) before beginning of scanning
• Any subsequent insert fails and any subsequent lookup returns the keyword
value
• These lexemes can no longer be used as an identifier

CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas

You might also like