2 Compiler - Slide
2 Compiler - Slide
lexical analyzer
CS 335: Lexical Analysis symbol table
code generator
Swarnendu Biswas
syntax analyzer error handler code optimizer
Semester 2022-2023-II
CSE, IIT Kanpur
intermediate code
semantic analyzer generator
Content influenced by many excellent references, see References slide for acknowledgements.
𝐿 = 𝑤 𝑤 ∈ 𝑎, 𝑏 ∗ ∧ 𝑤 ends with 𝑎}
𝑟 = (𝑎 + 𝑏)∗ 𝑎
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Regular Expressions Algebraic Rules for REs
• We can reduce the use of parentheses by introducing precedence and Rule Description
associativity rules 𝑟|𝑠 = 𝑠|𝑟 | is commutative
• Binary operators, closure, concatenation, and alternation are left associative
𝑟| 𝑠 𝑡 = 𝑟 𝑠 |𝑡 | is associative
𝑟 𝑠𝑡 = 𝑟𝑠 𝑡 Concatenation is commutative
• Precedence rule is
𝑟 𝑠 𝑡 = 𝑟𝑠|𝑟𝑡; 𝑠 𝑡 𝑟 = 𝑠𝑟|𝑡𝑟 Concatenation distributes over |
parentheses > closure > concatenation > alternation 𝜖𝑟 = 𝑟𝜖 = 𝑟 𝜖 is the identity of concatenation
𝑟∗ = (𝑟|𝜖)∗ 𝜖 is guaranteed in a closure
𝑟 ∗∗ = 𝑟∗ ∗ is idempotent
NFA = (𝑁, Σ, 𝛿𝑁 , 𝑛0 , 𝑁𝐴 )
𝑝1 𝑝2 𝑝3 𝑝5 𝑝7 𝑝5
<> relop NE
8
*
> relop GT return(relop, GT)
>= relop GE
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Transition Diagrams for IDs and Keywords Transition Diagram for Unsigned Numbers
IDs and Keywords letter/digit
digit digit digit
E digit
Whitespace
delim
* *
20 21
• To find the longest match, all TDs must be tried and the longest match
must be used
Implementing Scanners
Look up the transition based on the current state and the input character
FSA
Switch to the new state Interpreter
[0…9]
Check for termination conditions, i.e., accept and error
• Register specification r [0…9]
Repeat • For example, r1 and r27 s0 s1 s2
Optimizing Reads from the Buffer Optimizing Reads from the Buffer
switch (*forward++) {
• A sentinel character (say eof) is placed at the end of buffer to avoid case eof:
two comparisons if (forward is at end of first buffer) {
reload second buffer
forward = beginning of second buffer
} else if (forward is at end of second buffer) {
reload first buffer
E = M eof * C * * 2 eof eof
forward = beginning of first buffer
} else { // end of input
forward
lexBegin break
}
…
// case for other characters
}
32 bytes
4 bytes
• Consider token DIV and MOD with lexemes div and mod
• Initialize symbol table with insert(“div”, DIV) and
insert(“mod”, MOD) before beginning of scanning
• Any subsequent insert fails and any subsequent lookup returns the keyword
value
• These lexemes can no longer be used as an identifier