2 Lexing
2 Lexing
DESIGN
Adapted from slides by Steve Zdancewic, UPenn
Lexical analysis, tokens, lexer generators, regular
expressions, automata
LEXING
2
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Analysis &
Transformation
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
… 3
First Step: Lexical Analysis
• Change the character stream “if (b == 0) a = 0;” into
tokens:
if ( b == 0 ) { a = 0 ; }
DEMO: HANDLEX
5
Lexing By Hand
• How hard can it be?
– Tedious and painful!
• Problems:
– Precisely define tokens
– Multiple tokens may match
– Each case’s behavior depends on other cases
– Error handling is tricky
– Hard to maintain
6
LEXER GENERATOR (LEX)
7
Regular Expressions:
Refresher
• The key is a compact, checkable way of writing down what
all the tokens are
• We can use regular expressions!
• Example solutions:
a+b+
aa*bb*
8
Lexer Generators
• Read a list of regular expressions: R1,…,Rn , one per token.
• Each token has an attached “action” Ai (an arbitrary piece of
code to run when the regular expression is matched):
token
regular expressions actions
%%
%%
int main(){
t = yylex(); end: arbitrary C code, can
} call the lexing function yylex
12
Lex Actions
"if" { return IF; } keywords and special chars:
";" { return SEMICOLON; } just return the token
. { printf("Error: unrecognized
token"); }
use a wildcard to catch input
that isn’t a token
13
Running the Lexer
• Running lex <filename>.lex generates a file lex.yy.c
14
Program 1: Lexer
• Posted on the course website (
https://www.cs.uic.edu/~mansky/teaching/cs473/sp21/progr
am1.html
)
• Extend a simple lexer with support for more features
• Due next Wednesday at the start of class
• Submit via Gradescope
• Extending a lexer:
1. What do we want the syntax to look like?
2. What tokens does it use that we don’t already have?
3. Add cases and regexps for those tokens
4. Add actions for those regexps, returning appropriate values
15
16