0% found this document useful (0 votes)
27 views16 pages

2 Lexing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views16 pages

2 Lexing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

CS 473: COMPILER

DESIGN
Adapted from slides by Steve Zdancewic, UPenn
Lexical analysis, tokens, lexer generators, regular
expressions, automata

LEXING

2
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:

if ( b == 0 ) { a = 0 ; }
Parsing

Analysis &
Transformation

Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
… 3
First Step: Lexical Analysis
• Change the character stream “if (b == 0) a = 0;” into
tokens:
if ( b == 0 ) { a = 0 ; }

IF; LPAREN; ID(“b”); EQEQ; NUM(0); RPAREN; LBRACE;


ID(“a”); EQ; INT(0); SEMI; RBRACE

• Token: data type that represents indivisible “chunks” of text:


– Identifiers: a y11 elsex _100
– Keywords: if else while
– Integers: 2 200 -500 5L
– Floating point: 2.0 .02 1e5
– Symbols: + * ` { } ( ) ++ << >>
>>>
– Strings: "x" "He said, \"Are you?\""
– Comments: (* CS476: Project 1 … *) /* foo */

• Often delimited by whitespace (‘ ’, \t, etc.)


– In some languages (e.g. Python or Haskell) whitespace is significant
4
How hard can it be?
handlex.c

DEMO: HANDLEX

5
Lexing By Hand
• How hard can it be?
– Tedious and painful!

• Problems:
– Precisely define tokens
– Multiple tokens may match
– Each case’s behavior depends on other cases
– Error handling is tricky
– Hard to maintain

6
LEXER GENERATOR (LEX)

7
Regular Expressions:
Refresher
• The key is a compact, checkable way of writing down what
all the tokens are
• We can use regular expressions!

• Exercise: Write a regular expression for “strings of one or


more a’s followed by one or more b’s.”

• Example solutions:

a+b+
aa*bb*

8
Lexer Generators
• Read a list of regular expressions: R1,…,Rn , one per token.
• Each token has an attached “action” Ai (an arbitrary piece of
code to run when the regular expression is matched):

'-'?digit+ { return NUM; }


'+'
{ return PLUS; }
'if' { return IF; }
[a-z]([0-9]|[a-z]|'_')* { return ID; }
whitespace+ { /* do nothing */ }

token
regular expressions actions

• Generates scanning code that:


1. Decides whether the input is of the form (R1|…|Rn)*
2. Whenever the scanner matches a (longest) token, runs the
associated action
9
DEMO: LEX
HTTPS://SOURCEFORGE.NET/PROJECTS/FLEX/
MANUAL:
HTTP://DINOSAUR.COMPILERTOOLS.NET/LEX/
INDEX.HTML 10
11
Anatomy of a Lex file
%{

typedef union { … } YYSTYPE;


prelude: definitions and
YYSTYPE yylval; helper functions, written in C
%}

%%

"if" { return IF; }


";" { return SEMICOLON; } body: regular expressions
[0-9]+ { yylval.ival = atoi(yytext);and associated actions
return NUM; } (again, written in C)

%%

int main(){
t = yylex(); end: arbitrary C code, can
} call the lexing function yylex
12
Lex Actions
"if" { return IF; } keywords and special chars:
";" { return SEMICOLON; } just return the token

[0-9]+ { yylval.ival = atoi(yytext);


tokens with content: set
return NUM; } yylval, return token type

" " { continue; } if it doesn’t affect the


program, don’t make a token

. { printf("Error: unrecognized
token"); }
use a wildcard to catch input
that isn’t a token

13
Running the Lexer
• Running lex <filename>.lex generates a file lex.yy.c

• The file defines a function called yylex, which looks for


matches to the regular expressions, and runs the
associated action when it finds one
– If there are multiple possible matches, it chooses the longest
match

• If lexer has a main function, we can just compile and run


lex.yy.c

• Otherwise, we can use the lexer as a library, and call the


generated yylex function in other files (the rest of the
compiler)

14
Program 1: Lexer
• Posted on the course website (
https://www.cs.uic.edu/~mansky/teaching/cs473/sp21/progr
am1.html
)
• Extend a simple lexer with support for more features
• Due next Wednesday at the start of class
• Submit via Gradescope

• Extending a lexer:
1. What do we want the syntax to look like?
2. What tokens does it use that we don’t already have?
3. Add cases and regexps for those tokens
4. Add actions for those regexps, returning appropriate values

15
16

You might also like