COS 320 Compilers: David Walker
COS 320 Compilers: David Walker
Compilers
David Walker
Outline
• Last Week
– Introduction to ML
• Today:
– Lexical Analysis
– Reading: Chapter 2 of Appel
The Front End
stream of stream of abstract
characters tokens syntax
Type
Lexer Parser
Checker
Lexical Analysis
ID(x)
Lexical Analysis Example
x = ( y + 4.0 ) ;
Lexical Analysis
ID(x) ASSIGN
Lexical Analysis Example
x = ( y + 4.0 ) ;
Lexical Analysis
Lexer
Specification
Lexer Implementation
• Implementation Options:
1. Write a Lexer from scratch
– Boring, error-prone and too much work
2. Use a Lexer Generator
– Quick and easy. Good for lazy compiler writers.
Lexer
Lexer
Specification
lexer
generator
Lexer Implementation
• Implementation Options:
1. Write a Lexer from scratch
– Boring, error-prone and too much work
2. Use a Lexer Generator
– Quick and easy. Good for lazy compiler writers.
stream of
characters
Lexer
Lexer
Specification
lexer
generator
stream of
tokens
• How do we specify the lexer?
– Develop another language
– We’ll use a language involving regular
expressions to specify tokens
• So writing (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
and even worse (a | b | c | ...) gets
tedious...
Regular Expressions
• common abbreviations:
– [a-c] == (a | b | c)
–. == any character except \n
– \n == new line character
– a+ == one or more
– a? == zero or one
• How do we tokenize:
– foobar ==> ID(foobar) or ID(foo) ID(bar)
– if ==> ID(if) or IF
Ambiguous Token Rule Sets
• We resolve ambiguities using two rules:
– Longest match: The regular expression that
matches the longest string takes precedence.
– Rule Priority: The regular expressions
identifying tokens are written down in
sequence. If two regular expressions match
the same (longest) string, the first regular
expression in the sequence takes
precedence.
Ambiguous Token Rule Sets
• Example:
– Identifier tokens: a-z (a-z | 0-9)*
– Sample keyword tokens: if, then, ...
• How do we tokenize:
– foobar ==> ID(foobar) or ID(foo) ID(bar)
– if ==> ID(if) or IF
Ambiguous Token Rule Sets
• Example:
– Identifier tokens: a-z* (a-z | 0-9)*
– Sample keyword tokens: if, then, ...
• How do we tokenize:
– foobar ==> ID(foobar) or ID(foo) ID(bar)
– if ==> ID(if) or IF
Lexer Implementation
Implementation Options:
1. Write Lexer from scratch
– Boring and error-prone
2. Use Lexical Analyzer Generator
– Quick and easy
User Declarations
%%
ML-LEX Definitions
%%
Rules
User Declarations
• User Declarations:
– User can define various values that are
available to the action fragments.
– Two values must be defined in this section:
• type lexresult
– type of the value returned by each rule action.
• fun eof ()
– called by lexer when end of input stream is reached.
ML-LEX Definitions
• ML-LEX Definitions:
– User can define regular expression
abbreviations:
DIGITS = [0-9] +;
LETTER = [a-zA-Z];
fun itos s = case Int.fromString s of SOME x => x | NONE => raise fail
%%
NUM = [1-9][0-9]*
ID = [a-zA-Z] ([a-zA-Z] | NUM)*
%%
if => (IF);
then => (THEN);
else => (ELSE);
{NUM} => (Num (itos yytext));
{ID} => (Id yytext);
Using Multiple Lexers
• Rules prefixed with a lexer name are matched
only when that lexer is executing
• Enter new lexer using command YYBEGIN
• Initial lexer is called INITIAL
Using Multiple Lexers
%%
%s COMMENT
%%
%%
%s COMMENT
%%
a 2
1 3
b 2
a = =
+ c
2 4
= 3 4
b
+ 4
a-z
1 2
a-z
1 2