Lexing
Lexing
ACKNOWLEDGEMENT: SLIDES ARE ADAPTED FROM THE MATERIALS FROM JOSHUA ELLIS AND ANTHONY
CLARK
1
2
INTERPRETER Regular Expressions
and
Finite State Machines
Group characters
Source Code (Plain Text) into smallest
meaningful units Lexemes/Tokens
int main ( ) {
int i = getint ( ) , j = getint ( ) ;
Lexer/Scanner
while ( i != j ) {
if ( i > j ) i = i - j ;
else j = j - 1 ;
1. // GCD Program (in C) }
3. int i = getint(), j = }
getint();
4.
5.
while (i != j) {
if (i > j) i = i - j; Working on this
6. else j = j - i;
7. }
8. putint(i); 9.
}
3
The Finite-State Machine use by a
turnstile
Stat
Transition/ Stat OR e and
Details
Condition e information
Entran
Exit OR Exit
ce
5
INTERPRETER
read
Lexe
r
request … … send
token
toke
Parser n
request send
AST AST
I/O
Console Tree Walker
6
AST (Abstract Syntax
Tree)
LEXER
A scanner takes in a raw source (plain text) and turns it into lexemes
• A lexeme is the smallest grouping of characters that represents
something useful/meaningful
7
GENERATING AND DESIGNING A SCANNER
9
fn get_next_token(&mut self) -> Result<Token,
String> { self.skip_whitespace();
if self.cursor >=
self.input.len() { return
Ok(Token::End);
}
self.cursor += 1;
Ok(new_token)
} 10
}
fn get_next_token(&mut self) -> Result<Token,
String> { self.skip_whitespace();
if self.cursor >=
self.input.len() { return
Ok(Token::End);
}
11
LEXEME TOKEN CATEGORIES
• Single-character punctuators ( + , ; - ( }
• etc. )
Multi-character punctuators ( == <= etc.
• Comments )
• Literals (strings and numbers)
• Reserved keywords ( while for let int
etc. )
• Identifiers
12
Comment : '/*' .*? '*/' | '//' ~[\r\ // Read two
n]*; values read A
Assig : ':='; read B
n : '+'; ANTLR Syntax
// Add them
Plus : '-';
sum := A +
Minus : '*'; B
: '/';
Times : '('; // Print
: ')'; stuff write
Divid sum write
: 'read';
sum / talk
We’ll 2 more about
e : regular expressions next.
LPare 'write';
: Letter (Letter |
n Digit)*;
: Digit+ | Digit* ('.' Digit | Digit '.')
RPare Digit*;
n
fragmen Letter : [a-zA-
t
Read Z];
fragmen Digit
Write [0-9];
t
Identifie :
r Number
13
14
CHOICES TO MAKE
• Do we implement lists/arrays?
• Do we implement functions?
• Do we allow for the declaration of functions?
15
What features does our
language have?
What tokens are we looking
for?
WHAT DOES OUR FSM/LEXER LOOK
LIKE?