0% found this document useful (0 votes)
28 views16 pages

Lexing

Uploaded by

madtecharch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Lexing

Uploaded by

madtecharch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

LEXING

ACKNOWLEDGEMENT: SLIDES ARE ADAPTED FROM THE MATERIALS FROM JOSHUA ELLIS AND ANTHONY
CLARK

1
2
INTERPRETER Regular Expressions
and
Finite State Machines
Group characters
Source Code (Plain Text) into smallest
meaningful units Lexemes/Tokens
int main ( ) {
int i = getint ( ) , j = getint ( ) ;

Lexer/Scanner
while ( i != j ) {
if ( i > j ) i = i - j ;
else j = j - 1 ;
1. // GCD Program (in C) }

2. int main() { putint ( i ) ;

3. int i = getint(), j = }

getint();
4.
5.
while (i != j) {
if (i > j) i = i - j; Working on this
6. else j = j - i;
7. }
8. putint(i); 9.
}

We’ll create our own


simple calculator language

3
The Finite-State Machine use by a
turnstile

Finite-State Machine (FSM) (aka Finite-State Automaton (FSA)


or State Machine)
A mathematical model of computation. It is an abstract machine that
can be in exactly one of a finite number of states at any given time. 4
The FSM can change from one state to another in response to some
inputs; the change from one state to another is called a transition.
FINITE STATE MACHINE DIAGRAMS

Stat
Transition/ Stat OR e and
Details
Condition e information

Entran
Exit OR Exit
ce

5
INTERPRETER

read
Lexe
r
request … … send
token
toke
Parser n
request send
AST AST
I/O
Console Tree Walker

6
AST (Abstract Syntax
Tree)
LEXER

This process is known as:


• Scanning, lexing (lexical analysis), or tokenizing

This is the first step for any compiler or interpreter

A scanner takes in a raw source (plain text) and turns it into lexemes
• A lexeme is the smallest grouping of characters that represents
something useful/meaningful

7
GENERATING AND DESIGNING A SCANNER

1. Look at your language

2. Find all of your lexemes


• Lexemes are the smallest meaningful grouping of characters
• You also must respect the maximal munch rule (always take more characters if possible)
 • == is usually an EqualEqual token, not two Equal tokens
• let lettuce = … is Let, Identifier(“lettuce”), Equal, … not Let, Let, Identifier(“tuce”), …

1. Write a regular expression for each lexeme


• All of these expressions together create a lexical grammar

2. Generate or Design a scanner


• Use ANTLR or Lex or Flex or JFlex or … to automatically generate a scanner 8

• Write a scanner by hand (known as the Ad-hoc method)


AD-HOC SCANNERS

For this class, we will be writing an ad-hoc scanner


• This has the benefit of not hiding any of the details
• This has the drawback of not hiding any of the details

An ad-hoc scanner is basically a match expression on steroids


• Or a switch statement if your in C++
• Or branching if-statements if your in some other language
• “…pretty much a switch statement with delusions of grandeur.”

9
fn get_next_token(&mut self) -> Result<Token,
String> { self.skip_whitespace();

if self.cursor >=
self.input.len() { return
Ok(Token::End);
}

let new_token = match


self.input[self.cursor] { '+' =>
Token::Plus,
'-' => Token::Minus,
'*' => Token::Star,
'/' =>
Token::Slash, '^'
=> Token::Caret,
'(' =>
Token::LParen,
')' => Token::RParen,
_ => {
return Err(format!(
"Unexpected character: '{}'",
self.input[self.cursor]
));
}
};

self.cursor += 1;
Ok(new_token)
} 10

}
fn get_next_token(&mut self) -> Result<Token,
String> { self.skip_whitespace();

if self.cursor >=
self.input.len() { return
Ok(Token::End);
}

let new_token = match


self.input[self.cursor]
'+' => Token::Plus, {
'-' => Token::Minus,
'*' => Token::Star,
'/' => Token::Slash,
'^' => Token::Caret,
'(' => Token::LParen
,
_ =>=>{ Token::RParen
')'
return
, Err(format!(
"Unexpected character:
'{}'",
self.input[self.cursor]
));
}
};
self.cursor +=
1;
Ok(new_token)
}
}

11
LEXEME TOKEN CATEGORIES

• Single-character punctuators ( + , ; - ( }
• etc. )
Multi-character punctuators ( == <= etc.
• Comments )
• Literals (strings and numbers)
• Reserved keywords ( while for let int
etc. )
• Identifiers

12
Comment : '/*' .*? '*/' | '//' ~[\r\ // Read two
n]*; values read A
Assig : ':='; read B
n : '+'; ANTLR Syntax
// Add them
Plus : '-';
sum := A +
Minus : '*'; B
: '/';
Times : '('; // Print
: ')'; stuff write
Divid sum write
: 'read';
sum / talk
We’ll 2 more about
e : regular expressions next.
LPare 'write';
: Letter (Letter |
n Digit)*;
: Digit+ | Digit* ('.' Digit | Digit '.')
RPare Digit*;
n
fragmen Letter : [a-zA-
t
Read Z];
fragmen Digit
Write [0-9];
t
Identifie :
r Number
13
14
CHOICES TO MAKE

• How do we represent comments? (// and /* */) or (# and ‘’’ ‘’’)


• How do we end statements/lines of code. (return or ;)

• Do we require type definitions?


• How are variables initialized?

• Do we implement lists/arrays?

• Do we implement functions?
• Do we allow for the declaration of functions?
15
What features does our
language have?
What tokens are we looking
for?
WHAT DOES OUR FSM/LEXER LOOK
LIKE?

What coding language should


we use? 16

You might also like