0% found this document useful (0 votes)
9 views16 pages

3 Regex

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

3 Regex

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

CS 473: COMPILER

DESIGN
Adapted from slides by Steve Zdancewic, UPenn
PRINCIPLES OF LEXING

2
Regular Expressions:
Definition
• Regular expressions precisely describe sets of strings.
• A regular expression R has one of the following forms:
– e Epsilon stands for the empty string
– 'a' An ordinary character stands for itself
– R1 | R2 Alternatives, stands for choice of R1 or R2
– R1R2 Concatenation, stands for R1 followed by R2
– R* Kleene star, stands for zero or more
repetitions of R
• Useful extensions:
– "foo" Strings, equivalent to 'f''o''o'
– R+ One or more repetitions of R, equivalent to
RR*
– R? Zero or one occurrences of R, equivalent to
(e|R)
– ['a'-'z'] One of a or b or c or … z, equivalent to (a|b|…|z)
– [^'0'-'9'] Any character except 0 through 9
– . Any character

3
Example Regular Expressions
• Recognize the keyword “if”: "if"
• Recognize a digit: ['0'-'9']
• Recognize an integer literal: '-'?['0'-'9']+
• Recognize an identifier:
(['a'-'z']|['A'-'Z'])(['0'-'9']|'_'|['a'-'z']|['A'-'Z'])*

4
Finite Automata
• Every regular expression can be recognized by a finite
automaton
• Consider the regular expression: '"'[^'"']*'"'
• An automaton (DFA) can be represented as:
" Non-"
– A transition table:
0 1 ERROR
1 2 1
2 ERROR ERROR

– A graph:
Non-"

" "
0 1 2

5
RE to Finite Automaton
• Every regular expression can be recognized by a finite
automaton

• Strategy: consider every possible regular expression:


a
'a'
What about?

e R1|R2

R1 R2
R1R2 ??

6
Nondeterministic Finite
Automata
• A finite set of states, a start state, and accepting state(s)
• Transition arrows connecting states
– Labeled by input symbols
– Or e (which does not consume input)
• Nondeterministic: two arrows leaving the same state may
have the same label

b
a
e
a
a
e b

7
RE to NFA
• Converting regular expressions to NFAs is easy.
• Assume each NFA has one start state, unique accept state

a
'a'

R1 R2
e
R1R2

8
RE to NFA (cont’d)
• Sums and Kleene star are easy with NFAs

R1

e e
R1|R2
R2
e e

R
R*
e e

e
9
Exercise: RE to NFA
• Construct an NFA for the following regular expression:
(a*b*)|(b*a*)

a b
e
e e

b a
e e e

10
Deterministic Finite Automata
• An NFA accepts a string if there is any way to get to an
accepting state
– To implement, we either have to try all possibilities or get good
at guessing!

• A deterministic finite automata never has to guess: two


arrows leaving the same state must have different labels,
and never e
• This means that action for each input is fully determined!
• We can make a table for each state: “if you see symbol X,
go to state Y”

• Fortunately, we can convert any NFA into a DFA!

11
NFA to DFA conversion
(Intuition)
• Idea: Run all possible executions of the NFA “in parallel”
• Keep track of a set of possible states: “finite fingers”
• Consider: -?[0-9]+

[0-9]
• NFA representation: -
[0-9] e
0 1 2 3

e
• DFA representation:

- {1} [0-9]

{0,1} {2,3} [0-9]


[0-9]

12
Summary of Lexer Generator
Behavior
• Take each regular expression Ri and its action Ai
• Compute the NFA formed by (R1 | R2 | … | Rn)
– Remember the actions associated with the accepting states of
the Ri
• Compute the DFA for this big NFA
– There may be multiple accept states
– A single accept state may correspond to one or more actions
• Compute the minimal equivalent DFA
– There is a standard algorithm due to Myhill & Nerode
• Produce the transition table
• Implement longest match:
– Start from initial state
– Follow transitions, remember last accept state entered (if any)
– Accept input until no transition is possible (i.e. next state is
“ERROR”)
– Perform the highest-priority action associated with the last
accept state; if no accept state there is a lexing error
13
14
Lex: Start States
• Sometimes we want to use different lexers for different
parts of a program
• For instance, strings:
if (a == "\"if\" 0") return 0;

• Start states let us specify multiple sets of lexing rules and


switch between them
%s STRING // define a new ruleset for strings

// INITIAL is the default lexer


<INITIAL>[a-z]+ { return
ID; }
<INITIAL>[0-9]+ { return
NUM; }
// switch to the string lexer
<INITIAL>\" { BEGIN
STRING; }
<STRING>.
• { /*store
Demo: characters*/; }
states.lex
//switch back when we’re done 15
16

You might also like