3 Regex
3 Regex
DESIGN
Adapted from slides by Steve Zdancewic, UPenn
PRINCIPLES OF LEXING
2
Regular Expressions:
Definition
• Regular expressions precisely describe sets of strings.
• A regular expression R has one of the following forms:
– e Epsilon stands for the empty string
– 'a' An ordinary character stands for itself
– R1 | R2 Alternatives, stands for choice of R1 or R2
– R1R2 Concatenation, stands for R1 followed by R2
– R* Kleene star, stands for zero or more
repetitions of R
• Useful extensions:
– "foo" Strings, equivalent to 'f''o''o'
– R+ One or more repetitions of R, equivalent to
RR*
– R? Zero or one occurrences of R, equivalent to
(e|R)
– ['a'-'z'] One of a or b or c or … z, equivalent to (a|b|…|z)
– [^'0'-'9'] Any character except 0 through 9
– . Any character
3
Example Regular Expressions
• Recognize the keyword “if”: "if"
• Recognize a digit: ['0'-'9']
• Recognize an integer literal: '-'?['0'-'9']+
• Recognize an identifier:
(['a'-'z']|['A'-'Z'])(['0'-'9']|'_'|['a'-'z']|['A'-'Z'])*
4
Finite Automata
• Every regular expression can be recognized by a finite
automaton
• Consider the regular expression: '"'[^'"']*'"'
• An automaton (DFA) can be represented as:
" Non-"
– A transition table:
0 1 ERROR
1 2 1
2 ERROR ERROR
– A graph:
Non-"
" "
0 1 2
5
RE to Finite Automaton
• Every regular expression can be recognized by a finite
automaton
e R1|R2
R1 R2
R1R2 ??
6
Nondeterministic Finite
Automata
• A finite set of states, a start state, and accepting state(s)
• Transition arrows connecting states
– Labeled by input symbols
– Or e (which does not consume input)
• Nondeterministic: two arrows leaving the same state may
have the same label
b
a
e
a
a
e b
7
RE to NFA
• Converting regular expressions to NFAs is easy.
• Assume each NFA has one start state, unique accept state
a
'a'
R1 R2
e
R1R2
8
RE to NFA (cont’d)
• Sums and Kleene star are easy with NFAs
R1
e e
R1|R2
R2
e e
R
R*
e e
e
9
Exercise: RE to NFA
• Construct an NFA for the following regular expression:
(a*b*)|(b*a*)
a b
e
e e
b a
e e e
10
Deterministic Finite Automata
• An NFA accepts a string if there is any way to get to an
accepting state
– To implement, we either have to try all possibilities or get good
at guessing!
11
NFA to DFA conversion
(Intuition)
• Idea: Run all possible executions of the NFA “in parallel”
• Keep track of a set of possible states: “finite fingers”
• Consider: -?[0-9]+
[0-9]
• NFA representation: -
[0-9] e
0 1 2 3
e
• DFA representation:
- {1} [0-9]
12
Summary of Lexer Generator
Behavior
• Take each regular expression Ri and its action Ai
• Compute the NFA formed by (R1 | R2 | … | Rn)
– Remember the actions associated with the accepting states of
the Ri
• Compute the DFA for this big NFA
– There may be multiple accept states
– A single accept state may correspond to one or more actions
• Compute the minimal equivalent DFA
– There is a standard algorithm due to Myhill & Nerode
• Produce the transition table
• Implement longest match:
– Start from initial state
– Follow transitions, remember last accept state entered (if any)
– Accept input until no transition is possible (i.e. next state is
“ERROR”)
– Perform the highest-priority action associated with the last
accept state; if no accept state there is a lexing error
13
14
Lex: Start States
• Sometimes we want to use different lexers for different
parts of a program
• For instance, strings:
if (a == "\"if\" 0") return 0;