0% found this document useful (0 votes)
86 views

Implementation of The Regular Expression

- Regular expressions can be used to specify lexical structures and partition input into tokens. - Finite automata are used to implement regular expressions and recognize regular languages. They consist of states, transitions between states based on input symbols, a start state, and accepting states. - Regular expressions are first converted to non-deterministic finite automata (NFAs) which are then converted to deterministic finite automata (DFAs) for implementation of a lexical analyzer.

Uploaded by

Param Ahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Implementation of The Regular Expression

- Regular expressions can be used to specify lexical structures and partition input into tokens. - Finite automata are used to implement regular expressions and recognize regular languages. They consist of states, transitions between states based on input symbols, a start state, and accepting states. - Regular expressions are first converted to non-deterministic finite automata (NFAs) which are then converted to deterministic finite automata (DFAs) for implementation of a lexical analyzer.

Uploaded by

Param Ahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Outline

• Specifying lexical structure using regular


Implementation of Lexical Analysis expressions

• Finite automata
– Deterministic Finite Automata (DFAs)
– Non-deterministic Finite Automata (NFAs)

• Implementation of regular expressions


RegExp ⇒ NFA ⇒ DFA ⇒ Tables

Compiler Design 1 (2011) 2

Notation Regular Expressions in Lexical Specification

• For convenience, we use a variation (allow user- • Last lecture: a specification for the predicate
defined abbreviations) in regular expression s ∈ L(R)
notation • But a yes/no answer is not enough !
• Instead: partition the input into tokens
• Union: A + B ≡ A|B
• Option: A + ε ≡ A? • We will adapt regular expressions to this goal
• Range: ‘a’+’b’+…+’z’ ≡ [a-z]
• Excluded range:
complement of [a-z] ≡ [^a-z]

Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4


Regular Expressions ⇒ Lexical Spec. (1) Regular Expressions ⇒ Lexical Spec. (2)

1. Select a set of tokens 3. Construct R, matching all lexemes for all


• Integer, Keyword, Identifier, OpenPar, ... tokens

2. Write a regular expression (pattern) for the R = Keyword + Identifier + Integer + …


lexemes of each token = R1 + R2 + R3 + …
• Integer = digit +
• Keyword = ‘if’ + ‘else’ + …
• Identifier = letter (letter + digit)*
Facts: If s ∈ L(R) then s is a lexeme
• OpenPar = ‘(‘ – Furthermore s ∈ L(Ri) for some “i”
• … – This “i” determines the token that is reported

Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6

Regular Expressions ⇒ Lexical Spec. (3) How to Handle Spaces and Comments?

4. Let input be x1…xn 1. We could create a token Whitespace


• (x1 ... xn are characters) Whitespace = (‘ ’ + ‘\n’ + ‘\t’)+
• For 1 ≤ i ≤ n check – We could also add comments in there
x1…xi ∈ L(R) ? – An input “ \t\n 5555 “ is transformed into
5. It must be that Whitespace Integer Whitespace
x1…xi ∈ L(Rj) for some j 2. Lexer skips spaces (preferred)
(if there is a choice, pick a smallest such j) • Modify step 5 from before as follows:
It must be that xk ... xi ∈ L(Rj) for some j such
6. Remove x1…xi from input and go to previous step that x1 ... xk-1 ∈ L(Whitespace)
• Parser is not bothered with spaces

Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8


Ambiguities (1) Ambiguities (2)

• There are ambiguities in the algorithm • Which token is used? What if


• x1…xi ∈ L(Rj) and also
• How much input is used? What if • x1…xi ∈ L(Rk)
• x1…xi ∈ L(R) and also – Rule: use rule listed first (j if j < k)
• x1…xK ∈ L(R)
– Rule: Pick the longest possible substring • Example:
– The “maximal munch” – R1 = Keyword and R2 = Identifier
– “if” matches both
– Treats “if” as a keyword not an identifier

Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10

Error Handling Summary

• What if • Regular expressions provide a concise notation


No rule matches a prefix of input ? for string patterns
• Problem: Can’t just get stuck … • Use in lexical analysis requires small extensions
• Solution: – To resolve ambiguities
– Write a rule matching all “bad” strings – To handle errors
– Put it last • Good algorithms known (next)
• Lexer tools allow the writing of: – Require only single pass over the input
R = R1 + ... + Rn + Error – Few operations per character (table lookup)
– Token Error matches if nothing else matches

Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12


Regular Languages & Finite Automata Finite Automata

Basic formal language theory result: A finite automaton is a recognizer for the
Regular expressions and finite automata both strings of a regular language
define the class of regular languages.
A finite automaton consists of
Thus, we are going to use: – A finite input alphabet Σ
• Regular expressions for specification – A set of states S
– A start state n
• Finite automata for implementation
– A set of accepting states F ⊆ S
(automatic generation of lexical analyzers)
– A set of transitions state →input state

Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14

Finite Automata Finite Automata State Graphs

• Transition • A state
s1 →a s2
• Is read
In state s1 on input “a” go to state s2 • The start state

• If end of input (or no transition possible) • An accepting state


– If in accepting state ⇒ accept
– Otherwise ⇒ reject
a
• A transition

Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16


A Simple Example Another Simple Example

• A finite automaton that accepts only “1” • A finite automaton accepting any number of 1’s
followed by a single 0
• Alphabet: {0,1}

1
1

Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18

And Another Example And Another Example

• Alphabet {0,1} • Alphabet still { 0, 1 }


• What language does this recognize?
1

1 0
1
0 0
• The operation of the automaton is not
1 completely defined by the input
1 – On input “11” the automaton could be in either state

Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20


Epsilon Moves Deterministic and Non-Deterministic Automata

• Another kind of transition: ε-moves • Deterministic Finite Automata (DFA)


– One transition per input per state
ε – No ε-moves
A B • Non-deterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a
• Machine can move from state A to state B given state
without reading input – Can have ε-moves
• Finite automata have finite memory
– Enough to only encode the current state

Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22

Execution of Finite Automata Acceptance of NFAs

• A DFA can take only one path through the


state graph • An NFA can get into multiple states
– Completely determined by input 1

0 1
• NFAs can choose
– Whether to make ε-moves
– Which of multiple transitions for a single input to
take
0

• Input: 1 0 1

• Rule: NFA accepts an input if it can get in a


final state
Compiler Design 1 (2011) 23 Compiler Design 1 (2011) 24
NFA vs. DFA (1) NFA vs. DFA (2)

• NFAs and DFAs recognize the same set of • For a given language the NFA can be simpler
languages (regular languages) than the DFA

1
0 0
NFA
• DFAs are easier to implement 0

– There are no choices to consider 1 0


0 0
DFA
1
1

• DFA can be exponentially larger than NFA

Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26

Regular Expressions to Finite Automata Regular Expressions to NFA (1)

• High-level sketch • For each kind of reg. expr, define an NFA


– Notation: NFA for regular expression M

NFA
M
Regular
expressions DFA • For ε
ε

Lexical Table-driven • For input a


Specification Implementation of DFA a

Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28


Regular Expressions to NFA (2) Regular Expressions to NFA (3)

• For AB • For A*

A ε ε
B

A
• For A + B ε

ε
B ε
ε
ε
ε A

Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30

Example of Regular Expression → NFA conversion NFA to DFA. The Trick

• Consider the regular expression • Simulate the NFA


(1+0)*1 • Each state of DFA
• The NFA is = a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through ε-moves
ε from NFA start state

ε C 1 E ε • Add a transition S →a S’ to DFA iff


1 – S’ is the set of NFA states reachable from any
A ε B
0 F G H ε I J
ε D ε state in S after seeing the input a
• considering ε-moves as well
ε

Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32


NFA to DFA. Remark NFA to DFA Example

• An NFA may be in many states at any time ε

• How many different states ? ε C 1 E ε


1
A ε B
0 F G H ε I J
ε D ε
• If there are N states, the NFA must be in
some subset of those N states ε
0
• How many subsets are there? 0 FGABCDHI
– 2N - 1 = finitely many 0 1
ABCDHI
1
1 EJGABCDHI

Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34

Implementation Table Implementation of a DFA

• A DFA can be implemented by a 2D table T


0
– One dimension is “states”
– Other dimension is “input symbols”
0 T
0 1
– For every transition Si →a Sk define T[i,a] = k S
1
1 U

• DFA “execution”
– If in state Si and input a, read T[i,a] = k and skip to 0 1
state Sk
S T U
– Very efficient
T T U
U T U

Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36


Implementation (Cont.) Theory vs. Practice

• NFA → DFA conversion is at the heart of Two differences:


tools such as lex, ML-Lex or flex
• DFAs recognize lexemes. A lexer must return
• But, DFAs can be huge a type of acceptance (token type) rather than
simply an accept/reject indication.
• In practice, lex/ML-Lex/flex-like tools trade
off speed for space in the choice of NFA and • DFAs consume the complete string and accept
DFA representations or reject it. A lexer must find the end of the
lexeme in the input stream and then find the
next one, etc.

Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38

You might also like