0% found this document useful (0 votes)
30 views53 pages

Lexical Analysis I: Prof. Bodik CS 164 Lecture 2 1

This lecture discusses lexical analysis and how to specify a lexer. It introduces regular expressions as a way to describe the lexemes of a language and maps them to tokens. It discusses problems like the lexer specification language, the scanning mechanism, and ambiguities. The key problems are: (1) describing lexemes using regular expressions, (2) breaking the input string into lexemes using a finite automaton implementation of regular expressions, (3) generating the lexer from the regular expression specification, and (4) handling ambiguities using lookahead.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views53 pages

Lexical Analysis I: Prof. Bodik CS 164 Lecture 2 1

This lecture discusses lexical analysis and how to specify a lexer. It introduces regular expressions as a way to describe the lexemes of a language and maps them to tokens. It discusses problems like the lexer specification language, the scanning mechanism, and ambiguities. The key problems are: (1) describing lexemes using regular expressions, (2) breaking the input string into lexemes using a finite automaton implementation of regular expressions, (3) generating the lexer from the regular expression specification, and (4) handling ambiguities using lookahead.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

Lexical Analysis I

Lecture 2

Prof. Bodik CS 164 Lecture 2 1


Course Administration

• Class accounts still available


– See Bowei
– warning: misprint in passwords in accounts handed
out on Wed; to fix, remove leading ‘2’

• Extra credit for bugs in project assignments


– in starter kits and handouts
– TAs are final arbiters of what’s a bug
– only the first student to report the bug gets credit

Prof. Bodik CS 164 Lecture 2 2


Outline

• What we want to accomplish:

– given a description of the input language,


automatically generate the lexer.

• This is exactly the topic of PA2

Prof. Bodik CS 164 Lecture 2 3


Outline (continued)

Problems we will solve in the process:

1. lexer specification language


– how to describe lexemes of the input language?
2. the scanning mechanism
– how to break input string into the lexemes
3. lexer generator
– how to translate from (1) to (2)
4. ambiguities
– the need for lookahead

Prof. Bodik CS 164 Lecture 2 4


Recall: The Structure of a Compiler

Decaf program (stream of characters)


Lecture 2, 3 PA2: lexer

stream of tokens
PA3: parser

Abstract Syntax Tree (AST)

PA4: checker

AST with annotations (types, declarations)

PA5: code gen

MIPS code (maybe x86)

Prof. Bodik CS 164 Lecture 2 5


Recall: Lexical Analysis

• The input is just a sequence of characters. Example:


if (i == j)
z = 0;
else
z = 1;

• More accurately, the input is string:


\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

• Goal: Partition input string into substrings


– And classify them according to their role (role = token)

Prof. Bodik CS 164 Lecture 2 6


Continued

• Lexer input:
\tif(i==j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Lexer output:
a sequence Token-lexeme pairs:

(Whitespace, “\t”),
(Keyword, “if”),
(OpenPar, “(“),
(Identifier, “i”),
(Relation, “==“),
(Identifier, “j”),

Prof. Bodik CS 164 Lecture 2 7


What’s a Token?

• A token is a syntactic category


– In English:
noun, verb, adjective, …
– In a programming language:
Identifier, Integer, Keyword, Whitespace, …

• Parser relies on the token distinctions:


– E.g., identifiers are treated differently than keywords

Prof. Bodik CS 164 Lecture 2 8


What are lexemes?

• Webster:
– “an item in the vocabulary of a language”
• cs164:
– strings into which the input string is partitioned.
– serve as attributes of tokens:

(Whitespace, “\t”),
(Keyword, “if”), (Keyword, “class”)
(OpenPar, “(”),
(Identifier, “i”), (Identifier, “Foo”)
(Relation, “==”), (Relation, “>”)

Prof. Bodik CS 164 Lecture 2 9


The problem statement

Prof. Bodik CS 164 Lecture 2 10


What do we want to accomplish (in PA2)?

• given:
– D: description of the lexical part of the input
language L
– in our case (L=Decaf)
• deliver:
– the lexer for the language L
– such that you produce the lexer directly from D

Prof. Bodik CS 164 Lecture 2 11


… graphically

\tif(i==j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

lexer
description PA2: lexer

(Whitespace, “\t”),
(Keyword, “if”),
(OpenPar, “(“),
(Identifier, “i”),
(Relation, “==“),
(Identifier, “j”),

Prof. Bodik CS 164 Lecture 2 12

 
Outline (continued)

Problems we will solve in the process:

1. lexer specification language


– how to describe lexemes of the input language?
2. the scanning mechanism
– how to break input string into the lexemes
3. lexer generator
– how to translate from (1) to (2)
4. ambiguities
– the need for lookahead

Prof. Bodik CS 164 Lecture 2 13


Problem 1: how to specify the lexer

• We want a high-level language D that


1. describes lexemes, and
2. maps them to tokens
3. but doesn’t describe the lexer algorithm itself !

– Point 3 is very important


– allows focusing on what, not on how
– therefore, D is sometimes called a specification language,
not a programming language

• Part 2 is easy, so let’s focus on Parts 1 and 3

Prof. Bodik CS 164 Lecture 2 14


Example

• lexeme  Token

• strings of letters or digits, starting with a letter  Identifier

• a non-empty string of digits  Integer

• “else” or “if” or “begin” or …  Keyword

• a non-empty sequence of blanks, newlines, and tabs Whitespace

• a left-parenthesis  OpenPar

Prof. Bodik CS 164 Lecture 2 15


Regular Expressions

• There are several formalisms for specifying lexemes

• Regular expressions are the most popular


– Example:
• ‘a’|’b’ denotes the set of strings { “a”, “b” }
• ‘a’‘b’* denotes the set of strings { “a”, “ab”, “abb”, “abbb”, … }

• Regular expressions
– Simple and useful theory
– Easy to understand
– Efficient implementations

Prof. Bodik CS 164 Lecture 2 16


The language of regular expressions

• Def. Language is a set of strings


– More precisely, given a set of characters  called
alphabet, language is a set of strings over the
alphabet.

• Each regular expression denotes a language


– If A is a regular expression then we write L(A) to
refer to the language denoted by A

• if A = ‘a’|’b’ then L(A) = { “a”, “b” }


• if B = ‘a’‘b’* then L(B) = { “a”, “ab”, “abb”, “abbb”, … }
• in both cases,  = { “a”, “b” }
Prof. Bodik CS 164 Lecture 2 17
Atomic Regular Expressions

• Single character: ‘c’


L(‘c’) = { “c” } (for any c  )

• Concatenation: A.B (where A and B are reg. exp.)


L(A.B) = { a.b | a L(A) and b  L(B) }

• Example: L(‘i’.‘f’) = { “if” }


– we will abbreviate ‘i’.‘f’ as ‘if’
– the . operator can be omitted

Prof. Bodik CS 164 Lecture 2 18


Compound Regular Expressions

• Union
L(A | B) = { s | s  L(A) or s  L(B) }

• Examples:
‘if’ | ‘then‘ | ‘else’ = { “if”, “then”, “else”}
‘0’ | ‘1’ | … | ‘9’ = { “0”, “1”, …, “9” }
(note the … are just an abbreviation in this slide)
• Another example:
(‘0’ | ‘1’) (‘0’ | ‘1’) = { “00”, “01”, “10”, “11” }
Prof. Bodik CS 164 Lecture 2 19
More Compound Regular Expressions

• Now, a notation for infinite languages:


• Iteration: A*
L(A*) = { “” } [ L(A) [ L(AA) [ L(AAA) [ …
• Examples:
‘0’* = { “”, “0”, “00”, “000”, …}
‘1’ ‘0’* = { strings starting with 1 and followed by 0’s }
• Epsilon: 
L() = { “” }

Prof. Bodik CS 164 Lecture 2 20


Now back to our lexemes: Keyword

– Keyword: “else” or “if” or “begin” or …

‘else’ | ‘if’ | ‘begin’ | …

(Recall: ‘else’ abbreviates ‘e’ ‘l’ ‘s’ ‘e’ )

Prof. Bodik CS 164 Lecture 2 21


Example: Integers

Integer: a non-empty string of digits

digit = ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’
number = digit digit*

Abbreviation: A+ = A A*

Prof. Bodik CS 164 Lecture 2 22


Example: Identifier

Identifier: strings of letters or digits,


starting with a letter

letter = ‘A’ | … | ‘Z’ | ‘a’ | … | ‘z’


identifier = letter (letter | digit) *

Is (letter* | digit*) the same ?

Prof. Bodik CS 164 Lecture 2 23


Example: Whitespace

Whitespace: a non-empty sequence of blanks,


newlines, and tabs

(‘ ‘ | ‘\t’ | ‘\n’)+

(Can you spot a small mistake?)

Prof. Bodik CS 164 Lecture 2 24


Example: Phone Numbers

• Regular expressions are all around you!


• Consider (510) 643-1481
 = { 0, 1, 2, 3, …, 9, (, ), - }
area = digit3
exchange = digit3
phone = digit4
number = ‘(‘ area ‘)’ exchange ‘-’ phone

Prof. Bodik CS 164 Lecture 2 25


Example: Email Addresses

• Consider [email protected]

 = letters  { ., @ }
name = letter+
address = name ‘@’ name (‘.’ name)*

Prof. Bodik CS 164 Lecture 2 26


Outline

Problems we will solve in the process:

1. what: lexer specification language … done


– how to describe lexemes of the input language?
2. how: the lexer mechanism
– how to break input string into the lexemes
3. lexer generator
– how to translate from (1) to (2)
4. ambiguities
– the need for lookahead

Prof. Bodik CS 164 Lecture 2 27


The lexer algorithm

• Finite automata
– Deterministic Finite Automata (DFAs)
– Non-deterministic Finite Automata (NFAs)

• Implementation of the lexer


problem 2
lexeme1 RegExp1 NFA1
lexeme2 RegExp2 NFA2 NFA DFA Tables
lexeme3 RegExp3 NFA3
… … …
Prof. Bodik CS 164 Lecture 2 28
Finite Automata

• Regular expressions = specification


• Finite automata = implementation

• A finite automaton consists of


– An input alphabet 
– A set of states S
– A start state n
– A set of accepting states F  S
– A set of transitions state input state
Prof. Bodik CS 164 Lecture 2 29
Finite Automata

• Transition
s1 a s2
• Is read
In state s1 on input “a” go to state s2

• If end of input
– If in accepting state => accept
– Otherwise => reject

Prof. Bodik CS 164 Lecture 2 30


Finite Automata State Graphs

• A state

• The start state

• An accepting state

a
• A transition

Prof. Bodik CS 164 Lecture 2 31


A Simple Example
• A finite automaton that accepts only “1”

• A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state

Prof. Bodik CS 164 Lecture 2 32


Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}

• Check that “1110” is accepted but “110…” is not

Prof. Bodik CS 164 Lecture 2 33


And Another Example

• Alphabet {0,1}
• What language does this recognize?

1 0

0 0

1
1

Prof. Bodik CS 164 Lecture 2 34


And Another Example

• Alphabet still { 0, 1 }
1

• The operation of the automaton is not completely


defined by the input
– On input “11” the automaton could be in either state

Prof. Bodik CS 164 Lecture 2 35


Epsilon Moves

• Another kind of transition: -moves



A B

• Machine can move from state A to state B


without reading input

Prof. Bodik CS 164 Lecture 2 36


Deterministic and Nondeterministic Automata

• Deterministic Finite Automata (DFA)


– One transition per input per state
– No -moves
• Nondeterministic Finite Automata (NFA)
– Can have multiple transitions for one input in a
given state
– Can have -moves
• Finite automata have finite memory
– Need only to encode the current state

Prof. Bodik CS 164 Lecture 2 37


Execution of Finite Automata

• A DFA can take only one path through the


state graph
– Completely determined by input

• NFAs can choose


– Whether to make -moves
– Which of multiple transitions for a single input to
take

Prof. Bodik CS 164 Lecture 2 38


Acceptance of NFAs

• An NFA can get into multiple states


1

0 1

• Input: 1 0 1

• Rule: NFA accepts if it can get in a final state


Prof. Bodik CS 164 Lecture 2 39
NFA vs. DFA (1)

• NFAs and DFAs recognize the same set of


languages (regular languages)

• DFAs are easier to implement


– There are no choices to consider

Prof. Bodik CS 164 Lecture 2 40


NFA vs. DFA (2)

• For a given language the NFA can be simpler


than the DFA
1
0 0
NFA
0

1 0
0 0
DFA
1
1

• DFA can be exponentially larger than NFA


Prof. Bodik CS 164 Lecture 2 41
Reg Expressions to NFAs

lexeme1 RegExp1 NFA1


lexeme2 RegExp2 NFA2 NFA DFA Tables
lexeme3 RegExp3 NFA3
… … …

Prof. Bodik CS 164 Lecture 2 42


Regular Expressions to NFA (1)

• For each kind of rexp, define an NFA


– Notation: NFA for rexp A

• For 

• For input a
a

Prof. Bodik CS 164 Lecture 2 43


Regular Expressions to NFA (2)

• For AB
A 
B

• For A | B
B 


 A

Prof. Bodik CS 164 Lecture 2 44


Regular Expressions to NFA (3)

• For A*

A

Prof. Bodik CS 164 Lecture 2 45


Example of RegExp -> NFA conversion

• Consider the regular expression


(1 | 0)*1
• The NFA is

 C 1 E 
A B 1
 0 F G H  I J
 D 

Prof. Bodik CS 164 Lecture 2 46
Next

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA

Prof. Bodik CS 164 Lecture 2 47


NFA to DFA. The Trick

• Simulate the NFA


• Each state of DFA
= a non-empty subset of states of the NFA
• Start state
= the set of NFA states reachable through -moves
from NFA start state
• Add a transition S a S’ to DFA iff
– S’ is the set of NFA states reachable from the
states in S after seeing the input a
• considering -moves as well
Prof. Bodik CS 164 Lecture 2 48
NFA -> DFA Example

 C 1 E 
A B 1

 0 F G H  I J
D 

0
0 FGABCDHI
ABCDHI 0 1
1
1 EJGABCDHI
Prof. Bodik CS 164 Lecture 2 49
NFA to DFA. Remark

• An NFA may be in many states at any time

• How many different states ?

• If there are N states, the NFA must be in


some subset of those N states

• How many non-empty subsets are there?


– 2N - 1 = finitely many
Prof. Bodik CS 164 Lecture 2 50
Implementation

• A DFA can be implemented by a 2D table T


– One dimension is “states”
– Other dimension is “input symbols”
– For every transition Si a Sk define T[i,a] = k
• DFA “execution”
– If in state Si and input a, read T[i,a] = k and skip to
state Sk
– Very efficient

Prof. Bodik CS 164 Lecture 2 51


Table Implementation of a DFA

0
0 T

S 0 1
1
1 U

0 1
S T U
T T U
U T U
Prof. Bodik CS 164 Lecture 2 52
Implementation (Cont.)

• NFA -> DFA conversion is at the heart of tools


such as flex or jlex

• But, DFAs can be huge

• In practice, flex-like tools trade off speed


for space in the choice of NFA and DFA
representations

Prof. Bodik CS 164 Lecture 2 53

You might also like