Lecture2 Web
Lecture2 Web
Environments (ID2202)
Fall 2020
Lecture 2: Lexical Analysis
David Broman
Associate Professor, KTH Royal Institute of Technology
Associate Director Operations, Digital Futures
David Broman
Part I
Lexical Analysis
Part I
Lexical Analysis
Languages
Language
A set of strings
String
A finite sequence of symbols
Symbol
An element in a finite set (the alphabet)
Chomsky Hierarchy
For details, see for instance “Languages and Machines By Sudkamp, 2006
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman
Part II
Regular Expressions
Regular Expressions
Regular expression
A way to specify a (possibly infinite) set of strings
Example: Alphabet {a, b, c}
Symbol a
A language just containing a
Alternation M|N
M | N is a new regular expression,
Regex a|b represents the
where M and N are regular expressions
language {“a”, “b”}
Concatenation M⋅N
M⋅N is a new regular expression, forming the
Regex (a|b)⋅c represents the
concatenation of M and N, where M and N are
language {“ac”, “bc”}
regular expressions
Epsilon ε Regex (a⋅c)|ε represents the
Regular expression ε is the language of an language {“ac”, “”}
empty string
Repetition M* Regex ((a⋅b)|c)* represents the
* is called the Kleene closure of M. M* forms a language {“abc”, “abc”, “abcc”,
regular expression, representing the “abccc”, “abccabab …}
concatenation of zero or more M Structure from Apple (1998)
Conventions
Symbol a
A language just containing a
May omit ⋅ or ε
Alternation M|N
ab is the same as a⋅b
M | N is a new regular expression,
where M and N are regular expressions a| is the same as a|ε
Concatenation M⋅N
Kleene closure binds tighter
M⋅N is a new regular expression, forming the
concatenation of M and N, where M and N are ab* is the same as a(b)*
regular expressions
Epsilon ε Concatenations binds tighter than
Regular expression ε is the language of an alternation
empty string ab|c is the same as (ab)|c
Repetition M*
* is called the Kleene closure of M. M* forms a
regular expression, representing the
concatenation of zero or more M
Abbreviations
Symbol a
A language just containing a
Alternation abbreviation
Alternation M|N [abc] is the same as a|b|c
M | N is a new regular expression,
[b-h] is the same as [bcdefgh]
where M and N are regular expressions
[b-d01A-C] is the same as [bcd01ABC]
Concatenation M⋅N
M⋅N is a new regular expression, forming the
concatenation of M and N, where M and N are Optional and repetition
regular expressions M? with the meaning (M| ε)
Epsilon ε M+ with the meaning (M⋅M*)
Regular expression ε is the language of an
empty string Any characters represented by a
period .
Repetition M*
* is called the Kleene closure of M. M* forms a
regular expression, representing the
concatenation of zero or more M
Part III
Deterministic Finite Automata
1 Finite Automata
A DFA is a machine M, represented as is a 5-tuple:
The language recognized by M is the
set of strings that M accepts. It is
ata M = (S, ⌃, , s0 , F ) written as L(M).
1 Finite Automata
Set of final states
What is the Set of states
L(M )
M = (S, ⌃, , s0 , F )
formalization of the :Q⇥⌃!Q Start state
DFA? Transition
Finite set of L(M )
function Note: deterministic
input symbols because only one
S = {1, 2, 3, 4} (the alphabet) : S ⇥ ⌃ ! S(1) possible output
⌃ = {r, f, o} (2)S= {1,o2, 3, 4} r
f
= {((1, f), 2), ((2, o), 3), ((3, r), 4)} (3)
⌃= {r, f, o}
S = {1, 2, 3, 4}
s0 = 1 (4)
1 2 3 4
F = {4} (5) = {((1,
⌃ = {r, f),f,2),
o}((2, o), 3), (
Part I Part II
s0 = 1 Part=III {((1, f), 2), ((2, o), 3),
Lexical Analysis Regular Expressions sDeterministic
0 = 1 Finite Automata
David Broman
⇥⌃!Q
Deterministic Finite Automata (DFA)
S = {1, 2, 3, 4} (1)
⌃ = {r, f, o} (2)
= {((1, f),
Exercise: Write ((2,ao),
2),down 3), ((3,
regular r), 4)}for
expression (3)
s0a =
lower
1 case identifier (can include an (4) a-z
underscore as first character)
F = {4} (5)
[_a-z][a-z0-9]* _
1 2
S = {1, 2} (6)
⌃ = { , a, b, . . . , z, 0, 1, . . . , 9} (7)
a-z
= {((1, a), 2), . . . , (2, 0), 2), . . . , } (8)
s0 = 1 (9) 0-9
F = {2} (10)
Abbreviation. Represents one
transition line for each character
1 “aavaN” Is rejected by M
2. “//avvN” Is accepted by M v /
Reading Guidelines
See the course webpage
for more information.
Conclusions