0% found this document useful (0 votes)
22 views

Lecture2 Web

The document discusses lexical analysis and regular expressions. It covers topics like lexical analysis/scanning, regular expressions which are used to specify languages, the Chomsky hierarchy and different types of grammars and machines. It also provides examples of regular expressions and their use.

Uploaded by

jmi40724
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lecture2 Web

The document discusses lexical analysis and regular expressions. It covers topics like lexical analysis/scanning, regular expressions which are used to specify languages, the Chomsky hierarchy and different types of grammars and machines. It also provides examples of regular expressions and their use.

Uploaded by

jmi40724
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Compilers and Execution

Environments (ID2202)
Fall 2020
Lecture 2: Lexical Analysis

David Broman
Associate Professor, KTH Royal Institute of Technology
Associate Director Operations, Digital Futures
David Broman

Part I
Lexical Analysis

Part II Part III


Regular Expressions Deterministic Finite Automata

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Part I
Lexical Analysis

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Lexical Analysis / Scanning


Line comment: Recognize “//“,
int
ignore until new line Identifier token, IDENT
foo
( “foo” is a specific
lexeme
// A function int
int foo(int x){ x
Lexical Analyzer /
return x + 1; Scanner )
} {
Source Text Tokens return Keyword, RETURN
x
Ignores white space +
and comments. 1 Integer Literal / Constant
Space 0x20, Tab 0x09, Lexical Error ; INT
Line Feed 0xa, Carriage }
Return 0xd Specific token
CR LF Windows/DOS RCURLY
LF Unix, MacOS X, Linux
CR Classic Mac, C64

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Lexical Analysis / Scanning

Exercise: which lexical errors would Not a lexical error. Would be


a C/C++ compiler detect? detected by the parser

void foo(int x){


while(x < 10]{
if(x == 7) // Works?
x = x + 2;
braek;
x = x + 1 Not a lexical error. “braek”
is an identifier.
}
return "Hello\z";
}

How do we implement scanning?


Lexical warning. Unknown
1. By hand, programmatically longest match
escape sequence ‘\z’
2. Scanner tool
Specify => regular expressions (regex) NOTE: most of the errors in this example will
Implement => deterministic finate automata (DFA) be parsing errors.

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Languages

Language
A set of strings
String
A finite sequence of symbols

Symbol
An element in a finite set (the alphabet)

Example 1: set of strings representing an integer


Finite or infinite set? Answer: infinite

Example 2: set of strings forming an identifier, which


represents a keyword in the C language

Finite or infinite set? Answer: finite

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Chomsky Hierarchy

Grammars Languages Accepting Machines


Type 0 Recursively Turing machine
Unrestricted grammars enumerable

Type 1 Context- Linear-bounded


Context-sensitive sensitive automata
grammars
Used in parsing
Type 2 Context-free Pushdown
Context-free automata
grammars
Part of this
course
Type 3 Regular Deterministic finite Used in lexing / scanning
Regular grammars language automaton
Nondeterministic finite
automaton

For details, see for instance “Languages and Machines By Sudkamp, 2006
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Part II
Regular Expressions

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Regular Expressions
Regular expression
A way to specify a (possibly infinite) set of strings
Example: Alphabet {a, b, c}
Symbol a
A language just containing a
Alternation M|N
M | N is a new regular expression,
Regex a|b represents the
where M and N are regular expressions
language {“a”, “b”}
Concatenation M⋅N
M⋅N is a new regular expression, forming the
Regex (a|b)⋅c represents the
concatenation of M and N, where M and N are
language {“ac”, “bc”}
regular expressions
Epsilon ε Regex (a⋅c)|ε represents the
Regular expression ε is the language of an language {“ac”, “”}
empty string
Repetition M* Regex ((a⋅b)|c)* represents the
* is called the Kleene closure of M. M* forms a language {“abc”, “abc”, “abcc”,
regular expression, representing the “abccc”, “abccabab …}
concatenation of zero or more M Structure from Apple (1998)

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Regular Expressions, Convensions

Conventions
Symbol a
A language just containing a
May omit ⋅ or ε
Alternation M|N
ab is the same as a⋅b
M | N is a new regular expression,
where M and N are regular expressions a| is the same as a|ε

Concatenation M⋅N
Kleene closure binds tighter
M⋅N is a new regular expression, forming the
concatenation of M and N, where M and N are ab* is the same as a(b)*
regular expressions
Epsilon ε Concatenations binds tighter than
Regular expression ε is the language of an alternation
empty string ab|c is the same as (ab)|c

Repetition M*
* is called the Kleene closure of M. M* forms a
regular expression, representing the
concatenation of zero or more M

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Regular Expressions, Abbreviations

Abbreviations
Symbol a
A language just containing a
Alternation abbreviation
Alternation M|N [abc] is the same as a|b|c
M | N is a new regular expression,
[b-h] is the same as [bcdefgh]
where M and N are regular expressions
[b-d01A-C] is the same as [bcd01ABC]
Concatenation M⋅N
M⋅N is a new regular expression, forming the
concatenation of M and N, where M and N are Optional and repetition
regular expressions M? with the meaning (M| ε)
Epsilon ε M+ with the meaning (M⋅M*)
Regular expression ε is the language of an
empty string Any characters represented by a
period .
Repetition M*
* is called the Kleene closure of M. M* forms a
regular expression, representing the
concatenation of zero or more M

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Regular Expressions, Abbreviations

Symbol a Alternation abbreviation


Alternation M|N Optional and repetition
[abc] is the same as a|b|c
Concatenation M⋅N M? with the meaning (M| ε)
[b-h] is the same as [bcdefgh]
Epsilon ε M+ with the meaning (M⋅M*)
[b-d01A-C] is the same as [bcd01ABC]
Repetition M*

Unsigned integer [0-9]+


Keyword while “while”
Unsigned integer,
avoiding octal number 0|[1-9][0-9]* Keyword if “if”
Hexadecimal number,
0[xX][0-9a-fA-F]+ What about tokenizing the string “ifwhile”?
“0x01a” “0XfF31”
Typically, match identifier,
Identifiers, e.g. “foo”, check if keyword
“Bar”, “_foo12”. It is not [_a-zA-Z][_a-zA-Z0-9]*
allowed to start with a What about matching operators + and ++
digit. Two rules to disambiguate: longest
match and rule priority
Part I Part II Part III
Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Part III
Deterministic Finite Automata

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Deterministic Finite Automata (DFA)

1 Finite Automata
A DFA is a machine M, represented as is a 5-tuple:
The language recognized by M is the
set of strings that M accepts. It is
ata M = (S, ⌃, , s0 , F ) written as L(M).

1 Finite Automata
Set of final states
What is the Set of states
L(M )
M = (S, ⌃, , s0 , F )
formalization of the :Q⇥⌃!Q Start state
DFA? Transition
Finite set of L(M )
function Note: deterministic
input symbols because only one
S = {1, 2, 3, 4} (the alphabet) : S ⇥ ⌃ ! S(1) possible output
⌃ = {r, f, o} (2)S= {1,o2, 3, 4} r
f
= {((1, f), 2), ((2, o), 3), ((3, r), 4)} (3)
⌃= {r, f, o}
S = {1, 2, 3, 4}
s0 = 1 (4)
1 2 3 4
F = {4} (5) = {((1,
⌃ = {r, f),f,2),
o}((2, o), 3), (

Part I Part II
s0 = 1 Part=III {((1, f), 2), ((2, o), 3),
Lexical Analysis Regular Expressions sDeterministic
0 = 1 Finite Automata
David Broman

⇥⌃!Q
Deterministic Finite Automata (DFA)
S = {1, 2, 3, 4} (1)
⌃ = {r, f, o} (2)
= {((1, f),
Exercise: Write ((2,ao),
2),down 3), ((3,
regular r), 4)}for
expression (3)
s0a =
lower
1 case identifier (can include an (4) a-z
underscore as first character)
F = {4} (5)
[_a-z][a-z0-9]* _

1 2
S = {1, 2} (6)
⌃ = { , a, b, . . . , z, 0, 1, . . . , 9} (7)
a-z
= {((1, a), 2), . . . , (2, 0), 2), . . . , } (8)
s0 = 1 (9) 0-9
F = {2} (10)
Abbreviation. Represents one
transition line for each character

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Deterministic Finite Automata (DFA)


Task 1. Suppose your alphabet contains the
characters
Task 2. Write out
N, a, v, /
the DFA
We assume that N represents a new line
a
Write down the regular expression that matches a
line comment, starting with “//“
//[av/]*N / /
N
Task 3. Given the DFA M to the
right, which of the the following 1 2 3 4
strings is a string in L(M)?

1 “aavaN” Is rejected by M
2. “//avvN” Is accepted by M v /

This is how you would write it in


"//" [^'\n']* '\n'
ocamlflex. ^'\n' means not a new line.

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Reading Guidelines - Module 1

• Compiler overview: C&T Chapters 1.1 - 1.4


• Lexing and scanning: C&T Chapters 2.1 - 2.5
• Parsing: C&T Chapters 3.1 - 3.5. Focus on 3.2 and 3.3.

C&T = the course book by Cooper and Torczon

Reading Guidelines
See the course webpage
for more information.

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Soon time for coffee…

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata
David Broman

Conclusions

Some key take away points:

• Lexical analysis (scanning) is the first phase in the


front end. The output is a stream of tokens

• Regular expressions can be used when specifying the


scanner.

• A deterministic finite automata (DFA) may be used


when implementing the lexical analysis.

Thanks for listening!

Part I Part II Part III


Lexical Analysis Regular Expressions Deterministic Finite Automata

You might also like