0% found this document useful (0 votes)
63 views20 pages

CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions

The document provides information about a CPSC 388 compiler design course. It includes announcements about due dates for homework, programming assignments, and reading. It also discusses topics that will be covered, including scanners, regular expressions, finite state automata, and creating a scanner from regular expressions.

Uploaded by

Kashif Raffat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views20 pages

CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions

The document provides information about a CPSC 388 compiler design course. It includes announcements about due dates for homework, programming assignments, and reading. It also discusses topics that will be covered, including scanners, regular expressions, finite state automata, and creating a scanner from regular expressions.

Uploaded by

Kashif Raffat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

CPSC 388 – Compiler Design

and Construction

Scanners – Regular Expressions


Announcements
 Last day to Add/Drop Sept 11
 Wipe Down Computer Keyboards and Mice
 ACM programming contest
 Read Chapter 3
 Homework 2 due this Friday
 PROG1 due this Friday
 FSA for Java string constants (Anybody
figure it out?)
Homework 1 returned
 Summary – In addition to your summary please
answer the following questions in your report:
 Context – What is the context of the research
presented in the paper? What new ideas or concepts
do you think were presented in the paper?
 Evaluation – How was the research evaluated? What
evaluation techniques would you like to see to
compare the research to the state of the art before
the paper?
 Significance – What is the significance of the
research? Do you feel the work is a minor
improvement or a major step in the field?
 Grammar / spelling / clarity / support for statements
made
Scanner Generator

.Jlex file .java file


Containing Scanner Generator Containing
Regular Expressions Scanner code

To understand Regular Expressions


you need to understand Finite-State Automata
FSA Formal Definition (5-tuple)
Q – a finite set of states
Σ – The alphabet of the automata
(finite set of characters to label edges)
δ – state transition function
δ(statei,character)  statej
q – The start state
F – The set of final states
Types of FSA
 Deterministic (DFA)
 No State has more than one outgoing
edge with the same label
 Non-Deterministic (NFA)
 States may have more than one
outgoing edge with same label.
 Edges may be labeled with ε, the empty
string. The FSA can take an epsilon
transition without looking at the current
input character.
Terms to Know
 Alphabet (Σ) – any finite set of
symbols e.g. binary, ASCII, Unicode
 String – finite sequence of symbols
e.g. 010001, banana, bãër
 Language – any countable set of
strings e.g.
Empty set
Well-formed C programs
English words
Regular Expressions
 Easy way to express a language that is
accepted by FSA
 Rules:
 ε is a regular expression
 Any symbol in Σ is a regular expression
If r and s are any regular expressions then so is:
 r|s denotes union e.g. “r or s”
 rs denotes r followed by s (concatination)
 (r)* denotes concatination of r with itself zero or
more times (Kleene closer)
 () used for controlling order of operations
Example Regular Expressions
Regular Expression Corresponding Language
ε {“”}
a {“a”}
abc {“abc”}
a|b|c {“a”,”b”,”c”}
(a|b|c)* {“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…}
a|b|c|…|z|A|B|…|Z Any letter
0|1|2|…|9 Any digit
Precedence in Regular Expressions
 * has highest precedence, left associative

 Concatenation has second highest


precedence, left associative

 | has lowest associative, left associative


More Regular Expression Examples
Regular Expression Corresponding Language
ε|a|b|ab* {“”, “a”, “b”, “ab”, “abb”, “abbb”,…}

ab*c {“ac”, “abc”, “abbc”,…}

ab*|a* {“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}

a(b*|a*) {“a”, “ab”, “aa”, “abb”, “aaa”, …}

a(b|a)* {“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}


You Try
 What is the language described by
each Regular Expression?
a*
(a|b)*
a|a*b
(a|b)(a|b)
aa|ab|ba|bb
(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
Regular Definitions
If Σ is an alphabet of basic symbols,
then a regular definition is a
sequence of definitions of the form:
D1 → R 1
1. Each di is a new symbol not in Σ and
D2 → R2 not the same as any other of the d’s.

2. Each ri is a regular expression over
Dn → R n
Σ U (d1,d2,…,di-1)
Regular Definitions Example
Example C identifiers:
Σ = ASCII

letter_ → a|b|c|…|z|A|B|C|…|Z|_
digit → 0|1|2|…|9
id → letter_(letter_|digit)*
Regular Definitions Example
Example Unsigned Numbers (integer or float):
Σ = ASCII

digit → 0|1|2|…|9
digits → digit digit*
optionalFraction → . digits | ε
optionalExponent → (E(+|-| ε)digits)| ε
number → digits optionalFraction optionalExponent
Special Characters in Reg. Exp.
What does each of the following mean?
* – Kleene Closure
| – or
() – grouping
[] – creates a character class
+ – Positive Closure
? – zero or one instance
“” – anything in quotes means itself, e.g. “*”
. – matches any single character (except newline)
\ – used for escape characters (newline, tab, etc.)
^ – matches beginning of a line
$ – matches the end of a line
Extensions to Regular Expressions
 + means one or more occurrence
(positive closure)
 ? means zero or one occurrence
 Character classes
 a|r|t can be written [art]
 a|b|…|z can be written [a-z]
As long as there is a clear ordering to
characters
 [^a-z] matches any character except a-z
Example Using Character Classes
^[^aeiou]*$
Matches any complete line that does not
contain a lowercase vowel
How do you tell which meaning of ^ is
intended?
Try It
 Create Character Classes for:
 First ten letters (up to “j”)
 Lowercase consonants
 Digits in hexadecimal
 Create Regular Expressions for:
 Case Insensitive keyword such as
SELECT (or Select or SeLeCt) in SQL
 Java string constants
 Any string of whitespace characters
Creating a Scanner
 Create a set of regular expressions, one for
each token to be recognized
 Convert regular expressions into one
combined DFA
 Run DFA over input character stream
 Longest matching regular expression is selected
 If a tie then use first matching regular
expression
 Attach code to run when a regular
expression matches

You might also like