Specification of Tokens
Specification of Tokens
INTRODUCTION TO COMPILERS
Lexical Analysis
Specification of Tokens
Strings and Languages
• Regular expressions are an important notation
for specifying lexeme patterns
• Alphabet
– Any finite set of symbols
– Examples:
• The set {0, 1} is the binary alphabet
• ASCII
• Unicode
Strings and Languages
• String (sentence or word)
– A string over an alphabet is a finite sequence of symbols
drawn from that alphabet
– Length of a string s
• Written |s|
• The number of occurrences of symbols in s
• Example:
– banana is a string of length six
• The empty string, denoted ε, is the string of length zero
• Language
– Any countable set of strings over some fixed alphabet
• Φ, the empty set
• {ε} , the set containing only the empty string
Strings and Languages
• Parts of strings
– A prefix of string s is any string obtained by
removing zero or more symbols from the end of s
• Example: ban, banana, and ε are prefixes of banana
– A suffix of string s is any string obtained by
removing zero or more symbols from the
beginning of s
• Example: nana, banana, and ε are suffixes of banana
– A substring of s is obtained by deleting any prefix
and any suffix from s
• Example: banana, nan, and ε are substrings of banana
Strings and Languages
– The proper prefixes, suffixes, and substrings of a
string s are those, prefixes, suffixes, and
substrings, respectively, of s that are not ε or not
equal to s itself
– A subsequence of s is any string formed by
deleting zero or more not necessarily consecutive
positions of s
• Example: baan is a subsequence of banana
Strings and Languages
• If x and y are strings, then the concatenation of x
and y, denoted xy, is the string formed by
appending y to x
– Example: If x = dog and y = house, then xy = doghouse
• The empty string is the identity under
concatenation
– That is, for any string s, εs = sε = s
• Exponentiation of strings
– s0 to be ε
– For all i > 0, si is si-1s
• s1 = s0s = εs = s, s2 = ss, s3 = sss, and so on
Operations on Languages
• In lexical analysis, the most important
operations on languages are union,
concatenation, and closure
Operations on Languages
• Union of languages
– Strings are found either in the first language or in the second
language
• Concatenation of languages
– Strings formed by taking a string from the first language and a string
from the second language, and concatenating them
• Kleene closure of a language L
– Denoted by L*
– The set of strings get by concatenating L zero or more times
• L0 = {ε}
• Li = Li-1L
• Positive closure
– Denoted by L+
– Same as Kleene closure, but without the term L0
Operations on Languages
• L = {A, B , . . . , Z , a, b, . . . , z} and D = {0, 1 , . . . 9}
– L U D = Set of letters and digits. The language
with 62 strings of length one, each of which
strings is either one letter or one digit
– LD = Set of 520 strings of length two, each
consisting of one letter followed by one digit
– L4 = Set of all 4-letter strings
– L* = Set of ail strings of letters, including ε, the
empty string
– L(L U D)* = Set of all strings of letters and digits
beginning with a letter
– D+ = Set of all strings of one or more digits
Regular Expressions
• Regular expressions over some alphabet and the
languages that those expressions denote
– BASIS: There are two rules
• ε is a regular expression, and L(ε) = {ε} , the language whose sole
member is the empty string
• If ‘a’ is a symbol in , then a is a regular expression, and L(a) = {a},
the language with one string, of length one, with ‘a’
– INDUCTION: There are four parts. Suppose r and s are
regular expressions denoting languages L(r) and L(s),
respectively
• (r)|(s) is a regular expression denoting the language L(r) U L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r)) *
• (r) is a regular expression denoting L(r)
Regular Expressions
• The unary operator * has highest precedence
and is left associative
• Concatenation has second highest precedence
and is left associative
• |has lowest precedence and is left associative
• Example
– (a)|((b)*(c)) = a|b*c (b)* = b* = {ε, b, bb, bbb,…}
(bc)* != bc* (bc)* = {ε, bc, bcbc, …} bc* = {b, bc, bcc,
…}
• Set of strings that are either a single a or are zero or
Regular Expressions
• Example: Let = {a, b}
– a l b = {a, b}
– (a l b)(a l b) = {aa, ab, ba, bb}
• Another regular expression for the same language is aa | ab | ba| bb
– a* = Set of all strings of zero or more a‘s
= {ε, a, aa, aaa, ... }
– (a I b)* = Set of all strings of zero or more instances of a or
b
= {ε, a, b, aa, ab, ba, bb, aaa, aab, aba, abb... }
• Another regular expression for the same language is (a* b * )*
– a l a*b = {a, b, ab, aab, aaab, ... }
• Set consisting of a string ‘a’ and all strings of zero or more a's and
Regular Expressions
• A language that can be defined by a regular
expression is called a regular set.
• If two regular expressions r and s denote the
same regular set, they are said to be
equivalent (r = s)
Regular Expressions
• There are a number of algebraic laws for regular
expressions
• Consider arbitrary regular expressions r, s, and t
Regular Definitions
• If is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
d1 r1
d2 r2
...
dn rn
where:
1. Each di is a new symbol, not in and not the same as any
other of the d's, and
2. Each ri is a regular expression over the alphabet
U {d1 , d2 , . . . , di-1}
Regular Definitions
• Regular definition for the language of C
identifiers
letter_ A | B | · · · | Z | a | b | · · · | z | _
digit 0 | 1 | . . . | 9
id letter_ (letter_ | digit)*
Regular Definitions
• Regular definition for unsigned numbers
digit 0 | 1 | . . . | 9
digits digit digit* 100.05
optionalFraction .digits | ε
optionalExponent (E ( + | - | ε ) digits) | ε
number digits optionalFraction optionalExponent
Regular Expressions
• One or more instances
– Positive closure (+)
– If r is a regular expression, the (r)+ denotes the language (L(r))+
– r* = r+ | ε and r+ = rr* = r*r
• Zero or one instance
– Operator: ? a?.pdf
– r? = r| ε
– L(r?) = L(r) ꓴ {ε}
• Character classes
– A regular expression a1 | a2 | · · · | an , where the ai's are each
symbols of the alphabet, can be replaced by the shorthand [a 1a2 · ·
· an]
– When a1, a2 , . · · ,an form a logical sequence, it can replaced by a1-
a
Regular Definitions
• Regular definition for the language of C identifiers
letter_ A | B | · · · | Z | a | b | · · · | z | _
digit 0 | 1 | . . . | 9
id letter_ (letter_ | digit)*
letter_ [ A-Za-z_ ]
digit [0-9]
id letter_ (letter_ | digit)*
Regular Definitions
• Regular definition for unsigned numbers
digit 0 | 1 | . . . | 9
digits digit digit*
optionalFraction . digits | ε
optionalExponent (E ( + | - | ε ) digits) | ε
number digits optionalFraction
optionalExponent
digit [0-9]
digits digit+
number digits (. digits)? (E [+-]? digits)?