0% found this document useful (0 votes)
68 views21 pages

Specification of Tokens

The document provides an introduction to compilers, focusing on lexical analysis and the specification of tokens using regular expressions. It explains the concepts of strings, languages, and operations on languages, including union, concatenation, and closure. Additionally, it covers regular definitions and their applications in defining languages, such as C identifiers and unsigned numbers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views21 pages

Specification of Tokens

The document provides an introduction to compilers, focusing on lexical analysis and the specification of tokens using regular expressions. It explains the concepts of strings, languages, and operations on languages, including union, concatenation, and closure. Additionally, it covers regular definitions and their applications in defining languages, such as C identifiers and unsigned numbers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT I

INTRODUCTION TO COMPILERS

Lexical Analysis
Specification of Tokens
Strings and Languages
• Regular expressions are an important notation
for specifying lexeme patterns
• Alphabet
– Any finite set of symbols
– Examples:
• The set {0, 1} is the binary alphabet
• ASCII
• Unicode
Strings and Languages
• String (sentence or word)
– A string over an alphabet is a finite sequence of symbols
drawn from that alphabet
– Length of a string s
• Written |s|
• The number of occurrences of symbols in s
• Example:
– banana is a string of length six
• The empty string, denoted ε, is the string of length zero
• Language
– Any countable set of strings over some fixed alphabet
• Φ, the empty set
• {ε} , the set containing only the empty string
Strings and Languages
• Parts of strings
– A prefix of string s is any string obtained by
removing zero or more symbols from the end of s
• Example: ban, banana, and ε are prefixes of banana
– A suffix of string s is any string obtained by
removing zero or more symbols from the
beginning of s
• Example: nana, banana, and ε are suffixes of banana
– A substring of s is obtained by deleting any prefix
and any suffix from s
• Example: banana, nan, and ε are substrings of banana
Strings and Languages
– The proper prefixes, suffixes, and substrings of a
string s are those, prefixes, suffixes, and
substrings, respectively, of s that are not ε or not
equal to s itself
– A subsequence of s is any string formed by
deleting zero or more not necessarily consecutive
positions of s
• Example: baan is a subsequence of banana
Strings and Languages
• If x and y are strings, then the concatenation of x
and y, denoted xy, is the string formed by
appending y to x
– Example: If x = dog and y = house, then xy = doghouse
• The empty string is the identity under
concatenation
– That is, for any string s, εs = sε = s
• Exponentiation of strings
– s0 to be ε
– For all i > 0, si is si-1s
• s1 = s0s = εs = s, s2 = ss, s3 = sss, and so on
Operations on Languages
• In lexical analysis, the most important
operations on languages are union,
concatenation, and closure
Operations on Languages
• Union of languages
– Strings are found either in the first language or in the second
language
• Concatenation of languages
– Strings formed by taking a string from the first language and a string
from the second language, and concatenating them
• Kleene closure of a language L
– Denoted by L*
– The set of strings get by concatenating L zero or more times
• L0 = {ε}
• Li = Li-1L
• Positive closure
– Denoted by L+
– Same as Kleene closure, but without the term L0
Operations on Languages
• L = {A, B , . . . , Z , a, b, . . . , z} and D = {0, 1 , . . . 9}
– L U D = Set of letters and digits. The language
with 62 strings of length one, each of which
strings is either one letter or one digit
– LD = Set of 520 strings of length two, each
consisting of one letter followed by one digit
– L4 = Set of all 4-letter strings
– L* = Set of ail strings of letters, including ε, the
empty string
– L(L U D)* = Set of all strings of letters and digits
beginning with a letter
– D+ = Set of all strings of one or more digits
Regular Expressions
• Regular expressions over some alphabet  and the
languages that those expressions denote
– BASIS: There are two rules
• ε is a regular expression, and L(ε) = {ε} , the language whose sole
member is the empty string
• If ‘a’ is a symbol in , then a is a regular expression, and L(a) = {a},
the language with one string, of length one, with ‘a’
– INDUCTION: There are four parts. Suppose r and s are
regular expressions denoting languages L(r) and L(s),
respectively
• (r)|(s) is a regular expression denoting the language L(r) U L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r)) *
• (r) is a regular expression denoting L(r)
Regular Expressions
• The unary operator * has highest precedence
and is left associative
• Concatenation has second highest precedence
and is left associative
• |has lowest precedence and is left associative
• Example
– (a)|((b)*(c)) = a|b*c (b)* = b* = {ε, b, bb, bbb,…}
(bc)* != bc* (bc)* = {ε, bc, bcbc, …} bc* = {b, bc, bcc,
…}
• Set of strings that are either a single a or are zero or
Regular Expressions
• Example: Let  = {a, b}
– a l b = {a, b}
– (a l b)(a l b) = {aa, ab, ba, bb}
• Another regular expression for the same language is aa | ab | ba| bb
– a* = Set of all strings of zero or more a‘s
= {ε, a, aa, aaa, ... }
– (a I b)* = Set of all strings of zero or more instances of a or
b
= {ε, a, b, aa, ab, ba, bb, aaa, aab, aba, abb... }
• Another regular expression for the same language is (a* b * )*
– a l a*b = {a, b, ab, aab, aaab, ... }
• Set consisting of a string ‘a’ and all strings of zero or more a's and
Regular Expressions
• A language that can be defined by a regular
expression is called a regular set.
• If two regular expressions r and s denote the
same regular set, they are said to be
equivalent (r = s)
Regular Expressions
• There are a number of algebraic laws for regular
expressions
• Consider arbitrary regular expressions r, s, and t
Regular Definitions
• If  is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
d1  r1
d2  r2
...
dn  rn
where:
1. Each di is a new symbol, not in  and not the same as any
other of the d's, and
2. Each ri is a regular expression over the alphabet
 U {d1 , d2 , . . . , di-1}
Regular Definitions
• Regular definition for the language of C
identifiers
letter_  A | B | · · · | Z | a | b | · · · | z | _
digit  0 | 1 | . . . | 9
id  letter_ (letter_ | digit)*
Regular Definitions
• Regular definition for unsigned numbers

digit  0 | 1 | . . . | 9
digits  digit digit* 100.05
optionalFraction  .digits | ε
optionalExponent  (E ( + | - | ε ) digits) | ε
number  digits optionalFraction optionalExponent
Regular Expressions
• One or more instances
– Positive closure (+)
– If r is a regular expression, the (r)+ denotes the language (L(r))+
– r* = r+ | ε and r+ = rr* = r*r
• Zero or one instance
– Operator: ? a?.pdf
– r? = r| ε
– L(r?) = L(r) ꓴ {ε}
• Character classes
– A regular expression a1 | a2 | · · · | an , where the ai's are each
symbols of the alphabet, can be replaced by the shorthand [a 1a2 · ·
· an]
– When a1, a2 , . · · ,an form a logical sequence, it can replaced by a1-
a
Regular Definitions
• Regular definition for the language of C identifiers
letter_  A | B | · · · | Z | a | b | · · · | z | _
digit  0 | 1 | . . . | 9
id  letter_ (letter_ | digit)*

letter_  [ A-Za-z_ ]
digit  [0-9]
id  letter_ (letter_ | digit)*
Regular Definitions
• Regular definition for unsigned numbers
digit  0 | 1 | . . . | 9
digits  digit digit*
optionalFraction  . digits | ε
optionalExponent  (E ( + | - | ε ) digits) | ε
number  digits optionalFraction
optionalExponent

digit  [0-9]
digits  digit+
number  digits (. digits)? (E [+-]? digits)?

You might also like