Ch2+3 Compiler
Ch2+3 Compiler
1
COP5621 Compiler Construction
Copyright Robert van Engelen, Florida State University, 2007-2017
Lexical Analysis
• One such task is stripping out comments and whitespace (blank, newline,
tab, and perhaps other characters that are used to separate tokens in the
input).
• Another task is correlating error messages generated by the compiler
with the source program. For instance, the lexical analyzer may keep
track of the number of newline characters seen, so it can associate a line
number with each error message.
• In some compilers, the lexical analyzer makes a copy of the source
program with the error messages inserted at the appropriate positions.
• If the source program uses a macro-preprocessor, the expansion of
macros may also be performed by the lexical analyzer.
2
The Reason Why Lexical Analysis is a
Separate Phase
• Simplifies the design of the compiler
• For example, a parser that had to deal with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and whitespace
have already been removed by the lexical analyzer.
• Improves portability
• Non-standard symbols and alternate character encodings
can be normalized (e.g. UTF8, trigraphs)
3
Interaction of the Lexical Analyzer with the
Parser
Token,
Source tokenval
Lexical
Program Analyzer Parser
Get next
token
error error
Symbol Table
4
Attributes of Tokens
<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>
token
(lookahead)
tokenval Parser
(token attribute)
5
Tokens, Patterns, and Lexemes
6
Conditions:
7
Lexical Errors
8
• "panic mode" recovery.
We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left. This recovery technique may
confuse the parser, but in an interactive computing
environment it may be quite adequate.
• Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
9
Specification of Patterns for Tokens:
Definitions
• An alphabet is a finite set of symbols Typical examples of symbols
are letters, digits, and punctuation. The set {0,1) is the binary alphabet.
ASCII is an important example of an alphabet; it is used in many
software systems.
• A string s is a finite sequence of symbols from (sentence, words).
• s denotes the length of string s
• denotes the empty string, thus = 0
• A language is a specific set of strings over some fixed alphabet
10
Specification of Patterns for Tokens: String
Operations
• The concatenation of two strings x and y is denoted by xy
• The exponentation of a string s is defined by
s0 =
si = si-1s for i > 0
note that s = s = s
11
Specification of Patterns for Tokens:
Language Operations
• Union
L M = {s s L or s M}
• Concatenation
LM = {xy x L and y M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
12
Specification of Patterns for Tokens: Regular
Expressions
• Basis symbols:
• is a regular expression denoting language {}
• a is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
• rs is a regular expression denoting L(r) M(s)
• rs is a regular expression denoting L(r)M(s)
• r* is a regular expression denoting L(r)*
• (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called
a regular set
13
• As defined, regular expressions often contain unnecessary pairs of
parentheses. We may drop certain pairs of parentheses if we adopt
the conventions that:
• a) The unary operator * has highest precedence and is left
associative.
• b) Concatenation has second highest precedence and is left
associative.
• c) | has lowest precedence and is left associative.
• Under these conventions, for example, we may replace
the regular expression (a) I ((b) * (c)) by a| b*c. Both
expressions denote the set of strings that are either a
single a or are zero or more b's followed by one c.
14
• Example 3.4: Let C = {a, b}.
1. The regular expression a1 b denotes the language {a, b}.
2. (a|b) (a|b) denotes {aa, ab, ba, bb), the language of all strings of
length two over the alphabet C. Another regular expression for the same
language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's,
that is, {E, a, aa, aaa, . . . }.
• 4. (a|b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {e, a, b, aa, ab, ba,
bb, aaa, . . .}.
Another regular expression for the same language is (a*b*)*.
• 5. ala*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the
string a and all strings consisting of zero or more a's and ending in b.
15
16
Specification of Patterns for Tokens: Regular
Definitions
• Regular definitions introduce a naming convention
with name-to-regular-expression bindings:
d1 r 1
d2 r 2
…
dn r n
where each ri is a regular expression over
{d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
17
Specification of Patterns for Tokens: Regular
Definitions
• Example:
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
18
Specification of Patterns for Tokens:
Notational Shorthand
• The following shorthands are often used:
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )?
19
• L={a,b} , M={1,2}
L∣M=
LM=
L∗=
L∗M={ε1,ε2,a1,a2,b1,b2,aa1,aa2,ab1,ab2,ba1,ba2
L+={a,b,aa,ab,ba,bb,aaa,…}
L**=?????
L|ε =???
L|L∗M={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…}
(L∣M)∗={ε,a,b,1,2,aa,ab,ba,bb,a1,a2,b1,b2,…}
M∗={ε,1,2,11,22,121,211,112,1221,…}
M0= ε
20
• L={a,b} , M={1,2}
L∣M={(a,1),(a,2),(b,1),(b,2)}
LM={a1,a2,b1,b2}
L∗={ε,a,b,aa,ab,ba,bb,aaa,…}
L∗M={ε1,ε2,a1,a2,b1,b2,aa1,aa2,ab1,ab2,ba1,ba2,bb1,bb2,…}
L+={a,b,aa,ab,ba,bb,aaa,…}
L**=?????
L|ε =???
L|L∗M={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…}
(L∣M)∗={ε,a,b,1,2,aa,ab,ba,bb,a1,a2,b1,b2,…}
M∗={ε,1,2,11,22,121,211,112,1221,…}
M0= ε
21