0% found this document useful (0 votes)
24 views21 pages

Ch2+3 Compiler

Uploaded by

shahad Channel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views21 pages

Ch2+3 Compiler

Uploaded by

shahad Channel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Lexical Analysis and

Lexical Analyzer Generators


Chapter 2&3

1
COP5621 Compiler Construction
Copyright Robert van Engelen, Florida State University, 2007-2017
Lexical Analysis

• One such task is stripping out comments and whitespace (blank, newline,
tab, and perhaps other characters that are used to separate tokens in the
input).
• Another task is correlating error messages generated by the compiler
with the source program. For instance, the lexical analyzer may keep
track of the number of newline characters seen, so it can associate a line
number with each error message.
• In some compilers, the lexical analyzer makes a copy of the source
program with the error messages inserted at the appropriate positions.
• If the source program uses a macro-preprocessor, the expansion of
macros may also be performed by the lexical analyzer.

2
The Reason Why Lexical Analysis is a
Separate Phase
• Simplifies the design of the compiler
• For example, a parser that had to deal with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and whitespace
have already been removed by the lexical analyzer.

• Provides efficient implementation


• A separate lexical analyzer allows us to apply specialized techniques that serve only the
lexical task, not the job of parsing. In addition, specialized buffering techniques for reading
input characters can speed up the compiler significantly.

• Improves portability
• Non-standard symbols and alternate character encodings
can be normalized (e.g. UTF8, trigraphs)

3
Interaction of the Lexical Analyzer with the
Parser

Token,
Source tokenval
Lexical
Program Analyzer Parser
Get next
token
error error

Symbol Table

4
Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>

token
(lookahead)
tokenval Parser
(token attribute)
5
Tokens, Patterns, and Lexemes

• A token is a classification of lexical units


• For example: id and num
• <token name, opt attribute value> (bold)
• Lexemes are the specific character strings that make
up a token
• For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
• For example: “letter followed by letters and digits” and
“non-empty sequence of digits”

6
Conditions:

• 1. One token for each keyword. The pattern for a


keyword is the same as the keyword itself.
• 2. Tokens for thd operators, either individually or in
classes such as the token comparison.
• 3. One token representing all identifiers.
• 4. One or more tokens representing constants, such
as numbers and literal strings.
• 5. Tokens for each punctuation symbol, such as left
and right parentheses, comma, and semicolon.

7
Lexical Errors

• a lexical analyzer cannot tell whether f i is a misspelling of


the keyword if or an undeclared function identifier. Since f i
is a valid lexeme for the token id, the lexical analyzer must
return the token id to the parser and let some other phase of
the compiler - probably the parser in this case - handle an
error due to transposition of the letters.

8
• "panic mode" recovery.
We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left. This recovery technique may
confuse the parser, but in an interactive computing
environment it may be quite adequate.
• Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

9
Specification of Patterns for Tokens:
Definitions
• An alphabet  is a finite set of symbols Typical examples of symbols
are letters, digits, and punctuation. The set {0,1) is the binary alphabet.
ASCII is an important example of an alphabet; it is used in many
software systems.
• A string s is a finite sequence of symbols from  (sentence, words).
• s denotes the length of string s
•  denotes the empty string, thus  = 0
• A language is a specific set of strings over some fixed alphabet 

10
Specification of Patterns for Tokens: String
Operations
• The concatenation of two strings x and y is denoted by xy
• The exponentation of a string s is defined by
s0 = 
si = si-1s for i > 0

note that s = s = s

11
Specification of Patterns for Tokens:
Language Operations
• Union
L  M = {s  s  L or s  M}
• Concatenation
LM = {xy  x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
12
Specification of Patterns for Tokens: Regular
Expressions
• Basis symbols:
•  is a regular expression denoting language {}
• a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
• rs is a regular expression denoting L(r)  M(s)
• rs is a regular expression denoting L(r)M(s)
• r* is a regular expression denoting L(r)*
• (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called
a regular set
13
• As defined, regular expressions often contain unnecessary pairs of
parentheses. We may drop certain pairs of parentheses if we adopt
the conventions that:
• a) The unary operator * has highest precedence and is left
associative.
• b) Concatenation has second highest precedence and is left
associative.
• c) | has lowest precedence and is left associative.
• Under these conventions, for example, we may replace
the regular expression (a) I ((b) * (c)) by a| b*c. Both
expressions denote the set of strings that are either a
single a or are zero or more b's followed by one c.

14
• Example 3.4: Let C = {a, b}.
1. The regular expression a1 b denotes the language {a, b}.
2. (a|b) (a|b) denotes {aa, ab, ba, bb), the language of all strings of
length two over the alphabet C. Another regular expression for the same
language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's,
that is, {E, a, aa, aaa, . . . }.
• 4. (a|b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {e, a, b, aa, ab, ba,
bb, aaa, . . .}.
Another regular expression for the same language is (a*b*)*.
• 5. ala*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the
string a and all strings consisting of zero or more a's and ending in b.

15
16
Specification of Patterns for Tokens: Regular
Definitions
• Regular definitions introduce a naming convention
with name-to-regular-expression bindings:
d1  r 1
d2  r 2

dn  r n
where each ri is a regular expression over
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
17
Specification of Patterns for Tokens: Regular
Definitions
• Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

• Regular definitions cannot be recursive:

digits  digit digitsdigit wrong!

18
Specification of Patterns for Tokens:
Notational Shorthand
• The following shorthands are often used:

r+ = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?

19
• L={a,b} , M={1,2}
L∣M=
LM=
L∗=
L∗M={ε1,ε2,a1,a2,b1,b2,aa1,aa2,ab1,ab2,ba1,ba2
L+={a,b,aa,ab,ba,bb,aaa,…}
L**=?????
L|ε =???
L|L∗M={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…}
(L∣M)∗={ε,a,b,1,2,aa,ab,ba,bb,a1,a2,b1,b2,…}
M∗={ε,1,2,11,22,121,211,112,1221,…}
M0= ε

20
• L={a,b} , M={1,2}
L∣M={(a,1),(a,2),(b,1),(b,2)}
LM={a1,a2,b1,b2}
L∗={ε,a,b,aa,ab,ba,bb,aaa,…}
L∗M={ε1,ε2,a1,a2,b1,b2,aa1,aa2,ab1,ab2,ba1,ba2,bb1,bb2,…}
L+={a,b,aa,ab,ba,bb,aaa,…}
L**=?????
L|ε =???
L|L∗M={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…}
(L∣M)∗={ε,a,b,1,2,aa,ab,ba,bb,a1,a2,b1,b2,…}
M∗={ε,1,2,11,22,121,211,112,1221,…}
M0= ε

21

You might also like