0% found this document useful (0 votes)

24 views21 pages

Ch2+3 Compiler

Uploaded by

shahad Channel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views21 pages

Ch2+3 Compiler

Uploaded by

shahad Channel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Lexical Analysis and

Lexical Analyzer Generators

Chapter 2&3

1
COP5621 Compiler Construction
Copyright Robert van Engelen, Florida State University, 2007-2017
Lexical Analysis

• One such task is stripping out comments and whitespace (blank, newline,
tab, and perhaps other characters that are used to separate tokens in the
input).
• Another task is correlating error messages generated by the compiler
with the source program. For instance, the lexical analyzer may keep
track of the number of newline characters seen, so it can associate a line
number with each error message.
• In some compilers, the lexical analyzer makes a copy of the source
program with the error messages inserted at the appropriate positions.
• If the source program uses a macro-preprocessor, the expansion of
macros may also be performed by the lexical analyzer.

2
The Reason Why Lexical Analysis is a
Separate Phase
• Simplifies the design of the compiler
• For example, a parser that had to deal with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and whitespace
have already been removed by the lexical analyzer.

• Provides efficient implementation

• A separate lexical analyzer allows us to apply specialized techniques that serve only the
lexical task, not the job of parsing. In addition, specialized buffering techniques for reading
input characters can speed up the compiler significantly.

• Improves portability
• Non-standard symbols and alternate character encodings
can be normalized (e.g. UTF8, trigraphs)

3
Interaction of the Lexical Analyzer with the
Parser

Token,
Source tokenval
Lexical
Program Analyzer Parser
Get next
token
error error

Symbol Table

4
Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>

token
(lookahead)
tokenval Parser
(token attribute)
5
Tokens, Patterns, and Lexemes

• A token is a classification of lexical units

• For example: id and num
• <token name, opt attribute value> (bold)
• Lexemes are the specific character strings that make
up a token
• For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
• For example: “letter followed by letters and digits” and
“non-empty sequence of digits”

6
Conditions:

• 1. One token for each keyword. The pattern for a

keyword is the same as the keyword itself.
• 2. Tokens for thd operators, either individually or in
classes such as the token comparison.
• 3. One token representing all identifiers.
• 4. One or more tokens representing constants, such
as numbers and literal strings.
• 5. Tokens for each punctuation symbol, such as left
and right parentheses, comma, and semicolon.

7
Lexical Errors

• a lexical analyzer cannot tell whether f i is a misspelling of

the keyword if or an undeclared function identifier. Since f i
is a valid lexeme for the token id, the lexical analyzer must
return the token id to the parser and let some other phase of
the compiler - probably the parser in this case - handle an
error due to transposition of the letters.

8
• "panic mode" recovery.
We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left. This recovery technique may
confuse the parser, but in an interactive computing
environment it may be quite adequate.
• Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

9
Specification of Patterns for Tokens:
Definitions
• An alphabet  is a finite set of symbols Typical examples of symbols
are letters, digits, and punctuation. The set {0,1) is the binary alphabet.
ASCII is an important example of an alphabet; it is used in many
software systems.
• A string s is a finite sequence of symbols from  (sentence, words).
• s denotes the length of string s
•  denotes the empty string, thus  = 0
• A language is a specific set of strings over some fixed alphabet 

10
Specification of Patterns for Tokens: String
Operations
• The concatenation of two strings x and y is denoted by xy
• The exponentation of a string s is defined by
s0 = 
si = si-1s for i > 0

note that s = s = s

11
Specification of Patterns for Tokens:
Language Operations
• Union
L  M = {s  s  L or s  M}
• Concatenation
LM = {xy  x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
12
Specification of Patterns for Tokens: Regular
Expressions
• Basis symbols:
•  is a regular expression denoting language {}
• a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
• rs is a regular expression denoting L(r)  M(s)
• rs is a regular expression denoting L(r)M(s)
• r* is a regular expression denoting L(r)*
• (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called
a regular set
13
• As defined, regular expressions often contain unnecessary pairs of
parentheses. We may drop certain pairs of parentheses if we adopt
the conventions that:
• a) The unary operator * has highest precedence and is left
associative.
• b) Concatenation has second highest precedence and is left
associative.
• c) | has lowest precedence and is left associative.
• Under these conventions, for example, we may replace
the regular expression (a) I ((b) * (c)) by a| b*c. Both
expressions denote the set of strings that are either a
single a or are zero or more b's followed by one c.

14
• Example 3.4: Let C = {a, b}.
1. The regular expression a1 b denotes the language {a, b}.
2. (a|b) (a|b) denotes {aa, ab, ba, bb), the language of all strings of
length two over the alphabet C. Another regular expression for the same
language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's,
that is, {E, a, aa, aaa, . . . }.
• 4. (a|b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {e, a, b, aa, ab, ba,
bb, aaa, . . .}.
Another regular expression for the same language is (a*b*)*.
• 5. ala*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the
string a and all strings consisting of zero or more a's and ending in b.

15
16
Specification of Patterns for Tokens: Regular
Definitions
• Regular definitions introduce a naming convention
with name-to-regular-expression bindings:
d1  r 1
d2  r 2
…
dn  r n
where each ri is a regular expression over
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain
an equivalent set of definitions
17
Specification of Patterns for Tokens: Regular
Definitions
• Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

• Regular definitions cannot be recursive:

digits  digit digitsdigit wrong!

18
Specification of Patterns for Tokens:
Notational Shorthand
• The following shorthands are often used:

r+ = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?

19
• L={a,b} , M={1,2}
L∣M=
LM=
L∗=
L∗M={ε1,ε2,a1,a2,b1,b2,aa1,aa2,ab1,ab2,ba1,ba2
L+={a,b,aa,ab,ba,bb,aaa,…}
L**=?????
L|ε =???
L|L∗M={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…}
(L∣M)∗={ε,a,b,1,2,aa,ab,ba,bb,a1,a2,b1,b2,…}
M∗={ε,1,2,11,22,121,211,112,1221,…}
M0= ε

20
• L={a,b} , M={1,2}
L∣M={(a,1),(a,2),(b,1),(b,2)}
LM={a1,a2,b1,b2}
L∗={ε,a,b,aa,ab,ba,bb,aaa,…}
L∗M={ε1,ε2,a1,a2,b1,b2,aa1,aa2,ab1,ab2,ba1,ba2,bb1,bb2,…}
L+={a,b,aa,ab,ba,bb,aaa,…}
L**=?????
L|ε =???
L|L∗M={a,b,aa,ab,ba,bb,aaa,aab,aba,abb,…}
(L∣M)∗={ε,a,b,1,2,aa,ab,ba,bb,a1,a2,b1,b2,…}
M∗={ε,1,2,11,22,121,211,112,1221,…}
M0= ε

Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
63 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
CD ch2
No ratings yet
CD ch2
104 pages
Ch3 1
No ratings yet
Ch3 1
52 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
17 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Lecture 02
No ratings yet
Lecture 02
150 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
Compiler Design
No ratings yet
Compiler Design
65 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Module 5 Lexical Analyser
No ratings yet
Module 5 Lexical Analyser
10 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Module 3
No ratings yet
Module 3
7 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
2024 CD-Ch02 Lexical Analysis
No ratings yet
2024 CD-Ch02 Lexical Analysis
25 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Compiler
No ratings yet
Compiler
60 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
4 LexicalAnalysis
No ratings yet
4 LexicalAnalysis
27 pages
CD 1
No ratings yet
CD 1
92 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
No ratings yet
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
40 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages

Ch2+3 Compiler

Uploaded by

Ch2+3 Compiler

Uploaded by

Lexical Analysis and

Lexical Analyzer Generators

• Provides efficient implementation

y := 31 + 28*x Lexical analyzer

• A token is a classification of lexical units

• 1. One token for each keyword. The pattern for a

• a lexical analyzer cannot tell whether f i is a misspelling of

• Regular definitions cannot be recursive:

digits  digit digitsdigit wrong!

You might also like