0% found this document useful (0 votes)
10 views15 pages

Chap 04

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Chap 04

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Programming Languages Lexical and Syntax Analysis

CMSC 4023 Chapter 4

4. Lexical and Syntax Analysis


Why should we discuss the implementation of parts of a compiler?
• Syntax analyzers are based directly on the grammars discussed in Chapter 3.
• Lexical and syntax analyzers are needed in numerous situations outside compiler design
including
o program listing formatters
o programs that compute the complexity of programs
o programs that must analyze and react to the contents of a configuration file
4.1. Introduction
Lexical and Syntax Analysis are the first two phases of compilation as shown below.

characters Lexical Analysis tokens Syntax Analysis abstract


(Scanner) (Parser) syntax
tree

Figure 4.1 Lexical and Syntax Analysis


Languages are designed for both phases
• For characters, we have the language of regular expressions to recognize tokens.
• For tokens, we have context free grammars to recognize syntactically correct programs.

Reasons for separating lexical analysis from syntax analysis are:


1. Simplicity – Techniques for lexical analysis are less complex that those required for
syntax analysis, so the lexical-analysis process can be simpler if it separate. Also,
removing the low-level details of lexical analysis from the syntax analyze makes the
syntax analyzer both smaller and cleaner.
2. Efficiency – Although it pays to optimize the lexical analyzer, because lexical analysis
requires a significant portion of total compilation time, it is not fruitful to optimize the
syntax analyzer. Separation facilitates this selective optimization.
3. Portability – Because the lexical analyzer reads input program files and often includes
buffering of that input, it is somewhat platform dependent. However, the syntax
analyzer can be platform independent. It is always a good practice to isolate machine-
dependent parts of any software system.
4.2. Lexical Analysis
• A lexical analyzer is a patter matcher.
• A lexical analyzer recognizes strings of characters as tokens.
• A token is a tuple (code,spelling)
o code – an integer code is given to every unique pattern. Separate codes are
assigned to all punctuation, every reserve word, all types of constants, and to
identifiers.
o spelling – the spelling is the actual string that was recognized. For example, the
identifier, “result”. The string “result” is the spelling of the token.

1
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

Consider the following example of an assignment statement together with C++ declarations

result = oldsum – value / 100;

Spelling Symbol Code


result ID 1
= ASSIGN 3
oldsum ID 1
- MINUS 4
value ID 1
/ SLASH 5
100 INTLIT 2
; SEMICOLON 6
Figure 4.2 Tokens of the statement result = oldsum – value / 100;

#define ID 1
#define INTLIT 2
#define ASSIGN 3
#define MINUS 4
#define SLASH 5
#define SEMICOLON 6
Figure 4.3 C++ constant definitions that support tokens
• Lexical analyzers (scanners) extract lexemes (tokens) from a given input string.
• Lexical analyzers skip comments and blanks.
• There are three approaches to building a lexical analyzer:
1. Write a formal description of the token patterns of the language using a
descriptive language related to regular expressions. These descriptions are
used as input to a software tool that automatically generates a lexical analyzer.
The oldest and most accessible of these, name lex, is commonly included as
part of UNIX systems.
2. Design a state transition diagram that describes the token patterns of the
language and write a program that implements the diagram.
3. Design a state transition diagram that describes the token patterns of the
language and hand-construct a table-driven implementation of the state
diagram.
• A state diagrams is a directed graph. The nodes of a state diagram are labeled with state
names. The edges are labeled with the input characters that cause the transitions among
the states.
• Finite state machines are collections of related state diagrams called finite automata.
• A class of languages called regular languages or regular expression can be translated to
finite automata.

2
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

b
b
start a b
0 1 2 3
a
a

a
Figure 1. DFA accepting (a|b)*abb

State Input Symbol

a b

0 1 0

1 1 2

2 1 3

3 1 0

Table 1. Transition function for the DFA of Figure 1.

1. S = {0,1,2,3}
2. Table 1 shows the transition function move for the DFA of Figure 1.
3. Σ = {a, b}
4. s0 = 0
5. F = {3}

3
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

4.3. The Parsing Problem


• Analyzing a sequence of tokens to determine if they form a sentence in the grammar of the
programming language is called syntax analysis.
• Syntax analysis is often called parsing.
4.3.1. Introduction to Parsing
• Parsers for programming languages construct parse trees for given programs.
• There are two distinct goals of syntax analysis:
1. The parser determines if the input program is syntactically correct.
1.1. If an error is found, the parser generates a diagnostic message indicating the
location of the error and a message that indicates why the program is not
correct.
2. The parser produces a parse tree of a syntactically correct program.
• There are two broad classes of parsers.
1. top-down: A top-down parser attempts to construct the parse tree from the root
down to its leaves.
2. bottom-up: A bottom-up parser attempts to construct the parse tree from its
leaves upward to the root.
• Parsing terminology.
1. Terminal symbols – lowercase letters at the beginning of the alphabet. (𝑎, 𝑏, ⋯ ).
2. Nonterminal symbols – uppercase letters at the beginning of the alphabet.
(𝐴, 𝐵, ⋯ ).
3. Terminals or nonterminals – uppercase letters at the end of the alphabet.
(𝑊, 𝑋, 𝑌, 𝑍).
4. Strings of terminals – lowercase letters at the end of the alphabet. (𝑤, 𝑥, 𝑦, 𝑧).
5. Mixed strings (terminals or nonterminals) – lowercase Greek letters. (𝛼, 𝛽, 𝛾, 𝛿)
• Programming language terminology.
1. Terminal symbols – terminal symbols are printed in bold. For example, for, while,
+, -
2. Nonterminal symbols – Nonterminal symbols are printed in italics. For example,
expression, term, factor.
4.3.2. Top-Down Parsers
• A top-down parser traces or builds a parse tree in preorder. This corresponds to a
leftmost derivation.
• The general form of a left sentential form that is 𝑥𝐴𝛼, recalling that 𝑥 is a string of
terminal symbols, 𝐴 is a nonterminal, and 𝛼 is a mixed string,
• Because 𝑥 contains only terminal symbols, 𝐴 is the leftmost nonterminal in the
sentential form, so it is the one that must be expanded to get the next sentential form
in a leftmost derivation.

4
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

• In the sentential form 𝑥𝐴𝛼 a top-down parser must select one of the rules having 𝐴 on
the left hand side. Given the following 𝐴-rules,
𝐴 → 𝑏𝐵
𝐴 → 𝑐𝐵𝑏
𝐴→𝑎
A top-down parser must use one of the foregoing rules to transform the left sentential
form 𝑥𝐴𝛼 to
𝑥𝑏𝐵𝛼
𝑥𝑐𝐵𝑏𝛼
𝑥𝑎𝛼
• Recursive-descent parsers are the most common method of implementing a top-
down parser.
• A recursive-descent parser is coded directly from the BNF grammar and has one
function or procedure for each nonterminal symbol.
• Recursive-descent parsers employ LL algorithms. The first L is for a Left to right scan
of the input. The second L is for a Leftmost derivation.
4.3.3. Bottom-Up Parsers
• A bottom-up parser constructs a parse tree by beginning at the leaves and progressing
toward the root.
• Give a right sentential form 𝛼, the parser must determine what substring of 𝛼 is the
RHS (right-hand side) of the rule in the grammar that must be reduced to its LHS (left-
hand side) to produce the previous sentential form in the rightmost derivation.
• Consider the following grammar and derivation
LHS RHS
S → aAc
A → aA
A → b

S => aAc => aaAc => aabc

1. Start with aabc, a string of terminal symbols


2. Replace terminal symbol b with its LHS, A, yielding aaAc.
3. The LR parsing algorithm correctly selects the handle aA. The handle aA is
replaced by A yielding aAc.
4. Finally aAc is replaced by the goal symbol S and parsing terminates.
4.3.4. The Complexity of Parsing
• Parsing algorithms that work for any unambiguous grammar require 𝑂(𝑛3 ) time.
• By using a subset of context free grammars, the time complexity of parsing can be
reduced to 𝑂(𝑛) time.

5
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

4.4. Recursive-Descent Parsing


4.4.1. The Recursive-Descent Parsing Process
• A recursive-descent parser is so named because it consist of a collection of
subprograms, many of which are recursive, and it produces a paqrse tree in top-down
order.
• EBNF is ideally suited for recursive-descent parsers.
• A recursive-descent parser has a subprogram for each nonterminal in the grammar.
• Consider the following EBNF for arithmetic expressions.
LHS RHS
expression → term { (+|-) term }
term → factor { (*|/) factor }
factor → id | intlit | ( expression )

//-------------------------------------------------------------
//factor -> ID | INTLIT | ( expr )
//-------------------------------------------------------------
void factor(bool get)
{ ParsePrint("Enter factor");
switch (Token()) {
case ID:
case INTLIT:
Lex(); LexPrint(*o);
break;
case LPAREN:
Lex(); LexPrint(*o);
expr(false);
Expected[0]=RPAREN;
if (Token()!=RPAREN) throw ParseException(Expected,1,Token());
Lex(); LexPrint(*o);
break;
default:
Expected[0]=ID;Expected[1]=INTLIT;Expected[2]=LPAREN;
ParsePrint("Exit factor");
throw ParseException(Expected,3,Token());
break;
}
ParsePrint("Exit factor");
}
Function factor

6
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

//-------------------------------------------------------------
// term -> factor { (*|/) factor }
//-------------------------------------------------------------
void term(bool get)
{ ParsePrint(“Enter term”);
factor(get);
while (Token()==MUL_OP||Token()==DIV_OP) {
Lex(); LexPrint(*o);
factor(false);
}
ParsePrint(“Exit term”);
}
Function term

//-------------------------------------------------------------
// expr -> term { (+|-) term }
//-------------------------------------------------------------
void expr(bool get)
{ ParsePrint("Enter expr");
term(get);
while (Token()==ADD_OP||Token()==DIF_OP) {
Lex(); LexPrint(*o);
term(false);
}
ParsePrint("Exit expr");
}

7
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

• Example expression (sum+47)/total.


• Recursive-descent parse trace for the example expression

Token Code(8) Token Name(LPAREN) Token String="("


Enter expr
Enter term
Enter factor
Token Code(1) Token Name( ID) Token String="sum"
Enter expr
Enter term
Enter factor
Token Code(4) Token Name(ADD_OP) Token String="+"
Exit factor
Exit term
Token Code(2) Token Name(INTLIT) Token String="47"
Enter term
Enter factor
Token Code(9) Token Name(RPAREN) Token String=")"
Exit factor
Exit term
Exit expr
Token Code(7) Token Name(DIV_OP) Token String="/"
Exit factor
Token Code(1) Token Name( ID) Token String="total"
Enter factor
Token Code(0) Token Name( END) Token String=""
Exit factor
Exit term
Exit expr
Parse trace of the expression (sum+47)/total

8
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

Id LHS RHS
1 expression → term
2 expression → expression + term
3 expression → expression - term
4 term → factor
5 term → term * factor
6 term → term / factor
7 factor → ( expression )
8 factor → id
9 factor → intlit

expression

term

term / factor

factor id(total)

( expression )

expression + term

term factor

factor intlit(47)

id(sum)
Parse tree of (sum+47)/total

9
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

Id LHS RHS
1 E → TE’
2 E’ → +TE’
3 E’ → -TE’
4 E’ → 𝝐
5 T → FT’
6 T’ → *FT’
7 T’ → /FT’
8 T’ → 𝝐
9 F → (E)
10 F → id
11 F → intlit

T E’

F T’

( E ) / F T’

T E’ id ∈
(total)

F T’ + T E’

id ∈ F T’ ∈
( sum)

intlit ∈
(47)

Parse tree of (sum+47)/total

4.4.2. The LL Grammar Class

10
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

• Eliminating direct left recursion.


For each nonterminal, 𝐴,
1. Group the 𝐴-rules as 𝐴 → 𝐴𝛼1 | ⋯ | 𝐴𝛼𝑚 | 𝛽1 | 𝛽1 | ⋯ |𝛽𝑛 where none of the 𝛽’s
begin with 𝐴.
2. Replace the original 𝐴-rules with
𝐴 → 𝛽1 𝐴′ | 𝛽2 𝐴′ | ⋯ | 𝛽𝑛 𝐴′
𝐴′ → 𝛼1 𝐴′ | 𝛼2 𝐴′ | ⋯ | 𝛼𝑛 𝐴′ | 𝜀 where 𝜀 is the empty string.
• Apply the foregoing transformation rules to the canonical expression grammar given
below.
E → T
E → E+T
T → F
T → T*F
F → id
F → (E)

Applying the transformation rules:


E → TE’
E’ → +TE’
E’ → 𝜀
T → FT’
T’ → *FT’
T’ → 𝜀
E → E+T
F → (E)
• The parser must always be able to select the correct RHS based on the next token of
input, using only the first token generated by the leftmost nonterminal in the current
sentential form. This is called the pairwise disjointness test.
• The pairwise disjointness test is:
For each nonterminal, 𝐴, in the grammar that has more than one RHS, for each pair of
rules, 𝐴 → 𝛼𝑖 and 𝐴 → 𝛼𝑗 , it must be true that
𝐹𝐼𝑅𝑆𝑇(𝛼𝑖 ) ∩ 𝐹𝐼𝑅𝑆𝑇�𝛼𝑗 � = ∅
In other words, if a nonterminal 𝐴 has more than one RHS, the first terminal symbol
that can be generated in a derivation for each of them must be unique to that RHS.

11
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

• Define FIRST(α), where α is any string of grammar symbols, to be the set of terminals

that begin strings derived from α. If α ⇒ ∈, then ∈ is also in FIRST(α).

To compute 𝐹𝐼𝑅𝑆𝑇(𝑋)

1. If 𝑋 is a terminal, then 𝐹𝐼𝑅𝑆𝑇(𝑋) = {𝑋}


2. If 𝑋 is a nonterminal and 𝑋 → 𝑌1 𝑌2 ⋯ 𝑌𝑘 is a production, then place 𝑎 in
𝐹𝐼𝑅𝑆𝑇(𝑋) if for some 𝑖, 𝑎 is in 𝐹𝐼𝑅𝑆𝑇(𝑌𝑖 ), and ∈ is all

of 𝐹𝐼𝑅𝑆𝑇(𝑌1 ), ⋯ , 𝐹𝐼𝑅𝑆𝑇(𝑌𝑖−1 ); that is: 𝑌1 ⋯ 𝑌𝑖−1 ⇒∈. If ∈ is in 𝐹𝐼𝑅𝑆𝑇�𝑌𝑗 � for
all 𝑗 = 1,2, ⋯ , 𝑘, then add ∈ to 𝐹𝐼𝑅𝑆𝑇(𝑋). For example, everything in
𝐹𝐼𝑅𝑆𝑇(𝑌1 ) is surely in 𝐹𝐼𝑅𝑆𝑇(𝑋). If 𝑌1 does not derive ∈, then we add nothing
more to 𝐹𝐼𝑅𝑆𝑇(𝑋)but if 𝑌1 ⇒∈, then we add 𝐹𝐼𝑅𝑆𝑇(𝑌2 ), and so on.
3. If 𝑋 →∈ is a production, then add ∈ to 𝐹𝐼𝑅𝑆𝑇(𝑋).

4.5. Bottom-Up Parsing


4.5.1. The Parsing Problem for Bottom-Up Parsers

Consider an abbreviated grammar for expressions and the sentence


id+id*id
Id LHS RHS
1 E → T
2 E → E+T
3 T → F
4 T → T*F
5 F → id
6 F → (E)
A bottom-up parse, a LR parse is given below. A LR parse is a Left-to-right scan of the
input and a Rightmost derivation.

Sentential form Id LH RHS


(rightmost derivation) S
E Start with the start symbol
E+T 2 E → E+T
E+T*F 4 T → T*F
E + T * id 5 F → id
E + F * id 3 T → F
E + id * id 5 F → id
T + id * id 1 E → T
F + id * id 3 T → F
id + id * id 5 F → id

12
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

4.5.2. Shift-Reduce Algorithms


4.5.3. LR Parsers

1. The ACTION function takes as arguments a state i and a terminal a ($, the input
endmarker). The value of ACTION[i,a] can have one of four forms:
1.1. Shift j where j is a state. The action taken by the parser effectively shifts input a
to the stack, but uses state j to represent a.
1.2. Reduce 𝐴 → 𝛽. The action of the parser effectively reduces 𝛽 on the top of the
stack to head 𝐴.
1.3. Accept. The parser accepts the input and finishes parsing.
1.4. Error. The parser discovers an error in its input and takes some corrective
action.
2. We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii,A]=Ij,
then GOTO also maps a state i and a nonterminal A to state j.

Input a1 ai an $

Stack
sm LR Parsing Program Output
s m-1

$ ACTION GOTO
Figure 1. LR Parser Model
left side right side
1 E → E+T
2 E → T
3 T → T*F
4 T → F
5 F → (E)
6 F → id
Table 1. Set of productions expressions

let a be the first symbol of w$


while (1) {
let s be the state on top of the stack;
if (ACTION[s,a]==shift t) {
push t onto the stack
let a be the next input symbol;
} else if (ACTION[s,a]==reduce 𝐴 → 𝛽 ){
pop |𝛽|symbols off the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack
output the production 𝐴 → 𝛽 ;
} else if (ACTION[s,a]==accept) break; //parsing is done
else error();
}

13
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

STACK SYMBOLS INPUT ALGORITHM ACTION


(1) 0 id * id + id $ ACTION[0,id]=s5 shift
(2) 05 id * id + id $ ACTION[5,*]=r6 reduce by F → id
GOTO[0,F]=3
(3) 03 F * id + id $ ACTION[3,*]=r4 reduce by T → F
GOTO[0,T]=2
(4) 02 T * id + id $ ACTION[2,*]=s7 shift
(5) 027 T* id + id $ ACTION[7,id]=s5 shift
(6) 0275 T * id + id $ ACTION[5,+]=r6 reduce by F → id
GOTO[7,F]=10
(7) 0 2 7 10 T*F + id $ ACTION[10,+]=r3 reduce by T → T * F
GOTO[0,T]=2
(8) 02 T + id $ ACTION[2,+]=r2 reduce by E → T
GOTO[0,E]=1
(9) 01 E + id $ ACTION[1,+]=s6 shift
(10) 016 E+ id $ ACTION[6,id]=s5 shift
(11) 0165 E + id $ ACT ION[5,$]=r6 reduce by F → id
GOTO[6,F]=3
(12) 0163 E+F $ ACTION[3,$]=r4 reduce by T → F
GOTO[6,T]=9
(13) 0169 E+T $ ACTION[9,$]=r1 reduce by E → E + T
GOTO[0,E]=1
(14) 01 E $ ACTION[1,$]=acc accept

STATE ACTION GOTO


id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5

14
Programming Languages Lexical and Syntax Analysis
CMSC 4023 Chapter 4

STATE ACTION GOTO


id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5

STACK SYMBOLS INPUT ALGORITHM ACTION


(1) 0 id + id * id $ ACTION[0,id]=s5 shift
(2) 05 id + id * id $ ACTION[5,+]=r6 reduce by F → id
GOTO[0,F]=3
(3) 03 F + id * id $ ACTION[3,+]=r4 reduce by T → F
GOTO[0,T]=2
(4) 02 T + id * id $ ACTION[2,+]=r2 reduce by E → T
GOTO[0,E]=1
(5) 01 E + id * id $ ACTION[1,+]=s6 shift
(6) 016 E+ id * id $ ACTION[6,id]=s5 shift
(7) 0165 E + id * id $ ACTION[5,*]=r6 reduce by F → id
GOTO[6,F]=3
(8) 0163 E+F * id $ ACTION[3,*]=r4 reduce by T → F
GOTO[6,T]=9
(9) 0169 E+T * id $ ACTION[9,*]=s7 shift
(10) 01697 E+T* id $ ACTION[7,id]=s5 shift
(11) 016975 E + T * id $ ACTION[5,$]=r6 reduce by F → id
GOTO[7,F]=10
(12) 0 1 6 9 7 10 E+T*F $ ACTION[10,$]=r3 reduce by T → T * F
GOTO[6,T]=9
(13) 0169 E+T $ ACTION[9,$]=r1 reduce by E → E + T
GOTO[0,E]=1
(14) 01 E $ ACTION[1,$]=acc accept

( s 0 s1  s m , ai ai +1  a n $)
X 1 X 2  X mai ai +1  a n

15

You might also like