0% found this document useful (0 votes)
20 views43 pages

Unit 3

The document outlines the syllabus for Unit 3, covering Turing Machines, Compilers, and Syntax Analysis. It details the structure and functions of Turing Machines, including their definitions, configurations, and types of operations, as well as the phases of a compiler, including lexical analysis, syntax analysis, and code generation. Additionally, it discusses input buffering techniques and the importance of token specification in programming languages.

Uploaded by

sampreethip638
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views43 pages

Unit 3

The document outlines the syllabus for Unit 3, covering Turing Machines, Compilers, and Syntax Analysis. It details the structure and functions of Turing Machines, including their definitions, configurations, and types of operations, as well as the phases of a compiler, including lexical analysis, syntax analysis, and code generation. Additionally, it discusses input buffering techniques and the importance of token specification in programming languages.

Uploaded by

sampreethip638
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT 3

As per RKR21
Syllabus:
Turing Machine: Definition of TM, Structural representation of TM, Construction of TM
Compiler: Definition of Compiler, Phases of Compiler, Lexical Analysis, Input Buffering.
Syntax Analysis: Types of parsing, Top- Down Parsing: Recursive Descent Parsing, Predictive
Parsing, Bottom-Up Parsing: SLR, CLR, LALR

Turing Machine

Introduction
→ Mathematical Model of General-Purpose Computer
→ Languages accepted by TM are called Recursively Enumerable Language.
→ Consists of input tape of Infinite length, read-write head and a finite control.

Configuration

1
 Can also be said FA + [ RW head] + [ left | Right Direction]
 Can also be said Two stack FA.
 PDA = FA + 1 stack, 7 tuple system
Definitions

A TM is a 7-tuple system TM Μ= (Q, ∑, Ґ, δ, q0, B, F)


Q: Finite set of states
∑: Input alphabet
δ: Transition function mapping as
DTM → δ: Q X Ґ → Q X Ґ { , 𝑅 }
NTM → δ: Q X Ґ → 2 Q X Ґx {L, R}
q0: Initial state
F: Set of final states
Ґ: tape alphabet
B: Blank

========================================================================
Functions of TM
1. Acceptor: (TM accepts REL) → transducer.

 If w ∈ REC, it always halts at final


state.
 If w ∉ REL, it may halt at non final
state or it may never halt (Enters into
loop).

2. Enumerator: It enumerates the language.

========================================================================
TM that halts at FS
Halting TM
1. Final state + Halt → accepts w ∈ L (TM)
2. Non-final + Halt → Not accept w ∉ L (TM)
3. ∞- loop or No Halt → can’t conclude w ∈ L (TM)
w ∉ L (TM)

HTM:

2
 if w ∈ L halts at final state
 If w ∉ L halts at non final

Effective order lexicographical order.

Construction of TM: How to remember a


symbol

FA

3
4
Practice Questions
1. Construct a TM for L = {an bn | n >= 1}

No of states required: 5
5
6
========================================================================
2. Construct a TM for L = {an bn cn | n >= 1} → PDA does not accept.

Number of of states required: 6.

7
1
2
3
4
5
5. L = {Ending with 00}

========================================================================
6. L = {Ending with abb}

========================================================================
7. TM that accepts palindromes of strings over the alphabet ∑ = {a, b}

6
Q: # states for even palindrome ---------------------
Q: # states for odd palindrome ----------------------
========================================================================
8. 1’s complement of a binary number.

Compilers
Compiler is a program which translates a program written in one language (the source language)
to an equivalent program in other language (the target language). The source language usually is
a high-level language like Java, C, Fortran etc. whereas the target language is machine code
that a computer's processor understands.

The source language is optimized for humans. It is more user-friendly, to some extent platform-
independent. They are easier to read, write and maintain. Hence, it is easy to avoid errors.
Ultimately, programs written in a high-level language must be translated into machine language by
a compiler. The target machine language is efficient for hardware but lacks readability.
When compiler runs on same machine and produces machine code for the same machine on which it is
running. Then it is called as self compiler or resident compiler. Compiler may run on one machine and
produces the machine codes for other computer then in that case it is called as cross compiler.

We use Compiler for following reasons:

 Translates from one representation of the program to another.


 Typically, from high level source code to low level machine code or object code.
 Source code is normally optimized for human readability.
 Machine code is optimized for hardware.
 Redundancy is reduced.

7
Pictorial Representation

 Compiler is part of program development environment


 The other typical components of this environment are editor, assembler, linker, loader,
debugger, profiler etc.
 The compiler (and all other tools) must support each other for easy program
development

All development systems are essentially a combination of many tools. For compiler, the other
tools are debugger, assembler, linker, loader, profiler, editor etc. If these tools have support for
each other than the program development becomes a lot easier.

This is how the various tools work in coordination to make programming easier and better. These all
have a specific task to accomplish in the process, from writing code to compiling it and running /
debugging it. If debugged then do manual correction in the code if needed after getting debugging
results. It is the combined contribution of these tools that makes programming a lot easier and
efficient.

PHASES OF A COMPILER:
Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.
Following diagram depicts the phases of a compiler through which it goes during the compilation.
Compiler is having the following Phases:

1. Lexical Analyzer (Scanner),


2. Syntax Analyzer (Parser),

8
3. Semantic Analyzer,
4. Intermediate Code Generator(ICG),
5. Code Optimizer(CO) , and
6. Code Generator(CG).

In addition to these, it also has Symbol table management, and Error handler phases.
The Phases of compiler divided into two parts, first three phases we are called as Analysis part remaining
three called as Synthesis part.
The analysis part converts source code to intermediate representation
The synthesis part constructs the desired target program from the intermediate representation and the
information in the symbol table.
The analysis part is often called the front end of the compiler; the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates as a sequence of phases,
each of which transforms one representation of the source program to another.
A typical decomposition of a compiler into phases is shown in Fig. In practice, several phases may be
grouped together, and the intermediate representations between the grouped phases need not be
constructed explicitly.

PHASE, PASSES OF A COMPILER:


In some application we can have a compiler that is organized into what is called passes. Where a pass is a
collection of phases that convert the input from one representation to a completely different representation.
Each pass makes a complete scan of the input and produces its output to be processed by the subsequent pass.
For example a two pass Assembler.

Figure: Phases of a Compiler

9
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:

o Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as anidentifier,
a Keyword , a punctuation mark, a multi character operator like := .

o The character sequence forming a token is called a lexeme of the token.

o The Scanner generates a token-id, and also enters that identifiers name in the Symbol
table if it doesn‘t exist.

o Also removes the Comments, and un necessary spaces.

 The format of the token is < Token name, Attribute value>

SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:

o Groups the above received, and recorded token stream into syntactic structures, usually
into a structure called Parse Tree whose leaves are tokens.

o The interior node of this tree represents the stream of tokens that logically belongs
together.

o It means it checks the syntax of program elements.

SEMANTICANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct, it
may happen that they are not correct semantically.

Therefore the semantic analyzer checks the semantics (meaning) of the statements formed.
o The Syntactically and Semantically correct structures are produced here in the form of a
Syntax tree or DAG or some other sequential representation like matrix.

INTERMEDIATE CODE GENERATOR(ICG): This phase takes the syntactically and


semantically correct structure as input, and produces its equivalent intermediate notation of the
source program. The Intermediate Code (IC) should have two important properties specified below:

o It should be easy to produce, and Easy to translate into the target program. Example
intermediate code forms are:

o Three address codes,

o Polish notations, etc.

CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
10
Attempts to improve the IC so as to have a faster machine code. Typical functions include
–Loop Optimization, Removal of redundant computations, Strength reduction, Frequencyreductions
etc.

 Sometimes the data structures used in representing the intermediate forms may
also be changed.

CODE GENERATOR:
This is the final phase of the compiler and generates the target code, normally consisting of the
relocatable machine code or Assembly code or absolute machine code.
 Memory locations are selected for each variable used, and assignment of variables to
registers is done.
 Intermediate instructions are translated into a sequence of machine instructions.

The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.

For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.

Fig: Translation of assignment statement for position=initial +rate*60

11
The input source program is Position=initial +rate*60

Lexical Analysis
 The first phase of the compiler is lexical analysis.
 The lexical analyzer breaks a sentence into a sequence of words or tokens and
ignores white spaces and comments.
 It generates a stream of tokens from the input.
 This is modelled through regular expressions and the structure is recognized
through finite state automata.
 If the token is not valid i.e., does not fall into any of the identifiable groups, then the
lexical analyser reports an error.
 Lexical analysis involves recognizing the tokens in the source program and reporting errors,
if any.
 Token: A token is a syntactic category. Sentences consist of a string of tokens. For
example constants, identifier, special symbols, keyword, operators etc are tokens.

 Lexeme: Sequence of characters in a token is a lexeme. For example, 100.01,


counter, const, "How are you?" etc are lexemes.

Interface to other phases


12
The lexical analyzer reads characters from the input and passes the tokens to the syntax
analyzer whenever it is asked for a token. For many source languages, there are occasions when
the lexical analyzer needs to look ahead several characters beyond the current lexeme for a
pattern before a match can be announced.

For example, > and >= cannot be distinguished merely on the basis of the first character
>. Hence there is a need to maintain a buffer of the input for look ahead and push back.

We keep the input in a buffer and move pointers over the input. Sometimes, we may also need to
push back extra characters due to this lookahead character.

Recognize tokens and ignore white spaces, comments

Generates token stream

Input Buffering
 The lexical analyzer scans the input string from left to right, one character at a time.
 It uses two pointers begin_ptr (bp) and forward_ptr (fp) to keep track of the portion of the
input string scanned.
 Initially both the pointers point the first character of the input string. bp

i n t a , b ; a = a + 5 ; b = b * 3 ;

fp
 The forward ptr moves by searching the end of lexeme.
 When it finds the blank space, it indicates end of lexeme.
 In the above example fp finds a blank space then the lexeme int is identified.

bp

i n t a , b ; a = a + 5 ; b = b * 3 ;

fp
 Then both the bp and fp are set at next token.
 The fp will be moved ahead at white space.
 When fp encounters white space, it ignores and moves ahead.

13
bp

i n t a , b ; a = a + 5 ; b = b * 3 ;

fp
 The input character is read from secondary memory. But reading from secondary
memory is costly.
 Hence, buffering technique is introduced.
 A block of data is first read into a buffer and then scanned by lexical analyser.
 There are two methods used: one buffer scheme and two buffer scheme.

One buffer scheme

 In one buffer scheme, only one buffer is used to store the input string.
 If the lexeme is very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to refilled, that makes the overwriting the first part of lexeme.

bp

i n t a = a + 5

fp
one buffer scheme
It is the problem with this scheme.

Two buffer scheme

 To overcome the above said problem, two buffers are used to store the input string. The
first buffer and second buffer are scanned alternately.
 When the end of the current buffer is reached the other buffer is filled.
 The problem with this scheme is that if length of the lexeme is longer than length of the
buffer then scanning input cannot be scanned completely.
 Initially, both the pointers bp and fp are pointing the first buffer. Then the fp moves
towards right in search of the end of lexeme.
 When blank character is identified, the string between bp and fp is identified as
corresponding token.
 To identify the boundary of the first buffer end of buffer character should be placed at the
end of first buffer.
 In the same way, end of second buffer is also recognized by the end of buffer mark
present at the end of second buffer.
 When fp encounters first eof, then one can recognize end of first buffer and hence filling
up of second buffer is started.
 In the same way when second eof is obtained then it indicates end of second buffer.
 Alternately, both the buffers can be filled up until end of the input program and stream
of tokens identified.
 This eof character introduced at the end is called sentinel which is used to identify the

14
end of buffer.

Buffer 1 bp

i n t a = a + 5

; b = b + 1 ; eof

fp

Code for input buffering


if(fp == eof(buff1))
{
/*Refill buffer 2*/
fp++;
}
else if (fp == eof(buff2))
{
/*Refill buffer1
fp++;

}
else
if(fp == eof(input)) // end of input file
return;
else
{
// fill in buffer 1
fp++;
}

Specification of Tokens
 For programming languages, there are many types of tokens. They are constants, identifiers,
symbols and so on. The token is normally represented by a pair of token type and token
value.
 The token type tells the category of token and token value gives us the information regarding
token. The token value is also called token attribute.
 Lexical analysis created the symbol table. The token value can be a pointer to symbol table in
case of identifier and constant.
 The lexical analyser reads the input program and generates table for tokens.

main()
{
int x =10; x
= x * 5;
}

Lexeme Token
main Identifier
( Operator
15
) Operator
{ Special Symbol
int Keyword
x Identifier
= Operator
10 Constant
; Special symbol
* Constant
} Special symbol

The while spaces and new line characters are ignored. This stream of token will be given to
syntax analyser.

Syntax Analysis
========================================================================
 This is the second phase of the compiler. It can also be called Syntax analyser / Syntax Phase /
Syntax Verification / Parsing /Parse Tree Generator.

 Definition of a Parser: A parsing or syntax analyser is a process that takes the input string w and
produces either a parse tree or generates the syntactic errors.

Role of Parser
 The parser obtains a string of tokens from lexical analyser and can verifies that the string of token
names can be generated by the grammar for the source language.
 It checks for syntax errors if any and to recover from commonly occurring errors to continue the rest
of the program.
 The parser constructs a parse tree and passes to the rest of the compiler for the further processing.
 Error reporting and recovery form is important part of the syntax analyser.
 The error handler in the parser has the following goals -
i. It should report the presence of errors clearly and accurately.
ii. It should recover from each error quickly to detect subsequent errors.
iii. It should not slow down the processing of correct programs.
 This phase is modelled with context free grammars and the structure is recognized with push down
automata or table-driven parsers.

16
Context Free Grammar
A context free grammar has four components G = (V,T,P,S)
V→A set of non-terminals. Non-terminals are syntactic variables that denote sets of strings. The non-
terminals define sets of strings that help define the language generated by the grammar.
T→A set of tokens, known as terminal symbols. Terminals are the basic symbols from which strings are
formed.
P→A set of productions. The productions of a grammar specify the manner in which the terminals and non-
terminals can be combined to form strings. Each production consists of a non-terminal called the left side of
the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the
production.
S→A designation of one of the non-terminals as the start symbol, and the set of strings it denotes is the
language defined by the grammar.
Derivation: The process of deriving a string / sentence from a grammar is called derivation. It generates
strings / sentences of a language.
Input: id + id * id
Given Grammar
E →E + E / E * E / id
It can generate two left most derivations for the sentence id + id + id. E → E + E
→id + E
→ id + E * E
→ id + id * E
→ id + id * id

17
There are two types of derivations
1. Left Most Derivation (LMD) 2. Right Most Derivation (RMD)
Left Most Derivation The left most derivation is a derivation in which the leftmost non terminal is
replaced first from the sentential form.
Eg:
E →E +E
→ id + E
→ id + E * E
→ id + id * E
→ id + id * id
Right Most Derivation The right most derivation is a derivation in which the right most non
terminal is replaced first from the sentential form.
E →E +E
→ E +E *E
→ E + E * id
→ E + id * E
→ id + id * id
Ambiguous Grammar
Depending on Number of Derivation trees, CFGs are sub-divided into 2 types:
 Ambiguous grammars
 Unambiguous grammars
A CFG is said to ambiguous if there exists more than one derivation tree for the given input string i.e., more
than one Left Most Derivation Tree (LMDT) or Right Most Derivation Tree (RMDT).
Definition: G = (V,T,P,S) is a CFG is said to be ambiguous if and only if there exist a string in T* that has
more than one parse tree.
where V is a finite set of variables. T is a
finite set of terminals.
P is a finite set of productions of the form, A -> α, where A is a variable and α ∈ (V 𝖴 T)* S is a
designated variable called the start symbol.
Example:
E →E+E E→E*E
→ id + E →E+E*E

18
→ id + E * E → id + E * E
→ id + id * E → id + id * E
→ id + id * id → id + id * id
We can create 2 parse trees from this grammar to obtain a string id+ id*id The following
are the 2 parse trees generated by left most derivation:

Both the above parse trees are derived from same grammar but both parse trees are different. Hence the
grammar is ambiguous.
Disambiguate the grammar i.e., rewriting the grammar such that there is only one derivation or parse tree
possible for a string of the language which the grammar represents.

Removing Left Recursion


A grammar is left recursive if it has a non-terminal (variable) S such that there is a derivation S -> Sα | β
where α ε(V+T)* and β ε(V+T)* (sequence of terminals and non-terminals that do not start with S)
Due to the presence of left recursion some top down parsers enter into infinite loop so we have to eliminate
left recursion.

Language generated βα* Language generated α*β

19
A( ) A()
{ {
A() A()
α Α
} }
Infinite loop may occur No Problem of infinite loop

Let the productions is of the form A -> Aα1 | Aα2 | Aα3 | ….. | Aαm | β1 | β2 | …. | βn
Where no βi begins with an A. then we replace the A-productions by A -> β1 A’ |
β2 A’ | ….. | βn A’
A’ -> α1A’ | α2A’ | α3A’| ….. | αmA’ | ε
The nonterminal A generates the same strings as before but is no longer left recursive. Examples
1. E →E + T | T
T →T *F| F
F→ID → left recursive
Now, after removing left recursion we get unambiguous grammar as below
E→TE'
E' →+TE'| ε
T →FT'
T' →*FT'| ε
F→ID
2. S → S0S1S | 01
Now, after removing left recursion we get unambiguous grammar as below
S → 01S'
S'→0S1S S'| ε

3. S → S( L ) | x
Now, after removing left recursion we get unambiguous grammar as below
S → xS'
S'→ ( L )S ' | ε
Removing Left Factoring
Process of converting non-deterministic grammar to deterministic grammar is called Left
Factoring
A grammar is said to be left factored when it is of the form –
A -> αβ1 | αβ2 | αβ3 | …… | αβn | γ i.e the productions start with the same terminal (or set of terminals). On
seeing the input α we cannot immediately tell which production to choose to expand A.
 Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive or top down parsing.
 When the choice between two alternative A-productions is not clear, we may be able to rewrite the
productions to defer the decision until enough of the input has been seen to make the right choice.
20
For the grammar A -> αβ1 | αβ2 | αβ3 | …… | αβn | γ The
equivalent left factored grammar will be –
A → αA' | γ
A' → β1 | β2 | β3 | …… | βn
Eg
S → iEtS | iEtSeS E →b
Now after left factoring, we get grammar as below
S→iEtSS'
S' →ε | eS
E →b

21
Parsing / Types of parsers
Parsing is the process of analyzing a stream of input to determine its grammatical structure for a given
grammar. The task of the parser is to determine how the input can be derived from the start symbol using the
rules of the grammar. This can be done in essentially two ways:
 Top-down parsing - Construction of the parse tree starts at the root /start symbol and proceeds
towards leaves / terminals.

 Bottom-up parsing - Construction of the parse tree starts from the leaf nodes / terminals and proceeds
towards root / start symbol.

Example

Top down Parser Bottom up Parser

 Construction of a parse tree is done by starting the root labeled by a start symbol
 Repeat following two steps
- At a node labeled with non terminal A select one of the productions of A and
construct children nodes.
- Find the next node at which sub tree is constructed

22
Parse array [ numdotdotnum ] of integer using the grammar:
type simple | id | array [ simple ] of type

simple integer | char | numdotdotnum

 Initially, the token array is the look ahead symbol and the known part of the parse tree consists of
the root, labelled with the starting non- terminal type.
 For a match to occur, non-terminal type must derive a string that starts with the look ahead symbol
array. In the grammar, there is just one production of such type, so we select it, and construct the
children of the root labelled with the right side of the production.
 In this way we continue, when the node being considered on the parse tree is for a terminal and the
terminal matches the look ahead symbol, then we advance in both the parse tree and the input.
 The next token in the input becomes the new look ahead symbol and the next child in the parse tree
is considered.

BackTracking
 A back tracking parser will have different production rules to find the match for the string by
backtracking each time.
 It is powerful than predictive parsing.
 But it is slower and requires exponential time.
 Hence, it is not preferred for practical compilers.

Example
S → cAd
A → ab / a Input String w =cad

23
Advantages: It is very easy to implement
Disadvantages: Time consuming

Recursive Descent Parser


 The parser uses collection of recursive procedures for parsing the given input string is called
Recursive Descent Parser.
 The context free grammar is used to build recursive routines.
 The RHS of the production rule is directly converted to a program.
 For each non terminal, a separate procedure is written and body of the procedure (code) is RHS
of the corresponding non terminal.

Steps for construction of RD Parser


1. If the i/p symbol is non terminal, then call a procedure corresponding the non
terminal.
2. If the output symbol is terminal then it is matched with the look ahead from input.
The look ahead pointer points to next symbol.
3. If the production rule has many alternatives, then combine them into a single body
of procedure.
4. The parser should be activated by a procedure corresponding to the start symbol.
Example:

E → iE’
E’ → +i E’ | ε

Procedures
E( ) E’( )
{ {
if (l == ‘i’) if (l == ‘+’)
{ {
match ( ‘i ‘ ); match (‘+’);
E’ ( ); match (‘i’);
E’ ( );
} }
else return;
} }

Match (char t) main( )


{ {
if (l == t) E ( );
l = ggetchar(); if( l == ‘$’)
else printf(“ Parsing Successful ”);
printf (“ERROR”);
} }

24
Note it’s not E but a ε
Recursive Descent Parser

LL(1) or Predictive Parser

Algorithm to construct LL(1) Parsing Table:

Step 1: First check grammar is not nondeterminist ic, not left recursive and not ambiguous
Step 2: Calculate First() and Follow() for all non-terminals.
First(): If there is a variable, and from that variable, if we try to drive all the strings then the
beginning Terminal Symbol is called the First.
Follow (): What is the Terminal Symbol which follows a variable in the process of derivation.
Step 3: For each production A –> α. (A tends to alpha)
1. Find First(α) and for each terminal in First(α), make entry A –> α in the table.
2. If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each terminal in Follow(A), make
entry A –> ε in the table.
3. If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A –> ε in the table for the $.
To construct the parsing table, we have two functions:
In the table, rows will contain the Non-Terminals and the column will contain the Terminal Symbols. All
the Null Productions of the Grammars will go under the Follow elements and the remaining productions will
lie under the elements of the First set.

Example 1: Consider the Grammar:

E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)

*ε denotes epsilon
Step 1: The grammar satisfies all properties in step 1.
Step 2: Calculate first() and follow().
Find their First and Follow sets:
First Follow

E –> TE' { id, ( } { $, ) }

E’ –> +TE’|
Ε { +, ε } { $, ) }

T –> FT’ { id, ( } { +, $, ) }

T’ –> *FT'|
Ε { *, ε } { +, $, ) }

F –> id | (E) { id, ( } { *, +, $, ) }

25
Step 3: Make a parser table.
Now, the LL(1) Parsing Table is:

Id + * ( ) $

E E –> TE’ E –> TE’

E’ E’ –> +TE’ E’ –> ε E’ –> ε

T T –> FT’ T –> FT’

T’ T’ –> ε T’ –> *FT’ T’ –> ε T’ –> ε

F F –> id F –> (E)

Bottom Up Parser
1. Shift Reduce Parser

The parser constructs a parse tree from leaves to root. Thus, it works on the same principle of bottom up
parser. The parser requires the following data structures.
1. Input Buffer: It stores the input string.
2. Stack: It is used to store and access the LHS and RHS of rules.

Initial configuration of Shift Reduce Parser is as follows

Stack
$ S

Input Buffer

W $

Parsing Algorithm (Actions)


1. Shift: Moving of the symbols from input buffer onto the stack. This action is called shift action.
2. Reduce: If the handle appears on the top of the stack then reduction of it by appropriate rule. That
means RHS of rule is popped off and LHS is pushed in. This action is called reduce action.
3. Accept: If stack contains start symbol(S) and input buffer is empty ($) then the action is called
accept. When accept state is obtained in the process of parsing, it means that a successful parsing is
done.
4. Error: It is a situation in which parser can’t either shift or reduce the symbols, it cannot even perform
the accept action is called error.

Handle: The RHS of a production is called Handle. Substitution of RHS that matches the right side of the
production is called Handle.

26
Handle Pruning: In the derivation, replacing the handle with its LHS of a production is called Handle
Pruning.
LR(k) Parsers(Bottom up parser)

Types of LR (k) Parsers

 Every LR (0) parser is SLR (1) parser but may not be reverse.
 Every SLR (1) parser is LALR (1) parser but may not be reverse.
 Every LALR (1) parser is CLR (1) parser but may not be reverse.
 Every LL (1) parser is also LR (1) but every LR (1) need not be LL (1).

Bottom up parsing Configuration

27
Types of Items

Handle: RHS of a production is said to be handle.

A → a AB e
LHS Handle
Closure ( ): When there is a dot on left of a variable then add all the productions of that variable.

S’ → .S
S → .AA
A → .aA
A → .b

Go to ( ) : Moving the dot from the left of the variable to its right
go to (s) = S’ → S.

 LR parsing is divided into four parts: LR (0) parsing, SLR parsing, CLR parsing and
LALR parsing.

LR ( 0 ) Parser

 LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.
 In the LR parsing, "L" stands for left-to-right scanning of the input.
 An LR (0) item is a production G with dot at some position on the right side of the production.
LR(0) items is useful to indicate that how much of the input has been scanned up to a given point in
the process of parsing.
 In the LR (0), we place the reduce node in the entire row.
 To construct a parsing table for LR (0) and SLR (1), we use canonical collection of LR (0) items.
 To construct a parsing table for LALR (1) and CLR (1), we use canonical collection of LR (1) items.

28
Eg: Construct LR (0) parsing table for the given grammar

S→AA
A→aA
A→b

Solution:
Step 1: Constructing augmented grammar.

Step 2:

Note: States in the FA is the number of rows in the parsing table. Step 3:

Constructing parse table for LR(0)

Go to Part
STATE ACTION part
a b $ S A
0 S3 S4 1 2
1 Accept
2 S3 S4 5
3 S3 S4 6
4 R3 R3 R3
5 R1 R1 R1
6 R2 R2 R2

29
Step 4: Parsing the input string w = a a b b $

Stack:
$ a 3 a 3 b 4 A 6 A 6 A 2 b 4 A 5 S 1

SR Conflicts: LR(0): Puts reduce production in the entire row, for every terminal. If some column in the
action part (a terminal) contains both, shift and reduce moves, it's a "shift-reduce" (s/r) conflict.

Reduce-Reduce Conflicts. A reduce/reduce conflict occurs if there are two or more rules that apply to the
same sequence of input. This usually indicates a serious error in the grammar. For example, here is an
erroneous attempt to define a sequence of zero or more word groupings.

30
SLR (1) Parser

 The SLR parser is similar to LR(0) parser except that the reduced entry. The reduced
productions are written only in the FOLLOW of the variable whose production is reduced.
Construction of SLR parsing table –
 Construct C = { I0, I1, ……. In}, the collection of sets of LR(0) items for G’.
 State i is constructed from Ii. The parsing actions for state i are determined as follow :
o If [ A -> ?.a? ] is in Ii and GOTO(Ii , a) = Ij , then set ACTION[i, a] to “shift j”. Here a must
be terminal.
o If [A -> ?.] is in Ii, then set ACTION[i, a] to “reduce A -> ?” for all a in
FOLLOW(A); here A may not be S’.
o Is [S -> S.] is in Ii, then set action[i, $] to “accept”. If any conflicting actions are generated
by the above rules we say that the grammar is not SLR.
 The goto transitions for state i are constructed for all non-terminals A using the rule:
if GOTO( Ii , A ) = Ij then GOTO [i, A] = j.
 All entries not defined by rules 2 and 3 are made error.

Eg: Construct SLR (1) parsing table for the given grammar S → A A
A→aA
A→b

Solution:
Step 1: Constructing augmented grammar.
S’ → .S
S→.AA
A → .a A
A → .b
Step 2: Computing LR (0) items

31
Step 3: Constructing Parsing table for SLR(1) Parser

Action Part Go to Part


States
a b $ S A
0 S3 S4 1 2
1 Accept
2 S3 S4 5
3 S3 S4 6
4 R3 R3 R3
5 R1
6 R2 R2 R2

Step 4: Parsing the string w = a a b b

$ A 3 a 3 b 4 A 6 A 6 A 2 b 4 A 5 S 1

Thus the string is parsed.

32
CLR ( 1) Parser

 CLR refers to canonical look-a-head. CLR parsing use the canonical collection of LR (1) items to
build the CLR (1) parsing table.
 CLR (1) parsing table produces the more number of states as compare to
the SLR (1) parsing.
 In the CLR (1), we place the reduce node only in the look-a-head symbols.
 In CLR parsing we will be using LR(1) items. LR(k) item is defined to be an item using look-a-heads
of length k. So, the LR(1) item is comprised of two parts : the LR(0) item and the look-a-
head associated with the item. LR(1) parsers are more powerful parser.

Construction of CLR parsing table-

Input – augmented grammar G’


1. Construct C = { I0, I1, ……. In} , the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii. The parsing actions for state i are determined as follow :
i) If [ A -> ?.a?, b ] is in Ii and GOTO(Ii , a) = Ij, then set ACTION[i, a] to “shift j”. Here a must
be terminal.
ii) If [A -> ?. , a] is in Ii , A ≠ S, then set ACTION[i, a] to “reduce A -> ?”.
iii) Is [S -> S. , $ ] is in Ii, then set action[i, $] to “accept”.
If any conflicting actions are generated by the above rules we say that the grammar is
not CLR.
3. The goto transitions for state i are constructed for all non-terminals A using the rule: if GOTO( Ii,
A ) = Ij then GOTO [i, A] = j.
4. All entries not defined by rules 2 and 3 are made error.

Eg.Construct CLR (1) Parsing Table for the Given Grammar

33
Parsing Table for CLR(1)
Action Part Go to Part
States
a b $ S A
0 S3 S4 1 2
1 Accept
2 S6 S7 5
3 S3 S4 8
4 R3 R3
5 R1
6 9
7 R3
8 R2 R2
9 R2

Parsing the input string w = a a b b $

Stack:
$ 0 a 3 a 3 b 4 A 8 A 8 A 2 b 7 A 5 S 1

SR Conflict: If a state has one reduction and there is a shift from that state on a terminal same as the look-a-
head of the reduction then it will lead to multiple entries in parsing table thus a conflict. LALR parser is
same as CLR parser with one difference.

LALR (1) Parser

1. LALR means look-ahead parser. It is a bottom up parser. The idea of the parser is to build LALR
parsing table.
2. LALR parser slightly less powerful than CLR but more powerful than SLR parser.
3. Number of rows in LALR parser is almost equal to SLR and less than or equal to CLR parser.
4. CLR Parser avoids conflicts in the parsing table. But it produces a greater number of states when
compared to SLR parser. Hence more space is occupied by the table in memory.
5. Since it is as efficient as CLR parser, LALR parsing can be used. The parser obtains smaller size
parsing tables than CLR parse tables.
6. This is because LALR parse tables are constructed from LR (1) items.
7. The LR (1) items that have same productions but different look-a-heads are combined to form a
single set of items.
8. Shift-Reduce conflicts may not be taken place but Reduce-Reduce conflicts may occur.

Eg: Construct LALR (1)Parsing table for the grammar given

34
In the above diagrams, some states have got the same items. They can be merged. For example, I3 and I6
have same items. They can be merged asI 36.

In the same way, I4 and I7have the same items. They can also be merged as I47

Similarly, the states I8 and I9 differ only in their look-a-heads. Hence I8 and I9 are combined to form the
state I89.

LALR (1) Parsing Table

Action Part Go to Part


States
a b $ S A
0 S36 S47 1 2
1 Accept
2 S36 S47 5
36 S36 S47 89
47 R3 R3 R3
5 R1
89 R2 R2 R2

Thus, the resultant LALR parse table after merging states 3 and 6, 4 and 7, 8 and 9 is formed from LR(1)
items.
The merger states can never produce a shift-reduce conflicts. However, it can produce a RR conflict.
The number of rows has been reduced from 10 to 7 rows.

Parsing the input string w = a a b b $

35
Stack:
$ 0 a 36 a 36 b 47 A 89 A 89 A 2 b 47 A 5 S 1

RR Conflict: This reduces the power of the parser because not knowing the look-a-head symbols can
confuse the parser as to which grammar rule to pick next, resulting in reduce/reduce conflicts. All
conflicts that arise in applying a LALR(1) parser to an unambiguous LR(1) grammar are reduce/reduce
conflicts.

36

You might also like