0% found this document useful (0 votes)
13 views

compiler design

Compilers notes

Uploaded by

Bala Siva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
13 views

compiler design

Compilers notes

Uploaded by

Bala Siva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 152
Department of Computer Science & Engineering UNIT-I INTRODUCTION TO LANGUAGE PROCESSING: As Computers became inevitable and indigenous part of human life, and several languages with different and more advanced features are evolved into this stream to satisfy or comfort the user in communicating with the machine , the development of the translators or mediator Software’s have become essential to fill the huge gap between the human and machine understanding. This process is called Language Processing to reflect the goal and intent of the process. On the way to this process to understand it in a better way, we have to be familiar with some key terms and concepts explained in following lines. LANGUAGE TRANSLATORS : Is a computer program which translates a program written in one (Source) language to its equivalent program in other [Target}language. The Source programis ahigh level language where as the Target language can be any thing from the machine language of a target machine (between Microprocessor to Supercomputer) to another high level languageprogram. E Twocommonly Used Translators are Compiler and Interpreter 1. Compiler: — Compilerisaprogram, reads program in one language called Source Language and translates into its equivalent program in another Language called Target Language, in addition to this its presents the error information to the User. An Equivalent Program in other Language o COMPILER Relocatable Obj or Target Program Error Information YE the target programs an executable machine-language program, itcanthen be called by the users to process inputs and produce outputs. Input ——| Target Program 1 __, ouput Figurel.t: Running the target Program. 1p Department of Computer Science & Engineering 2. Interpreter: An interpreter is another commonly used language processor. Instead of producing 1 target program as a single translation unit, an interpreter appears to directly execute the ‘operations specified in the source program on inputs supplied by theuser. Source Program ——+| Input Interpreter, -— ouput Figure 1.2: Running the target Program LANGUAGE PROCESSING SYSTEM: Based on the input the translator takes and the output it produces, a language translator can be called as any one of the following. Preprocessor: A preprocessor takes the skeletal source program as input and produces an extended version of it, which is the resultant of expanding the Macros, manifest constants if any, and including header files etc in the source file, For example, the C preprocessor is a macro processor thats used automatically by the C compilerto transform our source before actual compilation. Over and above a preprocessor performs the following activiti 1 Collectsall the modules, files in case if the source program is divided into different modules stored at different files. Expands short hands / macros into source language statements. Compiler: Is a translator that takes as input a source program written in high level language and converts it into its equivalent target program in machine language. In addition to above the compiler also 1. Reports to its user the presence of errors in the source program. Facilitates the user in rectifying the errors, and execute the code. Assembler: Is a program that takes as input an assembly language program and converts it into equivalent machine language code. Loader / Linker: This is a program that takes as input a relocatable code and collects the library functions, relocatable object files, and produces its equivalent absolute machine code. Specifically, I Loading consists of taking the relocatable machine code, altering the relocatable addresses, and placing the altered instructions and data in memory at the proper locations. . Linking allows us to make a single program from several files of relocatable machine code. These files may have been result of several different compilations, one or more may be library routines provided by the system available to any program that needs them. 21P Department of Computer Science & Engineering Course File : Compiler Design In addition to these translators, programs like interpreters, text formatters etc., may be used in language processing system. To translate a program in a high level language program to an executable one, the Compiler performs by default the compile and linking functions. Normally the steps in a language processing system includes Preprocessing the skclctal Source program which produces an extended or expanded source program or a ready to compile unit of the source program, followed by compiling the resultant, then linking / loading , and finally its equivalent executable code is produced. As I said earlier not all these steps are mandatory. In some cases, the Compiler only performs this linking and loading functions implicitly. The steps involved in a typical language processing system can be understood with following diagram. Source Program —_[ Example: filename.C ] Pr sor reproces: if Jc Sharce Progra [ Example: filename.C ] —t Compiler x Target Assembly Program Assembler q ‘Relocatable Machine Code [ Example: filename.obj } Loader/linker_ | «—Library files L Relocatable Object files ‘Target Machine Code _[ Example: filename. exe ] Figurel.3 : Context of a Compiler in Language Processing System TYPES OF COMPILERS: Based on the specific input it takes and the output it produces, the Compilers can be classified into the following types; Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL into its equivalent in native machine code or object code. Department of Computer Science & Engineering Course File : Compiler Design Interpreters(LISP, SNOBOL, Javal.0): These Compilers first convert Source code into intermediate code, and then interprets (emulates) it to its equivalent machine code. Cross-Compilers: These are the compilers that run on one machine and produce code for another machine. Incremental Compilers: These compilers separate the source into user defined-steps; Compiling/recompiling step- by- step; interpreting steps in a given order Converters (e.g. COBOL to C++): These Programs will be compiling from one high level language to another. Just-In-Time (JIT) Compilers (Java, MicosoftNET): These are the runtime compilers from intermediate language (byte code, MSIL) to executable code or native machine code. These perform type based verification which makes the executable code more trustworthy Ahead-of-Time (AOT) Compilers (e.g., NET ngen): These are the pre-compilers to the native code for Java and NET Binary Compilation: These compilers will be compiling object code of one platform into object code of another platform. PHASES OF A COMPILER: Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of compilation phases. The phases communicate with each other via clearly defined interfaces. Generally an interface contains a Data structure (e.g., tree), Set of exported functions. Each phase works on an abstract intermediate representation of the source program, not the source program text itself (except the first phase) Compiler Phases are the individual modules which are chronologically executed to perform their respective Sub-activities, and finally integrate the solutions to give target code. tis desirable to have relatively few phases, since it takes time to read and write immediate files. Following diagram (Figurel.4) depicts the phases of a compiler through which it goes during the compilation. There fore a typical Compiler is having the following Phases: 1. Lexical Analyzer (Scanner), 2. Syntax Analyzer (Parser), 3.Semantic Analyzer, 4.Intermediate Code GeneratorICG), 5.Code Optimizer(CO) , and 6.Code Generator(CG) In addition to these, it also has Symbol table management, and Error handler phases. Not all the phases are mandatory in every Compiler. e.g, Code Optimizer phase is optional in some Department of Computer Science & Engineering Course File: Compiler Design cases. The description is given in next section. The Phases of compiler divided in to two parts, first three phases we are called as Analysis part remaining three called as Synthesis part. Some program Lexical analyser Syntax analyser i Semantic analyser [Syinbol-table manager Error handler Intermediate code generator aaa Ta Code optimiser SS Code generat y Target program Figurel.4 : Phases of a Compiler PHASE, PASSES OF A COMPILER: In some application we can have a compiler that is organized into what is called passes. Where a pass is a collection of phases that convert the input from one representation t0 a completely deferent representation. Each pass makes a complete scan of the input and produces its output to be processed by the subsequent pass. For example a two pass Assembler. THE FRONT-END & BACK-END OF A COMPILER S|Page Department of Computer Science & Engineering Course File : Compiler Design All of these phases of a general Compiler are conceptually divided into The Front-end, and The Back-end. This division is due to their dependence on either the Source Language or the Target machine. This model is called an Analysis & Synthesis model of acompiler. ‘The Front-end of the compiler consists of phases that depend primarily on the Source language and are largely independent on the target machine. For example, front-end of the compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the Intermediate Code Generator. ‘The Back-end of the compiler consists of phases that depend on the target machine, and those portions don’t dependent on the Source language, just the Intermediate language. In this we have different aspects of Code Optimization phase, code generation along with the necessary Error handling, and Symbol table operations. LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface between the compiler and the Source language program and performs the following functions: I Reads the characters in the Source program and groups them into a stream of tokens in which each token specifies a logically cohesive sequence of characters, such as an identifier , a Keyword , a punctuation mark, a multi character operator like ms ‘The character sequence forming a token is called a lexeme of the token. . ‘The Scanner generates a token-id, and also enters that identifiers name in the Symbol table if it doesn’t exist. ws Also removes the Comments, and unnecessary spaces. ‘The format of the token is < Token name, Attribute value> SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent phase Semantic Analyzer and performs the following functions: 1 Groups the above received, and recorded token stream into syntactic structures, usually into a structure called Parse Tree whose leaves are tokens. ‘The interior node of this tree represents the stream of tokens that logically belongs together. s 1. Itmeans it checks the syntax of program elements. SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the semantically correciness of the program. Though the tokens are valid and syntactically correct, it 6|Pa Department of Computer Science & Engineering ‘may happen that they are not correct semantically. Therefore the semantic analyzer checks the semantics (meaning) of the statements formed. 1. The Syntactically and Semantically correct structures are produced here in the form of a Syntax tree or DAG or some other sequential representation like matrix. INTERMEDIATE CODE GENERATOR(ICG): This phase takes the syntactically and semantically correct structure as input, and produces its equivalent intermediate notation of the source program, The Intermediate Code should have two important properties specified below: 1 Itshould be easy to produce,and Easy to translate into the target program. intermediate code forms are: mple 1 Three address codes, Polish notations, etc. CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in terms of saving development time, effort, and cost. This phase performs the following specific functions: 1 Attempts to improve the IC so as to have a faster machine code. Typical functions include “Loop Optimization, Removal of redundant computations, Strength reduction, Frequency reductions etc. Sometimes the data structures used in representing the intermediate forms may also be changed. CODE GENERATOR: ‘his is the final phase of the compiler and generates the target code, normally consisting of the relocatable machine code or Assembly code or absolute machine code. 1 Memory locations are selected for each variable used, and assignment of variables to registers is done. 1 Intermediate instructions are translated into a sequence of machine instructions. ‘The Compiler also performs the Symbol table management and Error handling throughout the compilation process. Symbol table is nothing but a data structure that stores different source Janguage constructs, and tokens generated during the compilation. These two interact with all phases of the Compiler. TIP Department of Computer Science & Engineering Course File : Compiler Design For example the source program is an assignment statement; the following figure shows how the phases of compiler will process the program. ‘The inpat source program is Position=initial+rate*60 position = initial + rate = 60 — LW ti = inttofloat (60) t2 = ids e ti t3 = id2 + t2 igi = t3 LDF R82. ia3 MULF R2. R2, 860.0 LDF Ri. 142 ADDF R1, Ri. R2 STF ida, Ri Figurel.5: Translation of an assignment Statement B[Page Department of Computer Science & Engineering LEXICAL ANALYSIS: As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output tokens for cach lexeme in the source program. This stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. This process is shown in the following figure. token source Lexical. §$[>-————*} to semantic Parser program "| Analyzer | analysis getNextToken Symbol Table Figure 1.6 : Lexical Analyzer When lexical analyzer identifies the first token it will send it to the parser, the parser receives the token and calls the lexical analyzer to send next token by issuing the getNextTokenQ) command. This Process continues until the lexical analyzer identifies all the tokens. During this process the lexical analyzer will neglect or discard the white spaces and comment lines. TOKENS, PATTERNS AND LEXEMES: A token is a pair consisting of a token mame and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name. A pattern is a description of the form that the lexemes of a token may take [ or match]. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that ‘matched by many strings. orp Department of Computer Science & Engineering Course File : Compiler Design A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token, Example: In the following C language statement , printf ("Total = 9%6d\n”, score) : both printf and score are lexemes matching the pattern for token id, and "Total = %d\n" is a lexeme matching literal [or string]. TOKEN INFORMAL DESCRIPTION SAMPLE LEXEMES if characters i, f if else characters e, 1, s, € else comparison | < or > or <= or >= or == or != , —=, or <=. Thus, we shall introduce a two-buffer scheme that handles large look aheads safely. We then consider an improvement involving "sentinels" that saves time checking for the ends of buffers. Buffer Pairs Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternately reloaded E = M:*/C: 4:4: 2 ‘eof: je lexemeBegin Figurel.8 : Using a Pair of Input Buffers Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes. Using one system read command we can read N characters in to a buffer, rather than using one system call per character. If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the source file and is different from any possible character of the source program. 1. Two pointers to the input are maintained: 1, The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent ‘we are attempting to determine. 2. Pointer forward scans ahead until a paitern match is found; the exact strategy whereby this determination is made will be covered in the balance of this chapter. [Page Department of Computer Science & Engineering Course File : Compiler Design Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found. In Fig, we see forward has passed the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted one position to its left. Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buifer from the input, and move forward to the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater than N, we shall never overwrite the lexeme in its buffer before determining it. Sentinels ‘To Improve Scanners Performance: If we use the above scheme as described, we must check, each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one to determine what character is read (the latter may be a multi way branch). We can combine the buffer-cad test with the test for the current character if we extend cach buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7, but with the sentinels added. Note that eof retains its use as a marker for the end of the entire input. PE! is) IM! tefl c lt! +! 2 eof ‘eof forward lexemeBegin Figure1.8 : Sentential at the end of each buffer Any eof that appears other than at the end of a buffer means that the input is at an end. Figure 1.9 summarizes the algorithm for advancing forward. Notice how the first test, which can be part of 12|Pa Department of Computer Science & Engineering a multiway branch based on the character pointed to by forward, is the only test we make, except in the case where we actually are at the end of a buffer or the end of the input, switch (*forward++ ) ( case eof: if (forward is at end of first buffer ) { reload second buffer, forward = beginning of second buffer; } dlse if (forward is at end of second buffer ) { reload first buifer: forward = beginning of first buffer; } else /* eof within a buffer marks the end of inpat */ terminate lexical analysis; break; Figure 1.9: use of switch-case for the sentential SPECIFICATION OF TOKENS: Regular expressions are an important notation for specifying lexeme patterns. While they cannot express, all possible pattems, they are very effective in specifying those types of patterns that we actually need for tokens. LEX the Lexical Analyzer generator Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates code, in a file called lex yy .c, itis a c program given for C Compiler, gives the Object code. Here we need to know how to write the Lex language. The structure of the Lex program is given below. 13|P Department of Computer Science & Engineering Course File : Compiler Design Structure of LEX Program : A Lex program has the following form: Declarations %% ‘Translation rules %% Auxiliary functions definitions ‘The declarations section : includes declarations of variables, manifest constants (identifiers declared to stand for a constant, ¢.g., the name of a token), and regular definitions. It appears between % {.. .%} In the Translation rules section, We place Pattern Action pairs where cach pair have the form Pattem (Action) The auxiliary function definitions section includes the definitions of functions used to install identifiers and numbers in the Symbol tale. LEX Program Example: % /* definitions of manifest constants LT.LE.EQ.NE.GT,GE, IF.THEN, ELSE,ID, NUMBER, RELOP */ %) /* regular definitions */ delim [dn] ws (— delim}+ letter [A-Za-z] digit [0-91 id {letter} ({letter} | (digit}) * number it}+(\. (digit}+)? E [+I]? (digit)? %% (ws} {/* no action and no return */} if {return(1F) ; } Department of Computer Science & Engineering Course File : Compiler Design then {return(THEN) ; } else {return(ELSE) ; } (id) {yylval = (int) installID(); return(1D);} (number) —_{yylval = (int) installNumd) ; return(NUMBER) ; } n< {yylval = LT; return(RELOP) ; )} “, form, where ais a single non terminal, f is (VUT)*.Each production consists of: @ A non terminal called the head or left side of the production; this production defines some of the strings denoted by the head. (©) The symbol ->. Some times: = has been used in place of the arrow. (A body or right side consisting of zero. or more temminals and non- terminals. The components of the body describe one way in which strings of the non terminal at the head can be constructed. 1 Conventionally, the productions for the start symbol are listed Example: Context Free Grammar to accept Arithmetic expressions. ‘The terminals are +, *,-, (), id. ‘The Non terminal symbols are expression, term, factor and expression is the starting symbol. expression —> expression + term expression —+ expression —term expression —> term term —+ term * factor term —> term / factor term —> factor factor —» (expression ) ‘factor — id Figure 2.3 : Grammar for Simple Arithmetic Expressions Notational Conventions Used In Writing CFGs: To avoid always having to state that “these are the terminals," "these are the non terminals,” and so on, the following notational conventions for grammars will be used throughout our discussions. 1. These symbols are terminals: (@) Lowercase letters early in the alphabet, such asa, b, e. (©) Operator symbols such as +, *, and so on. (© Punctuation symbols such as parentheses, comma, and so on. (@ The digits 0,1... 9. (©) Boldface strings such as terminal symbol. or if, each of which represents asingle [Pave Department of Computer Science & Engineering 2. These symbols are non terminals: (a) Uppercase letters early in the alphabet, such as A, B, C. (b) The letter S, which, when it appears, is usually the start symbol. (c) Lowercase, italic names such as expr or stmt. (4) When discussing programming constructs, uppercase letters may be used to represent Non terminals for the constructs. For example, non terminal for expressions, terms, and factors are often represented by E, T, and F, respectively. Using these conventions the grammar for the arithmetic expressions can be written as EOE+T|E-T|T T+ T*R|T/F|F F+> ®|id DERIVATIONS: ‘The construction of a parse tree can be made precise by taking a derivational view. in which productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step replaces a Non terminal by the body of one of its productions. This derivational view corresponds to the top-down construction of a parse tree as well as the bottom construction of the parse tree. © Derivations are classified in to Let most Derivation and Right Most Derivations. Left Most Derivation (LMD): It is the process of constructing the parse tree or accepting the given input string, in which at every time we need to rewrite the production rule it is done with left most non terminal only. Ex: if the Grammar is E-> E+E | E*E | -B] (E) | id and the input string is id + id* id The production E -> - E signifies that if E denotes an expression, then — E must also denote an expression. The replacement of a single E by ~ E will be described by writing E => -E which is read as “E derives _E” For a general definition of derivation, consider a non terminal A in the middle of a sequence of grammar symbols, as in AB, where « and B are arbitrary strings of grammar symbol. Suppose A ->y is a production. Then, we write GAB => ay. The symbol => means “derives in one step". Often, we wish to say, "Derives in zero or more steps." For this purpose, wwe can use the symbol “>, If we wish to say, “Derives in 5 one or more steps.” We en use the symbol =5 If $ => a, where S is the start symbol of a grammar G, we say that a is a sentential form of G. The Lefimost Derivation for the given input string id + id* id is E>E+E 2p Department of Computer Science & Engineering =>id+E =>id+E*E = id+id*E =>id-+id *id NOTE: Every time we need to start from the root production only, the under line using at Non terminal indicating that, i is the non terminal (left most one) we are choosing to rewrite the productions to accept the string. Right Most Derivation (RMD): It is the process of constructing the parse tree or accepting the given input string, every time we need to rewrite the production rule with Right most Non terminal only. The Right most derivation for the given input string id + id* id is E=>E+E => E+E*E =>E+E*id =>E+id*id =>id+id *id NOTE: Every time we need to start from the root production only, the under line using at Non’ terminal indicating that, it is the non terminal (Right most one) we are choosing to rewrite the productions to accept the string. What is a Parse Tree? A parse tree is a graphical representation of a derivation that filters out the order in witich productions are applied to replace non terminals. Each interior node of a parse tree represents the application of a production. Al the interior nodes are Non terminals and all the leaf nodes terminals. All the leaf nodes reading from the left to right will be the output of the parse tree. If a node n is labeled X and _ has children nl ,n2,n3,...nk with labels X1,X2,,..Xk respectively, then there must be a production A>X1X2...Xk in the grammar. Examplel:- Parse tree for the input string - (id + i) using the above Context free Grammar is 2p Department of Computer Science & Engineering Course File : Compiler Design ZN, ZEN, win? t i id id Figure 2.4 : Parse Tree for the input string - (id + id) ‘The Following figure shows step by step construction of purse tree using CFG for the parse tree for the input string - (id + id). B i mea - BL JIN ( EB ) = z 7 B o E ’, “™; /™; 7 a AN Weitin JIN UN CAN. Ba SE § as id id id Figure 2.5 : Sequence outputs of the Parse Tree construction process for the input string -(id+id) Example2:- Puse tree for the input string id-+id*id using the above Context free Grammar is ls, 7 TAIN, id i 7 id id Figure 2.6: Parse tree for the input string id+ id*id. BIPa Department of Computer Science & Engineering Course File : Compiler Design AMBIGUITY in CFGs: Definition: A grammar that produces more than one parse tree for some sentence (input string) is said to be ambiguous. In other words, an ambiguous grammar is one that produces more than one leftmost derivation or more than one rightmost derivation for the same sentence. Or If the right hand production of the grammar is having two non terminals which are exactly same as left hand side production Non terminal then itis said to an ambiguous grammar. Example : I the Grammar is E-> E+E | E*E | -E| (E) | id and the Input Siring is id + id* id ‘Two parse trees for given input string are E ZIN Mil E + E | ZAIN ZIN | dogs ® i id id id id (a) ) Two Left most Derivations for giv E=>E+E E=>E*E => id+E => EsE*E = id+E*E =>id+EtE =>id+id*E =>id+ id" E => id +id id =>id +id *id (a) (b) ‘The above Grammar is giving two parse trees or two derivations for the given input string so, it is an ambiguous Grammar Note: LL (1) parser will not accept the ambiguous grammars or We cannot construct an LL() parser for the ambiguous grammars. Because such grammars may cause the Top Down parser to go into infinite loop or make it consume more time for parsing. If necessary ‘we must remove all types of ambiguity from it and then construct. ELIMINATING AMBIGUITY: Since Ambiguous grammars may cause the top down Parser 20 into infinite loop, consume more time during parsing. Therefore, sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. The general form of ambiguous productions that cause ambiguity in grammars is MIP a Department of Computer Science & Engineering Course File : Compiler Design A> Aa|B ‘This can be written as (introduce one new non terminal in the place of second non terminal) ASB A Alt aAlle Example : Let the grammar is E—> E+E | E*E |-E] (B) |id . It is shown that it is ambiguous that can be written as E> EE ESEE EEE EOE E>) E> id In the above grammar the I* and 2 productions are having ambiguity. So, they can be written as E-> E+E | EE this production again can be written as E-> E+E |B, where B is E*E ‘The above production is same as the general form. so, that can be written as, ESE+TT Top The value of B is E*E so, above grammar can be written as 1) ESE+TT 2) T->E*E — The first production is free from ambiguity and substitute E->T in the 2"! production then it can be written as T> TT |-E| (B) | id this production again can be written as ‘T->T*T | B where B is -E] (E) | id, introduce new non terminal in the Right hand side production then it becomes T>T*F|E F->-E|(E)|id now the entire grammar tured in to it equivalent unambiguous, ‘The Unambiguous grammar equivalent to the given ambiguous one is D EPE+TIT 2 TOTFE|E 2 F>-E|() id LEFT RECURSION: Another feature of the CFGs which is not desirable to be used in top down parsers is left recursion, A grammar is left recursive if it has anon terminal A such that there is a derivation A=>Aa for some string a in (TUV)*. LL(1) or Top Down Parsers can not handle the Left Recursive grammars, so we need to remove the left recursion from the grammars before being used in Top Down Parsing. 25|Pa Department of Computer Science & Engineering The General form of Left Recursion is A+ AaiB The above left recursive production can be written as the non left recursive equivalent : A> par Ale aAll€ the following grammar left recursive? If so, find a non left recursive grammar Example : equivalent to ESEsT|T T+ TEE F-E|()|id Yes the grammar is left recursive due to the first two productions which are satisfying the general form of Left recursion, so they can be rewritten after removing left recursion from Es E+T,and T= T*F is E> TE E>4TE'|€ T+FT TFT’ I€ F> ©)|id LEFT FACTORING: Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive or top-down parsing. A grammar in which more than one production has common prefix is to be rewritten by factoring out the prefixes. For example, in the following grammar there are n A productions have the common prefix 41, which should be removed or factored out without changing the language defined for A. A aAl|aA2|aA3| aA4|...|aAn We can factor out the a from sll n productions by adding a new A production A ea! and rewriting the A‘ productions grammar as A> aa’ A’-r ALA2|A3IAd. EIRST and FOLLOW; JAn 26|Pave Department of Computer Science & Engineering Course File : Compiler Design ‘The construction of both top-down and bottom-up parsers is aided by two functions, FIRST and FOLLOW, associated with a grammar G. During top down parsing, FIRST and FOLLOW allow us to choose which production to apply, based on the next input (look a bead) symbol. Computation of FIRST: FIRST function computes the set of terminal symbols with which the right hand side of the productions begin. To compute FIRST (A) for all grammar symbols, apply the following rules until no more terminals or € can be added to any FIRST set. 1. If Ais aterminal, then FIRST (A) = (A). 2. If Ais aNon terminal and A->X1X2...Xi FIRST(A)=FIRST(X1) if XIis not null, if X1 is a non terminal and X1->€, add FIRST(X2) to FIRST(A), if X2-> € add FIRST(X3) to FIRST(A), ... if Xi>€ , ie., all Xi’s for i=1.i are null, add € FIRST(A). 3. If A->€is a production, then add € to FIRST (A). Computation Of FOLLOW: Follow (A) is nothing but the set of terminal symbols of the grammar that are immediately following the Non terminal A. If a is to the immediate right of non terminal A, then Follow(A)= (a). To compute FOLLOW (A) for all non terminals A, apply the following rules until no more symbols can be added to any FOLLOW set. 1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right end marker. If there is a production A> aBp, then everything in FIRST (B) except € is in FOLLOW@). 3. If there is a production A->aB or a production A-> aBf with FIRST(f) contains €, ‘then FOLLOW (B) = FOLLOW (A). Example: ~ Compute the FIRST and FOLLOW values of the expression grammar 1. E> TE’ E/-4TE'|€ T> FT TAFT’ € F+@)|id Computing FIRST Values: FIRST (E) = FIRST (T) = FIRST (F) = {( id} FIRST (E') = (+, €} FIRST (T') = {*, €} 271Pa Department of Computer Science & Engineering Computing FOLLOW Values: FOLLOW (E)={$,).} Because tis the start symbol of the grammar. FOLLOW (E') = (FOLLOW (E)} __ satisfying the 3“ rule of FOLLOW() {So} FOLLOW (T) = { FIRST E’} his Satisfying the 2” mule. U { FOLLOWE? } = (4 FOLLOWE)} = (48) FOLLOW (I')=(FOLLOW(T)} —_Satistying the 3° Rule 14,$,)) FIRST (T’) } Itis Satisfying the 2™ rule, U { FOLLOW(E’) } *, FOLLOW (T)} SHS Course File: Compiler Design NON TERMINAL FIRST FOLLOW E (G id) (S$) E (4 €} (8) T {G id} (438))} T {*, & {+8} F {G id} (s 45) Table 2.1: FIRST and FOLLOW values Constructing Predictive Or LL (1) Parse Table: Iti FIRST and FOLLOW values of the Productions. The rules to be followed to Construct the Parsing Table (M) are : For each terminal symbol ‘a’ the process of placing the all productions of the grammar in the parse table based on the ‘or Each production A> a of the grammar, do the bellow steps. RST (4), add the production A> a to M [A, a]. i. If€ is in FIRST (a) add production A->a to M [ A, b], where b is all terminals in FOLLOW (A). ii, If€ is in FIRST(a) and $ is in FOLLOW (A) then add production A->a to MIA, S$]. 4. Mark other entries in the parsing table as error . INPUT SYMBOLS NON-TERMINALS + * ( ) id s 281? Department of Computer Science & Engineering Course File: Compiler Design E TE’ id rE +TE” € € T T_ FIv TF Y Tv € |v FT YT € T € F F ® ‘Table 2.2: LL (1) Parsing Table for the Expressions Grammar Note: if there are no multiple entries in the table for single a terminal then grammar is accepted by LL(1) Parser. LL() Parsing Algor ‘The parser acts on basis on the basis of two symbols i. A, the symbol on the top of the stack |. a, the current input symbol There are three conditions for A and ‘a’, that are used fro the parsing program. 1. If A=a=S then parsing is Successful. 2. If A=a¢S then parser pops off the stack and advances the current input pointer to the next 3. If Aisa Non terminal the parser consults the entry M[A. a] in the parsing table. If MIA, a] is a Production A-> X:X2..X,, then the program replaces the A on the top of the Stack by X:X2..Xq in such a way that X1 comes on the top. STRING ACCEPTANCE BY PARSER: If the input string for the parser is id + id * id, the below table shows how the parser accept the string with the help of Stack. Stack Input Action Comments ide id dS [ETE E on top of the stack is replaced by TE ide idids [TP T on top of the stack is replaced by FT" ide idids [Pid F on top of the stack is replaced by id ids id*id $ | pop and remove id__| Condition 2 is satisfied vdds [Te Ton top of the stack is replaced by € adds [ETE E on top of the stack is replaced by +TE. +id*idS [Pop and remove + _| Condition 2 is satisfied iGrids T Fr T on top of the stack is replaced by FT” idrids F_id F on top of the stack is replaced by id id*id8___| popand remove id__| Condition 2 is satisfied 29|Page Department of Computer Science & Engineering Course File : Compiler Design ser ids Tr Ton top of the stack is replaced by "FT SET ER ey pop and remove * | Condition 2 is satisfied F id Foon top of the stack is replaced by id Pop and remove id _| Condition 2 is satisfied Té Ton top of the stack is replaced by € EB €¢ E* on top of the stack is replaced by € Parsing is successful | Condition I satisfied ‘Table2.3 : Sequence of steps taken by parser in parsing the input token stream i+ id* id. | id + Figure 2.7: Parse tree for the input id + id* id ERROR HANDLING (RECOVERY) IN PREDICTIVE PARSING: In table driven predictive parsing, itis clear as to which terminal and Non terminals the parser expects from the rest of input. An error can be detected in the following situations: 1. When the terminal on top of the stack does not match the current input symbol. 2. when Non terminal A is on top of the stack, a is the current input symbol, and MIA, a] is empty or error ‘The parser recovers from the error and continues its process. The following error recovery schemes are use in predictive parsing: Panic mode Error Recovery : It is based on the idea that when an error is detected, the parser will skips the remaining input until a synchronizing token is en countered in the input. Some cxamples are listed below: 1. For a Non Terminal A, place all symbols in FOLLOW (A) are adde into the synchronizing set of non terminal A. For Example, consider the assignment statement “e=;" Here, the expression on the right hand side is missing. So the Follow of this is considered. It is “;” and is taken as synchronizing token. On encountering it, parser emits an error message “Missing Expression”. 2. For a Non Terminal A, place all symbols in FIRST (A) are adde into the synchronizing set of non terminal A. For Example, consider the assignment statement “22c= a+ b;” Here, FIRST (expr) is 22. It is *;” and is taken as synchronizing token and then the reports the error as “extraneous token”. 30|Page Department of Computer Science & Engineering Course File : Compiler Design Phrase Level Recovery : It can be implemented in the predictive parsing by filling up the blank entries in the predictive parsing table with pointers to error Handling routines. These routines can insert, modify or delete symbols in the input. RECURSIVE DESCENT PARSING ; A recursive-descent parsing program consists of a set of recursive procedures, one for each non terminal. Each procedure is responsible for parsing the constructs defined by its non terminal, Execution begins with the procedure for the start symbol, which halts and announces success if its procedure body scans the entire input string If the given grammar is E> TE’ E> 41 T+ Fr Tl> *FT"/€ F > @®lid Reccursive procedures for the recursive descent parser for the given grammar are given below. procedure E( ) ( 1€ TO: BO: } procedure T ( ) ( FC); TO: ) Procedure E'( ) ( if input { 31[Page Department of Computer Science & Engineering TO: retum true; bse eum eo, procedure FO ‘s input { advance(): EQ: if input ‘advaneet ); retum true; } else if input= “id” { advance( ); retum true; } else return error, } advance() 4 BACK TRACKING: This parsing method uses the technique called Brute Force method during the parse tree construction process. This allows the process to go back (back track) and redo the steps by undoing the work done so far in the point of processing. input = next token; Brute force method: Itis a Top down Parsing technique, occurs when there is more than one alternative in the productions to be tried while parsing the input string. It selects alternatives in the order they appear and when it realizes that something gone wrong it tries with next alternative For example, cor ler the grammar bellow. S + cAd Arabia To generate the input string “cad”, initially the first parse tree given below is generated. ‘As the string generated is not “cad”, input pointer is back tracked to position “A”, to examine the next alternate of “A”. Now a match to the input string occurs as shown in the 2™ parse trees given below. 321? Department of Computer Science & Engineering Course File : Compiler Design Ss Ss s AN ZIN 4IN c A d c A d € A d /\, | (b QQ) IMPORTANT AND EXPECTED QUESTIONS 1. Explain the components of working of a Predictive Parser with an example? What do the FIRST and FOLLOW values represent? Give the algorithm for computing FIRST n FOLLOW of grammar symbols with an example? 3. Construct the LL (1) Parsing table for the following grammar? ESEsTT TOPE F> @)lid 4. For the above grammar construct, and explain the Recursive Descent Parser? 5. What happens if multiple entries occurring in your LL (1) Parsing table? Justify your answer? How does the Parser ASSIGNMENT QUESTIONS 1, Eliminate the Left recursion from the below grammar? As> Aab | AcBIb B-> Bald Explain the procedure to remove the ambiguity from the given grammar with your own example? 3. Write the grammar for the if-else statement in the C programming and check for the left factoring? 4. Will the Predictive parser accept the ambiguous Grammar justify your answer? 5. Is the grammar G = { S>L=R, S->R, R>L, L>*R | id } an LL(1) grammar? 33|Page Department of Computer Science & Engineering BOTTOM-UP PARSING Course File: Compiler Design Bottom-up parsing corresponds to the construction of a parse tree for an input string beginning at the leaves (the bottom nodes) and working up towards the root (the top node). It involves “reducing an input string ‘w’ to the Start Symbol of the grammar. in each reduction step, a perticular substring matching the right side of the production is replaced by symbol on the Jeft of that production and it is the Right most derivation. For example consider the following Grammar: E> Est TOTP F +> @)jid Bottom up parsing of the input string “id * Id “is as follows: INPUT STRING ‘SUB STRING REDUCING PRODUCTION iid a Foid Frid T Fost Tid 1a Fsid TE * TST T TE ST E Start symbol. Hence, the input String is accepted Parse Tree representation is as follow: 34|Pave Department of Computer Science & Engineering Course File : Compiler Design id + id Fo «id T « id T+F T. B - | 17 | /\\ I id F FF id T+*F T | | 17 iy /I\ id id Pr id f * f id FP id ia Figure 3.1 : A Bottom-up Parse tree for the input String “id*id” Bottom up parsing is classified in to 1. Shift-Reduce Parsing, 2. Operator Precedence parsing , and 3. [Table Driven] L R Parsing i. SLR(1) CALR (1) .LALR( 1) SHIET-REDUCE PARSING: Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and an input buffer holds the rest of the string to be parsed, We use $ to mark the bottom of the stack and also the right end of the input. And it makes use of the process of shift and reduce actions to accept the input string. Here, the parse tree is Constructed bottom up from the leaf nodes towards the root node. When we are parsing the given input string, if the match occurs the parser takes the reduce action otherwise it will go for s mn, And it can accept ambiguous grammars also. For example, consider the below grammar to accept the input string “id * id“, using S-R parser ESEsTIT TTF |F F>@®)lid Actions of the Shift-reduce parser using Stack implementation STACK INPUT ‘ACTION $ Teds Shift Sid ids Reduce with Fd SE ids Reduce with TF ST ids Shift ST ids Shift sTd 3 Reduce with F>id SPE 5 Reduce wih T> TF ST § Reduce with ET SE s ‘Accept 35|Pave

You might also like