Compiler
Compiler
Compiler
9x 9x 9x 3b 3x
3 9x
3b 3x 3b 3x 3b 3x 3b 3x 3b 3x 9x 9x 9x
6 3b 3x 3 9x
. How high level languages are implemented to generate machine code. Complete structure of compilers
and how various parts are composed together to get a compiler . Course has theoretical and practical components. Both are needed in implementing programming languages. The focus will be on practical application of the theory. . Emphasis will be on algorithms and data structures rather than proofs of correctness of algorithms. . Theory of lexical analysis, parsing, type checking, runtime system, code generation, optimization (without going too deep into the proofs etc.) . Techniques for developing lexical analyzers, parsers, type checkers, run time systems, code generator, optimization. Use of tools and specifications for developing various parts of compilers
- Compilers (very well understood with mathematical foundations) . Some environments provide both interpreter and compiler. Lisp, scheme etc. provide - Interpreter for development - Compiler for deployment . Java - Java compiler: Java to interpretable bytecode - Java JIT: bytecode to executable image
Some early machines and implementations . IBM developed 704 in 1954. All programming was done in assembly language. Cost of software
development far exceeded cost of hardware. Low productivity. . Speedcoding interpreter: programs ran about 10 times slower than hand written assembly code . John Backus (in 1954): Proposed a program that translated high level expressions into native machine code. Skeptism all around. Most people thought it was impossible . Fortran I project . (1954-1957): The first compiler was released
Fortran I . The first compiler had a huge impact on the programming languages and computer science.
The whole new field of compiler design was started . More than half the programmers were using Fortran by 1958 . The development time was cut down to half . Led to enormous amount of theoretical work (lexical analysis, parsing, optimization, structured programming, code generation, error recovery etc.) . Modern compilers preserve the basic structure of the Fortran I compiler !!!
References
.Compilers: Principles, Tools and Techniques by Aho, Sethi and Ullman soon to be replaced by "21 st Century Compilers" by Aho, Sethi, Ullman, and Lam . Crafting a Compiler in C by Fischer and LeBlanc soon to be replaced by "Crafting a Compiler" by Fischer . Compiler Design in C by Holub . Programming Language Pragmatics by Scott . Engineering a Compiler by Cooper and Torczon . Modern Compiler Implementation (in C and in Java) by Appel . Writing Compilers and Interpreters by Mak . Compiler Design by Wilhelm and Maurer . A Retargetable Compiler: Design and Implementation by Fraser and Hanson . Compiler Construction by Wirth . The Theory of Parsing, Translation and Compiling (vol 1 & 2) by Aho and Ullman (old classic) . Introduction to Compiler Construction with Unix by Schreiner and Friedman . Compiler Design and Construction: Tools and Techniques by Pyster . Engineering a Compiler by Anklam and Cutler . Object Oriented Compiler Construction by Holmes . The Compiler Design Handbook edited by Srikant and Shankar
Introduction to Compilers
How to translate?
The high level languages and machine languages differ in level of abstraction. At machine level we deal with memory locations, registers whereas these resources are never accessed in high level languages. But the level of abstraction differs from language to language and some languages are farther from machine code than others . Goals of translation - Good performance for the generated code Good performance for generated code : The metric for the quality of the generated code is the ratio between the size of handwritten code and compiled machine code for same program. A better compiler is one which generates smaller code. For optimizing compilers this ratio will be lesser. - Good compile time performance Good compile time performance : A handwritten machine code is more efficient than a compiled code in terms of the performance it produces. In other words, the program handwritten in machine code will run faster than compiled code. If a compiler produces a code which is 20-30% slower than the handwritten
code then it is considered to be acceptable. In addition to this, the compiler itself must run fast (compilation time must be proportional to program size). - Maintainable code - High level of abstraction . Correctness is a very important issue. Correctness : A compiler's most important goal is correctness - all valid programs must compile correctly. How do we check if a compiler is correct i.e. whether a compiler for a programming language generates correct machine code for programs in the language. The complexity of writing a correct compiler is a major limitation on the amount of optimization that can be done. Can compilers be proven to be correct? Very tedious! . However, the correctness has an implication on the development cost Many modern compilers share a common 'two stage' design. The "front end" translates the source language or the high level program into an intermediate representation. The second stage is the "back end", which works with the internal representation to produce code in the output language which is a low level code. The higher the abstraction a compiler can support, the better it is.
All development systems are essentially a combination of many tools. For compiler, the other tools are debugger, assembler, linker, loader, profiler, editor etc. If these tools have support for each other than the program development becomes a lot easier.
This is how the various tools work in coordination to make programming easier and better. They all have a specific task to accomplish in the process, from writing a code to compiling it and running/debugging it. If debugged then do manual correction in the code if needed, after getting debugging results. It is the combined contribution of these tools that makes programming a lot easier and efficient.
. Translate in steps. Each step handles a reasonably simple, logical, and well defined task . Design a series of program representations . Intermediate representations should be amenable to program manipulation of various kinds (type checking, optimization, code generation etc.) . Representations become more machine specific and less language specific as the translation proceeds
Lexical Analysis
. Recognizing ist his ase nte nce? words is not completely trivial. For example:
. Therefore, we must know what the word separators are . The language must define rules for breaking a sentence into a sequence of words. . Normally white spaces and punctuations are word separators in languages. . In programming languages a character from a different class may also be treated as word separator. . The lexical analyzer breaks a sentence into a sequence of words or tokens: - If a == b then a = 1 ; else a = 2 ; - Sequence of words (total 14 words) if a == b then a = 1 ; else a = 2 ; In simple words, lexical analysis is the process of identifying the words from an input string of characters, which may be handled more easily by a parser. These words must be separated by some predefined delimiter or there may be some rules imposed by the language for breaking the sentence into tokens or words which are then passed on to the next phase of syntax analysis. In programming languages, a character from a different class may also be considered as a word separator.
Syntax analysis (also called as parsing) is a process of imposing a hierarchical (tree like) structure on the token stream. It is basically like generating sentences for the language using language specific grammatical rules as we have in our natural language Ex. sentence subject + object + subject The example drawn above shows how a sentence in English (a natural language) can be broken down into a tree form depending on the construct of the sentence.
Parsing
Just like a natural language, a programming language also has a set of grammatical rules and hence can be broken down into a parse tree by the parser. It is on this parse tree that the further steps of semantic analysis are carried out. This is also used during generation of the intermediate language code. Yacc (yet another compiler compiler) is a program that generates parsers in the C programming language.
. What does his refer to? Prateek or Nitin ? . Even worse case Amit said Amit left his assignment at home . How many Amits are there? Which one left the assignment? Semantic analysis is the process of examining the statements and to make sure that they make sense. During the semantic analysis, the types, values, and other required information about statements are recorded, checked, and transformed appropriately to make sure the program makes sense. Ideally there should be no ambiguity in the grammar of the language. Each sentence should have just one meaning.
Semantic Analysis
. Too hard for compilers. They do not have capabilities similar to human understanding . However, compilers do perform analysis to understand the meaning and catch inconsistencies . Programming languages define strict rules to avoid such ambiguities
{ int Amit = 3; { int Amit = 4; cout << Amit; } } Since it is too hard for a compiler to do semantic analysis, the programming languages define strict rules to avoid ambiguities and make the analysis easier. In the code written above, there is a clear demarcation between the two instances of Amit. This has been done by putting one outside the scope of other so that the compiler knows that these two Amit are different by the virtue of their different scopes.
. Compilers perform many other checks besides variable bindings . Type checking Amit left her work at home . There is a type mismatch between her and Amit . Presumably Amit is a male. And they are not the same person.
From this we can draw an analogy with a programming statement. In the statement: double y = "Hello World"; The semantic analysis would reveal that "Hello World" is a string, and y is of type double, which is a type mismatch and hence, is wrong.
Till now we have conceptualized the front end of the compiler with its 3 phases, viz. Lexical Analysis, Syntax Analysis and Semantic Analysis; and the work done in each of the three phases. Next, we look into the backend in the forthcoming slides.
Lexical analysis is based on the finite state automata and hence finds the lexicons from the input on the basis of corresponding regular expressions. If there is some input which it can't recognize then it generates error. In the above example, the delimiter is a blank space. See for yourself that the lexical analyzer recognizes identifiers, numbers, brackets etc.
Syntax Analysis
. Error reporting and recovery . Model using context free grammars . Recognize using Push down automata/Table Driven Parsers Syntax Analysis is modeled on the basis of context free grammars. Programming languages can be written using context free grammars. Based on the rules of the grammar, a syntax tree can be made from a correct code of the language. A code written in a CFG is recognized using Push Down Automata. If there is any error in the syntax of the code then an error is generated by the compiler. Some compilers also tell that what exactly is the error, if possible.
Semantic Analysis
. Check semantics . Error reporting . Disambiguate overloaded operators .Type coercion . Static checking - Type checking - Control flow checking - Unique ness checking - Name checks
Semantic analysis should ensure that the code is unambiguous. Also it should do the type checking wherever needed. Ex. int y = "Hi"; should generate an error. Type coercion can be
explained by the following example:- int y = 5.6 + 1; The actual value of y used will be 6 since it is an integer. The compiler knows that since y is an instance of an integer it cannot have the value of 6.6 so it down-casts its value to the greatest integer less than 6.6. This is called type coercion.
Code Optimization
. No strong counter part with English, but is similar to editing/prcis writing . Automatically modify programs so that they - Run faster - Use less resources (memory, registers, space, fewer fetches etc.) . Some common optimizations - Common sub-expression elimination - Copy propagation - Dead code elimination - Code motion - Strength reduction - Constant folding . Example: x = 15 * 3 is transformed to x = 45 There is no strong counterpart in English, this is similar to precise writing where one cuts down the redundant words. It basically cuts down the redundancy. We modify the compiled code to make it more efficient such that it can - Run faster - Use less resources, such as memory, register, space, fewer fetches etc.
Example of Optimizations
3A+4M+1D+2E
Example: see the following code, int x = 2; int y = 3; int *array[5]; for (i=0; i<5;i++) *array[i] = x + y; Because x and y are invariant and do not change inside of the loop, their addition doesn't need to be performed for each loop iteration. Almost any good compiler optimizes the code. An optimizer moves the addition of x and y outside the loop, thus creating a more efficient loop. Thus, the optimized code in this case could look like the following: int x = 5; int y = 7; int z = x + y; int *array[10]; for (i=0; i<5;i++) *array[i] = z;
Code Generation
. Usually a two step process - Generate intermediate code from the semantic representation of the program
- Generate machine code from the intermediate code . The advantage is that each phase is simple . Requires design of intermediate language . Most compilers perform translation between successive intermediate representations . Intermediate languages are generally ordered in decreasing level of abstraction from highest (source) to lowest (machine) . However, typically the one after the intermediate code generation is the most important
The final phase of the compiler is generation of the relocatable target code. First of all, Intermediate code is generated from the semantic representation of the source program, and this intermediate code is used to generate machine code.
Thus it must not only relate to identifiers, expressions, functions & classes but also to opcodes, registers, etc. Then it must also map one abstraction to the other. These are some of the things to be taken care of in the intermediate code generation.
Some of the different optimization methods are : 1) Constant Folding - replacing y= 5+7 with y=12 or y=x*0 with y=0 2) Dead Code Elimination - e.g., If (false) a = 1; else a = 2; with a = 2;
3) Peephole Optimization - a machine-dependent optimization that makes a pass through low-level assembly-like instruction sequences of the program( called a peephole), and replacing them with a faster (usually shorter) sequences by removing redundant register loads and stores if possible. 4) Flow of Control Optimizations 5) Strength Reduction - replacing more expensive expressions with cheaper ones - like pow(x,2) with x*x 6) Common Sub expression elimination - like a = b*c, f= b*c*d with temp = b*c, a= temp, f= temp*d;
There is a clear intermediate code optimization - with 2 different sets of codes having 2 different parse trees.The optimized code does away with the redundancy in the original code and produces the same result.
Compiler structure
These are the various stages in the process of generation of the target code from the source code by the compiler. These stages can be broadly classified into . the Front End ( Language specific ), and . the Back End ( Machine specific )parts of compilation. Information required about the program variables during compilation - Class of variable: keyword, identifier etc. - Type of variable: integer, float, array, function etc. - Amount of storage required - Address in the memory - Scope information . Location to store this information - Attributes with the variable (has obvious problems) - At a central repository and every phase refers to the repository whenever information is required . Normally the second approach is preferred - Use a data structure called symbol table For the lexicons, additional information with its name may be needed. Information about whether it is a keyword/identifier, its data type, value, scope, etc might be needed to be known during the latter phases of compilation. However, all this information is not available in a straight away. This information has to be found and stored somewhere. We store it in a data structure called Symbol Table. Thus each phase of the compiler can access data from the symbol table & write data to it.
The method of retrieval of data is that with each lexicon a symbol table entry is associated. A pointer to this symbol in the table can be used to retrieve more information about the lexicon
This diagram elaborates what's written in the previous slide. You can see that each stage can access the Symbol Table. All the relevant information about the variables, classes, functions etc. are stored in it.
program from the representation created by the Front End phases. The advantages are that not only can lots of code be reused, but also since the compiler is well structured - it is easy to maintain & debug.
We design the front end independent of machines and the back end independent of the source language. For this, we will require a Universal Intermediate Language (UIL) that acts as an interface between front end and back end. The front end will convert code written in the particular source language to the code in UIL, and the back end will convert the code in UIL to the equivalent code in the particular machine language. So, we need to design only M front ends and N back ends. To design a compiler for language L that produces output for machine C, we take the front end for L and the back end for C. In this way, we require only M + N compilers for M source languages and N machine architectures. For large M and N, this is a significant reduction in the effort.
- the IR semantics should ideally be independent of both the source and target language (i.e. the target processor) Accordingly, already in the 1950s many researchers tried to define a single universal IR language, traditionally referred to as UNCOL (UNiversal Computer Oriented Language) First suggested in 1958, its first version was proposed in 1961. The semantics of this language would be quite independent of the target language, and hence apt to be used as an Intermediate Language
method. These methods mainly rely on writing state of a program before and after the execution of a statement. The state consists of the values of the variables of a program at that step. In large programs like compilers, the number of variables is too large and so, defining the state is very difficult. So, formal testing of compilers has not yet been put to practice. The solution is to go for systematic testing i.e., we will not prove that the compiler will work correctly in all situations but instead, we will test the compiler on different programs. Correct results increase the confidence that the compiler is correct. Test suites generally contain 5000-10000 programs of various kinds and sizes. Such test suites are heavily priced as they are very intelligently designed to test every aspect of the compiler.
The compiler generator needs to be written only once. To generate any compiler for language L and generating code for machine M, we will need to give the compiler generator the specifications of L and M. This would greatly reduce effort of compiler writing as the compiler generator needs to be written only once and all compilers could be produced automatically.
. Can target machine be described using specifications? There are ways to break down the source code into different components like lexemes, structure, semantics etc. Each component can be specified separately. The above example shows the way of recognizing identifiers for lexical analysis. Similarly there are rules for semantic as well as syntax analysis. Can we have some specifications to describe the target machine?
Tools for each stage of compiler design have been designed that take in the specifications of the stage and output the compiler fragment of that stage. For example , lex is a popular tool for lexical analysis, yacc is a popular tool for syntactic analysis. Similarly, tools have been designed for each of these stages that take in specifications required for that phase e.g., the code generator tool takes in machine specifications and outputs the final compiler code. This design of having separate tools for each stage of compiler development has many advantages that have been described on the next slide.
. Compiler performance can be improved by improving a tool and/or specification for a particular phase In tool based compilers, change in one phase of the compiler doesn't affect other phases. Its phases are independent of each other and hence the cost of maintenance is cut down drastically. Just make a tool for once and then use it as many times as you want. With tools each time you need a compiler you won't have to write it, you can just "generate" it.
Bootstrapping
. Compiler is a complex program and should not be written in assembly language . How to write compiler for a language in the same language (first time!)? . First time this experiment was done for Lisp . Initially, Lisp was used as a notation for writing functions. . Functions were then hand translated into assembly language and executed . McCarthy wrote a function eval[e,a] in Lisp that took a Lisp expression e as an argument . The function was later hand translated and it became an interpreter for Lisp Writing a compiler in assembly language directly can be a very tedious task. It is generally written in some high level language. What if the compiler is written in its intended source language itself ? This was done for the first time for Lisp. Initially, Lisp was used as a notation for writing functions. Functions were then hand translated into assembly language and executed. McCarthy wrote a function eval [ e , a ] in Lisp that took a Lisp expression e as an argument. Then it analyzed the expression and translated it into the assembly code. The function was later hand translated and it became an interpreter for Lisp.
Bootstrapping .
. A compiler can be characterized by three languages: the source language (S), the target language (T), and the implementation language (I) . The three language S, I, and T can be quite different. Such a compiler is called cross-compiler
Native compilers are written in the same language as the target language. For example, SMM is a
compiler for the language S that is in a language that runs on machine M and generates output code that runs on machine M.
Cross compilers are written in different language as the target language. For example, SNM is a
compiler for the language S that is in a language that runs on machine N and generates output code that runs on machine M.
Bootstrapping .
The compiler of LSN is written in language S. This compiler code is compiled once on SMM to generate the compiler's code in a language that runs on machine M. So, in effect, we get a compiler that converts code in language L to code that runs on machine N and the compiler itself is in language M. In other words, we get LMN.
Bootstrapping a Compiler
Using the technique described in the last slide, we try to use a compiler for a language L written in L. For this we require a compiler of L that runs on machine M and outputs code for machine M. First we write LLN i.e. we have a compiler written in L that converts code written in L to code that can run on machine N. We then compile this compiler program written in L on the available compiler LMM. So, we get a compiler program that can run on machine M and convert code written in L to code that can run on machine N i.e. we get LMN. Now, we again compile the original written compiler LLN on this new compiler LMN we got in last step. This compilation will convert the compiler code written in L to code that can run on machine N. So, we finally have a compiler code that can run on machine N and converts code in language L to code that will run on machine N. i.e. we get LNN.
Bootstrapping is obtaining a compiler for a language L by writing the compiler code in the same language L. We have discussed the steps involved in the last three slides. This slide shows the complete diagrammatical representation of the process.
Lexical Analysis
. Error reporting . Model using regular expressions . Recognize using Finite State Automata The first phase of the compiler is lexical analysis. The lexical analyzer breaks a sentence into a sequence of words or tokens and ignores white spaces and comments. It generates a stream of tokens from the input. This is modeled through regular expressions and the structure is recognized through finite state automata. If the token is not valid i.e., does not fall into any of the identifiable groups, then the lexical analyzer reports an error. Lexical analysis thus involves recognizing the tokens in the source program and reporting errors, if any. We will study more about all these processes in the subsequent slides
Lexical Analysis
. Sentences consist of string of tokens (a syntactic category) for example, number, identifier, keyword, string . Sequences of characters in a token is a lexeme for example, 100.01, counter, const, "How are you?" . Rule of description is a pattern for example, letter(letter/digit)* . Discard whatever does not contribute to parsing like white spaces ( blanks, tabs, newlines ) and comments . construct constants: convert numbers to token num and pass number as its attribute, for example, integer 31 becomes <num, 31> . recognize keyword and identifiers for example counter = counter + increment becomes id = id + id /*check if id is a keyword*/ We often use the terms "token", "pattern" and "lexeme" while studying lexical analysis. Lets see what each term stands for.
Token: A token is a syntactic category. Sentences consist of a string of tokens. For example number, identifier, keyword, string etc are tokens. Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter, const, "How are you?" etc are lexemes. Pattern: Rule of description is a pattern. For example letter (letter | digit)* is a pattern to symbolize a set of strings which consist of a letter followed by a letter or digit. In general, there is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. This pattern is said to match each string in the set. A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. The patterns are specified using regular expressions. For example, in the Pascal statement Const pi = 3.1416; The substring pi is a lexeme for the token "identifier". We discard whatever does not contribute to parsing like white spaces (blanks, tabs, new lines) and comments. When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler. For example, the pattern num matches both 1 and 0 but it is essential for the code generator to know what string was actually matched. The lexical analyzer collects information about tokens into their associated attributes. For example integer 31 becomes <num, 31>. So, the constants are constructed by converting numbers to token 'num' and passing the number as its attribute. Similarly, we recognize keywords and identifiers. For example count = count + inc becomes id = id + id.
. Push back is required due to lookahead for example > = and >
. It is implemented through a buffer - Keep input in a buffer - Move pointers over the input The lexical analyzer reads characters from the input and passes tokens to the syntax analyzer whenever it asks for one. For many source languages, there are occasions when the lexical analyzer needs to look ahead several characters beyond the current lexeme for a pattern before a match can be announced. For example, > and >= cannot be distinguished merely on the basis of the first character >. Hence there is a need to maintain a buffer of the input for look ahead and push back. We keep the input in a buffer and move pointers over the input. Sometimes, we may also need to push back extra characters due to this lookahead character.
Approaches to implementation
. Use assembly language Most efficient but most difficult to implement . Use high level languages like C Efficient but difficult to implement . Use tools like lex, flex Easy to implement but not as efficient as the first two cases Lexical analyzers can be implemented using many approaches/techniques:
. Assembly language : We have to take input and read it character by character. So we need to have control over low level I/O. Assembly language is the best option for that because it is the most efficient. This implementation produces very efficient lexical analyzers. However, it is most difficult to implement, debug and maintain. . High level language like C: Here we will have a reasonable control over I/O because of high-level constructs. This approach is efficient but still difficult to implement. . Tools like Lexical Generators and Parsers: This approach is very easy to implement, only specifications of the lexical analyzer or parser need to be written. The lex tool produces the corresponding C code. But this approach is not very efficient which can sometimes be an issue. We can also use a hybrid approach wherein we use high level languages or efficient tools to produce the basic code and if there are some hot-spots (some functions are a bottleneck) then they can be replaced by fast and efficient assembly language routines. Construct a lexical analyzer
. Allow white spaces, numbers and arithmetic operators in an expression . Return tokens and attributes to the syntax analyzer . A global variable tokenval is set to the value of the number . Design requires that - A finite set of tokens be defined - Describe strings belonging to each token We now try to construct a lexical analyzer for a language in which white spaces, numbers and arithmetic operators in an expression are allowed. From the input stream, the lexical analyzer recognizes the tokens and their corresponding attributes and returns them to the syntax analyzer. To achieve this, the function returns the corresponding token for the lexeme and sets a global variable, say tokenval , to the value of that token. Thus, we must define a finite set of tokens and specify the strings belonging to each token. We must also keep a count of the line number for the purposes of reporting errors and debugging. We will have a look at a typical code snippet which implements a lexical analyzer in the subsequent slide. #include <stdio.h>
#include <ctype.h> int lineno = 1; int tokenval = NONE; int lex() { int t; while (1) { t = getchar (); if (t = = ' ' || t = = '\t'); else if (t = = '\n')lineno = lineno + 1; else if (isdigit (t) ) { tokenval = t - '0' ; t = getchar (); while (isdigit(t)) { tokenval = tokenval * 10 + t - '0' ; t = getchar(); } ungetc(t,stdin); return num; } else { tokenval = NONE;return t; } } } A crude implementation of lex() analyzer to eliminate white space and collect numbers is shown. Every time the body of the while statement is executed, a character is read into t. If the character is a blank (written ' ') or a tab (written '\t'), then no token is returned to the parser; we merely go around the while loop again. If a character is a new line (written '\n'), then a global variable "lineno" is incremented, thereby keeping track of line numbers in the input, but again no token is returned. Supplying a line number with the error messages helps pin point errors. The code for reading a sequence of digits is on lines 11-19. The predicate isdigit(t) from the include file
<ctype.h> is used on lines 11 and 14 to determine if an incoming character t is a digit. If it is, then its integer value is given by the expression t-'0' in both ASCII and EBCDIC. With other character sets, the conversion may need to be done differently.
Problems
. Scans text character by character . Look ahead character determines what kind of token to read and when the current token ends . First character cannot determine what kind of token we are going to read The problem with lexical analyzer is that the input is scanned character by character. Now, its not possible to determine by only looking at the first character what kind of token we are going to read since it might be common in multiple tokens. We saw one such an example of > and >= previously. So one needs to use a lookahead character depending on which one can determine what kind of token to read or when does a particular token end. It may not be a punctuation or a blank but just another kind of token which acts as the word boundary. The lexical analyzer that we just saw used a function ungetc() to push lookahead characters back into the input stream. Because a large amount of time can be consumed moving characters, there is actually a lot of overhead in processing an input character. To reduce the amount of such overhead involved, many specialized buffering schemes have been developed and used.
Symbol Table
. Stores information for subsequent phases . Interface to the symbol table - Insert(s,t): save lexeme s and token t and return pointer - Lookup(s): return index of entry for lexeme s or 0 if s is not found Implementation of symbol table . Fixed amount of space to store lexemes. Not advisable as it waste space. . Store lexemes in a separate array. Each lexeme is separated by eos. Symbol table has pointers to lexemes. A data structure called symbol table is generally used to store information about various source language constructs. Lexical analyzer stores information in the symbol table for the subsequent phases of the compilation process. The symbol table routines are concerned primarily with saving and retrieving lexemes. When a lexeme is saved, we also save the token associated with the lexeme. As an interface to the symbol table, we have two functions - Insert( s , t ): Saves and returns index of new entry for string s , token t . - Lookup( s ): Returns index of the entry for string s , or 0 if s is not found. Next, we come to the issue of implementing a symbol table. The symbol table access should not be slow and so the data structure used for storing it should be efficient. However, having a fixed
amount of space to store lexemes is not advisable because a fixed amount of space may not be large enough to hold a very long identifier and may be wastefully large for a short identifier, such as i . An alternative is to store lexemes in a separate array. Each lexeme is terminated by an endof-string, denoted by EOS, that may not appear in identifiers. The symbol table has pointers to these lexemes.
Here, we have shown the two methods of implementing the symbol table which we discussed in the previous slide in detail. As, we can see, the first one which is based on allotting fixed amount space for each lexeme tends to waste a lot of space by using a fixed amount of space for each lexeme even though that lexeme might not require the whole of 32 bytes of fixed space. The second representation which stores pointers to a separate array, which stores lexemes terminated by an EOS, is a better space saving implementation. Although each lexeme now has an additional overhead of five bytes (four bytes for the pointer and one byte for the EOS). Even then we are saving about 70% of the space which we were wasting in the earlier implementation. We allocate extra space for 'Other Attributes' which are filled in the later phases.
How to handle keywords? . Consider token DIV and MOD with lexemes div and mod.
. Initialize symbol table with insert( "div" , DIV ) and insert( "mod" , MOD). . Any subsequent lookup returns a nonzero value, therefore, cannot be used as an identifier
To handle keywords, we consider the keywords themselves as lexemes. We store all the entries corresponding to keywords in the symbol table while initializing it and do lookup whenever we see a new lexeme. Now, whenever a lookup is done, if a nonzero value is returned, it means that there already exists a corresponding entry in the Symbol Table. So, if someone tries to use a keyword as an identifier, it will not be allowed as an identifier with this name already exists in the Symbol Table. For instance, consider the tokens DIV and MOD with lexemes "div" and "mod". We initialize symbol table with insert("div", DIV) and insert("mod", MOD). Any subsequent lookup now would return a nonzero value, and therefore, neither "div" nor "mod" can be used as an identifier.
The design of a lexical analyzer is quite complicated and not as simple as it looks. There are several kinds of problems because of all the different types of languages we have. Let us have a look at some of them. For example: 1. We have both fixed format and free format languages - A lexeme is a sequence of character in source program that is matched by pattern for a token. FORTRAN has lexemes in a fixed position. These white space and fixed format rules came into force due to punch cards and errors in punching. Fixed format languages make life difficult because in this case we have to look at the position of the tokens also. 2. Handling of blanks - It's of our concern that how do we handle blanks as many languages (like Pascal, FORTRAN etc) have significance for blanks and void spaces. When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the lexical analyzer. In Pascal blanks separate identifiers. In FORTRAN blanks are important only in literal strings. For example, the variable " counter " is same as " count er ". Another example is DO 10 I = 1.25 DO 10 I = 1,25 The first line is a variable assignment DO10I = 1.25. The second line is the beginning of a Do loop. In such a case we might need an arbitrary long lookahead. Reading from left to right, we cannot distinguish between the two until the " , " or " . " is reached. The first line is a variable assignment DO10I=1.25 - second line is beginning of a Do loop
- Reading from left to right one can not distinguish between the two until the ";" or "." is reached . Fortran white space and fixed format rules came into force due to punch cards and errors in punching In the example (previous slide), DO 10 I = 1.25 DO 10 I = 1,25
The first line is a variable assignment DO10I = 1.25. The second line is the beginning of a Do loop. In such a case, we might need an arbitrary long lookahead. Reading from left to right, we can not distinguish between the two until the " , " or " . " is reached. FORTRAN has a language convention which impacts the difficulty of lexical analysis. The alignment of lexeme may be important in determining the correctness of the source program; the treatment of blank varies from language to language such as FORTRAN and ALGOL 68. Blanks are not significant except in little strings. The conventions regarding blanks can greatly complicate the task of identified tokens.
PL/1 Problems
. Keywords are not reserved in PL/1 if then then then = else else else = then if if then then = then + 1 . PL/1 declarations Declare(arg 1 ,arg 2 ,arg 3 ,...,arg n )
. Cannot tell whether Declare is a keyword or array reference until after " ) " . Requires arbitrary lookahead and very large buffers . Worse, the buffers may have to be reloaded. In many languages certain strings are reserved, i.e., there meaning is predefined and cannot be changed by the user. If keywords are not reserved then the lexical analyzer must distinguish between a keyword and a user defined identifier. PL/1 has several problems: 1. In PL/1 keywords are not reserved; thus, the rules for distinguishing keywords from identifiers are quite complicated as the following PL/1 statement illustrates. For example - If then then then = else else else = then 2. PL/1 declarations: Example - Declare (arg1, arg2, arg3,.., argn) In this statement, we can not tell whether 'Declare' is a keyword or array name until we see the character that follows the ")". This requires arbitrary lookahead and very large buffers. This buffering scheme works quite well most of the time but with it the amount of lookahead is limited and this limited lookahead may make it impossible to recognize tokens in salutations where the distance the forward pointer must travel is more than the length of the buffer, as the slide illustrates. The situation even worsens if the buffers have to be reloaded.
- Tokens may have similar prefixes - Each character should be looked at only once The various issues which concern the specification of tokens are: 1. How to describe the complicated tokens like e0 20.e-01 2.000 2. How to break into tokens the input statements like if (x==0) a = x << 1; iff (x==0) a = x < 1; 3. How to break the input into tokens efficiently?There are the following problems that are encountered: - Tokens may have similar prefixes
Operations on languages
. L U M = {s | s is in L or s is in M} . LM = {st | s is in L and t is in M}
The various operations on languages are: . Union of two languages L and M written as L U M = {s | s is in L or s is in M} . Concatenation of two languages L and M written as LM = {st | s is in L and t is in M} .The Kleene Closure of a language L written as
Example
. Let L = {a, b, .., z} and D = {0, 1, 2, . 9} then . LUD is a set of letters and digits . LD is a set of strings consisting of a letter followed by a digit . L* is a set of all strings of letters including ? . L(LUD)* is a set of all strings of letters and digits beginning with a letter . D + is a set of strings of one or more digits Example:
Let L be a the set of alphabets defined as L = {a, b, .., z} and D be a set of all digits defined as D = {0, 1, 2, .., 9}. We can think of L and D in two ways. We can think of L as an alphabet consisting of the set of lower case letters, and D as the alphabet consisting of the set the ten decimal digits. Alternatively, since a symbol can be regarded as a string of length one, the sets L and D are each finite languages. Here are some examples of new languages created from L and D by applying the operators defined in the previous slide. . Union of L and D, L U D is the set of letters and digits. . Concatenation of L and D, LD is the set of strings consisting of a letter followed by a digit. . The Kleene closure of L, L* is a set of all strings of letters including ? . . L(LUD)* is the set of all strings of letters and digits beginning with a letter. . D+ is the set of strings one or more digits. Notation
. Let S be a set of characters. A language over S is a set of strings of characters belonging to S
. A regular expression r denotes a language L(r) . Rules that define the regular expressions over S - ? is a regular expression that denotes { ? } the set containing the empty string - If a is a symbol in S then a is a regular expression that denotes {a} Let S be a set of characters. A language over S is a set of strings of characters belonging to S . A regular expression is built up out of simpler regular expressions using a set of defining rules. Each regular expression r denotes a language L( r ). The defining rules specify how L( r ) is formed by combining in various ways the languages denoted by the sub expressions of r . Following are the rules that define the regular expressions over S : . ? is a regular expression that denotes { ? }, that is, the set containing the empty string. . If a is a symbol in S then a is a regular expression that denotes { a } i.e., the set containing the string a . Although we use the same notation for all three, technically, the regular expression a is different from the string a or the symbol a . It will be clear from the context whether we are talking about a as a regular expression, string or symbol.
Notation ....
. If r and s are regular expressions denoting the languages L(r) and L(s) then . (r)|(s) is a regular expression denoting L(r) U L(s) . (r)(s) is a regular expression denoting L(r)L(s) . (r)* is a regular expression denoting (L(r))* . (r) is a regular expression denoting L(r )
Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,
. (r)|(s) is a regular expression denoting L(r) U L(s). . (r) (s) is a regular expression denoting L(r) L(s). . (r)* is a regular expression denoting (L(r))*. . (r) is a regular expression denoting L(r). Let us take an example to illustrate: Let S = {a, b}. 1. The regular expression a|b denotes the set {a,b}. 2. The regular expression (a | b) (a | b) denotes {aa, ab, ba, bb}, the set of all strings of a's and b's of length two. Another regular expression for this same set is aa | ab | ba | bb. 3. The regular expression a* denotes the set of all strings of zero or more a's i.e., { ? , a, aa, aaa, .}.
4. The regular expression (a | b)* denotes the set of all strings containing zero or more instances of an a or b, that is, the set of strings of a's and b's. Another regular expression for this set is (a*b*)*. 5. The regular expression a | a*b denotes the set containing the string a and all strings consisting of zero or more a's followed by a b. If two regular expressions contain the same language, we say r and s are equivalent and write r = s. For example, (a | b) = (b | a).
Notation ....
. Precedence and associativity . *, concatenation, and | are left associative . * has the highest precedence . Concatenation has the second highest precedence . | has the lowest precedence
Unnecessary parentheses can be avoided in regular expressions if we adopt the conventions that: . The unary operator * has the highest precedence and is left associative. . Concatenation has the second highest precedence and is left associative. . | has the lowest precedence and is left associative. Under these conventions, (a)|((b)*(c)) is equivalent to a|b*c. Both expressions denote the set of strings that are either a single a or zero or more b 's followed by one c . How to specify tokens
If S is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form d1 d2 r1 r2
.............
dn rn
where each di is a distinct name, and each ri is a regular expression over the symbols in i.e. the basic symbols and the previously defined names. By restricting each r i to symbols of S and the previously defined names, we can construct a regular expression over S for any ri by repeatedly replacing regular-expression names by the expressions they denote. If r i used dkfor some k >= i, then ri might be recursively defined, and this substitution process would not terminate. So, we treat tokens as terminal symbols in the grammar for the source language. The lexeme matched by the pattern for the token consists of a string of characters in the source program and can be treated as a lexical unit. The lexical analyzer collects information about tokens into there associated attributes. As a practical matter a token has usually only a single attribute, appointed to the symbol table entry in which the information about the token is kept; the pointer becomes the attribute for the token.
Examples
Examples
Now we look at the regular definitions for writing an email address [email protected]: Set of alphabets being S = letter U {@, . } ): Letter a| b| .| z| A| B| .| Z i.e., any lower case or upper case alphabet
Name Address
letter + i.e., a string of one or more letters name '@' name '.' name '.' name
Examples
. Identifier letter digit identifier a| b| .|z| A| B| .| Z 0| 1| .| 9 letter(letter|digit)*
. Unsigned number in Pascal digit digits fraction exponent number 0| 1| . |9 digit + ' . ' digits | (E ( ' + ' | ' - ' | ) digits) | digits fraction exponent
Here are some more examples: The set of Identifiers is the set of strings of letters and digits beginning with a letter. Here is a regular definition for this set: letter digit identifier a| b| .|z| A| B| .| Z i.e., any lower case or upper case alphabet 0| 1| .| 9 i.e., a single digit letter(letter | digit)* i.e., a string of letters and digits beginning with a letter
Unsigned numbers in Pascal are strings such as 5280, 39.37, 6.336E4, 1.894E-4. Here is a regular definition for this set: digit digits fraction digits exponent 0| 1| .|9 i.e., a single digit digit + i.e., a string of one or more digits ' . ' digits | i.e., an empty string or a decimal symbol followed by one or more (E ( ' + ' | ' - ' | ) digits) |
number
Regular expressions describe many useful languages. A regular expression is built out of simpler regular expressions using a set of defining rules. Each regular expression R denotes a regular language L(R). The defining rules specify how L(R) is formed by combining in various phases the languages denoted by the sub expressions of R. But regular expressions are only specifications, the implementation is still required. The problem before us is that given a string s and a regular expression R , we have to find whether s e L(R). Solution to this problem is the basis of the lexical analyzers. However, just determining whether s e L(R) is not enough. In fact, the goal is to partition the input into tokens. Apart from this we have to do bookkeeping and push back the extra characters.
The algorithm gives priority to tokens listed earlier - Treats "if" as keyword and not identifier . How much input is used? What if
- x1 .xi? L(R) - x1.xj ? L(R) - Pick up the longest possible string in L(R) - The principle of "maximal munch" . Regular expressions provide a concise and useful notation for string patterns . Good algorithms require a single pass over the input
A simple technique for separating keywords and identifiers is to initialize appropriately the symbol table in which information about identifier is saved. The algorithm gives priority to the tokens listed earlier and hence treats "if" as keyword and not identifier. The technique of placing keywords in the symbol table is almost essential if the lexical analyzer is coded by hand. Without doing so the number of states in a lexical analyzer for a typical programming language is several hundred, while using the trick, fewer than a hundred states would suffice. If a token belongs to more than one category, then we go by priority rules such as " first match " or " longest match ". So we have to prioritize our rules to remove ambiguity. If both x1 .xi and x1 .xj L(R) then we pick up the longest possible string in L(R). This is the principle of " maximal munch ". Regular expressions provide a concise and useful notation for string patterns. Our goal is to construct a lexical analyzer that will isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute value using the translation table. We try to use a algorithm such that we are able to tokenize our data in single pass. Basically we try to efficiently and correctly tokenize the input data. How to break up text
Elsex=0 else
x =
elsex
. Regular expressions alone are not enough . Normally longest match wins . Ties are resolved by prioritizing tokens . Lexical definitions consist of regular definitions, priority rules and maximal munch principle We can see that regular expressions are not sufficient to help us in breaking up our text. Let us consider the example " elsex=0 ".In different programming languages this might mean " else x=0 " or "elsex=0". So the regular expressions alone are not enough. In case there are multiple possibilities, normally the longest match wins and further ties are resolved by prioritizing tokens. Hence lexical definitions consist of regular definitions, priority rules and prioritizing principles like maximal munch principle. The information about the language that is not in the regular language of the tokens can be used to pinpoint the errors in the input. There are several ways in which the redundant matching in the transitions diagrams can be avoided
Finite Automata
. Regular expression are declarative specifications . Finite automata is an implementation . A finite automata consists of - An input alphabet belonging to S - A set of states S - A set of transitions statei - A set of final states F - A start state n . Transition s1 s2 is read: statej
in state s1 on input a go to state s2 . If end of input is reached in a final state then accept A recognizer for language is a program that takes as input a string x and answers yes if x is the sentence of the language and no otherwise. We compile a regular expression into a recognizer by constructing a generalized transition diagram called a finite automaton. Regular expressions are declarative specifications and finite automaton is the implementation. It can be deterministic or non deterministic, both are capable of recognizing precisely the regular sets. Mathematical model of finite automata consists of: - An input alphabet belonging to S - The set of input symbols, - A set of states S , - A set of transitions statei set of states, statej , i.e., a transition function move that maps states symbol pairs to the
- A set of final states F or accepting states, and - A start state n . If end of input is reached in a final state then we accept the string, otherwise reject it. . Otherwise, reject
Pictorial notation
. A state
. Transition from state i to state j on an input a A state is represented by a circle, a final state by two concentric circles and a transition by an arrow.
. Construct an analyzer that will return <token, attribute> pairs We now consider the following grammar and try to construct an analyzer that will return <token, attribute> pairs. relop Id Num delim ws < | = | = | <> | = | > letter (letter | digit)* digit + ('.' digit+)? (E('+' | '-')? Digit +)? blank | tab | newline delim+
Using set of rules as given in the example above we would be able to recognize the tokens. Given a regular expression R and input string x, we have two methods for determining whether x is in
L(R). One approach is to use algorithm to construct an NFA N from R, and the other approach is using a DFA. We will study about both these approaches in details in future slides.
token is relop, lexeme is < token is relop, lexeme is <> token is relop, lexeme is <= token is relop, lexeme is = token is relop , lexeme is >= token is relop, lexeme is >
In case of < or >, we need a lookahead to see if it is a <, = , or <> or = or >. We also need a global data structure which stores all the characters. In lex, yylex is used for storing the lexeme. We can recognize the lexeme by using the transition diagram shown in the slide. Depending upon the number of checks a relational operator uses, we land up in a different kind of state like >= and > are different. From the transition diagram in the slide it's clear that we can land up into six kinds of relops.
Transition diagram for identifier: In order to reach the final state, it must encounter a letter followed by one or more letters or digits and then some other symbol. Transition diagram for
white spaces: In order to reach the final state, it must encounter a delimiter (tab, white space) followed by one or more delimiters and then some other symbol.
Transition diagram for Unsigned Numbers: We can have three kinds of unsigned numbers and hence need three transition diagrams which distinguish each of them. The first one recognizes exponential numbers. The second one recognizes real numbers. The third one recognizes integers.
Transition diagram for unsigned numbers . The lexeme for a given token must be the longest possible
. Assume input to be 12.34E56 . Starting in the third diagram the accept state will be reached after 12 . Therefore, the matching should always start with the first transition diagram . If failure occurs in one transition diagram then retract the forward pointer to the start state and activate the next diagram . If failure occurs in all diagrams then a lexical error has occurred The lexeme for a given token must be the longest possible. For example, let us assume the input to be 12.34E56 . In this case, the lexical analyzer must not stop after seeing 12 or even 12.3. If we start at the third diagram (which recognizes the integers) in the previous slide, the accept state will be reached after 12. Therefore, the matching should always start with the first transition diagram. In case a failure occurs in one transition diagram then we retract the forward pointer to the start state and start analyzing using the next diagram. If failure occurs in all diagrams then a lexical error has occurred i.e. the input doesn't pass through any of the three transition diagrams. So we need to prioritize our rules and try the transition diagrams in a certain order (changing the
order may put us into trouble). We also have to take care of the principle of maximal munch i.e. the automata should try matching the longest possible token as lexeme.
A more complex transition diagram is difficult to implement and may give rise to errors during coding, however, there are ways to better implementation
We can reduce the number of transition diagrams (automata) by clubbing all these diagrams into a single diagram in some cases. But because of two many arrows going out of each state the complexity of the code may increase very much. This may lead to creeping in of errors during coding. So it is not advisable to reduce the number of transition diagrams at the cost of making them too complex to understand. However, if we use multiple
transition diagrams, then the tradeoff is that we may have to unget() a large number of characters as we need to recheck the entire input in some other transition diagram. Lexical analyzer generator
. Input to the generator - List of regular expressions in priority order - Associated actions for each of regular expression (generates kind of token and other book keeping information) . Output of the generator - Program that reads input character stream and breaks that into tokens - Reports lexical errors (unexpected characters), if any We assume that we have a specification of lexical analyzers in the form of regular expression and the corresponding action parameters. Action parameter is the program segments that is to be executed whenever a lexeme matched by regular expressions is found in the input. So, the input to the generator is a list of regular expressions in a priority order and associated actions for each of the regular expressions. These actions generate the kind of token and other book keeping information. Our problem is to construct a recognizer that looks for lexemes in the input buffer. If more than one pattern matches, the recognizer is to choose the longest lexeme matched. If there are two or more patterns that match the longest lexeme, the first listed matching pattern is chosen. So, the output of the generator is a program that reads input character stream and breaks that into tokens. It also reports in case there is a lexical error i.e. either unexpected characters occur or an input string doesn't match any of the regular expressions.
Refer to LEX User's Manual In this section, we consider the design of a software tool that automatically constructs the lexical analyzer code from the LEX specifications. LEX is one such lexical analyzer generator which produces C code based on the token specifications. This tool has been widely used to specify lexical analyzers for a variety of languages. We refer to the tool as Lex Compiler, and to its input specification as the Lex language. Lex is generally used in the manner depicted in the slide. First, a specification of a lexical analyzer is prepared by creating a program lex.l in the lex language. Then, the lex.l is run through the Lex compiler to produce
a C program lex.yy.c . The program lex.yy.c consists of a tabular representation of a transition diagram constructed from the regular expressions of the lex.l, together with a standard routine that uses the table to recognize lexemes. The actions associated with the regular expressions in lex.l are pieces of C code and are carried over directly to lex.yy.c. Finally, lex.yy.c is run through the C compiler to produce an object program a.out which is the lexical analyzer that transforms the input stream into a sequence of tokens.
Syntax Analysis
Syntax Analysis
. Check syntax and construct abstract syntax tree
. Error reporting and recovery . Model using context free grammars . Recognize using Push down automata/Table Driven Parsers This is the second phase of the compiler. In this phase, we check the syntax and construct the abstract syntax tree. This phase is modeled through context free grammars and the structure is recognized through push down automata or table-driven parsers. The syntax analysis phase verifies that the string can be generated by the grammar for the source language. In case of any syntax errors in the program, the parser tries to report as many errors as possible. Error reporting and recovery form a very important part of the syntax analyzer. The error handler in the parser has the following goals: . It should report the presence of errors clearly and accurately. . It should recover from each error quickly enough to be able to detect subsequent errors. . It should not significantly slow down the processing of correct programs.
Syntax definition
. Context free grammars - a set of tokens (terminal symbols) - a set of non terminal symbols - a set of productions of the form nonterminal - a start symbol <T, N, P, S> . A grammar derives strings by beginning with a start symbol and repeatedly replacing a non terminal by the right hand side of a production for that non terminal. . The strings that can be derived from the start symbol of a grammar G form the language L(G) defined by the grammar. In this section, we review the definition of a context free grammar and introduce terminology for talking about parsing. A context free grammar has four components: String of terminals & non terminals
A set of tokens , known as terminal symbols. Terminals are the basic symbols from which strings are formed. A set of non-terminals . Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated by the grammar. A set of productions . The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production. A designation of one of the non-terminals as the start symbol , and the set of strings it denotes is the language defined by the grammar.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start symbol) by the right hand side of a production for that non-terminal.
Examples
. String of balanced parentheses S (S)S|
. Grammar list digit S list + digit | list - digit | digit 0 | 1 | . | 9 Consists of the language which is a list of digit separated by + or -. (S)S|
is the grammar for a string of balanced parentheses. For example, consider the string: (( )( )). It can be derived as: S (S)S Similarly, list list + digit ((S)S)S (( )S)S (( )(S)S)S (( )( )S)S (( )( ))S (( )( ))
Derivation
list list + digit list - digit + digit
digit - digit + digit 9 - digit + digit 9 - 5 + digit 9-5+2 Therefore, the string 9-5+2 belongs to the language specified by the grammar. The name context free comes from the fact that use of a production X the context of X For example, consider the string 9 - 5 + 2 . It can be derived as: list list + digit 9-5+2 list - digit + digit digit - digit + digit 9 - digit + digit 9 - 5 + digit . does not depend on
It would be interesting to know that the name context free grammar comes from the fact that use of a production X . does not depend on the context of X.
Examples .
. Grammar for Pascal block block begin statements end stmt-list | stmt-list ; stmt
Syntax analyzers
. Testing for membership whether w belongs to L(G) is just a "yes" or "no" answer . However the syntax analyzer - Must generate the parse tree
- Handle errors gracefully if string is not in the language . Form of the grammar is important - Many grammars generate the same language - Tools are sensitive to the grammar A parse tree may be viewed as a graphical representation for a derivation that filters out the choice regarding replacement order. Each interior node of a parse tree is labeled by some nonterminal A , and that the children of the node are labeled, from left to right, by the symbols in the right side of the production by which this A was replaced in the derivation. A syntax analyzer not only tests whether a construct is syntactically correct i.e. belongs to the language represented by the specified grammar but also generates the parse tree. It also reports appropriate error messages in case the string is not in the language represented by the grammar specified. It is possible that many grammars represent the same language. However, the tools such as yacc or other parser generators are sensitive to the grammar form. For example, if the grammar has shiftshift or shift-reduce conflicts, the parser tool will give appropriate warning message. We will study about these in details in the subsequent sections.
Derivation
. If there is a production A .aA . If a1 a if A a2 . a then we say that A derives a and is denoted by A a
is a production an then a 1 an w
a where a is a string of terminals and non terminals of G then we say that a is a sentential form
If there is a production A a then it is read as " A derives a " and is denoted by A a . The production tells us that we could replace one instance of an A in any string of grammar symbols by a . In a more abstract setting, we say that a A arbitrary strings of grammar symbols a if A ? is a production and a and are
If a 1 a2 . a n then we say a 1 derives a n . The symbol means "derives in one step". Often we wish to say "derives in one or more steps". For this purpose, we can use the symbol with a + on its top as shown in the slide. Thus, if a string w of terminals belongs to a grammar G, it + *
is written as S w . If S a , where a may contain non-terminals, then we say that a is a sentential form of G. A sentence is a sentential form with no non-terminals.
Derivation .
. If in a sentential form only the leftmost non terminal is replaced then it becomes leftmost derivation . Every leftmost step can be written as wA production . Similarly, right most derivation can be defined . An ambiguous grammar is one that produces more than one leftmost/rightmost derivation of a sentence Consider the derivations in which only the leftmost non-terminal in any sentential form is replaced at each step. Such derivations are termed leftmost derivations. If a by a step in lm which the leftmost non-terminal in a is replaced, we write a . Using our notational lm conventions, every leftmost step can be written wA wd where w consists of terminals only, A d is the production applied, and ? is a string of grammar symbols. If a derives by a lm* lm* leftmost derivation, then we write a . If S a , then we say a is a left-sentential form of the grammar at hand. Analogous definitions hold for rightmost derivations in which the rightmost non-terminal is replaced at each step. Rightmost derivations are sometimes called the canonical derivations. A grammar that produces more than one leftmost or more than one rightmost derivation for some sentence is said to be ambiguous . Put another way, an ambiguous grammar is one that produces more than one parse tree for some sentence is said to be ambiguous.
lm*
is a
Parse tree
It shows how the start symbol of a grammar derives a string in the language root is labeled by the start symbol leaf nodes are labeled by tokens Each internal node is labeled by a non terminal if A is a non-terminal labeling an internal node and x1 , x2 , .xn are labels of the children of that node then A x1 x2 . xn is a production
A parse tree may be viewed as a graphical representation for a derivation that filters out the choice regarding replacement order. Thus, a parse tree pictorially shows how the start symbol of a grammar derives a string in the language. Each interior node of a parse tree is labeled by some non-terminal A , and that the children of the node are labeled, from left to right, by the symbols in the right side of the production by which this A was replaced in the derivation. The root of the parse tree is labeled by the start symbol and the leaves by non-terminals or terminals and, read from left to right, they constitute a sentential form, called the yield or frontier of the tree. So, if A is a non-terminal labeling an internal node and x1 , x2 , .xn are labels of children of that node then A x1 x2 . xn is a production. We will consider an example in the next slide.
Example
Parse tree for 9-5+2
The parse tree for 9-5+2 implied by the derivation in one of the previous slides is shown. . 9 is a list by production (3), since 9 is a digit. . 9-5 is a list by production (2), since 9 is a list and 5 is a digit. . 9-5+2 is a list by production (1), since 9-5 is a list and 2 is a digit. Production 1: list Production 2: list Production 3: list digit list + digit list - digit digit
0|1|2|3|4|5|6|7|8|9
Ambiguity
. A Grammar can have more than one parse tree for a string Consider grammar string string + string
| string - string |0|1|.|9 . String 9-5+2 has two parse trees A grammar is said to be an ambiguous grammar if there is some string that it can generate in more than one way (i.e., the string has more than one parse tree or more than one leftmost derivation). A language is inherently ambiguous if it can only be generated by ambiguous grammars. For example, consider the following grammar: string string + string
| string - string
|0|1|.|9 In this grammar, the string 9-5+2 has two possible parse trees as shown in the next slide.
Consider the parse trees for string 9-5+2, expression like this has more than one parse tree. The two trees for 9-5+2 correspond to the two ways of parenthesizing the expression: (9-5)+2 and 9(5+2). The second parenthesization gives the expression the value 2 instead of 6.
Ambiguity .
. Ambiguity is problematic because meaning of the programs can be incorrect . Ambiguity can be handled in several ways - Enforce associativity and precedence - Rewrite the grammar (cleanest way) . There are no general techniques for handling ambiguity . It is impossible to convert automatically an ambiguous grammar to an unambiguous one Ambiguity is harmful to the intent of the program. The input might be deciphered in a way which was not really the intention of the programmer, as shown above in the 9-5+2 example. Though there is no general technique to handle ambiguity i.e., it is not possible to develop some feature which automatically identifies and removes ambiguity from any grammar. However, it can be removed, broadly speaking, in the following possible ways:1) Rewriting the whole grammar unambiguously. 2) Implementing precedence and associatively rules in the grammar. We shall discuss this technique in the later slides.
Associativity
. If an operand has operator on both the sides, the side on which operator takes this operand is the associativity of that operator . In a+b+c b is taken by left +
. +, -, *, / are left associative . ^, = are right associative . Grammar to generate strings with right associative operators right letter = right | letter letter z a| b |.|
A binary operation * on a set S that does not satisfy the associative law is called non-associative. A left-associative operation is a non-associative operation that is conventionally evaluated from left to right i.e., operand is taken by the operator on the left side. For example, 6*5*4 = (6*5)*4 and not 6*(5*4) 6/5/4 = (6/5)/4 and not 6/(5/4) A right-associative operation is a non-associative operation that is conventionally evaluated from right to left i.e., operand is taken by the operator on the right side. For example, 6^5^4 => 6^(5^4) and not (6^5)^4) x=y=z=5 => x=(y=(z=5)) Following is the grammar to generate strings with left associative operators. (Note that this is left recursive and may go into infinite loop. But we will handle this problem later on by making it right recursive) left letter left + letter | letter a | b | .... | z
Precedence
. String a+5*2 has two possible interpretations because of two different parse trees corresponding to (a+5)*2 and a+(5*2) . Precedence determines the correct interpretation. Precedence is a simple ordering, based on either importance or sequence. One thing is said to "take precedence" over another if it is either regarded as more important or is to be performed first. For example, consider the string a+5*2. It has two possible interpretations because of two different parse trees corresponding to (a+5)*2 and a+(5*2). But the * operator has precedence over the + operator. So, the second interpretation is correct. Hence, the precedence determines the correct interpretation.
Parsing
. Process of determination whether a string can be generated by a grammar
. Parsing falls in two categories: - Top-down parsing: Construction of the parse tree starts at the root (from the start symbol) and proceeds towards leaves (token or terminals) - Bottom-up parsing: Construction of the parse tree starts from the leaf nodes (tokens or terminals of the grammar) and proceeds towards root (start symbol) Parsing is the process of analyzing a continuous stream of input (read from a file or a keyboard, for example) in order to determine its grammatical structure with respect to a given formal grammar. The task of the parser is essentially to determine if and how the input can be derived from the start symbol within the rules of the formal grammar. This can be done in essentially two ways: . Top-down parsing - A parser can start with the start symbol and try to transform it to the input. Intuitively, the parser starts from the largest elements and breaks them down into incrementally smaller parts. LL parsers are examples of top-down parsers. We will study about these in detail in the coming slides. . Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. We will study about these in detail in the coming slides.
Example .
. Construction of a parse tree is done by starting the root labeled by a start symbol . repeat following two steps - at a node labeled with non terminal A select one of the productions of A and construct children nodes - find the next node at which subtree is Constructed To construct a parse tree for a string, we initially create a tree consisting of a single node (root node) labeled by the start symbol. Thereafter, we repeat the following steps to construct the of parse tree by starting at the root labeled by start symbol: . At node labeled with non terminal A select one of the production of A and construct the children nodes. . Find the next node at which subtree is constructed.
Example .
. Parse array [ num dotdot num ] of integer
. Cannot proceed as non terminal "simple" never generates a string beginning with token "array". Therefore, requires back-tracking. . Back-tracking is not desirable, therefore, take help of a "look-ahead" token. The current token is treated as look- ahead token. ( restricts the class of grammars ) To construct a parse tree corresponding to the string array [ num dotdot num ] of integer , we start with the start symbol type . Then, we use the production type simple to expand the tree further and construct the first child node. Now, finally, the non-terminal simple should lead to the original string. But, as we can see from the grammar, the expansion of the non-terminal simple never generates a string beginning with the token "array". So, at this stage, we come to know that we had used the wrong production to expand the tree in the first step and we should have used some other production. So, we need to backtrack now. This backtracking tends to cause a lot of overhead during the parsing of a string and is therefore not desirable. To overcome this problem, a " look-ahead " token can be used. In this method, the current token is treated as look-ahead token and the parse tree is expanded by using the production which is determined with the help of the look-ahead token.
Parse array [ num dotdot num ] of integer using the grammar: type | id | array [ simple ] of type simple integer | char | num dotdot num simple
Initially, the token array is the lookahead symbol and the known part of the parse tree consists of the root, labeled with the starting non- terminal type. For a match to occur, non-terminal type must derive a string that starts with the lookahead symbol array . In the grammar, there is just one production of such type, so we select it, and construct the children of the root labeled with the right side of the production. In this way we continue, when the node being considered on the parse tree is for a terminal and the terminal matches the lookahead symbol, then we advance in both the parse tree and the input. The next token in the input becomes the new lookahead symbol and the next child in the parse tree is considered
then First( ) is the set of tokens that appear as the first token in the strings generated from
For example : First(simple) = {integer, char, num} First(num dotdot num) = {num} Recursive descent parsing is a top down method of syntax analysis in which a set of recursive procedures are executed to process the input. A procedure is associated with each non-terminal of the grammar. Thus, a recursive descent parser is a top-down parser built from a set of mutually-recursive procedures or a non-recursive equivalent where each such procedure usually implements one of the production rules of the grammar. For example, consider the grammar, type simple simple | id | array [simple] of type integer | char | num dot dot num
First ( a ) is the set of terminals that begin the strings derived from a . If a derives e then e too is in First ( a ). This set is called the first set of the symbol a . Therefore, First (simple) = {integer, char , num} First (num dot dot num) = {num} First (type) = {integer, char, num, , array}
Apart from a procedure for each non-terminal we also needed an additional procedure named match . match advances to the next input token if its argument t matches the lookahead symbol.
Ambiguity
. Dangling else problem Stmt if expr then stmt
| if expr then stmt else stmt . according to this grammar, string if el then if e2 then S1 else S2 has two parse trees The dangling else is a well-known problem in computer programming in which a seemingly welldefined grammar can become ambiguous. In many programming languages you can write code like if a then if b then s1 else s2 which can be understood in two ways: Either as if a then if b then
s1
else
s2
or as if a then if b then s1 else s2 So, according to the following grammar, the string if el then if e2 then S1 else S2 will have two parse trees as shown in the next slide.
stmt if expr then stmt
The two parse trees corresponding to the string if a then if b then s1 else s2
| unmatched-stmt | others matched-stmt else matched-stmt | others So, we need to have some way to decide to which if an ambiguous else should be associated. It can be solved either at the implementation level, by telling the parser what the right way to solve the ambiguity, or at the grammar level by using a Parsing expression grammar or equivalent. Basically, the idea is that a statement appearing between a then and an else must be matched i.e., it must not end with an unmatched then followed by any statement, for the else would then be forced to match this unmatched then . So, the general rule is " Match each else with the closest previous unmatched then". Thus, we can rewrite the grammar as the following unambiguous grammar to eliminate the dangling else problem: stmt matched-stmt | unmatched-stmt | others if expr then matched-stmt else matched-stmt | others if expr then stmt if expr then matched-stmt
matched-stmt unmatched-stmt
| if expr then matched-stmt else unmatched-stmt A matched statement is either an if-then-else statement containing no unmatched statements or it is any other kind of unconditional statement.
Left recursion is an issue of concern in top down parsers. A grammar is left-recursive if we can find some non-terminal A which will eventually derive a sentential form with itself as the left-symbol. In other words, a grammar is left recursive if it has a non terminal A such that there is a derivation A + A a for some string a . These derivations may lead to an infinite loop. Removal of left recursion:
This slide shows the parse trees corresponding to the string a * using the original grammar (with left factoring) and the modified grammar (without left factoring).
Example
. Consider grammar for arithmetic expressions E T F E+T|T T*F|F ( E ) | id
As another example, a grammar having left recursion and its modified version with left recursion removed has been shown.
The general algorithm to remove the left recursion follows. Several improvements to this method have been made. For each rule of the form A A a1 | A a2 | ... | A a m | 1 | 2 | .. | n
Where: . A is a left-recursive non-terminal. . a is a sequence of non-terminals and terminals that is not null ( a ). . is a sequence of non-terminals and terminals that does not start with A . Replace the A-production by the production: A 1 A' | 2 A' | ...| n A'
This newly created symbol is often called the "tail", or the "rest".
S A
Aa | b Ac | Sd |
. In such cases, left recursion is removed systematically - Starting from the first rule and replacing all the occurrences of the first non terminal symbol - Removing left recursion from the modified grammar What we saw earlier was an example of immediate left recursion but there may be subtle cases where left recursion occurs involving two or more productions. For example in the grammar S A Aa|b Ac|Sd|e
More generally, for the non-terminals A 0, A 1,..., An , indirect left recursion can be defined as being of the form: An A1 .. An An a(n+1) | . A1 a 1 | . A2 a2 |
Where a 1 , a2 ,..., a n are sequences of non-terminals and terminals. Following algorithm may be used for removal of left recursion in general case: Input : Grammar G with no cycles or e -productions. Output: An equivalent grammar with no left recursion . Algorithm: Arrange the non-terminals in some order A1 , A2 , A3 ..... An . for i := 1 to n do begin replace each production of the form Ai Aj
d1 | d2 | ...........| dk
. After the second step (removal of left recursion) the grammar becomes S A A' Aa | b bdA' | A' cA' | adA' |
After the first step (substitute S by its R.H.S. in the rules), the grammar becomes S A Aa|b Ac|Aad|bd|
After the second step (removal of left recursion from the modified grammar obtained after the first step), the grammar becomes S A A' Aa|b b d A' | A' c A' | a d A' |
Left factoring
. In top-down parsing when it is not clear which production to choose for expansion of a symbol defer the decision till we have seen enough input. In general if A 1 | 2
Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive parsing. The basic idea is that when it is not clear which of two or more alternative productions to use to expand a non-terminal A, we defer the decision till we have seen enough input to make the right choice. In general if A 1 | 2 , we defer decision by expanding A to a A'.
| if expr then stmt can be transformed to stmt S' if expr then stmt S' else stmt |
We can also take care of the dangling else problem by left factoring. This can be done by left factoring the original grammar thus transforming it to the left factored form. stmt if expr then stmt else stmt
| if expr then stmt is transformed to stmt S' if expr then stmt S' else stmt |
Predictive parsers
. A non recursive top down parsing method . Parser "predicts" which production to use . It removes backtracking by fixing one production for every non-terminal and input token(s) . Predictive parsers accept LL(k) languages - First L stands for left to right scan of input - Second L stands for leftmost derivation - k stands for number of lookahead token . In practice LL(1) is used In general, the selection of a production for a non-terminal may involve trial-and-error; that is, we may have to try a production and backtrack to try another production if the first is found to be unsuitable. A production is unsuitable if, after using the production, we cannot complete the tree to match the input string. Predictive parsing is a special form of recursive-descent parsing, in which the current input token unambiguously determines the production to be applied at each step. After eliminating left recursion and left factoring, we can obtain a grammar that can be parsed by a recursive-descent parser that needs no backtracking . Basically, it removes the need of backtracking by fixing one production for every nonterminal and input tokens. Predictive parsers accept LL(k) languages where: . First L : The input is scanned from left to right. . Second L : Leftmost derivations are derived for the strings. . k : The number of lookahead tokens is k. However, in practice, LL(1) grammars are used i.e., one lookahead token is used.
Predictive parsing
. Predictive parser can be implemented by maintaining an external stack
Parse table is a two dimensional array M[X,a] where "X" is a non terminal and "a" is a terminal of the grammar
It is possible to build a non recursive predictive parser maintaining a stack explicitly, rather than implicitly via recursive calls. A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output stream. The input buffer contains the string to be parsed, followed by $, a symbol used as a right end marker to indicate the end of the input string. The stack contains a sequence of grammar symbols with a $ on the bottom, indicating the bottom of the stack. Initially the stack contains the start symbol of the grammar on top of $. The parsing table is a two-dimensional array M [X,a] , where X is a non-terminal, and a is a terminal or the symbol $ . The key problem during predictive parsing is that of determining the production to be applied for a non-terminal. The non-recursive parser looks up the production to be applied in the parsing table. We shall see how a predictive parser works in the subsequent slides.
Parsing algorithm
. The parser considers 'X' the symbol on top of stack, and 'a' the current input symbol . These two symbols determine the action to be taken by the parser . Assume that '$' is a special token that is at the bottom of the stack and terminates the input string if X = a = $ then halt if X = a $ then pop(x) and ip++ if X is a non terminal then if M[X,a] = {X UVW} then begin pop(X); push(W,V,U) end else error The parser is controlled by a program that behaves as follows. The program considers X , the symbol on top of the stack, and a , the current input symbol. These two symbols determine the action of the parser. Let us assume that a special symbol ' $ ' is at the bottom of the stack and terminates the input string. There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing. 2. If X = a $, the parser pops X off the stack and advances the input pointer to the next input symbol. 3. If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This entry will be either an X-production of the grammar or an error entry. If, for example, M[X,a] = {X UVW}, the parser replaces X on top of the stack by UVW (with U on the top). If M[X,a] = error, the parser calls an error recovery routine. The behavior of the parser can be described in terms of its configurations, which give the stack contents and the remaining input.
Example
. Consider the grammar E E' T T' F T E' +T E' | F T' * F T' | ( E ) | id
As an example, we shall consider the grammar shown. A predictive parsing table for this grammar is shown in the next slide. We shall see how to construct this table later.
Blank entries are error states. For example E cannot derive a string starting with '+' A predictive parsing table for the grammar in the previous slide is shown. In the table, the blank entries denote the error states; non-blanks indicate a production with which to expand the top nonterminal on the stack.
Example
input id + id * id $ id + id * id $ id + id * id $ id + id * id $ + id * id $ + id * id $ + id * id $ id * id $
action expand by E TE' expand by T FT' expand by F id pop id and ip++ expand by T' expand by E' +TE' pop + and ip++ expand by T FT'
Let us work out an example assuming that we have a parse table. We follow the predictive parsing algorithm which was stated a few slides ago. With input id id * id , the predictive parser makes the sequence of moves as shown. The input pointer points to the leftmost symbol of the string in the INPUT column. If we observe the actions of this parser carefully, we see that it is tracing out a leftmost derivation for the input, that is, the productions output are those of a leftmost derivation. The input symbols that have already been scanned, followed by the grammar symbols on the stack (from the top to bottom), make up the left-sentential forms in the derivation.
Constructing parse table . Table can be constructed if for every non terminal, every lookahead symbol can be handled by at most
one production . First( a ) for a string of terminals and non terminals a is - Set of symbols that might begin the fully expanded (made of only tokens) version of a . Follow(X) for a non terminal X is - set of symbols that might follow the derivation of X in the input stream
The construction of the parse table is aided by two functions associated with a grammar G. These functions, FIRST and FOLLOW , allow us to fill in the entries of a predictive parsing table for G, whenever possible. If is any string of grammar symbols, FIRST ( ) is the set of terminals that begin the strings derived from . If * , then is also in FIRST( ). If X is a non-terminal, FOLLOW ( X ) is the set of terminals a that can appear immediately to the right of X in some sentential form, that is, the set of terminals a such that there exists a derivation of the form S * A a for some and . Note that there may, at some time, during the derivation, have been symbols between A and a , but if so, they derived and disappeared. Also, if A can be the rightmost symbol in some sentential form, then $ is in FOLLOW(A).
. If X is a non terminal and X then if for some i, a is in First(Yi ) and is in all of First(Yj ) (such that j<i) then a is in First(X) . If is in First (Y1 ) . First(Yk ) then is in First(X) To compute FIRST ( X ) for all grammar symbols X, apply the following rules until no more terminals or e can be added to any FIRST set. Y 1 Y 2 2 . Yk is a production
3. If X is a non terminal and X Y 1 Yk .........Y k is a production, then place a in First (X) if for some i, a is in FIRST(Yi ) and e is in all of FIRST(Y 1 ), FIRST(Y 2 ),.., FIRST(Yi-1 );that is, Y1 ..Y i-1 * . If is in FIRST(Yj ) for all i = 1,2,..,k, then add to FIRST(X). For example, everything in FIRST(Y1 ) is surely in FIRST(X). If Y 1 does not derive , then we add nothing more to FIRST(X), but if Y 1 * , then we add FIRST(Y2 ) and so on.