Modern Compiler Design

Uploaded by

Andrei Cantea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

1K views753 pages

Modern Compiler Design

Uploaded by

Andrei Cantea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 753

WORLDWIDE SERIES IN COMPUTER SCIENCI Series Hditors Professor David Barron, Southampton University, UK Professor Peter Wegner, Brown University, USA ‘The Worldwide series in Computer Science has been created to publish textbooks which both address and anticipate the needs of an ever evolving curriculum thereby shaping its future, It is designed for undergraduates majoring in Computer Science and practitioners who need t0 teskill ts philosophy derives from the conviction that the discipline of computing needs to produce technically skilled engineers who will inevitably face, and possibly invent, radically new technologies throughout their future careers, New media will be used innovatively to support high quality texts writen by leaders in the fel. Books in Sores Ammeraal, Computer Grapbies for Java Programmers ‘Ammeraal, C++ for Programmers 3rd Edition Barton, The World of Scripting Languages Ben-Ari, Ada for Software Engineers Chapman & Chapman, Digital Multimedia Gollmann, Computer Security Goodrich & Tamassia, Data Structures and Algorithms in Java Kotonya & Sommerville, Requirements Engineering: Processes and Techmiques Lowe & Hall, Hypermedia & the Webs An Engineering Approach Magee & Kramer, Concurrency: State Models and java Programs Peters, Software Engineering: An Engineering Approach Preiss, Data Structures and Algorithms with Object-Oriented Design Patterns in C+ Preiss, Data Structures and Algoritbms with Object-Oriented Design Patterns in Java Reiss, A Practical Introduction to Software Design with Cr+ Schneider, Concurrent and Real-time Systems Winder & Roberts, Developing Java Software 2nd Editionmodern compiler design Dick Grune Vrije Universiteit, Amsterdam Henri E. Bal Vrije Universiteit, Amsterdam Ceriel J. H. Jacobs Vrije Universiteit, Amsterdam Koen G. Langendoen Delft University JOHN WILEY & SONS, LTO Chichester + New York « Weinheim + Brisbane « Singapore + TorontoCopyright © 2000 by John Wiley & Sons, Lid, The Atrium, Southern Gate, Chichester West Sussex PO19 88Q, England Telephone (+44) 1243 779777 Email (for orders and customer service enquiries): [email protected] ‘Visit our Home Page www.wileyeurope.com or www.wiley.com Reprinted March 2001, October 2002, December 2003 Al Rights Reserved. No part ofthis publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London WIT 4LP, UK, without the permission in writing ofthe Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to [email protected], or fixed to (+44) 1243 770571 ‘This publication is designed to provide accurate and authoritative information in regard tothe subject ‘matter covered. It is sold on the understanding thatthe Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the Services of a compe- tent professional should be sought, Other Wiley Editorial Offices John Wiley & Sons Ine., 111 River Street, Hoboken, NJ 07030, USA Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany John Wiley & Sons Australia Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809 John Wiley & Sons (Canada) Ltd, 22 Worcester Road, Etobicoke, Ontario M9W ILL Wiley also publishes its books ina variety of electronic formats, Some content that appears in print may not be available in electronic books. British Library Cataloguing in Publication Data ‘A catalogue record for this book is available from the British Library, ISBN. 0-471-97697-0 Printed and bound, from authors’ own electronic files, in Great Britain by Biddles Ltd, King's Lynn, Norfolk ‘This book is printed on acid-free paper responsibly manufactured from sustainable forestry, for which atleast two trees are planted for each one used for paper productionContents Preface 1 Introduction ll 13 Why study compiler construction? 1.1.1 Compiler construction is very successful 1.1.2 Compiler construction has a wide applicability 1.1.3 Compilers contain generally useful algorithms ‘A simple traditional modular compilerfinterpreter 1.2.1 The abstract syntax tree 1.2.2 Structure of the demo compiler 1.23 The language for the demo compiler 1.24 Lexical analysis for the demo compiler 1.25 Syntax analysis for the demo compiler 1.2.6 Context handling for the demo compiler 1.2.7 Code generation for the demo compiler 1,28 Interpretation for the demo compiler ‘The structure of a more realistic compiler 131 The structure 13.2 Run-time systems 133 Short-cuts Compiler architectures 14.1 The width of the compiler 14.2 Who's the boss? Properties of a good compilerContents 16 — Portability and retargetability 1.7 Place and usefulness of optimizations 1.8 A short history of compiler construction 18.1 1945-1960: code generation 1.82 1960-1975: parsing 1.8.3 1975-present: code generation and code optimization; paradigms 19 Grammars 19.1 The form of a grammar 1.9.2 The production process 1.9.3 Extended forms of grammars 1.9.4 Properties of grammars 1.9.5 The grammar formalism 1.10 Closure algorithms 1.10.1 Aniterative implementation of the closure algorithm LIL The outline code used in this book 112 Conclusion Summary Further reading Exercises From program text to abstract syntax tree 2.1 From program text to tokens ~ the lexical structure 2.1 Reading the program text 2.1.2 Lexical versus syntactic analysis 2.13 Regular expressions and regular descriptions 2.14 Lexical analysis 215 Creating a lexical analyzer by hand 2.1.6 Creating a lexical analyzer automatically 2.1.7 Transition table compression 2.8 Error handling in lexical analyzers 2.19 A traditional lexieal analyzer generator — lex 2.1.10 Lexical identification of tokens 2.1.11 Symbol tables 2.112 Macto processing and file inclusion 21.13 Conclusion 2.2 From tokens to syntax tree ~ the syntax 2.2.1 Two classes of parsing methods 2.2.2 Emror detection and error recovery 2.2.3 Creating a top-down parser manually 2.2.4 Creating a top-down parser automatically 32 32 33 3B 33 34 34 35 35 37 38 38 44 45 47 ro 49 2 56 57 38 60 61 68 B 103 109 110 1 15 117 120Contents 2.2.5 Creating a bottom-up parser automatically 2.3. Conelusion ‘Summary Further reading Exercises 3 Annotating the abstract syntax tree - the context 3.1 Attribute grammars 31 Dependeney graphs 3.12 Attribute evaluation 3113 Cycle handling 3.4 Attribute allocation 3.5 Mul-visit atuibute grammars 3.1.6 Summary ofthe types of atribute grammars 3.1.7 Leatributed grammars 3.118 S-attibuted grammars 3.19 Equivalence of L-attributed and S-attributed grammars 3.110 Extended grammar notations and atwibute grammars 3.111 Conclusion 3.2. Manual methods Threading the AST Symbolic interpretation Date-low equations Interprocedural dataflow analysis Carrying the information upstream — lve analysis Comparing symbolic interpretation and dataflow equations 33 Conclusion Summary Further reading Exercises 4 Processing the intermediate code 4.1 Interpretation 4.1.1 Reoursive interpretation 4.1.2 erative interpretation 42 Code generation 42.1 Avoiding code generation altogether 422 The starting point 423 Trivial eode generation 42.4 Simple code generation vil 150 179 Isl 184 185 194 195 200 202 210 217 28 29 230 25 25 237 238 238 239 245, 253, 260 262 267 269 269 m3 24 29 281 281 285 290 295 296 297 302viii Contents 425 Code generation for basic blocks 42.6 BURS code generation and dynamic programming 4.2.7 Register allocation by graph coloring 42.8 Supercompilation 4.29 Evaluation of code generation techniques 4.2.10 Debugging of code optimizers 42.11 Preprocessing the intermediate code 42.12 Postprocessing the target code 42.13 Machine code generation 43° Assemblers, linkers, and loaders 43.1 Assombler design issues 43.2. Linker design issues 44 Conelusion Summary Funher reading Exercises 5 Memory management 5.1 Data allocation with explicit deallocation 5.11 Basic memory allocation 5.12 Linked lists 5.1.3. Extensible arrays 5.2 Dataallocation with implicit deallocation 5.2.1 Basic garbage collection algorithms 5.2.2 Preparing the ground 5.2.3 Reference counting 5.2.4 Mark and sean 5.2.5 Two-space copying 5.26 Compaction 5.2.7 Generational garbage collection 5.3. Conclusion ‘Summary Further reading Exercises 6 Imperative and object-oriented programs 6.1 Context handing 6.1.1 Identification 61.2 Type checking 61.3 Conclusion 320 337 397 363 364 365 366 37 374 315, 378 381 383 383 389 389 396 398 399 403, 407 407 409 415 420 425 428 29 81 431 434 435 438, 440 441 449 460Contents 62 Source language data representation and handling 62.1 Basic types 62.2 Enumeration types 62.3 Pointer types 62.4 Record types 62.5 Union types Array types Set types Routine types Object types 6.2.10 Interface types 63. Routines and their activation 63.1 Activation records 63.2 Routines 633 Operations on routines 63.4 Non-nested routines .5 Nested routines 63.6 Lambda lifting 63.7 Iterators and coroutines 64 — Code generation for control flow statements 64.1 Local flow of control 64.2 Routine invocation 64.3 Run-time error handling 65 Code generation for modules 6.5.1 Name generation 65.2 Module initialization 6.5.3 Code generation for generics 66 — Conclusion ‘Summary Further reading Exercises Functional programs 7.1 A shor tour of Haskell TL Offside rte 112 Lists 71.3 List comprehension 7114 Patter matching 715 Polymorphic typing 71116 Referenil transparency 711.7 Higher-order functions 748 Lazy evaluation ix 460 460 461 461 465 466 467 470 47 471 480 481 482 485 486 489 491 499 500 501 502 512 519 323 524 524 525 327 528 331 532 538 340 340 S41 542 543 543 544 545 347Contents 7.2 Compiling functional languages 7.2.1 The functional core 7.3. Polymorphic type checking 73.1 Polymorphic function application 74 Desugaring 7.4.1 The translation of lists 74.2 The translation of pattern matching 74.3 The translation of list comprehension 744 The translation of nested functions 7.5 Graph reduction 75.1 Reduction order 7.5.2 The reduction engine 7.6 Code generation for functional core programs 7.6.1 Avoiding the construction of some application spines 7.7 Optimizing the functional core 7.71 — Strictness analysis 7.7.2 Boxing analysis 773° Tail-alls 7.7.4 Accumulator transformation 7.78 Limitations 7.8 Advanced graph manipulation 78.1 Variable-length nodes 78.2 Pointer tagging 783 Aggregate node allocation 784 Vector apply nodes 79 Conclusion Summary Further reading Exercises Logic programs 81 The logic programming model 8.11 The building blocks 8.12 The inference mechanism 8.2 The general implementation model, interpreted 821 Theinterpreter instrutions 8.2.2 Avoiding redundant goal lists 82.3 Avoiding copying goal list tails 83° Unification 83.1 Unification of structures, lists, and sets 548, 549 3s 551 353 553 553, 937 558 560 564 566 568: 573 515 375 582 582 584 587 387 587 588 588, 588 589 589 592 593 596 598 598, 600 601 603 606 606 607 607Contents 8.3.2 The implementation of unification 8.3.3 Unification of two unbound variables 83.4 — Conclusion 84 The general implementation model, compiled 84.1 List procedures 8.4.2 Compiled clause search and unification 8.4.3 Optimized clause selection in the WAM. 8.44 Implementing the ‘eut’ mechanism 84.5 Implementing the predicates assert and retract 85 Compiled code for unification 85.1 Unification instructions inthe WAM 85.2 Deriving a unification instruction by manual partial evaluation 85.3 Unification of structures in the WAM 85.4 Anoptimization: read/write mode 8.5.5 Further unification optimizations in the WAM 85.6 Conclusion Summary Funher reading Exercises Parallel and distributed programs 9.1 Parallel programming models 9.1.1 Shared variables and monitors 911.2 Message passing models 9.13 Objectoriented languages 911.4 The Linda Tuple space 9.15 Data-paralet languages 9.2 Processes and threads 9.3 Shared variables 93.1 Locks 93.2 Monitors 9.4 Message passing 9.4.1 Locating the receiver 94.2 Marshaling 9.4.3 Type checking of messages 9.4.4 Message selection 9.5 Parallel object-oriented languages 95.1 Object location 9.5.2 Object migration 9.5.3 Object replication xi 609 612 614 els 616 619 23 28 630 634 636 638, 639 646 650 650 633 633 656 659 659 61 663 664 665 667 668, 669 669 on on on on ora ona 675 676 677xii Contents 96 97 98 ‘Tuple space 9.6.1 Avoiding the overhead of associative addressing 9.6.2 Distributed implementations of the tuple space Automatic parallelization 9.7.1 Exploiting parallelism automatically 9.7.2 Data dependencies 9.1.3 Loop transformations 9.7.4 Automatic parallelization for distributed-memory machines Conclusion Summary Further reading Exercises Appendix A - A simple object-oriented compilerfinterpreter AL Symtax-determined clases and semantis-determining methods A2 The simple object-oriented compiler A3 Objectoriented parsing Ad Evaluation Exercise Answers to exercises References Index ons. 679 682 684 685 686 688. 690 693 693 695 695 9 699 701 702 708. 708, 709 70 nsPreface In the 1980s and 1990s, while the world was witnessing the rise of the PC and the Internet ‘on the front pages of the daily newspapers, compiler design methods developed with less fanfare, developments seen mainly in the echnical journals, and ~ more importantly ~ in ‘the compilers that are used to process today’s software, These developments were driven partly by the advent of new programming paradigms, partly by a better understanding of code generation techniques, and partly by the introduction of faster machines with large amounts of memory. “The field of programming languages has grown to include, besides the traditional imperative paradigm, the object-oriented, functional, logical, and paralleVdistributed paradigms, which inspire novel compilation techniques and which often require more extensive run-time systems than do imperative languages. BURS techniques (Bottom-Up Rewriting Systems) have evolved into very powerful code generation techniques which cope superbly with the complex machine instruction sets of present-day machines. And the speed and memory size of modern machines allow compilation techniques and programming language features that were unthinkable before, Modern compiler design ‘methods meet these challenges head-on. The audience ur audience are mature students in one of their final years, who have at least used a compiler occasionally and given some thought to the concept of compilation. When these students leave the university, they will have to be familiar with language processors for each of the modem paradigms, using modem techniques. Although curriculum requirements in ‘many universities may have been lagging behind in this respect, graduates entering the job market cannot afford to ignore these developments. Experience has shown us that a considerable number of techniques traditionally taught xiixiv Preface in compiler construction are special cases of more fundamental techniques. Often these special techniques work for imperative languages only; the fundamental techniques have a ‘much wider application, An example is the stack as an optimized representation for activa tion records in strictly last-in-first-out languages. ‘Therefore, this book focuses on principles and techniques of wide application, carefully distinguishing between the essential (= material that has a high chance of being useful to the student) nd the incidental (= material that will benefit the student only in exceptional cases); provides a first level of implementation details and optimizations; augments the explanations by pointers for further study. ‘The student, after having finished the book, can expect to: — have obtained a thorough understanding of the concepts of modern compiler design and construction, and some familiarity with their practical application; be able to start participating in the construction of a language processor for each of the ‘modern paradigms with a minimal training period; ~ be able to read the literature. ‘The first two provide a firm basis; the third provides potential for growth. The structure of the book ‘This book is conceptually divided into two parts. The first, comprising Chapters | through 5, is concerned with techniques for program processing in general; it includes a chapter on ‘memory management, both in the compiler and in the generated code. The second part, Chapters 6 through 9, covers the specific techniques required by the various programming paradigms. The interactions between the parts of the book are outlined in the table on the next page. ‘The lefimost column shows the four phases of compiler construction: analysis, con text handling, synthesis, and run-time systems. Chapters in this column cover both the ‘manual and the automatic creation of the pertinent software but tend to emphasize automatic generation. ‘The other columns show the four paradigms covered in this book; for each paradigm an example of a subject treated by each of the phases is shown. These chapters tend to contain manual techniques only, all automatic techniques having been delegated to Chapters 2 through 4 The scientific mind would like the table to be nice and square, with all boxes filled — in short ‘orthogonal’ ~ but we see that the top right entries are missing and that there is no chapter for ‘run-time systems’ in the leftmost column. The top right entries would cover such things as the special subjects in the text analysis of logic languages, but present text analysis techniques are powerful and flexible enough ~ and languages similar enough — 0 handle all language paradigms: there is nothing to be said there, for lack of problems. ‘The ‘chapter missing from the leftmost column would discuss manual and automatic techniquesThe structure of the book xv in imperative in functional _in logic in parallel/ and object- programs programs distributed oriented (Chapter 7) (Chapter 8) programs programs (Chapter 9) (Chapter 6) How to do: analysis - - - - (Chapter 2) context identifier polymorphic static rule. Linda statie handling identification type checking matching -—_ analysis, (Chapter 3) synthesis code for code for list structure marshaling (Chapter 4) while- comprehension unification statement run-time stack reduction Warren replication systems machine Abstract (0 chapter) Machine for creating run-time systems. Unfortunately there is little or no theory on this subject: run-time systems are still crafted by hand by programmers on an intuitive basis; there is nothing to be said there, for lack of solutions. Chapter | introduces the reader to compiler design by examining a simple traditional modular compiler/interpreter in detail. Several high-level aspects of compiler construction are discussed, followed by a short history of compiler construction and an introduction to formal grammars, Chapter 2 treats the analysis phase of a compiler: the conversion of the program text loan abstract syntax tree. Techniques for lexical analysis, lexical identification of tokens, ‘and syntax analysis are discussed, Chapter 3 covers the second phase of a compiler: context handling. Several methods of context handling are discussed: automated ones using attribute grammars, manual ones using L-attributed and S-attributed grammars, and semi-automated ones using symbolic interpretation and data-flow analysis. Chapter 4 covers the synthesis phase of a compiler, covering both interpretation and ‘code generation. ‘The section on code generation is mainly concemed with machine code generation; the intermediate code required for paradigm-specific constructs is treated in ‘Chapters 6 through 9. Chapter 5 concerns memory management techniques, both for use in the compiler and in the generated program,xvi Preface Chapters 6 through 9 address the special problems in compiling for the various para- ddigms ~ imperative, object-oriented, functional, logic, and paralleVdistributed. Compilers for imperative and object-oriented programs are similar enough to be treated together in ‘one chapter, Chapter 6, Appendix A discusses a possible but experimental method of object-oriented compiler construction, in which an attempt is made to exploit object-oriented concepts to simplify ‘compiler design, Several subjects in this book are treated in a non-traditional way, and some words of Justification may be in order. Lexical analysis is based on the same dotted items that are traditionally reserved for bottom-up syntax analysis, rather than on Thompson's NFA construction. We see the dotted item as the essential tool in bottom-up pattern matching, unifying lexical analysis, LR syntax analysis, and bottom-up code generation. The traditional lexical algorithms are just low-level implementations of item manipulation, We consider the different treatment of Texical and syntax analysis to be a historical artifact. Also, the difference between the lexical and the syntax levels tends to disappear in modem software, Considerable attention is being paid to attribute grammars, in spite of the fact that their impact on compiler design has been limited. Still, they are the only known way of au- tomating context handling, and we hope that the present treatment will help to lower the threshold of their application. Functions as first-class data are covered in much greater depth in this book than is usual in compiler design books. After a good start in Algol 60, functions lost much status as manipulatable data in languages like C, Pascal, and Ada, although Ada 95 rehabilitated them somewhat, The implementation of some modem concepts, for example functional and logic languages, iterators, and continuations, however, requires functions to be mani~ pulated as normal data. The fundamental aspects of the implementation are covered in the chapter on imperative and object-oriented languages; specifies are given in the chapters on the various other paradigms. ‘An attempt at justifying the outline code used in this book to specify algorithms can be found in Section 1.11 ‘Additional material, including more answers to exercises, and all diagrams and all code from the book, are available through John Wiley’s Web page. Use as a course book ‘The book contains far (00 much material for a compiler design course of 13 lectures of two hours each, as given at our university, so a selection has to be made. Depending on the ‘maturity of the audience, an introductory, more traditional course can be obtained by including, for example,Acknowledgments xvii Chapter 1; Chapter 2 up to 2.1.7; 2.1.10; 2.1.11; 2.2 up to 2.2.4.5; 2.2.5 up to 2.2.5.7; Chapter 3 up to 3.1.2; 3.1.7 up to 3.1.10; 3.2 up to 3.2.2.2; 3.2.3; Chapter 4 up t0 4.1; 4.2 up to 4.2.4.3; 4.2.6 up t0 4.2.6.4; 4.2.11; Chapter 5 up 0 5.1-1.1; 5.2 up to 5.2.4; Chapter 6 up t0 6.2.3.2; 6.2.4 up to 6.2.10; 6.4 up to 6.4.2.3. ‘A more advanced course would include all of Chapters 1 to 6, excluding Section 3.1 ‘This could be augmented by one of Chapters 7 t0 9 and perhaps Appendix A. ‘An advanced course would skip much of the introductory material and concentrate on the parts omitted in the introductory course: Section 3.1 and all of Chapters 5 to 9, plus Appendix A. Acknowledgments We owe many thanks to the following people, who were willing to spend time and effort on reading drafts of our book and to supply us with many useful and sometimes very detailed comments: Mirjam Bakker, Raoul Bhoedjang, Wilfred Dittmer, Thomer M. Gil, Ben N. Hasnai, Bert Huijpen, Jaco A. Imthorn, John Romein, Tim Rukl, and the anonymous reviewers. We thank Ronald Veldema for the Pentium code segments. We are grateful to Simon Plumiree, Gaynor Redvers-Mution, Dawn Booth, and Jane Kerr of John Wiley & Sons Ltd, for their help and encouragement in writing this book. Lambert Meertens kindly provided information on an older ABC compiler, and Ralph Griswold on an leon compiler. ‘We thank the Faculteit Wiskunde en Informatica (now part of the Faculteit der Exacte ‘Wetenschappen) of the Vrije Universitet for their support and the use of thei equipment. Dick Grune lekece.wuent, Rep: //[email protected]/ Henri E, Bal belace.vival, rtp: //nm.ce.v.n/-bal Ceriel J.H. Jacobs Tet c8. anh / corel Koen G. Langendoen oonapde. wl tudel tal, Rteps//pde.twi.tudele.a/~toen Amsterdam, May 2000Trademark notice Java™ is a trademark of Sun Mierosystems Miranda™ is a trademark of Research Software Ltd MS-DOS™ is a trademark of Microsoft Corporation Pentium® is a trademark of Intel ostScript™ is a trademark of Adobe Systems Inc Smalltalk™ is a trademark of Xerox Corporation UNIX® is a trademark of AT&T xvili1 Introduction In its most general form, a compiler isa program that accepts as input a program text in a certain language and produces as ouput a program text in another language, while preserving the meaning of that text. This process is called translation, as it would be if the texts were in natural languages, Almost all compilers translate from one input language, the source language, to one output language, the target language, only. One normally expects the source and target language to differ greatly: the source language could be C and the target language might be machine code for the Pentium processor series. The Janguage the compiler itself is written in is called the implementation language. ‘The main reason why one wants such a translation is that one has hardware on which one can ‘run’ the translated program, or more precisely: have the hardware perform the actions described by the semantics of the program. After all, hardware is the only real source of computing power. Running a translated program often involves feeding it input dats in some format, and will probably result in some output data in some other format The input data can derive from a variety of sources; examples are files, Keystrokes, and network packages. Likewise, the output can go to a variety of places; examples are files, ‘monitor screens, and printers ‘To obtain the translated program, we run 8 compiler, which is just another program ‘whose input data is a file with the format of program source text and whose output data is a file with the format of executable code. A subtle point here is that the file containing the executable code is (almost) tacitly converted to a runnable program; on some operating systems this requires some action, for example setting the ‘execute’ attribute. To obtain the compiler, we run another compiler whose input consists of compiler source text and which will produce executable code for it, as it would for any program source text, This process of compiling and running a compiler is depicted in Figure 1.1; that compilers can and do compile compilers sounds more confusing than it is. When the source language is also the implementation language and the Source text to be compiled is2 CHAPTER 1 Introduction actually a new version of the compiler itself, the process is called bootstrapping. The term ‘bootstrapping’ is traditionally attributed to a story of Baron von Mdnchhausen (1720-1797), although in the original story the baron pulled himself from a swamp by his hair plait, rather than by his bootstraps (Anonymous, 1840). pose | crn veges af coer | mom conier| ininplencatn a Coe Prog reat |__| Executable program ‘source language code for target machine aan np in Some input Tormat np in some ouput format Figure 1.1, Compiling and running a compiler. Compilation does not differ fundamentally from file conversion but it does differ in degree. One clear aspect of compilation is that the input has a property called semantics ~ 's ‘meaning’ — which must be preserved by the process and which is often less clearly identifiable in a traditional file conversion program, for example one that converts BCDIC to ASCII. On the other hand, a GIF to JPEG converter has to preserve the visual impression of the picture, which might with some justification be called its semanties. In the final analysis, a compiler is just a giant file conversion program. ‘The compiler can work its magic because of two factors — the input is in a language and consequently has a structure, which is described in the language reference manual; ~ the semantics of the input is described in terms of and is attached to that structure ‘These factors enable the compiler to ‘understand’ the program and to collect its semantics in @ semantic representation. The same two factors exist with respect to the target language. This allows the compiler to reformulate the collected semantics in terms of theCHAPTER 1 Introduction 3 target language. How all this is done in detail is the subject of this book. Semantic oon represent recut ‘tion code Compiler Figure 1.2 Conceptual structure of a compiler, ‘The part of a compiler that performs the analysis of the source language text i called the front-end, and the part that does the target language synthesis is the back-end; see Figure 1.2. Ifthe compiler has avery clean design, the front-end is totally unaware of the target language and the back-end is totally unaware of the source language: the only thing they have in common is knowledge of the semantic representation. ‘There are technical reasons Why such a strict separation is inefficient, and in practice even the best-structured compilers compromise. ‘The above description immediately suggests another mode of operation for @ compile: if all required input data are available, the compiler could perform the actions specified by the Semantic representation rather than re-express them in a different form. ‘The cade generating back-end is then replaced by an interpreting back-end, and the whole program is called an interpreter. ‘There are several reasons for doing this, some fundamental and some more opportunistic. One fundamental reason is that an interpreter is normally writen in high-level language and will therefore run on most machine types, whereas generated object code will only run on machines of the target type: in other words, portability is increased. Another is that writing an interpreter is much less work than writing a back-end. AA third reason for using an interpreter rather than a compiler is that performing the actions straight from the semantic representation allows better error checking and reporting to be done. This is not fundamentally so, but is a consequence of the fact that compilers (Grontend/back-end combinations) are expected to generate efficient code. As a result, most back-ends throw away any information that isnot essential to the program execution in order to gain speed; this includes much information that could have been useful in giving good diagnostics, for example source code line numbers. ‘A fourth reason is the increased security that can be achieved by interpreters; this effect has played an important role in Java's rise to fame. Again, this inereased security is not fundamental since there is no reason why compiled code could not do the same checks an interpreter can. Yet itis considerably easier to convince oneself that an interpreter does rot play dirty tricks than that there are no booby traps hidden in binary executable code It should be pointed out that there is no fundamental difference between using a compiler and using an interpreter. In both cases the program text is processed into an inter-4 CHAPTER 1 Introduction Why is a compiler called a compiler? The original meaning of ‘to compile” is ‘to select representative material and add it to a collection present-day makers of compilation compact Giscs use the term in its proper meaning In its early days programming language transition wes viewed in the same way: when the input Cor tained for example + ba prefabricated code fragment ‘load a in register, add b to register” was selected and added tothe outpt. A com piler compiled a listof code fragments to be added tothe translated program, ‘Today's compilers, especially those forthe non-impeative programming paradigms, often perform much more radical transformations onthe inp program, ‘mediate form, which is then interpreted by some interpreting mechanism. In compilation, ~ the program processing is considerable: ~ the resulting intermediate form, machine-specific binary ~ the interpreting mechanism is the hardware CPU; and program execution is relatively fast. cutable code, is low-level; In interpretation, ~ the program processing is minimal to moderate; = the resulting intermediate form, some system-specific data structure, is high- to ‘medium-level; ~ the interpreting mechanism is a (software) program; and ~ program execution is relatively slow. ‘These relationships are summarized graphically in Figure 1.3. Section 4.2.3 shows how a fairly smooth shift from interpreter to compiler can be made. ‘Afler considering the question of why one should study compiler construction (Sec- tion 1.1) we will address some general issues in compiler construction, including compiler architecture (Sections 1.2 to 1.5), retargetability (Section 1.6), and the place of optimizations (Section 1.7). This is followed by a short history of compiler construction (Section 1.8). Next are two more theoretical subjects: an introduction to context-free grammars (Section 1.9), and general closure algorithm (Section 1.10). A brief explanation of the pseudo-code used in the book (Section 1.11) concludes this introductory chapter. Occasionally, the structure of the text will be summarized in a ‘roadmap’ like the one below. 1.1. Why study compiler construction? There are a number of objective reasons why studying compiler construction is 2 good idea:SECTION 1.1 Why study compiler construction? 5 ae Some cane |_[ ie mi on r Compton Interpnetation Figure 1.3 Comparison of a compiler and an interpreter. Roadmap nroduetion 1.1 Why study compiler construction? 1.2 simple traditional modular compiler/interpreter 13 The structure of a more realistic compiler 1.4 Compiler architeewces 1541.7 Propesties of a good compiler 1.8 A short history of compiler construction 1.9 Grammars 1.10 Closure algorithms 111 The outline code used in this book ~ compiler construction is @ very successful branch of computer science, and one of the earliest to earn that predicate; ~ given its close relation to file conversion, it has wider application than just compilers; ~ itcontains many generally useful algorithms in a realistic setting, We will have a closer look at each of these below, The main subjective reason to study compiler construction is of course plain curiosity: it is fascinating to see how compilers manage to do what they do.6 SECTION 1.1. Why study compiler construction? 1.1.1. Compiler construction is very successful Compiler construction is a very successful branch of computer science. Some of the reasons for this are the proper structuring of the problem, the judicious use of formalisms, and the use of tools wherever possible 1.1.1.1 Proper structuring of the problem Compilers analyze-their input, construct a semantic representation, and synthesize their output from it, This analysis-synthests paradigm is very powerful and widely applicable ‘A program for tallying word lengths in a text could for example consist of a front-end ‘which analyzes the text and constructs internally a table of (length, frequency) pairs, and a back-end which then prints ths table, Extending this program, one could replace the text analyzing front-end by a module that collects file sizes in a file system; alternatively, or additionally, one could replace the back-end by a module that produces a bar graph rather than a printed table; we use the word “module” here to emphasize the exchangeabilty of the pars. In total, four programs have already resulted, all centered around the semantic representation and each reusing lts of code from the others. Likewise, without the strict separation of analysis and synthesis phases, programming languages and compiler construction would not be where they are today. Without it, each new language would require a completely new set of compilers forall interesting machines = or die for lack of support. With it, new frontend for that language suffices, to be combined with the existing back-ends for the current machines: for L languages and M frontends and M back-ends are needed, requiring L+M modules, rather than ExM programs. See Figure 14 Tt should be noted immediately, however, tht this strict separation is not completely free of charge. If, for example, a front-end knows itis analyzing for a machine with special machine instructions for mult-way jumps, it can probably analyze case/switch state- ‘ments so that they can benefit from these machine instructions. Similarly, if a back-end knows it is generating code for a language which has no nested routine declarations, it can generate simpler code for routine calls. Many professional compilers are integrated compilers, for one programming language and one machine architecture, using a semantic representation which derives from the source language and which may already contain ele- rents of the target machine, Still, the structuring has played and still plays a large role in the rapid introduction of new languages and new machines 1.1.1.2 Judicious use of formalisms For some parts of compiler construction excellent standardized formalisms have been developed, which greatly reduce the effort to produce these parts. The best examples are regular expressions and context-free grammars, used in lexical and syntactic analysis Enough theory about these has been developed from the 1960s onwards to fill an entire ‘course, but the practical aspects can be taught and understood without going too deeply into the theory. We will consider these formalisms and their applications in Chapter 2. Attribute grammars are a formalism that can be used for handling the context, the Jong-distance relations in a program that link, for example, the use of a variable to its declaration, Since attribute grammars are capable of describing the full semantics of aSUBSECTION 1.1.1. Compiler construction is very successful 7 Front-end for Back-end for Machine 1 Machine 2 language, their use can be extended to interpretation or code generation, although other techniques are perhaps more usual. There is much theory about them, but they are less well standardized than regular expressions and context-free grammars. Attribute gram- ‘mars are covered in Section 3. Object code generation for a given machine involves a lot of nitty-gritty programming when done manually, but the process can be automated, for example by using pattern matching and dynamic programming techniques. Quite a number of formalisms have been designed for the description of machine code, both at the assembly level and at the binary Tevel, but none has gained wide acceptance to date and each compiler writing system has its own version. Automated code generation is treated in Section 4.2.6. 1.1.1.3 Use of program-generating tools Once one has the proper formalism in which to describe what a program should do, one can generate a program from it, using program generator. Examples are lexical analyzers generated from regular descriptions of the input, parsers generated from gram: mars (syntax descriptions) and code generators generated from machine deseriptions. All these are generally more reliable and easier to debug than their handwritten counterparts; they are often more efficient to. Generating programs rather than writing them by hand has several advantages: = The input fo a program generator is of a much higher level of abstraction than the handwritten program would be. The programmer needs to specify less, and the tools take responsibility for much error-prone housekeeping, This increases the chances that8 SECTION 1.1 Why study compiler construction? the program will be correct. For example, it would be cumbersome to write parse tables by hand, ~ The use of program-generating tools allows increased flexibility and modifiability. For example, if during the design phase of a language a small change in the syntax is considered, a handwritten parser would be a major stumbling block to any such change, With a generated parser, one would just change the syntax description and generate a new parser. ~ Pre-canned or tailored code can be added to the generated program, enhancing its power at hardly any cost. For example, input error handling is usually a difficult affair in handwritten parsers; a generated parser can include tailored error correction code with ‘no effort on the part of the programmer. ~ A formal description can sometimes be used to generate more than one type of program, For example, once we have written a grammar for a language with the purpose of gen- crating a parser from it, we may use it fo generate a syntax-directed editor, a special- purpose program text editor that guides and supports the user in editing programs in that language. In summary, generated programs may be slightly more or slightly less efficient than handwritten ones, but generating them is so much more efficient than writing them by hand that whenever the possibilty exists, generating a program is almost always to be preferred, ‘The technique of creating compilers by program-generating tools was pioneered by Brooker et al, (1963), and its importance has continually risen since. Programs that generate parts of a compiler are sometimes called compiler compilers, although this is clearly misnomer. Yet, the term lingers on. 1.1.2 Compiler construction has a wide applicability Compiler consiriction techniques can be and are aplied ouside compiler construction in its strictest sense. Alternatively, more programming can be considered compiler constr: tion than one would traditionally assume. Examples are reading structured data, rapid introduction of new formats, and general file conversion problems. If data has a clear structure itis generally posible to write a grammar for it. Using a parser generator, a parser can then be generated automatically. Such techniques can, for example, be applied to rapidly create “ead” routines for HTML files, PostScript fils, etc This also facilitates the rapid introduction of new formats. Examples of file conversion systems that have profited considerably from compiler construction techniques ate TeX text formatters, which convert TeX text to dvi format, and PostScript interpreters, which convert PostScript txt to insiuctions for a specific printer. 1.1.3. Compilers contain generally useful algorithms: A third reason to study compiler construction lies in the generally useful data structures and algorithms compilers contain. Examples are hashing, precomputed tables, the stack ‘mechanism, garbage collection, dynamic programming, and graph algorithms. Although ceach of these can be studied in isolation, itis educationally more valuable and satisfying to do so in a meaningful context.SUBSECTION 1.2.1 The abstract syntax tree 9 1.2 Asimple traditional modular compiler/interpreter In this section we will show and discuss a simple demo compiler and interpreter, to intro- duce the concepts involved and to set the framework for the rest of the book. Turning to Figure 1.2, we sce that the heart of a compiler is the semantic representation of the pro- ‘gram being compiled. This semantic representation takes the form of a data structure, called the ‘intermediate code’ of the compiler. There are many possibilities for the form of the intermediate code; two usual choices are linked lists of pseudo-instructions and annotated abstract syntax trees. We will concentrate here on the latter, since the semantics is primarily attached to the syntax tree 1.2.1. The abstract syntax tree ‘The syntax tree of a program text is a data structure which shows precisely how the vari- cus segments of the program text are to be viewed in terms of the grammer. The syntax {tee can be obtained through a process called ‘parsing’; in other words, parsing’ is the process of structuring a text according to @ given grammar. For this reason, syntax trees are also called parse trees; we will use the terms interchangeably, with a slight preference for ‘parse tre" when the emphasis is on the actual parsing. Conversely, parsing is also called syntax analysis, but this has the problem that there is no corresponding verb ‘to syntax- analyze’. The parser can be writen by hand if the grammar is very small and simple; for larger and/or more complicated grammars it can be generated by a parser generator. Parser generators are discussed in Chapter 2, The exact form of the parse tree as required by the grammar is often not the most con- verient one for further processing, so usually a modified form of it is used, called an abstract syntax tree, or AST. Detailed information about the semantics can be attached to the nodes in this tree through annotations, which are stored in additional data fields in the nodes; hence the term annotated abstract syntax tree, Since unannotated ASTS are of limited use, ASTS are always more or less annotated in practice, and the abbreviation “AST” i used also for annotated ASTs. Examples of annotations are type information (‘this assignment node concems a Boolean array assignment’) and optimization information (‘this expression does not con- ‘ain a function call’), ‘The first kind is related (o the semantics as described in the manual, and is used, among other things, for context error checking. ‘The second kind is not related ‘oanything in the manual but may be important for the code generation phase. ‘The annotations in a node are also called the attributes of that node and since a node represents a grammar symbol, one also says that the grammar symbol has the corresponding attributes. Itis the task of the context handling module to determine and place the annotations or attributes Figure 1.5 shows the expression b*b - 4*a*c as a parse tee; the grammar used for expression is similar to those found in the Pascal, Modula-2, or C manuals: Tn linguistic and eicatonsl contexts, the ver “lo pare” #6 ako used for the determination of word clases: determining that in “to goby the word "by" is an adverb and in “by the way” is 8 preposition, In computer science the word sured exclsiely to refer to syntax analysis,10 SECTION 1.2 A simple traditional modular compilerinterpreter Se TN. factor identifier fact identifier identifier identifier te Figure 1.5 The expression b*b - 4*a*c as aparse tee. expression + expression ‘+’ term | expression ‘-’ term | term term term ‘*’ factor | term ’/* factor | factor factor + identifier | constant | '(’ expression ')’ Figure 1.6 shows the same expression as an AST and Figure 1.7 shows it as an annotated AST in which possible type and location information has been added. The location information in Figure 1.7 suggests offsets from a stack pointer and allocation in a machine register, but the precise nature of the information is not important at this point. What is important is that we see shift in emphasis from syntacte structure to semantic contents, Usually the grammar of a programming language is not specified in terms of input characters but of input ‘tokens’. Examples of input tokens are identifiers (for example Length or aS), strings ("Hello!", "!@h"), numbers (0, 1232-5), keywords (begin, real), compound operators (+, =), Separators (;, 1), ete. Input tokens may bbe and sometimes must be separated by white space, which is otherwise ignored. So before feeding the input program text to the parser, it must be divided into tokens. Doing so is the task of the lexical analyzer; the activity itself is sometimes called ‘to tokenize" but the literary value ofthis word is doubtful.SUBSECTION 1.22 Structure of the demo compiler 11 aN, JN. LS. “A”. igure 1.6 The expression be = a¥a¥c as an AST. 4a vor BEgpris =D’ BEwple = '*” Benepe ror ierapezs va RE coe vat eapes Figure 1.7 The expression bb - 4¥ac asan annotated AST. 1.2.2 Structure of the demo compiler We see that the front-end in Figure 1.2 must at least contain a lexical analyzer, a syntax ‘analyzer (parser), and a context handler, in that order. This leads us to the structure of the demo compiler/imterpreter shown in Figure 1.8 The back-end allows two intuitively different implementations: a code generator and ‘an interpreter. Both use the AST, the first for generating machine code, the second for performing the implied actions immediately.12 SECTION 1.2 A simple traditional modular compiler/interpreter oe handling — Fawe 18 Sra bdo comer 1.2.3. The language for the demo compiler ‘To keep the example small and to avoid the host of detailed problems that marks much of compiler writing, we will base our demonstration compiler on fully parenthesized expressions with operands of one digit. An arithmetic expression is ‘fully parenthesized’ if each ‘operator plus its operands is enclosed in a set of parentheses and no other parentheses ‘occur. This makes parsing almost trivial, since each open parenthesis signals the start of a lower level inthe parse tree and each close parenthesis signals the return to the previous, higher level: a fully parenthesized expression can be seen as a linear notation of a parse twee. expreseion + digit | ’(/ expression operator expression ‘)’ operator > "+" | * @igit > or | [2 Loar [rar pes) free bra | vee | oor Figure 1.9 Grammar for srmplefllyparenhesized expressions To simplify things even further, we will have only two operators, + and *, On the other hand, we will allow white space, including tabs and newlines, in the input. The grammar in Figure 1.9 produces such forms as 3, (548), and (2* ((3*4) +8)) Even this almost trivial language allows us to demonstrate the basie principles of both compiler and interpreter construction, with the exception of context handling: the language just has no context to handle.SUBSECTION 1.2.4 Lexical analysis for the demo compiler 13 #include — *parser.h /* for type AST_node */ #include *backend.hv = /* for Process() */ #include error.h" 7* for Brror() */ int main(void) ( AST_node ticode; Af (1Parae_program(Sicode)) Error("No top-level expression"); Process (icode) ; return 0; Figure 1.10 Driver forthe demo compiler Figure 1.10 shows the driver of the compiler/interpreter, in C. It starts by including the definition of the syntax analyzer, to obtain the definitions of type AST_node and of the routine Parse_program(),, which reads the program and constructs the AST. Next it includes the definition of the back-end, 10 obtain the definition of the routine Pro- cess), for which either a code generator or an interpreter can be linked in. It then calls the front-end and, if it succeeds, the back-end. (It should be pointed out that the condensed layout used for the program texts in the following sections is not really favored by any of the authors but is solely intended to keep each program text on a single page. Also, the #include commands for various system routines have been omitted.) 1.2.4 Lexical analysis for the demo compiler The tokens in our language are (,),+,*, and digit. Intuitively, these are five diferent tokens, but actually digit consists of ten tokens, fora total of 14. Our intuition is based ‘on the fact that the parser does not care exactly which digit it sees, so as far asthe parser is concerned, all digits are one and the same token: they form a token class. On the other hand, the back-end is interested in exactly which digit is present in the input, so we have to preserve the digit afterall, We therefore split the information about a token into two pats, the class of the token and its representation. This is reflected in the definition of the type Token_type in Figure I-11, which has two fields, one for the class of the token and one for its representation. For token classes that contain only one token which is also an ASCIL character (for example +), the class is the ASCIT value of the character itself. The class of digits is DIGIT, which is defined in Lex. h as 257, and the repr field is set to the representation ofthe digit. The class of the pseudo-token end-offile is BOF, which is defined as 256; iis wef to teat the end ofthe file asa genuine token. These numbers over 255 are chosen to avoid collisions with any ASCH values of single characters. “The representation of a token has atleast two important uses. First, itis processed in14 SECTION 1.2 A simple traditional modular compiler/interpreter /* Define class constants */ /* values 0-255 are reserved for ASCII characters */ define EOF 256 fdefine DIGIT 257 typedef struct {int class; char repr;) Token_type: extern Token_type Token; extern void get_next_token (void) ; Figure 1.11 Header file Lex. forthe demo lexical analyzer, one or more phases after the parser to produce semantic information; examples are a numeric value produced from an integer token, and an identification in some form from an identifier token. Second, it is used in error messages, to display the exact form of the token. In this role the representation is useful for all tokens, not just for those that carry semantic information, since it enables any part of the compiler to produce directly the correct printable version of any token, ‘The representation of a token is usually a string, implemented as a pointer, but in our demo compiler all tokens are single characters, soa field of type char suffices. ‘The implementation of the demo lexical analyzer, as shown in Figure 1.12, defines a global variable Token and a procedure get_next_token(). A call to get_next_token () skips possible layout characters (White space) and stores the next single character as a (class, repr) pair in Token. A global variable is appropriate here, since the corresponding input file is also global. In summary, a stream of tokens can be obtained by calling get_next_token () repeatedly. 1.2.5 Syntax analysis for the demo compiler It is the task of syntax analysis to structure the input into an AST. ‘The grammar in Fig- ture L9 is so simple that this can be done by two simple Boolean read routines, Parse_operator () for the non-terminal operator and Parse_expression () for the non-terminal expression. Both routines are shown in Figure 1.13 and the driver of the parser, which contains the initial call to Parse_expression(), is in Fig- ure 1.14. Each of the routines tries to read the syntactic construct it is named after, using the following strategy. The routine for the non-terminal N tries to read the alternatives of Nin order. For each alternative A it tries to read its first member A. If A, is found present, the routine assumes that A is the correct alternative and it then requires the presence of the other members of A. This assumption is not always warranted, which is why this parsing method is quite weak. But for the grammar of Figure 1.9 the assumption holds. If the routine succeeds in reading the syntactic construct in this way, it yields a pointer to the corresponding AST as an output parameter, and retums a 1 for success; the output parameter is implemented as a pointer to the location where the output value must beSUBSECTION 1.25 Syntax analysis for the demo compiler. 15 finclude —"lex.h" J* for self check */ /* PRIVATE */ static int Layout_char(int ch) { switch (ch) { case ' ': case ‘\t': cage ‘\n’: return 1; default return 0; } } J* PUBLIC */ ‘Token_type Token; void get_next_token(voia) { int Gh: /* get a non-layout character: */ ao ( oh = getchar(); if (ch <0) { ass © BOF; Token.repr = ' } white (Layout_char(ch)) ; /* classify it: +/ ie (or else (Token, <= 19") (Token.clage = DIGIT; } clase « ch;) ‘Token. repr igure 1.12 Lexical analyzer forthe demo compiler stored, a usual technique in C. If the routine fails to find the first member of any alternative of N it does not consume any input, does not set its output parameter, and returns a 0 for failure. And if it gets stuck in the middle it stops with a syntax error message. ‘The C template used for a rule Bie eek | Be Be is presented in Figure 1.15. More detailed code is required if any of As, Bi, .. is termi- ral symbol; see the examples in Figure 1.13, An error in the input is detected when we require a certain syntactic construct and find it is not there. We then give an error message by calling Error () with an appropriate message: this routine does not return and ter- ‘minates the program, after displaying the message to the user. This approach to parsing is called recursive descent parsing, because a set of routines éescend recursively to construct the parse tree. It is a rather weak parsing method and makes for inferior error diagnostics, but is, if applicable at all, very simple to implement, Much stronger parsing methods are discussed in Chapter 2, but recursive descent is suffi-16 SECTION 1.2 A simple traditional modular compiler/interpreter static int Parse_operator (operator *oper) ( Af (Token.clasa «- +7) ( toper = "0"; get_next_token(); return 2 ) if (Token.class == '#") { Soper = '*"; get_next_token(); return 1; t return 0; | static int Parse_expression (Expression *texpr_p) { Expression ‘expr = Yexpr_p = new_expression(); J+ try to parse a digit: +/ Af (Token.clase == DIGIT) { expr->type = 'D'; expr-svalue = Token.repr - ‘0"; get_next_token() ; return 1; } /* try to parse a parenthesized expression: */ Af (Token.class == *(') { expr->type = ‘BP’: get_next_token() : Af (iparge_expression (sexpr->left)) { Brror ("Missing expression"); z Af (1Parse_operator (kexpr->oper}) { Error ("Missing operator") ; } Af (1Paree_expression(Gexpr->right)) { Error (Missing expressicn*) ; ) Af (Token.class != *)") [ Error("Missing )"); z get_next_token(); return 13 } /* failed on both attempts */ free_expression (expr); return 0; Figure 1.13 Parsing routines fo the demo compilerSUBSECTION 1.25 Syntax analysis for the demo compiler 17 #include — "lex.ht include "error.h* /* for Error() */ Hinclude — "parser.h" 7* for eelf check */ J PRIVATE © static Expression *new_expression (void) { return (Expression *)malloc (sizeof (Expression) ); , static void free_expression (Expression ‘expr) (free ((void *)expr) :) static int Parse_operator (Operator toper_p) ; static int Parse_expression (Expression **expr_p) ; f+ puBLIc */ int Parae_program(AsT_node **icode_p) ( Expreasion ‘expr; get_next_token (); /* start the lexical analyzer #/ if [Parse_expression(sexpr)) Af (Token.class != EOF) { Brror ("Garbage after end of program"); } Sicode_p = expr; return 1; } return 0; Figure 1:4 Parser envionment for the demo compiler. cient for our present needs. The recursive descent parsing presented here is not to be confused with the much stronger predictive recursive descent parsing, which is discussed amply in Section 2.2.4.1. ‘The latter is an implementation of LL(1) parsing, and includes having look-ahead sets to base decisions on. ‘Although in theory we should have different node types for the ASTs of different syntactic constructs, itis more convenient to group them in broad classes and have only one node type for each of these classes. This is one of the differences between the parse tree, which follows the grammar faithfully, and the AST, which serves the convenience of the compiler writer. More in particular, in our example all nodes in an expression are of type Expression, and, since we have only expressions, that is the only possibility for the type of AST_node. To differentiate the nodes of type Expression, each such node contains a type attribute, set with a characteristic value: ‘D’ for a digit and *P" for a parenthesized expression. The type attribute tells us how to interpret the fields inthe rest of the node. Such interpretation is needed in the code generator and the interpreter. The header file with the definition of the node type Expression is shown in Figure 1.16. ‘The syntax analysis module shown in Figure 1.14 defines a single Boolean routine Parse_program() which ties to read the program as an expression by calling18 SECTION 1.2 A simple traditional modular compiler/interpreter int PG.) { /* try to parse the alternative A; Az... Ae */ fe Ge) { Af (iAg(..-)) Brror("Missing Ag"); if (AS(...)) Beror ("Missing Ay" return 1; } /* try to parse the alternative By By... */ if (Bil...)) { if (1B) (...)) Error("Migeing By"); return 1; } /* failed to find any alternative of P */ return 0; Figure 115 A C template fora grammar ele. typedef int operator; typedef struct expression { char type; [+ 1Dt or "Pt #/ int value; Y* for’! #/ struct expression *left, tright; /* for ‘Pp’ */ Operator oper; Ys for 1BY +/ } Expression; typedef Expression AST_node; /* the top node is an Expression extern int Parse_program(AST_node **); Figure 1.16 Pare header file forthe demo compiler Parse_expression() and, if it succeeds, converts the pointer to the expression to a pointer to AST_node, which it subsequently yields as its output parameter. It also checks if the input is indeed finished after the expression. Figure 1.17 shows the AST that results from parsing the expression (2* ( (3*4) +9) ). Depending on the value of the type attribute, a node contains either a value attribute or three attributes Left, oper, and right. In the diagram, the non- applicable attributes have been crossed out in each node.SUBSECTION 1.27 Code generation for the demo compiler 19 eye value 5 Lett mie right ot feaeeee iB XE a = Figure 1.17 An AST for the expression (2 ( (344) +9) ) 1.26 Context handling for the demo compiler As mentioned befor, there is no context to handle in our simple language. We could have introduced the need for some context handling in the form of a context check by allowing the logical values ¢ and £ as additional operands (for true and false) and defining + as logical or and * as logical and. The context check would then be that the operands must be either both numeric or both logical. Alternatively, we could have collected optimization information, for example by doing all arithmetic that can be done at compile time, Both would have required code that is very similar to that shown in the code generation and interpretation sections below. (Also, the optimization proposed above would have made the code generation and interpretation trivial!) 1.2.7 Code generation for the demo compiler ‘The code generator receives the AST (actually a pointer to it) and generates code from it fora simple stack machine. This machine has four instructions, which work on integers: PUSH n pushes the integer n onto the stack ADD replaces the topmost two elements by their sum MULT replaces the topmost iwo elements by their product PRINT pops the top element and prints is value20 SECTION 1.2 A simple traditional modular compiler/interpreter ‘The module, which is shown in Figure 1.18, defines one routine Process () ‘one parameter, a pointer to the AST. Its purpose is to emit ~ to add to the object file — ‘code with the same semantics as the AST. It first generates code for the expression by calling Code_gen_expression() and then emits a PRINT instruction. When run, the code for the expression will leave its value on the top of the stack where PRINT will find 1 the end of the program run the stack will again be empty (provided the machine started with an empty stack), finclude —*parser-h" /* for types AST_node and Expression */ finclude "backend.h" ——/* for self check */ /* PRIVATE */ static void Code_gen_expression (Expression *expr) { switch (expr->type) ( cage 'D! printf ("PUSH td\n", expr->value) ; break; case "Ps Cote_gen_expression(expr->left) ; Code_gen_expression (expr->right) ; switch (Gper) { case '4": print£("ApD\n"); break; cage (*°: printf (MULT\n") ; brea) } break; i: vold Process ast_node *icode) ( Code_gen_expression(icode); printf ("PRINT\n"); ) y+ PUBLIC */ Figure 118 Code generation backend or the demo compiler ‘The routine Code_gen_expression() checks the type attribute of its parameter to see if it is a digit node or a parenthesized expression node. In both cases it has to ‘generate code to put the eventual value on the top of the stack. If the input node is a digit node, the routine obtains the value directly from the node and generates code to push it ‘onto the stack: it emits @ PUSH instruction, Otherwise the input node is a parenthesized expression node; the routine first has to generate code for the left and right operands recursively, and then emit an ADD or MULT instruction, When run with the expression (2* ((3*4) +8) ) as input, the compiler that results from combining the above modules produces the following code:SUBSECTION 1.28 Interpretation for the demo compiler 21 pus 2 pus 3 Pus 4 PusH 9 vor PRINT 1.2.8 Interpretation for the demo compiler The interpreter (See Figure 1.19) is very similar to the code generator. Both perform depth-first scan of the AST, but where the code generator emits code t0 have the actions performed by a machine at a later time, the interpreter performs the actions right away. The extra set of braces ({...}) after case 'P’: is needed because we need two local variables and the C language does not allow declarations in the case parts of a switch state- ‘ment. /* for types AST_node and Expression */ Res /* PRIVATE */ static int Interpret_expression (Expression *expr) { ewitch (expr-stype) { case 'D! return expr->value; break; case ‘Pr: { int @_left = interpret_expression (expr-mleft); int euright = Interpret_expression (expr->right) ; switch (expr-soper) { case ‘4': return e left + e right; case '#; return e left * e right; y break; include "parser-h Hinclude "backend." /* for self } void Process (AsT_node *icode) ( printf ("$d\n", Interpret_expression(icode)) } /* PUBLIC */ Figure 1.19 Interpreter back-end forthe demo compiler,22 SECTION 1.3 The structure of a more realistic compiler Note that the code generator code (Figure 1.18) and the interpreter code (Figure 1.19) share the same module definition file (called a ‘header file’ in C), backend .bh, shown in Figure 1.20. This is possible because they both implement the same interface: a single routine Process (AST_node *). In Chapter 4 we will see an example of a different type of interpreter (Section 4.1.2) and two other code generators (Sections 4.2.3 and 4.2.3.2), each using this same interface. Another module that implements the back-end interface ‘meaningfully might be a module that displays the AST graphically. Each of these can be combined with the lexical and syntax modules, to produce a program processor. extern void Process(AsT_node *) Figure 1.20 Common back-end header for cade generator and interpreter, 1.3 The structure of a more realistic compiler Figure 1.8 showed that in order to describe the demo compiler we had to decompose the front-end into three modules and that the back-end could stay as a single module. It will be clear that this is not sufficient for a real-world compiler. A more realistic picture is shown in Figure 1.21, in which front-end and backend each consists of five modules. In addition to these, the compiler will contain modules for symbol table handling and error reporting; these modules will be called upon by almost all other modules. 1.3.1 The structure ‘A short description of each of the modules follows, together with an indication of where the material is discussed in detail ‘The program text input module finds the program text file, reads it efficiently, and turns it into a stream of characters, allowing for different kinds of newlines, escape codes, ele, It may also switch t0 other files, when these are t0 be included. This function may require cooperation with the operating system on the one hand and with the lexical analyzer on the other. ‘The lexical analysis module isolates tokens in the input stream and determines their class and representation. It can be written by hand or generated from a description of the tokens. Additionally, it may do some limited interpretation on some of the tokens, for example to see if an identifier is perhaps actually a macro or a keyword (reserved word). ‘The syntax analysis module converts the stream of tokens into an abstract syntax tree (AST). Some syntax analyzers consist of two modules. The first one reads the token stream and calls a function from the Second module for each syntax construct it recognizes; the functions in the second module then construct the nodes of the AST and link them. This has the advantage that one can replace the AST generation module to obtain a dif ferent AST from the same syntax analyzer, or, alternatively, one can replace the syntaxSUBSECTION 1.3.1 The structure 23 Propram input exict | Code analysis | generation | ‘hes symbolic sractions Tareet, ymax Tnvermediste fie amysis de code fle sh [7 (ovsimizaton} (7! ca symbolic pstsetons ‘Machine sade generation Frontend Backend Figure 1.21 Strveture of a compiler. analyzer and obtain the same type of AST from a (slightly) different language. ‘The above modules are the subject of Chapter 2. ‘The context handling module collects context information from various places in the program, and annotates nodes with the results, Examples are: relating type information from declarations to expressions; connecting goto statements to their labels, in imperative languages; deciding which routine calls are local and which are remote, in distributed languages. These annotations are then used for performing context checks or are passed on to subsequent modules, for example to aid in code generation. This module is discussed in Chapter 3. ‘The intermediate code generation module translates language-specific constructs in the AST into more general constructs; these general constructs then constitute the inter~ mediate code, sometimes abbreviated IC. Deciding what is a language-specific and what ‘8 more general construct is up to the compiler designer, but usually the choice is not very difficult. One eriterion for the level of the intermediate code is that it should be reasonably straightforward to generate machine code from it for various machines, as suggested by24 SECTION 1.3. The structure of a more realistic compiler Figure 1.4. Usually the intermediate code consists almost exclusively of expressions and flow-of-contro! instructions. Examples of the translations done by the intermediate code generation module are: replacing a while statement by tests, labels, and jumps in imperative languages; inserting code for determining which method to call for an object in languages with dynamic bind- ing; replacing a Prolog rule by a routine that does the appropriate backtracking search. In each of these cases an altemative translation would be 2 call to a routine in the run-time system, with the appropriate parameters: the Prolog rule could stay in symbolic form and be interpreted by a run-time routine, a run-time routine could dynamically find the method to be called, and even the while statement could be performed by a run-time routine if the test and the body were converted to anonymous subroutines. ‘The intermediate code generation module isthe place where the division of labor between in-line code and the run- time system is decided. This module is treated in Chapters 6 through 9, for the imperative, object-oriented, functional, logic, and parallel and distributed programming paradigms. ‘The intermediate code optimization module performs preprocessing on the intermedi ate code, with the intention of improving the effectiveness of the code generation module, ‘An example of straightforward preprocessing is constant folding, in which all operations in expressions with known simple operands are performed. A more sophisticated example is inclining, in which carefully chosen calls to some routines are replaced by the bodies of those routines, while atthe same time substituting the parameters. ‘The code generation module rewrites the AST into a linear list of target machine instructions, in more or less symbolic form. To this end, it scleets instructions for seg- iments of the AST, allocates registers to hold data and arranges the instructions in the proper order. ‘The target code optimization module considers the list of symbolic machine instructions and tries to optimize it by replacing sequences of machine instructions by faster or shorter sequences. It uses target-machine-specific properties. ‘The precise boundaries between intermediate code optimization, code generation, and target code optimization are floating: ifthe code generation is particularly good, little tar get code optimization may be needed or even possible, and constant folding can be done during code generation or even on the target code. Still, some optimizations fit better in ‘one module than in another, and it is useful to distinguish the above three levels. ‘The machine code generation module converts the symbolic machine instructions into the corresponding bit pattems. It determines machine addresses of program code and data and produces tables of constants and relocation tables. ‘The executable code output module combines the encoded machine instructions, the ‘constant tables, the relocation tables, and the headers, trailers, and other material required by the operating system into an executable code file. ‘The back-end modules are discussed in Chapter 4. 1.3.2 Run-time systems ‘There is one important component of a compiler that is aditionally left outside compiler structure pictures: the run-time system of the compiled programs. Some of the actions required by a running program will be of a general, language-dependent and/or machineSECTION 1.4 Compiler architectures 25 ependent housekeeping nature; examples are code for allocating arrays, manipulating stack frames, and finding the proper method during method invocation in an object cviented language. Although itis quite possible to generate code fragments for these actions wherever they are needed, these fragments are usually very repetitive and it is often, more convenient to compile them once and store the result in library modules. These library modules together form the runstime system. Some imperative languages need only a minimal runtime system; others, especially the logic and distributed languages, may require run-time systems of considerable size, containing code for parameter unification, remote procedure call, task scheduling, etc. The parts of the run-time system needed by a specific program can be linked in by the linker when the complete object program is constructed, or even be linked in dynamically when the compiled program is ealled; object Programs and linkers are explained in Section 4.3. If the back-end is an interpreter, the rur-time system must be incorporated in it. ‘As an aside it should be pointed out that run-time systems are not only traditionally left out of compiler overview pictures like those in Figure 1.8 and Figure 1.21, they are also sometimes overlooked or underestimated in compiler construction planning. Given the fact that they may contain such beauties as print (),malLoc(), and concurrent task management, overlooking them is definitely inadvisable, 1.3.3 Short-cuts Itis by no means always necessary to implement all modules of the back-end: = Writing the modules for generating machine code and executable code can be avoided by using the local assembler, which is almost always available. ~ Writing the entire back-end can ofien be avoided by generating C code from the intermediate code. This exploits the fact that good C compilers are available on virtually any platform, which is why C is sometimes called, half jokingly, ‘The Machine-Independent Assembler’, This is the usual approach taken by compilers for the more advanced paradigms, but it is certainly also recommendable for first implementations of compilers for new imperative and object-oriented languages. ‘The object code produced by the above ‘short-cuts’ is often of good to excellent quality, but the increased compilation time may be a disadvantage. Most C compilers are quite substantial programs and calling them may well cost noticeable time; their availability may, however, make them worth it, 1.4 Compiler architectures Compilers can differ considerably with regard to their architecture; unfortunately, termi- nology to describe the different types is lacking or confusing. Two architectural questions dominate the scene. One is concemed with the granularity of the data that is passed between the compiler modules: is it bits and pieces or is it the entire program? In other ‘words, how wide is the compiler? The second concems the flow of control between the compiler modules: which of the modules is the boss?26 SECTION 1.4 Compiler architectures 1.4.1. The width of the compiler A compiler consists of a series of modules that transform, refine, and pass on information between them. Information passes mainly from the front to the end, from module M, to module M,.1. Each such consecutive pair of modules defines an interface, and although in the end all information has to pass through all these interfaces, the size of the chunks of information that are passed on makes a considerable difference to the structure of the compiler. Two reasonable choices for the size of the chunks of information are the smallest unit that is meaningful between the two modules; and the entire program. This leads to two types of compilers, neither of which seems to have a name; we will call them ‘narrow’ and “broad” compilers, respectively, A narrow compiler reads a small part of the program, typically @ few tokens, processes the information obtained, produces a few bytes of object code if appropriate, dis- ‘cards most of the information about these tokens, and repeats this process until the end of the program text is reached ‘A broad compiler reads the entire program and applies a series of transformations to it (lexical, syntactic, contextual, optimizing, code generating, etc.), which eventually result in the desired object code. This object code is then generally written to a file. ‘twill be clear that a broad compiler needs an amount of memory that is proportional to the size of the source program, which is the reason why this type has always been rather unpopular. Until the 1980s, a broad compiler was unthinkable, even in academia. A narrow compiler needs much less memory; its memory requirements are still linear in the length of the source program, but the proportionality constant is much lower since it gath- ets permanent information (for example about global variables) at a much slower rate. From a theoretical, educational, and design point of view, broad compilers are prefer- able, since they represent a simpler model, more in Tine with the functional programming. paradigm. A broad compiler consists of a series of function calls (Figure 1.22) whereas a narrow compiler consists of a typically imperative loop (Figure 1,23). In practice, “real” compilers are often implemented as narrow compilers. Still, a narrow compiler may compromise and have a broad component: it is quite natural for a C compiler to read each routine in the C program in its entirety, process it, and then discard all but the global infor- ‘mation it has obtained. In the future we expect 10 see more broad compilers and fewer narrow ones. Most of the compilers for the new programming paradigms are already broad, since they often started out as interpreters. Since scarcity of memory will be less of a problem in the future, ‘more and more imperative compilers will be broad. On the other hand, almost all compiler construction tools have been developed for the narrow model and thus favor it. Also, the narrow model is probably better for the task of writing a simple compiler for a simple language by hand, since it requires much less dynamic memory allocation. Since the “field of vision’ of a narrow compiler is, well, narrow, it is possible that it cannot manage all its transformations on the fly. Such compilers then write a partially transformed version of the program to disk and, often using a different program, continue with a second pass; occasionally even more passes are used. Not surprisingly, such a compiler is called a 2pass (or N-pass) compiler, or a 2-sean (N-scan) compiler. If a distinc~ tion between these two terms is made, *2-sean’ often indicates that the second pass actually re-teads (re-scans) the original program text, the difference being that it is now armed withSUBSECTION 1.4.2 Who's the boss? 27 SET Object code 70 Asgenbly ( Code generation ( context. check ( Parse( ‘Tokenize( source code ’ Figure 1.22 Flow-of contol structure of abroad compiles WHILE NOT Finished: Read some da D from the source code; Process D and produce the corresponding object code, if any; Figure 1.23 Flow-of contol structure of a narrow compiler. information extracted during the first scan. ‘The major transformations performed by a compiler and shown in Figure 1.21 are sometimes called phases, giving rise to the term N-phase compiler, which is of course not the same as an N-pass compiler. Since on a very small machine each phase could very ‘well correspond to one pass, these notions are sometimes confused. With larger machines, better syntax analysis techniques and simpler programming language grammars, N-pass compilers with N>l are going out of fashion. It tums out that not only compilers but also people like to read their programs in one scan. This observa- tion has led to syntactically stronger programming languages, which are correspondingly easier t0 process. Many algorithms in a compiler use only local information; for these it makes little difference whether the compiler is broad or narrow. Where it does make a difference, we will show the broad method first and then explain the narrow method as an optimization, if appropriate. 1.4.2. Who's the boss? In a broad compiler, control is not a problem: the modules run in sequence and each ‘module has full control when it runs, both over the processor and over the data. A simple driver can activate the modules in the right order, as already shown in Figure 1.22. In a28 SECTION 1.4 Compiler architectures narrow compiler, things are more complicated. While pieces of data are moving forward from module to module, contro! has to shuttle forward and backward, (o activate the proper module at the proper time. We will now examine the flow of control in narrow compilers jin more detail. The modules in a compiler are essentially ‘filters’, reading chunks of information, processing them, and writing the result, Such filters are most easily programmed as loops ‘which execute function calls o obtain chunks of information from the previous module and routine calls to write chunks of information to the next module. An example of a filter as a ‘main loop is shown in Figure 1.24. WHILE Obtained input character ch from previous nodule: IF Ch = ‘a’ /I See if there ie another ‘a’ IF Obtained input character Chi from previous module: IP Chi = ‘a V1 We have ‘aa’ Output character ’b’ to next module; ELSE chi /= ‘a’: Output character ‘a’ to next module; Output character chi to next module; BLSE Chi not obtained Output character ‘a’ to next module; EXIT WHILE; ELSE ch /= ‘a’ Output character ch to next module; Figure 1.24 The filter aa -» 5 a main lop. It describes a simple filter which copies input characters to the output while replacing the sequence aa by b; the filter is representative of, but of course much simpler than, the kind of transformations performed by an actual compiler module, The reader may nevertheless be surprised at the complexity of the code, which is due to the requirements for the proper termination of the previous, the present, and the next module. The need for proper handling of end of input is, however, very much a fact of life in compiler construc tion and we cannot afford to sweep its complexities under the rug. The filter obtains its input characters by calling upon its predecessor in the module sequence; such a call may succeed and yield a character, or it may fail. The transformed characters are passed on fo the next module. Except for routine calls 10 the previous and the next module, control remains inside the while loop all the time, and no global variables are needed. Although main loops are efficient, easy to program and easy to understand, they have one serious flaw which prevents them from being used as the universal programming. model for compiler modules: a main loop does not interface well with another main loop in traditional programming languages. Linking the output of one main loop to the input ofSECTION 1.5 Properties of a good compiler 29 another involves a transfer of control that leaves both the environment of the callee and that of the caer intact, regardless of whether the consumer ealls the producer to obtain new information or the producer calls the consumer to pass on processed information. The traditional function call creates a new environment for the callee and traditional function return destroys the environment of the callee. So they cannot serve to link two loops. A transfer of control that does possess the desired properties is the coroutine call, Which involves having separate stacks for the two loops to preserve both environments. The coroutine linkage also takes care of the end-of-input handling: an attempt to obtain information from a module whose loop has terminated fails. A well-known implementation of the eoroutine linkage is the UNIX pipe, in which the two separate stacks reside in different processes and therefore in different address spaces. Implementation of coroutines inimperative languages is discussed in Seetion 6.3.7, ‘Although coroutine linkage was proposed by Conway (1963) early in the history of compiler construction, no programming language except perhaps Icon (Griswold and Griswold, 1983) has ever featured a usable implementation of it. In the absence of coroutines we have 10 choose one of our modules as the main loop in a narrow compiler and implement the other loops through trickery. That this implies major surgery to these loops is shown by Figure 1.25, which shows our filter asa loop-less module preceding the main loop, and Figure Answers.1, which shows it as a loop-Less module following the main loop. We see that global variables are needed to record information that must remain available between two successive calls of the function, The variable Input exhausted records whether the previous call of the function retumed from the position before the EXIT WHILE in Figure 1.24, and the variable There is a stored character records whether it retumed from before outputting Ch1. Some additional code is required for proper end-of-input handling. Note thatthe code is 29 lines long as opposed to 15 for the main loop. Similar considerations apply to the post-main variant, which is given as an exercise. An additional complication is that proper end-of-input hanaling requires thatthe filter be flushed by the using module when it has supplied its final chunk of information See Exercise I-11 Looking at Figures 1.25 and Answers.1 in the answers to the exercises, we see thatthe ‘complication comes from having to save program state that resides on the stack. So it will te convenient 0 choose for the main loop the module that has the most state on the stack. That module will almost always be the parser; the code generator may gather more state, butt is usually stored in a global data structure rather than on the stack. This explains why ‘we almost universally find the parser as the main module in a narrow compiler: in very simple-minded wording, the parser pulls the program text in through the lexical analyzer, and pushes the cade out through the code generator. 1.5 Properties of a good compiler The foremost property of a good compiler is of course that it generates correct code. A compiler that occasionally generates incorrect code is useless; a compiler that generates incorrect code once a year may seem useful but is dangerous Itis also important that a compiler conform completely to the language specification Itmay be tempting to implement a subset ofthe language, a superset or even what is some-30 CHAPTER 1 Introduction SET the flag Input exhausted TO False; SET the flag There is a stored character 70 False; SET Stored character 70 Undefined; // can never be an ‘a’ FUNCTION Filtered character RETURNING a Boolean, a character. IF Input Exhausted: RETURN False, No character; ELSE IP There is a stored character: {7 Te cannot be an ‘a’: SET There is a stored character 70 False; RETURN True, Stored character; ELSE Input not exhausted AND There is no stored character! IF Obtained input character Ch from previous module IP Ch = ‘a’ /{ See if there is another ‘a’ IP Obtained input character Chi from previous module: IP Chl = ‘a’: V1 We have ‘aa’: RETURN True, ‘b’; BLSE Chi /= ‘a’ SET Stored character TO Chi; SET There le a stored character TO True; RETORN True, ‘a! ELSE chi not obtained: SBT Input exhausted TO True; ELSE ch /= ‘a’ RETURN True, Ch; ELSE ch not obtained: ‘SET Input exhausted 70 True; RETURN False, No character; Figure 1.25 The filer aa +b asa pre-main module, times sarcastically called an ‘extended subset’, and users may even be grateful, but those same users will soon find that programs developed with such a compiler are much less portable than those written using a fully conforming compiler. (For more about the notion of ‘extended subset’, see Exercise 1.13.) Another property of a good compiler, one that is often overlooked, is that it should be able 10 handle programs of essentially arbitrary size, as far a available memory permits. It seems very reasonable to say that no sane programmer uses more than 32 parameters in a routine or more than 128 declarations in » block and that one may therefore allocate a fixed amount of space for each in the compiler. One should, however, keep in mind that programmers are not the only ones who Write programs. More and more sofiware is generated by other programs, and such generated software may easily contain more than 128 declarations in one block ~ although more than 32 parameters t0 a routine seems excessive, evenSECTION 1.5 Properties of a good compiler 31 for a generated program; famous last words .. Especially any assumptions about limits on the number of cases in a case/switch statement are unwarranted: very large case statements are often used in the implementation of automatically generated parsers and code generators. Section 5.1.3 shows how the flexible memory allocation needed for handling pro- srams of essentially arbitrary size can be achieved at an almost negligible increase in cost. ‘Compilation speed is an issue but not a major one. Small programs can be expected to compile in under a second on modern machines. Larger programming projects are usually organized in many relatively small subprograms, modules, library routines, etc, together called compilation units. Each of these compilation units can be compiled separately, and recompilation after program modification is usually restricted to the modified compilation units only. Also, compiler writers have traditionally been careful to keep their compilers ‘linear in the input, which means thatthe compilation time isa linear function ofthe length of the input file. This is even more important when generated programs are being compiled, since these can be of considerable length. ‘There are several possible sources of not-linearity in compilers. Firs, all linear-time parsing techniques are rather inconvenient, and the worry-free parsing techniques can be ‘aubie in the size of the input in the worst case, Second, many code optimizations are potentially exponential in the size of the input, since often the best code can only be found by considering all possible combinations of machine instructions. Third, naive memory ‘management can result in quadratic time consumption. Fortunately, good linear-time solutions or heuristics are available forall these problems. Compiler size is hardly an issue anymore, with most computers having many mege- bytes of primary memory nowadays. ‘The user-friendliness of a compiler shows mainly in the quality of its error reporting. At the least, the user should be presented with a clear error message which includes the perceived cause of the error. the name of the input file, and the position in it. Giving a really good error cause description is often hard or impossible, due to the limited insight compilers have into incorrect programs, Pinpointing the error is aided by recording the file ‘name and line number with every token and every node in the AST. More fancy reporting ‘mechanisms, including showing parts of the syntax tree, may not have the beneficial effect the compiler writer may expect from them, but it may be useful to provide some visual display mechanism, for example opening a text editor atthe point of the error. ‘The importance of the speed and the size of the generated code depends totally on the purpose of the compiler. Normally one can expect that the user is more interested in high speed than in small size (code for embedded applications, such as washing machines, portable telephones, etc, is an exception). Moderately advanced optimization techniques will pethaps provide a factor of three improvement over very naive code generation; imple ‘menting such optimizations may take about the same amount of time as the entire compiler ‘writing project. Gaining another factor of two or even three over this may be possible through extensive and aggressive optimization; one can expect to spend many times the ‘original effort on an optimization phase of this nature.32 CHAPTER 1 Introduction 1.6 Portability and retargetability A program is considered portable if it takes a limited and reasonable effort to make it run on different machine types. What constitutes ‘a limited and reasonable effort” is, of course, a matter of opinion, but today many programs can be ported by just editing the ‘makefile’ to reflect the local situation and recompiling. With compilers, machine dependence not only resides in the program itself, it resides also ~ perhaps even mainly ~ in the output. Therefore, with compilers one has to distin- ‘guish two forms of portability: the ease with which the compiler itself can be made to run ‘on another machine, and the ease With which it can be made to generate code for another machine. The first is called the portability of the compiler and the second is called its retargetability. If the compiler is writen in a reasonably good style in a modern high- level language, good portability can be expected. Retargeting is achieved by replacing the entire back-end; the retargetability is thus inversely related to the effort to create a new back-end. In this context itis important to note that ereating a new back-end does not necessarily meen writing one from scratch. Some of the code in a back-end is of course machine- dependent, but much of itis not, If structured properly, some parts can be reused from other back-ends and other parts can perhaps be generated from formalized machine- descriptions. This approach can reduce creating a back-end from a major enterprise to a reasonable effort. Given the proper tools, creating a back-end for a new machine may cost ‘between one and four programmer-months for an experienced compiler Writer. Machine descriptions range in size between a few hundred lines and many thousands of lines. 1.7 Place and usefulness of optimizations Optimizations are attractive: much research in compiler construction is concemed with them, and compiler writers regularly see all kinds of opportunities for optimizations. It should, however, be kept in mind that implementing optimizations is the last phase in compiler construction: unlike correctness, optimizations are an add-on feature. In program- ing, itis easier to make a correct program fast than a fast program correct; likewise itis easier to make correct generated object code fast than to make fast generated object code correct. ‘There is another reason besides correctness why we tend to focus on the unoptimized algorithm in this book: some traditional algorithms are actually optimized versions of more basic algorithms. Sometimes the basic algorithm has wider applicability than the optimized version and in any case the basic version will provide us with more insight and free- dom of design than the optimized version. A good example is the stack in implementations of imperative languages. At any ‘moment the stack holds the pertinent data — administration, parameters, and local data — for each active routine, a routine that has been called and has not yet terminated. This set Of data is called the ‘activation record’ of this activation of the routine. Traditionally, activation records are found only on the stack, and only the one on the top of the stack reptesents a running routine; we consider the stack as the primary mechanism of which activation records are just parts. It is, however, profitable to recognize the activationSECTION 1.8 A short history of compiler construction 33 record as the primary item: it arises naturally when a routine is called (‘activated’) since it is obvious that its pertinent data has to be stored somewhere. Its allocation on a stack is just an optimization that happens to be possible in many ~ but not all — imperative and object-oriented languages. From this point of view it is easier to understand the implementation of those languages for which stack allocation is not a good optimization: imperative languages with coroutines or Ada-like tasks, object-oriented languages with active Smalltalk-like objects, functional languages, Icon, ete. Probably the best attitude towards optimization is to first understand and implement the basic structure and algorithm, then see which optimizations the actual situation allows, and only implement them if they are considered worthwhile (considering their cost). In situations in which the need for optimization is obvious from the start, a$ for example in code generators, the basic structure would include a framework for these optimizations, ‘This framework can then be filled in as the project progresses. ‘This concludes our introductory part on actually constructing @ compiler. In the remainder of this chapter we consider three further issues: the history of compiler construe- tion, formal grammars, and closure algorithms. 1.8 A short history of compiler construction Three periods can be distinguished in the history of compiler construction: 1945-1960, 1960-1975, and 1975-present. Of course, the years are approximate. 1.8.1 1945-1960: code generation During this period languages developed relatively slowly and machines were idiosyncratic. ‘The primary problem was how to generate code for a given machine. The problem was exacerbated by the fact that assembly programming was held in high esteem, and high(er)- level languages and compilers were looked at with a mixture of suspicion and awe: using a compiler was often called ‘automatic programming’. Proponents of high-level languages feared, not without reason, that the idea of high-level programming would never catch on if compilers produced code that was less efficient than what assembly programmers produced by hand. The first FORTRAN compiler (Sheridan, 1959) optimized heavily and was far ahead of it time in that respect. 1.8.2 1960-1975: parsing The 1960s and 1970s saw a proliferation of new programming languages, and language designers began to believe that having a compiler for a new language quickly as more important than having one that generated very efficient code. This shifted the emphasis in compiler construction from back-ends to front-ends. At the same time, studies in formal languages revealed a number of powerful techniques that could be applied profitably in frontend construction, notably in parser generation.34 CHAPTER 1 Introduction 1.8.3 1975-present: code generation and code optimization; paradigms From 1975 to the present, both the number of new languages proposed and the number of different machine types in regular use decreased, which reduced the need for quick-and- simplefquick-and-dirty compilers for new languages and/or machines, ‘The greatest tur- rmoil in language and machine design being over, people began to demand professional compilers that were reliable, efficient, both in use and in generated code, and preferably with pleasant user interfaces. This called for more attention to the quality of the generated code, which was easier now, since with the slower change in machines the expected life- time ofa code generator increased Also, atthe same time new paradigms in programming were developed, with functional, logic, and distributed programming as the most prominent examples. Almost invariably, the run-time requirements of the corresponding languages far exceeded those of the imperative languages: automatic date allocation and deallocation, list comprehensions, unification, remote procedure call and many others, ae features which require much run- time effort that corresponds to hardly any code inthe program text. More and mor, the emphasis shifts from ‘how to compile’ “what to compile to” 1.9 Grammars Grammars, of more precisely context-free grammars, are the essential formalism for describing the structure of programs in a programming language. In principle, the grammar of a language describes the syntactic structure only, but since the semantics of a language is defined in terms of the syntax, the grammar is also instrumental in the definition of the semantics. ‘There are other grammar types besides context-free grammars, but we will be mainly concerned with context-free grammars. We will also meet regular grammars, which more often go by the name of ‘regular expressions’ and which result ftom a severe restriction on the context-free grammars; and attribute grammars, which are context-free grammars extended with parameters and code. Other types of grammars play no mote than a margi- nal role in compiler construction. The term ‘context-free’ is often abbreviated to CF. We will give here a brief summary of the features of CF grammars. ‘A grammar is a recipe for constructing elements of a set of strings of symbols. When applied to programming languages, the symbols are the tokens in the language, the strings of symbols are program texts, and the set of strings of symbols is the programming language. The string BEGIN print ( "Hit* ) END consists of 6 symbols (tokens) and could be an element of the set of strings of symbols generated by a programming language grammar, or in more normal words, be a program in some programming language. This cut-and-dried view of a programming language would be useless but for the fact that the strings are constructed in a structured fashion; and to this structure semantics can be attached.SECTION 1.9 Grammars 35 1.9.1. The form of a grammar A grammar consists of a set of production rules and a start symbol. Each production rule defines a named syntactic construct. A production rule consists of two parts, a left-hand side and a right-hand side, separated by a left-to-right arrow. The left-hand side is the name of the syntactic construct; the right-hand side shows a possible form of the syntactic construct. An example of a production rule is expression > '(" expression operator expression ‘)’ ‘The right-hand side of a production rule can contain two kinds of symbols, terminal symbols and non-terminal symbols. As the word says, a terminal symbol (or terminal for short) is an end point of the production process, and can be part of the strings produced by the grammar. A non-terminal symbol (or non-terminal for short) must occur as the Ieft- hand side the name) of one ot more production rules, and cannot be part of the strings produced by the grammar. Terminals are also called tokens, especially when they are part of an input to be analyzed. Non-terminals and terminals together are called grammar sym- ols. The grammar symbols in the right-hand side of a rule are collectively called its ‘members; when they occur as nodes in a syntax tree they are more often called its ‘chil- dren’. In discussing grammars, it is customary to use some conventions that allow the class of a symbol to be deduced from its typographical form. (on-terminals are denoted by capital letters, mostly A, B, C, and N. ~ Terminals are denoted by lower-case letters near the end of the alphabet, mostly x, y.and ~ Sequences of grammar symbols are denoted by Greek letters near the beginning of the alphabet, mostly c.(alpha), B (beta), and (gamma). ~ Lower-case letters near the beginning of the alphabet (a, b,¢, et.) stand for themselves, as terminals ~ The empty sequence is denoted by € (epsilon. 1.92 The production process The central data structure in the production process is the sentential form. It is usually described as a string of grammar symbols, and can then be thought of as representing a partially produced program text. For our purposes, however, we want to represent the syntac- tie structure of the program too. The syntactic structure can be added to the flat interpretation of a sentential form as a tree positioned above the sentential form so that the leaves of the tree are the grammar symbols. This combination is also called a production tree. A string of terminals can be produced from a grammar by applying so-called produc tion steps to a sentential form, as follows. The sentential form is initialized to a copy of the sat symbol. Each production step finds a non-terminal N in the leaves of the sentential form, finds a production rule N30. with N as its left-hand side, and replaces the NV in the sentential form with a tree having Nas the root and the right-hand side of the production tlle, cas the leaf or leaves. When no more non-terminals can be found in the leaves of36 SECTION 1.9 Grammars the sentential form, the production process is finished, and the leaves form a string of ter~ ‘minals in accordance with the grammar, Using the conventions described above, we can write that the production process replaces the sentential form BN by Bay. ‘The steps in the production process leading from the start symbol to a string of terminals are called the derivation of that string. Suppose our grammar consists of the four ‘numbered production rules’ [1] expression + *(* expression operator expression ')’ [2] expression > "1" [31 operator + 1+" [4] operator + ** in which the terminal symbols are surrounded by apostrophes and the non-terminals are {demtifiers, and suppose the start symbol is expression. Then the sequence of sentential forms shown in Figure 1.26 forms the derivation of the string (1* (1+1)). More in particular, it forms a leftmost derivation, a derivation in which itis always the leftmost ‘non-terminal inthe sentential form that is rewritten. An indication R@P in the left margin in Figure 1.26 shows that grammar rule R is used to rewrite the non-terminal at position P. ‘The resulting parse tree (in which the derivation order is no longer visible) is shown in Fig- ure 1.27 expression 191 ("expression operator expression ‘)* 262 (" "1" operator expression ')" 4930+ () 1" *" expression *) ies (110 18" © expression operator expression ')’ ")’ 2a5 01 (1 114 fer F(1 11" operator expression ‘)" ")* Bag (ar rer (ae yenye eae aetna Figure 1.26 Leftmost derivation ofthe ring (2* (142) ) ‘We see that recursion — the ability of a production rule to refer directly of indirectly 10 itself ~ is essential to the production process; without recursion, a grammar would produce only a finite set of strings. ‘The production process is kind enough to produce the program text together with the production tee, but then the program text is committed (0 a linear medium (paper, computer file) and the production tree gets stripped off in the process. Since we need the tree to find out the semantics of the program, we use a special program, called a ‘parser’, 10 retrieve it. The systematic construction of parsers is treated in Seetion 2.2.

Programming With Miranda
No ratings yet
Programming With Miranda
312 pages
MLIR Tutorial
No ratings yet
MLIR Tutorial
78 pages
Spos by Dhamdhere
50% (2)
Spos by Dhamdhere
456 pages
(Worldwide Series in Computer Science) Dick Grune, Henri E. Bal, Ceriel J.H. Jacobs, Koen Langendoen - Modern Compiler Design (2000, Wiley) - Libgen - Li
No ratings yet
(Worldwide Series in Computer Science) Dick Grune, Henri E. Bal, Ceriel J.H. Jacobs, Koen Langendoen - Modern Compiler Design (2000, Wiley) - Libgen - Li
753 pages
Organisation of Programming Languages
No ratings yet
Organisation of Programming Languages
217 pages
Spos by Dhamdhere PDF
No ratings yet
Spos by Dhamdhere PDF
456 pages
Gretl Guide
No ratings yet
Gretl Guide
336 pages
Fulltext01 PDF
No ratings yet
Fulltext01 PDF
168 pages
Cambridge: Computer Science Tripos Part Ib
No ratings yet
Cambridge: Computer Science Tripos Part Ib
82 pages
BS Computer Science PDF
50% (2)
BS Computer Science PDF
72 pages
Role of Data Structure in Compiler Design
67% (3)
Role of Data Structure in Compiler Design
16 pages
Kaleidoscope - Implementing A Language With LLVM in Objective Caml
No ratings yet
Kaleidoscope - Implementing A Language With LLVM in Objective Caml
142 pages
Thesis Hisham PDF
No ratings yet
Thesis Hisham PDF
151 pages
The Lux Programming Language
100% (1)
The Lux Programming Language
126 pages
Humble Ruby Book
100% (1)
Humble Ruby Book
141 pages
052 SyntaxDirectedTranslation
No ratings yet
052 SyntaxDirectedTranslation
57 pages
Computer Programming PDF
No ratings yet
Computer Programming PDF
111 pages
Comp Book List
No ratings yet
Comp Book List
34 pages
PLplot-5 3 1
No ratings yet
PLplot-5 3 1
178 pages
Concrete Semantics With Isabelle/HOL
No ratings yet
Concrete Semantics With Isabelle/HOL
308 pages
Bitwise Operator and Getch, Getche, Getchar Etc
100% (2)
Bitwise Operator and Getch, Getche, Getchar Etc
34 pages
Handbook of Floating-Point Arithmetic
No ratings yet
Handbook of Floating-Point Arithmetic
11 pages
Dsa Basic Data Structure
No ratings yet
Dsa Basic Data Structure
72 pages
Programming Language and Compiler Design Session
No ratings yet
Programming Language and Compiler Design Session
33 pages
System Programing PDF
No ratings yet
System Programing PDF
456 pages
C Programming
No ratings yet
C Programming
205 pages
Introduction To The Gedit Editor
100% (1)
Introduction To The Gedit Editor
51 pages
What Is Prolog
No ratings yet
What Is Prolog
14 pages
Barbara Liskov, Programming With Abstract Data Types
100% (1)
Barbara Liskov, Programming With Abstract Data Types
10 pages
01 - Introduction To Java
No ratings yet
01 - Introduction To Java
34 pages
T Diagrams
100% (1)
T Diagrams
22 pages
Java Assignment 2018 - 3rd Year-1
No ratings yet
Java Assignment 2018 - 3rd Year-1
1 page
Automated ANTLR Tree Walker Generation
No ratings yet
Automated ANTLR Tree Walker Generation
140 pages
Computer Fundamentals (ALL in ONE)
No ratings yet
Computer Fundamentals (ALL in ONE)
818 pages
Smart Syntax Highlighting For Dynamic Language Case: Common Lisp in Emacs
No ratings yet
Smart Syntax Highlighting For Dynamic Language Case: Common Lisp in Emacs
61 pages
Lab Report 1 PDF
No ratings yet
Lab Report 1 PDF
6 pages
Best Coding Practices in C
No ratings yet
Best Coding Practices in C
11 pages
Life Cycle of Source Program Compiler Design
No ratings yet
Life Cycle of Source Program Compiler Design
10 pages
Network Programming
No ratings yet
Network Programming
31 pages
Programming Languages (OOP)
100% (1)
Programming Languages (OOP)
27 pages
Create New Language
No ratings yet
Create New Language
26 pages
VHDL Lecture
No ratings yet
VHDL Lecture
90 pages
Introduction To Compiler Design-Unit I
No ratings yet
Introduction To Compiler Design-Unit I
30 pages
Introduction To DSA Chapter 1
100% (3)
Introduction To DSA Chapter 1
20 pages
Chapter 5 Exercise Solutions
100% (1)
Chapter 5 Exercise Solutions
9 pages
Java Module2
No ratings yet
Java Module2
117 pages
Activity Planning For Software Projects
No ratings yet
Activity Planning For Software Projects
33 pages
CSE202 Object Oriented Programming 15746::gaurav Kumar Tak 3.0 0.0 0.0 3.0 Courses With Numerical and Conceptual Focus
No ratings yet
CSE202 Object Oriented Programming 15746::gaurav Kumar Tak 3.0 0.0 0.0 3.0 Courses With Numerical and Conceptual Focus
8 pages
Dart Language Specification
No ratings yet
Dart Language Specification
111 pages
Ch9 Problem Set
No ratings yet
Ch9 Problem Set
4 pages
Principal of Programming Language
No ratings yet
Principal of Programming Language
67 pages
Lecture 1 Introduction To Programming
No ratings yet
Lecture 1 Introduction To Programming
19 pages
Compiler Design Code Optimization
No ratings yet
Compiler Design Code Optimization
5 pages
Intro To Automata Theory
No ratings yet
Intro To Automata Theory
23 pages
Brief Resume Dr-R-H-Goudar CS Dept1
No ratings yet
Brief Resume Dr-R-H-Goudar CS Dept1
2 pages
Programming Language
No ratings yet
Programming Language
1 page
Data Structures Questions
No ratings yet
Data Structures Questions
6 pages
System Programming by Dhamdhere Text
No ratings yet
System Programming by Dhamdhere Text
456 pages
SL Syllabus
No ratings yet
SL Syllabus
2 pages

Modern Compiler Design

Uploaded by

Modern Compiler Design

Uploaded by

You might also like