Polyglot An Extensible Compiler Framework For Java
Polyglot is an extensible compiler framework that supports the easy
creation of compilers for languages similar to Java, while avoiding code duplication.
The Polyglot framework is useful for domain-specific languages, exploration
of language design, and for simplified versions of Java for pedagogical
use. We have used Polyglot to implement several major and minor modifications
to Java; the cost of implementing language extensions scales well with the degree
to which the language differs from Java. This paper focuses on the design choices
in Polyglot that are important for making the framework usable and highly extensible.
Polyglot source code is available.
Polyglot An Extensible Compiler Framework For Java
Polyglot is an extensible compiler framework that supports the easy
creation of compilers for languages similar to Java, while avoiding code duplication.
The Polyglot framework is useful for domain-specific languages, exploration
of language design, and for simplified versions of Java for pedagogical
use. We have used Polyglot to implement several major and minor modifications
to Java; the cost of implementing language extensions scales well with the degree
to which the language differs from Java. This paper focuses on the design choices
in Polyglot that are important for making the framework usable and highly extensible.
Polyglot source code is available.
Polyglot: An Extensible Compiler Framework for Java
Nathaniel Nystrom, Michael R. Clarkson, and Andrew C. Myers
Cornell University {nystrom,clarkson,andru}@cs.cornell.edu Abstract. Polyglot is an extensible compiler framework that supports the easy creation of compilers for languages similar to Java, while avoiding code dupli- cation. The Polyglot framework is useful for domain-specic languages, explo- ration of language design, and for simplied versions of Java for pedagogical use. We have used Polyglot to implement several major and minor modications to Java; the cost of implementing language extensions scales well with the degree to which the language differs from Java. This paper focuses on the design choices in Polyglot that are important for making the framework usable and highly exten- sible. Polyglot source code is available. 1 Introduction Domain-specic extension or modication of an existing programming language en- ables more concise, maintainable programs. However, programmers construct domain- specic language extensions infrequently because building and maintaining a compiler is onerous. Better technology is needed. This paper presents a methodology for the construction of extensible compilers and also an application of this methodology in our implementation of Polyglot, a compiler framework for creating extensions to Java [14]. Language extension or modication is useful for many reasons: Security. Systems that enforce security at the language level may nd it useful to add security annotations or rule out unsafe language constructs. Static checking. A language might be extended to support annotations necessary for static verication of program correctness [23], more powerful static checking of program invariants [10], or heuristic methods [8]. Language design. Implementation helps validate programming language designs. Optimization. New passes may be added to implement optimizations not per- formed by the base compiler or not permitted by the base language specication. Style. Some language features or idioms may be deemed to violate good style but may not be easy to detect with simple syntactic analysis. Teaching. Students may learn better using a language that does not expose them to difcult features (e.g., inner classes [14]) or confusing error messages [9]. This research was supported in part by DARPA Contract F30602-99-1-0533, monitored by USAF Rome Laboratory, in part by ONR Grant N00014-01-1-0968, in part by NSF awards 0133302 and 0208642, and in part by a Sloan Research Fellowship. The views herein should not be interpreted as representing the policies or endorsement of NSF, DARPA or AFRL. Proceedings of the 12th International Conference on Compiler Construction, Warsaw, Poland, April 2003. LNCS 2622, pages 138152. We refer to the original unmodied language as the base language; we call the modied language a language extension even if it is not backwards compatible. When developing a compiler for a language extension, it is clearly desirable to build upon an existing compiler for the base language. The simplest approach is to copy the source code of the base compiler and edit it in place. This may be fairly effective if the base compiler is carefully written, but it duplicates code. Changes to the base compilerperhaps to x bugsmay then be difcult to apply to the extended compiler. Without considerable discipline, the code of the two compilers diverges, leading to duplication of effort. Our approach is different: the Polyglot framework implements an extensible com- piler for the base language Java 1.4. This framework, also written in Java, is by default simply a semantic checker for Java. However, a programmer implementing a language extension may extend the framework to dene any necessary changes to the compilation process, including the abstract syntax tree (AST) and semantic analysis. An important goal for Polyglot is scalable extensibility: an extension should require programming effort proportional only to the magnitude of the difference between the extended and base languages. Adding new AST node types or new compiler passes should require writing code whose size is proportional to the change. Language ex- tensions often require uniformly adding new elds and methods to an AST node and its subclasses; we require that this uniform mixin extension be implementable without subclassing all the extended node classes. Scalable extensibility is a challenge because it is difcult to simultaneously extend both types and the procedures that manipulate them [30, 38]. Existing programming methodologies such as visitors [13] improve ex- tensibility but are not a complete solution. In this paper we present a methodology that supports extension of both compiler passes and AST nodes, including mixin extension. The methodology uses abstract factories, delegation, and proxies [13] to permit greater extensibility and code reuse than in previous extensible compiler designs. Polyglot has been used to implement more than a dozen Java language extensions of varying complexity. Our experience using Polyglot suggests that it is a useful frame- work for developing compilers for new Java-like languages. Some of the complex ex- tensions implemented are Jif [26], which extends Java with security types that regulate information ow; PolyJ [27], which adds bounded parametric polymorphism to Java; and JMatch [24], which extends Java with pattern matching and iteration features. Com- pilers built using Polyglot are themselves extensible; complex extensions such as Jif and PolyJ have themselves been extended. The framework is not difcult to learn: users have been able to build interesting extensions to Java within a day of starting to use Polyglot. The Polyglot source code is available. 1 The rest of the paper is structured as follows. Section 2 gives an overview of the Polyglot compiler. Section 3 describes in detail our methodology for providing scalable extensibility. Other Polyglot features that make writing an extensible compiler conve- nient are described in Section 4. Our experience using the Polyglot system to build various languages is reported in Section 5. Related work on extensible compilers and macro systems is discussed in Section 6, and we conclude in Section 7. 1 At http://www.cs.cornell.edu/Projects/polyglot extended parser scheduled compiler passes code generation serialized type info Java AST Ext source code Ext AST Bytecode serialized type info Fig. 1. Polyglot Architecture 2 Polyglot Overview This section presents an overview of the various components of Polyglot and describes how they can be extended to implement a language extension. An example of a small extension is given to illustrate this process. 2.1 Architecture A Polyglot extension is a source-to-source compiler that accepts a program written in a language extension and translates it to Java source code. It also may invoke a Java compiler such as javac to convert its output to bytecode. The compilation process offers several opportunities for the language extension im- plementer to customize the behavior of the framework. This process, including the even- tual compilation to Java bytecode, is shown in Fig. 1. In the gure, the name Ext stands for the particular extended language. The rst step in compilation is parsing input source code to produce an AST. Poly- glot includes an extensible parser generator, PPG, that allows the implementer to dene the syntax of the language extension as a set of changes to the base grammar for Java. PPG provides grammar inheritance [29], which can be used to add, modify, or remove productions and symbols of the base grammar. PPG is implemented as a preprocessor for the CUP LALR parser generator [17]. The extended AST may contain new kinds of nodes either to represent syntax added to the base language or to record new information in the AST. These new node types are added by implementing the Node interface and optionally subclassing from an existing node implementation. The core of the compilation process is a series of compilation passes applied to the abstract syntax tree. Both semantic analysis and translation to Java may comprise several such passes. The pass scheduler selects passes to run over the AST of a single source le, in an order dened by the extension, ensuring that dependencies between source les are not violated. Each compilation pass, if successful, rewrites the AST, producing a new AST that is the input to the next pass. Some analysis passes (e.g., type checking) may halt compilation and report errors instead of rewriting the AST. A language extension may modify the base language pass schedule by adding, replacing, reordering, or removing compiler passes. The rewriting process is entirely functional; compilation passes do not destructively modify the AST. More details on our method- ology are described in Section 3. Compilation passes do their work using objects that dene important characteristics of the source and target languages. A type system object acts as a factory for objects 1 tracked(F) class FileReader { 2 FileReader(File f) [] -> [F] throws IOException[] { ... } 3 int read() [F] -> [F] throws IOException[F] { ... } 4 void close() [F] -> [] { ... ; free this; } 5 } Fig. 2. Example Coffer FileReader representing types and related constructs such as method signatures. The type system object also provides some type checking functionality. A node factory constructs AST nodes for its extension. In extensions that rely on an intermediate language, multiple type systems and node factories may be used during compilation. After all compilation passes complete, the usual result is a Java AST. A Java com- piler such as javac is invoked to compile the Java code to bytecode. The bytecode may contain serialized extension-specic type information used to enable separate compila- tion; we discuss separate compilation in more detail in Section 4. 2.2 An Example: Coffer To motivate our design, we describe a simple extension of Java that supports some of the resource management facilities of the Vault language [7]. This language, called Coffer, is a challenge for extensible compilers because it makes substantial changes to both the syntax and semantics of Java and requires identical modications to many AST node types. Coffer allows a linear capability, or key, to be associated with an object. Methods of the object may be invoked only when the key is held. A key is allocated when its object is created and deallocated by a free statement in a method of the object. The Coffer type system regulates allocation and freeing of keys to guarantee statically that keys are always deallocated. Fig. 2 shows a small Coffer program declaring a FileReader class that guarantees the program cannot read from a closed reader. The annotation tracked(F) on line 1 associates a key named F with instances of FileReader. Pre- and post-conditions on method and constructor signatures, written in brackets, specify how the set of held keys changes through an invocation. For example on line 2, the precondition [] indicates that no key need be held to invoke the constructor, and the postcondition [F] species that F is held when the constructor returns normally. The close method (line 4) frees the key; no subsequent method that requires F can be invoked. The Coffer extension is used as an example throughout the next section. It is im- plemented by adding new compiler passes for computing and checking held key sets at each program point. Coffers free statements and additional type annotations are implemented by adding new AST nodes and extending existing nodes and passes. 3 A Methodology for Scalable Extensibility Our goal is a mechanism that supports scalable extension of both the syntax and se- mantics of the base language. The programmer effort required to add or extend a pass should be proportional to the number of AST nodes non-trivially affected by that pass; the effort required to add or extend a node should be proportional to the number of passes the node must implement in an interesting way. When extending or overriding the behavior of existing AST nodes, it is often nec- essary to extend a node class that has more than one subclass. For instance, the Coffer extension adds identical pre- and post-condition syntax to both methods and construc- tors; to avoid code duplication, these annotations should be added to the common base class of method and constructor nodes. The programmer effort to make such changes should be constant, irrespective of the number of subclasses of this base class. Inheri- tance is the appropriate mechanism for adding a new eld or method to a single class. However, adding the same member to many different classes can quickly become te- dious. This is true even in languages with multiple inheritance: a new subclass must be created for every class affected by the change. Modifying these subclasses later re- quires making identical changes to each subclass. Mixin extensibility is a key goal of our methodology: a change that affects multiple classes should require no code duplication. Compilers written in object-oriented languages often implement compiler passes using the Visitor design pattern [13]. However, visitors present several problems for scalable extensibility. In a non-extensible compiler, the set of AST nodes is usually xed. The Visitor pattern permits scalable addition of new passes, but sacrices scalable addition of AST node types. To allow specialization of visitor behavior for both the AST node type and the visitor itself, each visitor class implements a separate callback method for every node type. Thus, adding a new kind of AST node requires modifying all existing visitors to insert a callback method for the node. Visitors written without knowledge of the new node cannot be used with the new node because they do not implement the callback. The Visitor pattern also does not provide mixin extensibility. A separate mechanism is needed to address this problem. An alternative to the Visitor pattern is for each AST node class to implement a method for each compiler pass. However, this technique suffers from the dual problem: adding a new pass requires adding a method to all existing node types. The remainder of this section presents a mechanism that achieves the goal of scal- able extensibility. We rst describe our approach to providing mixin extensibility. We then show how our solution also addresses the other aspects of scalable extensibility. 3.1 Node Extension Objects and Delegates We implement passes as methods associated with AST node objects; however, to pro- vide scalable extensibility, we introduce a delegation mechanism, illustrated in Fig. 3, that enables orthogonal extension and method override of nodes. Since subclassing of node classes does not adequately address orthogonal exten- sion of methods in classes with multiple subclasses, we add to each node object a eld, labeled ext in Fig. 3, that points to a (possibly null) node extension object. The ex- tension object (CofferExt in the gure) provides implementations of new methods and elds, thus extending the node interface without subclassing. These members are accessed by following the ext pointer and casting to the extension object type. In the example, CofferExt extends Node with keyFlow() and checkKeys() methods. Each AST node class to be extended with a given implementation of these members uses the keyFlow() {...} checkKeys() {...} CofferExt typeCheck() {...} print() {...} Node del ext del of Coffer node possible extension node ext node typeCheck() {...} print() {node.print();} NodeDelegate Fig. 3. Delegates and extensions same extension object class. Thus, several node classes can be orthogonally extended with a single implementation, avoiding code duplication. Since language extensions can themselves be extended, each extension object has an ext eld similar to the one located in the node object. In effect, a node and its extension object together can be considered a single node. Extension objects alone, however, do not adequately handle method override when the base language is extended multiple times. The problem is that any one of a nodes extension objects can implement the overridden method; a mechanism is needed to invoke the correct implementation. A possible solution to this problem is to introduce a delegate object for each method in the node interface. For each method, a eld in the node points to an object implementing that method. Calls to the method are made through its delegate object; language extensions can override the method simply by replacing the delegate. The delegate may implement the method itself or may invoke methods in the node or in the nodes extension objects. Because maintaining one object per method is cumbersome, the solution used in Polyglot is to combine delegate objects and to introduce a single delegate eld for each node objectillustrated by the del eld in Fig. 3. This eld points to an object imple- menting the entire Node interface, by default the node itself. To override a method, a language extension writer creates a new delegate object containing the new implemen- tation or code to dispatch to the new implementation. The delegate implements Nodes other methods by dispatching back to the node. Extension objects also contain a del eld used to override methods declared in the extension object interface. Calls to all node methods are made through the del pointer, thus ensuring that the correct implementation of the method is invoked if the delegate object is replaced by a language extension. Thus, in our example, the nodes typeCheck method is in- voked via n.del.typeCheck(); the Coffer checkKeys method is invoked by fol- lowing the nodes ext pointer and invoking through the extension objects delegate: ((CofferExt) n.ext).del.checkKeys(). An extension of Coffer could replace the extension objects delegate to override methods declared in the extension, or it could replace the nodes delegate to override methods of the node. To access Coffers type-checking functionality, this new node delegate may be a subclass of Coffers node delegate class or may contain a pointer to the old delegate object. The overhead of in- directing through the del pointer accounts for less than 2% of the total compilation time. 3.2 AST Rewriters Most passes in Polyglot are structured as functional AST rewriting passes. Factoring out AST traversal code eliminates the need to duplicate this code when implementing new passes. Each pass implements an AST rewriter object to traverse the AST and invoke the passs method at each node. At each node, the rewriter invokes a visitChildren method to recursively rewrite the nodes children using the rewriter and to reconstruct the node if any of the children are modied. A key implementation detail is that when a node is reconstructed, the node is cloned and the clone is returned. Cloning ensures that class members added by language extensions are correctly copied into the new node. The nodes delegates and extensions are cloned with the node. Each rewriter implements enter and leave methods, both of which take a node as argument. The enter method is invoked before the rewriter recurses on the nodes chil- dren using visitChildren and may return a new rewriter to be used for rewriting the children. This provides a convenient means for maintaining symbol table information as the rewriter crosses lexical scopes; the programmer need not write code to explicitly manage the stack of scopes, eliminating a potential source of errors. The leave method is called after visiting the children and returns the rewritten AST rooted at the node. 3.3 Scalable Extensibility A language extension may extend the interface of an AST node class through an ex- tension object interface. For each new pass, a method is added to the extension object interface and a rewriter class is created to invoke the method at each node. For most nodes, a single extension object class is implemented to dene the default behavior of the pass, typically just an identity transformation on the AST node. This class is over- ridden for individual nodes where non-trivial work is performed for the pass. To change the behavior of an existing pass at a given node, the programmer creates a new delegate class implementing the new behavior and associates the delegate with the node at construction time. Like extension classes, the same delegate class may be used for several different AST node classes, allowing functionality to be added to node classes at arbitrary points in the class hierarchy without code duplication. New kinds of nodes are dened by new node classes; existing node types are ex- tended by adding an extension object to instances of the class. A factory method for the new node type is added to the node factory to construct the node and, if necessary, its delegate and extension objects. The new node inherits default implementations of all compiler passes from its base class and from the extensions base class. The new node may provide new implementations using method override, possibly via delega- tion. Methods need be overridden only for those passes that need to perform non-trivial work for that node type. Fig. 4 shows a portion of the code implementing the Coffer key-checking pass, which checks the set of keys held when control enters a node. The code has been simplied in the interests of space and clarity. At each node in the AST, the pass in- vokes through the del pointer the checkKeys method in the Coffer extension, passing in the set of held keys (computed by a previous data-ow analysis pass). Since most AST nodes are not affected by the key-checking pass, a default checkKeys method class KeyChecker extends Rewriter { Node leave(Node n) { ((CofferExt) n.ext).del.checkKeys(held keys(n)); return n; } } class CofferExt { Node node; CofferExt del; void checkKeys(Set held keys) { /* empty */ } } class ProcedureCallExt extends CofferExt { void checkKeys(Set held keys) { ProcedureCall c = (ProcedureCall) node; CofferProcedureType p = (CofferProcedureType) c.callee(); if (! held keys.containsAll(p.entryKeys())) error(p.entryKeys() + " not held at " + c); } } Fig. 4. Coffer key checking implemented in the base CofferExt class is used for these nodes. For other nodes, a non-trivial implementation of key checking is required. Fig. 4 also contains an extension class used to compute the held keys for method and constructor calls. ProcedureCall is an interface implemented by the classes for three AST nodes that invoke either methods or constructors: method calls, new expres- sions, and explicit constructor calls (e.g., super()). All three nodes implement the checkKeys method identically. By using an extension object, we need only to write this code once. 4 Other Implementation Details In this section we consider some aspects of the Polyglot implementation that are not directly related to scalable extensibility. Data-Flow Analysis. Polyglot provides an extensible data-ow analysis frame- work. In Java implementation, this framework is used to check the that variables are initialized before use and that all statements are reachable; extensions may perform ad- ditional data-ow analyses to enable optimizations or to perform other transformations. Polyglot provides a rewriter in the base compiler framework that constructs the control- ow graph of the program. Intraprocedural data-ow analyses can then be performed on this graph by implementing the meet and transfer functions for the analysis. Separate Compilation. Java compilers use type information stored in Java class les to support separate compilation. For many extensions, the standard Java type in- formation in the class le is insufcient. Polyglot injects type information into class les that can be read by later invocations of the compiler to provide separate compila- tion. No code need be written for a language extension to use this functionality for its extended types. Before performing Java code generation, Polyglot uses the Java seri- alization facility to encode the type information for a given class into a string, which is then compressed and inserted as a nal static eld into the AST for the class being serialized. When compiling a class, the rst time a reference to another class is encoun- tered, Polyglot loads the class le for the referenced class and extracts the serialized type information. The type information is decoded and may be immediately used by the extension. Quasiquoting. To generate Java output, language extensions translate their ASTs to Java ASTs and rely on the code generator of the base compiler to output Java code. To enable AST rewriting, we have used PPG to extend Polyglots Java parser with the ability to generate an AST from a string of Java code and a collection of AST nodes to substitute into the generated AST. This feature provides many of the benets of quasiquoting in Scheme [19]. 5 Experience More than a dozen extensions of varying sizes have been implemented using Polyglot, for example: Jif is a Java extension that provides information ow control and features to ensure the condentiality and integrity of data [26]. Jif/split is an extension to Jif that partitions programs across multiple hosts based on their security requirements [37]. PolyJ is a Java extension that supports bounded parametric polymorphism [27]. Param is an abstract extension that provides support for parameterized classes. This extension is not a complete language, but instead includes code implementing lazy substitution of type parameters. Jif, PolyJ, and Coffer extend Param. JMatch is a Java extension that supports pattern matching and logic programming features [24]. Coffer, as previously described, adds resource management facilities to Java. PAO (primitives as objects) allows primitive values to be used transparently as objects via automatic boxing and unboxing, A covariant return extension restores the subtyping rules of Java 1.0 Beta [33] in which the return type of a method could be covariant in subclasses. The language was changed in the nal version of Java 1.0 [14] to require the invariance of return types. The major extensions add new syntax and make substantial changes to the language semantics. We describe the changes for Jif and PolyJ in more detail below. The simpler extensions, such as support for covariant return types, require more localized changes. 5.1 Jif Jif is an extension to Java that permits static checking of information ow policies. In Jif, the type of a variable may be annotated with a label specifying a set of principals who own the data and a set of principals that are permitted to read the data. Labels are checked by the compiler to ensure that the information ow policies are not violated. The base Polyglot parser is extended using PPG to recognize security annotations and new statement forms. New AST node classes are added for labels and for new state- ment and expression forms concerning security checks. The new AST nodes and nearly all existing AST nodes are also extended with security context annotations. These new elds are added to a Jif extension class. To implement information ow checking, a labelCheck method is declared in the Jif extension object. Many nodes do no work for this pass and therefore can inherit a default implementation declared in the base Jif extension class. Extension objects installed for expression and statement nodes override the labelCheck method to implement the security typing judgment for the node. Del- egates were used to override type checking of some AST nodes to disallow static elds and inner classes since they may provide an avenue for information leaks. Following label checking, the Jif AST is translated to a Java AST, largely by erasing security annotations. The new statement and expression forms are rewritten to Java syntax using the quasiquoting facility discussed in Section 4. Jif/split further extends Jif to partition programs across multiple hosts based on their security requirements. The syntax of Jif is modied slightly to also support integrity an- notations. New passes, implemented in extension objects, partition the Jif/split program into several Jif programs, each of which will run on a separate host. 5.2 PolyJ PolyJ is an extension to Java that supports parametric polymorphism. Classes and inter- faces may be declared with zero or more type parameters constrained by where clauses. The base Java parser is extended using PPG, and AST node classes are added for where clauses and for new type syntax. Further, the AST node for class declarations is ex- tended via inheritance to allow for type parameters and where clauses. The PolyJ type system customizes the behavior of the base Java type system and introduces judgments for parameterized and instantiated types. A new pass is intro- duced to check that the types on which a parameterized class is instantiated satisfy the constraints for that parameter, as described in [27]. The base compiler code generator is extended to generate code not only for each PolyJ source class, but also an adapter class for each instantiation of a parameterized class. 5.3 Results As a measure of the programmer effort required to implement the extensions discussed in this paper, the sizes of the code for these extensions are shown in Table 1. To eliminate bias due to the length of identiers in the source, sizes are given in number of tokens for source les, including Java, CUP, and PPG les. These results demonstrate that the cost of implementing language extensions scales well with the degree to which the extension differs from its base language. Simple ex- tensions such as the covariant return extension that differ from Java in small, localized Table 1. Extension size Extension Token count Percent of Base Polyglot base Polyglot 164136 100% Jif 126188 77% JMatch 105269 64% PolyJ 78159 48% Coffer 21251 13% PAO 3422 2% Param 3233 2% covariant return 1562 1% empty 691 < 1% ways can be implemented by writing only small amounts of code. To measure the over- head of simply creating a language extension, we implemented an empty extension that makes no changes to the Java language; the overhead includes empty subclasses of the base compiler node factory and type system classes, an empty PPG parser specication, and code for allocating these subclasses. PolyJ, which has large changes to the type system and to code generation, requires only about half as much code as the base Java compiler. For historical reasons, PolyJ generates code by overriding the Polyglot code generator to directly output Java. The size of this code could be reduced by using quasiquoting. Jif requires a large amount of extension code because label checking in Jif is more complex than the Java type checking that it extends. Much of the JMatch overhead is accounted for by extensive changes to add complex statement and expression translations. As a point of comparison, the base Polyglot compiler (which implements Java 1.4) and the Java 1.1 compiler, javac, are nearly the same size when measured in tokens. Thus, the base Polyglot compiler implementation is reasonably efcient. To be fair to javac, we did not count its code for bytecode generation. About 10% of the base Polyglot compiler consists of interfaces used to separate the interface hierarchy from the class hierarchy. The javac compiler is not implemented this way. Implementing small extensions has proved to be fairly easy. We asked a program- mer previously unfamiliar with the framework to implement the covariant return type extension; this took one day. The same programmer implemented several other small extensions within a few days. 5.4 Discussion In implementing Polyglot we found, not surprisingly, that application of good object- oriented design principles greatly enhances Polyglots extensibility. Rigorous separa- tion of interfaces and classes permit implementations to be more easily extended and replaced; calls through interfaces ensure the framework is not bound to any particular implementation of an interface. The Polyglot framework almost exclusively uses fac- tory methods to create objects [13], giving language extensions more freedom to change the implementation provided by the base compiler by avoiding explicitly tying code to a particular class. We chose to implement Polyglot using only standard Java features, but it is clear that several language extensionssome of which we have implemented using Polyglot would have made it easier to implement Polyglot. Multimethods (e.g., [5]) would have simplied the dispatching mechanism needed for our methodology. Open classes [6] might provide a cleaner solution to the extensibility problem, particularly in conjunc- tion with multimethods. Aspect-oriented programming [20] is another technique for adding and overriding methods in an existing class hierarchy. Hierarchically extensible datatypes and functions [25] offer another solution to the extensibility problem. Mul- tiple inheritance and, in particular, mixins (e.g., [4, 11]) would facilitate application of an extension to many AST nodes at once. Built-in quasiquoting support would make translation more efcient, though the need to support several target languages would introduce some difculties. Covariant modication of method return types would elim- inate many unnecessary type casts, as would parametric polymorphism [27, 28]. 6 Related Work There is much work that is related to Polyglot, including other extensible compilers, macro systems, and visitor patterns. JaCo is an extensible compiler for Java written in an extended version of Java [39] that supports ML-style pattern matching. JaCo does not provide mixin extensibility. It relies on a new language featureextensible algebraic datatypes [38]to address the difculty of handling new data types without changing existing code. Polyglot achieves scalable extensibility while relying only on features available in Java. CoSy [1] is a framework for combining compiler phases to create an optimiz- ing compiler. Compiler phases can be added and reused in multiple contexts without changing existing code. The framework was not designed for syntax extension. In the SUIF compiler [36], data structures can be extended with annotations, similar to Poly- glots extension objects; new annotations are ignored by existing compiler passes. Scor- pion [31, 32] is a meta-programming environment that has a similar extension mecha- nism. Neither SUIF nor Scorpion have a mechanism like Polyglots delegate objects to mix in method overrides. JastAdd [16] is a compiler framework that uses aspect-oriented programming to add methods and elds into the AST node class hierarchy to implement new passes or to override existing passes. The AST node hierarchy may be extended via inheritance, but duplicate code may need to be written for each pass to support new nodes. Macro systems and preprocessors are generally concerned only with syntactic ex- tensions to a language. Recent systems for use in Java include EPP [18], JSE [12], and JPP [21]. Maya [2] is a generalization of macro systems that uses generic functions and multimethods to allow extension of Java syntax. Semantic actions can be dened as multimethods on those generic functions. It is not clear how these systems scale to support semantic checking for large extensions to the base language. The Jakarta Tools Suite (JTS) [3] is a toolkit for implementing Java preprocessors to create domain-specic languages. Extensions of a base language are encapsulated as components that dene the syntax and semantics of the extension. A fundamental difference between JTS and Polyglot is that JTS is concerned primarily only the syn- tactic analysis of the extension language, not with semantic analysis [3, section 4]. This makes JTS more like a macro system in which the macros are dened by extending the compiler rather than declaring them in the source code. OpenJava [34] uses a meta-object protocol (MOP) similar to Javas reection API to allow manipulation of a programs structure. OpenJava allows very limited extension of syntax, but through its MOP exposes much of the semantic structure of the program. The original Visitor design pattern [13] has led to many renements. Extensible Visitors [22] and Staggered Visitors [35] both enhance the extensibility of the visitor pattern to facilitate adding new node types, but neither these nor the other renements mentioned above support mixin extensibility. Staggered Visitors rely on multiple inher- itance to extend visitors with support for new nodes. 7 Conclusions Our original motivation for developing the Polyglot compiler framework was simply to provide a publicly available Java front end that could be easily extended to support new languages. We discovered that the existing approaches to extensible compiler construc- tion within Java did not solve to our satisfaction the problem of scalable extensibility including mixins. Our extended visitor methodology is simple, yet improves on the pre- vious solutions to the extensibility problem. Other Polyglot features such as extensible parsing, pass scheduling, quasiquoting, and type signature insertion are also useful. Our experience using Polyglot has shown that it is an effective way to produce compilers for Java-like languages. We have used the framework for several signicant language extensions that modify Java syntax and semantics in complex ways. We hope that the public release of this software in source code form will facilitate experimenta- tion with new features for object-oriented languages. References 1. Martin Alt, Uwe Amann, and Hans van Someren. Cosy compiler phase embedding with the CoSy compiler model. In Peter A. Fritzson, editor, Proceedings of the 5th International Compiler Construction Conference (CC94), volume 786 of Lecture Notes in Computer Sci- ence, pages 278293, Edinburgh, UK, April 1994. 2. Jason Baker and Wilson C. Hsieh. Maya: Multiple-dispatch syntax extension in Java. In Proc. of the ACM SIGPLAN 02 Conference on Programming Language Design and Imple- mentation (PLDI), pages 270281, Berlin, Germany, June 2002. 3. Don Batory, Bernie Lofaso, and Yannis Smaragdakis. JTS: tools for implementing domain- specic languages. In Proceedings Fifth International Conference on Software Reuse, pages 14353, Victoria, BC, Canada, 1998. IEEE. 4. Gilad Bracha. The Programming Language Jigsaw: Mixins, Modularity and Multiple Inher- itance. PhD thesis, University of Utah, 1992. 5. Craig Chambers. Object-oriented multi-methods in Cecil. In Ole Lehrmann Madsen, editor, Proceedings of the 6th European Conference on Object-Oriented Programming (ECOOP), volume 615, pages 3356, Berlin, Heidelberg, New York, Tokyo, 1992. Springer-Verlag. 6. Curtis Clifton, Gary T. Leavens, Craig Chambers, and Todd Millstein. MultiJava: Modular open classes and symmetric multiple dispatch for Java. In OOPSLA 2000 Conference on Object-Oriented Programming, Systems, Languages, and Applications, Minneapolis, Min- nesota, volume 35(10), pages 130145, 2000. 7. Robert DeLine and Manuel F ahndrich. Enforcing high-level protocols in low-level software. In Proceedings of the ACM Conference on Programming Language Design and Implemen- tation, pages 5969, June 2001. 8. Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem. Checking system rules using system-specic, programmer-written compiler extensions. In Proceedings of Fourth Usenix Symposium on Operating Systems Design and Implementation, San Diego, California, Octo- ber 2000. 9. Robert Bruce Findler, Cormac Flanagan, Matthew Flatt, Shriram Krishnamurthi, and Matthias Felleisen. DrScheme: A pedagogic programming environment for Scheme. In Proc. International Symposium on Programming Languages: Implementations, Logics, and Programs, pages 369388, 1997. 10. Cormac Flanagan, K. Rustan M. Leino, Mark Lillibridge, Greg Nelson, James B. Saxe, and Raymie Stata. Extended static checking for Java. In Proc. of the ACM SIGPLAN 02 Con- ference on Programming Language Design and Implementation (PLDI), pages 234245, Berlin, Germany, June 2002. 11. Matthew Flatt, Shriram Krishnamurthi, and Matthias Felleisen. Classes and mixins. In Con- ference Record of POPL 98: The 25TH ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, California, pages 171183, New York, NY, 1998. 12. D. K. Frayne and Keith Playford. The Java syntactic extender (JSE). In Proceedings of the 2001 Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA 01), pages 3142, Tampa, FL, USA, 2001. 13. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison Wesley, Reading, MA, 1994. 14. James Gosling, Bill Joy, and Guy Steele. The Java Language Specication. Addison-Wesley, August 1996. ISBN 0-201-63451-1. 15. Carl Gunter and John C. Mitchell, editors. Theoretical aspects of object-oriented program- ming. MIT Press, 1994. 16. G orel Hedin and Eva Magnusson. JastAddan aspect-oriented compiler construction sys- tem. Science of Computer Programming, 47(1):3758, November 2002. 17. Scott E. Hudson, Frank Flannery, C. Scott Ananian, Dan Wang, and Andrew Appel. CUP LALR parser generator for Java , 1996. Software release. Located at http://www.cs.princeton.edu/appel/modern/java/CUP/. 18. Yuuji Ichisugi and Yves Roudier. The extensible Java preprocessor kit and a tiny data-parallel Java. In Proc. ISCOPE 97, LNCS 1343, pages 153160. Springer, 1997. 19. Richard Kelsey, William Clinger, and Jonathan Rees (editors). Revised 5 report on the algo- rithmic language Scheme. ACM SIGPLAN Notices, 33(9):2676, October 1998. Available at http://www.schemers.org/Documents/Standards/R5RS. 20. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Videira Lopes, Jean-Marc Loingtier, and John Irwin. Aspect-oriented programming. In Proceedings of 11th European Conference on Object-Oriented Programming (ECOOP97), number 1241 in Lec- ture Notes in Computer Science, pages 220242, Jyv askyl a, Finland, June 1997. Springer- Verlag. 21. Joseph R. Kiniry and Elaine Cheong. JPP: A Java pre-processor. Technical Report CS-TR- 98-15, California Institute of Technology, Pasadena, CA, September 1998. 22. Shriram Krishnamurthi, Matthias Felleisen, and Daniel P. Friedman. Synthesizing object- oriented and functional design to promote re-use. In Proc. ECOOP 98, pages 91113, 1998. 23. Gary T. Leavens, K. Rustan M. Leino, Erik Poll, Clyde Ruby, and Bart Jacobs. JML: no- tations and tools supporting detailed design in Java. In OOPSLA 2000 Companion, pages 105106, Minneapolis, Minnesota, 2000. 24. Jed Liu and Andrew C. Myers. JMatch: Abstract iterable pattern matching for Java. In Proc. 5th Intl Symp. on Practical Aspects of Declarative Languages, New Orleans, LA, January 2003. 25. Todd Millstein, Colin Bleckner, and Craig Chambers. Modular typechecking for hierarchi- cally extensible datatypes and functions. In Proc. 7th ACM SIGPLAN International Confer- ence on Functional Programming (ICFP), pages 110122, Philadelphia, PA, USA, October 2002. 26. Andrew C. Myers. JFlow: Practical mostly-static information ow control. In Proc. 26th ACM Symp. on Principles of Programming Languages (POPL), pages 228241, San Anto- nio, TX, January 1999. 27. Andrew C. Myers, Joseph A. Bank, and Barbara Liskov. Parameterized types for Java. In Proc. 24th ACM Symp. on Principles of Programming Languages (POPL), pages 132145, Paris, France, January 1997. 28. Martin Odersky and Philip Wadler. Pizza into Java: Translating theory into practice. In Proc. 24th ACM Symp. on Principles of Programming Languages (POPL), pages 146159, Paris, France, January 1997. 29. Terence Parr and Russell Quong. ANTLR: A predicated-LL(k) parser generator. Journal of Software Practice and Experience, 25(7), 1995. 30. John C. Reynolds. User-dened types and procedural data structures as complementary approaches to data abstraction. In Stephen A. Schuman, editor, New Directions in Algorith- mic Languages, pages 157168. Institut de Recherche dInformatique et dAutomatique, Le Chesnay, France, 1975. Reprinted in [15], pages 1323. 31. Richard Snodgrass. The Scorpion system, August 1995. Software release. Located at ftp://ftp.cs.arizona.edu/scorpion. 32. Richard Snodgrass and Karen Shannon. Supporting exible and efcient tool integration. In Proceedings of the International Workshop on Advanced Programming Environments, number 244 in Lecture Notes in Computer Science, pages 290313, Trondheim, Norway, June 1986. 33. Sun Microsystems. Java Language Specication, version 1.0 beta edition, October 1995. Available at ftp://ftp.javasoft.com/docs/javaspec.ps.zip. 34. Michiaki Tatsubori, Shigeru Chiba, Marc-Oliver Killijian, and Kozo Itano. OpenJava: A class-based macro system for Java. In Walter Cazzola, Robert J. Stroud, and Francesco Tisato, editors, Reection and Software Engineering, LNCS 1826, pages 119135. Springer- Verlag, July 2000. 35. J. Vlissides. Visitors in frameworks. C++ Report, 11(10), November 1999. 36. R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An infrastructure for research on parallelizing and optimizing compilers. SIGPLAN Notices, 29(12):3137, 1994. 37. Steve Zdancewic, Lantian Zheng, Nathaniel Nystrom, and Andrew C. Myers. Untrusted hosts and condentiality: Secure program partitioning. In Proc. 18th ACM Symp. on Oper- ating System Principles (SOSP), pages 114, Banff, Canada, October 2001. 38. Matthias Zenger and Martin Odersky. Extensible algebraic datatypes with defaults. In Proc. 6th ACM SIGPLAN International Conference on Functional Programming (ICFP), Firenze, Italy, September 2001. 39. Matthias Zenger and Martin Odersky. Implementing extensible compilers. In ECOOP Work- shop on Multiparadigm Programming with Object-Oriented Languages, Budapest, Hungary, June 2001.