Polyglot An Extensible Compiler Framework For Java

Polyglot is an extensible compiler framework that supports the easy creation of compilers for languages similar to Java, while avoiding code duplication. The Polyglot framework is useful for domain-specific languages, exploration of language design, and for simplified versions of Java for pedagogical use. We have used Polyglot to implement several major and minor modifications to Java; the cost of implementing language extensions scales well with the degree to which the language differs from Java. This paper focuses on the design choices in Polyglot that are important for making the framework usable and highly extensible. Polyglot source code is available.

Uploaded by

walllzzz

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Polyglot An Extensible Compiler Framework For Java

Uploaded by

walllzzz

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Polyglot: An Extensible Compiler Framework for Java

Nathaniel Nystrom, Michael R. Clarkson, and Andrew C. Myers

Cornell University
{nystrom,clarkson,andru}@cs.cornell.edu
Abstract. Polyglot is an extensible compiler framework that supports the easy
creation of compilers for languages similar to Java, while avoiding code dupli-
cation. The Polyglot framework is useful for domain-specic languages, explo-
ration of language design, and for simplied versions of Java for pedagogical
use. We have used Polyglot to implement several major and minor modications
to Java; the cost of implementing language extensions scales well with the degree
to which the language differs from Java. This paper focuses on the design choices
in Polyglot that are important for making the framework usable and highly exten-
sible. Polyglot source code is available.
1 Introduction
Domain-specic extension or modication of an existing programming language en-
ables more concise, maintainable programs. However, programmers construct domain-
specic language extensions infrequently because building and maintaining a compiler
is onerous. Better technology is needed. This paper presents a methodology for the
construction of extensible compilers and also an application of this methodology in our
implementation of Polyglot, a compiler framework for creating extensions to Java [14].
Language extension or modication is useful for many reasons:
Security. Systems that enforce security at the language level may nd it useful to
add security annotations or rule out unsafe language constructs.
Static checking. A language might be extended to support annotations necessary
for static verication of program correctness [23], more powerful static checking
of program invariants [10], or heuristic methods [8].
Language design. Implementation helps validate programming language designs.
Optimization. New passes may be added to implement optimizations not per-
formed by the base compiler or not permitted by the base language specication.
Style. Some language features or idioms may be deemed to violate good style but
may not be easy to detect with simple syntactic analysis.
Teaching. Students may learn better using a language that does not expose them to
difcult features (e.g., inner classes [14]) or confusing error messages [9].
This research was supported in part by DARPA Contract F30602-99-1-0533, monitored by USAF
Rome Laboratory, in part by ONR Grant N00014-01-1-0968, in part by NSF awards 0133302 and
0208642, and in part by a Sloan Research Fellowship. The views herein should not be interpreted
as representing the policies or endorsement of NSF, DARPA or AFRL.
Proceedings of the 12th International Conference on Compiler Construction, Warsaw, Poland, April 2003. LNCS 2622, pages 138152.
We refer to the original unmodied language as the base language; we call the modied
language a language extension even if it is not backwards compatible.
When developing a compiler for a language extension, it is clearly desirable to build
upon an existing compiler for the base language. The simplest approach is to copy
the source code of the base compiler and edit it in place. This may be fairly effective
if the base compiler is carefully written, but it duplicates code. Changes to the base
compilerperhaps to x bugsmay then be difcult to apply to the extended compiler.
Without considerable discipline, the code of the two compilers diverges, leading to
duplication of effort.
Our approach is different: the Polyglot framework implements an extensible com-
piler for the base language Java 1.4. This framework, also written in Java, is by default
simply a semantic checker for Java. However, a programmer implementing a language
extension may extend the framework to dene any necessary changes to the compilation
process, including the abstract syntax tree (AST) and semantic analysis.
An important goal for Polyglot is scalable extensibility: an extension should require
programming effort proportional only to the magnitude of the difference between the
extended and base languages. Adding new AST node types or new compiler passes
should require writing code whose size is proportional to the change. Language ex-
tensions often require uniformly adding new elds and methods to an AST node and
its subclasses; we require that this uniform mixin extension be implementable without
subclassing all the extended node classes. Scalable extensibility is a challenge because
it is difcult to simultaneously extend both types and the procedures that manipulate
them [30, 38]. Existing programming methodologies such as visitors [13] improve ex-
tensibility but are not a complete solution. In this paper we present a methodology that
supports extension of both compiler passes and AST nodes, including mixin extension.
The methodology uses abstract factories, delegation, and proxies [13] to permit greater
extensibility and code reuse than in previous extensible compiler designs.
Polyglot has been used to implement more than a dozen Java language extensions
of varying complexity. Our experience using Polyglot suggests that it is a useful frame-
work for developing compilers for new Java-like languages. Some of the complex ex-
tensions implemented are Jif [26], which extends Java with security types that regulate
information ow; PolyJ [27], which adds bounded parametric polymorphism to Java;
and JMatch [24], which extends Java with pattern matching and iteration features. Com-
pilers built using Polyglot are themselves extensible; complex extensions such as Jif
and PolyJ have themselves been extended. The framework is not difcult to learn: users
have been able to build interesting extensions to Java within a day of starting to use
Polyglot. The Polyglot source code is available.
1
The rest of the paper is structured as follows. Section 2 gives an overview of the
Polyglot compiler. Section 3 describes in detail our methodology for providing scalable
extensibility. Other Polyglot features that make writing an extensible compiler conve-
nient are described in Section 4. Our experience using the Polyglot system to build
various languages is reported in Section 5. Related work on extensible compilers and
macro systems is discussed in Section 6, and we conclude in Section 7.
1
At http://www.cs.cornell.edu/Projects/polyglot
extended
parser
scheduled
compiler
passes
code
generation
serialized
type info
Java
AST
Ext
source code
Ext
AST
Bytecode
serialized
type info
Fig. 1. Polyglot Architecture
2 Polyglot Overview
This section presents an overview of the various components of Polyglot and describes
how they can be extended to implement a language extension. An example of a small
extension is given to illustrate this process.
2.1 Architecture
A Polyglot extension is a source-to-source compiler that accepts a program written in
a language extension and translates it to Java source code. It also may invoke a Java
compiler such as javac to convert its output to bytecode.
The compilation process offers several opportunities for the language extension im-
plementer to customize the behavior of the framework. This process, including the even-
tual compilation to Java bytecode, is shown in Fig. 1. In the gure, the name Ext stands
for the particular extended language.
The rst step in compilation is parsing input source code to produce an AST. Poly-
glot includes an extensible parser generator, PPG, that allows the implementer to dene
the syntax of the language extension as a set of changes to the base grammar for Java.
PPG provides grammar inheritance [29], which can be used to add, modify, or remove
productions and symbols of the base grammar. PPG is implemented as a preprocessor
for the CUP LALR parser generator [17].
The extended AST may contain new kinds of nodes either to represent syntax added
to the base language or to record new information in the AST. These new node types are
added by implementing the Node interface and optionally subclassing from an existing
node implementation.
The core of the compilation process is a series of compilation passes applied to
the abstract syntax tree. Both semantic analysis and translation to Java may comprise
several such passes. The pass scheduler selects passes to run over the AST of a single
source le, in an order dened by the extension, ensuring that dependencies between
source les are not violated. Each compilation pass, if successful, rewrites the AST,
producing a new AST that is the input to the next pass. Some analysis passes (e.g.,
type checking) may halt compilation and report errors instead of rewriting the AST. A
language extension may modify the base language pass schedule by adding, replacing,
reordering, or removing compiler passes. The rewriting process is entirely functional;
compilation passes do not destructively modify the AST. More details on our method-
ology are described in Section 3.
Compilation passes do their work using objects that dene important characteristics
of the source and target languages. A type system object acts as a factory for objects
1 tracked(F) class FileReader {
2 FileReader(File f) [] -> [F] throws IOException[] { ... }
3 int read() [F] -> [F] throws IOException[F] { ... }
4 void close() [F] -> [] { ... ; free this; }
5 }
Fig. 2. Example Coffer FileReader
representing types and related constructs such as method signatures. The type system
object also provides some type checking functionality. A node factory constructs AST
nodes for its extension. In extensions that rely on an intermediate language, multiple
type systems and node factories may be used during compilation.
After all compilation passes complete, the usual result is a Java AST. A Java com-
piler such as javac is invoked to compile the Java code to bytecode. The bytecode may
contain serialized extension-specic type information used to enable separate compila-
tion; we discuss separate compilation in more detail in Section 4.
2.2 An Example: Coffer
To motivate our design, we describe a simple extension of Java that supports some of the
resource management facilities of the Vault language [7]. This language, called Coffer,
is a challenge for extensible compilers because it makes substantial changes to both the
syntax and semantics of Java and requires identical modications to many AST node
types. Coffer allows a linear capability, or key, to be associated with an object. Methods
of the object may be invoked only when the key is held. A key is allocated when its
object is created and deallocated by a free statement in a method of the object. The
Coffer type system regulates allocation and freeing of keys to guarantee statically that
keys are always deallocated.
Fig. 2 shows a small Coffer program declaring a FileReader class that guarantees
the program cannot read from a closed reader. The annotation tracked(F) on line 1
associates a key named F with instances of FileReader. Pre- and post-conditions on
method and constructor signatures, written in brackets, specify how the set of held keys
changes through an invocation. For example on line 2, the precondition [] indicates
that no key need be held to invoke the constructor, and the postcondition [F] species
that F is held when the constructor returns normally. The close method (line 4) frees
the key; no subsequent method that requires F can be invoked.
The Coffer extension is used as an example throughout the next section. It is im-
plemented by adding new compiler passes for computing and checking held key sets
at each program point. Coffers free statements and additional type annotations are
implemented by adding new AST nodes and extending existing nodes and passes.
3 A Methodology for Scalable Extensibility
Our goal is a mechanism that supports scalable extension of both the syntax and se-
mantics of the base language. The programmer effort required to add or extend a pass
should be proportional to the number of AST nodes non-trivially affected by that pass;
the effort required to add or extend a node should be proportional to the number of
passes the node must implement in an interesting way.
When extending or overriding the behavior of existing AST nodes, it is often nec-
essary to extend a node class that has more than one subclass. For instance, the Coffer
extension adds identical pre- and post-condition syntax to both methods and construc-
tors; to avoid code duplication, these annotations should be added to the common base
class of method and constructor nodes. The programmer effort to make such changes
should be constant, irrespective of the number of subclasses of this base class. Inheri-
tance is the appropriate mechanism for adding a new eld or method to a single class.
However, adding the same member to many different classes can quickly become te-
dious. This is true even in languages with multiple inheritance: a new subclass must
be created for every class affected by the change. Modifying these subclasses later re-
quires making identical changes to each subclass. Mixin extensibility is a key goal of our
methodology: a change that affects multiple classes should require no code duplication.
Compilers written in object-oriented languages often implement compiler passes
using the Visitor design pattern [13]. However, visitors present several problems for
scalable extensibility. In a non-extensible compiler, the set of AST nodes is usually
xed. The Visitor pattern permits scalable addition of new passes, but sacrices scalable
addition of AST node types. To allow specialization of visitor behavior for both the
AST node type and the visitor itself, each visitor class implements a separate callback
method for every node type. Thus, adding a new kind of AST node requires modifying
all existing visitors to insert a callback method for the node. Visitors written without
knowledge of the new node cannot be used with the new node because they do not
implement the callback. The Visitor pattern also does not provide mixin extensibility.
A separate mechanism is needed to address this problem.
An alternative to the Visitor pattern is for each AST node class to implement a
method for each compiler pass. However, this technique suffers from the dual problem:
adding a new pass requires adding a method to all existing node types.
The remainder of this section presents a mechanism that achieves the goal of scal-
able extensibility. We rst describe our approach to providing mixin extensibility. We
then show how our solution also addresses the other aspects of scalable extensibility.
3.1 Node Extension Objects and Delegates
We implement passes as methods associated with AST node objects; however, to pro-
vide scalable extensibility, we introduce a delegation mechanism, illustrated in Fig. 3,
that enables orthogonal extension and method override of nodes.
Since subclassing of node classes does not adequately address orthogonal exten-
sion of methods in classes with multiple subclasses, we add to each node object a eld,
labeled ext in Fig. 3, that points to a (possibly null) node extension object. The ex-
tension object (CofferExt in the gure) provides implementations of new methods
and elds, thus extending the node interface without subclassing. These members are
accessed by following the ext pointer and casting to the extension object type. In the
example, CofferExt extends Node with keyFlow() and checkKeys() methods. Each
AST node class to be extended with a given implementation of these members uses the
keyFlow() {...}
checkKeys() {...}
CofferExt
typeCheck() {...}
print() {...}
Node
del
ext
del
of Coffer node
possible extension
node
ext
node
typeCheck() {...}
print() {node.print();}
NodeDelegate
Fig. 3. Delegates and extensions
same extension object class. Thus, several node classes can be orthogonally extended
with a single implementation, avoiding code duplication. Since language extensions
can themselves be extended, each extension object has an ext eld similar to the one
located in the node object. In effect, a node and its extension object together can be
considered a single node.
Extension objects alone, however, do not adequately handle method override when
the base language is extended multiple times. The problem is that any one of a nodes
extension objects can implement the overridden method; a mechanism is needed to
invoke the correct implementation. A possible solution to this problem is to introduce
a delegate object for each method in the node interface. For each method, a eld in
the node points to an object implementing that method. Calls to the method are made
through its delegate object; language extensions can override the method simply by
replacing the delegate. The delegate may implement the method itself or may invoke
methods in the node or in the nodes extension objects.
Because maintaining one object per method is cumbersome, the solution used in
Polyglot is to combine delegate objects and to introduce a single delegate eld for each
node objectillustrated by the del eld in Fig. 3. This eld points to an object imple-
menting the entire Node interface, by default the node itself. To override a method, a
language extension writer creates a new delegate object containing the new implemen-
tation or code to dispatch to the new implementation. The delegate implements Nodes
other methods by dispatching back to the node. Extension objects also contain a del
eld used to override methods declared in the extension object interface.
Calls to all node methods are made through the del pointer, thus ensuring that
the correct implementation of the method is invoked if the delegate object is replaced
by a language extension. Thus, in our example, the nodes typeCheck method is in-
voked via n.del.typeCheck(); the Coffer checkKeys method is invoked by fol-
lowing the nodes ext pointer and invoking through the extension objects delegate:
((CofferExt) n.ext).del.checkKeys(). An extension of Coffer could replace
the extension objects delegate to override methods declared in the extension, or it
could replace the nodes delegate to override methods of the node. To access Coffers
type-checking functionality, this new node delegate may be a subclass of Coffers node
delegate class or may contain a pointer to the old delegate object. The overhead of in-
directing through the del pointer accounts for less than 2% of the total compilation
time.
3.2 AST Rewriters
Most passes in Polyglot are structured as functional AST rewriting passes. Factoring out
AST traversal code eliminates the need to duplicate this code when implementing new
passes. Each pass implements an AST rewriter object to traverse the AST and invoke
the passs method at each node. At each node, the rewriter invokes a visitChildren
method to recursively rewrite the nodes children using the rewriter and to reconstruct
the node if any of the children are modied. A key implementation detail is that when a
node is reconstructed, the node is cloned and the clone is returned. Cloning ensures that
class members added by language extensions are correctly copied into the new node.
The nodes delegates and extensions are cloned with the node.
Each rewriter implements enter and leave methods, both of which take a node as
argument. The enter method is invoked before the rewriter recurses on the nodes chil-
dren using visitChildren and may return a new rewriter to be used for rewriting the
children. This provides a convenient means for maintaining symbol table information
as the rewriter crosses lexical scopes; the programmer need not write code to explicitly
manage the stack of scopes, eliminating a potential source of errors. The leave method
is called after visiting the children and returns the rewritten AST rooted at the node.
3.3 Scalable Extensibility
A language extension may extend the interface of an AST node class through an ex-
tension object interface. For each new pass, a method is added to the extension object
interface and a rewriter class is created to invoke the method at each node. For most
nodes, a single extension object class is implemented to dene the default behavior of
the pass, typically just an identity transformation on the AST node. This class is over-
ridden for individual nodes where non-trivial work is performed for the pass.
To change the behavior of an existing pass at a given node, the programmer creates
a new delegate class implementing the new behavior and associates the delegate with
the node at construction time. Like extension classes, the same delegate class may be
used for several different AST node classes, allowing functionality to be added to node
classes at arbitrary points in the class hierarchy without code duplication.
New kinds of nodes are dened by new node classes; existing node types are ex-
tended by adding an extension object to instances of the class. A factory method for
the new node type is added to the node factory to construct the node and, if necessary,
its delegate and extension objects. The new node inherits default implementations of
all compiler passes from its base class and from the extensions base class. The new
node may provide new implementations using method override, possibly via delega-
tion. Methods need be overridden only for those passes that need to perform non-trivial
work for that node type.
Fig. 4 shows a portion of the code implementing the Coffer key-checking pass,
which checks the set of keys held when control enters a node. The code has been
simplied in the interests of space and clarity. At each node in the AST, the pass in-
vokes through the del pointer the checkKeys method in the Coffer extension, passing
in the set of held keys (computed by a previous data-ow analysis pass). Since most
AST nodes are not affected by the key-checking pass, a default checkKeys method
class KeyChecker extends Rewriter {
Node leave(Node n) {
((CofferExt) n.ext).del.checkKeys(held keys(n));
return n;
}
}
class CofferExt {
Node node; CofferExt del;
void checkKeys(Set held keys) { /* empty */ }
}
class ProcedureCallExt extends CofferExt {
void checkKeys(Set held keys) {
ProcedureCall c = (ProcedureCall) node;
CofferProcedureType p = (CofferProcedureType) c.callee();
if (! held keys.containsAll(p.entryKeys()))
error(p.entryKeys() + " not held at " + c);
}
}
Fig. 4. Coffer key checking
implemented in the base CofferExt class is used for these nodes. For other nodes, a
non-trivial implementation of key checking is required.
Fig. 4 also contains an extension class used to compute the held keys for method
and constructor calls. ProcedureCall is an interface implemented by the classes for
three AST nodes that invoke either methods or constructors: method calls, new expres-
sions, and explicit constructor calls (e.g., super()). All three nodes implement the
checkKeys method identically. By using an extension object, we need only to write
this code once.
4 Other Implementation Details
In this section we consider some aspects of the Polyglot implementation that are not
directly related to scalable extensibility.
Data-Flow Analysis. Polyglot provides an extensible data-ow analysis frame-
work. In Java implementation, this framework is used to check the that variables are
initialized before use and that all statements are reachable; extensions may perform ad-
ditional data-ow analyses to enable optimizations or to perform other transformations.
Polyglot provides a rewriter in the base compiler framework that constructs the control-
ow graph of the program. Intraprocedural data-ow analyses can then be performed
on this graph by implementing the meet and transfer functions for the analysis.
Separate Compilation. Java compilers use type information stored in Java class
les to support separate compilation. For many extensions, the standard Java type in-
formation in the class le is insufcient. Polyglot injects type information into class
les that can be read by later invocations of the compiler to provide separate compila-
tion. No code need be written for a language extension to use this functionality for its
extended types. Before performing Java code generation, Polyglot uses the Java seri-
alization facility to encode the type information for a given class into a string, which
is then compressed and inserted as a nal static eld into the AST for the class being
serialized. When compiling a class, the rst time a reference to another class is encoun-
tered, Polyglot loads the class le for the referenced class and extracts the serialized
type information. The type information is decoded and may be immediately used by
the extension.
Quasiquoting. To generate Java output, language extensions translate their ASTs
to Java ASTs and rely on the code generator of the base compiler to output Java code.
To enable AST rewriting, we have used PPG to extend Polyglots Java parser with the
ability to generate an AST from a string of Java code and a collection of AST nodes
to substitute into the generated AST. This feature provides many of the benets of
quasiquoting in Scheme [19].
5 Experience
More than a dozen extensions of varying sizes have been implemented using Polyglot,
for example:
Jif is a Java extension that provides information ow control and features to ensure
the condentiality and integrity of data [26].
Jif/split is an extension to Jif that partitions programs across multiple hosts based
on their security requirements [37].
PolyJ is a Java extension that supports bounded parametric polymorphism [27].
Param is an abstract extension that provides support for parameterized classes. This
extension is not a complete language, but instead includes code implementing lazy
substitution of type parameters. Jif, PolyJ, and Coffer extend Param.
JMatch is a Java extension that supports pattern matching and logic programming
features [24].
Coffer, as previously described, adds resource management facilities to Java.
PAO (primitives as objects) allows primitive values to be used transparently as
objects via automatic boxing and unboxing,
A covariant return extension restores the subtyping rules of Java 1.0 Beta [33] in
which the return type of a method could be covariant in subclasses. The language
was changed in the nal version of Java 1.0 [14] to require the invariance of return
types.
The major extensions add new syntax and make substantial changes to the language
semantics. We describe the changes for Jif and PolyJ in more detail below. The simpler
extensions, such as support for covariant return types, require more localized changes.
5.1 Jif
Jif is an extension to Java that permits static checking of information ow policies. In
Jif, the type of a variable may be annotated with a label specifying a set of principals
who own the data and a set of principals that are permitted to read the data. Labels are
checked by the compiler to ensure that the information ow policies are not violated.
The base Polyglot parser is extended using PPG to recognize security annotations
and new statement forms. New AST node classes are added for labels and for new state-
ment and expression forms concerning security checks. The new AST nodes and nearly
all existing AST nodes are also extended with security context annotations. These new
elds are added to a Jif extension class. To implement information ow checking, a
labelCheck method is declared in the Jif extension object. Many nodes do no work
for this pass and therefore can inherit a default implementation declared in the base Jif
extension class. Extension objects installed for expression and statement nodes override
the labelCheck method to implement the security typing judgment for the node. Del-
egates were used to override type checking of some AST nodes to disallow static elds
and inner classes since they may provide an avenue for information leaks.
Following label checking, the Jif AST is translated to a Java AST, largely by erasing
security annotations. The new statement and expression forms are rewritten to Java
syntax using the quasiquoting facility discussed in Section 4.
Jif/split further extends Jif to partition programs across multiple hosts based on their
security requirements. The syntax of Jif is modied slightly to also support integrity an-
notations. New passes, implemented in extension objects, partition the Jif/split program
into several Jif programs, each of which will run on a separate host.
5.2 PolyJ
PolyJ is an extension to Java that supports parametric polymorphism. Classes and inter-
faces may be declared with zero or more type parameters constrained by where clauses.
The base Java parser is extended using PPG, and AST node classes are added for where
clauses and for new type syntax. Further, the AST node for class declarations is ex-
tended via inheritance to allow for type parameters and where clauses.
The PolyJ type system customizes the behavior of the base Java type system and
introduces judgments for parameterized and instantiated types. A new pass is intro-
duced to check that the types on which a parameterized class is instantiated satisfy the
constraints for that parameter, as described in [27].
The base compiler code generator is extended to generate code not only for each
PolyJ source class, but also an adapter class for each instantiation of a parameterized
class.
5.3 Results
As a measure of the programmer effort required to implement the extensions discussed
in this paper, the sizes of the code for these extensions are shown in Table 1. To eliminate
bias due to the length of identiers in the source, sizes are given in number of tokens
for source les, including Java, CUP, and PPG les.
These results demonstrate that the cost of implementing language extensions scales
well with the degree to which the extension differs from its base language. Simple ex-
tensions such as the covariant return extension that differ from Java in small, localized
Table 1. Extension size
Extension Token count Percent of Base Polyglot
base Polyglot 164136 100%
Jif 126188 77%
JMatch 105269 64%
PolyJ 78159 48%
Coffer 21251 13%
PAO 3422 2%
Param 3233 2%
covariant return 1562 1%
empty 691 < 1%
ways can be implemented by writing only small amounts of code. To measure the over-
head of simply creating a language extension, we implemented an empty extension that
makes no changes to the Java language; the overhead includes empty subclasses of the
base compiler node factory and type system classes, an empty PPG parser specication,
and code for allocating these subclasses.
PolyJ, which has large changes to the type system and to code generation, requires
only about half as much code as the base Java compiler. For historical reasons, PolyJ
generates code by overriding the Polyglot code generator to directly output Java. The
size of this code could be reduced by using quasiquoting. Jif requires a large amount
of extension code because label checking in Jif is more complex than the Java type
checking that it extends. Much of the JMatch overhead is accounted for by extensive
changes to add complex statement and expression translations.
As a point of comparison, the base Polyglot compiler (which implements Java 1.4)
and the Java 1.1 compiler, javac, are nearly the same size when measured in tokens.
Thus, the base Polyglot compiler implementation is reasonably efcient. To be fair
to javac, we did not count its code for bytecode generation. About 10% of the base
Polyglot compiler consists of interfaces used to separate the interface hierarchy from
the class hierarchy. The javac compiler is not implemented this way.
Implementing small extensions has proved to be fairly easy. We asked a program-
mer previously unfamiliar with the framework to implement the covariant return type
extension; this took one day. The same programmer implemented several other small
extensions within a few days.
5.4 Discussion
In implementing Polyglot we found, not surprisingly, that application of good object-
oriented design principles greatly enhances Polyglots extensibility. Rigorous separa-
tion of interfaces and classes permit implementations to be more easily extended and
replaced; calls through interfaces ensure the framework is not bound to any particular
implementation of an interface. The Polyglot framework almost exclusively uses fac-
tory methods to create objects [13], giving language extensions more freedom to change
the implementation provided by the base compiler by avoiding explicitly tying code to
a particular class.
We chose to implement Polyglot using only standard Java features, but it is clear that
several language extensionssome of which we have implemented using Polyglot
would have made it easier to implement Polyglot. Multimethods (e.g., [5]) would have
simplied the dispatching mechanism needed for our methodology. Open classes [6]
might provide a cleaner solution to the extensibility problem, particularly in conjunc-
tion with multimethods. Aspect-oriented programming [20] is another technique for
adding and overriding methods in an existing class hierarchy. Hierarchically extensible
datatypes and functions [25] offer another solution to the extensibility problem. Mul-
tiple inheritance and, in particular, mixins (e.g., [4, 11]) would facilitate application of
an extension to many AST nodes at once. Built-in quasiquoting support would make
translation more efcient, though the need to support several target languages would
introduce some difculties. Covariant modication of method return types would elim-
inate many unnecessary type casts, as would parametric polymorphism [27, 28].
6 Related Work
There is much work that is related to Polyglot, including other extensible compilers,
macro systems, and visitor patterns.
JaCo is an extensible compiler for Java written in an extended version of Java [39]
that supports ML-style pattern matching. JaCo does not provide mixin extensibility. It
relies on a new language featureextensible algebraic datatypes [38]to address the
difculty of handling new data types without changing existing code. Polyglot achieves
scalable extensibility while relying only on features available in Java.
CoSy [1] is a framework for combining compiler phases to create an optimiz-
ing compiler. Compiler phases can be added and reused in multiple contexts without
changing existing code. The framework was not designed for syntax extension. In the
SUIF compiler [36], data structures can be extended with annotations, similar to Poly-
glots extension objects; new annotations are ignored by existing compiler passes. Scor-
pion [31, 32] is a meta-programming environment that has a similar extension mecha-
nism. Neither SUIF nor Scorpion have a mechanism like Polyglots delegate objects to
mix in method overrides.
JastAdd [16] is a compiler framework that uses aspect-oriented programming to add
methods and elds into the AST node class hierarchy to implement new passes or to
override existing passes. The AST node hierarchy may be extended via inheritance, but
duplicate code may need to be written for each pass to support new nodes.
Macro systems and preprocessors are generally concerned only with syntactic ex-
tensions to a language. Recent systems for use in Java include EPP [18], JSE [12], and
JPP [21]. Maya [2] is a generalization of macro systems that uses generic functions
and multimethods to allow extension of Java syntax. Semantic actions can be dened
as multimethods on those generic functions. It is not clear how these systems scale to
support semantic checking for large extensions to the base language.
The Jakarta Tools Suite (JTS) [3] is a toolkit for implementing Java preprocessors
to create domain-specic languages. Extensions of a base language are encapsulated
as components that dene the syntax and semantics of the extension. A fundamental
difference between JTS and Polyglot is that JTS is concerned primarily only the syn-
tactic analysis of the extension language, not with semantic analysis [3, section 4]. This
makes JTS more like a macro system in which the macros are dened by extending the
compiler rather than declaring them in the source code.
OpenJava [34] uses a meta-object protocol (MOP) similar to Javas reection API
to allow manipulation of a programs structure. OpenJava allows very limited extension
of syntax, but through its MOP exposes much of the semantic structure of the program.
The original Visitor design pattern [13] has led to many renements. Extensible
Visitors [22] and Staggered Visitors [35] both enhance the extensibility of the visitor
pattern to facilitate adding new node types, but neither these nor the other renements
mentioned above support mixin extensibility. Staggered Visitors rely on multiple inher-
itance to extend visitors with support for new nodes.
7 Conclusions
Our original motivation for developing the Polyglot compiler framework was simply to
provide a publicly available Java front end that could be easily extended to support new
languages. We discovered that the existing approaches to extensible compiler construc-
tion within Java did not solve to our satisfaction the problem of scalable extensibility
including mixins. Our extended visitor methodology is simple, yet improves on the pre-
vious solutions to the extensibility problem. Other Polyglot features such as extensible
parsing, pass scheduling, quasiquoting, and type signature insertion are also useful.
Our experience using Polyglot has shown that it is an effective way to produce
compilers for Java-like languages. We have used the framework for several signicant
language extensions that modify Java syntax and semantics in complex ways. We hope
that the public release of this software in source code form will facilitate experimenta-
tion with new features for object-oriented languages.
References
1. Martin Alt, Uwe Amann, and Hans van Someren. Cosy compiler phase embedding with
the CoSy compiler model. In Peter A. Fritzson, editor, Proceedings of the 5th International
Compiler Construction Conference (CC94), volume 786 of Lecture Notes in Computer Sci-
ence, pages 278293, Edinburgh, UK, April 1994.
2. Jason Baker and Wilson C. Hsieh. Maya: Multiple-dispatch syntax extension in Java. In
Proc. of the ACM SIGPLAN 02 Conference on Programming Language Design and Imple-
mentation (PLDI), pages 270281, Berlin, Germany, June 2002.
3. Don Batory, Bernie Lofaso, and Yannis Smaragdakis. JTS: tools for implementing domain-
specic languages. In Proceedings Fifth International Conference on Software Reuse, pages
14353, Victoria, BC, Canada, 1998. IEEE.
4. Gilad Bracha. The Programming Language Jigsaw: Mixins, Modularity and Multiple Inher-
itance. PhD thesis, University of Utah, 1992.
5. Craig Chambers. Object-oriented multi-methods in Cecil. In Ole Lehrmann Madsen, editor,
Proceedings of the 6th European Conference on Object-Oriented Programming (ECOOP),
volume 615, pages 3356, Berlin, Heidelberg, New York, Tokyo, 1992. Springer-Verlag.
6. Curtis Clifton, Gary T. Leavens, Craig Chambers, and Todd Millstein. MultiJava: Modular
open classes and symmetric multiple dispatch for Java. In OOPSLA 2000 Conference on
Object-Oriented Programming, Systems, Languages, and Applications, Minneapolis, Min-
nesota, volume 35(10), pages 130145, 2000.
7. Robert DeLine and Manuel F ahndrich. Enforcing high-level protocols in low-level software.
In Proceedings of the ACM Conference on Programming Language Design and Implemen-
tation, pages 5969, June 2001.
8. Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem. Checking system rules using
system-specic, programmer-written compiler extensions. In Proceedings of Fourth Usenix
Symposium on Operating Systems Design and Implementation, San Diego, California, Octo-
ber 2000.
9. Robert Bruce Findler, Cormac Flanagan, Matthew Flatt, Shriram Krishnamurthi, and
Matthias Felleisen. DrScheme: A pedagogic programming environment for Scheme. In
Proc. International Symposium on Programming Languages: Implementations, Logics, and
Programs, pages 369388, 1997.
10. Cormac Flanagan, K. Rustan M. Leino, Mark Lillibridge, Greg Nelson, James B. Saxe, and
Raymie Stata. Extended static checking for Java. In Proc. of the ACM SIGPLAN 02 Con-
ference on Programming Language Design and Implementation (PLDI), pages 234245,
Berlin, Germany, June 2002.
11. Matthew Flatt, Shriram Krishnamurthi, and Matthias Felleisen. Classes and mixins. In Con-
ference Record of POPL 98: The 25TH ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages, San Diego, California, pages 171183, New York, NY, 1998.
12. D. K. Frayne and Keith Playford. The Java syntactic extender (JSE). In Proceedings of the
2001 Conference on Object Oriented Programming Systems Languages and Applications
(OOPSLA 01), pages 3142, Tampa, FL, USA, 2001.
13. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements
of Reusable Object-Oriented Software. Addison Wesley, Reading, MA, 1994.
14. James Gosling, Bill Joy, and Guy Steele. The Java Language Specication. Addison-Wesley,
August 1996. ISBN 0-201-63451-1.
15. Carl Gunter and John C. Mitchell, editors. Theoretical aspects of object-oriented program-
ming. MIT Press, 1994.
16. G orel Hedin and Eva Magnusson. JastAddan aspect-oriented compiler construction sys-
tem. Science of Computer Programming, 47(1):3758, November 2002.
17. Scott E. Hudson, Frank Flannery, C. Scott Ananian, Dan Wang, and Andrew Appel. CUP
LALR parser generator for Java , 1996. Software release. Located at
http://www.cs.princeton.edu/appel/modern/java/CUP/.
18. Yuuji Ichisugi and Yves Roudier. The extensible Java preprocessor kit and a tiny data-parallel
Java. In Proc. ISCOPE 97, LNCS 1343, pages 153160. Springer, 1997.
19. Richard Kelsey, William Clinger, and Jonathan Rees (editors). Revised
5
report on the algo-
rithmic language Scheme. ACM SIGPLAN Notices, 33(9):2676, October 1998. Available
at http://www.schemers.org/Documents/Standards/R5RS.
20. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Videira Lopes,
Jean-Marc Loingtier, and John Irwin. Aspect-oriented programming. In Proceedings of 11th
European Conference on Object-Oriented Programming (ECOOP97), number 1241 in Lec-
ture Notes in Computer Science, pages 220242, Jyv askyl a, Finland, June 1997. Springer-
Verlag.
21. Joseph R. Kiniry and Elaine Cheong. JPP: A Java pre-processor. Technical Report CS-TR-
98-15, California Institute of Technology, Pasadena, CA, September 1998.
22. Shriram Krishnamurthi, Matthias Felleisen, and Daniel P. Friedman. Synthesizing object-
oriented and functional design to promote re-use. In Proc. ECOOP 98, pages 91113,
1998.
23. Gary T. Leavens, K. Rustan M. Leino, Erik Poll, Clyde Ruby, and Bart Jacobs. JML: no-
tations and tools supporting detailed design in Java. In OOPSLA 2000 Companion, pages
105106, Minneapolis, Minnesota, 2000.
24. Jed Liu and Andrew C. Myers. JMatch: Abstract iterable pattern matching for Java. In Proc.
5th Intl Symp. on Practical Aspects of Declarative Languages, New Orleans, LA, January
2003.
25. Todd Millstein, Colin Bleckner, and Craig Chambers. Modular typechecking for hierarchi-
cally extensible datatypes and functions. In Proc. 7th ACM SIGPLAN International Confer-
ence on Functional Programming (ICFP), pages 110122, Philadelphia, PA, USA, October
2002.
26. Andrew C. Myers. JFlow: Practical mostly-static information ow control. In Proc. 26th
ACM Symp. on Principles of Programming Languages (POPL), pages 228241, San Anto-
nio, TX, January 1999.
27. Andrew C. Myers, Joseph A. Bank, and Barbara Liskov. Parameterized types for Java. In
Proc. 24th ACM Symp. on Principles of Programming Languages (POPL), pages 132145,
Paris, France, January 1997.
28. Martin Odersky and Philip Wadler. Pizza into Java: Translating theory into practice. In Proc.
24th ACM Symp. on Principles of Programming Languages (POPL), pages 146159, Paris,
France, January 1997.
29. Terence Parr and Russell Quong. ANTLR: A predicated-LL(k) parser generator. Journal of
Software Practice and Experience, 25(7), 1995.
30. John C. Reynolds. User-dened types and procedural data structures as complementary
approaches to data abstraction. In Stephen A. Schuman, editor, New Directions in Algorith-
mic Languages, pages 157168. Institut de Recherche dInformatique et dAutomatique, Le
Chesnay, France, 1975. Reprinted in [15], pages 1323.
31. Richard Snodgrass. The Scorpion system, August 1995. Software release. Located at
ftp://ftp.cs.arizona.edu/scorpion.
32. Richard Snodgrass and Karen Shannon. Supporting exible and efcient tool integration.
In Proceedings of the International Workshop on Advanced Programming Environments,
number 244 in Lecture Notes in Computer Science, pages 290313, Trondheim, Norway,
June 1986.
33. Sun Microsystems. Java Language Specication, version 1.0 beta edition, October 1995.
Available at ftp://ftp.javasoft.com/docs/javaspec.ps.zip.
34. Michiaki Tatsubori, Shigeru Chiba, Marc-Oliver Killijian, and Kozo Itano. OpenJava: A
class-based macro system for Java. In Walter Cazzola, Robert J. Stroud, and Francesco
Tisato, editors, Reection and Software Engineering, LNCS 1826, pages 119135. Springer-
Verlag, July 2000.
35. J. Vlissides. Visitors in frameworks. C++ Report, 11(10), November 1999.
36. R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K.
Tjiang, S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An
infrastructure for research on parallelizing and optimizing compilers. SIGPLAN Notices,
29(12):3137, 1994.
37. Steve Zdancewic, Lantian Zheng, Nathaniel Nystrom, and Andrew C. Myers. Untrusted
hosts and condentiality: Secure program partitioning. In Proc. 18th ACM Symp. on Oper-
ating System Principles (SOSP), pages 114, Banff, Canada, October 2001.
38. Matthias Zenger and Martin Odersky. Extensible algebraic datatypes with defaults. In Proc.
6th ACM SIGPLAN International Conference on Functional Programming (ICFP), Firenze,
Italy, September 2001.
39. Matthias Zenger and Martin Odersky. Implementing extensible compilers. In ECOOP Work-
shop on Multiparadigm Programming with Object-Oriented Languages, Budapest, Hungary,
June 2001.