Use JavaCC To Build A User Friendly
Use JavaCC To Build A User Friendly
language
Now you're talking my language!
JavaCC is a very powerful "compiler compiler" tool that can be used to formulate context-free
grammars. This article demonstrates how JavaCC can be used to allow end users to formulate
simple boolean queries against a DB2 UDB database.
Introduction to JavaCC
Many Web-based projects contain ad-hoc query systems that allow end users to search for
information. For this, some sort of language is needed for the end users to convey what it is
they wish to search for. Sometimes, a user query language is defined quite simply. If your end
user is content to have a language as simple as your most typical Google search, then Java's
StringTokenizer is more than up to the task for parsing. However, if a customer wishes to have a
more robust language, perhaps adding in parentheses and "AND"/"OR" logic, then we quickly find
ourselves in need of a bigger hammer. We need a means to first define the language that the user
will be using and then use that definition to parse the user's entries and formulate the correct query
to whatever back-end database we have.
That's where a tool like JavaCC comes in. JavaCC stands for "Java® Compiler Compiler", a name
that stands in homage to YACC, "Yet Another Compiler Compiler", the C-based tool developed
by AT&T for the purposes of building parsers for C and other high-level languages. YACC and
its cohort lexical tokenizer, "Lex", accepted as input a definition of a language in a common form
known as Backus-Naur form (BNF, aka Backus Normal Form) and emitted a "C" program that
could parse input in that language and perform functions on it. JavaCC, like YACC, is designed
to speed the process of developing parser logic for languages. Whereas YACC emitted C code,
however, JavaCC, as you might expect, emits Java code.
JavaCC's history is storied. It began its life in Sun, under the name "Jack". Jack later became
JavaCC which passed through several owners, notably Metamata and WebGain, before coming
back to Sun, who then released it as Open Source code under the BSD license.
JavaCC's strength is in its simplicity and extensibility. To compile the Java code generated by
JavaCC, no outside JAR files or directories are required. Simply compile with the base 1.2 Java
compiler. The layout of the language also makes it easy to add production rules and behaviors.
The Web site even describes how you can tailor exceptions to give your user targetted syntax
hints.
The customer would like to be able to issue somewhat sophisticated queries against this database.
He would, for example, like to enter a query of the form
and have it return the Superman movies featuring Christopher Reeve. He would also like to use
parentheses to indicate order of evaluation to distinguish queries like
which might return movies that do not feature Christopher Reeve from
The Solution
For this exercise, you will define the solution in two stages. In the stage 1, you will define the
language in JavaCC, ensuring that end-user queries are parsed correctly. In stage 2, you will add
behavior to JavaCC code in order to produce DB2® SQL code that will ensure that the correct
movies are returned in answer to the end-user queries.
src for steps 1 through 4. Step 5 is covered in JavaCCPaper/stage2/src. DDL for the sample
database can be found in JavaCCPaper/moviedb.sql. The example works best if the database is
created and the parser run using the same userid. Ant files (build.xml) are provided to speed to
compilation process.
JavaCC .jj files are translated by the JavaCC executable into .java files. JavaCC takes the portion
of the .jj file that lives between PARSER_BEGIN and ends with PARSER_END and copies it
directly into the resulting java file. This is the place where you, the parser designer, can place all of
your pre- and post-parsing activity related to your parser. This is also the place where you can link
your java code to the parser actions you will define in steps 4 and 5.
In the example shown below, the parser is relatively simple. The constructor UQLParser takes
as input a String, reading it in through Java's java.io.StringReader class, and then calls another,
unseen constructor with the StringReader cast to Reader. The only other method defined here
is the static main method which calls the constructor and then calls an as-yet undefined method
called parse().
As you might have already guessed, JavaCC already provides a constructor from Java's Reader
class. We have added the String-based constructor for ease of use and testing.
package com.demo.stage1;
import java.io.StringReader;
import java.io.Reader;
/**
A String based constructor for ease of use.
**/
public UQLParser(String s)
{
this((Reader)(new StringReader(s)));
In this language, you wish to treat as delimiters but otherwise ignore spaces, tabs, carriage
returns and newlines. These characters are known as white space. In JavaCC, we define these
characters in the SKIP section, as indicated in Listing 2.
Next, you define what tokens you will recognize in the language. A token is the smallest unit of
parsed string that will have meaning to the parsing program. The process that scans an input
string and determines what the tokens are is called the tokenizer. In the query,
• actor
• =
• "Christopher Reeve"
In your language, you will make actor and the equal sign (=) reserved tokens in the language,
much as the words if and instanceof are reserved tokens with particular meaning within the
Java language. By reserving words and other special tokens, the programmer promises that
the parser will recognize these words literally, assigning a specific meaning to them. So long as
you're reserving words, go ahead and reserve the not equal sign (<>), and, or, the left and right
parentheses. Also reserve title, director, and keyword to represent specific fields for the user to
search.
To define all of these, use the TOKEN directive of JavaCC. Each token definition is enclosed in
angle-brackets, (< and >). The token's name is given on the left hand side of the colon ( : ) and
a regular expression is given on the right hand side. A regular expression is a way of defining a
portion of text to be matched. In its simplest form, regular expressions can match exact sequences
of characters. Use the code below to define six tokens that match exact words and four others that
match symbols. When the parser sees any of the words, it will match them symbolically as AND,
OR, TITLE, ACTOR, DIRECTOR, or KEYWORD. When the symbols are matched, the parser will
return LPAREN, RPAREN, EQUALS, or NOTEQUAL accordingly. Listing 3 shows the JavaCC
reserved tokens definition.
For strings like "Christopher Reeve", you cannot possibly set aside all the names of actors
as reserved words in our language. Instead, you will recognize tokens of type STRING or
QUOTED_STRING using a character pattern defined using regular expressions. Regular
expressions are strings of characters that define a pattern to be matched. Defining the regular
expression that will match all strings or quoted strings is a bit trickier than defining exact word
matches.
You will define a STRING as a series of one or more characters, the valid characters being A
through Z, upper and lower case, and the numerals 0 through 9. For the purposes of simplification,
don't worry about movie stars or movie titles with accented characters or other anomalies. You can
write this pattern as a regular expression in the following way.
The plus sign indicates that the pattern enclosed in the parentheses (any character from A to Z,
a to z, or 0 to 9) should occur one or more times in sequence. In JavaCC, you may also use the
asterisk (*) to indicate zero or more appearances of a pattern or a question mark (?) to indicate 0
or 1 repetition.
You should add the tokenizer rules for STRING and QUOTED_STRING after the rules for the
reserved words. It is important to preserve this ordering because the order that tokenizer rules
appear in the file are the order in which token rules are applied. You want to be very sure that
"title" is taken as the reserved word rather than as a STRING. The complete STRING and
QUOTED_STRING token definition appears in Listing 4.
Now that you have the tokens defined, it is time to define the parsing rules in terms of the tokens.
The user enters the query in the form of an expression. An expression is defined as a series of
one or more query terms joined by the boolean operators and or or.
To express this, we need to write a parsing rule, also known as a production. Write the production
in Listing 5 into the JavaCC UQLParser.JJ file.
When Javacc is run against a .jj file, productions are translated into methods. The first set of
braces will enclose any declarations that will be needed by the production method. For now, this
is left blank. The second set of braces encloses the production rule written in a way that JavaCC
understands. Note the use of the AND and OR tokens that were defined earlier. Also note that
queryTerm() is written as a method call. queryTerm() in fact, is another production method.
Now, let us define the queryTerm() production. queryTerm() is either a single criterion (such as
title="The Matrix") or an expression surrounded by parentheses. queryTerm() is defined in JavaCC
recursively, via expression(), which allows you to sum up the language succinctly using the code
shown in Listing 6.
That's all the rules we need. The entire language parser has been summed up in two productions.
The ZIP file provided with this paper should include the sample stage 1 JavaCC file, UQLParser.jj.
Unzip the entire ZIP file into an empty directory. To compile stage1/UQLParser.jj, first you'll need to
download JavaCC and install according to the directions at the JavaCC Web page. For simplicity's
sake, be sure to put the executable path for Javacc.bat into your PATH environment variable.
Compilation is easy; cd into the directory where you've unloaded UQLParser.jj and enter the
following.
You may also use the enclosed Ant file, build.xml if you prefer. You will have to make some
adjustments to the top of the properties file to point to your JavaCC installation. JavaCC should
produce the messages shown in Listing 7 the first time you run it.
In addition to the four files mentioned, JavaCC will also produce UQLParser.java,
UQLParserConstants.java, and UQLParserTokenManager.java. All of these are written to the
com\demo\stage1 directory. From here, you should be able to compile these files without any
additions to the default runtime classpath. The Ant file default target will perform the Java compile
automatically if the JavaCC step runs successfully. If not, you can compile your files from the top
level directory (JavaCCPaper/stage1) with
Once you have Java class files in place, you may test your new parser by entering the following
sample user query into the "main" java method you have defined. If you are using the same code,
begin from the JavaCCPaper/stage1 directory and enter the following on the command line.
The ?debug_parser option that we used during the JavaCC step ensures that the following useful
trace messages are output, indicating how the user query is parsed. The output should appear as
in Listing 8.
To test the recursive path for parenthesized expressions, try the following test.
This output is useful because it illustrates the recursion through queryTerm and expression. The
first instance of queryTerm is actually an expression made up of two queryTerms. Figure 1 shows a
pictorial view of this parse path.
If you're curious as to what Java code was produced by JavaCC, by all means, have a look (but
don't try to change any of it!). Here is what you'll find.
UQLParser.java ? In this file, you will find the code that you placed between PARSER_BEGIN and
PARSER_END in your UQLParser.jj file. You will also discover that your JJ production methods
have been mutated into Java methods.
For example, expression() rule has been expanded into the code found in Listing 10.
break;
default:
jj_la1[1] = jj_gen;
jj_consume_token(-1);
throw new ParseException();
}
queryTerm();
}
} finally {
trace_return("expression");
}
}
It bears some resemblance to what you originally wrote in that queryTerm(), AND, and OR make
an appearance but the rest is parsing detail that JavaCC has added around it.
UQLParserConstants.java ? This file is easy to figure out. All of the tokens that you have defined
are here. JavaCC has merely listed them in an array and provided integer constants to reference
into that array. Listing 11 shows the contents of UQLParserConstants.java
package com.demo.stage1;
int EOF = 0;
int AND = 5;
int OR = 6;
int TITLE = 7;
int ACTOR = 8;
int DIRECTOR = 9;
int KEYWORD = 10;
int LPAREN = 11;
int RPAREN = 12;
int EQUALS = 13;
int NOTEQUAL = 14;
int STRING = 15;
int QUOTED_STRING = 16;
int DEFAULT = 0;
String[] tokenImage = {
"<EOF>",
"\" \"",
"\"\\t\"",
"\"\\r\"",
"\"\\n\"",
"\"and\"",
"\"or\"",
"\"title\"",
"\"actor\"",
"\"director\"",
"\"keyword\"",
"\"(\"",
"\")\"",
"\"=\"",
"\"<>\"",
"<STRING>",
"<QUOTED_STRING>",
};
UQLParserTokenManager.java ? This is one scary file. JavaCC uses this class as the tokenizer.
This is the piece of code that determines what the tokens are. The primary routine of interest here
is GetNextToken. This is the routine used by the parser production methods to determine which
path to take.
Token.java ? The Token class is provided to represent the tokens themselves. The next part of
this paper will demonstrate the usefulness of the Token class.
Now that we've parsed, we need to take action on the individual expressions. The goal of this
stage is to produce a runnable DB2 SQL query that will return what the user expects.
The process should start with a boilerplate SELECT containing a blank spot where the parser will
fill in the rest. The SELECT template appears in Listing 12. The query produced by the parser may
not be as optimal as a human DBA might write, but it will return the correct results expected by the
end user.
What the parser fills in depends on the path it takes through the tokenizer. For example, if the user
enters the query from above:
then the parser should emit text into the missing portion of the SQL query according to Figure
2. It will echo the parentheses, enter subqueries for the terminal queryTerms and substitute
INTERSECT for AND and UNION for OR.
As mentioned earlier, there are probably shorter, more optimal ways to write this particular
query but this SQL will produce correct results. Generally, the DB2 optimizer can smooth out
performance deficiencies.
So, what needs to be added to the JavaCC source to produce this? You must add actions
and other supporting code to the defined grammar. An action is Java code that is executed in
response to a particular production. Before adding actions, first add a method that will return
the completed SQL to the caller. To do this, add a method called getSQL() to the top part of the
JavaCC file. You should also add the private StringBuffer sqlSB to the parser's internal members.
This variable will represent the current SQL string at any stage of the parsing. Listing 14 shows the
PARSER_BEGIN/PARSER_END section of UQLParser.jj. Finally, add some code in your main()
test method to print out and execute the generated SQL query.
package com.demo.stage2;
import java.sql.DriverManager;
import java.sql.Connection;
import java.sql.Statement;
import java.sql.ResultSet;
import java.sql.Statement;
import java.lang.StringBuffer;
import java.io.StringReader;
import java.io.Reader;
public UQLParser(String s)
{
this((Reader)(new StringReader(s)));
sqlSB = new StringBuffer();
}
Class.forName(
"COM.ibm.db2.jdbc.app.DB2Driver"
).newInstance();
Connection con =
DriverManager.getConnection(
"jdbc:db2:moviedb");
Statement stmt =
con.createStatement();
ResultSet rs =
stmt.executeQuery(parser.getSQL());
while(rs.next())
{
System.out.println("Movie Title = " +
rs.getString("title") +
" Director = " +
rs.getString("director"));
}
rs.close();
stmt.close();
con.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
PARSER_END(UQLParser)
Now fill in the actions taken by the parser. We'll start with the easy one first. When an expression
is being parsed, the parser should emit the word "INTERSECT" whenever it parses "AND" and
"UNION" whenever it parses "OR". To do that, insert self-contained blocks of Java code after the
<AND> and <OR> tokens in the expression production. The code should append INTERSECT or
UNION to the sqlSB StringBuffer. This is illustrated in Listing 15.
Several actions need to be taken within the queryTerm()production. These tasks are as follows:
1. Map the search names to their respective DB2 tables and columns
2. Save the comparator token.
3. Put the comparands into a form that DB2 can understand, for example remove the double
quotes from QUOTED_STRING tokens
4. Output the proper subselect into sqlSB
5. For the recursive expression case, emit the parentheses as is.
For all of these tasks, you will need some local variables. These are defined between the first pair
of braces in the production, as shown in Listing 16.
The first task can be accomplished with the code in Listing 17. Set the proper DB2 table and
column associated with the token encountered.
The second task can be accomplished with the code in Listing 18. Save the token so that it can be
used in the SQL buffer.
The third task can be accomplished with the code in Listing 19. Set the comparand value
accordingly, stripping the double quotes from the QUOTED_STRING token if necessary.
The fourth task can be accomplished with the code in Listing 20. The complete query term is
appended to the sql buffer.
Finally, for the case of recursive expressions, the parser should simply echo parentheses when it
sees them within the expression recursion, as illustrated in Listing 21.
( tComparator=<EQUALS> |
tComparator=<NOTEQUAL> )
(
tComparand=<STRING>
{ sComparand = tComparand.image; } |
tComparand=<QUOTED_STRING>
{ // need to get rid of quotes.
sComparand = tComparand.image.substring(1,
tComparand.image.length() - 1);
}
)
{
sqlSB.append("SELECT MOVIE_ID FROM ").append(table);
sqlSB.append("\nWHERE ").append(columnName);
sqlSB.append(" ").append(tComparator.image);
sqlSB.append(" '").append(sComparand).append("'");
}
|
<LPAREN>
{ sqlSB.append("("); }
expression()
<RPAREN>
{ sqlSB.append(")"); }
}
Compile and run UQLParser.jj as before. Visit UQLParser.java and note how your production rules
have been neatly inserted into the generated code. An example of the expanded expression()
method is shown in Listing 23. Note the code after the jj_consume_token calls.
Run this code as before. You will have to include db2java.zip in the CLASSPATH. This time, when
you run
Call: expression
Call: queryTerm
Consumed token: <"(">
Call: expression
Call: queryTerm
Consumed token: <"actor">
Consumed token: <"=">
Consumed token: <<QUOTED_STRING>:
""Tom Cruise"">
Return: queryTerm
Consumed token: <"or">
Call: queryTerm
Consumed token: <"actor">
Consumed token: <"=">
Consumed token: <<QUOTED_STRING>:
""Kelly McGillis"">
Return: queryTerm
Return: expression
Consumed token: <")">
Return: queryTerm
Consumed token: <"and">
Call: queryTerm
Consumed token: <"keyword">
Consumed token: <"=">
Consumed token: <<STRING>: "drama">
Return: queryTerm
Return: expression
Consumed token: <<EOF>>
Return: parse
Results of Query
Movie Title = Top Gun Director = Tony Scott
Movie Title = Witness Director = Peter Weir
Try a few more queries to get used to your parser. Try a query using the NOTEQUAL token, as in
actor<>"Harrison Ford". Try some illegal queries like "title=" to see what happens. With very few
lines of JavaCC code, you have produced a very effective end-user query language.
JavaCC also provide a tool called JJTree. This tool provides tree and node classes which make
it easy for you to partition your code into separate parsing and action classes. Moving forward
with this example, you might consider writing a modest optimizer for your query to eliminate
unnecessary INTERSECTs and UNIONS. You could accomplish this by visiting the nodes of
the parse tree and consolidating similar adjacent nodes (actor="Tom Cruise" and actor="Kelly
McGillis", for example).
JavaCC has a rich library of grammars. Before writing a parser yourself, be sure to look in
JavaCC's examples directory for a possible ready-built solution.
Be sure to read the Frequently Asked Questions on the JavaCC Web page and visit the javacc
newsgroup at comp.compilers.tools.javacc to get a better understanding of all the capabilities and
features of JavaCC.
Conclusion
JavaCC is a robust tool that can be used to define grammars and easily incorporate the parsing
and exception handling for that grammar into your Java business application. With this article, we
have demonstrated that JavaCC can be used to provide a powerful, yet simple query language for
end users of a database system.
Downloadable resources
Description Name Size
JavaCCPaper.zip ( HTTP | FTP ) 8.47KB
Related topics
• YACC: Yet Another Compiler Compiler The very first widely available "compiler compiler" is
described in a paper by Johnson, Stephen C, AT&T Bell Laboratories, Murray Hill, New Jersey
07974
• Build your own Languages with JavaCC is a very good JavaWorld article written by Ensileng,
Oliver.