Macho: Programming With Man Pages: Anthony Cozzie, Murph Finnicum, and Samuel T. King University of Illinois
Macho: Programming With Man Pages: Anthony Cozzie, Murph Finnicum, and Samuel T. King University of Illinois
Abstract not only help find problems [6], but sometimes even sug-
gest solutions [7]. For example, recent work by Weimer
Despite years of work on programming languages, pro- et al. [5], describes how to use genetic programming
gramming is still slow and error-prone. In this paper we algorithms to modify buggy source code automatically
describe Macho, a system which combines a natural lan- until the modified programs pass a set of test cases.
guage parser, a database of code, and an automated de-
Although these techniques do save time, the program-
bugger to write simple programs from natural language
mer is still responsible for selecting code snippets, ar-
and examples of their correct execution. Adding exam-
ranging them into a program, and debugging the result.
ples to natural language makes it easier for Macho to ac-
In this paper we describe Macho, a system that gener-
tually generate a correct program, because it can test its
ates simple Java programs from a combination of natural
candidate solutions and fix simple errors. Macho is able
language, examples (unit tests), and a large repository of
to synthesize basic versions of six out of nine small core-
Java source code (mostly from Sourceforge projects). It
utils from short natural language descriptions based on
contains four subsystems: a natural language parser that
their man pages and sample runs.
maps English into database queries, a large database that
maps programmer abstractions to snippets of Java code,
1 Introduction a stitcher that combines code snippets in “reasonable”
ways, and an automated debugger that tests the result-
Programming is hard. Because computers can only exe- ing candidate programs against the examples and makes
cute simple instructions, the programmer must spell out simple fixes automatically.
the application’s behavior in excruciating detail. Because Because database search and automated debugging are
computers slavishly follow their instructions, any triv- still hard problems with immature tools, Macho’s abili-
ial error will result in a crash or, worse, a security ex- ties are correspondingly basic. Our current version of
ploit. Together they make computer code difficult and Macho was able to synthesize simple versions (no op-
time consuming to write, read, and debug. tions, one or two arguments) of various Unix core utili-
Programmers write software the same way they do ev- ties from simple natural language specifications and ex-
erything else: by imitating other people. The first re- amples of correct behavior, including versions of ls, pwd,
sponse to a new problem is often to google it, and ide- cat, cp, sort, and grep. Macho was unable to generate
ally find code snippets or examples of library calls. The correct solutions for wget, head, and uniq. Macho is still
programmer then combines these chunks of code, writes under construction, but it has already provided us with
some test cases, and makes small changes to the program several interesting results.
until its output is correct for the inputs he has considered. Macho is a remarkably simple attack on an extraor-
Software engineering researchers have developed dinarily difficult task. Natural language understanding is
techniques to help automate each of these parts of the considered one of the hardest problems in Artificial Intel-
programming process. Code search tools scan through ligence with a huge body of current research. Generaliz-
databases of source code to find code samples related ing from examples is similarly difficult. And even once a
to programmer queries. For example, SNIFF [2] uses computer system “understands” the problem it still must
source code comments to help find snippets of code, and actually write suitable Java code.
Prospector [4] finds library calls that convert from one Our key insight is that natural language and examples
language type to another. Automated debugging tools have considerable synergy. Macho has a fighting chance
1
to generate correct programs because each component Our main problem was fixing the errors of the parser,
can partially correct for the mistakes of the others. For which was trained on a standard corpus of newspaper ar-
example, a database query will return many possible re- ticles, not jargon filled man pages. For example, ‘file’ is
sults, most of which will be incorrect, but by leveraging usually a verb, like “the SEC filed charges against En-
the type system the stitcher can eliminate many unlikely ron today.” and print is often a noun, e.g., “Their foul
solutions. Even more importantly, the test cases allow prints will not soon be cleansed from the financial sys-
Macho to partially detour around the difficult problem of tem.”. These kinds of errors were quite common.
natural language processing. Modern machine learning To help detect what words were intended to act as ac-
techniques provide probabilistic answers, whether the tions, we build a graph of prepositions linking the objects
question is the meaning of a piece of natural language or in a sentence together into a tree. A traversal of this tree
the best sample function in the database to use. Backed reveals the relationship between the nouns at its leaves.
by its automated debugger, Macho can afford to try mul- When we find words that are not linked to the rest of
tiple solutions. the sentence by this graph, we can guess that they are
In addition, combining examples and natural language misclassified verbs. The parser also provides some hints
greatly reduces their ambiguity: the set of programs that as to likely control flow. For example, plural adjective
satisfies both the natural language and the test cases is or adverbial phrases often imply a filter operation that is
much smaller than the sets that satisfy each input individ- implemented as an if statement. The description of grep
ually, although there are some exceptions: Macho found contains ‘lines matching a pattern’ which implies only
it surprisingly easy to synthesize cat from a unit test us- some lines will be used.
ing the empty files it used for generating ls. However, we
found that most of the time a program that passed even
one reasonable test case would be correct. Together nat-
2.2 Database
ural language and examples form a fairly concrete spec- As the subsystem that maps natural language abstrac-
ification. tions to concrete Java code, the database is the engine
that powers Macho. When the database can suggest rea-
sonable code chunks, the stitching can usually find a cor-
2 Architecture rect solution, but when the database fails the space of
candidate programs is simply too large to succeed by
Macho’s workflow mirrors a human programmer. It flailing randomly.
maps the natural language to implied computation, maps Our original plan was to use Google Code, but we al-
those abstractions to concrete Java code, combines the most immediately dismissed it as completely inadequate.
code chunks into a candidate solution, and finally de- Google Code indexes a huge number of files, but it ap-
bugs the resulting program. The goal of each subsystem pears to only perform keyword search on the raw text of
is therefore to minimize the amount of brute force and the source files, which we found to be inadequate for our
thereby synthesize the largest possible programs. problem. Instead, we developed our own database for
Macho.
2.1 Natural Language Parser Our first step was to obtain a data set of about 200,000
Java files from open source projects and compile them
Our natural language parsing subsystem attempts to ex- using a special version of javac that we modified to emit
tract implied chunks of computation and the data flow abstract syntax trees. We compiled rather than parsed be-
between them from the words and phrases it receives, cause we wanted exact global locations for each function
and encode that knowledge for the database. Usually the call, and because we didn’t want to reuse broken code.
structure of the sentence can be directly transformed to Since open source programmers are not exactly paragons
requested computation: verbs imply action, nouns im- of code maintenance, only about half of our source files
ply objects, and two nouns linked by a preposition imply compiled successfully.
some sort of conversion code. This mapping is concep- Our database returns candidate methods based on in-
tually similar to previous work [1], but Macho’s database put and output variables, e.g. the query directory →
“understands” a much larger number of concepts, includ- f iles would return all functions called with an input vari-
ing abbreviations. In order to handle these more varied able named directory and assigned to a variable named
sentences, we began with an off-the-shelf system pro- files. This nicely captured the different abstractions that
vided by the University of Illinois Cognitive Compu- different programmers used to represent code, which is
tation group to tag individual words with their part of important because functions have only one name. The
speech (noun, verb, adjective, etc.) and to split sentences problem with this approach is that many things aren’t
apart into smaller phrases. usually implemented as functions. Higher level concepts
2
Natural
Language
Examples
Code
Candidate
Queries
Chunks
Solu-ons
Automated
NL
Parser
Database
S-tching
Debugger
Glue
Change
Code
Requests
Requests
like ignore, first, or adjacent usually appear as operations and performs a diff between the output of the candidate
or even control flow. Often they have no input variables and the unit test and classifies the candidate into one of
or are only tagged in the comments. five simple cases: exception thrown (try to insert an if
block around the offending statement), a superset of cor-
2.3 Stitching rect output (insert if blocks around the offending print),
garbage (try the next program), a subset of correct out-
Macho’s stitching subsystem combines results from put (try adding a few prints), or, in the best case, correct
database queries into candidate programs. Its main guide output (declare victory).
is the type system; two expressions can be linked by a These components have synergy beyond simply cor-
variable if the output type of one matches the input type recting mistakes. For example, our automated debug-
of the other. If the types don’t match, the stitcher will ger leverages the database to suggest changes to buggy
query the database for common chunks of code that were programs. When it is faced with a potential solution
used to convert between those types. for ls which incorrectly prints hidden files, the debug-
Macho also generates a small amount of control flow. ger queries the database for commonly used functions
If statements are generated only from hints by the natu- of java.io.File which could be used in an if statement to
ral language parser and the synthesizer. Map loops are restrict the obstreperous print. This simple probabilistic
generated when suggested by the type system. Macho model allows it to try the isHidden method even though
tries to limit control flow generation because it swiftly it is not used elsewhere in the candidate solution.
increases the solution space; an upstream chunk may be Although the automated debugging seems superfi-
placed in any block above the downstream chunk. cially simple, it actually solves a very difficult problem
The most difficult part of stitching is keeping track of of library combination. Macho’s database finds candi-
the data flow between expressions in the presence of con- date functions entirely by name, which may be unrelated
trol flow. The natural language gives a great deal of infor- to their purpose. Running the code allows the debugger
mation for how information is supposed to flow from one to eliminate these imposter functions.
chunk to another; previous natural language program-
ming systems generated code without any search at all.
3 Evaluation
2.4 Automated debugger Objectively evaluating Macho is very difficult. There is
Macho’s automated debugging subsystem attempts to de- no standard test suite where we can benchmark our re-
bug candidate programs. This type of automated debug- sults against other systems, and using the language from
ging is potentially extremely difficult, but many of the the man pages directly is almost impossible. Consider
automatically generated candidate programs will have the byzantine man page description for wget:
utterly obvious errors that can be fixed easily. The pri-
mary difference between stitching and automated debug- GNU Wget is a free utility for non-interactive
ging is that debugging is dynamic rather than static and download of files from the Web. It supports
has access to the behavior of the program. Currently HTTP, HTTPS, and FTP protocols, as well as
the automated debugger runs the candidate in a sandbox retrieval through HTTP proxies.
3
Program Result Input Notes
pwd success Print the current working directory. Difficult as there is no input.
pwd success Print the user directory. CWD = “user.dir” in Java.
pwd success Print the current directory. Abbreviation!
pwd fail Print the working directory. Breaks NLP for arcane reasons.
pwd fail Show the current working directory. Database entries for show are mostly graphics.
cat success Print the lines of a file. Vanilla.
cat success Read a file. Print is synthesized.
cat fail Display the contents of a file. Database entries for contents are mostly graphics.
cat fail Print a file Solutions print the file name.
sort success Sort the lines of a file. Print is synthesized.
sort success Sort a file by line.
sort fail Sort a file. Insufficiently precise specification.
sort fail Sort the contents of a file Database entries for contents are mostly graphics.
grep success Print the lines in a file matching a pattern. Solutions using both JavaLib and GNU regexes.
grep fail Find a pattern in the lines of a file. Correct except for if statement linking test and print.
grep fail Search file for a pattern. Poor resiliency for function names.
ls success Print the names of files in a directory. Sort the names.
ls success Print the contents of a folder. Sort the names.
ls fail Print the names of the entries in a directory. Entries to names fails.
ls fail Print the files in a directory. Does not synthesize sort.
cp success Copy src file to dest file. Programmer abbreviation!
cp success Copy file to file. Ugly but Macho needs to know there are two inputs.
cp fail Duplicate file to file. No candidate in database.
wget fail Download file. Candidates have extra functionality.
wget fail Open network connection. Download file. Macho can’t create buffer transfer loop.
head fail Print the first ten lines of a file. ’First’ is incomprehensible.
uniq fail Print a file. Ignore adjacent lines. ’Ignore’ and ’adjacent’ don’t map to libraries.
perl fail The answer to life, the universe, and everything. Seems to work, but it’s still running.
Figure 2: Macho’s results for generating select core utils. This figure shows the results for pwd, cat, sort, grep, ls, cp,
wget, head, and uniq, and the natural language input we used for each of these programs.
Giving out partial credit is also difficult. Some of ity of the database to select reasonable pieces from the
Macho’s solutions are very close but not byte identical, natural language heuristics is absolutely critical. In gen-
but automatically determining whether or not an output eral, when the stitching failed, it was often reasonable to
is sufficiently close to the test case is approximately as think of a hack, or a simple fix, or just let it run a little
hard as generating the program, an artificial version of longer, but when the database failed Macho had no hope
the Dunning-Kruger effect. Under these circumstances of ever generating a correct solution. Improving Macho
we decided to try to pick an interesting set of natural will require a superior database above everything else.
language inputs right on the border of Macho’s capabili-
ties and use our best judgement when the test cases were
“close”. 4.2 Pure NLP is Bad
Macho succeeded in generating simple versions of six Programming with natural language is generally consid-
out of nine coreutils - pwd, cat, sort, grep, cp, and ls - ered a bad idea because specifying details gradually mu-
and failed to synthesize wget, head, and uniq. For each tates the natural language into a wordy version of Visual
core utility, we targeted its default behavior: no options Basic. Consider a natural language spec for ls:
and the minimum number of arguments possible. Since
we had the programs available anyway, we used them to Take the path "/home/zerocool/"
generate our unit tests. All of the programs had only one If the path is a file, print
short test and the results are shown in Figure 2. it. Otherwise get the list of
files in the directory. Sort the
result alphabetically. Go over
4 Lessons Learned the result from the beginning to
the end: If the current element’s
4.1 The Database is King filename does not begin with ".",
print it.
Although most of the programs Macho writes are 10-15
lines or less, there are a lot of potential 10-line Java pro- which is our best guess for the input required for Pe-
grams. Brute force really does not get very far - the abil- gasus [3]; it is obvious why most programmers would
4
prefer to use Python instead. Instead, a Macho program- flow in a program, variable names contain useful infor-
mer can specify the basic task very simply: mation about the functionality of code, and the automatic
debugger can use the database to add new code to a can-
Print the names of files in a
didate solution.
directory. Sort the names.
Macho is a simple proof of concept system, not yet di-
Even an almost trivial program like this leaves many rectly useful for most programmers, but it can still syn-
details unspecified: should the sort be alphabetically by thesize basic versions of six small coreutils. By improv-
filename, size, file extension, or date? Should the pro- ing the source code database we believe that Macho can
gram print the full path, the relative path, or just the name be a practical system for helping programmers.
of the files? Does “files” include subdirectories or hid-
den files? All of these questions are easily cleared up by References
an example of correct operation. Such examples not only
have a higher information density than tedious pages of [1] A. W. Biermann and B. W. Ballard. Toward natural lan-
pseudocode or UML, but they also reduce the workload guage computation. Comput. Linguist., 6(2):71–86, 1980.
of the programmer by allowing him to think about one [2] S. Chatterjee, S. Juvekar, and K. Sen. Sniff: A search en-
case at a time, rather than all possible cases. In other gine for Java using free-form queries. In FASE ’09: Pro-
words, examples allow a user to be concrete without be- ceedings of the 12th International Conference on Funda-
ing formal. mental Approaches to Software Engineering, pages 385–
400, Berlin, Heidelberg, 2009. Springer-Verlag.
[3] R. Knöll and M. Mezini. Pegasus: first steps toward a nat-
4.3 Interactive Programming is the An- uralistic programming language. In OOPSLA ’06: Com-
swer panion to the 21st ACM SIGPLAN symposium on Object-
oriented programming systems, languages, and applica-
A traditional programmer must write code that satisfies tions, pages 542–559, New York, NY, USA, 2006. ACM.
all possible inputs his program will encounter, while a
[4] D. Mandelin, L. Xu, R. Bodı́k, and D. Kimelman. Jun-
Macho Programmer can consider each input individually. gloid mining: helping to navigate the api jungle. In PLDI
Macho therefore not only saves the programmer the work ’05: Proceedings of the 2005 Conference on Programming
of writing code but also frees the programmer from dif- Language Design and Implementation, pages 48–61, New
ficult formal reasoning. York, NY, USA, 2005. ACM.
Ideally, however, the programmer would only be re- [5] W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest. Au-
quired to verify, not generate, concrete values. In this tomatically finding patches using genetic programming. In
rosy scenario the programmer would input natural lan- ICSE ’09: Proceedings of the 31st International Confer-
guage and the system would offer a set of alternatives. ence on Software Engineering, pages 364–374, Washing-
The programmer could then reject incorrect cases, or ton, DC, USA, 2009. IEEE Computer Society.
suggest modifications, until eventually a correct program [6] A. Zeller. Yesterday, my program worked. today, it does
is negotiated. This is important because programming is not. why? In ESEC/FSE-7: Proceedings of the 7th Eu-
not simply the act of transferring a mental vision into ma- ropean software engineering conference held jointly with
chine code. In reality, the requirements are fuzzy. Some the 7th ACM SIGSOFT international symposium on Foun-
things are more important than others, and still others can dations of software engineering, pages 253–267, London,
be waived or changed if they are difficult to implement. UK, 1999. Springer-Verlag.
Interactive programming allows the programmer to take [7] X. Zhang, N. Gupta, and R. Gupta. Locating faults through
the path of least resistance to a satisfactory program. automated predicate switching. In ICSE ’06: Proceedings
Of course, this also requires considerably more accu- of the 28th international conference on Software engineer-
rate program synthesis from pure natural language, as ing, pages 272–281, New York, NY, USA, 2006. ACM.
well as much better understanding of general concepts,
which no one really knows how to do at the moment.
5 Conclusions
In this paper we have discussed Macho, a system that
synthesizes programs from a combination of natural lan-
guage, unit tests, and a large database of source code
samples. A few of our technical findings are that the nat-
ural language can give implicit hints about the control