How To Build Your Own Language
How To Build Your Own Language
2019-05-06-adl-language-tools (/github/jpivarski/2019-05-06-adl-language-tools/tree/master)
/ 01-overview.ipynb (/github/jpivarski/2019-05-06-adl-language-tools/tree/master/01-overview.ipynb)
We tend to engage in magical thinking when it comes to programming: we say that the
program (a written text) does X, Y, or Z.
Some languages are better at describing the state of the beans than
others.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/01-overview.ipynb 1/3
5/7/2019 Jupyter Notebook Viewer
Modern programming languages are attached to the device in such a way that they push
the beans for us—that's where the confusion comes from. The distinction between talking
about the beans and moving the beans is hidden. It wasn't always so.
John McCarthy, creator of Lisp: "This EVAL was written and published in the paper and
Steve Russell said, 'Look, why don't I program this EVAL?' and I said to him, 'Ho, ho, you're
confusing theory with practice—this EVAL is intended for reading, not for computing!' But
he went ahead and did it." (Talk at MIT in 1974, published here
(https://dl.acm.org/citation.cfm?id=802047&dl=ACM&coll=DL).)
APL (ancestor of MATLAB, R, and Numpy) was also a notation for understanding programs
years before it was executable. The book was named A Programming Language
(http://www.softwarepreservation.org/projects/apl/Books/APROGRAMMING%20LANGUAGE
Despite the physics focus of our goals, the techniques for creating languages are standard.
This tutorial will introduce specific tools in Python along with their general concepts.
1. (this introduction)
2. Language basics: parsing and interpreting (02-parsers-and-interpreters.ipynb)
3. Compiling and transpiling to another language (03-compiling-and-transpiling.ipynb)
4. Type checking: proving program correctness (04-type-checking.ipynb)
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/01-overview.ipynb 2/3
5/7/2019 Jupyter Notebook Viewer
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/01-overview.ipynb 3/3
5/7/2019 Jupyter Notebook Viewer
2019-05-06-adl-language-tools (/github/jpivarski/2019-05-06-adl-language-tools/tree/master)
/
02-parsers-and-interpreters.ipynb (/github/jpivarski/2019-05-06-adl-language-tools/tree/master/02-parsers-and-interpreters.ipynb)
Parsing is the conversion of source code text into a tree representing relationships among tokens
(words & symbols).
Reports about medicines in newspapers and on television commonly contain little or no information about
drugs' risks and cost, and often cite medical "experts" without disclosing their financial ties to the
pharmaceutical industry, according to a new study.
Susan Okie, The Washington Post (published on June 1, 2000, in Louisville, KY, in The Courier-
Journal, page A3)
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 1/24
5/7/2019 Jupyter Notebook Viewer
Grammar: a list of rules to convert tokens into trees and trees into bigger trees.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 2/24
5/7/2019 Jupyter Notebook Viewer
Mathematical expressions and computer programs can be parsed the same way.
I used to have a favorite (PLY), but while preparing this demo, I found a better
one (Lark).
Beginners: Lark is not just another parser. It can parse any grammar you throw at it, no matter how
complicated or ambiguous, and do so efficiently. It also constructs a parse-tree for you, without additional
code on your part.
Experts: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can
trade-off power and speed, according to your requirements. It also provides a variety of sophisticated
features and utilities.
Lark can:
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 3/24
5/7/2019 Jupyter Notebook Viewer
expression_grammar = """
arith: term | term "+" arith -> add | term "-" arith -> sub
term: factor | factor "*" term -> mul | factor "/" term -> div
factor: pow | "+" factor -> pos | "-" factor -> neg
pow: call ["**" factor]
call: atom | call trailer
atom: "(" expression ")" | CNAME -> symbol | NUMBER -> literal
trailer: "(" arglist ")"
arglist: expression ("," expression)*
%import common.CNAME
%import common.NUMBER
%import common.WS
"""
parser = lark.Lark(grammar)
start
expression
add
term
factor
pow
call
literal 2
term
factor
pow
call
literal 2
If the prepositional phrase "in my pajamas" had a well-defined operator precedence in "I shot an elephant
in my pajamas," there would be no ambiguity.
Building the operator precedence into the grammar created a lot of superflous tree nodes, though.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 4/24
5/7/2019 Jupyter Notebook Viewer
The parsing tree has too much detail because it includes nodes
for rules even if they were just used to set up operator
precedence.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 5/24
5/7/2019 Jupyter Notebook Viewer
In [4]: def toast(ptnode): # Recursively convert parsing tree (PT) into abstract synt
if ptnode.data in ("add", "sub", "mul", "div", "pos", "neg"):
arguments = [toast(x) for x in ptnode.children]
return Call(Symbol(str(ptnode.data), line=arguments[0].line), argument
elif ptnode.data == "pow" and len(ptnode.children) == 2:
arguments = [toast(ptnode.children[0]), toast(ptnode.children[1])]
return Call(Symbol("pow", line=arguments[0].line), arguments)
elif ptnode.data == "call" and len(ptnode.children) == 2:
return Call(toast(ptnode.children[0]), toast(ptnode.children[1]))
elif ptnode.data == "symbol":
return Symbol(str(ptnode.children[0]), line=ptnode.children[0].line)
elif ptnode.data == "literal":
return Literal(float(ptnode.children[0]), line=ptnode.children[0].line
elif ptnode.data == "arglist":
return [toast(x) for x in ptnode.children]
else:
return toast(ptnode.children[0]) # many other cases, all of them si
print(toast(parser.parse("2 + 2")))
add(2.0, 2.0)
Execution
The simplest way to run a program is to repeatedly walk over the AST, evaluating each step. This is an
interpreter.
Historical interlude:
A compiler scans the AST to generate a sequence of machine instructions, natively recognized and
executed by the computer.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 6/24
5/7/2019 Jupyter Notebook Viewer
Out[5]: 4.0
This is the pattern we will use for the rest of this tutorial:
We will just be adding to the grammar, the AST, and the interpreter as we go.
Error handling
If a bad condition is encountered at runtime, like sqrt(-5) , the interpreter stops because the
underlying Python execution engine raises an exception.
When writing a language, we must distinguish between our own internal errors and the users' logic
mistakes. In the latter case, we have to let them know that they can fix it and provide a hint about where
to start.
Line numbers are the most useful hint—but only when they're lines in the user's code, not the execution
engine itself. The parser knows about line numbers—we must propagate that information into the AST
(for an interpreter) and the final executable (for a compiler with debugging symbols included).
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 7/24
5/7/2019 Jupyter Notebook Viewer
In [6]: # We've already propagated line numbers from parsing tree tokens to all AST no
def showline(ast):
if isinstance(ast, list):
for x in ast:
showline(x)
if isinstance(ast, AST):
print("{0:5s} {1:10s} {2}".format(str(ast.line), type(ast).__name__, a
for n in ast._fields:
showline(getattr(ast, n))
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 8/24
5/7/2019 Jupyter Notebook Viewer
In [7]: # Short exercise: change the line below to report UserErrors with line numbers
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-c8ce6ea2c630> in <module>
16 raise err # CHANGE THIS LINE
17
---> 18 run(toast(parser.parse("""sqrt(-5)""")), {**operator.__dict__, **math.
Assignments
So far, all we've implemented is a calculator. As a next step, let's extend the language to include
assignments.
These are our first statements, which do not compose as expressions do. Whereas expressions can be
nested in parentheses like mathematical formulae, statements form a sequence that can only be nested
with some block-syntax. We'll use curly brackets: {...} .
A block of statements could be used as an expression if it has a value. We'll use a common convention
(among functional languages) in which the last statement of a block is an expression, and its value is the
value of the block.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 9/24
5/7/2019 Jupyter Notebook Viewer
%import common.WS_INLINE
%import common.NEWLINE
%ignore WS_INLINE
"""
In [9]: print(parser.parse("""
x := 5
x
""").pretty())
start
statements
assignment
x
expression
arith
term
factor
pow
call
literal 5
expression
arith
term
factor
pow
call
symbol x
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 10/24
5/7/2019 Jupyter Notebook Viewer
{x := 5.0; x}
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 11/24
5/7/2019 Jupyter Notebook Viewer
In [13]: # Short exercise: change toast so that x := {5} produces the same AST as x :=
def toast(ptnode):
if ptnode.data == "statements":
statements = [toast(x) for x in ptnode.children if x != "\n"]
return Block(statements, line=statements[0].line)
{x := {5.0}; x}
Out[14]: 7.0
In [15]: # Should variables be accessible outside of the block where they're defined (l
Out[15]: 5.0
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 12/24
5/7/2019 Jupyter Notebook Viewer
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 13/24
5/7/2019 Jupyter Notebook Viewer
In [17]: # Short exercise: change the following so that scopes are nested.
# Bonus: what SHOULD reassigning a variable from a parent's scope do? Can you
def run(astnode, symbols):
if isinstance(astnode, Literal):
return astnode.value
elif isinstance(astnode, Symbol):
return symbols[astnode.symbol]
elif isinstance(astnode, Call):
function = run(astnode.function, symbols)
arguments = [run(x, symbols) for x in astnode.arguments]
return function(*arguments)
elif isinstance(astnode, Block):
for statement in astnode.statements:
last = run(statement, symbols)
return last
elif isinstance(astnode, Assign):
symbols[astnode.symbol] = run(astnode.value, symbols)
Out[17]: 5.0
Branching
A calculator that can assign quantities is still just a calculator—though it may make the expressions easier
to read.
The next level is to introduce control structures—if/then/else, while loops, and subroutines. For brevity,
we'll just do if/then/else.
We could either allow variable definitions in then/else clauses to leak or we could always require an else
clause and let the if/then/else block have a value. The former is a state-changing imperative language;
the latter is functional (and easier to implement). We'll do the latter.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 14/24
5/7/2019 Jupyter Notebook Viewer
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 15/24
5/7/2019 Jupyter Notebook Viewer
start
statements
expression
branch
expression
or
and
not
gt
arith
term
factor
pow
call
symbol x
arith
term
factor
pow
call
literal 0
block
expression
or
and
not
comparison
arith
term
factor
pow
call
literal 1
block
expression
or
and
not
comparison
arith
term
neg
factor
pow
call
literal 1
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 16/24
5/7/2019 Jupyter Notebook Viewer
elif ptnode.data in ("or", "and", "eq", "ne", "gt", "ge", "lt", "le") and
arguments = [toast(x) for x in ptnode.children]
return Call(Symbol(str(ptnode.data), line=arguments[0].line), argument
In [22]: # The interpreter doesn't need any new features; all of these operators are ju
Out[22]: 4.0
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 17/24
5/7/2019 Jupyter Notebook Viewer
In [23]: # However, both then/else clauses are evaluated, regardless of the predicate.
def show(x, y, f):
print(f, x, y, f(x, y))
return f(x, y)
builtins["add"] = lambda x, y: show(x, y, operator.add)
builtins["mul"] = lambda x, y: show(x, y, operator.mul)
Out[23]: 4.0
In [24]: # same for and/or, which are traditionally only evaluated fully if their resul
Out[24]: False
Much like the decision that if/then/else would return a value, rather than changing state, decisions about
order of evaluation has a subtle effect on what kinds of programs will be written in the language.
Both are mathematically valid, but (2) would break every bash script on Earth.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 18/24
5/7/2019 Jupyter Notebook Viewer
Out[25]: 4.0
To evaluate only one of the two branches, we had to implement a special rule in the interpreter because
the general rule is "evaluate all function arguments before evalutaing the function."
Some languages provide control over argument evaluation, such that users of the language can create
control structures like if/then/else. Lisp has a general-purpose "quote" form:
We can also do this in a language that allows functions to be defined and passed around as objects.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 19/24
5/7/2019 Jupyter Notebook Viewer
start
statements
expression
function
paramlist x
block
expression
or
and
not
comparison
arith
term
factor
pow
call
symbol x
factor
pow
call
literal 2
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 20/24
5/7/2019 Jupyter Notebook Viewer
def toast(ptnode):
if ptnode.data == "function":
paramlist = [str(x) for x in ptnode.children[0].children]
body = toast(ptnode.children[1])
return Function(paramlist, body, line=body.line)
In [29]: print(toast(parser.parse("""
f := x => x**2
f(y)
""")))
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 21/24
5/7/2019 Jupyter Notebook Viewer
run(toast(parser.parse("""
f := x => 2*x
f(y)
""")), SymbolTable(symboltable, y = 5))
Out[31]: 10.0
In [32]: # Now we can define "if" as a plain function that takes and calls zero-argumen
# to customize the order of evaluation.
Out[32]: 4.0
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 22/24
5/7/2019 Jupyter Notebook Viewer
Recently, we have been talking about Domain Specific Languages (DSL) but referring to them as
"declarative programming."
Strictly Evaluated languages evaluate expressions in lexical order (i.e. arguments before function
calls).
Lazily Evaluated languages eventually produce the same results but give the program more
flexibility in deciding when or where the code will run. There are several kinds of objects representing
a calculation that has not been executed:
Declarative languages produce the same output as strictly or lazily evaluated languages, but hide
the distinction between them: the order in which the code runs is an implementation detail.
rendering HTML: order determines placement, but not the sequence of graphics commands to draw
it
execution of SQL: user's queries are rewritten and optimized by query planner
commands in a Makefile: only executed if targets are out of date
cached function calls ("memoized"): function is only executed if not cached; e.g. an HTTP GET
request
machine code instructions in a CPU: modern processors sometimes engage in speculative execution
(https://en.wikipedia.org/wiki/Out-of-order_execution).
One place we might consider declarative evaluation is in hiding the distinction between columnar and
fused array operations (ask me later).
Parsing: conversion of source code text into a Parsing Tree, then an Abstract Syntax Tree (AST).
Interpreter: runs the program by walking over the AST at runtime.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 23/24
5/7/2019 Jupyter Notebook Viewer
Compiler: converts the program to another language, such as machine instructions, and runs that.
Expressions: nested elements of a mathematical formula.
Literal: values in the source code text.
Symbol: named reference to a value.
Function Call: evaluation of a function (including binary operations).
Function Definition: creation of a new function (named or unnamed).
Statements: sequential elements defining a process.
Assign: creation or replacement of a named reference (possibly to a function).
Symbol Scope: at which parts of a program symbols are bound to values (only considered Lexical
Scope).
Evaluation Order: temporal order in which expressions and statements are evaluated.
Strict/Eager Evaluation: the usual order; arguments of a function call before the function call.
Lazy Evaluation: pass unevaluated code (e.g. function definition) to let the called function
decide when to evaluate.
Declarative: evaluation order is not visible to the programmer.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/02-parsers-and-interpreters.ipynb 24/24
5/7/2019 Jupyter Notebook Viewer
2019-05-06-adl-language-tools (/github/jpivarski/2019-05-06-adl-language-tools/tree/master)
/
03-compiling-and-transpiling.ipynb (/github/jpivarski/2019-05-06-adl-language-tools/tree/master/03-compiling-and-transpiling.ipynb)
In the previous tutorial, we interpreted our little language, rather than compiling it.
The interpreter simply did in Python what was written in the new language—we filled the symbol table with
functions like:
builtins["+"] = lambda x, y: x + y
What we'll see in this tutorial is that a compiler isn't much more substantial than a translator, either.
Ultimately, it just replaces each + AST node with the machine instruction for + .
Programming languages do not perform actions. They only express a user's intention in another
layer of abstraction.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 1/9
5/7/2019 Jupyter Notebook Viewer
transistor gates → machine code → C compiler → Python implementation (C code) → our interpreter →
our little language
Interpreters vs compilers:
An interpreter walks over the AST (or even parses the source code!) at runtime.
A compiler serializes the AST into a state machine or a sequence of instructions, virtual or physical.
A transpiler serializes the AST into code in another human-readable language. (Subjective: what's
human-readable?)
Compilation targets:
A finite state machine is a graph of executable steps only (not a full interpreter). Regular expressions
are often compiled to finite state machines. A non-recurrent neural network is also a finite state
machine.
A push-down machine is a state machine with a stack of memory—parsers are almost always push-
down machines.
A virtual machine is a complete processor+memory driven by a sequence of instructions, like a
physical computer, but implemented in software. Python and Java are not interpreters: they compile
their source code to virtual machines.
A Von Neumann machine is a physical computer driven by a sequence of instructions.
Other:
FPGA/ASIC: physical computer consisting of raw gates, not instructions; Verilog isn't compiled like C,
it's synthesized.
linear (source code) → tree-like data structure (AST) → linear (instructions or other source code)
In this notebook, we'll write a transpiler, converting our little language into C++.
In [1]: # First of all, did you know that you can do this?
import ROOT
ROOT.gInterpreter.Declare("""
double new_function(double x, double y) {
return sqrt(x*x + y*y);
}""")
ROOT.new_function(3, 4)
Out[1]: 5.0
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 2/9
5/7/2019 Jupyter Notebook Viewer
c_parser = pycparser.c_parser.CParser()
ast = c_parser.parse("double f(double x) { return x*x; }")
ast.show()
FileAST:
FuncDef:
Decl: f, [], [], []
FuncDecl:
ParamList:
Decl: x, [], [], []
TypeDecl: x, []
IdentifierType: ['double']
TypeDecl: f, []
IdentifierType: ['double']
Compound:
Return:
BinaryOp: *
ID: x
ID: x
print(c_generator.visit(ast))
double f(double x)
{
return x * x;
}
We can compile and run C++ code in ROOT (Cling) and we have a general C99 AST in a Python library
(pycparser). The compilation chain could look like this:
our source language → our AST → C99 AST → C++ source code → compile and run in ROOT
We can compile and run C++ code in ROOT (Cling) and we have a general C99 AST in a Python library
(pycparser). The compilation chain could look like this:
our source language → our AST → C99 AST → C++ source code → compile and run in ROOT
We can compile and run C++ code in ROOT (Cling) and we have a general C99 AST in a Python library
(pycparser). The compilation chain could look like this:
our source language → our AST → C99 AST → C++ source code → compile and run in ROOT
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 3/9
5/7/2019 Jupyter Notebook Viewer
It's hard to compose source code strings properly and it's hard to debug them. (Coffeescript famously
skipped this step. (https://www.kickstarter.com/projects/michaelficarra/make-a-better-coffeescript-compiler)
:)
%import common.CNAME
%import common.NUMBER
%import common.WS
%ignore WS
"""
parser = lark.Lark(grammar)
In [5]: # the cute thing about bra-ket notation is that it turns type errors into synta
print(parser.parse("<0 1| σ₁* σ₂ |0 1>").pretty())
start
term
factor
bra
value
real 0
real 1
operators
conjugate
operator
s1
operators
operator
s2
ket
value
real 0
real 1
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 4/9
5/7/2019 Jupyter Notebook Viewer
In [6]: # I said that it's unwise to turn our AST directly into strings of the output l
# That's because strings are hard to COMPOSE. Don't avoid using strings to crea
def c_ast(c_code):
return c_parser.parse("void f() {" + c_code + ";}").ext[0].body.block_items
Out[6]: BinaryOp(op='*',
left=ID(name='x'
),
right=ID(name='z'
)
)
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 5/9
5/7/2019 Jupyter Notebook Viewer
else:
return transpile(ast.children[0], names) # pass-th
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 6/9
5/7/2019 Jupyter Notebook Viewer
In [8]: names = []
ast = transpile(parser.parse("<0 1| σ₁* σ₂ |0 1>"), names)
ast.show()
# print(c_generator.visit(ast))
names
FuncCall:
BinaryOp: *
BinaryOp: *
FuncCall:
ID: bra
ExprList:
FuncCall:
ID: C
ExprList:
Constant: double, 0
Constant: double, 0
FuncCall:
ID: C
ExprList:
Constant: double, 1
Constant: double, 0
BinaryOp: *
FuncCall:
ID: conjugate
ExprList:
ID: s1
ID: s2
FuncCall:
ID: ket
ExprList:
FuncCall:
ID: C
ExprList:
Constant: double, 0
Constant: double, 0
FuncCall:
ID: C
ExprList:
Constant: double, 1
Constant: double, 0
ExprList:
Constant: int, 0
Constant: int, 0
Out[8]: []
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 7/9
5/7/2019 Jupyter Notebook Viewer
In [9]: # Define some helper functions in the output language instead of complicating t
ROOT.gInterpreter.Declare("""
complex<double> C(double real, double imag) {
return complex<double>(real, imag);
}
ROOT::Math::SMatrix<complex<double>, 1, 2> bra(complex<double> up, complex<doub
return ROOT::Math::SMatrix<complex<double>, 1, 2>((complex<double>[]){up, d
}
ROOT::Math::SMatrix<complex<double>, 2, 1> ket(complex<double> up, complex<doub
return ROOT::Math::SMatrix<complex<double>, 2, 1>((complex<double>[]){up, d
}
ROOT::Math::SMatrix<complex<double>, 2, 2> conjugate(ROOT::Math::SMatrix<comple
auto out = ROOT::Math::Transpose(S);
out(0, 0) = conj(out(0, 0));
out(0, 1) = conj(out(0, 1));
out(1, 0) = conj(out(1, 0));
out(1, 1) = conj(out(1, 1));
return out;
}
ROOT::Math::SMatrix<complex<double>, 2, 2> matrix(complex<double> a, complex<do
return ROOT::Math::SMatrix<complex<double>, 2, 2>((complex<double>[]){a, b,
}
auto s1 = matrix(C(0, 0), C(1, 0), C(1, 0), C( 0, 0));
auto s2 = matrix(C(0, 0), C(0, -1), C(0, 1), C( 0, 0));
auto s3 = matrix(C(1, 0), C(0, 0), C(0, 0), C(-1, 0));
""")
Out[9]: True
In [10]: # Much of the work is actually rearranging inputs and outputs (matching Python
def c_args(names):
return ", ".join("double r1_{0}, double i1_{0}, double r2_{0}, double i2_{0
def python_ret(root_complex):
return complex(root_complex.real(), root_complex.imag())
print(c_args(["x", "y"]))
# python_args(["x", "y"], print)(x=[0, 1j], y=[1, 0])
double r1_x, double i1_x, double r2_x, double i2_x, double r1_y, double i1_y, d
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 8/9
5/7/2019 Jupyter Notebook Viewer
In [11]: # Again, don't shy away from using strings of the output language when COMPOSIN
def braket(code):
names = []
c_ast = transpile(parser.parse(code), names)
(1+0j)
(1+0j)
operator: spinor
| operator "*" -> conjugate
| operator "ᵀ" -> transpose | operator "T" -> transpose
and propagate it through the transpiler to the final output so that <0 1| σ₁ᵀ σ₂ |0 1> returns -1j .
Summary
Some of the work that an interpreter performs is unnecessarily repeated. Performing that part once to
create a lean executable is compilation.
All compilation simply translates code from one language to another, though machine instructions are
rather hard to read. Compilation to a human-readable language is called source-to-source compilation or
transpilation.
When transpiling, it's recommended to convert the input language's AST into the output language's AST,
rather than directly emitting strings in the output language because strings of code are hard to compose.
It's not a hard-edged rule to always avoid strings, but don't rely on composing strings.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/03-compiling-and-transpiling.ipynb 9/9
5/7/2019 Jupyter Notebook Viewer
2019-05-06-adl-language-tools (/github/jpivarski/2019-05-06-adl-language-tools/tree/master)
/
04-type-checking.ipynb (/github/jpivarski/2019-05-06-adl-language-tools/tree/master/04-type-checking.ipynb)
Generally, an untyped AST (such as the ones we've been dealing with) gets replaced by a
typed AST, in which each node is marked by a data type, such as double or
boolean . (It's also possible to mark an AST in-place with type labels, but if so, be sure
that node instances are unique!)
Type checking was traditionally motivated by the need to generate the right instructions in
the output language (e.g. __add_int32__ vs __add_float32__ on unlabled 32-bit
registers), but it can be much more general than that:
type checking is a formal proof that the program satisfies certain properties.
The properties to prove are encoded in the type system, which can be specialized to a
domain like particle physics.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 1/13
5/7/2019 Jupyter Notebook Viewer
Some terminology:
A type is a set of possible values that a symbol or expression can have at runtime.
Types may be
abstract if they're specified without reference to a bit-representation, like "all non-
negative integers less than 2**32 "
concrete if a bit-representation is given, like "two's complement 32-bit integers
without a sign bit."
A strongly typed language stops processing if it encounters values that do not match
function argument types: it either stops the compilation or the runtime execution.
A weakly typed language either passes bits without checking them or converts values
to fit expectations.
A dynamically typed language checks types at runtime. Types may be valid at one
time and invalid at another.
Most assembly languages treat all values as raw bits; programmer has to keep track of
types and call the right instructions.
C is often used as a weakly typed language (e.g. passing everything as void*).
C++, Java, C#, Rust, Go, Swift, Fortran, Haskell, ML, Scala, Julia, mypy (Python
linter), LLVM's assembly language...
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 2/13
5/7/2019 Jupyter Notebook Viewer
%import common.CNAME
%import common.INT
%import common.FLOAT
%import common.WS
%ignore WS
"""
parser = lark.Lark(grammar)
start
pass
and
not
pass
gt
pass
pass
symbol x
pass
pass
float 0.0
pass
pass
add
pass
int 2
pass
pass
int 2
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 3/13
5/7/2019 Jupyter Notebook Viewer
class AST:
_fields = ()
def __init__(self, *args):
for n, x in zip(self._fields, args):
setattr(self, n, x)
class Symbol(AST):
_fields = ("symbol",)
def __str__(self): return self.symbol
class Call(AST):
_fields = ("function", "arguments")
def __str__(self):
return "{0}({1})".format(str(self.function), ", ".join(str(
In [4]: # Simplify the Parsing Tree (PT) into an Abstract Syntax Tree (AST)
def toast(ptnode):
if ptnode.data == "start" or ptnode.data == "pass" or ptnode.da
return toast(ptnode.children[0])
elif ptnode.data == "int":
return Literal(int(ptnode.children[0]), int)
elif ptnode.data == "float":
return Literal(float(ptnode.children[0]), float)
elif ptnode.data == "symbol":
return Symbol(str(ptnode.children[0]))
else:
return Call(str(ptnode.data), [toast(x) for x in ptnode.chi
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 4/13
5/7/2019 Jupyter Notebook Viewer
In [5]: # The typed AST is just like the untyped AST except that each node
class Typed:
def __init__(self, thetype, *args):
self.type = thetype
super(Typed, self).__init__(*args)
def __str__(self):
return "{0} as {1}".format(super(Typed, self).__str__(), se
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 5/13
5/7/2019 Jupyter Notebook Viewer
-------------------------------------------------------------------
TypeError Traceback (most recent call las
<ipython-input-7-27844ff319a2> in <module>
3 code = "not x > 0.0 and 2 + 2"
4 print(toast(parser.parse(code)))
----> 5 print(totyped(toast(parser.parse(code)), symbols={"x": float
else:
arguments = [totyped(x, signatures, symbols) for x in ast.a
types = [x.type for x in arguments]
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 6/13
5/7/2019 Jupyter Notebook Viewer
In [9]: # Short exercise: add and test truediv. How does its signature diff
-------------------------------------------------------------------
TypeError Traceback (most recent call las
<ipython-input-9-4434fdbc0d78> in <module>
15 code = "not x > 0.0 and 3 / 2"
16 print(toast(parser.parse(code)))
---> 17 print(totyped(toast(parser.parse(code)), signatures, symbols
<ipython-input-8-0b04e1e1e8f5> in <listcomp>(.0)
9
10 else:
---> 11 arguments = [totyped(x, signatures, symbols) for x i
12 types = [x.type for x in arguments]
13
Parameterized types
To support more data structures, we can consider "functions of types." Like functions in a
programming language, they allow us to build what we need from simpler primitives.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 7/13
5/7/2019 Jupyter Notebook Viewer
Examples include:
C++ templates: think of the < > brackets as ( ) around a function's arguments.
Arrays, structs, and unions in C, which don't have a function-like syntax but compose
like functions.
tuple<T1, T2, T3> values are points in T1 and T2 and T3 (product type).
variant<T1, T2, T3> values are each in T1 or T2 or T3 (sum type).
Type specifications are described in some language: in C++, they're in the same source file
but template arguments have a different syntax than runtime functions. Scala uses square
brackets for type functions and parentheses for runtime functions. Dynamically typed
languages like Lisp and Python manipulate types at runtime using ordinary functions.
%import common.CNAME
%import common.WS
%ignore WS
"""
parser = lark.Lark(grammar)
print(toast(parser.parse("x + A y")))
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 8/13
5/7/2019 Jupyter Notebook Viewer
-------------------------------------------------------------------
TypeError Traceback (most recent call las
<ipython-input-12-052d33f5e167> in <module>
2 print(toast(parser.parse(code)))
3 print(totyped(toast(parser.parse(code)),
----> 4 symbols={"x": Matrix(1, 5), "A": Matrix(5, 4)
Traditionally (in FORTRAN and C, for instance), the reason for the type-check is to ensure
that the right machine instructions are applied to each value. Recently (particularly in
Haskell and Scala), the focus has been to ensure that the programmer is not making other
kinds of mistakes.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 9/13
5/7/2019 Jupyter Notebook Viewer
Examples include:
Refinement types
One of these new ideas is to specify the set of values that a type can represent with a
predicate. For instance, instead of just int and unsigned int , how about {x:ℤ |
x > 0} (positive integers) or {x:ℤ | x % 2 == 0} (even integers)?
Unlike other non-mainstream type ideas, this could be particularly relevant for us because
we deal in measurable quantities with intervals of validity. It could help to know at compile-
time whether ϕ ∈ (‒π, π] or [0, 2π), or to forbid 0/0 and ∞/∞ at compile-time, rather than
silently returning nan .
In this next example, we'll implement a compile-time 0/0 and ∞/∞ check.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 10/13
5/7/2019 Jupyter Notebook Viewer
grammar = """
start: term
term: factor -> pass | term "+" factor -> add | term "-" factor -
factor: atom -> pass | atom "*" factor -> mul | atom "/" factor ->
atom: "(" term ")" | CNAME -> symbol
%import common.CNAME
%import common.WS
%ignore WS
"""
parser = lark.Lark(grammar)
print(toast(parser.parse("(x + y) / z")))
truediv(add(x, y), z)
def __str__(self):
return "{0}{1}, {2}{3}".format("[" if self.includeslow else
self.high, "]" if self.inclu
[0, inf)
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 11/13
5/7/2019 Jupyter Notebook Viewer
[13, 28]
truediv(add(x, y), z)
truediv(add(x as [3, 8], y as [10, 20]) as [13, 28], z as [0, 100])
Add literal numbers, a few comparisons (e.g. < and > ), and if/then/else as in
the parsing tutorial.
Add a boolean type in addition to Interval . (Should it restrict to only one variable?
Is Boolean(only=True) useful for anything?)
Let the predicate of if/then/else refine the type that is passed into the
then/else clauses. For instance, if the type of x is [3, 8] and it passes
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 12/13
5/7/2019 Jupyter Notebook Viewer
A reading list:
Curry-Howard correspondence
(https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspondence): programs
are equivalent to proofs, and type-checkers are automated theorem provers. (Really!)
Total functional programming
(https://en.wikipedia.org/wiki/Total_functional_programming): if we don't need a Turing
complete (https://en.wikipedia.org/wiki/Turing_completeness) language, total functional
languages are guaranteed to never raise exceptions at runtime: if it compiles, it will
run. Recursion is hard in these languages (leading to a long discussion of primitive
recursion and data vs codata), but we don't care about that: non-recursive math would
be enough for a lot of particle physics applications.
Structural type system (https://en.wikipedia.org/wiki/Structural_type_system): instead
of checking type equivalence by name or inheritance, only check to see if it has the
components the function needs. This is the duck typing
(https://en.wikipedia.org/wiki/Duck_typing) of type-safe languages.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/04-type-checking.ipynb 13/13
5/7/2019 Jupyter Notebook Viewer
2019-05-06-adl-language-tools (/github/jpivarski/2019-05-06-adl-language-tools/tree/master)
/
05-embedded-dsl.ipynb (/github/jpivarski/2019-05-06-adl-language-tools/tree/master/05-embedded-dsl.ipynb)
For example:
Libraries are more on the "embedded DSL" end of the spectrum if they consist of shallow
functions that only do interesting things when combined, like the constructs of a
programming language.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 1/7
5/7/2019 Jupyter Notebook Viewer
In Scala, a class's methods can be accessed through a dot . (like most languages) or a
space `. Methods taking a single argument need not have
parentheses ( `) , and they can be named with unicode characters.
Scala also lets functions decide whether they want to be eagerly or lazily evaluated, and
has Lisp-like macros. This leads to some extreme DSLs, like ScalaTest
(http://www.scalatest.org/):
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 2/7
5/7/2019 Jupyter Notebook Viewer
Python's syntax is not nearly as flexible as Scala's, nor does it give control over eager/lazy
evaluation like Scala and C# (unless you ask the user to always pass functions as
arguments!), so options for embedded DSLs in Python are limited.
In all the demos I've shown so far, we've defined new grammars.
It's okay for small snippets of the new language, like regular expressions or SQL queries,
but for longer programs, the user will be limited by their favorite editor not interpreting the
language for syntax highlighting, auto-indentation, tab-completion, integrated
documentation...
In [1]: # Best of both? Code in Python FUNCTIONS are syntax-checked but oth
# The source code has been compiled for Python's virtual machine, w
# but that language can be parsed (not by humans).
print(function.__code__.co_code)
b'|\x00d\x01\x13\x00|\x01d\x01\x13\x00\x17\x00S\x00'
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 3/7
5/7/2019 Jupyter Notebook Viewer
def parse(function):
code = function.__code__
scanner = uncompyle6.scanner.get_scanner(float(sys.version[0:3]
parser = uncompyle6.parser.get_python_parser(float(sys.version[
tokens, customize = scanner.ingest(code)
return uncompyle6.parser.parse(parser, tokens, customize)
Out[2]: stmts
sstmt
stmt
return (2)
0. ret_expr
expr
binary_expr (3)
0. expr
binary_expr (3)
0. expr
L. 10 0 LOAD_FAST
1. expr
2 LOAD_CONST
2. binary_op
4 BINARY_POWER
1. expr
binary_expr (3)
0. expr
6 LOAD_FAST
1. expr
8 LOAD_CONST
2. binary_op
10 BINARY_POWER
2. binary_op
12 BINARY_ADD
1. 14 RETURN_VALUE
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 4/7
5/7/2019 Jupyter Notebook Viewer
In [3]: # Make our own AST, not that parsing tree or Python's own AST. We w
class AST:
_fields = ()
def __init__(self, *args):
for n, x in zip(self._fields, args):
setattr(self, n, x)
class Literal(AST):
_fields = ("value",)
def __str__(self): return str(self.value)
class Symbol(AST):
_fields = ("symbol",)
def __str__(self): return self.symbol
class Call(AST):
_fields = ("function", "arguments")
def __str__(self):
return "{0}({1})".format(str(self.function), ", ".join(str(
def toast(ptnode):
if ptnode.kind == "binary_expr":
return Call(toast(ptnode[2]), [toast(x) for x in ptnode[:2]
elif ptnode.kind == "BINARY_ADD":
return Symbol("add")
elif ptnode.kind == "BINARY_POWER":
return Symbol("pow")
elif ptnode.kind == "LOAD_FAST":
return Symbol(ptnode.pattr)
elif ptnode.kind == "LOAD_CONST":
return Literal(ptnode.pattr)
else:
# a lot of nodes are purely structural
return toast(ptnode[0])
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 5/7
5/7/2019 Jupyter Notebook Viewer
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 6/7
5/7/2019 Jupyter Notebook Viewer
@quote
def sum_quadrature(x, y):
return x**2 + y**2
print(sum_quadrature)
With the ability to turn any Python function into an AST, we can add
New syntax vs Python syntax is a separate question from the above features. It's also
possible to have two syntaxes for the same language (different parsing steps produce the
same ASTs).
Using Python as a syntax limits us to Python syntax rules and Pythonic expectations (users
would be upset if we changed the meaning of the operators), but it provides a better editing
experience.
https://nbviewer.jupyter.org/github/jpivarski/2019-05-06-adl-language-tools/blob/master/05-embedded-dsl.ipynb 7/7