Entropy 26 01046
Entropy 26 01046
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation: Michaud, E.J.; Liao, I.; Lad, V.; Liu, Z.; Mudide, A.; Loughridge, C.; Guo, Z.C.; Kheirkhah,
T.R.; Vukelić, M.; Tegmark, M. Opening the AI Black Box: Distilling Machine-Learned Algorithms
into Code. Entropy 2024, 26, 1046.
As Published: http://dx.doi.org/10.3390/e26121046
Version: Final published version: final published article, as it appeared in a journal, conference
proceedings, or other formally published context
Abstract: Can we turn AI black boxes into code? Although this mission sounds extremely challenging,
we show that it is not entirely impossible by presenting a proof-of-concept method, MIPS, that can
synthesize programs based on the automated mechanistic interpretability of neural networks trained
to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a
benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary
to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30).
MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean
or integer symbolic regression to capture the learned algorithm. As opposed to large language models,
this program synthesis technique makes no use of (and is therefore not limited by) human training
data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling
up this approach to make machine-learned models more interpretable and trustworthy.
where AutoMI means black box models can be turned into programs automatically without
human inspection. Specifically, we present a proof-of-concept method, MIPS (mechanistic-
interpretability-based program synthesis), which can distill simple learned algorithms from
neural networks into Python code, for small-scale algorithmic tasks. The main goal of this
paper is not to present a method that fully solves AutoMI, but to demonstrate progress
toward AutoMI with a simple proof of concept. The rest of this paper is organized as
follows. After reviewing prior work in Section 2, we present our method in Section 3, test it
on a benchmark in Section 4 and summarize our conclusions in Section 5.
2. Related Work
Program synthesis is a venerable field dating back to Alonzo Church in 1957; Zhou
and Ding [4] and Odena et al. [5] provide recent reviews of the field. Large language
models (LLMs) have become increasingly good at writing code based on verbal problem
descriptions or auto-complete. We instead study the common alternative problem setting
known as “programming by example” (PBE), where the desired program is specified by
giving examples of input–output pairs [6]. The aforementioned papers review a wide
variety of program synthesis methods, many of which involve some form of search over
a space of possible programs. LLMs that synthesize code directly have recently become
quite competitive with such search-based approaches [7]. Our work provides an alternative
search-free approach where the program learning happens during neural network training
rather than execution.
Our work builds on the recent progress in mechanistic interpretability (MI) of neural
networks [8–11]. Much MI work has tried to understand how neural networks represent
various types of information, e.g., geographic information [12–14], truth [15,16] and the
state of board games [17–19]. Another major MI thrust has been to understand how neural
networks perform algorithmic tasks, e.g., modular arithmetic [20–23] and other group
operations [24], greater than [25], and greatest common divisor [26]. The key step of
mechanistic interpretability is to look into discovering structures in hidden representations.
In this paper, we manage to discover bit representations, integer representations, and
clusters (finite state machines).
Whereas Lindner et al. [27] automatically convert traditional code into a neural net-
work, we aim to do the opposite, as was also recently demonstrated in Friedman et al. [28].
One direction in automating mechanistic interpretability uses LLMs to label internal
units of neural networks such as neurons [29] and features discovered by sparse autoen-
coders [30,31]. Another recent effort in automating MI involves automatically identifying
which internal units causally influence each other and the network output for a given set
of inputs, [32–34]. However, these methods do not automatically generate pseudocode
or give a description of how the states of downstream units are computed from upstream
units, which we aim to do in this work.
In this work, we focus on automating mechanistic interpretability for recurrent neural
networks (RNNs), building on the rich existing literature on interpreting RNN inter-
nals [35,36] and on extracting finite state machines from trained RNNs [36–41]. In our work,
we seek exceptionally simple descriptions of RNNs by factoring network hidden states into
discrete variables and representing state transitions with symbolic formulae.
Figure 1. The pipeline of our program synthesis method. MIPS relies on discovering integer
representations and bit representations of hidden states, which enable regression methods to figure
out the exact symbolic relations between input, hidden, and output states.
Step 1 is to train a black box neural network to learn an algorithm that performs the
desired task. In this paper, we use a recurrent neural network (RNN) of the general form
hi = f ( h i −1 , x i ), (1)
yi = g ( h i ), (2)
that maps input vectors xi into output vectors yi via hidden states hi . The RNN is defined
by the two functions f and g, which are implemented as feed-forward neural networks
(MLPs) to allow more model expressivity than a vanilla RNN. The techniques described
below can also be applied to more general neural network architectures.
Step 2 attempts to automatically simplify the learned neural network without reducing
its accuracy. Steps 3 and 4 automatically distill this simplified learned algorithm into Python
code. When the training data are discrete (consisting of, say, text tokens, integers, or pixel
colors), the neural network will be a finite state machine: the activation vectors for each of
its neuron layers define finite sets and the entire working of the network can be defined
by look-up tables specifying the update rules for each layer. For our RNN, this means
that the space of hidden states h is discrete, so the functions f and g can be defined by
lookup tables. As we will see below, the number of hidden states that MIPS needs to
keep track of can often be greatly reduced by clustering them, corresponding to learned
representations. After this, the geometry of the cluster centers in the hidden space often
reveals that they form either an incomplete multidimensional lattice whose points represent
integer tuples, or a set whose cardinality is a power of two, whose points represent Boolean
tuples. In both of these cases, the aforementioned lookup tables simply specify integer or
Boolean functions, which MIPS attempts to discover via symbolic regression. Below, we
present an integer autoencoder and a Boolean autoencoder to discover such integer/Boolean
representations from arbitrary point sets.
We will now describe each of the three steps of MIPS in greater detail.
Entropy 2024, 26, 1046 4 of 32
3.2. Auto-Simplification
After finding a minimal neural network architecture that can solve a task, the resulting
neural network weights typically seem random and un-interpretable. This is because there
exist symmetry transformations of the weights that leave the overall input–output behavior
of the neural network unchanged. The random initialization of the network has therefore
caused random symmetry transformations to be applied to the weights. In other words,
the learned network belongs to an equivalence class of neural networks with identical
behavior and performance, corresponding to a submanifold of the parameter space. We
exploit these symmetry transformations to simplify the neural network into a normal form,
which in a sense is the simplest member of its equivalence class. Conversion of objects into
a normal/standard form is a common concept in mathematics and physics (for example,
conjunctive normal form, wavefunction normalization, reduced row echelon form, and
gauge fixing).
Two of our simplification strategies below exploit a symmetry of the RNN hidden state
space h. We can always write the MLP g in the form g(h) = G (Uh + c) for some function G.
So if f is affine, i.e., of the form f (h, x) = Wh + Vx + b, then the symmetry transformation
W′ ≡ AWA−1 , V′ = AV, U′ = UA−1 , h′ ≡ Ah, b′ = Ab keeps the RNN in the
same form:
hi′ = Ahi = AWA−1 Ahi−1 + AVxi + Ab
= W′−1 hi′−1 + V′ xi + b′ , (3)
yi = G (Uhi + c) = G (UA−1 hi′ + c)
= G (U′ hi′ + c). (4)
Entropy 2024, 26, 1046 5 of 32
Since the tasks in our benchmark involve bits and integers, which are already discrete,
the only non-discrete parts in a recurrent neural network are its hidden representations.
Here, we show two cases when hidden states can be discretized: they are (1) a bit represen-
tation or (2) a (typically incomplete) integer lattice. Generalizing to the mixed case of bits
and integers is straightforward. Figure 2 shows all hidden state activation vectors hi for
all steps with all training examples for two of our tasks. The left panel shows that the 104
points hi form 22 = 4 tight clusters, which we interpret as representing two bits. The right
panel reveals that the points hi form an incomplete 2D lattice that we interpret as secretly
representing a pair of integers.
For the special case where the MLP defining the function f is affine or can be accurately
approximated as affine, we use a simpler method we term the Linear lattice finder, also
described in Appendix B. Here, the idea is to exploit the fact that the lattice is simply an
affine transformation of a regular integer lattice (the input data), so we can simply “read
off" the desired lattice basis vectors from this affine transformation.
1
2 def f (s , t ) :
3 a = 0; b = 0;
4 ys = []
5 for i in range (10) :
6 c = s [ i ]; d = t [ i ];
7 next_a = b ^ c ^ d
8 next_b = b + c +d >1
9 a = next_a ; b = next_b ;
10 y = a
11 ys . append ( y )
12 return ys
Figure 3. The generated program for the addition of two binary numbers represented as bit sequences.
Note that MIPS rediscovers the “ripple adder”, where the variable b above is the carry bit.
Entropy 2024, 26, 1046 8 of 32
1 def f ( s ) :
2 a = 198; b = -11; c = -3; d = 483; e = 0;
3 ys = []
4 for i in range (20) :
5 x = s[i]
6 next_a = -b + c +190
7 next_b = b -c -d - e + x +480
8 next_c = b - e +8
9 next_d = -b +e - x +472
10 next_e = a +b -e -187
11 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ;
12 y = -d +483
13 ys . append ( y )
14 return ys
1 def f ( s ) :
2 a = 0; b = 0; c = 0; d = 0; e = 0;
3 ys = []
4 for i in range (20) :
5 x = s[i]
6 next_a = + x
7 next_b = a
8 next_c = b
9 next_d = c
10 next_e = d
11 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ;
12 y = a+b+c+d+e
13 ys . append ( y )
14 return ys
Figure 4. Comparison of code generated from an RNN trained on Sum_Last5, without (top) and with
(bottom) normalizers.
4. Results
We will now test the program synthesis abilities of our MIPS algorithm on a benchmark
of algorithmic tasks specified by numerical examples. For comparison, we try the same
benchmark on GPT-4 Turbo, which is currently (as of January 2024) described by OpenAI
as their latest generation model, with a 128k context window and more capable than the
original GPT-4.
4.1. Benchmark
Our benchmark consists of the 62 algorithmic tasks listed in Table 1. They each map
one or two integer lists of length 10 or 20 into a new integer list. We refer to integers
whose range is limited to {0, 1} as bits. We generated this task list manually, attempting to
produce a collection of diverse tasks that would in principle be solvable by an RNN. We
also focused on tasks whose known algorithms involved majority, minimum, maximum,
and absolute value functions because we believed they would be more easily learnable than
other nonlinear algorithms due to our choice of the ReLU activation for our RNNs. The
benchmark training data and project code are available at https://github.com/ejmichaud/
neural-verification (accessed on 26 November 2024). The tasks are described in Table 1,
with additional details in Appendix E. The benchmark aims to cover a diverse range of
algorithmic tasks. To balance between different families, when a group of tasks is similar
(e.g., summing up the last k bits), our convention is to keep (at most) six of them.
Since the focus of our paper is not on whether RNNs can learn algorithms, but on
whether learned algorithms can be auto-extracted into Python, we discarded from our
benchmark any generated tasks on which our RNN training failed to achieve 100% accuracy.
Our benchmark can never show that MIPS outperforms any large language model
(LLM). Because LLMs are typically trained on GitHub, many LLMs can produce Python
code for complicated programming tasks that fall outside of the class we study. Instead, the
question that our MIPS-LLM comparison addresses is whether MIPS complements LLMs
by being able to solve some tasks where an LLM fails.
Entropy 2024, 26, 1046 9 of 32
Table 1. Benchmark results. For tasks with the note “see text”, please refer to Appendix E. The last
column highlights the MIPS module responsible for each task. BR = boolean regression, LR = linear
regression, SR = symbolic regression, and NA means MIPS is not expected to solve this problem.
Green means success and red means failure.
Solved Solved
Task Input Element MIPS
Task Description Task Name by by
# Strings Type Module
GPT-4? MIPS?
1 2 bit Binary addition of two bit strings Binary_Addition 0 1 BR
2 2 int Ternary addition of two digit strings Base_3_Addition 0 0 SR
3 2 int Base 4 addition of two digit strings Base_4_Addition 0 0 SR
4 2 int Base 5 addition of two digit strings Base_5_Addition 0 0 SR
5 2 int Base 6 addition of two digit strings Base_6_Addition 1 0 SR
6 2 int Base 7 addition of two digit strings Base_7_Addition 0 0 SR
7 2 bit Bitwise XOR Bitwise_Xor 1 1 BR
8 2 bit Bitwise OR Bitwise_Or 1 1 BR
9 2 bit Bitwise AND Bitwise_And 1 1 BR
10 1 bit Bitwise NOT Bitwise_Not 1 1 BR
11 1 bit Parity of last 2 bits Parity_Last2 1 1 BR
12 1 bit Parity of last 3 bits Parity_Last3 0 1 BR
13 1 bit Parity of last 4 bits Parity_Last4 0 0 BR
14 1 bit Parity of all bits seen so far Parity_All 0 1 BR
15 1 bit Parity of number of zeros seen so far Parity_Zeros 0 1 BR
16 1 int Cumulative number of even numbers Evens_Counter 0 0 BR
17 1 int Cumulative sum Sum_All 1 1 LR
18 1 int Sum of last 2 numbers Sum_Last2 0 1 LR
19 1 int Sum of last 3 numbers Sum_Last3 0 1 LR
20 1 int Sum of last 4 numbers Sum_Last4 1 1 LR
21 1 int Sum of last 5 numbers Sum_Last5 1 1 LR
22 1 int sum of last 6 numbers Sum_Last6 1 1 LR
23 1 int Sum of last 7 numbers Sum_Last7 1 1 LR
24 1 int Current number Current_Number 1 1 LR
25 1 int Number 1 step back Prev1 1 1 LR
26 1 int Number 2 steps back Prev2 1 1 LR
27 1 int Number 3 steps back Prev3 1 1 LR
28 1 int Number 4 steps back Prev4 1 1 LR
29 1 int Number 5 steps back Prev5 1 1 LR
30 1 int 1 if last two numbers are equal Previous_Equals_Current 0 1 SR
31 1 int current − previous Diff_Last2 0 1 SR
32 1 int |current − previous| Abs_Diff 0 1 SR
33 1 int |current| Abs_Current 1 1 SR
34 1 int |current| − |previous| Diff_Abs_Values 1 0 SR
35 1 int Minimum of numbers seen so far Min_Seen 1 0 SR
36 1 int Maximum of integers seen so far Max_Seen 1 0 SR
37 1 int integer in 0-1 with highest frequency Majority_0_1 1 0 SR
38 1 int Integer in 0-2 with highest frequency Majority_0_2 0 0 SR
39 1 int Integer in 0-3 with highest frequency Majority_0_3 0 0 SR
40 1 int 1 if even, otherwise 0 Evens_Detector 1 0 SR
41 1 int 1 if perfect square, otherwise 0 Perfect_Square_Detector 0 0 SR
42 1 bit 1 if bit string seen so far is a palindrome Bit_Palindrome 1 0 NA
43 1 bit 1 if parentheses balanced so far, else 0 Balanced_Parenthesis 0 0 SR
44 1 bit Number of bits seen so far mod 2 Parity_Bits_Mod2 1 0 BR
45 1 bit 1 if last 3 bits alternate Alternating_Last3 0 0 BR
46 1 bit 1 if last 4 bits alternate Alternating_Last4 1 0 BR
47 1 bit bit shift to right (same as prev1) Bit_Shift_Right 1 1 LR
48 2 bit Cumulative dot product of bits mod 2 Bit_Dot_Prod_Mod2 0 1 BR
49 1 bit Binary division by 3 (see text) Div_3 1 0 SR
50 1 bit Binary division by 5 (see text) Div_5 0 0 SR
Entropy 2024, 26, 1046 10 of 32
Table 1. Cont.
Solved Solved
Task Input Element MIPS
Task Description Task Name by by
# Strings Type Module
GPT-4? MIPS?
51 1 bit Binary division by 7 (see text) Div_7 0 0 SR
52 1 int Cumulative addition modulo 3 Add_Mod_3 1 1 SR
53 1 int Cumulative addition modulo 4 Add_Mod_4 0 0 SR
54 1 int Cumulative addition modulo 5 Add_Mod_5 0 0 SR
55 1 int Cumulative addition modulo 6 Add_Mod_6 0 0 SR
56 1 int Cumulative addition modulo 7 Add_Mod_7 0 0 SR
57 1 int Cumulative addition modulo 8 Add_Mod_8 0 0 SR
58 1 int 1D dithering, 4-bit to 1-bit (see text) Dithering 1 0 NA
59 1 int Newton’s of - freebody (integer input) Newton_Freebody 0 1 LR
60 1 int Newton’s law of gravity (see text) Newton_Gravity 0 1 LR
61 1 int Newton’s law w. spring (see text) Newton_Spring 0 1 LR
62 2 int Newton’s law w. magnetic field (see text) Newton_Magnetic 0 0 LR
Total solved 30 32
4.2. Evaluation
For both our method and GPT-4 Turbo, a task is considered solved if and only if
a Python program is produced that solves the task with 100% accuracy. GPT-4 Turbo
is prompted using the “chain-of-thought” approach described below and illustrated
in Figure 5.
Conversation Start
GPT: [Response]
User: "Please
write a Python
program to ..."
GPT: [Response]
Success or Failure?
Success Failure
Figure 5. We compare MIPS against program synthesis with the large language model GPT-4 Turbo,
prompted with a “chain-of-thought” approach. It begins with the user providing a task, followed by
the model’s response, and culminates in assessing the success or failure of the generated Python code
based on its accuracy in processing the provided lists.
For a given task, the LLM receives two lists of length 10 sourced from the respective
RNN training set. The model is instructed to generate a formula that transforms the
elements of list “x” (features) into the elements of list “y” (labels). Subsequently, the model
is instructed to translate this formula into Python code. The model is specifically asked
to use elements of the aforementioned lists as a test case and print “Success” or “Failure”
Entropy 2024, 26, 1046 11 of 32
if the generated function achieves full accuracy on the test case. An external program
extracts a fenced markdown codeblock from the output, which is saved to a separate file
and executed to determine if it successfully completes the task. To improve the chance of
success, this GPT-4 Turbo prompting process is repeated three times, requiring only at least
one of them to succeed. We run GPT using default temperature settings.
4.3. Performance
As seen in Table 1, MIPS is highly complementary to GPT-4 Turbo: MIPS solves 32 of
our tasks, including 13 that are not solved by ChatGPT-4 (which solves 30).
The AutoML process of Section 3.1 discovers networks of varying task-dependent
shape and size. Table A1 shows the parameters p discovered for each task. Across our
62 tasks, 16 tasks could be solved by a network with hidden dimensions n = 1, and the
largest n required was 81. For many tasks, there was an interpretable meaning to the shape
of the smallest network we discovered. For instance, on tasks where the output is the
element occurring k steps earlier in the list, we found n = k + 1, since the current element
and the previous k elements must be stored for later recall.
We found two main failure modes for MIPS:
1. Noise and non-linearity. The latent space is still close to being a finite state machine,
but the non-linearity and/or noise present in an RNN is so dominant that the integer
autoencoder fails, e.g., for Diff_Abs_Values. Humans can stare at the lookup table
and regress the symbolic function with their brains, but since the lookup table is not
perfect, i.e., it has the wrong integer in a few examples, MIPS fails to symbolically
regress the function. This can probably be mitigated by learning and generalizing
from a training subset with a smaller dynamic range.
2. Continuous computation. A key assumption of MIPS is that RNNs are finite-state ma-
chines. However, RNNs can also use continuous variables to represent information—the
Majority_0_X tasks fail for this reason. This can probably be mitigated by identifying
and implementing floating-point variables.
Figure 3 shows an example of a MIPS rediscovering the “ripple-carry adder” algorithm.
The normalizers significantly simplified some of the resulting programs, as illustrated
in Figure 4, and sometimes made the difference between MIPS failing and succeeding.
We found that applying a small L1 weight regularization sometimes facilitated integer
autoencoding by axis-aligning the lattice.
5. Discussion
We have presented MIPS, a novel method for program synthesis based on the au-
tomated mechanistic interpretability of neural networks trained to perform the desired
task, auto-distilling the learned algorithm into Python code. Its essence is to first train a
recurrent neural network to learn a clever finite state machine that performs the task and
then automatically figure out how this machine works.
5.1. Findings
We found MIPS to be highly complementary to LLM-based program synthesis with
GPT-4 Turbo, with each approach solving many tasks that stumped the other. Please note
that our motivation is not to outcompete other program synthesis methods, but instead to
provide a proof of principle that fully automated distillation of machine-learned algorithms
is not impossible.
Whereas LLM-based methods have the advantage of drawing upon a vast corpus
of human training data, MIPS has the advantage of discovering algorithms from scratch
without human hints, with the potential to discover entirely new algorithms. As opposed to
genetic programming approaches, MIPS leverages the power of deep learning by exploiting
gradient information.
Program synthesis aside, our results shed further light on mechanistic interpretabil-
ity, specifically on how neural networks represent bits and integers. We found that n
Entropy 2024, 26, 1046 12 of 32
5.2. Outlook
Our work is merely a modest first attempt at mechanistic-interpretability-based pro-
gram synthesis, and there are many obvious generalizations worth trying in future work,
for example,
1. Improvements in training and integer autoencoding (since many of our failed exam-
ples failed only just barely);
2. Generalization from RNNs to other architectures such as transformers;
3. Generalization from bits and integers to more general extractable data types such as
floating-point numbers and various discrete mathematical structures and knowledge
representations;
4. Scaling to tasks requiring much larger neural networks;
5. Automated formal verification of synthesized programs (we perform such verification
with Dafny in Appendix G—Formal Verification to show that our MIPS-learned ripple
adder correctly adds any binary numbers, not merely those in the test set, but such
manual work should ideally be fully automated).
LLM-based coding co-pilots are already highly useful for program synthesis tasks
based on verbal problem descriptions or auto-complete and will only get better. MIPS
instead tackles program synthesis based on test cases alone. This makes it analogous
to symbolic regression [42,43], which has already proven useful for various science and
engineering applications [44,45] where one wishes to approximate data relationships with
symbolic formulae. The MIPS framework generalizes symbolic regression from feed-
forward formulae to programs with loops, which are in principle Turing-complete. If this
approach can be scaled up, it may enable promising opportunities for making machine-
learned algorithms more interpretable, verifiable, and trustworthy.
Author Contributions: Conceptualization M.T.; software E.J.M., I.L., V.L., Z.L., A.M., C.L., Z.C.G.,
T.R.K. and M.V.; writing: M.T., E.J.M., I.L., V.L., Z.L., A.M. and C.L.; investigation E.J.M., I.L., V.L.,
Z.L., A.M. and C.L.; supervision: M.T., E.J.M. and Z.L. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was funded by Erik Otto, Jaan Tallinn, the Rothberg Family Fund for Cognitive
Science, the NSF Graduate Research Fellowship (Grant No. 2141064), and IAIFI through NSF grant
PHY-2019786.
Institutional Review Board Statement: Not applicable
Data Availability Statement: Code for reproducing our experiments can be found at https://github.
com/ejmichaud/neural-verification (accessed on 26 November 2024).
Acknowledgments: We thank Wes Gurnee, James Liu, and Armaun Sanayei for helpful conversations
and suggestions.
Conflicts of Interest: The authors declare no conflicts of interest.
D
xj = ∑ a ji bi + c, (A1)
i =1
it, without loss of generality, suffices to consider the case n = 2. A common algorithm
to compute GCD of two numbers is the so-called Euclidean algorithm. We start with
two numbers r0 , r1 and r0 > r1 , which is step 0. For the kth step, we perform division-
with-remainder to find the quotient qk and the remainder rk so that rk−2 = qk rk−1 + rk
with |rk−1 | > |rk | (We are considering a general case where r0 and r1 may be negative.
Otherwise rk can always be positive numbers, hence no need to use the absolute function).
The algorithm will eventually produce a zero remainder r N = 0, and the other non-zero
remainder r N −1 is the greatest common divisor. For example, GCD(55, 45) = 5, because
55 = 1 × 45 + 10,
45 = 4 × 10 + 5, (A3)
10 = 2 × 5 + 0.
The goal of D-dimensional GCD is to find a “minimal” parallelogram, such that its volume
(which is det(q1 , q2 , · · · , q D )) is the GCD of volumes of other possible parallelograms.
Once the minimal parallelogram is found (There could be many minimal parallelograms,
Entropy 2024, 26, 1046 14 of 32
but finding one is sufficient), we can also determine bi in Equation (A1), since bi is exactly
qi ! To find the minimal parallelogram, we need two steps: (1) figure out the unit volume;
(2) figure out qi (i = 1, 2, · · · ) whose volume is the unit volume.
Step 1: Compute unit volume V0 . We first define representative parallelograms as one
where all i = 1, 2, · · · , D, mi ≡ (mi1 , mi2 , · · · , miD ) are one-hot vectors, i.e., with only one
element being 1 and 0 otherwise. It is easy to show that the volume of any parallelogram is
a linear integer combination of volumes of representative parallelograms, so in WLOG, we
can focus on representative parallelograms. We compute the volumes of all representative
parallelograms, which gives a volume array. Since volumes are just scalars, we can obtain
the unit volume V0 by calling the regular GCD of the volume array.
Step 2: Find a minimal parallelogram (whose volume is the unit volume computed
in step 1). Recall that in regular GCD, we are dealing with two numbers (scalars). To
leverage this in the vector case, we need to create scalars out of vectors and make sure that
the vectors share the same linear structure as the scalars so that we can extend division and
remainder to vectors. A natural scalar is volume. Now consider two parallelograms P1 and
P2, which share D − 1 basis vectors (y3 , . . . , y D+1 ), but the last basis vector is different: y1
for P1 and y2 for P2. Denote their volume as V1 and V2 :
V1 = det(y1 , y3 , y4 , . . . , y D )
(A5)
V2 = det(y2 , y3 , y4 , . . . , y D )
Since
aV1 + bV2 = det( ay1 + by2 , y3 , y4 , . . . , y D ), (A6)
which shows that (V1 , V2 ) and (y1 , y2 ) share the same linear structure. We can simply apply
division and remainder to V1 and V2 as in regular GCD:
this means y2′ is a linear combination of (y3 , · · · , y D ) and hence can be removed from the
vector list.
Step 3: Simplification of basis vectors. We want to further simplify basis vectors.
For example, the basis vectors obtained in step 2 may have large norms. For example,
D = 2, the standard integer lattice, has b1 = (1, 0) and b2 = (0, 1), but there are infinitely
many possibilities after step 2, as long as pt − sq = ±1 for b1 = ( p, q) and b2 = (s, t),
e.g., b1 = (3, 5) and b2 = (4, 7).
To minimize ℓ2 norms, we choose a basis and project and subtract for other bases.
Note that (1) again, we are only allowed to subtract integer times of the chosen basis;
(2) the volume of the parallelogram does not change since the project-and-subtract matrix
has a determinant of 1 (suppose bi (i = 2, 3, · · · , D ) are projected to b1 and subtracted by
multiples of b1 and p∗ represents projection integers):
Entropy 2024, 26, 1046 15 of 32
Figure A1. Both red and blue bases form a minimal parallelogram (in terms of cell volume) but one
can further simplify red to blue by linear combination (simplicity in the sense of small ℓ2 norm).
t
t− j
h(t) = ∑ Wh Wi x j , (A11)
j =1
Since x j values themselves are integer lattices, we could then interpret the following as
basis vectors:
t− j
Wh Wi , j = 1, 2, · · · , t, (A12)
which are not necessarily independent. For example, for the task of summing up the
last two numbers, Wh Wi and Wi are non-zero vectors and are independent, while others
Entropy 2024, 26, 1046 16 of 32
Whn Wi ≈ 0, n ≥ 2. Then, Wh Wi and Wi are the two basis vectors for the lattice. In general,
we measure the norm of all the candidate basis vectors and select the first k vectors with
the highest norms, which are exactly basis vectors of the hidden lattice.
W =⇒ AWA−1
V =⇒ AV
b =⇒ Ab
U =⇒ UA−1
As we can see, all of the matrices in the decomposition are unstable near δ = 0, so the issue
of error thresholding is not only numerical but is mathematical in nature as well.
We would like to construct an algorithm that computes the Jordan normal form with
an error threshold |δ| < ϵ = 0.7 within which the algorithm will pick the transformation T
from Equation (A14) instead of from Equation (A13).
Our algorithm first computes the eigenvalues λi and then iteratively solves for the
generalized eigenvectors that lie in ker((W − λI)k ) for increasing k. The approximation
occurs whenever we compute the kernel (of unknown dimension) of a matrix X; we take
the SVD of X and treat any singular vectors as part of the nullspace if their singular values
are lower than the threshold ϵ, calling the result ϵ-ker(X).
Spaces are always stored in the form of a rectangular matrix F of orthonormal vec-
tors, and their dimension is always the width of the matrix. We build projections using
proj(F) = FF H , where F H denotes the conjugate transpose of F. We compute kernels ker(X)
of known dimension of matrices X by taking the SVD X = V1 SV2H and taking the last
singular vectors in V2H . We compute column spaces of projectors of known dimension by
taking the top singular vectors of the SVD.
The steps in our algorithm are as follows:
1. Solve for the eigenvalues λi of W, and check that eigenvalues that are within ϵ of each
other form group, i.e., that |λi − λ j | ≤ ϵ and |λ j − λk | ≤ ϵ always imply |λk − λi | ≤ ϵ.
Compute the mean eigenvalue for every group.
2. Solve for the approximate kernels of W − λI for each mean eigenvalue λ. We will
denote this operation by ϵ-ker(W − λI). We represent these kernels by storing the sin-
gular vectors whose singular values are lower than ϵ. Also, we construct a “corrected
matrix” of W − λI for every λ by taking the SVD, discarding the low singular values,
and multiplying the pruned decomposition back together again.
3. Solve for successive spaces Fk of generalized eigenvectors at increasing depths k along
the set of Jordan chains with eigenvalue λ for all λ. In other words, find chains of
mutually orthogonal vectors that are mapped to zero after exactly k applications of the
map W − λI. We first solve for F0 = ker(W − λI). Then for k > 0, we first solve for
Jk = ϵ-ker(( I − proj(Fk−1 ))(W − λI)) and deduce the number of chains which reach
depth k from the dimension of Jk , and then solve for Fk = col(proj(Jk ) − proj(F0 )).
4. Perform a consistency check to verify that the dimensions of Fk always stay the
same or decrease with k. Go through the spaces Fk in reverse order, and whenever
the dimension of Fk decreases, figure out which direction(s) are not mapped to by
applying W − λI to Fk+1 . Carry this out by building a projector J from mapping
vectors representing Fk+1 through W − λI and taking col(proj(Fk ) − J). Solve for the
Jordan chain by repeatedly applying proj(Fi )(Wi − λI) for i starting from k − 1 and
going all the way down to zero.
5. Concatenate all the Jordan chains together to form the transformation matrix T.
The transformation T consists of generalized eigenvectors that need not be com-
pletely real but may also include pairs of generalized eigenvectors that are complex
conjugates of each other. Since we do not want the weights of our normalized network
to be complex, we also apply a unitary transformation that changes any pair of complex
generalized eigenvectors into a pair of real vectors and the resulting block of W into a
Entropy 2024, 26, 1046 19 of 32
multiple of a rotation matrix. As an example, for a real 2-by-2 matrix W with complex
eigenvectors, we have
a + bi 0
W=T T−1
0 a − bi
a −b 1 1 i
= TT′ (TT′ )−1 , T′ = √
b a 2 1 −i
Table A1. AutoML architecture search results. All networks achieved 100% accuracy on at least one
test batch.
Bitwise-Xor
1
2 def f (s , t ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s [ i ]; c = t [ i ];
7 next_a = b ^ c
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Bitwise-Or
1
2 def f (s , t ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s [ i ]; c = t [ i ];
7 next_a = b +c >0
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Bitwise-And
1
2 def f (s , t ) :
3 a = 0; b = 1;
4 ys = []
5 for i in range (10) :
6 c = s [ i ]; d = t [ i ];
7 next_a = ( not a and not b and c and d ) or ( not a and b and not c
and d ) or ( not a and b and c and not d ) or ( not a and b and c and d ) or
( a and not b and c and d ) or ( a and b and c and d )
8 next_b = c + d ==0 or c + d ==2
9 a = next_a ; b = next_b ;
10 y = a +b >1
11 ys . append ( y )
12 return ys
Entropy 2024, 26, 1046 22 of 32
Bitwise-Not
1
2 def f ( s ) :
3 a = 1;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = x
8 a = next_a ;
9 y = -a +1
10 ys . append ( y )
11 return ys
Parity-Last2
1
2 def f ( s ) :
3 a = 0; b = 0;
4 ys = []
5 for i in range (10) :
6 c = s[i]
7 next_a = c
8 next_b = a ^ c
9 a = next_a ; b = next_b ;
10 y = b
11 ys . append ( y )
12 return ys
Parity-Last3
1
2 def f ( s ) :
3 a = 0; b = 0; c = 0;
4 ys = []
5 for i in range (10) :
6 d = s[i]
7 next_a = d
8 next_b = c
9 next_c = a
10 a = next_a ; b = next_b ; c = next_c ;
11 y = a ^ b ^ c
12 ys . append ( y )
13 return ys
Parity-All
1
2 def f ( s ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s[i]
7 next_a = a ^ b
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Parity-Zeros
1
2 def f ( s ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s[i]
7 next_a = a + b ==0 or a + b ==2
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Entropy 2024, 26, 1046 23 of 32
Sum-All
1
2 def f ( s ) :
3 a = 884;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = a - x
8 a = next_a ;
9 y = -a +884
10 ys . append ( y )
11 return ys
Sum-Last2
1
2 def f ( s ) :
3 a = 0; b = 99;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = -b + x +99
8 next_b = -x +99
9 a = next_a ; b = next_b ;
10 y = a
11 ys . append ( y )
12 return ys
Sum-Last3
1
2 def f ( s ) :
3 a = 0; b = 198; c = 0;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = x
8 next_b = -a - x +198
9 next_c = -b +198
10 a = next_a ; b = next_b ; c = next_c ;
11 y = a+c
12 ys . append ( y )
13 return ys
Sum-Last4
1
2 def f ( s ) :
3 a = 0; b = 99; c = 0; d = 99;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = c
8 next_b = -x +99
9 next_c = -b - d +198
10 next_d = b
11 a = next_a ; b = next_b ; c = next_c ; d = next_d ;
12 y = a -b - d +198
13 ys . append ( y )
14 return ys
Entropy 2024, 26, 1046 24 of 32
Sum-Last5
1
2 def f ( s ) :
3 a = 198; b = -10; c = -2; d = 482; e = 1;
4 ys = []
5 for i in range (20) :
6 x = s[i]
7 next_a = -b + c +190
8 next_b = b -c -d - e + x +480
9 next_c = b - e +8
10 next_d = -b +e - x +472
11 next_e = a +b -e -187
12 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ;
13 y = -d +483
14 ys . append ( y )
15 return ys
Sum-Last6
1
2 def f ( s ) :
3 a = 0; b = 295; c = 99; d = 0; e = 297; f = 99;
4 ys = []
5 for i in range (20) :
6 x = s[i]
7 next_a = -b +295
8 next_b = b - c + f
9 next_c = b - c +d -97
10 next_d = -f +99
11 next_e = -a +297
12 next_f = -x +99
13 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ; f = next_f ;
14 y = -b +c -e - f +592
15 ys . append ( y )
16 return ys
Sum-Last7
1
2 def f ( s ) :
3 a = 297; b = 198; c = 0; d = 99; e = 0; f = -15; g = 0;
4 ys = []
5 for i in range (20) :
6 x = s[i]
7 next_a = -a +d - f + g +480
8 next_b = a - d
9 next_c = d +e -99
10 next_d = -c +99
11 next_e = -b +198
12 next_f = -c + f + x
13 next_g = x
14 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ; f = next_f ; g
= next_g ;
15 y = -d + f +114
16 ys . append ( y )
17 return ys
Current-Number
1
2 def f ( s ) :
3 a = 99;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = -x +99
8 a = next_a ;
9 y = -a +99
10 ys . append ( y )
11 return ys
Entropy 2024, 26, 1046 25 of 32
Prev1
1
2 def f ( s ) :
3 a = 0; b = 99;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = -b +99
8 next_b = -x +99
9 a = next_a ; b = next_b ;
10 y = a
11 ys . append ( y )
12 return ys
Prev2
1
2 def f ( s ) :
3 a = 99; b = 0; c = 0;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = -x +99
8 next_b = -a +99
9 next_c = b
10 a = next_a ; b = next_b ; c = next_c ;
11 y = c
12 ys . append ( y )
13 return ys
Prev3
1
2 def f ( s ) :
3 a = 0; b = 0; c = 99; d = 99;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = b
8 next_b = -c +99
9 next_c = d
10 next_d = -x +99
11 a = next_a ; b = next_b ; c = next_c ; d = next_d ;
12 y = a
13 ys . append ( y )
14 return ys
Prev4
1
2 def f ( s ) :
3 a = 0; b = 99; c = 0; d = 99; e = 0;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = c
8 next_b = -a +99
9 next_c = -d +99
10 next_d = -e +99
11 next_e = x
12 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ;
13 y = -b +99
14 ys . append ( y )
15 return ys
Entropy 2024, 26, 1046 26 of 32
Prev5
1
2 def f ( s ) :
3 a = 0; b = 0; c = 99; d = 99; e = 99; f = 99;
4 ys = []
5 for i in range (20) :
6 x = s[i]
7 next_a = -c +99
8 next_b = -d +99
9 next_c = -b +99
10 next_d = e
11 next_e = f
12 next_f = -x +99
13 a = next_a ; b = next_b ; c = next_c ; d = next_d ; e = next_e ; f = next_f ;
14 y = a
15 ys . append ( y )
16 return ys
Previous-Equals-Current
1
2 def f ( s ) :
3 a = 0; b = 0;
4 ys = []
5 for i in range (10) :
6 c = s[i]
7 next_a = delta (c - b )
8 next_b = c
9 a = next_a ; b = next_b ;
10 y = a
11 ys . append ( y )
12 return ys
Diff-Last2
1
2 def f ( s ) :
3 a = 199; b = 100;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = -a - b + x +498
8 next_b = a +b -199
9 a = next_a ; b = next_b ;
10 y = a -199
11 ys . append ( y )
12 return ys
Abs-Diff
1
2 def f ( s ) :
3 a = 100; b = 100;
4 ys = []
5 for i in range (10) :
6 c = s[i]
7 next_a = b
8 next_b = c +100
9 a = next_a ; b = next_b ;
10 y = abs (b - a )
11 ys . append ( y )
12 return ys
Entropy 2024, 26, 1046 27 of 32
Abs-Current
1
2 def f ( s ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s[i]
7 next_a = abs ( b )
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Bit-Shift-Right
1
2 def f ( s ) :
3 a = 0; b = 1;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = -b +1
8 next_b = -x +1
9 a = next_a ; b = next_b ;
10 y = a
11 ys . append ( y )
12 return ys
Bit-Dot-Prod-Mod2
1
2 def f (s , t ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s [ i ]; c = t [ i ];
7 next_a = ( not a and b and c ) or ( a and not b and not c ) or ( a and
not b and c ) or ( a and b and not c )
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Add-Mod-3
1
2 def f ( s ) :
3 a = 0;
4 ys = []
5 for i in range (10) :
6 b = s[i]
7 next_a = ( b + a ) %3
8 a = next_a ;
9 y = a
10 ys . append ( y )
11 return ys
Entropy 2024, 26, 1046 28 of 32
Newton-Freebody
1
2 def f ( s ) :
3 a = 82; b = 393;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = a - x
8 next_b = -a + b +82
9 a = next_a ; b = next_b ;
10 y = -a +b -311
11 ys . append ( y )
12 return ys
Newton-Gravity
1
2 def f ( s ) :
3 a = 72; b = 513;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = a - x +1
8 next_b = -a + b + x +71
9 a = next_a ; b = next_b ;
10 y = b -513
11 ys . append ( y )
12 return ys
Newton-Spring
1
2 def f ( s ) :
3 a = 64; b = 57;
4 ys = []
5 for i in range (10) :
6 x = s[i]
7 next_a = a +b -x -57
8 next_b = -a +121
9 a = next_a ; b = next_b ;
10 y = -a +64
11 ys . append ( y )
12 return ys
Formal Verification
The Dafny programming language is designed so that programs can be formally
verified for correctness. The desired behavior of a program can be explicitly specified via
preconditions, postconditions, and invariants, which are verified via automated theorem
proving. These capabilities make Dafny useful in fields where correctness and safety
are crucial.
We leverage Dafny’s robust verification capabilities to prove the correctness of the
bit addition Python program synthesized by MIPS. The bit addition Python program
was first converted to Dafny, then annotated with specific assertions, preconditions, and
postconditions that defined the expected behavior of the code. Each annotation in the code
was then formally verified by Dafny, ensuring that under all possible valid inputs, the
code’s output would be consistent with the expected behavior. On line 79, we show that
the algorithm found by MIPS is indeed equivalent to performing bit addition with length
10 bitvectors in Dafny.
Entropy 2024, 26, 1046 29 of 32
Dafny-Code
1
2 function ArrayToBv10 ( arr : array < bool >) : bv10 // Converts boolean array to
bitvector
3 reads arr
4 requires arr . Length == 10
5 {
6 A r r a y T o B v 1 0 H e l p e r ( arr , arr . Length - 1)
7 }
8
9 function A r r a y T o B v 1 0 H e l p e r ( arr : array < bool > , index : nat ) : bv10
10 reads arr
11 requires arr . Length == 10
12 requires 0 <= index < arr . Length
13 decreases index
14 ensures forall i :: 0 <= i < index == > (( A r r a y T o B v 1 0 H e l p e r ( arr , i ) >> i )
& 1) == ( if arr [ i ] then 1 else 0)
15 {
16 if index == 0 then
17 ( if arr [0] then 1 else 0) as bv10
18 else
19 var bit : bv10 := if arr [ index ] then 1 as bv10 else 0 as bv10 ;
20 ( bit << index ) + A r r a y T o B v 1 0 H e l p e r ( arr , index - 1)
21 }
22
23 method A r ra yT o Se qu e nc e ( arr : array < bool >) returns ( res : seq < bool >) //
Converts boolean array to boolean sequence
24 ensures | res | == arr . Length
25 ensures forall k :: 0 <= k < arr . Length == > res [ k ] == arr [ k ]
26 {
27 res := [];
28 var i := 0;
29 while i < arr . Length
30 invariant 0 <= i <= arr . Length
31 invariant | res | == i
32 invariant forall k :: 0 <= k < i == > res [ k ] == arr [ k ]
33 {
34 res := res + [ arr [ i ]];
35 i := i + 1;
36 }
37 }
38
39 function isBitSet ( x : bv10 , bitIndex : nat ) : bool
40 requires bitIndex < 10
41 ensures isBitSet (x , bitIndex ) <== > ( x & (1 << bitIndex ) ) != 0
42 {
43 ( x & (1 << bitIndex ) ) != 0
44 }
45
46 function Bv10ToSeq ( x : bv10 ) : seq < bool > // Converts bitvector to boolean
sequence
47 ensures | Bv10ToSeq ( x ) | == 10
48 ensures forall i : nat :: 0 <= i < 10 == > Bv10ToSeq ( x ) [ i ] == isBitSet (x , i
)
49 {
50 [ isBitSet (x , 0) , isBitSet (x , 1) , isBitSet (x , 2) , isBitSet (x , 3) ,
51 isBitSet (x , 4) , isBitSet (x , 5) , isBitSet (x , 6) , isBitSet (x , 7) ,
52 isBitSet (x , 8) , isBitSet (x , 9) ]
53 }
54
55 function BoolToInt ( a : bool ) : int {
56 if a then 1 else 0
57 }
58
59 function XOR ( a : bool , b : bool ) : bool {
60 ( a || b ) && !( a && b )
61 }
62
Entropy 2024, 26, 1046 30 of 32
63 function BitAddition ( s : array < bool > , t : array < bool >) : seq < bool > // Performs
traditional bit addition
64 reads s
65 reads t
66 requires s . Length == 10 && t . Length == 10
67 {
68 var a : bv10 := ArrayToBv10 ( s ) ;
69 var b : bv10 := ArrayToBv10 ( t ) ;
70 var c : bv10 := a + b ;
71
72 Bv10ToSeq ( c )
73 }
74
75 method f ( s : array < bool > , t : array < bool >) returns ( sresult : seq < bool >) //
Generated program for bit addition
76 requires s . Length == 10 && t . Length == 10
77 ensures | sresult | == 10
78 ensures forall i :: 0 <= i && i < | sresult | == > sresult [ i ] == (( s [ i ] != t
[ i ]) != ( i > 0 && (( s [i -1] || t [i -1]) && !( sresult [i -1] && ( s [i -1] != t
[i -1]) ) ) ) )
79 ensures BitAddition (s , t ) == sresult // Verification of correctness
80 {
81 var a : bool := false ;
82 var b : bool := false ;
83 var result : array < bool > := new bool [10];
84 var i : int := 0;
85
86 while i < result . Length
87 invariant 0 <= i <= result . Length
88 invariant forall j :: 0 <= j < i == > result [ j ] == false
89 {
90 result [ i ] := false ;
91 i := i + 1;
92 }
93
94 i := 0;
95
96 assert forall j :: 0 <= j < result . Length == > result [ j ] == false ;
97
98 while i < result . Length
99 invariant 0 <= i <= result . Length
100 invariant b == ( i > 0 && (( s [i -1] || t [i -1]) && !( result [i -1] && ( s [i
-1] != t [i -1]) ) ) )
101 invariant forall j :: 0 <= j < i == > result [ j ] == (( s [ j ] != t [ j ]) != ( j
> 0 && (( s [j -1] || t [j -1]) && !( result [j -1] && ( s [j -1] != t [j -1]) ) ) ) )
102 {
103 assert b == ( i > 0 && (( s [i -1] || t [i -1]) && !( result [i -1] && ( s [i -1]
!= t [i -1]) ) ) ) ;
104
105 result [ i ] := XOR (b , XOR ( s [ i ] , t [ i ]) ) ;
106 b := BoolToInt ( b ) + BoolToInt ( s [ i ]) + BoolToInt ( t [ i ]) > 1;
107 assert b == (( s [ i ] || t [ i ]) && !( result [ i ] && ( s [ i ] != t [ i ]) ) ) ;
108
109 i := i + 1;
110 }
111
112 sresult := Ar ra y To Se q ue nc e ( result ) ;
113 }
Entropy 2024, 26, 1046 31 of 32
References
1. Center for AI Safety. Statement on AI Risk. 2023. Available online: https://www.safe.ai/work/statement-on-ai-risk (accessed
on 4 September 2024).
2. Tegmark, M.; Omohundro, S. Provably safe systems: The only path to controllable agi. arXiv 2023, arXiv:2309.01933.
3. Dalrymple, D.; Skalse, J.; Bengio, Y.; Russell, S.; Tegmark, M.; Seshia, S.; Omohundro, S.; Szegedy, C.; Goldhaber, B.; Ammann, N.;
et al. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems. arXiv 2024, arXiv:2405.06624.
4. Zhou, B.; Ding, G. Survey of intelligent program synthesis techniques. In Proceedings of the International Conference on
Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023), Yinchuan, China, 18–19 August 2023 ;
SPIE; Springer: Bellingham, WA, USA, 2023; Volume 12941, pp. 1122–1136.
5. Odena, A.; Shi, K.; Bieber, D.; Singh, R.; Sutton, C.; Dai, H. BUSTLE: Bottom-Up program synthesis through learning-guided
exploration. arXiv 2020, arXiv:2007.14381.
6. Wu, J.; Wei, L.; Jiang, Y.; Cheung, S.C.; Ren, L.; Xu, C. Programming by Example Made Easy. ACM Trans. Softw. Eng. Methodol.
2023, 33, 1–36. [CrossRef]
7. Sobania, D.; Briesch, M.; Rothlauf, F. Choose your programming copilot: A comparison of the program synthesis performance of
github copilot and genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference, Boston, MA,
USA, 9–13 July 2022; pp. 1019–1027.
8. Olah, C.; Cammarata, N.; Schubert, L.; Goh, G.; Petrov, M.; Carter, S. Zoom in: An introduction to circuits. Distill 2020,
5, e00024-001. [CrossRef]
9. Cammarata, N.; Goh, G.; Carter, S.; Schubert, L.; Petrov, M.; Olah, C. Curve Detectors. Distill 2020. Available online:
https://distill.pub/2020/circuits/curve-detectors (accessed on 26 November 2024 ). [CrossRef]
10. Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; Steinhardt, J. Interpretability in the wild: A circuit for indirect object
identification in gpt-2 small. arXiv 2022, arXiv:2211.00593.
11. Olsson, C.; Elhage, N.; Nanda, N.; Joseph, N.; DasSarma, N.; Henighan, T.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; et al. In-context
Learning and Induction Heads. Transform. Circuits Thread 2022. Available online: https://transformer-circuits.pub/2022/in-
context-learning-and-induction-heads/index.html (accessed on 26 November 2024).
12. Goh, G.; Cammarata, N.; Voss, C.; Carter, S.; Petrov, M.; Schubert, L.; Radford, A.; Olah, C. Multimodal Neurons in Artificial
Neural Networks. Distill 2021. Available online: https://distill.pub/2021/multimodal-neurons (accessed on 26 November 2024).
[CrossRef]
13. Gurnee, W.; Tegmark, M. Language models represent space and time. arXiv 2023, arXiv:2310.02207.
14. Vafa, K.; Chen, J.Y.; Kleinberg, J.; Mullainathan, S.; Rambachan, A. Evaluating the World Model Implicit in a Generative Model.
arXiv 2024, arXiv:2406.03689.
15. Burns, C.; Ye, H.; Klein, D.; Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv 2022,
arXiv:2212.03827.
16. Marks, S.; Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false
datasets. arXiv 2023, arXiv:2310.06824.
17. McGrath, T.; Kapishnikov, A.; Tomavsev, N.; Pearce, A.; Wattenberg, M.; Hassabis, D.; Kim, B.; Paquet, U.; Kramnik, V. Acquisition
of chess knowledge in alphazero. Proc. Natl. Acad. Sci. USA 2022, 119, e2206625119. [CrossRef] [PubMed]
18. Toshniwal, S.; Wiseman, S.; Livescu, K.; Gimpel, K. Chess as a testbed for language model state tracking. In Proceedings of the
AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 11385–11393.
19. Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent world representations: Exploring a sequence
model trained on a synthetic task. arXiv 2022, arXiv:2210.13382.
20. Nanda, N.; Chan, L.; Liberum, T.; Smith, J.; Steinhardt, J. Progress measures for grokking via mechanistic interpretability. arXiv
2023, arXiv:2301.05217.
21. Liu, Z.; Kitouni, O.; Nolte, N.; Michaud, E.J.; Tegmark, M.; Williams, M. Towards Understanding Grokking: An Effective
Theory of Representation Learning. In Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems,
New Orleans, LA, USA, 28 November 2022.
22. Zhong, Z.; Liu, Z.; Tegmark, M.; Andreas, J. The clock and the pizza: Two stories in mechanistic explanation of neural networks.
In Advances in Neural Information Processing Systems: 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New
Orleans, LA, USA, 10–16 December 2023; Volume 36.
23. Quirke, P.; Barez, F. Understanding Addition in Transformers. arXiv 2023, arXiv:2310.13121.
24. Chughtai, B.; Chan, L.; Nanda, N. A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations. In
Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill,
E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR (Proceedings of Machine Learning Research) 2023; Volume 202,
pp. 6243–6267.
25. Hanna, M.; Liu, O.; Variengien, A. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained
language model. arXiv 2023, arXiv:2305.00586.
26. Charton, F. Can transformers learn the greatest common divisor? arXiv 2023, arXiv:2308.15594.
27. Lindner, D.; Kramár, J.; Farquhar, S.; Rahtz, M.; McGrath, T.; Mikulik, V. Tracr: Compiled transformers as a laboratory for
interpretability. arXiv 2023, arXiv:2301.05062.
Entropy 2024, 26, 1046 32 of 32
28. Friedman, D.; Wettig, A.; Chen, D. Learning transformer programs. Adv. Neural Inf. Process. Syst. 2023, 36, 49044–49067.
29. Bills, S.; Cammarata, N.; Mossing, D.; Tillman, H.; Gao, L.; Goh, G.; Sutskever, I.; Leike, J.; Wu, J.; Saunders, W. Language
Models Can Explain Neurons in Language Models. 2023. Available online: https://openaipublic.blob.core.windows.net/neuron-
explainer/paper/index.html (accessed on 26 November 2024).
30. Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; Sharkey, L. Sparse autoencoders find highly interpretable features in language
models. arXiv 2023, arXiv:2309.08600.
31. Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; et al. Towards
Monosemanticity: Decomposing Language Models with Dictionary Learning. Transform. Circuits Thread 2023. Available online:
https://transformer-circuits.pub/2023/monosemantic-features/index.html (accessed on 26 November 2024).
32. Conmy, A.; Mavor-Parker, A.N.; Lynch, A.; Heimersheim, S.; Garriga-Alonso, A. Towards automated circuit discovery for
mechanistic interpretability. arXiv 2023, arXiv:2304.14997.
33. Syed, A.; Rager, C.; Conmy, A. Attribution Patching Outperforms Automated Circuit Discovery. arXiv 2023, arXiv:2310.10348.
34. Marks, S.; Rager, C.; Michaud, E.J.; Belinkov, Y.; Bau, D.; Mueller, A. Sparse feature circuits: Discovering and editing interpretable
causal graphs in language models. arXiv 2024, arXiv:2403.19647.
35. Karpathy, A.; Johnson, J.; Fei-Fei, L. Visualizing and understanding recurrent networks. arXiv 2015, arXiv:1506.02078.
36. Strobelt, H.; Gehrmann, S.; Pfister, H.; Rush, A.M. LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent
Neural Networks. IEEE Trans. Vis. Comput. Graph. 2018, 24, 667–676. [CrossRef]
37. Giles, C.L.; Horne, B.G.; Lin, T. Learning a class of large finite state machines with a recurrent neural network. Neural Netw. 1995,
8, 1359–1365. [CrossRef]
38. Wang, Q.; Zhang, K.; Ororbia II, A.G.; Xing, X.; Liu, X.; Giles, C.L. An empirical evaluation of rule extraction from recurrent
neural networks. arXiv 2017, arXiv:1709.10380. [CrossRef] [PubMed]
39. Weiss, G.; Goldberg, Y.; Yahav, E. Extracting automata from recurrent neural networks using queries and counterexamples. In
Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5247–5256.
40. Oliva, C.; Lago-Fernández, L.F. On the interpretation of recurrent neural networks as finite state machines. In Proceedings of the
Artificial Neural Networks and Machine Learning–ICANN 2019: Theoretical Neural Computation: 28th International Conference
on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; Proceedings, Part I 28; Springer: Berlin, Germany, 2019;
pp. 312–323.
41. Muvskardin, E.; Aichernig, B.K.; Pill, I.; Tappler, M. Learning finite state models from recurrent neural networks. In Proceedings
of the International Conference on Integrated Formal Methods, Lugano, Switzerland, 7–10 June 2022; Springer: Berlin, Germany,
2022; pp. 229–248.
42. Udrescu, S.M.; Tan, A.; Feng, J.; Neto, O.; Wu, T.; Tegmark, M. AI Feynman 2.0: Pareto-optimal symbolic regression exploiting
graph modularity. Adv. Neural Inf. Process. Syst. 2020, 33, 4860–4871.
43. Cranmer, M. Interpretable machine learning for science with PySR and SymbolicRegression. jl. arXiv 2023, arXiv:2305.01582.
44. Cranmer, M.; Sanchez Gonzalez, A.; Battaglia, P.; Xu, R.; Cranmer, K.; Spergel, D.; Ho, S. Discovering symbolic models from deep
learning with inductive biases. Adv. Neural Inf. Process. Syst. 2020, 33, 17429–17442.
45. Ma, H.; Narayanaswamy, A.; Riley, P.; Li, L. Evolving symbolic density functionals. Sci. Adv. 2022, 8, eabq0279. [CrossRef]
46. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.