Logic Synthesis Meets Machine Learning: Trading Exactness For Generalization
Logic Synthesis Meets Machine Learning: Trading Exactness For Generalization
Rasit O. Topalogluk,7 , Yuan Zhoul,8 , Jordan Dotzell,8 , Yichi Zhangl,8 , Hanyu Wangl,8 , Zhiru Zhangl,8 ,
Valerio Tenacen,10 , Pierre-Emmanuel Gaillardonn,10 , Alan Mishchenkoo,† , and Satrajit Chatterjeep,†
a
University of Tokyo, Japan, b Universidade Federal de Pelotas, Brazil, c National Taiwan University,
Taiwan, d University of Texas at Austin, USA, e Universidade Federal do Rio Grande do Sul, Brazil,
f
Technische Universitaet Dresden, Germany, j University of Wisconsin–Madison, USA, k IBM, USA,
l
Cornell University, USA, m Universidade Federal de Santa Catarina, Brazil, n University of Utah, USA,
o
UC Berkeley, USA, p Google AI, USA
The alphabetic characters in the superscript represent the affiliations while the digits represent the team numbers
†
Equal contribution. Email: [email protected], [email protected], [email protected], [email protected]
Abstract—Logic synthesis is a fundamental step in hard- artificial intelligence. A central problem in machine
ware design whose goal is to find structural representations learning is that of supervised learning: Given a class
of Boolean functions while minimizing delay and area. H of functions from a domain X to a co-domain Y , find
If the function is completely-specified, the implementa-
tion accurately represents the function. If the function is a member h ∈ H that best fits a given set of training
incompletely-specified, the implementation has to be true examples of the form (x, y) ∈ X × Y . The quality of the
only on the care set. While most of the algorithms in logic fit is judged by how well h generalizes, i.e., how well h
synthesis rely on SAT and Boolean methods to exactly fits examples that were not seen during training.
implement the care set, we investigate learning in logic
synthesis, attempting to trade exactness for generalization. Thus, logic synthesis and machine learning are closely
This work is directly related to machine learning where related. Supervised machine learning can be seen as logic
the care set is the training set and the implementation synthesis of an incompletely specified function with a
is expected to generalize on a validation set. We present different constraint (or objective): the circuit must also
learning incompletely-specified functions based on the re- generalize well outside the careset (i.e., to the test set)
sults of a competition conducted at IWLS 2020. The goal
of the competition was to implement 100 functions given possibly at the expense of reduced accuracy on the careset
by a set of care minterms for training, while testing the (i.e., on the training set). Conversely, logic synthesis may
implementation using a set of validation minterms sampled be seen as a machine learning problem where in addition
from the same function. We make this benchmark suite to generalization, we care about finding an element of H
available and offer a detailed comparative analysis of the that has small size, and the sets X and Y are not smooth
different approaches to learning.
but discrete.
I. I NTRODUCTION To explore this connection between the two fields, the
two last authors of this paper organized a programming
Logic synthesis is a key ingredient in modern electronic contest at the 2020 International Workshop in Logic
design automation flows. A central problem in logic Synthesis. The goal of this contest was to come up
synthesis is the following: Given a Boolean function with an algorithm to synthesize a small circuit for a
f : Bn → B (where B denotes the set {0, 1}), construct Boolean function f : Bn → B learnt from a training set
a logic circuit that implements f with the minimum of examples. Each example (x, y) in the training set is an
number of logic gates. The function f may be completely input-output pair, i.e., x ∈ Bn and y ∈ B. The training set
specified, i.e., we are given f (x) for all x ∈ Bn , or it may was chosen at random from the 2n possible inputs of the
be incompletely specified, i.e., we are only given f (x) function (and in most cases was much smaller than 2n ).
for a subset of Bn called the careset. An incompletely The quality of the solution was evaluated by measuring
specified function provides more flexibility for optimizing accuracy on a test set not provided to the participants.
the circuit since the values produced by the circuit outside
the careset are not of interest. The synthesized circuit for f had to be in the form
Recently, machine learning has emerged as a key of an And-Inverter Graph (AIG) [1, 2] with no more
enabling technology for a variety of breakthroughs in than 5000 nodes. An AIG is a standard data structure
used in logic synthesis to represent Boolean functions
This work was supported in part by the Semiconductor Research where a node corresponds to a 2-input And gate and
Corporation under Contract 2867.001. edges represent direct or inverted connections. Since an
Table I: An overview of different types of functions in
AIG can represent any Boolean function, in this problem the benchmark set. They are selected from three domains:
H is the full set of Boolean functions on n variables. Arithmetic, Random Logic, and Machine Learning.
To evaluate the algorithms proposed by the participants, 00-09 2 MSBs of k-bit adders for k ∈ {16, 32, 64, 128, 256}
we created a set of 100 benchmarks drawn from a 10-19 MSB of k-bit dividers and remainder circuits for k ∈ {16, 32, 64, 128, 256}
mix of standard problems in logic synthesis such as 20-29 MSB and middle bit of k-bit multipliers for k ∈ {8, 16, 32, 64, 128}
synthesis of arithmetic circuits and random logic from 30-39 k-bit comparators for k ∈ {10, 20, . . . , 100}
40-49 LSB and middle bit of k-bit square-rooters with k ∈ {16, 32, 64, 128, 256}
standard logic synthesis benchmarks. We also included 50-59 10 outputs of PicoJava design with 16-200 inputs and roughly balanced onset & offset
some tasks from standard machine learning benchmarks. 60-69 10 outputs of MCNC i10 design with 16-200 inputs and roughly balanced onset & offset
For each benchmark the participants were provided with 70-79 5 other outputs from MCNC benchmarks + 5 symmetric functions of 16 inputs
the training set (which was sub-divided into a training set 80-89 10 binary classification problems from MNIST group comparisons
90-99 10 binary classification problems from CIFAR-10 group comparisons
proper of 6400 examples and a validation set of another
6400 examples though the participants were free to use
these subsets as they saw fit), and the circuits returned
by their algorithms were evaluated on the corresponding false value otherwise. The threshold value is defined
test set (again with 6400 examples) that was kept private during training. Hence, each internal node can be seen
until the competition was over. The training, validation as a multiplexer, with the selector given by the threshold
and test sets were created in the PLA format [3]. The value. Random forests are composed by multiple decision
score assigned to each participant was the average test trees, where each tree is trained over a distinct feature,
accuracy over all the benchmarks with possible ties being so that trees are not very similar. The output is given by
broken by the circuit size. the combination of individual predictions.
Ten teams spanning 6 countries took part in the contest. Look-up Table (LUT) Network is a network of
They explored many different techniques to solve this randomly connected k-input LUTs, where each k-input
problem. In this paper we present short overviews of the LUT can implement any function with up to k variables.
techniques used by the different teams (the superscript LUT networks were first employed in a theoretical
for an author indicates their team number), as well a study to understand if pure memorization (i.e., fitting
comparative analysis of these techniques. The following without any explicit search or optimization) could lead
are our main findings from the analysis: to generalization [11].
• No one technique dominated across all the bench- III. B ENCHMARKS
marks, and most teams including the winning team
used an ensemble of techniques. The set of 100 benchmarks used in the contest can
• Random forests (and decision trees) were very
be broadly divided into 10 categories, each with 10 test-
popular and form a strong baseline, and may be cases. The summary of categories is shown in Table I. For
a useful technique for approximate logic synthesis. example, the first 10 test-cases are created by considering
• Sacrificing a little accuracy allows for a significant
the two most-significant bits (MSBs) of k-input adders
reduction in the size of the circuit. for k ∈ {16, 32, 64, 128, 256}.
Test-cases ex60 through ex69 were derived from
These findings suggest an interesting direction for future MCNC benchmark [12] i10 by extracting outputs 91,
work: Can machine learning algorithms be used for 128, 150, 159, 161, 163, 179, 182, 187, and 209 (zero-
approximate logic synthesis to greatly reduce power and based indexing). For example, ex60 was derived using
area when exactness is not needed? the ABC command line: &read i10.aig; &cone -O 91.
Finally, we believe that the set of benchmarks used Five test-cases ex70 through ex74 were similarly
in this contest along with the solutions provided by the derived from MCNC benchmarks cordic (both outputs),
participants (based on the methods described in this paper) too large (zero-based output 2), t481, and parity.
provide an interesting framework to evaluate further Five 16-input symmetric functions used in ex75
advances in this area. To that end we are making these through ex79 have the following signatures:
available at https://github.com/iwls2020-lsml-contest/. 00000000111111111, 11111100000111111,
00011110001111000, 00001110101110000, and
II. BACKGROUND AND P RELIMINARIES 00000011111000000.
We review briefly the more popular techniques used. They were generated by ABC using command sym-
Sum-of-Products (SOP), or disjunctive normal form, fun hsignaturei.
is a two-level logic representation commonly used in Table II shows the rules used to generate the last 20
logic synthesis. Minimizing the SOP representation of an benchmarks. Each of the 10 rows of the table contains
incompletely specified Boolean function is a well-studied two groups of labels, which were compared to generate
problem with a number of exact approaches [4, 5, 6] as one test-case. Group A results in value 0 at the output,
well as heuristics [7, 8, 9, 10] with ESPRESSO [7] being while Group B results in value 1. The same groups
the most popular. were used for MNIST [13] and CIFAR-10 [14]. For
Decision Trees (DT) and Random Forests (RF) are example, benchmark ex81 compares odd and even labels
very popular techniques in machine learning and they in MNIST, while benchmark ex91 compares the same
were used by many of the teams. In the contest scope, labels in CIFAR-10.
the decision trees were applied as a classification tree, In generating the benchmarks, the goal was to fulfill the
where the internal nodes were associated to the function following requirements: (1) Create problems, which are
input variables, and terminal nodes classify the function non-trivial to solve. (2) Consider practical functions, such
as 1 or 0, given the association of internal nodes. Thus, as arithmetic logic and symmetric functions, extract logic
each internal node has two outgoing-edges: a true edge cones from the available benchmarks, and derive binary
if the variable value exceeds a threshold value, and a classification problems from the MNIST and CIFAR-10
Table II: Group comparisons for MNIST and CIFAR10
we use the minimum number of objects to determine the
ex Group A Group B best classifier. Finally, the ABC tool checks the size of
0 0-4 5-9
the generated AIGs to match the contest requirements.
1 odd even Team 3’s solution consists of decision tree based
2 0-2 3-5 and neural network (NN) based methods. For each
3 01 23 benchmark, multiple models are trained and 3 are selected
4 45 67
5 67 89 for ensemble. For the DT-based method, the fringe feature
6 17 38 extraction process proposed in [16, 17] is adopted. The
7 09 38 DT is trained and modified for multiple iterations. In
8 13 78
9 03 89 each iteration, the patterns near the fringes (leave nodes)
of the DT are identified as the composite features of
machine learning challenges. (3) Limit the number of AIG 2 decision variables. These newly detected features are
nodes in the solution to 5000 to prevent the participants then added to the list of decision variables for the DT
from generating large AIGs and rather concentrate on training in the next iteration. The procedure terminates
algorithmic improvements aiming at high solution quality when there are no new features found or the number of
using fewer nodes. the extracted features exceeds the preset limit.
There was also an effort to discourage the participants For the NN-based method, a 3-layer network is
from developing strategies for reverse-engineering the employed, where each layer is fully-connected and uses
test-cases based on their functionality, for example, detect- sigmoid as the activation function. As the synthesized
ing that some test-cases are outputs of arithmetic circuits, circuit size of a typical NN could be quite large, the
such as adders or multipliers. Instead, the participants connection pruning technique proposed in [18] is adopted
were encouraged to look for algorithmic solutions to to meet the stringent size restriction. The NN is pruned
handle arbitrary functions and produce consistently good until the number of fanins of each neuron is at most
solutions for every one independently of its origin. 12. Each neuron is then synthesized into a LUT by
rounding its activation [11]. The overall dataset, training
IV. OVERVIEW OF THE VARIOUS A PPROACHES and validation set combined, for each benchmark is re-
divided into 3 partitions before training. Two partitions
Team 1’s solution is to take the best one among are selected as the new training set, and the remaining one
ESPRESSO, LUT network, RF, and pre-defined standard as the new validation set, resulting in 3 different grouping
function matching (with some arithmetic functions). If configurations. Under each configuration, multiple models
the AIG size exceeds the limit, a simple approximation are trained with different methods and hyper-parameters,
method is applied to the AIG. and the one with the highest validation accuracy is chosen
ESPRESSO is used with an option to finish optimiza- for ensemble.
tion after the first irredundant operation. LUT network
has some parameters: the number of levels, the number Team 4’s solution is based on multi-level ensemble-
of LUTs in each level, and the size of each LUT. These based feature selection, recommendation-network-based
parameters are incremented like a beam search as long model training, subspace-expansion-based prediction, and
as the accuracy is improved. The number of estimators accuracy-node joint exploration during synthesis.
in random forest is explored from 4 to 16. Given the high sparsity in the high-dimensional boolean
A simple approximation method is used if the number space, a multi-level feature importance ranking is adopted
of AIG nodes is more than 5000. The AIG is simulated to reduce the learning space. Level 1: a 100-ExtraTree
with thousands of random input patterns, and the node based AdaBoost [19] ensemble classifier is used with
which most frequently outputs 0 is replaced by constant- 10-repeat permutation importance [20] ranking to select
0 while taking the negation (replacing with constant-1) the top-k important features, where k ∈ [10, 16]. Level 2:
into account. This is repeated until the AIG size meets a 100-ExtraTree based AdaBoost classifier and an
the condition. The nodes near the outputs are excluded XGB classifier with 200 trees are used with stratified 10-
from the candidates by setting a threshold on levels. fold cross-validation to select top-k important features,
The threshold is explored through try and error. It was where k ranges from 10 to 16, given the 5,000 node
observed that the accuracy drops 5% when reducing constraints.
3000-5000 nodes. Based on the above 14 groups of selected features, 14
Team 2’s solution uses J48 and PART AI classifiers state-of-the-art recommendation models, Adaptive Factor-
to learn the unknown Boolean function from a single ization Network (AFN) [21], are independently learned
training set that combines the training and validation as DNN-based boolean function approximators. A 128-
sets. The algorithm first transforms the PLA file in an dimensional logarithmic neural network is used to learn
ARFF (Attribute-Relation File Format) description to sparse boolean feature interaction, and a 4-layer MLP is
handle the WEKA tool [15]. We used the WEKA tool to used to combine the formed cross features with overfitting
run five different configurations to the J48 classifier and being handled by fine-tuned dropout. After training, a
five configurations to the PART classifier, varying the k-feature trained model will predict the output for 2k
confidence factor. The J48 classifier creates a decision input combinations to expand the full k-dimensional
tree that the developed software converts in a PLA file. In hypercube, where other pruned features are set to DON’T
the sequence, the ABC tool transforms the PLA file into CARE type in the predicted .pla file to allow enough
an AIG file. The PART classifier creates a set of rules that smoothness in the Boolean hypercube. Such a subspace
the developed software converts in an AAG file. After, expansion technique can fully-leverage the prediction
the AIGER transforms the AAG file into an AIG file capability of our model to maximize the accuracy on the
to decide the best configuration for each classifier. Also, validation/test dataset while constraining the maximum
number of product terms for node minimization during Team 7 adopts tree-based ML models for the straight-
synthesis. forward conversion from tree nodes to SOP terms. The
Team 5’s solution explores the use of DTs and RFs, model is either a decision tree with unlimited depth, or
along with NNs, to learn the required Boolean functions. an extreme gradient boosting (XGBoost) of 125 trees
DTs/RFs are easy to convert into SOP expressions. To with a maximum depth of five, depending on the results
evaluate this proposal, the implementation obtains the of a 10-fold cross validation on training data.
models using the Scikit-learn Python library [22]. The With the learned model, all underlying tree leaves
solution is chosen from simulations using Decision- are converted to SOP terms, which are minimized and
TreeClassifier for the DTs, and an ensemble of Decision- compiled to AIGs with ESPRESSO and ABC, respectively.
TreeClassifier for the RFs – the RandomForestClassifier If the model is a decision tree, the converted AIG is final.
structure would be inconvenient, considering the 5000- If the model is XGBoost, the value of each tree leaf is
gate limit, given that it employs a weighted average of first quantized to one bit, and then aggregated with a
each tree. 3-layer network of 5-input majority gates for efficient
The simulations are performed using different tree implementation of AIGs.
depths and feature selection methods (SelectKBest and Tree-based models may not perform well in symmetric
SelectPercentile). NNs are also employed to enhance functions or complex arithmetic functions. However,
our exploration capabilities, using the MLPClassifier patterns in the importance of input bits can be observed
structure. Given that SOPs cannot be directly obtained for some pre-defined standard functions such as adders,
from the output of the NN employed, the NN is used as a comparators, outputs of XOR or MUX. Before ML, Team
feature selection method to obtain the importance of each 7 checks if the training data come from a symmetric
input based on their weight values. With a small sub- function, and compares training data with each identified
set of weights obtained from this method, the proposed special function. In case of a match, an AIG of the
solution performs a small exhaustive search by applying identified function is constructed directly without ML.
combinations of functions on the four features with the Team 8’s solution is an ensemble drawing from
highest importance, considering OR, XOR, AND, and multiple classes of models. It includes a multi-layer
NOT functions. The SOP with the highest accuracy perceptron (MLP), binary decision tree (BDT) augmented
(respecting the 5001-gate limit) out of the DTs/RFs with functional decomposition, and a RF. These models
and NNs tested was chosen to be converted to an AIG are selected to capture various types of circuits. For all
file. The data sets were split into an 80%-20% ratio, benchmarks, all models are trained independently, and the
preserving the original data set’s target distribution. The model with the best validation accuracy that results in a
simulations were run using half of the newly obtained circuit with under 5000 gates is selected. The MLP uses a
training set (40%) and the whole training set to increase periodic activation instead of the traditional ReLU to learn
our exploration. additional periodic features in the input. It has three layers,
Team 6’s solution learns the unknown Boolean func- with the number of neurons divided in half between
tion using the method as mentioned in [11]. In order to each layer. The BDT is a customized implementation
construct the LUT network, we use the minterms as input of the C4.5 tree that has been modified with functional
features to construct layers of LUTs with connections decomposition in the cases where the information gain
starting from the input layer. We then carry out two is below a threshold. The RF is a collection of 17 trees
schemes of connections between the layers: ‘random limited to a maximum depth of 8. RF helps especially
set of input’ and ‘unique but random set of inputs’. By in the cases where BDT overfits.
‘random set of inputs’, we imply that we just randomly After training, the AIGs of the trained models are
select the outputs of preceding layer and feed it to the next generated to ensure they are under 5000 gates. In all cases,
layer. This is the default flow. By ‘unique but random set the generated AIGs are simplified using the Berkeley
of inputs’, we mean that we ensure that all outputs from a ABC tool to produce the final AIG graph.
preceding layer is used before duplication of connection. Team 9’s proposes a Bootstrapped flow that explores
We carry out experiments with four hyper parameters to the search algorithm Cartesian Genetic Programming
achieve accuracy– number of inputs per LUT, number of (CGP). CGP is an evolutionary approach proposed as
LUTS per layers, selection of connecting edges from the a generalization of Genetic Programming used in the
preceding layer to the next layer and the depth (number digital circuit’s domain. It is called Cartesian because the
of LUT layers) of the model. We experiment with varying candidate solutions are composed of a two-dimensional
number of inputs for each LUT in order to get the network of nodes. CGP is a population-based approach
maximum accuracy. We notice from our experiments that often using the evolution strategies (1+λ)-ES algorithm
4-input LUTs returns the best average numbers across for searching the parameter space. Each individual is a
the benchmark suite. circuit, represented by a two-dimensional integer matrix
Once the network is created, we convert the network describing the functions and the connections among
into an SOP form using sympy package in python. This nodes.
is done from reverse topological order starting from the The proposed flow decides between two initialization:
outputs back to the inputs. Using the SOP form, we 1) starts the CGP search from random (unbiased) indi-
generate the verilog file which is then used with ABC to viduals seeking for optimal circuits; or, 2) exploring a
calculate the accuracy. bootstrapped initialization with individuals generated by
Team 7’s solution is a mix of conventional ML and previously optimized SOPs created by decision trees or
pre-defined standard function matching. If a training set ESPRESSO when they provide AIGs with more than
matches a pre-defined standard function, a custom AIG 55% of accuracy. This flow restricts the node functions
of the identified function is written out. Otherwise, an to XORs, ANDs, and Inverters; in other words, we may
ML model is trained and translated to an AIG. use AIG or XAIG to learn the circuits. Variation is
Table III: Performance of the different teams
test accuracy
80
SOP, the circuit is fine-tuned with the whole training
set. When the random initialization is used, it was tested 70 (537, 89.88) (1140.76, 91.0)
with multiple configurations of sizes and mini-batches
of the training set that change based on the number of 60 Pareto curve for virtual best
generations processed. Average accuracy by team
Top accuracy achieved by Team 1
Team 10’s solution learns Boolean function representa- 50
500 1000 1500 2000 2500
tions, using DTs. We developed a Python program using number of And gates
the Scikit-learn library where the parameter max depth
serves as an upper-bound to the growth of the trees, Fig. 2: Acc-size trade-off across teams and for virtual best
and is set to be 8. The training set PLA, treated as a similar accuracy, with very divergent number of nodes,
numpy matrix, is used to train the DT. On the other hand, as seen in Table III. For teams that have relied on just
the validation set PLA is then used to test whether the one approach, such as Team 10 and 2 who used only
obtained DT meets the minimum validation accuracy, decision trees, it seems that more AND nodes might lead
which we empirically set to be 70%. If such a condition to better accuracy. Most of the teams, however, use a
is not met, the validation set is merged with the training portfolio approach, and for each benchmark choose an
set. According to empirical evaluations, most of the appropriate technique. Here it is worth pointing out that
benchmarks with accuracy < 70% showed a validation there is no approach, which is consistently better across
accuracy fluctuating around 50%, regardless of the size all the considered benchmarks. Thus, applying several
and shapes of the DTs. This suggests that the training approaches and deciding which one to use, depending
sets were not able to provide enough representative cases on the target Boolean functions, seems to be the best
to effectively exploit the adopted technique, thus leading strategy. Fig. 1 presents the approaches used by each
to DTs with very high training accuracy, but completely team.
negligible performances. For DTs having a validation While the size of the network was not one of the
accuracy ≥ 70%, the tree structure is annotated as a optimization criteria in the contest, it is an important
Verilog netlist, where each DT node is replaced with a parameter considering the hardware implementation, as
multiplexer. The obtained Verilog netlist is then processed it impacts area, delay, and power. The average area
with the ABC Synthesis Tool in order to generate a reported by individual teams are shown in Fig. 2 as ‘×’.
compact and optimized AIG structure. This approach has Certain interesting observations can be made from Fig. 2.
shown an average accuracy over the validation set of 84%, Apart from showing the average size reached by various
with an average size of AIG of 140 nodes (and no AIG teams, it also shows the Pareto-curve between the average
with more than 300 nodes). More detailed information accuracy across all benchmarks and their size in terms of
about the adopted technique can be found in [24]. number of AND gates. It can be observed that while 91%
V. R ESULTS
A. Accuracy
Table III shows the average accuracy of the solutions 100
found by all the 10 teams, along with the average circuit
size, the average number of levels in the circuit, and 90
the overfit measured as the average difference between
test accuracy
the accuracy on the validation set and the test set. The 80
following interesting observations can be made: (i) most 70
of the teams achieved more than 80% accuracy. (ii) the
teams were able to find circuits with much fewer gates 60
than the specification.
When it comes to comparing network size vs accuracy, 50
0 20 40 60 80 100
there is no clear trend. For instance, teams 1 and 7 have benchmarks
3000
For 6-word adder tree, the accuracy of the level-based
method was around 80%. We came up with another
heuristic that if both straight two-sided matching and
2000
complemented two-sided matching are available, the one
with the smaller gain is used, under a bias of 100 nodes
1000 on the complemented matching This heuristic increased
the accuracy of the level-based method to be 85-90%.
0 However, none of the methods above obtained meaningful
0 10 20 30 40 50 60 70 80 90 100
(more than 50%) accuracy for 8-word adder tree.
benchmark We conclude that BDD can learn a function if the
ESPRESSO RandomForest BDD of its underlying function is small under some
LUTNetwork input order and we know that order. The main reason for
minimization failure is that merging inappropriate nodes
Fig. 6: The resulting AIG size of the methods
is mistakenly performed due to a lack of contradictory
patterns. Our heuristics prevent it to some degree. If
1 12000 we have a black box simulator, simulating patterns
to distinguish straight and complemented two-sided
0.9 10000
matching would be helpful. Reordering using don’t cares
0.8 8000
is another topic to explore.
AIG nodes
accuracy
...
...
--0011
...
1
lows,
1111 --1111 1
80
#Node (k)
80 80 3000
3000
(AFN) [18], to fit the sparse Boolean dataset. Fig. 20
#Node
#Node
#Node
3000
70
70 70 2000
2000
demonstrates the AFN structure and network configuration
2000 we use to fit the Boolean dataset. The embedding layer
60
60 60 1000
1000
1000
maps the high-dimensional Boolean feature to a 10-d
space and transform the sparse feature with a logarithmic
50
50 50 0 00 transformation layer. In the logarithmic neural network,
ex20
ex40
ex88
ex00
ex04
ex08
ex12
ex16
ex24
ex28
ex32
ex36
ex44
ex48
ex52
ex56
ex60
ex64
ex68
ex72
ex76
ex80
ex84
ex92
ex96
ex12
ex48
ex84
ex00
ex04
ex08
ex16
ex20
ex24
ex28
ex32
ex36
ex40
ex44
ex52
ex56
ex60
ex64
ex68
ex72
ex76
ex80
ex88
ex92
ex96
ex12
ex48
ex84
ex00
ex04
ex08
ex16
ex20
ex24
ex28
ex32
ex36
ex40
ex44
ex52
ex56
ex60
ex64
ex68
ex72
ex76
ex80
ex88
ex92
ex96
multiple vector-wise logarithmic neurons are constructed
Validation
Validation Acc.
Validation Acc.
Acc. (%)
(%)(%) #Node
#Node
#Node
to represent any cross features to obtain different higher-
order input combinations [18],
Fig. 21: Evaluation results on IWLS 2020 benchmarks. m
X
yj = exp wij ln(Embed(F (d))) . (2)
Input Sparse Boolean Feature i=1
...
Sparse Feature Three small-scale fully-connected layers are used to
Embedding d to 10 embedding
combine and transform the crossed features after log-
arithmic transformation layers. Dropout layers after each
Logarithmic Transform
hidden layers are used to during training to improve the
generalization on the unknown don’t-care set.
LNN 128
2) Inference with Sub-Space Expansion: After training
the AFN-based function approximator, we need to gener-
Logarithmic NN Exponential & Concatenation
ate the dot-product terms in the PLA file and synthesize
FC-80 + Dropout-0.5
the AIG representation. Since we ignore all pruned input
dimension, we only assume our model can generalize in
FC-64 + Dropout-0.5
the reduced d-dimensional hypercube. Hence, we predict
all 2d input combinations with our trained approximator,
MLP Classifier FC-64 + Dropout-0.4 and set all other pruned inputs to don’t-care state. On 14
different feature groups F , we trained 14 different models
Binary Output {AFN0 , · · · , AFNd , · · · }, sorted in a descending order in
terms of validation accuracy. With the above 14 models,
we predict 14 corresponding PLA files {P0 , · · · , Pd . · · · }
Fig. 20: AFN [18] and configurations used for Boolean function with sub-space expansion to maximize the accuracy in
approximation. our target space while minimizing the node count by
pruning all other product terms, shown in Fig. 18. In the
ABC [19] tool, we use the node optimization command
feature engineering to evaluate the feature importance. sequence as resyn2, resyn2a, resyn3, resyn2rs,
We pre-train an AdaBoost [16] ensemble classifier and compress2rs.
with 100 ExtraTree sub-classifier on the training 3) Accuracy-Node Joint Search with ABC: For each
set to generate the importance score for all features. benchmark, we obtain multiple predicted PLA files to
Then we perform permutation importance ranking [17] be selected based on the node constraints. We search
for 10 times to select the top-d important features as the PLA with the best accuracy that meets the node
the care set variables F 1 (d). Given that the ultimate constraints,
accuracy is sensitive to the feature selection, we generate
another feature group at the second level to expand the P ∗ = arg max Acc(AIG(P), Dval ), (3)
search space. At the second level, we train two classifier P∼{··· ,Pd ,··· }
ensembles, one is an XGB classifier with 200 sub-trees s.t. N (AIG(P)) ≤ 5, 000.
and another is a 100-ExtraTree based AdaBoost
classifier. Besides, a stratified 10-fold cross-validation is If the accuracy is still very low, e.g., 60%, we resplit the
used to select top-d important features F 2 (d) based on dataset Dtrn and Dval and go to step (1) again in Fig. 18.
the average scores from the above two models. The entire 4) Results and Analysis: Fig. 21 shows our validation
14 candidates of input feature groups for each benchmark accuracy and number of node after node optimization.
are F = {F 1 (d), F 2 (d)}16d=10 . Our model achieves high accuracy on most benchmarks.
1) Deep Learning in the Sparse High-Dimensional While on certain cases regardless of the input count, our
Boolean Space: This learning problem is different from model fails to achieve a desired accuracy. An intuitive
continuous-space learning tasks, e.g., time-sequence- explanation is that the feature-pruning-based dimension
prediction, computer vision-related tasks, since its inputs reduction is sensitive on the feature selection. A repeating
are binarized with poor smoothness, which means high- procedure by re-splitting the dataset may help find a good
frequency patterns in the input features are important to feature combination to improve accuracy.
the model prediction, e.g., XNOR and XOR. Besides, the
extremely-limited training set gives an under-sampling C. Conclusion and Future Work
of the real distribution, such that a simple multi-layer We introduce the detailed framework we use to learn
perceptron is barely capable of fitting the dataset while the high-dimensional unknown Boolean function for
still having good generalization. Therefore, motivated by
Obtain Get important
on preliminary testing, we found that RFs could not
proportions
Train NN
features scale due to the contest’s 5000-gate limitation. This was
mainly due to the use of the majority voting, considering
Train. set
that the preliminaries expressions for that are too large
Generate new
Generate
to be combined between each other. Therefore, for this
Merge sets train. and Train DTs/RFs
Valid. set valid. sets
SOP proposal, we opted to limit the number of trees used in
the RFs to a value of three.
Other parameters in the DecisionT reeClassif ier
AIG with
highest Reduce Node
could be varied as well, such as the split metric, by
accuracy within
5000-AND limit
AIG
Evaluation
and Logic
Generate
EQN/AIG
changing the Gini metric to Entropy. However, prelimi-
Level
nary analyses showed that both metrics led to very similar
results. Since Gini was slightly better in most scenarios
Fig. 22: Design flow employed in this proposal. and is also less computationally expensive, it was chosen.
Even though timing was not a limitation described
by the contest, we still had to provide a solution that
we could verify that was yielding the same AIGs as
IWLS’20 contest. Our recommendation system based the ones submitted. Therefore, for every configuration
model achieves a top 3 smallest generalization gap on tested, we had to test every example given in the problem.
the test set (0.48%), which is a suitable selection for this Hence, even though higher depths could be used without
task. A future direction is to combine more networks and surpassing the 5000-gate limitation, we opted for 10 and
explore the unique characteristic of various benchmarks. 20 only, so that we could evaluate every example in a
feasible time.
V. TEAM 5 Besides training the DTs and RFs with varying depths,
Authors: Brunno Alves de Abreu, Isac de Souza we also considered using feature selection methods
Campos, Augusto Berndt, Cristina Meinhardt, from the Scikit-learn library. We opted for the use
Jonata Tyska Carvalho, Mateus Grellert and Sergio of the SelectKBest and SelectP ercentile methods.
Bampi, Universidade Federal do Rio Grande do Sul, These methods perform a pre-processing in the features,
Universidade Federal de Santa Catarina, Brazil eliminating some of them prior to the training stage.
The SelectKBest method selects features according to
Fig. 22 presents the process employed in this proposal. the k highest scores, based on a score function, namely
In the first stage, the training and valid sets provided in the f classif , mutual inf o classif or chi2, according to
problem description are merged. The ratios of 0’s and 1’s the Scikit-learn library [12]. The SelectP ercentile is
in the output of the newly merged set are calculated, and similar but selects features within a percentile range given
we split the set into two new training and validation sets, as parameter, based on the same score functions [12].
considering an 80%-20% ratio, respectively, preserving We used the values of 0.25, 0.5, and 0.75 for k, and the
the output variable distribution. Additionally, a second percentiles considered were 25%, 50%, and 75%.
training set is generated, equivalent to half of the The solution employing neural networks (NNs) was
previously generated training set. These two training considered after obtaining the accuracy results for the
sets are used separately in our training process, and their DTs/RFs configurations. In this case, we train the model
accuracy results are calculated using the same validation using the M LP Classif ier structure, to which we used
set to enhance our search for the best models. This was the default values for every parameter. Considering that
done because the model with the training set containing NNs present an activation function in the output, which
80% of the entire set could potentially lead to models is non-linear, the translation to a SOP would not be
with overfitting issues, so using another set with 40% of possible using conventional NNs. Therefore, this solution
the entire provided set could serve as an alternative. only uses the NNs to obtain a subset of features based
After the data sets are prepared, we train the DTs on their importance, i.e., select the set of features with
and RFs models. Every decision model in this proposal the corresponding highest weights. With this subset of
uses the structures and methods from the Scikit-learn features obtained, we evaluate combinations of functions,
Python library. The DTs and RFs from this library use using ”OR,” ”AND,” ”XOR,” and ”NOT” operations
the Classification and Regression Trees (CART) algorithm among them. Due to the fact that the combinations of
to assemble the tree-based decision tools [20]. To train functions would not scale well, in terms of time, with the
the DT, we use the DecisionT reeClassif ier structure, number of features, we limit the sub-set to contain only
limiting its max depth hyper-parameter to values of 10 four features. The number of expressions evaluated for
and 20 due to the 5000-gate limitation of the contest. The each NN model trained was 792. This part of the proposal
RF model was trained similarly, but with an ensemble was mainly considered given the difficulty of DTs/RFs
of DecisionT reeClassif ier: we opted not to employ in finding trivial solutions for XOR problems. Despite
the RandomF orestClassif ier structure given that it solving the problems from specific examples whose
applied a weighted average of the preliminary decisions solution was a XOR2 between two of the inputs, with
of each DT within it. Considering that this would require a 100% accuracy, we were able to slightly increase the
the use of multipliers, and this would not be suitable to maximum accuracy of other examples through this scan of
obtain a Sum-Of-Products (SOP) equivalent expression, functions. The parameters used by the M LP Classif ier
using several instances of the DecisionT reeClassif ier were the default ones: 100 hidden layers and ReLu
with a simple majority voter in the output was the choice activation function [12].
we adopted. In this case, each DT was trained with a The translation from the DT/RF to SOP was imple-
random subset of the total number of features. Based mented as follows: the program recursively passes through
every tree path, concatenating every comparison. When limitation, as it restrained us from using a higher number
it moves to the left child, this is equivalent to a ”true” of trees. The NNs were mainly useful to solve XOR2
result in the comparison. In the right child, we need a problems and in problems whose best accuracy results
”NOT” operator in the expression as well. In a single from DTs/RFs were close to 50%. It can also be observed
path to a leaf node, the comparisons are joined through that the use of the feature selection methods along with
an ”AND” operation, given that the leaf node result will DTs/RFs was helpful; therefore, cropping a sub-set of
only be true when all the comparisons conditions are true. features based on these methods can definitely improve
However, given that this is a binary problem, we only the model obtained by the classifiers. The number of best
consider the ”AND” expression of a path when the leaf examples based on the different scoring functions used in
leads to a value of 1. After that, we perform an ”OR” the feature selection methods shows that the chi2 function
operation between the ”AND” expressions obtained for was the most useful. This is understandable given that
each path, which yields the final expression of that DT. this is the default function employed by the Scikit-learn
The RF scenario works the same way, but it considers library. Lastly, even though the proportions from 80%-
the expression required for the majority gate as well, 20% represented the majority of the best results, it can
whose inputs are the ”OR” expressions of each DT. be seen that running every model with a 40%-20% was
From the SOP, we obtain the AIG file. The generated also a useful approach.
AIG file is then optimized using commands of the ABC
tool [14] attempting to reduce the number of AIG nodes Table VI: Number of best examples based on the characteristics
and the number of logic levels by performing iterative of the configurations.
collapsing and refactoring of logic cones in the AIG, and
Characteristic Parameter # of examples
rewrite operations. Even if these commands could be
iteratively executed, we decided to run them only once DT 55
given that the 5000-gate limitation was not a significant Decision Tool RF 28
NN 17
issue for our best solutions, and a single sequence of them
was enough for our solutions to adhere to the restriction. Select K Best 48
Finally, we run the AIG file using the AIG evaluation Feature Selection Select Percentile 11
commands provided by ABC to collect the desired results None 41
of accuracy, number of nodes, and number of logic levels chi2 34
for both validation sets generated at the beginning of our f classif 6
Scoring Function
flow. mutual info classif 19
None 41
All experiments were performed three times with
different seed values using the Numpy random 40%-20% 23
Proportion
seed method. This was necessary given that the 80%-20% 77
DecisionT reeClassif ier structure has a degree of
randomness. Considering it was used for both DTs and
RFs, this would yield different results at each execution. VI. TEAM 6
The M LP Classif ier also inserts randomnesses in the Authors: Aditya Lohana, Shubham Rai and Akash Ku-
initialization of weights and biases. Therefore, to ensure mar, Chair for Processor Design, Technische Universitaet
that the contest organizers could perfectly replicate the Dresden, Germany
code with the same generated AIGs, we fixed the seeds To start of with, we read the training pla files using
to values of 0, 1, and 2. Therefore, we evaluated two ABC and get the result of &mltest directly on the train
classifiers (DTs/RFs), with two maximum depths, two data. This gives the upper limit of the number of AND
different proportions, and three different seeds, which gates which are used. Then, in order to learn the unknown
leads to 24 configurations. For each of the SelectKBest Boolean function, we have used the method as mentioned
and SelectP ercentile methods, considering that we ana- in [11]. We used LUT network as a means to learn from
lyzed three values of K and percentage each, respectively, the known set to synthesize on an unknown set. We
along with three scoring functions, we have a total of 18 use the concept of memorization and construct a logic
feature selection methods. Given that we also tested the network.
models without any feature selection method, we have In order to construct the LUT network, we use the
19 possible combinations. By multiplying the number minterms as input features to construct layers of LUTs
of configurations (24) with the number of combinations with connections starting from the input layer. We then
with and without feature selection methods (adding up try out two schemes of connections between the layers:
to 19), we obtain a total of 456 different DT/RF models ‘random set of input’ and ‘unique but random set of
being evaluated. Additionally, each NN, as mentioned, inputs’. By ‘random set of inputs’, we imply that we
evaluated 792 expressions. These were also tested with just randomly select the outputs of preceding layer and
two different training set proportions and three different feed it to the next layer. This is the default flow. By
seeds, leading to a total of 4752 expressions for each of ‘unique but random set of inputs’, we mean that we
the 100 examples of the contest. ensure that all outputs from a preceding layer is used
Table VI presents some additional information on the before duplication of connection. This obviously makes
configurations that presented the best accuracy for the sense when the number of connections is more than the
examples from the contest. As it can be seen, when we number of outputs of the preceding layer.
split the 100 examples by the decision tool employed, We have four hyper parameters to experiment with in
most of them obtained the best accuracy results when order to achieve good accuracy– number of inputs per
using DTs, followed by RFs and NNs, respectively. The LUT, number of LUTS per layers, selection of connecting
use of RFs was not as significant due to the 5000-gate edges from the preceding layer to the next layer and the
x1? .i 3 VII. TEAM 7: L EARNING WITH T REE -BASED
.o 1 M ODELS AND E XPLANATORY A NALYSIS
0 1
.p 3
.ilb x1 x2 x3
Authors: Wei Zeng, Azadeh Davoodi, and Rasit Onur
x3? 1 Topaloglu, University of Wisconsin–Madison, IBM, USA
0-0 1
0 1 0-1 0 Team 7’s solution is a mix of conventional machine
1-- 1 learning (ML) and pre-defined standard function match-
1 0 .e ing. If the training set matches a pre-defined standard
function, a custom AIG of the identified function is
(a) (b) written out. Otherwise, an ML model is trained and
Fig. 23: (a) A decision tree and (b) its corresponding PLA. translated to an AIG.
Team 7 adopts tree-based ML models, considering the
straightforward correspondence between tree nodes and
x1? x2? SOP terms. The model is either a single decision tree
0 1 0 1 with unlimited depth, or an extreme gradient boosting
(XGBoost) [21] of 125 trees with a maximum depth of
x3? +0.3 + x1? x3? +… 5, depending on the results of a 10-fold cross validation
0 1 0 1 0 1 on training data.
It is straightforward to convert a decision tree to SOP
+0.8 -0.2 +0.3 -0.5 -0.7 +0.6 terms used in PLA. Fig. 23 shows a simple example of
(a)
decision tree and its corresponding SOP terms. Numbers
in leaves (rectangular boxes) indicate the predicted output
x1? x2? values. XGBoost is a more complex model in two aspects.
0 1 0 1 First, it is a boosting ensemble of many shallow decision
trees, where the final output is the sum of all leaf values
x3? 1 + x1? x3? +… that a testing data point falls into, expressed in log odds
0 1 0 1 0 1 log[P (y = 1)/P (y = 0)]. Second, since the output is
1 0
a real number instead of a binary result, a threshold of
1 0 0 1
classification is needed to get the binary prediction (the
(b) default value is 0). Fig. 24(a) shows a simple example
of a trained XGBoost of trees.
Fig. 24: XGBoost of decision trees, (a) before and (b) after With the trained model, each underlying tree is
quantization of leaf values. Plus signs mean to add up resulting converted to a PLA, where each leaf node in the tree
leaf values, one from each tree. corresponds to a SOP term in the PLA. Each PLA are
minimized and compiled to an AIG with the integrated
espresso and ABC, respectively. If the function is
depth (number of LUT layers) of the model. We carry learned by a single decision tree, the converted AIG is
out experiments with varying number of inputs for each final. If the function is learned by XGBoost of trees, the
LUT in order to get the maximum accuracy. We notice exact final output of a prediction would be the sum of the
from our experiments that 4-input LUTs returns the best 125 leaves (one for each underlying tree) where the testing
average numbers across the benchmark suite. We also data point falls into. In order to implement the AIGs
found that increasing the number of LUTs per layer or efficiently, the model is approximated in the following
number of layers does not directly increases the accuracy. two steps. First, the value of each underlying tree leaf
This is due to the fact, because increasing the number is quantized to one bit, as shown in Fig. 24(b). Each
of LUTs allows repeatability of connections from the test data point will fall into 125 specific leaves, yielding
preceding layer. This leads to passing of just zeros/ones 125 output bits. To mimic the summation of leaf values
to the succeeding layers. Hence, the overall network tends and the default threshold of 0 for classification, a 125-
towards a constant zero or constant one. input majority gate could be used to get the final output.
After coming up with an appropriate set of connections, However, again for efficient implementation, the 125-
we create the whole network. Once we create the network, input majority gate is further approximated by a 3-layer
we convert the network into an SOP form using sympy network of 5-input majority gates as shown in Fig. 25.
package in python. This is done from reverse topological In the above discussion, it is assumed that the gate
order starting from the outputs back to the inputs. We count does not exceed the limit of 5000. If it is not the
then pass the SOP form of the final output and create a case, the maximum depth of the decision tree and the
verilog file out of that. Using the verilog file, we convert trees in XGBoost, and/or the number of trees in XGBoost,
it into AIG format and then use the ABC command to list can be reduced at the cost of potential loss of accuracy.
out the output of the &mltest. We also report accuracy Tree-based models may not perform well in some
of our network using sklearn. pre-defined standard functions, especially symmetric
We have also incorporated use of data from validation functions and complex arithmetic functions. However,
testcase. For our training model, we have used ‘0.4’ part symmetric functions are easy to be identified by com-
of the minterms in our training. paring the number of ones and the output bit. And it
can be implemented by adding a side circuit that counts
N1 , i.e., the number of ones in the input bits, and a
decision tree that learns the relationship between N1 and
the original output. For arithmetic functions, patterns in
4 the importance of input bits can be observed in some pre-
defined standard functions, such as adders, comparators,
outputs of XOR or MUX. This is done by training an
2 initial XGBoost of trees and use SHAP tree explainer
[1] to evaluate the importance of each input bit. Fig. 26
shows that SHAP importance shows a pattern in training
Mean SHAP
2.0
ideal targets for decision trees. In the rest of this section
1.5 we introduce our binary decision tree implementation, and
1.0
provide some insights on the connection between decision
tree and Espresso [23], a successful logic minimization
0.5
algorithm.
0.0
0 10 20 30 40 50 60
Input bit
1024,
X. TEAM 10
500, 1000,
Random
5000
80/20 Complete
2000 Authors: Valerio Tenace, Walter Lau Neto, and
Train Set Pierre-Emmanuel Gaillardon, University of Utah, USA
Table VII: Hyper-Parameters Dependent on Initialization
In order to learn incompletely specified Boolean
functions for the contest, we decided to resort to decision
trees (DTs). Fig. 31 presents an overview of the adopted
copies of the next generation. Variation is added to the design flow.
individuals through mutations. The mutation rate is under
optimization according to the 1/5th rule [37], in which Training Validation
mutation rate value varies jointly with the proportion of PLA PLA
individuals being better or worst than their ancestor.
The CGP was run with three main hyper-parameter:
(1) the number of generations with values of 10, 20, 25, DT Training
50 and 100 thousand; (2) two logic structures available:
Training
AIG and XAIG; and (3) the program had the option to Augmentation
check for all nodes as being a PO during every training Accuracy
Custom Python
Check
generation. The latter is a computationally intensive Validation <= 70%
strategy, but some exemplars demonstrated good results Validation > 70%
with it. The CGP flow runs exhaustively, combining these Verilog
three main hyper-parameter options. Generation