0% found this document useful (0 votes)
49 views23 pages

Logic Synthesis Meets Machine Learning: Trading Exactness For Generalization

This document describes a competition held at IWLS 2020 that challenged participants to synthesize logic circuits for Boolean functions. Participants were given training examples of input-output pairs for each function and had to design a circuit no larger than 5000 nodes that generalized well to unseen examples. The goal was to explore connections between logic synthesis and machine learning by framing logic synthesis as a supervised learning problem of finding a compact circuit that fits training data while also generalizing. A benchmark set of 100 functions was created from domains like arithmetic circuits, random logic, and machine learning to evaluate the algorithms.

Uploaded by

Guru Velmathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views23 pages

Logic Synthesis Meets Machine Learning: Trading Exactness For Generalization

This document describes a competition held at IWLS 2020 that challenged participants to synthesize logic circuits for Boolean functions. Participants were given training examples of input-output pairs for each function and had to design a circuit no larger than 5000 nodes that generalized well to unseen examples. The goal was to explore connections between logic synthesis and machine learning by framing logic synthesis as a supervised learning problem of finding a compact circuit that fits training data while also generalizing. A benchmark set of 100 functions was created from domains like arithmetic circuits, random logic, and machine learning to evaluate the algorithms.

Uploaded by

Guru Velmathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Logic Synthesis Meets Machine Learning:

Trading Exactness for Generalization


Shubham Raif,6,† , Walter Lau Neton,10,† , Yukio Miyasakao,1 , Xinpei Zhanga,1 , Mingfei Yua,1 , Qingyang Yia,1 ,
Masahiro Fujitaa,1 , Guilherme B. Manskeb,2 , Matheus F. Pontesb,2 , Leomar S. da Rosa Juniorb,2 ,
Marilton S. de Aguiarb,2 , Paulo F. Butzene,2 , Po-Chun Chienc,3 , Yu-Shan Huangc,3 , Hoa-Ren Wangc,3 ,
Jie-Hong R. Jiangc,3 , Jiaqi Gud,4 , Zheng Zhaod,4 , Zixuan Jiangd,4 , David Z. Pand,4 , Brunno A. de Abreue,5,9 ,
Isac de Souza Camposm,5,9 , Augusto Berndtm,5,9 , Cristina Meinhardtm,5,9 , Jonata T. Carvalhom,5,9 ,
Mateus Grellertm,5,9 , Sergio Bampie,5 , Aditya Lohanaf,6 , Akash Kumarf,6 , Wei Zengj,7 , Azadeh Davoodij,7 ,
arXiv:2012.02530v2 [cs.LG] 15 Dec 2020

Rasit O. Topalogluk,7 , Yuan Zhoul,8 , Jordan Dotzell,8 , Yichi Zhangl,8 , Hanyu Wangl,8 , Zhiru Zhangl,8 ,
Valerio Tenacen,10 , Pierre-Emmanuel Gaillardonn,10 , Alan Mishchenkoo,† , and Satrajit Chatterjeep,†
a
University of Tokyo, Japan, b Universidade Federal de Pelotas, Brazil, c National Taiwan University,
Taiwan, d University of Texas at Austin, USA, e Universidade Federal do Rio Grande do Sul, Brazil,
f
Technische Universitaet Dresden, Germany, j University of Wisconsin–Madison, USA, k IBM, USA,
l
Cornell University, USA, m Universidade Federal de Santa Catarina, Brazil, n University of Utah, USA,
o
UC Berkeley, USA, p Google AI, USA
The alphabetic characters in the superscript represent the affiliations while the digits represent the team numbers

Equal contribution. Email: [email protected], [email protected], [email protected], [email protected]

Abstract—Logic synthesis is a fundamental step in hard- artificial intelligence. A central problem in machine
ware design whose goal is to find structural representations learning is that of supervised learning: Given a class
of Boolean functions while minimizing delay and area. H of functions from a domain X to a co-domain Y , find
If the function is completely-specified, the implementa-
tion accurately represents the function. If the function is a member h ∈ H that best fits a given set of training
incompletely-specified, the implementation has to be true examples of the form (x, y) ∈ X × Y . The quality of the
only on the care set. While most of the algorithms in logic fit is judged by how well h generalizes, i.e., how well h
synthesis rely on SAT and Boolean methods to exactly fits examples that were not seen during training.
implement the care set, we investigate learning in logic
synthesis, attempting to trade exactness for generalization. Thus, logic synthesis and machine learning are closely
This work is directly related to machine learning where related. Supervised machine learning can be seen as logic
the care set is the training set and the implementation synthesis of an incompletely specified function with a
is expected to generalize on a validation set. We present different constraint (or objective): the circuit must also
learning incompletely-specified functions based on the re- generalize well outside the careset (i.e., to the test set)
sults of a competition conducted at IWLS 2020. The goal
of the competition was to implement 100 functions given possibly at the expense of reduced accuracy on the careset
by a set of care minterms for training, while testing the (i.e., on the training set). Conversely, logic synthesis may
implementation using a set of validation minterms sampled be seen as a machine learning problem where in addition
from the same function. We make this benchmark suite to generalization, we care about finding an element of H
available and offer a detailed comparative analysis of the that has small size, and the sets X and Y are not smooth
different approaches to learning.
but discrete.
I. I NTRODUCTION To explore this connection between the two fields, the
two last authors of this paper organized a programming
Logic synthesis is a key ingredient in modern electronic contest at the 2020 International Workshop in Logic
design automation flows. A central problem in logic Synthesis. The goal of this contest was to come up
synthesis is the following: Given a Boolean function with an algorithm to synthesize a small circuit for a
f : Bn → B (where B denotes the set {0, 1}), construct Boolean function f : Bn → B learnt from a training set
a logic circuit that implements f with the minimum of examples. Each example (x, y) in the training set is an
number of logic gates. The function f may be completely input-output pair, i.e., x ∈ Bn and y ∈ B. The training set
specified, i.e., we are given f (x) for all x ∈ Bn , or it may was chosen at random from the 2n possible inputs of the
be incompletely specified, i.e., we are only given f (x) function (and in most cases was much smaller than 2n ).
for a subset of Bn called the careset. An incompletely The quality of the solution was evaluated by measuring
specified function provides more flexibility for optimizing accuracy on a test set not provided to the participants.
the circuit since the values produced by the circuit outside
the careset are not of interest. The synthesized circuit for f had to be in the form
Recently, machine learning has emerged as a key of an And-Inverter Graph (AIG) [1, 2] with no more
enabling technology for a variety of breakthroughs in than 5000 nodes. An AIG is a standard data structure
used in logic synthesis to represent Boolean functions
This work was supported in part by the Semiconductor Research where a node corresponds to a 2-input And gate and
Corporation under Contract 2867.001. edges represent direct or inverted connections. Since an
Table I: An overview of different types of functions in
AIG can represent any Boolean function, in this problem the benchmark set. They are selected from three domains:
H is the full set of Boolean functions on n variables. Arithmetic, Random Logic, and Machine Learning.
To evaluate the algorithms proposed by the participants, 00-09 2 MSBs of k-bit adders for k ∈ {16, 32, 64, 128, 256}
we created a set of 100 benchmarks drawn from a 10-19 MSB of k-bit dividers and remainder circuits for k ∈ {16, 32, 64, 128, 256}
mix of standard problems in logic synthesis such as 20-29 MSB and middle bit of k-bit multipliers for k ∈ {8, 16, 32, 64, 128}
synthesis of arithmetic circuits and random logic from 30-39 k-bit comparators for k ∈ {10, 20, . . . , 100}
40-49 LSB and middle bit of k-bit square-rooters with k ∈ {16, 32, 64, 128, 256}
standard logic synthesis benchmarks. We also included 50-59 10 outputs of PicoJava design with 16-200 inputs and roughly balanced onset & offset
some tasks from standard machine learning benchmarks. 60-69 10 outputs of MCNC i10 design with 16-200 inputs and roughly balanced onset & offset
For each benchmark the participants were provided with 70-79 5 other outputs from MCNC benchmarks + 5 symmetric functions of 16 inputs
the training set (which was sub-divided into a training set 80-89 10 binary classification problems from MNIST group comparisons
90-99 10 binary classification problems from CIFAR-10 group comparisons
proper of 6400 examples and a validation set of another
6400 examples though the participants were free to use
these subsets as they saw fit), and the circuits returned
by their algorithms were evaluated on the corresponding false value otherwise. The threshold value is defined
test set (again with 6400 examples) that was kept private during training. Hence, each internal node can be seen
until the competition was over. The training, validation as a multiplexer, with the selector given by the threshold
and test sets were created in the PLA format [3]. The value. Random forests are composed by multiple decision
score assigned to each participant was the average test trees, where each tree is trained over a distinct feature,
accuracy over all the benchmarks with possible ties being so that trees are not very similar. The output is given by
broken by the circuit size. the combination of individual predictions.
Ten teams spanning 6 countries took part in the contest. Look-up Table (LUT) Network is a network of
They explored many different techniques to solve this randomly connected k-input LUTs, where each k-input
problem. In this paper we present short overviews of the LUT can implement any function with up to k variables.
techniques used by the different teams (the superscript LUT networks were first employed in a theoretical
for an author indicates their team number), as well a study to understand if pure memorization (i.e., fitting
comparative analysis of these techniques. The following without any explicit search or optimization) could lead
are our main findings from the analysis: to generalization [11].
• No one technique dominated across all the bench- III. B ENCHMARKS
marks, and most teams including the winning team
used an ensemble of techniques. The set of 100 benchmarks used in the contest can
• Random forests (and decision trees) were very
be broadly divided into 10 categories, each with 10 test-
popular and form a strong baseline, and may be cases. The summary of categories is shown in Table I. For
a useful technique for approximate logic synthesis. example, the first 10 test-cases are created by considering
• Sacrificing a little accuracy allows for a significant
the two most-significant bits (MSBs) of k-input adders
reduction in the size of the circuit. for k ∈ {16, 32, 64, 128, 256}.
Test-cases ex60 through ex69 were derived from
These findings suggest an interesting direction for future MCNC benchmark [12] i10 by extracting outputs 91,
work: Can machine learning algorithms be used for 128, 150, 159, 161, 163, 179, 182, 187, and 209 (zero-
approximate logic synthesis to greatly reduce power and based indexing). For example, ex60 was derived using
area when exactness is not needed? the ABC command line: &read i10.aig; &cone -O 91.
Finally, we believe that the set of benchmarks used Five test-cases ex70 through ex74 were similarly
in this contest along with the solutions provided by the derived from MCNC benchmarks cordic (both outputs),
participants (based on the methods described in this paper) too large (zero-based output 2), t481, and parity.
provide an interesting framework to evaluate further Five 16-input symmetric functions used in ex75
advances in this area. To that end we are making these through ex79 have the following signatures:
available at https://github.com/iwls2020-lsml-contest/. 00000000111111111, 11111100000111111,
00011110001111000, 00001110101110000, and
II. BACKGROUND AND P RELIMINARIES 00000011111000000.
We review briefly the more popular techniques used. They were generated by ABC using command sym-
Sum-of-Products (SOP), or disjunctive normal form, fun hsignaturei.
is a two-level logic representation commonly used in Table II shows the rules used to generate the last 20
logic synthesis. Minimizing the SOP representation of an benchmarks. Each of the 10 rows of the table contains
incompletely specified Boolean function is a well-studied two groups of labels, which were compared to generate
problem with a number of exact approaches [4, 5, 6] as one test-case. Group A results in value 0 at the output,
well as heuristics [7, 8, 9, 10] with ESPRESSO [7] being while Group B results in value 1. The same groups
the most popular. were used for MNIST [13] and CIFAR-10 [14]. For
Decision Trees (DT) and Random Forests (RF) are example, benchmark ex81 compares odd and even labels
very popular techniques in machine learning and they in MNIST, while benchmark ex91 compares the same
were used by many of the teams. In the contest scope, labels in CIFAR-10.
the decision trees were applied as a classification tree, In generating the benchmarks, the goal was to fulfill the
where the internal nodes were associated to the function following requirements: (1) Create problems, which are
input variables, and terminal nodes classify the function non-trivial to solve. (2) Consider practical functions, such
as 1 or 0, given the association of internal nodes. Thus, as arithmetic logic and symmetric functions, extract logic
each internal node has two outgoing-edges: a true edge cones from the available benchmarks, and derive binary
if the variable value exceeds a threshold value, and a classification problems from the MNIST and CIFAR-10
Table II: Group comparisons for MNIST and CIFAR10
we use the minimum number of objects to determine the
ex Group A Group B best classifier. Finally, the ABC tool checks the size of
0 0-4 5-9
the generated AIGs to match the contest requirements.
1 odd even Team 3’s solution consists of decision tree based
2 0-2 3-5 and neural network (NN) based methods. For each
3 01 23 benchmark, multiple models are trained and 3 are selected
4 45 67
5 67 89 for ensemble. For the DT-based method, the fringe feature
6 17 38 extraction process proposed in [16, 17] is adopted. The
7 09 38 DT is trained and modified for multiple iterations. In
8 13 78
9 03 89 each iteration, the patterns near the fringes (leave nodes)
of the DT are identified as the composite features of
machine learning challenges. (3) Limit the number of AIG 2 decision variables. These newly detected features are
nodes in the solution to 5000 to prevent the participants then added to the list of decision variables for the DT
from generating large AIGs and rather concentrate on training in the next iteration. The procedure terminates
algorithmic improvements aiming at high solution quality when there are no new features found or the number of
using fewer nodes. the extracted features exceeds the preset limit.
There was also an effort to discourage the participants For the NN-based method, a 3-layer network is
from developing strategies for reverse-engineering the employed, where each layer is fully-connected and uses
test-cases based on their functionality, for example, detect- sigmoid as the activation function. As the synthesized
ing that some test-cases are outputs of arithmetic circuits, circuit size of a typical NN could be quite large, the
such as adders or multipliers. Instead, the participants connection pruning technique proposed in [18] is adopted
were encouraged to look for algorithmic solutions to to meet the stringent size restriction. The NN is pruned
handle arbitrary functions and produce consistently good until the number of fanins of each neuron is at most
solutions for every one independently of its origin. 12. Each neuron is then synthesized into a LUT by
rounding its activation [11]. The overall dataset, training
IV. OVERVIEW OF THE VARIOUS A PPROACHES and validation set combined, for each benchmark is re-
divided into 3 partitions before training. Two partitions
Team 1’s solution is to take the best one among are selected as the new training set, and the remaining one
ESPRESSO, LUT network, RF, and pre-defined standard as the new validation set, resulting in 3 different grouping
function matching (with some arithmetic functions). If configurations. Under each configuration, multiple models
the AIG size exceeds the limit, a simple approximation are trained with different methods and hyper-parameters,
method is applied to the AIG. and the one with the highest validation accuracy is chosen
ESPRESSO is used with an option to finish optimiza- for ensemble.
tion after the first irredundant operation. LUT network
has some parameters: the number of levels, the number Team 4’s solution is based on multi-level ensemble-
of LUTs in each level, and the size of each LUT. These based feature selection, recommendation-network-based
parameters are incremented like a beam search as long model training, subspace-expansion-based prediction, and
as the accuracy is improved. The number of estimators accuracy-node joint exploration during synthesis.
in random forest is explored from 4 to 16. Given the high sparsity in the high-dimensional boolean
A simple approximation method is used if the number space, a multi-level feature importance ranking is adopted
of AIG nodes is more than 5000. The AIG is simulated to reduce the learning space. Level 1: a 100-ExtraTree
with thousands of random input patterns, and the node based AdaBoost [19] ensemble classifier is used with
which most frequently outputs 0 is replaced by constant- 10-repeat permutation importance [20] ranking to select
0 while taking the negation (replacing with constant-1) the top-k important features, where k ∈ [10, 16]. Level 2:
into account. This is repeated until the AIG size meets a 100-ExtraTree based AdaBoost classifier and an
the condition. The nodes near the outputs are excluded XGB classifier with 200 trees are used with stratified 10-
from the candidates by setting a threshold on levels. fold cross-validation to select top-k important features,
The threshold is explored through try and error. It was where k ranges from 10 to 16, given the 5,000 node
observed that the accuracy drops 5% when reducing constraints.
3000-5000 nodes. Based on the above 14 groups of selected features, 14
Team 2’s solution uses J48 and PART AI classifiers state-of-the-art recommendation models, Adaptive Factor-
to learn the unknown Boolean function from a single ization Network (AFN) [21], are independently learned
training set that combines the training and validation as DNN-based boolean function approximators. A 128-
sets. The algorithm first transforms the PLA file in an dimensional logarithmic neural network is used to learn
ARFF (Attribute-Relation File Format) description to sparse boolean feature interaction, and a 4-layer MLP is
handle the WEKA tool [15]. We used the WEKA tool to used to combine the formed cross features with overfitting
run five different configurations to the J48 classifier and being handled by fine-tuned dropout. After training, a
five configurations to the PART classifier, varying the k-feature trained model will predict the output for 2k
confidence factor. The J48 classifier creates a decision input combinations to expand the full k-dimensional
tree that the developed software converts in a PLA file. In hypercube, where other pruned features are set to DON’T
the sequence, the ABC tool transforms the PLA file into CARE type in the predicted .pla file to allow enough
an AIG file. The PART classifier creates a set of rules that smoothness in the Boolean hypercube. Such a subspace
the developed software converts in an AAG file. After, expansion technique can fully-leverage the prediction
the AIGER transforms the AAG file into an AIG file capability of our model to maximize the accuracy on the
to decide the best configuration for each classifier. Also, validation/test dataset while constraining the maximum
number of product terms for node minimization during Team 7 adopts tree-based ML models for the straight-
synthesis. forward conversion from tree nodes to SOP terms. The
Team 5’s solution explores the use of DTs and RFs, model is either a decision tree with unlimited depth, or
along with NNs, to learn the required Boolean functions. an extreme gradient boosting (XGBoost) of 125 trees
DTs/RFs are easy to convert into SOP expressions. To with a maximum depth of five, depending on the results
evaluate this proposal, the implementation obtains the of a 10-fold cross validation on training data.
models using the Scikit-learn Python library [22]. The With the learned model, all underlying tree leaves
solution is chosen from simulations using Decision- are converted to SOP terms, which are minimized and
TreeClassifier for the DTs, and an ensemble of Decision- compiled to AIGs with ESPRESSO and ABC, respectively.
TreeClassifier for the RFs – the RandomForestClassifier If the model is a decision tree, the converted AIG is final.
structure would be inconvenient, considering the 5000- If the model is XGBoost, the value of each tree leaf is
gate limit, given that it employs a weighted average of first quantized to one bit, and then aggregated with a
each tree. 3-layer network of 5-input majority gates for efficient
The simulations are performed using different tree implementation of AIGs.
depths and feature selection methods (SelectKBest and Tree-based models may not perform well in symmetric
SelectPercentile). NNs are also employed to enhance functions or complex arithmetic functions. However,
our exploration capabilities, using the MLPClassifier patterns in the importance of input bits can be observed
structure. Given that SOPs cannot be directly obtained for some pre-defined standard functions such as adders,
from the output of the NN employed, the NN is used as a comparators, outputs of XOR or MUX. Before ML, Team
feature selection method to obtain the importance of each 7 checks if the training data come from a symmetric
input based on their weight values. With a small sub- function, and compares training data with each identified
set of weights obtained from this method, the proposed special function. In case of a match, an AIG of the
solution performs a small exhaustive search by applying identified function is constructed directly without ML.
combinations of functions on the four features with the Team 8’s solution is an ensemble drawing from
highest importance, considering OR, XOR, AND, and multiple classes of models. It includes a multi-layer
NOT functions. The SOP with the highest accuracy perceptron (MLP), binary decision tree (BDT) augmented
(respecting the 5001-gate limit) out of the DTs/RFs with functional decomposition, and a RF. These models
and NNs tested was chosen to be converted to an AIG are selected to capture various types of circuits. For all
file. The data sets were split into an 80%-20% ratio, benchmarks, all models are trained independently, and the
preserving the original data set’s target distribution. The model with the best validation accuracy that results in a
simulations were run using half of the newly obtained circuit with under 5000 gates is selected. The MLP uses a
training set (40%) and the whole training set to increase periodic activation instead of the traditional ReLU to learn
our exploration. additional periodic features in the input. It has three layers,
Team 6’s solution learns the unknown Boolean func- with the number of neurons divided in half between
tion using the method as mentioned in [11]. In order to each layer. The BDT is a customized implementation
construct the LUT network, we use the minterms as input of the C4.5 tree that has been modified with functional
features to construct layers of LUTs with connections decomposition in the cases where the information gain
starting from the input layer. We then carry out two is below a threshold. The RF is a collection of 17 trees
schemes of connections between the layers: ‘random limited to a maximum depth of 8. RF helps especially
set of input’ and ‘unique but random set of inputs’. By in the cases where BDT overfits.
‘random set of inputs’, we imply that we just randomly After training, the AIGs of the trained models are
select the outputs of preceding layer and feed it to the next generated to ensure they are under 5000 gates. In all cases,
layer. This is the default flow. By ‘unique but random set the generated AIGs are simplified using the Berkeley
of inputs’, we mean that we ensure that all outputs from a ABC tool to produce the final AIG graph.
preceding layer is used before duplication of connection. Team 9’s proposes a Bootstrapped flow that explores
We carry out experiments with four hyper parameters to the search algorithm Cartesian Genetic Programming
achieve accuracy– number of inputs per LUT, number of (CGP). CGP is an evolutionary approach proposed as
LUTS per layers, selection of connecting edges from the a generalization of Genetic Programming used in the
preceding layer to the next layer and the depth (number digital circuit’s domain. It is called Cartesian because the
of LUT layers) of the model. We experiment with varying candidate solutions are composed of a two-dimensional
number of inputs for each LUT in order to get the network of nodes. CGP is a population-based approach
maximum accuracy. We notice from our experiments that often using the evolution strategies (1+λ)-ES algorithm
4-input LUTs returns the best average numbers across for searching the parameter space. Each individual is a
the benchmark suite. circuit, represented by a two-dimensional integer matrix
Once the network is created, we convert the network describing the functions and the connections among
into an SOP form using sympy package in python. This nodes.
is done from reverse topological order starting from the The proposed flow decides between two initialization:
outputs back to the inputs. Using the SOP form, we 1) starts the CGP search from random (unbiased) indi-
generate the verilog file which is then used with ABC to viduals seeking for optimal circuits; or, 2) exploring a
calculate the accuracy. bootstrapped initialization with individuals generated by
Team 7’s solution is a mix of conventional ML and previously optimized SOPs created by decision trees or
pre-defined standard function matching. If a training set ESPRESSO when they provide AIGs with more than
matches a pre-defined standard function, a custom AIG 55% of accuracy. This flow restricts the node functions
of the identified function is written out. Otherwise, an to XORs, ANDs, and Inverters; in other words, we may
ML model is trained and translated to an AIG. use AIG or XAIG to learn the circuits. Variation is
Table III: Performance of the different teams

team ↓ test accuracy And gates levels overfit ✔ ✔ ✔ ✔


1 88.69 2517.66 39.96 1.86 ✔
7 87.50 1167.50 32.02 0.05 ✔ ✔
8 87.32 1293.92 21.49 0.14 ✔
✔ ✔
3 87.25 1550.33 21.08 5.76 ✔
2 85.95 731.92 80.63 8.70 ✔ ✔
9 84.65 991.89 103.42 1.75 ✔ ✔
4 84.64 1795.31 21.00 0.48 ✔ ✔
5 84.08 1142.83 145.87 4.17 ✔
10 80.25 140.25 10.90 3.86
6 62.40 356.26 8.73 0.88 Fig. 1: Representation used by various teams

added to the individuals through mutations seeking to 100


find a circuit that optimizes a given fitness function. The
mutation rate is adaptive, according to the 1/5th rule 90
[23]. When the population is bootstrapped with DTs or

test accuracy
80
SOP, the circuit is fine-tuned with the whole training
set. When the random initialization is used, it was tested 70 (537, 89.88) (1140.76, 91.0)
with multiple configurations of sizes and mini-batches
of the training set that change based on the number of 60 Pareto curve for virtual best
generations processed. Average accuracy by team
Top accuracy achieved by Team 1
Team 10’s solution learns Boolean function representa- 50
500 1000 1500 2000 2500
tions, using DTs. We developed a Python program using number of And gates
the Scikit-learn library where the parameter max depth
serves as an upper-bound to the growth of the trees, Fig. 2: Acc-size trade-off across teams and for virtual best
and is set to be 8. The training set PLA, treated as a similar accuracy, with very divergent number of nodes,
numpy matrix, is used to train the DT. On the other hand, as seen in Table III. For teams that have relied on just
the validation set PLA is then used to test whether the one approach, such as Team 10 and 2 who used only
obtained DT meets the minimum validation accuracy, decision trees, it seems that more AND nodes might lead
which we empirically set to be 70%. If such a condition to better accuracy. Most of the teams, however, use a
is not met, the validation set is merged with the training portfolio approach, and for each benchmark choose an
set. According to empirical evaluations, most of the appropriate technique. Here it is worth pointing out that
benchmarks with accuracy < 70% showed a validation there is no approach, which is consistently better across
accuracy fluctuating around 50%, regardless of the size all the considered benchmarks. Thus, applying several
and shapes of the DTs. This suggests that the training approaches and deciding which one to use, depending
sets were not able to provide enough representative cases on the target Boolean functions, seems to be the best
to effectively exploit the adopted technique, thus leading strategy. Fig. 1 presents the approaches used by each
to DTs with very high training accuracy, but completely team.
negligible performances. For DTs having a validation While the size of the network was not one of the
accuracy ≥ 70%, the tree structure is annotated as a optimization criteria in the contest, it is an important
Verilog netlist, where each DT node is replaced with a parameter considering the hardware implementation, as
multiplexer. The obtained Verilog netlist is then processed it impacts area, delay, and power. The average area
with the ABC Synthesis Tool in order to generate a reported by individual teams are shown in Fig. 2 as ‘×’.
compact and optimized AIG structure. This approach has Certain interesting observations can be made from Fig. 2.
shown an average accuracy over the validation set of 84%, Apart from showing the average size reached by various
with an average size of AIG of 140 nodes (and no AIG teams, it also shows the Pareto-curve between the average
with more than 300 nodes). More detailed information accuracy across all benchmarks and their size in terms of
about the adopted technique can be found in [24]. number of AND gates. It can be observed that while 91%
V. R ESULTS
A. Accuracy
Table III shows the average accuracy of the solutions 100
found by all the 10 teams, along with the average circuit
size, the average number of levels in the circuit, and 90
the overfit measured as the average difference between
test accuracy

the accuracy on the validation set and the test set. The 80
following interesting observations can be made: (i) most 70
of the teams achieved more than 80% accuracy. (ii) the
teams were able to find circuits with much fewer gates 60
than the specification.
When it comes to comparing network size vs accuracy, 50
0 20 40 60 80 100
there is no clear trend. For instance, teams 1 and 7 have benchmarks

Fig. 3: Maximum accuracy achieved for each example


VI. C ONCLUSION
70 In this work, we explored the connection between
number of examples with highest accuracy
Best
Within top-1%
60 logic synthesis of incompletely specified functions and
50
supervised learning. This was done via a programming
contest held at the 2020 International Workshop on Logic
40 and Synthesis where the objective was to synthesize small
30 circuits that generalize well from input-output samples.
20 The solutions submitted to the contest used a variety
10
of techniques spanning logic synthesis and machine
learning. Portfolio approaches ended up working better
0 than individual techniques, though random forests formed
1 2 3 4 5 6 7 8 9 10
team a strong baseline. Furthermore, by sacrificing a little
Fig. 4: Top-accuracy results achieved by different teams accuracy, the size of the circuit could be greatly reduced.
These findings suggest an interesting direction for future
accuracy constraint requires about 1141 gates, a reduction work: When exactness is not needed, can synthesis be
in accuracy constraint of merely 2%, requires a circuit done using machine learning algorithms to greatly reduce
of only half that size. This is an insightful observation area and power?
which strongly suggests that with a slight compromise Future extensions of this contest could target circuits
in the accuracy, much smaller size requirements can be with multiple outputs and algorithms generating an
satisfied. optimal trade-off between accuracy and area (instead
Besides area, it is also worth to look for the number of a single solution).
of logic-levels in the generated implementation, as it
correlates with circuit delay. Similar to the number of R EFERENCES
nodes, there is no clear distinction on how the number of [1] Satrajit Chatterjee. “On Algorithms for Technology Mapping”.
PhD thesis. University of California, Berkeley, 2007.
levels impacts the final accuracy. Team 6 has delivered [2] Armin Biere, Keijo Heljanko, and Siert Wieringa. AIGER
the networks with the smallest depth, often at the cost 1.9 And Beyond. Tech. rep. Institute for Formal Models and
of accuracy. In practice, the winning team has just the Verification, Johannes Kepler University, 2011.
[3] ESPRESSO(5OCTTOOLS) Manual Page. https://ultraespresso.
4th larger depth among the 10 teams. di.univr.it/assets/data/espresso/espresso5.pdf.
Finally, Fig. 3 shows the maximum accuracy achieved [4] Olivier Coudert. “Two-level logic minimization: an overview”.
for each benchmarks. While most of the benchmarks In: Integration (1994).
[5] O. Coudert. “On Solving Covering Problems”. In: DAC. 1996.
achieved a 100% accuracy, several benchmarks only [6] Goldberg et al. “Negative thinking by incremental problem
achieved close to 50%. That gives an insight on which solving: application to unate covering”. In: ICCAD. 1997.
benchmarks are harder to generalize, and these bench- [7] Robert K Brayton et al. Logic minimization algorithms for
VLSI synthesis. Vol. 2. 1984.
marks might be used as test-case for further developments [8] R. L. Rudell and A. Sangiovanni-Vincentelli. “Multiple-Valued
on this research area. Minimization for PLA Optimization”. In: IEEE TCAD (1987).
[9] P. C. McGeer et al. “ESPRESSO-SIGNATURE: a new exact
B. Generalization gap minimizer for logic functions”. In: IEEE TVLSI (1993).
[10] J. Hlavicka and P. Fiser. “BOOM-a heuristic Boolean mini-
The generalization gap for each team is presented in mizer”. In: ICCAD. 2001.
the last column of Table III. This value presents how [11] S. Chatterjee. “Learning and memorization”. In: ICML. 2018.
well the learnt model can generalize on an unknown set. [12] Saeyang Yang. Logic synthesis and optimization benchmarks
user guide: version 3.0. Microelectronics Center of North
Usually, a generalization gap ranging from 1% to 2% is Carolina (MCNC), 1991.
considered to be good. It is possible to note that most of [13] Yann LeCun, Corinna Cortes, and CJ Burges. “MNIST hand-
the teams have reached this range, with team 7 having written digit database”. In: ATT Labs [Online]. Available:
http://yann.lecun.com/exdb/mnist (2010).
a very small gap of 0.05%. Furthermore, given that the [14] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple
benchmark functions are incompletely-specified, with layers of features from tiny images”. In: (2009).
a very small subset of minterms available for training, [15] Mark Hall et al. “The WEKA Data Mining Software: An
Update”. In: SIGKDD Explor. Newsl. (2009).
reaching generalization in small networks is a challenge. [16] Giulia Pagallo and David Haussler. “Boolean Feature Discovery
Therefore, a special mention must be given to Team 10, in Empirical Learning”. In: Machine Learning (1990).
who reached a high level of accuracy with extremely [17] Arlindo L. Oliveira and Alberto Sangiovanni-Vincentelli.
“Learning Complex Boolean Functions: Algorithms and Appli-
small network sizes. cations”. In: NeurIPS. 1993.
[18] Song Han et al. “Learning Both Weights and Connections for
C. Win-rate for different teams Efficient Neural Networks”. In: NeurIPS. 2015.
Fig. 4 shows a bar chart showing which team achieved [19] Yoav Freund and Robert E. Schapire. “A Decision-Theoretic
Generalization of On-Line Learning and an Application to
the best accuracy and the top-1% for the largest number Boosting”. In: JCSS (1997).
of benchmarks. Team 3 is the winner in terms of both [20] Leo Breiman. “Random Forests”. In: Machine Learning (2001),
of these criteria, achieving the best accuracy among all pp. 5–32.
[21] Weiyu Cheng, Yanyan Shen, and Linpeng Huang. “Adaptive
the teams for 42 benchmark. Following Team 3, is Team Factorization Network: Learning Adaptive-Order Feature Inter-
7, and then the winning Team 1. Still, when it comes to actions”. In: Proc. AAAI. 2020.
the best average accuracy, Team 1 has won the contest. [22] Fabian Pedregosa et al. “Scikit-learn: Machine learning in
Python”. In: JMLR (2011).
This figure gives insights on which approaches have been [23] Benjamin Doerr and Carola Doerr. “Optimal parameter choices
in the top of achieved accuracy more frequently, and through self-adjustment: Applying the 1/5-th rule in discrete
give a pointer to what could be the ideal composition of settings”. In: ACGEC. 2015.
[24] R. G. Rizzo, V. Tenace, and A. Calimera. “Multiplication by
techniques to achieve high-accuracy. As shown in Fig. 1, Inference using Classification Trees: A Case-Study Analysis”.
indeed a portfolio of techniques needs to be employed to In: ISCAS. 2018.
achieve high-accuracy, since there is no single technique
that dominates.
A PPENDIX inputs of test cases are ordered regularly from LSB
The detailed version of approaches adopted by indi- to MSB for each word. Nevertheless, it seems almost
vidual teams are described below. impossible to realize significantly accurate multiplier and
square rooters with more than 100 inputs within 5000
I. TEAM 1 AIG nodes.
Authors: Yukio Miyasaka, Xinpei Zhang, Mingfei Yu, D. Exploration After The Contest
Qingyang Yi, Masahiro Fujita, The University of Tokyo, 1) Binary Decision Tree: We examined BDT (Binary
Japan and University of California, Berkeley, USA Decision Tree), inspired by the methods of other teams.
Our target is the second MSB of adder because only BDT
A. Learning Methods was able to learn the second MSB of adder with more
than 90% accuracy according to the presentation of the
We tried ESPRESSO, LUT network, and Random 3rd place team. In normal BDT, case-splitting by adding a
forests. ESPRESSO will work well if the underlying new node is performed based on the entropy gain. On the
function is small as a 2-level logic. On the other hand, other hand, in their method, when the best entropy gain
LUT network has a multi-level structure and seems good is less than a threshold, the number of patterns where the
for the function which is small as a multi-level logic. negation of the input causes the output to be negated too
Random forests also support multi-level logic based on is counted for each input, and case-splitting is performed
a tree structure. at the input which has such patterns the most.
ESPRESSO is used with an option to finish optimiza- First of all, the 3rd team’s tree construction highly
tion after the first irredundant operation. LUT network depends on the order of inputs. Even in the smallest
has some parameters: the number of levels, the number adder (16-bit adder, case 0), there is no pattern such that
of LUTs in each level, and the size of each LUT. These the pattern with an input negated is also included in the
parameters are incremented like a beam search as long given set of patterns. Their program just chooses the last
as the accuracy is improved. The number of estimators input in such case and fortunately works in the contest
in Random forests is explored from 4 to 16. benchmark, where the last input is the MSB of one input
If the number of AIG nodes exceeds the limit (5000), word. However, if the MSB is not selected at first, the
a simple approximation method is applied to the AIG. accuracy dramatically drops. When we sort the inputs at
The AIG is simulated with thousands of random input random, the accuracy was 59% on average of 10 times.
patterns, and the node which most frequently outputs Next, we tried SHAP analysis [1] on top of XGBoost,
0 is replaced by constant-0 while taking the negation based on the method of the 2nd place team, to find out the
(replacing with constant-1) into account. This is repeated MSBs of input-words. The SHAP analysis successfully
until the AIG size meets the condition. To avoid the result identifies the MSBs for a 16-bit adder, but not that for a
being constant 0 or 1, the nodes near the outputs are larger adder (32-bit adder, case 2).
excluded from the candidates by setting a threshold on
level. The threshold is explored through try and error.
1
B. Preliminary Experiment
We conducted a simple experiment. The parameters of 0.9
LUT was fixed as follows: the number of levels was 8, 0.8
accuracy

the number of LUTs in a level was 1028, and the size of


each LUT was 4. The number of estimators in Random 0.7
forests was 8. The test accuracy and the AIG size of the
methods is shown at Fig. 5 and Fig. 6. Generally Random 0.6
forests works best, but LUT network works better in a 0.5
few cases among case 90-99. All methods failed to learn
case 0, 2, 4, 6, 8, 20, 22, 24, 26, 28, and 40-49. The 0.4
approximation was applied to AIGs generated by LUT 0 10 20 30 40 50 60 70 80 90 100
network and Random forests for these difficult cases and
benchmark
case 80-99. ESPRESSO always generates a small AIG ESPRESSO RandomForest
with less than 5000 nodes as it conforms to less than ten LUTNetwork
thousands of min-terms.
The effect of approximation of the AIGs generated by Fig. 5: The test accuracy of the methods
LUT network is shown at Fig. 7. For difficult cases, the
accuracy was originally around 50%. For case 80-99, the In conclusion, it is almost impossible for BDT to learn
accuracy drops at most 5% while reducing 3000-5000 32-bit or larger adders with random input order. If the
nodes. Similar thing was observed in the AIGs generated problem provides a black box simulator as in ICCAD
by Random forests. contest 2019 problem A [2], we may be able to know
the MSBs by simulating one-bit flipped patterns, such as
C. Pre-defined standard function matching one-hot patterns following an all-zero pattern. Another
The most important method in the contest was actually mention is that BDT cannot learn a large XOR (16-XOR,
matching with a pre-defined standard functions. There case 74). This is because the patterns are divided into two
are difficult cases where all of the methods above fail to parts after each case-splitting and the entropy becomes
get meaningful accuracy. We analyzed these test cases zero at a shallow level. So, BDT cannot learn a deep
by sorting input patterns and was able to find adders, adder tree (adder with more than 2 input-words), much
multipliers, and square rooters fortunately because the less multipliers.
5000 does not exceed the threshold, achieved more than 95%
accuracy. These accuracy on 4-word adder tree is high
4000 compared to BDT, whose accuracy was only 60% even
with the best ordering.
AIG nodes

3000
For 6-word adder tree, the accuracy of the level-based
method was around 80%. We came up with another
heuristic that if both straight two-sided matching and
2000
complemented two-sided matching are available, the one
with the smaller gain is used, under a bias of 100 nodes
1000 on the complemented matching This heuristic increased
the accuracy of the level-based method to be 85-90%.
0 However, none of the methods above obtained meaningful
0 10 20 30 40 50 60 70 80 90 100
(more than 50%) accuracy for 8-word adder tree.
benchmark We conclude that BDD can learn a function if the
ESPRESSO RandomForest BDD of its underlying function is small under some
LUTNetwork input order and we know that order. The main reason for
minimization failure is that merging inappropriate nodes
Fig. 6: The resulting AIG size of the methods
is mistakenly performed due to a lack of contradictory
patterns. Our heuristics prevent it to some degree. If
1 12000 we have a black box simulator, simulating patterns
to distinguish straight and complemented two-sided
0.9 10000
matching would be helpful. Reordering using don’t cares
0.8 8000
is another topic to explore.
AIG nodes
accuracy

0.7 6000 II. TEAM 2


Authors: Guilherme Barbosa Manske, Matheus
0.6 4000
Ferreira Pontes, Leomar Soares da Rosa Junior,
0.5 2000
Marilton Sanchotene de Aguiar, Paulo Francisco Butzen,
Universidade Federal de Pelotas, Universidade Federal
0.4 0 do Rio Grande do Sul, Brazil
0 10 20 30 40 50 60 70 80 90 100
benchmark
LUTNetwork
LUTNetwork-OriginalA. Proposed solution
SizeOfLUTNetwork
SizeOfLUTNetwork-Original Our solution is a mix of two machine learning
techniques, J48 [4] and PART [5]. We will first present a
Fig. 7: The test accuracy and AIG size of LUT network before general overview of our solution. Then we will focus on
and after approximation individual machine learning classifier techniques. Finally,
we present the largest difference that we have found
between both strategies, showing the importance of
2) Binary Decision Diagram: We also tried BDD exploring more than only one solution.
(Binary Decision Diagram) to learn adder. BDD mini- The flowchart in Fig. 8 illustrates our overall solution.
mization using don’t cares [3] is applied to the BDD The first step was to convert the PLA structure into one
of the given on-set. Given an on-set and a care-set, we that Weka [6] could interpret. We decided to use the
traverse the BDD of on-set while replacing a node by its ARFF structure due to how the structure of attributes and
child if the other child is don’t care (one-sided matching), classes is arranged.
by an intersection of two children if possible (two-sided
matching), or by an intersection between a complemented
child and the other child if possible (complemented
two-sided matching). Unlike BDT, BDD can learn a
large XOR up to 24-XOR (using 6400 patterns) because
patterns are shared where nodes are shared.
BDD was able to learn the second MSB of adder tree
only if the inputs are sorted from MSB to LSB mixing all Fig. 8: Solution’s flowchart.
input-words (the MSB of the first word, the MSB of the
second word, the MSB of the third word, ..., the MSB of After converting to ARFF, as shown in the flowchart’s
the last word, the second MSB of the first word, ...). For second block, the J48 and PART algorithms are executed.
normal adder (2 words), one-sided matching achieved In this stage of the process, the confidence factor is varied
98% accuracy. The accuracy was almost the same among in five values (0.001, 0.01, 0.1, 0.25, and 0.5) for each
any bit-width because the top dozens of bits control the algorithm. In total, this step will generate ten results.
output. For 4-word adder tree, one-sided matching got The statistics extracted from cross-validation were used
around 80% accuracy, while two-sided matching using a to determine the best classifier and the best confidence
threshold on the gain of substitution achieved around 90% factor.
accuracy. Note that naive two-sided matching fails (gets This dynamic selection between the two classifiers
50% accuracy). Furthermore, a level-based minimization, and confidence factors was necessary since a common
where nodes in the same level are merged if the gain
configuration was not found for all the examples provided for a given input. We transform this set of rules in an
in the IWLS Contest. After selecting the best classifier AAG file, and to follow the order of the rules, we have
and the best confidence factor, six new configurations are created a circuit that guarantees this order. Each rule is
performed. At this point, the parameter to be varied is an AND logic gate with a varied number of inputs.
the minimum number of instances per sheet, that is, the First, we go through the PART file and create all the
Weka parameter ”-M”. The minimum number of instances rules (ANDs), inverting the inputs that are 0. We need
per sheet was defined (0, 1, 3, 4, 5, and 10). Again, the to save the values and positions of all rules in a data
selection criterion was the result of the cross-validation structure. After, we read this data structure, connecting
statistic. all the outputs. If a rule makes the output goes to 1,
1) J48: Algorithms for constructing decision trees are we add an OR logic gate to connect with the rest of
among the most well known and widely used machine the rules. If a rule makes the output goes to 0, we add
learning methods. With decision trees, we can classify an AND logic gate, with an inverter in the first input.
instances by sorting them based on feature values. We These guarantees that the first correct rule will define the
classify the samples starting at the root node and sorted output. Finally, we use the AIGER [7] library to convert
based on their feature values so that in each node, we the created AAG to AIG.
represent a feature in an example to be classified, and each In Fig. 10, we can see how this circuit is cre-
branch represents a value that the node can assume. In the ated. Fig. 10(a) shows a set of rules (PART file) with four
machine learning community, J. Ross Quinlan’s ID3 and rules, where a blank line separates each rule. Fig. 10 (b)
its successor, C4.5 [4], are the most used decision tree shows the circuit created with this set of rules.
algorithms. J48 is an open-source Java implementation
of the C4.5 decision tree algorithm in Weka.
The J48 classifier output is a decision tree, which we
transform into a new PLA file. First, we go through the
tree until we reach a leaf node, saving all internal nodes’
values in a vector. When we get to the leaf node, we use
the vector data to write a line of the PLA file. After, we
read the next node, and the amount of data that we keep
in the vector depends on the height of this new node.
Our software keeps going through the tree until the
end of the J48 file, and then it finishes the PLA file
writing the metadata. Finally, we use the ABC tool to
create the AIG file, using the PLA file that our software
has created.
In Fig. 9, we show an example of how our software
works. Fig. 9 (a) shows a decision tree (J48 output) with
7 nodes, 4 of which are leaves. Fig. 9 (b) shows the PLA
file generated by our software. The PLA file has 1 line Fig. 10: (a) Set of rules (PART file) and (b) circuit created
for every leaf in the tree. Fig. 9 (c) shows the pseudocode with this set of rules.
j48topla, where n control the data in the vector and x is
the node read in the line.
B. Results
Fig. 11 shows the accuracy of the ten functions that
varied the most between the J48 and PART classifiers.
We compared the best result of J48 and the best result
of PART with the Weka parameter ”-M” fixed in 2. The
biggest difference happened in circuit 29, with J48 getting
69.74% of accuracy and PART 99.27%, resulting in a
difference of 29.52%.
Most of the functions got similar accuracy for both
classifiers. The average accuracy of the J48 classifier
was 83.50%, while the average accuracy of the PART
classifier was 84.53%, a difference of a little over 1%.
After optimizing all the parameters in the Weka tool, we
got an average accuracy of 85.73%. All accuracy values
Fig. 9: (a) Decision tree (J48 output), (b) PLA file generated were obtained with cross-validation.
by our software and (c) pseudocode j48topla. In Fig. 12, we compare the number of ANDs in the
AIG in the same ten functions. The interesting point
2) PART: In the PART algorithm [5], we can infer observed in this plot refers to circuits 43, 51, and 52.
rules generating partial decision trees. Thus two major The better accuracy for these circuits is obtained through
paradigms for rule generation are combined: creating the J48 classifier, while the resulting AIG is smaller than
rules from decision trees and the separate-and-conquer the ones created from PART solution. The complementary
rule learning technique. Once we build a partial tree, behavior is observed in circuits 4, and 75. This behavior
we extract a single rule from it, and for this reason, the reinforces the needed for diversification in machine
PART algorithm avoids post-processing.
The PART classifier’s output is a set of rules, which learning classifiers.
checks from the first to the last rule to define the output
III. TEAM 3
Authors: Po-Chun Chien, Yu-Shan Huang, Hoa-Ren
Wang, and Jie-Hong R. Jiang, Graduate Institute of
Electronics Engineering, Department of Electrical
Engineering, National Taiwan University, Taipei, Taiwan
Team 3’s solution consists of decision tree (DT) based
and neural network (NN) based methods. For each
benchmark, multiple models are generated and 3 are
selected for ensemble.
A. DT-based method
For the DT-based method, the fringe feature extraction
Fig. 13: Fringe DT learning flow. process proposed in [8, 9] is adopted. The overall learning
procedure is depicted in Fig. 13. The DT is trained and
modified for multiple iterations. In each iteration, the
patterns near the fringes (leave nodes) of the DT are
identified as the composite features of 2 decision variables.
As illustrated in Fig. 14, 12 different fringe patterns
can be recognized, each of which is the combination of
2 decision variables under various Boolean operations.
These newly detected features are then added to the list
of decision variables for the DT training in the next
iteration. The procedure terminates when there are no
new features found or the number of the extracted features
exceeds the preset limit. After training, the DT model
can be synthesized into a MUX-tree in a straightforward
manner, which will not be covered in detail in the paper.
B. NN-based method
For the NN-based method, a 3-layer network is
employed, where each layer is fully-connected and uses
Fig. 14: 12 fringe patterns. sigmoid(σ) as the activation function. As the synthesized
circuit size of a typical NN could be quite large, the
connection pruning technique proposed in [10] is adopted
in order to meet the stringent size restriction. Network
pruning is an iterative process. In each iteration, a portion
of unimportant connections (the ones with weights close
to 0) are discarded and the network is then retrained to
recover its accuracy. The NN is pruned until the number
of fanins of each neuron is at most 12. To synthesize the
network into a logic circuit, an alternative can be done
by utilizing the arithmetic modules, such as adders and
multipliers, for the intended computation. Nevertheless,
the synthesized circuit size can easily exceed the limit
due to the high complexity of the arithmetic units. Instead,
each neuron in the NN is converted into a LUT by
rounding and quantizing its activation. Fig. 15 shows an
Fig. 11: Accuracy of the ten functions that had the biggest example transformation of a neuron into a LUT, where
difference in accuracy between J48 and PART classifiers.The all possible input assignments are enumerated, and the
10 functions are 0, 2, 4, 6, 27, 29, 43, 51, 52 and 75 neuron output is quantized to 0 or 1 as the LUT output
under each assignment. The synthesis of the network can
be done quite fast, despite the exponential growth of the
enumeration step, since the number of fanins of each
neuron has been restricted to a reasonable size during the
previous pruning step. The resulting NN after pruning
and synthesis has a structure similar to the LUT-network
in [11], where, however, the connections were assigned
randomly instead of learned iteratively.

Fig. 12: Number of ANDs in ten functions used in Fig. 11.


D. Experimental results
The DT-based and NN-based methods were imple-
mented with scikit-learn [12] and PyTorch [13],
respectively. After synthesis of each model, the circuit
was then passed down to ABC [14] for optimization. Both
methods were evaluated on the 100 benchmarks provided
by the contest.
Table IV summarizes the experimental results. The first
column lists various methods under examination, where
Fr-DT and DT correspond to the DT-based method with
or without fringe feature extraction, NN correspond to
the NN-based method, LUT-Net is the learning procedure
proposed in [11], and ensemble is the combination of
Fig. 16: Test accuracy of each benchmark by different methods. Fr-DT, DT and NN. The remaining columns of Table IV
specify the average training, validation and testing accu-
racies along with the circuit sizes (in terms of the number
of AIG nodes) of the 100 benchmarks. Fig. 16 and 17
plot the testing accuracy and circuit size of each case,
where different methods are marked in different colors.
Table IV: Summary of experimental results.
method avg. train acc. avg. valid acc. avg. test acc. avg. size
DT 90.41% 80.33% 80.15% 303.90
Fr-DT 92.47% 85.37% 85.23% 241.47
NN 82.64% 80.91% 80.90% 10981.38
LUT-Net [11] 98.37% 72.78% 72.68% 64004.39
ensemble - - 87.25% 1550.33

Fr-DT performed the best of the 4 methods under


comparison. It increased the average testing accuracy by
Fig. 17: Circuit size of each benchmark by different methods. over 5% when compared to the ordinary DT, and even
attained circuit size reduction. Fr-DT could successfully
identify and discover the important features of each
benchmark. Therefore, by adding the composite features
into the variable list, more branching options are provided
and thus the tree has the potential to create a better
split during the training step. On the other hand, even
though NN could achieved a slightly higher accuracy
than DT on average, its circuit size exceeded the limit in
75 cases, which is undesirable. When comparing NN to
LUT-Net, which was built in a way so that it had the same
number of LUTs and average number of connections as
NN, NN clearly has an edge. The major difference of
NN and LUT-Net exists in the way they connect LUTs
from consecutive layers, the former learn the connections
iteratively from a fully-connected network, whereas the
latter assign them randomly. Moreover, Table V shows
the accuracy degradation of NN after connection pruning
and neuron-to-LUT conversion. It remains future work to
Fig. 15: Neuron to LUT transformation. mitigate this non-negligible ∼2% accuracy drop. Finally,
by ensemble, models with the highest testing accuracies
could be obtained.
C. Model ensemble Table V: Accuracy degradation of NN after pruning and
The overall dataset, training and validation set com- synthesis.
bined, of each benchmark is re-divided into 3 partitions
before training. Two partitions are selected as the new NN config. avg. train acc. avg. valid acc. avg. test acc.
training set, and the remaining one as the new validation initial 87.30% 83.14% 82.87%
set, resulting in 3 different grouping configurations. Under after pruning 89.06% 82.60% 81.88%
each configuration, multiple models are trained with after synthesis 82.64% 80.91% 80.90%
different methods and hyper-parameters, and the one with
the highest validation accuracy is chosen for ensemble.
Therefore, the obtained circuit is a voting ensemble of 3 Of the 300 selected models during ensemble, Fr-
distinct models. If the circuit size exceeds the limit, the DT and ordinary DT account for 80.3% and 16.0%,
largest model is then removed and re-selected from its
corresponding configuration.
N-dimensional A. Deep-Learning-Based Boolean Function Approxima-
Training dataset (2) Sparse Feature Learning
via AFN
tion
Given the universal approximation theorem, neural
(1) Multi-Level networks can be used to fit arbitrarily complex functions.
Feature Selection For example, multi-layer perceptrons (MLPs) with enough
width and depth are capable to learn the most non-smooth
binary function in the high-dimensional hypercube, i.e.,
XNOR or XOR [15]. For this contest, we adopt a
(3) Inference with Sub-space deep-learning-based method to learn an unknown high-
CrossValidation Expansion
--0000 1
dimensional Boolean function with 5k node constraints
.pla
Lower imp. 0000
0001
--0001
--0010
0
0
and extremely-limited training data, formulated as fol-
0010 AFN 1

...
...
--0011
...
1
lows,
1111 --1111 1

min L(W ; Dtrn ), (1)


(4) Node Constrained AIG Search Resplit data
Next best >5000 go to (1) s.t. N (AIG(W )) ≤ 5, 000,
pla <= 5000
--0000
--0001
1
0 .pla
and low
acc.
where L(W, Dtrn ) is the binary classification loss func-
.aig
tion on the training set, N (AIG(W )) is the num-
--0010 0
--0011 1 (i) Synth. (i)
mltest
...
--1111 1
done
<=5000 and high acc. ber of node of the synthesized AIG representation
based on the learned model. To solve this constrained
Fig. 18: Deep-learning-based Boolean function approximation stochastic optimization, we adopt the following tech-
framework. niques, 1) multi-level ensemble-based feature selection,
2) recommendation-network-based model training, 3)
subspace-expansion-based prediction, and 4) accuracy-
DON’T CARE node joint exploration during synthesis. The framework
0 is shown in Fig. 18. We demonstrate the test result and
1 0 give an analysis to show our performance on the 100
public benchmarks.
B. Feature Selection with Multi-Level Model Ensemble
The first critical step for this learning task is to perform
1
Sparsely data pre-processing and feature engineering. The public
Sampled Space
0 100 benchmarks have very unbalanced input dimensions,
ranging from 10 to over 700, but the training dataset
merely has 6,400 examples per benchmark, which gives
an inadequate sampling of the true distribution. We
Fig. 19: Smoothness assumption in the sparse high-dimensional also observe that a naive AIG representation directly
Boolean space. synthesized from the given product terms has orders-
of-magnitude more AND gates than the 5,000 node
respectively, with NN taking up the remaining 3.7%. constraint. The number of AND gate after node optimiza-
As expected, the best-performing Fr-DT models are in tion of the synthesizer highly depends on the selection
the majority. It seems that the DT-based method is better- of don’t-care set. Therefore, we make a smoothness
suited for this learning task. However, there were several assumption in the sparse high-dimensional hypercube
cases, such as case 75 where NN achieved 89.97% testing that any binary input combinations that are not explicitly
accuracy over Fr-DT’s 87.38%, with an acceptable circuit described in the PLA file are set to don’t-care state, such
size (2320 AIG nodes). that the synthesizer has enough optimization space to cut
down the circuit scale, shown in Fig. 19.
E. Take-Away Based on this smoothness assumption, we perform
Team 3 adopted DT-based and NN-based methods input dimension reduction to prune the Boolean space
to tackle the problem of learning an unknown Boolean by multi-level feature selection. Since we have 6,400
function. The team ranked 4 in terms of testing accuracy randomly sampled examples for each benchmark, we
among all the contestants of the IWLS 2020 programming assume the training set is enough to recover the true
contest. From the experimental evaluation, the approach functionality of circuits with less than blog2 6, 400c = 12
that utilized decision tree with fringe feature extraction inputs and do not perform dimension reduction on those
could achieve the highest accuracy with the lowest circuit benchmarks. For benchmarks with more than 13 inputs,
size in average. This approach is well-suited for this we empirically pre-define a feature dimension d ranging
problem and can generate a good solution for almost from 10 to 16, which is appropriate to cover enough
every benchmark, regardless of its origin. optimization space under accuracy and node constraints.
For each dimension d, we first perform the first level
IV. TEAM 4 of feature selection by pre-training a machine learning
model ensemble.
Authors: Jiaqi Gu, Zheng Zhao, Zixuan Jiang, Given the good interpretability and generalization,
David Z. Pan, Department of Electrical and Computer traditional machine learning models are widely used in
Engineering, University of Texas at Austin, USA
100
100
100 6000
5000
5000
a similar problem, the recommendation system design
5000 which targets at predicting the click rate based on
90
90 90 4000
4000
dense and sparse features, we adopt a state-of-the-art
4000 recommendation model, adaptive factorization network
Accuracy (%)
Accuracy (%)
Accuracy (%)

80

#Node (k)
80 80 3000
3000
(AFN) [18], to fit the sparse Boolean dataset. Fig. 20

#Node
#Node
#Node
3000
70
70 70 2000
2000
demonstrates the AFN structure and network configuration
2000 we use to fit the Boolean dataset. The embedding layer
60
60 60 1000
1000
1000
maps the high-dimensional Boolean feature to a 10-d
space and transform the sparse feature with a logarithmic
50
50 50 0 00 transformation layer. In the logarithmic neural network,
ex20

ex40

ex88
ex00
ex04
ex08
ex12
ex16

ex24
ex28
ex32
ex36

ex44
ex48
ex52
ex56
ex60
ex64
ex68
ex72
ex76
ex80
ex84

ex92
ex96
ex12

ex48

ex84
ex00
ex04
ex08

ex16
ex20
ex24
ex28
ex32
ex36
ex40
ex44

ex52
ex56
ex60
ex64
ex68
ex72
ex76
ex80

ex88
ex92
ex96
ex12

ex48

ex84
ex00
ex04
ex08

ex16
ex20
ex24
ex28
ex32
ex36
ex40
ex44

ex52
ex56
ex60
ex64
ex68
ex72
ex76
ex80

ex88
ex92
ex96
multiple vector-wise logarithmic neurons are constructed
Validation
Validation Acc.
Validation Acc.
Acc. (%)
(%)(%) #Node
#Node
#Node
to represent any cross features to obtain different higher-
order input combinations [18],
Fig. 21: Evaluation results on IWLS 2020 benchmarks. m
X 
yj = exp wij ln(Embed(F (d))) . (2)
Input Sparse Boolean Feature i=1
...
Sparse Feature Three small-scale fully-connected layers are used to
Embedding d to 10 embedding
combine and transform the crossed features after log-
arithmic transformation layers. Dropout layers after each
Logarithmic Transform
hidden layers are used to during training to improve the
generalization on the unknown don’t-care set.
LNN 128
2) Inference with Sub-Space Expansion: After training
the AFN-based function approximator, we need to gener-
Logarithmic NN Exponential & Concatenation
ate the dot-product terms in the PLA file and synthesize
FC-80 + Dropout-0.5
the AIG representation. Since we ignore all pruned input
dimension, we only assume our model can generalize in
FC-64 + Dropout-0.5
the reduced d-dimensional hypercube. Hence, we predict
all 2d input combinations with our trained approximator,
MLP Classifier FC-64 + Dropout-0.4 and set all other pruned inputs to don’t-care state. On 14
different feature groups F , we trained 14 different models
Binary Output {AFN0 , · · · , AFNd , · · · }, sorted in a descending order in
terms of validation accuracy. With the above 14 models,
we predict 14 corresponding PLA files {P0 , · · · , Pd . · · · }
Fig. 20: AFN [18] and configurations used for Boolean function with sub-space expansion to maximize the accuracy in
approximation. our target space while minimizing the node count by
pruning all other product terms, shown in Fig. 18. In the
ABC [19] tool, we use the node optimization command
feature engineering to evaluate the feature importance. sequence as resyn2, resyn2a, resyn3, resyn2rs,
We pre-train an AdaBoost [16] ensemble classifier and compress2rs.
with 100 ExtraTree sub-classifier on the training 3) Accuracy-Node Joint Search with ABC: For each
set to generate the importance score for all features. benchmark, we obtain multiple predicted PLA files to
Then we perform permutation importance ranking [17] be selected based on the node constraints. We search
for 10 times to select the top-d important features as the PLA with the best accuracy that meets the node
the care set variables F 1 (d). Given that the ultimate constraints,
accuracy is sensitive to the feature selection, we generate
another feature group at the second level to expand the P ∗ = arg max Acc(AIG(P), Dval ), (3)
search space. At the second level, we train two classifier P∼{··· ,Pd ,··· }
ensembles, one is an XGB classifier with 200 sub-trees s.t. N (AIG(P)) ≤ 5, 000.
and another is a 100-ExtraTree based AdaBoost
classifier. Besides, a stratified 10-fold cross-validation is If the accuracy is still very low, e.g., 60%, we resplit the
used to select top-d important features F 2 (d) based on dataset Dtrn and Dval and go to step (1) again in Fig. 18.
the average scores from the above two models. The entire 4) Results and Analysis: Fig. 21 shows our validation
14 candidates of input feature groups for each benchmark accuracy and number of node after node optimization.
are F = {F 1 (d), F 2 (d)}16d=10 . Our model achieves high accuracy on most benchmarks.
1) Deep Learning in the Sparse High-Dimensional While on certain cases regardless of the input count, our
Boolean Space: This learning problem is different from model fails to achieve a desired accuracy. An intuitive
continuous-space learning tasks, e.g., time-sequence- explanation is that the feature-pruning-based dimension
prediction, computer vision-related tasks, since its inputs reduction is sensitive on the feature selection. A repeating
are binarized with poor smoothness, which means high- procedure by re-splitting the dataset may help find a good
frequency patterns in the input features are important to feature combination to improve accuracy.
the model prediction, e.g., XNOR and XOR. Besides, the
extremely-limited training set gives an under-sampling C. Conclusion and Future Work
of the real distribution, such that a simple multi-layer We introduce the detailed framework we use to learn
perceptron is barely capable of fitting the dataset while the high-dimensional unknown Boolean function for
still having good generalization. Therefore, motivated by
Obtain Get important
on preliminary testing, we found that RFs could not
proportions
Train NN
features scale due to the contest’s 5000-gate limitation. This was
mainly due to the use of the majority voting, considering
Train. set
that the preliminaries expressions for that are too large
Generate new
Generate
to be combined between each other. Therefore, for this
Merge sets train. and Train DTs/RFs
Valid. set valid. sets
SOP proposal, we opted to limit the number of trees used in
the RFs to a value of three.
Other parameters in the DecisionT reeClassif ier
AIG with
highest Reduce Node
could be varied as well, such as the split metric, by
accuracy within
5000-AND limit
AIG
Evaluation
and Logic
Generate
EQN/AIG
changing the Gini metric to Entropy. However, prelimi-
Level
nary analyses showed that both metrics led to very similar
results. Since Gini was slightly better in most scenarios
Fig. 22: Design flow employed in this proposal. and is also less computationally expensive, it was chosen.
Even though timing was not a limitation described
by the contest, we still had to provide a solution that
we could verify that was yielding the same AIGs as
IWLS’20 contest. Our recommendation system based the ones submitted. Therefore, for every configuration
model achieves a top 3 smallest generalization gap on tested, we had to test every example given in the problem.
the test set (0.48%), which is a suitable selection for this Hence, even though higher depths could be used without
task. A future direction is to combine more networks and surpassing the 5000-gate limitation, we opted for 10 and
explore the unique characteristic of various benchmarks. 20 only, so that we could evaluate every example in a
feasible time.
V. TEAM 5 Besides training the DTs and RFs with varying depths,
Authors: Brunno Alves de Abreu, Isac de Souza we also considered using feature selection methods
Campos, Augusto Berndt, Cristina Meinhardt, from the Scikit-learn library. We opted for the use
Jonata Tyska Carvalho, Mateus Grellert and Sergio of the SelectKBest and SelectP ercentile methods.
Bampi, Universidade Federal do Rio Grande do Sul, These methods perform a pre-processing in the features,
Universidade Federal de Santa Catarina, Brazil eliminating some of them prior to the training stage.
The SelectKBest method selects features according to
Fig. 22 presents the process employed in this proposal. the k highest scores, based on a score function, namely
In the first stage, the training and valid sets provided in the f classif , mutual inf o classif or chi2, according to
problem description are merged. The ratios of 0’s and 1’s the Scikit-learn library [12]. The SelectP ercentile is
in the output of the newly merged set are calculated, and similar but selects features within a percentile range given
we split the set into two new training and validation sets, as parameter, based on the same score functions [12].
considering an 80%-20% ratio, respectively, preserving We used the values of 0.25, 0.5, and 0.75 for k, and the
the output variable distribution. Additionally, a second percentiles considered were 25%, 50%, and 75%.
training set is generated, equivalent to half of the The solution employing neural networks (NNs) was
previously generated training set. These two training considered after obtaining the accuracy results for the
sets are used separately in our training process, and their DTs/RFs configurations. In this case, we train the model
accuracy results are calculated using the same validation using the M LP Classif ier structure, to which we used
set to enhance our search for the best models. This was the default values for every parameter. Considering that
done because the model with the training set containing NNs present an activation function in the output, which
80% of the entire set could potentially lead to models is non-linear, the translation to a SOP would not be
with overfitting issues, so using another set with 40% of possible using conventional NNs. Therefore, this solution
the entire provided set could serve as an alternative. only uses the NNs to obtain a subset of features based
After the data sets are prepared, we train the DTs on their importance, i.e., select the set of features with
and RFs models. Every decision model in this proposal the corresponding highest weights. With this subset of
uses the structures and methods from the Scikit-learn features obtained, we evaluate combinations of functions,
Python library. The DTs and RFs from this library use using ”OR,” ”AND,” ”XOR,” and ”NOT” operations
the Classification and Regression Trees (CART) algorithm among them. Due to the fact that the combinations of
to assemble the tree-based decision tools [20]. To train functions would not scale well, in terms of time, with the
the DT, we use the DecisionT reeClassif ier structure, number of features, we limit the sub-set to contain only
limiting its max depth hyper-parameter to values of 10 four features. The number of expressions evaluated for
and 20 due to the 5000-gate limitation of the contest. The each NN model trained was 792. This part of the proposal
RF model was trained similarly, but with an ensemble was mainly considered given the difficulty of DTs/RFs
of DecisionT reeClassif ier: we opted not to employ in finding trivial solutions for XOR problems. Despite
the RandomF orestClassif ier structure given that it solving the problems from specific examples whose
applied a weighted average of the preliminary decisions solution was a XOR2 between two of the inputs, with
of each DT within it. Considering that this would require a 100% accuracy, we were able to slightly increase the
the use of multipliers, and this would not be suitable to maximum accuracy of other examples through this scan of
obtain a Sum-Of-Products (SOP) equivalent expression, functions. The parameters used by the M LP Classif ier
using several instances of the DecisionT reeClassif ier were the default ones: 100 hidden layers and ReLu
with a simple majority voter in the output was the choice activation function [12].
we adopted. In this case, each DT was trained with a The translation from the DT/RF to SOP was imple-
random subset of the total number of features. Based mented as follows: the program recursively passes through
every tree path, concatenating every comparison. When limitation, as it restrained us from using a higher number
it moves to the left child, this is equivalent to a ”true” of trees. The NNs were mainly useful to solve XOR2
result in the comparison. In the right child, we need a problems and in problems whose best accuracy results
”NOT” operator in the expression as well. In a single from DTs/RFs were close to 50%. It can also be observed
path to a leaf node, the comparisons are joined through that the use of the feature selection methods along with
an ”AND” operation, given that the leaf node result will DTs/RFs was helpful; therefore, cropping a sub-set of
only be true when all the comparisons conditions are true. features based on these methods can definitely improve
However, given that this is a binary problem, we only the model obtained by the classifiers. The number of best
consider the ”AND” expression of a path when the leaf examples based on the different scoring functions used in
leads to a value of 1. After that, we perform an ”OR” the feature selection methods shows that the chi2 function
operation between the ”AND” expressions obtained for was the most useful. This is understandable given that
each path, which yields the final expression of that DT. this is the default function employed by the Scikit-learn
The RF scenario works the same way, but it considers library. Lastly, even though the proportions from 80%-
the expression required for the majority gate as well, 20% represented the majority of the best results, it can
whose inputs are the ”OR” expressions of each DT. be seen that running every model with a 40%-20% was
From the SOP, we obtain the AIG file. The generated also a useful approach.
AIG file is then optimized using commands of the ABC
tool [14] attempting to reduce the number of AIG nodes Table VI: Number of best examples based on the characteristics
and the number of logic levels by performing iterative of the configurations.
collapsing and refactoring of logic cones in the AIG, and
Characteristic Parameter # of examples
rewrite operations. Even if these commands could be
iteratively executed, we decided to run them only once DT 55
given that the 5000-gate limitation was not a significant Decision Tool RF 28
NN 17
issue for our best solutions, and a single sequence of them
was enough for our solutions to adhere to the restriction. Select K Best 48
Finally, we run the AIG file using the AIG evaluation Feature Selection Select Percentile 11
commands provided by ABC to collect the desired results None 41
of accuracy, number of nodes, and number of logic levels chi2 34
for both validation sets generated at the beginning of our f classif 6
Scoring Function
flow. mutual info classif 19
None 41
All experiments were performed three times with
different seed values using the Numpy random 40%-20% 23
Proportion
seed method. This was necessary given that the 80%-20% 77
DecisionT reeClassif ier structure has a degree of
randomness. Considering it was used for both DTs and
RFs, this would yield different results at each execution. VI. TEAM 6
The M LP Classif ier also inserts randomnesses in the Authors: Aditya Lohana, Shubham Rai and Akash Ku-
initialization of weights and biases. Therefore, to ensure mar, Chair for Processor Design, Technische Universitaet
that the contest organizers could perfectly replicate the Dresden, Germany
code with the same generated AIGs, we fixed the seeds To start of with, we read the training pla files using
to values of 0, 1, and 2. Therefore, we evaluated two ABC and get the result of &mltest directly on the train
classifiers (DTs/RFs), with two maximum depths, two data. This gives the upper limit of the number of AND
different proportions, and three different seeds, which gates which are used. Then, in order to learn the unknown
leads to 24 configurations. For each of the SelectKBest Boolean function, we have used the method as mentioned
and SelectP ercentile methods, considering that we ana- in [11]. We used LUT network as a means to learn from
lyzed three values of K and percentage each, respectively, the known set to synthesize on an unknown set. We
along with three scoring functions, we have a total of 18 use the concept of memorization and construct a logic
feature selection methods. Given that we also tested the network.
models without any feature selection method, we have In order to construct the LUT network, we use the
19 possible combinations. By multiplying the number minterms as input features to construct layers of LUTs
of configurations (24) with the number of combinations with connections starting from the input layer. We then
with and without feature selection methods (adding up try out two schemes of connections between the layers:
to 19), we obtain a total of 456 different DT/RF models ‘random set of input’ and ‘unique but random set of
being evaluated. Additionally, each NN, as mentioned, inputs’. By ‘random set of inputs’, we imply that we
evaluated 792 expressions. These were also tested with just randomly select the outputs of preceding layer and
two different training set proportions and three different feed it to the next layer. This is the default flow. By
seeds, leading to a total of 4752 expressions for each of ‘unique but random set of inputs’, we mean that we
the 100 examples of the contest. ensure that all outputs from a preceding layer is used
Table VI presents some additional information on the before duplication of connection. This obviously makes
configurations that presented the best accuracy for the sense when the number of connections is more than the
examples from the contest. As it can be seen, when we number of outputs of the preceding layer.
split the 100 examples by the decision tool employed, We have four hyper parameters to experiment with in
most of them obtained the best accuracy results when order to achieve good accuracy– number of inputs per
using DTs, followed by RFs and NNs, respectively. The LUT, number of LUTS per layers, selection of connecting
use of RFs was not as significant due to the 5000-gate edges from the preceding layer to the next layer and the
x1? .i 3 VII. TEAM 7: L EARNING WITH T REE -BASED
.o 1 M ODELS AND E XPLANATORY A NALYSIS
0 1
.p 3
.ilb x1 x2 x3
Authors: Wei Zeng, Azadeh Davoodi, and Rasit Onur
x3? 1 Topaloglu, University of Wisconsin–Madison, IBM, USA
0-0 1
0 1 0-1 0 Team 7’s solution is a mix of conventional machine
1-- 1 learning (ML) and pre-defined standard function match-
1 0 .e ing. If the training set matches a pre-defined standard
function, a custom AIG of the identified function is
(a) (b) written out. Otherwise, an ML model is trained and
Fig. 23: (a) A decision tree and (b) its corresponding PLA. translated to an AIG.
Team 7 adopts tree-based ML models, considering the
straightforward correspondence between tree nodes and
x1? x2? SOP terms. The model is either a single decision tree
0 1 0 1 with unlimited depth, or an extreme gradient boosting
(XGBoost) [21] of 125 trees with a maximum depth of
x3? +0.3 + x1? x3? +… 5, depending on the results of a 10-fold cross validation
0 1 0 1 0 1 on training data.
It is straightforward to convert a decision tree to SOP
+0.8 -0.2 +0.3 -0.5 -0.7 +0.6 terms used in PLA. Fig. 23 shows a simple example of
(a)
decision tree and its corresponding SOP terms. Numbers
in leaves (rectangular boxes) indicate the predicted output
x1? x2? values. XGBoost is a more complex model in two aspects.
0 1 0 1 First, it is a boosting ensemble of many shallow decision
trees, where the final output is the sum of all leaf values
x3? 1 + x1? x3? +… that a testing data point falls into, expressed in log odds
0 1 0 1 0 1 log[P (y = 1)/P (y = 0)]. Second, since the output is
1 0
a real number instead of a binary result, a threshold of
1 0 0 1
classification is needed to get the binary prediction (the
(b) default value is 0). Fig. 24(a) shows a simple example
of a trained XGBoost of trees.
Fig. 24: XGBoost of decision trees, (a) before and (b) after With the trained model, each underlying tree is
quantization of leaf values. Plus signs mean to add up resulting converted to a PLA, where each leaf node in the tree
leaf values, one from each tree. corresponds to a SOP term in the PLA. Each PLA are
minimized and compiled to an AIG with the integrated
espresso and ABC, respectively. If the function is
depth (number of LUT layers) of the model. We carry learned by a single decision tree, the converted AIG is
out experiments with varying number of inputs for each final. If the function is learned by XGBoost of trees, the
LUT in order to get the maximum accuracy. We notice exact final output of a prediction would be the sum of the
from our experiments that 4-input LUTs returns the best 125 leaves (one for each underlying tree) where the testing
average numbers across the benchmark suite. We also data point falls into. In order to implement the AIGs
found that increasing the number of LUTs per layer or efficiently, the model is approximated in the following
number of layers does not directly increases the accuracy. two steps. First, the value of each underlying tree leaf
This is due to the fact, because increasing the number is quantized to one bit, as shown in Fig. 24(b). Each
of LUTs allows repeatability of connections from the test data point will fall into 125 specific leaves, yielding
preceding layer. This leads to passing of just zeros/ones 125 output bits. To mimic the summation of leaf values
to the succeeding layers. Hence, the overall network tends and the default threshold of 0 for classification, a 125-
towards a constant zero or constant one. input majority gate could be used to get the final output.
After coming up with an appropriate set of connections, However, again for efficient implementation, the 125-
we create the whole network. Once we create the network, input majority gate is further approximated by a 3-layer
we convert the network into an SOP form using sympy network of 5-input majority gates as shown in Fig. 25.
package in python. This is done from reverse topological In the above discussion, it is assumed that the gate
order starting from the outputs back to the inputs. We count does not exceed the limit of 5000. If it is not the
then pass the SOP form of the final output and create a case, the maximum depth of the decision tree and the
verilog file out of that. Using the verilog file, we convert trees in XGBoost, and/or the number of trees in XGBoost,
it into AIG format and then use the ABC command to list can be reduced at the cost of potential loss of accuracy.
out the output of the &mltest. We also report accuracy Tree-based models may not perform well in some
of our network using sklearn. pre-defined standard functions, especially symmetric
We have also incorporated use of data from validation functions and complex arithmetic functions. However,
testcase. For our training model, we have used ‘0.4’ part symmetric functions are easy to be identified by com-
of the minterms in our training. paring the number of ones and the output bit. And it
can be implemented by adding a side circuit that counts
N1 , i.e., the number of ones in the input bits, and a
decision tree that learns the relationship between N1 and
the original output. For arithmetic functions, patterns in
4 the importance of input bits can be observed in some pre-
defined standard functions, such as adders, comparators,
outputs of XOR or MUX. This is done by training an
2 initial XGBoost of trees and use SHAP tree explainer
[1] to evaluate the importance of each input bit. Fig. 26
shows that SHAP importance shows a pattern in training
Mean SHAP

0 sets that suggests important input bits, while correlation


coefficient fails to show a meaningful pattern. Fig. 27
shows an pattern of mean SHAP values of input bits,
−2 which suggests the possible existence of two signed binary
coded integers with opposite polarities in the function.
Based on these observations, Team 7 checks before ML
−4 if the training data come from a symmetric function, and
compares training data with each identified pre-defined
0 20 40 60 80 100 120
Input bit standard function. In case of a match, an AIG of the
standard function is constructed directly without ML.
Fig. 27: Mean SHAP values of input bits in ex35, comparator With function matching, all six symmetric functions and
of two 60-bit signed integers, showing a pattern of “weights” 25 out of 40 arithmetic functions can be identified with
that correspond to two signed integers with opposite polarities. close to 100% accuracy.
VIII. TEAM 8: L EARNING B OOLEAN
(Final prediction) C IRCUITS WITH ML M ODEL E NSEMBLE
Authors: Yuan Zhou, Jordan Dotzel, Yichi Zhang,
Hanyu Wang, Zhiru Zhang, School of Electrical and
MAJ5 Computer Engineering, Cornell University, Ithaca, NY,
USA
MAJ5 …
A. Machine Learning Model Ensemble

MAJ5 … MAJ5 … Model ensemble is a common technique in the machine


learning community to improve the performance on a
classification or regression task. For this contest, we use
a specific type of ensemble called “bucket of models”,
(125 outputs from underlying trees) where we train multiple machine learning models for each
benchmark, generate the AIGs, and select the model that
Fig. 25: Network of 5-input majority gates as an approximation achieves the best accuracy on the validation set from
of a 125-input majority gate. all models whose AIGs are within the area constraint
of 5k AND gates. The models we have explored in
0.03 this contest include decision trees, random forests, and
multi-layer perceptrons. The decision tree and multi-layer
0.02
perceptron models are modified to provide better results
0.01 on the benchmarks, and our enhancements to these models
Corr. coef.

are introduced in Section VIII-B and Section VIII-D,


0.00 respectively. We perform a grid search to find the best
−0.01
combination of hyper-parameters for each benchmark.
−0.02 B. Decision Tree for Logic Learning
0 10 20 30
Input bit
40 50 60 Using decision trees to learn Boolean functions can be
dated back to the 1990s [22]. As a type of light-weight
(a) machine learning models, decision trees are very effective
when the task’s solution space has a “nearest-neighbor”
3.0 property and can be divided into regions with similar
2.5 labels by cuts that are parallel to the feature axes. Many
Boolean functions have these two properties, making them
Mean |SHAP|

2.0
ideal targets for decision trees. In the rest of this section
1.5 we introduce our binary decision tree implementation, and
1.0
provide some insights on the connection between decision
tree and Espresso [23], a successful logic minimization
0.5
algorithm.
0.0
0 10 20 30 40 50 60
Input bit

Fig. 26: Comparison of two importance metrics: (a) correlation


coefficient and (b) mean absolute SHAP value in ex25, the
MSB of a 32x32 multiplier.
the samples with zero and one labels. In such cases, it is
𝑓 = 𝐴 𝑥𝑜𝑟 𝐵 A=1 likely that the decision tree will pick an irrelevant feature
No Yes
at the root node.
ABC f If the decision tree selects an irrelevant or incorrect
Entries Entries feature at the beginning, it is very difficult for it to
000 0
0,1,2,3 4,5,6,7 ”recover” during the rest of the training process due to
001 0 the lack of a backtracking procedure. Finding the best
Mutual information = 0
010 1 decision tree for any arbitrary dataset requires exponential
011 1 C=1 time. While recent works [25] propose to use smart
No Yes branch-and-bound methods to find optimal sparse decision
100 1 trees, it is still impractical to use such approaches for
101 1 Entries Entries logic learning when the number of inputs is large. As
110 0 0,2,4,6 1,3,5,7 a result, we propose to apply single-variable functional
decomposition when the maximum mutual information is
111 0 Mutual information = 0 lower than a set threshold τ during splitting. The value
of τ is a hyper-parameter and is included in the grid
Fig. 28: Example of learning a two-input XOR with an search for tuning. At each none-leaf node, we test all
irrelevant input. The whole truth table is provided for illustration non-used features to find a feature that satisfies either of
purposes. The impurity function based on mutual information the following two requirements:
cannot distinguish the useful inputs (A and B) from the 1) At least one branch is constant after splitting by this
irrelevant input C. This is because mutual information only feature.
considers the probability distributions at the current node (root) 2) One branch is the complement of the other after
and its children. splitting by this feature.
In many cases we don’t have enough data samples to
fully test the second requirement. As a result, we take
1) Binary Decision Tree: We use our own implemen- an aggressive approach where we consider the second
tation of the C4.5 decision tree [24] for the IWLS’20 requirement as being satisfied unless we find a counter
programming contest. The decision trees directly take the example. While such an aggressive approach may not
Boolean variables as inputs. We use mutual information find the best feature to split, it provides an opportunity to
as the impurity function: when evaluating the splitting avoid the effect of sampling noise. In our implementation,
criterion at each non-leaf node, the feature split that functional decomposition is triggered more often at the
provides the maximum mutual information is selected. beginning of the training and less often when close to
During training, our decision tree continues splitting at the leaf nodes. This is because the number of samples at
each node unless one of the following three conditions each non-leaf node decreases exponentially with respect
is satisfied: to the depth of the node, and the sampling noise becomes
1) All samples at a node have the same label. more salient with few data samples. When the number
2) All samples at a node have exactly the same features. of samples is small enough, the mutual information is
3) The number of samples at a node is smaller than a unlikely to fall below τ , so functional decomposition will
hyper-parameter N set by the user. no longer be triggered.
The hyper-parameter N helps avoid overfitting by allow- We observed significant accuracy improvement after
ing the decision tree to tolerate errors in the training incorporating functional decomposition into the training
set. With larger N , the decision tree learns a simpler process. However, we also realized that the improve-
classification function, which may not achieve the best ment in some benchmarks might be because of an
accuracy on the training set but generalizes better to implementation detail: if multiple features satisfy the
unseen data. two requirements listed above, we select the last one.
2) Functional Decomposition for Split Criterion Selec- This implementation decision happens to help for some
tion: The splitting criterion selection method described benchmarks, mostly because our check for the second
in Section VIII-B1 is completely based on the probability requirement is too aggressive. This finding was also
distribution of data samples. One drawback of such confirmed by team 1 after our discussion with them.
probability-based metrics in logic learning is that they are 3) Connection with Espresso: We found an interesting
very sensitive to the sampling noise, especially when the connection between decision tree and the Espresso logic
sampled data is only a very small portion of the complete optimization method. For the contest benchmarks, only
truth table. Fig. 28 shows an example where we try to 6.4k entries of the truth table are given and all other
learn a three-input function whose output is the XOR of entries are don’t cares. Espresso exploits don’t cares by
two inputs A and B. An optimal decision tree for learning expanding the min-terms in the SOP, which also leverages
this function should split by either variable A or B at the “nearest neighbor” property in the Boolean input
the root node. However, splitting by the correct variable space. Decision trees also leverage this property, but
A or B does not provide higher gain than splitting by they exploit don’t cares by making cuts in the input
the irrelevant variable C. Depending on the tie-breaking space. As a result, neither Espresso nor decision tree is
mechanism, the decision tree might choose to split by able to learn a logic function from an incomplete truth
either of the three variables at the root node. For the table if the nearest neighbor property does not hold. For
contest benchmarks, only a small portion of the truth example, according to our experiments, neither technique
table is available. As a result, it is possible that the can correctly learn a parity function even if 90% of the
sampling noise will cause a slight imbalance between truth table has been given.
4) Converting into AIG: One advantage of using disadvantage is the exponential increase in local minima
decision trees for logic learning is that converting a [29], which makes training more unstable.
decision tree into a logic circuit is trivial. Each non-
leaf node in the tree can be implemented as a two-input F. Logic Synthesis
multiplexer, and the multiplexers can be easily connected After training the (sine-based) MLP models, we convert
by following the structure of the tree. The values at the them into AIGs. Some common methods for this conver-
leaf nodes are simply implemented as constant inputs to sion are binary neural networks [30], quantized neural
the multiplexers. We follow this procedure to convert each networks [31], or directly learned logic [32]. Instead, we
decision tree into an AIG, and use ABC [14] to further took advantage of the small input size and generated the
optimize the AIG to reduce area. After optimization, none full truth tables of the neural network by enumerating
of the AIGs generated from decision trees exceeds the all inputs and recording their outputs. Then, we passed
area budget of 5k AND gates. this truth table to ABC for minimization. This method
is a simpler variant to one described in LogicNets [33].
C. Random Forest It easily produces models within the 5k gate limit for
Random forest can be considered as an ensemble of smaller models, which with our constraints corresponds
bagged decision trees. Each random forest model contains to benchmarks with fewer than 20 inputs.
multiple decision trees, where each tree is trained with a
different subset of data samples or features so that the G. Summary and Future Work
trees won’t be very similar to each other. The final output We have described the ML models we used for
of the model is the majority of all the decision trees’ the IWLS’20 contest. Based on our evaluation on the
predictions. validation set, different models achieve strong results
We used the random forest implementation from sci- on different benchmarks. Decision trees achieve the
kit learn [12] and performed grid search to find the best best accuracy for simple arithmetic benchmarks (0-39),
combination of hyper-parameters. For all benchmarks random forest gives better performance on the binary
we use a collection of seventeen trees with a maximum classification problems (80-99), while our special MLP
depth of eight. To generate the AIG, we first convert dominates on the symmetric functions (70-79). It remains
each tree into a separate AIG, and then connect all a challenge to learn complex arithmetic functions (40-49)
the AIGs together with a seventeen-input majority gate. from a small amount of data. Another direction for future
The generated AIG is further optimized using ABC. work is to convert MLP with a large number of inputs
None of the AIGs exceeds the 5K gate constraint after into logic.
optimization.
D. Multi-Layer Perceptron IX. TEAM 9:B OOTSTRAPPED CGP L EARNING
Our experiments find that multi-layer perceptrons F LOW
(MLP) with a variety of activation functions perform Authors: Augusto Berndt, Brunno Alves de Abreu,
well on certain benchmark classes. These models come Isac de Souza Campos, Cristina Meinhardt, Mateus
with a few core challenges. First, it is difficult to Grellert, Jonata Tyska Carvalho, Universidade Federal
translate a trained floating-point model to an AIG without de Santa Catarina, Universidade Federal do Rio Grande
significantly decreasing the accuracy. Thus, we limit do Sul, Brazil
the size of the MLP, synthesize it precisely through
enumeration of all input-output pairs, and carefully reduce
resultant truth table. Second, MLP architectures have A. Introduction
difficulties with latent periodic features. To address this, Team 9 uses a population-based optimization method
we utilize the sine activation function which can select called Cartesian Genetic Programming (CGP) for synthe-
for periodic features in the input, validating previous sizing circuits for solving the IWLS’20 contest problems.
work in its limited applicability to learning some classes The flow used includes bootstrapping the initial pop-
of functions. ulation with solutions found by other well-performing
methods as decision trees and espresso technique, and
E. Sine Activation further improving them. The results of this flow are
We constructed a sine-based MLP that performed as two-fold: (i) it allows to improve further the solutions
well or better than the equivalent ReLU-based MLP found by the other techniques used for bootstrapping
on certain benchmark classes. This work is concurrent the evolutionary process, and (ii) alternatively, when
with a recent work that applies periodic activations no good solutions are provided by the bootstrapping
to implicit neural representation [26], and draws from methods, CGP starts the search from random (unbiased)
intermittent work over the last few decades on periodic individuals seeking optimal circuits. This text starts with
activation functions [27][28]. In our case, we find the sine a brief introduction to the CGP technique, followed by
activation function performs well in certain problems that presenting the proposed flow in detail.
have significant latent frequency components, e.g. parity CGP is a stochastic search algorithm that uses an
circuit. It also performs well in other cases since for evolutionary approach for exploring the solution’s param-
sufficiently small activations sine is monotonic, which eter space, usually starting from a random population of
passes gradients in a similar manner to the standard candidate solutions. Through simple genetic operators
activation functions, e.g ReLU. We learn the period of such as selection, mutation and recombination, the search
the activation function implicitly through the weights process often leads to incrementally improved solutions,
inputted into the neuron. However, we find that one despite there is no guarantee of optimality.
The evolutionary methods seek to maximize a fitness time, its complete genetic code representation is known
function, which usually is user-defined and problem- as its Genotype.
related, that measures each candidate solution’s per- As demonstrated by [35], the CGP search has a
formance, also called individual. The parameter space better convergence if phenotypically larger solutions
search is conducted by adding variation to the individuals are considered preferred candidates when analyzing the
by mutation or recombination and retaining the best individual’s fitness scores. In other words, if, during the
performing individuals through the selection operator evolution, there are two individuals with equal accuracy
after evaluation. Mutation means that specific individ- but different functional sizes, the larger one will be
ual’s parameters are randomly modified with a certain preferred. The proposed flow makes use of such a
probability. Recombination means that individuals can technique by always choosing AIGs with larger available
be combined to generate new individuals, also with sizes as fathers when a tie in fitness values happens.
a given probability. This latter process is referred to The proposed CGP flow is presented in Fig. 30. It
as crossover. However, the CGP variation does not first tries to start from a well-performing AIG previously
employ the crossover process since it does not improve found using some other method as, for instance, Decision
CGP efficiency [34]. The genetic operators are applied Trees(DT) or SOP via the Espresso technique. We call this
iteratively, and each iteration is usually called generation. initialization process bootstrap. It is noteworthy that the
Fig. 29 presents an example of a CGP implementation proposed flow can use any AIG as initialization, making
with one individual, its set of primitive logic functions, it possible to explore previous AIGs from different
and the individual’s genetic code. The CGP structure is syntheses and apply the flow as a fine-tuning process.
represented as a logic circuit, and the CGP Cartesian If the bootstrapped AIG has an accuracy inferior to
values are also shown. Notice that there is only one 55%, the CGP starts the search from a random (unbiased)
line present in the two-dimensional graph, in reference population seeking optimal circuits; otherwise, it runs a
[35] it is shown that such configuration provides faster bootstrapped initialization with the individual generated
convergence when executing the CGP search. Recent by previously optimized Sum of Products created by
works also commonly make use of such configuration Decision Trees or Espresso. Summarizing, the flow’s first
[36]. The CGP implemented in work presented herein step is to evaluate the input AIG provided by another
used such configuration of only one line and multiple method and check whether it will be used to run the
columns. Please note that one may represent any logic CGP search, or the search will be conducted using a new
circuit in a single line as one would in a complete two- random AIG.
dimensional graph.
Fig. 29 also presents the individuals’ representation,
sometimes called genotype or genetic coding. This code
is given by a vector of integers, where a 3-tuple describes
each node, the first integer corresponds to the node’s logic
function, and the other two integers correspond to the
fan-ins’ Cartesian values. The number of primary inputs,
primary outputs, and internal CGP nodes is fixed, i.e.,
are not under optimization. For the contest specifically,
the PIs and POs sizes will always correspond to the
input PLA file’s values. At the same time, the number
of internal CGP nodes are parametrized and set as one
of the program’s input information.
Notice that the circuit presented in Fig. 29 is composed
of a functional and a non-functional portion, meaning
that a specific section of the circuit does not influence its
primary output. The non-functional portion of the circuit
is drawn in gray, while the functional portion is black.
The functional portion of a CGP individual is said to
be its Phenotype, in other words, the observable portion
with relation to the circuit’s primary outputs. At the same Fig. 30: Proposed CGP Flow.

Due to the CGP training approach, the learning set


must be split among the CGP and the other method for
using the Bootstrap option. When this is the case, the
training set is split in a 40%-40%/20% format, meaning
half the training set is used by the initialization method
and the other half by the CGP, while leaving 20% of
the dataset for testing with both methods involved in the
AIG learning.
An empiric exploration of input hyper-parameters con-
figurations was realized, attempting to look for superior
solutions. The evolutionary approach used is the (1+4)-
ES rule. In this approach, one individual generates four
Fig. 29: Single Line CGP Example. new mutated copies of itself, and the one with the
best performance is selected for generating the mutated
Train/Test
Initialization AIG
Format
Batch Change Ubuntu 18.04LTS image. The CGP was implemented in
Type Size Size Each C++ with object-oriented paradigm, codes were compiled
(%)
Bootstrap
Twice the
40-40/20
Half Not on GCC 7.5.
Original AIG Train Set Applicable

1024,
X. TEAM 10
500, 1000,
Random
5000
80/20 Complete
2000 Authors: Valerio Tenace, Walter Lau Neto, and
Train Set Pierre-Emmanuel Gaillardon, University of Utah, USA
Table VII: Hyper-Parameters Dependent on Initialization
In order to learn incompletely specified Boolean
functions for the contest, we decided to resort to decision
trees (DTs). Fig. 31 presents an overview of the adopted
copies of the next generation. Variation is added to the design flow.
individuals through mutations. The mutation rate is under
optimization according to the 1/5th rule [37], in which Training Validation
mutation rate value varies jointly with the proportion of PLA PLA
individuals being better or worst than their ancestor.
The CGP was run with three main hyper-parameter:
(1) the number of generations with values of 10, 20, 25, DT Training
50 and 100 thousand; (2) two logic structures available:
Training
AIG and XAIG; and (3) the program had the option to Augmentation
check for all nodes as being a PO during every training Accuracy
Custom Python
Check
generation. The latter is a computationally intensive Validation <= 70%
strategy, but some exemplars demonstrated good results Validation > 70%

with it. The CGP flow runs exhaustively, combining these Verilog
three main hyper-parameter options. Generation

Together with the previously mentioned hyper-


parameters, there is a set of four other hyper-parameters Synthesis and
dependent on the initialization option, presented in Ta- Optimization
ABC

ble VII. For a bootstrap option of a previous AIG, there


is a single configuration where the CGP size is twice Fig. 31: Overview of adopted flow.
the Original AIG, meaning that for each node in the
Original AIG, a non-functional node is incremented to the We developed a Python program using the Scikit-learn
learning model, increasing the search space. The training library where the parameter max depth serves as an upper-
configuration takes half the training set since the other bound to the growth of the trees. Through many rounds of
half was already used by the initialization method. This evaluation, we opted to set the max depth to 8, as it gives
bootstrap option does not execute with mini-batches since small AIG networks without sacrificing the accuracy too
they add stochasticity for the search and the intention is much. The training set PLAs, provided by the contest, are
to fine-tune the AIG solution. read into a numpy matrix, which is used to train the DT.
For a random initialization, multiple hyper-parameters On the other hand, the validation set PLA is used to test
alternatives are explored in parallel. The AIG size is whether the obtained DT meets the minimum validation
configured to 500 and 5000 nodes. The training set does accuracy, which is empirically set to be 70%. If such a
not have to be shared, meaning that the whole training set condition is not met, the training set is augmented by
is employed for the search. Training mini-batches of size merging it with the validation set. According to empirical
1024 are used, meaning that the individuals are evaluated evaluations, most of these benchmarks with performance
using the same examples during a certain number of < 70% showed a validation accuracy fluctuating around
consecutive generations. The number of generations a 50%, regardless of the size and shapes of the DTs. This
mini-batch is kept is determined by the change each suggests that the training sets were not able to provide
hyper-parameter, meaning that the examples used for enough representative cases to effectively exploit the
evaluating individuals change each time that number of adopted technique, thus leading to DTs with very high
generations is reached. A previous study showed that training accuracy, but completely negligible performances.
this technique could lead to the synthesis of more robust For DTs having a validation accuracy ≥ 70%, the tree
solutions, i. e., solutions that generalize well to unseen structure is annotated as a Verilog netlist, where each DT
instances [38]. The random initialization could also use node is replaced with a multiplexer. In cases where the
the complete train set as if the batch size was the same accuracy does not achieve the lower bound of 70%, we re-
as the training set available. train the DT with the augmented set, and then annotate it
Finally, the last step evaluates the AIGs generated, in Verilog. The obtained netlist is then processed with the
collecting results of accuracy, number of nodes, and logic ABC Synthesis Tool in order to generate a compact and
levels for the validation sets. All the AIGs generated by optimized AIG structure. More detailed information about
the CGP respect the size limit of 5000 AND nodes. the adopted technique can be found in [40]. Overall, our
The AIG with the highest accuracy is chosen as the approach was capable of achieving very good accuracy
final solution for the benchmark. The CGP flow shows for most of the cases, without exceeding 300 AIG nodes
the potential to optimize previous solutions, where the in any benchmark, thus yielding the smallest average
optimization metric is easily configurable. network sizes among all the participants. In many cases,
Experiments were executed on the Emulab Utah [39] we achieved an accuracy greater than 90% with less
research cluster, the computers available operate under a than 50 nodes, and a mean accuracy of 84% with only
140 AND gates on average. Fig. 32 presents the accuracy [16] Yoav Freund and Robert E. Schapire. “A Decision-Theoretic
achieved for all the adopted benchmarks, whereas Fig. 33 Generalization of On-Line Learning and an Application to
shows the AIG size for the same set of circuits. These Boosting”. In: JCSS (1997).
results clearly show that DTs are a viable technique to [17] Leo Breiman. “Random Forests”. In: Machine Learning (2001),
learn Boolean functions efficiently. pp. 5–32.
[18] Weiyu Cheng, Yanyan Shen, and Linpeng Huang. “Adaptive
Factorization Network: Learning Adaptive-Order Feature Inter-
100 actions”. In: Proc. AAAI. 2020.
Accuracy (%)

75 [19] Robert Brayton and Alan Mishchenko. “ABC: An Academic


50 Industrial-Strength Verification Tool”. In: Proc. CAV. 2010,
25 pp. 24–40.
0
[20] Dan Steinberg and Phillip Colla. “CART: classification and
00 07 14 21 28 35 42 49 x56 x63 x70 x77 x84 x91 x98
ex ex ex ex ex ex ex ex e e e e e e e regression trees”. In: The top ten algorithms in data mining 9
Benchmarks
(2009), p. 179.
[21] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree
Fig. 32: Accuracy for different benchmarks. Boosting System”. In: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining. San Francisco, California, USA: ACM, 2016, pp. 785–
794.
300 [22] Arlindo L Oliveira and Alberto Sangiovanni-Vincentelli. “Learn-
AND Gates

200 ing complex boolean functions: Algorithms and applications”.


100 In: Advances in Neural Information Processing Systems. 1994,
0 pp. 911–918.
00 07 14 21 28 35 42 49 x56 x63 x70 x77 x84 x91 x98 [23] R. L. Rudell and A. Sangiovanni-Vincentelli. “Multiple-Valued
ex ex ex ex ex ex ex ex e e e e e e e
Benchmarks Minimization for PLA Optimization”. In: IEEE TCAD (1987).
[24] John Quinlan. C4. 5: programs for machine learning. Elsevier,
2014.
Fig. 33: Number of AIG nodes for different benchmarks. [25] Xiyang Hu, Cynthia Rudin, and Margo Seltzer. “Optimal sparse
decision trees”. In: Advances in Neural Information Processing
Systems. 2019, pp. 7267–7275.
R EFERENCES [26] Vincent Sitzmann et al. “Implicit Neural Representations with
[1] Scott M. Lundberg et al. “From local explanations to global Periodic Activation Functions”. In: CVPR. 2020.
understanding with explainable AI for trees”. In: Nature [27] Alan Lapedes and Robert Farber. “How Neural Nets Work”.
Machine Intelligence 2.1 (2020), pp. 56–67. In: Advances in Neural Information Processing Systems. MIT
[2] Pei-wei Chen et al. “Circuit Learning for Logic Regression on Press, 1987, pp. 442–456.
High Dimensional Boolean Space”. In: Proceedings of Design [28] J.M. Sopena, Enrique Romero, and Rene Alquezar. “Neural
Automation Conference (DAC). 2020. Networks with Periodic and Monotonic Activation Functions:
[3] Thomas R. Shiple et al. “Heuristic minimization of BDDs using A Comparative Study in Classification Problems”. In: Inter-
don’t cares”. In: Proceedings of Design Automation Conference national Conference on Artificial Neural Networks (ICANN).
(DAC). New York, New York, USA: ACM Press, 1994, pp. 225– 1999, 323–328 vol.1.
231. [29] Giambattista Parascandolo, H. Huttunen, and T. Virtanen.
[4] J. Ross Quinlan. C4.5: Programs for Machine Learning. San Taming the waves: sine as activation function in deep neural
Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993. networks. 2017.
[5] Eibe Frank and Ian H. Witten. “Generating Accurate Rule Sets [30] Zechun Liu et al. ReActNet: Towards Precise Binary Neural
Without Global Optimization”. In: Proceedings of the Fifteenth Network with Generalized Activation Functions. 2020.
International Conference on Machine Learning. ICML ’98. San [31] Ritchie Zhao et al. “Improving Neural Network Quantization
Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, without Retraining using Outlier Channel Splitting”. In: In-
pp. 144–151. ternational Conference on Machine Learning (ICML) (2019),
[6] Mark Hall et al. “The WEKA Data Mining Software: An pp. 7543–7552.
Update”. In: SIGKDD Explor. Newsl. (2009). [32] E Wang et al. “LUTNet: Learning FPGA Configurations
[7] Armin Biere, Keijo Heljanko, and Siert Wieringa. AIGER for Highly Efficient Neural Network Inference”. In: IEEE
1.9 And Beyond. Tech. rep. Institute for Formal Models and Transactions on Computers (2020).
Verification, Johannes Kepler University, 2011. [33] Yaman Umuroglu et al. LogicNets: Co-Designed Neural
[8] Giulia Pagallo and David Haussler. “Boolean Feature Discovery Networks and Circuits for Extreme-Throughput Applications.
in Empirical Learning”. In: Machine Learning (1990). 2020.
[9] Arlindo L. Oliveira and Alberto Sangiovanni-Vincentelli. [34] Julian Francis Miller. “Cartesian genetic programming: its
“Learning Complex Boolean Functions: Algorithms and Appli- status and future”. In: Genetic Programming and Evolvable
cations”. In: NeurIPS. 1993. Machines (2019), pp. 1–40.
[10] Song Han et al. “Learning Both Weights and Connections for [35] Nicola Milano and Stefano Nolfi. “Scaling Up Cartesian
Efficient Neural Networks”. In: NeurIPS. 2015. Genetic Programming through Preferential Selection of Larger
[11] S. Chatterjee. “Learning and memorization”. In: ICML. 2018. Solutions”. In: arXiv preprint arXiv:1810.09485 (2018).
[12] Fabian Pedregosa et al. “Scikit-learn: Machine learning in [36] Abdul Manazir and Khalid Raza. “Recent developments in
Python”. In: JMLR (2011). Cartesian genetic programming and its variants”. In: ACM
[13] Adam Paszke et al. “PyTorch: An Imperative Style, High- Computing Surveys (CSUR) 51.6 (2019), pp. 1–29.
Performance Deep Learning Library”. In: Advances in Neural [37] Benjamin Doerr and Carola Doerr. “Optimal parameter choices
Information Processing Systems (NeurIPS) 32. Ed. by H. through self-adjustment: Applying the 1/5-th rule in discrete
Wallach et al. Curran Associates, Inc., 2019, pp. 8024–8035. settings”. In: ACGEC. 2015.
[14] Robert Brayton and Alan Mishchenko. “ABC: An Academic [38] Jônata Tyska Carvalho, Nicola Milano, and Stefano Nolfi.
Industrial-Strength Verification Tool”. In: Proc. CAV. 2010, “Evolving Robust Solutions for Stochastically Varying Prob-
pp. 24–40. lems”. In: 2018 IEEE Congress on Evolutionary Computation
[15] K. Hornik, M. Stinchcombe, and H. White. “Multilayer Feed- (CEC). IEEE. 2018, pp. 1–8.
forward Networks Are Universal Approximators”. In: Neural
Netw. 2.5 (1989), pp. 359–366.
[39] Brian White et al. “An Integrated Experimental Environment [40] R. G. Rizzo, V. Tenace, and A. Calimera. “Multiplication by
for Distributed Systems and Networks”. In: OSDI02. USENIX- Inference using Classification Trees: A Case-Study Analysis”.
ASSOC. Boston, MA, Dec. 2002, pp. 255–270. In: ISCAS. 2018.

You might also like