0% found this document useful (0 votes)
30 views85 pages

21 SS133

Uploaded by

tigist.wondimneh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views85 pages

21 SS133

Uploaded by

tigist.wondimneh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Statistics Surveys

Vol. 16 (2022) 1–85


ISSN: 1935-7516
https://doi.org/10.1214/21-SS133

Interpretable machine learning:


Fundamental principles and 10 grand
challenges∗
Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia
Semenova, Chudi Zhong†
e-mail: [email protected]; [email protected]; [email protected]
[email protected]; [email protected]; [email protected]

Abstract: Interpretability in machine learning (ML) is crucial for high


stakes decisions and troubleshooting. In this work, we provide fundamen-
tal principles for interpretable ML, and dispel common misunderstandings
that dilute the importance of this crucial topic. We also identify 10 technical
challenge areas in interpretable machine learning and provide history and
background on each problem. Some of these problems are classically impor-
tant, and some are recent problems that have arisen in the last few years.
These problems are: (1) Optimizing sparse logical models such as decision
trees; (2) Optimization of scoring systems; (3) Placing constraints into gen-
eralized additive models to encourage sparsity and better interpretability;
(4) Modern case-based reasoning, including neural networks and match-
ing for causal inference; (5) Complete supervised disentanglement of neural
networks; (6) Complete or even partial unsupervised disentanglement of
neural networks; (7) Dimensionality reduction for data visualization; (8)
Machine learning models that can incorporate physics and other generative
or causal constraints; (9) Characterization of the “Rashomon set” of good
models; and (10) Interpretable reinforcement learning. This survey is suit-
able as a starting point for statisticians and computer scientists interested
in working in interpretable machine learning.

MSC2020 subject classifications: Primary 68T01; secondary 62-02.


Keywords and phrases: Interpretable machine learning, explainable ma-
chine learning.

Received March 2021.

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1 Sparse logical models: Decision trees, decision lists, decision sets . . . 11
2 Scoring systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Generalized additive models . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Modern case-based reasoning . . . . . . . . . . . . . . . . . . . . . . . 25
5 Complete supervised disentanglement of neural networks . . . . . . . . 31
arXiv: 2103.11251
∗ Partial support provided by grants DOE DE-SC0021358, NSF DGE-2022040, NSF CCF-
1934964, and NIDA DA054994-01.
† Equal contribution from C. Chen, Z. Chen, H. Huang, L. Semenova, and C. Zhong

1
2 C. Rudin et al.

6 Unsupervised disentanglement of neural networks . . . . . . . . . . . . 35


7 Dimension reduction for data visualization . . . . . . . . . . . . . . . . 40
8 Machine learning models that incorporate physics and other generative
or causal constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 Characterization of the “Rashomon” set of good models . . . . . . . . 48
10 Interpretable reinforcement learning . . . . . . . . . . . . . . . . . . . 54
11 Problems that were not in our top 10 but are really important . . . . 58
12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Introduction

With widespread use of machine learning (ML), the importance of interpretabil-


ity has become clear in avoiding catastrophic consequences. Black box predictive
models, which by definition are inscrutable, have led to serious societal prob-
lems that deeply affect health, freedom, racial bias, and safety. Interpretable
predictive models, which are constrained so that their reasoning processes are
more understandable to humans, are much easier to troubleshoot and to use in
practice. It is universally agreed that interpretability is a key element of trust
for AI models [304, 257, 235, 17, 287, 278, 42]. In this survey, we provide funda-
mental principles, as well as 10 technical challenges in the design of inherently
interpretable machine learning models.
Let us provide some background. A black box machine learning model is a
formula that is either too complicated for any human to understand, or pro-
prietary, so that one cannot understand its inner workings. Black box models
are difficult to troubleshoot, which is particularly problematic for medical data.
Black box models often predict the right answer for the wrong reason (the
“Clever Hans” phenomenon), leading to excellent performance in training but
poor performance in practice [268, 172, 225, 331, 19, 119]. There are numerous
other issues with black box models. In criminal justice, individuals may have
been subjected to years of extra prison time due to typographical errors in black
box model inputs [317] and poorly-designed proprietary models for air quality
have had serious consequences for public safety during wildfires [203]; both of
these situations may have been easy to avoid with interpretable models. In cases
where the underlying distribution of data changes (called domain shift, which
occurs often in practice), problems arise if users cannot troubleshoot the model
in real-time, which is much harder with black box models than interpretable
models. Determining whether a black box model is fair with respect to gender or
racial groups is much more difficult than determining whether an interpretable
model has such a bias. In medicine, black box models turn computer-aided deci-
sions into automated decisions, precisely because physicians cannot understand
the reasoning processes of black box models. Explaining black boxes, rather
than replacing them with interpretable models, can make the problem worse
by providing misleading or false characterizations [250, 173, 171], or adding un-
Interpretable machine learning grand challenges 3

necessary authority to the black box [253]. There is a clear need for innovative
machine learning models that are inherently interpretable.
There is now a vast and confusing literature on some combination of in-
terpretability and explainability. Much literature on explainability confounds it
with interpretability/comprehensibility, thus obscuring the arguments (and thus
detracting from their precision), and failing to convey the relative importance
and use-cases of the two topics in practice. Some of the literature discusses top-
ics in such generality that its lessons have little bearing on any specific problem.
Some of it aims to design taxonomies that miss vast topics within interpretable
ML. Some of it provides definitions that we disagree with. Some of it even
provides guidance that could perpetuate bad practice. Importantly, most of it
assumes that one would explain a black box without consideration of whether
there is an interpretable model of the same accuracy. In what follows, we provide
some simple and general guiding principles of interpretable machine learning.
These are not meant to be exhaustive. Instead they aim to help readers avoid
common but problematic ways of thinking about interpretability in machine
learning.
The major part of this survey outlines a set of important and fundamental
technical grand challenges in interpretable machine learning. These are both
modern and classical challenges, and some are much harder than others. They
are all either hard to solve, or difficult to formulate correctly. While there
are numerous sociotechnical challenges about model deployment (that can be
much more difficult than technical challenges), human-computer interaction
challenges, and how robustness and fairness interact with interpretability, those
topics can be saved for another day. We begin with the most classical and most
canonical problems in interpretable machine learning: how to build sparse mod-
els for tabular data, including decision trees (Challenge #1) and scoring systems
(Challenge #2). We then delve into a challenge involving additive models (Chal-
lenge #3), followed by another in case-based reasoning (Challenge #4), which
is another classic topic in interpretable artificial intelligence. We then move to
more exotic problems, namely supervised and unsupervised disentanglement of
concepts in neural networks (Challenges #5 and #6). Back to classical prob-
lems, we discuss dimension reduction (Challenge #7). Then, how to incorporate
physics or causal constraints (Challenge #8). Challenge #9 involves understand-
ing, exploring, and measuring the Rashomon set of accurate predictive models.
Challenge #10 discusses interpretable reinforcement learning. Table 1 provides
a guideline that may help users to match a dataset to a suitable interpretable
supervised learning technique. We will touch on all of these techniques in the
challenges.

General principles of interpretable machine learning

Our first fundamental principle defines interpretable ML, following [250]:

Principle 1 An interpretable machine learning model obeys a domain-specific


set of constraints to allow it (or its predictions, or the data) to be more easily
4 C. Rudin et al.

Table 1
Rule of thumb for the types of data that naturally apply to various supervised learning
algorithms. “Clean” means that the data do not have too much noise or systematic bias.
“Tabular” means that the features are categorical or real, and that each feature is a
meaningful predictor of the output on its own. “Raw” data is unprocessed and has a complex
data type, e.g., image data where each pixel is a feature, medical records, or time series data.
Models Data type
decision trees / decision lists somewhat clean tabular data with interactions, includ-
(rule lists) / decision sets ing multiclass problems. Particularly useful for cate-
gorical data with complex interactions (i.e., more than
quadratic). Robust to outliers.
scoring systems somewhat clean tabular data, typically used in medicine
and criminal justice because they are small enough that
they can be memorized by humans.
generalized additive models continuous data with at most quadratic interactions,
(GAMs) useful for large-scale medical record data.
case-based reasoning any data type (different methods exist for different data
types), including multiclass problems.
disentangled neural net- data with raw inputs (computer vision, time series, tex-
works tual data), suitable for multiclass problems.

understood by humans. These constraints can differ dramatically depending on


the domain.

A typical interpretable supervised learning setup, with data {zi }i , and models
chosen from function class F is:
1
min Loss(f, zi ) + C · InterpretabilityPenalty(f ), subject to (*)
f ∈F n
i
InterpretabilityConstraint(f ),
where the loss function, as well as soft and hard interpretability constraints, are
chosen to match the domain. (For classification zi might be (xi , yi ), xi ∈ Rp , yi ∈
{−1, 1}.) The goal of these constraints is to make the resulting model f or its
predictions more interpretable. While solutions of (*) would not necessarily be
sufficiently interpretable to use in practice, the constraints would generally help
us find models that would be interpretable (if we design them well), and we might
also be willing to consider slightly suboptimal solutions to find a more useful
model. The constant C trades off between accuracy and the interpretability
penalty, and can be tuned, either by cross-validation or by taking into account
the user’s desired tradeoff between the two terms.
Equation (*) can be generalized to unsupervised learning, where the loss term
would simply be replaced by a loss term for the unsupervised problem, whether
it is novelty detection, clustering, dimension reduction, or another task.
Creating interpretable models can sometimes be much more difficult than
creating black box models for many different reasons including: (i) Solving the
optimization problem may be computationally hard, depending on the choice of
constraints and the model class F. (ii) When one does create an interpretable
model, one invariably realizes that the data are problematic and require trou-
bleshooting, which slows down deployment (but leads to a better model). (iii)
Interpretable machine learning grand challenges 5

It might not be initially clear which definition of interpretability to use. This


definition might require refinement, sometimes over multiple iterations with do-
main experts. There are many papers detailing these issues, the earliest dating
from the mid-1990s [e.g., 157].
Interpretability differs across domains just as predictive performance met-
rics vary across domains. Just as we might choose from a variety of perfor-
mance metrics (e.g., accuracy, weighted accuracy, precision, average precision,
precision@N, recall, recall@N, DCG, NCDG, AUC, partial AUC, mean-time-
to-failure, etc.), or combinations of these metrics, we might also choose from
a combination of interpretability metrics that are specific to the domain. We
may not be able to define a single best definition of interpretability; regardless,
if our chosen interpretability measure is helpful for the problem at hand, we
are better off including it. Interpretability penalties or constraints can include
sparsity of the model, monotonicity with respect to a variable, decomposibility
into sub-models, an ability to perform case-based reasoning or other types of
visual comparisons, disentanglement of certain types of information within the
model’s reasoning process, generative constraints (e.g., laws of physics), pref-
erences among the choice of variables, or any other type of constraint that is
relevant to the domain. Just as it would be futile to create a complete list of
performance metrics for machine learning, any list of interpretability metrics
would be similarly fated.
Our 10 challenges involve how to define some of these interpretability con-
straints and how to incorporate them into machine learning models. For tabular
data, sparsity is usually part of the definition of interpretability, whereas for
computer vision of natural images, it generally is not. (Would you find a model
for natural image classification interpretable if it uses only a few pixels from the
image?) For natural images, we are better off with interpretable neural networks
that perform case-based reasoning or disentanglement, and provide us with a
visual understanding of intermediate computations; we will describe these in
depth. Choices of model form (e.g., the choice to use a decision tree, or a spe-
cific neural architecture) are examples of interpretability constraints. For most
problems involving tabular data, a fully interpretable model, whose full calcula-
tions can be understood by humans such as a sparse decision tree or sparse linear
model, is generally more desirable than either a model whose calculations can
only be partially understood, or a model whose predictions (but not its model)
can be understood. Thus, we make a distinction between fully interpretable and
partially interpretable models, often preferring the former.
Interpretable machine learning models are not needed for all machine learn-
ing problems. For low-stakes decisions (e.g., advertising), for decisions where an
explanation would be trivial and the model is 100% reliable (e.g., “there is no le-
sion in this mammogram” where the explanation would be trivial), for decisions
where humans can verify or modify the decision afterwards (e.g., segmentation
of the chambers of the heart), interpretability is probably not needed.1 On the

1 Obviously, this document does not apply to black box formulas that do not depend on

randomness in the data, i.e., a calculation of deterministic function, not machine learning.
6 C. Rudin et al.

other hand, for self-driving cars, even if they are very reliable, problems would
arise if the car’s vision system malfunctions causing a crash and no reason for
the crash is available. Lack of interpretability would be problematic in this case.
Our second fundamental principle concerns trust:

Principle 2 Despite common rhetoric, interpretable models do not necessarily


create or enable trust – they could also enable distrust. They simply allow users
to decide whether to trust them. In other words, they permit a decision of trust,
rather than trust itself.

With black boxes, one needs to make a decision about trust with much less
information; without knowledge about the reasoning process of the model, it
is more difficult to detect whether it might generalize beyond the dataset. As
stated by Afnan et al. [4] with respect to medical decisions, while interpretable
AI is an enhancement of human decision making, black box AI is a replacement
of it.
An important point about interpretable machine learning models is that there
is no scientific evidence for a general tradeoff between accuracy and interpretabil-
ity when one considers the full data science process for turning data into knowl-
edge. (Examples of such pipelines include KDD, CRISP-DM, or the CCC Big
Data Pipelines; see Figure 1, or [95, 51, 7].) In real problems, interpretability

Fig 1. Knowledge Discovery Process. Figure adapted from [95].

is useful for troubleshooting, which leads to better accuracy, not worse. In that
sense, we have the third principle:

Principle 3 It is important not to assume that one needs to make a sacrifice


in accuracy in order to gain interpretability. In fact, interpretability often begets
accuracy, and not the reverse. Interpretability versus accuracy is, in general, a
false dichotomy in machine learning.

Interpretability has traditionally been associated with complexity, and specif-


Interpretable machine learning grand challenges 7

ically, sparsity, but model creators generally would not equate interpretability
with sparsity. Sparsity is often one component of interpretability, and a model
that is sufficiently sparse but has other desirable properties is more typical.
While there is almost always a tradeoff of accuracy with sparsity (particularly
for extremely small models), there is no evidence of a general tradeoff of ac-
curacy with interpretability. Let us consider both (1) development and use of
ML models in practice, and (2) experiments with static datasets; in neither case
have interpretable models proven to be less accurate.

Development and use of ML models in practice. A key example of the pitfalls


of black box models when used in practice is the story of COMPAS [224, 258].
COMPAS is a black box because it is proprietary: no one outside of its design-
ers knows its secret formula for predicting criminal recidivism, yet it is used
widely across the United States and influences parole, bail, and sentencing de-
cisions that deeply affect people’s lives [13]. COMPAS is error-prone because
it could require over 130 variables, and typographical errors in those variables
influence outcomes [317]. COMPAS has been explained incorrectly, where the
news organization ProPublica mistakenly assumed that an important variable
in an approximation to COMPAS (namely race) was also important to COM-
PAS itself, and used this faulty logic to conclude that COMPAS depends on
race other than through age and criminal history [13, 258]. While the racial bias
of COMPAS does not seem to be what ProPublica claimed, COMPAS still has
unclear dependence on race. And worst, COMPAS seems to be unnecessarily
complicated as it does not seem to be any more accurate than a very sparse de-
cision tree [10, 11] involving only a couple of variables. The story of COMPAS
is a key example simultaneously demonstrating many pitfalls of black boxes in
practice. And, it shows an example of when black boxes are unnecessary, but
used anyway.
Other systemic issues arise when using black box models in practice, or when
developing them as part of a data science process, that would cause them to be
less accurate than an interpretable model. A full list is beyond the scope of this
survey, but an article that points out many serious issues that arise with black
box models in a particular domain is that of Afnan et al. [4], who discuss in
vitro fertilization (IVF). In modern IVF, black box models that have not been
subjected to randomized controlled trials determine who comes into existence.
Afnan et al. [4] point out ethical and practical issues, ranging from the inabil-
ity to perform shared decision-making with patients, to economic consequences
of clinics needing to “buy into” environments that are similar to those where
the black box models are trained so as to avoid distribution shift; this is nec-
essary precisely because errors cannot be detected effectively in real-time using
the black box. They also discuss accountability issues (“Who is accountable
when the model causes harm?”). Finally, they present a case where a standard
performance metric (namely area under the ROC curve – AUC) has been mis-
construed as representing the value of a model in practice, potentially leading to
overconfidence in the performance of a black box model. Specifically, a reported
AUC was inflated by including many “obvious” cases in the sample over which
8 C. Rudin et al.

it was computed. If we cannot trust reported numerical performance results,


then interpretability would be a crucial remaining ingredient to assess trust.
In a full data science process, like the one shown in Figure 1, interpretability
plays a key role in determining how to update the other steps of the process for
the next iteration. One interprets the results and tunes the processing of the
data, the loss function, the evaluation metric, or anything else that is relevant,
as shown in the figure. How can one do this without understanding how the
model works? It may be possible, but might be much more difficult. In essence,
the messiness that comes with messy data and complicated black box models
causes lower quality decision-making in practice.
Let’s move on to a case where the problem is instead controlled, so that we
have a static dataset and a fixed evaluation metric.

Static datasets. Most benchmarking of algorithms is done on static datasets,


where the data and evaluation metric are not cleaned or updated as a result of
a run of an algorithm. In other words, these experiments are not done as part
of a data science process, they are only designed to compare algorithms in a
controlled experimental environment.
Even with static datasets and fixed evaluation metrics, interpretable models
do not generally lead to a loss in accuracy over black box models. Even for deep
neural networks for computer vision, even on the most challenging of benchmark
datasets, a plethora of machine learning techniques have been designed that do
not sacrifice accuracy but gain substantial interpretability [333, 53, 158, 58, 12,
216].
Let us consider two extremes of data types: tabular data, where all vari-
ables are real or discrete features, each of which is meaningful (e.g., age, race,
sex, number of past strokes, congestive heart failure), and “raw” data, such as
images, sound files or text, where each pixel, bit, or word is not useful on its
own. These types of data have different properties with respect to both machine
learning performance and interpretability.
For tabular data, most machine learning algorithms tend to perform similarly
in terms of prediction accuracy. This means it is often difficult even to beat
logistic regression, assuming one is willing to perform minor preprocessing such
as creating dummy variables [e.g., see 61]. In these domains, neural networks
generally find no advantage. It has been known for a very long time that very
simple models perform surprisingly well for tabular data [128]. The fact that
simple models perform well for tabular data could arise from the Rashomon
Effect discussed by Leo Breiman [41]. Breiman posits the possibility of a large
Rashomon set, i.e., a multitude of models with approximately the minumum
error rate, for many problems. Semenova et al. [269] show that as long as a large
Rashomon set exists, it is more likely that some of these models are interpretable.
For raw data, on the other hand, neural networks have an advantage cur-
rently over other approaches [164]. In these raw data cases, the definition of
interpretability changes; for visual data, one may require visual explanations.
In such cases, as discussed earlier and in Challenges 4 and 5, interpretable neural
networks suffice, without losing accuracy.
Interpretable machine learning grand challenges 9

These two data extremes show that in machine learning, the dichotomy be-
tween the accurate black box and the less-accurate interpretable model is false.
The often-discussed hypothetical choice between the accurate machine-learning-
based robotic surgeon and the less-accurate human surgeon is moot once some-
one builds an interpretable robotic surgeon. Given that even the most difficult
computer vision benchmarks can be solved with interpretable models, there is no
reason to believe that an interpretable robotic surgeon would be worse than its
black box counterpart. The question ultimately becomes whether the Rashomon
set should permit such an interpretable robotic surgeon–and all scientific evi-
dence so far (including a large-and-growing number of experimental papers on
interpretable deep learning) suggests it would.
Our next principle returns to the data science process.

Principle 4 As part of the full data science process, one should expect both the
performance metric and interpretability metric to be iteratively refined.

The knowledge discovery process in Figure 1 explicitly shows these important


feedback loops. We have found it useful in practice to create many interpretable
models (satisfying the known constraints) and have domain experts choose be-
tween them. Their rationale for choosing one model over another helps to refine
the definition of interpretability. Each problem can thus have its own unique
interpretability metrics (or set of metrics).
The fifth principle is as follows:

Principle 5 For high stakes decisions, interpretable models should be used if


possible, rather than “explained” black box models.

Hence, this survey concerns the former. This is not a survey on Explainable
AI (XAI, where one attempts to explain a black box using an approximation
model, derivatives, variable importance measures, or other statistics), it is a sur-
vey on Interpretable Machine Learning (creating a predictive model that is not
a black box). Unfortunately, these topics are much too often lumped together
within the misleading term “explainable artificial intelligence” or “XAI” despite
a chasm separating these two concepts [250]. Explainability and interpretability
techniques are not alternative choices for many real problems, as the recent sur-
veys often imply; one of them (XAI) can be dangerous for high-stakes decisions
to a degree that the other is not.
Interpretable ML is not a subset of XAI. The term XAI dates from ∼2016,
and grew out of work on function approximation; i.e., explaining a black box
model by approximating its predictions by a simpler model [e.g., 70, 69], or
explaining a black box using local approximations. Interpretable ML also has a
(separate) long and rich history, dating back to the days of expert systems in
the 1950’s, and the early days of decision trees. While these topics may sound
similar to some readers, they differ in ways that are important in practice.
In particular, there are many serious problems with the use of explaining black
boxes posthoc, as outlined in several papers that have shown why explaining
10 C. Rudin et al.

black boxes can be misleading and why explanations do not generally serve their
intended purpose [250, 173, 171]. The most compelling such reasons are:
• Explanations for black boxes are often problematic and misleading, poten-
tially creating misplaced trust in black box models. Such issues with ex-
planations have arisen with assessment of fairness and variable importance
[258, 82] as well as uncertainty bands for variable importance [113, 97].
There is an overall difficulty in troubleshooting the combination of a black
box and an explanation model on top of it; if the explanation model is
not always correct, it can be difficult to tell whether the black box model
is wrong, or if it is right and the explanation model is wrong. Ultimately,
posthoc explanations are wrong (or misleading) too often.
One particular type of posthoc explanation, called saliency maps (also
called attention maps) have become particularly popular in radiology
and other computer vision domains despite known problems [2, 53, 334].
Saliency maps highlight the pixels of an image that are used for a predic-
tion, but do not explain how the pixels are used. As an analogy, consider
a real estate agent who is pricing a house. A “black box” real estate agent
would provide the price with no explanation. A “saliency” real estate agent
would say that the price is determined from the roof and backyard, but
doesn’t explain how the roof and backyard were used to determine the
price. In contrast, an interpretable agent would explain the calculation in
detail, for instance, using “comps” or comparable properties to explain
how the roof and backyard are comparable between properties, and how
these comparisons were used to determine the price. One can see from this
real estate example how the saliency agent’s explanation is insufficient.
Saliency maps also tend to be unreliable; researchers often report that dif-
ferent saliency methods provide different results, making it unclear which
one (if any) actually represents the network’s true attention.2
• Black boxes are generally unnecessary, given that their accuracy is gener-
ally not better than a well-designed interpretable model. Thus, explana-
tions that seem reasonable can undermine efforts to find an interpretable
model of the same level of accuracy as the black box.
• Explanations for complex models hide the fact that complex models are
difficult to use in practice for many different reasons. Typographical errors
in input data are a prime example of this issue [as in the use of COMPAS
2 To clear possible confusion, techniques such as SHAP and LIME are tools for explaining

black box models, are not needed for inherently interpretable models, and thus do not belong
in this survey. These methods determine how much each variable contributed to a prediction.
Interpretable models do not need SHAP values because they already explicitly show what
variables they are using and how they are using them. For instance, sparse decision trees
and sparse linear models do not need SHAP values because we know exactly what variables
are being used and how they are used. Interpretable supervised deep neural networks that
use case-based reasoning (Challenge #4) do not need SHAP values because they explicitly
reveal what part of the observation they are paying attention to, and in addition, how that
information is being used (e.g., what comparison is being made between part of a test image
and part of a training image). Thus, if one creates an interpretable model, one does not need
LIME or SHAP whatsoever.
Interpretable machine learning grand challenges 11

in practice, see 317]. A model with 130 hand-typed inputs is more error-
prone than one involving 5 hand-typed inputs.
In that sense, explainability methods are often used as an excuse to use a black
box model–whether or not one is actually needed. Explainability techniques give
authority to black box models rather than suggesting the possibility of models
that are understandable in the first place [253].
XAI surveys have (thus far) universally failed to acknowledge the important
point that interpretability begets accuracy when considering the full data science
process, and not the other way around. Perhaps this point is missed because of
the more subtle fact that one does generally lose accuracy when approximating
a complicated function with a simpler one, so it would appear that the simpler
approximation is less accurate. (Again the approximations must be imperfect,
otherwise one would throw out the black box and instead use the explanation as
an inherently interpretable model.) But function approximators are not used in
interpretable ML; instead of approximating a known function (a black box ML
model), interpretable ML can choose from a potential myriad of approximately-
equally-good models, which, as we noted earlier, is called “the Rashomon set”
[41, 97, 269]. We will discuss the study of this set in Challenge 9. Thus, when
one explains black boxes, one expects to lose accuracy, whereas when one creates
an inherently interpretable ML model, one does not.
In this survey, we do not aim to provide yet another dull taxonomy of “ex-
plainability” terminology. The ideas of interpretable ML can be stated in just
one sentence: an interpretable model is constrained, following a domain-specific
set of constraints that make reasoning processes understandable. Instead, we
highlight important challenges, each of which can serve as a starting point for
someone wanting to enter into the field of interpretable ML.

1. Sparse logical models: Decision trees, decision lists, decision sets

The first two challenges involve optimization of sparse models. We discuss both
sparse logical models in Challenge #1 and scoring systems (which are sparse
linear models with integer coefficients) in Challenge #2. Sparsity is often used as
a measure of interpretability for tabular data where the features are meaningful.
Sparsity is useful because humans can handle only 7±2 cognitive entities at
the same time [208], and sparsity makes it easier to troubleshoot, check for
typographical errors, and reason about counterfactuals (e.g., “How would my
prediction change if I changed this specific input?”). Sparsity is rarely the only
consideration for interpretability, but if we can design models to be sparse, we
can often handle additional constraints. Also, if one can optimize for sparsity,
a useful baseline can be established for how sparse a model could be with a
particular level of accuracy.
We remark that more sparsity does not always equate to more interpretabil-
ity. This is because “humans by nature are mentally opposed to too simplistic
representations of complex relations” [93, 100]. For instance, in loan decisions,
we may choose to have several sparse mini-models for length of credit, history of
12 C. Rudin et al.

default, etc., which are then assembled at the end into a larger model composed
of the results of the mini-models [see 54, who attempted this]. On the other hand,
sparsity is necessary for many real applications, particularly in healthcare and
criminal justice where the practitioner needs to memorize the model.
Logical models, which consist of logical statements involving “if-then,” “or,”
and “and” clauses are among the most popular algorithms for interpretable
machine learning, since their statements provide human-understandable reasons
for each prediction.
When would we use logical models? Logical models are usually an excellent
choice for modeling categorical data with potentially complicated interaction
terms (e.g., “IF (female AND high blood pressure AND congenital heart fail-
ure), OR (male AND high blood pressure AND either prior stroke OR age >
70) THEN predict Condition 1 = true”). Logical models are also excellent for
multiclass problems. Logical models are also known for their robustness to out-
liers and ease of handling missing data. Logical models can be highly nonlinear,
and even classes of sparse nonlinear models can be quite powerful.
Figure 2 visualizes three logical models: a decision tree, a decision list, and
a decision set. Decision trees are tree-structured predictive models where each
branch node tests a condition and each leaf node makes a prediction. Decision
lists, identical to rule lists or one-sided decision trees, are composed of if-then-
else statements. The rules are tried in order, and the first rule that is satisfied
makes the prediction. Sometimes rule lists have multiple conditions in each
split, whereas decision trees typically do not. A decision set, also known as a
“disjunction of conjunctions,” “disjunctive normal form” (DNF), or an “OR of
ANDs” is comprised of an unordered collection of rules, where each rule is a
conjunction of conditions. A positive prediction is made if at least one of the
rules is satisfied. Even though these logical models seem to have very different
forms, they are closely related: every decision list is a (one-sided) decision tree
and every decision tree can be expressed as an equivalent decision list (by listing
each path to a leaf as a decision rule). The collection of leaves of a decision tree
(or a decision list) also forms a decision set.
Let us provide some background on decision trees. Since Morgan and Son-
quist [212] developed the first decision tree algorithm, many works have been
proposed to build decision trees and improve their performance. However, learn-
ing decision trees with high performance and sparsity is not easy. Full decision
tree optimization is known to be an NP-complete problem [174], and heuristic
greedy splitting and pruning procedures have been the major type of approach
since the 1980s to grow decision trees [40, 239, 192, 205]. These greedy methods
for building decision trees create trees from the top down and prune them back
afterwards. They do not go back to fix a bad split if one was made. Consequently,
the trees created from these greedy methods tend to be both less accurate and
less interpretable than necessary. That is, greedy induction algorithms are not
designed to optimize any particular performance metric, leaving a gap between
the performance that a decision tree might obtain and the performance that
the algorithm’s decision tree actually attains, with no way to determine how
large the gap is (see Figure 3 for a case where the 1984 CART algorithm did
Interpretable machine learning grand challenges 13

Fig 2. Predicting which individuals are arrested within two years of release by a decision
tree (a), a decision list (b), and a decision set (c). The dataset used here is the ProPublica
recidivism dataset [13].

not obtain an optimal solution, as shown by the better solution from a 2020
algorithm called “GOSDT,” to the right). This gap can cause a problem in
practice because one does not know whether poor performance is due to the
choice of model form (the choice to use a decision tree of a specific size) or poor
optimization (not fully optimizing over the set of decision trees of that size).
When fully optimized, single trees can be as accurate as ensembles of trees, or
neural networks, for many problems. Thus, it is worthwhile to think carefully
about how to optimize them.

Fig 3. (a) 16-leaf decision tree learned by CART [40] and (b) 9-leaf decision tree generated by
GOSDT [185] for the classic Monk 2 dataset [88]. The GOSDT tree is optimal with respect
to a balance between accuracy and sparsity.

GOSDT and related modern decision tree methods solve an optimization


14 C. Rudin et al.

problem that is a special case of (*):

1
min Loss(f, zi ) + C · Number of leaves (f ), (1.1)
f ∈ set of trees n i

where the user specifies the loss function and the trade-off (regularization) pa-
rameter.
Efforts to fully optimize decision trees, solving problems related to (1.1),
have been made since the 1990s [28, 85, 94, 220, 221, 131, 185]. Many recent
papers directly optimize the performance metric (e.g., accuracy) with soft or
hard sparsity constraints on the tree size, where sparsity is measured by the
number of leaves in the tree. Three major groups of these techniques are (1)
mathematical programming, including mixed integer programming (MIP) [see
the works of 28, 29, 251, 301, 302, 118, 6] and SAT solvers [214, 130] [see also the
review of 47], (2) stochastic search through the space of trees [e.g., 321, 114, 228],
and (3) customized dynamic programming algorithms that incorporate branch-
and-bound techniques for reducing the size of the search space [131, 185, 222, 78].
Decision list and decision set construction lead to the same challenges as
decision tree optimization, and have a parallel development path. Dating back
to 1980s, decision lists have often been constructed in a top-down greedy fashion.
Associative classification methods assemble decision lists or decision sets from
a set of pre-mined rules, generally either by greedily adding rules to the model
one by one, or simply including all “top-scoring” rules into a decision set, where
each rule is scored separately according to a scoring function [247, 63, 188,
184, 325, 275, 200, 298, 252, 62, 108, 65, 99, 104, 199, 198]. Sometimes decision
lists or decision sets are optimized by sampling [180, 321, 306], providing a
Bayesian interpretation. Some recent works can jointly optimize performance
metrics and sparsity for decision lists [251, 327, 10, 11, 8] and decision sets
[311, 110, 170, 132, 77, 197, 109, 80, 328, 45]. Some works optimize for individual
rules [77, 255].
In recent years, great progress has been made on optimizing the combination
of accuracy and sparsity for logical models, but there are still many challenges
that need to be solved. Some important ones are as follows:
1.1 Can we improve the scalability of optimal sparse decision trees?
A lofty goal for optimal decision tree methods is to fully optimize trees
as fast as CART produces its (non-optimal) trees. Current state-of-the-art
optimal decision tree methods can handle medium-sized datasets (thousands
of samples, tens of binary variables) in a reasonable amount of time (e.g.,
within 10 minutes) when appropriate sparsity constraints are used. But how
to scale up to deal with large datasets or to reduce the running time remains
a challenge.
These methods often scale exponentially in p, the number of dimensions
of the data. Developing algorithms that reduce the number of dimensions
through variable screening theorems or through other means could be ex-
tremely helpful.
Interpretable machine learning grand challenges 15

For methods that use the mathematical programming solvers, a good for-
mulation is key to reducing training time. For example, MIP solvers use
branch-and-bound methods, which partition the search space recursively
and solve Linear Programming (LP) relaxations for each partition to pro-
duce lower bounds. Small formulations with fewer variables and constraints
can enable the LP relaxations to be solved faster, while stronger LP relax-
ations (which usually involve more variables) can produce high quality lower
bounds to prune the search space faster and reduce the number of LPs to
be solved. How to formulate the problem to leverage the full power of MIP
solvers is an open question. Currently, mathematical programming solvers
are not as efficient as the best customized algorithms.
For customized branch-and-bound search algorithms such as GOSDT and
OSDT [185, 131], there are several mechanisms to improve scalability: (1)
effective lower bounds, which prevent branching into parts of the search
space where we can prove there is no optimal solution, (2) effective schedul-
ing policies, which help us search the space to find close-to-optimal solu-
tions quickly, which in turn improves the bounds and again prevents us
from branching into irrelevant parts of the space, (3) computational reuse,
whereby if a computation involves a sum over (even slightly) expensive
computations, and part of that sum has previously been computed and
stored, we can reuse the previous computation instead of computing the
whole sum over again, (4) efficient data structures to store subproblems
that can be referenced later in the computation should that subproblem
arise again.
1.2 Can we efficiently handle continuous variables? While decision trees
handle categorical variables and complicated interactions better than other
types of approaches (e.g., linear models), one of the most important chal-
lenges for decision tree algorithms is to optimize over continuous features.
Many current methods use binary variables as input [131, 185, 301, 222, 78],
which assumes that continuous variables have been transformed into indi-
cator variables beforehand (e.g., age> 50). These methods are unable to
jointly optimize the selection of variables to split at each internal tree node,
the splitting threshold of that variable (if it is continuous), and the tree
structure (the overall shape of the tree). Lin et al. [185] preprocess the data
by transforming continuous features into a set of dummy variables, with
many different split points; they take split points between every ordered
pair of unique values present in the training data. Doing this preserves opti-
mality, but creates a huge number of binary features, leading to a dramatic
increase in the size of the search space, and the possibility of hitting either
time or memory limits. Some methods [301, 222] preprocess the data using
an approximation, whereby they consider a much smaller subset of possible
thresholds, potentially sacrificing the optimality of the solution [see 185,
Section 3, which explains this]. One possible technique to help with this
problem is to use similar support bounds, identified by Angelino et al. [11],
but in practice these bounds have been hard to implement because checking
the bounds repeatedly is computationally expensive, to the point where the
16 C. Rudin et al.

bounds have never been used (as far as we know). Future work could go
into improving the determination of when to check these bounds, or prov-
ing that a subset of all possible dummy variables still preserves closeness to
optimality.
1.3 Can we handle constraints more gracefully? Particularly for greedy
methods that create trees using local decisions, it is difficult to enforce
global constraints on the overall tree. Given that domain-specific constraints
may be essential for interpretability, an important challenge is to deter-
mine how to incorporate such constraints. Optimization approaches (mathe-
matical programming, dynamic programming, branch-and-bound) are more
amenable to global constraints, but the constraints can make the prob-
lem much more difficult. For instance, falling constraints [306, 55] enforce
decreasing probabilities along a rule list, which make the list more inter-
pretable and useful in practice, but make the optimization problem harder,
even though the search space itself becomes smaller.
Example: Suppose a hospital would like to create a decision tree that will
be used to assign medical treatments to patients. A tree is convenient because
it corresponds to a set of questions to ask the patient (one for each internal
node along a path to a leaf). A tree is also convenient in its handling of multiple
medical treatments; each leaf could even represent a different medical treatment.
The tree can also handle complex interactions, where patients can be asked
multiple questions that build on each other to determine the best medication for
the patient. To train this tree, the proper assumptions and data handling were
made to allow us to use machine learning to perform causal analysis (in practice
these are more difficult than we have room to discuss here). The questions
we discussed above arise when the variables are continuous; for instance, if
we split on age somewhere in the tree, what is the optimal age to split at
in order to create a sparse tree? (See Challenge 1.2.) If we have many other
continuous variables (e.g., blood pressure, weight, body mass index), scalability
in choosing how to split them all becomes an issue. Further, if the hospital
has additional preferences, such as “falling probabilities,” where fewer questions
should be asked to determine whether a patient is in the most urgent treatment
categories, again it could affect our ability to find an optimal tree given limited
computational resources (see Challenge 1.3).

2. Scoring systems

Scoring systems are linear classification models that require users to add, sub-
tract, and multiply only a few small numbers in order to make a prediction.
These models are used to assess the risk of numerous serious medical con-
ditions since they allow quick predictions, without using a computer. Such
models are also heavily used in criminal justice. Table 2 shows an example
of a scoring system. A doctor can easily determine whether a patient screens
positive for obstructive sleep apnea by adding points for the patient’s age,
whether they have hypertension, body mass index, and sex. If the score is above
Interpretable machine learning grand challenges 17

a threshold, the patient would be recommended to a clinic for a diagnostic


test.
Scoring systems commonly use binary or indicator variables (e.g., age ≥ 60)
and point values (e.g., Table 2) which make computation of the score easier for
humans. Linear models do not handle interaction terms between variables like
the logical models discussed above in Challenge 1.1, and they are not particularly
useful for multiclass problems, but they are useful for counterfactual reasoning:
if we ask “What changes could keep someone’s score low if they developed
hypertension?” the answer would be easy to compute using point scores such as
“4, 4, 2, 2, and -6.” Such logic is more difficult for humans if the point scores
are instead, for instance, 53.2, 41.1, 16.3, 23.6 and -61.8.

Table 2
A scoring system for sleep apnea screening [295]. Patients that screen positive may need to
come to the clinic to be tested.
Patient screens positive for obstructive sleep apnea if Score >1
1. age ≥ 60 4 points ......
2. hypertension 4 points +......
3. body mass index ≥ 30 2 points +......
4. body mass index ≥ 40 2 points +......
5. female -6 points +......
Add points from row 1-5 Score = ......

Risk scores are scoring systems that have a conversion table to probabilities.
For instance, a 1 point total might convert to probability 15%, 2 points to 33%
and so on. Whereas scoring systems with a threshold (like the one in Table 2)
would be measured by false positive and false negative rates, a risk score might
be measured using the area under the ROC curve (AUC) and calibration.
The development of scoring systems dates back at least to criminal justice
work in the 1920s [44]. Since then, many scoring systems have been designed
for healthcare [154, 152, 153, 38, 175, 14, 107, 211, 274, 315, 286]. However,
none of the scoring systems mentioned so far was optimized purely using an
algorithm applied to data. Each scoring system was created using a different
method involving different heuristics. Some of them were built using domain
expertise alone without data, and some were created using rounding heuristics
for logistic regression coefficients and other manual feature selection approaches
to obtain integer-valued point scores [see, e.g., 175].
Such scoring systems could be optimized using a combination of the user’s
preferences (and constraints) and data. This optimization should ideally be ac-
complished by a computer, leaving the domain expert only to specify the prob-
lem. However, jointly optimizing for predictive performance, sparsity, and other
user constraints may not be an easy task. Equation (2.1) shows an example of
a generic optimization problem for creating a scoring system:

1
minf ∈F Loss(f, zi ) + C · Number of nonzero terms (f ), subject to
n i
18 C. Rudin et al.


p
f is a linear model, f (x) = λj xj ,
j=1
with small integer coefficients, ∀ j, λj ∈ {−10, −9, .., 0, .., 9, 10}
and additional user constraints. (2.1)

Here, the user would specify the loss function (logistic loss, etc.), the tradeoff
parameter C between the number of nonzero coefficients and the training loss,
and possibly some additional constraints, depending on the domain.
The integrality constraint on the coefficients makes the optimization problem
very difficult. The easiest way to satisfy these constraints is to fit real coefficients
(e.g., run logistic regression, perhaps with 1 regularization) and then round
these real coefficients to integers. However, rounding can go against the loss
gradient and ruin predictive performance. Here is an example of a coefficient
vector that illustrates why rounding might not work:

[5.3, 6.1, 0.31, 0.30, 0.25, 0.25, 0.24, ..., 0.05] → (rounding) → [5, 6, 0, 0, ..., 0].

When rounding, we lose all signal coming from all variables except the first
two. The contribution from the eliminated variables may together be significant
even if each individual coefficient is small, in which case, we lose predictive
performance.
Compounding the issue with rounding is the fact that 1 regularization in-
troduces a strong bias for very sparse problems. To understand why, consider
that the regularization parameter must be set to a very large number to get a
very sparse solution. In that case, the 1 regularization does more than make
the solution sparse, it also imposes a strong 1 bias. The solutions disintegrate
in quality as the solutions become sparser, then rounding to integers only makes
the solution worse.
An even bigger problem arises when trying to incorporate additional con-
straints, as we allude to in (2.1). Even simple constraints such as “ensure pre-
cision is at least 20%” when optimizing recall would be very difficult to satisfy
manually with rounding. There are four main types of approaches to building
scoring systems: i) exact solutions using optimization techniques, ii) approxi-
mation algorithms using linear programming, iii) more sophisticated rounding
techniques, iv) computer-aided exploration techniques.

Exact solutions. There are several methods that can solve (2.1) directly [294,
291, 292, 293, 256]. To date, the most promising approaches use mixed-integer
linear programming solvers (MIP solvers) which are generic optimization soft-
ware packages that handle systems of linear equations, where variables can be
either linear or integer. Commercial MIP solvers (currently CPLEX and Gurobi)
are substantially faster than free MIP solvers, and have free academic licenses.
MIP solvers can be used directly when the problem is not too large and when
the loss function is discrete or linear (e.g., classification error is discrete, as it
takes values either 0 or 1). These solvers are flexible and can handle a huge
variety of user-defined constraints easily. However, in the case where the loss
Interpretable machine learning grand challenges 19


function is nonlinear, like the classical logistic loss i log(1 + exp(−yi f (xi ))),
MIP solvers cannot be used directly. In that case, it is possible to use an algo-
rithm called RiskSLIM [293] that uses sophisticated optimization tools: cutting
planes within a branch-and-bound framework, using “callback” functions to a
MIP solver. A major benefit of scoring systems is that they can be used as
decision aids in very high stakes settings; RiskSLIM has been used to create a
model (the 2HELPS2B score) that is used in intensive care units of hospitals to
make treatment decisions about critically ill patients [281].
While exact optimization approaches provide optimal solutions, they strug-
gle with larger problems. For instance, to handle nonlinearities in continuous
covariates, these variables are often discretized to form dummy variables by
splitting on all possible values of the covariate (similar to the way continuous
variables are handled for logical model construction as discussed above, e.g.,
create dummy variables for age<30, age<31, age<32, etc.). Obviously doing
this can turn a small number of continuous variables into a large number of
categorical variables. One way to reduce the size of the problem is to use only
a subset of thresholds (e.g., age<30, age<35, age<40, etc.), but it is possible
to lose accuracy if not enough thresholds are included. Approximation methods
can be valuable in such cases.

Approximate methods. Several works solve approximate versions of (2.1), includ-


ing works of [277, 276, 35, 34, 48]. These works generally use a piecewise linear or
piecewise constant loss function, and sometimes use the 1 norm for regulariza-
tion as a proxy for the number of terms. This allows for the possibility of solving
linear programs using a mathematical programming solver, which is generally
computationally efficient since linear programs can be solved much faster than
mixed-integer programs. The main problems with approximation approaches is
that it is not clear how well the solution of the approximate optimization prob-
lem is to the solution of the desired optimization problem, particularly when
user-defined constraints or preferences are required. Some of these constraints
may be able to be placed into the mathematical program, but it is still not clear
whether the solution of the optimization problem one solves would actually be
close to the solution of the optimization problem we actually care about.
It is possible to use sampling to try to find useful scoring systems, which can
be useful for Bayesian interpretations, though the space of scoring systems can
be quite large and hard to sample [251].

Sophisticated rounding methods. Keeping in mind the disadvantages to rounding


discussed above, there are some compelling advantages to sophisticated round-
ing approaches, namely that they are easy to program and use in practice.
Rounding techniques cannot easily accommodate constraints, but they can be
used for problems without constraints, problems where the constraints are weak
enough that rounding would not cause us to violate them, or they can be used
within the middle of algorithms like RiskSLIM [293] to help find optimal solu-
tions faster. There are several variations of sophisticated rounding. Chevaleyre
et al. [60] propose randomized rounding, where non-integers are rounded up or
20 C. Rudin et al.

down randomly. They also propose a greedy method where the sum of coef-
ficients is fixed and coefficients are rounded one at a time. Sokolovska et al.
[277] propose an algorithm that finds a local minimum by improving the so-
lution at each iteration until no further improvements are possible. Ustun and
Rudin [293] propose a combination of rounding and “polishing.” Their rounding
method is called Sequential Rounding. At each iteration, Sequential Rounding
chooses a coefficient to round and whether to round it up or down. It makes this
choice by evaluating each possible coefficient rounded both up and down, and
chooses the option with the best objective. After Sequential Rounding produces
an integer coefficient vector, a second algorithm, called Discrete Coordinate De-
scent (DCD), is used to “polish” the rounded solution. At each iteration, DCD
chooses a coefficient, and optimizes its value over the set of integers to obtain
a feasible integer solution with a better objective. All of these algorithms are
easy to program and might be easier to deal with than troubleshooting a MIP
or LP solver.

Computer-aided exploration techniques. These are design interfaces where do-


main experts can modify the model itself rather than relying directly on opti-
mization techniques to encode constraints. Billiet et al. [35, 34] created a toolbox
that allows users to make manual adjustments to the model, which could po-
tentially help users design interpretable models according to certain types of
preferences. Xie et al. [320] also suggest an expert-in-the-loop approach, where
heuristics such as the Gini splitting criteria can be used to help discretize con-
tinuous variables. Again, with these approaches, the domain expert must know
how they want the model to depend on its variables, rather than considering
overall performance optimization.

2.1 Improve the scalability of optimal sparse scoring systems: As dis-


cussed, for scoring systems, the only practical approaches that produce op-
timal scoring systems require a MIP solver, and these approaches may not
be able to scale to large problems, or optimally handle continuous vari-
ables. Current state-of-the-art methods for optimal scoring systems (like
RiskSLIM) can deal with a dataset with about thousands of samples and
tens of variables within an hour. However, an hour is quite long if one wants
to adjust constraints and rerun it several times. How to scale up to large
datasets or to reduce solving time remains a challenge, particularly when
including complicated sets of constraints.
2.2 Ease of constraint elicitation and handling: Since domain experts
often do not know the full set of constraints they might want to use in ad-
vance, and since they also might want to adjust the model manually [34],
a more holistic approach to scoring systems design might be useful. There
are many ways to cast the problem of scoring system design with feedback
from domain experts. For instance, if we had better ways of representing
and exploring the Rashomon set (see Challenge 9), domain experts might
be able to search within it effectively, without fear of leaving that set and
producing a suboptimal model. If we knew domain experts’ views about the
Interpretable machine learning grand challenges 21

importance of features, we should be able to incorporate that through regu-


larization [307]. Better interfaces might elicit better constraints from domain
experts and incorporate such constraints into the models. Faster optimiza-
tion methods for scoring systems would allow users faster turnaround for
creating these models interactively.
Example: A physician wants to create a scoring system for predicting seizures
in critically ill patients [similarly to 281]. The physician has upwards of 70
clinical variables, some of which are continuous (e.g., patient age). The physi-
cian creates dummy variables for age and other continuous features, combines
them with the variables that are already binarized, runs 1 -regularized logis-
tic regression and rounds the coefficients. However, the model does not look
reasonable, as it uses only one feature, and isn’t very accurate. The physician
thinks that the model should instead depend heavily on age, and have a false
positive rate below 20% when the true positive rate is above 70% (see Chal-
lenge 2.2). The physician downloads a piece of software for developing scoring
systems. The software reveals that a boosted decision tree model is much more
accurate, and that there is a set of scoring systems with approximately the
same accuracy as the boosted tree. Some of these models do not use age and
violate the physician’s other constraint, so these constraints are then added to
the system, which restricts its search. The physician uses a software tool to
look at models provided by the system and manually adjusts the dependence
of the model on age and other important variables, in a way that still main-
tains predictive ability. Ideally, this entire process takes a few hours from start
to finish (see Challenge 2.1). Finally, the physician takes the resulting model
into the clinic to perform a validation study on data the model has not been
trained with, and the model is adopted into regular use as a decision aid for
physicians.

3. Generalized additive models

Generalized additive models (GAMs) were introduced to present a flexible exten-


sion of generalized linear models (GLMs) [217], allowing for arbitrary functions
for modeling the influence of each feature on a response ([122], see also [318]).
The set of GAMs includes the set of additive models, which, in turn, includes
the set of linear models, which includes scoring systems (and risk scores). Figure
4 shows these relationships.
The standard form of a GAM is:

g(E[y]) = β0 + f1 (x·1 ) + ... + fp (x·p ),

where x·j indicates the jth feature, g(·) is a link function and the fi ’s are
univariate component functions that are possibly nonlinear; common choices are
step functions and splines. If the link function g(·) is the identity, the expression
describes an additive model such as a regression model; if the link function
is the logistic function, then the expression describes a generalized additive
22 C. Rudin et al.

Fig 4. Hierarchical relationships between GAMs, additive models, linear models, and scoring
systems.

model that could be used for classification. The standard form of GAMs is
interpretable because the model is constrained to be a linear combination of
univariate component functions. We can plot each component function with
fj (x·j ) as a function of x·j to see the contribution of a single feature to the
prediction. The left part of Figure 5 shows all component functions of a GAM
(with no interactions) that predicts whether a patient has diabetes. The right
enlarged figure visualizes the relationship between plasma glucose concentration
after 2 hours into an oral glucose tolerance test and the risk of having diabetes.

Fig 5. Left: All component functions of a GAM model trained using the interpret package
[223] on a diabetes dataset [88]; Right: zoom-in of component function for glucose concentra-
tion.

If the features are all binary (or categorical), the GAM becomes a linear
model and the visualizations are just step functions. The visualizations become
more interesting for continuous variables, like the ones shown in Figure 5. If a
GAM has bivariate component functions (that is, if we choose an fj to depend
on two variables, which permits an interaction between these two variables), a
heatmap can be used to visualize the component function on the two dimensional
plane and understand the pairwise interactions [196]. As a comparison point with
Interpretable machine learning grand challenges 23

decision trees, GAMs typically do not handle more than a few interaction terms,
and all of these would be quadratic (i.e., involve 2 variables); this contrasts
with decision trees, which handle complex interactions of categorical variables.
GAMs, like other linear models, do not handle multiclass problems in a natural
way. GAMs have been particularly successful for dealing with large datasets of
medical records that have many continuous variables because they can elucidate
complex relationships between, for instance, age, disease and mortality. (Of
course, dealing with large raw medical datasets, we would typically encounter
serious issues with missing data, or bias in the labels or variables, which would
be challenging for any method, including GAMs.)
A component function fj can take different forms. For example, it can be a
weighted sum of indicator functions, that is:

fj (x·j ) = cj,j  1[x·j > θj  ]. (3.1)
thresholdsj 

If the weights on the indicator functions are integers, and only a small set of
weights are nonzero, then the GAM becomes a scoring system. If the indicators
are all forced to aim in one direction (e.g., 1[x·j > θj  ] for all j  , with no
indicators in the other direction, 1[x·j < θj  ]) and the coefficients cj,j  are all
constrained to be nonnegative, then the function will be monotonic. In the case
that splines are used as component functions, the GAM can be a weighted sum
 Kj
of the splines’ basis functions, i.e. fj (x·j ) = k=1 βjk bjk (x·j ).
There are many different ways to fit GAMs. The traditional way is to use
backfitting, where we iteratively train a component function to best fit the
residuals from the other (already-chosen) components [122]. If the model is fit-
ted using boosting methods [101, 102, 103], we learn a tree on each single feature
in each iteration and then aggregate them together [195]. Among different es-
timations of component functions and fitting procedures, Binder and Tutz [36]
found that boosting performed particularly well in high-dimensional settings,
and Lou et al. [195] found that using a shallow bagged ensemble of trees on
a single feature in each step of stochastic gradient boosting generally achieved
better performance.
We remark that GAMs have the advantage that they are very powerful,
particularly if they are trained as boosted stumps or trees, which are reliable
out-of-the-box machine learning techniques. The AdaBoost algorithm also has
the advantage that it maximizes convex proxies for both classification error and
area under the ROC curve (AUC) simultaneously [254, 72]. This connection
explains why boosted models tend to have both high AUC and accuracy. How-
ever, boosted models are not naturally sparse, and issues with bias arise under
1 regularization, as discussed in the scoring systems section.
We present two interesting challenges involving GAMs:
3.1 How to control the simplicity and interpretability for GAMs: The
simplicity of GAMs arises in at least two ways: sparsity in the number of
component functions and smoothness of the component functions. Imposing
monotonicity of the component functions also helps with interpretability
24 C. Rudin et al.

when we have prior knowledge, e.g., that risk increases with age. In the
case when component functions are estimated by splines, many works ap-
ply convex regularizers (e.g., 1 ) to control both smoothness and sparsity
[186, 245, 206, 324, 234, 194, 262, 121]. For example, they add “roughness”
penalties and lasso type penalties on the fj ’s in the objective function to con-
trol both the smoothness of component functions and sparsity of the model.
Similarly, if the fj ’s are sums of indicators, as in (3.1), we could regularize
to reduce the cj,j  coefficients to induce smoothness. These penalties are
usually convex, therefore, when combined with convex loss functions, con-
vex optimization algorithms minimize their (regularized) objectives. There
could be some disadvantages to this setup: (1) as we know, 1 regulariza-
tion imposes a strong unintended bias on the coefficients when aiming for
very sparse solutions. (2) Lou et al. [195] find that imposing smoothness
may come at the expense of accuracy, (3) imposing smoothness may miss
important naturally-occurring patterns like a jump in a component func-
tion; in fact, they found such a jump in mortality as a function of age that
seems to occur around retirement age. In that case, it might be more in-
terpretable to include a smooth increasing function of age plus an indicator
function around retirement age. At the moment, these types of choices are
hand-designed, rather than automated.
As mentioned earlier, boosting can be used to train GAMs to produce ac-
curate models. However, sparsity and smoothness are hard to control with
AdaBoost since it adds a new term to the model at each iteration.
3.2 How to use GAMs to troubleshoot complex datasets? GAMs are
often used on raw medical records or other complex data types, and these
datasets are likely to benefit from troubleshooting. Using a GAM, we might
find counterintuitive patterns; e.g., as shown in [49], asthma patients fared
better than non-asthma patients in a health outcomes study. Caruana et
al. [49] provides a possible reason for this finding, which is that asthma
patients are at higher natural risk, and are thus given better care, leading
to lower observed risk. Medical records are notorious for missing important
information or providing biased information such as billing codes. Could
a GAM help us to identify important missing confounders, such as retire-
ment effects or special treatment for asthma patients? Could GAMs help us
reconcile medical records from multiple data storage environments? These
data quality issues can be really important.

Example: Suppose a medical researcher has a stack of raw medical records and
would like to predict the mortality risk for pneumonia patients. The data are
challenging, including missing measurements (structural missingness as well as
data missing not at random, and unobserved variables), insurance codes that
do not convey exactly what happened to the patient, nor what their state was.
However, the researcher decides that there is enough signal in the data that it
could be useful in prediction, given a powerful machine learning method, such
as a GAM trained with boosted trees. There are also several important con-
tinuous variables, such as age, that could be visualized. A GAM with a small
Interpretable machine learning grand challenges 25

number of component functions might be appropriate since the doctor can vi-
sualize each component function. If there are too many component functions
(GAM without sparsity control), analyzing contributions from all of them could
be overwhelming (see Challenge 3.1). If the researcher could control the sparsity,
smoothness, and monotonicity of the component functions, she might be able to
design a model that not only predicts well, but also reveals interesting relation-
ships between observed variables and outcomes. This model could also help us
to determine whether important variables were missing, recorded inconsistently
or incorrectly, and could help identify key risk factors (see Challenge 3.2).
From there, the researcher might want to develop even simpler models, such
as decision trees or scoring systems, for use in the clinic (see Challenges 1 and 2).

4. Modern case-based reasoning

Case-based reasoning is a paradigm that involves solving a new problem using


known solutions to similar past problems [1]. It is a problem-solving strategy
that we humans use naturally in our decision-making processes [219]. For ex-
ample, when ornithologists classify a bird, they will look for specific features
or patterns on the bird and compare them with those from known bird species
to decide which species the bird belongs to. The interesting question is: can a
machine learning algorithm emulate the case-based reasoning process that we
humans are accustomed to? A model that performs case-based reasoning is ap-
pealing, because by emulating how humans reason, the model can explain its
decision-making process in an interpretable way.
The potential uses for case-based reasoning are incredibly broad: whereas
the earlier challenges apply only to tabular data, case-based reasoning applies
to both tabular and raw data, including computer vision. For computer vision
and other raw data problems, we distinguish between the feature extraction and
prediction steps. The feature extraction can be verified by a human, and the
prediction steps are simple calculations. That is, while a human may not under-
stand the mapping from the original image to the feature space, they can visually
verify that a particular interpretable concept/feature has been extracted. Then,
the features/concepts are combined to form a prediction, which is done via a
sparse linear combination, or another calculation that a human can understand.
Case-based reasoning has long been a subject of interest in the artificial
intelligence (AI) community [246, 160, 120]. There are, in general, two types of
case-based reasoning techniques: (i) nearest neighbor-based techniques, and (ii)
prototype-based techniques, illustrated in Figure 6. There are many variations
of each of these two types.

Fig 6. Case-based reasoning types. Left: Nearest neighbors (just some arrows are shown for
3-nearest neighbors). Right: Prototype-based reasoning, shown with two prototypes.
26 C. Rudin et al.

Nearest neighbor-based techniques. These techniques make a decision


for a previously unseen test instance, by finding training instances that most
closely resemble the particular test instance (i.e., the training instances that have
the smallest “distance” or the largest “similarity” measures to the test instance).
A classic example of nearest neighbor-based techniques is k-nearest neighbors
(kNN) [98, 68]. Traditionally, a kNN classifier is non-parametric and requires no
training at all – given a previously unseen test instance, a kNN classifier finds the
k training instances that have the smallest 2 distances to the test instance, and
the class label of the test instance is predicted to be the majority label of those k
training instances. Many variants of the original kNN classification scheme have
been developed in the 1970s and 80s [89, 106, 142, 30]. Some later papers on
nearest neighbor-based techniques focused on the problem of “adaptive kNN,”
where the goal is to learn a suitable distance metric to quantify the dissimilarity
between any pair of input instances (instead of using a pre-determined distance
metric such as the Euclidean 2 distance measure), to improve the performance
of nearest neighbor-based techniques. For example, [316] proposed a method to
learn a parametrized distance metric (such as the matrix in a Mahalanobis dis-
tance measure) for kNN. Their method involves minimizing a loss function on
the training data, such that for every training instance, the distance between the
training instance and its “target” neighbors (of the same class) will be minimized
while the distance between the training instance and its “imposter” neighbors
(from other classes) will be maximized (until those neighbors are at least 1 dis-
tance unit away from the training instance). More recently, there are works that
began to focus on “performing kNN in a learned latent space,” where the latent
space is often learned using a deep neural network. For example, Salakhutdinov
and Hinton [264] proposed to learn a nonlinear transformation, using a deep
neural network that transforms the input space into a feature space where a
kNN classifier will perform well (i.e., deep kNN). Papernot and McDaniel [229]
proposed an algorithm for deep kNN classification, which uses the k-nearest
neighbors of a test instance from every hidden layer of a trained neural network.
Card et al. [46] introduced a deep weighted averaging classifier, which classifies
an input based on its latent-space distances to other training examples.
We remark that the notion of “adaptive kNN” is mathematically the same as
“performing kNN in a learned latent space.” Let us show why that is. In adaptive
kNN, we would learn a distance metric d(·, ·) such that k-nearest neighbors
tends to be an accurate classifier. The distance between points x1 and x2 would
be d(x1 , x2 ). For latent space nearest neighbor classification, we would learn a
mapping φ : x → φ(x) from our original space to the latent space, and then
compute 2 distances in the latent space. That is, φ(x1 ) − φ(x2 )2 . Equating
these two perspectives, we have d(x1 , x2 ) = φ(x1 )−φ(x2 )2 . That is, the latent
space mapping acts as a transformation of the original metric so that 2 distance
works well for kNN.
As an aside, the problem of matching in observational causal inference also
can use case-based reasoning. Here treatment units are matched to similar con-
trol units to estimate treatment effects. We refer readers to recent work on this
topic for more details [310, 231].
Interpretable machine learning grand challenges 27

Prototype-based techniques. Despite the popularity of nearest-neighbor


techniques, those techniques often require a substantial amount of distance com-
putations (e.g., to find out the nearest neighbors of a test input), which can be
slow in practice. Also, it is possible that the nearest neighbors may not be par-
ticularly good representatives of a class, so that reasoning about nearest neigh-
bors may not be interpretable. Prototype-based techniques are an alternative to
nearest-neighbor techniques that have neither of these disadvantages. Prototype-
based techniques learn, from the training data, a set of prototypical cases for
comparison. Given a previously unseen test instance, they make a decision by
finding prototypical cases (instead of training instances from the entire training
set) that most closely resemble the particular test instance. One of the earliest
prototype learning techniques is learning vector quantization (LVQ) [159]. In
LVQ, each class is represented by one or more prototypes, and points are as-
signed to the nearest prototype. During training, if the training example’s class
agrees with the nearest prototype’s class, then the prototype is moved closer
to the training example; otherwise the prototype is moved further away from
the training example. In more recent works, prototype learning is also achieved
by solving a discrete optimization program, which selects the “best” prototypes
from a set of training instances according to some training objective. For exam-
ple, Bien and Tibshirani [33] formulated the prototype learning problem as a
set-cover integer program (an NP-complete problem), which can be solved using
standard approximation algorithms such as relaxation-and-rounding and greedy
algorithms. Kim et al. [145] formulated the prototype learning problem as an
optimization program that minimizes the squared maximum mean discrepancy,
which is a submodular optimization problem and can be solved approximately
using a greedy algorithm.
Part-based prototypes. One issue that arises with both nearest neighbor
and prototype techniques is the comparison of a whole observation to another
whole observation. This makes little sense, for instance, with images, where some
aspects resemble a known past image, but other aspects resemble a different im-
age. For example, we consider architecture of buildings: while some architectural
elements of a building may resemble one style, other elements resemble another.
Another example is recipes. A recipe for a cheesecake with strawberry topping
may call for part of a strawberry pancake recipe according to the typical prepa-
ration for a strawberry sauce, while the cheesecake part of the recipe could
follow a traditional plain cheesecake recipe. In that case, it makes more sense to
compare the strawberry cheesecake recipe to both the pancake recipe and the
plain cheesecake recipe. Thus, some newer case-based reasoning methods have
been comparing parts of observations to parts of other observations, by creating
comparisons on subsets of features. This allows case-based reasoning techniques
both more flexibility and more interpretability.
Kim et al. [146] formulated a prototype-parts learning problem for structured
(tabular) data using a Bayesian generative framework. They considered the
example (discussed above) of recipes in their experiments. Wu and Tabak [319]
used a convex combination of training instances to represent a prototype, where
the prototype does not necessarily need to be a member of the training set. Using
28 C. Rudin et al.

Fig 7. (a) Prototype-based classification of the network from Li et al. [183]. The network
compares a previously unseen image of “6” with 15 prototypes of handwritten digits, learned
from the training set, and classifies the image as a 6 because it looks like the three proto-
types of handwritten 6’s, which have been visualized by passing them through a decoder from
latent space into image space. (b) Part-based prototype classification of a ProtoPNet [53].
The ProtoPNet compares a previously unseen image of a bird with prototypical parts of a
clay colored sparrow, which are learned from the training set. It classifies the image as a
clay colored sparrow because (the network thinks that) its head looks like a prototypical head
from a clay-colored sparrow, its wing bars look like prototypical wing bars from a clay-colored
sparrow, and so on. Here, the prototypes do not need to be passed through a decoder, they are
images from the training set.

a convex combination of training examples as a prototype would be suitable for


some data types (e.g., tabular data, where a convex combination of real training
examples might resemble a realistic observation), but for images, averaging the
latent positions of units in latent space may not correspond to a realistic-looking
image, which means the prototype may not look like a real image, which could
be a disadvantage to this type of approach.
Recently, there are works that integrate deep learning with prototype- and
prototype-parts-based classification. This idea was first explored by Li et al.
[183] for image classification. They created a neural network architecture that
contains an autoencoder and a special prototype layer, where each unit of that
layer (i.e., a prototype) stores a weight vector that resembles some encoded
training input. Given an input instance, the network compares the encoded input
instance with the learned prototypes (stored in the prototype layer), which can
be visualized using the decoder. The prediction of the network is based on the 2
distances between the encoded input instance and the learned prototypes. The
authors applied the network to handwritten digit recognition [MNIST, 177], and
the network was able to learn prototypical cases of how humans write digits by
hand. Given a test image of a handwritten digit, the network was also able to
find prototypical cases that are similar to the test digit (Figure 7(a)).
More recently, Chen et al. [53] extended the work of Li et al. [183] to create
a prototypical part network (ProtoPNet) whose prototype layer stores proto-
Interpretable machine learning grand challenges 29

typical parts of encoded training images. The prototypical parts are patches
of convolutional-neural-network-encoded training images, and represent typical
features observed for various image classes. Given an input instance, the net-
work compares an encoded input image with each of the learned prototypical
parts, and generates a prototype activation map that indicates both the location
and the degree of the image patch most similar to that prototypical part. The
authors applied the network to the benchmark Caltech-UCSD Birds-200-2011
(CUB-200-2011) dataset [305] for bird recognition, and the ProtoPNet was able
to learn prototypical parts of 200 classes of birds, and use these prototypes to
classify birds with an accuracy comparable to non-interpretable black-box mod-
els. Given a test image of a bird, the ProtoPNet was able to find prototypical
parts that are similar to various parts of the test image, and was able to provide
an explanation for its prediction, such as “this bird is a clay-colored sparrow,
because its head looks like that prototypical head from a clay-colored sparrow,
and its wing bars look like those prototypical wing bars from a clay-colored spar-
row” (Figure 7(b)). In their work, Chen et al. [53] also removed the decoder, and
instead introduced prototype projection, which pushes every prototypical part
to the nearest encoded training patch of the same class for visualization. This
improved the visualization quality of the learned prototypes (in comparison to
the approach of Li et al. [183] which used a decoder).
These two works (i.e., [183, 53]) have been extended in the domain of deep
case-based reasoning and deep prototype learning. In the image recognition
domain, Nauta et al. [215] proposed a method for explaining what visual char-
acteristics a prototype (in a trained ProtoPNet) is looking for; Nauta et al.
[216] proposed a method for learning neural prototype trees based on a proto-
type layer; Rymarczyk et al. [260] proposed data-dependent merge-pruning of
the prototypes in a ProtoPNet, to allow prototypes that activate on similarly
looking parts from various classes to be pruned and shared among those classes.
In the sequence modeling domain (such as natural language processing), Ming
et al. [209] and Hong et al. [129] took the concepts in [183] and [53], and inte-
grated prototype learning into recurrent neural networks for modeling sequential
data. Barnett et al. [21] extended the ideas of Chen et al. [53] and developed an
application to interpretable computer-aided digital mammography.
Despite the recent progress, many challenges still exist in the domain of case-
based reasoning, including:

4.1 How to extend the existing case-based reasoning approach to han-


dling more complex data, such as video? Currently, case-based reason-
ing has been used for structured (tabular) data, static images, and simple
sequences such as text. It remains a challenge to extend the existing case-
based reasoning approach to handling more complex data, such as video
data, which are sequences of static images. While Trinh et al. [290] recently
extended the ideas in Chen et at. [53] and developed a dynamic prototype
network (DPNet) which learns prototypical video patches from deep-faked
and real videos, how to efficiently compare various videos and find simi-
lar videos (for either nearest-neighbor or prototype-based classification) re-
30 C. Rudin et al.

mains an open challenge for case-based video classification tasks. Performing


case-based reasoning on video data is technically challenging because of the
high dimensionality of the input data. On the other hand, a video is an or-
dered combination of frames and we can take advantage of the sequential na-
ture of the data. For example, current algorithms can be refined or improved
with more information that comes from neighboring video frames; could
prototypes be designed from neighboring frames or parts of the frames?
4.2 How to integrate prior knowledge or human supervision into pro-
totype learning? Current approaches to prototype learning do not take
into account prior knowledge or expert opinions. At times, it may be ad-
vantageous to develop prototype-learning algorithms that collaborate with
human experts in choosing prototypical cases or prototypical features. For
example, in healthcare, it would be beneficial if a prototype-based classi-
fier learns, under the supervision of human doctors, prototypical signs of
cancerous growth. For example, the doctors might prune prototypes, design
them, or specify a region of interest where the prototypes should focus. Such
human-machine collaboration would improve the classifier’s accuracy and
interpretability, and would potentially reduce the amount of data needed to
train the model. However, human-machine collaboration is rarely explored
in the context of prototype learning, or case-based reasoning in general.
4.3 How to troubleshoot a trained prototype-based model to improve
the quality of prototypes? Prototype-based models make decisions based
on similarities with learned prototypes. However, sometimes, the prototypes
may not be learned well enough, in the sense that they may not capture
the most representative features of a class. This is especially problematic
for part-based prototype models, because these models reason by compar-
ing subsets of features they deem important. In the case of a (part-based)
prototype model with “invalid” prototypes that capture non-representative
or undesirable features (e.g., a prototype of text in medical images, which
represents information that improves training accuracy but not test accu-
racy), one way to “fix” the model is to get rid of the invalid prototypes, but
this may lead to an imbalance in the number of prototypes among different
classes and a bias for some classes that are over-represented with abundant
prototypes. An alternative solution is to replace each undesirable prototype
with a different prototype that does not involve the undesirable features.
The challenge here lies in how we can replace undesirable prototypes sys-
tematically without harming the model performance, and at the same time
improve the given model incrementally without retraining from scratch.

Example: Suppose that a doctor wants to evaluate the risk of breast cancer
among patients [see 21]. The dataset used to predict malignancy of breast can-
cer usually contains a set of mammograms, and a set of patient features (e.g.,
age). Given a particular patient, how do we characterize prototypical signs of
cancerous growth from a sequence of mammograms taken at various times (this
is similar to a video, see Challenge 4.1)? How can a doctor supervise prototype
learning by telling a prototype-based model what (image/patient) features are
Interpretable machine learning grand challenges 31

typical of breast cancer (Challenge 4.2)? If a trained model contains a prototype


of “text” in the mammograms (such text could be doctor’s notes or patient IDs
left on some training images), how can we replace the unwanted prototype with
another prototype that is more medically relevant, without retraining the model
from scratch (Challenge 4.3)?

5. Complete supervised disentanglement of neural networks

Deep neural networks (DNNs) have achieved state-of-the-art predictive perfor-


mance for many important tasks [176]. DNNs are the quintessential “black box”
because the computations within its hidden layers are typically inscrutable. As a
result, there have been many works that have attempted to “disentangle” DNNs
in various ways so that information flow through the network is easier to under-
stand. “Disentanglement” here refers to the way information travels through the
network: we would perhaps prefer that all information about a specific concept
(say “lamps”) traverse through one part of the network while information about
another concept (e.g., “airplane”) traverse through a separate part. Therefore,
disentanglement is an interpretability constraint on the neurons. For example,
suppose we hope neuron “neur” in layer l is aligned with concept c, in which
case, the disentanglement constraint is

DisentanglementConstraint(c, neur, l) : Signal(c, neur) = Signal(c, l), (5.1)

where Signal(c, x) means the signal of concept c that passes through x, which
measures similarity between its two arguments. This constraint means all signal
about concept c in layer l will only pass through neuron neur. In other words,
using these constraints, we could constrain our network to have the kind of
“Grandmother node” that scientists have been searching for in both real and
artificial convolutional neural networks [115].
In this challenge, we consider the possibility of fully disentangling a DNN
so that each neuron in a piece of the network represents a human-interpretable
concept.
The vector space whose axes are formed by activation values on the hidden
layer’s neurons is known as the latent space of a DNN. Figure 8 shows an exam-
ple of what an ideal interpretable latent space might look like. The axes of the
latent space are aligned with individual visual concepts, such as “lamp,” “bed,”
“nightstand,” “curtain.” Note that the “visual concepts” are not restricted to
objects but can also be things such as weather or materials in a scene. We hope
all information about the concept travels through that concept’s corresponding
neuron on the way to the final prediction. For example, the “lamp” neuron will
be activated if and only if the network thinks that the input image contains in-
formation about lamps. This kind of representation makes the reasoning process
of the DNN much easier to understand: the image is classified as “bedroom”
because it contains information about “bed” and “lamp.”
Such a latent space, made of disjoint data generation factors, is a disentangled
latent space [26, 123]. An easy way to create a disentangled latent space is just
32 C. Rudin et al.

Fig 8. Disentangled latent space. The axes (neurons) of the latent space are aligned with
supervised concepts, e.g. “lamp,” “bed,” “nightstand,” “curtain.” All information about the
concept up to that point in the network travels through that concept’s corresponding neuron.

to create a classifier for each concept (e.g., create a lamp classifier), but this
is not a good strategy: it might be that the network only requires the light
of the lamp rather than the actual lamp body, so creating a lamp classifier
could actually reduce performance. Instead, we would want to encourage the
information that is used about a concept (if any is used) to go along one path
through the network.
Disentanglement is not guaranteed in standard neural networks. In fact, in-
formation about any concept could be scattered throughout the latent space of
a standard DNN. For example, post hoc analyses on neurons of standard con-
volutional neural networks [336, 337] show that concepts that are completely
unrelated could be activated on the same axis, as shown in Figure 9. Even if we
create a vector in the latent space that is aimed towards a single concept [as is
done in 147, 338], that vector could activate highly on multiple concepts, which
means the signal for the two concepts is not disentangled. In that sense, vectors
in the latent space are “impure” in that they do not naturally represent single
concepts [see 58, for a detailed discussion].
This challenge focuses on supervised disentanglement of neural networks, i.e.,
the researcher specifies which concepts to disentangle in the latent space. (In the
next section we will discuss unsupervised disentanglement.) Earlier work in this
domain disentangles the latent space for specific applications, such as face recog-
nition, where we might aim to separate identity and pose [341]. Recent work
in this area aims to disentangle the latent space with respect to a collection of
predefined concepts [58, 158, 193, 3]. For example, Chen et al. [58] adds con-
straints to the latent space to decorrelate the concepts and align them with axes
in the latent space; this is called “concept whitening.” (One could think of this
Interpretable machine learning grand challenges 33

Fig 9. Example of an impure neuron in standard neural network from [337]. This figure shows
images that highly activate this neuron. Both dining tables (green) and Greek-style buildings
(red) are highly activated on the neuron, even though these two concepts are unrelated.

as analogous to a type of principal components analysis for neural networks.)


To define the concepts, a separate dataset is labeled with concept information
(or the main training dataset could also serve as this separate dataset as long
as it has labels for each concept.) These concept datasets are used to align the
network’s axes along the concepts. This type of method can create disentangled
latent spaces where concepts are more “pure” than those of a standard DNN,
without hurting accuracy. With the disentangled latent space, one can answer
questions about how the network gradually learns concepts over layers [58], or
one could interact with the model by intervening on the activations of con-
cept neurons [158]. Many challenges still exist for supervised disentanglement
of DNNs, including:
5.1 How to make a whole layer of a DNN interpretable? Current meth-
ods can make neurons in a single layer interpretable but all of them have
limitations. Several works [158, 193, 3] have tried to directly learn a con-
cept classifier in the latent space. However, good discriminative power on
concepts does not guarantee concept separation and disentanglement [58].
That is, we could easily have our network classify all the concepts correctly,
in addition to the overall classification task, but this would not mean that
the information flow concerning each concept goes only though that con-
cept’s designated path through the network; the activations of the different
concept nodes would likely still be correlated. The method discussed earlier,
namely that of [58], can fully disentangle the latent space with respect to
a few pre-defined concepts. But it also has two disadvantages if we want to
disentangle all neurons in a layer, namely: (i) this method currently has a
group of unconstrained neurons to handle residual information, which means
it does not disentangle all neurons in a layer, just some of them, and (ii) it
requires loading samples of all target concepts to train the latent space. If
the layer to be disentangled contains 500 neurons, we would need to load
samples for all 500 concepts, which takes a long time. Perhaps the latter
problem could be solved by training with a small random subset of concepts
at each iteration; but this has not been tried in current methods at the time
of this writing. The first problem might be solved by using an unsupervised
concept detector from Challenge 6 and having a human interact with the
concepts to determine whether each one is interpretable.
5.2 How to disentangle all neurons of a DNN simultaneously? Current
34 C. Rudin et al.

methods try to disentangle at most a single layer in the DNN (that is, they
attempt only the problem discussed above). This means one could only in-
terpret neurons in that specific layer, while semantic meanings of neurons
in other layers remain unknown. Ideally, we want to be able to completely
understand and modify the information flowing through all neurons in the
network. This is a challenging task for many obvious reasons, the first one
being that it is hard practically to define what all these concepts could
possibly be. We would need a comprehensive set of human-interpretable
concepts, which is hard to locate, create, or even to parameterize. Even if
we had this complete set, we would probably not want to manually specify
exactly what part of the network would be disentangled with respect to
each of these numerous concepts. For instance, if we tried to disentangle
the same set of concepts in all layers, it would be immediately problem-
atic because DNNs are naturally hierarchical: high-level concepts (objects,
weather, etc.) are learned in deeper layers, and the deeper layers leverage
low-level concepts (color, texture, object parts, etc.) learned in lower layers.
Clearly, complex concepts like “weather outside” could not be learned well
in earlier layers of the network, so higher-level concepts might be reserved
for deeper layers. Hence, we also need to know the hierarchy of the concepts
to place them in the correct layer. Defining the concept hierarchy manually
is almost impossible, since there could be thousands of concepts. But how
to automate it is also a challenge.
5.3 How to choose good concepts to learn for disentanglement? In su-
pervised disentanglement, the concepts are chosen manually. To gain useful
insights from the model, we need good concepts. But what are good con-
cepts in specific application domains? For example, in medical applications,
past works mostly use clinical attributes that already exist in the datasets.
However, Chen et al. [58] found that attributes in the ISIC dataset might
be missing the key concept used by the model to classify lesion malignancy.
Active learning approaches could be incredibly helpful in interfacing with
domain experts to create and refine concepts.
Moreover, it is challenging to learn concepts with continuous values. These
concepts might be important in specific applications, e.g., age of the patient
and size of tumors in medical applications. Current methods either define
a concept by using a set of representative samples or treat the concept as a
binary variable, where both are discrete. Therefore, for continuous concepts,
a challenge is how to choose good thresholds to transform the continuous
concept into one or multiple binary variables.
5.4 How to make the mapping from the disentangled layer to the out-
put layer interpretable? The decision process of current disentangled
neural networks contains two parts, x → c mapping the input x to the
disentangled representation (concepts) c, and c → y mapping the disen-
tangled representation c to the output y. [The notations are adopted from
158]. All current methods on neural disentanglement aim at making the
c interpretable, i.e., making the neurons in the latent space aligned with
human understandable concepts, but how these concepts combine to make
Interpretable machine learning grand challenges 35

the final prediction, i.e., c → y, often remains a black box. This leaves a
gap between the interpretability of the latent space and the interpretability
of the entire model. Current methods either rely on variable importance
methods to explain c → y posthoc [58], or simply make c → y a linear
layer [158]. However, a linear layer might not be expressive enough to learn
c → y. [158] also shows that a linear function c → y is less effective than
nonlinear counterparts when the user wants to intervene in developing the
disentangled representation, e.g., replacing predicted concept values ĉj with
true concept values cj . Neural networks like neural additive models [5] and
neural decision trees [322] could potentially be used to model c → y, since
they are both differentiable, nonlinear, and intrinsically interpretable once
the input features are interpretable.
Example: Suppose machine learning practitioners and doctors want to build
a supervised disentangled DNN on X-ray data to detect and predict arthritis.
First, they aim to choose a set of relevant concepts that have been assessed by
doctors (Challenge 5.3). They should also choose thresholds to turn continuous
concepts (e.g., age) into binary variables to create the concept datasets (Chal-
lenge 5.3). Using the concept datasets, they can use supervised disentanglement
methods like concept whitening [58] to build a disentangled DNN. However, if
they choose too many concepts to disentangle in the neural network, loading
samples from all of the concept datasets may take a very long time (Challenge
5.1). Moreover, doctors may have chosen different levels of concepts, such as
bone spur (high-level) and shape of joint (low-level), and they would like the
low-level concepts to be disentangled by neurons in earlier layers, and high-level
concepts to be disentangled in deeper layers, since these concepts have a hi-
erarchy according to medical knowledge. However, current methods only allow
placing them in the same layer (Challenge 5.2). Finally, all previous steps can
only make neurons in the DNN latent space aligned with medical concepts, while
the way in which these concepts combine to predict arthritis remains uninter-
pretable (Challenge 5.4).

6. Unsupervised disentanglement of neural networks

The major motivation of unsupervised disentanglement is the same as Challenge


5, i.e., making information flow through the network easier to understand and
interact with. But in the case where we do not know the concepts, or in the
case where the concepts are numerous and we do not know how to parameter-
ize them, we cannot use the techniques from Challenge 5. In other words, the
concept c in Constraint (5.1) is no longer a concept we predefine, but it must
still be an actual concept in the existing universe of concepts. There are situ-
ations where concepts are actually unknown; for example, in materials science,
the concepts, such as key geometric patterns in the unit cells of materials, have
generally not been previously defined. There are also situations where concepts
are generally known but too numerous to handle; a key example is computer
vision for natural images. Even though we could put a name to many concepts
36 C. Rudin et al.

that exist in natural scenes, labeled datasets for computer vision have a severe
labeling bias: we tend only to label entities in images that are useful for a spe-
cific task (e.g., object detection), thus ignoring much of the information found
in images. If we could effectively perform unsupervised disentanglement, we can
rectify problems caused by human bias, and potentially make scientific discov-
eries in uncharted domains. For example, an unsupervised disentangled neural
network can be used to discover key patterns in materials and characterize their
relation to the physical properties of the material (e.g., “will the material al-
low light to pass through it?”). Figure 10 shows such a neural network, with
a latent space completely disentangled and aligned with the key patterns dis-
covered without supervision: in the latent space, each neuron corresponds to
a key pattern and all information about the pattern flows through the corre-
sponding neuron. Analyzing these patterns’ contribution to the prediction of a
desired physical property could help material scientists understand what corre-
lates with the physical properties and could provide insight into the design of
new materials.

Fig 10. Neural network with unsupervised disentanglement in the latent space. The input (on
the left) is a unit cell of metameterial, made of stiff (yellow) and soft (purple) constituent
materials. The target of the neural network is predicting whether the material with unit cell
on the left supports the formation of forbidden frequency bands of propagation, i.e. existence
of “band gaps.” Key patterns related to the band gap (illustrated in blue) have been discov-
ered without supervision. Neurons in the latent space are aligned with these patterns. Figure
adapted from [59].

Multiple branches of works are related to unsupervised disentanglement of


deep neural networks, and we will describe some of them.
Disentangled representations in deep generative models: Disentanglement of
generative models have long been studied [267, 79]. Deep generative models such
as generative adversarial networks [GANs, 112] and variational auto-encoders
[VAE, 149] try to learn a mapping from points in one probability distribution
to points in another. The first distribution is over points in the latent space,
and these points are chosen i.i.d. according to a zero-centered Gaussian dis-
tribution. The second distribution is over the space from which the data are
Interpretable machine learning grand challenges 37

drawn (e.g., random natural images in the space of natural images). The statis-
tical independence between the latent features makes it easy to disentangle the
representation [27]: a disentangled representation simply guarantees that know-
ing or changing one latent feature (and its corresponding concept) does not
affect the distribution of any other. Results on simple imagery datasets show
that these generative models can decompose the data generation process into
disjoint factors (for instance, age, pose, identity of a person) and explicitly rep-
resent them in the latent space, without any supervision on these factors; this
happens based purely on statistical independence of these factors in the data.
Recently, the quality of disentanglement in deep generative models has been
improved [57, 124]. These methods achieve full disentanglement without super-
vision by maximizing the mutual information between latent variables and the
observations. However, these methods only work for relatively simple imagery
data, such as faces or single 3D objects (that is, in the image, there is only one
object, not a whole scene). Learning the decomposition of a scene into groups of
objects in the latent space, for example, is not yet achievable by these deep gen-
erative models. One reason for the failure of these approaches might be that the
occurrence of objects may not be statistically independent; for example, some
objects such as “bed” and “lamp” tend to co-occur in the same scene. Also, the
same type of object may occur multiple times in the scene, which cannot be
easily encoded by a single continuous latent feature.
Neural networks that incorporate compositional inductive bias: Another line
of work designs neural networks that directly build compositional structure into
a neural architecture. Compositional structure occurs naturally in computer
vision data, as objects in the natural world are made of parts. In areas that
have been widely studied beyond computer vision, including speech recognition,
researchers have already summarized a series of compositional hypotheses, and
incorporated them into machine learning frameworks. For example, in computer
vision, the “vision as inverse graphics” paradigm [23] tries to treat vision tasks
as the inverse of the computer graphics rendering process. In other words, it tries
to decode images into a combination of features that might control rendering of
a scene, such as object position, orientation, texture and lighting. Many studies
on unsupervised disentanglement have been focused on creating this type of rep-
resentation because it is intrinsically disentangled. Early approaches toward this
goal includes DC-IGN [165] and Spatial Transformers [135]. Recently, Capsule
Networks [126, 261] have provided a new way to incorporate compositional as-
sumptions into neural networks. Instead of using neurons as the building blocks,
Capsule Networks combine sets of neurons into larger units called “Capsules,”
and force them to represent information such as pose, color and location of ei-
ther a particular part or a complete object. This method was later combined
with generative models in the Stack Capsule Autoencoder (SCAE) [163]. With
the help of the Set Transformer [179] in combining information between layers,
SCAE discovers constituents of the image and organizes them into a smaller set
of objects. Slot Attention modules [191] further use an iterative attention mech-
anism to control information flow between layers, and achieve better results on
unsupervised object discovery. Nevertheless, similar to the generative models,
38 C. Rudin et al.

these networks perform poorly when aiming to discover concepts on more realis-
tic datasets. For example, SCAE can only discover stroke-like structures that are
uninterpretable to humans on the Street View House Numbers (SVHN) dataset
[218] (see Figure 11). The reason is that the SCAE can only discover visual
structures that appear frequently in the dataset, but in reality the appearance
of objects that belong to the same category can vary a lot. There have been
proposals [e.g., GLOM 125] on how a neural network with a fixed architecture
could potentially parse an image into a part-whole hierarchy. Although the idea
seems to be promising, no working system has been developed yet. There is still
a lot of room for development for this type of method.

Fig 11. Left: Sample images collected from the SVHN dataset [218]. Right: Stroke-like tem-
plates discovered by the SCAE. Although the objects in the dataset are mostly digits, current
capsule networks are unable to discover them as concepts without supervision.

Works that perform unsupervised disentanglement implicitly: Many other in-


terpretable neural networks can also learn disjoint concepts in the latent space
although the idea of “disentanglement” is not explicitly mentioned in these pa-
pers. For example, as mentioned in Section 4, Chen et al. [53] propose to create
a prototype layer, storing prototypical parts of training images, to do case based
reasoning. When classifying birds, the prototypical parts are usually object parts
such as heads and wings of birds. Interestingly, a separation cost is applied to
the learned prototypes to encourage diversity of the prototypical parts, which
is very similar to the idea of disentanglement. Zhang et al. [333] propose to
maximize mutual information between the output of convolutional filters and
predefined part templates (feature masks with positive values in a localized re-
gion and negative in other places). The mutual information regularization term
is essentially the sum of two disentanglement losses, (a) an inter-category en-
tropy loss that encourages each filter to be exclusively activated by images of
one category and not activated on other categories; (b) a spatial entropy loss
that encourages each filter to be activated only on a local region of the image.
Results show the convolutional filters with the mutual information loss tend
Interpretable machine learning grand challenges 39

to be activated only on a specific part of the object. These two methods are
capable of learning single objects or parts in the latent space but have not yet
been generalized to handle more comprehensive concepts (such as properties of
scenes, including style of an indoor room – e.g., cozy, modern, etc., weather for
outdoor scenes, etc.), since these concepts are not localized in the images.
Despite multiple branches of related work targeting concept discovery in dif-
ferent way, many challenges still exist:

6.1 How to quantitatively evaluate unsupervised disentanglement?


Recall that unsupervised disentanglement is desired (but challenging) in
two different types of domains: (a) domains in which we do not know what
the concepts are (e.g., materials science); (b) domains in which concepts are
known but labeling biases exist.
Let us start with a version of case (b) where some concepts are known,
such as objects in natural images. Let us say we are just missing a subset
of concept labels in the dataset, which is a simple type of missing data
bias. If we build a neural network that disentangles the latent space, we can
quantitatively evaluate the quality of disentanglement. For instance, let us
say we know a collection of concepts that we would like disentangled (say,
a collection of specific objects that appear in natural images). If we use
an unsupervised algorithm for disentangling the space, and then evaluate
whether it indeed disentangled the known concepts, then we have a useful
quantitative evaluation [124, 56, 148, 92, 84].
If we are working in a domain where we do not know the concepts to disen-
tangle (case (a)), we also might not know what regularization (or other type
of inductive bias) to add to the algorithm so that it can discover concepts
that we would find interpretable. In that case, it becomes difficult to eval-
uate the results quantitatively. Materials science, as discussed above, is an
example of one of these domains, where we cannot find ground-truth labels
of key patterns, nor can we send queries to human evaluators (since even
material scientists do not know the key patterns in many cases). Evaluation
metrics from natural images do not work. Evaluating disentanglement in
these domains is thus a challenge.
Going back to case (b), in situations where labeling biases exist (i.e., when
only some concepts are labeled, and the set of labels are biased), current
evaluation metrics of disentanglement that rely only on human annota-
tions can be problematic. For instance, most annotations in current im-
agery datasets concern only objects in images but ignore useful information
such as lighting and style of furniture. An unsupervised neural network may
discover important types of concepts that do not exist in the annotations.
Medical images may be an important domain here, where some concepts
are clearly known to radiologists, but where labels in available datasets are
extremely limited.
6.2 How to adapt neural network architectures designed with compo-
sitional constraints to other domains?
Incorporating compositional assumptions into network architecture design
40 C. Rudin et al.

is a way to create intrinsically interpretable disentangled representations.


That said, such assumptions are usually modality- and task-specific, severely
limiting the general applicability for such designs. Let us take Capsule Net-
works [261] and the Slot Attention module [191] in the computer vision field
as examples: these modules try to create object-centric and part-centric rep-
resentations inside the latent space of the network. These ideas are based on
the assumption that an image can be understood as a composition of differ-
ent objects, and the interference between objects is negligible. Nevertheless,
such an assumption cannot necessarily be applied to materials science, in
which a material’s physical properties could depend jointly, in a complex
way, on the patterns of constituent materials within it. How would we re-
define the compositional constraints derived in computer vision for natural
images to work for other domains such as materials science?
6.3 How to learn part-whole disentanglement for more complicated
patterns in large vision datasets? A specific area within unsupervised
disentanglement in computer vision is to semantically segment the image
into different parts, where each part represents an object or a part of an
object [71, 261, 163, 191]. Current networks can achieve convincing results
over simple, synthetic datasets (where objects consist of simple geometric
shapes or digits). (Such datasets include CLEVR6, [137], d-Sprite, [202],
and Objects Room and Tetrominoes, [139].) However, no work has suc-
cessfully learned part-whole relationships on more realistic datasets such as
ImageNet. New techniques may be needed to handle various interactions
between different objects in a real-world dataset.
Example: Suppose material scientists want to build a classification model that
can predict whether the designs of metameterials support existence of band
gaps (same example as Figure 10). Because unit cells of metamaterials are usu-
ally represented as a matrix/tensor of what constituent materials are placed at
each location, the researchers plan to build a deep neural network, since neural
networks excel at extracting useful information from raw inputs. They might en-
counter several challenges when building the disentangled neural network. First,
they need to identify the disentanglement constraints to build into the network
architecture. Architectures that work well for imagery data may not work for
unit cells of metamaterials, since key patterns in unit cells can be completely
different from objects/concepts in imagery data (Challenge 6.2 above). More-
over, evaluation of disentanglement can be challenging as the material patterns
have no ground truth labels (Challenge 6.1 above).

7. Dimension reduction for data visualization

Even in data science, a picture is worth a thousand words. Dimension reduc-


tion (DR) techniques take, as input, high-dimensional data and project it down
to a lower-dimensional space (usually 2D or 3D) so that a human can better
comprehend it. Data visualization can provide an intuitive understanding of the
underlying structure of the dataset. DR can help us gain insight and build hy-
Interpretable machine learning grand challenges 41

potheses. DR can help us design features so that we can build an interpretable


supervised model, allowing us to work with high-dimensional data in a way we
would not otherwise be able to. With DR, biases or pervasive noise in the data
may be illuminated, allowing us to be better data caretakers. However, with
the wrong DR method, information about the high-dimensional relationships
between points can be lost when projecting onto a 2D or 3D space.
DR methods are unsupervised ML methods that are constrained to be in-
terpretable. Referring back to our generic interpretable ML formulation (*),
DR methods produce a function mapping data x1 , x2 , ..., xn from p-dimensional
space to y1 , y2 , ... yn in a low dimensional space (usually 2D). The constraint of
mapping to 2 dimensions is an interpretability constraint. DR methods typically
have a loss function that aims to preserve a combination of distance informa-
tion between points and neighborhood information around each point when
projecting to 2D. Each DR algorithm chooses the loss function differently. Each
algorithm also chooses its own distance or neighborhood information to preserve.

Fig 12. Visualization of the MNIST dataset [177] using different kinds of DR methods: PCA
[232], t-SNE [297, 187, 236], UMAP [204], and PaCMAP [313]. The axes are not quantified
because these are projections into an abstract 2D space.

Generally speaking, there are two primary types of approaches to DR for


visualization, commonly referred to as local and global methods. Global meth-
ods aim mainly to preserve distances between any pair of points (rather than
neighborhoods), while the local methods emphasize preservation of local neigh-
borhoods (that is, which points are nearest neighbors). As a result, local meth-
ods can preserve the local cluster structure better, while failing to preserve the
overall layout of clusters in the space, and vice versa. Figure 12 demonstrates
the difference between the two kinds of algorithms over the MNIST handwritten
figure dataset [177], which is a dataset where local structure tends to be more
important than global structure. The only global method here, Principal Com-
ponent Analysis (PCA) [232], fails to separate different digits into clusters, but
it gives a sense of which digits are different from each other. t-SNE [297], which
is a local method, successfully separated all the digits, but does not keep the
scale information that is preserved in the PCA embedding. More recent meth-
ods, such as UMAP [204] and PaCMAP [313] also separated the digits while
preserving some of the global information.
Early approaches toward this problem, including PCA [232] and Multidi-
mensional Scaling (MDS) [289], mostly fall into the global category. They aim
42 C. Rudin et al.

to preserve as much information as possible from the high-dimensional space,


including the distances or rank information between pairs of points. These meth-
ods usually apply matrix decomposition over the data or pairwise distance ma-
trix, and are widely used for data preprocessing. These methods usually fail to
preserve local structure, including cluster structure.
To solve these problems, researchers later (early 2000s) developed methods
that aimed at local structure, because they had knowledge that high dimensional
data lie along low-dimensional manifolds. Here, it was important to preserve the
local information along the manifolds. Isomap [285], Local Linear Embedding
(LLE) [249], Hessian Local Linear Embedding [87], and Laplacian Eigenmaps
[24] all try to preserve exact local Euclidean distances from the original space
when creating low-dimensional embeddings. But distances between points be-
have differently in high dimensions than in low dimensions, leading to problems
preserving the distances. In particular, these methods tended to exhibit what
is called the “crowding problem,” where samples in the low-dimensional space
crowd together, devastating the local neighborhood structure and losing infor-
mation. t-SNE [297] was able to handle this problem by transforming the high-
dimensional distances between each pair of points into probabilities of whether
the two points should be neighbors in the low-dimensional space. Doing this
aims to ensure that local neighborhood structure is preserved. Then, during
the projection to low dimensions, t-SNE enforces that the distribution over dis-
tances between points in the low-dimensional space is a specific transformation
of the distribution over distances between points in the high-dimensional space.
Forcing distances in the low-dimensional space to follow a specific distribution
avoids the crowding problem, which is when a large proportion of the distances
are almost zero.
Though more successful than previous methods, t-SNE still suffers from slow
running times, sensitivity to hyperparameter changes, and lack of preservation
of global structure. A series of t-SNE variants aim to improve on these short-
comings. The most famous among them are BH t-SNE [296] and FIt-SNE [187].
BH t-SNE constructs a k-nearest neighbor graph over the high dimensional
space to record local neighborhood structure, and utilizes the Barnes-Hut force
calculation algorithm [which is typically used for multi-body simulation; 20]
to accelerate optimization of the low-dimensional embedding. These choices re-
duced the complexity for each step in the optimization from O(n2 ), where n is
the number of samples in the dataset, to a smaller O(n log n). FIt-SNE further
accelerates the rendering phase with a Fast Fourier Transform, and brings down
the complexity to O(n). Besides these variants, multiple new algorithms have
been created based on the framework of t-SNE. Prominent examples include
LargeVis [284], which is widely used in network analysis, and UMAP [204],
which is widely used in computational biology. With a better initialization cre-
ated by unsupervised machine learning algorithms (such as spectral clustering)
and better loss functions, these algorithms improve global structure preservation
and run-time efficiency. Figure 13 demonstrates differences in results from DR
methods on a dataset of isotropic Gaussians. As the figure shows, the results of
different DR techniques look quite different from each other. Recent studies on
Interpretable machine learning grand challenges 43

Fig 13. Visualization of 4000 points sampled from 20 isotropic Gaussians using Laplacian
Eigenmap [24, 233], t-SNE [297, 187, 236], ForceAtlas2 [134], UMAP [204], TriMap [9] and
PaCMAP [313]. The 20 Gaussians are equally spaced on an axis in 50-dimensional space,
labelled by the gradient colors. The best results are arguably those of t-SNE and PaCMAP in
this figure, which preserve clusters compactly and their relative placement (yellow on the left,
purple on the right).

DR algorithms shed light on how the loss function affects the rendering of local
structure [37], and provide guidance on how to design good loss functions so
that the local and global structure can both be preserved simultaneously [313].
Nevertheless, several challenges still exists for DR methods:
7.1 How to capture information from the high dimensional space more
accurately?
Most of the recent DR methods capture information in the high dimensional
space mainly from the k-nearest-neighbors and their relative distances, at
the expense of information from points that are more distant, which would
allow the preservation of more global structure. Several works [314, 64] dis-
cuss possible pitfalls in data analysis created by t-SNE and UMAP due to
loss of non-local information. Recent methods mitigate the loss of global in-
formation by using global-aware initialization [204, 155] (that is, initializing
the distances in the low-dimensional space using PCA) and/or selectively
preserving distances between non-neighbor samples [105, 313]. Neverthe-
less, these methods are still designed and optimized under the assumption
that the nearest neighbors, defined by the given metric (usually Euclidean
distance in the high dimensional space), can accurately depict the relation-
ships between samples. This assumption may not hold true for some data,
44 C. Rudin et al.

for instance, Euclidean distance may not be suitable for measuring distances
between weights (or activations) of a neural network [see 50, for a detailed
example of such a failure]. We would like DR methods to better capture
information from the high dimensional space to avoid such pitfalls.
7.2 How should we select hyperparameters for DR?
Modern DR methods, due to their multi-stage characteristics, involve a large
number of hyperparameters, including the number of high-dimensional near-
est neighbors to be preserved in the low-dimensional space and the learn-
ing rate used to optimize the low-dimensional embedding. There are often
dozens of hyperparameters in any given DR method, and since DR meth-
ods are unsupervised and we do not already know the structure of the
high-dimensional data, it is difficult to tune them. A poor choice of hy-
perparameters may lead to disappointing (or even misleading) DR results.
Fig 14 shows some DR results for the Mammoth dataset [133, 64] using
t-SNE, LargeVis, UMAP, TriMAP and PaCMAP with different sets of rea-
sonable hyperparameters. When their perplexity parameter or the number
of nearest neighbors is not chosen carefully, algorithms can fail to preserve
the global structure of the mammoth (specifically, the overall placement of
the mammoth’s parts), and they create spurious clusters (losing connectiv-
ity between parts of the mammoth) and lose details (such as the toes on the
feet of the mammoth). For more detailed discussions about the effect of dif-
ferent hyperparameters, see [314, 64]. Multiple works [for example 25, 313]
aimed to alleviate this problem for the most influential hyperparameters,
but the problem still exists, and the set of hyperparameters remains data-
dependent. The tuning process, which sometimes involves many runs of a
DR method, is time and power consuming, and requires user expertise in
both the data domain and in DR algorithms. It could be extremely helpful
to achieve better automatic hyperparameter selection for DR algorithms.
7.3 Can the DR transformation from high- to low-dimensions be made
more interpretable or explainable? The DR mapping itself – that is,
the transformation from high to low dimensions – typically is complex.
There are some cases in which insight into this mapping can be gained, for
instance, if PCA is used as the DR method, we may be able to determine
which of the original dimensions are dominant in the first few principle
components. It may be useful to design modern approaches to help users
understand how the final two or three dimensions are defined in terms of the
high-dimensional features. This may take the form of explanatory post-hoc
visualizations or constrained DR methods.

Example: Computational biologists often apply DR methods to single-cell


RNA sequence data to understand the cell differentiation process and discover
previously-unknown subtypes of cells. Without a suitable way to tune parame-
ters, they may be misled by a DR method into thinking that a spurious cluster
from a DR method is actually a new subtype of cell, when it is simply a failure
of the DR method to capture local or global structure (Challenge 7.2 above).
Since tuning hyperparameters in a high-dimensional space is difficult (with-
Interpretable machine learning grand challenges 45

Fig 14. Projection from [313] of the Mammoth dataset into 2D using t-SNE [297, 187, 236],
LargeVis [284], UMAP [204], TriMap [9] and PaCMAP [313]. Incorrectly-chosen hyperpa-
rameters will lead to misleading results even in a simple dataset. This issue is particularly
visible for t-SNE (first two columns) and UMAP (fourth column). The original dataset is 3
dimensional and is shown at the top.

out the ground truth afforded to supervised methods), the researchers have no
way to see whether this cluster is present in the high-dimensional data or not
(Challenge 7.1 above). Scientists could waste a lot of time examining each such
spurious cluster. If we were able to solve the problems with DR tuning and
structure preservation discussed above, it will make DR methods more reliable,
leading to potentially increased understanding of many datasets.

8. Machine learning models that incorporate physics and other


generative or causal constraints

There is a growing trend towards developing machine learning models that incor-
porate physics (or other) constraints. These models are not purely data-driven,
in the sense that their training may require little data or no data at all [e.g.,
244]. Instead, these models are trained to observe physical laws, often in the form
of ordinary (ODEs) and partial differential equations (PDEs). These physics-
guided models provide alternatives to traditional numerical methods (e.g., finite
element methods) for solving PDEs, and are of immense interest to physicists,
chemists, and materials scientists. The resulting models are interpretable, in the
46 C. Rudin et al.

sense that they are constrained to follow the laws of physics that were provided
to them. (It might be easier to think conversely: physicists might find that a
standard supervised machine learning model that is trained on data from a
known physical system – but that does not follow the laws of physics – would
be uninterpretable.)
The idea of using machine learning models to approximate ODEs and PDEs
solutions is not new. Lee and Kang [178] developed highly parallel algorithms,
based on neural networks, for solving finite difference equations (which are them-
selves approximations of original differential equations). Psichogios and Ungar
[238] created a hybrid neural network-first principles modeling scheme, in which
neural networks are used to estimate parameters of differential equations. La-
garis et al. [168, 169] explored the idea of using neural networks to solve initial
and boundary value problems. Several additional works [167, 43, 259] used sparse
regression and dynamic mode decomposition to discover the governing equa-
tions of dynamical systems directly from data. More recently, Raissi et al. [242]
extended the earlier works and developed the general framework of a physics-
informed neural network (PINN). In general, a PINN is a neural network that
approximates the solution of a set of PDEs with initial and boundary conditions.
The training of a PINN minimizes the residuals from the PDEs as well as the
residuals from the initial and boundary conditions. In general, physics-guided
models (neural networks) can be trained without supervised training data. Let
us explain how this works. Given a differential equation, say, f  (t) = af (t)+bt+c,
where a, b and c are known constants, we could train a neural network g to ap-
proximate f , by minimizing (g  (t) − ag(t) − bt − c)2 at finitely many points t.
Thus, no labeled data in the form (t, f (t)) (what we would need for conven-
tional supervised machine learning) is needed. The derivative g  (t) with respect
to input t (at each of those finitely many points t used for training) is found by
leveraging the existing network structure of g using back-propagation. Figure
15 illustrates the training process of a PINN for approximating the solution of
∂2u
one-dimensional heat equation ∂u ∂t = k ∂x2 with initial condition u(x, 0) = f (x)
and Dirichlet boundary conditions u(0, t) = 0 and u(L, t) = 0. If observed data
are available, a PINN can be optimized with an additional mean-squared-error
term to encourage data fit. Many extensions to PINNs have since been devel-
oped, including fractional PINNs [226] and parareal PINNs [207]. PINNs have
been extended to convolutional [340] and graph neural network [270] backbones.
They have been used in many scientific applications, including fluid mechanics
modeling [243], cardiac activation mapping [263], stochastic systems modeling
[323], and discovery of differential equations [241, 240].
In addition to neural networks, Gaussian processes are also popular mod-
els for approximating solutions of differential equations. For example, Archam-
beau et al. [16] developed a variational approximation scheme for estimating
the posterior distribution of a system governed by a general stochastic differen-
tial equation, based on Gaussian processes. Zhao et al. [335] developed a PDE-
constrained Gaussian process model, based on the global Galerkin discretization
of the governing PDEs for the wire saw slicing process. More recently, Pang et
Interpretable machine learning grand challenges 47

Fig 15. Physics-informed neural network (PINN) framework for solving the one-dimensional
2
heat equation ∂u∂t
= k ∂∂xu2 with initial condition u(x, 0) = f (x) and Dirichlet boundary condi-
tions u(0, t) = 0 and u(L, t) = 0. Here, u is the model, which is the output of a neural network
(left), obeys the heat equation (upper right) and initial and boundary conditions (lower right).
The degree to which u does not obey the heat equation or initial or boundary conditions is
reflected in the loss function.

al. [227] used the neural-network-induced Gaussian process (NNGP) regression


for solving PDEs.
Despite recent success of PINNs or physics-guided machine learning in gen-
eral, challenges still exist, including:
8.1 How to improve training of a PINN? Wang et al. [308] discussed the
problem of numerical stiffness in gradient backpropagation during training
of a PINN. To mitigate the problem, they proposed an algorithm to dynam-
ically balance various loss terms during training. How to further improve
training of a PINN is still an open challenge that needs to be addressed.
8.2 How do we combine PINNs with optimal experimental/simulation
design to quickly reduce uncertainty in approximating physical
models? Currently, data that might be used for training PINNs are often
collected (from experiments or simulations) beforehand (i.e., prior to train-
ing of PINNs). This could lead to large uncertainty bands on areas of the
input domain where data are scarce, and when the parameters of the PDE
need to be learned from these data. Integrating experimental/simulation
design with training of PINNs could lead to a more efficient data collection
process and could reduce uncertainty of PINNs in approximating physical
models. However, how to achieve this is still an open question.
8.3 Are there other ways of incorporating PDE (or other) constraints
into models other than neural networks? Neural networks can be
easily differentiated using the chain rule. This makes them ideal for ap-
48 C. Rudin et al.

proximating solutions to PDEs. Other models, especially non-parametric


regression models cannot be differentiated easily – even though there have
been some works that used Gaussian processes for approximating solutions
to PDEs, these Gaussian-process solutions often rely on discrete (finite-
difference) approximations for the original PDEs [see 335], or they could be
non-differentiable at times where there are observations [see 16]. How one
can incorporate differential constraints (in terms of differential equations)
into non-parametric regression models remains an open question.
Going beyond PDEs to other types of complex constraints that encode prior
knowledge, Kursuncu et al. [166] proposed the concept of knowledge infused
deep learning, where the learning of a deep neural network is guided by
domain knowledge represented as a knowledge graph. Knowledge graphs
encode relationships between entities in the world, including common sense
knowledge. Thus, being able to incorporate information from a knowledge
graph could be very helpful if we can figure out ways to do it effectively.
Example: Suppose that a team of aerospace engineers wants to study the aero-
dynamics for a new design of aircraft wing. They want to use a physics-guided
machine learning model to approximate a solution to the physical model (a set
of PDEs), because numerically computing the PDE solution is computationally
expensive. They want to try different machine learning models, but they need
to first think about how they can integrate the PDE constraints into machine
learning models that are not easily differentiable with respect to input (Chal-
lenge 8.3). If they use a PINN, they need to ensure that training does not suffer
from gradient pathologies (Challenge 8.1). In addition, collecting data to sup-
port training of a PINN by conducting experiments or running simulations can
be expensive. They want to know how to integrate experimental/simulation de-
sign with training of PINNs to achieve the best model (Challenge 8.2). How to
integrate constraints into various models, and how to improve training and to
co-design experiments/simulations with model training are challenges that bur-
den not only aerospace engineers but any scientist working with physics-guided
machine learning models.

9. Characterization of the “Rashomon” set of good models

In many practical machine learning problems, there is a multiplicity of almost-


equally-accurate models. This set of high performing models is called the
Rashomon set, based on an observation of the Rashomon effect by the statis-
tician Leo Breiman. The Rashomon effect occurs when there are multiple de-
scriptions of the same event [41] with possibly no ground truth. The Rashomon
set includes these multiple descriptions of the same dataset [and again, none of
them are assumed to be the truth, see 269, 97, 86, 201]. Rashomon sets occur
in multiple domains, including credit score estimation, medical imaging, nat-
ural language processing, health records analysis, recidivism prediction, model
explanations for planning and robotics, and so on [74, 201, 280]. We have dis-
cussed in Challenges 4 and 5 that even deep neural networks for computer
Interpretable machine learning grand challenges 49

vision exhibit Rashomon sets, because neural networks that perform case-based
reasoning on disentanglement still yielded models that were equally accurate
to their unconstrained values; thus, these interpretable deep neural models are
within the Rashomon set. Rashomon sets present an opportunity for data scien-
tists: if there are many equally-good models, we can choose one that has desired
properties that go beyond minimizing an objective function. In fact, the model
that optimizes the training loss might not be the best to deploy in practice
anyway due to the possibilities of poor generalization, trust issues, or encoded
inductive biases that are undesirable [74]. More careful approaches to problem
formulation and model selection could be taken that include the possibility of
model multiplicity in the first place. Simply put – we need ways to explore the
Rashomon set, particularly if we are interested in model interpretability.
Formally, the Rashomon set is the set of models whose training loss is below a
specific threshold, as shown in Figure 16 (a). Given a loss function and a model
class F, the Rashomon set can be written as

R(F, f ∗ , ) = {f ∈ F such that Loss(f ) ≤ Loss(f ∗ ) + },

where f ∗ can be an empirical risk minimizer, optimal model or any other refer-
ence model. We would typically choose F to be complex enough to contain
models that fit the training data well without overfitting. The threshold ,
which is called the Rashomon parameter, can be a hard hyper-parameter set
by a machine learning practitioner or a percentage of the loss (i.e.,  becomes
 Loss(f ∗ )). We would typically choose  or  to be small enough that suffering
this additional loss would have little to no practical significance on predictive
performance. For instance, we might choose it to be much smaller than the
(generalization) error between training and test sets. We would conversely want
to choose  or  to be as large as permitted so that we have more flexibility to
choose models within a bigger Rashomon set.
It has been shown by Semenova et al. [269] that when the Rashomon set is
large, under weak assumptions, it must contain a simple (perhaps more inter-
pretable) model within it. The argument goes as follows: assume the Rashomon
set is large, so that it contains a ball of functions from a complicated function
class Fcomplicated (think of this as high-dimensional polynomials that are com-
plex enough to fit the data well without overfitting). If a set of simpler functions
Fsimpler could serve as approximating set for Fcomplicated (think decision trees
of a certain depth approximating the set of polynomials), it means that each
complicated function could be well-approximated by a simpler function (and
indeed, polynomials can be well-approximated by decision trees). By this logic,
the ball of Fcomplicated that is within the Rashomon set must contain at least
one function within Fsimpler , which is the simple function we were looking for.
Semenova et al. [269] also suggested a useful rule of thumb for determining
whether a Rashomon set is large: run many different types of machine learn-
ing algorithms (e.g., boosted decision trees, support vector machines, neural
networks, random forests, logistic regression) and if they generally perform sim-
ilarly, it correlates with the existence of a large Rashomon set (and thus the
50 C. Rudin et al.

possibility of a simpler function also achieving a similar level of accuracy). The


knowledge that there might exist a simple-yet-accurate function before finding
it could be very useful, particularly in the cases where finding an optimal sparse
model is NP-hard, as in Challenge 1. Here, the user would run many differ-
ent algorithms to determine whether it would be worthwhile to solve the hard
problem of finding an interpretable sparse model.
The knowledge of the functions that lie within Rashomon sets sees value in
multiple interesting use cases. Models with various important properties be-
sides interpretability can exist in the Rashomon set, including fairness [67] and
monotonicity. Thus, understanding the properties of the Rashomon set could
be pivotal for analysis of a complex machine learning problem and its possible
modeling choices.
The size of the Rashomon set can be considered as a way of measuring the
complexity of a learning problem. Problems with large Rashomon sets are less
complicated problems, since more good solutions exist to these problems. The
Rashomon set is a property of the model class and dataset. The size of the
Rashomon set differs from other known characterizations of the complexity in
machine learning. Ways that complexity of function classes, algorithms, or learn-
ing problems are typically measured in statistical learning theory include the
VC dimension, covering numbers, algorithmic stability, Rademacher complexity,
or flat minima [see, for instance, 279, 140, 339, 266, 161, 39, 299]; the size of the
Rashomon set differs from all of these quantities in fundamental ways, and it is
important in its own right for showing the existence of simpler models.
A useful way to represent the hypothesis for a problem is by projecting it
to variable importance space, where each axis represents the importance of a
variable. That way, a single function is represented by a point in this space (i.e.,
a vector of coordinates), where each coordinate represents how important a
variable is to that model. (Here, variable importance for variable v is measured
by model reliance or conditional model reliance, which represents the change
in loss we would incur if we scrambled variable v.) The Rashomon set can be
represented as a subset of this variable importance space. Fisher et al. [97] used
this representation to create a model independent notion of variable importance:
specifically, they suggest to consider the maximum and minimum of variable
importance across all models in the Rashomon set. This is called the Model
Class Reliance (MCR). For any user who would pick a “good” model (i.e., a
model in the Rashomon set), then the importance of the variable is within the
range given by the MCR. Dong and Rudin [86] expand on this by suggesting
to visualize the “cloud” of variable importance values for models within the
Rashomon set. This cloud helps us understand the Rashomon set well enough
to choose a model we might find interpretable.
Current research works have barely scratched the characterization and usage
of model multiplicity and Rashomon sets in machine learning. Exploring the
Rashomon set is critical if we hope to find interpretable models within the set
of accurate models. We ask the following questions about the Rashomon set:

9.1 How can we characterize the Rashomon set? As discussed above, the
Interpretable machine learning grand challenges 51

Fig 16. (a) An illustration of a possible Rashomon set in two-dimensional hypothesis space.
Models from two local minima that are below the red plane belong to the Rashomon set. (b)
An illustration of a possible visualization of the Rashomon set. Models inside the shaded red
regions belong to the Rashomon set. The plot is created as a contour plot of the loss over the
hypothesis space. Green dots represent a few simpler models inside the Rashomon set. These
models are simpler because they are sparse: they depend on one less dimension than other
models in the Rashomon set.

volume of the Rashomon set is a useful indirect indicator of the existence of


interpretable models. There have been several works that suggest computing
statistics of the Rashomon set in parameter space, such as the volume in pa-
rameter space, which is called the Rashomon volume. The Rashomon volume
or other statistics that characterize the Rashomon set can be computed more
easily if the volume in parameter space has a single global minimum [127,
83, 144, 144, 52]; these variations include -flatness [127, 83] and -sharpness
[144, 83]. The Rashomon ratio provides some perspective on the size of the
Rashomon set, defined as the ratio of the Rashomon volume to the volume
of the whole hypothesis space [269], assuming that both are bounded.
There are several problems that arise when computing the size of the
Rashomon set in parameter space. First, the volume of a set of functions
can be hard to compute or estimate, but there are ways to do it, includ-
ing sampling (or analytical computation in some simple cases when there
is one minimum). Second, the choice of parameterization of the hypothe-
sis space matters when working in parameter space. Overparametrization
or underparametrization can cause volumes in parameter space to be ar-
tificially made larger or smaller without a change in the actual space of
functions [83]. For instance, if we make a copy of a variable and include it
in the datasets, it can change the Rashomon ratio. A third but separate
problem with the Rashomon ratio is its denominator: the denominator is
the volume of all models we might reasonably consider before seeing the
data. If we change the set of models we might consider, we change the de-
nominator, and the ratio changes. And, because different model spaces are
different, there might not be a way to directly compare the Rashomon ratios
52 C. Rudin et al.

Fig 17. A one-dimensional example of computation of the Rashomon ratio. The hypothesis
space consists of decision stumps f (x) = {1[x≤a] }, a ∈ [0, 1]. (The function is 1 if x ≤ a and
0 otherwise.) The loss function is the zero-one loss: loss(f, x, y) = 1[f (x)=y] , which is 1 if
the
 model makes a mistake and zero otherwise. A model f belongs to the Rashomon set if
i loss(f, xi , yi ) ≤ 0.2, that is, the model made 2 mistakes. The Rashomon ratio in (a) is
equal to 0.4 and is computed as the ratio of volumes of decision stumps in the Rashomon set
(blue shaded region) to the volume of all possible decision stumps in the area where data reside
(blue and orange shaded region). The pattern Rashomon ratio in (b) is equal to 3/11 and is
computed as the ratio of classifiers that belong to the Rashomon set (blue stumps, of which
there are 3 unique stumps with respect to the data) to the number of all possible classifiers
(orange and blue stumps, of which there are 11). This figure shows that there are multiple
ways to compute the size of the Rashomon set and it is not clear which one to choose.

computed on different hypothesis spaces to each other.


This brings us to another way to measure the Rashomon set: The pat-
tern Rashomon ratio [269, 201], which could potentially be used to handle
some of these problems. The pattern Rashomon ratio considers unique pre-
dictions on the data (called “patterns”) rather than the count of functions
themselves. In other words, the pattern Rashomon ratio considers the count
of predictions that could be made rather than a count of functions. Figure
17 illustrates the difference between the Rashomon ratio and the pattern
Rashomon ratio. The pattern Rashomon ratio has a fixed denominator (all
possible patterns one could produce on the data using functions from the
class). A direct comparison of the pattern Rashomon ratio over function
classes is meaningful. However, it is still difficult to compute. It is also more
sensitive to small shifts and changes in data.
Despite that multiple measures, including -flatness, -sharpness, Rashomon
ratio, and pattern ratio, have been proposed to measure the size of the
Rashomon set, they do not provide a universal and simple-to-calculate way
of characterization the size of the Rashomon set. Many challenges to remain
open:
(a) What is a good space in which to measure the size of the
Rashomon set? What is a good measure of the size of the
Rashomon set? As discussed above, the parameter space suffers from
Interpretable machine learning grand challenges 53

issues with parameterization, but the pattern Rashomon ratio has its
own problems. Would variable importance space be suitable, where
each axis represents the importance of a raw variable, as in the work
of Dong and Rudin [86]? Or is there an easier space and metric to work
with?
(b) Are there approximation algorithms or other techniques that
will allow us to efficiently compute or approximate the size
of the Rashomon set? At worst, computation over the Rashomon
set requires a brute force calculation over a large discrete hypothesis
space. In most cases, the computation should be much easier. Perhaps
we could use dynamic programming or branch and bound techniques
to reduce the search space so that it encompasses the Rashomon set
but not too much more than that? In some cases, we could actually
compute the size of the Rashomon set analytically. For instance, for
linear regression, a closed-form solution in parameter space for the vol-
ume of the Rashomon set has been derived based on the singular values
of the data matrix [269].
9.2 What techniques can be used to visualize the Rashomon set? Visu-
alization of the Rashomon set can potentially help us to understand its prop-
erties, issues with the data, biases, or underspecification of the problem. To
give an example of how visualization can be helpful in troubleshooting mod-
els generally, Li et al. [181] visualized the loss landscape of neural networks
and, as a result, found answers to questions about the selection of the batch
size, optimizer, and network architecture. Kissel and Mentch [151] proposed
a tree-based graphical visualization to display outputs of the model path
selection procedure that finds models from the Rashomon set based on a for-
ward selection of variables. The visualization helps us to understand the sta-
bility of the model selection process, as well as the richness of the Rashomon
set, since wider graphical trees imply that there are more models available
for the selection procedure. To characterize the Rashomon set in variable
importance space, variable importance diagrams have been proposed [86],
which are 2-dimensional projections of the variable importance cloud. The
variable importance cloud is created by mapping every variable to its impor-
tance for every good predictive model (every model in the Rashomon set).
Figure 16(b) uses a similar technique of projection in two-dimensional space
and depicts the visualization of the example of the Rashomon set of Figure
16(a). This simple visualization allows us to see the Rashomon set’s layout,
estimate its size or locate sparser models within it. Can more sophisticated
approaches be developed that would allow good visualization? The success
of these techniques most likely will have to depend on whether we can design
a good metric for the model class. After the metric is designed, perhaps we
might be able to utilize techniques from Challenge 7 for the visualization.
9.3 What model to choose from the Rashomon set? When the Rashomon
set is large, it can contain multiple accurate models with different properties.
Choosing between them might be difficult, particularly if we do not know
54 C. Rudin et al.

how to explore the Rashomon set. Interactive methods might rely on dimen-
sion reduction techniques (that allow users to change the location of data
on the plot or to change the axis of visualization), weighted linear models
(that allow the user to compute weights on specific data points), continuous
feedback from the user (that helps to continuously improve models predic-
tion in changing environments, for example, in recommender systems) in
order to interpret or choose a specific model with desired property. Das et
al. [75] design a system called BEAMES that allows users to interactively
select important features, change weights on data points, visualize and se-
lect a specific model or even an ensemble of models. BEAMES searches the
hypothesis space for models that are close to the practitioners’ constraints
and design choices. The main limitation of BEAMES is that it works with
linear regression classifiers only. Can a similar framework that searches the
Rashomon set, instead of the whole hypothesis space, be designed? What
would the interactive specification of constraints look like in practice to help
the user choose the right model? Can collaboration with domain experts in
other ways be useful to explore the Rashomon set?
Example: Suppose that a financial institution would like to make data-driven
loan decisions. The institution must have a model that is as accurate as possible,
and must provide reasons for loan denials. In practice, loan decision prediction
problems have large Rashomon sets, and many machine learning methods per-
form similarly despite their different levels of complexity. To check whether the
Rashomon set is indeed large, the financial institution wishes to measure the
size of the Rashomon set (Challenge 9.1) or visualize the layout of it (Challenge
9.2) to understand how many accurate models exist, and how many of them
are interpretable. If the Rashomon set contains multiple interpretable models,
the financial institution might wish to use an interactive framework (Challenge
9.3) to navigate the Rashomon set and design constraints that would help to
locate the best model for their purposes. For example, the institution could
additionally optimize for fairness or sparsity.

10. Interpretable reinforcement learning

Reinforcement learning (RL) methods determine what actions to take at differ-


ent states in order to maximize a cumulative numerical reward [e.g., see 282]. In
RL, a learning agent interacts with the environment by a trial and error process,
receiving signals (rewards) for the actions it takes. Recently, deep reinforcement
learning algorithms have achieved state-of-the-art performance in mastering the
game of Go and various Atari games [273, 210] and have been actively used in
robotic control [156, 283], simulated autonomous navigation and self-driving
cars [150, 265], dialogue summarization, and question answering [182]. RL has
also been applied to high-stakes decisions, such as healthcare and personalized
medicine, including approaches for dynamic treatment regimes [190, 326], treat-
ment of sepsis [162], treatment of chronic disorders including epilepsy [117], for
HIV therapy [230], and management of Type 1 diabetes [136].
Interpretable machine learning grand challenges 55

In reinforcement learning, at each timestep, an agent observes the state (sit-


uation) that it is currently in, chooses an action according to a policy (a learned
mapping from states to actions), and receives an intermediate reward that de-
pends on the chosen action and the current state, which leads the agent to the
next state, where the process repeats. For example, let us consider an RL system
that is used for the management of glucose levels of Type 1 diabetes patients
by regulating the amount of synthetic insulin in the blood. The state is repre-
sented by the patient’s information, including diet, exercise level, weight, so on;
actions include the amount of insulin to inject; rewards are positive if the blood
sugar level is within a healthy range and negative otherwise. Figure 18 shows
the schematic diagram of a reinforcement learning system based on the diabetes
example.
Reinforcement learning is more challenging than a standard supervised set-
ting: the agent does not receive all the data at once (in fact, the agent is re-
sponsible for collecting data by exploring the environment); sometimes besides
immediate rewards, there are delayed rewards when feedback is provided only
after some sequence of actions, or even after the goal is achieved (for example,
the positive reward on a self-driving car agent is provided after it reaches a desti-
nation). Because reinforcement learning involves long-term consequences to ac-
tions, because actions and states depend on each other, and because agents need
to collect data through exploration, interpretability generally becomes more dif-
ficult than in a standard supervised learning. Search spaces for RL can be mas-
sive since we need to understand which action to take in each state based on
an estimate of future rewards, uncertainty of these estimates, and exploration
strategy. In deep reinforcement learning [210, 273], policies are defined by deep
neural networks, which helps to solve complex applied problems, but typically
the policies are essentially impossible to understand or trust.
Interpretability could be particularly valuable in RL. Interpretability might
help to reduce the RL search space; if we understand the choices and intent of
the RL system, we can remove actions that might lead to harm. Interpretabil-
ity could make it easier to troubleshoot (and use) RL for long-term medical
treatment or to train an agent for human-robot collaboration.
One natural way to include interpretability in a reinforcement learning sys-
tem is by representing the policy or value function by a decision tree or rule list
[248, 272]. In a policy tree representation, the nodes might contain a small set
of logical conditions, and leaves might contain an estimate of the optimal prob-
ability distribution over actions (see Figure 19). A policy tree can be grown, for
example, in a greedy incremental way by updating the tree only when the esti-
mated discounted future reward would increase sufficiently [248]; in the future,
one could adapt methods like those from Challenge 1 to build such trees.
Other methods make specific assumptions about the domain or problem that
allows for interpretable policies. One of the most popular simplifications is to
consider a symbolic or relational state representation. In this case, the state
is a combination of logical or relational statements. For example, in the classic
Blocks World domain, where the objective is to learn how to stack blocks on top
of each other to achieve a target configuration, these statements comprising the
56 C. Rudin et al.

Fig 18. An illustration of an RL system based on the example of Type 1 diabetes manage-
ment. This is theoretically a closed-loop system, where the agent observes the state from the
environment, makes an action, and receives a reward based on the action. For diabetes con-
trol, the state consists of measurements of glucose and other patient data, the actions are
injections of insulin, and the rewards are based on sugar levels.

state might look like on(block1, block2); on(block1, table); free(block2). Typically,
in relational reinforcement learning, policies are represented as relational regres-
sion trees [90, 91, 143, 96, 76] that are more interpretable than neural networks.
A second set of assumptions is based on a natural decomposition of the prob-
lem for multi-task or skills learning. For example, Shu et al. [271] use hierarchical
policies that are learned over multiple tasks, where each task is decomposed into
multiple sub-tasks (skills). The description of the skills is created by a human,
so that the agent learns these understandable skills (for instance, task stack blue
block can be decomposed into the skills find blue block, get blue block, and put
blue block ).
As far as we know, there are currently no general interpretable well-performing
methods for deep reinforcement learning that allow transparency in the agent’s
actions or intent. Progress has been made instead on explainable deep rein-
forcement learning (posthoc approximations), including tree-based explanations
[66, 189, 81, 22, 73], reward decomposition [138], and attention-based methods
[330, 213]. An interesting approach is that of Verma et al. [300], who define
rule-based policies through a domain-specific, human-readable programming
language that generalizes to unseen environments, but shows slightly worse per-
formance than neural network policies it was designed to explain. The approach
works for deterministic policies and symbolic domains only and will not work
for the domains where the state is represented as a raw data image, unless an
additional logical relation extractor is provided.
Based on experience from supervised learning (discussed above), we know
that post-hoc explanations suffer from multiple problems, in that explanations
are often incorrect or incomplete. Atrey et al. [18] have argued that for deep
reinforcement learning, saliency maps should be used for exploratory, and not
explanatory, purposes. Therefore, developing interpretable reinforcement learn-
ing policies and other interpretable RL models is important. The most crucial
challenges for interpretable reinforcement learning area include:
Interpretable machine learning grand challenges 57

Fig 19. (a) An example of a learned decision tree policy for the Cart-Pole balancing problem
adapted from Silva et al. [272]. In the Cart-Pole environment, a pole is attached to a cart
by a joint. The cart moves along the horizontal axis. The goal is to keep the pole balanced
upright, preventing it from falling. The state consists of four features: Cart Position, Cart
Velocity, Pole Angle and Pole Velocity At Tip. Here, angle 0◦ means straight upwards. There
are two possible actions at each state, including push cart to the left or push cart to the right.
(b) is a visualization of the policy from Silva et al. [272] based on two features: Pole Angle
and Pole Velocity At Tip. Interestingly, it is easy to see from this figure that the policy is not
left-right symmetric. Note that there are many reasonable policies for this problem, including
many that are asymmetric.

10.1 What constraints would lead to an accurate interpretable policy?


In relational methods, policies or value functions could be represented by
decision or regression trees, however, since these methods work for sym-
bolic domains only, they have limited practical use without a method of
extracting interpretable relations from the domain. Can interpretable deep
neural network methods similar to those in Challenge 5 provide an inter-
pretable policy without the performance compromise? Are there other ways
of adding constraints or structure to the policy that will ensure human-
understandable actions and intent? Will adding other sparsity or inter-
pretability constraints, as in Challenges 1 and 2, to the policy or value
function help improve interpretability without sacrificing accuracy? Can
we design interpretable models whose explanations can capture longer term
consequences of actions?
10.2 Under what assumptions does there exist an interpretable pol-
icy that is as accurate as the more complicated black-box model?
Reinforcement learning systems are tricky to train, understand and debug.
It is a challenge in itself to decide what constraints will lead to an inter-
pretable policy without loss of performance. For example, for PIRL [300],
there was a drop in performance of the resulting rule-based policy. Of
course, one of the possible reasons for this drop could be the fact that
it used posthoc approximations rather than inherently interpretable mod-
els; another reason is that there might not be a well-performing sparse
rule-based policy. So, the question is: how would we know if it is useful to
optimize for an interpretable policy? A simple test that tells the modeler if
58 C. Rudin et al.

such a model exists (similar to the characterizations of the Rashomon sets


in Challenge 9) would be helpful before trying to develop an algorithm to
find this well-performing model (if it exists).
10.3 Can we simplify or interpret the state space? Deep reinforcement
learning often operates on complex data (raw data) or huge, if not continu-
ous state spaces. The agent might be required to do a lot of exploration to
learn useful policies in this case. Simplifying the state space might not only
make the RL system learn more efficiently, but might also help humans to
understand the decisions it is making. For example, by applying t-SNE to
the last layer activation data of Mnih et al.’s deep Q-learning model [210],
Zahavy [329] showed that the network automatically discovers hierarchical
structures. Previously, these structures had not been used to simplify the
state space. Guan et al. [116] improved sample efficiency by incorporat-
ing domain knowledge through human feedback and visual explanations.
Zhang et al. [332] improved over batch RL by computing the policy only at
specific decision points where clinicians treat patients differently, instead
of at every timestep. This resulted in a significantly smaller state space
and faster planning.
Example: Consider a machine learning practitioner that is working on plan-
ning for assistive robots that help the elderly to perform day-to-day duties. In
cases when the robot does some damage during its task execution or chooses a
unusual plan to execute, it might be hard to understand the logic behind the
robot’s choices. If the practitioner can check that an interpretable model exists
(Challenge 10.2), optimize for it (Challenge 10.1), and interpret the state space
(Challenge 10.3), debugging the robot’s failed behavior and correcting it can be
much faster than without these techniques. Further, interpretability will help to
strengthen trust between the robot and the human its assisting, and help it to
generalize to new unforeseen cases.

11. Problems that were not in our top 10 but are really important

We covered a lot of ground in the 10 challenges above, but we certainly did not
cover all of the important topics related to interpretable ML. Here are a few
that we left out:
• Can we improve preprocessing to help with both interpretabil-
ity and test accuracy? As we discussed, some interpretable modeling
problems are computationally difficult (or could tend to overfit) without
preprocessing. For supervised problems with tabular data, popular meth-
ods like Principal Component Analysis (PCA) would generally transform
the data in a way that damages interpretability of the model. This is
because each transformed feature is a combination of all of the original
features. Perhaps there are other general preprocessing tools that would
preserve predictive power (or improve it), yet retain interpretability.
• Can we convey uncertainty clearly? Uncertainty quantification is al-
ways important. Tomsett et al. [288] discuss the importance of uncertainty
Interpretable machine learning grand challenges 59

quantification and interpretability, and Antoran et al. [15] discuss inter-


pretable uncertainty estimation.
• Can we divide the observations into easier cases and harder
cases, assigning more interpretable models to the easier cases?
Not all observations are equally easy to classify. As noted in the work
of Wang et al. [309, 312], it is possible that easier cases could be han-
dled by simpler models, leaving only the harder cases for models with less
interpretability.
• Do we need interpretable neural networks for tabular data? There
have been numerous attempts to design interpretable neural networks for
tabular data, imposing sparsity, monotonicity, etc. However, it is unclear
whether there is motivation for such formulations, since neural networks
do not generally provide much benefit for tabular data over other kinds
of models. One nice benefit of the “optimal” methods discussed above is
that when we train them we know how far we are from a globally optimal
solution; the same is not true for neural networks, rendering them not just
less interpretable but also harder to implement reliably.
• What are good ways to co-design models with their visualiza-
tions? Work in data and model visualization will be important in convey-
ing information to humans. Ideally, we would like to co-design our model
in a way that lends itself to easier visualization. One particularly impres-
sive interactive visualization was done by Gomez et al. [111] for a model
on loan decisions. Another example of an interactive model visualization
tool is that of Chen et al. [54]. These works demonstrate that as long as
the model can be visualized effectively, it can be non-sparse but still be
interpretable. In fact, these works on loan decisions constitute an example
of when a sparse model might not be interpretable, since people might
find a loan decision model unacceptable if it does not include many types
of information about the applicant’s credit history. Visualizations can al-
low the user to hone in on the important pieces of the model for each
prediction.
• Can we do interpretable matching in observational causal infer-
ence? In the classical potential outcomes framework of causal inference,
matching algorithms are often used to match treatment units to con-
trol units. Treatment effects can be estimated using the matched groups.
Matching in causal inference can be considered a form of case-based rea-
soning (Challenge 4). Ideally, units within each matched group would be
similar on important variables. One such framework for matching is the
Almost Matching Exactly framework [also called “learning to match,” see
310], which learns the important variables using a training set, and prior-
itizes matching on those important variables.
Matching opens the door for interpretable machine learning in causal infer-
ence because it produces treatment and control data whose distributions
overlap. Interpretable machine learning algorithms for supervised classi-
fication discussed in Challenges 1, 2, and 3 can be directly applied to
matched data; for instance, a sparse decision tree can be used on matched
60 C. Rudin et al.

data to obtain an interpretable model for individual treatment effects, or


an interpretable policy (i.e., treatment regimes).
• How should we measure variable importance? There are several
types of variable importance measures: those that measure how important
a variable is to a specific prediction, those that measure how important a
variable is generally to a model, and those that measure how important
a variable is independently of a model. Inherently interpretable models
should generally not need the first one, as it should be clear how much a
variable contributed to the prediction; for instance, for decision trees and
other logical models, scoring systems, GAMs, case-based reasoning models,
and disentangled neural networks, the reasoning process that comes with
each prediction tells us explicitly how a variable (or concept) contributed
to a given prediction. There are many explainable ML tutorials to describe
how to estimate the importance of variables for black box models. For the
other two types of variable importance, we refer readers to the work of
Fisher et al. [97], which has relevant references. If we can decide on an
ideal variable importance measure, we may be able to create interpretable
models with specific preferences on variable importance.
• What are the practical challenges in deploying interpretable ML
models? There are many papers that provide useful background about
the use of interpretability and explainability in practice. For instance,
an interesting perspective is provided by Bhatt et al. [31] who conduct
an explainability survey showing that explanations are often used only
internally for troubleshooting rather than being shown to users. Several
studies, e.g. [141, 237], have performed extensive human experiments on
interpretable models and posthoc explanations of black box models, with
some interesting and sometimes nonintuitive results.
• What type of explanation is required by law? The question of what
explanations are legally required is important and interesting, but we will
leave those questions to legal scholars (e.g., [32, 303]). Interestingly, schol-
ars have explicitly stated that a “Right to Explanation” for automated
decision-making does not actually exist in the European Union’s General
Data Protection Regulation, despite the intention [303]. Rudin [250] pro-
poses that for a high-stakes decision that deeply affects lives, no black box
model should be used unless the deployer can prove that no interpretable
model exists with a similar accuracy. If enacted, this would likely mean
that black box models rarely (if ever) would be deployed for high-stakes
decisions.
• What are other forms of interpretable models? It is not possible
to truly review all interpretable ML models. One could argue that most
of applied Bayesian statistics fits into our definition of interpretable ML
because the models are constrained to be formed through an interpretable
generative process. This is indeed a huge field, and numerous topics that
we were unable to cover also would have deserved a spot in this survey.
Interpretable machine learning grand challenges 61

12. Conclusion

In this survey, we hoped to provide a pathway for readers into important topics
in interpretable machine learning. The literature currently being generated on
interpretable and explainable AI can be downright confusing. The sheer diversity
of individuals weighing in on this field includes not just statisticians and com-
puter scientists but legal experts, philosophers, and graduate students, many of
whom have not either built or deployed a machine learning model ever. It is
easy to underestimate how difficult it is to convince someone to use a machine
learning model in practice, and interpretability is a key factor. Many works over
the last few years have contributed new terminology, mistakenly subsumed the
older field of interpretable machine learning into the new field of “XAI,” and
review papers have universally failed even to truly distinguish between the ba-
sic concepts of explaining a black box and designing an interpretable model.
Because of the misleading terminology, where papers titled “explainability” are
sometimes about “interpretability” and vice versa, it is very difficult to follow
the literature (even for us). At the very least, we hoped to introduce some fun-
damental principles, and cover several important areas of the field and show
how they relate to each other and to real problems. Clearly this is a massive
field that we cannot truly hope to cover, but we hope that the diverse areas we
covered and problems we posed might be useful to those needing an entrance
point into this maze.
Interpretable models are not just important for society, they are also beauti-
ful. One might also find it absolutely magical that simple-yet-accurate models
exist for so many real-world datasets. We hope this document allows you to see
not only the importance of this topic but also the elegance of its mathematics
and the beauty of its models.

Acknowledgments

We thank Leonardo Lucio Custode for pointing out several useful references to
Challenge 10. Thank you to David Page for providing useful references on early
explainable ML. Thank you to the anonymous reviewers that made extremely
helpful comments.

References

[1] Aamodt, A. and Plaza, E. (1994). Case-Based Reasoning: Foundational


Issues, Methodological Variations, and System Approaches. AI Communi-
cations 7 39–59.
[2] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M.
and Kim, B. (2018). Sanity Checks for Saliency Maps. In Proceedings of
Conference on Neural Information Processing Systems (NeurIPS).
[3] Adel, T., Ghahramani, Z. and Weller, A. (2018). Discovering in-
terpretable representations for both deep generative and discriminative
62 C. Rudin et al.

models. In Proceedings of International Conference on Machine Learning


(ICML) 50–59.
[4] Afnan, M. A. M., Rudin, C., Conitzer, V., Savulescu, J.,
Mishra, A., Liu, Y. and Afnan, M. (2021). Ethical Implementation of
Artificial Intelligence to Select Embryos in In Vitro Fertilization. In Pro-
ceedings of the AAAI/ACM Conference on Artificial Intelligence, Ethics,
and Society (AIES).
[5] Agarwal, R., Frosst, N., Zhang, X., Caruana, R. and Hin-
ton, G. E. (2020). Neural additive models: Interpretable machine learning
with neural nets. In Proceedings of the ICML Workshop on Human Inter-
pretability in Machine Learning.
[6] Aghaei, S., Gomez, A. and Vayanos, P. (2020). Learning Opti-
mal Classification Trees: Strong Max-Flow Formulations. arXiv e-print
arXiv:2002.09142.
[7] Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U.,
Franklin, M. and Widom, J. (2012). Challenges and Opportunities with
Big Data: A white paper prepared for the Computing Community Consor-
tium committee of the Computing Research Association Technical Report,
CRA.
[8] Aı̈vodji, U., Ferry, J., Gambs, S., Huguet, M.-J. and Siala, M.
(2019). Learning Fair Rule Lists. arXiv e-print arXiv:1909.03977.
[9] Amid, E. and Warmuth, M. K. (2019). TriMap: Large-scale Dimension-
ality Reduction Using Triplets. arXiv e-print arXiv:1910.00204.
[10] Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M. and
Rudin, C. (2017). Learning certifiably optimal rule lists for categorical
data. In Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD).
[11] Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M. and
Rudin, C. (2018). Certifiably optimal rule lists for categorical data. Jour-
nal of Machine Learning Research 18 1–78. MR3845533
[12] Angelov, P. and Soares, E. (2020). Towards Explainable Deep Neural
Networks (xDNN). Neural Networks 130 185–194.
[13] Angwin, J., Larson, J., Mattu, S. and Kirchner, L. (2016). Machine
Bias. Available from: https://www.propublica.org/article/machine-
bias-risk-assessments-in-criminal-sentencing.
[14] Antman, E. M., Cohen, M., Bernink, P. J., McCabe, C. H., Ho-
racek, T., Papuchis, G., Mautner, B., Corbalan, R., Radley, D.
and Braunwald, E. (2000). The TIMI risk score for unstable angina/non–
ST elevation MI: a method for prognostication and therapeutic decision
making. Journal of the American Medical Association 284 835–842.
[15] Antoràn, J., Bhatt, U., Adel, T., Weller, A. and Hernández-
Lobato, J. M. (2021). Getting a CLUE: A Method for Explaining Uncer-
tainty Estimates. In Proceedings of the International Conference on Learn-
ing Representations (ICLR).
[16] Archambeau, C., Cornford, D., Opper, M. and Shawe-Taylor, J.
(2007). Gaussian process approximations of stochastic differential equa-
Interpretable machine learning grand challenges 63

tions. In Gaussian Processes in Practice 1–16.


[17] Ashoori, M. and Weisz, J. D. (2019). In AI We Trust? Factors That
Influence Trustworthiness of AI-infused Decision-Making Processes. arXiv
e-print arXiv:1912.02675.
[18] Atrey, A., Clary, K. and Jensen, D. (2020). Exploratory not explana-
tory: Counterfactual analysis of saliency maps for deep reinforcement learn-
ing. In Proceedings of the International Conference on Learning Represen-
tations (ICLR).
[19] Badgeley, M. A., Zech, J. R., Oakden-Rayner, L., Glicks-
berg, B. S., Liu, M., Gale, W., McConnell, M. V., Percha, B.,
Snyder, T. M. and Dudley, J. T. (2019). Deep learning predicts hip
fracture using confounding patient and healthcare variables. npj Digital
Medicine 2.
[20] Barnes, J. and Hut, P. (1986). A hierarchical O(N log N ) force-
calculation algorithm. Nature 324 446–449.
[21] Barnett, A. J., Schwartz, F. R., Tao, C., Chen, C., Ren, Y.,
Lo, J. Y. and Rudin, C. (2021). IAIA-BL: A Case-based Interpretable
Deep Learning Model for Classification of Mass Lesions in Digital Mam-
mography. Nature Machine Intelligence 3 1061–1070.
[22] Bastani, O., Pu, Y. and Solar-Lezama, A. (2018). Verifiable reinforce-
ment learning via policy extraction. In Proceedings of Conference on Neural
Information Processing Systems (NeurIPS) 2499–2509.
[23] Baumgart, B. G. (1974). Geometric modeling for computer vision Tech-
nical Report, Stanford University Department of Computer Science.
[24] Belkin, M. and Niyogi, P. (2001). Laplacian eigenmaps and spectral
techniques for embedding and clustering. In Proceedings of Conference on
Neural Information Processing Systems (NeurIPS) 14 585–591. MIT Press.
[25] Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spi-
dlen, J. and Snyder-Cappione, J. E. (2019). Automated optimized
parameters for T-distributed stochastic neighbor embedding improve vi-
sualization and analysis of large datasets. Nature Communications 10.
[26] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers
Inc.
[27] Bengio, Y., Courville, A. and Vincent, P. (2013). Representation
learning: A review and new perspectives. IEEE Transactions on Pattern
Analysis and Machine Intelligence 35 1798–1828.
[28] Bennett, K. P. and Blue, J. A. (1996). Optimal Decision Trees Techni-
cal Report, R.P.I. Math Report No. 214, Rensselaer Polytechnic Institute.
[29] Bertsimas, D. and Dunn, J. (2017). Optimal classification trees. Machine
Learning 106 1039–1082. MR3665788
[30] Bezdek, J. C., Chuah, S. K. and Leep, D. (1986). Generalized k-nearest
neighbor rules. Fuzzy Sets and Systems 18 237–256. MR0834703
[31] Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y.,
Ghosh, J., Puri, R., Moura, J. M. F. and Eckersley, P. (2020).
Explainable Machine Learning in Deployment. In Proceedings of the 2020
Conference on Fairness, Accountability, and Transparency 648–657. Asso-
64 C. Rudin et al.

ciation for Computing Machinery.


[32] Bibal, A., Lognoul, M., de Streel, A. and Frénay, B. (2020). Legal
Requirements on Explainability in Machine Learning. Artificial Intelligence
and Law July 30.
[33] Bien, J. and Tibshirani, R. (2011). Prototype Selection for Interpretable
Classification. The Annals of Applied Statistics 2403–2424. MR2907120
[34] Billiet, L., Van Huffel, S. and Van Belle, V. (2017). Interval Coded
Scoring extensions for larger problems. In Proceedings of the IEEE Sympo-
sium on Computers and Communications 198–203. IEEE.
[35] Billiet, L., Van Huffel, S. and Van Belle, V. (2018). Interval Coded
Scoring: A toolbox for interpretable scoring systems. PeerJ Computer Sci-
ence 4 e150.
[36] Binder, H. and Tutz, G. (2008). A comparison of methods for the fit-
ting of generalized additive models. Statistics and Computing 18 87–99.
MR2416441
[37] Böhm, J. N., Berens, P. and Kobak, D. (2020). A Unifying Perspective
on Neighbor Embeddings along the Attraction-Repulsion Spectrum. arXiv
e-print arXiv:2007.08902.
[38] Bone, R. C., Balk, R. A., Cerra, F. B., Dellinger, R. P.,
Fein, A. M., Knaus, W. A., Schein, R. M. and Sibbald, W. J. (1992).
Definitions for sepsis and organ failure and guidelines for the use of inno-
vative therapies in sepsis. Chest 101 1644–1655.
[39] Bousquet, O. and Elisseeff, A. (2002). Stability and generalization.
Journal of Machine Learning Research 2 499–526. MR1929416
[40] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984).
Classification and Regression Trees. Wadsworth. MR0726392
[41] Breiman, L. et al. (2001). Statistical modeling: The two cultures (with
comments and a rejoinder by the author). Statistical Science 16 199–231.
MR1874152
[42] Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G. and
et al., G. H. (2020). Toward Trustworthy AI Development: Mechanisms
for Supporting Verifiable Claims. arXiv e-print arXiv:2004.07213.
[43] Brunton, S. L., Proctor, J. L. and Kutz, J. N. (2016). Discover-
ing governing equations from data by sparse identification of nonlinear
dynamical systems. Proceedings of the national academy of sciences 113
3932–3937. MR3494081
[44] Burgess, E. W. (1928). Factors determining success or failure on parole.
The workings of the indeterminate sentence law and the parole system in
Illinois 221–234.
[45] Cao, H. E., Sarlin, R. and Jung, A. (2020). Learning Explainable De-
cision Rules via Maximum Satisfiability. IEEE Access 8 218180–218185.
[46] Card, D., Zhang, M. and Smith, N. A. (2019). Deep Weighted Averag-
ing Classifiers. In Proceedings of the Conference on Fairness, Accountabil-
ity, and Transparency 369–378.
[47] Carrizosa, E., Molero-Rıo, C. and Romero Morales, D. (2021).
Mathematical optimization in classification and regression trees. TOP, An
Interpretable machine learning grand challenges 65

Official Journal of the Spanish Society of Statistics and Operations Research


29 5–33. MR4236627
[48] Carrizosa, E., Nogales-Gómez, A. and Romero Morales, D.
(2016). Strongly agree or strongly disagree?: Rating features in Support
Vector Machines. Information Sciences 329 256–273. Special issue on Dis-
covery Science.
[49] Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M. and El-
hadad, N. (2015). Intelligible models for healthcare: Predicting pneumonia
risk and hospital 30-day readmission. In Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD)
1721–1730.
[50] Chan, D. M., Rao, R., Huang, F. and Canny, J. F. (2019). GPU ac-
celerated t-distributed stochastic neighbor embedding. Journal of Parallel
and Distributed Computing 131 1–13.
[51] Chapman, P. et al. (2000). CRISP-DM 1.0 – Step-by-step data mining
guide Technical Report, SPSS.
[52] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Bal-
dassi, C., Borgs, C., Chayes, J., Sagun, L. and Zecchina, R. (2019).
Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Sta-
tistical Mechanics: Theory and Experiment 2019 124018. MR4063616
[53] Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C. and Su, J. K.
(2019). This Looks Like That: Deep Learning for Interpretable Image
Recognition. In Proceedings of Conference on Neural Information Process-
ing Systems (NeurIPS) 8930–8941.
[54] Chen, C., Lin, K., Rudin, C., Shaposhnik, Y., Wang, S. and
Wang, T. (2021). An interpretable model with globally consistent expla-
nations for credit risk. Decision Support Systems.
[55] Chen, C. and Rudin, C. (2018). An optimization approach to learning
falling rule lists. In Proceedings of Artificial Intelligence and Statistics (AIS-
TATS) 604–612.
[56] Chen, R. T. Q., Li, X., Grosse, R. B. and Duvenaud, D. K.
(2018). Isolating Sources of Disentanglement in Variational Autoencoders.
In Proceedings of Conference on Neural Information Processing Systems
(NeurIPS) (S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi and R. Garnett, eds.) 31.
[57] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I.
and Abbeel, P. (2016). Infogan: Interpretable representation learning by
information maximizing generative adversarial nets. In Proceedings of Con-
ference on Neural Information Processing Systems (NeurIPS) 2172–2180.
[58] Chen, Z., Bei, Y. and Rudin, C. (2020). Concept whitening for inter-
pretable image recognition. Nature Machine Intelligence 2 772-782.
[59] Chen, Z., Ogren, A., Lin, A., Daraio, C., Brinson, C. and Rudin, C.
(2021). How to See Hidden Patterns in Metamaterials with Interpretable
Machine Learning. (In progress).
[60] Chevaleyre, Y., Koriche, F. and Zucker, J.-D. (2013). Rounding
methods for discrete linear classification. In Proceedings of International
66 C. Rudin et al.

Conference on Machine Learning (ICML) 651–659.


[61] Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W.,
Verbakel, J. Y. and Van Calster, B. (2019). A systematic review
shows no performance benefit of machine learning over logistic regression
for clinical prediction models. Journal of Clinical Epidemiology 110 12–22.
MR4156846
[62] Clark, P. and Boswell, R. (1991). Rule induction with CN2: Some
recent improvements. In European Working Session on Learning 151–163.
Springer. MR1101394
[63] Clark, P. and Niblett, T. (1989). The CN2 induction algorithm. Ma-
chine Learning 3 261–283.
[64] Coenen, A. and Pearce, A. (2019). Understanding UMAP. https://
pair-code.github.io/understanding-umap/.
[65] Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the
Twelfth International Conference on Machine Learning (ICML) 115–123.
[66] Coppens, Y., Efthymiadis, K., Lenaerts, T., Nowé, A., Miller, T.,
Weber, R. and Magazzeni, D. (2019). Distilling deep reinforcement
learning policies in soft decision trees. In Proceedings of the IJCAI Work-
shop on Explainable Artificial Intelligence 1–6.
[67] Coston, A., Rambachan, A. and Chouldechova, A. (2021). Charac-
terizing Fairness Over the Set of Good Models Under Selective Labels. In
Proceedings of International Conference on Machine Learning (ICML).
[68] Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification.
IEEE Transactions on Information Theory 13 21-27.
[69] Craven, M. W. (1996). Extracting comprehensible models from trained
neural networks, PhD thesis, University of Wisconsin-Madison Department
of Computer Sciences.
[70] Craven, M. W. and Shavlik, J. W. (1995). Extracting Tree-Structured
Representations of Trained Networks. In Proceedings of Conference on Neu-
ral Information Processing Systems (NeurIPS) 24–30.
[71] Crawford, E. and Pineau, J. (2019). Spatially invariant unsupervised
object detection with convolutional neural networks. In Proceedings of
AAAI Conference on Artificial Intelligence (AAAI) 33 3412–3420.
[72] Şeyda Ertekin and Rudin, C. (2011). On Equivalence Relationships Be-
tween Classification and Ranking Algorithms. Journal of Machine Learning
Research 12 2905–2929. MR2854352
[73] Dahlin, N., Kalagarla, K. C., Naik, N., Jain, R. and Nuzzo, P.
(2020). Designing interpretable approximations to deep reinforcement
learning with soft decision trees. arXiv e-print arXiv:2010.14785.
[74] D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B.,
Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M. D.
et al. (2020). Underspecification presents challenges for credibility in mod-
ern machine learning. arXiv e-print arXiv:2011.03395.
[75] Das, S., Cashman, D., Chang, R. and Endert, A. (2019). BEAMES:
Interactive Multimodel Steering, Selection, and Inspection for Regression
Tasks. IEEE Computer Graphics and Applications 39 20–32.
Interpretable machine learning grand challenges 67

[76] Das, S., Natarajan, S., Roy, K., Parr, R. and Kersting, K.
(2020). Fitted Q-Learning for Relational Domains. arXiv e-print
arXiv:2006.05595.
[77] Dash, S., Günlük, O. and Wei, D. (2018). Boolean Decision Rules via
Column Generation. In Proceedings of Conference on Neural Information
Processing Systems (NeurIPS) 31 4655–4665.
[78] Demirović, E., Lukina, A., Hebrard, E., Chan, J., Bailey, J.,
Leckie, C., Ramamohanarao, K. and Stuckey, P. J. (2020). MurTree:
Optimal Classification Trees via Dynamic Programming and Search. arXiv
e-print arXiv:2007.12652.
[79] Desjardins, G., Courville, A. and Bengio, Y. (2012). Disen-
tangling factors of variation via generative entangling. arXiv e-print
arXiv:1210.5474.
[80] Dhamnani, S., Singal, D., Sinha, R., Mohandoss, T. and Dash, M.
(2019). RAPID: Rapid and Precise Interpretable Decision Sets. In 2019
IEEE International Conference on Big Data (Big Data) 1292–1301. IEEE.
[81] Dhebar, Y., Deb, K., Nageshrao, S., Zhu, L. and Filev, D. (2020).
Interpretable-AI policies using evolutionary nonlinear decision trees for dis-
crete action systems. arXiv e-print arXiv:2009.09521.
[82] Dimanov, B., Bhatt, U., Jamnik, M. and Weller, A. (2020). You
Shouldn’t Trust Me: Learning Models Which Conceal Unfairness From
Multiple Explanation Methods. In 24th European Conference on Artificial
Intelligence (ECAI).
[83] Dinh, L., Pascanu, R., Bengio, S. and Bengio, Y. (2017). Sharp min-
ima can generalize for deep nets. In Proceedings of the International Con-
ference on Machine Learning (ICML) 1019–1028.
[84] Do, K. and Tran, T. (2019). Theory and evaluation metrics for learning
disentangled representations. In Proceedings of the International Confer-
ence on Learning Representations (ICLR).
[85] Dobkin, D., Fulton, T., Gunopulos, D., Kasif, S. and Salzberg, S.
(1997). Induction of shallow decision trees. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
[86] Dong, J. and Rudin, C. (2020). Exploring the cloud of variable impor-
tance for the set of all good models. Nature Machine Intelligence 2 810–824.
[87] Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: Locally lin-
ear embedding techniques for high-dimensional data. Proceedings of the
National Academy of Arts and Sciences 100 5591-5596. MR1981019
[88] Dua, D. and Graff, C. (2017). UCI Machine Learning Repository.
[89] Dudani, S. A. (1976). The Distance-Weighted k-Nearest-Neighbor Rule.
IEEE Transactions on Systems, Man, and Cybernetics SMC-6 325-327.
[90] Džeroski, S., De Raedt, L. and Blockeel, H. (1998). Relational re-
inforcement learning. In International Conference on Inductive Logic Pro-
gramming 11–22. Springer.
[91] Džeroski, S., De Raedt, L. and Driessens, K. (2001). Relational re-
inforcement learning. Machine Learning 43 7–52.
[92] Eastwood, C. and Williams, C. K. (2018). A framework for the quan-
68 C. Rudin et al.

titative evaluation of disentangled representations. In Proceedings of the


International Conference on Learning Representations (ICLR).
[93] Elomaa, T. (1994). In Defense of C4.5: Notes on Learning One-Level Deci-
sion Trees. In Proceedings of International Conference on Machine Learning
(ICML) 62–69. Morgan Kaufmann.
[94] Farhangfar, A., Greiner, R. and Zinkevich, M. (2008). A Fast Way to
Produce Optimal Fixed-Depth Decision Trees. In International Symposium
on Artificial Intelligence and Mathematics (ISAIM).
[95] Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data
mining to knowledge discovery in databases. AI Magazine 17 37–54.
[96] Finney, S., Gardiol, N. H., Kaelbling, L. P. and Oates, T. (2002).
The thing that we tried didn’t work very well: deictic representation in re-
inforcement learning. In Proceedings of Conference on Uncertainty in Ar-
tificial Intelligence (UAI) 154–161.
[97] Fisher, A., Rudin, C. and Dominici, F. (2019). All Models are Wrong,
but Many are Useful: Learning a Variable’s Importance by Studying an En-
tire Class of Prediction Models Simultaneously. Journal of Machine Learn-
ing Research 20 1–81. MR4048988
[98] Fix, E. and Hodges, J. L. (1951). Discriminatory analysis, nonparamet-
ric discrimination: Consistency properties Technical Report No. 4, USAF
School of Aviation Medicine, Randolph Field, Texas.
[99] Frank, E. and Witten, I. H. (1998). Generating accurate rule sets with-
out global optimization. In Proceedings of International Conference on Ma-
chine Learning (ICML) 144–151.
[100] Freitas, A. A. (2014). Comprehensible classification models: a position
paper. ACM SIGKDD Explorations Newsletter 15 1–10.
[101] Freund, Y. and Schapire, R. E. (1997). A decision-theoretic gener-
alization of on-line learning and an application to boosting. Journal of
Computer and System Sciences 55 119–139. MR1473055
[102] Friedman, J. H. (2001). Greedy function approximation: a gradient
boosting machine. Annals of Statistics 1189–1232. MR1873328
[103] Friedman, J. H. (2002). Stochastic gradient boosting. Computational
Statistics & Data Analysis 38 367–378. MR1884869
[104] Friedman, J. H. and Fisher, N. I. (1999). Bump hunting in high-
dimensional data. Statistics and Computing 9 123–143.
[105] Fu, C., Zhang, Y., Cai, D. and Ren, X. (2019). AtSNE: Efficient and
Robust Visualization on GPU through Hierarchical Optimization. In Pro-
ceedings of the ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining (KDD) 176–186.
[106] Fukunaga, K. and Hostetler, L. (1973). Optimization of k nearest
neighbor density estimates. IEEE Transactions on Information Theory 19
320–326. MR0405701
[107] Gage, B. F., Waterman, A. D., Shannon, W., Boechler, M.,
Rich, M. W. and Radford, M. J. (2001). Validation of clinical classifi-
cation schemes for predicting stroke: results from the National Registry of
Atrial Fibrillation. Journal of the American Medical Association (JAMA)
Interpretable machine learning grand challenges 69

285 2864–2870.
[108] Gaines, B. R. and Compton, P. (1995). Induction of ripple-down rules
applied to modeling large databases. Journal of Intelligent Information
Systems 5 211–228.
[109] Ghosh, B. and Meel, K. S. (2019). IMLI: An incremental framework for
MaxSAT-based learning of interpretable classification rules. In Proceedings
of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) 203–210.
[110] Goh, S. T. and Rudin, C. (2014). Box Drawings for Learning with Imbal-
anced Data. In Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD) 333–342.
[111] Gomez, O., Holter, S., Yuan, J. and Bertini, E. (2020). ViCE: Vi-
sual Counterfactual Explanations for Machine Learning Models. In Pro-
ceedings of the 25th International Conference on Intelligent User Interfaces
(IUI’20).
[112] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-
Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Genera-
tive Adversarial Nets. In Proceedings of Conference on Neural Information
Processing Systems (NeurIPS) 27 2672–2680.
[113] Gosiewska, A. and Biecek, P. (2020). Do Not Trust Additive Expla-
nations. arXiv e-print arXiv:1903.11420.
[114] Gray, J. B. and Fan, G. (2008). Classification tree analysis us-
ing TARGET. Computational Statistics & Data Analysis 52 1362–1372.
MR2422741
[115] Gross, C. C. (2002). Genealogy of the “Grandmother Cell”. The Neu-
roscientist 8 512-518.
[116] Guan, L., Verma, M., Guo, S., Zhang, R. and Kambhampati, S.
(2020). Widening the Pipeline in Human-Guided Reinforcement Learning
with Explanation and Context-Aware Data Augmentation.
[117] Guez, A., Vincent, R. D., Avoli, M. and Pineau, J. (2008). Adaptive
Treatment of Epilepsy via Batch-mode Reinforcement Learning. In Proceed-
ings of AAAI Conference on Artificial Intelligence (AAAI) 1671–1678.
[118] Günlük, O., Kalagnanam, J., Li, M., Menickelly, M. and Schein-
berg, K. (2021). Optimal decision trees for categorical data via integer
programming. Journal of Global Optimization 1–28. MR4299184
[119] Hamamoto, R., Suvarna, K., Yamada, M., Kobayashi, K.,
Shinkai, N., Miyake, M., Takahashi, M., Jinnai, S., Shimoyama, R.,
Sakai, A., Takasawa, K., Bolatkan, A., Shozu, K., Dozen, A.,
Machino, H., Takahashi, S., Asada, K., Komatsu, M., Sese, J. and
Kaneko, S. (2020). Application of Artificial Intelligence Technology in
Oncology: Towards the Establishment of Precision Medicine. Cancers 12.
[120] Hammond, K. (1989). Proceedings of the Second Case-Based Reasoning
Workshop (DARPA). Morgan Kanfmann Publishers, Inc., San Mateo, CA.
[121] Haris, A., Simon, N. and Shojaie, A. (2019). Generalized sparse addi-
tive models. arXiv e-print arXiv:1903.04641.
[122] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models
43. CRC press. MR1082147
70 C. Rudin et al.

[123] Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L.,
Rezende, D. and Lerchner, A. (2018). Towards a definition of disen-
tangled representations. arXiv e-print arXiv:1812.02230.
[124] Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,
Botvinick, M., Mohamed, S. and Lerchner, A. (2017). beta-VAE:
Learning basic visual concepts with a constrained variational framework.
In Proceedings of the International Conference on Learning Representations
(ICLR).
[125] Hinton, G. (2021). How to represent part-whole hierarchies in a neural
network. arXiv e-print arXiv:2102.12627.
[126] Hinton, G. E., Krizhevsky, A. and Wang, S. D. (2011). Transforming
Auto-Encoders. In Proceedings of International Conference on Artificial
Neural Networks 44–51. Springer.
[127] Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural
Computation 9 1–42.
[128] Holte, R. C. (1993). Very simple classification rules perform well on
most commonly used datasets. Machine Learning 11 63-91.
[129] Hong, D., Baek, S. S. and Wang, T. (2020). Interpretable Sequence
Classification Via Prototype Trajectory. arXiv e-print arXiv:2007.01777.
[130] Hu, H., Siala, M., Hébrard, E. and Huguet, M.-J. (2020). Learning
Optimal Decision Trees with MaxSAT and its Integration in AdaBoost.
In IJCAI-PRICAI 2020, 29th International Joint Conference on Artificial
Intelligence and the 17th Pacific Rim International Conference on Artificial
Intelligence.
[131] Hu, X., Rudin, C. and Seltzer, M. (2019). Optimal Sparse Decision
Trees. In Proceedings of Conference on Neural Information Processing Sys-
tems (NeurIPS).
[132] Ignatiev, A., Pereira, F., Narodytska, N. and Marques-Silva, J.
(2018). A SAT-based approach to learn explainable decision sets. In Inter-
national Joint Conference on Automated Reasoning 627–645. MR3836041
[133] The Smithsonian Institute (2020). Mammuthus primigenius
(Blumbach). https://3d.si.edu/object/3d/mammuthus-primigenius-
blumbach:341c96cd-f967-4540-8ed1-d3fc56d31f12.
[134] Jacomy, M., Venturini, T., Heymann, S. and Bastian, M. (2014).
ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network
Visualization Designed for the Gephi Software. PLoS One 9 1-12.
[135] Jaderberg, M., Simonyan, K., Zisserman, A. and
Kavukcuoglu, K. (2015). Spatial transformer networks. In Proceedings
of Conference on Neural Information Processing Systems (NeurIPS)
2017–2025.
[136] Javad, M. O. M., Agboola, S. O., Jethwani, K., Zeid, A. and Ka-
marthi, S. (2019). A reinforcement learning–based method for manage-
ment of type 1 diabetes: exploratory study. JMIR Diabetes 4 e12905.
[137] Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.,
Lawrence Zitnick, C. and Girshick, R. (2017). CLEVR: A Diagnos-
tic Dataset for Compositional Language and Elementary Visual Reasoning.
Interpretable machine learning grand challenges 71

In Proceedings of the IEEE Conference on Computer Vision and Pattern


Recognition (CVPR).
[138] Juozapaitis, Z., Koul, A., Fern, A., Erwig, M. and Doshi-
Velez, F. (2019). Explainable reinforcement learning via reward decom-
position. In Proceedings of the IJCAI/ECAI Workshop on Explainable Ar-
tificial Intelligence.
[139] Kabra, R., Burgess, C., Matthey, L., Kaufman, R. L., Gr-
eff, K., Reynolds, M. and Lerchner, A. (2019). Multi-Object
Datasets. https://github.com/deepmind/multi-object-datasets/.
[140] Kakade, S. M., Sridharan, K. and Tewari, A. (2009). On the com-
plexity of linear prediction: Risk bounds, margin bounds, and regulariza-
tion. In Proceedings of Conference on Neural Information Processing Sys-
tems (NeurIPS) 793–800.
[141] Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H. and
Wortman Vaughan, J. (2020). Interpreting Interpretability: Under-
standing Data Scientists’ Use of Interpretability Tools for Machine Learn-
ing. In Proceedings of the 2020 CHI Conference on Human Factors in Com-
puting Systems 1–14.
[142] Keller, J. M., Gray, M. R. and Givens, J. A. (1985). A fuzzy k-
nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cy-
bernetics 4 580–585.
[143] Kersting, K. and Driessens, K. (2008). Non-parametric policy gradi-
ents: A unified treatment of propositional and relational domains. In Pro-
ceedings of the 25th International Conference on Machine Learning 456–
463.
[144] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and
Tang, P. T. P. (2017). On large-batch training for deep learning: Gener-
alization gap and sharp minima. In Proceedings of the International Con-
ference on Learning Representations (ICLR).
[145] Kim, B., Khanna, R. and Koyejo, O. O. (2016). Examples are not
enough, learn to criticize! Criticism for Interpretability. In Proceedings of
Conference on Neural Information Processing Systems (NeurIPS) (D. Lee,
M. Sugiyama, U. Luxburg, I. Guyon and R. Garnett, eds.) 29 2280–
2288.
[146] Kim, B., Rudin, C. and Shah, J. A. (2014). The Bayesian Case Model:
A Generative Approach for Case-Based Reasoning and Prototype Classi-
fication. In Proceedings of Conference on Neural Information Processing
Systems (NeurIPS) 27 1952–1960.
[147] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Vie-
gas, F. et al. (2018). Interpretability beyond feature attribution: Quan-
titative testing with concept activation vectors (tcav). In Proceedings of
International Conference on Machine Learning (ICML) 2668–2677.
[148] Kim, H. and Mnih, A. (2018). Disentangling by factorising. In Proceed-
ings of International Conference on Machine Learning (ICML) 2649–2658.
[149] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational
bayes. In Proceedings of the International Conference on Learning Repre-
72 C. Rudin et al.

sentations (ICLR).
[150] Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Al Sal-
lab, A. A., Yogamani, S. and Pérez, P. (2021). Deep reinforcement
learning for autonomous driving: A survey. IEEE Transactions on Intelli-
gent Transportation Systems 1-18.
[151] Kissel, N. and Mentch, L. (2021). Forward Stability and Model Path
Selection. arXiv e-print arXiv:2103.03462.
[152] Knaus, W. A., Draper, E. A., Wagner, D. P. and Zimmerman, J. E.
(1985). APACHE II: a severity of disease classification system. Critical Care
Medicine 13 818–829.
[153] Knaus, W. A., Wagner, D. P., Draper, E. A., Zimmerman, J. E.,
Bergner, M., Bastos, P. G., Sirio, C. A., Murphy, D. J.,
Lotring, T., Damiano, A. et al. (1991). The APACHE III prognostic
system: risk prediction of hospital mortality for critically ill hospitalized
adults. Chest 100 1619–1636.
[154] Knaus, W. A., Zimmerman, J. E., Wagner, D. P., Draper, E. A.
and Lawrence, D. E. (1981). APACHE-acute physiology and chronic
health evaluation: a physiologically based classification system. Critical
Care Medicine 9 591–597.
[155] Kobak, D. and Berens, P. (2019). The art of using t-SNE for single-cell
transcriptomics. Nature Communication 10 5416.
[156] Kober, J., Bagnell, J. A. and Peters, J. (2013). Reinforcement learn-
ing in robotics: A survey. The International Journal of Robotics Research
32 1238–1274.
[157] Kodratoff, Y. (1994). The comprehensibility manifesto. KDD Nugget
Newsletter 94.
[158] Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E.,
Kim, B. and Liang, P. (2020). Concept bottleneck models. In Proceedings
of International Conference on Machine Learning (ICML) 5338–5348.
[159] Kohonen, T. (1995). Learning Vector Quantization. The Handbook of
Brain Theory and Neural Networks 537–540.
[160] Kolodner, J. L. (1988). Proceedings of the Case-Based Reasoning Work-
shop (DARPA). Morgan Kaufmmnn Publishers, Inc., San Mateo, CA.
[161] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distri-
butions and bounding the generalization error of combined classifiers. The
Annals of Statistics 30 1–50. MR1892654
[162] Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C. and
Faisal, A. A. (2018). The artificial intelligence clinician learns optimal
treatment strategies for sepsis in intensive care. Nature Medicine 24 1716–
1720.
[163] Kosiorek, A., Sabour, S., Teh, Y. W. and Hinton, G. E. (2019).
Stacked capsule autoencoders. In Proceedings of Conference on Neural In-
formation Processing Systems (NeurIPS) 15512–15522.
[164] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2017). ImageNet
classification with deep convolutional neural networks. Communications of
the ACM 60.
Interpretable machine learning grand challenges 73

[165] Kulkarni, T. D., Whitney, W., Kohli, P. and Tenenbaum, J. B.


(2015). Deep convolutional inverse graphics network. In Proceedings of Con-
ference on Neural Information Processing Systems (NeurIPS).
[166] Kursuncu, U., Gaur, M. and Sheth, A. (2020). Knowledge Infused
Learning (K-IL): Towards Deep Incorporation of Knowledge in Deep Learn-
ing. In Proceedings of the AAAI 2020 Spring Symposium on Combining
Machine Learning and Knowledge Engineering in Practice (AAAI-MAKE
2020) I.
[167] Kutz, J. N., Brunton, S. L., Brunton, B. W. and Proctor, J. L.
(2016). Dynamic mode decomposition: data-driven modeling of complex sys-
tems. SIAM. MR3602007
[168] Lagaris, I. E., Likas, A. and Fotiadis, D. I. (1998). Artificial neu-
ral networks for solving ordinary and partial differential equations. IEEE
Transactions on Neural Networks 9 987–1000.
[169] Lagaris, I. E., Likas, A. C. and Papageorgiou, D. G. (2000). Neural-
network methods for boundary value problems with irregular boundaries.
IEEE Transactions on Neural Networks 11 1041-1049.
[170] Lakkaraju, H., Bach, S. H. and Leskovec, J. (2016). Interpretable
decision sets: A joint framework for description and prediction. In Proceed-
ings of the ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining (KDD) 1675–1684.
[171] Lakkaraju, H. and Bastani, O. (2020). “How do I fool you?”: Manip-
ulating User Trust via Misleading Black Box Explanations. In Proceedings
of the AAAI/ACM Conference on AI, Ethics, and Society (AIES) 79–85.
[172] Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G.,
Samek, W. and Müller, K.-R. (2019). Unmasking Clever Hans pre-
dictors and assessing what machines really learn. Nature Communications
10.
[173] Laugel, T., Lesot, M.-J., Marsala, C., Renard, X. and De-
tyniecki, M. (2019). The Dangers of Post-hoc Interpretability: Unjustified
Counterfactual Explanations. In Proceedings of the Twenty-Eighth Interna-
tional Joint Conference on Artificial Intelligence, (IJCAI) 2801–2807.
[174] Laurent, H. and Rivest, R. L. (1976). Constructing optimal binary
decision trees is NP-complete. Information Processing Letters 5 15–17.
MR0413598
[175] Le Gall, J.-R., Lemeshow, S. and Saulnier, F. (1993). A new simpli-
fied acute physiology score (SAPS II) based on a European/North American
multicenter study. Journal of the American Medical Association (JAMA)
270 2957–2963.
[176] LeCun, Y., Bengio, Y. and Hinton, G. (2015). Deep learning. Nature
521 436–444.
[177] LeCun, Y., Cortes, C. and Burges, C. (2010). MNIST handwritten
digit database. ATT Labs [Online]. Available: http: // yann. lecun. com/
exdb/ mnist 2.
[178] Lee, H. and Kang, I. S. (1990). Neural algorithm for solving differential
equations. Journal of Computational Physics 91 110–131. MR1078748
74 C. Rudin et al.

[179] Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S. and Teh, Y. W.
(2019). Set transformer: A framework for attention-based permutation-
invariant neural networks. In Proceedings of International Conference on
Machine Learning (ICML) 3744–3753.
[180] Letham, B., Rudin, C., McCormick, T. H., Madigan, D. et al.
(2015). Interpretable classifiers using rules and bayesian analysis: Build-
ing a better stroke prediction model. The Annals of Applied Statistics 9
1350–1371. MR3418726
[181] Li, H., Xu, Z., Taylor, G., Studer, C. and Goldstein, T. (2018).
Visualizing the loss landscape of neural nets. In Proceedings of Conference
on Neural Information Processing Systems (NeurIPS).
[182] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J. and Juraf-
sky, D. (2016). Deep reinforcement learning for dialogue generation. In
Proceedings of Conference on Empirical Methods in Natural Language Pro-
cessing (EMNLP).
[183] Li, O., Liu, H., Chen, C. and Rudin, C. (2018). Deep Learning for Case-
Based Reasoning through Prototypes: A Neural Network that Explains Its
Predictions. In Proceedings of AAAI Conference on Artificial Intelligence
(AAAI).
[184] Li, W., Han, J. and Pei, J. (2001). CMAR: Accurate and efficient classi-
fication based on multiple class-association rules. In Proceedings 2001 IEEE
International Conference on Data Mining (ICDM) 369–376. IEEE.
[185] Lin, J., Zhong, C., Hu, D., Rudin, C. and Seltzer, M. (2020). Gen-
eralized and scalable optimal sparse decision trees. In Proceedings of Inter-
national Conference on Machine Learning (ICML) 6150–6160.
[186] Lin, Y., Zhang, H. H. et al. (2006). Component selection and smoothing
in multivariate nonparametric regression. The Annals of Statistics 34 2272–
2297. MR2291500
[187] Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S.
and Kluger, Y. (2019). Fast interpolation-based t-SNE for improved vi-
sualization of single-cell RNA-seq data. Nature Methods 16 243-245.
[188] Liu, B., Hsu, W., Ma, Y. et al. (1998). Integrating classification and
association rule mining. In Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD) 98 80–86.
[189] Liu, G., Schulte, O., Zhu, W. and Li, Q. (2018). Toward interpretable
deep reinforcement learning with linear model u-trees. In Joint European
Conference on Machine Learning and Knowledge Discovery in Databases
414–429. Springer.
[190] Liu, Y., Logan, B., Liu, N., Xu, Z., Tang, J. and Wang, Y. (2017).
Deep reinforcement learning for dynamic treatment regimes on medical
registry data. In 2017 IEEE International Conference on Healthcare Infor-
matics (ICHI) 380–385. IEEE.
[191] Locatello, F., Weissenborn, D., Unterthiner, T., Mahen-
dran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A. and Kipf, T.
(2020). Object-Centric Learning with Slot Attention. In Proceedings of
Conference on Neural Information Processing Systems (NeurIPS).
Interpretable machine learning grand challenges 75

[192] Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classifi-
cation trees. Statistica Sinica 7 815–840. MR1488644
[193] Losch, M., Fritz, M. and Schiele, B. (2019). Interpretability be-
yond classification output: Semantic bottleneck networks. arXiv e-print
arXiv:1907.10882.
[194] Lou, Y., Bien, J., Caruana, R. and Gehrke, J. (2016). Sparse par-
tially linear additive models. Journal of Computational and Graphical
Statistics 25 1126–1140. MR3572032
[195] Lou, Y., Caruana, R. and Gehrke, J. (2012). Intelligible models for
classification and regression. In Proceedings of the ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining (KDD) 150–
158.
[196] Lou, Y., Caruana, R., Gehrke, J. and Hooker, G. (2013). Accurate
intelligible models with pairwise interactions. In Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Min-
ing (KDD) 623–631.
[197] Malioutov, D. and Meel, K. S. (2018). MLIC: A MaxSAT-based
framework for learning interpretable classification rules. In International
Conference on Principles and Practice of Constraint Programming 312–
327. Springer.
[198] Malioutov, D. and Varshney, K. (2013). Exact rule learning via
boolean compressed sensing. In Proceedings of International Conference
on Machine Learning (ICML) 765–773.
[199] Marchand, M. and Shawe-Taylor, J. (2002). The set covering ma-
chine. Journal of Machine Learning Research 3 723–746. MR1983944
[200] Marchand, M. and Sokolova, M. (2005). Learning with decision lists
of data-dependent features. Journal of Machine Learning Research 6 427–
451. MR2249827
[201] Marx, C., Calmon, F. and Ustun, B. (2020). Predictive multiplicity
in classification. In Proceedings of International Conference on Machine
Learning (ICML) 6765–6774.
[202] Matthey, L., Higgins, I., Hassabis, D. and Lerchner, A. (2017).
dSprites: Disentanglement testing Sprites dataset. https://github.com/
deepmind/dsprites-dataset/.
[203] McGough, M. (2018). How bad is Sacramento’s air, exactly? Google
results appear at odds with reality, some say. Sacramento Bee.
[204] McInnes, L., Healy, J. and Melville, J. (2018). UMAP: Uniform
Manifold Approximation and Projection for Dimension Reduction. arXiv
e-print arXiv:1802.03426.
[205] Mehta, M., Agrawal, R. and Rissanen, J. (1996). SLIQ: A fast scal-
able classifier for data mining. In International Conference on Extending
Database Technology 18–32. Springer.
[206] Meier, L., Van de Geer, S., Bühlmann, P. et al. (2009). High-
dimensional additive modeling. The Annals of Statistics 37 3779–3821.
MR2572443
[207] Meng, X., Li, Z., Zhang, D. and Karniadakis, G. E. (2020).
76 C. Rudin et al.

PPINN: Parareal physics-informed neural network for time-dependent


PDEs. Computer Methods in Applied Mechanics and Engineering 370
113250. MR4121078
[208] Miller, G. (1956). The magical number seven, plus or minus two: Some
limits on our capacity for processing information. The Psychological Review
63 81–97.
[209] Ming, Y., Xu, P., Qu, H. and Ren, L. (2019). Interpretable and
Steerable Sequence Learning via Prototypes. In Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Min-
ing (KDD) 903–913.
[210] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,
Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,
Ostrovski, G. et al. (2015). Human-level control through deep reinforce-
ment learning. Nature 518 529–533.
[211] Moreno, R. P., Metnitz, P. G., Almeida, E., Jordan, B.,
Bauer, P., Campos, R. A., Iapichino, G., Edbrooke, D., Ca-
puzzo, M., Le Gall, J.-R. et al. (2005). SAPS 3 — From evaluation of
the Patient to Evaluation of the Intensive Care Unit. Part 2: Development
of a prognostic model for hospital mortality at ICU admission. Intensive
Care Medicine 31 1345–1355.
[212] Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analy-
sis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415–434.
MR3646572
[213] Mott, A., Zoran, D., Chrzanowski, M., Wierstra, D. and
Rezende, D. J. (2019). Towards interpretable reinforcement learning us-
ing attention augmented agents. In Proceedings of Conference on Neural
Information Processing Systems (NeurIPS) 12350–12359.
[214] Narodytska, N., Ignatiev, A., Pereira, F. and Marques-Silva, J.
(2018). Learning Optimal Decision Trees with SAT. In Proceedings of the
Twenty-Eighth International Joint Conference on Artificial Intelligence
(IJCAI) 1362–1368.
[215] Nauta, M., Jutte, A., Provoost, J. and Seifert, C. (2020). This
Looks Like That, Because... Explaining Prototypes for Interpretable Image
Recognition. arXiv e-print arXiv:2011.02863.
[216] Nauta, M., van Bree, R. and Seifert, C. (2021). Neural Proto-
type Trees for Interpretable Fine-grained Image Recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
[217] Nelder, J. A. and Wedderburn, R. W. (1972). Generalized linear
models. Journal of the Royal Statistical Society: Series A (General) 135
370–384. MR0375592
[218] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B. and
Ng, A. Y. (2011). Reading Digits in Natural Images with Unsupervised
Feature Learning. In Proceedings of the NeurIPS Workshop on Deep Learn-
ing and Unsupervised Feature Learning.
[219] Newell, A., Simon, H. A. et al. (1972). Human Problem Solving.
Interpretable machine learning grand challenges 77

Prentice-Hall Englewood Cliffs, NJ.


[220] Nijssen, S. and Fromont, E. (2007). Mining optimal decision trees
from itemset lattices. In Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD) 530–539.
ACM.
[221] Nijssen, S. and Fromont, E. (2010). Optimal constraint-based decision
tree induction from itemset lattices. Data Mining and Knowledge Discovery
21 9-51. MR2720512
[222] Nijssen, S., Schaus, P. et al. (2020). Learning Optimal Decision Trees
Using Caching Branch-and-Bound Search. In Proceedings of AAAI Confer-
ence on Artificial Intelligence (AAAI).
[223] Nori, H., Jenkins, S., Koch, P. and Caruana, R. (2019). Inter-
pretML: A unified framework for machine learning interpretability. arXiv
e-print arXiv:1909.09223.
[224] Northpointe (2013). Practitioner’s Guide to COMPAS Core Technical
Report, Northpointe.
[225] O’Connor, M. (2021). Algorithm’s ‘unexpected’ weakness raises larger
concerns about AI’s potential in broader populations. Health Imaging, Ar-
tificial Intelligence section April 05.
[226] Pang, G., Lu, L. and Karniadakis, G. E. (2019). fPINNs: Fractional
physics-informed neural networks. SIAM Journal on Scientific Computing
41 A2603–A2626. MR3995303
[227] Pang, G., Yang, L. and Karniadakis, G. E. (2019). Neural-net-
induced Gaussian process regression for function approximation and PDE
solution. Journal of Computational Physics 384 270–288. MR3920923
[228] Papagelis, A. and Kalles, D. (2000). GA Tree: genetically evolved de-
cision trees. In Proceedings 12th IEEE Internationals Conference on Tools
with Artificial Intelligence (ICTAI) 203–206. IEEE.
[229] Papernot, N. and McDaniel, P. (2018). Deep k-Nearest Neighbors:
Towards Confident, Interpretable and Robust Deep Learning. arXiv e-print
arXiv:1803.04765.
[230] Parbhoo, S., Bogojeska, J., Zazzi, M., Roth, V. and Doshi-
Velez, F. (2017). Combining kernel and model based learning for HIV
therapy selection. AMIA Summits on Translational Science Proceedings
2017 239.
[231] Parikh, H., Rudin, C. and Volfovsky, A. (2019). An Application of
Matching After Learning To Stretch (MALTS) to the ACIC 2018 Causal
Inference Challenge Data. Observational Studies 118–130.
[232] Pearson, K. (1901). On Lines and Planes of Closest Fit to Systems of
Points in Space. Philosophical Magazine 2 559–572.
[233] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R.,
Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn:
Machine Learning in Python. Journal of Machine Learning Research 12
2825–2830. MR2854348
78 C. Rudin et al.

[234] Petersen, A., Witten, D. and Simon, N. (2016). Fused lasso additive
model. Journal of Computational and Graphical Statistics 25 1005–1025.
MR3572026
[235] Lo Piano, S. (2020). Ethical principles in machine learning and artificial
intelligence: cases from the field and possible ways forward. Humanit Soc
Sci Commun 7.
[236] Poličar, P. G., Stražar, M. and Zupan, B. (2019). openTSNE: a
modular Python library for t-SNE dimensionality reduction and embed-
ding. bioRxiv.
[237] Poursabzi-Sangdeh, F., Goldstein, D. G., Hofman, J. M.,
Vaughan, J. W. and Wallach, H. M. (2021). Manipulating and Measur-
ing Model Interpretability. In Proceedings of Conference on Human Factors
in Computing Systems (CHI).
[238] Psichogios, D. C. and Ungar, L. H. (1992). A hybrid neural network-
first principles approach to process modeling. AIChE Journal 38 1499–
1511.
[239] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan
Kaufmann. MR3729316
[240] Raissi, M. (2018). Deep Hidden Physics Models: Deep Learning of Non-
linear Partial Differential Equations. Journal of Machine Learning Research
19 1-24. MR3862432
[241] Raissi, M., Perdikaris, P. and Karniadakis, G. E. (2018). Multi-
step Neural Networks for Data-driven Discovery of Nonlinear Dynamical
Systems. arXiv e-print arXiv:1801.01236. MR3881695
[242] Raissi, M., Perdikaris, P. and Karniadakis, G. E. (2019). Physics-
informed neural networks: A deep learning framework for solving forward
and inverse problems involving nonlinear partial differential equations.
Journal of Computational Physics 378 686–707. MR3881695
[243] Raissi, M., Yazdani, A. and Karniadakis, G. E. (2020). Hidden fluid
mechanics: Learning velocity and pressure fields from flow visualizations.
Science 367 1026–1030. MR4265157
[244] Rao, C., Sun, H. and Liu, Y. (2021). Physics-Informed Deep Learn-
ing for Computational Elastodynamics without Labeled Data. Journal of
Engineering Mechanics 147 04021043.
[245] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009).
Sparse additive models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 71 1009–1030. MR2750255
[246] Riesbeck, C. K. and Schank, R. S. (1989). Inside Case-Based Reason-
ing. Lawrence Erlbaum Assoc., Inc., Hillsdale, NJ.
[247] Rivest, R. L. (1987). Learning decision lists. Machine Learning 2 229–
246.
[248] Roth, A. M., Topin, N., Jamshidi, P. and Veloso, M. (2019).
Conservative Q-improvement: Reinforcement learning for an interpretable
decision-tree policy. arXiv e-print arXiv:1907.01180.
[249] Roweis, S. T. and Saul, L. K. (2000). Nonlinear Dimensionality Re-
duction by Locally Linear Embedding. Science 290 2323-2326.
Interpretable machine learning grand challenges 79

[250] Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models
for High Stakes Decisions and Use Interpretable Models Instead. Nature
Machine Intelligence 1 206–215.
[251] Rudin, C. and Ertekin, S. (2018). Learning Customized and Optimized
Lists of Rules with Mathematical Programming. Mathematical Program-
ming C (Computation) 10 659–702. MR3863707
[252] Rudin, C., Letham, B. and Madigan, D. (2013). Learning theory anal-
ysis for association rules and sequential event prediction. Journal of Ma-
chine Learning Research 14 3441–3492. MR3144468
[253] Rudin, C. and Radin, J. (2019). Why Are We Using Black Box Mod-
els in AI When We Don’t Need To? A Lesson From An Explainable AI
Competition. Harvard Data Science Review 1.
[254] Rudin, C. and Schapire, R. E. (2009). Margin-Based Ranking and
an Equivalence between AdaBoost and RankBoost. Journal of Machine
Learning Research 10 2193–2232. MR2563980
[255] Rudin, C. and Shaposhnik, Y. (2019). Globally-Consistent Rule-
Based Summary-Explanations for Machine Learning Models: Application
to Credit-Risk Evaluation. May 28, 2019. Available at SSRN: https://
ssrn.com/abstract=3395422.
[256] Rudin, C. and Ustun, B. (2018). Optimized scoring systems: Toward
trust in machine learning for healthcare and criminal justice. Interfaces 48
449–466.
[257] Rudin, C. and Wagstaff, K. L. (2014). Machine Learning for Science
and Society. Machine Learning 95. MR3179975
[258] Rudin, C., Wang, C. and Coker, B. (2020). The Age of Secrecy and
Unfairness in Recidivism Prediction. Harvard Data Science Review 2.
[259] Rudy, S. H., Brunton, S. L., Proctor, J. L. and Kutz, J. N. (2017).
Data-driven discovery of partial differential equations. Science Advances 3
e1602614.
[260] Rymarczyk, D., Struski, L  ., Tabor, J. and Zieliński, B. (2020).
ProtoPShare: Prototype Sharing for Interpretable Image Classification and
Similarity Discovery. arXiv e-print arXiv:2011.14340.
[261] Sabour, S., Frosst, N. and Hinton, G. E. (2017). Dynamic rout-
ing between capsules. In Proceedings of Conference on Neural Information
Processing Systems (NeurIPS) 3856–3866.
[262] Sadhanala, V., Tibshirani, R. J. et al. (2019). Additive models with
trend filtering. The Annals of Statistics 47 3032–3068. MR4025734
[263] Sahli Costabal, F., Yang, Y., Perdikaris, P., Hurtado, D. E. and
Kuhl, E. (2020). Physics-informed neural networks for cardiac activation
mapping. Frontiers in Physics 8 42.
[264] Salakhutdinov, R. and Hinton, G. (2007). Learning a Nonlinear Em-
bedding by Preserving Class Neighbourhood Structure. In Artificial Intel-
ligence and Statistics (AISTATS) 412–419.
[265] Sallab, A. E., Abdou, M., Perot, E. and Yogamani, S. (2017).
Deep reinforcement learning framework for autonomous driving. Electronic
Imaging 2017 70–76.
80 C. Rudin et al.

[266] Schapire, R. E., Freund, Y., Bartlett, P., Lee, W. S. et al. (1998).
Boosting the margin: A new explanation for the effectiveness of voting
methods. The Annals of Statistics 26 1651–1686. MR1673273
[267] Schmidhuber, J. (1992). Learning factorial codes by predictability min-
imization. Neural Computation 4 863–879.
[268] Schramowski, P., Stammer, W., Teso, S., Brugger, A., Her-
bert, F., Shao, X., Luigs, H.-G., Mahlein, A.-K. and Kersting, K.
(2020). Making deep neural networks right for the right scientific reasons
by interacting with their explanations. Nature Machine Intelligence 2 476–
486.
[269] Semenova, L., Rudin, C. and Parr, R. (2019). A study in Rashomon
curves and volumes: A new perspective on generalization and model sim-
plicity in machine learning. arXiv e-print arXiv:1908.01755.
[270] Seo, S., Meng, C. and Liu, Y. (2019). Physics-aware difference graph
networks for sparsely-observed dynamics. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR).
[271] Shu, T., Xiong, C. and Socher, R. (2018). Hierarchical and inter-
pretable skill acquisition in multi-task reinforcement learning. In Proceed-
ings of the International Conference on Learning Representations (ICLR).
[272] Silva, A., Gombolay, M., Killian, T., Jimenez, I. and Son, S.-H.
(2020). Optimization methods for interpretable differentiable decision trees
applied to reinforcement learning. In International Conference on Artificial
Intelligence and Statistics (AISTATS) 1855–1865.
[273] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van
Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneer-
shelvam, V., Lanctot, M. et al. (2016). Mastering the game of Go with
deep neural networks and tree search. Nature 529 484–489.
[274] Six, A., Backus, B. and Kelder, J. (2008). Chest pain in the emergency
room: value of the HEART score. Netherlands Heart Journal 16 191–196.
[275] Sokolova, M., Marchand, M., Japkowicz, N. and Shawe-
taylor, J. S. (2003). The decision list machine. In Proceedings of Confer-
ence on Neural Information Processing Systems (NeurIPS) 945–952.
[276] Sokolovska, N., Chevaleyre, Y., Clément, K. and Zucker, J.-D.
(2017). The fused lasso penalty for learning interpretable medical scor-
ing systems. In 2017 International Joint Conference on Neural Networks
(IJCNN) 4504–4511. IEEE.
[277] Sokolovska, N., Chevaleyre, Y. and Zucker, J.-D. (2018). A Prov-
able Algorithm for Learning Interpretable Scoring Systems. In Proceedings
of Machine Learning Research Vol. 84: Artificial Intelligence and Statistics
(AISTATS) 566–574.
[278] Spiegelhalter, D. (2020). Should We Trust Algorithms? Harvard Data
Science Review 2.
[279] Srebro, N., Sridharan, K. and Tewari, A. (2010). Smoothness, low
noise and fast rates. In Proceedings of Conference on Neural Information
Processing Systems (NeurIPS) 2199–2207.
[280] Sreedharan, S., Chakraborti, T. and Kambhampati, S. (2018).
Interpretable machine learning grand challenges 81

Handling Model Uncertainty and Multiplicity in Explanations via Model


Reconciliation. In Proceedings of the International Conference on Auto-
mated Planning and Scheduling 28 518–526.
[281] Struck, A. F., Ustun, B., Ruiz, A. R., Lee, J. W., LaRoche, S. M.,
Hirsch, L. J., Gilmore, E. J., Vlachy, J., Haider, H. A., Rudin, C.
and Westover, M. B. (2017). Association of an Electroencephalography-
Based Risk Score With Seizure Probability in Hospitalized Patients. JAMA
Neurology 74 1419-1424.
[282] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An
introduction. MIT press. MR3889951
[283] Tai, L., Zhang, J., Liu, M., Boedecker, J. and Burgard, W. (2016).
A survey of deep network solutions for learning control in robotics: From
reinforcement to imitation. arXiv e-print arXiv:1612.07139.
[284] Tang, J., Liu, J., Zhang, M. and Mei, Q. (2016). Visualizing large-
scale and high-dimensional data. In Proceedings of the 25th International
Conference on the World Wide Web 287–297. International World Wide
Web Conferences Steering Committee.
[285] Tenenbaum, J. B., de Silva, V. and Langford, J. C. (2000). A
Global Geometric Framework for Nonlinear Dimensionality Reduction. Sci-
ence 290 2319-2323.
[286] Than, M., Flaws, D., Sanders, S., Doust, J., Glasziou, P.,
Kline, J., Aldous, S., Troughton, R., Reid, C., Parsonage, W. A.
et al. (2014). Development and validation of the Emergency Department
Assessment of Chest pain Score and 2h accelerated diagnostic protocol.
Emergency Medicine Australasia 26 34–44.
[287] Thiebes, S., Lins, S. and Sunyaev, A. (2020). Trustworthy artificial
intelligence. Electronic Markets.
[288] Tomsett, R., Preece, A., Braines, D., Cerutti, F.,
Chakraborty, S., Srivastava, M., Pearson, G. and Kaplan, L.
(2020). Rapid Trust Calibration through Interpretable and Uncertainty-
Aware AI. Patterns 1 100049.
[289] Torgerson, W. (1952). Multidimensional scaling: I Theory and Method.
Psychometrika 17 401-419. MR0054219
[290] Trinh, L., Tsang, M., Rambhatla, S. and Liu, Y. (2021). Inter-
pretable and Trustworthy Deepfake Detection via Dynamic Prototypes.
In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV) 1973–1983.
[291] Ustun, B. and Rudin, C. (2016). Supersparse linear integer models
for optimized medical scoring systems. Machine Learning 102 349–391.
MR3463093
[292] Ustun, B. and Rudin, C. (2017). Optimized risk scores. In Proceedings
of the ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD) 1125–1134.
[293] Ustun, B. and Rudin, C. (2019). Learning Optimized Risk Scores. Jour-
nal of Machine Learning Research 20 1–75. MR4030164
[294] Ustun, B., Traca, S. and Rudin, C. (2013). Supersparse linear integer
82 C. Rudin et al.

models for predictive scoring systems. In Proceedings of the AAAI Work-


shop Late Breaking Track.
[295] Ustun, B., Westover, M. B., Rudin, C. and Bianchi, M. T. (2016).
Clinical Prediction Models for Sleep Apnea: The Importance of Medical
History over Symptoms. Journal of Clinical Sleep Medicine 12 161–168.
[296] van der Maaten, L. (2014). Accelerating t-SNE using tree-based algo-
rithms. Journal of Machine Learning Research 15 3221-3245. MR3277169
[297] van der Maaten, L. and Hinton, G. (2008). Visualizing data using
t-SNE. Journal of Machine Learning Research 9 2579–2605.
[298] Vanhoof, K. and Depaire, B. (2010). Structure of association rule clas-
sifiers: a review. In 2010 IEEE International Conference on Intelligent Sys-
tems and Knowledge Engineering (ISKE) 9–12. IEEE.
[299] Vapnik, V. and Chervonenkis, A. Y. (1971). On the Uniform Conver-
gence of Relative Frequencies of Events to Their Probabilities. Theory of
Probability and its Applications 16 264. MR0288823
[300] Verma, A., Murali, V., Singh, R., Kohli, P. and Chaudhuri, S.
(2018). Programmatically interpretable reinforcement learning. In Proceed-
ings of International Conference on Machine Learning (ICML) 5045–5054.
[301] Verwer, S. and Zhang, Y. (2019). Learning optimal classification trees
using a binary linear program formulation. In Proceedings of AAAI Con-
ference on Artificial Intelligence (AAAI).
[302] Vilas Boas, M. G., Santos, H. G., Merschmann, L. H. d. C. and
Vanden Berghe, G. (2019). Optimal decision trees for the algorithm
selection problem: integer programming based approaches. International
Transactions in Operational Research.
[303] Wachter, S., Mittelstadt, B. and Floridi, L. (2017). Why a Right
to Explanation of Automated Decision-Making Does Not Exist in the Gen-
eral Data Protection Regulation. International Data Privacy Law.
[304] Wagstaff, K. L. (2012). Machine Learning that Matters. In Proceedings
of International Conference on Machine Learning (ICML) 529-536.
[305] Wah, C., Branson, S., Welinder, P., Perona, P. and Belongie, S.
(2011). The Caltech-UCSD Birds-200-2011 Dataset Technical Report No.
CNS-TR-2011-001, California Institute of Technology.
[306] Wang, F. and Rudin, C. (2015). Falling rule lists. In Proceedings of
Artificial Intelligence and Statistics (AISTATS) 1013–1022.
[307] Wang, J., Oh, J., Wang, H. and Wiens, J. (2018). Learning Credible
Models. In Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD) 2417–2426.
[308] Wang, S., Teng, Y. and Perdikaris, P. (2020). Understanding and
mitigating gradient pathologies in physics-informed neural networks. arXiv
e-print arXiv:2001.04536.
[309] Wang, T. (2019). Gaining Free or Low-Cost Interpretability with Inter-
pretable Partial Substitute. In Proceedings of the 36th International Con-
ference on Machine Learning (ICML) 97 6505–6514.
[310] Wang, T., Morucci, M., Awan, M. U., Liu, Y., Roy, S., Rudin, C.
and Volfovsky, A. (2021). FLAME: A Fast Large-scale Almost Match-
Interpretable machine learning grand challenges 83

ing Exactly Approach to Causal Inference. Journal of Machine Learning


Research 22 1-41.
[311] Wang, T. and Rudin, C. (2015). Learning optimized Or’s of And’s. arXiv
e-print arXiv:1511.02210.
[312] Wang, T., Yang, J., Li, Y. and Wang, B. (2021). Partially In-
terpretable Estimators (PIE): Black-Box-Refined Interpretable Machine
Learning. arXiv e-print 2105.02410 arXiv:2105.02410.
[313] Wang, Y., Huang, H., Rudin, C. and Shaposhnik, Y. (2021). Under-
standing How Dimension Reduction Tools Work: An Empirical Approach
to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualiza-
tion. Journal of Machine Learning Research.
[314] Wattenberg, M., Viégas, F. and Johnson, I. (2016). How to use
t-SNE effectively. Distill 1 e2.
[315] Weathers, F. W., Litz, B. T., Keane, T. M., Palmieri, P. A.,
Marx, B. P. and Schnurr, P. P. (2013). The PTSD checklist for DSM-
5 (PCL-5) – Standard [Measurement instrument]. Scale available from the
National Center for PTSD at www.ptsd.va.gov.
[316] Weinberger, K. Q. and Saul, L. K. (2009). Distance Metric Learn-
ing for Large Margin Nearest Neighbor Classification. Journal of Machine
Learning Research 10.
[317] Wexler, R. (2017). When a Computer Program Keeps You in Jail: How
Computers are Harming Criminal Justice. New York Times.
[318] Wood, S. N. (2017). Generalized additive models: an introduction with
R. CRC press.
[319] Wu, C. and Tabak, E. G. (2017). Prototypal Analysis and Prototypal
Regression. arXiv e-print arXiv:1701.08916.
[320] Xie, F., Chakraborty, B., Ong, M. E. H., Goldstein, B. A. and
Liu, N. (2020). AutoScore: A Machine Learning–Based Automatic Clinical
Score Generator and Its Application to Mortality Prediction Using Elec-
tronic Health Records. JMIR Medical Informatics 8 e21798.
[321] Yang, H., Rudin, C. and Seltzer, M. (2017). Scalable Bayesian rule
lists. In Proceedings of International Conference on Machine Learning
(ICML) 3921–3930.
[322] Yang, Y., Morillo, I. G. and Hospedales, T. M. (2018). Deep neu-
ral decision trees. In Proceedings of the ICML Workshop on Human Inter-
pretability in Machine Learning.
[323] Yang, Y. and Perdikaris, P. (2019). Conditional deep surrogate models
for stochastic, high-dimensional, and multi-fidelity systems. Computational
Mechanics 64 417–434.
[324] Yin, J. and Yu, Y. (2017). Convex-constrained sparse additive modeling
and its extensions. In Conference on Uncertainty in Artificial Intelligence
(UAI).
[325] Yin, X. and Han, J. (2003). CPAR: Classification based on predictive as-
sociation rules. In Proceedings of the 2003 SIAM International Conference
on Data Mining (SDM) 331–335. SIAM.
84 C. Rudin et al.

[326] Yu, C., Liu, J. and Nemati, S. (2019). Reinforcement learning in health-
care: A survey. arXiv e-print arXiv:1908.08796.
[327] Yu, J., Ignatiev, A., Bodic, P. L. and Stuckey, P. J. (2020). Optimal
Decision Lists using SAT. arXiv e-print arXiv:2010.09919.
[328] Yu, J., Ignatiev, A., Stuckey, P. J. and Le Bodic, P. (2020). Com-
puting Optimal Decision Sets with SAT. In International Conference on
Principles and Practice of Constraint Programming 952–970. Springer.
[329] Zahavy, T., Ben-Zrihem, N. and Mannor, S. (2016). Graying the
black box: Understanding DQNs. In International Conference on Machine
Learning 1899–1908.
[330] Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y.,
Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T., Lock-
hart, E. et al. (2019). Relational deep reinforcement learning. In Proceed-
ings of the International Conference on Learning Representations (ICLR).
[331] Zech, J. R. et al. (2018). Variable generalization performance of a deep
learning model to detect pneumonia in chest radiographs: A cross-sectional
study. PLoS Med. 15.
[332] Zhang, K., Wang, Y., Du, J., Chu, B., Celi, L. A., Kindle, R. and
Doshi-Velez, F. (2021). Identifying Decision Points for Safe and Inter-
pretable Reinforcement Learning in Hypotension Treatment. In Proceedings
of the NeurIPS Workshop on Machine Learning for Health.
[333] Zhang, Q., Nian Wu, Y. and Zhu, S.-C. (2018). Interpretable convolu-
tional neural networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) 8827–8836.
[334] Zhang, Y., Song, K., Sun, Y., Tan, S. and Udell, M. (2019). “Why
Should You Trust My Explanation?” Understanding Uncertainty in LIME
Explanations. In Proceedings of the ICML AI for Social Good Workshop.
[335] Zhao, H., Jin, R., Wu, S. and Shi, J. (2011). PDE-constrained Gaussian
process model on material removal rate of wire saw slicing process. Journal
of Manufacturing Science and Engineering 133.
[336] Zhou, B., Bau, D., Oliva, A. and Torralba, A. (2018). Interpreting
deep visual representations via network dissection. IEEE Transactions on
Pattern Analysis and Machine Intelligence.
[337] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.
(2014). Object detectors emerge in deep scene CNNs. In Proceedings of the
International Conference on Learning Representations (ICLR).
[338] Zhou, B., Sun, Y., Bau, D. and Torralba, A. (2018). Interpretable
basis decomposition for visual explanation. In Proceedings of the European
Conference on Computer Vision (ECCV) 119–134.
[339] Zhou, D.-X. (2002). The covering number in learning theory. Journal of
Complexity 18 739–767.
[340] Zhu, Y., Zabaras, N., Koutsourelakis, P.-S. and Perdikaris, P.
(2019). Physics-constrained deep learning for high-dimensional surrogate
modeling and uncertainty quantification without labeled data. Journal of
Computational Physics 394 56–81.
Interpretable machine learning grand challenges 85

[341] Zhu, Z., Luo, P., Wang, X. and Tang, X. (2014). Multi-view per-
ceptron: a deep model for learning face identity and view representations.
In Proceedings of Conference on Neural Information Processing Systems
(NeurIPS) 217–225.

You might also like