XEMP Prediction Explanations With DataRobot White Paper
XEMP Prediction Explanations With DataRobot White Paper
• Which input feature values in this data point caused the prediction to be
so high (or so low)?
• Which input feature values in this data point caused the decision to be
negative (or positive)?
• Was my sensitive feature (e.g. gender, race, age etc.) a factor in the
prediction or decision?
• Which input feature values for this data point were the most important
in determining the result?
• Which feature values made the prediction lower? Which feature values
made the prediction higher?
2
WHITE PAPER | XEMP Prediction Explanations with DataRobot
The table below shows an example of a prediction by DataRobot. The model predicts
that hospital patient Lester’s probability readmission rate is 63.7%—quite high
compared to a typical patient. Naturally, doctors and administrators have questions
about why his readmission rate is so high. By adding explanations, it is possible to get
a ranked list of the key factors for the prediction, and we can see that the top factor is the
primary diagnosis feature or variable. In this example, the code of abdominal pain at an
unspecified site has high positive strength as a factor in the prediction. The positive and
negative effect of the factor strength is indicated by orange (positive) or blue (negative)
background color. A secondary factor is the number of prior inpatient stays Lester had in
the last year, since patients that have had previous admissions tend to readmit at a higher
rate. Because this is Lester’s first visit in over a year, this factor has a moderate negative
effect on the readmission probability. We can see that these are the two most important
factors for Lester’s high readmission score, and they provide the necessary context for
understanding the prediction.
Explanations offer clear value, yet there is not widespread use of them within enterprises.
We have found three factors that are limiting widespread use of explanations.
1. Explanations are a new technique in data science and there is a clear lack of criteria on
what constitutes a good explanation.
2. While data scientists have access to many open source explanation methodologies,
such as LIME, Shap, and Breakdown, there is a lack of benchmarks to evaluate which
of these approaches offer suitable accuracy. Without a way to evaluate multiple
approaches, it’s not clear which approach an enterprise should widely adopt.
3
WHITE PAPER | XEMP Prediction Explanations with DataRobot
3. Enterprises require reliable scalable algorithms that run across large complex datasets.
Most explanation approaches have not been designed to account for scalability, and
some approaches can take hours to return a single prediction explanation, something
that is simply not scalable for a modern enterprise.
This white paper will begin with identifying the key criteria for a good explanation. After
establishing the criteria, the paper highlights some of the shortcomings of the most widely
used explanation approach, LIME. Next, we introduce and review DataRobot’s eXemplar-
based Explanations of Model Predictions (XEMP). The last section of the paper provides
several benchmark test results in order to quantify the advantages of XEMP over Local
Interpretable Model-Agnostic Explanations (LIME).
The decisions made by AIs affecting a person’s life, quality of life, or reputation should
always be justifiable in a language that is understood by the people who use them or
who are subjected to the consequences of their use. Justification consists in making
transparent the most important factors and parameters shaping the decision, and
should take the same form as the justification we would demand of a human making
the same kind of decision.
Using this framework, there are at least three different properties that a good
explanation should have:
1. The explanations need to be readable and suitable for all the stakeholders, including
those that are non-technical. At a practical level, this suggests that explanations should
be stated in the language understandable by the people affected by the model,
4
WHITE PAPER | XEMP Prediction Explanations with DataRobot
2. The explanation should contain the most important factors for the decision. This can
be evaluated in several ways, overall strength, fidelity to the prediction, and consistency.
A good explanation should highlight just a handful of most important factors. Those
factors should have a strong influence on the prediction. Later in the quantitative
testing section of this document, we compare different explanation methodologies
based on how strong of an influence the explanation has on its prediction using the
concept of explanation accuracy.
3. Explanations should be identical and replicable given the same model, same input
feature values, and same prediction to be explained. This type of consistency is
important to have confidence in explanations. After all, a prediction that can generate two
different explanations is problematic. If you asked a human their reason for a specific
decision, and each time you asked they told you a different story, then you wouldn’t give
much credibility to those explanations. You might even be suspicious that the person was
5
WHITE PAPER | XEMP Prediction Explanations with DataRobot
just inventing the decisions! It will not be possible to trust an explanation if stakeholders
don’t get identical explanations for the same model.
6
WHITE PAPER | XEMP Prediction Explanations with DataRobot
The inability of a linear surrogate model, such as used by LIME, to capture the behavior of
the underlying model raises doubt as to whether it is an appropriate technique. After all,
most interpretability methods, such as feature importance and partial dependence, are
based on directly querying the underlying model. Moreover, legal requirements about
explanations typically require explanations based on the actual model (not a surrogate).
For example, the Equal Credit Opportunity Act (ECOA), as implemented by Regulation B,
and the Fair Credit Reporting Act (FCRA) were to designed to protect consumers and
businesses against potential discrimination by requiring that creditors support their credit
decisioning processes and to alert the consumer that negative information was the basis
for “adverse action.” Specifically, in the event that a creditor has denied a consumer credit
then the creditor must provide an adverse action notice to the consumer that includes the
top four factors that adversely affected the credit score; if one of the key factors was the
number of inquiries to the consumer’s report, you must list five key factors. There is no
option for approximation, nor surrogate models, in this regulatory requirement. Instead,
the lenders must be able to accurately interpret explain the actual model (not a surrogate)
to ensure alignment with regulatory requirements.
In the process of creating an explanation, LIME generates synthetic data upon which
the surrogate model is built. This leads to a myriad of problems. First, LIME generates
synthetic data randomly. This means if LIME is run twice, you can have two different
sets of synthetic data generated, and hence two different explanations. The result is that
LIME can supply multiple contradictory explanations. The table below shows an example
from the Boston Housing dataset where LIME provides two different explanations for
the same data point. The only commonality is that both RM (average number of rooms
per dwelling) and LSTAT (percentage of the population with lower education and working
statuses) appear near the top in both explanations.
RM LSTAT
DIS TAX
B PRATIO
ZN CRIM
This lack of consistency is built into the LIME approach. LIME’s reliance on generating
random data means that given the same dataset and underlying model, that two
7
WHITE PAPER | XEMP Prediction Explanations with DataRobot
A related issue is ensuring the generated synthetic data accurately represents the actual
data. In the widely used Python implementation of LIME created by LIME’s authors, there
is an assumption of normality when generating synthetic data for continuous features.
However, real-world datasets often include zero-inflated, skewed, and truncated features
that are not normally distributed. Commonly found examples of these features include:
how many purchases a customer has made; how much a customer spent on purchases;
how long a person has been married; number of criminal convictions; and years of
education. The inability of LIME to handle non-normal distributions can lead to inaccurate
explanations. In the quantitative benchmarking section of this document, one of the
examples will illustrate this point.
Once LIME generates the data, it builds a model around those data points. A critical issue
is the size of the local region. If the region is too big, then the explanations mirror the overall
global feature importance and there is not fidelity to the data point being explained. However,
if the local region is too small, LIME may provide counterintuitive explanations. Consider the
below example of income for a probability of defaulting on a loan. Overall, there is a general
trend that the higher the income, the less likely a person will be to default. However, there
are some areas, say between 90K and 100K or between 140K to 200K, where the model
actually increases the probability of default as income goes up. If our data point falls into
this range and the local area is so small to only encompass this area, the explanations that
come out of LIME may say that one of the risks for defaulting is that the person has 100K
of income. So, while it is strictly true compared to someone that has 90K for this dataset, it
really isn’t a very useful or fair comparison. As we see later, other methods are more explicit
about their comparison and don’t suffer from this issue of local minima.
https://github.com/marcotcr/lime/issues/119 8
1
WHITE PAPER | XEMP Prediction Explanations with DataRobot
Introducing XEMP
XEMP was designed within DataRobot and introduced to our customers in 2016. XEMP
had several design objectives. First, the approach needed to be model agnostic. In a
modern enterprise, data scientists use a variety of algorithms and ensemble methods.
Any approach that is limited to a particular model structure will be very limited in its
usage. This is known as the “no free lunch” concept in machine learning, meaning that
there is no one best model or approach for every problem and that you cannot know in
advance which algorithm will work best on your data. Consequently, there is a need for
diversity of model types and ensembling of models. As a result, an explanation approach
needs to work with different types of models and combinations of models, whether it is
Random Forest, SVM, or deep learning. This standard is no different to how we treat other
explanatory tools like partial dependence.
9
WHITE PAPER | XEMP Prediction Explanations with DataRobot
Finally, XEMP would need to be flexible for handling diverse data types, while maintaining
computationally efficiency. Our customers must calculate millions of explanations
in a reasonable amount of time. This requires some creativity in the design process.
While there are many explanation approaches, most of them have not considered
computational efficiency in their design. As an example, one approach—breakdown—can
take 3.5 hours for just one explanation.
Since XEMP explains predictions by how they differ from a typical value, it uses exemplars
to represent a baseline “typical” data value. No single input feature value can truly represent
the full range of the data, so an exemplar uses a series of synthetic data rows comprising
the range of values that an input feature can take, along with the consequent variations in
predicted values. Unlike LIME, the synthetic data created by XEMP takes actual values from
the entire range of values found in the population, and these synthetic data are consistent
each time an explanation is calculated, not varying randomly.
To illustrate XEMP, let’s start with a simple insurance model that predicts Sally should pay
$2,000 for car insurance. Sally wants to know, what data points about her are leading to
that price? For background, Sally is 43, female, and lives in the Southwest. XEMP is going
to compare each of the features or variables about Sally to a baseline exemplar value.
The first step for XEMP is to evaluate each feature to calculate its exemplar values. For
Sally, we get predictions for all the possible values of locations as shown below.
10
WHITE PAPER | XEMP Prediction Explanations with DataRobot
The next step is to average all of these features to generate a baseline exemplar
prediction. The value of $2,040 represents the baseline price for Sally across all the
possible locations. This value is used as our baseline exemplar prediction. The table
below shows all the predictions for Sally across all the locations, including “Southwest,”
the actual value we are evaluating.
43 F Northeast 1880
43 F Southeast 1940
43 F Midwest 2230
43 F Northwest 2100
43 F Southwest 2000
43 F Texas 2090
The next step is identifying how much of a factor location is for Sally. In this case, the
exemplar value of $2,040 is subtracted from the actual prediction of $2,000, leading to
a difference of -$40, meaning that the influence of location is reducing the prediction
by $40 from the exemplar value. The same step can be repeated for other other features.
In this case, let’s say the effect of gender was 100, and age was 20. The final explanation
for Sally’s $2,000 prediction is that gender was the strongest explanation for the prediction,
followed by location and age.
11
WHITE PAPER | XEMP Prediction Explanations with DataRobot
The exemplar prediction is the weighted average of the partial dependencies, in this
case approximately 0.14. The explanation for an income with a partial dependence lower
than 0.14 should show a negative effect upon the prediction, while the explanation for
an income with a partial dependence higher than 0.14 should show a positive effect
upon the prediction. The prediction explanation for a person with $180,000 income
should show that income has a negative effect upon the prediction, because the partial
dependence for high incomes is substantially lower than for most other incomes.
Similarly, the prediction explanation with a $38,000 income shows a positive effect upon
the prediction. This is true even though the predicted value is lower than at $36,000 or
$40,000, because the partial dependence at this income is higher than the exemplar.
12
WHITE PAPER | XEMP Prediction Explanations with DataRobot
40
count
20
Figure 2
While XEMP creates both positive and negative reasons, when we run LIME on this same
data, it does not always retain fidelity to the model. If the hyperparameters were not tuned
to use binning for numeric features, we found that LIME always produced small negative
explanations for annual income, no matter what the value of annual income. This is
counterintuitive. An input feature cannot always have a negative effect upon the prediction,
no matter what value it takes. There must be some values that give a better result. This is
a reminder that methods that require hyperparameter tuning can be problematic.
EXPLANATION DIVERSITY
An important trait for explanations is that they maintain fidelity to the data being
explained. A common issue for the fidelity of explanations is that relationship to feature
importance. Feature importance identifies what the most important features are that
affect predictions in the model. Typically, it would look like a ranked list in Figure 2.
However, when it comes to explanations for individual predictions, we expect them to
13
WHITE PAPER | XEMP Prediction Explanations with DataRobot
deviate from the overall feature importance. After all, if all explanations mirror the overall
feature importance, they would not be showing fidelity to the individual feature values and
predicted value for this data point examples.
14
WHITE PAPER | XEMP Prediction Explanations with DataRobot
Clearly, LIME is lacking in diversity of prediction explanations. LIME is favoring the top
feature in feature impact for each and every explanation. The lack of diversity draws
concerns when using an explanation approach. After all, the goal is to understand why a
prediction is being made for a certain example. To have the same explanation for every
example isn’t very useful. One reason for the lack of diversity could be that capital gain is
a zero-inflated feature and LIME synthetic data generation step cannot adequately model
zero-inflated distributions.
15
WHITE PAPER | XEMP Prediction Explanations with DataRobot
EXPLANATION ACCURACY
The accuracy of the explanations was evaluated using a permutation test with both XEMP
and LIME. The permutation test used by Lundberg operates by permuting or changing the
value of the top feature in an explanation with a randomly selected value and measuring
the difference in predictions. If the explanation is a strong explanatory feature, you would
expect a strong effect on the prediction.
16
WHITE PAPER | XEMP Prediction Explanations with DataRobot
Time in Seconds
LIME XEMP
17
WHITE PAPER | XEMP Prediction Explanations with DataRobot
Conclusion
The following chart shows a side-by-side comparison of the LIME versus XEMP prediction
explanation methodologies.
Adult dataset 1 13
Boston dataset 3 7
As the table above shows, XEMP is generally more suitable and capable than LIME.
1. From a theoretical perspective, XEMP doesn’t make assumptions about local linearity.
2. XEMP works directly on the actual model that is deployed, and doesn’t rely on a linear
approximation model.
18
WHITE PAPER | XEMP Prediction Explanations with DataRobot
3. XEMP is stable and every explanation results in a unique and replicable prediction. You
won’t get two different explanations for the same prediction.
4. XEMP is easily explainable, answering the questions asked by stakeholders, not using
abstract mathematical concepts such as gradients.
5. XEMP provides a greater variety of explanations, keeping fidelity with the individual
feature values used within a prediction.
6. XEMP provides stronger explanations. The stronger explanations are very important,
because they provide a quantitative basis for evaluating predictions.
7. XEMP is much more scalable, since it is fast, with calculation times being
approximately linear with the number of input features. XEMP is an enterprise-ready
solution for providing prediction explanations.
19
DataRobot helps enterprises embrace artificial intelligence (AI). Invented by
DataRobot, automated machine learning enables organizations to build predictive
models that unlock value in data, making machine learning accessible to business
analysts and allowing data scientists to accomplish more faster. With DataRobot,
organizations become AI-driven and are enabled to automate processes, optimize
outcomes, and extract deeper insights.
© 2019 DataRobot, Inc. All rights reserved. DataRobot and the DataRobot logo are trademarks of DataRobot, Inc.
All other marks are trademarks or registered trademarks of their respective holders.