Chapter3: Background and Related Work
Chapter3: Background and Related Work
Explanation methods
It is vital to ensure that ML models make decisions for the right reasons.
Understand why a model makes such prediction can help the user to debug, gain
trust and further improve the existing model. This process falls into a research
area called interpretability. Interpretability is defined as the ability to explain or
present in understandable terms to humans.
Simple models such as linear models and decision trees usually cannot handle
difficult tasks but can be interpreted by their weights and decision nodes.
Complex models such as ensemble methods or neural network typically have
better predictive ability, but they are hard to interpret by humans. Hence, the
explanation model uses a simpler model to approximate the original complex
model locally and return the explanation.
LIME
The local explanation method LIME interprets an individual prediction by
learning an interpretable model locally. The intuition behind LIME is that it
samples instances both in the vicinity and far away from the interpretable
representation of the original input. Then it takes the interpretable
representation of these sample points, gets their predictions and builds a
weighted linear model by minimizing the loss and complexity. The weighting of
the samples is based on their distances from the original point. The weights of
the points decreases as the points are further away. This is illustrated in Figure.
caption: The black curve is the decision boundary of a complex model f. The red
cross is the instance being explained. LIME samples instances (red and blue
dots), weights them based on their distances from the original instance
(represents by size), and gets predictions using f. The dashed grey line is the
learned explanation model g that is locally faithful.
We now formalize this intuition. Let f denote the complex model being explained.
In classification f(x) is the probability that an instance x belongs to the certain
class. Let g be defined as explanation model g in G, where G is a class of
interpretable models, such as linear models and decision trees. In the current
version of LIME, G is the class of linear models, such that g(z')=w_g * z'.
In order to ensure the interpretability and local fidelity, LIME produce the
explanation xi by minimizing the loss and complexity. The loss measures how
unfaithful of g is in approximating f. Omega(g) measures the complexity of the
explanation model. For example, Omega(g) can be the number of non-zero
weights of linear models or the depth of the decision trees. The proximity
measure between x and z is denoted as an exponential kernel pi_x. Given the
dataset Z of perturbed samples with associated labels, the explanation can be
obtained by optimizing the following equation:
The advantage of LIME is that the runtime does not depend on the size of the
dataset, but on the complexity of f(x) and the number of samples instead.
However, any choices of interpretable representation and G will have some
drawbacks. Since the complex model can be treated as a black-box, the chosen
interpretable representation might not fully explain specific behaviours.
Additionally, linear models G cannot give faithfully explain locally if the input
model is highly non-linear.
SHAP
SHAP assigns each feature importance based on Shapley values from game
theory. It is an unique solution in the class of additive attribution methods that
satisifies three desired properties (described later). While these properties are
familiar to the classical Shapley value estimation methods, they were previous
unknown for other additive feature attribution methods. We start by explaining
Shapley values.
The Shapley value is one way to distribute the total gains of the contributors,
assuming that all the contributors collaborate. It is an interpretation in terms of
expected marginal contribution. It is calculated by considering all the possible
orders of the arrival of contributors and computing the marginal contribution of
each contributor. The following example depicts this.
Suppose there are two players A and B in the coalitional game, and their
contribution v are v({A})=6, v({B})=10, v({A, B})=24. The following table
displays the marginal contributions of both players. Therefore, the expected
marginal contribution and Shapley value of player A is 0.5*6+0.5*14=10 and
player B is 0.5*18+0.5*10=14.
where M is the number of interpretable input features. This value is the marginal
contribution when i feature adds to the set of feature S weighting by the number
of ways to come up this marginal calculation divided by the permutation of all
the features.
SHAP also guarantees to satisfy the following three properties: local accuracy,
missingness, and consistency. Local accuracy requires the sum of the feature
contributions to be equal to the model's prediction. Missingness requires the
absent features in original input to have no impact. Consistency ensures that
when the model changes and a feature has a higher impact on the model, the
importance of that feature will never decrease. Lundberg et al. prove that SHAP
method is the only possible local accurate and consistent method that satisfied
the missingness property.
With above solid theory, SHAP is adapted to combine with LIME and tree-based
classification models to obtain full explanations for ML models.
Kernel SHAP
LIME chooses the parameters heuristically and does not recover the Shapley
values. Hence, the intuition of Kernel SHAP is to find the loss function L,
weighting kernel pi_x, and regularization term Omega that recover the Shapley
values.
Lundberg et al. prove that the following solutions make Equation recover the
Shapley values and hold for the above three properties:
The connection between linear regression and Shapley values is that Equation is
a difference of means.
Tree SHAP
Currently, popular feature attribution methods for tree ensembles are
inconsistent. This means when a model change such that a feature has a higher
impact on the model's output, current methods can actually lower the impact of
that feature. A tree SHAP model is proposed to provide a theoretical
improvement by solving the inconsistency problem.
Summary
We summarize the pros and cons of those two methods and connect their
advantage to seek for useful explanations. LIME provides selected explanation
which can explain instances with fewer important features. This meets the
property of selected in a good explanation, since humans usually prefer sparse
explanations. LIME is also capable of predicting the probability change of a label
based on the changing of an input feature, which provides a good contrast
explanation.
On the other hand, SHAP provides a full explanation that is crucial for decisions
which need to consider all the effects, for example assist doctors in clinical
diagnosis and help judges make careful decisions about defendants. SHAP also
provides contrast explanation that allows the user to compare explanations on
the group level. In addition, SHAP value cannot be considered as the probability
decreased after removing the feature as LIME. The value should be interpreted as
the feature contribution to the difference between the actual prediction and the
mean prediction.
The framework for FPs elimination is illustrated in Figure. The proposed novel
approaches are shown in the grey frame. Firstly, we start with preprocessing the
transaction data. The transaction dataset has a large number of features and is
highly imbalanced. Under-sampling and feature selection techniques are applied
to deal with those problems. Secondly, we build an ML model on the training set
and predict on a test set. The cases predicted as a positive class are considered as
alerts, while the cases predicted as negative are non-alerts. Thirdly, we use
explanation model Tree SHAP to obtain the explanations for the alerts. Each
explanation of the alert consists of the SHAP value for all the features. After the
SHAP values are obtained, we propose the following two approaches to filter out
FPs automatically.
In the first approach, SHAP values are applied to distinguish between FPs and
TPs. Since SHAP value of each feature is the decomposition of output probability,
the sum of SHAP values should reflect that TPs have a higher probability of being
an alert than FPs. Moreover, SHAP values decompose this confidence; we expect
that some features in this decomposition will dominate the true positive
classifications and other features the false positive classifications. Therefore, the
SHAP values are considered as features to build another ML model for predicting
FPs and TPs.
In the second approach, SHAP values are applied to discriminate FPs and TPs on
the original feature space. Therefore, the elimination process does not rely on the
SHAP values and can filter out FPs by the original features. We relate SHAP
values with the original features by using clustering and subgroup discovery
techniques. The process is constructed as follows. Clustering is applied to find
the structure in SHAP values. After forming the clusters, the instances in each
cluster are dominated by similar features. Then, the problem remains that how
to extract the pattern to distinguish FPs and TPs. Subgroup discovery is
implemented to discover the rules to distinguish FPs and TPs in each cluster. To
ensure the extracted rules also fulfil certain confidence in all the alerts, the
support and confidence of each rule are computed. Finally, the rules with higher
FPs confidence and lower TPs support are used in the filter.
The two approaches are evaluated on the performance metric ROC-curve with
the ML model and respective baseline. The extract rules in the second approach
are presented to a domain expert to validate their genuineness.
Preprocessing
The transaction dataset has a huge amount of features and imbalanced classes. A
useful explanation requires a certain ML model with high accuracy; therefore,
the preprocessing step is necessary. We summarize the existing methods to deal
with class imbalanced and feature selection, then describe the methods chosen in
the experiment.
The imbalanced class problem can be addressed on different levels: data level,
algorithm level and combining methods. At the algorithm level, it adjusts the
costs of various classes to balance the classes. For example, different weights can
be assigned to the respective classes in kNN, and the biased can be developed in
SVM such that the learned hyperplane is further away from the major class. At
the data level, there are many forms of re-sampling such as random under-
sampling, random over-sampling, and over-sampling technique SMOTE. The
combining methods combine the results of many classifiers; each of them are
built on under/over-sampling data with different sampling ratios. To avoid
building an extra model on the massive transaction dataset, we focus on data
level algorithm. Over-sampling can induce an additional computational task since
the size of our fraud dataset is already large. We apply Random Under-sampling
in the experiments. It aims to balance class distribution through the random
replication of minor class samples. One major drawback is that this method can
discard examples that could be important for the performance metric.
The main feature selection methods are filter, wrapper, and embedded methods.
Filter methods examine each feature individually to determine the strength of
correlation between input feature and class labels. These methods do not rely on
a learning algorithm and are computationally light. However, finding a suitable
algorithm can be relative hard since wrapper methods do not take into account
any classifier. Ignoring the specific heuristics and biases of the classifier might
lower the classification accuracy. Additionally, important features are more
informative when combined with others could be discarded but less important
on their own. Wrapper methods use classifier and evaluate the variable subset by
the predict performance. The main drawback is the number of computations
required to obtain the feature subset can become computationally expensive.
Embedded methods incorporate feature selection in the training process.
Decision trees such as CART have a built-in mechanism to perform feature
selection. The coefficients of regularized regression models are of use in selecting
features. After the under-sampling, the size of the transaction dataset is
relatively small. Therefore, we apply the wrapper method using a Random Forest
classifier with 500 estimators.
Gradient boost creates a new model base on the difference between the current
approximation of prediction and target value, and then add them together to
make a final prediction. This difference is usually called residual or residual
vector. It is called gradient boost because it performs gradient descent algorithm
that training on the residual vector to minimize the loss when adding new
models.
Clustering
The clustering algorithm is mostly divided into partitioning method and
hierarchical method. The partitioning method usually assume a certain number
of clusters in advance, while the hierarchical method does not have a priori
assumption. The hierarchical method is again divided into agglomerative and
divisive. The agglomerative method uses a bottom-up approach starting with
individual data points. Then two most similar data points are merged into one
cluster. On the contrary, the divisive use a top-down approach to treating all the
data points as one big cluster and then dividing it into small clusters.
Subgroup Discovery
In FPs elimination task, we try to filter out false alarm cases while retaining most
of true positives. This can be done by using descriptions to distinguish between
TPs and FPs groups. The main objective of the following three techniques is
extracting descriptive knowledge from the data of a property of interest: contrast
set mining, emerging pattern mining and subgroup discovery. Contrast set
mining is a data mining technique for finding differences between contrasting
groups. Emerging Pattern Mining defined as itemsets whose support increases
significantly from one dataset to another. Subgroup discovery is aimed at finding
descriptions of interest subgroups. It has been proven that the terminology, the
task definitions and the rule learning heuristics of those three mining techniques
are compatible. Contrast set mining and emerging pattern mining communities
have primarily addressed only categorical data, whereas the subgroup discovery
community also considered numeric and relational data. Hence, we apply
subgroup discovery techniques to discovery description for FPs.
WRAcc optimizes two factors of rule X -> Y: rule coverage P(X) and distribution
unusualness p(Y|X) - p(Y). The rule coverage is the size of the subgroup. The
distribution unusualness is the difference between the proportion of positive
examples in the describing rule and the entire example set. The algorithm starts
by selecting the best rule with highest WRAcc. In each iteration, after the best
rule selected, the weights of positive examples are decreased according to the
number of rules covering each example c(e), where w(e) = 1/c(e). Let p denotes
positive examples correctly classified as positive by rule, n denotes negative
examples wrongly classified as positive by rule, and P and N are the numbers of
all positive and negative examples in the dataset.