0% found this document useful (0 votes)

29 views8 pages

Chapter3: Background and Related Work

LIME and SHAP are commonly used model-agnostic explanation methods. LIME explains individual predictions by learning an interpretable model locally, approximating the complex model. SHAP assigns feature importance based on Shapley values from game theory, which satisfies properties like local accuracy. While LIME chooses parameters heuristically and does not recover Shapley values, Kernel SHAP finds solutions that do recover Shapley values. Tree SHAP also provides a more efficient and consistent method for explaining tree ensemble models.

Uploaded by

Lillian Lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views8 pages

Chapter3: Background and Related Work

Uploaded by

Lillian Lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Chapter3: Background and related work

In this chapter, we introduce state-of-art model-agnostic explanation methods

LIME and SHAP. Afterwards, we compare the pros and cons of both methods and
connect their strengths with the useful explanations.

Explanation methods
It is vital to ensure that ML models make decisions for the right reasons.
Understand why a model makes such prediction can help the user to debug, gain
trust and further improve the existing model. This process falls into a research
area called interpretability. Interpretability is defined as the ability to explain or
present in understandable terms to humans.

Simple models such as linear models and decision trees usually cannot handle
difficult tasks but can be interpreted by their weights and decision nodes.
Complex models such as ensemble methods or neural network typically have
better predictive ability, but they are hard to interpret by humans. Hence, the
explanation model uses a simpler model to approximate the original complex
model locally and return the explanation.

Recently, many current methods for interpreting machine learning prediction

falls into the class of additive feature attribution methods. The explanation
methods LIME and SHAP used in the following experiments are also the classes
of these methods.

Additive feature attribution methods

As an additive feature attribution method, it explains the model output as the
sum of the values attributed to each input features.

LIME
The local explanation method LIME interprets an individual prediction by
learning an interpretable model locally. The intuition behind LIME is that it
samples instances both in the vicinity and far away from the interpretable
representation of the original input. Then it takes the interpretable
representation of these sample points, gets their predictions and builds a
weighted linear model by minimizing the loss and complexity. The weighting of
the samples is based on their distances from the original point. The weights of
the points decreases as the points are further away. This is illustrated in Figure.

caption: The black curve is the decision boundary of a complex model f. The red
cross is the instance being explained. LIME samples instances (red and blue
dots), weights them based on their distances from the original instance
(represents by size), and gets predictions using f. The dashed grey line is the
learned explanation model g that is locally faithful.

The interpretable representation can be the features themselves or the

representation of actually used features. For example, an interpretable
representation for numerical classification is a binary vector indicating a
presence or absence of the discretized feature. For text and image classification,
an interpretable representation can be an appearance of a word or a superpixel.

We now formalize this intuition. Let f denote the complex model being explained.
In classification f(x) is the probability that an instance x belongs to the certain
class. Let g be defined as explanation model g in G, where G is a class of
interpretable models, such as linear models and decision trees. In the current
version of LIME, G is the class of linear models, such that g(z')=w_g * z'.

As mentioned previously, LIME explains an instance by interpretable

representation and draw the samples aroud it. Let x denotes the original
instance, where x in R^N and N is the number of features in the original input. A
mapping function is defined to convert the binary pattern of missing features
represented by x' to its original feature value x=h(x'). The interpretable
representation of input and the samples are denoted by a binary vector x' and z',
where x', z' in {0, 1}^{M}. LIME draws samples around x' uniformly at random.
Given a perturbed sample z', LIME can recover the sample to its original input z
and obtain f(z), which is used as a label for the explanation model.

In order to ensure the interpretability and local fidelity, LIME produce the
explanation xi by minimizing the loss and complexity. The loss measures how
unfaithful of g is in approximating f. Omega(g) measures the complexity of the
explanation model. For example, Omega(g) can be the number of non-zero
weights of linear models or the depth of the decision trees. The proximity
measure between x and z is denoted as an exponential kernel pi_x. Given the
dataset Z of perturbed samples with associated labels, the explanation can be
obtained by optimizing the following equation:

The loss L is computed by weighted square loss as Equation, where pi(z) is an

exponential kernel defined on some distance function with width.

To solve above equations and obtain the explanation, LIME approximates

Omega(g) by selecting K features with Lasso and then learning the weights via
least squares.

The advantage of LIME is that the runtime does not depend on the size of the
dataset, but on the complexity of f(x) and the number of samples instead.
However, any choices of interpretable representation and G will have some
drawbacks. Since the complex model can be treated as a black-box, the chosen
interpretable representation might not fully explain specific behaviours.
Additionally, linear models G cannot give faithfully explain locally if the input
model is highly non-linear.

SHAP
SHAP assigns each feature importance based on Shapley values from game
theory. It is an unique solution in the class of additive attribution methods that
satisifies three desired properties (described later). While these properties are
familiar to the classical Shapley value estimation methods, they were previous
unknown for other additive feature attribution methods. We start by explaining
Shapley values.

The Shapley value is one way to distribute the total gains of the contributors,
assuming that all the contributors collaborate. It is an interpretation in terms of
expected marginal contribution. It is calculated by considering all the possible
orders of the arrival of contributors and computing the marginal contribution of
each contributor. The following example depicts this.

Suppose there are two players A and B in the coalitional game, and their
contribution v are v({A})=6, v({B})=10, v({A, B})=24. The following table
displays the marginal contributions of both players. Therefore, the expected
marginal contribution and Shapley value of player A is 0.5*6+0.5*14=10 and
player B is 0.5*18+0.5*10=14.

SHAP approximates the input value as a conditional expectation function of the

original model f_x(S)=f(h(z'))=E[f(x)|x_{S}], where S is a set of non-zero indexes
in z’ and E[f(x)|x_{S}] is an expected value of the conditional function on a set of
input features S. SHAP values combine these conditional expectations with the
classic Shapley values from game theory to attribute phi_{i} values to each
feature:

where M is the number of interpretable input features. This value is the marginal
contribution when i feature adds to the set of feature S weighting by the number
of ways to come up this marginal calculation divided by the permutation of all
the features.

SHAP also guarantees to satisfy the following three properties: local accuracy,
missingness, and consistency. Local accuracy requires the sum of the feature
contributions to be equal to the model's prediction. Missingness requires the
absent features in original input to have no impact. Consistency ensures that
when the model changes and a feature has a higher impact on the model, the
importance of that feature will never decrease. Lundberg et al. prove that SHAP
method is the only possible local accurate and consistent method that satisfied
the missingness property.

With above solid theory, SHAP is adapted to combine with LIME and tree-based
classification models to obtain full explanations for ML models.

Kernel SHAP
LIME chooses the parameters heuristically and does not recover the Shapley
values. Hence, the intuition of Kernel SHAP is to find the loss function L,
weighting kernel pi_x, and regularization term Omega that recover the Shapley
values.

Lundberg et al. prove that the following solutions make Equation recover the
Shapley values and hold for the above three properties:
The connection between linear regression and Shapley values is that Equation is
a difference of means.

Tree SHAP
Currently, popular feature attribution methods for tree ensembles are
inconsistent. This means when a model change such that a feature has a higher
impact on the model's output, current methods can actually lower the impact of
that feature. A tree SHAP model is proposed to provide a theoretical
improvement by solving the inconsistency problem.

As a tree model, E[f(x)|x_{S}] can be estimated by recursively computing the

proportion of the training samples matching the conditioning set S multiply with
the node values. However, running the algorithm for all 2^M subsets for all the M
features will cost O(TL2^M) time. Recall T is the number of trees, L is the
maximum number of leaves in any tree, and $M$ is the number of features.
Hence, the polynomial time tree SHAP is developed to reduce the runtime to
O(TLD^2). The intuition behind this is to recursively track the proportion of all
possible subsets flow down into each of the leaves of the tree. The current
version Tree SHAP has been merged into XGBoost, LightGBM, CatBoost, and
scikit-learn tree model. The computation performance is more efficient than
Kernel SHAP, since it estimates E[f(x)|x_{S}] without building an extra linear
model.

Summary
We summarize the pros and cons of those two methods and connect their
advantage to seek for useful explanations. LIME provides selected explanation
which can explain instances with fewer important features. This meets the
property of selected in a good explanation, since humans usually prefer sparse
explanations. LIME is also capable of predicting the probability change of a label
based on the changing of an input feature, which provides a good contrast
explanation.

On the other hand, SHAP provides a full explanation that is crucial for decisions
which need to consider all the effects, for example assist doctors in clinical
diagnosis and help judges make careful decisions about defendants. SHAP also
provides contrast explanation that allows the user to compare explanations on
the group level. In addition, SHAP value cannot be considered as the probability
decreased after removing the feature as LIME. The value should be interpreted as
the feature contribution to the difference between the actual prediction and the
mean prediction.

Chapter4: FPs elimination framework

In this Chapter, We introduce the framework used in the FPs elimination

experiment. The framework contains two approaches: FPs elimination by the
SHAP values and FPs elimination from extracted rules. Then, we describe each
component in detail.
As discussed in Chapter 2, fraud detection is a streaming process where alerts
are generated consecutively. A more reasonable approach is building an ML
model on training data in a small interval and perform the elimination for the
upcoming false alerts. Then the new ML models can be built for next time
intervals. The performance of the new ML model should increase due to the
increasing size of a training set. However, training ML models and explanation
models are time-consuming. To reduce the problem size, we build an off-line
static ML model on one-year transaction data from October 2016 to September
2017 in this thesis.

The framework for FPs elimination is illustrated in Figure. The proposed novel
approaches are shown in the grey frame. Firstly, we start with preprocessing the
transaction data. The transaction dataset has a large number of features and is
highly imbalanced. Under-sampling and feature selection techniques are applied
to deal with those problems. Secondly, we build an ML model on the training set
and predict on a test set. The cases predicted as a positive class are considered as
alerts, while the cases predicted as negative are non-alerts. Thirdly, we use
explanation model Tree SHAP to obtain the explanations for the alerts. Each
explanation of the alert consists of the SHAP value for all the features. After the
SHAP values are obtained, we propose the following two approaches to filter out
FPs automatically.

In the first approach, SHAP values are applied to distinguish between FPs and
TPs. Since SHAP value of each feature is the decomposition of output probability,
the sum of SHAP values should reflect that TPs have a higher probability of being
an alert than FPs. Moreover, SHAP values decompose this confidence; we expect
that some features in this decomposition will dominate the true positive
classifications and other features the false positive classifications. Therefore, the
SHAP values are considered as features to build another ML model for predicting
FPs and TPs.

In the second approach, SHAP values are applied to discriminate FPs and TPs on
the original feature space. Therefore, the elimination process does not rely on the
SHAP values and can filter out FPs by the original features. We relate SHAP
values with the original features by using clustering and subgroup discovery
techniques. The process is constructed as follows. Clustering is applied to find
the structure in SHAP values. After forming the clusters, the instances in each
cluster are dominated by similar features. Then, the problem remains that how
to extract the pattern to distinguish FPs and TPs. Subgroup discovery is
implemented to discover the rules to distinguish FPs and TPs in each cluster. To
ensure the extracted rules also fulfil certain confidence in all the alerts, the
support and confidence of each rule are computed. Finally, the rules with higher
FPs confidence and lower TPs support are used in the filter.

The two approaches are evaluated on the performance metric ROC-curve with
the ML model and respective baseline. The extract rules in the second approach
are presented to a domain expert to validate their genuineness.
Preprocessing
The transaction dataset has a huge amount of features and imbalanced classes. A
useful explanation requires a certain ML model with high accuracy; therefore,
the preprocessing step is necessary. We summarize the existing methods to deal
with class imbalanced and feature selection, then describe the methods chosen in
the experiment.

The imbalanced class problem can be addressed on different levels: data level,
algorithm level and combining methods. At the algorithm level, it adjusts the
costs of various classes to balance the classes. For example, different weights can
be assigned to the respective classes in kNN, and the biased can be developed in
SVM such that the learned hyperplane is further away from the major class. At
the data level, there are many forms of re-sampling such as random under-
sampling, random over-sampling, and over-sampling technique SMOTE. The
combining methods combine the results of many classifiers; each of them are
built on under/over-sampling data with different sampling ratios. To avoid
building an extra model on the massive transaction dataset, we focus on data
level algorithm. Over-sampling can induce an additional computational task since
the size of our fraud dataset is already large. We apply Random Under-sampling
in the experiments. It aims to balance class distribution through the random
replication of minor class samples. One major drawback is that this method can
discard examples that could be important for the performance metric.

The main feature selection methods are filter, wrapper, and embedded methods.
Filter methods examine each feature individually to determine the strength of
correlation between input feature and class labels. These methods do not rely on
a learning algorithm and are computationally light. However, finding a suitable
algorithm can be relative hard since wrapper methods do not take into account
any classifier. Ignoring the specific heuristics and biases of the classifier might
lower the classification accuracy. Additionally, important features are more
informative when combined with others could be discarded but less important
on their own. Wrapper methods use classifier and evaluate the variable subset by
the predict performance. The main drawback is the number of computations
required to obtain the feature subset can become computationally expensive.
Embedded methods incorporate feature selection in the training process.
Decision trees such as CART have a built-in mechanism to perform feature
selection. The coefficients of regularized regression models are of use in selecting
features. After the under-sampling, the size of the transaction dataset is
relatively small. Therefore, we apply the wrapper method using a Random Forest
classifier with 500 estimators.

ML model and explanation model

We select the ML model and the explanation model by their computation
efficiency and model accuracy. The kernel based explanation models, e.g. LIME
and Kernel SHAP, need to build a linear model for each instance thus not suitable
for our massive transaction dataset. Therefore, we apply Tree SHAP to obtain the
explanations. Its fast C++ implementation is supported for eXtreme Gradient
Boosting (XGBoost), LightGBM, CatBoost, and scikit-learn tree model. We choose
Tree SHAP XGBoost implementation which is the same as the original paper
published by the author.

XGBoost model is an implementation of the gradient boosting framework with

the highly efficient design. Boosting is an algorithm creates multiple weak
learners instead of one single powerful model. New models are added
sequentially to correct the errors made by existing models. At each iteration, the
instances which were miss-classified are assigned higher weight, while the
instances were correctly predicted received the lower weight. The process is
repeated until no improvement can be made.

Gradient boost creates a new model base on the difference between the current
approximation of prediction and target value, and then add them together to
make a final prediction. This difference is usually called residual or residual
vector. It is called gradient boost because it performs gradient descent algorithm
that training on the residual vector to minimize the loss when adding new
models.

Clustering
The clustering algorithm is mostly divided into partitioning method and
hierarchical method. The partitioning method usually assume a certain number
of clusters in advance, while the hierarchical method does not have a priori
assumption. The hierarchical method is again divided into agglomerative and
divisive. The agglomerative method uses a bottom-up approach starting with
individual data points. Then two most similar data points are merged into one
cluster. On the contrary, the divisive use a top-down approach to treating all the
data points as one big cluster and then dividing it into small clusters.

We choose hierarchical agglomerative algorithm since the number of clusters is

unknown in advance. The intuition behind the algorithm is to repeatedly merge
two most similar groups, as measured by linkage. We use ward' method as the
linkage function, which is defined to be the squared Euclidean distance between
points. There is very little theory about how to find a right number of clusters.
It's not even clear what the right number of clusters means. Hence, we visualize
each merge step and decide the cut off distance in a Hierarchical clustering
dendrogram. The visualization of dendrogram also help us in capturing the
structure of features.

Subgroup Discovery
In FPs elimination task, we try to filter out false alarm cases while retaining most
of true positives. This can be done by using descriptions to distinguish between
TPs and FPs groups. The main objective of the following three techniques is
extracting descriptive knowledge from the data of a property of interest: contrast
set mining, emerging pattern mining and subgroup discovery. Contrast set
mining is a data mining technique for finding differences between contrasting
groups. Emerging Pattern Mining defined as itemsets whose support increases
significantly from one dataset to another. Subgroup discovery is aimed at finding
descriptions of interest subgroups. It has been proven that the terminology, the
task definitions and the rule learning heuristics of those three mining techniques
are compatible. Contrast set mining and emerging pattern mining communities
have primarily addressed only categorical data, whereas the subgroup discovery
community also considered numeric and relational data. Hence, we apply
subgroup discovery techniques to discovery description for FPs.

Subgroup discovery is a data mining technique aimed at discovering interesting

relationships between a set of instances with respect to a target variable. The
patterns extracted are represented in the form of rules called subgroups. It is
suitable for our framework since we aim to find small numbers of descriptive
rules with high coverage on the FPs. The subgroup discovery can be divided into
three categories: extensions of classification algorithms, the extension of
association rule and fuzzy evolutionary systems. We apply the extensions of
classification algorithms CN2-SD which is implemented in an open source tool
Orange. CN2-SD is obtained by adopting a rule learning classifier CN2 to
subgroup discovery. A standard CN2 tend to generate very specific rules, due to
using accuracy heuristic. CN2-SD use weighted relative accuracy (WRAcc) to
optimize a trade-off between rule accuracy and coverage.

WRAcc optimizes two factors of rule X -> Y: rule coverage P(X) and distribution
unusualness p(Y|X) - p(Y). The rule coverage is the size of the subgroup. The
distribution unusualness is the difference between the proportion of positive
examples in the describing rule and the entire example set. The algorithm starts
by selecting the best rule with highest WRAcc. In each iteration, after the best
rule selected, the weights of positive examples are decreased according to the
number of rules covering each example c(e), where w(e) = 1/c(e). Let p denotes
positive examples correctly classified as positive by rule, n denotes negative
examples wrongly classified as positive by rule, and P and N are the numbers of
all positive and negative examples in the dataset.

TFN PPT Ida Jean Orlando
100% (1)
TFN PPT Ida Jean Orlando
19 pages
DepEd Template
100% (1)
DepEd Template
2 pages
The Secrets of Seduction
No ratings yet
The Secrets of Seduction
46 pages
Creative Writing in Eastern Visayas 1982-2018 (Merlie Alunan)
No ratings yet
Creative Writing in Eastern Visayas 1982-2018 (Merlie Alunan)
4 pages
Bie Sample Entrance Exam 2021
No ratings yet
Bie Sample Entrance Exam 2021
4 pages
Autobiography
86% (7)
Autobiography
2 pages
Dil Sinifi Belirleme Sinavi: 10.SINIF
100% (1)
Dil Sinifi Belirleme Sinavi: 10.SINIF
8 pages
Ejercicios de Ingles
No ratings yet
Ejercicios de Ingles
9 pages
GE11 EntrepreneurialMind FINAL
No ratings yet
GE11 EntrepreneurialMind FINAL
18 pages
Ôn Tập Thi Speaking Tanc Cuối Kỳ 2
No ratings yet
Ôn Tập Thi Speaking Tanc Cuối Kỳ 2
31 pages
Chapter 11 Test Bank
No ratings yet
Chapter 11 Test Bank
30 pages
Ethics Chapter 3
No ratings yet
Ethics Chapter 3
17 pages
Straight To The Heart
No ratings yet
Straight To The Heart
9 pages
Modified Basic Education Enrollment Form: Student Information
No ratings yet
Modified Basic Education Enrollment Form: Student Information
1 page
Math 7 Third Quarterly Assessment
No ratings yet
Math 7 Third Quarterly Assessment
4 pages
Elitmus All Question
No ratings yet
Elitmus All Question
11 pages
JAN-JUNE 2025 BSCBT 6 SEM V9 BSCBT603C PPT
No ratings yet
JAN-JUNE 2025 BSCBT 6 SEM V9 BSCBT603C PPT
16 pages
Ee693 PWM Converters and Applications
No ratings yet
Ee693 PWM Converters and Applications
6 pages
AI 2076 - Ankit Pangeni
No ratings yet
AI 2076 - Ankit Pangeni
8 pages
Organic 2
No ratings yet
Organic 2
8 pages
MATE Course Syllabus CURRICULUM DEVELOPMENT
No ratings yet
MATE Course Syllabus CURRICULUM DEVELOPMENT
6 pages
Lesson 6
No ratings yet
Lesson 6
4 pages
Curriculum-Vitae: Suraj Kumar Mishra
No ratings yet
Curriculum-Vitae: Suraj Kumar Mishra
2 pages
2019 Unsupervised Face Domain Transfer For Low-Resolution Face Recognition
No ratings yet
2019 Unsupervised Face Domain Transfer For Low-Resolution Face Recognition
5 pages
Foundation Schedules (Vi - X) 2025-26
No ratings yet
Foundation Schedules (Vi - X) 2025-26
3 pages
Medical Anthropology Note
No ratings yet
Medical Anthropology Note
3 pages
Global University Rankings and The Mediatization of Higher Education 2019 (Book Review
No ratings yet
Global University Rankings and The Mediatization of Higher Education 2019 (Book Review
3 pages
DuplicateAdmissionNotice Jan23
No ratings yet
DuplicateAdmissionNotice Jan23
1 page
Acko
No ratings yet
Acko
1 page
CV
No ratings yet
CV
1 page

Chapter3: Background and Related Work

Uploaded by

Chapter3: Background and Related Work

Uploaded by

Chapter3: Background and related work

In this chapter, we introduce state-of-art model-agnostic explanation methods

Recently, many current methods for interpreting machine learning prediction

Additive feature attribution methods

The interpretable representation can be the features themselves or the

As mentioned previously, LIME explains an instance by interpretable

The loss L is computed by weighted square loss as Equation, where pi(z) is an

To solve above equations and obtain the explanation, LIME approximates

SHAP approximates the input value as a conditional expectation function of the

As a tree model, E[f(x)|x_{S}] can be estimated by recursively computing the

Chapter4: FPs elimination framework

In this Chapter, We introduce the framework used in the FPs elimination

ML model and explanation model

XGBoost model is an implementation of the gradient boosting framework with

We choose hierarchical agglomerative algorithm since the number of clusters is

Subgroup discovery is a data mining technique aimed at discovering interesting

You might also like