Evaluating Explanation Methods For Deep Learning in Security
Evaluating Explanation Methods For Deep Learning in Security
Abstract—Deep learning is increasingly used as a building In contrast to other application domains of deep
block of security systems. Unfortunately, neural networks learning, computer security poses particular challenges
are hard to interpret and typically opaque to the practitioner. for the use of explanation methods. First, security tasks,
such as malware detection and binary code analysis,
arXiv:1906.02108v4 [cs.LG] 27 Apr 2020
Figure 1: Explanations for the prediction of the security Multilayer Perceptrons (MLPs). Multilayer perceptrons,
system VulDeePecker on a code snippet from the original also referred to as feedforward networks, are a classic and
dataset. From top to bottom: Original code, LRP, LEMNA, general-purpose network architecture [38]. The network is
and LIME. composed of multiple fully connected layers of neurons,
where the first and last layer correspond to the input
and output of the network, respectively. MLPs have been
Our evaluation highlights the need for comparing successfully applied to a variety of security problems,
explanation methods and determining the best fit for a such as intrusion and malware detection [20, 23]. While
given security task. Furthermore, it also unveils a notable MLP architectures are not necessarily complex, explaining
number of artifacts in the underlying datasets. For all of the the contribution of individual features is still difficult, as
four security tasks, we identify features that are unrelated several neurons impact the decision when passing through
to security but strongly contribute to the predictions. As a the network layers.
consequence, we argue that explanation methods need
Convolutional Neural Networks (CNNs). These net-
to become an integral part of learning-based security
works share a similar architecture with MLPs, yet they
systems—first, for understanding the decision process of
differ in the concept of convolution and pooling [29]. The
deep learning and, second, for eliminating artifacts in the
neurons in convolutional layers receive input only from
training datasets.
a local neighborhood of the previous layer. These neigh-
The rest of this paper is organized as follows: We briefly borhoods overlap and create receptive fields that provide
review the technical background of explainable learning in a powerful primitive for identifying spatial structure in
Section 2. The explanation methods and security systems data. CNNs have thus been successfully used for detecting
under test are described in Section 3. We introduce our malicious patterns in the bytecode of Android applications
criteria for comparing explanation methods in Section 4 [33]. Due to the convolution and pooling layers, however,
and evaluate them in Section 5. Our qualitative analysis is it is hard to explain the decisions of a CNN, as its output
presented in Section 6 and Section 7 concludes the paper. needs to be “unfolded” and “unpooled” for analysis.
Recurrent Neural Networks (RNNs). Recurrent net- and can directly compute explanations for the function fN
works, such as LSTM and GRU networks [9, 22], are on the structure of the network. In practice, predictions
characterized by a recurrent structure, that is, some neurons and explanations are often computed from within the same
are connected in a loop. This structure enables memorizing system, such that the neural network is readily available
information and allows RNNs to operate on sequences for generating explanations. This is usually the case for
of data [16]. As a result, RNNs have been successfully stand-alone systems for malware detection, binary analysis,
applied in security tasks involving sequential data, such as and vulnerability discovery. However, several white-box
the recognition of functions in native code [10, 41] or the methods are designed for specific network layouts from
discovery of vulnerabilities in software [30]. Interpreting computer vision and not applicable to all considered
the prediction of an RNN is also difficult, as the relevance architectures [e.g., 43, 46, 54].
of an input feature depends on the sequence of previously Black-box and white-box explanation methods often
processed features. share similarities with concepts of adversarial learning
and feature selection, as these also aim at identifying
2.2. Explanation Strategies features related to the prediction of a classifier. However,
adversarial learning and feature selection pursue funda-
Given the different architectures and the complexity mentally different goals and cannot be directly applied for
of many neural networks, decoding the entire decision explaining neural networks. We discuss the differences to
process is a challenging task that currently cannot be solved these approaches for the interested reader in Appendix A.
adequately. However, there exist several recent methods
that enable explaining individual predictions of a neural 3. Methods and Systems under Test
network instead of the complete decision process [e.g.,
5, 21, 36, 47, 54]. We focus on this form of explainable Before presenting our criteria for evaluating explanation
learning that can be formally defined as follows: methods, we first introduce the methods and systems under
test. In particular, we cover six methods for explaining
Definition 1. Given an input vector x = (x1 , . . . , xd ), predictions in Section 3.1 and present four security systems
a neural network N , and a prediction fN (x) = y , an based on deep learning in Section 3.2. For more informa-
explanation method determines why the label y has been tion about explanation methods we do not evaluate in the
selected by N . This explanation is given by a vector r = paper [e.g., 12, 17] we refer the reader to the Appendix B.
(r1 , . . . , rd ) that describes the relevance of the dimensions
of x for fN (x).
3.1. Explanation Methods
The computed relevance values r are typically real
numbers and can be overlayed with the input in form Table 1 provides an overview of popular explanation
of a heatmap, such that relevant features are visually methods along with their support for the different network
highlighted. An example of this visualization is depicted architectures. As we are interested in explaining predictions
in Figure 1. Positive relevance values are shown in blue of security systems, we select those methods for our
and indicate importance towards the prediction fN (x), study that are applicable to all common architectures. In
whereas negative values are given in orange and indicate the following, we briefly sketch the main idea of these
importance against the prediction. We will use this color approaches for computing relevance vectors, illustrating
scheme throughout the paper1 . the technical diversity of explanation methods.
Despite the variety of approaches for computing a Gradients and IG. One of the first white-box methods
relevance vector for a given neural network and an input, all to compute explanations for neural networks has been
approaches can be broadly categorized into two explanation introduced by Simonyan et al. [43] and is based on
strategies: black-box and white-box explanations. simple gradients. The output of the method is given by
Black-box Explanations. These methods operate under a ri = ∂y/∂xi , which the authors call a saliency map.
black-box setting that assumes no knowledge about the Here ri measures how much y changes with respect to xi .
neural network and its parameters. Black-box methods Sundararajan et al. [47] extend this approach and propose
are an effective tool if no access to the neural network Integrated Gradients (IG) that use a baseline x0 , for instance
is available, for example, when a learning service is a vector of zeros, and calculate the shortest path from x0
audited remotely. Technically, black-box methods rest on an
approximation of the function fN , which enables them to TABLE 1: Popular explanation methods. The support for
estimate how the dimensions of x contribute to a prediction. different neural network architectures is indicated by 3.
Although black-box methods are a promising approach Methods evaluated in this paper are indicated by *.
for explaining deep learning, they can be impaired by the
black-box setting and omit valuable information provided Explanation methods MLP CNN RNN
through the network architecture and parameters. Gradients* [43], IG* [47] 3 3 3
White-box Explanations. These approaches operate under LRP* [5], DeepLift [42] 3 3 3
the assumption that all parameters of a neural network are PatternNet, PatternAttribution [24] 3 3 –
DeConvNet [54], GuidedBP [46] 3 3 –
known and can be used for determining an explanation.
CAM [56], GradCAM [8, 39] 3 3 –
As a result, these methods do not rely on approximations RTIS [11], MASK [17] 3 3 –
LIME* [36], SHAP* [31], QII [12] 3 3 3
1. We use the blue-orange color scheme instead of the typical green-red LEMNA* [21] 3 3 3
scheme to make our paper better accessible to color-blind readers.
to x, given by x − x0 . To compute the relevance of xi , the LEMNA. As last explanation method, we consider
gradients with respect to xi are cumulated along this path LEMNA, a black-box method specifically designed for
yielding security applications [21]. It uses a mixture regression
Z 1 model for approximation, that is, a weighted sum of K
∂fN (x0 + α(x − x0 )) linear models:
ri = (xi − x0i ) dα.
0 ∂xi K
X
f (x) = πj (βj · x + j ).
Both gradient-based methods can be applied to all relevant
j=1
network architectures and thus are considered in our
comparative evaluation of explanation methods. The parameter K specifies the number of models, the
random variables = (1 , . . . , K ) originate from a normal
LRP and DeepLift. These popular white-box methods
distribution i ∼ N (0, σ) and π = (π1 , . . . , πK ) holds the
determine the relevance of a prediction by performing a
weights for each model. The variables β1 , . . . , βK are the
backward pass through the neural network, starting at
regression coefficients and can be interpreted as K linear
the output layer and performing calculations until the
approximations of the decision boundary near fN (x).
input layer is reached [5]. The central idea of layer-wise
relevance propagation (LRP) is the use of a conservation
property that needs to hold true during the backward pass. 3.2. Security Systems
If ril is the relevance of the unit i in layer l of the neural
network then As field of application for the six explanation methods,
X X X we consider four recent security systems that employ deep
ri1 = ri2 = · · · = riL learning (see Table 2). The systems cover the three major
i i i architectures/types introduced in Section 2.1 and comprise
between 4 to 6 layers of different types.
needs to hold true for all L layers. Similarly, DeepLift
performs a backward pass but takes a reference activation TABLE 2: Overview of the considered security systems.
y 0 = fN (x0 ) of a reference input x0 into account. The
method enforces the conservation law, System Publication Type # Layers
X Drebin+ ESORICS’17 [20] MLP 4
ri = y − y 0 = ∆y , Mimicus+ CCS’18 [21] MLP 4
i DAMD CODASPY’17 [33] CNN 6
VulDeePecker NDSS’18 [30] RNN 5
that is, the relevance assigned to the features must sum
up to the difference between the outcome of x and x0 .
Both approaches support explaining the decisions of feed- Drebin+. The first system uses an MLP for identifying
forward, convolutional and recurrent neural networks [see Android malware. The system has been proposed by Grosse
4]. However, as DeepLift and IG are closely related [2], et al. [20] and builds on features originally developed by
we focus our study on the method -LRP. Arp et al. [3]. The network consists of two hidden layers,
LIME and SHAP. Ribeiro et al. [36] introduce one of each comprising 200 neurons. The input features are stati-
the first black-box methods for explaining neural networks cally extracted from Android applications and cover data
that is further extended by Lundberg and Lee [31]. Both from the application’s manifest, such as hardware details
methods aim at approximating the decision function fN and requested permissions, as well as information based
by creating a series of l perturbations of x, denoted as on the application’s code, such as suspicious API calls
x̃1 , . . . , x̃l by setting entries in the vector x to 0 randomly. and network addresses. To verify the correctness of our
The methods then proceed by predicting a label fN (x̃i ) = implementation, we train the system on the original Drebin
ỹi for each x̃i of the l perturbations. This sampling strategy dataset [3], where we use 75 % of the 129,013 Android
enables the methods to approximate the local neighborhood application for training and 25 % for testing. Table 3 shows
of fN at the point fN (x). LIME [36] approximates the the results of this experiment, which are in line with the
decision boundary by a weighted linear regression model, performance published by Grosse et al. [20].
l Mimicus+. The second system also uses an MLP but
2 is designed to detect malicious PDF documents. The
X
arg min πx (x̃i ) fN (x̃i ) − g(x̃i ) ,
g∈G i=1
system is re-implemented based on the work of Guo
et al. [21] and builds on features originally introduced
where G is the set of all linear functions and πx is a by Smutz and Stavrou [45]. Our implementation uses
function indicating the difference between the input x and two hidden layers with 200 nodes each and is trained
a perturbation x̃. SHAP [31] follows the same approach with 135 features extracted from PDF documents. These
but uses the SHAP kernel as weighting function πx , which features cover properties about the document structure,
is shown to create Shapley Values [40] when solving the such as the number of sections and fonts in the document,
regression. Shapley Values are a concept from game theory and are mapped to binary values as described by Guo et al.
where the features act as players under the objective of [21]. For a full list of features, we refer the reader to the
finding a fair contribution of the features to the payout—in implementation by Šrndić and Laskov [50]. For verifying
this case the prediction of the model. As both approaches our implementation, we make use of the original dataset
can be applied to any learning model, we study them in that contains 5,000 benign and 5,000 malicious PDF files
our empirical evaluation. and again split the dataset into 75 % for training and 25 %
TABLE 3: Performance of the re-implemented security Mimicus+ Drebin+
systems on the original datasets. Gradient Gradient
IG IG 1.0
System Accuracy Precision Recall F1-Score LRP LRP
Drebin+ 0.980 0.926 0.924 0.925 SHAP SHAP
Lemna Lemna 0.8
Mimicus+ 0.994 0.991 0.998 0.994
DAMD 0.949 0.967 0.924 0.953 LIME LIME
VulDeePecker 0.908 0.837 0.802 0.819
nt
IG
SH P
Le AP
LIMa
E
nt
IG
SH P
Le AP
LIMa
E
0.6
LR
LR
mn
mn
ie
ie
ad
ad
Gr
Gr
for testing. Our results are shown in Table 3 and come 0.4
close to a perfect detection.
VulDeePecker DAMD
Gradient Gradient
DAMD. The third security system studied in our evaluation IG IG
LRP LRP 0.2
uses a CNN for identifying malicious Android applica-
tions [33]. The system processes the raw Dalvik bytecode SHAP SHAP
of Android applications and its neural network is comprised Lemna Lemna
LIME LIME
of six layers for embedding, convolution, and max-pooling
t
IG
SH P
Le AP
LIMa
E
t
IG
SH P
Le AP
LIMa
E
of the extracted instructions. As the system processes
ien
ien
LR
LR
mn
mn
ad
ad
entire applications, the number of features depends on
Gr
Gr
the size of the applications. For a detailed description of
Figure 3: Comparison of the top-10 features for the
this process, we refer the reader to the publication by
different explanation methods. An average value of 1
McLaughlin et al. [33]. To replicate the original results,
indicates identical top-10 features and a value of 0 indicates
we apply the system to data from the Malware Genome
no overlap.
Project [57]. This dataset consists of 2,123 applications in
total, with 863 benign and 1,260 malicious samples. We
again split the dataset into 75 % of training and 25 % of similar explanations, criteria for their comparison would
testing data and obtain results similar to those presented be less important and any suitable method could be
in the original publication. chosen in practice.
To answer this question, we investigate the top-k
VulDeePecker. The fourth system uses an RNN for
features of the six explanation methods when explaining
discovering vulnerabilities in source code [30]. The RNN
predictions of the security systems. That is, we compare
consists of five layers, uses 300 LSTM cells [22], and
the set Ti of the k features with the highest relevance from
applies a word2vec embedding [34] with 200 dimensions
method i with the set Tj of the k features with the highest
for analyzing C/C++ code. As a preprocessing step, the
relevance from method j . In particular, we compute the
source code is sliced into code gadgets that comprise short
intersection size
snippets of tokens. The gadgets are truncated or padded
to a length of 50 tokens. For verifying the correctness of |Ti ∩ Tj |
our implementation, we use the CWE-119 dataset, which IS(i, j) = , (1)
k
consists of 39,757 code gadgets, with 10,444 gadgets corre-
sponding to vulnerabilities. In line with the original study, as a measure of similarity between the two methods. The
we split the dataset into 80 % training and 20 % testing intersection size lies between 0 and 1, where 0 indicates
data, and attain a comparable accuracy. no overlap and 1 corresponds to identical top-k features.
The four selected security systems provide a diverse A visualization of the intersection size averaged over
view on the current use of deep learning in security. the samples of the four datasets is shown in Figure 3.
Drebin+ and Mimicus+ are examples of systems that make We choose k = 10 according to a typical use case of
use of MLPs for detecting malware. However, they differ in explainable learning: An expert investigates the top-10
the dimensionality of the input: While Mimicus+ works on features to gain insights on a prediction. For DAMD,
a small set of engineered features, Drebin+ analyzes inputs we use k = 50, as the dataset is comprised of long
with thousands of dimensions. DAMD is an example of a opcode sequences. We observe that the top features of the
system using a CNN in security and capable of learning explanation methods differ considerably. For example, in
from large inputs, whereas VulDeePecker makes use of an the case of VulDeePecker, all methods determine different
RNN, similar to other learning-based approaches analyzing top-10 features. While we notice some similarity between
program code [e.g., 10, 41, 53]. the methods, it becomes clear that the methods cannot be
simply interchanged, and there is a need for measurable
evaluation criteria.
4. Evaluation Criteria
In light of the broad range of available explanation 4.1. General Criteria: Descriptive Accuracy
methods, the practitioner is in need of criteria for selecting
the best method for a security task at hand. In this section, As the first evaluation criteria, we introduce the de-
we develop these criteria and demonstrate their utility in scriptive accuracy. This criterion reflects how accurate
different examples. Before doing so, however, we address an explanation method captures relevant features of a
another important question: Do the considered explanation prediction. As it is difficult to assess the relation between
methods provide different results? If the methods generated features and a prediction directly, we follow an indirect
TABLE 4: Explanations of LRP and LEMNA for a sample
1 data = NULL;
2 data = new wchar_t[50];
of the GoldDream family from the DAMD dataset.
3 data[0] = L’\\0’;
4 wchar_t source[100]; Id LRP LEMNA
5 wmemset(source, L’C’, 100-1);
6 source[100-1] = L’\\0’;
0 invoke-virtual invoke-virtual
7 memmove(data, source, 100*sizeof(wchar_t)); 1 move-result-object move-result-object
2 if-eqz if-eqz
(a) Original code 3 const-string const-string
1 INT0 ] ; 4 invoke-virtual invoke-virtual
2 VAR0 [ INT0 ] = STR0 ; 5 move-result-object move-result-object
3 wchar_t VAR0 [ INT0 ] ; 6 check-cast check-cast
4 wmemset ( VAR0 , STR0 , INT0 - INT1 ) ; 7 array-length array-length
5 VAR0 [ INT0 - INT1 ] = STR0 ;
8 new-array new-array
6 memmove ( VAR0 , VAR1 , INT0 * sizeof ( wchar_t ) ) ;
9 const/4 const/4
(b) Integrated Gradients 10 array-length array-length
11 if-ge if-ge
1 INT0 ] ;
2 VAR0 [ INT0 ] = STR0 ;
12 aget-object aget-object
3 wchar_t VAR0 [ INT0 ] ;
4 wmemset ( VAR0 , STR0 , INT0 - INT1 ) ;
5 VAR0 [ INT0 - INT1 ] = STR0 ;
6 memmove ( VAR0 , VAR1 , INT0 * sizeof ( wchar_t ) ) ;
4.2. General Criteria: Descriptive Sparsity
4.3. Security Criteria: Completeness TABLE 6: Two explanations from LEMNA for the same
sample computed in different runs.
After introducing two generic evaluation criteria, we
Id LEMNA (Run 1) LEMNA (Run 2)
start focusing on aspects that are especially important for
the area of security. In a security system, an explanation 0 pos_page_min pos_page_min
method must be capable of creating proper results in all 1 count_js count_js
possible situations. If some inputs, such as pathological 2 count_javascript count_javascript
data or corner cases, cannot be processed by an expla- 3 pos_acroform_min pos_acroform_min
nation method, an adversary may trick the method into 4 ratio_size_page ratio_size_page
producing degenerated results. Consequently, we propose 5 pos_image_min pos_image_min
6 count_obj count_obj
completeness as the first security-specific criterion.
... ...
Definition 4. An explanation method is complete, if it 27 pos_image_max pos_image_max
can generate non-degenerated explanations for all possible 28 count_page count_page
input vectors of the prediction function fN . 29 len_stream_avg len_stream_avg
30 pos_page_avg pos_page_avg
Several white-box methods are complete by definition, 31 count_stream count_stream
as they calculate relevance vectors directly from the 32 moddate_tz moddate_tz
weights of the neural network. For black-box methods, how- 33 len_stream_max len_stream_max
ever, the situation is different: If a method approximates 34 count_endstream count_endstream
the prediction function fN using random perturbations,
it may fail to derive a valid estimate of fN and return
degenerated explanations. We investigate this phenomenon That is, for any run i and j of the method, the intersection
in more detail in Section 5.4. size of the top features Ti and Tj should be close to 1,
As an example of this problem, Table 5 shows expla- that is, IS(i, j) > 1 − for some small threshold .
nations generated by the methods Gradients and SHAP
for a benign Android application of the Drebin dataset. The stability of an explanation method can be empiri-
The Gradients explanation finds the touchscreen feature in cally determined by running the methods multiple times
combination with the launcher category and the internet and computing the average intersection size, as explained
permission as an explanation for the benign classification. in the beginning of this section. White-box methods are
SHAP, however, creates an explanation of zeros which deterministic by construction since they perform a fixed
provides no insights. The reason for this degenerated sequence of computations for generating an explanation.
explanation is rooted in the random perturbations used by Most black-box methods, however, require random pertur-
SHAP. By flipping the value of features, these perturbations bations to compute their output which can lead to different
aim at changing the class label of the input. As there exist results for the same input. Table 6, for instance, shows the
far more benign features than malicious ones in the case output of LEMNA for a PDF document from the Mimicus+
of Drebin+, the perturbations can fail to switch the label dataset over two runs. Some of the most relevant features
and prevent the linear regression to work resulting in a from the first run receive very little relevance in the second
degenerated explanation. run and vice versa, rendering the explanations unstable.
We analyze these instabilities of the explanation methods
4.4. Security Criteria: Stability in Section 5.5.
In addition to complete results, the explanations gen- 4.5. Security Criteria: Efficiency
erated in a security system need to be reliable. That is,
relevant features must not be affected by fluctuations and
When operating a security system in practice, expla-
need to remain stable over time in order to be useful for
nations need to be available in reasonable time. While
an expert. As a consequence, we define stability as another
low run-time is not a strict requirement in general, time
security-specific evaluation criterion.
differences between minutes and milliseconds are still
Definition 5. An explanation methods is stable, if the significant. For example, when dealing with large amounts
generated explanations do not vary between multiple runs. of data, it might be desirable for the analyst to create
explanations for every sample of an entire class. We thus l = 500. The parameter S is set to 104 for Drebin+ and
define efficiency as a further criterion for explanation Mimicus+, as the underlying features are not sequential and
methods in security applications. to 10−3 for the sequences of DAMD and VulDeePecker [see
21]. Furthermore, we implement LIME with l = 500 per-
Definition 6. We consider a method efficient if it enables
turbations, use the cosine similarity as proximity measure,
providing explanations without delaying the typical work-
and employ the regression solver from the scipy package
flow of an expert.
using L1 regularization. For SHAP we make use of the
As the workflow depends on the particular security open-source implementation by Lundberg and Lee [31]
task, we do not define concrete run-time numbers, yet we including the KernelSHAP solver.
provide a negative example as an illustration. The run-
time of the method LEMNA depends on the size of the TABLE 7: Descriptive accuracy (DA) and sparsity (MAZ)
inputs. For the largest sample of the DAMD dataset with for the different explanation methods.
530,000 features, it requires about one hour for computing Method Drebin+ Mimicus+ DAMD VulDeePecker
an explanation, which obstructs the workflow of inspecting
Android malware severely. LIME 0.580 0.257 0.919 0.571
LEMNA 0.656 0.405 0.983 0.764
SHAP 0.891 0.565 0.966 0.869
4.6. Security Criteria: Robustness Gradients 0.472 0.213 0.858 0.856
IG 0.446 0.206 0.499 0.574
As the last criterion, we consider the robustness of ex- LRP 0.474 0.213 0.504 0.625
planation methods to attacks. Recently, several attacks [e.g.,
14, 44, 55] have shown that explanation methods may (a) Area under the DA curves from Figure 5.
suffer from adversarial perturbations and can be tricked
into returning incorrect relevance vectors, similarly to Method Drebin+ Mimicus+ DAMD VulDeePecker
adversarial examples [7]. The objective of these attacks LIME 0.757 0.752 0.833 0.745
is to disconnect the explanation from the underlying LEMNA 0.681 0.727 0.625 0.416
prediction, such that arbitrary relevance values can be SHAP 0.783 0.716 0.713 0.813
generated that do not explain the behavior of the model. Gradients 0.846 0.856 0.949 0.816
IG 0.847 0.858 0.999 0.839
Definition 7. An explanation method is robust if the LRP 0.846 0.856 0.964 0.827
computed relevance vector cannot be decoupled from the
prediction by an adversarial perturbation. (b) Area under MAZ curves from Figure 5.
Unfortunately, the robustness of explanation methods
is still not well understood and, similarly to adversarial 5.2. Descriptive Accuracy
examples, guarantees and strong defenses have not been
established yet. To this end, we assess the robustness of We start our evaluation by measuring the descriptive
the explanation methods based on the existing literature. accuracy (DA) of the explanation methods as defined in
Section 4.1. In particular, we successively remove the
5. Evaluation most relevant features from the samples of the datasets
and measure the decrease in the classification score. For
Equipped with evaluation criteria for comparing ex- Drebin+ and Mimicus+, we remove features by setting the
planation methods, we proceed to empirically investigate corresponding dimensions to 0. For DAMD, we replace
these in different security tasks. To this end, we implement the most relevant instructions with the no-op opcode, and
a comparison framework that integrates the six selected for VulDeePecker we substitute the selected tokens with
explanation methods and four security systems. an embedding-vector of zeros.
The top row in Figure 5 shows the results of this
experiment. As the first observation, we find that the
5.1. Experimental Setup DA curves vary significantly between the explanation
methods and security systems. However, the methods
White-box Explanations. For our comparison frame- IG and LRP consistently obtain strong results in all
work, we make use of the iNNvestigate toolbox by settings and show steep declines of the descriptive accuracy.
Alber et al. [1] that provides efficient implementations for Only on the VulDeePecker dataset, the black-box method
LRP, Gradients, and IG. For the security system VulDeeP- LIME can provide explanations with comparable accuracy.
ecker, we use our own LRP implementation [51] based on Notably, for the DAMD dataset, IG and LRP are the only
the publication by Arras et al. [4]. In all experiments, we methods to generate real impact on the outcome of the
set = 10−3 for LRP and use N = 64 steps for IG. Due classifier. For Mimicus+, IG, LRP and Gradients achieve a
to the high dimensional embedding space of VulDeePecker, perfect accuracy decline after only 25 features and thus the
we choose a step count of N = 256 in the corresponding white-box explanation methods outperform the black-box
experiments. methods in this experiment.
Black-box Explanations. We re-implement LEMNA in Table 7(a) shows the area under curve (AUC) for the
accordance to Guo et al. [21] and use the Python package descriptive accuracy curves from Figure 5. We observe
cvxpy [13] to solve the linear regression problem with that IG is the best method over all datasets—lower values
Fused Lasso restriction [52]. We set the number of mixture indicate better explanations—followed by LRP. In compar-
models to K = 3 and the number of perturbations to ison to other methods it is up to 48 % better on average.
Drebin+ Mimicus+ VulDeePecker DAMD
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
ADA
ADA
ADA
ADA
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 100 200 300 400
# Removed features # Removed features # Removed features # Removed features
MAZ
MAZ
MAZ
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Interval size Interval size Interval size Interval size
LRP IG Gradient LIME LEMNA KernelSHAP
Figure 5: Descriptive accuracy and sparsity for the considered explanation methods. Top row: Average descriptive
accuracy (ADA); bottom row: sparsity measured as mass around zero (MAZ).
Intuitively, this considerable difference between the white- been assigned a relevance close to 0, that is, the explanation
box and black-box methods makes sense, as white-box is more sparse. We find that the best methods again are
approaches can utilize internal information of the neural white-box approaches, providing explanations that are up
networks that are not available to black-box methods. to 50 % sparser compared to the other methods in this
experiment.
5.3. Descriptive Sparsity
5.4. Completeness of Explanations
We proceed by investigating the sparsity of the gen-
erated explanations with the MAZ score defined in Sec- We further examine the completeness of the explana-
tion 4.2. The second row in Figure 5 shows the result of tions. As shown in Section 4.3, some explanation methods
this experiment for all datasets and methods. We observe can not calculate meaningful relevance values for all inputs.
that the methods IG, LRP, and Gradients show the steepest In particular, perturbation-based methods suffer from this
slopes and assign the majority of features little relevance, problem, since they determine a regression with labels
which indicates a sparse distribution. By contrast, the other derived from random perturbations. To investigate this
explanation methods provide flat slopes of the MAZ close problem, we monitor the creation of perturbations and
to 0, as they generate relevance values with a broader their labels for the different datasets.
range and thus are less sparse. When creating perturbations for some sample x it is
For Drebin+ and Mimicus+, we observe an almost essential for black-box methods that a fraction p of them
identical level of sparsity for LRP, IG and Gradients is classified as belonging to the opposite class of x. In an
supporting the findings from Figure 3. Interestingly, for optimal case one can achieve p ≈ 0.5, however during our
VulDeePecker, the MAZ curve of LEMNA shows a strong experiments we find that 5 % can be sufficient to calculate
increase close to 1, indicating that it assigns high relevance a non-degenerated explanation in some cases. Figure 6
to a lot of tokens. While this generally is undesirable, shows for each value of p and all datasets the fraction
in case of LEMNA, this is founded in the basic design of samples remaining when enforcing a percentage p of
and the use of the Fused Lasso constraint. In case of perturbations from the opposite class.
DAMD, we see a massive peak at 0 for IG, showing that In general, we observe that creating malicious perturba-
it marks almost all features as irrelevant. According to the tions from benign samples is a hard problem, especially for
previous experiment, however, it simultaneously provides Drebin+ and DAMD. For example, in the Drebin+ dataset
a very good accuracy on this data. The resulting sparse only 31 % of the benign samples can obtain a p value
and accurate explanations are particularly advantageous for of 5 % which means that more than 65 % of the whole
a human analyst since the DAMD dataset contains samples dataset suffer from degenerated explanations. A detailed
with up to 520,000 features. The explanations from IG calculation for all datasets with a p value of 5 % can be
provide a compressed yet accurate representation of the found in Table 12 in the Appendix C.
sequences which can be inspected easily. The problem of incomplete explanations is rooted in the
We summarize the performance on the MAZ metric imbalance of features characterizing malicious and benign
by calculating the area under curve and report it in data in the datasets. While only few features make a sample
Table 7(b). A high AUC indicates that more features have malicious, there exists a large variety of features turning
Class- Class+ TABLE 8: Average intersection size between top features
1.0 1.0 for multiple runs. Values close to one indicate greater
0.8 0.8 stability.
Samples remaining
Samples remaining
0.6 0.6 Method Drebin+ Mimicus+ DAMD VulDeePecker
0.4 0.4 LIME 0.480 0.446 0.040 0.446
LEMNA 0.4205 0.304 0.016 0.416
0.2 0.2 SHAP 0.257 0.411 0.007 0.440
0.0 0.0 Gradients 1.000 1.000 1.000 1.000
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 IG 1.000 1.000 1.000 1.000
Perturbations from Class+ Perturbations from Class- LRP 1.000 1.000 1.000 1.000
Drebin+ Mimicus+ VulDeePecker DAMD
Figure 6: Perturbation label statistics of the datasets. For IG and LRP can benefit from computations on a graphical
each percentage of perturbations from the other class the processing unit (GPU), therefore we report both results
percentage of samples achieving this number is shown. but use only the CPU results to achieve a fair comparison
with the black-box methods.
Table 9 shows the average run-time per input for all
a sample benign. As a consequence, randomly setting explanations methods and security systems. We observe
malicious features to zero leads to a benign classification, that Gradients and LRP achieve the highest throughput in
while setting benign features to zero usually does not general beating the other methods by orders of magnitude.
impact the prediction. As a consequence, it is often not This advantage arises from the fact that data can be
possible to explain predictions for benign applications and processed batch-wise for methods like Gradients, IG, and
the analyst is stuck with an empty explanation. LRP, that is, explanations can be calculated for a set
In summary, we argue that perturbation-based expla- of samples at the same time. The Mimicus+ dataset, for
nation methods should only be used in security settings example, can be processed in one batch resulting in a
where incomplete explanations can be compensated by speed-up factor of more than 16,000× over the fastest
other means. In all other cases, one should refrain from black-box method. In general we note that the white-box
using these black-box methods in the context of security. methods Gradients and LRP achieve the fastest run-time
since they require a single backwards-pass through the
5.5. Stability of Explanations network. Moreover, computing these methods on a GPU
results in additional speedups of a factor up to three.
We proceed to evaluate the stability of the explanation TABLE 9: Run-time per sample in seconds. Note the range
methods when processing inputs from the four security of the different times from microseconds to minutes.
systems. To this end, we apply the explanations to the
same samples over multiple runs and measure the average Method Drebin+ Mimicus+ DAMD VulDeePecker
intersection size between the runs. LIME 3.1 × 10−2 2.8 × 10−2 7.4 × 10−1 3.0 × 10−2
Table 8 shows the average intersection size between LEMNA 4.6 2.6 6.9 × 102 6.1
the top k features for three runs of the methods as defined SHAP 9.1 4.3 × 10−1 4.5 5.0
in Equation 1. We use k = 10 for all datasets except Gradients 8.1 × 10−3 7.8 × 10−6 1.1 × 10−2 7.6 × 10−4
IG 1.1 × 10−1 5.4 × 10−5 6.9 × 10−1 4.0 × 10−1
for DAMD where we use k = 50 due to the larger input LRP 8.4 × 10−3 1.7 × 10−6 1.3 × 10−2 2.9 × 10−2
space. Since the outputs of Gradients, IG, and LRP are GPU Drebin+ Mimicus+ DAMD VulDeePecker
deterministic, they reach the perfect score of 1.0 in all −3 −6 −3
Gradients 7.4 × 10 3.9 × 10 3.5 × 10 3.0 × 10−4
settings and thus do not suffer from limitations concerning IG 1.5 × 10−2 3.9 × 10−5 3.0 × 10−1 1.3 × 10−1
stability. LRP 7.3 × 10−3 1.6 × 10−6 7.8 × 10−3 1.1 × 10−2
For the perturbation-based methods, however, stability
poses a severe problem since none of those methods obtains
a intersection size of more than 0.5. This indicates that The run-time of the black-box methods increases for
on average half of the top features do not overlap when high dimensional datasets, especially DAMD, since the
computing explanations on the same input. Furthermore, regression problems need to be solved in higher dimensions.
we see that the assumption of locality of the perturbation- While the speed-up factors are already enormous, we have
based methods does not apply for all models under test, not even included the creation of perturbations and their
since the output is highly dependent on the perturbations classification, which consume additional run-time as well.
used to approximate the decision boundary. Therefore, the
best methods for the stability criterion beat the perturbation- 5.7. Robustness of Explanations
based methods by a factor of at least 2.5 on all datasets.
Recently, multiple authors have shown that adversarial
5.6. Efficiency of Explanations perturbations are also applicable against explanation meth-
ods and can manipulate the generated relevance values.
We finally examine the efficiency of the different Given a classification function f , an input x and a target
explanation methods. Our experiments are performed on class ct the goal of an adversarial perturbation is to find
a regular server system with an Intel Xeon E5 v3 CPU x̃ = x + δ such that δ is minimal but at the same time
at 2.6 GHz. It is noteworthy that the methods Gradients, f (x̃) = ct 6= f (x).
TABLE 10: Results of the evaluated explanation methods. The last column summarizes these metrics in a rating
comprising three levels: strong( ), medium ( ), and weak (#).
Explanation Method Accuracy Sparsity Completeness Stability Efficiency Robustness Overall Rating
−1
LIME 0.582 0.772 – 0.353 2.1 × 10 s # ## #
LEMNA 0.702 0.612 – 0.289 1.8 × 102 s # ######
SHAP 0.823 0.757 – 0.279 4.8 s # # ####
Gradients 0.600 0.867 3 1.000 5.0 × 10−3 s # #
IG 0.431 0.886 3 1.000 3.0 × 10−1 s # #
LRP 0.454 0.873 3 1.000 5.0 × 10−2 s # #
1-l l
2-l er
3-l rs
ers
pdfid1_num
na
na
– 81.5 % 2.8 %
e
e
ay
ay
igi
igi
ay
ay
ay
ay
title_uc
Or
Or
– 68.6 % 4.8 %
Figure 7: Intersection size of the Top-10 features of – pos_eof_min 100.0 % 93.4 %
explanations obtained from models that were stolen from + count_javascript 6.0 % 88.0 %
the original model of the Drebin+ and Mimicus+ dataset. + count_js 5.2 % 83.4 %
+ count_trailer 89.3 % 97.7 %
+ pos_page_avg 100.0 % 100.0 %
5.9. Model Stealing for White-Box Explanations + count_endobj 100.0 % 99.6 %
+ createdate_tz 85.5 % 99.9 %
+ count_action 16.4 % 73.8 %
Our experiments show that practitioners from the
security domain should favor white-box methods over
black-box methods when aiming to explain neural networks.
Moreover, we publish the generated explanations from all
However, there are cases when access to the parameters of
datasets and methods on the project’s website2 in order to
the system is not available and white-box methods can not
foster future research.
be used. Instead of using black-box methods one could
also use model stealing to obtain an approximation of
the original network[49]. This approach assumes that the 6.1. Insights on Mimicus+
user can predict an unlimited number of samples with the
model to be explained. The obtained predictions can then When inspecting explanations for the Mimicus+ system,
be used to train a surrogate model which might have a we observe that the features for detecting malware are
different architecture but a similar behavior. dominated by count_javascript and count_js, which
To evaluate the differences between the explanations both stand for the number of JavaScript elements in
of surrogate models to the original ones we conduct an the document. The strong impact of these elements is
experiment on the Drebin+ and Mimicus+ datasets as meaningful, as JavaScript is frequently used in malicious
follows: We use the predictions of the original model PDF documents [27]. However, we also identify features
from Grosse et al. [20] which has two dense layers with in the explanations that are non-intuitive. For example,
200 units each and use these predictions to train three features like count_trailer that measures the number of
surrogate models. The number of layers is varied to be trailer sections in the document or count_box_letter that
[1, 2, 3] and the number of units in each layer is always counts the number of US letter sized boxes can hardly
200 resulting in models with higher, lower and the original be related to security and rather constitute artifacts in the
complexity. For each model we calculate explanations via dataset captured by the learning process.
LRP and compute the intersection size given by Equation 1 To further investigate the impact of JavaScript features
for k = 10. on the neural network, we determine the distribution of
The results in Figure 7 show that the models de- the top 5 features from the method IG for each class in
liver similar explanations to the original model (IS≈0.7) the entire dataset. It turns out that JavaScript appears in
although having different architectures for the Drebin+ 88 % of the malicious documents, whereas only about 6 %
dataset. However, the similarity between the stolen models of the benign samples make use of it (see Table 11).
is clearly higher (IS≈0.85). For the Mimicus+ dataset, This makes JavaScript an extremely discriminating feature
we observe a general stability of the learned features at for the dataset. From a security perspective, this is an
a lower level (IS≈0.55). These results indicate that the unsatisfying result, as the neural network of Mimicus+
explanations of the stolen models are better than those relies on a few indicators for detecting the malicious code
obtained from black-box methods (see Figure 3) but still in the documents. An attacker could potentially evade
deviate from the original model, i.e., there is no global Mimicus+ by not using JavaScript or obfuscating the
transferability between the explanations. At all, model JavaScript elements in the document.
stealing can be considered a good alternative to the usage
of black-box explanation methods. 6.2. Insights on Drebin+