0% found this document useful (0 votes)
48 views17 pages

Evaluating Explanation Methods For Deep Learning in Security

1. The document evaluates explanation methods for deep learning models in security applications like malware detection. 2. It introduces criteria for comparing explanation methods, including general criteria like accuracy and sparsity, as well as security-specific criteria like completeness, stability, robustness, and efficiency. 3. The authors analyze six explanation methods based on these criteria using four security systems from literature, finding significant differences between methods in meeting the criteria for reliable explanations.

Uploaded by

Fahim Hasan Alif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views17 pages

Evaluating Explanation Methods For Deep Learning in Security

1. The document evaluates explanation methods for deep learning models in security applications like malware detection. 2. It introduces criteria for comparing explanation methods, including general criteria like accuracy and sparsity, as well as security-specific criteria like completeness, stability, robustness, and efficiency. 3. The authors analyze six explanation methods based on these criteria using four security systems from literature, finding significant differences between methods in meeting the criteria for reliable explanations.

Uploaded by

Fahim Hasan Alif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Evaluating Explanation Methods

for Deep Learning in Security

Alexander Warnecke∗ , Daniel Arp∗ , Christian Wressnegger† and Konrad Rieck∗


∗ Technische Universität Braunschweig, Germany
† Karlsruhe Institute of Technology, Germany

Abstract—Deep learning is increasingly used as a building In contrast to other application domains of deep
block of security systems. Unfortunately, neural networks learning, computer security poses particular challenges
are hard to interpret and typically opaque to the practitioner. for the use of explanation methods. First, security tasks,
such as malware detection and binary code analysis,
arXiv:1906.02108v4 [cs.LG] 27 Apr 2020

The machine learning community has started to address this


problem by developing methods for explaining the predic- require complex neural network architectures that are
tions of neural networks. While several of these approaches challenging to investigate. Second, explanation methods in
have been successfully applied in the area of computer vision, security do not only need to be accurate but also satisfy
their application in security has received little attention so security-specific requirements, such as complete and robust
far. It is an open question which explanation methods are explanations. As a result of these challenges, it is an
appropriate for computer security and what requirements unanswered question which of the available explanation
they need to satisfy. In this paper, we introduce criteria methods can be applied in security and what properties
for comparing and evaluating explanation methods in the they need to possess for providing reliable results.
context of computer security. These cover general properties, In this paper, we address this problem and develop
such as the accuracy of explanations, as well as security- evaluation criteria for assessing and comparing explanation
focused aspects, such as the completeness, efficiency, and methods in security. Our work provides a bridge between
robustness. Based on our criteria, we investigate six popular
deep learning in security and explanation methods devel-
explanation methods and assess their utility in security
oped for other application domains of machine learning.
Consequently, our criteria for judging explanations cover
systems for malware detection and vulnerability discovery.
general properties of deep learning as well as aspects that
We observe significant differences between the methods and
are especially relevant to the domain of security.
build on these to derive general recommendations for select-
ing and applying explanation methods in computer security. General evaluation criteria. As general criteria, we con-
sider the descriptive accuracy and sparsity of explanations.
These properties reflect how accurate and concise an ex-
planation method captures relevant features of a prediction.
1. Introduction While accuracy is an evident criterion for obtaining reliable
results, sparsity is another crucial constraint in security. In
Over the last years, deep learning has been increasingly contrast to computer vision, where an analyst can examine
recognized as an effective tool for computer security. an entire image, a security practitioner cannot investigate
Different types of neural networks have been integrated large sets of features at once, and thus sparsity becomes
into security systems, for example, for malware detec- an essential property when non-graphic data is analyzed.
tion [20, 23, 33], binary analysis [10, 41, 53], and vul-
nerability discovery [30]. Deep learning, however, suffers Security evaluation criteria. We define the completeness,
from a severe drawback: Neural networks are hard to stability, robustness, and efficiency of explanations as
interpret and their decisions are opaque to practitioners. security criteria. These properties ensure that reliable
Even simple tasks, such as determining which features explanations are available to a practitioner in all cases and
of an input contribute to a prediction, are challenging to in reasonable time—requirements that are less important
solve on neural networks. This lack of transparency is a in other areas of deep learning. For example, an attacker
considerable problem in security, as black-box learning may expose pathologic inputs to a security system that
systems are hard to audit and protect from attacks [7, 35]. mislead, corrupt, or slow down the computation of
The machine learning community has started to de- explanations. Note that the robustness of explanation
velop methods for interpreting deep learning in computer methods to adversarial examples is not well understood
vision [e.g., 5, 43, 54]. These methods enable tracing back yet, and thus we base our analysis on the recent work by
the predictions of neural networks to individual regions Zhang et al. [55] and Dombrowski et al. [14].
in images and thereby help to understand the decision With the help of these criteria, we analyze six recent
process. These approaches have been further extended to explanation methods and assess their performance in
also explain predictions on text and sequences [4, 21]. different security tasks. To this end, we implement four
Surprisingly, this work has received little attention in security systems from the literature that make use of deep
security and there exists only a single technique that has learning and enable detecting Android malware [20, 33],
been investigated so far [21]. malicious PDF files [50], and security vulnerabilities [30],
respectively. When explaining the decisions of these 2. Explainable Deep Learning
systems, we observe significant differences between the
methods in all criteria. Some methods are not capable Neural networks have been used in artificial intelli-
of providing sparse results, whereas others struggle with gence for over 50 years, yet concepts for explaining their
structured security data or suffer from unstable outputs. decisions have just recently started to be explored. This
While the importance of the individual criteria depends development has been driven by the remarkable progress
on the particular task, we find that the methods IG [47] of deep learning in several areas, such as image recogni-
and LRP [5] comply best with all criteria and resemble tion [28] and machine translation [48]. To embed our work
general-purpose techniques for security systems. in this context, we briefly review two aspects of explainable
To demonstrate the utility of explainable learning, learning that are crucial for its application in security: the
we also qualitatively examine the generated explanations. type of neural network and the explanation strategy.
As an example for this investigation, Figure 1 shows
three explanations for the system VulDeePecker [30] that 2.1. Neural Network Architectures
identifies vulnerabilities in source code. While the first
explanation method provides a nuanced representation
of the relevant features, the second method generates an Different architectures can be used for constructing a
unsharp explanation due to a lack of sparsity. The third neural network, ranging from general-purpose networks to
approach provides an explanation that even contradicts the highly specific architectures. In the area of security, three
first one. Note that the variables VAR2 and VAR3 receive a of these architectures are prevalent: multilayer perceptrons,
positive relevance (blue) in the first case and a negative convolutional neural networks, and recurrent neural net-
relevance (orange) in the third. works (see Figure 2). Consequently, we focus our study
on these network types and refer the reader to the books
by Rojas [37] and Goodfellow et al. [19] for a detailed
description of network architectures in general.
1 c = split(arg[i],"=",&n);
2 block_flgs = strcpy((xmalloc(strlen(c[1]) + 1)),c[1]);

1 VAR0 = FUN0 ( VAR1 [ VAR2 ] , STR0 , & VAR3 ) ;


2 VAR0 = strcpy ( ( FUN0 ( strlen ( VAR1 [ INT0 ] ) +
INT0 ) ) , VAR1 [ INT0 ] ) ;

1 VAR0 = FUN0 ( VAR1 [ VAR2 ] , STR0 , & VAR3 ) ;


2 VAR0 = strcpy ( ( FUN0 ( strlen ( VAR1 [ INT0 ] ) +
INT0 ) ) , VAR1 [ INT0 ] ) ; (a) MLP layer (b) CNN layer (c) RNN layer
Figure 2: Overview of network architectures in security:
1 VAR0 = FUN0 ( VAR1 [ VAR2 ] , STR0 , & VAR3 ) ; Multilayer perceptrons (MLP), convolutional neural net-
2 VAR0 = strcpy ( ( FUN0 ( strlen ( VAR1 [ INT0 ] ) + works (CNN), and recurrent neural networks (RNN).
INT0 ) ) , VAR1 [ INT0 ] ) ;

Figure 1: Explanations for the prediction of the security Multilayer Perceptrons (MLPs). Multilayer perceptrons,
system VulDeePecker on a code snippet from the original also referred to as feedforward networks, are a classic and
dataset. From top to bottom: Original code, LRP, LEMNA, general-purpose network architecture [38]. The network is
and LIME. composed of multiple fully connected layers of neurons,
where the first and last layer correspond to the input
and output of the network, respectively. MLPs have been
Our evaluation highlights the need for comparing successfully applied to a variety of security problems,
explanation methods and determining the best fit for a such as intrusion and malware detection [20, 23]. While
given security task. Furthermore, it also unveils a notable MLP architectures are not necessarily complex, explaining
number of artifacts in the underlying datasets. For all of the the contribution of individual features is still difficult, as
four security tasks, we identify features that are unrelated several neurons impact the decision when passing through
to security but strongly contribute to the predictions. As a the network layers.
consequence, we argue that explanation methods need
Convolutional Neural Networks (CNNs). These net-
to become an integral part of learning-based security
works share a similar architecture with MLPs, yet they
systems—first, for understanding the decision process of
differ in the concept of convolution and pooling [29]. The
deep learning and, second, for eliminating artifacts in the
neurons in convolutional layers receive input only from
training datasets.
a local neighborhood of the previous layer. These neigh-
The rest of this paper is organized as follows: We briefly borhoods overlap and create receptive fields that provide
review the technical background of explainable learning in a powerful primitive for identifying spatial structure in
Section 2. The explanation methods and security systems data. CNNs have thus been successfully used for detecting
under test are described in Section 3. We introduce our malicious patterns in the bytecode of Android applications
criteria for comparing explanation methods in Section 4 [33]. Due to the convolution and pooling layers, however,
and evaluate them in Section 5. Our qualitative analysis is it is hard to explain the decisions of a CNN, as its output
presented in Section 6 and Section 7 concludes the paper. needs to be “unfolded” and “unpooled” for analysis.
Recurrent Neural Networks (RNNs). Recurrent net- and can directly compute explanations for the function fN
works, such as LSTM and GRU networks [9, 22], are on the structure of the network. In practice, predictions
characterized by a recurrent structure, that is, some neurons and explanations are often computed from within the same
are connected in a loop. This structure enables memorizing system, such that the neural network is readily available
information and allows RNNs to operate on sequences for generating explanations. This is usually the case for
of data [16]. As a result, RNNs have been successfully stand-alone systems for malware detection, binary analysis,
applied in security tasks involving sequential data, such as and vulnerability discovery. However, several white-box
the recognition of functions in native code [10, 41] or the methods are designed for specific network layouts from
discovery of vulnerabilities in software [30]. Interpreting computer vision and not applicable to all considered
the prediction of an RNN is also difficult, as the relevance architectures [e.g., 43, 46, 54].
of an input feature depends on the sequence of previously Black-box and white-box explanation methods often
processed features. share similarities with concepts of adversarial learning
and feature selection, as these also aim at identifying
2.2. Explanation Strategies features related to the prediction of a classifier. However,
adversarial learning and feature selection pursue funda-
Given the different architectures and the complexity mentally different goals and cannot be directly applied for
of many neural networks, decoding the entire decision explaining neural networks. We discuss the differences to
process is a challenging task that currently cannot be solved these approaches for the interested reader in Appendix A.
adequately. However, there exist several recent methods
that enable explaining individual predictions of a neural 3. Methods and Systems under Test
network instead of the complete decision process [e.g.,
5, 21, 36, 47, 54]. We focus on this form of explainable Before presenting our criteria for evaluating explanation
learning that can be formally defined as follows: methods, we first introduce the methods and systems under
test. In particular, we cover six methods for explaining
Definition 1. Given an input vector x = (x1 , . . . , xd ), predictions in Section 3.1 and present four security systems
a neural network N , and a prediction fN (x) = y , an based on deep learning in Section 3.2. For more informa-
explanation method determines why the label y has been tion about explanation methods we do not evaluate in the
selected by N . This explanation is given by a vector r = paper [e.g., 12, 17] we refer the reader to the Appendix B.
(r1 , . . . , rd ) that describes the relevance of the dimensions
of x for fN (x).
3.1. Explanation Methods
The computed relevance values r are typically real
numbers and can be overlayed with the input in form Table 1 provides an overview of popular explanation
of a heatmap, such that relevant features are visually methods along with their support for the different network
highlighted. An example of this visualization is depicted architectures. As we are interested in explaining predictions
in Figure 1. Positive relevance values are shown in blue of security systems, we select those methods for our
and indicate importance towards the prediction fN (x), study that are applicable to all common architectures. In
whereas negative values are given in orange and indicate the following, we briefly sketch the main idea of these
importance against the prediction. We will use this color approaches for computing relevance vectors, illustrating
scheme throughout the paper1 . the technical diversity of explanation methods.
Despite the variety of approaches for computing a Gradients and IG. One of the first white-box methods
relevance vector for a given neural network and an input, all to compute explanations for neural networks has been
approaches can be broadly categorized into two explanation introduced by Simonyan et al. [43] and is based on
strategies: black-box and white-box explanations. simple gradients. The output of the method is given by
Black-box Explanations. These methods operate under a ri = ∂y/∂xi , which the authors call a saliency map.
black-box setting that assumes no knowledge about the Here ri measures how much y changes with respect to xi .
neural network and its parameters. Black-box methods Sundararajan et al. [47] extend this approach and propose
are an effective tool if no access to the neural network Integrated Gradients (IG) that use a baseline x0 , for instance
is available, for example, when a learning service is a vector of zeros, and calculate the shortest path from x0
audited remotely. Technically, black-box methods rest on an
approximation of the function fN , which enables them to TABLE 1: Popular explanation methods. The support for
estimate how the dimensions of x contribute to a prediction. different neural network architectures is indicated by 3.
Although black-box methods are a promising approach Methods evaluated in this paper are indicated by *.
for explaining deep learning, they can be impaired by the
black-box setting and omit valuable information provided Explanation methods MLP CNN RNN
through the network architecture and parameters. Gradients* [43], IG* [47] 3 3 3
White-box Explanations. These approaches operate under LRP* [5], DeepLift [42] 3 3 3
the assumption that all parameters of a neural network are PatternNet, PatternAttribution [24] 3 3 –
DeConvNet [54], GuidedBP [46] 3 3 –
known and can be used for determining an explanation.
CAM [56], GradCAM [8, 39] 3 3 –
As a result, these methods do not rely on approximations RTIS [11], MASK [17] 3 3 –
LIME* [36], SHAP* [31], QII [12] 3 3 3
1. We use the blue-orange color scheme instead of the typical green-red LEMNA* [21] 3 3 3
scheme to make our paper better accessible to color-blind readers.
to x, given by x − x0 . To compute the relevance of xi , the LEMNA. As last explanation method, we consider
gradients with respect to xi are cumulated along this path LEMNA, a black-box method specifically designed for
yielding security applications [21]. It uses a mixture regression
Z 1 model for approximation, that is, a weighted sum of K
∂fN (x0 + α(x − x0 )) linear models:
ri = (xi − x0i ) dα.
0 ∂xi K
X
f (x) = πj (βj · x + j ).
Both gradient-based methods can be applied to all relevant
j=1
network architectures and thus are considered in our
comparative evaluation of explanation methods. The parameter K specifies the number of models, the
random variables  = (1 , . . . , K ) originate from a normal
LRP and DeepLift. These popular white-box methods
distribution i ∼ N (0, σ) and π = (π1 , . . . , πK ) holds the
determine the relevance of a prediction by performing a
weights for each model. The variables β1 , . . . , βK are the
backward pass through the neural network, starting at
regression coefficients and can be interpreted as K linear
the output layer and performing calculations until the
approximations of the decision boundary near fN (x).
input layer is reached [5]. The central idea of layer-wise
relevance propagation (LRP) is the use of a conservation
property that needs to hold true during the backward pass. 3.2. Security Systems
If ril is the relevance of the unit i in layer l of the neural
network then As field of application for the six explanation methods,
X X X we consider four recent security systems that employ deep
ri1 = ri2 = · · · = riL learning (see Table 2). The systems cover the three major
i i i architectures/types introduced in Section 2.1 and comprise
between 4 to 6 layers of different types.
needs to hold true for all L layers. Similarly, DeepLift
performs a backward pass but takes a reference activation TABLE 2: Overview of the considered security systems.
y 0 = fN (x0 ) of a reference input x0 into account. The
method enforces the conservation law, System Publication Type # Layers
X Drebin+ ESORICS’17 [20] MLP 4
ri = y − y 0 = ∆y , Mimicus+ CCS’18 [21] MLP 4
i DAMD CODASPY’17 [33] CNN 6
VulDeePecker NDSS’18 [30] RNN 5
that is, the relevance assigned to the features must sum
up to the difference between the outcome of x and x0 .
Both approaches support explaining the decisions of feed- Drebin+. The first system uses an MLP for identifying
forward, convolutional and recurrent neural networks [see Android malware. The system has been proposed by Grosse
4]. However, as DeepLift and IG are closely related [2], et al. [20] and builds on features originally developed by
we focus our study on the method -LRP. Arp et al. [3]. The network consists of two hidden layers,
LIME and SHAP. Ribeiro et al. [36] introduce one of each comprising 200 neurons. The input features are stati-
the first black-box methods for explaining neural networks cally extracted from Android applications and cover data
that is further extended by Lundberg and Lee [31]. Both from the application’s manifest, such as hardware details
methods aim at approximating the decision function fN and requested permissions, as well as information based
by creating a series of l perturbations of x, denoted as on the application’s code, such as suspicious API calls
x̃1 , . . . , x̃l by setting entries in the vector x to 0 randomly. and network addresses. To verify the correctness of our
The methods then proceed by predicting a label fN (x̃i ) = implementation, we train the system on the original Drebin
ỹi for each x̃i of the l perturbations. This sampling strategy dataset [3], where we use 75 % of the 129,013 Android
enables the methods to approximate the local neighborhood application for training and 25 % for testing. Table 3 shows
of fN at the point fN (x). LIME [36] approximates the the results of this experiment, which are in line with the
decision boundary by a weighted linear regression model, performance published by Grosse et al. [20].
l Mimicus+. The second system also uses an MLP but
2 is designed to detect malicious PDF documents. The
X
arg min πx (x̃i ) fN (x̃i ) − g(x̃i ) ,
g∈G i=1
system is re-implemented based on the work of Guo
et al. [21] and builds on features originally introduced
where G is the set of all linear functions and πx is a by Smutz and Stavrou [45]. Our implementation uses
function indicating the difference between the input x and two hidden layers with 200 nodes each and is trained
a perturbation x̃. SHAP [31] follows the same approach with 135 features extracted from PDF documents. These
but uses the SHAP kernel as weighting function πx , which features cover properties about the document structure,
is shown to create Shapley Values [40] when solving the such as the number of sections and fonts in the document,
regression. Shapley Values are a concept from game theory and are mapped to binary values as described by Guo et al.
where the features act as players under the objective of [21]. For a full list of features, we refer the reader to the
finding a fair contribution of the features to the payout—in implementation by Šrndić and Laskov [50]. For verifying
this case the prediction of the model. As both approaches our implementation, we make use of the original dataset
can be applied to any learning model, we study them in that contains 5,000 benign and 5,000 malicious PDF files
our empirical evaluation. and again split the dataset into 75 % for training and 25 %
TABLE 3: Performance of the re-implemented security Mimicus+ Drebin+
systems on the original datasets. Gradient Gradient
IG IG 1.0
System Accuracy Precision Recall F1-Score LRP LRP
Drebin+ 0.980 0.926 0.924 0.925 SHAP SHAP
Lemna Lemna 0.8
Mimicus+ 0.994 0.991 0.998 0.994
DAMD 0.949 0.967 0.924 0.953 LIME LIME
VulDeePecker 0.908 0.837 0.802 0.819

nt
IG
SH P
Le AP
LIMa
E

nt
IG
SH P
Le AP
LIMa
E
0.6

LR

LR
mn

mn
ie

ie
ad

ad
Gr

Gr
for testing. Our results are shown in Table 3 and come 0.4
close to a perfect detection.
VulDeePecker DAMD
Gradient Gradient
DAMD. The third security system studied in our evaluation IG IG
LRP LRP 0.2
uses a CNN for identifying malicious Android applica-
tions [33]. The system processes the raw Dalvik bytecode SHAP SHAP
of Android applications and its neural network is comprised Lemna Lemna
LIME LIME
of six layers for embedding, convolution, and max-pooling

t
IG
SH P
Le AP
LIMa
E

t
IG
SH P
Le AP
LIMa
E
of the extracted instructions. As the system processes

ien

ien
LR

LR
mn

mn
ad

ad
entire applications, the number of features depends on

Gr

Gr
the size of the applications. For a detailed description of
Figure 3: Comparison of the top-10 features for the
this process, we refer the reader to the publication by
different explanation methods. An average value of 1
McLaughlin et al. [33]. To replicate the original results,
indicates identical top-10 features and a value of 0 indicates
we apply the system to data from the Malware Genome
no overlap.
Project [57]. This dataset consists of 2,123 applications in
total, with 863 benign and 1,260 malicious samples. We
again split the dataset into 75 % of training and 25 % of similar explanations, criteria for their comparison would
testing data and obtain results similar to those presented be less important and any suitable method could be
in the original publication. chosen in practice.
To answer this question, we investigate the top-k
VulDeePecker. The fourth system uses an RNN for
features of the six explanation methods when explaining
discovering vulnerabilities in source code [30]. The RNN
predictions of the security systems. That is, we compare
consists of five layers, uses 300 LSTM cells [22], and
the set Ti of the k features with the highest relevance from
applies a word2vec embedding [34] with 200 dimensions
method i with the set Tj of the k features with the highest
for analyzing C/C++ code. As a preprocessing step, the
relevance from method j . In particular, we compute the
source code is sliced into code gadgets that comprise short
intersection size
snippets of tokens. The gadgets are truncated or padded
to a length of 50 tokens. For verifying the correctness of |Ti ∩ Tj |
our implementation, we use the CWE-119 dataset, which IS(i, j) = , (1)
k
consists of 39,757 code gadgets, with 10,444 gadgets corre-
sponding to vulnerabilities. In line with the original study, as a measure of similarity between the two methods. The
we split the dataset into 80 % training and 20 % testing intersection size lies between 0 and 1, where 0 indicates
data, and attain a comparable accuracy. no overlap and 1 corresponds to identical top-k features.
The four selected security systems provide a diverse A visualization of the intersection size averaged over
view on the current use of deep learning in security. the samples of the four datasets is shown in Figure 3.
Drebin+ and Mimicus+ are examples of systems that make We choose k = 10 according to a typical use case of
use of MLPs for detecting malware. However, they differ in explainable learning: An expert investigates the top-10
the dimensionality of the input: While Mimicus+ works on features to gain insights on a prediction. For DAMD,
a small set of engineered features, Drebin+ analyzes inputs we use k = 50, as the dataset is comprised of long
with thousands of dimensions. DAMD is an example of a opcode sequences. We observe that the top features of the
system using a CNN in security and capable of learning explanation methods differ considerably. For example, in
from large inputs, whereas VulDeePecker makes use of an the case of VulDeePecker, all methods determine different
RNN, similar to other learning-based approaches analyzing top-10 features. While we notice some similarity between
program code [e.g., 10, 41, 53]. the methods, it becomes clear that the methods cannot be
simply interchanged, and there is a need for measurable
evaluation criteria.
4. Evaluation Criteria
In light of the broad range of available explanation 4.1. General Criteria: Descriptive Accuracy
methods, the practitioner is in need of criteria for selecting
the best method for a security task at hand. In this section, As the first evaluation criteria, we introduce the de-
we develop these criteria and demonstrate their utility in scriptive accuracy. This criterion reflects how accurate
different examples. Before doing so, however, we address an explanation method captures relevant features of a
another important question: Do the considered explanation prediction. As it is difficult to assess the relation between
methods provide different results? If the methods generated features and a prediction directly, we follow an indirect
TABLE 4: Explanations of LRP and LEMNA for a sample
1 data = NULL;
2 data = new wchar_t[50];
of the GoldDream family from the DAMD dataset.
3 data[0] = L’\\0’;
4 wchar_t source[100]; Id LRP LEMNA
5 wmemset(source, L’C’, 100-1);
6 source[100-1] = L’\\0’;
0 invoke-virtual invoke-virtual
7 memmove(data, source, 100*sizeof(wchar_t)); 1 move-result-object move-result-object
2 if-eqz if-eqz
(a) Original code 3 const-string const-string
1 INT0 ] ; 4 invoke-virtual invoke-virtual
2 VAR0 [ INT0 ] = STR0 ; 5 move-result-object move-result-object
3 wchar_t VAR0 [ INT0 ] ; 6 check-cast check-cast
4 wmemset ( VAR0 , STR0 , INT0 - INT1 ) ; 7 array-length array-length
5 VAR0 [ INT0 - INT1 ] = STR0 ;
8 new-array new-array
6 memmove ( VAR0 , VAR1 , INT0 * sizeof ( wchar_t ) ) ;
9 const/4 const/4
(b) Integrated Gradients 10 array-length array-length
11 if-ge if-ge
1 INT0 ] ;
2 VAR0 [ INT0 ] = STR0 ;
12 aget-object aget-object
3 wchar_t VAR0 [ INT0 ] ;
4 wmemset ( VAR0 , STR0 , INT0 - INT1 ) ;
5 VAR0 [ INT0 - INT1 ] = STR0 ;
6 memmove ( VAR0 , VAR1 , INT0 * sizeof ( wchar_t ) ) ;
4.2. General Criteria: Descriptive Sparsity

(c) LIME Assigning high relevance to features which impact a


Figure 4: Explanations for a program slice from the prediction is a necessary prerequisite for good explanations.
VulDeePecker dataset using (b) Integrated Gradients and However, a human analyst can only process a limited
(c) LIME. number of these features, and thus we define the descriptive
sparsity as a further criterion for comparing explanations
methods as follows:
strategy and measure how removing the most relevant
features changes the prediction of the neural network. Definition 3. The descriptive sparsity is measured by
scaling the relevance values to the range [−1, 1], computing
Definition 2. Given a sample x, the descriptive accuracy a normalized histogram h of them and calculating the mass
(DA) is calculated by removing the k most relevant features around zero (MAZ) defined by
x1 , . . . , xk from the sample, computing the new prediction Z r
using fN and measuring the score of the original prediction
MAZ(r) = h(x)dx for r ∈ [0, 1].
class c without the k features, −r
 
DAk x, fN = fN x | x1 = 0, . . . , xk = 0 c . The MAZ can be thought of as a window which starts
If we remove relevant features from a sample, the at 0 and grows uniformly into the positive and negative
accuracy should decrease, as the neural network has less direction of the x axis. For each window, the fraction
information for making a correct prediction. The better of relevance values that lies in the window is evaluated.
the explanation, the quicker the accuracy will drop, as the Sparse explanations have a steep rise in MAZ close to
removed features capture more context of the predictions. 0 and are flat around 1, as most of the features are not
Consequently, explanation methods with a steep decline of marked as relevant. By contrast, dense explanations have
the descriptive accuracy provide better explanations than a notable smaller slope close to 0, indicating a larger set
methods with a gradual decrease. of relevant features. Consequently, explanation methods
To demonstrate the utility of the descriptive accuracy, with a MAZ distribution peaking at 0 should be preferred
we consider a sample from the VulDeePecker dataset, over methods with less pronounced distributions.
which is shown in Figure 4(a). The sample corresponds As an example of a sparse and dense explanation,
to a program slice and is passed to the neural network as we consider two explanations generated for a malicious
a sequence of tokens. Figures 4(b) and 4(c) depict these Android application of the DAMD dataset. Table 4 shows
tokens overlayed with the explanations of the methods a snapshot of these explanations, covering opcodes of the
Integrated Gradients (IG) and LIME, respectively. Note onReceive method. LRP provides a crisp representation in
that the VulDeePecker system truncates all code snippets this setting, whereas LEMNA marks the entire snapshot as
to a length of 50 tokens before processing them through relevant. If we normalize the relevance vectors to [−1, 1]
the neural network [30]. and focus on features above 0.2, LRP returns only 14
The example shows a simple buffer overflow which relevant features for investigation, whereas LEMNA returns
originates from an incorrect calculation of the buffer 2,048 features, rendering a manual examination tedious.
size in line 7. The two explanation methods significantly It is important to note that the descriptive accuracy and
differ when explaining the detection of this vulnerability. the descriptive sparsity are not correlated and must both
While IG highlights the wmemset call as important, LIME be satisfied by an effective explanation method. A method
highlights the call to memmove and even marks wmemset marking all features as relevant while highlighting a few
as speaking against the detection. Measuring the descrip- ones can be accurate but is clearly not sparse. Vice versa, a
tive accuracy can help to determine which of the two method assigning high relevance to very few meaningless
explanations reflects the prediction of the system better. features is sparse but not accurate.
TABLE 5: Explanations for the Android malware FakeInstaller generated for Drebin+ using Gradients and SHAP.
Id Gradients SHAP
0 feature::android.hardware.touchscreen feature::android.hardware.touchscreen
1 intent::android.intent.category.LAUNCHER intent::android.intent.category.LAUNCHER
2 real_permission::android.permission.INTERNET real_permission::android.permission.INTERNET
3 api_call::android/webkit/WebView api_call::android/webkit/WebView
4 intent::android.intent.action.MAIN intent::android.intent.action.MAIN
5 url::translator.worldclockr.com url::translator.worldclockr.com
6 url::translator.worldclockr.com/android.html url::translator.worldclockr.com/android.html
7 permission::android.permission.INTERNET permission::android.permission.INTERNET
8 activity::.Main activity::.Main

4.3. Security Criteria: Completeness TABLE 6: Two explanations from LEMNA for the same
sample computed in different runs.
After introducing two generic evaluation criteria, we
Id LEMNA (Run 1) LEMNA (Run 2)
start focusing on aspects that are especially important for
the area of security. In a security system, an explanation 0 pos_page_min pos_page_min
method must be capable of creating proper results in all 1 count_js count_js
possible situations. If some inputs, such as pathological 2 count_javascript count_javascript
data or corner cases, cannot be processed by an expla- 3 pos_acroform_min pos_acroform_min
nation method, an adversary may trick the method into 4 ratio_size_page ratio_size_page
producing degenerated results. Consequently, we propose 5 pos_image_min pos_image_min
6 count_obj count_obj
completeness as the first security-specific criterion.
... ...
Definition 4. An explanation method is complete, if it 27 pos_image_max pos_image_max
can generate non-degenerated explanations for all possible 28 count_page count_page
input vectors of the prediction function fN . 29 len_stream_avg len_stream_avg
30 pos_page_avg pos_page_avg
Several white-box methods are complete by definition, 31 count_stream count_stream
as they calculate relevance vectors directly from the 32 moddate_tz moddate_tz
weights of the neural network. For black-box methods, how- 33 len_stream_max len_stream_max
ever, the situation is different: If a method approximates 34 count_endstream count_endstream
the prediction function fN using random perturbations,
it may fail to derive a valid estimate of fN and return
degenerated explanations. We investigate this phenomenon That is, for any run i and j of the method, the intersection
in more detail in Section 5.4. size of the top features Ti and Tj should be close to 1,
As an example of this problem, Table 5 shows expla- that is, IS(i, j) > 1 −  for some small threshold .
nations generated by the methods Gradients and SHAP
for a benign Android application of the Drebin dataset. The stability of an explanation method can be empiri-
The Gradients explanation finds the touchscreen feature in cally determined by running the methods multiple times
combination with the launcher category and the internet and computing the average intersection size, as explained
permission as an explanation for the benign classification. in the beginning of this section. White-box methods are
SHAP, however, creates an explanation of zeros which deterministic by construction since they perform a fixed
provides no insights. The reason for this degenerated sequence of computations for generating an explanation.
explanation is rooted in the random perturbations used by Most black-box methods, however, require random pertur-
SHAP. By flipping the value of features, these perturbations bations to compute their output which can lead to different
aim at changing the class label of the input. As there exist results for the same input. Table 6, for instance, shows the
far more benign features than malicious ones in the case output of LEMNA for a PDF document from the Mimicus+
of Drebin+, the perturbations can fail to switch the label dataset over two runs. Some of the most relevant features
and prevent the linear regression to work resulting in a from the first run receive very little relevance in the second
degenerated explanation. run and vice versa, rendering the explanations unstable.
We analyze these instabilities of the explanation methods
4.4. Security Criteria: Stability in Section 5.5.

In addition to complete results, the explanations gen- 4.5. Security Criteria: Efficiency
erated in a security system need to be reliable. That is,
relevant features must not be affected by fluctuations and
When operating a security system in practice, expla-
need to remain stable over time in order to be useful for
nations need to be available in reasonable time. While
an expert. As a consequence, we define stability as another
low run-time is not a strict requirement in general, time
security-specific evaluation criterion.
differences between minutes and milliseconds are still
Definition 5. An explanation methods is stable, if the significant. For example, when dealing with large amounts
generated explanations do not vary between multiple runs. of data, it might be desirable for the analyst to create
explanations for every sample of an entire class. We thus l = 500. The parameter S is set to 104 for Drebin+ and
define efficiency as a further criterion for explanation Mimicus+, as the underlying features are not sequential and
methods in security applications. to 10−3 for the sequences of DAMD and VulDeePecker [see
21]. Furthermore, we implement LIME with l = 500 per-
Definition 6. We consider a method efficient if it enables
turbations, use the cosine similarity as proximity measure,
providing explanations without delaying the typical work-
and employ the regression solver from the scipy package
flow of an expert.
using L1 regularization. For SHAP we make use of the
As the workflow depends on the particular security open-source implementation by Lundberg and Lee [31]
task, we do not define concrete run-time numbers, yet we including the KernelSHAP solver.
provide a negative example as an illustration. The run-
time of the method LEMNA depends on the size of the TABLE 7: Descriptive accuracy (DA) and sparsity (MAZ)
inputs. For the largest sample of the DAMD dataset with for the different explanation methods.
530,000 features, it requires about one hour for computing Method Drebin+ Mimicus+ DAMD VulDeePecker
an explanation, which obstructs the workflow of inspecting
Android malware severely. LIME 0.580 0.257 0.919 0.571
LEMNA 0.656 0.405 0.983 0.764
SHAP 0.891 0.565 0.966 0.869
4.6. Security Criteria: Robustness Gradients 0.472 0.213 0.858 0.856
IG 0.446 0.206 0.499 0.574
As the last criterion, we consider the robustness of ex- LRP 0.474 0.213 0.504 0.625
planation methods to attacks. Recently, several attacks [e.g.,
14, 44, 55] have shown that explanation methods may (a) Area under the DA curves from Figure 5.
suffer from adversarial perturbations and can be tricked
into returning incorrect relevance vectors, similarly to Method Drebin+ Mimicus+ DAMD VulDeePecker
adversarial examples [7]. The objective of these attacks LIME 0.757 0.752 0.833 0.745
is to disconnect the explanation from the underlying LEMNA 0.681 0.727 0.625 0.416
prediction, such that arbitrary relevance values can be SHAP 0.783 0.716 0.713 0.813
generated that do not explain the behavior of the model. Gradients 0.846 0.856 0.949 0.816
IG 0.847 0.858 0.999 0.839
Definition 7. An explanation method is robust if the LRP 0.846 0.856 0.964 0.827
computed relevance vector cannot be decoupled from the
prediction by an adversarial perturbation. (b) Area under MAZ curves from Figure 5.
Unfortunately, the robustness of explanation methods
is still not well understood and, similarly to adversarial 5.2. Descriptive Accuracy
examples, guarantees and strong defenses have not been
established yet. To this end, we assess the robustness of We start our evaluation by measuring the descriptive
the explanation methods based on the existing literature. accuracy (DA) of the explanation methods as defined in
Section 4.1. In particular, we successively remove the
5. Evaluation most relevant features from the samples of the datasets
and measure the decrease in the classification score. For
Equipped with evaluation criteria for comparing ex- Drebin+ and Mimicus+, we remove features by setting the
planation methods, we proceed to empirically investigate corresponding dimensions to 0. For DAMD, we replace
these in different security tasks. To this end, we implement the most relevant instructions with the no-op opcode, and
a comparison framework that integrates the six selected for VulDeePecker we substitute the selected tokens with
explanation methods and four security systems. an embedding-vector of zeros.
The top row in Figure 5 shows the results of this
experiment. As the first observation, we find that the
5.1. Experimental Setup DA curves vary significantly between the explanation
methods and security systems. However, the methods
White-box Explanations. For our comparison frame- IG and LRP consistently obtain strong results in all
work, we make use of the iNNvestigate toolbox by settings and show steep declines of the descriptive accuracy.
Alber et al. [1] that provides efficient implementations for Only on the VulDeePecker dataset, the black-box method
LRP, Gradients, and IG. For the security system VulDeeP- LIME can provide explanations with comparable accuracy.
ecker, we use our own LRP implementation [51] based on Notably, for the DAMD dataset, IG and LRP are the only
the publication by Arras et al. [4]. In all experiments, we methods to generate real impact on the outcome of the
set  = 10−3 for LRP and use N = 64 steps for IG. Due classifier. For Mimicus+, IG, LRP and Gradients achieve a
to the high dimensional embedding space of VulDeePecker, perfect accuracy decline after only 25 features and thus the
we choose a step count of N = 256 in the corresponding white-box explanation methods outperform the black-box
experiments. methods in this experiment.
Black-box Explanations. We re-implement LEMNA in Table 7(a) shows the area under curve (AUC) for the
accordance to Guo et al. [21] and use the Python package descriptive accuracy curves from Figure 5. We observe
cvxpy [13] to solve the linear regression problem with that IG is the best method over all datasets—lower values
Fused Lasso restriction [52]. We set the number of mixture indicate better explanations—followed by LRP. In compar-
models to K = 3 and the number of perturbations to ison to other methods it is up to 48 % better on average.
Drebin+ Mimicus+ VulDeePecker DAMD
1.0 1.0 1.0 1.0
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
ADA

ADA

ADA

ADA
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 100 200 300 400
# Removed features # Removed features # Removed features # Removed features

1.0 1.0 1.0 1.0


0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
MAZ

MAZ

MAZ

MAZ
0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Interval size Interval size Interval size Interval size
LRP IG Gradient LIME LEMNA KernelSHAP
Figure 5: Descriptive accuracy and sparsity for the considered explanation methods. Top row: Average descriptive
accuracy (ADA); bottom row: sparsity measured as mass around zero (MAZ).

Intuitively, this considerable difference between the white- been assigned a relevance close to 0, that is, the explanation
box and black-box methods makes sense, as white-box is more sparse. We find that the best methods again are
approaches can utilize internal information of the neural white-box approaches, providing explanations that are up
networks that are not available to black-box methods. to 50 % sparser compared to the other methods in this
experiment.
5.3. Descriptive Sparsity
5.4. Completeness of Explanations
We proceed by investigating the sparsity of the gen-
erated explanations with the MAZ score defined in Sec- We further examine the completeness of the explana-
tion 4.2. The second row in Figure 5 shows the result of tions. As shown in Section 4.3, some explanation methods
this experiment for all datasets and methods. We observe can not calculate meaningful relevance values for all inputs.
that the methods IG, LRP, and Gradients show the steepest In particular, perturbation-based methods suffer from this
slopes and assign the majority of features little relevance, problem, since they determine a regression with labels
which indicates a sparse distribution. By contrast, the other derived from random perturbations. To investigate this
explanation methods provide flat slopes of the MAZ close problem, we monitor the creation of perturbations and
to 0, as they generate relevance values with a broader their labels for the different datasets.
range and thus are less sparse. When creating perturbations for some sample x it is
For Drebin+ and Mimicus+, we observe an almost essential for black-box methods that a fraction p of them
identical level of sparsity for LRP, IG and Gradients is classified as belonging to the opposite class of x. In an
supporting the findings from Figure 3. Interestingly, for optimal case one can achieve p ≈ 0.5, however during our
VulDeePecker, the MAZ curve of LEMNA shows a strong experiments we find that 5 % can be sufficient to calculate
increase close to 1, indicating that it assigns high relevance a non-degenerated explanation in some cases. Figure 6
to a lot of tokens. While this generally is undesirable, shows for each value of p and all datasets the fraction
in case of LEMNA, this is founded in the basic design of samples remaining when enforcing a percentage p of
and the use of the Fused Lasso constraint. In case of perturbations from the opposite class.
DAMD, we see a massive peak at 0 for IG, showing that In general, we observe that creating malicious perturba-
it marks almost all features as irrelevant. According to the tions from benign samples is a hard problem, especially for
previous experiment, however, it simultaneously provides Drebin+ and DAMD. For example, in the Drebin+ dataset
a very good accuracy on this data. The resulting sparse only 31 % of the benign samples can obtain a p value
and accurate explanations are particularly advantageous for of 5 % which means that more than 65 % of the whole
a human analyst since the DAMD dataset contains samples dataset suffer from degenerated explanations. A detailed
with up to 520,000 features. The explanations from IG calculation for all datasets with a p value of 5 % can be
provide a compressed yet accurate representation of the found in Table 12 in the Appendix C.
sequences which can be inspected easily. The problem of incomplete explanations is rooted in the
We summarize the performance on the MAZ metric imbalance of features characterizing malicious and benign
by calculating the area under curve and report it in data in the datasets. While only few features make a sample
Table 7(b). A high AUC indicates that more features have malicious, there exists a large variety of features turning
Class- Class+ TABLE 8: Average intersection size between top features
1.0 1.0 for multiple runs. Values close to one indicate greater
0.8 0.8 stability.
Samples remaining

Samples remaining
0.6 0.6 Method Drebin+ Mimicus+ DAMD VulDeePecker
0.4 0.4 LIME 0.480 0.446 0.040 0.446
LEMNA 0.4205 0.304 0.016 0.416
0.2 0.2 SHAP 0.257 0.411 0.007 0.440
0.0 0.0 Gradients 1.000 1.000 1.000 1.000
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 IG 1.000 1.000 1.000 1.000
Perturbations from Class+ Perturbations from Class- LRP 1.000 1.000 1.000 1.000
Drebin+ Mimicus+ VulDeePecker DAMD

Figure 6: Perturbation label statistics of the datasets. For IG and LRP can benefit from computations on a graphical
each percentage of perturbations from the other class the processing unit (GPU), therefore we report both results
percentage of samples achieving this number is shown. but use only the CPU results to achieve a fair comparison
with the black-box methods.
Table 9 shows the average run-time per input for all
a sample benign. As a consequence, randomly setting explanations methods and security systems. We observe
malicious features to zero leads to a benign classification, that Gradients and LRP achieve the highest throughput in
while setting benign features to zero usually does not general beating the other methods by orders of magnitude.
impact the prediction. As a consequence, it is often not This advantage arises from the fact that data can be
possible to explain predictions for benign applications and processed batch-wise for methods like Gradients, IG, and
the analyst is stuck with an empty explanation. LRP, that is, explanations can be calculated for a set
In summary, we argue that perturbation-based expla- of samples at the same time. The Mimicus+ dataset, for
nation methods should only be used in security settings example, can be processed in one batch resulting in a
where incomplete explanations can be compensated by speed-up factor of more than 16,000× over the fastest
other means. In all other cases, one should refrain from black-box method. In general we note that the white-box
using these black-box methods in the context of security. methods Gradients and LRP achieve the fastest run-time
since they require a single backwards-pass through the
5.5. Stability of Explanations network. Moreover, computing these methods on a GPU
results in additional speedups of a factor up to three.
We proceed to evaluate the stability of the explanation TABLE 9: Run-time per sample in seconds. Note the range
methods when processing inputs from the four security of the different times from microseconds to minutes.
systems. To this end, we apply the explanations to the
same samples over multiple runs and measure the average Method Drebin+ Mimicus+ DAMD VulDeePecker
intersection size between the runs. LIME 3.1 × 10−2 2.8 × 10−2 7.4 × 10−1 3.0 × 10−2
Table 8 shows the average intersection size between LEMNA 4.6 2.6 6.9 × 102 6.1
the top k features for three runs of the methods as defined SHAP 9.1 4.3 × 10−1 4.5 5.0
in Equation 1. We use k = 10 for all datasets except Gradients 8.1 × 10−3 7.8 × 10−6 1.1 × 10−2 7.6 × 10−4
IG 1.1 × 10−1 5.4 × 10−5 6.9 × 10−1 4.0 × 10−1
for DAMD where we use k = 50 due to the larger input LRP 8.4 × 10−3 1.7 × 10−6 1.3 × 10−2 2.9 × 10−2
space. Since the outputs of Gradients, IG, and LRP are GPU Drebin+ Mimicus+ DAMD VulDeePecker
deterministic, they reach the perfect score of 1.0 in all −3 −6 −3
Gradients 7.4 × 10 3.9 × 10 3.5 × 10 3.0 × 10−4
settings and thus do not suffer from limitations concerning IG 1.5 × 10−2 3.9 × 10−5 3.0 × 10−1 1.3 × 10−1
stability. LRP 7.3 × 10−3 1.6 × 10−6 7.8 × 10−3 1.1 × 10−2
For the perturbation-based methods, however, stability
poses a severe problem since none of those methods obtains
a intersection size of more than 0.5. This indicates that The run-time of the black-box methods increases for
on average half of the top features do not overlap when high dimensional datasets, especially DAMD, since the
computing explanations on the same input. Furthermore, regression problems need to be solved in higher dimensions.
we see that the assumption of locality of the perturbation- While the speed-up factors are already enormous, we have
based methods does not apply for all models under test, not even included the creation of perturbations and their
since the output is highly dependent on the perturbations classification, which consume additional run-time as well.
used to approximate the decision boundary. Therefore, the
best methods for the stability criterion beat the perturbation- 5.7. Robustness of Explanations
based methods by a factor of at least 2.5 on all datasets.
Recently, multiple authors have shown that adversarial
5.6. Efficiency of Explanations perturbations are also applicable against explanation meth-
ods and can manipulate the generated relevance values.
We finally examine the efficiency of the different Given a classification function f , an input x and a target
explanation methods. Our experiments are performed on class ct the goal of an adversarial perturbation is to find
a regular server system with an Intel Xeon E5 v3 CPU x̃ = x + δ such that δ is minimal but at the same time
at 2.6 GHz. It is noteworthy that the methods Gradients, f (x̃) = ct 6= f (x).
TABLE 10: Results of the evaluated explanation methods. The last column summarizes these metrics in a rating
comprising three levels: strong( ), medium ( ), and weak (#).
Explanation Method Accuracy Sparsity Completeness Stability Efficiency Robustness Overall Rating
−1
LIME 0.582 0.772 – 0.353 2.1 × 10 s # ## #
LEMNA 0.702 0.612 – 0.289 1.8 × 102 s # ######
SHAP 0.823 0.757 – 0.279 4.8 s # # ####
Gradients 0.600 0.867 3 1.000 5.0 × 10−3 s # #
IG 0.431 0.886 3 1.000 3.0 × 10−1 s # #
LRP 0.454 0.873 3 1.000 5.0 × 10−2 s # #

For an explanation method gf (x) Zhang et al. [55] 5.8. Summary


propose to solve
A strong explanation method is expected to achieve
good results for each criterion and on each dataset. For
 
min dp f (x̃), ct + λde gf (x̃), gf (x) , (2)
δ example, we have seen that the Gradients method computes
sparse results in a decent amount of time. The features,
where dp and de are distance measures for classes and however, are not accurate on the DAMD and VulDeePecker
explanations of f . The crafted input x̃ is misclassified by dataset. Equally, the relevance values of SHAP for the
the network but keeps an explanation very close to the one Drebin+ dataset are sparser than those from LEMNA but
of x. Dombrowski et al. [14] show that many white-box suffer from instability. To provide an overview, we average
methods can be tricked to produce an arbitrary explanation the performance of all methods over the four datasets and
et without changing the classification by solving summarize the results in Table 10.
For each of the six evaluation criteria, we assign each
  method one of the following three categories: , , and #.
min de gf (x̃), et + γdp f (x̃), f (x) . (3) The category is given to the best explanation method and
δ
other methods with a similar performance. The # category
is assigned to the worst method and methods performing
While the aforementioned attacks are constructed for equally bad. Finally, the category is given to methods
white-box methods, Slack et al. [44] have recently pro- that lie between the best and worst methods.
posed an attack against LIME and SHAP. They show Based on Table 10, we can see that white-box expla-
that the perturbations, which have to be classified to nation methods achieve a better ranking than black-box
create explanations, deviate strongly from the original methods in all evaluation criteria. Due to the direct access
data distribution and hence are easily distinguishable from to the parameters of the neural network, these methods
original data samples. With this knowledge an adversary can better analyze the prediction function and are able
can use a different model f˜ to classify the perturbations to identify relevant features. In particular, IG and LRP
and create arbitrary explanations to hide potential biases are the best methods overall regarding our evaluation
of the original model. Although LEMNA is not considered criteria. They compute results in less than 50 ms in our
by Slack et al. [44], it can be attacked likewise since it benchmark, mark only few features as relevant, and the
relies on perturbation labels as well. selected features have great impact on the decision of
The white-box attacks by Zhang et al. [55] and Dom- the classifier. These methods also provide deterministic
browski et al. [14] require access to the model parameters results and do not suffer from incompleteness. As a result,
which is a technical hurdle in practice. Similarly, however, we recommend to use these methods for explaining deep
the black-box attack by Slack et al. [44] needs to bypass the learning in security. However, if white-box access is not
classification process of the perturbations to create arbitrary available, we recommend the black-box method LIME as
explanations which is equally difficult. A further problem it shows the best performance in our experiments or to
of all attacks in the security domain are the discrete apply model stealing as shown in the following Section 5.9
input features: For images, an adversarial perturbation δ is to enable the use of white-box methods.
typically small and imperceptible, while binary features, In general, whether white-box or black-box methods
as in the Drebin+ dataset, require larger changes with are applicable also depends on who is generating the
|δ| ≥ 1. Similarly, for VulDeePecker and DAMD, a explanations: If the developer of a security system wants
direct application of existing attacks will likely result in to investigate its prediction, direct access to all model
broken code or invalid behavior. Adapting these attacks parameters is typically available and white-box methods
seems possible but requires further research on adversarial can be applied easily. Similarly, if the learning models
learning in structured domains. are shared between practitioners, white-box approaches
Based on this analysis, we conclude that explanation are also the method of choice. If the learning model,
methods are not robust and vulnerable to different attacks. however, is trained by a remote party, such as a machine-
Still, these attacks require access to specific parts of the learning-as-a-service providers, only black-box methods
victim system as well as further extenions to work in are applicable. Likewise, if an auditor or security tester
discrete domains. As a consequence, the robustness of the inspects a proprietary system, black-box methods also
methods is difficult to assess and further work is needed become handy, as they do not require reverse-engineering
to establish a better understanding of this threat. and extracting model parameters.
Top-10-Drebin+ Top-10-Mimicus+ 1.0 TABLE 11: Top-5 features for the Mimicus+ dataset
determined using IG. The right columns show the frequency
Original Original 0.8
in benign and malicious PDF documents, respectively.
1-layer 1-layer 0.6
Class Top 5 Feature Benign Malicious
2-layers 2-layers 0.4
– count_font 98.4 % 20.8 %
3-layers 3-layers 0.2 – producer_mismatch 97.5 % 16.6 %
– title_num 68.6 % 4.8 %
0.0
1-l l
2-l er
3-l rs
ers

1-l l
2-l er
3-l rs
ers
pdfid1_num
na

na
– 81.5 % 2.8 %
e

e
ay

ay
igi

igi
ay
ay

ay
ay
title_uc
Or

Or
– 68.6 % 4.8 %
Figure 7: Intersection size of the Top-10 features of – pos_eof_min 100.0 % 93.4 %
explanations obtained from models that were stolen from + count_javascript 6.0 % 88.0 %
the original model of the Drebin+ and Mimicus+ dataset. + count_js 5.2 % 83.4 %
+ count_trailer 89.3 % 97.7 %
+ pos_page_avg 100.0 % 100.0 %
5.9. Model Stealing for White-Box Explanations + count_endobj 100.0 % 99.6 %
+ createdate_tz 85.5 % 99.9 %
+ count_action 16.4 % 73.8 %
Our experiments show that practitioners from the
security domain should favor white-box methods over
black-box methods when aiming to explain neural networks.
Moreover, we publish the generated explanations from all
However, there are cases when access to the parameters of
datasets and methods on the project’s website2 in order to
the system is not available and white-box methods can not
foster future research.
be used. Instead of using black-box methods one could
also use model stealing to obtain an approximation of
the original network[49]. This approach assumes that the 6.1. Insights on Mimicus+
user can predict an unlimited number of samples with the
model to be explained. The obtained predictions can then When inspecting explanations for the Mimicus+ system,
be used to train a surrogate model which might have a we observe that the features for detecting malware are
different architecture but a similar behavior. dominated by count_javascript and count_js, which
To evaluate the differences between the explanations both stand for the number of JavaScript elements in
of surrogate models to the original ones we conduct an the document. The strong impact of these elements is
experiment on the Drebin+ and Mimicus+ datasets as meaningful, as JavaScript is frequently used in malicious
follows: We use the predictions of the original model PDF documents [27]. However, we also identify features
from Grosse et al. [20] which has two dense layers with in the explanations that are non-intuitive. For example,
200 units each and use these predictions to train three features like count_trailer that measures the number of
surrogate models. The number of layers is varied to be trailer sections in the document or count_box_letter that
[1, 2, 3] and the number of units in each layer is always counts the number of US letter sized boxes can hardly
200 resulting in models with higher, lower and the original be related to security and rather constitute artifacts in the
complexity. For each model we calculate explanations via dataset captured by the learning process.
LRP and compute the intersection size given by Equation 1 To further investigate the impact of JavaScript features
for k = 10. on the neural network, we determine the distribution of
The results in Figure 7 show that the models de- the top 5 features from the method IG for each class in
liver similar explanations to the original model (IS≈0.7) the entire dataset. It turns out that JavaScript appears in
although having different architectures for the Drebin+ 88 % of the malicious documents, whereas only about 6 %
dataset. However, the similarity between the stolen models of the benign samples make use of it (see Table 11).
is clearly higher (IS≈0.85). For the Mimicus+ dataset, This makes JavaScript an extremely discriminating feature
we observe a general stability of the learned features at for the dataset. From a security perspective, this is an
a lower level (IS≈0.55). These results indicate that the unsatisfying result, as the neural network of Mimicus+
explanations of the stolen models are better than those relies on a few indicators for detecting the malicious code
obtained from black-box methods (see Figure 3) but still in the documents. An attacker could potentially evade
deviate from the original model, i.e., there is no global Mimicus+ by not using JavaScript or obfuscating the
transferability between the explanations. At all, model JavaScript elements in the document.
stealing can be considered a good alternative to the usage
of black-box explanation methods. 6.2. Insights on Drebin+

During the analysis of the Drebin+ dataset, we notice


6. Insights on the Datasets that several benign applications are characterized by the
hardware feature touchscreen, the intent filter launcher,
During the experiments for this paper, we have ana- and the permission INTERNET. These features frequently
lyzed various explanations of security systems—not only occur in benign and malicious applications in the Drebin+
quantitatively as discussed in Section 5 but also qual- dataset and are not particularly descriptive for benignity.
itatively from the perspective of a security analyst. In Note that the interpretation of features speaking for benign
this section, we summarize our observations and discuss
insights related to the role of deep learning in security. 2. http://explain-mlsec.org
applications is challenging due to the broader scope and becomes tractable with moderate effort using these methods
the difficulty in defining benignity. We conclude that the and we are able to investigate the opcodes with the highest
three features together form an artifact in the dataset that relevance in detail. We observe that the relevant opcode
provides an indicator for detecting benign applications. sequences are linked to the malicious functionality.
For malicious Android applications, the situation is As an example, Table 4 depicts the opcode sequence,
different: The explanation methods return highly relevant that is found in all samples of the GoldDream family.Taking
features that can be linked to the functionality of the a closer look, this sequence occurs in the onReceive
malware. For instance, the requested permission SEND_SMS method of the com.GoldDream.zj.zjReceiver class. In
or features related to accessing sensitive information, such this function, the malware intercepts incoming SMS and
as the permission READ_PHONE_STATE and the API call phone calls and stores the information in local files before
getSimCountryIso, receive consistently high scores in our sending them to an external server. Similarly, we can
investigartion. These features are well in line with common interpret the explanations of the other two malware fami-
malware for Android, such as the FakeInstaller family [32], lies, where functionality related to exploits and persistent
which is known to obtain money from victims by secretly installation is highlighted in the Dalvik opcode sequences.
sending text messages (SMS) to premium services. Our For all members of each malware family, the opcode
analysis shows that the MLP network employed in Drebin+ sequences identified using the explanation methods LRP
has captured indicative features directly related to the and IG are identical, which demonstrates that the CNN in
underlying malicious activities. the DAMD system has learned an discriminative pattern
from the underlying opcode representation.
6.3. Insights on VulDeePecker
7. Conclusion
In contrast to the datasets considered before, the fea-
tures processed by VulDeePecker resemble lexical tokens The increasing application of deep learning in security
and are strongly interconnected on a syntactical level. renders means for explaining their decisions vitally impor-
This becomes apparent in the explanations of the method tant. While there exist a wide range of explanation methods
Integrated Gradients in Figure 4, where adjacent tokens from the area of computer vision and machine learning,
have mostly equal colors. Moreover, orange and blue it has been unclear which of these methods are suitable
colored features in the explanation are often separated for security systems. We have addressed this problem and
by tokens with no color, indicating a gradual separation propose evaluation criteria that enable a practitioner to
of positive and negative relevance values. compare and select explanation methods in the context of
During our analysis, we notice that it is still difficult security. While the importance of these criteria depends
for a human analyst to benefit from the highlighted tokens. on the particular security task, we find that the methods
First, an analyst interprets the source code rather than the Integrated Gradients and LRP comply best with all require-
extracted tokens and thus maintains a different view on ments. Hence, we generally recommend these methods for
the data. In Figure 4, for example, the interpretation of the explaining predictions in security systems.
highlighted INT0 and INT1 tokens as buffer sizes of 50 and Aside from our evaluation of explanation methods,
100 wide characters is misleading, since the neural net- we reveal problems in the general application of deep
work is not aware of this relation. Second, VulDeePecker learning in security. For all considered systems under test,
truncates essential parts of the code. In Figure 4, during we identify artifacts that substantially contribute to the
the initialization of the destination buffer, for instance, overall prediction, but are unrelated to the security task.
only the size remains as part of the input. Third, the large Several of these artifacts are rooted in peculiarities of the
amount of highlighted tokens like semicolons, brackets, data. It is likely that the employed neural networks overfit
and equality signs seems to indicate that VulDeePecker the data rather than solving the underlying task. We thus
overfits to the training data at hand. conclude that explanations need to become an integral
Given the truncated program slices and the seemingly part of any deep learning system to identify artifacts in
unrelated tokens marked as relevant, we conclude that the training data and to keep the learning focused on the
the VulDeePecker system might benefit from extending targeted security problem.
the learning strategy to longer sequences and cleansing Our study is a first step for integrating explainable
the training data to remove artifacts that are irrelevant for learning in security systems. We hope to foster a series of
vulnerability discovery. research that applies and extends explanation methods, such
that deep learning becomes more transparent in computer
6.4. Insights on DAMD security. To support this development, we make all our
implementations and datasets publicly available.
Finally, we consider Android applications from the
DAMD dataset. Due to the difficulty of analyzing raw Acknowledgements
Dalvik bytecode, we guide our analysis of the dataset by in-
specting malicious applications from three popular Android The authors gratefully acknowledge funding from the
malware families: GoldDream [26], DroidKungFu [25], German Federal Ministry of Education and Research
and DroidDream [18]. These families exfiltrate sensitive (BMBF) under the projects VAMOS (FKZ 16KIS0534)
data and run exploits to take full control of the device. and BIFOLD (FKZ 01IS18025B). Furthermore, the authors
In our analysis of the Dalvik bytecode, we benefit acknowledge funding by the Deutsche Forschungsgemein-
from the sparsity of the explanations from LRP and IG as schaft (DFG, German Research Foundation) under Ger-
explained in Section 5.3. Analyzing all relevant features many’s Excellence Strategy EXC 2092 CASA-390781972.
References ermann, K.-R. Müller, and P. Kessel. Explanations
can be manipulated and geometry is to blame. In
[1] M. Alber, S. Lapuschkin, P. Seegerer, M. Hägele, Advances in Neural Information Proccessing Systems
K. T. Schütt, G. Montavon, W. Samek, K.-R. Müller, (NIPS), 2019.
S. Dähne, and P.-J. Kindermans. iNNvestigate neural [15] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
networks! Technical Report abs/1808.04260, Com- classification. John Wiley & Sons, second edition,
puting Research Repository (CoRR), 2018. 2000.
[2] M. Ancona, E. Ceolini, C. Öztireli, and M. Gross. [16] J. L. Elman. Finding structure in time. Cognitive
Towards better understanding of gradient-based at- Science, 14(2):179–211, 1990.
tribution methods for deep neural networks. In In- [17] R. C. Fong and A. Vedaldi. Interpretable explanations
ternational Conference on Learning Representations, of black boxes by meaningful perturbation. In IEEE
ICLR, 2018. International Conference on Computer Vision, pages
[3] D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, 3449–3457, 2017.
and K. Rieck. Drebin: Efficient and explainable [18] J. Foremost. DroidDream mobile malware.
detection of Android malware in your pocket. In https://www.virusbulletin.com/virusbulletin/2012/
Proc. of the Network and Distributed System Security 03/droiddream-mobile-malware, 2012. (Online;
Symposium (NDSS), Feb. 2014. accessed 14-February-2019).
[4] L. Arras, F. Horn, G. Montavon, K.-R. Müller, and [19] I. Goodfellow, Y. Bengio, and A. Courville. Deep
W. Samek. "what is relevant in a text document?": Learning. MIT Press, 2016.
An interpretable machine learning approach. PLoS [20] K. Grosse, N. Papernot, P. Manoharan, M. Backes,
ONE, 12(8), Aug. 2017. and P. D. McDaniel. Adversarial examples for mal-
[5] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. ware detection. In Proc. of the European Symposium
Müller, and W. Samek. On pixel-wise explanations for on Research in Computer Security (ESORICS), pages
non-linear classifier decisions by layer-wise relevance 62–79, 2017.
propagation. PLoS ONE, 10(7), July 2015. [21] W. Guo, D. Mu, J. Xu, P. Su, G. Wang, and X. Xing.
[6] N. Carlini. Is AmI (attacks meet interpretability) LEMNA: Explaining deep learning based security
robust to adversarial examples? Technical Report applications. In Proc. of the ACM Conference on
abs/1902.02322, Computing Research Repository Computer and Communications Security (CCS), pages
(CoRR), 2019. 364–379, 2018.
[7] N. Carlini and D. A. Wagner. Towards evaluating the [22] S. Hochreiter and J. Schmidhuber. Long short-term
robustness of neural networks. In Proc. of the IEEE memory. Neural Computation, 9:1735–1780, 1997.
Symposium on Security and Privacy, pages 39–57, [23] W. Huang and J. W. Stokes. MtNet: A multi-task
2017. neural network for dynamic malware classification.
[8] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. In Proc. of the Conference on Detection of Intrusions
Balasubramanian. Grad-cam++: Generalized gradient- and Malware & Vulnerability Assessment (DIMVA),
based visual explanations for deep convolutional pages 399–418, 2016.
networks. In 2018 IEEE Winter Conference on [24] P. jan Kindermans, K. T. Schütt, M. Alber, K.-R.
Applications of Computer Vision, WACV 2018, Lake Müller, D. Erhan, B. Kim, and S. Dähne. Learning
Tahoe, NV, USA, March 12-15, 2018, pages 839–847, how to explain neural networks: Patternnet and patter-
2018. nattribution. In Proc. of the International Conference
[9] K. Cho, B. van Merrienboer, Ç. Gülçehre, on Learning Representations (ICLR), 2018.
F. Bougares, H. Schwenk, and Y. Bengio. Learning [25] X. Jiang. Security Alert: New sophisticated Android
phrase representations using RNN encoder-decoder malware DroidKungFu found in alternative chinese
for statistical machine translation. Technical Report App markets. https://www.csc2.ncsu.edu/faculty/
abs/1606.04435, Computing Research Repository xjiang4/DroidKungFu.html, 2011. (Online; accessed
(CoRR), 2014. 14-February-2019).
[10] Z. L. Chua, S. Shen, P. Saxena, and Z. Liang. Neural [26] X. Jiang. Security Alert: New Android mal-
nets can learn function type signatures from binaries. ware GoldDream found in alternative app mar-
In Proc. of the USENIX Security Symposium, pages kets. https://www.csc2.ncsu.edu/faculty/xjiang4/
99–116, 2017. GoldDream/, 2011. (Online; accessed 14-February-
[11] P. Dabkowski and Y. Gal. Real time image saliency 2019).
for black box classifiers. In Advances in Neural [27] A. Kapravelos, Y. Shoshitaishvili, M. Cova,
Information Proccessing Systems (NIPS), pages 6967– C. Kruegel, and G. Vigna. Revolver: An automated
6976. 2017. approach to the detection of evasive web-based mal-
[12] A. Datta, S. Sen, and Y. Zick. Algorithmic trans- ware. In Proc. of the USENIX Security Symposium,
parency via quantitative input influence: Theory and pages 637–651, Aug. 2013.
experiments with learning systems. In 2016 IEEE [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
Symposium on Security and Privacy, pages 598–617, agenet classification with deep convolutional neu-
2016. ral networks. In Advances in Neural Information
[13] S. Diamond and S. Boyd. CVXPY: A Python- Proccessing Systems (NIPS). Curran Associates, Inc.,
embedded modeling language for convex optimiza- 2012.
tion. Journal of Machine Learning Research, 2016. [29] Y. LeCun and Y. Bengio. Convolutional networks for
[14] A.-K. Dombrowski, M. Alber, C. J. Anders, M. Ack- images, speech, and time-series. In The Handbook
of Brain Theory and Neural Networks. MIT, 1995. tacks on post hoc explanation methods. In AAAI/ACM
[30] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, Conference on Artificial Intelligence , Ethics, and
and Y. Zhong. Vuldeepecker: A deep learning-based Society (AIES), 2019.
system for vulnerability detection. In Proc. of the [45] C. Smutz and A. Stavrou. Malicious PDF detection
Network and Distributed System Security Symposium using metadata and structural features. In Proc. of the
(NDSS), 2018. Annual Computer Security Applications Conference
[31] S. M. Lundberg and S.-I. Lee. A unified approach to (ACSAC), pages 239–248, 2012.
interpreting model predictions. In Advances in Neural [46] J. Springenberg, A. Dosovitskiy, T. Brox, and
Information Proccessing Systems (NIPS), pages 4765– M. Riedmiller. Striving for simplicity: The all
4774. 2017. convolutional net. In ICLR (workshop track), 2015.
[32] McAfee. Android/FakeInstaller.L. https://home. [47] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic
mcafee.com/virusinfo/, 2012. (Online; accessed 1- attribution for deep networks. In Proceedings of the
August-2018). 34th International Conference on Machine Learning,
[33] N. McLaughlin, J. M. del RincÃşn, B. Kang, S. Y. pages 3319–3328, 2017.
Yerima, P. C. Miller, S. Sezer, Y. Safaei, E. Trickel, [48] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to
Z. Zhao, A. DoupÃl’, and G.-J. Ahn. Deep android sequence learning with neural networks. In Advances
malware detection. In Proc. of the ACM Confer- in Neural Information Proccessing Systems (NIPS),
ence on Data and Application Security and Privacy pages 3104–3112, 2014.
(CODASPY), pages 301–308, 2017. [49] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter,
[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean. and T. Ristenpart. Stealing machine learning
Efficient estimation of word representations in vector models via prediction apis. In 25th USENIX
space. In Proc. of the International Conference on Security Symposium (USENIX Security 16), pages
Learning Representations (ICLR Workshop), 2013. 601–618, Austin, TX, Aug. 2016. USENIX
[35] N. Papernot, P. D. McDaniel, A. Sinha, and M. P. Association. ISBN 978-1-931971-32-4. URL
Wellman. Sok: Security and privacy in machine https://www.usenix.org/conference/usenixsecurity16/
learning. In Proc. of the IEEE European Symposium technical-sessions/presentation/tramer.
on Security and Privacy (EuroS&P), pages 399–414, [50] N. Šrndić and P. Laskov. Practical evasion of a
2018. learning-based classifier: A case study. In Proc. of
[36] M. T. Ribeiro, S. Singh, and C. Guestrin. "why the IEEE Symposium on Security and Privacy, pages
should i trust you?": Explaining the predictions 197–211, 2014.
of any classifier. In Proc. of the ACM SIGKDD [51] A. Warnecke. Layerwise Relevance Propaga-
International Conference On Knowledge Discovery tion for LSTMs. https://github.com/alewarne/
and Data Mining (KDD), 2016. Layerwise-Relevance-Propagation-for-LSTMs, .
[37] R. Rojas. Neural Networks: A Systematic Approach. [52] A. Warnecke. Explain Security DNNs. https://github.
Springer-Verlag, Berlin, Deutschland, 1996. ISBN com/alewarne/ExplainSecurityDNNs, .
3-450-60505-3. [53] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and
[38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. D. Song. Neural network-based graph embedding
Learning internal representations by error propagation. for cross-platform binary code similarity detection.
Parallel distributed processing: Explorations in the In Proc. of the ACM Conference on Computer and
microstructure of cognition, 1(Foundation), 1986. Communications Security (CCS), pages 363–376,
[39] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedan- 2017.
tam, D. Parikh, and D. Batra. Grad-cam: Visual [54] M. D. Zeiler and R. Fergus. Visualizing and under-
explanations from deep networks via gradient-based standing convolutional networks. In Computer Vision
localization. In The IEEE International Conference – ECCV 2014, pages 818–833. Springer International
on Computer Vision (ICCV), pages 618–626, Oct Publishing, 2014.
2017. [55] X. Zhang, N. Wang, H. Shen, S. Ji, X. Luo, and
[40] L. Shapley. A value for n-person games. 1953. T. Wang. Interpretable deep learning under fire. In
[41] E. C. R. Shin, D. Song, and R. Moazzezi. Recog- Proc. of USENIX Security Symposium, 2019.
nizing functions in binaries with neural networks. [56] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and
In Proc. of the USENIX Security Symposium, pages A. Torralba. Learning deep features for discriminative
611–626, 2015. localization. In IEEE Conference on Computer Vision
[42] A. Shrikumar, P. Greenside, and A. Kundaje. Learn- and Pattern Recognition (CVPR), pages 2921–2929,
ing important features through propagating activation 2016.
differences. In Proc. of the International Conference [57] Y. Zhou and X. Jiang. Dissecting android malware:
on Machine Learning (ICML), pages 3145–3153, Characterization and evolution. In Proc. of the IEEE
2017. Symposium on Security and Privacy, pages 95–109,
[43] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep 2012.
inside convolutional networks: Visualising image
classification models and saliency maps. In Proc.
of the International Conference on Learning Repre-
sentations (ICLR), 2014.
[44] D. Slack, S. Hilgard, E. Jia, S. Singh, and
H. Lakkaraju. Fooling lime and shap: Adversarial at-
Appendix DeConvNet and GuidedBackProp. These methods aim at
reconstructing an input x given output y , that is, mapping y
back to the input space. To this end, the authors present
1. Related Concepts an approach to revert the computations of a convolutional
layer followed by a rectified linear unit (ReLu) and max-
Some of the considered explanations methods share pooling, which is the essential sequence of layers in neural
similarities with techniques of adversarial examples and networks for image classification. Similar to LRP and
feature selection. While these similarities result from an DeepLift, both methods perform a backwards pass through
analogous analysis of the prediction function fN , the the network. The major drawback of these methods is
underlying objectives are fundamentally different from again the restriction to convolutional neural networks.
explainable learning and cannot be transferred easily. In the CAM, GradCAM, and GradCAM++. These three white-
following, we briefly highlight these different objectives: box methods compute relevance scores by accessing the
Adversarial examples. Adversarial examples are con- output of the last convolutional layer in a CNN and
structed by determining a minimal perturbation δ such performing global average pooling. Given the activations
that fN (x + δ) 6= fN (x) for a given neural network aki of the k -th channel at unit i, GradCam learn weights
N and an input vector x [7, 35]. The perturbation δ wk such that XX
encodes which features need to be modified to change y≈ wk aki .
the prediction. However, the perturbation does not explain i k
why x was given the label y by the neural network.
The Gradients explanation method described in Section 3 That is, the classification is modeled as a linear combination
shares similarities with some attacks generating adversarial of thePactivations of the last layer of all channels and finally
examples, as the gradient ∂fN /∂xi is used to quantify ri = k wk aki . GradCam and GradCam++ extend this
the difference of fN when changing a feature xi slightly. approach by including specific gradients in this calculation.
Still, algorithms for determining adversarial examples are All three methods are only applicable if the neural network
insufficient for computing reasonable explanations. uses a convolutional layer as the final layer. While this
Note that we deliberately do not study adversarial setting is common in image recognition, it is rarely used
examples in this paper. Techniques for attacking and in security applications and thus we do not analyze these
defending learning algorithms are orthogonal to our work. methods.
These techniques can be augmented using explanations, RTIS and MASK. These methods compute relevance
yet it is completely open how this can be done in a secure scores by solving an optimization problem for a mask m.
manner. Recent defenses for adversarial examples based A mask m is applied to x as m ◦ x in order to affect x,
on explanations have proven to be totally ineffective [6]. for example by setting features to zero. To this end, Fong
and Vedaldi [17] propose the optimization problem
Feature selection. This concept aims at reducing the
dimensionality of a learning problem by selecting a subset m∗ = arg min λk1 − mk1 + fN (m ◦ x),
of discriminative features [15]. At a first glance, features m∈[0,1]d
determined through feature selection seem like a good fit
for explanation. While the selected features can be investi- which determines a sparse mask, that identifies relevant
gated and often capture characteristics of the underlying features of x. This can be solved using gradient descent,
data, they are determined independent from a particular which thus makes these white-box approaches. However,
learning model. As a result, feature selection methods solving the equation above often leads to noisy results
cannot be direclty applied for explaining the decision of a which is why RTIS and MASK add additional terms to
neural network. achieve smooth solutions using regularization and blurring.
These concepts, however, are only applicable for images
and cannot be transferred to other types of features.
2. Incompatible Explanation Methods
Quantitative Input Influence. This method is another
black-box approach which calculates relevances by chang-
Several explanation methods are not suitable for general ing input features and calculating the difference between
application in security, as they do not support common the outcomes. Let X−i Ui be the random variable with
architectures of neural networks used in this area (see Ta- the ith input of X being replaced by a random value
ble 1). We do not consider these methods in our evaluation, that is drawn from the distribution of feature xi . Then
yet for completeness we provide a short overview of these the relevance of feature i for a classification to class c is
methods in the following. given by
PatternNet and PatternAttribution. These white-box
 
ri = E fN (X−i Ui ) 6= c|X = x .
methods are inspired by the explanation of linear models.
While PatternNet determines gradients and replaces neural
network weights by so-called informative directions, Patter- However, when the features of X are binary like in some
nAttribution builds on the LRP framework and computes of our datasets this equation becomes
explanations relative to so-called root points whose output (
are 0. Both approaches are restricted to feed-forward and 1 fN (x¬i ) 6= c
ri =
convolutional networks. Recurrent neural networks are not 0 else
supported.
As noted by Datta et al. [12] this results in many
features receiving a relevance of zero which has no
meaning. We notice that even the extension to sets proposed
by Datta et al. [12] does not solve this problem since it is
highly related to degenerated explanations as discussed in
Section 5.4.

3. Completeness of Datasets: Example calculation

In Section 5.4 we discussed the problem of incomplete


or degenerated explanations from black-box methods that
can occur when there are not enough labels from the
opposite class in the perturbations. Here we give an
concrete example when enforcing 5 % of the labels to
be from the opposite class.
Table 12 shows the results of this experiment. On
average, 29 % of the samples cannot be explained well,
as the computed perturbations contain too few instances
from the opposite class. In particular, we observe that
creating malicious perturbations from benign samples is a
hard problem in the case of Drebin+ and DAMD, where
only 32.6 % and 2.8 % of the benign samples achieve
sufficient perturbations from the opposite class.
TABLE 12: Incomplete explanations of black-box methods.
First two columns: Samples remaining when enforcing at
least 5 % perturbations of opposite class.
System Class- Class+ Incomplete
Drebin+ 24.2 % 97.1 % 66.3 %
Mimicus+ 73.5 % 98.9 % 14.2 %
VulDeePecker 90.5 % 99.8 % 7.1 %
DAMD 5.9 % 94.8 % 44.9 %
Average 48.3 % 97.7 % 33.15 %

You might also like