0% found this document useful (0 votes)
115 views11 pages

Black Box Fairness Testing of Machine Learning Models

1. The document proposes a methodology to automatically generate test inputs to detect individual discrimination in machine learning models. 2. It combines symbolic execution and local explainability techniques to effectively search the input space and generate more diverse test cases than benchmark methods. 3. An empirical evaluation shows the proposed approach generates more effective test cases than state-of-the-art benchmark techniques for detecting individual discrimination in models.

Uploaded by

Ana Luiza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views11 pages

Black Box Fairness Testing of Machine Learning Models

1. The document proposes a methodology to automatically generate test inputs to detect individual discrimination in machine learning models. 2. It combines symbolic execution and local explainability techniques to effectively search the input space and generate more diverse test cases than benchmark methods. 3. An empirical evaluation shows the proposed approach generates more effective test cases than state-of-the-art benchmark techniques for detecting individual discrimination in models.

Uploaded by

Ana Luiza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Black Box Fairness Testing of Machine Learning Models

Aniya Aggarwal Pranay Lohia Seema Nagar Kuntal Dey Diptikalyan Saha
[email protected] [email protected] [email protected] [email protected] [email protected]
IBM Research AI IBM Research AI IBM Research AI IBM Research AI IBM Research AI
India India India India India
ABSTRACT its decisions are fair. Bias may be inherent in a decision-making
Any given AI system cannot be accepted unless its trustworthiness system in multiple ways. It can exist in the form of group discrim-
is proven. An important characteristic of a trustworthy AI system ination [6] where two different groups (e.g., based on “protected
is the absence of algorithmic bias. “Individual discrimination” exists attributes” such as gender/race) get a varied decision or, through
when a given individual different from another only in “protected individual discrimination [7] which discriminates between two sam-
attributes” (e.g., age, gender, race, etc.) receives a different decision ples. Note that, discrimination-aware systems need to be trained
outcome from a given machine learning (ML) model as compared to avoid discriminating against sensitive characteristic features,
to the other individual. The current work addresses the problem that are termed as “protected attributes.” Protected attributes are
of detecting the presence of individual discrimination in given ML application specific. Features such as age, gender, ethnicity, etc. are
models. Detection of individual discrimination is test-intensive in a a few frequent examples of what many applications practically treat
black-box setting, which is not feasible for non-trivial systems. We as protected attributes [7].
propose a methodology for auto-generation of test inputs, for the Individual discrimination. In this paper, we address the problem
task of detecting individual discrimination. Our approach combines of detecting individual discrimination in machine learning mod-
two well-established techniques - symbolic execution and local els. The definition of individual fairness/bias that we use in this
explainability for effective test case generation. We empirically paper is a simplified and non-probabilistic form of counterfactual
show that our approach to generate test cases is very effective as fairness [12], which also fits into Dwork’s framework of individual
compared to the best-known benchmark systems that we examine. fairness [5]. As stated in this work, a system is said to be fair if for
any two valid inputs which differ only in the protected attribute
CCS CONCEPTS are always assigned the same class (and bias is said to exist if for
some pair of valid inputs, it yields different classification). Such
• Software and its engineering → Software testing and de-
cases of bias have been previously noticed in models such as [7]
bugging.
and caused derogatory consequences to the model generator. There-
KEYWORDS fore, detection of such cases is of utmost importance. Note that,
removal of such bias cannot be done by removing the protected
Individual Discrimination, Fairness Testing, Symbolic Execution, attributes from the training data as the individual discrimination
Local Explainability may still exist due to the possible co-relations between protected
ACM Reference Format: and non-protected attributes, just like the case of race (protected)
Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan and zip-code (non-protected) in Adult census income dataset. 1
Saha. 2019. Black Box Fairness Testing of Machine Learning Models. In This is an instance of indirect discrimination for which individual
Proceedings of the 27th ACM Joint European Software Engineering Conference discrimination testing is still required for co-related non-protected
and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19),
attributes. The challenge is, therefore, to evaluate and find that
August 26–30, 2019, Tallinn, Estonia. ACM, New York, NY, USA, 11 pages.
https://doi.org/10.1145/3338906.3338937
for which all values of non-protected and protected attributes, the
model demonstrates such an individual discrimination behavior.
1 INTRODUCTION Existing Techniques and their drawbacks. Measuring individual
discrimination requires an exhaustive testing, which is infeasible for
Model Bias. This decade marks the resurgence of Artificial Intelli- a non-trivial system. The existing techniques, such as THEMIS [7],
gence (AI) where AI Models have started taking crucial decisions AEQUITAS [23], generate a test suite to determine if and how much
in a lot of systems - from hiring decisions, approving loans, etc. to individual discrimination is present in the model. THEMIS selects
design driver-less cars. Therefore, dependability of AI models is of random values from the domain for all attributes to determine if
utmost importance to ensure wide acceptance of the AI systems. the system discriminates amongst the individuals. The AEQUITAS
One of the important aspects of a trusted AI system is to ensure that generates test cases in two phases. The first phase generates test
Permission to make digital or hard copies of all or part of this work for personal or cases by performing random sampling on the input space. The sec-
classroom use is granted without fee provided that copies are not made or distributed ond phase starts by taking every discriminatory input generated
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM in the first phase as an input and perturbing it to generate further
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, more test cases. Both techniques aim to generate more discrimina-
to post on servers or to redistribute to lists, requires prior specific permission and/or a tory inputs. Even though these two aforementioned techniques are
fee. Request permissions from [email protected].
ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia applicable to any black-box system, our experiments demonstrate
© 2019 Association for Computing Machinery. that they miss many such combinations of non-protected attribute
ACM ISBN 978-1-4503-5572-8/19/08. . . $15.00
https://doi.org/10.1145/3338906.3338937 1 https://archive.ics.uci.edu/ml/datasets/bank+marketing

625
ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha

values for which the individual discrimination may exist. We also we perform a global search using symbolic execution to cover
look to cover more diverse paths of the model to generate more different paths in the model.
test inputs. • Optimizations. The local explainer presents confidence associated
Our approach. Our aim is to perform systematic searching of with the predicates. Our algorithm performs the selection of
feature space to cover more space without much redundancy. There constraints for toggling based on their confidence scores.
exist symbolic evaluation [2, 9, 17] based techniques to automati- • Scalability. Our algorithm systematically traverses paths in the
cally generate test inputs by systematically exploring different ex- feature space by toggling feature related constraints. This makes
ecution paths in the program. Such methods avoid generation of it scalable, unlike other techniques [18] which consider structure-
multiple inputs which tend to explore the same program path. Such based coverage criteria.
techniques are essentially white-box and leverage the capabilities
Contributions. Our contributions are as listed next.
of constraint solvers [4] to create test inputs automatically. Sym-
bolic execution starts with a random input and analyzes the path • We present a novel technique to find individual discrimination
to generate a set of path constraints (i.e. conditions on the input in the model.
attributes) and iteratively toggles (or negates) the constraints in • We develop a novel combination of dynamic symbolic execution
the path to generate a set of new path constraints. It then solves and local explanation to generate test cases for non-interpretable
the resultant path constraints using a constraint solver to gener- models. We believe that the use of local explainer will open up
ate a new input which can possibly take the control to the new many avenues for path-based analysis of black-box AI models.
path as explained using an example in Section 2. Our idea is to use • We demonstrate the effectiveness of our technique on several
such dynamic symbolic execution to generate test inputs which open-source classification models having known biases. We em-
can potentially lead to uncovering individual discrimination in ML pirically compare our technique with the existing algorithms i.e.
models. However, existing such techniques have been used to gen- THEMIS, AEQUITAS and demonstrate the performance improve-
erate inputs for procedural programs which are interpretable. Our ment delivered by our approach over these prior works.
main challenge is to apply such technique for non-interpretable Outline. Section 2 presents the required background on dynamic
models which cannot be executed symbolically. Note that, similar symbolic execution and local explainability. In Section 3, we present
to THEMIS, our goal is to build a scalable black-box solution, which our solution while concentrating on the various challenges that we
can be applied efficiently on varied models. Black-box analysis will have faced to successfully combine the idea of symbolic execution
enable AI testing service where a third party machine learning with the local explanation. Further, in Section 4, we present our
model in the form of an access API can be given as an input along experimental setup and the results. Next, we discuss the related
with some payload/training data. prior work in Section 5, while Section 6 mentions a brief summary
Challenges. There exist a few works which try to use symbolic along with the possible future extensions of this work.
evaluation-based techniques for non-interpretable models such as
deep neural networks, although they do not address the problem 2 BACKGROUND
of finding individual discrimination that may exist in the model. 2.1 Notation and Individual Discrimination
Such techniques are essentially white-box and try to approximate
the functions (ReLu/Sigmoid) that exist in the network. Therefore, Let us consider a dataset D with the set of attributes defined as
they are catered towards a specific kind of networks and are not A = {A1 , A2 , . . . , An } where P ⊂ A denotes its set of protected
generalized. Other test-case generation techniques [18, 19] use attributes. We use x, x ′ to represent two data instances, where
coverage criteria (like neuron coverage, sign-coverage, etc.) which x, x ′ ∈ D. Each data instance is an n-tuple x = (x i , . . . , x n ) where
are structure dependent and hence, such techniques suffer from x i denotes the value for Ai in data instance x. The domain of at-
scalability. tribute Ai is denoted as Dom(Ai ). Say, a model M is trained on the
Solution overview. In this paper, our key idea is to use the local training data T D, then M(x) denotes the output of this model on
interpretable model as the path in the symbolic execution. A lo- an input x. Formally, an individual bias instance is a pair (x, x ′ )
cal explainer such as LIME [15] can produce a decision tree path of input samples such that: x i = x i′ for all Ai < P and ∃j, x j , x j′ ,
corresponding to an input in a model agnostic way. The decisions where A j ∈ P and M(x) , M(x ′ ).
in the decision tree path are then toggled to generate a new set of Note that we use a strict notion of equality for all non-protected
constraints. Next, we list several advantages and salient features of attributes which does not consider various issues such as equal-
our approach. ity in case of continuous attributes, possible co-relation between
• Constraints. It is possible to use an off-the-shelf local explainer attributes, etc. However, the focus of the paper is to perform system-
to generate a linear approximation to the path. The linear con- atic test case generation with respect to a definition of individual
straints obtained from one such explainer can be used for sym- fairness and therefore, handling the above issues are considered as
bolic evaluation which won’t require any specialized constraint possible future extensions.
solver [18].
• Data-driven. Our algorithm can take advantage of the presence 2.2 Dynamic Symbolic Execution
of data which can be used as the seed data to start the search. Dynamic symbolic execution (DSE) [2, 9, 17] performs automated
• Global and Local Search. Once an individual discrimination is test generation by instrumenting and running a program while
found, we perform a local search to uncover many input com- collecting execution path constraints for different inputs. It system-
binations which can uncover more discrimination. Otherwise, atically toggles the predicates present in a set of path constraints

626
Black Box Fairness Testing of Machine Learning Models ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia

Algorithm 1: Generalized Dynamic Symbolic Execution textual artifacts. Given an input data instance and its output as gen-
1 count = 0; erated by the classifier, LIME generates data points in the vicinity of
2 inputs = seed_test_inputs() the instance by perturbation of the input data point and generates
3 priorityQ q = empty; q.enqueueAll(inputs,0)
4 while count < limit && !q.isEmpty() do
the output. Using such input and output, it learns an interpretable
5 t = q.dequeue() model by maximizing local-fidelity and interpretability.
6 check_for_error_condition(t )
We use LIME to explain a prediction instance for a model and
7 Path p = getPath(t )
8 prefix_pred = true generate a decision tree as interpretable model for a prediction.
9 foreach predicate c in order from top of path do
10 path_constraint = prefix_pred ∧ toggle (c)
11 if !visited_path.contains(path_predicate) then 3 ALGORITHM
12 visited_path.add(path_predicate)
13 input= solve(path_constraint)
In this section, we discuss our overall solution approach spread
14 r = rank(path_constraint) across the subsequent three subsections. The first sub-section dis-
15 q.enqueue(input,r) cusses the goals of our test case generation technique under differ-
16 end
17 prefix_pred = prefix_pred ∧ c ent conditions. We divide our algorithm into two different kinds of
18 end search algorithms called global search and local search. The next
19 count++
two sub-sections devout on these respective searches.
20 end

3.1 Problem Formulation


to generate a new set of path constraints whose solution generates Below are the two optimization criteria that we want to achieve
new inputs to further explore a new path. Note that this technique through the devised test case generation technique.
caters for path coverage and hence, does not generate any redun- Effective Test Case Generation: Given a model M, a set of do-
dant input. The path coverage criteria states that the test cases are main constraints C and protected attribute set P, the aim is to
executed in such a way that every possible path is executed at least generate test cases to maximize the ratio of |Succ |/|Gen|, where
once. It is further to be noted that the path coverage criteria are Gen is the set of non-protected attribute value combinations gener-
stronger than branch/decision coverage and statement coverage. ated by the algorithm and Succ ⊆ Gen that leads to discrimination
Next, we formally present a generalized version of the above i.e. each instance in Succ yields at least one different decision for
algorithm in Algorithm 1 which can be used as a framework to different combinations of protected attribute values.
generate test cases for AI models. We further discuss the changes in Here are a few salient points about this criteria.
this generalized version as compared to the algorithm for dynamic • Test case: Each test case is not considered as the collection of the
symbolic execution used in DART [9]. values of all the attributes, but only non-protected attributes. This
The first change relates to the inputs the algorithm starts with. ensures that multiple discriminatory test cases are not counted
Instead of starting from a random input, the algorithm finds one or for the same combination of non-protected attribute values.
more seed inputs to start with (Line 2). Seeds can also be generated • Domain constraints: We assume that applying domain constraints
at random. The second change (Lines 3, 14, 15) is related to the C will filter out unreal test cases.
abstraction of the ranking strategy of selecting which test inputs to
• Order of generation and discrimination test: The optimization
execute next. Please note that the rank function now determines
criteria does not specify if all the test cases are generated at once
which input should be processed next. The third change is an ad-
or whether check for discrimination and generation can go hand-
dition of the check to detect if a path is already traversed or not
in-hand. This way the test case generation can be dependent on
(Line 11). Such checks are not required in symbolic execution for
discrimination checking as well.
programs as the selection of predicates for toggling ensures that
an already traversed path will not be traversed again. However, in In the software testing domain, there exists a number of prede-
an environment where path generation does not guarantee preser- fined coverage criteria. Many such coverage criteria have also been
vation of predicates in the path, this is a necessary check to avoid defined in recent works on machine learning [19]. Next, we define
generating redundant inputs. the path coverage criteria such that it is applicable for varied types
The generation of constraints (Lines 9-17) corresponding to the of models.
generalized algorithm is illustrated in Figure 1. Note that the other Coverage criteria: Note that defining path coverage criteria for
variables, not present in the constraint, can take any value from its any black-box model is not straightforward. It is possible to define
possible domain.

2.3 Local Explainability


Local Interpretable Model-agnostic Explanations (LIME) [15] con-
sists of explanation techniques that explain any prediction of any
classifier or regressor in an interpretable and faithful manner. It is
able to do that by approximating it as an interpretable model locally
around the prediction. It generates explanation in the form of inter-
pretable models, such as linear models, decision trees, or falling rule
lists, which can be easily comprehended by the user with visual or Figure 1: Generation of path constraints

627
ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha

Algorithm 2: Individual Discrimination Checker coverage defined in [19]), and decision paths in a decision tree
1 function generate_test_cases(seed , model , clust er _count , l imit , Rank1, classifier.
Rank 2, Rank 3): We define the coverage criteria as follows. Given a classification
2 count = 0;
3 priorityQ q = seed_test_input(seed , clust er _count , l imit , Rank 1) model M and a set of test cases T , we define the coverage of T as
4 while count < limit && !q.isEmpty() do the number of decision regions of M executed by T .
5 t = q.dequeue()
6 found = check_for_error_condition(t )
In this paper we use a decision tree classifier to approximate the
7 Path p = get_path(model , t ) behavior of the model M. We generate highly accurate decision
8 if found then tree model to approximate the decision regions of M.
9 // local search
10 foreach predicate c in order from top to bottom of p do The aim of our test case generation technique is to maximize both
11 path_constraint = p.constraint the path coverage and the individual discrimination detection. In
12 If c is of protected attribute then continue
13 path_constraint.remove(c) path_constraint.add(not(c)) practice, there is always a limit to the automatic test case generation
14 if !visited_path.contains(path_constraint) then process within which maximizing both the objectives needs to be
15 visited_path.add(path_constraint);
16 input= solve(path_constraint,t)
done. In our case, we consider two such possible limits: 1) the
17 r = average_confidence(path_constraints) number of generated test cases, and 2) the time taken to generate.
18 q.enqueue(input,Rank2-r) In the subsequent subsections, we present our algorithm that
19 end
20 end intends to maximize path coverage and effectively detects discrimi-
21 end nation, and the way to combine them both.
22 // global search
23 prefix_pred = true
24 foreach predicate c in order from top to bottom of p do 3.2 Maximizing Path Coverage
25 If c is a protected attribute then continue
26 If c .conf idence < T 1 then break
Path coverage maximization is carried out by leveraging the capa-
27 path_constraint = prefix_pred ∧ not (c) bilities of symbolic execution algorithm which is catered towards
28 if !visited_path.contains(path_constraint) then the systematic exploration of different execution paths. In this sub-
29 visited_path.add(path_constraint);
30 input= solve(path_constraint) section, we describe the various functions that are kept undefined
31 r = average_confidence(path_constraints) in Algorithm 1. The final algorithm is presented in Algorithm 2.
32 q.enqueue(input,Rank3+r)
33 end
Maximizing the path coverage is done in the global search module
34 prefix_pred = prefix_pred ∧ c as mentioned in our final algorithm presented in Algorithm 2.
35 end
36 count++ Path Creation. Let us start with the get_path method. As men-
37 end
38 return
tioned in Section 1, our key idea is to use the set of input, output as
39 function get_path(model , input ): generated from a local explainer - LIME. Our algorithm learns the
40 Set < In,Out > inout = LIME_localexpl(model,input) decision tree using the input and output instead of the linear re-
41 return genDecisionTree(inout);
42 function seed_test_input(dat a , clust er _count , l imit , Rank 1): gression classifier as in the original implementation of LIME (refer
43 i=0 Line 39 in Algorithm 2).
44 priorityQ q = empty
45 clusters=KMeans(data, cluster_count)
With above definition of get_path method, the algorithm works
46 while i < max(size(clusters)) do in the same way as the symbolic execution algorithm, which can
47 if q.size() == limit then
48 break
negate the conditions present in the decision tree path to generate
49 end new set of constraints that can then be solved using a constraints
50 foreach cluster ∈ clusters do solver to generate further new inputs that can traverse other paths.
51 if i ≥ size(cluster) then
52 continue There are three major challenges associated with the straight-
53 end forward application of the idea of symbolic execution and local
54 row = cluster.get(i)
55 q.enqueue(row, Rank1) model approximating a path. The first two arise due to the inherent
56 end approximation present in the local model, while symbolic execution
57 i++
58 end
is the reason for the third one.
59 return (q) • Approximation: The decision tree path approximates the ac-
60 function check_for_error_condition(t ):
61 class = M(t) tual execution path in terms of interpretable features. Be-
62 foreach ⟨val 0, . . . , val n ⟩ |∀Ai ∈ P, val i ∈ Dom(Ai ) do cause of such an approximation, duplicates of the actual
63 // try combination of protected attribute values
64 tnew=Replace value of Ai in t with val i
program path can be generated.
65 class1 = M(tnew) • Confidence: Decision tree path has a confidence score as-
66 if class1 != class then sociated with all its comprising predicates (which was not
67 return Individual Discrimination Found
68 end the case for a program path). Therefore, the challenge lies
69 end in devising a way to use this confidence score for better
70 return Individual Discrimination Not Found
exploration of the paths.
• Symbolic execution in program testing suffers from path ex-
plosion problem [8], especially in the depth-first-search way.
path for different types of models based on their operational char- It can keep on exploring paths in the depth of the program
acteristics. For instance, it is possible to define a path in a neural tree, without exploring paths in other parts of the program.
network based on the activation of neurons (similar to the branch Researchers have explored various techniques to address

628
Black Box Fairness Testing of Machine Learning Models ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia

this problem - applying demand driven or directed technique Checking Individual Discrimination: To start with, let us consider
which generates test cases towards a particular location in the case of checking individual discrimination which is carried
the program, and compositional techniques which try to out in the method check_for_error_condition, as presented in
analyze various functional modules separately before com- Algorithm 2. The algorithm performs the check as per the definition
bining them to generate longer paths in the whole program. of individual discrimination. A test case is said to be individually
All these techniques exploit the structure of the program discriminatory if keeping the values of its set of non-protected
under test. attributes same, but changing the values of its protected attributes
set by trying every possible combination, yields different class
Addressing Path Explosion. To resolve the path explosion prob- labels.
lem, such that the path generation does not concentrate only on
a few localized portion of the entire space, we exploit the distri- Local Search: The symbolic execution as discussed in earlier sec-
bution present in the data (training or testing), if available. Each tions, tries to find test inputs to maximize the path coverage. We
data instance can be a good starting point of the symbolic search. call such a symbolic search strategy the global search. Some of the
Therefore, we add the data as the seed test data, as shown in method test inputs generated through seed data or symbolic execution will
at Line 42 given in Algorithm 2. be discriminatory in nature. To increase the likelihood of discrimi-
However, the order of data instances become important when natory test cases, we exploit the fact that we can execute test cases
a search limit is imposed due to which execution of all the data and check whether they are discriminatory and then, depending
instances is not possible. Therefore, to increase the diversity in on that, generate more test cases.
search, we cluster the data and process seed inputs from each cluster Once a discriminatory test case, say t is found, we try to generate
in a round-robin fashion. further more test inputs which may lead to individual discrimina-
tion. The key idea is to negate the non-protected attribute constraints
Addressing Local Model Approximation. We use an off-the-shelf of t’s decision tree to generate more test inputs. By toggling one
local explainer (LIME) to fetch the interpretable local model. The constraint related to non-protected attribute and generating an
perturbation used in such an explainer is outside the scope of the input solving the resultant constraint, the algorithm tries to explore
present algorithm, therefore, it is not possible to reduce the error the neighborhood of discriminatory path p. This form of symbolic
caused by the local approximation. Note that, the approximation execution is what we call as local search as it tends to search the
can lead to production of test inputs which do not uncover any new locality of discriminatory test cases.
path in the model. However, by introducing the data instances in The reason why this works is due to the inherent adversarial
the seed, our algorithm addresses the above issue. Our algorithm robustness property of a machine learning model which demon-
further ensures that the seed inputs get more priority than the strates that a small perturbation of an input can result in changing
inputs generated by the constraint solver during symbolic execution. the classifier decision [20].
This ensures high degree of varied path coverage.
Sticky Solutions. The aim of the local and the global search is to
Ranking based on Confidence. Automated test case generation traverse as many paths as possible. The local search concentrates
procedure typically runs within a limit. It is therefore important on exploring the paths in the vicinity of the discriminatory paths i.e.
to generate non-redundant test inputs and cover as much as path the paths that are generated from discriminatory inputs. Therefore,
covered by the generated test inputs within the limit. For the test we only get one solution for the constraint. However, to cater to
cases generated by the constraint solver, we use a ranking scheme the possible approximation caused by the local linear model, we
based on the confidence of predicates in the decision tree to select use the constraint solver’s solution that is close to the solution of
which test input to execute next. The confidence of a path is deter- the previous constraint (related to discriminatory input). We call
mined by taking an average of the confidence of all its comprising such a solution as a sticky solution. Because of the stickiness, if we
predicates (Line 31, Algorithm 2). Therefore, when the algorithm negate one predicate, then for the remaining predicate, it tends
selects a predicate c for negating, it considers the rank as the aver- to take the same values as in its previous solution. In Section 4,
age confidence of the predicates in the prefix of the path leading to we demonstrate that sticky solution provides a huge performance
and including c. improvement over the existing works.
Confidence Threshold. The lesser the confidence of a predicate in
The algorithm for test case generation to detect individual dis-
the decision tree path, the lesser is the chance of generating varied
crimination is presented in method generate_test_cases given
inputs. Therefore, to remove unnecessary generation of inputs
at Line 1 Algorithm 2. Lines 10-21 describe the local search whereas
through symbolic evaluation, our algorithm employs a threshold on
Lines 23-34 describe the global search. There are two major differ-
the confidence of the predicate for selection (Line 26, Algorithm 2).
ences between the implementations of the local and global search.
The value of the threshold is obtained through experimentation.
The first difference is that no threshold based constraint selection is
done in local search. In contrast, during global search, only the high
3.3 Maximizing Effectiveness of confidence constraints are chosen for toggling. The reason behind
Discrimination Detection this strategy is to increase the chances of diverse coverage of paths.
In this subsection, we discuss a few more changes that we have em- The second difference is that, in global search, no constraint exists
ployed on the generic algorithm in order to maximize the detection for suffix of the path, whereas in local search, all constraints, except
of individual discrimination. the selected low confidence one to toggle, remain as it is.

629
ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha

Ordering Local and Global Search. In the consolidated Algorithm 2, • Comparison with the existing works - The intention is to gauge
three reference ranks, namely Rank1, Rank2 and Rank3 are pre- how well does our algorithm perform as compared to the ex-
sented, one each for seed input, local search, and global search isting works in the field of test case generation to find individ-
respectively. These ranks are set in such a way that the highest ual discrimination in models. We compare our approach with
priority is given to local search, followed by the seed input, which two existing systems: (a) THEMIS [7], which checks the indi-
is further followed by the global search, based on their ability to vidual discrimination by random test case generation, and (b)
uncover discrimination causing inputs (see Lines 3, 18, 32 of Algo- AEQUITAS [23], which performs both random global search and
rithm 2). perturbation-based local search. We have used success score (i.e.
In the next section, we experimentally show the effectiveness of Succ/Gen) as an appropriate metric for evaluation, where Succ
various optimizations described in this section along with compar- denotes the subset of the generated test cases i.e. Gen which
isons with the existing works. results in individual discrimination. Please note that the same
metric as ours has also been used in other related works such
4 EXPERIMENTAL EVALUATION as THEMIS and AEQUITAS. Furthermore, precision and recall
4.1 Setup as evaluation metrics are more appropriate when we have fixed
gold standard data. A detailed theoretic comparison with both
4.1.1 Benchmark Characteristics. We have conducted our exper- the approaches is presented in the forthcoming Section 5. In this
iments on eight open-source fairness benchmarks from various paper, we refer our algorithm as SG which stands for Symbolic
sources as listed in Table 1. Generation, in all our subsequent discussions.
4.1.2 Configurations. Our code is written in Python and executed • Effect of algorithmic features - The motive is to find out how well
in Python 2.7.12. All the experiments are performed in a machine does each algorithmic feature, such as symbolic execution in
running Ubuntu 16.04, having 16GB RAM, 2.4Ghz CPU running local and global search, presence of data, etc., contribute towards
Intel Core i5. We have used LIME [15] for local explainability. We finding individual discrimination.
have used K-means with cluster size = 4 to cluster the input seed
data. Since our use case requires generation of more test cases in
lesser time, K-means being one of the simplest and fastest clustering
algorithms, proves to be a reasonable choice. The fact that the data
sets used to run our experiments have either two or four true class
labels, drives the logical assumption to set the cluster count as 4.
This was further validated using a scatter plot, shown in Figure 2,
which clearly depicts four different clusters in the seed data.
For each benchmark, we have generated a model using Logistic
Regression with its default configuration as provided in scikit-learn.
The model is configured similar to the one used in THEMIS [7]
in order to ensure fair comparison. The models are highly precise
(accuracy > 95%) to avoid overfitting. The threshold T 1, proposed Figure 2: Scatter plot for Clustered Seed Data and Discrimi-
earlier in Algorithm 2, is set to T 1 = 0.3. The justification for natory Inputs (Black) for German (age) data
choosing this value of threshold has been experimentally validated
later in this section.
4.3 Comparison with Related Work
4.2 Experiment Goals 4.3.1 Comparison with THEMIS. We have fetched THEMIS code
By conducting our set of experiments, we broadly try to achieve from their GitHub repository.2 On carefully analyzing their code,
two goals, as mentioned below. we figured out an unintended behavior in the open-sourced code.
THEMIS actually generates duplicate test cases and their reported
Table 1: Benchmark Characteristics experimental statistics also contains these duplicates. This is one
of the problems posed by random test case generation as it can
Benchmark Size Features Count Domain Size
produce duplicate test cases. We have changed THEMIS’s code to
German Credit Data 1 1000 21 6.32 × 1017 remove duplicates for our experimental evaluations.
Adult census income 2 32561 12 1.95 × 1016 Table 2 depicts the results of the comparison with THEMIS. It can
Bank marketing 3 45211 16 3.30 × 1024
US Executions 4 1437 10 1.40 × 108 be inferred from the results that our algorithm performs better than
Fraud Detection 5 1100 9 4.01 × 109 THEMIS, except in the case of one benchmark i.e. Fraud Detection.
Raw Car Rentals 6 486 7 1.15 × 103
credit data 1 (modified) 600 20 1.74 × 1015
THEMIS has an average success score (#Succ/#Gen) of 6.4% but, for
census data 2 (modified) 15360 13 1.74 × 1018 our algorithm it is 34.8%. It is evident that across 12 benchmarks,
our algorithm generates 6 times more successful (i.e. the ones
1 https://archive.ics.uci.edu/ml/datasets/adult that resulted in discrimination) test cases than THEMIS. It is further
2 https://archive.ics.uci.edu/ml/datasets/bank+marketing
3 https://data.world/markmarkoh/executions-since-1977 to be noted that the maximum percentage of discriminatory test
4 https://www.kaggle.com/c/frauddetection/data
5 https://www.yelp.com/biz/raw-car-rentals-sacramento 2 https://github.com/LASER-UMASS/Themis

630
Black Box Fairness Testing of Machine Learning Models ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia

Table 2: Comparison of SG with THEMIS Table 4: Time limit wise comparison between THEMIS and
SG for German Credit Dataset
Bench. Prot. Attr. THEMIS SG
#Gen #Succ #Gen #Succ
Duration THEMIS SG
German Credit gender 999 166 1000 640
(secs) Gen Succ Gen Succ
German Credit age 999 90 1000 485
10 3 0 12 0
Adult income race 999 70 1000 295
20 4 0 22 0
Adult income sex 990 1 1000 858 50 8 0 51 2
Fraud Detection age 999 3 1000 0 100 12 0 89 14
Car Rentals Gender 680 198 1000 783 200 33 2 58 41
credit i/gender 598 44 1000 267 500 114 5 341 113
census h/race 999 57 1000 970 1000 282 21 649 232
census i/sex 999 7 1000 282 2000 569 40 1000 359
Bank Marketing age 999 0 1000 1
US Executions Race 999 2 1000 6
US Executions Sex 999 8 1000 31 Table 5: Number of Discriminatory Test Cases generated
through Global Search for Adult Dataset with limit 1000
Table 3: Contribution of different features
Model Type AEQUITAS SG
Decision Tree 6 56
Bench. Data Local Symb. Global symbolic Random Forest 73 93
Gen Succ Gen Succ Gen Succ Multi Layer Perceptron 41 48
German Credit(gender) 25 2 975 638 0 0
German Credit(age) 25 2 975 483 0 0
Adult Income(race) 37 4 963 291 0 0
Adult Income(sex) 35 6 965 852 0 0 than that of THEMIS. The reason is that random sample generation
Fraud Detection 1000 0 0 0 0 0
Car Rentals 2 2 998 781 0 0 takes a lot of time as it still conforms to the constraints of the
credit 22 2 978 265 0 0 domain, whereas, SG takes very little time to generate samples
census(race) 1 1 999 969 0 0
census(sex) 6 1 994 281 0 0 from seed and LIME/constraint solver-based local and global test
Bank Marketing 983 1 17 0 0 0 case generation is also quite fast. Most importantly, within the same
US Executions(Race) 877 0 22 4 101 2
US Executions(Sex) 877 0 74 29 49 2
limit, the number of discriminatory test cases for THEMIS is much
lower as compared to SG, which is well evident by the results in
Table 4.
cases as generated by SG is 97%, which is also very high as compared
4.3.2 Comparison with AEQUITAS. The AEQUITAS [23] algorithm
to THEMIS’s 29%.
operates in two search phases - global and local. The global phase
In Table 3, we report the contribution of different test case gen-
considers a limit on the count of test cases and generates them by
eration features, such as seed data, global symbolic search and local
random sampling of the input space. Out of all these generated test
symbolic search, towards the aforementioned success. The data
cases, a few of them are discriminatory in nature. The local phase
refers to the case where test instance itself is the data point avail-
then starts by taking each of the discriminatory inputs identified
able in the input seed data. Global and Local symbolic search refer
in the global search phase as an input and perturbs it to generate
to the cases where test cases are generated by employing symbolic
further more test cases. This phase just like the previous global
execution in the global search and local search, respectively. Please
search considers a limit on the number of test cases to be generated.
note that the contribution of each of these searches is dependent
They have applied three different types of perturbation resulting in
on the previously executed search. Table 3 does not aim to compare
three different variations of the algorithm.
the effectiveness of these individual searches, but records whether
As we have similar phases in our algorithm too, therefore, we
a test case is generated in the data/local/global symbolic search
perform phase-by-phase comparison with AEQUITAS by consider-
phase, thus drilling down on results of Table 2.
ing the same limit. We have fetched the code for AEQUITAS from
The results show the effect of our relative ranking strategy which
their Github3 repository and run it for comparison.
specifies the order of preferences for local symbolic, seed data,
and global symbolic in the decreasing order. Please note that on Global Search Comparison. Table 5 presents a comparison of SG
an average, the success percentage for data and local symbolic with AEQUITAS in context of their global search strategy. Our
execution are 22.4% and 47.4%, respectively. global search method uses clustered seed data and symbolic exe-
We have performed an another experiment where we try to make cution, whereas their strategy uses random sampling of the input
the same comparison with THEMIS, but by varying the time limit space. It is evident from the statistics that in general, our algorithm
instead of considering a fixed limit of test cases to be generated. generates more discriminatory inputs othan AEQUITAS.
Essentially, we conducted this test to check that within a given time
Local Search Comparison. AEQUITAS perturbs the available dis-
limit, whether SG performs better than THEMIS or not. Considering
criminatory test input to find even more discriminatory inputs. In
that our algorithm requires local model generation and running
contrast, SG’s local search is still symbolic which helps to discover
constraint solver, this experiment also takes into account the time
more execution paths near the discriminatory paths. To accurately
taken by such components in our algorithm. This result also depicts
compare both the local searches, we have run them considering
the time required for test case generation by our algorithm. We
the same set of discriminatory inputs and the same models for the
experimented with German Credit dataset with ’age’ as its protected
Adult dataset. Please note that AEQUITAS open-source code has
attribute and run both THEMIS and SG for 2, 000 secs. Interestingly,
at any given instant, the number of test case generated by SG is more 3 https://github.com/sakshiudeshi/Aequitas

631
ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha

Table 6: Number of Discriminatory Test Cases generated random data. We see that just by applying training data (without
through Local Search for Adult Dataset with limit 1000 any symbolic search) we get an average improvement of 108% (25%
to 12%) in comparison with random data. This implies that in global
Model Type AEQUITAS SG
Decision Tree 318 694
search, more discriminatory examples can be obtained with training
Random Forest 541 726 data. This result also demonstrates why our global search method is
Multi Layer Perceptron 403 42
superior to THEMIS and AEQUITAS to find discriminatory inputs.
The second experiment in Table 9 shows the effectiveness of
hard-coding for the Adult dataset and therefore, we could only symbolic evaluation even if we start with the random input. For
perform the comparison experiments on that dataset but for all the example, in the credit data, we see that random data got 310
three different model types. Table 6 presents the comparison of the successful test cases and is less effective than training data which
number of discriminatory inputs generated by both the approaches has got 421 successful test cases.
within the limit of 1000 test cases. In two out of three cases, SG’s Importance of Clustering of Seed Data. Figure 2 shows the distri-
local search performs better than that of AEQUITAS’s. In case of bution of the seed data and the discriminatory inputs found. It also
Multi Layer Perceptron, SG’s poor performance is attributed to the demonstrates that the discriminatory inputs are scattered across
error in local model approximation by LIME. the input space, a proof why random exploration of the input space
is not effective in finding discriminatory inputs.
4.4 Path Coverage Figure 3 shows the use of diverse seed data ordering got getting
We perform an experiment to compare the path coverage of our individual discrimination. When the number of test cases is limited
global search and random data based search. The random data based then we expect that diverse ordering (round-robin) will fetch more
search has been applied for both THEMIS and AEQUITAS. There- discrimination than iterative ordering. It is evident from the Figure 3
fore, this experiment presents the comparison with the existing that, for most number of test cases, the number of discrimination
related works. found is higher for round-robin selection from clusters is more
To perform the path coverage, we learn a decision tree model effective than iterative selection. For example, in the execution of
with a precision of 85%-95%, measured using 5-fold cross-validation 600 test cases, round-robin selection found 45 discriminatory inputs,
for each benchmark and map each generated test input to a path compared to 35 found by iterative selection.
of the decision tree model. The results in Table 7 show that on Note that, even on using random data as seed, our global sym-
average over all the benchmarks, SG has 2.66 times more path cov- bolic search performs well, as discussed in the next subsection.
erage than random data. This result demonstrates that on the path
coverage metric, we are superior to other algorithms. Therefore,
our algorithm will be able to find discriminatory inputs at various Table 8: Training vs Random seed data (no symbolic execu-
different places in the model. This is important as at one shot we tion)
can de-bias multiple parts of the model if we use the test cases for
Bench. Training Data Random
retraining. Gen Succ Gen Succ
credit 500 56 500 25
German (Age) 1000 70 1000 46
4.5 Importance of Seed Data Census (Sex) 500 20 500 5
Car 500 394 500 190
We conducted two experiments for determining the importance of
seed data and related features. In both the experiments we compare
the training data with random data as seed data. Table 9: Random seed data (with symbolic) (#Succ/#Gen)
In the first experiment (shown in Table 8), we switch off the
Bench. Random
global and local symbolic execution so that all the test cases are Total Seed Local Symbolic Global Symbolic
generated from the seed data i.e. the training data. In the second credit 310/1000 1/21 309/979 0/0
experiment (Table 9), we keep both the symbolic searches. Both German (Age) 365/1000 4/49 361/951 0/0
Census (Sex) 195/1000 3/87 192/913 0/0
the experiments are performed with a limit on the number of test Car 803/1000 14/21 789/979 0/0
cases generated. The third experiment compares the path coverage
of training data vs. the random data.
Limited Number of Test Cases: The results from the first experi-
ments in Table 8 demonstrates the importance of the seed data. The
results show that for all benchmarks, the number of discriminatory
test cases generated from seed training data is more than that of

Table 7: Path coverage of SG-global search vs Random

Benchmark Limit Random SG-Global


adult 1000 284 610
census 1000 47 413
credit 600 61 116
German (age) 1000 102 187 Figure 3: Importance of clustering (German-age)

632
Black Box Fairness Testing of Machine Learning Models ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia

Table 10: Global Symbolic Search (w/o local) with Random Table 11: Without local symbolic search (#Succ/#Gen)
Seed Data (#Succ/#Gen)
Bench. Total Seed Local Global
credit 66/603 66/600 0/0 0/3
Bench. Total Seed Local Global Total Random
German (Age) 70/1000 70/1000 0/0 0/0
credit 14/1000 14/773 0/0 0/227 18/1000
Census (Sex) 45/994 45/992 0/0 0/2
German (Age) 446/1000 0/1 0/0 446/999 380/1000
Car 231/295 91/114 0/0 140/181
Census (Sex) 16/1000 1/883 0/0 15/117 1/1000
Car 655/1000 1/1 0/0 654/999 626/1000

For initial fault finding, seed data works better than the global
4.6 Importance of Global Symbolic Search symbolic search.
To determine the effectiveness of global symbolic search we per-
4.8 Threats to Validity
formed an experiment where we removed the local symbolic search
to emphasis on the global search part and we assigned the higher Nature of data inputs. Our approach works well on the data sets
priority to the global symbolic search than the seed data. Since the containing both numeric and categorical attributes, or a mix of them.
seed data selection may also affects the symbolic search, we used However, it doesn’t handle very high dimensional data, such as
random data as seed data. The result of the experiment is shown in image, sound or video, in its current state due to a limitation posed
Table 10. We see that in case of credit, effectiveness is little lower by our local explainer, LIME. Note that, our algorithm is generic to
(14 vs 18) as symbolic search could not find any discriminatory any local explainer. So, LIME can be replaced by any sophisticated
cases generated in the generated 227 test cases. In all the other explainer, such as Grad-CAM [16] in the image domain, to make it
three cases, global symbolic search alone is able to generate 22% work for the high dimensional data.
more discriminatory test cases than random search. Relevance of benchmarks and models used. The benchmarks used
Importance of Threshold in the Global Symbolic Search: We ex- to evaluate this work are well-known in the field of fairness and are
perimented with the global search threshold for 4 benchmarks and used in the related works, such as THEMIS [7] and AEQUITAS [23]
the result is shown in Figure 4. The result shows that in most cases as well. Further, we have picked the same models to conduct our
the lower threshold causes decrease in the ratio of discriminatory experiments as the ones used in these works.
inputs. Therefore, we have chosen the threshold to be 0.3, which Protected attributes. We consider only one protected attribute at
balances the number of test case generated and the effectiveness of a time per benchmark for our experiments. However, considering
generation. multiple protected attributes will not hamper the coverage or effec-
tiveness offered by our novel search technique, but will certainly
4.7 Importance of Local Symbolic Search lead to an increase in execution time. This increase is attributed
Based on the previous experiments we notice that local symbolic towards the fact that the algorithm in such a case, needs to consider
search has a high percentage (47.4%) of effectiveness. We conduct all the possible combinations of their unique values.
another experiment to see how we local symbolic search affects the
overall execution of the algorithm given a limit on the number of 5 RELATED WORK
test cases (1,000). The results by switching off the local symbolic This section discusses the existing works spread across two related
search feature is presented in Table 11. We should compare this spheres - model testing and individual discrimination detection.
result with the results in Table 3 for the 4 benchmarks. We see AI Model Testing. First, we present the works which are capa-
that the average effectiveness drops from 45.4% to 25.2% for these 4 ble to perform symbolic or concolic-based test case generation
benchmarks by removing the local symbolic search feature. This for AI models. DeepCheck [10] uses a white box technique which
shows the importance of the local symbolic search technique. performs symbolic execution on deep neural networks with the
Overall, our experiments demonstrate that local symbolic search target of generating adversarial images. Another related work [19]
uncovers many bias instances after finding the discriminatory input. performs concolic execution [9] (Concrete and Symbolic) on deep
neural networks. Their technique is also white-box with the goal of
ensuring coverage of deep neural network by systematic test case
generation. They model the network using linear constraints and
use a specialized solver to generate test cases. Wicker et al. [24] aim
to cover the input space by exhaustive mutation testing that has
theoretical guarantees, while in [13, 14, 21] gradient-based search
algorithms are applied to solve optimization problems. In addition,
Sun et al. [18] leverage the capabilities of linear programming.
All above techniques are white-box as compared to our black-
box solution approach. We use an off-the-shelf solver to generate
test cases. The aforementioned approaches try to consider test gen-
eration to create adversarial input in the image space, while our
technique addresses a new problem domain - trust and ethics.
Figure 4: Importance of Threshold

633
ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha

Individual Discrimination Detection. THEMIS [7] uses the notion approach combines the notion of symbolic evaluation which sys-
of causality to define individual discrimination. Even though they tematically generates test inputs for any program, with local expla-
use a black-box technique, their technique employs random test nation that approximates the execution path in the model using a
case generation approach instead of a systematic one. However, linear and interpretable model. Our technique offers an additional
they envision the use of systematic test case generation techniques advantage by being black-box in nature. Our search strategy ma-
in their paper as future work. jorly spans across two approaches, namely global and local search.
AEQUITAS [23] randomly samples the input space to discover Global search caters for path coverage and helps to discover an
the presence of discriminatory inputs. Then, it searches the neigh- initial set of discriminatory inputs. In order to achieve that, we
borhood of these inputs to generate further more inputs. Therefore, use seed data along with symbolic execution while considering
their global search technique is based on random search, while approximation existing in the local model and intelligently using
the local search is performed by perturbing discriminatory inputs. the confidence associated with the path constraints fetched from
However, their approach still differs from ours in the context of the local model. Further, local search aims at finding more and
global search and local search techniques. For global search, they more discriminatory inputs. It starts with the initial set of avail-
perform random sampling on the input space, while we use clus- able discriminatory paths and generates other inputs belonging
tered seed data and symbolic execution to systematically uncover to the nearby execution paths, thereby systematically performing
the first set of discriminatory inputs. The effectiveness of their tech- local explanation while banking on the adversarial robustness prop-
nique completely depends on where the discriminatory inputs are erty. Our experimental evaluations clearly show that our approach
placed. As we have already demonstrated in Section 4, our global performs better than all the existing tools.
search performs better than random sampling because of the com- Symbolic testing of machine learning models is bound to pave
bination of clustered seed data and symbolic execution used in our way for a number of efforts in future, some of them are listed below.
technique. In the context of local search, the two techniques differ A generic framework for testing. Our global search method is a
in the way that they perturb. AEQUITAS perturbs the input by generic way of testing any black-box ML model and not just a
a factor of δ ∈ {−1, +1} (0.5 to start with) and selects attributes method towards solving individual discrimination. For a different
with a probability p (which is set as uniform across non-protected problem, there may be a need to change the implementation of only
attribute at the start). The probability of selection and δ is further check_for_error_condition function, especially if the modality
updated depending on their three different strategies, namely ran- remains the same.
dom, semi-directed, and fully-directed. In contrast, our selection of Local Model Generation. In this paper, we are very much dependent
attributes is based on the confidence of the predicates associated on LIME which is a local model generator. The approximation
with the attributes and the change (or perturbation) is performed caused by the local model does have a role to play as far as the
based on the constraints found in the approximated decision tree effectiveness of systematic exploration is concerned. In future, we
path. AEQUITAS tends to generate test data points in the vicinity want to investigate how we can generate a more effective and
of the discriminatory test cases which may or may not explore efficient local model for better exploration.
new paths, but our algorithm makes use of symbolic execution Model Path Definition. For testing neural networks, we can use
which results in exploration of new paths near the discriminatory neuron activation to precisely define path.
ones. Furthermore, our perturbation scheme does not consider any Global Approximation. We have used a local approximation as a
pre-determined offset like the one used in AEQUITAS. way to get an interpretable portion of the model. Intuitively, that
FairTest [22] uses manually written tests to measure four types of decision goes with the ’on-demand exploration’ strategy which
discrimination scores. Their idea is to leverage indirect co-relation does not require re-engineering of all the parts of a model. Also,
behavior existing between attributes (e.g., salary is related to age) to local model generation does not require any training data. In future,
generate test cases. FairML [1] uses an iterative procedure based on we also wish to investigate the global approximation algorithm like
an orthogonal projection of input attributes, to enable interpretabil- TREPAN [3] in order to create decision tree models and perform
ity of black-box predictive models. Through such an iterative tech- symbolic execution on such a decision tree in case of availability of
nique, one can quantify the relative dependence of a black-box training data.
model on its input attributes. The relative significance of the inputs Adversarial Robustness. There exists many white-box approaches
with respect to a predictive model can then be used to assess the addressing the problem of finding adversaries. In future, we may
fairness (or discriminatory extent) of such a model. investigate if such techniques can be applied in case of black-box
Despite being black box techniques, none of these existing indi- testing to detect individual discrimination.
vidual discrimination techniques uses systematic test case genera- Hybrid Search Strategy. Our algorithm gives preference to local
tion. To the best of our knowledge, we are the first ones to present search over the global one, and seed data over global search. Finding
a black-box solution approach capable of generating test cases sys- an interleaved strategy can be another direction to pursue in future.
tematically with the intention of detecting individual discrimination Explaining Source of Individual Discrimination and De-Biasing. Once
in AI models. an individual discrimination is detected in a model, we would like
to uncover its root cause, especially by mapping it to the training
data. We envision the use of influence function [11] to achieve this.
6 CONCLUSION AND FUTURE WORK Once we figure out the training data instance and their attributes
In this paper, we present a test case generation algorithm to iden- responsible for bias, it is then possible to de-bias the model by either
tify individual discrimination in machine learning models. Our removing or appropriately perturbing the training data instances.

634
Black Box Fairness Testing of Machine Learning Models ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia

REFERENCES multi-granularity testing criteria for deep learning systems. In Proceedings of the
[1] Julius Adebayo and Lalana Kagal. 2016. Iterative Orthogonal Feature Projection 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE
for Diagnosing Bias in Black-Box Models. arXiv:arXiv:1611.04967 2018, Montpellier, France, September 3-7, 2018. 120–131. https://doi.org/10.1145/
[2] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, David L. Dill, and Dawson R. 3238147.3238202
Engler. 2006. EXE: Automatically Generating Inputs of Death. In Proceedings of [14] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Auto-
the 13th ACM Conference on Computer and Communications Security (CCS ’06). mated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th
ACM, New York, NY, USA, 322–335. https://doi.org/10.1145/1180405.1180445 Symposium on Operating Systems Principles (SOSP ’17). ACM, New York, NY, USA,
[3] Mark William Craven. 1996. Extracting Comprehensible Models from Trained 1–18. https://doi.org/10.1145/3132747.3132785
Neural Networks. Ph.D. Dissertation. AAI9700774. [15] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i
[4] Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd
Tools and Algorithms for the Construction and Analysis of Systems, C. R. Ramakr- ACM SIGKDD international conference on knowledge discovery and data mining.
ishnan and Jakob Rehof (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, ACM, 1135–1144.
337–340. [16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra.
[5] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based
Zemel. 2012. Fairness Through Awareness. In Proceedings of the 3rd Innovations Localization. In 2017 IEEE International Conference on Computer Vision (ICCV).
in Theoretical Computer Science Conference (ITCS ’12). ACM, New York, NY, USA, 618–626. https://doi.org/10.1109/ICCV.2017.74
214–226. https://doi.org/10.1145/2090236.2090255 [17] Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A Concolic Unit
[6] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Testing Engine for C. In Proceedings of the 10th European Software Engineering
Venkatasubramanian. 2015. Certifying and removing disparate impact. In Proceed- Conference Held Jointly with 13th ACM SIGSOFT International Symposium on
ings of the 21th ACM SIGKDD International Conference on Knowledge Discovery Foundations of Software Engineering (ESEC/FSE-13). ACM, New York, NY, USA,
and Data Mining. ACM, 259–268. 263–272. https://doi.org/10.1145/1081706.1081750
[7] Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness Testing: [18] Youcheng Sun, Xiaowei Huang, and Daniel Kroening. 2018. Testing Deep Neural
Testing Software for Discrimination. In Proceedings of the 2017 11th Joint Meeting Networks. arXiv:arXiv:1803.04792
on Foundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, [19] Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska,
USA, 498–510. https://doi.org/10.1145/3106237.3106277 and Daniel Kroening. 2018. Concolic Testing for Deep Neural Networks.
[8] Patrice Godefroid. 2007. Compositional Dynamic Test Generation. In Proceed- arXiv:arXiv:1805.00089
ings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Pro- [20] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
gramming Languages (POPL ’07). ACM, New York, NY, USA, 47–54. https: Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks.
//doi.org/10.1145/1190216.1190226 CoRR abs/1312.6199 (2013). arXiv:1312.6199 http://arxiv.org/abs/1312.6199
[21] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated
[9] Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Auto-
Testing of Deep-neural-network-driven Autonomous Cars. In Proceedings of the
mated Random Testing. In Proceedings of the 2005 ACM SIGPLAN Conference on
40th International Conference on Software Engineering (ICSE ’18). ACM, New York,
Programming Language Design and Implementation (PLDI ’05). ACM, New York,
NY, USA, 303–314. https://doi.org/10.1145/3180155.3180220
NY, USA, 213–223. https://doi.org/10.1145/1065010.1065036
[22] F. TramÃĺr, V. Atlidakis, R. Geambasu, D. Hsu, J. Hubaux, M. Humbert, A. Juels,
[10] Divya Gopinath, Kaiyuan Wang, Mengshi Zhang, Corina S. Pasareanu, and
and H. Lin. 2017. FairTest: Discovering Unwarranted Associations in Data-Driven
Sarfraz Khurshid. 2018. Symbolic Execution for Deep Neural Networks.
Applications. In 2017 IEEE European Symposium on Security and Privacy (EuroS
arXiv:arXiv:1807.10439
P). 401–416. https://doi.org/10.1109/EuroSP.2017.29
[11] Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions
[23] Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay. 2018. Automated
via Influence Functions. In Proceedings of the 34th International Conference on
Directed Fairness Testing. Automated Directed Fairness Testing. In Proceedings
Machine Learning (Proceedings of Machine Learning Research), Doina Precup and
of the 2018 33rd ACM/IEEE International Conference on Automated Software
Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney,
Engineering (ASE 18), September 3-7, 2018, Montpellier, France. (2018). https:
Australia, 1885–1894. http://proceedings.mlr.press/v70/koh17a.html
//doi.org/10.1145/3238147.3238165 arXiv:arXiv:1807.00468
[12] Matt Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual
[24] Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. 2018. Feature-Guided
Fairness. In Proceedings of the 31st International Conference on Neural Information
Black-Box Safety Testing of Deep Neural Networks. In Tools and Algorithms for
Processing Systems (NIPS’17). Curran Associates Inc., USA, 4069–4079. http:
the Construction and Analysis of Systems, Dirk Beyer and Marieke Huisman (Eds.).
//dl.acm.org/citation.cfm?id=3294996.3295162
Springer International Publishing, Cham, 408–426.
[13] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang
Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge:

635

You might also like