Outcome Explorer
Outcome Explorer
Outcome Explorer
3102051 1
Abstract—The widespread adoption of algorithmic decision-making systems has brought about the necessity to interpret the
reasoning behind these decisions. The majority of these systems are complex black box models, and auxiliary models are often used
to approximate and then explain their behavior. However, recent research suggests that such explanations are not overly accessible to
arXiv:2101.00633v5 [cs.HC] 25 Feb 2022
lay users with no specific expertise in machine learning and this can lead to an incorrect interpretation of the underlying model. In this
paper, we show that a predictive and interactive model based on causality is inherently interpretable, does not require any auxiliary
model, and allows both expert and non-expert users to understand the model comprehensively. To demonstrate our method we
developed Outcome Explorer, a causality guided interactive interface, and evaluated it by conducting think-aloud sessions with three
expert users and a user study with 18 non-expert users. All three expert users found our tool to be comprehensive in supporting their
explanation needs while the non-expert users were able to understand the inner workings of a model easily.
F
1 I NTRODUCTION
Fig. 1. Prediction in a causal model. The hammer icon represents intervention. (a) true causal model. (b) Interventions on all feature variables.
The causal links leading to node C and D are removed since the values of C and D are set externally. (c) Interventions on node A, B, and C. (d)
interventions on node A and B. In the path equations above models (b)-(d), the β are standardized regression coefficients estimated from the data.
what features they have in common. Our tool, Outcome- what-if analyses but we believe the latter is more accessible
Explorer, also falls into the category of these systems, but to a person directly affected by the modeled process, and so
focuses on causal networks and quantitative data. It specifi- is more amenable to non-expert users who do not think in
cally targets non-expert users and actively supports four of terms of distributions, uncertainties, and probabilities.
the six interface features specified in GAMUT. We chose this Finally, several visual analytics systems have been pro-
subset since they appear most suited for non-expert users. posed for causal analysis in ecological settings. Causal-
Net [43] allows users to interactively explore and verify
2.2 Interactive Causal Analysis phenotypic character networks. A user can search for a
subgraph and verify it using several statistical methods (e.g.,
Several interactive systems have been proposed for visual Pearson Correlation and SEM) and textual descriptions of
causal analysis but none have been designed for algorithmic the relations, mined from a literature database. Natsukawa
decision-making. Wang and Mueller [37], [38] proposed a et al. [44] proposed a dynamic causal network based system
causal DAG-based-modeler equipped with several statisti- to analyze evolving relationships between time dependant
cal methods. The aspect of model understanding and what- ecological variables. While these systems have been shown
if experience it provides to users is via observing how edit- to be effective in analyzing complex ecological relations,
ing the causal network’s edges affects the model’s quality they do not provide facilities for network editing and inter-
via the Bayesian Information Criterion (BIC) score. Our tool, ventions which ours does. They also do not support what-if
Outcome-Explorer, on the other hand conveys the what-if and counterfactual analyses which are instrumental for an
experience by allowing users to change the values of the XAI platform [13], [34].
network nodes (the model variables) and observe the effect In summary, the fundamental differences of ours to the
this has on the outcome and other variables. At first glance existing systems are: (1) introduction of a prediction mech-
both help in model understanding, but only the second is an anism; (2) allowing users to change variables (intervention)
experimenter’s procedure. It probes a process with different by directly interacting with the nodes in the causal DAG;
inputs and collects outcomes (predictions), using the causal (3) visualizing the interplay between variables when ap-
edges to see the relations with ease. This kind of what- plying/removing interventions; (4) supporting explanation
if analysis has appeared in numerous XAI interfaces [13], queries such as what-if analyses, neighborhood exploration,
[34], [35] and we have designed ours specifically for causal and instance comparisons; and (5) including non-expert
models. The mechanism also appeals to self-determination users in the design process, supporting their explanation
and gamification which both play a key part in education needs.
and learning [39]. One might say that achieving a lower
BIC score is also gamification, but a BIC score does not
3 BACKGROUND
emotionally connect a person to the extent that a lower
house price or a college admission does. It provides a We follow Pearl’s Structural Causal Model (SCM) [45] to
storyline, fun and realism, all elements of gamification [40]. define causal relationships between variables. According to
Yan et al. [41] proposed a method that allows users to SCM, causal relations between variables are expressed in
edit a post-hoc causal model in order to reveal issues with a DAG, also known as Path Diagram. In a path diagram,
algorithmic fairness. Their focus is primarily on advising variables are categorized as either exogenous (U ) or en-
analysts which causal relations to keep or omit to gain a dogenous (V ). Exogenous variables have no parents in a
fairer adjunct ML model. We, on the other hand, use the path diagram and are considered to be independent and
causal model itself for prediction and focus on supporting unexplained by the model. On the other hand, endogenous
the right to explanation of both expert and non-expert users. variables are fully explained by the model and presented
The work by Xie et al. [42] is closest to ours but it as the causal effects of the exogenous variables. Figure 1
uses different mechanisms for value propagation and it presents two exogenous variables (A and B ) and three
also addresses a different user audience. They look at the endogenous variables (C, D, and E ). Formally, the Causal
problem in terms of distributions which essentially is a Model is a set of triples (U, V, F ) such that
manager’s cohort perspective, while we look at specific out- • U is the set of exogenous variables, and V is the set
comes from an individual’s perspective. Both are forms of of endogenous variables.
IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS, 2021. AUTHOR’S VERSION. HTTPS://DOI.ORG/10.1109/TVCG.2021.3102051 4
A B C
Fig. 3. Model Creation Module of Outcome-Explorer. A) Control Panel. B) Causal structure obtained from the search algorithms. Users can
interactively add, remove, and direct edges in the causal structure. C) Model fit measures obtained from Structural Equation Modelling (SEM).
D) Parameter estimation for each relations (beta coefficients). E) A line chart showing the prediction accuracy of the Causal Model on the test set.
comparing themselves with others and try to make sense the producer-consumer analogy where an expert user will cre-
of why different people received different decisions. The ate and interpret the model for accurate and fair modeling,
participants shared several such cases where they wondered while a non-expert user will interpret this verified model
why they received one decision, while their friends received to understand the service the model facilitates. To support
different decisions (P1-5, P7-8, P10). We note that this is sim- this relationship, we decided that Outcome-Explorer should
ilar to instance comparison and neighborhood exploration, have two different modules: (1) Model Creation module,
which are capabilities C2 and C4 identified by Hohman et and (2) Model Interpretation module. Figure 2 shows the
al. [13] that expert users should have to compare a data two modules and their respective target users.
point to its nearest neighbors. Our study shows that our tool DG2. Creating the Model: Using the Model Creation
should support these capabilities also for non-expert users. module, an expert user should be able to create a causal
model interactively with the help of state-of-the-art tech-
niques and evaluate the performance of the model.
5 D ESIGN G UIDELINES DG3. Interpreting the Causal DAG: The causal DAG is
Based on the insights gathered from our formative study, central to understanding a causal model. The visualization
we formulate the following design guidelines: and interaction designed for the causal DAG in the Model
DG1. Supporting experts and non-experts via a two Interpretation module should allow both expert and non-
module design: The formative study revealed overlapping expert users to interpret the model correctly. Users should
interests between expert and non-expert users to interpret be able to set values to the input features in the DAG to
predictive models. However, a model needs to be created observe the changes in the outcome.
before it can be interpreted. XAI interfaces typically accept DG4. Supporting Explanation Queries: The forma-
trained models for this purpose [13], [14], [20]. However, at tive study revealed that non-expert users ask explanation
the time of the development of this work, no open-source queries (C1-C4) similar to those already well-studied in XAI
software or package was available for human-centered research [13]. Our tool should support these queries and
causal analysis (the third method from Section 3.1). Ad- they should be implemented keeping in mind the algorith-
ditionally, none of the existing tools supported prediction mic and visualization literacy gap between expert and non-
in a causal model. Hence, we decided to also support the expert users.
creation of a predictive causal model. DG5. Input Feature Configuration: Our tool is com-
The methods described for creating a causal model in pletely transparent and a non-expert user can change the in-
Section 3 requires substantial algorithmic and statistical ex- put features freely in the interface. However, when engaging
pertise which can only be expected from an expert user. The in this activity, it is possible that to obtain a certain outcome
relationship between expert and non-expert users follows a user might opt for a feature configuration that is unlikely
IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS, 2021. AUTHOR’S VERSION. HTTPS://DOI.ORG/10.1109/TVCG.2021.3102051 6
A B
C
Fig. 4. Model Interpretation Module of Outcome-Explorer. A) Interactive causal DAG showing causal relations between variables. Each node
includes two circular knobs (green and orange) to facilitate profile comparisons. The edge thickness and color depict the effect size and type of
each edge. B) Sample selection panel. C) A biplot showing the position of green and orange profiles compared to nearest neighbors. D) A line chart
to track the model outcome and to go back and forth between feature configuration. E) Realism meter allowing users to determine how common a
profile is compared to other samples in the dataset.
to be realistic [21]. Thus, our tool should allow non-expert of the graph is uncertain, we decided to use GraphViz [52], a
users to evaluate not only the value of the outcome, but also well-known library for graph visualization. The panel facili-
how realistic the input configuration is when compared to tates four sets of edge editing features: (1) Add, (2) Direct, (3)
existing data points and configurations. Remove, and (4) Reverse. When editing the causal model, an
expert user can evaluate several model fit measures in panel
C, model parameters in D, and prediction accuracy in E.
6 V ISUAL I NTERFACE
Outcome-Explorer is a web-based interface. We used Python
as the back-end language and D3 [51] for interactive visual- 6.2 Model Interpretation Module
ization. We used Tetrad 1 , and semopy 2 for causal analysis. The interpretation module (Figure 4) uses a different visual
As per DG1, Outcome-Explorer has two interactive representation to present the graph than the model creation
modules. While the Interpretation Module is accessible to module since a parameterized causal model has a definitive
both expert and non-expert users, the Model Creation Mod- structure (DAG). The interpretation module accepts a DAG
ule is only available to the expert users (DG1). We describe as input and employs topological sort to present that DAG
the visual components of these two modules below. in a left to right manner.
Each variable in the causal model contains two circular
6.1 Model Creation Module knobs: a green and an orange knob. A user can control
The Model Creation Module is divided into five regions two different profiles independently by setting the green
(Figure 3). The Control Panel (A) allows an expert user to and orange knobs to specific values (DG3). This two pro-
upload data and run statistical models on the data. The file mechanism facilitates instance comparison and what-if
Central Panel (B) visualizes the causal model obtained from analysis (DG3, DG4, see Section 7). The range for the input
automated algorithms. An expert user can select the ap- knobs is set from the min to the max of a particular variable.
propriate algorithm from a dropdown list there. The graph Each knob provides a grey handle which a user can use to
returned by the automated algorithms is not necessarily a move the knob through mouse drag action. The user can
DAG; it can contain undirected edges. Besides, an expert also set the numbers directly in the input boxes, either an
user can edit the graph in this module (DG2), often resulting exact number or even a number that is out of range (outside
in a change of the structure of the graph. Since the structure (max − min) range) for that variable. In case of an out of
range value, the circular knob is simply set to min or max,
1. https://www.ccd.pitt.edu/tools/ whichever extrema are closer to the value. The outcome
2. https://semopy.com variable is presented as a bar chart in the causal model.
IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS, 2021. AUTHOR’S VERSION. HTTPS://DOI.ORG/10.1109/TVCG.2021.3102051 7
Similar to the input knobs, the outcome variable contains we can calculate the probability of a datapoint x belonging
two bars to show prediction values for two profiles. to a component Ci using the following equation
Finally, we follow the visual design of Wang et al. [37], P (Ci )P (x|Ci ) φi N (x|µi , Σi )
[38] to encode the edge weights in the causal DAG. To P (Ci |x) = PK = PK
visualize intervention, all edges leading to an endogenous k=1 P (Ck )P (x|Ck ) i=1 φk N (x|µk , Σk )
variable are blurred whenever an user sets that variable to (3)
a specific value. Consequently, the user can cancel out the A high value of P (Ci |x) implies that x is highly likely
intervention by clicking the × icon beside an endogenous to belong to Ci , whereas a low value P (Ci |x) implies that
variable in which case its value is estimated from its parent the features of x is not common among the members of
nodes and the edges return to their original opacity. Ci . Thus, P (Ci |x) can be interpreted as a scale of how
“real” a datapoint is to the other members of a component.
6.2.1 Profile Map We translate P (Ci |x) to a human understandable meter
with P (Ci |x) = 0 interpreted as “Rare”, P (Ci |x) = 0.5
The profile map (Figure 4(C)) is a biplot which shows the as “Moderately Common”, and P (Ci |x) = 1 as “Common”.
nearest neighbors of a profile (DG4). To compute the biplot,
we run PCA on the selected points. A user can control the
radius of the neighborhood, given by the range of the out- 7 U SAGE S CENARIO
come variable, through the “Radius” slider. The neighbors In this section, we present a usage scenario to demonstrate
are colored according to the outcome value, as specified in how a hypothetical expert user (Adam), and a non-expert
the color map on the right of the plot. user (Emily) could benefit from Outcome-Explorer.
A user can hover the mouse over any point on the map Adam (he/him) is a Research Engineer at a technology
to compare the data point with the existing green profile company and is responsible for creating a housing price
(DG4). Subsequently, the user can click on any point to set prediction model. Non-expert users will eventually use the
that point as a comparison case for a more detailed analysis. model. As a result, Adam also needs to create an easy-to-
Both green and orange disks (larger circles) move around understand interactive interface for the non expert users.
the map as the user changes the profiles in the causal model. Based on these requirements, Adam decides to use an inter-
pretable model for prediction and determines that Outcome-
6.2.2 Evolution Tracker Explorer matches the requirements perfectly.
40 Trust
10
30 33.50 Model Understanding
31.45
7.99
20
5 5.68
Educational
5.12
10 12.74
3.00 10.21 Enjoyable
0 0
Easy to Use
# Changes (Non- % Changes (Non- # Changes (All % Changes (All
impacting Variables) impacting Variables) variables) variables) 1 2 3 4 5 6 7
(a) Average number of changes and magni- (b) Average number of changes and magni- (c) Subjective Measures (1: Strongly Disagree,
tude of changes on non-impacting variables tude of changes on all variables 7:Strongly Agree).
Fig. 6. Study Results. The average number of changes and the average magnitude of changes (%) made to (a) non-impacting variables, and (b) all
variables to reach the target outcomes. (c) Average self-reported subjective measures. Error bars show +1 SD.
are not good predictors of how humans perform on actual account for model understanding, we measured (1) the number
decision-making tasks [55]. Based on that, we decided to of changes (non-impacting) and (2) the magnitude of changes (%,
evaluate our tool on actual decision-making tasks. The tasks non-impacting). Here, non-impacting refers to the changes
were similar to the case study presented in Section 7.4. For made on non-impacting variables. We used a paired t-test
example, in the case of the housing dataset, we provided the with Bonferroni correction and Mann-Whitney U to assess
participants with the scenario of a person who wants to buy statistical significance of the quantitative and likert scale
an ideal housing (e.g. housing with price 35K) with budget measures respectively.
constraints. We then asked the participants to reach alterna- On average, the participants made 5.68 (SD = 4.01)
tive/target outcomes (e.g. reducing housing price from 35K changes to the non-impacting variables when using SHAP
to 25K) to satisfy the budget constraints while minimizing while for Outcome-Explorer-Lite the average was 3.00
the number of changes, and the magnitude of the changes (SD = 1.97). Participants reduced the changes made to the
from the ideal housing. Note that both conditions made non-impacting variables by 47%, which was statistically sig-
predictions based on the same underlying causal model. nificant (p < 0.02); Cohen’s effect size value (d = 0.68) sug-
We also collected self-reported subjective measures such as gested a medium significance. We also found a significant
model understanding, trust, and usability. difference between the magnitude of changes users made
on non-impacting variables (36% reduction with p < 0.001,
8.2.3 Study Design plotted in Figure 6a).
Similar to the sessions with the expert users, we conducted In order to understand how study conditions and dataset
the study sessions via web and Skype. A study session be- relate to each other with respect to the above quantitative
gan with the participant signing a consent form. Following measures, we constructed two mixed-effect linear models,
this, the participants were introduced to the assigned first one for each measure. We tested for interaction between
condition and received a brief description of the interface. study condition and dataset while predicting a specific
The participants then interacted with the system (with a measure. While there were no interaction effects and dataset
training dataset), during which they were encouraged to did not play any significant role in predicting the measures,
ask questions until they were comfortable. Each participant we found study condition to be the main effect in pre-
was then given a scenario and a task list for the first dicting the number of changes on non-impacting variables
condition. After completing the tasks, participants rated the (F (1, 26.033) = 7.723, p = 0.01) and the magnitude of
study conditions (interfaces) on a Likert scale ranging from changes (%) on non-impacting variables (F (1, 26.992) =
1 (Strongly Disagree) to 7 (Strongly Agree) based on six 13.140, p = 0.001).
subjective measures. The same process was carried out for Finally, as shown in Figure 6(c), participants rated
the second condition. Each session lasted around ∼1 hour Outcome-Explorer-Lite favorably in terms of Model Un-
and ended with an exit-interview. derstanding (M : 6.11, SD : 0.50), and Educational (M :
6.26, SD : 0.87). In comparison, the scores for SHAP were:
8.2.4 Results Model Understanding (M : 4.34, SD : 0.911), and Educa-
H1: Model Understanding. In a causal model, the exoge- tional (M : 5.15, SD : 0.964). The differences were statis-
nous variables may not affect the outcome if endogenous tically significant with p < 0.0001 (Model Understanding)
variables are set to specific values. While Outcome-Explorer and p < 0.01 (Educational). However, we also observe that
visualizes this interplay, SHAP only estimates feature con- participants’ trust did not improve in Outcome-Explorer.
tributions to the decision and it does not explain why In the post-study interview, several participants mentioned
some variables are not affecting the outcome. We refer to their unpleasant experiences with automated systems. Such
such variables as non-impacting variables. We anticipated distrusts are unlikely to be changed in one study, and that
that interactions with non-impacting variables might reveal might be the reason for equal trust in both conditions.
how well users understood the model. Based on that, to H2: Efficiency in Decision-making Tasks. We measured
IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS, 2021. AUTHOR’S VERSION. HTTPS://DOI.ORG/10.1109/TVCG.2021.3102051 11
two quantitative measures to account for users’ overall per- did not result in better overall performance. Participants
formance in decision-making tasks. They are the total (1) instead increased interaction with the impacting variables
number of changes, and (2) magnitude of changes (%) made to to understand the effects of causal relations while using
input variables. As shown in Figure 6b, the average number Outcome-Explorer-Lite. As a result, the overall performance
of changes were 12.74 (SD = 8.59) for SHAP, and 10.21 (i.e., the total number of interactions with impacting and
(SD = 6.88) for Outcome-Explorer-Lite. The difference was non-impacting variables) remained similar for both condi-
not statistically significant. We also did not find a significant tions. We also believe that the increased focus on impact-
difference between the overall magnitude of changes (%) ing variables while using Outcome-Explorer-Lite fostered
users made in each study condition. Finally, we measured the observed better model understanding in the subjective
the time taken to complete the tasks, but no statistically measures.
significant difference was found. The Accuracy-Interpretability Trade-off: We acknowl-
Similar to the above, we constructed two mixed-effect edge that causal models might not reach the prediction accu-
linear models. We did not find any interaction effects, and racy of complex machine learning models. The comparison
the datasets as well as the conditions did not play any requires rigorous experiments on common ML tasks which
significant role in predicting the measures. is beyond the scope of this paper. However, there exists
H3: Ease of Use. Participants rated Outcome-Explorer- empirical evidences where causal models or linear models
Lite favorably in terms of Easy to use (M : 5.96, SD : 0.96), such as ours outperformed complex ML models [50], [56].
and I would use this system (M : 6.33, SD : 0.6). In compar- As stated by Rudin [21], the idea that interpretable models
ison, the scores for SHAP were: Easy to use (M : 5.08, SD : do not perform as well as black-box models (the accuracy-
1.35), and I would use this system (M : 5.16, SD : 1.42). interpretability tradeoff ) is often due to the lack of feature
The differences were statistically significant with p < 0.04 engineering while building interpretable models.
(Easy to use) and p < 0.03 (I would use this system). The Implication for XAI Research: Outcome-Explorer offers
other metric (Enjoyable) was not statistically significant. several design insights for visual analytics systems in XAI.
The results matched our anticipation that Outcome- First, its novel two module design shows that it is possible
Explorer-Lite will improve user model understanding and to support the explanation needs of experts and non-experts
that they will learn more about the prediction mechanism as well as the model creation functionalities for experts in
using our tool. In the post-study interview, participants a single system. While we acknowledge this is not a strict
appreciated the visual design of the causal DAG which requirement for an XAI interface, we believe our work will
might be the reason why they found Outcome-Explorer-Lite motivate non-expert inclusive design of XAI interfaces in
to be easy to use and want to see it in practice. the future. Second, it shows that an effective XAI interface
On average, participants spent slightly more time when does not necessarily require a new and complex visualiza-
using Outcome-Explorer. While familiarizing with the in- tion. “Simple” visual design and “intuitive” interactions are
terface was one factor for that, in the post-study interview, effective ways to convey the inner workings of predictive
several participants mentioned that they felt curious, spent models, especially for non-expert users.
more time to learn the relations, and put some thought Another potential impact of Outcome-Explorer is bridg-
before taking an action. A participant, a senior college stu- ing XAI and algorithmic fairness research. Algorithmic fair-
dent, mentioned: “The interface (Outcome-Explorer-Lite) is fun, ness ensures fair machine-generated decisions while XAI
attractive as well as educational. I feel like I learned something. I ensures transparency and explainability of ML models. Al-
did not know much about housing prices before this session. But, though highly relevant, these two research directions have
I think I now have a much better understanding of housing prices. not been bridged together yet. By promoting the causal
If available in public when I buy a house in the future, it will help model, a highly effective paradigm for bias mitigation
me make an informed decision.” strategies, Outcome-Explorer opens the door for an ML
model to be transparent, accountable, and fair altogether.
Finally, our design and visual encoding can be extended
9 D ISCUSSION AND L IMITATIONS to other graphical models. For example, a Bayesian Network
Model Understanding vs Overall Performance: The user is also represented as a DAG, and the design of Outcome-
study validated H1 and H3, but not H2. The study revealed Explorer can be transferred to interactive XAI systems based
that participants reduced interactions with non-impacting on Bayesian networks.
variables significantly in Outcome-Explorer-Lite, indicating Limitations & Future Work: It is important for a causal
a better model understanding compared to SHAP. The lack model to have a sufficient number of variables that cover all
of edges or blurred edges in Outcome-Explorer-Lite pro- or at least most aspects determining the predicted outcome.
vided participants with clear evidence for non-impacting Causal inferencing under incomplete domain coverage can
variables. On the other hand, while using SHAP, partici- result in islands of variables or a causal skeleton where
pants interacted with the non-impacting variables despite some links are reduced to correlations only. We are currently
observing their zero feature importance in the visualization. experimenting with evolutionary and confirmatory factor
They constructed several hypotheses about non-impacting analysis to introduce additional variables that can comple-
variables while using SHAP, including the possibility of a ment the native set of variables. These variables are often
change of impact in the future, and their indirect effects not directly measurable and can serve as latent variables.
on other variables. This may be the reason for the in- Our preliminary work has shown that they can greatly add
creased interaction with the non-impacting variables. How- to both model comprehensibility and completeness.
ever, improved understanding of non-impacting variables Another source of error can be confounders which can
IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS, 2021. AUTHOR’S VERSION. HTTPS://DOI.ORG/10.1109/TVCG.2021.3102051 12
lead to an overestimation or underestimation of the strength [3] Z. Obermeyer and S. Mullainathan, “Dissecting racial bias in an
of certain causal edges. There are several algorithms avail- algorithm that guides health decisions for 70 million people,” in
Conf. on Fairness, Accountability, and Transparency, 2019, pp. 89–89.
able for the detection and elimination of counfonding effects [4] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,”
and we are presently working on a visual interface where ProPublica, May, vol. 23, p. 2016, 2016.
expert users can take an active role in this type of effort. [5] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accu-
A current limitation of our system is scalability. At racy disparities in commercial gender classification,” in Conference
on fairness, accountability and transparency. PMLR, 2018, pp. 77–91.
the moment we limit the number of variables to George [6] O. Keyes, “The misgendering machines: Trans/hci implications of
Miller’s Magical Number Seven, Plus or Minus Two automatic gender recognition,” ACM CSCW, vol. 2, pp. 1–22, 2018.
paradigm [57]. This allowed us to understand the expla- [7] S. Lundberg and S.-I. Lee, “A unified approach to interpreting
nation needs of mainstream non-expert users and support- model predictions,” in Proc NIPS, 2017, pp. 4765–4774.
[8] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust
ing them through interactive visualizations on previously you? explaining the predictions of any classifier,” in Proc. ACM
studied XAI datasets [13], [20]. However, in many real-life Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
scenarios, there can be “hundreds of variables”, which could [9] F. Hohman, M. Kahng, R. Pienta, and D. H. Chau, “Visual analytics
in deep learning: An interrogative survey for the next frontiers,”
overwhelm non-expert users. We envision that the Model
IEEE Trans on Visualization and Computer Graphics, vol. 25, no. 8,
Creation Module could be enhanced with advanced feature pp. 2674–2693, 2018.
engineering capabilities, such as clustering, dimension re- [10] A. Abdul, J. Vermeulen, D. Wang, B. Lim, and M. Kankanhalli,
duction, pooling of variables into latent variables (factor “Trends and trajectories for explainable, accountable and intelligi-
ble systems: A hci research agenda,” in ACM CHI, 2018, pp. 1–18.
analysis), and level of detail visualization [58]. Alternatively, [11] S. Amershi, M. Chickering, S. Drucker, B. Lee, P. Simard, and
scalable causal graph visualization [42] could also be used J. Suh, “Modeltracker: Redesigning performance analysis tools for
for this purpose. Our future work will focus on gaining machine learning,” in ACM CHI, 2015, pp. 337–346.
more insight on how much complexity non-expert users can [12] M. Kahng, P. Y. Andrews, A. Kalro, and D. Chau, “Activis: Visual
exploration of industry-scale deep neural network models,” IEEE
handle, and which of the above-mentioned methods work Trans on Vis. and Computer Graphics, vol. 24, no. 1, pp. 88–97, 2017.
best for them. [13] F. Hohman, A. Head, R. Caruana, R. DeLine, and S. Drucker,
Finally, so far the participants we have studied were all “Gamut: A design probe to understand how data scientists un-
from the younger generation (19-35) who generally tend derstand machine learning models,” in ACM CHI, 2019, pp. 1–13.
[14] F. Hohman, H. Park, C. Robinson, and D. H. P. Chau, “Summit:
to be savvier when it comes to the graphical tools used Scaling deep learning interpretability by visualizing activation
in our interface. In future work we aim to study how our and attribution summarizations,” IEEE Trans on Visualization and
system would be received by older members of society. Computer Graphics, vol. 26, no. 1, pp. 1096–1106, 2019.
[15] H.-F. Cheng, R. Wang, Z. Zhang, F. O’Connell, T. Gray, F. M.
It might require additional information integrated into the Harper, and H. Zhu, “Explaining decision-making algorithms
user interface, such as tooltips and pop-up suggestion boxes through ui: Strategies to help non-expert stakeholders,” in Proc
and the like. ACM CHI, 2019, pp. 1–12.
[16] T. Miller, “Explanation in artificial intelligence: Insights from the
social sciences,” Artificial Intelligence, vol. 267, pp. 1–38, 2019.
10 C ONCLUSION [17] P. Voigt and A. Von dem Bussche, “The eu general data protection
regulation (gdpr),” A Practical Guide 1st ed., Springer Int’l., 2017.
We presented Outcome-Explorer— an interactive visual in- [18] H. Shen, H. Jin, Á. A. Cabrera, A. Perer, H. Zhu, and J. I. Hong,
terface that exploits the explanatory power of the causal “Designing alternative representations of confusion matrices to
model and provides a visual design that can be extended support non-expert public understanding of algorithm perfor-
mance,” Proc. ACM on HCI, vol. 4, pp. 1–22, 2020.
to other graphical models. Outcome-Explorer advances re- [19] R. Moraffah, M. Karami, R. Guo, A. Raglin, and H. Liu, “Causal
search towards interpretable interfaces and provides critical interpretability for machine learning-problems, methods and eval-
findings through user study and expert evaluation. uation,” ACM KDD Explorations, vol. 22, no. 1, pp. 18–33, 2020.
We envision a myriad of applications of our interface. [20] Y. Ming, H. Qu, and E. Bertini, “Rulematrix: Visualizing and
understanding classifiers with rules,” IEEE Trans on Visualization
For example, bank advisors or insurance agents might sit and Computer Graphics, vol. 25, no. 1, pp. 342–352, 2018.
with a client and use our system to discuss the various [21] C. Rudin, “Stop explaining black box machine learning models
options with them (in response to their right to explanation), for high stakes decisions and use interpretable models instead,”
Nature Machine Intelligence, vol. 1, no. 5, pp. 206–215, 2019.
or a bank or insurance would make our interface available
[22] I. E. Kumar, S. Venkatasubramanian, C. Scheidegger, and
on their website, along with a short instructional video. S. Friedler, “Problems with shapley-value-based explanations as
Future work will explore how complex a model can get feature importance measures,” Proc. ICML, 2020.
while still being understandable by non-expert users. [23] T. Hastie and R. Tibshirani, Generalized Additive Models. CRC
Press, 1990, vol. 43.
[24] B. Kim, C. Rudin, and J. A. Shah, “The bayesian case model:
ACKNOWLEDGMENTS A generative approach for case-based reasoning and prototype
classification,” in Proc NIPS, 2014, pp. 1952–1960.
This research was partially supported by NSF grants IIS [25] J. Pearl and D. Mackenzie, The Book of Why: the New Science of Cause
1527200 and 1941613. and Effect. Basic Books, 2018.
[26] B. Glymour and J. Herington, “Measuring the biases that matter:
The ethical and casual foundations for measures of fairness in
algorithms,” in Proc Conf. on Fairness, Accountability, and Trans-
R EFERENCES parency, 2019, pp. 269–278.
[1] T. Brennan, W. Dieterich, and B. Ehret, “Evaluating the predictive [27] Y. Wu, L. Zhang, X. Wu, and H. Tong, “Pc-fairness: A unified
validity of the compas risk and needs assessment system,” Crimi- framework for measuring causality-based fairness,” in Proc NIPS,
nal Justice and Behavior, vol. 36, no. 1, pp. 21–40, 2009. 2019, pp. 3399–3409.
[2] A. Chouldechova, D. Benavides-Prado, O. Fialko, and R. Vaithi- [28] J. Zhang and E. Bareinboim, “Fairness in decision-making—the
anathan, “A case study of algorithm-assisted decision making in causal explanation formula,” in AAAI Artificial Intelligence, 2018.
child maltreatment hotline screening decisions,” in Conference on [29] D. Madras, E. Creager, T. Pitassi, and R. Zemel, “Fairness through
Fairness, Accountability and Transparency, 2018, pp. 134–148. causal awareness: Learning causal latent-variable models for bi-
IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS, 2021. AUTHOR’S VERSION. HTTPS://DOI.ORG/10.1109/TVCG.2021.3102051 13
ased data,” in Proceedings of the Conference on Fairness, Accountabil- [56] C. Rudin and J. Radin, “Why are we using black box models
ity, and Transparency. ACM, 2019, pp. 349–358. in ai when we don’t need to? a lesson from an explainable ai
[30] J. R. Loftus, C. Russell, M. J. Kusner, and R. Silva, “Causal competition,” Harvard Data Science Review, vol. 1, no. 2, 2019.
reasoning for algorithmic fairness,” arXiv arXiv:1805.05859, 2018. [57] G. A. Miller, “The magical number seven, plus or minus two: Some
[31] M. J. Kusner, C. Russell, J. R. Loftus, and R. Silva, “Causal limits on our capacity for processing information.” Psychological
interventions for fairness,” arXiv preprint arXiv:1806.02380, 2018. review, vol. 101, no. 2, p. 343, 1994.
[32] A. Khademi, S. Lee, D. Foley, and V. Honavar, “Fairness in [58] Z. Zhang, K. T. McDonnell, and K. Mueller, “A network-based
algorithmic decision making: An excursion through the lens of interface for the exploration of high-dimensional data spaces,” in
causality,” in ACM World Wide Web, 2019, pp. 2907–2914. 2012 IEEE Pacific Visualization Symposium, 2012, pp. 17–24.
[33] B. Dietvorst, J. Simmons, and C. Massey, “Overcoming algorithm
aversion: People will use imperfect algorithms if they can (even
slightly) modify them,” Management Science, vol. 64, no. 3, pp.
1155–1170, 2018.
[34] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas,
and J. Wilson, “The what-if tool: Interactive probing of machine
learning models,” IEEE Trans on Visualization and Computer Graph-
ics, vol. 26, no. 1, pp. 56–65, 2019.
[35] O. Gomez, S. Holter, J. Yuan, and E. Bertini, “Vice: visual counter-
factual explanations for machine learning models,” in Proc. Intern.
Conference on Intelligent User Interfaces, 2020, pp. 531–535.
[36] J. Krause, A. Perer, and K. Ng, “Interacting with predictions:
Visual inspection of black-box machine learning models,” in Proc.
ACM CHI, 2016, pp. 5686–5697.
[37] J. Wang and K. Mueller, “The visual causality analyst: An interac-
tive interface for causal reasoning,” IEEE Trans on Visualization and
Computer graphics, vol. 22, no. 1, pp. 230–239, 2015.
[38] ——, “Visual causality analysis made practical,” in 2017 IEEE Conf.
on Visual Analytics Science and Technology (VAST), 2017, pp. 151–161.
[39] R. Ryan and E. Deci, “Self-determination theory and the role of
basic psychological needs in personality and the organization of
behavior.” in Handbook of Personality: R&T, 2008, pp. 654–678.
[40] J. Schell, The Art of Game Design: Book of Lenses. CRC Press, 2008.
[41] J. Yan, Z. Gu, H. Lin, and J. Rzeszotarski, “Silva: Interactively
assessing machine learning fairness using causality,” in Proc. ACM Md Naimul Hoque is currently a PhD student
CHI, 2020, pp. 1–13. at the College of Information Studies, University
[42] X. Xie, F. Du, and Y. Wu, “A visual analytics approach for of Maryland, College Park. Previously, he ob-
exploratory causal analysis: Exploration, validation, and applica- tained an M.S. in Computer Science degree from
tions,” IEEE Trans on Visualization and Computer Graphics, 2020. Stony Brook University. His current research in-
[43] Y. Onoue, K. Kyoda, M. Kioka, K. Baba, S. Onami, and K. Koya- terests include explainable AI, visual analytics,
mada, “Development of an integrated visualization system for and human-computer interaction. For more infor-
phenotypic character networks,” in 2018 IEEE Pacific Visualization mation, see https://naimulh0que.github.io
Symposium (PacificVis). IEEE, 2018, pp. 21–25.
[44] H. Natsukawa, E. R. Deyle, G. M. Pao, K. Koyamada, and G. Sugi-
hara, “A visual analytics approach for ecosystem dynamics based
on empirical dynamic modeling,” IEEE Transactions on Visualiza-
tion and Computer Graphics, 2020. Klaus Mueller has a PhD in computer science
[45] J. Pearl, Causality: Models, Reasoning & Inference, 2nd Ed. Cam- and is currently a professor of computer science
bridge University Press, 2013. at Stony Brook University and is a senior sci-
[46] P. M. Bentler and D. G. Weeks, “Linear structural equations with entist at Brookhaven National Lab. His current
latent variables,” Psychometrika, vol. 45, no. 3, pp. 289–308, 1980. research interests include explainable AI, visual
[47] R. H. Hoyle, Structural equation modeling: Concepts, issues, and analytics, data science, and medical imaging. He
applications. Sage, 1995. won the US National Science Foundation Early
[48] C. Glymour, K. Zhang, and P. Spirtes, “Review of causal discovery CAREER Award, the SUNY Chancellor’s Award
methods based on graphical models,” Frontiers in Genetics, vol. 10, for Excellence in Scholarship & Creative Activity,
p. 524, 2019. and the IEEE CS Meritorious Service Certificate.
[49] X. Shen, S. Ma, P. Vemuri, and G. Simon, “Challenges and His 200+ papers were cited over 10,000 times.
opportunities with causal discovery algorithms: Application to For more information, see http://www.cs.sunysb.edu/ mueller
alzheimer’s pathophysiology,” Scientific Reports, vol. 10, no. 1, pp.
1–12, 2020.
[50] S. Tople, A. Sharma, and A. Nori, “Alleviating privacy attacks via
causal learning,” in International Conference on Machine Learning.
PMLR, 2020, pp. 9537–9547.
[51] M. Bostock, V. Ogievetsky, and J. Heer, “D3 data-driven docu-
ments,” IEEE Trans on Visualization and Computer Graphics, vol. 17,
no. 12, pp. 2301–2309, 2011.
[52] J. Ellson, E. Gansner, L. Koutsofios, S. C. North, and G. Woodhull,
“Graphviz—open source graph drawing tools,” in International
Symposium on Graph Drawing. Springer, 2001, pp. 483–484.
[53] D. Harrison Jr and D. L. Rubinfeld, “Hedonic housing prices and
the demand for clean air,” 1978.
[54] J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes,
“Using the adap learning algorithm to forecast the onset of di-
abetes mellitus,” in Proc. of the Annual Symposium on Computer
Application in Medical Care, 1988, p. 261.
[55] Z. Buçinca, P. Lin, K. Z. Gajos, and E. L. Glassman, “Proxy
tasks and subjective measures can be misleading in evaluating
explainable ai systems,” in Proc. ACM IUI, 2020, pp. 454–464.