Pipelineprofiler:: A Visual Analytics Tool For The Exploration of Automl Pipelines
Pipelineprofiler:: A Visual Analytics Tool For The Exploration of Automl Pipelines
AutoML Pipelines
Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, Claudio Silva
arXiv:2005.00160v2 [cs.HC] 4 Sep 2020
Fig. 1. PipelineProfiler applied to the analysis of binary classification pipelines generated by five different AutoML systems for the Statlog
(Heart) Data Set. A) The system is integrated with Jupyter Notebook and can be invoked with one line of code. B) PipelineProfiler menu,
with options to subset, export, sort, and perform automated analysis on pipelines. C) Pipeline Matrix: C1) Primitives (columns) used by
the pipelines (rows). C2) Tooltip showing the metadata and hyperparameters for a primitive. C3) One-hot-encoded hyperparameters
(columns) for the primitive Xgboost Gbtree across pipelines (rows). C4) Pipeline scores: users can select different metrics to rank
pipelines. C5) Primitive Contribution View, showing correlations between primitive usage and pipeline scores (here, Deep Feature
Synthesis has the highest correlation with F1 scores). D) Pipeline Comparison View: visual comparison of the top-3 scoring pipelines.
Abstract— In recent years, a wide variety of automated machine learning (AutoML) methods have been proposed to generate
end-to-end ML pipelines. While these techniques facilitate the creation of models, given their black-box nature, the complexity of the
underlying algorithms, and the large number of pipelines they derive, they are difficult for developers to debug. It is also challenging
for machine learning experts to select an AutoML system that is well suited for a given problem. In this paper, we present the
PipelineProfiler , an interactive visualization tool that allows the exploration and comparison of the solution space of machine learning
(ML) pipelines produced by AutoML systems. PipelineProfiler is integrated with Jupyter Notebook and can be combined with common
data science tools to enable a rich set of analyses of the ML pipelines, providing users a better understanding of the algorithms that
generated them as well as insights into how they can be improved. We demonstrate the utility of our tool through use cases where
PipelineProfiler is used to better understand and improve a real-world AutoML system. Furthermore, we validate our approach by
presenting a detailed analysis of a think-aloud experiment with six data scientists who develop and evaluate AutoML tools.
Index Terms—Automatic Machine Learning, Pipeline Visualization, Model Evaluation
1 I NTRODUCTION
Machine Learning (ML) has been successfully used in a plethora of ap- is difficult for ML experts and out of reach for subject-matter experts
plications. However, assembling end-to-end ML pipelines is a difficult with little or no training in ML or computer science. AutoML systems
endeavor that requires a time-consuming, trial-and-error process. This have been proposed to address this challenge. Given a ML problem,
AutoML aims to automate the synthesis of ML pipelines that perform
well for the problem by searching over a space of possible pipelines
which use different combinations of primitives – computational steps
All authors are with the New York University. E-mails: {jorgehpo, s.castelo, / ML algorithms, and values for their associated hyperparameters –
rlopez, enrico.bertini, juliana.freire, csilva}@nyu.edu
settings that configure algorithm behavior [32].
Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication
xx xxx. 201x; date of current version xx xxx. 201x. For information on AutoML has had substantial practical impact by making data scien-
obtaining reprints of this article, please send e-mail to: [email protected]. tists more efficient, enabling researchers to work on harder problems,
Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx and democratizing ML to less experienced users [17, 22, 23]. Several
open-source AutoML systems are currently available. For example,
Auto-sklearn [23] creates pipelines based on a pool of 33 pre-processing score, the representation can also help uncover insights into the suit-
and classification primitives. In the context of the DARPA Data-Driven ability (or effectiveness) of primitives for specific data and problem
Discovery of Models (D3M) program [20], a collaborative effort involv- types. Users can drill down into the structural details of pipelines, and
ing multiple research groups, twenty AutoML systems were developed examine both their differences and similarities. As we discuss later,
that synthesize pipelines from a set of over 300 primitives [12, 14]. these analyses enable the identification of patterns which can, for exam-
Because exploring the entire space of primitives and associated hyper- ple, expose interesting aspects of the search strategies used by AutoML
parameters is not feasible in practice, these systems use sophisticated systems. PipelineProfiler is integrated with Jupyter Notebooks. This
search strategies to to reduce the number of pipelines they need to enables complex analyses to be performed over pipeline collections
evaluate. For example, AutoWeka uses Bayesian optimization [57], that leverage the rich ecosystem of Python tools for data science.
Auto-sklearn uses meta-learning [23], TPOT uses genetic program- Our design was inspired by requirements from developers of Au-
ming [44], and AlphaD3M uses deep learning [17]. toML systems and experts that evaluate and compare these systems
in the context of the DARPA D3M project [20]. These experts used
Challenges in Understanding and Comparing AutoML Systems.
PipelineProfiler and provided feedback throughout its development.
Given the complexity of these systems, two important challenges arise.
Our main contributions can be summarized as follows:
First, it is difficult to evaluate the efficiency, correctness and perfor-
mance or AutoML systems. To debug an AutoML system, developers • We propose PipelineProfiler, which is, to the best of our knowl-
must analyze logs consisting of synthesized pipeline instances and edge the first visual analytics tool for exploring and comparing
their outcomes. They need to assess, for instance, the efficiency of the pipelines produced by multiple AutoML systems.
search process, the structural diversity of the derived pipelines, how • The tool combines effective visual representations and interac-
well the search covers the available primitives, and whether primitives tions for AutoML analysis, including: the Pipeline Matrix, a
were used correctly. This requires involved analyses that are further visual encoding that summarizes the pipelines, primitives and
complicated due to the fact that not only do these logs contain a large hyperparameter values produced by AutoML systems; and the
number of instances, but the instances can have a complex structure Graph Comparison View, which applies graph matching to high-
and use a wide variety of primitives. light the structural differences and similarities among a set of
The second challenge emerges from the need to compare different pipelines of interest.
AutoML systems. Given the growing number of available systems [17,
23, 27, 33, 42, 44], to make an informed decision when selecting a • To derive the information needed to convey the correlation be-
system, it is important for users to understand how well the systems tween primitives and pipeline scores, we propose a new algorithm
perform for different problems, and identify features that contribute to that identifies groups of primitives that have a strong impact on
a system being more or less effective than others for a task. Insights pipeline scores.
obtained in such a comparison are also of great value for AutoML • To demonstrate the effectiveness of PipelineProfiler, we present
developers, as these may help them improve their systems. two use cases that show how the tool was used to obtains insights
Visual analytics techniques have been proposed to make Au- that resulted in efficiency improvements to an open-source Au-
toML more transparent by enabling the exploration of the produced toML system. We also discuss a detailed analysis of a think-aloud
pipelines, the primitives employed, and their associated hyperparam- experiment with six data scientists, which suggests that the tool is
eters [8, 26, 45, 63, 66]. However, these have important limitations both usable and useful, and highlights the benefits of the integra-
with respect to the challenges we outlined: (1) the graphical encod- tion of the tool with Jupyter Notebooks both in the flexibility it
ings used for pipelines only support fixed templates, thus they cannot affords for complex analyses and the simplicity in incorporating
explore pipelines that have complex structure; and (2) they do not the tool into the experts’ development and evaluation workflows.
support the comparison of multiple AutoML systems. For example:
AutoAIViz [66] only supports sequential pipelines that consist of three 2 R ELATED W ORK
primitives: two transformers (e.g., pre-processing or feature selection) AutoML has emerged as an approach to simplify the use of ML for
and one estimator (e.g., classification or regression); ATMSeer [63] different applications, and many systems that support AutoML are
presents information only for the estimator step of the pipeline. Using currently available [17, 23, 27, 32, 33, 42, 44]. While early work on Au-
these approaches, it is not possible to explore pipelines that have a com- toML focused on hyperparameter optimization and on ML primitives,
plex structure such as directed acyclic graphs (DAG), or long pipelines recent approaches aim to efficiently automate the synthesis of end-to-
that contain multiple estimators, such as the ones used for ensemble end pipelines – from data loading, pre-processing, feature extraction,
models which have been shown to outperform simpler pipelines [23]. feature selection, model fitting and selection, and hyper-parameter tun-
It is worth noting that since these techniques are tightly coupled with ing [17, 23, 44, 49, 53]. AlphaD3M uses deep learning to learn how to
a specific AutoML system, they do not address the challenges involved incrementally construct ML pipelines, framing the problem of pipeline
in comparing different systems. For example, while ATMSeer [63] synthesis for model discovery as a single-player game with a neural
visualizes pipelines produced by ATM [53], AutoAIViz can only show network sequence model and Monte Carlo Tree Search (MCTS) [17].
pipelines from AutoAI [62]. Auto-Sklearn produces pipelines using Bayesian optimization com-
Our Approach. We propose PipelineProfiler, a visual analytics bined with meta-learning [23], and TPOT uses genetic programming
tool that enables the exploration and comparison of end-to-end ML and tree-based optimization [44].
pipelines produced by multiple AutoML systems. PipelineProfiler During the search for well-performing pipelines, AutoML systems
takes as input pipelines represented with a common description lan- can generate a large number of pipelines. This makes the analysis of the
guage, consisting of pipeline architecture – the set of primitives and search results a challenging problem, in particular when pipelines have
data flow between primitives (encoded as a Directed Acyclic Graph); similar scores. As a consequence, the selection of the best performing
and pipeline metadata – hyperparameters, running time, evaluation end-to-end pipeline becomes expensive, time-consuming and a tedious
scores, etc. Fig. 1 shows the main components of the system. The process. In the following, we discuss research that has attempted to
Pipeline Matrix provides a visual summary of a collection of AutoML tackle this challenges through visualization.
pipeline instances that captures structural information, including the Explaining AutoML. The black box nature of AutoML systems and
primitives and associated hyperparameters, used in the pipelines, as the difficulty in understanding the inner-workings of these systems lead
well as the outcome of the pipeline encoded in a score. This repre- to reduced trust in the pipelines they produce [63]. There have been
sentation is compact and can effectively encode pipelines derived by some attempts to make the AutoML process more transparent through
multiple AutoML systems that have both complex structure and use a insightful visualizations of the resulting pipelines. These can be roughly
variety of primitives. By showing the correlations between primitive grouped into two categories: hyperparameter visualization [26, 45, 64]
usage and pipeline scores and how much primitives contribute to the and pipeline visualization [8, 66].
ATMSeer [63] and Google Vizier [26] are approaches to visualize ous models. It presents model performance information to the users and
hyperparameters. ATMSeer, which is integrated with the ATM AutoML trains new models based on the user feedback. EnsembleMatrix [54]
framework [53], displays the predictive model (i.e., last step of the ML and TreePOD [43] enable users to steer the creation of ensemble de-
pipeline) together with its hyperparameters and performance metrics cision tree models. With EnsembleMatrix, users can combine and
to the user. Users can use crossfiltering (e.g., over ML algorithms) to choose weights for decision trees and interactively evaluate the per-
facilitate the exploration of large collection of pipelines and refine the formance of the ensemble model. Conversely, TreePOD supports the
search space of the AutoML system, if needed. Google Vizier makes optimization of multiple objectives, including performance and inter-
use of a parallel coordinate view where hyperparameters and objective pretability. Although these systems help users understand the impact of
functions are displayed to the users. It allows them to examine how these parameters over the models, they do not consider other relevant
the different hyperparameters (dimensions) co-vary with each other steps (also called primitives) that are part of end-to-end pipelines like
and also against the objective function. Although these methods help data ingestion and feature engineering, which could have a significant
AutoML users to analyze the generated pipelines, most of them only impact in the final model performance. The primitives contribution
support the analysis of its last step which is the fitted model, leaving view in PipelineProfiler displays the correlations between primitive
aside important aspects of the pipeline such as data cleaning and feature usage and scores, allowing users to infer which primitives can lead
engineering. In contrast, PipelineProfiler provides a visualization to to well-performing pipelines. Users can then drill down and further
explore and analyze the end-to-end pipeline – from data ingestion and explore individual pipelines, their primitives and hyperparameters.
engineering to model generation.
Systems that support pipeline visualization include AutoAIViz [66] 3 PipelineProfiler : EXPLORING END - TO - END ML PIPELINES
and REMAP [8]. AutoAIViz uses Conditional Parallel Coordinates In this section, we describe PipelineProfiler, a tool that enables the
(PCP) [65] to represent sequential pipelines and their hyperparame- exploration of end-to-end machine learning pipelines produced by
ters. The system provides a hierarchical visualization which shows the AutoML systems. We first present the desiderata we distilled from
pipeline steps (on the first level of PCP) as well as the hyperparameters interviews with AutoML experts and subsequently used to guide our
of each step (on the second level of PCP). REMAP [8] focuses on design choices. Then, we describe the components of PipelineProfiler,
pipelines that use deep neural networks. It proposes a new glyph, called how they are integrated, and the algorithms we developed to enable
Sequential Neural Architecture Chips (SNAC), which shows both the the effective analysis of ML pipelines. Finally, we briefly describe the
layer type and dimensionality, and allows users to interactively add or implementation details of our system.
remove layers from the networks. Despite their ability to show end-
to-end pipelines, both systems can only show linear pipelines, making 3.1 Domain Requirements
it difficult to explore pipelines created by different AutoML systems We conducted interviews with six data scientists who actively work
that have more complex structure. Furthermore, REMAP was designed with AutoML systems in the context of the D3M project [20]: the
specifically to visualize neural network architectures, and thus it is not developers of four distinct AutoML systems (D1 - D4) and two data
suitable to explore general ML pipelines that use different learning scientists that are tasked with evaluating the D3M AutoML systems
techniques. In this work, our goal is to allow users to explore, compare (E1 and E2). Since each developer works on a specific system, they
and analyze pipelines generated by multiple AutoML systems which have different needs and follow distinct workflows. However, they
can have nonlinear pipeline structures and use a variety of primitives share some challenges. The AutoML evaluators are part of the D3M
and learning techniques. management team. They are responsible for selecting what types
Visual Analytics for Model Selection. Selecting a good model of ML tasks the developers must focus on and also evaluate system
among the potentially large set of models (or pipelines) derived by an performance.
AutoML system is a challenging problem that has attracted significant D3M pipelines are represented as JSON-serialized objects that con-
attention in the literature. Visual analytics systems such Visus [48], tain metadata, input and output information, and the pipeline architec-
TwoRavens [25], and Snowcat [7] provide a front-end to AutoML ture, which is described as a directed acyclic graph (DAG) [13, 40].
systems and guide subject-matter experts without training in data sci- The exploration of ML pipelines collections is a task performed by all
ence through the model selection process. They focus on providing AutoML developers and evaluators. All interviewees said they explored
explanations for models, and some provide simple mechanisms to com- pipelines by looking at their JSON representations, and complained
pare models (e.g., based on the scores, or the actual explanations). that reading the text files, and inspecting the pipelines one at a time
Other approaches focus exclusively on model selection. Regression- was a tedious and time-consuming task. Understanding and comparing
Explorer [16] enables the creation, evaluation and comparison of lo- the pipelines is difficult, in particular, since the DAG structure is hard
gistic regression models using subgroup analysis. ModelTracker [2] to grasp from the JSON representaion.
and Squares [46] introduce novel encodings to investigate different D1 said she does not have time to inspect pipelines often, and instead
models, by visually comparing histograms based on the statistical per- focuses on assessing cross-validation scores and looking for correla-
formance metrics of binary and multi-class classifiers. They also show tions between primitives and performance scores. In contrast, D2, D3
instance-level distribution information and enable multi-scale analy- and D4 they examine the pipeline DAGs and their architecture. D1
sis. For a more complete list of visual analytics tools for interactive and D2 also analyze the prediction and training time. They mentioned
model exploration, we refer the interested readers to the following sur- that their AutoML systems were evaluated within a given time budget,
veys [24, 30, 38, 52]. The majority of these methods were designed to therefore training time is an important metric for them.
evaluate and select predictive models based on the performance results. D2’s system has a blacklisting feature: when a primitive is found to
However, they do not take into account additional metrics like run- have poor performance, it can be flagged and excluded from the search
ning time, i.e., how long pipelines take to run, or primitive usage, i.e., process. Therefore, he was interested in identifying when a primitive
whether primitives are used correctly and effectively. PipelineProfiler was associated with high or low pipeline scores.
not only encodes this information in a compact visual representation, D3 usually compares pipelines using their cross-validation scores.
but it also provides a usable interface that allows users interact with a When he finds a problem for which his systems derives sub-optimal
pipeline collection at different levels of abstraction – from a high-level pipelines, he inspects the pipelines derived by other systems. His goal
overview to drilling down to inspect details of select pipelines. is to understand which features in his pipelines lead to the low scores,
and conversely, why the pipelines derived by the other systems perform
Interactive Model Steering. Systems such as TreePOD [43], better. By answering these questions, he hopes to gain insights into if
BEAMES [15] and EnsembleMatrix [54] support the analysis and re- and how he can improve his system. He is also interested in exploring
finement of models through interactive visualizations, and allow users the pipelines at the hyperparameter level, but said this is currently not
to explore the effects of modifying some parameters. BEAMES [15] possible due to the large number of primitives (over 300), pipelines,
lets users steer the training of new regression models from a set of previ- and parameters involved.
D4 is also interested in comparing pipelines, albeit for a different rea- a tabular summary of all the pipelines in the collection. The user can
son. More specifically, he is interested in comparing AutoML pipelines also drill down and explore one or multiple pipelines in more detail –
from different sources, including human generated pipelines. His goal the graph structure of selected pipelines are displayed in the Pipeline
is to evaluate if there are differences between machine and human- Comparison View (D) upon request. The system Menu (B) enables
generated pipelines. He is also interested in primitive similarity. More users to focus on a subset of the pipelines, export pipelines of interest
specifically, he wants to find which primitives are exchangeable within to Python, sort the table rows and columns, and perform automated
a pipeline architecture. analyses over groups of primitives. These operations are described later
The analysis workflow followed by the AutoML evaluators is signifi- in this section. PipelineProfiler is implemented as a Python library that
cantly different from that of the developers. While developers focus on can be used with Jupyter Notebooks to facilitate the integration with
pipeline structure, evaluators are mostly concerned with how well the the workflow of the AutoML community (A).
systems perform and the problem types (e.g., classification, regression,
forecasting, object detection, etc.) they currently support and should Pipeline Matrix
support in future iterations. More specifically, E1 and E2 said that their The Pipeline Matrix provides a summary for a collection of machine
workflow consisted mostly on evaluating AutoML systems based on learning pipelines [R1] selected by the user. Its visual encoding was
their cross-validation scores. However, they were also interested in inspired by visualizations used for topic modeling systems [1,10]. How-
checking how the primitives were being used, and whether AutoML ever, instead of words and documents, this matrix represents whether
systems produced different pipelines a given problem type. More specif- a primitive is used [R2] in a machine learning pipeline (Fig. 1(C1)).
ically, they stated that if all AutoML systems derived the same (or very Users can interactively reorder rows and columns according to pipeline
similar) pipelines, the task they are solving is no longer challenging evaluation score, pipeline source (AutoML system that generated it),
and new problem types should be proposed. For formal evaluations, primitive type (e.g., classification, regression, feature extraction, etc.),
the D3M systems are evaluated using sequestered problems that are not and estimated primitive contribution (i.e., correlation of primitive usage
visible to the developers. Thus, to give actionable insight to AutoML with pipeline scores). Furthermore, we use shape to encode primitive
developers without disclosing specifics of the sequestered problems, types. When columns are sorted by primitive type, we include vertical
E2 was also interested in identifying why pipelines fail. separator lines between primitive types to help differentiate groups of
We compiled the following desiderata from the interviews. Each primitives. A similar design was used in [56] to highlight groups of
requirement is marked as Critical, Important, or Optional, according to samples.
the number of users that requested it, and how they guided the design To support the exploration of hyperparameters [R3], PipelineProfiler
of our tool. implements two interactions that show this information on demand:
parameter tooltip and one-hot-encoded parameter matrix. When the
[R1] Pipeline collection overview and summary: all participants would user hovers over a cell in the matrix, a tooltip shows the primitive
like to visualize and compare multiple pipelines simultaneously, metadata (type and Python path) as well as a table with all the hyper-
instead of inspecting them one by one Critical parameters set. Fig. 1(C2) shows a tooltip for primitive Denormalize,
[R2] Primitive usage: E1 and E2 are interested in exploring how primi- with four hyperparameter values set. Users can also inspect a summary
tives are used across different AutoML systems. More specifically, of hyperparameter space for a primitive by selecting a column in the
they want to check if the systems are generating diverse solutions Pipeline Matrix. When a primitive (column) is selected, all of its hyper-
and if there are underutilized primitives Important parameters are represented using a one-hot-encoding approach: each
hyperparameter value becomes a column in the matrix, and dots indi-
[R3] Visualizing primitive hyperparameters: D3 would like to be able cate when the hyperparameter is set in a pipeline. Since ML researchers
to explore the hyperparameter space of the primitives used in his are familiar with one-hot-encoding, this representation is natural for
pipelines Optional them. Fig. 1(C3) shows the hyperparameter space of Xgboost Gbtree.
Domain experts were interested in exploring pipeline metadata [R4],
[R4] Visualizing pipeline metadata: D1, D2, E1 and E2 mentioned they including training and testing scores, training time and execution
were interested in visualizing and comparing different aspects of time. PipelineProfiler shows the pipeline metadata in the Metric View
the trained pipelines, including scores, prediction and training (Fig. 1(C4)). Users can select which metric to display using a drop
time Important down menu, and the numerical values are shown in bar chart aligned
[R5] Finding correlations between primitives and scores: D1 and D2 with the matrix rows. In C4, the user can choose to display the metric
were interested in identifying primitives that correlate with high F1 or the prediction time. Pipeline rows can be re-ordered based on
scores on different problems and datasets. Furthermore, D2 would the metric, and to enable a comparison across systems, users can also
like to see primitives that perform poorly in order to blacklist them, interactively group pipelines based by the system that generated them.
and E2 is interested in identifying possible causes for pipeline To convey information about the relationships between primitive
failure (i.e., low scores) Important usage and pipeline scores [R5], we designed the Primitive Contribution
view. This view shows an estimate of how much a primitive contributes
[R6] Visualizing and comparing pipeline graphs: all developers were to the score of the pipeline using a bar chart encoding, aligned with
interested in visualizing the connections between pipelines prim- the columns of the matrix (Fig. 1(C5)). The contribution can be either
itives using a graph metaphor. Furthermore, D3 and D4 are positive or negative, representing positive or negative primitive correla-
interested in performing a detailed comparison of the pipeline tion with the scores. For example, in C5, Deep Feature Synthesis is the
graphs. In particular, they want to identify how different AutoML primitive most highly correlated with F1.
systems structure their pipelines to solve a particular problem We estimate the primitive contribution using the Pearson Correlation
type Critical (PC) between the primitive indicator vector p (pi = 1 if pipeline i
contains the primitive in question and pi = 0 otherwise) and the pipeline
3.2 Visualization Design metric vector m, where mi is the metric score for pipeline i. Since
In order to fulfill the requirements identified in the previous section, p is dichotomous and m is quantitative, PC can be computed more
we developed PipelineProfiler, a tool that enables the interactive ex- efficiently with the Point-Biserial Correlation (PBC) coefficient [50].
ploration of pipelines generated by AutoML systems. Fig. 1 shows PBC is equivalent to the Pearson correlation, but can be evaluated with
PipelineProfiler being applied to compare pipelines derived by five dis- fewer operations. Let m1 be the mean of the metric score (m) when the
tinct AutoML systems for a classification problem that aims to predict primitive is used (pi = 1); m0 , the mean of the scores when the primitive
the presence of heart disease using the Statlog (Heart) Data Set [19]. is not used (pi = 0); s be the standard deviation of all the scores (m);
The main components of PipelineProfiler are the Pipeline Matrix (C) n1 be the number of pipelines where the primitive is used; n0 be the
and the Pipeline Comparison View (D). The Pipeline Matrix (C) shows number of pipelines where the primitive is not used; and n = n1 + n2 .
As in Koop et al. [35], we use the Similarity Flooding algorithm [39]
to iteratively adjust the similarity between nodes and take node connec-
tivity into account. We refer the reader to [39] for details.
2) Graph Edit Matrix Construction: In order to match two graphs,
(a) Best performing pipeline (F1 Macro: 0.45) G1 and G2, the algorithm builds a graph edit matrix that contains the all
the possible costs to transform G1 into G2. Let m and n be the number
of nodes G1 and G2 respectively. The edit matrix E is defined so that
the selection of one entry from every row and one entry from every
column corresponds to a graph edit that transforms G1 into G2 [47]. E
contains the costs to add (a), delete (d) and substitute (s) nodes. We
(b) Worst performing pipeline (F1 Macro: 0.06) choose costs that prioritize node substitutions in case of a total or partial
match: si, j = 1 − f (i, j), d = 0.4 and a = 0.4.
s
00 s01 ... s0n a0 ∞ ... ∞
s s11 ... s1n ∞ a1 ... ∞
10
.. .. .. .. . . .. .
. . . . .. .. . ..
s
m0 sm1 ... smn ∞ ∞ ... am
E = d0
∞ ... ∞ 0 0 0 0
∞
d1 ... ∞ 0 0 0 0
. .. .. .. .. .. .. ..
.
. .
(c) Merged pipeline . . . . . .
∞ ∞ ... dn 0 0 0 0
Fig. 2. Pipeline Comparison View, showing the best and worst pipelines
for a multitask classification problem on the 20 newsgroups dataset. (a) 3) Node matching: We use the Hungarian algorithm [36] to select
and (b) show individual pipeline structure for the best and worst pipelines one entry of every row and one entry of every column of E, while
respectively. (c) presents the merged view of both pipelines, highlighting minimizing the total cost of the graph edit. Two nodes match when one
the differences between them using color-coded headers. can be substituted by the other, i.e., their substitution entry is selected
h iq from the matrix.
The Point-Biserial Correlation is computed as: PBC = m1 −m s
0 n1 n0
n2 4) Graph merging: We merge G1 and G2 by creating a compound
The choice of Pearson Correlation (PC) as a proxy for primitive node for every pair of nodes that were matched in step 3. However,
contribution is motivated by its natural interpretation, community fa- since machine learning pipelines are directed acyclic graphs, we do
miliarity, and fast computation. However, PC is sensitive to sample not want the merged graph to have cycles either. Therefore, we use the
size and outliers [34]. There are more robust methods to evaluate the additional constraint to only merge nodes that do not result in cycles in
importance of hyperparameters/attributes [3, 31, 58], but they are not the merged graph. This check is done using a depth search first after
tailored for interactive systems. In particular, these algorithms have each merge.
longer running times and require data pre-processing, i.e., training
regression models [31, 58] or computing frequent patterns [3]. Combined-Primitive Contribution
The primitive contribution presented in the previous section does not
Pipeline Comparison View take into account primitive interactions. For example, it might be the
To provide a concise summary of a collection of pipelines, the Pipeline case that for a given problem, the classification algorithm SVM and
Matrix models the pipelines as a set of primitives that can be effectively the pre-processing PCA together produce good models , but they may
displayed in a matrix. However, while analyzing pipelines collections, lead to low-scoring pipelines when used independently. Because the
AutoML developers also need to examine and compare the graph struc- contribution is estimated with the Point-Biserial Correlation of the
ture of the pipelines [R6]. The Pipeline Comparison view (Fig. 1(D)) binary primitive usage vector and pipeline score, interactions involving
consists of a node-link diagram that shows either an individual pipeline, multiple primitives are not considered.
or visual-difference summary of multiple pipelines selected in the ma- To take all primitive interactions into consideration, it would be
trix representation. In the summary graph, each primitive (node) is necessary to check for the correlations of all the primitive groups in
color-coded to indicate the pipeline where it appears. If a primitive is the powerset of our primitive space. This strategy has two critical
present in multiple pipelines, all corresponding colors are displayed. If problems: 1) it is not computationally tractable, and 2) it would result
a primitive appears in all selected pipelines, no color is displayed. in a number of combinations prohibitively large for users to inspect. To
The Pipeline Comparison View enables users to quickly identify tackle this challenge, we propose a new algorithm to identify groups
similarities and differences across pipelines. Fig. 2 shows the best of primitives strongly correlated with pipeline scores. The algorithm
(a) and worst pipelines (b) solving the 20 newsgroups classification works as follows: for every combination of primitives S up to a prede-
problem [37], and a merged pipeline (c) that highlights the differences fined constant size, 1) create a new primitive indicator vector p̂ = ∏ pi ,
between the two pipelines, clearly showing that the best pipeline (blue) i∈S
uses a Gradient Boosting classifier, and an HDP and Text Reader feature which contains 1 if the set of primitives is used in the pipeline, and 0
extractors. otherwise. 2) compute the correlation of the primitive group with the
To support the comparison of multiple pipeline structures, we pipeline scores using the Point-Biserial Correlation. 3) select which
adapted the graph summarization method proposed by Koop et al. [35]. combination of primitives to report to the user. We only report the
Since ML pipelines are directed-acyclic graphs, we modify the method primitive group if its Pearson correlation is greater than the Pearson
to avoid cycles in the merged graph. The algorithm creates a summary correlation of all the elements in its powerset. Algorithm 1 describes
graph by iteratively merging graph pairs. The merge of two graphs G1 CPC in detail.
and G2 is performed in four steps, as detailed below. The idea behind CPC is simple. The algorithm checks the correla-
1) Computing Node Similarity: Let p ∈ G1 and q ∈ G2 be two tion between combinations of primitives and the pipeline scores, and
primitives (nodes). We say that p and q have the same type if they reports surprising combinations to the user (correlations not shown in
perform the same data manipulation (e.g., Classification, Regression, the Primitive Contribution View). The user defines K (in our tests, we
Feature Extraction, etc.). The similarity f is given by: found that K = 3 is effective ; i.e. larger values did not return more
patterns in the data). If there are n primitives, the algorithm evaluates
1.0 if p and q are the same primitive (total match)
K
n n
f (p, q) = 0.5 if p and q have the same type (partial match) ∑ k groups of primitives and has a time complexity of O K . In
k=1
0.0 otherwise PipelineProfiler, this CPC can be run via the “Combinatorial Analysis”
Algorithm 1: Combined-Primitive Contribution using Widget hooks. The library also enables users to import pipelines
from Auto-Sklearn [23]. We implemented a bi-directional communi-
Input: p1 , p2 , . . . , pn , the primitive indicator vectors
cation between Jupyter Notebook and our tool. From Jupyter, the user
m, the evaluation metric score vector
can create an instance of PipelineProfiler for their dataset of choice.
K, the maximal cardinality of the primitive group
The main menu (Fig. 1(B)) of PipelineProfiler, on the other hand, en-
I ← [1, 2, . . . , n];
ables users to subset the data (remove pipelines from the analysis),
contributions ← {};
reorder pipelines according to different metrics, and export the se-
// Computing group correlations
lected pipelines back to Python. The goal of our design is to provide a
// 2I≤K : powerset of I up to cardinality K seamless integration with the existing AutoML ecosystem and Python
for S ∈ 2I≤K do libraries, and to make it easier for experts to explore, subset, combine
p̂ ← ∏ pi ; and compare results from multiple AutoML systems.
i∈S PipelineProfiler is already being used in production by the DARPA
contributions[S] ← corr( p̂, m); D3M project members. An open-source release is available at https:
// Selecting primitive groups to report (R) //github.com/VIDA-NYU/PipelineVis.
R ← [ ];
// 2I≥2,≤K powerset of I of cardinality c, 2 ≤ c ≤ K 4 E VALUATION
for S ∈ 2I≥2,≤K do To demonstrate the usefulness of PipelineProfiler, we present case
keep ← True; studies that use a collection containing 10,131 pipelines created as part
// Checks if there is a subset of S with of the D3M programs Summer 2019 Evaluation. In this evaluation,
greater contribution 20 AutoML systems were run to solve various ML tasks (classifica-
for sub ∈ 2S≥1 do tion, regression, forecasting, graph matching, link prediction, object
a ← contributions[S]; detection, etc.) over 40 datasets, which covered multiple data types
b ← contributions[sub]; (tabular, image, audio, time-series and graph). Each AutoML system
if |b| ≥ |a| then was executed for one hour and derived zero or more pipelines for each
keep ← False; each dataset.
break;
4.1 Case Study 1: Improving an AutoML System
if keep = True then To showcase how PipelineProfiler supports the effective exploration
R ← R ∪ S; of AutoML-derived pipelines in a real-world scenario, we describe
how an AlphaD3M developer used the system, the insights he obtained,
return R
and how these insights helped him improve AlphaD3M. AlphaD3M
is an AutoML system based on reinforcement learning that uses a
grammar (set of primitives patterns) to reduce the search space of
menu (Fig. 1(A)). When the algorithm is run, we show a table contain- pipelines [17, 18].
ing the selected groups of primitives and the correlation values. Fig. 3 The AlphaD3M developer started his exploration using a problem for
shows an example of a CPC run over pipelines derived to perform a which AlphaD3M had a poor performance: a multi-class classification
classification task using the Diabetes dataset [19]. When a primitive task using the libras move dataset1 from the OpenML database [60].
group is clicked, the corresponding columns are highlighted in Pipeline For this task, in the ranking of all pipelines produced by D3M AutoML
Matrix. systems, the best pipeline produced by AlphaD3M was ranked 18th
with an accuracy score of 0.79.
Implementation details Comparing pipeline patterns. The developer sought to identify
PipelineProfiler is implemented as a Python 3 library. The front-end is common patterns in the best pipelines that were overlooked by the
implemented in Javascript with React [21], D3 [5] and Dagre [11]. The AlphaD3M search. To this end, he first sorted the primitives by type
back-end, responsible for data management, graph merging and the and the pipelines by performance. This uncovered useful patterns.
Jupyter Notebook hooks is implemented in Python with Numpy [61] As Fig. 4A shows, primitives for feature selection were frequently
and NetworkX [29]. present in the best pipelines, while lower-scoring pipelines did not use
The PipelineProfiler library takes as input a Python array of pipelines these primitives. Although he identified other patterns, the information
in the D3M JSON [41, 55] format, and plots the visualization in Jupyter provided by primitive contribution bar charts indicated that feature
selection primitives had a large impact in the score of best pipelines.
This information led the developer to hypothesize that the usage of
feature selection primitives might be necessary for pipelines to perform
well for the problem and data combination.
Exploring execution times. The developer then analyzed the
pipelines produced only by AlphaD3M. Fig. 4B clearly shows that
pipelines containing one-hot encoding primitives take a substantially
longer time to execute, approximately 10 seconds – this is in contrast
to pipelines that do not use this primitive and take less than 1 second.
He also saw in primitive contribution bar charts that the one-hot encod-
ing primitive has the highest impact on the running time. He realized
that for this specific dataset, one-hot encoding primitives were used
Fig. 3. Combined-Primitive Contribution applied to pipelines that solve inefficiently, since all the features of the dataset were numeric. Since
a classification problem on the Diabetes [19] dataset: A) The Pipeline an AutoML system needs to evaluate a potentially large number of
Matrix representation of the pipeline collection. B) The Combinatorial pipelines during its search, an order-of-magnitude difference in execu-
Analysis View, showing a group of two primitives, Min Max Scaler and tion time, such as what was observed here, will greatly limit its ability
RBF Sampler, that correlate with higher F1 scores. The two primitives to find good pipelines given a limited time budget – for the summer
are highlighted in (A). Notice that pipelines with higher scores use both evaluation, this budget was 1 hour.
primitives (#1, #2) – pipelines that use them separately have lower scores
(#3 - #9). 1 OpenML dataset, https://www.openml.org/d/299
Fig. 4. Pipelines Matrix. A) Pipelines are sorted by performance, the 10 best pipelines at the top and the 10 worst pipelines at the bottom. Green and
red boxes show the presence and absence, respectively, of feature selection primitives in the pipelines with their performances. B) Pipelines are
sorted by execution time, only pipelines generated by AlphaD3M are displayed. Green and red boxes show the absence and presence, respectively,
of one-hot encoder primitive in the pipelines with their execution times.
Dataset Dataset Type Task Type Metric Mean Score Score Range # Pipelines # Primitives Participant
Auto MPG [19] Tabular Regression Mean Squared Error 30.18 ± 82.39 [4.71, 595.24] 103 71 D1
Word Levels [28] Tabular Classification F1 Macro 0.24 ± 0.07 [0.04, 0.33] 115 69 D2
Sunspots [51] Time Series Forecasting Root Mean Squared Error 27.82 ± 23.64 [8.08, 121.27] 137 71 D3
Popular Kids [59] Tabular Classification F1 Macro 0.39 ± 0.06 [0.21, 0.48] 120 64 D4
Chlorine Concentration [9] Time Series Classification F1 Macro 0.39 ± 0.23 [0.00, 0.78] 47 48 E1
GPS Trajectories [19] Tabular Classification F1 Macro 0.48 ± 0.15 [0.00, 0.68] 163 91 E2
using decision trees. The Primitive Contribution view confirmed his • D4 wants to integrate PipelineProfiler into his development work-
hypothesis that the use of gradient boosting was indeed correlated with flow: “The tools is great, and I as mentioned earlier, it would be
high scores. He said that he could use this insight to drop and replace even better if an API is provided to ingest the data automatically
primitives in his AutoML search space. D1, D2 and D3 had similar from our AutoML systems”. E1 and E2 were also interested in
findings in their pipeline investigations. Evaluators compared primitive integrating this tool with their sequestered datasets, which used
usage for a different reason: they wanted to make sure AutoML systems for evaluation but not shared with the developers
were exploring the search space and the primitives available to their 4.4 Usability
systems. For example, E1 noticed that the best AutoML system used a
single classifier type on its pipelines, as opposed to other systems that We evaluated the usability of PipelineProfiler using the System Usabil-
had more diverse solutions. E2 did a similar analysis on his dataset. ity Score (SUS) [6], a valuable and robust tool for assessing the quality
of system interfaces [4]. In order to compute the SUS, we conducted
Hyperparameter search strategy. D1 noticed that the top-five a survey at the end of the second interview: we asked participants to
pipelines belonged to the same AutoML system and were nearly iden- fill out the standard SUS survey, grading each of the 10 statements
tical. She explored the hyperparameters of these pipelines using the on a scale from 1 (strongly disagree) to 5 (strongly agree). The SUS
one-hot-encoded hyperparameter view, and found that although they grades systems on a scale between 1 and 100 and our system obtained
had the same graph structure, they were using different hyperparame- an average score of 82.92 ± 12.37. According to Bangor et al. [4], a
ters for the Gradient Boosting primitive. She compared this strategy mean SUS score above 80 is in the fourth quartile and is acceptable.
with another system which did not tune many hyperparameters, and
concluded that tuning parameters was beneficial for this problem. 5 C ONCLUSIONS AND F UTURE W ORK
Primitive correctness. Participants also used PipelineProfiler to We presented PipelineProfiler, a new tool for the exploration of pipeline
check if primitives were being used correctly. A common finding was collections derived by AutoML systems. PipelineProfiler advances the
the unnecessary use of primitives. For example, D2 found that pipelines state-of-the-art in visual analytics for AutoML in two significant direc-
containing Cast to Type resulted in lower F1 scores. He inspected the tions: it enables the analysis of pipelines that have complex structure
hyperparameters of this primitive and noted that string features were and use a multitude of primitives, and it supports the comparison of
being converted to float values (hashes). He concluded that string multiple AutoML systems. Users can perform a wide range of analyses
hashes were bad features for this classification problem, and the Cast which can help them answer common questions that arise when they
to Type primitive should be removed from those pipelines. Similar are debugging or evaluating AutoML systems. Because these analyses
findings were obtained with One Hot Encoder used in datasets with no are scripted, they can be reproduced and re-used.
categorical features (D3, E1, E2), and Imputer used on datasets with no Limitations. 1) Regarding the visualization of hyperparameters,
missing data (D4, E1). E2 also found incorrect hyperparameter settings, our tool is well suited for categorical data, but numerical parameters
such as the use of “warm-start=true” in a Random Forest primitive. suffer from the one-hot-encoded representation. This issue may be
Execution time investigation. D4 checked the running times of addressed by using linked views that show the distribution of numerical
the pipelines. In particular, he was interested in verifying whether hyperparameters, such as in ATMSeer [63]. 2) When the pipelines
the best pipelines took longer to run. First he sorted the pipelines by have different structures, it may be hard to see individual graphs on the
score. Then, he switched the displayed metric to “Time” and noticed Pipeline Comparison View. Currently, users need to select individual
that, contrary to his original hypothesis, the best pipelines were also pipelines to explore them in more detail. As future work, we will inves-
the fastest. He looked at the Primitive Contribution View in order to tigate interactions that facilitate this process, such as highlighting indi-
find what primitives were responsible for the longer running times, and vidual graphs, and groups of nodes/edges. 3) The Combined-Primitive
identified that the General Relational Dataset primitive was most likely Contribution currently only displays primitive names and their com-
the culprit. He concluded that if he removed this primitive, he would bined effects in the model. More complex analyses can be performed
get a faster pipeline. by mining patterns in the data (e.g. [3, 31, 58]).
Future work. There are many avenues for future work. To increase
Expert Feedback the adoption of our tool beyond the D3M ecosystem, we plan to add
support for other pipeline schemata adopted by widely used AutoML
We received very positive feedback from the participants. They ex- systems. On the research front, we would like to explore how to extend
pressed interest in using PipelineProfiler for their work and suggested the system to support end-users of AutoML systems, who have limited
new features to improve the system. After the think-aloud experiment, knowledge of ML but need to explore and deploy their models. In
they were asked if they had any additional comments or suggestions. particular, we would like to capture the knowledge derived by users of
Here are some of their answers: PipelineProfiler and use this knowledge to steer the search performed
• D1 mentioned that PipelineProfiler is better than her current tools: by the AutoML system, which in turn, can lead to the generation of
“I think this is very useful, we are always trying to improve our more efficient pipelines in a shorter time. For example, if the user finds
pipelines. The pipeline scores can give you some scope, but this that a group of primitives work well together, they should be able to
is doing it more comprehensively”. indicate this to the AutoML system, so that it can focus the search of
pipelines that use these primitives.
• D2 liked the debugging capabilities of PipelineProfiler: “Actually,
with this tool we can infer what search strategies the AutoML is ACKNOWLEDGEMENTS
using. This tool is really nice to do reverse engineering”. This work was partially supported by the DARPA D3M program
• D3 particularly liked the integration with Jupyter Notebook: “I and NSF awards CNS-1229185, CCF-1533564, CNS-1544753, CNS-
really liked this tool! It is very informative and easy to use. It 1730396, and CNS-1828576. Any opinions, findings, and conclusions
works as a standalone tool without any coding, but I can make or recommendations expressed in this material are those of the authors
more specific/advanced queries with just a little bit of code.” and do not necessarily reflect the views of NSF and DARPA.
R EFERENCES model engineering. Computers & Graphics, 77:30–49, 2018.
[25] Y. Gil, J. Honaker, S. Gupta, Y. Ma, V. D’Orazio, D. Garijo, S. Gadewar,
[1] E. Alexander, J. Kohlmann, R. Valenza, M. Witmore, and M. Gleicher. Q. Yang, and N. Jahanshad. Towards human-guided machine learning. In
Serendip: Topic model-driven visual exploration of text corpora. In Proceedings of the Conference on Intelligent User Interfaces (IUI), pp.
Proceedings of the IEEE Conference on Visual Analytics Science and 614–624. ACM, 2019.
Technology (VAST), pp. 173–182. IEEE, 2014. [26] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley.
[2] S. Amershi, M. Chickering, S. M. Drucker, B. Lee, P. Simard, and J. Suh. Google Vizier: A Service for Black-Box Optimization. In Proceedings
Modeltracker: Redesigning performance analysis tools for machine learn- of the ACM SIGKDD International Conference on Knowledge Discovery
ing. In Proceedings of the ACM Conference on Human Factors in Com- and Data Mining (KDD), pp. 1487–1495. Association for Computing
puting Systems (CHI), pp. 337–346, 2015. Machinery, 2017.
[3] P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Mac- [27] Google. Cloud AutoML - Custom Machine Learning Models | AutoML.
robase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM https://cloud.google.com/automl/, 2020.
International Conference on Management of Data, pp. 541–556, 2017. [28] O. Guzey, G. Sohsah, and M. Unal. Classification of word levels with
[4] A. Bangor, P. T. Kortum, and J. T. Miller. An empirical evaluation of usage frequency, expert opinions and machine learning, 2014.
the system usability scale. International Journal of Human–Computer [29] A. Hagberg, P. Swart, and D. S Chult. Exploring network structure,
Interaction, 24(6):574–594, 2008. dynamics, and function using networkx. Technical report, Los Alamos
[5] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents. National Lab.(LANL), Los Alamos, NM (United States), 2008.
IEEE Transactions on Visualization and Computer Graphics, 2011. [30] F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analytics in deep
[6] J. Brooke et al. Sus-a quick and dirty usability scale. Usability evaluation learning: An interrogative survey for the next frontiers. IEEE transactions
in industry, 189(194):4–7, 1996. on visualization and computer graphics, 25(8):2674–2693, 2018.
[7] D. Cashman, S. R. Humayoun, F. Heimerl, K. Park, S. Das, J. Thompson, [31] F. Hutter, H. Hoos, and K. Leyton-Brown. An efficient approach for
B. Saket, A. Mosca, J. T. Stasko, A. Endert, M. Gleicher, and R. Chang. assessing hyperparameter importance. In International conference on
Visual analytics for automated model discovery. CoRR, 2018. machine learning, pp. 754–762, 2014.
[8] D. Cashman, A. Perer, R. Chang, and H. Strobelt. Ablate, Variate, and [32] F. Hutter, L. Kotthoff, and J. Vanschoren. Automated machine learning
Contemplate: Visual Analytics for Discovering Neural Architectures. methods, systems, challenges. Springer, Cham, Switzerland, 2019. OCLC:
arXiv:1908.00387 [cs], July 2019. arXiv: 1908.00387. 1127437059.
[9] Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and [33] IBM. Watson Studio - AutoAI, Jan. 2020.
G. Batista. The ucr time series classification archive, 2015. [34] Y. Kim, T.-H. Kim, and T. Ergün. The instability of the pearson correlation
[10] J. Chuang, C. D. Manning, and J. Heer. Termite: Visualization tech- coefficient in the presence of coincidental outliers. Finance Research
niques for assessing textual topic models. In Proceedings of the Working Letters, 13:243–257, 2015.
Conference on Advanced Visual Interfaces, pp. 74–77, 2012. [35] D. Koop, J. Freire, and C. T. Silva. Visual summaries for graph collec-
[11] B. Cobarrubia. Dagre: Graph layout for javascript. https://github. tions. In 2013 IEEE Pacific Visualization Symposium (PacificVis), pp.
com/dagrejs/dagre/wiki, 2018. 57–64. IEEE, Sydney, Australia, Feb. 2013. doi: 10.1109/PacificVis.2013.
[12] D3M. Data-driven discovery of models (D3M) program index. https: 6596128
//github.com/darpa-i2o/d3m-program-index, 2020. Library Cat- [36] H. W. Kuhn. The hungarian method for the assignment problem. Naval
alog: gitlab.com. research logistics quarterly, 2(1-2):83–97, 1955.
[13] D3M. datadrivendiscovery / metalearning. https://gitlab.com/ [37] K. Lang. Newsweeder: Learning to filter netnews. In Machine Learning
datadrivendiscovery/metalearning, 2020. Proceedings 1995, pp. 331–339. Elsevier, 1995.
[14] D3M. Index of open source D3M primitives. https://gitlab.com/ [38] S. Liu, X. Wang, M. Liu, and J. Zhu. Towards better analysis of ma-
datadrivendiscovery/primitives, 2020. Library Catalog: git- chine learning models: A visual analytics perspective. Visual Informatics,
lab.com. 1(1):48–56, 2017.
[15] S. Das, D. Cashman, R. Chang, and A. Endert. BEAMES: Interactive [39] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A ver-
Multimodel Steering, Selection, and Inspection for Regression Tasks. satile graph matching algorithm and its application to schema matching.
IEEE Computer Graphics and Applications, 39(5):20–32, 2019. In Proceedings of the International Conference on Data Engineering, pp.
[16] D. Dingen, M. V. Veer, P. Houthuizen, E. H. J. Mestrom, E. H. H. M. Ko- 117–128. IEEE, 2002.
rsten, A. R. A. Bouwman, and J. V. Wijk. RegressionExplorer: Interactive [40] M. Milutinovic. Towards Automatic Machine Learning Pipeline Design.
Exploration of Logistic Regression Models with Subgroup Analysis. IEEE PhD thesis, EECS Department, University of California, Berkeley, Aug
Transactions on Visualization and Computer Graphics, pp. 1–1, 2018. 2019.
[17] I. Drori, Y. Krishnamurthy, R. Rampin, R. d. P. Lourenco, J. P. Ono, [41] M. Milutinovic, B. Schoenfeld, D. Martinez-Garcia, S. Ray, S. Shah, and
K. Cho, C. Silva, and J. Freire. AlphaD3M: Machine Learning Pipeline D. Yan. On evaluation of automl systems. In Proceedings of the ICML
Synthesis. In Proceedings of the ICML Workshop on Automatic Machine Workshop on Automatic Machine Learning, 2020.
Learning, p. 8, 2018. [42] MLJAR. Machine Learning for Humans. https://mljar.com/, 2020.
[18] I. Drori, Y. Krishnamurthy, R. Rampin, R. d. P. Lourenco, J. P. Ono, [43] T. Mhlbacher, L. Linhardt, T. Mller, and H. Piringer. TreePOD: Sensitivity-
K. Cho, C. Silva, and J. Freire. Automatic machine learning by pipeline Aware Selection of Pareto-Optimal Decision Trees. IEEE Transactions on
synthesis using model-based reinforcement learning and a grammar. In Visualization and Computer Graphics, 24(1):174–183, 2018.
Proceedings of the ICML Workshop on Automatic Machine Learning, [44] R. S. Olson and J. H. Moore. TPOT: A Tree-Based Pipeline Optimization
2019. Tool for Automating Machine Learning. In F. Hutter, L. Kotthoff, and
[19] D. Dua and C. Graff. UCI machine learning repository. http://archive. J. Vanschoren, eds., Automated Machine Learning: Methods, Systems,
ics.uci.edu/ml, 2017. Challenges, pp. 151–160. Springer International Publishing, Cham, 2019.
[20] J. Elliott. Darpa data-Driven Discovery of Models [45] H. Park, J. Kim, M. Kim, J.-H. Kim, J. Choo, J.-W. Ha, and N. Sung. Vi-
(D3M) program. https://www.darpa.mil/program/ sualHyperTuner: Visual Analytics for User-Driven Hyperparamter Tuning
data-driven-discovery-of-models, 2020. of Deep Neural Networks. Proceedings of the 2nd SysML Conference,
[21] A. Fedosejev. React. js essentials. Packt Publishing Ltd, 2015. p. 2, 2019.
[22] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and [46] D. Ren, S. Amershi, B. Lee, J. Suh, and J. D. Williams. Squares: Sup-
F. Hutter. Efficient and robust automated machine learning. In Advances porting Interactive Performance Analysis for Multiclass Classifiers. IEEE
in neural information processing systems, pp. 2962–2970, 2015. Transactions on Visualization and Computer Graphics, 23(1):61–70, Jan.
[23] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, 2017. doi: 10.1109/TVCG.2016.2598828
and F. Hutter. Auto-sklearn: Efficient and Robust Automated Machine [47] K. Riesen and H. Bunke. Approximate graph edit distance computation
Learning. In F. Hutter, L. Kotthoff, and J. Vanschoren, eds., Automated by means of bipartite graph matching. Image and Vision computing,
Machine Learning: Methods, Systems, Challenges, pp. 113–134. Springer 27(7):950–959, 2009.
International Publishing, Cham, 2019. [48] A. Santos, S. Castelo, C. Felix, J. P. Ono, B. Yu, S. R. Hong, C. T. Silva,
[24] R. Garcia, A. C. Telea, B. C. da Silva, J. Tørresen, and J. L. D. Comba. A E. Bertini, and J. Freire. Visus: An Interactive System for Automatic
task-and-technique centered survey on visual analytics for deep learning Machine Learning Model Building and Curation. In Proceedings of
the Workshop on Human-In-the-Loop Data Analytics (HILDA), pp. 1–
7. Association for Computing Machinery, 2019.
[49] Z. Shang, E. Zgraggen, B. Buratti, F. Kossmann, P. Eichmann, Y. Chung,
C. Binnig, E. Upfal, and T. Kraska. Democratizing data science through
interactive curation of ml pipelines. In Proceedings of the Conference on
Management of Data (SIGMOD), p. 11711188. Association for Computing
Machinery, 2019.
[50] D. J. Sheskin. Handbook of parametric and nonparametric statistical
procedures. crc Press, 2003.
[51] SILSO World Data Center. The international sunspot number. Interna-
tional Sunspot Number Monthly Bulletin and online catalogue, 2019.
[52] T. Spinner, U. Schlegel, H. Schäfer, and M. El-Assady. explainer: A visual
analytics framework for interactive and explainable machine learning.
IEEE transactions on visualization and computer graphics, 26(1):1064–
1074, 2019.
[53] T. Swearingen, W. Drevo, B. Cyphers, A. Cuesta-Infante, A. Ross, and
K. Veeramachaneni. Atm: A distributed, collaborative, scalable system
for automated machine learning. In 2017 IEEE International Conference
on Big Data (Big Data), pp. 151–162, Dec 2017. doi: 10.1109/BigData.
2017.8257923
[54] J. Talbot, B. Lee, A. Kapoor, and D. S. Tan. EnsembleMatrix: interactive
visualization to support machine learning with multiple classifiers. In
Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, pp. 1283–1292. ACM, 2009.
[55] D. M. Team. D3m metalearning database. https://metalearning.
datadrivendiscovery.org/, 2020.
[56] A. Telea. Combining extended table lens and treemap techniques for visu-
alizing tabular data. In Proceedings of the Eighth Joint Eurographics/IEEE
VGTC conference on Visualization, pp. 51–58, 2006.
[57] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-weka:
Combined selection and hyperparameter optimization of classification
algorithms. In Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 847–855, 2013.
[58] J. N. Van Rijn and F. Hutter. Hyperparameter importance across datasets.
In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pp. 2367–2376, 2018.
[59] J. Vanschoren. OpenML- PopularKids, 2014. Library Catalog:
www.openml.org.
[60] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. Openml: Networked
science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013.
doi: 10.1145/2641190.2641198
[61] S. v. d. Walt, S. C. Colbert, and G. Varoquaux. The numpy array: a
structure for efficient numerical computation. Computing in Science &
Engineering, 13(2):22–30, 2011.
[62] D. Wang, J. D. Weisz, M. Muller, P. Ram, W. Geyer, C. Dugan, Y. Tausczik,
H. Samulowitz, and A. Gray. Human-ai collaboration in data science:
Exploring data scientists’ perceptions of automated ai. Proceedings of the
ACM on Human-Computer Interaction, 3(CSCW):1–24, 2019.
[63] Q. Wang, Y. Ming, Z. Jin, Q. Shen, D. Liu, M. J. Smith, K. Veeramacha-
neni, and H. Qu. ATMSeer: Increasing Transparency and Controllability
in Automated Machine Learning. arXiv:1902.05009 [cs, stat], Feb. 2019.
arXiv: 1902.05009. doi: 10.1145/3290605.3300911
[64] Q. Wang, J. Yuan, S. Chen, H. Su, H. Qu, and S. Liu. Visual Genealogy of
Deep Neural Networks. IEEE Transactions on Visualization and Computer
Graphics, p. 12, 2019.
[65] D. K. I. Weidele. Conditional Parallel Coordinates. In 2019 IEEE Visu-
alization Conference (VIS), pp. 221–225, Oct. 2019. ISSN: null. doi: 10.
1109/VISUAL.2019.8933632
[66] D. K. I. Weidele, J. D. Weisz, E. Oduor, M. Muller, J. Andres, A. Gray,
and D. Wang. AutoAIViz: Opening the Blackbox of Automated Artificial
Intelligence with Conditional Parallel Coordinates. arXiv:1912.06723 [cs,
stat], Jan. 2020. arXiv: 1912.06723. doi: 10.1145/3377325.3377538