0% found this document useful (0 votes)
5 views

Dead or Alive Continuous Data Profiling for Interactive Data Science

The document presents AutoProfiler, a system designed for continuous data profiling in interactive data science, which automatically updates visual summaries and statistics as data changes. A user study demonstrated that both live and on-demand profiling tools significantly aid in insight discovery, with 91% of insights generated from the tools rather than manual coding. The findings suggest that continuous data profiling can enhance the efficiency and effectiveness of exploratory data analysis, ultimately leading to better data quality and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Dead or Alive Continuous Data Profiling for Interactive Data Science

The document presents AutoProfiler, a system designed for continuous data profiling in interactive data science, which automatically updates visual summaries and statistics as data changes. A user study demonstrated that both live and on-demand profiling tools significantly aid in insight discovery, with 91% of insights generated from the tools rather than manual coding. The findings suggest that continuous data profiling can enhance the efficiency and effectiveness of exploratory data analysis, ultimately leading to better data quality and decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO.

1, JANUARY 2024 197

Dead or Alive: Continuous Data Profiling for Interactive Data Science


Will Epperson , Vaishnavi Gorantla , Dominik Moritz , Adam Perer

Fig. 1: In AutoProfiler, data profiles update whenever the data in memory updates and are sorted with the last updated dataframes at
the top. In this example, a user has (1) loaded a dataframe about housing prices and sees the profile for df in the sidebar. (2) The user
then investigates the price column and exports a chart to code so they can persist this chart and tweak the code for follow-up analysis.

Abstract— Profiling data by plotting distributions and analyzing summary statistics is a critical step throughout data analysis. Currently,
this process is manual and tedious since analysts must write extra code to examine their data after every transformation. This
inefficiency may lead to data scientists profiling their data infrequently, rather than after each transformation, making it easy for them to
miss important errors or insights. We propose continuous data profiling as a process that allows analysts to immediately see interactive
visual summaries of their data throughout their data analysis to facilitate fast and thorough analysis. Our system, AutoProfiler, presents
three ways to support continuous data profiling: (1) it automatically displays data distributions and summary statistics to facilitate data
comprehension; (2) it is live, so visualizations are always accessible and update automatically as the data updates; (3) it supports follow
up analysis and documentation by authoring code for the user in the notebook. In a user study with 16 participants, we evaluate two
versions of our system that integrate different levels of automation: both automatically show data profiles and facilitate code authoring,
however, one version updates reactively (“live”) and the other updates only on demand (“dead”). We find that both tools, dead or alive,
facilitate insight discovery with 91% of user-generated insights originating from the tools rather than manual profiling code written by
users. Participants found live updates intuitive and felt it helped them verify their transformations while those with on-demand profiles
liked the ability to look at past visualizations. We also present a longitudinal case study on how AutoProfiler helped domain scientists
find serendipitous insights about their data through automatic, live data profiles. Our results have implications for the design of future
tools that offer automated data analysis support.
Index Terms—Data Profiling, Data Quality, Exploratory Data Analysis, Interactive Data Science

1 I NTRODUCTION
In recent decades, data analysis is no longer bottlenecked by the tech- difficulty in choosing where to look for interesting insights [5]. Interac-
nical feasibility of executing queries against large datasets, but by the tive programming environments such as Jupyter notebooks help since
they support fast, flexible, and iterative feedback when programming
with data [2, 33]. However, while these coding tools were designed
• All authors are with Carnegie Mellon University. Emails: [email protected], to track the state of program execution and variables for debugging,
[email protected], [email protected], [email protected]. they were not inherently designed to track how data is manipulated and
transformed. This forces users to manually make sense of and write
additional code to explore their data.
Exploratory Data Analysis (EDA) is critical to understanding a
Manuscript received 31 March 2023; revised 1 July 2023; accepted 8 August 2023.
Date of publication 30 October 2023; date of current version 21 December 2023. dataset and its limitations and is a common task at the beginning of a
Digital Object Identifier no. 10.1109/TVCG.2023.3327367 data analysis [47, 49]. Yet the manual effort required to construct data

© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information,
see https://creativecommons.org/licenses/by-nc-nd/4.0/
198 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 1, JANUARY 2024

profiles for EDA takes up a significant part of data analysts’ time: recent 2.1 Data understanding is critical yet cumbersome
surveys of data scientists show that they spend almost 50% of their time Understanding data and its limitations has long been an important, but
just cleaning and visualizing their data [3]. Since data profiling is so often overlooked, part of analysis. Tukey was an early advocate for plot-
time intensive, it is easy for users to skip over important trends or errors ting distributions and summary statistics to get to know your data before
in their data. This can lead to negative downstream consequences when confirmatory analysis (hypothesis testing) began [47]. Current best
this data is used for modeling and decision-making [41]. In particular, practices taught in introductory statistics courses still emphasize the
many data quality issues are potentially silent: models will still train or importance of starting analysis with summaries of individual columns,
queries will execute, but the results will be incorrect [16]. For example, such as distributions and descriptive statistics, before moving on to plot
in the data profile of apartment prices in Figure 1 we can see that some combinations of columns or investigating correlations [42]. Recent re-
apartment prices have negative values. If these values are not addressed, search has highlighted how with the increasing emphasis on developing
analyses or models that use this data may lead to wrong decisions. AI models, people often undervalue data quality leading to negative
We propose continuous data profiling as a process that allows an- downstream effects [41]. Multiple surveys of production data scientists
alysts to immediately see interactive visual summaries of their data describe the difficulty and time spent on data understanding, profiling,
throughout their data analysis to facilitate fast and thorough analysis. and wrangling [3, 19, 24]. For example, a recent Anaconda foundation
To explore how automated tools can best support continuous data pro- survey described that data scientists self-reported spending almost 50%
filing, we have built a computational notebook extension AutoProfiler of their time on data cleaning and visualization [3].
that tightly integrates data profiling information into the analysis loop. Data understanding is difficult because of a variety of factors, includ-
AutoProfiler maintains the advantages of the interactive notebook pro- ing that data updates quickly in production environments, so automated
gramming paradigm, while giving users immediate feedback on how methods and alerts have a high number of false positives [43], current
their code affects their data. This tightens the feedback loop between popular tools require manual data exploration and become messy [33],
manipulating data and understanding it during data programming. and as datasets have grown, there are a large number of issues to check
We explore three main features in AutoProfiler. First, it automati- for. Prior systems in the visualization community have addressed parts
cally displays profiling information about each dataframe and column of this space such as comparing data over time as models are trained
to facilitate data understanding. By showing data distributions and on subsequent data versions [18] or methods for cleaning up notebooks
summaries, AutoProfiler jump-starts a user’s EDA. Second, when the during analysis [14]. However, more work is needed to understand how
data in memory updates, the profiling information updates accordingly. tools can facilitate discovering data and potential quality issues before
“Live” updates in user interfaces have been shown to reduce iteration they propagate to downstream models or analyses.
time [27]; with AutoProfiler we apply this concept to data profiling to
understand how it helps facilitate data understanding. Third, although 2.2 Prior assisted and integrated EDA tools
AutoProfiler eliminates the repetitive work of authoring data profiling Prior visualization systems aim to automate the visual presentation of
code, users still need to be able to conduct flexible follow-up analysis data to speed up data understanding. In general, this automation helps
and persist interesting findings in their notebook [40]. AutoProfiler alleviate the burden of specifying charts so that users can focus more
supports this by authoring code for the user through code exports to on insights rather than how to produce a specific chart [15]. Some
help users quickly select subsets, find outliers, or author charts. systems automate visual presentation and then rank charts according
We present two complimentary evaluations of AutoProfiler. In a to metrics of interest such as high correlation [8], charts that satisfy a
user study with 16 participants, we evaluate two levels of automated particular pattern in the data [45], or contain attributes of interest [50].
assistance to see how different versions of the tool help users find errors Closely related to our work is the Profiler system, which checks data for
and insights in their data. Half of the participants used AutoProfiler common quality issues such as missing data or outliers, and presents
(a “live” profiler) and the other used a version that presents the same potentially interesting charts to the user [20].
information but in a static, inline version (which we denote as “dead”). However, many of these systems exist in standalone tools, making
In this evaluation, we found that users experience similar benefits from them difficult to integrate into flexible data analysis workflows in pro-
both versions of the tool, “dead” or “live”, and generate 91% of findings gramming environments like Jupyter notebooks [2]. Other systems
from the tools as opposed to their own code. Participants found live have explored how to integrate visualization recommendations in the
updates intuitive and felt it helped them verify their transformations notebook programming context as well through visualization callbacks,
while those with static profiles liked the ability to look at past visual- libraries, embedded widgets, and similar notebook search [26, 34].
izations. Furthermore, participants described how the systems sped up Lux [25] and other open source tools [1, 6, 30, 31, 39] show EDA infor-
their analysis and exports facilitated a more fluid analysis. In our sec- mation on demand for individual Pandas dataframes. While Lux uses
ond evaluation, we conducted a long-term deployment of AutoProfiler “always on” visualization recommendations to overwrite the default ta-
with domain scientists to use the system during their analysis. These ble view for pandas dataframes, users must still ask for visualizations by
users described how the “live” system enabled them to find and follow calling a dataframe explicitly. Diff in the Loop [48] presents a paradigm
up on interesting trends and how AutoProfiler facilitated serendipitous for automatically visualizing the differences between dataframes after
discoveries in their data by plotting things they might not have checked each step in an analysis. Although these prior systems use automatic
otherwise. We discuss how future automated assistants can build on visualization, they still require the user to manually ask for this in-
AutoProfiler to augment data programming environments. In summary, formation after each data update and often present an abundance of
our paper makes the following contributions: information that can be difficult to compute in reactive times and for
1. We demonstrate the benefits of continuous data profiling with users to parse quickly. With AutoProfiler, we explore the benefits and
AutoProfiler, which supports data programming with automatic, design constraints around coupling automatic visualization with live
live profiles and code exports. updates and code authoring on the user’s behalf.
2. We evaluate this tool in a controlled study and demonstrate how
2.3 Liveness in user interfaces
continuous profiling helps analysts discover insights in their data
and supports their workflow. Fast iteration on data and models is a key element to effective data
3. We also present a longitudinal case study demonstrating how science [11, 43]. The fast, incremental feedback that users receive in
AutoProfiler leads to insights and discoveries during daily analysis Jupyter notebooks is part of the popularity of the platform [10, 33],
workflows for scientists. yet the default presentation of data feedback in Jupyter is limited to a
handful of rows. “Liveness” in user interfaces reduces iteration time
through reactive updates [27], such as in spreadsheets [17]. Prior studies
2 R ELATED W ORK
of liveness in data science tools have compared live interfaces to REPL
Our work builds on prior literature on assisted data understanding, live (read-eval-print-loop) interfaces like Jupyter and found users like the
interfaces, and linking GUI and code interfaces. responsiveness and clean coding that live interfaces afford [7]. Inspired
epperson ET AL.: DEAD OR ALIVE: CONTINUOUS DATA PROFILING FOR INTERACTIVE DATA SCIENCE 199

by the affordances of live, reactive updates, AutoProfiler evaluates how Pandas is the most popular data manipulation library in Python, with
automatically updating data profiles after a user changes their data can millions of downloads every week [29]. Likewise, computational note-
help reduce iteration time during analysis. When using AutoProfiler books in Jupyter have become the tool of choice for data science in
in Jupyter, users must still explicitly execute their code to manipulate Python [33]. AutoProfiler focuses on Pandas users in Jupyter with the
the data, thus it is not a completely “live” environment. However, data goal that features that support this workflow will generalize to other
profiles reactively update when data changes. dataframe libraries such as Polars [36] or Arrow [4], as well as other
notebook programming environments. The AutoProfiler system has
2.4 Linking code and GUI interactions three core features that enable continuous data profiling: automatic
There is a tradeoff between tools that support using code to interact with visualization (§ 4.1), live updates (§ 4.2), and code exports (§ 4.3).
data or direct manipulation. Programming languages are flexible and
expressive, yet GUIs are responsive and easy to use [2]. Prior systems 4.1 AutoProfiler shows EDA automatically
in the notebook setting have bridged this gap by writing interactions AutoProfiler detects all Pandas dataframes in memory and presents
with a chart [51] or widget [22] back to the notebook automatically. them in the sidebar of the notebook. Each dataframe profile can be
This allows users to reuse analysis code and preserves the steps of shown or hidden, along with more information about each column.
their analysis. Selection exports in AutoProfiler serve a similar purpose This allows users to drill down into dataframes and columns of interest
of facilitating drill down into rows of interest in a dataset. Our code to see more information, providing details on demand. By situating
authoring approach differs from prior systems since we only write code AutoProfiler in the sidebar it also allows users to simultaneously look at
to the notebook explicitly when the user asks, rather than implicitly both summary data profiles of their data in AutoProfiler and the default
after every interaction to avoid polluting the user’s notebook. instance view inline from Jupyter.
Beyond their flexibility, programming languages remain popular We use the Pandas datatype of the column to show corresponding
for data science because they allow users to reuse old analysis code charts and summary information. We categorize the Pandas datatypes
for new purposes [21], or use analysis “templates” to help users go into semantic datatypes of numeric, categorical, or timestamp columns
through the same steps of analysis for similar tasks [10]. AutoProfiler’s similar to previous Pandas visualization systems [9, 25]. Column pro-
template exports serve a similar purpose to author code in the notebook files for each of these three data types are shown in Figure 2. Each
and support follow-up analysis for tasks like customizing a plot, doing column profile has three core components:
outlier analysis, or investigating duplicates.
1. Column Overview which contains the name, data type, a small
3 D ESIGN G OALS visualization, and the percentage of missing values.
We developed the following design principles to inform our system 2. Column Distribution which is shown by clicking on the overview
requirements and design: to reveal a larger, interactive visualization of column values.
3. Column Summary that has extra facts about a column such as the
G1: Automatic & Predictable: Basic data profiling information should number of outliers or duplicate values.
be visualized automatically without any need for extra code in a
consistent manner. The overview, distribution, and summary shown depend on the data
G2: Live: When the data updates, so should all visualizations of it. type of the column. Furthermore, the distribution and summary can be
This prevents “stale” data visualizations in a notebook and allows toggled on and off to show more details on demand [44]. This is impor-
data profiles to be accessible throughout an analysis. tant for large dataframes with many columns, or when there are many
dataframes in memory to prevent unncessary scrolling. Many visual
G3: Non-intrusive: Since users are writing code to interact with their
elements show hints on hover to further prevent visual clutter, provid-
data, automatic visualization should not interfere with their flow.
ing further details on demand. Our core charting components were
G4: Initiate EDA: Data profiles should present a starting point for adapted from the open-source Rill Developer platform which shows
understanding each column, which can inform follow-up analysis. data profiles for SQL queries [38]. We use the same visualizations in
G5: Persistence: Tools should support writing findings to the notebook AutoProfiler with extra summary information and linked interactions to
to enable reproducible and shareable analysis. connect the profile to the notebook.
G1 and G2 were motivated by the manual EDA which is the current Quantitative Columns: For quantitative columns like integers and
status quo in notebook programming. We build on prior techniques floats, we show a binned histogram so that users can get an overview of
in live interfaces [27] and automatic visualization [15, 25] to speed up the distribution of the column. This histogram is shown in the column
the data profiling process and enable continuous data profiling. This overview as a preview; a larger and interactive version is presented
eliminates the need to write repetitive profiling code to understand upon toggling the column open. On hover, users can see how many
dataframes after each update. Importantly, we show the same profiling points are in each bin. We also show numerical summary information
information for each type of column and visualize the data “as is” in like the min, mean, median, and max of the column. This is similar
order to facilitate finding issues (G1). With live updates, we situate to what is presented in the describe() function in Pandas to give a
our profiler alongside the programming environment rather than inline numeric summary of a column. In Figure 2 (left), we demonstrate
(G3) so that it does not take programmers out of their analysis flow [12]. this information for a price column where we can see that some of the
This also helps declutter the programming environment since most prices in this distribution are negative, a potential error that should be
preliminary visualization can be done in the sidebar. We make the inspected during analysis.
design choice to show univariate profiling information to help users If users want to see more information, they can toggle the summary
jump-start their EDA process (G4). Previous profiling systems often to see potential outliers, whether the column is sorted, and the number
require scrolling to look through multiple pages of charts [25, 30], of positive, zero, and negative values. We use two common heuristics
making it hard to find interesting problems or insights. Our goal is to to detect outlier values. The first is if a value is greater than 3 standard
facilitate rapid data understanding with data profiles, then allow users deviations from the mean; the second is if a point falls outside of
to do further custom analysis by handing off their analysis back to code 1.5 ⇤ IQR away from the first or third quartile. Both forms of outlier
through exports. Code exports also facilitate saving findings such as detection code can be exported to code which allows users to investigate
charts or code snippets to the notebook so that notebooks can be shared potential outliers more or change these thresholds for classifying the
and reproduced (G5), a core goal in notebook data analysis [40]. outliers with their code manually.
Categorical Columns: For categorical or boolean columns, we
4 C ONTINUOUS DATA P ROFILING WITH AutoProfiler first show the cardinality of the column in the overview to let users
AutoProfiler provides data analysts rapid feedback on how their code understand the total number of unique values. Once toggled open, the
affects their data to speed up insight generation. The system fits into distribution view shows the frequency of the top 10 most common
a common existing workflow for analysis: using Pandas in Jupyter. values. This is similar to the commonly used value_counts() function in
200 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 1, JANUARY 2024

Fig. 2: AutoProfiler shows distributions and summary information depending on the column type. For quantitative columns, we show a binned
histogram along with summary statistics. On hover, the user can see the count in each bin or export the selection to code. We also show a summary
with extra information like potential outliers that can be exported to code. For categorical columns like strings or boolean values, we show up to the
top 10 most frequent values. On click, the selection can also be exported to code. For temporal columns, we show the count of records over time
and the range of the column.

Pandas which shows the count of all unique values. In the categorical as users manipulate their data over an analysis. AutoProfiler supports
summary, we show extra information about the character lengths of sorting dataframe profiles to find those of interest. By default, the most
the strings in the column along with a more detailed description of recently updated profiles are shown at the top of the sidebar. A user
the column’s uniqueness. This uniqueness fact can be exported to can also sort alphabetically by the dataframe name. Furthermore, users
code which lets users inspect duplicated data points. Once again, users can pin any profile so that it always appears at the top of the sort order.
can export a selection to code in the notebook to quickly filter their Dataframe profiles are typically only shown for dataframes explicitly
dataframe. For example, in Figure 2 (center) we show the information assigned to a variable with one exception: if the output from the most
for the categorical column “county”. This column has some default recently executed cell is a Pandas dataframe we will compute a profile
values of "---" that seem like an error, so a user can click “Export for it with the name “Output from cell [5]”. On the next cell execution,
rows to code” to have the code df[df.county == "---"] written to their these temporary profiles are removed. This fits into a common note-
notebook and can investigate these rows further. Once this new code book programming workflow where users display their dataframe after
is written to the notebook, the user can look at this subselection in making a transformation to see how the data has changed.
AutoProfiler or with their own Pandas code.
Temporal Columns: Our last semantic data type is for temporal 4.3 Exports to code
columns, where we also show a distribution overview so users can see In addition to interactive data profiles, AutoProfiler assists users in
the count of their records over time. In the larger distribution view, authoring code. AutoProfiler facilitates code creation in two ways:
users can hover over this chart to see the count of values at a particular selection and template code exports. For both of these, a user clicks
point in time. We also show the range of the column and if the column on a button or part of a chart and AutoProfiler writes code for them in
is sorted or not. Users can drag over a selection of the column to the notebook below the user’s currently selected cell. All code export
zoom into the time range more in the visualization. We plan on adding snippets are pre-built into AutoProfiler and produce the same code
selection exports to temporal columns in the future. In Figure 2 (right), snippet for each task with the dataframe and column names filled in so
we show the profiling information for a date column where a user can the code is ready to execute in the notebook.
observe that the records in their dataset span 17 years, however are not Selection and template exports only differ in the kind of code they
evenly distributed with large spikes in certain years such as early 2012. produce. Selection exports allow users to export selections from charts
to help them filter their data, as mentioned in § 4.1. For example,
4.2 Live Data Profiles Figure 2 (left and center) demonstrates how a user can export selections
Beyond showing useful data profiling information just once, AutoPro- from categorical and numeric charts to quickly filter their data. This
filer updates as the data in memory updates. Once a new cell is executed, helps users more quickly iterate on ideas during analysis to spend less
AutoProfiler recomputes the data profiles for all Pandas dataframes in time writing simple code and proved very popular in our user study.
memory and updates the charts and statistics as necessary in the inter- AutoProfiler authors more complex code like charts or code to detect
face. With live updates, AutoProfiler always shows the current state of outliers with template exports. Code exports for these tasks are still
all dataframes currently in memory in the notebook, allowing users to relatively simple, only exporting up to 10 lines of code. However, this
quickly verify if transformations have expected or unexpected effects saves users from having to remember how to author a chart themselves
on their data. Figure 3 shows this update when a string column is parsed or compute outliers. Users can then easily edit this code, for example
to numeric. Here, Pandas initially parses this column as an object data to customize their visualization or change the threshold for an outlier.
type but when the user turns the column into an integer the distribution Prior work has discovered how data scientists often re-use snippets of
and summary information is updated. Live updates help users verify code across analyses to help them speed up their workflows [10, 21].
a wide range of transforms. For example, after updating the types of AutoProfiler’s exports serve as a form of these pre-baked “templates”
columns, applying filters, or dropping “bad” values. for analysis steps. The other benefit of this type of export is that it helps
AutoProfiler has several UI elements to help users track and assess preserve analysis in the notebook in the form of code, which supports
changes after updates. The first is that when a user hovers over a more replicable analyses in notebooks, a common goal [35].
column in any dataframe, if other dataframes have columns with the This linking between analysis in a visual analytics tool and notebook
exact same name they are highlighted. For example, if a user takes code has been introduced in previous systems such as Mage [22] and
the dataframe df, filters it to df_filtered, and then hovers on the Price B2 [51]. Our goal here is similar: to support tight integration between
column the linked highlights help the user make a visual connection GUI and code. However, our approach differs slightly in that we only
between the two Price columns. With automatic dataframe detection write code to the notebook when the user explicitly clicks a button to
and visualization, there can potentially be many dataframes in memory prevent polluting the user’s working environment.
epperson ET AL.: DEAD OR ALIVE: CONTINUOUS DATA PROFILING FOR INTERACTIVE DATA SCIENCE 201

Fig. 4: AutoProfiler profiling workflow. Data profiles are computed reac-


tively when a user executes new code. Profiling is done in the kernel to
speed up performance and avoid serializing the entire dataframe.

executed serially, larger requests for dataframes with many columns or


more dataframes in memory make updates slower. The AutoProfiler UI
is not affected by the size of the underlying data since the queries return
binned data counts or summary statistics so the UI remains responsive,
it simply takes longer to fetch new data for larger or more dataframes.
We have included several performance tweaks to make AutoProfiler
usable for real workflows. For example, we do not calculate updates
Fig. 3: AutoProfiler updates the data profiles shown as soon as the data when the AutoProfiler tab is closed to avoid unnecessary computation.
updates. In this example, Pandas parses the sqft column as a string The scalability of AutoProfiler can be improved with further engi-
type since some of the values initially have strings in them. Once the neering. For example, the requests for profiling queries could be exe-
dataframe df updates in memory, AutoProfiler will update the profile cuted in parallel by augmenting the Jupyter kernel. Furthermore, faster
shown. This way the user can see their transformation was successful, query execution system like DuckDB [37] can speed up the response
inspect the distribution of sqft, and even notice that the number of nulls on individual queries over pandas. For particularly large datasets, the
increased by 0.3% after this parse. distributions and statistics could be estimated from samples.

4.4 Implementation and Architecture 5 E VALUATION : U SER S TUDY


AutoProfiler is built as a Jupyter Lab extension to augment a normal We demonstrate the effectiveness of AutoProfiler in two ways. In this
interactive programming environment with a data profiling sidebar. section, we discuss the results of a user study comparing two levels of
Figure 4 shows the components involved in a example live update loop. automation support with AutoProfiler and in § 6 we discuss the results
When a user executes new code, the kernel sends a signal that a cell of a longitudinal case study of users with AutoProfiler.
was executed (step 1). AutoProfiler then interacts with the kernel to get
all variables that are Pandas dataframes, and requests data profiles for 5.1 Participants
each of these variables (steps 2 - 4). When a user requests to export To evaluate how AutoProfiler helps data analysts in a sample data
code, a new cell is created with the code (step 5). This is only a UI analysis task, we recruited Pandas and Jupyter users for a between-
interaction, and when the user executes the generated cell, the update subjects user study. We recruited 16 participants from social media
loop will trigger again. Whenever the kernel is restarted, the dataframes and our networks who were experienced data analysts. Our inclusion
in memory are cleared so the profiles in AutoProfiler reset. criteria required that participants be regular Pandas and Python users.
As a Jupyter extension, AutoProfiler can be easily installed as a Our participants had 2 to 12 years of experience doing data science
Python package and included in a user’s Jupyter Lab environment. This (mean 4.8 years), and were all regular Python and Pandas users who
easy installation has proven very popular with users of our system. The frequently used Juptyer. The typical participant reported doing data
frontend code for AutoProfiler uses Svelte [46] for all UI components. analysis weekly and using Pandas daily, with all participants using Pan-
Our code is open-sourced and available for use1 . das at least monthly. Our participants worked in a variety of industries
All profiling functions are written in Python and execute code in including autonomous vehicles, data journalism, and finance with job
Pandas. Pre-binning distributions in python makes serialization faster titles including data analyst, data engineer, post-doc, and researcher.
to avoid serializing entire dataframes. Since our profiling happens
in Pandas, the performance of AutoProfiler generally scales with the 5.2 Research Questions
capabilities of Pandas. Anecdotally, we have used AutoProfiler during We had three primary research questions in our user study:
analyses with dataframes with hundreds of thousands of datapoints and Q1. Live updates: Does a profiler with live updates lead to more
updates remain responsive. insights found than one with manual updates?
The scalability of our approach is primarily impacted by two main Q2. Starting point for EDA: Does automatically providing visual data
considerations: the number of columns in each dataframe and number profiles lead users to write less code, and is this information
of dataframes in memory. Pandas can still execute a single query helpful?
relatively quickly for dataframes with up to millions of datapoints, and
Q3. Linked code and GUI: How does code exporting facilitate handoff
we consider a full benchmarking of pandas queries outside the scope
for follow-up analysis?
of this work. Since requests to the Jupyter python kernel are currently
These research questions correspond to the main features of our tool.
1 https://github.com/cmudig/AutoProfiler We test how different levels of automation support continuous data
202 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 1, JANUARY 2024

No. Type Category Origin Description Found Found with tool


1 Missing Error Inherent Small number of missing values in county, beds, title 56% 100%
2 Missing Error Inherent Mostly missing in baths, sqft, description 56% 100%
3 Inconsistent Error Added City has values that are lower and upper case 69% 91%
4 Inconsistent Error Added Negative prices 69% 100%
5 Incorrect Error Inherent Date could be parsed to DateTime format 63% 90%
6 Incorrect Error Added County has default values of "---" 81% 85%
7 Incorrect Error Added Sqft has string values and should be converted to an int 69% 82%
8 Outliers Error Inherent Outliers in sqft 6% 100%
9 Outliers Error Inherent Outliers in price 44% 100%
10 Schema Error Added Duplicate datapoints (duplicate post_ids) 38% 100%
11 Distribution Insight Inherent Room_in_apt is almost all 0 56% 100%
12 Scope Insight Inherent Dataset is only apartments in California 31% 100%
13 Correlation Insight Inherent Inspect correlations between any variable and price 13% 0%
14 Distribution Insight Inherent Data is not evenly distributed across years 38% 100%
15 Inconsistent Error Inherent Year and date column correspond (ensure consistency) 19% 67%
16 Inconsistent Error Inherent The price is not properly extracted from title for some rows 6% 0%

Table 1: Description of each of the errors and insights on our “rubric” of participant performance. We include the percentage of participants that
discovered each error/insight, noting that some discoveries were found far more often than others. As the same information was present in both
AutoProfiler and StaticProfiler, the discovery rate in each condition is largely comparable. The first 13 insights and errors were things we expected
participants to discover ahead of time, and the last 3 were valid extra findings discovered by participants.

profiling for Q1 by comparing the number of insights found through was a sample of a larger dataset of apartment listings from craigslist [32]
a profiler with live updates to one that required manual invocation. with extra “errors” added2 . The task dataset had 1,942 rows and 13
With Q2, we explore our design choice of showing a starting point for columns. We sampled the dataset to a smaller size so we could be more
data profiling. To answer this question we measure how many insights confident that our rubric covered the majority of important insights and
participants found through our tools versus their own code and their errors in the data.
qualitative perceptions of each tool version. Finally, to answer Q3 We had 13 pre-known insights/errors that we measured to see how
we measured how often exports to code are used during analysis and well participants could explore the data and find these insights as an
participants’ perceptions of this feature. inital “rubric” of task performance. Additionally, we included three
In order to answer these research questions, we ran a between- extra insights and errors that participants found during their exploration.
subjects user study with two versions of our tool. We elected for a A detailed description of each insight/error that we measured is in Ta-
between-subjects design since data analysis requires time to do well and ble 1. The categories of errors in this dataset were inspired by prior
we found during pilots that having participants analyze two separate studies that group dataset errors into common types [20]. Our first 10
datasets was infeasible and the quality of analysis on the second task dataset errors are issues of missing data, inconsistent data, incorrect
was significantly worse. We also noticed a large learning effect in pilot data, outliers, and schema violations. Inconsistent data refers to data
studies when participants analyzed two datasets back to back. with inconsistencies like variations in spelling or units; incorrect data is
parsed as the wrong data type or has default values like dashes or empty
5.3 StaticProfiler strings. In addition to errors that might jeopardize an analysis if not
In our study, one condition used AutoProfiler with live profiles, auto- discovered, we also measured how well participants discovered several
matic updates, and code exports. For our other condition, participants broader insights in the dataset. Building off past definitions of dataset
used a static version of the tool which we call StaticProfiler which re- insights as unexpected, qualitative findings rooted in the data [28], we
quires manual invocation. StaticProfiler allows us to test how different broadly considered insights as findings about the data that did not fit
levels of automation support continuous data profiling. The interface into one of the aforementioned error buckets and are important to know
shows the exact same information as AutoProfiler, however, it must before the dataset is used for a downstream task. We initially included
be called manually with plot(df) and does not update automatically three general insights such as the scope of the dataset, realizing skewed
with data updates. The same profiles for each column are presented distributions, and investigating correlations. While these errors/insights
in an inline interactive widget with the ability to hand off to code in are by no means exhaustive of everything of interest in our dataset, they
the notebook. This sort of manual invocation is similar to other Pan- provide a common “rubric” that we could evaluate participants against.
das visualization tools in notebooks [1, 25, 30]. A screenshot of the We consider this rubric indicative of things that should be found in a
StaticProfiler tool is included in the appendix. proper EDA of the dataset, regardless of the tool being used. With the
We compare AutoProfiler with StaticProfiler rather than other open exception of insight 13 about correlations, all of these findings can be
source tools since StaticProfiler includes largely the same information seen in the AutoProfiler or StaticProfiler interfaces.
as other tools but the UI design is the same as AutoProfiler. Our goal Participants were asked to explore and clean the data under the
with this comparison was to isolate the effects that live updates have on guidance that this dataset was recently acquired by a colleague who
continuous data profiling (Q1) and evaluate Q2 and Q3 through logs wants to build a predictive model of apartment prices. Participants were
and interviews across both system versions. We compare AutoProfiler asked to clean and produce a report about the dataset in the notebook
to a non-live updating tool, StaticProfiler, instead of a baseline of no that would be handed off to their colleague. Participants were told there
tool since participants could write any extra code in the study notebook were at least 10 errors in this dataset that they should try to find and
and did not have to use the tools. This allowed us to evaluate how fix to encourage critical engagement with the data. They were not told
different designs impacted tool use and how a tool augmented a typical what kind of errors these were or what constituted an error.
programming workflow. Participants were given 30 minutes to explore the data with the tool
and asked to think aloud about what they were investigating. Partic-
5.4 Procedure and Task ipants were asked to write down any insights and findings in their
In both conditions, participants were first shown a demo of the tool notebooks and voice them aloud. During their analysis, they were free
version they would be using (AutoProfiler or StaticProfiler). Each
participant then analyzed the same dataset during the task. The dataset 2 Task dataset: https://github.com/cmudig/AP-Lab-Study-Public
epperson ET AL.: DEAD OR ALIVE: CONTINUOUS DATA PROFILING FOR INTERACTIVE DATA SCIENCE 203

to look up external documentation and use any other python libraries want [StaticProfiler] when I’m ready for it. Because it does take up
they thought might be helpful. Our research team was present if par- some screen space. Like I don’t want it like suddenly bumping a bunch
ticipants had questions about the task overall, however, did not answer of things out of the way.” Since StaticProfiler puts visualizations inline
questions about the data. We automatically logged interactions with in the notebook, multiple invocations can lead to cluttered notebooks.
the tools during the study. Afterward, we conducted semi-structured Both AutoProfiler and StaticProfiler also helped participants quickly
interviews with each participant and asked them about how they went discover when they had done a transformation incorrectly. For example,
about the task and how the tool supported their analysis. We examined P5 used AutoProfiler to export the outliers for the beds column to code.
the findings that participants wrote down in the notebook or voiced However, when they re-assigned their dataframe variable, they assigned
aloud from study recordings to quantify how many of the insights on df to only contain outliers by accident. With AutoProfiler they quickly
our rubric they had found. In Sections § 5.5, § 5.6, and § 5.7 we discuss noticed that their dataframe now only contained 12 data points with
findings based on these logs and interview data. extreme distributions and were able to fix their error. We observed this
pattern of the tool helping find user errors during four different studies,
three of which were using AutoProfiler.
Using static, inline data profiles is not without its advantages. For
one, several users liked the ability to keep a history of past dataframes
in their notebook when they called plot() with StaticProfiler. Although
some participants felt this led to potentially cluttered notebooks, it
can be useful to scroll back to an earlier version of the data. This is
not possible in AutoProfiler since the visualizations always show the
current dataframe in memory.

5.6 Automatic visualizations speed up insight discovery


Participants found the tools to be useful both as a first step in analysis,
but also to help them understand their data after updates and transforms.
We logged interactions during the study and present metrics of interest
in Figure 5. We measured the unique dataframes explored as the num-
ber of unique dataframes toggled open (AutoProfiler) or called with
plot (StaticProfiler). This metric captures how often a user returns to a
dataframe after it updates or explores a new dataframe. For example, if
a user explores df, updates it, then explores df again we would count this
as two unique interactions. We observed that participants with AutoPro-
filer interacted with slightly more dataframes (9.9 vs 5.5), however, this
Fig. 5: Usage and task performance metrics of AutoProfiler and Stat- difference was not statistically significant (P=0.21). Over the course
icProfiler from our user study. of their analysis, participants were on average inspecting data profiles
in AutoProfiler for almost 10 different slices or updates to dataframes.
One of our participants with AutoProfiler actually interacted with 30
5.5 Live profiles do not lead to more insights but make unique dataframes during their analysis.
verification easier We also measured the number of unique columns (including updates)
In both conditions, participants found a similar number of insights: on that participants interacted with and find that they explore largely the
average, 6.9 with StaticProfiler and 7.4 with AutoProfiler out of the 16 same number of columns in each condition, investigating 25.5 unique
we measured (P=0.71). Therefore, we did not observe more insights columns on average with AutoProfiler and 24.1 with StaticProfiler.
found with AutoProfiler (Q1). Participants heavily used both versions Since the original dataset had 13 columns, this indicates that partici-
of the tool as demonstrated by the similar number of unique dataframes pants were not only interacting with the original data but were returning
and columns explored in Figure 5. We suspected the live updates in to the profiles as they updated or filtered their data. This continuous
AutoProfiler to encourage more tool use which would lead to more interaction is the main goal of continuous data profiling.
insights found but participants found both versions to be helpful during Overwhelmingly, participants found their insights with the assis-
their analysis task, reinforcing the value of automatic visualization. tance of either tool rather than by manually writing code to get the
Furthermore, live updates may not have made as much of a difference same information. This means that when a participant said the insight
in a controlled lab setup versus a less well-defined analysis outside of aloud or wrote it down in their notebook, this information was discov-
the lab setting which we explore in § 6. ered through the tool. Across both conditions, an average of 91% of
Participants used both versions of the tools to verify that their code insights found came from the tool, with a non-significant difference in
had the expected effect on a dataframe. For example, we observed rates between the two conditions (P=1.0). This means that on average
participants finding an error through the tool, writing code to fix it, and only 9% of insights were found by users writing manual pandas code
then checking that their code had the expected effect through the tool. during the study. This supports that the information contained in the
We particularly noticed this pattern with users of AutoProfiler. For profiles is useful and replicates what participants would have wanted to
example, P3 noticed error #3 that the city column contained some cities see anyway without requiring extra code to be written (Q2). As P14
that were spelled with different casings (“Oakland” and “oakland”) (StaticProfiler) said “it does a lot of the things that I already do, but
with the column detail view. They then fixed this error by making all just in one succinct and easy-to-understand way”. By presenting this
the values upper case with their own Pandas code and verified that the information automatically, the tools saved participants time and pre-
top values were all upper case in AutoProfiler. As P3 described: vented them from having to exit their analysis flow to look up external
documentation. As P10 (AutoProfiler) described:
“It was nice to see when I do the upper [casing] and I can
just see, oh that worked. When I do the drop duplicates, I “I might have known to look for it, but it would have taken
can just look and see like, oh that worked. I like that.” me a lot longer to remember how to do it in Pandas.”
We observed this (1) find a dataset error, (2) fix, and (3) verify in the When data profiling information is more easily accessible it speeds
tool loop for many of our participants. Live updates help facilitate this up the entire analysis loop, making it easier to discover more insights
verification since the updates happen automatically, whereas with the in a shorter amount of time while still being thorough. As P9 (AutoPro-
static version of the tool, users would often verify transformations with filer) described:
their own code manually. As P7 (StaticProfiler) mentioned: “I only “I would probably try to do similar things that AutoProfiler
204 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 1, JANUARY 2024

suggests [on my own], but it would take a much longer time. supported continuous data profiling rather than comparing to a baseline
Like the amount I did in 30 minutes, if I had to do it without with no tool and view this as an area for future work.
AutoProfiler, would have taken hours. And then since it
takes longer, my motivation would go down and my focus
would go down. So I feel like I would have found far fewer
errors than I could with AutoProfiler.”
We found that not all insights were discovered with the same fre-
quency, with discovery rates between 6% and 81%. In Table 1 we see
that some errors like #6 were found by 81% of participants; others
like #8 or #10 were found by 6% and 38%, respectively. Error #8 was
particularly difficult since the sqft column had to be parsed from a
string to an integer (error #7) to get information about the outliers in
the profiles. Many participants did not successfully fix this issue during
the study time, explaining the low discovery rate. However, duplicate
primary keys (error #10) was readily discoverable in the interface by
looking at the number of unique values in the post_id column yet few
participants found it. We discuss this usage trend in more depth in § 7
about how tools can facilitate users finding information they would
have already wanted to investigate, however if they do not know to
check for an issue then this information is easily skipped over.

5.7 Exports facilitate follow-up analysis and learning


We also measured the number of times that participants exported to
code during their analysis. Every participant used code exports at least
once, with the total number of exports ranging from 1 to 16, with a mean
of 7.1 exports. In Figure 5, we detail the average number of exports
between the two tools. We see similar trends across both conditions,
where participants export more selection exports than template exports.
Selection exports refer to exporting a filter from a chart or summary
statistic like exporting the selection for df[df.city == "San Jose"]. Al- Fig. 6: AutoProfiler integrated into a domain scientist’s analysis workflow
though these exports are small, they can help make follow-up analysis during our case study. AutoProfiler is shown on the bottom screen in the
easier if a user wants to filter since “that’s probably the most annoying Jupyter notebook.
lines to constantly type is [to] just filter” (P5, using AutoProfiler).
Template exports refer to code for authoring a chart or getting out-
liers. Participants also found this helpful because it helped facilitate 6 E VALUATION : L ONGITUDINAL C ASE S TUDY
tweaking code for follow-up analysis. When describing their reason for To address some of the limitations of our user study, we also evaluated
using chart exports, P14 (StaticProfiler) mentioned “It’s really nice to how AutoProfiler helps data scientists in a real world environment
just quickly be able to like to copy that and use it, and then I could just by working with domain scientists at a US National Lab to integrate
make some edits to it.” This answers Q3 that exports facilitate faster AutoProfiler into their workflows. These scientists work with large-
feedback loops. scale image data collected from beamline X-ray scattering experiments
Another unexpected benefit of code exports is the ability to actually to understand the properties of physical materials [23]. Two different
learn Pandas better and understand what is going on under the hood scientists installed AutoProfiler into their Jupyter Lab environments
of the system when it reports a statistic. As P12 (AutoProfiler) said and used it over a three month period during their analyses as much as
succinctly: “I’m learning as I’m exploring and it’s saving me time.” they liked. We were unable to collect log data during this deployment
Expanding more, P2 (StaticProfiler) mentioned: for privacy reasons. We periodically spoke with the scientists during
“For the educational perspective, that’s something I didn’t the deployment to make sure the tool was working. At the end of the
expect...specifically, I [exported] the standard deviation and 3-month period, we conducted in-person observations and interviews
I could see points inside or outside of 3 [std]. When I saw with the participants where they showed us the notebooks and datasets
that code I learned that’s the way to do that.” where they were using AutoProfiler and we asked about how they used
the system, and which features they felt supported their workflows.
The ability to teach users how to do common analysis steps is an As a Jupyter Lab extension, AutoProfiler fits into the existing work-
exciting aspect of systems that support easily linking code and direct flows of these scientists since they typically did data analysis with
manipulation interactions. Python and had existing libraries for visualizing and manipulating their
data. AutoProfiler helped improve two different workflows they have
5.8 Limitations for data analysis. The first is for monitoring data outputs and quality
Our user study is subject to several limitations. First, subjects were while an experiment is running. Their experiments last for multiple
explicitly told to explore and clean their dataset and were given 30 hours or even days while they collect image readings from a sensor and
minutes to engage with a brand-new dataset. This is a relatively short then process these images into tabular datasets with Python image pro-
time span to learn and use a new tool on new data. We also suspect cessing pipelines. As the scientists describe, during these experiments
that the explicit instructions to find errors and write down findings in a “real-time feedback is important as it shows us whether the experiment
report might have encouraged better continuous data profiling practices is working”. The participants mentioned how AutoProfiler improved
than what actually happens in real-world settings. However, these this type of monitoring since it works with any Python-based analysis
explicit instructions helped us determine which features specifically aid and “allows [them] to easily notice any anomaly and observe a trend or
in continuous data profiling and what kind of errors users commonly correlation during experiments.”
find or miss. Another limitation is that participants analyzed a relatively The second way the participants used AutoProfiler was to analyze
small dataset. The errors and insights in our dataset were representatives their results after an experiment completed. In this scenario, the scien-
of those found in larger datasets and we believe our findings translate tists “iteratively sub-selected a relevant set of data, using AutoProfiler
well to other tabular dataset tasks. Finally, we compared two versions as a guide, and then analyzed this subset of data using existing analy-
of our tool with different levels of automation to understand how they sis/plotting tools. Thus, AutoProfiler has shown its value in improving
epperson ET AL.: DEAD OR ALIVE: CONTINUOUS DATA PROFILING FOR INTERACTIVE DATA SCIENCE 205

data triage, data organization, and serendipitous discovery of trends in 7.1 Guiding users towards unknown insights
datasets”. In the remainder of this section, we discuss two high-level Beyond making data analysis faster, automated systems like AutoPro-
patterns of use that emerged from interviews with the participants in filer can help users discover insights they might have otherwise missed.
our long-term deployment. These serendipitous discoveries present an interesting opportunity for
tools to help users look at their data in new ways. However, this process
6.1 Finding and following up on trends cannot be fully automated. Automatically presenting data profiles to
When using AutoProfiler to analyze their experimental results, our par- users gives them the opportunity to find insights. Users must still take
ticipants expressed how the tool facilitated finding interesting aspects the time to look at and interpret if an insight or error is noteworthy.
in their data and then diving deeper into those subsets. In this way, Automated systems can augment human expertise, but do not replace
AutoProfiler facilitated a faster find-and-verify loop during analysis. it. For example, in our user study, many participants missed important
The automatic plotting in AutoProfiler presented interesting plots in data quality issues like duplicate values, even though this information
their dataset that helped them find subsets to export and explore further was readily available in either tool if one knew to check. The most
such as by running other analysis code to plot the images corresponding common types of unexpected errors discovered through AutoProfiler
to each data point. They were especially excited about the possibility were strange distributions such as a totally flat distribution or weird
of incorporating bivariate charts into AutoProfiler so they would have frequent values. The distribution information is very visually prominent
to use even less of their own analysis code. in AutoProfiler, perhaps making it easier to discover in the interface.
Automated assistance in notebooks opens up the design space for
6.2 AutoProfiler facilitates serendipitous discovery further improvements toward guided analysis. One exciting area for
future work is the potential to integrate alerts into automatic data pro-
The scientists used the live version of AutoProfiler that updates when- files to draw user attention to important errors. For example, an alert
ever their data changes. They mentioned that the combination of all could be displayed if a column has a number of null values or outliers
three features (automatic visualization, live updates, and code author- greater than some threshold. Alerts must be customizable and designed
ing) supported one another to lower the friction of their data analysis to minimize alert fatigue, or else a user may totally ignore them [43].
and were not enthusiastic about using versions of the tool without all of With existing inline, manual profilers [30] these alerts would be re-
these features (such as in StaticProfiler). Furthermore, the participants computed and displayed every time a user updates and re-profiles their
mentioned that using AutoProfiler helped them discover trends or errors data, quickly causing alert fatigue. Tools like AutoProfiler present an
they might not have noticed otherwise: opportunity for persistent alerts between profiles that can better support
“One of the things that I very often notice is if the histogram continuous data science.
is completely flat. That means that either all the numbers
7.2 Authoring more analysis code for users
are exactly the same, or that it’s some sort of sequential
number. Sometimes that’s what I’m expecting, so great. Our export to code feature was very popular among participants, with
But sometimes, if it’s not what I’m expecting, then that many requests for even more ways to export to code. Part of the benefit
immediately stands out as being weird and it draws my of AutoProfiler’s approach to exports is they are predictable: the system
attention to it. I would never have noticed if it were not exports the same template code every time, with the dataframe and
plotted; I would never have thought to plot it.” column names filled in. This is in contrast to generative approaches
Our participants described how these unexpected, serendipitous, dis- to code authoring such as Github Copilot [13] where a model might
coveries were primarily facilitated by the auto-updating and automatic produce different code for the same task depending on the prompt.
visualizations of AutoProfiler and made the system a valuable part of Users must then take time to understand this new code each time it is
their workflow. exported. The downside to template approaches like ours is that it is
less flexible for arbitrary analysis.
In our user study, we frequently observed participants needing to
7 D ISCUSSION AND F UTURE W ORK look up the documentation for how to write a certain command with the
Data science is messy. There are a combinatorially large number of Pandas library, even if they were experienced users. As tools continue
ways to slice a dataset, trying to find meaningful insights. The goal to evolve to automatically write analysis code through text prompting,
of continuous data profiling is to augment a human’s sense-making we think this will make data iteration even faster. The linked, interactive
ability by automating the analysis feedback loop to be as fast as possible. outputs from systems like AutoProfiler becomes even more valuable to
Previous work has established that automated systems can best facilitate help users understand their data as the time it takes to write analysis
data understanding by automating the need for manual specification code decreases, perhaps especially when users are not manually writing
[15]. We found that two different versions of automatic profiling help all of that code and need to understand its effect on their data.
speed up this feedback loop in our user study. Furthermore, we found
evidence that the combination of automatic visualization, live updates, 8 C ONCLUSION
and code handoff leads to a smoother, more thorough analysis loop in In conclusion, we present AutoProfiler, a Jupyter notebook assistant
our long-term deployment where our participants credited AutoProfiler that uses automatic, live, and linked data profiles to support continuous
with helping them find “serendipitous discoveries” in their dataset. data profiling during data analysis. In a controlled user study, we find
In real-world tasks, encouraging critical engagement is challenging users leverage two versions of our tool, dead or alive, to find the vast
because analysts must trade off finding insights and errors quickly with majority of insights during a data cleaning task. Furthermore, we find
a thorough and exhaustive analysis of their data. AutoProfiler’s design that AutoProfiler easily fits into data scientists’ real-world workflows
removes friction by saving time and clicks to better facilitate continuous and helps them discover unexpected insights in their data during a
data profiling. Since AutoProfiler works with any pandas dataframe, longitudinal case study. We discuss how tools like AutoProfiler open
users do not have to write or copy and paste profiling code that might up the design space for automated assistants to support continuous data
be tightly coupled to a specific dataset. This makes notebooks cleaner profiling during analysis.
and easier to maintain.
Future tools can leverage the benefits of both code and automated ACKNOWLEDGMENTS
visualization for data analysis through linked and deeply integrated We would like to thank Venkat Sivaraman, Katelyn Morrison, Alex
data profiles. Automatically presenting a starting set of profiling in- Cabrera and the members of the Data Interaction Group at CMU
formation and supporting follow-up analysis by enabling code exports for their feedback on this work; Hamilton Ulmer and the Rill Data
helps reduce the feedback time during analysis. This approach differs team for the initial implementation of our data profiling charts; Wei
from other profiling systems that aim to include as much information Xu, Kevin Yager, and Esther Tsai at Brookhaven National Labora-
as possible in the interface without handing off to code [25, 30]. tory for their feedback and use of AutoProfiler. This research was
206 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 30, NO. 1, JANUARY 2024

supported by Brookhaven National Laboratory through New York [20] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Profiler:
State funding and the Human-AI-Facility Integration (HAI-FI) ini- Integrated statistical analysis and visualization for data quality assessment.
tiative. In Proceedings of the International Working Conference on Advanced
Visual Interfaces, AVI ’12, p. 547–554. Association for Computing Ma-
chinery, New York, NY, USA, 2012. doi: 10.1145/2254556.2254659 2,
6
R EFERENCES [21] M. B. Kery, A. Horvath, and B. Myers. Variolite: Supporting exploratory
[1] 8080 Labs. bamboolib. https://bamboolib.8080labs.com/, 2020. programming by data scientists. In Proceedings of the 2017 CHI Confer-
Accessed 06-2023. 2, 6 ence on Human Factors in Computing Systems, CHI ’17, p. 1265–1276.
[2] S. Alspaugh, N. Zokaei, A. Liu, C. Jin, and M. A. Hearst. Futzing Association for Computing Machinery, New York, NY, USA, 2017. doi:
and moseying: Interviews with professional data analysts on exploration 10.1145/3025453.3025626 3, 4
practices. IEEE Transactions on Visualization and Computer Graphics, [22] M. B. Kery, D. Ren, F. Hohman, D. Moritz, K. Wongsuphasawat, and
25:22–31, 2019. doi: 10.1109/TVCG.2018.2865040 1, 2, 3 K. Patel. Mage: Fluid moves between code and graphical work in compu-
[3] Anaconda Foundation. The state of data science 2020: Mov- tational notebooks. In Proceedings of the 33rd Annual ACM Symposium
ing from hype toward maturity. https://www.anaconda.com/ on User Interface Software and Technology, UIST ’20, p. 140–151. As-
state-of-data-science-2020, 2020. Accessed 06-2023. 2 sociation for Computing Machinery, New York, NY, USA, 2020. doi: 10.
[4] Apache Arrow. Pyarrow - apache arrow python bindings. https:// 1145/3379337.3415842 3, 4
arrow.apache.org/docs/python/index.html, 2023. Accessed 06- [23] M. H. Kiapour, K. G. Yager, A. C. Berg, and T. L. Berg. Materials
2023. 3 discovery: Fine-grained classification of x-ray scattering images. IEEE
[5] P. D. Bailis, E. Gan, K. Rong, S. Suri, and S. InfoLab. Prioritizing Winter Conference on Applications of Computer Vision, pp. 933–940, 2014.
attention in fast data: Principles and promise. In 8th Biennial Conference 8
on Innovative Data Systems Research (CIDR 17), 2017. 1 [24] M. Kim, T. Zimmermann, R. DeLine, and A. Begel. Data scientists in
[6] F. Bertrand. sweetviz. https://github.com/fbdesignpro/ software teams: State of the art and challenges. IEEE Transactions on
sweetviz. Accessed 06-2023. 2 Software Engineering, 44(11):1024–1038, 2018. doi: 10.1109/TSE.2017.
[7] R. DeLine and D. Fisher. Supporting exploratory data analysis with live 2754374 2
programming. In 2015 IEEE Symposium on Visual Languages and Human- [25] D. J. L. Lee, D. Tang, K. Agarwal, T. Boonmark, C. Chen, J. T. J.
Centric Computing (VL/HCC), pp. 111–119, 2015. doi: 10.1109/VLHCC. Kang, U. Mukhopadhyay, J. Song, M. Yong, M. A. Hearst, and A. G.
2015.7357205 2 Parameswaran. Lux: Always-on visualization recommendations for ex-
[8] c. Demiralp, P. J. Haas, S. Parthasarathy, and T. Pedapati. Foresight: ploratory dataframe workflows. Proc. VLDB Endow., 15:727–738, 2021.
Recommending visual insights. Proc. VLDB Endow., 10(12), aug 2017. 2, 3, 6, 9
doi: 10.14778/3137765.3137813 2 [26] X. Li, Y. Zhang, J. Leung, C. Sun, and J. Zhao. Edassistant: Supporting
[9] W. Epperson, D. Jung-Lin Lee, L. Wang, K. Agarwal, A. G. Parameswaran, exploratory data analysis in computational notebooks with in situ code
D. Moritz, and A. Perer. Leveraging analysis history for improved in situ search and recommendation. ACM Trans. Interact. Intell. Syst., 13(1), mar
visualization recommendation. Computer Graphics Forum, 41(3):145– 2023. doi: 10.1145/3545995 2
155, 2022. doi: 10.1111/cgf.14529 3 [27] J. H. Maloney and R. B. Smith. Directness and liveness in the morphic
[10] W. Epperson, A. Y. Wang, R. DeLine, and S. M. Drucker. Strategies for user interface construction environment. In ACM Symposium on User
reuse and sharing among data scientists in software teams. In Proceedings Interface Software and Technology, 1995. 2, 3
of the 44th International Conference on Software Engineering: Software [28] C. North. Toward measuring visualization insight. IEEE Computer Graph-
Engineering in Practice, ICSE-SEIP ’22, p. 243–252. Association for ics and Applications, 26(3):6–9, may 2006. doi: 10.1109/mcg.2006.70
Computing Machinery, New York, NY, USA, 2022. doi: 10.1145/3510457 6
.3513042 2, 3, 4 [29] Pandas. Pandas: Python data analysis library. https://pandas.pydata.
[11] D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker. Interactions with org. Accessed 06-2023. 3
big data analytics. Interactions, 19(3):50–59, may 2012. doi: 10.1145/ [30] Pandas-Profiling. pandas-profiling. https://github.com/
2168931.2168943 2 pandas-profiling/pandas-profiling. Accessed 06-2023. 2,
[12] N. Forsgren, M.-A. Storey, C. Maddila, T. Zimmermann, B. Houck, and 3, 6, 9
J. Butler. The space of developer productivity: There’s more to it than you [31] J. Peng, W. Wu, B. Lockhart, S. Bian, J. N. Yan, L. Xu, Z. Chi, J. M. Rzes-
think. Queue, 19(1):20–48, mar 2021. doi: 10.1145/3454122.3454124 3 zotarski, and J. Wang. Dataprep.eda: Task-centric exploratory data analy-
[13] Github. Github copilot - your ai pair programmer. https://github. sis for statistical modeling in python. In Proceedings of the 2021 Interna-
com/features/copilot. Accessed 06-2023. 9 tional Conference on Management of Data, SIGMOD ’21, p. 2271–2280.
[14] A. Head, F. Hohman, T. Barik, S. M. Drucker, and R. DeLine. Managing Association for Computing Machinery, New York, NY, USA, 2021. doi:
messes in computational notebooks. In Proceedings of the 2019 CHI 10.1145/3448016.3457330 2
Conference on Human Factors in Computing Systems, CHI ’19, p. 1–12. [32] K. Pennington. Bay area craigslist posts, 2000 - 2018. https://www.
Association for Computing Machinery, New York, NY, USA, 2019. doi: katepennington.org/data. Accessed 06-2023. 6
10.1145/3290605.3300500 2 [33] J. M. Perkel. Why jupyter is data scientists’ computational notebook of
[15] J. Heer. Agency plus automation: Designing artificial intelligence into choice. Nature News, Oct 2018. doi: 10.1038/d41586-018-07196-1 1, 2, 3
interactive systems. Proceedings of the National Academy of Sciences, [34] J. Piazentin Ono, J. Freire, and C. T. Silva. Interactive data visualization in
116:1844 – 1850, 2019. doi: 10.1073/pnas.1807184115 2, 3, 9 jupyter notebooks. Computing in Science & Engineering, 23(2):99–106,
[16] J. M. Hellerstein. Quantitative data cleaning for large databases. United 2021. doi: 10.1109/MCSE.2021.3052619 2
Nations Economic Commission for Europe (UNECE), 2008. 2 [35] J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. A large-scale study
[17] F. Hermans, B. Jansen, S. Roy, E. Aivaloglou, A. Swidan, and D. Hoe- about quality and reproducibility of jupyter notebooks. In 2019 IEEE/ACM
pelman. Spreadsheets are code: An overview of software engineering 16th International Conference on Mining Software Repositories (MSR),
approaches applied to spreadsheets. In 2016 IEEE 23rd International pp. 507–517, 2019. doi: 10.1109/MSR.2019.00077 4
Conference on Software Analysis, Evolution, and Reengineering (SANER), [36] Polars. Polars, lightning-fast dataframe library. https://www.pola.rs/.
vol. 5, pp. 56–65, 2016. doi: 10.1109/SANER.2016.86 2 Accessed 06-2023. 3
[18] F. Hohman, K. Wongsuphasawat, M. B. Kery, and K. Patel. Understanding [37] M. Raasveldt and H. Mühleisen. Efficient SQL on Pandas with DuckDB
and visualizing data iteration in machine learning. In Proceedings of the — duckdb.org. https://duckdb.org/2021/05/14/sql-on-pandas.
2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, html, may 2021. Accessed 06-2023. 5
p. 1–13. Association for Computing Machinery, New York, NY, USA, [38] Rill Data. Rill developer. https://github.com/rilldata/
2020. doi: 10.1145/3313831.3376177 2 rill-developer. Accessed 06-2023. 3
[19] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data [39] A. Rose. PandasGUI. https://github.com/adamerose/pandasgui.
analysis and visualization: An interview study. IEEE Transactions on Accessed 06-2023. 2
Visualization and Computer Graphics, 18(12):2917–2926, 2012. doi: 10. [40] A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in
1109/TVCG.2012.219 2 computational notebooks. In Proceedings of the 2018 CHI Conference on
epperson ET AL.: DEAD OR ALIVE: CONTINUOUS DATA PROFILING FOR INTERACTIVE DATA SCIENCE 207

Human Factors in Computing Systems, CHI ’18, p. 1–12. Association for


Computing Machinery, New York, NY, USA, 2018. doi: 10.1145/3173574
.3173606 2, 3
[41] N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M.
Aroyo. “everyone wants to do the model work, not the data work”: Data
cascades in high-stakes ai. In Proceedings of the 2021 CHI Conference
on Human Factors in Computing Systems, CHI ’21. Association for Com-
puting Machinery, New York, NY, USA, 2021. doi: 10.1145/3411764.
3445518 2
[42] H. Seltman. Experimental design and analysis. Carnegie Mellon Univer-
sity, Jul 2018. 2
[43] S. Shankar, R. Garcia, J. M. Hellerstein, and A. G. Parameswaran. Opera-
tionalizing machine learning: An interview study. ArXiv, abs/2209.09125,
2022. 2, 9
[44] B. Shneiderman. The eyes have it: a task by data type taxonomy for
information visualizations. In Proceedings 1996 IEEE Symposium on
Visual Languages, pp. 336–343, 1996. doi: 10.1109/VL.1996.545307 3
[45] T. Siddiqui, A. Kim, J. Lee, K. Karahalios, and A. Parameswaran. Ef-
fortless data exploration with zenvisage: An expressive and interactive
visual analytics system. Proc. VLDB Endow., 10(4), nov 2016. doi: 10.
14778/3025111.3025126 2
[46] Sveltejs. Svelte: cybernetically enhanced web apps. https://svelte.
dev/, 2016. Accessed 06-2023. 5
[47] J. W. Tukey. We need both exploratory and confirmatory. The American
Statistician, 34:23–25, 1980. doi: 10.2307/2682991 1, 2
[48] A. Y. Wang, W. Epperson, R. A. DeLine, and S. M. Drucker. Diff in
the loop: Supporting data comparison in exploratory data analysis. In
Proceedings of the 2022 CHI Conference on Human Factors in Computing
Systems, CHI ’22. Association for Computing Machinery, New York, NY,
USA, 2022. doi: 10.1145/3491102.3502123 2
[49] K. Wongsuphasawat, Y. Liu, and J. Heer. Goals, process, and challenges
of exploratory data analysis: An interview study. ArXiv, abs/1911.00568,
2019. 1
[50] K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe, and
J. Heer. Voyager: Exploratory analysis via faceted browsing of visual-
ization recommendations. IEEE Trans. Visualization & Comp. Graphics
(Proc. InfoVis), 2016. doi: 10.1109/TVCG.2015.2467191 2
[51] Y. Wu, J. M. Hellerstein, and A. Satyanarayan. B2: Bridging Code and
Interactive Visualization in Computational Notebooks. In ACM User
Interface Software & Technology (UIST), 2020. doi: 10.1145/3379337.
3415851 3, 4

You might also like