Readable Code Data Scientists Flashfill

Uploaded by

Kakoli Begum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views

Readable Code Data Scientists Flashfill

Uploaded by

Kakoli Begum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Wrex: A Unifed Programming-by-Example Interaction for

Synthesizing Readable Code for Data Scientists

Ian Drosos1 , Titus Barik2 , Philip J. Guo1 , Robert DeLine2 , Sumit Gulwani2
UC San Diego1 , Microsoft2
{idrosos, pg}@ucsd.edu, {titus.barik, rob.deline, sumitg}@microsoft.com
ABSTRACT
Data wrangling is a diffcult and time-consuming activity in
computational notebooks, and existing wrangling tools do not
ft the exploratory workfow for data scientists in these en-
vironments. We propose a unifed interaction model based
on programming-by-example that generates readable code for
a variety of useful data transformations, implemented as a
Jupyter notebook extension called W REX. User study results
demonstrate that data scientists are signifcantly more effec-
tive and effcient at data wrangling with W REX over manual
programming. Qualitative participant feedback indicates that
W REX was useful and reduced barriers in having to recall
or look up the usage of various data transform functions. The
synthesized code allowed data scientists to verify the intended
data transformation, increased their trust and confdence in
W REX, and ft seamlessly within their cell-based notebook
workfows. This work suggests that presenting readable code
to professional data scientists is an indispensable component
of offering data wrangling tools in notebooks.

Author Keywords
computational notebooks; program synthesis; data science

CCS Concepts
•Human-centered computing → Interactive systems and
tools; •Software and its engineering → Development frame-
works and environments;

INTRODUCTION
Data wrangling—the process of transforming, munging, shap-
ing, and cleaning data to make it suitable for downstream
analysis—is a diffcult and time-consuming activity [4, 14].
Consequently, data scientists spend a substantial portion of
their time preparing data rather than performing data analysis Figure 1: W REX is a programming-by-example environment within a
tasks such as modeling and prediction. computational notebook, which supports a variety of program transfor-
mations to accelerate common data wrangling activities. A Users create
Increasingly, data scientists orchestrate all of their data- a data frame with their dataset and sample it. B W REX’s interactive grid
oriented activities—including wrangling—within a single where users can derive a new column and give data transformation exam-
context: the computational notebook [25, 1, 2, 5, 20, 30, ples. C W REX’s code window containing synthesized code generated
31]. The notebook user interface, essentially, is an interactive from grid interactions. D Synthesized code inserted into a new input cell.
session that contains a collection of input and output “cells.” E Applying synthesized code to full data frame and plotting results.
Data scientists use input code cells, for example, to write

Permission to make digital or hard copies of all or part of this work for personal or Python. The result of running an input cell renders an out-
classroom use is granted without fee provided that copies are not made or distributed put cell, which can display rich media, such as audio, images,
for proft or commercial advantage and that copies bear this notice and the full citation and plots. This interaction paradigm has made notebooks a
on the frst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, popular choice for exploratory data analysis.
to post on servers or to redistribute to lists, requires prior specifc permission and/or a
fee. Request permissions from [email protected]. Through formative interviews with professional data scien-
CHI ’20, April 25–30, 2020, Honolulu, HI, USA. tists at a large, data-driven company, we identifed an unad-
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-6708-0/20/04 ...$15.00. dressed gap between existing data wrangling tools and how
http://dx.doi.org/10.1145/3313831.3376442
data scientists prefer to work within their notebooks. First, EXAMPLE USAGE SCENARIO FOR WREX
although data scientists were aware of and appreciated the Dan is a professional data scientist who uses computational
productivity benefts of existing data wrangling tools, having notebooks in Python. He has recently installed W REX as an
to leave their native notebook environment to perform wran- extension within his notebook environment. Dan has an open-
gling limited the usefulness of these tools. Second, although ended task that requires him to explore an unfamiliar dataset
we expected that data scientists would only want to complete relating to emergency calls (911) for Montgomery County,
their data wrangling tasks, our participants were reluctant to PA. The dataset contains several columns, including the emer-
use data wrangling tools that transformed their data through gency call’s location as a latitude and longitude pair, the time
“black boxes.” Instead, they wanted to inspect the code that of the incident, the title of the emergency, and an assortment
transformed their data. Crucially, data scientists preferred of other columns.
these scripts to be written in their familiar data science lan-
guages, like Python or R. This allows them to insert and ex- First Steps: As with most of his data explorations, Dan
ecute this code directly into their notebooks, modify and ex- starts with a blank Python notebook. He loads the
tend the code if necessary, and keep the data transformation montcoalert.zip dataset into a data frame using pandas—a
code alongside their other notebook code for reproducibility. common library for working with this rectangular data. He
previews a slice of the data frame, the latlng and title
To address this gap, we introduce a hybrid interaction model columns for the frst ten rows A 1 . W REX displays an
that reconciles the productivity benefts of interactivity with interactive grid representing the returned data frame B 2 .
the versatility of programmability. We implemented this in- Through the interactive grid, Dan can view, flter, or search
teraction model as W REX, a Jupyter notebook extension for his data. He can also perform data wrangling using “Derive
Python. W REX automatically displays an interactive grid Column by Example.”
when a code cell returns a tabular data structure, such as a
data frame. Using programming-by-example, data scientists
can provide examples to the system on the data transform they
intend to perform. From these examples, W REX generates
readable code in Python, a popular data science language.
Existing programming-by-example systems for data wran-
gling address some, but not all, of these requirements. Flash-
Fill [15] does not display the transformed code to the data sci-
entist. Although Wrangler [14, 23] can produce Python code,
From Examples to Code: Dan notices that cells in the title
these scripts are not designed to be read or modifed directly
column seem to start with EMS, Fire, and Traffic. As a
by data scientists. Trifacta [36] produces readable code, but
sanity check, he wants to confrm that these are the only types
in a domain-specifc language and not a general-purpose one.
of incidents in his data, and also get a sense of how frequently
The contributions of this paper are as follows: these types of incidents are happening.
• We propose a hybrid interaction model that combines Dan selects the title column by clicking its header 3 , then
programming-by-example with readable code synthesis clicks the “Derive column by example” button to activate this
within the cell-based workfow of computational note- feature 4 . The result is a new empty column 5 through
books. We implement this interaction model as a Jupyter which Dan can provide an example (or more, if necessary)
notebook extension for Python, using an interactive grid of the transform he needs.
and provisional code cell.
• We apply program synthesis to the domain of data science
in a scalable way. Up until now, program synthesis has
been restricted to Excel-like settings where the user wants
to transform a small amount of data. Our approach allows
data scientists to synthesize code on subsets of their data
and to apply this code to other, larger datasets. The synthe- He arbitrarily types in his intention, EMS, into the second row
sized code can be incorporated into existing data pipelines. of the newly created column in the grid 6 . When Dan presses
• Through a user study, we fnd that data scientists are sig- Enter or leaves the cell, W REX detects a cell change in the
nifcantly more effective and effcient at performing data derived column. W REX uses the example provided by Dan
wrangling tasks than manual programming. Through quali- (EMS) with the input example taken from the derived from col-
tative feedback, participants report that W REX reduced bar- umn title (EMS: DIABETIC EMERGENCY) to automatically
riers in having to recall or look up the usage of various data fll in the remaining rows 7 .
transform functions. Data scientists indicated that the avail-
ability of readable code allowed them to verify that the data In addition, W REX presents the actual data transformation to
transform would do what they intended and increased their Dan as Python code through a provisional code cell C 8 .
trust and confdence in the wrangling tool. Moreover, in- This allows Dan to inspect the code Python code before com-
serting synthesized code as cells is useful and fts naturally mitting to the code. In this case, the code seems like what he
with the way data scientists already work in notebooks. probably would have written had he done this transformation
manually: split the string on a colon, and then return the frst RELATED WORK
split. Dan decides to insert this code into a cell below this W REX extends and coalesces two lines of prior work: data
one, but defers executing it 9 . If Dan had actually intended wrangling tools and program synthesis for structured data.
to uppercase all of the types, he could have provided W REX
with a second example: Fire to FIRE. If desired, Dan could Data Wrangling Tools
have also changed the target from Python to R for comparison One well-known class of tools tackles the need to make data
(or even PySpark), since Dan is a bit more familiar with R. wrangling (e.g., preprocessing, cleaning, transformation [24])
more effcient. OpenRefne provides an interactive grid that
allows the data scientist to perform simple text transforma-
tions, such as trimming a string, to clean up data columns,
and to discover needed transformations through flters [8]. Jet-
Brains’ DataLore provides a data science IDE that provides
code suggestions given user intentions [18]. Wrangler low-
ered the time and effort that data scientists spent on data
wrangling by suggesting contextually relevant transforms to
users, showing a preview window with the transform’s effects,
and providing an export to JavaScript function [23] (but this
feature has since been removed in the commercial version
[36]). Proactive Wrangler extended it with mixed-initiative
Since the new input cell is just Python code D , Dan is free
suggested steps to transform data into relational formats [14].
to use it however he wants: he can use it as is, modify the
function, or even copy the snippet elsewhere. Dan decides to Tempe provides interactive and continuous visualization sup-
apply the synthesized function to the larger data frame—this port for live streaming data [6], where not only did the visual-
results in adding a type column to the data frame (df). He ization change with new incoming data, but also changed with
plots a bar chart of the count of the categories and confrms new user input through live programming support. Trend-
that there are only three types of incidents in the data E . Query is a “human-in-the-loop” interactive system that al-
lowed users to iteratively and directly manipulate their data
From Code to Insights: Having wrangled the title column
visualizations for the curation and discovery of trends [21].
to type, and given the latlng column already present, Dan
Northstar describes an interactive system for data analysis
thinks it might be interesting to plot the locations on a map.
aimed at allowing non-data-scientist domain experts and data
To do so, the latlng column is a string and needs to be sep-
scientists to collaborate, making data science more accessible
arated into lat and lng columns. Once again, he repeats the
[27]. DS.js leverages existing web pages, and the tables and
data wrangling steps as before: Dan returns a subset of the
visualizations on them, to create programming environments
data, and uses the interactive grid in W REX to wrangle the
that help novices learn data science [39].
latitude and longitude transforms out of the latlng column.
He applies these functions to his data frame. Having done the While these tools all provide increased effciency for data sci-
tricky part of data wrangling the three columns—lat, lng, entists performing data wrangling tasks, they are missing sev-
and type—he cobbles together some code to add this infor- eral key features that W REX provides. Most notably, they
mation onto folium 10 , a map visualization tool. do not aim to generate readable code or to integrate with
data scientists’ existing workfows. W REX uses program syn-
thesis to achieve these goals and integrates with Jupyter, a
popular computational notebook used by data scientists [26].
Through both our formative and controlled lab studies, we
found these features to be critical for data scientists. Partici-
pants required saving source code as an artifact of their data
wrangling so they could perform similar transformations on
future datasets. Further, they wanted to take their wrangling
scripts they created with their sample dataset and apply them
to the full dataset in cluster or cloud environments.
As for notebook integration, Kandel et al. described a com-
mon data science workfow of context switching between raw
data, wrangling tools, and visualization tools; Kandel noted
that the “ideal” tool would combine these workfows into one
tool [22]. We also found that data scientists desired tools such
as W REX that integrated with their current workfow. Using
Like the data scientists in our study, Dan fnds that data wran- separate applications to perform data wrangling and analysis
gling is a roadblock to doing more impactful data analysis. requires extra time and effort to import data into independent
With W REX, Dan can accelerate the tedious process of data tools to wrangle their data, after which they will need to ex-
wrangling to focus on more interesting data explorations—all port their transformed data back into their preferred tool for
in Python, and without having to leave his notebook. data exploration, the computational notebook.
Program Synthesis for Structured Data “[they] shouldn’t have to go somewhere else just to transform
Gulwani et al. developed a new language and program synthe- data.” F6 wondered if was “possible to put some of these ca-
sis algorithm implemented in Microsoft Excel that can per- pabilities [that are available in standalone tools] within their
form several tasks that users have diffculty with in spread- notebooks.” This feedback led to our frst design goal:
sheet environments [9, 11, 12]. This feature, which became
D1. Data wrangling tools should be available where the
to be known as FlashFill, leveraged input-output examples de-
data scientist works—within their notebooks.
fned by the user. FlashFill took these examples and created
programs to perform string manipulations quickly, and with All of our participants wanted tools that produced code as an
very few output examples from the user. Harris and Gulwani inspectable artifact, because, “as a black box; you don’t have
then applied this research direction to table transformations a good intuition about what is happening to your data” (F7)
in spreadsheets [15]. Yessenov et al. used programming-by- and because “black boxes aren’t transparent, the data trans-
example to do text processing [38]. Le and Gulwani then forms aren’t customizable. If the tool doesn’t have your trans-
developed FlashExtract, a framework that uses examples to formation, you have to write it yourself anyway” (F6).
extract data from documents and tabular data [28]. Others
have also leveraged synthesis to perform data transformations Although some tools allowed data scientists to view their data
involving tabular data [19, 7, 17, 37]. transformations as scripts, we found that data scientists pre-
ferred that these scripts be written in languages they already
This procession of research allows end-users to perform the were comfortable with (F1, F6). For example, F1 “preferred
above tasks without knowing how to write wrangling scripts. general-purpose languages for doing data science.” F6 ex-
Further, even when users have the knowledge to create these plained, “there’s a learning curve to having to learn new li-
scripts, these methods can produce results in a fraction of the braries”. F7 added that the scripts from these tools were often
time it takes to code these scripts by hand. This increases quite limited: “I’m an expert in Python; these [languages for
both the accessibility and effciency of users dealing with data wrangling] seem to cater only to novice programmers.
data. W REX leverages these benefts to allow data scientists They don’t compose well with our existing notebook code or
to forgo writing data wrangling scripts and focus on providing the ‘crazy formats’ we have to deal with.”
example output of desired data transformations.
Data scientists’ desire for inspectable code as output of the
These projects found examples to be “the most natural” way data transformation tool, their preference for using familiar
to provide a program synthesizer with a specifcation, but programming languages, and the desire to customize or ex-
challenges remained in designing programming-by-example tend data wrangling transforms led to our second design goal:
interaction models, particularly in user intent [10]. They
noted that user examples may be ambiguous, so users need a D2. Data wrangling tools should produce code as an
way to address this ambiguity. W REX uses the “User Driven inspectable and modifable artifact, using programming
Interaction” model described in this line of work, which al- languages already familiar to the data scientist.
lows the user to examine the artifact, through reading the syn-
WREX SYSTEM DESIGN AND IMPLEMENTATION
thesized source code, and the behavior of the artifact, through
W REX is implemented as a Jupyter notebook extension. The
the resultant output in the derived column. If any discrepan-
front-end display component is based on Qgrid [33], an inter-
cies exist with either, the user can provide further input by in-
active grid view for editing data frames. Several changes were
teracting with their data frame or by directly editing the code.
made to this component to support code generation. First, we
Chasins et al. found that some participants perceived PBE
modifed Qgrid to render views of the underlying data frame,
tools to have less fexibility than traditional programming [3].
rather than the data frame itself. Second, we added the abil-
In W REX, users have both the speed and ease-of-use bene-
ity to add new columns to the grid. By implementing both
fts of PBE with the freedom to always switch to traditional
of these changes, users are able to give examples through vir-
text-based coding if the user perceives it to be necessary.
tual columns without affecting the underlying data. Third, we
FORMATIVE INTERVIEWS AND DESIGN GOALS added a view component to Qgrid to render the code block.
We conducted interviews with seven data scientists who fre- Finally, we bound to appropriate event handlers to invoke our
quently use computational notebooks at a large, data-driven program synthesis engine on cell changes. To automatically
software company. In our interviews, we focused on how display the interactive grid for data frames, the back-end com-
they perform data wrangling, how data wrangling fts within ponent injects confguration options to the Python pandas li-
their notebook workfow, what tools they use or have used for brary [32] and overrides its HTML display mechanism.
data wrangling, and what diffculties they face as they wran-
Readable Synthesis Algorithm
gle data. These data scientists (F1–F7) provided several in-
sights that guided the design goals for W REX. The program synthesis engine that powers W REX substan-
tially extends the FlashFill toolkit [29], which provide several
Data scientists reported that using standalone tools designed domain specifc languages (DSLs) with operators that support
for data wrangling required “excessive roundtrips” (F2, F4) string transformations [9], number transformations [35], date
or “shuffing data back and forth” (F1, F6) between their note- transformations [35], and table lookup transformations [34].
books and the data wrangling tool. As a result, they preferred A technical report by Gulwani et al. [13] formally describes
to write their wrangling code by hand in their notebooks. the semantics of extensions; W REX uses these extensions,
F2 explained that although these tools have nice capabilities, which we summarize in this section.
Transform Input(s) / Example(s) Synthesized Code
String
E XTRACTING 12;L MERION;CITY AVE a = s.index(";") + len(";")
b = s.rindex(";")
L MERION return s[a:b]
C ASE MANIPULATION NEW HANOVER return s.title()
New Hanover
C ONCATENATION Claudio A Chew return "{}-{}-{}".format(s, t, u)
Claudio-A-Chew
G ENERATING INITIALS Doug Funnie t = regex.search(r"\p{Lu}+", s).group(0)
u = list(regex.finditer(
D.F. r"\p{Lu}+", s))[-1].group(0)
return "{}.{}.".format(t, u)
M APPING CONST VALUES Male 0 { "Male": 0, "Female": 1 }.get(s)
Female 1
Number
ROUND TO TWO -15.319 -15.32 return Decimal(s).quantize(
DECIMALS WITH TIES 17.315 17.32 Decimal(".01"),
GOING AWAY FROM ZERO rounding = ROUND_HALF_UP)

ROUND UP TO NEAREST 6512 6600 return 100 * math.ceil(x / 100)

100 23 100

S CALING -12.5 return x * 1000

-12500
PADDING 828 return str(n).zfill(5)
00828
Date and Time
E XTRACTING PARTS 31-Jan-2031 05:54:18 dt= strptime(s, "%d-%b-%Y %H:%M:%S")
Fri return dt.strftime("%a")

F ORMATTING 2015-12-10 17:10:52 dt = strptime(s, "%Y-%m-%d %H:%M:%S")

10 Dec 2015 return dt.strftime("%d %b %Y")

B INNING 2:02 dt = datetime.strptime(s, "%m/%d/%Y %H:%M")

02:00-04:00 base_value = timedelta(hours = dt.hour, ...)
delta_value = timedelta(hours = 2)
dt_str = (dt - base_value % delta_value) \
.strftime("%H:%M")
rounded_up_next = (dt - base_value % delta_value) \
+ delta_value
return dt_str + "-" + rounded_up_next.strftime("%H:%M")
Composite
P OINT C OMPOSE 40.865324 -73.935237 d1 = Decimal(s).quantize(
(40.87, -73.94) Decimal(".01"), rounding = ROUND_HALF_UP)
d2 = Decimal(t).quantize(
Decimal(".01"), rounding = ROUND_HALF_UP)
return "({}, {})".format(d1, d2)
F IXED -W IDTH C OMPOSE 3 71 s = str(n).zfill(2))
03071 t = str(n).zfill(3))
return s + t

Table 1: W REX synthesizes readable code for transformations commonly used by data scientists during data wrangling activities. After selecting
one or more columns (text in blue), the data scientists can specify examples in an output column to provide their intent (text in red). As the data scientist
provides examples, W REX generates a synthesized code block and presents this code block to them.
With W REX, we surfaced this PBE algorithm through an in- Algorithm 1 Program synthesis phases for readable code.
teraction that is accessible to data scientists. The algorithm function R EADABLE C ODE S YNTHESIS(df, examples)
supports a variety of transformations, and even compositions P1 ← S YNTHESIZE(examples, fashfll_ranker)
of those transformations, without requiring the user to explic- examples_all ← {(row, P1 (row)) | row ∈ df}
itly specify any input or output data types. Table 1 lists exam- P2 ← S YNTHESIZE(examples_all, readability_ranker)
ples of the resulting synthesized Python code for typical data P3 ← R EWRITE(P2 , rules, df)
science use cases; the synthesized code for E XTRACTING is code ← TRANSLATE _ TO _ TARGET(P3 )
only three lines of code. In the classic FlashFill algorithm, return F ORMATTER(code)
this same program is over 30 lines of code.
The extended FlashFill algorithm (RCS) has four phases:
the equivalent operator in the target language—which today
Phase 1: Standard Program. RCS calls S YNTHESIZE with can be Python, R, and PySpark. For example, the C ONCAT
the user-provided examples, using the standard FlashFill operator in the DSL is just mapped to + or a format method
ranker. Since the FlashFill ranker is optimized to minimize on a string, depending on the number of elements concate-
the number of required examples, data scientists can in many nated. If the DSL operator does not have a semantically-
cases obtain a useful program (P1 ) by giving only a single ex- equivalent Python operator, then the translator generates mul-
ample. Here, the program is represented as an internal DSL. tiple lines of code in the target language to emulate its behav-
Phase 2: Readable Program. We use the size of the program ior. Finally, the target code is passed through an off-the-shell
as a proxy for readability, and design a ranker that prefers code formatter: for Python, this is autopep8 [16].
small programs, which are likely easier to understand. This
ranker is also designed to prefer programs that use DSL op- Limitations
erators that have direct translation into the target language A limitation of W REX is that user-friendly error handling is
(e.g., Python). Since the readability ranker is optimized to not implemented yet. Errors can arise in two ways: when
prefer small programs, the ranker requires more examples the user specifes a conficting set of constraints (for exam-
than the FlashFill ranker. The insight is to apply the pro- ple, transforming “Traffic to T” alongside “Traffic to TR”),
gram P1 to all required input columns in the data frame or when the synthesis engine fails to learn a program. Pro-
to obtain these additional examples (examples_all). RCS gram synthesis will also unrelentingly generate incomprehen-
again calls S YNTHESIS to obtain the program P2 , this time sible programs due to diffcult-to-spot typos in user-entered
using examples_all and the readability ranker. Concretely, examples (such as having a trailing space in an example). In
consider the transform of “21-07-2012” to “21”. FlashFill such cases, we asked participants to invoke the grid again and
(intent-based) takes the sub-string that matches \d+ on the redo the task, although we did not restart their task time. In
left and “-” on the right—because it handles dates like “4-12- practice, it is unrealistic to expect that data scientists can per-
2018” (input.match('^\d+')). However, tuning towards fectly provide examples to the system, so these issues will
generality makes it less succinct. The objective-based ranker need to be addressed in future work. When the user intro-
chooses input[0:2], but if and only if the behavior matches duces these internally conficting examples or when rows in
on a much larger sample of inputs (maintaining behavioral a dataset have ambiguous values (e.g., null), it is useful to
equivalence to the intent-based ranker). Hence, we pick suggest additional rows to investigate; this signifcant inputs
input[0:2] if there are no inputs of the form “4-12-2018”. feature is available but not evaluated in this paper.
If there are inputs in such a form, the user would have to pro- Some tasks are not amenable to programming and thus are
vide a second example. not performed by W REX, like certain natural language trans-
Phase 3: Rewriting. The goal of the rewriting phase is formations (e.g., “S.F.” to “San Francisco”), and other tasks
to transform the synthesized program into another program that require aggregation like the sum or average of the entire
that is simpler to understand. As before, we apply the in- column. Another limitation comes from how a user samples
sight that we can use rewrite rules such that the synthe- their data (for example, df.head(n), which may lack suff-
sized program preserves the behavior of examples_all, but cient diversity in range of exposed values). This issue may
allows for changing the semantics of any potential inputs lead to synthesized code that works perfectly for the sample
that have not been passed to RCS. One such rule rewrites but runs into issues on the full dataset.
“[0-9]+(\,[0-9]{3})*(\.[0-9]+)?” to “\d+”. This re- Users may not know when to stop providing examples (where
places a complex pattern that matches numbers with commas further examples have little effect on synthesis). Here users
and decimal point with a pattern that matches a sequence of must inspect the data frame and code to determine if W REX
digits. Clearly, replacing the frst regular expression by sec- narrows in on an acceptable solution. It outputs only the
ond one will change the semantics of a program. But if all top-ranked one, and it is possible that the user may prefer a
numbers in all the inputs are of the form “\d+”, then the re- lower-ranked program (e.g., uses non-regex instead of regex).
placement will preserve behavioral equivalence. Finally, W REX is aimed for professional data scientists who
Phase 4: Translation to the Target Language. The fnal trans- work mostly within notebooks; users of Excel, Tableau, and
lation step goes down the abstract syntax tree (AST) of the other GUI tools may be more accustomed to switching be-
DSL-program, and translates each node (DSL operator) into tween multiple tools, so an integrated single-app workfow
may not be as necessary for them.
EVALUATION: IN-LAB COMPARATIVE USER STUDY notebook (manual condition). They had 5 minutes per task
Participants: We recruited 12 data scientists (10 male), ran- to read the requirements of the task and write code to com-
domly selected from a directory of computational notebook plete the task. Participants were provided a verifcation code
users with Python familiarity within a large, data-driven soft- snippet within their notebooks that participants ran to deter-
ware company. They self-reported an average of 4 years of mine if they had completed the task successfully. If partici-
data science experience within the company. They self-rated pants failed to complete the task within the allotted time, we
familiarity with Jupyter notebooks with a median of “Ex- marked the task as incorrect. Participants had access to the
tremely familiar (5),” using a 5-point Likert scale from “Not internet to assist them in completing the task if needed. At
familiar at all (1)” to “Extremely familiar (5),” and their famil- the end of the manual condition, we interviewed the partic-
iarity with Python at a median of “Moderately familiar (4).” ipants about their experience and asked them to complete a
questionnaire to rate aspects of their experience. Next, par-
Tasks: Participants completed six tasks using two different ticipants completed a short tutorial that introduced them to
datasets. These tasks involved transformations commonly W REX. After participants completed the tutorial, they moved
done by data scientists during data wrangling, such as extract- on to the second set of tasks, this time using W REX with con-
ing part of a string and changing its case, formatting dates, ditions similar to the frst set of tasks. After the three tasks are
time-binning, and rounding foating-point numbers. completed, we again interviewed them about their experience
The frst dataset, called A, contains emergency call data con- and asked them to complete the questionnaire.
taining columns with dates, times, latitude, longitude, physi- Questionnaires: After the frst set of tasks, participants rated
cal location with zip code and cross streets, and an incident how often the tasks showed up in their day-to-day work us-
description.1 We designed three tasks using this dataset: ing a 5-point Likert scale from “Never (1)” to “A great deal
A1 Using the Location (19044;HORSHAM;CEDAR AVE & (5)”, and discussed what aspects of the notebook made it diff-
COTTAGE AVE) column, extract the city name and title cult to complete the tasks and what affordances could address
case it (Horsham). these diffculties. After the second set of tasks, this time with
A2 Using the Date (12/11/2015) and Time (13:34:52) W REX, the participants took a second questionnaire that also
columns, format the date to the day of the week, time had them rate task representativeness, and asked free-form
to 12-hour clock format, and combine these values questions on diffculties they had and tool improvements. Fur-
with an “@” symbol (Friday @ 1pm). ther, the second questionnaire asked the participant to rate
grid and code acceptability using a 5-point Likert-type item
A3 Using the Latitude (40.185840) and Longitude
scale ranging from “Unacceptable (1)” to “Acceptable (5)”,
(-75.125512) columns, round half up the values to the
and rate the likeliness they would use a productionized ver-
nearest hundredths place and combine them in a new
sion of W REX using a 5-point Likert-type item scale rang-
format ([40.19, -75.13]).
ing from “Extremely unlikely (1)” to “Extremely likely (5)”.
The second dataset, called B, contains New York City noise Finally, participants were interviewed after each set of tasks
complaint data which includes columns containing the date- about their experience with Jupyter notebooks and W REX.
timestamp of the call, the date-timestamp of when the inci-
Follow-up: We directly addressed participant feedback to im-
dent was closed, type of location, zip code of incident, city of
prove the synthesized code: We removed the use of classes
incident location, borough of incident location, latitude, and
entirely and replaced these instances with lightweight func-
longitude.2 We designed three tasks using this dataset:
tions. We replaced the register-based variable naming scheme
B1 Using the Created Date (12/31/2015 0:01) col- (_0 and _1) with a variable-name generation scheme that uses
umn, extract the time and place it in a 2-hour time bin simpler mnemonic names, such as s and t for string argu-
(00:00-02:00). ments. We removed exception handling logic because these
B2 Using the Location Type (Store/Commercial), constructs made it harder for the data scientists to identify
City (NEW YORK), and Borough (MANHATTAN) columns, the core part of the transformation. Finally, we returned to
title case the values and combine them in a new format the participants after implementing these changes and asked
(Store/Commercial in New York, Manhattan). them to reassess the synthesized code for the study tasks.
B3 Using the Latitude (40.865324) and Longitude QUANTITATIVE RESULTS
(-73.938237) columns, round half down the values to
Table 2 shows completion rates by task and condition.
the nearest hundredths place and combine them in a
Fisher’s exact test identifed a signifcant difference between
new format ((40.86, -73.94)).
the W REX and manual conditions, both in the A-dataset frst
Protocol: Participants were assigned A and B datasets (p < .0001) and in B-dataset frst (p < .0001) subgroups.
through a counterbalanced design, such that half the partic- Participants in the manual condition completed 12/36 tasks,
ipants received the A dataset frst (A-dataset group), and the while those in the W REX condition completed all 36/36 tasks.
other half received the B dataset frst (B-dataset group). We The signifcant differences between the two conditions can
randomized task order within each dataset to mitigate learn- be explained mostly by tasks A2 and B1, which require non-
ing effects. They frst completed three tasks with a Jupyter trivial date and time transformations.
1 https://www.kaggle.com/mchirico/montcoalert Participant Effciency: Table 3 shows the distribution of
2 https://www.kaggle.com/somesnm/partynyc completion times by task and condition, and the participants’
Manual W REX Frequency Acceptability
Task n % n % n Dist. Task n Grid Code1 Code2
A1 3 50% 6 100% 12 3 A1 6 5 3 5
A2 0 0% 6 100% 12 2
A2 6 5 2 5
A3 2 33% 6 100% 12 2
B1 0 0% 6 100% 12 3 A3 6 5 2 5
B2 3 50% 6 100% 12 2 B1 6 4 2 4
B3 4 67% 6 100% 12 2
B2 6 4 3 5
Table 2: Participant task completion under W REX and manual data
wrangling conditions. Participant reported frequency of tasks in day- B3 6 5 3 5
to-day work. Participants were given fve minutes to complete each task.
Rating scale for task frequency from left-to-right: Never (1), Rarely Table 4: How acceptable was the grid experience and the corre-
(2), Occasionally (3), Moderately (4), A great deal (5). Median sponding synthesized code snippet? Rating scale from left-to-right:
values precede each distribution. Unacceptable (1), Slightly unacceptable (2), Neutral (3), Slightly
acceptable (4), and Acceptable (5). Code1 are the ratings from the code
synthesized in the in-lab study. Code2 are the ratings after incorporating
Manual W REX the participants’ feedback. Median values precede each distribution.

Task Timeline n Time (min) n Time (min) ticipants reported that they would probably use the tool (4),
0 5 and seven reported that they would defnitely use the tool (5).
A1 3 2.5 6 2.4
0 5
A2 0 5.0 6 2.9 QUALITATIVE FEEDBACK FROM STUDY PARTICIPANTS
0 5
A3 2 3.6 6 1.8 Reducing Barriers to Data Wrangling
B1
0 5
0 5.0 6 3.1 After completing the three tasks in the manual Jupyter condi-
0 5
tion, participants noted these sets of barriers to wrangling that
B2 3 4.4 6 3.2 they experienced both during the tasks and also in their daily
B3
0 5
4 4.2 6 1.7 work, some of which W REX helped overcome.
Table 3: Participant effciency under W REX and manual data Recall of Functions and Syntax
wrangling conditions. The most common barrier reported by participants, both
within our lab study and in their daily work, is remembering
self-reported frequency of how often they do that type of task. what functions and syntax are required to perform the nec-
A t-test failed to identify a signifcant difference in the A- essary data transformations. One reason for failed recall is
dataset (t(5.93) = 1.13, p = 0.30), but did identify a signif- due to lack of recency, “my biggest diffculty was recalling
cant difference for the B-dataset (t(22.32) = 5.17, p < 0.001). the specifc command names and syntax, just because I didn’t
Using W REX, the average time to completion was u = 2.4, use them today” (P2). The complexity of modern languages
sd = 1.0 (A) and u = 2.7, sd = 1.0 (B). Participants using and the number of libraries available is too vast for data sci-
W REX, on average, were about 40 seconds faster (u = 0.60, entists to rely on their memory faculties as “it is just tough to
sd = 0.53) in A, and about 1.6 minutes faster (u = 1.61, memorize all the nuances of a language” (P5). Participants
sd = 0.31) in B. This means if one has a good understanding noted that although computational notebooks have features
of the code required to perform their transformation—and if like inline documentation and autocompletion, these features
the code is simple to write—then it may be faster to write don’t directly help them in understanding which operations
code directly than to give an example to W REX. they need to use and how they should use them.
Grid and Code Acceptability: Table 4 shows distribution of W REX reduces this barrier with the synthesis of readable code
acceptability for the grid, the code acceptability during the via programming-by-example. This removes the need for
study (Code1 ), and the code acceptability after post-study im- data scientists to remember the specifc functions or syntax
provements (Code2 ). Participants reported the median accept- needed for a transformation. Instead, they need only know
ability of the grid experience as Acceptable (5). The code what they want to do with the data in order to produce code.
acceptability during the study (Code1 ) had substantial varia-
tion in responses, with a median of Neutral (3). After im- Searching for Solutions
proving the program synthesis engine based on the participant To alleviate recall issues, data scientists rely on web searches
feedback (Code2 ), the median score improved to Acceptable for solutions on websites like Stack Overfow. These searches
(5). A Wilcoxon signed-rank test identifed these differences occur because “most of the tasks are pretty standard, I ex-
as signifcant (S = 319, p < .0001), with a median rating in- pect there to be one function that solves the piece, gener-
crease of 2. As a measure of user satisfaction, we asked par- ally in Stack Overfow, if you are able to break the prob-
ticipants if they would use W REX for data wrangling tasks if lem down small enough you can fnd a teeny code snippet
a production version of the tool was made available: fve par- to test.” (P3). Participants believed searching for these code
snippets is quicker than producing the solutions themselves. things. [So], if I leave and pass on my work to someone else,
This helps them reach their goal of “achieving the fnal result they would be able to use it if they know how it is written”
as fast as possible”, so they prefer “to save time and use some- (P12). Participants also cited readability as an enabler for de-
thing existing” (P8). Unfortunately, searching for solutions bugging and maintenance, where readable code would allow
can fail or increase data wrangling time depending on the do- them to make small changes to the code themselves rather
main of the task since “there are so many [web pages] and than provide more examples to the interactive grid. Amongst
you need to pick the right one. So, it takes time to fnd some- our participants, some example standards for readability were:
one who has the exact same problem that you had. Usually in “I would want it to be very similar to what I would expect
70-80% of the cases I’ve found that someone else has had the searching Stack Overfow” (P10). Interestingly, our partici-
same problem, sometimes not, depends on the domain. [...] pants found short variable names like “String s” or “Float f”
In audio [data] it’s more complicated to fnd someone who to be acceptable, as they could just rename these themselves.
did something similar to what you were looking for” (P8).
Trust in Synthesized Code
W REX reduced participants’ reliance on web searches. In- The most salient method to increase trust was reading the syn-
stead of hunting online for the right syntax or API calls, they thesized code. Inspecting the resulting wrangled data frame is
could remain in the context of their wrangling activities and not enough, and that without readable code they “don’t know
only had to provide the expected output for data transforma- what is going on there, so I don’t know if I can trust it if I want
tions. Participants immediately noticed the time it took to to extend it to other tasks. I saw my examples were correctly
complete the three tasks with W REX compared to doing so transformed, but because the code is hard to read, I would not
with a default Jupyter notebook and web search: “I super be able to trust what it is doing” (P10). Several participants
liked it, it was amazing, really quick, I didn’t have to look noted that the best way to gain the confdence of a user in
up or browse anything else” (P9); W REX also “avoided my these types of tools is to “have the code be readable” (P3).
back and forth from Stack Overfow.” (P12). By removing
the need to search websites and code repositories, W REX al- Several participants proposed alternative methods beyond in-
lows data scientists to remain in the context of their analysis. specting the data frame and the data wrangling code to im-
prove their trust of the system. These proposals ranged from
Fitting into Data Scientists’ Workfows simple summations of the resulting output, code comments,
W REX helps address the above barriers by providing familiar and data visualizations. Some participants desired informa-
interactions that reduce the need for syntax recall and code- tion on any assumptions made for edge cases, or to request
related web searches. First, W REX’s grid felt familiar, less- examples for these edge cases. These alternative affordances
ening the learning curve required to perform data wrangling are important, as they could provide “a way of validating,
tasks with the system. This form of interaction was likened maybe not the mechanics, the internals of it, but the output
to “the pattern recognition that Excel has when you drag and of it, would help me be confdent that it did what I thought it
drop it” and that W REX had a “nice free text fow” (P5). Feed- did” (P2). That said, if W REX did not produce readable code,
back for the grid interaction was overwhelmingly positive (Ta- some participants “would be less trustful of it” (P10).
ble 4), with only minor enhancements suggested such as a
right-click context menu and better horizontal scrolling. DISCUSSION

Participants agreed that this tool ft into their workfow. They Data Scientists Need In-Situ Tools Within Their Workfow
were enthusiastic about not having to leave the notebook Computational notebooks are not just for wrangling, but for
when performing their day-to-day data wrangling tasks. By the entire data analysis workfow. Thus, programming-by-
having a tool that generates wrangling code directly in their example (PBE) tools that enhance the user experience at each
notebook, participants felt that they could easily iterate be- stage of data analysis need to reside where data scientists per-
tween data wrangling and analysis. Some participants re- form these tasks: within the notebook. These in-situ work-
ported running subsets of their data on local notebooks for fows are an effciency boost for data scientists in two ways:
exploratory analysis, but then eventually needing to export First, providing PBE within the notebook removes the need
their code into production Python scripts to run in the cloud. for users to leave their notebook and spend valuable time web
With existing data wrangling tools, participants indicated that searching for code solutions, as the solutions are generated
they would have to re-write these transformations in Python. based on user examples. Second, users no longer need to ex-
Because W REX already produces code, these data wrangling port their data, open an external tool, load the data into that
transformations are easy to incorporate into such scripts. tool, perform any data wrangling required, export the wran-
gled data, and reload that data back into their notebook.
Data Scientists’ Expectations of Synthesized Code Though our investigation focused on data wrangling, tools
Readability of Synthesized Code like W REX can play a critical role at each step of a data anal-
Participants described readability as being a critical feature of ysis by providing unifed PBE interactions. For example, fu-
usable synthesized code. P6 wanted “to read what the code ture PBE tools can frst ingest data to synthesize code that
was doing and make sure it was doing what I expect it to creates a data frame, which can then be wrangled using PBE,
do, in case there was an ambiguity I didn’t pick up on”. It and fnally again be used to produce code to create visualiza-
is also important for collaboration “because the whole pur- tions like histograms or other useful graphs. This provides an
pose for me to use Jupyter notebook is to be able to interpret accessible, effcient, and powerful interface to data scientists
performing data analysis, allowing them to never leave their scientists’ existing code-oriented workfow. Though our orig-
notebooks and thus avoid context-switching costs. inal aim was to help data scientists accomplish diffcult data
wrangling tasks, our participants found that W REX was also
In our user studies we found that data scientists were unlikely
useful for performing simpler PBE tasks like adding or drop-
to adopt tools that required them to leave their notebook. We
ping a column. While we implemented our interaction with
also witnessed participants struggle to fnd code online that
program synthesis as an interactive grid, we believe that other
was suitable to the task at hand. Without W REX, they had
interactions can also synthesize readable code. For example,
to frequently move back and forth between web searches and
our study participants mentioned data summaries and visu-
their notebook as they copied and modifed various code snip-
alizations as potential sources for verifcation of the output
pets. With W REX, this frustration was removed. When PBE
of data science tools. One requested feature was histograms
interactions are within the notebook, a streamlined and eff- of the initial and the updated data frame so users can take a
cient workfow for data analysis can be realized. In sum, as quick glance and make sure the shape of the data makes sense.
long as interactions are in-notebook, familiar, and can pro- Data summaries provide ranges of values that provide poten-
duce readable code, data scientists are enthusiastic to adopt tial edge cases for their code to handle, either by feeding them
PBE tools; they are hesitant to do so without these features. as examples to W REX or by modifying synthesized code to
Also, notebooks are the ideal environment for PBE tools since handle these cases. The insight we gleaned from this feed-
program synthesis is good at generating small code snippets back is that data scientists want the freedom that comes with
which is a similar granularity to existing notebook cell usage. multiple workfows so they can choose the best interaction for
Program synthesis also relies on user interactions to provide each task. In future work, it is interesting to explore different
examples that remove ambiguity so that PBE tools can pro- surfaces for exposing PBE tools, like the visually-richer in-
duce correct code. Notebooks already provide a platform that terfaces described above, while discovering and minimizing
enables the interaction between a user and their code that is a potential trade-offs in user experience.
good match for the user interaction requirements of PBE.

Data Scientists’ Priorities for Readable Synthesized Code Synthesized Code Makes Data Science More Accessible
Data scientists need to be able to read and comprehend the Synthesizing code with PBE has the potential to make data
code so they can verify it is accomplishing their task. Thus, science more accessible to people with varying levels of
if a system synthesizes unreadable code, users have much programming profciency. For instance, without a tool like
more trouble performing verifcation of the output. Verifa- W REX, a data scientist in a neuroscience lab must not only
bility increases trust in the system, and gives data scientists become an expert on brain-related data but also in the me-
confdence that the synthesized code handles edge cases and chanics of programming. With W REX they can not only see
performs the task without errors. Readability also improves the fnal wrangled data, which speeds up their workfow, they
maintainability. If a data scientist wants to reuse the synthe- can also study the code that performed those transformations.
sized code elsewhere but make edits based on the context of
Our study participants noted that W REX was useful in learn-
their data, they need to be able to frst understand that code.
ing how to perform the transforms they were interested in,
Data scientists prioritize certain readability features over oth- or even assist them in discovering different programming pat-
ers when thinking about acceptable synthesized code. Partic- terns for regular expressions. W REX also alleviates the te-
ipants did signal a need for features that increased readability dium felt by data scientists having to learn new APIs and
like better indentation, line breaks, naming conventions, and even lessens the burden of having to keep up with API up-
meaningful comments in the synthesized code. Some partici- dates. This also benefts polyglot programmers who might be
pants desired synthesized code that followed language idioms weaker in a new language, as they can quickly get up to speed
or that “would pass a [GitHub] pull request”, but other par- by leveraging W REX to produce code that they can use and
ticipants saw their notebooks as exploratory code that would learn from. In the future we see potential for interactive pro-
need to be rewritten and productionized later anyway, and in- gram synthesis tools as learning instruments if they are able
stead, desired synthesized code that is brief and easy to follow. to synthesize readable and pedagogically-suitable code.
This means that the goal of synthesized code should not be to
appear as if a human had written it, but to focus on having
these high priority features that data scientists require. CONCLUSION
Our formative study found that professional data scientists are
Alternative Interactions with Code and Data reluctant to use existing wrangling tools that did not ft within
Data scientists frequently use applications like Excel to view their notebook-based workfows. To address this gap, we de-
their data and Python IDEs to manipulate it, which make them veloped W REX, a notebook-based programming-by-example
choose between the ease of use afforded by GUIs, and the ex- system that generates readable code for common data trans-
pressive fexibility afforded by programming. W REX merges formations. Our user study found that data scientists are
usability and fexibility by generating code through grid inter- signifcantly more effective and effcient at data wrangling
actions. We found that our grid was familiar to data scientists with W REX over manual programming. In particular, users
who had used various grid-like structures before in spread- reported that the synthesis of readable code—and the trans-
sheets. By implementing our grid in a programming envi- parency that code offers—was an essential requirement for
ronment such as Jupyter Notebooks, our system fts into data supporting their data wrangling workfows.
Acknowledgments [12] Sumit Gulwani and Mark Marron. 2014. NLyze:
We thank Arjun Radhakrishna, Ashish Tiwari, and Andrew Interactive Programming by Natural Language for
Head for helpful discussions about tool and study design, and Spreadsheet Data Analysis and Manipulation. In
the data scientists at Microsoft who participated in the inter- Proceedings of the 2014 ACM SIGMOD International
views and studies. Conference on Management of Data (SIGMOD ’14).
803–814. DOI:
REFERENCES http://dx.doi.org/10.1145/2588555.2612177
[1] Apache. 2019. Zeppelin. (2019).
[13] Sumit Gulwani, Kunal Pathak, Arjun Radhakrishna,
https://zeppelin.apache.org/
Ashish Tiwari, and Abhishek Udupa. 2019.
[2] Carbide. 2019. Carbide Alpha. (2019). Quantitative Programming by Examples. arXiv e-prints
https://alpha.trycarbide.com/ (Sep. 2019), arXiv:1909.05964.
[3] Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. [14] Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and
2018. Rousillon: Scraping Distributed Hierarchical Jeffrey Heer. 2011. Proactive Wrangling:
Web Data. In Proceedings of the 31st Annual ACM Mixed-initiative End-user Programming of Data
Symposium on User Interface Software and Technology Transformation Scripts. In Proceedings of the 24th
(UIST ’18). 963–975. DOI: Annual ACM Symposium on User Interface Software
http://dx.doi.org/10.1145/3242587.3242661 and Technology (UIST ’11). 65–74. DOI:
http://dx.doi.org/10.1145/2047196.2047205
[4] Tamraparni Dasu and Theodore Johnson. 2003.
Exploratory Data Mining and Data Cleaning (1 ed.). [15] William R. Harris and Sumit Gulwani. 2011.
DOI:http://dx.doi.org/10.1002/0471448354 Spreadsheet Table Transformations from Examples. In
Proceedings of the 32nd ACM SIGPLAN Conference on
[5] Databricks. 2019. databricks. (2019). Programming Language Design and Implementation
https://databricks.com/
(PLDI ’11). 317–328. DOI:
[6] Robert DeLine, Danyel Fisher, Badrish Chandramouli, http://dx.doi.org/10.1145/1993498.1993536
Jonathan Goldstein, Michael Barnett, James F [16] Hideo Hattori. 2019. autopep8. (2019).
Terwilliger, and John Wernsing. 2015. Tempe: Live https://github.com/hhatto/autopep8/
scripting for live data. In 2015 IEEE Symposium on
Visual Languages and Human-Centric Computing [17] Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek
(VL/HCC ’15). 137–141. DOI: Narasayya, Surajit Chaudhuri, Xu Chu, and Yudian
http://dx.doi.org/10.1109/VLHCC.2015.7357208 Zheng. 2018. Transform-Data-by-Example (TDE):
Extensible Data Transformation in Excel. In
[7] Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, Proceedings of the 2018 International Conference on
and Swarat Chaudhuri. 2017. Component-based Management of Data (SIGMOD ’18). 1785–1788. DOI:
Synthesis of Table Consolidation and Transformation http://dx.doi.org/10.1145/3183713.3193539
Tasks from Examples. In Proceedings of the 38th ACM
SIGPLAN Conference on Programming Language [18] Jetbrains. 2019. Datalore. (2019). https://datalore.io/
Design and Implementation (PLDI ’17). 422–436. DOI:
[19] Zhongjun Jin, Michael R. Anderson, Michael Cafarella,
http://dx.doi.org/10.1145/3062341.3062351
and H. V. Jagadish. 2017. Foofah: Transforming Data
[8] Google. 2019. OpenRefne. (2019). By Example. In Proceedings of the 2017 ACM
https://openrefine.org/ International Conference on Management of Data
(SIGMOD ’17). 683–698. DOI:
[9] Sumit Gulwani. 2011. Automating String Processing in http://dx.doi.org/10.1145/3035918.3064034
Spreadsheets Using Input-output Examples. In
Proceedings of the 38th Annual ACM [20] Jupyter. 2019. Jupyter Notebook. (2019).
SIGPLAN-SIGACT Symposium on Principles of https://jupyter.org/
Programming Languages (POPL ’11). 317–330. DOI:
[21] Niranjan Kamat, Eugene Wu, and Arnab Nandi. 2016.
http://dx.doi.org/10.1145/1926385.1926423
TrendQuery: A System for Interactive Exploration of
[10] Sumit Gulwani. 2012. Synthesis from Examples: Trends. In Proceedings of the Workshop on
Interaction Models and Algorithms. In Proceedings of Human-In-the-Loop Data Analytics (HILDA ’16).
the 2012 14th International Symposium on Symbolic Article 12, 4 pages. DOI:
and Numeric Algorithms for Scientifc Computing http://dx.doi.org/10.1145/2939502.2939514
(SYNASC ’12). 8–14. DOI:
http://dx.doi.org/10.1109/SYNASC.2012.69
[22] Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie
Kennedy, Frank van Ham, Nathalie Henry Riche, Chris
[11] Sumit Gulwani, William R. Harris, and Rishabh Singh. Weaver, Bongshin Lee, Dominique Brodbeck, and
2012. Spreadsheet Data Manipulation Using Examples. Paolo Buono. 2011a. Research Directions in Data
Commun. ACM 55, 8 (Aug. 2012), 97–105. DOI: Wrangling: Visualizations and Transformations for
http://dx.doi.org/10.1145/2240236.2240260 Usable and Credible Data. Information Visualization
10, 4 (Oct. 2011), 271–288. DOI: [35] Rishabh Singh and Sumit Gulwani. 2012b.
http://dx.doi.org/10.1177/1473871611415994 Synthesizing Number Transformations from
[23] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Input-Output Examples. In Computer Aided
Jeffrey Heer. 2011b. Wrangler: Interactive Visual Verifcation. 634–651. DOI:
http://dx.doi.org/10.1007/978-3-642-31424-7_44
Specifcation of Data Transformation Scripts. In
Proceedings of the SIGCHI Conference on Human [36] Trifacta. 2019. Wrangler. (2019).
Factors in Computing Systems (CHI ’11). 3363–3372. https://www.trifacta.com/products/wrangler-editions/
DOI:http://dx.doi.org/10.1145/1978942.1979444
[37] Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig.
[24] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, 2018. Automated Migration of Hierarchical Data to
and Jeffrey Heer. 2012. Enterprise Data Analysis and Relational Tables Using Programming-by-example.
Visualization: An Interview Study. IEEE Transactions Proceedings of the VLDB Endowment 11, 5 (Jan. 2018),
on Visualization and Computer Graphics 18, 12 (Dec. 580–593. DOI:
2012), 2917–2926. DOI: http://dx.doi.org/10.1145/3187009.3177735
http://dx.doi.org/10.1109/TVCG.2012.219
[38] Kuat Yessenov, Shubham Tulsiani, Aditya Menon,
[25] Mary Beth Kery, Marissa Radensky, Mahima Arya, Robert C. Miller, Sumit Gulwani, Butler Lampson, and
Bonnie E. John, and Brad A. Myers. 2018. The Story in Adam Kalai. 2013. A Colorful Approach to Text
the Notebook: Exploratory Data Science Using a Processing by Example. In Proceedings of the 26th
Literate Programming Tool. In Proceedings of the 2018 Annual ACM Symposium on User Interface Software
CHI Conference on Human Factors in Computing and Technology (UIST ’13). 495–504. DOI:
Systems (CHI ’18). Article 174, 11 pages. DOI: http://dx.doi.org/10.1145/2501988.2502040
http://dx.doi.org/10.1145/3173574.3173748
[39] Xiong Zhang and Philip J. Guo. 2017. DS.Js: Turn Any
[26] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando
Webpage into an Example-Centric Live Programming
Pérez, Brian E. Granger, Matthias Bussonnier,
Environment for Learning Data Science. In
Jonathan Frederic, Kyle Kelley, Jessica B. Hamrick,
Proceedings of the 30th Annual ACM Symposium on
Jason Grout, Sylvain Corlay, Paul Ivanov, Damián
User Interface Software and Technology (UIST ’17).
Avila, Safa Abdalla, Carol Willing, and et al. 2016.
691–702. DOI:
Jupyter Notebooks - a publishing format for
http://dx.doi.org/10.1145/3126594.3126663
reproducible computational workfows. In Proceedings
of the 20th International Conference on Electronic
Publishing (ELPUB ’16). DOI:
http://dx.doi.org/10.3233/978-1-61499-649-1-87
[27] Tim Kraska. 2018. Northstar: An Interactive Data
Science System. Proceedings of the VLDB Endowment
11, 12 (Aug. 2018), 2150–2164. DOI:
http://dx.doi.org/10.14778/3229863.3240493
[28] Vu Le and Sumit Gulwani. 2014. FlashExtract: A
Framework for Data Extraction by Examples. In
Proceedings of the 35th ACM SIGPLAN Conference on
Programming Language Design and Implementation
(PLDI ’14). 542–553. DOI:
http://dx.doi.org/10.1145/2594291.2594333
[29] Microsoft. 2019. PROSE SDK. (2019).
https://microsoft.github.io/prose/
[30] Mozilla. 2019. Iodide. (2019). https://alpha.iodide.io/
[31] Observable. 2019. Observable. (2019).
https://observablehq.com/
[32] pandas-dev. 2019. The pandas project. (2019).
https://pandas.pydata.org/
[33] Quantopian. 2019. Qgrid. (2019).
https://github.com/quantopian/qgrid/
[34] Rishabh Singh and Sumit Gulwani. 2012a. Learning
Semantic String Transformations from Examples.
Proceedings of the VLDB Endowment 5, 8 (April 2012),
740–751. DOI:
http://dx.doi.org/10.14778/2212351.2212356

Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
1Z0-808: Java SE 8 Programmer I Exam: 2017 NEW Questions and Answers RELEASED in Online IT Study Website Today!
No ratings yet
1Z0-808: Java SE 8 Programmer I Exam: 2017 NEW Questions and Answers RELEASED in Online IT Study Website Today!
9 pages
Java Programming Applications PDF
100% (10)
Java Programming Applications PDF
182 pages
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Networking Programming with C++: Build Efficient Communication Systems
From Everand
Networking Programming with C++: Build Efficient Communication Systems
Robert Johnson
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Introduction to computer science vol 1: Introduction to computer science vol 1, #1
From Everand
Introduction to computer science vol 1: Introduction to computer science vol 1, #1
Jm Alexander
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
C++ Functional Programming for Starters: A Practical Guide with Examples
From Everand
C++ Functional Programming for Starters: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Modern C++23 QuickStart Pro: Advanced programming including variadic templates, lambdas, async IO, multithreading and thread sync
From Everand
Modern C++23 QuickStart Pro: Advanced programming including variadic templates, lambdas, async IO, multithreading and thread sync
Jarek Thalor
No ratings yet
Modern C++23 QuickStart Pro
From Everand
Modern C++23 QuickStart Pro
Jarek Thalor
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
LOTED: a semantic web portal for the management of tenders from the European Community
From Everand
LOTED: a semantic web portal for the management of tenders from the European Community
Francesco Valle
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
Data Format Compare
From Everand
Data Format Compare
Frank Wellington
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Java / J2EE Interview Questions You'll Most Likely Be Asked
From Everand
Java / J2EE Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
The Complete C++ Programming Guide
From Everand
The Complete C++ Programming Guide
gareth thomas
No ratings yet
Microsoft Visual C++ Windows Applications by Example
From Everand
Microsoft Visual C++ Windows Applications by Example
Stefan BjÃ¶rnander
3.5/5 (3)
Virtual Report Processing: The Mapper Story
From Everand
Virtual Report Processing: The Mapper Story
Louis Schlueter
No ratings yet
The Ascetic Programmer
From Everand
The Ascetic Programmer
Antonio Piccolboni
5/5 (1)
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Introduction to Internet & Web Technology: Internet & Web Technology
From Everand
Introduction to Internet & Web Technology: Internet & Web Technology
Dr. Yashpal singh
No ratings yet
Ipl Data Analysis Pbl II-II
No ratings yet
Ipl Data Analysis Pbl II-II
11 pages
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
From Everand
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
Gary Thatcher
No ratings yet
Learn Computer Science
From Everand
Learn Computer Science
Knowledge Flow
No ratings yet
Learning .NET High-performance Programming
From Everand
Learning .NET High-performance Programming
Antonio Esposito
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
The Software Programmer: Basis of common protocols and procedures
From Everand
The Software Programmer: Basis of common protocols and procedures
S Mathioudakis
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Programming And Coding in Intermidiate Level
From Everand
Programming And Coding in Intermidiate Level
Memo
No ratings yet
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
Using Cython To Speed Up Numerical Python Programs
No ratings yet
Using Cython To Speed Up Numerical Python Programs
7 pages
React Simplified_ The Essential Guide for Beginners
No ratings yet
React Simplified_ The Essential Guide for Beginners
60 pages
Notes On Programming in Excel (Typescript Included)
No ratings yet
Notes On Programming in Excel (Typescript Included)
16 pages
Smriti CV
No ratings yet
Smriti CV
1 page
1.1introduction To Language Processor
No ratings yet
1.1introduction To Language Processor
17 pages
News
No ratings yet
News
693 pages
OS lab manual 2024-25
No ratings yet
OS lab manual 2024-25
28 pages
Assignment 5: 1.) Write A PL/SQL Code To Display A Message To Check Whether The Record Is Deleted or Not
No ratings yet
Assignment 5: 1.) Write A PL/SQL Code To Display A Message To Check Whether The Record Is Deleted or Not
4 pages
CSC 103 Introduction To Computers and Programming: By: Dr. Sadaf Tanvir
No ratings yet
CSC 103 Introduction To Computers and Programming: By: Dr. Sadaf Tanvir
32 pages
Algorithms: A Brief Introduction Notes: Section 2.1 of Rosen
No ratings yet
Algorithms: A Brief Introduction Notes: Section 2.1 of Rosen
6 pages
Acropolis Institute of Technology and Research
No ratings yet
Acropolis Institute of Technology and Research
19 pages
Junci Lus Resume
No ratings yet
Junci Lus Resume
1 page
10.5 Oracle Disable Triggers by Practical Examples
No ratings yet
10.5 Oracle Disable Triggers by Practical Examples
4 pages
The Solid Principles Slides
No ratings yet
The Solid Principles Slides
37 pages
Forge-1 12 2-14 23 5 2854-Installer Jar
100% (1)
Forge-1 12 2-14 23 5 2854-Installer Jar
2 pages
Data Structure Book by Yandyesh
No ratings yet
Data Structure Book by Yandyesh
159 pages
Assignment Process Management
No ratings yet
Assignment Process Management
5 pages
Idiorm
No ratings yet
Idiorm
37 pages
Download Full Connecting with Computer Science 2nd Edition Greg Anderson PDF All Chapters
100% (4)
Download Full Connecting with Computer Science 2nd Edition Greg Anderson PDF All Chapters
61 pages
Cs521 Midterm Cheatsheet
0% (1)
Cs521 Midterm Cheatsheet
2 pages
Hugs Prelude - Hs Main Prelude - Hs Itoa - HS, Exercise1.hs
No ratings yet
Hugs Prelude - Hs Main Prelude - Hs Itoa - HS, Exercise1.hs
13 pages
Grade XII - Python Programs - To Be Written in Record
No ratings yet
Grade XII - Python Programs - To Be Written in Record
19 pages
Syllabus 1 To 8 Sem
No ratings yet
Syllabus 1 To 8 Sem
122 pages
Algorithm and Examples: Method Solve The Linear Programming Problem Using Dual Simplex Method Calculator
No ratings yet
Algorithm and Examples: Method Solve The Linear Programming Problem Using Dual Simplex Method Calculator
1 page
Heart Disease Documentation
No ratings yet
Heart Disease Documentation
82 pages
Python Mastery 100 Quizzes From Beginner To Advanced With Detailed Solutions 100 Python Programming Language Quiz With... (Muhammad Zafar) (Z-Library)
No ratings yet
Python Mastery 100 Quizzes From Beginner To Advanced With Detailed Solutions 100 Python Programming Language Quiz With... (Muhammad Zafar) (Z-Library)
48 pages
Simple Calculator
No ratings yet
Simple Calculator
2 pages
TSN2101 - Tutorial 3 (Processes and Threads)
No ratings yet
TSN2101 - Tutorial 3 (Processes and Threads)
3 pages

Readable Code Data Scientists Flashfill

Uploaded by

Readable Code Data Scientists Flashfill

Uploaded by

Wrex: A Unifed Programming-by-Example Interaction for

Synthesizing Readable Code for Data Scientists

ROUND UP TO NEAREST 6512 6600 return 100 * math.ceil(x / 100)

S CALING -12.5 return x * 1000

F ORMATTING 2015-12-10 17:10:52 dt = strptime(s, "%Y-%m-%d %H:%M:%S")

B INNING 2:02 dt = datetime.strptime(s, "%m/%d/%Y %H:%M")

You might also like