0% found this document useful (0 votes)

42 views28 pages

Development Workflows for Data Scientists

Uploaded by

spammyboba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views28 pages

Development Workflows for Data Scientists

Uploaded by

spammyboba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Co

m
pl
im
en
Development

ts
of
Workflows for
Data Scientists
Enabling Fast, Efficient, and Reproducible
Results for Data Science Teams

Ciara Byrne
Development Workflows
for Data Scientists

Ciara Byrne

Beijing Boston Farnham Sebastopol Tokyo

Development Workflows for Data Scientists
by Ciara Byrne
Copyright © 2017 O’Reilly Media, Inc.. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://oreilly.com/safari). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
[email protected].

Editor: Marie Beaugureau Interior Designer: David Futato

Production Editor: Shiny Kalapurakkel Cover Designer: Karen Montgomery
Copyeditor: Octal Publishing, Inc. Illustrator: Rebecca Demarest

March 2017: First Edition

Revision History for the First Edition

2017-03-08: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Development

Workflows for Data Scientists, the cover image, and related trade dress are trade‐
marks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-98330-0
[LSI]
Table of Contents

Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Development Workflows for Data Scientists. . . . . . . . . . . . . . . . . . . . . . . . 1

What’s a Good Data Science Workflow? 2
Team Structure and Roles 4
The Data Science Process 5
A Real-Life Data Science Development Workflow 16
How to Improve Your Workflow 19

iii
Foreword
The field of data science has taken all industries by storm. Data sci‐
entist positions are consistently in the top-ranked best job listings,
and new job opportunities with titles like data engineer and data
analyst are opening faster than they can be filled. The explosion of
data collection and subsequent backlog of big data projects in every
industry has lead to the situation in which "we’re drowning in data
and starved for insight.”
To anyone who lived through the growth of software engineering in
the previous two decades, this is a familiar scene. The imperative to
maintain a competitive edge in software by rapidly delivering
higher-quality products to market, led to a revolution in software
development methods and tooling; it is the manifesto for Agile soft‐
ware development, Agile operations, DevOps, Continuous Integra‐
tion, Continuous Delivery, and so on.
Much of the analysis performed by scientists in this fast-growing
field occurs as software experimentation in languages like R and
Python. This raises the question: what can data science learn from
software development?
Ciara Byrne takes us on a journey through the data science and ana‐
lytics teams of many different companies to answer this question.
She leads us through their practices and priorities, their tools and
techniques, and their capabilities and concerns. It’s an illuminating
journey that shows that even though the pace of change is rapid and
the desire for the knowledge and insight from data is ever growing,
the dual disciplines of software engineering and data science are up
for the task.

— Compliments of GitHub

v
Development Workflows for
Data Scientists

Engineers learn in order to build, whereas scientists build in order

to learn, according to Fred Brooks, author of the software develop‐
ment classic The Mythical Man Month. It’s no mistake that the term
“data science” includes the word “science.” In contrast with the work
of engineers or software developers, the product of a data science
project is not code; the product is useful insight.
“A data scientist has a very different relationship with code than a
developer does,” says Drew Conway, CEO of Alluvium and a coau‐
thor of Machine Learning for Hackers. Conway continues:
I look at code as a tool to go from the question I am interested in
answering to having some insight. That code is more or less dispos‐
able. For developers, they are thinking about writing code to build
into a larger system. They are thinking about how can I write some‐
thing that can be reused?
However, data scientists often need to write code to arrive at useful
insight, and that insight might be wrapped in code to make it easily
consumable. As a result, data science teams have borrowed from
software best practices to improve their own work. But which of
those best practices are most relevant to data science? In what areas
do data scientists need to develop new best practices? How have data
science teams improved their workflows and what benefits have they
seen? These are the questions this report addresses.
Many of the data scientists with whom I spoke said that software
development best practices really become useful when you already
have a good idea of what to build. At the beginning of a project, a

1
data scientist doesn’t always know what that is. “Planning a data sci‐
ence project can be difficult because the scope of a project can be
difficult to know ex ante,” says Conway. “There is often a zero-step
of exploratory data analysis or experimentation that must be done in
order to know how to define the end of a project.”

What’s a Good Data Science Workflow?

A workflow is the definition, execution, and automation of business
processes toward the goal of coordinating tasks and information
between people and systems. In software development, standard
processes like planning, development, testing, integration, and
deployment, as well as the workflows that link them have evolved
over decades. Data science is a young field so its processes are still in
flux.
A good workflow for a particular team depends on the tasks, goals,
and values of that team, whether they want to make their work
faster, more efficient, correct, compliant, agile, transparent, or
reproducible. A tradeoff often exists between different goals and val‐
ues—do I want to get something done quickly or do I want to invest
time now to make sure that it can be done quickly next time? I quiz‐
zed multiple data science teams about their reasons for defining,
enforcing, and automating a workflow.

Produce Results Fast

The data science team at BinaryEdge, a Swiss cybersecurity firm that
provides threat intelligence feeds or security reports based on inter‐
net data, wanted to create a rigorous, objective, and reproducible
data science process. The team works with data that has an expira‐
tion date, so it wanted its workflow to produce initial results fast,
and then allow a subsequent thorough analysis of the data while
avoiding common pitfalls. Finally, the team is tasked with transmit‐
ting the resulting knowledge in the most useful ways possible.
The team also wanted to record all the steps taken to reach a partic‐
ular result, even those that did not lead anywhere. Otherwise, time
will be lost in future exploring avenues that have already proved to
be dead ends. During the exploratory stage of a project, data scien‐
tists at the company create a codebook, recording all steps taken,
tools used, data sources, and conclusions reached.

2 | Development Workflows for Data Scientists

When BinaryEdge’s team works with data in a familiar format
(where the data structure is known a priori), most steps in its work‐
flow are automated. When dealing with a new type of data, all of the
steps of the workflow are initially followed manually to derive maxi‐
mum knowledge from that data.

Reproduce and Reuse Results

Reproducibility is as basic a tenet of science as reuse is of software
development, but both are often still an afterthought in data science
projects. Airbnb has made a concerted effort to make previous data
science work discoverable so that it can be reproduced and reused.
The company defined a process to contribute and review data sci‐
ence work and created a tool called the Knowledge Repo to share that
work well beyond the data science team.
Airbnb introduced a workflow specifically for data scientists to add
new work to the Knowledge Repo and make it searchable. “We basi‐
cally had to balance out short-term costs and long-term costs,” says
Nikki Ray, the core maintainer of the Knowledge Repo. Ray elabo‐
rates:
Short-term, you’re being asked to go through a specified format
and going through an actual review cycle which is a little bit longer,
but long-term you’re answering less questions and your research is
in one place where other people can find it in the future.
GitHub’s machine learning team builds user-facing features using
machine learning and big data techniques. Reuse is also one of the
team’s priorities. “We try to build on each other’s work,” says Ho-
Hsiang Wu, a data scientist in the data product team. “We can go
back and iterate on each model separately to improve that model.”
Tools created to improve your data science workflow can also be
reused. “It’s easy to turn your existing code—whether it’s written in
Python, R, or Java—into a command-line tool so that you can reuse
it and combine it with other tools,” says Jeroen Janssens, founder of
Data Science Workshops and author of Data Science at the Com‐
mand Line. “Thanks to GitHub, it’s easier than ever to share your
tools with the rest of the world or find ones created by others.”

What’s a Good Data Science Workflow? | 3

Audit Results
In regulated industries like banking or healthcare, data scientists
must also consider the compliance and auditability of models when
designing a workflow. The Data Science and Model Innovation team
at Canadian bank Scotiabank, for example, built a deep-learning
model to discover patterns in credit card payment collection. The
model identifies potentially delinquent customers as well as those
who might have simply forgotten to pay, and suggests the best way
to approach them about payment.
In the future, the bank’s internal auditors will need to evaluate new
models, whether they comply with regulations and are of sufficient
quality to help make decisions about real-life customers. A reprodu‐
cible and, as far as possible, automated workflow makes auditing
much easier.
“Then, you no longer need to believe somebody’s word or make sure
that you’ve taken the manual steps,” says Suhail Shergill, director of
data science and model innovation at Scotiabank. “The automation
actually gives us more confidence. Whenever somebody needs to
review, it’s right there.”

Team Structure and Roles

There are as many data science workflows as there are data scientists
because their tasks, goals, and skills vary so much. Research scien‐
tists performing exploratory work or more ad hoc analyses might
never need to write production code. Data scientists, like those in
GitHub’s machine learning team, write code that ends up in a soft‐
ware product that must perform at scale.
To define an effective team workflow, you first need to clearly define
the roles within your team. According to Alluvium’s Conway, the
three roles that work best in data science teams are the data scientist,
machine learning engineer, and data engineer.
The data scientist explores the data and can build a minimum viable
product (MVP) version of a function, feature, or product. Machine
learning engineers are concerned with performance and scale. How
is this feature actually going to work on a website with millions of
people interacting with it in a day? The data engineer designs tool‐

4 | Development Workflows for Data Scientists

ing and infrastructure to serve both the product and the data scien‐
tists.
Scotiabank’s Shergill sees these roles as more fluid, defined on a con‐
tinuum rather than via a hard divide, especially as a team evolves
over time:
We got to a state where a lot of the engineers, either software or
data engineers, were really providing excellent things from a data
science perspective. The data scientists were also coming up with
suggestions to make software better.
Friederike Schuur, a data scientist at Fast Forward Labs, a machine
learning research firm, says a proliferation of roles are emerging,
including specialists in particular types of algorithms like natural
language processing (NLP) or deep learning. “Data science is open‐
ing up these new specialized positions,” says Schuur. “Every time
that happens, you’re creating touch points, and those touch points
are becoming potential friction points.”
Eliminating friction points is one of the reasons why you need a
good team workflow, or sometimes even an entirely new role.
Andrew Ng, the chief scientist at Baidu, advocates for the role of the
AI product manager, who translates all the business requirements
into a test set. Over the full lifecycle of their projects, data science
teams also might work with Scrum masters, designers, software
developers, DevOps, and even auditors.

The Data Science Process

There is no universally agreed upon data science process. The intro‐
ductory data science course at Harvard uses the following basic pro‐
cess, which I will use as reference when discussing workflow and
best practices at each stage (see Figure 1-1). (Hilary Mason’s
OSEMN taxonomy of data science, although it dates from 2010, is
also still an excellent overview of the stages in a data science
project.)

The Data Science Process | 5

Figure 1-1. One representation of the data science process (courtesy of
Joe Blitzstein and Hanspeter Pfister, created for the Harvard data sci‐
ence course)

Development workflows come in many different flavors, but they

generally include steps to define specifications, design, write code,
test and review that code, document, integrate your code with the
rest of the software system, and ultimately deploy the system to a
production environment where it can serve some business purpose.
Because we are discussing development workflows for data scien‐
tists, the sections that follow refer to a mix of steps from the data sci‐
ence process and the relevant steps in a typical software
development process.

Ask an Interesting Question

Asking good questions is both a science and an art. It has been
described as the hardest task in data science. Understanding both
the goals of your business (or client) and the limitations of your data
seem to be key prerequisites to asking interesting questions.

6 | Development Workflows for Data Scientists

Dean Malmgren is a cofounder of the data science consultancy
Datascope. “Almost always during the course of our engagements,
our clients have already tried something similar to the thing that
we’re doing for them,” he says. “Just having them talk to us about it
in a way that is comprehensible for people who haven’t been staring
at it for two years is really hard.”
You can’t understand a question or problem by looking at the data
alone. “You have to become familiar with the subject,” says Fast For‐
ward Labs’ Schuur. She goes on to say:
There’s one specific client which wants to automate part of cus‐
tomer service. The technical team has already been thinking about
it for maybe three or four months. I asked them ‘did you ever talk
to someone who does that job, who is a customer service represen‐
tative?’ It hadn’t even occurred to them.
At the beginning of every project, GitHub’s machine learning team
defines not just the problem or question it addresses, but what the
success metrics should be (see Figure 1-2). “What does user success
mean?” says Ho-Hsiang Wu. “We work with product managers and
application engineers to define the metrics and have all the instru‐
ments ready.”

The Data Science Process | 7

Figure 1-2. The data science workflow of GitHub’s machine learning
team

Defining a success measure that makes sense to both the business

and the data science team can be a challenge. “The success measure
that you give to your technical team is the thing that they optimize,”
says Schuur. “What often happens is that the success measure is
something like accuracy that is easily measurable and that machine
learning engineers or data scientists are used to.” If that metric
decouples from the business objective, the business feels that the
data science team doesn’t deliver, or the technical team creates mod‐
els and reaches conclusions that might not be valid.

8 | Development Workflows for Data Scientists

Examine Previous Work
As data science teams expand and mature, knowledge-sharing
within data science teams and across the entire organization
becomes a growing challenge. In scientific research, examining pre‐
vious relevant work on a topic is a basic prerequisite to doing new
work. Frequently, however, no formal processes or tools exist within
organizations to discover earlier data science work.
Airbnb’s Knowledge Repo is the company’s attempt to make data
science work discoverable, not just by the data science team, but by
the entire company. Posts written in Jupyter notebooks, R mark‐
down files, or in plain markdown are committed to a Git repository.
A web app renders the repository’s contents as an internal blog
(Figure 1-3). Research work in a different format, such as a slide
deck or a Google doc, can also be automatically imported and ren‐
dered in the Knowledge Repo.

Figure 1-3. Airbnb’s Knowledge Repo

Every file begins with a small amount of structured metadata,

including author(s), tags, and a synopsis (also known as a TLDR). A
Python script validates the content and transforms the post into
plain text with markdown syntax. GitHub’s pull request mechanism
is used for the review process. Airbnb has implemented a checklist
for reviewers including questions like does the synopsis accurately
summarize the file? Does the title make sense?
Data science work at Airbnb is now discoverable via a full-text
search over the synopsis, title, author, and tags. Since the Knowledge
Repo started two years ago, the number of weekly users has risen
from 50 to 250 people, most of whom are not data scientists. Prod‐
uct managers at Airbnb often subscribe to tags or authors and get an
email every time a new post on a topic appears.

The Data Science Process | 9

“The anecdotal evidence is that the number of times we have to redo
or recreate a piece of research has gone down,” says Knowledge
Repo’s Ray. She explains:
On Slack, we’ve seen that when people ask questions, people will
direct them to the Knowledge Repo. More people are getting feed‐
back on research. Before, the first time you shared research was
when you were presenting to your product manager or your engi‐
neers. Now there’s a formal review cycle, a lot more people are will‐
ing to chime in and give their feedback.

Get the Data

GitHub’s insights data science team provides data and analyses on
user behavior to the rest of the company. According to data scientist
Frannie Zlotnick, who works in the team, the insights teams regu‐
larly acts as a data intermediary, pulling CSV or JSON files for other
teams that might not have the access permissions for the data or the
skills to transform it.
If a project requires new data, the insights team will collaborate with
GitHub’s engineering teams to put instrumentation in place to
gather that data. Often the data is already available but in a format
that is not usable; for example, an unstructured flow or JSON that is
not schematized. The insights team then transforms the data into a
regular format that can be converted to SQL tables. Restructuring of
the data is automated by using scripts.
It’s not enough to gather relevant data; you need to understand how
it was generated. According to Fast Forward Labs’ Schuur:
If you don’t look at that process and if you don’t witness it, then I
think it will be very hard to really understand the data that you’re
working with. There is too quickly a jump to ‘how are we going to
model that?’
Some data is more sensitive, and special procedures must be fol‐
lowed to gather and use it. “We had to go through extensive audit
and security review to make sure that what we’ve done is not com‐
promising any client information,” says Scotiabank’s Shergill. He fol‐
lows up:

10 | Development Workflows for Data Scientists

Doing that while being able to learn from that data, it becomes a bit
challenging. There’s security, compliance, anonymization. What are
the security procedures that you’re following? Do you have a secu‐
rity design document of the workflow? We made heavy use of con‐
tainerization, provisioning tools, and automation.

Explore the Data

In addition to being Data Science Workshops’ founder, Jeroen Jans‐
sens is the author of Data Science at the Command Line. He says that
the command line is most useful when you want to explore a new
dataset, create some quick visualizations, or compute some aggre‐
gate statistics:
When you start exploring, you don’t know anything about the data.
What files are in here? What file type are they? Are they com‐
pressed or not? Does the CSV have a header? When you start work‐
ing with a language like R or Python, it would really help if you
know these kinds of things.
Janssens also says that having a good data directory structure helps:
A tool called cookiecutter can take care of all the setup and boiler‐
plate for data science projects. Inside the data directory, there are
subdirectories called external or interim process and raw. When I
started working for a startup, I learned that they were throwing
away all the raw streaming data and only temporarily stored aggre‐
gates. I had to convince the developers that you should leave the
raw data untouched.
Datascope’s Malmgren says that his team frequently downsamples
data in order to quickly prototype, a process that is largely automa‐
ted. Says Malmgren:
If the original data is a terabyte, we’ll downsample it to a couple
hundred megabytes that still represents something significant, but
it’s small enough that we can iterate quickly. If you have to wait for
an hour for your model to train before it computes a result, that’s
kind of an unnatural amount of time to let your brain linger on
other things.

Model the Data

The optimal choice of model depends both on the goal of the analy‐
sis and the characteristics of the data. “It’s not okay to simply have
the best model (in terms of predictive power)” says Suhail Shergill.
He expands on the point:

The Data Science Process | 11

What is the speed with which you can train it? That may be differ‐
ent from the speed of inference where you actually define the pre‐
diction tasks, which may be different from how rich a pattern you
can express in it, which may be different from how well or not you
can explain this model.
The needs of the business users of the final models are also impor‐
tant when it comes to deciding what to optimize. Says Shergill:
Various areas in business have different appetites for explainability.
When we look on the collections side (credit card payments) the
black box approach is okay. In approval, if you decline somebody
(for a credit card), you should have a reason, you should be able to
explain that decision.
There have been some attempts to automate model selection and
building. Alluvium’s Conway, mentions a company called DataRobot
that has a product for selecting the best model given a particular
dataset and prediction target.
Scotiabank has an entirely separate team to independently validate
models built by the global risk team. “The way to go is have a team
do something, then have a completely independent team create a
different model for the same problem, and see if the results are simi‐
lar,” says Shergill. “And they do it here. That to me, is awesome, and
something that I’ve not seen in other industries.”

Test
Testing is one area where data science projects often deviate from
standard software development practices. Alluvium’s Drew Conway
explains:
The rules get a bit broken because the tests are not the same kinds
of tests that you might think about in regular software where you
say “does the function always give the expected value? Is the pixel
always where it needs to be?” With data science, you may have
some stochastic process that you’re measuring, and therefore test‐
ing is going to be a function of that process.
However, Conway thinks that testing is just as important for data
science as it is for development. “Part of the value of writing tests is
that you should be able to interpret the results,” he says. “You have
some expectation of the answer coming out of a model given some
set of inputs, and that improves interpretability.” Conway points out
that tooling specifically for data science testing has also improved

12 | Development Workflows for Data Scientists

over the past couple of years, and mentions a package called testthat
in R.
At data science consultancy Datascope, there are two main types of
testing: a prediction accuracy test suite, and testing to improve the
user experience of reports. When testing for accuracy, the team
often downsamples the original test set. A first small downsample is
used to iterate quickly on models, and then a second downsample
consisting of 1 to 10 percent of the dataset is used to test with some
statistical accuracy the quality of those models. Only when a model
is already working well is it tested on the full dataset.
“We often do ad hoc tests just to make sure that the data that we’re
inputting into our systems is what we expect, and that the output
from our algorithms are actually working properly,” says Datascope’s
Dean Malmgren. “The value of the output is when it makes sense.
This is hard to do algorithmically. I think I just made software engi‐
neers all over the world die a little bit inside, but it’s kind of how we
roll.”
Test-driven software development, as applied to data science, still
seems to be a contentious issue. “I don’t want to kill innovation so I
don’t want to say, ‘Before we write this predictive model let’s make
sure we have all these test cases ahead of time,’” says Aaron Wolf, a
data scientist at Fiat Chrysler Automobiles. “We’ve got to under‐
stand the data. There’s a lot of trial and error that occurs there,
which I don’t like putting in the test-driven framework.”

Document the Code

In software development, documenting code always means docu‐
menting a working solution. That’s not necessarily sufficient for data
science projects says Fast Forward Labs’ Friederike Schuur:
In data science and machine learning you’re doing so many things
before you know what actually works. You can’t just document the
working solution. It’s equally valuable to know the dead ends.
Otherwise, someone else will take the same approach.
BinaryEdge includes a step in its data science workflow for exactly
this kind of documentation. During the exploratory stage, data sci‐
entists at the company create a codebook, recording all steps taken,
tools used, data sources, where the results were saved, conclusions
reached (even when the conclusion is that the data does not contain

The Data Science Process | 13

any information relevant to the question at hand), and every detail
required to reproduce the analysis.

Deploy to Production
Scotiabank’s global risk data science team built a sophisticated, auto‐
mated deployment system for new risk models (see Figure 1-4). “We
want to develop models almost as quickly as you can think about an
idea,” says data science director Shergill. “Executing it and getting an
answer should be as quick and easy as possible without compromis‐
ing quality, compliance, and security. To enable that, lots of infra‐
structure has to be put into place and a lot of restructuring of teams
needs to happen.”

Figure 1-4. Scotiabank’s automated deployment system; this deploy‐

ment model is based on the model proposed by Seamus Birch

Banking is a regulated industry, so Scotiabank’s entire development

and deployment process must conform to strict compliance and
security requirements and must be fully auditable. At the same time,
the process has to be extremely efficient and easy to use for Scotia‐
bank’s data scientists.
Models, application code, and Puppet modules to configure infra‐
structure in the production environment are stored in GitHub. A
standard gitflow is used to version all changes to the models and to
the environment so that they are fully auditable. Changes are
approved using GitHub’s pull request mechanism.

14 | Development Workflows for Data Scientists

At GitHub, the machine learning team experiments with new mod‐
els in a development environment, but it is also responsible for writ‐
ing production-ready code and plugging the most promising
models into the production data pipeline. New models and features
are always launched to GitHub employees first, followed by a small
percentage of GitHub users.

Communicate the Results

Data scientists often cited the first and final steps of data science
projects—asking a question and communicating results—as the
most problematic. They are also the steps that are the least amenable
to automation.
“It is difficult for people to wrap their heads around what it means
to work within a world where everything is just a probability,” says
Fast Forward Labs’ Schuur. “People want exact numbers and definite
results. Even just two days ago, I was in a meeting and we had built a
proof-of-concept prototype for the client, and they just went, ‘Does
it work? Yes or no.’”
Another issue mentioned several times was the tendency of data sci‐
ence teams to build over-complicated models that the target users
do not use. As Fiat Chrysler Automobiles’ Wolf puts it:
Either they don’t understand it because they don’t understand the
black box, or the output is not something that’s consumable. Focus
more on usability as opposed to complexity.
GitHub’s machine learning team provides its models via an API,
which is easy for the company’s frontend developers to use. Because
every department at GitHub, including sales and marketing, uses
GitHub as a standard part of its workflow, pull requests provide a
mechanism for all GitHub employees to follow and comment on the
work of the data science team. Rowan Wing, a data scientist on
GitHub’s machine learning team explains it this way:
I think it helps with that global brainstorming as a company when
anyone can come in and see what you are working on. It helps give
direction to the project. I might be working on something for one
problem and then three people come in and say, “Oh, this is solving
a bigger problem.”
Datascope’s Dean Malmgren also believes in involving clients
throughout the data science process. “We use GitHub in that way,”
he says. “We invite our clients into the repositories. From coming up

The Data Science Process | 15

with ideas about how you can approach a problem, to building those
first couple prototypes, to collecting feedback and then iterating on
the whole thing, they can add comments to issues as they arise.”

A Real-Life Data Science Development

Workflow
Swiss cybersecurity firm BinaryEdge provides threat intelligence
feeds and security reports based on internet data. The data science
team works mainly with data obtained via the BinaryEdge scanning
platform. One of the responsibilities of the team is to ensure data
quality as well as enhancing the data by cleaning or supplementing it
with other data sources.
The data science team wanted to create a data science workflow that
was rigorous, objective, and reproducible (see Figure 1-5). The team
cited reproducibility as the biggest problem it sees in data science
workflows.
The team works with data that quickly becomes out-of-date, so it
wanted its workflow to produce initial results fast, allow a subse‐
quent thorough analysis of the data while avoiding common pitfalls,
and transmit the resulting knowledge in the most useful ways possi‐
ble.
BinaryEdge’s team admitted that the workflow is not always rigor‐
ously followed. Sometimes a few steps are skipped (for example, the
feature-engineering step) in order to quickly deliver a first func‐
tional version of the product. The team later executes all the steps
that were skipped in order to create a new, better version of the
product.

16 | Development Workflows for Data Scientists

Figure 1-5. BinaryEdge’s data science workflow

Preliminary Data Analysis

The process of gathering data from the BinaryEdge platform is auto‐
mated, as are the first steps of the analysis, such as cleaning the data
and generating reports. The main tools used at this stage are Bash,
Python, and a Jupyter notebook. The notebook contains a quick

A Real-Life Data Science Development Workflow | 17

overview of the data and some statistics. All of these documents are
stored in a GitHub repository.

Exploratory Data Analysis

Usually the output of this phase is also a Jupyter notebook, docu‐
menting everything done with the data: all exploratory analysis steps
executed, results, conclusions drawn, problems found, notes for ref‐
erence, and so on. BinaryEdge mainly uses Python. However,
according to the volume of the data (specifically whether it can be
fully loaded to memory or not), the company sometimes also uses
another framework, typically pandas and/or Spark.

Knowledge Discovery
The data BinaryEdge works with is constantly changing, and as
such, so does the model. To model, the team uses frameworks such
as scikit-learn, OpenCV for image data, and NLTK for textual data.
When building a new model, all of the steps, including choosing the
most relevant features, the best algorithm, and parameter values, are
performed manually. At this stage, the process involves a lot of
research, experimentation, and general trial and error.
When feeding an existing model with new data, the entire process
(cleaning, retrieving the best features, normalizing) can be automa‐
ted. However, if the results substantially change when new data is
injected, the model will be retuned, leading back to the manual work
of experimentation.

Visualization
The main outputs of this step in the workflow are reports such as
BinaryEdge’s Internet Security Exposure 2016 Report, blog posts on
security issues, dashboards for internal data quality, and infograph‐
ics.
Data visualizations are created using one of three tools: Plotly for
dashboards and interactive plots, Matplotlib when the output is a
Jupyter notebook report, and Illustrator when more sophisticated
design is needed.

18 | Development Workflows for Data Scientists

How to Improve Your Workflow
Improving your workflow can improve your working life, but a
good workflow for you or your team is one that is tailored to your
tasks, goals, and values.

Enable Collaboration and Knowledge Sharing

Collaboration on, and sharing of, data science work emerged as an
ongoing problem for many teams. “I am seeing some questions
being answered again and again every year, just with a new group of
people,” says GitHub’s Ho-Hsiang Wu. “Can we find a way to propa‐
gate that knowledge to the later people or to a larger audience? I
think because we are at GitHub, we try to use GitHub to do all this.”
Wu’s team uses iPython and Spark notebooks in the exploratory
phase, which are saved in GitHub and can be rendered there for all
to easily access and view.
At Airbnb, one unexpected side effect of the introduction of the
Knowledge Repo was the publication of many posts on data science
methodology and best practices. Says Nikki Ray, who maintains the
Knowledge Repo:
Before this existed, this was all either in people’s heads or in emails
where you couldn’t find them. I think this is probably one of the
biggest improvements. With the Knowledge Repo being introduced
as a normal part of the workflow, more people have taken it upon
themselves to document these best practices.
Agreeing on standard formats like notebooks, setting up collabora‐
tion tools, and creating new workflows to make previous work dis‐
coverable are all helpful but don’t yet solve the problem entirely.
“The notebook paradigm is useful because it creates a kind of shara‐
ble entity,” says Drew Conway. “Sharing still becomes nontrivial
because sharing data and images and reports, and all these things,
can be cumbersome.”

Learn from Developers

Data Science Workshop’s Jeroen Janssens offers the following advice:
I have learned so much from developers at the companies I’ve
worked with because they have been developing these best practices
for much longer. If the development environment is easy to set up
and consistent for every member of the team, then this will greatly

How to Improve Your Workflow | 19

improve everybody’s workflow. The creation of these environments
can easily be put into version control, as well.
Nowadays, when Janssens creates a Jupyter Notebook, for example,
he also creates a Docker image to go with it. “The barrier becomes
much lower for someone else to try out my analysis,” he says, “It’s
just a couple of commands to set up exactly the same environment
and to run the analysis again and get the same results.”

Tweak Development Tools for Data Science

Airbnb noticed that data scientists would often push one pull
request to GitHub with the contents of a new Knowledge Repo post,
merge, and then follow up with lots of little pull requests because
they didn’t realize what their posts looked like in production. Nikki
Ray explains:
So, we wrapped all the Git commands in a wrapper. We also added
preview functionality, which meant that you could see what your
post looked like in production. A lot of the normal formatting in R
Markdown or an iPython notebook is not what it ends up looking
like on the web app side. We added a lot of our own CSS and styling
to it to make it look and feel more like an Airbnb product.

Define Workflows Specifically for Data Science

Ultimately, data science doesn’t fit neatly into a pure software devel‐
opment workflow and will, over time, need to create its own best
practices and not just borrow them from other fields.
“How will things from traditional software development need to be
modified (for data science)?” asks Drew Conway. “I think one of
them is time. Time is so core to how all success is measured inside
an agile development process. That’s a little bit harder to do with
data science because there’s a lot more unknowns in a world where
you’re running an experiment.”
Software development processes and workflows have evolved over
decades to integrate roles like design, support methodologies like
Agile development, and accommodate the fact that software is now
created by teams, not by solo coders. Data science will undergo a
similar evolution before it reaches maturity. In the meantime, data
scientists will continue to build in order to learn.

20 | Development Workflows for Data Scientists

About the Author
Ciara Byrne is a lapsed software developer, current tech journalist,
wannabe data scientist, and started her career in Machine Learning
research. She went on to manage teams of software developers,
suites of products, and built her own products as well.
Her writing has appeared in Fast Company, Forbes, MIT Technology
Review, VentureBeat, O’ Reilly Radar, Techcrunch, and The New
York Times Digital. In 2014, she was shortlisted for the Knight Sci‐
ence Journalism Fellowship at MIT and was a Significance Labs Fel‐
low.

FABCO Pipe Catalog PDF
No ratings yet
FABCO Pipe Catalog PDF
24 pages
21 Powerful Tips Tricks and Hacks For Data Scientists
No ratings yet
21 Powerful Tips Tricks and Hacks For Data Scientists
37 pages
Week 2 Lecture 3
No ratings yet
Week 2 Lecture 3
62 pages
JetBrains Whitepaper 3 Pillars For A Successful Data Team
No ratings yet
JetBrains Whitepaper 3 Pillars For A Successful Data Team
27 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
LF 45 - 55
100% (3)
LF 45 - 55
410 pages
The Kinds of Data Scientist
No ratings yet
The Kinds of Data Scientist
7 pages
Principles_for_data_analysis_workflows
No ratings yet
Principles_for_data_analysis_workflows
26 pages
module-2
No ratings yet
module-2
49 pages
DevOps For Data Science (Alex K Gold) (Z-Library)
No ratings yet
DevOps For Data Science (Alex K Gold) (Z-Library)
274 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
Agile Data Science With R
No ratings yet
Agile Data Science With R
65 pages
Agile Data Science_ Embracing Iteration and Collaboration
No ratings yet
Agile Data Science_ Embracing Iteration and Collaboration
7 pages
Jumpstart 2022
No ratings yet
Jumpstart 2022
7 pages
DSA Lecture1 (1)
No ratings yet
DSA Lecture1 (1)
15 pages
From AdHoc Data Analytics To DataOps - 2020 - Association For Computing Machinery Inc
No ratings yet
From AdHoc Data Analytics To DataOps - 2020 - Association For Computing Machinery Inc
10 pages
21 Powerful Tips Tricks and Hacks for Data Scientists
No ratings yet
21 Powerful Tips Tricks and Hacks for Data Scientists
38 pages
Unit 5 BDTT
No ratings yet
Unit 5 BDTT
19 pages
EXPLORATORY DATA ANALYSIS WITH PYTHON
No ratings yet
EXPLORATORY DATA ANALYSIS WITH PYTHON
24 pages
Unit 1
No ratings yet
Unit 1
30 pages
LECTURE 6
No ratings yet
LECTURE 6
26 pages
PMP Study Guide (Big Bundle).PDF Standard
No ratings yet
PMP Study Guide (Big Bundle).PDF Standard
82 pages
SAP IQ Installation and Configuration Linux en
No ratings yet
SAP IQ Installation and Configuration Linux en
240 pages
MUCLecture_2022_81948578
No ratings yet
MUCLecture_2022_81948578
30 pages
Modeling and Analysis of Stochastic Systems, Third Edition Vidyadhar G. Kulkarni download pdf
100% (1)
Modeling and Analysis of Stochastic Systems, Third Edition Vidyadhar G. Kulkarni download pdf
65 pages
Cyber Security in Pakistan
0% (1)
Cyber Security in Pakistan
7 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Data Science Career Guide
100% (3)
Data Science Career Guide
11 pages
S-TN-IFC-001
No ratings yet
S-TN-IFC-001
25 pages
The Eve of War
No ratings yet
The Eve of War
90 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
IELTS SPEAKING PART 2 ANSWERS 2019 Top 121 Ielts Speaking Part 2 Model Answers For An 8.0 Band Score! Julia White instant download
100% (2)
IELTS SPEAKING PART 2 ANSWERS 2019 Top 121 Ielts Speaking Part 2 Model Answers For An 8.0 Band Score! Julia White instant download
33 pages
Bridging Space Over Time - Global Virtual Team Dynamics and Effect
No ratings yet
Bridging Space Over Time - Global Virtual Team Dynamics and Effect
20 pages
Air Amplifiers and Systems R13 June 2021
100% (1)
Air Amplifiers and Systems R13 June 2021
12 pages
Massey Ferguson Mf3120 Parts Catalogue
100% (68)
Massey Ferguson Mf3120 Parts Catalogue
20 pages
Sony PCM 7040
No ratings yet
Sony PCM 7040
6 pages
RN41/RN41N: Class 1 Bluetooth Module With EDR Support
No ratings yet
RN41/RN41N: Class 1 Bluetooth Module With EDR Support
28 pages
Accomplishment Report Dictrict Preventive Maintenance
No ratings yet
Accomplishment Report Dictrict Preventive Maintenance
2 pages
Kunal Mohanta Profile+2022
No ratings yet
Kunal Mohanta Profile+2022
8 pages
Sudhakar CV
No ratings yet
Sudhakar CV
3 pages
Catalog Series 320 8320 Solenoid Valve Asco North Americas en
No ratings yet
Catalog Series 320 8320 Solenoid Valve Asco North Americas en
4 pages
CBS FCC Gurantee Opening in 11.8
No ratings yet
CBS FCC Gurantee Opening in 11.8
4 pages
Architectural Details of Tesla GPU Microarchitecture
No ratings yet
Architectural Details of Tesla GPU Microarchitecture
9 pages
Empowerment Technologies - Module 2
No ratings yet
Empowerment Technologies - Module 2
3 pages
Mayank's Profile - India-4
No ratings yet
Mayank's Profile - India-4
2 pages
chapter-03-pair-of-linear-equations-in-two-vairable-test-03
No ratings yet
chapter-03-pair-of-linear-equations-in-two-vairable-test-03
3 pages
The Dark Side of Internet
No ratings yet
The Dark Side of Internet
4 pages
EE 122 Digital Electronics
No ratings yet
EE 122 Digital Electronics
3 pages
IDC For Gas Tight Damper
No ratings yet
IDC For Gas Tight Damper
1 page
FGL 600 13-11-2012 Eng
No ratings yet
FGL 600 13-11-2012 Eng
2 pages
2010 Chevrolet Captiva Sport X2 (LE5 o LE9)
100% (1)
2010 Chevrolet Captiva Sport X2 (LE5 o LE9)
6 pages
Project to Product: How to Survive and Thrive in the Age of Digital Disruption with the Flow Framework
From Everand
Project to Product: How to Survive and Thrive in the Age of Digital Disruption with the Flow Framework
Mik Kersten
No ratings yet
Hands-on NumPy for Numerical Analysis: Unlock NumPy with Google Colab for High-Performance Numerical Computing and Optimizing Numerical Data Analysis
From Everand
Hands-on NumPy for Numerical Analysis: Unlock NumPy with Google Colab for High-Performance Numerical Computing and Optimizing Numerical Data Analysis
Rituraj Dixit
No ratings yet
Data-Centric Machine Learning with Python: The ultimate guide to engineering and deploying high-quality models based on good data
From Everand
Data-Centric Machine Learning with Python: The ultimate guide to engineering and deploying high-quality models based on good data
Jonas Christensen
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow
Manoj Kumar
No ratings yet
Data Management Strategy at Microsoft: Best practices from a tech giant's decade-long data transformation journey
From Everand
Data Management Strategy at Microsoft: Best practices from a tech giant's decade-long data transformation journey
Aleksejs Plotnikovs
No ratings yet
Apache Airflow Best Practices: A practical guide to orchestrating data workflow with Apache Airflow
From Everand
Apache Airflow Best Practices: A practical guide to orchestrating data workflow with Apache Airflow
Dylan Intorf
No ratings yet
Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field
From Everand
Cracking the Data Science Interview: Unlock insider tips from industry experts to master the data science field
Leondra R. Gonzalez
No ratings yet
DevOps for the Modern Enterprise: Winning Practices to Transform Legacy IT Organizations
From Everand
DevOps for the Modern Enterprise: Winning Practices to Transform Legacy IT Organizations
Mirco Hering
No ratings yet
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Herding Cats and Coders: Software Development for Non-Techies
From Everand
Herding Cats and Coders: Software Development for Non-Techies
Greg Ross-Munro
5/5 (1)
XGBoost for Regression Predictive Modeling and Time Series Analysis: Learn how to build, evaluate, and deploy predictive models with expert guidance
From Everand
XGBoost for Regression Predictive Modeling and Time Series Analysis: Learn how to build, evaluate, and deploy predictive models with expert guidance
Partha Pritam Deka
No ratings yet
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Graph Data Science with Python and Neo4j
From Everand
Graph Data Science with Python and Neo4j
Timothy Eastridge
No ratings yet
Data Analysis and Business Modeling with Excel 2013: Manage, analyze, and visualize data with Microsoft Excel 2013 to transform raw data into ready to use information
From Everand
Data Analysis and Business Modeling with Excel 2013: Manage, analyze, and visualize data with Microsoft Excel 2013 to transform raw data into ready to use information
David Rojas
1/5 (2)
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
From Everand
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
Shekhar Khandelwal
No ratings yet
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
From Everand
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
Simon Tallman
No ratings yet
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
Timothy Eastridge
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era
From Everand
Data Engineering Best Practices: Architect robust and cost-effective data solutions in the cloud era
Richard J. Schiller
No ratings yet
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
From Everand
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Neal Fishman
No ratings yet
Minding the Machines: Building and Leading Data Science and Analytics Teams
From Everand
Minding the Machines: Building and Leading Data Science and Analytics Teams
Jeremy Adamson
No ratings yet
Learning .NET High-performance Programming
From Everand
Learning .NET High-performance Programming
Antonio Esposito
No ratings yet
Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Sunila Gollapudi
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Business Intelligence and Data Mining Techniques
From Everand
Business Intelligence and Data Mining Techniques
Dwaipayan Sethi
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
How to Be a Successful Software Project Manager
From Everand
How to Be a Successful Software Project Manager
Dr. Tuhin Chattopadhyay
No ratings yet
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Tech Survival 101
From Everand
Tech Survival 101
Tom Henricksen
No ratings yet
Combining DataOps, MLOps and DevOps: Outperform Analytics and Software Development with Expert Practices on Process Optimization and Automation
From Everand
Combining DataOps, MLOps and DevOps: Outperform Analytics and Software Development with Expert Practices on Process Optimization and Automation
Dr. Kalpesh Parikh
No ratings yet
Getting Data Science Done: Managing Projects From Ideas to Products
From Everand
Getting Data Science Done: Managing Projects From Ideas to Products
John Hawkins
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
The Predictive Project Manager
From Everand
The Predictive Project Manager
Puneet Mathur
No ratings yet
Python Data Science Essentials
From Everand
Python Data Science Essentials
Alberto Boschetti
No ratings yet

Development Workflows for Data Scientists

Uploaded by

Development Workflows for Data Scientists

Uploaded by

Co

Beijing Boston Farnham Sebastopol Tokyo

Editor: Marie Beaugureau Interior Designer: David Futato

March 2017: First Edition

Revision History for the First Edition

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Development

Development Workflows for Data Scientists. . . . . . . . . . . . . . . . . . . . . . . . 1

Engineers learn in order to build, whereas scientists build in order

What’s a Good Data Science Workflow?

Produce Results Fast

2 | Development Workflows for Data Scientists

Reproduce and Reuse Results

What’s a Good Data Science Workflow? | 3

Team Structure and Roles

4 | Development Workflows for Data Scientists

The Data Science Process

The Data Science Process | 5

Development workflows come in many different flavors, but they

Ask an Interesting Question

6 | Development Workflows for Data Scientists

The Data Science Process | 7

Defining a success measure that makes sense to both the business

8 | Development Workflows for Data Scientists

Figure 1-3. Airbnb’s Knowledge Repo

Every file begins with a small amount of structured metadata,

The Data Science Process | 9

Get the Data

10 | Development Workflows for Data Scientists

Explore the Data

Model the Data

The Data Science Process | 11

12 | Development Workflows for Data Scientists

Document the Code

The Data Science Process | 13

Figure 1-4. Scotiabank’s automated deployment system; this deploy‐

Banking is a regulated industry, so Scotiabank’s entire development

14 | Development Workflows for Data Scientists

Communicate the Results

The Data Science Process | 15

A Real-Life Data Science Development

16 | Development Workflows for Data Scientists

Preliminary Data Analysis

A Real-Life Data Science Development Workflow | 17

Exploratory Data Analysis

18 | Development Workflows for Data Scientists

Enable Collaboration and Knowledge Sharing

Learn from Developers

How to Improve Your Workflow | 19

Tweak Development Tools for Data Science

Define Workflows Specifically for Data Science

20 | Development Workflows for Data Scientists

You might also like