Development Workflows for Data Scientists
Development Workflows for Data Scientists
m
pl
im
en
Development
ts
of
Workflows for
Data Scientists
Enabling Fast, Efficient, and Reproducible
Results for Data Science Teams
Ciara Byrne
Development Workflows
for Data Scientists
Ciara Byrne
978-1-491-98330-0
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
Foreword
The field of data science has taken all industries by storm. Data sci‐
entist positions are consistently in the top-ranked best job listings,
and new job opportunities with titles like data engineer and data
analyst are opening faster than they can be filled. The explosion of
data collection and subsequent backlog of big data projects in every
industry has lead to the situation in which "we’re drowning in data
and starved for insight.”
To anyone who lived through the growth of software engineering in
the previous two decades, this is a familiar scene. The imperative to
maintain a competitive edge in software by rapidly delivering
higher-quality products to market, led to a revolution in software
development methods and tooling; it is the manifesto for Agile soft‐
ware development, Agile operations, DevOps, Continuous Integra‐
tion, Continuous Delivery, and so on.
Much of the analysis performed by scientists in this fast-growing
field occurs as software experimentation in languages like R and
Python. This raises the question: what can data science learn from
software development?
Ciara Byrne takes us on a journey through the data science and ana‐
lytics teams of many different companies to answer this question.
She leads us through their practices and priorities, their tools and
techniques, and their capabilities and concerns. It’s an illuminating
journey that shows that even though the pace of change is rapid and
the desire for the knowledge and insight from data is ever growing,
the dual disciplines of software engineering and data science are up
for the task.
— Compliments of GitHub
v
Development Workflows for
Data Scientists
1
data scientist doesn’t always know what that is. “Planning a data sci‐
ence project can be difficult because the scope of a project can be
difficult to know ex ante,” says Conway. “There is often a zero-step
of exploratory data analysis or experimentation that must be done in
order to know how to define the end of a project.”
Test
Testing is one area where data science projects often deviate from
standard software development practices. Alluvium’s Drew Conway
explains:
The rules get a bit broken because the tests are not the same kinds
of tests that you might think about in regular software where you
say “does the function always give the expected value? Is the pixel
always where it needs to be?” With data science, you may have
some stochastic process that you’re measuring, and therefore test‐
ing is going to be a function of that process.
However, Conway thinks that testing is just as important for data
science as it is for development. “Part of the value of writing tests is
that you should be able to interpret the results,” he says. “You have
some expectation of the answer coming out of a model given some
set of inputs, and that improves interpretability.” Conway points out
that tooling specifically for data science testing has also improved
Deploy to Production
Scotiabank’s global risk data science team built a sophisticated, auto‐
mated deployment system for new risk models (see Figure 1-4). “We
want to develop models almost as quickly as you can think about an
idea,” says data science director Shergill. “Executing it and getting an
answer should be as quick and easy as possible without compromis‐
ing quality, compliance, and security. To enable that, lots of infra‐
structure has to be put into place and a lot of restructuring of teams
needs to happen.”
Knowledge Discovery
The data BinaryEdge works with is constantly changing, and as
such, so does the model. To model, the team uses frameworks such
as scikit-learn, OpenCV for image data, and NLTK for textual data.
When building a new model, all of the steps, including choosing the
most relevant features, the best algorithm, and parameter values, are
performed manually. At this stage, the process involves a lot of
research, experimentation, and general trial and error.
When feeding an existing model with new data, the entire process
(cleaning, retrieving the best features, normalizing) can be automa‐
ted. However, if the results substantially change when new data is
injected, the model will be retuned, leading back to the manual work
of experimentation.
Visualization
The main outputs of this step in the workflow are reports such as
BinaryEdge’s Internet Security Exposure 2016 Report, blog posts on
security issues, dashboards for internal data quality, and infograph‐
ics.
Data visualizations are created using one of three tools: Plotly for
dashboards and interactive plots, Matplotlib when the output is a
Jupyter notebook report, and Illustrator when more sophisticated
design is needed.