White Paper - DataOps Is NOT DevOps For Data
White Paper - DataOps Is NOT DevOps For Data
While the name “DataOps” implies that it borrows most heavily from DevOps, it
is all three of these methodologies — Agile, DevOps and statistical process
control — that comprise the intellectual heritage of DataOps. Agile governs
analytics development, DevOps optimizes code verification, builds and delivery
of new analytics and SPC orchestrates and monitors the data factory. Figure 2
illustrates how Agile, DevOps and statistical process control flow into DataOps.
You can view DataOps in the context of a century-long evolution of ideas that
improve how people manage complex systems. It started with pioneers like
Deming and statistical process control — gradually these ideas crossed into the
technology space in the form of Agile, DevOps and now, DataOps.
DevOps was created to serve the needs of software developers. Dev engineers
love coding and embrace technology. The requirement to learn a new
language or deploy a new tool is an opportunity, not a hassle. They take a
professional interest in all the minute details of code creation, integration and
deployment. DevOps embraces complexity.
In DataOps, tests target either data or code. In a recent blog, we discussed this
concept using Figure 9. Data that flows through the Value Pipeline is variable
and subject to statistical process control and monitoring. Tests target the data
which is continuously changing. Analytics in the Value Pipeline, on the other
hand, are fixed and change only using a formal release process. In the Value
Pipeline, analytics are revision controlled to minimize any disruptions in service
that could affect the data factory.
In the Innovation Pipeline code is variable and data is fixed. The analytics are
revised and updated until complete. Once the sandbox is set-up, the data
doesn’t usually change. In the Innovation Pipeline, tests target the code
(analytics), not the data.
Some tests are aimed at both data and code. For example, a test that makes
sure that a database has the right number of rows helps your data and code
work together. Ultimately both data tests and code tests need to come together
in an integrated pipeline as shown in Figure 5. DataOps enables code and data
tests to work together so all around quality remains high.
DataOps Complexity —
Sandbox Management
When an engineer joins a software development team, one of their first steps
is to create a “sandbox.” A sandbox is an isolated development environment
where the engineer can write and test new application features, without
impacting teammates who are developing other features in parallel. Sandbox
creation in software development is typically straightforward — the engineer
usually receives a bunch of scripts from teammates and can configure a
sandbox in a day or two. This is the typical mindset of a team using DevOps.
Sandboxes in data analytics are often more challenging from a tools and data
perspective. First of all, data teams collectively tend to use many more tools
than typical software dev teams. There are literally thousands of tools,
languages and vendors for data engineering, data science, BI, data
visualization, and governance. Without the centralization that is characteristic
of most software development teams, data teams tend to naturally diverge
with different tools and data islands scattered across the enterprise.
DataOps Complexity —
Test Data Management
In order to create a dev environment for analytics, you have to create a copy of
the data factory. This requires the data professional to replicate data which
may have security, governance or licensing restrictions. It may be impractical
or expensive to copy the entire data set, so some thought and care is required
to construct a representative data set. Once a multi-terabyte data set is
sampled or filtered, it may have to be cleaned or redacted (have sensitive
information removed). The data also requires infrastructure which may not be
easy to replicate due to technical obstacles or license restrictions.
Figure 11: The concept of test data management is a first order problem in DataOps.
In data analytics, the operations team supports and monitors the data pipeline.
This can be IT, but it also includes customers — the users who create and
consume analytics. DataOps brings these groups together so they can work
together more closely.
Figure 12: DataOps combines data analytics development and data operations.
Centralizing analytics development under the control of one group, such as IT,
enables the organization to standardize metrics, control data quality, enforce
security and governance, and eliminate islands of data. The issue is that too
much centralization chokes creativity.
In a DataOps enterprise,
new analytics originate and
undergo refinement in the
local pockets of innovation.
When an idea proves useful
Figure 13: DataOps brings together
centralized and distributed development or is worthy of wider
distribution, it is promoted to
a centralized development
group who can more
efficiently and robustly
implement it at scale.
Enterprise Example —
Data Analytics Lifecycle Complexity
Having examined the DataOps development process The challenge of pushing analytics into production
at a high level, let’s look at the development lifecycle across these four quite different environments is
in the enterprise context. Figure 15 illustrates the daunting without DataOps. It requires a patchwork
complexity of analytics progression from inception to of manual operations and scripts that are in
production. Analytics are first created and themselves complex to manage. Human processes
developed by an individual and then merged into a are error-prone so data professionals compensate
team project. After completing unit acceptance by working long hours, mistakenly relying on hope
testing (UAT), analytics move into production. The and heroism for success. All of this results in
goal of DataOps is to create analytics in the unnecessary complexity, confusion and a great
individual development environment, advance into deal of wasted time and energy. Slow progression
production, receive feedback from through the lifecycle shown in Figure 16 coupled
users and then continuously improve through with high-severity errors finding their way into
further iterations. This can be challenging due to production can leave a data analytics team little
the differences in personnel, tools, code, versions, time for innovation.
manual procedures/automation, hardware,
operating systems/libraries and target data. The
columns in Figure 16 show the varied characteristics
for each of these four environments.
DataOps simplifies the complexity of data A DataOps Platform automates the steps and
analytics creation and operations. It aligns data processes that comprise DataOps: sandbox
analytics development with user priorities. It management, orchestration, monitoring, testing,
streamlines and automates the analytics deployment, the data factory, dashboards, Agile,
development lifecycle — from the creation of and more. A DataOps Platform is built for data
sandboxes to deployment. DataOps controls and professionals with the goal of simplifying all of the
monitors the data factory so data quality remains tools, steps and processes that they need into an
high, keeping the data team focused on adding easy-to-use, configurable, end-to-end system. This
value. high degree of automation eliminates a great deal
of manual work, freeing up the team to create new
You can get started with DataOps by and innovative analytics that maximize the value of
implementing these seven steps. You can also an organization’s data.
adopt a DataOps Platform which will support
DataOps methods within the context of your
existing tools and infrastructure.
About DataKitchen
DataKitchen, Inc. helps organizations turns data into value by offering
the world’s first DataOps platform. With DataKitchen, data and
analytic teams can orchestrate data to value and deploy features into
production while automating quality. These teams benefit from
delivering value quickly, with high quality, using the tools that they
love. DataKitchen is leading the DataOps movement to incorporate
Agile Software Development, DevOps, and manufacturing-based
statistical process control into analytics and data management.
DataKitchen is headquartered in Cambridge, Massachusetts.
© 2021 DataKitchen, Inc. All Rights Reserved. The information in this document is subject to
change without notice and should not be construed as a commitment by DataKitchen. While
reasonable precautions have been taken, DataKitchen assumes no responsibility for any errors
that may appear in this document. All products shown or mentioned are trademarks or registered
trademarks of their respective owners. | 910510A