0% found this document useful (0 votes)
64 views15 pages

White Paper - DataOps Is NOT DevOps For Data

Uploaded by

Reza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views15 pages

White Paper - DataOps Is NOT DevOps For Data

Uploaded by

Reza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DataOps is NOT Just

DevOps for Data


Figure 1: DevOps is often depicted as an infinite loop,
while DataOps is illustrated as intersecting Value and Innovation Pipelines

One common misconception about DataOps is that it is just DevOps


applied to data analytics. While a little semantically misleading, the name
“DataOps” has one positive attribute. It communicates that
data analytics can achieve what software development attained with
DevOps. That is to say, DataOps can yield an order of magnitude
improvement in quality and cycle time when data teams utilize new tools and
methodologies. The specific ways that DataOps achieves these gains reflect the
unique people, processes and tools characteristic of data teams (versus
software development teams using DevOps). Here’s our in-depth take on
both the pronounced and subtle differences between DataOps and
DevOps.

The Intellectual Heritage of DataOps


DevOps is an approach to software development that accelerates the build
lifecycle (formerly known as release engineering) using automation. DevOps
focuses on continuous integration and continuous delivery of software by
leveraging on-demand IT resources (infrastructure as code) and by
automating integration, test and deployment of code. This merging of
software development and IT operations (“DEVelopment” and “OPerationS”)
reduces time to deployment, decreases time to market, minimizes defects,
and shortens the time required to resolve issues.

Using DevOps, leading companies have been able to reduce their


software release cycle time from months to (literally) seconds. This has
enabled them to grow and lead in fast-paced, emerging markets.
Companies like Google, Amazon and many others now release software many
times per day. By improving the quality and cycle time of code releases,
DevOps deserves a lot of credit for these companies’ success.

2 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


Optimizing code builds and delivery is only one piece of the larger puzzle for
data analytics. DataOps seeks to reduce the end-to-end cycle time of data
analytics, from the origin of ideas to the literal creation of charts, graphs
and models that create value. The data lifecycle relies upon people in
addition to tools. For DataOps to be effective, it must manage collaboration
and innovation. To this end, DataOps introduces Agile Development into
data analytics so that data teams and users work together more efficiently
and effectively.

In Agile Development, the data team publishes new or updated analytics in


short increments called “sprints.” With innovation occurring in rapid
intervals, the team can continuously reassess its priorities and more easily
adapt to evolving requirements. This type of responsiveness is
impossible using a Waterfall project management methodology which locks
a team into a long development cycle with one “big-bang” deliverable at the
end.
Studies show that Agile
software development projects
complete faster and with fewer
defects when Agile
Development replaces the
traditional Waterfall sequential
methodology. The Agile
methodology is particularly
effective in environments
where requirements are
quickly evolving — a situation
well known to data analytics
professionals. In a DataOps
setting, Agile methods enable
organizations to respond
quickly to customer Figure 2: The intellectual heritage of DataOps.

requirements and accelerate


time to value.
Agile development and DevOps add significant value to data
analytics, but there is one more major component to DataOps.
Whereas Agile and DevOps relate to analytics development and
deployment, data analytics also manages and orchestrates a data
pipeline. Data continuously enters on one side of the pipeline,
progresses through a series of steps and exits in the form of reports,
models and views. The data pipeline is the “operations” side of data
analytics. It is helpful to conceptualize the data pipeline as a
manufacturing line where quality, efficiency, constraints and uptime must
be managed. To fully embrace this manufacturing mindset, we call this
pipeline the “data factory.

WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO 3


In DataOps, the flow of data through operations is an important area of focus.
DataOps orchestrates, monitors and manages the data factory. One
particularly powerful lean-manufacturing tool is statistical process control
(SPC). SPC measures and monitors data and operational characteristics of the
data pipeline, ensuring that statistics remain within acceptable ranges. When
SPC is applied to data analytics, it leads to remarkable improvements in
efficiency, quality and transparency. With SPC in place, the data flowing
through the operational system is verified to be working. If an anomaly occurs,
the data analytics team will be the first to know, through an automated alert.

While the name “DataOps” implies that it borrows most heavily from DevOps, it
is all three of these methodologies — Agile, DevOps and statistical process
control — that comprise the intellectual heritage of DataOps. Agile governs
analytics development, DevOps optimizes code verification, builds and delivery
of new analytics and SPC orchestrates and monitors the data factory. Figure 2
illustrates how Agile, DevOps and statistical process control flow into DataOps.

You can view DataOps in the context of a century-long evolution of ideas that
improve how people manage complex systems. It started with pioneers like
Deming and statistical process control — gradually these ideas crossed into the
technology space in the form of Agile, DevOps and now, DataOps.

DevOps vs. DataOps  —


 The Human Factor
As mentioned above, DataOps is as much about managing people as it is about
tools. One subtle difference between DataOps and DevOps relates to the needs
and preferences of stakeholders.

Figure 3: DataOps and DevOps users have different mindsets

DevOps was created to serve the needs of software developers. Dev engineers
love coding and embrace technology. The requirement to learn a new
language or deploy a new tool is an opportunity, not a hassle. They take a
professional interest in all the minute details of code creation, integration and
deployment. DevOps embraces complexity.

4 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


DataOps users are often the opposite of that. They are data scientists or
analysts who are focused on building and deploying models and visualizations.
Scientists and analysts are typically not as technically savvy as engineers. They
focus on domain expertise. They are interested in getting models to be more
predictive or deciding how to best visually render data. The technology used to
create these models and visualizations is just a means to an end. Data
professionals are happiest using one or two tools — anything beyond that adds
unwelcome complexity. In extreme cases, the complexity grows beyond their
ability to manage it. DataOps accepts that data professionals live in a multi-
tool, heterogeneous world and it seeks to make that world more manageable
for them.

DevOps vs. DataOps  — 


Process Differences
We can begin to understand the unique complexity facing data professionals
by looking at data analytics development and lifecycle processes. We find that
data analytics professionals face challenges both similar and unique relative to
software developers.

The DevOps lifecycle is commonly illustrated using a diagram in the shape of


an infinite symbol — See Figure 4. The end of the cycle (“plan”) feeds back to the
beginning (“create”), and the process iterates indefinitely.

Figure 4: The DevOps lifecycle is often depicted as an infinite loop

The DataOps lifecycle shares these iterative properties, but an important


difference is that DataOps consists of two active and intersecting pipelines
(Figure 5). The data factory, described above, is one pipeline. The other
pipeline governs how the data factory is updated — the creation and
deployment of new analytics into the data pipeline.

WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO 5


The data factory takes raw data sources as input and through a series
of orchestrated steps produces analytic insights that create “value” for
the organization. We call this the “Value Pipeline.” DataOps
automates orchestration and, using SPC, monitors the quality of data flowing
through the Value Pipeline.

The “Innovation Pipeline” is the process by which new analytic ideas


are introduced into the Value Pipeline. The Innovation Pipeline
conceptually resembles a DevOps development process, but upon closer
examination, several factors make the DataOps development process more
challenging than DevOps. Figure 5 shows a simplified view of the Value and
Innovation Pipelines.

Figure 5: The DataOps lifecycle — the Value and Innovation Pipelines

DevOps vs. DataOps  —  


Development & Deployment Processes
DataOps builds upon the DevOps development model. As shown in Figure 6, the
DevOps process flow includes a series of steps that are common to software
development projects:

• Develop –  create/modify an application


• Build  –  assemble application components
• Test –  verify the application in a test environment
• Deploy  –  transition code into production
• Run –  execute the application

DevOps introduces two foundational concepts: Continuous Integration (CI) and


Continuous Deployment (CD). CI continuously builds, integrates and tests
new code in a development environment. Build and test are automated so
they can occur rapidly and repeatedly. This allows issues to be identified and
resolved quickly. Figure 6 illustrates how CI encompasses the build and
test process stages of DevOps.

6 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


Figure 6: Comparing the DataOps and DevOps processes

CD is an automated approach to deploying or delivering software. Once an


application passes all qualification tests, DevOps deploys it into production.
Together CI and CD resolve the main constraint hampering Agile development.
Before DevOps, Agile created a rapid succession of updates and innovations
that would stall in a manual integration and deployment process. With
automated CI and CD, DevOps has enabled companies to update their
software many times per day.

The Duality of Orchestration in DataOps


It’s important to note that “orchestration” occurs twice in the DataOps process
shown in Figure 6. As we explained above, DataOps orchestrates the data
factory (the Value Pipeline). The data factory consists of a pipeline process with
many steps. Imagine a complex directed acyclic graph (DAG). The
“orchestrator” could be a software entity which controls the execution of the
steps, traverses the DAG, and handles exceptions. For example, the
orchestrator might create containers, invoke runtime processes with context-
sensitive parameters, transfer data from stage to stage, and “monitor” pipeline
execution. Orchestration of the data factory is the second “orchestration” in the
DataOps process in Figure 7.

Figure 7: DataOps orchestrates the data factory.

WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO 7


As noted above, the Innovation Pipeline has a representative copy of the
data pipeline which is used to test and verify new analytics before
deployment into production. This is the orchestration that occurs in
conjunction with “testing” and prior to “deployment” of new analytics — as
shown in Figure 8.

Orchestration occurs in both the Value and Innovation Pipelines. Similarly,


testing fulfills a dual role in DataOps.

Figure 8: DataOps orchestration controls the numerous tools that


access, transform, model, visualize and report data.

The Duality of Testing in DataOps


Tests in DataOps have a role in both the Value and Innovation Pipelines. In the
Value Pipeline, tests monitor the data values flowing through the data factory to
catch anomalies or flag data values outside statistical norms. In the Innovation
Pipeline, tests validate new analytics before deploying them.

In DataOps, tests target either data or code. In a recent blog, we discussed this
concept using Figure 9. Data that flows through the Value Pipeline is variable
and subject to statistical process control and monitoring. Tests target the data
which is continuously changing. Analytics in the Value Pipeline, on the other
hand, are fixed and change only using a formal release process. In the Value
Pipeline, analytics are revision controlled to minimize any disruptions in service
that could affect the data factory.

In the Innovation Pipeline code is variable and data is fixed. The analytics are
revised and updated until complete. Once the sandbox is set-up, the data
doesn’t usually change. In the Innovation Pipeline, tests target the code
(analytics), not the data.

8 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


All tests must pass before promoting (merging) new code into production. A
good test suite serves as an automated form of impact analysis that runs on any
and every code change before deployment.

Some tests are aimed at both data and code. For example, a test that makes
sure that a database has the right number of rows helps your data and code
work together. Ultimately both data tests and code tests need to come together
in an integrated pipeline as shown in Figure 5. DataOps enables code and data
tests to work together so all around quality remains high.

Figure 9: In DataOps, analytics quality is a function of data and code testing

DataOps Complexity  —  
Sandbox Management
When an engineer joins a software development team, one of their first steps
is to create a “sandbox.” A sandbox is an isolated development environment
where the engineer can write and test new application features, without
impacting teammates who are developing other features in parallel. Sandbox
creation in software development is typically straightforward — the engineer
usually receives a bunch of scripts from teammates and can configure a
sandbox in a day or two. This is the typical mindset of a team using DevOps.

Sandboxes in data analytics are often more challenging from a tools and data
perspective. First of all, data teams collectively tend to use many more tools
than typical software dev teams. There are literally thousands of tools,
languages and vendors for data engineering, data science, BI, data
visualization, and governance. Without the centralization that is characteristic
of most software development teams, data teams tend to naturally diverge
with different tools and data islands scattered across the enterprise.

WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO 9


Figure 10: A “sandbox” is an isolated development environment where the data
professional can write and test new analytics without impacting teammates.

DataOps Complexity  — 
Test Data Management
In order to create a dev environment for analytics, you have to create a copy of
the data factory. This requires the data professional to replicate data which
may have security, governance or licensing restrictions. It may be impractical
or expensive to copy the entire data set, so some thought and care is required
to construct a representative data set. Once a multi-terabyte data set is
sampled or filtered, it may have to be cleaned or redacted (have sensitive
information removed). The data also requires infrastructure which may not be
easy to replicate due to technical obstacles or license restrictions.

Figure 11: The concept of test data management is a first order problem in DataOps.

The concept of test data management is a first order problem in DataOps


whereas in most DevOps environments, it is an afterthought. To accelerate
analytics development, DataOps has to automate the creation of development
environments with the needed data, software, hardware and libraries so
innovation keeps pace with Agile iterations.

10 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


DataOps Connects the
Organization in Two Ways
DevOps strives to help development and operations (information technology)
teams work together in an integrated fashion. In DataOps, this concept is
depicted in Figure 12. The development team are the analysts, scientists,
engineers, architects and others who create data warehouses and analytics.

In data analytics, the operations team supports and monitors the data pipeline.
This can be IT, but it also includes customers — the users who create and
consume analytics. DataOps brings these groups together so they can work
together more closely.

Figure 12: DataOps combines data analytics development and data operations.

Freedom vs. Centralization


DataOps also brings the organization together across another dimension. A
great deal of data analytics development occurs in remote corners of the
enterprise, close to business units, using self-service tools like Tableau, Alteryx,
or Excel. These local teams, engaged in decentralized, distributed analytics
creation play an essential role in delivering innovation to users. Empowering
these pockets of creativity maintains the enterprise’s competitiveness, but
frankly, a lack of top-down control can lead to unmanaged chaos.

Centralizing analytics development under the control of one group, such as IT,
enables the organization to standardize metrics, control data quality, enforce
security and governance, and eliminate islands of data. The issue is that too
much centralization chokes creativity.

WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO 11


One important benefit of
DataOps is its ability to
harmonize the back-and-
forth between the
decentralized and centralized
development of data
analytics — the tension
between centralization and
freedom.

In a DataOps enterprise,
new analytics originate and
undergo refinement in the
local pockets of innovation.
When an idea proves useful
Figure 13: DataOps brings together
centralized and distributed development or is worthy of wider
distribution, it is promoted to
a centralized development
group who can more
efficiently and robustly
implement it at scale.

DataOps brings localized and centralized development together enabling


organizations to reap the efficiencies of centralization while preserving localized
development — the tip of the innovation spear. DataOps brings the enterprise
together across two dimensions as shown in Figure 14 — development/operations
as well as distributed/centralized development.

DataOps brings three


cycles of innovation
between core groups in
the organization:
centralized production
teams, centralized data
engineering/analytics/
science/governance
development teams, and
groups using self-service
tools distributed into the
lines business closest to
Figure 14: DataOps brings teams together across two
the customer. Figure 15 dimensions — development/operations as well as distributed/
centralized development.
shows the interlocking
cycles of innovation.

12 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


Figure 15: DataOps brings three cycles of innovation between production, central data, and self-service teams.

Enterprise Example  — 
Data Analytics Lifecycle Complexity
Having examined the DataOps development process The challenge of pushing analytics into production
at a high level, let’s look at the development lifecycle across these four quite different environments is
in the enterprise context. Figure 15 illustrates the daunting without DataOps. It requires a patchwork
complexity of analytics progression from inception to of manual operations and scripts that are in
production. Analytics are first created and themselves complex to manage. Human processes
developed by an individual and then merged into a are error-prone so data professionals compensate
team project. After completing unit acceptance by working long hours, mistakenly relying on hope
testing (UAT), analytics move into production. The and heroism for success. All of this results in
goal of DataOps is to create analytics in the unnecessary complexity, confusion and a great
individual development environment, advance into deal of wasted time and energy. Slow progression
production, receive feedback from through the lifecycle shown in Figure 16 coupled
users and then continuously improve through with high-severity errors finding their way into
further iterations. This can be challenging due to production can leave a data analytics team little
the differences in personnel, tools, code, versions, time for innovation.
manual procedures/automation, hardware,
operating systems/libraries and target data. The
columns in Figure 16 show the varied characteristics
for each of these four environments.

WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO 13


Figure 16: Data Analytics Development Lifecycle Complexities Implementing DataOps

DataOps simplifies the complexity of data A DataOps Platform automates the steps and
analytics creation and operations. It aligns data processes that comprise DataOps: sandbox
analytics development with user priorities. It management, orchestration, monitoring, testing,
streamlines and automates the analytics deployment, the data factory, dashboards, Agile,
development lifecycle — from the creation of and more. A DataOps Platform is built for data
sandboxes to deployment. DataOps controls and professionals with the goal of simplifying all of the
monitors the data factory so data quality remains tools, steps and processes that they need into an
high, keeping the data team focused on adding easy-to-use, configurable, end-to-end system. This
value. high degree of automation eliminates a great deal
of manual work, freeing up the team to create new
You can get started with DataOps by and innovative analytics that maximize the value of
implementing these seven steps. You can also an organization’s data.
adopt a DataOps Platform which will support
DataOps methods within the context of your
existing tools and infrastructure.

14 WHITEPAPER | DATAOPS IS NOT JUST DEVOPS FOR DATA | DATAKITCHEN.IO


Learn More About DataOps
For more information about DataOps please refer to datakitchen.io.

About DataKitchen
DataKitchen, Inc. helps organizations turns data into value by offering
the world’s first DataOps platform. With DataKitchen, data and
analytic teams can orchestrate data to value and deploy features into
production while automating quality. These teams benefit from
delivering value quickly, with high quality, using the tools that they
love. DataKitchen is leading the DataOps movement to incorporate
Agile Software Development, DevOps, and manufacturing-based
statistical process control into analytics and data management.
DataKitchen is headquartered in Cambridge, Massachusetts.

© 2021 DataKitchen, Inc. All Rights Reserved. The information in this document is subject to
change without notice and should not be construed as a commitment by DataKitchen. While
reasonable precautions have been taken, DataKitchen assumes no responsibility for any errors
that may appear in this document. All products shown or mentioned are trademarks or registered
trademarks of their respective owners. | 910510A

You might also like