DataKitchen Dataops Cookbook
DataKitchen Dataops Cookbook
COOKBOOK
Methodologies and tools that reduce analytics
cycle time while improving quality.
DataKitchen Headquarters:
101 Main Street, 14th Floor
Cambridge, MA 02142
In the early 2000s, Chris and Gil worked at a company that specialized in analytics for the
pharmaceutical industry. It was a small company that offered a full suite of services related
to analytics — data engineering, data integration, visualization and what is now called “data
science.” Their customers were marketing and sales executives who tend to be challenging
because they are busy, need fast answers and don’t understand or care about the underlying
mechanics of analytics. They are business people, not technologists.
When a request from a customer came in, Chris and Gil would gather their team of engi-
neers, data scientists and consultants to plan out the how to get the project done. After days
of planning, they would propose their project plan to the customer. “It will take two weeks.”
The customer would shoot back, “I need it in two hours!”
Walking back to their office, tail between their legs, they would pick up the phone. It was a
customer boiling over with anger. There was a data error. If it wasn’t fixed immediately the
customer would find a different vendor.
The company had hired a bunch of smart people to deliver these services. “ I want to
innovate — Can I try out this new open source tool,” the team members would ask. “No,” the
managers would have to answer. “We can’t afford to introduce technical risk.”
They lived this life for many years. How do you create innovative data analytics? How do you
not have embarrassing errors? How do you let your team easily try new ideas? There had to
be a better way.
They found their answer by studying the software and manufacturing industries which had
been struggling with these same issues for decades. They discovered that data-analytics
cycle time and quality can be optimized with a combination of tools and methodologies that
they now call DataOps. They decided to start a new company. The new organization adopted
the kitchen metaphor for data analytics. After all, cooking up charts and graphs requires the
right ingredients and recipes.
Having experienced this transformation, the DataKitchen founders sought a way to help
other data professionals. There are so many talented people stuck in no-win situations. This
book is for data professionals who are living the nightmare of slow, buggy analytics and
frustrated users. It will explain why working weekends isn’t the answer. It provides you with
practical steps that you can take tomorrow to improve your analytics cycle time.
DataKitchen markets a DataOps Platform that will help analytics organizations implement
DataOps. However, this book isn’t really about us and our product. It is about you, your
challenges, your potential and getting your analytics team back on track.
The values and principles that are central to DataOps are listed in the DataOps Manifesto
which you can read below. If you agree with it, please join the thousands of others who
share these beliefs by signing the manifesto. There may be aspects of the manifesto that
require further explanation. Please read on. By the end of this book, it should all make sense.
You’ll also notice that we’ve included some real recipes in this book. These are some of our
favorites. We hope you enjoy them!
Background
Through firsthand experience working with data across organizations, tools, and industries
we have uncovered a better way to develop and deliver analytics that we call DataOps.
Whether referred to as data science, data engineering, data management, big data, business
intelligence, or the like, through our work we have come to value in analytics:
• Individuals and interactions over processes and tools
• Working analytics over comprehensive documentation
• Customer collaboration over contract negotiation
• Experimentation, iteration, and feedback over extensive upfront design
• Cross-functional ownership of operations over siloed responsibilities
DataOps Principles
1. CONTINUALLY SATISFY YOUR CUSTOMER
Our highest priority is to satisfy the customer through the early and continuous delivery of
valuable analytic insights from a couple of minutes to weeks.
3. EMBRACE CHANGE
We welcome evolving customer needs, and in fact, we embrace them to generate competi-
tive advantage. We believe that the most efficient, effective, and agile method of communi-
cation with customers is face-to-face conversation.
5. DAILY INTERACTIONS
Customers, analytic teams, and operations must work together daily throughout the project.
6. SELF-ORGANIZE
We believe that the best analytic insight, algorithms, architectures, requirements, and de-
signs emerge from self-organizing teams.
7. REDUCE HEROISM
As the pace and breadth of need for analytic insights ever increases, we believe analytic
teams should strive to reduce heroism and create sustainable and scalable data analytic
teams and processes.
8. REFLECT
Analytic teams should fine-tune their operational performance by self-reflecting, at regular
intervals, on feedback provided by their customers, themselves, and operational statistics.
9. ANALYTICS IS CODE
Analytic teams use a variety of individual tools to access, integrate, model, and visualize data.
Fundamentally, each of these tools generates code and configuration which describes the
actions taken upon data to deliver insight.
10. ORCHESTRATE
The beginning-to-end orchestration of data, tools, code, environments, and the analytic
team’s work is a key driver of analytic success.
13. SIMPLICITY
We believe that continuous attention to technical excellence and good design enhances
agility; likewise simplicity—the art of maximizing the amount of work not done—is essential.
17. REUSE
We believe a foundational aspect of analytic insight manufacturing efficiency is to avoid the
repetition of previous work by the individual or team.
Join the Thousands of People Who Have Already Signed The Manifesto
Companies increasingly look to analytics to drive growth strategies. As the leader of the
data-analytics team, you manage a group responsible for supplying business partners with
the analytic insights that can create a competitive edge. Customer and market opportunities
evolve quickly and drive a relentless series of questions. Analytics, by contrast, move slowly,
constrained by development cycles, limited resources and brittle IT systems. The gap be-
tween what users need and what IT can provide can be a source of conflict and frustration.
Inevitably this mismatch between expectations and capabilities can cause dissatisfaction,
leaving the data-analytics team in an unfortunate position and preventing a company from
fully realizing the strategic benefit of its data.
As a manager overseeing analytics, it’s your job to understand and address the factors that
prevent the data-analytics team from achieving peak levels of performance. If you talk to
your team, they will tell you exactly what is slowing them down. You’ll likely hear variations
of the following eight challenges:
They don’t know what they want. Users are not data experts. They don’t know what insights
are possible until someone from your team shows them. Sometimes they don’t know what
they want until after they see it in production (and maybe not even then). Often, business
stakeholders do not know what they will need next week, let alone next quarter or next year.
It’s not their fault. It’s the nature of pursuing opportunities in a fast-paced marketplace.
They need everything ASAP. Business is a competitive endeavor. When an opportunity opens,
the company needs to move on it faster than the competition. When users bring a question
to the data-analytics team, they expect an immediate response. They can’t wait weeks or
months — the opportunity will close as the market seeks alternative solutions.
The questions never end. Sometimes providing business stakeholders with analytics generates
more questions than answers. Analytic insights enable users to understand the business
in new ways. This spurs creativity, which leads to requests for more analytics. A healthy
relationship between the analytics and users will foster a continuous series of questions that
drive demand for new analytics. However, this relationship can sour quickly if the delivery of
new analytics can’t meet the required time frames.
Business stakeholders want fast answers. Meanwhile, the data-analytics team has to work
with IT to gain access to operational systems, plan and implement architectural changes, and
develop/test/deploy new analytics. This process is complex, lengthy and subject to numer-
ous bottlenecks and blockages.
A database optimized for data analytics is structured to optimize reads and aggregations. It’s
also important for the schema of an analytics database to be easily understood by humans.
For example, the field names would be descriptive of their contents and data tables would
be linked in ways that make intuitive sense.
4 – Data Errors
Whether your data sources are internal or from external third parties, data will eventually
contain errors. Data errors can prevent your data pipeline from flowing correctly. Errors may
also be subtle, such as duplicate records or individual fields that contain erroneous data.
Data errors could be caused by a new algorithm that doesn’t work as expected, a database
schema change that broke one of your feeds, an IT failure or one of many other possibilities.
Data errors can be difficult to trace and resolve quickly.
Further, manual processes can also lead to high employee turnover. Many managers have
watched high-performing data-analytics team members burn out due to having to repeat-
edly execute manual data procedures. Manual processes strain the productivity of the data
team in numerous ways.
According to the research firm Gartner, Inc., half of all chief data officers (CDO) in large
organizations will not be deemed a success in their role. Per Forrester Research, 60% of the
data and analytics decision-makers surveyed said they are not very confident in their analytics
insights. Only ten percent responded that their organizations sufficiently manage the quality
of data and analytics. Just sixteen percent believe they perform well in producing accurate
models.
Heroism - Data-analytics teams work long hours to compensate for the gap between perfor-
mance and expectations. When a deliverable is met, the data-analytics team is considered
heroes. However, yesterday’s heroes are quickly forgotten when there is a new deliverable
to meet. Also, this strategy is difficult to sustain over a long period of time, and it, ultimately,
just resets expectations at a higher level without providing additional resources. The heroism
approach is also difficult to scale up as an organization grows.
Hope - When a deadline must be met, it is tempting to just quickly produce a solution with
minimal testing, push it out to the users and hope it does not break. This approach has inher-
ent risks. Eventually, a deliverable will contain data errors, upsetting the users and harming
the hard-won credibility of the data-analytics team.
Caution - The team decides to give each data-analytics project a longer development and
test schedule. Effectively, this is a decision to deliver higher quality, but fewer features to
users. One difficulty with this approach is that users often don’t know what they want until
they see it, so a detailed specification might change considerably by the end of a project. The
slow and methodical approach might also make the users unhappy because the analytics are
delivered more slowly than their stated delivery requirements and as requests pile up, the
data-analytics team risks being viewed as bureaucratic and inefficient.
None of these approaches adequately serve the needs of both users and data-analytics
professionals, but there is a way out of this bind. The challenges above are not unique to
analytics, and in fact, are shared by other organizations.
Overcoming the
Challenges
Some say that an analytics
team can overcome these
challenges by buying a
new tool. While it is true
that new tools are helpful,
they are not enough by
themselves. You cannot
truly transform your staff
into a high-performance
team without an overhaul
of the methodologies and
processes that guide your
workflows. In this book, we will discuss how to combine tools and new processes in a way
that improves the productivity of your data analytics team by orders of magnitude.
INSTRUCTIONS
1. Crock Pot: Combine all ingredients and cook on high for 5-8 hours.
Stir occasionally.
2. Stove Top: Combine ground beef, onion, and pepper. Cook on medium high
until beef is cooked through. Add the remaining ingredients and cook on
low-simmer for 1-2 hours. Stir occasionally.
3. For vegan chili: Substitute 5 tablespoons of canola oil for the ground beef.
4. Serve with rice.
You can view DataOps in the context of a century-long evolution of ideas that improve
how people manage complex systems. It started with pioneers like W. Edwards Deming and
statistical process control - gradually these ideas crossed into the technology space in the
form of Agile, DevOps and now, DataOps. In the next section we will examine how these
methodologies impact productivity, quality and reliability in data analytics.
For example, some may remember video stores where movies were rented for later viewing.
Today, 65 percent of global respondents to a recent Nielsen survey watch video on demand
(VOD), many of them daily. With VOD, a person’s desire to watch a movie is fulfilled within
seconds. Amazon participates in the VOD market with their Amazon Prime Video service.
Instant fulfillment of customer orders seems to be part of Amazon’s business model. They
have even brought that capability to IT. About 10 years ago, Amazon Web Services (AWS)
began offering computing, storage, and other IT infrastructure on an as-needed basis.
Whether the need is for one server or thousands and whether for hours, days, or months,
you only pay for what you use, and the resources are available in just a few minutes.
In order to deliver value consistently, quickly and accurately, data-analytics teams must learn
to create and publish analytics in a new way. We call this new approach DataOps. DataOps
is a combination of tools and methods, which streamline the development of new analytics
while ensuring impeccable data quality. DataOps helps shorten the cycle time for producing
analytic value and innovation, while avoiding the trap of “hope, heroism and caution.”
Figure 1: In Agile development, a burndown chart shows work remaining over time.
The waterfall model is better suited to situations where the requirements are fixed and well
understood up front. This is nothing like the technology industry where the competitive
environment evolves rapidly. In the 1980’s a typical software project required about 12
calendar months. In technology-driven businesses (i.e. nearly everyone these days) custom-
ers demand new features and services, and competitive pressures change priorities on a
seemingly daily basis. The waterfall model has no mechanism to respond to these changes.
In waterfall, changes trigger a seemingly endless cycle of replanning causing delays and
resulting in project budget overruns.
In the early 2000’s, the software industry embraced a new approach to code production
called Agile Development. Agile is an umbrella term for several different iterative and incre-
mental software development methodologies.
In Agile Software Development, the team and its processes and tools are organized around
the goal of publishing releases to the users every few weeks (or at most every few months).
A development cycle is called a sprint or an iteration. At the beginning of an iteration, the
team commits to completing working and (the most) valuable changes to the code base.
Features are associated with user stories, which help the development team understand the
Agile is widely credited with boosting software productivity. One study sponsored by the
Central Ohio Agile Association and Columbus Executive Agile Special Interest Group found
that Agile projects were completed 31 percent faster and with a 75 percent lower defect
rate than the industry norm. The vast majority of companies are getting on-board. In a survey
of 400 IT professionals by TechBeacon, two-thirds described their company as either “pure
agile” or “leaning towards agile. Among the remaining one third of companies, most use a
hybrid approach, leaving only nine percent using a pure waterfall approach.
If, for example, the customer reported a problem, it might not be replicable in the support,
test or development groups due to differences in the hardware and software environments
being run. This lack of alignment fostered misunderstandings and delays and often led to a
lack of trust and communication between the various stakeholders.
In a complex world requiring the physical provisioning of servers, installation of stacks and
frameworks, and numerous target devices, the standardization and control of the run-time
environment has been difficult and slow. It became necessary to break down barriers be-
tween the respective teams in the software development pipeline. This merging of develop-
ment and IT/Operations is widely known as DevOps, which also has had enormous impact
on the world of software development. DevOps improves collaboration between employees
from the planning through the deployment phases of software. It seeks to reduce time to
deployment, decrease time to market, minimize defects, and shorten the time required to fix
problems.
About a decade ago, Amazon Web Services (AWS) and other cloud providers, began offering
computing, storage and other IT resources as an on-demand service. No more waiting weeks
or months for the IT department to fulfill a request for servers. Cloud providers now allow
you to order computing services, paying only for what you use, whether that is one proces-
sor for an hour or thousands of processors for months. These on-demand cloud services
have enabled developers to write
code that provisions processing
resources with strictly specified
environments, on-demand, in just
a few minutes. This capability has
been called Infrastructure as Code
(IaC). IaC has made it possible for
everyone in the software devel-
opment pipeline, all the different
groups mentioned above, to use
an identical environment tailored
to application requirements. With
IaC, design, test, QA and support
With IT infrastructure being defined by code, the hard divisions between IT operations and
software development are able to blur. The merger of development and operations is how
the term DevOps originated.
With the automated provisioning of resources, DevOps paved the way for a fully automated
test and release process. The process of deploying code that used to take weeks, could now
be completed in minutes. Major organizations including Amazon, Facebook and Netflix are
now operating this way. At a recent conference, Amazon disclosed that their AWS team per-
forms 50,000,000 code releases per year. This is more than one per second! This methodol-
ogy of rapid releases is called continuous delivery or alternatively, continuous deployment, when
new features (and fixes) are not only delivered internally but fully deployed to customers.
DevOps starts with continuous delivery and Agile development and adds automated provi-
sioning of resources (infrastructure as code) and cloud services (platform as a service) to en-
sure that the same environment is being utilized at every stage of the software development
pipeline. The cloud provides a natural platform that allows individuals to create and define
identical run-time environments. DevOps is beginning to achieve critical mass in terms of its
adoption within the world of software development.
DevOps improves collaboration between employees from the planning through the deploy-
ment phases of software. It seeks to reduce time to deployment, decrease time to market,
minimize defects, and shorten the time required to fix problems.
The impact of DevOps on development organizations was shown in a 2014 survey, “The
2014 State of DevOps Report” by Puppet Labs, IT Revolution Press and ThoughtWorks,
based on 9,200 survey responses from technical professionals. The survey found that IT or-
ganizations implementing DevOps were deploying code 30 times more frequently and with
50 percent fewer failures. Further, companies with these higher performing IT organizations
tended to have stronger business performance, greater productivity, higher profitability and
larger market share. In other words, DevOps is not just something that engineers are doing
off in a dark corner. It is a core competency that helps good companies become better.
The Data analytics team transforms raw data into actionable information that improves
decision making and provides market insight. Imagine an organization with the best data
analytics in the industry. That organization would have a tremendous advantage over com-
petitors. That could be you.
In data analytics, tests should verify that the results of each intermediate step in the
production of analytics matches expectations. Even very simple tests can be useful. For
example, a simple row-count test could catch an error in a join that inadvertently produces a
Cartesian product. Tests can also detect unexpected trends in data, which might be flagged
as warnings. Imagine that the number of customer transactions exceeds its historical average
by 50%. Perhaps that is an anomaly that upon investigation would lead to insight about
business seasonality.
Tests in data analytics can be applied to data or models either at the input or output of a
phase in the analytics pipeline. Tests can also verify business logic.
Input tests check data prior to each stage in the analytics pipeline. For example:
• Count Verification – Check that row counts are in the right range, ...
• Conformity – US Zip5 codes are five digits, US phone numbers are 10 digits, ...
• History – The number of prospects always increases, ...
• Balance – Week over week, sales should not vary by more than 10%, ...
• Temporal Consistency – Transaction dates are in the past, end dates are later than start
dates, ...
• Application Consistency – Body temperature is within a range around 98.6F/37C, ...
• Field Validation – All required fields are present, correctly entered, ...
Output tests check the results of an operation, like a Cartesian join. For example:
• Completeness – Number of customer prospects should increase with time
• Range Verification – Number of physicians in the US is less than 1.5 million
The data analytics pipeline is a complex process with steps often too numerous to be moni-
tored manually. SPC allows the data analytics team to monitor the pipeline end-to-end from
a big-picture perspective, ensuring that everything is operating as expected. As an automat-
ed test suite grows and matures, the quality of the analytics is assured without adding cost.
This makes it possible for the data analytics team to move quickly — enhancing analytics to
address new challenges and queries — without sacrificing quality.
When DataOps is implemented correctly, it addresses many of the issues discussed earlier
that have plagued data-analytics teams.
DataOps views the data-analytics pipeline as a process and as such focuses on how to make
the entire process run more rapidly and with higher quality, rather than optimizing the pro-
ductivity of any single individual or tool by itself.
DataKitchen markets an automated DataOps platform that helps companies accelerate their
DataOps implementation, but this book is about DataOps not us. This book is not trying to
sell you anything. You can implement DataOps all by yourself, using your existing tools, by
implementing the seven steps described in the next section. If you desire assistance, there is
an ecosystem of DataOps vendors who offer a variety of innovative solutions and services.
Imagine the next time that the Vice President of Marketing requests a new customer
segmentation, by tomorrow. With DataOps, the data-analytics team can respond ‘yes’ with
complete confidence that the changes can be accomplished quickly, efficiently and robustly.
How then does an organization implement DataOps? You may be surprised to learn that an
analytics team can migrate to DataOps in seven simple steps.
Adding tests in data analytics is analogous to the statistical process control that is imple-
mented in a manufacturing operations flow. Tests insure the integrity of the final output by
verifying that work-in-progress (the results of intermediate steps in the pipeline) matches
expectations. Testing can be applied to data, models and logic. The figure below shows
examples of tests in the data-analytics pipeline.
For every step in the data-analytics pipeline, there should be at least one test. The philoso-
phy is to start with simple tests and grow over time. Even a simple test will eventually catch
an error before it is released out to the users. For example, just making sure that row counts
are consistent throughout the process can be a very powerful test. One could easily make
a mistake on a join and make a cross product which fails to execute correctly. A simple row-
count test would quickly catch that.
Tests can detect warnings in addition to errors. A warning might be triggered if data exceeds
certain boundaries. For example, the number of customer transactions in a week may be
OK if it is within 90% of its historical average. If the transaction level exceeds that, then a
warning could be flagged. This might not be an error. It could be a seasonal occurrence for
example, but the reason would require investigation. Once recognized and understood, the
users of the data could be alerted.
DataOps is not about being perfect. In fact, it acknowledges that code is imperfect. It’s
natural that a data-analytics team will make a best effort, yet still miss something. If so, they
can determine the cause of the issue and add a test so that it never happens again. In a rapid
release environment, a fix can quickly propagate out to the users.
With a suite of tests in place, DataOps allows you to move fast because you can make
changes and quickly rerun the test suite. If the changes pass the tests, then the data-analyt-
Automated tests continuously monitor the data pipeline for errors and anomalies. They work
nights, weekends and holidays without taking a break. If you build a DataOps dashboard, you
can view the high-level state of your data operations at any time. If warning and failure alerts
are automated, you don’t have to constantly check your dashboard. Automated testing frees
the data-analytics team from the drudgery of manual testing, so they can focus on higher
value-add activities.
Figure 4: Tests enable the data professional to apply statistical process controls
to the data pipeline
The artifacts (files) that make this reproducibility possible are usually subject to continuous
improvement. Like other software projects, the source files associated with the data pipeline
should be maintained in a version control (source control) system such as Git. A version con-
trol tool helps teams of individuals organize and manage the changes and revisions to code.
It also keeps code in a known repository and facilitates disaster recovery. However, the most
important benefit of version control relates to a process change that it facilitates. It allows
data-analytics team members to branch and merge.
Branching and merging can be a major productivity boost for data analytics because it allows
teams to make changes to the same source code files in parallel without slowing each other
down. Each individual team member has control of his or her work environment. They can
run their own tests, make changes, take risks and experiment. If they wish, they can discard
their changes and start over. Another key to allowing team members to work well in parallel
relates to providing them with an isolated machine environment.
When many team members work on the production database, it can lead to conflicts. A
database engineer changing a schema may break reports. A data scientist developing a new
model might get confused as new data flows in. Giving team members their own Environ-
ment isolates the rest of the organization from being impacted by their work.
Some steps in the data-analytics pipeline are messy and complicated. For example, one
operation might call a custom tool, run a python script, use FTP and other specialized
logic. This operation might be hard to set up (because it requires a specific set of tools)
and difficult to create (because it requires a specific skill set). This scenario is another
common use case for creating a container. Once the code is placed in a container, it is
much easier to use by other programmers who aren’t familiar with the custom tools inside
the container but know how to use the container’s external interfaces. It is also easier to
deploy that code to each environment.
For example, imagine a pharmaceutical company that obtains prescription data from a
3rd party company. The data is incomplete, so the data producer uses algorithms to fill in
those gaps. In the course of improving their product, the data producer develops a different
algorithm to the fill in the gaps. The data has the same shape (rows and columns), but certain
fields are modified using the new algorithm. With the correct built-in parameters, an engi-
neer or analyst can easily build a parallel data mart with the new algorithm and have both
the old and new versions accessible through a parameter change.
Data engineers, scientists and analysts spend an excessive amount of time and energy
working to avoid these disastrous scenarios. They attempt “heroism” — working weekends.
They do a lot of hoping and praying. They devise creative ways to avoid overcommitting. The
problem is that heroic efforts are eventually overcome by circumstances. Without the right
controls in place, a problem will slip through and bring the company’s critical analytics to a halt.
The DataOps enterprise puts the right set of tools and processes in place to enable data and
new analytics to be deployed with a high level of quality. When an organization implements
DataOps, engineers, scientists and analysts can relax because quality is assured. They can
Work Without Fear or Heroism. DataOps accomplishes this by optimizing two key workflows.
As mentioned above, the worst possible outcome is for poor quality data to enter the Value
Pipeline. DataOps prevents this by implementing data tests (step 1). Inspired by the statisti-
cal process control in a manufacturing workflow, data tests ensure that data values lay within
an acceptable statistical range. Data tests validate data values at the inputs and outputs of
each processing stage in the pipeline. For example, a US phone number should be ten digits.
Any other value is incorrect or requires normalization.
Once data tests are in place, they work 24x7 to guarantee the integrity of the Value Pipeline.
Quality becomes literally built in. If anomalous data flows through the pipeline, the data tests
catch it and take action — in most cases this means firing off an alert to the data analytics
team who can then investigate. The tests can even, in the spirit of auto manufacturing, “stop
the line.” Statistical process control eliminates the need to worry about what might happen.
With the right data tests in place, the data analytics team can Work Without Fear or Heroism.
This frees DataOps engineers to focus on their other major responsibility — the Innovation
Pipeline.
DataOps implements continuous deployment of new ideas by automating the workflow for
building and deploying new analytics. It reduces the overall cycle time of turning ideas into
innovation. While doing this, the development team must avoid introducing new analytics
that break production. The DataOps enterprise uses logic tests (step 1) to validate new code
before it is deployed. Logic tests ensure that data matches business assumptions. For exam-
ple, a field that identifies a customer should match an existing entry in a customer dimension
table. A mismatch should trigger some type of follow-up.
With logic tests in place, the development pipeline can be automated for continuous deploy-
ment, simplifying the release of new enhancements and enabling the data analytics team to
focus on the next valuable feature. With DataOps the dev team can deploy without worrying
about breaking the production systems — they can Work Without Fear or Heroism. This is a
key characteristic of a fulfilled, productive team.
Figure 8: The Value and Innovation Pipelines illustrate how new analytics are
introduced into data operations.
Using DevOps, leading companies have been able to reduce their software release cycle time
from months to (literally) seconds. This has enabled them to grow and lead in fast-paced,
emerging markets. Companies like Google, Amazon and many others now release software
many times per day. By improving the quality and cycle time of code releases, DevOps de-
serves a lot of credit for these companies’ success.
Optimizing code builds and delivery is only one piece of the larger puzzle for data analyt-
ics. DataOps seeks to reduce the end-to-end cycle time of data analytics, from the origin
of ideas to the literal creation of charts, graphs and models that create value. The data
lifecycle relies upon people in addition to tools. For DataOps to be effective, it must manage
collaboration and innovation. To this end, DataOps introduces Agile Development into data
analytics so that data teams and users work together more efficiently and effectively.
In Agile Development, the data team publishes new or updated analytics in short increments
called “sprints.” With innovation occurring in rapid intervals, the team can continuously re-
assess its priorities and more easily adapt to evolving requirements. This type of responsive-
Studies show that Agile software development projects complete faster and with fewer
defects when Agile Development replaces the traditional Waterfall sequential methodol-
ogy. The Agile methodology is particularly effective in environments where requirements
are quickly evolving — a situation well known to data analytics professionals. In a DataOps
setting, Agile methods enable organizations to respond quickly to customer requirements
and accelerate time to value.
Agile development and DevOps add significant value to data analytics, but there is one more
major component to DataOps. Whereas Agile and DevOps relate to analytics development
and deployment, data analytics also manages and orchestrates a data pipeline. Data con-
tinuously enters on one side of the pipeline, progresses through a series of steps and exits
in the form of reports, models and views. The data pipeline is the “operations” side of data
analytics. It is helpful to conceptualize the data pipeline as a manufacturing line where quali-
ty, efficiency, constraints and uptime must be managed. To fully embrace this manufacturing
mindset, we call this pipeline the “data factory.”
In DataOps, the flow of data through operations is an important area of focus. DataOps
orchestrates, monitors and manages the data factory. One particularly powerful lean-man-
ufacturing tool is statistical process control (SPC). SPC measures and monitors data and
operational characteristics of the data pipeline, ensuring that statistics remain within accept-
able ranges. When SPC is applied to data analytics, it leads to remarkable improvements in
While the name “DataOps” implies that it borrows most heavily from DevOps, it is all three
of these methodologies - Agile, DevOps and statistical process control — that comprise the
intellectual heritage of DataOps. Agile governs analytics development, DevOps optimizes
code verification, builds and delivery of new analytics and SPC orchestrates and monitors
the data factory. Figure 10 illustrates how Agile, DevOps and statistical process control flow
into DataOps.
You can view DataOps in the context of a century-long evolution of ideas that improve how
people manage complex systems. It started with pioneers like Demming and statistical pro-
cess control — gradually these ideas crossed into the technology space in the form of Agile,
DevOps and now, DataOps.
DevOps was created to serve the needs of software developers. Dev engineers love coding
and embrace technology. The requirement to learn a new language or deploy a new tool
is an opportunity, not a hassle. They take a professional interest in all the minute details of
code creation, integration and deployment. DevOps embraces complexity.
DataOps users are often the opposite of that. They are data scientists or analysts who are
focused on building and deploying models and visualizations. Scientists and analysts are
typically not as technically savvy as engineers. They focus on domain expertise. They are
interested in getting models to be more predictive or deciding how to best visually render
data. The technology used to create these models and visualizations is just a means to an
end. Data professionals are happiest using one or two tools — anything beyond that adds un-
welcome complexity. In extreme cases, the complexity grows beyond their ability to manage
it. DataOps accepts that data professionals live in a multi-tool, heterogeneous world and it
seeks to make that world more manageable for them.
The data factory takes raw data sources as input and through a series of orchestrated steps
produces analytic insights that create “value” for the organization. We call this the “Value
Pipeline.” DataOps automates orchestration and, using SPC, monitors the quality of data
flowing through the Value Pipeline.
The “Innovation Pipeline” is the process by which new analytic ideas are introduced into the
Value Pipeline. The Innovation Pipeline conceptually resembles a DevOps development pro-
cess, but upon closer examination, several factors make the DataOps development process more
challenging than DevOps. Figure 13 shows a simplified view of the Value and Innovation Pipelines.
Figure 13: The DataOps lifecycle — the Value and Innovation Pipelines
DevOps introduces two foundational concepts: Continuous Integration (CI) and Continuous
Deployment (CD). CI continuously builds, integrates and tests new code in a development
environment. Build and test are automated so they can occur rapidly and repeatedly. This
allows issues to be identified and resolved quickly. Figure 14 illustrates how CI encompasses
the build and test process stages of DevOps.
As noted above, the Innovation Pipeline has a representative copy of the data pipeline
which is used to test and verify new analytics before deployment into production. This is
the orchestration that occurs in conjunction with “testing” and prior to “deployment” of new
analytics — as shown in Figure 16.
Orchestration occurs in both the Value and Innovation Pipelines. Similarly, testing fulfills a
dual role in DataOps.
Figure 16: DataOps orchestration controls the numerous tools that access, transform,
model, visualize and report data
In the Innovation Pipeline code is variable and data is fixed. The analytics are revised and
updated until complete. Once the sandbox (analytics development environment) is set-up,
the data doesn’t usually change. In the Innovation Pipeline, tests target the code (analytics),
not the data. All tests must pass before promoting (merging) new code into production. A
good test suite serves as an automated form of impact analysis that runs on any and every
code change before deployment.
Some tests are aimed at both data and code. For example, a test that makes sure that a
database has the right number of rows helps your data and code work together. Ultimately
both data tests and code tests need to come together in an integrated pipeline as shown
in Figure 13. DataOps enables code and data tests to work together so all around quality
remains high.
Figure 17: In DataOps, analytics quality is a function of data and code testing
Figure 19: The concept of test data management is a first order problem in DataOps.
The concept of test data management is a first order problem in DataOps whereas in most
DevOps environments, it is an afterthought. To accelerate analytics development, DataOps
has to automate the creation of development environments with the needed data, software,
hardware and libraries so innovation keeps pace with Agile iterations.
In data analytics, the operations team supports and monitors the data pipeline. This can be
IT, but it also includes customers — the users who create and consume analytics. DataOps
brings these groups together so they can work together more closely.
Figure 20: DataOps combines data analytics development and data operations
Centralizing analytics development under the control of one group, such as IT, enables the
organization to standardize metrics, control data quality, enforce security and governance,
and eliminate islands of
data. The issue is that too
much centralization chokes
creativity.
Figure 22: DataOps brings teams together across two dimensions — develop-
ment/operations as well as distributed/centralized development.
DataOps brings three cycles of innovation between core groups in the organization: central-
ized production teams, centralized data engineering/analytics/science/governance develop-
ment teams, and groups using self-service tools distributed into the lines business closest to
the customer. Figure 23 shows the interlocking cycles of innovation.
The challenge of pushing analytics into production across these four quite different envi-
ronments is daunting without DataOps. It requires a patchwork of manual operations and
scripts that are in themselves complex to manage. Human processes are error-prone so data
professionals compensate by working long hours, mistakenly relying on hope and heroism for
success. All of this results in unnecessary complexity, confusion and a great deal of wasted
time and energy. Slow progression through the lifecycle shown in Figure 24 coupled with
high-severity errors finding their way into production can leave a data analytics team little
time for innovation.
Implementing DataOps
DataOps simplifies the complexity of data analytics creation and operations. It aligns data
analytics development with user priorities. It streamlines and automates the analytics
development lifecycle — from the creation of sandboxes to deployment. DataOps controls
and monitors the data factory so data quality remains high, keeping the data team focused
on adding value.
A DataOps Platform automates the steps and processes that comprise DataOps: sandbox
management, orchestration, monitoring, testing, deployment, the data factory, dashboards,
Agile, and more. A DataOps Platform is built for data professionals with the goal of simpli-
fying all of the tools, steps and processes that they need into an easy-to-use, configurable,
end-to-end system. This high degree of automation eliminates a great deal of manual work,
freeing up the team to create new and innovative analytics that maximize the value of an
organization’s data.
Some managers respond to this challenge by centralizing analytics. With data and analytics
under the control of one group, such as IT, you can standardize metrics, control data quality,
enforce security and governance, and eliminate islands of data. All worthy endeavors, how-
ever forcing analytic updates through a heavyweight IT development process is a sure way to
stifle innovation. It is one of the reasons that some companies take three months to deploy
ten lines of SQL into production. Analytics have to be able to evolve and iterate quickly to
keep up with user demands and fast-paced markets. Managers instinctively understand that
data analytics teams must be free to innovate. The fast-growing self-service tools market
(Tableau, Looker, etc.) addresses this market.
Centralizing analytics brings it under control but granting analysts free reign is necessary to
stay competitive. How do you balance the need for centralization and freedom? How do you
empower your analysts to be innovative without drowning in the chaos and inconsistency
that a lack of centralized control inevitably produces? Visit any modern enterprise, and you
will find this challenge playing out repeatedly in budget discussions and hiring decisions. You
might say, it is a struggle between centralization and freedom.
Figure 26: Data suppliers, engineers and analysts use different cycle times driven
mainly by their tools, methods and proximity to demanding users.
Analysts choose tools and processes oriented toward this business context. They use
powerful, self-service tools, such as Tableau, Alteryx, and Excel, to quickly create or iterate
on charts, graphs, and dashboards. They organize their work into daily sprints (figure 26), so
they can deliver value regularly and receive feedback from users immediately. Agile tools like
Jira are an excellent way to manage the productivity of analyst daily sprints.
The data analyst is the tip of the innovation spear. Organizations must give data analysts
maximum freedom to experiment. There are a lot more data in the world than companies
can analyze. Not everything can be placed in data warehouses. Not all data should be opera-
tionalized. Companies need data analysts to play around with different data sets to establish
what is predictive and relevant.
Some companies mistakenly ask data engineering to create data sets for every idea. It is best
to let analysts lead on implementing new analytic ideas and proving them out before consid-
ering how data engineering can help. For example, consider the following:
By this standard, the organization focuses its data engineering resources on those items
that give the most bang for the buck. Keep in mind that when analytics are moved into a data
warehouse, some of the benefits of centralization come at the expense of reduced freedom
— it is slower to update a data warehouse than a Tableau worksheet. It’s important to wait
until analytics have earned the right to make this transition. The value created by centralizing
must outweigh the restriction of freedom.
Data engineers utilize programmable platforms such as AWS, S3, EC2 and Redshift. These
tools require programming in a high-level language and offer greater potential functionality
than the tools used by analysts. The relative complexity of the tools and scope of projects in
data engineering fit best in weekly Agile iterations (figure 26). DataOps platforms like Dat-
aKitchen enable the data engineer to streamline the quality control, orchestration and data
operations aspects of their duties. With automated support for agile development, impact
analysis, and data quality, the data engineer can stay focused on creating and improving data
sets for analysts.
After data sets have proven their value, it’s worth considering whether the benefits of fur-
ther centralization outweigh the cost of a further reduction in freedom. Data suppliers fulfill
the function of greater centralization by providing data sources or data extracts for data
engineering.
There are several reasons that a project may have earned the right to transition to data
suppliers. Analytics may provide functionality that executives wish to make available to the
entire corporation, not just one business unit. It could also be a case of standardization — for
example, the company wants to standardize on an algorithm for calculating market share. In
another example, perhaps data engineering has implemented quality control on a data set
and wishes to achieve efficiencies by pushing this functionality upstream to the data suppli-
er. A data supplier may be an external third party or an internal group, such as an IT master
data management (MDM) team.
After the usefulness of the mastered data is established, the company might decide that the
data has broader uses. They may want the customer or partner list to be available for a portal
or tied into a billing system. This use case requires a higher standard of accuracy for the mas-
tered data than was necessary for the analytic data warehouse. It’s appropriate at this point
to consider moving the MDM to a data supplier, such as a corporate IT team, who are adept
at tackling more extensive, development initiatives. Put another way, initial data mastery
may have been good enough for analytic insights, but data must be perfect when it is being
used in a billing system. The data supplier takes the MDM to the next level.
Data Suppliers
Projects transitioned to data suppliers tend to incorporate more process and tool complexity
than those in data engineering, leading to a more extended iteration period of one or more
months (figure 26). These projects use tools such as RDBS, MDM, Salesforce, Excel, sFTP,
etc., and rely upon waterfall project management and MS Project tracking. Table 2 summariz-
es tools and processes preferred by data suppliers as contrasted with engineers and analysts.
Figure 27: Data Suppliers, Data Engineers and Data Analysts sit on a spectrum of
centralization and innovation/freedom.
Figure 28: Tests verify that data rows, facts and dimensions match business logic
throughout the data pipeline
For example, Figure 28 shows how the DataOps platform orchestrates, tests and monitors
every step of the data operations pipeline, freeing up the team from significant manual
effort. The test verifies that the quantity of data matches business logic at each stage of the
data pipeline. If a problem occurs at any point in the pipeline, the analytics team is alerted and can
resolve the issue before it develops into an emergency. With 24x7 monitoring of the data pipeline,
the team can rest easy and focus on customer requirements for new/updated analytics.
INSTRUCTIONS
1. Place chicken drums and wings in a large zip-lock back, add marinade, seal
zip-lock bag, mix contents of the bag around gently (you don’t want to acci-
dentally open the bag and marinate your kitchen floor or counter), make sure
your chicken is well coated inside the bag.
2. Refrigerate your chicken in the marinade for 8-24 hours (You can also just
cook them right away if you don’t have the time)
3. Best slow cooked for 5-6 hours in a crockpot or on 225 degrees in a conven-
tional oven—use all the contents in the bag. (If you don’t have that kind of time,
bake at 400 degrees Fahrenheit.) 3.5 lbs. of chicken should bake for 55-60
minutes; 4.5 lbs. of chicken requires 60-65 minutes.
As a Chief Data Officer (CDO) or Chief Analytics Officer (CAO), you serve as an advocate for
the benefits of data-driven decision making. Yet, many CDO’s are surprisingly unanalytical
about the activities relating to their own department. Why not use analytics to shine a light
on yourself?
Internal analytics could help you pinpoint areas of concern or provide a big-picture assess-
ment of the state of the analytics team. We call this set of analytics the CDO Dashboard. If
you are as good as you think you are, the CDO Dashboard will show how simply awesome
you are at what you do. You might find it helpful to share this information with your boss
when discussing the data analytics department and your plans to take it to the next level.
Below are some reports that you might consider including in your CDO dashboard:
VELOCITY CHART
The velocity chart shows the amount of work completed during each sprint — it displays how
much work the team is doing week in and week out. This chart can illustrate how improved
processes and indirect investments (training, tools, process improvements, …) increase veloc-
ity over time.
The average tenure of a CDO or CAO is about 2.5 years. In our conversations with data and
analytics executives, we find that CDOs and CAOs often fall short of expectations because
they fail to add sufficient value in an acceptable time frame. If you are a CDO looking to sur-
vive well beyond year two, we recommend avoiding three common traps that we have seen
ensnare even the best and brightest.
Data offense expands top-line revenue, builds the brand, grows the company and in general
puts points on the board. Using data analytics to help marketing and sales is data offense.
Companies may acknowledge the importance of defense, but they care passionately about
offense and focus on it daily. Data offense provides the organization with direct value and it
is what gets CDOs and CAOs promoted.
The challenge for a CDO is that data defense is hard. A company’s shortcomings in gover-
nance, security, privacy, or compliance may be glaringly obvious. In some cases, new regu-
lations like GDPR (General Data Protection Regulation, EU 2016/679) demand immediate
In a fast-paced, competitive environment, an 18-month integration project can seem like the
remote future. Also, success is uncertain until you deliver. Your C-level peers know that big
software integration projects fail half the time. Projects frequently turn out to be more com-
plex than anticipated, and they often miss the mark. For example, you may have thought you
needed ten new capabilities, but your internal customers only really require seven, and two
of them were not on your original list. The issue is that you won’t know which seven features
are critical until around the time of your second annual performance review and by then it
might be too late to right the ship.
Figure 36: CDO’s often make the dual mistake of (1) focusing too much on delivering
indirect value (governance, security, privacy, or compliance, …) and (2) using a wa-
terfall project methodology which defers the delivery of value to the end of a long
project cycle. In the case shown, it takes several months to deliver direct value
A data valuation project can take months of effort and consumes the attention of the CDO
and her staff on what is essentially an internally-focused, intellectual exercise. In the end,
you have a beautiful PowerPoint presentation with detailed spreadsheets to back it up. Your
data has tremendous value that can and should be carried on the balance sheet. You tell every-
one all about it — why don’t they care?
Don’t confuse data valuation with data offense. Knowing the theoretical value of data is not
data offense. While data valuation may be useful and important in certain cases, it is often a
distraction. All of the time and resources devoted to creating and populating the valuation
model could have been spent on higher value-add activities.
Figure 37: DataOps uses an iterative product management methodology (Agile develop-
ment) that enables the CDO to rapidly deliver direct value (growing the top line).
People do not always trust data. Imagine you are an executive and an employee walks into
your office and shows you charts and graphs that contradict strongly held assumptions
about your business. A lot of managers in this situation favor their own instincts. Data-ana-
lytics professionals, who tend to be doers, not talkers, are sometimes unable to convince an
organization to trust its data.
DataOps relies upon the data lake design pattern, which enables data analytics teams to up-
date schemas and transforms quickly to create new data warehouses that promptly address
pressing business questions. DataOps incorporates the continuous-deployment methodolo-
gy that is characteristic of DevOps. This reduces the cycle time of new analytics by an order
of magnitude. When users get used to quick answers, it builds trust in the data-analytics
team, and stimulates the type of creativity and teamwork that leads to breakthroughs.
A company that trusts its data develops a unified view of reality and can formulate a
shared vision of how to achieve its goals. Data-driven companies deliver higher growth and
ultimately higher valuations than their peers. As a CAO or CDO, leading the organization
to become more data-driven is your mission. DataOps makes that easier by helping the da-
ta-analytics team deliver quickly and robustly, creating value that is recognized and trusted
by the organization.
This situation could have implications for the company’s future. What if competitors have
devised a way to use data analytics to garner a competitive advantage? Without a compre-
hensive data strategy, a company risks missing the market.
There is nothing inherently wrong with Boutique Analytics. It is a great way to explore the
best ways to deliver value based on data. The eventual goal should be to operationalize the
data and deliver that value on a regular basis. This can be time-consuming and error-prone if
executed manually.
In the Waterfall world, development cycles are long and rigidly controlled. Projects pass
through a set of sequential phases: architecture, design, test, deployment, and maintenance.
Changes in the project plan at any stage cause modifications to the scope, schedule or
budget of the project. As a result, Waterfall projects are resistant to change. This is wholly
appropriate when you are building a bridge or bringing a new drug to market, but in the field
of data analytics, changes in requirements occur on a continuous basis. Teams that use Wa-
terfall analytics often struggle with development cycle times that are much longer than their
users expect and demand. Waterfall analytics also tends to be labor intensive, which makes
every aspect of the process slow and susceptible to error. Most data-analytics teams today
are in the Waterfall analytics stage and are often unaware that there is a better way.
INSTRUCTIONS
Preheat oven to 350. Combine dry ingredients (flour through nutmeg) in a small
bowl. In a separate bowl, mix together yogurt, vanilla, brown sugar and honey. Add
egg. Add mashed up bananas. Slowly fold dry ingredients into wet. Stir in cran-
berries and 3/4 cup walnuts gently. Pour mixture into buttered loaf pan. Sprinkle
remaining walnuts on top of loaf. Bake about 45 minutes, or until lightly browned
and knife comes out clean.
Imagine that you oversee a fifty-person team managing numerous large integrated databases
(DB) for a big insurance or financial services company. You have 300 terabytes (TB) of data
which you manage using a proprietary database. Between software, licensing, maintenance,
support and associated hardware, you pay $10M per year in annual fees. Even putting an-
other single CPU into production could cost hundreds of thousands of dollars.
The machine environments are different and have to be managed and maintained separately.
New analytics are tested on each machine in turn — first in dev, then QA and finally produc-
tion. You may not catch every problem in dev and QA since they aren’t using the same data
and environment as production.
Running regression tests manually is time-consuming so it can’t be done often. This creates
risk whenever new code is deployed. Also, when changes are made on one machine they
have to be manually installed on the others. The steps in this procedure are detailed in a
30-page text document, which is updated by a committee through a cumbersome series of
reviews and meetings. It is a very siloed and fractured process, not to mention inefficient;
during upgrades, the DB is offline, so new work is temporarily on hold.
In our hypothetical company, the organization of the workforce is also a factor in slowing the
team’s velocity. Everyone is assigned a fixed role. Adding a table to a database involves sev-
eral discrete functions: a Data Quality person who analyzes the problem, a Schema/Architect
who designs the schema, an ETL engineer who writes the ETL, a Test Engineer that writes
tests and a Release Engineer who handles deployment. Each of these functions is performed
sequentially and requires considerable documentation and committee review before any
action is taken. Hand-off meetings mark the transition from one stage to the next.
The team wants to move faster but is prevented from doing so due to heavyweight process-
es, serialization of tasks, overhead, difficulty in coordination and lack of automation. They
need a way to increase collaboration and streamline the many inefficiencies of their current
process without having to abandon their existing tools.
Shared Workspace – DataOps creates a shared workspace so team members have visibility
into each other’s work. This enables the team to work more collaboratively and seamlessly
outside the formal structure of the hand-off meeting. DataOps also streamlines documenta-
tion and reduces the need for formal meetings as a communication forum.
Orchestration – DataOps deploys code updates to each machine instantiation and auto-
mates the execution of tests along each stage of the data analytics pipeline. This includes
data and logic tests that validate both the production and feature deployment pipelines.
Tests are parameterized so they can run in the subset database of each particular machine
environment equally well. As the test suite improves, it grows to reflect the full breadth of
the production environment. Automated tests are run repeatedly so you can be confident
that new features have not broken old ones.
People often speak about data lakes as a repository for raw data. It can also be helpful to
move processed data into the data lake. There are several important advantages to using
data lakes. First and foremost, the data analytics team controls access to it. Nothing can
frustrate progress more than having to wait for access to an operational system (ERP, CRM,
MRP, …). Additionally, a data lake brings data together in one place. This makes it much eas-
ier to process. Imagine buying items at garage sales all over town and placing them in your
backyard. When you need the items, it is much easier to retrieve them from the backyard
rather than visiting each of the garage sale sites. A data lake serves as a common store for all
of the organization’s critical data. Easy, unrestricted access to data eliminates restrictions on
productivity that slow down the development of new analytics.
Note that if you put public company financial data in a data lake, everyone who has access to the
data lake is an “insider.” If you have confidential data, HIPAA data (Health Insurance Portability
and Accountability Act of 1996) or Personally identifiable information (PII) — these must be man-
aged in line with government regulations, which vary by country.
UNDERSTANDING SCHEMAS
A database schema is a collection of tables. It dictates how the database is structured and
organized and how the various data relate to each other. Below is a schema that might be
used in a pharmaceutical-sales analytics use case. There are tables for products, payers,
period, prescribers and patients with an integer ID number for each row in each table. Each
sale recorded has been entered in the fact table with the corresponding IDs that identify the
product, payer, period, and prescriber respectively. Conceptually, the IDs are pointers into
the other tables.
The schema establishes the basic relationships between the data tables. A schema for an
operational system is optimized for inserts and updates. The schema for an analytics system,
like the star schema shown here, is optimized for reads, aggregations, and is easily under-
stood by people.
Suppose that you want to do analysis of patients based on their MSA (metropolitan service
area). An MSA is a metropolitan region usually clustered near a large city. For example,
Cambridge, Massachusetts is in the Greater Boston MSA. The prescriber table has a zip-code
field. You could create a zip-code-to-MSA lookup table or just add MSA as an attribute to
the patient table. Both of these are schema changes. In one case you add a table and in the
other case you add a column.
You might hear the term data mart in relation to data analytics. Data marts are a streamlined
form of data warehouses. The two are conceptually very similar.
Data transforms (scripts, source code, algorithms, …) create data warehouses from data
lakes. In DataOps this process is optimized by keeping transform code in source control and
by automating the deployment of data warehouses. An automated deployment process is
significantly faster, more robust and more productive than a manual deployment process.
DataOps moves the enterprise beyond slow, inflexible, disorganized and error-prone manual
processes. The DataOps pipeline leverages data lakes and transforms them into well-crafted
data warehouses using continuous deployment techniques. This speeds the creation and
deployment of new analytics by an order of magnitude. Additionally, the DataOps pipeline is
constantly monitored using statistical process control so the analytics team can be confident
of the quality of data flowing through the pipeline. Work Without Fear or Heroism. With
these tools and process improvements, DataOps compresses the cycle time of innovation
while ensuring the robustness of the analytic pipeline. Faster and higher quality analytics ul-
timately lead to better insights that enable an enterprise to thrive in a dynamic environment.
When we talk about reusing code, we mean reusing data analytics components. All of the
files that comprise the data analytics pipeline — scripts, source code, algorithms, html, con-
figuration files, parameter files — we think of these as code. Like other software develop-
ment, code reuse can significantly boost coding velocity.
Code reuse saves time and resources by leveraging existing tools, libraries or other code in
the extension or development of new code. If a software component has taken several
months to develop, it effectively saves the organization several months of development time
when another project reuses that component. This practice can be used to decrease projects
budgets. In other cases, code reuse makes it possible to complete projects that would have
been impossible if the team were forced to start from scratch.
Containers make code reuse much simpler. A container packages everything needed to run
a piece of software — code, runtimes, tools, libraries, configuration files — into a stand-alone
executable. Containers are somewhat like virtual machines but use fewer resources because
they do not include full operating systems. A given hardware server can run many more
containers than virtual machines.
A container eliminates the problem in which code runs on one machine, but not on another,
because of slight differences in the set-up and configuration of the two servers or software
environments. A container enables code to run the same way on every machine by auto-
mating the task of setting up and configuring a machine environment. This is one DataOps
techniques that facilitates moving code from development to production — the run-time
environment is the same for both. One popular open-source container technology is Docker.
Each step in the data-analytics pipeline is the output of the prior stage and the input to the
next stage. It is cumbersome to work with an entire data-analytics pipeline as one mono-
lith, so it is common to break it down into smaller components. On a practical level, smaller
components are much easier to reuse by other team members.
These are all excellent points and often the conversation ends here — in exasperation. We
can tell you that we have been there and have the PTSD to prove it. Fortunately, a few years
ago, we found a way out of what may seem at times like a no-win situation. We believe that the
secret to successful data science is a little about tools and a lot about people and processes.
INSTRUCTIONS
1. Combine flour, yeast and salt in a large bowl and stir with your DataKitchen
spoon. Add water and stir until blended; dough will be shaggy. You may need
an extra ¼ cup of water to get all the flour to blend in. Cover bowl with plas-
tic wrap. Let dough rest at least 4 hours (12-18 hours is good too) at warm
room temperature, about 70 degrees.
2. Lightly oil a work surface and place dough on it; fold it over on itself once or
twice. Cover loosely with plastic wrap and let rest 30 minutes more. This is a
good time to turn the oven on to 425°F.
3. Put a 6-to-8-quart heavy covered pot (cast iron, enamel, Pyrex or ceramic)
in the oven as it heats. When dough is ready, carefully remove pot from
oven. Slide your hand under dough and put it into pot, seam side up. Shake
pan once or twice if dough is unevenly distributed; it will straighten out as it
bakes.
4. Cover with lid and bake 30 minutes, then remove lid and bake another 15 to
30 minutes, until loaf is beautifully browned. Cool on a rack.
NOTES
In a convection oven, cook 23 minutes with the lid on, and then 5 minutes with
the lid off.
You don’t need to pre-heat the pot. You can put the dough on a cookie sheet. The
only difference is the crust will not be as crunchy or as beautifully browned. You
can experiment with a round shape or Italian or French loaf shapes. The longer
shapes will take less time to cook.
You can also cook at a lower temperature (e.g. 350°F). In all cases, take the bread
out when the internal temperature reaches 190°F - 200°F. Use a meat thermom-
eter to check.
The last thing that an analytics professional wants to do is introduce a change that breaks
the system. Nobody wants to be the object of scorn, the butt of jokes, or a cautionary tale.
If that 20-line SQL change is misapplied, it can be a “career-limiting move” for an analytics
professional.
Analytics systems grow so large and complex that no single person in the company under-
stands them from end to end. A large company often institutes slow, bureaucratic proce-
dures for introducing new analytics in order to reduce fear and uncertainty. They create
a waterfall process with specific milestones. There is a lot of documentation, checks and
balances, and meetings — lots of meetings.
Imagine you are building technical systems that integrate data and do models and visu-
alizations. How does a change in one area affect other areas? In a traditional established
company, that information is locked in various people’s heads. The company may think it has
no choice but to gather these experts together in one room to discuss and analyze proposed
changes. This is called an “impact analysis meeting.” The process includes the company’s
most senior technical contributors; the backbone of data operations. Naturally, these individ-
uals are extremely busy and subject to high-priority interruptions. Sometimes it takes weeks
to gather them in one room. It can take additional weeks or months for them to approve a
change.
The impact analysis team is a critical bottleneck that slows down updates to analytics. A
DataOps approach to improving analytics cycle time adopts process optimization techniques
from the manufacturing field. In a factory environment, a small number of bottlenecks
often limit throughput. This is called the Theory of Constraints. Optimize the throughput of
bottlenecks and your end-to-end cycle time improves (check out “The Goal” by Eliyahu M.
Goldratt).
DataOps automates testing. Environments are spun up under machine control and test
scripts, written in advance, are executed in batch. Automated testing is much more cost-ef-
fective and reliable than manual testing, but the effectiveness of automated testing depends
on the quality and breadth of the tests. In a DataOps enterprise, members of the analytics
team spend 20% of their time writing tests. Whenever a problem is encountered, a new test
is added. New tests accompany every analytics update. The breadth and depth of the test
suite continuously grow.
These concepts are new to many data teams, but they are well established in the software
industry. As figure 39 shows, the cycle time of software development releases has been
(and continues to be) reduced by orders of magnitude through automation and process
improvements. The automation of impact analysis can have a similar positive effect on your
organization’s analytics cycle time.
Figure 39: Software developers have reduced the cycle time for new releases
by orders of magnitude using automation and process improvements
ANALYTICS IS CODE
At this point some of you are thinking this has nothing to do with me. I am a data analyst/sci-
entist, not a coder. I am a tool expert. What I do is just a sophisticated form of configuration. This
is a common point of view in data analytics. However, it leads to a mindset that slows down
analytics cycle time.
Tools vendors have a business interest in perpetuating the myth that if you stay within the
well-defined boundaries of their tool, you are protected from the complexity of software
development. This is ill-considered.
Don’t get us wrong. We love our tools, but don’t buy into this falsehood.
The $100B analytics market is divided into two segments: tools that create code and tools
that run code. The point is — data analytics is code. The data professional creates code and
must own, embrace and manage the complexity that comes along with it.
Figure 41: Tableau files are stored as XML, and can contain conditional
branches, loops and embedded code.
Figure 40 shows a data operations pipeline with code at every stage of the pipeline. Python,
SQL, R — these are all code. The tools of the trade (Informatica, Tableau, Excel, …) these too
are code. If you open an Informatica or Tableau file, it’s XML. It contains conditional branches
(if-then-else constructs), loops and you can embed Python or R in it.
Remember our 20-line SQL change that took six months to implement? The problem is that
analytics systems become so complex that they can easily break if someone makes one mis-
begotten change. The average data-analytics pipeline encompasses many tools (code genera-
tors) and runs lots of code. Between all of the code and people involved, data operations
becomes a combinatorially complex hairball of systems that could come crashing down with
one little mistake.
For example, imagine that you have analytics that sorts customers into five bins based on
some conditional criterion. Deep inside your tool’s XML file is an if-then-else construct that
is responsible for sorting the customers correctly. You have numerous reports based off of
a template that contains this logic. They provide information to your business stakeholders:
top customers, middle customers, gainers, decliners, whales, profitable customers, …
There’s a team of IT engineers, database developers, data engineers, analysts and data
scientists that manage the end to end system that supports these analytics. One of these
individuals makes a change. They convert the sales volume field from an integer into a
decimal. Perhaps they convert a field that was US dollars into a different currency. Maybe
they rename a column. Everything in the analytics pipeline is so interdependent; the change
breaks all of the reports that contain the if-then-else logic upon which the original five
categories are built. All of a sudden, your five customer categories become one category, or
the wrong customers are sorted into the wrong bins. None of the dependent analytics are
correct, reports are showing incorrect data, and the VP of Sales is calling you hourly.
Whether you use an analytics tool like Informatica or Tableau, an Integrated Development
Environment (IDE) like Microsoft Visual Studio (Figure 44) or even a text editor like Notepad,
you are creating code. The code that you create interacts with all of the other code that
populates the DAG that represents your data pipeline.
To automate impact analysis, think of the end-to-end data pipeline holistically. Your test
suite should verify software entities on a stand-alone basis as well as how they interact.
Figure 44: Developers write SQL, Python and other code using an integrated
development environment or sometimes a simple editor like Notepad.
The development of new analytics follows a different path, which is shown in Figure 46 as
the Innovation Pipeline. The Innovation Pipeline delivers new insights to the data operations
pipeline, regulated by the release process. To safely develop new code, the analyst needs an
isolated development environment. When creating new analytics, the developer creates an
environment analogous to the overall system. If the database is terabytes in size, the data
professional might copy it for test purposes. If the data is petabytes in size, it may make
sense to sample it; for example, take 10% of the overall data. If there are concerns about
Table 4: In the Value Pipeline code is fixed and data is variable. In the Innovation
Pipeline, data is fixed, and code is variable.
In the Innovation Pipeline code is variable, but data is fixed. Tests target the code, not the
data. The unit, integration, functional, performance and regression tests that were men-
tioned above are aimed at vetting new code. All tests are run before promoting (merging)
new code to production. Code changes should be managed using a version control system,
for example GIT. A good test suite serves as an automated form of impact analysis that can
be run on any and every code change before deployment.
Some tests are aimed at both data and code. For example, a test that makes sure that a
database has the right number of rows helps your data and code work together. Ultimately
both data tests and code tests need to come together in an integrated pipeline as shown
in Figure 47. DataOps enables code and data tests to work together so all around quality
remains high.
A unified, automated test suite that tests/monitors both production data and analytic code
is the linchpin that makes DataOps work. Robust and thorough testing removes or minimizes
the need to perform manual impact analysis, which avoids a bottleneck that slows innova-
tion. Removing constraints helps speed innovation and improve quality by minimizing analyt-
ics cycle time. With a highly optimized test process you’ll be able to expedite new analytics
into production with a high level of confidence.
We recently talked to a data team in a financial services company that lost the trust of their
users. They lacked the resources to implement quality controls so bad data sometimes
leaked into user analytics. After several high-profile episodes, department heads hired their
own people to create reports. For a data-analytics team, this is the nightmare scenario, and it
could have been avoided.
Organizations trust their data when they believe it is accurate. A data team can struggle to
produce high-quality analytics when resources are limited, business logic keeps changing
and data sources have less-than-perfect quality themselves. Accurate data analytics are the
product of quality controls and sound processes.
The data team can’t spend 100% of its time checking data, but if data analysts or scientists
spend 10-20% of their time on quality, they can produce an automated testing and monitor-
ing system that does the work for them. Automated testing can work 24x7 to ensure that
bad data never reaches users, and when a mishap does occur, it helps to be able to assure
users that new tests can be written to make certain that an error never happens again. Auto-
mated testing and monitoring greatly multiplies the effort that a data team invests in quality.
Figure 48 depicts the data-analytics pipeline. In this diagram, databases are accessed and
then data is transformed in preparation for being input into models. Models output visualiza-
tions and reports that provide critical information to users.
Along the way, tests ask important questions. Are data inputs free from issues? Is business
logic correct? Are outputs consistent? As in lean manufacturing, tests are performed at every
step in the pipeline. For example, data input tests are analogous to manufacturing incoming
quality control. Figure 49 shows examples of data input, output and business logic tests.
Data input tests strive to prevent any bad data from being fed into subsequent pipeline
stages. Allowing bad data to progress through the pipeline wastes processing resources and
increases the risk of never catching an issue. It also focuses attention on the quality of data
sources, which must be actively managed — manufacturers call this supply chain management.
Data output tests verify that a pipeline stage executed correctly. Business logic tests validate
data against tried and true assumptions about the business. For example, perhaps all Europe-
an customers are assigned to a member of the Europe sales team.
Test results saved over time provide a way to check and monitor quality versus historical
levels.
FAILURE MODES
A disciplined data production process classifies failures according to severity level. Some
errors are fatal and require the data analytics pipeline to be stopped. In a manufacturing
setting, the most severe errors “stop the line.”
Some test failures are warnings. They require further investigation by a member of the data
analytics team. Was there a change in a data source? Or a redefinition that affects how data
is reported? A warning gives the data-an-
alytics team time to review the changes,
talk to domain experts, and find the root
cause of the anomaly.
Finding issues before your internal customers do is critically important for the data team.
There are three basic types of tests that will help you find issues before anyone else: location
balance, historical balance and statistical process control.
Figure 50: Location Balance Tests verify 1M rows in raw source data, and the
corresponding 1M rows / 300K facts / 700K dimension members in the database
schema, and 300K facts / 700K dimension members in a Tableau report
HISTORICAL BALANCE
Historical Balance tests compare current data to previous or expected values. These tests
rely upon historical values as a reference to determine whether data values are reasonable
(or within the range of reasonable). For example, a test can check the top fifty customers or
suppliers. Did their values unexpectedly or unreasonably go up or down relative to historical
values?
It’s not enough for analytics to be correct. Accurate analytics that “look wrong” to users raise
credibility questions. Figure 51 shows how a change in allocations of SKUs, moving from
pre-production to production, affects the sales volumes for product groups G1 and G2. You
can bet that the VP of sales will notice this change immediately and will report back that
the analytics look wrong. This is a common issue for analytics — the report is correct, but it
reflects poorly on the data team because it looks wrong to users. What has changed? When
confronted, the data-analytics team has no ready explanation. Guess who is in the hot seat.
Historical Balance tests could have alerted the data team ahead of time that product group
sales volumes had shifted unexpectedly. This would give the data-analytics team a chance to
investigate and communicate the change to users in advance. Instead of hurting credibility,
this episode could help build it by showing users that the reporting is under control and that
the data team is on top of changes that affect analytics. “Dear sales department, you may no-
tice a change in the sales volumes for G1 and G2. This is driven by a reassignment of SKUs within
the product groups.”
Automated tests and alerts enforce quality and greatly lessen the day-to-day burden of
monitoring the pipeline. The organization’s trust in data is built and maintained by producing
consistent, high-quality analytics that help users understand their operational environment.
That trust is critical to the success of an analytics initiative. After all, trust in the data is really
trust in the data team.
In data analytics, tests should verify that the results of each intermediate step in the
production of analytics matches expectations. Even very simple tests can be useful. For
example, a simple row-count test could catch an error in a join that inadvertently produces a
Cartesian product. Tests can also detect unexpected trends in data, which might be flagged
as warnings. Imagine that the number of customer transactions exceeds its historical average
by 50%. Perhaps that is an anomaly that upon investigation would lead to insight about
business seasonality.
Tests in data analytics can be applied to data or models either at the input or output of a
phase in the analytics pipeline. Tests can also verify business logic.
The data analytics pipeline is a complex process with steps often too numerous to be moni-
tored manually. SPC allows the data analytics team to monitor the pipeline end-to-end from
a big-picture perspective, ensuring that everything is operating as expected. As an automat-
ed test suite grows and matures, the quality of the analytics is assured without adding cost.
This makes it possible for the data analytics team to move quickly — enhancing analytics to
address new challenges and queries — without sacrificing quality.
Instructions
1. Preheat oven to 325F. Line a baking sheet with parchment paper.
2. In a large bowl, cream together butter and sugar until light and fluffy.
3. Beat in honey, vanilla, and both eggs, adding the eggs in one at a time.
4. In a medium bowl, whisk together flour, baking soda, cinnamon, and salt.
5. Working by hand or at a low speed, gradually incorporate flour mixture into
honey mixture.
6. Stir in trail mix.
7. Shape cookie dough into 1-inch balls and place onto prepared baking sheet,
leaving about 2 inches between each cookie to allow for the dough to
spread.
8. Bake for 12-15 minutes, until cookies are golden brown.
9. Cool for 3-4 minutes on the baking sheet, then transfer to a wire rack to cool
completely.
The role of release engineer was (and still is) critical to completing a successful software
release and deployment, but as these things go, my friend was valued less than the software
developers who worked beside him. The thinking went something like this — developers
could make or break schedules and that directly contributed to the bottom line. Release
engineers, on the other hand, were never noticed, unless something went wrong. As you
might guess, in those days the job of release engineer was compensated less generously
than development engineer. Often, the best people vied for positions in development where
compensation was better.
Whereas a release engineer used to work off in a corner tying up loose ends, the DevOps
engineer is a high-visibility role coordinating the development, test, IT and operations
functions. If a DevOps engineer is successful, the wall between development and operations
melts away and the dev team becomes more agile, efficient and responsive to the market.
This has a huge impact on the organization’s culture and ability to innovate. With so much
at stake, it makes sense to get the best person possible to fulfill the DevOps engineer role
and compensate them accordingly. When DevOps came along, the release engineer went
from fulfilling a secondary supporting role to occupying the most sought-after position in
the department. Many release engineers have successfully rebranded themselves as DevOps
engineers and significantly upgraded their careers.
Data engineers, data analysts, data scientists — these are all important roles, but they will be
valued even more under DataOps. Too often, data analytics professionals are trapped into
relying upon non-scalable methods: heroism, hope or caution. DataOps offers a way out of
this no-win situation.
The capabilities unlocked by DataOps impacts everyone that uses data analytics — all the
way to the top levels of the organization. DataOps breaks down the barriers between data
analytics and operations. It makes data more easily accessible to users by redesigning the
data analytics pipeline to be more flexible and responsive. It will completely change what
people think of as possible in data analytics.
In many organizations, the DataOps engineer will be a separate role. In others, it will be a
shared function. In any case, the opportunity to have a high-visibility impact on the organi-
zation will make DataOps engineering one of the most desirable and highly compensated
functions. Like the release engineer whose career was transformed by DevOps, DataOps will
boost the fortunes of data analytics professionals. DataOps will offer select members of the
analytics team a chance to reposition their roles in a way that significantly advances their
career. If you are looking for an opportunity for growth as a DBA, ETL Engineer, BI Analyst,
or another role look into DataOps as the next step.
And watch out Data Scientist, the real sexiest job of the 21st century is DataOps Engineer.
Data analytics analyzes internal and external data to create value and actionable insights.
Analytics is a positive force that is transforming organizations around the globe. It helps cure
diseases, grow businesses, serve customers better and improve operational efficiency.
In analytics there is mediocre and there is better. A typical data analytics team works slowly,
all the while living in fear of a high-visibility data quality issue. A high-performance data
analytics team rapidly produces new analytics and flexibly responds to marketplace demands
while maintaining impeccable quality. We call this a DataOps team. A DataOps team can
Work Without Fear or Heroism because they have automated controls in place to enforce
a high level of quality even as they shorten the cycle time of new analytics by an order of
magnitude. Want to upgrade your data analytics team to a DataOps team? It comes down to
roles, tools and processes.
DATA ENGINEER
The data engineer is a software or computer engineer that lays the groundwork for other
members of the team to perform analytics. The data engineer moves data from operation-
al systems (ERP, CRM, MRP, …) into a data lake and writes the transforms that populate
schemas in data warehouses and data marts. The data engineer also implements data tests
for quality.
DATA SCIENTIST
Data scientists perform research and tackle open-ended questions. A data scientist has
domain expertise, which helps him or her create new algorithms and models that address
questions or solve problems.
For example, consider the inventory management system of a large retailer. The company
has a limited inventory of snow shovels, which have to be allocated among a large number of
stores. The data scientist could create an algorithm that uses weather models to predict buy-
ing patterns. When snow is forecasted for a particular region it could trigger the inventory
management system to move more snow shovels to the stores in that area.
The process and tools enhancements described above can be implemented by anyone on
the analytics team or a new role may be created. We call this role the DataOps Engineer.
DATAOPS ENGINEER
The DataOps Engineer applies Agile Development, DevOps and statistical process controls
to data analytics. He or she orchestrates and automates the data analytics pipeline to make
it more flexible while maintaining a high level of quality. The DataOps Engineer uses tools
to break down the barriers between operations and data analytics, unlocking a high level of
productivity from the entire team.
As DataOps breaks down the barriers between data and operations, it makes data more
easily accessible to users by redesigning the data analytics pipeline to be more responsive,
efficient and robust. This new function will completely change what people think of as possi-
ble in data analytics. The opportunity to have a high-visibility impact on the organization will
make DataOps engineering one of the most desirable and highly compensated functions on
the data-analytics team.
INSTRUCTIONS
1. Preheat oven to 350ºF
2. Soften butter to room temperature
3. Line a baking sheet with parchment paper
4. In a large bowl, cream together softened butter, brown sugar and white
sugar
5. Add vanilla extract, chocolate peanut butter and eggs and mix well
6. Stir in flour, baking soda and cocoa powder and combine until blended
7. Fold chocolate chips and peanut butter chips into batter
8. Scoop batter onto prepared baking sheet using a cookie or ice-cream scoop,
leaving enough space in-between for cookies to expand
9. Bake for 14-16 minutes
10. Transfer cookies to a wire rack to cool
Rapid-Response Analytics – The sales and marketing team will continue to demand a
never-ending stream of new and changing requirements, but the data-analytics team will
delight your sales and marketing colleagues with rapid responses to their requests. New
analytics will inspire new questions that will, in turn, drive new requirements for analytics.
The feedback loop between analytics and sales/marketing will iterate so quickly that it will
infuse excitement and creativity throughout the organization. This will lead to breakthroughs
that vault the company to a leadership position in its markets.
Data Under Your Control – Data from all of the various internal and external sources will be
integrated into a consolidated database that is under the control of the data-analytics team.
Your team will have complete access to it at all times, and they will manage it independently
of IT, using their preferred tools. With data under its control, the data-analytics team can
modify the format and architecture of data to meet its own operational requirements.
Impeccable Data Quality – As data flows through the data-analytics pipeline, it will pass
through tests and filters that ensure that it meets quality guidelines. Data will be monitored
for anomalies 24x7, preventing bad data from ever reaching sales and marketing analytics.
You’ll have a dashboard providing visibility into your data pipeline with metrics that delineate
problematic data sources or other issues. When an issue occurs, the system alerts the appro-
priate member of your team who can then fix the problem before it ever receives visibility.
As the manager of the data-analytics team, you’ll spend far less time in uncomfortable meet-
ings discussing issues and anomalies related to analytics.
The processes, methodologies and tools required to realize these efficiencies combine two
powerful ideas: The Customer Data Platform (CDP) and a revolutionary new approach to
analytics called DataOps. Below we’ll explain how you can implement your own Data-
Ops-powered CDP that improves both your analytics cycle time and data-pipeline quality
by 10X or more.
Figure 56: The Customer Data Platform consolidates data from operational
systems to provide a unified customer view for sales and marketing.
DATAOPS
A CDP is a step in the right direction, but it won’t provide much improvement in team
productivity if the team relies on cumbersome processes and procedures to create analytics.
DataOps is a set of methodologies and tools that will help you optimize the processes by
When implemented in concert, Agile, DevOps and SPC take the productivity of data-analyt-
ics professionals to a whole new level. DataOps will help you get the most out of your data,
human resources and integrated CDP database.
Every resource, technology and tool in the data-analytics organization exists to support the
data analyst’s ability to serve Sales and Marketing. This also applies to Data Scientists who
also deliver insights directly to Sales and Marketing colleagues.
The engineer writes transforms that operate on the data lake, creating data warehouses and
data marts used by data analysts and scientists. The data engineer also implements tests that
monitor data at every point along the data-analytics pipeline assuring a high level of quality.
The data engineer lays the groundwork for other members of the team to perform analytics
without having to be operations experts. With a dedicated data engineering function,
DataOps provides a high level of service and responsiveness to the data-analytics team.
With tests monitoring each stage of the automated data pipeline, DataOps can produce a
dashboard showing the status of the pipeline. The DataOps dashboard provides a high-level
overview of the end-to-end data pipeline. Is any data failing quality tests? What are the error
rates? Which are the troublesome data sources? With this information at his or her finger-
tips, the Data Engineer can proactively improve the data pipeline to increase robustness. In
the event of a high-severity data anomaly, an alert is sent to the Data Engineer who can take
steps to protect production analytics and work to resolve the error. If the anomaly relates to
a data supplier, data engineering can work with the vendor to drive the issue to resolution.
Workarounds and data patches can be implemented as needed with information in release
notes for users. In many cases, errors are resolved without the users (or the organization’s
management) ever being aware of any problem.
DATAOPS PLATFORM
The various methodologies, processes, people (and their tools) and the CDP analytics data-
base are tied together cohesively using a technical environment called a DataOps Platform.
The DataOps Platform includes support for:
• Agile project management
• Deployment of new analytics
• Execution of the data pipeline (orchestration)
• Integration of all tools and platforms
• Management of development and production environments
• Source-code version control
• Testing and monitoring of data quality
• Data Operations reporting and dashboards
The high degree of automation offered by DataOps eliminates a great deal of work that has
traditionally been done manually. This frees up the team to create new analytics requested
by stakeholder partners.
The enterprise can also outsource the functions shown initially but insource them at a later
date. Once set-up, the DataOps Platform can be easily and seamlessly transitioned to an
internal team.
Customer Data Platforms promise to drive sales and improve the customer experience by
unifying customer data from numerous disjointed operational systems. As a leader of the
analytics team, you can take control of sales and marketing data by implementing efficient
analytics-creation and deployment processes using a DataOps-powered CDP. A DataOps
platform makes analytics responsive and robust. This enables your data analysts and scien-
tists to rise above the bits and bytes of data operations and focus on new analytics that help
the organization achieve its goals.
DataOps is the mixed martial arts of data analytics. It is a hybrid of Agile Development,
DevOps and the statistical process controls drawn from lean manufacturing. Like MMA, the
strength of DataOps is its readiness to evolve and incorporate new techniques that improve
the quality, reliability, and flexibility of the data analytics pipeline. DataOps gives data ana-
lytics professionals an unfair advantage over those who are doing things the old way — using
hope, heroism or just going slowly in order to cope with the rapidly changing requirements
of the competitive marketplace.
Agile development has revolutionized the speed of software development over the past
twenty years. Before Agile, development teams spent long periods of time developing
specifications that would be obsolete long before deployment. Agile breaks down software
development into small increments, which are defined and implemented quickly. This allows
a development team to become much more responsive to customer requirements and ulti-
mately accelerates time to market.
The difficulty of procuring and provisioning physical IT resources has often hampered data
analytics. In the software development domain, leading-edge companies are turning to
DevOps, which utilizes cloud resources instead of on-site servers and storage. This allows
developers to procure and provision IT resources nearly instantly and with much greater
control over the run-time environment. This improves flexibility and yields another order of
magnitude improvement in the speed of deploying features to the user base.
DataOps also incorporates lean manufacturing techniques into data analytics through the
use of statistical process controls. In manufacturing, tests are used to monitor and improve
the quality of factory-floor processes. In DataOps, tests are used to verify the inputs,
business logic, and outputs at each stage of the data analytics pipeline. The data analytics
professional adds a test each time a change is made. The suite of tests grows over time
until it eventually becomes quite substantial. The tests validate the quality and integrity of
a new release when a feature set is released to the user base. Tests allow the data analytics
professional to quickly verify a release, substantially reducing the amount of time spent on
deploying updates.
Statistical process controls also monitor data, alerting the data team to an unexpected vari-
ance. This may require updates to the business logic built into the tests, or it might lead data
scientists down new paths of inquiry or experimentation. The test alerts can be a starting
point for creative discovery.
The combination of Agile development, DevOps, and statistical process controls gives Data-
Ops the strategic tools to reduce time to insight, improve the quality of analytics, promote
reuse and refactoring and lower the marginal cost of asking the next business question.
Like mixed martial arts, DataOps draws its effectiveness from an eclectic mix of tools and
techniques drawn from other fields and domains. Individually, each of these techniques is
valuable, but together they form an effective new approach, which can take your data analyt-
ics to the next level.
Figure 59: With DataKitchen, marketing automation data flows continuously from
numerous sources through the analytics pipeline with efficiency and quality.
BUSINESS IMPACT
With the DataKitchen Platform, the company was able to break the long 18-month project
into sprints and began to deliver value in six weeks. The agility of the DataKitchen DataOps
approach enabled the analytics team to rapidly respond to changing user requirements
with a continuous series of enhancements. Users no longer waited months to add new data
sources or make other changes. The team can now deploy new data sources, update sche-
mas and produce new analytics quickly and efficiently without fear of disrupting the existing
data pipelines.
DataKitchen’s lean manufacturing control helped the team be more proactive addressing
data quality issues. With monitoring and alerts, the team is now able to provide immediate
feedback to data suppliers about issues and can prevent bad data from reaching user analytics.
All this has led to improved insight into customers and markets and higher impact marketing
campaigns that drive revenue growth.
DataKitchen’s DataOps Platform helped this pharmaceutical company achieve its strategic
goals by improving analytics quality, responsiveness, and efficiency. DataKitchen software
provides support for improved processes, automation of tools, and agile development of new
analytics. With DataKitchen, the analytics team was able to deliver value to users in 1/10th
the time, accelerating and magnifying their impact on top line growth.
Cashew Cream
• 1 cup cashews soaked in water for at least 2 hours
• 2 cups veg stock
• 4 teaspoons cornstarch (can sub tapioca starch if desired)
• Drain the cashews. In a blender, combine all the ingredients and work for 2
to 5 minutes or until smooth, scraping down the sides with a rubber spatula
several times. Set aside.
Soup
• 1 Tablespoon olive oil
• 1 large onion coarsely chopped
• 2 celery ribs, chopped
• 3 cups veg broth
• 1 large carrot chopped
• 1 red pepper diced (could sub 1 bag Frozen mixed-vegetables, thawed in a
pinch)
• 1 potato, diced
• 3 ears of fresh corn (cut the kernels off and scrape the corn cobs for corn
milk to add to the soup)
• Can of corn
INSTRUCTIONS
1. Heat the oil in 4-quart pot
2. When hot add onion and celery with a pinch of salt, cook until start to soften.
3. Add carrots and potatoes
4. Add corn and red pepper and stir-fry for 10 minutes
5. Add 3 cups veg stock and the corn milk
6. Bring to a boil, lower the head and cover — simmer 10 min or until veg tender
but not overcooked.
7. Stir in Cashew Cream and stir gently for 7 minutes until nicely thickened.
8. Blend up to half the soup to make more liquid and add it back in
9. Add salt & pepper to taste, depending on the type of veg stock you used.
My own adaptation of a vegan New England Clam Chowder recipe from the Boston
Globe from Isa-does-it by Isa Chandra Moskowitz
130 • The DataOps Cookbook
DataOps Energy Bytes
by Eric Estabrooks
INSTRUCTIONS
1. Add rolled oats, coconut flakes, nut butter, flax seed, honey, and vanilla to a
mixing bowl and mix.
2. Mix well so that you can form the balls easily.
3. Add chocolate chips if using or other desired mix ins.
4. Chill the mixture in the fridge for an hour so that balls will bind together.
5. Roll the balls into about a 1-inch diameter.
INSTRUCTIONS
1. Preheat oven to 400◦. Microwave milk at HIGH for 1 ½ minutes. Melt butter
in a large skillet or Dutch oven over medium-low heat; whisk in flour until
smooth. Cook, whisking constantly for 1 minute.
2. Gradually whisk in warm milk and cook, whisking constantly 5 minutes or
until thickened.
3. Whisk in salt, black pepper, 1 cup shredded cheese, and if desired, red
pepper until smooth; stir in pasta. Spoon pasta mixture into a lightly greased
2-qt. baking dish; top with remaining cheese. Bake at 400◦ for 20 minutes or
until golden and bubbly.
NOTES
For this recipe, it is recommended that you grate the block(s) of cheese. I combine
Sharp Cheddar and Swiss cheeses — my favorite. Pre-shredded varieties won’t
give you the same sharp bite or melt into creamy goodness over your macaroni
as smoothly as block cheese that you grate yourself. You can go reduced-fat (but
then it’s even more important to prep your own). Grating won’t take long, and the
rest of this recipe is super simple. Use a pasta that has plenty of nooks to capture
the cheese—like elbows, shells, or cavatappi. Try it just once, and I guarantee that
Classic Baked Macaroni and Cheese will become your go-to comfort food.
https://en.wikipedia.org/wiki/
Statistical Process Control
Statistical_process_control
https://en.wikipedia.org/wiki/
W. Edwards Demming
%20W._Edwards_Deming
Christopher Bergh is a Founder and Head Chef at DataKitchen where, among other activi-
ties, he is leading DataKitchen’s DataOps initiative. Chris has more than 25 years of research,
engineering, analytics, and executive management experience.
Previously, Chris was Regional Vice President in the Revenue Management Intelligence
group in Model N. Before Model N, Chris was COO of LeapFrogRx, a descriptive and pre-
dictive analytics software and service provider. Chris led the acquisition of LeapFrogRx by
Model N in January 2012. Prior to LeapFrogRx Chris was CTO and VP of Product Manage-
ment of MarketSoft (now part of IBM) an innovative Enterprise Marketing Management
software. Prior to that, Chris developed Microsoft Passport, the predecessor to Windows
Live ID, a distributed authentication system used by 100s of Millions of users today. He
was awarded a US Patent for his work on that project. Before joining Microsoft, he led the
technical architecture and implementation of Firefly Passport, an early leader in Internet Per-
sonalization and Privacy. Microsoft subsequently acquired Firefly. Chris led the development
of the first travel-related e-commerce web site at NetMarket. Chris began his career at the
Massachusetts Institute of Technology’s (MIT) Lincoln Laboratory and NASA Ames Research
Center. There he created software and algorithms that provided aircraft arrival optimization
assistance to Air Traffic Controllers at several major airports in the United States.
Chris served as a Peace Corps Volunteer Math Teacher in Botswana, Africa. Chris has an M.S.
from Columbia University and a B.S. from the University of Wisconsin-Madison. He is an
avid cyclist, hiker, reader, and father of two college age children.
Gil has held various technical and leadership roles at Solid Oak Consulting, HealthEdge,
Phreesia, LeapFrogRx (purchased by Model N), Relicore (purchased by Symantec), Phase For-
ward (IPO and then purchased by Oracle), Netcentric, Sybase (purchased by SAP), and AT&T
Bell Laboratories (now Nokia Bell Labs).
Gil holds an M.S. in Computer Science from Stanford University and a Sc.B. in Applied Math-
ematics/Biology from Brown University. He completed hiking all 48 of New Hampshire’s‚
4,000 peaks and is now working on the New England 67, and is the father of one high school
and two college age boys.
Eran Strod works in marketing at DataKitchen where he writes white papers, case studies
and the DataOps blog. Eran was previously Director of Marketing for Atrenne Integrated
Solutions (now Celestica) and has held product marketing and systems engineering roles at
Curtiss-Wright, Black Duck Software (now Synopsys), Mercury Systems, Motorola Computer
Group (now Artesyn), and Freescale Semiconductor (now NXP), where he was a contributing
author to the book “Network Processor Design, Issues and Practices.”
Eran began his career as a software developer at CSPi working in the field of embedded
computing.
Eran holds a B.A. in Computer Science and Psychology from the University of California at
Santa Cruz and an M.B.A. from Northeastern University. He is father to two children and
enjoys hiking, travel and watching the New England Patriots.