Creating A Data Driven Enterprise With DataOps
Creating A Data Driven Enterprise With DataOps
m
pl
im
en
ts
of
Creating a
Data-Driven Enterprise
with DataOps
Insights from Facebook, Uber, LinkedIn,
Twitter, and eBay
The killer app for public cloud is big data analytics. And as IT
evolves from a cost center to a true nexus of business
innovation, the data team, data engineers, platform engineers
and database admins need to build the enterprise of
tomorrow. One that is scalable, and built on a totally
self-service infrastructure.
Their stories are in this book. Come meet them in person and
learn more at Data Platforms 2017. Join us for the first ever
conference dedicated to building the enterprise of tomorrow -
conference attendees will take home the blueprint to create
tomorrow's data driven architecture today.
Learn More
http://bit.ly/DataPlatformsConference
Presented by:
Creating a Data-Driven
Enterprise with DataOps
Insights from Facebook, Uber,
LinkedIn, Twitter, and eBay
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Creating a Data-
Driven Enterprise with DataOps, the cover image, and related trade dress are trade‐
marks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-97781-1
[LSI]
Table of Contents
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
iii
When Facebook’s Data Warehouse Ran Out of Steam 37
Is Using Either/Or a Possible Strategy? 38
Common Misconceptions 39
Difficulty Finding Qualified Personnel 41
Summary 42
iv | Table of Contents
8. A Maturity-Model “Reality Check” for Organizations. . . . . . . . . . . . 95
Organizations Understand the Need for Big Data, But
Reach Is Still Limited 95
Significant Challenges Remain 99
Summary 107
Table of Contents | v
Acknowledgments
vii
want to thank Alice LaPlante for diligently capturing our interviews
on the subject and for helping build the content based on those
interviews.
This book also tries to look for patterns that are common in enter‐
prises that have achieved the “nirvana” of being data-driven. In that
aspect, the contributions of Debashis Saha (eBay), Karthik Ramas‐
amy (Twitter), Shrikanth Shankar (LinkedIn), and Zheng Shao
(Uber) are some of the most valuable to the book as well as to our
collective knowledge. All of these folks are great practitioners of the
art and science of making their companies data-driven, and we are
very thankful to them for sharing their learnings and experiences,
and in the process making this book all the more insightful.
Last but not least, thanks to our families for putting up with us while
we worked on this book. Without their constant encouragement and
support, this effort would not have been possible.
viii | Acknowledgments
PART I
Foundations of a Data-Driven
Enterprise
This book is divided into two parts. In Part I, we discuss the theoret‐
ical and practical foundations for building a self-service, data-driven
company.
In Chapter 1, we explain why data-driven companies are more suc‐
cessful and profitable than companies that do not center their
decision-making on data. We also define what DataOps is and
explain why moving to a self-service infrastructure is so critical.
In Chapter 2, we trace the history of data over the past three decades
and how analytics has evolved accordingly. We then introduce the
Qubole Self-Service Maturity Model to show how companies pro‐
gress from a relatively simple state to a mature state that makes data
ubiquitous to all employees through self-service.
In Chapter 3, we discuss the important distinctions between data
warehouses and data lakes, and why, at least for now, you need to
have both to effectively manage big data.
In Chapter 4, we define what a data-driven company is and how to
successfully build, support, and evolve one.
In Chapter 5, we explore the need for a complete, integrated, and
self-service data infrastructure, and the personas and tools that are
required to support this.
In Chapter 6, we talk about how the cloud makes building a self-
service infrastructure much easier and more cost effective. We
explore the five capabilities of cloud to show why it makes the per‐
fect enabler for a self-service culture.
In Chapter 7, we define metadata, and explain why it is essential for
a successful self-service, data-driven operation.
In Chapter 8, we reveal the results of a Qubole survey that show the
current state of maturity of global organizations today.
CHAPTER 1
Introduction
3
Hadoop. Joydeep created the first Hadoop cluster at Facebook and
the first set of jobs, populating the first datasets to be consumed by
other engineers—application logs collected using Scribe and appli‐
cation data stored in MySQL.
But Hadoop wasn’t (and still isn’t) particularly user friendly, even for
engineers. Gartner found that even today—due to how difficult it is
to find people with adequate Hadoop skills—more than half of busi‐
nesses (54 percent) have no plans to invest in it.1 It was, and is, a
challenging environment. We found that the productivity of our
engineers suffered. The bottleneck of data requests persisted (see
Figure 1-1).
SQL, on the other hand, was widely used by both engineers and ana‐
lysts, and was powerful enough for most analytics requirements. So
Joydeep and I decided to make the programmability of Hadoop
available to everyone. Our idea: to create a SQL-based declarative
language that would allow engineers to plug in their own scripts and
programs when SQL wasn’t adequate. In addition, it was built to
store all of the metadata about Hadoop-based datasets in one place.
This latter feature was important because it turned out indispensable
for creating the data-driven company that Facebook subsequently
became.
1 http://www.gartner.com/newsroom/id/3051717
4 | Chapter 1: Introduction
That language, of course, was Hive, and the rest is history. Still, the
idea was very new to us. We had no idea whether it would succeed.
But it did. The data team immediately became more productive. The
bottleneck eased. But then something happened that surprised us.
In January of 2008, when we released the first version of Hive inter‐
nally at Facebook, a rush of employees—data scientists and engi‐
neers—grabbed the interfaces for themselves. They began to access
the data they needed directly. They didn’t bother to request help
from the data team. With Hive, we had inadvertently brought the
power of big data to the people. We immediately saw tremendous
opportunities in completely democratizing data. That was our first
“ah-ha!”
One of the things driving employees to Hive was that at that same
time (January 2008) Facebook released its Ad product.
Over the course of the next six months, a number of employees
began to use the system heavily. Although the initial use case for
Hive and Hadoop centered around summarizing and analyzing
clickstream data for the launch of the Facebook Ad program, Hive
quickly began to be used by product teams and data scientists for a
number of other projects. In addition, we first talked about Hive at
the first Hadoop summit, and immediately realized the tremendous
potential beyond just what Facebook was doing with it.
With this, we had our second “ah-ha”—that by making data more
universally accessible within the company, we could actually disrupt
our entire industry. Data in the hands of the people was that power‐
ful. As an aside, some time later we saw another example of what
happens when you make data universally available.
Facebook used to have “hackathons,” where everyone in the com‐
pany stayed up all night, ordered pizza and beer, and coded into the
wee hours with the goal of coming up with something interesting.
One intern—Paul Butler—came up with a spectacular idea. He per‐
formed analyses using Hadoop and Hive and mapped out how Face‐
book users were interacting with each other all over the world. By
drawing the interactions between people and their locations, he
developed a global map of Facebook’s reach. Astonishingly, it map‐
ped out all continents and even some individual countries.
2 https://hbr.org/2012/10/big-data-the-management-revolution
6 | Chapter 1: Introduction
itors. This performance difference remained even after accounting
for labor, capital, purchased services, and traditional IT investments.
It was also statistically significant and reflected in increased stock
market prices that could be objectively measured.
Another survey, by The Economist Intelligence Unit, showed a clear
connection between how a company uses data, and its financial suc‐
cess. Only 11 percent of companies said that their organization
makes “substantially” better use of data than their peers. Yet more
than a third of this group fell into the category of “top performing
companies.”3 The reverse also indicates the relationship between
data and financial success. Of the 17 percent of companies that said
they “lagged” their peers in taking advantage of data, not one was a
top-performing business.
3 https://www.tableau.com/sites/default/files/whitepapers/tableau_datacul
ture_130219.pdf
4 http://www.zsassociates.com/publications/articles/Broken-links-Why-analytics-
investments-have-yet-to-pay-off.aspx
5 http://www.contegix.com/the-importance-of-a-data-driven-company-culture/
6 https://hbr.org/2012/10/making-advanced-analytics-work-for-you
8 | Chapter 1: Introduction
Identify, combine, and manage multiple sources of data
You might already have all the data you need. Or you might
need to be creative to find other sources for it. Either way, you
need to eliminate silos of data while constantly seeking out new
sources to inform your decision-making. And it’s critical to
remember that when mining data for insights, demanding data
from different and independent sources leads to much better
decisions. Today, both the sources and the amount of data you
can collect has increased by orders of magnitude. It’s a connec‐
ted world, given all the transactions, interactions, and, increas‐
ingly, sensors that are generating data. And the fact is, if you
combine multiple independent sources, you get better insight.
The companies that do this are in much better shape, financially
and operationally.
Build advanced analytics models for predicting and optimizing
outcomes
The most effective approach is to identify a business opportu‐
nity and determine how the model can achieve it. In other
words, you don’t start with the data—at least at first—but with a
problem.
Transform the organization and culture of the company so that data
actually produces better business decisions
Many big data initiatives fail because they aren’t in sync with a
company’s day-to-day processes and decision-making habits.
Data professionals must understand what decisions their busi‐
ness users make, and give users the tools they need to make
those decisions. (More on this in Chapter 5.)
So, why are we hearing about the failure of so many big data initia‐
tives? One PricewaterhouseCoopers study found that only four per‐
cent of companies with big data initiatives consider them successful.
Almost half (43 percent) of companies “obtain little tangible benefit
from their information,” and 23 percent “derive no benefit whatso‐
ever.”7 Sobering statistics.
It turns out that despite the benefits of a data-driven culture, creat‐
ing one can be difficult. It requires a major shift in the thinking and
7 http://www.cio.com/article/3003538/big-data/study-reveals-that-most-companies-are-
failing-at-big-data.html
8 http://www.zsassociates.com/publications/articles/Broken-links-Why-analytics-
investments-have-yet-to-pay-off.aspx
10 | Chapter 1: Introduction
could do so in a way that was controlled and easily auditable. We
also had to make sure that this infrastructure could be built incre‐
mentally so that we could add capacity as dictated by the demands
of the users.
As Figure 1-4 illustrates, moving from manual infrastructure provi‐
sioning processes—which creates the same bottlenecks that occur‐
red with the old model of data access—to a self-service one gives
employees a much faster response to their data-access needs at a
much lower operating cost. Think about it: just as you had the data
team positioned between the employees and the data, now you had
the same wall between employees and infrastructure. Having theo‐
retical access to data did employees no good when they had to go to
the data team to request infrastructure resources every time they
wanted to query the data.
12 | Chapter 1: Introduction
The Emergence of DataOps
Once upon a time, corporate developers and IT operations profes‐
sionals worked separately, in heavily armored silos. Developers
wrote application code and “threw it over the wall” to the operations
team, who then were responsible for making sure the applications
worked when users actually had them in their hands. This was never
an optimal way to work. But it soon became impossible as busi‐
nesses began developing web apps. In the fast-paced digital world,
they needed to roll out fresh code and updates to production rap‐
idly. And it had to work. Unfortunately, it often didn’t. So, organiza‐
tions are now embracing a set of best practices known as DevOps
that improve coordination between developers and the operations
team.
DevOps is the practice of combining software engineering, quality
assurance (QA), and operations into a single, agile organization. The
practice is changing the way applications—particularly web apps—
are developed and deployed within businesses.
Now a similar model, called DataOps, is changing the way data is
consumed.
Here’s Gartner’s definition of DataOps:
[A] hub for collecting and distributing data, with a mandate to pro‐
vide controlled access to systems of record for customer and mar‐
keting performance data, while protecting privacy, usage
restrictions, and data integrity.9
That mostly covers it. However, I prefer a slightly different, perhaps
more pragmatic, hands-on definition:
DataOps is a new way of managing data that promotes communi‐
cation between, and integration of, formerly siloed data, teams, and
systems. It takes advantage of process change, organizational
realignment, and technology to facilitate relationships between
everyone who handles data: developers, data engineers, data scien‐
tists, analysts, and business users. DataOps closely connects the
people who collect and prepare the data, those who analyze the
data, and those who put the findings from those analyses to good
business use.
9 http://www.gartner.com/it-glossary/data-ops/
14 | Chapter 1: Introduction
Data becoming more mainstream
This ties back to the fact that in today’s world there is a prolifer‐
ation of data sources because of all the advancements in collec‐
tion: new apps, sensors on the Internet of Things (IoT), and
social media. There’s also the increasing realization that data can
be a competitive advantage. As data has become mainstream,
the need to democratize it and make it accessible is felt very
strongly within businesses today. In light of these trends, data
teams are getting pressure from all sides.
In effect, data teams are having the same problem that application
developers once had. Instead of developers writing code, we now
have data scientists designing analytic models for extracting actiona‐
ble insights from large volumes of data. But there’s the problem: no
matter how clever and innovative those data scientists are, they don’t
help the business if they can’t get hold of the data or can’t put the
results of their models into the hands of decision-makers.
DataOps has therefore become a critical discipline for any IT orga‐
nization that wants to survive and thrive in a world in which real-
time business intelligence is a competitive necessity. Three reasons
are driving this:
Data isn’t a static thing
According to Gartner, big data can be described by the “Three
Vs”:10 volume, velocity, and variety. It’s also changing constantly.
On Monday, machine learning might be a priority; on Tuesday,
you need to focus on predictive analytics. And on Friday, you’re
processing transactions. Your infrastructure needs to be able to
support all these different workloads, equally well. With Data‐
Ops, you can quickly create new models, reprioritize workloads,
and extract value from your data by promoting communication
and collaboration.
Technology is not enough
Data science and the technology that supports it is getting
stronger every day. But these tools are only good if they are
applied in a consistent and reliable way.
10 http://www.gartner.com/it-glossary/big-data/
In This Book
In this book, we explain what is required to become a truly data-
driven organization that adopts a self-service data culture. You’ll
read about the organizational, cultural, and—of course—technical
transformations needed to get there, along with actionable advice.
Finally, we’ve profiled five famously leading companies on their
data-driven journeys: Facebook, Twitter, Uber, eBay, and LinkedIn.
16 | Chapter 1: Introduction
CHAPTER 2
Data and Data Infrastructure
17
If we think back again to Gartner’s Three Vs of big data—volume,
velocity, and variety—we realize that the interaction data has a much
higher velocity, volume, and variety than the traditional transac‐
tional data created by business applications. That data is also of very
high value to businesses. Figure 2-1 offers a simple illustration of the
evolution of data from transactional to interaction.
In this chapter, we explore the drivers of big data and how organiza‐
tions can get the most out of all the different kinds of data they now
routinely collect. We’ll also present a maturity model that shows the
steps that organizations should take to achieve data-driven status.
1 http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
2 https://www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2015.pdf
Figure 2-3. Devices are now always on, connected, and powerful
(source: Qubole)
3 https://infocus.emc.com/william_schmarzo/kpmg-survey-firms-struggle-with-big-data/
4 http://www.gartner.com/newsroom/id/3130017
5 https://assets.kpmg.com/content/dam/kpmg/pdf/2016/07/2016-ceo-survey.pdf
6 http://www.gartner.com/it-glossary/predictive-analytics/
Stage 1: Aspiration
At this stage, a company is typically using a traditional data ware‐
house with production reporting and ad hoc analyses.
The signs that you are a Stage 1 company include: having a large
number of apps that are collecting growing volumes of data;
researching big data, but not investing in it yet; hiring big data engi‐
neers; actively contracting with Software-as-a-Service (SaaS) appli‐
cations and go-to-market products; and gathering budget
requirements for big data initiatives.
You also face certain challenges if you’re in Stage 1. You don’t know
what you don’t know. You are typically afraid of the unknown.
You’re legitimately worried about the competitive landscape. Added
to that, you are unsure of what the total cost of ownership (TCO) of
a big data initiative will be. You know you need to come up with a
plan to reap positive return on investment (ROI). And you might
also at this time be suffering from internal organizational or cultural
conflicts.
The classic sign of a Stage 1 company is that the data team acts as a
conduit to the data, and all employees must to go through that team
to access data.
The key to getting from Stage 1 to Stage 2 is to not think too big.
Rather than worrying about how to change to a DataOps culture,
begin by focusing on one problem you have that might be solved by
a big data initiative.
Stage 3: Expansion
In this stage, multiple projects are using big data, so you have the
foundation for a big data infrastructure. You have created a roadmap
for building out teams to support the environment.
You also face a plethora of possible projects. These typically are
“top-down” projects—that is, they come from high up in the organi‐
zation, from executives or directors. You are focused on scalability
and automation, but you’re not yet evaluating new technologies to
see if they can help you. However, you do have the capacity and
resources to meet future needs, and have won management buy-in
for the project on your existing infrastructure.
Stage 4: Inversion
It is at this stage that you achieve an enterprise transformation and
begin seeing “bottoms-up” use cases—meaning employees are iden‐
tifying projects for big data themselves rather than depending on
executives to commission them. All of this is good. But there is still
pain.
You know you are in Stage 4 if you have spent many months build‐
ing a cluster and have invested a considerable amount of money, but
you no longer feel in control. Your users used to be happy with the
big data infrastructure but now they complain. You’re also simulta‐
neously seeing high growth in your business—which means more
customers and more data—and you’re finding it difficult if not
impossible to scale quickly. This results in massive queuing for data.
You’re not able to serve your “customers”—employees, and lines of
business are not getting the insight they need to make decisions.
Stage 5: Nirvana
If you’ve reached this stage, you’re on par with the Facebooks and
Googles of the world. You are a truly data-driven enterprise with
ubiquitous insights. Your business has been successfully trans‐
formed.
Summary
The drivers of big data—the fact that we live in an increasingly con‐
nected world, and the number and types of different data-producing
devices are multiplying—will only grow more intense as time goes
by. By moving to a self-service culture, organizations can get the
most out of all the different kinds of data they now routinely collect.
But to do this successfully, organizations need to advance step by
Summary | 31
step through the Qubole self-service maturity model presented in
this chapter.
33
houses and was based on technologies such as Teradata, Oracle,
Neteeza, Greenplum, and Vertica, among others.
Figure 3-3 shows the main differences between a data lake and a
data warehouse.
Data warehouses continue to be popular because they are very
mature technologies, having been around since the 1990s. Addition‐
ally, they work well with the tools business analysts and users have
become accustomed to when using dashboards or other kinds of
mechanisms through which they can consume insights from the res‐
ident data. In fact, for certain use cases, data warehouses perform
very well because the data is completely curated and structured to
answer certain query patterns quickly.
Common Misconceptions | 39
Figure 3-4. Analytics value pyramid
We cover the first two points in this chapter and discuss the roles
and responsibilities of the employees who form the vital cogs in the
engine that drives the data-driven organization—from data produc‐
ers, to data scientists, to engineers, to analysts, to business users.
The next chapter devotes itself to the technology needed to support
a data-driven culture.
43
Creating a Self-Service Culture
The most important—and arguably the most difficult—aspect of
transitioning to a data-driven organization that practices DataOps is
the cultural shift required to move to a data mindset. This shift
entails identifying and building a cultural framework that enables all
the people involved in a data initiative—from the producers of the
data, to the people who build the models, to the people who analyze
it, to the employees who use it in their jobs—to collaborate on mak‐
ing data the heart of organizational decision-making. Though the
technology that makes this collaboration and data access easy is very
important, it is just one of the considerations. A key focus area in
this transition are the employees and the organization. After you
achieve a true self-service, data-driven culture, as discussed in Chap‐
ter 1, you should experience a significant competitive boost to your
business.
This skill is critical because the two languages are very different. The
business wants to ask questions such as the following:
You then need analysts who can take those business questions and
convert them into a series of questions to ask the data. Thus, data
analysts would translate these questions into SQL or other com‐
mands to pull the relevant data from the data stores.
At Facebook, we had a centralized data team. Then, we had analysts
embedded in every product team. We also took care that all the ana‐
lysts had a central forum at which they could meet and communi‐
cate what they were doing, allowing data intelligence to flow
through the entire organization. Essentially, this model transmitted
the data-driven DNA of the self-service organizational culture
throughout the company.
What’s a persona?
The persona is a profile of a job role. It is not a person or a job title.
In fact, one employee might represent multiple personas (in a small
company). Alternatively, one persona might be split across multiple
people in a larger enterprise. Understanding personas is important
because it allows you to comprehend their pain points, what they
care about, and their ultimate responsibilities.
If you understand personas, you can pinpoint which messages are
likely to be the most relevant to the person you are talking to as you
try to effect a transformation into a data-driven enterprise. It’s very
important to understand that this is not a hierarchy, but a collabora‐
tive team that works together in the hub-and-spoke model described
in the previous section. Here are the personas you’re likely to come
across:
Summary
In this chapter, we discussed ways to transform your business into a
data-driven organization with a self-service culture. We discussed
the different organization roles and responsibilities, and presented
the very important hub-and-spoke organizational structure needed
to support such a transformation. By learning and implementing
these concepts, your business will be on its way to becoming truly
data driven.
57
organization, we will publish all data without thinking of how it will
be used. Then, the infrastructure platforms and analysis tools all
need to be self-service and data universally available.
Summary
Like all new technologies and tools, the self-service model has some
barriers to adoption. Chief among them: how to allocate resources
to all the different employees who wish to use the business’s con‐
strained—usually by cost—data resources. Going forward, compa‐
nies seeking to implement a self-service model need to find ways to
monitor and charge for use of data resources. The fact that this is
infinitely easier in the cloud points to where the future of the data-
driven organization is going.
• Scalability
• Elasticity
• Self-service and collaboration
• Cost efficiencies
• Monitoring and usage-tracking capabilities
In this chapter, we discuss how these properties can help data teams
create self-service data platforms.
71
Scalability
One of the biggest advantages of cloud computing is your ability to
quickly expand infrastructure to meet the needs of your organiza‐
tion. Today, because of the cloud, huge amounts of infrastructure are
now instantaneously available to any organization. Compute resour‐
ces can be provisioned in minutes or even less. This is called scala‐
bility. If you suddenly need, say, a thousand more nodes of compute,
you can get it today in the cloud due to the its scaling capabilities.
Because of the cloud, large-scale infrastructures are now available to
any organization, not just the Facebooks and Googles of the world.
Scalability is frequently talked about at the application layer. Can the
infrastructure handle a growing volume of workloads? Can it grow
sufficiently and fast enough to meet the expanding demands of the
business?
There are two types of scalability the cloud supports: you can scale
up (scale vertically) or scale out (horizontally). You scale vertically
by creating a larger virtual machine (VM) and transferring a work‐
load to it. Or, you could make the existing VM for the workload big‐
ger. Scaling horizontally means you add more VMs, and divide the
load between all of them.
Why is the scalability of cloud important to the self-service model?
Because big data workloads can be very large, and very “bursty.”
They are difficult to run using an on-premises infrastructure
because the ability to scale is so limited—you can grow only to the
capacity of the physical infrastructure you have already put into
place. It’s difficult to grow it quickly and cost effectively. And by
being limited by infrastructure scalability, organizations can find
themselves compromising on data. They use smaller datasets, which
result in inferior models and, ultimately, less valuable business
insight. With the scalability of the cloud, you can have very large
datasets against which to test your hypotheses. And to be true to the
vision of a truly data-driven culture, all your data should be kept
and made available online. The cloud eliminates the limitations that
difficult-to-scale on-premises infrastructures place on you.
Elasticity
Although they are frequently used interchangeably, scalability and
elasticity are not synonyms. Both refer to an environment’s capabil‐
Cost Effectiveness
Another important property of the cloud is that it’s significantly
more cost effective than on-premises infrastructure. There are two
reasons for that: one, the fees are calculated on a usage model rather
than a software-licensing one; and two, your operational costs are
much lower because you don’t need to maintain an IT operations
staff to manage and maintain the infrastructure. In fact, moving to
the cloud generally boosts the productivity of IT personnel.
First, the pay-as-you-use model. In the cloud, you pay only for what
you use. With on-premises infrastructure, you must purchase and
build the infrastructure to accommodate peak workloads. Even if
most of the time your usage requirements are generally not high,
you must accommodate those peaks or risk running out of resour‐
ces, which almost always results in a higher total cost of ownership
(TCO) than the cloud. For examples, most retailers previously had
to build infrastructure to handle the holiday season, which meant
that they paid a great deal for infrastructure that was used for two
and a half months each year. The rest of the year, that capacity
remained dormant. Now, with the cloud, retailers can scale-up in
November and December, and then flexibly go back to paying much
less for what they need in January and the rest of the year.
Secondly, the TCO of the cloud is reduced because it improves the
productivity of IT in general and the DevOps team in particular.
You don’t need to employ people to manage datacenters, because all
of that is done by the cloud provider. In addition, the economies of
scale that are achieved by the cloud providers because multiple
organizations are using the same infrastructure are passed onto
everyone.
One analogy can be found in the way that electricity is produced
and consumed. Most businesses don’t run their own electrical grids
or generators—it would be much more expensive to do that than to
Cloud Architecture
What makes these properties possible is the cloud architecture.
Attempting to “lift and shift” from on-premises to the cloud simply
doesn’t work for big data. Instead, the big data architecture for the
cloud is different and needs to be built according to some very spe‐
cific architectural characteristics.
Cloud Architecture | 77
Object storage is an architecture that manages storage as objects
instead of as a hierarchy (as filesystems do), or as blocks (as block
storage does). In a cloud architecture, you need to store data in
object stores and use compute when needed to process data directly
from the object stores. The object stores become the place where the
data lake is created—basically, all data is dumped into the data lake
in its raw format. The cloud-native data platforms then create the
compute infrastructure needed to process this data on the fly,
whether it be processing for data wrangling, ad hoc analysis,
machine learning, or any other analytical application.
Note that this is different from the architectural pattern typically fol‐
lowed by the big data platforms built on-premises. Because of the
lack of highly scalable object stores that can support thousands of
machines reading data from them and writing data to them, on-
premises data-lake architectures stress convergence of compute and
storage, as demonstrated in Figure 6-1. The rise of Apache Hadoop
—one of the leading big data platforms—was based on the principle
that compute and storage should be converged. The same physical
machines that store data are the machines that provide the compu‐
tation for different data applications on that data.
Cloud Architecture | 79
different from the needs for on-premises deployments of data
platforms.
On-premises big data deployments happen behind the firewall.
Because hardware and infrastructure are not virtualized and are not
multitenant, there are no strong mechanisms needed to secure infra‐
structure resources such as machines. Typically, machines are
racked-up in the datacenters behind a firewall, and the security is
managed by making the firewalls secure. Therefore, for on-premises
data platforms, it becomes sufficient to tie user identity stored in
identity-management systems such as Active Directory with the
authorization policies for accessing data. That is the standard for
authentication and authorization to which most data platforms built
on-premises adhere.
On the other hand, in the cloud, each API call for provisioning, de-
provisioning, or using infrastructure must be authenticated and
authorized. It is not sufficient to simply create data-access authoriza‐
tion rules. It is also necessary to tie those rules and the user identity
to the right authorization rules for orchestrating infrastructure. As a
result, data platforms built for the cloud need to be tightly integrated
with the cloud API-level authentication and authorization policies,
which are integrations that on-premises data platforms do not need
to support.
In addition, the cloud provides mechanisms to encrypt data both in
transit and also while the data is stored (at rest). These encryption
mechanisms are also tied to external key management systems in
which the keys that are used to decrypt the data are stored separately
from the cloud infrastructure. Native cloud data platforms are inte‐
grated with such mechanisms to provide extra security for highly
sensitive datasets that are stored and processed in the cloud.
Summary
Big data and data platforms are increasingly vital for businesses. To
create a data-driven culture, the agility and flexibility of these plat‐
forms are very important. Cloud architecture with its scalability,
elasticity, usage tracking, and cost savings is the best infrastructure
on which to build and deploy these data platforms. The cloud’s sepa‐
ration of compute and storage, security architecture, and scalable
object stores are important capabilities that Data 2.0 platforms need
to build on, in order to enable data teams to reach the “nirvana” of
the data-driven culture in their companies.
Data by itself means very little. After all, a piece of digital data is
simply a collection of bits. To be discovered (found) or understood,
this collection of bits needs something called metadata. In this chap‐
ter, we cover what metadata is and why it’s important to the self-
service data model.
First, a basic definition: metadata is any information that gives you
information about the data. It’s data about data.
87
Figure 7-1. Classes of metadata
Descriptive Metadata
Descriptive metadata describes a collection of bits, or piece of digital
data, so that it can be cataloged, discovered, identified, explored, and
so on. It can include elements such as title, abstract, author, and key‐
words. It is typically meant to be read by humans.
Descriptive metadata can also help data users “decode” data. For
example, suppose that you’ve administered a survey to a number of
users, and have captured their answers to the multiple-choice survey
as one of five letters: a, b, c, d, and e. Although the data itself might
consist only of the letters, the descriptive metadata informs you how
to interpret those letters. The single most important attribute of
descriptive metadata is that it allows data users to understand what
the data is, and how to use it.
In addition to direct applications for data users, this type of meta‐
data is also used by data discovery tools to aid the data users in find‐
ing, exploring, and understanding datasets.
Structural Metadata
Structural metadata is data about the organization of information
and the relationship between different organizational units. It
informs either the data users or machines about how the data is
organized so that they can perform transformations on it and corre‐
late it with other datasets. Among other aspects of data, structural
Administrative Metadata
Administrative metadata provides information to help administra‐
tors manage a resource, such as when and how it was created, file
type, who can access it, and other technical information. It captures
information during the entire lifecycle of data, from creation,
through consumption, through changes, and then finally to archival
or deletion. The information is most easily captured by software aid‐
ing in all of these activities. Given this information, one of the key
applications of administrative data is compliance and auditing.
Another part of administrative data captures organizational policies
related to data management. Such policies might be specified to
control any or all aspects the lifecycle of data. As an example, a key
policy element around data that every organization deals with is
controlling access to it. Access control lists (ACLs) and permissions
are elements of administrative metadata that help administrators
with managing access to different part of the data’s lifecycle—cre‐
ation, consumption (read), modification (update), deletion of
archival. Another example of a policy element associated with data
management could be specifications of the duration of time when
data is held online and when it is either deleted or archived away.
Summary
Creating and managing metadata effectively and efficiently is essen‐
tial for a self-service data infrastructure. Without accurate and con‐
sistent metadata, your datasets will not be as useful as they otherwise
would be, and your users will constantly be running to the data team
for clarification on the metadata. Instituting a “middle of the road”
metadata management philosophy using standards, crowdsourcing,
and software automation will help you achieve metadata that is both
accurate and relevant.
Summary | 93
CHAPTER 8
A Maturity-Model “Reality Check”
for Organizations
95
some time (38 percent) or were just getting started on big data
projects (38 percent). Although the remaining 20 percent don’t cur‐
rently have big data projects in place, they expect to in the future
(Figure 8-1).
This tells us that the notion that everyone should be making use of
big data is both well understood and widely accepted.
Figure 8-1. Does your company currently have a big data initiative?
(Source: The State of Data Ops—A Global Survey of IT and Data Pro‐
fessionals, February 2017, Dimensional Research)
Despite this broad interest in big data, it seems that big data activity
has yet to reach deeply into most organizations. Less than half of
respondents (45 percent) say they are seeing more than four groups
requesting big data projects today (Figure 8-2).
What this tells us is that big data is still a very limited deployment.
Enthusiasm for taking advantage of data and analytics hasn’t spread
throughout most organizations. Yet our core theory behind writing
this book is that every company should want to provide ubiquitous
access to data and analytics to all of their users to become fully data-
driven. If all employees in a company have this kind of access, they
will be able to do their jobs better and be more competitive and effi‐
cient while reducing costs.
If you look back at the illustration of Qubole’s data maturity model
in Chapter 2, you’ll see that most organizations are at Stage 2 or
Stage 3.
Reinforcing these conclusions, when asked to assess themselves, 88
percent of respondents said they were still in the early stages of big
data adoption (Figure 8-3).
Organizations Understand the Need for Big Data, But Reach Is Still Limited | 97
Figure 8-3. Where is your company in terms of adopting big data?
(Source: The State of Data Ops—A Global Survey of IT and Data Pro‐
fessionals, February 2017, Dimensional Research)
Figure 8-8. How confident are you that the data team can achieve self-
service analytics? (Source: The State of Data Ops—A Global Survey of
IT and Data Professionals, February 2017, Dimensional Research)
For us, the takeaway of Figure 8-8 is that companies are underesti‐
mating the difficulty of this transition to a self-service data model.
This reminds us of a few years ago, when companies were convinced
that they could build their own private clouds—that they could do
better than the public cloud vendors. They eventually had to admit
that their private clouds would never be as big, as cheap, as scalable,
reliable, or secure as an Amazon or Microsoft public cloud. So, even
if they succeeded in building a private cloud, they had effectively
Figure 8-9. Is any of your big data processing performed in the cloud?
(Source: The State of Data Ops—A Global Survey of IT and Data Pro‐
fessionals, February 2017, Dimensional Research)
The takeaway from Figure 8-12 is that big data is made up of a lot of
diverse workloads. Therefore, you need a “complete” platform to
handle all of them—a platform that supports different transforma‐
tion engines and tools for different workloads and user personas. In
addition, making this platform self-service and meeting the diverse
needs of multiple workloads and users becomes an increasingly dif‐
ficult task.
Application data integration leads the current big data workloads
(53 percent), followed by ad hoc analytics (41 percent), and ETL (37
percent).
Summary
In summary, most companies today are still in the early stages of
moving to a self-service infrastructure that can fully support their
big data aspirations. Most would be at either Stage 1 or Stage 2 of the
Qubole Data-Driven Maturity Model. Although on one level they
seem cognizant of this fact, in other ways they appear to have unre‐
alistic expectations of exactly how much effort and expertise it will
require for them to progress to the next maturity stage. This appears
to be a case where the more they learn, the more realistic and tem‐
pered their expectations and goals will be.
Summary | 107
PART II
Case Studies
I’ve been with LinkedIn for only 18 months. Yet, what I’ve seen in
data operations has amazed me. Like all consumer web companies,
there’s always been an enormous amount of data that’s flowed
through LinkedIn. But LinkedIn was relatively early to realize the
importance of this data.
At LinkedIn, it wasn’t just about getting the analytics right. The
company realized early on that infrastructure had to go hand in
hand with analytics to support the data ecosystem. Many open
source projects, most famously Apache Kafka, were born at
LinkedIn to support this ecosystem. Today at LinkedIn, we rely
heavily on the scalability and reliability of Kafka, Hadoop, and a sur‐
rounding ecosystem of open source and internally developed tools
to serve our analytic needs.
Early on, the company found that different teams—such as the
Email Team, and the Homepage Team—were using disparate tools
when building data pipelines, as illustrated in Figure 9-1.
111
Figure 9-1. Different teams built and operated different pipelines
(source: LinkedIn)1
The next step was to unify the metric computation. When I arrived,
the team was working on the second generation of a unified plat‐
form for computing metrics. We were moving beyond using this
platform for experimentation to supporting reporting and other
analytics use cases. This centralized platform allowed us to impose
governance and quality controls on the data. There was now just
one place to go to edit the logic, so if product analysts wanted to
change something, there was no confusion about where the logic
was, who it affected, or the process for changing it.
Now, we have a way for analysts, data scientists, and engineers to
define metrics using the language of their choice, whether it’s Hive,
Pig, or something else. A significant percentage of these metrics can
be expressed declaratively, which allows the data infrastructure team
to keep adding more execution platforms and optimizations. The
platform code generates the entire flow, and we push all these com‐
puted metrics into a distributed Online Analytical Processing
(OLAP) engine called Pinot that we also built here.
119
tried to do its own data analysis, and soon each had its own data-
driven processes, which resulted in inconsistent data across teams.
So, Uber’s first step toward data democratization was to establish a
centralized data organization that would be responsible for the
company-wide data platform. This centralized team is important for
two reasons: consistency and efficiency. The first, consistency, is
obvious. We can’t have two different teams define the same driver
rating average value with different formula (for example, moving
average and lifetime average). If one team found drivers getting an
average rating of 4.8, and another 4.6, that would make our data
confusing to internal users.
The second reason for centralizing our data team is efficiency. The
bigger the data is, the costlier it will be. We need to handle our data
efficiently to minimize those costs. Increasing granularity gives us
more insights but increases the cost because more hardware will be
needed. A centralized team also makes centralized access control for
compliance, privacy, and security more efficient.
Technical Scalability
Uber’s initial data stack included massively parallel processing
(MPP) architecture that used proprietary technology. MPP is tradi‐
tionally more efficient in a small-scale cluster because it assigns each
node to a specific function. For example, if we have 10 nodes, we can
divide the job into 10 pieces, so each node will take exactly 10 per‐
cent of the work. This approach is centrally organized and typically
results in high efficiency when everything runs smoothly. However,
Self-Serve Platforms
There are many ways to serve data on a web browser. Dashboards
are very popular among nontechnical users; however, dashboards—
even the interactive ones—are limited in functionality. To combat
these limitations at Uber, we built a browser-based SQL access tool.
It’s very important to enforce sound controls on a self-serve plat‐
form. Ensuring compliance, privacy, and security is the most obvi‐
ous reason, but scalability and efficiency are also a strong
consideration. The data cluster might not have enough capacity to
give everyone the computational resources they require, especially
when a company is scaling rapidly.
Supporting Users
A self-serve platform is the first component of data democratization;
the second is implementing good processes and support systems.
For Uber’s use case, we decided the appropriate combination to
meet these needs was a specialized, full-time data ops team, as well
as dedicated time from rest of the data team.
This combination is important. The data ops team builds very
strong relationships with users. It understands users’ business use
cases as well as their pain points with the self-serve platform.
Responsibilities include building and improving communication
127
When my team joined Twitter, the need for real-time analytics was
growing exponentially. The company had one of the first implemen‐
tations of the real-time processing system called Apache Storm,
another popular software project that was open sourced by Twitter
in 2011.
But Storm had three primary shortcomings: it was person-to-hours
intensive to support, organizations couldn’t reliably run it in a
monitored cluster at scale, and it couldn’t handle diverse use cases.
We would receive pages almost every night in the wee hours because
it was so unreliable. And precisely because it was so unreliable, my
team would devote significant resources to a single job, even though
that job might not warrant that much attention. Thus, the cost of
running individual jobs was high.
Ironically, the more resources that were dedicated to a job, the more
unreliable that job became. Such was the status of the situation there
when we started. After looking at the various options, I decided that
writing new software from scratch would give Twitter more inde‐
pendent control and ownership than trying to adapt existing open
source software. So, we decided to write our own software, which we
did pretty quickly—within just nine months.
After we moved Heron into production, my team was able to reduce
the number of incidents tenfold. In fact, within the two and half
years Heron was in production, my team experienced only four or
five incidents in total. In short, Heron had proven to be completely
reliable.
By having a stable system like Heron to analyze all of our data, the
data-driven nature of Twitter’s operations gave us continuous visi‐
bility across all our operations. Twitter is therefore a stronger com‐
pany because its team is able to make decisions based on data rather
than “gut feel.”
For example, we used data coming from logs to analyze failures
across datacenters. If one datacenter failed for any reason, we ana‐
lyzed that data to figure out when the failure occurred, and to
immediately cut traffic over to the other datacenter. This type of
instant failover improved the overall availability of Twitter.
Heron is all about processing data in motion. In other words, Twit‐
ter’s data is continuously streaming. They do all the outtakes and all
the cancellations on the data in real time, because there are multiple
Looking Ahead
Today, I am considering a number of new ideas, especially in terms
of detecting anomalies in data. For example, instead of saying, “I’m
looking for an anomaly,” we want to automatically find anomalies
when they occur, and make them visible so that we can do a
forward-looking analysis of why something happened.
What can’t you buy on eBay? You name it, you can find it on the
global online marketplace. Four nights in a luxury condo in Tellur‐
ide? Credit for Uber rides? A Winnebago motor home? Or perhaps
clothes, kitchen gadgets, computers, and just about anything else
you can think of for personal or business use.
All the activities of all the millions of buyers and sellers—165 mil‐
lion active users at the end of 20161 to be exact, with more than 1
billion simultaneous listings2—are captured as part of eBay’s stan‐
dard operating procedures: capture all data and decide whether—
and how—to use it later.
1 https://www.statista.com/statistics/242235/number-of-ebays-total-active-users/
2 http://expandedramblings.com/index.php/ebay-stats/
133
When I joined eBay 10 years ago, I was glad to see that a data-driven
culture was already ingrained in the organization. In particular, I
was delighted to see that eBay already had executive support for cen‐
tralizing data. It made my role to build, manage, and operate every‐
thing to do with data for eBay—from datacenters to frameworks, to
data infrastructure, networks, and all data services—much easier.
At eBay, we produce three fundamental types of data. First is the
transactional data: every time a user buys, bids, or otherwise takes
action on the site, data is generated and collected. Transactional data
produces tens of millions of data points every day.
The second kind of data is behavioral data: the ways our users inter‐
act with eBay. A lot of behavioral actions—browsing, clicking on a
link in an email, or monitoring a particular auction—eventually lead
to the transaction, but at eBay, we consider that a different kind of
data. The behavioral data, which is both semi-structured and
unstructured, is orders of magnitude larger in volume than the
transactional data.
The third kind of data is system data: this is data that represents the
way our systems operate and interact with each other, which
includes but is not limited to their health (security, performance,
availability, diagnostics) and more, in order to produce a characteri‐
zation of the systems and humans working together to deliver eBay’s
customer experiences.
Before I joined the company, eBay was primarily a transaction data–
driven company. All the transaction data was put in one central data
warehouse that was managed by a centralized team, and all deci‐
sions and reporting came out of that team. For example, the ware‐
house tracked such things as how many Pez dispensers were sold
during a given period, the average selling price, and other parame‐
ters of transactions.
Back then, the transactional data warehouse was very small. Still, I
feel that having a centralized data warehouse and supporting team
was a critical early decision—and a great one. The only issue was the
velocity of change to data models. Whenever we had logic changes
within the site, it took us longer to make changes in the reports than
we would have liked.
At that point, we were not storing any of our user behavior data.
Instead, we were throwing it out. This was in late 2007, and Hadoop
134 | Chapter 12: Capture All Data, Decide What to Do with It Later: My Experience at eBay
was just coming into the picture within the industry but had not yet
matured to the point where it was mainstream.
So, when I arrived, the question was how could we use all the behav‐
ioral data in a way that could drive personalization—which would in
turn drive better customer engagement and delight customers? The
drive to do more with data came from the top of the company.
eBay’s CFO was particularly interested in how to use data to under‐
stand how the business could be improved.
You need three elements to build a data-driven company: extremely
strong executive support; centralized reporting of business data; and
high-quality data.
About the first and second points: if directives to use centralized
data are not coming from the top down, you’re going to get different
answers for the same questions, and people are going to begin ques‐
tioning the data itself. Then, of course, there’s the question “is the
data right?” The quality of data is extremely important so that people
trust it. Data governance plays a huge part in building a data-driven
company.
Organizational Structure
When I was at eBay, the analytics team reported up through the
CFO. We organized our data analysts around the business units—
the groups that support our various business products.
When companies are small, centralization supports their ability to
build momentum. It also creates a consumer relationship between
the providers and the users of data. Somebody’s clamoring for the
data and then somebody produces the data to meet that need, and
that’s a good thing. But it begins to break down when you have
product managers and other employees who want direct access to
the data. The centralized team can’t service them all fast enough.
Then you enter a phase in which self-service data access becomes
important. Every company is going to struggle with that eventually.
Who is the end user for the data? Who gets to be on the user side of
self-service?
In other words, does your data exist to simply make your data ana‐
lysts more productive, or are you concerned about giving access to
data to product managers, development managers, and everyone
else in the organization?
I championed the fact that data is there for everybody. Of course,
eBay has specialized people—the employees we called data analysts
—who are experts in the use of the data. However, more and more,
we realized that the regular questions about products, product help,
product metrics, and all the other things that are routinely asked
should be done by product managers and other employees.
In such cases, a simple SQL-like interface, or simple out-of-the-box
reports and graphical representations of data that can be achieved
136 | Chapter 12: Capture All Data, Decide What to Do with It Later: My Experience at eBay
with today’s visualization tools provide more self-service ways of
slicing and dicing data.
But now the challenge is that your data needs and data infrastruc‐
ture needs really become extremely high. This was where having a
strong relationship with the CFO was very important. As it turned
out, the cost of capturing, storing, and making data available was
always one of our bigger budget line items.
138 | Chapter 12: Capture All Data, Decide What to Do with It Later: My Experience at eBay
data and create a group platform where sellers and others can use
data to extend their experiences. eBay is continuing on its journey to
open up even more of its data to users.
Building Data Tools and Giving Back to the Open Source Community | 139
However, we were very careful to avoid creating multiple frame‐
works that did the same thing. We focused on having the platform
team provide a very simple, easy-to-use library and easy-to-use end
points, focusing heavily on productivity in two forms. One was data
productivity for data engineers, and the second was developer pro‐
ductivity because developers also want to use the same data to
enhance the experiences that they are creating in the products.
140 | Chapter 12: Capture All Data, Decide What to Do with It Later: My Experience at eBay
Looking Ahead
Today, eBay’s plan is to integrate machine learning into each and
every piece of data in the eBay product infrastructure, and to under‐
stand how to create a more complex and more intelligent form of
data processing. The company is now working on building infra‐
structure that it can use to promote self-service machine learning,
reusable machine learning, and extensible machine learning. That’s
the next frontier that eBay is trying to get to for analysts, product
managers, business people, and developers.
143
companies. It’s been proven again and again and there’s a lot of liter‐
ature around this, which shows that companies that embrace that
type of an approach ultimately become much more profitable from
different metrics of success. They become much more successful as
compared to companies who are just relying on intuition or gut feel
or certain expert opinions inside the company itself, so that is what I
mean by data-driven culture. It is essentially a confluence of a posi‐
tive confluence of people, processes, and technologies that puts data
into the conversations that companies have whether they’re for stra‐
tegic decision-making or tactical reasons.
Jon: It’s a matter of avoiding, in part, what is sometimes called HIPPO
—the “highest paid person’s opinion.”
Ashish: That is correct, correct. The HIPPOs are very dangerous
and data essentially makes the conversation much more objective as
opposed to subjective.
Jon: Excellent.
Ashish: Sort of [inaudible 00:02:09] and it empowers people to
actually talk about issues in such a standard way, as opposed to in a
subjective way.
Jon: It’s kind of a mindset that can spread throughout a whole com‐
pany and become a way that any employee contributes by looking at
the data and making, as you say, more objective decisions rather than
perhaps embedding themselves unproductively in a hierarchy or feeling
like they can’t contribute.
Ashish: That is correct. It does empower employees and very impor‐
tantly, it also...What it does is that when you are in a room making a
decision or talking about a certain issue, then the natural ques‐
tion...Whenever there’s such a discussion, there are a lot of questions
that arise and the natural recourse to that should be, “Hey, let’s look
at the data and figure out whether some of these assumptions are
correct,” or “If we do such and such thing, what will be the effect?
What does the data show us?” That type of realization across the
company, across different levels of the company, across different
functions of the company, once that type of realization sips in, that’s
what creates a data-driven culture.