Big Data Beyond The Hype
Big Data Beyond The Hype
Dirk deRoos is IBM’s World-Wide Technical Sales Leader for IBM’s Big Data
technologies. Dirk has spent the past four years helping customers build Big
Data solutions featuring InfoSphere BigInsights for Hadoop and Apache
Hadoop, along with other components in IBM’s Big Data and Analytics plat-
form. Dirk has co-authored three books on this subject area: Hadoop for Dum-
mies, Harness the Power of Big Data, and Understanding Big Data. Dirk earned
two degrees from the University of New Brunswick in Canada: a bachelor of
computer science and a bachelor of arts (honors English). You can reach him
on Twitter @Dirk_deRoos.
Rick Buglio has been at IBM for more than eight years and is currently a
product manager responsible for managing IBM’s InfoSphere Optim solu-
tions, which are an integral part of IBM’s InfoSphere lifecycle management
portfolio. He specializes in Optim’s Data Privacy and Test Data Management
services, which are used extensively to build right-sized, privatized, and
trusted nonproduction environments. Prior to joining the Optim team at
IBM, he was a product manager for IBM’s Data Studio solution and was
instrumental in bringing the product to market. Rick has more than 35 years
of experience in the information technology and commercial software indus-
try across a vast number of roles as an application programmer, business
analyst, database administrator, and consultant, and has spent the last 18
years as a product manager specializing in the design, management, and
delivery of successful and effective database management solutions for
numerous industry-leading database management software companies.
Paul Zikopoulos
Dirk deRoos
Christopher Bienko
Rick Buglio
Marc Andrews
ISBN: 978-0-07-184466-6
MHID: 0-07-184466-X
The material in this eBook also appears in the print version of this title: ISBN: 978-0-07-184465-9,
MHID: 0-07-184465-1.
All trademarks are trademarks of their respective owners. Rather than put a trademark symbol after every occurrence of a
trademarked name, we use names in an editorial fashion only, and to the benefit of the trademark owner, with no intention
of infringement of the trademark. Where such designations appear in this book, they have been printed with initial caps.
McGraw-Hill Education eBooks are available at special quantity discounts to use as premiums and sales promotions or
for use in corporate training programs. To contact a representative, please visit the Contact Us page at www.mhprofes-
sional.com.
The contents of this book represent those features that may or may not be available in the current release of any on
premise or off premise products or services mentioned within this book despite what the book may say. IBM reserves the
right to include or exclude any functionality mentioned in this book for the current or subsequent releases of any IBM
cloud services or products mentioned in this book. Decisions to purchase any IBM software should not be made based
on the features said to be available in this book. In addition, any performance claims made in this book aren’t official
communications by IBM; rather, they are the results observed by the authors in unaudited testing. The views expressed in
this book are ultimately those of the authors and not necessarily those of IBM Corporation.
Information has been obtained by McGraw-Hill Education from sources believed to be reliable. However, because of the
possibility of human or mechanical error by our sources, McGraw-Hill Education, or others, McGraw-Hill Education
does not guarantee the accuracy, adequacy, or completeness of any information and is not responsible for any errors or
omissions or the results obtained from the use of such information.
TERMS OF USE
This is a copyrighted work and McGraw-Hill Education and its licensors reserve all rights in and to the work. Use of this
work is subject to these terms. Except as permitted under the Copyright Act of 1976 and the right to store and retrieve
one copy of the work, you may not decompile, disassemble, reverse engineer, reproduce, modify, create derivative works
based upon, transmit, distribute, disseminate, sell, publish or sublicense the work or any part of it without McGraw-Hill
Education’s prior consent. You may use the work for your own noncommercial and personal use; any other use of the
work is strictly prohibited. Your right to use the work may be terminated if you fail to comply with these terms.
THE WORK IS PROVIDED “AS IS.” McGRAW-HILL EDUCATION AND ITS LICENSORS MAKE NO GUAR-
ANTEES OR WARRANTIES AS TO THE ACCURACY, ADEQUACY OR COMPLETENESS OF OR RESULTS
TO BE OBTAINED FROM USING THE WORK, INCLUDING ANY INFORMATION THAT CAN BE ACCESSED
THROUGH THE WORK VIA HYPERLINK OR OTHERWISE, AND EXPRESSLY DISCLAIM ANY WARRANTY,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABIL-
ITY OR FITNESS FOR A PARTICULAR PURPOSE. McGraw-Hill Education and its licensors do not warrant or guar-
antee that the functions contained in the work will meet your requirements or that its operation will be uninterrupted or
error free. Neither McGraw-Hill Education nor its licensors shall be liable to you or anyone else for any inaccuracy, error
or omission, regardless of cause, in the work or for any damages resulting therefrom. McGraw-Hill Education has no
responsibility for the content of any information accessed through the work. Under no circumstances shall McGraw-Hill
Education and/or its licensors be liable for any indirect, incidental, special, punitive, consequential or similar damages
that result from the use of or inability to use the work, even if any of them has been advised of the possibility of such
damages. This limitation of liability shall apply to any claim or cause whatsoever whether such claim or cause arises in
contract, tort or otherwise.
Book number 19: One day I’ll come to my senses. The time and energy
needed to write a book in your “spare time” are daunting, but if the end
result is the telling of a great story (in this case, the topic of Big Data in
general and the IBM Big Data and Analytics platform), I feel that it is well
worth the sacrifice.
Speaking of sacrifice, it’s not just the authors who pay a price to tell a story;
it’s also their loved ones, and I am lucky to have plenty of those people. My
wife is unwavering in her support for my career and takes a backseat more
often than not—she’s the “behind-the-scenes” person who makes it all work:
Thank you. And my sparkling daughter, Chloë, sheer inspiration behind a
simple smile, who so empowers me with endless energy and excitement each
and every day. Over the course of the summer, we started to talk about
Hadoop. She hears so much about it and wanted to know what it is. We
discussed massively parallel processing (MPP), and I gave her a challenge:
Create for me an idea how MPP could help in a kid’s world. After some
thought, she brought a grass rake with marshmallows on each tine, put it on
the BBQ, and asked “Is this MPP, BigDaDa?” in her witty voice. That’s an
example of the sparkling energy I’m talking about. I love you for this,
Chloë—and for so much more.
I want to thank all the IBMers who I interact with daily in my quest for
knowledge. I’m not born with it, I learn it—and from some of the most
talented people in the world…IBMers.
Finally, to Debbie Smith, Kelly McCoy, Brad Strickert, Brant Hurren,
Lindsay Hogg, Brittany Fletcher, and the rest of the Canada Power Yoga
crew in Oshawa: In the half-year leading up to this book, I badly injured my
back…twice. I don’t have the words to describe the frustration and despair
and how my life had taken a wrong turn. After meeting Debbie and her crew,
I found that, indeed, my life had taken a turn…but in the opposite direction
from what I had originally thought. Thanks to this studio for teaching me a
new practice, for your caring and your acceptance, and for restoring me to a
place of well-being (at least until Chloë is a teenager). Namaste.
—Paul Zikopoulos
To Sandra, Erik, and Anna: Yet again, I’ve taken on a book project, and yet
again, you’ve given me the support that I’ve needed.
—Dirk deRoos
To my family and friends who patiently put up with the late nights, short
weekends, and the long hours it takes to be a part of this incredible team:
Thank you. I would also like to thank those who first domesticated Coffee
arabica for making the aforementioned possible.
—Christopher Bienko
I would like to thank all of our clients for taking the time to tell us about
their challenges and giving us the opportunity to demonstrate how we can
help them. I would also like to thank the entire IBM Big Data Industry
Team for their continued motivation and passion working with our clients
to understand their business needs and helping them to find ways of
delivering new value through information and analytics. By listening to
our clients and sharing their experiences, we are able to continuously learn
new ways to help transform industries and businesses with data. And thank
you to my family, Amy, Ayla, and Ethan, for their patience and support
even when I am constantly away from home to spend time with companies
across the world in my personal pursuit to make an impact.
—Marc Andrews
CONTENTS AT A GLANCE
PART I
Opening Conversations About Big Data
1 Getting Hype out of the Way: Big Data and Beyond . . . . 3
PART II
Watson Foundations
5 Starting Out with a Solid Base:
A Tour of Watson Foundations . . . . . . . . . . . . . . . . . . . . 123
ix
PART III
Calming the Waters: Big Data Governance
11 Guiding Principles for Data Governance . . . . . . . . . . . . . 303
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
PART I
Opening Conversations About Big Data
1 Getting Hype out of the Way: Big Data and Beyond . . . . 3
There’s Gold in “Them There” Hills! . . . . . . . . . . . . . . . . . . 3
Why Is Big Data Important? . . . . . . . . . . . . . . . . . . . . . . . . 5
Brought to You by the Letter V:
How We Define Big Data . . . . . . . . . . . . . . . . . . . . . . . 8
Cognitive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Why Does the Big Data World
Need Cognitive Computing? . . . . . . . . . . . . . . . . . . . . 15
A Big Data and Analytics Platform Manifesto . . . . . . . . . . . 17
1. Discover, Explore, and Navigate Big Data Sources . . . . 18
2. Land, Manage, and Store
Huge Volumes of Any Data . . . . . . . . . . . . . . . . . . . . . . 20
3. Structured and Controlled Data . . . . . . . . . . . . . . . . . . . . 21
4. Manage and Analyze Unstructured Data . . . . . . . . . . . . 22
5. Analyze Data in Real Time . . . . . . . . . . . . . . . . . . . . . . . . 24
6. A Rich Library of Analytical Functions and Tools . . . . . 24
7. Integrate and Govern All Data Sources . . . . . . . . . . . . . 26
Cognitive Computing Systems . . . . . . . . . . . . . . . . . . . . . . 27
Of Cloud and Manifestos… . . . . . . . . . . . . . . . . . . . . . . . . . 27
Wrapping It Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
xi
xii Contents
PART II
Watson Foundations
5 Starting Out with a Solid Base:
A Tour of Watson Foundations . . . . . . . . . . . . . . . . . . . . 123
Overview of Watson Foundations . . . . . . . . . . . . . . . . . . . . . . 124
A Continuum of Analytics Capabilities:
Foundations for Watson . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
PART III
Calming the Waters: Big Data Governance
11 Guiding Principles for Data Governance . . . . . . . . . . . . . 303
The IBM Data Governance Council Maturity Model . . . . . . . 304
Wrapping It Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Beth Smith
General Manager, IBM Information Management
xix
This page intentionally left blank
ACKNOWLEDGMENTS
Collectively, we want to thank the following people, without whom this
book would not have been possible: Anjul Bhambhri, Rob Thomas, Roger
Rea, Steven Sit, Rob Utzschneider, Joe DiPietro, Nagui Halim, Shivakumar
Vaithyanathan, Shankar Venkataraman, Dwaine Snow, Andrew Buckler,
Glen Sheffield, Klaus Roder, Ritika Gunnar, Tim Vincent, Jennifer McGinn,
Anand Ranganathan, Jennifer Chen, and Robert Uleman. Thanks also to all
the other people in our business who make personal sacrifices day in and
day out to bring you the IBM Big Data and Analytics platform. IBM is an
amazing place to work and is unparalleled when you get to work beside
this kind of brain power every day.
Roman Melnyk, our technical editor, has been working with us for a long
time—sometimes as a coauthor, sometimes as an editor, but always as an
integral part of the project. We also want to thank Xiamei (May) Li who
bailed us out on Chapter 14 and brought some common sense to our Big
Match chapter. Bob Harbus helped us a lot with Chapter 8—the shadow
tables technology—and we wanted to thank him here too.
We want to thank (although at times we cursed) Susan Visser, Frances
Wells, Melissa Reilly, and Linda Currie for getting the book in place; an idea
is an idea, but it takes people like this to help get that idea up and running.
Our editing team—Janet Walden, Kim Wimpsett, and Lisa McCoy—all
played a key role behind the scenes, and we want to extend our thanks for
that. It’s also hard not to give a special sentence of thanks to Hardik Popli at
Cenveo Publisher Services—the guy’s personal effort to perfection is beyond
apparent. Finally, to our McGraw-Hill guru, Paul Carlstroem—there is a rea-
son why we specifically want to work with you—you did more magic for
this book than any other before it…thank you.
xxi
This page intentionally left blank
INTRODUCTION
The poet A.R. Ammons once wrote, “A word too much repeated falls out
of being.” Well, kudos to the term Big Data, because it’s hanging in there, and
it’s hard to imagine a term with more hype than Big Data. Indeed, perhaps it
is repeated too much. Big Data Beyond the Hype: A Guide to Conversations for
Today’s Data Center is a collection of discussions that take an overused term
and break it down into a confluence of technologies, some that have been
around for a while, some that are relatively new, and others that are just com-
ing down the pipe or are not even a market reality yet. The book is organized
into three parts.
Part I, “Opening Conversations About Big Data,” gives you a framework
so that you can engage in Big Data conversations in social forums, at key-
notes, in architectural reviews, during marketing mix planning, at the office
watercooler, or even with your spouse (nothing like a Big Data discussion to
inject romance into an evening). Although we talk a bit about what IBM does
in this space, the aim of this part is to give you a grounding in cloud service
delivery models, NoSQL, Big Data, cognitive computing, what a modern
data information architecture looks like, and more. This chapter is going to
give you the constructs and foundations that you will need to engage conver-
sation that indeed can hype Big Data, but allow you to extend those conver-
sations beyond.
In Chapter 1, we briefly tackle, define, and illustrate the term Big Data.
Although its use is ubiquitous, we think that many people have used it irre-
sponsibly. For example, some people think Big Data just means Hadoop—
and although Hadoop is indeed a critical repository and execution engine in
the Big Data world, Hadoop is not solely Big Data. In fact, without analytics,
Big Data is, well, just a bunch of data. Others think Big Data just means more
data, and although that could be a characteristic, you certainly can engage in
Big Data without lots of data. Big Data certainly doesn’t replace the RDBMS
either, and admittedly we do find it ironic that the biggest trend in the NoSQL
world is SQL.
We also included in this chapter a discussion of cognitive computing—the
next epoch of data analytics. IBM Watson represents a whole new class of
industry-specific solutions called Cognitive Systems. It builds upon—but is
xxiii
xxiv Introduction
a primer; that said, if you thought NoSQL was something you put on your
resume if you don’t know SQL, then reading this chapter will give you a
solid foundation for understanding some of the most powerful forces in
today’s IT landscape.
In the cloud, you don’t build applications; you compose them. “Composing
Cloud Applications: Why We Love the Bluemix and the IBM Cloud” is the
title for our dive into the innovative cloud computing marketplace of compos-
able services (Chapter 3). Three key phrases are introduced here: as a service, as
a service, and as a service (that’s not a typo, our editors are too good to have
missed that one). In this chapter, we introduce you to different cloud “as a
service” models, which you can define by business value or use case; as your
understanding of these new service models deepens, you will find that this
distinction appears less rigid than you might first expect. During this journey,
we examine IBM SoftLayer’s flexible, bare-metal infrastructure as a service
(IaaS), IBM Bluemix’s developer-friendly and enterprise-ready platform as a
service (PaaS), and software as a service (SaaS). In our SaaS discussion, we
talk about the IBM Cloud marketplace, where you can get started in a free-
mium way with loads of IBM and partner services to drive your business. We
will also talk about some of the subcategories in the SaaS model, such as data
warehouse as a service (DWaaS) and database as a service (DBaaS). The IBM
dashDB fully managed analytics service is an example of DWaaS. By this
point in the book, we would have briefly discussed the IBM NoSQL Cloudant
database—an example of DBaaS. Finally, we will tie together how services can
be procured in a PaaS or SaaS environment—depending on what you are try-
ing to get done. For example, IBM provides a set of services to help build trust
into data; they are critical to building a data refinery. The collection of these
services is referred to as IBM DataWorks and can be used in production or to
build applications (just like the dashDB service, among others). Finally, we
talk about the IBM Cloud and how IBM, non IBM, and open source services
are hosted to suit whatever needs may arise. After finishing this chapter, you
will be seeing “as a service” everywhere you look.
Part I ends with a discussion about where any organization that is serious
about its analytics is heading: toward a data zones model (Chapter 4). New
technologies introduced by the open source community and enhancements
developed by various technology vendors are driving a dramatic shift in
xxvi Introduction
Part II, “IBM Watson Foundations,” covers the IBM Big Data and Analyt-
ics platform that taps into all relevant data, regardless of source or type, to
provide fresh insights in real time and the confidence to act on them. IBM
Watson Foundations, as its name implies, is the place where data is prepared
for the journey to cognitive computing, in other words, IBM Watson. Clients
often ask us how to get started with an IBM Watson project. We tell them to
start with the “ground truth”—your predetermined view of good, rational,
and trusted insights—because to get to the start line, you need a solid foun-
dation. IBM Watson Foundations enables you to infuse analytics into every
decision, every business process, and every system of engagement; indeed,
this part of the book gives you details on how IBM can help you get Big Data
beyond the hype.
As part of the IBM Big Data and Analytics portfolio, Watson Foundations
supports all types of analytics (including discovery, reporting, and analysis)
and predictive and cognitive capabilities, and that’s what we cover in Chap-
ter 5. For example, Watson Foundations offers an enterprise-class, nonforked
Apache Hadoop distribution that’s optionally “Blue suited” for even more
value; there’s also a rich portfolio of workload-optimized systems, analytics
on streaming data, text and content analytics, and more. With governance,
privacy, and security services, the platform is open, modular, trusted, and
integrated so that you can start small and scale at your own pace. The follow-
ing illustration shows an overview map of the IBM Watson Foundations capa-
bilities and some of the things we cover in this part of the book.
from?” and “Who owns this metric?” and “What did you do to present these
aggregated measures?”
A business glossary that can be used across the polyglot environment is
imperative. This information can be surfaced no matter the user or the tool.
Conversations about the glossarization, documentation, and location of data
make the data more trusted, and this allows you to broadcast new data assets
across the enterprise. Trusting data isn’t solely an on-premise phenomenon;
in our social-mobile-cloud world, it’s critical that these capabilities are pro-
vided as services, thereby creating a data refinery for trusted data. The set of
products within the InfoSphere Information Server family, which is discussed
in this chapter, includes the services that make up the IBM data refinery and
they can be used traditionally through on-premise product installation or as
individually composable discrete services via the IBM DataWorks catalog.
Master data management (matching data from different repositories) is
the focus of Chapter 14. Matching is a critical tool for Big Data environments
that facilitate regular reporting and analytics, exploration, and discovery.
Most organizations have many databases, document stores, and log data
repositories, not to mention access to data from external sources. Successful
organizations in the Big Data era of analytics will effectively match the data
between these data sets and build context around them at scale and this is
where IBM’s Big Match technology comes to play. In a Big Data world, tradi-
tional matching engines that solely rely on relational technology aren’t going
to cut it. IBM’s Big Match, as far as we know, is the only enterprise-capable
matching engine that’s built on Hadoop, and this is the focus of the aptly
named “Matching at Scale: Big Match” Chapter 14.
Ready…Set…Go!
We understand that when all is said and done, you will spend the better part
of a couple of days of your precious time reading this book. But we’re confi-
dent that by the time you are finished, you’ll have a better understanding of
requirements for the right Big Data and Analytics platform and a strong
foundational knowledge of available IBM technologies to help you tackle
the most promising Big Data opportunities. You will be able to get beyond
the hype.
xxxiv Introduction
Our authoring team has more than 100 years of collective experience,
including many thousands of consulting hours and customer interactions.
We have experience in research, patents, sales, architecture, development,
competitive analysis, management, and various industry verticals. We hope
that we have been able to effectively share some of that experience with you
to help you on your Big Data journey beyond the hype.
1
This page intentionally left blank
1
Getting Hype out of the
Way: Big Data and Beyond
The term Big Data is a bit of a misnomer. Truth be told, we’re not even big
fans of the term—despite that it is so prominently displayed on the cover of
this book—because it implies that other data is somehow small (it might
be) or that this particular type of data is large in size (it can be, but doesn’t
have to be). For this reason, we thought we’d use this chapter to explain
exactly what Big Data is, to explore the future of Big Data (cognitive com-
puting), and to offer a manifesto of what constitutes a Big Data and Analyt-
ics platform.
3
4 Big Data Beyond the Hype
in cleansing it, transforming it, tracking it, cataloging it, glossarizing it, and
so on. It was harvested…with care.
Today’s miners work differently. Gold mining leverages new age capital
equipment that can process millions of tons of dirt (low value-per-byte data)
to find nearly invisible strands of gold. Ore grades of 30 parts per million are
usually needed before gold is visible to the naked eye. In other words, there’s
a great deal of gold (high value-per-byte data) in all of this dirt (low value-
per-byte data), and with the right equipment, you can economically process
lots of dirt and keep the flakes of gold that you find. The flakes of gold are
processed and combined to make gold bars, which are stored and logged in
a safe place, governed, valued, and trusted. If this were data, we would call
it harvested because it has been processed, is trusted, and is of known quality.
The gold industry is working on chemical washes whose purpose is to
reveal even finer gold deposits in previously extracted dirt. The gold analogy
for Big Data holds for this innovation as well. New analytic approaches in
the future will enable you to extract more insight out of your forever
archived data than you can with today’s technology (we come back to this
when we discuss cognitive computing later in this chapter).
This is Big Data in a nutshell: It is the ability to retain, process, and under-
stand data like never before. It can mean more data than what you are using
today, but it can also mean different kinds of data—a venture into the
unstructured world where most of today’s data resides.
If you’ve ever been to Singapore, you’re surely aware of the kind of down-
pours that happen in that part of the world; what’s more, you know that it is
next to impossible to get a taxi during such a downpour. The reason seems
obvious—they are all busy. But when you take the “visible gold” and mix it
with the “nearly invisible gold,” you get a completely different story. When
Big Data tells the story about why you can’t get a cab in Singapore when it is
pouring rain, you find out that it is not because they are all busy. In fact, it is
the opposite! Cab drivers pull over and stop driving. Why? Because the
deductible on their insurance is prohibitive and not worth the risk of an acci-
dent (at fault or not). It was Big Data that found this correlation. By rigging
Singapore cabs with GPS systems to do spatial and temporal analysis of their
movements and combining that with freely available national weather ser-
vice data, it was found that taxi movements mostly stopped in torrential
downpours.
Getting Hype out of the Way: Big Data and Beyond 5
an “if” here that tightly correlates with the promise of Big Data: “If you could
collect and analyze all the data….” We like to refer to the capability of analyz-
ing all of the data as whole-population analytics. It’s one of the value proposi-
tions of Big Data; imagine the kind of predictions and insights your analytic
programs could make if they weren’t restricted to samples and subsets of
the data.
In the last couple of years, the data that is available in a Big Data world has
increased even more, and we refer to this phenomenon as the Internet of
Things (IoT). The IoT represents an evolution in which objects are capable of
interacting with other objects. For example, hospitals can monitor and regu-
late pacemakers from afar, factories can automatically address production-
line issues, and hotels can adjust temperature and lighting according to their
guests’ preferences. This development’s prediction by IBM’s Smarter Planet
agenda was encapsulated by the term interconnected.
This plethora of data sources and data types opens up new opportunities.
For example, energy companies can do things that they could not do before.
Data gathered from smart meters can provide a better understanding of cus-
tomer segmentation and behavior and of how pricing influences usage—but
only if companies have the ability to use such data. Time-of-use pricing
encourages cost-savvy energy consumers to run their laundry facilities, air
conditioners, and dishwashers at off-peak times. But the opportunities don’t
end there. With the additional information that’s available from smart meters
and smart grids, it’s possible to transform and dramatically improve the effi-
ciency of electricity generation and scheduling. It’s also possible to deter-
mine which appliances are drawing too much electricity and to use that
information to propose rebate-eligible, cost-effective, energy-efficient
upgrades wrapped in a compelling business case to improve the conversion
yield on an associated campaign.
Now consider the additional impact of social media. A social layer on top
of an instrumented and interconnected world generates a massive amount of
data too. This data is more complex because most of it is unstructured (images,
Twitter feeds, Facebook posts, micro-blog commentaries, and so on). If you
eat Frito-Lay SunChips, you might remember its move to the world’s first
biodegradable, environmentally friendly chip bag; you might also remember
how loud the packaging was. Customers created thousands of YouTube vid-
eos showing how noisy the environmentally friendly bag was. A “Sorry, but
I can’t hear you over this SunChips bag” Facebook page had hundreds of
Getting Hype out of the Way: Big Data and Beyond 7
thousands of Likes, and bloggers let their feelings be known. In the end, Frito-
Lay introduced a new, quieter SunChips bag, demonstrating the power and
importance of social media. It isn’t hard to miss the careers lost and made
from a tweet or video that went viral.
For a number of years, Facebook was adding a new user every three seconds;
today these users collectively generate double-digit terabytes of data every day.
In fact, in a typical day, Facebook experiences more than 3.5 billion posts and
about 155 million “Likes.” The format of a Facebook post is indeed structured
data. It’s encoded in the JavaScript Object Notation (JSON) format—which we
talk about in Chapter 2. However, it’s the unstructured part that has the “golden
nugget” of potential value; it holds monetizable intent, reputational decrees,
and more. Although the structured data is easy to store and analyze, it is the
unstructured components for intent, sentiment, and so on that are hard to
analyze. They’ve got the potential to be very rewarding, if….
Twitter is another phenomenon. The world has taken to generating over
400 million daily expressions of short 140 character or less opinions (amount-
ing to double-digit terabytes) and commentary (often unfiltered) about
sporting events, sales, images, politics, and more. Twitter also provides enor-
mous amounts of data that’s structured in format, but it’s the unstructured
part within the structure that holds most of the untapped value. Perhaps
more accurate, it’s the combination of the structured (timestamp, location)
and unstructured (the message) data where the ultimate value lies.
The social world is at an inflection point. It is moving from a text-centric
mode of communication to a visually centric one. In fact, the fastest growing
social sites (such as Vine, Snapchat, Pinterest, and Instagram, among others)
are based on video or image communications. For example, one fashion house
uses Pinterest to build preference profiles for women (Pinterest’s membership
is approximately 95 percent female). It does this by sending collages of outfits
to clients and then extracting from the likes and dislikes preferences for color,
cut, fabric, and so on; the company is essentially learning through error (it’s a
methodology we refer to as fail fast and we talk about it later in this book).
Compare this to the traditional method whereby you might fill in a question-
naire about your favorite colors, cuts, and styles. Whenever you complete a
survey, you are guarded and think about your responses. The approach that
this fashion house is using represents unfiltered observation of raw human
behavior—instant decisions in the form of “likes” and “dislikes.”
8 Big Data Beyond the Hype
much data you have; it’s what you do with it that counts. Like we always say,
Big Data without analytics is…well…just a bunch of data.
and classify what the sound was (in this case a falling tree), even if no one is
around to hear it. Image and sound classification is a great example of the
variety characteristic of Big Data.
Embedded within all of this noise are potential useful signals: The person
who professes a profound disdain for her current smart phone manufacturer
and starts a soliloquy about the need for a new one is expressing monetizable
intent. Big Data is so vast that quality issues are a reality, and veracity is what
we generally use to refer to this problem domain; no one believes everything
they read on the Internet, do they? The fact that one in three business leaders
don’t trust the information that they use to make decisions is a strong indica-
tor that veracity needs to be a recognized Big Data attribute.
Cognitive Computing
Technology has helped us go faster and go further—it has literally taken us
to the moon. Technology has helped us to solve problems previous genera-
tions couldn’t imagine, let alone dream about. But one question that we often
ask our audiences is “Can technology think?” Cognitive systems is a cate-
gory of technologies that uses natural language processing and machine
learning (we talk about these disciplines in Chapter 6) to enable people and
machines to interact more naturally and to extend and magnify human
expertise and cognition. These systems will eventually learn, think, and
interact with users to provide expert assistance to scientists, engineers, law-
yers, and other professionals in a fraction of the time it now takes.
When we are asked the “Can technology think?” question, our answer is
“Watson can!” Watson is the IBM name for its cognitive computing family
that delivers a set of technologies unlike any that has come before it. Rather
than forcing its users to think like a computer, Watson interacts with
humans…on human terms. Watson can read and understand natural lan-
guage, such as the tweets, texts, studies, and reports that make up the major-
ity of the world’s data—a simple Internet search can’t do that. You likely got
your first exposure to cognitive computing when IBM Watson defeated Brad
Rutter and Ken Jennings in the Jeopardy! challenge. This quiz show, known
for its complex, tricky questions and smart champions, was a perfect venue
for the world to get a first look at the potential for cognitive computing.
To play, let alone win, the Jeopardy! challenge, Watson had to answer ques-
tions posed in every nuance of natural language, including puns, synonyms
and homonyms, slang, and jargon. One thing you might not know is that
during the match, Watson was not connected to the Internet. It knew only
what it had amassed through years of interaction and learning from a large
Getting Hype out of the Way: Big Data and Beyond 13
there to make sense of the data and to help to make the process faster and
more accurate.
Watson is also learning the language of finance to help wealth manage-
ment advisors plan for their clients’ retirement. For example, the IBM Watson
Engagement Advisor can react to a caller’s questions and help front-line ser-
vice personnel find the right answers faster. The IBM Watson Discovery
Advisor helps researchers uncover patterns that are hiding in mountains of
data and then helps them share these new insights with colleagues. With
Watson Analytics, you can ask Watson a question, and Watson then shows
you what the data means in a way that is easy for anyone to understand and
share—no techie required. Watson Data Explorer helps you to search, inter-
rogate, and find the data wherever it might be—and this capability is a pillar
to any Big Data and Analytics platform (we talk about this in the “A Big Data
and Analytics Platform Manifesto“ section in this chapter). Finally, there is
also IBM Watson Developer’s Cloud that offers the software, technology, and
tools that developers need to take advantage of Watson’s cognitive power
over the cloud anytime.
For city leaders, such new systems can help them to prepare for major
storms by predicting electrical outages. They can also help them to plan evac-
uations and prepare emergency management equipment and personnel to
respond in the areas that will need help the most. These are just a few exam-
ples that show how Watson and cognitive computing are game changers.
Personalization applications that help you to shop for clothes, pick a bottle of
wine, and even invent a recipe (check out Chef Watson: www.bonappetit
.com/tag/chef-watson) are another area in which IBM cognitive computing
capabilities are redefining what Big Data solutions can do and what they
should do.
The one thing to remember about Watson long after you’ve read this book
is that Watson provides answers to your questions. Watson understands the
nuances of human language and returns relevant answers in appropriate
context. It keeps getting smarter, learning from each interaction with users
and each piece of data it ingests. In other words, you don’t really program
Watson; you learn with it.
You might be familiar with Apple’s Siri technology. People often ask us,
“What’s the difference?” Siri is preprogrammed, and all possible questions
Getting Hype out of the Way: Big Data and Beyond 15
and her answers must be written into the application using structured data.
Siri does not learn from her interactions either. Watson, on the other hand,
reads through all kinds of data (structured, semistructured, and unstruc-
tured) to provide answers to questions in natural language. Watson learns
from each interaction and gets smarter with time through its machine learn-
ing capabilities. We were pretty excited when IBM and Apple recently
announced their partnership because we believe that Siri and Watson are
going to go on a date and fireworks are going to happen! We also see the
potential for IBM technologies such as the NoSQL Cloudant document store
to get deeply embedded into iOS devices, thereby changing the game of this
technology for millions of users.
Actuarial Data
Government Statistics
Epidemic Data
Occupational Risk
Family History
Dietary Risk
Raw Data Travel History ...
Social Relationships
Domain Linkages
Full
Contextual Analytics
Figure 1-1 Big Data analytics and the context multiplier effect
Let’s assume you use a wearable device that tracks your steps during the
day and sleep patterns at night. In this case, the raw data is the number of
steps and hours of sleep. From this raw data, the device can calculate the
number of calories burned in a day; this is an example of feature extraction.
You can also use this device’s software to see at what points in the day you
are active; for example, you were highly active in the latter part of the day
and sedentary in the morning. During your sleeping hours, the device is able
to leverage the metadata to map your sleeping habits. This device’s accom-
panying application also enables you to log what you eat in a day and how
much you weigh; this is a great example of domain linkages. You are mixing
information about diet with physical markers that describe your overall
sleep and activity levels. There might also be a social ecosystem where others
share information such as occupation, location, some family medical history
(such as “My dad died of a heart attack”), travel information, hobbies, and so
on. Such a community can share information and encourage friendly compe-
tition to promote health. Such an infrastructure can represent a corpus for
full contextual analytics. It can factor in other variables, such as the weather,
and uncover trends, such as low activity, and alert you to them. This is a
Getting Hype out of the Way: Big Data and Beyond 17
simple example, the point is that as interest in such a problem domain grows,
so too does the data. Now think about the field of cancer research. Consider
how many publications in this area are generated every day with all the clin-
ical trials and the documentation behind both successes and failures. Think
about all the relevant factors, such as a patient’s family history, diet, lifestyle,
job, and more. You can see that when you want to connect the dots—including
dots that you can’t even see at first—the data is going to get unmanageable
and cognitive help is needed.
That help can come in the form of the new class of industry-specific solu-
tions called cognitive systems, of which IBM Watson is a good example. The
next generation of problem solvers is going to learn much faster than we ever
did with cognitive computing technologies like Watson, and, in turn, Watson
will learn much faster with our help. Cognitive computing will enable devel-
opers to solve new problems and business leaders to ask bigger questions.
Together, people and technology will do things that generations before could
not have imagined.
24/7 decisioning (rather than just 24/7 data collection). A big part of that capa-
bility, and surely the focus of the next inflection point, is cognitive computing.
When putting together the building blocks of a Big Data and Analytics
platform, it is imperative that you start with the requirement that the plat-
form must support all of your data. It must also be able to run all of the com-
putations that are needed to drive your analytics, taking into consideration
the expected service level agreements (SLAs) that will come with multitiered
data. Figure 1-2 shows a vendor-neutral set of building blocks that we feel
are “must have” parts of a true Big Data platform—it’s our Big Data and
Analytics Manifesto.
3
Data Appliances for Extreme Performance
Figure 1-2 A Big Data and Analytics platform manifesto: imperatives and underlying
technologies
forgot you have. It’s the same in the analytics world. With all the hype about
Big Data, it’s easy to see why companies can’t wait to analyze new data types
and make available new volumes of data with dreams of epiphanies that are
just “around the corner.” But any Big Data project should start with the
“what you could already know” part. From personalized marketing to
finance to public safety, your Big Data project should start with what you
have. In other words, know what you have before you make big plans to go
out and get more. Provision search, navigation, and discovery services over
a broad range of data sources and applications, both inside and outside of
your enterprise, to help your business uncover the most relevant information
and insights. A plan to get more mileage out of the data that you already have
before starting other data gathering projects will make understanding the
new data that much easier.
The process of data analysis begins with understanding data sources, figur-
ing out what data is available within a particular source, and getting a sense
of its quality and its relationship to other data elements. This process, known
as data discovery, enables data scientists to create the right analytic models
and computational strategies. Traditional approaches required data to be
physically moved to a central location before it could be discovered. With Big
Data, this approach is too expensive and impractical.
To facilitate data discovery and unlock resident value within Big Data, the
platform must be able to discover data that is in place. It has to be able to sup-
port the indexing, searching, and navigation of different sources of Big Data.
Quite simply, it has to be able to facilitate discovery in a diverse set of data
sources, such as databases, flat files, or content management systems—pretty
much any persistent data store (including Hadoop) that contains structured,
semistructured, or unstructured data.
And don’t forget, these cross-enterprise search, discovery, and navigation
services must strictly adhere to and preserve the inherent security profiles of
the underlying data systems. For example, suppose that a financial docu-
ment is redacted for certain users. When an authorized user searches for
granular details in this document, the user will see all the details. But if
someone else with a coarser level of authorization can see only the summary
when directly accessing the data through its natural interfaces, that’s all that
will be surfaced by the discovery service; it could even be the case that the
search doesn’t turn up any data at all!
20 Big Data Beyond the Hype
In a Big Data world, it’s not just about being able to search your data
repositories (we cover these repositories in Chapter 4); it’s also about creat-
ing a platform that promotes the rapid development and deployment of
search-based applications that leverage the Big Data architecture that you’ve
put in place—and that’s the point, right? You aren’t investing in Big Data
because you don’t want people to use the data. These aren’t data science
projects; they are strategic business imperatives.
When you invest in the discovery of data that you already have, your
business benefits immensely. Benefits include enhanced productivity with
greater access to needed information and improved collaboration and deci-
sion making through better information sharing.
It’s simple. In the same manner that humans forget stuff they used to
know as they get older, organizations forget things too. We call it organiza-
tional amnesia and it’s way more of an epidemic than you think. Keeping in
mind that innovation is facilitated by making information available to
empowered employees, when you start having conversations about getting
beyond the Big Data hype, don’t forget to include the information you
already have—the information that you may have just forgotten about.
Considering the pressing need for technologies that can overcome the vol-
ume and variety challenges for data at rest, it’s no wonder that business mag-
azines and online tech forums alike are buzzing about Hadoop. And it’s not
all talk either. The IT departments in most Fortune 500 companies have done
more than a little experimentation with Hadoop. The problem is that many of
these initiatives have stagnated in the “science project” phase. The challenges
are common: It is easy and exciting to start dumping data into these reposi-
tories; the hard part comes with what to do next. The meaningful analysis of
data that is stored in Hadoop requires highly specialized programming
skills, and for many algorithms, it can be challenging to put them into a par-
allelizable form so that they can run in Hadoop. And what about information
governance concerns, such as security and data lifecycle management, where
new technologies like Hadoop don’t have a complete story?
Some traditional database vendors consider Hadoop to be no more than a
data preparation opportunity for their data warehousing projects. We dis-
agree with that view and discuss how we see Hadoop at play for analytics,
data integration, operation excellence, discovery and more in Chapter 6.
It is not just Hadoop that’s at play here. The JSON specification is ubiqui-
tous for how mobile information is encoded today. Understanding relation-
ships between entities (people, things, places, and so on) is something that
requires the use of graphs. Some data needs to “bit bucket” to just store any
data and retrieve it via a simple key. All of these fall under the NoSQL data-
base genre, and we cover that in Chapter 2.
data classifications, and independent systems have been used to store and
manage these different data types. We’ve also seen the emergence of hybrid
systems that can be disappointing because they don’t natively manage all
data types. For the most part—there are some exceptions—we find the “try-
ing to fit a circle shape through a square hole” approach comes with a lot of
compromise.
Of course, organizational processes don’t distinguish among data types.
When you want to analyze customer support effectiveness, structured infor-
mation about a customer service representative’s conversation (such as call
duration, call outcome, customer satisfaction survey responses, and so on) is
as important as unstructured information gleaned from that conversation
(such as sentiment, customer feedback, and verbally expressed concerns).
Effective analysis needs to factor in all components of an interaction and to
analyze them within the same context, regardless of whether the underlying
data is structured or not. A Big Data and Analytics platform must be able to
manage, store, and retrieve both unstructured and structured data, but it
must also provide tools for unstructured data exploration and analysis. (We
touch on this in Chapter 6, where we cover Hadoop.)
This is a key point that we do not want you to miss because it is often over-
looked. You likely do not have the appetite or the budget to hire hundreds of
expensive Java developers to build extractors. Nor are you likely a LinkedIn-
or Facebook-like company; you have a different core competency—your busi-
ness. To process and analyze the unstructured data that is found in call logs,
sentiment expressions, contracts, and so on, you need a rich ecosystem in
which you can develop those extractors. That ecosystem must start with a tool
set: visually rich tooling that enables you to compose and work with the
extractions. You will want to compose textual understanding algorithms
through a declarative language; in other words, you should not have to rely
on bits and bytes Java programming, but rather on an app-based composition
framework that is more consumable throughout the enterprise. If you want to
query Hadoop data at rest, you are likely looking for an SQL-like interface to
query unstructured data. This way, you can leverage the extensive skills in
which you have invested. Finally, the ecosystem should have the ability to
compile the higher-level declarations (the queries that you write) into code
that can be executed in an efficient and well-performing manner, just like the
SQL that is optimized under the covers by every RDBMS vendor.
24 Big Data Beyond the Hype
Of course, you can apply the same requirements to video and audio, among
other data types. For example, IBM hosts what we believe to be the world’s
largest image classification library. Pictures are retrieved on the basis of attri-
butes that are “learned” through training sets. Imagine combing through thou-
sands of photos looking for winter sports without having to tag them as winter
sports. You can try it for yourself at http://mp7.watson.ibm.com/imars/.
your platform empowers you to run extremely fast analytics, you have the
foundation on which to support multiple analytic iterations and speed up
model development. Although this is the goal, there needs to be a focus on
improving developer productivity. By making it easy to discover data,
develop and deploy models, visualize results, and integrate with front-end
applications, your organization can enable practitioners, such as analysts
and data scientists, to be more effective. We refer to this concept as the art of
consumability. Consumability is key to democratizing Big Data across the
enterprise. You shouldn’t just want; you should always demand that your Big
Data and Analytics platform flatten the time-to-analysis curve with a rich set
of accelerators, libraries of analytic functions, and a tool set that brings agility
to the development and visualization process. Throughout this book, we
illustrate how this part of the manifesto is rooted across the IBM Big Data
and Analytics platform.
A new take on an old saying is appropriate here: “A picture is worth a thou-
sand terabytes.” Because analytics is an emerging discipline, it is not uncommon
to find data scientists who have their own preferred approach to creating and
visualizing models. It’s important that you spend time focusing on the visual-
ization of your data because that will be a key consumability attribute and
communication vehicle for your platform. It is not the case that the fancier the
visualization, the better—so we are not asking you to use your Office Excel 3D
mapping rotations here. That said, Big Data does offer some new kinds of visu-
alizations that you will want to get acquainted with, such as Streamgraphs,
Treemaps, Gantt charts, and more. But rest assured, your eyes will still be look-
ing over the traditional bar charts and scatter plots too.
Data scientists often build their models by using packaged applications,
emerging open source libraries, or “roll your own” approaches to build mod-
els with procedural languages. Creating a restrictive development environ-
ment curtails their productivity. A Big Data and Analytics platform needs to
support interaction with the most commonly available analytic packages. There
should also be deep integration that facilitates the pushing of computationally
intensive activities, such as model scoring, from those packages into the plat-
form on premise or in the cloud. At a minimum, your data-at-rest and data-in-
motion environments should support SAS, SPSS, and R. The platform also
needs to support a rich set of “parallelizable” algorithms (think: machine
learning) that has been developed and tested to run on Big Data (we cover all
26 Big Data Beyond the Hype
of this in Chapter 6), with specific capabilities for unstructured data analytics
(such as text analytic routines) and a framework for developing additional
algorithms. Your platform should provide the ability to visualize and publish
results in an intuitive and easy-to-use manner.
Now reflect on some of the required Big Data and Analytics platform capa-
bilities we outlined in our manifesto. It means you may want to leverage a
cloud service to perform one of the many data refinery activities available in
the IBM Cloud. The catalog of these services is collectively known as IBM
DataWorks. IBM BLU Acceleration technology is available as a turnkey analytics
powerhouse service called dashDB. (dashDB also includes other services,
some of which are found in the IBM DataWorks catalog, along with Netezza
capabilities and more.)
The cloud-first strategy isn’t just for IBM’s Information Management
brand either. Watson Analytics is a set of services that provide a cloud-based
predictive and cognitive analytics discovery platform that’s purposely
designed for the business user (remember the new buyer of IT we talked
about earlier in this chapter).
Need a JSON document store to support a fast-paced and agile app dev
environment? It’s available in the IBM cloud as well; Hadoop cluster, yup;
SQL data store, yup—you get the point. From standing up computing
resources (infrastructure as service), to composing applications (platform as
a service—think IBM Bluemix), to running them (the IBM Cloud market-
place SaaS environment), the richest capability of service we know of is avail-
able from IBM. These services are mostly available in a freemium model to
get going with options for enterprise-scale use and sophistication. Finally,
the IBM Cloud services catalog doesn’t just contain IBM capabilities, it
includes services delivered through a rich partner ecosystem and open
source technologies too.
Wrapping It Up
In this chapter, we introduced you to the concept of Big Data. We also high-
lighted the future of analytics: cognitive computing. Finally, we shared with
you our Big Data and Analytics platform manifesto.
We kept most of this chapter vendor agnostic. We wanted to give you a
solid background against which you can evaluate the vendors that you
choose to help you build your own Big Data and Analytics platform and to
describe the things to look for when you design the architecture. In short, this
chapter gave you the key points to guide you in any Big Data conversation.
The remaining chapters in this part of the book cover NoSQL and the cloud.
Getting Hype out of the Way: Big Data and Beyond 29
By the time you finish this part, you will have a solid understanding of the
main topics of discussion in today’s data-intensive analytics landscape.
Of course, it goes without saying that we hope you will agree that IBM is a
Big Data and Analytics partner that has journeyed further with respect to stra-
tegic vision and capabilities than any other vendor in today’s marketplace.
And although we would love you to implement every IBM solution that maps
to this manifesto, remember that the entire portfolio is modular in design. You
can take the governance pieces for database activity monitor and use those on
MongoDB or another vendor’s distribution of Hadoop. Whatever vendor, or
vendors, you choose to partner with on your Big Data journey, be sure to
follow this Big Data and Analytics manifesto.
This page intentionally left blank
2
To SQL or Not to SQL:
That’s Not the Question,
It’s the Era of Polyglot
Persistence
31
32 Big Data Beyond the Hype
complicated data-layer problems into segments and to select the best tech-
nology for each problem segment. No technology completely replaces
another; there are existing investments in database infrastructure, skills
honed over decades, and existing best practices that cannot (and should not)
be thrown away. Innovation around polyglot solutions and NoSQL technolo-
gies will be complementary to existing SQL systems. The question is and
always has been “What is available to help the business, and what is the
problem that we are trying to solve?” If you’re having a Big Data conversa-
tion that’s intended to get beyond the hype, you need to verse yourself with
polyglot concepts—the point of this chapter.
The easiest way to look at polyglot persistence is first to understand the
emergence of the two most ubiquitous data persistence architectures in use
today: SQL and NoSQL. (We are setting aside for the moment another, more
recent approach, which is aptly named NewSQL.) NoSQL is not just a single
technology, a fact that makes any conversations on this topic a bit more com-
plicated than most. There are many kinds of NoSQL technologies. Even the
NoSQL classification system is full of complexity (for example, the data
model for some columnar family stores is a key/value format, but there are
key/value databases too). In fact, the last time we looked, there were more
than 125 open source NoSQL database offerings! (For a summary of the cur-
rent landscape, see 451 Group’s Matt Aslett’s data platforms landscape map
at http://tinyurl.com/pakbgbd.)
The global NoSQL market has been forecasted to be $3.4 billion by 2020,
experiencing 21 percent compounded annual growth rates (CAGRs) between
2015 and 2020, according to the value-added reseller firm Solid IT. The bot-
tom line is that NoSQL technologies continue to gain traction (some much
more than others); but the majority have ended up as spins-offs or off-shoots
or have just faded away. In this chapter we are going to talk about the styles
of NoSQL databases and refer to some of the ones that have made their mark
and didn’t fade away.
You likely interact with NoSQL databases (directly or indirectly) on a daily
basis. Do you belong to any social sites? Are you an online gamer? Do you
buy things online? Have you experienced a personalized ad on your mobile
device or web browser? If you answered “yes” to any of these questions,
you’ve been touched by NoSQL technology. In fact, if you’re a Microsoft Xbox
One gamer, have leveraged the Rosetta Stone learning platform to
To SQL or Not to SQL 33
done” is the stuff that makes developers cringe). Now, think about this app
in production…your change request to surface the new Facebook scoring
module falls into the realm of DBA “limbo.” Ask colleagues who have been
in similar situations how long this might take and the answer might shock
you. In some companies, it’s months! (Additional data modeling could create
even further delays.)
Developers want to be able to add, modify, or remove data attributes at
will, without having to deal with middlemen or “pending approval” obsta-
cles. They want to experiment, learn from failure, and be agile. If they are to
fail, they want to fail fast (we talk about this methodology in Chapter 4).
Agility is prized because mobile app users expect continuous functional
enhancements, but the friction of traditional development approaches typi-
cally doesn’t allow for that. In most environments, developers have to keep
“in sync” with DBA change cycles.
NoSQL databases appeal to these developers because they can evolve
apps rapidly without DBA or data modeler intervention. There are other sce-
narios as well. Speed of transaction matters in a social-mobile-cloud world;
depending on the app, many developers are willing to serve up “stale”
results (by using data that might be slightly out of date and will in all likeli-
hood eventually be right, it’s just not right now) in exchange for more respon-
sive and faster apps—in some cases they are okay if the data is lost—we
discuss these trade-offs later in this chapter. We think Big Data conversations
that include polyglot open the door to a more satisfactory trade-off between
consistency and availability.
This agility scenario sums up what we believe are the fundamental differ-
ences between the SQL and NoSQL worlds. There are others. For example, in
the SQL world, a lot of the work to store data has to be done up front (this is
often referred to as schema-first or schema-on-write), but getting data out is
pretty simple (for example, Excel easily generates SQL queries). In the NoSQL
world, however, it’s really easy to store data (just dump it in), and much of
the work goes into programming ways to get the data out (schema-later or
schema-on-read).
As previously mentioned, NoSQL was born out of a need for scalability.
Most (not all) NoSQL technologies run on elastic scale-out clusters, but most
RDBMSs are hard-pressed to scale horizontally for transactions. Microsoft SQL
Server doesn’t have anything here that isn’t some kind of workaround; at the
36 Big Data Beyond the Hype
time of writing, SAP HANA doesn’t support scale-out for OLTP, and those
with experience trying to get Oracle RAC to scale know the effort that is
involved. From a transactional perspective, it’s fair to give IBM’s DB2 for Sys-
tem z and DB2 pureScale offerings due credit here. Their well-proven transac-
tional scale-out and performance-optimizing services are made possible by
IBM’s Coupling Facility technology for bringing in and removing resources
without having to alter your apps to achieve the near-linear performance gains
that you would expect when scaling a cluster. It’s also fair to note that almost
every RDBMS vendor has at least a semi-proven solution for scaling out busi-
ness intelligence (BI) apps—but these are very focused on structured data.
Table 2-1 lists a number of other differences between the NoSQL and SQL
worlds. We aren’t going to get into the details for each of them (or this would
be solely a NoSQL book), but we touch on some throughout the remainder
of this chapter. Take note that what we’re listing in this table are general
tendencies—there are exceptions.
The NoSQL World (Schema Later) The SQL World (Schema First)
New questions? No schema change New questions? Schema change
Schema change in minutes, if not seconds Schema: permission process (we’re not talking
minutes here)
Tolerate chaos for new insights and agility Control freaks (in many cases
(the cost of “getting it wrong” is low) for good reasons)
Agility in the name of discovery Single version of the truth
(easy to change)
Developers code integrity (yikes!) Database manages integrity (consistency from
any app that access the database)
Eventual consistency (the data that you Consistency (can guarantee that the data you
read might not be current) are reading is 100 percent current)
All kinds of data Mostly structured data
Consistency, availability, and partition Atomicity, consistency, isolated, durable
tolerance (CAP) (ACID)
What Is NoSQL?
In our collective century-plus experience with IT, we honestly can’t recall a
more confrontational classification for a set of technologies than NoSQL.
More ironic is the popular trend of adopting SQL practices and terminology
in NoSQL solutions. We’re sure that there are practitioners who feel that the
To SQL or Not to SQL 37
name is apt because, as they see it, a war is coming—a sort of Game of Thrones
(“The Database”) where the Wildings of the North (the NoSQL “barbarians”)
are set to take over the regulated (and boring) world of RDBMSs. Although
this sounds pretty exciting, it’s simply not going to happen. The term NoSQL
was the happy (some would say not happy) result of a developer meet-up
that was pressed to come up with a name for SQL-alternative technologies.
Popular NoSQL products—which we classify later in this chapter—
include Cloudant, MongoDB, Redis, Riak, Couchbase, HBase, and Cassan-
dra, among others. These products almost exclusively run on Linux, leverage
locally attached storage (though sometimes network-attached storage), and
scale out on commodity hardware.
SQL, which has been around for 40 years, is the biggest craze in the NoSQL
Hadoop ecosystem space. SQL is so hot because it’s incredibly well suited to
querying data structures and because it is a declarative language, which is so
important because it enables you to get an answer without having to pro-
gram how to get it.
This matters. It matters when it comes to broadening insights about your
organization. A terrific example of this involves the IBM SQL on Hadoop
engine (called Big SQL) available in InfoSphere BigInsights for Hadoop—a
“just the query engine” stand-alone MPP Hadoop SQL processing engine
that quite simply is the best in its class. The IBM approach is different than
other vendors such as Oracle, Teradata, Pivotal, and Microsoft; they focus on
a “use the RDBMS to either submit a remote query or run the entire query
with the involvement of a database” approach. A true Hadoop query engine
runs the MPP SQL query engine on Hadoop, uses Hive metadata, and oper-
ates on data stored in Hadoop’s file system using common Hadoop data
types. (For details on SQL on Hadoop and Big SQL see Chapter 6).
We believe that Big SQL can deliver incredible performance benefits and
is more suitable over anything we’ve seen in the marketplace, including
Cloudera Impala. Why? First, IBM has been in the MPP SQL parallelization
game for more than four decades; as you can imagine, there’s decades of
experience in the Big SQL optimizer. Second, IBM’s Big SQL offers the richest
SQL support that we know of today in this space—it supports the SQL used
in today’s enterprises. That’s very different from supporting the SQL that’s
easiest to implement, which is the tactic of many other vendors, like those
that support query-only engines. If Hadoop SQL compatibility were a golf
38 Big Data Beyond the Hype
game, we’d say that the IBM Big SQL offering is in the cup or a tap-in birdie,
whereas others are in the current position of trying to chip on to save a bogey.
Don’t lose sight of the motivation behind polyglot persistence: attaching
the right approach and technology for the business problem at hand. Conver-
sations around SQL and NoSQL should focus more on how the two technolo-
gies complement one another. In the same way that a person living in Canada
is better served by speaking English and French, a data architecture that can
easily combine NoSQL and SQL techniques offers greater value than the two
solutions working orthogonally. In our opinion—and there are those who will
debate this—the rise of NoSQL has really been driven by developers and the
requirements we outlined earlier in this chapter. New types of apps (online
gaming, ad serving, and so on), the social-mobile-cloud phenomenon, and the
acceptance of something we call eventual consistency (where data might not
necessarily be the most current—no problem if you are counting “Likes” on
Facebook but a big deal if you’re looking at your account balance) are all sig-
nificant factors.
At least for now, Hadoop is mostly used for analytic workloads. Over
time, we expect that to change; indeed, parts of its ecosystem, such as HBase,
are being used for transactional work and have emerging database proper-
ties for transactional control. Like NoSQL databases, Hadoop is implemented
on commodity clusters that scale out and run on Linux (although Horton-
works did port its Hortonworks Data Platform (HDP) to Windows in part-
nership with Microsoft). The disk architecture behind this ecosystem is
almost always locally attached disks.
which in this analogy’s case is the directory pathname. In the same way you
don’t load half a photo, in a true K/V store, you don’t fetch a portion of the
values. (If you want to do that, you should likely be using a NoSQL document
database, which we cover in the next section.) K/V stores are less suited for
complex data structures as well; they typically can’t be used as a substitute for
JSON-encoded NoSQL document database stores, such as IBM Cloudant,
which is more suited for persisting complex data structures.
K/V stores are popular for user sessions and shopping cart sessions
because they can provide rapid scaling for simple data collections. For exam-
ple, you might use a K/V store to create a front end for a retail web site’s
product catalog (this is information that doesn’t change very often); if incon-
sistent changes do occur, they’re not going to create major problems that
grind service to a halt. NoSQL’s eventually consistent data model applies
conflict resolution strategies under the covers while the distributed database
“agrees” (eventually) to a consistent view of your data. Data that is typically
found in K/V stores isn’t highly related, so practitioners rely on basic pro-
grammatic CRUD (create, read, update, delete) interfaces to work with them.
We estimate that K/V stores represent about 24 percent of the NoSQL
market. Examples of K/V databases include MemcacheDB, Redis, Riak,
Amazon DynamoDB, and the one we aren’t supposed to say out loud (Volde-
mort), among others. IBM provides HBase (packaged with the InfoSphere
BigInsights for Hadoop offering, available on premise and off premise
through the IBM Cloud), which includes a number of enterprise-hardening
and accessibility features such as SQL and high-availability services. (HBase
has columnar characteristics as well, so we return to that product in our dis-
cussion about columnar stores.) Finally, it’s fair to note that IBM’s Cloudant
document store has been used with great success for many of the use cases
mentioned earlier.
database tables that store sparse data sets. (In a social-mobile-cloud world,
there are hundreds or thousands of stored attributes, and not all of them
have data values. Think of a detailed social profile that is barely filled out—
this is an example of a sparse data set.)
The multitude of data types that you can represent with JSON are intuitive
and comprehensive: strings, numbers, Booleans, nulls, arrays, and objects
(nested combinations of these data types). JSON is composed with curly
brackets ({ }) and can contain arrays (unlike XML), which are denoted with
square brackets ([ ]). You use double quotation marks (" ") to denote a string.
For example, 35 is a number on which you can perform arithmetic operations,
whereas "35" is a string—both could be used for age, but the data type that
you choose has implications for what you can do with the data.
In almost every use case that we’ve seen, JSON processing is much faster
than XML processing; it’s much easier to read and write too. As its name implies,
it’s closely related to the ubiquitous JavaScript programming language—
ubiquitous in the mobile apps space. XML, on the other hand, is fairly difficult
for JavaScript to parse; its processing involves a lot of infrastructure overhead.
Today we are in a world that IBM calls the “API economy,” which serves
as the backbone to the as-a-service model that dominates today’s cloud (we
talk about this in Chapter 3). JSON is a natural fit for these services—in fact,
if you programmatically interact with Facebook, Flickr, or Twitter services,
then you’re familiar with how these vendors leverage JSON-based APIs for
sending data to and receiving data from end users. By the time the RDBMS
world got sophisticated with XML, developers had moved to JSON because
it is so naturally integrated with almost any programming language used
today (not just JavaScript). JSON has a much simpler markup and appear-
ance than XML, but it is by no means less sophisticated in terms of function-
ality. This app shift has led developers to request that their data be stored in
JSON document store NoSQL databases.
Most new development tools support JSON but not XML. In fact, between
2009 and 2011, one in every five APIs said “good-bye” to XML, and 20 per-
cent of all new APIs supported only JSON (based on the 3,200 web APIs that
were listed at Programmable Web in May 2011).
“Less is better: The less we need to agree upon for interoperation, the more
easily we interoperate,” said Tim O’Reilly (the personality behind those tech
books with animal covers). The logical representation, readability, and flexibil-
ity of JSON are compelling reasons why JSON is the de facto data interchange
44 Big Data Beyond the Hype
format for the social-mobile-cloud world. These are the drivers behind JSON’s
success—they directly map to the developer’s value system we commented on
earlier—and are precisely the reason why NoSQL document stores such as
IBM Cloudant are dominating the NoSQL marketplace today.
{"yelpScore" : "*****"},
{"facebookReview" : "5/5"}
]
}
As you can see, a JSON document is pretty simple. And if you want to add
another set of attributes to this generated document, you can do so in seconds.
For example, you might want to add an Instructors array with corre-
sponding bios and certification information. A developer could do this with a
few simple keystrokes, thereby avoiding the begging and bribing rituals
around DBA-curated schema change requests of the past.
To SQL or Not to SQL 45
schema-less—you can define columns within rows and on the fly. Perhaps
puzzling to RDBMS practitioners is the fact that one row can have a different
number of columns than another row that belongs to the same database.
HBase and Cassandra are probably the most popular names in the NoSQL
columnar space—both descend from BigTable. There is a lot of fierce compe-
tition between them in the marketplace, with zealots on each side claiming
victory. Both technologies have a lot in common. They both log but don’t
support “real” transactions (more about this topic later), they both have
high availability and disaster recovery services that are implemented
through replication, both are Apache open source projects, and more. We
examined the strengths and weaknesses of both technologies, but we tend to
favor HBase (and not just because IBM ships it as part of BigInsights for
Hadoop). HBase is a more general-purpose tool, and it’s also an established
part of the Hadoop ecosystem. For example, when you install HBase through
BigInsights for Hadoop, you don’t need to iteratively download, install,
configure, and test other open source components. Instead, simply launch
the installer, and it performs the necessary work on single-node and cluster
environments. After installation, the tool automatically performs a “health
check” to validate the installation and report on the results. Monitoring of
your environment is centralized and ongoing administration is simplified
because it’s integrated into the Hadoop management tooling that is pro-
vided by IBM. Ultimately, HBase just feels like a much richer and more inte-
grated experience, especially for users of Hadoop. Cassandra can also run
on Hadoop, but we are told it runs best in standalone mode, where it’s really
good at high-velocity write operations.
is key, such as a web site recommendation engine for purchasing that is based
on social circles and influencers. Graph store databases are like those tiny fish
in a big aquarium that look really cool—they aren’t plentiful, but we are going
to keep a watchful eye because we expect great things from them.
As you can imagine, in the IoT world where we live, graph database tech-
nologies are poised to become a big deal. We think that Social Network Anal-
ysis (SNA) is a front-runner for graph database apps—it maps and measures
relationship flows between entities that belong to groups, networks, opera-
tional maps, and so on. If you’re an online dater, there’s no question that
you’ve been “touched” by a graph database. The social phenomenon Snap
AYI (Are You Interested) has socially graphed hundreds of millions of pro-
files for its matchmaking recommendations engine to statistically match
people that are more likely to be compatible. (Imagine what Yenta could have
done with this technology—think Fiddler on the Roof…wait for it…got it?
Good. If not, Google it.)
Although online dating might not be your cup of tea, we are almost sure that
our readers participate in professional social forums such as LinkedIn or Glass-
door; both use graph store technologies. Glassdoor’s social graph has nearly a
billion connections, and it openly talks about the accuracy of its recommenda-
tions engine (which is powered by graph database technology). Twitter? It
released its own graph store, FlockDB, to the open source community.
Imagine the usefulness of graph databases for solving traffic gridlock
problems by using location data from smart phones to orchestrate the speeds
and paths that drivers should take to reach their destinations; as creepy as
that sounds, it’s still a graph database use case. The classic traveling salesper-
son problem, network optimization, and so on—these kinds of problems are
aptly suited for graph databases. For a great example of how this type of
analysis can have a true impact on a business, consider a company like
UPS—the “What can brown do for you?” delivery folks. If every UPS driver
drove one extra mile a day, it would mean an average of 30,000 extra miles
for UPS—at a cost of $10,000,000 per year! Graph database technology could
identify the relationships among drivers, fuel, and costs sooner. Ten million
dollars in annual savings is certainly a strong incentive for companies that
are on the fence about what NoSQL technology can do for their business.
Finally, IBM Watson (briefly discussed in Chapter 1) uses graph stores to
draw conclusions from the corpus of data that it uses to identify probable
To SQL or Not to SQL 49
answers with confidence. There is a lot of math going on behind the scenes in
a graph database, but hey, that’s what computers are for.
Graph processing is heating up and is the fastest-growing database in the
NoSQL space. Some RDBMS vendors are incorporating graph technology as
part of their database solution. For example, both IBM DB2 and Teradata
have at one time or another announced such capabilities to the marketplace.
That said, we believe graph store technologies belong squarely in the NoSQL
realm and out of the RDBMS world, for reasons that are beyond the scope of
this book.
There are a number of graph databases available to the NoSQL market
today. For example, Apache Giraph is an iterative graph processing system
that is built for high scalability and that runs on Hadoop. Other examples
include Titan and Neo4J. IBM is working closely with graph stores in Wat-
son, as mentioned previously, and in social apps that are deployed across the
company. We wanted to formally state IBM’s direction in the graph store
space, but as it stands today our lawyers see fit to give us that loving mother
stare of “We know better” on this topic (at least we thought it was loving), so
we’ll leave it at this: Expect potentially big “titan-sized” things around the
time that this book is in your hands (or sometime thereafter).
area. SmallBlue is an internal IBM app that enables you to find colleagues—
both from your existing network of contacts and also from the wider organi-
zation—who are likely to be knowledgeable in subject areas that interest you.
This app can suggest professional network paths to help you reach out to
colleagues more accurately (from an interest perspective) and more quickly.
Figure 2-1 uses our group’s real app to show you how a graph store works.
A graph always has an ordered pair (just like the K/V stores we talked
about earlier): a vertex (node) and edge. A graph is defined by its vertices
and edges. Nodes can be about anything that interests you, and that’s why
we often refer to them as entities. As you can see in Figure 2-1, a graph can
have many nodes and edges that are connected in multiple ways. This figure
shows a general graph from which you can learn a lot of things. For example,
who might be the best gatekeeper for a certain class of information in the
network? In other words, who is the key group connector? That individual
should be the one with the shortest path to the highest degree (number of
linkages) to other persons.
Figure 2-1 Using a graph database to discover and understand relationships in a large
organization
52 Big Data Beyond the Hype
A basic graph has details about every node and vertex (vertices are bidirec-
tional). For example, the fact that Dirk knows Paul doesn’t necessarily mean
that Paul knows Dirk until a vertex showing that Paul knows Dirk is added. In
a graph store, it’s important to understand that these connections are weighted;
not all connections are equal. Weightings can help you to understand the rela-
tionships and how they affect a network. Weightings can also use color or line
thickness to articulate attributes. For example, in Figure 2-1, Dirk has a larger
circle connecting him to Paul than Chris does. Although they both work for
Paul, Dirk and Paul have worked together for many years, published books
together, and previously worked side-by-side in development—all this
strengthens their relationship. Marc Andrews is a vice president in a different
segment and therefore has a different relationship with Paul; although that
relationship doesn’t have similar attributes to what Dirk and Paul have, Marc
and Paul work closely together and have equal hierarchal standing at IBM.
Rick Buglio is in a different relationship altogether; he doesn’t report to Paul,
neither is he part of the department, but something connects them (for exam-
ple, this book), and so on.
We created this visualization by searching our graph store for Big Data
attributes; we were specifically looking for the strongest spoke of the book’s
authors into a new group, or an individual with the greatest number of
spokes, into Big Data domains that we aren’t already a part of—all in an effort
to spread our gospel of Big Data. Any time you try to figure out multiple entry
points into a community—be it for internal enablement, charity fundraising,
or to get a date from a pool of people—it’s a graph problem. This domain of
analysis involves concepts such as closeness centrality, “betweeness” central-
ity, and degree centrality. As it turns out, Lori Zavattoni works in IBM’s Wat-
son division; we identified her by using a social graph within IBM to find an
enablement leader who seems to be a major influencer and shares spokes with
other groups. We kept this graph simple, so keep in mind that a “true” graph
can contain thousands (and more likely millions) of connections. We also left
the connection graph at a coarse degree of separation (2, as shown on the right
side of this figure); you can throttle this (and other) attributes up or down. We
reached out to Lori to see what we can do around Big Data and Watson and
the result was a section on the topic in Chapter 1.
There are graph subtypes such as a tree and a linked list. A tree graph
enables any node to talk to any other node, but it cannot loop back (cycle) to
To SQL or Not to SQL 53
the starting node except by retracing its exact steps through the path it took
to get there (this kind of breadcrumb trail is popular for web page navigation
analysis). If a connection were made between Paul and Lori, the only way to
Lori’s connections is through Paul; therefore, Dirk must go through Paul to
get to Lori. Of course, Dirk could just send Lori a quick email—but if this
were a voter identification list, a political party would use Paul to get to Lori
and gain access to her connections only through that first link with Paul. In
fact, this exact technique was used by the reelection team for Barack Obama’s
second presidential campaign.
The best way to describe a linked list (which is more restrictive than a tree
graph) is to think of a graph that resembles a connect-the-dots picture. Each
node sequentially connects to another node, though in this case you might
not end up with a pretty picture. Examining a company’s supply chain or
procurement channel might produce a linked list graph. For example, the
only way to get a doctor’s appointment with a specialist in Canada is to get
a referral by your general practitioner; so if you’ve got a skin issue, your fam-
ily doctor is part of the linked list to the dermatologist—there is only one
kind of graph in a linked list.
As you’ve seen, graph databases are great for highly connected data
because they are powerful, come with a flexible data model, and provide a
mechanism to query connected data in a highly performing manner. They
are, however, not perfect. Aside from the fact that they are still maturing, the
schema that is produced by a graph database can get complex in a hurry, and
transactions aren’t always ACID-compliant (more on this in a bit). What’s
more, because they can grow rapidly, one of the traditional challenges with
graphs has been (and still is) the processing time that is required to work
through all of the data permutations that even relatively small graphs can
produce (we think that this is a short-term challenge). That said, we feel that
the greatest short-term weakness of graph databases is that they require a
new way of thinking—a paradigm where everything that you look at is a
graph. The open source movement is taking the lead and you can see quar-
terly gains in graph database adoption as a reflection of those efforts. We
think that this is a great example of why polyglot persistence is key to your
strategy—an architecture that leverages multiple capabilities (in a variety of
different forms) can provide the most value as businesses make the jump to
NoSQL.
54 Big Data Beyond the Hype
Being partition tolerant also means being able to scale into distributed sys-
tems; you can see how RDBMSs traditionally place less emphasis on partition
tolerance, given how difficult many of these systems are to scale horizontally.
Partition tolerance helps with the elastic scalability of NoSQL databases (those
that scale out). Although the SQL world is weaker in the partition tolerance
transactional area (with the exception of DB2 pureScale and DB2 for System z
technologies that are based on IBM’s Coupling Facility, which was designed
for this purpose), the NoSQL world prioritizes partition tolerance.
The CAP theorem (first proposed by Eric Brewer in his “Principles of Dis-
tributed Computing” keynote address and later refined by many) states that
you can guarantee only two out of the three CAP properties simultaneously—
and that’s not bad according to the hit song writer, Meatloaf. In other words,
if your database has partitions (particularly if it’s distributed), do you priori-
tize consistency or the availability of your data? (We should note that trade-
offs between consistency and availability exist on a continuum and are not an
“all-or-nothing” proposition.)
Clearly, reduced consistency and weakened transactional “ACIDity”
don’t sound like a good idea for some banking use cases that we can think
of—and yet there are many apps where high availability trumps consistency
if your goal is to be more scalable and robust. As you might expect, in a
social-mobile-cloud world, systems of engagement and systems of people
apps could very well make this kind of trade-off. For example, if you’re
checking how many Repins you have on Pinterest, does it matter if that num-
ber is accurate at a specific point in time? Or can you settle for being eventu-
ally consistent, if it means that your users get values (whether or not they are
the most up to date) on demand?
NoSQL practitioners are, more often than not, willing to trade consistency
for availability on a tunable spectrum. In fact, one NoSQL database with a
“mongo” footprint is well known for data writes that can just disappear into
thin air. Ultimately, the NoSQL solution that you select should be based on
business requirements and the type of apps that you’re going to support.
There is a subset of the CAP theorem that refers to a persistence layer
being basic availability, soft state, and eventual consistency—BASE. We’re
not going to get into these finer-level details in this chapter because our goal
was to give you a basic overview, which we feel we’ve done.
To SQL or Not to SQL 57
Wrapping It Up
In this chapter, we covered SQL, NoSQL, and even touched on NewSQL
database technologies as options for today’s data centers. We also discussed
the concept of polyglot persistence. Supporting a polyglot Big Data and Ana-
lytics platform means selecting a variety of data-layer technologies that
address the widest array of use cases that your apps might encounter. We
also talked about how in the API economy, these layers will be fully
abstracted—we talk more about this in the next chapter. Elsewhere in this
book, you’ll read about NoSQL technologies such as the Cloudant document
store, HBase, the Hadoop ecosystem, in-memory columnar database technol-
ogy, and traditional RDBMS technologies across all form factors: from off-
premise hosted or fully managed services, to “roll-your-own” on-premise
solutions, and appliances. Together, these polyglot tools represent the next
generation of data-layer architectures and solutions for tomorrow’s Big Data
and Analytics challenges.
3
Composing Cloud
Applications: Why We
Love the Bluemix and
the IBM Cloud
59
60 Big Data Beyond the Hype
Of course, all this new data has become the new basis for competitive advan-
tage. We call this data The What. All that’s left is The How. And this is where the
cloud comes in. This is where you deploy infrastructure, software, and even
services that deliver analytics and insights in an agile way. In fact, if you’re a
startup today, would you go to a venture investment firm with a business plan
that includes the purchase of a bunch of hardware and an army of DBAs to get
started? Who really does that anymore? You’d get shown the door, if you even
make it that far. Instead, what you do is go to a cloud company like SoftLayer
(the IBM Cloud), swipe your credit card to get the capacity or services, and get
to work. It allows you to get started fast and cheap (as far as your credit card is
concerned). You don’t spend time budgeting, forecasting, planning, or going
through approval processes: You focus on your business.
The transformational effects of The How (cloud), The Why (engagement),
and The What (data) can be seen across all business verticals and their associ-
ated IT landscapes. Consider your run-of-the-mill application (app) devel-
oper and what life was like before the cloud era and “as a service” models
became a reality. Developers spent as much time navigating roadblocks as
they did writing code. The list of IT barriers was endless: contending with
delays for weeks or months caused by ever-changing back-end persistence
(database) requirements, siloed processes, proprietary platform architec-
tures, resource requisitions, database schema change synchronization cycles,
cost models that were heavily influenced by the number of staff that had to
be kept in-house to manage the solution, processes longer than this sentence,
and more. Looking back, it is a wonder that any code got written at all!
Development is a great example. Today we see a shift toward agile pro-
cesses. We call this continual engineering. In this development operations
(DevOps) model, development cycles get measured in days; environment
stand up times are on the order of minutes (at most hours); the data persis-
tence layer is likely to be a hosted (at least partially) and even a fully man-
aged service; the platform architecture is loosely coupled and based on an
API economy and open standards; and the cost model is variable and
expensed, as opposed to fixed and capital cost depreciated over time.
As is true with so many aspects of IT and the world of Big Data, we antic-
ipate this wave of change has yet to reach its apex. Indeed, there was too
much hype around cloud technologies, and practitioners had to get beyond
that just like Big Data practitioners are now facing the hype barrier. But today
Composing Cloud Applications 61
there is no question about it, the cloud is delivering real value and agility. It
has incredible momentum, it is here to stay, and the conversation is now well
beyond the hype. The cloud was once used for one-off projects or test work-
loads, but now it’s a development hub, a place for transactions and analytics,
a services procurement platform where things get done. In fact, it’s whatever
you decide to make it. That’s the beauty of the cloud: All you need is an idea.
customer
vendor
vendor
come up with the following analogy (we’d like to thank IBMer Albert Barron
for this story’s inspiration).
Imagine you have a son who’s just started his sophomore year at university,
and is living in your home. He just mentioned how he found a tequila lime
chicken recipe he got from the Chef Watson’s bon appetit site (http://
bonappetit.com/tag/chef-watson) and wants to use it to impress a date. You
laugh and remind him that you taught him how to boil an egg over the sum-
mer. After some thought, your son concludes that making a gourmet dish is
too much work (cooking, cleaning, preparation, and so on) and his core com-
petencies are nowhere to be found in the kitchen; therefore, he decides to
instead take his date to a restaurant. This is similar to the SaaS model, because
the restaurant manages everything to do with the meal: the building and
kitchen appliances, the electricity, the cooking skills, and the food itself. Your
son only had to make sure he cleaned himself up.
A year passes and your son is now in his junior year, but is still living in
your home (we can hear the painful groan of some of our readers who know
this situation too well). He’s got another date and wants to arrange for a
more intimate setting: your house. He buys all the ingredients and digs out
the recipe, but pleads for help with making the meal. Being a loving parent,
you also want your son to impress his date (this may speed up his moving
out), and you prepare the meal for him. This is similar to the PaaS model,
because you (the vendor) managed the cooking “platform”: You provided
the house and kitchen appliances, the electricity, and the cooking skills. You
even bought candles! However, your son brought the food and the recipe.
In his senior year, you’re pleased that your son has moved into his own
apartment and has been taking on more responsibility. Just before he moved,
he tells you he’s found “the one” and wants to really impress this person
with his favorite tequila lime chicken recipe. But this time he wants to class it
up, so he’s hosting it and he’s making it. He’s had to go and find a place to
live (floor space), and his rent includes appliances (a stove and fridge) and
utilities. He takes a trip to IKEA and gets some tasteful place settings and a
kitchen table and chairs (your 20-year-old couch wouldn’t be suitable for
such an important guest to sit and eat). Once he’s settled into his apartment,
he goes to the grocery store to get the ingredients, and then comes home and
cooks the meal—he essentially owns the process from start to finish. Your
son owns everything in the food preparation process except for the core
infrastructure, which he’s renting: This is IaaS. As an aside, since he has sole
Composing Cloud Applications 63
access to all of the resources in his apartment, this IaaS model would be
called bare-metal infrastructure in cloud-speak.
In this IaaS scenario, imagine if your son shared an apartment with room-
mates. Everything from the contents in the fridge, to the stove, to place set-
tings, to the tables and chairs are potentially shared; we call this a multitenant
infrastructure in cloud-speak. Those roommates? They can literally be what
we call “noisy neighbors” in the cloud space. In so many ways, they can get in
the way of an expected planned experience. For example, perhaps they are
cooking their own food when he needs access to the stove, they planned their
own important date on the very same evening, they might have used the
dishes and left them dirty in the sink, or worse yet…drank all the tequila! In a
multitenanted cloud environment, you share resources; what’s different in
our analogy is the fact that you don’t know who you’re sharing the resources
with. But the key concept here is that when you share resources, things may
not operate in a manner in which you expected or planned.
A few years later, your son has a stable job (finally!) and has just bought
his own house. He’s newly married, and for old-time’s sake, wants to make
his spouse the tequila lime chicken recipe that started their romance. This
time, he truly owns the entire process. He owns the infrastructure, he’s cook-
ing the meal, and he bought the ingredients. In every sense, this is similar to
an on-premise deployment.
Notice what’s common: In every one of these scenarios, at the end of the
evening, your son eats tequila lime chicken. In the on-premise variation of
the story, he does all of the work—in some scenarios he has to share the
resource and in others he doesn’t. There are scenarios in which he has other
people take varying amounts of responsibility for the meal. Today, even
though your son is now independent and has the luxury of his own infra-
structure, he still enjoys occasionally eating at restaurants; it’s definitely eas-
ier for him to go out for something like sushi or perhaps he is just too tired to
cook after a long week. This is the beauty of the “as a service” models—you
can consume services whenever the need arises to offload the management
of infrastructure, the platform, or the software itself; in cloud-speak you’ll
hear this referred to as bursting.
As your understanding of these service models deepens, you will come to
realize that the distinctions among them can begin to blur. Ad hoc develop-
ment strategies of the past have given way to predefined development
64 Big Data Beyond the Hype
“stacks,” and going forward, we will see these stacks refined into develop-
ment “patterns” that are available as a service. Today, the expert integrated
IBM PureSystems family (PureFlex, PureApplication, and PureData systems)
combines the flexibility of a general-purpose system, the elasticity of an on-
premise cloud, and the simplicity of an appliance. These systems are inte-
grated by design and are finely tuned by decades of experience to deliver a
simplified IT experience. You can think of these as foolproof recipes. For
example, a data warehouse with a BLU Acceleration pattern (we talk about
this technology in Chapter 8) is available for a PureFlex system, which equates
to tracing paper for a beautiful picture (in this case, a high-performance turn-
key analytics service). Patterned solutions are not exclusive to on premise:
Pattern design thinking is driving service provisioning for cloud-based envi-
ronments. For example, IBM makes the BLU Acceleration technology avail-
able in PaaS and SaaS environments through a hosted or managed analytics
warehousing service called dashDB.
From small businesses to enterprises, organizations that want to lead their
segments will quickly need to embrace the transition from low-agility soft-
ware development strategies to integrated, end-to-end DevOps. The shift
toward cloud models ushers in a new paradigm of “consumable IT”; the
question is, what is your organization going to do? Are you going to have a
conversation about how to take part in this revolutionary reinvention of IT?
Or will you risk the chance that competitors taking the leap first will find
greater market share as they embrace the cloud?
isn’t the true reason for the cloud’s momentum. Moving to the cloud gives you
increased agility and the opportunity for capacity planning that’s similar to the
concept of using electricity: It’s “metered” usage that you pay for as you use it.
In other words, IaaS enables you to consider compute resources as if they are a
utility like electricity. Similar to utilities, it’s likely the case that your cloud ser-
vice has tiered pricing. For example, is the provisioned compute capacity
multi-tenant or bare metal? Are you paying for fast CPUs or extra memory?
Never lose sight of what we think is the number-one reason for the cloud and
IaaS: it’s all about how fast can you provision the computing capacity.
When organizations can quickly provision an infrastructure layer (for
example, the server component for IaaS shown in Figure 3-1), the required
time to market is dramatically reduced. Think of it this way: If you could snap
your fingers and have immediate access to a four-core, 32GB server that you
could use for five hours, at less than the price of a combo meal at your favorite
fast-food restaurant, how cool would that be? Even cooler, imagine you could
then snap your fingers and have the server costs go away—like eating the
combo meal and not having to deal with the indigestion or calories after you
savor the flavor. Think about how agile and productive you would be. (We do
want to note that if you are running a cloud service 24x7, 365 days a year, it
may not yield the cost savings you think; but agility will always reign supreme
compared to an on-premise solution.)
A more efficient development and delivery cycle leads to less friction and
lowers the bar for entry when it comes to innovation and insights: Imagine
the opportunities and innovations that your organization can realize when
the traditional pain points and risk associated with costly on-premise IT infra-
structures are reduced by a cloud-based, scalable, “as a service” solution.
We often talk about three significant use-case patterns emerging for adopt-
ers of IaaS data layers, and we think they should be part of your Big Data
conversations too. The first involves a discussion on ways for organizations
to reduce their large and ongoing IT expenses through the optimization of
their existing data centers. When such companies move their infrastructure
to a managed cloud, they begin to enjoy consolidation of services, virtualized
deployments that are tailored to their storage and networking requirements,
and automatic monitoring and migration.
The second pattern involves the ability to accelerate the time to market for
new applications and services. Clients can leverage new environments and
66 Big Data Beyond the Hype
topologies that are provisioned on the cloud with ease, often by simply select-
ing the sizing that is appropriate for their workloads and clicking “Go” to kick
off the compute resource spin-up process. It might not be real magic, but a
couple of swipes and gestures can be just as impressive.
The third pattern involves an environment that is running a client’s apps
and services on top of an IaaS layer. These clients can gain immediate access
to higher-level enterprise-grade services, including software for composing
(developing) apps in the cloud or consuming cloud apps such as IBM’s
dashDB analytics warehouse or IBM’s DataWorks refinery services, and so
on. These are the PaaS and SaaS flavors of the cloud. Each service builds on
the other and delivers more value in the stack. It’s kind of like a set of Rus-
sian dolls—each level of beauty gets transferred to the next and the attention
to detail (the capabilities and services provided from a cloud perspective)
increases. In other words, the value of IaaS gives way to even more value,
flexibility, and automation in PaaS, which gives way to even more of these
attributes in SaaS (depending on what it is you are trying to do).
What are some of the available services for organizations that leverage an
IaaS methodology? Figure 3-2 shows IaaS at its most fundamental level. A
vendor delivers complete management of the hardware resources that you
request for provisioning over the cloud: virtual machine (VM) provisioning,
construction and management of VM images, usage metering, management
of multitenant user authentication and roles, deployment of patterned solu-
tions, and management of cloud resources.
Earlier in this chapter we referred to cloud services as hosted or managed.
With managed IaaS providers, you gain access to additional resources such as
patching and maintenance, planning for capacity and load, managed
responses to data-layer events (including bursts in activity, backup, and
disaster recovery), and endpoint compliance assurances for security—all of
which are handled for you! In a hosted environment, these responsibilities fall
on you. It’s the difference between sleeping over at mom’s house (where you
Figure 3-2 Infrastructure as a service (IaaS) provides the foundational level for any
service or application that you want to deploy on the cloud.
Composing Cloud Applications 67
The IBM Cloud gives clients the option to provision a true bare-metal,
dedicated IaaS that removes “noisy neighbors” and privacy concerns from
the equation altogether. It can provision and manage an infrastructure layer
that is exclusively dedicated to your company’s cloud services or hosted
apps. Your services benefit from the guaranteed performance and privacy
that come from a bare-metal deployment. IBM is uniquely positioned to
seamlessly integrate bare-metal and virtual servers, all on demand, as part of
a unified platform. SoftLayer also provides autoscaling, load balancing, dif-
ferent security models (for open and commercial technologies), and a variety
of server provisioning options (dedicated and multitenant). Every client has
a preference; some might have a business need that fits a hybrid model of
multitenancy and dedicated bare metal working side by side; others might
be fine with virtualization for all their servers. SoftLayer offers the IaaS flex-
ibility to meet and exceed the requirements of any client—including those
who don’t want to live beside noisy neighbors.
security, so there is no need to fret about the volatile world of the cloud.
Lastly, the way you pay for using Bluemix is much like how you’re billed for
electricity—you pay for what you use. And if you are using one of the many
freemium services that are available on Bluemix, unlike your electricity, you
don’t pay anything at all!
IBM envisions Bluemix as a ecosystem that answers the needs and chal-
lenges facing developers, while at the same time empowering and enabling
businesses to leverage their resources in the most efficient way possible. This
means Bluemix provides organizations with a cloud platform that requires
very little in-house technical know-how, as well as cost savings. At the same
time, Bluemix lets organizations dynamically react to users’ demands for
new features. The Bluemix platform and the cloud provide the elasticity of
compute capacity that organizations require when their apps explode in
popularity.
As a reflection of IBM’s commitment to open standards, there are ongoing
efforts to ensure Bluemix is a leader when it comes to the open standards that
surround the cloud. As such, Bluemix is an implementation of IBM’s Open
Cloud Architecture, leveraging Cloud Foundry to enable developers to rap-
idly build, deploy, and manage their cloud applications, while tapping a
growing ecosystem of available services and run-time frameworks.
The API economy is at the core of Bluemix. It includes a catalog of IBM,
third-party, and open source services and libraries that allows developers to
compose apps in minutes. Quite simply, Bluemix offers the instant services,
run times, and managed infrastructure that you need to experiment, inno-
vate, and build cloud-first apps. Bluemix exposes a rich library of IBM pro-
prietary technologies. For example, the BLU Acceleration technology behind
the dashDB analytics service, the Cloudant NoSQL document store, IBM
DataWorks refinery services, and more. IBM partner-built technologies (like
Pitney Bowes location services) as well as open source services (like Mon-
goDB and MySQL, among others) are also part of its catalog. The Bluemix
catalog is by no means static—Bluemix continues to add services and shape
the platform to best serve its community of developers.
In practical terms, this means that cloud apps built on Bluemix will reduce
the time needed for application and infrastructure provisioning, allow for
flexible capacity, help to address any lack of internal tech resources, reduce
TCO, and accelerate the exploration of new workloads: social, mobile, and
Big Data.
Bluemix is deployed through a fully managed IBM Cloud infrastructure
layer. Beyond integration in the IBM Cloud, Bluemix has integration hooks
for on-premise resources as well. Flexibility is the design point of Bluemix
and it lets you seamlessly pull in resources from public and private clouds or
other on-premise resources.
74 Big Data Beyond the Hype
Layered security is at the heart of Bluemix. IBM secures the platform and
infrastructure and provides you with the tools you need to secure the apps
that you compose. IBM’s pedigree in this area matters, and this is baked into
the apps you compose on this platform.
Bluemix has a dynamic pricing scheme that includes a freemium model! It
has pay-as-you go and subscription models that offer wide choice and flexi-
bility when it comes to the way you want to provision Bluemix services.
What’s more, and perhaps unlike past IBM experiences, you can sign up and
start composing apps with Bluemix in minutes!
PaaS offers such tremendous potential and value to users like Jane because
there are more efficiencies to be gained than just provisioning some hardware.
Take a moment to consider Figure 3-3. Having access to subsystems and link-
ages to databases, messaging, workflow, connectivity, web portals, and so
on—this is the depth of customization that developers crave from their envi-
ronment. It is also the kind of tailored experience that’s missing from IaaS-only
vendors. The layered approach shown in Figure 3-3 demonstrates how PaaS is,
in many respects, a radical departure from the way in which enterprises and
businesses traditionally provision and link services over distributed systems.
Traditionally, classes of subsystems would be deployed independently of the
app that they are supporting. Similarly, their lifecycles would be managed
independently of the primary app as well. If Jane’s app is developed out of
step with her supporting network of services, she will need to spend time
(time that would be better spent building innovative new features for her app)
ensuring that each of these subsystems has the correct versioning, functional-
ity mapping for dependencies, and so on. All of this translates into greater risk,
steeper costs, higher complexity, and a longer development cycle for Jane.
With a PaaS delivery model like Bluemix, these obstacles are removed:
App developers and users don’t need to manage installation, licensing,
maintenance, or availability of underlying support services. The platform
assumes responsibility for managing subservices (and their lifecycles) for
you. When the minutiae of micromanaging subservices are no longer part of
the equation, the potential for unobstructed, end-to-end DevOps becomes
possible. With a complete DevOps framework, a developer can move from
concept to full production in a matter of minutes. Jane can be more agile in
her development, which leads to faster code iteration and a more polished
product or service for her customers. We would describe this as the PaaS
approach to process-oriented design and development.
Global
Orchestrator Global Orchestrator
needs to tap into both Cloud A and B. How she does that integration and
how these required services are coordinated between those clouds (some-
thing we refer to as orchestration) is where we start to differentiate between
the PaaS approaches. Whether Jane provisions services through a “single
pane of glass” (a global orchestrator that has a predefined assortment of ser-
vices to choose from) or defines linkages with these cloud services directly
(even if they live on clouds that are separate from her PaaS environment)
determines the type of PaaS architecture her solution requires.
Need a breather? Remember that this layer of complexity—how the pieces
of PaaS and its services fit together—are not the concern of the developer.
This is what’s great about PaaS after all! IBM handles the integration of ser-
vices behind the scenes and determines how those services come together. In
other words, the PaaS infrastructure in Bluemix determines how Jane’s
instructions to a “single pane of glass” are translated across multiple cloud
environments (internal or even external public clouds) to deliver the func-
tionality that her apps need. We just added this tech section here in case you
want a taste of the details that are going on behind the scenes, but never lose
sight that with Bluemix, you don’t have to care about such details.
Defined Pattern
Infrastructure Services
Services
Composable
Services
Systems
of Record
Figure 3-5 IBM Bluemix’s catalog of composable services is central to the integration
of on-premise and off-premise infrastructure services.
Composing Cloud Applications 79
and the IBM Cloud as a whole. What differentiates IBM’s platform from com-
petitors is the continuum of services that can be accessed across both on-
premise and off-premise resources. Cloud-ready solutions such as SoftLayer
make this possible, bursting data on demand to the cloud, as needed.
and safety of the treatment are well understood that the vaccine is mass pro-
duced for the general population.
To apply this analogy to business patterns, let’s consider Figure 3-6.
Although there might exist several ad hoc one-off strategies for solving a
particular business use case, it isn’t until the strategy demonstrates success-
ful results over a period of time that IBM will consider it to be an expert
deployable “patterned” solution for clients. IBM’s entire PureSystems family
is built around this concept of standardization and proven patterns to address
business challenges of innovation, speed, and time to market. What’s more,
all of the expertise and effort that went into developing these on-premise pat-
terned expert-baked solutions are now being leveraged for the Bluemix PaaS
environment.
The challenge with deploying patterns to a cloud environment is that,
from a design perspective, patterns are geared more toward existing tech-
nologies and are not necessarily optimized for iterative deployment of ever-
evolving applications and architectures. For example, what is the pattern for
taking a cookie-cutter implementation and days later stitching in a new ser-
vice that just changed the scope of what your app is all about? We are getting
at the fact that the cloud is a highly dynamic environment. The idea of estab-
lishing a hardened pattern around an environment where new code is
deployed on the first day, a new database is added on the second day, a queu-
ing system and then a monitoring system are tacked on after that, and who
knows what else, becomes seemingly impossible.
So if apps that are born and developed on the cloud are built iteratively and
gradually, how can design patterns ever get traction in such an environment?
The answer is PaaS, because PaaS can be broken down and modeled in much
the same way as subpatterns. There is a surprising amount of synergy between
design patterns and the way that practitioners provision environments and
services over the cloud. With this in mind, let’s explore a layered approach to
application development in the cloud.
At the top of Figure 3-6, developer Jane is actively making use of the Blue-
mix core services to compose and run cloud-native apps; this is what we refer
to as the Bluemix Fabric. Meanwhile, Jane is taking advantage of a large catalog
of external services (available through the API economy; more on this topic in
a bit) such as database back ends, queuing services, and so on. The key idea
here is that these back-end external services can be provisioned using patterns.
Composing Cloud Applications 81
For example, at some point in the coding cycle in Figure 3-6, Jane will need a
service that can handle a certain requirement for her app—perhaps applying
an XSL style sheet transformation that was requested by an end user. “Under
the covers” Bluemix pushes down a design pattern that is used to instantiate
the respective piece of software. After it is ready, the necessary hooks and API
connectors to the service are returned to Jane so that she can tie them into the
new app that is being developed on top of the PaaS. When design patterns are
correctly leveraged as part of a PaaS environment, the complexities of provi-
sioning and looping in external services are made transparent to the devel-
oper. In other words, Jane doesn’t need to worry about how it’s achieved, Jane
simply moves forward with the confidence that the underlying PaaS environ-
ment handles that for her all abstracted through an API. Ultimately, this trans-
lates into the frictionless and more productive experience that developers like
Jane thrive on.
82 Big Data Beyond the Hype
• Finally, the ecosystem that supports these apps must appeal to customer
and developer communities. Business services and users want to be
able to use an app without needing to build it and own it. Likewise,
developers demand an environment that provides them with the
necessary tools and infrastructure to get their apps off the ground and
often have next-to-zero tolerance for roadblocks (like budget approvals,
DBA permissions, and so on).
These expectations and requirements precisely fit into the concept of PaaS
and building cloud-native apps. Put another way, this is exactly the kind of
environment that IBM Bluemix promotes and delivers.
As previously mentioned, Bluemix provides a PaaS over the IBM Cloud
(powered by SoftLayer and Cloud Foundry) to deliver a venue for testing
and deploying your apps. At its core, Bluemix is an open standards compli-
ant, cloud-based platform for building, managing, and running apps of all
types, especially for smart devices, which are driving the new data econ-
omy. You can think of Bluemix as a catalog of composable services and run
times that is powered by IBM, open source, or third-party vendor-built tech-
nologies running on top of a managed hardware infrastructure. The goal is
to make the process of uploading and deploying applications to the cloud as
seamless as possible so that development teams can hit the ground running
and start programming from day one. Bluemix capabilities include a Java
framework for mobile back-end development, application monitoring in a
self-service environment, and a host of other capabilities…all delivered
through an as-a-service model. Bluemix offers a services-rich portfolio without
the burden of mundane infrastructure tasks. Its target audience—developers,
enterprise line-of-business users—can say goodbye to the drudgery of
installing software, configuring hardware and software, fulfilling middle-
ware requirements, and wrestling with database architectures. Bluemix has
you covered.
As previously mentioned, Bluemix offers a catalog of services, drawing
from the strengths of the IBM portfolio around enterprise-grade security,
web, database management, Big Data analytics, cross-services and platform
integration, DevOps, and more. These are known quantities—things that cli-
ents have come to expect when doing business with IBM. One of the things
that the Bluemix team found is that developers often identify themselves
with the technology that they choose to use. With this in mind, IBM also
84 Big Data Beyond the Hype
wants to ensure that the Bluemix platform provides a choice of code base,
language and API support, and an infrastructure that will attract developers
from communities that IBM has not traditionally addressed in the past. For
example, Bluemix supports Mongo-based apps. Quite simply, Bluemix elim-
inates entry barriers for these programmers and developers by offering a
streamlined, cost-flexible, development and deployment platform.
Figure 3-7 offers a tiny glimpse at the Bluemix service catalog where
developers can browse, explore, and provision services to power their apps.
Being able to provide a full lifecycle of end-to-end DevOps is important,
given the experimental, iterative, sandbox-like approach that we expect
developers to bring to this platform.
The entire app lifecycle is cloud agile—from planning the app to monitor-
ing and managing these services. All of these services are supported by the
Bluemix platform right “out of the box.” This ultimately enables developers
to integrate their apps with a variety of systems, such as on-premise systems
of record (think transactional databases) or other public and private services
that are already running on the cloud, all through Bluemix APIs and services.
If you want other developers on Bluemix to tap into the potential of these
systems, you can work with IBM to expose your APIs to the Bluemix audi-
ence by becoming a service partner. Consider the fundamental Bluemix ini-
tiatives that we outlined: Bluemix is a collaborative ecosystem that supports
Figure 3-7 The Bluemix catalog of composable services comes from IBM, partner,
and open source technologies that are enabled for the continuous, agile development
of cloud apps.
Composing Cloud Applications 85
both developers and customers, and enabling our partners to integrate across
these verticals is a key part of that initiative.
user who needs to consume the capabilities of a service that is already built
and ready to deploy. SaaS consumers span the breadth of multiple domains
and interest groups: human resources, procurement officers, legal depart-
ments, city operations, marketing campaign support, demand-generation
leads, political or business campaign analysis, agency collaboration, sales,
customer care, technical support, and more. The variety of ways that busi-
ness users can exploit SaaS is extraordinarily complex: It can be specific (to
the extent that it requires tailored software to address the problem at hand),
or it can require a broad and adaptable solution to address the needs of an
entire industry.
Not every SaaS vendor can guarantee that their packaged software will
satisfy every vertical and niche use case we’ve outlined here. Nor do all SaaS
vendors offer a catalog of services that supports deep integration hooks and
connectors for when you need to scale out your solution or roll in additional
tools in support of your business end users. Figure 3-8 highlights the fact that
SaaS can apply to anything, as long as it is provisioned and maintained as a
service over the cloud.
What differentiates IBM’s solution is that it offers the complete stack—
from IaaS to PaaS to SaaS—in support of every possible customer use case. We
would argue that IBM’s portfolio stands alone in being able to offer both tai-
lored and extensible SaaS deliverables to address the various dimensions of
consumable and composable functionality that are required by enterprises today.
Figure 3-8 SaaS delivers software that runs on the cloud, enabling you to consume
functionality without having to manage software installation, upgrades, or other
maintenance.
Composing Cloud Applications 87
And consider this: As developers build their apps on the Bluemix PaaS plat-
form, they can seamlessly bring their solutions to a SaaS market and a sophis-
ticated IBM Cloud Computing Marketplace (ibm.com/cloud-computing/
us/en/marketplace.html), where new clients can consume them. Throughout
this section, we explore SaaS and IBM’s offerings in this space in more detail,
but first, it is worth setting the stage by describing how cloud distributions are
reshaping software.
services through APIs. The real value of Twitter lies in its ability to seam-
lessly integrate with other apps. Have you ever bought something online
and found an integrated tweet button for sharing your shopping experience?
That vendor is leveraging Twitter’s API to help you announce to the world
that your new golf club is the solution to your slicing issues. When you tweet,
underlying service authorizations can store your tweeted image in Flickr and
leverage Google Maps to enrich the tweet with geospatial information visu-
alizations. The API economy also lets you use that same tweet to tell all your
Facebook friends how your golf game is set to improve. Such a neatly orches-
trated set of transactions between distinct services is made possible and is
done seamlessly by the API economy that drives today’s cloud SaaS market-
place, as shown in Figure 3-9.
Having a well thought-out and architected API is crucial for developing
an ecosystem around the functionality that you want to deliver through an
as-a-service property. Remember, the key is designing an API for your ser-
vices and software that plays well with others. Your API audience is not only
people in your company, but partners and competitors too. Consumers of the
SaaS API economy require instant access to that SaaS property (these services
need to be always available and accessible), instant access to the API’s docu-
mentation, and instant access to the API itself (so that developers can pro-
gram and code against it). Furthermore, it is critical that the API architecture
be extensible, with hooks and integration end points across multiple domains
(Figure 3-9) to promote cooperation and connectivity with new services as
they emerge.
API
Figure 3-9 IBM’s “API economy” is made possible by a unified cloud architecture of
well-defined integration end points between external ecosystems, cloud marketplaces,
and IBM’s SaaS portfolio.
Composing Cloud Applications 89
surface the application in a SaaS model and charge for it down the road; IBM
gives you that option as a service partner. That said, clients can use dashDB in
a PaaS model for their own analytics needs—it’s just not managed in the same
way as a SaaS model. However, for customers who simply want to consume
software without building and assembling it themselves, a ready-to-go and
managed SaaS solution might be just the ticket. Remember, SaaS isn’t for
developers; it’s for users and consumers who need to achieve new business
outcomes rapidly and cost effectively.
You may see dashDB referred to as a data warehouse as a service
(DWaaS)—which we think is best conceptualized as an added layer of clas-
sification atop the software as a service (SaaS) model. Fundamentally,
DWaaS is very much a SaaS offering—clients consume the functionality of
the service (in the case of dashDB, that functionality entails best-of-breed
performance and analytics) without concern over provisioning or managing
the software’s lifecycle themselves.
IBM’s portfolio of SaaS offerings is built on the same foundations of resil-
iency, availability, and security as the IaaS and Bluemix PaaS layers we talked
about in this chapter. What differentiates dashDB from the pack are the three
pillars that set DWaaS apart from the traditional on-premise data warehous-
ing that has come before it: simplicity, agility, and performance. Let’s take a
moment to tease apart what each of these pillars offers clients that provision
this service. Simplicity: dashDB is easy to deploy, easy to use, requires fewer
resources to operate, and is truly “load and go” for any type of data you can
throw at it. Agility: “train of thought” analytics are made possible by dashDB’s
integration across the IBM Big Data portfolio, making your apps faster and
delivering actionable business insight sooner. Performance: dashDB delivers
BLU Acceleration’s signature, industry-leading performance at a price the
market demands—all within a cloud-simple form factor. In ten words or less,
dashDB gives faster actionable insight for unlocking data’s true potential.
IBM dashDB is all about the shifting client values around how functional-
ity is consumed and the explosive growth that we are witnessing in the deliv-
ery of SaaS on the cloud. Customers today are focused on assembling the
components and services that they need to achieve their business outcomes.
In terms of the marketplace, analytics are becoming increasingly valuable; in
fact, we’d go so far as to say that in the modern marketplace, analytics are
indispensable—like we said before, from initiative to imperative. Analytics
Composing Cloud Applications 91
are driving decisions about how skills are deployed across an organization,
how resources and infrastructure are distributed in support of a business,
and how clients access the mountains of data that they are actively creating
or collecting.
One question being asked by many businesses is “How do we rapidly
transition from data collection to data discovery and insight to drive decision
making and growth strategies?” As companies move aggressively into this
space, they quickly realize that the skills that are required to build the tools
and perform data analysis are still pretty scarce; the combination of statisti-
cian and DBA guru in one individual is hard to come by! Cost becomes a
major obstacle for companies that want to deploy or manage business analyt-
ics suites with an on-premise infrastructure and appliances. The future inno-
vators in this space will be companies that enable their partners and custom-
ers to bypass the risk and costs that were previously associated with analytics.
This is the drive behind SaaS delivery and in particular IBM’s dashDB.
The idea of an analytics warehouse that can be consumed as a service is a
product of powerful solutions and infrastructure emerging at just the right
time…over the cloud. Speaking from our own firsthand interactions with cli-
ents, we can confidently say that the vast majority of IT and enterprise cus-
tomers want to move at least part of their resources into a cloud environment.
The trend is increasingly toward applications that are “born” in the cloud,
developed and optimized for consumption over distributed systems. A key
driver behind this momentum is centered on CAPEX containment, which is a
SaaS product that scales elastically and can be “right-sized” for your business,
translating into smaller up-front costs and greater flexibility down the road.
SaaS solutions are distinctly “cloud first.” We believe that following best
practices and architecting a software solution for the cloud first makes trans-
lating that asset into an on-premise solution much more straightforward and
painless. In terms of the performance of on-premise and cloud deliverables,
the laws of physics are ultimately going to win. With the cloud, there is the
unavoidable fact that your data and instructions need to travel across a wire.
In most cases, the latency in cloud deployments will be at least slightly greater
(whether or not it is significant is another question). Being able to use in-
memory analytic capabilities, perform data refinement, and ingest data for
analysis in a real-time environment, all in the cloud, is propelling the indus-
try toward new opportunities for actionable business insight. This is all being
92 Big Data Beyond the Hype
We say build more because dashDB gives you the flexibility to get started
without making costly CAPEX investments, which reduces risk and miti-
gates cost. The provisioning of this SaaS takes less than a dozen or so minutes
for 1TB of data, and migration and performance have been tuned to delight
users with their first experience. To support bursty workloads from on-prem-
ise infrastructures to dashDB in the cloud, a lot of effort has been put into
data transfer rates. (Mileage is going to vary—remember that your network
connection speed matters.) When these pain points are eliminated, LOB
owners are free to experiment and pursue new insights that are born out of
the cloud—in short, they can build more.
The beauty of the SaaS model is that capability enhancements can be
delivered to your business without performance degradation or downtime to
your customers. Consumption of its analytics capabilities becomes your only
concern—that, and growing your business with your newfound insights,
and this is why we say dashDB lets you grow more. Think about it: You can
put personnel on discovering new insights and leveraging investigative
techniques you could never apply before, or you can spend your time
upgrading your on-premise investments.
IBM dashDB’s in-database analytics also includes R integration for predic-
tive modeling (alongside the INZA functionality that will continue to work its
way into the service offering as dashDB iterates over time). R’s deeply woven
integration with dashDB delivers in-database functions that span the analytic
gamut: linear regression, decision trees, regression trees, K-means, and data
preparation. Other statistical showcases, such as support for geospatial ana-
lytics and developing analytics extensions through C++ UDX, will eventually
round out dashDB’s analytics portfolio.
From a security perspective, dashDB is Guardium-ready (covered in
Chapter 12). Industry-leading data refinery services will add free (and some
for fee) masking, transformation, and enrichment, as well as performance-
enhancing operations, which we briefly discuss in the next section.
Refinery as a Service
Both Bluemix and Watson Analytics are platforms that consume and produce
data; as such, they need governance and integration tools that fit into the
SaaS model we’ve been describing in this chapter. IBM’s engineering teams
are taking its rich data governance capabilities and exposing them as a
94 Big Data Beyond the Hype
Wrapping It Up
Having well-performing and powerful analytics tools is critical to making
data warehousing in the cloud a worthwhile endeavor. If you can get an
answer in seconds rather than hours, you are more likely to think about the
next step of analysis to probe your data even further. If you can set up a sys-
tem in less than an hour instead of waiting months for budgetary approval
and dealing with the numerous delays that get between loading dock and
loading data, you can deliver business results sooner. This is why we say that
dashDB and its BLU Acceleration technologies enable “train of thought” ana-
lytics: Having a high-performance analytics SaaS toolkit keeps business
users hungry for more data-driven insight.
In this chapter, we implicitly touched on the four core elements that are
central to IBM’s brand and its cloud-first strategy. The first is simplicity: offer-
ing customers the ability to deploy, operate, and easily manage their apps
and infrastructure. “Load and go” technology makes it possible to get off the
ground and soar into the cloud quickly and cost effectively. Agility is next: If
your business can deploy and iterate new applications faster, you will be
positioned to adapt to a rapidly changing marketplace with ease. IBM’s as-a-
service portfolio (for infrastructure, development platforms, and software)
makes that a reality. The third element (it’s a big one) is integration with Big
Data. Analytics that are being run in platforms such as Hadoop do not live in
isolation from the type of questions and work that is being done in data
warehousing. These technologies work in concert, and workflows between
the two need to exist on a continuum for businesses to unlock the deep and
actionable insight that is within reach. The fourth element in IBM’s cloud
vision is a unified experience. As we show in Figure 3-10, the integration hooks,
interoperability, migration services, consistent and elastic infrastructure, and
platform services layers that IBM’s cloud solution has in place (and contin-
ues to develop) offer a truly holistic and unified experience for developers
and enterprise customers alike.
Composing Cloud Applications 95
API
Figure 3-10 IBM’s cloud portfolio delivers a holistic approach to deeply integrated
infrastructure, platform, and software as a service environments that is unrivalled in
scalability, availability, and performance.
No other competitor in the cloud space can offer an equally rich as-a-
service environment with solutions that are designed for the enterprise, and
still remain open to emerging technologies from partners and third-party
vendors. Cloud used to be a term that was surrounded by hype, but over the
years, the as-a-service model has transformed business challenges that had
remained unsolved for decades into new sources of opportunity and reve-
nues; quite simply, this proves that cloud is beyond the hype. As competi-
tors race to stitch together their own offerings, IBM customers are already
designing apps and composing solutions with a Bluemix fabric, consuming
software as a service functionality on top of a SoftLayer infrastructure that
lives in the cloud.
This page intentionally left blank
4
The Data Zones Model:
A New Approach to
Managing Data
The Big Data opportunity: Is it a shift, rift, lift, or cliff? We think it is a mat-
ter of time until you are using one (ideally two) of these words to describe the
effect that Big Data has had on your organization and its outcomes. Study
after study proves that those who invest in analytics outperform their peers
across almost any financial indicator that you can think of: earnings, stock
appreciation, and more. Big Data, along with new technologies introduced
by the open source community and enhancements developed by various
technology vendors, is driving a dramatic shift in how organizations manage
their analytic assets. And when it’s done right, it has the effect of lifting those
organizations’ Analytics IQ, our measure for how smart an organization is
about analytics.
A common challenge to any organization investing in its Analytics IQ has
been providing common access to data across the organization when the
data is being generated in various divisional and functional silos. In a social-
mobile-cloud world, this question hasn’t changed, but its answer certainly
has. Historically, the answer to this question was to make the enterprise data
warehouse (EDW) the center of the universe. The process for doing this had
become fairly standardized too. If companies want an EDW, they know what
to do: Create a normalized data model to make information from different
97
98 Big Data Beyond the Hype
sources more consumable; extract required data from the operational sys-
tems into a staging area; map the data to a common data model; transform
and move the data into a centralized EDW; and create views, aggregates,
tiers, and marts for the various reporting requirements.
Over time, various challenges to this approach have given rise to some
variations. For example, some organizations have attempted to move their
data transformation and normalization processes into the warehouse, in
other words, shifting from extract, transform, and load (ETL) to extract, load,
and transform (ELT). Some organizations invested more heavily in opera-
tional data stores (ODSs) for more immediate reporting on recent operational
and transaction-level data.
As organizations began to do more predictive analytics and modeling,
they became interested in data that was not necessarily in the EDW: data sets
whose volumes are too large to fit in the EDW, or untouched raw data in its
most granular form. For example, consider a typical telecommunications
(telco) EDW. Perhaps the most granular data in a telco is the call detail record
(CDR). The CDR contains subscriber data, information about the quality of
the signal, who was called, the length of the call, geographic details (location
before, during, and at the end of the call), and more. Because this data is so
voluminous—we’re talking terabytes per day here—it can’t possibly reside
in the EDW. The data that ends up in the EDW is typically aggregated, which
removes the finer details that a data scientist might require. This data is also
likely to be range-managed; perhaps only six to nine months of the data is
kept in the EDW, and then it’s “rolled out” of the repository on some sched-
uled basis to make room for new data. As you can imagine, this approach
makes it tough to do long-term historical analysis.
Analytics teams respond to such challenges by pulling down data directly
from both operational systems and the data warehouse to their desktops to
build and test their predictive models. At many organizations, this has cre-
ated a completely separate path for data that is used for analytics, as opposed
to data that is used for enterprise reporting and performance management.
With data center consolidation efforts, we’ve seen some organizations that
make use of a specific vendor’s warehousing technology load their EDWs
with as much raw source system data as they can and then use the ware-
house mostly for data preparation (this is where the ELT we mentioned ear-
lier is used). As you can imagine, if an EDW is to be a trusted and highly
The Data Zones Model: A New Approach to Managing Data 99
Actionable Insight
Operational
Systems Predictive Analytics
Enterprise and Modeling
Staging Area
Warehouse
Archive
Agility
Agility matters. It’s critical in sports, politics, organizational effectiveness,
and more. Agility is really a measurement of the ability to adapt or change
direction in an efficient and effective manner to achieve a targeted goal. In
fact, agility is so important that almost every role-playing game (RPG) has
some sort of key performance indicator to measure a character’s ability to
carry out or evade an attack or to negotiate uneven terrain. Organizations are
discovering that with traditional EDW approaches, it takes too long to make
information available to the business. A common complaint from most busi-
ness users is that any time they make a request for some additional data, the
standard response is that it will take at least “several months” and “more
than a million dollars”…and that’s even before they share their specific
requirements! But if you think about what it takes to make new information
available in this kind of environment, responses like this are actually not as
unreasonable as they might initially sound. At this point, we are beginning to
feel a little sorry for operational teams because they really do get it from both
ends: Developers don’t want to wait on them (Chapter 2) and neither does
the line of business (LOB). We think that IT ops is like parenthood, which is
often a thankless job.
In the traditional “EDW is the center of the universe” architecture, if a
LOB requires additional data to investigate a problem, several steps need to
be taken. Of course, before even getting started, you have to figure out
whether the required data already exists somewhere in the EDW. If it doesn’t,
the real work begins.
You start by figuring out how the new data should fit into the current nor-
malized data model. You will likely need to change that model, which might
involve some governance process in addition to the actual tasks of modifying
the database schema. The next step (we hope) is to update the system’s docu-
mentation so that you can easily identify this data if you get a similar request
in the future. Of course, the database schema changes aren’t in production
yet; they are in some testing process to ensure that the required changes
The Data Zones Model: A New Approach to Managing Data 101
didn’t break anything. After you are confident that you can deliver on the
LOB’s request, you need to schedule these changes in production so that they
don’t impact any business-critical processes. What about total time? It
depends on your environment, but traditionally, this isn’t an hour or days
measurement; in fact, weeks would be remarkably good. Our scenario
assumes you have all the data lined up and ready to go—and that might not
even be the case.
After you’ve taken care of the “where the data will go” part, you have to
figure out where and how to get the data. This typically involves multiple
discussions and negotiations with the source system owners to get the data
feeds set up and scheduled based on the frequency and timeliness require-
ments of the business. You can then start working on how to map the source
data to your new target model, after which you have to start building or
updating your data integration jobs.
After all this is done, your new data is finally available. Only at this point
can you start building queries, updating reports, or developing analytic
models to actually leverage the new data. Tired yet? Sound familiar? What
we are talking about used to be the status quo: The zones model this chapter
introduces you to is about turning the status quo on its head, in the name of
agility, insights, and economies of scale.
This is a major reason why organizations continue to see new databases
pop up across various LOBs and continue to experience data sprawl chal-
lenges even after investing in the ideal EDW. Remember, just like developers,
business users require agility too. If IT ops doesn’t deliver it to them, they
will go and seek it elsewhere. Agility is one of the reasons that investing in a
data zones model is a great idea.
Cost
No one ever suggested that EDWs are cheap. In fact, an EDW is one of the
most expensive environments for managing data. But this is typically a con-
scious decision, usually made for very good reasons (think back to the gold
miner discussion we had in Chapter 1). High-end hardware and storage,
along with more advanced, feature-rich commercial software, are used to
provide the workload management and performance characteristics needed
to support the various applications and business users that leverage this
platform. These investments are part of a concerted effort to maintain the
102 Big Data Beyond the Hype
reliability and resilience that are required for the critical, enterprise-level
data that you have decided to manage in the EDW. In addition, as mentioned
earlier, ensuring that you provide the consistent, trusted information that is
expected from this investment drives up the effort and costs to manage this
environment. Although EDWs are important, they are but a single technol-
ogy within a polyglot environment.
Another thing about the “EDW is the center of the universe” approach is
that organizations continuously have to add more and more capacity as the
amount of data they are capturing and maintaining continues to rise and as
they deliver additional reporting and analytic solutions that leverage that
data. Companies are starting to realize that they will need to set aside sig-
nificant amounts of additional capital budget every year just to support the
current needs of the business, let alone the additional reporting or analytical
requirements that continue to arise. Often, there isn’t a clear understanding
of the impact of the new applications and analytics that are being developed
by the LOB. As a result, organizations can unexpectedly run into capacity
issues that impact the business and drive the need for “emergency” capital
expenditures, which are usually more costly and don’t provide the flexibility
to accommodate alternative approaches.
The cost issues all come down to this question: “Do you need to incur
these high costs for all of your data, or are there more cost-effective ways to
manage and leverage information across the enterprise?”
Depth of Insight
The third category of challenges that organizations face with the traditional
EDW approach is related to the maturity and adoption of current approaches
to enterprise data management. Initially, the organizations that invested in
aggregating data for analysis and establishing an enterprise view of their
business were able to drive competitive advantage. They were able to gener-
ate new insights about their customers and their operations, which provided
significant business value. However, the early adopters are starting to see
diminishing returns on these investments as their peers catch up.
Organizations are looking for new ways to deliver deeper, faster, and
more accurate insights. They want to incorporate larger volumes of data into
The Data Zones Model: A New Approach to Managing Data 103
Next-Generation Information
Management Architectures
Next-generation (NextGen) architectures for managing information and gen-
erating insight are not about a single technology. Despite what some people
think, Big Data does not equal Hadoop. Hadoop does play a major role in
NextGen architectures, but it’s a player on a team. There are various tech-
nologies, along with alternative approaches to managing and analyzing
information, that enable organizations to address the challenges we have dis-
cussed and to drive a whole new level of value from their information.
The most significant transformation affects how data is maintained and
used in different zones, each supporting a specific set of functions or work-
loads and each leveraging a specific set of technologies. When you have
these zones working together with data refinery services and information
integration and governance frameworks, you’ll have a data reservoir.
104 Big Data Beyond the Hype
K201.67.34.67 V201.67.34.67
201.67.34.67 10:24:48, www.ff.com Reduce Step 201.67.34.67
Shuffle/Sort/
Ensure
Group/Reduce
something is in
Hadoop 10:25:15, www.ff.com/search? www.ff.com/shipping
cart and
Framework does
determine last
this for you.
18:09:56, www.ff.com/shipping page visited.
site, but it has spatial information (IP addresses) and temporal information
(time stamps). Understanding data like this, across millions of users, could
enable an organization to zoom in on why potential customers abandon their
carts at the shipping calculation phase of their purchase. We’ve all been
there—we thought we got a great online deal only to find that shipping was
four times the price. Perhaps this organization could offer extra discount pro-
motions. For example, it could give a point-in-time offer for free shipping if
you spend $100 or more today.
However, the greater value comes when you start combining all of these
different types of data and when you start linking your customer’s transac-
tions to their path through your web site—this is called pathing. Data includes
comments from call center notes or emails, insights from linking the daily
output of different production facilities to the condition of production equip-
ment, machine sensor data, and environmental data from weather feeds.
When you consider that our example shows only a small portion of the input
data available for analysis, you will understand where Big Data gets its name.
You have volume, you have variety, and you have velocity (clickstreams are
typically measured in terabytes per day).
The landing zone should really be the place where you capture and store
the raw data coming from various sources. But what do you do with the
metadata and other information extracted from your unstructured data?
And where do you create these linkages and maintain some type of structure
that is required for any real analysis? Guess what? Metadata is cool again! It’s
how you understand the provenance of your data; it’s how you announce
data assets to the community (those data sets that are refined, and those left
raw for exploration). We cover this in Chapter 13.
A separate set of “exploration” zones is emerging to provide the aforemen-
tioned capabilities, enabling different groups of users across the organization
to combine data in different ways and to organize the data to fit their specific
business needs. It also provides a separate environment to test different
hypotheses or discover patterns, correlations, or outliers. Such an environment
lets the data lead the way. We call it a sandbox.
These sandboxes become virtualized and agile testing and learning envi-
ronments where you can experiment with data from the landing zone in
various combinations and schemas, incorporating additional information
that has been generated from the less structured data. These data sets can be
108 Big Data Beyond the Hype
more fluid and temporary in nature. Sandboxes provide a place where vari-
ous groups of users can easily create and modify data sets that can be
removed when they are no longer needed.
We are great proponents of sandboxes because they encourage us to fail,
and fail fast. That caught you off-guard, right? Much can be learned from
failure, and in yesterday’s environments, because it cost so much to fail, we’ve
become afraid of it. Some of the world’s greatest accomplishments came after
tremendous failures. Thomas Edison developed more than 10,000 prototypes
of the lightbulb. James Dyson’s iconic vacuum cleaner made it to market after
5,126 prototypes. And Steven Spielberg was rejected by the University of
Southern California three times until he changed direction and started to
make movies! Hadoop and cloud computing have changed the consequences
of failure and afford new opportunities to learn and discover. This is why it’s
called a sandbox—go on and play, and if you find something, explore!
When leveraging Hadoop to host an exploration zone, the zone might be
just another set of files that sits on the same cluster as the landing zone, or
you can create different data sets for different user communities or LOBs. But
you could also establish exploration zones on different clusters to ensure that
the activities of one group do not impact those of another. What’s more, you
could leverage IBM BigInsights for Hadoop’s multitenant capabilities that
were designed for this very purpose (details in Chapter 6).
Of course, Hadoop is not the only place for doing exploration. Although
Hadoop can be attractive from a cost and data preparation perspective, it might
not provide the required performance characteristics, especially for deeper ana-
lytics. In-memory columnar technologies such as IBM’s BLU Acceleration
(learn about BLU in Chapter 8) are key instruments in any exploration architec-
ture. Discovery via a managed service is key to a NextGen architecture since it
provides an agile delivery method with LOB-like consumption—IBM Watson
Analytics and dashDB are services we wholly expect to be part of such an archi-
tecture. Finally, there is an opportunity to leverage other fit-for-purpose envi-
ronments and technologies, such as analytic appliances, instead of setting up
exploration in the more expensive EDW counterparts.
To make data from your different source systems available for reporting
and analysis, you need to define a target data model and transform the data
into the appropriate structure. In the past, a typical data transformation pro-
cess involved one of two approaches. The first approach was to extract the
data from where it had landed into a separate system to perform the transfor-
mation processes and then to deliver the transformed data to the target sys-
tem, usually a data warehouse or mart. When a large volume of data is
involved and many transformations are required, this process can take a long
time, create a complex and fragile stack of dependencies, and more. Many
companies have become challenged by the long batch processing times that
are associated with this approach. In some cases, the available batch win-
dows are exceeded; in other cases, the required transformations are not com-
plete before the next day’s processing is set to begin.
The other approach was to move the required data into the target data ware-
house or mart and then to leverage the “in-database” processing capabilities to
perform the transformations—this is sometimes appealing to clients whose
transformation logic is SQL-rich since an RDBMS is very good at parallelizing
data. Although this initially helps to speed up some data transformations, it
can often increase the complexity and cost. It drives up the data storage and
processing requirements of the data warehouse environment and impacts the
performance of other critical EDW applications. Because of the drain on capac-
ity, many companies have found themselves once again faced with challenges
around processing times. Finally, when you consider IBM’s Big SQL stand-
alone MPP query engine for Hadoop (Chapter 6), you can see that, for the most
part, the future of transformation is not in the EDW; it’s too expensive and
consumes too many high-investment resources.
Our point is that there are different classes (or tiers) of analytical process-
ing requirements. Tiers that deliver the most performance are likely to cost
more than those that can’t meet speed-of-thought response times or stringent
service level agreements (SLAs). If you are investing in a platinum analytics
tier that provides the fastest performance for your organization, then we sug-
gest you keep it focused on those tasks. Having a platinum tier prepare data
or having it house data that is seldom queried isn’t going to yield the best
return on your investment. Keep in mind that the more transformation work
you place in the EDW, the slower it will be for those LOB users who think
they are leveraging a platinum-rated service.
The Data Zones Model: A New Approach to Managing Data 111
Organizations are finding that they can include Hadoop in their architec-
ture to assist with data transformation jobs. Instead of having to extract data
from where it has landed to perform the transformations, Hadoop enables
the data to be transformed where it is. And because they leverage the parallel
processing capabilities of Hadoop (more on that in a bit), such transforma-
tions can often be performed faster than when they are done within the EDW.
This provides companies with the opportunity to reduce their required batch
window time frames and offload processing from a platinum-tiered system.
Faster processing and reduced costs…that’s a potential win-win scenario!
We want to say a couple of things here with respect to Hadoop as a data
preparation platform and why we just said the word potential—you can think
of this as an appetizer for Chapter 13. We aren’t proposing that you toss out
your integration tools. In fact, Gartner has been very public about Hadoop on
its own being used as a data integration platform and how such an approach
demands custom code, requires specialized skills and effort, and is likely to
cost more. And the famous Dr. Ralph Kimball (the godfather of data ware-
housing) noted in a Cloudera web cast that Hadoop is not an ETL tool. Our
own personal experience with a pharmaceutical client saw a mere two days of
effort to compose a complex transformation with a set of integration services,
compared to 30 days for a roll-your-own, hand-coding Hadoop effort—and
the two-day effort ran faster too! So, here’s the point: Data integration isn’t
just about the run time; it’s about a graphical data flow interface, reusability,
maintainability, documentation, metadata, and more. Make no mistake about
it, the data integration tool that you select must be able to work with data in
Hadoop’s distributed file system (HDFS) as though it were any other data
source and even be able to leverage its scalable parallel processing frame-
work. But here’s the thing: We’ve seen other parallel frameworks vastly out-
perform MapReduce for data transformation. And if the transformation is
heavy on the SQL side, it actually might make sense to do the transformation
in the EDW itself or in a separate, dedicated, parallel-processing data integra-
tion engine (not always, but sometimes). The IBM Information Server plat-
form (and its associated as-a-service IBM DataWorks capabilities) is this kind
of integration tool. It enables you to compose and document transformations once
and then run them across a wide range of parallel processing frameworks,
choosing the one that makes the most sense—Hadoop, a relational database,
112 Big Data Beyond the Hype
There are two ways to establish a queryable archive in Hadoop. The first
approach is to follow standard archiving procedures, archiving the data into
Hadoop instead of some alternative storage medium. Considering all the
data storage cost models that are associated with Hadoop, we’ll often refer to
Hadoop as “the new tape.” If you follow this approach, you can leverage
your established processes for moving data to lower-cost storage. And
because the data can easily be accessed with SQL, you can move data off the
EDW much more aggressively. For example, instead of archiving only data
that is older than seven years, you can archive data that is older than two
years. And maybe you can start moving detail-level data even sooner. The
IBM InfoSphere Optim product suite has been enhanced with Hadoop as a
key repository for the archiving process, supporting HDFS as a repository for
archived files. We discuss archiving, Hadoop, and information lifecycle man-
agement in Chapter 13.
The second approach for establishing a queryable archive in Hadoop
involves creating the aforementioned day zero archive. Consider all the zones
we’ve introduced so far. If you were to transform your overall data architec-
ture to establish landing and staging zones, your staging zone could actually
become a queryable archive zone. In other words, you could create your
archive on day zero! If you create a load-ready staging zone, where the data
structure and model matches, or can easily be mapped to the structure and
model of your EDW, you essentially have a copy of most of the data in your
EDW already in Hadoop—and in a queryable format. You can maintain that
data there instead of archiving data from the EDW. You can then just delete
unneeded older data from the EDW. The data will still reside in the Hadoop
staging zone, which now is also your queryable archive zone. Even if you
choose not to instantiate a mirror data model of the EDW, the data is still
captured, and you can use metadata and SQL to retrieve it as needed.
Most organizations do not build their EDW from scratch, nor are they just
getting started in this space. We have found that, in most cases, it is a gradual
process and culture shift to get to a day zero archive architecture. We recom-
mend starting small and learning “the ropes.” Perhaps choose a limited subject
area and perform an initial archiving of the data from your existing EDW. Most
organizations already have multiple data transformation processes in place
and cannot begin using new landing and staging zones, as outlined here, right
away. But having a set of integration tools and services that works with HDFS
The Data Zones Model: A New Approach to Managing Data 115
as well as it works with an RDBMS enables you to change the execution engine
for the transformations without rewriting thousands of lines of code. You
might prefer to start with the first approach outlined earlier (following stan-
dard archiving procedures), but moving toward the second approach (creating
a day zero archive) should be part of your longer-term plan.
on its own terms. Companies have historically addressed this by creating dif-
ferent data marts or separate views on the EDW to support these different
business-reporting requirements.
This requirement will not go away, and the approach will remain fairly sim-
ilar. Organizations will continue to maintain “business reporting” zones and
will, at least for the immediate future, continue to rely on relational databases.
The biggest advancement in this area is likely to be the use of in-memory
technologies to improve query performance. Many business users have com-
plained about the time it takes to run reports and get information about their
business operations. To address this, some organizations are starting to lever-
age in-memory technologies as part of their operational systems so that they
can report on those systems without impacting core operational processes.
Others are starting to leverage in-memory databases for their actual data marts
and improving response times, especially when reviewing business perfor-
mance across multiple dimensions.
This is where the BLU Acceleration technology and the IBM managed
dashDB service play such a crucial role in business reporting. They are
designed from the ground up to be fast, to deliver incredible compression
ratios so that more data can be loaded into memory, to be agile, and to be
nearly DBA-free. With BLU Acceleration, you don’t spend time creating per-
formance-tuning objects such as indexes or materialized views; you load the
data and go. This keeps things agile and suits LOB reporting just fine.
Finally, for data maintained on premise, the addition of BLU Acceleration
shadow tables in the DB2 Cancun Release further extends the reporting capa-
bilities on operational data by enabling the creation of in-memory reporting
structures over live transactional data without impacting transactional per-
formance. This latest enhancement reinforces the theme of this chapter: There
are multiple technologies and techniques to help you deliver an effective
analytics architecture to your organization.
often associated with “real-time” capabilities have often restricted the appli-
cation of vanilla CEP technologies to the fringe use cases where the effort and
investment required could be justified.
Big Data, along with some of its related evolving technologies, is now driv-
ing both the demand and the opportunity to extend these capabilities to a
much broader set of mainstream use cases. As data volumes explode and orga-
nizations start dealing with more semistructured and unstructured data, some
organizations are discovering that they can no longer afford to keep and store
everything. And some organizations that used to just discard entire data sets
are realizing that they might be able to get value from some of the information
buried in those data sets. The data might be clickstreams generated from con-
stantly growing web traffic, machine-generated data from device and equip-
ment sensors being driven by the Internet of Things, or even system log files
that are typically maintained for short periods of time and then purged.
New capabilities have emerged to address this need, enabling companies
to process this data as it is generated and to extract the specific information
required without having to copy or land all of the unneeded data or “noise”
that might have been included in those feeds. And by incorporating analytic
capabilities, organizations can derive true real-time insights that can be acted
upon immediately. In addition, because there is so much focus on harvesting
analytics from data at rest, people often miss the opportunity to perform
additional analytics on the data in motion from what you learned through
analytics on the data at rest. Both of these potential game changers under-
score the need for a real-time processing and analytics zone.
The real-time processing and analytics zone, unlike some of the other
zones, is not a place where data is maintained, but is more of a processing
zone where data is analyzed or transformed as it is generated from various
sources. Some organizations use this zone to apply transformations to data
as it hits their doorsteps to build aggregated data structures on the fly.
Whereas the raw data that is generated by those sources is the input, the
output can be either a filtered set of that data, which has been extracted, or
new data that was generated through analytics: in short, insight. This data
can be put into one of the other zones for further analysis and reporting, or it
can be acted upon immediately. Actions could be triggered directly from
logic in the real-time processing and analytics zone or from a separate deci-
sion management application. This powerful zone is powered by IBM’s Info-
Sphere Streams technology, which we cover in Chapter 7.
118 Big Data Beyond the Hype
Mixed Workload
Data and Transforms and Archive RDBMS Relational Data
Document Stores Marts and Views Discovery and
Hadoop Exploration
Social Data
Information Data Matching Metadata and Security and Lifecycle
Third-Party Data Integration and MDM Lineage Privacy Management
Information Integration and Governance
Figure 4-3 A next-generation zones reference architecture for Big Data and Analytics
The Data Zones Model: A New Approach to Managing Data 119
121
This page intentionally left blank
5
Starting Out with a Solid
Base: A Tour of Watson
Foundations
It’s a frightening thing to wake up one morning and realize that your busi-
ness is data-rich but knowledge-poor. It seems that while the world spent
the last few years collecting data 24/7, it neglected to put the same effort
into decisioning that data—transforming it into something we like to call
actionable insight. This concept might seem counterintuitive. You’re likely
thinking, “I have petabytes of data on my customers. Why can’t I make
predictions about where my business is heading?” But we are seeing an
epiphany across enterprises and businesses. Companies have spent decades
amassing mountains of data on their businesses, customers, and work-
forces, but now they are asking “What do we do with that data?” and “How
do we transform data points into actionable insight?”
Elsewhere in this book we talk about the concept of a data economy, compar-
ing data analytics to the transformation of crude oil into refined petroleum.
Data is bountiful and can take on a variety of forms: structured, semistructured,
unstructured, at rest, or in motion. Data generates insights…but...only when
backed by understanding. If you have the tools to help you understand the big-
ger picture that your Big Data is painting, then the notion of mountains of data
is no longer a frightening prospect; it becomes an opportunity for further dis-
covery and insight. This chapter introduces the rich portfolio of services IBM
123
124 Big Data Beyond the Hype
Figure 5-1 The IBM Watson Foundations portfolio—leveraging the breadth of services
and software available from IBM’s Big Data and Analytics platform
131
132 Big Data Beyond the Hype
make money. Hadoop has drawn so much interest because it’s a practical solu-
tion to save money and make money. For example, Hadoop enables large vol-
umes of data to be stored on lower-tier storage with (relatively speaking) lower
software licensing costs (saving money), and it enables users to perform analy-
sis activities that weren’t possible before because of past limitations around
data volumes or variety (making money). We almost sound like an infomercial
here—Hadoop provides the best of both worlds. But as we discussed in Chap-
ter 4, while Hadoop is a major part of an information architecture, conversa-
tions about Big Data can’t just be about Hadoop. In practice, what our success-
ful clients have typically found is that they start by inserting Hadoop into their
architecture to save money, and as their sophistication around the technology
grows, they extend it to make money.
Considering the pressing need for technologies that can overcome the vol-
ume and variety challenges for data at rest, it’s no wonder business maga-
zines and online tech forums alike are still buzzing about Hadoop. And it’s
not all talk either. IT departments in almost all Fortune 500 companies have
done some level of experimentation with Hadoop. The problem is that many
of these initiatives have stagnated in the “science project” phase or have not
been able to expand their scope. The challenges are common. It’s easy and
exciting to start dumping data into these repositories; the hard part comes
with what to do next. Meaningful analysis of data stored in Hadoop requires
highly specialized programming skills—and for many algorithms, it can be
challenging to put them into a parallelizable form so that they can run prop-
erly in Hadoop. And what about information governance concerns, such as
security and data lifecycle management, where new technologies like
Hadoop don’t have a complete story?
IBM sees the tremendous potential for technologies such as Hadoop to
have a transformative impact on businesses. This is why IBM has scores of
researchers, developers, and support staff continually building out a platform
for Hadoop, called IBM InfoSphere BigInsights for Hadoop (BigInsights).
BigInsights was released in October 2010 with a relatively straightforward goal:
to maintain an open source Hadoop distribution that can be made enterprise-
ready. In this chapter, we describe five aspects of enterprise readiness:
• Analytics ease Enable different classes of analysts to use skills they
already have (like SQL or R) to get value out of data that is stored in
BigInsights. This avoids the need to pass requirements over the fence
Landing Your Data in Style with Blue Suit Hadoop 133
Each computer (often referred to as a node in Hadoop-speak) has its own pro-
cessors and a dozen or so 3TB or 4TB hard disk drives. All of these nodes are
running software that unifies them into a single cluster, where, instead of see-
ing the individual computers, you see an extremely large bucket where you
can put your data. The beauty of this Hadoop system is that you can store
anything there: millions of digital image scans of mortgage contracts, weeks
of security camera footage, trillions of sensor-generated log records, or all of
the operator transcription notes from a call center. Basically, if the data is born
into the Internet of Things (IoT), Hadoop is where we think it should land.
This ingestion of data, without worrying about the data model, is actually a
key tenet of the NoSQL movement (this is referred to as “schema later”),
which we talked about in Chapter 2. In contrast, the traditional SQL and rela-
tional database world depends on the opposite approach (“schema now”),
where the data model is of utmost concern upon data ingest. This is where the
flexibility of Hadoop is even more apparent, because you can store data using
both schema later and schema now approaches. There are Hadoop-based
databases where you can store records in a variety of models: relational,
columnar, and key/value. In other words, with data in Hadoop, you can go
from completely unstructured to fully relational, and any point in between.
The data storage system that we describe here is known as Hadoop’s distrib-
uted file system (HDFS).
Let’s go back to this imaginary Hadoop cluster with many individual
nodes. Suppose your business uses this cluster to store all of the clickstream
log records for its web site. Your Hadoop cluster is using the IBM BigInsights
for Hadoop distribution, and you and your fellow analysts decide to run
some of the sessionization analytics against this data to isolate common pat-
terns for customers who abandon their shopping carts. When you run this
application, Hadoop sends copies of your application logic to each individ-
ual computer in the cluster to be run against data that’s local to each com-
puter. So, instead of moving data to a central computer for processing (bring-
ing data to the function), it’s the application that gets moved to all of the
locations where the data is stored (bringing function to the data). Having a
cluster of independent computers working together to run applications is
known as distributed processing.
One of the main benefits of bringing application code to the target data
distributed across a cluster of computers is to avoid the high cost of transferring
Landing Your Data in Style with Blue Suit Hadoop 135
all the data to be processed to a single central point. Another way of putting
this is that computing systems have to respect the laws of physics; shipping
massive amounts of data to a central computer (data to function) is highly
impractical when you start needing to process data at a terabyte (or higher)
scale. Not only will it take a long time to transfer, but the central computer
doing the processing will likely not have enough memory or disk space to
support the workload.
An important aspect of Hadoop is the redundancy that’s built into its
environment. Not only is data redundantly stored in multiple places across
the cluster, but the programming model is designed to expect failures, which
are resolved automatically by running portions of the program on various
servers in the cluster. Because of this redundancy, it’s possible to distribute
the data and programming across a large cluster of commodity components,
like the cluster we discussed earlier. It’s well known that commodity hard-
ware components will fail (especially when you have many of them), but this
redundancy provides fault tolerance and the capability for the Hadoop cluster
to heal itself. This enables Hadoop to scale workloads across large clusters of
inexpensive machines to work on Big Data problems.
projects that work with Hadoop; the following is a list of some of the more
important ones (all of which, and more, are supported in BigInsights):
• Apache Avro A data serialization framework (translates data from
different formats into a binary form for consistent processing)
• Apache Flume Collects and aggregates data and stores it in Hadoop
• Apache HBase Real-time read and write database (this is a BigTable-
style database—see Chapter 2 for more about BigTable)
• Apache Hive Facility for cataloging data sets stored in Hadoop and
processing SQL-like queries against them
• Apache Lucene/Solr Indexing and search facility
• Apache Oozie Workflow and job scheduling engine
• Apache Pig A high-level data flow language and execution framework
• Apache Sqoop Data transfer between Hadoop and relational
databases
• Apache Spark A flexible data processing framework
• Apache Zookeeper Monitors the availability of key Hadoop services
and manages failovers so the Hadoop system can remain functional
We’d love to get into more depth and describe how HDFS and YARN
work, not to mention the additional projects in the Apache Hadoop ecosys-
tem; however, unlike a Hadoop cluster, a limiting factor for a book is space.
To learn about Hadoop in greater detail, visit BigDataUniversity.com where
you can find a number of quality online courses for free! You may also want
to get Hadoop for Dummies (Wiley, 2014), another book written by some of this
same author team.
You can’t just drop new code into production. In our experience, backward-
compatibility issues are always present in open source projects. BigInsights
pretty much takes away all of the risk and guesswork that’s associated with
typical open source projects for your Hadoop components.
When you install BigInsights, you also have the assurance that every line
of code from open source and from IBM has been tested using the same rigor-
ous test cases, which cover the following aspects and more:
• Security in all code paths
• Performance
• Scalability (we have a Hadoop cluster that’s well over 1000 nodes in
our development lab where we test petabyte-scale workloads)
• Integration with other software components (such as Java drivers and
operating systems)
• Hardware compatibility (our development lab clusters feature dozens
of different brands and models of switches, motherboards, CPUs,
and HDDs)
In short, BigInsights goes through the same rigorous regression and qual-
ity assurance testing processes used for all IBM software. So, ask yourself
this: Would you rather be your own systems integrator, repeatedly testing all
of the Hadoop components to ensure compatibility? Or would you rather let
IBM find a stable stack that you can deploy without worry?
Architecture
The SQL on Hadoop solutions available in today’s marketplace can be bro-
ken down quite neatly into three distinct categories (going from least effec-
tive to most):
• Remote query facility The approach taken by many of the established
relational database management system (RDBMS) vendors—except
IBM—such as Oracle, Teradata, and Microsoft, where the RDBMS
remains the center of activity. Users submit SQL to the front-end
RDBMS, which catalogs data on the separate Hadoop cluster as
external tables. In this architecture, the database then sends the query
Landing Your Data in Style with Blue Suit Hadoop 143
Client applications
Big SQL
Scheduler
Coordinator
Catalog
Cluster Cluster
Network Network
built-in SerDe that takes a JSON-encoded social feed (such as Twitter) and
turns it into a “vanilla” structure that is Big SQL queryable. This is an addi-
tional example of how Big SQL is a first-class citizen, native to Hadoop.
Security
Setting aside the increase in Hadoop adoption rates, security as a founda-
tional topic is more important than ever because it seems we can’t go a month
without realizing our data isn’t as safeguarded as it should be. This is a great
example of how IBM’s experience in hardening enterprises for operations has
come to benefit Hadoop. In the relational world, fine-grained access control
(FGAC) and row column access control (RCAC) are well-known methods for
implementing application-transparent security controls on data. These are
capabilities available in BigInsights for Hadoop. This is a great example of
where Big SQL’s code maturity comes into play because IBM engineers have
built these kinds of regulatory-compliant security features deep into the Big
SQL architecture. Other SQL on Hadoop engines in the market simply offer
access control at the database, view, or table level; but using these facilities,
Big SQL can restrict access at the row or column level using SQL statements
that any DBA would be comfortable with. Big SQL also includes an audit
facility that can be used independently or integrates seamlessly with leading
audit facilities such as InfoSphere Guardium (covered in Chapter 12).
for some time, to the point where today’s generation of statisticians and data
scientists mostly favor R as their statistical platform of choice. And why
not—it’s open source, there’s a huge public library of statistical models and
applications coded in R (through the Comprehensive R Archive Network—
CRAN—a network of FTP and web servers around the world that store iden-
tical, up-to-date versions of code and documentation for R), and there are a
number of great R tools for development and visualization (like RStudio).
Taken altogether, the R ecosystem provides a complete set of capabilities for
data exploration, transformation, visualization, and modeling.
So, you’d think that R and Hadoop would be made for each other, right?
Sadly, this isn’t the case. While Hadoop is a parallel execution environment,
R is not. R runs beautifully on a single computer with lots of memory, but it’s
not designed to run in a distributed environment like Hadoop. This is where
Big R comes in—it enables R applications to be run against large Hadoop
data sets. For data scientists and statisticians, this means they can use their
native tools to do analysis on Hadoop data in a highly performant manner,
without having to learn new languages or programming approaches (such as
parallel programming).
At its core, Big R is simply an R package, and you can install it in your R
client of choice (like RStudio). It’s also installed on each node in the BigInsights
cluster. When you use Big R, you have access to the entire dataset that’s resid-
ing in Hadoop, but there is minimal data movement between the BigInsights
cluster and the client. What makes Big R special is that you can use it to push
down your R code to be executed on the BigInsights Hadoop cluster. This
could be your own custom code or preexisting R library code from CRAN.
IBM’s Big R also includes a unique machine learning facility that enables
statistics algorithms to be run in parallel on a BigInsights cluster. We’re not
sure if our marketing teams will end up swizzling the name into something
else, but for now, we’ll refer to it by its research name: SystemML. (If you’ve
followed IBM innovations, you’ll note their projects get prefixed with the
word System, for example, System R for relational databases, SystemT for
text analysis, System S for streaming, and SystemML for machine learning.)
SystemML is an ecosystem for distributed machine learning that provides
the ability to run machine learning models on the BigInsights cluster at a mas-
sive scale. SystemML includes a number of prebuilt algorithms for machine
learning and descriptive statistics. For cases where you need to write your own
custom algorithms, SystemML includes a high-level declarative language,
Landing Your Data in Style with Blue Suit Hadoop 149
that manually developing this workflow for Hadoop would require a great
deal of tedious hand-coding. IBM has decades of text analytics experience
(think about the natural language processing capabilities in Watson Analytics
we talked about in Chapter 1) and includes in BigInsights an Advanced Text
Analytics Toolkit. While the BigInsights analytics components all connect to
this toolkit, it’s also integrated into IBM InfoSphere Streams (covered in Chap-
ter 7). This means the text extractors that you write for your organization’s
data can be deployed on data at rest (through BigInsights) or data in motion
(through Streams). For example, you can expose a set of extractors as an app
(see the later section “Hadoop Apps for the Masses: Easy Deployment and
Execution of Custom Applications”), or you can apply it to a column of data in
a BigSheets sheet (see the next section).
The biggest challenge in text analysis is to ensure the accuracy of results. To
keep things simple, we will define accuracy of results through two compo-
nents, precision and recall:
• Precision A measure of exactness, the percentage of items in the
result set that are relevant: “Are the results you’re getting valid?” For
example, if you wanted to extract all the references to publicly traded
companies from news stories, and 20 of the 80 passages identified as
referring to these companies weren’t about them at all, your precision
would be 75 percent. In summary, precision describes how many
passages are correctly identified.
• Recall A measure of completeness, the percentage of relevant results
that are retrieved from the text; in other words, are all the valid strings
from the original text showing up? For example, if you wanted to extract
all of the references to publicly traded companies from news stories, and
got 80 out of 100 that would be found by a human expert, your recall
would be 80 percent, because your application missed 20 percent of
references. In summary, recall is how many matching passages are
found out of the total number of actual matching passages.
As analysts develop their extractors and applications, they iteratively
make refinements to tune their precision and recall rates. The development
of extractors is really about adding more rules and knowledge to the extrac-
tor itself; in short, it’s about getting more powerful with each iteration.
The Text Analytics Toolkit has two interfaces: a web-based GUI designed
for business users to quickly and simply design text extractors (see Figure 6-2)
Landing Your Data in Style with Blue Suit Hadoop 151
simple. With such a rich tool set, your business users are shielded from AQL
code and only see visual representations of the extractions.
formulas pivot, filter data, and so on. You can also deploy custom
macros for BigSheets. For example, you can deploy a text extractor
that you built with the BigInsights development tools as a
BigSheets macro. As you build your sheets and refine your
analysis, you can see the interim results in the sample data. It’s
only when you click the Run button that your analysis is applied to
the complete data collection. Because your data could range from
gigabytes to terabytes to petabytes in size, working iteratively with
a small data set is the best approach.
3. Explore and visualize data. After running the analysis from your
sheets against the data, you can apply visualizations to help you
make sense of your data. BigSheets provides a number of traditional
and new age Big Data visualizations, including the following:
• Tag cloud Shows word frequencies; the larger the letters, the
more occurrences of data referencing that term were found.
• Pie chart Shows proportional relationships, where the relative
size of the slice represents its proportion of the data.
154 Big Data Beyond the Hype
Spatiotemporal Analytics
Much of the data we see flowing into Hadoop clusters today comes from IoT
applications, where some kind of status or activity is logged with a time code
and a geocode. To make it easier to get value out of this data in BigInsights,
IBM has included a spatiotemporal analytics library with BigInsights. This is
actually the same spatiotemporal analytics library that’s included in the
Geospatial toolkit for InfoSphere Streams, which further shows IBM’s com-
mitment to a unified analytics experience across the Big Data stack.
The spatiotemporal analytics library consists of a number of functions
such as interpreting bounding boxes and points in space/time; calculating
area, distances, and intersections; and converting geohash codes. The beauty
of the integration of these functions in BigInsights is that they’re surfaced in
BigSheets. This means nonprogrammers can apply a rich set of powerful spa-
tiotemporal functions using a GUI tool. Moreover, since this is in BigSheets,
it means other analytics tools in BigInsights can consume them; for example,
you can query a BigSheets sheet with the spatiotemporal functions using a
Big SQL query.
Landing Your Data in Style with Blue Suit Hadoop 155
Cradle-to-Grave Application
Development Support
In 2008, Steve Ballmer made YouTube history with his infamous “Developers…
Developers… Developers…” speech. Search for it, and you’ll quickly find him
on the stage in a passionate frenzy, repeating the word “Developers!” with
increasing intensity. While this might make you question your own sanity in
156 Big Data Beyond the Hype
choosing a career in IT, we can’t help but agree that Steve Ballmer was onto
something important here: developers. (Refer to our discussion on this topic in
Chapter 2 for more details.) If you want your platform to succeed, you had bet-
ter make it easy for your developers to write applications for it. It’s with this
thought in mind that IBM built a rich set of developer tools for BigInsights. The
ultimate goal is to minimize the barrier between analysts and developers so that
apps can quickly be developed and tested with the BigInsights cluster and be
easily consumed. This section describes the BigInsights developer tool set.
Macro
App
1. Develop
2. Publish
4. Execute (test)
Macro
App
The following details the lifecycle steps and their impacted components:
• Develop Using the BigInsights development tools for Eclipse, you
can, for example, create text extractors to isolate connection failures in
your web server logs.
• Publish From Eclipse, you can push your text extractor to be
available in the Web Console as either an app (through the
Applications panel) or a macro for BigSheets.
• Deploy From the Web Console, someone with the Application
Administrator role can deploy and configure an application for
execution on the BigInsights cluster.
• Execute An end user can run the text extractor in the console (either as
an app or as a macro, depending on how it was published). Alternatively,
a developer can run the app from the Web Console and download the
results for testing in the Text Analytics debugging tool.
These are all steps that an IT organization needs to take to develop and
deploy applications in a Hadoop context. In BigInsights, IBM has provided
Landing Your Data in Style with Blue Suit Hadoop 159
be authorized to run specific apps. Given that there are data sources or services
where security credentials are required, the app’s interface lets you leverage
the BigInsights credentials store, enabling you to securely pass authentication
information to the data source or service to which you’re connecting.
failed), and more. Each dashboard contains monitoring widgets that you can
configure to a high degree of granularity, ranging from the time period to its
refresh interval (see Figure 6-7).
Monitoring Applications
The BigInsights Web Console provides context-sensitive views of the cluster
so that people see only what they need to see based on their role. For exam-
ple, if someone with the role of “user” runs an application, they see only their
own statistics in the Application History view. This particular view is opti-
mized to show a high level of application status information, hiding the
lower-level workflow and job information.
People with administrator roles can see the state of all apps in the Appli-
cation History view and the Application Status monitoring dashboard;
additionally, they’re able to drill into individual workflows and jobs for
debugging or performance testing reasons. The Application Status pane has
views for Workflows and Jobs, which list every active and completed work-
flow and job in the cluster. You can drill into each workflow or job to get
further details and from there also see related elements.
the security surface area through securing access to the administrative inter-
faces, key Hadoop services, lockdown of open ports, role-based security, inte-
gration into InfoSphere Guardium (Guardium), and more.
The BigInsights Web Console has been structured to act as a gateway to
the cluster. It features enhanced security by supporting Lightweight Direc-
tory Access Protocol (LDAP) and Kerberos authentication protocols. Secure
authentication and reverse-proxy support help administrators restrict access
to authorized users. In addition, clients outside of the cluster must use
secured REST interfaces to gain access to the cluster through the gateway. In
contrast, Apache Hadoop has open ports on every node in the cluster. The
more ports you need to have open (and there are a lot of them in open source
Hadoop), the less secure the environment and the more likely you won’t pass
internal audit scans.
BigInsights can be configured to communicate with an LDAP credentials
server or Kerberos key distribution center (KDC) for authentication. In the
case of LDAP, all communication between the console and the LDAP server
occurs using LDAP (by default) or both LDAP and LDAPS (LDAP over
HTTPS). The BigInsights installer helps you to define mappings between
your LDAP/Kerberos users and groups and the four BigInsights roles (Sys-
tem Administrator, Data Administrator, Application Administrator, and
User). After BigInsights has been installed, you can add or remove users
from the LDAP groups to grant or revoke access to various console functions.
BigInsights also supports alternative authentication options such as Linux
pluggable authentication modules (PAMs). You can use this to deploy bio-
metric authentication and other custom protocols.
Putting the cluster behind the Web Console’s software firewall and estab-
lishing user roles can help lock down BigInsights and its data, but a complete
security story has to include auditing, encryption, data masking and redac-
tion, and enhanced role-based access. In Chapter 12, the Guardium and
Optim solutions that provide on-premise and off-premise (for IBM Data-
Works) governance and integration services are described in detail. Earlier in
this chapter we describe Big SQL, which features fine-grained and label-
based access controls for data at the row or column level.
Adaptive MapReduce
BigInsights includes enhanced workload management and scheduling capa-
bilities to improve the resilience and flexibility of Hadoop, called Adaptive
164 Big Data Beyond the Hype
can reduce their total storage footprint by as much as one third, helping
reduce infrastructure costs and their data center footprint.
The full POSIX compliance of GPFS-FPO enables you to manage your
Hadoop storage just as you would any other servers in your IT shop. That’s
going to give you economies of scale when it comes to building Hadoop skills
and just make life easier. Many Hadoop people seem to take it for granted that
they can’t do anything but delete or append to files in HDFS—in GPFS-FPO,
you’re free to edit your files! Also, your traditional file administration utilities
will work, as will your backup and restore tooling and procedures. This is
actually a big deal—imagine you need to quickly look for differences between
two data sets. In Linux, you’d simply use the diff command, but in Hadoop
you’d have to write your own diff application.
GPFS includes a rich replication facility called Active File Management
(AFM), which supports a range of high availability options. With AFM you
can create associations to additional BigInsights clusters within your data
center, or in other geographic locations. AFM will asynchronously replicate
data between the clusters, enabling you to have a single namespace, span-
ning multiple data centers, even in different countries. You can configure
active-active or active-passive configurations for the associated BigInsights
clusters. In the case of disasters, AFM supports online recovery. This goes
light years beyond what HDFS is capable of.
GPFS-FPO enables you to safely manage multitenant Hadoop clusters
with a robust separation-of-concern (SoC) infrastructure, allowing other
applications to share the cluster resources. This isn’t possible in HDFS. This
also helps from a capacity planning perspective because without GPFS-FPO,
you would need to design the disk space that is dedicated to the Hadoop
cluster up front. If fact, not only do you have to estimate how much data you
need to store in HDFS, but you’re also going to have to guess how much stor-
age you’ll need for the output of MapReduce jobs, which can vary widely by
workload. Finally, don’t forget that you need to account for space that will be
taken up by log files created by the Hadoop system too! With GPFS-FPO, you
only need to worry about the disks themselves filling up; there’s no need to
dedicate storage for Hadoop.
All of the characteristics that make GPFS the file system of choice for
large-scale, mission-critical IT installations are applicable to GPFS-FPO.
After all, this is still GPFS, but with Hadoop-friendly extensions. You get the
same stability, flexibility, and performance in GPFS-FPO, as well as all of the
Landing Your Data in Style with Blue Suit Hadoop 167
utilities that you’re used to. GPFS-FPO also provides hierarchical storage
management (HSM) capabilities—another favorite of ours—whereby it can
manage and use disk drives with different retrieval speeds efficiently. This
enables you to manage multitemperature data, keeping “hot” data on your
best-performing hardware.
GPFS-FPO also features extensive security and governance capabilities,
which you can configure as automatically deployed policies. For a particular
data set, you can define an expiration date (for defensible disposal), a replica-
tion policy (for example, you may only want to replicate mission-critical data
off site), specific access controls, mutability (whether people can edit, append,
or delete files), compression, or encryption. That last point about encryption is
worth an extra bit of attention: the GPFS-FPO encryption is NIST SP 800-131A
compliant. And even better, this is file system–based encryption that doesn’t
require extra tools or expense. In addition, GPFS-FPO includes a secure erase
facility that ensures hard disk drives in the cluster won’t include traces of
deleted data.
GPFS-FPO is such a game changer that it won the prestigious Supercom-
puting Storage Challenge award in 2010 for being the “most innovative stor-
age solution” submitted to this competition.
connect to the IBM PureData System for Analytics and Netezza through a
special high-speed adapter.
text extraction across both product, the spatiotemporal support is the same,
and more. Both the PureData System for Analytics and BigInsights include
some entitlement to the Streams product.
want to explore this data from different sources and make correlations?
Without a matching facility, this exploration becomes a tedious exercise. To
solve this problem, IBM has taken its probabilistic matching engine from the
master data management offering and ported it to Hadoop with the new
name: Big Match. (Yes, everything with BigInsights is big. We’re not a cre-
ative bunch when it comes to naming, but we like to think we are foreshad-
owing the results.) For more about Big Match, see Chapter 14.
Deployment Flexibility
When people deploy Hadoop today, there is a fairly limited set of options
available out of the box. What if you want to run Hadoop on noncommodity
(that is, non-Intel) hardware for some different use cases? Cloud computing
is all the rage now, but how many cloud services for Hadoop can boast bare-
metal hardware deployments with hardware that’s optimized for Hadoop?
Bottom line, we recognize that what different organizations need from
Hadoop is varied, and to best serve these different needs, IBM has provided
a number of varied deployment options for BigInsights.
Standard Edition
BigInsights Standard Edition is designed to help you get your feet wet with
BigInsights in a budget-friendly way. It has Big SQL, Big Sheets, development
tools, security features, administration console, and, of course, the same open
source Hadoop foundation as the Enterprise Edition. It’s the enterprise fea-
tures (Big R, the analytic accelerators, Text Analytics Toolkit, GPFS-FPO sup-
port, Adaptive MapReduce, and the additional software license) that are
exclusive to the Enterprise Edition. However, to provide additional value,
both Big SQL and BigSheets are included in the Standard Edition.
Enterprise Edition
BigInsights Enterprise Edition includes all of the features described in this
chapter. In addition, you get a number of software licenses for other data
analytics offerings from IBM:
• InfoSphere Streams A streaming data analysis engine; we describe
this in Chapter 7. This license entitles you to run streaming data jobs in
conjunction with your data-at-rest processing using BigInsights.
174 Big Data Beyond the Hype
BigInsights supports two flexible cloud options: bring your own license
(BYOL) to whatever cloud provider you want or provision a powerful and
secure bare-metal deployment on the IBM Cloud. BigInsights is also avail-
able as a platform as a service component in IBM Bluemix (for more, see
Chapter 3).
Figure 6-8 Provisioning BigInsights over the cloud and getting it right.
There are many cases where the cloud is appealing if you just need to try
something and performance doesn’t matter. But when reliable performance
does matter and you don’t have the time or desire to provision your own
Hadoop cluster, this Hadoop-optimized SoftLayer bare-metal option is a
winner.
Higher-Class Hardware:
Power and System z Support
Although commodity hardware deployments have been the norm for Hadoop
clusters up to this point, we’re seeing some alternatives. IBM provides two
compelling options: support for BigInsights on Power or System z hardware.
BigInsights on Power
One of the most significant benefits of Power hardware is reliability, which
makes it well suited to be deployed as master nodes in your cluster. If you
need higher-energy efficiency or need to concentrate data nodes to fewer tiles
Landing Your Data in Style with Blue Suit Hadoop 177
or simply need more processing power, Power can most definitely be a great
way to go. In 2011, the IBM Watson system beat the two greatest champions
of the American quiz show Jeopardy!—Linux on Power was the underlying
platform for Watson, which also leveraged Hadoop for some of its subsys-
tems. During the development of the IBM PowerLinux Solution for Big Data,
customizations and lessons learned from the Watson project were applied to
this offering for both on-premise and off-premise deployments.
BigInsights on System z
Nothing says uptime and rock-solid stability like the mainframe. The next
time you talk to a System z admin, ask her about the last outage—planned or
unplanned. It’s likely that it was many years ago. Also, System z has an
extremely secure architecture and access control regime. Not everyone wants
to run Hadoop on commodity hardware, and a number of IBM’s mainframe
customers have been asking for the flexibility of Hadoop. So, to provide that
flexibility, BigInsights can be installed on Linux on System z.
Wrapping It Up
We said it earlier in this book, but it’s really important to understand that Big
Data conversations do not solely equal Hadoop conversations. Hadoop is
but one of multiple data processing engines you will need to address today’s
challenges (and the ones you don’t yet know about yet). For this reason, IBM
has been long committed to Hadoop, with code contributions to open source
and continual engineering around the ecosystem. With its long history of
enterprise-grade infrastructure and optimization, IBM has taken this experi-
ence and applied it to Hadoop through BigInsights. BigInsights includes
open source Hadoop and adds some operational excellence features such as
178 Big Data Beyond the Hype
179
180 Big Data Beyond the Hype
respiration, brain waves, and blood pressure, among others. This constant
feed of vital signs is transmitted as waves and numbers and routinely dis-
played on computer screens at every bedside. Currently, it’s up to doctors
and nurses to rapidly process and analyze all this information in order to
make medical decisions.
Emory aims to enable clinicians to acquire, analyze, and correlate medical
data at a volume and velocity that was never before possible. IBM Streams
and EME’s bedside monitors work together to provide a data aggregation
and analytics application that collects and analyzes more than 100,000 real-
time data points per patient per second. Think about how bedside monitor-
ing is handled today. A nurse comes and records charting information every
half-hour or so; if there are no major changes, the nurse is apt to just record
the same number. Throughout this book we keep talking about how our Big
Data world involves 24/7 data collection (in a single hour, a bedside monitor
could produce a whopping 360 million data points). The problem here is
captured perfectly by the “enterprise amnesia” we talked about in Chapter 1.
Many potentially interesting patterns could be found in those 360 million
data points, but current hospital practices use just a few dozen.
The software developed by Emory identifies patterns that could indicate
serious complications such as sepsis, heart failure, or pneumonia, aiming to
provide real-time medical insights to clinicians. Tim Buchman, MD, PhD,
director of critical care at Emory University Hospital, speaks to the potential
of analyzing those missing 359,999,999,998 records when he notes, “Access-
ing and drawing insights from real-time data can mean life and death for a
patient.” He further notes how IBM Streams can help because it empowers
Emory to “…analyze thousands of streaming data points and act on those
insights to make better decisions about which patient needs our immediate
attention and how to treat that patient. It’s making us much smarter in our
approach to critical care.” For example, patients with a common heart disorder
called atrial fibrillation (A-fib) often show no symptoms, but it is a common
and serious condition that can be associated with congestive heart failure
and strokes. Using the new research system, Emory clinicians can view a
real-time digital visualization of the patient’s analyzed heart rhythm and
spot A-fib in its earliest stages.
In short, we say it’s advancing the ICU of the future. In fact, we saw this
work pioneered by Dr. Caroyln McGregor from the University of Ontario
“In the Moment” Analytics: InfoSphere Streams 181
Institute of Technology (UOIT) and her work in the neonatal ICU (NICU) at
Toronto’s Sick Kids hospital—search the Internet for “IBM data baby” and
learn about this incredible story.
A variety of industries have adopted information technology innovations
to transform everything, from forecasting the weather to studying fraud, and
from homeland security to call center analytics—even predicting the out-
comes of presidential elections. Emory’s vision of the “ICU of the future” is
based on the notion that the same predictive capabilities possible in banking,
air travel, online commerce, oil and gas exploration, and other industries can
also apply in medicine. With this in mind, let’s take a look at the IBM Watson
Foundations software that allows you to take data at rest and apply it to ana-
lytics for data in motion.
FileSink
ODBCAppend
Figure 7-1 A simple data stream that applies a transformation to data and splits it into
two possible outputs based on predefined logic
sequence of tuples and looks a lot like a database view. Windows are continu-
ously updated as new data arrives, by eliminating the oldest tuples and adding
the newest tuples. Windows can be easily configured in many ways. For
example, the window size can be defined by the number of tuples or an aging
attribute such as the number of seconds. Windows can be advanced in many
ways, including one tuple at a time, or by replacing an entire window at
once. Each time the window is updated, you can think of it as a temporarily
frozen view. It’s easy to correlate a frozen view with another window of data
from a different stream, or you can compute aggregates using similar tech-
niques for aggregates and joins in relational databases. The windowing
libraries in Streams provide incredible productivity for building applica-
tions. Windowing is an important concept to understand because Streams is
not just about manipulating one tuple at a time, but rather analyzing large
sets of data in real time and gaining insight from analytics across multiple
tuples, streams, and context data.
Streams also has the concept of composite operators. A composite operator
consists of a reusable and configurable Streams subgraph. Technically, all
Streams applications contain at least one composite (the main composite for
the application), but they can include more than one composite (composites
can also be nested). A composite defines zero or more input streams and zero
or more output streams. Streams can be passed to the inputs of the composite
and are connected to inputs in the internal subgraph. Outputs from the inter-
nal subgraph are similarly connected to the composite outputs. A composite
can expose parameters that are used to customize its behavior. This is all
technical-speak for saying you can nest flows of Streams logic within other
flows of Streams logic—which makes for a powerful composition paradigm.
184 Big Data Beyond the Hype
High Availability
Large, massively parallel jobs have unique availability requirements because
in a large cluster there are bound to be failures; massively parallel
186 Big Data Beyond the Hype
technologies such as Streams have to plan for when things fail in a cluster by
expecting failures (Hadoop’s architecture embraced this notion as well).
Streams has built-in availability protocols that take this into account. Coupled
with application monitoring, Streams enables you to keep management costs
low and the reputation of your business high.
When you build a streaming application, the operators that make up the
graph are compiled into processing elements (PEs). PEs can contain one or
more operators, which are often “fused” together inside a single PE for per-
formance reasons. (You can think of a PE as a unit of physical deployment in
the Streams run time.) PEs from the same application can run on multiple
hosts in a network, as well as exchange tuples across the network. In the
event of a PE failure, Streams automatically detects the failure and chooses
from among a large number of possible remedial actions. For example, if the
PE is restartable and relocatable, the Streams run-time engine automatically
picks an available host in the cluster on which to run the PE, starts the PE on
that host, and automatically “rewires” inputs and outputs to other PEs, as
appropriate. However, if the PE continues to fail over and over again and
exceeds a retry threshold (perhaps because of a recurring underlying hardware
issue), the PE is placed into a stopped state and requires manual intervention
to resolve the issue. If the PE is restartable but has been defined as not relo-
catable (for example, the PE is a sink that requires it to be run on a specific
host where it intends to deposit output of the stream after some operation),
the Streams run-time engine automatically attempts to restart the PE on the
same host, if it is available. Similarly, if a management host fails, you can
have the management function restarted elsewhere. In this case, the recovery
metadata has the necessary information to restart any management tasks on
another server in the cluster. As you can see, Streams wasn’t just designed to
recover from multiple failures; it was designed to expect them. Large scale-
out clusters have more components, which means the probability of an indi-
vidual component failing goes up.
Data processing is guaranteed in Streams when there are no host, net-
work, or PE failures. You might be wondering what happens to the data in
the event of one of these failures. When a PE (or its network or host) does fail,
the data in the PE’s buffers (and data that appears while a PE is being
“In the Moment” Analytics: InfoSphere Streams 187
restarted) can be lost without special precautions. With the most sensitive
applications requiring very high performance, it is appropriate to deploy
two or more parallel instances of the streaming application on different serv-
ers in the cluster—that way, if a PE in one instance of a graph fails, the other
parallel instance is already actively processing all of the data and can con-
tinue while the failed graph is being restored. There are other strategies that
can be used, depending on the needs of the application. Streams is often
deployed in production environments in which the processing of every sin-
gle bit of data with high performance is essential, and these high-availability
strategies and mechanisms have been critical to successful Streams deploy-
ments. The fact that Streams is hardened for enterprise deployments helped
one customer recover from a serious business emergency that was caused by
an electrical outage—if only everything were as reliable as Streams!
Figure 7-2 The management of jobs and hosts in the Streams Console
“In the Moment” Analytics: InfoSphere Streams 189
view that shows each available host in a Streams cluster. At a glance, you can
see host health in the Status field. You can also see the status of services on
each host, whether the host is available for running applications, metrics for
the host, including load average and number of CPU cores when metrics col-
lection is enabled, and so on. Tags, such as IngestServer, can be added to
hosts to facilitate optimal application placement in the cluster. When some-
thing goes wrong with a host, you can view or download logs with the click
of a button for problem determination. It’s also possible to quiesce work-
loads on a server so that it can be taken out of service for maintenance. Simi-
lar features for jobs, operators, and processing elements are available as well;
for example, the top of Figure 7-2 shows the console’s jobs view.
Operations Visualized
A great feature in the Streams Console is the ability to monitor results of
Streams applications in a natural way, using graphs. The Application Graph
service displays all running applications and run-time metrics from a Streams
instance in an interactive and customizable window.
For example, Figure 7-3 illustrates two running financial services applica-
tions, one calculating trend metrics and one calculating the volume-weighted
average price (VWAP) for every stock symbol on a tick-by-tick basis. At the
top of Figure 7-3 you can see that the graph is configured to continuously
update the color of the operator based on the tuple rate. The thickness of the
lines is also updated to be proportional to the data flow rate to give you a
quick visual cue as to how much data is flowing through the system. The
bottom of Figure 7-3 illustrates a different view—one in which the operators
are grouped by job. By clicking an operator in the TrendCalculator applica-
tion, additional information, such as tuples processed by the operator, is dis-
played. Clicking other objects, such as ports or a stream, also provides sig-
nificant detail.
Figure 7-3 The Streams Console showing the Application Graph service with
two running jobs at the top and a grouping by jobs showing details about the
NewTrendValue custom operator at the bottom
manipulated until the flow is right. The operators and graphs can also be
annotated with the functions that should be performed. Other users, or
developers, can later choose and configure the remaining operator imple-
mentations, creating new operators if they don’t already exist in one of the
extensive toolkits.
For example, Figure 7-5 shows a simple sketched application with operators
labeled Reader, Filter, and Writer. The implementation for Reader is
already known to be a FileSource (a built-in operator for reading data from
files). The Filter and Writer operators are generic operators that serve as
placeholders until an implementation is chosen. In this example, an architect
has annotated the graph, indicating that the generic Filter placeholder
should be implemented using the Streams standard toolkit Filter operator,
with filtering based on ticker symbol. The figure also shows a user searching
for operators starting with fil and subsequently dragging (the squared icon
hovering over the operator) the standard toolkit Filter operator onto the
graph to provide an implementation for the Filter placeholder. If the
Figure 7-5 Application building is as simple as dragging and dropping with the
Streams Graphical Editor.
“In the Moment” Analytics: InfoSphere Streams 193
Figure 7-6 Extending the VWAP sample to write tuples from QuoteFilter to a file
Figure 7-7 The Streams Graphical Editor and SPL editor are linked for round-trip updates.
“In the Moment” Analytics: InfoSphere Streams 195
Figure 7-8 Display live data being fed from the NewTrendValue stream by clicking the
Instance Graph.
application programming interfaces (APIs), in the same way that SQL makes
it easy to pull desired data sets out of database tables instead of hand-coding C
or Java applications. In fact, Dr. Alex Philp, founder and CTO of Adelos, Inc.,
noted that his developers can “deliver applications 45 percent faster due to
the agility of [the] Streams Processing Language.” We think that Bó Thide, a
professor at Sweden’s top-ranked Uppsala University, said it best when
referring to SPL: “Streams allows me to again be a space physicist instead of
a computer scientist.” After all, technology is great, but if you can’t quickly
apply it to the business need at hand, what’s the point?
Streams-based applications built with the Streams Graphical Editor or
written in SPL are compiled using the Streams compiler, which turns them
into C++ code and invokes the C++ compiler to create binary executable
code—this executable code runs in the Streams environment to accomplish
tasks on the various servers in the cluster. An SPL program is a text-based
“In the Moment” Analytics: InfoSphere Streams 197
MetricsSink
The MetricsSink adapter is an interesting and useful sink adapter because
it enables you to set up a named metric, which is updated whenever a tuple
arrives at the sink. You can think of these metrics as a gauge that you can
monitor using Streams Studio or other tools. If you’ve ever driven over one
of those traffic counters (those black rubber hoses that lie across an intersec-
tion or road), you have the right idea. While a traffic counter measures the
flow of traffic through a point of interest, a MetricsSink can be used to
monitor the volume and velocity of data flowing out of your data stream.
Analytical Operators
Operators are at the heart of the Streams analytical engine. They take data from
upstream adapters or other operators, manipulate that data, and create a new
stream and new tuples (possibly pass-through) to send to downstream opera-
tors. In addition to tuples from an input stream, operators have access to met-
rics that can be used to change the behavior of the operator, for example, during
periods of high load. In this section, we discuss some of the more common
Streams operators that can be strung together to build a Streams application.
Filter
The Filter operator is similar to a filter in an actual water stream or in your
furnace or car: Its purpose is to allow only some of the streaming contents
200 Big Data Beyond the Hype
to pass. A Streams Filter operator removes tuples from a data stream based
on a user-defined condition specified as a parameter to the operator. After
you’ve specified a condition, the first output port defined in the operator will
send out any tuples that satisfy that condition. You can optionally specify a
second output port to send any tuples that did not satisfy the specified condi-
tion. (If you’re familiar with extract, transform, and load [ETL] flows, this is
similar to a match and discard operation.)
Functor
The Functor operator reads from an input stream, transforms tuples in flex-
ible ways, and sends new tuples to an output stream. The transformations
can manipulate or perform calculations on any of the attributes in the tuple.
For example, if you need to keep a running tally of the number of seconds a
patient’s oxygen saturation level is below 90 percent, you could extract the
applicable data element out of the patient’s data stream and output the run-
ning total for every tuple.
Punctor
The Punctor operator adds punctuation into the stream, which can then be
used downstream to separate the stream into multiple windows. For exam-
ple, suppose a stream reads a contact directory listing and processes the data
flowing through that stream. You can keep a running count of last names in
the contact directory by using the Punctor operator to add a punctuation
mark into the stream any time your application observes a changed last name
in the stream. You could then use this punctuation mark downstream in an
aggregation Functor operator that sends out the running total for the cur-
rent name, to reset the count to zero and start counting occurrences of the
next name. Other operators can also insert punctuation markers, but for the
Punctor that is its only role in life.
Sort
The aptly named Sort operator outputs the tuples that it receives, but in a
specified sorted order. This operator uses a window on the input stream.
Think about it for a moment: If a stream represents an infinite flow of data,
how can you sort that data? You don’t know whether the next tuple to arrive
will need to be sorted with the first tuple to be sent as output. To overcome
“In the Moment” Analytics: InfoSphere Streams 201
this issue, Streams enables you to specify a window on which to operate. You
can specify a window of tuples in the following ways:
• count The number of tuples to include in the window
• delta Waiting until a given attribute of an element in the stream has
changed by a specified delta amount
• time The amount of time, in seconds, to allow the window to fill up
• punctuation The punctuation used to delimit the window
(inserted by a Punctor or some other upstream operator)
In addition to specifying the window, you must specify an expression that
defines how you want the data to be sorted (for example, sort by a given
attribute in the stream). After the window fills up, the sort operator will sort
the tuples based on the element that you specified and then send those tuples
to the output port in sorted order. Then the window fills up again. By default,
Streams sorts in ascending order, but you can specify a sort in descending
order.
Join
As you’ve likely guessed, the Join operator takes two streams, matches the
tuples on a specified condition, and then sends the matches to an output
stream. When a row arrives on one input stream, the matching attribute is
compared to the tuples that already exist in the operating window of the
second input stream to try to find a match. Just as in a relational database,
several types of joins can be used, including inner joins (in which only
matches are passed on) and outer joins (which can pass on one of the stream
tuples even without a match, in addition to matching tuples from both
streams).
Aggregate
The aggregate operator can be used to sum up the values of a given attri-
bute or set of attributes for the tuples in the window; this operator also relies
on a windowing option to group together a set of tuples. An Aggregate
operator enables groupBy and partitionBy parameters to divide up the
tuples in a window and perform aggregation on those subsets of tuples. You
can use the Aggregate operator to perform COUNT, SUM, AVERAGE, MAX,
MIN, and other forms of aggregation.
202 Big Data Beyond the Hype
Beacon
The Beacon is a useful operator because it’s used to create tuples on the fly. For
example, you can set up a Beacon to send tuples into a stream, optionally rate-
limited by a time period (send a tuple every n tenths of a second) or limited to
a number of iterations (send out n tuples and then stop), or both. The Beacon
operator can be useful for testing and debugging your Streams applications.
Streams Toolkits
In addition to the adapters and operators that are described in the previous
sections, Streams ships with a number of toolkits that enable even faster
application development. These toolkits enable you to connect to specific
data sources and manipulate the data that is commonly found in databases
or Hadoop, perform signal processing on time series data, extract informa-
tion from text using advanced text analytics, score data mining models in
real time, process financial markets data, and much more. Because the
Streams toolkits can dramatically accelerate your time to analysis with
Streams, we cover the Messaging, Database, Big Data, Text, and Mining
“In the Moment” Analytics: InfoSphere Streams 203
Toolkits in more detail here. There are many more toolkits available as part
of the product, such as the TimeSeries, Geospatial, Financial Services, Mes-
saging, SPSS, R-project, Complex Event Processing, Internet, and IBM Info-
Sphere Information Server for Data Integration toolkits. There are also toolkits
freely downloadable from GitHub (https://github.com/IBMStreams), such as
the MongoDB and JSON toolkits.
Solution Accelerators
IBM has also surfaced the most popular streaming data use cases as solution
accelerators. For example, IBM provides a customizable solution accelerator
known as the IBM Accelerator for Telecommunications Event Data Analyt-
ics, which uses Streams to process call detail records (CDRs) for telecommu-
nications. A customizable solution accelerator known as the IBM Accelerator
for Social Data provides analytics for lead generation and brand manage-
ment based on social media.
Use Cases
To give you some insight into how Streams can fit into your environment, in
this section we’ll provide a few examples of use cases where Streams has
provided transformative results. Obviously, we can’t cover every industry in
such a short book, but we think this section will get you thinking and excited
about the breadth of possibilities that Streams technology can offer your
environment.
In telecommunications companies, the quantity of CDRs their IT depart-
ments need to manage is staggering. Not only is this information useful for
providing accurate customer billing, but a wealth of information can be
gleaned from CDR analysis performed in near real time. For example, CDR
analysis can help to prevent customer loss by analyzing the access patterns
of “group leaders” in their social networks. These group leaders are people
who might be in a position to affect the tendencies of their contacts to move
from one service provider to another. Through a combination of traditional
and social media analysis, Streams can help you to identify these individu-
als, the networks to which they belong, and on whom they have influence.
Streams can also be used to power up a real-time analytics processing
(RTAP) campaign management solution to help boost campaign effective-
ness, deliver a shorter time to market for new promotions and soft bundles,
help to find new revenue streams, and enrich churn analysis. For example,
Globe Telecom leverages information gathered from its handsets to identify
the optimal service promotion for each customer and the best time to deliver
it, which has had profound effects on its business. Globe Telecom reduced
from 10 months to 40 days the time to market for new services, increased
sales significantly through real-time promotional engines, and more.
206 Big Data Beyond the Hype
Wrapping It Up
IBM InfoSphere Streams is an advanced analytic platform that allows user-
developed applications to quickly ingest, analyze, and correlate information
as it arrives from thousands of real-time sources. The solution can handle very
high data throughput rates, up to millions of events or messages per second.
InfoSphere Streams helps you to do the following:
• Analyze data in motion Provides submillisecond response times,
allowing you to view information and events as they unfold
• Simplify development of streaming applications Uses an Eclipse-
based integrated development environment (IDE), coupled with a
declarative language, to provide a flexible and easy framework for
developing applications; for non-programmers, there is a web-based
drag and drop interface for building Streams applications
• Extend the value of existing systems Integrates with your applications
and supports both structured and unstructured data sources
In this chapter, we introduced you to the concept of data in motion and
InfoSphere Streams, the world’s fastest and most flexible platform for stream-
ing data. The key value proposition of Streams is the ability to get analytics
to the frontier of the business—transforming the typical forecast into a now-
cast. We’re starting to hear more buzz about streaming data of late, but the
other offerings in this space are limited by capability and capacity and are
208 Big Data Beyond the Hype
often highly immature (one can create a storm of problems with data loss
issues at high volumes). Streams is a proven technology with a large number
of successful deployments that have helped transform businesses into lead-
ers in their peer groups. In a Big Data conversation that seeks to go beyond
the hype, Streams can help you overcome the speed and variety of the data
arriving at your organization’s doorstep—and just as important, you’ll be
able to easily develop applications to understand the data.
8
700 Million Times Faster
Than the Blink of an Eye:
BLU Acceleration
209
210 Big Data Beyond the Hype
capabilities that make it easy to meet new end-user reporting needs, provision
it on or off premise, and you have an in-memory computing technology that is
fast, simple, and agile—you have what we would call a NextGen database
technology—you have BLU Acceleration.
The first question we always get asked about BLU Acceleration is “What
does BLU stand for?” The answer: nothing. We are sure it’s somewhat related
to IBM’s “Big Blue” nickname, and we like that, because it is suggestive of
big ideas and leading-edge solutions like Watson. The IBM research project
behind BLU was called Blink Ultra, so perhaps that is it. With that in mind,
you might also be wondering what IBM dashDB stands for, since BLU Accel-
eration and SoftLayer play a big role behind the scenes for this managed
analytics cloud service. Don’t let trying to figure out the acronyms keep you
up at night; these technologies are going to let you sleep…like a baby.
The one thing we want you to understand about BLU Acceleration is that
it is a technology and not a product. You are going to find this technology
behind a number of IBM offerings—be they cloud-based services (such as
dashDB) or on premise. For example, making BLU Acceleration technology
available in the cloud is fundamental to empowering organizations to find
more agile, cost-effective ways to deploy analytics.
GigaOM Research, in its May 2013 ditty on the future of cloud computing,
noted that 75 percent of organizations are reporting some sort of cloud plat-
form usage. Make no mistake about it: cloud adoption is happening now. We
think that the coming years will be defined by the hybrid cloud; sure private
clouds exist today, but so do public ones—ultimately, clients are going to
demand that they cohesively work together in a hybrid cloud. Gartner does
too. In its report last year on the breakdown of IT expenditures, it predicted
that half of all large enterprises would have a hybrid cloud deployment by
the end of 2017. It’s simple: BLU Acceleration is readily available off premise;
it’s also available through a public cloud platform as a service (PaaS) offering
aimed at developers (Bluemix) and eventually as a managed service in a soft-
ware as a service (SaaS) model. BLU skies and billowy clouds sound like a
perfect analytics forecast to us!
BLU Acceleration is the DNA behind the IBM Bluemix dashDB service
(formerly known as the Analytics Warehouse service on Bluemix). It enables
you to start analyzing your data right away with familiar tools—in minutes.
The official IBM announcement notes that Bluemix is a PaaS offering based
700 Million Times Faster Than the Blink of an Eye 211
on the Cloud Foundry open source project that delivers enterprise-level fea-
tures and services that are easy to integrate into cloud applications. We like
to say that Bluemix is a PaaS platform where developers can act like kids in
a sandbox, except that this box is enterprise-grade. You can get started with
this service (along with hundreds of other IBM, non-IBM, and open source
services that let you build and compose apps in no time at all) for free at
https://bluemix.net.
Earlier we alluded to the fact that BLU Acceleration will also be found as
part of a SaaS-based offering. You will often hear this type of usage referred
to as a database as a service (DBaaS) offering or a data warehouse as a service
offering (DWaaS). We envision dashDB being available in the future in mul-
tiple tiers through a DBaaS/DWaaS offering where you contract services that
relate to the characteristics of the run time and how much control you have
over them, their availability commitments, and so on, in a tiered service for-
mat. No matter what you call it, a DBaaS/DWaaS would deliver a managed
(not just hosted like others in this space) turnkey analytical database service.
This service is cloud agile: It can be deployed in minutes with rapid cloud
provisioning, it supports a hybrid cloud model, and there is zero infrastruc-
ture investment required. IBM dashDB is also simple like an appliance—
without the loading dock and floorspace. It is just “load and go,” and there is
no tuning required (you will see why later in this chapter).
The data that you store in a data repository, enriched by BLU Acceleration,
is memory and columnar optimized for incredible performance. And more
importantly, when you are talking off premise, the data is enterprise secure.
IBM’s SoftLayer-backed as a service offerings exceed compliance standards
and come with advanced compliance reporting, patching, and alerting capa-
bilities that put the “Ops” into DevOps. They also provide the option for
single-tenant operations where you can isolate yourself from the “noisy
neighbor” effect. We cover dashDB, Bluemix, noisy neighbors, and other
cloud computing concepts in Chapter 3.
BLU Acceleration technology is available on premise too. This is funda-
mental to IBM’s ground to cloud strategy. We think that this is unique to IBM.
The seamless manner in which it can take some workloads and burst them
onto the cloud without changing the application (and this list of workloads
is ever expanding) is going to be a lynchpin capability behind any organiza-
tion that surges ahead of its peers. For example, suppose that an
212 Big Data Beyond the Hype
structures. DB2 recently debuted its DB2 Cancun Release 10.5.0.4 (we will
just refer to it as the DB2 Cancun release), which introduces the concept of
shadow tables. This innovation leverages BLU Acceleration technology directly
on top of OLTP environments, which enables a single database to fulfill both
reporting and transactional requirements—very cool stuff! The DB2 Cancun
release also includes the ability to set up BLU Acceleration in an on-premise
deployment by using DB2’s High Availability and Disaster Recovery (HADR)
integrated clustering software. In this chapter, we don’t dive deeply into the
use of BLU Acceleration in DB2 or dashDB—we talk about the BLU Accel-
eration technology and what makes it so special. Both Informix and DB2
technologies, and a whole lot more of the IBM Information Management
portfolio, are available through IBM’s PaaS and SaaS offerings and can also
be deployed in IaaS environments.
Now think back to the application developers we talked about in Chapter 2—
their quest for agility and how they want continuous feature delivery because
the consumers of their apps demand it. In that same style, you will see most
BLU Acceleration enhancements show up in dashDB before they are available
in a traditional on-premise solution. This is all part of a cloud-first strategy that
is born “agile” and drives IBM’s ability to more quickly deliver Big Data ana-
lytic capabilities to the marketplace.
For example, the IBM PureData System for Analytics, powered by Netezza
technology includes a suite of algorithms referred to as IBM Netezza Analytics—
INZA. Over time you are going to see these deeply embedded statistical algo-
rithms appear in the BLU Acceleration technology. In the dashDB public cloud
service offerings, these kinds of new capabilities can be delivered more quickly
and in a more continuous fashion when compared to traditional on-premise
release vehicles. Perhaps it’s the case that a number of phases are used to quickly
deliver new capabilities. Netezza SQL compatibility with dashDB is also an
example of a capability that could be continually delivered in this manner.
We’re not going to further discuss this topic in this chapter, although we do
cover the INZA capabilities in Chapter 9. The point we are making is that a
cloud delivery model allows your analytics service to get more capable on a
continuum as opposed to traditional software release schedules. You can see
the benefit for line of business (LOB) here: if the service is managed, you aren’t
applying code or maintenance upgrades; you are simply getting access to a con-
tinually evolving set of analytic capabilities through a service to empower your
business to do new things and make better decisions.
214 Big Data Beyond the Hype
This chapter introduces you to the BLU Acceleration technology. For the
most part, we remain agnostic with respect to the delivery vehicle, whether
you are composing analytic applications in the cloud on Bluemix or running
on premise within DB2. We discuss BLU Acceleration from a business value
perspective, covering the benefits our clients are seeing and also what we’ve
personally observed. That said, we don’t skip over how we got here. There
are a number of ideas—really big ideas—that you need to be aware of to fully
appreciate just how market-advanced BLU Acceleration really is, so we dis-
cuss those from a technical perspective as well.
drops by X percent every year but the amount of data we’re trying to store and
analyze increases by Y percent (>X percent), a nondynamic, memory-only solu-
tion isn’t going to cut it—you’re going to come up short. This is why a NextGen
technology needs to avoid the rigid requirement that all of the query’s data has
to fit into memory to experience super-easy, super-fast analytics. Of course, bet-
ter compression can help to get more data into memory, but NextGen analytics
can’t be bound by memory alone; it needs to understand I/O. For this reason, we
like to say that “BLU Acceleration is in-memory optimized not system memory
constrained.”
Although there is a lot of talk about just how fast in-memory databases
are, this talk always centers around system memory (the memory area that
some vendors demand your query data is not to exceed) compared to disk.
System memory compared to disk is faster—it is way faster, even faster than
solid-state disk (SSD) access. How much faster? About 166,666 times faster!
But today’s CPUs have memory too—various hierarchies of memory—
referred to as L1, L2, and so on. The IBM Power 8 CPU, the nucleus of the
OpenPower Foundation, leads all processor architectures in levels of CPU
cache and size. CPU memory areas are even faster than system memory!
How fast? If L1 cache were a fighter jet rocketing through the skies, system
memory would be a cheetah in full stride. Not much of a race, is it? A Next-
Gen database has to have prefetch and data placement protocols that are
designed for CPU memory, include system memory, and also have provi-
sions for SSD and plain old spinning disk; it has to make use of all of these
tiers, move data between them with knowledge, purpose, optimization, and
more. BLU Acceleration does this—be it in the cloud through the dashDB
service or in a roll-your-own on-premise software deployment such as DB2.
In fact, BLU Acceleration was built from the ground up to understand and
make these very decisions in its optimizations.
Columnar processing has been around for a while and is attracting
renewed interest, but it has some drawbacks that a NextGen platform must
address. BLU Acceleration is about enhanced columnar storage techniques,
and has the ability to support the coexistence of row-organized and column-
organized tables.
A NextGen database should have what IBM refers to as actionable compres-
sion. BLU Acceleration has patented compression techniques that preserve
order such that the technology almost always works on compressed data
216 Big Data Beyond the Hype
Seamlessly Integrated
A NextGen in-memory database technology must be simple to use and must
seamlessly integrate with your environment. In fact, on premise or off premise,
BLU Acceleration comes with Cognos software to help you get going faster.
When it comes to DB2, the BLU Acceleration technology is not bolted on;
it has been deeply integrated and is actually part of the product. BLU Accel-
eration is in DB2’s DNA: This is not only going to give you administrative
efficiencies and economies of scale, but risk mitigation as well. Seamless inte-
gration means that the SQL language interfaces that are surfaced to your
applications are the same no matter how the table is organized. It means that
backup and restore strategies and utilities such as LOAD and EXPORT are
consistent. Consider that in DB2, BLU Acceleration is exposed to you as a
simple table object—it is not a new engine, and that’s why we say it is not
bolted on. It is simply a new format for storing table data. Do not overlook
that when you use DB2 with BLU Acceleration, it looks and feels just like the
DB2 you have known for years, except that a lot of the complexity around
tuning your analytic workloads has disappeared. If you compare DB2 with
BLU Acceleration with some other vendor offerings, we are sure that you
will quickly get a sense of how easy it is to use and why it doesn’t leave DBAs
“seeing red.” We should mention that BLU Acceleration even supports some
of the Oracle PL/SQL protocols and data types, which would make porting
applications to any database that surfaces this technology easier too!
Another compelling fact that resonates with clients (and makes our com-
petitors jealous) is that you don’t have to rip and replace on-premise hard-
ware investments to get all the benefits of BLU Acceleration. Some vendors
are suggesting that there is no risk involved in upgrading your applications,
swapping in a new database, retraining your DBAs, and tossing out your
existing hardware to buy very specific servers that are “certified” for their
software. Although some of this might not matter in the cloud, it all matters
on the ground.
Hardware Optimized
The degree to which BLU Acceleration technology is optimized for the
entire hardware stack is a very strong, industry-leading value proposition,
which matters on or off premise. BLU Acceleration takes advantage of the
218 Big Data Beyond the Hype
queries. That’s a number that we feel will pacify our legal team, but prepare
to be amazed.
Let’s clarify what we mean by “average”—some clients witness mind-
blowing quadruple performance speedups for single (typically their most
difficult) queries, but we think it is most important to appreciate the average
speedup of the query set, which is what we are referring to here.
Finally, keep this in mind: If a query that is running fast and within your
service level agreement (SLA) runs faster, that is a nice benefit. However, it is
the query that you have never been able to run before where the real value
resides. We are convinced that you are going to find some of these with BLU
Acceleration. The toughest queries are going to see the most benefit and are
going to give you that jaw-dropping triple to quadruple performance boost.
The same BCBS member that we mentioned earlier had an analyst drop a
research hypothesis because the query that was used to test the hypothesis ran
for three hours and seemed to run “forever” so he just stopped it. When the
same query was loaded into BLU Acceleration, it completed in 10 seconds—
that is 1,080 times faster!
because you are not going to need these objects when you use BLU Accelera-
tion (and you are not enforcing uniqueness because a reliable ETL process
has cleansed and validated the data), you are going to save lots of space. That
said, if you want BLU Acceleration to enforce uniqueness, you can still create
uniqueness constraints and primary keys on BLU Acceleration tables.
Just like performance results, compression ratios will vary. Our personal
experiences lead us to believe that you will see, on average, about a 10-fold
compression savings with BLU Acceleration (our lawyers hate it when we
convey compression savings just like performance results). Later in this
chapter we talk about how BLU Acceleration can achieve amazing compres-
sion ratios, but for now, let’s share a couple of client experiences. Andrew
Juarez, Lead SAP Basis and DBA at Coca-Cola Bottling Consolidated (CCBC),
told us that “in our mixed environment (it includes both row- and column-
organized tables in the same database), we realized an amazing 10- to 25-fold
reduction in the storage requirements for the database when taking into
account the compression ratios, along with all the things I no longer need to
worry about: indexes, aggregates, and so on.” So although baseline compres-
sion with BLU Acceleration was 10-fold, when the CCBC team took into
account all the storage savings from the things that they no longer have to do
(create indexes and so on), they found savings of up to 25-fold!
Just a word or two on compression ratios: The term raw data is used to
describe the amount of data that is loaded into a performance warehouse;
think of it mostly as the input files. The 10-fold compression that CCBC real-
ized was on raw data. When you ask most people how large their warehouses
or data marts are, they will tell you a number that includes performance tun-
ing objects such as indexes. This is not raw data; we call this fully loaded data.
The compression ratios that we discuss in this chapter are based on raw data
(unless otherwise noted, such as the 25-fold savings achieved by CCBC).
Because the difference between raw and fully loaded data can be quite
astounding (sometimes double the raw data amount or more), ensure that
you clarify what’s being talked about whenever having discussions about the
size of a database for any purpose, not just compression bragging rights.
Mike Petkau is the director of database architecture and administration at
TMW Systems—an industry-leading provider of enterprise management
software for the surface transportation services industry, including logistics,
freight, trucking, and heavy-duty repair and maintenance. His team loaded
222 Big Data Beyond the Hype
one of TMW Systems’ largest customer databases into BLU Acceleration and
found that it produced “astounding compression results.” TMW Systems
found compression savings ranging from 7- to 20-fold when compared with
its uncompressed tables. In fact, one of its largest and most critical operations
tables saw a compression rate of 11-fold, and these measurements are only on
raw data; in other words, they do not even account for the performance-
tuning objects that they no longer need. We like how Mike summed up his
BLU Acceleration experience: “These amazing results will save us a great
deal of space on disk and memory.”
Don’t forget: BLU Acceleration has a wide spectrum of run-time optimiza-
tions for your analytics environment, and it has a holistic approach that we
have yet to see any other vendor adopt. For example, the more concurrency
you need for your analytic database, the more temporary (temp) space you’re
going to need. If a table spills into temp storage, BLU Acceleration might
choose to automatically compress it if it’s beneficial to the query. What’s
more, if BLU Acceleration anticipates that it will need to reference that large
temp table again, it might compress it as well. So, although temp space is a
definite requirement in an analytics environment, we want you to know that
those unique benefits are not even accounted for in the experience of our
clients. We don’t know of another in-memory database technology that can
intelligently autocompress temp space—a key factor in analytic performance.
since the DB2 Cancun Release 10.5.0.4 and its BLU Acceleration shadow
tables became generally available, you can have a single database for both
your OLTP and OLAP queries, making the decision to use DB2 with BLU
Acceleration for on premise more compelling than ever!
BLU Acceleration simplifies and accelerates the analysis of data in sup-
port of business decisions because it empowers database administrators
(DBAs) to effortlessly transform poorly performing analytic databases into
super-performing ones, while at the same time insulating the business from
front-end application and tool set changes. From a DBA’s perspective, it is
an instant performance boost—just load up the data in BLU Acceleration
and go…analyze. Of course it empowers LOB (the future buyers of IT) to
self-serve their BI, spinning up analytic data marts with swipes and gestures
in minutes through the managed dashDB service. And don’t forget how
Bluemix offers developers the ability to effortlessly stitch in analytic ser-
vices into their apps—no DBA required!
When the BLU Acceleration technology first made its debut, one of our
early adopters, who works for one of the largest freight railroad networks in
North America, told the press “…I thought my database had abended because
a multimillion-row query was processed so fast.” Think of that impact on your
end users. BLU Acceleration makes your business agile too. Typical data marts
require architectural changes, capacity planning, storage choices, tooling deci-
sions, optimization, and index tuning; with BLU Acceleration, the simplicity of
create, load, and go becomes a reality—it’s not a dream.
There’s been a lot of integration work with Cognos Business Intelligence
(Cognos BI) and BLU Acceleration. For example, deploying a Cognos-based
front end is done by simply modeling business and dimensional character-
istics of the database and then deploying them for consumption by using
Cognos BI’s interactive exploration, dashboards, and managed reporting.
BLU Acceleration technology flattens the time-to-value curve for Cognos
BI (or any analytics tool sets for that matter) by decreasing the complexity
of loading, massaging, and managing the data at the data mart level. This
is one of the reasons why you will find BLU Acceleration behind the IBM
Watson Anlaytics cloud service that we talked about in Chapter 1. We think
that perhaps one of the most attractive features in BLU Acceleration and
Cognos is that the Cognos engine looks at a BLU Acceleration database just
like it does a row-organized database. Because they are both just tables in a
224 Big Data Beyond the Hype
data service to Cognos, a Cognos power user can convert underlying row-
organized tables to BLU Acceleration without changing anything in the
Cognos definitions; that’s very cool!
Almost all LOBs find that although transactional data systems are sufficient
to support the business, the data from these online transaction processing
(OLTP) or enterprise resource planning (ERP) systems is not quickly surfaced
to their units as actionable information; the data is “mysterious” because it is
just not organized in a way that would suit an analytical workload. This quan-
dary gives way to a second popular use case: creating local logical or separate
physical marts directly off transactional databases for fast LOB reporting.
Because BLU Acceleration is so simple to use, DBAs to LOBs can effortlessly
spin up analytics-oriented data marts to rapidly react to business require-
ments. For example, consider a division’s CMO who is sponsoring a certain
marketing promotion. She wants to know how it is progressing and to analyze
the information in a timely manner. BLU Acceleration empowers this use case:
Data does not need to be indexed and organized to support business queries.
As well, the data mart can now contain and handle the historical data that’s
continuously being spawned out of a system of record, such as a transactional
database.
As you can see, BLU Acceleration is designed for data mart–like analytic
workloads that are characterized by activities such as grouping, aggregation,
range scans, and so on. These workloads typically process more than 1 per-
cent of the active data and access less than 25 percent of the table’s columns
in a single query when it comes to traditional on-premise repositories—but
BLU Acceleration isn’t just limited to this sweet spot. A lot of these environ-
ments are characterized by star and snowflake schemas and will likely be top
candidates to move to BLU Acceleration, but these structures are not required
to benefit from this technology.
With the DB2 Cancun release and its shadow tables, you no longer need to
spin off physical data marts, unless you want to, of course—and there are still
really good reasons to do so—more on that in a bit. Be careful not to overlook
the incredible opportunity if you are an on-premise BLU Acceleration user.
You really do have the best of both worlds with this offering. BLU Acceleration
shadow tables enable you to have your operational queries routed seamlessly
to your collocated row-organized tables and your analytical queries routed to
your collocated logical marts through shadow tables in the same database!
700 Million Times Faster Than the Blink of an Eye 225
do not have to do instead of what you have to do. How cool is that? First,
there’s no physical design tuning to be done. In addition, operational memory
and storage attributes are automatically configured for you. Think about the
kinds of things that typically reside in a DBA’s toolkit when it comes to per-
formance tuning and maintenance. These are the things you are not going to
do with BLU Acceleration, even if you are not using it through a hosted man-
aged service such as dashDB. You are not going to spend time creating
indexes, creating aggregates, figuring out what columns are hot, implement-
ing partitioning to get the hot data into memory, reclaiming space, collecting
statistics, and more. This is why BLU Acceleration is so appealing. None of
this is needed to derive instant value from the BLU Acceleration technology—
be it on premise or off premise. From a DBA’s perspective, you “load and go”
and instantly start enjoying performance gains, compression savings, and the
ability to do things that you could never do before. Now compare that to
some of our competitors that make you face design decisions like the ones we
just outlined.
Huras
Zikopoulos
Register Length
Packed into
Lightstone
Zikopoulos
Register
Zikopoulos Length
Huras
Zikopoulos
Figure 8-1 A representation of how BLU Acceleration encodes and packs the CPU
register for optimal performance and compression
228 Big Data Beyond the Hype
said to be ordered (01 is less than 10 because 20000 is less than 50000). As
a result, the BLU Acceleration query engine can perform a lot of predicate
evaluations without decompressing the data. A query such as select * …
where C1 < 50000 using this example becomes select * … where C1
< 10. In our example, BLU Acceleration would filter out the values that are
greater than 50000 without having to decompress (materialize) the data.
Quite simply, the fact that BLU Acceleration has to decompress only qualify-
ing rows is a tremendous performance boost. Specifically, note how instead
of decompressing all the data to see whether it matches the predicate (<
50000), which could potentially require decompressing billions of values,
BLU Acceleration will simply compress the predicate into the encoding space
of the column. In other words, it compresses 50,000 to 10 and then per-
forms the comparisons on the compressed data—using smaller numbers
(such as 10 compared to 50000) results in faster performance for compari-
son operations, among others. That’s just cool!
The final tenet of this big idea pertains to how the data is handled on the
CPU: BLU Acceleration takes the symbols’ bits and packs them as tightly as
possible into vectors, which are collections of bits that match (as closely as
possible) the width of the CPU register; this is what you see on the right side
of Figure 8-1 (although the figure is intentionally oversimplified and intended
to not reveal all of our secrets, we think you will get the gist of what the tech-
nology is doing). This is a big deal because it enables BLU Acceleration to
flow the data (in its compressed form) into the CPU with maximum efficiency.
It’s like a construction yard filling a dump truck with dirt that is bound for
some landfill—in this case the dirt is the data and the landfill is a data mart.
Is it more efficient to send the dump truck away loaded to the brim with dirt
or to load it halfway and send it on its way? You’d be shocked at how many
of the competing technologies send trucks half full, which is not only ineffi-
cient, it’s not environmentally friendly either.
To sum up, all of the big idea components in this section will yield a syn-
ergistic combination of effects: better I/O because the data is smaller, which
leads to more density in the memory (no matter what hierarchy of memory
you are using), optimized storage, and more efficient CPU because data is
being operated on without decompressing it and it is packed in the CPU “reg-
ister aligned.” This is all going to result in much better—we say blistering—
performance gains for the right workloads.
700 Million Times Faster Than the Blink of an Eye 229
yet another differentiator between BLU Acceleration and some other in-
memory database technologies that are available in today’s marketplace. On
the right side of this figure, we show four column values being processed at
one time. Keep in mind this is only for illustration purposes because it is
quite possible to have more than four data elements processed by a single
instruction with this technology.
Contrast the left side (without SIMD) of Figure 8-2 with the right side
(with SIMD), which shows an example of how predicate evaluation process-
ing would work if BLU Acceleration were not engineered to automatically
detect, exploit, and optimize SIMD technology, or implement big idea #2 to
optimally encode and pack the CPU register with data. In such an environ-
ment, instead of optimally packing the register width, things just happen the
way they do in so many competing technologies: Each value is loaded one at
a time into its own register for predicate evaluation. As you can see on the left
side of Figure 8-2, other data elements queue up for predicate evaluation,
each requiring distinct processing cycles that waste resources by having to
schedule the execution of that work on the CPU, context switches, and so on.
> >
Result
Stream
Figure 8-2 Comparing predicate evaluations with and without SIMD in a CPU packing-
optimized processing environment such as BLU Acceleration
700 Million Times Faster Than the Blink of an Eye 231
In summary, big idea #3 is about multiplying the power of the CPU for the
key operations that are typically associated with analytic query processing,
such as scanning, joining, grouping, and arithmetic. By exploiting low-level
instructions that are available on modern CPUs and matching that with opti-
mizations for how the data is encoded and packed on the CPU’s registers
(big idea #2), BLU Acceleration literally multiplies the power of the CPU: A
single instruction can get results on multiple data elements with relatively
little run-time processing. The big ideas discussed so far compress and
encode the data and then pack it into a set of vectors that matches the width
of the CPU register as closely as possible. This gives you the biggest “bang”
for every CPU “buck” (cycles that the on-premise or off-premise servers on
which the technology is running consume). BLU Acceleration literally
squeezes out every last drop of CPU that it can wherever it is running.
of clients that have tested these claims and the performance issues that arise
when transactional activity occurs on their columnar storage format.)
Essentially, this big idea for BLU Acceleration is to bring all of the typical
benefits of columnar stores (such as I/O minimization through elimination,
improved memory density, scan-based optimizations, compelling compres-
sion ratios, and so on) to modern-era analytics. What makes BLU Accelera-
tion so special is not just that it is a column store; indeed, many claim to be
and are…just that. Rather, it is how the BLU Acceleration technology is
implemented with the other big ideas detailed in this chapter.
With all the focus on columnar these days, we took a leap of faith and
assume that you have the gist of what a column-organized table is, so we
won’t spend too much time describing it here. To ensure that we are all on the
same page, however, Figure 8-3 shows a simplified view of a row-organized
table (on the left) and a column-organized table (on the right). As you can
see, a column-organized table stores its data by column instead of by row.
This technique is well suited to warehousing and analytics scenarios.
With a columnar format, a single page stores the values of just a single
column, which means that when a database engine performs I/O to retrieve
data, it performs I/O for only the columns that satisfy the query. This can
save a lot of resources when processing certain kinds of queries, as well as
sparse or highly repeating data sets.
Column-Organized Table
Row-Organized Table
Name Salary ID Skill
Name
Baklarz 35000 8HJ9 DB2 Salary
Baklarz
Data page
Data page
the technology is very effective at keeping the data in memory, which allows
for efficient reuse and minimizing I/O; this is true even in cases where the
working set and individual scans are larger than the available memory. BLU
Acceleration’s scan-friendly memory caching is an automatically triggered
cache-replacement algorithm that provides egalitarian access, and it is some-
thing new and powerful for analytics. How that translates for your business
users who don’t really care about technology details is that performance won’t
“fall off the cliff” when a query’s data requirements are bigger than the amount
of memory that is available on the provisioned server—be it on the ground or
in the cloud.
The analytics-optimized page replacement algorithm that is associated with
BLU Acceleration assumes that the data is going to be highly compressed
and will require columnar access and that it is likely the case that all of this
active data (or at least 70 to 80 percent of it) is going to be put into memory.
When surveying our clients, we found that the most common memory-to-
disk ratios was about 15 to 50 percent. Assuming a conservative 10-fold com-
pression rate, there is still a high probability that you are going to be able to
land most (if not all) of your active data in memory when you use a data
store with BLU Acceleration technology. However, although we expect most,
if not all, of the active data to fit in memory for the majority of environments,
we don’t require it, unlike the technology sold by a certain ERP vendor-
turned-database provider. When DB2 accesses column-organized data, it
will automatically use its scan-friendly memory-caching algorithm to decide
which pages should stay in memory to minimize I/O, as opposed to using
the ubiquitous least recently used (LRU) algorithm, which is good for OLTP
but not optimized for analytics.
The second aspect of dynamic in-memory processing is a new prefetch
technology called dynamic list prefetching. You will find that the prefetching
algorithms for BLU Acceleration have been designed from scratch for its
columnar parallel vector-processing engine.
These algorithms take a different approach because BLU Acceleration
does not have indexes to tell it what pages are interesting to the query (list
prefetching), which would be a common case with a traditional database
technology. Of course, almost any database could simply prefetch every page
of every column that appears in a query, but that would be wasteful. BLU
Acceleration addresses this challenge with an innovative strategy to prefetch
236 Big Data Beyond the Hype
only a subset of pages that are interesting (from a query perspective), without
the ability to know far in advance what they are. We call this dynamic list
prefetching because the specific list of pages cannot be known in advance
through an index.
To sum up, remember that one of the special benefits of BLU Acceleration
in comparison to traditional in-memory columnar technologies is that per-
formance doesn’t “fall off the cliff” if your data sets are so large that they
won’t fit into memory. If all of your data does fit into memory, it is going to
benefit you, but in a Big Data world, that is not always going to be the case,
and this important BLU Acceleration benefit should not be overlooked.
Dynamic memory optimization and the additional focus on CPU cache
exploitation truly do separate BLU Acceleration from its peers.
Memory
Figure 8-4 Looking at BLU Acceleration’s seven big ideas from a compute resources
perspective: the hardware stack
service and without any idea what BLU Acceleration is, the analyst com-
poses the following query: select count(*) from LOYALTYCLIENTS
where year = '2012'.
The BLU Acceleration goal is to provide the analyst with subsecond response
times from this single nonpartitioned, 32-core server without creating any
indexes or aggregates, partitioning the data, and so on. When we tabled this
scenario to a group of LOB users (without mentioning the BLU Acceleration
technology), they laughed at us. We did the same thing in front of a seasoned
DBA with one of our biggest clients, and she told us, “Impossible, not without
an index!” Figure 8-5 shows how the seven big ideas worked together to take an
incredible opportunity and turn it into something truly special.
1
Compression Column Access Data Skipping
Reduces to 1TB Reduces to 10GB Reduces to 1GB
10TB Data
In-Memory
Massive Parallel
Processing: Vector Processing Results in
~31MB Linear Scan Scans as Fast as Seconds
on Each Core 8MB Through SIMD or Less!
DATA DATA DATA
5 6 7
Figure 8-5 How some of the seven big BLU Acceleration ideas manifest into incredible
performance opportunities
The example starts with 10TB of raw data () that is sitting on a file system
waiting to be loaded into an analytics repository with BLU Acceleration.
Although we have shared with you higher compression ratio experiences
from our clients, this example uses the average order of magnitude (10-fold)
reduction in the raw data storage requirements that most achieve with BLU
Acceleration’s encoding and compression techniques. This leaves 1TB () of
data. Note that there are not any indexes or summary tables here. In a typical
data warehouse, 10TB of raw data is going to turn into a 15- to 30-TB foot-
print by the time these traditional kinds of performance objects are taken into
account. In this example, the 10TB of data is raw data. It is the size before the
data is aggregated, indexed, and so on. When it is loaded into a BLU Accel-
eration table, that 10TB becomes 1TB.
240 Big Data Beyond the Hype
The analyst’s query is looking only for loyalty members acquired in 2012.
YEAR is just one column in the 100-column LOYALTYCLIENTS table. Because
BLU Acceleration needs to access only a single column in this table, you can
divide the 1TB of loaded data by 100. Now the data set is down to 10GB ()
of data that needs to be processed. However, BLU Acceleration is not fin-
ished applying its seven big ideas yet! While it is accessing this 10GB of data
using its columnar algorithms, it also applies the data-skipping big idea to
skip over the other nine years of data in the YEAR column. At this point, BLU
Acceleration skips over the nonqualifying data without any decompression
or evaluation processing. There is now 1GB () of data that needs to be
worked on: BLU Acceleration is left looking at a single column of data and,
within that column, a single discrete interval. Thanks to scan-friendly mem-
ory caching, it is likely that all of that data can be accessed at main memory
speeds (we are keeping the example simple by leaving out the CPU memory
caches that we have referenced throughout this chapter; assuming you
believe that high-speed bullet trains are faster than a sloth, you can assume
things could go even faster than what we are portraying here). Now BLU
Acceleration takes that 1GB and parallelizes it across the 32 cores on the
server (), with incredible results because of the work that was done to
implement the fourth big idea: parallel vector processing. This means that
each server core has work to do on only about 31MB of data (1GB = 1000MB
/ 32 cores = 31.25MB). Don’t overlook this important point that might not be
obvious: BLU Acceleration is still operating on compressed data at this point,
and nothing has been materialized. It is really important to remember this,
because all of the CPU, memory density, and I/O benefits still apply.
BLU Acceleration now applies the other big ideas, namely, actionable
compression (operating on the compressed data, carefully organized as vec-
tors that automatch the register width of the CPU wherever the technology is
running) and leveraging SIMD optimization (). These big ideas take the
required scanning activity to be performed on the remaining 32MB of data
and make it run several times faster than on traditional systems. How fast?
We think you already know the answer to that one—it depends. The com-
bined benefits of actionable compression and SIMD can be profound. For the
sake of this example, you can assume that the speedup over traditional row-
based systems is four times faster per byte (we think it is often much higher,
but we are being conservative—or rather, being told to be). With this in mind,
700 Million Times Faster Than the Blink of an Eye 241
the server that is empowered by BLU Acceleration has to scan only about
8MB (32MB / 4 speedup factor = 8MB) of data compared to a traditional
system. Think about that for a moment. Eight megabytes is about the size of
a high-quality digital image that you can capture with your smartphone. A
modern CPU can chew through that amount of data in less than a second…
no problem. The end result? BLU Acceleration took a seemingly impossible
challenge on 10TB of raw data and was able to run a typical analytic query on
it in less than a second by using the application of seven big ideas and BLU
Acceleration technology ().
incur a lot of overhead when rows are inserted, updated, or deleted because
all of the supporting analytical indexes have to be maintained. Of course, if
you did not have those indexes, then reporting requirements would result in
full-table scans and kill the performance of your analytic workloads and put
at risk the OLTP transactions that require seconds-or-less performance.
Because of these trade-offs, IT shops typically take data from their transac-
tional system and extract, transform, and load that data into some kind of
analytical reporting system that is exposed to the LOB. The disadvantage of
this approach is that it can often require a complex ETL setup and massive
data movement between different systems. There are data time lags (differ-
ent degrees of freshness of the data on the OLAP system as compared to the
real-time OLTP system), and the complexity of managing multiple systems
can be a nightmare.
Figure 8-6 shows a simplified high-level architectural view of a traditional
analytical environment. You can see in this figure that an OLAP database is
Update
trade..
Insert into Delete Select sum (..)
trade.. trade.. from trade..
Select.. from
trade..
Select.. from Insert into Select.. from
trade.. trade.. trade.. group by..
Delete
trade..
Figure 8-6 A traditional analytic environment where data from a transactional source
is moved to another database for reporting and analytics purposes
700 Million Times Faster Than the Blink of an Eye 243
sourced from an OLTP database and has periodic refresh cycles to bring the
reporting and analytics data up to date.
Although this common approach is effective, having different data in two
systems and needing to refresh the data daily has its disadvantages, as men-
tioned earlier. You can see how this environment could get wildly complex if
we expanded the OLAP database to tens if not hundreds of marts (quite typ-
ical in a large enterprise) for LOB reporting. Suppose there is a schema
change in the OLTP database; that would create an enormous trickle-down
change from the ETL scripts and processes to the OLAP database schema and
into the reports, which would require effort and introduce risk.
Clients have been trying to figure out ways to offer reporting and analyt-
ics on OLTP databases without the risk profile or trade-offs of the past. We
believe that the “right” answer is the BLU Acceleration shadow tables that
had their debut in the DB2 Cancun release.
Before we introduce you to shadow tables, we want to note that you will
likely need both approaches for LOB reporting and analytics. Think about the
challenge at hand: How does IT provision deep analytical reporting environ-
ments with historical data to LOB but at the same time give them access to the
real-time data for real-time reporting? And in a self-service manner too?
Reporting on a single transactional system is great for operational report-
ing. But think about the kinds of data that you need to explore higher-level
domain analytics versus “What is the run-rate of our inventory levels hour to
hour?” If you want to empower LOB for tactical or strategic work, you are
more than likely going to want to work with a 360-degree view of the prob-
lem domain, and that’s likely to require bringing together data from multiple
sources. The point we are trying to make is that acquiring the ability to get
real-time operational reports has been frustrating IT for years, and BLU
Acceleration on premise has a solution for that. But the coordination of sub-
ject domains from across the business is not going away either because you
are not going to keep all the data in your OLTP system, as some of our com-
petitors (who ironically have data warehousing solutions) are suggesting.
Shadow tables deliver operational reporting capabilities to your transac-
tional environment from a single database, which addresses the shortcom-
ings of most current environments. As of the time that this book was pub-
lished, this capability was available only for DB2 with BLU Acceleration; if
you are leveraging BLU Acceleration off premise, it is not available…yet.
244 Big Data Beyond the Hype
Column #2
Column #3
Column #M
Record #2
...
Record #3
...
Record #N DB2 Optimizer
Update
trade..
Insert into Delete Select sum (..)
trade.. trade.. from trade..
Select.. from
trade..
Select.. from Insert into Select.. from
trade.. trade.. trade.. group by..
Delete
trade..
Figure 8-7 A high-level look at how BLU Acceleration shadow tables operate in an on-
premise DB2 environment
700 Million Times Faster Than the Blink of an Eye 245
Shadow tables are maintained by a set of services that are part of the DB2
offering. These services asynchronously replicate SQL change activities that
were applied to the source table (where the transactional workload is run-
ning) to the BLU Acceleration shadow table (where the analytics are running).
By default, all applications access the source transactional tables. Queries are
transparently routed to either the source table or the shadow table, depending
on the workload and certain business rules that you apply to the analytics
environment. Figure 8-7 shows an example of DB2 with BLU Acceleration
shadow tables.
The DB2 optimizer decides at query optimization time, on which table to
execute the SQL workload: the DB2 table or the BLU Acceleration shadow
table. This decision is based on two factors: the cost of the SQL statement and
the business-defined latency rules. Although an administrator might set up a
shadow table to refresh its data at specific intervals, it is the business’ defined
latency that determines whether the shadow table is used.
You can define a latency-based business rule that is used to prevent appli-
cations from accessing the shadow tables when the actual latency is beyond
the user-defined limit. This enables you to effectively control the “freshness”
of the data that your analytical workloads will access. You can imagine how
beneficial these governance rules are. Perhaps a report calculates the run-rate
costs that are associated with a manufacturing line’s reject ratio KPI; such a
calculation might need to be performed on data that is not older than a cer-
tain number of seconds. For example, it could be the case that data that is
older than 10 seconds is considered too stale for this KPI as defined by the
LOB. If the shadow table’s required data was refreshed within the last 10
seconds when the SQL statement executes, the optimizer transparently
routes the query to the shadow table.
Let’s assume that the acceptable latency was set to 5 seconds and the data
was last updated 10 seconds ago. In this case, the query will be routed to the
row store to access the freshest data for the report. If a report hits the row
table, you have access to all the services and capabilities of that technology to
manage the query. For example, the time-sensitive query could “waterfall”
its resource consumption such that it gets 30 seconds of exclusive access to
unlimited resources; then for the next 30 seconds, it gradually is forced to
decrease its consumption of the various compute resources available on the
system until the query completes. Remember, this is all transparent to the
end user.
246 Big Data Beyond the Hype
As previously noted, you can set these business rules at the connection
level, but you can granularly set them at the statement level too. For exam-
ple, perhaps the same reporting application also maintains a dashboard with
a KPI that measures the mechanical fluid level on the production line; the
data behind this KPI might have business relevance that is measured in min-
utes or even hours.
We have mentioned the transparency of this solution. From an application
point of view, this transparency is key. Both transaction workloads and ana-
lytical queries will reference the same base row-organized table names. There
is no change to the application, nor to the SQL statements. The database
determines and handles query routing as described earlier. This makes using
BLU Acceleration shadow tables easy for existing DB2 environments. The
advantage here is that OLTP queries get the great DB2 OLTP performance,
and analytical queries get the great BLU Acceleration performance. Super
Analytics, Super Easy.
Wrapping It Up
In this chapter, we introduced you to one of the “crown jewels” in the IBM
Information Management portfolio: BLU Acceleration. We talked about
seven big ideas that are the inspirations behind the BLU Acceleration tech-
nology. We took some time to articulate how BLU Acceleration is a technol-
ogy that can be provisioned as a service off premise (hosted or managed as a
Bluemix service or in dashDB) or on premise. We also talked about how this
technology can be found deeply embedded but not exposed within IBM
technologies such as Informix, as well as deeply embedded and surfaced,
such as in DB2. Finally, we introduced you to the latest BLU Acceleration
innovation, shadow tables, which you can find in the DB2 Cancun Release
10.5.0.4.
Many clients have noticed the transformative opportunity that BLU Accel-
eration technology can have on their business. For example, LIS.TEC’s
Joachim Klassen observed one of the key benefits of BLU Acceleration when
he noted, “Even if your data does not completely fit into memory, you still
have great performance gains. In the tests we ran we were seeing queries run
up to 100 times faster with BLU Acceleration.” It’s a very different approach
than what’s been taken by some other vendors. After all, if the price of
700 Million Times Faster Than the Blink of an Eye 247
memory drops by about 30 percent every 18 months yet the amount of data
grows by 50 percent, and in a Big Data world data is being used to move from
transactions to interactions, you’re not going to be able to fit all your data in
memory—and that’s why BLU Acceleration is so different.
We shared the experiences of only a few of the clients who are delighted
by the BLU Acceleration technology from a performance, simplicity, com-
pression, and, most of all, opportunity perspective. Handelsbanken, one of
the more secure and profitable banks in the world, saw some of its risk insight
queries speed up 100-fold with no tuning, and were actively seeing the effect
that BLU Acceleration had on its queries within six hours of downloading
the technology!
Paul Peters is the lead DBA at VSN Systemen BV—a builder of high-density,
high-volume telecom and datacom applications. VSN Systemen first started
using BLU Acceleration when it debuted two years ago and was quite pleased
with the 10-fold performance improvements and 10-fold compression ratios.
As part of a set of clients who jumped on to the DB2 Cancun release and its
BLU Acceleration shadow tables, VSN Systemen was pleased with the ability
“to allow our clients to run reports directly on top of transactional tables. The
results are delighting my end users, and we don’t see any impact to our trans-
actional performance,” according to Peters.
Ruel Gonzalez from DataProxy LLC notes that in the “DB2 Cancun Release
10.5.0.4, shadow tables were easily integrated into our system. This allows
our transactional and analytic workloads to coexist in one database, with no
effect on our transactional workloads. We get the extreme performance of
BLU Acceleration while maintaining our key transactional execution!”
You are going to hear a lot of chest thumping in the marketplace around
in-memory columnar databases, some claiming the ability to run OLTP or
OLAP applications, and more. Although there is no question that we did
some chest thumping of our own in this chapter, we want to invite you to ask
every vendor (including IBM) to put their money where their mouth is. We
invite you to try it for yourself. It won’t take long; you just load and go! In
fact, things are even easier than that... just dashDB, load, and go!
This page intentionally left blank
9
An Expert Integrated System
for Deep Analytics
249
250 Big Data Beyond the Hype
relied on the data warehouse designers to guess query patterns and retrieval
needs up front so that they could tune the system for performance. This not
only impacted business agility in meeting new reporting and analytics
requirements, but also required significant manual effort to set up, optimize,
maintain, tune, and configure the data warehouse tiers. As a result, collec-
tively, these systems became expensive to set up and manage and brutal to
maintain.
The IBM PureData System for Analytics, powered by Netezza technology,
was developed from the whiteboard to the motherboard to overcome these
specific challenges. (Since this chapter talks about the history of the Netezza
technology that is the genesis for the IBM PureData System for Analytics, we
will refer to its form factor as an appliance and the technology as Netezza for
most of this chapter.) In fact, it’s fair to say that Netezza started what became
the appliance revolution in data warehousing by integrating the database,
processing, analytics, and storage components into a flexible, compact, pur-
pose-built, and optimized system for analytical workloads. This innovative
platform was built to deliver an industry-leading price-performance ratio
with appliance simplicity. As a purpose-built appliance for high-speed Big
Data analytics, its power comes not from the most powerful and expensive
components available in the current marketplace (which would spike the
slope of its cost-benefit curve) but from how the right components can be
assembled to work together in perfect harmony to maximize performance. In
short, the goal wasn’t to build a supercomputer, but rather an elegantly
designed system to address commonplace bottlenecks with the unequalled
ability to perform complex analytics on all of your data. Netezza did this by
combining massively parallel processing (MPP) scalability techniques,
highly intelligent software, and multicore CPUs with Netezza’s unique hard-
ware acceleration (we refer to this as Netezza’s secret sauce) to minimize I/O
bottlenecks and deliver performance that much more expensive systems
could never match or even approach—all without the need for any tuning.
Netezza didn’t try to catch up to the pace of analytics-based appliance
innovation—it set it. Unlike some Exa-branded offerings in the marketplace
that take an old system with known analytic shortcomings and attempt to
balance it with a separate storage tier that requires further database licensing
and maintenance, the Netezza technology was not inspired by the “bolt-on”
approach. Netezza was built from the ground up, specifically for running
252 Big Data Beyond the Hype
memory, and network, from having to process superfluous data; this produces
a significant multiplier effect on system performance.
The NYSE Euronext is a Euro-American corporation that operates multiple
securities exchanges around the world, most notably the New York Stock
Exchange (NYSE) and Euronext. Its analysts track the value of a listed com-
pany, perform trend analysis, and search for evidence of fraudulent activity. As
you can imagine, their algorithms perform market surveillance and analyze
every transaction from a trading day, which translates into full table scans on
massive volumes of data. As is the case with many enterprises, NYSE Euron-
ext’s traditional data warehouse was moving data back and forth between stor-
age systems and its analytic engine; it took more than 26 hours to complete
certain types of processing! How did the company address this challenge? It
chose Netezza technology and reduced the time needed to access business-
critical data from 26 hours to 2 minutes because analytics were getting pro-
cessed closer to the source.
design that simply scales from a few hundred gigabytes to petabytes of user
data for query. In fact, the system has been designed to be highly adaptable
and to serve the needs of different segments of the data warehouse and analyt-
ics market. For example, aside from being a dedicated deep analytics appli-
ance, Netezza can execute in-database MapReduce routines (the original pro-
gramming framework for Hadoop that we covered in Chapter 6) right in the
database—allowing you to leverage skills across the information management
architecture we introduced you to in Chapter 4.
T-Mobile is a terrific example that illustrates the scalability of Netezza
technology. Every day it processes 17+ billion events, including phone calls,
text messages, and data traffic, over its networks. This translates to upward
of 2PB of data that needs to be crunched. T-Mobile needed a Big Data solu-
tion that could store and analyze multiple years of call detail records (CDRs)
containing switch, billing, and network event data for its millions of sub-
scribers. T-Mobile wanted to identify and address call network bottlenecks
and to ensure that quality and capacity would be provisioned when and
where they are needed.
Netezza proved to be the right technology for T-Mobile to manage its mas-
sive growth in data. More than 1,200 users access its Netezza system to ana-
lyze these events and perform network quality of experience (QoE) analytics,
traffic engineering, churn analysis, and dropped session analytics, as well as
voice and data session analytics. Since deploying Netezza, T-Mobile has real-
ized a near evaporation in data warehouse administrative activities com-
pared to its previous solution, which had ironically redlined, and has also
been able to reduce tax and call-routing fees by using the greater volume of
granular data to defend against false claims. To top it off, T-Mobile has been
able to increase call network availability by identifying and fixing bottle-
necks and congestion issues whenever and wherever they arise.
Figure 9-1 shows a single-rack Netezza system. As you can see, it has two
high-end rack-mounted servers called hosts, which function both as the
external interface to the Netezza appliance and the controller for the MPP
infrastructure. The hosts are connected through an internal network fabric to
a set of snippet blades (we’ll often refer to them simply as S-blades), where the
bulk of the data processing is performed. The S-blades are connected through
a high-speed interconnect to a set of disk enclosures where the data is stored
in a highly compressed format.
If any component fails, be it the host, disk, or S-blade, recovery is automatic.
Thus, the full range of automatic failure detection and recovery options dem-
onstrates the advantages of an appliance approach. Netezza has built-in
component and operational redundancy, and the system responds automati-
cally and seamlessly to the failure of critical components.
Snippet Blades
• Hardware-based query
SMP Hosts acceleration with FPGAs
• Blisteringly fast results
• SQL compiler • Complex analytics
• Query plan executed as the data
• Optimize streams from disk
• Admin
Figure 9-1 The IBM Netezza appliance (now known as the IBM PureData System for
Analytics)
FPGA CPU
Memory Advanced
Analytics
BI
FPGA CPU
Host
Memory
ETL
Loader
FPGA CPU
Memory
is tiny in size (about 1 inch by 1 inch of square silicon) but performs pro-
grammed tasks with enormous efficiency, drawing little power and generat-
ing little heat.
A dedicated high-speed interconnect from the storage array enables data
to be delivered to Netezza memory as quickly as it can stream off the disk.
Compressed data is cached in memory using a smart algorithm, which
ensures that the most commonly accessed data is served right out of memory
instead of requiring disk access.
Each FPGA contains embedded engines that perform compression, filter-
ing, and transformation functions on the data stream. These engines are
dynamically reconfigurable (enabling them to be modified or extended
through software), are customized for every snippet through instructions pro-
vided during query execution, and act on the data stream at extremely high
speeds. They all run in parallel and deliver the net effect of decompressing
and filtering out 95 to 98 percent of table data at physics speed, thereby keep-
ing only the data that’s relevant to the query. The process described here is
repeated on about 100 of these parallel snippet processors running in the
appliance. On a 10-rack system, this would represent up to 1,000 parallel
snippet processors, with performance that exceeds that of much more expen-
sive systems by orders of magnitude.
Complex ∑
Restrict and Joins,
Decompress Project
Visibility Aggregations,
+++
Figure 9-3 The FPGA role in a Netezza query run—The “FPGA effect”
The first thing to note is that this system needs to read only a page of data
and not the entire table (the solid block that makes its way from the disks to
the FPGA assembly line in Figure 9-3—this is an initial qualifying row restric-
tion) because storage ZoneMaps know what data values reside within each
page and can completely skip data that is of no interest to the query. These
pages contain compressed data that gets streamed from disk onto the assem-
bly line at the fastest rate that the physics of the disk will allow (this is why
you see the solid block expand and its pattern changes in the figure—it is
now decompressed). Note that if data were already cached, it would be taken
directly from memory.
The FPGA then uses its Project engine to apply any query projections in
order to filter out any attributes (columns) based on parameters specified in
the SELECT clause of the SQL query being processed, in this case, the
DISTRICT, PRODUCTGRP, and NRX columns. Next, the assembly line applies
any restrictions from the query using its Restrict engine. Here rows that do
not qualify based on restrictions specified in the WHERE clause are removed.
This phase includes a Visibility engine that feeds in additional parameters to
the Restrict engine to filter out rows that should not be “seen” by a query; for
264 Big Data Beyond the Hype
example, perhaps there is a row with a transaction that is not yet committed.
The Visibility engine is critical for maintaining ACID compliance at stream-
ing speeds in the PureData System for Analytics (see Chapter 2 if you don’t
know what this is).
Finally, the processing core picks up the data and performs fundamental
database operations such as sorts, joins, and aggregations. The CPU can also
apply complex algorithms that are embedded in the snippet code for
advanced analytics processing. All the intermediate results are assembled,
and the final results are sent over the network fabric to other S-blades or the
host, as directed by the snippet code.
There are over 200 analytical algorithms that have been deeply integrated
into the Netezza system. For example, INZA includes ranging estimation
models (regression tress, nearest neighbor analysis), simulations (matrix
engine and Monte Carlo analysis), k-means cluster, Naïve Bayes classifier,
and hundreds more we don’t have the space to detail here. Through a
planned number of phases, you’re going to see these INZA capabilities get
deeply embedded within the BLU Acceleration technology. As mentioned
earlier in this book, IBM Information Management will iterate capabilities
over the cloud first—in a manner that allows us to deliver capabilities much
faster than traditional models. For example, you will eventually be able to
run the same INZA statistical algorithms that you can run in Netezza on the
dashDB service (powered by BLU Acceleration) in the cloud. This greatly
empowers organizations with the ability to take bursty Netezza workloads
into the cloud. This said, the INZA capabilities will also find their way into
on-premise software instantiations of BLU Acceleration too; DB2 will even-
tually gain these capabilities as well.
IBM Netezza Analytics supports multiple tools, languages, and frame-
works. It enables analytic applications, visualization tools, and business
intelligence tools to harness parallelized advanced analytics through a vari-
ety of programming methods such as SQL, Java, MapReduce, Python, R, C,
and C++, among others; all can be used to deliver powerful, insightful ana-
lytics. The following table summarizes the in-database analytics built into
the Netezza technology:
Netezza also enables the integration of its robust set of built-in analytics
with leading analytics tools from such vendors as Revolution Analytics
(which ships a commercial version of R), open source R, SAS, IBM SPSS,
Fuzzy Logix, and Zementis, among others. Additionally, you can develop
new capabilities using the platform’s user-defined extensions.
This comprehensive advanced analytics environment makes it easy to
derive benefit from this platform, giving you the flexibility to use your pre-
ferred tools for ad hoc analysis, prototyping, and production deployment of
advanced analytics. Do not overlook this INZA stuff—our clients are far out-
pacing their peer groups by latching onto these capabilities.
Wrapping It Up
The IBM PureData System for Analytics, powered by Netezza technology, is a
simple data appliance for serious analytics. It simplifies and optimizes the per-
formance of data services for analytic applications, enabling very complex algo-
rithms to run in minutes and seconds, not hours. Clients often cite performance
boosts that are 10–100x faster than traditional “roll-your-own” solutions. This
system requires minimal tuning and administration since it is expert integrated
and highly resilient. It also includes more than 200 in-database analytics func-
tions, which support analysis where the data resides and effectively eliminate
costly data movement, yielding industry-leading performance while hiding the
complexity of parallel programming; collectively, this capability is known as
INZA. In fact, the INZA algorithms serve as the base for the embedded analytic
capabilities that will emerge within the BLU Acceleration technology over
time—showing up in its as-a-service dashDB offering first.
In this chapter we referred to the IBM PureData System for Analytics as an
appliance and by its ancestry name, Netezza. We did this because it made it
simpler to explain the way it works. The truth is that the Netezza technology
has evolved from an appliance into an expert integrated system since the
technology was acquired by IBM. Expert integrated systems fundamentally
change both the experience and economics of IT, and they are quite different
from appliances.
Expert integrated systems are more than a static stack of self-tuned
components—a server here, some database software there—serving a fixed
application at the top. Instead, these systems have three attributes whose
confluence is not only unique, but empowering.
An Expert Integrated System for Deep Analytics 267
More than ever, we live in a time in which the desire to consume func-
tionality without having to manage and maintain the solution is the expec-
tation, if not the norm. This is particularly true for cloud-based initiatives
and “as a service” delivery models, such as infrastructure as a service (IaaS),
platform as a service (PaaS), software as a service (SaaS), database as a ser-
vice (DBaaS), database warehouse as a service (DWaaS), business process
as a service (BPaaS), and any others that might have arisen since this book
was written. Today’s marketplace is bearing witness to an inflection point:
There has been an explosion in new classes of middleware, infrastructure
provisioning over the cloud, and applications (apps) that are offered as a
service. The buyer of information technology (IT) is increasingly the chief
marketing officer (CMO) and line-of-business (LOB) user. For example, in a
recent report, Gartner suggested that corporate IT spending will come more
from the CMO’s office than it will from the chief information officer (CIO).
“By 2016, 80% of new IT investments will directly involve LOB executives,”
Gartner notes. The as-a-service model is about six things that we want you
to remember long after you’re done reading this book: agility, agility, agil-
ity and simplicity, simplicity, simplicity.
With this in mind, it is our expectation that the cloud as a service delivery
model is going to see a sprint (not a walk) of ever-growing adoption by
established and emerging markets. If you think about it, this change was
inevitable, a natural evolution. Social-mobile-cloud created so much friction
269
270 Big Data Beyond the Hype
that Big Data and Internet of Things (IoT) apps had no choice but to adopt
this model. For businesses, developers, and the enterprise, the answer to
delivering these new consumption models is simply “the cloud.”
But is the cloud truly simple? Application developers for mobile devices
and modern web apps are often challenged by their database management
systems. Choosing your database solution intelligently and correctly from
the outset is essential. Selecting an inappropriate data persistence model (a
developer’s way of referring to a database) or building apps on top of a
poorly sized or misconfigured data layer can result in serious, even debilitat-
ing, issues with performance and scalability down the road. What’s more, in
a mobile-social-cloud world, the demand placed on database systems can
change on a day-to-day basis.
No doubt you’ve heard about overnight Internet app sensations. Consider
Hothead Games, an IBM Cloudant (Cloudant) client and maker of some of the
most amazing mobile games you will find. Its published mission statement is
“develop kick-ass mobile games or go bust.” We’re happy to say that it is
doing quite well. In fact, as an example of just how well the company is doing,
it leveraged a fully managed IBM NoSQL JSON data-layer service running
Cloudant to grow its cluster 33-fold over the course of a year! (If you’re thinking
that JSON is the name of the guy in the “Friday the 13th” horror franchise,
check out Chapter 2.) Hothead Games needed agility and simplicity from a
managed data service so they could focus on gaming, not scaling the database.
Hothead Games needed DBaaS (a type of SaaS). More specifically, its develop-
ers needed a solution that enabled them to accommodate fluctuations in load
and use of their infrastructure. With growing demand for their games, they
required a delivery model that could scale up without requiring a correspond-
ing scaling up of the “administrative bench”—database administrators (DBAs)
and operations (ops) teams, all of which bring added costs to an organization.
Improper planning from day zero will saddle any app with performance, scal-
ability, and availability issues that translate into higher costs for the business;
Cloudant scopes these requirements for you and delivers a managed service to
grow your data layer as needed.
Cloudant eliminates the aforementioned potential pitfalls by putting to
rest the complexities of database management and growth. Startups and
enterprises have traditionally needed to spend time on things that have noth-
ing to do with their business: selecting what hardware to purchase, deciding
Build More, Grow More, Sleep More: IBM Cloudant 271
on a database and then maintaining it, hiring DBAs, and so on. Cloudant is
designed specifically to eliminate these stressors and to enable the develop-
ers of net-new mobile and “born on the web” apps to focus purely on build-
ing their next generation of apps. There’s no need to manage database infra-
structure or growth.
Cloudant is a fully managed NoSQL data layer service that is available both
on and off premise (in cloud-speak, these deployment models are described
as on prem and off prem). Cloudant guarantees its customers high availability,
scalability, simplicity, and performance, all delivered through a fully hosted
and managed DBaaS in the cloud, or through an on-premise implementation.
Cloudant is also deeply integrated with IBM’s PaaS offering called Bluemix,
where it’s known as Cloudant NoSQL DB. Cloudant’s potential to interoper-
ate between on-premise deployments and scalable cloud IaaS is key to IBM’s
cloud computing and Mobile First initiatives.
In this chapter, we’ll introduce you to the Cloudant service and demon-
strate how it offers greater agility for launching or rapidly evolving new
products in response to an ever-changing market. In short, Cloudant gives
you to the ability to build more, grow more, and sleep more, and who couldn’t
use more of all of these, especially the last one?
Design
Guesswork, Overhead, Risk
and to the demands of that growth on its cluster. Collectively, this authoring
team has more than a century of experience in IT, and we cannot imagine the
turmoil that budget planning and execution for this kind of scaling demand
would cause. Each of these factors imparts a hidden cost in addition to the
investment your company will make into infrastructure and licensing; scop-
ing any of these requirements incorrectly will impact your costs even further.
An alternative to the RYO approach is what’s described as a hosted solu-
tion. We are a little disappointed in some database vendors who have taken
this term and misleadingly marketed it as a catchall for managed databases.
In reality, these “hosted” data layers are only one step removed from the RYO
approach. Sure, the infrastructure layer is provisioned and set up by the ven-
dor, but the remaining software administration, configuration, and mainte-
nance responsibilities rest with the customer. These vendors openly market
their offerings as a “managed database service.” But, having only “delivered
the car and its keys,” it remains the sole responsibility of the client to drive
the car. In fact, the client has to fill the tank with gas, change the oil, schedule
time to comply with a recall on any defects, and more.
IBM Cloudant’s distinguishing feature is the risk-free delivery of a fully man-
aged (not just hosted) data layer through a cloud-distributed DBaaS solution.
Unlike RYO solutions, which ask developers to handle everything from pro-
visioning the hardware to database administration at the top of the stack,
274 Big Data Beyond the Hype
To cope with the existing dilemma and avoid repeating the same mistakes,
the design studio articulated five criteria that need to be satisfied before it is
able to commit to a solution:
• The improved database back end needs to scale massively and
elastically (up and down like a thermostat controls temperature) in
response to fluctuating demand on the app vendor store.
• The solution needs to be highly available and deployable without
downtime to the existing service so that the delivery of entertainment
to users around the world is not interrupted.
• The solution needs to be up and running quickly, while the studio still
has a chance to capitalize on early popularity and buzz around the game.
• The data layer needs to be managed. Hiring a large DBA team doesn’t
make sense for the company’s long-term objectives of developing
better games for its customers.
• The technology has to deliver improved tooling and techniques for
data management. The messy and frustrating experience of managing
a CouchDB instance (and the complexity of some RDBMS alternatives)
is a contributing factor to the outage the company suffered.
Why would this studio come to use the IBM Cloudant DBaaS (as so many
others have done)? As we’ve discussed, IBM Cloudant’s database scales mas-
sively and elastically: As demand for an app grows or wanes (the mobile
gaming marketplace is often described as a cyclical business of peaks and
valleys), Cloudant database clusters can shrink and grow as needed. More-
over, the costs associated with Cloudant adjust accordingly to the size of
your business (in fact, you can get started for free with a freemium plan). Cus-
tomers do not require a big capital investment up front (no CAPEX) to get a
massive database solution up and running ahead of time. If demand for your
company’s services inflates in the months and years ahead, the elastic scal-
ability of Cloudant’s data layer will be able to grow accordingly.
There is also the matter of guaranteed performance and uptimes. Cloudant
has many unique capabilities for delivering customer access to data. Cloud-
ant clients are not just buying database software; in addition to the technol-
ogy, clients are buying the peace of mind that comes from an SLA that removes
concerns over high availability and disaster recovery. If a data center goes
276 Big Data Beyond the Hype
down, Cloudant will reroute users to another replicate of the data. In this way,
your data is guaranteed to remain highly available, even in the event of hard-
ware failure, anywhere in the world, when and where you need it. You are
likely thinking what we thought when we first met the team shortly after the
IBM acquisition: Cloudant feels like an extension of its customers’ develop-
ment teams. They provide expert operational development for the data-layer
back end, applying their expertise to growing apps and providing guidance
on application architecture as you manage your business.
In this section, we used the gaming industry as an example of a domain
that greatly benefits from the DBaaS service that Cloudant offers. It does not
stand alone. You will find that the Cloudant value proposition has no indus-
try boundaries. From Microsoft’s Xbox One gaming platform to the Rosetta
Stone language learning system to the Dropbox file-sharing solution—each
of these products interacts and engages with IBM Cloudant’s NoSQL docu-
ment store database.
looking to make the jump to cloud-based technologies. The last thing you
want is to invest in a solution that you expect to be “pick up and go” and
then find yourself restricted to only products from the same vendor as you
look to build out and extend your service. Many competing vendors in the
cloud space follow these kinds of lock-in practices, offering an enticing entry
service that scales dramatically in cost if you need to tack on additional func-
tionality or services—and only from their own catalog of services, which
might not be the best fit for your use case! We think you will agree that lock-
in practices such as these run contrary to the type of open, low-risk, fast time
to value environment promoted by as-a-service cloud offerings.
Open is key here: Cloudant maintains a 99 percent compatibility with open
source Apache CouchDB (the foundation on which Cloudant was built),
allowing Cloudant and CouchDB users to replicate between the two data-
bases; query and perform create, read, update, and delete operations; and
build secondary indices with MapReduce—all in a consistent way. A team of
Cloudant engineers has made a number of significant contributions to the
Apache CouchDB open source community from which it emerged, such as
adding Cloudant’s user-friendly dashboard to the latest iteration of CouchDB.
In addition, a number of Cloudant engineers—Adam Kocoloski, Joan Touzet,
Paul Davis, and Bob Newson—also serve as members of the Apache
CouchDB Project Management Committee (PMC).
There are considerable advances that Cloudant has made atop CouchDB,
of course: Cloudant has a different authorization model (by virtue of being a
hosted service, which CouchDB is not); Dynamo-style database clustering is
alien to CouchDB; Apache Lucene and GeoJSON geospatial querying do not
extend from Cloudant to CouchDB; and so on. However, because Cloudant’s
API is deeply compatible with CouchDB, Cloudant frequently leverages the
open source community for things like building libraries and promoting best
practices for interacting with that shared API. Cloudant is a strong supporter
of open standards—using JSON instead of a binary alternative that requires
adding drivers just to interact with your data.
The motivation behind this is simple: native compatibility eases the pain
points for CouchDB users making the move to a managed cloud service;
opening the platform to external services and libraries (and not locking in
users to a particular vendor) makes it easier to develop with Cloudant; and
the ease of integration with existing platforms fosters the type of environ-
ment and culture that web and mobile developers cherish.
278 Big Data Beyond the Hype
Cloudant or Hadoop?
A question we get a lot is “What NoSQL technology should I use, Cloudant
or Hadoop?” Sometimes the answer is “both,” applying each to the use cases
for which it was designed. We covered the NoSQL landscape in Chapter 2, so
you know that Cloudant is a purpose-built document database and Hadoop
is an ecosystem of technologies that includes NoSQL technologies and more.
Recall that Cloudant is best utilized as an operational data store rather than
as a data warehouse or for ad hoc analytical queries. Cloudant is flexible
enough to handle any type of work you could throw at it, but a jack-of-all-
trades is a master of none, or so the saying goes. You will want to intelli-
gently apply Cloudant where it can be most useful, namely, as the back-end
data layer to systems of engagement (remember: web, mobile, social) because
of the incremental index builds that support its query architecture.
We explore this in much more detail in the upcoming “For Techies” sections,
but for now, just remember that with Cloudant, your indexes are rebuilt incre-
mentally as the data that they index changes. This results in high-performing
queries: Cloudant is fast because you perform searches against an index that
has already done the heavy lifting of being built ahead of time. Index the fields
that you want to query, send out the results to a Reduce job to consolidate your
results (from the MapReduce framework you hear about in Hadoop so much),
and format the output. These are the types of indexes that you can design with
Cloudant, which then handles the work of ensuring that your searchable
indexes are kept up to date as data streams in or is altered.
Imagine, however, that you are a data scientist who regularly changes the
type of index you need to define or frequently tests new query patterns.
Because Cloudant needs to rebuild an index every time the design document
changes, it would be computationally demanding for your database to follow
this kind of access pattern. Cloudant keeps pace with the work of building
indexes across massive collections of data through Dynamo-style horizontal
clustering and a masterless architecture. Nodes operate independently with
no single host node governing the jobs that are running on other parts of the
cluster. Each node is able to service requests made by the client, and requests
that are delivered to an offline node can simply be rerouted to other hosts in
the cluster. This contrasts starkly with a Hadoop-style master-slave architec-
ture and ensures that Cloudant clusters achieve the highest degree of avail-
ability and uptime. The failure of any one of its masterless nodes does not
Build More, Grow More, Sleep More: IBM Cloudant 279
result in any breakdown of job coordination among other nodes in the cluster.
Should you need to scale your database for added nodes and computational
power, Cloudant is able to move and divide existing partitions at the database
level for rebalancing. The advantage for end users is that the high costs that
would be associated with building indexes incrementally can be distributed
among the masterless nodes of the cluster—divide and conquer!
connectivity outages, the SQLite data layer manages local reads, writes, and
indexing. When your device regains connectivity to the network, the Sync
library handles pushing up that data to the Cloudant cluster. When your
device’s data hits the cluster, your edits are propagated through the tradi-
tional Cloudant replication and synchronization mechanisms to the rest of
the nodes according to the quorum parameters that you defined earlier. As
before, no additional libraries are required for Cloudant Sync to interact with
your database cluster; all requests are passed over HTTP or HTTPS through
Cloudant’s RESTful API. You can see how this kind of robustness is critical
for occasionally connected clients.
Cluster A Cluster B
Pull
Sync
Push
responses such as replication and sync is the physical latency between your
data centers.
Versioning and document conflicts will inevitably arise with any large dis-
tributed system, and although Cloudant employs numerous strategies to avoid
document update conflicts—such as the requirement that all updates to docu-
ments include the document ID and the latest revision ID of the document—
there must always be a contingency for resolving conflicts. In other words,
expect the best, but always plan for the worst. Cloudant’s solution is to never
delete conflicts; instead, conflicting documents are both persisted inside the
database and left for the application layer to deterministically resolve. Mini-
mizing write contention is rule number one for avoiding conflicts. Requiring
document updates to supply both the document ID and revision value cer-
tainly helps to reduce the number of conflicts that might arise. Cloudant rec-
ommends you treat your data as immutable documents and that you create
new documents instead of updating existing ones as a strategy for further
cutting down on the number of revision conflicts. This has some storage
implications, but the benefits of having fewer document update conflicts and
having a traceable document lineage are valuable to many Cloudant cus-
tomer use cases.
The “right” version of a document is something that Cloudant determinis-
tically returns for you by default. If you want to know about all existing
conflicts, you must ask for them. The parameter conflicts=true returns a
list of revision identifiers for siblings of the document that you specify. Today,
Cloudant will not inform you as to the number of conflicts that exist in a
document’s metadata; however, you can easily leverage Cloudant’s built-in
MapReduce functionality to build a list of documents that have revision con-
flicts if you need this information. With conflicting documents, the cluster
coordinator detects that one of the document hosts has responded to a request
with an older version of the document. It does so by detecting divergence in
the document history, at which point it can determine—through the conflict-
ing document’s hash history—how the conflicting document fits into the
complete revision tree. This enables a developer to later commit that docu-
ment programmatically. Cloudant actively leverages hash history technol-
ogy to preserve the integrity and consistency of the cluster as a whole.
Append, merge, drop, and pick the winning revision; as a developer, the
choice is up to you. Conflicts are never automatically resolved within the
Build More, Grow More, Sleep More: IBM Cloudant 287
database (which is to say, Cloudant will never take the leap of deleting data
without your express permission); it is up to the developer to resolve them at
the application layer.
Cloudant Local
By the time we go to print, IBM will have a special offering, called Cloudant
Local, which is designed for enterprise, healthcare, security, and other organi-
zations that have privacy-related ordinances or government regulations that
restrict their data to on-premise deployment models. Cloudant’s dedicated
team of engineers can fully support the deployment of your on-premise infra-
structure, as well as make your on-premise data available on the cloud, if and
when your business requires it.
Cloudant will provide support for the initial provisioning and system
calibration of Cloudant Local; however, Cloudant Local will not carry the
same level of ongoing managed support. After Cloudant Local begins its on-
site residency, the further provisioning, management, and scaling of your
database is up to you. Of course, you will interact with Cloudant Local
through the same interfaces and APIs that you currently employ with tradi-
tional Cloudant (a dashboard GUI, the cURL command line, the RESTful
HTTP- or HTTPS- based API, and so on).
We want to stress that the core strengths of Cloudant’s functionality are
preserved across both cloud and on-premise deployments. Cloudant Local
offers superior agility for developers: a NoSQL data layer built on top of
open source Apache CouchDB, with fully supported database replication
and synchronization protocols. Cloudant’s multimaster (“masterless”) repli-
cation architecture, as well as mobile application sync capabilities for occa-
sionally connected devices, are also available in Cloudant Local. Further-
more, Cloudant’s powerful query suite—including geospatial, real-time
indexing, incremental MapReduce, and Apache Lucene full-text search—is
carried over to on-premise deployments.
We think you will leverage Cloudant Local to integrate directly with
existing on-site NoSQL environments or infrastructures. But more impor-
tantly, because Cloudant Local maintains compatibility with traditional
Cloudant, you can leverage the same API libraries to “burst” data to the
cloud for additional capacity (for backups, disaster recovery, or high avail-
ability, for example) when and as needed. Even if your organization is not
subject to regulatory requirements, it might make sense to start your deploy-
ment on premise and take “bursty” workloads into the cloud as needed—
we refer to this as hybrid cloud computing.
290 Big Data Beyond the Hype
specify the start and end keys ranges, and more. The takeaway message here
is that the primary index can be an extremely efficient tool when properly
leveraged by your application layer. Cloudant supports the specification of
customized document IDs (versus the autogenerated UUIDs that Cloudant
supplies by default) to make it easier to locate documents and to perform
searches against the primary index. Figure 10-3 illustrates how a simple
JSON document can be written to or retrieved from a Cloudant document
database endpoint using the HTTP-based API.
All interactions with a Cloudant database are made through a RESTful
HTTP or HTTPS API. Queries against the primary index nicely demonstrate
how a web-based API can be both intuitive and expressive to users. Any
application that “speaks the language of the web” can interact with your
Cloudant data layer through its RESTful API. Nearly every modern pro-
gramming language and development environment you can imagine—Java,
Python, .NET, Ruby, Node.js, and so on—cleanly integrates with the web;
such languages can therefore just as easily interoperate with a Cloudant data
layer. Developers can issue commands using verbs such as GET, PUT, POST,
DELETE, and COPY to a set of logically organized URI endpoints. The hierar-
chy of endpoints that developers specify through this API is intuitive to any
user of a modern web browser (and, in fact, the API can be accessed and
queried through your favorite browser at any time). Cloudant’s language-
agnostic API is incredibly useful for developing apps for any platform or
device that speaks a web language.
GET /<database>/<doc._id>
{
“name” : “Joe Smith”,
“headline” : “I love hockey”,
“date” : “2014-07-30”,
“tags” : [“Boston”, “hockey”, “win”],
“comment” : “This year we take the cup!”
}
PUT /<database>/<doc._id>
Figure 10-3 Cloudant’s RESTful API follows a logical set of URI endpoints to interact
with your data layer through HTTP or HTTPS.
292 Big Data Beyond the Hype
(In the relational world, a compound key contains multiple columns.) This
capability is extremely powerful when you consider that the resulting index,
when queried, is automatically sorted by key. For example, you could retrieve
a list of all books written about a specific topic, sorted by copyright date and
version, by using a complex key.
Developers have the option to specify a Reduce job following the Map
phase. Cloudant provides Reduce job boilerplates that are natively compiled
in Erlang, a programming language that is designed specifically for engaging
with and powering massively distributed apps. Erlang-compiled jobs are
much more efficiently built on disk than custom Reduce functions. Cloudant
also lets you define custom Reduce jobs in JavaScript, but these will perform
slower than their Erlang counterparts. Recall that Cloudant is best applied as
an operational data store precisely because it builds its query indexes incre-
mentally in real time—it is not optimized for building indexes in batch. Clou-
dant does support the query parameter ?stale=ok, which enables prioriti-
zation of performance over consistency by giving Cloudant permission to
query the last-built index rather than waiting for a currently running index
job to finish rebuilding before processing the query. If you have a large pro-
duction environment, we recommend you leverage ?stale=ok for improved
query responses as a best practice.
resulting output creates an index of one or more fields that can be queried.
These indexes are built as soon as you load the design document. Over time,
any event that updates your documents will trigger a rebuild of your indexes,
as well as subsequent replication of the indexes across the cluster.
Defining a search index requires supplying the key for the field that you
want to index, as well as several optional parameters if you are feeling adven-
turous. One parameter that Cloudant practitioners often use in place of a
custom identifier (which would be used for querying a field by name) is
default; it enables you to submit a search against the index without know-
ing the name of the field. For example, you could perform an equivalent
search to q=animalName:zebra by using q=zebra, without having to
specify the field identifier. Cloudant’s search index also supports more
advanced types of queries, such as searches against geospatial coordinates,
group searches, and faceting. Lucene supports sorting on a particular field.
By default, it sorts based on the Lucene order number; however, if you want
to have reliable sorts that are based on relevance, you can index by a particu-
lar field name.
If you have ever lined up at an amusement park to ride a roller coaster and
have been placed into one of several lines waiting to fill the carts, then you
have been part of a hashing function—what a rush! Determining where to
distribute the document shards is a purely functional computation of the
document ID; there is no lookup table that says “For this document ID 123,
host on partition X.” Instead, the location is determined as a pure hash func-
tion of the document ID to a segment in the hash key-space. The only state
that Cloudant needs to maintain is the mapping of shards to nodes. This is
persisted in a separate database that is continuously replicated to every node
in the cluster. Each document is guaranteed to fall into the key-space of that
hash function, where each key-space represents a shard. Selecting a good,
consistent hashing function will generate an even distribution of the number
of documents per shard across your key-space. By sharding (distributing)
your data over multiple nodes, you are able to house within your database
(and thus across your cluster) more data than could be persisted on just a
single host. To ensure that these shards are stored durably and consistently,
you need to calibrate the replication factor for each shard. This brings us to
the letter N.
JSON
JSON
JSON
JSON
Figure 10-4 Conceptualizing a “write” request from a client to a Cloudant cluster of six
nodes with a write quorum of 2
Build More, Grow More, Sleep More: IBM Cloudant 297
JSON
JSON
JSON
JSON
data happens to reside on the coordinating node, and other host nodes in the
cluster share two replicas. If these three replicas are in agreement and able to
produce a version of the requested JSON document, then a quorum is reached,
and the document is returned to the requesting application.
be computed fast enough to give a response with low latency. Second, you
want to minimize the amount of network traffic that is generated across your
cluster: If you are building an index that you expect will be queried against
thousands of times per second, you need to consider that each of those (thou-
sand) queries will be sent to every single partition. Cranking up the number
of partitions significantly increases the load you place on both the CPUs that
support your database and the network itself. If an application is read-heavy,
the database should favor a lower number of shards. Conversely, if your
application is write-heavy, you will want a higher number of shards to bal-
ance out and distribute the work of performing write operations. In general,
the number of shards should be a multiple of the number of nodes in the
cluster so that the shards can be distributed evenly.
Of course, “stuff happens.” In the event of failure, Cloudant operators
manage the restoration of a downed node by first returning this failed host to
life in maintenance mode. The node accepts write operations and continues to
perform internal replication tasks to bring it in step with the rest of the clus-
ter, but it does not respond to coordination requests, which are read and write
quorum acknowledgments in response to client requests.
When the node matches the replication state of the rest of the cluster,
Cloudant operators turn off maintenance mode, and the node is available for
servicing coordination tasks. Cloudant engages maintenance mode in the
event of a node failure to preserve and minimize the consistency-related
responses that you might get on secondary index queries. The bottom line is
that your database solution partner (IBM) strives to keep you from having to
wrestle with the underlying technology while delivering reliable uptime
document store services—which is why you chose a managed service in the
first place.
Wrapping It Up
In this chapter, we talked about how Cloudant eliminates complexity and
risk when you build fast-growing web and mobile apps by providing not just
a hosted, but a fully managed data-layer solution. With Cloudant, you receive
a cloud service that enables your developers to focus on creating the next
generation of apps without needing to manage their database infrastructure
and growth. The flexibility of a NoSQL JSON persistence model ensures that
300 Big Data Beyond the Hype
you can work with nearly any type of data, structured or unstructured, on an
architecture that is scalable, available, and distributed.
On top of this data layer, Cloudant adds a suite of advanced indexing and
querying capabilities (for use with even multidimensional spatial data),
including Cloudant Sync libraries to support local write, read, and indexing
capabilities on occasionally connected or even disconnected mobile devices.
Delivered from the cloud as a managed DBaaS that is unparalleled in the
industry, Cloudant simplifies application development for a new class of sys-
tems of engagement. At the same time, Cloudant Local offers companies that
are not ready (or unable because of regulations) the ability to leverage an
off-premise environment while taking advantage of the many Cloudant
advantages we outlined in this chapter. Cloudant Local maintains pure API
compatibility with traditional cloud-based Cloudant deployments to enable
hybrid apps that support bursty workloads.
Cloudant brings cloud-based scalability and continuous availability to
empower application developers to build more, grow more, and sleep more.
We think you will find IBM Cloudant to be the “do more” NoSQL layer that
your business is looking for. Get started for free at https://cloudant.com/
sign-up/.
Part III
Calming the Waters:
Big Data Governance
301
This page intentionally left blank
11
Guiding Principles for Data
Governance
IBM has long been promoting the importance of data governance, both in
word, by advising clients, and indeed, by building governance capabilities
into our software. This part of the book discusses both, by first outlining key
principles for data governance and, in the following chapters, by describing
the on-premise and off-premise data governance tools that can be used to
bring order to your Big Data challenges. Where the first two parts of this
book are about understanding Big Data, blue sky thinking (industry trends,
ideal architectures), and the latest tech (the Watson Foundations data stor-
age and processing technologies), this is the parental guidance section. The
chapters in this part of the book are critical to any Big Data project that
aspires to transform an organization from the hype phase to a data decision-
ing one—there is no hype allowed here. We feel that this is an appropriate
closing note for our book because as exciting as new technologies are, real
success in analytics and data management comes from a properly deployed
data governance strategy. In this chapter, we’re going to lay out the IBM
data governance maturity model as a set of directives, a modus operandi if
you will. We’ll use the remaining chapters in this book to explain their
application in a Big Data setting. This is the sort of chapter you pick up and
reread a few times a year to see how your organization measures up to data
governance criteria.
303
304 Big Data Beyond the Hype
2. Stewardship
More often than not, we see discussions around stewardship begin and end
with the question of which department should own certain data sets. A
proper discussion on stewardship needs to go beyond control and should
Guiding Principles for Data Governance 305
include details about how the people in control of the data will take care of it,
nurture it, feel responsible for it, and instill a culture of responsibility across
the population that consumes the organization’s information supply chain—
which should be everyone.
3. Policy
Most people find documenting policies boring, as do we, but without having
the desired practices around your organization’s data written in black and
white, you’re asking for trouble. We’ve seen this with many early Big Data
projects, where teams went off and played with new technologies without
having firm documented goals. More often than not, the lack of documented
policies indicates a potential problem. Documented policies are like a Big
Data constitution for your organization—they spell out expectations and
ordinances in how data is to be handled—not to mention they come in handy
if you’re ever subjected to an audit. But step back for a moment and try to
imagine any highly invested ecosystem for which you don’t have a set of
well-documented policies. Do you have a credit card? Ever rent a car? Gym
membership? Policies state out the rules, regulations, and remediation
actions for a service, so why shouldn’t your data by governed by them?
4. Value Creation
Mature organizations tend to be good at identifying the value of their data
assets. In the same way that analytics helps organizations to make better deci-
sions, quantifying the importance that data assets have to an organization
helps IT departments prioritize how they deal with these assets. For example,
which data sets need to be hosted on tier-one storage and are subject to per-
formance requirements (such as, a service level agreement)? Or which data
sets are most important to cleanse? These questions are equally relevant to
data sets that people call Big Data. For example, much of the data that is har-
vested from social sources is sparse and poorly identified, but there might be
little value in applying the same quality techniques to this type of data that
you would apply to relational data. The bottom line is that all data is not cre-
ated equal: It doesn’t all have to be as fast or as available or as cleansed, and
it’s not all going to have the same value. As kids, all the authors of this book
collected hockey cards (yes, even the U.S. ones) and we recounted about how
we would keep them in an ordered fashion that reflected their value. You
should have value statements around your Big Data sources and they will
dictate how the data is used, trusted, curated, invested in, and so on.
306 Big Data Beyond the Hype
7. Data Architecture
Well-planned and well-documented data architectures might sound like an
obvious part of a data governance model, but we’ve been in enough cus-
tomer situations where architecture has been an afterthought. A lack of archi-
tectural thinking is especially common when IT teams face urgent new
requirements and implement solutions that are quick and inexpensive. This
often results in costly changes being needed in the future.
Guiding Principles for Data Governance 307
policy; after all, the decision to keep something forever is a policy. What’s
needed is for this policy to be captured in metadata around the data in question.
Wrapping It Up
None of these principles of data governance should be looked at in isolation.
This is an interrelated set of practices—when you deal with one principle, you
need to consider the other ten principles as well. Good data governance is a
cultural thing: The things we outlined in this chapter are the virtues of a
strong culture. For example, no discussion of data risk management and com-
pliance is complete without considering information lifecycle management.
Applying the principles of data governance becomes increasingly impor-
tant as the complexity and volume of the data that is being managed con-
tinue to grow. This is not just about making life easier for analysts or DBAs,
or even about saving money with cheaper data centers. The success—even
survival—of organizations today hinges on the quality of their analytics,
which demands a mature implementation of data governance principles.
The remainder of this book focuses on tools and strategies to apply many of
the data governance principles that we’ve looked at in this chapter. The
chapters that follow aren’t part of any hype conversations: They cover con-
cepts that have been readily applied to the relational world for years. IBM
has gone to great lengths to extend good governance capabilities, be it on
premise or off premise via cloud services such as IBM DataWorks, to the Big
Data world including NoSQL and Hadoop. Take your time through these
chapters—you can save yourself a lot headaches in the future.
12
Security Is NOT an
Afterthought
Data breach! Did the hair on the back of your neck stand up? Concerned
about hackers, crackers, and spies, oh my? Take a breath and relax. This is
a test. The authors are conducting a test of your built-in data breach fear
system to see whether this concept still scares you. If it doesn’t, do an Inter-
net search on the topic to see what the last year or two has brought to the
headlines. From retailers with credit card breaches to the personal profiles
of productivity software users to gamers—three breaches alone impacted
more than 250 million people, and that’s more people than the population
of every individual country in the world except for China, India, and the
Unites States. Now consider what happens to a public company’s market
capitalization when they are associated with a headline breach. In this
social-mobile-cloud world, more and more clients will give you their per-
sonal information for a deal or a better personalized marketing experience,
but if you give them the impression that you have not done all that you can
to protect their data, they will punish you…with their wallets.
There are two kinds of security breaches: those that have happened and
those that are going to happen. So, let’s start with the fact that no organization
is immune to security breaches. We don’t want you walking around thinking
“It will never happen to us.” Breaches can happen because of planned ill intent
(from outside or within) or merely by accident. In short, we are telling you this:
Where there is data, there is the potential for breaches and unauthorized access.
The other thing to consider is that when it comes to hackers executing a
309
310 Big Data Beyond the Hype
successful security breach, they have to be successful only once. When your
job is to protect sensitive data, you have to be successful all the time.
The best advice we can give you is that security can’t be an afterthought—
it should actually be the first one. When you jump into the car for a drive, the
first thing you do is buckle up. Thrill seeker? We have yet to see a single
person who doesn’t check the safety device on that roller coaster before trav-
eling over hills at 140 kilometers per hour—that would be dangerous. So,
why operate like this with your—actually, other people’s—data? Every
sound security implementation begins with planning; you make choices to
govern data up front, not after the fact.
information, intellectual property, and more. When you’re done reading this
section, you’ll say, “These methods have been available for a long time in the
SQL database world!” That’s correct, and if they are available there and those
platforms store data, then they should be available for Hadoop (and other
NoSQL ecosystem databases) too.
The most important parts of this definition are the phrases “embodies
data policies” and “business process management.” Establishing data gover-
nance policies (the what, the how, and the rules) should occur before imple-
menting any data governance activities (this is the policy stuff we talked
about in Chapter 11). Ironically, this step is typically an afterthought and gets
overlooked by most organizations. It is not a difficult task, but it can be time
consuming and resource intensive to accomplish.
Simply put, you cannot govern what you do not understand. If you are
not prepared to invest in the required due diligence, take the time to define
and establish what the data governance policies and rules are needed for and
Security Is NOT an Afterthought 315
publish them across the organization so that everyone not only has access to
the policies, but understands them too; how else do you expect those respon-
sible for managing the data to protect it properly, or even to care about it? We
believe that a charter can embody these concepts, and we recommend you
document the journey to a charter for all to see and embrace.
Ensuring the clear definition and corporate understanding of your organi-
zation’s data privacy policies provides a substantial return on investment;
however, these benefits cannot be experienced unless the business commits
to making it happen. If this kind of investment is made up front, then policies
and rules on how data is governed and protected can be clearly established,
made accessible from anywhere at any time, be well understood throughout
the organization, and be enforced and implemented by the data stewards
and custodians who are responsible for the data.
Services in the IBM InfoSphere Information Governance Catalog (IGC)
provide all of these capabilities. The IGC aligns business goals and IT for bet-
ter information governance and trusted information. It is an information gov-
ernance encyclopedia with a comprehensive and intuitive set of services to
manage and store data governance, including business, technical, and opera-
tional metadata. It has the capability to establish and define business policies
and rules, assign ownership to those who are responsible for enforcing and
implementing them, and catalog and store them in a central repository for
easy access by everyone who needs access to this manifest from anywhere at
any time. In short, it empowers the next generation information architecture
that we introduced you to in Chapter 4.
The definition of what constitutes sensitive data varies by country and even
by state or province within those countries. In the European Union (EU), for
example, the definition of sensitive data is much broader and extends to
identifiers such as trade union membership and even political beliefs. There
are cultural issues that vary geographically too. For example, “Right to be
Forgotten” policies are making their way through certain European coun-
tries. Countries such as Germany and Luxembourg tend to have broader
definitions of private data than other countries such as Canada and so on.
You should assume that any data about an identified or identifiable individ-
ual is considered to be personal data.
For example, another kind of PII is personal health information (PHI),
which is any information that is related to an individual’s health, condition,
diagnosis, or treatment. In the Unites States, this information is governed by
the Health Insurance Portability and Accountability Act (HIPAA), which
contains national standards for electronic healthcare transactions and
national identifiers for providers, insurance plans, and employers. One of the
first steps in protecting data is to understand your organization’s privacy
policies and to identify what is considered sensitive within your organiza-
tion, because of either legislation or good corporate citizenship. This has to
be done upfront before the fact and not after it.
The following services and capabilities are essential to any effective mask-
ing strategy for sensitive data:
• Substitute real values with fictional values such as names and addresses
by using replacement data that is provided from trusted sources such
as national postal services from different countries.
• Use hashing algorithms to ensure that you select the same consistent
replacement data based on a key value or a set of key values. This is
important to ensure referential integrity across multiple data
environments. For example, if you want to ensure that Rick Buglio
is masked to a different name such as John Smith, but always the
same masked name, you can use a hashing algorithm and assign a
hash value that consistently identifies Rick Buglio, like an employee
identification number or some other unique identifier. Of course, the
hash table must be secured, but this is a great way to ensure that your
application test cases won’t break down for certain kinds of context-
dependent data.
• Use intelligent algorithms that are designed for specific types of PII
such as national identifiers (U.S. Social Security numbers, Canadian
social insurance numbers, and so on), personal identification numbers
(PINs), email addresses, phone numbers, driver’s license numbers,
credit cards and their associated verification numbers, dates, ages, and
so on. These algorithms are context-aware with respect to the data that
is being masked and produce a repeatable or random masked value
that is fictional but contextually correct. Repeatable means that every
time the original value is encountered and the same algorithm is
applied, the same masked value is produced. As its name implies, a
randomized masking service generates a contextually correct masked
value, based on the type of data, but the algorithm generates a random
value that is not based on the source value.
• Provide the ability to construct custom data masking methods through
the use of scripts or custom code to do additional processing on data
values before, during, or after data masking. The additional processing
uses programming techniques such as conditional logic, arithmetic
functions, operating system functions, data manipulation and
transformation functions, decryption and encryption, and so on.
Security Is NOT an Afterthought 319
All of these masking methods can be valid and effective; choosing one
over the others depends on what your data masking requirements are and
what type of data you are masking.
The Optim Data Privacy solution, with its granular services that are compos-
able on or off premise through IBM DataWorks, provides all these capabilities
and more to mask structured data that is stored in relational database manage-
ment systems and a majority of nonrelational data stores, including Hadoop.
Think about how important this is to your polyglot environments (Chapters 2
and 4). It makes sense to have a consistent business policy that centrally con-
trols the masking of data. If the PII data happens to land in Hadoop and work
its way into an in-memory analytics database service such as dashDB that is
surfaced to the line of business (LOB), why should time and effort be spent on
redefining masking rules or perhaps missing them altogether in one persistent
data store? The set of masking services that are offered both on and off premise
by IBM enables you to have a consistent and persistent masking strategy across
almost any data asset in your environment.
Optim Data Privacy includes a wide variety of methods to mask data in
Hadoop. For example, it can be used to mask data in CSV and XML files that
already exist in HDFS, even if that data comes from RDBMS tables and is
converted into these (or other) data transfer formats. Optim masking ser-
vices can be used to mask data from an RDBMS and then load that masked
data into Hadoop, or can be used to extract and mask Hadoop data and then
load it into an RDBMS or the NoSQL stores we talked about in Chapter 2.
with the business metadata, policy, and rule capabilities of the IBM Informa-
tion Governance Catalog. This synergy delivers an unprecedented policy-
driven data privacy approach to protect sensitive data that can be directly
applied to Hadoop distributions such as IBM InfoSphere BigInsights for
Hadoop (BigInsights), the Cloudera Distribution for Hadoop (CDH), Horton-
works Data Platform (HDP), other Hadoop distributions, and non-Hadoop
NoSQL or SQL environments. These services empower data stewards and
custodians who are responsible for protecting sensitive data to use the Infor-
mation Governance Catalog data privacy policies, data classification terms,
and rules that were specifically designed to classify and mask sensitive data—
no matter where they reside. They enable the automatic classification of data
elements as sensitive, apply a best-practice data masking function to that
data, and enable the rich set of IBM masking services to implement and
enforce the data masking policy. Figure 12-1 gives you an example of just how
powerful this integration work is.
Figure 12-1 IBM Information Governance Catalog: data privacy policies and rules and
data classification information in one place for your enterprise
Security Is NOT an Afterthought 321
In Figure 12-1, you can see multiple capabilities that are offered by the
IBM Information Governance Catalog. On the left side of this figure is a large
number (greater than 40) of predefined data privacy rules that you can use to
mask sensitive data. For example, there is a random method to mask a Cana-
dian social insurance number (SIN), credit card number masking that uses
repeatable or random methods, and more. Using this capability through the
IBM DataWorks cloud service, you could mask a Canadian SIN and wher-
ever that data lands (perhaps, in a cloud storage service), it is obfuscated.
On the right side of Figure 12-1, you can see that the Information Gover-
nance Catalog has a Data Classification category that contains a number of rich
services with defined data classification terms that can be used to mask sensi-
tive data such as account number, address, credit card verification number,
and more.
Figure 12-2 shows the tool set’s designer interface, which has an integrated
view into the Information Governance Catalog that empowers you to review,
understand, and use the data classifications to mask sensitive data.
Figure 12-2 The Optim Designer interface is integrated with the Information
Governance Catalog.
322 Big Data Beyond the Hype
It might not be obvious at first, but the capabilities that are shown in
Figure 12-2 should astound you. The figure shows a data steward leveraging
a web-based management interface to apply enterprise-defined data privacy
classification terms and masking policies to columns in Hive (part of the
Hadoop ecosystem) through simple drag-and-drop operations. These ser-
vices automatically classify the column, identify it as sensitive, and then
apply and enforce the IBM best-practice data masking privacy service to
mask the data based on its classification. You can also add your own custom
data masking services that implement masking rules to protect sensitive data
and host them here too. Think about that. Using an enterprisewide corpus of
masking policies that can be applied across the polyglot data environment
through simple drag-and-drop gestures. If that phone number exists in Hive,
DB2, or Oracle, a simple gesture applies the same policy across those reposi-
tories. Folks, this is ground breaking.
Figure 12-3 showcases an example of a Data Privacy Compliance report
that can be generated from the service’s operational dashboard.
This report makes it easy to track the status of data privacy policy compli-
ance across the enterprise. It reports on what columns (by table and data
store) have been classified with a data masking policy applied, classified with
no data masking policy applied, unclassified with a data masking policy
applied, or unclassified with no data masking policy applied. Understand
what is happening here: The report is discovering compliance metadata for
you to help you curate and govern your data assets. The report also provides
an overall status for those columns that are fully compliant and protected, as
well as alerts for those that are not. Finally, the report identifies whether a data
masking service has been assigned to implement the data masking policies.
This integration delivers a cross-enterprise, end-to-end, policy-driven
approach to data privacy that enables organizations to easily define and
establish data privacy policies, publish them for everyone to understand,
and then enforce and implement them to protect sensitive data.
Wrapping It Up
We hope that after reading this chapter you are reminded about, or perhaps
now fully appreciate, the vital importance of securing data, especially in
Hadoop environments. In a Big Data world, it’s even more challenging and
imperative to safeguard PII and sensitive information to minimize the risk of
data breaches and other unauthorized, suspicious, and malicious activities.
Security Is NOT an Afterthought 327
Hadoop data is different in many respects; the volumes are described with
words such as massive, and the content and format of the data coming from new
and different sources such as social media, machines, sensors, and call centers
can be unpredictable. It’s also true that a fair amount of data that will be (or is
being) stored in Hadoop systems is the same production data that is used by
organizations to run their critical business applications. Think back to the land-
ing zones concept that we cover in Chapter 4. Governance is about the data
being stored, not where it is stored; and if that stored data needs to be protected,
it must be treated with equal importance and with the utmost urgency and sen-
sitivity to secure and protect it properly, no matter where it lands.
The good news is that there are now proven methods for securing and
protecting data in Hadoop. These methods have been available and used for
decades in the SQL world across distributed and mainframe systems. They
are reliable and can be applied to NoSQL and Hadoop systems to establish
critical data governance policies and rules; mask, redact, or encrypt sensitive
data; and monitor, audit, and control data access.
Today there are mature, industry-leading, on- and off-premise data secu-
rity solutions from IBM that have extended their capabilities to support
Hadoop environments.
This page intentionally left blank
13
Big Data Lifecycle
Management
In the months leading up to the writing of this book, we’ve been struck by
the number of times we’ve heard customers share stories about how their
data volumes are rapidly increasing. One financial services customer mar-
veled, “We bought a 25TB warehouse appliance two years ago, and at the
time I never imagined we’d fill it. And now here we are, asking for more
capacity.” Over and over we hear about how a terabyte no longer seems
like a huge amount of data.
Data governance and, in particular, lifecycle management were important
before, but they become an even more significant concern with so much addi-
tional data to manage. This impacts every facet of the lifecycle. As data is
generated, is sufficient metadata collected to enable analysts to find their nee-
dle in an increasingly larger stack of needles? Can the data integration strat-
egy (transfer and transformation) keep up with the data volumes? Can the
data quality tools handle the increased variations in the data sets being stored,
as well as the high data velocity? As growing numbers of analytics applica-
tions are being developed, for Hadoop as well as relational sources, how can
test data be managed consistently? And finally, can archiving facilities handle
more defensible, flexible, and cost-effective solutions, enabling data to be
taken off of warehouse systems in a governable manner, while having the
archives stored in Hadoop where they can still be queried? In presenting key
portions of IBM’s data integration and governance platform, this chapter will
329
330 Big Data Beyond the Hype
answer all these questions for you, among others we are sure you have, and
also show how IBM is enabling governance capabilities to be consumed as
services in cloud environments.
Shared metadata is the basis for effective data integration. By sharing a com-
mon definition of terms and data transformations at the enterprise level, an
organization can quickly translate data from one system to another. In a fash-
ion not dissimilar to how you would browse and buy shirt online, the IGC
features tooling that enables analysts to browse through metadata collections
and simply “shop” for the data sets they want to. You can see this interface in
Figure 13-1. Once you’ve found a data set you’re interested in, you can transfer
it to your sandbox (on InfoSphere BigInsights for Hadoop, another database,
or a cloud-based analytics service such as dashDB) for further analysis.
Once you’ve found a data collection you’re interested in, you can explore
the rich set of metadata, email users, provision replicas of the data for sand-
boxing purposes, and more. Figure 13-2 shows this exploration interface.
Effective sharing of metadata relies upon a common vocabulary, one that
business and IT can agree upon. The IGC manages the business definition of
the metadata—putting IT terms into a vocabulary that business users can
understand. Because data comes from many new sources, it also comes with
many new terms and in many new formats. In this era of Big Data, the need
for a common vocabulary has never been greater.
332 Big Data Beyond the Hype
Metadata also documents data lineage (where the data comes from),
where it’s heading, and what happened to it along the way. Data lineage
metadata is one of the most powerful tools for building confidence in data,
because when it’s exposed to a business user, there is clear evidence that
what’s being looked at isn’t just another unmanaged replica of some random
data set. In Figure 13-3 you can see the Data Lineage view for a data set in the
IGC. This data lineage metadata can be populated in the IGC as it’s gener-
ated by ETL or data movement tools.
Data Integration
Information integration is a key requirement of a Big Data platform because
it enables you to leverage economies of scale from existing investments, yet
discovers new economies of scale as you expand the analytics paradigm. For
example, consider the case where you have a heavy SQL investment in a next
best offer (NBO) app. If this app were based on an SQL warehouse, it could
be enhanced with the ability to call a Hadoop job that would look at the
trending sentiment associated with a feature item stock-out condition. This
could help to determine the acceptability of such an offer before it’s made.
Having the ability to leverage familiar SQL to call a function that spawns a
MapReduce job in a Hadoop cluster to perform this sentiment analysis is not
only a powerful concept; it’s crucial from the perspective of maximizing the
value of your IT investment.
Perhaps it’s the case that you have a machine data analysis job running on
a Hadoop cluster and want to draw customer information from a system that
manages rebates, hoping to find a strong correlation between certain log
events and eventual credits.
Big Data Lifecycle Management 335
Figure 13-5 A data-flow job that utilizes a combination of Big Data assets, including
source data in Hadoop joined with DB2 relational data, and various transformations to
classify risk
Big Data Lifecycle Management 337
emails (stored in Hadoop) for customer sentiment, and the results of that analy-
sis are used to update a profile in the warehouse (for example, the Customer
dimension); this is an example of risk classification based on email analytics.
Other integration technologies that are commonplace in today’s IT envi-
ronments (and remain important in a Big Data world) include real-time rep-
lication and federation. Real-time replication, utilizing a product such as IBM
InfoSphere Data Replication (Data Replication), involves monitoring a source
system and triggering a replication or change to the target system. This is
often used for low-latency integration requirements. Data Replication has
sophisticated functionality for high-speed data movement, conflict detec-
tion, system monitoring, and a graphical development environment for
designing integration tasks. Furthermore, it’s integrated with the set of IBM
PureData Systems for high-speed data loading/synchronization and also
with Information Server to accumulate changes and move data in bulk to a
target system.
We believe that organizations shouldn’t try to deliver enterprise integra-
tion solely with Hadoop; rather, they should leverage mature data integra-
tion technologies to help speed their deployments of Big Data whenever that
makes sense. There’s a huge gap between a general-purpose tool and a pur-
pose-built one, not to mention that integration involves many aspects other
than the delivery of data, such as discovery, profiling, metadata, and data
quality. We recommend that you consider using IBM Information Server
with your Big Data projects to optimize the loading (via bulk load or replica-
tion) of high-volume structured data into a data warehouse, the loading of
structured or semistructured data into Hadoop, and the collecting of infor-
mation that’s filtered and analyzed by stream analytics. You can then load
the data into a relational system (such as a data warehouse) and replicate
data sources to a Hadoop cluster or other data warehouse.
Data Quality
Data quality components can be used to ensure the cleanliness and accuracy
of information. IBM InfoSphere Information Server for Data Quality (IIS for
DQ) is a market-leading data quality product containing innovative features,
such as information profiling and quality analysis, address standardization
and validation, and so on. It’s fully integrated into the ICG, which allows you
338 Big Data Beyond the Hype
to develop quality rules and share metadata too. You can also use the IIS for
DQ services while you’re executing quality jobs on Information Server’s par-
allel processing platform. Data quality discussions typically involve the fol-
lowing services:
• Parsing Separating data and parsing it into a structured format.
• Standardization Determining what data to place in which field and
ensuring that it’s stored in a standard format (for example, a nine-digit
ZIP code).
• Validation Ensuring that data is consistent; for example, a phone
number contains an area code and the correct number of digits for its
locale. It might also include cross-field validation, such as checking the
telephone area code against a city to ensure that it’s valid (for example,
area code 416 is valid for Toronto; 415 is not).
• Verification Checking data against a source of verified information
to ensure that the data is valid; for example, checking that an address
value is indeed a real and valid address.
• Matching Identifying duplicate records and merging those records
correctly.
Organizations should determine whether their Big Data sources require
quality checking before analysis and then apply the appropriate data quality
components. A Big Data project is likely going to require you to focus on data
quality when loading a data warehouse to ensure accuracy and complete-
ness, when loading and analyzing new sources of Big Data that will be inte-
grated with a data warehouse, and when Big Data analysis depends on a
more accurate view (for example, reflecting customer insight), even if the
data is managed within Hadoop.
approach is that it’s all about the developer: All the hardware, underlying soft-
ware, and data are managed on IBM’s SoftLayer cloud so the developer doesn’t
have to worry about it. IBM Watson Analytics is a similar sort of ecosystem,
only its focus is on analysts, as opposed to developers. Watson Analytics brings
together a complete set of self-service analytics capabilities on the cloud. You
bring your problem, and Watson Analytics helps you acquire the data, cleanse
it, discover insights, predict outcomes, visualize results, create reports or dash-
boards, and collaborate with others.
Since they’re platforms that consume and produce data, both Bluemix and
Watson Analytics need governance and integration tools. And the traditional
software offerings we’ve talked about do not fit in this model. IBM’s engi-
neering teams are taking its rich data governance portfolio capabilities and
exposing them as a collection of composite and data refinery services, known
as IBM DataWorks. Figure 13-6 shows an example of some of the DataWorks
services available for Bluemix and Watson Analytics. (Also note that some of
the DataWorks services are part of dashDB.)
• Simple Load Service The Simple Load Service extracts data and
metadata from source data and creates the same schema in the target
database if it doesn’t exist. Once the schema is created, this service
loads the data into the target, offering high-speed data movement both
at a bulk level and at a granular level. For data in files, as opposed to
relational tables, this service will read a sample of the data to detect the
schema and then create a corresponding schema in the target database.
What gives this service additional credibility from a governance
perspective is its ability to automatically capture the data design and
collect operational lineage data. Cloud applications and analytics need
governance too!
• Masking Service The Masking Service masks sensitive data based on
the Optim masking capabilities we detailed in Chapter 12. Masking
support includes credit card numbers, Social Security numbers, dates,
and more.
• Selective Load Service The Selective Load Service does exactly what
its name implies: It loads only a subset of source data into a target
repository. The selection criteria can include constraints on the
metadata (for example, only a certain schema, tables, and columns)
and data values (for example, rows that meet a specified condition).
• Standardize Service The Standardize Service reads a string, identifies
“tokens” that have meaning, and then structures all identified tokens
into a correctly formatted result. These tokens can be entities such as
addresses, dates, or phone numbers.
• Match Data Service The Match Data Service provides probabilistic
matching to inspect multiple records, score the likelihood of a match,
and return a single master record, based on user options. This is
essentially a service-based offering of the Big Match product described
in Chapter 14.
• Profile Data Service The Profile Data Service analyzes the attributes
of individual data records to provide information on cardinality,
completeness, conformance, validity, and other characteristics. This
enables analysts to understand the characteristics of data so they can
better determine the confidence score in particular data values.
• Combine Data Service The Combine Data Service extracts data from
multiple sources, joins it (based on criteria defined by the user), and
then loads the combined set into the target database.
Big Data Lifecycle Management 341
Wrapping It Up
For all the phases in the life of your data, IBM provides a range of mature
tools to help you apply the principles of data governance based on your busi-
ness requirements. The major benefit of IBM’s suite of tools is the central
storage of metadata in the Information Governance Catalog, which enables a
far more seamless experience as data is transformed, moved, and selected for
analytics. In addition, with IBM’s emphasis on cloud-friendly deployments,
many of these capabilities are now available as services as well.
14
Matching at Scale:
Big Match
Every Hadoop vendor’s marketing charts have the same bold proclama-
tions about storing all kinds of data in Hadoop and then being able to ana-
lyze it. “Imagine the joy of your analysts!” the marketers cheer. “With the
wonders of SQL on Hadoop, they’ll be able to query across all your data
sets, making new discoveries every day!” The IBM charts are no different,
and our marketing folks say it too; in fact, this book says it too. But each
time a veteran analyst hears those claims, the response is fear, not excite-
ment. We’ve seen it ourselves in our customer work, and the question is
always the same: “How (insert expletives here) can we easily query all our
data in Hadoop and make correlations when both the data formats and the
data are inconsistent?”
This is not a new problem, as anyone who’s tried to run queries against
relational databases from different organizations knows. That said, Big Data
makes it a bigger problem because the amount, variety, and speed at which
data arrives at your doorstep. IBM has solutions to overcome this problem,
which is a key technical enabler for the landing zone architecture we
described back in Chapter 4. As a result, there’s a huge difference between
IBM’s claim about being able to easily query across these different data sets
and competitors making the same claim. The key differentiator here is a data
matching service called Big Match.
343
344 Big Data Beyond the Hype
engine to find matches between the new entries and existing ones. When
done properly, matching data entries is a resource-intensive and complex
process. Consider the scenario with Daenerys Targaryen and her financial
institution; if her records are to be properly matched out of the millions of
other customer records, many sophisticated comparisons need to be made.
The end result is the generation of master data, which would include a
single record for Daenerys Targaryen, including all the data attributes
about her. Let’s say that Daenerys applied for a loan, and she added her
email address to the loan’s application form. After the loan application data
is added to a database, this record will eventually pass through the match-
ing engine. Daenerys’ email address is seen as a new data point, and a cor-
responding update is made to Daenerys Targaryen’s master data record.
Linkage information is also written to the master data, which records the
connections between the new data entries and the corresponding master
records (these are also known as golden records).
Matching on Hadoop
Now that we’ve walked through the generation of master data, you’re prob-
ably thinking, “Hey, Hadoop would make a lot of sense as a place to do
matching!” First, recall the landing zone pattern in the data zones model
Matching at Scale: Big Match 347
from Chapter 4: Having all your data land in one place is a really good start-
ing point for any large-scale matching activity. This avoids a great deal of
additional data movement and complexity. Second, matching is computa-
tionally expensive, and as the number of records grows, it can become
impractical to keep the matching workloads on operational systems. Hadoop
is a great fit here because it’s good at large-scale operations such as matching,
and can do so inexpensively. And if more capacity is needed, it’s easy to add
servers. Third, in thinking back to the 360-degree view of the customer, imag-
ine the potential for incorporating data from nonrelational sources, such as
social data or call center records. Hadoop can easily store these other data
sets and make them available to matching technologies.
As with many good ideas, there’s a catch. Remember that matching, at
least when it’s done right, is extremely complex. And the trouble with run-
ning complex operations at scale on Hadoop is that they need to be able to
run in parallel. Although there are many matching engines for traditional
relational data, you’ll notice that outside IBM there are no mature, fully func-
tioning matching engines available for Hadoop. So, what do we mean by
“mature” and “fully functioning”? To give you an appreciation of the tech-
nology behind Big Match, the next section shows you the common approaches
for matching, their benefits, and their limitations.
Matching Approaches
When you think back to the pain that analysts endure as they struggle to
analyze data from different databases, the phrase “Necessity is the mother of
invention” comes to mind. Given all that pain, the state of the art in data
matching has advanced considerably. The following list summarizes the
most common approaches to matching, ranging from the simplest to the
most sophisticated approach:
• Rules-based matching Logical rules define when records match.
This is an inflexible approach and does not yield good results for large
data sets or when there are more than two or three data sources.
• Rules-based fuzzy matching More advanced than simple rules-based
matching, this approach includes code to factor in typos and misspellings.
This is still not sufficient to handle the complex, but common, scenarios
that we illustrated with Daenerys Targaryen’s example.
348 Big Data Beyond the Hype
Algorithm
Configuration Files
HBase Tables
Before running the Big Match applications in BigInsights, you need to store
the record data in HBase tables. There is an additional Big Match table in
HBase where all the linkages from the processed records are connected with
the originating record.
How It Works
Big Match matches and links the data automatically as it is ingested. Before
you run matching operations, you need to configure the attributes that you
want to match (for example, a person’s name, their phone number, and
address) and the tolerances for matching. Big Match has a graphical configu-
ration tool that you can use to generate the configuration files that the Big
Match applications will consume. Following are the steps involved in the
matching process:
1. Load Ingest new records into the Big Match data table in HBase.
2. Derive A Big Match coprocessor standardizes and compacts each
record, optimizing it for statistical comparisons. A Big Match HBase
coprocessor then performs rough calculations to find the set of
potential matches (this is called a bucket).
3. Compare A Big Match HBase coprocessor scores the comparisons
from the bucket by using a number of statistical methods.
4. Link Based on the results of the scoring, the Big Match linking
application builds a set of discrete entities, where an entity represents
an individual (or organization). The Big Match HBase tables are
populated with master records and entity data that captures the
linkages between the original records.
You can also run this process in batch mode by running the Big Match
apps from the BigInsights console.
So imagine you’ve got records about your customers from multiple data
sets, where you know there are inconsistencies for whatever reason (it
could be different date formats, data entry mistakes, or any of the reasons
we looked at with the Daenerys Targaryen example earlier). In addition,
even within each of the data sets, you know there are instances where
there are duplicate records for individual people. In short, this is a mess,
and this is the normal state for most businesses storing customer data.
What the linking application does is build a single set of entities represent-
ing individual people, and links each entity with all the matching records
for each individual person.
352 Big Data Beyond the Hype
Extract
After you have generated a set of master data, you can use Big Match to gen-
erate an up-to-date set of system-of-record data. You can do this by running
the Big Match Extract app, which generates a set of delimited files that repre-
sents the system-of-record data. There are many uses for a trusted data set
like this—for example, you can then load the system-of-record data into a
warehouse to populate dimension tables.
Search
Having your customer data sets scored and linked with a master set of entities
opens up many possibilities. There are many use cases, as you’ll see later on in
this chapter, where taking a set of different records and searching for the
matching entity can be very useful. Big Match provides a REST interface for
external applications to invoke probabilistic searches against the matched
data. This REST interface is great for checking batches of records, but there are
also situations where it would be useful for someone to manually search for
matches one individual at a time. Big Match also includes an interactive match-
ing dashboard (see Figure 14-3), where you can enter data points for people
and see the corresponding records. As you can see in Figure 14-3, with an
entity-based search, you can overcome data quality issues (like mistakenly
swapping first name and last name) to get meaningful results. Other similar
records will also be shown, but with lower scores.
could be information from call center logs, forms, emails, and social media.
While for the most part relational databases are heavily biased to tightly
organized structured data, Hadoop doesn’t discriminate on the kinds of data
it stores.
Many companies talk a big game when it comes to analyzing text data
(some people call it unstructured data), but once you look past their market-
ing slides and hard-coded demos, there’s not much behind the curtain that
you could get value from (think about the closing chapter in The Wizard of Oz
as a good analogy). IBM Research has built an entity analytics engine that’s
capable of making sense of text data in a repeatable way, without requiring
special programming. For example, from call center logs you could derive
sentiment on how your customers feel about your offerings. Or consider
social data: Twitter has become a public place for people to express their
immediate reactions to the world around them. But how do you link Twitter
accounts to actual customer records? With the combination of Big Match and
IBM’s entity analytics engine, it’s now possible to automatically discover
(assuming they were not provided to you) your customers’ Twitter accounts.
IBM Research worked with two nameplate clients to refine and iterate this
technology: a large U.S. bank and large U.S. retailer—both with millions of
customers. In both cases, Big Match was able to identify customers’ Twitter
accounts with 90 percent precision! (This means that for the millions of Twit-
ter accounts that were each associated with a customer, there was 90 percent
accuracy.) Of course, many customers don’t have Twitter accounts, but for
those that do, they represent a sizable sample of the full set of customers. This
gives you a wealth of additional information about how different segments of
customers feel about your products and services, but also attribute informa-
tion such as hobbies, age, sex, and so on.
customers, their account manager, and their investments. IBM was able to fur-
ther enhance these applications using Big Match by providing much more
accurate relationship information for Watson Explorer to consume.
and insurance payers and maintain a large set of reference data. As prescrip-
tion requests come into the system, the company can quickly score them and
return recommendations to the insurance payers and pharmacies.
Wrapping It Up
Increasingly, businesses are storing operational data that captures their sys-
tems of engagement in Hadoop. This is often customer data, and where there
is customer data, invariably there is complexity. Hadoop is an ideal platform
to land system-of-engagement data, but if analytics users are to get any value
out of it, a matching solution is essential. IBM has made its industry-leading
probabilistic matching engine available in Big Match, which offers a match-
ing solution that can run at Hadoop scale, and handle complex data.
In this chapter we shared with you a number of Big Match use cases. You
also learned about how Big Match has a number of patterns that you can
deploy to solve some of the more pressing business challenges today. We
think you are going to find that while the details of the use case change from
enterprise to enterprise or industry to industry, there’s a repeating pattern
that can unlock value in the data like never before. Take the knowledge you’ve
learned in this chapter and start to think about how Big Match can help solve
your pressing needs, and now you’re taking Big Data beyond the hype.
Additional Skills Resources
InfoSphere BigInsights for Hadoop Community
Rely on the wide range of IBM experts, programs, and services that are avail-
able to help you take your Big Data skills to the next level. Participate with us
online in the InfoSphere BigInsights for Hadoop Community. Find whitepa-
pers; videos; demos; BigInsights downloads; links to Twitter, blogs, and
Facebook pages; and more.
Visit https://developer.ibm.com/hadoop/
Twitter
For the latest news and information as it happens, follow us on Twitter:
@IBMBigData, @Hadoop_Dev, and @IBMAnalytics
developerWorks
On developerWorks, you’ll find deep technical articles and tutorials that can
help you build your skills to the mastery level. Also find downloads to free
and trial versions of software to try today.
Visit www.ibm.com/developerworks/analytics/
Blogs
A team of experts regularly write blogs related to the full spectrum of Big
Data topics. Bookmark the “Stream Computing” page and check often to
stay on top of industry trends.
Visit www.ibmbigdatahub.com/technology/stream-computing or www
.ibmbigdatahub.com/technology/all
This is part of The Big Data & Analytics Hub (ibmbigdatahub.com) that is
populated with the content from thought leaders, subject matter experts, and
Big Data practitioners (both IBM and third-party thinkers). The Big Data &
Analytics Hub is your source for information, content, and conversation
regarding Big Data analytics for the enterprise.
IBM Redbooks
IBM Redbooks publications are developed and published by the IBM Inter-
national Technical Support Organization (ITSO). The ITSO develops and
delivers skills, technical know-how, and materials to IBM technical profes-
sionals, business partners, clients, and the marketplace in general.
Visit http://ibm.com/redbooks