Fast Data Smart and at Scale
Fast Data Smart and at Scale
Smart and
at Scale
Design Patterns and Recipes
Job # 15420
Fast Data:
Smart and at Scale
Design Patterns and Recipes
The OReilly logo is a registered trademark of OReilly Media, Inc. Fast Data: Smart
and at Scale, the cover image, and related trade dress are trademarks of OReilly
Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-94038-9
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
Pattern: Connect Big Data Analytics to Real-Time Stream
Processing 19
Pattern: Use Loose Coupling to Improve Reliability 20
When to Avoid Pipelines 21
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv | Table of Contents
Foreword
v
Fast Data Application Value
vii
So how do you combine real-time, streaming analytics with real-
time decisions in an architecture thats reliable, scalable, and simple?
You could do it yourself using a batch/streaming approach that
would require a lot of infrastructure and effort; or you could build
your app on a fast, distributed data processing platform with sup
port for per-event transactions, streaming aggregations combined
with per-event ACID processing, and SQL. This approach would
simplify app development and enhance performance and capability.
This report examines how to develop apps for fast data, using well-
recognized, predefined patterns. While our expertise is with
VoltDBs unified fast data platform, these patterns are general
enough to suit both the do-it-yourself, hybrid batch/streaming
approach as well as the simpler, in-memory approach.
Our goal is to create a collection of fast data app development rec
ipes. In that spirit, we welcome your contributions, which will be
tested and included in future editions of this report. To submit a
recipe, send a note to [email protected].
ix
This report is structured into four main sections: an introduction to
fast data, with advice on identifying and structuring fast data archi
tectures; a chapter on ACID and CAP, describing why its important
to understand the concepts and limitations of both in a fast data
architecture; four chapters, each a recipe/design pattern for writing
certain types of streaming/fast data applications; and a glossary of
terms and concepts that will aid in understanding these patterns.
The recipe portion of the book is designed to be easily extensible as
new common fast data patterns emerge. We invite readers to submit
additional recipes at [email protected]
Into a world dominated by discussions of big data, fast data has been
born with little fanfare. Yet fast data will be the agent of change in
the information-management industry, as we will show in this
report.
Fast data is data in motion, streaming into applications and comput
ing environments from hundreds of thousands to millions of end
pointsmobile devices, sensor networks, financial transactions,
stock tick feeds, logs, retail systems, telco call routing and authoriza
tion systems, and more. Real-time applications built on top of fast
data are changing the game for businesses that are data dependent:
telco, financial services, health/medical, energy, and others. Its also
changing the game for developers, who must build applications to
handle increasing streams of data.1
Were all familiar with big data. Its data at rest: collections of struc
tured and unstructured data, stored in Hadoop and other data
lakes, awaiting historical analysis. Fast data, by contrast, is stream
1 Where is all this data coming from? Weve all heard the statement that data is doubling
every two yearsthe so-called Moores Law of data. And according to the oft-cited
EMC Digital Universe Study (2014), which included research and analysis by IDC, this
statement is true. The study states that data will multiply 10-fold between 2013 and
2020from 4.4 trillion gigabytes to 44 trillion gigabytes. This data, much of it new, is
coming from an increasing number of new sources: people, social, mobile, devices, and
sensors. Its transforming the business landscape, creating a generational shift in how
data is used, and a corresponding market opportunity. Applications and services tap
ping this market opportunity require the ability to process data fast.
1
ing data: data in motion. Fast data demands to be dealt with as it
streams in to the enterprise in real time. Big data can be dealt with
some other timetypically after its been stored in a Hadoop data
warehouseand analyzed via batch processing.
A stack is emerging across verticals and industries to help develop
ers build applications to process fast streams of data. This fast data
stack has a unique purpose: to process real-time data and output
recommendations, analytics, and decisionstransactionsin milli
seconds (billing authorization and up-sell of service level, for exam
ple, in telecoms), although some fast data use cases can tolerate up
to minutes of latency (energy sensor networks, for example).
Streaming Analytics
As data is created, it arrives in the enterprise in fast-moving streams.
Data in a stream may arrive in many data types and formats. Most
often, the data provides information about the process that gener
ated it; this information may be called messages or events. This
includes data from new sources, such as sensor data, as well as click
streams from web servers, machine data, and data from devices,
events, transactions, and customer interactions.
The increase in fast data presents the opportunity to perform analyt
ics on data as it streams in, rather than post-facto, after its been
pushed to a data warehouse for longer-term analysis. The ability to
analyze streams of data and make in-transaction decisions on this
fresh data is the most compelling vision for designers of data-driven
applications.
Queryable Cache
Queries that make a decision on ingest are another example of using
fast data front-ends to deliver business value. For example, a click
event arrives in an ad-serving system, and we need to know which
ad was shown, and analyze the response to the ad. Was the click
fraudulent? Was it a robot? Which customer account do we debit
because the click came in and it turns out that it wasnt fraudulent?
Using queries that look for certain conditions, we might ask ques
tions such as: Is this router under attack based on what I know
from the last hour? Another example might deal with SLAs: Is my
SLA being met based on what I know from the last day or two? If so,
what is the contractual cost? In this case, we could populate a dash
board that says SLAs are not being met, and it has cost n in the last
week. Other deep analytical queries, such as How many purple hats
were sold on Tuesdays in 2015 when it rained? are really best
served by systems such as Hive or Impala. These types of queries are
ad-hoc and may involve scanning lots of data; theyre typically not
fast data queries.
Fast data is transformative. The most significant uses for fast data
apps have been discussed in prior chapters. Key to writing fast data
apps is an understanding of two concepts central to modern data
management: the ACID properties and the CAP theorem, addressed
in this chapter. Its unfortunate that in both acronyms the C stands
for Consistency, but actually means completely different things.
What follows is a primer on the two concepts and an explanation of
the differences between the two C"s.
What Is ACID?
The idea of transactions, their semantics and guarantees, evolved
with data management itself. As computers became more powerful,
they were tasked with managing more data. Eventually, multiple
users would share data on a machine. This led to problems where
data could be changed or overwritten out from under users in the
middle of a calculation. Something needed to be done; so the aca
demics were called in.
The rules were originally defined by Jim Gray in the 1970s, and the
acronym was popularized in the 1980s. ACID transactions solve
many problems when implemented to the letter, but have been
engaged in a push-pull with performance tradeoffs ever since. Still,
simply understanding these rules can educate those who seek to
bend them.
7
A transaction is a bundling of one or more operations on database
state into a single sequence. Databases that offer transactional
semantics offer a clear way to start, stop, and cancel (or roll back) a
set of operations (reads and writes) as a single logical meta-
operation.
But transactional semantics do not make a transaction. A true
transaction must adhere to the ACID properties. ACID transactions
offer guarantees that absolve the end user of much of the headache
of concurrent access to mutable database state.
From the seminal Google F1 Paper:
The system must provide ACID transactions, and must always
present applications with consistent and correct data. Designing
applications to cope with concurrency anomalies in their data is
very error-prone, time-consuming, and ultimately not worth the
performance gains.
What Is CAP? | 9
How Is CAP Consistency Different from ACID
Consistency?
ACID consistency is all about database rules. If a schema declares
that a value must be unique, then a consistent system will enforce
uniqueness of that value across all operations. If a foreign key
implies deleting one row will delete related rows, then a consistent
system will ensure the state cant contain related rows once the base
row is deleted.
CAP consistency promises that every replica of the same logical
value, spread across nodes in a distributed system, has the same
exact value at all times. Note that this is a logical guarantee, rather
than a physical one. Due to the speed of light, it may take some non
zero time to replicate values across a cluster. The cluster can still
present a logical view of preventing clients from viewing different
values at different nodes.
The most interesting confluence of these concepts occurs when sys
tems offer more than a simple key-value store. When systems offer
some or all of the ACID properties across a cluster, CAP consistency
becomes more involved. If a system offers repeatable reads,
compare-and-set or full transactions, then to be CAP consistent, it
must offer those guarantees at any node. This is why systems that
focus on CAP availability over CAP consistency rarely promise
these features.
Idea in Brief
Increasing numbers of high-speed transactional applications are
being built: operational applications that transact against a stream of
incoming events for use cases like real-time authorization, billing,
usage, operational tuning, and intelligent alerting. Writing these
applications requires combining real-time analytics with transaction
processing.
Transactions in these applications require real-time analytics as
inputs. Recalculating analytics from base data for each event in a
high-velocity feed is impractical. To scale, maintain streaming
aggregations that can be read cheaply in the transaction path. Unlike
periodic batch operations, streaming aggregations maintain consis
tent, up-to-date, and accurate analytics needed in the transaction
path.
This pattern trades ad hoc analytics capability for high-speed access
to analytic outputs that are known to be needed by an application.
This trade-off is necessary when calculating an analytic result from
base data for each transaction is infeasible.
Lets consider a few example applications to illustrate the concept.
13
Pattern: Reject Requests Past a Threshold
Consider a high-request-volume API that must implement sophisti
cated usage metrics for groups of users and individual users on a
per-operation basis. Metrics are used for multiple purposes: they are
used to derive usage-based billing charges, and they are used to
enforce a contracted quality of service standard (expressed as a
number of requests per second, per user, and per group). In this
case, the operational platform implementing the policy check must
be able to maintain fast counters for API operations, for users and
for groups. These counters must be accurate (they are inputs to bill
ing and quality of service policy enforcement), and they must be
accessible in real time to evaluate and authorize (or deny) new
requests.
In this scenario, it is necessary to keep a real-time balance for each
user. Maintaining the balance accurately (granting new credits,
deducting used credits) requires an ACID OLTP system. That same
system requires the ability to maintain high-speed aggregations.
Combining real-time, high-velocity streaming aggregations with
transactions provides a scalable solution.
Related Concepts
Pre-aggregation is a common technique for which many algorithms
and features have been developed. Materialized views, probabilistic
data structures (examples: HyperLogLog, Bloom filters), and win
dowing are common techniques for implementing efficient real-
time aggregation and summary state.
Idea in Brief
Processing big data effectively often requires multiple database
engines, each specialized to a purpose. Databases that are very good
at event-oriented real-time processing are likely not good at batch
analytics against large volumes. Some systems are good for high-
velocity problems. Others are good for large-volume problems.
However, in most cases, these systems need to interoperate to sup
port meaningful applications.
Minimally, data arriving at the high-velocity, ingest-oriented sys
tems needs to be processed and captured into the volume-oriented
systems. In more advanced cases, reports, analytics, and predictive
models generated from the volume-oriented systems need to be
communicated to the velocity-oriented system to support real-time
applications. Real-time analytics from the velocity side need to be
integrated into operational dashboards or downstream applications
that process real-time alerts, alarms, insights, and trends.
In practice, this means that many big data applications sit on top of
a platform of tools. Usually the components of the platform include
at least a large shared storage pool (like HDFS), a high-performance
BI analytics query tool (like a columnar SQL system), a batch pro
cessing system (MapReduce or perhaps Spark), and a streaming sys
tem. Data and processing outputs move between all of these
systems. Designing that dataflowdesigning a processing pipeline
17
that coordinates these different platform components, is key to
solving many big data challenges.
1. Where does data that can not be pushed (or pulled) through the
pipeline rest? Which components are responsible for the dura
bility of stalled data?
2. How do systems recover? Which systems are systems of record
meaning they can be recovery sources for lost data or inter
rupted processing?
3. What is the failure and availability model of each component in
the pipeline?
4. When a component fails, which other components become
unavailable? How long can upstream components maintain
functionality (for example, how long can they log processed
work to disk)? These numbers inform your recovery time objec
tive (RTO).
Idea in Brief
Most streaming applications move data through multiple processing
stages. In many cases, events are landed in a queue and then read by
downstream components. Those components might write new data
back to a queue as they process or they might directly stream data to
their downstream components. Building a reliable data pipeline
requires designing failure-recovery strategies.
With multiple processing stages connected, eventually one stage will
fail, become unreachable, or otherwise unavailable. When this
occurs, the other stages continue to receive data. When the failed
component comes back online, typically it must recover some previ
ous state and then begin processing new events. This recipe dis
cusses where to resume processing.
23
There are a few factors that complicate these problems and lead to
different trade-offs.
First, it is usually uncertain what the last processed event was. It is
typically not technologically feasible, for example, to two-phase
commit the event processing across all pipeline components. Typi
cally, unreliable communication between processing components
means the fate of events in-flight near the time of failure is
unknown.
Second, event streams are often partitioned (sharded) across a num
ber of processors. Processors and upstream sources can fail in arbi
trary combinations. Picturing a single, unified event flow is often an
insufficient abstraction. Your system should assume that a single
partition of events can be omitted from an otherwise available
stream due to failure.
This leads to three options for resuming processing distinguished by
how events near the failure time are handled. Approaches to solving
the problem follow.
Idea in Brief
When dealing with streams of data in the face of possible failure,
processing each datum exactly once is extremely difficult. When the
processing system fails, it may not be easy to determine which data
was successfully processed and which data was not.
Traditional approaches to this problem are complex, require
strongly consistent processing systems, and require smart clients
that can determine through introspection what has or hasnt been
processed.
As strongly consistent systems have become more scarce, and
throughput needs have skyrocketed, this approach often has been
deemed unwieldy and impractical. Many have given up on precise
answers and chosen to work toward answers that are as correct as
possible under the circumstances. The Lambda Architecture propo
ses doing all calculations twice, in two different ways, to allow for
cross-checking. Conflict-free replicated data types (CRDTs) have
been proposed as a way to add data structures that can be reasoned
about when using eventually consistent data stores.
27
If these options are less than ideal, idempotency offers another path.
An idempotent operation is an operation that has the same effect no
matter how many times it is applied. The simplest example is setting
a value. If I set x = 5, then I set x = 5 again, the second action doesnt
have any effect. How does this relate to exactly-once processing? For
idempotent operations, there is no effective difference between at-
least-once processing and exactly-once processing, and at-least-once
processing is much easier to achieve.
Leveraging the idempotent setting of values in eventually consistent
systems is one of the core tools used to build robust applications on
these platforms. Nevertheless, setting individual values is a much
weaker tool than the ACID-transactional model that pushed data
management forward in the late 20th century. CRDTs offer more,
but come with rigid constraints and restrictions. Theyre still a dan
gerous thing to build around without a deep understanding of what
they offer and how they work.
With the advent of consistent systems that truly scale, a broader set
of idempotent processing can be supported, which can improve and
simplify past approaches dramatically. ACID transactions can be
built that read and write multiple values based on business logic,
while offering the same effects if repeatedly executed.
1. Inserting items into Kafka has all of the same problems as any
other distributed system. Managing exactly-once insertion into
Kafka is not easy, and Kafka doesnt offer the right tools (at this
time) to manage idempotency when writing to the Kafka topic.
2. If the Kafka cluster is restarted or switched, topic offsets may no
longer be unique. It may be possible to use a third value, e.g., a
Kafka cluster ID, to make the event unique.
The app must ingest these events and compute average hold time
globally.
35
Dimension Data
will always have the exact same loaded into the long-term data
result given a particular input store.
and state. Determinism is
important in replication. A Exponential Backoff
deterministic operation can be Exponential backoff is a way to
applied to two replicas, assum manage contention during fail
ing the results will match. ure. Often, during failure, many
Determinism is also useful in clients try to reconnect at the
log replay. Performing the same same time, overloading a recov
set of deterministic operations a ering system.
second time will give the same Exponential backoff is a strategy
result. of exponentially increasing the
timeouts between retries on fail
Dimension Data
ure. If an operation fails, wait
Dimension data is infrequently
one second to retry. If that retry
changing data that expands
fails, wait two seconds, then
upon data in fact tables or event
four seconds, etc,. This allows
records.
simple one-off failures to
For example, dimension data recover quickly, but for more-
may include products for sale, complex failures, there will
current customers, and current eventually be a low-enough load
salespeople. The record of a to successfully recover. Often
particular order might reference the growing timeouts are cap
rows from these tables so as not ped at some large number to
to duplicate data. Dimension bound recovery times, such as
data not only saves space, it 16 seconds or 32 seconds.
allows a product to be renamed
and have that rename reflected Fast Data
in all open orders instantly. The processing of streaming
Dimensional schemas also allow data at real-time velocity, ena
easy filtering, grouping, and bling instant analysis, aware
labeling of data. ness, and action. Fast data is
data in motion, streaming into
In data warehousing, a single applications and computing
fact table, a table storing a environments from hundreds of
record of facts or events, com thousands to millions of end
bined with many dimension pointsmobile devices, sensor
tables full of dimension data, is networks, financial transactions,
referred to as a star schema. stock tick feeds, logs, retail sys
tems, telco call routing and
ETL
authorization systems, and
Extract, transform, load is the
more.
traditional sequence by which
data is loaded into a database. Systems and applications
Fast data pipelines may either designed to take advantage of
compress this sequence, or per fast data enable companies to
form analysis on or in response make real-time, per-event deci
to incoming data before it is sions that have direct, real-time
36 | Glossary
Real-Time Analytics
Glossary | 37
Probabilistic Data Structures
38 | Glossary
About the Authors
Ryan Betts is one of the VoltDB founding developers and is pres
ently VoltDB CTO. Ryan came to New England to attend WPI. He
graduated with a B.S. in Mathematics and has been part of the Bos
ton tech scene ever since. Ryan has been designing and building dis
tributed systems and high-performance infrastructure software for
almost 20 years. Chances are, if youve used the Internet, some of
your ones and zeros passed through a slice of code he wrote or tes
ted.
John Hugg, founding engineer & Manager of Developer Relations at
VoltDB, specializes in the development of databases, information
management software, and distributed systems. As the first engineer
on the VoltDB product, he worked with the team of academics at
MIT, Yale, and Brown to build H-Store, VoltDBs research prototype.
John also helped build the world-class engineering team at VoltDB
to continue development of the companys open source and com
mercial products. He holds a B.S. in Mathematics and Computer
Science and an M.S. in Computer Science from Tufts University.