Redis in Action
Redis in Action
Redis in Action
m
pl
Fast Data:
im
en
ts
of
Smart and
at Scale
Design Patterns and Recipes
STREAMING
ANALYTICS
WITH
TRANSACTIONS
Highest throughput, lowest latency,
SQL relational database.
Fast Data:
Smart and at Scale
Design Patterns and Recipes
978-1-491-94036-5
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
iii
Pattern: Connect Big Data Analytics to Real-Time Stream
Processing 19
Pattern: Use Loose Coupling to Improve Reliability 20
When to Avoid Pipelines 21
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv | Table of Contents
Foreword
v
Fast Data Application Value
vii
So how do you combine real-time, streaming analytics with real-
time decisions in an architecture that’s reliable, scalable, and simple?
You could do it yourself using a batch/streaming approach that
would require a lot of infrastructure and effort; or you could build
your app on a fast, distributed data processing platform with sup‐
port for per-event transactions, streaming aggregations combined
with per-event ACID processing, and SQL. This approach would
simplify app development and enhance performance and capability.
This report examines how to develop apps for fast data, using well-
recognized, predefined patterns. While our expertise is with
VoltDB’s unified fast data platform, these patterns are general
enough to suit both the do-it-yourself, hybrid batch/streaming
approach as well as the simpler, in-memory approach.
Our goal is to create a collection of “fast data app development rec‐
ipes.” In that spirit, we welcome your contributions, which will be
tested and included in future editions of this report. To submit a
recipe, send a note to [email protected].
ix
This report is structured into four main sections: an introduction to
fast data, with advice on identifying and structuring fast data archi‐
tectures; a chapter on ACID and CAP, describing why it’s important
to understand the concepts and limitations of both in a fast data
architecture; four chapters, each a recipe/design pattern for writing
certain types of streaming/fast data applications; and a glossary of
terms and concepts that will aid in understanding these patterns.
The recipe portion of the book is designed to be easily extensible as
new common fast data patterns emerge. We invite readers to submit
additional recipes at [email protected]
Into a world dominated by discussions of big data, fast data has been
born with little fanfare. Yet fast data will be the agent of change in
the information-management industry, as we will show in this
report.
Fast data is data in motion, streaming into applications and comput‐
ing environments from hundreds of thousands to millions of end‐
points—mobile devices, sensor networks, financial transactions,
stock tick feeds, logs, retail systems, telco call routing and authoriza‐
tion systems, and more. Real-time applications built on top of fast
data are changing the game for businesses that are data dependent:
telco, financial services, health/medical, energy, and others. It’s also
changing the game for developers, who must build applications to
handle increasing streams of data.1
We’re all familiar with big data. It’s data at rest: collections of struc‐
tured and unstructured data, stored in Hadoop and other “data
lakes,” awaiting historical analysis. Fast data, by contrast, is stream‐
1 Where is all this data coming from? We’ve all heard the statement that “data is doubling
every two years”—the so-called Moore’s Law of data. And according to the oft-cited
EMC Digital Universe Study (2014), which included research and analysis by IDC, this
statement is true. The study states that data “will multiply 10-fold between 2013 and
2020—from 4.4 trillion gigabytes to 44 trillion gigabytes”. This data, much of it new, is
coming from an increasing number of new sources: people, social, mobile, devices, and
sensors. It’s transforming the business landscape, creating a generational shift in how
data is used, and a corresponding market opportunity. Applications and services tap‐
ping this market opportunity require the ability to process data fast.
1
ing data: data in motion. Fast data demands to be dealt with as it
streams in to the enterprise in real time. Big data can be dealt with
some other time—typically after it’s been stored in a Hadoop data
warehouse—and analyzed via batch processing.
A stack is emerging across verticals and industries to help develop‐
ers build applications to process fast streams of data. This fast data
stack has a unique purpose: to process real-time data and output
recommendations, analytics, and decisions—transactions—in milli‐
seconds (billing authorization and up-sell of service level, for exam‐
ple, in telecoms), although some fast data use cases can tolerate up
to minutes of latency (energy sensor networks, for example).
Streaming Analytics
As data is created, it arrives in the enterprise in fast-moving streams.
Data in a stream may arrive in many data types and formats. Most
often, the data provides information about the process that gener‐
ated it; this information may be called messages or events. This
includes data from new sources, such as sensor data, as well as click‐
streams from web servers, machine data, and data from devices,
events, transactions, and customer interactions.
The increase in fast data presents the opportunity to perform analyt‐
ics on data as it streams in, rather than post-facto, after it’s been
pushed to a data warehouse for longer-term analysis. The ability to
analyze streams of data and make in-transaction decisions on this
fresh data is the most compelling vision for designers of data-driven
applications.
Queryable Cache
Queries that make a decision on ingest are another example of using
fast data front-ends to deliver business value. For example, a click
event arrives in an ad-serving system, and we need to know which
ad was shown, and analyze the response to the ad. Was the click
fraudulent? Was it a robot? Which customer account do we debit
because the click came in and it turns out that it wasn’t fraudulent?
Using queries that look for certain conditions, we might ask ques‐
tions such as: “Is this router under attack based on what I know
from the last hour?” Another example might deal with SLAs: “Is my
SLA being met based on what I know from the last day or two? If so,
what is the contractual cost?” In this case, we could populate a dash‐
board that says SLAs are not being met, and it has cost n in the last
week. Other deep analytical queries, such as “How many purple hats
were sold on Tuesdays in 2015 when it rained?” are really best
served by systems such as Hive or Impala. These types of queries are
ad-hoc and may involve scanning lots of data; they’re typically not
fast data queries.
Fast data is transformative. The most significant uses for fast data
apps have been discussed in prior chapters. Key to writing fast data
apps is an understanding of two concepts central to modern data
management: the ACID properties and the CAP theorem, addressed
in this chapter. It’s unfortunate that in both acronyms the “C” stands
for “Consistency,” but actually means completely different things.
What follows is a primer on the two concepts and an explanation of
the differences between the two “C"s.
What Is ACID?
The idea of transactions, their semantics and guarantees, evolved
with data management itself. As computers became more powerful,
they were tasked with managing more data. Eventually, multiple
users would share data on a machine. This led to problems where
data could be changed or overwritten out from under users in the
middle of a calculation. Something needed to be done; so the aca‐
demics were called in.
The rules were originally defined by Jim Gray in the 1970s, and the
acronym was popularized in the 1980s. “ACID” transactions solve
many problems when implemented to the letter, but have been
engaged in a push-pull with performance tradeoffs ever since. Still,
simply understanding these rules can educate those who seek to
bend them.
7
A transaction is a bundling of one or more operations on database
state into a single sequence. Databases that offer transactional
semantics offer a clear way to start, stop, and cancel (or roll back) a
set of operations (reads and writes) as a single logical meta-
operation.
But transactional semantics do not make a “transaction.” A true
transaction must adhere to the ACID properties. ACID transactions
offer guarantees that absolve the end user of much of the headache
of concurrent access to mutable database state.
From the seminal Google F1 Paper:
The system must provide ACID transactions, and must always
present applications with consistent and correct data. Designing
applications to cope with concurrency anomalies in their data is
very error-prone, time-consuming, and ultimately not worth the
performance gains.
What Is CAP? | 9
How Is CAP Consistency Different from ACID
Consistency?
ACID consistency is all about database rules. If a schema declares
that a value must be unique, then a consistent system will enforce
uniqueness of that value across all operations. If a foreign key
implies deleting one row will delete related rows, then a consistent
system will ensure the state can’t contain related rows once the base
row is deleted.
CAP consistency promises that every replica of the same logical
value, spread across nodes in a distributed system, has the same
exact value at all times. Note that this is a logical guarantee, rather
than a physical one. Due to the speed of light, it may take some non‐
zero time to replicate values across a cluster. The cluster can still
present a logical view of preventing clients from viewing different
values at different nodes.
The most interesting confluence of these concepts occurs when sys‐
tems offer more than a simple key-value store. When systems offer
some or all of the ACID properties across a cluster, CAP consistency
becomes more involved. If a system offers repeatable reads,
compare-and-set or full transactions, then to be CAP consistent, it
must offer those guarantees at any node. This is why systems that
focus on CAP availability over CAP consistency rarely promise
these features.
Idea in Brief
Increasing numbers of high-speed transactional applications are
being built: operational applications that transact against a stream of
incoming events for use cases like real-time authorization, billing,
usage, operational tuning, and intelligent alerting. Writing these
applications requires combining real-time analytics with transaction
processing.
Transactions in these applications require real-time analytics as
inputs. Recalculating analytics from base data for each event in a
high-velocity feed is impractical. To scale, maintain streaming
aggregations that can be read cheaply in the transaction path. Unlike
periodic batch operations, streaming aggregations maintain consis‐
tent, up-to-date, and accurate analytics needed in the transaction
path.
This pattern trades ad hoc analytics capability for high-speed access
to analytic outputs that are known to be needed by an application.
This trade-off is necessary when calculating an analytic result from
base data for each transaction is infeasible.
Let’s consider a few example applications to illustrate the concept.
13
Pattern: Reject Requests Past a Threshold
Consider a high-request-volume API that must implement sophisti‐
cated usage metrics for groups of users and individual users on a
per-operation basis. Metrics are used for multiple purposes: they are
used to derive usage-based billing charges, and they are used to
enforce a contracted quality of service standard (expressed as a
number of requests per second, per user, and per group). In this
case, the operational platform implementing the policy check must
be able to maintain fast counters for API operations, for users and
for groups. These counters must be accurate (they are inputs to bill‐
ing and quality of service policy enforcement), and they must be
accessible in real time to evaluate and authorize (or deny) new
requests.
In this scenario, it is necessary to keep a real-time balance for each
user. Maintaining the balance accurately (granting new credits,
deducting used credits) requires an ACID OLTP system. That same
system requires the ability to maintain high-speed aggregations.
Combining real-time, high-velocity streaming aggregations with
transactions provides a scalable solution.
Related Concepts
Pre-aggregation is a common technique for which many algorithms
and features have been developed. Materialized views, probabilistic
data structures (examples: HyperLogLog, Bloom filters), and win‐
dowing are common techniques for implementing efficient real-
time aggregation and summary state.
Idea in Brief
Processing big data effectively often requires multiple database
engines, each specialized to a purpose. Databases that are very good
at event-oriented real-time processing are likely not good at batch
analytics against large volumes. Some systems are good for high-
velocity problems. Others are good for large-volume problems.
However, in most cases, these systems need to interoperate to sup‐
port meaningful applications.
Minimally, data arriving at the high-velocity, ingest-oriented sys‐
tems needs to be processed and captured into the volume-oriented
systems. In more advanced cases, reports, analytics, and predictive
models generated from the volume-oriented systems need to be
communicated to the velocity-oriented system to support real-time
applications. Real-time analytics from the velocity side need to be
integrated into operational dashboards or downstream applications
that process real-time alerts, alarms, insights, and trends.
In practice, this means that many big data applications sit on top of
a platform of tools. Usually the components of the platform include
at least a large shared storage pool (like HDFS), a high-performance
BI analytics query tool (like a columnar SQL system), a batch pro‐
cessing system (MapReduce or perhaps Spark), and a streaming sys‐
tem. Data and processing outputs move between all of these
systems. Designing that dataflow—designing a processing pipeline
17
—that coordinates these different platform components, is key to
solving many big data challenges.
1. Where does data that can not be pushed (or pulled) through the
pipeline rest? Which components are responsible for the dura‐
bility of stalled data?
2. How do systems recover? Which systems are systems of record
—meaning they can be recovery sources for lost data or inter‐
rupted processing?
3. What is the failure and availability model of each component in
the pipeline?
4. When a component fails, which other components become
unavailable? How long can upstream components maintain
functionality (for example, how long can they log processed
work to disk)? These numbers inform your recovery time objec‐
tive (RTO).
Idea in Brief
Most streaming applications move data through multiple processing
stages. In many cases, events are landed in a queue and then read by
downstream components. Those components might write new data
back to a queue as they process or they might directly stream data to
their downstream components. Building a reliable data pipeline
requires designing failure-recovery strategies.
With multiple processing stages connected, eventually one stage will
fail, become unreachable, or otherwise unavailable. When this
occurs, the other stages continue to receive data. When the failed
component comes back online, typically it must recover some previ‐
ous state and then begin processing new events. This recipe dis‐
cusses where to resume processing.
23
There are a few factors that complicate these problems and lead to
different trade-offs.
First, it is usually uncertain what the last processed event was. It is
typically not technologically feasible, for example, to two-phase
commit the event processing across all pipeline components. Typi‐
cally, unreliable communication between processing components
means the fate of events in-flight near the time of failure is
unknown.
Second, event streams are often partitioned (sharded) across a num‐
ber of processors. Processors and upstream sources can fail in arbi‐
trary combinations. Picturing a single, unified event flow is often an
insufficient abstraction. Your system should assume that a single
partition of events can be omitted from an otherwise available
stream due to failure.
This leads to three options for resuming processing distinguished by
how events near the failure time are handled. Approaches to solving
the problem follow.
Idea in Brief
When dealing with streams of data in the face of possible failure,
processing each datum exactly once is extremely difficult. When the
processing system fails, it may not be easy to determine which data
was successfully processed and which data was not.
Traditional approaches to this problem are complex, require
strongly consistent processing systems, and require smart clients
that can determine through introspection what has or hasn’t been
processed.
As strongly consistent systems have become more scarce, and
throughput needs have skyrocketed, this approach often has been
deemed unwieldy and impractical. Many have given up on precise
answers and chosen to work toward answers that are as correct as
possible under the circumstances. The Lambda Architecture propo‐
ses doing all calculations twice, in two different ways, to allow for
cross-checking. Conflict-free replicated data types (CRDTs) have
been proposed as a way to add data structures that can be reasoned
about when using eventually consistent data stores.
27
If these options are less than ideal, idempotency offers another path.
An idempotent operation is an operation that has the same effect no
matter how many times it is applied. The simplest example is setting
a value. If I set x = 5, then I set x = 5 again, the second action doesn’t
have any effect. How does this relate to exactly-once processing? For
idempotent operations, there is no effective difference between at-
least-once processing and exactly-once processing, and at-least-once
processing is much easier to achieve.
Leveraging the idempotent setting of values in eventually consistent
systems is one of the core tools used to build robust applications on
these platforms. Nevertheless, setting individual values is a much
weaker tool than the ACID-transactional model that pushed data
management forward in the late 20th century. CRDTs offer more,
but come with rigid constraints and restrictions. They’re still a dan‐
gerous thing to build around without a deep understanding of what
they offer and how they work.
With the advent of consistent systems that truly scale, a broader set
of idempotent processing can be supported, which can improve and
simplify past approaches dramatically. ACID transactions can be
built that read and write multiple values based on business logic,
while offering the same effects if repeatedly executed.
1. Inserting items into Kafka has all of the same problems as any
other distributed system. Managing exactly-once insertion into
Kafka is not easy, and Kafka doesn’t offer the right tools (at this
time) to manage idempotency when writing to the Kafka topic.
2. If the Kafka cluster is restarted or switched, topic offsets may no
longer be unique. It may be possible to use a third value, e.g., a
Kafka cluster ID, to make the event unique.
The app must ingest these events and compute average hold time
globally.
35
Dimension Data
will always have the exact same loaded into the long-term data
result given a particular input store.
and state. Determinism is
important in replication. A Exponential Backoff
deterministic operation can be Exponential backoff is a way to
applied to two replicas, assum‐ manage contention during fail‐
ing the results will match. ure. Often, during failure, many
Determinism is also useful in clients try to reconnect at the
log replay. Performing the same same time, overloading a recov‐
set of deterministic operations a ering system.
second time will give the same Exponential backoff is a strategy
result. of exponentially increasing the
timeouts between retries on fail‐
Dimension Data
ure. If an operation fails, wait
Dimension data is infrequently
one second to retry. If that retry
changing data that expands
fails, wait two seconds, then
upon data in fact tables or event
four seconds, etc,…. This allows
records.
simple one-off failures to
For example, dimension data recover quickly, but for more-
may include products for sale, complex failures, there will
current customers, and current eventually be a low-enough load
salespeople. The record of a to successfully recover. Often
particular order might reference the growing timeouts are cap‐
rows from these tables so as not ped at some large number to
to duplicate data. Dimension bound recovery times, such as
data not only saves space, it 16 seconds or 32 seconds.
allows a product to be renamed
and have that rename reflected Fast Data
in all open orders instantly. The processing of streaming
Dimensional schemas also allow data at real-time velocity, ena‐
easy filtering, grouping, and bling instant analysis, aware‐
labeling of data. ness, and action. Fast data is
data in motion, streaming into
In data warehousing, a single applications and computing
fact table, a table storing a environments from hundreds of
record of facts or events, com‐ thousands to millions of end‐
bined with many dimension points—mobile devices, sensor
tables full of dimension data, is networks, financial transactions,
referred to as a star schema. stock tick feeds, logs, retail sys‐
tems, telco call routing and
ETL
authorization systems, and
Extract, transform, load is the
more.
traditional sequence by which
data is loaded into a database. Systems and applications
Fast data pipelines may either designed to take advantage of
compress this sequence, or per‐ fast data enable companies to
form analysis on or in response make real-time, per-event deci‐
to incoming data before it is sions that have direct, real-time
36 | Glossary
Real-Time Analytics
Glossary | 37
Probabilistic Data Structures
38 | Glossary
About the Authors
Ryan Betts is one of the VoltDB founding developers and is pres‐
ently VoltDB CTO. Ryan came to New England to attend WPI. He
graduated with a B.S. in Mathematics and has been part of the Bos‐
ton tech scene ever since. Ryan has been designing and building dis‐
tributed systems and high-performance infrastructure software for
almost 20 years. Chances are, if you’ve used the Internet, some of
your ones and zeros passed through a slice of code he wrote or tes‐
ted.
John Hugg, founding engineer & Manager of Developer Relations at
VoltDB, specializes in the development of databases, information
management software, and distributed systems. As the first engineer
on the VoltDB product, he worked with the team of academics at
MIT, Yale, and Brown to build H-Store, VoltDB’s research prototype.
John also helped build the world-class engineering team at VoltDB
to continue development of the company’s open source and com‐
mercial products. He holds a B.S. in Mathematics and Computer
Science and an M.S. in Computer Science from Tufts University.