Eb Attunity Streaming Change Data Capture en
Eb Attunity Streaming Change Data Capture en
Data Capture
A Foundation for Modern
Data Architectures
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Streaming Change
Data Capture, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub‐
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
This work is part of a collaboration between O’Reilly and Qlik. See our statement of
editorial independence.
978-1-492-03249-6
[LSI]
Table of Contents
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Prologue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
iii
4. Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Case Study 1: Streaming to a Cloud-Based Lambda
Architecture 27
Case Study 2: Streaming to the Data Lake 29
Case Study 3: Streaming, Data Lake, and Cloud Architecture 30
Case Study 4: Supporting Microservices on the AWS Cloud
Architecture 31
Case Study 5: Real-Time Operational Data Store/Data
Warehouse 32
7. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iv | Table of Contents
Acknowledgments
v
Prologue
vii
Introduction: The Rise of Modern
Data Architectures
ix
structured system of record for analytics. Figure I-1 summarizes
these shifts.
All this entails careful planning and new technologies because tradi‐
tional batch-oriented data integration tools do not meet these
1
queries is to both record business data and analyze it, without one
action interfering with the other.
The first method used for replicating production records (i.e., rows
in a database table) to an analytics platform is batch loading, also
known as bulk or full loading. This process creates files or tables at
the target, defines their “metadata” structures based on the source,
and populates them with data copied from the source as well as the
necessary metadata definitions.
Batch loads and periodic reloads with the latest data take time and
often consume significant processing power on the source system.
This means administrators need to run replication loads during
“batch windows” of time in which production is paused or will not
be heavily affected. Batch windows are increasingly unacceptable in
today’s global, 24×7 business environment.
Here are real examples of enterprise struggles with batch loads (in
Chapter 4, we examine how organizations are using CDC to elimi‐
nate struggles like these and realize new business value):
Advantages of CDC
CDC has three fundamental advantages over batch replication:
Advantages of CDC | 3
not difficult to envision ways in which real-time data updates, some‐
times referred to as fast data, can improve the bottom line.
For example, business events create data with perishable business
value. When someone buys something in a store, there is a limited
time to notify their smartphone of a great deal on a related product
in that store. When a customer logs into a vendor’s website, this cre‐
ates a short-lived opportunity to cross-sell to them, upsell to them,
or measure their satisfaction. These events often merit quick analy‐
sis and action.
In a 2017 study titled The Half Life of Data, Nucleus Research ana‐
lyzed more than 50 analytics case studies and plotted the value of
data over time for three types of decisions: tactical, operational, and
strategic. Although mileage varied by example, the aggregate find‐
ings are striking:
Examples bring research findings like this to life. Consider the case
of a leading European payments processor, which we’ll call U Pay. It
handles millions of mobile, online and in-store transactions daily
for hundreds of thousands of merchants in more than 100 countries.
Part of U Pay’s value to merchants is that it credit-checks each trans‐
action as it happens. But loading data in batch to the underlying
data lake with Sqoop, an open source ingestion scripting tool for
Hadoop, created damaging bottlenecks. The company could not
integrate both the transactions from its production SQL Server and
Oracle systems and credit agency communications fast enough to
meet merchant demands.
U Pay decided to replace Sqoop with CDC, and everything changed.
The company was able to transact its business much more rapidly
and bring the credit checks in house. U Pay created a new automa‐
ted decision engine that assesses the risk on every transaction on a
near-real-time basis by analyzing its own extensive customer infor‐
mation. By eliminating the third-party agency, U Pay increased mar‐
gins and improved service-level agreements (SLAs) for merchants.
Indeed, CDC is fueling more and more software-driven decisions.
Machine learning algorithms, an example of artificial intelligence
(AI), teach themselves as they process continuously changing data.
Machine learning practitioners need to test and score multiple,
evolving models against one another to generate the best results,
Change data capture (CDC) identifies and captures just the most
recent production data and metadata changes that the source has
registered during a given time period, typically measured in seconds
or minutes, and then enables replication software to copy those
changes to a separate data repository. A variety of technical mecha‐
nisms enable CDC to minimize time and overhead in the manner
most suited to the type of analytics or application it supports. CDC
can accompany batch load replication to ensure that the target is
and remains synchronized with the source upon load completion.
Like batch loads, CDC helps replication software copy data from
one source to one target, or one source to multiple targets. CDC also
identifies and replicates changes to source schema (that is, data defi‐
nition language [DDL]) changes, enabling targets to dynamically
adapt to structural updates. This eliminates the risk that other data
management and analytics processes become brittle and require
time-consuming manual updates.
9
Targets, meanwhile, commonly include not just traditional struc‐
tured data warehouses, but also data lakes based on distributions
from Hortonworks, Cloudera, or MapR. Targets also include cloud
platforms such as Elastic MapReduce (EMR) and Amazon Simple
Storage Service (S3) from Amazon Web Services (AWS), Microsoft
Azure Data Lake Store, and Azure HDInsight. In addition, message
streaming platforms (e.g., open source Apache Kafka and Kafka var‐
iants like Amazon Kinesis and Azure Event Hubs) are used both to
enable streaming analytics applications and to transmit to various
big data targets.
CDC has evolved to become a critical building block of modern data
architectures. As explained in Chapter 1, CDC identifies and cap‐
tures the data and metadata changes that were committed to a
source during the latest time period, typically seconds or minutes.
This enables replication software to copy and commit these incre‐
mental source database updates to a target. Figure 2-1 offers a sim‐
plified view of CDC’s role in modern data analytics architectures.
So, what are these incremental data changes? There are four primary
categories of changes to a source database: row changes such as
inserts, updates, and deletes, as well as metadata (DDL) changes:
Figure 2-2. CDC example: row changes (one row = one record)
Query-based CDC
This approach regularly checks the production database for
changes. This method can also slow production performance by
consuming source CPU cycles. Certain source databases and
Log readers
Log readers identify new transactions by scanning changes in
transaction log files that already exist for backup and recovery
purposes (Figure 2-4). Log readers are the fastest and least dis‐
ruptive of the CDC options because they require no additional
modifications to existing databases or applications and do not
weigh down production systems with query loads. A leading
example of this approach is Qlik Replicate. Log readers must
carefully integrate with each source database’s distinct processes,
such as those that log and store changes, apply inserts/updates/
deletes, and so on. Different databases can have different and
often proprietary, undocumented formats, underscoring the
need for deep understanding of the various databases and care‐
ful integration by the CDC vendor.
19
Figure 3-1. CDC and modern data integration
Replication to Databases
Organizations have long used databases such as Oracle and SQL
Server for operational reporting; for example, to track sales transac‐
tions and trends, inventory levels, and supply-chain status. They can
employ batch and CDC replication to copy the necessary records to
reporting databases, thereby offloading the queries and analytics
workload from production. CDC has become common in these sce‐
narios as the pace of business quickens and business managers at all
levels increasingly demand real-time operational dashboards.
Microservices
A final important element of modern architectures to consider is
microservices. The concept of the microservice builds on service-
oriented architectures (SOAs) by structuring each application as a
collection of fine-grained services that are more easily developed,
deployed, refined, and scaled out by independent IT teams. Whereas
SOA services often are implemented together, microservices are
modular.
In his 2017 O’Reilly book Reactive Microsystems, Jonas Boner
defined five key attributes of a microservice:
Isolation
Architectural resources are separate from those of other micro‐
services.
Autonomy
Service logic and the teams that manage them have independent
control.
Single responsibility
A microservice has one purpose only.
Ownership of state
Each microservice remembers all relevant preceding events.
Mobility
A microservice can be moved to new deployment scenarios and
topologies.
It’s not difficult to envision the advantages of microservices for large,
distributed enterprises that need to provide granular services to a
wide range of customers. You want to automate and repeat processes
wherever possible but still address individual errors, enhancement
requests, and so on for each process in a focused fashion without
interrupting other services. For instance, a global bank providing
hundreds of distinct online services across multiple customer types
Microservices | 25
needs to be able to rapidly update, improve, or retire individual
services without waiting on other resources. Microservices help
them achieve that.
CDC plays a critical role in such architectures by efficiently deliver‐
ing and synchronizing data across the specialized microservice data
repositories. In the next chapter, we explore how one global invest‐
ment firm uses CDC to enable microservices for customers in mul‐
tiple continents.
Now let’s explore some case studies. Each of these illustrates the role
of change data capture (CDC) in enabling scalable and efficient ana‐
lytics architectures that do not affect production application perfor‐
mance. By moving and processing incremental data and metadata
updates in real time, these organizations have reduced or eliminated
the need for resource-draining and disruptive batch (aka full) loads.
They are siphoning data to multiple platforms for specialized analy‐
sis on each, consuming CPU and other resources in a balanced and
sustainable way.
27
GetWell data scientists conduct therapy research on this Lambda
architecture, using both historical batch processing and real-time
analytics. In addition to traditional SQL-structured analysis, they
run graph analysis to better assess the relationships between clinical
drug treatments, drug usage, and outcomes. They also perform
natural-language processing (NLP) to identify key observations
within physician’s notes and are testing other new AI approaches
such as machine learning to improve predictions of clinical treat‐
ment outcomes.
After the data arrives in HDFS and HBase, Spark in-memory pro‐
cessing helps match orders to production on a real-time basis and
maintain referential integrity for purchase order tables. As a result,
Suppertime has accelerated sales and product delivery with accurate
real-time operational reporting. It has replaced batch loads with
CDC to operate more efficiently and more profitably.
Figure 4-3. Data architecture for cloud-based streaming and data lake
architecture
35
Figure 5-1. Replication Maturity Model
Level 1: Basic
At the Basic maturity level, organizations have not yet implemented
CDC. A significant portion of organizations are still in this phase.
During a course on data integration at a TDWI event in Anaheim in
Orlando in December 2017, this author was surprised to see only
half of the attendees raise their hands when asked if they used CDC.
Instead, organizations use traditional, manual extract, transform,
and load (ETL) tools and scripts, or open source Sqoop software in
the case of Hadoop, that replicate production data to analytics plat‐
forms via disruptive batch loads. These processes often vary by end
point and require skilled ETL programmers to learn multiple pro‐
cesses and spend extra time configuring and reconfiguring replica‐
tion tasks. Data silos persist because most of these organizations lack
the resources needed to integrate all of their data manually.
Such practices often are symptoms of larger issues that leave much
analytics value unrealized, because the cost and effort of data inte‐
gration limit both the number and the scope of analytics projects.
Siloed teams often run ad hoc analytics initiatives that lack a single
source of truth and strategic guidance from executives. To move
from the Basic to Opportunistic level, IT department leaders need to
Level 2: Opportunistic
At the Opportunistic maturity level, enterprise IT departments have
begun to implement basic CDC technologies. These often are man‐
ually configured tools that require software agents to be installed on
production systems and capture source updates with unnecessary,
disruptive triggers or queries. Because such tools still require
resource-intensive and inflexible ETL programming that varies by
platform type, efficiency suffers.
From a broader perspective, Level 2 IT departments often are also
beginning to formalize their data management requirements. Mov‐
ing to Level 3 requires a clear executive mandate to overcome cul‐
tural and motivational barriers.
Level 3: Systematic
Systematic organizations are getting their data house in order. IT
departments in this phase implement automated CDC solutions
such as Qlik Replicate that require no disruptive agents on source
systems. These solutions enable uniform data integration proce‐
dures across more platforms, breaking silos while minimizing skill
and labor requirements with a “self-service” approach. Data archi‐
tects rather than specialized ETL programmers can efficiently per‐
form high-scale data integration, ideally through a consolidated
enterprise console and with no manual scripting. In many cases,
they also can integrate full-load replication and CDC processes into
larger IT management frameworks using REST or other APIs. For
example, administrators can invoke and execute Qlik Replicate tasks
from workload automation solutions.
IT teams at this level often have clear executive guidance and spon‐
sorship in the form of a crisp corporate data strategy. Leadership is
beginning to use data analytics as a competitive differentiator.
Examples from Chapter 4 include the case studies for Suppertime
and USave, which have taken systematic, data-driven approaches to
improving operational efficiency. StartupBackers (case study 3) is
similarly systematic in its data consolidation efforts to enable new
analytics insights. Another example is illustrated in case study 4,
Level 2: Opportunistic | 37
Nest Egg, whose ambitious campaign to run all transactional
records through a coordinated Amazon Web Services (AWS) cloud
data flow is enabling an efficient, high-scale microservices environ‐
ment.
Level 4: Transformational
Organizations reaching the Transformational level are automating
additional segments of data pipelines to accelerate data readiness for
analytics. For example, they might use data warehouse automation
software to streamline the creation, management, and updates of
data warehouse and data mart environments. They also might be
automating the creation, structuring, and continuous updates of
data stores within data lakes. Qlik Compose (formerly Attunity
Compose) for Hive provides these capabilities for Hive data stores
so that datasets compliant with ACID (atomicity, consistency, isola‐
tion, durability) can be structured rapidly in what are effectively
SQL-like data warehouses on top of Hadoop.
We find that leaders within Transformational organizations are
often devising creative strategies to reinvent their businesses with
analytics. They seek to become truly data-driven. GetWell (case
study 1 in Chapter 4) is an example of a transformational organiza‐
tion. By applying the very latest technologies—machine learning,
and so on—to large data volumes, it is reinventing its offerings to
greatly improve the quality of care for millions of patients.
So why not deploy Level 3 or Level 4 solutions and call it a day?
Applying a consistent, nondisruptive and fully automated CDC pro‐
cess to various end points certainly improves efficiency, enables real-
time analytics, and yields other benefits. However, the technology
will take you only so far. We find that the most effective IT teams
achieve the greatest efficiency, scalability, and analytics value when
they are aligned with a C-level strategy to eliminate data silos, and
guide and even transform their business with data-driven decisions.
39
its log reader. Changes are sent in-memory to the target, with the
ability to filter out rows or columns that do not align with the target
schema or user-defined parameters. Qlik Replicate also can rename
target tables or columns, change data types, or automatically per‐
form other basic transformations that are necessary for transfer
between heterogeneous end points. Figure 6-1 shows a typical Qlik
Replicate architecture.
Qlik Replicate CDC and the larger Qlik Replicate portfolio enable
efficient, scalable, and low-impact integration of data to break silos.
Organizations can maintain consistent, flexible control of data flows
throughout their environments and automate key aspects of data
transformation for analytics. These key benefits are achieved while
reducing dependence on expensive, high-skilled ETL programmers.
43
APPENDIX A
Gartner Maturity Model
for Data and Analytics
Figure A-1. Overview of the Maturity Model for Data and Analytics
(D&A = data and analytics; ROI = return on investment)
45
About the Authors
Kevin Petrie is senior director of product marketing at Qlik. He has
20 years of experience in high tech, including marketing, big data
services, strategy, and journalism. Kevin has held leadership roles at
EMC and Symantec, and is a frequent speaker and blogger. He holds
a Bachelor of Arts degree from Bowdoin College and MBA from the
Haas School of Business at UC Berkeley. Kevin is a bookworm, out‐
door fitness nut, husband, and father of three boys.
Dan Potter is VP of Product Marketing at Qlik. In this role, Dan is
responsible for product marketing and go-to-market strategies
related to modern data architectures, data integration, and DataOps.
Prior, he held executive positions at Attunity, Datawatch, IBM, Ora‐
cle, and Progress Software where he was responsible for identifying
and launching solutions across a variety of emerging markets
including cloud computing, real-time data streaming, federated
data, and ecommerce.
Itamar Ankorion is Senior Vice President of Technology Alliances
at Qlik. In his previous role, Itamar was managing director of Enter‐
prise Data Integration at Qlik, responsible for leading Qlik’s Data
Integration software business including Attunity, which was
acquired by Qlik in May 2019.
Prior to the acquisition, Itamar was Chief Marketing Officer at Attu‐
nity where he was responsible for global marketing, alliances, busi‐
ness development and product management. Itamar has over 20
years of experience in marketing, business development and prod‐
uct management in the enterprise software space. He holds a BA in
Computer Science and Business Administration and an MBA from
the Tel Aviv University.