0% found this document useful (0 votes)

126 views22 pages

WP Data Engineers Handbook

This document discusses 4 cloud design patterns for data ingestion and transformation that are critical for successfully migrating data and workloads to cloud platforms. The patterns are: 1) ingesting to cloud data lakes or storage, 2) ingesting to cloud data warehouses, 3) ingesting to cloud messaging services or event hubs, and 4) transforming raw data to conformed data. Following these patterns can help engineers accelerate and simplify moving to the cloud. The document also discusses the role of data engineers and challenges they face in supporting evolving analytics needs and changing data environments.

Uploaded by

agustinlopez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views22 pages

WP Data Engineers Handbook

Uploaded by

agustinlopez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

EBOOK

Data Engineers’
Handbook
4 Cloud Design Patterns for Data Ingestion and Transformation

www.streamsets.com
Data Engineers’ Handbook

Table of Contents
Introduction...................................................................................................................................03
Critical Design Patterns for the Cloud ....................................................................................04
The Role of the Data Engineer...................................................................................................05
The Rise of Smart Data Pipelines..............................................................................................06
4 Cloud Design Patterns..............................................................................................................07
Pipeline Example #1:
Ingest to Cloud Data Lakes or Cloud Storage..................................................................08
Pipeline Example #2:
Ingest to Cloud Data Warehouses ....................................................................................11
Pipeline Example #3:
Ingest to Cloud Messaging Services or Event Hubs .......................................................14
Pipeline Example #4:
Transform from Raw Data to Conformed Data..............................................................16
Operationalizing Smart Data Pipelines...................................................................................18
Conclusion .....................................................................................................................................21

StreamSets and the StreamSets logo are the registered trademarks of StreamSets, Inc. © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 2
All other marks are the property of their respective owners.
Data Engineers’ Handbook

Introduction
The move to the cloud has become top of mind and more
urgent for data engineers than ever before. According to
Gartner, “By 2023, 75% of all databases will be on a cloud
platform.” 1 This mass migration to the cloud has translated
into huge new cloud projects that have dropped into the lap
of the data engineer.

Whether you are an individual contributor or a senior data team leader, you most
likely are supporting a growing number of ETL developers who in turn support an
expanding number of data scientists and analysts across the whole business.
They want data faster, and they increasingly want data from outside the company.
The cloud is the easiest way to gather all of the external data your teams need and
let the analytics team free to party on the data. For data engineers moving to the
cloud means pipeline redesigns, migration projects, and shifts in data processing
strategy.

1
Gartner Says the Future of the Database Market Is the Cloud

© 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 3

Data Engineers’ Handbook

Critical Design
Patterns
for the Cloud
To successfully migrate data and data workloads to
your cloud data platforms these are the four most
common data pipeline design patterns:

• Ingesting to Cloud Data Lakes and Cloud Storage

• Ingesting to Cloud Data Warehouses

• Ingesting to Cloud Messaging Services

or Event Hubs
Figure 1. Design Patterns for Migrating to Cloud
• Transforming from Raw Data to Conformed Data

These four data pipeline patterns are the building

blocks for ingesting, transforming and migrating your Finally, we consider the deployment and ongoing
data into cloud data platforms. Together, they help operations involved with running data pipelines
data engineers accelerate and simplify the move to that deliver continuous data. From batch ingestion,
the cloud in support of next generation data analytics, to change data capture, to real-time streaming, all
data science, AI, and machine learning workloads. workloads can be managed and optimized through
interactive maps.
This handbook will walk you through the process
of building each of these critical design patterns.
We provide multiple pipeline examples, best
practices, design considerations, and use case
examples. We will also explore what happens when
something changes and how to create data pipelines
that are resilient to change.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 4

Data Engineers’ Handbook

The Role of the

Data Engineer
The data engineer is the technical professional who In addition, these approaches lead to brittle
understands how data analysts and data scientists mappings or pipelines that require significant
need data, then builds the data pipeline(s) to deliver rework every time anything changes in the source or
the right data, in the right format, to the right place. Most people don’t destination. Engineers can end up spending 80% of
The best data engineers are able to anticipate realize that change their time on maintenance, leaving very little time for
the needs of the business, track the rise of new new, value-added work.
technologies, and maintain a complex and evolving and evolution of target
data infrastructure. systems causes 10-20 This handbook outlines 4 data pipelines that can be
implemented as “smart” data pipelines so you can go
But data engineers face many challenges as hours of work for the fast to get the data to the business, and be confident
organizations evolve their use of data beyond that the pipelines you’re building will hold up for
traditional reporting to data science, AI, and machine
data engineer. Every ongoing operations.
learning. First, the project backlog is stressed and single change.” This handbook outlines 4 data pipelines that can be
growing, putting pressure on the data engineering
team. More data scientists and more data analysts implemented as “smart” data pipelines so you can go
DATA ENGINEER @ FINANCIAL COMPANY
mean more projects and demands for support from fast to get the data to the business and be confident
the data engineer. that the pipelines you’re building will hold up for
ongoing operations.
Second, changes to data are accelerating in small
and large ways. We call this “data drift”: the replatforming projects while still juggling their daily
unexpected and undocumented changes to data responsibilities.
structure, semantics, and infrastructure that is a
Data engineers have many options, ranging from
result of modern data movement. Keeping up with
traditional ETL tools to simple ingest services, to hand
data drift creates a huge burden on data engineers
coding using a variety of programming languages.
and platform operators, both to keep the lights on
But juggling different design interfaces makes life
and ensure there are no disruptions to the analytics
hard for the data engineer. Why is that? They have
delivery.
to choose between powerful tools that require
Third, as data platforms evolve, for example, specialized skills or black box utilities for easy data
from on-premises data lakes and EDW’s into public ingest pipelines that are painful to maintain and
cloud services, data engineers are on task for huge operate continuously.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 5

Data Engineers’ Handbook

The Rise of Smart

Data Pipelines
A smart data pipeline is a data What makes a data pipeline smart?
pipeline that is designed to operate
 Smart data pipelines use intent-driven design to abstract away the “how”
in a continuous fashion, with of implementation from the “what”, so engineers can focus on the business
as little manual intervention as meaning and logic of the data.
possible. Smart data pipelines are
 Smart data pipelines expect and are resilient to data drift.
essential in highly dynamic cloud
data environments where data  Smart data pipelines ensure portability across different platforms and clouds.
flows across multiple data platforms
both on-premises and cloud, and As we present each of the four design patterns essential for migrating to cloud,
where data drift is everywhere. we look at the difference smart data pipelines make and how they adapt to change.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 6

Data Engineers’ Handbook

4 Cloud Design Patterns

Pipeline Example #1: Ingest to Cloud Data Lakes or Cloud Storage

Pipeline Example #2: Ingest to Cloud Data Warehouses

Pipeline Example #3: Ingest to Cloud Messaging Services or Event Hubs

Pipeline Example #4: Transform from Raw Data to Conformed Data

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 7

Data Engineers’ Handbook

PIPELINE EXAMPLE #1:

Ingest to Cloud Data Lakes or Cloud Storage

Pipeline Overview
The first pattern is the most common and often the
first step in moving data to the cloud. It’s all about
ingesting data into cloud data lake or other raw cloud
storage. This is the gateway into the cloud for much
of your data — it can go in many different directions
and for many different use cases after it’s ingested
into the cloud data lake or cloud storage.

Key Steps
• Read data from multiple tables in parallel from
the Oracle database.
• Conditionally route customer records based
on the table name record header attribute
(metadata) exposed by the platform to mask
customer PII. For example, customers’ email
addresses.
• Securely store the data on Amazon S3 Data Lake
using server-side encryption and partitioned by
table name.
• Automatically stop the pipeline once all of the Figure 2. Pipeline Example #1 – Migrating data from on-prem Oracle database to Amazon S3 Data Lake.
data has been processed from Oracle database
and written to Amazon S3.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 8

PIPELINE EXAMPLE #1: Ingest to Cloud Data Lakes or Cloud Storage Data Engineers’ Handbook

Smart Data Pipelines at Work

Real-time Transformations Dataflow Metrics
• As data flows through the pipeline, it is • The platform captures data flow metrics in a time
transformed in real time to enable downstream series manner to enable real-time insights at the
consumption and/or to meet business pipeline as well as individual stage level. Some of
requirements. This process works independently the metrics exposed in the UI include: input and
of the source, destination, data formats, and output record counts, record throughput, stage
processing modes — streaming, batch, or micro- level processing times, etc.
batch.
• This also enables configuring data SLAs.
Preview and Snapshots
Multiplexing and Demultiplexing
• The platform’s built-in ability to preview data
• Same pipeline is configured to read multiple
before running the pipeline and being able to
tables with different schemas from the same
step through each transformation (stage by stage)
database, and then seamlessly writes records to
against the data being read from the source.
the appropriate partitions. Optionally, it can be
configured to include or exclude certain schemas
• The platform’s built-in to take snapshots of data
and tables based on name patterns using the SQL
in real time, while the pipeline is running, without
LIKE syntax.
constraining throughput and performance. This
enables debugging and replaying of dataflows
Offset Tracking
to identify issues, in data transformations, for
• Internal offset tracking imposes a “state” on the example, in production environments.
pipeline which effectively allows the pipeline to be
stopped and restarted in a “pick up where you left
off” manner.

• External offset tracking enables failover via the

control plane (StreamSets Control Hub) at the
execution engine / data plane (StreamSets Data
Collector) level.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 9

PIPELINE EXAMPLE #1: Ingest to Cloud Data Lakes or Cloud Storage Data Engineers’ Handbook

Handling Infrastructure Drift

Now let’s assume you need to ingest data from Parameterize key attributes:
MySQL database instead of Oracle database, while
• Instead of duplicating the pipeline or creating a
all other requirements remain the same. In this
new version, you could also parameterize key
particular scenario, here are the options that would
attributes, such as JDBC URL and credentials. Then
require minimal pipeline development time and
create a job that will run multiple instances of the
effort:
same pipeline by passing in different parameters
to connect to Oracle or MySQL database.
Duplicate pipeline:
• You can easily duplicate the pipeline and then
update the origin to ingest data from MySQL
instead of Oracle. The key attributes to change
would be the JDBC URL and credentials.

• This operation will create a new instance of the

pipeline so you can also make other changes, if If you’re new to
needed, and still have the original pipeline “as-is”.
engineering or Hadoop,
Create a new version of the pipeline: that comes with
• Instead of creating a new pipeline (by way of
duplication), you could also easily create a new
experience and time.
version of the same pipeline. No junior engineer will
• The different versions of the pipeline are tracked build all the fault safety
and maintained by the built-in version control things.”
system.
DATA ENGINEER
• This operation will not create a new instance of
@ HEALTHCARE COMPANY
the pipeline, but you can still revert back to the
older version(s) or visually compare versions in
the UI.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 10

Data Engineers’ Handbook

PIPELINE EXAMPLE #2:

Ingest to Cloud Data Warehouses

Pipeline Overview
Cloud data warehouses are a critical component of
modern data architecture in enterprises that leverage
massive amounts of data to drive quality of their
products and services. They are a new breed that
bring the added advantage of cost-effectiveness
and scalability with pay-as-you-go pricing models,
a serverless approach, and on-demand resources
made possible by separating compute and storage to
provide a layer specifically for fast analytics, reporting
and data mining.

Key Steps
• Read web logs stored in a file system.
• Convert data types of certain fields from string
to their appropriate types.
• Enrich records by creating new fields using
regular expressions.
• Store the transformed web logs in Snowflake
Cloud Data Warehouse.
Figure 3. Pipeline Example #2 – Ingest web logs from file system to Snowflake Cloud Data Warehouse.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 11

PIPELINE EXAMPLE #2: Ingest to Cloud Data Warehouses Data Engineers’ Handbook

Smart Data Pipelines at Work

Built-in Log Parser Real-time Transformations Preview and Snapshots
• The directory origin has built-in capability of • As data flows through the pipeline, it is • The platform’s built-in ability to preview data
parsing logs in various formats such as Common transformed in real-time to enable downstream before running the pipeline and being able to
Log Format, Apache Error Log Format, Combined consumption and/or to meet business step through each transformation (stage by stage)
Log Format, Apache Access Log Custom Format, requirements. This process works independently against the data being read from the source.
Grok Pattern-based Format, etc. of the source, destination, data formats, and
processing modes — streaming, batch, or micro- • The platform’s built-in ability to take snapshots
Data Enrichment batch. of data in real time, while the pipeline is
running, without constraining throughput and
• Data types can be converted from various formats
Offset Tracking performance. This enables debugging and
to a different format without having to write
replaying of dataflows to identify issues, in data
custom code/script. For example, string to date • Internal offset tracking imposes a “state” on the
transformations, for example, in production
formats with and without time zones, integers, pipeline which effectively allows the pipeline to be
environments.
floating point numbers, etc. stopped and restarted in a “pick up where you left
off” manner.
• New fields can be created based on existing fields
and/or certain conditions to enrich the records • External offset tracking enables failover via
using expressions. the control plane
(StreamSets Control Hub) at the execution
Schema Evolution engine / data plane
• Snowflake destination can be configured to auto- (StreamSets Data Collector) level.
create new columns or tables if they don’t already
Dataflow Metrics
exist.
• The platform captures data flow metrics in a time
• This eliminates the overhead of creating the tables series manner to enable real-time insights at the
beforehand especially when the source schema is pipeline as well as individual stage level. Some of
unknown. the metrics exposed in the UI include: input and
output record counts, record throughput, stage
level processing times etc.

• This also enables configuring data SLAs.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 12

PIPELINE EXAMPLE #2: Ingest to Cloud Data Warehouses Data Engineers’ Handbook

Handling Semantic Drift

Now let’s assume that the structure of the log file
changes. For example, the order of the columns
change as new files are uploaded or added for
processing. In that case, the pipeline would continue
to work without having to rewrite any of the pipeline The major slowdown is
logic. In other words, the data enrichment stages moving from all on prem,
(for example, Field Type Converter and Expression
Evaluator) would continue to transform and enrich moving that to full cloud
the data without making any changes.tax of hand
coding the origins or bolting on heavy enterprise
is a rude awakening, It’s
solutions that are not built for modern cloud not what you think. You
environments.
can’t just lift and shift
some things.”
DATA ENGINEER
@ HEALTHCARE COMPANY

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 13

Data Engineers’ Handbook
PIPELINE EXAMPLE #3:
Ingest to Cloud Messaging Services
or Event Hubs

Pipeline Overview
Ingesting into the cloud messaging systems enables
flexible deployment and maintenance of decoupled
producers and consumers. This design pattern
eliminates duplication of logic and, in fact, allows
for decoupled logic and operations for different
consumers. In the case of Kafka, for example, topics
provide a well understood metaphor, partitions
provide a basis for scaling topics, and consumer
groups provide “Shared Consumer” capabilities
across threads, processes or both.

Key Steps
• Stream data from Twitter using Twitter API.
• Create individual tweet records from the list.
• Enrich records by parsing and extracting data
from XML field.
• Reduce payload by removing unwanted fields.
• Flatten nested records.
Figure 4. Pipeline Example #3 – Stream data from HTTP Client to Apache Kafka.
• Send transformed data to Kafka.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 14

PIPELINE EXAMPLE #3: Ingest to Cloud Messaging Services or Event Hubs Data Engineers’ Handbook

Smart Data Pipelines at Work Handling Infrastructure Drift

Built-in Field Pivoter and XML Parser Now let’s assume that you need to test a new version Either of these options can be implemented within
of Apache Kafka while continuing to support older minutes without having to rewrite any of the pipeline
• When external systems return data in nested or
versions for the following reasons: logic.
other formats, the records can be transformed in
real-time to enable downstream consumption and/ • Existing systems that depend on older versions
or to meet business requirements. This process continue to run.
works independently of the source, destination,
data formats, and processing modes — streaming, • Company IT policy or business requirements.
batch, or micro-batch. I have some on prem,
• It continues to serve other systems that depend
Dataflow Metrics on older versions. some on cloud, some on
• The platform captures data flow metrics in a time
In such cases, you have the following options that
SaaS. How do I put it all
series manner to enable real-time insights at the
pipeline as well as individual stage level. Some of
require minimal development efforts: together to build value
the metrics exposed in the UI include: input and
• Create a new version of the pipeline and select out of all this data in a
output record counts, record throughput, stage
level processing times, etc.
the new Kafka version in the destination in
the new pipeline. The different versions of the
cost-effective way? The
• This also enables configuring data SLAs.
pipeline are tracked and maintained by the built- cost element comes in
in version control system. This operation will not
create a new instance of the pipeline, but you can
afterwards — you don’t
Preview and Snapshots
• The platform’s built-in ability to preview data
still revert back to the older version(s) or visually think about it up front.”
compare versions in the UI.
before running the pipeline and being able to
DATA ENGINEER @
step through each transformation (stage by stage) • Duplicate the pipeline and select the new Kafka TELECOMMUNICATION COMPANY
against the data being read from the source. version in the destination in the duplicated
pipeline. This operation will create a new instance
• The platform’s built-in ability to take snapshots of the pipeline so you can also make other
of data in real time, while the pipeline is running, changes, if needed, and still have the original
without constraining throughput and performance. pipeline “as-is”.
This enables debugging and replaying of dataflows
to identify issues, in data transformations, for • In the same pipeline, add a second Kafka destination
example, in production environments. and select the new Kafka version for it.

Data Engineers’ Handbook
PIPELINE EXAMPLE #4:
Transform from Raw Data
to Conformed Data

Pipeline Overview
Raw zone normally stores large amounts of data
in its originating state usually in its original format,
such as JSON or CSV. Clean zone can be thought of
as a filter zone that improves data quality and may
involve data enrichment. Common transformations
include data type definition and conversion, removing
unnecessary columns, etc. to further improve the
value of insights. The organization of this zone is
normally dictated by business needs — for example,
per region, date, department, etc. Curated zone is
the consumption zone, optimized for analytics and
not so much for data processing. This zone stores
data in denormalized data marts and is best suited
for analysts or data scientists that want to run ad hoc
queries, analysis, or advanced analytics.

Key Steps
• Ingest Sales insights stored in CSV format on
Azure Data Lake Storage (ADLS) Gen2.
• Remove information from records not critical for
downstream analysis. Figure 5. Pipeline Example #4 – Transform from Raw Data to Conformed Data
• Enrich records by performing calculations and Ingest raw sales insights in CSV, perform transformations and aggregations and store the conformed
(clean and curated) data in Parquet and SQL.
aggregations.
• Store clean data in parquet format in ADLS Gen2
and aggregate data in Azure SQL.

PIPELINE EXAMPLE #4: Transform from Raw Data to Conformed Data Data Engineers’ Handbook

Smart Data Pipelines at Work

Multiple Formats Preview and Snapshots
• Same pipeline converts data into multiple formats • Ability to preview data before running the
and serves different type of downstream analysis. pipeline and being able to step through each
transformation (stage by stage) against the data
being read from the source.
Multiple Cluster Manager Types
• Ability to take snapshots of data in real time,
• Transformer pipelines can be designed to run
while the pipeline is running, without constraining
on various Apache Spark cluster types such as
throughput and performance. This enables
Azure HDInsigt, Amazon EMR, Google Dataproc,
debugging and replaying of dataflows to identify
Databricks, Hadoop YARN, etc.
issues, in data transformations, for example, in
production environments.
Dataflow Metrics
• The platform captures data flow metrics in a time
series manner to enable real-time insights at the
pipeline as well as individual stage level. Some of
the metrics exposed in the UI include: input and
output record counts, record throughput, stage
level processing times, etc.

• This also enables configuring data SLAs.

Data Engineers’ Handbook

Operationalizing
Smart Data
Design a Test/debug Deploy a
pipeline a pipeline pipeline Monitor
health Handle
Data changes.

Pipelines
Engineers Design Test/debug Deploy of all
pipelines Redeploy

DATA PIPELINES
Design Test/debug Deploy

In a modern enterprise, pipeline development is only Enable others (force multiplier)

part of the battle. As your technology stack evolves,
you will need to design pipelines for change, and Test/debug Deploy Monitor
ETL/ Design Handle
Project health changes.
deploy them, monitor them continually, and refactor Developers Design Test/debug Deploy of pipelines Redeploy

in an agile fashion. When managing thousands of

data pipelines, getting visibility into all the pipelines
Data

PLATFORMS
and the performance across all stages can be a Engineers
Adopt Data Cloud platforms Manage data pipeline platform;
staggering proposition. Smart data pipelines give
(e.g. Snowflake) set and enforce policies/SLAs
you continuous visibility at every stage of execution. Platform
Operators
Collections of pipelines can

be visualized in live data maps and drilled into Figure 7. The full lifecycle of data pipelines: Design, Deploy, and Enrich
when problems arise. This drastically reduces the
amount of time data engineers spend fixing errors
and hunting for root causes. Smart data pipelines let This active monitoring helps data engineers ensure able to detect it immediately, based on the sensors
you make changes to pipelines, even when they are that data is delivered correctly with retained fidelity. embedded into the smart data pipelines themselves,
running in production, allowing you to create agile It also helps flag and troubleshoot any operational and b) have choices on how they want to handle the
development sprints. or performance issues with either the data pipelines drift. In some cases, structural drift is not material
or the underlying execution engines in real time, no to the meaning of the data, so the smart data
Smart data pipelines report on critical matter where they are deployed, even across multiple pipeline can simply keep running with no change
metrics including: platforms both on-premises and in the cloud. Such or intervention whatsoever. Other types of change,
end-to-end transparency significantly reduces the such as a schema update, can be automatically
• Throughput rates administrative burden of monitoring and managing propagated into downstream systems. This ability to
• Error rates tens of thousands of pipelines across hundreds of automatically handle many common types of data
• Execution time by stage engines. drift drastically reduces the amount of time and effort
• PII detection Having this real-time instrumentation is also critical spent on maintenance and change management of
for smart pipelines’ operational resiliency to data data pipelines in operation.
• Schema drift alerting
drift. When drift happens, data engineers a) are
• Semantic drift alerting

Data Engineers’ Handbook

Other times, drift may be a material or even

dangerous change, and the data may need to be Critical Design
diverted and reviewed by a data engineer or analyst.
Smart pipelines can detect such changes and alert the Patterns for
relevant team member when they arise.
Modern Analytics
With multiple layers of operational resiliency built into
the system, data engineers can feel confident that
When deployed using smart data pipelines and a
the data they are delivering to their analytics teams is
DataOps approach to operationalization, these critical
sound and if any discrepancies happen they can trace
design patterns help data engineers accelerate and
the full execution back to the point of remediation.
simplify the move to the cloud in support of next
relevant team member when they arise.
generation data analytics, data science, AI, and
machine learning workloads. StreamSets is the only
data engineering platform for designing, deploying,
and managing smart data pipelines across hybrid and
cloud data architectures with ease.

Data Engineers’ Handbook

StreamSets:
Smart Data
Pipelines
for Data Engineers
The StreamSets DataOps Platform
supports your entire data team with
an easy on-ramp for a wide variety
of developers and powerful tools for
advanced data engineers. Our smart
data pipelines are resilient to changes.
The platform actively detects and alerts
users when data drift occurs. StreamSets Figure 6. The Full Data Pipelines Lifecycle: Design, Deploy, And Enrich

lets you change when the business needs

change, and port data pipelines across StreamSets Transformer Engine is a data pipeline optimizing all your data pipelines and data processing
engine designed for any developer or data engineer jobs. The central nervous system of the StreamSets
clouds and data platforms without to build ETL and ML pipelines that execute on DataOps Platform, Control Hub lets the entire
re-writes. extended team collaborate to design, monitor and
StreamSets Transformer is a data pipeline engine optimize data pipelines and jobs running on Data
designed for any developer or data engineer to build Collector and Transformer, and gives you a real-time,
The platform consists of two powerful data
pipelines that execute on Apache Spark. Using an end-to-end view of all data pipelines across your
engines and a comprehensive management hub:
intent-driven visual design tool, users can create enterprise. Control Hub also manages, monitors, and
pipelines for performing ETL and machine learning scales the Data Collector and Transformer engines
StreamSets Control Hub is an easy-to-use data
operations. It allows everyone to harness the power themselves to optimize your overall StreamSets
pipeline engine for streaming, CDC, and batch ingest
of Apache Spark by eliminating the need for Scala and environment. Control Hub provides full transparency
from any source, to any destination. Data engineers
PySpark coding. and control of all data pipelines and execution
can spend their time building data pipelines, enabling
engines across your entire hybrid/multi-cloud
self-service, and innovating, and minimize the
architecture in one single hub.
time they spend maintaining, rewriting, and fixing StreamSets Control Hubr is a single hub for
pipelines. designing, deploying, monitoring, managing and

Data Engineers’ Handbook

Conclusion
For data engineers, the public cloud provides numerous advantages to modernize your toolkit with exciting new
data services that scale way beyond the confines of your traditional role. However, simply shifting your legacy
platform to the public cloud brings all your problems along with it. Data pipelines for the cloud need to address
the elastic, scalable, and accessible nature of the cloud. Smart data pipelines take full advantage of these cloud
attributes, while also detecting and being resilient to data drift.

By developing the core capabilities to land data into raw data lakes and data warehouses, enrich with real-
time data from streaming services and event hubs, and transform data to be delivered to analytics teams and
platforms, you will have the foundations for delivering fast, reliable insight to every corner of your business.
StreamSets helps you build smart data pipelines with a common design interface, extensive tools for deep
integration, reliable operation with monitoring and reporting, and truly portable pipeline design across all
environments.

Do you want to start building these design patterns today? Try StreamSets

About StreamSets
At StreamSets, our mission is to make data engineering teams wildly successful. The StreamSets
DataOps Platform empowers engineers to build and run the smart data pipelines needed to power
DataOps across hybrid and multi-cloud architectures. That’s why the largest companies in the
world trust StreamSets to power millions of data pipelines for modern analytics, AI/ML and smart
applications. With StreamSets, data engineers spend less time fixing and more time doing.
To learn more, visit www.streamsets.com and follow us on LinkedIn.

START YOUR FREE TRIAL

StreamSets and the StreamSets logo are the registered trademarks of StreamSets, Inc. All other marks reference are the property of their respective owners.

CSC-211 Multifunction Protection IED Technical Application Manual - V1.01 PDF
100% (2)
CSC-211 Multifunction Protection IED Technical Application Manual - V1.01 PDF
341 pages
Complete C Programming Guide Employment
100% (1)
Complete C Programming Guide Employment
201 pages
FB BGP Community Signaling
No ratings yet
FB BGP Community Signaling
12 pages
Sistema de Alarmas MOS MCS 2200
No ratings yet
Sistema de Alarmas MOS MCS 2200
364 pages
KP184 - Exe v1 - 0 User Manual
No ratings yet
KP184 - Exe v1 - 0 User Manual
65 pages
Alma Llanera Diaz
100% (1)
Alma Llanera Diaz
9 pages
How To Burn The Image
100% (1)
How To Burn The Image
51 pages
Valsecito Criollo Laurita by Jorge Cardoso
No ratings yet
Valsecito Criollo Laurita by Jorge Cardoso
2 pages
Lenovo Aio 3-24are05 - La-J751p Rev1.0
No ratings yet
Lenovo Aio 3-24are05 - La-J751p Rev1.0
100 pages
1 Clock Domain Crossing
No ratings yet
1 Clock Domain Crossing
15 pages
Notes CC-Unit 1
No ratings yet
Notes CC-Unit 1
4 pages
Basic Computer Book 1
No ratings yet
Basic Computer Book 1
83 pages
Memory Virtualization
No ratings yet
Memory Virtualization
6 pages
Movimiento Perpetuo
0% (1)
Movimiento Perpetuo
11 pages
Google Test-Inside Cloud-Digital-Leader Simulations 2022-Oct-12 by Kent 232q Vce
No ratings yet
Google Test-Inside Cloud-Digital-Leader Simulations 2022-Oct-12 by Kent 232q Vce
8 pages
CTR 8500-8300 3.7.0 About CLI Commands - December2018
No ratings yet
CTR 8500-8300 3.7.0 About CLI Commands - December2018
18 pages
BDC - Sap Abap Questionnare
No ratings yet
BDC - Sap Abap Questionnare
6 pages
Olympia Programming 16
No ratings yet
Olympia Programming 16
81 pages
1 1 4 6 Lab ConfiguringBasicRouterSettings PDF
No ratings yet
1 1 4 6 Lab ConfiguringBasicRouterSettings PDF
17 pages
STOP STANDING THERE Chords by Avril Lavigne PDF
No ratings yet
STOP STANDING THERE Chords by Avril Lavigne PDF
4 pages
Java Practice Test
No ratings yet
Java Practice Test
12 pages
A Waltz For Rebecca by Al Petteway
No ratings yet
A Waltz For Rebecca by Al Petteway
3 pages
4 SRS Jalote
No ratings yet
4 SRS Jalote
10 pages
SIMATIC S5 Communication Cable
No ratings yet
SIMATIC S5 Communication Cable
8 pages
System Programming
No ratings yet
System Programming
12 pages
WHY (VER. 2) Chords by Avril Lavigne - Chords Explorer
No ratings yet
WHY (VER. 2) Chords by Avril Lavigne - Chords Explorer
5 pages
Mos Capacitors
No ratings yet
Mos Capacitors
10 pages
Flip-Flops and Its Applications: Module - 3
No ratings yet
Flip-Flops and Its Applications: Module - 3
41 pages
Micronor DS MPRZ
No ratings yet
Micronor DS MPRZ
9 pages
Half Adder Full Adder Class Material
No ratings yet
Half Adder Full Adder Class Material
3 pages
A Project Report: ON "E-Commerce"
No ratings yet
A Project Report: ON "E-Commerce"
38 pages
Ibm - Pre.C1000-018.By - Vceplus.60Q - Demo: Website: Vce To PDF Converter: Facebook: Twitter
No ratings yet
Ibm - Pre.C1000-018.By - Vceplus.60Q - Demo: Website: Vce To PDF Converter: Facebook: Twitter
21 pages
TOMORROW ACOUSTIC Chords by Avril Lavigne - Chords Explorer
No ratings yet
TOMORROW ACOUSTIC Chords by Avril Lavigne - Chords Explorer
5 pages
Unit-Ii Application of Operational Amplifiers
No ratings yet
Unit-Ii Application of Operational Amplifiers
4 pages
WHAT THE HELL Chords by Avril Lavigne - Chords Explorer
No ratings yet
WHAT THE HELL Chords by Avril Lavigne - Chords Explorer
6 pages
THINGS I WILL NEVER SAY Chords by Avril Lavigne PDF
No ratings yet
THINGS I WILL NEVER SAY Chords by Avril Lavigne PDF
4 pages
Zulay by Antonio Lauro
No ratings yet
Zulay by Antonio Lauro
3 pages
Valse Venezolano Alp by Jorge Cardoso
No ratings yet
Valse Venezolano Alp by Jorge Cardoso
3 pages
WHY Chords by Avril Lavigne - Chords Explorer
No ratings yet
WHY Chords by Avril Lavigne - Chords Explorer
4 pages
OH HOLY NIGHT Chords by Avril Lavigne - Chords Explorer
No ratings yet
OH HOLY NIGHT Chords by Avril Lavigne - Chords Explorer
2 pages
THINGS I'LL NEVER SAY Chords by Avril Lavigne PDF
No ratings yet
THINGS I'LL NEVER SAY Chords by Avril Lavigne PDF
2 pages
REMEMBER WHEN Chords by Avril Lavigne - Chords Explorer
No ratings yet
REMEMBER WHEN Chords by Avril Lavigne - Chords Explorer
4 pages
SCIENTIST Chords by Avril Lavigne - Chords Explorer
No ratings yet
SCIENTIST Chords by Avril Lavigne - Chords Explorer
4 pages
WHAT THE HELL (VER. 2) Chords by Avril Lavigne
No ratings yet
WHAT THE HELL (VER. 2) Chords by Avril Lavigne
4 pages
PUSH (VER. 3) Chords by Avril Lavigne - Chords Explorer
No ratings yet
PUSH (VER. 3) Chords by Avril Lavigne - Chords Explorer
3 pages
Twinkle Twinkle Little Star Easy Guitar Tab
No ratings yet
Twinkle Twinkle Little Star Easy Guitar Tab
1 page
Yamaha All Access Magazine
No ratings yet
Yamaha All Access Magazine
2 pages
Choose The Correct Answer: Bootstrap
No ratings yet
Choose The Correct Answer: Bootstrap
4 pages
DTMF Receiver: Package Dimensions
No ratings yet
DTMF Receiver: Package Dimensions
8 pages
H Mol Lestvica Prirodna 1. Oktava Varijanta 1 Melodijska Formula 2
No ratings yet
H Mol Lestvica Prirodna 1. Oktava Varijanta 1 Melodijska Formula 2
1 page
F Dur Lestvica Oktave
No ratings yet
F Dur Lestvica Oktave
1 page
Ehouarn Perret - .NET Developer - Resume
No ratings yet
Ehouarn Perret - .NET Developer - Resume
4 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

WP Data Engineers Handbook

Uploaded by

WP Data Engineers Handbook

Uploaded by

EBOOK

© 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 3

• Ingesting to Cloud Data Lakes and Cloud Storage

• Ingesting to Cloud Data Warehouses

• Ingesting to Cloud Messaging Services

These four data pipeline patterns are the building

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 4

The Role of the

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 5

The Rise of Smart

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 6

4 Cloud Design Patterns

Pipeline Example #2: Ingest to Cloud Data Warehouses

Pipeline Example #3: Ingest to Cloud Messaging Services or Event Hubs

Pipeline Example #4: Transform from Raw Data to Conformed Data

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 7

PIPELINE EXAMPLE #1:

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 8

Smart Data Pipelines at Work

• External offset tracking enables failover via the

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 9

Handling Infrastructure Drift

• This operation will create a new instance of the

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 10

PIPELINE EXAMPLE #2:

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 11

Smart Data Pipelines at Work

• This also enables configuring data SLAs.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 12

Handling Semantic Drift

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 13

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 14

Smart Data Pipelines at Work Handling Infrastructure Drift

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 15

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 16

Smart Data Pipelines at Work

• This also enables configuring data SLAs.

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 17

In a modern enterprise, pipeline development is only Enable others (force multiplier)

in an agile fashion. When managing thousands of

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 18

Other times, drift may be a material or even

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 19

lets you change when the business needs

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 20

www.streamsets.com © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 21

START YOUR FREE TRIAL

You might also like