WP Data Engineers Handbook
WP Data Engineers Handbook
Data Engineers’
Handbook
4 Cloud Design Patterns for Data Ingestion and Transformation
www.streamsets.com
Data Engineers’ Handbook
Table of Contents
Introduction...................................................................................................................................03
Critical Design Patterns for the Cloud ....................................................................................04
The Role of the Data Engineer...................................................................................................05
The Rise of Smart Data Pipelines..............................................................................................06
4 Cloud Design Patterns..............................................................................................................07
Pipeline Example #1:
Ingest to Cloud Data Lakes or Cloud Storage..................................................................08
Pipeline Example #2:
Ingest to Cloud Data Warehouses ....................................................................................11
Pipeline Example #3:
Ingest to Cloud Messaging Services or Event Hubs .......................................................14
Pipeline Example #4:
Transform from Raw Data to Conformed Data..............................................................16
Operationalizing Smart Data Pipelines...................................................................................18
Conclusion .....................................................................................................................................21
StreamSets and the StreamSets logo are the registered trademarks of StreamSets, Inc. © 2022, STREAMSETS, INC. ALL RIGHTS RESERVED 2
All other marks are the property of their respective owners.
Data Engineers’ Handbook
Introduction
The move to the cloud has become top of mind and more
urgent for data engineers than ever before. According to
Gartner, “By 2023, 75% of all databases will be on a cloud
platform.” 1 This mass migration to the cloud has translated
into huge new cloud projects that have dropped into the lap
of the data engineer.
Whether you are an individual contributor or a senior data team leader, you most
likely are supporting a growing number of ETL developers who in turn support an
expanding number of data scientists and analysts across the whole business.
They want data faster, and they increasingly want data from outside the company.
The cloud is the easiest way to gather all of the external data your teams need and
let the analytics team free to party on the data. For data engineers moving to the
cloud means pipeline redesigns, migration projects, and shifts in data processing
strategy.
1
Gartner Says the Future of the Database Market Is the Cloud
Critical Design
Patterns
for the Cloud
To successfully migrate data and data workloads to
your cloud data platforms these are the four most
common data pipeline design patterns:
Pipeline Overview
The first pattern is the most common and often the
first step in moving data to the cloud. It’s all about
ingesting data into cloud data lake or other raw cloud
storage. This is the gateway into the cloud for much
of your data — it can go in many different directions
and for many different use cases after it’s ingested
into the cloud data lake or cloud storage.
Key Steps
• Read data from multiple tables in parallel from
the Oracle database.
• Conditionally route customer records based
on the table name record header attribute
(metadata) exposed by the platform to mask
customer PII. For example, customers’ email
addresses.
• Securely store the data on Amazon S3 Data Lake
using server-side encryption and partitioned by
table name.
• Automatically stop the pipeline once all of the Figure 2. Pipeline Example #1 – Migrating data from on-prem Oracle database to Amazon S3 Data Lake.
data has been processed from Oracle database
and written to Amazon S3.
Pipeline Overview
Cloud data warehouses are a critical component of
modern data architecture in enterprises that leverage
massive amounts of data to drive quality of their
products and services. They are a new breed that
bring the added advantage of cost-effectiveness
and scalability with pay-as-you-go pricing models,
a serverless approach, and on-demand resources
made possible by separating compute and storage to
provide a layer specifically for fast analytics, reporting
and data mining.
Key Steps
• Read web logs stored in a file system.
• Convert data types of certain fields from string
to their appropriate types.
• Enrich records by creating new fields using
regular expressions.
• Store the transformed web logs in Snowflake
Cloud Data Warehouse.
Figure 3. Pipeline Example #2 – Ingest web logs from file system to Snowflake Cloud Data Warehouse.
Pipeline Overview
Ingesting into the cloud messaging systems enables
flexible deployment and maintenance of decoupled
producers and consumers. This design pattern
eliminates duplication of logic and, in fact, allows
for decoupled logic and operations for different
consumers. In the case of Kafka, for example, topics
provide a well understood metaphor, partitions
provide a basis for scaling topics, and consumer
groups provide “Shared Consumer” capabilities
across threads, processes or both.
Key Steps
• Stream data from Twitter using Twitter API.
• Create individual tweet records from the list.
• Enrich records by parsing and extracting data
from XML field.
• Reduce payload by removing unwanted fields.
• Flatten nested records.
Figure 4. Pipeline Example #3 – Stream data from HTTP Client to Apache Kafka.
• Send transformed data to Kafka.
Pipeline Overview
Raw zone normally stores large amounts of data
in its originating state usually in its original format,
such as JSON or CSV. Clean zone can be thought of
as a filter zone that improves data quality and may
involve data enrichment. Common transformations
include data type definition and conversion, removing
unnecessary columns, etc. to further improve the
value of insights. The organization of this zone is
normally dictated by business needs — for example,
per region, date, department, etc. Curated zone is
the consumption zone, optimized for analytics and
not so much for data processing. This zone stores
data in denormalized data marts and is best suited
for analysts or data scientists that want to run ad hoc
queries, analysis, or advanced analytics.
Key Steps
• Ingest Sales insights stored in CSV format on
Azure Data Lake Storage (ADLS) Gen2.
• Remove information from records not critical for
downstream analysis. Figure 5. Pipeline Example #4 – Transform from Raw Data to Conformed Data
• Enrich records by performing calculations and Ingest raw sales insights in CSV, perform transformations and aggregations and store the conformed
(clean and curated) data in Parquet and SQL.
aggregations.
• Store clean data in parquet format in ADLS Gen2
and aggregate data in Azure SQL.
Operationalizing
Smart Data
Design a Test/debug Deploy a
pipeline a pipeline pipeline Monitor
health Handle
Data changes.
Pipelines
Engineers Design Test/debug Deploy of all
pipelines Redeploy
DATA PIPELINES
Design Test/debug Deploy
PLATFORMS
and the performance across all stages can be a Engineers
Adopt Data Cloud platforms Manage data pipeline platform;
staggering proposition. Smart data pipelines give
(e.g. Snowflake) set and enforce policies/SLAs
you continuous visibility at every stage of execution. Platform
Operators
Collections of pipelines can
be visualized in live data maps and drilled into Figure 7. The full lifecycle of data pipelines: Design, Deploy, and Enrich
when problems arise. This drastically reduces the
amount of time data engineers spend fixing errors
and hunting for root causes. Smart data pipelines let This active monitoring helps data engineers ensure able to detect it immediately, based on the sensors
you make changes to pipelines, even when they are that data is delivered correctly with retained fidelity. embedded into the smart data pipelines themselves,
running in production, allowing you to create agile It also helps flag and troubleshoot any operational and b) have choices on how they want to handle the
development sprints. or performance issues with either the data pipelines drift. In some cases, structural drift is not material
or the underlying execution engines in real time, no to the meaning of the data, so the smart data
Smart data pipelines report on critical matter where they are deployed, even across multiple pipeline can simply keep running with no change
metrics including: platforms both on-premises and in the cloud. Such or intervention whatsoever. Other types of change,
end-to-end transparency significantly reduces the such as a schema update, can be automatically
• Throughput rates administrative burden of monitoring and managing propagated into downstream systems. This ability to
• Error rates tens of thousands of pipelines across hundreds of automatically handle many common types of data
• Execution time by stage engines. drift drastically reduces the amount of time and effort
• PII detection Having this real-time instrumentation is also critical spent on maintenance and change management of
for smart pipelines’ operational resiliency to data data pipelines in operation.
• Schema drift alerting
drift. When drift happens, data engineers a) are
• Semantic drift alerting
StreamSets:
Smart Data
Pipelines
for Data Engineers
The StreamSets DataOps Platform
supports your entire data team with
an easy on-ramp for a wide variety
of developers and powerful tools for
advanced data engineers. Our smart
data pipelines are resilient to changes.
The platform actively detects and alerts
users when data drift occurs. StreamSets Figure 6. The Full Data Pipelines Lifecycle: Design, Deploy, And Enrich
Conclusion
For data engineers, the public cloud provides numerous advantages to modernize your toolkit with exciting new
data services that scale way beyond the confines of your traditional role. However, simply shifting your legacy
platform to the public cloud brings all your problems along with it. Data pipelines for the cloud need to address
the elastic, scalable, and accessible nature of the cloud. Smart data pipelines take full advantage of these cloud
attributes, while also detecting and being resilient to data drift.
By developing the core capabilities to land data into raw data lakes and data warehouses, enrich with real-
time data from streaming services and event hubs, and transform data to be delivered to analytics teams and
platforms, you will have the foundations for delivering fast, reliable insight to every corner of your business.
StreamSets helps you build smart data pipelines with a common design interface, extensive tools for deep
integration, reliable operation with monitoring and reporting, and truly portable pipeline design across all
environments.
Do you want to start building these design patterns today? Try StreamSets
StreamSets and the StreamSets logo are the registered trademarks of StreamSets, Inc. All other marks reference are the property of their respective owners.