0% found this document useful (0 votes)
52 views30 pages

Essentials of Data engineeringByMukeshSaini

The document is a publication titled 'Essentials of Data Engineering' by Dr. Mukesh Kumar Saini, scheduled for release in July 2024. It covers fundamental concepts and principles of data engineering, focusing on the data lifecycle, architecture, and security, while addressing the challenges faced by data professionals. The book aims to provide a comprehensive understanding of data engineering applicable across various technologies and is targeted at technical practitioners and data stakeholders.

Uploaded by

Andita Dwiyoga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views30 pages

Essentials of Data engineeringByMukeshSaini

The document is a publication titled 'Essentials of Data Engineering' by Dr. Mukesh Kumar Saini, scheduled for release in July 2024. It covers fundamental concepts and principles of data engineering, focusing on the data lifecycle, architecture, and security, while addressing the challenges faced by data professionals. The book aims to provide a comprehensive understanding of data engineering applicable across various technologies and is targeted at technical practitioners and data stakeholders.

Uploaded by

Andita Dwiyoga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/385041174

Essentials of Data Engineering

Book · July 2024


DOI: 10.5281/zenodo.14617149

CITATIONS READS

2 765

1 author:

Mukesh Kumar Saini


Hospital Corporation of America
29 PUBLICATIONS 514 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mukesh Kumar Saini on 18 October 2024.

The user has requested enhancement of the downloaded file.


Essentials of Data Engineering
by Dr. Mukesh Saini

Copyright © 2024 Mukesh Kumar Saini


All rights reserved.
ISBN: 9798334158917
DEDICATION

To my parents, family and friends, whose unwavering support and belief in my dreams have
been my guiding light. Your love and encouragement have made this journey possible. Thank
you for always believing in me.
Table of Contents

Preface. ................................................................................................................. xi

Part I. Structural Foundation and Essential Building Blocks


1. Insights into Data Engineering................................................................................................... 3
Understanding Data Engineering 3
Defining Data Engineering 4
Data Engineering Lifecycle 4
Transformative Journey of the Data Engineer 5
Data Engineering and Data Science 9
Deep Dive into Data Engineering Skills and Activities 12
Data Maturity and the Data Engineer 13
The DNA of a Data Engineer Background, Skills, and Attributes 16
Business Responsibilities 18
Technical Responsibilities 19
Data Engineers and Other Technical Roles 21
Conclusion 24

2. Navigating the Data Engineering Lifecycle. ............................................................. 25


Understanding Data Engineering Lifecycle 25
Data Lifecycle Versus the Data Engineering Lifecycle 26
Generation: Source Systems 27
Storage 29
Ingestion 32
Transformation 35
Serving Data 36
Driving Forces in Data Engineering: Exploring Lifecycle Dynamics 40

Table of Contents | iii


Security 41
Data Management 42
DataOps 50
Data Architecture 51
Orchestration 52
Software Engineering 53
Conclusion 56

3. Design Principles for Data Architecture.................................................................. 58


Understanding Data Architecture 58
Defining Enterprise Architecture 58
Defining Data Architecture 61
“Characteristics of Effective” Data Architecture 63
Principles of Effective Data Architecture 64
Principle 1: Strategic Components Selection 64
Principle 2: Resilience Planning 65
Principle 3: Scalability Centric Architecture 66
Principle 4: Leadership in Architecture Design 66
Principle 5: Continuous Architectural Evolution 67
Principle 6: Loosely Coupling for Flexibility 68
Principle 7: Flexible Decisions Making 68
Principle 8: Security First Approach 69
Key Architecture Concepts 71
Domains and Services 71
Distributed Systems, Scalability, and Designing for Failure 73
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices 75
User Access: Single Versus Multitenant 79
Event-Driven Architecture 80
Brownfield Versus Greenfield Projects 80
Examples and Types of Data Architecture 82
Data Warehouse 82
Data Lake 85
Unified Data Platforms, Convergence and Next-Generation Data Lakes 86
Modern Data Stack 87
Lambda Architecture 88
Kappa Architecture 89
The Dataflow Model and Unified Batch and Streaming 89
Architecture for IoT 90
Other Data Architecture Examples 93
Who will be Stakeholders in Data Architecture Design? 94
Conclusion 94

Table of Contents
Part II. Navigating the Data Engineering Lifecycle: A Deep Drive
4. Data Generation in Source Systems. .................................................................... 97
Sources of Data: How Is Data Created? 98
Explore Source Systems: Key Concepts 98
Files and Unstructured Data 99
Application Programming Interfaces (APIs) 99
Application Databases (OLTP Systems) 99
Online Analytical Processing System (OLAP) 101
Change Data Capture (CDC) 102
Logs 102
Database Logs 103
CRUD 104
Insert-Only 105
Messages and Streams 105
Types of Time 106
Source System Practical Details 107
Databases 107
APIs 116
Data Sharing 118
Message Queues and Event-Streaming Platforms 119
With Whom You’ll Collaborate 123
Understanding the Impact of Undercurrents on Source Systems 124
Security 124
Data Management 125
DataOps 125
Data Architecture 126
Orchestration 127
Software Engineering 128
Conclusion 128

5. Storage. ...................................................................................................... 130


Raw Ingredients of Data Storage 132
Magnetic Disk Drive 132
Solid-State Drive 133
Random Access Memory 133
Networking and CPU 134
Serialization 134
Compression 135
Caching 135
Data Storage Systems 136
Single Machine Versus Distributed Storage 136

Table of Contents | v
Eventual Versus Strong Consistency 137
File Storage 138
Block Storage 139
Object Storage 140
Cache and Memory-Based Storage Systems 143
Streaming Storage 144
Indexes, Partitioning, and Clustering 144
Data Engineering Storage Abstractions 146
The Data Warehouse 147
The Data Lake 147
The Data Lakehouse 147
Data Platforms 148
Stream-to-Batch Storage Architecture 149
Emerging Paradigms and Trends in Storage 149
Data Catalog 149
Data Sharing 150
Schema 150
Separation of Compute from Storage 151
Data Storage Lifecycle and Data Retention 154
Single-Tenant Versus Multitenant Storage 156
With Whom You’ll Collaborate 157
Undercurrents 158
Security 158
Data Management 158
DataOps 159
Data Architecture 159
Orchestration 160
Software Engineering 160
Conclusion 160

6. Ingestion. .................................................................................................... 161


Understanding Data Ingestion 161
Key Engineering Considerations for the Ingestion Phase 162
Bounded Versus Unbounded Data 163
Frequency 164
Synchronous Versus Asynchronous Ingestion 165
Serialization and Deserialization 166
Throughput and Scalability 166
Reliability and Durability 167
Payload 167
Push Versus Pull Versus Poll Patterns 170
Batch Ingestion Considerations 171

Table of Contents
Snapshot or Differential Extraction 172
File-Based Export and Ingestion 172
ETL Versus ELT 172
Inserts, Updates, and Batch Size 173
Data Migration 173
Message and Stream Ingestion Considerations 174
Schema Evolution 174
Late-Arriving Data 175
Ordering and Multiple Delivery 175
Time to Live 176
Message Size 176
Error Handling and Dead-Letter Queues 176
Consumer Pull and Push 177
Location 177
Ways to Ingest Data 177
Direct Database Connection 178
Change Data Capture 179
APIs 181
Message Queues and Event-Streaming Platforms 182
Managed Data Connectors 183
Moving Data with Object Storage 183
EDI 184
Databases and File Export 184
Practical Issues with Common File Formats 184
Shell 185
SSH 185
SFTP and SCP 186
Webhooks 186
Web Interface 187
Web Scraping 187
Transfer Appliances for Data Migration 188
Data Sharing 188
With Whom You’ll Collaborate 189
Upstream Stakeholders 189
Downstream Stakeholders 190
Undercurrents 190
Security 190
Data Management 191
DataOps 192
Orchestration 194
Software Engineering 195
Conclusion 195

Table of Contents | vii


7. Exploring Queries, Modeling Techniques, and Data Transformation. ...................... 197
Queries 198
Understanding a Query 198
The Life of a Query 200
The Query Optimizer 201
Improving Query Performance 201
Queries on Streaming Data 205
Data Modeling 210
Understanding a Data Model 211
Conceptual, Logical, and Physical Data Models 211
Normalization 213
Techniques for Modeling Batch Analytical Data 216
Modeling Streaming Data 228
Transformations 229
Batch Transformations 229
Materialized Views, Federation, and Query Virtualization 239
Streaming Transformations and Processing 242
With Whom You’ll Collaborate 244
Upstream Stakeholders 244
Downstream Stakeholders 245
Undercurrents 245
Security 245
Data Management 246
DataOps 247
Data Architecture 248
Orchestration 248
Software Engineering 248
Conclusion 249

8. Delivering Data for Analytics, Machine Learning, and Reverse ETL. .........................251
General Considerations for Serving Data 252
Trust 252
What’s the Use Case, and Who’s the User? 253
Data Products 254
Self-Service or Not? 255
Data Definitions and Logic 255
Data Mesh 256
Analytics 257
Business Analytics 257
Operational Analytics 257
Embedded Analytics 258
Machine Learning 259

Table of Contents
What a Data Engineer Should Know About ML 260
Ways to Serve Data for Analytics and ML 261
File Exchange 261
Databases 262
Streaming Systems 263
Query Federation 263
Data Sharing 264
Semantic and Metrics Layers 264
Serving Data in Notebooks 265
Reverse ETL 267
With Whom You’ll Collaborate 268
Undercurrents 269
Security 269
Data Management 270
DataOps 270
Data Architecture 271
Orchestration 271
Software Engineering 272
Conclusion 273

Part III. Safeguarding Data in the Digital Era: Security, Privacy, and the Future
of Data Engineering
9. Security and Privacy.......................................................................................... 277
People 278
How Negative Thinking Can Fuel Success? 278
Always Be Paranoid 278
Processes 279
Security Theater Versus Security Habit 279
Active Security 280
The Principle of Least Privilege 280
Shared Responsibility in the Cloud 281
Always Back Up Your Data 281
An Example Security Policy 281
Technology 282
Patch and Update Systems 283
Encryption 283
Logging, Monitoring, and Alerting 283
Network Access 284
Security for Low-Level Data Engineering 285
Conclusion 286

Table of Contents | ix
10. The Evolution of Data Engineering – Future Perspectives. ................... 287
Why the Data Engineering Lifecycle Matters Now and, in the Future 288
Simplifying Complexity: The Evolution of User-Friendly Data Tools 288
Enhanced Interoperability Through Cloud-Scale Data OS 289
Enterprisey Data Engineering 290
Evolving Titles and Responsibilities in Modern Workspaces... 291
Transitioning to the Live Data Stack: Beyond the Modern Data Infrastructure 292
The Live Data Stack 293
The Power of Real-Time: Streaming Pipelines and Analytical Databases 294
Data-Driven Application: The Fusion of Data and Functionality 295
Enhancing Applications Through ML Feedback 295
The Unexpected Duo: Spreadsheets and Dark Matter Data? 296
Conclusion 297

A. Serialization and Compression Technical Details. .................................................... 299

B. Additional Resource and References. ..................................................................... 305

Index. .................................................................................................................. 309

Table of Contents
Preface

The genesis of this book traces back to transition from data science to data engineering.
As data science gained prominence, companies heavily invested in data science talent with
high hopes of significant returns. However, many data scientists frequently encountered
fundamental issues that their training did not adequately prepare them for—such as data
collection, cleansing, access, transformation, and infrastructure. These challenges are
precisely what data engineering seeks to address.

How This Book Differs


Before diving into what you can expect from this book and how it will benefit you, it's
important to clarify what this book does not aim to be. This book is not centered around
using specific tools, technologies, or platforms for data engineering. While there are many
valuable resources that take this approach, those tend to become outdated relatively
quickly. Instead, our focus is on imparting the fundamental concepts that underpin data
engineering.

The Essential Focus of This Book


This book seeks to address a notable gap in existing data engineering literature. While
numerous technical resources delve into specific tools and technologies, there remains a
challenge in understanding how to integrate these components effectively for real-world
applications. The goal is to connect the dots across the entire data lifecycle, illustrating
how to seamlessly weave together diverse technologies. This approach enables you to
meet the requirements of downstream data consumers such as analysts, data scientists, and
machine learning engineers

The core concept of this book revolves around the data engineering lifecycle: data
generation, storage, ingestion, transformation, and serving. Despite the ever-changing
landscape of specific technologies and vendor products, these stages have remained
fundamentally consistent throughout the evolution of data practices. By embracing this
framework, readers will gain a solid foundation for applying technologies to solve real-
world business challenges.

Preface | xi
The objective is to establish principles that span across two key dimensions. First, to distill
data engineering into principles adaptable to any relevant technology. Second, to present
principles that endure over time, drawing from lessons learned amidst the technological
upheavals of the past two decades. This mental framework remaining valuable for a
decade or more into the future.

Who Will Benefit Most from This Book


Primary audience for this book comprises technical practitioners: mid- to senior-level
software engineers, data scientists, and analysts eager to transition into data engineering.
It also includes data engineers deeply involved in specific technologies seeking a broader
perspective. The secondary audience consists of data stakeholders with technical
backgrounds, such as data team leads overseeing teams of data engineers or directors of
data warehousing looking to migrate to cloud-based solutions.

Ideally, you're someone driven by curiosity and a desire to learn the very reason you're
exploring this book. You actively keep pace with developments in data technologies,
regularly reading books and articles on topics like data warehousing, data lakes, batch and
streaming systems, orchestration, modeling, and cloud technologies. This book aims to
help you synthesize this knowledge, providing a cohesive understanding of data
engineering across various technologies and paradigms.

Prerequisites
Assume readers possess a solid understanding of typical data systems used in corporate
settings. Additionally, expect familiarity with SQL and Python, or another programming
language, along with some experience using cloud services.

There are abundant resources available for those looking to practice Python and SQL.
Numerous free online platforms such as blogs, tutorials, and YouTube offer learning
materials, and new Python books are published regularly.

The cloud offers unparalleled opportunities for hands-on experience with data tools and
recommend that aspiring data engineers create accounts with cloud providers like AWS,
Azure, Google Cloud Platform, Snowflake, or Databricks. Many of these platforms offer
free-tier options, although users should monitor costs carefully and start with small
datasets and single-node clusters during their studies.

For individuals aiming to familiarize themselves with corporate data systems outside of a
formal work environment, challenges exist. This can pose barriers for aspiring data
engineers seeking their first job. This book aims to bridge this gap, encourage newcomers
to read for broad concepts initially, and refer to the additional resources. On a second pass
through the material, focus on unfamiliar terms and technologies. Supplement your

xii | Preface
understanding using resources like Google, Wikipedia, blogs, YouTube, and vendor
websites to fill any knowledge gaps.

Enhancing Your Skills: What You’ll Learn and How It Will


Boost Your Abilities
This book aims to establish a strong foundation for tackling practical data engineering
challenges. By the end of this book, you will achieve the following:

• Gain insights into how data engineering influences your current role (whether
you're a data scientist, software engineer, or data team lead).
• Navigate through marketing noise to select the most suitable technologies, data
architectures, and processes.
• Utilize the data engineering lifecycle effectively to design and construct a robust
architecture.
• Implement best practices at every phase of the data lifecycle.

Moreover, you will be equipped to:

• Apply data engineering principles within your current role, whether as a data
scientist, analyst, software engineer, or data team lead.
• Integrate various cloud technologies to cater to the needs of downstream data
consumers.
• Evaluate data engineering challenges using a comprehensive framework of best
practices.
• Integrate data governance and security throughout the data engineering lifecycle.

Navigating This Book


This book is composed of four parts:

• Part I, “Structural Foundation and Essential Building Blocks”


• Part II, “Navigating the Data Engineering Lifecycle: A Deep Drive”
• Part III, “Safeguarding Data in the Digital Era: Security, Privacy, and the Future of Data
Engineering”
• Appendices A: covering serialization and compression

In Part I, begin by defining data engineering in Chapter 1, then map out the data engineering
lifecycle in Chapter 2. In Chapter 3, talk about good and effective data architecture.

Part II builds on Chapter 2 to cover the data engineering lifecycle in depth; each lifecycle
stage—data generation, storage, ingestion, transformation and serving is covered in its own
Preface | xiii
chapter. Part II is arguably the heart of the book, and the other chapters exist to support the
core ideas covered here.

Part III covers additional topics. In Chapter 9, talk about security and privacy. While security has
always been an important aspect of the data engineering profession, its significance has
heightened due to the increase in for-profit hacking and state-sponsored cyber-attacks. As for
privacy, the era of corporate indifference is over no company desires to be featured in headlines
due to lax privacy practices. Mishandling personal data can lead to severe legal consequences,
especially with the implementation of GDPR, CCPA, and other regulations. In summary,
prioritizing security and privacy is essential in all data engineering endeavors.

During a journey in data engineering, conducting research for this book and engaging in interviews
with numerous experts, extensively contemplated the direction of the field in both the short and
long term. Chapter 10 delves into few speculative thoughts on the future of data engineering.
Predicting the future is inherently uncertain, and only time will reveal the accuracy of ideas.

In the appendices, cover a handful of technical topics that are extremely relevant to the day-to-day
practice of data engineering but didn’t fit into the main body of the text. Specifically, engineers
need to understand serialization and compression (see Appendix A) as well as additional resources
and references (see Appendix B).

xiv | Preface
For instance, in analyzing this book, you might notice certain words appearing frequently.
By replacing these common words with shorter tokens during compression, you could
potentially achieve a compression ratio of 2:1 or greater. However, sophisticated
compression algorithms use advanced mathematical techniques to achieve even higher
compression ratios, often reaching 10:1 or more for text data.

It's important to note that these discussions focus on lossless compression algorithms.
When data is compressed using a lossless algorithm, decompression results in an exact
copy of the original data, bit-for-bit. In contrast, lossy compression algorithms, used in
audio, images, and video, aim to preserve perceptual fidelity rather than exact data
replication. Data engineers encounter lossy compression in media processing pipelines
but typically use lossless compression for analytical serialization, where data fidelity is
crucial.

Traditional compression engines like gzip and bzip2 excel at compressing text data and
are commonly applied to formats such as JSON, JSONL, XML, CSV, and others. In recent
years, a new generation of compression algorithms has emerged that prioritize speed and
CPU efficiency over achieving the highest possible compression ratio. Examples include
Snappy, Zstandard, LZFSE, and LZ4. These algorithms are widely employed in
compressing data within data lakes and columnar databases to optimize for rapid query
performance.

304 | Appendix A: Exploring Serialization and Compression in Depth


APPENDIX B

Additional Resources and


References

This appendix discusses additional resources and references that could be


valuable for understanding various aspects of data engineering, data architecture,
and related topics.

Additional Resources
Books
• Designing Data-Intensive Applications by Martin Kleppmann
• Streaming Systems: The What, Where, When, and How of Large-Scale Data
Processing by Tyler Akidau et al. (O’Reilly)
• Data Science for Business: What You Need to Know about Data Mining and Data-
Analytic Thinking by Foster Provost and Tom Fawcett (O’Reilly)

Articles and Papers


• "Lambda Architecture: Design Simplicity and Performance" by Nathan Marz
• "Kappa Architecture: Simplifying Big Data Pipelines" by Jay Kreps
• "Event-Driven Architecture: Software that Keeps on Giving" by Martin Fowler
• "Data Mesh: A paradigm shift in big data architecture" by Zhamak Dehghani

Websites and Blogs:


• Towards Data Science (towardsdatascience.com) - A platform with numerous
articles on data engineering and related topics.
• The Data Engineering Cookbook (dataengineeringcookbook.com) - Offers
practical guides and recipes for data engineering tasks.
• The Data Warehousing Institute (tdwi.org) - Provides resources, research, and
education on data warehousing and BI.
305 | Appendix B: Additional Resources and References
Documentation and Technical Guides:
• Apache Kafka Documentation (kafka.apache.org/documentation/) -
Comprehensive documentation for Kafka, a popular distributed streaming
platform.
• Apache Spark Documentation (spark.apache.org/documentation/) - Official
documentation for Apache Spark, a unified analytics engine for big data
processing.
Academic Resources
• IEEE Transactions on Knowledge and Data Engineering (tkde.ieee.org) -
Publishes research articles on data engineering topics.
• Journal of Data Science and Analytics (springer.com/journal/41060) - Covers
theoretical and practical aspects of data science and analytics.
• These resources cover a wide range of topics from foundational principles to
advanced techniques in data engineering and architecture. They are essential for
anyone looking to deepen their understanding or stay updated in this rapidly
evolving field.

These resources cover a wide range of topics from foundational principles to


advanced techniques in data engineering and architecture. They are essential for
anyone looking to deepen their understanding or stay updated in this rapidly
evolving field

References
 “The Log: What Every Software Engineer Should Know About Real-Time Data’s
Unifying Abstraction” by Jay Kreps
 “Modernizing Business Data Indexing” by Benjamin Douglas and
Mohammad Mohtasham
 “The What, Why, and When of Single-Table Design with DynamoDB” by
Alex DeBrie
 “The AI Hierarchy of Needs” by Monica Rogati
 “Building Analytics Teams” by John K. Thompson
 “Data as a Product vs. Data as a Service” by Justin Gage
 “Data Engineering: A Quick and Simple Definition” by James Furbush
 “Data Teams” by Jesse Anderson
 “The Downfall of the Data Engineer” by Maxime Beauchemin
306 | Appendix B: Additional Resources and References
 “The Future of Data Engineering Is the Convergence of Disciplines” by Liam
Hausmann
 “How Creating a Data-Driven Culture Can Drive Success” by Frederik Bussler
 “On Complexity in Big Data” by Jesse Anderson
 “The Rise of the Data Engineer” by Maxime Beauchemin
 “A Short History of Big Data” by Mark van Rijmenam
 “Skills of the Data Architect” by Bob Lambert
 “The Three Levels of Data Analysis: A Framework for Assessing Data
Organization Maturity” by Emilie Schario
 “What Is a Data Architect? IT’s Data Framework Visionary” by Thor Olavsrud
 “Big Data Architectures” Azure documentation
 “Bounded Context” by Martin Fowler
 “A Brief Introduction to Two Data Processing Architectures—Lambda and Kappa
for Big Data” by Iman Samizadeh
 "The Modern Data Stack: Past, Present, and Future" by Tristan Handy
 "Moving Beyond Batch vs. Streaming" by David Yaffe
 "A Personal Implementation of Modern Data Architecture: Getting Strava
Data into Google Cloud Platform" by Matthew Reeve
 "Principled Data Engineering, Part I: Architectural Overview" by Hussein Danish
 "Questioning the Lambda Architecture" by Jay Kreps
 "Reliable Microservices Data Exchange with the Outbox Pattern" by
Gunnar Morling
 "The Rise of the Metadata Lake" by Prukalpa
 "Run Your Data Team Like a Product Team" by Emilie Schario and Taylor
A. Murphy
 “The Building Blocks of a Modern Data Platform” by Prukalpa
 Choosing Open Wisel”y by Benoit Dageville et al. Who’s Involved with Designing a
Data Architecture?
 “Choosing the Right Architecture for Global Data Distribution” Google Cloud
Architecture web page
 “A Comparison of Data Processing Frameworks” by Ludovik Santos
 “Data Architecture: A Primer for the Data Scientist” by W. H. Inmon et
al. (Academic Press)
 "Data Architecture: Complex vs. Complicated" by Dave Wells
 "Data as a Product vs. Data as a Service" by Justin Gage

307 | Appendix B: Additional Resources and References


 "The Data Dichotomy: Rethinking the Way We Treat Data and Services" by
Ben Stopford
 Data Fabric Architecture Is Key to Modernizing Data Management and
Integration" by Ashutosh Gupta
 "Data Warehouse Architecture: Overview" by Roelant Vos
 "Data Warehouse Architecture" tutorial at Javatpoint
 "Defining Architecture" ISO/IEC/IEEE 42010 web page
 "The Design and Implementation of Modern Column-Oriented Database
Systems" by Daniel Abadi et al.
 "Domain-Driven Design" by Martin Fowler
 “EABOK” website
 "End-to-End Serverless ETL Orchestration in AWS: A Guide" by Rittika Jindal
 "Enterprise Architecture" Gartner Glossary definition
 "Enterprise Architecture’s Role in Building a Data-Driven Organization"
by Ashutosh Gupta
 "Five Principles for Cloud-Native Architecture: What It Is and How to Master
It" by Tom Grey
 "Functional Data Engineering: A Modern Paradigm for Batch Data Processing" by
Maxime Beauchemin
 "Google Cloud Architecture Framework" Google Cloud Architecture web page
 "How to Beat the Cap Theorem" by Nathan Marz
 "How to Build a Data Architecture to Drive Innovation—Today and Tomorrow"
by Antonio Castro et al.
 "How TOGAF Defines Enterprise Architecture (EA)" by Avancier Limited
 "Introducing Dagster: An Open Source Python Library for Building
Data Applications" by Nick Schrock
 "The Log: What Every Software Engineer Should Know About Real-Time
Data’s Unifying Abstraction" by Jay Kreps
 "Microsoft Azure IoT Reference Architecture" documentation
 Microsoft’s "Azure Architecture Center"
 "Modern CI Is Too Complex and Misdirected" by Gregory Szorc

308 | Appendix B: Additional Resources and References


Index

broadcast joins, 231


Symbols business logic and derived data, 236
1NF (first normal form), 214 batch logs, 103
2NF (second normal form), 214 batch size, 173
3NF (third normal form), 215 Bezos API Mandate, 68
big data engineers, 8
block storage, 139
A bounded data, 162
abstraction, 25 Boyce-Codd system, 216
access policies, 262 bridge tables, 226
accountability, 43 brownfield projects, 81
accuracy, 45 bursty data ingestion, 166
ACID (atomicity, consistency, business analytics, 257
isolation, and durability) transactions, business intelligence (BI), 37
100 business logic, 36
active security, 280 business stakeholders, 67
ad hoc analysis, 257
agility, 63
AI researchers, 23 C
alerting, 283 cache hierarchy, 135
Amazon EMR, 152 caching, 110, 135
analog data, 98 change data capture (CDC), 102
Apache Arrow, 302 cloning, zero-copy, 153
Apache Beam framework, 90 cloud data warehouses, 49
Apache Druid, 153 clustering, 145
Apache Spark, 232 code-based transformation tools, 232
APIs (see application program cold data, 32
interfaces), 99 cold storage, 112
application databases, 99 collections, 110
architecture tiers, 75 columnar serialization, 301
archival storage, 143 comma-separated values (CSV) format, 300
areal density, 132 common table expressions (CTEs), 202
atomic transactions, 99 completeness, 45
atomicity, consistency, isolation, and compliance, 191
durability (ACID) transactions, 100 composable materialized views, 240
automated lifecycle policies, 155 compression algorithms, 135
Avro, 300 concurrency, 204
AWS Well-Architected Framework, conformed dimension, 222
64 consistency, 108
consumers (from a stream), 182
B correlated failure, 212
backups, 283
bash, 20
batch data ingestion, 33, 180

309
data pipelines, 231
CRUD (create, read, update, data platforms, 148
and delete), 104 data producers and consumers, 270
CSV (comma-separated values) data products, 254
format, 300 data retention, 152
data schemas, 109
Data Science Hierarchy of Needs, 10
D data security, 41
DAGs (directed acyclic graphs), data sharing,118
52 data stacks, 87
DAMA (Data Management data swamps, 47
Association International), 42 data virtualization, 241
dark data, 86 data wrangling, 235
data access frequency, 31 database logs, 102
data accountability, 45 database replication, 180
data analysts, 23 database storage engines, 303
data applications, 149 dead database records, 204 dead-letter
data architects, 21, queues, 176
data block location, 153 decompression, 304
data breaches, 49 decoupling, 76
Data Build Tool (dbt), 265 defensive posture, 278
data catalogs, 149 denormalization, 145
data connectors, 183 derived data, 237
data control language (DCL), deserialization, 166
199 digital data, 98
data definition language (DDL), dimension tables, 219
199 direct costs, 167
data egress costs, 177 disaster prevention, 281
data maturity, 194 discoverability, 31
data featurization, 36 distributed joins, 230
data integration, 43 distributed storage, 136
data lakehouses, 30 distributed systems, 74
data lakes, 86 document stores, 110
data latency, 263 domain coupling, 77
data lifecycle, 26
data lifecycle engineers, 25
data lifecycle management, 25 E
data lineage, 31 EA (enterprise architecture), 58
data lineage tools, 247 EABOK, 59
data logic, 256 EBS (Amazon Elastic Block Store), 151
data accountability, 45 edge computing, 91
data governance, 43 electronic data interchange (EDI), 172
data integration, 43 ELT (extract, load, and transform), 84
data lifecycle management, 43 embedded analytics, 256
data modeling and design, 43 emitted metrics, 263, 354
data quality, 43 encryption, 70, 192
Data Management Body of ephemerality, 152
Knowledge (DMBOK), 61 error handling, 28,249
Data Management Maturity ethics, 50, 191, 265
(DMM), 13-17 event-based data, 115,174, 248-250
data manipulation language event-driven architecture, 80,89,119
(DML), 199 event-driven systems, 163,175
data marketplaces, 119 eventual consistency,112,137, 138
data marts, 83 Extensible Markup Language (XML),300,
data maturity, 12 311,316
data mesh,79 extract, load, and transform (ELT), 84, 310
data migration, 173
data observability, 247
Data Observability Driven
F
Development (DODD), 48 fact tables, 219,220,221

310 | Index
Family Educational Rights and indexes, 107,112,144
Privacy Act (FERPA), 277,311 ingestion time, 106,175
fault tolerance, 77,123,140 Inmon data model, 217
featurization, 36, 259 insert-only pattern, 104, 239
federated queries, 324, 354 inserts, 171, 234
FERPA (Family Educational institutional knowledge, 256
Rights and Privacy Act), 311 integration, 29,48,129
file storage, 261,138 internal ingestion, 162
File Transfer Protocol (FTP), Internet of Things (IoT),90 interoperability,
311 75,290,302
filesystems, 138,139,152 IoT gateways, 91
financial management, 72 irreversible decisions, 81
FinOps, 159,247
Five Principles for Cloud-
Native Architecture, 64,67,308 J
fixed-time windows, 284 Java, 20, 117, 289
foreign keys, 109,198,212 JavaScript Object Notation (JSON), 300
frequency, 28,91,120 join strategy, 201
full snapshots, 171 joins, 28,109,201
full-disk encryption, 283
K
G key-value timestamps, 121
keys, 125,223,233
Gartner Hype Cycle, 59 Kimball data model, 220
sources of data, 61
golden records, 31
Google File System (GFS), 288 L
governance, 17,31,42 lakehouses, 148
grain, 138,158,213 Lambda architecture, 88
graph databases, 172 late-arriving data, 175
GraphQL, 110 latency, 99, 112, 113
greenfield projects, 80,81 lean practices, 51
gRPC, 118 lifecycle management, 43
lightweight object caching, 143
H lineage, 40
link tables (Data Vault), 224
Hadoop Distributed File System log analysis, 115
(HDFS), 86, 136, 141 log-structured merge-trees (LSMs), 107
hard delete, 234 logging, 285
Harvard architecture, 134 logs, 102
headless BI, 265 Looker, 265
Health Insurance Portability and lossless compression algorithms, 304
Accountabil‐ ity Act (HIPAA), lossy compression algorithms, 304
277 lukewarm data, 32
horizontal scaling, 74
hot data, 154,155
hot storage, 156 M
hotspotting, 122 machine learning (ML), 38
Hudi (Hadoop Update Delete machine learning engineers, 23
Incremental), 148, 302 magnetic disks, 132
managed data connectors, 181
hybrid columnar storage, 146 MapReduce, 237
hybrid separation, 152 massively parallel processing (MPP)
hybrid serialization, 302 databases, 6, 63
master data management (MDM), 44
I materialized views, 239
maximum message retention time, 176
Iceberg, 148, 303 measurement data, 115
identifiable business element, memcached, 143
223 memory-based storage systems, 143
incident response, 51,193
Index | 311
merge pattern, 234 312
message queues, 120 Parquet, 135, 290
metrics layers, 256 partial dependency, 214
micro-partitioning, 145 partition keys, 180
microservices architecture, 77 partitioning, 213
modeling patterns, 108 patches, 374
modularization, 87 payloads, 241-243
monolithic architectures, 77 performance, 95
multitenant storage, 156 permissions, monitoring, 376
multitier architectures, 76 persistence, optional, 211 physical data
multitier caching, 152 models, 289 pipeline metadata, 54
pipelines as code, 68
N PIT (point-in-time) tables, 305 platform as a
service (PaaS), 124 plumbing tasks, 175
NAS (network-attached point-in-time (PIT) tables, 305 polyglot
storage), 138, 139 applications, 141
near real-time data ingestion, prejoining data, 276
180 premium-tier networking, 401 prerequisites
negative thinking, 276, 280 for book, xv primary keys, 167
nested subqueries, 202 principle of least privilege, 49, 372
network security, 284 privacy, 59, 85, 265, 369 (see also security)
nodes, 74,77,113 process time, 165
normal forms, 213 processing engines, choosing, 78 processing
normalization, 227, 229 time, 165
normalized schemas, 109 product managers, 30
NTFS (New Technology File project managers, 30 proprietary cloud
System), 138 offerings, 138
proprietary walled gardens, 137-138
pruning, 278
O public access, 85
publishers, 163
object storage, 141,148,185
observability, 51, 247, 271 pull model of data ingestion, 42, 244, 250
one-size-fits-all technology push model of data ingestion, 42, 244, 246,
solutions, 65 250
online analytical processing Python, 20
(OLAP) systems, 101
online transaction processing Q
(OLTP) systems, 99,101 query engines, 272
Open Group Architecture query optimizers, 166, 275
Framework, The (TOGAF) query performance, 349
data architecture, 75 query pushdown, 326, 353
enterprise architecture, 72
open source software (OSS),
133-137, 381, 388 R
operational analytics, 45, 346 RAID (redundant array of independent
operational architecture, 76 disks), 202
operational expenses (opex), random access memory (RAM), 194
119 operational metadata, 54 RDBMSs (relational database management
Optimized Row Columnar sys‐ tems), 35, 157, 167
(ORC), 394 optional real-time analytical databases, 386
persistence, 211 real-time data ingestion, 41, 106, 237
organizational characteristics, real-time logs, 161
82 organizational data recovery point objective (RPO), 79 recovery
warehouse architecture, 98 time objective (RTO), 79 reduce step, 220
overarchitecting, 120 redundancy, 401
overhead, 118 redundant array of independent disks
(RAID), 202
P reference metadata, 54
regions, 400
PaaS (platform as a service), regulations, 226
312 | Index
relational database management site-reliability engineers (SREs), 21
systems (RDBMSs), 35, 157, size-based batch ingestion, 245 SLAs
167 (service-level agreements), 183
relational schema, 219 SLOs (service-level objectives), 124
relations (rows), 167, 169 slowly changing dimensions (SCDs), 221
reliability, 79, 89, 240 snapshots, 205
remote procedure calls (RPCs), Snowflake micro-partitioning, 145
176 repatriation, 130 soft delete, 234
replay, 212, 249 software as a service (SaaS), 101
reports, 345 solid-state drives (SSDs), 133
representational state transfer Spark API, 233
(REST) APIs, 174 resilience, SSDs (solid-state drives), 133
181 SSH protocol, 185
resolution, 161 star schema, 219
resource use, 376 streaming data modeling, 228
resume-driven development, 97, streaming data processing, 54
115 streaming pipelines, 294
reverse cache, 197 streaming platforms, 182
reverse ETL, 47, 358-360 streaming queries, 242
reversible decisions, 83 streaming transformations, 242
rotational latency, 192 strong consistency, 137
row explosion, 276 subject orientation, 217 synchronous data
row-based serialization, 391 ingestion, 238
rows, 213, 290 systems stakeholder, 123
RPO (recovery point objective),
79 RTO (recovery time
objective), 79 T
target architecture, 67
TCO (total cost of ownership), 54
S technical architecture, 62
S3 Standard-Infrequent Access technical data warehouse architecture, 83
storage class, 142 technologies (see data technologies)
SaaS (software as a service), temporary tables, 202
116 text search, 115
SAN (storage area network) third normal form (3NF), 215
systems, 115 tight coupling, 75
satellites (Data Vault), 225 time to live (TTL), 176
SCDs (slowly changing time-interval batch ingestion, 171
dimensions), 221 time-series databases, 92
schema changes, 191 timeliness, 46
schema evolution, 174 timestamps, 121
schemaless option, 29 touchless production, 192
search databases, 115 transaction control language (TCL), 200
secure copy (SCP), 186 transactional databases, 99, 234
secure FTP (SFTP), 186 transactions, 119
security policies, 118 transfer appliances, 188
self-service data, 255 transformation stage, 35
semantic layers, 228 transitive dependency, 214
sequencing plans, 67 trust, 252
serialization, 165 tumbling windows, 90
serialization formats, 299 two-way doors, 60, 69
session windows, 208
sessionization, 207
SFTP (secure FTP), 186 U
shared responsibility security unbounded data, 90, 163
model, 69 unstructured data, 99
shuffle hash joins, 230 update operations, 144, 173
single-row inserts, 173 update patterns, 233
single-tenant storage, 156 upstream stakeholders, 22
single-tier architectures, 76

Index | 313
V
vacuuming, 204
version metadata, 142
version-control systems,
selecting, 65
views, 280
volatility, 134
von Neumann architecture, 134

W
warm data, 154
watermarks, 208
web interfaces, 187
web scraping, 187
webhooks, 117
wide denormalized tables, 226
wide tables, 226
wide-column databases, 113
write once, read never
(WORN), 47

X
XML (Extensible Markup
Language), 300

Z
zero-copy cloning, 153
zero-trust security models, 69

314 | Index
About the Authors
Dr. Mukesh Saini is a renowned figure in technology, research, architecture,
and data engineering, with a distinguished career spanning more than twenty
years. His expertise covers various domains within the data industry, including
statistical modeling, forecasting, and advanced machine learning techniques.
Dr. Saini has been instrumental in data engineering, shaping architecture, and
implementing robust business intelligence solutions, particularly in the
healthcare sector.

Throughout his career, Dr. Saini has made substantial contributions to the
progress of data science and technology. His work demonstrates a profound
grasp of intricate data ecosystems and a talent for creating pioneering solutions
tailored to diverse business requirements. Spanning multiple sectors, his
experience showcases his adeptness in addressing complex challenges in data
management and analysis.

As a technologist, Dr. Saini has not only excelled in theoretical frameworks but
has also been instrumental in implementing practical, scalable solutions. His
proficiency in integrating cutting-edge technologies with business strategies
has made him a sought-after consultant in the field. Whether it's optimizing
data workflows, designing efficient data pipelines, or harnessing the power of
machine learning for predictive analytics, Dr. Saini has consistently
demonstrated his ability to deliver impactful results.

In addition to his technical expertise, Dr. Saini is renowned for his scholarly
contributions. He has authored numerous publications and research papers,
providing valuable insights to the academic community and influencing
discussions on data integration and analytics.

Dr. Mukesh Saini remains a prominent figure in both clinical and data industries,
blending profound technical expertise with a strategic vision for harnessing data
as a transformative asset. His diverse career underscores his commitment to
pushing the boundaries of data possibilities, establishing him as both an
inspirational figure and a trusted authority in his field.
316 | Index

View publication stats

You might also like