Architecting A Data Lake
Architecting A Data Lake
Chad Gronbach
Chief Technology Architect
Microsoft Technology Center - Boston
Repository of data from multiple sources, cleansed & enriched for reporting;
Data Warehouse
generally ‘schema on write’
Data Lake Repository of data for multi-structured data; generally ‘schema on read’
(1) Data storage via HDFS (Hadoop Distributed File System), and
Hadoop
(2) Set of Apache projects for data processing and analytics
Lambda Architecture Data processing & storage with batch, speed, and serving layers
Semantic Model User-friendly interface for users on top of a data warehouse and/or data lake
Definitions (2/3)
Federated Query A type of data virtualization: access & consolidate data from
multiple distributed data sources
Data structure is applied at query time rather than when the data is
Schema on Read
initially stored (data lakes, NoSQL)
More in-depth definitions: https://www.sqlchick.com/entries/2017/1/9/defining-the-components-of-a-modern-data-warehouse-a-glossary
Definitions (3/3)
On-Premises
Azure Stack (Private Cloud)
Dedicated Virtual
Infrastructure Server
(Higher Cost of Physical More
Ownership) Server Difficult
to Scale
More Control Less Control
(Higher Administration Effort) (Lower Administration Effort)
What are some common challenges
of analytical environments?
Challenges of Analytical Environments
Agility Complexity Balance Never-Ending
Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Objectives (1/2)
✓ Make acquiring new data easy, so it can be available for Devices &
data science & analysis quickly Sensors
✓ Achieve agility faster than a traditional data warehouse can Devices &
Sensors
to speed up decision-making ability
Data Science
Images, Audio, Sandbox
Video
Data Lake Use Cases
✓ Sandbox solutions for initial
Data Science Experimentation | Hadoop Integration data prep, experimentation, and
analysis
Hadoop
Spatial, ✓ Migrate from proof of concept
GPS
Data Lake
to operationalized solution
Machine Advanced ✓ Integrate with open source
Devices & Learning Analytics
Sensors projects such as Hive, Pig,
Raw Data Spark, Storm, etc.
Curated Data
✓ Big data clusters
✓ SQL-on-Hadoop solutions
Social
Media Data
Data Science
Sandbox Exploratory Analytics &
Flat Files Analysis Reporting
Data Lake Use Cases
✓ ELT strategy
Data Warehouse Staging Area
✓ Reduce storage needs in
relational platform by using
Cloud
Systems the data lake as landing
Data Lake
area
Devices & ✓ Practical use for data stored
Sensors
Raw Data: in the data lake
Cubes &
Staging Area Data Semantic
Social Warehouse Models ✓ Potentially also handle
Media Data transformations in the data
lake
Data
Processing
Third Party Jobs
Data, Flat Files
Analytics &
Reporting
Corporate
Data
Data Lake Use Cases ✓ Grow around existing DW
Integration with DW | Data Archival | Centralization
✓ Aged data available for
querying when needed
Cloud
Systems ✓ Complement to the DW
Data Lake
via data virtualization
Devices &
Sensors ✓ Federated queries to
Raw Data: Cubes & access current data
Staging Area Data Semantic
Warehouse Models
(relational DB) + archive
Social
Media Data (data lake)
Corporate
Data
Data Lake Use Cases
Lambda Architecture ✓ Support for low-latency,
Serving Layer high-velocity data in near
Speed Layer real time
✓ Support for batch-
Devices & Data Data Near Real-Time
Sensors Ingestion Processing oriented operations
Data
Data Lake: Warehouse Analytics &
Curated Data
Reporting
Corporate
Data
What are some initial considerations for
deciding if a data lake is right for you?
Is a Data Lake Right For You?
Initial Considerations:
Do you have non-relational data?
Do you need to offload ETL processing (ELT) and/or archival data from a data warehouse?
Readiness:
Are you ready willing to learn different development patterns and/or new technologies?
Are you ready to handle the trade-offs of ‘schema on read’ vs ‘schema on write’?
What are some key differences between
a data warehouse & a data lake?
Data Lake + Data Warehouse: Inverse Relationship
Data Lake
Enterprise
Data Warehouse
Azure Data
Azure Azure Azure Science VMs
HDInsight Databricks Machine Learning
Hadoop on a cluster
Azure Storage of Azure virtual
machines
Big Data in Azure: Compute
Higher level of complexity, Greater
control, & customization ease of use
Less
administrative
effort
Azure
Machine Learning
Azure
Databricks
Azure (PaaS)
HDInsight
(PaaS)
Hadoop / Data Science Greater
Azure virtual machines
administrative
(IaaS)
effort
Greater integration Less integration
with Apache with Apache
projects projects
Deciding Between Compute Services
Custom HDFS
Azure HDInsight application
SparkSQL
Azure Machine WebHDFS-compatible interface
DataFrames
Learning
MLlib Azure Data Lake Store
GraphX
ADLS
SparkR Azure Databricks File
Azure Data Lake Store – Distributed File System
Files of any size can be stored because ADLS is a distributed system which file contents are divided up
across backend storage nodes.
Standardized
Raw Data Big Data Analytics
Transient/Temp
Curated Data
Operationalized
Master Data Data Science
Standardized
Raw Data
Transient/Temp
Curated Data
Master Data
Standardized
Raw Data
Transient/Temp
Curated Data
Master Data
Standardized
Raw Data
Transient/Temp
Curated Data
Master Data
Confidential Classification
Security Boundaries Business Impact / Criticality Public information
Department High (HBI) Internal use only
Business unit Medium (MBI) Supplier/partner confidential
etc… Low (LBI) Personally identifiable information (PII)
etc… Sensitive – financial
Sensitive – intellectual property
Downstream App/Purpose Owner / Steward / SME etc…
Organizing a Data Lake (2/7)
Example 1
Raw Data Zone Pros: Subject area at top level, organization-wide
Subject Area Partitioned by time
Data Source Cons: No obvious security or organizational boundaries
Object
Date Loaded
File(s)
----------------------------------------------------
Curated Data Zone
Sales Purpose
Salesforce Type
CustomerContacts Snapshot Date
2016 File(s)
12 -----------------------------------
01 Sales Trending Analysis
CustContact_2016_12_01.txt Summarized
2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake (3/7)
Example 2
Raw Data Zone Pros: Security at the organizational level
Organization Unit Partitioned by time
Subject Area Cons: Potentially siloed data, duplicated data
Data Source
Object
Date Loaded Curated Data Zone
File(s) Organizational Unit
------------------------------- Purpose
East Division Type
Sales Snapshot Date
Salesforce File(s)
CustomerContacts -----------------------------------
2016 East Division
12 Sales Trending Analysis
01 Summarized
CustContact_2016_12_01.txt 2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake (4/7)
Example 3
Pros: Segregates records coming in, going out, as well as error records
Time partitioning can go down to the hour, or even minute level, depending on volume (ex: IoT data)
Cons: Not obvious by the names what the purpose of ‘out’ is (which could be ok if numerous downstream
applications utilize the same ‘out’ data)
Subject Area 2
RawData
YYYY
MM
CuratedData
MasterData
StagedData
Organizing a Data Lake (6/7)
Do:
✓ Hyper-focus on ease of data discovery & retrieval – will one type of structure make more sense?
✓ Focus on security implications early – what data redundancy is allowed in exchange for security
✓ Include data lineage & relevant metadata with the data file itself whenever possible (ex: columns
indicating source system where the data originated, source date, processed date, etc)
✓ Include the time element in both the folder structure & the file name
✓ Be liberal yet disciplined with folder structure (lots of nests are ok)
✓ Clearly separate out the zones so governance & policies can be applied separately
✓ Register the curated data with a catalog (ex: Azure Data Catalog) to document the metadata–a
data catalog is even more important with a data lake
✓ Implement change management for migrating from a sandbox zone (discourage production use
from the sandbox)
✓ Assign a data owner & data archival policies as part of the structure, or part of the metadata
Organizing a Data Lake (7/7)
Don’t:
× Do not combine mixed formats in a single folder structure
✓ If it’s looping through all files in a folder schema-on-read will fail if it finds a different format
✓ Files in one folder should all be able to be traversed with the same script
× Do not put your date partitions at the beginning of the file path -- it’s much easier to organize &
secure by subject area/department/etc if dates are the lowest folder level
Optimal for top level security: Tedious for enforcing security:
\SubjectArea\YYYY\MM\DD\FileData_YYYY_MM_DD.txt \YYYY\MM\DD\SubjectArea\FileData_YYYY_MM_DD.txt
× Do not neglect naming conventions. You might use camel case, or you might just go with all
lower case – either is ok, as long as you’re consistent because some languages are case-sensitive
Following
Big Data Principles
When Designing
A Data Lake
Lambda Architecture
Batch Layer Serving Layer
Analytics
Corporate HDInsight Data Exploration
Data Processing Azure Corporate
Cluster Analysis
SQL Data Reporting
Warehouse Services
Self-Service BI
SaaS
Data
Data Factory
SQL Database
Hive metastore
Advanced Analytics
Web & Data Science
Data Lake Store
Data Machine Learning
HDInsight
Devices & Interactive Query
Sensors Event Stream Blob
Hubs Analytics Cluster
Storage
Streaming/Real-Time/
Application
Lambda Architecture
Real-time dataset
Speed
Layer Temporary storage of low-
latency data; moves to batch
layer for retention. Serving Layer
Batch view
Master dataset
Support for data analysis via queries
Immutable, growing master (random reads). Typically stored in
Batch dataset (typically partitioned
among physically many files) of
denormalized form suitable for
Layer higher latency data. The source
reporting & analysis. Aggregations can
be stored to reduce computations at
of truth from which batch runtime.
views are created. Atomic data
is typically stored in a
normalized format. “Big Data” by Nathan Marz and James Warren
What principles might you
expect to follow
in a big data project?
Big Data Principles to Follow in a Data Lake Project
Immutable Raw Data Identifiable Data
• Raw data is append-only & unchanging • Timestamped
• Continually growing • Unique (tolerant of duplicates from retries)
• No summarizations or deletions
• Bad data can be deleted, but it’s rare
• Immutable data is resilient to human error Rawness of Data
• Obtain the rawest, most atomic, data
Recreatable available
• Everything downstream from the raw data
can be regenerated (error tolerant) Separate Layers
• Schema changes can be handled
• Unstructured data can always be re- • Redundant data in both the batch & serving
structured (“semantic normalization”) layers allows normalized & denormalized data
• Speed layer may use approximations,
corrected in the batch layer (eventual consistency) “Big Data” by Nathan Marz and James Warren
Schema Changes Over Time (1/2)
Two options:
(1) Schema enforcement upon the ingestion of data
(2) Schema flexibility for the developers; deal with “standardizing” the data after ingestion
Schema Changes Over Time (1/2)
Raw Data:
Standardized
Raw Data:
(semantic
normalization)
Data Formats & Data Compression
CSV Deciding on a format
Commonly used. Human-readable. Not compressed. Typically not the • Supported formats by key
best choice for large datasets. systems
• Integration with other
JSON systems
Commonly used. Human-readable. Self-describing schema. • File sizes
• Schema changes over time
• If a self-describing schema
Parquet
is desired
Columnar format; highly compressed.
• Data type support
• Data format compatibility
Avro • Performance of workload
Row-based format. Supports compression. Schema encoded on the file. (read vs. write)
• Convenience & ease of use
ORC (optimized row columnar)
Columnar format with collections of rows. Light indexing and statistics.
Techniques to Recompute the Serving Layer
Full recomputation
The entire master dataset is used to recompute the batch views in the serving layer.
Pros: Simplicity
Better human fault-tolerance
Ability to continually reap benefits of improved algorithms or calculations
Easier to keep wide datasets which contain redundant data synchronized/consistent
Cons: Performance; speed of updates
CPU and I/O heavy
Not practical for extremely large datasets
Incremental recomputation
Only new data from the master dataset is involved in recomputations.
Pros: Better performance
Cons: Significantly more complex
Still need a way to do a full recomputation in the event of errors or significant changes
What is the state of data modeling
for files stored in
a data lake?
Data Modeling for Files in a Data Lake
Wide datasets, with all data needed in one file, are commonly used
Pros: Easy to do analysis.
Data can be co-located on the nodes as the data gets distributed (depending on the tool).
The desired format frequently for data scientists & the tools they use.
Usually well-suited to in-memory, columnar, data formats.