100% found this document useful (8 votes)

2K views60 pages

Architecting A Data Lake

The document discusses architecting a data lake. A data lake is a repository that stores large quantities of raw data in its native format. It provides a single platform for various types of data including machine, human, and traditional operational data. The data lake uses a processing engine like Hadoop for analyzing the raw data. Some common use cases of data lakes include storing IoT sensor data, social media data, web logs, images and more. Data lakes aim to provide agility, handle complexity, and balance self-service with corporate solutions for analytics.

Uploaded by

arjun.ec633

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (8 votes)

2K views60 pages

Architecting A Data Lake

Uploaded by

arjun.ec633

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Architecting a Data Lake

Chad Gronbach
Chief Technology Architect
Microsoft Technology Center - Boston

Content Credit: JamesSerra.com

Modern
Multi-Platform Architectures
Modern Data Warehousing & Analytics Multi-platform architecture
Alerts
✓ Handle a variety of data
Near Real-Time Monitoring
Devices & types & sources
Sensors
Data
Lake ✓ Larger data volumes at
Social Advanced Analytics lower latency
Data Data
Media Data Integration Virtualization ✓ Bimodal: self-service +
Machine corporate BI to support
Cloud Hadoop Learning
Systems NoSQL all types of users
✓ Newer cloud services
Operational Reporting
ODS
Corporate ✓ Advanced analytics
Data Batch ETL Historical Analytics Analytics &
Cubes & Reporting scenarios
Semantic In-Memory
Models Models ✓ Balance data integration
Third Data
Party Data Warehouse Self-Service
& data virtualization
Data Marts
Master Reports & Models
Data
Definitions (1/3)

Repository of data from multiple sources, cleansed & enriched for reporting;
Data Warehouse
generally ‘schema on write’

Data Lake Repository of data for multi-structured data; generally ‘schema on read’

(1) Data storage via HDFS (Hadoop Distributed File System), and
Hadoop
(2) Set of Apache projects for data processing and analytics

Lambda Architecture Data processing & storage with batch, speed, and serving layers

Extract > Transform > Load: traditional paradigm associated with

ETL data warehousing and ‘schema on write’
Extract > Load > Transform: newer paradigm associated with data
ELT
lakes & ‘schema on read’

Semantic Model User-friendly interface for users on top of a data warehouse and/or data lake
Definitions (2/3)

Data Integration Physically moving data to integrate multiple sources together

Access to one or more distributed data sources without requiring

Data Virtualization the data to be physically materialized in another data structure

Federated Query A type of data virtualization: access & consolidate data from
multiple distributed data sources

A multi-platform strategy which values using the most effective technology

Polyglot Persistence
based on the data itself (“best fit engineering”)

Data structure is applied at design time, requiring additional up-front

Schema on Write
effort to formulate a data model (relational DBs)

Data structure is applied at query time rather than when the data is
Schema on Read
initially stored (data lakes, NoSQL)
More in-depth definitions: https://www.sqlchick.com/entries/2017/1/9/defining-the-components-of-a-modern-data-warehouse-a-glossary
Definitions (3/3)

Public Cloud Easier to

Shared
Azure
Serverless Apps
Functions Scale
Infrastructure
(Lower Cost) Software
Platform as a
as a Service Service
Infrastructure (Paas) (SaaS)
as a Service
(IaaS) Azure SQL DB, Power BI,
Office 365,
Azure SQL DW,
SQL Server in a VM, Azure HDInsight / Databricks,
Hadoop in a VM Azure Data Lake Store

On-Premises
Azure Stack (Private Cloud)
Dedicated Virtual
Infrastructure Server
(Higher Cost of Physical More
Ownership) Server Difficult
to Scale
More Control Less Control
(Higher Administration Effort) (Lower Administration Effort)
What are some common challenges
of analytical environments?
Challenges of Analytical Environments
Agility Complexity Balance Never-Ending

✓ Reducing time to ✓ Hybrid scenarios ✓ Self-service solutions ✓ Data quality

value ✓ Multi-platform challenge corporate ✓ User trust
DW solutions
✓ Minimizing chaos architecture
✓ Master data
with self-service ✓ Ever-increasing data ✓ Operationalizing
valuable user-created ✓ Security
✓ Evolving & maturing volumes
solutions (including ✓ Governance
technology ✓ Diversity of file types data science)
✓ Balancing schema- & formats ✓ Performance
✓ Handling ownership
on-read with ✓ Effort & cost of data changes of a
schema-on-write integration productionized
✓ How strict to be ✓ Many skillsets needed solution
with dimensional
design?
Data Lake Overview
&
Use Cases
Data Lake
Spatial, A repository for storing large quantities of disparate sources of
A
GPS
Data Lake data in its native format
One architectural platform to house all types of data:
Devices &
Sensors ✓ Machine-generated data (ex: IoT, logs)
✓ Human-generated data (ex: tweets, e-mail)
Social ✓ Traditional operational data (ex: sales, inventory)
Media Data

Web Logs B A processing engine for analyzing data

Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Objectives (1/2)

✓ Reduce up-front effort by ingesting data in any format, Spatial,

GPS
any size, without requiring a schema initially Data Lake

✓ Make acquiring new data easy, so it can be available for Devices &
data science & analysis quickly Sensors

✓ Store large volume of multi-structured data in its native

Social
format Media Data

✓ Storage for additional types of data which were

historically difficult to obtain or store Web Logs

✓ Reduce the long-term ownership cost of data

management & storage Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Objectives (2/2)

✓ Schema-on-read: Defer work to ‘schematize’ after value & Spatial,

GPS
requirements are known Data Lake

✓ Achieve agility faster than a traditional data warehouse can Devices &
Sensors
to speed up decision-making ability

✓ Access to low-latency data Social

Media Data

✓ Different / new value proposition vs. traditional data

warehousing Web Logs

✓ Facilitate advanced analytics scenarios

Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Use Cases
✓ Preparatory file storage for
Ingestion of New File Types multi-structured data
✓ Exploratory analysis + POCs to
Spatial, determine value of new data
GPS
Data Lake
types & sources
✓ Affords additional time for
Devices &
Sensors longer-term planning while
Raw Data accumulating data or handling
an influx of data
Social
Media Data
Exploratory
Analysis
Web Logs

Data Science
Images, Audio, Sandbox
Video
Data Lake Use Cases
✓ Sandbox solutions for initial
Data Science Experimentation | Hadoop Integration data prep, experimentation, and
analysis
Hadoop
Spatial, ✓ Migrate from proof of concept
GPS
Data Lake
to operationalized solution
Machine Advanced ✓ Integrate with open source
Devices & Learning Analytics
Sensors projects such as Hive, Pig,
Raw Data Spark, Storm, etc.

Curated Data
✓ Big data clusters
✓ SQL-on-Hadoop solutions
Social
Media Data

Data Science
Sandbox Exploratory Analytics &
Flat Files Analysis Reporting
Data Lake Use Cases
✓ ELT strategy
Data Warehouse Staging Area
✓ Reduce storage needs in
relational platform by using
Cloud
Systems the data lake as landing
Data Lake
area
Devices & ✓ Practical use for data stored
Sensors
Raw Data: in the data lake
Cubes &
Staging Area Data Semantic
Social Warehouse Models ✓ Potentially also handle
Media Data transformations in the data
lake
Data
Processing
Third Party Jobs
Data, Flat Files
Analytics &
Reporting

Corporate
Data
Data Lake Use Cases ✓ Grow around existing DW
Integration with DW | Data Archival | Centralization
✓ Aged data available for
querying when needed
Cloud
Systems ✓ Complement to the DW
Data Lake
via data virtualization
Devices &
Sensors ✓ Federated queries to
Raw Data: Cubes & access current data
Staging Area Data Semantic
Warehouse Models
(relational DB) + archive
Social
Media Data (data lake)

Third Party Archived Data

Data, Flat Files
Analytics &
Reporting

Corporate
Data
Data Lake Use Cases
Lambda Architecture ✓ Support for low-latency,
Serving Layer high-velocity data in near
Speed Layer real time
✓ Support for batch-
Devices & Data Data Near Real-Time
Sensors Ingestion Processing oriented operations

Batch Layer Cubes &

Semantic
Models
Social Data Lake:
Media Data Raw Data

Data
Data Lake: Warehouse Analytics &
Curated Data
Reporting
Corporate
Data
What are some initial considerations for
deciding if a data lake is right for you?
Is a Data Lake Right For You?
Initial Considerations:
Do you have non-relational data?

Do you have IoT type of data?

Do you have advanced analytics scenarios on unusual datasets?

Do you need to offload ETL processing (ELT) and/or archival data from a data warehouse?

Readiness:
Are you ready willing to learn different development patterns and/or new technologies?

Are you ready to handle the trade-offs of ‘schema on read’ vs ‘schema on write’?
What are some key differences between
a data warehouse & a data lake?
Data Lake + Data Warehouse: Inverse Relationship
Data Lake
Enterprise
Data Warehouse

Data Lake focuses on: Data warehouse focuses on:

✓ Agility ✓Cleansed, user-friendly data
✓ Flexibility ✓Reliability
✓ Easy data acquisition ✓Standardization
✓ Early exploration activities ✓Process-oriented operationalization

Schema on Read Schema on Write

Less effort Data acquisition More effort
More effort Data retrieval Less effort
Data Lake Challenges

Technology Process People

✓ Addt’l component(s) ✓ Right balance of deferred ✓ Expectations & trust

in a multi-layered work vs. up-front work
architecture ✓ Data stewardship
✓ Ignoring established best ✓ Redundant effort
✓ Unknown storage & practices for data
scalability management ✓ Skills required to
✓ Data retrieval effectively use the data
✓ Data quality
✓ Working with un- ✓ Governance
curated data
✓ Security
✓ Performance
✓ Disaster recovery for
✓ Change management large solutions
Big Data in Azure
Big Data in Azure
Compute Compute
(PaaS) (IaaS)

Azure Data
Azure Azure Azure Science VMs
HDInsight Databricks Machine Learning

Storage Azure Data Lake Store (Gen2)

Hadoop on a cluster
Azure Storage of Azure virtual
machines
Big Data in Azure: Compute
Higher level of complexity, Greater
control, & customization ease of use
Less
administrative
effort
Azure
Machine Learning
Azure
Databricks
Azure (PaaS)
HDInsight
(PaaS)
Hadoop / Data Science Greater
Azure virtual machines
administrative
(IaaS)
effort
Greater integration Less integration
with Apache with Apache
projects projects
Deciding Between Compute Services

Hadoop VM HDInsight Databricks Azure ML

Type: IaaS PaaS PaaS SaaS

Purpose: Running your own Running a Running optimized Running packaged

cluster of Hadoop managed cluster Spark framework AI, R or Python
virtual machines Script

Full control over Integration with Collaborative An ideal initial

Suitable everything; open source notebooks, easier entry point for
for: investment in Apache projects deployments sandbox
distributions such as (ex: Hive, Storm, experimentation
Hortonworks, Kafka, Spark, etc)
Cloudera, MapR
Intro to
Azure Data Lake
Azure Data Lake Store - Compatibility
(1) WebHDFS endpoint (https://) allows integration with open source projects.
(2) “AzureDataLakeFilesystem” (adl://) provides additional performance enhancements
not available in WebHDFS.
(3) Other various connectivity options Hadoop Kafka R HBase
(ex: Spark API, RDD API, Databricks File
System) are also available. Hive Spark Storm

Custom HDFS
Azure HDInsight application

SparkSQL
Azure Machine WebHDFS-compatible interface
DataFrames
Learning
MLlib Azure Data Lake Store
GraphX
ADLS
SparkR Azure Databricks File
Azure Data Lake Store – Distributed File System
Files of any size can be stored because ADLS is a distributed system which file contents are divided up
across backend storage nodes.

A read operation on the file is also parallelized across the nodes.

Blocks are also replicated for fault tolerance.

ADLS File
The ideal file size in ADLS is 256MB – 2GB
in size.

Many very tiny files introduces significant

overhead which reduces performance. Data Data Data Data
This is a well-known issue with storing Node 1 Node 2 Node 3 Node 4
data in HDFS. Techniques: Block Block Block Block
• Append-only data streams
• Consolidation of data into larger files
Designing the Structure
of a
Data Lake
Designing the Zones of a Data Lake
Data Lake
Archive Data
Staged Data Data
Warehouse
Analytics
Raw Data Sandbox
Exploratory Analysis
& Data Science

Standardized
Raw Data Big Data Analytics

Transient/Temp
Curated Data
Operationalized
Master Data Data Science

User Drop Zone

Analytics &
Reporting
Metadata | Security | Governance | Information Management
Raw Data Zone
Data Lake
Archive Data ✓ Storage in native format for
any type of data
Staged Data
Analytics ✓ Exact copy from the source
Raw Data Sandbox
✓ Immutable to change
✓ Typically append-only
Standardized
Raw Data ✓ History retained indefinitely
Transient/Temp ✓ Extremely limited access to
Curated Data
the Raw Data Zone – no
Master Data operationalized usage
✓ Everything downstream
User Drop Zone
from here can be
regenerated from raw data
Metadata | Security | Governance | Information Management
Transient/Temp Zone
Data Lake
Archive Data ✓ Selectively utilized
Staged Data ✓ Useful when data quality
Analytics checks or validation is
Raw Data Sandbox required before the data is
routed to the Raw Data
Zone for retention
Standardized
Raw Data ✓ Useful when you need a
“New Data” zone separate
Transient/Temp from Raw Data Zone
Curated Data
(ex: to ensure that jobs pulling data
Master Data from Raw receive consistent data)

✓ Could contain transient,

User Drop Zone
low-latency data
(aka ‘speed layer’)
Metadata | Security | Governance | Information Management
Master Data Zone
Data Lake
Archive Data ✓ Reference data to augment
analysis
Staged Data
Analytics
Raw Data Sandbox

Standardized
Raw Data
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management

User Drop Zone
Data Lake
Archive Data ✓ Manually-generated data
to augment analysis
Staged Data
Analytics
Raw Data Sandbox

Standardized
Raw Data
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management

Curated Data Zone
Data Lake
Archive Data ✓ Cleansed and transformed
Staged Data ✓ Organized for optimal data
Analytics delivery (aka ‘serving layer’)
Raw Data Sandbox
✓ Nearly all self-service data
access comes from the
Curated Data Zone
Standardized
Raw Data ✓ Standard governance and
Transient/Temp security policies
Curated Data
✓ Standard change
Master Data
management principles
User Drop Zone

Metadata | Security | Governance | Information Management

Standardized Data Zone
Data Lake
Archive Data ✓ A standardized version of
the Raw Data Zone
Staged Data
Analytics applicable to data
Sandbox structures which vary in
Raw Data
format – ex: JSON which is
standardized into
consistent columns &
Standardized
rows (aka ‘semantic normalization’)
Raw Data
Transient/Temp ✓ No real cleansing or
Curated Data transformations applied
Master Data ✓ Intermediary to assist
creation of curated data
User Drop Zone
✓ File consolidations (ex: solve
‘small files’ performance issues)
Metadata | Security | Governance | Information Management
Staged Data Zone
Data Lake
Archive Data ✓ Data which is staged for a
particular purpose or
Staged Data
Analytics application (thus has certain
Sandbox columns, certain formats,
Raw Data with or without headers, etc.)

Standardized
Raw Data
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management

Analytics Sandbox Zone
Data Lake
Archive Data ✓ Workspace for data
science and exploratory
Staged Data
Analytics activities
Raw Data Sandbox ✓ Minimal, if any,
governance and standards
(purposely undisciplined)
Standardized
Raw Data ✓ Valuable efforts are
“productionized” and
Transient/Temp “operationalized” to the
Curated Data
Curated Data Zone
Master Data
✓ Not used for self-service,
User Drop Zone operationalized, purposes

Metadata | Security | Governance | Information Management

Archive Data Zone
Data Lake
Archive Data ✓ An active archive
Staged Data ✓ Contains aged data
Analytics offloaded from a data
Raw Data Sandbox warehouse or other
application
✓ Available for querying
Standardized
Raw Data when needed (typically
only occasionally)
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management

What are some ways we
could potentially organize
data in a data lake?
Organizing a Data Lake (1/7)
Objectives
✓ Plan the structure based on optimal data retrieval
✓ Avoid a chaotic, unorganized data swamp

Common ways to organize the data:

Time Partitioning Data Retention Policy Probability of Data Access
Year/Month/Day/Hour/Minute Temporary data Recent/current data
Permanent data Historical data
Applicable period (ex: project lifetime) etc…
Subject Area etc…

Confidential Classification
Security Boundaries Business Impact / Criticality Public information
Department High (HBI) Internal use only
Business unit Medium (MBI) Supplier/partner confidential
etc… Low (LBI) Personally identifiable information (PII)
etc… Sensitive – financial
Sensitive – intellectual property
Downstream App/Purpose Owner / Steward / SME etc…
Organizing a Data Lake (2/7)
Example 1
Raw Data Zone Pros: Subject area at top level, organization-wide
Subject Area Partitioned by time
Data Source Cons: No obvious security or organizational boundaries
Object
Date Loaded
File(s)
----------------------------------------------------
Curated Data Zone
Sales Purpose
Salesforce Type
CustomerContacts Snapshot Date
2016 File(s)
12 -----------------------------------
01 Sales Trending Analysis
CustContact_2016_12_01.txt Summarized
2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake (3/7)
Example 2
Raw Data Zone Pros: Security at the organizational level
Organization Unit Partitioned by time
Subject Area Cons: Potentially siloed data, duplicated data
Data Source
Object
Date Loaded Curated Data Zone
File(s) Organizational Unit
------------------------------- Purpose
East Division Type
Sales Snapshot Date
Salesforce File(s)
CustomerContacts -----------------------------------
2016 East Division
12 Sales Trending Analysis
01 Summarized
CustContact_2016_12_01.txt 2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake (4/7)
Example 3
Pros: Segregates records coming in, going out, as well as error records
Time partitioning can go down to the hour, or even minute level, depending on volume (ex: IoT data)
Cons: Not obvious by the names what the purpose of ‘out’ is (which could be ok if numerous downstream
applications utilize the same ‘out’ data)

Raw Data Zone

Organization Unit Organization Unit Organization Unit
Subject Area Subject Area Subject Area
In Out Error
YYYY YYYY YYYY
MM MM MM
DD DD DD
HH HH HH
File(s) File(s) File(s)
Organizing a Data Lake (5/7)

Subject Area 1 Example 4

RawData Zones are a logical need, but they don’t necessarily have to be at the top of the structure
YYYY Pros: Security by subject area
Cons: All raw data is not centralized
MM
CuratedData
MasterData
StagedData

Subject Area 2
RawData
YYYY
MM
CuratedData
MasterData
StagedData
Organizing a Data Lake (6/7)
Do:
✓ Hyper-focus on ease of data discovery & retrieval – will one type of structure make more sense?
✓ Focus on security implications early – what data redundancy is allowed in exchange for security
✓ Include data lineage & relevant metadata with the data file itself whenever possible (ex: columns
indicating source system where the data originated, source date, processed date, etc)
✓ Include the time element in both the folder structure & the file name
✓ Be liberal yet disciplined with folder structure (lots of nests are ok)
✓ Clearly separate out the zones so governance & policies can be applied separately
✓ Register the curated data with a catalog (ex: Azure Data Catalog) to document the metadata–a
data catalog is even more important with a data lake
✓ Implement change management for migrating from a sandbox zone (discourage production use
from the sandbox)
✓ Assign a data owner & data archival policies as part of the structure, or part of the metadata
Organizing a Data Lake (7/7)

Don’t:
× Do not combine mixed formats in a single folder structure
✓ If it’s looping through all files in a folder schema-on-read will fail if it finds a different format
✓ Files in one folder should all be able to be traversed with the same script

× Do not put your date partitions at the beginning of the file path -- it’s much easier to organize &
secure by subject area/department/etc if dates are the lowest folder level
Optimal for top level security: Tedious for enforcing security:
\SubjectArea\YYYY\MM\DD\FileData_YYYY_MM_DD.txt \YYYY\MM\DD\SubjectArea\FileData_YYYY_MM_DD.txt

× Do not neglect naming conventions. You might use camel case, or you might just go with all
lower case – either is ok, as long as you’re consistent because some languages are case-sensitive
Following
Big Data Principles
When Designing
A Data Lake
Lambda Architecture
Batch Layer Serving Layer
Analytics
Corporate HDInsight Data Exploration
Data Processing Azure Corporate
Cluster Analysis
SQL Data Reporting
Warehouse Services
Self-Service BI
SaaS
Data
Data Factory

SQL Database
Hive metastore
Advanced Analytics
Web & Data Science
Data Lake Store
Data Machine Learning

Speed Layer R, Python, APIs

HDInsight
Devices & Interactive Query
Sensors Event Stream Blob
Hubs Analytics Cluster
Storage
Streaming/Real-Time/
Application
Lambda Architecture
Real-time dataset
Speed
Layer Temporary storage of low-
latency data; moves to batch
layer for retention. Serving Layer

Batch view Batch view

Batch view
Master dataset
Support for data analysis via queries
Immutable, growing master (random reads). Typically stored in
Batch dataset (typically partitioned
among physically many files) of
denormalized form suitable for
Layer higher latency data. The source
reporting & analysis. Aggregations can
be stored to reduce computations at
of truth from which batch runtime.
views are created. Atomic data
is typically stored in a
normalized format. “Big Data” by Nathan Marz and James Warren
What principles might you
expect to follow
in a big data project?
Big Data Principles to Follow in a Data Lake Project
Immutable Raw Data Identifiable Data
• Raw data is append-only & unchanging • Timestamped
• Continually growing • Unique (tolerant of duplicates from retries)
• No summarizations or deletions
• Bad data can be deleted, but it’s rare
• Immutable data is resilient to human error Rawness of Data
• Obtain the rawest, most atomic, data
Recreatable available
• Everything downstream from the raw data
can be regenerated (error tolerant) Separate Layers
• Schema changes can be handled
• Unstructured data can always be re- • Redundant data in both the batch & serving
structured (“semantic normalization”) layers allows normalized & denormalized data
• Speed layer may use approximations,
corrected in the batch layer (eventual consistency) “Big Data” by Nathan Marz and James Warren
Schema Changes Over Time (1/2)

Schema changes include:

Addition of new columns
Removal of columns
Renaming of columns

Two options:
(1) Schema enforcement upon the ingestion of data
(2) Schema flexibility for the developers; deal with “standardizing” the data after ingestion
Schema Changes Over Time (1/2)
Raw Data:

Standardized
Raw Data:

(semantic
normalization)
Data Formats & Data Compression
CSV Deciding on a format
Commonly used. Human-readable. Not compressed. Typically not the • Supported formats by key
best choice for large datasets. systems
• Integration with other
JSON systems
Commonly used. Human-readable. Self-describing schema. • File sizes
• Schema changes over time
• If a self-describing schema
Parquet
is desired
Columnar format; highly compressed.
• Data type support
• Data format compatibility
Avro • Performance of workload
Row-based format. Supports compression. Schema encoded on the file. (read vs. write)
• Convenience & ease of use
ORC (optimized row columnar)
Columnar format with collections of rows. Light indexing and statistics.
Techniques to Recompute the Serving Layer
Full recomputation
The entire master dataset is used to recompute the batch views in the serving layer.
Pros: Simplicity
Better human fault-tolerance
Ability to continually reap benefits of improved algorithms or calculations
Easier to keep wide datasets which contain redundant data synchronized/consistent
Cons: Performance; speed of updates
CPU and I/O heavy
Not practical for extremely large datasets

Incremental recomputation
Only new data from the master dataset is involved in recomputations.
Pros: Better performance
Cons: Significantly more complex
Still need a way to do a full recomputation in the event of errors or significant changes
What is the state of data modeling
for files stored in
a data lake?
Data Modeling for Files in a Data Lake
Wide datasets, with all data needed in one file, are commonly used
Pros: Easy to do analysis.
Data can be co-located on the nodes as the data gets distributed (depending on the tool).
The desired format frequently for data scientists & the tools they use.
Usually well-suited to in-memory, columnar, data formats.

Cons: Data is repeated (particularly dimensional data) across lots of files.

Keeping data updated across many files can take time.
Data of different granularities can get tricky.
Immutable, append-only data means everything acts like a slowly changing dimension.
Recap,
Suggestions,
Q&A

Lakehouse With Delta Lake Deep Dive
100% (2)
Lakehouse With Delta Lake Deep Dive
64 pages
Synapse Project Deck
No ratings yet
Synapse Project Deck
196 pages
Data Lake or Data Swamp?
No ratings yet
Data Lake or Data Swamp?
16 pages
Learn Data Modelling PDF
100% (3)
Learn Data Modelling PDF
112 pages
Dan Linstedt, Supercharge Your Data Warehouse
100% (1)
Dan Linstedt, Supercharge Your Data Warehouse
152 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
12 Best Practices For Modern Data Integration: White Paper
100% (3)
12 Best Practices For Modern Data Integration: White Paper
10 pages
Azure Databricks Course Slide Deck V4
100% (4)
Azure Databricks Course Slide Deck V4
308 pages
Well Architected Lakehouse Workshop
100% (1)
Well Architected Lakehouse Workshop
49 pages
Fast Data Enterprise Data Architecture
100% (2)
Fast Data Enterprise Data Architecture
47 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
Data Lakes in A Modern Data Architecture
88% (8)
Data Lakes in A Modern Data Architecture
23 pages
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
100% (6)
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
31 pages
Azure Data Platform Overview
100% (2)
Azure Data Platform Overview
57 pages
Implemententerprise Data Lake
100% (1)
Implemententerprise Data Lake
9 pages
Intro To Data Engineering Databricks Webinar 13may
No ratings yet
Intro To Data Engineering Databricks Webinar 13may
59 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
Azure Cloud Services: Azure Data Lake Store
No ratings yet
Azure Cloud Services: Azure Data Lake Store
21 pages
Microsoft Fabric - James Serra - Public
No ratings yet
Microsoft Fabric - James Serra - Public
54 pages
Learn How Databricks Streamlines The Data Management Lifecycle
No ratings yet
Learn How Databricks Streamlines The Data Management Lifecycle
20 pages
Azure Data Engineer
100% (4)
Azure Data Engineer
54 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
02 Hadoop Architecture and HDFS
100% (1)
02 Hadoop Architecture and HDFS
74 pages
ELT Architecture in The Azure Cloud
No ratings yet
ELT Architecture in The Azure Cloud
8 pages
DatabricksDataEngineer Associate2024
75% (4)
DatabricksDataEngineer Associate2024
157 pages
Architecting Data Lakes Zaloni PDF
No ratings yet
Architecting Data Lakes Zaloni PDF
63 pages
Snowflake Flatten PDF
100% (2)
Snowflake Flatten PDF
17 pages
Snowflake For: Data Engineering
No ratings yet
Snowflake For: Data Engineering
15 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
7,197 pages
Start To Finish With Azure Data Factory
100% (2)
Start To Finish With Azure Data Factory
30 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Data Strategy and Architecture
100% (4)
Data Strategy and Architecture
19 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Azure Databricks
67% (6)
Azure Databricks
69 pages
Data Strategy
No ratings yet
Data Strategy
1 page
Azure Ai Landscape
100% (1)
Azure Ai Landscape
31 pages
Data Engineering With Databricks Da
100% (2)
Data Engineering With Databricks Da
232 pages
Snowflake Architecture - Concepts
No ratings yet
Snowflake Architecture - Concepts
38 pages
Udemyfor Business Course List
No ratings yet
Udemyfor Business Course List
87 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
Big Data Analytics 2023 Solution
No ratings yet
Big Data Analytics 2023 Solution
17 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
100% (7)
Etl With Azure Cookbook Practical Recipes For Building Modern Etl Solutions To Load and Transform Data From Any Source 1800203314 9781800203310
446 pages
Zookeeper Tutorial
100% (1)
Zookeeper Tutorial
43 pages
Statutory Compliance HR Full Material PPT Download Business Laws Related HR
100% (2)
Statutory Compliance HR Full Material PPT Download Business Laws Related HR
66 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
Databricks Question 1668314325
No ratings yet
Databricks Question 1668314325
104 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Building Your ETL Framework With BIML
No ratings yet
Building Your ETL Framework With BIML
19 pages
Aws Kms Best Practices PDF
No ratings yet
Aws Kms Best Practices PDF
24 pages
Snowflake Questions1
No ratings yet
Snowflake Questions1
4 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Azure Data Engineer Learning Path (July 2019)
No ratings yet
Azure Data Engineer Learning Path (July 2019)
1 page
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Azure Data Factory
77% (13)
Azure Data Factory
52 pages
Master Thesis
No ratings yet
Master Thesis
68 pages
Organisational Implications of Coaching: Jane Stubberfield
No ratings yet
Organisational Implications of Coaching: Jane Stubberfield
13 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
R20 M.Tech DS
No ratings yet
R20 M.Tech DS
64 pages
Stream Processing Everywhere
No ratings yet
Stream Processing Everywhere
46 pages
Cloud Computing Module-1
No ratings yet
Cloud Computing Module-1
5 pages
Carter Nelson Education Kansas State University
No ratings yet
Carter Nelson Education Kansas State University
1 page
Mod10 PDF
No ratings yet
Mod10 PDF
65 pages
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
No ratings yet
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
45 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Personality Development
No ratings yet
Personality Development
8 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Azure Data Engineer Learning Path (OCT 2019)
No ratings yet
Azure Data Engineer Learning Path (OCT 2019)
1 page
03 Hive
No ratings yet
03 Hive
48 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
22 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
1 My First Perceptron With Python Eric Joel Barragan Gonzalez (WWW - Ebook DL - Com)
No ratings yet
1 My First Perceptron With Python Eric Joel Barragan Gonzalez (WWW - Ebook DL - Com)
96 pages
Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution
No ratings yet
Medical Big Data Warehouse: Architecture and System Design, A Case Study: Improving Healthcare Resources Distribution
16 pages
ClouderaManager ExerciseInstructions
No ratings yet
ClouderaManager ExerciseInstructions
25 pages
Data Integration in A Hadoop-Based Data Lake: A Bioinformatics Case
No ratings yet
Data Integration in A Hadoop-Based Data Lake: A Bioinformatics Case
24 pages
CAIIB Elective Paper Information Technology 2023 Mock 01 20th October
No ratings yet
CAIIB Elective Paper Information Technology 2023 Mock 01 20th October
25 pages
Datawarehosue Proejct With Datastage 8
No ratings yet
Datawarehosue Proejct With Datastage 8
5 pages
Log Parsing
No ratings yet
Log Parsing
24 pages
Analysis Services DevOps Using Tabular Editor
No ratings yet
Analysis Services DevOps Using Tabular Editor
23 pages
AWS Big Data Specialty Study Guide PDF
No ratings yet
AWS Big Data Specialty Study Guide PDF
13 pages
The Future of Business Intelligence in T
No ratings yet
The Future of Business Intelligence in T
30 pages
The Structure Spectrum: Structured Semi-Structured Unstructured
No ratings yet
The Structure Spectrum: Structured Semi-Structured Unstructured
10 pages
Kevin Cui Resume
No ratings yet
Kevin Cui Resume
1 page
BDS Course Handout - Intuit PDF
No ratings yet
BDS Course Handout - Intuit PDF
6 pages
Big Data BDO
No ratings yet
Big Data BDO
11 pages
Laxmancibi Sivakumar Databricks Resume
No ratings yet
Laxmancibi Sivakumar Databricks Resume
5 pages
Dice Resume CV Devendra Velivelli
No ratings yet
Dice Resume CV Devendra Velivelli
7 pages
Data Engineering Cookbook
88% (8)
Data Engineering Cookbook
88 pages
Synopsis
No ratings yet
Synopsis
8 pages
Talend Preparing Metadata For HDFS Connection
No ratings yet
Talend Preparing Metadata For HDFS Connection
4 pages
Data Science and Big Data Analytics: An Open' Course To Unleash The Power of Big Data
No ratings yet
Data Science and Big Data Analytics: An Open' Course To Unleash The Power of Big Data
2 pages
Parminder Singh Bhatia Resume
No ratings yet
Parminder Singh Bhatia Resume
2 pages
The Data Model Resource Book: Volume 3: Universal Patterns for Data Modeling
From Everand
The Data Model Resource Book: Volume 3: Universal Patterns for Data Modeling
Len Silverston
No ratings yet
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
From Everand
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Mayank Malhotra
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet

Architecting A Data Lake

Uploaded by

Architecting A Data Lake

Uploaded by

Architecting a Data Lake

Content Credit: JamesSerra.com

Extract > Transform > Load: traditional paradigm associated with

Data Integration Physically moving data to integrate multiple sources together

Access to one or more distributed data sources without requiring

A multi-platform strategy which values using the most effective technology

Data structure is applied at design time, requiring additional up-front

Public Cloud Easier to

✓ Reducing time to ✓ Hybrid scenarios ✓ Self-service solutions ✓ Data quality

Web Logs B A processing engine for analyzing data

✓ Reduce up-front effort by ingesting data in any format, Spatial,

✓ Store large volume of multi-structured data in its native

✓ Storage for additional types of data which were

✓ Reduce the long-term ownership cost of data

✓ Schema-on-read: Defer work to ‘schematize’ after value & Spatial,

✓ Access to low-latency data Social

✓ Different / new value proposition vs. traditional data

✓ Facilitate advanced analytics scenarios

Third Party Archived Data

Batch Layer Cubes &

Do you have IoT type of data?

Do you have advanced analytics scenarios on unusual datasets?

Data Lake focuses on: Data warehouse focuses on:

Schema on Read Schema on Write

Technology Process People

✓ Addt’l component(s) ✓ Right balance of deferred ✓ Expectations & trust

Storage Azure Data Lake Store (Gen2)

Hadoop VM HDInsight Databricks Azure ML

Type: IaaS PaaS PaaS SaaS

Purpose: Running your own Running a Running optimized Running packaged

Full control over Integration with Collaborative An ideal initial

A read operation on the file is also parallelized across the nodes.

Blocks are also replicated for fault tolerance.

Many very tiny files introduces significant

User Drop Zone

✓ Could contain transient,

User Drop Zone

Metadata | Security | Governance | Information Management

User Drop Zone

Metadata | Security | Governance | Information Management

Metadata | Security | Governance | Information Management

User Drop Zone

Metadata | Security | Governance | Information Management

Metadata | Security | Governance | Information Management

User Drop Zone

Metadata | Security | Governance | Information Management

Common ways to organize the data:

Raw Data Zone

Subject Area 1 Example 4

Speed Layer R, Python, APIs

Batch view Batch view

Schema changes include:

Cons: Data is repeated (particularly dimensional data) across lots of files.

You might also like