100% found this document useful (8 votes)
2K views60 pages

Architecting A Data Lake

The document discusses architecting a data lake. A data lake is a repository that stores large quantities of raw data in its native format. It provides a single platform for various types of data including machine, human, and traditional operational data. The data lake uses a processing engine like Hadoop for analyzing the raw data. Some common use cases of data lakes include storing IoT sensor data, social media data, web logs, images and more. Data lakes aim to provide agility, handle complexity, and balance self-service with corporate solutions for analytics.

Uploaded by

arjun.ec633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
2K views60 pages

Architecting A Data Lake

The document discusses architecting a data lake. A data lake is a repository that stores large quantities of raw data in its native format. It provides a single platform for various types of data including machine, human, and traditional operational data. The data lake uses a processing engine like Hadoop for analyzing the raw data. Some common use cases of data lakes include storing IoT sensor data, social media data, web logs, images and more. Data lakes aim to provide agility, handle complexity, and balance self-service with corporate solutions for analytics.

Uploaded by

arjun.ec633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Architecting a Data Lake

Chad Gronbach
Chief Technology Architect
Microsoft Technology Center - Boston

Content Credit: JamesSerra.com


Modern
Multi-Platform Architectures
Modern Data Warehousing & Analytics Multi-platform architecture
Alerts
✓ Handle a variety of data
Near Real-Time Monitoring
Devices & types & sources
Sensors
Data
Lake ✓ Larger data volumes at
Social Advanced Analytics lower latency
Data Data
Media Data Integration Virtualization ✓ Bimodal: self-service +
Machine corporate BI to support
Cloud Hadoop Learning
Systems NoSQL all types of users
✓ Newer cloud services
Operational Reporting
ODS
Corporate ✓ Advanced analytics
Data Batch ETL Historical Analytics Analytics &
Cubes & Reporting scenarios
Semantic In-Memory
Models Models ✓ Balance data integration
Third Data
Party Data Warehouse Self-Service
& data virtualization
Data Marts
Master Reports & Models
Data
Definitions (1/3)

Repository of data from multiple sources, cleansed & enriched for reporting;
Data Warehouse
generally ‘schema on write’

Data Lake Repository of data for multi-structured data; generally ‘schema on read’

(1) Data storage via HDFS (Hadoop Distributed File System), and
Hadoop
(2) Set of Apache projects for data processing and analytics

Lambda Architecture Data processing & storage with batch, speed, and serving layers

Extract > Transform > Load: traditional paradigm associated with


ETL data warehousing and ‘schema on write’
Extract > Load > Transform: newer paradigm associated with data
ELT
lakes & ‘schema on read’

Semantic Model User-friendly interface for users on top of a data warehouse and/or data lake
Definitions (2/3)

Data Integration Physically moving data to integrate multiple sources together

Access to one or more distributed data sources without requiring


Data Virtualization the data to be physically materialized in another data structure

Federated Query A type of data virtualization: access & consolidate data from
multiple distributed data sources

A multi-platform strategy which values using the most effective technology


Polyglot Persistence
based on the data itself (“best fit engineering”)

Data structure is applied at design time, requiring additional up-front


Schema on Write
effort to formulate a data model (relational DBs)

Data structure is applied at query time rather than when the data is
Schema on Read
initially stored (data lakes, NoSQL)
More in-depth definitions: https://www.sqlchick.com/entries/2017/1/9/defining-the-components-of-a-modern-data-warehouse-a-glossary
Definitions (3/3)

Public Cloud Easier to


Shared
Azure
Serverless Apps
Functions Scale
Infrastructure
(Lower Cost) Software
Platform as a
as a Service Service
Infrastructure (Paas) (SaaS)
as a Service
(IaaS) Azure SQL DB, Power BI,
Office 365,
Azure SQL DW,
SQL Server in a VM, Azure HDInsight / Databricks,
Hadoop in a VM Azure Data Lake Store

On-Premises
Azure Stack (Private Cloud)
Dedicated Virtual
Infrastructure Server
(Higher Cost of Physical More
Ownership) Server Difficult
to Scale
More Control Less Control
(Higher Administration Effort) (Lower Administration Effort)
What are some common challenges
of analytical environments?
Challenges of Analytical Environments
Agility Complexity Balance Never-Ending

✓ Reducing time to ✓ Hybrid scenarios ✓ Self-service solutions ✓ Data quality


value ✓ Multi-platform challenge corporate ✓ User trust
DW solutions
✓ Minimizing chaos architecture
✓ Master data
with self-service ✓ Ever-increasing data ✓ Operationalizing
valuable user-created ✓ Security
✓ Evolving & maturing volumes
solutions (including ✓ Governance
technology ✓ Diversity of file types data science)
✓ Balancing schema- & formats ✓ Performance
✓ Handling ownership
on-read with ✓ Effort & cost of data changes of a
schema-on-write integration productionized
✓ How strict to be ✓ Many skillsets needed solution
with dimensional
design?
Data Lake Overview
&
Use Cases
Data Lake
Spatial, A repository for storing large quantities of disparate sources of
A
GPS
Data Lake data in its native format
One architectural platform to house all types of data:
Devices &
Sensors ✓ Machine-generated data (ex: IoT, logs)
✓ Human-generated data (ex: tweets, e-mail)
Social ✓ Traditional operational data (ex: sales, inventory)
Media Data

Web Logs B A processing engine for analyzing data

Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Objectives (1/2)

✓ Reduce up-front effort by ingesting data in any format, Spatial,


GPS
any size, without requiring a schema initially Data Lake

✓ Make acquiring new data easy, so it can be available for Devices &
data science & analysis quickly Sensors

✓ Store large volume of multi-structured data in its native


Social
format Media Data

✓ Storage for additional types of data which were


historically difficult to obtain or store Web Logs

✓ Reduce the long-term ownership cost of data


management & storage Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Objectives (2/2)

✓ Schema-on-read: Defer work to ‘schematize’ after value & Spatial,


GPS
requirements are known Data Lake

✓ Achieve agility faster than a traditional data warehouse can Devices &
Sensors
to speed up decision-making ability

✓ Access to low-latency data Social


Media Data

✓ Different / new value proposition vs. traditional data


warehousing Web Logs

✓ Facilitate advanced analytics scenarios


Images, Audio,
Video
Data Lake Processing
Engine
Data Lake Use Cases
✓ Preparatory file storage for
Ingestion of New File Types multi-structured data
✓ Exploratory analysis + POCs to
Spatial, determine value of new data
GPS
Data Lake
types & sources
✓ Affords additional time for
Devices &
Sensors longer-term planning while
Raw Data accumulating data or handling
an influx of data
Social
Media Data
Exploratory
Analysis
Web Logs

Data Science
Images, Audio, Sandbox
Video
Data Lake Use Cases
✓ Sandbox solutions for initial
Data Science Experimentation | Hadoop Integration data prep, experimentation, and
analysis
Hadoop
Spatial, ✓ Migrate from proof of concept
GPS
Data Lake
to operationalized solution
Machine Advanced ✓ Integrate with open source
Devices & Learning Analytics
Sensors projects such as Hive, Pig,
Raw Data Spark, Storm, etc.

Curated Data
✓ Big data clusters
✓ SQL-on-Hadoop solutions
Social
Media Data

Data Science
Sandbox Exploratory Analytics &
Flat Files Analysis Reporting
Data Lake Use Cases
✓ ELT strategy
Data Warehouse Staging Area
✓ Reduce storage needs in
relational platform by using
Cloud
Systems the data lake as landing
Data Lake
area
Devices & ✓ Practical use for data stored
Sensors
Raw Data: in the data lake
Cubes &
Staging Area Data Semantic
Social Warehouse Models ✓ Potentially also handle
Media Data transformations in the data
lake
Data
Processing
Third Party Jobs
Data, Flat Files
Analytics &
Reporting

Corporate
Data
Data Lake Use Cases ✓ Grow around existing DW
Integration with DW | Data Archival | Centralization
✓ Aged data available for
querying when needed
Cloud
Systems ✓ Complement to the DW
Data Lake
via data virtualization
Devices &
Sensors ✓ Federated queries to
Raw Data: Cubes & access current data
Staging Area Data Semantic
Warehouse Models
(relational DB) + archive
Social
Media Data (data lake)

Third Party Archived Data


Data, Flat Files
Analytics &
Reporting

Corporate
Data
Data Lake Use Cases
Lambda Architecture ✓ Support for low-latency,
Serving Layer high-velocity data in near
Speed Layer real time
✓ Support for batch-
Devices & Data Data Near Real-Time
Sensors Ingestion Processing oriented operations

Batch Layer Cubes &


Semantic
Models
Social Data Lake:
Media Data Raw Data

Data
Data Lake: Warehouse Analytics &
Curated Data
Reporting
Corporate
Data
What are some initial considerations for
deciding if a data lake is right for you?
Is a Data Lake Right For You?
Initial Considerations:
Do you have non-relational data?

Do you have IoT type of data?

Do you have advanced analytics scenarios on unusual datasets?

Do you need to offload ETL processing (ELT) and/or archival data from a data warehouse?

Readiness:
Are you ready willing to learn different development patterns and/or new technologies?

Are you ready to handle the trade-offs of ‘schema on read’ vs ‘schema on write’?
What are some key differences between
a data warehouse & a data lake?
Data Lake + Data Warehouse: Inverse Relationship
Data Lake
Enterprise
Data Warehouse

Data Lake focuses on: Data warehouse focuses on:


✓ Agility ✓Cleansed, user-friendly data
✓ Flexibility ✓Reliability
✓ Easy data acquisition ✓Standardization
✓ Early exploration activities ✓Process-oriented operationalization

Schema on Read Schema on Write


Less effort Data acquisition More effort
More effort Data retrieval Less effort
Data Lake Challenges

Technology Process People

✓ Addt’l component(s) ✓ Right balance of deferred ✓ Expectations & trust


in a multi-layered work vs. up-front work
architecture ✓ Data stewardship
✓ Ignoring established best ✓ Redundant effort
✓ Unknown storage & practices for data
scalability management ✓ Skills required to
✓ Data retrieval effectively use the data
✓ Data quality
✓ Working with un- ✓ Governance
curated data
✓ Security
✓ Performance
✓ Disaster recovery for
✓ Change management large solutions
Big Data in Azure
Big Data in Azure
Compute Compute
(PaaS) (IaaS)

Azure Data
Azure Azure Azure Science VMs
HDInsight Databricks Machine Learning

Storage Azure Data Lake Store (Gen2)

Hadoop on a cluster
Azure Storage of Azure virtual
machines
Big Data in Azure: Compute
Higher level of complexity, Greater
control, & customization ease of use
Less
administrative
effort
Azure
Machine Learning
Azure
Databricks
Azure (PaaS)
HDInsight
(PaaS)
Hadoop / Data Science Greater
Azure virtual machines
administrative
(IaaS)
effort
Greater integration Less integration
with Apache with Apache
projects projects
Deciding Between Compute Services

Hadoop VM HDInsight Databricks Azure ML

Type: IaaS PaaS PaaS SaaS

Purpose: Running your own Running a Running optimized Running packaged


cluster of Hadoop managed cluster Spark framework AI, R or Python
virtual machines Script

Full control over Integration with Collaborative An ideal initial


Suitable everything; open source notebooks, easier entry point for
for: investment in Apache projects deployments sandbox
distributions such as (ex: Hive, Storm, experimentation
Hortonworks, Kafka, Spark, etc)
Cloudera, MapR
Intro to
Azure Data Lake
Azure Data Lake Store - Compatibility
(1) WebHDFS endpoint (https://) allows integration with open source projects.
(2) “AzureDataLakeFilesystem” (adl://) provides additional performance enhancements
not available in WebHDFS.
(3) Other various connectivity options Hadoop Kafka R HBase
(ex: Spark API, RDD API, Databricks File
System) are also available. Hive Spark Storm

Custom HDFS
Azure HDInsight application

SparkSQL
Azure Machine WebHDFS-compatible interface
DataFrames
Learning
MLlib Azure Data Lake Store
GraphX
ADLS
SparkR Azure Databricks File
Azure Data Lake Store – Distributed File System
Files of any size can be stored because ADLS is a distributed system which file contents are divided up
across backend storage nodes.

A read operation on the file is also parallelized across the nodes.

Blocks are also replicated for fault tolerance.


ADLS File
The ideal file size in ADLS is 256MB – 2GB
in size.

Many very tiny files introduces significant


overhead which reduces performance. Data Data Data Data
This is a well-known issue with storing Node 1 Node 2 Node 3 Node 4
data in HDFS. Techniques: Block Block Block Block
• Append-only data streams
• Consolidation of data into larger files
Designing the Structure
of a
Data Lake
Designing the Zones of a Data Lake
Data Lake
Archive Data
Staged Data Data
Warehouse
Analytics
Raw Data Sandbox
Exploratory Analysis
& Data Science

Standardized
Raw Data Big Data Analytics

Transient/Temp
Curated Data
Operationalized
Master Data Data Science

User Drop Zone


Analytics &
Reporting
Metadata | Security | Governance | Information Management
Raw Data Zone
Data Lake
Archive Data ✓ Storage in native format for
any type of data
Staged Data
Analytics ✓ Exact copy from the source
Raw Data Sandbox
✓ Immutable to change
✓ Typically append-only
Standardized
Raw Data ✓ History retained indefinitely
Transient/Temp ✓ Extremely limited access to
Curated Data
the Raw Data Zone – no
Master Data operationalized usage
✓ Everything downstream
User Drop Zone
from here can be
regenerated from raw data
Metadata | Security | Governance | Information Management
Transient/Temp Zone
Data Lake
Archive Data ✓ Selectively utilized
Staged Data ✓ Useful when data quality
Analytics checks or validation is
Raw Data Sandbox required before the data is
routed to the Raw Data
Zone for retention
Standardized
Raw Data ✓ Useful when you need a
“New Data” zone separate
Transient/Temp from Raw Data Zone
Curated Data
(ex: to ensure that jobs pulling data
Master Data from Raw receive consistent data)

✓ Could contain transient,


User Drop Zone
low-latency data
(aka ‘speed layer’)
Metadata | Security | Governance | Information Management
Master Data Zone
Data Lake
Archive Data ✓ Reference data to augment
analysis
Staged Data
Analytics
Raw Data Sandbox

Standardized
Raw Data
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management


User Drop Zone
Data Lake
Archive Data ✓ Manually-generated data
to augment analysis
Staged Data
Analytics
Raw Data Sandbox

Standardized
Raw Data
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management


Curated Data Zone
Data Lake
Archive Data ✓ Cleansed and transformed
Staged Data ✓ Organized for optimal data
Analytics delivery (aka ‘serving layer’)
Raw Data Sandbox
✓ Nearly all self-service data
access comes from the
Curated Data Zone
Standardized
Raw Data ✓ Standard governance and
Transient/Temp security policies
Curated Data
✓ Standard change
Master Data
management principles
User Drop Zone

Metadata | Security | Governance | Information Management


Standardized Data Zone
Data Lake
Archive Data ✓ A standardized version of
the Raw Data Zone
Staged Data
Analytics applicable to data
Sandbox structures which vary in
Raw Data
format – ex: JSON which is
standardized into
consistent columns &
Standardized
rows (aka ‘semantic normalization’)
Raw Data
Transient/Temp ✓ No real cleansing or
Curated Data transformations applied
Master Data ✓ Intermediary to assist
creation of curated data
User Drop Zone
✓ File consolidations (ex: solve
‘small files’ performance issues)
Metadata | Security | Governance | Information Management
Staged Data Zone
Data Lake
Archive Data ✓ Data which is staged for a
particular purpose or
Staged Data
Analytics application (thus has certain
Sandbox columns, certain formats,
Raw Data with or without headers, etc.)

Standardized
Raw Data
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management


Analytics Sandbox Zone
Data Lake
Archive Data ✓ Workspace for data
science and exploratory
Staged Data
Analytics activities
Raw Data Sandbox ✓ Minimal, if any,
governance and standards
(purposely undisciplined)
Standardized
Raw Data ✓ Valuable efforts are
“productionized” and
Transient/Temp “operationalized” to the
Curated Data
Curated Data Zone
Master Data
✓ Not used for self-service,
User Drop Zone operationalized, purposes

Metadata | Security | Governance | Information Management


Archive Data Zone
Data Lake
Archive Data ✓ An active archive
Staged Data ✓ Contains aged data
Analytics offloaded from a data
Raw Data Sandbox warehouse or other
application
✓ Available for querying
Standardized
Raw Data when needed (typically
only occasionally)
Transient/Temp
Curated Data
Master Data

User Drop Zone

Metadata | Security | Governance | Information Management


What are some ways we
could potentially organize
data in a data lake?
Organizing a Data Lake (1/7)
Objectives
✓ Plan the structure based on optimal data retrieval
✓ Avoid a chaotic, unorganized data swamp

Common ways to organize the data:


Time Partitioning Data Retention Policy Probability of Data Access
Year/Month/Day/Hour/Minute Temporary data Recent/current data
Permanent data Historical data
Applicable period (ex: project lifetime) etc…
Subject Area etc…

Confidential Classification
Security Boundaries Business Impact / Criticality Public information
Department High (HBI) Internal use only
Business unit Medium (MBI) Supplier/partner confidential
etc… Low (LBI) Personally identifiable information (PII)
etc… Sensitive – financial
Sensitive – intellectual property
Downstream App/Purpose Owner / Steward / SME etc…
Organizing a Data Lake (2/7)
Example 1
Raw Data Zone Pros: Subject area at top level, organization-wide
Subject Area Partitioned by time
Data Source Cons: No obvious security or organizational boundaries
Object
Date Loaded
File(s)
----------------------------------------------------
Curated Data Zone
Sales Purpose
Salesforce Type
CustomerContacts Snapshot Date
2016 File(s)
12 -----------------------------------
01 Sales Trending Analysis
CustContact_2016_12_01.txt Summarized
2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake (3/7)
Example 2
Raw Data Zone Pros: Security at the organizational level
Organization Unit Partitioned by time
Subject Area Cons: Potentially siloed data, duplicated data
Data Source
Object
Date Loaded Curated Data Zone
File(s) Organizational Unit
------------------------------- Purpose
East Division Type
Sales Snapshot Date
Salesforce File(s)
CustomerContacts -----------------------------------
2016 East Division
12 Sales Trending Analysis
01 Summarized
CustContact_2016_12_01.txt 2016_12_01
SalesTrend_2016_12_01.txt
Organizing a Data Lake (4/7)
Example 3
Pros: Segregates records coming in, going out, as well as error records
Time partitioning can go down to the hour, or even minute level, depending on volume (ex: IoT data)
Cons: Not obvious by the names what the purpose of ‘out’ is (which could be ok if numerous downstream
applications utilize the same ‘out’ data)

Raw Data Zone


Organization Unit Organization Unit Organization Unit
Subject Area Subject Area Subject Area
In Out Error
YYYY YYYY YYYY
MM MM MM
DD DD DD
HH HH HH
File(s) File(s) File(s)
Organizing a Data Lake (5/7)

Subject Area 1 Example 4


RawData Zones are a logical need, but they don’t necessarily have to be at the top of the structure
YYYY Pros: Security by subject area
Cons: All raw data is not centralized
MM
CuratedData
MasterData
StagedData

Subject Area 2
RawData
YYYY
MM
CuratedData
MasterData
StagedData
Organizing a Data Lake (6/7)
Do:
✓ Hyper-focus on ease of data discovery & retrieval – will one type of structure make more sense?
✓ Focus on security implications early – what data redundancy is allowed in exchange for security
✓ Include data lineage & relevant metadata with the data file itself whenever possible (ex: columns
indicating source system where the data originated, source date, processed date, etc)
✓ Include the time element in both the folder structure & the file name
✓ Be liberal yet disciplined with folder structure (lots of nests are ok)
✓ Clearly separate out the zones so governance & policies can be applied separately
✓ Register the curated data with a catalog (ex: Azure Data Catalog) to document the metadata–a
data catalog is even more important with a data lake
✓ Implement change management for migrating from a sandbox zone (discourage production use
from the sandbox)
✓ Assign a data owner & data archival policies as part of the structure, or part of the metadata
Organizing a Data Lake (7/7)

Don’t:
× Do not combine mixed formats in a single folder structure
✓ If it’s looping through all files in a folder schema-on-read will fail if it finds a different format
✓ Files in one folder should all be able to be traversed with the same script

× Do not put your date partitions at the beginning of the file path -- it’s much easier to organize &
secure by subject area/department/etc if dates are the lowest folder level
Optimal for top level security: Tedious for enforcing security:
\SubjectArea\YYYY\MM\DD\FileData_YYYY_MM_DD.txt \YYYY\MM\DD\SubjectArea\FileData_YYYY_MM_DD.txt

× Do not neglect naming conventions. You might use camel case, or you might just go with all
lower case – either is ok, as long as you’re consistent because some languages are case-sensitive
Following
Big Data Principles
When Designing
A Data Lake
Lambda Architecture
Batch Layer Serving Layer
Analytics
Corporate HDInsight Data Exploration
Data Processing Azure Corporate
Cluster Analysis
SQL Data Reporting
Warehouse Services
Self-Service BI
SaaS
Data
Data Factory

SQL Database
Hive metastore
Advanced Analytics
Web & Data Science
Data Lake Store
Data Machine Learning

Speed Layer R, Python, APIs

HDInsight
Devices & Interactive Query
Sensors Event Stream Blob
Hubs Analytics Cluster
Storage
Streaming/Real-Time/
Application
Lambda Architecture
Real-time dataset
Speed
Layer Temporary storage of low-
latency data; moves to batch
layer for retention. Serving Layer

Batch view Batch view

Batch view
Master dataset
Support for data analysis via queries
Immutable, growing master (random reads). Typically stored in
Batch dataset (typically partitioned
among physically many files) of
denormalized form suitable for
Layer higher latency data. The source
reporting & analysis. Aggregations can
be stored to reduce computations at
of truth from which batch runtime.
views are created. Atomic data
is typically stored in a
normalized format. “Big Data” by Nathan Marz and James Warren
What principles might you
expect to follow
in a big data project?
Big Data Principles to Follow in a Data Lake Project
Immutable Raw Data Identifiable Data
• Raw data is append-only & unchanging • Timestamped
• Continually growing • Unique (tolerant of duplicates from retries)
• No summarizations or deletions
• Bad data can be deleted, but it’s rare
• Immutable data is resilient to human error Rawness of Data
• Obtain the rawest, most atomic, data
Recreatable available
• Everything downstream from the raw data
can be regenerated (error tolerant) Separate Layers
• Schema changes can be handled
• Unstructured data can always be re- • Redundant data in both the batch & serving
structured (“semantic normalization”) layers allows normalized & denormalized data
• Speed layer may use approximations,
corrected in the batch layer (eventual consistency) “Big Data” by Nathan Marz and James Warren
Schema Changes Over Time (1/2)

Schema changes include:


Addition of new columns
Removal of columns
Renaming of columns

Two options:
(1) Schema enforcement upon the ingestion of data
(2) Schema flexibility for the developers; deal with “standardizing” the data after ingestion
Schema Changes Over Time (1/2)
Raw Data:

Standardized
Raw Data:

(semantic
normalization)
Data Formats & Data Compression
CSV Deciding on a format
Commonly used. Human-readable. Not compressed. Typically not the • Supported formats by key
best choice for large datasets. systems
• Integration with other
JSON systems
Commonly used. Human-readable. Self-describing schema. • File sizes
• Schema changes over time
• If a self-describing schema
Parquet
is desired
Columnar format; highly compressed.
• Data type support
• Data format compatibility
Avro • Performance of workload
Row-based format. Supports compression. Schema encoded on the file. (read vs. write)
• Convenience & ease of use
ORC (optimized row columnar)
Columnar format with collections of rows. Light indexing and statistics.
Techniques to Recompute the Serving Layer
Full recomputation
The entire master dataset is used to recompute the batch views in the serving layer.
Pros: Simplicity
Better human fault-tolerance
Ability to continually reap benefits of improved algorithms or calculations
Easier to keep wide datasets which contain redundant data synchronized/consistent
Cons: Performance; speed of updates
CPU and I/O heavy
Not practical for extremely large datasets

Incremental recomputation
Only new data from the master dataset is involved in recomputations.
Pros: Better performance
Cons: Significantly more complex
Still need a way to do a full recomputation in the event of errors or significant changes
What is the state of data modeling
for files stored in
a data lake?
Data Modeling for Files in a Data Lake
Wide datasets, with all data needed in one file, are commonly used
Pros: Easy to do analysis.
Data can be co-located on the nodes as the data gets distributed (depending on the tool).
The desired format frequently for data scientists & the tools they use.
Usually well-suited to in-memory, columnar, data formats.

Cons: Data is repeated (particularly dimensional data) across lots of files.


Keeping data updated across many files can take time.
Data of different granularities can get tricky.
Immutable, append-only data means everything acts like a slowly changing dimension.
Recap,
Suggestions,
Q&A

You might also like