0% found this document useful (0 votes)

26 views26 pages

Data Lake

The document discusses data lakes, including what they are, how they differ from data warehouses, why they are needed, common architectures and components, use cases, tools and technologies used, and challenges. A data lake is a centralized repository that stores raw data in its native format at scale. It differs from a data warehouse in that data does not need to be structured beforehand and can be used for various analytics. Benefits include cost effectiveness, scalability, and flexibility in storing various data types.

Uploaded by

Nada Elsharawy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views26 pages

Data Lake

Uploaded by

Nada Elsharawy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Lake

Agenda
- What is a Data lake?
- What is the difference between Data lake and Data warehouse?
- Why do you need a Data lake?
- What are the architectures of the Data lake?
- Use Cases for Data lakes
- Tools and Technologies Used in Data Lakes
- What is the Challenges of data lakes?
What is a Data
lake?
Data Lake A data lake is a centralized
repository that allows you to store
all your data at any scale.
It lets you store raw data as-is.
without having to first structure the
data, and run different types of
analytics—from dashboards and
visualizations to big data processing,
and real-time analytics.
large-scale repositories designed
to store and manage massive
amounts of data.
Key Characteristics of data lakes:

- Stores data in its native format

- Data can be structured, semi-structured, or unstructured
- Uses a flat architecture
What is the difference
between Data lake and
Data warehouse?
A data warehouse : A data lake :
Data Types: - store processed, structured data - store raw, structured, semi-
according to specific metrics and structured, and unstructured data
attributes. from multiple sources.

Data Purpose: - The data is currently being used for - The purpose of the data is often not
operations and analytics within a predefined. It can be used for
structured system. machine learning, AI algorithms,
and other business purposes after
processing.

Process: - Follows an Extract, Transform, Load - Follows an Extract, Load, Transform

(ETL) process, which offers security (ELT) approach, which offers agility
and high performance. and easy data capture.

- Schema-on-read; the schema is

Schema Position: - Schema Position: Schema-on-write; applied after data storage.
the schema is defined before data
storage.
A data warehouse : A data lake :
Users: - Suited for business professionals who - Ideal for data scientists and those
need operational reporting and who need in-depth analysis and
analytics. predictive modeling tools.

Accessibility: - More complicated to make changes - Highly accessible and easier to

due to the structured nature of the update.
data.

History: - The concept has been around for - A relatively new concept for
decades and is well-established. managing big data.
Why do you need a data
lake?
Why do you need a data lake?
• Cost-Effectiveness: They typically require low-cost hardware, and many technologies used
for data management in data lakes are open source, such as Hadoop, making them more
economical than data warehouses.
• Resource Optimization: By storing any kind of data, data lakes help reduce unnecessary
resource usage within an organization.
• Scalability: Data lakes can easily scale to store and process large amounts of data,
accommodating the growth of data over time.
• Flexibility: They can store data in any format, including structured, semi-structured, and
unstructured, which is essential for machine learning use cases.
• Centralization: A data lake creates a single point of reference by consolidating information in
one place, reducing data siloing and making it easier to find, analyze, and share data across
different departments and projects.
• Machine Learning and AI: The sheer volume and variety of data in a data lake fuels model
development and unlocks the true potential of artificial intelligence and predictive analytics.
What are the
architectures of the data
lake?
Data lake architecture is majorly comprised of
three components or layers in general:

•Sources
•Data Processing Layer
•Target
The following diagram represents a high-level Data lake architecture:
1-Sources
• Sources are the providers of the business data to the data lake.
• The ETL or ELT mediums are being used to retrieve data from
various sources for further data processing.
• They are categorized into two types based upon the source
structure and formats for ETL Process
a. homogenous sources

• These are from similar data formats or structures

• Easy to join and consolidate the data
• Example: Sources from MS SQL Server databases.

b. Heterogeneous sources

• These are from different data formats or structures.

• It is tricky for ELT professionals to aggregate the sources to
create consolidate data for processing.
•Example: Sources from Flat files, NoSQL Databases,
2. Data Processing Layer
•The data processing layer of Data lake comprises of Datastore, Metadata store and the
Replication to support the High availability (HA) of data.
•The index is applied to the data for optimizing the processing.
•The best practices include including a cloud-based cluster for the data processing layer.
•The data processing layer is efficiently designed to support the security, scalability, and
resilience of the data.
•Also, proper business rules and configurations are maintained through the administration.
•There are several tools and cloud providers that support this data processing layer.
•Example: Apache Spark, Azure Databricks, Data lake solutions from AWS.
3. Targets for the Data Lake
After processing layer, data lake provides the processed data to the
target systems or applications.
There are several systems that consume data from Data lake
through an API layer or through connectors.
Following is an example of the systems which uses the data lake:

• EDW

• Analytics Dashboards

• Data Visualization Tools

• Machine Learning Projects

Use Cases for Data
lakes
Tools and Technologies
Used in Data Lakes
Tools and Technologies Used in Data Lakes:

- Storage: Object storage systems like Azure Data Lake Storage, AWS S3, and Google

Cloud Storage are commonly used.

- Data Ingestion: Tools for data ingestion include Apache NiFi, Apache Kafka, and Azure

Data Factory.

- Metadata Management: Cataloging tools like Apache Atlas and AWS Glue help manage

metadata.

- Analytics and Processing: Big data processing frameworks like Apache Hadoop, Apache

Spark, and Databricks are essential for analytics.

Steps for creating a data lake using AWS
S3
1.Register an Amazon Simple Storage Service (Amazon S3) path as a data lake.
2.Grant Lake Formation permissions to write to the Data Catalog and to Amazon S3 locations
in the data lake.
3.Create a database to organize the metadata tables in the Data Catalog.
4.Use a blueprint to create a workflow. Run the workflow to ingest data from a data source.
5.Set up your Lake Formation permissions to allow others to manage data in the Data Catalog
and the data lake.
6.Set up Amazon Athena to query the data that you imported into your Amazon S3 data lake.
7.For some data store types, set up Amazon Redshift Spectrum to query the data that you
imported into your Amazon S3 data lake.
What is the Challenges
of data lakes?
What are the challenges of data lakes?

- Schema: The lack of a predefined schema can make data hard to consume or
query.
- Quality: Maintaining the quality of data ingested into the data lake can be
challenging.
- The main challenge with a data lake architecture is the storage of raw data
without oversight of the contents, leading to a "data swamp" if not properly
managed.
- To make data usable, it needs to have defined mechanisms to catalog, and
secure data.
- Without these elements, data cannot be found, or trusted resulting.
Thank you

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
Data Lakes in A Modern Data Architecture
88% (8)
Data Lakes in A Modern Data Architecture
23 pages
Understanding Data Lakes EMC
No ratings yet
Understanding Data Lakes EMC
1 page
Architecting Data Lakes Zaloni PDF
No ratings yet
Architecting Data Lakes Zaloni PDF
63 pages
Introduction To Data Lakes
No ratings yet
Introduction To Data Lakes
6 pages
Data Lake Essentials
No ratings yet
Data Lake Essentials
11 pages
2019C2 - Data Lakes Ebook
No ratings yet
2019C2 - Data Lakes Ebook
37 pages
2019C2 - Data Lakes Ebook
No ratings yet
2019C2 - Data Lakes Ebook
37 pages
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Clase 2 A
No ratings yet
Clase 2 A
12 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Datalakes
No ratings yet
Datalakes
18 pages
Defining The Data Lake White Paper
0% (1)
Defining The Data Lake White Paper
7 pages
Lecture 15 Data Warehouse and Data Lake Architecture Part 2
No ratings yet
Lecture 15 Data Warehouse and Data Lake Architecture Part 2
12 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
DL Vs DLH Draft v0.1
No ratings yet
DL Vs DLH Draft v0.1
9 pages
Best Practices For Designing Your Data Lake
No ratings yet
Best Practices For Designing Your Data Lake
13 pages
House Refcard 350 Getting Started Data Lakes 2021
No ratings yet
House Refcard 350 Getting Started Data Lakes 2021
5 pages
Data Lake A New Ideology in Big Data Era
No ratings yet
Data Lake A New Ideology in Big Data Era
11 pages
ELT Vs ETL
No ratings yet
ELT Vs ETL
13 pages
On Data Lake Architectures Andmetadata Management
No ratings yet
On Data Lake Architectures Andmetadata Management
24 pages
Interview Topics 1749449767
No ratings yet
Interview Topics 1749449767
5 pages
Increase Data Lake ROI Whitepaper A4
No ratings yet
Increase Data Lake ROI Whitepaper A4
17 pages
The Data Lakes: A Leap Forward Future of Data Warehousing
No ratings yet
The Data Lakes: A Leap Forward Future of Data Warehousing
5 pages
On Data Lake Architectures and Metadata Management: Pegdwend e Sawadogo J Er Ome Darmont
No ratings yet
On Data Lake Architectures and Metadata Management: Pegdwend e Sawadogo J Er Ome Darmont
24 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
57 pages
Data Lake
No ratings yet
Data Lake
2 pages
Data Lake
No ratings yet
Data Lake
2 pages
WP 6 ETL Guidelines (New Temp)
No ratings yet
WP 6 ETL Guidelines (New Temp)
9 pages
Data Lakes Powering The Future of Big Data
No ratings yet
Data Lakes Powering The Future of Big Data
8 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
What Is A Data Lake - Definition From SearchDataManagement
No ratings yet
What Is A Data Lake - Definition From SearchDataManagement
11 pages
Hadoop Data Lake: Hadoop Log Files Json
No ratings yet
Hadoop Data Lake: Hadoop Log Files Json
5 pages
Introduction To Data Lake
No ratings yet
Introduction To Data Lake
1 page
Apache Spark Week-5 PDF
No ratings yet
Apache Spark Week-5 PDF
9 pages
Tdwi Checklist The Future Proof Data Lake Six Considerations For Success
No ratings yet
Tdwi Checklist The Future Proof Data Lake Six Considerations For Success
10 pages
Build A True Data Lake With A Cloud Data Warehouse
No ratings yet
Build A True Data Lake With A Cloud Data Warehouse
15 pages
Database Datalake
No ratings yet
Database Datalake
2 pages
A Guide To Best Practices: Putting The Data Lake To Work
No ratings yet
A Guide To Best Practices: Putting The Data Lake To Work
12 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 2
No ratings yet
Cloud Data Lakes For Dummies Snowflake Special Edition V1 2
10 pages
DW Vs Data Lake
No ratings yet
DW Vs Data Lake
5 pages
Implemententerprise Data Lake
100% (1)
Implemententerprise Data Lake
9 pages
TDWI Checklist Report KPDL Databricks Tableau Halper Web
No ratings yet
TDWI Checklist Report KPDL Databricks Tableau Halper Web
9 pages
Top Five Differences Between Data Lakes and Data Warehouses
No ratings yet
Top Five Differences Between Data Lakes and Data Warehouses
6 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
Unit 1.1data Science Technology Stack
No ratings yet
Unit 1.1data Science Technology Stack
87 pages
Advanced Database Systems
No ratings yet
Advanced Database Systems
7 pages
Warehouse Assignment MIM 106
No ratings yet
Warehouse Assignment MIM 106
8 pages
Big Data PDF
No ratings yet
Big Data PDF
18 pages
Real Scenarios On Data Term 1722747078
No ratings yet
Real Scenarios On Data Term 1722747078
11 pages
CHPT 15
No ratings yet
CHPT 15
19 pages
Data Lake
No ratings yet
Data Lake
8 pages
Data Lake
No ratings yet
Data Lake
2 pages
LakeHouse Architecture
No ratings yet
LakeHouse Architecture
23 pages
Designing A Modern Data Warehouse + Data Lake
No ratings yet
Designing A Modern Data Warehouse + Data Lake
73 pages
From Data Lake To Data-Driven Organization
No ratings yet
From Data Lake To Data-Driven Organization
30 pages
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Ingestion, Cleaning, and Transformation Tools
No ratings yet
Data Ingestion, Cleaning, and Transformation Tools
2 pages
Buet Admission Quest Basic
No ratings yet
Buet Admission Quest Basic
7 pages
BPC10 1 BADIs
No ratings yet
BPC10 1 BADIs
6 pages
Tree Representations
No ratings yet
Tree Representations
7 pages
Triggers - SQL Server - CodeProject
No ratings yet
Triggers - SQL Server - CodeProject
5 pages
Primary Key Primary Index
No ratings yet
Primary Key Primary Index
9 pages
HowToDataMart BW BCS
100% (1)
HowToDataMart BW BCS
28 pages
Chapter 3 Olap and Oltp
No ratings yet
Chapter 3 Olap and Oltp
29 pages
Oracle ASM Interview Questions Basic To Advanced
No ratings yet
Oracle ASM Interview Questions Basic To Advanced
2 pages
PT 1 Paper CS 12th 24-25
No ratings yet
PT 1 Paper CS 12th 24-25
2 pages
Explaining The Unexplainable
No ratings yet
Explaining The Unexplainable
31 pages
NSE5 - FAZ-7.2 Answer
No ratings yet
NSE5 - FAZ-7.2 Answer
93 pages
Solaris Unix
No ratings yet
Solaris Unix
1 page
DP8 Practice Activities - Answers
No ratings yet
DP8 Practice Activities - Answers
2 pages
Navy Education Society Conduct of Common Preboard Examination For For Navy Children Schools For Class 12 Computer Science
No ratings yet
Navy Education Society Conduct of Common Preboard Examination For For Navy Children Schools For Class 12 Computer Science
10 pages
1234 Yrf
No ratings yet
1234 Yrf
72 pages
Lab9 - Understanding Managed Disks - Azure
No ratings yet
Lab9 - Understanding Managed Disks - Azure
33 pages
Data Warehouse Definition: - Users and System Orientation
No ratings yet
Data Warehouse Definition: - Users and System Orientation
6 pages
CIS150 1E Plaster
No ratings yet
CIS150 1E Plaster
4 pages
Page 2 - ExamTopics
No ratings yet
Page 2 - ExamTopics
7 pages
Biw Ora Db2 Export
No ratings yet
Biw Ora Db2 Export
24 pages
Dbms Lab1
No ratings yet
Dbms Lab1
5 pages
Acqueon - RAP CMS Data Bridge - Presentation
No ratings yet
Acqueon - RAP CMS Data Bridge - Presentation
29 pages
Introduction To Databases - COM4015 - (4378)
No ratings yet
Introduction To Databases - COM4015 - (4378)
16 pages
TCS Feedback
No ratings yet
TCS Feedback
6 pages
Pertemuan 15 - Komputasi Awan
No ratings yet
Pertemuan 15 - Komputasi Awan
12 pages
2) Title: Distributed Databases
No ratings yet
2) Title: Distributed Databases
8 pages
Minify UnminifyAll
No ratings yet
Minify UnminifyAll
9 pages
Chapter 4 Advanced SQL
No ratings yet
Chapter 4 Advanced SQL
58 pages
QA Chapter2 PDF
No ratings yet
QA Chapter2 PDF
3 pages
Vogel - 2009 - Rockfall Protection As An Integral Task
No ratings yet
Vogel - 2009 - Rockfall Protection As An Integral Task
11 pages

Data Lake

Uploaded by

Data Lake

Uploaded by

Data Lake

- Stores data in its native format

Process: - Follows an Extract, Transform, Load - Follows an Extract, Load, Transform

- Schema-on-read; the schema is

Accessibility: - More complicated to make changes - Highly accessible and easier to

• These are from similar data formats or structures

• These are from different data formats or structures.

• Data Visualization Tools

• Machine Learning Projects

Cloud Storage are commonly used.

Spark, and Databricks are essential for analytics.

You might also like