0% found this document useful (0 votes)
26 views26 pages

Data Lake

The document discusses data lakes, including what they are, how they differ from data warehouses, why they are needed, common architectures and components, use cases, tools and technologies used, and challenges. A data lake is a centralized repository that stores raw data in its native format at scale. It differs from a data warehouse in that data does not need to be structured beforehand and can be used for various analytics. Benefits include cost effectiveness, scalability, and flexibility in storing various data types.

Uploaded by

Nada Elsharawy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views26 pages

Data Lake

The document discusses data lakes, including what they are, how they differ from data warehouses, why they are needed, common architectures and components, use cases, tools and technologies used, and challenges. A data lake is a centralized repository that stores raw data in its native format at scale. It differs from a data warehouse in that data does not need to be structured beforehand and can be used for various analytics. Benefits include cost effectiveness, scalability, and flexibility in storing various data types.

Uploaded by

Nada Elsharawy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Lake

Agenda
- What is a Data lake?
- What is the difference between Data lake and Data warehouse?
- Why do you need a Data lake?
- What are the architectures of the Data lake?
- Use Cases for Data lakes
- Tools and Technologies Used in Data Lakes
- What is the Challenges of data lakes?
What is a Data
lake?
Data Lake A data lake is a centralized
repository that allows you to store
all your data at any scale.
It lets you store raw data as-is.
without having to first structure the
data, and run different types of
analytics—from dashboards and
visualizations to big data processing,
and real-time analytics.
large-scale repositories designed
to store and manage massive
amounts of data.
Key Characteristics of data lakes:

- Stores data in its native format


- Data can be structured, semi-structured, or unstructured
- Uses a flat architecture
What is the difference
between Data lake and
Data warehouse?
A data warehouse : A data lake :
Data Types: - store processed, structured data - store raw, structured, semi-
according to specific metrics and structured, and unstructured data
attributes. from multiple sources.

Data Purpose: - The data is currently being used for - The purpose of the data is often not
operations and analytics within a predefined. It can be used for
structured system. machine learning, AI algorithms,
and other business purposes after
processing.

Process: - Follows an Extract, Transform, Load - Follows an Extract, Load, Transform


(ETL) process, which offers security (ELT) approach, which offers agility
and high performance. and easy data capture.

- Schema-on-read; the schema is


Schema Position: - Schema Position: Schema-on-write; applied after data storage.
the schema is defined before data
storage.
A data warehouse : A data lake :
Users: - Suited for business professionals who - Ideal for data scientists and those
need operational reporting and who need in-depth analysis and
analytics. predictive modeling tools.

Accessibility: - More complicated to make changes - Highly accessible and easier to


due to the structured nature of the update.
data.

History: - The concept has been around for - A relatively new concept for
decades and is well-established. managing big data.
Why do you need a data
lake?
Why do you need a data lake?
• Cost-Effectiveness: They typically require low-cost hardware, and many technologies used
for data management in data lakes are open source, such as Hadoop, making them more
economical than data warehouses.
• Resource Optimization: By storing any kind of data, data lakes help reduce unnecessary
resource usage within an organization.
• Scalability: Data lakes can easily scale to store and process large amounts of data,
accommodating the growth of data over time.
• Flexibility: They can store data in any format, including structured, semi-structured, and
unstructured, which is essential for machine learning use cases.
• Centralization: A data lake creates a single point of reference by consolidating information in
one place, reducing data siloing and making it easier to find, analyze, and share data across
different departments and projects.
• Machine Learning and AI: The sheer volume and variety of data in a data lake fuels model
development and unlocks the true potential of artificial intelligence and predictive analytics.
What are the
architectures of the data
lake?
Data lake architecture is majorly comprised of
three components or layers in general:

•Sources
•Data Processing Layer
•Target
The following diagram represents a high-level Data lake architecture:
1-Sources
• Sources are the providers of the business data to the data lake.
• The ETL or ELT mediums are being used to retrieve data from
various sources for further data processing.
• They are categorized into two types based upon the source
structure and formats for ETL Process
a. homogenous sources

• These are from similar data formats or structures


• Easy to join and consolidate the data
• Example: Sources from MS SQL Server databases.

b. Heterogeneous sources

• These are from different data formats or structures.


• It is tricky for ELT professionals to aggregate the sources to
create consolidate data for processing.
•Example: Sources from Flat files, NoSQL Databases,
2. Data Processing Layer
•The data processing layer of Data lake comprises of Datastore, Metadata store and the
Replication to support the High availability (HA) of data.
•The index is applied to the data for optimizing the processing.
•The best practices include including a cloud-based cluster for the data processing layer.
•The data processing layer is efficiently designed to support the security, scalability, and
resilience of the data.
•Also, proper business rules and configurations are maintained through the administration.
•There are several tools and cloud providers that support this data processing layer.
•Example: Apache Spark, Azure Databricks, Data lake solutions from AWS.
3. Targets for the Data Lake
After processing layer, data lake provides the processed data to the
target systems or applications.
There are several systems that consume data from Data lake
through an API layer or through connectors.
Following is an example of the systems which uses the data lake:

• EDW

• Analytics Dashboards

• Data Visualization Tools

• Machine Learning Projects


Use Cases for Data
lakes
Tools and Technologies
Used in Data Lakes
Tools and Technologies Used in Data Lakes:

- Storage: Object storage systems like Azure Data Lake Storage, AWS S3, and Google

Cloud Storage are commonly used.

- Data Ingestion: Tools for data ingestion include Apache NiFi, Apache Kafka, and Azure

Data Factory.

- Metadata Management: Cataloging tools like Apache Atlas and AWS Glue help manage

metadata.

- Analytics and Processing: Big data processing frameworks like Apache Hadoop, Apache

Spark, and Databricks are essential for analytics.


Steps for creating a data lake using AWS
S3
1.Register an Amazon Simple Storage Service (Amazon S3) path as a data lake.
2.Grant Lake Formation permissions to write to the Data Catalog and to Amazon S3 locations
in the data lake.
3.Create a database to organize the metadata tables in the Data Catalog.
4.Use a blueprint to create a workflow. Run the workflow to ingest data from a data source.
5.Set up your Lake Formation permissions to allow others to manage data in the Data Catalog
and the data lake.
6.Set up Amazon Athena to query the data that you imported into your Amazon S3 data lake.
7.For some data store types, set up Amazon Redshift Spectrum to query the data that you
imported into your Amazon S3 data lake.
What is the Challenges
of data lakes?
What are the challenges of data lakes?

- Schema: The lack of a predefined schema can make data hard to consume or
query.
- Quality: Maintaining the quality of data ingested into the data lake can be
challenging.
- The main challenge with a data lake architecture is the storage of raw data
without oversight of the contents, leading to a "data swamp" if not properly
managed.
- To make data usable, it needs to have defined mechanisms to catalog, and
secure data.
- Without these elements, data cannot be found, or trusted resulting.
Thank you

You might also like