0% found this document useful (0 votes)
158 views20 pages

Eb Data Lake Vs Data Warehouse Selection Guide en

This document provides a guide to help organizations choose between a data lake and a data warehouse. It outlines key differences between the two approaches and presents three sections for comparison: 1) a selection Q&A with 12 questions to help identify the best approach, 2) descriptions of data lakes and data warehouses, and 3) a comparison matrix of the features. The guide notes that while data warehouses are still more commonly used, recent technologies are blurring the lines between the two approaches.

Uploaded by

Amila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views20 pages

Eb Data Lake Vs Data Warehouse Selection Guide en

This document provides a guide to help organizations choose between a data lake and a data warehouse. It outlines key differences between the two approaches and presents three sections for comparison: 1) a selection Q&A with 12 questions to help identify the best approach, 2) descriptions of data lakes and data warehouses, and 3) a comparison matrix of the features. The guide notes that while data warehouses are still more commonly used, recent technologies are blurring the lines between the two approaches.

Uploaded by

Amila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

WORKSHEET

D ATA I N T E G R AT I O N

Data Lake vs Data Warehouse

Selection
Guide
This guide describes the differences between
a data lake and a data warehouse to help you
choose the right approach for your organization.

QLIK.COM
Table of
Contents
We begin with a few key takeaways and then provide three ways
for you to compare and evaluate data lakes and data warehouses.
The selection guide, descriptions, and a comparison matrix
present similar information but in different ways. Explore each
section to get a complete picture.

Key Takeaways 3

Part 1: Selection Q&A 5

Part 2: Descriptions of Each Approach 10

Part 3: Comparison Matrix 19

Data Lake vs Data Warehouse Selection Guide 2


Key
Takeaways
Data warehouses still dominate. Even today, the large majority
of organizations primarily use data warehouses and the clear
trend is toward cloud data warehouses. Data lakes are typically
used by data scientists for machine learning and exploration of
flat files.

Here’s why:

1
The use case is usually simple. Typically, the primary use case for most organizations’
data storage and analysis systems is to analyze and report on historical data for decision-
making, which data warehouses support better than data lakes. While machine learning
is exciting, you may not need it. And if you do need it, your data scientist will likely
perform it as a side project using their own data lake.

2
SQL is pervasive. There is a deep understanding and usage of SQL across the entire
data industry (analytics tools, relational database tools, computer science teaching,
etc.). SQL is so well known and supported that a data warehouse with SQL as the access
mechanism is much easier to adopt on average for most organizations than adopting a
data lake with a language like Spark.

3
Data warehouses are evolving. Many organizations are moving their data warehouses
to the cloud which removes the high investment in buying and maintaining hardware
on premises. Plus cloud data warehouse vendors are extending SQL with machine
learning operators so you don’t need the new toolset involved with a data lake to put
unstructured data into a data warehouse.

Data Lake vs Data Warehouse Selection Guide 3


The lines are blurring. Recent technologies offer middle ground between what
have historically been very different approaches. As stated above, new data
warehouse tools allow you to put unstructured data into a data warehouse, making
it more like a data lake. Conversely, tools such as Amazon Athena or Microsoft’s
U-SQL allow you to put SQL on the front end of a data lake to add structure, making
it more like a data warehouse.

A new, hybrid approach, called a data lakehouse, can be more flexible than either
a data warehouse or data lake in that it can eliminate data redundancy and improve
data quality while allowing unstructured data. This is achieved via ETL pipelines
that provide the critical link between the unsorted lake layer and the integrated
warehouse layer.

Data Lake vs Data Warehouse Selection Guide 4


Part 1:
Selection Q&A
Deciding between using a data lake or a data warehouse can be
challenging because each approach has its own advantages and
disadvantages and there are a lot of criteria to consider. To make
it easier, we’ve created this checklist of the top 12 questions.
Answering these questions Yes or No will help you identify which
option is the best fit for your organization.

1) Data Type:
Does your organization primarily deal with structured data?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because it is designed to handle because it allows you to store data in its raw
structured data that conforms to a specific form, without the need for upfront schema de-
schema, making it easier to process and sign or data transformation, making it easier to
analyze. handle unstructured and semi-structured data.

2) Use Cases:
Is your primary use case for your data storage and analysis system to store struc-
tured data for consistency and historical analysis?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because it is designed to store because it’s a flexible and cost-effective way to
structured data, ensuring consistency and store large volumes of data of different types,
quality and making it easier to analyze without the need for extensive data modeling
historical data for decision-making. or transformation. It can handle exploratory
analysis, machine learning, real-time analytics,
and processing of unstructured and semi-
structured data.

Data Lake vs Data Warehouse Selection Guide 5


3) Schema:
Do you need a predefined schema for your data storage and analysis system?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because it requires a predefined because it allows you to store data in its raw
schema, which ensures consistency and form and apply a schema when the data
quality of the data, making it easier to analyze is read, making it easier to handle diverse
and query. data types and formats. A schema-on-read
approach is more flexible and can handle
evolving data requirements.

4) Data transformation:
Do you need to transform your data before it can be analyzed?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because because it requires structured because it allows you to store data in its raw
data, which can be transformed and loaded form, which can be transformed when it is
into a predefined schema, making it easier to read, making it easier to handle diverse data
analyze. types and formats.

5) Data volume:
Do you need to store and analyze large volumes of data?

YES: A cloud data warehouse or data lake NO: An on-premises data warehouse
are the better choice because they’re designed may be the better choice as it’s designed to
to handle large volumes of data and can scale handle smaller volumes of data and can scale
horizontally. vertically, making it easier to optimize query
performance.

6) Data access:
Do you have a single user or a specific set of users who need access to the data?

YES: A data warehouse may be the better NO: An a data lake may be the better choice
choice because it has a centralized structure if you have multiple users with varying needs
and can be optimized for specific use cases, who need access to the data, as it allows for
making it easier to ensure data quality and granular access control and can be accessed
consistency. by different tools and applications, making it
easier to collaborate and share data.

Data Lake vs Data Warehouse Selection Guide 6


7) Cost:
Do you need to dynamically grow or shrink your data storage and processing costs
depending on changing business budgets and requirements?

YES: A cloud data warehouse or data lake NO: An on-premises data warehouse may
are the better choice because vendors such as be the better choice if you’re able to cover the
Snowflake, Azure, Amazon, and Big Query make upfront hardware and software costs and cost
it easy to scale as your storage needs grow and of maintaining the system.
don’t require you to purchase new hardware as
an on-premises data warehouse does.

8) Processing Speed:
Do you primarily require fast query response times on historical data?

YES: A data warehouse may be the better NO: A data lake may be the better choice if
choice because it typically offers faster query your primary use case is not querying historical
response times on historical data due to an data but is exploring a broader range of data.
optimized and structured data storage design.

9) Flexibility:
Does your organization have diverse and evolving data sources and structures?

YES: A cloud data warehouse or data lake NO: An on-premises data warehouse may be
are the better choice because they can the better choice if your organization has fairly
accommodate a wide range of data types and static data sources and structures because it
formats and can allow for flexible data storage. requires structured data and schema design
upfront.

10) Integration:
Does your system need to integrate with existing data infrastructure and tools?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because it can be integrated with if integration with existing infrastructure
existing infrastructure and tools more easily and tools is not a concern because it offers
due to its SQL-based access mechanism and more flexibility in terms of data storage and
structured and optimized data storage design. processing.

Data Lake vs Data Warehouse Selection Guide 7


11) Data Governance:
Does your organization need to comply with specific data
governance requirements?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because it typically offers more built- because it offers more flexibility in terms
in data governance features, such as access of data storage and processing, making it
control and auditing, making them a good a good option for organizations that don’t
option for organizations that need to comply need to comply with specific data governance
with specific data governance requirements. requirements.

12) Data quality:


Does your organization require high data quality standards?

YES: A data warehouse may be the better NO: A data lake may be the better choice
choice because it typically requires structured because it allows for more flexible data storage
data and schema design upfront, making it and processing, but may require more effort to
easier to ensure that you meet data quality ensure high data quality standards are met.
standards.

So, how did it go?

You may not have a clear winner. This is why many organizations use both approaches to cover the
spectrum of their data storage needs. And, depending on the issue, you can take extra steps to make either
approach work for you. For example, a data integration platform with a robust data governance tool will
help you govern a data lake.

As we said in the Key Takeaways section, the large majority of organizations primarily use data warehouses
and the clear trend is toward cloud data warehouses. So unless you’re a data scientist with a specific
machine learning project, a cloud data warehouse may be the best approach.

Data warehouse automation accelerates and simplifies the


data warehouse lifecycle for faster time to insights.

Learn more: https://www.qlik.com/us/data-warehouse-automation

Data Lake vs Data Warehouse Selection Guide 8


Part 2:
Descriptions of
Each Approach
Data lakes and data warehouses are both approaches to data
management and analytics. A data lake is a cost-effective and
scalable repository of structured and unstructured data, and the
purpose for this data has not been defined. A data warehouse
is a repository of highly structured historical data which has
been optimized for processing and analyzing structured data for
business intelligence and reporting.

Here we dive into each approach in turn:

Data Lakes
Data Lake Definition
A data lake is a repository that stores all of your organization’s data — both structured and unstructured.
Think of it as a massive storage pool for data in its natural, raw state (like a lake). A data lake can handle
the huge volumes of data that most organizations produce without the need to structure it first. Data
stored in a data lake can be used to build data pipelines to make it available for data analytics tools to find
insights that inform key business decisions.

Data Lake vs Data Warehouse Selection Guide 9


Data Lake Architecture
Data lakes employ a flat architecture, allowing you to avoid pre-defining the schema and data
requirements and instead store raw data at any scale without the need to structure it first. You achieve this
by using tools to assign unique identifiers and tags to data elements so that only a subset of relevant data
is queried to analyze a given business question.

These tools, such as Snowflake, Azure, and AWS, vary in specific capabilities. Therefore, your system’s
detailed physical structure will depend on which tool you select.

High-level data lake architecture diagram:

Data Lake vs Data Warehouse Selection Guide 10


Your data teams can build ETL data pipelines and schema-on-read transformations to make data stored
in a data lake available for data science and machine learning and for analytics and business intelligence
tools. Or, managed data lake creation tools help you achieve this without the limitations of slow, hand-
coded scripts and scarce engineering resources.

Modern data lake architectures are cloud-based and provide rapid data access and analytics by having
all necessary compute resources and storage objects internal to the data lake platform. They also isolate
workloads and allocate resources to the prioritized jobs, helping you avoid user concurrency issues
from slowing down analyses across your organization. Here are the key features of a cloud data lake
architecture:

• Simultaneous data loading and querying without impacting performance


• Independent storage resource and compute scaling
• An architecture which is multi-cluster and shared-data
• Metadata capabilities that are core to the object storage environment
• No performance degradation from adding users

Managed data lake


creation allows you to have
continuously updated,
analytics-ready data lakes
without coding.

Learn more: https://www.qlik.com/us/data-lake-creation

Data Lake vs Data Warehouse Selection Guide 11


Data Lake Benefits
Because the large volumes of data in a data lake are not structured before being stored, skilled data
scientists or modern BI tools can gain access to a broader range of data far faster than in a data warehouse.
Some of the main advantages of data lakes include:

Scalability: Data lakes are highly scalable, allowing you to cost-effectively store and manage massive
volumes of structured and unstructured data like ERP transactions and call logs.

Flexibility: Data lakes can store data in its raw form, making it possible to store different types of
data in the same location without needing to structure it in advance. This allows you to store diverse
data types, including structured, semi-structured, and unstructured data, in the same location.

Cost-effectiveness: Data lakes can be a cost-effective way to store large amounts of data, as they
use low-cost storage options such as cloud storage or commodity hardware. This makes data lakes a
viable option for storing and processing large amounts of data without incurring high costs.

Agility: Data lakes can be rapidly provisioned and deployed, allowing you to quickly store and
manage data as needed. This makes it easier to respond to changing business needs and to innovate
quickly.

Data analysis: Like data warehouses, data lakes can be a valuable resource for data analysis and
business intelligence. The difference here is that a broader range of data can be analyzed in new ways
to gain unexpected and previously unavailable insights.

Data Lake vs Data Warehouse Selection Guide 12


Data Lake Challenges
Data lakes present several challenges that you need to consider, including:

Data quality: Data lakes are designed to store vast amounts of data, which can come from various
sources and in different formats. This can make it difficult to ensure that the data is accurate,
consistent, and reliable.

Data governance: With data lakes, it can be challenging to maintain data governance, including data
lineage, data classification, and data privacy. You’ll have to establish and enforce data governance
policies to ensure that data is used appropriately.

Data security: Data lakes are a prime target for cyber-attacks and data breaches. The vast amounts
of data stored in data lakes can be challenging to secure, and you must ensure that appropriate
security measures are in place.

Data integration: Integrating data from different sources can be a significant challenge. Data lakes
must be able to handle data from different formats, sources, and systems, and it can be challenging
to integrate this data in a way that maintains its quality and integrity.

Cost management: Storing large amounts of data can be expensive. You’ll need to carefully manage
the cost of storing and managing data lakes to ensure that they’re cost-effective.

Analytics and processing: Data lakes are designed to store large amounts of data, but processing
this data in a meaningful way can be challenging. You’ll need to have the right tools and technologies
in place to analyze and process the data to gain insights.

Data Lake vs Data Warehouse Selection Guide 13


Data Warehouses
Data Warehouse Definition
Similar to a data lake, a data warehouse is a repository for business data. However, unlike a data lake, only
highly structured and unified data lives in a data warehouse to support specific business intelligence and
analytics needs. Think of it like an actual warehouse, where contents are first processed, then organized
into sections and onto shelves (called Data Marts). Data from a warehouse is ready for use to support
historical analysis and reporting to inform decision making across an organization’s lines of business.

Data Warehouse Architecture


Your specific data warehouse architecture will be determined by your organization’s unique needs. Here’s
a high-level diagram of the typical structure:

Generally, there are three zones. Data in the landing zone is structured as tables and mirrors the data from
your transactional systems. Data in the curated zone conforms to a well-known methodology such as Data
Vault, Inland or Kimble. Data in the analytics zone is typically housed in data marts and structured in star
schemas where you’ll have a central fact such as the number of units sold and emanating from that fact
are dimensions such as days, weeks, months, and years.

A key challenge in executing the above structure is that it requires you to write a lot of SQL code for each
zone and for moving data between zones. Data warehouse automation allows you to use visual tools to
rapidly design, deploy, and manage your entire warehouse lifecycle without writing any code.

Data Lake vs Data Warehouse Selection Guide 14


Data Warehouse Benefits
A data warehouse offers enormous benefits to organizations, especially as it relates to BI and analytics.
After the initial work of cleansing and processing, data stored in a warehouse serves as a consistent “single
source of truth” which is invaluable to business data analysis, collaboration, and better insights. Some of
the main advantages of data warehouses include:

Data consistency: Data warehouses are designed to store structured data, which means the data is
pre-processed and structured to ensure consistency and quality. This makes it easier to analyze and
use the data for decision-making.

Performance: Data warehouses are optimized for query and analysis performance, making it
possible to retrieve and analyze large amounts of data quickly. This makes it possible to perform
complex queries and generate reports in real-time.

Security: Data warehouses are designed with security in mind, making it possible to control access
to data and ensure data privacy. This makes data warehouses a good option for organizations that
need to comply with data privacy regulations and protect sensitive data.

Single source of truth: Unified, harmonized data offers a single source of truth for data analysis,
building trust in data insights and decision-making across business lines.

Historical analysis: Data warehouses store historical data, making it possible to perform trend
analysis and identify patterns over time. This makes it easier to make data-driven decisions based on
past performance and future trends.

Data Lake vs Data Warehouse Selection Guide 15


Data Warehouse Challengess
Data warehouses present several challenges that you should consider, including:

Data quality: Data warehouses are designed to store structured data, but ensuring the quality of this
data can be a challenge. The data may come from different sources and may have inconsistencies,
errors, or duplicates that need to be resolved.

Data integration: Integrating data from different sources can be a significant challenge. Data
warehouses must be able to handle data from different formats, sources, and systems, and it can be
challenging to integrate this data in a way that maintains its quality and integrity.

Scalability: As data volumes grow, it can be challenging to scale on-premises data warehouses to
handle the increased workload. You’ll have to carefully manage the size and complexity of your data
warehouses to ensure that they remain performant.

Cost management: Building and maintaining an on-premises data warehouse can be very
expensive. And even cloud-based solutions require you to carefully manage the cost of maintaining a
cloud data warehouse to ensure that it’s cost-effective.

Data governance: With data warehouses, it can be challenging to maintain data governance,
including data lineage, data classification, and data privacy. As with data lakes, you’ll have to
establish and enforce data governance policies to ensure that data is used appropriately.

Maintenance and support: On-premises data warehouses require ongoing maintenance and
support to ensure that they remain performant and secure. Be sure you have the right resources and
expertise in place to manage and maintain your on-prem data warehouse.

Cloud data migration solutions accelerate your move to


the cloud with easy, automated data transfer.

Learn more: https://www.qlik.com/us/cloud-data-migration

Data Lake vs Data Warehouse Selection Guide 16


Part 3:
Comparison
Matrix
Here we take a side-by-side look at data lakes vs data
warehouses.

Data Lake Data Warehouse

Purpose Store unstructured or raw Store structured and processed


data: A data lake contains all data: A data warehouse contains
an organization’s data in a raw, structured data that has been
unstructured form, and can cleaned and processed, ready
store the data indefinitely — for for strategic analysis based on
immediate or future use. predefined business needs.

Data Type Variety of structured and Structured data only: A data


unstructured data: A data lake warehouse typically only stores
can store a wide range of data structured data, such as data from
types, including structured, semi- transactional or CRM systems.
structured, and unstructured data, This structured data is typically
such as log files, social media processed and transformed
feeds, sensor data, and more. before it is loaded into the data
warehouse.

Use Cases Advanced analytics/ML: typically Business intelligence, reporting,


used for exploratory data science analytics: suitable for pre-defined
and machine learning of flat files. queries and analysis for decision
making.

Data Lake vs Data Warehouse Selection Guide 17


Data Lake Data Warehouse

Schema Schema-on-read: schema is Schema-on-write: schema is


defined after the data is stored in a defined before the data is stored.
data lake and when the data is read This lengthens the time it takes
or queried. This makes the process to process the data, but once
of capturing and storing the data complete, the data is at the ready
faster. for consistent, confident use across
the organization.

Data Only when needed: data is stored Extensive ETL processes: data
Transformtion “as is” and then transformed using requires cleansing and processing
an ELT (Extract, Load, Transform) in an ETL (Extract, Transform, Load)
process when needed. pipeline prior to loading in the data
warehouse.

Data Volume Large volumes of data: can Large volumes of data: cloud data
handle terabytes to petabytes. warehouses can scale to petabytes
but on-prem DWs typically handle
gigabytes to terabytes.

Data Access Suitable for ad-hoc analysis: Suitable for pre-defined


provides a flexible and scalable queries and analysis: optimized
way to store and process data. for business intelligence and
reporting.

Cost Typically lower storage cost but Typically higher storage cost but
higher processing cost: object lower processing cost: Relational
storage is cheaper than relational database storage requires costly
database storage but since the data hardware–whether in the cloud or
isn’t structured the processing cost on-prem–but the processing cost
is often higher. can be lower relative to data lakes.

Data Lake vs Data Warehouse Selection Guide 18


Data Lake Data Warehouse

Processing Typically slower processing Faster processing speed:


Speed speed: data needs to be processed historical data is pre-processed and
before it can be analyzed. However, ready for analysis.
using platforms like Apache Spark
allows data to be processed in
parallel across multiple nodes.

Flexibility Highly flexible and adaptable: Less flexible and more rigid:
can store a variety of data types requires a predefined schema and
and handle changes in data structure.
structure easily.

User Skill Level Requires advanced technical Less technical skills needed: Data
skills: Data from a data lake is from a data warehouse is typically
typically used by data scientists accessed by business analysts and
and engineers who prefer to study report developers looking to gain
data in its raw form. This requires insights from business KPIs and
expertise in big data tools and answer pre-determined questions.
programming languages.

Data Harder to govern: data may not be Easier to govern: data is more
Governance governed or controlled as closely easily governed and controlled to
as in a data warehouse. ensure consistency and accuracy.

Data Quality Harder to meet standards: Easier to meet standards: data is


flexible data storage and structured and schema is designed
processing allow for varying upfront.
degrees of quality.

Data Lake vs Data Warehouse Selection Guide 19


ABOUT QLIK

Qlik’s vision is a data-literate world, where everyone can use data


and analytics to improve decision-making and solve their most
challenging problems. Qlik offers real-time data integration and
analytics solutions, powered by Qlik Cloud, to close the gaps
between data, insights and action. By transforming data into
Active Intelligence, businesses can drive better decisions, improve
revenue and profitability, and optimize customer relationships. Qlik
serves more than 38,000 active customers in over 100 countries.

© 2023 QlikTech International AB. All rights reserved. All company and/or product names may be trade names, trademarks and/or registered
trademarks of the respective owners with which they are associated.

You might also like