Critical Success Factors For Data Lake Architecture: Checklist Report
Critical Success Factors For Data Lake Architecture: Checklist Report
By Philip Russom
Sponsored by:
MARCH 2020
6 NUMBER TWO
Know your data lake’s requirements and map them to
data platform architectural options
8 NUMBER THREE
Expect your data lake to evolve to the cloud and
prepare for multicloud
10 NUMBER FOUR
Design a data lake’s internal architecture to
accommodate diverse data structures, data domains,
and analytics applications
12 NUMBER FIVE
Give end users the tools and curation they need to
control sharing data across the lake architecture
14 NUMBER SIX
Select a cloud data platform that fulfills the lake’s
diverse data requirements and integrates well with
related IT systems
T 425.277.9126 © 2020 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to [email protected].
F 425.687.2842
E [email protected] Product and company names mentioned herein may be trademarks and/or registered trademarks
of their respective companies. Inclusion of a vendor, product, or service in TDWI research does
tdwi.org
not constitute an endorsement by TDWI or its management. Sponsorship of a publication should
not be construed as an endorsement of the sponsor organization or validation of its claims.
FOREWORD
The data lake has come a long way since its current wave of dissatisfaction with Hadoop is
origins around 2015. Today it is a well-estab- driving a number of lake migrations off of Hadoop.
lished design pattern and data architecture for For example, after living with their lakes for a year
profound applications in data warehousing, or more, many users discover that key use cases
reporting, data science, and advanced analytics demand more and better relational functionality
as well as operational environments for marketing, than can be retrofitted onto Hadoop. In a related
supply chain, and finance. Over the years, users’ trend, many organizations proved the value of data
expectations, best practices, and business use lakes on premises and are now migrating to cloud
cases for the data lake have evolved, as have the data platforms for their relational functionality,
available data platforms upon which a data lake elasticity, minimal administration, and cost control.
may be deployed.
A MODERN DATA LAKE MUST SERVE A WIDER
This evolution is forcing changes in how data RANGE OF USERS AND THEIR NEEDS. The first
lakes are designed, architected, and deployed. In users of data lakes were mostly data scientists
short, TDWI sees many corporations, government and data analysts who program algorithm-based
agencies, and other user organizations modernizing analytics for data mining, statistics, clustering,
their data lakes to adapt them to today’s data and and machine learning. As lakes have become
business requirements—instead of those of 2015. more multitenant (serving more user types and
Similarly, “greenfield” data lakes today are quite use cases), set-based analytics (reporting at scale,
different from early data lakes. broad data exploration, self-service queries, and
data prep) has arisen as a requirement for the
Many of the recent changes in the data lake are
lake—and that requires a relational database.
occurring at the architectural level. In particular,
TDWI sees many organizations replatforming CLOUD HAS RECENTLY BECOME THE PREFERRED
their data lakes as they abandon older database PLATFORM FOR DATA-DRIVEN APPLICATIONS.
management systems and other data platforms in Cloud is no longer just for operational applications.
favor of modern ones. This forces changes in the Many TDWI clients first proved the value of cloud
systems architectures of a data lake, whether an old as a general computing platform by adopting or
platform is replaced by a new one or left in place upgrading to cloud-based applications deployed
and augmented by a new one. Replatforming also in the software-as-a-service (SaaS) model. Data
leads to changes in a data lake’s data architecture warehousing, data lakes, reporting, and analytics
when data is redistributed among the new mix of are now aggressively adopting or migrating to
platforms or remodeled and improved during cloud tools and data platforms. This is a normal
data migration. maturity life cycle—many new technologies are
first adopted for operational applications, then for
Replatforming and other drivers for data lake
data-driven, analytics applications.
architecture evolution take various forms:
NEW CLOUD DATA PLATFORMS ARE NOW
DATA LAKES STARTED ON HADOOP BUT ARE
FULLY PROVEN. The early adoption phase is
MIGRATING ELSEWHERE. In fact, the earliest data
over, spurring a rush of migrations to them for all
lakes were almost exclusively on Hadoop. The
kinds of data sets. As mentioned earlier, cloud
FOREWORD CONTINUED
For data lakes, as with any valuable enterprise AMONG THESE, THE DATA ARCHITECTURE IS THE
data set, architecture is a requirement. Without MOST IMPORTANT. In fact, TDWI has always defined
the design patterns, organization, and standards the “data warehouse” as a data architecture that
imposed by architecture, a data set will be hard is populated by data, models, relationships, and
to maintain, scale, optimize, govern, access, and metadata. This same definition also applies to data
leverage for organizational advantage. Hence, lakes and related data sets such as data marts and
a good architectural design is key to achieving operational data stores.
business value and a return on investment (ROI)
DATA MODELS AND ARCHITECTURE ARE RELATED
from a data lake.
BUT DIFFERENT. Data modeling is largely about
Before diving into data lake architectures, let’s local data structures and their components (rows,
review TDWI’s definitions of architecture and how columns, tables, keys, and data types), typically one
multiple architectures work together in data-driven database or table at a time. Data architecture tends
applications such as analytics, data warehousing, to be about relationships across multiple data sets
and data lakes. and their platforms. Obviously, a data architecture
such as the data lake can have many data models
ARCHITECTURE EXISTS ON MULTIPLE LAYERS. In
within it.
other words, the “technology stack” of data-driven
applications includes more than one architecture. DATA PLATFORM ARCHITECTURE IS THE PRIMARY
This includes: ENABLER OF DATA ARCHITECTURES. Note that
the “real” data lake or data warehouse is the data,
• Data architecture (both physical storage and not the data platforms that store and manage the
virtual views) data. However, we wouldn’t have analytics and
• Platform architecture (typically database other data-driven practices without data platforms
management systems, storage, file systems, that can store and manage data as well as provide
and other data platforms) in-place processing for data. In the practical world,
selecting the platform(s) that best satisfy your
• Data integration architecture (for ETL/ELT, requirements is key to the success of a data set and
quality, replication, metadata, etc.) its overall architecture. This is true of data lakes,
warehouses, and most other data architectures.2
These numerous architectures and technology stack
components tightly integrate or overlap with one DATA PLATFORM ARCHITECTURE VARIES
another such that it can be hard to tell where one CONSIDERABLY. One trend in data architecture
architecture stops and another begins. The peaceful involves multiplatform architectures that consist of
coexistence of multiple architectures is commonly two or more types of data platforms. Multiplatform
seen in complex enterprise environments, including architectures are hybrid when they distribute data
data lakes and warehouses.1 across both on-premises and cloud data platforms.
1
For a detailed discussion of layered architectures in complex data environments, see the TDWI Best Practices Report: Evolving Data Warehouse Architectures in
the Age of Big Data, online at tdwi.org/bpreports.
2
For more details, see the section “Data Lake Platforms and Architectures” in the TDWI Best Practices Report: Data Lakes: Purposes, Practices, Patterns, and
Platforms, online at tdwi.org/bpreports.
3
For a comprehensive study of complex architectures for data management, see the TDWI Best Practices Report: Multiplatform Data Architectures, online at
tdwi.org/bpreports.
Before selecting one or more platforms for your RELATIONAL BUSINESS REQUIREMENTS. Early data
data lake, you must determine the lake’s many lakes were usually deployed for technical users, such
requirements. This is true whether you are designing as data scientists and analysts. Today, business users
a new data lake architecture or modernizing an also demand access to a lake’s data so they can
existing one. Data lake requirements fall into three perform self-service data prep, visualization, and
general categories. Here are examples of questions light analytics. Most end-user tools for self-service
asked when gathering data lake requirements: (from enterprise BI platforms to visualization tools)
are query-driven, and they assume relational data.
BUSINESS REQUIREMENTS:
In a related area, many firms have portfolios of
• Who are the primary end users for the
SQL-driven reporting and analytics tools and want
data lake?
to apply this investment to data lake practices.
• What do they need to achieve via the lake? Due to these relational business requirements, a
data lake’s primary (or only) data platform should
• What practices for analytics, reporting, and support the relational paradigm deeply. This way,
self-service will they apply to the lake’s data? self-service business users and other groups of users
get the easy access to lake data and fast SQL-based
DATA REQUIREMENTS:
exploration and analytics that they need from their
• What sources will feed the lake? data-driven solutions.
• What kinds of data integration solutions BROAD DATA FOR SELF-SERVICE AND ANALYTICS
are required? CORRELATIONS. In addition to self-service practices,
another high-priority use case for the data is
• What data structures result and how do those
discovery analytics, where analytics algorithms scan
affect requirements for storage?
large volumes of diverse data to discover different
ANALYTICS REQUIREMENTS: types of business entities and to correlate facts
about them. This usually involves technologies
• What analytics tool types and technologies will and tools for mining, statistics, clustering, and
access and process the lake’s data? machine learning. One thing that self-service and
discovery analytics have in common is that both
• What data structures, quality conditions, and
assume massive volumes of data, from many
interfaces do these tools demand?
sources, covering multiple data domains, in a
• What forms of in-database analytics and variety of structures and quality states, about many
other push-down processing will the analytics business entities and processes, and involving
tools require? multiple latencies.
Once you have gathered data lake requirements, When we pile together the data requirements of
you can map them to data platform options and lake-based analytics this way, the list is daunting.
platform architecture designs. Here are examples of Yet a data lake must satisfy these requirements
how data lake requirements guide such decisions. and more. One strategy for coping with the list is
to limit the first phase of a data lake to a short list
KNOW YOUR DATA LAKE’S REQUIREMENTS AND MAP THEM TO DATA PLATFORM
ARCHITECTURAL OPTIONS CONTINUED
of use cases, then grow into more in subsequent data ownership practices. Other users prefer to
project phases. This avoids the risks of a big bang diversify within a single platform or instance so
project, but it means that you must revise data lake they can simplify design patterns, governance,
architecture with each phase. and optimization while avoiding the distraction
and cost of administering multiple platforms. In
EXTREMELY DIVERSE REQUIREMENTS CAN LEAD
short, stick to simple architectures if you can and
TO COMPLEX ARCHITECTURE. As we just saw, the
control the scope of complexity if you must adopt
list of business, data, and analytics requirements
multiple platforms.
is long and challenging. It is difficult to satisfy all
requirements with a single data platform. The
situation leads some organizations to distribute
the data of their lake across multiple data
platforms of different types. The resulting data
platform architecture is complex in that it can be
multiplatform, heavily distributed, and hybrid
(when distributed across both clouds and
on-premises systems).
3 EXPECT YOUR DATA LAKE TO EVOLVE TO THE CLOUD AND PREPARE FOR
MULTICLOUD
Nowadays, all kinds of platform choices and TECHNOLOGY ISSUES ALSO LEAD DATA LAKES TO
architectural design patterns in IT almost inevitably THE CLOUD. Deploying a data lake on the cloud
lead to cloud, so let’s consider cloud as an avoids the risky, time-consuming, and distracting
important, modern platform. system integration projects that plague on-premises
deployments. Furthermore, administering a cloud
TDWI considers cloud to be a general compute data platform is trivial compared to on-premises
platform upon which many types of other platforms databases. Finally, due to elasticity, cloud data
or applications can be deployed. As discussed in platforms scale automatically to rising data volumes
the foreword of this report, cloud went mainstream and unpredictable analytics workloads. In summary,
for operational applications, driven by a rush of cloud data platforms take many mundane tasks
organizations adopting SaaS apps, which are off the plates of data management professionals
cloud based. TDWI is now witnessing a flurry so they can focus on their mandate: building
of migrations to cloud-based DBMSs and data data-driven solutions for the business.
warehouse platforms.4
CLOUD IS THE FUTURE. Cloud is quickly becoming
DRIVERS FOR DATA LAKES ON THE CLOUD the preferred compute platform for all of IT. Many
Cloud is well on its way to becoming mainstream for firms have a “cloud-first” mandate or similar
analytics, warehousing, and data lakes. Therefore, guidance that forces new implementations and
we have yet another question to consider regarding updates of older ones onto clouds. The point is that
data lake architecture and deployment: on premises cloud is inevitable, so you should embrace it and
or on the cloud? Here’s how the answer to that plan for it. This is fine, given the many benefits of
question plays out: cloud for analytics use cases.
4
For more details, see the TDWI Best Practices Report: Cloud Data Management, available at tdwi.org/bpreports.
EXPECT YOUR DATA LAKE TO EVOLVE TO THE CLOUD AND PREPARE FOR MULTICLOUD
CONTINUED
At this point, let’s leave our discussion of platform technical purposes, too, including data landing,
architectures and focus on data architectures that data staging, data archiving, and managing
organize the internals of a data lake. cleansed or remodeled data sets for self-service
or specific departments.
The earliest data lakes had little or no design or
architecture because they were single-instance Given these changes and the ongoing
dumping grounds for whatever data that data modernization of the data lake, it behooves data
scientists and analysts deemed useful. If a lake management professionals to carefully design
had an internal architecture, it was the result of an internal architecture that can accommodate
sandboxing or prominent schema dumped into the diverse data structures, data domains, and
it. As users worked with their data lakes over analytics applications that a multitenant and multi-
time, they realized that—like all enterprise data functional data lake is called upon to support. See
collections—the usability and performance of Figure 1, a reference architecture for organizing the
the lake improves when its data’s models, quality, internals of a data lake. Note that this is a logical
semantics, and organization are improved. In a representation, and its architectural components
related trend, early successes with analytics on lakes may be physically deployed on one data platform
drew more use cases and user constituencies to or across many. Figure 1 also assumes a left-to-right
them, and these demanded certain improvements flow of data.
and modernizations. Furthermore, the available
The reference architecture in Figure 1 identifies four
data platforms and tools for data lakes have
key areas for a lake’s internal organization:
advanced considerably.
INGESTION AREAS. Data landing and staging is the
Today, a modern and mature data lake differs from
weakest area of most data warehouse architectures
its predecessors:
and other enterprise data environments. Users have
• DATA LAKES ARE NOW MULTITENANT. They traditionally used a hodgepodge of file systems
are enterprise assets shared by multiple and spare database licenses for this without much
departments and their diverse analytics needs. thought for design or architecture. The data lake
Data scientists and analysts continue to be corrects this architectural failure by modernizing
important users, but they are now joined the ingestion side of data architecture. This makes
by mildly technical business users, typically sense because the detailed source data extracted
from marketing, finance, and supply chain by data integration processes is the same data
departments. that the lake’s primary repository needs to fulfill
its mandate as the provider of analytics data.
• DATA LAKES ARE NOW MULTIFUNCTIONAL.
This also illustrates how a single modern data
They are still true to their original mandate as
lake does double duty by supporting multiple
a massive repository of detailed source data
architectural functions.
that can be repurposed ad infinitum for use
cases in advanced analytics. That repository is ANALYTICS DATA SETS. This is the “real” data lake
still at the heart of the modern data lake, along in terms of the lake’s size and mandate. It is where
with sandboxes, data labs, and other areas for source data ingested via extraction processes is
data prototyping. Yet the data lake serves other persisted on disk in the lake’s repository of raw
detailed data. Some of the data sets developed across multiple units or across a global organization,
from raw detailed source data are also persisted as is the case with complete customer views or
here, ranging from transient sandboxes to lightly analytics data sets designed for broad consumption
standardized data sets for self-service, operational and collaboration.
reporting at scale, and other set-based analytics.
LIVE ARCHIVE. As with any valuable enterprise data
Note that the lake’s raw detailed source data is asset, a lake needs life cycle management for its
always maintained in its original, arrival condition, data. For example, data that is “cold” (because it is
even when subsets from it are copied to develop used infrequently or has limited value) should be
other data sets. After all, source data is the lake’s moved to cheaper storage media or a “live”
mandate. Furthermore, the focus on source data archive, where data is still available 24x7 without
differentiates data lakes from data warehouses, restoration processes. As an alternative, a best
which focus on calculated values, aggregated data practice is to leave lake data in place but mark it as
sets, and specialized data models. This differentia- cold or archival using metadata or cataloging. Given
tion makes lakes and warehouses complementary, the petabyte scale that data lakes are growing
which explains why they are increasingly deployed toward, it makes sense to identify data sets that are
side by side with data flowing through the lake candidates for archiving and deletion. In addition,
into the warehouse and minimal data redundancy isolating cold data means less data for algorithms
between the two. and queries to parse, thereby optimizing
performance. Finally, data curation (discussed in the
FUNCTIONAL DATA SETS. A multitenant data
next section of this report) should set rules for who
lake may also manage data sets derived from its
is allowed to bring what data into the lake, as well
raw data repository that are intended for specific
as when abandoned sandboxes should be archived
departments and business units. This makes the
or deleted.
most sense when these data sets are shared
Data Lake
Ingestion Analytics Functional
Many Sources and Ingestion Methods
5 GIVE END USERS THE TOOLS AND CURATION THEY NEED TO CONTROL
SHARING DATA ACROSS THE LAKE ARCHITECTURE
We all know that data lakes run the risk of a mix of data management professionals (who
deteriorating into the dreaded data swamp—an create enterprise standards for data) and business
undocumented and disorganized data store that is managers (who serve as data owners, stewards,
nearly impossible to navigate, trust, and leverage and curators, with a focus on compliance). All these
for organizational advantage. A lake becomes a people collaborate to establish and enforce policies
swamp when users bring any data they like to the that assure data’s compliant access, use, security,
lake, make contradictory copies of data as they standards, and trust.
create sandboxes, and fail to provide metadata for
Implementers of a data lake should work with their
their data.
enterprise DG board so the lake and its data will
As organizations modernized their data lakes in comply with established DG policies. Do this prior
recent years, they adopted user best practices to designing the lake and loading it with data.
and tool functions that control the lake and Given that data lakes differ from older data store
avoid the swamp. These practices and tool types, it is probable that yours will require new DG
types are now considered required components policies or revisions of older ones.
of a comprehensive and controlled data lake
METADATA MANAGEMENT. The challenge is
architecture.
that data lakes typically manage a wide range
DATA CURATION. The data curator is key to of data types, structures, and containers, many
controlling data’s entrance into a lake as well as how of which arrive without metadata. Curation
data is used, documented, and shared within and demands metadata, yet manually retrofitting
outside the lake. Data curation may be performed metadata onto terabytes or petabytes of data is
by a dedicated, full-time employee or by users not practical. Luckily, modern tools for metadata
who follow governance policies. Either way, data management can scan data to deduce its structures
curation prevents multiple users from copying the and components, then suggest metadata to a
same data into the lake, resulting in redundancy developer or automatically apply the deduced
that skews statistics, machine learning, and other metadata. Similarly, tools can inject metadata into
analytics outcomes. files, documents, and other containers (which are
common with lakes) making those easier and faster
Curation usually demands metadata for all data
to scan for analytics. Metadata tools aside, these
brought into the lake or created there, which is
same features are increasingly found in tools for
necessary for users to find data in the lake, to
data integration, visualization, and analytics.
share data with others, and to distinguish between
source data (raw and unaltered) and sandbox data COLLABORATIVE FEATURES. Whether data scientists
(aggregated and remodeled). or self-service users, all data lake users want and
need to share their data discoveries and carefully
DATA GOVERNANCE. The data lake is like any
crafted data sets with other colleagues—but with
data platform or data store in that it needs data
privacy and governance controls. Hence, a data lake
governance (DG) to keep its technical standards
ideally needs some kind of “publish and subscribe”
and business compliance high. DG usually takes
mechanism or similar sharing facility. In some
the form of a board or committee populated with
user organizations, the data curator reviews and
GIVE END USERS THE TOOLS AND CURATION THEY NEED TO CONTROL SHARING DATA
ACROSS THE LAKE ARCHITECTURE CONTINUED
6 SELECT A CLOUD DATA PLATFORM THAT FULFILLS THE LAKE’S DIVERSE DATA
REQUIREMENTS AND INTEGRATES WELL WITH RELATED IT SYSTEMS
In closing, let’s summarize the findings of this report THE DATA LAKE HAS BECOME A BALANCING
related to selecting data platforms and designing ACT. The lake’s original high-priority use case was
architectures for modern data lakes: open-ended advanced analytics, which requires
masses of detailed source data. That is now joined
A DATA LAKE REQUIRES ARCHITECTURE. In fact,
by a new priority, namely self-service data access,
a data lake can be at the intersection of multiple prep, visualization, and analytics, which requires
architectures, typically data architecture, data aggregated and lightly standardized data. In
platform architecture, integration architecture, addition, as data lakes become more multitenant
and analytics or reporting architectures. All these and multifunctional, they must support a growing
must work together to complete a technology number of users and use cases. A successful
stack and to meet the diverse requirements of data lake architecture will support all user groups
lake-based solutions. and their solutions by provisioning data that is
DATA LAKE DESIGN IS ALL ABOUT SATISFYING appropriate for each.
MULTIPLE REQUIREMENTS. These include THE DATA LAKE’S INTERNAL ARCHITECTURE
business requirements (for specific user types), SATISFIES DIVERSE USERS AND THEIR
data requirements (to accommodate many data REQUIREMENTS. This provides areas for the
structures, domains, and latencies, both at rest lake’s multiple technical functions (data ingestion
and in motion), and analytics requirements (to give and source data management) and multiple
specific tools or methods the data structures and business functions (improved data sets for
breadth of data content they need). These add self-service, reporting, and specific business units).
up to a long list of requirements, and all of them A given data platform may include functions for
need attention. Hence, the greatest challenge for defining data volumes, optimizing specific data
a mature and modern data lake is to satisfy all its sets, and security mechanisms for limiting access.
requirements credibly. However, a data lake’s internal architecture must
DATA MANAGEMENT IS TRENDING TOWARD be designed by technical users, typically data
COMPLEX DATA ARCHITECTURES. This is true of architectures and modelers.
the modern data warehouse, which is increasingly THE RELATIONAL PARADIGM IS HIGHLY RELEVANT
multiplatform and hybrid. The data lake may go TO MODERN AND MATURE DATA LAKES. In
this direction, though it is usually simpler today, fact, high-priority lake use cases are impossible
typically with a single cloud data platform. TDWI without it, especially self-service data practices
recommends that data architects and other data and operational reporting at scale. Relational
management professionals keep their architectures requirements explain why Hadoop is no longer the
as simple as possible so they are easier to design, preferred data lake platform (because its relational
maintain, govern, and optimize. However, if the support is weak) and why cloud data platforms are
extremely diverse requirements of modern data now preferred (because they support the relational
lakes drive you into a more complex architecture, paradigm deeply, as well as algorithmic approaches
then at least control the scope of complexity. to advanced analytics).
SELECT A CLOUD DATA PLATFORM THAT FULFILLS THE LAKE’S DIVERSE DATA
REQUIREMENTS AND INTEGRATES WELL WITH RELATED IT SYSTEMS CONTINUED