0% found this document useful (0 votes)
95 views17 pages

Critical Success Factors For Data Lake Architecture: Checklist Report

Uploaded by

Noman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views17 pages

Critical Success Factors For Data Lake Architecture: Checklist Report

Uploaded by

Noman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CHECKLIST REPORT 2020

Critical Success Factors for


Data Lake Architecture

By Philip Russom

Sponsored by:
MARCH 2020

TDWI CHECKLIST REPORT

Critical Success TABLE OF CONTENTS

Factors for Data Lake 2 FOREWORD


Architecture 4 NUMBER ONE
Give your data lake an architecture similar to those of
By Philip Russom data warehouses and other databases

6 NUMBER TWO
Know your data lake’s requirements and map them to
data platform architectural options

8 NUMBER THREE
Expect your data lake to evolve to the cloud and
prepare for multicloud

10 NUMBER FOUR
Design a data lake’s internal architecture to
accommodate diverse data structures, data domains,
and analytics applications

12 NUMBER FIVE
Give end users the tools and curation they need to
control sharing data across the lake architecture

14 NUMBER SIX
Select a cloud data platform that fulfills the lake’s
diverse data requirements and integrates well with
related IT systems

16 ABOUT OUR SPONSOR

16 ABOUT TDWI CHECKLIST REPORTS

16 ABOUT THE AUTHOR

16 ABOUT TDWI RESEARCH

555 S. Renton Village Place, Ste. 700


Renton, WA 98057-3295

T 425.277.9126 © 2020 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to [email protected].
F 425.687.2842
E [email protected] Product and company names mentioned herein may be trademarks and/or registered trademarks
of their respective companies. Inclusion of a vendor, product, or service in TDWI research does

tdwi.org
not constitute an endorsement by TDWI or its management. Sponsorship of a publication should
not be construed as an endorsement of the sponsor organization or validation of its claims.

1   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

FOREWORD

The data lake has come a long way since its current wave of dissatisfaction with Hadoop is
origins around 2015. Today it is a well-estab- driving a number of lake migrations off of Hadoop.
lished design pattern and data architecture for For example, after living with their lakes for a year
profound applications in data warehousing, or more, many users discover that key use cases
reporting, data science, and advanced analytics demand more and better relational functionality
as well as operational environments for marketing, than can be retrofitted onto Hadoop. In a related
supply chain, and finance. Over the years, users’ trend, many organizations proved the value of data
expectations, best practices, and business use lakes on premises and are now migrating to cloud
cases for the data lake have evolved, as have the data platforms for their relational functionality,
available data platforms upon which a data lake elasticity, minimal administration, and cost control.
may be deployed.
A MODERN DATA LAKE MUST SERVE A WIDER
This evolution is forcing changes in how data RANGE OF USERS AND THEIR NEEDS. The first
lakes are designed, architected, and deployed. In users of data lakes were mostly data scientists
short, TDWI sees many corporations, government and data analysts who program algorithm-based
agencies, and other user organizations modernizing analytics for data mining, statistics, clustering,
their data lakes to adapt them to today’s data and and machine learning. As lakes have become
business requirements—instead of those of 2015. more multitenant (serving more user types and
Similarly, “greenfield” data lakes today are quite use cases), set-based analytics (reporting at scale,
different from early data lakes. broad data exploration, self-service queries, and
data prep) has arisen as a requirement for the
Many of the recent changes in the data lake are
lake—and that requires a relational database.
occurring at the architectural level. In particular,
TDWI sees many organizations replatforming CLOUD HAS RECENTLY BECOME THE PREFERRED
their data lakes as they abandon older database PLATFORM FOR DATA-DRIVEN APPLICATIONS.
management systems and other data platforms in Cloud is no longer just for operational applications.
favor of modern ones. This forces changes in the Many TDWI clients first proved the value of cloud
systems architectures of a data lake, whether an old as a general computing platform by adopting or
platform is replaced by a new one or left in place upgrading to cloud-based applications deployed
and augmented by a new one. Replatforming also in the software-as-a-service (SaaS) model. Data
leads to changes in a data lake’s data architecture warehousing, data lakes, reporting, and analytics
when data is redistributed among the new mix of are now aggressively adopting or migrating to
platforms or remodeled and improved during cloud tools and data platforms. This is a normal
data migration. maturity life cycle—many new technologies are
first adopted for operational applications, then for
Replatforming and other drivers for data lake
data-driven, analytics applications.
architecture evolution take various forms:
NEW CLOUD DATA PLATFORMS ARE NOW
DATA LAKES STARTED ON HADOOP BUT ARE
FULLY PROVEN. The early adoption phase is
MIGRATING ELSEWHERE. In fact, the earliest data
over, spurring a rush of migrations to them for all
lakes were almost exclusively on Hadoop. The
kinds of data sets. As mentioned earlier, cloud

2   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

FOREWORD CONTINUED

data warehouses and other data platforms have


the relational functionality that users need. In
addition, they support the push-down execution of
custom programming in Java, R, and Python. Early
adopters have corroborated that the platforms
perform and scale elastically, as advertised, while
maintaining high availability and tight security.
This gives more organizations the confidence they
need to make their own commitments to cloud
data platforms.

USER BEST PRACTICES FOR DATA LAKES ARE FAR


MORE SOPHISTICATED TODAY. Early data lakes
suffered abusive practices such as data dumping,
neglect of data standards, and a disregard for
compliance. Over time, lake users have corrected
these poor practices. Furthermore, users have
realized that the data lake—like any enterprise data
set—benefits from more structure, quality, curation,
and governance.

The catch is to make these improvements in


moderation, without harming the spirit of the data
lake as a repository for massive volumes of raw
source data fit for broad exploration, discovery,
and many analytics approaches. It’s a bit of a
balancing act, but data lake best practices are
now established for maintaining detailed source
data for discovery analytics while also providing
cleansed and lightly standardized data for
set-based analytics.

This TDWI Checklist Report will drill into the many


issues, design patterns, and best practices of data
architectures with a focus on modernizing data
lake architectures. The report will also touch on
the many practical use cases—in analytics and
elsewhere—that a well-constructed data lake
architecture can support and nurture, as well as the
types of data platforms and tools that commonly
go into such architectures.

3   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

1 GIVE YOUR DATA LAKE AN ARCHITECTURE SIMILAR TO THOSE OF


DATA WAREHOUSES AND OTHER DATABASES

For data lakes, as with any valuable enterprise AMONG THESE, THE DATA ARCHITECTURE IS THE
data set, architecture is a requirement. Without MOST IMPORTANT. In fact, TDWI has always defined
the design patterns, organization, and standards the “data warehouse” as a data architecture that
imposed by architecture, a data set will be hard is populated by data, models, relationships, and
to maintain, scale, optimize, govern, access, and metadata. This same definition also applies to data
leverage for organizational advantage. Hence, lakes and related data sets such as data marts and
a good architectural design is key to achieving operational data stores.
business value and a return on investment (ROI)
DATA MODELS AND ARCHITECTURE ARE RELATED
from a data lake.
BUT DIFFERENT. Data modeling is largely about
Before diving into data lake architectures, let’s local data structures and their components (rows,
review TDWI’s definitions of architecture and how columns, tables, keys, and data types), typically one
multiple architectures work together in data-driven database or table at a time. Data architecture tends
applications such as analytics, data warehousing, to be about relationships across multiple data sets
and data lakes. and their platforms. Obviously, a data architecture
such as the data lake can have many data models
ARCHITECTURE EXISTS ON MULTIPLE LAYERS. In
within it.
other words, the “technology stack” of data-driven
applications includes more than one architecture. DATA PLATFORM ARCHITECTURE IS THE PRIMARY
This includes: ENABLER OF DATA ARCHITECTURES. Note that
the “real” data lake or data warehouse is the data,
• Data architecture (both physical storage and not the data platforms that store and manage the
virtual views) data. However, we wouldn’t have analytics and
• Platform architecture (typically database other data-driven practices without data platforms
management systems, storage, file systems, that can store and manage data as well as provide
and other data platforms) in-place processing for data. In the practical world,
selecting the platform(s) that best satisfy your
• Data integration architecture (for ETL/ELT, requirements is key to the success of a data set and
quality, replication, metadata, etc.) its overall architecture. This is true of data lakes,
warehouses, and most other data architectures.2
These numerous architectures and technology stack
components tightly integrate or overlap with one DATA PLATFORM ARCHITECTURE VARIES
another such that it can be hard to tell where one CONSIDERABLY. One trend in data architecture
architecture stops and another begins. The peaceful involves multiplatform architectures that consist of
coexistence of multiple architectures is commonly two or more types of data platforms. Multiplatform
seen in complex enterprise environments, including architectures are hybrid when they distribute data
data lakes and warehouses.1 across both on-premises and cloud data platforms.

1
For a detailed discussion of layered architectures in complex data environments, see the TDWI Best Practices Report: Evolving Data Warehouse Architectures in
the Age of Big Data, online at tdwi.org/bpreports.

2
For more details, see the section “Data Lake Platforms and Architectures” in the TDWI Best Practices Report: Data Lakes: Purposes, Practices, Patterns, and
Platforms, online at tdwi.org/bpreports.

4   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

GIVE YOUR DATA LAKE AN ARCHITECTURE SIMILAR TO THOSE OF DATA WAREHOUSES


AND OTHER DATABASES CONTINUED

For example, the average data warehouse has been


multiplatform for years, and most data warehouse
modernization programs today replace some or
all of these platforms with a cloud data platform.
However, most data lakes are today deployed on
a single platform, and that platform is increasingly
a cloud-based database management system or
data warehouse. Single-platform data lakes still
have an internal data architecture that organizes the
lake’s data to distinguish data of diverse structures,
domains, sources, or use cases.

Section Five in this report will drill into the internal


architecture of a data lake.3

3
For a comprehensive study of complex architectures for data management, see the TDWI Best Practices Report: Multiplatform Data Architectures, online at
tdwi.org/bpreports.

5   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

2 KNOW YOUR DATA LAKE’S REQUIREMENTS AND MAP THEM TO DATA


PLATFORM ARCHITECTURAL OPTIONS

Before selecting one or more platforms for your RELATIONAL BUSINESS REQUIREMENTS. Early data
data lake, you must determine the lake’s many lakes were usually deployed for technical users, such
requirements. This is true whether you are designing as data scientists and analysts. Today, business users
a new data lake architecture or modernizing an also demand access to a lake’s data so they can
existing one. Data lake requirements fall into three perform self-service data prep, visualization, and
general categories. Here are examples of questions light analytics. Most end-user tools for self-service
asked when gathering data lake requirements: (from enterprise BI platforms to visualization tools)
are query-driven, and they assume relational data.
BUSINESS REQUIREMENTS:
In a related area, many firms have portfolios of
• Who are the primary end users for the
SQL-driven reporting and analytics tools and want
data lake?
to apply this investment to data lake practices.
• What do they need to achieve via the lake? Due to these relational business requirements, a
data lake’s primary (or only) data platform should
• What practices for analytics, reporting, and support the relational paradigm deeply. This way,
self-service will they apply to the lake’s data? self-service business users and other groups of users
get the easy access to lake data and fast SQL-based
DATA REQUIREMENTS:
exploration and analytics that they need from their
• What sources will feed the lake? data-driven solutions.

• What kinds of data integration solutions BROAD DATA FOR SELF-SERVICE AND ANALYTICS
are required? CORRELATIONS. In addition to self-service practices,
another high-priority use case for the data is
• What data structures result and how do those
discovery analytics, where analytics algorithms scan
affect requirements for storage?
large volumes of diverse data to discover different
ANALYTICS REQUIREMENTS: types of business entities and to correlate facts
about them. This usually involves technologies
• What analytics tool types and technologies will and tools for mining, statistics, clustering, and
access and process the lake’s data? machine learning. One thing that self-service and
discovery analytics have in common is that both
• What data structures, quality conditions, and
assume massive volumes of data, from many
interfaces do these tools demand?
sources, covering multiple data domains, in a
• What forms of in-database analytics and variety of structures and quality states, about many
other push-down processing will the analytics business entities and processes, and involving
tools require? multiple latencies.

Once you have gathered data lake requirements, When we pile together the data requirements of
you can map them to data platform options and lake-based analytics this way, the list is daunting.
platform architecture designs. Here are examples of Yet a data lake must satisfy these requirements
how data lake requirements guide such decisions. and more. One strategy for coping with the list is
to limit the first phase of a data lake to a short list

6   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

KNOW YOUR DATA LAKE’S REQUIREMENTS AND MAP THEM TO DATA PLATFORM
ARCHITECTURAL OPTIONS CONTINUED

of use cases, then grow into more in subsequent data ownership practices. Other users prefer to
project phases. This avoids the risks of a big bang diversify within a single platform or instance so
project, but it means that you must revise data lake they can simplify design patterns, governance,
architecture with each phase. and optimization while avoiding the distraction
and cost of administering multiple platforms. In
EXTREMELY DIVERSE REQUIREMENTS CAN LEAD
short, stick to simple architectures if you can and
TO COMPLEX ARCHITECTURE. As we just saw, the
control the scope of complexity if you must adopt
list of business, data, and analytics requirements
multiple platforms.
is long and challenging. It is difficult to satisfy all
requirements with a single data platform. The
situation leads some organizations to distribute
the data of their lake across multiple data
platforms of different types. The resulting data
platform architecture is complex in that it can be
multiplatform, heavily distributed, and hybrid
(when distributed across both clouds and
on-premises systems).

• UPSIDE OF COMPLEX DATA PLATFORM


ARCHITECTURES. When handled carefully,
diverse requirements are satisfied by platform
instances that are optimized for specific
analytics use cases. Furthermore, users have
more and better options when it comes to
mapping specific data sets and analytics
practices to a fully appropriate data platform.

• DOWNSIDE OF COMPLEX DATA PLATFORM


ARCHITECTURES. The more platform types
or instances you have, the higher you drive
up the costs of licenses, IT infrastructure,
administration, staffing, and training per
platform. Furthermore, complexity exacerbates
design, maintenance, governance, and
cross-platform optimization for speed and scale.

KEEP IT SIMPLE AND MANAGE THE SCOPE OF


COMPLEXITY. TDWI sees organizations succeeding
with both single-platform and multiplatform
data lakes. Users with multiplatform or hybrid
architectures put up with the complexity because it
fits their technical expertise, IT infrastructure, and

7   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

3 EXPECT YOUR DATA LAKE TO EVOLVE TO THE CLOUD AND PREPARE FOR
MULTICLOUD

Nowadays, all kinds of platform choices and TECHNOLOGY ISSUES ALSO LEAD DATA LAKES TO
architectural design patterns in IT almost inevitably THE CLOUD. Deploying a data lake on the cloud
lead to cloud, so let’s consider cloud as an avoids the risky, time-consuming, and distracting
important, modern platform. system integration projects that plague on-premises
deployments. Furthermore, administering a cloud
TDWI considers cloud to be a general compute data platform is trivial compared to on-premises
platform upon which many types of other platforms databases. Finally, due to elasticity, cloud data
or applications can be deployed. As discussed in platforms scale automatically to rising data volumes
the foreword of this report, cloud went mainstream and unpredictable analytics workloads. In summary,
for operational applications, driven by a rush of cloud data platforms take many mundane tasks
organizations adopting SaaS apps, which are off the plates of data management professionals
cloud based. TDWI is now witnessing a flurry so they can focus on their mandate: building
of migrations to cloud-based DBMSs and data data-driven solutions for the business.
warehouse platforms.4
CLOUD IS THE FUTURE. Cloud is quickly becoming
DRIVERS FOR DATA LAKES ON THE CLOUD the preferred compute platform for all of IT. Many
Cloud is well on its way to becoming mainstream for firms have a “cloud-first” mandate or similar
analytics, warehousing, and data lakes. Therefore, guidance that forces new implementations and
we have yet another question to consider regarding updates of older ones onto clouds. The point is that
data lake architecture and deployment: on premises cloud is inevitable, so you should embrace it and
or on the cloud? Here’s how the answer to that plan for it. This is fine, given the many benefits of
question plays out: cloud for analytics use cases.

ECONOMICS ALONE LEADS DATA LAKES TO MULTICLOUD ARCHITECTURES FOR


THE CLOUD. Data lakes are characterized by DATA LAKES
their massive data volumes and strenuous data We say “cloud” as though it is one thing, but many
processing for analytics. Building an on-premises organizations end up with operational applications
MPP configuration of a traditional relational and data-driven systems on multiple clouds.
database—a configuration that is scalable and So-called multicloud is becoming a norm for IT and
powerful enough to satisfy a modern data lake—is data management, and it has ramifications for data
cost-prohibitive for most organizations in terms of lake architecture.
the servers, storage, licenses, and administrative
resources that would need to be amassed. By IT CAN BE FICKLE CONCERNING PREFERRED
comparison, an equivalent data lake deployed on CLOUD PROVIDERS. As corporations and other
a cloud database or cloud data platform is less organizations fine-tune their cloud commitments,
expensive because it avoids capital expenses for their IT groups regularly strike new deals with
hardware and has reasonably priced licensing. more cloud providers. Ideally, users should look for
data platforms and solutions that are portable, in
case they need to migrate their data and analytics
across multiple clouds. Likewise, modern tooling

4
For more details, see the TDWI Best Practices Report: Cloud Data Management, available at tdwi.org/bpreports.

8   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

EXPECT YOUR DATA LAKE TO EVOLVE TO THE CLOUD AND PREPARE FOR MULTICLOUD
CONTINUED

for warehousing, reports, analytics, and integration


must now reach data residing in multiple cloud
providers and cloud regions, plus aggregate and
process multicloud data with close to real-time
performance. This is for the purposes of federated
queries, plus the inevitable near-time movement
of data across multiple clouds when, say, a data
scientist builds a new analytics sandbox.

MULTICLOUD CAN RESULT FROM DIVERSE


ANALYTICS SPONSORSHIP. Most analytics
applications have a departmental bias. For example,
marketing should sponsor customer analytics, and a
procurement department needs to control partner
and supply chain analytics. In a related trend,
many departments and business units are now in
the habit of funding their own “shadow IT,” which
increasingly involves cloud tools and platforms,
sometimes on multiple clouds. Hence, there is a
need for data lake and data platform solutions that
are portable across clouds or can help unify data
across cloud providers.

GLOBAL BUSINESS NEEDS REGIONAL


ARCHITECTURAL OPTIONS. A particular cloud
provider may not serve all geographic regions
natively. Similarly, the regional clouds of
international cloud providers may have different
performance and reliability characteristics, or a
regional cloud may be subject to local governance
regulations or accounting standards. These vagaries
may push a global business toward a multicloud
architecture for data lakes and other systems.

9   TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

4 DESIGN A DATA LAKE’S INTERNAL ARCHITECTURE TO ACCOMMODATE DIVERSE


DATA STRUCTURES, DATA DOMAINS, AND ANALYTICS APPLICATIONS

At this point, let’s leave our discussion of platform technical purposes, too, including data landing,
architectures and focus on data architectures that data staging, data archiving, and managing
organize the internals of a data lake. cleansed or remodeled data sets for self-service
or specific departments.
The earliest data lakes had little or no design or
architecture because they were single-instance Given these changes and the ongoing
dumping grounds for whatever data that data modernization of the data lake, it behooves data
scientists and analysts deemed useful. If a lake management professionals to carefully design
had an internal architecture, it was the result of an internal architecture that can accommodate
sandboxing or prominent schema dumped into the diverse data structures, data domains, and
it. As users worked with their data lakes over analytics applications that a multitenant and multi-
time, they realized that—like all enterprise data functional data lake is called upon to support. See
collections—the usability and performance of Figure 1, a reference architecture for organizing the
the lake improves when its data’s models, quality, internals of a data lake. Note that this is a logical
semantics, and organization are improved. In a representation, and its architectural components
related trend, early successes with analytics on lakes may be physically deployed on one data platform
drew more use cases and user constituencies to or across many. Figure 1 also assumes a left-to-right
them, and these demanded certain improvements flow of data.
and modernizations. Furthermore, the available
The reference architecture in Figure 1 identifies four
data platforms and tools for data lakes have
key areas for a lake’s internal organization:
advanced considerably.
INGESTION AREAS. Data landing and staging is the
Today, a modern and mature data lake differs from
weakest area of most data warehouse architectures
its predecessors:
and other enterprise data environments. Users have
• DATA LAKES ARE NOW MULTITENANT. They traditionally used a hodgepodge of file systems
are enterprise assets shared by multiple and spare database licenses for this without much
departments and their diverse analytics needs. thought for design or architecture. The data lake
Data scientists and analysts continue to be corrects this architectural failure by modernizing
important users, but they are now joined the ingestion side of data architecture. This makes
by mildly technical business users, typically sense because the detailed source data extracted
from marketing, finance, and supply chain by data integration processes is the same data
departments. that the lake’s primary repository needs to fulfill
its mandate as the provider of analytics data.
• DATA LAKES ARE NOW MULTIFUNCTIONAL.
This also illustrates how a single modern data
They are still true to their original mandate as
lake does double duty by supporting multiple
a massive repository of detailed source data
architectural functions.
that can be repurposed ad infinitum for use
cases in advanced analytics. That repository is ANALYTICS DATA SETS. This is the “real” data lake
still at the heart of the modern data lake, along in terms of the lake’s size and mandate. It is where
with sandboxes, data labs, and other areas for source data ingested via extraction processes is
data prototyping. Yet the data lake serves other persisted on disk in the lake’s repository of raw

10  TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

DESIGN A DATA LAKE’S INTERNAL ARCHITECTURE TO ACCOMMODATE DIVERSE DATA


STRUCTURES, DATA DOMAINS, AND ANALYTICS APPLICATIONS CONTINUED

detailed data. Some of the data sets developed across multiple units or across a global organization,
from raw detailed source data are also persisted as is the case with complete customer views or
here, ranging from transient sandboxes to lightly analytics data sets designed for broad consumption
standardized data sets for self-service, operational and collaboration.
reporting at scale, and other set-based analytics.
LIVE ARCHIVE. As with any valuable enterprise data
Note that the lake’s raw detailed source data is asset, a lake needs life cycle management for its
always maintained in its original, arrival condition, data. For example, data that is “cold” (because it is
even when subsets from it are copied to develop used infrequently or has limited value) should be
other data sets. After all, source data is the lake’s moved to cheaper storage media or a “live”
mandate. Furthermore, the focus on source data archive, where data is still available 24x7 without
differentiates data lakes from data warehouses, restoration processes. As an alternative, a best
which focus on calculated values, aggregated data practice is to leave lake data in place but mark it as
sets, and specialized data models. This differentia- cold or archival using metadata or cataloging. Given
tion makes lakes and warehouses complementary, the petabyte scale that data lakes are growing
which explains why they are increasingly deployed toward, it makes sense to identify data sets that are
side by side with data flowing through the lake candidates for archiving and deletion. In addition,
into the warehouse and minimal data redundancy isolating cold data means less data for algorithms
between the two. and queries to parse, thereby optimizing
performance. Finally, data curation (discussed in the
FUNCTIONAL DATA SETS. A multitenant data
next section of this report) should set rules for who
lake may also manage data sets derived from its
is allowed to bring what data into the lake, as well
raw data repository that are intended for specific
as when abandoned sandboxes should be archived
departments and business units. This makes the
or deleted.
most sense when these data sets are shared

Data Lake
Ingestion Analytics Functional
Many Sources and Ingestion Methods

Many Targets and Delivery Methods

Areas Data Sets Data Sets


Data Curation Algorithm-Based BUSINESS UNITS
Advanced Analytics Marketing, Sales, Call
Data Landing
Center, Financials...
Set-Based
Data Staging
Self-Service SHARED DATA
ETL/ELT (Data Prep, Viz, etc.) Customer Views
Collaborative Sandbox
Push-Down Sandboxes, Data
Processing Labs, Prototypes SPECIAL DATA
Sync w/Op Apps
File, Document, and Outbound RT data
Container Capture

Real-Time (RT) Live Archive


Stream Capture
Infrequently Used Data, but Available 24x7 Forever
Expired Data and Sandboxes per Curation Rules

FIGURE 1. Reference architecture for a data lake’s internal organization.

11  TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

5 GIVE END USERS THE TOOLS AND CURATION THEY NEED TO CONTROL
SHARING DATA ACROSS THE LAKE ARCHITECTURE

We all know that data lakes run the risk of a mix of data management professionals (who
deteriorating into the dreaded data swamp—an create enterprise standards for data) and business
undocumented and disorganized data store that is managers (who serve as data owners, stewards,
nearly impossible to navigate, trust, and leverage and curators, with a focus on compliance). All these
for organizational advantage. A lake becomes a people collaborate to establish and enforce policies
swamp when users bring any data they like to the that assure data’s compliant access, use, security,
lake, make contradictory copies of data as they standards, and trust.
create sandboxes, and fail to provide metadata for
Implementers of a data lake should work with their
their data.
enterprise DG board so the lake and its data will
As organizations modernized their data lakes in comply with established DG policies. Do this prior
recent years, they adopted user best practices to designing the lake and loading it with data.
and tool functions that control the lake and Given that data lakes differ from older data store
avoid the swamp. These practices and tool types, it is probable that yours will require new DG
types are now considered required components policies or revisions of older ones.
of a comprehensive and controlled data lake
METADATA MANAGEMENT. The challenge is
architecture.
that data lakes typically manage a wide range
DATA CURATION. The data curator is key to of data types, structures, and containers, many
controlling data’s entrance into a lake as well as how of which arrive without metadata. Curation
data is used, documented, and shared within and demands metadata, yet manually retrofitting
outside the lake. Data curation may be performed metadata onto terabytes or petabytes of data is
by a dedicated, full-time employee or by users not practical. Luckily, modern tools for metadata
who follow governance policies. Either way, data management can scan data to deduce its structures
curation prevents multiple users from copying the and components, then suggest metadata to a
same data into the lake, resulting in redundancy developer or automatically apply the deduced
that skews statistics, machine learning, and other metadata. Similarly, tools can inject metadata into
analytics outcomes. files, documents, and other containers (which are
common with lakes) making those easier and faster
Curation usually demands metadata for all data
to scan for analytics. Metadata tools aside, these
brought into the lake or created there, which is
same features are increasingly found in tools for
necessary for users to find data in the lake, to
data integration, visualization, and analytics.
share data with others, and to distinguish between
source data (raw and unaltered) and sandbox data COLLABORATIVE FEATURES. Whether data scientists
(aggregated and remodeled). or self-service users, all data lake users want and
need to share their data discoveries and carefully
DATA GOVERNANCE. The data lake is like any
crafted data sets with other colleagues—but with
data platform or data store in that it needs data
privacy and governance controls. Hence, a data lake
governance (DG) to keep its technical standards
ideally needs some kind of “publish and subscribe”
and business compliance high. DG usually takes
mechanism or similar sharing facility. In some
the form of a board or committee populated with
user organizations, the data curator reviews and

12  TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

GIVE END USERS THE TOOLS AND CURATION THEY NEED TO CONTROL SHARING DATA
ACROSS THE LAKE ARCHITECTURE CONTINUED

approves data sets or documents shared this way to


reduce redundancy and ensure metadata, security,
privacy, and other controls.

END-USER TOOLS AND PLATFORMS. Note that


software functionality for curation, governance,
metadata, and collaboration is available via many
types of systems, including data platforms, open
source, and tools for analytics, data semantics,
and integration.

TDWI sees users making successful decisions


about the platforms and tools they call on for these
functions. For example, some users like to keep it
simple and stick to their data platform’s capabilities.
Others take more of a best-of-breed approach by
acquiring strong tools from third parties. Users
who take the second route should look for vendors
that partner with their platform provider because
partnerships usually ensure strong compatibility,
interoperability, and performance.

13  TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

6 SELECT A CLOUD DATA PLATFORM THAT FULFILLS THE LAKE’S DIVERSE DATA
REQUIREMENTS AND INTEGRATES WELL WITH RELATED IT SYSTEMS

In closing, let’s summarize the findings of this report THE DATA LAKE HAS BECOME A BALANCING
related to selecting data platforms and designing ACT. The lake’s original high-priority use case was
architectures for modern data lakes: open-ended advanced analytics, which requires
masses of detailed source data. That is now joined
A DATA LAKE REQUIRES ARCHITECTURE. In fact,
by a new priority, namely self-service data access,
a data lake can be at the intersection of multiple prep, visualization, and analytics, which requires
architectures, typically data architecture, data aggregated and lightly standardized data. In
platform architecture, integration architecture, addition, as data lakes become more multitenant
and analytics or reporting architectures. All these and multifunctional, they must support a growing
must work together to complete a technology number of users and use cases. A successful
stack and to meet the diverse requirements of data lake architecture will support all user groups
lake-based solutions. and their solutions by provisioning data that is
DATA LAKE DESIGN IS ALL ABOUT SATISFYING appropriate for each.
MULTIPLE REQUIREMENTS. These include THE DATA LAKE’S INTERNAL ARCHITECTURE
business requirements (for specific user types), SATISFIES DIVERSE USERS AND THEIR
data requirements (to accommodate many data REQUIREMENTS. This provides areas for the
structures, domains, and latencies, both at rest lake’s multiple technical functions (data ingestion
and in motion), and analytics requirements (to give and source data management) and multiple
specific tools or methods the data structures and business functions (improved data sets for
breadth of data content they need). These add self-service, reporting, and specific business units).
up to a long list of requirements, and all of them A given data platform may include functions for
need attention. Hence, the greatest challenge for defining data volumes, optimizing specific data
a mature and modern data lake is to satisfy all its sets, and security mechanisms for limiting access.
requirements credibly. However, a data lake’s internal architecture must
DATA MANAGEMENT IS TRENDING TOWARD be designed by technical users, typically data
COMPLEX DATA ARCHITECTURES. This is true of architectures and modelers.
the modern data warehouse, which is increasingly THE RELATIONAL PARADIGM IS HIGHLY RELEVANT
multiplatform and hybrid. The data lake may go TO MODERN AND MATURE DATA LAKES. In
this direction, though it is usually simpler today, fact, high-priority lake use cases are impossible
typically with a single cloud data platform. TDWI without it, especially self-service data practices
recommends that data architects and other data and operational reporting at scale. Relational
management professionals keep their architectures requirements explain why Hadoop is no longer the
as simple as possible so they are easier to design, preferred data lake platform (because its relational
maintain, govern, and optimize. However, if the support is weak) and why cloud data platforms are
extremely diverse requirements of modern data now preferred (because they support the relational
lakes drive you into a more complex architecture, paradigm deeply, as well as algorithmic approaches
then at least control the scope of complexity. to advanced analytics).

14  TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

SELECT A CLOUD DATA PLATFORM THAT FULFILLS THE LAKE’S DIVERSE DATA
REQUIREMENTS AND INTEGRATES WELL WITH RELATED IT SYSTEMS CONTINUED

CLOUD IS THE FUTURE OF ALL IT, INCLUDING


DATA LAKE ARCHITECTURES. Many organizations
have made general commitments to cloud (e.g.,
cloud-first mandates), which are driving data lakes
to the cloud. More important, however, cloud data
platforms are a natural fit for the data lake due to
their support for relational technology and elasticity
to automatically accommodate the scale and
unpredictable stresses of analytics workloads.

Also, the general benefits of the cloud come to


play with data lakes on cloud data platforms,
namely short time to use, minimal cost of entry and
ownership, and zero system integration and capital
expense. Hence, it is no surprise that TDWI sees
many of its members and other user organizations
modernizing their first-generation data lakes by
migrating them to the cloud, as well as adopting
new data lake architectures designed for cloud
data platforms.

15  TDWI RESEARCH tdwi.org


TDWI CHECKLIST REPORT: CRITICAL SUCCESS FACTORS FOR DATA LAKE ARCHITECTURE

ABOUT OUR SPONSOR ABOUT THE AUTHOR


Philip Russom, Ph.D., is senior
director of TDWI Research for data
management and is a well-known
Snowflake’s cloud data platform shatters the
figure in data warehousing,
barriers that have prevented organizations of all
integration, and quality, having
sizes from unleashing the true value from their data.
published over 600 research reports,
More than 2,000 customers deploy Snowflake to
magazine articles, opinion columns, and speeches
advance their businesses beyond what was once
over a 20-year period. Before joining TDWI in 2005,
possible by deriving all the insights from all their
Russom was an industry analyst covering data
data by all their business users. Snowflake equips
management at Forrester Research and Giga
organizations with a single, integrated platform
Information Group. He also ran his own business as
that offers a data warehouse built for the cloud;
an independent industry analyst and consultant,
instant, secure, and governed access to their entire
was a contributing editor with leading IT
network of data; and a core architecture to enable
magazines, and was a product manager at
many types of data workloads, including a single
database vendors. His Ph.D. is from Yale. You can
platform for developing modern data applications.
reach him at [email protected], @prussom on
Snowflake: Data without limits. Find out more at
Twitter, and on LinkedIn at linkedin.com/in/
Snowflake.com.
philiprussom.

ABOUT TDWI CHECKLIST REPORTS ABOUT TDWI RESEARCH


TDWI Checklist Reports provide an overview of
TDWI Research provides research and advice for BI
success factors for a specific project in business
professionals worldwide. TDWI Research focuses
intelligence, data warehousing, analytics, or a
exclusively on analytics and data management
related data management discipline. Companies
issues and teams up with industry practitioners
may use this overview to get organized before
to deliver both broad and deep understanding
beginning a project or to identify goals and areas
of the business and technical issues surrounding
of improvement for current projects.
the deployment of business intelligence and data
management solutions. TDWI Research offers
reports, commentary, and inquiry services via a
worldwide membership program and provides
custom research, benchmarking, and strategic
planning services to user and vendor organizations.

16  TDWI RESEARCH tdwi.org

You might also like