BlueGranite Data Lake Ebook
BlueGranite Data Lake Ebook
BlueGranite Data Lake Ebook
IN A
Modern
Data
Architecture
TABLE OF CONTENTS
Modern Data Architecture ............................................................................... 3
2
Modern Data
Architecture
What does it really mean to implement a modern data architecture?
Like many other technology initiatives, it really depends on the
implementation objectives. The following characteristics are most
commonly associated with a modern data architecture:
• Data originating from internal systems, • Analytics use cases ranging from operational
cloud-based systems, as well as external data BI to corporate BI to advanced analytics and
provided from partners and third parties data science
• Diverse set data sources and multi-structured • Multi-platform data architecture to suit a
formats variety of use cases
• Streaming real-time data, loading in batches, • Agile delivery approach with iterative delivery
or some combination of both cycles
• Data volumes that range from moderate to • Providing support to a diverse group of
high users, whether casual data consumers, data
• Cloud-based and hybrid delivery modes analysts, or data scientists
• Delivery of analytics to traditional platforms • Automation and DevOps to reduce time-to-
such as data marts and semantic layers, value and ensure solution consistency and
as well as specialized databases like graph, quality
spatial, or NoSQL
• Data virtualization techniques employed, in
addition to data integration
3
Business Needs Driving Data
Architectures to Evolve & Adapt
Today’s business leaders understand that data holds the key to making educated and supportable
decisions.
Traditional data warehousing and business intelligence approaches have been challenged as being
too slow to respond. Reducing the time to value is a fundamental objective of a modern data
architecture.
Data warehouses have traditionally excelled in simplifying data access and answering many of the
questions required to successfully run the business. However, it’s impossible to anticipate every
question a business might ask and every report they might need. In a modern data architecture,
acquiring new data should be relatively easy so that new analysis can be conducted swiftly.
Data volumes have exploded as businesses discover the value contained within social media,
documents, comments, sensors, and edge devices. Fifteen years ago, companies never expected
to have to keep track of things such as social media “likes.” The ability to capture and analyze
practically any type of data is a critical business capability.
A final thought, users need to know that data in the data lake is governed, high quality, and not
a disorganized unreliable swamp.
With all the media hype around data lakes and big data, it can be difficult to understand how — and
even if — a technology like a data lake makes sense for your analytics needs. Some people believe
that implementing a data lake means throwing away their investment in a data warehouse. This
perception ends up either sending them down the wrong path or causes them to sideline big data
and data lakes as a future project.
The good news? At BlueGranite, we believe that a data lake does not replace a company’s existing
investment in its data warehouse. In fact, they complement each other very nicely. With a modern
data architecture, organizations can continue to leverage their existing investments, begin collecting
data they have been ignoring or discarding, ultimately enabling analysts to obtain insights faster.
4
Principles of a Modern
Data Architecture
Big data technologies, such as a data lake, support and enhance modern analytics
but they do not typically replace traditional systems.
5
Data integration and data virtualization are both prevalent.
Many IT professionals have become less willing to take on data integration – that
is, the requirement to physically move data before it can be used or analyzed. In
reality, a lot of data integration still occurs, but it is more thoughtful and purposeful.
Data virtualization and logical data warehouse tactics, such as federated queries
across multiple data stores, are ways to “query data where it lives.” Minimizing data
movement is useful in situations such as:
• Large datasets, impractical to move
• Short time window to do data integration
• Data privacy, regulatory, geographic concerns
• Loss of metadata or additional context
6
Data Lake + Data Warehouse:
Complementary Solutions
A traditional data warehouse is a centralized repository containing information
which has been integrated from multiple source systems and structured in a
user-friendly way that facilitates analytical queries.
Data Warehouse
Data warehousing is characterized by requiring a significant amount of discovery,
planning, data modeling, and development work before the data becomes
available for analysis by the business users.
7
Data Lake
Data lakes are broadly accepting of new data regardless of the
format. This is a marked departure from the rule-laden, highly-
structured storage within traditional relational databases. While this
helps relational databases maintain high standards for data quality,
this heavy-handed enforcement of dataset schemas can impede the
rapid and iterative development.
The philosophy of data lakes is to accept new data instantly and with few restrictions,
but then apply the rigors of business logic, type-checking, and data governance when
it comes time to use (or “read”) the data. This is widely termed “schema on read”, in
contrast to the relational database approach of “schema on write”.
This flexibility allows for new value propositions that are more difficult or time-
consuming to achieve with a traditional data warehouse. A data lake focuses on
providing:
• One architectural platform to house any type of data: machine-generated data,
human-generated data, as well as traditional operational data
• Fewer obstacles to data acquisition
• Access to low-latency and near real-time data
• Reduced cost of ownership, permitting long-term retention of data in its raw,
granular, form
• Deferral of work to schematize data until value is known and requirements are
established
The tradeoff to a data lake’s agility is the additional effort required to analyze the data
via “schema on read” techniques, during which a data structure is defined at query time
to analyze the data.
The different characteristics lead to an inverse relationship between a data lake and a
data warehouse:
This inverse relationship is the precise reason why a data lake and a
data warehouse are complementary solutions.
8
To summarize:
a data warehouse is a highly structured store
of the data that the business has deemed
important, while a data lake is a more
organic store of data without regard for the
perceived value or structure of the data.
ELT
ETL
Load pattern: (Extract, Load, and Transform at
(Extract, Transform, then Load)
the time the data is needed)
9
Data Lakes
1 Retain Data Perpetually
During the development of a traditional data warehouse, a considerable amount of
time is spent analyzing data sources, understanding business processes, profiling data,
and modeling data. The result is a highly structured data model designed for reporting.
Generally, if data isn’t used to answer specific known questions, or required for a defined
report, it may be excluded from the data warehouse. This is usually done to simplify the
data model and to wisely utilize data storage.
In contrast, a data lake can acquire all types of data and retain that data perpetually –
just in case. Whereas the default mode of a data warehouse is to justify the need for
data before it is included, the default expectation for a data lake is to acquire all of the
data and retain all of the data. Unless a firm archival policy is required, long-term data
retention is justified because future requirements and needs are unknown.
This approach of retaining large amounts of data becomes possible because the
hardware for a data lake usually differs greatly from that used for a data warehouse.
Inexpensive storage allows scaling of a data lake to terabytes and petabytes fairly
economically.
A data lake also can act as an “active archive” area in which older data which is rarely
needed is transmitted from the data warehouse to the data lake. This is often referred to
as hot data in the DW, and cold data in the data lake.
10
Data Lakes
2 Support All
Types of Data
Traditional relational data warehouses most commonly contain data extracted from
transactional systems. These systems, such as sales and inventory, typically consist of
quantitative metrics and the textual attributes that describe them.
Nontraditional data sources include items such as web server logs, sensor data, social
network activity, text, and images. New use cases for these data types continue to be
found. Storing and consuming multi-structured data in a relational database can be
very challenging.
The data lake approach embraces these nontraditional data types. A data lake can
store all data, regardless of source, regardless of structure, and (usually) regardless of
size.
Best practices dictate that raw data be retained in its original native format. No
changes should be allowed to the raw data, as it is considered immutable. It is
particularly important to retain, and securely back up, raw data in its native format to
ensure:
• All data which is transformed downstream from the raw data can be regenerated
• Access to the raw data is possible in select circumstances – for instance, data
scientists frequently request the raw data because there has been no context
applied to it
• Transformations or algorithms which adapt over time can be reprocessed, thus
improving accuracy of the historical data
• Point-in-time analysis can be accomplished if data has been stored to support
historical reporting
11
Data Lakes
3 Encourage Early Data
Exploration
One of the chief complaints about data warehouses is how long modifications can take. A
good data warehouse design can adapt to change, but because of the complexity of the
data load transformation processes and the work done to make analysis and reporting
easy, introducing changes will necessarily consume some DW/BI team resources.
Some business questions don’t have the luxury of waiting for the data warehouse team
to adapt the system. This ever-increasing need for faster answers is one of the main
drivers for self-service business intelligence initiatives.
On the other hand, in the data lake, since all data is stored in its raw form, it could
be made accessible to someone who needs access to the data quickly. Although
direct access to the raw data should be highly restricted, select users could be
empowered to conduct early analysis.
If the result of an exploration is shown to be useful and there is a desire to repeat it,
then a more formal schema can be applied to the data. Data cleansing, transformations,
standardization, reusability, and automation can be incorporated to extend the results
to a broader audience via the data warehouse or via a curated data area of the data lake.
Conversely, if the initial exploration results were not useful, they can be discarded with
no additional time and effort.
Data loads into a data warehouse perform best in batch mode. As data warehousing
systems have scaled to larger solutions, the distributed nature of MPP (massively parallel
processing) systems makes delivering near real-time data even more problematic.
The ability for a data lake to ingest data with ease brings about many more use cases,
especially related to IoT (Internet of Things). A frequent pattern we see with respect to
near real-time data is to transmit data to two outputs: once to a streaming dashboard or
application, and once to persist the data permanently in the data lake.
As we know, a data warehouse is geared towards business users with its user-friendly
structuring of the data. Conversely, a data lake is more appealing to a data scientist
because:
• A data lake offers access to raw, uncleansed, untransformed data. Although a BI
professional may have spent time transforming data for usability, a data scientist
frequently wants data without any context applied whatsoever. The structure can be
applied specifically for the individual experiment or analysis.
• Since a data lake can capture any type of data, that makes data access easy for
multi-structured data sources which have become commonplace.
• Modern analytical tools and languages understand how to connect to many types of
data formats.
The objective for data engineers is to facilitate easy access to the data, so that data
scientists and analysts spend the majority of their time running experiments and
analyzing data, as opposed to acquiring and importing data.
14
TIPS FOR DESIGNING A DATA LAKE
The importance of purposefully organizing the data lake cannot be overstated. The
concept of zones in a data lake is important. Note that zones can be physical, or
can just be conceptual. The following high-level zones are commonly used:
As the name implies, the raw data zone is storage for new data acquired, in its native format. This is
an exact copy from the source, often in a normalized format, and is immutable to change. History
is retained indefinitely in the raw data zone to not only satisfy future business needs, but also to
regenerate downstream data whenever needed. Access by users to raw data is highly restricted.
Raw Data Zone
A temporary transient zone can be included and selectively utilized when data quality validations are
required before the data may land in the raw data zone. It is also helpful when you temporarily need to
segregate a “new data zone” separate from the “raw data zone”. Alternatively, some people think of this
as the speed layer, separate from the batch layer, in a Lambda architecture.
Master Data The master data zone contains master data and reference data which augments and aids analytical
Zone activities.
User Drop Zone The user drop zone is an area where users can place data which is manually maintained.
The archive data zone contains aged data which has been offloaded from the data warehouse or other
Archive Data
systems. As an active archive, the data is available for querying when needed.
The analytics sandbox is the workspace for data science and exploratory activities. Valuable efforts
Analytics
should be operationalized into the curated data zone, to ensure the analytics sandbox is not delivering
Sandbox
production solutions.
The curated data zone is the serving layer location for data which has been cleansed, transformed as
necessary, and structured for optimal delivery. Data structures here can be large, wide, flat files, or the
Curated Data
structure could mimic a star schema/denormalized format. Nearly all self-service data access should
Zone
come from the curated data zone. Standard governance, security, and change management principles
all apply.
15
When designing the raw data zone, focus on
optimal write performance
One tip is to include data lineage and relevant metadata within the actual data
itself whenever possible. For instance: columns indicating the source system
where the data originated.
16
Azure Technologies for
Implementing a Data Lake
Data lakes are very well-suited to cloud services, such as the Microsoft Azure
cloud platform. In this section we will focus primarily on the data storage and data
processing layers of the architecture.
17
Since data in storage can be reused across various compute
services, choosing a compute service for processing is not an
either/or proposition. For instance, you may have one cluster
running certain hours of the day to handle data processing
operations, and another cluster running 24/7 to handle user
queries. Typically, the cost for compute services significantly
exceed the cost for data storage.
The data in the data lake can also be used in conjunction with
your data warehouse. This could be via a full-fledged third-
party data virtualization provider. There are also numerous
ways to execute remote federated queries between the data
lake and relational databases:
18
Considerations for
a Successful Data
Lake in the Cloud
Cloud-based services, such as Microsoft Azure, have become the
most common choice for new data lake deployments. Cloud service
providers allow organizations to avoid the cost and hassle of managing
an on-premises data center by moving storage, compute, and
networking to hosted solutions.
19
Th following are some specific considerations when
planning a data lake deployment on a cloud service:
Type of storage
A data lake is a conceptual data architecture, and not a specific technology. The technical
implementation can vary, which means different types of storage and storage features can be
utilized.
The following options for a data lake are less commonly used due to greatly reduced flexibility:
• Relational databases (ex: SQL Server, Azure SQL Database, Azure SQL Data Warehouse)
• NoSQL databases (ex: Azure CosmosDB)
Security Capabilities
Different technology platforms implement security differently. A service such as Azure Data Lake
Store implements hierarchical security based on access control lists, whereas Azure Blob Storage
implements key-based security.
20
Elasticity
One of the most powerful features of cloud-based deployments is elasticity, which refers
to scaling resources up or down depending on demand. This equates to cost savings
when processing power isn’t needed. The services in Azure Data Lake are decoupled with
respect to compute and storage, which provides independent scalability and flexibility.
Disaster recovery
The most critical data from a disaster recovery standpoint is your raw data (because, in
theory, your curated data views should be reproducible at any time from the raw data).
Data center outages are certainly not a common occurrence, but you do need to pre-
plan and test how to handle a situation when acquiring raw data is business critical. The
ability to recover your data after a damaging weather event, system error, or human
error is crucial.
21
Getting Started with a Data Lake
Confirm a data lake really is the best choice
Make sure that your data and use cases truly are well-suited to a data lake. Ignore the
rampant marketing hype. Continue using your relational database technologies for what
they are good at, while introducing a data lake to your multi-platform architecture when it
becomes appropriate. The existence of many types of data sources, in varying formats, is
usually a great indicator of a data lake use case. Conversely, if all of your data sources are
relational, extracting that data into a data lake may not be the best option unless you are
looking to house point-in-time history.
22
We hope you have found this eBook useful. If you are
looking to bring in new approaches, combined with
proven techniques, to support decision making at all
levels of your organization, let BlueGranite help.