Using Hadoop For Data Warehouse Optimization
Using Hadoop For Data Warehouse Optimization
2017
By David Loshin
Sponsored by:
JUNE 2017
2 NUMBER ONE
Leverage scalability and elasticity to reduce costs
By David Loshin
3 NUMBER TWO
Augment EDW storage with Hadoop and Hive
3 NUMBER THREE
Use flexible data organization to enable schema-on-read
4 NUMBER FOUR
Use open source tools to handle unstructured data
4 NUMBER FIVE
Provide data discovery tools for data consumers
5 NUMBER SIX
Offload ETL processing to Hadoop
5 NUMBER SEVEN
Ensure data governance for data versioning, lineage, and
provenance
6 A
FTERWORD
Evolving the hybrid EDW as a modernization strategy
T 425.277.9126
F 425.687.2842
E [email protected] © 2017 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to [email protected].
Product and company names mentioned herein may be trademarks and/or registered trademarks of
tdwi.org their respective companies.
TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION
The conventional enterprise data warehouse (EDW) architecture Implementing an EDW used to imply a significant capital investment:
common in most organizations represents over two decades of purchasing a system platform or specialty appliance sized to meet the
engineering and refinement, yet its design has become relatively immediate and the anticipated future needs for business analytics.
static and has not kept pace with the exploding desire for business Despite the initial investment, the development and deployment
intelligence (BI) and analytics. Weaknesses associated with the life cycle often delayed the data warehouse’s time to value, and the
conventional architecture limit the scale and scope of a BI/analytics investment wasn’t fully monetized until long after the EDW was put
program, and the design may actually impede the optimal use of into production.
information. For example, consider that:
As demand grows for access to (and reports from) EDW data,
• There are high up-front capital expenses prior to deploying a the initial system configuration may no longer meet performance
data warehouse environment. requirements, necessitating an increased investment in storage
and computing resources. This may be particularly acute if your
• The EDW team must anticipate scalability needs prior to
organization uses costly specialty data warehouse appliances;
instantiating the system and must be prepared to add to the
increased demand means rising costs. An alternative to this pattern
original capital investment.
leverages a Hadoop platform and technologies such as Hive that
• The EDW has relatively high continuing operations and exploit pools of commodity computing and storage resources, thereby
maintenance costs. allowing system performance to scale proportionally to demand while
reducing overall costs.
• Its architecture is engineered for summarization and may not
include all of the data. This limits what can be analyzed. Begin your Hive configuration with a small number of storage and
compute resources. Increased user demand does not necessitate a
• Decisions about data organization made early in the design
major system refresh; instead, user performance needs can be met
process force dependence on a highly structured data model that
by growing the system incrementally—adding resources as needed.
constrains the ability of data consumers to analyze the data.
Hadoop’s system elasticity allows reporting and analytics applications
• The established structure complicates agile development, so to consume the resources they need when required and release those
adding new features is a slow process. The design, development, resources once the application has completed so they can be used by
and implementation of extraction, transformation, and loading other applications.
(ETL) are also slow, making it difficult to quickly add new
Commodity hardware lowers overall costs in three ways. First, you
data sources.
only buy the components you need instead of requiring a significant
• Analysis is often limited to structured data because typical bulk investment. Second, maintaining commodity components is less
EDW architectures are not designed to easily support expensive than maintaining specialty data warehouse appliances.
unstructured data. Finally, decoupling the software ecosystem from the underlying
hardware allows you to transparently replace slower, aging components
• Proactive data discovery is difficult, if allowed at all.
with newer technology that seamlessly provides increased performance
• The EDW is often missing lineage and provenance for the with few changes to your application code.
data, and there is little information about how the data was
Technology innovations also contribute to optimized performance.
aggregated, allowing for confusion in interpretation.
The Apache Stinger.next initiative is intended to provide fully ACID
These performance inhibitors are being increasingly addressed transactions and accessibility to SQL:2011 analytics support, and
through the integration of new technologies such as the many architectural advancements such as low-latency analytical processing
components of the open source Hadoop stack. In this TDWI (LLAP) provide optimized in-memory caching and persistent query
Checklist Report, we recommend ways that augmenting the executors that scale elastically and provide subsecond response time,
conventional EDW design with Hadoop and Hive can help optimize even for more complex SQL queries.
your EDW by expanding usability, improving performance, improving
results, and reducing overall costs.
This past decade has been an extraordinary time for data management You might say that one of the most innovative concepts associated
and analytics as the rate of big data and analytics innovations with data warehousing is dimensional modeling. Recognizing that
continues to accelerate. The growing inventory of tools collectively the data models used by transaction processing and operational
bundled within the Hadoop ecosystem has significantly lowered the systems were not suited to the types of reporting and analyses
barriers to launching or expanding a BI/analytics program. desired by business users, pioneering data warehouse architects
transformed the structured data from the source systems into a
Organizations may need to leverage existing systems while considering
different model organized around “facts” representing events or
the adoption of new systems architectures and may not yet be ready to
entities and the dimensions that characterize “items of interest”
completely rip and replace their existing EDWs. Nonetheless, the future
associated with the facts. This structured model is more amenable
points to a modernized hybrid environment for BI/analytics evolved
to the typical analyst’s slicing and dicing, and the dimensional
through the introduction of the right technology at the right time.
modeling approach has become standard for most EDW.
An example is introducing Hive as a way of enhancing the EDW. Hive
However, although this data model may be suitable for
running on top of Hadoop—with its SQL-like interface layered on top
standard operational reporting, it forces the data sources into a
of data stored in HDFS (Hadoop Distributed File System) as well as
predetermined organization. This is an example of schema-on-write,
other file systems (such as Amazon’s S3), structures, and formats
in which data sets are loaded into a predefined schema or structure.
(such as plain text or Parquet)—provides a database platform
Despite its suitability for typical BI activities, schema-on-write is
that can accommodate most, if not all, of those satisfied by the
inherently biased, as it reflects the opinions of the data modelers
conventional EDW.
and decisions based on their interpretation of user requirements.
Migrating to Hive provides a starting point for augmenting the existing
This approach may constrain data analysts and data scientists
EDW’s storage footprint while leveraging the scalability of commodity
from applying more creative methods and ultimately impedes
hardware resources that can be easily increased with growing demand.
optimal data analysis. The alternative is schema-on-read, in which
The continually maturing Hive system provides faster execution,
data sets are stored in their original formats and are reconfigured
increased security and data protection, and greater flexibility in
into one of a variety of target schemas when the data sets are read
supporting different downstream analytics requirements:
for analysis.
• In-memory caching and persistent query executors within the
When using schema-on-read, the data is stored using the format
Hive LLAP architecture speed execution and allow elastic scaling
in which it was acquired, and no transformations remove any of the
to meet performance needs.
original values. Downstream users are not constrained by using data
• Integrating the Druid column-oriented distributed data store forced into a single predefined model.
with Hive optimizes low-latency analytics by combining a column
Adopting a schema-on-read approach enables greater freedom
store with inverted indexing, which can be used to effectively
in your EDW environment. Schema-on-read allows multiple data
create extremely fast OLAP (online analytical processing) cubes.
users to layer their own logical structures on top of the source
• Hive can be tightly integrated with Apache Ranger to provide data, thereby optimizing their ability to explore creative algorithmic
a comprehensive approach to security and data access methods that take advantage of text analytics and machine learning
authorization within a Hadoop cluster, as well as to set policies in addition to traditional BI and reporting.
to encrypt or mask data at rest.
• Improved Hive SQL support for SQL ACID Merge reduces the need
for massive refreshes and allows users to read the data as it is
being updated.
Data warehouses are engineered around structured data—most are The lowered barrier to entry for scalable high-performance systems
populated with data extracted from existing transaction processing suggests that any BI/analytics program would benefit from adding
or operational processing systems. If your organization’s preference more data to the mix. Yet adding new data into the warehouse
is to focus largely (if not exclusively) on structured reports reflecting requires a lot of work—profile the data source, assess data quality,
historical operations, this architecture will probably be satisfactory. identify the transformations, and implement the data integration
However, if you are interested in introducing more complex reporting application.
and advanced analytics, including predictive and prescriptive
This bottleneck is compounded when different sets of downstream
modeling, you may find that the conventional data warehouse’s
data consumers have differing needs, requiring the data integration
limitation to only structured data sources will impede those
development to be done multiple times for the various users. More
objectives.
critically, conflating all of the downstream requirements into one set
An increasing number of Internet-connected devices, social media of transformations leads to conflicts and confusion.
channels, volumes of digitized data artifacts, and storage devices
There are three main challenges to be addressed:
loaded with legacy documents are all sources of semistructured and
unstructured data fit for advanced analytics. These sources include • Ensure that the quality of the data ingested is sufficient for
some machine-generated sources (such as automatically generated downstream consumption
weather and traffic updates), but the lion’s share of unstructured
• Maintain flexibility in interpreting the data for each consumer’s
data emanates from human-generated sources such as documents,
particular purpose
email, text messages, audio files, graphics, images, blog posts, and
website comments. • Guarantee against imposing standardizations based on data
usability expectations that are too strict
The first step in integrating unstructured data is to ingest, parse,
tag, and organize information embedded in text, and there are a All three of these issues can be accommodated with data discovery
number of open source machine learning and text analysis tools tools. Don’t subject all data sources to a single predefined set of
that are part of the Hadoop ecosystem. For example, Lucene is business rules. Instead, leverage the freedom of schema-on-read
an Apache open source library that provides indexing and search and let the data consumers use a data discovery tool to profile
for unstructured text as well as spellchecking, hit highlighting, the data, find patterns that are relevant to their own business
and advanced analysis and tokenization capabilities. Apache Solr needs, and evaluate how the source data can be manipulated and
is an open source product layered on top of Lucene, supporting transformed for their specific application purposes.
high-volume traffic of unstructured documents and providing real-
This optimizes the EDW in a number of ways:
time indexing and full-text search. Mahout is a component in the
Hadoop open source stack providing a number of machine learning • It gives users flexibility in defining their own data transformation
techniques, including text mining. rules, which reduces the complexity of attempting to force data
integration to meet multiple downstream users’ needs.
These tools can be used to identify key concepts within specific
business contexts and optimize the integration of unstructured data • Data discovery allows users to combine structured data with
into your enterprise reporting and analytics environment. Meta- unstructured data, which broadens the scope for analytics
tagging unstructured text and organizing unstructured data into applications.
usable structures improves your BI/analytics capability by expanding
• Data discovery tools free the downstream users from the
the breadth of information you can analyze.
constraints of IT decisions about how data sources are loaded
into the warehouse, providing greater freedom for analysis.
• Enabling data scientists to prepare their own views of the data
allows more innovative use of machine learning methods and
more sophisticated unsupervised deep learning algorithms.
It is commonly said that 60 percent to 70 percent of the effort Expanding the accessibility and usability of enterprise data
of data warehouse development can be attributed to ETL. Even if definitely creates new freedoms for data consumers and data
these are just estimates, there is no doubt that ETL development scientists. However, delegating more power to the user community
commands the lion’s share of both development and execution time poses some risks in ensuring consistency in use, interpretation, and
necessary to populate a data warehouse. The conventional data importantly, in protection. Sometimes data discovery tools provide
architecture provides a staging area for data ingestion and ETL too much flexibility, allowing different users to apply rules and
processing, after which the data is loaded into the target system. transformations that effectively result in contradictory semantics.
Offload your ETL processing to a Hadoop platform. This is a Moderate user flexibility with operational data governance to ensure
straightforward step towards EDW modernization, and your consistent policies for data oversight, including:
organization will benefit from three key optimizations:
• Metadata management: Don’t just keep track of data
• Faster execution: You can take advantage of the parallelism asset and data element metadata; ensure that the captured
provided by a Hadoop cluster to significantly speed up ETL metadata and corresponding data controls can be shared among
because it can be linearly scaled based on the number of applications across the BI/analytics framework both in and
computing resources used. outside your Hadoop ecosystem.
• Faster development time: Faster execution can radically • Data lineage and provenance: Address the typical EDW
reduce the develop/test/debug cycle, thereby shortening time challenge of tracking data from its acquisition to its integration.
to value. This becomes more complex as static data integration yields to
integration combining both static and continuously streaming
• Reduced resources: You can reduce the amount of resources
data sources. Further, as a growing number of downstream
solely dedicated to ETL because you can spin up and launch your
consumers use data discovery tools to assess the data and
ETL processing on an elastic Hadoop cluster and then release
define their own data standardization rules and data integration
those computing resources when the ETL is finished.
paths, it becomes challenging to track the sequences of actions
Recent improvements to Apache Hive support the SQL ACID Merge, applied to reach the data asset used in analysis.
which handles inserts, updates, and deletes in a single pass instead
• Data archiving and versioning policies: With scalable
of needing more complex update pipelines supported by complicated
storage provided by Hadoop, the organization can maintain
logic for managing rollback. This effectively allows in-database
data longer without having to shunt it off to a secondary (or
transformations to be performed without the need for massive
tertiary) storage device such as DVD or tape. Also, a byproduct
refreshes. By allowing updates to be performed consistently and
of incorporating a broader set of older data is the need to
isolating readers from in-progress updates, the Hive-based EDW
differentiate earlier versions from more current versions. You
remains available while updates are being applied.
will need to continually define and ensure observance of these
policies.
• Data security and protection: Migrating your EDW to a
big data platform does not eliminate the need to ensure proper
access to sensitive data.
Apache Atlas is an open source system that supports big data
governance. The Atlas system allows for data element definition,
hierarchical taxonomies, capture of business-oriented annotations,
and documenting relationships—linking data sets to their
underlying data elements. In addition, Atlas supports metadata
exchange, allowing you to export metadata to downstream and third-
party systems.