0% found this document useful (0 votes)
92 views

Using Hadoop For Data Warehouse Optimization

Using Hadoop for Data Warehouse Optimization
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Using Hadoop For Data Warehouse Optimization

Using Hadoop for Data Warehouse Optimization
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CHECKLIST REPORT

2017

Using Hadoop for


Data Warehouse
Optimization

By David Loshin

Sponsored by:
JUNE 2017

T DW I CHECK L IS T RE P OR T TABLE OF CONTENTS

Using Hadoop for


Data Warehouse 2 F OREWORD
Optimization Legacy architectures impede optimal performance

2 NUMBER ONE
Leverage scalability and elasticity to reduce costs
By David Loshin
3 NUMBER TWO
Augment EDW storage with Hadoop and Hive

3 NUMBER THREE
Use flexible data organization to enable schema-on-read

4 NUMBER FOUR
Use open source tools to handle unstructured data

4 NUMBER FIVE
Provide data discovery tools for data consumers

5 NUMBER SIX
Offload ETL processing to Hadoop

5 NUMBER SEVEN
Ensure data governance for data versioning, lineage, and
provenance

6 A
 FTERWORD
Evolving the hybrid EDW as a modernization strategy

7 ABOUT OUR SPONSOR

7 ABOUT THE AUTHOR

7 ABOUT TDWI RESEARCH

7 ABOUT TDWI CHECKLIST REPORTS

555 S. Renton Village Place, Ste. 700


Renton, WA 98057-3295

T 425.277.9126
F 425.687.2842
E [email protected] © 2017 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to [email protected].
Product and company names mentioned herein may be trademarks and/or registered trademarks of
tdwi.org their respective companies.
TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

FOREWORD NUMBER ONE


LEGACY ARCHITECTURES IMPEDE OPTIMAL PERFORMANCE LEVERAGE SCALABILITY AND ELASTICITY TO REDUCE COSTS

The conventional enterprise data warehouse (EDW) architecture Implementing an EDW used to imply a significant capital investment:
common in most organizations represents over two decades of purchasing a system platform or specialty appliance sized to meet the
engineering and refinement, yet its design has become relatively immediate and the anticipated future needs for business analytics.
static and has not kept pace with the exploding desire for business Despite the initial investment, the development and deployment
intelligence (BI) and analytics. Weaknesses associated with the life cycle often delayed the data warehouse’s time to value, and the
conventional architecture limit the scale and scope of a BI/analytics investment wasn’t fully monetized until long after the EDW was put
program, and the design may actually impede the optimal use of into production.
information. For example, consider that:
As demand grows for access to (and reports from) EDW data,
• There are high up-front capital expenses prior to deploying a the initial system configuration may no longer meet performance
data warehouse environment. requirements, necessitating an increased investment in storage
and computing resources. This may be particularly acute if your
• The EDW team must anticipate scalability needs prior to
organization uses costly specialty data warehouse appliances;
instantiating the system and must be prepared to add to the
increased demand means rising costs. An alternative to this pattern
original capital investment.
leverages a Hadoop platform and technologies such as Hive that
• The EDW has relatively high continuing operations and exploit pools of commodity computing and storage resources, thereby
maintenance costs. allowing system performance to scale proportionally to demand while
reducing overall costs.
• Its architecture is engineered for summarization and may not
include all of the data. This limits what can be analyzed. Begin your Hive configuration with a small number of storage and
compute resources. Increased user demand does not necessitate a
• Decisions about data organization made early in the design
major system refresh; instead, user performance needs can be met
process force dependence on a highly structured data model that
by growing the system incrementally—adding resources as needed.
constrains the ability of data consumers to analyze the data.
Hadoop’s system elasticity allows reporting and analytics applications
• The established structure complicates agile development, so to consume the resources they need when required and release those
adding new features is a slow process. The design, development, resources once the application has completed so they can be used by
and implementation of extraction, transformation, and loading other applications.
(ETL) are also slow, making it difficult to quickly add new
Commodity hardware lowers overall costs in three ways. First, you
data sources.
only buy the components you need instead of requiring a significant
• Analysis is often limited to structured data because typical bulk investment. Second, maintaining commodity components is less
EDW architectures are not designed to easily support expensive than maintaining specialty data warehouse appliances.
unstructured data. Finally, decoupling the software ecosystem from the underlying
hardware allows you to transparently replace slower, aging components
• Proactive data discovery is difficult, if allowed at all.
with newer technology that seamlessly provides increased performance
• The EDW is often missing lineage and provenance for the with few changes to your application code.
data, and there is little information about how the data was
Technology innovations also contribute to optimized performance.
aggregated, allowing for confusion in interpretation.
The Apache Stinger.next initiative is intended to provide fully ACID
These performance inhibitors are being increasingly addressed transactions and accessibility to SQL:2011 analytics support, and
through the integration of new technologies such as the many architectural advancements such as low-latency analytical processing
components of the open source Hadoop stack. In this TDWI (LLAP) provide optimized in-memory caching and persistent query
Checklist Report, we recommend ways that augmenting the executors that scale elastically and provide subsecond response time,
conventional EDW design with Hadoop and Hive can help optimize even for more complex SQL queries.
your EDW by expanding usability, improving performance, improving
results, and reducing overall costs.

2  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

NUMBER TWO NUMBER THREE


AUGMENT EDW STORAGE WITH HADOOP AND HIVE USE FLEXIBLE DATA ORGANIZATION TO ENABLE
SCHEMA-ON-READ

This past decade has been an extraordinary time for data management You might say that one of the most innovative concepts associated
and analytics as the rate of big data and analytics innovations with data warehousing is dimensional modeling. Recognizing that
continues to accelerate. The growing inventory of tools collectively the data models used by transaction processing and operational
bundled within the Hadoop ecosystem has significantly lowered the systems were not suited to the types of reporting and analyses
barriers to launching or expanding a BI/analytics program. desired by business users, pioneering data warehouse architects
transformed the structured data from the source systems into a
Organizations may need to leverage existing systems while considering
different model organized around “facts” representing events or
the adoption of new systems architectures and may not yet be ready to
entities and the dimensions that characterize “items of interest”
completely rip and replace their existing EDWs. Nonetheless, the future
associated with the facts. This structured model is more amenable
points to a modernized hybrid environment for BI/analytics evolved
to the typical analyst’s slicing and dicing, and the dimensional
through the introduction of the right technology at the right time.
modeling approach has become standard for most EDW.
An example is introducing Hive as a way of enhancing the EDW. Hive
However, although this data model may be suitable for
running on top of Hadoop—with its SQL-like interface layered on top
standard operational reporting, it forces the data sources into a
of data stored in HDFS (Hadoop Distributed File System) as well as
predetermined organization. This is an example of schema-on-write,
other file systems (such as Amazon’s S3), structures, and formats
in which data sets are loaded into a predefined schema or structure.
(such as plain text or Parquet)—provides a database platform
Despite its suitability for typical BI activities, schema-on-write is
that can accommodate most, if not all, of those satisfied by the
inherently biased, as it reflects the opinions of the data modelers
conventional EDW.
and decisions based on their interpretation of user requirements.
Migrating to Hive provides a starting point for augmenting the existing
This approach may constrain data analysts and data scientists
EDW’s storage footprint while leveraging the scalability of commodity
from applying more creative methods and ultimately impedes
hardware resources that can be easily increased with growing demand.
optimal data analysis. The alternative is schema-on-read, in which
The continually maturing Hive system provides faster execution,
data sets are stored in their original formats and are reconfigured
increased security and data protection, and greater flexibility in
into one of a variety of target schemas when the data sets are read
supporting different downstream analytics requirements:
for analysis.
• In-memory caching and persistent query executors within the
When using schema-on-read, the data is stored using the format
Hive LLAP architecture speed execution and allow elastic scaling
in which it was acquired, and no transformations remove any of the
to meet performance needs.
original values. Downstream users are not constrained by using data
• Integrating the Druid column-oriented distributed data store forced into a single predefined model.
with Hive optimizes low-latency analytics by combining a column
Adopting a schema-on-read approach enables greater freedom
store with inverted indexing, which can be used to effectively
in your EDW environment. Schema-on-read allows multiple data
create extremely fast OLAP (online analytical processing) cubes.
users to layer their own logical structures on top of the source
• Hive can be tightly integrated with Apache Ranger to provide data, thereby optimizing their ability to explore creative algorithmic
a comprehensive approach to security and data access methods that take advantage of text analytics and machine learning
authorization within a Hadoop cluster, as well as to set policies in addition to traditional BI and reporting.
to encrypt or mask data at rest.
• Improved Hive SQL support for SQL ACID Merge reduces the need
for massive refreshes and allows users to read the data as it is
being updated.

3  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

NUMBER FOUR NUMBER FIVE


USE OPEN SOURCE TOOLS TO HANDLE UNSTRUCTURED DATA PROVIDE DATA DISCOVERY TOOLS FOR DATA CONSUMERS

Data warehouses are engineered around structured data—most are The lowered barrier to entry for scalable high-performance systems
populated with data extracted from existing transaction processing suggests that any BI/analytics program would benefit from adding
or operational processing systems. If your organization’s preference more data to the mix. Yet adding new data into the warehouse
is to focus largely (if not exclusively) on structured reports reflecting requires a lot of work—profile the data source, assess data quality,
historical operations, this architecture will probably be satisfactory. identify the transformations, and implement the data integration
However, if you are interested in introducing more complex reporting application.
and advanced analytics, including predictive and prescriptive
This bottleneck is compounded when different sets of downstream
modeling, you may find that the conventional data warehouse’s
data consumers have differing needs, requiring the data integration
limitation to only structured data sources will impede those
development to be done multiple times for the various users. More
objectives.
critically, conflating all of the downstream requirements into one set
An increasing number of Internet-connected devices, social media of transformations leads to conflicts and confusion.
channels, volumes of digitized data artifacts, and storage devices
There are three main challenges to be addressed:
loaded with legacy documents are all sources of semistructured and
unstructured data fit for advanced analytics. These sources include • Ensure that the quality of the data ingested is sufficient for
some machine-generated sources (such as automatically generated downstream consumption
weather and traffic updates), but the lion’s share of unstructured
• Maintain flexibility in interpreting the data for each consumer’s
data emanates from human-generated sources such as documents,
particular purpose
email, text messages, audio files, graphics, images, blog posts, and
website comments. • Guarantee against imposing standardizations based on data
usability expectations that are too strict
The first step in integrating unstructured data is to ingest, parse,
tag, and organize information embedded in text, and there are a All three of these issues can be accommodated with data discovery
number of open source machine learning and text analysis tools tools. Don’t subject all data sources to a single predefined set of
that are part of the Hadoop ecosystem. For example, Lucene is business rules. Instead, leverage the freedom of schema-on-read
an Apache open source library that provides indexing and search and let the data consumers use a data discovery tool to profile
for unstructured text as well as spellchecking, hit highlighting, the data, find patterns that are relevant to their own business
and advanced analysis and tokenization capabilities. Apache Solr needs, and evaluate how the source data can be manipulated and
is an open source product layered on top of Lucene, supporting transformed for their specific application purposes.
high-volume traffic of unstructured documents and providing real-
This optimizes the EDW in a number of ways:
time indexing and full-text search. Mahout is a component in the
Hadoop open source stack providing a number of machine learning • It gives users flexibility in defining their own data transformation
techniques, including text mining. rules, which reduces the complexity of attempting to force data
integration to meet multiple downstream users’ needs.
These tools can be used to identify key concepts within specific
business contexts and optimize the integration of unstructured data • Data discovery allows users to combine structured data with
into your enterprise reporting and analytics environment. Meta- unstructured data, which broadens the scope for analytics
tagging unstructured text and organizing unstructured data into applications.
usable structures improves your BI/analytics capability by expanding
• Data discovery tools free the downstream users from the
the breadth of information you can analyze.
constraints of IT decisions about how data sources are loaded
into the warehouse, providing greater freedom for analysis.
• Enabling data scientists to prepare their own views of the data
allows more innovative use of machine learning methods and
more sophisticated unsupervised deep learning algorithms.

4  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

NUMBER SIX NUMBER SEVEN


OFFLOAD ETL PROCESSING TO HADOOP ENSURE DATA GOVERNANCE FOR DATA VERSIONING,
LINEAGE, AND PROVENANCE

It is commonly said that 60 percent to 70 percent of the effort Expanding the accessibility and usability of enterprise data
of data warehouse development can be attributed to ETL. Even if definitely creates new freedoms for data consumers and data
these are just estimates, there is no doubt that ETL development scientists. However, delegating more power to the user community
commands the lion’s share of both development and execution time poses some risks in ensuring consistency in use, interpretation, and
necessary to populate a data warehouse. The conventional data importantly, in protection. Sometimes data discovery tools provide
architecture provides a staging area for data ingestion and ETL too much flexibility, allowing different users to apply rules and
processing, after which the data is loaded into the target system. transformations that effectively result in contradictory semantics.
Offload your ETL processing to a Hadoop platform. This is a Moderate user flexibility with operational data governance to ensure
straightforward step towards EDW modernization, and your consistent policies for data oversight, including:
organization will benefit from three key optimizations:
• Metadata management: Don’t just keep track of data
• Faster execution: You can take advantage of the parallelism asset and data element metadata; ensure that the captured
provided by a Hadoop cluster to significantly speed up ETL metadata and corresponding data controls can be shared among
because it can be linearly scaled based on the number of applications across the BI/analytics framework both in and
computing resources used. outside your Hadoop ecosystem.
• Faster development time: Faster execution can radically • Data lineage and provenance: Address the typical EDW
reduce the develop/test/debug cycle, thereby shortening time challenge of tracking data from its acquisition to its integration.
to value. This becomes more complex as static data integration yields to
integration combining both static and continuously streaming
• Reduced resources: You can reduce the amount of resources
data sources. Further, as a growing number of downstream
solely dedicated to ETL because you can spin up and launch your
consumers use data discovery tools to assess the data and
ETL processing on an elastic Hadoop cluster and then release
define their own data standardization rules and data integration
those computing resources when the ETL is finished.
paths, it becomes challenging to track the sequences of actions
Recent improvements to Apache Hive support the SQL ACID Merge, applied to reach the data asset used in analysis.
which handles inserts, updates, and deletes in a single pass instead
• Data archiving and versioning policies: With scalable
of needing more complex update pipelines supported by complicated
storage provided by Hadoop, the organization can maintain
logic for managing rollback. This effectively allows in-database
data longer without having to shunt it off to a secondary (or
transformations to be performed without the need for massive
tertiary) storage device such as DVD or tape. Also, a byproduct
refreshes. By allowing updates to be performed consistently and
of incorporating a broader set of older data is the need to
isolating readers from in-progress updates, the Hive-based EDW
differentiate earlier versions from more current versions. You
remains available while updates are being applied.
will need to continually define and ensure observance of these
policies.
• Data security and protection: Migrating your EDW to a
big data platform does not eliminate the need to ensure proper
access to sensitive data.
Apache Atlas is an open source system that supports big data
governance. The Atlas system allows for data element definition,
hierarchical taxonomies, capture of business-oriented annotations,
and documenting relationships—linking data sets to their
underlying data elements. In addition, Atlas supports metadata
exchange, allowing you to export metadata to downstream and third-
party systems.

5  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

Atlas is able to capture lineage across both internal Hadoop and


external data assets. It can capture the details of data derivation
processes that map source data to corresponding targets and AFTERWORD
centralize auditing and interactive visualization of data lineage and EVOLVING THE HYBRID EDW AS A MODERNIZATION STRATEGY
provenance, including security access and operational information
associated with the execution of every application, process, and
interaction with data.
Atlas supports observation and compliance with data security The conventional RDBMS-based architecture for the EDW has served
directives and other data policies based on defined metadata most organizations well, but market trends indicate that scalable
specifications. At the same time, policies can be defined to prevent systems built on commodity storage and compute components are
data derivation, and data masking can be instituted at the column the future for enterprise BI, analytics, and data science practices.
and row level based on data attributes. The best path to adoption is evolutionary: focus on EDW
optimization, adjust your architecture to be augmented with Hadoop
to expand the storage footprint, increase computational power, and
broaden the scope of applications supporting advanced analytics. By
keeping relevant data in the enterprise data warehouse and moving
less critical data to Hive, you will benefit from the best of both
worlds while future-proofing your EDW and BI investment.

6  TDWI RESE A RCH tdwi.org


TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

ABOUT OUR SPONSOR ABOUT TDWI RESEARCH

TDWI Research provides research and advice for BI professionals


worldwide. Focusing exclusively on data management and analytics
issues, TDWI Research teams up with industry practitioners to
deliver both broad and deep understanding of the business and
hortonworks.com technical issues surrounding the deployment of business intelligence
and data warehousing solutions. TDWI Research offers reports,
Hortonworks is a leading innovator in the industry—creating,
commentary, and inquiry services via a worldwide membership
distributing, and supporting enterprise-ready open data platforms
program and provides custom research, benchmarking, and strategic
and modern data applications. Our mission is to manage the world’s
planning services to user and vendor organizations.
data. We have a single-minded focus on driving innovation in open
source communities such as Apache Hadoop, NiFi, and Spark. We,
along with our 1,600+ partners, provide the expertise, training, and
services that allow our customers to unlock transformational value
for their organizations across any line of business. Our connected
data platforms power modern data applications that deliver ABOUT TDWI CHECKLIST REPORTS
actionable intelligence from both data in motion and data at rest.
We are powering the future of data.
TDWI Checklist Reports provide an overview of success factors
for a specific project in business intelligence, data warehousing,
analytics, or a related data management discipline. Companies may
use this overview to get organized before beginning a project or to
identify goals and areas of improvement for current projects.
ABOUT THE AUTHOR

David Loshin, president of Knowledge


Integrity, Inc, (www.knowledge-integrity.com),
is a recognized thought leader, TDWI
instructor, and expert consultant in the areas
of data management and business
intelligence. David is a prolific author
regarding business intelligence best practices and is the author of
numerous books and papers on data management, including Big
Data Analytics: From Strategic Planning to Enterprise Integration
with Tools, Techniques, NoSQL, and Graph and The Practitioner’s
Guide to Data Quality Improvement, with additional content
provided at www.dataqualitybook.com. David is a frequent invited
speaker at conferences, online seminars, and sponsored websites
and channels including TechTarget and The Bloor Group. His best-
selling book, Master Data Management, has been endorsed by many
data management industry leaders.
David can be reached at [email protected].

7  TDWI RESE A RCH tdwi.org

You might also like