Using Hadoop For Data Warehouse Optimization

Using Hadoop for Data Warehouse Optimization

Uploaded by

wilfredorangelucv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views8 pages

Using Hadoop For Data Warehouse Optimization

Using Hadoop for Data Warehouse Optimization

Uploaded by

wilfredorangelucv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CHECKLIST REPORT

2017

Using Hadoop for

Data Warehouse
Optimization

By David Loshin

T DW I CHECK L IS T RE P OR T TABLE OF CONTENTS

Using Hadoop for

Data Warehouse 2 F OREWORD
Optimization Legacy architectures impede optimal performance

2 NUMBER ONE
Leverage scalability and elasticity to reduce costs
By David Loshin
3 NUMBER TWO
Augment EDW storage with Hadoop and Hive

3 NUMBER THREE
Use flexible data organization to enable schema-on-read

4 NUMBER FOUR
Use open source tools to handle unstructured data

4 NUMBER FIVE
Provide data discovery tools for data consumers

5 NUMBER SIX
Offload ETL processing to Hadoop

5 NUMBER SEVEN
Ensure data governance for data versioning, lineage, and
provenance

6 A
FTERWORD
Evolving the hybrid EDW as a modernization strategy

7 ABOUT OUR SPONSOR

7 ABOUT THE AUTHOR

7 ABOUT TDWI RESEARCH

7 ABOUT TDWI CHECKLIST REPORTS

555 S. Renton Village Place, Ste. 700

Renton, WA 98057-3295

T 425.277.9126
F 425.687.2842
E info@[Link] © 2017 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in
part are prohibited except by written permission. Email requests or feedback to info@[Link].
Product and company names mentioned herein may be trademarks and/or registered trademarks of
[Link] their respective companies.
TDWI CHECKLIST REPORT: USING H ADOOP FOR DATA WA REHOUSE OPTIMIZ ATION

FOREWORD NUMBER ONE

LEGACY ARCHITECTURES IMPEDE OPTIMAL PERFORMANCE LEVERAGE SCALABILITY AND ELASTICITY TO REDUCE COSTS

The conventional enterprise data warehouse (EDW) architecture Implementing an EDW used to imply a significant capital investment:
common in most organizations represents over two decades of purchasing a system platform or specialty appliance sized to meet the
engineering and refinement, yet its design has become relatively immediate and the anticipated future needs for business analytics.
static and has not kept pace with the exploding desire for business Despite the initial investment, the development and deployment
intelligence (BI) and analytics. Weaknesses associated with the life cycle often delayed the data warehouse’s time to value, and the
conventional architecture limit the scale and scope of a BI/analytics investment wasn’t fully monetized until long after the EDW was put
program, and the design may actually impede the optimal use of into production.
information. For example, consider that:
As demand grows for access to (and reports from) EDW data,
• There are high up-front capital expenses prior to deploying a the initial system configuration may no longer meet performance
data warehouse environment. requirements, necessitating an increased investment in storage
and computing resources. This may be particularly acute if your
• The EDW team must anticipate scalability needs prior to
organization uses costly specialty data warehouse appliances;
instantiating the system and must be prepared to add to the
increased demand means rising costs. An alternative to this pattern
original capital investment.
leverages a Hadoop platform and technologies such as Hive that
• The EDW has relatively high continuing operations and exploit pools of commodity computing and storage resources, thereby
maintenance costs. allowing system performance to scale proportionally to demand while
reducing overall costs.
• Its architecture is engineered for summarization and may not
include all of the data. This limits what can be analyzed. Begin your Hive configuration with a small number of storage and
compute resources. Increased user demand does not necessitate a
• Decisions about data organization made early in the design
major system refresh; instead, user performance needs can be met
process force dependence on a highly structured data model that
by growing the system incrementally—adding resources as needed.
constrains the ability of data consumers to analyze the data.
Hadoop’s system elasticity allows reporting and analytics applications
• The established structure complicates agile development, so to consume the resources they need when required and release those
adding new features is a slow process. The design, development, resources once the application has completed so they can be used by
and implementation of extraction, transformation, and loading other applications.
(ETL) are also slow, making it difficult to quickly add new
Commodity hardware lowers overall costs in three ways. First, you
data sources.
only buy the components you need instead of requiring a significant
• Analysis is often limited to structured data because typical bulk investment. Second, maintaining commodity components is less
EDW architectures are not designed to easily support expensive than maintaining specialty data warehouse appliances.
unstructured data. Finally, decoupling the software ecosystem from the underlying
hardware allows you to transparently replace slower, aging components
• Proactive data discovery is difficult, if allowed at all.
with newer technology that seamlessly provides increased performance
• The EDW is often missing lineage and provenance for the with few changes to your application code.
data, and there is little information about how the data was
Technology innovations also contribute to optimized performance.
aggregated, allowing for confusion in interpretation.
The Apache [Link] initiative is intended to provide fully ACID
These performance inhibitors are being increasingly addressed transactions and accessibility to SQL:2011 analytics support, and
through the integration of new technologies such as the many architectural advancements such as low-latency analytical processing
components of the open source Hadoop stack. In this TDWI (LLAP) provide optimized in-memory caching and persistent query
Checklist Report, we recommend ways that augmenting the executors that scale elastically and provide subsecond response time,
conventional EDW design with Hadoop and Hive can help optimize even for more complex SQL queries.
your EDW by expanding usability, improving performance, improving
results, and reducing overall costs.