Denodo
Denodo
HEADQUARTERS
Palo Alto, CA.
CUSTOMERS
3
Speakers
Description:
“The Logical Data Warehouse (LDW) is a new data management architecture for
analytics combining the strengths of traditional repository warehouses with
alternative data management and access strategy. The LDW will form a new
best practice by the end of 2015.”
“The LDW is an evolution and augmentation of DW practices, not a replacement”
“A repository-only style DW contains a single ontology/taxonomy, whereas in the
LDW a semantic layer can contain many combination of use cases, many
business definitions of the same information”
“The LDW permits an IT organization to make a large number of datasets
available for analysis via query tools and applications.”
8
Logical Data Warehouse
Gartner Definition
Description:
“The Logical Data Warehouse (LDW) is a new data management architecture for
analytics combining the strengths of traditional repository warehouses with
alternative data management and access strategy. The LDW will form a new
best practice by the end of 2015.”
“The LDW is an evolution and augmentation of DW practices, not a replacement”
“A repository-only style DW contains a single ontology/taxonomy, whereas in the
LDW a semantic layer can contain many combination of use cases, many
business definitions of the same information”
“The LDW permits an IT organization to make a large number of datasets
available for analysis via query tools and applications.”
9
Logical Data Warehouse
Description:
A semantic layer on top of the data warehouse that keeps the business data
definition.
Allows the integration of multiple data sources including enterprise systems,
the data warehouse, additional processing nodes (analytical appliances, Big
Data, …), Web, Cloud and unstructured data.
Publishes data to multiple applications and reporting tools.
10
Three Integration/Semantic Layer Alternatives
Gartner’s View of Data Integration
11
Application/BI Tool as the Data Integration Layer
12
EDW as the Data Integration Layer
13
Data Virtualization as the Data Integration Layer
14
Logical Data Warehouse
HDFS Document
ERP Sales
Files Collections
15
Logical Data Warehouse
Reference Architecture by Denodo
16
Physical data movement architectures that aren’t designed to
support the dynamic nature of business change, volatile
requirements and massive data volume are increasingly being
replaced by data virtualization.
17
What about the Logical Data Lake?
A Data Lake will not have a star or snowflake schema, but rather a more
heterogeneous collection of views with raw data from heterogeneous
sources
The virtual layer will act as a common umbrella under which these
different sources are presented to the end user as a single system
20
Virtual Data Marts
Simplified semantic models for business users
Typical queries
Simple projections, filters and aggregations on top of curated “fat tables”
that merge data from facts and many dimensions
21
Virtual Data Marts
Sales Product
Retailer
Dimension
Prod. Details
Time Dimension Fact table
(sales)
Product
EDW Others
22
DW + MDM
Slim dimensions with extended information maintained in an external
MDM system
Motivation
Keep a single copy of golden records in the MDM that can be reused across
systems and managed in a single place
Typical queries
Join a large fact table (DW) with several MDM dimensions, aggregations on
top
Example
Revenue by customer, projecting the address from the MDM
23
DW + MDM dimensions
Retailer
Dimension
EDW MDM
24
DW + Cloud dimensional data
Fresh data from cloud systems (e.g. SFDC) is mixed with the EDW, usually
on the dimensions. DW is sometimes also in the cloud.
Motivation
Take advantage of “fresh” data coming straight from SaaS systems
Avoid local replication of cloud systems
Typical queries
Dimensions are joined with cloud data to filter based on some external attribute
not available (or not current) in the EDW
Example
Report on current revenue on accounts where the potential for an expansion is
higher than 80%
25
DW + Cloud dimensional data
Customer
Dimension
SFDC
Customer
Time Dimension Fact table
(sales) Product Dimension
EDW CRM
26
Multiple DW integration
Use of multiple DW as if it was only one
Motivation
Merges and acquisitions
Different DWs by department
Transition to new EDW Deployments (migration to Spark, Redshift, etc.)
Typical queries
Joins across fact tables in different DW with aggregations before or after the JOIN
Example
Get customers with a purchases higher than 100 USD that do not have a fidelity
card (purchases and fidelity card data in different DW)
27
Multiple DW integration
Product Store
Dimension City
Region
Time
Dimensi Sales fact
on Customer Fidelity facts
Product
Dimension
Marketing EDW
Finance EDW
Only the most current data (e.g. last year) is in the EDW. Historical data is
offloaded to a Hadoop cluster
Motivations
Reduce storage cost
Transparently use the two datasets as if they were all together
Typical queries
Facts are defined as a partitioned UNION based on date
Queries join the “virtual fact” with dimensions and aggregate on top
Example
Queries on current date only need to go to the DW, but longer timespans need to merge
with Hadoop
29
DW Historical offloading
Horizontal partitioning
Retailer
Dimension
EDW
30
Slim DW extension
Vertical partitioning
31
Slim DW extension
Vertical partitioning
Retailer
Dimension
EDW
32
Performance in a LDW
It is a common assumption that a virtualized solution will
be much slower than a persisted approach via ETL:
34
Debunking the myths of virtual performance
35
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
vs.
Customer Dim. Items Dim.
Sales Facts Sales Facts
2 M rows 400 K rows Customer Dim. Items Dim.
290 M rows 290 M rows
2 M rows 400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of
decision support solutions including, but not limited to, Big Data systems.
36
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Time Denodo
Returned Optimization Technique
Query Description Time Netezza (Federated Oracle,
Rows (automatically selected)
Netezza & SQL Server)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
37
Performance and optimizations in Denodo
Focused on 3 core concepts
38
Performance and optimizations in Denodo
Comparing optimizations in DV vs ETL
39
Query Optimizer
How Dynamic Query Optimizer Works
Step by Step
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved
Metadata
Query Tree in the query
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join
Static reordering, transformation push-up, star-schema rewritings, etc.)
Optimizer
• Data movement query plans
41
How Dynamic Query Optimizer Works
Example: Total sales by retailer and product during the last month for the brand ACME
SELECT retailer.name,
product.name,
SUM(sales.amount)
FROM
sales JOIN retailer ON
sales.retailer_fk = retailer.id
Retailer JOIN product ON sales.product_fk =
Dimension
product.id
JOIN time ON sales.time_fk = time.id
Time Dimension Fact table
(sales) Product Dimension WHERE time.date < ADDMONTH(NOW(),-1)
AND product.brand = ‘ACME’
GROUP BY product.name, retailer.name
EDW MDM
42
How Dynamic Query Optimizer Works
Example: Non-optimized
GROUP BY
product.name,
10,000,000
retailer.name rows
JOIN
JOIN
JOIN
GROUP BY
product.name,
10,000,000
retailer.name rows
JOIN
JOIN
45
How Dynamic Query Optimizer Works
Step 3
GROUP BY
product.name, 1,000 rows
retailer.name
46
How Dynamic Query Optimizer Works
Summary
3. The Cost-based Optimizer picks the right JOIN strategies based on estimations on data
volumes, existence of indexes, transfer rates, etc.
Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular
databases to take into consideration the different way those systems operate (distributed data, parallel
processing, different aggregation techniques, etc.)
47
How Dynamic Query Optimizer Works
Other relevant optimization techniques for LDW and Big Data
Execution Alternatives
If a view exist in more than one system, Denodo can decide in execution time which one
to use
The goal is to maximize query delegation depending on the other tables involved in the
query
48
How Dynamic Query Optimizer Works
Other relevant optimization techniques for LDW and Big Data
49
Caching
50
Caching
Real time vs. caching
For these scenarios, Denodo can replicate just the relevant data in
the cache
51
Caching
Overview
52
References
53
Further Reading
Denodo Cookbook
• Data Warehouse Offloading
54
Success Stories
Customer Case Studies
Autodesk Overview
56
Business Drivers for Change
57
Technology Challenges
58
Logical Data Warehouse at Autodesk
59
Logical Data Warehouse at Autodesk
Traditional BI/Reporting
60
Logical Data Warehouse at Autodesk
‘New Data’ Ingestion
61
Logical Data Warehouse Example
Reporting on Combined Data
62
Case Study Autodesk Successfully Changes Their
Revenue Model and Transforms Business
Autodesk, Inc. is an American multinational software corporation that makes software for the
architecture, engineering, construction, manufacturing, media, and entertainment industries.
63
Q A
&
Thanks!
www.denodo.com [email protected]
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical,
including photocopying and microfilm, without prior the written authorization from Denodo Technologies.