Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte
Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte
In order to meet SLAs, DBAs usual go through some iterations augmenting the
database with performance layer objects and set up the initial configuration to get
good performance
During production, with changing requirements and change in data, there is ongoing tuning required to keep operations smooth.
Motivation
Data warehouse environments characteristics:
Large volumes of data
Millions/Billions of rows involved in some tables
Large amounts of data rolled-in and rolled-out
Complex queries
Large Joins
Large Sorts,
Large amounts of Aggregations
Many tables involved
Ad Hoc Queries
Objectives
Provide recommendations so that you can improve data
warehouse query performance
Agenda
Database Partitioning
Table Partitioning
Multi-Dimension Clustering
UNION ALL Views
SMP recommended
When CPUs are highly under utilized
When DPF is not an option
Table Partitioning
Key Benefit : Better data management (roll-in and roll-out of data)
PartitionBy
DistributeBy
Star Schema
STORE
PRODUCT
Store_id
Product_id
SALES
Region_id
TIME
Date_id
Month_id
Quarter_id
Year_id
Product_id
Store_id
Channel_id
Date_id
Amount
Quantity
Class_id
Group_id
Family_id
Line_id
Division_id
CHANNEL
Channel_id
Product Dimension
Dimension Hierarchy
Division
Level 5
Time Dimension
Line
Level 4
Year
Family
Level 3
Group
Level 2
Class
Level 1
Product
Level 0
Store Dimension
Channel Dimension
Retailer
Channel
Store
Quarter
Month
Date
Sales Fact
Compression
Table, Index and Temp Table compression
Huge benefits with storage savings
With table and TEMP compression 30-70%
With Index compression 30-40%
Agenda
Agenda
Constraints
Referential Integrity
Indexes
Indexes are a vertical subset of the data in the table
Indexes provide ORDER
Indexes may allow for clustered access to the table
Index Considerations
To get Index Only Access instead of more expensive ISCANFETCH or TSCAN (Table Scan)
To avoid SORTs particularly those that spill
To promote index-ORing and index-ANDing
To promote Star Joins
When you have range join predicates
Better possibilities with Nested Loop Join
Cardinality Estimation
Estimating the size of intermediate results is critical to getting
good query execution plans
Without sufficient information, the optimizer can only guess
based on some assumptions
Data skew and statistical correlation between multiple
column values introduce uncertainty
Pay attention to DATE columns
Country City
Hotel Name
German Bremen
y
Hilton
German Bremen
y
Best Western
German Frankfur
y
t
InterCity
German Frankfur
y
t
Shangri-La
Canada
Four Seasons
Example: COUNTRY = Germany And
CITYToronto
= Frankfurt
Canada
Toronto
Intercontinent
al
10000000 rows
CUSTID
CNAME
CUSTID
# of Rows
ABC
2000000
DEF
10
700000
GHI
500000
IBM
63
300000
JKL
72
100000
MNO
50000
PQR
12
20000
100
XYZ
Q9
Q9
GB
Sues
GB
JOIN
JOIN
Dim2
JOIN
Joes Q
GB
JOIN
Dim2
JOIN
Fact
Dim1
GB
Fact
Dim1
JOIN
Dim2
JOIN
Fact
Dim1
JOIN
Dim2
GB
Sues Query
Fact
Bobs Q
JOIN
Dim1
JOIN
Bobs Query
Fact
Dim2
Dim1
MQT
As far as possible build the MQT from the fact table alone
Use Table Partitioning for the fact table and the MQTs
REFRESH DEFERRED
If log space is an issue, consider NOT LOGGED INITIALLY or LOAD from
cursor
An MQT can be temporarily toggled into a regular table by using
ALTER TABLE DROP MATERIALIZED QUERY
ALTER TABLE ADD MATERIALIZED QUERY
Use ATTACH / DETACH if fact table and MQT are range partitioned tables
Replicated Tables
JOIN
BTQ
CUST
CUST
COPY
JOIN
BTQ
SALES
CUST
COPY
JOIN
BTQ
SALES
CUST
COPY
SALES
Agenda
DB2_REDUCED_OPTIMIZATION=YES
If compile time is an issue
Application Design
SQL Tips
Performance Layer
Indexes, Statistics, Referential Integrity, Materialized Query
Tables, Replicated Tables
Calisto Zuzarte
[email protected]