0% found this document useful (0 votes)
80 views41 pages

Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte

The document provides best practices for improving query performance in a data warehouse. It discusses database design considerations like parallelism, partitioning, schema design, and compression. It also covers application design best practices and performance layer techniques including indexes, statistics, constraints, and materialized views. Ongoing tuning considerations are also highlighted to keep operations optimized as data and requirements change over time.

Uploaded by

Indrajit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views41 pages

Best Practices For Query Performance in A Data Warehouse: Calisto Zuzarte

The document provides best practices for improving query performance in a data warehouse. It discusses database design considerations like parallelism, partitioning, schema design, and compression. It also covers application design best practices and performance layer techniques including indexes, statistics, constraints, and materialized views. Ongoing tuning considerations are also highlighted to keep operations optimized as data and requirements change over time.

Uploaded by

Indrajit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 41

Best Practices for Query

Performance In a Data Warehouse


Calisto Zuzarte
IBM
[email protected]
Session Code: D09
May 13, 2010 8:30AM9:30AM
Platform: Linux, Unix and Windows

Data Warehouse Life Cycle

Database design / Application design

Database performance layer implementation

The Warehouse Application architects and Database Administrators work together


to design the queries and schema before they put the application in production

In order to meet SLAs, DBAs usual go through some iterations augmenting the
database with performance layer objects and set up the initial configuration to get
good performance

Database tuning operations

During production, with changing requirements and change in data, there is ongoing tuning required to keep operations smooth.

Motivation
Data warehouse environments characteristics:
Large volumes of data
Millions/Billions of rows involved in some tables
Large amounts of data rolled-in and rolled-out

Complex queries

Large Joins
Large Sorts,
Large amounts of Aggregations
Many tables involved

Ad Hoc Queries

It is important to pay attention to query performance

Objectives
Provide recommendations so that you can improve data
warehouse query performance

Database Design considerations


Application Design considerations
Performance Layer Considerations
Ongoing Tuning Considerations

Agenda

Best Practices Database Design


Best Practices Application Design
Best Practices Performance Layer
Best Practices Configuration and Operations

Best Practices Database Design


Best Practices - Parallelism
Inter-partition Shared nothing parallelism
Intra-Query Parallelism (SMP)

Best Practices - Partitioning

Database Partitioning
Table Partitioning
Multi-Dimension Clustering
UNION ALL Views

Best Practices Schema


Best Practices - Compression

Best Practices - Parallelism


Database partition feature (DPF) is recommended
To achieve parallelism in a data warehouse
For scalability and query performance

SMP (Intra-Query Parallelism) not recommended


In concurrent multi-user environments with heavy CPU usage

SMP recommended
When CPUs are highly under utilized
When DPF is not an option

Partitioning (Complimentary Strategies in DB2)

Database Partitioning (DPF) DISTRIBUTE BY HASH

Multidimensional Clustering (MDC) ORGANIZE BY DIMENSION

Key Benefit : Better scalability and performance through parallelism

Key Benefit : Better query performance through data clustering

Table (Range) Partitioning PARTITION BY RANGE

Table Partitioning
Key Benefit : Better data management (roll-in and roll-out of data)

UNION ALL Views


Key Benefit : Independent branch optimization

Divide And Conquer ! Distribute, Partition, Organize !


OrganizeBy

PartitionBy

DistributeBy

Best Practices Database Partitioning


Collocate the fact and largest dimension
Choose to avoid significant skew on some partitions
Avoid DATE dimension where active transactions for
current date all fall on one database partition
(TIMESTAMP is good)
Possibilities for workload isolation for data marts
Different partition groups but common dimension tables
Needs replicated tables (discussed later)

Best Practices Table Partitioning


Recommend partitioning the fact tables
Typically based on DATE dimension
Works better with application key predicates applied directly

Table or Range Partitioning


Recommend table or range partitioning (V9.7 :partitioned indexes)
Choose partitioning based on roll-in / roll-out granularity

UNION ALL Views

Each branch optimized independently


Use with well designed applications (Dangers of materialization)
Large number of branches require time and memory to optimize
Needs predicates with constants for branch elimination

Best Practices Multidimensional Clustering


(MDC)
Recommend defining MDC on the fact table
Guaranteed clustering (Avoids the need to REORG for clustering)
I/O optimization
Compact indexes (compact, coexists with regular indexes)

Choose dimensions based on query predicates


Recommend the use of 1 to 4 dimensions
Need to ensure dimensions are chosen such that they do not waste
storage

Could choose a finer granularity of Table partitioning range


For example: Table partition range by month, MDC by date

Star Schema

STORE

PRODUCT

Store_id

Product_id
SALES

Region_id

TIME
Date_id
Month_id
Quarter_id
Year_id

Product_id
Store_id
Channel_id
Date_id
Amount
Quantity

Class_id
Group_id
Family_id
Line_id
Division_id

CHANNEL
Channel_id

Product Dimension

Dimension Hierarchy

Division

Level 5

Time Dimension

Line

Level 4

Year

Family

Level 3

Group

Level 2

Class

Level 1

Product

Level 0

Store Dimension

Channel Dimension

Retailer

Channel

Store

Quarter
Month
Date

Sales Fact

Best Practices - Schema


Surrogate Keys
As far as possible use application keys themselves
allows predicates to be applied/transferred directly on the fact table
DATE is a good candidate (easier to roll-in/roll-out and for MDC )

Star Schema / Snowflakes


Separate tables for each dimension hierarchy (snowflake) may result in a large
number of joins
Flattened dimensions may contain a lot of redundancy (space)

Define Columns NOT NULL when appropriate


Many optimizations that are done based on NOT NULL

Define Uniqueness when appropriate


Primary Keys / Unique Constraints / Unique Indexes

Compression
Table, Index and Temp Table compression
Huge benefits with storage savings
With table and TEMP compression 30-70%
With Index compression 30-40%

Performance gains because


Less I/O and better use of bufferpools

TEMP table compression helps operators like Hash


Join, Merge Join, Sorts and Table Queues if they spill

Best Practices - Compression


Consider compression particularly with the fact table
Strongly recommend compression on the fact table when
not CPU bound

Agenda

Best Practices Database Design


Best Practices Application Design
Best Practices Performance Layer
Best Practices Configuration and Operations

Best Practices Application Considerations


Use constants instead of expressions in the query
Example
WHERE DateCol <= CURRENT DATE 5
Use VALUES(CURRENT DATE 5) to get the resulting constant first
and use it in the query

Avoid expressions on indexed columns


Example
WHERE DATECOL 2 DAYS > 2009-10-22
WHERE DATECOL > 2009-10-22 + 2 DAYS

Best Practices Application Considerations


Avoid mixing data types in join predicates
Example
WHERE IntegerCol = DecimalCol

Use Global Temporary Tables to split a query if it contains


more than 10-15 tables
Reduces optimization time

Agenda

Best Practices Database Design


Best Practices Application Design
Best Practices Performance Layer
Best Practices Configuration and Operations

Best Practices Performance Layer


Indexes
Statistics
Distribution Statistics
Column Group Statistics
Statistical Views

Constraints
Referential Integrity

Materialized Query Tables


Replicated Tables

Indexes
Indexes are a vertical subset of the data in the table
Indexes provide ORDER
Indexes may allow for clustered access to the table

Index Considerations
To get Index Only Access instead of more expensive ISCANFETCH or TSCAN (Table Scan)
To avoid SORTs particularly those that spill
To promote index-ORing and index-ANDing
To promote Star Joins
When you have range join predicates
Better possibilities with Nested Loop Join

Indexes for clustering (MDC)

Cardinality Estimation
Estimating the size of intermediate results is critical to getting
good query execution plans
Without sufficient information, the optimizer can only guess
based on some assumptions
Data skew and statistical correlation between multiple
column values introduce uncertainty
Pay attention to DATE columns

Best Practices - Statistics


Collect distribution Statistics when there is skew and
predicates use constants
Consider a high number of quantile statistics on columns
with DATE range predicates and character string columns

Column Group Statistics


1
2
3

Country City

Hotel Name

German Bremen
y

Hilton

German Bremen
y

Best Western

German Frankfur
y
t

InterCity

German Frankfur
y
t

Shangri-La

Canada
Four Seasons
Example: COUNTRY = Germany And
CITYToronto
= Frankfurt
Canada

Toronto

No CGS: Selectivity = * 1/3 = 1/6 Estimate 1 row


With CGS: Selectivity = 1/3 Estimate 2 rows

Intercontinent
al

Problem Scenario - Skew


CUST Table 100 rows, 100 custids

10000000 rows

Frequency Statistics SALES Table

CUSTID

CNAME

CUSTID

# of Rows

ABC

2000000

DEF

10

700000

GHI

500000

IBM

63

300000

JKL

72

100000

MNO

50000

PQR

12

20000

100

XYZ

SELECT FROM SALES, CUST


WHERE CUST.CNAME = IBM AND CUST.CUSTID = SALES.CUSTID
Cardinality Estimate with Uniformity = 100,000
Actual Cardinality : 2,000,000 !!!!!!!!!!!!!!!!!!!!

Best Practices - Statistics


Collect Column Group Statistics with multiple predicates on the
same table
WHERE Country = CANADA and City = TORONTO
RUNSTATS ON ALL COLUMNS AND ON COLUMNS ((country, city) )

Consider Statistical Views when


There is skew on the join column
There is a significant difference in the range of values in the fact and the
dimension
CREATE VIEW SV1 AS
(SELECT C.* FROM CUST C, FACT F WHERE C.CUST_ID = F.CUST_ID)
ALTER VIEW cust_fact ENABLE QUERY OPTIMIZATION
RUNSTATS ON TABLE dba.cust_fact WITH DISTRIBUTION

Referential Integrity (RI)


Facilitates aggregation push down
Example in the appendix section

Eliminates redundant joins in views


RI helps determine that queries that do not require data from a
primary key table need not do that join even if it is in the view

Helps with Materialized Query Table matching


Allows Queries to match MQTs with more dimension table joins

Consider Materialized Query Tables


Joes Query

Q9

Q9

GB

Sues

GB
JOIN
JOIN

Dim2

JOIN

Joes Q
GB

JOIN
Dim2
JOIN
Fact
Dim1
GB
Fact
Dim1
JOIN
Dim2
JOIN
Fact
Dim1
JOIN
Dim2

GB

Sues Query

Fact

Bobs Q

JOIN

Dim1

JOIN

Bobs Query
Fact

Dim2
Dim1

MQT

Best Practices - Defining Materialized Query


Tables
What MQTs should I define ?
Estimate the size of the candidate MQTs by executing COUNT
queries against base tables.
Try to achieve at least a 10X reduction in size between fact and
the MQT
Build MQTs with a reasonable number of GROUP BY columns (3
to 6 dimension keys) at a time based on query patterns

As far as possible build the MQT from the fact table alone

Use Table Partitioning for the fact table and the MQTs

Best Practices - MQT Matching


Define Referential Integrity to help with matching MQTs
that contain more tables than the queries
Define Functional Dependencies for thinner MQTs
Use COUNT_BIG instead of COUNT for DPF MQTs
Define indexes on MQTs
Keep statistics up-to-date
Define base table columns NOT NULL as far as possible
For example we can match SUM(A + B) with SUM(A) + SUM(B)

Best Practices MQT Maintenance


REFRESH IMMEDIATE
Create an index on the GROUP BY columns
Create the index on the set of columns that form a unique key
Always keep the base table and MQT statistics up-to-date

REFRESH DEFERRED
If log space is an issue, consider NOT LOGGED INITIALLY or LOAD from
cursor
An MQT can be temporarily toggled into a regular table by using
ALTER TABLE DROP MATERIALIZED QUERY
ALTER TABLE ADD MATERIALIZED QUERY

Use ATTACH / DETACH if fact table and MQT are range partitioned tables

Replicated Tables
JOIN
BTQ

CUST

CUST
COPY

JOIN
BTQ

SALES

CUST
COPY

JOIN
BTQ

SALES

CUST
COPY

SALES

Replicate dimension tables (unless collocated with fact )


Benefit : Avoids data movement
Important : Define suitable indexes
If too large, replicate a subset of frequently used columns

Agenda

Best Practices Database Design


Best Practices Application Design
Best Practices Performance Layer
Best Practices Configuration and Operations

Best Practices Configuration


Optimization Level 5
Registry Variables
DB2_ANTIJOIN=EXTEND
If slow queries have NOT EXISTS, NOT IN predicates

DB2_REDUCED_OPTIMIZATION=YES
If compile time is an issue

Configuration thumb rules


BUFFPOOL ~= SHEAPTHRES
SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)

Best Practices - Statistics


The DB2 Query Optimizer relies on reasonably accurate statistics to
get a good query plans
User runs RUNSTATS when data changes (part of ETL)
Statistics Fabrication (unreliable)
DB2 keeps UPDATE / DELETE / INSERT counters
Fabrication limited to a few statistics Not enough

Consider configuring Automatic Statistics


Automatically collects statistics on tables in need
Runs in the background as a low priority job

Consider configuring Real Time Statistics


Collects statistics on-the-fly

Summary Best Practices


Database Design :
Parallelism, Partitioning, Schema, Compression

Application Design
SQL Tips

Performance Layer
Indexes, Statistics, Referential Integrity, Materialized Query
Tables, Replicated Tables

Configuration and Operations


Configuration, Collecting Statistics

Calisto Zuzarte
[email protected]

You might also like