Building Scalable: High Performance Datamarts With MySQL
Building Scalable: High Performance Datamarts With MySQL
High Performance
Datamarts with MySQL
Presented by,
MySQL AB® & O’Reilly Media, Inc.
Tangirala C. Sarma
E-mail: [email protected]
Agenda
? Introduction
? Section 1: Why DW/DM? (30 min)
? Section 2: DW Project Methodology (30 min)
? Break – 10 min
? Section 3: DW Modeling Techniques (60 min)
? Break – 10 min
? Section 4: DW Technologies (30 min)
? Questions
Introduction
? DW/BI Architect
? 15+ years of experience providing consulting /
advisory services on DW/BI
? Implemented 10+ multi-terabyte data
warehouses at enterprises incl. Wells Fargo,
WAMU, REI, Reader’s Digest, Marriott and 15+
data marts.
? Started looking at MySQL as a viable database
platform for Data Warehousing over the last year
Why Data Warehouse/DM?
Demand for DW/DM
? Business reasons for Data Warehouse
• Discover and act on “market windows”
• Competitive pressures/compressed product cycles
• Reduce business costs: inventory, advertising,
production
• The other guy has one
? IT reasons for Data Warehouse
• Consolidate data from diverse operational systems
• Off-load decision support from mainframe systems
• OLTP systems make poor decision support servers
Data Warehouse Definitions
?Theoretical definition:
?Subject Oriented
• Revolves around business rules
• “High level entities of the enterprise” (i.e. subject
areas)
• Organized differently than operational/functional
environment
Data Warehouse Definitions (cont..)
Subject-Oriented
Operational Data Warehouse
ENTERPRISE
loans product
customer
bank cards
checking location
Data Warehouse Definitions (cont..)
?Integrated
• Data can come from many different sources
• Each source can look different than the other
• Once it’s in the DW, it should look the same
"MARRIED"
"SINGLE"
"DIVORCED" Data Is Integrated
"OTHER"
"0"
"1" INTEGRATE
"2" S,M,D,O
"3"
"MAR"
"SGL"
"DIV"
"OTH"
Data Warehouse Definitions (cont..)
?Time-Variant
• Key structure is an element of time
• No matter how it’s organized, it still represents a series of
snapshots
• Snapshots or slices can lose accuracy over time, as opposed
to the operational environment, which doesn’t lose accuracy.
Example: “Our product code has changed since last year.”
Data Warehouse Definitions (cont..)
?Nonvolatile
• Two types of routine operations: Load & Access
• No update
• Removal of data according to business rule
DATABASE
Data
Operational
Warehouse
- current - historical
- not time-based - rolling schedule
- used for “day-to-day” - business decisions
- updated - inserted and left alone
- large number of users - smaller number of users
- well-defined - ad-hoc climate
Data Warehouse Architecture
Several Flavors
Datamart Datamart
Data Warehouse Database
(Relational)
Project
ProjectManagement,
Management,Status
StatusReporting,
Reporting,Issue
Issueand
andRisk
RiskManagement,
Management,
Facilitated
Facilitated Workshops, Process Controls, Joint Project Teams,Scope
Workshops, Process Controls, Joint Project Teams, Scope
Management, Consensus Building, and Phase-Based Delivery
Management, Consensus Building, and Phase-Based Delivery
Project Approach
Goals, Planning, Preparation
Deliverable Description
Objective & High level objective of the overall initiative and key success
Goals document metrics should be established and documented here.
Project Approach
Discovery, Business & Data Requirements
Deliverables
Deliverable Description
Business requirements Contains business requirements for the DW/DM. This is the
document finalized synthesized information resulting from the interviews and
workshops.
Project Approach
Conceptual Model
Deliverables
Deliverable Description
Conceptual Data Model High level data model that identifies data entities, relationships
and key metrics
Project Approach
Data Discovery & Data Quality Analysis
? Spend time to understand the source data & assess data quality
? Use data profiling tools to identify data availability, anomalies
? This information will help with ETL design in later phases
Project Approach
Data Discovery & Data Quality Analysis - Deliverables
Deliverable Description
Data Source Gap The Data Source Gap Matrix defines specific data element gaps in the
Matrix available data required to support the solution and the Business Data
Model as defined by the representative users. It details the gaps, and
their characteristics (that is, uniqueness, availability, granularity, quality,
etc.) and defines possible options or alternate solutions to support the
business requirements (where feasible).
High-Level Data The High-Level Data Quality Assessment documents findings related to
Quality Assessment source data, its integrity, availability, ability to meet business
requirements, and overall quality. It includes issues and possible
approaches to resolution.
Project Approach
Technical Architecture
Deliverable Description
Architecture Blueprint • Defines technical design based on business requirements
• Clarifies technical components of overall solution
• Where there are integration requirements, this document should
summarize the configuration in a series of diagrams.
Capacity Plan This plan defines the requirements for storage capacity for current and
future needs for database storage, staging area and archival needs.
Other items it should cover include
Processor Performance, Memory Capacity, Storage Capacity, Backup
bandwidth, Network bandwidth, extensibility of the platform etc.
Project Approach
Data Modeling & Database Design
Logical Database The Logical Database Design documents the following design elements:
Design ? Table, column and view definitions
? Primary, unique and foreign key definitions,
column and row level validation rules (check constraints)
? Rules for populating specific columns (sequences, derivations).
Physical Database The initial Physical Database Design consists of a narrative description of
Design the database design decisions and a number of listings of the physical
storage aspects of the database and its associated objects.
Customer This document captures the standardization and matching rules to identify
Standardization & unique sites and build the customer hierarchy.
Matching rules
Source Data To The Source Data to Target Logical Database Matrix defines the key
Target Logical assumptions, mapping between source and target, and mapping rules and
Database Matrix logic that is needed to create the conversions and interfaces necessary to
support the solution. It is intended to provide the developer with the
necessary information for writing accurate transformation and load logic.
Project Approach
Data Acquisition (ETL) Development
? ETL components and processes are built during this phase to cover
?First Time Load
?On-going Load
?Customer and House holding rules
?Process automation
?Error and reject record handling
? Use of ETL tools / Data Quality tools are recommended; Writing
custom SQL may introduce maintenance issues later
Project Approach
Data Acquisition (ETL) Development - Deliverables
Deliverable Name Description
Data Cleansing • These components for the on-going load are the modules that
and standardize the name, address and e-mail address etc. This standardized
standardization information is used by other components to de-dup and consolidate the
components for customer records. This consolidated information is fed through the
on-going load mappings to be loaded into the corresponding database objects.
Data Acquisition • The Data Acquisition (ETL) Components for on-going load are the
(ETL) modules that move data from sources to the DW/DM. Developers are
components for also responsible for Unit Testing the respective modules before making
on-going load them available in the test environment.
Data Acquisition • The Data Acquisition (ETL) Components for historical load are the
(ETL) for History modules that move data from sources to the DW/DM. Sometimes, the
Load first time sources are different from on-going sources.
Project Approach
Data Acquisition (ETL) Development - Deliverables
Deliverable Name Description
Data Acquisition The Data Acquisition (ETL) Automation for on-going loads includes the
(ETL) Automation for scripts and control information to manage the automated receipt and
standard, on-going loading of data that is scheduled to be loaded into the database on a
loads regularly scheduled basis. It will log activity and send out appropriate
notices of success and failure of the data loads.
Data Acquisition These ETL components extract data from the DW/DM in a pre-defined
(ETL) for extracts format to provide to the downstream systems.
Data Acquisition The Data Acquisition (ETL) Components for rolling off the data from
(ETL) components DW/DM and promote the same to higher levels of aggregation tables as
for Data Archival the data reaches history requirement ceiling.
Project Approach
System Integration Testing (SIT)
Integration And Testing Plan The Integration and Testing Plan defines the scope and
approach to handle the system, system integration and
component testing activities. It includes information about:
• Salesrep • Division
Dimension Tables
?Dimensions provide context to the data
?Define business in terms already familiar to
users
?Wide rows with lots of descriptive text
?Generally Small Tables (there are exceptions)
?Joined to Fact table by a foreign key (not
necessarily enforced)
?Typical Dimensions
?Time, Geography, Product, Customer etc.
Fact Table
Dimension
Fact
Date
Dimension Normalization
(SnowFlaking)
?Each Level in the hierarchy is
represented as its own table
?Attributes for each level reside in
the corresponding DIM table
?Each level is mapped to the Parent
level table using FK relationship
?Has a source key which will be
used to match rows while loading
?Use this technique only for very
large dimensions
SnowFlake Schema Example
Year Dimension
Quarter
Fact
Month
Date
Hierarchy
The DATE Dimension
? DATE dimension is one dimension
that you will find in every DM/DW
? You can build the DATE dimension
in advance with 5 to 10 years of data
depending on the history
requirements
? You may have to model for Calendar
and Fiscal hierarchy in the DATE
dimension
? To allow for non standard calendar
calculations (such as Holiday sales, Example of a simple DATE dimension
Weekend sales, seasonal sales
etc.), use indicators to flag the days
Other Dimensions
? Other dimensions are generally sourced from
operational sources
? Capture as many descriptive attributes as possible
to allow for robust and complete analysis
? Model the dimension as a STAR or SNOWFLAKE
depending on the size
? Really large dimensions, should use SNOWFLAKE
model to avoid storing the attribute values
redundantly at every level.
? Always add a dummy row to represent ‘unknown’
dimension value to avoid storing NULL keys in the
FACT table.
Changing The Grain
? Let’s say after we designed the DW/DM and loaded
the data, users would like access to more detailed
level data, say HOURLY sales not DAILY sales
(Another example would be if the users want to see the daily summary
at the store level within Zip code)
? HOURLY will change the Grain of the Fact
? Changing the Grain, will require reloading all the
historical data which will be painful
? Make sure you capture the requirements early on, to
avoid these painful scenarios
The New SALES FACT
' ,0 B+ 2 8 5
3. +285 . ( <
+285
' ,0 B3 5 2 ' 8 & 7 ' ,0 B& 8 6 7 2 0 ( 5
3$57 18 0 & 8 6 7 2 0 ( 5 12
3 5 2 ' 8&7 1 $0 ( & 8 6 7 2 0 ( 5 1$0 (
3. 7 5 $ 1 ' $ 7( . ( <
3. 352'8&7 . ( <
3. &86720(5 . ( <
3. =,3 . ( <
3. +285 . ( <
7 5 $ 1 6 $ & 7 , 2 1 4 7<
75$16$&7,21 $0 7
7 5 $ 1 6 $ & 7 , 2 1 &2 6 7
Measure
Sales Quantity X X X X X X
Sales Amount X X X X X X X
Cost Amt X X X X X
Gross Profit X X X X X
How Many Fact tables in DW/DM?
? Granularity and Dimensionality of measures determine
the number of Fact tables required
? Identify all the measures that the users would like to
analyze on
? Identify the granularity at which these measures
should be stored
? Determine the dimensionality of each measure
? Tabulate these results in a spreadsheet
? This information should guide in determining the
number of FACT tables in DW/DM
Surrogate Keys
? Surrogate keys are integers that are assigned
sequentially while populating dimension tables
? Avoid embedding intelligence in generating Surrogate Key
these keys
? If the source id has intelligence built in (say
part num has 1st 3 letters identifying
manufacturer), these codes should appear as
separate columns in the dimension table
? Surrogate keys shield the DW/DM from
operational changes
? Always add a dummy row with ‘-1’ as
sequence id, to identify ‘unknown’ codes in Natural Key
the source data
Slowly Changing Dimensions
? Dimension attributes change over time
? Users may want to track these attribute changes, to
help them with analysis of the data
? Early on, work with the users to identify which attributes
they would like to track history
? You should educate the users to resist the urge to track
‘everything’ since it will have performance implications
? Do proper analysis of the source to determine the
frequency of change to help identifying the right
strategy in handling the changes
Handling Slowly Changing Dimensions
?Type 1 – Overwrite the attribute value
?Type 2 – Track every change by adding a
new row
?Type 3 - Track only the last change by
adding a new column
?Combination of these techniques can be
used in the same dimension table as
required
Type 1 – Overwrite the value
?This technique will be used if users don’t
want to track history of an attribute
?Match on the source id and replace the
value of the attribute with the new value,
if there is a change
' ,0 B& 8 6 7 2 0 ( 5
3. &86720(5 . ( <
&86720(5 12
) $ & 7 ' $ ,/ < 6 $ / ( 6 & 8 6 7 2 0 ( 5 1 $0 (
3. 7 5 $ 1 ' $ 7( . ( <
3. 3 5 2 ' 8&7 . ( <
3. &86720(5 . ( <
3. '(02*5$3+,& . ( <
3. =,3 . ( <
3. +285 . ( <
$*(
, 1 & 2 0 ( 5$1 * (
*(1'(5
0 $ 5 , 7 $ / 6 7$ 78 6
Before After
Multi Sourced Dimension tables
?In some instances, there will be a situation
where one dimension table is populated from
more than one source
?Identify rules for matching dimension records
between source systems
?Determine survival rules for dimension
attributes
Data Structures - Summary Tables
Dimension Sum by
Product
Sum by
Fact Summarize Household
Sum by
Day
Date
daily transactions
daily transactions
daily transactions
More Detail
daily transactions
daily transactions
weekly transactions
weekly transactions
SUMMARY ROLLUPS
weekly transactions
Less Detail
monthly transactions
Typical ETL Architecture
Scheduling, Exception Handling, Restart/Recovery, Configuration Management
Relational
Databases
Flat Files Dimensional
And / on File Tables ( Atomic
Schemas in System
RDBMS Or and Aggregates )
Flat Files
CM Application
Servers
DW/DM Technology
DW/DM Technology Architecture
ETL tool Requirements
Reports
BI tools (Commercial Vendors)
?There are numerous vendors in this space
?Some of the major players
Business Objects
Hyperion
Oracle EEBI
Cognos
SAS
BI tools (OpenSource Options)
?Jasper and Pentaho are major players
?Jasper Reports, Jasper Decisions
?Pentaho BI Suite (Mondrian for OLAP,
Dashboard, Weka for Data Mining, Reporting)
?JFreeChart for graphs
?Actuate BIRT (for reporting)
Required Database Functionality
?Data Partitioning
?Columnar Storage
?Data Compression
?Parallel Query
?Specialized data loaders
?Materialized Views (in the MySQL
roadmap, but not currently available)
?Specialized analytical functions
Commercial Vendors like Oracle have evolved over the last 10 years to
support Data Warehousing features.
Some built-in features and support of 3rd party storage engines is making
MySQL a viable database platform for DW/DM
Partitioning
?‘Not if to partition, but how to partition’
?Partitioning Benefits
? Series of separate tables under a “view”.
? Partition Schemes: RANGE, HASH, LIST, KEY,
COMPOSITE
? Alternative to managing one extremely large table.
? Targets fact/detail tables most of the time.
? Partition Pruning helps examine only required data
? Easy to Manage
?MySQL 5.1 and above supports Partitioning
Data Loading
?One of the most forgotten and neglected
issues
?Perhaps highest in critical path for daily
operation of the warehouse
?Database should support fast / incremental
loaders optimized for bulk data loads
?Some 3rd party MySQL storage engines have
specialized loaders that provide screaming
load performance
Data Compression
?Data Compression provides enormous
storage savings
?Data compression may impact query
performance if server has to uncompress to
analyze the data
?There are storage engines like KickFire
coming up that support data queries without
uncompressing data
Columnar Storage
?Traditional databases write data to the disk
as rows ; Columnar storage writes data to the
disk as columns
?Columnar storage requires less I/O if a
subset of columns are selected in the Query
thus improving Query performance
?3rd party storage engines such as KickFire
and InfoBright support Columnar Storage
MySQL Storage Engines
supporting DW/DM
?Internal ?3rd Party
?MyISAM ?KickFire
?Archive ?BrightHouse
?Memory ?NitroEDB
Feature Comparison
Data Partitioning X
Parallel Query X
Columnar Storage X X
Data Compression X X