<Insert Picture Here>
Optimizing ETL for Oracle
Maria Colgan Data Warehouse Product Management
Agenda
The Three Layers of ETL Think about Hardware Think about your Access Methods Hardware Trends and ETL Parallel Loading and Partitioning Q&A
The Three Layers of ETL
The ETL Ecosystem
Your ETL Tool of Choice
The Data Warehouse Database
Hardware and Storage
Balanced Configuration
The weakest link defines the throughput
CPU Quantity and Speed dictate
number of HBAs capacity of interconnect
HBA Quantity and Speed dictate
HBA2 HBA1 HBA2 HBA1 HBA2 HBA1 HBA1 HBA2
number of Disk Controllers Speed and quantity of switches
FC-Switch1
FC-Switch2
Controllers Quantity and Speed dictate
number of Disks Speed and quantity of switches
Disk Quantity and Speed
Disk Array 1 Disk Array 2 Disk Array 3 Disk Array 4 Disk Array 5 Disk Array 6 Disk Array 7 Disk Array 8
Database configuration best practices
Minimum of 4GB of Memory per core (8GB if compression) Interconnect bandwidth should equal IO bandwidth Use ASM or RAID 10 (Stripe And Mirror Everything) Use AutoAllocate tablespaces (NO Uniform extend sizes) Set Initial and Next to 8MB in the storage clause of large partitioned tables DB_BLOCK_SIZE = 8KB / 16KB
Most Bang for the Buck
Transporting data around can be expensive Most performance can be gained when loading into the database Focus should therefore be on:
Where is data now How to get into a format that allows the fastest load methods Ensuring the database and data sets are set up for this load
Spend your time wisely and try to achieve the biggest improvements that can be made Minimize staging data (writes are expensive)
Bulk Access Methods
Ideal Access Method
TTS Data Pump
Bulk Performance
Flat Files
Common Methods
JDBC, ODBC, Gateways and DBLinks
XML Files
Web Services
Heterogeneous
Bulk Access Strategies
Own your access strategy
Dictate how you want data Set SLAs for your data access provider
Do not join over database links
No parallel access Limited functionality when using ODBC Keep it Simple
SQL Loader or External Tables
And the winner is => External Tables Why:
Full usage of SQL capabilities directly on the data Automatic use of parallel capabilities (just like a table) No need to stage the data once more Better allocation of space when storing data Interesting capabilities like
The usage of data pump The usage of pre-processing
Tips for External Tables
File locations and size
Stage data across as many physical disks as possible When using multiple files the file size should be similar Use largest to smallest in LOCATION clause if not similar in size
File Formats
Use a format allowing position-able and seek-able scans Delimitate clearly and use well known record termination to allow for automatic Granulation
Consider compressing data files and uncompressing during loading
Parallel with Compressed Files
When using compression Oracle will not parallelize the load (not even when you use External Table) Use Multiple Compressed Files per ET to get parallel loading To Create these multiple files:
Either create the archives in equal chunks when the data is generated and compressed Use the pre-processing steps to dynamically break down the master archive into smaller files
Tips for External Tables (2)
New functionality in 11.1.0.7 and 10.2.0.5 For Example
Allows compression of large files Speeding up transport of files to the load location Simplify process by encapsulating unzip
CREATE TABLE sales_external () ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY data_dir1 ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE PREPROCESSOR exec_dir: 'gunzip FIELDS TERMINATED BY '|' ) LOCATION () )
Tips for External Tables (3)
Has been there since 10.1.x For Example
Allows utilization of Data Pump as an ETL extraction method Great for unloading from Oracle to Oracle
CREATE TABLE inventories_xt ORGANIZATION EXTERNAL ( TYPE ORACLE_DATAPUMP DEFAULT DIRECTORY def_dir1 LOCATION ('inv_xt.dmp') ) AS SELECT * FROM inventories where <anything goes>;
Bulk Data Loading Example
SQL Server Oracle Target (10.2 / 11.1)
Compress Uncompress
BCP Unload
FTP
External Tables
Oracle Source
FTP
Compress Uncompress
Data Pump Unload
Hardware Trends
Commodity hardware platforms
Intel Chips 64 bit Linux OS
Clustered environments
Increasing CPU counts Increasing memory sizes available
Larger systems
A lot more data Compute power you didnt think you could have
Hardware Trends and ETL
Fact: External table is the best way to load the DW Fact: If you run Linux, you probably run Oracle RAC Oops: You cannot run a cluster-parallelized SQL statement on an external table unless the files are on shared storage Oops: A simple single NFS filer will bottleneck your solution in ETL (not enough throughput)
Introducing DBFS (Database Machine)
DataBase File System DBFS is a file system interface for storing unstructured data in Oracle Database
Built on SecureFiles It is cluster-coherent and scalable with Real Application Clusters
FUSE
Filesystem in Userspace (http://fuse.sourceforge.net/)
Combining DBFS with FUSE offers mountable filesystems for Linux x64 (e.g. Database Machine!)
Configuring DBFS as a staging area Steps on the OS
Check Fuse rpm is loaded Add fusermount to you path Add your Oracle OS user to the OS group Fuse Edit the file /etc/fuse.conf and add the line
User_allow_others
Create a mount point for the file system owned by the oracle OS user e.g. /data
Configuring DBFS as a staging area Steps on the OS
DBFS should be housed in a separate database Use DBCA with OLTP template to create it Redo Logs should be large - 8GB Create bigfile tablespace large enough to hold the filesystem Create a user DBFS
Grant create session, create table, create procedure, & dbfs role
Grant quota unlimited on the tablespace to user Create actual filesystem using the script
$ORACLE_HOME/rdbms/admin/dbfs_create_filesystem_advanced .sql
Mount the files system using
$ORACLE_HOME/bin/dbfs_client
Access Strategies Example
SQL Server
BCP Unload FTP
Oracle Target (11.2)
External Tables
Oracle Source
SCP
Data Pump Unload
DBFS in its own instance holds files in SecureFiles
Direct Path Load
Data is written directly to the database storage using multiple blocks per I/O request using asynchronous writes A CTAS command always uses direct path but an IAS needs an APPEND hint
Insert /*+ APPEND */ into Sales partition(p2) Select * From ext_tab_for_sales_data;
Ensure you do direct path loads in parallel
Specify parallel degree either with hint or on both tables Enable parallel DML by issuing alter session command ALTER SESSION ENABLE PARALLEL DML;
Database Settings for Data Loading
Use Parallel loading
Create the table with PARALLEL on it Hint with PARALLEL Dont forget Alter session enable parallel DML
Use these settings:
PARALLEL_EXECUTION_MESSAGE_SIZE=16KB PARALLEL_MIN_SERVERS= 2*Default_DOP PARALLEL_MAX_SERVERS= Max_conc_queries * Default_DOP
DB_BLOCK_SIZE = 8KB / 16KB
COLLABORATE 10 - IOUG Forum
April 18-22, 2010
Mandalay Bay Convention Center in Las Vegas, NV
5,000 attendees; Hands on Labs; Boot Camps; Deep Dives, Hospitality events and over 200+ Exhibits IOUG is offering an entire conference within a conference called: Get Analytical with BIWA Training Days brain powered by the IOUG BIWA SIG Sample sessions include: Integrating Essbase & OBIEE+ With Your Data Warehouse Strategy OBIEE Security & BI APPS Integration Case Study: Atheros Communications Data Visualization Best Practices: Know How to Design and Improve Your BI & EPM Reports, Dashboards and Queries De-Mystifying Oracle Business Intelligence Applications An ETL Framework using Oracle Warehouse Builder Driving Unbelievable ROI through Operational Intelligence with OBIEE Oracle Data Mining: Hands-on Lab
Register by March 18 to Save on COLLABORATE 10 IOUG Forum April 18-22, 2010 Las Vegas, NV
Register via offer code BIWA2010 (step one of the registration process) by March 18 to save $740 off the onsite rate Registration gets your name entered into a drawing for an iPod Touch!
collaborate10.ioug.org