How To Select An Analytic DBMS: Overview, Checklists, and Tips
How To Select An Analytic DBMS: Overview, Checklists, and Tips
Curt Monash
Covered DBMS since the pre-relational days Also analytics, search, etc.
Blogs, including DBMS2 (www.DBMS2.com -- the source for most of this talk) Feed at www.monash.com/blogs.html White papers and more at www.monash.com
Our agenda
Why are there such things as specialized analytic DBMS? What are the major analytic DBMS product alternatives? What are the most relevant differentiations among analytic DBMS users? Whats the best process for selecting an analytic DBMS?
General-purpose database managers are optimized for updating short rows not for analytic query performance 10-100X price/performance differences are not uncommon
45% 40% 35% 30% Transistors/Chips since 1971 Disk Density since 1956 Disk Speed since 1956
Transistors/chip: >100,000 since 1971 25% Disk density: 20% >100,000,000 since 1956 15% Disk speed: 10% 12.5 since 1956 5%
0% Compound Annual Growth Rate
Page size
Materialized views OLAP cubes
Precalculate results
Tuned MPP (Massively Parallel Processing) is ideal. Recommended configurations are a mixed bag.
Custom or unusual chips (rare) Custom or unusual interconnects Fixed configurations of common parts
Aster Data Dataupia Exasol Greenplum HP Neoview IBM DB2 BCUs Infobright/MySQL Kickfire/MySQL Kognitio Microsoft Madison
Netezza Oracle Exadata Oracle w/o Exadata ParAccel SQL Server w/o Madison Sybase IQ Teradata Vertica
Query performance Update/load performance Alternate datatypes Compatibilities Advanced analytics Manageability and availability Encryption and security
Oracle (especially pre-Exadata) IBM DB2 (especially mainframe) Microsoft SQL Server (pre-Madison)
Other specialized indexes Query optimization tools Other OLAP extensions SQL 2003 Other embedded analytics
Drawbacks
Complexity and people cost Hardware cost Software cost Absolute performance
Undemanding performance (and therefore administration too) OLTP-like Integrated MOLAP Edge-case analytics
Teradata DB2 (open systems version) Netezza Oracle Exadata (sort of) DATAllegro/Microsoft Madison Greenplum Aster Data Kognitio HP Neoview
Random (hashed or round-robin) data distribution among nodes Large block sizes
Suitable for scans rather than random accesses Or little optimization for using the full boat
Enterprise standards Vendor size Hardware lock-in Total system price Features
Sybase IQ Vertica InfoBright SAND ParAccel Kickfire Exasol MonetDB SAP BI Accelerator (sort of)
Bulk retrieval is faster Pinpoint I/O is slower Compression is easier Memory-centric processing is easier MPP is not as crucial
One database to rule them all One analytic database to rule them all Frontline analytic database Very, very big analytic database Big analytic database handled very costeffectively
Below 1-2 TB, references abound 10 TB is another major breakpoint 5, 15, 50, or 500?
Data freshness
Figure out what youre trying to buy Make a shortlist Do free POCs* Evaluate and decide
*The only part thats even slightly specific to the analytic DBMS category
Current Known future Wish-list/dream-list future People and platforms Money Must-haves Nice-to-haves
Set constraints
Database growth
Reports
Today Future Today Future Latency Users Now that we have great response time
Ad-hoc
Run more models? Model on more data? Add more variables? Increase model complexity?
Which of those can the DBMS help with anyway? What about scoring?
SLA realism
Customer or customer-facing users Executive users Analyst users Customer or customer-facing users Executive users Analyst users
Cash cost
But purchases are heavily negotiated Appliances can be good You might as well consider incumbent(s) Appliances can be frowned on
Deployment effort
Platform politics
Who matches your requirements in theory? What kinds of evidence do you require?
References? How many? How relevant? A careful POC? Analyst recommendations? General buzz?
Whats your tolerance for specialized hardware? Whats your tolerance for set-up effort? Whats your tolerance for ongoing administration? What are your insert and update requirements? At what volumes will you run fairly simple queries? What are your complex queries like? For which third-party tools do you need support?
Proof-of-Concept basics
The better you match your use cases, the more reliable the POC is Most of the effort is in the set-up You might as well do POCs for several vendors at (almost) the same time! Where is the POC being held?
Getting data
Real?
Politics Privacy
Picking queries
Realistic simulation(s)
POC tips
Dont underestimate requirements Dont overestimate requirements Get SOME data ASAP Dont leave the vendor in control Test what youll actually be buying Use the baseball bat
Further information
Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com http://www.monash.com http://www.DBMS2.com