100% found this document useful (1 vote)
749 views33 pages

Multi-Terabyte MySQL Data Warehouses - Absolutely! Presentation

Data warehousing is the number one area of CIO spend in North America in 2007 and 2008. BI started as a strategic, decision-support tool, used to create canned reports. OLTP Many simple transactions of Data Warehouse many queries - all very different, unpredictable, and always changing queries are very complex.

Uploaded by

yejr
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
749 views33 pages

Multi-Terabyte MySQL Data Warehouses - Absolutely! Presentation

Data warehousing is the number one area of CIO spend in North America in 2007 and 2008. BI started as a strategic, decision-support tool, used to create canned reports. OLTP Many simple transactions of Data Warehouse many queries - all very different, unpredictable, and always changing queries are very complex.

Uploaded by

yejr
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Multi-terabyte Data Warehouses on MYSQL?

Absolutely!
Agenda

 Data Warehousing Today

 Traditional Data Warehouse Solutions

 A New Approach to Multi-Terabyte Data Warehouses


A Look at the Market

 The worldwide database market was $18.8 billion in


2006
 11.1 Billion for OLTP
 7.7 Billion for Data warehousing
 41% of the database budget is spent on data
warehousing
 Data warehousing is the number one area of CIO
spend in North America in 2007 and 2008
 The data warehousing market is growing twice as
fast as the OLTP database market
BI is Not Just for the Boardroom

 BI started as a strategic, decision-support tool, used to create


canned reports by executives and analysts to guide the ship

 Today, BI is mission critical, and serves users across the enterprise,


used to support not only traditional analytics, but also daily,
operational decision making

 These changes in use have brought changes in infrastructure


requirements

 Traditional RDBMS have trouble making the grade


How is the Data Warehouse different?

OLTP Data Warehouse


• Many queries - all very
 Many simple transactions of
different, unpredictable, and
exactly the same type always changing
 A lot of tuning (data model,
• Queries are very complex - lots
indexes, partitions) in order
of joins, group bys, where
for the specific transaction to
clauses
perform well
• Focus on history over many
 Focus on current data
days, months, years
 Scaling thru massive
• Interface to user and many
hardware and multiple copies
different Business Intelligence
of the database (eg. Online
tools
gaming systems have 1
database instance for every
customer)
 Interface to an application,
often custom
The Data Warehousing Challenge

“Volume of the world’s data doubles every three years.


Ninety-two percent of new information is stored in magnetic
media...organizations face a simple problem: what to do with
all the data.”

Industry Research

“Collecting and analyzing information that enables your


organization to better lead, decide, measure, manage and
optimize its overall efficiency is a major financial and
competitive differentiator. The faster an enterprise can gather
and use relevant information, the faster it will be able to
reduce costs and increase profits.”

Gartner
The Problem: Data Warehouses
are Strained

Data Volume Complexity Trouble

+ =

Users are
Data is aggregated and deleted
Data is growing asking more
Data is archived and not usable
exponentially complex
Complex queries are blocked
questions
Complex queries don’t perform
What do the current limitations mean for Stakeholders?

 Users
 Do not get access to data they need;
 Queries run too slowly;
 Are not allowed to think creatively – ask new and different questions;
 Are told to wait for months for what they want in minutes

 IT
 Besieged with requests for new data sources;
 Feature creep and changing requirements straining resources;
 Analytic system maintenance tuning affect support for operational systems;

 Executives
 CIOs face service level complaints and rising IT costs;
 Business unit leaders without analytic data fail to achieve objectives;
About Infobright

Founded 2005

Headquarters Toronto, Canada; offices in Boston, MA and Warsaw, Poland

A highly scalable, analytic data warehouse built on MYSQL


Brighthouse designed to deliver fast response for ad hoc, complex queries without
burdening IT

 Simplicity: “Load and Go”, no new schemas, no indices, no data


partitioning, easy to maintain
 Scalability: Ideal for databases of 500GB - 50 TB
Major Benefits  Low TCO: Industry-leading compression, less storage, industry
standard servers, low software costs, minimal ongoing
operational expenses

MySQL/Sun
Leverages MySQL connectivity to ETL and BI
Key Partner Provides MySQL customer with scalable, enterprise-ready data
warehouse
.

Data Warehousing:
Part of the Problem
More Data & More Kinds of Output Needed
Data Sources by More Business Users
Clickstream and log files
0101010101010101010101010101

10 101 101 0 10
0101010101010101010101010101
Existing data warehouse
0101010101010101010101010 10
0101010101010101010101010 0
10

External Sources
0101010101010101010101
10

0101010101010101010101
101

1
10 101 10 0 100

1
10

0
10

01
01

10
0

1
1

01
10 1
10
1

10
0
1

1
10
0

01 10 10001
1 1
10
1

1
0

0 1
0

10 1 0
0

011 1
1 10 10 10
1 0 1 10 10 1 101
0

01 010 1 0
1 0 1010 0100 1
10 1 0 1 1 01
0

01 0101 0 10 0 1
1

1 01 0
10 0 10
0

1  I/O intensive, write centric


0
 Labor intensive, heavy
Traditional Data indexing and partitioning
Warehousing  Hardware intensive: massive
storage; big servers
Real Life Example
Background A large internet marketing company was performance
driven and used data from 155 million online
consumers worldwide, to do sophisticated analytics and
advanced targeting technologies to create value for
both marketers and publishers.
There operational systems could no longer handle both
operations and reporting. Company was unable to
execute queries against large data volume.
IT Challenge The volumes of data (32 million visitors/day and 140
million actions) exceeded capabilities of production
system. In addition, staff could not keep pace with
needs of the users; no ad-hoc query ability therefore no
ability to compete using analytics
“Data is the difference. The difference between a campaign that meets
your objectives vs. one that blows them away. Between paying $9 vs.
$60 for a new customer. Between predicting what sites to advertise on
vs. knowing you’ve put the right message in front of the right person no
matter what site they are on.”
Desired Queries:
 #1 - Campaign Effectiveness:
 Goal was to determine the optimum number of times to show an ad to get
best results
 Actual query example: analyzed 2 billion rows of campaign frequency by
date, to look at 5 campaigns in order to determine how many times a user
saw each campaign.
 #2 - User Demographics by Campaign:

 Counts users by different demographic categories


 Very wide range of possible results across varying range of rows
 Two actual query examples:
 User entered incorrect campaign number. Search was performed against 1.3
billion rows in user campaign aggregate table and the result was a null set
 Largest campaign (highest results returned) where 89 million rows (11% of
entire table) in user campaign were selected and joined to 57 million rows in the
user dimension table
Traditional Data Warehouse Approach

 Identify the reporting requirements


 Determine the data needed
 Design the data warehouse:
 Extract-Transform-Load
 Data Model (Logical and Physical)
 Canned reports and BI tools

…then
 Revise the model as reporting requirements change and data
grows:
 Add indexes
 Partition data to improve performance
 Restrict users!
Traditional Data Warehouse Approach

 Results:
 Software costs well known and predictable but...
 Management and support costs spiral:
 Partitioning strategies
 Indexing strategies
 Additional data marts
 More hardware
 Business user satisfaction declines as restrictions are placed on:
 Adhoc query capabilities
 Volume of historical data that can be queried
 Time lag between business requirement and system
delivery
 With this particular client, their systems were unable to handle this
Market Evolution

Work Harder Work Smarter

Data
Warehouse
Innovator
Working Smarter

Radical New Approach

Database
Advances

Extending
Hardware Database Concepts
Advances Incremental improvements, still
inflexible
Divide and conquer on
Traditional
lots of hardware (MPP)
Nothing to address
All-purpose RDBMS underlying issues
Resource intensive, lots of
DBA time

Innovation
What to Look for in a New Approach
New Approach
 Leverages column approach
Clickstream and log files
0101010101010101010101010101
0101010101010101010101010101
 Automatically creates
10
101 01
Existing data warehouse
0101010101010101010101010
0101010101010101010101010
structures that:
• finds needed data
10 1 0
0 1 0110 101 0

External Sources
1

0101010101010101010101
0101010101010101010101
0 10
10 101 10 0 1 100

• responds to all queries,


10

01
• are always ready
01

10

01
 Has small footprint
10 1

10
1

01  Uses existing infrastructure


10

01 10
01 1
1

 Is easy to setup and


1

01 10
0

10 10
0

1 maintain
101

0110 0 1
1 01 1 1 10
1

1 1010101010101010101
10
101010101010101010101010
0

10101010101010101010101010101

The Analytic Data Warehouse


A New Approach: Introducing Brighthouse
Clickstream and log files
0101010101010101010101010101 Working Smarter, Not Harder
0101010101010101010101010101
Scalable solution without scaling IT  Better Analytics

10
101 01
Existing data warehouse
0101010101010101010101010
0101010101010101010101010  Faster Response

10 1 0
0 1 0110 101 0
External Sources
0101010101010101010101 1
0101010101010101010101  Decreased IT Burden
0 10
10 101 10 0 1 100
10

 Smaller Footprint
01
01

10

01
10 1

10
1

01
10

01 10
01 1
1
1

01 10
0

10 10
0
101

1
10
1 101010101010101010101
1
1 101010101010101010101010
10101010101010101010101010101

The Infobright Analytic Data Warehouse


How Brighthouse Works Smarter

Smarter architecture:
 Load data and go

 No indices or partitions
Knowledge Grid—statistics to build / maintain
and metadata “describing”
the super-compressed data
 Knowledge Grid
created automatically as
Data Packs—Data stored data loaded
in manageably sized, highly
compressed data packs
 Up to 40:1 compression
Data compressed using reduces storage
algorithms tailored to
data type
 Open architecture
leverages off-the-shelf
Brighthouse
hardware
How Brighthouse Works Smarter

Optimizer iterates over Query received


the Knowledge Grid by Brighthouse

Knowledge Grid Only the data packs


needed to resolve
the query are
decompressed
Often query results
can be determined
Data Packs from the Knowledge
Grid alone

Brighthouse
Brighthouse is Easy on IT
Existing Data Warehouses
010101010101010101010101010
BI Connectors

10 1 101 0 10
010101010101010101010101010 1

10
1

10
Clickstream/Logfiles
0101010101010101010101010
0101010101010101010101010 0
No strain on IT:
1
1
10

External Sources  No need for physical


10110

01 10
0101010101010101010101
0101010101010101010101
1
10
10 101 10 0 100

01
01

data modeling
10

10
0

1
0 1
0

10
10 1

 Run on standard
01

1
10
01
0

hardware
1
1

10
01
1
1

01  Works with existing


0

10
10
1

01
10 BI and ETL platforms
0

 MySQL “wrapper”
 No need to learn
new database
ETL Platform system
Connector  Leverage mature
tools
BrightHouse Architecture and MySQL

MySQL selected due to:


mature connectors, tools,
resources
interconnectivity and
certification with BI Tools
commercial OEM license
protects our IP
most broadly used open-source
DB (12 million users).
Benefits
Greatly improved time to
market
Development focused on
competitive differentiators
Sell to MySQL customers
experiencing scalability
problems
Real Life Example: Results with Brighthouse

What type of #1 Users by Campaign: Analyzed 2 billion rows of campaign frequency


queries did by date to look at 5 campaigns
they want to #2 Demographics by Campaign
run? Against 1.3 billion rows in user campaign aggregated table where the
How did result is a null set
Brighthouse Largest campaign (highest results returned) where 89 million rows (11%
perform? of entire table) in user campaign were selected and joined to 57 million
rows in the user dimension table

Overall •The ability to track impressions/clicks/actions by user and thereby more


benefits of intelligently provide their clients with reliable data. They can now compare
Infobright costs to clickthroughs to optimize banner purchasing and improve cost
solution performance.
•The ability to optimize advertising dollars for their clients.
•The ability to make data accessible via SQL, BI Tools (Pentaho) to end
users
•The ability to lower the cost of queries - The ability to eliminate DBA
involvement and thereby reduce the personnel costs associated with the
ad hoc queries
•Brighthouse supported 10TB data warehouse with a single, $20K industry
standard server and much less storage than alternatives
Customer Query Response Time
Results vs. Oracle

Oracle Time: 136 sec


Brighthouse 3.0 Time:
16.8 sec
Query Speed as Volume of Data Grows

Average Response Time (in Secs)


Impact of Additional Data on Query Times

Increases as data
volume grows

Brighthouse
Performance
Advantage

• Queries were moderately complex, with at least two table joins and two or more where clauses
• Tables were indexed
• Response time represents the average response time of queries
Brighthouse Load Time Remains
Constant

Load Time (in Secs) Data Load Times as Volume Increases

Savings in processing time


during load over
conventional databases

Brighthouse load time


stays constant

Volume (Rows in Millions)

• Comparison of load to a single table. Data was loaded in 10 million row chunks
• Table had a single index
Brighthouse is Fast

 Brighthouse is designed specifically to quickly run complex queries on


large data sets
 The Knowledge Grid’s small chunks of highly compressed data are
fast and easy to manipulate
 Knowledge Grid optimizer iteratively optimizes query execution plan
 Only data packs needed to answer query are opened; Often query
results can be determined from the Knowledge Grid alone
 Users enjoy fast response times no matter how complex or
spontaneous their query

“Each month we process and analyze data generated


by 20 billion online transactions,”...We are pleased by
Brighthouse’s performance and the fact that we now
can get answers to questions we want to ask.
-- Ola Udén, CTO of TradeDoubler
Brighthouse is Flexible

 Brighthouse ensures changing and complex analytic


requirements are supported with fast response times
 Knowledge Grid is built on-the-fly, creating a layer of statistics and
metadata across all columns and rows
 Knowledge Grid obviates the need for Indexing, data partitioning, or
other physical data structures
 Data no longer needs to be off-loaded or archived
 Users can ask any question of all the data

"Brighthouse allows us to do very


complex analyses on over 30 terabytes of
data”
-- Jay Webster, General Manager, BlueLithium
Brighthouse is Simple

 Brighthouse eliminates the complications, cost and disruption IT


teams must endure to support complex queries
 No DBA resources required to build indices and data partitions in response
to user requirements
 No complicated performance variables to tune; “No Knobs”
 Leverages MySQL ease of use, connectivity, and supported BI tools
 Runs on off-the-shelf hardware
 Reduced complexity frees IT resources and significantly lowers
lifetime TCO
How does Brighthouse impact TCO?
 Hardware footprint 20 to 50 times smaller
 Fewer DBA resources required
 40 – 60% reduction in one-time build
 Up to 90% reduction in ongoing support
 Support for existing infrastructure
 Load and Go
 Improved SLAs – immediate response vs weeks or
months

Software Hardware ETL and Physical Tuning


Data Changes Modeling
Prove it!

RAPID START:
 Call us – we’ll walk you
through a a few questions to
mutually determine if our
technology is a good fit.
 Agree on process – e.g.
your place or ours?
 Load and go – Load your
Contact Us
data, run your queries
[email protected]
 Summarize results –
 416.596.2483, x. 225
performance, compression,
load times Download Claudia Imhoff
 Next steps – did we prove paper:
it? http://www.infobright.com

You might also like