DB Revolution Report Final
DB Revolution Report Final
Mark Madsen
Robin Bloor, Ph.D.
WHITE PAPER
Contents
Introducing the Database Revolution!....................................................3
A Summary of Findings!.....................................................................3
A Quick Look Back!..................................................................................4
The Sea Change!.................................................................................4
The Scale-Out Architecture and the New Generation!.....................5
The Relational Database and the Non-Relational Database!...............7
The Relational Database (RDBMS)!.......................................................7
The Non-Relational Database!...............................................................7
OldSQL, NewSQL and the Problem of NoSQL!....................................8
The Influence of MapReduce and Hadoop!..........................................10
HBase, Hive, Pig and Other Hadoop Database Developments!.........11
Key Value Stores and Distributed Hash Tables!..................................12
Horses for Courses!...............................................................................13
Database Workloads!...............................................................................17
Workload Characteristics!..................................................................18
Read-Write Mix!.......................................................................................18
Data Latency!..........................................................................................18
Consistency!...........................................................................................18
Updatability!............................................................................................19
Data Types!..............................................................................................19
Response Time!......................................................................................19
Predictability!..........................................................................................20
In Summary!............................................................................................20
Conclusion!..........................................................................................26
Data Flow Architecture!..........................................................................28
How to understand the new generation of databases that have recently emerged in the
marketplace. We cover both those sometimes described as NoSQL databases and also
column-store databases that are like the traditional relational databases to which we
have become accustomed. Our intention is to explain the overall market shift and
expansion, and in consequence what the database market looks like today.
Guidance on how to view database workloads and technologies, and how they line
up. We attempt to provide rules of thumb that may help the reader determine what
class of technology is likely to fit with which workloads.
A Summary of Findings
This paper is the result of a research program driven by Mark Madsen and Robin Bloor,
involving interviews with vendors, interviews with customers, four webcasts, two of which
took the format of round tables with other respected database technology analysts, and a
survey of database users.
We reached several key conclusions, listed here in summary:
These new products include some databases that implement the relational model of
data (we're terming these products "NewSQL" databases) and some that choose not to
do so (NoSQL databases). Having said that, we do not believe the term NoSQL is
informative since it covers too wide a range of capability to be useful.
We do not believe at this point in time that the older universal database products
(grouped under the term "OldSQL") have become outmoded. They are good at what
they do, but they lack scalability for some specialized or very large workloads.
Database
Table
Query
Sub
Query 1
Sub
Query 2
Server 1
Server 2
CPU
CPU
CPU
CPU
The database
scales up and out by
adding more servers
Server 1
CPU
CPU
Common
Memory
Common
Memory
Common
Memory
Cache
Cache
Cache
DataData
DataData
DataData
DataData
DataData
DataData
The evolution of computer hardware in combination with its decline in cost (per unit
of work) that supports more cost-effective scale-out architectures.
Because the driver to this database renaissance is different, we are currently uncertain where
it will lead. We do not expect it to lead to a few dominant and fairly similar products. Much of
what is currently driving database innovation is the need for workload-specific solutions or,
as we like to characterize it, "horses for courses."
We have entered an era where many of the new database products are distinctly different.
Some are targeted at processing problems for which the older universal databases are
inappropriate, while others are designed for extreme scalability beyond the capabilities of the
traditional RDBMS. For those who are seeking to select a database for a specific task, we
believe that there are two primary considerations:
1. What is the structure of the data that the database will hold?
2. What are the workloads that the database will be required to process?
We have already identified that traditional universal relational databases we will refer to
these as OldSQL databases have proven to be excellent workhorses for most transactional
data and also for querying and analyzing broad collections of corporate data. These databases
are characterized by the use of SQL as the primary means of data access, although they may
have other data access features.
There are also relatively new species of relational databases that operate differently or
extend the relational model. A key element of many of these databases is new architectures to
extend performance or scalability, most commonly scale-out. They include such products as
Infobright, SAP Sybase IQ, Greenplum, ParAccel, SAND Technologies, Teradata, Vertica,
Vectorwise and others. We categorize these as NewSQL databases, since they employ SQL as
their primary means of access and are primarily relational in nature.
There are also new species of database that specifically chose not to use SQL or to provide a
SQL interface but support other non-SQL modes of data access as the primary interface. These
are commonly categorized as NoSQL databases for their definition of "not only SQL" or "no
SQL at all." Nevertheless, when we examine the new collection of databases that happily
claim to be NoSQL, we discover that they are quite diverse. What they have in common is:
1. Most have been built to scale out using similar techniques to the NewSQL databases.
2. They do not rigidly adhere to SQL. The attitude to SQL varies between vendors. Some
offer a broader set of data access capabilities than is offered by SQL, while some
implement only a subset of SQL.
No JOIN
nosql
Single Table
Star Schema
oldsql
newsql
Snow Flake
Data
Volume
TNF Schema
Nested Data
Graph Data
nosql
Map: The map step partitions the workload across all the nodes for execution. This
step may cycle as each node can spawn a further division of work and share it with
other nodes. In any event, an answer set is arrived at on each node.
Reduce: The reduce step combines the answers from all the nodes. This activity may
also be distributed across the grid, if needed, with data being passed as well.
Eventually an answer is arrived at.
The Map stage is a filter/workload partition stage. It simply distributes selection criteria
across every node. Each node selects data from HDFS files at its node, based on key values.
HDFS stores data as a key with attached other data that is undefined (in the sense of being in
a schema). Hence it is a primitive key value store, with the records consisting of a head (the
key) and a tail (all other data).
The Map phase reads data serially from the file and retains only keys that fit the map. Java
hooks are provided for any further processing at this stage. The map phase then sends results
to other nodes for reduction, so that records which fit the same criteria end up on the same
node for reduction. In effect, results are mapped to and sent to an appropriate node for
reduction.
The Reduce phase processes this data. Usually it will be aggregating or averaging or
counting or some combination of such operations. Java hooks are provided for adding
sophistication to such processing. Then there is a result of some kind on each reduce node.
Further reduction passes may then be carried out to arrive at a final result. This may involve
10
11
Key value store: A key value store is a file that stores records by key. The record
consists of a key and other attached information: a key value pair. The structure of the
attached data is not explicitly defined by a schema in effect it is a blob of data. The
HDFS within Hadoop is a key value store. The primary benefit of such a file is that it is
relatively easy to scale in a shared-nothing fashion: it delivers good performance for
keyed reads, and developers have more flexibility when storing complex data
structures.
Transactional Systems
BI
BI
App
App
BI
BI
App
App
Data
Data
Mart
Mart
Data
Personal
Store
Data
Store
Unstructured
Data
Structured
Data
App
App
App
App
App
App
Operational
Data
Store
File or
File or
DBMS
File or
DBMS
DBMS
DBMS
DBMS
DBMS
Staging
Area
Content
DBMS
BI
BI
App
App
OLAP
OLAP
DBMS
DBMS
Data
Warehouse
BI
App
File or
DBMS
BI
App
14
Structured data: Structured data is data for which an explicit structure has been
declared in a database schema. In other words, the metadata for every element and its
storage is accessible; its structure has been formally declared for use by multiple
programs.
Unstructured data: Unstructured data constitutes all digital data that falls outside the
definition of structured data. Its structure is not explicitly declared in a schema. In
some cases, as with natural language, the structure may need to be discovered.
Soon after XML (the eXtensible Mark-up Language) was invented, designers realized that
data could exploit XML to carry metadata with it. The data would then be self-describing.
This gave rise to another form of structured data aside from that described explicitly in a
database schema. There are databases such as MarkLogic which use an XML-based schema to
define the structure of the data they contain. We can refer to such structured data as XMLdefined data.
There is an important distinction between data defined by SQLs data definition statements
and data defined by XML. The SQL schema defines the data for use within the associated
database, whereas XML defines data at any level of granularity from a single item through to
a complex structure such as a web page. XML is far more versatile in that respect and it is
relatively simple to use it to define data that is exported from a database.
Further, XML may be used define many things about the data to which it applies, such as
page mark-up for display of data in a browser. Consequently it became a dominant standard
for use in information interchange, and because XML is an extensible language it has been
extended for use in many different contexts.
Soon after the advent of XML, a query language, XQuery was developed for querying XMLdefined data. This is a new query language developed in the XML Query Working Group
(part of the World Wide Web Consortium) and it specifically uses XML as the basis for its data
model and type system. So XQuery is based on XML just as SQL is based on the relational
model of data. However, XQuery has no concept of relational data. Because of that an
extension of SQL, SQL/XML was developed, designed for SQL programmers and intended to
allow them to query XML data stored within a relational database. SQL/XML is included in
the ANSI/ISO SQL 2003 specification.
At this point in time, the use of XML is not as widespread as the use of SQL. However, many
of the more developer-oriented databases use JSON (the JavaScript Object Notation) rather
than XML for data manipulation and interchange. There is an important issue here. SQL
schemas prove to be very useful at the logical level to provide a basis for set-oriented data
manipulation, but do not define data at the physical level particularly well. The physical
definition of data is specific to the database product. XML is broader in some ways as a logical
definition of data, but is cumbersome at the physical data storage level. JSON, which is object
oriented, is less cumbersome than XML at the physical level, but lacks logical information
15
16
Database Workloads
The most important criterion in selecting a database is whether it will be able to handle the
intended workload. This is not a simple criterion because workloads have multiple
dimensions and there are different architectural approaches to managing these workloads.
Aside from streaming data (which we ignore in this report for reasons of space), workloads
can be classified into three primary groups:
Business intelligence (BI): Originally this was viewed as a combination of batch and
on-demand reporting, later expanding to include ad hoc query, dashboards and
visualization tools. BI workloads are read-intensive, with writes usually done during
off-hours or in ways that don't compete with queries. While quick response times are
desired, they are not typically in the sub-second range that OLTP requires. Data access
patterns tend to be unpredictable; they often involve reading a lot of data at one time,
and can have many complex joins.
Analytics: Analytic workloads involve more extensive calculation over data than BI.
They are both compute-intensive and read-intensive, similar in many ways to BI
except that access patterns are more predictable. They generally access entire datasets
at one time, sometimes with complex joins prior to doing computations. Most analytic
workloads are done in a batch mode, with the output used downstream via BI or other
applications.
Relational databases have been the platform of choice for all three workloads over the past
two decades. As workloads grew larger and more varied, the databases kept pace, adding
new features and improving performance. This development led to the database market of
fairly recent years, with a small number of large database vendors offering general-purpose
RDBMS products designed to support everything from transaction processing to batch
analytics.
Over the last decade, companies pushed workloads past the capabilities of almost all of
these universal databases. Workload scale is larger and the required scope broader, making it
difficult for the traditional RDBMS to support both the past use cases and new use cases that
exist today.
Consequently, software vendors have developed new database products to support
workload-specific needs. By limiting the scope to a single workload, these vendors narrow
the set of requirements that must be met and expand their technology and design choices.
Some choices are poor for one workload but good for another.
These choices were not adopted by the older RDBMSs even though they would be optimal
for a specific workload. Instead, a tradeoff was made for breadth of scope against capability
for a given workload. Such tradeoffs manifest themselves in the RDBMS as poor performance
at extreme scale for a single workload or at moderate scale when more than one simultaneous
workload is involved.
17
Workload Characteristics
The challenge for any product is that different workloads have different characteristics,
leading to conflicts when trying to support a mixed workload. Supporting a single workload,
assuming the database is powerful enough and appropriate for the workload, is pain free, but
there is usually a mix. The following seven characteristics are key to defining workloads.
Read-Write Mix
All workloads are a mix of reads and writes. OLTP is a write-intensive workload, but
writing data on most OLTP systems makes up only 20% to 30% of the total. BI and analytics
are thought of as read-only, but the data must be loaded at some point before it can be used.
The difference is that most BI systems write data in bulk at one time and read data
afterwards. OLTP reads and writes are happening at the same time. The intensity of reading
and writing and the mix of the two are important aspects of a workload. Business
intelligence-specific databases designed to handle read-intensive work are often designed to
load data in bulk, avoiding writes while querying. If the writes are done continuously
throughout the day rather than in batch, poor query performance can result.
Conventional workloads are changing. Operational BI and dashboards often require up-todate information. Analytic processing is done in real time as part of the work in OLTP
systems. The workload for an operational BI application can look very similar to an OLTP
application.
Many of today's analytic workloads are based on log or interaction data. This high volume
data flows continuously, so it must be written constantly. Continuous loading is the extreme
end of the spectrum for write intensity. Likewise, large-scale analytics, particularly when
building models, will read entire datasets one or more times, making them among the most
read-intensive workloads.
Data Latency
Data latency is the time between its creation and availability for a query. Applications can
have different tolerances for latency. For example, many data warehouses have long latencies,
updated once per day. OLTP systems have short latencies, with the data available for query as
soon as it has been inserted or updated.
Longer latency requirements mean more options are available in a database. They allow for
the possibility of incremental updates or batch processing and the separation of data
collection processes from data consumption processes. Short latencies impose more
restrictions on a system.
Consistency
Consistency applies to data that is queried. Immediate consistency means that as soon as
data has been updated, any other query will see the updated value. Eventual consistency
means that changes to data will not be uniformly visible to all queries for some period of
time. Some queries may see the earlier value while others see the new value. The time until
consistency could be a few milliseconds to a few minutes, depending on the database.
Consistency is important to most OLTP systems because inconsistent query results could
lead to serious problems. For example, if a bank account is emptied by one withdrawal, it
18
Updatability
Data may be changeable or it may be permanent. If an application never updates or deletes
data then it is possible to simplify the database and improve both performance and scalability.
If updates and deletes are a normal and constant part of the workload then mechanisms must
be present to handle them.
Event streams, such as log data or web tracking activity, are examples of data that by its
nature does not have updates. It is created when an event occurs, unlike transaction data in
an OLTP system that might be changed over the lifetime of a process. Outside of event
streams, the most common scenarios for write-once data are in BI and analytics workloads,
where data is usually loaded once and queried thereafter.
A number of BI and analytic databases assume that updates and deletes are rare and use
very simple mechanisms to control them. Putting a workload with a constant stream of
updates and deletes onto one of these databases will lead to query performance problems
because that workload is not part of their primary design. The same applies to some NoSQL
stores that have been designed as append-only stores to handle extremely high rates of data
loading. They can write large volumes of data quickly, but once written the data can't be
changed. Instead it must be copied, modified and written a second time.
Data Types
Relational databases operate on tables of data, but not all data is tabular. Data structures can
be hierarchies, networks, documents or even nested inside one another. If the data is
hierarchical then it must be flattened into different tables before it can be stored in a relational
database. This isn't difficult, but it creates a challenge when mapping between the database
and a program that needs to retrieve the data.
Different types of databases, like object and document databases, are designed to accept
these data structures, making it much easier for an application to store, retrieve or analyze
this data. There are tradeoffs with these databases because they are mostly non-relational.
Being non-relational means that managing unpredictable queries may be difficult or
impossible. They simplify the query to a retrieval or write operation based on a single key.
The benefits of these databases are performance, extremely low latency, application flexibility
and scalability for OLTP workloads.
Response Time
Response time is the duration of a query or transaction and the time taken to return the
result of the operation. There are three course ranges of response time: machine speed,
interactive speed and batch speed. Machine speed is measured in micro to milliseconds while
19
Predictability
Some workloads have highly predictable data access patterns. For example, OLTP access
patterns are usually highly repetitive because there are only a few types of transaction,
making them easier to design for and tune. Dashboards and batch reporting will issue the
same queries day after day. The repetition allows more options to design or tune a database
since the workload can be anticipated.
When queries are unpredictable, as with ad hoc query or data exploration workloads, the
database must be more flexible. The query optimizer must be better so it can provide
reasonable performance given unknown queries. Performance management is much more
difficult because there is little that can be done in advance to design or tune for the workload.
The repetition of transactions and queries is one of the key parameters for selecting suitable
technologies. The less predictable the data access patterns are, the more likely it is that a
relational model, or one that permits arbitrary joins or searches easily, will be required.
In Summary
These seven items are key defining
characteristics of workloads. Table 1
shows how the ends of the
spectrum align with constraints on a
database for each characteristic. One
or more items on the more
restrictive end of the scale can
significantly limit the available
choice of technologies. A workload
is defined by the combination of
these characteristics and the scale of
the work. Scale exists independently
of the above parameters and at
extremes can complicate even the
simplest of workloads.
Characteristic
More
Red-Write Mix
Low
High
Data Latency
High
Low
Consistency
Eventual
Immediate
Updatability
None
Constant
Data Types
Simple
Complex
Response Time
High
Low
Predictability
High
Low
Data Volume
Data growth has been a consistent source of
performance trouble for databases. Data volume can
be looked at in different ways. The simple
measurement of size in gigabytes or terabytes of total
data hides some of the important aspects.
Lower Impact
BI workloads
OLTP workloads
Complex data
structure
Simple data
structure
Many tables or
objects
Fewer tables or
objects
Fast rate of
growth
Slow rate of
growth
Above 5
terabytes (2012)
Below 5
terabytes (2012)
The number of tables and relationships is as important as the amount of data stored. Large
numbers of schema objects imply more complex joins and more difficulty distributing the
data so that it can be joined efficiently. These drive query complexity which can result in poor
optimizations and lots of data movement. This element of data size is often overlooked, but is
one that can significantly affect scalability.
The rate of data growth is important as well. A large initial volume with small incremental
growth is easier to manage than a quickly growing volume of data. Fast growth implies the
need for an easily scalable platform, generally pushing one toward databases that support a
scale-out model.
There are few helpful rules of thumb for what size qualifies as small or large. In general,
when the total amount of data rises to the five terabyte range, universal databases running on
a single server begin to experience performance challenges. At this scale it takes more
expertise to tune and manage a system. It's at this boundary that most organizations begin
looking to alternatives like purpose-built appliances and parallel shared-nothing databases.
The five terabytes is a 2012 figure. As hardware power increases, this boundary figure will
also increase. The overall capability of a single server can currently be expected to increase by
20% to 30% per year in respect to data volumes.
Concurrency
The important aspect of concurrency is the number of simultaneous queries and
transactions. The number of end users accessing the system is often a proxy for these counts,
21
Lower Impact
More distinct
users (i.e.,
connections)
Fewer distinct
users (i.e.,
connections
More active
users
Fewer active
users
Many tables or
objects
Fewer tables or
objects
Computation
This scale axis is about computational complexity as well as the sheer volume of
computations. Analytic workload performance is heavily influenced by both complexity and
data volume.
Running complex models over moderate data sets can be a performance challenge. The
problem is that many algorithms are nonlinear in performance. As data volume increases, the
amount of work done by the algorithm increases even more. A doubling of data can lead to a
22
In Summary
Taken together, these three axes define
the scale of a workload. Workload scale
may grow at different rates along any or
all of the axes. The important point to
understand when looking at workload
scale is that growth along different axes
imposes different requirements on a
database. These requirements may
eliminate technology categories from
consideration when designing a system.
Data Volumes
23
Scaling a Platform
Scale Up
Scale Out
Upgrade to
more powerful
server or cluster
Deploy on a grid
or cluster of
similar servers
Appropriate for
traditional scale
up architecture
Appropriate for
recent scale-out
products
More expensive
in hardware
Less expensive
in hardware
Eventually hits
a limit
Server
Disk
Server
Server
Shared
Disk
(SAN)
Server Grid
Server
Server
Server
Server
Disk
Disk
Disk
Disk
Conclusion
Weve spent a good deal of time taking a fresh look at the database market and the key
technologies available. We have a number of conclusions regarding database selection:
Much database innovation has taken place in recent years, prompted both by the
continuing evolution of computer hardware and the emergence of new collections of
data that can now be used profitably by businesses. The market has diversified in
response to this, in order to fill important but often niche requirements.
Universal databases based on the relational model still fit the need for most database
implementations, but they have reached scalability limits, making them either
impractical or too expensive for specialized workloads. New entrants to the market
and alternative approaches are often better suited to specific workloads.
Therefore our general advice if you are considering using a new type of database is to make
sure you aren't buying for novelty or just to follow the current fad. The IT department has the
26
27
Very Low
Latency
(High Value Data)
In-Memory
In Memory
Data
Data Store
Store
Low
Latency
(Medium Value Data)
Analytic
Analytic
Data
Data Store
Store
Higher
Latency
(Low Value Data)
ETL
ETL
Data
Data Store
Store
28
29
High data volume: Large data volumes generate two problems. First there is the
physical problem of distributing the data across the available resources so that access
times are consistent. Second, there is the simple fact that if a query accesses a large
percentage of the data, it will take time to process, even with excellent scale-out. The
time taken will increase if there is a large amount of joining and sorting of data
involved in the query, or if there is a high level of analytical processing.
Concurrency: Satisfying a single user is relatively easy. As soon as you have more than
one user, the problems escalate. The database must try to distribute its activity among
all the data access requests that it is concurrently trying to satisfy, ideally ensuring that
the appropriate priority is given to each. Data locking becomes an issue, even if there
are only two users. If this is a read-only workload then locking is not a problem, but
note that data ingest is data writing, so if data ingest is concurrent with other usage
then locking can be an issue. The greater the number of concurrent data accesses the
more difficult it will be for the database.
Scalability and mixed workload: The more homogenous the workloads, the better a
database will scale. Mixed workloads consisting of short queries, long queries,
complex analytical queries and transactions will impose a scalability limit at some
point. Of note is the balance or conflict between throughput and response time. Some
application workloads require specific service levels which must be met. Other
workloads are more forgiving of specific user latency, but require that throughput
goals are met (e.g., the whole workload must be processed before the start of the
normal working day, but individual latencies do not matter).
Data model: The most effective data model to use depends a great deal upon the
database products capabilities. In the past, data warehouses were not frequently
normalized to third normal form. Instead, a star or a snowflake data model was
implemented. In these models the majority of data resides in a large fact table with a
few other tables linking to it. Some data, such as log file data can fit into one big table.
30
Failover: The limit this imposes depends on the true availability requirement for the
data. It used to be acceptable with transaction applications for the database to fail and
for users to wait while data recovery took place. That could take minutes or longer
when recovery was achieved by restoring a reliable backup and applying transaction
logs. That recovery technique was superseded to a large degree by faster methods - for
example, by having a hot stand-by available. In the realm of many servers, the
probability that a server will fail increases in proportion to the number of servers. If
very high availability is required, the database needs to have a strategy for continuous
operation in the event of server failure. This inevitably involves replication of the data
in some way and a database architecture that is aware, at more than one node, of all
the data accesses that are currently in progress, so it can recover any that are impacted
by a node failure.
Database entropy: No matter how good the database, in time data growth combined
with changing and escalating workload demands can impose a limit on scalability.
Entropy may set in faster with some products. It therefore helps to understand what
new or changed workloads are likely to cause degradation of any database product. It
also helps to know whether this can be addressed by a database rebuild and whether
taking time out for a rebuild is allowable, given the availability service level that the
database applications demand.
Cost: There is a limit to every IT budget. While it might be possible, with infinite
dollars, to meet all user performance goals, infinite dollars do not exist. Cost factors
are many and various. They include start-up costs, hardware and networking
resources, software licenses, software support, database management, DBA time,
design costs; and opportunity costs. They also include the incremental costs of scaling
as the system grows in size or number of users.
Implementation time: Finally, there is the time it takes to implement the solution fully.
Some database products are faster to deploy than others and time to implementation
may be a critical consideration.
Data security
Data cleansing
Product support
These naturally form part of the database selection. There may also be other policies the
organization requires you to use for product selection (parameters such as preferred vendors
or vendor size). These may eliminate some products from consideration.
2. Budget
Clearly a budget can be stated simply as a figure, but there are complexities that need to be
sorted out. For example, does it cover the product purchase or also training and consultancy?
Would it be better to have a perpetual license or subscription model that shifts cost from
capital to operational budgets? When is the money available? What is the support budget?
This can vary based on the context of the project. There is also the chance that the solution
may involve more than one product.
4. The Short-List
There are a bewildering number of database products, over 200 at last count, and some
vendors have multiple products. You cannot examine them all in detail. To assist in creating a
list of potential products you need some drop dead conditions qualifying characteristics
that will eliminate most choices. These can often be created from the data already gathered.
In practice, during database selections we have been involved in, we would create a shortlist of products in a two-hour interactive session with the selection team. First we would
define selection criteria and then consider a fairly long list of products, quickly reducing the
list by applying agreed criteria. Often, selecting a short-list would require just a few criteria.
Total cost of acquisition (TCA)/Total cost of ownership (TCO) was always one of these. Does
it look like the database can do the job, technically? That was also always a factor, but rarely
explored in depth. Adequate support was also an important factor for consideration.
33
5. Product Research
There is no need to contact any potential vendors directly until
you have decided on a short-list of products to investigate. You
need to engage with the vendors in order to identify how suitable
each candidate product is for your requirements.
The reason for postponing contact with vendors is that vendors
have their established sales process. From the moment you engage
with them, they will try to influence your decision. This is natural.
However, it is important that you are not derailed from your
product selection process. Our advice is that you create a detailed
step-by-step plan for selection.
This plan will vary based upon your particular organization and
the requirements, but should include most of the following:
Writing and issuing a formal request for proposal (RFP) for
short-listed vendors. It should at minimum include the following:
Database Selection
Criteria
General
Data governance
policies
Hardware platform
requirements
Vendor support
offered
Performance
Data volumes
Concurrency
Scalability
Expected workloads
Data model
Failover capability
Ecosystem
Data architecture of
which it will be a part
Database entropy
Time Factors
Implementation time
(design)
Implementation time
(data load)
Cost
Total cost of
acquisition (TCA)
Total cost of
ownership (TCO)
Table 5: Criteria
34
Note that this is RFP lite. It should take little time to prepare. It should be useful to the
vendors that receive it so that they know what you want and need as accurately as you are
able to describe it. If you want to carry out an extensive RFP, you can add a long list of
technical questions for the vendor to answer. Our experience is that the larger vendors will
have a canned list of answers and smaller vendors may be intimidated by the time it will
take them to respond, if they suspect that they may not be chosen anyway.
The responses to the RFP should guide the further product research that needs to be done.
In our view, the following further activities are worth pursuing.
If possible, download a version of the product (if the vendor has a downloadable
version) and have the technically aware people on the selection team gain some
experience with it.
Maintain a list of resolved and unresolved points for each vendor during this process.
35
8. Proof of Concept
The goal of a PoC is to demonstrate that a product will be capable of managing the
workloads with the kind of data your company intends to store and process. The PoC is also a
chance to gain hands-on experience with the database. This experience can easily shift
purchase decisions. For example, one might discover that development and performance
features are great but managing in production will be very difficult. Managing and
diagnosing problems is one area that often goes untested.
Consequently, any PoC needs to target the most common tasks as well as the most difficult
aspects of the workload. Since the vendor involved will likely need to assist in the exercise,
the goal of the activity must be clearly stated in measurable terms so that it is clear whether
the product was capable of doing the work.
The PoC should be designed to allow the work done to contribute to the final
implementation. Therefore the PoC should mimic the intended live environment to some
degree, so that both the data flow architecture and the performance of the database itself are
properly tested. Too often, we see databases tested using a configuration or environment very
different from what will be deployed in production.
It is also important to ensure that if the selected product fails there is an alternative product
to try next as a fallback. In other words, the PoC needs to be a genuine test of capability with
stated goals and known consequences if a product fails.
9. Product Selection
In our view the selection team should write a short report on what was done to ensure that
the selected product met all the requirements. This report should include the outcome of
evaluation and PoC testing, if applicable, and the recommended configurations.
37
DataStax offers products and services based on the popular open-source database, Apache
Cassandra that solve today's most challenging big data problems. DataStax Enterprise (DSE)
combines the performance of Cassandra with analytics powered by Apache Hadoop creating a
smartly integrated, data-centric platform. With DSE, real-time and analytic workloads never conflict,
giving you maximum performance with the added benefit of only managing a single database. The
company has over 100 customers, including leaders such as Netflix, Cisco, Rackspace and Constant
Contact, and spanning verticals including web, financial services, telecommunications, logistics and
government. DataStax is backed by industry leading investors, including Lightspeed Venture Partners
and Crosslink Capital and is based in San Mateo, Cailfornia. http://www.datastax.com.
Dataversity provides a centralized location for training, online webinars, certification, news and more
for information technology (IT) professionals, executives and business managers worldwide. Members
enjoy access to a deeper archive, leaders within the industry, a extensive knowledge base and discounts
off many educational resources including webcasts and data management conferences.
Infobright's high-performance database is the preferred choice for applications and data marts that
analyze large volumes of "machine-generated data" such as Web data, network logs, telecom records,
stock tick data, and sensor data. Easy to implement and with unmatched data compression, operational
simplicity and low cost, Infobright is being used by enterprises, SaaS and software companies in online
businesses, telecommunications, financial services and other industries to provide rapid access to
critical business data. For more information, please visit http://www.infobright.com or join our open
source community at http://www.infobright.org.
MarkLogic
leads
the
advancement
of
Big
Data
with
the
9irst
operational
database
technology
for
mission-
critical
Big
Data
Applications.
Customers
trust
MarkLogic
to
drive
revenue
and
growth
through
Big
Data
Applications
enabled
by
MarkLogic
products,
services,
and
partners.
MarkLogic
is
a
fast
growing
enterprise
software
company
and
ushers
in
a
new
era
of
Big
Data
by
powering
more
than
500
of
the
worlds
most
critical
Big
Data
Applications
in
the
public
sector
and
Global
1000.
Organizations
around
the
world
get
to
better
decisions
faster
with
MarkLogic.
MarkLogic
is
headquartered
in
Silicon
Valley
with
9ield
of9ices
in
Austin,
Frankfurt,
London,
Tokyo,
New
York,
and
Washington
D.C.
For
more
information,
please
visit
www.marklogic.com.
38
RainStor provides Big Data management software.RainStors database enables the worlds largest
companies to keep and access limitless amounts of data for as long as they want at the lowest cost. It
features the highest level of compression on the market, together with high performance on-demand
query and simplified management. RainStor runs natively on a variety of architectures and
infrastructure including Hadoop.RainStors leading partners, includeAdaptiveMobile, Amdocs,
Dell,HP, Informatica,Qosmos, andTeradata. RainStor is a privately held company with offices in San
Francisco, California, and Gloucester, UK. For more information, visit www.rainstor.com. Join the
conversation at www.twitter.com/rainstor.
As market leader in enterprise application software, SAP helps companies of all sizes and industries
run better. From back office to boardroom, warehouse to storefront, desktop to mobile device SAP
empowers people and organizations to work together more efficiently. Sybase IQ with PlexQ
technology is an analytics grid that extends the power of business analytics beyond a few users to the
entire organization with greater ease and efficiency. For more information, please visit: http://
www.sybase.com/iq.
Teradata is the world's largest company focused on integrated data warehousing, big data analytics,
and business applications. Our powerful solutions portfolio and database are the foundation on which
weve built our leadership position in business intelligence and are designed to address any business
or technology need for companies of all sizes. Only Teradata gives you the ability to integrate your
organizations data, optimize your business processes, and accelerate new insights like never before.
The power unleashed from your data brings confidence to your organization and inspires leaders to
think boldly and act decisively for the best decisions possible. Learn more at teradata.com.
39
Robin Bloor is co-founder and principal analyst The Bloor Group. He has more than 25 years
of experience in the world of data and information management. He is the creator of the
Information-Oriented Architecture which is to data what the SOA is to services. He is the
author of several books including, The Electronic B@zaar, From the Silk Road to the eRoad
a book on e-commerce and three IT books in the Dummies series on SOA, Service
Management and The Cloud. He is an international speaker on information management
topics. As an analyst for Bloor Research and The Bloor Group, Robin has written scores of
white papers, research reports and columns on a wide range of topics from database evaluation
to networking options and comparisons to the enterprise in transition.
Mark Madsen is president of Third Nature, a technology research and consulting firm focused
on business intelligence, data integration and data management. Mark is an award-winning
author, architect and CTO whose work has been featured in numerous industry publications.
Over the past ten years Mark received awards for his work from the American Productivity &
Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a
contributing editor at Intelligent Enterprise, and manages the open source channel at the
Business Intelligence Network. For more information or to contact Mark, visit http://
ThirdNature.net.
"
"
Copyright 2012 Bloor Group & Third Nature. All Rights Reserved.
40