RealityBehind RAC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

The Reality behind Oracle


Real Application Clusters (RAC)
Marketing Messages
October 2004

Abstract:
This document studies the key features of Oracle Real Application Clusters (RAC),
for version 9i and 10g, and what are the true benefits and issues around its
implementation. This paper will delve into the various abilities claimed by Oracle and
provide results from independent research on what are real and what are just
marketing messages.
The information contained in this document represents the current view of Microsoft
Corporation on the issues discussed as of the date of publication. Because Microsoft
must respond to changing market conditions, it should not be interpreted to be a
commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of
any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in, introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as
expressly provided in any written license agreement from Microsoft, the furnishing of
this document does not give license to these patents, trademarks, copyrights, or other
intellectual property.
© 2004 Microsoft Corporation. All rights reserved.
Windows® and Windows NT® are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks
of their respective owners.

ii
Table of Contents

The Reality behind Oracle Real Application Clusters (RAC) Marketing Messages i
Table of Contents i
Executive Summary 2
Introduction 4
Target Audience 4
What is Oracle Real Application Clusters? 5
How RAC works 6
Summary 9
Oracle Claim: “RAC is Scalable” 10
The Truth: Unscalable Architecture 10
Summary 11
Oracle Claim: “RAC Provides High Availability “ 12
The Truth: Downtime Still Unavoidable 12
The Truth: Transactions Do Not Failover Automatically 12
Summary 13
Oracle Claim: “RAC Lowers Total Cost of Ownership” 14
Oracle’s TCO Claims 14
The Truth: RAC will probably increase cost 15
The Truth: RAC Increases Licensing Costs 19
Summary 21
Conclusion 22
Appendix I: Resource links 23
Appendix II: References 24

i
Executive Summary
Oracle has touted its Real Application Cluster (RAC) technology as being the all-
encompassing solution for today’s enterprise computing requirements in scalability and
high availability. With the release of is latest database, Oracle further proposes that RAC
is the only solution for Enterprise Grids as part of its Grid computing marketing
campaign. However, although it is now about 4 years since the initial public release of
RAC and over 14 years since the release of RAC’s predecessor Oracle Parallel Server,
several questions still remain around RAC’s business value as a technology;
• RAC continues to maintain Oracle’s tradition of high cost of ownership
for customers already on the Intel platform and some hardware cost
savings for customers on legacy UNIX®/mainframe system.
Regardless of the marketing claims, the cost of Oracle licensing increases when
deploying RAC. In measuring real deployments and long term TCO, SQL Server
still provides the best total cost of ownership for users who demand high
performance and reliability for their systems.
The only savings users may expect with Oracle RAC is primarily on the hardware
if moving from legacy UNIX/mainframe systems to Intel based systems.
If the user is currently already on Intel based servers, deploying RAC will very
likely increase cost because the user will now need to purchase
o additional certified server(s) and networking hardware
o certified storage and connectivity solutions
o additional Oracle database licenses (each node needs to be licensed for
database and all options)
o Oracle RAC licenses (typically separate from database licenses)
o Specialized RAC training of operators and database administrators
o Consulting and other services for implementation, certification and tuning
(RAC consulting is typically at a premium from basic DBA consulting)

On the other hand, not only has SQL Server 2000 repeatedly demonstrated lower
total cost of ownership for deployments of various sizes, it also provides tightly
integrated business intelligence (Analysis, reporting, Data Mining and ETL)
capabilities as a standard feature. Furthermore, its intelligent, dynamic
automated resource management features provide peace of mind and reduced
administration overhead that has no peers in the industry.
• RAC has few customer references running business applications like
SAP®, PeopleSoft®, Siebel®, etc… with no verified large deployments
Oracle has made many claims on the scalability of RAC since its launch almost 6
years ago. It has presented respectable results on the TPC-C benchmark but has
yet to demonstrate scalability with actual large scale commercial deployments.
While there are some interesting deployments by research institutions, these are
not your traditional business application that commercial users can relate to.
2
On the other hand, SQL Server has proven its scalability by performing among
the top positions for various industry benchmarks like TPC-C, TPC-W and
applications like SAP, Siebel, PeopleSoft, Onyx, etc… for some time and will
continue to raise the bar on scalability for business applications while Oracle RAC
remains a relative non-player.
• Does not provide automatic high availability out of the box. Most
applications will not have automated transaction failover.
The level of availability that RAC can provide is no different from what is
commonly available today and has been available for several years. There are no
secrets or advanced technology here as similar methods have been employed by
other vendors for years. RAC does, in some cases, simplify the process (for
example: the adding and removing of nodes to a cluster is much simpler in the
10g version) but it does not introduce any groundbreaking new technology nor
does it raise the bar for high availability.
Additionally, while there is no black out for the transaction since other servers are
available where the transaction can be re-submitted, there is a brown out and
transaction state in most ISV or internally developed corporate applications (not
written with TAF) is lost.
This document will delve into each of the key claims made by Oracle on RAC’s abilities in
Scalability, Availability and Total Cost of Ownership, explain how it really works and
uncover the reality behind Oracle’s marketing messages.

3
Introduction
This paper provides a high-level view of Oracle’s Real Application Cluster (RAC)
technology, its key features, strengths and weaknesses. It also sheds light on various
claims made by Oracle regarding scalability, availability and total cost of ownership. The
primary objective is to provide the reader with the facts about RAC, based on the
authors’ research, testing and observations as opposed to accepting Oracle’s marketing
messages.
Note that this paper is not intended to be an exhaustive technical whitepaper on RAC,
nor will it cover every single detail about RAC and related technologies. Other
technologies may be addressed where appropriate.

Target Audience
This paper is useful to anyone who has to manage a database system, develop database
applications and anyone involved in the decision making process on acquiring database
systems. Business and technical decision makers will find this paper particularly helpful
in aiding them to make the right decisions based on real data, rather than marketing
messages.

4
What is Oracle Real Application Clusters?
Oracle launched Real Application Clusters (RAC) with the release of its 9i database
product several years ago and has since made several updates. While there have been
some changes in marketing taglines, RAC is still generally positioned as the enterprise
solution for both scalability and reliability. Industry veterans will notice that RAC is an
updated version of Oracle’s older scale-out clustering technology known as Oracle
Parallel Server (OPS) in versions of Oracle database prior to 9i.

Figure1: Shared Data Figure 2: Partitioned Data


Oracle RAC SQL Server 2000

Figure 1 above provides a view of what the RAC architecture looks like from a very high
level. Basically, RAC is a limited scale-out system built on shared disk, shared cache
architecture, where there is only one copy of the actual data in a single database, but
can be serviced by one or more database server instances. All instances will be working
against the same copy of the data, with various mechanisms in place to manage
resources, locks, access, etc…
In contrast, figure 2 shows how SQL Server 2000 implements scale-out, by federating a
group of independent databases to provide users with the view of single database, even
though there can be two or more physical servers and databases beneath that view. This
concept obviously includes a lot more detail than what has just been described, but it is
not the objective of this paper to dwell into SQL Server 2000’s scale-out
implementation. More information can be obtained from
http://www.microsoft.com/sql/evaluation/features/distpart.asp

5
How RAC works
In the preceding section, we took a brief look at RAC’s architecture. Now, let’s drill down
into how RAC works.
Though it uses the same Oracle database engine, there are several key components that
are unique to RAC. These components are listed below and their functions will be
discussed later in this section.
• Cluster Manager
• Global Cache Service
• Global Enqueue Service
• Cluster Interconnect & Inter-Process Communication (node-to-node)
• Quiesce Database Feature
• Shared Disk Subsystem

Figure 3 below shows a typical 2-node RAC setup. If users were to expand this to
support more nodes, each new node would need to be connected to all existing nodes
and to the shared disk subsystem. The Interconnect between all nodes uses Network
Interface Cards (NICs) on each node, connected via a hub/switch. Connection to the
shared disk typically requires Host Bus Adapters (HBA) for Storage Area Network (SAN)
systems. In some cases, SCSI cards may be used for 2-node clusters.

Instance A Instance B

DLM DLM

Parallel Cache Management

Cluster Cluster
Comm. Comm.
Manager Manager
Layer Layer
Cluster Interconnect

Shared Disk Driver Shared Disk Driver

Node 1 Node 2

Shared Disk Sub-system

Figure 3. Sample 2-node RAC system

6
Transactions that come into the system may be directed at any active node either
manually or via a cluster alias (a feature of the cluster services), which automatically re-
directs the transaction to an available node. The Global Cache Service manages the
status and transfer of data blocks across the buffer caches of the instances and is
integrated with the buffer cache manager to lookup resource information in the Global
Resource Directory. This directory is distributed across all instances and maintains the
status information about resources, including any data blocks that require global
coordination.
When a transaction requests a specific row(s), the node that services the request will
first check the local cache to see if the requested row(s) is cached locally. If it is, the
requested row(s) is returned to the caller and the transaction ends. If not, the calling
node will check to see if the row(s) is cached in any one of the other nodes in the
cluster. If found, a series of processes are initiated to ship that cache block to the calling
node, after which, the results are returned to the user. This has fixed some of the
problems suffered by OPS which required several I/O operations before the requested
block was acquired by the calling node. In the event that the requested row(s) is not
cached anywhere; a regular read operation is performed.
Disk I/Os are only performed when none of the collective caches contain the necessary
data and when an update transaction performs a COMMIT operation requiring disk write
guarantees.
There are four situations that warrant consideration when multiple Oracle instances
access the same row in the database. For reasons of simplicity, the following example
refers to a 2-node cluster named node-1 and node-2.
Read/read — User on node-1 wants to read a block that user on node 2 has recently
read. The read request may be served by any of the caches in the cluster database
where the order of access preference is local cache, remote cache, and finally disk I/O.
Read/write — User on node 1 wants to read a block that user on node 2 has recently
updated. In both the read/write and write/write cases in which a user on node 1
updates the block, coordination between the instances becomes necessary so that the
block being read is a read consistent image (for read/write) and the block being updated
preserves data integrity (for write/write). In both cases, the node that holds the initially-
updated data block ships the block to the requesting node across the high speed cluster
interconnect and avoids expensive disk I/O for read. It does this with full recovery
capabilities.
Write/read — User on node 1 wants to update a block that user on node 2 has recently
read. An update operation typically involves reading the relevant block into memory and
then writing the updated block back to disk. In this scenario, node 1 wants to update a
block that has already been read and cached by a remote instance or node 2. A disk I/O
for read is avoided and performance is increased, as the block is shipped from the cache
of node 2 into the cache of node 1.
Write/write — User on node 1 wants to update a block that user on node 2 has recently
updated. See read/write above.

7
The efficiency of inter-node messaging depends on three primary factors:
• The number of messages required for each synchronization sequence
• The frequency of synchronization -the less frequent, the better
• The latency, or speed, of inter-node communications

Number of Messages for Synchronization


The number of messages required for coordination is intricately linked with the Global
Enqueue Service (GES) implementation in the cluster database. The Global Enqueue
Service tracks which database resources are mastered on which nodes. It also handles
resource allocation and de-allocation to synchronize data operations made by the user
processes.
The Global Cache Service implementation enables coordination of a fast block transfer
across caches using, at most, two inter-node messages and one intra-node message.
The user process that needs access to some piece of data is able to quickly locate it if
the associated block is currently mastered on the local or remote node via an in-memory
look-up in the Global Resource Directory. Afterwards, processing branches in the
following manner:
• If the data is not already available in a local or a remote cache, Oracle performs
regular disk access—this is slow, but inevitable.
• If the data is in the local cache, it is accessed immediately without any message
generation—clearly the best case.
• If the data is in a remote cache, generally:
o The user process generates an inter-node message to the Global Cache
Service process on the remote node.
o The Global Cache Service and instance processes in turn update the in-
memory Global Resource Directory and send the block to the requesting
user process. In some cases a third instance may be involved.
Each synchronization operation occurs with minimal communication and thus delivers
fast response. However, does this scheme scale to several thousand user processes that
may potentially access the system in an e-business application environment? Oracle’s
inter-node resource model is the key to providing this capability. The interconnect is also
used by other Oracle technologies like parallel query hence bandwidth used up by
parallel query could result in less bandwidth available to the RAC cluster itself.

Inter-node Message Latency


Cache Fusion requires low latency communication protocols to function effectively. As
such the functionality or even usability of RAC is highly dependent on the performance
and capacity of the cluster Interconnect and all of its related components. This means
users will typically need to acquire enterprise class hardware and not just any
commodity hub to serve as the Interconnect.

8
Summary
RAC functions by means of spreading processing load across multiple servers, but all
these servers are still working against the same single database residing on the same
storage system. Additionally, it is dependent on the cluster Interconnect to provide
cache coherency and effectively, data integrity. Any weakness or failure in either of
these parts may result in problems with the overall system.
The following sections will delve into the claims of scalability, availability and total cost
of ownership, providing insight into the reality behind Oracle’s marketing messages.

9
Oracle Claim: “RAC is Scalable”
Oracle claims to provide record breaking scalability with RAC. It claims that users can
easily add nodes whenever they needed to with virtually boundless scalability.
Interestingly however, after having been in the market for around 6 years, RAC has
barely merely demonstrated scalability on non-business application scenarios like TPC-C
benchmarks and research organizations. While respectable in their own right, they are
not representative of typical commercial/business applications like SAP, Siebel or
PeopleSoft.
Oracle did publish a SAP-SD Parallel standard benchmark but the number of benchmark
users achieve was not even half of what SQL Server accomplished in its SAP-SD 3-tier
results. In fact, Oracle’s own highest benchmark results are based on Oracle running on
single server, non-clustered systems, not RAC. (See Appendix II, #1 for details).

The Truth: Not-Scalable Architecture


One of the possible reasons for RAC not living up to the marketing promise is rooted
deep within the architecture of RAC itself. Though having eliminated the disk I/O
processes that were the main bottleneck in OPS, RAC merely transferred the problem to
another area with a slightly higher ceiling, rather than solving the problem. As explained
in the earlier section on how RAC works, it is clear that RAC’s design lends itself to being
bottlenecked by the cluster Interconnect. This will manifest itself in several ways and
there is only a limited buffer for addressing this scalability problem before the
technology reaches its capacity ceiling.
When each node attempts to process very high numbers of transactions per node, as is
typical in any high OLTP environment, the amount of traffic traveling across the
Interconnect to facilitate block requests and all related interprocess communications, as
well as processes that are required to service each block request, will quickly flood the
node’s network card (NICs) and the cluster Interconnect switch or hub. While this may
be mitigated with improved (expensive) networking hardware, there is still the issue of
latency on the Interconnect which is often significantly higher than what one would
experience on the bus of Symmetric Multi-Processing (SMP) systems. Using higher
capacity NICs or switches simply raises the ceiling by a marginal amount and does not
resolve the problem, because its root cause lies in the design of RAC, which causes
heavy resource contention in the storage and Interconnect with little or no room around
which to navigate. This is probably the main reason why Oracle’s sweet spot for RAC is
4-CPUs per node (as per content presented at OpenWorld 2001/2002, Oracle
International User Group conference 2001 and Oracle OpenWorld London 2004) and
ideally, a cluster of 4, as more CPUs per node, or more nodes in a cluster, may
overwhelm the cluster.
Furthermore, if RAC really were scalable for most applications out of the box, why would
Oracle need to implement partitioning (a $10,000 per CPU option according to Oracle’s
online store http://oraclestore.oracle.com) in its TPC-C clustered benchmark? (See
http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=103120803 for details)

10
Summary
Oracle has made many claims on the scalability of RAC since its launch 6 years ago, but
to date, it has provided little to no real customer evidence on large scale RAC systems or
top benchmarks with common business applications like SAP and PeopleSoft. The few
benchmarks published, though respectable in their own capacity, are a far cry from what
Oracle has claimed in various marketing presentations and literature. At this time, RAC
has still not been able to deliver on the promises made by Oracle.
On the other hand, SQL Server has proven its scalability by securing the top positions
not just in various industry benchmarks but has also been recognized by independent
analysts like the WinterCorp based on SQL Server’s large customer deployments. In fact,
there are many SQL Server customers running mission critical applications that process
up to a 100 million transactions per day using only 8-CPU SMP servers. More information
on this is available at the following sites:
http://www.microsoft.com/sql/evaluation/compare/wintercorp.asp
http://www.microsoft.com/sql/evaluation/casestudies/default.asp
http://www.microsoft.com/sql/worldrecord/

11
Oracle Claim: “RAC Provides High
Availability “
Oracle makes interesting claims about RAC’s ability to provide users with a system that
suffers no downtime, and other attractive features related to high availability. However,
after filtering through the marketing messages, the reality is not nearly as interesting as
one might have initially believed. In fact, there are instances of customers moving from
RAC to non-RAC systems to resume operations from a failure attributed to RAC.

The Truth: Downtime Still Unavoidable


Oracle marketing claims that RAC will allow users to build systems that suffer no
downtime due to maintenance or hardware failure. Unfortunately, that claim is only
partially true.
RAC does provide tolerance for hardware failure, if setup and configured appropriately,
so long as there is adequate redundancy in all hardware components. However, in the
event where the user needs to apply a patch (not an uncommon activity in Oracle or any
other database), the data dictionary in Oracle generally needs to be updated. For that to
happen, the system needs to be in restricted mode, which means end-users are unable
to access data in that database until the process is complete. As such, though the
system is not “down”, per se, it certainly is not available to serve end-users. To the end-
user doing data entry or customer on an ecommerce site trying to complete a
transaction, the experience is no different from a system gone offline.
On the operating system level, users can certainly perform rolling upgrades on the
operating system, where patches and service packs may be applied on one node at a
time. Even if the node needs to be rebooted after applying the patch or service pack, the
surviving nodes are still actively serving end-users. However, this feature isn’t
particularly interesting, as it has been available in the industry from most major
database vendors for several years. It certainly isn’t groundbreaking by today’s
standards.

The Truth: Transactions Do Not Failover


Automatically
Another key point to note about RAC’s availability claims is that transactions or queries
do not failover automatically. Users need to build applications that utilize Oracle’s
Transparent Application Failover (TAF) using the Oracle Call Interface (OCI), which
allows applications to track the state of the server to which an application is presently
connected. In the event that the currently connected server fails, the application can
then re-submit the query, or re-direct the query to a surviving node for processing,
without user intervention. Note that using this functionality does result in significant
overhead in the application since the state of the transaction is being stored at all times.

12
In short, applications that were not developed with TAF in place will not enjoy fully
automated failover or re-try for transactions or queries that were still running when a
node fails. Developing applications to leverage TAF or TAF-like features is not a new
concept and is commonly available among enterprise databases, including SQL Server so
again, there really is nothing groundbreaking here.

Summary
The level of availability that RAC can provide is no different from what has commonly
been available today and for several years now. There are no secrets or advanced
technology introduced here, as similar methods have been employed by other vendors
for some time. RAC does, in some cases, simplify the process (for example: the adding
and removing of nodes to a cluster is much simpler compared to OPS) but it does not
introduce any groundbreaking new technology, nor does it raise the bar for high
availability despite the significant additional cost incurred with RAC.
Additionally, while there is no black out for the transaction since other servers are
available where the transaction can be re-submitted, there is a brown out and
transaction state in most ISV or internally developed corporate systems (not written
with TAF) is lost.
There is no magic formula to high availability despite what Oracle marketing would have
you believe. RAC does have a role in the overall HA scenario but it certainly isn’t the
cure-all solution. See http://www.eweek.com/article2/0,1759,1196874,00.asp for one
example of where RAC was identified as the cause of failure of a major internet
commerce site and the solution to get back online was to move off RAC onto a single
server SMP system.
Microsoft takes a more pragmatic approach and provides customers with prescriptive
guidance on how to build highly available SQL Server systems. The effectiveness of this
approach of providing customers with actual facts and advanced warning of potential
pitfalls has proven highly effective with customers successfully building systems that
with up to 99.999% availability without resorting to exotic hardware or buying expensive
add-on software. See case studies on customers like NASDAQ, Western Digital, Borgata
Hotel Casino & Spa, Wildcard Systems and CountryWide Home Loans who have achieved
99.999% of availability or more with SQL Server 2000. Details are available at
http://www.microsoft.com/sql/evaluation/casestudies/alphalisting.asp

13
Oracle Claim: “RAC Lowers Total Cost of
Ownership”
One of the most heavily touted benefits claimed by Oracle about RAC is that it will
significantly lower total cost of ownership, compared to other databases in the industry.
This is based on the lower cost of hardware and easier management. However, though
there are some instances where the savings can be significant, only a small segment of
customers will actually realize these savings. For the most part, customers will realize
minimal or no change in TCO and in many cases, the TCO actually increases.

Oracle’s TCO Claims


Oracle claims significantly lower TCO when deploying its Oracle Database with RAC for
scalability and high availability. The argument presented is that users can deploy RAC on
low cost, commodity Intel™ servers running Linux, thus saving on hardware and
operating systems costs.

Another claim made by Oracle is on its new management tools which are supposed to
significantly reduce management complexity, a well known issue with Oracle, thus
reducing cost of management. However, one only needs to attempt installing RAC to
immediately realize how far from the truth this really is. Furthermore, none of the claims
of simplified management have yet been proven in the demanding environments of a
production system particularly in constantly changing environments.

Finally, Oracle makes claims regarding what other databases cost to manage, with its
own assertions on downtime and Oracle actually defines the cost of that downtime for
those databases. In order for the claims in cost savings to be realized, this claim
requires customers to believe that all Oracle databases have zero downtime and that all
other databases inherently have significant downtime. Fortunately, most customers are
able to see through Oracle marketing’s concoctions and understand that any system that
is well designed, deployed and managed can provide a high level of availability, and that
downtime affects all databases, including Oracle (see news article linked in the summary
of the availability section). Even very large companies with large DBA teams, such as
major internet based auction and ecommerce sites, have experienced significant
downtime involving their databases; often not because of faulty technology. See
http://roc.cs.berkeley.edu/papers/Cost_Downtime_LISA.pdf and
http://news.com.com/2009-1001-251651.html?legacy=cnet&tag=owv for some
examples. Fact is, the human factor is the single largest cause of downtime in any
system and this is something that RAC does not address. If anything RAC increases

14
The Truth: RAC will probably increase cost
For most cases, the few customers who will realize the cost savings claimed by Oracle
are those who are currently running on legacy UNIX and/or mainframe platforms like
SUN or IBM. The savings really just come from the hardware itself and hardware related
services. However, note that the software and services costs of the Oracle database do
not go down but rather increases. Users still need to purchase all relevant Oracle
licenses and pay for the same maintenance, support, consulting and training fees (see
next section for license costs). In fact, users will probably end up paying Oracle more
than before because of the additional charges per CPU, per node for the RAC option and
increased number of nodes the customer is now required to license the database for.
RAC is also probably not covered in the typical Oracle site license but rather, available as
a special option that incurs additional licensing charges. To top that off, there will almost
certainly be a significant increase in system administration complexity with the
deployment of any cluster system, which Oracle does not indicate in its messages. Even
a 10 year veteran of Oracle points out this increased complexity and the added skills
required (see
http://www.miracleas.dk/WritingsFromMogens/YouProbablyDontNeedRACUSVersion.pdf)
.
There are some parties who believe RAC is a cost effective solution for high availability,
because:
• RAC spreads cost across multiple active servers, all of which are servicing users
at all times, instead of waiting idle until a node fails
• All servers are servicing users for the same database, hence load is balanced
across all servers and you don’t have idle resources
• Customers need half the resources on each server as the user load can be
managed to an almost even distribution between the servers. Hence, you get
failover support without needing to invest in high capacity servers
This is a dangerous misconception, because it may lead users into a false sense of
security while deploying systems that really do not provide adequate levels of availability
and scalability. Users needs to understand that in any high availability system where
multiple nodes are built into a cluster to provide failover support, each node in the
cluster must be able to support the entire load of the cluster in the event of a disaster
where only a single node is left functioning. This means each node must be configured
with adequate resources to handle all the load of the entire cluster, on its own. This is
true for any real high availability scenario, regardless of vendor for clusters with few
nodes. As the number of nodes increase to more than four, the risk of multiple node
failure is significantly reduced hence users are able to reduce the level of redundancy
supported per node. However, it still has to take into account workload spikes and
multiple node failures.
For example, if you need 8-CPUs and 16GB RAM in a single server environment to
support your transactions system, and you choose to deploy a 2-node cluster system for
the same application, each node in the cluster must have the same 8-CPU and 16GB
RAM configuration. Otherwise, in the event of a node failure, the surviving node will not
be able to cope with the increased workload resulting in transactions being rejected.
Additionally, because some resources are being used to deal with the added transaction
request (though the system cannot service them), even the existing workload that is
15
being serviced may experience degraded performance. So while the system is
technically “up-and-running”, your users/customers are experiencing downtime. The
summary is that moving from a single node SMP system to a RAC based system
generally does not save a customer hardware costs plus, it will very likely, increase
software and administrative costs.

To further illustrate this, please see the following diagrams and descriptions:
Single server, minimal high availability features

– Running ACME ERP system


– Supports 10,000 concurrent users
– System requires a total of 8-CPUs, 16GB RAM

Figure 4

High Availability Failover Cluster (as used by SQL Server)

– Same ACME ERP system


– Supports 10,000 concurrent users
– System requires a total of 8-CPUs, 16GB RAM
– Has failover clustering and redundant storage features
– Each node has 8-CPUs, 16GB RAM to ensure that
in the event of a node failure, the surviving node can
8-CPU 8-CPU
16GB RAM 16GB RAM handle the entire load and run without problems
Figure 5

Wrong assumption about RAC configurations


Oracle implies in its messaging that for a system similar to what is used as an example
above, the 10,000 users can be balanced on 2-nodes with 4-CPUs, 8GB RAM per node
providing a high availability solution in case either one of the nodes fail. This means
that when both nodes are up and running, workload is spread across both nodes.
However, as described earlier, this is a flawed deployment for a high availability cluster
because in the event of a node failure, all the users from the failed node will switch to
the surviving node and try to resume operations (manually or automatically depending
on how the application was written). In that situation, the surviving node which was
sized for only 5,000 users will not be able to service the requests from 10,000 users that
it now has. As such, at least half of the users will not be able to execute their
transactions plus, the remaining users who are able to do so, will likely have significantly

16
impacted performance as some system resources are being utilized to manage the other
5,000 users which it is unable to service.

Single node failure causes downtime in a common RAC configuration

– Same ACME ERP system


– Jointly supports 10,000 concurrent users
– System requires a total of 8-CPUs, 16GB RAM
– Has failover clustering and redundant storage features
– Each node has 4-CPUs, 8GB RAM
– In the event of failure, the surviving node is

4-CPU 4-CPU
overwhelmed and not able to support the load for
8GB RAM 8GB RAM 10,000 users hence about 50% of users suffer
-FAILURE- -OVERLOAD-
downtime
Figure 6. When 1 node fails, the system now only has 4-CPUs, 8GB RAM which is not
capable of supporting the 10,000 users load.

Correct deployment for high availability clusters, including RAC


10,000 users balanced on 2-nodes, with 8-CPUs, 16GB RAM per node

– Same ACME ERP system


– Supports 10,000 concurrent users
– System requires a total of 8-CPUs, 16GB RAM
– Has failover clustering and redundant storage features
– Each node has 8-CPUs, 16GB RAM to ensure no
– In the event of a node failure, the surviving node
8-CPU 8-CPU
16GB RAM 16GB RAM picks up the entire load and runs without problems
Figure 7

If 1 node fails, the system still has 8-CPUs, 16GB RAM which means it can still support
10,000 users. Though the example does take a simplified view of workloads, the
reasoning employed applies to any environment.

17
Once users understand why this is critical for any system (which is omitted from Oracle’s
communications) that requires true high availability, users will quickly realize how costs
do not reduce but will likely increase as the number of nodes (as proposed by Oracle
using RAC) and workload increases. Additionally, in a real deployment, if a single
machine with 4-CPUs and 8GB RAM can support 5,000 users, deploying a 2-node RAC of
identical hardware configuration will not be able to support 10,000 users. Oracle clearly
indicates that, at best, you will be getting 85% gain (at best) by adding a second node
and as the number of nodes in a cluster increases, the gain per node decreases
noticeably. This is partly due to physical inefficiencies of hardware and partly due to the
overhead imposed by Oracle when running RAC.
Thus, the claims made by Oracle on how users can save money with RAC are, at best,
amusing myths or marketing gimmicks. In reality, for a system with true high
availability, users will not save money by deploying RAC and all users should be wary of
any marketing claims otherwise.

Conclusion
Despite Oracle marketing’s claims, the commonly deployed scenario shown above clearly
illustrates how deploying RAC for high availability will not save the user any cost
compared to existing, proven failover solutions available today like Microsoft Failover
Clustering which can be implemented by both SQL Server and Oracle. Additionally, RAC
may increase hardware, software and management cost as there are additional
components required compared to a failover cluster (e.g. Interconnect). Though the
solution will work by providing automated failover, it certainly is not a money saver as
claimed. If anything, it will likely increase the TCO.
That does not mean that customers should avoid using clusters altogether as they do
serve a useful purpose in providing high availability. As mentioned before, both Oracle
and SQL Server support failover cluster which also provides high availability. What
customers should know is that they will not be saving on hardware costs for high
availability clustering with RAC in the given scenario and will likely increase software and
administration costs.

18
The Truth: RAC Increases Licensing Costs
Sample Cost Comparison between Oracle and Microsoft Solutions
The following table provides estimated retail prices, based on published prices on Oracle,
SUN™, DELL™ and Microsoft websites, of various configurations with equivalent basic
functionality. This comparison does not include storage costs, as that is expected to be
reasonably similar across different configurations (though RAC systems require more
Host Bus Adapters or NICs than other systems). Note that software maintenance, taxes,
shipping and other charges may apply.

Items Oracle10g on SUN Oracle10g with RAC SQL Server 2000 on


Solaris™ 9 on Dell/Red Hat® Windows 2003
4-CPU x 1-node Enterprise Linux AS 3 Enterprise Edition
(2-CPU x 2-nodes) 4-CPU x 1-node
Database license
• Enterprise Edition $160,000 $160,000 $79,996
• OLAP & Data Mining $160,0001 $160,0001 -Included-
• RAC Option n/a $80,000 n/a
• Network Encryption, PKI $40,0002 $40,0002 -Included-
and Single sign-on
support
• Performance monitoring & $12,0003 $12,0003 -Included-
tuning tools
• Diagnostic tools $12,0004 $12,0004 -Included-

DATABASE COST $384,000 $464,000 $79,996


Server Cost (not inclusive of $201,052 $49,672 (24,836 per $39,533
storage, networking and other SUN Fire™ 4800 small node) DELL PowerEdge 6650
devices) base server DELL PowerEdge™ • 4 CPU
configuration 6650 (3.0GHz/4MB
• 4 CPU • 2 x 2 CPU XEON™)
(UltraSPARC™III- (3.0GHz/4MB • 16GB RAM
1.05GHz) XEON™) • 3 year gold
• 16GB RAM • 8GB RAM per node support
• 1 year gold • 3 year gold
support support
Interconnect Switch & NIC -Not required- -Not required-
with redundancy (required
for RAC only)
• 2x Dell PowerConnect $356
2608 Gigabit Ethernet
switch $261.9
• 2x 3Com Gigabit Server
NIC

Host Bus Adapters (For SAN $995 $2,135.90 $1,067.95

19
connectivity) SUN Dual Gigabit QLOGIC - QLA2340-CK QLOGIC - QLA2340-CK
Ethernet + Dual SCSI SANblade 2340 PCI-X SANblade 2340 PCI-X
PCI Host Adapter to 2 Gb Fiber Channel to 2 Gb Fiber Channel
Host Bus Adapter x2 Host Bus Adapter
(1 HBA required per
node)
HARDWARE COSTS $202,047 $52,425.80 $40,600.95
TOTAL COST $586,047 $516,425.80 $120,596.95
Oracle10g on SUN Oracle10g with RAC SQL Server 2000 on
Solaris 9 on Dell/Redhat Windows Server
4-CPU x 1-node Enterprise Linux AS 3 2003 Enterprise
(2-CPU x 2-nodes) Edition
4-CPU x 1-node

Table 1. Costs estimates as per published prices on vendor websites and are subject to
change without notice.
1. Requires OLAP option at $20,000 per CPU and Data Mining option at $20,000 per
CPU for 4 CPUs
2. Requires Advanced Security option at $10,000 per CPU for 4 CPUs
3. Requires Diagnostics Pack at $3,000 per CPU for 4 CPUs
4. Requires Tuning Pack at $3,000 per CPU for 4 CPUs
5. Prices from www.dell.com, oraclestore.oracle.com, www.sun.com,
www.microsoft.com/sql
Note: While Oracle typically mentions “commodity hardware”, RAC deployments require
hardware specially certified for RAC and not just any off the shelf server or components.
That means most if not all users will not be able to re-use existing hardware to build a
RAC system. Shared storage subsystem w/ a high speed interconnect is also a pre-
requisite for RAC and is certainly not a commodity item.

It is clear from the figures in Table 1 that moderate levels of hardware savings can be
realized if users move away from legacy UNIX system onto Intel based commodity
systems. However, note that the savings are purely from the reduced hardware and
hardware related services costs. Database costs either remain constant if running on a
single server or increase with the RAC option.
With Oracle’s licensing costs making up almost 90% of the total system cost, hardware
savings become almost negligible. Software discounts have not been taken into account
in this comparison and neither has maintenance & software upgrade costs. Given the
base cost of the software compared, maintenance & upgrade costs which is a percentage
of the software base price, it isn’t hard to see that Oracle once again leads in high TCO.
Note that high availability requirements (as described in the preceding section) were not
factored into this sample configuration. The same systems deployed in a highly available
setting minimally involves doubling the count (and cost) of most of the components
listed including the RAC system. This includes hardware components (true even for RAC
systems that already have two nodes as explained in the earlier section) and software

20
licenses. Please see your Microsoft or hardware representative for further details on
configuring systems that provide high availability.

Summary
Regardless of the marketing claims, when measuring real deployments SQL Server still
provides the best total cost of ownership for users who demand high performance and
reliability for their systems. The only savings which users may expect with Oracle10g
RAC is purely from the hardware but that typically makes up less than 15% of the
overall system cost (see Budgeting for IT: Average Spending Ratios by J. Giera, Giga
Analyst) and only if the user is moving from an expensive, legacy UNIX system to an
Intel-based commodity system.
If the user is currently already using Intel-based servers, moving to RAC would probably
not reduce costs. In fact, there is a high likelihood that total cost of ownership would
increase with using Oracle RAC. Deploying Oracle RAC does not offer significant
scalability or high availability benefits but it does require additional purchases such as:
• additional certified server and networking hardware
• certified storage & connectivity solutions
• additional Oracle database licenses (each node needs to be licensed)
• Oracle RAC licenses
• training of operators and administrators
• services for implementation, certification and tuning
This is not even counting infrastructure requirements and additional administrative tasks
for the DBA, System Administrator and Network Administrator.
On the other hand, not only does SQL Server 2000 deliver the lowest total cost of
ownership, but it also provides the best price/performance for the user. Its intelligent,
dynamic automated resource management features provide peace of mind and reduced
administration overhead. This is not a claim but rather a report on customers’
experience with SQL Server. See
http://www.microsoft.com/sql/evaluation/compare/tco.asp for more details.

21
Conclusion
Though having navigated around some significant problems inherent with OPS (RAC’s
predecessor) in the area of performance, and having raised the scalability ceiling
marginally, RAC has still not proven it is able to deliver enterprise-class scalability for
real business applications. In high availability, RAC only delivers one small part of any
typical high availability solution and offers, at best, marginal advantage over existing
solutions but at a much high cost & complexity.
The actual cost to deploy and manage RAC is nowhere near as minimal as Oracle might
have users believe; one only needs to attempt installing RAC realize the added
complexity (even for experienced Oracle DBAs). RAC is a significant step forward by
Oracle in making OPS usable, but there is only so much that can be done with an
architecture that limits scalability by design.
Conclusion: RAC is a significant release for Oracle, but does not live up to Oracle’s
marketing claims. RAC typically costs users more to implement and maintain than SMP
systems offering equal or greater performance. SQL Server offers comparable or better
levels of scalability and availability but at a fraction of the cost and is significantly easier
to manage/use.

22
Appendix I: Resource links
Transaction Processing Performance Council (TPC)
www.tpc.org
SAP Benchmarks
www.sap.com/benchmark
SQL Server Scalability Benchmarks Leadership Proofpoints
www.microsoft.com/sql/worldrecord
SQL Server Total Cost Of Ownership Leadership Proofpoints
www.microsoft.com/sql/evaluation/compare/tco.asp
SQL Server Business Intelligence and Data Warehousing
www.microsoft.com/sql/evaluation/bi/default.asp
Your Probably Don’t Need RAC
http://www.miracleas.dk/WritingsFromMogens/YouProbablyDontNeedRACUSVersi
on.pdf

23
Appendix II: References
1. SAP benchmark reference
Oracle (SAP-SD Parallel)
** This is Oracle’s highest Oracle RAC based SAP SD Parallel benchmark to date

The SAP SD Standard 4.6 C Application Benchmark performed on November 16, 2001 by HP in Nashua, NH,
USA was certified on June 3, 2002 with the following data:

Number of benchmark users & comp.: 12,000 SD (Sales & Distribution) Parallel
Average dialog response time: 1.92 seconds
Throughput:
Fully Processed Order Line items / hour: 1,208,330
Dialog steps / hour: 3,625,000
SAPS: 60,420
Average DB request time (dia/upd): 0.058 sec / 0.185 sec
Operating System all server: HP Tru64 Unix V5.1A
RDBMS: Oracle 9i Real Application Clusters (RAC)
R/3 Release: 4.6 C
Configuration:
4 Database servers (4 active nodes):
HP AlphaServer ES45 Model 2, 4-processors, Alpha EV6.8CB (21264C) 1000 MHz, 8 MB L2 cache, 32 GB main
memory each
Certification Number: 2002031

SQL Server (SAP-SD 3-tier)


The SAP SD standard R/3 Enterprise 4.70 application benchmark performed on March 11, 2004 by HP in
Houston, TX, USA was certified on April 1, 2004 with the following data:

Number of benchmark users & comp.: 8,016 SD (Sales & Distribution)


Average dialog response time: 1.99 seconds
Throughput:
Fully Processed Order Line items / hour: 802,330
Dialog steps / hour: 2,407,000
SAPS: 40,120
Average DB request time (dia/upd): 0.160 sec / 0.235 sec
Operating System all servers: Windows Server 2003 Enterprise Edition
RDBMS database server: SQL Server 2000
R/3 Release: 4.70
Configuration:
Database server: HP ProLiant Model DL580 G2, 4-way SMP, Intel XEON MP, 3.0 GHz, 20 KB L1 cache, 512 KB
L2 cache, 4 MB L3 cache, 8 GB main memory
Certification Number: 2004017
More details are available at http://www.sap.com/benchmark/

2. Oracle TPC-C Clustered Benchmark required partitioning option

hp Integrity rx5670 Cluster 64P C/S with 80 hp ProLiant DL360-G3


Report Date: December 8, 2003
Total System Cost TPC-C Throughput Price/Performance Availability Date
$6,541,770 1,184,893.38 $5.52 April 30, 2004*
*Hardware available now
24
Processors:
Servers: 64 x 1.5GHz Intel Itanium 2 6M Processors
Nodes:
16
Database Manager:
Oracle Database 10g Enterprise Edition with Real Application Cluster and
Partitioning
Operating System
Red Hat Enterprise Linux AS 3
Other Software
BEA Tuxedo 8.1 1280160

25

You might also like