The previous section explains the principles behind database replication. The following sections provide step-by-step instructions for setting up and managing replication using VoltDB.
All of the following examples use the same fictional servers to describe the replication process. The server used for the master cluster is called serverA; the server for the replica is serverB.
It is easy to establish database replication with VoltDB. You can replicate any VoltDB database — there are no special requirements or configuration needed for the master database. It is also possible to begin replication of a new (empty) database or an existing database that already has content in it.
The steps to start replication are:
Start the master database.
You can either create a new database or use an existing database as the master. When starting the database, you can use either of the standard startup arguments: create or recover. For example:
$ voltdb create catalog.jar \
-d deployment.xml \
-H serverA \
-l license.xml
If any of the servers in the master database cluster have two or more network interface cards (and therefore multiple network addresses), you must explicitly identify which interface the server uses for both internal and external communication when you start VoltDB. For example:
$ voltdb create catalog.jar \ -d deployment.xml \ -H serverA \ -l license.xml \ --externalinterface=10.11.169.10 \ --internalinterface=10.12.171.14
If you do not specify which interface to use for multi-homed servers, replication will fail when the DR agent attempts to connect to those servers of the master database.
Create a replica database.
You create a replica database just as you would any other VoltDB database, except instead of specifying create as the startup action, you specify replica. For example:
$ voltdb create --replica catalog.jar \
-d deployment.xml \
-H serverB \
-l license.xml
Note that the replica database must:
Use the same version of the VoltDB server software.
Start with the same catalog as the master database.
Have the same configuration (that is, the same number of servers, sites per host, and K-safety value) as the master database.
If these settings do not match, the DR agent will report an error and fail to start in the next step.
Start the DR agent.
The DR agent is a separate process that can be run on any server that meets the hardware and software requirements for VoltDB. It is possible to run the agent on the same node as one of the master or replica cluster nodes. However, for best performance, it is recommended that the DR agent run on a separate, dedicated server located near the replica database.
To start the DR agent, use the dragent
command specifying the IP address or hostname of a
node from the master database and a node from the replica database as arguments to the command. For example:
$ dragent master serverA replica serverB
If the master or replica use ports other than the default, you can specify which port the DR agent should use as part of the server name. For example, the following command tells the agent to connect to the master starting at port 6666 and the replica on port 23232:
$ dragent master serverA:6666 replica serverB:23232
If you are using the Enterprise Manager to manage your databases, you can start the master database (Step 1) as you would normally, using the create, restore, or recover action. There is also a replica option on the Start Database dialog for creating a replica database (Step 2). The DR agent must be started by hand.
When the DR agent starts, it performs the following actions:
Contacts both the master and replica databases.
Verifies that the application catalogs match for the two databases.
Verifies that the two clusters have the same number of unique partitions.
Requests a snapshot from the master database. If data exists, the agent replays the snapshot on the replica.
Begins to POLL and ACK the master database for completed transactions to be replayed on the replica.
If, for any reason, you wish to stop replication of a database, all you need to do is stop the DR agent process or the replica database. If either the agent or the replica database is not capable of processing the stream of transactions, the master will continue to queue completed transactions until the queue is full. At which point the master will abandon replication, delete the queue, and resume normal operation.
In other words, except for logging error messages explaining that replication has stopped, there is no outward change to the master cluster and no interruption of client activity. If you wish to shutdown replication in a more orderly fashion, you can:
Pause the master cluster, using the voltadmin pause command, to put the database in admin mode and stop client activity.
Once all transactions have passed through the DR agent to the replica (see Section 12.2.4.1, “Monitoring the Replication Process”), stop the DR agent process.
Stop the replica database, using voltadmin shutdown to perform an orderly shutdown.
Resume normal client operations on the master database, using voltadmin resume.
If the master database becomes unreachable for whatever reason (such as catastrophic system or network failure) and you choose to “turn on” the replica as a live database in its place, you use the voltadmin promote command to promote the replica to a fully active (writable) database. Specifically:
Stop the DR agent process. If not, the agent will report an error and stop after the following step.
Issue the voltadmin promote command on the replica database.
When you invoke voltadmin promote, the replica exits read-only mode and becomes a fully operational VoltDB database. For example, the following Linux shell command uses voltadmin to promote the replica node serverB:
$ voltadmin promote --host=serverB
Database replication runs silently in the background, providing security against unexpected disruptions. Ideally, the replica will never be needed. But it is there just in case and the replication process is designed to withstand normal operational glitches. However, there are some conditions that can interrupt replication and it is important to be able to recognize and be able to respond to those situations, in order to ensure ongoing protection.
Both the master database and the DR agent maintain queues to handle fluctuations in the transmission of transactions. Network hiccups or a sudden increase of load on the master database can cause delays. Nodes on the master cluster may fail and rejoin (assuming K-safety). The queues help the replication process survive such interruptions.
In the case of the master database, replication initially queues data in memory. If the pending data exceeds the
allocated queue size, data then overflows to disk in the directory voltdbroot/dr_overflow
.
If the problem persists for too long, it is possible for the queues to fill up, resulting in either the master or the DR agent (or both) canceling replication. When this happens, it is necessary to restart the replication process. The following sections explain how to monitor the replication process and how to respond to error conditions.
There are two ways to monitor the replication process:
The DR agent provides a stream of informational messages concerning its status as part of its logs (displayed on the console by default).
You can query the master database about its current replication queue using the @Statistics system procedure and the "DR" component type.
The DR agent logs information about the ongoing transmissions with the master and the replica. It also reports any issues communicating with the master and continues to retry until communication is re-established. If the agent encounters a problem it cannot recover from, it logs the error and the process stops. In this situation, you must restart replication from the beginning. (See Section 12.2.4.2, “Restarting Replication if an Error Occurs” for details.)
If you do not want the log messages displayed on the console, you can redirect them by providing an alternate Log4J configuration file. You specify the alternate configuration file with the environment variable LOG4J_CONFIG_PATH. For example, the following commands start the DR agent and specify an alternate log configuration file mylogconfig.xml in the current working directory:
$ export LOG4J_CONFIG_PATH="mylogconfig.xml" $ dragent master serverA replica serverB
In addition to the DR agent logs, you can query the master database to determine the current state of its replication queues using the @Statistics system procedure. The "DR" keyword returns information about the amount of replication data currently in memory (waiting to be sent to the agent). One VoltTable reports the amount of memory used for queuing transactions and another reports on the current status of any snapshots (if any) waiting to be sent.
If an error does occur that causes replication to fail, you must restart replication from the beginning. In other words:
Stop the DR agent process, if it is not already stopped.
Shutdown and restart the replica database.
If the master database is not running, restart it.
Restart the DR agent.
Note that, if the master is still running, it does not need to be stopped and restarted. However, both the DR agent and the replica database must be restarted if any condition causes replication to fail. Situations that will require restarting replication include the following:
If the replica database stops.
If the master database stops.
If the DR agent stops.
If a snapshot is restored to the master database. (Consequently, if restoring or recovering data when restarting the master database, be sure the restore completes on the master before beginning replication.)
If communication between the master and the DR agent is delayed to the point where the master cluster's replication queues overflow.
If any transaction replayed on the replica fails. Note that only successfully completed transactions are sent to the replica. So if a transaction fails, the replica is no longer in sync with the master.
If any transaction replayed on the replica returns a different result than received on the master. The results are hashed and compared. Just as all replicated transactions must succeed, they must produce the same results or the two databases are out of sync.