User Manual - IBM InfoSphere Data Replication's Change Data Capture (CDC) Disaster Recovery (DR) Considerations
User Manual - IBM InfoSphere Data Replication's Change Data Capture (CDC) Disaster Recovery (DR) Considerations
Version 1.0
IBM Information
Management
TABLE OF CONTENTS
TABLE OF CONTENTS ....................................................................................................................... 2
INTRODUCTION ................................................................................................................................. 3
There are various considerations when CDC is used within an environment for which DR is
implemented. This document will explore the various topologies and considerations for
operating CDC in such an environment.
Before we can talk about recovery of a CDC instance, we will introduce the concept of
database log reading as it applies to CDC, and the concept of a bookmark that CDC uses to
track its replication progress.
When reading a DBMS log, there is the concept of a log position. A log position is a unique
position/point in the DBMS log. Examples of a simplistic representation of a log position are
an SCN for Oracle and an LSN for DB2.
The CDC bookmark consists of all relevant information required to be able to restart
replication at the appropriate log position (including the current log position and earliest open
log position). Data changes are scraped from the log and sent to the target. CDC will apply
the appropriate database operation (insert/update/delete), and in the same transaction
commit the bookmark to a metadata table. When CDC restarts after any normal or
abnormal shutdown, it will acquire the bookmark from the target system, and restart
replication at the appropriate point in the log. The mechanism is illustrated in the following
diagram:
The following diagram illustrates the log reading concept and the population of the CDC
bookmark table:
For the purposes of this document, we will use a simplistic representation of a log position
(in reality, this can differ quite significantly from one database to another). Here we show
that each operation has a corresponding log position. In this case, the ‘Insert a’ corresponds
to log position A01, and the ‘insert b’ corresponds to log position A02 and so on. When
CDC is replicating and applies data to the target database, it will also write an entry into the
bookmark table (within the same commit). The example above illustrates that CDC has
replicate all four inserts and thus the bookmark would contain the log position of the last
operation (in this case ‘insert d’).
The above does over simplify the bookmark. In actuality, multiple items make up the CDC
bookmark. As it pertains to the topic of CDC in a DR environment, there are two key aspects
that you need to be aware of:
1. The last applied log position (which was illiustrated above)
2. The earliest open transaction log position
When CDC reads data from the logs, it will first build transactions of the source system and
will not send them to the target system until a complete transaction is built. The earliest
open log position keeps track of the log position for the start of a transaction which CDC has
started to process (read the logs), but has not yet seen an end transaction for. This is an
important concept for DR recovery as when CDC restarts replication, it may potentially need
to go back to the earliest open log position. This is another good reason to follow a best
practice of trying to avoid large and long running transactions.
How CDC is installed and how the instance should be created will depend on the DR
replication solution being utilized.
If you are using physical replication to your DR machine, it is best to include the file system
that CDC is installed on to be mirrored to the DR site as well. In this case, no additional
install or instance creation will be required on the DR system. In all other scenarios where
the file system that CDC is installed in is not replicated to the DR system, you will be
required to do an additional install of CDC on the DR system, and to create the CDC
instance as well.
The following are items that need to be considered when CDC is failed-over to another
system:
1. IP Address used to reach CDC
2. CDC configuration metadata (stored in an internal database)
3. CDC Operational metadata (stored in client database), most importantly the CDC
bookmark table
For now, let’s consider that CDC is installed locally on the source and target database
servers. In the following diagram, the active production CDC replication is indicated by the
solid arrow. The dashed arrows numbered 1 to 3 represent possible CDC replication after
either the source production server fails and is switch over to a backup server, or the target
production fails and is switched over, or both the source and target production servers fail.
Although the replication may not be instantaneous, synchronous physical replication will
ensure that the data on the source and target will remain synchronized in the event of a
failure. Note that in the above example the DR log always contains the exact same image
as the Production log.
When using Synchronous Physical DR replication, you can only replicate the database, or
replicate the database and the CDC instance directory. It is recommended that you also
replicate the CDC instance directory. By also replicating the CDC instance directory you will
not need to do a separate install on the DR system, and it will ensure that the CDC internal
metadata is kept in sync.
Using asynchronous physical replication means that there is the possibility that the DR
system will have an image from an earlier point in time than the Production System. Note in
the above example that the log on the DR is at a different point in time than the log position
on the production system.
When using asynchronous Physical DR replication, you can only replicate the database, or
replicate the database and the CDC instance directory. It is recommended that you also
replicate the CDC instance directory. By also replicating the CDC instance directory you will
not need to do a separate install on the DR system, and it will ensure that the CDC internal
metadata is kept in sync. Even though there is some latency with asynchronous physical
replication, given that the CDC metadata rarely changes (only when configuration changes
are made), the CDC metadata on the DR site should always be in sync.
In the above example depicted by the wide arrows, whole system physical replication is
used between the production and DR servers. Also key is that a synchronous mode is being
utilized for the replication. DBMS level sychronization between the production and DR
system is handled by the DR soltion.
It is also ideal to use the physical replication to mirror the CDC instance directory. If you are
using sychronous replication for the DBMS, but not the CDC instance directory, please see
section ‘Keeping CDC Instance Synchronized’. If you do use physical mirroring for the CDC
Instance, since the entire CDC instance and Database has an exact copy on the DR system,
it makes fail-over straight forward. As a result, there are no CDC specific considerations
about log position and data availability for disaster recovery in this scenario.
The only CDC consideration is dealing with an IP address change. Using a virtual IP
address is recommended and simplifies the process. If a virtual IP address can not be used,
you will need to follow the procedures outline in the ‘Changing CDC IP Address’ section of
this document.
The above diagram illustrates some of the possible outcomes at point of failure. The data
replicated to the target relative to the source (may be failed-over) can be either ahead, equal
to, or behind. For instance, in the above diagram if the only the source production switched
over to the DR box, the production target database is ahead of the source, and contains
data that does not exist on the source. If only the target production machine switched over
to the DR target, then the data on the DR target would be behind the production source,
which is the easy situation to deal with. Refer to section ‘How To Determine If Log Position
Valid After Switch-Over’ for information on determine which scenario you fall into after a fail-
over.
Lastly, there are common considerations for CDC instance sychronization and IP Address
change when using asynchronous pysical DR replication method versus sychronous
physical DR replication, and as such will not be repeated here.
If you are using asynchronous physical DR replication, where the CDC target is the same or
behind the source log after switchover, then you have a straight forward recovery case.
Below are two examples of situations which match this scenario:
or
In the first example, the source switches over to the DR site. In this case, the DR source
system has more recent data than the production target. In the second example, the
production target switches over to the DR target, and again, the source has more recent
data than the target. In both of these examples, since the data in the target is older than
what is available in the production server, there is no special CDC consideration required.
When you restart CDC, it will go back in the log based on the log position stored in the CDC
target bookmark, and will restart replication as per normal operating behavior.
If you are using asynchronous physical DR replication, and after switchover the CDC target
is ahead of the source log, then recovery is significantly more involved. The following
diagram illustrates this case:
In the above example, CDC had replicated the data upto log position A04. However, the
asychronous source DR replication had only replicated upto the point in time of log position
A02. Thus, at the point in time of the fail-over to the DR source, the DR source is missing
data that has already been applied to the target system. Because of this, the bookmark on
the target system is invalid as it doesn’t exist on the source system.
This situation is one you want to avoid at all costs. If for instance you know that the
asynchronous DR replication will at most be 2 minutes latent, one way that you may be able
to prevent this situation is by creating a CDC target user exit that will delay the apply by a
set amount of time (in this case for example, 3 minutes). There is a sample user exit
available in Developer Works located here:
https://www.ibm.com/developerworks/community/files/app/file/f047a38c-734a-4071-8a3c-
4fe37c85baeb
If you need to deal with the situation in the example above, then you are out of sync and
need to make a business decision on how to move forward. Here are some possible
options:
1) Source DR should be reconciled with the latest changes that were replicated to the target
but missed by the Asynch DR Replication solution
• In this case, the operations/transactions that were not captured by the DR replication
solution need to be repopulated to the source DR location
• This reconciliation needs to be performed before restarting the source applications
on the DR site
In case ①, when the target fails, you will not be required to do any special processing
(beyond ensuring that the CDC instance configuration is up to date), and can just restart
CDC replication which will pick up from the last applied entry.
In case ②, and ③, the DR source has a different log and the log positions do not
correspond to the log positions found on the original production source. There is also an
added potential complexity that the Object IDs are different on the DR source. Additionally,
this example also illustrates that the CDC target may be ahead of the DR replication so
there is data on the target that does not exist on the source. The techniques to deal with
the newer data would be the same as those outlined for the asynchronous physical DR
replication.
Dealing with the potential difference in Object IDs will require the table mapping to be
reconfigured. The simplest way to accomplish this is by doing an export on the production
system, and an import on the DR system. Note that after you import the table mappings, the
tables wil be marked for refresh and the bookmark position will be ‘reset’. The bookmark
position is not valid since it corresponds to another log. You will need to change from
refresh to mirroring, and you would have to mark the table capture point to set an
appropriate bookmark starting position.
Note that the above is very simplified as the Mark Correct log position can involve the
complex setup that was listed previously.
The ideal case is to use physical replication to mirror the CDC instance directory to the DR
machine (this would not be applicable to a MS Windows environment). If physical
replication of the CDC instance directory is not possible, then you will have to perform the
following:
If you are using asychronous physical DR mirroring, the easiest way to know if the target is
ahead of the source is to start CDC replication (before making the source system available
to users). If upon restart, the bookmark position is not found, the target is ahead of the
source. Since the CDC bookmark is specific to the source database, there are different
procedures to determine the right bookmark, and for some databases, can only be done with
the help of IBM L2 support. The following sections outlines the procedures for some
databases.
If your CDC target is not a Database, then the CDC bookmark will not be in the customer
target database, but rather stored in the internal CDC metadata. This aspect requires
special consideration for DR of the CDC target system. The following CDC engines are
examples of ones that have this characteristic:
• CDC Event Server
• CDC for DataStage if using FlatFile
Since the bookmark is stored in the CDC Instance directory, it is very important to ensure
that the CDC target instance directory is mirrored to the DR target system for easier
recovery. The other approach is to use the dmbackupmd command on a very regular basis
and ensure that it is available for restore on the backup target system. Note, using this
technique the bookmark stored will be at a point in time behind what has already been
applied, so when replication is restarted, you will “replay” some transactions.