0% found this document useful (0 votes)
13 views5 pages

Block Change Tracking RMAN

Recovery Manager (RMAN) introduced incremental backups in Oracle, but prior to version 10g, these required full datafile scans, leading to inefficiencies. Change Tracking was implemented to record changed blocks in real-time, improving backup performance by reducing the amount of data scanned. A recent incident revealed that a node failure disabled Change Tracking, causing backups to revert to scanning entire files, which significantly increased backup times until the issue was resolved by re-enabling Change Tracking and performing a full backup.

Uploaded by

prabhatbhusan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Block Change Tracking RMAN

Recovery Manager (RMAN) introduced incremental backups in Oracle, but prior to version 10g, these required full datafile scans, leading to inefficiencies. Change Tracking was implemented to record changed blocks in real-time, improving backup performance by reducing the amount of data scanned. A recent incident revealed that a node failure disabled Change Tracking, causing backups to revert to scanning entire files, which significantly increased backup times until the issue was resolved by re-enabling Change Tracking and performing a full backup.

Uploaded by

prabhatbhusan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Recovery Manager (RMAN) was introduced in Oracle 8.

0 and enabled true incremental


backups for the first time. Prior to RMAN, database exports could be taken incrementally,
but if any block in a table was changed, the entire table was written. RMAN incremental
backups write only the changed blocks from a given datafile, thus saving space.

Prior to version 10g, however, incremental backups required a scan of an entire datafile to
detect which blocks in it had changed. For very large databases, this made the time
required for incrementals not much better than for full backups, even though space was
saved.

Change Tracking, sometimes called Block Change Tracking, provides a solution by


recording to a separate Change Tracking file which blocks have changed. The changes are
written in real time by a background process.

The first level 0 incremental backup still requires a complete scan. Thereafter, the Change
Tracking file is consulted during backups and the required blocks are accessed directly.

To create the Change Tracking file and enable Change Tracking, make sure the database
is either open or mounted. Then issue an ALTER DATABASE command:

ALTER DATABASE ENABLE BLOCK CHANGE TRACKING USING FILE <pathname>;

where <pathname> is the operating system pathname to the location where you want the
Change Tracking file to be stored. (If you have Oracle Managed Files enabled, the USING
clause is optional, and the file will be created in the directory pointed to by
DB_CREATE_FILE_DEST.)

Regardless of how the path is determined, the file begins at 10MB in size and grows by
10MB if it needs to be extended.

There is a performance impact to enabling Change Tracking, so unless you are using
RMAN's incremental backup feature, it is best left disabled (the default). But if you are doing
incremental backups, the time lost recording changes will be more than made up in faster
backups.
Magic of Block Change Tracking
The other day, one of the instances of our main 3-node RAC database crashed due to an I/O
error. Apparently there were many "dead paths" from the host to the SAN, which made
some LUNs in the ASM diskgroup not being recognizable by the ASM instance. Why? Well,
that's the topic for another blog. The point behind this one has nothing to do with the
reason. All I want to convey that there was a LUN problem on one of the nodes which
brought the instance down. Being RAC, the database was serviced from other two nodes -
praise God for RAC! - and the users didn't notice it terribly (or, I would like to think that
way).

After a few days we noticed the incremental RMAN backup taking a long time. This caused
major issues - it took a long time and I/O waits went through the roof. In fact it took
increasingly longer every day that passed by that unfortunate collapse of the node.
Everyone was quite intrigued - what could be the connection between an instance crash and
instance crashing? All sorts of theories cropped up - from failed HBA cards to undiscovered
RAC bugs.
This is where I got involved. The following chronicles the diagnosis of the issue and the
resolution.

First, the increased length of time is obviously a result of the incemental backups doing
more work, i.e. more changed blocks. What caused so many changed blocks? Interviews
with stakeholdrs yielded no clear answer - there was absolutely no reason for increased
activity. Since we are doing proper research, I decided to start with the facts. How much was
the extra blocks processed by incrementals?

I started with this simple query:

select completion_time, datafile_blocks, blocks_read, blocks


from v$backup_datafile
where file# = 1
order by 1
/

Output:

COMPLETIO DATAFILE_BLOCKS BLOCKS_READ BLOCKS


--------- --------------- ----------- ----------
18-JUL-08 524288 32023 31713
19-JUL-08 524288 11652 10960
20-JUL-08 524288 524288 12764
21-JUL-08 524288 524288 5612
22-JUL-08 524288 524288 11089

The columns are:

DATAFILE_BLOCKS - the number of blocks in the datafile at that time


BLOCKS_READ - the exact number of blocks the RMAN incremental backup read
BLOCKS - the numberof blocks it actually backed up

From the above output, a pattern emerges - until Jul 20th, the backup read only a few
blocks; but on July 20th, it started scanning the entire file - all the blocks! I checked for a few
other datafiles and the story is the same everywhere. With a 4.5 TB database, if the
incremental backup reads the datafiles in entirity, then I/O would obviously go for a toss.
That explains the I/O and time issue.
But why did RMAN switch from reading a few blocks to the whole file that day? The #1
suspect is Block Change Tracking. The 10g feature BCT allows RMAN to scan only the
changed blocks and not the entire file. We use that. So, did something happen to make that
disappear?

to answer, I issued a modified query:

select completion_time, datafile_blocks, blocks_read, blocks,


used_change_tracking
from v$backup_datafile
where file# = 1
order by 1
/

Output:

COMPLETIO DATAFILE_BLOCKS BLOCKS_READ BLOCKS USE


--------- --------------- ----------- ---------- ---
18-JUL-08 524288 32023 31713 YES
19-JUL-08 524288 11652 10960 YES
20-JUL-08 524288 524288 12764 NO
21-JUL-08 524288 524288 5612 NO
22-JUL-08 524288 524288 11089 NO

Bingo! The BCT use ceased from the 20th July date. That was what caused the whole file to
be scanned. But why was it stopped? No one actually stopped it.

Investigating even further, I found from the alert log of Node 1:

Sun Jul 20 00:23:52 2008


CHANGE TRACKING ERROR in another instance, disabling change tracking
Block change tracking service stopping.

From Node 2:

Sun Jul 20 00:23:51 2008


CHANGE TRACKING ERROR in another instance, disabling change tracking
Block change tracking service stopping.

Alert log of Node 3 showed the issue:

Sun Jul 20 00:23:50 2008


Unexpected communication failure with ASM instance:
ORA-12549: TNS:operating system resource quota exceeded
CHANGE TRACKING ERROR 19755, disabling change tracking
Sun Jul 20 00:23:50 2008
Errors in file /xxx/oracle/admin/XXXX/bdump/xxx3_ctwr_20729.trc:
ORA-19755: could not open change tracking file
ORA-19750: change tracking file: '+DG1/change_tracking.dbf'
ORA-17503: ksfdopn:2 Failed to open file +DG1/change_tracking.dbf
ORA-12549: TNS:operating system resource quota exceeded
Block change tracking service stopping.

The last message shows the true error. The error was “operating system resource quota
exceeded”, making the diskgroup unavailable. Since the ASM diskgroup was down, all the
files were also not available, including BCT file. Surprisingly, Oracle decided to stop BCT
altogether rather than report it as a problem and let the user decide the next steps. So block
change tracking was silently disabled and the DBAs didn't get a hint of that. Ouch!

Resolution

Well, now that we discovered the issue, we took the necessary steps to correct it. Because
of the usual change control process, it took some time to have the change approved and put
in place. We executed the following to put the BCT file.

alter database enable block change tracking using file '+DG1/change_tracking.dbf'

The entry in alert log confirms it (all all nodes)

Block change tracking file is current.

But this does not solve the issue completely. to use block change tracking, there has to be a
baseline, which is generally a full backup. We never take a full backup. We always take an
incremental image copy and then merge to a full backup on a separate location. So, the first
order of business was to take a full backup. After that we immediately took an incremental.
It took just about an hour, down from some 18+ hours earlier.

Here is some analysis. Looking at the backup of just one file - file#1, i.e. SYSTEM datafile:

select COMPLETION_TIME, USED_CHANGE_TRACKING, BLOCKS, BLOCKS_READ


from v$backup_datafile
where file# = 1
order by 1
/

The output:

COMPLETIO USE BLOCKS BLOCKS_READ


--------- --- ---------- -----------
18-AUG-08 NO 31713 524288
18-AUG-08 NO 10960 524288
20-AUG-08 NO 12764 524288

21-AUG-08 NO 5612 524288


22-AUG-08 NO 11089 524288
23-AUG-08 NO 8217 524288
23-AUG-08 NO 8025 524288
25-AUG-08 NO 3230 524288
26-AUG-08 NO 6629 524288
27-AUG-08 NO 11094 524288 <= the filesize was increased 28-AUG-08 NO 3608
786432 29-AUG-08 NO 8199 786432 29-AUG-08 NO 12893 786432 31-AUG-08 YES 1798
6055 01-SEP-08 YES 7664 35411
Columns descriptions:

USE - was Block Change Tracking used?


BLOCKS - the number of blocks backed up
BLOCKS_READ - the number of blocks read by the backup

Note, when the BCT was not used, the *entire* file - 524288 blocks - were
being read every time. Of course only a percent of that was being backed up
since that percentage changed; but the whole file was being checked.
After BCT, note how the "blocks read" number dropped dramatically. That is
the magic behind the dropped time.

I wanted to find out exactly how much I/O savings BCT was bringing us. A simple query
would show that:

select sum(BLOCKS_READ)/sum(DATAFILE_BLOCKS)
from v$backup_datafile
where USED_CHANGE_TRACKING = 'YES'
/

The output:
.09581342

That's just 9.58%. After BCT, only 9.58% of the blocks of the datafiles were scanned!
Consider the impact of that. Before BCT, the entire file was scanned for changed blocks.
After BCT, only about 9.58% of the blocks were scanned for changed blocks. Just 9.58%.
How sweet is that?!!!

Here are three representative files:

File# Blocks Read Actual # of blocks Pct Read


----- ------------- ------------------- --------
985 109 1254400 .009
986 1 786432 .000
987 1 1048576 .000

Note, files 986 and 987 were virtually unread (only one block was read). Before BCT, all the
1048576 blocks were read; after BCT only 1 was. This makes perfect sense. These files are
essentially older data; so nothing changes there. RMAN incremental is now blazing fast
because it scans less than 10% of the blocks. The I/O problem disappered too, making the
database performance even better.

So, we started with some random I/O issue, causing a node failure, which led to increased
time for incremental, which was tracjed down to a block change tracking file being
suspended by Oracle silently without raising an error.

Takeaways:

The single biggest takeway you should get is that just because it is defined, don't get the
idea it is going to be there. So, a periodic check for the BCT file is a must. I will work on
developing an automated tool to check for non-use of BCT file. The tool will essentially issue:

SELECT count(1)
FROM v$backup_datafile
where USED_CHANGE_TRACKING = 'NO'
/

If the output is >1, then an alert should be issued. Material for the next blog. Thanks for
reading.

You might also like