Troubleshooting Backup Failures
Troubleshooting Backup Failures
Troubleshooting Backup Failures
Your name:
Anthony Nguyen
Anthony Nguyen
11/29/2006
Jackie Schlitz
Description of Change:
Document Creation
Changed escalation process per Ron
Caplinger and Bryce Pier.
Added the following comments:
Whenever a new Netbackup alert appears in
OVOU, please verify that it is not related to a
Sev 2 ticket or any other ticket.
If a backup job is re-started, please monitor it
and keep the job id handy.
Effective date:
10/18/2006
10/30/2006
11/29/2006
INDEX
Responsibilities/Escalation
HPOV Alert
Netbackup Administration Console
Error Numbers and Resolution
Netbackup Troubleshooter
Restarting Failed Backups
Manually Stopping a Backup Job
Troubleshooting Windows NetBackup Clients
Troubleshooting Unix/Linux NetBackup Clients
Shutting off VSP (Veritas SnapShot Provider)
Netbackup Drives Alerts
Contacting IBM on hardware issues
Responsibilities/Escalation:
Infrastructure Support attempts to remediate all backup failures and creates a trouble ticket to
track the failures. Any failure that cant be resolved by Infrastructure Support is escalated to L3ENT-BACK. Normally, if the failure is a single failure, a Sev 3 Med ticket is sufficient. In the case
of error code 96 (Out of Media), send a Sev 3 Critical ticket to L3-ENT-BACK. If there are multiple
errors of the same kind, such as: more than two tape drives down and cannot be UPed; both
tape drives on the same Media Server are down and cannot be UPd; multiple 219 (Storage Unit
Unavailable) or multiple 84 (Media Write Error) errors, first follow normal trouble-shooting
procedures for those error types. If still unable to resolve, send a Sev 3 Critical ticket to L3-ENTBACK.
Teradata Backups:
At this time, Teradata backups are the responsibility of the backup team. Teradata backups can
be distinguished by the policy they belong to. Teradata backups will start with a "TD_xxxxxx" in
the policy name:
Example: TD_DS32BKP, TD_DS35BKP, TD_1a_Inv_Item_Dict, etc...
For Teradata backups, create a sev3 ticket to L3-ENT-BACK.
HPOV Alert
Infrastructure Support will get an alert when a backup fails: The alerts will appear in the OVOW
console with the message group NBup. The alerts will also appear in the OVOU console with the
message group DCTech. The alert will look like the following:
HPOV dxp11uxa.bestbuy.com NBU_JobFailure.log Entry: ds27bkup dvp03fc2z
BDC_Wintel_Prod_File_Z_Drive Monthly_Cumulative 3896202 3896202
10/09/2006 06:30:06 54 :timed out connecting to client on 10/09/2006 at
07:07:49
The report will provide the name of the media server ( ds27bkup), the name of the client
(dvp03fc2z), the backup policy (BDC_Wintel_Prod_File_Z_Drive), the schedule being executed
(Monthly_Cumulative), the date & time the job started (10/09/2006 06:30:06), the error
number (54), the reason why the backup failed (timed out connecting to client), and the
date & time the job abended (10/09/2006 at 07:07:49.
Since Netbackup retries failed backup jobs twice, you should only get one job failure alert after
the 2nd retry.
In addition to the backup job failure alerts, Infrastructure Support will also get alerts on Netbackup
Drives.
Whenever a new Netbackup alert appears in OVOU, please verify that it is not related to a Sev 2
ticket or any other ticket.
1. Restart the job that has failed. See Restarting Backup jobs via Backup Policies.
Error 219 Storage Unit is currently unavailable
1. Check for drives down. See Error 52
2. Create a Sev 3 Critical trouble ticket for the Enterprise Backup & Recovery group (L3ENT-BACK).
Self Correcting Jobs
Some jobs will automatically re-run when they hit certain errors. The Exchange Store jobs (ie
Exchange_BDC_STG1) are monitored for a status of 1 and are automatically re-ran and an email
with a subject of Restart Notice is sent. Sometimes this process doesnt work well due to an
issue on the Exchange server and youll see many restarts for the same servers in a short period
of time. If this occurs contact the Enterprise Backup & Recovery group.
Netbackup Troubleshooter
Additional errors codes and resolution steps can be found using the Troubleshooter within
Netbackup. To access the Troubleshooter, click on the hand/wrench icon. See Figure 1.
Figure 1
Enter the error code into the status code field and click Lookup (Figure 2). The Troubleshooter
will detail the problem and provide troubleshooting steps based on the error code you entered.
See Figure 3.
Figure 2
Figure 3
1. The first thing you need to know is the backup policy and the schedule the failed
backup ran under. You can find this information from the HPOV alert or by looking at
the activity monitor and locating or filtering for the client.
2. Log into one of the media servers listed in the List of Netbackup Media Server chart.
Choose the media server for the domain the client belongs to. Launch the Netbackup
Administration Console.
3. On the left pane, click on the + next to Policies.
4. Locate the backup policy from step 1.
5. Right click on the backup policy and select Manual Backup (Figure 4).
Figure 4
6. Select the schedule (from step 1) on the left pane and select the client from the right
pane. Click OK (Figure 5).
Figure 5
7. Verify the backup started via the activity monitor.
Note:
It is ok to restart incremental and full backups. If you restart a backup and users
complain about the performance of the server, please kill the backup and restart it at a
later time.
DMZ Domain
Teradata
DS15BKUP
DS16BKUP
DS17BKUP
DS20BKUP
DS21BKUP
DS22BKUP
DS23BKUP
DS24BKUP
DS25BKUP
DS26BKUP
DS27BKUP
DS28BKUP
RS20BKUP
RS21BKUP
Prod DMZ:
DS19BKUP
RS22BKUP
DS30BKUP
DS31BKUP
DS32BKUP
DS33BKUP
DS34BKUP
DS35BKUP
DS37BKUP
DS38BKUP
Legacy DMZ:
DS29BKUP
3.
4.
5.
6.
Figure 6
7.
8.
Figure 7
9.
In the field below VSP volume exclude list (drive letters separated by commas):
enter all drives on the server. (example: c,d,e,f,g)
10. Check Customize the cache sizes
Serial #
78A0347
78A0308
78A1096
800-426-7378
3584
3592
Phone # Assoc. to Lib.
(612) 670-2706
(952) 324-1872
(952) 324-1872
FYI, you will need to give IBM the Phone # associated to the library, as well as the library model
# (3584, same for all libraries) and serial #. If the problem appears to be a tape drive or if IBM
asks, the tape drive model we use # is 3592.