0% found this document useful (0 votes)
177 views2 pages

Storage Knowledge Base - Failed Disk Replacement in NetApp

This document discusses disk replacement procedures in NetApp storage systems and potential complications that may arise. It describes two scenarios: 1) A replaced disk is still shown as broken, which could be due to improper labeling or a faulty disk. Contacting NetApp for guidance is recommended. 2) Two disks fail in the same RAID group without spare disks available, putting data at high risk. Replacing disks ASAP from another system or contacting NetApp for immediate replacement is advised. Always maintaining adequate spare disks is important for avoiding data loss during rebuilds or replacement delays.

Uploaded by

panwar14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views2 pages

Storage Knowledge Base - Failed Disk Replacement in NetApp

This document discusses disk replacement procedures in NetApp storage systems and potential complications that may arise. It describes two scenarios: 1) A replaced disk is still shown as broken, which could be due to improper labeling or a faulty disk. Contacting NetApp for guidance is recommended. 2) Two disks fail in the same RAID group without spare disks available, putting data at high risk. Replacing disks ASAP from another system or contacting NetApp for immediate replacement is advised. Always maintaining adequate spare disks is important for avoiding data loss during rebuilds or replacement delays.

Uploaded by

panwar14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1/18/2019 Storage Knowledge Base : Failed disk replacement in NetApp

More Create Blog Sign In

Storage Knowledge Base

Monday, January 20, 2014 Blog Archive

Failed disk replacement in NetApp ▼ 2014 (26)



► February (8)

Disk failures are very common in storage environment and as a storage administrator we come across this situation very often, ▼ January (18)

how often that depends how much disks your storage systems is having; more disks you manage more often you come across
Restoring data from Snashots
this situation. using Snaprestore com...
This post I have written considering RAID-DP with FC-AL disks because it’s always better than RAID4 and SCSI loops we don’t
use. Due to its design RAID-DP gives protection from double disk failure in a single raid group. To say that it means you will not
Data Deduplication Concepts
loose data even if 2 disks are failed in a single RG at same time or one after another. How to check part number of
As like any other storage system Ontap also uses a disk from spare disks pool to rebuild the data from surviving disk as soon as installed adapter in O...
it encounters a failed disk situation and sends an autosupport message to NetApp for parts replacement. Once autosupport is
Which is faster, NDMPcopy or vol
received by NetApp they initiate RMA process and part gets delivered to the address listed for that failed system in NetApp copy?
records. Once the disk arrives you change the disk by yourself or ask a NetApp engineer to come at onsite and change it,
How to check unplanned
whatever way as soon as you replace the disk your system finds the newly working disk and adds it in spare pool.
downtime detail for a NetAp...
Now wasn’t that pretty simple and straightforward? Oh yes; because we are using software based disk ownership and disk auto
assignment is turned on. Much like your baby had some cold so he called-up GP himself and got it cured rather than asking you
HA Configuration Checker (ha-
config-check.cgi)
to take care of him, but what about if there are some more complication.
Now, will cover what all other things can come in way and any other complications. HA GROUP ERROR:
Scenario 1: DISK/SHELF COUNT
MISMATCH ERROR
I have replaced my drive and light shows Green or Amber but ‘sysconfig -r' still shows the drive as broken?
Sometimes we face this problem because system was not able to either label the disks properly or replaced disk itself is not Failed disk replacement in NetApp
good. The first thing we try is to label the disk correctly if that doesn’t work try replacing with another disk or known good disk but
How to map a snapshot of a LUN
what if that too doesn’t work, just contact NetApp and follow their guidelines.
to another server f...
To label the disk from "BROKEN" to "SPARE" first you have to note down the broken disk id, which you can get from “aggr status
-r", now go to advance mode with “priv set advanced” and run “disk unfail ” at this stage your filer will throw some 3-4 errors on
How to fix WAFL hung in SK
process
console or syslog or snmp traps, depends on how you have configured but this was the final step and now disks should be good
which you can confirm with “disk show” for detailed status or “sysconfig -r” command. Give it a few seconds to recognize the Updating SP firmware
changed status of disk if status change doesn’t shows at first.
How to troubleshoot takeover of
Scenario 2: partner is disable...
Two disks have failed from same raid group and I don’t have any spare disk in my system.
Netapp NFS Exportfs CLI
Now in this case you are really in big trouble because always you need to have at least one spare disk available in your system
Configuration Guide
whereas NetApp recommends 1:28 ratio i.e. have one spare on each 28 disks. In the situation of dual disk failure you have very
high chances of loosing your data if another disk goes while you are rebuilding the data on spare disk or while you are waiting for Netapp Hardware connection
new disks to arrive. LUN Creation Using Snapdrive
So always have minimum 2 disks available in your system one disk is also fine and system will not complain about spare disk but
NetApp FC LUN Allocation on
if you leave system with only one spare disk then maintenance centre will not work and system will not scan any disk for
Windows Server - Hard ...
potential failure.
Now going to your above situation that you have dual disk failure with no spares available, so best bet is just ring NetApp to Steps for creating Vfiler
replace failed disk ASAP or if you think you are loosing your patient select same type of disk from another healthy system, do a Netapp Interview Questions
disk fail, remove disk and replace it with failed disk on other system.
After adding the disk to another filer if it shows Partial/failed volume, make sure the volume reported as partial/failed belongs to
► 2013 (31)

newly inserted disk by using “vol status -v” and “vol status -r" commands, if so just destroy the volume with “vol destroy”
command and then zero out the disk with “disk zero spares”.
This exercise will not take more than 15 min(except disk zeroing which depends on your disk type and capacity) and you will About Me
have single disk failure in 2 systems which can survive with another disk failure, but what if that doesn’t happens and you keep
running your system with dual disk failure. Your system will shut down by itself after 24 hours; yes it will shut down itself without
any failover to take, your attention. There is a registry setting to control how long your system should run after disk failure but I
think 24hrs is a good time and you shouldn’t increase or decrease it until and unless you think you don’t care of the data sitting
there and anyone accessing it. Rajat Garg
Scenario 3: Working as
My drive failed but there is no disk with amber lights Associate
A number of times these things happen because disk electricals are failed and no more system can recognize it as part of it. So
consultant in TCS
in this situation first you have to know the disk name. There are couple of methods to know which disk has failed.
with over 8 years
a) “sysconfig -r “ look for broken disk list of experience in
b) From autosupport message check for failed disk ID Storage
c) "fcadmin device_map" looks for a disk with xxx or “BYP” message Implementation
d) In /etc/messages look for failed or bypassed disk warning and there it gives disk ID and Support
Now once you have identified failed disk ID run “disk fail ” and check if you see amber light if not use “blink_on ” in advanced
View my
mode to turn on the disk LED or if that that fails turn on the adjusting disk’s light so you can identify the disk correctly using same
complete profile

http://rajat926.blogspot.com/2014/01/failed-disk-replacement-in-netapp.html 1/2
1/18/2019 Storage Knowledge Base : Failed disk replacement in NetApp
blink_on command. Alternatively you can use led_on command also instead of blink_on to turn on the disk LEDs adjacent to the
defective disk rather than its red LED.
If you use auto assign function then system will assign the disk to spare pool automatically otherwise use “disk assign ”
command to assign the disk to system.
Scenario 4:
Disk LED remains orange after replacing failed disk
This error is because you were in very hurry and haven’t given enough time for system to recognize the changes. When the
failed disk is removed from slot, the disk LED will remain lit until the Enclosure Services notices and corrects it generally it takes
around 30 seconds after removing failed one.
Now as you have already done it so better use led_off command from advanced mode or if that doesn’t works because system
believes that the LED is off when it is actually on, so simply turn the LED on and then back off again using “led_on ” then
“led_off ” commands.
Scenario 5:
Disk reconstruction failed
There could be a number of issues to fail the RAID reconstruction fail on new disk including enclosure access error, file system
disk not responding/missing, spare disk not responding/missing or something else, however most common reason for this failure
is outdated firmware on newly inserted disk.
Check if newly inserted disk is having same firmware as other disks if not first update the firmware on newly inserted disk and it
then reconstruction should finish successfully.
Scenario 6:
Disk reconstruction stuck at 0% or failed to start
This might be an error or due to limitation in ONTAP i.e. no more than 2 reconstructions should be running at same time. Error
which you might find a time is because RAID was in degraded state and system went through unclean shutdown hence parity
will be marked inconsistent and need to be recomputed after boot. However as parity recomputation requires all data disks to be
present in the RAID group and we already have a failed disk in RG so aggregate will be marked as WAFL_inconsistent. You can
confirm this condition with “aggr status -r" command.

If this is the case then you have to run wafliron, giving command “aggr wafliron start ” while you are in advance mode. Make sure

you contact NetApp before starting walfiron as it will un-mount all the volumes hosted in the aggregate until first phase of tests

are not completed. As the time walfiron takes to complete first phase depends on lots of variables like size of

volume/aggregate/RG, number of files/snapshot/Luns and lots of other things therefore you can’t predict how much time it will

take to complete, it might be 1 hr or might be 4-5 hrs. So if you are running wafliron contact NetApp at fist hand.

Posted by Rajat Garg at 7:38 AM

1 comment:

Anonymous October 21, 2014 at 3:26 AM


can you also let us know how we can use disk replace to replace mismatched disk
types like FC and ATA in an aggregate, is it possible.
Reply

Enter your comment...

Comment as: sukesh.panwa

Publish Preview

Newer Post Home Older Post

Subscribe to: Post Comments (Atom)

Simple theme. Theme images by luoman. Powered by Blogger.

http://rajat926.blogspot.com/2014/01/failed-disk-replacement-in-netapp.html 2/2

You might also like