0% found this document useful (0 votes)
45 views16 pages

Netapp Ontap Ha

Netapp Ontap Ha

Uploaded by

Akmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views16 pages

Netapp Ontap Ha

Netapp Ontap Ha

Uploaded by

Akmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

NetApp ONTAP High Availability

ONTAP HA Overview:

Cluster nodes are configured in high-availability (HA) pairs for fault


tolerance and nondisruptive operations. If a node fails or if you need
to bring a node down for routine maintenance, its partner can take

my
over its storage and continue to serve data from it. The partner gives

a
back storage when the node is brought back on line.

us
th
The HA pair controller configuration consists of a pair of matching
Mu
FAS/AFF storage controllers (local node and partner node). Each of
these nodes is connected to the other’s disk shelves. When one node
r

in an HA pair encounters an error and stops processing data, its


ma

partner detects the failed status of the partner and takes over all data
processing from that controller.
l ku

Takeover is the process in which a node assumes control of its


hi

partner’s storage.
nt

Giveback is the process in which the storage is returned to the partner.


Se

An internal HA interconnect allows each node to continually check


whether its partner is functioning and to mirror log data for the other’s
nonvolatile memory. When a write request is made to a node, it is
logged in NVRAM on both nodes before a response is sent back to the
client or host. On failover, the surviving partner commits the failed
node’s uncommitted write requests to disk, ensuring data
consistency.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 1


2

Connections to the other controller’s storage media allow each node


to access the other’s storage in the event of a takeover. Network path
failover mechanisms ensure that clients and hosts continue to
communicate with the surviving node.

By default, takeovers occur automatically in any of the following


situations:

▪ A software or system failure occurs on a node that


leads to a panic.

my
▪ A system failure occurs on a node, and the node
cannot reboot.

a
us
▪ Heartbeat messages are not received from the node’s
partner.
th
▪ The remote management device (Service Processor)
Mu
detects failure of the partner node.
r
ma
l ku
hi
nt
Se

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 2


3

a my
us
th
Mu
r
ma
ku

Hardware Assisted Takeover:


l
hi

Enabled by default, the hardware-assisted takeover feature can speed


up the takeover process by using a node’s remote management device
nt

(Service Processor).
Se

When the remote management device detects a failure, it quickly


initiates the takeover rather than waiting for ONTAP to recognize that
the partner’s heartbeat has stopped. If a failure occurs without this
feature enabled, the partner waits until it notices that the node is no
longer giving a heartbeat, confirms the loss of heartbeat, and then
initiates the takeover.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 3


4

The hardware-assisted takeover feature uses the following process to


avoid that wait:

1. The remote management device monitors the local system


for certain types of failures.
2. If a failure is detected, the remote management device
immediately sends an alert to the partner node.
3. Upon receiving the alert, the partner initiates takeover.

my
Automatic Takeover and Giveback:

a
us
The automatic takeover and giveback operations can work together to
reduce and avoid client outages.
th
By default, if one node in the HA pair panics, reboots, or halts, the
Mu
partner node automatically takes over and then returns storage when
the affected node reboots. The HA pair then resumes a normal
r

operating state.
ma
ku

HA Policy Overview:
l

ONTAP automatically assigns an HA policy of CFO (controller failover)


hi

and SFO (storage failover) to an aggregate. This policy determines how


nt

storage failover operations occur for the aggregate and its volumes.
Se

The two options, CFO and SFO, determine the aggregate control
sequence ONTAP uses during storage failover and giveback
operations.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 4


5

Although the terms CFO and SFO are sometimes used informally to
refer to storage failover (takeover and giveback) operations, they
actually represent the HA policy assigned to the aggregates. For
example, the terms SFO aggregate or CFO aggregate simply refer to
the aggregate’s HA policy assignment.

HA policies affect takeover and giveback operations as follows:

• Aggregates created on ONTAP systems (except for the root


aggregate containing the root volume) have an HA policy of
SFO. Manually initiated takeover is optimized for

my
performance by relocating SFO (non-root) aggregates serially

a
to the partner before takeover. During the giveback process,

us
aggregates are given back serially after the taken-over system
th
boots and the management applications come online,
enabling the node to receive its aggregates.
Mu
• Because aggregate relocation operations entail reassigning
aggregate disk ownership and shifting control from a node to
r

its partner, only aggregates with an HA policy of SFO are


ma

eligible for aggregate relocation.


ku

• The root aggregate always has an HA policy of CFO and is


l

given back at the start of the giveback operation. This is


hi

necessary to allow the taken-over system to boot. All other


nt

aggregates are given back serially after the taken-over system


completes the boot process and the management
Se

applications come online, enabling the node to receive its


aggregates.
Manual Takeover:
You should move epsilon if you expect that any manually initiated
takeovers could result in your storage system being one unexpected
node failure away from a cluster-wide loss of quorum.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 5


6

To perform planned maintenance, you must take over one of the


nodes in an HA pair. Cluster-wide quorum must be maintained to
prevent unplanned client data disruptions for the remaining nodes. In
some instances,
performing the takeover can result in a cluster that is one unexpected
node failure away from cluster-wide loss of quorum.

This can occur if the node being taken over holds epsilon or if the node
with epsilon is not healthy. To maintain a more resilient cluster, you
can transfer epsilon to a healthy node that is not being taken over.

my
Typically, this would be the HA partner.

a
us
th Epsilon is True – Node1
Mu
(Master Node)
r
ma
l ku

Planned Failover – we can change


hi

the Epsilon Node.


nt
Se

Planned Failover – we can change


the Epsilon Node.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 6


7

aggr_prod1 owned by cluster1-02


node.

a my
us
th
Mu
Flexvol volume (volp1) resides
in aggr_prod1 aggregate.
r
ma
l ku
hi

NFS data protocol service, svm_prod1 uses two LIF’s.


nt

e0d of each node mapped.


Se

Volume volp1 mounted in linux


server.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 7


8

As per our example,


1. Volp1 → resides aggr_prod1→ owned by cluster1-02 node.

Create files in that share.

a my
us
To check the HA failover status. Both the
nodes are connected to partner node.

th
Mu
r
ma
l ku
hi
nt

Using System
Se

Manager, you can


manage the HA.
Both nodes are
online and ready to
takeover.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 8


9

HA Planned Failover:
You should move epsilon if you expect that any manually initiated
takeovers could result in your storage system being one unexpected
node failure away from a cluster-wide loss of quorum.

To perform planned maintenance, you must take over one of the


nodes in an HA pair. Cluster-wide quorum must be maintained to
prevent unplanned client data disruptions for the remaining nodes. In

my
some instances,
performing the takeover can result in a cluster that is one unexpected

a
us
node failure away from cluster-wide loss of quorum.

th
This can occur if the node being taken over holds epsilon or if the node
with epsilon is not healthy. To maintain a more resilient cluster, you
Mu
can transfer epsilon to a healthy node that is not being taken over.
Typically, this would be the HA partner.
r

Initiate the Planned Failover or Manual


ma

Takeover of partner Node (cluster1-02)


l ku
hi

Check the failover status. Relocating its


nt

SFO aggregates to cluster1-01.


Se

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 9


10

Optimized takeover of partner in progress.

Cluster1-02 node being taken over by


partner. Relocating SFO aggregates.

a my
us
th
Mu
r

Cluster1-02 node relocated its aggregates


ma

and in Take over status


l ku
hi

Relocated both SFO and CFO


Aggregates to its partner node.
nt
Se

Partner node is taken over successfully.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 10


11

After taken over by cluster1-01 node, you can the aggr_prod1


aggregate owned by cluster1-01 node (before take over it was owned
by cluster1-02 node)
aggr_prod1 owned by cluster1-01 node.

a my
us
th
Mu
r
ma

From unix hosts you can access the


NFS shares.
l ku
hi
nt
Se

As per the LIF failover policy defined, VIFMGR unit (RDB unit) failover
to partner node. (nas2 LIF home node is false)

LIF failover done successfully. Nas2 LIF failover to cluster1-01

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 11


12

Before takeover nas2 LIF owned by cluster1-02


node e0d port.

a my
us
th
Mu
r

Manual Give Back:


ma

You can perform a normal giveback, a giveback in which you terminate


ku

processes on the partner node, or a forced giveback.


l
hi

If the takeover node experiences a failure or a power outage during


the giveback process, that process stops and the takeover node
nt

returns to takeover mode until the failure is repaired or the power is


restored.
Se

However, this depends upon the stage of giveback in which the failure
occurred. If the node encountered failure or a power outage during
partial giveback state (after it has given back the root aggregate), it
will not return to takeover mode. Instead, the node returns to partial-
giveback mode. If this occurs, complete the process by repeating the
giveback operation.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 12


13

If giveback is vetoed, you must check the EMS messages to determine


the cause. Depending on the reason or reasons, you can decide
whether you can safely override the vetoes.

After you configure all aspects of your HA pair, you need to verify that
it is operating as expected in maintaining uninterrupted access to both
nodes' storage during takeover and giveback operations. Throughout
the takeover process, the local (or takeover) node should continue
serving the data normally provided by the partner node. During

my
giveback, control and delivery of the partner’s storage should return

a
to the partner node.

us
th
Initiate the storage failover giveback command to giveback the node
manually.
Mu

How Giveback Works:


r
ma

The local node returns ownership to the partner node when issues are
resolved, when the partner node boots up, or when giveback is
ku

initiated.
l
hi

The following process takes place in a normal giveback operation. In


nt

this discussion, Node A has taken over Node B. Any issues on Node B
have been resolved and it is ready to resume serving data.
Se

1. Any issues on Node B are resolved and it displays the


following message: Waiting for giveback
2. The giveback is initiated by the storage failover
giveback command or by automatic giveback if the system is
configured for it. This initiates the process of returning
ownership of Node B’s aggregates and volumes from Node A
back to Node B.
3. Node A returns control of the root aggregate first.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 13


14

4. Node B completes the process of booting up to its normal


operating state.

HA policies affect takeover and giveback operations as follows:

• Aggregates created on ONTAP systems (except for the root


aggregate containing the root volume) have an HA policy of
SFO. Manually initiated takeover is optimized for
performance by relocating SFO (non-root) aggregates serially

my
to the partner before takeover. During the giveback process,
aggregates are given back serially after the taken-over system

a
us
boots and the management applications come online,
enabling the node to receive its aggregates.

th
Because aggregate relocation operations entail reassigning
Mu
aggregate disk ownership and shifting control from a node to
its partner, only aggregates with an HA policy of SFO are
r

eligible for aggregate relocation.


ma

• The root aggregate always has an HA policy of CFO and is


ku

given back at the start of the giveback operation. This is


necessary to allow the taken-over system to boot. All other
l

aggregates are given back serially after the taken-over system


hi

completes the boot process and the management


nt

applications come online, enabling the node to receive its


aggregates.
Se

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 14


15

Initiate the giveback manually once the partner node


is up.

Check the cluster HA giveback status.

Relocating CFO aggregate first and

my
then relocate the SFO aggregate.

a
us
thAll CFO and SFO aggregates are relocated, it
will wait for applications to come online.
Mu
r
ma
l ku
hi
nt
Se

It’s connected to cluster1-01 and giveback SFO


aggregates in progress.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 15


16

Giveback successful and both the nodes are


connected to partner.

my
Revert back the failover LIF to its home node. After successful giveback, still nas2
LIF uses cluster1-01 node e0d port

a
only.

us
th
Mu
r
ma

Revert back the LIF to original


Home node and port.
l ku
hi
nt
Se

After you configure all aspects of your HA pair, you need to verify that
it is operating as expected in maintaining uninterrupted access to both
nodes' storage during takeover and giveback operations. Throughout
the takeover process, the local (or takeover) node should continue
serving the data normally provided by the partner node. During
giveback, control and delivery of the partner’s storage should return
to the partner node.

SENTHILKUMAR MUTHUSAMY | SAN MASTERS 16

You might also like