DRBD 8.0.x and Beyond Shared-Disk Semantics On A Shared-Nothing Cluster
DRBD 8.0.x and Beyond Shared-Disk Semantics On A Shared-Nothing Cluster
x and beyond
Shared-Disk semantics on a Shared-Nothing Cluster
Lars Ellenberg
August 10, 2007
Abstract
So, you have an HA-Cluster. Fine. What about when your storage goes down?
DRBD – the Distributed Replicated Block Device, as developed mainly by Philipp Reisner and
Lars Ellenberg and their team at LINBIT http://www.linbit.com – makes your data highly avail-
able, even in case a storage node goes down completely. No special requirements, off-the-shelf hardware
and some standard IP-connectivity is just fine.
I’ll outline the problems we try to solve, how we deal with it in DRBD, and why we deal with it the
way we do, explaining the design ideas of some algorithms we have developed for DRBD, which I think
may be useful for software raid or snapshots as well (dm-mirror and friends), or in cluster file systems or
distributed databases.
There is an overview of typical and not-so-typical usage scenarios, and the current limitations, and
some hints about how you can (ab)use DRBD to your advantage, e.g. when upgrading your hardware.
There is also mention of benchmarks, as well as upcoming features, like cache-warming on the re-
ceiving side, improved handling of cluster-partition/rejoin scenarios with multiple-primaries (interesting
for cluster file systems), or the (as yet only conceptual) possibility to scale-out to many (simultaneously
active) nodes (N > 2).
1 What is DRBD
DRBD is the Distributed Replicated Block Device, implemented as a Linux kernel module.
Block device something like /dev/sda; something you can put a file system on
Replicated any local changes are copied to additional nodes in real time to increase data availability
Distributed reduce the risk of catastrophic data loss, spread the data to storage nodes in different locations
The purpose of DRBD is to keep your data available (business continuity), even if your storage server fails
completely (high-availability fail-over clustering), or if your main site is flooded or otherwise catastrophi-
cally destroyed (disaster recovery). DRBD is not to be confused with grid-storage.
1
4 SYNCHRONOUS DATA REPLICATION
DRBD only provides the infrastructure, to actually do HA-clustering you need a cluster manager.
Heartbeat is the cluster manager most commonly used together with DRBD. If you are not familiar with
some HA-clustering vocabulary, e.g. fail-over, switch-over, fencing, STONITH, SPOF, split-brain, . . . ,
their project page http://www.linux-ha.org might be a good starting point.
There are a number of other cluster managers that work just fine with drbd, too, including the RedHat
Cluster Suit, as well as "home grown", tailored solutions at telcos or storage providers.
to get DRBD included into mainline. If I find the time to “beautify” our code base sufficiently in time, we might get into mainline
later this year.
2 hahum. OK, this should read "the other node". Singular :(. But see also section 13
2
6 RESYNC: MAGIC HEALING
We have to defer completion events for WRITE requests and wait for all3 mirror nodes to complete
the request, before we may relay completion to upper layers (file system, vm, database using drbd as
"raw-device" with direct-io, . . . ).
Since by replicating the data, each node has its own data set, they do not share any hardware: this is
called a “shared-nothing” cluster.
5 Sounds Trivial?
So we basically “only” need to ship READs directly to the local block layer, and ship WRITEs to local
storage as well as over TCP to the other node. Which will send back an ACK once it is done.
To get this working is almost trivial, right?
Well, yes. But to keep it working in face of component failures is slightly more involved. In the
following, all-capital "DRBD" shall denote the driver and the concept, "drbd" means a single virtual block
device instance.
"Primary" in DRBD-speak means a node that accepts IO-requests from upper layers; as opposed to
"Secondary", which will refuse all IO-requests from upper layers, and only submit process requests as
received from a Primary node. We may change the terminology to Active resp. Standby/Passive someday.
A healthy drbd consists of (at least) two nodes, each with local storage, and established network con-
nection.
The network link may become disconnected, disks may throw IO-errors, nodes may crash... We survive
any single failure, and try to handle most common multiple failures as good as possible.
We guarantee that in a healthy drbd, while no WRITE-IO is in-flight, the DRBD-controlled local back-
end storages of the nodes are exactly bit-wise identical.
If we lose connectivity, or get local IO-errors, we can no longer do this: we become "unhealthy",
degraded. To become healthy again, we need to make the various disks contents identical again. This is
called resync, and has interesting problems in itself, which are detailed in the section 6.
When a Secondary node detects IO-error on its local storage, it will notify the other nodes about this
and detach from its local storage. No harm done yet, we are merely degraded. Operator gets paged, disk
gets replaced, resync happens.
When we lose connectivity, DRBD will not be able to tell if it is a network problem or a node-crash.
Enter the cluster manager, which should have multiple communication channels, and the smarts to decide
whether the Primary may keep going, or the Secondary needs to be promoted to Primary and start up ser-
vices, because the former Primary is dead, really. In any case, data is still there, we wait for the connection
to be established again, and resync.
When a Primary node detects IO-error on its local storage, it will notify the other nodes and detach
from its local storage. Failing READs will be retried via the network, new READs will now have to be
served via the network by some other node. WRITEs have been mirrored anyways.
Still no harm done, upper layers on the Primary won’t notice the IO-error.
But at the nearest convenient opportunity, services should be migrated to an other node (switch-over).
Because if you don’t, Murphy has it that the network will soon get flaky as well, we won’t have access to
good data any longer, so we must fail any new IO-request coming in. And to have the services available
we’d have to do a fail-over, anyways. Which will happen at the most inconvenient time, obviously. . .
3
6.1 network hiccup 6 RESYNC: MAGIC HEALING
• which direction?
6.4 Peanuts . . .
To reliably keep track of the target blocks of in-flight IO, while minimizing the required additional io-
requests for this housekeeping, we came up with the concept of the "Activity-Log".
Think of your storage as a huge heap of peanuts. Sisyphus has tagged them all with a distinct block
number. There are many people running around, taking some of the peanuts in their pockets (that is the
in-flight io), and throwing them back on the heap (that is the io-completion). Painting them blue is allowed,
these are WRITEs we are missing the acknowledgment of the other node for (dirty bits). Eating peanuts is
strictly forbidden, as is re-tagging.4
Blocks corresponding to the in-pocket peanuts have to be retransmitted, those corresponding to the
heap don’t need to (but it would do no harm if some of them are).
Our mission is to know at each given moment as precisely as possible which peanuts are NOT in the
pockets of those people (and not painted blue, yet), because if we know that, we can avoid retransmitting
the corresponding blocks after Primary crash.
First, we get into control of the situation. We structure the heap, and put the peanuts in order into boxes
(activity-log extents) which in turn are numbered. We draw a line in the sand.
4 Some do that, anyways; call them Eh-i-oh and Silent Corruption ;)
4
6.5 . . . aka Activity-Log 6 RESYNC: MAGIC HEALING
We prepare a number of parking lots on one side, and get ourselves as many little red wagons (activity-
log slots). People cannot reach the peanuts on the other side of the line. Only we are allowed to move the
wagons from the other side into the parking lots on this side. People are free to take from this side of the
line (the in-memory activity-log).
Now we know that everything not in the activity-log is stable, and does not need to be retransmitted -
apart from the blue ones, the tags of which we jotted down (to the on-disk bitmap) whenever we dropped
an extend out of the activity-log.
To further structure this, we line up the extents in the activity-log in three LRUs: those boxes where
some peanuts are missing (the active lru, currently target of in-flight io), those which are complete (the
unused lru, zero in-flight io), and those wagons with no boxes at all (the free list).
We start with an empty activity-log.
Now, if someone wants a peanut from a special box which is not on display yet, we have some work
to do. We check whether there is an empty wagon, if yes, we go, fetch the box, and write the box number
down (to the on-disk activity-log). If there is no slot available on the free list, we check the unused lru. If
there is an unused slot, we exchange its content with the requested one, and write both numbers down (to
the on-disk activity-log).
If there is neither an empty nor an unused slot available (activity-log is starving), we have to be rude,
and block further WRITE requests until enough of the outstanding requests have completed and one slot
becomes unused again, which we then will take and exchange, again writing both numbers down (you
know where).
When someone comes by and want some of the boxes on display already, we just move them around to
keep the lru order, no further action required. This is the caching effect of the activity-log.
For performance reasons (write latency) you want to avoid to block, so you want to have a reasonable
number of slots to make sure there is always some unused ones. But you do not want to have a huge
number of slots either, because on Primary crash, that is the area we need to retransmit, because that is the
granularity with which we know what might have been changed.
Whenever we need to “write down the number", we have to do so transactional, synchronously, before
we start to process the corresponding WRITE that triggered this meta-data transaction. Since we may crash
during such a transaction, we write to an on-disk ring buffer to avoid corrupting the on-disk representation.
With each such transaction, we cyclically write down a partial list of unchanged members as well. We
dimension the number of slots and the size of the on-disk ring buffer appropriately, so we can always
restore the exact set of extents which have been in the activity-log at the time of crash. When a drbd is
attached to its backing storage, it detects whether it has been cleanly shut down. If it determines it had been
a crashed Primary, it will set all the bits corresponding to the extents recorded in the activity log as dirty.
The number of slots in the activity-log is tunable, the trade-of is larger activity-log, less frequent meta
data transactions, less likely to introduce maximum latency because of activity-log starvation, but also
longer minimum resync time after Primary crash.
5
7 RESYNC DIRECTION: YOU OR ME?
[Y;X;A;B] | [X;0;A;B]
If we lose communication during this resync, we just start over; content and IDs changed, but not the overall
pattern. resync finished We reset the bitmap ID, rotating its value out into the history. The SyncTarget
adopts the current ID of the SyncSource, and sets itself consistent again.
Similar for other cases. This way we reliably detect differing data sets with or without common ancestry,
too: [G;X;A;B] <???> [K;X;A;B]; split-brain; version control system equivalent: manual merge
required.
We offer a number of auto-recovery strategies when split-brain is detected, but the default policy is to
give up and disconnect, and let the operator sort out how the merge shall be done, or which change-set to
throw away.
A version control system would assist you with a three-way-merge. Unfortunately, we cannot. Because
we don’t have the parent data set anymore, to generate the necessary deltas against. And even if we had,
there simply is no generic three-way-binary-merge.
Unlike with a "real" shared disk, a split-brain situation with DRBD does not lead to data-corruption.
With a shared disk, if there is no (or malfunctioning) arbitration, once multiple nodes write uncoordi-
nated to the same disk, you can reach out for your backups, because that disk is now scrambled.
With DRBD, because it is shared-nothing, each node has exclusive access to its own storage. When
6
8 CLUSTER FILE SYSTEMS: YOU AND ME!
they don’t communicate, the end result is "just" diverging data sets, each consistent in itself.
This may be even worse than a scrambled disk, though, since now you possibly spend considerable time
trying to merge them, before you get frustrated and reach out for your backups, to start over, anyways...
5 If
DRBD is connected, not degraded, and we look at it in a moment when no requests are on the fly.
6 Of
course we expect them to work. But there will always be someone trying to do this with reiserfs, believing DRBD will
magically make it cluster aware. We have to guarantee that we scramble both replication copies in the same way...
7
8.2 Shared disk emulation 8 CLUSTER FILE SYSTEMS: YOU AND ME!
1. A write request is issued on the node, which is sent to the peer, as well as submitted to the local IO
subsystem (WR).
2. The data packet arrives at the peer node and is submitted to the peer’s io subsystem.
3. The write to the peers disk finishes, and an ACK packet is sent back to the origin node.
Events 1. to 4. always occur in the same order. The local completion can happen anytime, indepen-
dently of 2. to 4. The timing can vary drastically.
If we look at two competing write accesses to a location of the replicated storage, we have two classes
of events with five events each, shuffled, where we still can distinguish four different orderings within each
class. Expressed mathematically, the number of different possible timings for the naive implementation
number of combinations (5+5)!
is number of indistinguishable combinations , which is 5! × 5! = 4032, or 2016, if we take the symmetry
4 4
into account.
This quite impressive number can be reduced into a few different cases if we “sort” by the timing of
the “would be” disk writes: writes are strictly in order (trivial case; 96 combinations); the writes can be
reordered easily so they are in the correct order again (remote request while local request is still pending;
336 combinations); the conflict can be detected and solved by just one node, without communication
(local request while remote still pending; 1080 combinations); the write requests have been issued quasi
simultaneously (2520 combinations).
communicate out-of-band state information, avoiding the additional latency such communication would suffer on a congested data
socket. Since we have it, we use it for ACK packets as well. We don’t care for ACK packets overtaking data packets, but as explained,
we have reason to ensure that processing of DATA packets does only happen after processing of previously sent ACK packets.
8
8.5 Local request while remote request still pending 8 CLUSTER FILE SYSTEMS: YOU AND ME!
WR
data data
DW
DW DW
ACK ACK
WR
data data
CE
DW
ACK
ACK
CE
WR
data data
DW DW
ACK DW ACK
WR
data data
DW
CE
ACK
ACK
CE
WR
data data
DW
data WR
DW
ACK CE ACK
DW CE
ACK
Figure 4: Concurrent write while writing to the backing storage; 1080 of 4032
9
8.6 Quasi simultaneous writes 8 CLUSTER FILE SYSTEMS: YOU AND ME!
data pac WR data pac WR
ket data packet ket data packet
Disc ard ACK
DW DW CE
ACK pac t
ket ACK packe
ACK pac
ket
CE
As illustrated, in the end each node would end up with the respective other node’s data. We flag one
node (in the example N2) with the discard-concurrent-writes-flag. Now both nodes end up with N2’s data.
See figure 5.
WR WR
dat dat
DW a ta DW a ta
da da
CE Discard ACK
DW
ACK
DW
ACK ACK
CE
Concurrent writes, high latency for data packets This case is also handled by the just introduced flag.
See figure 6.
10
8.7 Algorithm for Concurrent Write Arbitration 8 CLUSTER FILE SYSTEMS: YOU AND ME!
WR WR
DW a DW a
dat dat
da da
ta defer ta
DW
ACK submit
CE Discard ACK
DW
ACK ACK
CE
Each time we have a concurrent write access, we print an alert message to the kernel log, since this indicates
that some layer above us is seriously broken!
11
10 PERFORMANCE, TUNABLES AND BENCHMARKS
protocol A asynchronous; completion of request happens as soon as it is completed locally and handed
over to the local network stack. In case of a Primary node crash and fail-over to the other node, all
WRITEs that have not yet reached the other node are lost.
protocol B quasi-synchronous; completion of request happens after local completion and remote Rec-
vAck. Rationale: by the time the RecvAck has made its way to the Primary node, the data has
probably reached the remote disk, too. Even in case of a Primary crash and fail-over to the other
node, no data has been lost. For any completed WRITEs to be actually lost, we’d need to have a si-
multaneous crash of both nodes after RecvAck has been already received on the Primary, but before
the data reached the disk of the Secondary – and the Primary has to be irreparably damaged, so the
12
10.2 limit resource consumption 10 PERFORMANCE, TUNABLES AND BENCHMARKS
only remaining copy would be the Secondary’s disk, which lacks these WRITEs. Not impossible,
but unlikely.
protocol C synchronous; completion of request only happens after both local completion and remote
WriteAck.
When the former Secondary takes over after a Primary crash and fail-over, its data is consistent. But it may
not be clean. To become clean and consistent, a file system would need to do a journal-replay (or fsck),
a data base would need to additionally replay transaction logs etc., so to the applications and services it
would just look like an extremely fast reboot.
Now, actually, it is not as simple as that.
Since we do replication, there are two io layers involved, there are two io-schedulers, two disk subsys-
tems that may reorder WRITEs, and may chose to order differently. To avoid that, we could do strictly
synchronous replication, only. But we can also define reorder-domains, and separate them with in-protocol
barriers.
Any WRITE that is submitted before a previously submitted WRITE has been completed may be re-
ordered. Every time we complete a request to upper layers, we create a new “epoch” in the “transfer log”
(a ring-list structure on the Primary), which then contain all requests submitted between two application-
visible completion-events. On the wire, different epochs are separated by in-protocol DRBD-barrier pack-
ets inserted into the data stream. Once such a barrier is received, the receiving side waits for all pending
IO-requests to complete before sending a BarrierAck back, and submitting the next one. This could be im-
proved upon using tagged command queuing (TCQ), which would avoid the additional latency introduced
by waiting for pending IO.
To still be able to reliably track (possibly) changed blocks, even with the asynchronous protocols,
and mark them for resynchronization if necessary, completion of a request happens in two phases. First
it gets completed to the upper layers. Then, when the corresponding epoch is closed by the receipt of
a BarrierAck, it gets cleared from the transfer-log (in protocol C, closing an epoch is almost a no-op).
When we lose connection, we scan the transfer log for any entries in not-yet closed epochs, and mark the
corresponding bits as dirty, even when the corresponding application requests have been completed already.
watch -n1 ’cat /proc/drbd; netstat -tnp | grep :7788 | grep ESTABLISHED’
during sequential writes, and see whether it ever runs against the socket buffer limits.
Some IO-subsystems like to be kicked (unplugged) often, some like to really only decide themselves.
You can tune this using “unplug-watermark”, which tells the receiving side to unplug its lower level
device whenever it has more than thus many in-flight requests (yes, there is some sort of hysteresis in
place).
13
10.3 find the bottleneck: systematically benchmarking 11 TYPICAL SETUPS
• DRBD, disconnected
You tried everything to get track your bottlenecks and tune the system to get decent performance out of it,
but still no luck? Ask for help on our mailing list, or get a support contract. . .
11 typical setups
11.1 Active/Standby
You have one drbd, containing a file-system, backing the data area of some service (web/mail/news...). You
can also have several of such resource groups, but you usually have all active on one node, the other acting
as hot-standby only, ready to take over in case of Primary crash.
8 You are free to use whatever tool you like, but don’t forget about fsync!
14
11.2 Active/Standby + Standby/Active 12 AND WHAT YOU CAN DO WITH IT
11.4 Disaster-Recovery
You could also simply not use a cluster manager, and have a remote site be strictly Secondary, mirroring
all changes from the main site. If the main site is catastrophically destroyed, you can move the box from
the DR-site physically into a new site, and make it Primary there.
11.5 . . . + Disaster-Recovery
Of course you could combine any of the above two-node clusters with a third node for disaster recovery.
15
12.2 “online” kernel or firmware upgrade 13 CURRENT LIMITATIONS, FUTURE PROSPECTS
on the other node, boot, test, resync. Added storage? Tell DRBD about it (drbdadm resize), then tell
the file system (xfs_growfs/resize2fs).
Virtually no downtime, and your database is running with doubled RAM, your backup server has tripled
its available capacity, and no user has even noticed...
ment as new selling point –, the commercial variant (DRBD+) makes three-node-setups slightly less cumbersome. But effectively
you pay for the support to let people who do it every day deal with the details and hide them from you. . .
Also, think of this as your chance to support the further development of DRBD, and to influence feature development and priorisa-
tion.
16
13 CURRENT LIMITATIONS, FUTURE PROSPECTS
Write-Quorum
In the Two-Primaries case, currently whenever DRBD loses the connection, you get effectively a split-brain
situation immediately, with slightly diverging data. Which then needs to be resolved by “discarding” the
changes on one of the nodes, possibly using the provided auto-recovery strategies.
Since this is annoying big time, one of the next features will be the implementation of an (optional)
Write-Quorum, so that starting with a Write-Quorum of two, any lonely Primary (after connection loss)
would first freeze all IO, then wait for either re-established communications (and then re-transmit re-
quests and resume), or an administrative (operator or cluster manager induced) request to reduce the Write-
Quorum to one (and then generate the new data generation tag and do the rest as explained above).
You (your cluster manager) would only reduce Write-Quorum after making sure that the other node(s)
won’t fiddle with their copy of the data.
Resources
More publications and papers, documentation, and the software, are available online at the DRBD Project
Hompage http://www.drbd.org.
You may want to subscribe to the mailing list http://lists.linbit.com/listinfo/drbd-user.
You should browse (and help to improve) the FAQ http://www.linux-ha.org/DRBD/FAQ.
17