Chapter 1. DRBD Fundamentals: 1.1. Kernel Module 1.2. User Space Administration Tools 1.3. Resources 1.4. Resource Roles
Chapter 1. DRBD Fundamentals: 1.1. Kernel Module 1.2. User Space Administration Tools 1.3. Resources 1.4. Resource Roles
DRBD Fundamentals
Table of Contents
1.1. Kernel module
1.2. User space administration tools
1.3. Resources
1.4. Resource roles
DRBD comes with a set of administration tools which communicate with the kernel module in
order to configure and administer DRBD resources.
drbdadm. The high-level administration tool of the DRBD program suite. Obtains all DRBD
configuration parameters from the configuration file /etc/drbd.conf and acts as a front-end
for drbdsetup and drbdmeta. drbdadm has a dry-run mode, invoked with the -d option, that
shows which drbdsetup and drbdmeta calls drbdadm would issue without actually calling
those commands.
drbdsetup. Configures the DRBD module loaded into the kernel. All parameters to drbdsetup
must be passed on the command line. The separation between drbdadm and drbdsetup allows
for maximum flexibility. Most users will rarely need to use drbdsetup directly, if at all.
drbdmeta. Allows to create, dump, restore, and modify DRBD meta data structures. Like
drbdsetup, most users will only rarely need to use drbdmeta directly.
1.3. Resources
In DRBD, resource is the collective term that refers to all aspects of a particular replicated data set.
These include:
Resource name. This can be any arbitrary, US-ASCII name not containing whitespace by which
the resource is referred to.
Volumes. Any resource is a replication group consisting of one of more volumes that share a
common replication stream. DRBD ensures write fidelity across all volumes in the resource.
Volumes are numbered starting with 0, and there may be up to 65,535 volumes in one resource. A
volume contains the replicated data set, and a set of metadata for DRBD internal use.
At the drbdadm level, a volume within a resource can be addressed by the resource name and
volume number as <resource>/<volume>.
DRBD device. This is a virtual block device managed by DRBD. It has a device major number of
147, and its minor numbers are numbered from 0 onwards, as is customary. Each DRBD device
corresponds to a volume in a resource. The associated block device is usually named
/dev/drbdX, where X is the device minor number. DRBD also allows for user-defined block
device names which must, however, start with drbd_.
Note
Very early DRBD versions hijacked NBD’s device major number 43. This
is long obsolete; 147 is the LANANA-registered DRBD device major.
Connection. A connection is a communication link between two hosts that share a replicated data
set. As of the time of this writing, each resource involves only two hosts and exactly one connection
between these hosts, so for the most part, the terms resource and connection can be used
interchangeably.
At the drbdadm level, a connection is addressed by the resource name.
1.4. Resource roles
In DRBD, every resource has a role, which may be Primary or Secondary.
Note
The choice of terms here is not arbitrary. These roles were deliberately not
named "Active" and "Passive" by DRBD’s creators. Primary vs.
secondary refers to a concept related to availability of storage, whereas
active vs. passive refers to the availability of an application. It is usually
the case in a high-availability environment that the primary node is also
the active one, but this is by no means necessary.
• A DRBD device in the primary role can be used unrestrictedly for read and write operations.
It may be used for creating and mounting file systems, raw or direct I/O to the block device,
etc.
• A DRBD device in the secondary role receives all updates from the peer node’s device, but
otherwise disallows access completely. It can not be used by applications, neither for read
nor write access. The reason for disallowing even read-only access to the device is the
necessity to maintain cache coherency, which would be impossible if a secondary resource
were made accessible in any way.
The resource’s role can, of course, be changed, either by manual intervention or by way of some
automated algorithm by a cluster management application. Changing the resource role from
secondary to primary is referred to as promotion, whereas the reverse operation is termed demotion.
Chapter 2. DRBD Features
Table of Contents
2.1. Single-primary mode
2.2. Dual-primary mode
2.3. Replication modes
2.4. Multiple replication transports
2.5. Efficient synchronization
2.5.1. Variable-rate synchronization
2.5.2. Fixed-rate synchronization
2.5.3. Checksum-based synchronization
2.6. Suspended replication
2.7. On-line device verification
2.8. Replication traffic integrity checking
2.9. Split brain notification and automatic recovery
2.10. Support for disk flushes
2.11. Disk error handling strategies
2.12. Strategies for dealing with outdated data
2.13. Three-way replication
2.14. Long-distance replication with DRBD Proxy
2.15. Truck based replication
2.16. Floating peers
This chapter discusses various useful DRBD features, and gives some background information
about them. Some of these features will be important to most users, some will only be relevant in
very specific deployment scenarios. Chapter 6, Common administrative tasks and Chapter 7,
Troubleshooting and error recovery contain instructions on how to enable and use these features in
day-to-day operation.
2.2.Dual-primary mode
In dual-primary mode, a resource is, at any given time, in the primary role on both cluster nodes.
Since concurrent access to the data is thus possible, this mode requires the use of a shared cluster
file system that utilizes a distributed lock manager. Examples include GFS and OCFS2.
Deploying DRBD in dual-primary mode is the preferred approach for load-balancing clusters which
require concurrent data access from two nodes. This mode is disabled by default, and must be
enabled explicitly in DRBD’s configuration file.
2.3. Replication modes
DRBD supports three distinct replication modes, allowing three degrees of replication
synchronicity.
Protocol A. Asynchronous replication protocol. Local write operations on the primary node are
considered completed as soon as the local disk write has finished, and the replication packet has
been placed in the local TCP send buffer. In the event of forced fail-over, data loss may occur. The
data on the standby node is consistent after fail-over, however, the most recent updates performed
prior to the crash could be lost. Protocol A is most often used in long distance replication scenarios.
When used in combination with DRBD Proxy it makes an effective disaster recovery solution.
Protocol C. Synchronous replication protocol. Local write operations on the primary node are
considered completed only after both the local and the remote disk write have been confirmed. As a
result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course,
inevitable even with this replication protocol if both nodes (or their storage subsystems) are
irreversibly destroyed at the same time.
By far, the most commonly used replication protocol in DRBD setups is protocol C.
The choice of replication protocol influences two factors of your deployment: protection and
latency. Throughput, by contrast, is largely independent of the replication protocol selected.
On-line device verification enables users to do a block-by-block data integrity check between nodes
in a very efficient manner.
Note that efficient refers to efficient use of network bandwidth here, and to the fact that verification
does not break redundancy in any way. On-line verification is still a resource-intensive operation,
with a noticeable impact on CPU utilization and load average.
It works by one node (the verification source) sequentially calculating a cryptographic digest of
every block stored on the lower-level storage device of a particular resource. DRBD then transmits
that digest to the peer node (the verification target), where it is checked against a digest of the local
copy of the affected block. If the digests do not match, the block is marked out-of-sync and may
later be synchronized. Because DRBD transmits just the digests, not the full blocks, on-line
verification uses network bandwidth very efficiently.
The process is termed on-line verification because it does not require that the DRBD resource being
verified is unused at the time of verification. Thus, though it does carry a slight performance penalty
while it is running, on-line verification does not cause service interruption or system down time —
neither during the verification run nor during subsequent synchronization.
It is a common use case to have on-line verification managed by the local cron daemon, running it,
for example, once a week or once a month
When using three-way replication, DRBD adds a third node to an existing 2-node cluster and
replicates data to that node, where it can be used for backup and disaster recovery purposes. This
type of configuration generally involves Section 2.14, “Long-distance replication with DRBD
Proxy”.
Three-way replication works by adding another, stacked DRBD resource on top of the existing
resource holding your production data, as seen in this illustration:
Figure 2.1. DRBD resource stacking
The stacked resource is replicated using asynchronous replication (DRBD protocol A), whereas the
production data would usually make use of synchronous replication (DRBD protocol C).
Three-way replication can be used permanently, where the third node is continuously updated with
data from the production cluster. Alternatively, it may also be employed on demand, where the
production cluster is normally disconnected from the backup site, and site-to-site synchronization is
performed on a regular basis, for example by running a nightly cron job.
DRBD’s protocol A is asynchronous, but the writing application will block as soon as the socket
output buffer is full (see the sndbuf-size option in drbd.conf(5)). In that event, the writing
application has to wait until some of the data written runs off through a possibly small bandwidth
network link.
The average write bandwidth is limited by available bandwidth of the network link. Write bursts can
only be handled gracefully if they fit into the limited socket output buffer.
You can mitigate this by DRBD Proxy’s buffering mechanism. DRBD Proxy will place changed
data from the DRBD device on the primary node into its buffers. DRBD Proxy’s buffer size is
freely configurable, only limited by the address room size and available physical RAM.
Optionally DRBD Proxy can be configured to compress and decompress the data it forwards.
Compression and decompression of DRBD’s data packets might slightly increase latency. However,
when the bandwidth of the network link is the limiting factor, the gain in shortening transmit time
outweighs the compression and decompression overhead.
Compression and decompression were implemented with multi core SMP systems in mind, and can
utilize multiple CPU cores.
The fact that most block I/O data compresses very well and therefore the effective bandwidth
increases well justifies the use of the DRBD Proxy even with DRBD protocols B and C.
Note
DRBD Proxy is the only part of the DRBD product family that is not
published under an open source license. Please contact [email protected]
or [email protected] for an evaluation license.
2.15. Truck based replication
Truck based replication, also known as disk shipping, is a means of preseeding a remote site with
data to be replicated, by physically shipping storage media to the remote site. This is particularly
suited for situations where
• the total amount of data to be replicated is fairly large (more than a few hundreds of
gigabytes);
• the expected rate of change of the data to be replicated is less than enormous;
• the available network bandwidth between sites is limited.
In such situations, without truck based replication, DRBD would require a very long initial device
synchronization (on the order of days or weeks). Truck based replication allows us to ship a data
seed to the remote site, and drastically reduce the initial synchronization time.
A somewhat special use case for DRBD is the floating peers configuration. In floating peer setups,
DRBD peers are not tied to specific named hosts (as in conventional configurations), but instead
have the ability to float between several hosts. In such a configuration, DRBD identifies peers by IP
address, rather than by host name.
Chapter 5. Configuring DRBD
It is not necessary for this storage area to be empty before you create a DRBD resource from it. In
fact it is a common use case to create a two-node cluster from a previously non-redundant single-
server system using DRBD (some caveats apply — please refer to Section 17.1, “DRBD meta data”
if you are planning to do this).
For the purposes of this guide, we assume a very simple setup:
• Both hosts have a free (currently unused) partition named /dev/sda7.
• We are using internal meta data.
Note
Volumes may also be added to existing resources on the fly. For an
example see Section 10.5, “Adding a new DRBD volume to an existing
Volume Group”.
Enable the resource. This step associates the resource with its backing device (or devices, in case
of a multi-volume resource), sets replication parameters, and connects the resource to its peer:
# drbdadm up <resource>
Observe /proc/drbd. DRBD’s virtual status file in the /proc filesystem, /proc/drbd,
should now contain information similar to the following:
# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by buildsystem@linbit,
2011-12-20 12:58:48
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:524236
Note
The Inconsistent/Inconsistent disk state is expected at this
point.
By now, DRBD has successfully allocated both disk and network resources and is ready for
operation. What it does not know yet is which of your nodes should be used as the source of the
initial device synchronization.
• Create a consistent, verbatim copy of the resource’s data and its metadata. You may do so,
for example, by removing a hot-swappable drive from a RAID-1 mirror. You would, of
course, replace it with a fresh drive, and rebuild the RAID set, to ensure continued
redundancy. But the removed drive is a verbatim copy that can now be shipped off site. If
your local block device supports snapshot copies (such as when using DRBD on top of
LVM), you may also create a bitwise copy of that snapshot using dd.
• On the local node, issue:
# drbdadm new-current-uuid <resource>
After the two peers connect, they will not initiate a full device synchronization. Instead, the
automatic synchronization that now commences only covers those blocks that changed since the
invocation of drbdadm –clear-bitmap new-current-uuid.
Even if there were no changes whatsoever since then, there may still be a brief synchronization
period due to areas covered by the Activity Log being rolled back on the new Secondary. This may
be mitigated by the use of checksum-based synchronization.
You may use this same procedure regardless of whether the resource is a regular DRBD resource, or
a stacked resource. For stacked resources, simply add the -S or --stacked option to drbdadm.
Part III. Working with DRBD
The first line, prefixed with version:, shows the DRBD version used on your system. The
second line contains information about this specific build.
The other four lines in this example form a block that is repeated for every DRBD device
configured, prefixed by the device minor number. In this case, this is 0, corresponding to the device
/dev/drbd0.
The resource-specific output from /proc/drbd contains various pieces of information about the
resource:
cs (connection state). Status of the network connection.
ro (roles). Roles of the nodes. The role of the local node is displayed first, followed by the role of
the partner node shown after the slash.
ds (disk states). State of the hard disks. Prior to the slash the state of the local node is displayed,
after the slash the state of the hard disk of the partner node is shown.
Replication protocol. Replication protocol used by the resource. Either A, B or C.
I/O Flags. Six state flags reflecting the I/O status of this resource.
Performance indicators. A number of counters and gauges reflecting the resource’s utilization and
performance.
The local resource role is always displayed first, the remote resource role last.
You may see one of the following resource roles:
Primary. The resource is currently in the primary role, and may be read from and written to. This
role only occurs on one of the two nodes, unless dual-primary mode is enabled.
Secondary. The resource is currently in the secondary role. It normally receives updates from its
peer (unless running in disconnected mode), but may neither be read from nor written to. This role
may occur on one or both nodes.
Unknown. The resource’s role is currently unknown. The local resource role never has this status. It
is only displayed for the peer’s resource role, and only in disconnected mode.
The local disk state is always displayed first, the remote disk state last.
Both the local and the remote disk state may be one of the following:
Diskless. No local block device has been assigned to the DRBD driver. This may mean that the
resource has never attached to its backing device, that it has been manually detached using
drbdadm detach, or that it automatically detached after a lower-level I/O error.
Attaching. Transient state while reading meta data.
Failed. Transient state following an I/O failure report by the local block device. Next state:
Diskless.
Negotiating. Transient state when an Attach is carried out on an already-Connected
DRBD device.
Inconsistent. The data is inconsistent. This status occurs immediately upon creation of a new
resource, on both nodes (before the initial full sync). Also, this status is found in one node (the
synchronization target) during synchronization.
Outdated. Resource data is consistent, but outdated.
Dunknown. This state is used for the peer disk if no network connection is available.
Consistent. Consistent data of a node without connection. When the connection is established,
it is decided whether the data is UpToDate or Outdated.
UpToDate. Consistent, up-to-date state of the data. This is the normal state.
As always, you may use the keyword all instead of a specific resource name if you want to enable
all resources configured in /etc/drbd.conf at once.
Here, too, you may use the keyword all in place of a resource name if you wish to temporarily
disable all resources listed in /etc/drbd.conf at once.
6.3. Reconfiguring resources
DRBD allows you to reconfigure resources while they are operational. To that end,
• make any necessary changes to the resource configuration in /etc/drbd.conf,
• synchronize your /etc/drbd.conf file between both nodes,
• issue the drbdadm adjust <resource> command on both nodes.
drbdadm adjust then hands off to drbdsetup to make the necessary adjustments to the
configuration. As always, you are able to review the pending drbdsetup invocations by running
drbdadm with the -d (dry-run) option.
Note
When making changes to the common section in /etc/drbd.conf,
you can adjust the configuration for all resources in one run, by issuing
drbdadm adjust all.
In single-primary mode (DRBD’s default), any resource can be in the primary role on only one node
at any given time while the connection state is Connected. Thus, issuing drbdadm primary
<resource> on one node while <resource> is still in the primary role on the peer will result in
an error.
A resource configured to allow dual-primary mode can be switched to the primary role on both
nodes.
Now on the node we wish to make primary promote the resource and mount the device.
# drbdadm primary <resource>
# mount /dev/drbd/by-res/<resource> <mountpoint>
6.8. Enabling dual-primary mode
Dual-primary mode allows a resource to assume the primary role simultaneously on both nodes.
Doing so is possible on either a permanent or a temporary basis.
Note
Dual-primary mode requires that the resource is configured to replicate
synchronously (protocol C). Because of this it is latency sensitive, and ill
suited for WAN environments.
After that, do not forget to synchronize the configuration between nodes. Run drbdadm adjust
<resource> on both nodes.
You can now change both nodes to role primary at the same time with drbdadm primary
<resource>.
To end temporary dual-primary mode, run the same command as above but with --allow-two-
primaries=no (and your desired replication protocol, if applicable).
The /etc/init.d/drbd system init script parses this option on startup and promotes resources
accordingly.
Note
The become-primary-on approach is not required, nor
recommended, in Pacemaker-managed DRBD configurations. In
Pacemaker configuration, resource promotion and demotion should
always be handled by the cluster manager.
<algorithm> may be any message digest algorithm supported by the kernel crypto API in your
system’s kernel configuration. Normally, you should be able to choose at least from sha1, md5,
and crc32c.
If you make this change to an existing resource, as always, synchronize your drbd.conf to the
peer, and run drbdadm adjust <resource> on both nodes.
When you do so, DRBD starts an online verification run for <resource>, and if it detects any
blocks not in sync, will mark those blocks as such and write a message to the kernel log. Any
applications using the device at that time can continue to do so unimpeded, and you may also switch
resource roles at will.
If out-of-sync blocks were detected during the verification run, you may resynchronize them using
the following commands after verification has completed:
# drbdadm disconnect <resource>
# drbdadm connect <resource>
Likewise, and for the same reasons, it does not make sense to set a synchronization rate that is
higher than the bandwidth available on the replication network.
Thus, the recommended value for the rate option would be 33M.
By contrast, if you had an I/O subsystem with a maximum throughput of 80MB/s and a Gigabit
Ethernet connection (the I/O subsystem being the bottleneck), you would calculate:
Figure 6.2. yncer rate example, 80MB/s effective available bandwidth
In this case, the recommended value for the rate option would be 24M.
Tip
A good starting value for c-fill-target is BDP×3, where BDP is
your bandwidth delay product on the replication link.
<algorithm> may be any message digest algorithm supported by the kernel crypto API in your
system’s kernel configuration. Normally, you should be able to choose at least from sha1, md5,
and crc32c.
If you make this change to an existing resource, as always, synchronize your drbd.conf to the
peer, and run drbdadm adjust <resource> on both nodes.
6.12. Configuring congestion policies and suspended
replication
In an environment where the replication bandwidth is highly variable (as would be typical in WAN
replication setups), the replication link may occasionally become congested. In a default
configuration, this would cause I/O on the primary node to block, which is sometimes undesirable.
Instead, you may configure DRBD to suspend the ongoing replication in this case, causing the
Primary’s data set to pull ahead of the Secondary. In this mode, DRBD keeps the replication
channel open — it never switches to disconnected mode — but does not actually replicate until
sufficient bandwith becomes available again.
The following example is for a DRBD Proxy configuration:
resource <resource> {
net {
on-congestion pull-ahead;
congestion-fill 2G;
congestion-extents 2000;
...
}
...
}
You may, of course, set this in the common section too, if you want to define a global I/O error
handling policy for all resources.
<strategy> may be one of the following options:
1. detach This is the default and recommended option. On the occurrence of a lower-level
I/O error, the node drops its backing device, and continues in diskless mode.
2. pass_on This causes DRBD to report the I/O error to the upper layers. On the primary
node, it is reported to the mounted file system. On the secondary node, it is ignored (because
the secondary has no upper layer to report to).
3. call-local-io-error Invokes the command defined as the local
I/O error handler. This requires that a corresponding +local-
io-error command invocation is defined in the resource’s handlers section. It is
entirely left to the administrator’s discretion to implement I/O error handling using the
command (or script) invoked by local-io-error.
Note
Early DRBD versions (prior to 8.0) included another option, panic,
which would forcibly remove the node from the cluster by way of a kernel
panic, whenever a local I/O error occurred. While that option is no longer
available, the same behavior may be mimicked via the local-io-
error/+ call-local-io-error+ interface. You should do so only if you fully
understand the implications of such behavior.
You may reconfigure a running resource’s I/O error handling strategy by following this process:
• Edit the resource configuration in /etc/drbd.d/<resource>.res.
• Copy the configuration to the peer node.
• Issue drbdadm adjust <resource> on both nodes.
6.14. Configuring replication traffic integrity checking
Replication traffic integrity checking is not enabled for resources by default. To enable it, add the
following lines to your resource configuration in /etc/drbd.conf:
resource <resource>
net {
data-integrity-alg <algorithm>;
}
...
}
<algorithm> may be any message digest algorithm supported by the kernel crypto API in your
system’s kernel configuration. Normally, you should be able to choose at least from sha1, md5,
and crc32c.
If you make this change to an existing resource, as always, synchronize your drbd.conf to the
peer, and run drbdadm adjust <resource> on both nodes.
6.15. Resizing resources
6.15.1. Growing on-line
If the backing block devices can be grown while in operation (online), it is also possible to increase
the size of a DRBD device based on these devices during operation. To do so, two criteria must be
fulfilled:
1. The affected resource’s backing device must be one managed by a logical volume
management subsystem, such as LVM.
2. The resource must currently be in the Connected connection state.
Having grown the backing block devices on both nodes, ensure that only one node is in primary
state. Then enter on one node:
# drbdadm resize <resource>
This triggers a synchronization of the new section. The synchronization is done from the primary
node to the secondary node.
If the space you’re adding is clean, you can skip syncing the additional space by using the --assume-
clean option.
# drbdadm -- --assume-clean resize <resource>
You must do this on both nodes, using a separate dump file for every node. Do not dump the meta
data on one node, and simply copy the dump file to the peer. This will not work.
• Grow the backing block device on both nodes.
• Adjust the size information ( la-size-sect) in the file /tmp/metadata accordingly,
on both nodes. Remember that la-size-sect must be specified in sectors.
• Re-initialize the metadata area:
# drbdadm create-md <resource>
• Re-import the corrected meta data, on both nodes:
# drbdmeta_cmd=$(drbdadm -d dump-md <resource>)
# ${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata
Valid meta-data in place, overwrite? [need to type 'yes' to confirm]
yes
Successfully restored meta data
Note
This example uses bash parameter substitution. It may or may not work
in other shells. Check your SHELL environment variable if you are unsure
which shell you are currently using.
• Finally, grow the file system so it fills the extended size of the DRBD device.
Before shrinking a DRBD device, you must shrink the layers above DRBD, i.e. usually the file
system. Since DRBD cannot ask the file system how much space it actually uses, you have to be
careful in order not to cause data loss.
Note
Whether or not the filesystem can be shrunk on-line depends on the
filesystem being used. Most filesystems do not support on-line shrinking.
XFS does not support shrinking at all.
To shrink DRBD on-line, issue the following command after you have shrunk the file system
residing on top of it:
# drbdadm resize --size=<new-size> <resource>
You may use the usual multiplier suffixes for <new-size> (K, M, G etc.). After you have shrunk
DRBD, you may also shrink the containing block device (if it supports shrinking).
• Shrink the file system from one node, while DRBD is still configured.
• Unconfigure your DRBD resource:
# drbdadm down <resource>
You must do this on both nodes, using a separate dump file for every node. Do not dump the meta
data on one node, and simply copy the dump file to the peer. This will not work.
• Shrink the backing block device on both nodes.
• Adjust the size information ( la-size-sect) in the file /tmp/metadata accordingly,
on both nodes. Remember that la-size-sect must be specified in sectors.
• Only if you are using internal metadata (which at this time have probably been lost due to
the shrinking process), re-initialize the metadata area:
# drbdadm create-md <resource>
Note
This example uses bash parameter substitution. It may or may not work
in other shells. Check your SHELL environment variable if you are unsure
which shell you are currently using.
Disabling DRBD’s flushes when running without BBWC, or on BBWC with a depleted battery, is
likely to cause data loss and should not be attempted.
DRBD allows you to enable and disable backing device flushes separately for the replicated data set
and DRBD’s own meta data. Both of these options are enabled by default. If you wish to disable
either (or both), you would set this in the disk section for the DRBD configuration file,
/etc/drbd.conf.
To disable disk flushes for the replicated data set, include the following line in your configuration:
resource <resource>
disk {
disk-flushes no;
...
}
...
}
To disable disk flushes on DRBD’s meta data, include the following line:
resource <resource>
disk {
md-flushes no;
...
}
...
}
After you have modified your resource configuration (and synchronized your /etc/drbd.conf
between nodes, of course), you may enable these settings by issuing this command on both nodes:
# drbdadm adjust <resource>
6.17. Configuring split brain behavior
6.17.1. Split brain notification
DRBD invokes the split-brain handler, if configured, at any time split brain is detected. To
configure this handler, add the following item to your resource configuration:
resource <resource>
handlers {
split-brain <handler>;
...
}
...
}
After you have made this modification on a running resource (and synchronized the configuration
file between nodes), no additional intervention is needed to enable the handler. DRBD will simply
invoke the newly-configured handler on the next occurrence of split brain.
For example, a resource which serves as the block device for a GFS or OCFS2 file system in dual-
Primary mode may have its recovery policy defined as follows:
resource <resource> {
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root"
...
}
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
...
}
...
}
6.18. Creating a three-node setup
A three-node setup involves one DRBD device stacked atop another.
on alice {
device /dev/drbd0;
disk /dev/sda6;
address 10.0.0.1:7788;
meta-disk internal;
}
on bob {
device /dev/drbd0;
disk /dev/sda6;
address 10.0.0.2:7788;
meta-disk internal;
}
}
resource r0-U {
net {
protocol A;
}
stacked-on-top-of r0 {
device /dev/drbd10;
address 192.168.42.1:7788;
}
on charlie {
device /dev/drbd10;
disk /dev/hda6;
address 192.168.42.2:7788; # Public IP of the backup node
meta-disk internal;
}
}
As with any drbd.conf configuration file, this must be distributed across all nodes in the cluster
— in this case, three nodes. Notice the following extra keyword not found in an unstacked resource
configuration:
stacked-on-top-of. This option informs DRBD that the resource which contains it is a
stacked resource. It replaces one of the on sections normally found in any resource configuration.
Do not use stacked-on-top-of in an lower-level resource.
Note
It is not a requirement to use Protocol A for stacked resources. You may
select any of DRBD’s replication protocols depending on your
application.
As with unstacked resources, you must create DRBD meta data on the stacked resources. This is
done using the following command:
# drbdadm create-md --stacked r0-U
After this, you may bring up the resource on the backup node, enabling three-node replication:
# drbdadm create-md r0-U
# drbdadm up r0-U
In order to automate stacked resource management, you may integrate stacked resources in your
cluster manager configuration.
6.19. Using DRBD Proxy
6.19.1. DRBD Proxy deployment considerations
The DRBD Proxy processes can either be located directly on the machines where DRBD is set up,
or they can be placed on distinct dedicated servers. A DRBD Proxy instance can serve as a proxy
for multiple DRBD devices distributed across multiple nodes.
DRBD Proxy is completely transparent to DRBD. Typically you will expect a high number of data
packets in flight, therefore the activity log should be reasonably large. Since this may cause longer
re-sync runs after the crash of a primary node, it is recommended to enable DRBD’s csums-alg
setting.
6.19.2. Installation
To obtain DRBD Proxy, please contact your Linbit sales representative. Unless instructed otherwise,
please always use the most recent DRBD Proxy release.
To install DRBD Proxy on Debian and Debian-based systems, use the dpkg tool as follows (replace
version with your DRBD Proxy version, and architecture with your target architecture):
# dpkg -i drbd-proxy_3.0.0_amd64.deb
To install DRBD Proxy on RPM based systems (like SLES or RHEL) use the rpm tool as follows
(replace version with your DRBD Proxy version, and architecture with your target architecture):
# rpm -i drbd-proxy-3.0-3.0.0-1.x86_64.rpm
Also install the DRBD administration program drbdadm since it is required to configure DRBD
Proxy.
This will install the DRBD proxy binaries as well as an init script which usually goes into
/etc/init.d. Please always use the init script to start/stop DRBD proxy since it also configures
DRBD Proxy using the drbdadm tool.
6.19.4. Configuration
DRBD Proxy is configured in DRBD’s main configuration file. It is configured by an additional
options section called proxy and additional proxy on sections within the host sections.
Below is a DRBD configuration example for proxies running directly on the DRBD nodes:
resource r0 {
net {
protocol A;
}
device minor 0;
disk /dev/sdb1;
meta-disk /dev/sdb2;
proxy {
memlimit 100M;
plugin {
zlib level 9;
}
}
on alice {
address 127.0.0.1:7789;
proxy on alice {
inside 127.0.0.1:7788;
outside 192.168.23.1:7788;
}
}
on bob {
address 127.0.0.1:7789;
proxy on bob {
inside 127.0.0.1:7788;
outside 192.168.23.2:7788;
}
}
}
The inside IP address is used for communication between DRBD and the DRBD Proxy, whereas
the outside IP address is used for communication between the proxies.
show
Shows currently configured communication paths.
show memusage
Shows memory usage of each connection.
show [h]subconnections
Shows currently established individual connections
together with some stats. With h outputs bytes in human
readable format.
show [h]connections
Shows currently configured connections and their states
With h outputs bytes in human readable format.
shutdown
Shuts down the drbd-proxy program. Attention: this
unconditionally terminates any DRBD connections running.
Examples:
drbd-proxy-ctl -c "list hconnections"
prints configured connections and their status to stdout
Note that the quotes are required.
While the commands above are only accepted from UID 0 (ie., the root user), there’s one
(information gathering) command that can be used by any user (provided that unix permissions
allow access on the proxy socket at /var/run/drbd-proxy/drbd-proxy-ctl.socket);
see the init script at /etc/init.d/drbdproxy about setting the rights.
print details
This prints detailed statistics for the currently active connections.
Can be used for monitoring, as this is the only command that may be sent by a
user with UID
quit
Exits the client program (closes control connection).
The example above would restrict the outgoing rate over the WAN connection to approximately
2MiB per second, leaving room on the wire for other data.
6.19.8. Troubleshooting
DRBD proxy logs via syslog using the LOG_DAEMON facility. Usually you will find DRBD Proxy
messages in /var/log/daemon.log.
Enabling debug mode in DRBD Proxy can be done with the following command.
# drbd-proxy-ctl -c 'set loglevel debug'
For example, if proxy fails to connect it will log something like Rejecting connection
because I can’t connect on the other side. In that case, please check if DRBD is
running (not in StandAlone mode) on both nodes and if both proxies are running. Also double-
check your configuration.
Chapter 7. Troubleshooting and error recovery
Note
For the most part, the steps described here apply only if you run DRBD
directly on top of physical hard drives. They generally do not apply in
case you are running DRBD layered on top of
• an MD software RAID set (in this case, use mdadm to manage drive replacement),
• device-mapper RAID (use dmraid),
• a hardware RAID appliance (follow the vendor’s instructions on how to deal with failed
drives),
• some non-standard device-mapper virtual block devices (see the device mapper
documentation).
By running the drbdadm dstate command, you will now be able to verify that the resource is
now in diskless mode:
drbdadm dstate <resource>
Diskless/UpToDate
If the disk failure has occured on your primary node, you may combine this step with a switch-over
operation.
Full synchronization of the new hard disk starts instantaneously and automatically. You will be able
to monitor the synchronization’s progress via /proc/drbd, as with any background
synchronization.
Here, the drbdadm invalidate command triggers synchronization. Again, sync progress may
be observed via /proc/drbd.
After split brain has been detected, one node will always have the resource in a StandAlone
connection state. The other might either also be in the StandAlone state (if both nodes detected
the split brain simultaneously), or in WFConnection (if the peer tore down the connection before
the other node had a chance to detect split brain).
At this point, unless you configured DRBD to automatically recover from split brain, you must
manually intervene by selecting one node whose modifications will be discarded (this node is
referred to as the split brain victim). This intervention is made with the following commands:
Note
The split brain victim needs to be in the connection state of
StandAlone or the following commands will return an error. You can
ensure it is standalone by issuing:
On the other node (the split brain survivor), if its connection state is also StandAlone, you would
enter:
drbdadm connect <resource>
You may omit this step if the node is already in the WFConnection state; it will then reconnect
automatically.
If the resource affected by the split brain is a stacked resource, use drbdadm --stacked instead
of just drbdadm.
Upon connection, your split brain victim immediately changes its connection state to
SyncTarget, and has its modifications overwritten by the remaining primary node.
Note
The split brain victim is not subjected to a full device synchronization.
Instead, it has its local modifications rolled back, and any modifications
made on the split brain survivor propagate to the victim.
After re-synchronization has completed, the split brain is considered resolved and the two nodes
form a fully consistent, redundant replicated storage system