0% found this document useful (0 votes)
152 views48 pages

Chapter 1. DRBD Fundamentals: 1.1. Kernel Module 1.2. User Space Administration Tools 1.3. Resources 1.4. Resource Roles

This document summarizes key aspects of the Distributed Replicated Block Device (DRBD): - DRBD mirrors data in real-time and transparently between hosts using either synchronous or asynchronous replication. Its core functionality is implemented via a Linux kernel module. - It includes user space administration tools like drbdadm, drbdsetup, and drbdmeta to configure and manage DRBD resources, which refer collectively to replicated data sets and their associated volumes, devices, connections, and roles. - Resources have either a primary or secondary role, with the primary able to read and write data and the secondary only receiving updates. DRBD supports various features like single-primary and dual-primary modes, different replication
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views48 pages

Chapter 1. DRBD Fundamentals: 1.1. Kernel Module 1.2. User Space Administration Tools 1.3. Resources 1.4. Resource Roles

This document summarizes key aspects of the Distributed Replicated Block Device (DRBD): - DRBD mirrors data in real-time and transparently between hosts using either synchronous or asynchronous replication. Its core functionality is implemented via a Linux kernel module. - It includes user space administration tools like drbdadm, drbdsetup, and drbdmeta to configure and manage DRBD resources, which refer collectively to replicated data sets and their associated volumes, devices, connections, and roles. - Resources have either a primary or secondary role, with the primary able to read and write data and the secondary only receiving updates. DRBD supports various features like single-primary and dual-primary modes, different replication
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Chapter 1.

DRBD Fundamentals
Table of Contents
1.1. Kernel module
1.2. User space administration tools
1.3. Resources
1.4. Resource roles

The Distributed Replicated Block Device (DRBD) is a software-based, shared-nothing, replicated


storage solution mirroring the content of block devices (hard disks, partitions, logical volumes etc.)
between hosts.
DRBD mirrors data
• in real time. Replication occurs continuously while applications modify the data on the
device.
• transparently. Applications need not be aware that the data is stored on multiple hosts.
• synchronously or asynchronously. With synchronous mirroring, applications are notified of
write completions after the writes have been carried out on all hosts. With asynchronous
mirroring, applications are notified of write completions when the writes have completed
locally, which usually is before they have propagated to the other hosts.

1.1. Kernel module


DRBD’s core functionality is implemented by way of a Linux kernel module. Specifically, DRBD
constitutes a driver for a virtual block device, so DRBD is situated right near the bottom of a
system’s I/O stack. Because of this, DRBD is extremely flexible and versatile, which makes it a
replication solution suitable for adding high availability to just about any application.
DRBD is, by definition and as mandated by the Linux kernel architecture, agnostic of the layers
above it. Thus, it is impossible for DRBD to miraculously add features to upper layers that these do
not possess. For example, DRBD cannot auto-detect file system corruption or add active-active
clustering capability to file systems like ext3 or XFS.
Figure 1.1. DRBD’s position within the Linux I/O stack
1.2. User space administration tools

DRBD comes with a set of administration tools which communicate with the kernel module in
order to configure and administer DRBD resources.
drbdadm. The high-level administration tool of the DRBD program suite. Obtains all DRBD
configuration parameters from the configuration file /etc/drbd.conf and acts as a front-end
for drbdsetup and drbdmeta. drbdadm has a dry-run mode, invoked with the -d option, that
shows which drbdsetup and drbdmeta calls drbdadm would issue without actually calling
those commands.
drbdsetup. Configures the DRBD module loaded into the kernel. All parameters to drbdsetup
must be passed on the command line. The separation between drbdadm and drbdsetup allows
for maximum flexibility. Most users will rarely need to use drbdsetup directly, if at all.
drbdmeta. Allows to create, dump, restore, and modify DRBD meta data structures. Like
drbdsetup, most users will only rarely need to use drbdmeta directly.

1.3. Resources
In DRBD, resource is the collective term that refers to all aspects of a particular replicated data set.
These include:
Resource name. This can be any arbitrary, US-ASCII name not containing whitespace by which
the resource is referred to.
Volumes. Any resource is a replication group consisting of one of more volumes that share a
common replication stream. DRBD ensures write fidelity across all volumes in the resource.
Volumes are numbered starting with 0, and there may be up to 65,535 volumes in one resource. A
volume contains the replicated data set, and a set of metadata for DRBD internal use.
At the drbdadm level, a volume within a resource can be addressed by the resource name and
volume number as <resource>/<volume>.
DRBD device. This is a virtual block device managed by DRBD. It has a device major number of
147, and its minor numbers are numbered from 0 onwards, as is customary. Each DRBD device
corresponds to a volume in a resource. The associated block device is usually named
/dev/drbdX, where X is the device minor number. DRBD also allows for user-defined block
device names which must, however, start with drbd_.

Note
Very early DRBD versions hijacked NBD’s device major number 43. This
is long obsolete; 147 is the LANANA-registered DRBD device major.

Connection. A connection is a communication link between two hosts that share a replicated data
set. As of the time of this writing, each resource involves only two hosts and exactly one connection
between these hosts, so for the most part, the terms resource and connection can be used
interchangeably.
At the drbdadm level, a connection is addressed by the resource name.
1.4. Resource roles
In DRBD, every resource has a role, which may be Primary or Secondary.

Note
The choice of terms here is not arbitrary. These roles were deliberately not
named "Active" and "Passive" by DRBD’s creators. Primary vs.
secondary refers to a concept related to availability of storage, whereas
active vs. passive refers to the availability of an application. It is usually
the case in a high-availability environment that the primary node is also
the active one, but this is by no means necessary.

• A DRBD device in the primary role can be used unrestrictedly for read and write operations.
It may be used for creating and mounting file systems, raw or direct I/O to the block device,
etc.
• A DRBD device in the secondary role receives all updates from the peer node’s device, but
otherwise disallows access completely. It can not be used by applications, neither for read
nor write access. The reason for disallowing even read-only access to the device is the
necessity to maintain cache coherency, which would be impossible if a secondary resource
were made accessible in any way.
The resource’s role can, of course, be changed, either by manual intervention or by way of some
automated algorithm by a cluster management application. Changing the resource role from
secondary to primary is referred to as promotion, whereas the reverse operation is termed demotion.
Chapter 2. DRBD Features
Table of Contents
2.1. Single-primary mode
2.2. Dual-primary mode
2.3. Replication modes
2.4. Multiple replication transports
2.5. Efficient synchronization
2.5.1. Variable-rate synchronization
2.5.2. Fixed-rate synchronization
2.5.3. Checksum-based synchronization
2.6. Suspended replication
2.7. On-line device verification
2.8. Replication traffic integrity checking
2.9. Split brain notification and automatic recovery
2.10. Support for disk flushes
2.11. Disk error handling strategies
2.12. Strategies for dealing with outdated data
2.13. Three-way replication
2.14. Long-distance replication with DRBD Proxy
2.15. Truck based replication
2.16. Floating peers

This chapter discusses various useful DRBD features, and gives some background information
about them. Some of these features will be important to most users, some will only be relevant in
very specific deployment scenarios. Chapter 6, Common administrative tasks and Chapter 7,
Troubleshooting and error recovery contain instructions on how to enable and use these features in
day-to-day operation.

2.1. Single-primary mode


In single-primary mode, a resource is, at any given time, in the primary role on only one cluster
member. Since it is guaranteed that only one cluster node manipulates the data at any moment, this
mode can be used with any conventional file system (ext3, ext4, XFS etc.).
Deploying DRBD in single-primary mode is the canonical approach for high availability (fail-over
capable) clusters.

2.2.Dual-primary mode
In dual-primary mode, a resource is, at any given time, in the primary role on both cluster nodes.
Since concurrent access to the data is thus possible, this mode requires the use of a shared cluster
file system that utilizes a distributed lock manager. Examples include GFS and OCFS2.
Deploying DRBD in dual-primary mode is the preferred approach for load-balancing clusters which
require concurrent data access from two nodes. This mode is disabled by default, and must be
enabled explicitly in DRBD’s configuration file.
2.3. Replication modes
DRBD supports three distinct replication modes, allowing three degrees of replication
synchronicity.

Protocol A. Asynchronous replication protocol. Local write operations on the primary node are
considered completed as soon as the local disk write has finished, and the replication packet has
been placed in the local TCP send buffer. In the event of forced fail-over, data loss may occur. The
data on the standby node is consistent after fail-over, however, the most recent updates performed
prior to the crash could be lost. Protocol A is most often used in long distance replication scenarios.
When used in combination with DRBD Proxy it makes an effective disaster recovery solution.

Protocol B. Memory synchronous (semi-synchronous) replication protocol. Local write operations


on the primary node are considered completed as soon as the local disk write has occurred, and the
replication packet has reached the peer node. Normally, no writes are lost in case of forced fail-over.
However, in the event of simultaneous power failure on both nodes and concurrent, irreversible
destruction of the primary’s data store, the most recent writes completed on the primary may be lost.

Protocol C. Synchronous replication protocol. Local write operations on the primary node are
considered completed only after both the local and the remote disk write have been confirmed. As a
result, loss of a single node is guaranteed not to lead to any data loss. Data loss is, of course,
inevitable even with this replication protocol if both nodes (or their storage subsystems) are
irreversibly destroyed at the same time.

By far, the most commonly used replication protocol in DRBD setups is protocol C.
The choice of replication protocol influences two factors of your deployment: protection and
latency. Throughput, by contrast, is largely independent of the replication protocol selected.

2.4. Multiple replication transports


DRBD’s replication and synchronization framework socket layer supports multiple low-level
transports:
TCP over IPv4. This is the canonical implementation, and DRBD’s default. It may be used on any
system that has IPv4 enabled.
TCP over IPv6. When configured to use standard TCP sockets for replication and synchronization,
DRBD can use also IPv6 as its network protocol. This is equivalent in semantics and performance
to IPv4, albeit using a different addressing scheme.
SDP. SDP is an implementation of BSD-style sockets for RDMA capable transports such as
InfiniBand. SDP is available as part of the OFED stack for most current distributions. SDP uses and
IPv4-style addressing scheme. Employed over an InfiniBand interconnect, SDP provides a high-
throughput, low-latency replication network to DRBD.
SuperSockets. SuperSockets replace the TCP/IP portions of the stack with a single, monolithic,
highly efficient and RDMA capable socket implementation. DRBD can use this socket type for very
low latency replication. SuperSockets must run on specific hardware which is currently available
from a single vendor, Dolphin Interconnect Solutions
2.5. Efficient synchronization
(Re-)synchronization is distinct from device replication. While replication occurs on any write
event to a resource in the primary role, synchronization is decoupled from incoming writes. Rather,
it affects the device as a whole.
Synchronization is necessary if the replication link has been interrupted for any reason, be it due to
failure of the primary node, failure of the secondary node, or interruption of the replication link.
Synchronization is efficient in the sense that DRBD does not synchronize modified blocks in the
order they were originally written, but in linear order, which has the following consequences:
• Synchronization is fast, since blocks in which several successive write operations occurred
are only synchronized once.
• Synchronization is also associated with few disk seeks, as blocks are synchronized
according to the natural on-disk block layout.
• During synchronization, the data set on the standby node is partly obsolete and partly
already updated. This state of data is called inconsistent.
The service continues to run uninterrupted on the active node, while background synchronization is
in progress.
Important
A node with inconsistent data generally cannot be put into operation, thus
it is desirable to keep the time period during which a node is inconsistent
as short as possible. DRBD does, however, ship with an LVM integration
facility that automates the creation of LVM snapshots immediately before
synchronization. This ensures that a consistent copy of the data is always
available on the peer, even while synchronization is running.

2.5.1. Variable-rate synchronization


In variable-rate synchronization (the default), DRBD detects the available bandwidth on the
synchronization network, compares it to incoming foreground application I/O, and selects an
appropriate synchronization rate based on a fully automatic control loop.

2.5.2. Fixed-rate synchronization


In fixed-rate synchronization, the amount of data shipped to the synchronizing peer per second (the
synchronization rate) has a configurable, static upper limit. Based on this limit, you may estimate
the expected sync time based on the following simple formula:
Synchronization time. tsync is the expected sync time. D is the amount of data to be synchronized,
which you are unlikely to have any influence over (this is the amount of data that was modified by
your application while the replication link was broken). R is the rate of synchronization, which is
configurable — bounded by the throughput limitations of the replication network and I/O
subsystem.

2.5.3. Checksum-based synchronization


The efficiency of DRBD’s synchronization algorithm may be further enhanced by using data
digests, also known as checksums. When using checksum-based synchronization, then rather than
performing a brute-force overwrite of blocks marked out of sync, DRBD reads blocks before
synchronizing them and computes a hash of the contents currently found on disk. It then compares
this hash with one computed from the same sector on the peer, and omits re-writing this block if the
hashes match. This can dramatically cut down synchronization times in situation where a filesystem
re-writes a sector with identical contents while DRBD is in disconnected mode.

2.6. Suspended replication


If properly configured, DRBD can detect if the replication network is congested, and suspend
replication in this case. In this mode, the primary node "pulls ahead" of the secondary — 
temporarily going out of sync, but still leaving a consistent copy on the secondary. When more
bandwidth becomes available, replication automatically resumes and a background synchronization
takes place.
Suspended replication is typically enabled over links with variable bandwidth, such as wide area
replication over shared connections between data centers or cloud instances.

2.7. On-line device verification

On-line device verification enables users to do a block-by-block data integrity check between nodes
in a very efficient manner.
Note that efficient refers to efficient use of network bandwidth here, and to the fact that verification
does not break redundancy in any way. On-line verification is still a resource-intensive operation,
with a noticeable impact on CPU utilization and load average.
It works by one node (the verification source) sequentially calculating a cryptographic digest of
every block stored on the lower-level storage device of a particular resource. DRBD then transmits
that digest to the peer node (the verification target), where it is checked against a digest of the local
copy of the affected block. If the digests do not match, the block is marked out-of-sync and may
later be synchronized. Because DRBD transmits just the digests, not the full blocks, on-line
verification uses network bandwidth very efficiently.
The process is termed on-line verification because it does not require that the DRBD resource being
verified is unused at the time of verification. Thus, though it does carry a slight performance penalty
while it is running, on-line verification does not cause service interruption or system down time — 
neither during the verification run nor during subsequent synchronization.
It is a common use case to have on-line verification managed by the local cron daemon, running it,
for example, once a week or once a month

2.8.Replication traffic integrity checking


DRBD optionally performs end-to-end message integrity checking using cryptographic message
digest algorithms such as MD5, SHA-1 or CRC-32C.
These message digest algorithms are not provided by DRBD. The Linux kernel crypto API provides
these; DRBD merely uses them. Thus, DRBD is capable of utilizing any message digest algorithm
available in a particular system’s kernel configuration.
With this feature enabled, DRBD generates a message digest of every data block it replicates to the
peer, which the peer then uses to verify the integrity of the replication packet. If the replicated block
can not be verified against the digest, the peer requests retransmission. Thus, DRBD replication is
protected against several error sources, all of which, if unchecked, would potentially lead to data
corruption during the replication process:
• Bitwise errors ("bit flips") occurring on data in transit between main memory and the
network interface on the sending node (which goes undetected by TCP checksumming if it is
offloaded to the network card, as is common in recent implementations);
• bit flips occurring on data in transit from the network interface to main memory on the
receiving node (the same considerations apply for TCP checksum offloading);
• any form of corruption due to a race conditions or bugs in network interface firmware or
drivers;
• bit flips or random corruption injected by some reassembling network component between
nodes (if not using direct, back-to-back connections).

2.9. Split brain notification and automatic recovery


Split brain is a situation where, due to temporary failure of all network links between cluster nodes,
and possibly due to intervention by a cluster management software or human error, both nodes
switched to the primary role while disconnected. This is a potentially harmful state, as it implies
that modifications to the data might have been made on either node, without having been replicated
to the peer. Thus, it is likely in this situation that two diverging sets of data have been created,
which cannot be trivially merged.
DRBD split brain is distinct from cluster split brain, which is the loss of all connectivity between
hosts managed by a distributed cluster management application such as Heartbeat. To avoid
confusion, this guide uses the following convention:
• Split brain refers to DRBD split brain as described in the paragraph above.
• Loss of all cluster connectivity is referred to as a cluster partition, an alternative term for
cluster split brain.
DRBD allows for automatic operator notification (by email or other means) when it detects split
brain. See Section 6.17.1, “Split brain notification” for details on how to configure this feature.
While the recommended course of action in this scenario is to manually resolve the split brain and
then eliminate its root cause, it may be desirable, in some cases, to automate the process. DRBD has
several resolution algorithms available for doing so:
• Discarding modifications made on the younger primary. In this mode, when the network
connection is re-established and split brain is discovered, DRBD will discard modifications
made, in the meantime, on the node which switched to the primary role last.
• Discarding modifications made on the older primary. In this mode, DRBD will discard
modifications made, in the meantime, on the node which switched to the primary role first.
• Discarding modifications on the primary with fewer changes. In this mode, DRBD will
check which of the two nodes has recorded fewer modifications, and will then discard all
modifications made on that host.
• Graceful recovery from split brain if one host has had no intermediate changes. In this
mode, if one of the hosts has made no modifications at all during split brain, DRBD will
simply recover gracefully and declare the split brain resolved. Note that this is a fairly
unlikely scenario. Even if both hosts only mounted the file system on the DRBD block
device (even read-only), the device contents would be modified, ruling out the possibility of
automatic recovery.
Whether or not automatic split brain recovery is acceptable depends largely on the individual
application. Consider the example of DRBD hosting a database. The discard modifications from
host with fewer changes approach may be fine for a web application click-through database. By
contrast, it may be totally unacceptable to automatically discard any modifications made to a
financial database, requiring manual recovery in any split brain event. Consider your application’s
requirements carefully before enabling automatic split brain recovery.

2.10. Support for disk flushes


When local block devices such as hard drives or RAID logical disks have write caching enabled,
writes to these devices are considered completed as soon as they have reached the volatile cache.
Controller manufacturers typically refer to this as write-back mode, the opposite being write-
through. If a power outage occurs on a controller in write-back mode, the last writes are never
committed to the disk, potentially causing data loss.
To counteract this, DRBD makes use of disk flushes. A disk flush is a write operation that
completes only when the associated data has been committed to stable (non-volatile) storage — that
is to say, it has effectively been written to disk, rather than to the cache. DRBD uses disk flushes for
write operations both to its replicated data set and to its meta data. In effect, DRBD circumvents the
write cache in situations it deems necessary, as in activity log updates or enforcement of implicit
write-after-write dependencies. This means additional reliability even in the face of power failure.
It is important to understand that DRBD can use disk flushes only when layered on top of backing
devices that support them. Most reasonably recent kernels support disk flushes for most SCSI and
SATA devices. Linux software RAID (md) supports disk flushes for RAID-1 provided that all
component devices support them too. The same is true for device-mapper devices (LVM2, dm-raid,
multipath).
Controllers with battery-backed write cache (BBWC) use a battery to back up their volatile storage.
On such devices, when power is restored after an outage, the controller flushes all pending writes
out to disk from the battery-backed cache, ensuring that all writes committed to the volatile cache
are actually transferred to stable storage. When running DRBD on top of such devices, it may be
acceptable to disable disk flushes, thereby improving DRBD’s write performance.

2.11. Disk error handling strategies


If a hard drive fails which is used as a backing block device for DRBD on one of the nodes, DRBD
may either pass on the I/O error to the upper layer (usually the file system) or it can mask I/O errors
from upper layers.
Passing on I/O errors. If DRBD is configured to pass on I/O errors, any such errors occurring on
the lower-level device are transparently passed to upper I/O layers. Thus, it is left to upper layers to
deal with such errors (this may result in a file system being remounted read-only, for example). This
strategy does not ensure service continuity, and is hence not recommended for most users.
Masking I/O errors. If DRBD is configured to detach on lower-level I/O error, DRBD will do so,
automatically, upon occurrence of the first lower-level I/O error. The I/O error is masked from
upper layers while DRBD transparently fetches the affected block from the peer node, over the
network. From then onwards, DRBD is said to operate in diskless mode, and carries out all
subsequent I/O operations, read and write, on the peer node. Performance in this mode will be
reduced, but the service continues without interruption, and can be moved to the peer node in a
deliberate fashion at a convenient time.
2.12. Strategies for dealing with outdated data
DRBD distinguishes between inconsistent and outdated data. Inconsistent data is data that cannot be
expected to be accessible and useful in any manner. The prime example for this is data on a node
that is currently the target of an on-going synchronization. Data on such a node is part obsolete, part
up to date, and impossible to identify as either. Thus, for example, if the device holds a filesystem
(as is commonly the case), that filesystem would be unexpected to mount or even pass an automatic
filesystem check.
Outdated data, by contrast, is data on a secondary node that is consistent, but no longer in sync with
the primary node. This would occur in any interruption of the replication link, whether temporary or
permanent. Data on an outdated, disconnected secondary node is expected to be clean, but it reflects
a state of the peer node some time past. In order to avoid services using outdated data, DRBD
disallows promoting a resource that is in the outdated state.
DRBD has interfaces that allow an external application to outdate a secondary node as soon as a
network interruption occurs. DRBD will then refuse to switch the node to the primary role,
preventing applications from using the outdated data. A complete implementation of this
functionality exists for the Pacemaker cluster management framework (where it uses a
communication channel separate from the DRBD replication link). However, the interfaces are
generic and may be easily used by any other cluster management application.
Whenever an outdated resource has its replication link re-established, its outdated flag is
automatically cleared. A background synchronization then follows.

2.13. Three-way replication


Note
Available in DRBD version 8.3.0 and above

When using three-way replication, DRBD adds a third node to an existing 2-node cluster and
replicates data to that node, where it can be used for backup and disaster recovery purposes. This
type of configuration generally involves Section 2.14, “Long-distance replication with DRBD
Proxy”.
Three-way replication works by adding another, stacked DRBD resource on top of the existing
resource holding your production data, as seen in this illustration:
Figure 2.1. DRBD resource stacking

The stacked resource is replicated using asynchronous replication (DRBD protocol A), whereas the
production data would usually make use of synchronous replication (DRBD protocol C).
Three-way replication can be used permanently, where the third node is continuously updated with
data from the production cluster. Alternatively, it may also be employed on demand, where the
production cluster is normally disconnected from the backup site, and site-to-site synchronization is
performed on a regular basis, for example by running a nightly cron job.

2.14. Long-distance replication with DRBD Proxy


Note
DRBD Proxy requires DRBD version 8.2.7 or above.

DRBD’s protocol A is asynchronous, but the writing application will block as soon as the socket
output buffer is full (see the sndbuf-size option in drbd.conf(5)). In that event, the writing
application has to wait until some of the data written runs off through a possibly small bandwidth
network link.
The average write bandwidth is limited by available bandwidth of the network link. Write bursts can
only be handled gracefully if they fit into the limited socket output buffer.
You can mitigate this by DRBD Proxy’s buffering mechanism. DRBD Proxy will place changed
data from the DRBD device on the primary node into its buffers. DRBD Proxy’s buffer size is
freely configurable, only limited by the address room size and available physical RAM.
Optionally DRBD Proxy can be configured to compress and decompress the data it forwards.
Compression and decompression of DRBD’s data packets might slightly increase latency. However,
when the bandwidth of the network link is the limiting factor, the gain in shortening transmit time
outweighs the compression and decompression overhead.
Compression and decompression were implemented with multi core SMP systems in mind, and can
utilize multiple CPU cores.
The fact that most block I/O data compresses very well and therefore the effective bandwidth
increases well justifies the use of the DRBD Proxy even with DRBD protocols B and C.

Note
DRBD Proxy is the only part of the DRBD product family that is not
published under an open source license. Please contact [email protected]
or [email protected] for an evaluation license.
2.15. Truck based replication
Truck based replication, also known as disk shipping, is a means of preseeding a remote site with
data to be replicated, by physically shipping storage media to the remote site. This is particularly
suited for situations where
• the total amount of data to be replicated is fairly large (more than a few hundreds of
gigabytes);
• the expected rate of change of the data to be replicated is less than enormous;
• the available network bandwidth between sites is limited.
In such situations, without truck based replication, DRBD would require a very long initial device
synchronization (on the order of days or weeks). Truck based replication allows us to ship a data
seed to the remote site, and drastically reduce the initial synchronization time.

2.16. Floating peers


Note
This feature is available in DRBD versions 8.3.2 and above.

A somewhat special use case for DRBD is the floating peers configuration. In floating peer setups,
DRBD peers are not tied to specific named hosts (as in conventional configurations), but instead
have the ability to float between several hosts. In such a configuration, DRBD identifies peers by IP
address, rather than by host name.
Chapter 5. Configuring DRBD

5.1. Preparing your lower-level storage


After you have installed DRBD, you must set aside a roughly identically sized storage area on both
cluster nodes. This will become the lower-level device for your DRBD resource. You may use any
type of block device found on your system for this purpose. Typical examples include:
• A hard drive partition (or a full physical hard drive),
• a software RAID device,
• an LVM Logical Volume or any other block device configured by the Linux device-mapper
infrastructure,
• any other block device type found on your system.
You may also use resource stacking, meaning you can use one DRBD device as a lower-level
device for another. Some specific considerations apply to stacked resources; their configuration is
covered in detail in Section 6.18, “Creating a three-node setup”.
Note
While it is possible to use loop devices as lower-level devices for DRBD,
doing so is not recommended due to deadlock issues.

It is not necessary for this storage area to be empty before you create a DRBD resource from it. In
fact it is a common use case to create a two-node cluster from a previously non-redundant single-
server system using DRBD (some caveats apply — please refer to Section 17.1, “DRBD meta data”
if you are planning to do this).
For the purposes of this guide, we assume a very simple setup:
• Both hosts have a free (currently unused) partition named /dev/sda7.
• We are using internal meta data.

5.2. Preparing your network configuration


It is recommended, though not strictly required, that you run your DRBD replication over a
dedicated connection. At the time of this writing, the most reasonable choice for this is a direct,
back-to-back, Gigabit Ethernet connection. When DRBD is run over switches, use of redundant
components and the bonding driver (in active-backup mode) is recommended.
It is generally not recommended to run DRBD replication via routers, for reasons of fairly obvious
performance drawbacks (adversely affecting both throughput and latency).
In terms of local firewall considerations, it is important to understand that DRBD (by convention)
uses TCP ports from 7788 upwards, with every resource listening on a separate port. DRBD uses
two TCP connections for every resource configured. For proper DRBD functionality, it is required
that these connections are allowed by your firewall configuration.
Security considerations other than firewalling may also apply if a Mandatory Access Control
(MAC) scheme such as SELinux or AppArmor is enabled. You may have to adjust your local
security policy so it does not keep DRBD from functioning properly.
You must, of course, also ensure that the TCP ports for DRBD are not already used by another
application.
It is not possible to configure a DRBD resource to support more than one TCP connection. If you
want to provide for DRBD connection load-balancing or redundancy, you can easily do so at the
Ethernet level (again, using the bonding driver).
For the purposes of this guide, we assume a very simple setup:
• Our two DRBD hosts each have a currently unused network interface, eth1, with IP
addresses 10.1.1.31 and 10.1.1.32 assigned to it, respectively.
• No other services are using TCP ports 7788 through 7799 on either host.
• The local firewall configuration allows both inbound and outbound TCP connections
between the hosts over these ports.

5.3. Configuring your resource


All aspects of DRBD are controlled in its configuration file, /etc/drbd.conf. Normally, this
configuration file is just a skeleton with the following contents:
include "/etc/drbd.d/global_common.conf";
include "/etc/drbd.d/*.res";

By convention, /etc/drbd.d/global_common.conf contains the global and common


sections of the DRBD configuration, whereas the .res files contain one resource section each.
It is also possible to use drbd.conf as a flat configuration file without any include statements
at all. Such a configuration, however, quickly becomes cluttered and hard to manage, which is why
the multiple-file approach is the preferred one.
Regardless of which approach you employ, you should always make sure that drbd.conf, and
any other files it includes, are exactly identical on all participating cluster nodes.
The DRBD source tarball contains an example configuration file in the scripts subdirectory.
Binary installation packages will either install this example configuration directly in /etc, or in a
package-specific documentation directory such as /usr/share/doc/packages/drbd.
This section describes only those few aspects of the configuration file which are absolutely
necessary to understand in order to get DRBD up and running. The configuration file’s syntax and
contents are documented in great detail in drbd.conf(5).

5.3.1. Example configuration


For the purposes of this guide, we assume a minimal setup in line with the examples given in the
previous sections:
Simple DRBD configuration (/etc/drbd.d/global_common.conf).
global {
usage-count yes;
}
common {
net {
protocol C;
}
}

Simple DRBD resource configuration (/etc/drbd.d/r0.res).


resource r0 {
on alice {
device /dev/drbd1;
disk /dev/sda7;
address 10.1.1.31:7789;
meta-disk internal;
}
on bob {
device /dev/drbd1;
disk /dev/sda7;
address 10.1.1.32:7789;
meta-disk internal;
}
}

This example configures DRBD in the following fashion:


• You "opt in" to be included in DRBD’s usage statistics (see usage-count).
• Resources are configured to use fully synchronous replication (Protocol C) unless explicitly
specified otherwise.
• Our cluster consists of two nodes, alice and bob.
• We have a resource arbitrarily named r0 which uses /dev/sda7 as the lower-level device,
and is configured with internal meta data.
• The resource uses TCP port 7789 for its network connections, and binds to the IP addresses
10.1.1.31 and 10.1.1.32, respectively.
The configuration above implicitly creates one volume in the resource, numbered zero (0). For
multiple volumes in one resource, modify the syntax as follows:
Multi-volume DRBD resource configuration (/etc/drbd.d/r0.res).
resource r0 {
volume 0 {
device /dev/drbd1;
disk /dev/sda7;
meta-disk internal;
}
volume 1 {
device /dev/drbd2;
disk /dev/sda8;
meta-disk internal;
}
on alice {
address 10.1.1.31:7789;
}
on bob {
address 10.1.1.32:7789;
}
}

Note
Volumes may also be added to existing resources on the fly. For an
example see Section 10.5, “Adding a new DRBD volume to an existing
Volume Group”.

5.3.2. The global section


This section is allowed only once in the configuration. It is normally in the
/etc/drbd.d/global_common.conf file. In a single-file configuration, it should go to the
very top of the configuration file. Of the few options available in this section, only one is of
relevance to most users:
usage-count. The DRBD project keeps statistics about the usage of various DRBD versions.
This is done by contacting an HTTP server every time a new DRBD version is installed on a
system. This can be disabled by setting usage-count no;. The default is usage-count
ask; which will prompt you every time you upgrade DRBD.
DRBD’s usage statistics are, of course, publicly available: see http://usage.drbd.org.

5.3.3. The common section


This section provides a shorthand method to define configuration settings inherited by every
resource. It is normally found in /etc/drbd.d/global_common.conf. You may define any
option you can also define on a per-resource basis.
Including a common section is not strictly required, but strongly recommended if you are using
more than one resource. Otherwise, the configuration quickly becomes convoluted by repeatedly-
used options.
In the example above, we included net { protocol C; } in the common section, so every
resource configured (including r0) inherits this option unless it has another protocol option
configured explicitly. For other synchronization protocols available, see Section 2.3, “Replication
modes”.

5.3.4. The resource sections


A per-resource configuration file is usually named /etc/drbd.d/<resource>.res. Any
DRBD resource you define must be named by specifying resource name in the configuration. You
may use any arbitrary identifier, however the name must not contain characters other than those
found in the US-ASCII character set, and must also not include whitespace.
Every resource configuration must also have two on <host> sub-sections (one for every cluster
node). All other configuration settings are either inherited from the common section (if it exists), or
derived from DRBD’s default settings.
In addition, options with equal values on both hosts can be specified directly in the resource
section. Thus, we can further condense our example configuration as follows:
resource r0 {
device /dev/drbd1;
disk /dev/sda7;
meta-disk internal;
on alice {
address 10.1.1.31:7789;
}
on bob {
address 10.1.1.32:7789;
}
}
5.4. Enabling your resource for the first time
After you have completed initial resource configuration as outlined in the previous sections, you
can bring up your resource.
Each of the following steps must be completed on both nodes.
Please note that with our example config snippets (resource r0 { … }), <resource>
would be r0.
Create device metadata. This step must be completed only on initial device creation. It initializes
DRBD’s metadata:
# drbdadm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New drbd meta data block sucessfully created.

Enable the resource. This step associates the resource with its backing device (or devices, in case
of a multi-volume resource), sets replication parameters, and connects the resource to its peer:
# drbdadm up <resource>

Observe /proc/drbd. DRBD’s virtual status file in the /proc filesystem, /proc/drbd,
should now contain information similar to the following:
# cat /proc/drbd
version: 8.4.1 (api:1/proto:86-100)
GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by buildsystem@linbit,
2011-12-20 12:58:48
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:524236

Note
The Inconsistent/Inconsistent disk state is expected at this
point.

By now, DRBD has successfully allocated both disk and network resources and is ready for
operation. What it does not know yet is which of your nodes should be used as the source of the
initial device synchronization.

5.5. The initial device synchronization


There are two more steps required for DRBD to become fully operational:
Select an initial sync source. If you are dealing with newly-initialized, empty disk, this choice is
entirely arbitrary. If one of your nodes already has valuable data that you need to preserve, however,
it is of crucial importance that you select that node as your synchronization source. If you do initial
device synchronization in the wrong direction, you will lose that data. Exercise caution.
Start the initial full synchronization. This step must be performed on only one node, only on
initial resource configuration, and only on the node you selected as the synchronization source. To
perform this step, issue this command:
# drbdadm primary --force <resource>
After issuing this command, the initial full synchronization will commence. You will be able to
monitor its progress via /proc/drbd. It may take some time depending on the size of the device.
By now, your DRBD device is fully operational, even before the initial synchronization has
completed (albeit with slightly reduced performance). You may now create a filesystem on the
device, use it as a raw block device, mount it, and perform any other operation you would with an
accessible block device.

5.6. Using truck based replication


In order to preseed a remote node with data which is then to be kept synchronized, and to skip the
initial device synchronization, follow these steps.
This assumes that your local node has a configured, but disconnected DRBD resource in the
Primary role. That is to say, device configuration is completed, identical drbd.conf copies exist
on both nodes, and you have issued the commands for initial resource promotion on your local
node — but the remote node is not connected yet.
• On the local node, issue the following command:
# drbdadm new-current-uuid --clear-bitmap <resource>

• Create a consistent, verbatim copy of the resource’s data and its metadata. You may do so,
for example, by removing a hot-swappable drive from a RAID-1 mirror. You would, of
course, replace it with a fresh drive, and rebuild the RAID set, to ensure continued
redundancy. But the removed drive is a verbatim copy that can now be shipped off site. If
your local block device supports snapshot copies (such as when using DRBD on top of
LVM), you may also create a bitwise copy of that snapshot using dd.
• On the local node, issue:
# drbdadm new-current-uuid <resource>

Note the absence of the --clear-bitmap option in this second invocation.


• Physically transport the copies to the remote peer location.
• Add the copies to the remote node. This may again be a matter of plugging a physical disk,
or grafting a bitwise copy of your shipped data onto existing storage on the remote node. Be
sure to restore or copy not only your replicated data, but also the associated DRBD
metadata. If you fail to do so, the disk shipping process is moot.
• Bring up the resource on the remote node:
# drbdadm up <resource>

After the two peers connect, they will not initiate a full device synchronization. Instead, the
automatic synchronization that now commences only covers those blocks that changed since the
invocation of drbdadm –clear-bitmap new-current-uuid.
Even if there were no changes whatsoever since then, there may still be a brief synchronization
period due to areas covered by the Activity Log being rolled back on the new Secondary. This may
be mitigated by the use of checksum-based synchronization.
You may use this same procedure regardless of whether the resource is a regular DRBD resource, or
a stacked resource. For stacked resources, simply add the -S or --stacked option to drbdadm.
Part III. Working with DRBD

Chapter 6. Common administrative tasks

6.1. Checking DRBD status


6.1.1. Retrieving status with drbd-overview
The most convenient way to look at DRBD’s status is the drbd-overview utility.
# drbd-overview
0:home Connected Primary/Secondary
UpToDate/UpToDate C r--- /home xfs 200G 158G 43G 79%
1:data Connected Primary/Secondary
UpToDate/UpToDate C r--- /mnt/ha1 ext3 9.9G 618M 8.8G 7%
2:nfs-root Connected Primary/Secondary
UpToDate/UpToDate C r--- /mnt/netboot ext3 79G 57G 19G 76%

6.1.2. Status information in /proc/drbd


/proc/drbd is a virtual file displaying real-time status information about all DRBD resources
currently configured. You may interrogate this file’s contents using this command:
$ cat /proc/drbd
version: 8.4.0 (api:1/proto:86-100)
GIT-hash: 09b6d528b3b3de50462cd7831c0a3791abc665c3 build by
[email protected], 2011-10-12 09:07:35
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:656 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

The first line, prefixed with version:, shows the DRBD version used on your system. The
second line contains information about this specific build.
The other four lines in this example form a block that is repeated for every DRBD device
configured, prefixed by the device minor number. In this case, this is 0, corresponding to the device
/dev/drbd0.
The resource-specific output from /proc/drbd contains various pieces of information about the
resource:
cs (connection state). Status of the network connection.
ro (roles). Roles of the nodes. The role of the local node is displayed first, followed by the role of
the partner node shown after the slash.
ds (disk states). State of the hard disks. Prior to the slash the state of the local node is displayed,
after the slash the state of the hard disk of the partner node is shown.
Replication protocol. Replication protocol used by the resource. Either A, B or C.
I/O Flags. Six state flags reflecting the I/O status of this resource.
Performance indicators. A number of counters and gauges reflecting the resource’s utilization and
performance.

6.1.3. Connection states


A resource’s connection state can be observed either by monitoring /proc/drbd, or by issuing
the drbdadm cstate command:
# drbdadm cstate <resource>
Connected

A resource may have one of the following connection states:


StandAlone. No network configuration available. The resource has not yet been connected, or
has been administratively disconnected (using drbdadm disconnect), or has dropped its
connection due to failed authentication or split brain.
Disconnecting. Temporary state during disconnection. The next state is StandAlone.
Unconnected. Temporary state, prior to a connection attempt. Possible next states:
WFConnection and WFReportParams.
Timeout. Temporary state following a timeout in the communication with the peer. Next state:
Unconnected.
BrokenPipe. Temporary state after the connection to the peer was lost. Next state:
Unconnected.
NetworkFailure. Temporary state after the connection to the partner was lost. Next state:
Unconnected.
ProtocolError. Temporary state after the connection to the partner was lost. Next state:
Unconnected.
TearDown. Temporary state. The peer is closing the connection. Next state: Unconnected.
WFConnection. This node is waiting until the peer node becomes visible on the network.
WFReportParams. TCP connection has been established, this node waits for the first network
packet from the peer.
Connected. A DRBD connection has been established, data mirroring is now active. This is the
normal state.
StartingSyncS. Full synchronization, initiated by the administrator, is just starting. The next
possible states are: SyncSource or PausedSyncS.
StartingSyncT. Full synchronization, initiated by the administrator, is just starting. Next state:
WFSyncUUID.
WFBitMapS. Partial synchronization is just starting. Next possible states: SyncSource or
PausedSyncS.
WFBitMapT. Partial synchronization is just starting. Next possible state: WFSyncUUID.
WFSyncUUID. Synchronization is about to begin. Next possible states: SyncTarget or
PausedSyncT.
SyncSource. Synchronization is currently running, with the local node being the source of
synchronization.
SyncTarget. Synchronization is currently running, with the local node being the target of
synchronization.
PausedSyncS. The local node is the source of an ongoing synchronization, but synchronization is
currently paused. This may be due to a dependency on the completion of another synchronization
process, or due to synchronization having been manually interrupted by drbdadm pause-sync.
PausedSyncT. The local node is the target of an ongoing synchronization, but synchronization is
currently paused. This may be due to a dependency on the completion of another synchronization
process, or due to synchronization having been manually interrupted by drbdadm pause-sync.
VerifyS. On-line device verification is currently running, with the local node being the source of
verification.
VerifyT. On-line device verification is currently running, with the local node being the target of
verification.

6.1.4. Resource roles


A resource’s role can be observed either by monitoring /proc/drbd, or by issuing the drbdadm
role command:
# drbdadm role <resource>
Primary/Secondary

The local resource role is always displayed first, the remote resource role last.
You may see one of the following resource roles:
Primary. The resource is currently in the primary role, and may be read from and written to. This
role only occurs on one of the two nodes, unless dual-primary mode is enabled.
Secondary. The resource is currently in the secondary role. It normally receives updates from its
peer (unless running in disconnected mode), but may neither be read from nor written to. This role
may occur on one or both nodes.
Unknown. The resource’s role is currently unknown. The local resource role never has this status. It
is only displayed for the peer’s resource role, and only in disconnected mode.

6.1.5. Disk states


A resource’s disk state can be observed either by monitoring /proc/drbd, or by issuing the
drbdadm dstate command:
# drbdadm dstate <resource>
UpToDate/UpToDate

The local disk state is always displayed first, the remote disk state last.
Both the local and the remote disk state may be one of the following:
Diskless. No local block device has been assigned to the DRBD driver. This may mean that the
resource has never attached to its backing device, that it has been manually detached using
drbdadm detach, or that it automatically detached after a lower-level I/O error.
Attaching. Transient state while reading meta data.
Failed. Transient state following an I/O failure report by the local block device. Next state:
Diskless.
Negotiating. Transient state when an Attach is carried out on an already-Connected
DRBD device.
Inconsistent. The data is inconsistent. This status occurs immediately upon creation of a new
resource, on both nodes (before the initial full sync). Also, this status is found in one node (the
synchronization target) during synchronization.
Outdated. Resource data is consistent, but outdated.
Dunknown. This state is used for the peer disk if no network connection is available.
Consistent. Consistent data of a node without connection. When the connection is established,
it is decided whether the data is UpToDate or Outdated.
UpToDate. Consistent, up-to-date state of the data. This is the normal state.

6.1.6. I/O state flags


The I/O state flag field in /proc/drbd contains information about the current state of I/O
operations associated with the resource. There are six such flags in total, with the following possible
values:
1. I/O suspension. Either r for running or s for suspended I/O. Normally r.
2. Serial resynchronization. When a resource is awaiting resynchronization, but has deferred
this because of a resync-after dependency, this flag becomes a. Normally -.
3. Peer-initiated sync suspension. When resource is awaiting resynchronization, but the peer
node has suspended it for any reason, this flag becomes p. Normally -.
4. Locally initiated sync suspension. When resource is awaiting resynchronization, but a user
on the local node has suspended it, this flag becomes u. Normally -.
5. Locally blocked I/O. Normally -. May be one of the following flags:
• d: I/O blocked for a reason internal to DRBD, such as a transient disk state.
• b: Backing device I/O is blocking.
• n: Congestion on the network socket.
• a: Simultaneous combination of blocking device I/O and network congestion.
6. Activity Log update suspension. When updates to the Activity Log are suspended, this flag
becomes s. Normally -.

6.1.7. Performance indicators


The second line of /proc/drbd information for each resource contains the following counters
and gauges:
ns (network send). Volume of net data sent to the partner via the network connection; in Kibyte.
nr (network receive). Volume of net data received by the partner via the network connection; in
Kibyte.
dw (disk write). Net data written on local hard disk; in Kibyte.
dr (disk read). Net data read from local hard disk; in Kibyte.
al (activity log). Number of updates of the activity log area of the meta data.
bm (bit map). Number of updates of the bitmap area of the meta data.
lo (local count). Number of open requests to the local I/O sub-system issued by DRBD.
pe (pending). Number of requests sent to the partner, but that have not yet been answered by the
latter.
ua (unacknowledged). Number of requests received by the partner via the network connection, but
that have not yet been answered.
ap (application pending). Number of block I/O requests forwarded to DRBD, but not yet
answered by DRBD.
ep (epochs). Number of epoch objects. Usually 1. Might increase under I/O load when using either
the barrier or the none write ordering method.
wo (write order). Currently used write ordering method: b(barrier), f(flush), d(drain) or n(none).
oos (out of sync). Amount of storage currently out of sync; in Kibibytes.

6.2. Enabling and disabling resources


6.2.1. Enabling resources
Normally, all configured DRBD resources are automatically enabled
• by a cluster resource management application at its discretion, based on your cluster
configuration, or
• by the /etc/init.d/drbd init script on system startup.
If, however, you need to enable resources manually for any reason, you may do so by issuing the
command
# drbdadm up <resource>

As always, you may use the keyword all instead of a specific resource name if you want to enable
all resources configured in /etc/drbd.conf at once.

6.2.2. Disabling resources


You may temporarily disable specific resources by issuing the command
# drbdadm down <resource>

Here, too, you may use the keyword all in place of a resource name if you wish to temporarily
disable all resources listed in /etc/drbd.conf at once.
6.3. Reconfiguring resources
DRBD allows you to reconfigure resources while they are operational. To that end,
• make any necessary changes to the resource configuration in /etc/drbd.conf,
• synchronize your /etc/drbd.conf file between both nodes,
• issue the drbdadm adjust <resource> command on both nodes.
drbdadm adjust then hands off to drbdsetup to make the necessary adjustments to the
configuration. As always, you are able to review the pending drbdsetup invocations by running
drbdadm with the -d (dry-run) option.
Note
When making changes to the common section in /etc/drbd.conf,
you can adjust the configuration for all resources in one run, by issuing
drbdadm adjust all.

6.4. Promoting and demoting resources


Manually switching a resource’s role from secondary to primary (promotion) or vice versa
(demotion) is done using the following commands:
# drbdadm primary <resource>
# drbdadm secondary <resource>

In single-primary mode (DRBD’s default), any resource can be in the primary role on only one node
at any given time while the connection state is Connected. Thus, issuing drbdadm primary
<resource> on one node while <resource> is still in the primary role on the peer will result in
an error.
A resource configured to allow dual-primary mode can be switched to the primary role on both
nodes.

6.5. Basic Manual Fail-over


If not using Pacemaker and looking to handle fail-overs manually in a passive/active configuration
the process is as follows.
On the current primary node stop any applications or services using the DRBD device, unmount the
DRBD device, and demote the resource to secondary.
# umount /dev/drbd/by-res/<resource>
# drbdadm secondary <resource>

Now on the node we wish to make primary promote the resource and mount the device.
# drbdadm primary <resource>
# mount /dev/drbd/by-res/<resource> <mountpoint>
6.8. Enabling dual-primary mode
Dual-primary mode allows a resource to assume the primary role simultaneously on both nodes.
Doing so is possible on either a permanent or a temporary basis.
Note
Dual-primary mode requires that the resource is configured to replicate
synchronously (protocol C). Because of this it is latency sensitive, and ill
suited for WAN environments.

Additionally, as both resources are always primary, any interruption in the


network between nodes will result in a split-brain.

6.8.1. Permanent dual-primary mode


To enable dual-primary mode, set the allow-two-primaries option to yes in the net section
of your resource configuration:
resource <resource>
net {
protocol C;
allow-two-primaries yes;
}
...
}

After that, do not forget to synchronize the configuration between nodes. Run drbdadm adjust
<resource> on both nodes.
You can now change both nodes to role primary at the same time with drbdadm primary
<resource>.

6.8.2. Temporary dual-primary mode


To temporarily enable dual-primary mode for a resource normally running in a single-primary
configuration, issue the following command:
# drbdadm net-options --protocol=C --allow-two-primaries <resource>

To end temporary dual-primary mode, run the same command as above but with --allow-two-
primaries=no (and your desired replication protocol, if applicable).

6.8.3. Automating promotion on system startup


When a resource is configured to support dual-primary mode, it may also be desirable to
automatically switch the resource into the primary role upon system (or DRBD) startup.
resource <resource>
startup {
become-primary-on both;
}
...
}

The /etc/init.d/drbd system init script parses this option on startup and promotes resources
accordingly.
Note
The become-primary-on approach is not required, nor
recommended, in Pacemaker-managed DRBD configurations. In
Pacemaker configuration, resource promotion and demotion should
always be handled by the cluster manager.

6.9. Using on-line device verification


6.9.1. Enabling on-line verification
On-line device verification is not enabled for resources by default. To enable it, add the following
lines to your resource configuration in /etc/drbd.conf:
resource <resource>
net {
verify-alg <algorithm>;
}
...
}

<algorithm> may be any message digest algorithm supported by the kernel crypto API in your
system’s kernel configuration. Normally, you should be able to choose at least from sha1, md5,
and crc32c.
If you make this change to an existing resource, as always, synchronize your drbd.conf to the
peer, and run drbdadm adjust <resource> on both nodes.

6.9.2. Invoking on-line verification


After you have enabled on-line verification, you will be able to initiate a verification run using the
following command:
# drbdadm verify <resource>

When you do so, DRBD starts an online verification run for <resource>, and if it detects any
blocks not in sync, will mark those blocks as such and write a message to the kernel log. Any
applications using the device at that time can continue to do so unimpeded, and you may also switch
resource roles at will.
If out-of-sync blocks were detected during the verification run, you may resynchronize them using
the following commands after verification has completed:
# drbdadm disconnect <resource>
# drbdadm connect <resource>

6.9.3. Automating on-line verification


Most users will want to automate on-line device verification. This can be easily accomplished.
Create a file with the following contents, named /etc/cron.d/drbd-verify on one of your
nodes:
42 0 * * 0 root /sbin/drbdadm verify <resource>
This will have cron invoke a device verification every Sunday at 42 minutes past midnight.
If you have enabled on-line verification for all your resources (for example, by adding verify-
alg <algorithm> to the common section in /etc/drbd.conf), you may also use:
42 0 * * 0 root /sbin/drbdadm verify all

6.10. Configuring the rate of synchronization


Normally, one tries to ensure that background synchronization (which makes the data on the
synchronization target temporarily inconsistent) completes as quickly as possible. However, it is
also necessary to keep background synchronization from hogging all bandwidth otherwise available
for foreground replication, which would be detrimental to application performance. Thus, you must
configure the synchronization bandwidth to match your hardware — which you may do in a
permanent fashion or on-the-fly.
Important
It does not make sense to set a synchronization rate that is higher than the
maximum write throughput on your secondary node. You must not expect
your secondary node to miraculously be able to write faster than its I/O
subsystem allows, just because it happens to be the target of an ongoing
device synchronization.

Likewise, and for the same reasons, it does not make sense to set a synchronization rate that is
higher than the bandwidth available on the replication network.

6.10.1. Permanent fixed sync rate configuration


The maximum bandwidth a resource uses for background re-synchronization is determined by the
rate option for a resource. This must be included in the resource configuration’s disk section in
/etc/drbd.conf:
resource <resource>
disk {
resync-rate 40M;
...
}
...
}
Note that the rate setting is given in bytes, not bits per second; the default unit is Kibibyte, so a
value of 4096 would be interpreted as 4MiB.
Tip
A good rule of thumb for this value is to use about 30% of the available
replication bandwidth. Thus, if you had an I/O subsystem capable of
sustaining write throughput of 180MB/s, and a Gigabit Ethernet network
capable of sustaining 110 MB/s network throughput (the network being
the bottleneck), you would calculate:

Figure 6.1. Syncer rate example, 110MB/s effective available bandwidth

Thus, the recommended value for the rate option would be 33M.
By contrast, if you had an I/O subsystem with a maximum throughput of 80MB/s and a Gigabit
Ethernet connection (the I/O subsystem being the bottleneck), you would calculate:
Figure 6.2. yncer rate example, 80MB/s effective available bandwidth

In this case, the recommended value for the rate option would be 24M.

6.10.2. Temporary fixed sync rate configuration


It is sometimes desirable to temporarily adjust the sync rate. For example, you might want to speed
up background re-synchronization after having performed scheduled maintenance on one of your
cluster nodes. Or, you might want to throttle background re-synchronization if it happens to occur at
a time when your application is extremely busy with write operations, and you want to make sure
that a large portion of the existing bandwidth is available to replication.
For example, in order to make most bandwidth of a Gigabit Ethernet link available to re-
synchronization, issue the following command:
# drbdadm disk-options --resync-rate=110M <resource>

You need to issue this command on only one of the nodes.


To revert this temporary setting and re-enable the synchronization rate set in /etc/drbd.conf,
issue this command:
# drbdadm adjust <resource>

6.10.3. Variable sync rate configuration


Specifically in configurations where multiple DRBD resources share a single
replication/synchronization network, fixed-rate synchronization may not be an optimal approach. In
this case, you should configure variable-rate synchronization. In this mode, DRBD uses an
automated control loop algorithm to determine, and permanently adjust, the synchronization rate.
This algorithm ensures that there is always sufficient bandwidth available for foreground
replication, greatly mitigating the impact that background synchronization has on foreground I/O.
The optimal configuration for variable-rate synchronization may vary greatly depending on the
available network bandwidth, application I/O pattern and link congestion. Ideal configuration
settings also depend on whether DRBD Proxy is in use or not. It may be wise to engage
professional consultancy in order to optimally configure this DRBD feature. An example
configuration (which assumes a deployment in conjunction with DRBD Proxy) is provided below:
resource <resource> {
disk {
c-plan-ahead 200;
c-max-rate 10M;
c-fill-target 15M;
}
}

Tip
A good starting value for c-fill-target is BDP×3, where BDP is
your bandwidth delay product on the replication link.

6.11. Configuring checksum-based synchronization


Checksum-based synchronization is not enabled for resources by default. To enable it, add the
following lines to your resource configuration in /etc/drbd.conf:
resource <resource>
net {
csums-alg <algorithm>;
}
...
}

<algorithm> may be any message digest algorithm supported by the kernel crypto API in your
system’s kernel configuration. Normally, you should be able to choose at least from sha1, md5,
and crc32c.
If you make this change to an existing resource, as always, synchronize your drbd.conf to the
peer, and run drbdadm adjust <resource> on both nodes.
6.12. Configuring congestion policies and suspended
replication
In an environment where the replication bandwidth is highly variable (as would be typical in WAN
replication setups), the replication link may occasionally become congested. In a default
configuration, this would cause I/O on the primary node to block, which is sometimes undesirable.
Instead, you may configure DRBD to suspend the ongoing replication in this case, causing the
Primary’s data set to pull ahead of the Secondary. In this mode, DRBD keeps the replication
channel open — it never switches to disconnected mode — but does not actually replicate until
sufficient bandwith becomes available again.
The following example is for a DRBD Proxy configuration:
resource <resource> {
net {
on-congestion pull-ahead;
congestion-fill 2G;
congestion-extents 2000;
...
}
...
}

It is usually wise to set both congestion-fill and congestion-extents together with


the pull-ahead option.
A good value for congestion-fill is 90%
• of the allocated DRBD proxy buffer memory, when replicating over DRBD Proxy, or
• of the TCP network send buffer, in non-DRBD Proxy setups.
A good value for congestion-extents is 90% of your configured al-extents for the
affected resources.
6.13. Configuring I/O error handling strategies
DRBD’s strategy for handling lower-level I/O errors is determined by the on-io-error option,
included in the resource disk configuration in /etc/drbd.conf:
resource <resource> {
disk {
on-io-error <strategy>;
...
}
...
}

You may, of course, set this in the common section too, if you want to define a global I/O error
handling policy for all resources.
<strategy> may be one of the following options:
1. detach This is the default and recommended option. On the occurrence of a lower-level
I/O error, the node drops its backing device, and continues in diskless mode.
2. pass_on This causes DRBD to report the I/O error to the upper layers. On the primary
node, it is reported to the mounted file system. On the secondary node, it is ignored (because
the secondary has no upper layer to report to).
3. call-local-io-error Invokes the command defined as the local
I/O error handler. This requires that a corresponding +local-
io-error command invocation is defined in the resource’s handlers section. It is
entirely left to the administrator’s discretion to implement I/O error handling using the
command (or script) invoked by local-io-error.
Note
Early DRBD versions (prior to 8.0) included another option, panic,
which would forcibly remove the node from the cluster by way of a kernel
panic, whenever a local I/O error occurred. While that option is no longer
available, the same behavior may be mimicked via the local-io-
error/+ call-local-io-error+ interface. You should do so only if you fully
understand the implications of such behavior.

You may reconfigure a running resource’s I/O error handling strategy by following this process:
• Edit the resource configuration in /etc/drbd.d/<resource>.res.
• Copy the configuration to the peer node.
• Issue drbdadm adjust <resource> on both nodes.
6.14. Configuring replication traffic integrity checking
Replication traffic integrity checking is not enabled for resources by default. To enable it, add the
following lines to your resource configuration in /etc/drbd.conf:
resource <resource>
net {
data-integrity-alg <algorithm>;
}
...
}

<algorithm> may be any message digest algorithm supported by the kernel crypto API in your
system’s kernel configuration. Normally, you should be able to choose at least from sha1, md5,
and crc32c.
If you make this change to an existing resource, as always, synchronize your drbd.conf to the
peer, and run drbdadm adjust <resource> on both nodes.
6.15. Resizing resources
6.15.1. Growing on-line
If the backing block devices can be grown while in operation (online), it is also possible to increase
the size of a DRBD device based on these devices during operation. To do so, two criteria must be
fulfilled:
1. The affected resource’s backing device must be one managed by a logical volume
management subsystem, such as LVM.
2. The resource must currently be in the Connected connection state.
Having grown the backing block devices on both nodes, ensure that only one node is in primary
state. Then enter on one node:
# drbdadm resize <resource>

This triggers a synchronization of the new section. The synchronization is done from the primary
node to the secondary node.
If the space you’re adding is clean, you can skip syncing the additional space by using the --assume-
clean option.
# drbdadm -- --assume-clean resize <resource>

6.15.2. Growing off-line


When the backing block devices on both nodes are grown while DRBD is inactive, and the DRBD
resource is using external meta data, then the new size is recognized automatically. No
administrative intervention is necessary. The DRBD device will have the new size after the next
activation of DRBD on both nodes and a successful establishment of a network connection.
If however the DRBD resource is configured to use internal meta data, then this meta data must be
moved to the end of the grown device before the new size becomes available. To do so, complete
the following steps:
Warning
This is an advanced procedure. Use at your own discretion.

• Unconfigure your DRBD resource:


# drbdadm down <resource>

• Save the meta data in a text file prior to shrinking:


# drbdadm dump-md <resource> > /tmp/metadata

You must do this on both nodes, using a separate dump file for every node. Do not dump the meta
data on one node, and simply copy the dump file to the peer. This will not work.
• Grow the backing block device on both nodes.
• Adjust the size information ( la-size-sect) in the file /tmp/metadata accordingly,
on both nodes. Remember that la-size-sect must be specified in sectors.
• Re-initialize the metadata area:
# drbdadm create-md <resource>
• Re-import the corrected meta data, on both nodes:
# drbdmeta_cmd=$(drbdadm -d dump-md <resource>)
# ${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata
Valid meta-data in place, overwrite? [need to type 'yes' to confirm]
yes
Successfully restored meta data

Note
This example uses bash parameter substitution. It may or may not work
in other shells. Check your SHELL environment variable if you are unsure
which shell you are currently using.

• Re-enable your DRBD resource:


# drbdadm up <resource>

• On one node, promote the DRBD resource:


# drbdadm primary <resource>

• Finally, grow the file system so it fills the extended size of the DRBD device.

6.15.3. Shrinking on-line


Warning
Online shrinking is only supported with external
metadata.

Before shrinking a DRBD device, you must shrink the layers above DRBD, i.e. usually the file
system. Since DRBD cannot ask the file system how much space it actually uses, you have to be
careful in order not to cause data loss.
Note
Whether or not the filesystem can be shrunk on-line depends on the
filesystem being used. Most filesystems do not support on-line shrinking.
XFS does not support shrinking at all.

To shrink DRBD on-line, issue the following command after you have shrunk the file system
residing on top of it:
# drbdadm resize --size=<new-size> <resource>

You may use the usual multiplier suffixes for <new-size> (K, M, G etc.). After you have shrunk
DRBD, you may also shrink the containing block device (if it supports shrinking).

6.15.4. Shrinking off-line


If you were to shrink a backing block device while DRBD is inactive, DRBD would refuse to attach
to this block device during the next attach attempt, since it is now too small (in case external meta
data is used), or it would be unable to find its meta data (in case internal meta data is used). To work
around these issues, use this procedure (if you cannot use on-line shrinking):
Warning
This is an advanced procedure. Use at your own discretion.

• Shrink the file system from one node, while DRBD is still configured.
• Unconfigure your DRBD resource:
# drbdadm down <resource>

• Save the meta data in a text file prior to shrinking:


# drbdadm dump-md <resource> > +/tmp/metadata+

You must do this on both nodes, using a separate dump file for every node. Do not dump the meta
data on one node, and simply copy the dump file to the peer. This will not work.
• Shrink the backing block device on both nodes.
• Adjust the size information ( la-size-sect) in the file /tmp/metadata accordingly,
on both nodes. Remember that la-size-sect must be specified in sectors.
• Only if you are using internal metadata (which at this time have probably been lost due to
the shrinking process), re-initialize the metadata area:
# drbdadm create-md <resource>

• Re-import the corrected meta data, on both nodes:


# drbdmeta_cmd=$(drbdadm -d dump-md <resource>)
# ${drbdmeta_cmd/dump-md/restore-md} /tmp/metadata
Valid meta-data in place, overwrite? [need to type 'yes' to confirm]
yes
Successfully restored meta data

Note
This example uses bash parameter substitution. It may or may not work
in other shells. Check your SHELL environment variable if you are unsure
which shell you are currently using.

• Re-enable your DRBD resource:


# drbdadm up <resource>
6.16. Disabling backing device flushes
Caution
You should only disable device flushes when running DRBD on devices
with a battery-backed write cache (BBWC). Most storage controllers
allow to automatically disable the write cache when the battery is
depleted, switching to write-through mode when the battery dies. It is
strongly recommended to enable such a feature.

Disabling DRBD’s flushes when running without BBWC, or on BBWC with a depleted battery, is
likely to cause data loss and should not be attempted.
DRBD allows you to enable and disable backing device flushes separately for the replicated data set
and DRBD’s own meta data. Both of these options are enabled by default. If you wish to disable
either (or both), you would set this in the disk section for the DRBD configuration file,
/etc/drbd.conf.
To disable disk flushes for the replicated data set, include the following line in your configuration:
resource <resource>
disk {
disk-flushes no;
...
}
...
}

To disable disk flushes on DRBD’s meta data, include the following line:
resource <resource>
disk {
md-flushes no;
...
}
...
}

After you have modified your resource configuration (and synchronized your /etc/drbd.conf
between nodes, of course), you may enable these settings by issuing this command on both nodes:
# drbdadm adjust <resource>
6.17. Configuring split brain behavior
6.17.1. Split brain notification
DRBD invokes the split-brain handler, if configured, at any time split brain is detected. To
configure this handler, add the following item to your resource configuration:
resource <resource>
handlers {
split-brain <handler>;
...
}
...
}

<handler> may be any executable present on the system.


The DRBD distribution contains a split brain handler script that installs as
/usr/lib/drbd/notify-split-brain.sh. It simply sends a notification e-mail message
to a specified address. To configure the handler to send a message to root@localhost (which is
expected to be an email address that forwards the notification to a real system administrator),
configure the +split-brain handler+as follows:
resource <resource>
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
...
}
...
}

After you have made this modification on a running resource (and synchronized the configuration
file between nodes), no additional intervention is needed to enable the handler. DRBD will simply
invoke the newly-configured handler on the next occurrence of split brain.

6.17.2. Automatic split brain recovery policies


In order to be able to enable and configure DRBD’s automatic split brain recovery policies, you
must understand that DRBD offers several configuration options for this purpose. DRBD applies its
split brain recovery procedures based on the number of nodes in the Primary role at the time the
split brain is detected. To that end, DRBD examines the following keywords, all found in the
resource’s net configuration section:
after-sb-0pri. Split brain has just been detected, but at this time the resource is not in the
Primary role on any host. For this option, DRBD understands the following keywords:
• disconnect: Do not recover automatically, simply invoke the split-brain handler
script (if configured), drop the connection and continue in disconnected mode.
• discard-younger-primary: Discard and roll back the modifications made on the host
which assumed the Primary role last.
• discard-least-changes: Discard and roll back the modifications on the host where
fewer changes occurred.
• discard-zero-changes: If there is any host on which no changes occurred at all,
simply apply all modifications made on the other and continue.
after-sb-1pri. Split brain has just been detected, and at this time the resource is in the Primary
role on one host. For this option, DRBD understands the following keywords:
• disconnect: As with after-sb-0pri, simply invoke the split-brain handler
script (if configured), drop the connection and continue in disconnected mode.
• consensus: Apply the same recovery policies as specified in after-sb-0pri. If a split
brain victim can be selected after applying these policies, automatically resolve. Otherwise,
behave exactly as if disconnect were specified.
• call-pri-lost-after-sb: Apply the recovery policies as specified in after-sb-
0pri. If a split brain victim can be selected after applying these policies, invoke the pri-
lost-after-sb handler on the victim node. This handler must be configured in the
handlers section and is expected to forcibly remove the node from the cluster.
• discard-secondary: Whichever host is currently in the Secondary role, make that host
the split brain victim.
after-sb-2pri. Split brain has just been detected, and at this time the resource is in the Primary
role on both hosts. This option accepts the same keywords as after-sb-1pri except
discard-secondary and consensus.
Note
DRBD understands additional keywords for these three options, which
have been omitted here because they are very rarely used. Refer to
drbd.conf(5) for details on split brain recovery keywords not discussed
here.

For example, a resource which serves as the block device for a GFS or OCFS2 file system in dual-
Primary mode may have its recovery policy defined as follows:
resource <resource> {
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root"
...
}
net {
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
...
}
...
}
6.18. Creating a three-node setup
A three-node setup involves one DRBD device stacked atop another.

6.18.1. Device stacking considerations


The following considerations apply to this type of setup:
• The stacked device is the active one. Assume you have configured one DRBD device
/dev/drbd0, and the stacked device atop it is /dev/drbd10, then /dev/drbd10 will
be the device that you mount and use.
• Device meta data will be stored twice, on the underlying DRBD device and the stacked
DRBD device. On the stacked device, you must always use internal meta data. This means
that the effectively available storage area on a stacked device is slightly smaller, compared
to an unstacked device.
• To get the stacked upper level device running, the underlying device must be in the primary
role.
• To be able to synchronize the backup node, the stacked device on the active node must be up
and in the primary role.

6.18.2. Configuring a stacked resource


In the following example, nodes are named alice, bob, and charlie, with alice and bob
forming a two-node cluster, and charlie being the backup node.
resource r0 {
net {
protocol C;
}

on alice {
device /dev/drbd0;
disk /dev/sda6;
address 10.0.0.1:7788;
meta-disk internal;
}

on bob {
device /dev/drbd0;
disk /dev/sda6;
address 10.0.0.2:7788;
meta-disk internal;
}
}

resource r0-U {
net {
protocol A;
}

stacked-on-top-of r0 {
device /dev/drbd10;
address 192.168.42.1:7788;
}

on charlie {
device /dev/drbd10;
disk /dev/hda6;
address 192.168.42.2:7788; # Public IP of the backup node
meta-disk internal;
}
}

As with any drbd.conf configuration file, this must be distributed across all nodes in the cluster 
— in this case, three nodes. Notice the following extra keyword not found in an unstacked resource
configuration:
stacked-on-top-of. This option informs DRBD that the resource which contains it is a
stacked resource. It replaces one of the on sections normally found in any resource configuration.
Do not use stacked-on-top-of in an lower-level resource.
Note
It is not a requirement to use Protocol A for stacked resources. You may
select any of DRBD’s replication protocols depending on your
application.

6.18.3. Enabling stacked resources


To enable a stacked resource, you first enable its lower-level resource and promote it:
drbdadm up r0
drbdadm primary r0

As with unstacked resources, you must create DRBD meta data on the stacked resources. This is
done using the following command:
# drbdadm create-md --stacked r0-U

Then, you may enable the stacked resource:


# drbdadm up --stacked r0-U
# drbdadm primary --stacked r0-U

After this, you may bring up the resource on the backup node, enabling three-node replication:
# drbdadm create-md r0-U
# drbdadm up r0-U

In order to automate stacked resource management, you may integrate stacked resources in your
cluster manager configuration.
6.19. Using DRBD Proxy
6.19.1. DRBD Proxy deployment considerations
The DRBD Proxy processes can either be located directly on the machines where DRBD is set up,
or they can be placed on distinct dedicated servers. A DRBD Proxy instance can serve as a proxy
for multiple DRBD devices distributed across multiple nodes.
DRBD Proxy is completely transparent to DRBD. Typically you will expect a high number of data
packets in flight, therefore the activity log should be reasonably large. Since this may cause longer
re-sync runs after the crash of a primary node, it is recommended to enable DRBD’s csums-alg
setting.

6.19.2. Installation
To obtain DRBD Proxy, please contact your Linbit sales representative. Unless instructed otherwise,
please always use the most recent DRBD Proxy release.
To install DRBD Proxy on Debian and Debian-based systems, use the dpkg tool as follows (replace
version with your DRBD Proxy version, and architecture with your target architecture):
# dpkg -i drbd-proxy_3.0.0_amd64.deb

To install DRBD Proxy on RPM based systems (like SLES or RHEL) use the rpm tool as follows
(replace version with your DRBD Proxy version, and architecture with your target architecture):
# rpm -i drbd-proxy-3.0-3.0.0-1.x86_64.rpm

Also install the DRBD administration program drbdadm since it is required to configure DRBD
Proxy.
This will install the DRBD proxy binaries as well as an init script which usually goes into
/etc/init.d. Please always use the init script to start/stop DRBD proxy since it also configures
DRBD Proxy using the drbdadm tool.

6.19.3. License file


When obtaining a license from Linbit, you will be sent a DRBD Proxy license file which is required
to run DRBD Proxy. The file is called drbd-proxy.license, it must be copied into the /etc
directory of the target machines, and be owned by the user/group drbdpxy.
# cp drbd-proxy.license /etc/

6.19.4. Configuration
DRBD Proxy is configured in DRBD’s main configuration file. It is configured by an additional
options section called proxy and additional proxy on sections within the host sections.
Below is a DRBD configuration example for proxies running directly on the DRBD nodes:
resource r0 {
net {
protocol A;
}
device minor 0;
disk /dev/sdb1;
meta-disk /dev/sdb2;
proxy {
memlimit 100M;
plugin {
zlib level 9;
}
}

on alice {
address 127.0.0.1:7789;
proxy on alice {
inside 127.0.0.1:7788;
outside 192.168.23.1:7788;
}
}

on bob {
address 127.0.0.1:7789;
proxy on bob {
inside 127.0.0.1:7788;
outside 192.168.23.2:7788;
}
}
}

The inside IP address is used for communication between DRBD and the DRBD Proxy, whereas
the outside IP address is used for communication between the proxies.

6.19.5. Controlling DRBD Proxy


drbdadm offers the proxy-up and proxy-down subcommands to configure or delete the
connection to the local DRBD Proxy process of the named DRBD resource(s). These commands are
used by the start and stop actions which /etc/init.d/drbdproxy implements.
The DRBD Proxy has a low level configuration tool, called drbd-proxy-ctl. When called
without any option it operates in interactive mode.
To pass a command directly, avoiding interactive mode, use the -c parameter followed by the
command.
To display the available commands use:
# drbd-proxy-ctl -c "help"

Note the double quotes around the command being passed.


add connection <name> <listen-lan-ip>:<port> <remote-proxy-ip>:<port>
<local-proxy-wan-ip>:<port> <local-drbd-ip>:<port>
Creates a communication path between two DRBD instances.

set memlimit <name> <memlimit-in-bytes>


Sets memlimit for connection <name>

del connection <name>


Deletes communication path named name.

show
Shows currently configured communication paths.

show memusage
Shows memory usage of each connection.

show [h]subconnections
Shows currently established individual connections
together with some stats. With h outputs bytes in human
readable format.

show [h]connections
Shows currently configured connections and their states
With h outputs bytes in human readable format.

shutdown
Shuts down the drbd-proxy program. Attention: this
unconditionally terminates any DRBD connections running.

Examples:
drbd-proxy-ctl -c "list hconnections"
prints configured connections and their status to stdout
Note that the quotes are required.

drbd-proxy-ctl -c "list subconnections" | cut -f 2,9,13


prints some more detailed info about the individual connections

watch -n 1 'drbd-proxy-ctl -c "show memusage"'


monitors memory usage.
Note that the quotes are required as listed above.

While the commands above are only accepted from UID 0 (ie., the root user), there’s one
(information gathering) command that can be used by any user (provided that unix permissions
allow access on the proxy socket at /var/run/drbd-proxy/drbd-proxy-ctl.socket);
see the init script at /etc/init.d/drbdproxy about setting the rights.
print details
This prints detailed statistics for the currently active connections.
Can be used for monitoring, as this is the only command that may be sent by a
user with UID

quit
Exits the client program (closes control connection).

6.19.6. About DRBD Proxy plugins


Since DRBD proxy 3.0 the proxy allows to enable a few specific plugins for the WAN connection.
The currently available plugins are zlib and lzma.
The zlib plugin uses the GZIP algorithm for compression. The advantage is fairly low CPU usage.
The lzma plugin uses the liblzma2 library. It can use dictionaries of several hundred MiB; these
allow for very efficient delta-compression of repeated data, even for small changes. lzma needs
much more CPU and memory, but results in much better compression than zlib. The lzma plugin
has to be enabled in your license.
Please contact Linbit to find the best settings for your environment - it depends on the CPU (speed,
threading count), memory, input and the available output bandwidth.
Please note that the older compression on in the proxy section is deprecated, and will be
removed in a future release. Currently it is treated as zlib level 9.
6.19.7. Using a WAN Side Bandwidth Limit
With DRBD-utils 8.4.4 and DRBD Proxy version 3.1.1 there is experimental support for a per-
connection bandwidth limit in the proxy configuration section, via the the bwlimit option.
This will make the corresponding sending thread sleep a bit after sending a chunk of data, to use not
more than (approximately) the specified bandwidth.
The value 0 means no limitation, and is the default.
proxy {
bwlimit 2M;
...
}

The example above would restrict the outgoing rate over the WAN connection to approximately
2MiB per second, leaving room on the wire for other data.

6.19.8. Troubleshooting
DRBD proxy logs via syslog using the LOG_DAEMON facility. Usually you will find DRBD Proxy
messages in /var/log/daemon.log.
Enabling debug mode in DRBD Proxy can be done with the following command.
# drbd-proxy-ctl -c 'set loglevel debug'

For example, if proxy fails to connect it will log something like Rejecting connection
because I can’t connect on the other side. In that case, please check if DRBD is
running (not in StandAlone mode) on both nodes and if both proxies are running. Also double-
check your configuration.
Chapter 7. Troubleshooting and error recovery

7.1. Dealing with hard drive failure


How to deal with hard drive failure depends on the way DRBD is configured to handle disk I/O
errors

Note
For the most part, the steps described here apply only if you run DRBD
directly on top of physical hard drives. They generally do not apply in
case you are running DRBD layered on top of

• an MD software RAID set (in this case, use mdadm to manage drive replacement),
• device-mapper RAID (use dmraid),
• a hardware RAID appliance (follow the vendor’s instructions on how to deal with failed
drives),
• some non-standard device-mapper virtual block devices (see the device mapper
documentation).

7.1.1. Manually detaching DRBD from your hard drive


If DRBD is configured to pass on I/O errors (not recommended), you must first detach the DRBD
resource, that is, disassociate it from its backing storage:
drbdadm detach <resource>

By running the drbdadm dstate command, you will now be able to verify that the resource is
now in diskless mode:
drbdadm dstate <resource>
Diskless/UpToDate

If the disk failure has occured on your primary node, you may combine this step with a switch-over
operation.

7.1.2. Automatic detach on I/O error


If DRBD is configured to automatically detach upon I/O error (the recommended option), DRBD
should have automatically detached the resource from its backing storage already, without manual
intervention. You may still use the drbdadm dstate command to verify that the resource is in
fact running in diskless mode.

7.1.3. Replacing a failed disk when using internal meta data


If using internal meta data, it is sufficient to bind the DRBD device to the new hard disk. If the new
hard disk has to be addressed by another Linux device name than the defective disk, this has to be
modified accordingly in the DRBD configuration file.
This process involves creating a new meta data set, then re-attaching the resource:
drbdadm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New drbd meta data block sucessfully created.

drbdadm attach <resource>

Full synchronization of the new hard disk starts instantaneously and automatically. You will be able
to monitor the synchronization’s progress via /proc/drbd, as with any background
synchronization.

7.1.4. Replacing a failed disk when using external meta data


When using external meta data, the procedure is basically the same. However, DRBD is not able to
recognize independently that the hard drive was swapped, thus an additional step is required.
drbdadm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New drbd meta data block sucessfully created.

drbdadm attach <resource>


drbdadm invalidate <resource>

Here, the drbdadm invalidate command triggers synchronization. Again, sync progress may
be observed via /proc/drbd.

7.2. Dealing with node failure


When DRBD detects that its peer node is down (either by true hardware failure or manual
intervention), DRBD changes its connection state from Connected to WFConnection and
waits for the peer node to re-appear. The DRBD resource is then said to operate in disconnected
mode. In disconnected mode, the resource and its associated block device are fully usable, and may
be promoted and demoted as necessary, but no block modifications are being replicated to the peer
node. Instead, DRBD stores internal information on which blocks are being modified while
disconnected.
7.2.1. Dealing with temporary secondary node failure
If a node that currently has a resource in the secondary role fails temporarily (due to, for example, a
memory problem that is subsequently rectified by replacing RAM), no further intervention is
necessary — besides the obvious necessity to repair the failed node and bring it back on line. When
that happens, the two nodes will simply re-establish connectivity upon system start-up. After this,
DRBD replicates all modifications made on the primary node in the meantime, to the secondary
node.
Important
At this point, due to the nature of DRBD’s re-synchronization algorithm,
the resource is briefly inconsistent on the secondary node. During that
short time window, the secondary node can not switch to the Primary role
if the peer is unavailable. Thus, the period in which your cluster is not
redundant consists of the actual secondary node down time, plus the
subsequent re-synchronization.

7.2.2. Dealing with temporary primary node failure


From DRBD’s standpoint, failure of the primary node is almost identical to a failure of the
secondary node. The surviving node detects the peer node’s failure, and switches to disconnected
mode. DRBD does not promote the surviving node to the primary role; it is the cluster management
application’s responsibility to do so.
When the failed node is repaired and returns to the cluster, it does so in the secondary role, thus, as
outlined in the previous section, no further manual intervention is necessary. Again, DRBD does not
change the resource role back, it is up to the cluster manager to do so (if so configured).
DRBD ensures block device consistency in case of a primary node failure by way of a special
mechanism. For a detailed discussion, refer to Section 17.3, “The Activity Log”.

7.2.3. Dealing with permanent node failure


If a node suffers an unrecoverable problem or permanent destruction, you must follow the following
steps:
• Replace the failed hardware with one with similar performance and disk capacity.
Note
Replacing a failed node with one with worse performance characteristics
is possible, but not recommended. Replacing a failed node with one with
less disk capacity is not supported, and will cause DRBD to refuse to
connect to the replaced node.

• Install the base system and applications.


• Install DRBD and copy /etc/drbd.conf and all of /etc/drbd.d/ from the
surviving node.
• Follow the steps outlined in Chapter 5, Configuring DRBD, but stop short of Section 5.5,
“The initial device synchronization”.
Manually starting a full device synchronization is not necessary at this point, it will commence
automatically upon connection to the surviving primary node.
7.3. Manual split brain recovery
DRBD detects split brain at the time connectivity becomes available again and the peer nodes
exchange the initial DRBD protocol handshake. If DRBD detects that both nodes are (or were at
some point, while disconnected) in the primary role, it immediately tears down the replication
connection. The tell-tale sign of this is a message like the following appearing in the system log:
Split-Brain detected, dropping connection!

After split brain has been detected, one node will always have the resource in a StandAlone
connection state. The other might either also be in the StandAlone state (if both nodes detected
the split brain simultaneously), or in WFConnection (if the peer tore down the connection before
the other node had a chance to detect split brain).
At this point, unless you configured DRBD to automatically recover from split brain, you must
manually intervene by selecting one node whose modifications will be discarded (this node is
referred to as the split brain victim). This intervention is made with the following commands:
Note
The split brain victim needs to be in the connection state of
StandAlone or the following commands will return an error. You can
ensure it is standalone by issuing:

drbdadm disconnect <resource>

drbdadm secondary <resource>


drbdadm connect --discard-my-data <resource>

On the other node (the split brain survivor), if its connection state is also StandAlone, you would
enter:
drbdadm connect <resource>

You may omit this step if the node is already in the WFConnection state; it will then reconnect
automatically.
If the resource affected by the split brain is a stacked resource, use drbdadm --stacked instead
of just drbdadm.
Upon connection, your split brain victim immediately changes its connection state to
SyncTarget, and has its modifications overwritten by the remaining primary node.
Note
The split brain victim is not subjected to a full device synchronization.
Instead, it has its local modifications rolled back, and any modifications
made on the split brain survivor propagate to the victim.

After re-synchronization has completed, the split brain is considered resolved and the two nodes
form a fully consistent, redundant replicated storage system

You might also like