DRBD is basically "RAID over the network", the ability to replicate block devices over multiple machines. It's used extensively in our ganeti configuration to replicate virtual machines across multiple hosts.

How-to

Checking status

Just like mdadm, there's a device in /proc which shows the status of the RAID configuration. This is a healthy configuration:

# cat /proc/drbd
version: 8.4.10 (api:1/proto:86-101)
srcversion: 9B4D87C5E865DF526864868 
 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:10821208 dw:10821208 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:10485760 dw:10485760 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Keyword: UpToDate. This is a configuration that is being resync'd:

version: 8.4.10 (api:1/proto:86-101)
srcversion: 9B4D87C5E865DF526864868 
 0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
    ns:0 nr:9352840 dw:9352840 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:1468352
    [================>...] sync'ed: 86.1% (1432/10240)M
    finish: 0:00:36 speed: 40,436 (38,368) want: 61,440 K/sec
 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
    ns:0 nr:8439808 dw:8439808 dr:0 al:8 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:2045952
    [===============>....] sync'ed: 80.6% (1996/10240)M
    finish: 0:00:52 speed: 39,056 (37,508) want: 61,440 K/sec
 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:1048580 dw:1048580 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

See the upstream documentation for details on this output.

The drbdmon command also provides a similar view but, in my opinion, less readable.

Because DRBD is built with kernel modules, you can also see activity in the dmesg logs

Finding device associated with host

In the drbd status, devices are shown by their minor identifier. For example, this is device minor id 18 having a trouble of some sort:

18: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:1237956 nr:0 dw:11489220 dr:341910 al:177 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
    [===================>] sync'ed:100.0% (0/10240)M
    finish: 0:00:00 speed: 764 (768) K/sec (stalled)

Finding which host is associated with this device is easy: just call list-drbd:

root@fsn-node-01:~# gnt-node list-drbd fsn-node-01 | grep 18
fsn-node-01.torproject.org    18 gettor-01.torproject.org          disk/0 primary   fsn-node-02.torproject.org

It's the host gettor-01.

Pager playbook

Resyncing disks

In Nagios, if you see this warning:

DRBD CRITICAL: Device 10 WFConnection UpToDate, Device 9 WFConnection UpToDate

It means that, on that host (in my case it was fsn-node-04.torproject.org), disks are desynchronized for some reason. In this case, those are disks 9 and 10. You can confirm that on the host:

# ssh fsn-node-04.torproject.org cat /proc/drbd
[...]
 9: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:13799284 nr:0 dw:272704248 dr:15512933 al:1331 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:8343096
10: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:2097152 nr:0 dw:2097192 dr:2102652 al:9 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:40
[...]

You need to find which instance this disk is associated with (see also above):

$ ssh fsn-node-01.torproject.org gnt-node list-drbd fsn-node-04
[...]
Node                       Minor Instance                            Disk   Role      PeerNode
[...]
fsn-node-04.torproject.org     9 onionoo-frontend-01.torproject.org  disk/0 primary   fsn-node-03.torproject.org
fsn-node-04.torproject.org    10 onionoo-frontend-01.torproject.org  disk/1 primary   fsn-node-03.torproject.org
[...]

Then you can "reactivate" the disks simply by telling ganeti:

$ ssh fsn-node-01.torproject.org gnt-instance activate-disks onionoo-frontend-01.torproject.org

And then the disk will resync.

Upstream documentation

Reference

Installation

The ganeti Puppet module takes care of basic DRBD configuration, by installing the right software (drbd-utils) and kernel modules. Everything else is handled automatically by Ganeti itself.

There's a Nagios check for the DRBD service that ensures devices are synchronized. It will yield an UNKNOWN status when no device is created, so it's expected that new nodes are flagged until they host some content. The check is shipped as part of tor-nagios-checks, as dsa-check-drbd, see dsa-check-drbd.