Software RAID

Replacing a drive

If a drive fails in a server, the procedure is essentially to open a ticket, wait for the drive change, partition and re-add it to the RAID array. The following procdure assumes that sda failed and sdb is good in a RAID-1 array, but can vary with other RAID configurations or drive models.

  1. file a ticket upstream

    Hetzner Support, for example, has an excellent service which asks you the disk serial number (available in the SMART email notification) and the SMART log (output of smartctl -x /dev/sda). Then they will turn off the machine, replace the disk, and start it up again.

  2. wait for the server to return with the new disk

    Hetzner will send an email to the tpa alias when that is done.

  3. partition the new drive (sda) to match the old (sdb):

    sfdisk -d /dev/sdb | sfdisk --no-reread /dev/sda --force
    
  4. re-add the new disk to the RAID array:

    mdadm /dev/md0 -a /dev/sda
    

Note that Hetzner also has pretty good documentation on how to deal with SMART output.

Hardware RAID

Some TPO machines have hardware RAID with megaraid controllers. Those are controlled with the MegaCLI command that is ... rather hard to use.

First, alias the megacli command because the package (derived from the upstream RPM by Alien) installs it in a strange location:

alias megacli=/opt/MegaRAID/MegaCli/MegaCli

This will confirm you are using hardware raid:

root@moly:/home/anarcat# lspci | grep -i raid
05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)

This will show the RAID levels of each enclosure, for example this is RAID-10:

root@moly:/home/anarcat# megacli -LdPdInfo -aALL | grep "RAID Level"
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0

This lists a summary of all the disks, for example the first disk has failed here:

root@moly:/home/anarcat# megacli -PDList -aALL | grep -e '^Enclosure' -e '^Slot' -e '^PD' -e '^Firmware' -e '^Raw' -e '^Inquiry'
Enclosure Device ID: 252
Slot Number: 0
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Failed
Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
Enclosure Device ID: 252
Slot Number: 1
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
Enclosure Device ID: 252
Slot Number: 2
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST3600057SS     [REDACTED]
Enclosure Device ID: 252
Slot Number: 3
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST3600057SS     [REDACTED]

This will make the drive blink (slot number 0 in enclosure 252):

megacli -PdLocate -start -physdrv[252:0] -aALL

SMART monitoring

Some servers will fail to properly detect disk drives in their SMART configuration. In particular, smartd does not support:

  • virtual disks (e.g. /dev/nbd0)
  • MMC block devices (e.g. /dev/mmcblk0, commonly found on ARM devices)
  • out of the box, CCISS raid devices (e.g. /dev/cciss/c0d0)

The latter can be configured with the following snippet in /etc/smartd.conf:

#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/cciss/c0d0 -d cciss,0
/dev/cciss/c0d0 -d cciss,1
/dev/cciss/c0d0 -d cciss,2
/dev/cciss/c0d0 -d cciss,3
/dev/cciss/c0d0 -d cciss,4
/dev/cciss/c0d0 -d cciss,5

Notice how the DEVICESCAN is commented out to be replaced by the CCISS configuration. One line for each drive should be added (and no, it does not autodetect all drives unfortunately). This hack was deployed on listera which uses that hardware RAID.

Other hardware RAID controllers are better supported. For example, the megaraid controller on moly was correctly detected by smartd which accurately found a broken hard drive.

References

Here are some external documentation links: