Replacing a Linux RAID Drive

NAS drives

I have been running a software RAID array at home for some time now. It’s a single network storage where I consolidate all my files. I manage this array manually using the mdadm command. Some people choose to buy a NAS storage box which hides all of the implementation details behind a nice Web GUI, but it’s essentially the same thing under the hood.

It operates with 4 drives using Linux software RAID 5, which means it can tolerate a single drive failure, but failures don’t always take out an entire drive. They usually manifest as bad sectors in a drive. As an illustration, the RAID 5 array below can still operate properly (meaning no data loss, yet) with bad sectors on two of its drives:

RAID 5 array with damaged blocks

As long as the other drives in the array doesn’t develop bad sectors in the same stripe, the data can still be reconstructed from the remaining good blocks. This means that you can somewhat leave the drive as it is for a period without replacement, but of course you are taking a risk.

I thought I’d share my experiences with drive replacements thus far.

Detecting Drive Problems

Most Linux distributions provide the raid-check script for periodic RAID scrubbing. This is basically a background cron job that tells the kernel to start checking the RAID array. For RHEL/CentOS systems, this should occur every weekend.

During this scrubbing process, all drives within the array are read and their parity blocks are computed, to ensure that everything tallies.

It is during this verification process that sometimes causes hard drive errors to show up. Typically when a drive encounters a problem during read, the hardware returns an error, which will then be logged by Linux. They can look like these:

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata3.00: irq_stat 0x40000001
ata3.00: failed command: READ DMA EXT
ata3.00: cmd 25/00:00:d8:10:27/00:02:05:00:00/e0 tag 8 dma 262144 in
         res 51/40:1f:b8:12:27/00:00:05:00:00/e0 Emask 0x9 (media error)
ata3.00: status: { DRDY ERR }
ata3.00: error: { UNC }
ata3.00: configured for UDMA/133
ata3: EH complete
 .
 . (repeats)
 .
sd 2:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 2:0:0:0: [sdc]  Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
        05 27 12 b8
sd 2:0:0:0: [sdc]  Add. Sense: Unrecovered read error - auto reallocate failed
sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 05 27 10 d8 00 02 00 00
end_request: I/O error, dev sdc, sector 86446776

Continue reading