I’ve been planning to enable scheduled S.M.A.R.T. scans ever since I built my file server, but it was just one of those things that I never got around to do.

Well — when one of the 8 TB drives suddenly started reporting S.M.A.R.T. errors, the scheduled scans got back on the agenda. So let’s do that now 👇

Scheduled S.M.A.R.T. scans are performed by smartd:

smartd is a daemon that monitors the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests. — smartd man page

smartd

First — let’s figure out which drives we want to scan. You can use smartctl to scan for drives:

$ sudo smartctl --scan

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device

/dev/sda is my boot SSD, while the rest are drives in the ZFS pool. I don’t like to use /dev/sdX when referencing drives, because it can change, I like to use the disk ID. To find the ID you can list the content of /dev/disk/by-id/ and filter on sdX, like this:

$ ls -l /dev/disk/by-id/ | grep -E "sdb$"

lrwxrwxrwx 1 root root  9 Feb 10 23:15 scsi-xxxxxxxxxxxxxxxxx -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 10 23:15 scsi-SATA_WDC_WD40EFRX-68N_WD-xxxxxxxxxxxx -> ../../sdb
lrwxrwxrwx 1 root root  9 Feb 10 23:15 wwn-0x500xxxxxxxxxxxxx -> ../../sdb

After I had all the drive IDs; I opened the smartd configuration file:

$ sudo vim /etc/smartd.conf

First I had to comment out the line containing DEVICESCAN, because:

The word DEVICESCAN will cause any remaining lines in this configuration file to be ignored: it tells smartd to scan for all ATA and SCSI devices.

Then I added my drives, and configuration to the end of the file:

# vdev 0
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/04|L/../01/./01) -m root@localhost 
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/05|L/../02/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/06|L/../03/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/07|L/../04/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/04|L/../05/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/05|L/../06/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/06|L/../07/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/07|L/../08/./01) -m root@localhost

# vdev 1
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/04|L/../09/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/05|L/../10/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/06|L/../11/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/07|L/../12/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/04|L/../13/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/05|L/../14/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/06|L/../15/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/07|L/../16/./01) -m root@localhost

Let’s look at what the different directives mean:

  • -a: Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
  • -d TYPE: Set the device type: ata, scsi, marvell, removable, 3ware, N, hpt, L/M/N
  • -s REGE: Start self-test when type/date matches regular expression (see below)
  • -m ADD: Send warning email to ADD for -H, -l error, -l selftest, and -f

Looking closer at the -s directive:

T/MM/DD/d/HH
  • T is the type of the test, L for a long self-test, S for a short self-test.
  • MM is the month of the year, expressed with two decimal digits.
  • DD is the day of the month, expressed with two decimal digits.
  • d is the day of the week, expressed with one decimal digit.
    • The range is from 1 (Monday) to 7 (Sunday) inclusive.
  • HH is the hour of the day, written with two decimal digits, and given in hours after midnight.
To test email delivery — you can add the directive -M test. Which will send a single test email immediately upon smartd startup.

Save the file, and restart smartd:

$ sudo systemctl restart smartd

The configuration we just added does four short tests on Monday, Tuesday, Wednesday and Thursday; from 4 to 7 AM. And long tests on the 1. to the 16. each month; at 1 AM. Any problems are mailed to root@localhost, which is delivered to my local mail server.

To show all upcoming scheduled tests; use this command:

$ sudo smartd -q showtests

Test results

To print out the test results for all drives with scheduled scans; I’m using the script below. It prints drive family, model, serial number, and test results for all drives defined in /etc/smartd.conf:

#!/bin/bash

DISKS=`cat /etc/smartd.conf | grep /dev/disk | awk '{print $1}'`

for disk in $DISKS; do
    echo $disk
    sudo smartctl -i $disk | grep -E "Model Family|Device Model|Serial Number"
    sudo smartctl -l selftest $disk
    echo ""
done

Conclusion

Catch failing drives before they die — pay attention to S.M.A.R.T. errors and test results. And set up email notifications!

Resources

Last commit 2024-04-05, with message: Tag cleanup.