I’ve been planning to enable scheduled S.M.A.R.T. scans ever since I built my file server, but it was just one of those things that I never got around to do.
Well — when one of the 8 TB drives suddenly started reporting S.M.A.R.T. errors, the scheduled scans got back on the agenda. So let’s do that now 👇
Scheduled S.M.A.R.T. scans are performed by smartd
:
smartd is a daemon that monitors the Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests. — smartd man page
smartd
First — let’s figure out which drives we want to scan. You can use smartctl
to scan for drives:
$ sudo smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device
/dev/sda
is my boot SSD, while the rest are drives in the ZFS pool. I don’t like to use /dev/sdX
when referencing drives, because it can change, I like to use the disk ID. To find the ID you can list the content of /dev/disk/by-id/
and filter on sdX
, like this:
$ ls -l /dev/disk/by-id/ | grep -E "sdb$"
lrwxrwxrwx 1 root root 9 Feb 10 23:15 scsi-xxxxxxxxxxxxxxxxx -> ../../sdb
lrwxrwxrwx 1 root root 9 Feb 10 23:15 scsi-SATA_WDC_WD40EFRX-68N_WD-xxxxxxxxxxxx -> ../../sdb
lrwxrwxrwx 1 root root 9 Feb 10 23:15 wwn-0x500xxxxxxxxxxxxx -> ../../sdb
After I had all the drive IDs; I opened the smartd configuration file:
$ sudo vim /etc/smartd.conf
First I had to comment out the line containing DEVICESCAN
, because:
The word
DEVICESCAN
will cause any remaining lines in this configuration file to be ignored: it tells smartd to scan for all ATA and SCSI devices.
Then I added my drives, and configuration to the end of the file:
# vdev 0
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/04|L/../01/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/05|L/../02/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/06|L/../03/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../1/07|L/../04/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/04|L/../05/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/05|L/../06/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/06|L/../07/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../2/07|L/../08/./01) -m root@localhost
# vdev 1
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/04|L/../09/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/05|L/../10/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/06|L/../11/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../3/07|L/../12/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/04|L/../13/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/05|L/../14/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/06|L/../15/./01) -m root@localhost
/dev/disk/by-id/wwn-0x5000xxxxxxxxxxxx -a -d scsi -s (S/../../4/07|L/../16/./01) -m root@localhost
Let’s look at what the different directives mean:
-a
: Default: equivalent to-H -f -t -l error -l selftest -C 197 -U 198
-d TYPE
: Set the device type: ata, scsi, marvell, removable, 3ware, N, hpt, L/M/N-s REGE
: Start self-test when type/date matches regular expression (see below)-m ADD
: Send warning email to ADD for-H
,-l error
,-l selftest
, and-f
Looking closer at the -s
directive:
T/MM/DD/d/HH
T
is the type of the test,L
for a long self-test,S
for a short self-test.MM
is the month of the year, expressed with two decimal digits.DD
is the day of the month, expressed with two decimal digits.d
is the day of the week, expressed with one decimal digit.- The range is from 1 (Monday) to 7 (Sunday) inclusive.
HH
is the hour of the day, written with two decimal digits, and given in hours after midnight.
-M test
. Which will send a single test email immediately upon smartd startup.
Save the file, and restart smartd:
$ sudo systemctl restart smartd
The configuration we just added does four short tests on Monday, Tuesday, Wednesday and Thursday; from 4 to 7 AM. And long tests on the 1. to the 16. each month; at 1 AM. Any problems are mailed to root@localhost
, which is delivered to my local mail server.
To show all upcoming scheduled tests; use this command:
$ sudo smartd -q showtests
Test results
To print out the test results for all drives with scheduled scans; I’m using the script below. It prints drive family, model, serial number, and test results for all drives defined in /etc/smartd.conf
:
#!/bin/bash
DISKS=`cat /etc/smartd.conf | grep /dev/disk | awk '{print $1}'`
for disk in $DISKS; do
echo $disk
sudo smartctl -i $disk | grep -E "Model Family|Device Model|Serial Number"
sudo smartctl -l selftest $disk
echo ""
done
Conclusion
Catch failing drives before they die — pay attention to S.M.A.R.T. errors and test results. And set up email notifications!
Resources
Last commit 2024-11-11, with message: Add lots of tags to posts.