Notes about Smartmontools

I am writing this while using ddrescue on a drive with unrecoverable read errors. The internet says that using SMART monitoring it is possible to be notified of a failing drive before the need to use ddrescue arises. The tool of choice are the smartmontools.

There are two programs: smartctl to issue SMART commands as needed and smartd to run a daemon to watch drives.

Self-Tests

There are various types of self-tests available:

short: runs in under 10 minutes. Checks electrical, mechanical and read performance.
long: may run up to several hours. Basically a longer version of the short test. Where the short test may only sample the platter surface, this test reads all.
conveyance: runs in minutes. Tries to detect damage during transportation of the drive
select: tests a range of disk blocks. All those tests can be run while the disk is in use. A slowdown in performance might be noticeable (especially for the long or select tests).

Of interest to me are the short and long test.

To issue a test use smartctl -t short /dev/sdX. An expected runtime will be displayed. The result can be viewed with smartctl -l selftest /dev/sdX.

Daemon

The daemon ca be configured to run tests on all or selected drives at certain times. Its configuration is stored in /etc/smartd.conf. Basically every line is a drive with otions what to monitor, when to run which tests and what to do in error/failure cases. Consult smartd.conf(5) for a full reference.

# -a is a shortcut for the following: '-H -f -t -l error -l selftest -l
#    selftests -C 197 - U 198'
# -H check SMART health status. If this one fails the drive might die in
#    the next 24 hours
# -f check usage attributes if they are equal or below the threshold.
# -t report any time a prefail or usage attribute changed its value
#    since the last check
# -l report increases in the number of errors in one of three SMART logs
#    (error, selftest, xerror, selftests)
# -C 197 report if the current number of pending sectors is non-zero. A
#    pending sector is a sector which failed to read and is marked to be
#    reallocated
# -U 198 report if the number of offline uncorrectable sectors is
#        non-zero. A sector is marked as offline uncorrectable, if it
#        failed to read during a self-test
# -n Do not check the drive if it is in a specified power state (never,
#    sleep, standby, idle). Checking a drive usually spins it up, so
#    this option can be used to conserve power
# -s REGEX run a test if it matches the supplied extended regex. The
#    string has 12 characters: T/MM/DD/d/HH
#    - T type of test: L (long), S (short), C (conveyance), O (offline
#      immediate)
#    - MM month (01-12). Has to be 2 digits or else it doesn't match
#    - DD day of month (01-31). has to be 2 digits or doesn't match
#    - d day of week (1 [monday] to 7 [sunday])
#    - HH hour of day. Has to be 2 digits, means hours after midnight
# -m Send a warning email to an address if a failure or new error is
#    detected (-H, -l, -f, -C, -O)
/dev/sda -H -n standby,12 -s (S/../../1/03|L/../01/./4)

TODO: Set up a local mail server so that emails by smartd are delivered in a user mailbox.

Sources

man pages (smartctl(8), smartd(8), smartd.conf(5))
Monitoring Hard Disks with SMART