I am writing this while using ddrescue on a drive with unrecoverable read errors. The internet says that using SMART monitoring it is possible to be notified of a failing drive before the need to use ddrescue arises. The tool of choice are the smartmontools.
There are two programs: smartctl to issue SMART commands as needed and smartd to run a daemon to watch drives.
There are various types of self-tests available:
short: runs in under 10 minutes. Checks electrical, mechanical and read performance.
long: may run up to several hours. Basically a longer version of the short test. Where the short test may only sample the platter surface, this test reads all.
conveyance: runs in minutes. Tries to detect damage during transportation of the drive
select: tests a range of disk blocks. All those tests can be run while the disk is in use. A slowdown in performance might be noticeable (especially for the long or select tests).
Of interest to me are the short and long test.
To issue a test use smartctl -t short /dev/sdX. An expected runtime will be displayed. The result can be viewed with smartctl -l selftest /dev/sdX.
The daemon ca be configured to run tests on all or selected drives at certain times. Its configuration is stored in /etc/smartd.conf. Basically every line is a drive with otions what to monitor, when to run which tests and what to do in error/failure cases. Consult smartd.conf(5) for a full reference.
# -a is a shortcut for the following: '-H -f -t -l error -l selftest -l # selftests -C 197 - U 198' # -H check SMART health status. If this one fails the drive might die in # the next 24 hours # -f check usage attributes if they are equal or below the threshold. # -t report any time a prefail or usage attribute changed its value # since the last check # -l report increases in the number of errors in one of three SMART logs # (error, selftest, xerror, selftests) # -C 197 report if the current number of pending sectors is non-zero. A # pending sector is a sector which failed to read and is marked to be # reallocated # -U 198 report if the number of offline uncorrectable sectors is # non-zero. A sector is marked as offline uncorrectable, if it # failed to read during a self-test # -n Do not check the drive if it is in a specified power state (never, # sleep, standby, idle). Checking a drive usually spins it up, so # this option can be used to conserve power # -s REGEX run a test if it matches the supplied extended regex. The # string has 12 characters: T/MM/DD/d/HH # - T type of test: L (long), S (short), C (conveyance), O (offline # immediate) # - MM month (01-12). Has to be 2 digits or else it doesn't match # - DD day of month (01-31). has to be 2 digits or doesn't match # - d day of week (1 [monday] to 7 [sunday]) # - HH hour of day. Has to be 2 digits, means hours after midnight # -m Send a warning email to an address if a failure or new error is # detected (-H, -l, -f, -C, -O) /dev/sda -H -n standby,12 -s (S/../../1/03|L/../01/./4)
TODO: Set up a local mail server so that emails by smartd are delivered in a user mailbox.
man pages (smartctl(8), smartd(8), smartd.conf(5))