smartmontools

With respect to hard drives, the acronym "SMART" stands for Self-Monitoring, Analysis and Reporting Technology. This was built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. Basically anything after about 2005 should have it.

Installation

Ubuntu/Debian:

sudo apt-get install smartmontools

CentOS/Fedora/RH:

sudo yum install smartmontools

Gentoo:

sudo emerge sys-apps/smartmontools

Or go to the source.

smartctl

The program smartctl is used to interface with the SMART features on the drive firmware. Here are a couple of easy things to get started with (however some versions do not have the --scan option):

$ smartctl --scan -d ata
/dev/hda -d ata # /dev/hda, ATA device
/dev/hdc -d ata # /dev/hdc, ATA device
$ sudo smartctl --info /dev/hdc
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.33.1-xedvia] (local
build)
Copyright (C) 2002-11 by Bruce Allen,
http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus
Device Model:     ST3160023A
Serial Number:    5JS9MDKW
Firmware Version: 8.01
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Thu Feb  7 09:27:18 2013 PST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

Note that the "SMART support" is listed as available but disabled. To enable full diagnostic checking turn it on with something like this:

$ sudo smartctl --smart=on --offlineauto=on --saveauto=on /dev/hdc
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.

In theory this should only need to be done once and the drive should remember this (because of the saveauto directive). The offlineauto will cause automatic testing every 4 hours. In theory it will wait "nicely" if the drive is already busy so performance should not be seriously impacted.

Testing

Here is how to do the simple test.

$ sudo smartctl -a /dev/sdb

That should produce a table (among other things) that looks like this.

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   094   094   050    Pre-fail  Always       -       171774027
  5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       44951 (46 68 0)
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24
171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       4
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       8
181 Program_Fail_Cnt_Total  0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   032   038   000    Old_age   Always       -       32 (Min/Max 23/38)
195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      -       171774027
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Soft_Read_Error_Rate    0x001c   120   120   000    Old_age   Offline      -       171774027
204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      -       171774027
230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0000   000   000   000    Old_age   Offline      -       9071
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       1588
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       1588
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       15296

Note that the Raw_Read_Error_Rate’s RAW_VALUE is astronomically high. That can’t be good. Read my helpful blog post to learn more about the subtleties of interpreting this stuff.

There are more proactive tests too. Here’s a way to run a "short" off-line test. This tests electrical and mechanical performance of the drive and does read testing.

$ sudo smartctl --test=short /dev/hda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Thu Feb  7 10:13:19 2013
Use smartctl -X to abort test.

$ sudo smartctl --log=selftest /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     43398        -

$ sudo smartctl --log=selftest /dev/hdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     37994         7234643

The first command starts the test off and it tells you to come back in 1 or 2 minutes. The second command shows how to query the log file to see if anything bad came up. In this case hda was fine ("Completed without error") but hdc had a very important "read error". Replace that drive ASAP!

Interpretation

The smartctl gives amazingly bad feedback. I think this mostly is a result of every drive manufacturer having different ideas of what should be checked and what’s important. Here are scans of two identical drives installed on the same machine. I know one of these drives (A) has problems, it reliably corrupts the filesystem; the other (B) seems to be working fine. This allows a very helpful comparison because it is often difficult to tell what the values should be between good and bad. I’ve put the raw values for A that I think are problematic in bold. Compare with the raw values for B.

(P=Pre-fail, O=Old_age; A=Always, O=Offline; V=VALUE, W=WORST, T=THRESH, R=RAW_VALUE)

ATTRIBUTE_NAME

TYPE

UPDATED

V A

W A

T A

R A

V B

W B

T B

R B

Raw_Read_Error_Rate

P

A

200

200

051

2848

200

200

051

0

Spin_Up_Time

P

A

146

137

021

3666

149

141

021

3525

Start_Stop_Count

O

A

100

100

000

266

100

100

000

816

Reallocated_Sector_Ct

P

A

200

200

140

0

200

200

140

0

Seek_Error_Rate

O

A

200

200

000

0

200

200

000

0

Power_On_Hours

O

A

049

049

000

37662

063

063

000

27656

Spin_Retry_Count

O

A

100

100

000

0

100

100

000

0

Calibration_Retry_Count

O

A

100

100

000

0

100

100

000

0

Power_Cycle_Count

O

A

100

100

000

264

100

100

000

629

Power-Off_Retract_Count

O

A

200

200

000

66

200

200

000

127

Load_Cycle_Count

O

A

200

200

000

199

200

200

000

688

Temperature_Celsius

O

A

111

075

000

32

112

091

000

31

Reallocated_Event_Count

O

A

200

200

000

0

200

200

000

0

Current_Pending_Sector

O

A

200

200

000

31

200

200

000

0

Offline_Uncorrectable

O

O

200

200

000

10

200

200

000

0

UDMA_CRC_Error_Count

O

A

200

200

000

0

200

200

000

0

Multi_Zone_Error_Rate

O

O

187

149

000

2671

200

200

000

0

Note that "RAW_VALUE" for "Raw_Read_Error_Rate", "Offline_Uncorrectable", and "Multi_Zone_Error_Rate" seem to be reliable indicators of a problem.

Aside from this table and the attributes "Serial Number", "LU WWN Device Id", "Local Time is", "data collection", and "recommended polling time" there were no other differences between these drives. Note that both of these claimed this.

SMART overall-health self-assessment test result: PASSED

I would take this to mean that if you get a "FAILED" here, that is very bad.

Here’s another important difference between the good drive (B) and the bad drive (A). This shows the difference between test results as described above.

$ sudo smartctl --test=short /dev/sda
$ sudo smartctl --test=short /dev/sdb ; sleep 120
$ sudo smartctl --log=selftest /dev/sda | grep -A1 Status
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     37663         293137097
$ sudo smartctl --log=selftest /dev/sdb | grep -A1 Status
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     27657         -

smartstart

When a hard drive begins to act suspiciously, it can be extremely helpful to have a baseline SMART report to compare with. This program will run smartctl and rename the report based on the date and the hard drive’s serial number. It also copies the report into /lost+found which, although usually a waste of an inode, is the perfect place to store this kind of information. It’s not a terrible idea to copy it elsewhere too, but if the drive is so hosed that this report can’t be read, then you’re probably not running smartctl on it any more anyway.

smartstart
#!/bin/bash
function help { cat <<EOHELP
A simple script to save S.M.A.R.T. reports for hard drives for use
in later comparisons when units begin to fail. Saves in a filename
based on the serial number of the drive. Specify the drive,
just the "sd[a-z]" part, "/dev/" added automatically.
E.g.
    $0 sda
EOHELP
}
D=$1

# How to upgrade to root/sudo if regular user.
SUDO=''; if (( $EUID != 0 )); then SUDO='sudo'; fi

# Use a safe temporary file.
TDIR=/tmp
T=$(mktemp --tmpdir=${TDIR} smartstart.tmp.XXXXX)

# Check for such a drive.
if [[ -z "${D}" ]] || ! lsblk --nodeps -o NAME | tail -n+2 | grep "${D}" >/dev/null ; then
    help
    lsblk --nodeps -o NAME | tail -n+2
    exit
fi

${SUDO} smartctl -a /dev/${D} > ${T}
#For sensible name, extract unique serial number formatted like this.
#Serial Number:    WD-WCAYW0003385
N=smart-$(sed -n 's/^Serial Number: *\(.*\)$/\1/p' ${T})-$(date '+%Y%m%d')
mv ${T} ${TDIR}/${N}

echo "Copying to /lost+found..."
${SUDO} cp ${TDIR}/${N} /lost+found/${N}
${SUDO} ls -l ${TDIR}/${N} /lost+found/${N}

MHDD

I finally found out what "MHDD" stands for. I used to call it "Moscow Hard Disk Diagnostics" but Eugene from hddguru.com says:

MHDD stands for "maysoft’s hdd tool" ("maysoft" is an old nickname Dmitry used on forums). As far as I know Dmitry is from Ukraine, not from Russia :)

This tool can be tricky to get running. It can be difficult to have it detect your drive. But if you can have this tool examine your hard drive, you will get one of the best diagnostic reports about your drive possible. It basically scans every part of the drive and creates a histogram of seek times. If it takes a long time to access a patch of your hard drive, then you can be pretty sure it would be a good idea to make a back up immediately and start replacement actions. I often use this tool to see if I want to reuse old hard drives I reclaim from scrapped computers. If the scan comes back clean, there is no reason to not use the drive for low importance work.

Where To Find MHDD

I typically just use sysresccd which includes an image containing MHDD. Just get a bootloader prompt and type mhdd. If that’s not working out for you, try MHDD’s home.

Detecting Drives

Sometimes the drives are not detected. Perhaps it says something about Slave channels not being recognized. If it truly is an ancient IDE thing and you’re on a slave channel due to cabling issues, in theory you can specify the missing 5 when there is an empty 4. But that probably isn’t the problem unless you’ve time travelled back to 2005.

What worked for me was going to the BIOS and finding the AHCI (Advanced Host Controller Interface) setting. This is a protocol that controls how the hard drive controller talks to the system. The actual hard drive knows nothing about this. But software that likes simple old-fashioned IDE style interfaces appreciate this. So set any AHCI settings on the drives in BIOS to IDE, run mhdd and once you’re done, put it back. An article on the topic.

Effective Use

These instructions are quite helpful (and in English!)

Basically the program runs in an isolated DOS environment. You select the drive you want to be curious about (pressing shift-F3 will bring the drive selection menu up).

Then you can enter commands to query interesting information about the drive.

  • ID - Returns identification info about the drive.

  • EID - Returns some more information than ID including what features the drive BIOS supports.

  • CX - Perform a seek and read test. This will run indefinitely outputting a continuous average. Press ESC when you’re happy.

  • RPM - Tells you the spindle speed for this drive.

  • SMART ATT

  • SMART ERRLOG

  • SCAN - Starts a drive scan without logging it. First you get a menu that allows you to changes some things about the scan. F4 is a synonym for SCAN and also starts the actual scan after you’re satisfied with the scan setup menu.

Table 1. SCAN output codes

?

TIME

VERIFY command did NOT complete within the timeout

x

UNC

data is uncorrectable.

!

ABRT

command was aborted

*

BBK

Bad Block

S

IDNF

sector ID cannot be read or not as expected

A

AMNF

Data Address Mark Not Found

0

TONF

Track 0 was not found during drive recalibration

More HDD Info

Other Disk Tools When You’re Screwed

Note
Other than ddrescue I’ve never used any of these tools but it seems like a good idea to know about them.
  • TestDisk helps with corrupted partition tables.

  • dcfldd is a forensics enhanced dd. See also ddrescue and dd_rescue.

  • PhotoRec snarfs out file-like data of known formats from raw physical volumes independent of filesystem. Not just for photos.