Drive Fail FTW

Nothing fills an experienced computer user with dread like a failing hard drive. I’m experienced enough to have the word when firmly stuck in my head, not if. Even if you take great care to have this entropic force of nature under control (back ups, back ups, back ups), the actual decline, as with the health of old humans, can be rather unnerving. Crazy non-deterministic stuff happens and computer people hate that above all else.

The main tool for restoring a sense of control and comprehension in this situation is Self-Monitoring Analysis and Reporting Technology, SMART, which is a part of most modern (2006+) hard drives. Unfortunately, SMART feedback is almost incomprehensible. The drive can report between one dozen and three dozen or so attributes about drive health. Unfortunately, although there are common themes, each vendor has their own particular system. For example, is a "Start_Stop_Count" of 816 good or bad? Who knows? It’s probably irrelevant.

There are some attributes that would seem to be obvious, like the ones with the word "error" in them, but even that can be kind of tricky. Sometimes you get attributes like "life left" and "seek time performance" which you’d assume should ideally be a high value. Simply looking for big numbers isn’t sufficient to find errors. And how big do those numbers have to be before you start worrying anyway?

I just discovered a very interesting hard drive failure. I know it is failing because one of the file system journals on the drive gets corrupted without ever being mounted and used. In other words, I fix it, reboot, and it’s broken again. What makes this interesting and even quite lucky is that it is installed in a computer with an identical second drive that is working fine.

This allows me to compare SMART reports of the two drives and get a very clear understanding of the difference between an old drive that is failing and an old drive that is still good. Here is a look at the two reports using vimdiff (installed by default on all sane computers).

Failing Drive | Healthy Drive

What I find most interesting about this is that the "VALUE", "WORST", and "THRESH" metrics are almost useless. Besides being comically incomprehensible, the good drive is indistinguishable from the bad using these metrics on "Raw_Read_Error_Rate" and "Offline_Uncorrectable". For the one problem metric where these are different, "Multi_Zone_Error_Rate", they make no sense at all (to me). Having these side by side has done more for helping me understand how to read this information than all other resources I’ve come across.

Of course I have no idea about other models of drives, especially solid state drives, but I imagine there’d be some common themes. The main heuristic I’m getting from this is that if you have attributes which contain the word "error" with RAW_VALUES that are not zero, pay attention to that. Everything else you can mostly ignore. For example, note that both drives reported this.

SMART overall-health self-assessment test result: PASSED

Obviously FAILED would be worse, but don’t assume your drive is good because it says "PASSED".

This side by side comparison also suggests a strategy that I had not thought of before but which, in hindsight, is obvious. When you install a new hard drive, immediately get a SMART report and put it somewhere safe. If the drive starts flaking out at a later time, it will give you something to compare with. More details about hard drive diagnostics can be found in my hard drive notes.