Hard Disk Diagnostics

smartmontools
MHDD
More HDD Info
Other Disk Tools When You’re Screwed
System Rescue CD

Data Moving Speeds

10BASE-T	10Mbs	.01Gbs
100BASE-T	100Mbs	.1Gbs
USB2	500Mbs	.5Gbs
1000BASE-T	1000Mbs	1Gbs
SATA1	1500Mbs	1.5Gbs
SATA2	3000Mbs	3Gbs
RPi Cam	3628Mbs	3.6Gbs
USB3	5000Mbs	5Gbs
SATA3	6000Mbs	6Gbs
10GBASE-T	10000Mbs	10Gbs
USB3.1	10000Mbs	10Gbs
USB3.2	20000Mbs	20Gbs

Raspberry Pi camera maximum = 24bit x 2592w x 1944h x 30fps

SDCard Experiments

smartmontools

With respect to hard drives, the acronym "SMART" stands for Self-Monitoring, Analysis and Reporting Technology. This was built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. Basically anything after about 2005 should have it.

Installation

Ubuntu/Debian:

sudo apt-get install smartmontools

CentOS/Fedora/RH:

sudo yum install smartmontools

Gentoo:

sudo emerge sys-apps/smartmontools

Or go to the source.

smartctl

The program smartctl is used to interface with the SMART features on the drive firmware. Here are a couple of easy things to get started with (however some versions do not have the --scan option):

$ smartctl --scan -d ata
/dev/hda -d ata # /dev/hda, ATA device
/dev/hdc -d ata # /dev/hdc, ATA device
$ sudo smartctl --info /dev/hdc
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.33.1-xedvia] (local
build)
Copyright (C) 2002-11 by Bruce Allen,
http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.7 and 7200.7 Plus
Device Model:     ST3160023A
Serial Number:    5JS9MDKW
Firmware Version: 8.01
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Thu Feb  7 09:27:18 2013 PST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

Note that the "SMART support" is listed as available but disabled. To enable full diagnostic checking turn it on with something like this:

$ sudo smartctl --smart=on --offlineauto=on --saveauto=on /dev/hdc
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.

In theory this should only need to be done once and the drive should remember this (because of the saveauto directive). The offlineauto will cause automatic testing every 4 hours. In theory it will wait "nicely" if the drive is already busy so performance should not be seriously impacted.

Temperature

It can be interesting to see how hot the drive is getting. Install Debian package smartmontools and do something like this.

$ sudo smartctl -a /dev/sdb | grep Temperature_Celsius | awk '{print $10}'
30

Note that not every SMART report may contain a temperature field.

Testing

Here is how to do the simple test.

$ sudo smartctl -a /dev/sdb

That should produce a table (among other things) that looks like this.

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   094   094   050    Pre-fail  Always       -       171774027
  5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       44951 (46 68 0)
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24
171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       4
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       8
181 Program_Fail_Cnt_Total  0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   032   038   000    Old_age   Always       -       32 (Min/Max 23/38)
195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      -       171774027
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Soft_Read_Error_Rate    0x001c   120   120   000    Old_age   Offline      -       171774027
204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      -       171774027
230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0000   000   000   000    Old_age   Offline      -       9071
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       1588
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       1588
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       15296

Note that the Raw_Read_Error_Rate’s RAW_VALUE is astronomically high. That can’t be good. Read my helpful blog post to learn more about the subtleties of interpreting this stuff.

There are more proactive tests too. Here’s a way to run a "short" off-line test. This tests electrical and mechanical performance of the drive and does read testing.

$ sudo smartctl --test=short /dev/hda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Thu Feb  7 10:13:19 2013
Use smartctl -X to abort test.

$ sudo smartctl --log=selftest /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     43398        -

$ sudo smartctl --log=selftest /dev/hdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     37994         7234643

The first command starts the test off and it tells you to come back in 1 or 2 minutes. The second command shows how to query the log file to see if anything bad came up. In this case hda was fine ("Completed without error") but hdc had a very important "read error". Replace that drive ASAP!

Interpretation

The smartctl gives amazingly bad feedback. I think this mostly is a result of every drive manufacturer having different ideas of what should be checked and what’s important. Here are scans of two identical drives installed on the same machine. I know one of these drives (A) has problems, it reliably corrupts the filesystem; the other (B) seems to be working fine. This allows a very helpful comparison because it is often difficult to tell what the values should be between good and bad. I’ve put the raw values for A that I think are problematic in bold. Compare with the raw values for B.

(P=Pre-fail, O=Old_age; A=Always, O=Offline; V=VALUE, W=WORST, T=THRESH, R=RAW_VALUE)

ATTRIBUTE_NAME	TYPE	UPDATED	V A	W A	T A	R A	V B	W B	T B	R B
Raw_Read_Error_Rate	P	A	200	200	051	2848	200	200	051	0
Spin_Up_Time	P	A	146	137	021	3666	149	141	021	3525
Start_Stop_Count	O	A	100	100	000	266	100	100	000	816
Reallocated_Sector_Ct	P	A	200	200	140	0	200	200	140	0
Seek_Error_Rate	O	A	200	200	000	0	200	200	000	0
Power_On_Hours	O	A	049	049	000	37662	063	063	000	27656
Spin_Retry_Count	O	A	100	100	000	0	100	100	000	0
Calibration_Retry_Count	O	A	100	100	000	0	100	100	000	0
Power_Cycle_Count	O	A	100	100	000	264	100	100	000	629
Power-Off_Retract_Count	O	A	200	200	000	66	200	200	000	127
Load_Cycle_Count	O	A	200	200	000	199	200	200	000	688
Temperature_Celsius	O	A	111	075	000	32	112	091	000	31
Reallocated_Event_Count	O	A	200	200	000	0	200	200	000	0
Current_Pending_Sector	O	A	200	200	000	31	200	200	000	0
Offline_Uncorrectable	O	O	200	200	000	10	200	200	000	0
UDMA_CRC_Error_Count	O	A	200	200	000	0	200	200	000	0
Multi_Zone_Error_Rate	O	O	187	149	000	2671	200	200	000	0

Note that "RAW_VALUE" for "Raw_Read_Error_Rate", "Offline_Uncorrectable", and "Multi_Zone_Error_Rate" seem to be reliable indicators of a problem.

Aside from this table and the attributes "Serial Number", "LU WWN Device Id", "Local Time is", "data collection", and "recommended polling time" there were no other differences between these drives. Note that both of these claimed this.

SMART overall-health self-assessment test result: PASSED

I would take this to mean that if you get a "FAILED" here, that is very bad.

Here’s another important difference between the good drive (B) and the bad drive (A). This shows the difference between test results as described above.

$ sudo smartctl --test=short /dev/sda
$ sudo smartctl --test=short /dev/sdb ; sleep 120
$ sudo smartctl --log=selftest /dev/sda | grep -A1 Status
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     37663         293137097
$ sudo smartctl --log=selftest /dev/sdb | grep -A1 Status
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     27657         -

smartstart

When a hard drive begins to act suspiciously, it can be extremely helpful to have a baseline SMART report to compare with. This program will run smartctl and rename the report based on the date and the hard drive’s serial number. It also copies the report into /lost+found which, although usually a waste of an inode, is the perfect place to store this kind of information. It’s not a terrible idea to copy it elsewhere too, but if the drive is so hosed that this report can’t be read, then you’re probably not running smartctl on it any more anyway.

smartstart

#!/bin/bash
function help { cat <<EOHELP
A simple script to save S.M.A.R.T. reports for hard drives for use
in later comparisons when units begin to fail. Saves in a filename
based on the serial number of the drive. Specify the drive,
just the "sd[a-z]" part, "/dev/" added automatically.
E.g.
    $0 sda
EOHELP
}
D=$1

# How to upgrade to root/sudo if regular user.
SUDO=''; if (( $EUID != 0 )); then SUDO='sudo'; fi

# Use a safe temporary file.
TDIR=/tmp
T=$(mktemp --tmpdir=${TDIR} smartstart.tmp.XXXXX)

# Check for such a drive.
if [[ -z "${D}" ]] || ! lsblk --nodeps -o NAME | tail -n+2 | grep "${D}" >/dev/null ; then
    help
    lsblk --nodeps -o NAME | tail -n+2
    exit
fi

${SUDO} smartctl -a /dev/${D} > ${T}
#For sensible name, extract unique serial number formatted like this.
#Serial Number:    WD-WCAYW0003385
N=smart-$(sed -n 's/^Serial Number: *\(.*\)$/\1/p' ${T})-$(date '+%Y%m%d')
mv ${T} ${TDIR}/${N}

echo "Copying to /lost+found..."
${SUDO} cp ${TDIR}/${N} /lost+found/${N}
${SUDO} ls -l ${TDIR}/${N} /lost+found/${N}

MHDD

I finally found out what "MHDD" stands for. I used to call it "Moscow Hard Disk Diagnostics" but Eugene from hddguru.com says:

MHDD stands for "maysoft’s hdd tool" ("maysoft" is an old nickname Dmitry used on forums). As far as I know Dmitry is from Ukraine, not from Russia :)

This tool can be tricky to get running. It can be difficult to have it detect your drive. But if you can have this tool examine your hard drive, you will get one of the best diagnostic reports about your drive possible. It basically scans every part of the drive and creates a histogram of seek times. If it takes a long time to access a patch of your hard drive, then you can be pretty sure it would be a good idea to make a back up immediately and start replacement actions. I often use this tool to see if I want to reuse old hard drives I reclaim from scrapped computers. If the scan comes back clean, there is no reason to not use the drive for low importance work.

Where To Find MHDD

I typically just use System Rescue CD which includes an image containing MHDD. Just get a bootloader prompt and type mhdd. If that’s not working out for you, try MHDD’s home.

Detecting Drives

Sometimes the drives are not detected. Perhaps it says something about Slave channels not being recognized. If it truly is an ancient IDE thing and you’re on a slave channel due to cabling issues, in theory you can specify the missing 5 when there is an empty 4. But that probably isn’t the problem unless you’ve time travelled back to 2005.

What worked for me was going to the BIOS and finding the AHCI (Advanced Host Controller Interface) setting. This is a protocol that controls how the hard drive controller talks to the system. The actual hard drive knows nothing about this. But software that likes simple old-fashioned IDE style interfaces appreciate this. So set any AHCI settings on the drives in BIOS to IDE, run mhdd and once you’re done, put it back. An article on the topic.

Effective Use

These instructions are quite helpful (and in English!)

Basically the program runs in an isolated DOS environment. You select the drive you want to be curious about (pressing shift-F3 will bring the drive selection menu up).

Then you can enter commands to query interesting information about the drive.

ID - Returns identification info about the drive.
EID - Returns some more information than ID including what features the drive BIOS supports.
CX - Perform a seek and read test. This will run indefinitely outputting a continuous average. Press ESC when you’re happy.
RPM - Tells you the spindle speed for this drive.
SMART ATT
SMART ERRLOG
SCAN - Starts a drive scan without logging it. First you get a menu that allows you to changes some things about the scan. F4 is a synonym for SCAN and also starts the actual scan after you’re satisfied with the scan setup menu.

Table 1. SCAN output codes
?	TIME	VERIFY command did NOT complete within the timeout
x	UNC	data is uncorrectable.
!	ABRT	command was aborted
*	BBK	Bad Block
S	IDNF	sector ID cannot be read or not as expected
A	AMNF	Data Address Mark Not Found
0	TONF	Track 0 was not found during drive recalibration

whdd

I was tipped off to this great project which is like mhdd but can run in a live system.

https://github.com/whdd/whdd

It can even run on the mounted disk in use as root (though the feedback may be a bit nonsensical if you’re doing that).

There may be Ubuntu packages. I just did something like this and was up and running quickly.

git clone https://github.com/whdd/whdd
cd whdd
./build.sh
sudo ./whdd

More HDD Info

Here are some references that might help figure out SMART reports.

Find a drive’s volume name: ls /dev/drive/by-label
Also by-id and by-uuid.
My blog post on the topic
https://en.wikipedia.org/wiki/S.M.A.R.T
Understanding SMART Reports

Need absurdly detailed info about drives?

Other Disk Tools When You’re Screwed

Note	Other than `ddrescue` I’ve never used any of these tools but it seems like a good idea to know about them.

TestDisk helps with corrupted partition tables.
dcfldd is a forensics enhanced dd. See also ddrescue and dd_rescue.
PhotoRec snarfs out file-like data of known formats from raw physical volumes independent of filesystem. Not just for photos.

System Rescue CD

System Rescue CD (aka sysresccd) can be found at: http://www.system-rescue-cd.org/

One problem that can be extremely frustrating is a hard drive or RAID set that has some deep dark problem (I/O errors and the like) and System Rescue hangs on boot aggressively trying to automount everything. Or worse, System Rescue corrupts a software raid setup as mentioned in my notes here (search for "md127").

From the boot options reference page, here is how to suppress that.

skipmount=/dev/xxx : The system mounts all the storage devices at boot time to find the sysrcd.dat file. If your hard disk is broken it should not be mounted. Boot with skipmount=/dev/sda1 skipmount=/dev/sda2 to ignore these two partitions.

Use multiple repeated skipmount=/dev/yyy as needed.