Data Moving Speeds
10BASE-T |
10Mbs |
.01Gbs |
100BASE-T |
100Mbs |
.1Gbs |
USB2 |
500Mbs |
.5Gbs |
1000BASE-T |
1000Mbs |
1Gbs |
SATA1 |
1500Mbs |
1.5Gbs |
SATA2 |
3000Mbs |
3Gbs |
RPi Cam |
3628Mbs |
3.6Gbs |
USB3 |
5000Mbs |
5Gbs |
SATA3 |
6000Mbs |
6Gbs |
10GBASE-T |
10000Mbs |
10Gbs |
USB3.1 |
10000Mbs |
10Gbs |
USB3.2 |
20000Mbs |
20Gbs |
Raspberry Pi camera maximum = 24bit x 2592w x 1944h x 30fps
smartmontools
With respect to hard drives, the acronym "SMART" stands for Self-Monitoring, Analysis and Reporting Technology. This was built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. Basically anything after about 2005 should have it.
Installation
Ubuntu/Debian:
sudo apt-get install smartmontools
CentOS/Fedora/RH:
sudo yum install smartmontools
Gentoo:
sudo emerge sys-apps/smartmontools
Or go to the source.
smartctl
The program smartctl
is used to interface with the SMART features on
the drive firmware. Here are a couple of easy things to get started
with (however some versions do not have the --scan
option):
$ smartctl --scan -d ata
/dev/hda -d ata # /dev/hda, ATA device
/dev/hdc -d ata # /dev/hdc, ATA device
$ sudo smartctl --info /dev/hdc
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.33.1-xedvia] (local
build)
Copyright (C) 2002-11 by Bruce Allen,
http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus
Device Model: ST3160023A
Serial Number: 5JS9MDKW
Firmware Version: 8.01
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2
Local Time is: Thu Feb 7 09:27:18 2013 PST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled
Note that the "SMART support" is listed as available but disabled. To enable full diagnostic checking turn it on with something like this:
$ sudo smartctl --smart=on --offlineauto=on --saveauto=on /dev/hdc
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.
SMART Attribute Autosave Enabled.
SMART Automatic Offline Testing Enabled every four hours.
In theory this should only need to be done once and the drive should
remember this (because of the saveauto
directive). The offlineauto
will cause automatic testing every 4 hours. In theory it will wait
"nicely" if the drive is already busy so performance should not be
seriously impacted.
Temperature
It can be interesting to see how hot the drive is getting.
Install Debian package smartmontools
and do something like this.
$ sudo smartctl -a /dev/sdb | grep Temperature_Celsius | awk '{print $10}'
30
Note that not every SMART report may contain a temperature field.
Testing
Here is how to do the simple test.
$ sudo smartctl -a /dev/sdb
That should produce a table (among other things) that looks like this.
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 094 094 050 Pre-fail Always - 171774027
5 Reallocated_Sector_Ct 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours 0x0032 049 049 000 Old_age Always - 44951 (46 68 0)
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 24
171 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0
174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 4
177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 8
181 Program_Fail_Cnt_Total 0x0032 000 000 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 000 000 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 032 038 000 Old_age Always - 32 (Min/Max 23/38)
195 Hardware_ECC_Recovered 0x001c 120 120 000 Old_age Offline - 171774027
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x001c 120 120 000 Old_age Offline - 171774027
204 Soft_ECC_Correction 0x001c 120 120 000 Old_age Offline - 171774027
230 Head_Amplitude 0x0013 100 100 000 Pre-fail Always - 100
231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0000 000 000 000 Old_age Offline - 9071
234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 1588
241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 1588
242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 15296
Note that the Raw_Read_Error_Rate’s RAW_VALUE is astronomically high. That can’t be good. Read my helpful blog post to learn more about the subtleties of interpreting this stuff.
There are more proactive tests too. Here’s a way to run a "short" off-line test. This tests electrical and mechanical performance of the drive and does read testing.
$ sudo smartctl --test=short /dev/hda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Thu Feb 7 10:13:19 2013
Use smartctl -X to abort test.
$ sudo smartctl --log=selftest /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 43398 -
$ sudo smartctl --log=selftest /dev/hdc
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 37994 7234643
The first command starts the test off and it tells you to come back in
1 or 2 minutes. The second command shows how to query the log file to
see if anything bad came up. In this case hda
was fine ("Completed
without error") but hdc
had a very important "read error". Replace
that drive ASAP!
Interpretation
The smartctl
gives amazingly bad feedback. I think this mostly is a
result of every drive manufacturer having different ideas of what
should be checked and what’s important. Here are scans of two
identical drives installed on the same machine. I know one of these
drives (A) has problems, it reliably corrupts the filesystem; the
other (B) seems to be working fine. This allows a very helpful
comparison because it is often difficult to tell what the values
should be between good and bad. I’ve put the raw values for A that I
think are problematic in bold. Compare with the raw values for B.
(P=Pre-fail, O=Old_age; A=Always, O=Offline; V=VALUE, W=WORST, T=THRESH, R=RAW_VALUE)
ATTRIBUTE_NAME |
TYPE |
UPDATED |
V A |
W A |
T A |
R A |
V B |
W B |
T B |
R B |
Raw_Read_Error_Rate |
P |
A |
200 |
200 |
051 |
2848 |
200 |
200 |
051 |
0 |
Spin_Up_Time |
P |
A |
146 |
137 |
021 |
3666 |
149 |
141 |
021 |
3525 |
Start_Stop_Count |
O |
A |
100 |
100 |
000 |
266 |
100 |
100 |
000 |
816 |
Reallocated_Sector_Ct |
P |
A |
200 |
200 |
140 |
0 |
200 |
200 |
140 |
0 |
Seek_Error_Rate |
O |
A |
200 |
200 |
000 |
0 |
200 |
200 |
000 |
0 |
Power_On_Hours |
O |
A |
049 |
049 |
000 |
37662 |
063 |
063 |
000 |
27656 |
Spin_Retry_Count |
O |
A |
100 |
100 |
000 |
0 |
100 |
100 |
000 |
0 |
Calibration_Retry_Count |
O |
A |
100 |
100 |
000 |
0 |
100 |
100 |
000 |
0 |
Power_Cycle_Count |
O |
A |
100 |
100 |
000 |
264 |
100 |
100 |
000 |
629 |
Power-Off_Retract_Count |
O |
A |
200 |
200 |
000 |
66 |
200 |
200 |
000 |
127 |
Load_Cycle_Count |
O |
A |
200 |
200 |
000 |
199 |
200 |
200 |
000 |
688 |
Temperature_Celsius |
O |
A |
111 |
075 |
000 |
32 |
112 |
091 |
000 |
31 |
Reallocated_Event_Count |
O |
A |
200 |
200 |
000 |
0 |
200 |
200 |
000 |
0 |
Current_Pending_Sector |
O |
A |
200 |
200 |
000 |
31 |
200 |
200 |
000 |
0 |
Offline_Uncorrectable |
O |
O |
200 |
200 |
000 |
10 |
200 |
200 |
000 |
0 |
UDMA_CRC_Error_Count |
O |
A |
200 |
200 |
000 |
0 |
200 |
200 |
000 |
0 |
Multi_Zone_Error_Rate |
O |
O |
187 |
149 |
000 |
2671 |
200 |
200 |
000 |
0 |
Note that "RAW_VALUE" for "Raw_Read_Error_Rate", "Offline_Uncorrectable", and "Multi_Zone_Error_Rate" seem to be reliable indicators of a problem.
Aside from this table and the attributes "Serial Number", "LU WWN Device Id", "Local Time is", "data collection", and "recommended polling time" there were no other differences between these drives. Note that both of these claimed this.
SMART overall-health self-assessment test result: PASSED
I would take this to mean that if you get a "FAILED" here, that is very bad.
Here’s another important difference between the good drive (B) and the bad drive (A). This shows the difference between test results as described above.
$ sudo smartctl --test=short /dev/sda
$ sudo smartctl --test=short /dev/sdb ; sleep 120
$ sudo smartctl --log=selftest /dev/sda | grep -A1 Status
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 37663 293137097
$ sudo smartctl --log=selftest /dev/sdb | grep -A1 Status
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 27657 -
smartstart
When a hard drive begins to act suspiciously, it can be extremely
helpful to have a baseline SMART report to compare with. This program
will run smartctl
and rename the report based on the date and the
hard drive’s serial number. It also copies the report into
/lost+found
which, although usually a waste of an inode, is the
perfect place to store this kind of information. It’s not a terrible
idea to copy it elsewhere too, but if the drive is so hosed that this
report can’t be read, then you’re probably not running smartctl
on
it any more anyway.
#!/bin/bash function help { cat <<EOHELP A simple script to save S.M.A.R.T. reports for hard drives for use in later comparisons when units begin to fail. Saves in a filename based on the serial number of the drive. Specify the drive, just the "sd[a-z]" part, "/dev/" added automatically. E.g. $0 sda EOHELP } D=$1 # How to upgrade to root/sudo if regular user. SUDO=''; if (( $EUID != 0 )); then SUDO='sudo'; fi # Use a safe temporary file. TDIR=/tmp T=$(mktemp --tmpdir=${TDIR} smartstart.tmp.XXXXX) # Check for such a drive. if [[ -z "${D}" ]] || ! lsblk --nodeps -o NAME | tail -n+2 | grep "${D}" >/dev/null ; then help lsblk --nodeps -o NAME | tail -n+2 exit fi ${SUDO} smartctl -a /dev/${D} > ${T} #For sensible name, extract unique serial number formatted like this. #Serial Number: WD-WCAYW0003385 N=smart-$(sed -n 's/^Serial Number: *\(.*\)$/\1/p' ${T})-$(date '+%Y%m%d') mv ${T} ${TDIR}/${N} echo "Copying to /lost+found..." ${SUDO} cp ${TDIR}/${N} /lost+found/${N} ${SUDO} ls -l ${TDIR}/${N} /lost+found/${N}
MHDD
I finally found out what "MHDD" stands for. I used to call it "Moscow Hard Disk Diagnostics" but Eugene from hddguru.com says:
MHDD stands for "maysoft’s hdd tool" ("maysoft" is an old nickname Dmitry used on forums). As far as I know Dmitry is from Ukraine, not from Russia :)
This tool can be tricky to get running. It can be difficult to have it detect your drive. But if you can have this tool examine your hard drive, you will get one of the best diagnostic reports about your drive possible. It basically scans every part of the drive and creates a histogram of seek times. If it takes a long time to access a patch of your hard drive, then you can be pretty sure it would be a good idea to make a back up immediately and start replacement actions. I often use this tool to see if I want to reuse old hard drives I reclaim from scrapped computers. If the scan comes back clean, there is no reason to not use the drive for low importance work.
Where To Find MHDD
I typically just use System Rescue CD which includes an
image containing MHDD. Just get a bootloader prompt and type mhdd
.
If that’s not working out for you, try
MHDD’s home.
Detecting Drives
Sometimes the drives are not detected. Perhaps it says something about Slave channels not being recognized. If it truly is an ancient IDE thing and you’re on a slave channel due to cabling issues, in theory you can specify the missing 5 when there is an empty 4. But that probably isn’t the problem unless you’ve time travelled back to 2005.
What worked for me was going to the BIOS and finding the AHCI (Advanced Host Controller Interface) setting. This is a protocol that controls how the hard drive controller talks to the system. The actual hard drive knows nothing about this. But software that likes simple old-fashioned IDE style interfaces appreciate this. So set any AHCI settings on the drives in BIOS to IDE, run mhdd and once you’re done, put it back. An article on the topic.
Effective Use
These instructions are quite helpful (and in English!)
Basically the program runs in an isolated DOS environment. You select
the drive you want to be curious about (pressing shift-F3
will bring
the drive selection menu up).
Then you can enter commands to query interesting information about the drive.
-
ID
- Returns identification info about the drive. -
EID
- Returns some more information thanID
including what features the drive BIOS supports. -
CX
- Perform a seek and read test. This will run indefinitely outputting a continuous average. Press ESC when you’re happy. -
RPM
- Tells you the spindle speed for this drive. -
SMART ATT
-
SMART ERRLOG
-
SCAN
- Starts a drive scan without logging it. First you get a menu that allows you to changes some things about the scan. F4 is a synonym forSCAN
and also starts the actual scan after you’re satisfied with the scan setup menu.
? |
TIME |
VERIFY command did NOT complete within the timeout |
x |
UNC |
data is uncorrectable. |
! |
ABRT |
command was aborted |
* |
BBK |
Bad Block |
S |
IDNF |
sector ID cannot be read or not as expected |
A |
AMNF |
Data Address Mark Not Found |
0 |
TONF |
Track 0 was not found during drive recalibration |
whdd
I was tipped off to this great project which is like mhdd but can run in a live system.
It can even run on the mounted disk in use as root (though the feedback may be a bit nonsensical if you’re doing that).
There may be Ubuntu packages. I just did something like this and was up and running quickly.
git clone https://github.com/whdd/whdd
cd whdd
./build.sh
sudo ./whdd
More HDD Info
Here are some references that might help figure out SMART reports.
-
Find a drive’s volume name:
ls /dev/drive/by-label
-
Also
by-id
andby-uuid
.
Need absurdly detailed info about drives?
Other Disk Tools When You’re Screwed
Note
|
Other than ddrescue I’ve never used any of these tools but it
seems like a good idea to know about them. |
System Rescue CD
System Rescue CD (aka sysresccd) can be found at: http://www.system-rescue-cd.org/
One problem that can be extremely frustrating is a hard drive or RAID set that has some deep dark problem (I/O errors and the like) and System Rescue hangs on boot aggressively trying to automount everything. Or worse, System Rescue corrupts a software raid setup as mentioned in my notes here (search for "md127").
From the boot options reference page, here is how to suppress that.
skipmount=/dev/xxx : The system mounts all the storage devices at
boot time to find the sysrcd.dat file. If your hard disk is broken it
should not be mounted. Boot with skipmount=/dev/sda1
skipmount=/dev/sda2
to ignore these two partitions.
Use multiple repeated skipmount=/dev/yyy
as needed.