One of the most insidious forms of computer trouble is hardware failure and one of the most insidious forms of hardware failure is degrading memory. Often memory doesn’t spectacularly fail but as it degrades it starts producing errors. This has the effect of causing random crashes that can be very confusing.

Because this problem produces such diverse, intermittent, and weird effects it is often difficult to recognize conclusively. The way to definitively isolate this problem is with MemTest86. This is a tiny utility that loads into CPU cache and flogs the memory. It knows what sorts of operations are especially difficult for memory to perform and it iterates through them.

The problem I have with MemTest is that it is difficult to prove a memory module is good. Who knows where badness lurks? Maybe the kryptonite bit sequence hasn’t occurred yet. And if you find an error, is the module bad? Not always. Cosmic radiation is a normal source of soft errors in memory that is functionally sound.

Often when I run memtest86 I look at a blue screen with zero errors for what seems like a long time. My rule of thumb is to let each module, or pairs if required, run for about 2 minutes assuming the scan gets through the whole thing at least once. Usually a truly bad memory module that is likely to cause noticeable problems will show up with an error in less than a minute.

When errors do show up, they aren’t subtle. Here’s a rough photo of what they look like.

memtest_bad.jpg

If 2 minutes for each module comes up clean you may need to try longer. When you give up and start looking elsewhere is up to you. But if you run it all night and still believe the error might be in the memory, I’d say you’re really in denial about some other problem.