My New Computer
Note: This site hasn't been updated in a long time. I mostly just use this domain for email.
Back in early November I built a new computer. It’s a moderately high-end computer — 2 gigabytes of RAM and an Intel Core 2 Duo E6600 processor — but its most distinguishing feature is its storage subsystem: three 750GB hard drives. My plan was to make a RAID-5 array out of these drives, for about 1.5 terabytes of usable storage space and protection against drive failure. I figure this should be enough storage space for me for the next 4 or 5 years.
My plan was also to encrypt the RAID array using Linux‘s “dm-crypt” feature, as I’ve already been doing on my laptop for over a year. I think a 1.5TB encrypted RAID-5 array — which I can then subdivide into logical volumes with LVM — is pretty cool.
Unfortunately, after I’d put everything together and installed the operating system (which itself required jumping through some hoops), I found that I had a problem: frequent, severe filesystem corruption anytime lots of disk activity was occurring. And Debian‘s “aptitude” package-management tool would frequently print garbled text when I launched it. And one of the hard drives had logged an I/O communication failure in its SMART log.
Uh-oh.
In a situation like this, the “usual suspect” is naturally the hard drive. I had three brand-new hard drives here, which hadn’t yet proven themselves to be reliable. So I put them to the test: I downloaded Seagate’s SeaTools diagnostic suite and ran full diagnostics, including complete surface scans, on all three drives. Twice. But after 16 hours of thorough self-tests, none of the drives reported any problems whatsoever.
OK, so it‘s not the hard drives. Maybe it‘s the RAM? I ran memtest86 overnight. Twice. No errors found. I tried each of the two 1GB DIMMs individually; no dice. The filesystem still corrupted itself.
Since I was pretty sure it was a hardware problem, the only likely candidates left were the processor and the motherboard. I borrowed a different processor, a Pentium 4 HT chip, and installed it in place of my Core 2 Duo chip. I booted up the system, and lo and behold: the filesystem still corrupted itself, and aptitude still displayed gibberish half the time I started it.
So, it must be the motherboard, right? I returned the board (an Abit AB9 Pro) to Newegg, where I‘d bought it, for a replacement. The replacement they sent me was definitely a different board, not just the same one returned to me; the serial numbers were different, and it was a newer revision so it had a few physical differences too. I installed it and tested the system again. Same problems.
As you can imagine, at this point I was pretty baffled as to what could be wrong with my computer. The symptoms all pointed to a hardware problem, but I‘d ruled out all the hardware that seemed like it could have caused this sort of problem. Moving on to less-likely hardware, I borrowed a video card from my brother, just in case my XFX GeForce 7600GS was somehow corrupting data on the system bus. It didn‘t help either.
It was while I was waiting for an opportunity to borrow a spare power supply from someone that, by random luck, I came across this while browsing the archives of the debian-kernel mailing list.
A bug in the kernel.
One that causes disk corruption.
One that only manifests itself when dm-crypt encryption is used on a software RAID-5 array — which would explain why it didn‘t occur on my laptop, because I don‘t use RAID there.
Cue sound of palm smacking forehead.
Things came together pretty quickly once I applied the kernel patch to fix that bug. After a clean reinstall of the OS, the filesystem stayed intact; no more random corruption. The aptitude display problem persisted, but that turns out to be an unrelated threading bug, and as far as I can tell it‘s harmless. (It can occur on any system, but it‘s much more frequent on SMP systems, and this is the first SMP system I‘ve owned.) As for that one I/O failure recorded in the third drive‘s SMART log, I have no idea — it hasn‘t happened again, so all I can guess is that one of my SATA cables was loose back when I first assembled the system.
Now, as I write this, Debian‘s packaged kernel has just recently been updated and now incorporates the fix for that encryption bug, so I no longer have to compile my own kernel just to avoid hosing my filesystem. It‘s good to know that the Etch release won‘t go out the door with that problem.
Total time between ordering the new hardware and being able to actually use the new box for tasks other than troubleshooting: 2 months, from early November 2006 to early January 2007. Most of this time was spent swapping out hardware and repeatedly reinstalling Debian in different configurations to try to isolate the cause of the problem. (It didn‘t help that I wrongly thought the aptitude issue and the filesystem corruption were two symptoms of the same root problem.)
But in the end, everything works, and I'm happy with my new computer. :-)