"Contemplate this on the Tree of Woe..."
This is a line from Conan the Barbarian uttered by the villain Thulsa Doom after capturing Conan and beating him to within an inch of his life. Looking down upon the fallen warrior, Doom (played by Vader-esque James Earl Jones) then turns to his henchman Rexor and issues a directive:
"Crucify him."
In response, Conan collapses in exhausted anguish.
Last week wasn't quite that bad for our hardware team, but it was a rough one. We had two important managed server customers suffer catastrophic hardware issues which required hours -- even days -- of downtime to fully repair. While dissimilar, both problems were storage-related.
First, a mini-primer on storage redundancy and fault tolerance:
There are two basic ways to provide storage redundancy in a standalone dedicated server: software and hardware.
- The benefits of the software strategy is lower hardware cost, which we can then pass along to the customer.
- The hardware solution is better at detecting and handling drive failures but costs more to deploy.
Neither strategy is immune to fault, as we've been painfully reminded over the past 10 days. Both affected customers themselves host dozens of their own customers on these servers, and so the interruptions were particularly undesirable for them.
In the first case, the server featured hardware-based redundancy, a RAID-controller made by 3Ware. It just so happens that the particular driver for this controller has a rare bug when installed on servers running a certain Linux kernel version. The bug can cause generalized data corruption (!), and in this case, we discovered that various system configuration files in /etc were getting periodically scrambled, removed, and relocated!
This is a live server providing business-critical functionality to our customer and his customers, and yet the only fix was to re-install the operating system running an updated driver to ensure data integrity. This required an overnight re-install and data-restoration procedure, but when the server was back up and running, a few configurations and software versions were different, and thus we had to work through dozens of small web application and email glitches before everything was ship-shape.
That would have been challenging enough, but around the same time, another managed server suffered a hard drive failure. This machine utilized the software-approach to storage redundancy, and while the drive failure was indeed detected, what wasn't detected was that the other hard drive was on its last legs and could have also failed at any moment. The server was in an absolutely precarious state when it finally alerted us to the issue.
Ordinarily, when one hard drive in a mirrored pair fails, the procedure is to shut down, replaced the failed drive, reboot, and instruct the system to re-mirror everything. In this case though, the server's remaining drive was in such bad shape that we suspected it wouldn't make it through the reboot, and that all current data on the machine would be lost.
The challenge was therefore to ensure that we had a snapshot of the most current data before beginning the surgery. But try making a fresh backup of 30+GB of data off a damaged hard drive that could fail at any moment; it's a slow, slow process, and took close to 18 hours to complete. It was only upon that completion that we could proceed with a full reinstall, reconfiguration, and restoration from backup (which took much of the next day).
I'm happy to report that as of Friday afternoon, thanks to long hours of work by our best hardware guys, both servers are (to the best of our knowledge) repaired and fully functioning.
Our managed servers as a rule boast the highest availability of all our services, with many of them enjoying near-100% historical uptime. But the fact remains that hardware components, and especially moving parts such as hard drives, simply wear out and break. We do what we can to ensure that when they do, repair and recovery is relatively painless, but this past week presented a 'perfect storm' of hardware problems.
By the way, Conan's friend Subotai rescued him from the Tree of Woe, and Conan returned heroically to vanquish the enemy.
-JM
Comments