Last month, we had a couple glitches with the storage system of our shared hosting environment. A number of you asked what we were doing to prevent it from happening again, so here's an overview of what happened and what's next:
Our shared hosting system was built with reliability in mind, with many redundant elements. It's a load-balanced cluster of webservers, so that if any one webserver goes belly-up, we keep running without any interruption. The filesystem itself runs on dozens of hard drives, so losing a hard drive causes no significant headaches either.
There are parts of the shared system that aren't redundant in the way we'd like: most of all, the file storage servers. If a file server crashes, web pages and graphics cannot load. That's what happened in September.
When a server starts randomly crashing, despite no changes being made, it can mean it's suffering from a hardware problem that is difficult to diagnose. So, our first step will be to move everything to a different server as soon as possible. We've already deployed a new file server, and this initial migration will be done later this month.
But the real problem is our reliance on each file server, so we've been building a truly redundant storage solution since summer. In fact, a number of volunteers have had their websites hosted on this new system for a while already. We intend to migrate all customer sites to the new system by year's end.
Technical info: The new storage system (using OCFS, AOE, and DRBD) involves multiple storage servers, each of which stores all the data.That means if one has a serious hardware failure, then the other one can take over.
If you're interested in being a beta tester, knowing that that involves occasional interruptions while we test and reconfigure and reboot the new system, just let us know.
Comments