« Modwest Open House - December 11th | Main | Happy Holidays to All »

December 16, 2008

Monday's Outage Explained

Modwest customers just received the following email about a service issue yesterday:

As many of you are aware, we had a pretty serious issue on the Modwest shared hosting system yesterday. Some of the details are at http://status.modwest.com but I wanted to also take a moment and let you know what happened and what we're doing to prevent it from happening again.

At Modwest, we strive to build and maintain server systems which can survive all the most common problems. For power issues, we use large backup batteries for all our servers, and our datacenter provides generator backup power to ensure continuous power during longer utility power outages.

Thankfully, long utility power outages are extremely rare and we haven't had one in more than five years. Every Monday morning, the datacenter runs a test to make sure the generator works, and when that happens, electricity in the datacenter fluctuates, just a little.

For reasons that may be impossible to determine, right around the time of that fluctuation, one of our backup batteries briefly stopped providing any electricity to the servers relying on it. All those servers immediately shut off. A moment later, power returned, and the servers all came back online. Except for one. A server responsible for storing customer website files refused to completely start. Because it was temporarily offline, the sites it stores were not accessible. Even worse, thousands of other sites were also affected, and started slowing down.

We divided the "Ops" team into two groups, one dedicated to solving the underlying problem, and one to mitigating its side effects and communicating progress. Both groups succeeded. We eliminated most of the secondary effects by mid-afternoon. After a long conference call with the vendor and subsequent operating system re-install, we brought the problematic storage server and its thousand or so directly affected sites back online at around 9PM last night. While the technical causes aren't specifically known, we believe that upgrading the capacity of the backup battery which seems to have caused the problem will help prevent this from happening again. The upgrade equipment is already on the way.

For those of you who were affected by the service problems yesterday, I apologize. You can never get away from machines behaving unexpectedly, but you can always improve your preparedness. We will continue to improve our systems to ensure excellent hosting service reliability.

Feel free to comment here with any questions, or contact us.

 
-JM

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c263353ef0105368167db970c

Listed below are links to weblogs that reference Monday's Outage Explained:

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

June 2009

Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        

Modwest.com >>
Modwest System Status >>