Modwest customers just received the following email about a service issue yesterday:
As many of you are aware, we had a pretty serious issue on the Modwest shared hosting system yesterday. Some of the details are at http://status.modwest.com but I wanted to also take a moment and let you know what happened and what we're doing to prevent it from happening again.
At Modwest, we strive to build and maintain server systems which can survive all the most common problems. For power issues, we use large backup batteries for all our servers, and our datacenter provides generator backup power to ensure continuous power during longer utility power outages.
Thankfully, long utility power outages are extremely rare and we haven't had one in more than five years. Every Monday morning, the datacenter runs a test to make sure the generator works, and when that happens, electricity in the datacenter fluctuates, just a little.
For reasons that may be impossible to determine, right around the time of that fluctuation, one of our backup batteries briefly stopped providing any electricity to the servers relying on it. All those servers immediately shut off. A moment later, power returned, and the servers all came back online. Except for one. A server responsible for storing customer website files refused to completely start. Because it was temporarily offline, the sites it stores were not accessible. Even worse, thousands of other sites were also affected, and started slowing down.
We divided the "Ops" team into two groups, one dedicated to solving the underlying problem, and one to mitigating its side effects and communicating progress. Both groups succeeded. We eliminated most of the secondary effects by mid-afternoon. After a long conference call with the vendor and subsequent operating system re-install, we brought the problematic storage server and its thousand or so directly affected sites back online at around 9PM last night. While the technical causes aren't specifically known, we believe that upgrading the capacity of the backup battery which seems to have caused the problem will help prevent this from happening again. The upgrade equipment is already on the way.
For those of you who were affected by the service problems yesterday, I apologize. You can never get away from machines behaving unexpectedly, but you can always improve your preparedness. We will continue to improve our systems to ensure excellent hosting service reliability.
Feel free to comment here with any questions, or contact us.
-JM
As many of you are aware, we had a pretty serious issue on the Modwest shared hosting system yesterday. Some of the details are at http://status.modwest.com but I wanted to also take a moment and let you know what happened and what we're doing to prevent it from happening again.
At Modwest, we strive to build and maintain server systems which can survive all the most common problems. For power issues, we use large backup batteries for all our servers, and our datacenter provides generator backup power to ensure continuous power during longer utility power outages.
Thankfully, long utility power outages are extremely rare and we haven't had one in more than five years. Every Monday morning, the datacenter runs a test to make sure the generator works, and when that happens, electricity in the datacenter fluctuates, just a little.
For reasons that may be impossible to determine, right around the time of that fluctuation, one of our backup batteries briefly stopped providing any electricity to the servers relying on it. All those servers immediately shut off. A moment later, power returned, and the servers all came back online. Except for one. A server responsible for storing customer website files refused to completely start. Because it was temporarily offline, the sites it stores were not accessible. Even worse, thousands of other sites were also affected, and started slowing down.
We divided the "Ops" team into two groups, one dedicated to solving the underlying problem, and one to mitigating its side effects and communicating progress. Both groups succeeded. We eliminated most of the secondary effects by mid-afternoon. After a long conference call with the vendor and subsequent operating system re-install, we brought the problematic storage server and its thousand or so directly affected sites back online at around 9PM last night. While the technical causes aren't specifically known, we believe that upgrading the capacity of the backup battery which seems to have caused the problem will help prevent this from happening again. The upgrade equipment is already on the way.
For those of you who were affected by the service problems yesterday, I apologize. You can never get away from machines behaving unexpectedly, but you can always improve your preparedness. We will continue to improve our systems to ensure excellent hosting service reliability.
Feel free to comment here with any questions, or contact us.
-JM
Comments