Greetings, Modwesterners! Yesterday we had some intermittent service disruptions, so I wanted to give you the short and long version of what happened.
Thursday, as a result of some planned upgrades, our storage system began to show some faults, causing a portion of sites to become unavailable for short periods throughout the day. As you might expect, this service disruption was not part of our plan. I usually try to explain exactly how much downtime is involved, but in this case, it varies depending on where your site lives on our storage system. At most, some sites may have experienced an hour of downtime, while other sites would have been fine for all but a few minutes.
We've kept our status page updated, but I want to mention that the downtime shown for HTTP (web) service is exaggerated. Modwest sites were not down for over 5% of the day (yesterday or today), as the numbers might lead you to believe. The uptime calculations are complex and sometimes overly sensitive; Unfortunately for us, that hurts our perceived 6 month average, which had been well over 99.9% until recently.
Want to know more? Here's the technical scoop:
Over the last few weeks, our Modwest Grid has undergone some growing pains, largely due to the popularity of our Yep! hosting plans. We have known that additional storage servers in our cluster were needed in order to account for our continual data growth. However, the data growth outpaced our planned project and unfortunately, that has led to some service interruptions.
Our Storage Cluster expansion project is relatively simple -- we're adding additional storage servers to the cluster, thereby balancing the thousands of customers' accounts across additional pieces of hardware. In this particular case, our bottleneck is drive contention -- not enough spinning disks for the thousands and thousands of requests that our Grid Web Cluster receives. As the Storage Cluster, or server members, slow down, the entire Modwest Grid Web Cluster can be affected.
What are we doing about that? Well, in addition to fine-tuning active processes on the existing storage servers in order to prioritize web traffic, we just completed the hardware and software deployment of a new storage server this afternoon (Thursday, September 8, 2011). We'll be moving some busy accounts over to the newly deployed storage server this evening, which will reduce the disk contention on the other, over-taxed storage servers. This is the first part of the deployment plan. Once it's complete, we are planning to add yet another storage server to the Storage Cluster in order to
provide even more resiliency to our Grid system.
To learn more about how the Modwest Grid works, you can check out our blog post from last year that describes the architecture: https://blog.modwest.com/2010/08/introducing-the-modwest-grid.html
For those of you who experienced downtime yesterday, I sincerely apologize, and I hope you felt informed by our updates and responses by our outstanding support team. Please let me know if you have any feedback.
- Steven, Cofounder