Last week, two separate service issues caused disruption of website service for all network services (DNS, email, websites, ...) during the network saturation (peak hours). Combined, these issues caused up to 2 hours of downtime for some sites on the Modwest Grid as well as for all network services (websites, email, DNS resolution, ...) during one of the events. Modwest kept its status page updated with brief descriptions of what happened and referenced the status page on Modwest's Facebook page, main Twitter feed, and status-specific Twitter feed. We would like to explain what happened in more detail and what we're doing to prevent this from happening again.
The network outage which occurred on Thursday, August 18 was caused by an exploited customer web application. This particular customer was using a vulnerable, out of date plug-in for Wordpress which, with exploitation, allowed our network to become saturated with traffic (Read below for more details). We are currently investigating various methods of detecting and preventing this type of network congestion event again. We hope that our blog post about the exploited timthumb.php vulnerability will encourage Modwest customers to protect themselves and their site data.
The Grid web cluster outages occurred on various dates during the week of August 17-24 and were caused by storage server disk IO issues (Read below for more details). These interruptions were quickly fixed and occurred primarily off-hours. Long term, our solution to this issue is to increase the number of operational storage servers. The Modwest Operations Team is in the process of rebuilding two storage servers which need updated software and have a greater capacity than the additional storage server currently in use.
We thank you for your continued support of and use of Modwest Complete Web Hosting Services. The Modwest Team is happy to serve you and continue to assist with any service issues you may have regarding any part of your Modwest service. Please feel free to contact Modwest Support by phone 888-549-0917 and 1-406-541-4678, by chat or email to support@modwest.com.
NETWORK OUTAGE
Thursday, August 18
5:00 p.m. MDT: The Modwest Operations Team detected a problem with the Modwest Network. Our monitoring system detected multiple service failures for systems across our network including the Modwest Grid servers (web servers, database servers, and storage servers), managed servers, and VPS servers. Failures affecting disparate systems indicated Modwest was not dealing with a failure of one server and rather something was wrong with the network itself -- possibly a failed component or network saturation.
5:22 p.m. MDT: We posted to our status page that we were experiencing network congestion on our internal network and that we were working on diagnosing the problem and implementing a solution.
5:28 p.m. MDT: We narrowed down the source of the congestion, confirmed it, and mitigated it by throttling the network port for one of our VPS hardware nodes. The Modwest network began functioning normally after we blocked the source of the trouble. One of our unmanaged VPS customers was the source and regardless of whether it was legitimate customer traffic or an exploit that was used by 'bad guys', we had to shut down that customer's VPS container.
6:00 p.m. MDT: We had shut down the affected VPS container and the other VPS containers hosted on the same hardware node proceeded to function normally.
Friday, August 19
In coordination with the unmanaged VPS customer, we spent two hours performing forensics work on the affected VPS container in order to identify the source of the unexpected traffic. We determined that the customer's web application which was running a recently announced 0-day exploit for the timthump.php utility. You can read more about this vulnerability in our blog post from August 19. A web application on customer's VPS was compromised on Wednesday, August 17 and a socket to another server had been opened. Since Modwest forensics was based on evidence in log files and other files on the filesystem, and not while the event was occurring live, we don't know what kind of data was being transferred through this open socket. The 'bad guys' who exploited the VPS customer's web application lost control of their data transfer when they congested the Modwest network disabling Modwest network services as well as their unauthorized network connection. By the end of Thursday, that unmanaged VPS in question had pushed 975GB of data through the Modwest network interface which is nearly 2000 times the traffic that VPS container transfers in a typical day.
We are currently investigating various methods of detecting and preventing this type of network congestion event again and we hope that our blog post about the exploited timthumb.php vulnerability will encourage Modwest customers to protect themselves.
GRID WEB CLUSTER OUTAGES
Wednesday, August 17
Unless you are already familiar with the architecture of how the Modwest Grid works, you'll probably want to look at this blog post from last year for an introduction to the various components of the Modwest Grid.
00:55 a.m. MDT: We reported a problem with our Modwest Grid web hosting cluster that had been occurring for the previous hour.
01:12 a.m. MDT: The problem was resolved. We determined that one of our storage servers was not serving files properly due to a service configuration issue which was the result of an odd edge-case involving our configuration management system, Puppet. As the Grid web servers were not able to retrieve valid data from the affected storage server, they started to slow down in the processing of requests. Modwest load balancers, which test servers for responsiveness, remove unresponsive servers from the cluster. One of the Apache checks that our load balancer checks relies on data that was hosted on the affected storage server. As the Grid web server Apache timeouts increased, our load balancers began removing the Grid web servers from the cluster. When all of the Grid web servers were removed from the cluster, all Grid websites timed out, regardless of the storage server on which the website was stored.
We are making two changes based on this event. First, we have put additional checks in place in Modwest's configuration management system to ensure service configuration files are not corrupted. Second, Modwest is in the process of reconfiguring its load balancers to use multiple information sources about traffic flow before pulling a Grid web server from the cluster.
Friday, August 19
10:31 p.m. MDT: A problem with the Modwest Grid web hosting cluster affecting websites hosted on the Modwest Grid as well as Modwest Webmail was reported. Modwest is in the process of deploying additional storage servers and rebalancing customer data among the Grid storage servers. The new storage server, although passing initial tests, has exhibited behavior not seen in the other storage servers particularly during nightly backups; the server sends the backup data too fast which uses up more disk IO than it should. If backups run during business hours, the demand of the backups on the storage server combined with requests from the Grid web servers exceeds the overall disk IO capacity. As requests from the Grid web servers time out, load balancers remove affected Grid web servers from the cluster and Modwest customers and visitors experience intermittent timeouts on their sites.
We are working on a solution to this issue from two angles: First, we are investigating why this storage server is sending backup data too quickly. Second, we are reconfiguring nightly backups for this storage server to reduce their priority to other requests for disk IO.
Sunday, August 21
08:01 a.m. MDT: We reported a problem with our Modwest Grid web hosting cluster, affecting websites hosted on the Grid as well as webmail. This event was very similar to the event from August 19, but earlier in the morning. Nightly backups were running late and those requests along with normal traffic on the Grid web servers combined to cause problems for the affected storage server. The problems for the Modwest Grid web servers were infrequent and intermittent, primarily because the Modwest Operations Team was actively manipulating the nightly backup requests to have a lower priority as they launched.
Wednesday, August 24
4:00 – 6:00 a.m. MDT: Intermittent problems occurred with the Modwest Grid web hosting cluster, affecting websites hosted on the Modwest Grid, as well as Modwest Webmail. Again, nightly backups drew on disk IO to the Modwest Grid web servers as did increased traffic from U.S. East Coast morning business causing congestion for the affected storage server. The Modwest Operations Team resolved the issue as soon as they were able, but our response time was admittedly too long. Changes were made to reduce the priority of the nightly backup requests in order to alleviate the disk IO problems on the affected storage server.
Long term, Modwest's solution to this storage server issue is to increase the number of operational storage servers. The Modwest Operations Team is in the process of rebuilding two storage servers which need updated software and have a greater capacity than the storage server described in the above notes. Once those storage servers are redeployed, any intermittent service issues from overloaded disk IO will no longer be an issue.