by John Masterson
In late November, Modwest staff and a handful of customers noticed that we dropped off the internet for a few minutes. After a significant investigation, we struck out in identifying the source, and so we moved on to other concerns.
Then it happened again. And again.
Third time's a charm, and so we began working on improving our ability to monitor our network traffic in fine detail, and narrow down the range of possibilities as to the cause of the disruptions. Was it a failing switch or router? Some sort of misconfiguration? An attack? It was unclear, but we launched a thorough investigation.
The issue returned intermittently. A week would pass, and then we'd have four 5-minute events in a day. And for a time, while we had our suspicions (which eventually turned out to be correct), the specific source eluded us.
But we kept on the investigation. We determined that something was creating a massive burst of internal network traffic, for a few minutes at a time. So much traffic that one of our core switches would stop responding briefly, causing a flurry of "host is down" alerts, and causing some customer websites and email to stop responding.
Our process involved upgrading and improving the configuration of our NMIS implementation to be sure we were monitoring virtually everything that can be monitored and graphed. We also improved our switches' "storm control" settings, a feature which can temporarily shut down affected network segments in the event of a sudden burst of traffic beyond a normal profile. And we analyzed packet captures for source, destination, rate, and protocol.
Once we'd narrowed the issue down to a group of Modwest Grid webservers, we tried to catch it in the act (using tools called tcpdump and wireshark, if you're curious). After a few days, we succeeded. And the new monitoring we'd installed on the webservers created logs of a gigantic burst of outbound traffic associated with one particular customer.
The culprit? The customer's outdated WordPress installation had been previously exploited by a remote hacker to install an attack program, which was being periodically accessed and used to launch a massive UDP packet flood at various external hosts.
We deleted the program, informed the customer, and locked down a few other parts of our network to detect and quaratine similar events in the future.
We try to turn every disruptive event into an opportunity to improve operations, and this case was no exception. It highlighted the importance of having deep knowledge of your network setup, monitoring, reporting, and graphing -- which we have now accomplished. It also serves as a reminder of the potential impact of ignoring or overlooking those pesky security updates in common web applications like WordPress, Joomla, and Drupal.
At Modwest, we have one day a month we call "Update Day". It's the day on which we go through our list of servers and switches, and ensure everything has the latest patches for security, performance, and functionality. Consider (or talk to your webmaster about) doing the same -- it makes your upgrades easier and helps keep your site secure.
There will always be bugs, exploits, and attackers. This particular issue is resolved, and we're much better equipped to deal with the next one. Onwards!