One of our guiding principles is to continuously improve our systems by never getting bitten twice by the same problem. When disaster strikes, we review what happened and see what changes we can make to prevent similar headaches in the future. We don't always find a fixable “tragic flaw,” but the review is still valuable.
Modwest hosts tens of thousands of mailboxes on our shared hosting mail cluster. We do this by spreading mail services out over numerous load-balanced servers. Some are devoted to processing messages on arrival--ascertaining whether the destination address is valid and then scanning the messages for viruses and spam content. Others are devoted to storage of the messages themselves. The first group (the “front end”) is like a strainer which does its best to ensure only legitimate messages get through to the second group, the “back end.”
Due to a series of headache-inducing events, it was the group of back-end storage servers that we had trouble with earlier this month. For the gory details, read on. Or skip to the lessons learned.
It all started with a report from one customer – they had a subfolder which was mysteriously inaccessible via webmail. After some initial debugging on the appropriate back-end storage server, we concluded that the mailbox was corrupt.
Each mailbox can be thought of like a book and its index. Each message is a page in the book, and is referenced in the index. Every entry in the index has a corresponding page. However, if the index and the pages don't agree for some reason, a system administrator needs to “reconstruct” the mailbox to eliminate this corruption. This is very infrequent, but happens from time to time.
This reconstruction process usually takes a few seconds up to a few minutes for mailboxes that contain lots of messages and folders. On this mailbox, the process seemed to be still running after close to 24 hours. Actually, it was stalled, unable to begin work on the problematic folder.
Some additional research indicated the problem was not with the mailbox, but the underlying filesystem – the software glue between the mail server software and the physical hard drives. The only fix for that problem is an hours-long offline repair process. So, we began preparations for a period of downtime.
The next day, our patient took a turn for the worse. Right in the middle of the business day, the server's CPU utilization skyrocketed and performance plummeted. We did everything we could to mitigate but finally decided that performance was so bad, with webmail taking minutes to load every page, that we weren't really providing a “service” anyway, so we better go ahead and begin the offline filesystem repairs immediately. It took nine hours to complete.
And it didn't help.
So our next tactic was to move mailboxes off the problematic server as quickly as possible. Luckily, we had a brand new machine racked up and ready for exactly this purpose, and after some last-minute software upgrades and configuration changes, it was ready to help. We moved some 4,000 mailboxes late that night.
And that helped a lot! What a relief, for everyone.
Except that the upgrades required to integrate the new server left our oldest mailbox storage server out in the cold. Its older software was unable to communicate properly with all the new stuff, and so IMAP mailboxes, including webmail, were unavailable. Thousands of mailbox owners were denied access to their mailboxes unless they added some special settings or used an alternate webmail address we set up.
Our first plan was to move all the mailboxes on the older server to the brand new one. In theory, this could be done transparently and without interruption. But after exhaustive research on the plan, we concluded it just wouldn't work. An upgrade with data-in-place was the only way forward.
Meanwhile, since we're preparing for an office move (more on this in a future post), we are in the process of organizing and relocating various equipment. Late one night, we moved a backup server without changing its power settings. The resulting voltage mismatch caused a power fluctuation which resulted in several servers rebooting – including the brand new mail storage server. So, the thousands of mailboxes we'd just moved off the problem server were unavailable again while we ran the required filesystem repairs. This coincidence was just plain bad luck.
Near the end of last week, we scheduled and completed the final upgrade required to get the older storage server up to speed with the rest. The mail system has been functioning fine every since.
On the server-side, we haven't yet found any “silver bullets” to truly prevent this from happening again. There are no good ways for us to provide meaningful additional redundancy which would somehow host mailboxes in more than one place at once, and we still don't know exactly what caused the initial problem.
What we can do is communicate better about issues like this.
- When we're having server problems, our status page needs to be front and center. Our support team received hundreds of calls during the various problems, and many people simply didn't know the status page existed. We're working on ways to remedy that.
- The status page needs to more clearly explain the scope of the issues and their resolution. Despite thousands of mailboxes being non-functional for hours, the page claimed that POP3 service was “100% OK” throughout. This is because the statistics are automatically generated based on a simple check of initial email connectivity – not whether any mailboxes are actually functioning. So, we need to explain this better, and/or implement more comprehensive checks.
As much as I'd like to, we can't prevent servers from breaking. But when problems do occur, we can and will do a better job of communicating what happened, who's affected, and what we're doing about it. And as always, we're happy to hear your feedback.