Earlier this week we experienced a storage server problem on the Modwest shared hosting system. I just wanted to give you the low-down on everything that happened and how we responded.
Late Monday night, the server had warned us about a possible hard drive issue; it contains a whopping 24 drives, so one drive having a problem is no big deal. So, the next day, we prepared to replace the drive, which wasn't marked as "bad", just "unavailable" for some reason.
To make a long story short, we hit a glitch related to the confluence of Sun's Solaris operating system and LSI RAID hardware. The server crashed, and wouldn't restart. And, the 500GB of customer data on it was inaccessible.
Because something similar happened last year too, we immediately began preparing an alternate server to take this one's place. The new one runs good ol' Debian Linux.
But we needed access to the current website data. We opened a ticket with Sun commercial support and they couldn't figure it out beyond "hardware problem", but agreed that reinstalling the operating system might work. Some 12 hours later, after numerous attempts, our server team succeeded in accessing the customer data and began transferring it off the problem server onto the server we had standing by and ready to take its place.
It takes a long time to copy millions of files totaling nearly 500GB of data between servers, even on our fast internal network, and so then a waiting game began.
We know our customers' sites are important to them, and we are analyzing all aspects of the entire event so we can improve our operation and prevent a similar recurrence in the future.
During the outage, our awesome Support Team had the chance to speak with countless customers about the issue. We did our best to provide timely, accurate information about the work in progress, and kept the public Status Page up to date frequently. Did we do ok?
The Ops Team put in a tremendous effort over two days to get things up and running again as safely and quickly as possible. If you have a congratulatory haiku for the guys, please post it here: https://feedback.modwest.com/topic/103/Soothing_haikus_for_Modwest_engineers
We've recently hired a storage specialist who is now part of the team responsible for monitoring, maintaining, and improving our entire infrastructure. I'm much more confident now that the sorts of problems we endured with you this week will be less likely in the future. Thanks for your patience and understanding, and feel free to send us an soothing haiku!
-JM
Not sure why your boys in Missoula at Modwest deserve
a pat on the back Russ. First of all they should of
had a backup of customers' data to handle just this
situation with the backup server in place. Second they
should not had to reinstall the OS on that Solaris
system to gain access to those disks. Third I can't
believe it took 12 hours to bring that system back up
unless it a had a major hardware malfunction in the
motherboard and in that case it would of had to be
replaced, DOA. I think I would consider looking for
another host provider. Most of them now are running
virtual servers and if something happens they just
start your image somewhere else in a different bank.
Dana M. Proctor
Dandy Made Productions
Posted by: Dana Proctor | May 26, 2009 at 09:58 PM
Dana, thanks for your candid comments. Our current conclusion is that the problem related to a bug in the LSI controller driver mixed with Solaris' poor handling of that bug, resulting in the entire data store being inaccessible. There may have been another way to overcome it, but after 10+ hours of work to do so, we opted to reinstall to OS as advised by Sun. Interestingly, a subsequent reboot after recovery caused the data to become inaccessible again; so something's seriously wrong with that machine.
The additional unavailability time was associated with transferring the data. 500GB of digital movies might transfer in a few hours on a fast network, but millions of small files are a different scenario.
Even though everything goes down (including google!) from time to time, this interruption was far too long for everyone's liking, and we are now working on updating our server architecture to minimize any such disruption in the future.
Posted by: John Masterson | May 27, 2009 at 10:00 AM
Oh one further thought -- I don't think Russ meant to praise any aspect of our technical server-wrangling efforts. We did our best to maintain frequent, honest communication throughout via our toll free number and public status page at http://status.modwest.com and I think this is the aspect of this unfortunate event he was intending to highlight at http://matr.net
Posted by: John Masterson | May 27, 2009 at 10:04 AM
Just a thought... Can rsync scale to mirror a 500GB array? Even if there were a slight delay in synchronization, you could keep your Debian setup around and use it for a near-realtime storage fallback. I assume that the drives in the storage device were redundant, but when the storage rack itself fails, we're all stuck. Thanks for getting us all back online either way.
Posted by: Trailhead | May 27, 2009 at 11:00 PM
rsync isn't really a good way because it isn't atomic; in 500 GB of data, even if none of it has changed, rsync would still have to look at each file, which takes a long time. If we wanted a readily-available day-back copies, snapshots would be a better solution ( http://en.wikipedia.org/wiki/Logical_Volume_Management#Snapshots )
However, we are looking into DRBD ( http://en.wikipedia.org/wiki/DRBD ) or similar in the long term to accomplish mirroring in a dynamic way and large scale. Additionally, having an instant-failover would save everyone buckets of stress.
Posted by: Thomas Connell | May 28, 2009 at 09:45 AM
DRBD looks awesome. Let us know how it's going. :)
Posted by: Trailhead | May 28, 2009 at 11:46 PM
I agree this happens way way to much and every time I complain of a fail safe. I love supporting local business but when we lose business because of it I question your business model.
Posted by: Mark Roberts | December 17, 2009 at 04:51 PM
Mark, it certainly isn't a part of our business model to have a hardware failure every 6 months. We've been working towards the more redundant system described in the blog post but have not yet conquered all the technical issues. We're closer, but not close enough today unfortunately!
Thanks for bearing with us.
Posted by: John Masterson | December 17, 2009 at 04:59 PM
I will bear with you and pass this request along to our costumers.
Posted by: Mark Roberts | December 17, 2009 at 06:59 PM