« PHP upgraded... and everything else | Main | New Database Server Deployed »

May 22, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Dana Proctor

Not sure why your boys in Missoula at Modwest deserve
a pat on the back Russ. First of all they should of
had a backup of customers' data to handle just this
situation with the backup server in place. Second they
should not had to reinstall the OS on that Solaris
system to gain access to those disks. Third I can't
believe it took 12 hours to bring that system back up
unless it a had a major hardware malfunction in the
motherboard and in that case it would of had to be
replaced, DOA. I think I would consider looking for
another host provider. Most of them now are running
virtual servers and if something happens they just
start your image somewhere else in a different bank.

Dana M. Proctor
Dandy Made Productions

John Masterson

Dana, thanks for your candid comments. Our current conclusion is that the problem related to a bug in the LSI controller driver mixed with Solaris' poor handling of that bug, resulting in the entire data store being inaccessible. There may have been another way to overcome it, but after 10+ hours of work to do so, we opted to reinstall to OS as advised by Sun. Interestingly, a subsequent reboot after recovery caused the data to become inaccessible again; so something's seriously wrong with that machine.

The additional unavailability time was associated with transferring the data. 500GB of digital movies might transfer in a few hours on a fast network, but millions of small files are a different scenario.

Even though everything goes down (including google!) from time to time, this interruption was far too long for everyone's liking, and we are now working on updating our server architecture to minimize any such disruption in the future.

John Masterson

Oh one further thought -- I don't think Russ meant to praise any aspect of our technical server-wrangling efforts. We did our best to maintain frequent, honest communication throughout via our toll free number and public status page at http://status.modwest.com and I think this is the aspect of this unfortunate event he was intending to highlight at http://matr.net

Trailhead

Just a thought... Can rsync scale to mirror a 500GB array? Even if there were a slight delay in synchronization, you could keep your Debian setup around and use it for a near-realtime storage fallback. I assume that the drives in the storage device were redundant, but when the storage rack itself fails, we're all stuck. Thanks for getting us all back online either way.

Thomas Connell

rsync isn't really a good way because it isn't atomic; in 500 GB of data, even if none of it has changed, rsync would still have to look at each file, which takes a long time. If we wanted a readily-available day-back copies, snapshots would be a better solution ( http://en.wikipedia.org/wiki/Logical_Volume_Management#Snapshots )

However, we are looking into DRBD ( http://en.wikipedia.org/wiki/DRBD ) or similar in the long term to accomplish mirroring in a dynamic way and large scale. Additionally, having an instant-failover would save everyone buckets of stress.

Trailhead

DRBD looks awesome. Let us know how it's going. :)

Mark Roberts

I agree this happens way way to much and every time I complain of a fail safe. I love supporting local business but when we lose business because of it I question your business model.

John Masterson

Mark, it certainly isn't a part of our business model to have a hardware failure every 6 months. We've been working towards the more redundant system described in the blog post but have not yet conquered all the technical issues. We're closer, but not close enough today unfortunately!

Thanks for bearing with us.

Mark Roberts

I will bear with you and pass this request along to our costumers.

The comments to this entry are closed.