« PHP upgraded... and everything else | Main | New Database Server Deployed »

May 22, 2009

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c263353ef0115709fa58e970b

Listed below are links to weblogs that reference Storage Server Issue: What Happened:

Comments

Dana Proctor

Not sure why your boys in Missoula at Modwest deserve
a pat on the back Russ. First of all they should of
had a backup of customers' data to handle just this
situation with the backup server in place. Second they
should not had to reinstall the OS on that Solaris
system to gain access to those disks. Third I can't
believe it took 12 hours to bring that system back up
unless it a had a major hardware malfunction in the
motherboard and in that case it would of had to be
replaced, DOA. I think I would consider looking for
another host provider. Most of them now are running
virtual servers and if something happens they just
start your image somewhere else in a different bank.

Dana M. Proctor
Dandy Made Productions

John Masterson

Dana, thanks for your candid comments. Our current conclusion is that the problem related to a bug in the LSI controller driver mixed with Solaris' poor handling of that bug, resulting in the entire data store being inaccessible. There may have been another way to overcome it, but after 10+ hours of work to do so, we opted to reinstall to OS as advised by Sun. Interestingly, a subsequent reboot after recovery caused the data to become inaccessible again; so something's seriously wrong with that machine.

The additional unavailability time was associated with transferring the data. 500GB of digital movies might transfer in a few hours on a fast network, but millions of small files are a different scenario.

Even though everything goes down (including google!) from time to time, this interruption was far too long for everyone's liking, and we are now working on updating our server architecture to minimize any such disruption in the future.

John Masterson

Oh one further thought -- I don't think Russ meant to praise any aspect of our technical server-wrangling efforts. We did our best to maintain frequent, honest communication throughout via our toll free number and public status page at http://status.modwest.com and I think this is the aspect of this unfortunate event he was intending to highlight at http://matr.net

Trailhead

Just a thought... Can rsync scale to mirror a 500GB array? Even if there were a slight delay in synchronization, you could keep your Debian setup around and use it for a near-realtime storage fallback. I assume that the drives in the storage device were redundant, but when the storage rack itself fails, we're all stuck. Thanks for getting us all back online either way.

Thomas Connell

rsync isn't really a good way because it isn't atomic; in 500 GB of data, even if none of it has changed, rsync would still have to look at each file, which takes a long time. If we wanted a readily-available day-back copies, snapshots would be a better solution ( http://en.wikipedia.org/wiki/Logical_Volume_Management#Snapshots )

However, we are looking into DRBD ( http://en.wikipedia.org/wiki/DRBD ) or similar in the long term to accomplish mirroring in a dynamic way and large scale. Additionally, having an instant-failover would save everyone buckets of stress.

Trailhead

DRBD looks awesome. Let us know how it's going. :)

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

November 2009

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Modwest.com >>
Modwest System Status >>