Thursday, September 8, 2011

patent granted

This is a follow-up to a post I made last year.  About four years ago I applied for a patent on a method of replicating data from one process to another without blocking operations on the data.  It's used in the GemFire data fabric product to create backup copies of data buckets, and is called "state flush operation".  In a way it provides a temporal point of virtual synchrony that assures the new replica bucket sees all of the changes to the data that the original bucket sees.

I got the idea for this work after reading a paper by Lamport and Chandy published back in 1985, Distributed Snapshots: Determining Global States of Distributed Systems.

Basically what you do is create a sort of catchers-mitt that is set up to record operations on the data during the transfer, then you announce to everyone that the transfer is going to happen.  At that point any operations performed on the original bucket will also be sent to the new replica bucket.

Then you send a message to each member of the distributed system that holds the bucket telling them to apply in-process operations to the bucket and then "flush" those changes to the member holding the original bucket.  A message gets sent from each of these members to the original bucket holder that tells it which messages have been sent.  An observer is created that watches for all of the changes to arrive and be applied to the original.  It then sends a notice to the new replica that the operation has completed.

At this point the data may be copied from the original bucket to the replica bucket, taking care not to overwrite any items that have shown up in the catchers-mitt.  Because of the flush we know that the copied data holds any changes that were made prior to creating the catchers-mitt, but the catchers-mitt may hold operations that are newer than what is reflected in the copied data.

Tuesday, August 30, 2011

Chaos Monkey

If you haven't heard of the Netflix Chaos Monkey, read Jeff Atwood's blog. This "monkey" roams around their cloud app killing processes to ensure that the system is resilient. IMO the MTBF for java VMs isn't all that long unless a great deal of testing has been done, so this is a great way to keep the system healthy.  Jeff asserts that having the monkey in their system was at least part of the reason that Netflix survived the Amazon Web Services (AWS) crash.

When we test GemFire we run many High Availability (HA) tests that randomly kill server processes and then test to ensure that the product continues to run and maintains consistency.  That guarantees that the product reacts to failures correctly in short (10-60 minute) tests, but what about long running distributed systems?  It would be nice to build an optional Chaos Monkey into the product that randomly killed off server-side processes (can't kill the clients!).   The system-monitoring infrastructure would have to be able to recognize the Monkey's work so that alarms aren't raised, but how hard could that be?

A smart Monkey could examine metadata about the system and, perhaps, give weight to older processes now and then when selecting a process to kill.  That would tend to shake things up a little more in the distributed system and test things like lock grantor fail-over.

The Monkey would need to have a collection of blind spots built into it so that customers could protect VMs that they don't want the Monkey to, er, monkey with.  GemFire might be well tested and be able to withstand a Chaos Monkey, but that doesn't mean the systems built with it could survive degradation of their own essential services.

Friday, May 6, 2011

Moving to the Cloud

For half a year I've been doing some of my development work on a virtual computer hosted in a data center that I've never seen.  It works remarkably well and is like using RDP to connect to a desktop at work when you're telecommuting.  I fire up VMware View and connect to the computer, giving it one of the Dexpot screens on my laptop.  I can even connect to it on my iPod Touch using WYSE Pocket Cloud and zaTelnet.

The downside has been that I use other machines to run tests and those machines were seven network hops away from my cloud-based development machine.   Any network interaction with those machines was painfully slow.  So slow that I stopped using the virtual computer for much of anything.

Recently that situation changed.  Most of the rack-mounted machines that we own were moved to the same data center, so that now it's seven network hops from my desk to all of the machines I use.  But it's now only one hop from my cloud-based virtual computer to them.  The situation is reversed and the virtual computer is a life saver.  I log in, pop up VMware View and the rest of my computing day is spent in the cloud.