Bruce's Blog: September 2011

This is a follow-up to a post I made last year. About four years ago I applied for a patent on a method of replicating data from one process to another without blocking operations on the data. It's used in the GemFire data fabric product to create backup copies of data buckets, and is called "state flush operation". In a way it provides a temporal point of virtual synchrony that assures the new replica bucket sees all of the changes to the data that the original bucket sees.

I got the idea for this work after reading a paper by Lamport and Chandy published back in 1985, Distributed Snapshots: Determining Global States of Distributed Systems.

Basically what you do is create a sort of catchers-mitt that is set up to record operations on the data during the transfer, then you announce to everyone that the transfer is going to happen. At that point any operations performed on the original bucket will also be sent to the new replica bucket.

Then you send a message to each member of the distributed system that holds the bucket telling them to apply in-process operations to the bucket and then "flush" those changes to the member holding the original bucket. A message gets sent from each of these members to the original bucket holder that tells it which messages have been sent. An observer is created that watches for all of the changes to arrive and be applied to the original. It then sends a notice to the new replica that the operation has completed.

At this point the data may be copied from the original bucket to the replica bucket, taking care not to overwrite any items that have shown up in the catchers-mitt. Because of the flush we know that the copied data holds any changes that were made prior to creating the catchers-mitt, but the catchers-mitt may hold operations that are newer than what is reflected in the copied data.

Bruce's Blog

About Me

Thursday, September 8, 2011

patent granted