Wednesday, May 12, 2010

GemFire and Paxos

People have recently been asking me about the Paxos algorithm and how GemFire implements it. I'm not an expert on Paxos itself, but have spent years dealing with the problems that arise if you do not have something similar to it in place in a distributed computing product like GemFire.

Paxos is a family of protocols for developing consensus in a distributed system. The basic idea is that there is a Client that sends a message to effect some change. One or more Acceptors exist to service the message, and the message is sent to all of them. A Proposer works on behalf of the client to make sure the Acceptors agree on the message. The message is then acted on and a Learner is tasked to effect replication of the change.

There is also a Leader, which is selected from among the Acceptors. If there is no unique Leader, no progress is allowed in the system.

In GemFire there are Client processes that send messages to Servers. The Client forms a unique identifier for the message with respect to the Client and the originating thread. In a shortcut of the general Paxos algorithm a Server acts as both an Acceptor and a Proposer. It determines the member having primary storage for the message and forwards the message to it.

The member having primary storage plays the role of Learner. It is responsible for effecting the change in primary storage and then replicating that change in other Servers. Once it is done, any result is sent back to the first Server (Paxos Proposer) and the result is sent back to the client.

If after sending the message, the Client deems that the Server is taking too long to process it, another Server is selected and the client sends the message (with the same unique identifier) to it. The Server finds the member having primary storage and forwards the message to it. The member having primary storage (Paxos Learner) protects its storage by checking the message's unique identifier. If the identifier has already been applied to the store, it is marked as a possible-duplicate and then replicated to other Servers. Each other server is also responsible for performing duplicate-checks.

The "timeout/retry" logic in the Client can cause duplicate processing, but in the normal case it does not and is much faster than a cumbersome back-and-forth Client/Acceptor/Proposer interaction.

The Paxos Leader role is handled in GemFire by the low-level membership system. Administrative members of the system and those marked as not being Servers are excluded from election as Leader. After this, the oldest Server is selected as the GemFire "lead" member.

Paxos also has the notion of a Quorum. Wikipedia defines this as "Quorums are defined as a family of subsets of the set of Acceptors such that any two subsets from the family (that is, any two Quorums) have a non-empty intersection.". In GemFire this function is handled by the Roles feature. Each Server can note that it provides one or more Roles and each cache region can require one or more Roles in order to function. If the required Roles aren't present, the cache region experiences a "loss action" which can cause it to become read-only, block access, or even shut down and restart the cache. This provides better control over the behavior of the system when there are machine or network failures than a simple Quorum mechanism.

Paxos also has the notion of a Choice when there are conflicting values. This is essentially handled by the unique identifiers assigned to messages. The identifiers can be used to order conflicting values and determine which is newer or older than the other. This is done during duplicate-detection and is a very cheap and lightweight way to implement Choice.

So there is quite a strong mapping from GemFire's distributed correctness algorithms and Paxos. Where there are differences it is usually because we opted for higher performance at the possible cost of doing a small amount of extra work (duplicate processing).

1 comment:

  1. Excellent note on how servers use message identifiers to processes duplicate messages in a failover situation.

    Also worth pointing out, is how GemFire uses some of the concepts in "paxos commit" (http://research.microsoft.com/apps/pubs/default.aspx?id=64636). Unlike traditional 2 phase commit algorithms where the commit is doomed if the coordinator fails, in gemfire, if the 1st phase of the commit process has been successfully relayed to all cohorts, the surviving members collaborate with each other to commit the transaction even if the coordinator fails before sending the final commit.

    ReplyDelete