Friday, August 9, 2013

Shifting madness

My team was chasing an odd bug for a couple of weeks. In the GemFire distributed cache we store a 64-bit timestamp representing the last-modified-time for a cache entry and use it for entry expiration and inter-site consistency checks.  This value was all of a sudden going back in time a day and a half, causing early expiration and inter-site inconsistencies.

Well, what do you do?  You review recent changes to the product, for one thing.  Not long ago two people worked on the code for handling this timestamp.  The significant change seemed to be in using the top 8 bits of this field to store boolean flags and using a java.util.concurrent.atomic.AtomicLongFieldUpdater to access the field.

  private static final long LAST_MODIFIED_MASK = 0x00FFFFFFFFFFFFFFL;

   long storedValue;
    long newValue;
    do {
      storedValue = lastModifiedUpdater.get(this);
      newValue = storedValue & ~LAST_MODIFIED_MASK;
      newValue |= lastModifiedTime;
    } while (!lastModifiedUpdater.compareAndSet(this, storedValue, newValue));

This code looks okay unless the relatively new AtomicLongFieldUpdater is messing up.  The other change was introduction of a new bit to store in the 8 top bits of the field:

  private static final long VALUE_RESULT_OF_SEARCH   = 0x01L << 56;
  private static final long UPDATE_IN_PROGRESS       = 0x02L << 56;
  private static final long TOMBSTONE_SCHEDULED      = 0x04L << 56;
  private static final long LISTENER_INVOCATION_IN_PROGRESS = 0x08 << 56;

There's a problem with this last line.  We're shifting a 32-bit integer 56 places to the left. A good C compiler will complain about this but the Java compiler seems okay with it.  Here's a C program:

#include "stdio.h"

int main(int argc, char *argv[]) {
    long l = 1 << 30;
    printf("1 << 30=%ld",l);

    l = 1<<32 font="">
    printf("1 << 32=%ld",l);
    l = 3<<32 br="">    printf("3 << 32=%ld",l);
~> cc -o testit test.c
test.c: In function 'main':
test.c:6: warning: left shift count >= width of type
test.c:8: warning: left shift count >= width of type

And the equivalent Java program:


public class test {
  public static void main(String[] args) {
    long l = 1 << 30;
    System.out.println("1 << 30="+l);

    l = 1 << 32;
    System.out.println("1 << 32="+l);

    l = 3 << 32;
    System.out.println("3 << 32="+l);


~> javac

Running these two programs gives different results

~> ./testit
1 << 30=1073741824

1 << 32=0
3 << 32=0

~> java test
1 << 30=1073741824

1 << 32=1
3 << 32=3

So the << operator works differently in Java than in C.  Shifting 0x08 fifty six bits in C results in a 0 but Java turns it into 0x800,0000!  The Java Language Specification says

If the promoted type of the left-hand operand is int, only the five lowest-order bits of the right-hand operand are used as the shift distance. It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x1f (0b11111). The shift distance actually used is therefore always in the range 0 to 31, inclusive.
 So 56 (111000) is silently turned into 24 (011000) by the javac compiler! The new constant, 0x8000000, is 8 << 24, not 8 << 56 as intended!

This bit was being set and cleared in the timestamps when the new flag was used.  For instance, 2013/07/25 15:55:10.906 PDT is 1374792910906 on the millisecond clock.  Clearing bit 0x8000000 turns the clock back to 1374658693178, which is 2013/07/24 02:38:13.178 PDT.  That's roughly 37 hours earlier than the unmolested timestamp.  No wonder entries were being considered "old" before their time.

Changing the constant to have "L" like the others fixed the problem.

Tuesday, February 5, 2013

I was reviewing some code for a coworker and saw something that I didn't know was possible...

Integer counter = 0;
synchronized(counter) {
  // do other work under sync

I thought that the compiler might be accepting this as a valid use of autoboxing, as in

Integer counter = 0;

synchronized(counter) {
  int i = counter.intValue();
  // do other work under sync

in other words the value is pulled out of the Integer and held in a temporary variable where it is incremented and then thrown away.

This turned out not to be the case at all.  The compiler actually creates code that will increment the value like this but it affects counter just as if this were an "int" field!  So my coworker was right - this "++" is incrementing the counter like he wanted it to do.

But now there is something else wrong with the code!  Java Integers are immutable, and that "++" is assigning a new Integer to the counter variable.  This makes his synchronized(counter)statement useless in protecting anything but the counter++.  Once that's finished there is a new object in counter.  If one thread synchronized on Integer(0) the counter++ would change it to Integer(1)  Another thread could then enter the synchronized block holding a lock on Integer(1) while the original thread continued to lock on Integer(0).

t1: synchronized(Integer(0)) {
t1: counter++ // in other words, counter=Integer(1)
t2: synchronized(Integer(1)) {
t1 & t2: // do other work under sync

What else is wrong with this?  What about the objects we're synchronizing on?  I wrote a short program to look at the Integers generated in code such as this.

public class incInt {

  public static void main(String args[]) {
    Integer i = 0;
    System.out.println("i="+i + " hash="+System.identityHashCode(i));
    System.out.println("i="+i + " hash="+System.identityHashCode(i));
    System.out.println("i="+i + " hash="+System.identityHashCode(i));

    System.out.println("resetting to zero");

    i = 0;
    System.out.println("i="+i + " hash="+System.identityHashCode(i));
    System.out.println("i="+i + " hash="+System.identityHashCode(i));
    System.out.println("i="+i + " hash="+System.identityHashCode(i));


The result of running this with Oracle's JRE 1.7.0_5 shows that there are canonical values for autoboxed zero, one and two.

> java incInt
i=0 hash=4991049
i=1 hash=32043680
i=2 hash=9499036
resetting to zero
i=0 hash=4991049
i=1 hash=32043680
i=2 hash=9499036

Here's a blog post that claims that [-128,127] are cached by the JVM and used in autoboxing.  It turns out that the post is right.  I modified the test to print out the hashes for zero, one and two and they are the same objects

> java incInt
i=0 hash=31879808
i=1 hash=6770745
i=2 hash=12835244
resetting to zero
i=0 hash=31879808
i=1 hash=6770745
i=2 hash=12835244

Getting back to the code under review, this means that the synchronization is at least sometimes using a canonical object used by the whole JVM.  Anything could sync on Integer.valueOf(0), causing this code to be affected by code running in other threads.  All synchronization should be done on private state or by using well-tested concurrency utilities to avoid accidental conflicts and meddling.

Thursday, September 8, 2011

patent granted

This is a follow-up to a post I made last year.  About four years ago I applied for a patent on a method of replicating data from one process to another without blocking operations on the data.  It's used in the GemFire data fabric product to create backup copies of data buckets, and is called "state flush operation".  In a way it provides a temporal point of virtual synchrony that assures the new replica bucket sees all of the changes to the data that the original bucket sees.

I got the idea for this work after reading a paper by Lamport and Chandy published back in 1985, Distributed Snapshots: Determining Global States of Distributed Systems.

Basically what you do is create a sort of catchers-mitt that is set up to record operations on the data during the transfer, then you announce to everyone that the transfer is going to happen.  At that point any operations performed on the original bucket will also be sent to the new replica bucket.

Then you send a message to each member of the distributed system that holds the bucket telling them to apply in-process operations to the bucket and then "flush" those changes to the member holding the original bucket.  A message gets sent from each of these members to the original bucket holder that tells it which messages have been sent.  An observer is created that watches for all of the changes to arrive and be applied to the original.  It then sends a notice to the new replica that the operation has completed.

At this point the data may be copied from the original bucket to the replica bucket, taking care not to overwrite any items that have shown up in the catchers-mitt.  Because of the flush we know that the copied data holds any changes that were made prior to creating the catchers-mitt, but the catchers-mitt may hold operations that are newer than what is reflected in the copied data.

Tuesday, August 30, 2011

Chaos Monkey

If you haven't heard of the Netflix Chaos Monkey, read Jeff Atwood's blog. This "monkey" roams around their cloud app killing processes to ensure that the system is resilient. IMO the MTBF for java VMs isn't all that long unless a great deal of testing has been done, so this is a great way to keep the system healthy.  Jeff asserts that having the monkey in their system was at least part of the reason that Netflix survived the Amazon Web Services (AWS) crash.

When we test GemFire we run many High Availability (HA) tests that randomly kill server processes and then test to ensure that the product continues to run and maintains consistency.  That guarantees that the product reacts to failures correctly in short (10-60 minute) tests, but what about long running distributed systems?  It would be nice to build an optional Chaos Monkey into the product that randomly killed off server-side processes (can't kill the clients!).   The system-monitoring infrastructure would have to be able to recognize the Monkey's work so that alarms aren't raised, but how hard could that be?

A smart Monkey could examine metadata about the system and, perhaps, give weight to older processes now and then when selecting a process to kill.  That would tend to shake things up a little more in the distributed system and test things like lock grantor fail-over.

The Monkey would need to have a collection of blind spots built into it so that customers could protect VMs that they don't want the Monkey to, er, monkey with.  GemFire might be well tested and be able to withstand a Chaos Monkey, but that doesn't mean the systems built with it could survive degradation of their own essential services.

Friday, May 6, 2011

Moving to the Cloud

For half a year I've been doing some of my development work on a virtual computer hosted in a data center that I've never seen.  It works remarkably well and is like using RDP to connect to a desktop at work when you're telecommuting.  I fire up VMware View and connect to the computer, giving it one of the Dexpot screens on my laptop.  I can even connect to it on my iPod Touch using WYSE Pocket Cloud and zaTelnet.

The downside has been that I use other machines to run tests and those machines were seven network hops away from my cloud-based development machine.   Any network interaction with those machines was painfully slow.  So slow that I stopped using the virtual computer for much of anything.

Recently that situation changed.  Most of the rack-mounted machines that we own were moved to the same data center, so that now it's seven network hops from my desk to all of the machines I use.  But it's now only one hop from my cloud-based virtual computer to them.  The situation is reversed and the virtual computer is a life saver.  I log in, pop up VMware View and the rest of my computing day is spent in the cloud.

Wednesday, November 3, 2010

Performance of Message Patterns

I've been faced with a dilemma in distributed processing.  I have a "client" process that needs to send a message to several servers.  The servers contain a versioning service that stamps a version on each message for concurrency management, but the client doesn't have this service.  Each of the servers must end up having the same version number for the message.  And it must be blindingly fast.

Either the servers are going to have to talk to each other to agree on a version for the message or some other trick is going to have to be used.

The simplest approach is to send a request to one of the servers to get a version number and then send the version out with the message to all of the servers in parallel.  This is nice because the message can be serialized to the network in parallel and there are mechanisms in place that will do this very quickly.

Here's a diagram of the first scenario.  If any of the diagrams are hard to read, you can click on it to see a larger image.

Instead of sending a request for a version to the first server, the client could send the message and expect a version number in response.  It could then add this to the message and send it to the other servers.  This would require multiple serialization of the message to on-wire format, but might be faster for small messages.

Another approach is to delegate the sending of the message to one of the servers ("one hop messaging").  The client sends the message to one of the servers and it, in turn, sends the message to the other servers after adding a version stamp to it.  This also requires some level of extra serialization since the client must write the message to the network and then the selected server must forward the message to the other servers over the network.

We can simplify the acknowledgment scheme by having each server send an ack directly back to the client instead of piping them through Server1.  This would cause context switching in the client, but might be better than funneling the acks through Server1.

Yet another approach is to have the client send the message to all of the servers and include a tag that selects one of the servers to supply the version tag.  After the selected server (Server1 in the diagram below) receives the message it stamps it with a version and then sends that version to the other servers.  The other servers wait for the version number before accepting the message.  This, too, requires extra context switching because the other servers have to wait for a signal that the version number has been received.

I wrote a program to simulate these different approaches to see how they compared.  Like the diagrams, I used a client and three servers running on a fairly large and fast Linux computer.

The payload column shows the size of message that was used in each test run, and each column shows how many seconds it took the approch to handle 1 million messages.  The "no versioning" column is a base-line that shows the performance of simple send-with-acknowledgement messaging.

The results were a little surprising to me.  I had expected the last approach, where the client sends the message to all servers and  they wait for one to send a version number, to have the best performance.  Instead, all but one of them converge to the same performance level when messages reach 10,000 bytes in size.  At this level they are only moving about 70mb of data through the system per second, so they aren't being throttled by the network, but with the extra synchronization points and context switches CPU was becoming a limiting factor.

The "version request" and "message returns version" scenarios (the first two diagrams above) are clear losers because server2 and server3 do not even see the message until a complete send and response cycle is performed with server1.

The "one hop" scenario had a poor showing because of the long acknowledgement chain, with both server2 and server3 sending their acks to server1.  Server1 has to wait for both acks before it finally sends its own ack back to the client.

The clear winner is the "one hop, ack client" algorithm with servers sending acknowledgements directly to the client.  It even converged with and then passed the base-line "no versioning" scenario at about 3000 bytes/message.

Thursday, September 23, 2010

Virtual Synchrony in GemFire

A virtual synchrony protocol ensures that all past operations have been delivered before a membership change is put into effect.

In our GemFire product, the equivalent of virtual synchrony is achieved in Gemfire in a component of the cache called a DistributionAdvisor.  These are mini membership managers that track the ownership of cache Regions across the distributed system.  When a member creates a new cache Region that others already have, the DistributionAdvisor is used to perform a flush operation (called State Flush in Gemfire) to ensure that all past operations have been applied before the new member's Region is allowed to be used.  There are a number of steps involved in this operation for which I have applied for a patent (#11/982,563).  Outside of this we haven't seen the need for a full virtual synchrony protocol covering all communications.