Tuesday, August 30, 2011

Chaos Monkey

If you haven't heard of the Netflix Chaos Monkey, read Jeff Atwood's blog. This "monkey" roams around their cloud app killing processes to ensure that the system is resilient. IMO the MTBF for java VMs isn't all that long unless a great deal of testing has been done, so this is a great way to keep the system healthy.  Jeff asserts that having the monkey in their system was at least part of the reason that Netflix survived the Amazon Web Services (AWS) crash.


When we test GemFire we run many High Availability (HA) tests that randomly kill server processes and then test to ensure that the product continues to run and maintains consistency.  That guarantees that the product reacts to failures correctly in short (10-60 minute) tests, but what about long running distributed systems?  It would be nice to build an optional Chaos Monkey into the product that randomly killed off server-side processes (can't kill the clients!).   The system-monitoring infrastructure would have to be able to recognize the Monkey's work so that alarms aren't raised, but how hard could that be?


A smart Monkey could examine metadata about the system and, perhaps, give weight to older processes now and then when selecting a process to kill.  That would tend to shake things up a little more in the distributed system and test things like lock grantor fail-over.


The Monkey would need to have a collection of blind spots built into it so that customers could protect VMs that they don't want the Monkey to, er, monkey with.  GemFire might be well tested and be able to withstand a Chaos Monkey, but that doesn't mean the systems built with it could survive degradation of their own essential services.