tag:blogger.com,1999:blog-52801621512398323622024-03-05T09:34:18.379-08:00Bruce's BlogBruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-5280162151239832362.post-85703645853512625782021-01-29T08:59:00.004-08:002021-02-05T11:27:54.382-08:00Integrating Apache Geode with the Rapid cluster membership system<p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In 2020 I set a goal of integrating an alternative Membership Service with <a href="https://geode.apache.org/" target="_blank">Apache Geode</a>. A team of engineers (including myself) broke that service out into a separate module (link) in 2019 and I wanted to see if it was possible to use a different implementation.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">All distributed systems are based on a Membership Service. Geode, Cassandra, Hazelcast, Coherence etc. all depend on one to know which processes (nodes) are part of the cluster and to detect when there are failures. There needs to be a way to introduce new nodes into the cluster and a way to remove current nodes when the need arises.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Apache Geode's <a href="https://cwiki.apache.org/confluence/display/GEODE/Core+Distributed+System+Concepts" target="_blank">Membership Service</a> (which I'll call GMS) grew out of a heavily modified fork of the <a href="http://www.jgroups.org/" target="_blank">JGroups</a> project. Though it's been rewritten to not depend on JGroups, Geode still uses the same concept of a membership coordinator (usually the oldest node in the cluster) that accepts join/remove requests and decides whether a node should be kicked out of the cluster. It also has the other components you'd expect in a Membership Service like a message-sender/receiver and a failure detector.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Halfway through the year I read a paper on the <a href="https://github.com/lalithsuresh/rapid" target="_blank">Rapid membership service</a> claiming it could spin up a large cluster in record time and detect failures with a high probability of success. There was also a Java implementation of the service (as well as a Go impl, which is interesting from a K8s perspective) which made it a nice fit for an integration project. Perfect! So I cloned the Rapid repo and started integrating it with Geode.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">As an initial goal I wanted to demonstrate that the integration would at least pass a non-trivial integration test and an even more complicated distributed unit test, which orchestrates a cluster of JVMs to test distributed algorithms. I also wanted to try out a regression test that kills off processes and asserts that the membership service detects and resolves the node failures.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Another goal was to see what dependencies Geode has on its membership implementation that make it difficult or impossible to use a different implementation. I've highlighted these findings in the sections below.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The modularization effort that I mentioned created a nice API for Geode's membership service but hardcoded the implementation into its Builder implementation. I didn't bother with creating a pluggable SPI during the integration. Instead I modified the <a href="https://github.com/bschuchardt/geode/blob/feature/rapid_integration/geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/rapid/MembershipBuilderImpl.java" target="_blank">Builder</a> to create a <a href="https://github.com/bschuchardt/geode/blob/feature/rapid_integration/geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/rapid/MembershipImpl.java" target="_blank">Membership</a> based on Rapid. The Rapid project includes a sample Netty-based messaging component and a ping-pong based failure detector that are similar enough to what Geode has that I decided to use them in the integration effort.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: small;">Node Identifiers</span></span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In a distributed system nodes have identifiers that let you know how to contact them. Rapid uses UUIDs and these map to physical addresses that contain the Netty host and port. Other systems like JGroups use a similar approach, but not Geode. It has an overgrown membership identifier that includes a lot of information about each node. That was the first problem I ran into. How could I disseminate the metadata about each node to other members of the cluster? That metadata included things like the address used for non-membership communications, the type of node (locator, server) and the name assigned to it. Fortunately Rapid's <a href="https://github.com/lalithsuresh/rapid/blob/master/rapid/src/main/java/com/vrg/rapid/Cluster.java" target="_blank">Cluster</a> implementation allows you to provide this data when joining the cluster and makes it available to other nodes in its callback API. I merely serialized the Geode identifier and set that as the metadata for the process joining the cluster. In the callbacks I could then deserialize the ID and pass that up the line.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> This points out that Geode requires a way to transmit metadata about a node in membership views and node identifiers. The metadata is static and is known when joining the cluster.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><pre style="background-color: white; font-family: Menlo;"><pre style="font-family: Menlo;"><div style="text-align: left;">Map<String, ByteString> metadata = <span style="color: navy; font-weight: bold;">new </span>HashMap<>();</div><div style="text-align: left;">metadata.put(<span style="color: #660e7a; font-style: italic; font-weight: bold;">GEODE_ID</span>, objectToByteString(<span style="color: #660e7a; font-weight: bold;">localAddress</span>, <span style="color: #660e7a; font-weight: bold;">serializer</span>));</div><span style="color: #660e7a; font-weight: bold;"><div style="text-align: left;">clusterBuilder = <span style="color: navy;">new </span>Cluster.Builder(listenAddress)</div></span><div style="text-align: left;"> .setMessagingClientAndServer(<span style="color: #660e7a; font-weight: bold;">messenger</span>, <span style="color: #660e7a; font-weight: bold;">messenger</span>)</div><div style="text-align: left;"> .setMetadata(metadata);</div><span style="color: #660e7a; font-weight: bold;"><div style="text-align: left;">clusterBuilder.addSubscription(com.vrg.rapid.ClusterEvents.<span style="font-style: italic;">VIEW_CHANGE_PROPOSAL</span>,</div></span><div style="text-align: left;"> <span style="color: navy; font-weight: bold;">this</span>::onViewChangeProposal);</div><span style="color: #660e7a; font-weight: bold;"><div style="text-align: left;">clusterBuilder.addSubscription(com.vrg.rapid.ClusterEvents.<span style="font-style: italic;">VIEW_CHANGE</span>,</div></span><div style="text-align: left;"> <span style="color: navy; font-weight: bold;">this</span>::onViewChange);</div><span style="color: #660e7a; font-weight: bold;"><div style="text-align: left;">clusterBuilder.addSubscription(com.vrg.rapid.ClusterEvents.<span style="font-style: italic;">KICKED</span>,</div></span><div style="text-align: left;"> <span style="color: navy; font-weight: bold;">this</span>::onKicked);</div></pre></pre></div></blockquote><p style="text-align: left;"> </p></div><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: small;">Joining the cluster</span></span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Next I needed to reconcile the discovery mechanisms of Rapid and Geode.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Like many other systems Rapid relies on there being Seed Nodes that new processes can contact and request to join the cluster. Any process that's already in the cluster can be used as a Seed Node — you just need to know the host & port of the node to use it as a Seed Node.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Geode's membership service, on the other hand, doesn't use Seed Nodes<span id="docs-internal-guid-8767c4be-7fff-1da8-7b00-14b58c32725d"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;">. Instead it </span></span>uses a Locator service to find the node that is currently the membership coordinator and uses that node to <a href="https://cwiki.apache.org/confluence/display/GEODE/GMSJoinLeave+message+sequence+diagrams" target="_blank">join the cluster</a>. All of the Geode unit tests that create a cluster expect this behavior, so if I wanted to use those tests as-is I couldn't use Seed Nodes.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The Locator service is a simple request/reply server that keeps track of who is in the cluster and gives new processes the address of the membership coordinator. I decided to write an alternative plugin to Geode's GMSLocator class that would allow multiple Rapid nodes to start up concurrently.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">A new process contacts the new <a href="https://github.com/bschuchardt/geode/blob/feature/rapid_integration/geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/rapid/RapidSeedLocator.java" target="_blank">RapidSeedLocator</a> and asks for the ID of a seed node. Normally there is a seed node available and the new process starts the Rapid cluster using the seed node's address.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span></p><pre style="background-color: white; font-family: Menlo;"><div style="text-align: left;">seedAddress = findSeedAddressFromLocators();</div><span style="color: navy; font-weight: bold;"><div style="text-align: left;">if (!seedAddress.equals(<span style="color: #660e7a;">listenAddress</span>)) {</div></span><div style="text-align: left;"> <span style="color: #660e7a; font-weight: bold;">cluster </span>= <span style="color: #660e7a; font-weight: bold;">clusterBuilder</span>.join(seedAddress);</div><div style="text-align: left;">}</div></pre></blockquote><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">If there are no nodes in the cluster the new process loops a few times to see if one will show up. If none does it starts a new cluster. It's a little more complicated than that because there may be multiple processes starting up at the same time and they need to work out which one will bootstrap the cluster.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"></span></p><pre style="background-color: white; font-family: Menlo;"><div style="text-align: left;"><span style="color: navy; font-weight: bold;">if </span>(bestChoice.equals(<span style="color: #660e7a; font-weight: bold;">localAddress</span>)) {</div><div style="text-align: left;"> <span style="color: #660e7a; font-style: italic; font-weight: bold;">logger</span>.info(<span style="color: green; font-weight: bold;">"I am the lead registrant - starting new cluster"</span>);</div><div style="text-align: left;"> <span style="color: #660e7a; font-weight: bold;">cluster </span>= <span style="color: #660e7a; font-weight: bold;">clusterBuilder</span>.start();</div><div style="text-align: left;">} <span style="color: navy; font-weight: bold;">else </span>{</div><div style="text-align: left;"> <span style="color: #660e7a; font-style: italic; font-weight: bold;">logger</span>.info(<span style="color: green; font-weight: bold;">"Joining with lead registrant"</span>);</div><div style="text-align: left;"> <span style="color: #660e7a; font-weight: bold;">cluster </span>= <span style="color: #660e7a; font-weight: bold;">clusterBuilder</span>.join(seedAddress);</div><div style="text-align: left;">}</div></pre></blockquote><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This Locator is behavior that we took from the JGroups project and have used in Geode since its inception. Any seed node based system could be fit into this same pattern.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">Other than that, all I needed to do was implement leaving the cluster, the callbacks announcing new/left/crashed nodes and a callback that handles being kicked out of the cluster. That can happen if a node becomes unresponsive and the other nodes (observers) declare it dead.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Geode uses a Locator service to find the node that's the membership coordinator.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Geode needs to handle concurrent startup of the initial nodes of a cluster.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><br /></p><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: small;">Integration Testing</span></span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">There was an existing unit test for Geode's Membership class that I used to shake out problems with the integration, <a href="https://github.com/bschuchardt/geode/blob/feature/rapid_integration/geode-membership/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/MembershipIntegrationTest.java" target="_blank">MembershipIntegrationTest</a>. The simplest test in that class just boots up a cluster of 1 and verifies that the cluster started okay. It passed so I went on to the next test, one that boots up a 2 node cluster. That one failed because the view identifiers in Rapid are not a monotonically increasing integer as they are in Geode's GMS<span id="docs-internal-guid-753408a6-7fff-ea4d-ca99-4ef5f8c178a5"><span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;">. Instead they</span></span> are a hash of the node IDs and endpoints in its membership view. Geode will refuse to accept a membership view with a view ID lower than the current view's and, since it's just a hash, Rapid's identifiers jump all over the place.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The only way to get around that problem was to comment out all of the checks in Geode that pay attention to view identifiers. After doing that the 2 node test passed.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Other tests in MembershipIntegrationTest were variants of that second test and passed. That was encouraging.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Geode expects monotonically increasing membership view identifiers.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: small;">Distributed Unit Testing</span></span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">I moved on to a multi-JVM test using a Cache, <a href="https://github.com/bschuchardt/geode/blob/feature/rapid_integration/geode-core/src/distributedTest/java/org/apache/geode/cache30/DistributedAckRegionDUnitTest.java" target="_blank">DistributedAckRegionDUnitTest</a>. For this kind of test a Locator and four satellite JVMs are started by the test framework and then all of the test methods are executed in the unit test JVM. The unit tests assign tasks to the satellite JVMs via RMI. For instance, a test might assign tasks to create a Cache, populate it and then verify that all of the nodes have the correct data. At the end of each test method the framework cleans up any Caches created in the JVMs to make them ready for the next test method.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">First I focused on getting one test to pass. The test first assigned two of the satellite JVMs tasks to create a Cache and then have them create incompatible cache Regions. This test hung during startup because, doh!, the nodes had membership views with identifiers in the different order. One had </span><span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: courier;">[nodeA, nodeB, Locator]</span></span><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> and the other had</span><span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: courier;"> [nodeB, nodeA, Locator]</span></span><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. The problem with this is that Geode's distributed lock service depends on a stable ordering of the identifiers and it uses the left-most node as its Elder in its distributed algorithms. The two nodes disagreed on which node should be the Elder and it screwed up the algorithms, causing the Caches to hang.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In order for Geode to use a membership service that doesn't provide a stable ordering of the node identifiers some other mechanism would be needed to determine the Elder for the distributed lock service. For my project I didn't want to get into that so I stored the millisecond and nanosecond clock values in the metadata when initializing each node and used that to sort the identifiers. That's good enough for unit testing but it's not something that could be used IRL.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Geode requires stable ordering of node IDs in a membership view or it needs an alternative way of choosing an Elder for the distributed lock service.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">With that change the distributed test passed, as did other tests that I selected at random in DistributedAckRegionDUnitTest. However, when I fired off all of the tests to run sequentially things started to go wrong. A run would get part way through the tests and then hang.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Looking at the logs I could see the first satellite JVM try to join with the Locator node and some other node that the test hadn't started.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="background-color: transparent; color: black; font-family: "Courier New"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">[vm0] [info 2021/01/13 11:52:29.455 PST <RMI TCP Connection(1)-192.168.1.218> tid=0x14] mac-a01:63410 is sending a join-p2 to mac-a01:63200 (Locator) for config 4530718538436871917</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: "Courier New"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: "Courier New"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> [vm0] [info 2021/01/13 11:52:29.455 PST <RMI TCP Connection(1)-192.168.1.218> tid=0x14] mac-a01:63410 is sending a join-p2 to mac-a01:63383 (?????) for config 4530718538436871917</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: "Courier New"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">and then Rapid's Paxos phase2 would fail:</span></p><p style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> </span><span style="background-color: transparent; color: black; font-family: "Courier New"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">[vm0] [error 2021/01/13 11:52:34.455 PST <RMI TCP Connection(1)-192.168.1.218> tid=0x14] Join message to seed mac-a01:63200 (Locator) returned an exception: com.vrg.rapid.Cluster$JoinPhaseTwoException</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: "Courier New"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">but there was no logging for this mac-a01:63383 node.</span></p><p style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><br /></span></p><p style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">I finally looked at the previous test's logs and found that this node was used in that test and had shut down but the remaining node (the Locator) didn't install a new view removing it. A quick talk with the Rapid developer confirmed that Rapid requires at least two nodes to make any decisions, so I modified the test to use two Locators instead of one. That way there are always two nodes to come to agreement on removal of lost nodes.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Geode assumes the cluster can devolve to one node.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">When running all 101 tests in DistributedAckRegionDUnitTest I found that the shutdown of a Rapid node doesn't seem to inform all other nodes in a small cluster and they're left to figure out that a node is gone. This caused new nodes to take 20 seconds or more to join the cluster while the seed nodes figured out the loss of old nodes. Attempts to join would fail and then there would be a retry. Decreasing the timeout in Rapid's NettyClientServer to 2 seconds helped with that, but joining the cluster usually took a full 2 seconds or more compared to 50 to 250ms in Geode's GMS. A better failure detector and a UDP-based heartbeat might give tighter response times.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Geode has tight SLAs concerning failure detection and needs to know whether a node has crashed or left gracefully.</span></p><p style="text-align: left;"><b style="font-weight: normal;"><br /></b></p><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: small;">Regression Testing</span></span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial;"><span style="white-space: pre-wrap;">VMware has a large collection of regression tests that it uses to harden </span><span style="white-space: pre-wrap;">its</span><span style="white-space: pre-wrap;"> own releases of Geode. These are similar to the Distributed Unit Tests I described above but use a different framework, can run on multiple machines and typically run for much longer periods of time. I chose one of the tests from the Membership regression to test the Rapid integration's handling of lost & new members. After the poor performance of the default ping-pong failure detector with Netty in the Distributed Unit Tests I didn't expect this to go well, and it didn't. Though there were only 10 nodes in the test, all of them failed in the first few minutes with processes unable to join the cluster.</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Looking through the logs I could see some nodes in the cluster accepting a new node while others didn't and the join attempt failing with a JoinPhaseTwoException. The accepting nodes would then remove the new node.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">--> Like Geode's Distributed Unit Tests, VMware's regression tests require a membership service that can handle frequent shutdown and startup of nodes.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: #e69138; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">* Scalability Testing</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><br /></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial;"><span style="white-space: pre-wrap;">I ran a small <a href="https://github.com/bschuchardt/geode/blob/feature/rapid_integration/geode-membership/src/integrationTest/java/org/apache/geode/distributed/internal/membership/gms/MembershipScalingTest.java" target="_blank">scalability test</a> with Geode's membership implementation and then with the Rapid implementation. The test doesn't spin up a Geode Cache, but just membership instances. For these tests I used my 16gb Mac laptop and ran the tests under IntelliJ. Sometime I'll grab a bunch of machines in AWS or GCP and do a larger test.</span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">Rapid was unable to create a cluster of more than 71 nodes before running out of file descriptors in this single-machine test, probably due to the use of Netty instead of the connectionless JGroups UDP messaging used by Geode's membership implementation.</span></p><div class="separator" style="clear: both; font-family: Arial; text-align: left; white-space: pre-wrap;"><br /></div><div class="separator" style="clear: both; font-family: Arial; text-align: left; white-space: pre-wrap;"><br /></div><div class="separator" style="clear: both; font-family: Arial; text-align: left; white-space: pre-wrap;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdZk0iusdjiHVRXSTQjDFDhrj4lHKlY0mCWjBR-JZanBzhyphenhyphen3yMg7cMG3r1WAYOWyY1J_lEf3kJls5eqwvYJ6nvn0H7IFBcZM882mEx-p48ChdqKeS5jWbeacwamvLyIRNRHc7_GpOWlNDR/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="510" data-original-width="910" height="307" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdZk0iusdjiHVRXSTQjDFDhrj4lHKlY0mCWjBR-JZanBzhyphenhyphen3yMg7cMG3r1WAYOWyY1J_lEf3kJls5eqwvYJ6nvn0H7IFBcZM882mEx-p48ChdqKeS5jWbeacwamvLyIRNRHc7_GpOWlNDR/w548-h307/startup_time.png" width="548" /></a></div><div class="separator" style="clear: both; font-family: Arial; text-align: left; white-space: pre-wrap;"><br /></div><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; font-family: Arial; text-align: center; white-space: pre-wrap;"><div style="text-align: left;"><br /></div><div style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhaavcerIpyGA0UJIYhrO7hgnxLDIACyfzLVK3CUtWamEgLP3X4dce892z4oaEezk_wh3zYju_JTeF2WSYhrvdj8JeoKg41RtOMQJ5aDjNAlOP3USaTfInECKlmnd5frBo1oOD20Ky929i/" style="clear: left; margin-bottom: 1em;"><img alt="" data-original-height="508" data-original-width="912" height="303" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhaavcerIpyGA0UJIYhrO7hgnxLDIACyfzLVK3CUtWamEgLP3X4dce892z4oaEezk_wh3zYju_JTeF2WSYhrvdj8JeoKg41RtOMQJ5aDjNAlOP3USaTfInECKlmnd5frBo1oOD20Ky929i/w545-h303/startup_time_per_node.png" width="545" /></a></div></div><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; font-family: Arial; text-align: center; white-space: pre-wrap;"><div class="separator" style="clear: both; text-align: justify;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxGh5uHdMOf0k07MQMYpUNKzZgPhUGQJQBncPv8C6UL9JfdMEfT96UpnjUDSEpYcarQHvc8xh95Si07rrBMdEIZ78gKnf8LEtqTYOTUDzgw1lRX2eTSswEoVzInTh3bFVKNRLjWfgxUwYI/" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: left;"><br /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxGh5uHdMOf0k07MQMYpUNKzZgPhUGQJQBncPv8C6UL9JfdMEfT96UpnjUDSEpYcarQHvc8xh95Si07rrBMdEIZ78gKnf8LEtqTYOTUDzgw1lRX2eTSswEoVzInTh3bFVKNRLjWfgxUwYI/" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: left;"><img alt="" data-original-height="508" data-original-width="910" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxGh5uHdMOf0k07MQMYpUNKzZgPhUGQJQBncPv8C6UL9JfdMEfT96UpnjUDSEpYcarQHvc8xh95Si07rrBMdEIZ78gKnf8LEtqTYOTUDzgw1lRX2eTSswEoVzInTh3bFVKNRLjWfgxUwYI/w544-h304/stddev_startup_per_node.png" width="544" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxGh5uHdMOf0k07MQMYpUNKzZgPhUGQJQBncPv8C6UL9JfdMEfT96UpnjUDSEpYcarQHvc8xh95Si07rrBMdEIZ78gKnf8LEtqTYOTUDzgw1lRX2eTSswEoVzInTh3bFVKNRLjWfgxUwYI/" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: left;"><br /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxGh5uHdMOf0k07MQMYpUNKzZgPhUGQJQBncPv8C6UL9JfdMEfT96UpnjUDSEpYcarQHvc8xh95Si07rrBMdEIZ78gKnf8LEtqTYOTUDzgw1lRX2eTSswEoVzInTh3bFVKNRLjWfgxUwYI/" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><div style="margin-left: 1em; margin-right: 1em; text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiW6CmTLnzo0T5v23UuDQLyqkMh8U8sDGm922XSSRUL6Q4Jx_Ms64cITybzTxm788TLqAuonJmkuJTLJ8-I0Bml7GP4WjdhmiJs-wIVrkqlHxlK14MQHBZ6cahVSuLE7VhD-BhMZ9EaE3l_/" style="margin-left: 1em; margin-right: 1em;"></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiW6CmTLnzo0T5v23UuDQLyqkMh8U8sDGm922XSSRUL6Q4Jx_Ms64cITybzTxm788TLqAuonJmkuJTLJ8-I0Bml7GP4WjdhmiJs-wIVrkqlHxlK14MQHBZ6cahVSuLE7VhD-BhMZ9EaE3l_/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="500" data-original-width="910" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiW6CmTLnzo0T5v23UuDQLyqkMh8U8sDGm922XSSRUL6Q4Jx_Ms64cITybzTxm788TLqAuonJmkuJTLJ8-I0Bml7GP4WjdhmiJs-wIVrkqlHxlK14MQHBZ6cahVSuLE7VhD-BhMZ9EaE3l_/w546-h300/shutdown_time.png" width="546" /></a></div></div></div></div><div class="separator" style="clear: both; font-family: Arial; text-align: left; white-space: pre-wrap;"><br /></div></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">Using Rapid, nodes were able to join faster than when using Geode's membership system but the time to join was more erratic. The time required to shut down the cluster was much higher with Rapid.</span></div><p></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><br /></p><h2 style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="background-color: transparent; color: black; font-family: Arial; font-style: normal; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><span style="font-size: small;">Conclusions</span></span></h2><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">Integration of Rapid and Geode was fairly straightforward but pointed out dependencies that Geode has on its existing membership system that go beyond what you might expect. Untangling these so that the membership module is truly pluggable will take work.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">These are the dependencies that I noticed:</span></p><p></p><ul style="text-align: left;"><li style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">Geode requires a way to transmit metadata about a node in membership views and node identifiers. The metadata is static and is known when joining the cluster.</span></li><li style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;">Geode uses a Locator service to find the node that's the membership coordinator.</span></li><li style="text-align: left;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Geode needs to handle concurrent startup of the initial nodes of a cluster.</span></li><li style="text-align: left;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Geode expects monotonically increasing membership view identifiers.</span></li><li style="text-align: left;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Geode requires stable ordering of node IDs in a membership view or it needs an alternative way of choosing an Elder for the distributed lock service.</span></li><li style="text-align: left;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Geode assumes the cluster can devolve to one node.</span></li><li style="text-align: left;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Geode has tight SLAs concerning failure detection and needs to know whether a node has crashed or left gracefully.</span></li><li style="text-align: left;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Like Geode's Distributed Unit Tests, VMware's regression tests require a membership service that can handle frequent shutdown and startup of nodes.</span></li></ul><span style="font-family: Arial;"><span style="white-space: pre-wrap;">Integration code can be found here:</span></span><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><div><span style="font-family: Arial;"><div><span style="white-space: pre-wrap;">https://github.com/bschuchardt/geode/tree/feature/rapid_integration</span></div></span></div><div><span style="font-family: Arial;"><div><span style="white-space: pre-wrap;">https://github.com/bschuchardt/rapid/tree/geode_integration</span></div></span></div></blockquote><div><span style="font-family: Arial;"><div style="white-space: pre-wrap;"><br /></div></span><p></p><div><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><p dir="ltr" style="font-family: Times; line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; white-space: normal;"><span style="font-family: Arial; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"></span></p></span></span></div><p style="text-align: left;"><br /></p></div>Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-45297988746048413142013-08-09T11:14:00.002-07:002013-08-09T11:28:22.828-07:00Shifting madness<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
My team was chasing an odd bug for a couple of weeks. In the GemFire distributed cache we store a 64-bit timestamp representing the last-modified-time for a cache entry and use it for entry expiration and inter-site consistency checks. This value was all of a sudden going back in time a day and a half, causing early expiration and inter-site inconsistencies.<br />
<br />
Well, what do you do? You review recent changes to the product, for one thing. Not long ago two people worked on the code for handling this timestamp. The significant change seemed to be in using the top 8 bits of this field to store boolean flags and using a <span style="font-family: "Courier New", Courier, monospace; font-size: small;">java.util.concurrent.atomic.AtomicLongFieldUpdater</span> to access the field.<br />
<br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"> private static final long LAST_MODIFIED_MASK = 0x00FFFFFFFFFFFFFFL;</span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"><br /> long storedValue;<br /> long newValue;<br /> do {<br /> storedValue = lastModifiedUpdater.get(this);<br /> newValue = storedValue & ~LAST_MODIFIED_MASK;<br /> newValue |= lastModifiedTime;<br /> } while (!lastModifiedUpdater.compareAndSet(this, storedValue, newValue));</span><br />
<br />
This code looks okay unless the relatively new AtomicLongFieldUpdater is messing up. The other change was introduction of a new bit to store in the 8 top bits of the field:<br />
<span style="font-size: small;"><br /></span>
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"> private static final long VALUE_RESULT_OF_SEARCH = 0x01L << 56;<br /> private static final long UPDATE_IN_PROGRESS = 0x02L << 56;<br /> private static final long TOMBSTONE_SCHEDULED = 0x04L << 56;<br /> private static final long LISTENER_INVOCATION_IN_PROGRESS = 0x08 << 56;</span><br />
<br />
There's a problem with this last line. We're shifting a 32-bit integer 56 places to the left. A good C compiler will complain about this but the Java compiler seems okay with it. Here's a C program:<br />
<span style="font-size: small;"><br /></span>
<br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;">#include "stdio.h"<br /><br />int main(int argc, char *argv[]) {<br /> long l = 1 << 30;<br /> printf("1 << 30=%ld",l);</span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"> l = 1<<32 font=""></32></span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"> printf("1 << 32=%ld",l);</span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"> l = 3<<32 br=""> printf("3 << 32=%ld",l);</32></span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;">}<!--32--><!--32--><!--32--><!--30--><!--32--><!--32--><!--32--><!--30--><!--32--><!--32--><!--32--><!--30--><!--32--><!--32--><!--32--><!--30--></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;">~> cc -o testit test.c<br />test.c: In function 'main':<br />test.c:6: warning: left shift count >= width of type<br />test.c:8: warning: left shift count >= width of type</span></span></span><br />
<br />
And the equivalent Java program:<br />
<br />
<br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;">import java.io.*;<br /><br />public class test {<br /> public static void main(String[] args) {<br /> long l = 1 << 30;<br /> System.out.println("1 << 30="+l);</span></span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;"> l = 1 << 32;<br /> System.out.println("1 << 32="+l);</span></span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;"> l = 3 << 32;<br /> System.out.println("3 << 32="+l);</span></span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;"> }<br />}<!--32--><!--32--><!--30--><!--32--><!--32--><!--30--><!--32--><!--32--><!--30--><!--32--><!--32--><!--30--></span><br /><br /><span style="font-size: small;">~> javac test.java</span></span></span><br />
<br />
<br />
Running these two programs gives different results<br />
<br />
<br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"><span style="font-family: "Courier New", Courier, monospace;">~> ./testit<br />1 << 30=1073741824</span></span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"><span style="font-family: "Courier New", Courier, monospace;">1 << 32=0</span></span><br />
<span style="font-family: "Courier New", Courier, monospace; font-size: small;"><span style="font-family: "Courier New", Courier, monospace;">3 << 32=0<!--32--><!--32--><!--30--><!--32--><!--32--><!--30--><!--32--><!--32--><!--30--><!--32--><!--32--><!--30--></span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-family: "Courier New", Courier, monospace; font-size: small;"><br /></span><span style="font-size: small;">~> java test<br />1 << 30=1073741824</span></span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;">1 << 32=1</span></span></span><br />
<span style="font-size: x-small;"><span style="font-family: "Courier New", Courier, monospace;"><span style="font-size: small;">3 << 32=3<span style="background-color: yellow;"><!--32--><!--32--><!--32--><!--32--><!--32--><!--32--><!--32--><!--32--></span><!--30--><!--30--><!--30--><!--30--></span></span></span><br />
<br />
So the << operator works differently in Java than in C. Shifting 0x08 fifty six bits in C results in a 0 but Java turns it into 0x800,0000! The <a href="http://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.19">Java Language Specification</a> says<br />
<br />
<blockquote class="tr_bq">
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">If the promoted type of the left-hand operand is<span class="Apple-converted-space"> </span></span><code class="literal" style="-webkit-text-stroke-width: 0px; background-color: white; color: black; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">int</code><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">, only the five lowest-order bits of the right-hand operand are used as the shift distance. It is as if the right-hand operand were subjected to a bitwise logical AND operator<span class="Apple-converted-space"> </span></span><code class="literal" style="-webkit-text-stroke-width: 0px; background-color: white; color: black; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">&</code><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;"><span class="Apple-converted-space"> </span>(</span><a class="xref" href="http://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.22.1" style="-webkit-text-stroke-width: 0px; background-color: white; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;" title="15.22.1. Integer Bitwise Operators &, ^, and |">§15.22.1</a><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">) with the mask value<span class="Apple-converted-space"> </span></span><code class="literal" style="-webkit-text-stroke-width: 0px; background-color: white; color: black; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">0x1f</code><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;"><span class="Apple-converted-space"> </span>(</span><code class="literal" style="-webkit-text-stroke-width: 0px; background-color: white; color: black; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">0b11111</code><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">). The shift distance actually used is therefore always in the range<span class="Apple-converted-space"> </span></span><code class="literal" style="-webkit-text-stroke-width: 0px; background-color: white; color: black; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">0</code><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;"><span class="Apple-converted-space"> </span>to<span class="Apple-converted-space"> </span></span><code class="literal" style="-webkit-text-stroke-width: 0px; background-color: white; color: black; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">31</code><span style="-webkit-text-stroke-width: 0px; background-color: white; color: black; display: inline !important; float: none; font-family: Arial, Helvetica, sans-serif; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 16px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">, inclusive.</span></blockquote>
So 56 (111000) is <u>silently </u>turned into 24 (011000) by the javac compiler! The new constant, 0x8000000, is 8 << 24, not 8 << 56 as intended!<!--24--><!--24--><!--24--><!--24--><br />
<br />
This bit was being set and cleared in the timestamps when the new flag was used. For instance, 2013/07/25 15:55:10.906 PDT is 1374792910906 on the millisecond clock. Clearing bit 0x8000000 turns the clock back to 1374658693178, which is 2013/07/24 02:38:13.178 PDT. That's roughly 37 hours earlier than the unmolested timestamp. No wonder entries were being considered "old" before their time.<br />
<br />
Changing the constant to have "L" like the others fixed the problem.<br />
<br /></div>
Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-44591760243873040282013-02-05T14:43:00.001-08:002013-02-05T14:43:08.597-08:00I was reviewing some code for a coworker and saw something that I didn't know was possible...<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">Integer counter = 0;</span><br />
<span style="font-family: Courier New, Courier, monospace;">.</span><br />
<span style="font-family: Courier New, Courier, monospace;">.</span><br />
<span style="font-family: Courier New, Courier, monospace;">.</span><br />
<span style="font-family: Courier New, Courier, monospace;">synchronized(counter) {</span><br />
<span style="font-family: Courier New, Courier, monospace;"> counter++;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> // do other work under sync</span><br />
<span style="font-family: Courier New, Courier, monospace;">}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
I thought that the compiler might be accepting this as a valid use of autoboxing, as in<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">Integer counter = 0;</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">synchronized(counter) {</span><br />
<span style="font-family: Courier New, Courier, monospace;"> int i = counter.intValue();</span><br />
<span style="font-family: Courier New, Courier, monospace;"> i++;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> // do other work under sync</span><br />
<span style="font-family: Courier New, Courier, monospace;">}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
in other words the value is pulled out of the Integer and held in a temporary variable where it is incremented and then thrown away.<br />
<br />
This turned out not to be the case at all. The compiler actually creates code that will increment the value like this but it affects <span style="font-family: Courier New, Courier, monospace;">counter</span><span style="font-family: inherit;"> just as if this were an "int" field! So my coworker was right - this "++" is incrementing the counter like he wanted it to do.</span><br />
<span style="font-family: inherit;"><br /></span>
But now there is something else wrong with the code! Java Integers are immutable, and that "++" is assigning a new Integer to the <span style="font-family: Courier New, Courier, monospace;">counter </span>variable. This makes his <span style="font-family: Courier New, Courier, monospace;">synchronized(counter)</span><span style="font-family: inherit;">statement useless in protecting anything but the </span><span style="font-family: Courier New, Courier, monospace;">counter++</span><span style="font-family: inherit;">. Once that's finished there is a new object in </span><span style="font-family: Courier New, Courier, monospace;">counter</span><span style="font-family: inherit;">. If one thread synchronized on </span><span style="font-family: Courier New, Courier, monospace;">Integer(0)</span><span style="font-family: inherit;"> the </span><span style="font-family: Courier New, Courier, monospace;">counter++ </span><span style="font-family: inherit;">would change it to</span><span style="font-family: Courier New, Courier, monospace;"> Integer(1)</span><span style="font-family: inherit;"> Another thread could then enter the synchronized block holding a lock on </span><span style="font-family: Courier New, Courier, monospace;">Integer(1)</span><span style="font-family: inherit;"> while the original thread continued to lock on </span><span style="font-family: Courier New, Courier, monospace;">Integer(0)</span><span style="font-family: inherit;">.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">t1: synchronized(Integer(0)) {</span><br />
<span style="font-family: Courier New, Courier, monospace;">t1: counter++ // in other words, counter=Integer(1)</span><br />
<span style="font-family: Courier New, Courier, monospace;">t2: synchronized(Integer(1)) {</span><br />
<span style="font-family: Courier New, Courier, monospace;">t1 & t2: // do other work under sync</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">What else is wrong with this? What about the objects we're synchronizing on? I wrote a short program to look at the Integers generated in code such as this.</span><br />
<span style="font-family: inherit;"><br /></span>
<br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">public class incInt {</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> public static void main(String args[]) {</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> Integer i = 0;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("i="+i + " hash="+System.identityHashCode(i));</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> i++;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("i="+i + " hash="+System.identityHashCode(i));</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> i++;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("i="+i + " hash="+System.identityHashCode(i));</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("resetting to zero");</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> i = 0;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("i="+i + " hash="+System.identityHashCode(i));</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> i++;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("i="+i + " hash="+System.identityHashCode(i));</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> i++;</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> System.out.println("i="+i + " hash="+System.identityHashCode(i));</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"> }</span><br />
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;"><br /></span>
<span style="font-family: Courier New, Courier, monospace; font-size: x-small;">}</span><br />
<div style="font-family: inherit;">
<br /></div>
<div>
<span style="font-family: inherit;">The result of running this with Oracle's JRE 1.7.0_5 shows that there are canonical values for autoboxed zero, one and two.</span></div>
<div>
<span style="font-family: inherit;"><br /></span></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">> java incInt</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=0 hash=4991049</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=1 hash=32043680</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=2 hash=9499036</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">resetting to zero</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=0 hash=4991049</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=1 hash=32043680</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=2 hash=9499036</span></div>
<div style="font-family: inherit;">
<br /></div>
</div>
<div>
<div>
<a href="http://stackoverflow.com/questions/5277881/why-arent-integers-cached-in-java">Here's a blog post</a> that claims that [-128,127] are cached by the JVM and used in autoboxing. It turns out that the post is right. I modified the test to print out the hashes for zero, one and two and they are the same objects</div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">> java incInt</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=0 hash=31879808</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=1 hash=6770745</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=2 hash=12835244</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">resetting to zero</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=0 hash=31879808</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=1 hash=6770745</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">i=2 hash=12835244</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">Integer.valueOf(0)=31879808</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">Integer.valueOf(1)=6770745</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">Integer.valueOf(2)=12835244</span></div>
<div style="font-family: inherit;">
<br /></div>
</div>
<div>
<span style="font-family: inherit;">Getting back to the code under review, this means that the synchronization is at least sometimes using a canonical object used by the whole JVM. Anything could sync on </span><span style="font-family: Courier New, Courier, monospace;">Integer.valueOf(0)</span><span style="font-family: inherit;">, causing this code to be affected by code running in other threads. All synchronization should be done on private state or by using well-tested concurrency utilities to avoid accidental conflicts and meddling.</span></div>
<div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div>
<br /></div>
<div>
<br /></div>
Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-48730681494008115852011-09-08T15:08:00.000-07:002011-09-08T15:08:32.522-07:00patent granted<div dir="ltr" style="text-align: left;" trbidi="on"><span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;">This is a follow-up to a <a href="http://brucesch.blogspot.com/2010/09/virtual-synchrony-in-gemfire.html">post</a> I made last year. About four years ago I applied for a <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=8,005,787.PN.&OS=PN/8,005,787&RS=PN/8,005,787">patent</a> on a method of replicating data from one process to another without blocking operations on the data. It's used in the GemFire data fabric product to create backup copies of data buckets, and is called "state flush operation". In a way it provides a temporal point of <a href="http://en.wikipedia.org/wiki/Virtual_synchrony">virtual synchrony</a> that assures the new replica bucket sees all of the changes to the data that the original bucket sees.</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;">I got the idea for this work after reading a paper by Lamport and Chandy published back in 1985, <a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.7694">Distributed Snapshots: Determining Global States of Distributed Systems</a>.</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;">Basically what you do is create a sort of catchers-mitt that is set up to record operations on the data during the transfer, then you announce to everyone that the transfer is going to happen. At that point any operations performed on the original bucket will also be sent to the new replica bucket.</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;">Then you send a message to each member of the distributed system that holds the bucket telling them to apply in-process operations to the bucket and then "flush" those changes to the member holding the original bucket. A message gets sent from each of these members to the original bucket holder that tells it which messages have been sent. An observer is created that watches for all of the changes to arrive and be applied to the original. It then sends a notice to the new replica that the operation has completed.</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;">At this point the data may be copied from the original bucket to the replica bucket, taking care not to overwrite any items that have shown up in the catchers-mitt. Because of the flush we know that the copied data holds any changes that were made prior to creating the catchers-mitt, but the catchers-mitt may hold operations that are newer than what is reflected in the copied data.</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', serif;"><br />
</span></div>Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-90273414337841100712011-08-30T09:10:00.000-07:002011-08-30T09:13:23.003-07:00Chaos Monkey<div dir="ltr" style="text-align: left;" trbidi="on"><span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small;"><span class="Apple-style-span" style="line-height: 18px;">If you haven't heard of the Netflix Chaos Monkey, read Jeff Atwood's blog. This "monkey" roams around their cloud app killing processes to ensure that the system is resilient. IMO the MTBF for java VMs isn't all that long unless a great deal of testing has been done, so this is a great way to keep the system healthy. Jeff asserts that having the monkey in their system was at least part of the reason that Netflix survived the Amazon Web Services (AWS) crash.</span></span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small; line-height: 18px;"><br />
</span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small; line-height: 18px;">When we test GemFire we run many High Availability (HA) tests that randomly kill server processes and then test to ensure that the product continues to run and maintains consistency. That guarantees that the product reacts to failures correctly in short (10-60 minute) tests, but what about long running distributed systems? It would be nice to build an optional Chaos Monkey into the product that randomly killed off server-side processes (can't kill the clients!). The system-monitoring infrastructure would have to be able to recognize the Monkey's work so that alarms aren't raised, but how hard could that be?</span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small;"><span class="Apple-style-span" style="line-height: 18px;"><br />
</span></span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small;"><span class="Apple-style-span" style="line-height: 18px;">A smart Monkey could examine metadata about the system and, perhaps, give weight to older processes now and then when selecting a process to kill. That would tend to shake things up a little more in the distributed system and test things like lock grantor fail-over.</span></span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small;"><span class="Apple-style-span" style="line-height: 18px;"><br />
</span></span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small;"><span class="Apple-style-span" style="line-height: 18px;">The Monkey would need to have a collection of blind spots built into it so that customers could protect VMs that they don't want the Monkey to, er, monkey with. GemFire might be well tested and be able to withstand a Chaos Monkey, but that doesn't mean the systems built with it could survive degradation of their own essential services.</span></span><br />
<span class="Apple-style-span" style="color: #404040; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: x-small;"><span class="Apple-style-span" style="line-height: 18px;"><br />
</span></span></div>Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-6481978384152691962011-05-06T10:09:00.000-07:002011-05-06T10:09:57.000-07:00Moving to the Cloud<div dir="ltr" style="text-align: left;" trbidi="on">For half a year I've been doing some of my development work on a virtual computer hosted in a data center that I've never seen. It works remarkably well and is like using <a href="http://en.wikipedia.org/wiki/Remote_Desktop_Protocol">RDP</a> to connect to a desktop at work when you're telecommuting. I fire up <a href="http://www.vmware.com/products/view/">VMware View</a> and connect to the computer, giving it one of the <a href="http://www.dexpot.de/">Dexpot</a> screens on my laptop. I can even connect to it on my iPod Touch using <a href="http://www.wyse.com/products/software/pocketcloud/index.asp">WYSE Pocket Cloud</a> and <a href="http://www.zatelnet.com/">zaTelnet</a>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhX_uzBMtDR4wzl-YMbs5R6MAyYx_xBa1mXc2ZwOcuc_GQvPoNjcjwmVgFYBTwS52w_POpwZIf_eZeJEJF-O0wxXKLyu6-HpW3OIOApG6WxEZYdn8b7t7KwuAt5qn3J9MxrTjgH3pmT_XKZ/s1600/Clipboard01.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="223" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhX_uzBMtDR4wzl-YMbs5R6MAyYx_xBa1mXc2ZwOcuc_GQvPoNjcjwmVgFYBTwS52w_POpwZIf_eZeJEJF-O0wxXKLyu6-HpW3OIOApG6WxEZYdn8b7t7KwuAt5qn3J9MxrTjgH3pmT_XKZ/s320/Clipboard01.jpg" width="320" /></a></div><br />
The downside has been that I use other machines to run tests and those machines were seven network hops away from my cloud-based development machine. Any network interaction with those machines was painfully slow. So slow that I stopped using the virtual computer for much of anything.<br />
<br />
Recently that situation changed. Most of the rack-mounted machines that we own were moved to the same data center, so that now it's seven network hops from my desk to all of the machines I use. But it's now only one hop from my cloud-based virtual computer to them. The situation is reversed and the virtual computer is a life saver. I log in, pop up VMware View and the rest of my computing day is spent in the cloud.<br />
<br />
<br />
</div>Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-25548626984764892842010-11-03T15:45:00.000-07:002010-11-08T10:59:59.927-08:00Performance of Message PatternsI've been faced with a dilemma in distributed processing. I have a "client" process that needs to send a message to several servers. The servers contain a versioning service that stamps a version on each message for concurrency management, but the client doesn't have this service. Each of the servers must end up having the same version number for the message. And it must be blindingly fast.<br />
<br />
Either the servers are going to have to talk to each other to agree on a version for the message or some other trick is going to have to be used.<br />
<br />
The simplest approach is to send a request to one of the servers to get a version number and then send the version out with the message to all of the servers in parallel. This is nice because the message can be serialized to the network in parallel and there are mechanisms in place that will do this very quickly.<br />
<br />
Here's a diagram of the first scenario. If any of the diagrams are hard to read, you can click on it to see a larger image.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZ85hG5WNGEos-IjIm4JDSLfKe_RnobfW63fTbRzna_Yr5xy66qgeopHk_zk9AezSeHoam8KMeUf6xttI4QZbT6h7cvLZRSTHa3HGF68Yn5h16pfpJtbgQn2UXBXOwcO5n9iCKFLikYJLZ/s1600/version_first.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZ85hG5WNGEos-IjIm4JDSLfKe_RnobfW63fTbRzna_Yr5xy66qgeopHk_zk9AezSeHoam8KMeUf6xttI4QZbT6h7cvLZRSTHa3HGF68Yn5h16pfpJtbgQn2UXBXOwcO5n9iCKFLikYJLZ/s400/version_first.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div><div class="separator" style="clear: both; text-align: left;">Instead of sending a request for a version to the first server, the client could send the message and expect a version number in response. It could then add this to the message and send it to the other servers. This would require multiple serialization of the message to on-wire format, but might be faster for small messages.</div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZlmr9AyV_eMALHVcLF8aNaZ8cEtGCRn6AXxzLkxo9GTBVFrW_p7CY4nW7KxPXXZsPtnaoOOvg1D0ONw9Vmt1Rl9oeqvCglK9d4-9efAly15L4znlO4MyJafBQlFFUZedwhLJEf2oETse-/s1600/message_version_returned.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="277" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZlmr9AyV_eMALHVcLF8aNaZ8cEtGCRn6AXxzLkxo9GTBVFrW_p7CY4nW7KxPXXZsPtnaoOOvg1D0ONw9Vmt1Rl9oeqvCglK9d4-9efAly15L4znlO4MyJafBQlFFUZedwhLJEf2oETse-/s400/message_version_returned.png" width="400" /></a></div><br />
Another approach is to delegate the sending of the message to one of the servers ("one hop messaging"). The client sends the message to one of the servers and it, in turn, sends the message to the other servers after adding a version stamp to it. This also requires some level of extra serialization since the client must write the message to the network and then the selected server must forward the message to the other servers over the network.<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQE7Min2YVSW4ZFgRQL2SF1ZT6iO51VhYItOkuMrChg2I5WjYolzMnqEBz2TzMIPph7KjMNIZq7wjLOCnCSKwVKpzkt-45-pTVWlNmgOF0nRy9-SJXOj2eOgJ68j49diRiVuCvJTQhyJfs/s1600/1hop.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQE7Min2YVSW4ZFgRQL2SF1ZT6iO51VhYItOkuMrChg2I5WjYolzMnqEBz2TzMIPph7KjMNIZq7wjLOCnCSKwVKpzkt-45-pTVWlNmgOF0nRy9-SJXOj2eOgJ68j49diRiVuCvJTQhyJfs/s400/1hop.png" width="400" /></a></div><br />
We can simplify the acknowledgment scheme by having each server send an ack directly back to the client instead of piping them through Server1. This would cause context switching in the client, but might be better than funneling the acks through Server1.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_7vMGGyQFQkAunjBx6jwHQRhd-2oy2-PPfqRxMJwJKO_pJEc107fC5ejnKogFYyPKO-L00bDJonJkWp60U507VvUwdn8z0oRBfaHM60lgLiakjhZesYaj08cyL2alwDYqWMtwrhBHWHgv/s1600/ihop+ack_client.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_7vMGGyQFQkAunjBx6jwHQRhd-2oy2-PPfqRxMJwJKO_pJEc107fC5ejnKogFYyPKO-L00bDJonJkWp60U507VvUwdn8z0oRBfaHM60lgLiakjhZesYaj08cyL2alwDYqWMtwrhBHWHgv/s400/ihop+ack_client.png" width="400" /></a></div><br />
Yet another approach is to have the client send the message to all of the servers and include a tag that selects one of the servers to supply the version tag. After the selected server (Server1 in the diagram below) receives the message it stamps it with a version and then sends that version to the other servers. The other servers wait for the version number before accepting the message. This, too, requires extra context switching because the other servers have to wait for a signal that the version number has been received.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilI-wwcb3uFuWv9SsJ9pZBERjgg3reszh4SAPuvD1YKjfcYFpj9w_D5wQcuAWZW3XNYT8Zt1DGKT_2oZQBbOgrq3JMQ_EjxDJsk-vMIcCkQf60Pb3oJr3-_Xsk7RaipVWLPUAh50PTbVgj/s1600/version_delegate.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="362" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilI-wwcb3uFuWv9SsJ9pZBERjgg3reszh4SAPuvD1YKjfcYFpj9w_D5wQcuAWZW3XNYT8Zt1DGKT_2oZQBbOgrq3JMQ_EjxDJsk-vMIcCkQf60Pb3oJr3-_Xsk7RaipVWLPUAh50PTbVgj/s400/version_delegate.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div><div class="separator" style="clear: both; text-align: left;"><br />
</div>I wrote a program to simulate these different approaches to see how they compared. Like the diagrams, I used a client and three servers running on a fairly large and fast Linux computer.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCuhu4kH5QtpO4omtetRvM_dp_7lfBdgK2NDwhJm3buVwSl8OAkRW9YIQ_Pw6ufxSYlEbM9lduAqBzkPRU5GfwhW0zjbVgJsTW6u6x4JABPgaaUj2rVWxII5nu63fnOKB9LKVrVjXnXnxy/s1600/table.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="60" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCuhu4kH5QtpO4omtetRvM_dp_7lfBdgK2NDwhJm3buVwSl8OAkRW9YIQ_Pw6ufxSYlEbM9lduAqBzkPRU5GfwhW0zjbVgJsTW6u6x4JABPgaaUj2rVWxII5nu63fnOKB9LKVrVjXnXnxy/s400/table.jpg" width="400" /></a></div><br />
The payload column shows the size of message that was used in each test run, and each column shows how many seconds it took the approch to handle 1 million messages. The "no versioning" column is a base-line that shows the performance of simple send-with-acknowledgement messaging.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcLmxkGlp7Wxv1Ywm9ZMiLWX3G7mbNxQmzIJEB_HIP5gL8DIjt31oqC9LvgEZwz7h8iZ5N5caGyMtpRrCus_ZRLAV22j-3XbSJIXg0xi-KD6JJcrzEZR1Se-lYpuZz-w2Lixi2-1BiA1Rs/s1600/graph.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcLmxkGlp7Wxv1Ywm9ZMiLWX3G7mbNxQmzIJEB_HIP5gL8DIjt31oqC9LvgEZwz7h8iZ5N5caGyMtpRrCus_ZRLAV22j-3XbSJIXg0xi-KD6JJcrzEZR1Se-lYpuZz-w2Lixi2-1BiA1Rs/s400/graph.jpg" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br />
</div><div class="separator" style="clear: both; text-align: left;">The results were a little surprising to me. I had expected the last approach, where the client sends the message to all servers and they wait for one to send a version number, to have the best performance. Instead, all but one of them converge to the same performance level when messages reach 10,000 bytes in size. At this level they are only moving about 70mb of data through the system per second, so they aren't being throttled by the network, but with the extra synchronization points and context switches CPU was becoming a limiting factor.</div><div class="separator" style="clear: both; text-align: left;"><br />
</div><div class="separator" style="clear: both; text-align: left;">The "version request" and "message returns version" scenarios (the first two diagrams above) are clear losers because server2 and server3 do not even see the message until a complete send and response cycle is performed with server1.</div><div class="separator" style="clear: both; text-align: left;"><br />
</div><div class="separator" style="clear: both; text-align: left;">The "one hop" scenario had a poor showing because of the long acknowledgement chain, with both server2 and server3 sending their acks to server1. Server1 has to wait for both acks before it finally sends its own ack back to the client.</div><div class="separator" style="clear: both; text-align: left;"><br />
</div><div class="separator" style="clear: both; text-align: left;">The clear winner is the "one hop, ack client" algorithm with servers sending acknowledgements directly to the client. It even converged with and then passed the base-line "no versioning" scenario at about 3000 bytes/message.</div><div class="separator" style="clear: both; text-align: left;"><br />
</div>Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-69014108469880391412010-09-23T15:30:00.000-07:002010-09-23T15:30:23.376-07:00Virtual Synchrony in GemFireA virtual synchrony protocol ensures that all past operations have been delivered before a membership change is put into effect.<br />
<br />
In our GemFire product, the equivalent of virtual synchrony is achieved in Gemfire in a component of the cache called a DistributionAdvisor. These are mini membership managers that track the ownership of cache Regions across the distributed system. When a member creates a new cache Region that others already have, the DistributionAdvisor is used to perform a flush operation (called State Flush in Gemfire) to ensure that all past operations have been applied before the new member's Region is allowed to be used. There are a number of steps involved in this operation for which I have applied for a patent (#11/982,563). Outside of this we haven't seen the need for a full virtual synchrony protocol covering all communications.Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com0tag:blogger.com,1999:blog-5280162151239832362.post-28040219239002614812010-07-01T14:42:00.000-07:002010-09-02T16:38:15.299-07:00JGroups address reuseOne of the things we avoided for years in GemFire was the fact that its JGroups component can reuse an old address with a new member.<div><br /></div><div>When a process joins a JGroups channel it creates an ephemeral datagram socket and uses its port number + the machines IP address to create a membership ID. It then finds the membership coordinator and sends a Join message to it with this ID. The coordinator sends out a new membership view containing the new member's ID.</div><div><br /></div><div>If the address happens to be one that was recently used by another process on the same machine, it can look like the member left and then joined again. This is actually possible for a single process if you use the JGroups merge protocols.</div><div><br /></div><div>GemFire keeps track of recently departed members in a "shunned members" collection and has used this to keep addresses from being reused. If a member tries to join using an old ID, all messages from it are ignored. The new member then times out in its attempt to send startup messages to other members of the system and throws a SystemConnectException.</div><div><br /></div><div>That was fine for a long time, but new customers have brought new demands and we now allow you to restrict the range of ephemeral ports that JGroups uses. This makes ID reuse much more likely. Setting a port range of 1000-1050, for instance, only provides a pool of 51 ports to select from.</div><div><br /></div><div>We tried several approaches to making identifiers less likely to be reused before finding a solution that worked. We tried adding the cache's peer-to-peer stream socket port number to the ID and adding randomly selected bytes to the ID before hitting on the idea of using the JGroups Lamport clock.</div><div><br /></div><div>Now when a member starts to join the system it creates a temporary ID that contains its InetAddress and JGroups port. This address is put in the Locator's discovery set and is sent to the membership coordinator in a Join request. The coordinator responds by sending back the current membership view <i>and</i> a modified ID that the member should use. This ID has the JGroups Lamport timestamp embedded in it and this is then used in ID equality checks to differentiate between two IDs that have the same InetAddress and port.</div><div><br /></div><div>In JGroups the Lamport timestamp is nothing more than the membership view number. It is usually incremented by 1 each time a new membership view is formed. Having this information in the ID not only lets us make them "more unique" but it also lets you see at a glance when a member joined the system.</div><div><br /></div><div>Having this information in the identifier has eliminated address reuse completely. The Lamport clock is never decremented or reset, so each new member is guaranteed to have a unique ID and not be confused with an old, departed member.</div>Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com1tag:blogger.com,1999:blog-5280162151239832362.post-47575143949048893802010-05-12T14:04:00.000-07:002010-05-12T14:33:24.726-07:00GemFire and PaxosPeople have recently been asking me about the Paxos algorithm and how GemFire implements it. I'm not an expert on Paxos itself, but have spent years dealing with the problems that arise if you do not have something similar to it in place in a distributed computing product like GemFire.<br /><br />Paxos is a family of protocols for developing consensus in a distributed system. The basic idea is that there is a Client that sends a message to effect some change. One or more Acceptors exist to service the message, and the message is sent to all of them. A Proposer works on behalf of the client to make sure the Acceptors agree on the message. The message is then acted on and a Learner is tasked to effect replication of the change.<br /><br />There is also a Leader, which is selected from among the Acceptors. If there is no unique Leader, no progress is allowed in the system.<br /><br />In GemFire there are Client processes that send messages to Servers. The Client forms a unique identifier for the message with respect to the Client and the originating thread. In a shortcut of the general Paxos algorithm a Server acts as both an Acceptor and a Proposer. It determines the member having primary storage for the message and forwards the message to it.<br /><br />The member having primary storage plays the role of Learner. It is responsible for effecting the change in primary storage and then replicating that change in other Servers. Once it is done, any result is sent back to the first Server (Paxos Proposer) and the result is sent back to the client.<br /><br />If after sending the message, the Client deems that the Server is taking too long to process it, another Server is selected and the client sends the message (with the same unique identifier) to it. The Server finds the member having primary storage and forwards the message to it. The member having primary storage (Paxos Learner) protects its storage by checking the message's unique identifier. If the identifier has already been applied to the store, it is marked as a possible-duplicate and then replicated to other Servers. Each other server is also responsible for performing duplicate-checks.<br /><br />The "timeout/retry" logic in the Client can cause duplicate processing, but in the normal case it does not and is much faster than a cumbersome back-and-forth Client/Acceptor/Proposer interaction.<br /><br />The Paxos Leader role is handled in GemFire by the low-level membership system. Administrative members of the system and those marked as not being Servers are excluded from election as Leader. After this, the oldest Server is selected as the GemFire "lead" member.<br /><br />Paxos also has the notion of a Quorum. Wikipedia defines this as "Quorums are defined as a family of subsets of the set of Acceptors such that any two subsets from the family (that is, any two Quorums) have a non-empty intersection.". In GemFire this function is handled by the Roles feature. Each Server can note that it provides one or more Roles and each cache region can require one or more Roles in order to function. If the required Roles aren't present, the cache region experiences a "loss action" which can cause it to become read-only, block access, or even shut down and restart the cache. This provides better control over the behavior of the system when there are machine or network failures than a simple Quorum mechanism.<br /><br />Paxos also has the notion of a Choice when there are conflicting values. This is essentially handled by the unique identifiers assigned to messages. The identifiers can be used to order conflicting values and determine which is newer or older than the other. This is done during duplicate-detection and is a very cheap and lightweight way to implement Choice.<br /><br />So there is quite a strong mapping from GemFire's distributed correctness algorithms and Paxos. Where there are differences it is usually because we opted for higher performance at the possible cost of doing a small amount of extra work (duplicate processing).Bruce Schuchardthttp://www.blogger.com/profile/06502272736174814112noreply@blogger.com1