One Page Summary. Gryff: Unifying Consensus and Shared Registers

This paper by Matthew Burke, Audrey Cheng, and Wyatt Lloyd appeared in NSDI 2020 and explores an interesting idea of a hybrid replication protocol. The premise is very simple – we can take one protocol that solves a part of the problem well, and marry it with another protocol that excels at the second half of the problem. This paper tackles replication in geo-distributed strongly consistent storage systems. The authors argue that consensus, when used in storage systems with predominantly read and write operations, is inefficient and causes high tail latency.

A system presented in the paper, called Gryff, takes advantage of predominantly read/write workloads in storage systems and exposes these two APIs via a multi-writer atomic storage ABD protocol.  ABD operates in two phases both for reads and writes. On writes, ABD’s coordinator retrieves the latest version of the register from all nodes and writes back with a version higher than it has seen. On reads, ABD’s coordinator again retrieves the register, writes the highest version back to the cluster to ensure future reads do not see any previous versions, and only then returns back to the client. The write-back stage, however, can be skipped if a quorum of nodes agrees on the same version/value of the register, allowing for single RTT reads in a happy case.

Unfortunately, ABD, while providing linearizability, is not capable of supporting more sophisticated APIs. Read-modify-write (RMW) is a common pattern in many storage systems to implement transaction-like conditional updates of data. To support RMW, Gryff resorts back to consensus and in particular to Egalitarian Paxos (EPaxos) protocol. Choice of EPaxos allows any node in the cluster to act as the coordinator, so it does not restrict writes to a single node like with many other protocols. The problem of this hybrid approach is then the ordering of operations completed with ABD protocol and RMW operations running under EPaxos. Since EPaxos side of Gryff works only with RMWs, it can only order these operations with respect to other RMW operations, but what we need is a linearizable ordering of RMWs with normal writes and/or reads. To keep the ordering, Gryff uses tuples of ABD’s logical timestamp, process ID and the RMW logical counter, called carstamps. Carstamps connect the ABD part of the system with EPaxos – only ABD can update ABD’s logical clock, and only EPaxos updates RMWs counter.

When we consider the interleaving of writes and RMWs, the write with higher ABD’s logical time supersedes any other write or RMW. This means that we actually do not need to order all RMWs with respect to each other, but only order RMWs that have the same base or ABD’s logical time. EPaxos was modified to allow such partial ordering of commands belonging to different bases, essentially making the protocols to have different dependency graphs for RMWs applied to different ABD states. Another change to EPaxos is the cluster-execute requirement, as the quorum of nodes need to apply the change before it can be returned to the client to make the change visible for subsequent ABD read operations.

So, how does Gryff do with regards to performance? Based on the author’s evaluation, it is doing very well in reducing the (tail) latency of reads. However, I have to point out that the comparison with Multi-Paxos was flawed, at least to some extent. The authors always consider running a full Paxos phase for reads, and do not consider the possibility of reading from a lease-protected leader, eliminating 1 RTT from Paxos read. This will make Paxos minimum latency to be smaller than Gryff’s, while also dramatically reducing the tail latency. Gryff also struggles with write performance, because writes always take 2 RTTs in the ABD algorithm. As far as scalability, authors admit that it cannot push as many requests per second as EPaxos even in its most favorable configuration with just 3 nodes.

Can Paxos do better? We believe that our PQR optimization when applied in WAN will cut down most of the reads down to 1 quorum RTT, similar to Gryff. PQR, however, may still occasionally retry the reads if the size of a keyspace is small, however this problem also applies to Gryff when the cluster is larger than 3 nodes.

What about Casandra? Cassandra uses a protocol similar to ABD for its replication, and it also incorporates Paxos to perform compare-and-set transactions, which are one case of RMW operation, so in a sense Gryff appears to be very similar to what Cassandra has been doing for years.

Paper Summary: Bolt-On Global Consistency for the Cloud

This paper appeared in SOCC 2018, but caught my Paxos attention only recently. The premise of the paper is to provide strong consistency in a heterogeneous storage system spanning multiple cloud providers and storage platforms. Going across cloud providers is challenging, since storage services at different clouds cannot directly talk to each other and replicate the data with strong consistency. The benefits of spanning multiple clouds, however, may worth the hustle, since a heterogeneous system will be both better protected from cloud provider outages, and provide better performance by placing the data closer to the users. The latter aspect is emphasized in the paper, and as seen in the figure, going multi-cloud can reduce latency by up to ~25%.

To solve the issue of consistent cross-cloud replication, authors propose to use Cloud Paxos (CPaxos), a Paxos variant designed to work with followers supporting a very minimal and common set of operations: get and conditional put. In CPaxos, clients act as prospers, and storage systems serve the role of the followers. The followers are not really “smart” in this protocol, and most of the Paxos logic shifts to the client-proposers (Figure 2).

The prepare phase in CPaxos simply gathers the state from the followers, making the proposer decide for itself whether the followers would have accepted it with the current ballot or not. If the proposer thinks it would have been accepted, it will try updating the followers’ state. Doing this, however, requires some precautions from the followers, since their state may have changed after the proposer made a decision to proceed. For that matter, CPaxos uses conditional put (or compare-and-set) operation, making the followers update their state only if it has not changed since it was read by the proposer. This ensures that at most one proposer can succeed with changing the state of the majority of followers.

I visualize this as a log to represent changes in some object’s state. The new version of an object corresponds to a new slot in the log, while each slot can be tried with different ballot by different proposers. The put operation succeeds at the follower only if the value at the slot and a ballot has not been written by some other proposer. In case a proposer does not get a majority of successful updates, it needs to start from the beginning: increase its ballot, perform a read and make a decision whether to proceed with state update. Upon reaching the majority acks on state update, the proposer sends a message to flip the commit bit to make sure each follower knows the global state of the operation.

This basic protocol has quite a few problems with performance. Latency is large, since at least 2 round-trips are required to reach consensus, since every proposer needs to run 2 phases (+ send a commit message). Additionally, increasing the number of proposers acting on the same objects will lead to the growth in conflict, requiring repeated restarts and further increasing latency. CPaxos mitigates these problems to a degree. For example, it tries to commit values on the fast path by avoiding the prepare phase entirely and starting an accept phase on what it believes will be next version of an object with ballot #0. If the proposer’s knowledge of the object’s state (version, ballot) is outdated, the conditional put will fail and the proposer will try again, but this time with full two phases to learn the correct state first. However, if the proposer is lucky, an update can go in just one round-trip. This optimization, of course, works only when an object is rarely updated concurrently by multiple proposers; otherwise dueling leaders become a problem not only for progress, but for safety as well, since two proposers may write different values for the same version using the same ballot. This creates a bit of a conundrum on when the value becomes safely anchored and won’t ever get lost.

Consider an example in which two proposers write different values: green and blue to the same version using ballot #0 (Figure 4 on the left). One of the proposers is able to write to the majority, before it becomes unresponsive. At the same time, one green follower crashes as well, leading to a situation with two followers having green value and two being blue (Figure 5 on the right). The remaining proposer has no knowledge of whether the green or blue value needs to be recovered (remember, they are both on the same ballot in the same slot/version!). To avoid this situation, CPaxos expands the fast path commit quorum from majority to a supermajority, namely $$\left \lceil{\frac{3f}{2}}\right \rceil +1$$ followers, where $$2f+1$$ is the total number of followers, and f is the tolerated number of follower failures, allowing the anchored/committed value to be in a majority of any majority of followers . Having this creates an interesting misbalance in fault tolerance: while CPaxos still tolerates $$f[\latex] node failures and can make progress by degrading to full 2 phases of the protocol, it can lose an uncommitted value even if it was accepted by the majority when up to [latex]f$$ followers fail.

Proposer conflicts are a big problem for CPaxos, so naturally the protocol tries to mitigate it. The approach taken here reduces the duration in which possible conflicts may occur. As CPaxos is deployed over many datacenters, the latencies between datacenters are not likely to be uniform. This means, that a prepare or accept messages from some proposer reach different datacenters at different times, creating an inconsistent state. When two proposers operate concurrently, they are more likely so observe this inconsistency: as both proposers quickly update their neighboring datacenters, they run the risk of not reaching the required supermajority due to the conflicting state (Figure 6(a)) created by some messages being not as quick to reach remaining datacenters. To avoid rejecting both proposers, CPaxos schedules sending messages in a way to deliver them to all datacenters at roughly the same time. This reduces the duration of inconsistent state, allowing to order some concurrent operations (Figure 6(b)).

Despite the above mitigation strategy, conflicts still affect CPaxos greatly. The authors are rather open about this, and show their system CRIC with CPaxos degrading quicker than Paxos and Fast Paxos as the conflict rate increases. However, in the low conflict scenario, which authors argue is more likely in real world applications, CRIC and CPaxos improve on performance compared to Paxos/Fast Paxos, especially for reading the data. This is because reads in CPaxos are carried out in one round-trip-time (RTT) by client-proposer contacting all followers and waiting for at least a majority of them to reply. If the client sees the latest version with a commit flag set in the majority, it can return the data. Otherwise, it will wait to hear from more followers and use their logs to determine the safe value to return. In some rare cases when the proposer cannot determine the latest safe value, it will perform the recovery by running the write path of CPaxos with the value to recover (highest ballot value or highest frequency value if more than one value share the ballot).

Some Thoughts

• The motivation of the paper was to make strongly consistent system spanning multiple clouds providers and storage systems for the benefit of improved latency though leveraging the location of datacenters of these different providers. However, CRIC and CPaxos protocol requires a lot of communication, even on the read path. During reads, a client-proposer contacts all CPaxos nodes, located at all datacenters, and in best case still needs the majority replies. As such the latency benefit here comes from trying to get not just one node closer to the client, but a majority of nodes. This may be difficult to achieve in large systems spanning many datacenters. I think sharding the system and placing it on subset of nodes based on access locality can benefit here greatly. For instance, Facebook’s Akkio paper claims to have significant reduction in traffic and storage by having fewer replicas and making data follow access patterns. In our recent paper, we have also illustrated a few very simple data migration policies and possible latency improvement from implementing these policies.
• One RTT reads in “happy path” can be implemented on top of regular MutliPaxos without contacting all nodes in the systems. Reading from the majority of followers is good enough for this most of the time, while in rare circumstances the reader may need to retry the read from any one node. More on this will be in our upcoming HotStorage ’19 paper.
• The optimization to delay message sending in order to deliver messages at roughly the same time to all nodes can help with conflict reduction in other protocols that suffer from this problem. EPaxos comes to mind right away, as it is affected by the “dueling leaders” problem as well. Actually, CPaxos and EPaxos are rather similar. Both assume low conflict rate to have single round trip “happy path” writes and reads. When the assumption breaks, and there is a conflict, both switch to two phases. EPaxos is better here in a sense that the first opportunistic phase is not totally wasted and can be used as phase-1 in the two phase mode, whereas CPaxos has to start all the way from the beginning due to the API limitation on the follower side.

Looking at State and Operational Consistency

Recently I rediscovered the “The many faces of consistency” paper by Marcos Aguilera and Doug Terry. When I first read the paper two years ago, I largely dismissed it as trivial, and, oh boy, now I realized how wrong I was at that time.  It is easy to read for sure, and may appear as some summary of various consistency models at first, but it is thought provoking and really makes you ask more questions and draw interesting parallels after giving it some quality time.

Murat gave a good summary of this paper recently in relation to his sabbatical. The questions he asks after the summary, however, provoke even more thoughts about consistency and how we classify, categorize and view it.

In a nutshell, the paper talks about consistency from different perspectives, namely state consistency as observed by a system itself and operational consistency that clients see. State consistency involves enforcing a system-state to hold some invariants. State consistency promotes invariant-based reasoning. The operational consistency is different, as it looks at the system from the client point of view. Outside clients do not directly observe the state of the system, instead they perform operations against it and can only see the results of these operations. So in short, state consistency is invariant-based and concerns with the internal state of the systems. Operational consistency deals with what clients observe from the outside of the system. These operational consistencies include various types sequential equivalence, such as linearizability and serializability, and other client-centric guarantees, like read-your-write or bounded-staleness.

For more details, read the original paper, Murat’s summary or one from the morning paper.

“Strength” of state consistency.

What strikes me right away is how different the two types of consistency are described. Operational consistency gives us the framework for reasoning about systems without having too many details about the internals. We can gauge the relative “strength” of different consistency models and put them in perspective against each other. We know the serializability is weaker then session serializability. Or that linearizability is stronger than sequential equivalence.

But what about state consistency? There is no such reasoning framework. And in fact, it is not easy to even classify state consistency, yet along reason about the “strength” of different state consistency classes. The paper mentions a few state consistency models, such referential integrity for databases, or mutual consistency in primary-backup systems, or error bounds. But these are not generally applicable across the board. These examples of state consistency operate within the constraints of their specific domains or applications.

However, state consistency still comes at different “strength” levels. When reasoning about state consistency, we use invariant-based approach. The invariants on the state we need to enforce for an eventually consistent data-store and a strongly-consistent one are different. In the former case, we can be more relaxed, since we only need to make sure that at least one future state will have different nodes of the store to have the same data. The latter case is more complicated, as we need to hold stricter invariants (i.e. no two alive nodes have different committed value for the same slot in the log, and there can be only one active leader, and the leader must process the commands in the receive order, etc.).

It is the “tightness” of the invariants that makes state consistency strong or weak. But deciding which invariant is tighter or stricter is hard. We cannot simply compare various invariants on the merits of when they should hold, or how many parameters they cover or how many nodes they span, as all these (and other) metrics mean something only in the context of their systems/problems. Invariant that must hold at every state is not necessarily tighter than the eventual one, as it may simply be an invariant against some irrelevant or trivial parameter that holds all the time anyway and has no impact on the system.

And this is why we often translate these invariants and state consistency they represent into the operational consistency. The operational side of things allows us to observe the impact of the invariants on the system, albeit indirectly. Operational consistency levels the playing field and enables the comparison from the external point of view. It allows us to gauge how otherwise hard-to-compare invariants at the state-consistency level affect the outcome of operations.

Smart systems or smart clients

Does this mean that a system providing stronger operational consistency has stronger state consistency? Well, it would have been too simple if that was the case. It often happens, and more so recently, that systems have “misalignments” between their internal state consistency and the operational one exposed on the client side.

A system that provides stronger operational semantics may do so because it has a strong state that makes it easy to expose the strong operational consistency. These smart systems preserve strong state-consistency at all costs. They may need to run complicated algorithms (i.e. Paxos) to achieve that, but doing so makes the clients lean and simple with minimal or no state at all.

On the other hand, simple systems may forgo the complicated protocols needed to enforce a strong state. Instead they aim to run as lean as possible at the core system level and shift as much burden to the smart clients. These clients need to have more complex state and protocols if they are to provide stronger operational guarantees. There are systems that do exactly that.

Both “smart system – lean client” and “lean system – smart client” approaches have their advantages and drawbacks. Designing and maintaining a smart system may actually be simpler: all the things engineers need are readily available at the system level. Invariant-based reasoning and tools like TLA easily apply in this setting. Debugging is simpler too, since lean and stateless clients are likely not the cause of a problem, and internal logging can help collect all the necessary information. On the other hand, having a lean system may improve the performance by reducing the bottlenecks, spreading load more evenly across the nodes and even sharing the load with smart clients. But it comes at some engineering costs. Protocols now involve more state and state is even more distributed: both at the lean system nodes and smart clients. Modeling this state with TLA is still possible, but it will likely take more time to check and require a more complicated models that include clients and client interactions with the system nodes. Debugging may be slowed down due to the lack of necessary data, since many issues (especially on production) may happen at smart clients outside of the engineers’ reach.