Tag Archives: replication

Scalable but Wasteful or Why Fast Replication Protocols are Actually Slow

In the last decade or so, quite a few new state machine replication protocols emerged in the literature and the internet. I am “guilty” of this myself, with the PigPaxos appearing in this year’s SIGMOD and the PQR paper at HotStorage’19. There are better-known examples as well — EPaxos inspired a lot of development in this area. However, there seems to be a big disconnect between literature and production. While many sophisticated protocols, such as EPaxos, Atlas, SDPaxos, Comprmentalized Paxos, appear in the literature, and some of them are rather popular in the academic community, the production systems tend to rely on more conservative approaches based on Raft, Multi-Paxos, or primary-backup.

There are probably a few reasons for this disconnect. For instance, as we discussed in Paxos vs Raft reading group meeting, explaining protocols in terminology closer to actual code and having good reliable reference implementations help with adoption. However, the difficulty of implementing some technology should not be a roadblock when this technology offers substantial improvements. 

Well, it appears that there are substantial quantitative improvements. Consider EPaxos. The protocol presents a leader-less solution, where any node can become an opportunistic coordinator for an operation. The exciting part is that EPaxos can detect conflicts, and when operations do not conflict with each other, both of them can succeed in just one round trip time to a fast quorum. The original paper claims better throughput than Multi-Paxos, and this is not surprising, as EPaxos avoids a single-leader bottleneck of Multi-Paxos and allows non-conflicting operations to proceed concurrently. This is a big deal! 

Throughput of Multi-Paxos and EPaxos in 5 dedicated nodes cluster under low conflict.

I wanted to find the truth here, so I and a couple of colleagues worked on experimentally testing Multi-Paxos and EPaxos. We performed a simple experiment, shown in the figure here, to confirm EPaxos’ performance advantage, and we achieved somewhere between 15-25% better throughput in 5 nodes EPaxos than Multi-Paxos when running on identical VMs (EC2 m5a.large with 2vCPUs and 8 GiB of RAM). Original EPaxos paper has a bit larger performance difference in some situations, but we have spent significantly more time on our Multi-Paxos implementation and optimizations than EPaxos. Anyway, we see a substantial improvement that was stable, reproducible, and quite large to make a real-world difference. Not to mention that EPaxos has a few other advantages, such as lower latency in geo-distributed deployments. So what gives? 

I have a theory here. I think many of these “faster” protocols are often slower in real life. This has to do with how the protocols were designed and evaluated in the first place. So bear with me and let me explain. 

Most evaluations of these consensus-based replication protocol papers are conducted in some sort of dedicated environment, be it bare-metal servers in the lab or VMs in the cloud. These environments have fixed allocated resources, and to improve the performance we ideally want to maximize the resource usage of these dedicated machines. Consider Multi-Paxos or Raft. These protocols are skewed towards the leader, causing the leader to do disproportionately more work than the followers. So if we deploy Multi-Paxos in five identical VMs, one will be used a lot more than the remaining four, essentially leaving unused resources on the table. EPaxos, by design, avoids the leader bottleneck and harvests all the resources at all nodes. So naturally, EPaxos outperforms Multi-Paxos by using the resources Multi-Paxos cannot get a hold of ue to its design. Such uniform resource usage across nodes is rather desirable, as long as avoiding the leader bottleneck and allowing each node to participate equally comes cheaply. 

Aggregate CPU utilization for Multi-Paxos and EPaxos in 5 nodes

Unfortunately, it is not the case, and EPaxos’ ability to use resources that would have been left idle by Multi-Paxos comes at a steep penalty! Intuitively, EPaxos needs to do a lot more work to maintain safety and remain leader-less — it needs to communicate the dependencies, compute dependency graphs, check for conflicts in these graphs, and resolve the conflicts as needed. This added complexity contrasts with Multi-Paxos that provides most of its safety via a few simple comparisons both at the leader and followers. To try to estimate the cost of EPaxos’ role uniformity, we need to look at resource usage. In the same experimental run as in the absolute performance figure earlier, we have measured CPU usage across all VMs. Expectedly, we were able to observe that EPaxos is very good at using all available resources, and Multi-Paxos consumes the entire CPU only at one node – the leader. However, it does not take long to spot something odd if we start looking at the aggregate resource usage, shown in the figure here. It turns out that Multi-Paxos achieved nearly 14k ops/s of throughput consuming roughly 200% CPU (and leaving 300% unused), while EPaxos did a bit over 18k ops/s with all 500% of aggregate CPU utilization. 

Throughput normalized by the CPU usage

This suggests that EPaxos needs overall more CPU cycles across all nodes to finish each operation. We can see this better if we normalize the throughput per the amount of CPU consumed. As seen in the figure, this normalized throughput presents a big gap in the efficiency of protocols. Of course, this gap may vary and will depend on the implementation. It may even shrink if some better engineers implement EPaxos, but I also think there is a fundamental issue here. EPaxos and many other protocols were designed to take advantage of all allocated resources in the cluster since allocated but unused resources are wasted. This resource usage paradigm is common when we have dedicated servers or VMs. However, technology has moved on, and we rarely deploy replicated services or systems all by themselves in isolation. With the advances in virtualization and containerization, we increasingly deploy replicated systems in resource-shared environments that are designed to pack applications across clusters of servers in a way to avoid having idle physical resources. 

Aggregate throughput of 5 Multi-Paxos and EPaxos instances in a task-packed 5 nodes cluster.

Moreover, most replicated systems are sharded, so we normally have multiple instances of the replication protocol supporting shards of one larger system. This makes task-packing simpler as we have many relatively uniform tasks to schedule between machines. Consider Yugabyte DB for example. Karthik Ranganathan in his talk described how the system schedules raft groups across a cluster such that some server has a few resource-heavy raft leaders and plenty more resource-light raft followers. This type of packing allows Yugabyte (and Cockroach DB, Spanner, Cosmos DB, and so on) to achieve uniform resource usage on VMs or dedicated servers while having non-uniform, but efficient replication protocols. Here I show how 5 instances of Multi-Paxos compare with 5 instances of EPaxos when deployed on larger servers (32 GiB RAM and 8 vCPUs). This drastic difference compared to the dedicated environment is due to the fact that we were able to pack multiple instances of Multi-Paxos to consume all the resources of the cluster, and clock-for-clock, EPaxos has no chance of winning due to its lower efficiency. 

Ok, so I think there are quite a few things to unpack here. 

  • Absolute performance as measured on dedicated VMs or bare-metal servers is not necessarily a good measure of performance in real-life, especially in resource-shared setting (aka the cloud)
  • We need to consider efficiency when evaluating protocols to have a better understanding of how well the protocol may behave in the resource-shared, task-packed settings
  • The efficiency of protocols is largely under-studied, as most evaluations from academic literature simply focus on absolute performance.
  • The efficiency (or the lack of it) may be the reason why protocols popular in academia stay in academia and do not get wide adoption in the industry. 

In conclusion, I want to stress that I am not picking on EPaxos. It is a great protocol that has inspired a lot of innovation in the area distributed consensus. Moreover, I am sure there are use cases for EPaxos (or more recent extensions, such as Atlas), especially in the WAN setting when trading of some efficiency for better latency may be acceptable. Many other protocols (mine included) may have similar efficiency problems, as they approach the performance issue similarly – find the bottlenecked node and move its work elsewhere. The fundamental issues at play here are that (1) moving the work elsewhere comes at an efficiency penalty, and (2) in many modern environments there may not be any resource available for such “moved” work due to resource budgets and tight task-packing.

I, Abutalib Aghayev, and Venkata Swaroop Matte discuss the efficiency issues in a bit more detail in our upcoming HotStorage’21 paper: “Scalable but Wasteful: Current State of Replication in the Cloud.”

Reading Group. Strong and Efficient Consistency with Consistency-Aware Durability

In the 62nd reading group session, we covered the “Strong and Efficient Consistency with Consistency-Aware Durability” paper from FAST’20. Jesse did an excellent presentation for the group that explains the core of the paper rather well:

This paper describes a problem with many leader-based replication protocols. It specifically focuses on ZooKeper and Zab, but similar issues arise in Multi-Paxos, and Raft systems as well. Linearizable reads are expensive, as they need to go through a lease-protected leader (or carry out a full consensus round in consensus-based replication if leases are to be avoided). This puts a single leader node in the middle of everything, which is bad for throughput. 

But what if we allow reading from followers? Systems like ZooKeeper allow such reads, but these have a few problems. For one, followers may be stale, so such reads do not have the same recency guarantees as the linearizable solutions. Secondly, reading from followers violates monotonic reads if the client is to jump between the followers it reads from without having a special session/recency token. The third problem, the one the paper is specifically solving, is that there is no monotonicity between clients — when one client reads a value of version n, nothing prevents another client at a later time to read version n-1 from a different node. 

So, the goal of the paper is to allow reading from followers while sacrificing as little consistency as possible. The system introduced in the paper, called ORCA, solves two consistency-related problems. First, it ensures that local followers return to clients only durable data that cannot be lost in the face of failures. Second, ORCA enforces some recency guarantees in the form of cross-client monotonic reads.

To achieve both goals, the system introduces a globally replicated durable-index. The purpose of this index is to let the nodes know up to which point in the log the data has been durably (here durability means persistence to some non-volatile storage) replicated to some quorum of nodes (traditionally, this would be a majority quorum). This durability marker allows the followers to avoid returning a value that can be lost due to a crash. For example, consider some object k with the initial value k1 and a corresponding durable-index=1. Another value k2 is being replicated in log position 2. If a local node returns k2 before it is durable, there is a chance that k2 was not replicated and persisted at the majority of nodes yet, and a minority partition or failure can erase k2. To prevent this, the leader maintains a durable-index that gets updated once the quorum of replicas acknowledged the persistence of data to storage. In the example, once the k2 is durable and followers acked, the leader updates its durable-index:=2, and sends it to the followers. A follower can only return a value k2 if it has received a durable-index>=2 (i.e the index that has moved to or past the value k2). Naturally, there are plenty of situations when the follower may be aware of k2 but has not received the durable-index=2 update just yet. In this case, the follower cannot return the data just yet. It cannot return k2, since it is not durable yet, but it also cannot return k1, as some other follower may have already returned k2 to a client. One solution would be to wait out and see whether the durable-index eventually catches up. A more proactive approach would forward the read to the leader, and let the leader synchronously replicate k2 (and tell followers to flush it to disk) before returning the value to the client. The paper favors the proactive approach, as it acts somewhat as “read-repair” applicable to a variety of leader-based replication schemes. 

The paper presents this durable-index as a big contribution, although I am a bit skeptical about this. See, existing systems have been doing the whole durable-index business for a very long time. I will focus on just one example, however — Raft. In Raft, the leader must communicate the leader’s commitIndex with the followers after a quorum has been reached so that all nodes can mark all log items or updates up to commitIndex as committed. If we assume that Raft persists operations to disk at each node upon receiving them, then the leader’s commitIndex in the “commit” message (i.e next AppendEntry RPC) is the same as durable-index from this paper. A similar global commit progress marker can be implemented in Paxos-style systems despite a few challenges with out-of-order commits in Multi-Paxos. 

For paper’s defense, it makes the durable-index a bit more general than Raft’s approach. For example, the paper allows for more relaxed primary backup implementations, asynchronous replication, and implementations with delayed disk persistence. Consider adding durable-index to an asynchronous leader-driven system to allow durable reads. The system can check whether the value to be returned is durable, allowing it to do synchronous “read-repair” only when needed, and preserving the benefits of async replication as often as possible.

Ok, so we have a durable-index that marks the global progress of the system, and the leader distributes the durable-index to all nodes. This is enough to ensure the data read once won’t ever get lost due to a failure, but this is still not enough to ensure monotonicity. If one server in the cluster of 3 is slow, it may not have received some updates and the new durable-index as 2 other faster nodes have successfully applied some more operations. This can lead to the slow node returning older data than the other two nodes may have already returned. Similar can happen when one replica is network partitioned. For a single client, this monotonicity problem can be solved with some sort of session token to make recency of data observed by the client, but there is no elegant solution for the cross-client case. 

The naive solution to the monotonicity problem, which Jesse describes in great detail in the presentation, expands the durable-index requirement to all nodes. In this case, a follower can only return a value when its state matches the durable state agreed upon by all nodes, so there can be no slow or partitioned replicas. For instance, we may have 3 nodes, all having a persisted history of k1, k2, k3. Since all nodes have the value k3, then it is clearly durable. A leader node can then send the durable-index=3 to signal the rest of the cluster that a 3rd item (i.e. k3) is durable. Let’s say one follower f1 has received this durable-index. We can allow this replica to return k3 locally. At the same time, other follower f2 may not have received the durable-index update just yet, so it is aware of k3 but does not think it is durable. The node f2 won’t return k2 or k3 and instead will forward the read to the leader, which will ensure that k3 is durable and no other operation is in progress before returning k3.

Requiring durability/replication to all nodes has been explored before. If I continue to draw parallels with Raft and take into account the flexible quorums result, then configuring Raft with a replication phase requiring a quorum of all nodes will achieve this durable-index of all nodes approach. Needless to say that Raft will provide linearizability and not just monotonic reads (as long as followers block in case of a dirty read until they receive updated commitIndex from the leader). There is an obvious problem with the quorum of all nodes crippling the availability when one node goes down, however, there are also examples of production systems that make this work to some degree.

Anyway, ORCA does not require durability on all nodes due to availability issues. Instead, it introduces the “active-set,” a set of nodes that must achieve durability for the durable-index to progress. Reads are possible against the replicas in the active-set, while reads against nodes outside of the active-set are prohibited (forwarded to leader?). The active-set must be at least the majority of nodes for fault tolerance reasons. The lease protects the active-set to allow failed or partitioned nodes to fall out of the set safely. This idea is largely described in Paxos Quorum Leases paper, which, again, provides linearizability. 


1) Linearizable Alternatives

Flexible Quorums in Write All Read One configuration. I already touched on this point in the summary, but I also raised it in our group’s discussion. One thing that is different between just running Raft with write quorum of all nodes and ORCA is that ORCA can do “fast writes” since it is not doing linearizability. These fast writes are just uncommitted writes when clients receive a write acknowledgment before an operation has been quorum-committed. Some systems, like MongoDB, also have them, and they seem to cause more confusion and problems than benefits. So, we do not think this is really a good feature to have in a system with stronger consistency guarantees. In fact, this consistency relaxation seems to be the only big difference between Flexible-Paxos (Raft) approach and the naive ZooKeeper-based ORCA. 

Paxos Quorum Leases. Paxos Quorum Leases (PQL) paper was brought up during the discussion too. At the high level, PQL is similar to lease-protected active-sets of ORCA. However, we were surprised to see a complete lack of comparison between ORCA’s active-set approach and PQL. 

Paxos Quorum Read. The problem of scaling read throughput in strongly consistent systems has been explored by me before. Paxos Quorum Read (PQR) approach allows client-driven reads from a quorum of followers. This is not as scalable as reading from just one node, but PQR avoids any leases and sticks to linearizability. 

2) Safety of durable-index

Lease safety. As with all systems that rely on leases, proper timeouts are essential for safety. Lots of care must be put into establishing proper timeouts. For example, it is essential to have partitioned nodes in active-set to timeout and “kill themselves” before the leader has a chance to timeout and pick a new active-set. If a leader creates a new active-set before the bad node in the old one dies, we may have a situation with two nodes returning different data. Timeouts also need properly working clocks. If one clock ticks too quickly or too slowly, this may also be a problem. 

Replication safety. An interesting point was brought up regarding the safety of durable-index. Since the durable-index update may get delivered faster to some nodes and slower (or be dropped altogether) to others, this creates a situation when a client can read from a more up-to-date node and receive some newer data, let’s say value k3. A client may then contact a different node that has not received the new durable-index and still thinks k2 is durable. This node cannot return k2as it would be a violation of monotonic reads! The node must wait for k3 to become durable, which in ORCA means that the read gets forwarded to the leader. So in other words, if a node knows of some non-durable update to an object, it is prettymuch blocked, and needs to forward to the leader to avoid safety problems. This leads to the next discussion point.

3) Performance cost. The safety of the protocol relies on fixing the durability while doing a read operation. So every time a read is done against the node that knows of some newer update that is not durable yet, the protocol forwards to the leader and potentially re-replicates. Even the forwarding part alone can be costly. It will cost not only extra latency but also the resources needed to support these multi-hop operations. 

ORCA is banking on the fact that we have a system with multiple keys/objects and that the access patterns are such that it is unlikely for us to both have an outstanding non-durable update and a read operation for the same key at the same time. This is generally an ok assumption, and many systems do that. The problem is that the opposite case is a complete disaster for the system. Having just a handful of “hot keys” will cause frequent forwards to the leader and all the extra work and extra latency. At the very best this will create bad latency, but because this creates a lot more work for the system, it actually lowers the maximum throughput achievable in the system. Can this translate to a metastable failure?

4) Practicality of cross-client monotonic reads. The final discussion topic I want to mention is the practicality of such cross-client monotonic read consistency. The paper goes into a few examples. However, monotonic reads are a far cry from more stringent consistency models. ORCA does not explicitly enforce read-your-write property, which, intuitively, is desirable in a system claiming strong consistency. For instance, Facebook goes out of its way to provide read-your-write consistency across its stack of weakly consistent systems to make development easier and prevent user confusion. 

As a last remark, I want to add that consistency properties with the paper’s approach will depend on the underlying leader-based protocol and implementation. As I mentioned a few times, Raft already has durable-index backed in, and solutions like PQL can provide something similar to the active-set described here for a completely linearizable system. However, the durable-index and active-set approach can be applied to weaker-consistency leader-based replication, and in such cases, this can improve consistency, hopefully at a low-enough cost. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed Storage

Short Summary

pando_delegatesYesterday we discussed Pando, a geo-replication system achieving near-optimal latency-cost tradeoff in storage systems. Pando uses large Flexible Paxos deployments and erasure coding to do its magic. Pando relies on having many storage sites to locate sites closer to users. It then uses Flexible Paxos to optimize read and write quorums to have the best latency for a particular workload. Finally, erasure reduces the cost associated with having many storage sites. There are a few caveats to the paper. First is the slowness of the (Flexible) Paxos protocol, as it requires two phases to complete an operation. Pando addresses this issue with a cool use of delegate nodes — one of the nodes from phase-1 quorums acts as a delegate on behalf of the client-proposer to collect phase-1 acks and start phase-2. This can save significant latency, getting the full two-rounds in geo deployments closer to the one round RTT of a regular proposer-driven Paxos. The second problem is the quorum overlaps when using erasure coding. If the data is split into parts, we need to ensure the k-nodes overlap between the read and write quorums, making erasure coding detrimental to achieving low latency. This k-nodes overlap is needed to restore the latest version of the data, as having fewer than k-nodes with the latest version is not enough to reconstruct the data. Pando, however, first tries to read from a smaller quorum that guarantees just one node overlap and hopes that there will be enough nodes that have received the update by the time a read is issued. If there are no k nodes to reconstruct the data, Pando continues waiting for more nodes to respond to get to that large quorum with k-nodes overlap. 


The presentation video:


1) ConfigManager. ConfigManager is a component of Pando that computes desirable quorums and delegates for each key and key’s access pattern. We felt like this ConfigManager is very important for the performance of Pando, and it was rather under-specified in the paper. For example, we were wondering if it is possible to make the ConfigManager dynamic and make it learn the workload paters and adjust as needed. 

2) Read consistency. Pando uses quorum reads while using Paxos for writes. Naively reading from a quorum is dangerous and can lead to linearizability problems, yet the reads are not discussed in detail in the paper. The paper briefly in one sentence mentions that to enable quorum reads a global acknowledgment is needed to make sure all nodes know a value has been committed. But Pando does not explain how such commitment works and does not study the performance consequences of this. For example, if a value has been accepted by all nodes, but the commit message is still in flight, does the client read the previous value, or does it retry? If the read quorum has some nodes that have received the commit messages, and some that do not, is it safe to read a new value, or a retry is needed? We do not think it is safe to read, and the paper seems to mention that in passing as well. There are quite a few more questions around the reads and their safety here, so we wished the paper addressed the reads in greater detail. 

3) Rare conflicts. One of the motifs in the paper is that conflicts are rare in scenarios Pando is designed for. This influences a lot of design, as most of the systems are written to work well in the conflict-free case, and somehow address conflict case. For example, when two clients try to write on the same key/document, that will cause a conflict, and Paxos is not very good at handling it due to the dueling-leader problem. Same for the read problem above in (2), as we feel like conflicts will cause the retries, significantly impacting the performance. What is interesting is that despite largely conflict-free evaluation, Pando did better than Fast Paxos in latency, a protocol designed for 1 RTT consensus in conflict-free cases. We feel like the reason is that Pando can get its writes close to 1 RTT with the clever delegates trick, while it can keep the read quorums small to optimize for latency, while Fast Paxos approach is stuck with larger quorums for reads. 

4) Quorum Size. We have spent quite a bit of time discussing quorum sizes. We were confused a bit about the quorum descriptions from the paper, but in the end, we were able to figure out our misunderstanding.

5) Comparison with Atlas. Pando extensively compares with EPaxos. Atlas is a newer, EPaxos-influenced protocol, so we were curious about how the two would compare. 

6) Scalability.  Reducing latency is very important for WAN setups, however, we felt that the paper is not addressing the throughput question. Having too many nodes may have an impact on scalability/throughput. Since Pando uses the clients as proposers, the grunt of dealing with too many storage replicas goes to the clients, and scalability should not be impacted that much. However, the delegates may become the bottlenecks for some workloads.

7) Delegates for throughput. Pando uses delegates to improve latency. We noticed that a somewhat similar idea is used in Compartmentalized Paxos and PigPaxos to improve the throughput instead.

Next Meeting

As always, next week we will continue looking at more great distributed systems papers. We are starting with the OSDI 2020 bunch of papers, and in our next meeting, we will discuss “Fault-tolerant and transactional stateful serverless workflows.” Please join our slack to participate in the discussions and Zoom meetings.

Reading Group. Autoscaling Tiered Cloud Storage in Anna.

This week we looked at “Autoscaling Tiered Cloud Storage in Anna.” This is the second Anna paper. The first one introduces Anna Key-Value store, and the second paper talks about various “cloud-native” improvements. The presentation by Michael Whittaker is available here:

Short Summary

The Anna Architecture

Anna is an eventual-consistent key-value data store, where each value is a lattice. Anna can tweak some consistency guarantees by changing the type of lattice stored in the value. For example, it can be a Last Write Wins (LWW) system that merges operations based on the definition of the last writer (i.e. a timestamp), or it can use some CRDT lattice to preserve all updates/operations. The weak consistency of Anna means that it is relatively easy to add and remove nodes to scale the systems. The Autoscaling part takes advantage of that and dynamically adjusts the number of replicas responsible for each key to match the workload. Moreover, Autoscaled Anna has multiple usage tiers for keys based on the usage frequency. The paper discusses “hot” and “cold” keys, but more tiers are possible. Such tiering allows using heterogeneous hardware to better balance the cost and performance of Anna, as “cold” keys may run on cheaper hardware backed by SSDs, while “hot” keys may be store in RAM on more expensive VMs. Anna has rules for data transitioning between tiers and for controlling the degree of replication. For example, there is a minimum replication factor for durability, and there are usage thresholds for promoting and demoting keys to/from hot/cold tiers.


We have spent quite some time talking about this paper, as there is a lot of information to digest. Below are a few of the bigger discussion points

1) Apache Cassandra comparison. This came up a few times as lacking, as the system is somewhat similar to Cassandra, at least when you consider the LWW lattice. There are many differences too, especially in the context of autoscaling. Cassandra uses a statically configured replication factor and lacks any autoscaling. Cassandra’s DHT model is pretty rigid and resists scaling – adding nodes require considerable ring rebalancing. Also, only one node can be added/removed from the ring at once to avoid membership and ring partitioning issues. Anna has no such problem and can adjust quicker and more dynamic. Additionally, for all the fairness, the original Anna paper compares against Cassandra.

2) Breaking Anna. Part of the discussion was around breaking the autoscaling and putting Anna in a bad/less-performant state. One example would be using a workload that changes at about the same rate as Anna can adjust. For example, if we hit some keys above the threshold to promote from the “cold” tier to “hot”, we can cause a lot of data movement as these keys potentially migrate or added to the memory-tier hardware. At the same time, we can reduce the usage, causing the systems to demote these keys, and experience associated overheads. This is like putting a workload in resonance with Anna’s elastic abilities, and it reminds me of a famous Tacoma Narrows bridge collapse

3) Possible consistency guarantees. One of the benefits of Anna is that some guarantees can be tweaked based on the lattice used for the value. Part of the discussion focused on how much consistency we can get purely out of these lattices, and how much may necessitate more drastic architectural changes. It is pretty straightforward to understand that Anna won’t ever provide strong consistency, but what about some other guarantees? For example, the original Anna paper claims the ability to have causal consistency, although it uses more complicated lattices, and keeps vector clocks for maintaining causality, not only making the system more complicated but also potentially harming the performance (something noted by the authors themselves).

4) Comparison fairness. Some points were mentioned about the evaluation fairness. It appears that some systems compared against are significantly more capable and/or may provide better guarantees(DynamoDB, Redis). It is worth noting, however, that there are not that many, if any, direct competitors, so these comparisons are still very valuable.

5) Cloud-Native. This appears as one the “most” cloud-native database papers that talks about some practical aspects of designing and running a  cloud-native database. One aspect of this “cloud-nativeness” is the focus on elasticity and scalability happening automatically. Another one is the cost-performance consideration in how the system scales with regard to workload changes. We noted some other much older papers that fit our definition of “cloud-native” papers. One example was Yahoo’s PNUTS paper that describes a global system with some capability to adjust to workloads and selective replication.

6) Scheduling/resource allocation. One of the challenging aspects of  Anna with autoscaling is figuring out the best key “packing” or binning” given multiple tiers and pricing constraints. This is similar to a Knapsack problem and bin packing problem, and both are NP-complete. So this resource management/scheduling will become a big problem at the production scale when we may have thousands of machines. At the same time, task packing problems at the datacenter level are well studied, for example in Google’s Borg cluster management system or its derivative Kubernetes


Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

A Few Words about Inconsistent Replication (IR)

Recently I was reading the “Building Consistent Transaction with Inconsistent Replication” paper. In this paper authors use inconsistently replicated state machine, but yet they are capable of creating a consistent transaction system. So what is Inconsistent Replication (IR)?

In the previous posts I summarized Raft and EPaxos. These two algorithms are used to achieve consensus in the distributed system, so for example when we deal with replicated state machines, these algorithms allow each replica to be an exact, consistent copy of each other. So, it is logical to assume that Inconsistent Replication will not produce the same replicas all the time, so our state machines can end up in different states. Why would we want to have a replicated state machine with various copies potentially being in different states? According to the authors of the paper it is faster than consistent replication, yet can still be used in some applications, such as transaction commit. I think the usage of IR will not be as straightforward as using consistent replications, since users of IR must also design their applications in such a way that tolerates the inconsistent state of the nodes.

IR does not guarantee the order in which each command is executed by replicas, thus replicas can reach different states unless the operations are independent of each other.  The figure below illustrates hot this can happen.


Figure 1. Replicas in inconsistent state. (a) Two requests C1 and C2 are being sent to replicas at the same, but the requests reach replicas in different orders. (b) Requests are logged and executed in the order they have been received leading to an inconsistent state.

Since replicas can be in different states, IR cannot guarantee that recovering from failures can preserve the value or the states of each recovered operation. Because of this, IR has two types of guarantees it can provide upon recovering from a failure. If a client does not need to have an ability to recover the value of each operation, it can use IR in an inconsistent operation mode that only guarantees the recovery of only the fact that an operation occurred for up to f failures in a system of 2f+1 replicas. Inconsistent operation mode does not allow to recover the value of the operation. In consensus operation mode, IR can preserve both the command and a result of such command for up to f failures. In this mode, the result of an operation is the result a majority of replicas report for such operation, if such majority exist. Consensus operation may fail if not enough replicas report the same result for the operation, in which case IR protocol retries. Consensus operations also need to have floor(3/2f)+1 replicas agree to be finalized. IR protocol will also retry the operation until it finalizes. It seems like only finalized consensus operation can be recovered with the results of the operation, and not all consensus operation can be finalized since they can timeout after getting the majority but before reaching the consensus on floor(3/2f)+1 replicas.

This notion of retrying the operations until they eventually succeed makes me question whether it is a good solution especially under various kinds of loads.  If the system deals with mostly independent requests it may not be difficult to reach consensus in consensus operation mode, but if there are lot of contention between requests, the system may just be stuck retrying all the operations without doing much of a useful work.

I am not going mention IR recovery mechanism described in the paper, but it is worth noting that during the recovery the entire system blocks and stops responding to new requests. The process is initialized by a recovering replica communicating to all other nodes, and once each node learns that some other replica is recovering from a failure it stops processing the requests until the protocol is finished and normal operation can resume.