Reading Group. Toward a Generic Fault Tolerance Technique for Partial Network Partitioning

Short Summary

We have resumed the distributed systems reading group after a short holiday break. Yesterday we discussed the “Toward a Generic Fault Tolerance Technique for Partial Network Partitioning” paper from OSDI 2020. The paper studies a particular type of network partitioning – partial network partitioning. Normally, we expect that every node can reach every other node; under the network partitions, we often assume that a node or a group of nodes are “disconnected” from the rest of the cluster and cannot be reached. Partial partitioning assumes that some communication remains, and some nodes outside the partition can still communicate with the partitioned servers. The paper studies several public bug reports from popular systems and finds quite a few issues attributed to the partial partitioning problem. A simple example of the issue is a system having one leader and multiple follower nodes. If the leader is partitioned away from the followers, the progress may stop. At the same time, if the leader can still communicate with a configuration manager (ZooKeeper) responsible for picking the next leader, that communication manager won’t see a problem and not elect a new leader, causing the entire system to stall. The paper has many insights about the partial partition failures, and it is worth taking a look at the paper itself for these details.  Aside from extensive exploration of the problem, the paper proposes a simple solution to mask such partial partitions, called Nifty. Nifty construct a connectivity graph and uses it to relay/reroute messages between nodes when the direct communication link between nodes fails.

Video presentation

Discussion

We had a pretty lively discussion of this paper.

1) Monitoring. One of the issues identified in the paper is that such failures caused by partial partitions often go silent. However, the proposed solution does not seem to include the monitoring/notification service itself. We speculate that it should be trivial to add it to Nifty.

2) System Design. Another important find from the paper is that many of the bugs were due to design problems. We spent considerable time discussing how better design with model checking can help avoid such issues. One of the issues, of course, is that model checking is only as good as the assumptions you put in. So if the design was checked, but the designer did not take into account, for example, message loss, it may not have caught the problem. The paper raises awareness for the problem, so hopefully, the next generation of systems can be model-checked with partial network partitions in mind.

3) Nifty performance. We had two major questions with regards to performance. The first one is whether Nifty can scale to large clusters, given its all-to-all communication requirement for constructing the connectivity graph. This all-to-all communication gives a quadratic complexity, but we also think that collecting this connectivity information can be done relatively infrequently, given that the application can tolerate some message delay/loss. The second question is whether Nifty can handle large messages. Data-driven systems, like databases, may transmit a lot of data between nodes, and having all these data flow through some “relay” nodes may put more stress on such. Also, both the topology of connected nodes and the state of the application matter.  If two partial partitions are connected by just one server, it will put more stress on that one machine. Similarly, if the system is in a state that requires more data transfer, such as building a new replica, it may cause a lot of traffic/data to flow through Nifty, putting more stress on the “relay” nodes.

4) Other types of network partitions. The paper talks about (mainly) partial symmetrical partitions. Another interesting type of network partition is an asymmetrical or one-way partition, when messages flow in one direction but not the other. However, one can argue that this is still a partial partition, and Nifty can handle such a case with its connectivity graph approach, assuming the graph is directed.

5) Failure Masking. Nifty masks failures, which is great when we need the applications to continue working. However, what happens when/if Nifty fails? Will it create an even bigger/disastrous failure? Another question is whether the partial partition may exist as a transient state towards a full network partition. In such a case, Nifty only gives a little bit of extra time to react to the issue as it emerges and cannot be the only solution to dealing with the problem (hence discussion point #1 and the need for monitoring).

6) We also noted some similarities with the recent Cloudflare issue.

Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Near-Optimal Latency Versus Cost Tradeoffs in Geo-Distributed Storage

Short Summary

pando_delegatesYesterday we discussed Pando, a geo-replication system achieving near-optimal latency-cost tradeoff in storage systems. Pando uses large Flexible Paxos deployments and erasure coding to do its magic. Pando relies on having many storage sites to locate sites closer to users. It then uses Flexible Paxos to optimize read and write quorums to have the best latency for a particular workload. Finally, erasure reduces the cost associated with having many storage sites. There are a few caveats to the paper. First is the slowness of the (Flexible) Paxos protocol, as it requires two phases to complete an operation. Pando addresses this issue with a cool use of delegate nodes — one of the nodes from phase-1 quorums acts as a delegate on behalf of the client-proposer to collect phase-1 acks and start phase-2. This can save significant latency, getting the full two-rounds in geo deployments closer to the one round RTT of a regular proposer-driven Paxos. The second problem is the quorum overlaps when using erasure coding. If the data is split into parts, we need to ensure the k-nodes overlap between the read and write quorums, making erasure coding detrimental to achieving low latency. This k-nodes overlap is needed to restore the latest version of the data, as having fewer than k-nodes with the latest version is not enough to reconstruct the data. Pando, however, first tries to read from a smaller quorum that guarantees just one node overlap and hopes that there will be enough nodes that have received the update by the time a read is issued. If there are no k nodes to reconstruct the data, Pando continues waiting for more nodes to respond to get to that large quorum with k-nodes overlap. 

pando-perf

The presentation video:

Discussion

1) ConfigManager. ConfigManager is a component of Pando that computes desirable quorums and delegates for each key and key’s access pattern. We felt like this ConfigManager is very important for the performance of Pando, and it was rather under-specified in the paper. For example, we were wondering if it is possible to make the ConfigManager dynamic and make it learn the workload paters and adjust as needed. 

2) Read consistency. Pando uses quorum reads while using Paxos for writes. Naively reading from a quorum is dangerous and can lead to linearizability problems, yet the reads are not discussed in detail in the paper. The paper briefly in one sentence mentions that to enable quorum reads a global acknowledgment is needed to make sure all nodes know a value has been committed. But Pando does not explain how such commitment works and does not study the performance consequences of this. For example, if a value has been accepted by all nodes, but the commit message is still in flight, does the client read the previous value, or does it retry? If the read quorum has some nodes that have received the commit messages, and some that do not, is it safe to read a new value, or a retry is needed? We do not think it is safe to read, and the paper seems to mention that in passing as well. There are quite a few more questions around the reads and their safety here, so we wished the paper addressed the reads in greater detail. 

3) Rare conflicts. One of the motifs in the paper is that conflicts are rare in scenarios Pando is designed for. This influences a lot of design, as most of the systems are written to work well in the conflict-free case, and somehow address conflict case. For example, when two clients try to write on the same key/document, that will cause a conflict, and Paxos is not very good at handling it due to the dueling-leader problem. Same for the read problem above in (2), as we feel like conflicts will cause the retries, significantly impacting the performance. What is interesting is that despite largely conflict-free evaluation, Pando did better than Fast Paxos in latency, a protocol designed for 1 RTT consensus in conflict-free cases. We feel like the reason is that Pando can get its writes close to 1 RTT with the clever delegates trick, while it can keep the read quorums small to optimize for latency, while Fast Paxos approach is stuck with larger quorums for reads. 

4) Quorum Size. We have spent quite a bit of time discussing quorum sizes. We were confused a bit about the quorum descriptions from the paper, but in the end, we were able to figure out our misunderstanding.

5) Comparison with Atlas. Pando extensively compares with EPaxos. Atlas is a newer, EPaxos-influenced protocol, so we were curious about how the two would compare. 

6) Scalability.  Reducing latency is very important for WAN setups, however, we felt that the paper is not addressing the throughput question. Having too many nodes may have an impact on scalability/throughput. Since Pando uses the clients as proposers, the grunt of dealing with too many storage replicas goes to the clients, and scalability should not be impacted that much. However, the delegates may become the bottlenecks for some workloads.

7) Delegates for throughput. Pando uses delegates to improve latency. We noticed that a somewhat similar idea is used in Compartmentalized Paxos and PigPaxos to improve the throughput instead.

Next Meeting

As always, next week we will continue looking at more great distributed systems papers. We are starting with the OSDI 2020 bunch of papers, and in our next meeting, we will discuss “Fault-tolerant and transactional stateful serverless workflows.” Please join our slack to participate in the discussions and Zoom meetings.

Reading Group. Autoscaling Tiered Cloud Storage in Anna.

This week we looked at “Autoscaling Tiered Cloud Storage in Anna.” This is the second Anna paper. The first one introduces Anna Key-Value store, and the second paper talks about various “cloud-native” improvements. The presentation by Michael Whittaker is available here:

Short Summary

The Anna Architecture

Anna is an eventual-consistent key-value data store, where each value is a lattice. Anna can tweak some consistency guarantees by changing the type of lattice stored in the value. For example, it can be a Last Write Wins (LWW) system that merges operations based on the definition of the last writer (i.e. a timestamp), or it can use some CRDT lattice to preserve all updates/operations. The weak consistency of Anna means that it is relatively easy to add and remove nodes to scale the systems. The Autoscaling part takes advantage of that and dynamically adjusts the number of replicas responsible for each key to match the workload. Moreover, Autoscaled Anna has multiple usage tiers for keys based on the usage frequency. The paper discusses “hot” and “cold” keys, but more tiers are possible. Such tiering allows using heterogeneous hardware to better balance the cost and performance of Anna, as “cold” keys may run on cheaper hardware backed by SSDs, while “hot” keys may be store in RAM on more expensive VMs. Anna has rules for data transitioning between tiers and for controlling the degree of replication. For example, there is a minimum replication factor for durability, and there are usage thresholds for promoting and demoting keys to/from hot/cold tiers.

Discussion

We have spent quite some time talking about this paper, as there is a lot of information to digest. Below are a few of the bigger discussion points

1) Apache Cassandra comparison. This came up a few times as lacking, as the system is somewhat similar to Cassandra, at least when you consider the LWW lattice. There are many differences too, especially in the context of autoscaling. Cassandra uses a statically configured replication factor and lacks any autoscaling. Cassandra’s DHT model is pretty rigid and resists scaling – adding nodes require considerable ring rebalancing. Also, only one node can be added/removed from the ring at once to avoid membership and ring partitioning issues. Anna has no such problem and can adjust quicker and more dynamic. Additionally, for all the fairness, the original Anna paper compares against Cassandra.

2) Breaking Anna. Part of the discussion was around breaking the autoscaling and putting Anna in a bad/less-performant state. One example would be using a workload that changes at about the same rate as Anna can adjust. For example, if we hit some keys above the threshold to promote from the “cold” tier to “hot”, we can cause a lot of data movement as these keys potentially migrate or added to the memory-tier hardware. At the same time, we can reduce the usage, causing the systems to demote these keys, and experience associated overheads. This is like putting a workload in resonance with Anna’s elastic abilities, and it reminds me of a famous Tacoma Narrows bridge collapse

3) Possible consistency guarantees. One of the benefits of Anna is that some guarantees can be tweaked based on the lattice used for the value. Part of the discussion focused on how much consistency we can get purely out of these lattices, and how much may necessitate more drastic architectural changes. It is pretty straightforward to understand that Anna won’t ever provide strong consistency, but what about some other guarantees? For example, the original Anna paper claims the ability to have causal consistency, although it uses more complicated lattices, and keeps vector clocks for maintaining causality, not only making the system more complicated but also potentially harming the performance (something noted by the authors themselves).

4) Comparison fairness. Some points were mentioned about the evaluation fairness. It appears that some systems compared against are significantly more capable and/or may provide better guarantees(DynamoDB, Redis). It is worth noting, however, that there are not that many, if any, direct competitors, so these comparisons are still very valuable.

5) Cloud-Native. This appears as one the “most” cloud-native database papers that talks about some practical aspects of designing and running a  cloud-native database. One aspect of this “cloud-nativeness” is the focus on elasticity and scalability happening automatically. Another one is the cost-performance consideration in how the system scales with regard to workload changes. We noted some other much older papers that fit our definition of “cloud-native” papers. One example was Yahoo’s PNUTS paper that describes a global system with some capability to adjust to workloads and selective replication.

6) Scheduling/resource allocation. One of the challenging aspects of  Anna with autoscaling is figuring out the best key “packing” or binning” given multiple tiers and pricing constraints. This is similar to a Knapsack problem and bin packing problem, and both are NP-complete. So this resource management/scheduling will become a big problem at the production scale when we may have thousands of machines. At the same time, task packing problems at the datacenter level are well studied, for example in Google’s Borg cluster management system or its derivative Kubernetes

 

Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group Paper List. Papers ##37-50

This week we are on the 35th paper in our reading group. We will be discussing “Autoscaling Tiered Cloud Storage in Anna” from VLDB 2019. It is exciting that the reading group has managed to go through this many papers since we have started in April! 

Today I and Murat have sat down and picked the next few papers from the OSDI 2020 to discuss. Originally we planned to select just 10 OSDI papers for the next batch, but all the papers are so interesting, and it was so hard to narrow the list down. After some debating, we settled on 14 papers instead. Below is the list in the order the papers will appear in the reading group:

  1. Fault-tolerant and transactional stateful serverless workflows – December 23rd
  2. Aragog: Scalable Runtime Verification of Shardable Networked Systems – December 30th
  3. Toward a Generic Fault Tolerance Technique for Partial Network Partitioning – January 6th
  4. Microsecond Consensus for Microsecond Applications – January 13th
  5. Virtual Consensus in Delos – January 20th
  6. A large scale analysis of hundreds of in-memory cache clusters at Twitter – January 27th
  7. Cobra: Making Transactional Key-Value Stores Verifiably Serializable – February 3rd
  8. hXDP: Efficient Software Packet Processing on FPGA NICs – February 10th
  9. Performance-Optimal Read-Only Transactions – February 17th
  10. Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories – February 24th
  11. FlightTracker: Consistency across Read-Optimized Online Stores at Facebook – March 3rd
  12. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads – March 10th
  13. Sundial: Fault-tolerant Clock Synchronization for Datacenters – March 17th
  14. Protean: VM Allocation Service at Scale – March 24th

This list will start on December 23rd, and will run us all the way to paper #50! Please join the reading group on slack for schedule, discussion, and Zoom links. The reading group Zoom is every Wednesday at 3:30 pm EST.

Reading Group. Compositional Programming and Testing of Dynamic Distributed Systems

We have resumed the reading group after one week of Thanksgiving break. On Wednesday, we have discussed “Compositional Programming and Testing of DynamicDistributed Systems.” This paper is on the edge between programming languages, distributed systems, and some formal methods/verification. The premise of the paper is to decompose large monolithic distributed programs into smaller pieces and test each piece separately using the abstracts of all other components. By using the abstract components for testing, we can reduce the number of traces (i.e. state-tree size) and make the testing feasible. The paper takes advantage and adopts the assume-guarantee (AG) theory to make such piece-wise testing/verification possible and correct.

Discussion Points

We are not formal methods and/or programming languages experts, so our discussion may have been a bit shallow. However, we tried to approach it from multiple directions.
1) TLA+ allows writing specs through refinement. Can something like this be adopted or used in TLA for checking large specs? Interestingly enough, despite having quite a few users of TLA, none of us have tried the refinement approach seriously. Our consensus in this was that this sounds interesting and plausible.

2) The paper relied on the P language, which is an event-driven language. Part of the discussion was on the pros/cons of event-driven programming vs using threads. We eventually drifted off to talk about green-threads/goroutines of Go, and then to coroutines. All-in-all, there are many ways to approach concurrent programming, and, as evidenced by different opinions in the discussion, it appears that we (as a computing community) are still far from agreeing/coming up with the best or most suited way of dealing with concurrency.

3) We also talked about other languages geared towards distributed systems. We have mentioned DistAlgo as one. It is an educational language/compiler that produces runnable python code. DistAlgo, however, has a different purpose, as its aim is the ease of use and clarity of expressing distributed programs, and not testing/verification.

4) We spent quite a bit of time discussing the assume-guarantee (AG) theory and the need to do “cross-testing” of components where every specific implementation must be tested with abstracts of other components. The original discussion question was why it is not enough to simply show the refinement (i.e. specific refines the abstract and satisfies all the properties). In short, it is all about safety and making sure the specifics not only refine their abstracts but do not introduce other unwanted behaviors when interacting with other components.

Reading Group. Aria: A Fast and Practical Deterministic OLTP Database.

In our 33rd reading group meeting, we discussed “Aria: A Fast and Practical Deterministic OLTP Database.” by Yi Lu, Xiangyao Yu, Lei Cao, Samuel Madden. We had a very nice presentation by Alex Miller:

Quick Summary

Aria is a transaction protocol, heavily influenced by Calvin, and it largely adopts Calvin’s transaction model, with one big difference. In Calvin, read and write sets of a transaction must be known beforehand, but Aria has no such strict requirement, allowing for generally more flexible transactions. Aria’s main goal is parallelizing transactions as much as possible to maximize the throughput. To that extent, Aria adopts batching and processes multiple transactions in each batch concurrently.

ariaEach batch operates in two phases: execution and commit. All transactions in a batch start the execution phase from the same snapshot (the result of committing and applying the previous batch). In the execution phase, each transaction concurrently executes and produces the new state, which is stored temporarily. Once all transactions in a batch have finished executing, the protocol moves into the commit phase, where Aria aborts any transaction that has a Write-after-Write(WAW) conflict or Read-after-Write(RAW) conflict. Aria uses unique and sortable transaction IDs to determine the order of transactions within the batch to find these conflicts. Any aborted transaction goes to the future batch, while the remaining successful transactions commit and apply their temporary execution results. Once the commit has finished, a new batch can start, and the protocol keeps moving in these lock-steps, synchronizing after each step. Every replica can run the protocol without synchronizing/communicating with other replicas unless the system is sharded/partitioned and one replica need to read data from another shard. An important optimization of Aria is deterministic reordering, where transactions within a batch can be reordered (i.e. not follow the order of their TIDs) to reduce the number of aborts, and consequently, reduce the amount of wasted work in the execution phase.

Discussion Points

We had good participation and touched on quite a few things in the discussion, but to the credit of the paper, we were able to find lots of answers there just by reading more carefully.

1) Transaction Serializability. Paper claims serializability, but it applies a bunch of transactions to the same snapshot. How does it compare to snapshot isolation? and what is the difference that allows serializability?

We think the batch processing and addition of RAW conflict within the batch makes a difference. Snapshot Isolation checks for WAW conflicts only and allows some artifacts. By also disallowing RAW conflicts, we can eliminate the problem. However, there are nuances with how a read set can be defined for conflict resolution. For example, What is a read set in a transaction running UPDATE items SET x=0 WHERE x=1? It can be either all items, or only items with x=1 if some indexing is used, and this difference may result in serializability issues. For the paper’s defense, it does not explicitly consider a SQL model, and also mentions that things like the above are up to the users to decide, and a fall-back to Calvin-like approach.

2) Batch size. This is a batched protocol, so the batch size may play a role in the performance. Batches that are too small will have more frequent barriers between batches and phases. Batches that are too large have a higher chance of having a super long transaction that stalls the entire batch, impacting the performance. The paper mentions that the transactions should take about the same time to execute for best performance to avoid the case when one slow transaction stalls the entire batch.

3) Paper readability. The overall consensus in the group was that this was one of the easier transaction appears to read.

4) Comparison with other transaction systems. SLOG paper we discussed a few months ago also uses determinism on each node to run transactions. Unlike Aria though, SLOG uses determinism for execution order, while Aria executes unordered (from the same state) and deterministically aborts the conflicting transactions. Overall it seems that Aria’s use of determinism is more extensive – the same snapshot, transactions are deterministic themselves, deterministic conflict search and abort, etc.

There was also a mention of CockroachDB’s transactions, since they also use snapshot, and produce temporary results, but Cockroach is more optimistic than deterministic. Also, Cockroach is more interested in low-latency transactions, while Aria is all about high throughput even at the expense of latency.

5) Deterministic reordering. The paper mentions that the reordering is a best-effort algorithm and not the most optimal one. It also seems that reordering plays a big role in having high throughput, so can we squeeze more performance with a better reordering algorithm? Obviously, it is not efficient to brute-force all possible permutations, but maybe some better heuristic approach?

aria-reorder

6) Performance variance under conflict. Aria works best for workloads with not a lot of conflict between transactions. Reordering helps, but not in all cases, as evidenced in the evaluation.

 

Join our reading group on slack for more discussions, paper schedule, and zoom participation.

Reading Group. High availability in cheap distributed key value storage

Our recent paper was “High availability in cheap distributed key value storage”. And what a paper that was! It was definitely a mind-tingling read the lead to a very interesting and long discussion session with the group.

Short Summary

The paper addresses the problem of fast recovery from the leader (primary) crashes in key-value stores backed by the non-volatile main memory (NVMM). NVMMs provide good latency but have much lower throughput than the DRAM. They are also a lot cheaper than DRAM for the same capacity, making NVMMs an interesting low-latency compromise between in-memory datastores and SSD-backed stores. NVMMs are, however, much more expensive than SSDs. In CANDStore, a primary replica is supported by the NVMM for fast latency, but the follower, or back-up, is backed by an SSD to save on the hardware. An additional “witness” replica exists. Witness, like the primary, is supported by NVMM, but it does not store actual data, and only has the operation log with placeholders for data. This is done, again, for cost reasons. The replication between a primary and the backup is done with a modified Raft algorithm. In a happy case, witness replica is a non-voting member of the cluster, much like in Cheap Paxos. As such, CANDStore is supported by heterogeneous hardware, with specific roles attached to specific hardware.

Upon the failure of a primary node, the witness is promoted to the primary role. Since the witness has no data, it has to learn it from the backup, but it does so smartly: first, learn the log with placeholders, once it knows all the placeholder, it can start serving new write requests that overwrite the key’s state. Knowing the position of the operation in the log is not enough to serve reads and a new primary needs to copy the data from the backup. It does so smartly and copies the “hot” keys first, followed by less frequently used keys. This allows primary to start serving hot reads very soon after gaining the ability to serve writes, and before the full copy of the data has finished. The paper claims this approach yields 4.5-10.5 times recovery speedup compared to the offline rebuild of a replica from an SSD. Backup failure is a less-discussed issue in the paper, but here the authors take the Cheap Paxos approach and make the witness vote on the quorum to select a new configuration.

Video Presentation

Discussion

The discussion was rather heated, as the paper has many points to digest.
1) Homogeneous vs. heterogeneous deployment. The biggest question is why do we need to restore a primary from a backup by doing a full copy? The backup node in a 3-replica configuration studied in the paper(1 primary, 1 backup, 1 witness) already has the latest state, so it makes sense to promote it to the leader/primary node instead of doing a full replica “build” from the witness state. This would undoubtedly result in a much faster primary recovery (while “build-out” of a new backup can happen in the background). Of course, this works only if the backup and primary are homogeneous, which is not the case here. So, this raises the question of why having such a heterogeneous system in the first place? Our best guess was the cost. Having an SSD-backed replica is cheaper than NVMM one.

2) Witness log. In the paper, a witness maintains a log, keeping track of all the keys and essentially having an order of updates. This is a placeholder log that has no values. In the process of recovery, the witness learns of the missing log entries, i.e. tail of the log that a primary and backup may have, but has not reached the witness. So if this learning phase exists, why having a log at all? just make a witness learn the log (or relevant suffix of the log) when recovery is needed. In the discussion, we think the witness has a log to speed-up recovery when it is needed.

3) Performance comparison of DRAM/NVMM/SSD. The paper’s heterogeneous setup leverage the fact that these types of storage have different performance and price. DRAM is the fastest but also the most expensive, NVMM presents a latency compromise, while SSD is the price compromise. However, if the performance of NVMM is close to that of SSDs, there may be fewer incentives for such design. We looked at a few sources. And it appears that NVMM has good latency, but its throughput is somewhat on par with SSDs

4) The performance of the primary. Another question we had is why NVMM is fast enough for primary, but not for backup? This loops back to (3), and it seems like NVMM would do ok for a backup, given its similar throughput to an SSD. However, we speculate that (a) the NVMM performance may change less favorably depending on the write block size, and (b) again the cost and not the performance is the reason for having an SSD backup.

5) One witness for multiple partitions. The paper evaluates a single-partition 3 node system. However, in such a setup, we have a witness that must run on hardware capable of supporting a primary. This does not lead to cost-savings, as primary hardware is more expensive. We speculate that in a sharded system the cost of running a witness can be amortized by having single witness hardware shared across many partitions. When one partition needs a witness promoted, it takes over the resources, and forces other partition to get new witness nodes. This of course has several complications. Firstly, a correlated failure between partitions may not be tolerated well, when two partitions need a new primary, but they share a witness. Secondly, managing such a system become more challenging, since a witness promotion leads to the cleanup of other partition-witness data, finding new witness nodes, etc

6) Hot/cold benefits. Does hot/cold key separation benefit anything except recovery? The backup cannot benefit from this, it (normally?) serves no user work, so there is not much difference. It maintains hot/cold keys to make a copy upon the recovery prioritized to reduce performance degradation. Primary nodes, on the other hand, may benefit. Keep the “hot” keys in DRAM, and “cold” ones in the NVMM storage.

7) Backup failures. This aspect is not discussed in much details in the paper, except mentioning that the Cheap Paxos approach is used to reconfigure and bring up a new backup. One thing that caught the group’s attention is that a backup recovery puts a lot more work on the primary – it has to copy data to the backup, while still serving the operations (given that the backup recovery is done on-line).

8) More backup nodes. So, in relation to (7), what if we have more backup nodes, lets say 2 nodes instead of 1? This means that one backup should be available to restore the failed one. But what does this mean for write quorums? Do we still need all backup in the quorum? or just one out of 2 backups is enough? If we require all backups for the quorum, we also need two witnesses to tolerate two failures and be able to reconfigure with Cheap Paxos. If we require just one backup out of two, do we need a witness (aside from promoting it to the primary in case of a primary crash). Also, how does such two backups setup compare in terms of cost saving?

9 and beyond) We had way more discussion points as well. Can this heterogeneous setup benefit other copy-have workloads for databases, like elasticity tasks? Can heterogeneous deployments be used with regular Multi-Paxos/Raft? Why using modified Raft if most of the things that differ Raft from Paxos were undone? You can read more about these in our slack discussion group.

Reading Group. RMWPaxos: Fault-Tolerant In-Place Consensus Sequences

Quick Summary

In the last reading group discussion, we talked about RMWPaxos. The paper argues that under some circumstances, log-based replication schemes and replicated state machines (RSMs), like Multi-Paxos, are a waste of resources. For example, when the state is small, it may be more efficient to just manage the state directly instead of managing a log of state-mutating operations. To that order, RMWPaxos foregoes the log and instead implements a state machine as a single atomic register. The basic protocol implements a write-once distributed register and resembles Paxos – it has two phases to elect a leader and prepare the value. There are some differences too. For example, the leader’s ballot is given by acceptors in phase-1, essentially meaning that a proposer always gets the majority of replies, as long as the majority of acceptors are up. The protocol then makes use of consistent quorums (see image below), and if the entire majority quorum has the same round number, the leader can proceed to phase-2 and play out regular Paxos (recover value if needed, or write a new one if not). However, if the phase-1 quorum is inconsistent, meaning that some acceptors have returned different round numbers, the protocol default back to running a regular Paxos Phase-1 by picking a round number higher than any seen in the quorum. The basic protocol is augmented to allow a register to be rewritable (after all, if all we care about is a write-once register, we can simply use basic synod Paxos). Anther augmentation deals with preventing the same operation to be double-written in the dueling leader scenarios. Here authors make some strong assumptions about the network and expect ordered point-to-point message delivery.

rmwq

 

Discussion

(1) We have spent quite some time discussing the basic write-once protocol, as it seems more complicated than Paxos (and default to Paxos under inconsistent phase-1 quorum). It appears that when there is a need to recover some partly written value, a new proposer does so in 3 phases instead of two: phase-1, phase-1 classical Paxos, and phase-2

(2) On the usefulness of such protocol. The authors claim it is a big overhead to manage the log, but in our discussion, we did not fully buy that. The compelling point is that when the size of an operation is large, it becomes cheaper to manage the register than a log since the log will occupy a lot more space compared to the state machine. In this argument, an entire state machine fits in a register. The authors claim that there is no need to maintain separate learner nodes as each acceptor always has a copy of a register and that the entire RSM can be “learned” with just one read. The paper gives an example of a KV-store that can use different registers for different keys. One issue we discussed was that the keys are managed by totally separate state machines, meaning that having complex operations that touch multiple keys require a distributed transaction protocol.

(3) With regard to related works, our consensus was that there are some similar protocols proposed earlier that are not discussed or compared against. One example is CASPaxos/gryadka. Another one is CPaxos(Bolt-On Global Consistency for the Cloud), and both of these have earlier precursors, such as Disk Paxos and Active Disk Paxos.

(4) More about assumptions in the basic protocol. We discussed the possibility of duplicate messages causing the repeat of phase-2 in the round since it is possible for a proposer to go into phase-2 with an acceptor-assigned round/ballot. In that case, a bizarre duplication of phase-1 acks, may cause an acceptor to also run phase-2 twice. Which is ok if the value at the proposer does not change. This scenario is a bit of a stretch, but it was fun looking for ways to possibly defeat the protocol.

(5) Evaluation shows good results, with RMWPaxos outperforming Multi-Paxos two-fold in some cases. This is impressive, but not unexpected, given that RMWPaxos used multiple instances for different keys vs one instance of Multi-Paxos for all keys. In the discussion, we believe that this is a better illustration of why sharding/partitioning/multi-leader approach helps with performance and not so much the pure benefit of having a register. The being said, RMWPaxos should be very good for such fine-parallelism in tasks that allow it.

rmwperf

Please join the reading group’s slack channel for more discussion, paper information, and presentation times.

Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions

Ocean Vista

On Wednesday we had a presentation and discussion of the Ocean Vista (OV) replication and distributed transaction protocol. OV works in the WANs, where each region has all data-partitions, and transactions can originate in any region. OV separates replication from transaction execution, by making replication conflict-free with a FastPaxos-inspired protocol. For the transaction execution, OV maintains the visibility watermarks, such that any transaction ongoing the replication is not yet visible, and all transactions below the watermark are visible. The protocol computes the watermark via a hierarchical gossiping protocol by taking the minimum watermark from each region. The regional watermark, in its turn, is a minimum of the server watermarks in the region, and a server watermark is an ongoing transaction with a minimal timestamp on that server. A few additional optimizations exist to allow reading from a single server, but these require an additional watermark to designate the full replication of a transaction.

Below you can find a presentation from our reading group by Balaji Arun

 

Discussion

Our discussion focused on a few points/questions:

(1) Why does OV protocol use the FastPaxos-like algorithm for replication? FastPaxos requires the use of larger “super quorums”, however, OV replication is not client-driven, and a server specifically picks a unique timestamp for the replication. In FastPaxos, multiple commands may be tried on the same instance (i.e. timestamp), requiring a larger quorum for recovery in phase-1 of Paxos with a smaller majority quorum. However, in OV we do not see such conflicts: the timestamps for transactions are unique and are assigned by one node that coordinates replication. The only possible conflict that we saw happening is when a transaction first tries to replicate the command and then issues an abort on the same timestamp. This arguably creates a write-write conflict on the same instance (timestamp), but we think it can be resolved by establishing fixed precedence to make aborts always win over the regular writes. With precedence order established, writing abort to a majority of nodes should be sufficient to make abort persistent and recoverable. We have not reached a definite conclusion on why OV uses larger fast quorums, so we may still be missing something in our understanding of the protocol.

(2) Another discussion point was concerning the evaluation and comparison. A recent SLOG paper solves a similar problem, so we were wondering how it may behave compared to the OV. The group’s conclusion was a definite “it depends.” From one perspective, OV is more decentralized, so it may be able to achieve higher throughput when there are many multi-partition transactions. SLOG is more centralized, having both a dedicated master per partition and a dedicated ordering layer for the transactions involving many partitions. Both systems, however, do not abort transactions in most of the cases (unless there are significant failures), so their performance may be close to each other. 

(3) Related to (2). Is comparison with TAPIR a fair one? Two protocols operate quite differently. We thought that it would have been nice to see a comparison with Calvin and SLOG.

(4) Performance for geo-sharded setup could suffer greatly. If we do not have all partitions/shards in every region, we will need to do a cross-region replication. In the protocol, the visibility watermark gets updated after the replication is complete, and having WAN replication will delay that process. Moreover, this may delay the visibility of other transactions that do not have geo-replication but have a timestamp after the geo-replicated transaction. 

One Page Summary. Gryff: Unifying Consensus and Shared Registers

This paper by Matthew Burke, Audrey Cheng, and Wyatt Lloyd appeared in NSDI 2020 and explores an interesting idea of a hybrid replication protocol. The premise is very simple – we can take one protocol that solves a part of the problem well, and marry it with another protocol that excels at the second half of the problem. This paper tackles replication in geo-distributed strongly consistent storage systems. The authors argue that consensus, when used in storage systems with predominantly read and write operations, is inefficient and causes high tail latency.

A system presented in the paper, called Gryff, takes advantage of predominantly read/write workloads in storage systems and exposes these two APIs via a multi-writer atomic storage ABD protocol.  ABD operates in two phases both for reads and writes. On writes, ABD’s coordinator retrieves the latest version of the register from all nodes and writes back with a version higher than it has seen. On reads, ABD’s coordinator again retrieves the register, writes the highest version back to the cluster to ensure future reads do not see any previous versions, and only then returns back to the client. The write-back stage, however, can be skipped if a quorum of nodes agrees on the same version/value of the register, allowing for single RTT reads in a happy case.

Unfortunately, ABD, while providing linearizability, is not capable of supporting more sophisticated APIs. Read-modify-write (RMW) is a common pattern in many storage systems to implement transaction-like conditional updates of data. To support RMW, Gryff resorts back to consensus and in particular to Egalitarian Paxos (EPaxos) protocol. Choice of EPaxos allows any node in the cluster to act as the coordinator, so it does not restrict writes to a single node like with many other protocols. The problem of this hybrid approach is then the ordering of operations completed with ABD protocol and RMW operations running under EPaxos. Since EPaxos side of Gryff works only with RMWs, it can only order these operations with respect to other RMW operations, but what we need is a linearizable ordering of RMWs with normal writes and/or reads. To keep the ordering, Gryff uses tuples of ABD’s logical timestamp, process ID and the RMW logical counter, called carstamps. Carstamps connect the ABD part of the system with EPaxos – only ABD can update ABD’s logical clock, and only EPaxos updates RMWs counter.

When we consider the interleaving of writes and RMWs, the write with higher ABD’s logical time supersedes any other write or RMW. This means that we actually do not need to order all RMWs with respect to each other, but only order RMWs that have the same base or ABD’s logical time. EPaxos was modified to allow such partial ordering of commands belonging to different bases, essentially making the protocols to have different dependency graphs for RMWs applied to different ABD states. Another change to EPaxos is the cluster-execute requirement, as the quorum of nodes need to apply the change before it can be returned to the client to make the change visible for subsequent ABD read operations.

gryff_1
So, how does Gryff do with regards to performance? Based on the author’s evaluation, it is doing very well in reducing the (tail) latency of reads. However, I have to point out that the comparison with Multi-Paxos was flawed, at least to some extent. The authors always consider running a full Paxos phase for reads, and do not consider the possibility of reading from a lease-protected leader, eliminating 1 RTT from Paxos read. This will make Paxos minimum latency to be smaller than Gryff’s, while also dramatically reducing the tail latency. Gryff also struggles with write performance, because writes always take 2 RTTs in the ABD algorithm. As far as scalability, authors admit that it cannot push as many requests per second as EPaxos even in its most favorable configuration with just 3 nodes.gryff_2

Can Paxos do better? We believe that our PQR optimization when applied in WAN will cut down most of the reads down to 1 quorum RTT, similar to Gryff. PQR, however, may still occasionally retry the reads if the size of a keyspace is small, however this problem also applies to Gryff when the cluster is larger than 3 nodes.

What about Casandra? Cassandra uses a protocol similar to ABD for its replication, and it also incorporates Paxos to perform compare-and-set transactions, which are one case of RMW operation, so in a sense Gryff appears to be very similar to what Cassandra has been doing for years.