# Reading Group. Prescient Data Partitioning and Migration for Deterministic Database Systems

In the 75th reading group session, we discussed the transaction locality and dynamic data partitioning through the eyes of a recent OSDI’21 paper – “Don’t Look Back, Look into the Future: Prescient Data Partitioning and Migration for Deterministic Database Systems.”

This interesting paper solves the transaction locality problem in distributed, sharded deterministic databases. The deterministic databases pre-order transactions so that they can execute without as much coordination after the pre-ordering step. Now, this does not mean that every transaction can run locally. Indeed, if a transaction needs to touch data across multiple partitions or shards, then there is a need to reach out across partitions, which introduces delays. A natural way to solve the problem is to migrate the data based on its access patterns to minimize the number of such cross-partition transactions.

Such dynamic data migration is pretty much the solution in Hermes, the system presented in the paper. The thing is that the “group the data together” approach is not new. The authors cite several papers, such as Clay that do very similar stuff. Now, the biggest difference between Hermes and other approaches is the decision-making process involved in figuring out how and where the data needs to go. Traditionally, dynamic re-partitioning solutions rely on the historical data from the workload. These approaches work great when the workload is decently stable. However, workloads that abruptly change their access patterns present a problem — the system based on historical observation of access locality is reactive and needs time to adjust to the new workload. So, naturally, to tolerate the rapidly changing workload locality characteristics, we need a proactive system that can predict these locality changes and make necessary changes ahead of time. It would be nice to have an oracle that can see into the future. Well, Hermes kind of does this. See, the system requires a batched database, and before executing each batch, it can look at the access patterns within the batch and adjust accordingly.

Hermes looks at the patterns of transactions to be executed shortly, figures out how many transactions will require cross-partition coordination, and then makes data movements to minimize that number. For example, if some transaction type accesses two objects frequently together, Hermes will move one of these objects, incurring one cross-partition data transfer. If this happens, the transaction itself becomes partition-local, incurring no cross-partition data transfers.

In addition to moving and repartitioning the data around, Hermes also moves the transactions. The movement of compute tasks, again, is not new and makes a ton of sense. If a transaction originating in partition P1 needs some objects A & B located in some partition P2, then it makes a lot of sense to move a relatively small transaction to where the data is.

Unfortunately, the combination of dynamic re-partitioning and transaction movements leads to some unintended consequences over the long run. Consider a batch where objects A & B were used frequently together. The system moves them to one partition for speed. In the next batch, we may have objects B & C used together, so C moves to the same partition as B, again for speed. Now we have colocated objects A, B, and C. If this continues for very long, the system will consolidate more and more data in one place. This consolidation is not ideal for load balancing, so Hermes has to account for this and prevent data from gradually drifting closer together. It does so by “de-optimizing” for locality and allowing more distributed transactions in exchange for a more even load in each batch.

As far as performance, Hermes delivers when it comes to the workloads with frequent changes in locality:

As always, the presentation video from the reading group:

## Discussion

1) Wide Area Networks. Hermes takes advantage of locality in a data center setting. It groups objects used together to allow local transactions as opposed to distributed ones. However, it may not work in a geo-distributed environment. The problem is that Hermes only solves one type of locality puzzle. It accounts for the grouping of objects used together in hopes that these objects will be used together again — spatial locality and used soon enough — temporal locality. In fact, Hermes optimizes based on spatial criteria. This notion of locality works well in LAN, where the transaction can run in a partition with the data without a performance penalty. However, in WAN, moving transactions between partitions is costly. If a transaction originated in region R1, but data is in region R2, then moving the transaction from R1 to R2 may incur almost as much latency as moving data from R2 to R1. That is a big difference between LAN and WAN — in the WAN setting, transactions incur a significant latency penalty when processed in another region. In geo-distributed setting, locality means more than just grouping frequently used objects together to rip the benefit shortly. In addition to grouping data based on spatial principles, we need to preserve the geographical affinity and place data close to where it is accessed. In other words, in geo-replication, we do not only care about finding cliques of related data, but also placing these cliques in the best possible geography for a given workload. And needless to say, grouping objects and finding the best geographical location for the data often conflict with each other, making the problem significantly more nuanced and complicated.

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

# Paper Summary: Bolt-On Global Consistency for the Cloud

This paper appeared in SOCC 2018, but caught my Paxos attention only recently. The premise of the paper is to provide strong consistency in a heterogeneous storage system spanning multiple cloud providers and storage platforms. Going across cloud providers is challenging, since storage services at different clouds cannot directly talk to each other and replicate the data with strong consistency. The benefits of spanning multiple clouds, however, may worth the hustle, since a heterogeneous system will be both better protected from cloud provider outages, and provide better performance by placing the data closer to the users. The latter aspect is emphasized in the paper, and as seen in the figure, going multi-cloud can reduce latency by up to ~25%.

To solve the issue of consistent cross-cloud replication, authors propose to use Cloud Paxos (CPaxos), a Paxos variant designed to work with followers supporting a very minimal and common set of operations: get and conditional put. In CPaxos, clients act as prospers, and storage systems serve the role of the followers. The followers are not really “smart” in this protocol, and most of the Paxos logic shifts to the client-proposers (Figure 2).

The prepare phase in CPaxos simply gathers the state from the followers, making the proposer decide for itself whether the followers would have accepted it with the current ballot or not. If the proposer thinks it would have been accepted, it will try updating the followers’ state. Doing this, however, requires some precautions from the followers, since their state may have changed after the proposer made a decision to proceed. For that matter, CPaxos uses conditional put (or compare-and-set) operation, making the followers update their state only if it has not changed since it was read by the proposer. This ensures that at most one proposer can succeed with changing the state of the majority of followers.

I visualize this as a log to represent changes in some object’s state. The new version of an object corresponds to a new slot in the log, while each slot can be tried with different ballot by different proposers. The put operation succeeds at the follower only if the value at the slot and a ballot has not been written by some other proposer. In case a proposer does not get a majority of successful updates, it needs to start from the beginning: increase its ballot, perform a read and make a decision whether to proceed with state update. Upon reaching the majority acks on state update, the proposer sends a message to flip the commit bit to make sure each follower knows the global state of the operation.

This basic protocol has quite a few problems with performance. Latency is large, since at least 2 round-trips are required to reach consensus, since every proposer needs to run 2 phases (+ send a commit message). Additionally, increasing the number of proposers acting on the same objects will lead to the growth in conflict, requiring repeated restarts and further increasing latency. CPaxos mitigates these problems to a degree. For example, it tries to commit values on the fast path by avoiding the prepare phase entirely and starting an accept phase on what it believes will be next version of an object with ballot #0. If the proposer’s knowledge of the object’s state (version, ballot) is outdated, the conditional put will fail and the proposer will try again, but this time with full two phases to learn the correct state first. However, if the proposer is lucky, an update can go in just one round-trip. This optimization, of course, works only when an object is rarely updated concurrently by multiple proposers; otherwise dueling leaders become a problem not only for progress, but for safety as well, since two proposers may write different values for the same version using the same ballot. This creates a bit of a conundrum on when the value becomes safely anchored and won’t ever get lost.

Consider an example in which two proposers write different values: green and blue to the same version using ballot #0 (Figure 4 on the left). One of the proposers is able to write to the majority, before it becomes unresponsive. At the same time, one green follower crashes as well, leading to a situation with two followers having green value and two being blue (Figure 5 on the right). The remaining proposer has no knowledge of whether the green or blue value needs to be recovered (remember, they are both on the same ballot in the same slot/version!). To avoid this situation, CPaxos expands the fast path commit quorum from majority to a supermajority, namely $$\left \lceil{\frac{3f}{2}}\right \rceil +1$$ followers, where $$2f+1$$ is the total number of followers, and f is the tolerated number of follower failures, allowing the anchored/committed value to be in a majority of any majority of followers . Having this creates an interesting misbalance in fault tolerance: while CPaxos still tolerates $$f[\latex] node failures and can make progress by degrading to full 2 phases of the protocol, it can lose an uncommitted value even if it was accepted by the majority when up to [latex]f$$ followers fail.

Proposer conflicts are a big problem for CPaxos, so naturally the protocol tries to mitigate it. The approach taken here reduces the duration in which possible conflicts may occur. As CPaxos is deployed over many datacenters, the latencies between datacenters are not likely to be uniform. This means, that a prepare or accept messages from some proposer reach different datacenters at different times, creating an inconsistent state. When two proposers operate concurrently, they are more likely so observe this inconsistency: as both proposers quickly update their neighboring datacenters, they run the risk of not reaching the required supermajority due to the conflicting state (Figure 6(a)) created by some messages being not as quick to reach remaining datacenters. To avoid rejecting both proposers, CPaxos schedules sending messages in a way to deliver them to all datacenters at roughly the same time. This reduces the duration of inconsistent state, allowing to order some concurrent operations (Figure 6(b)).

Despite the above mitigation strategy, conflicts still affect CPaxos greatly. The authors are rather open about this, and show their system CRIC with CPaxos degrading quicker than Paxos and Fast Paxos as the conflict rate increases. However, in the low conflict scenario, which authors argue is more likely in real world applications, CRIC and CPaxos improve on performance compared to Paxos/Fast Paxos, especially for reading the data. This is because reads in CPaxos are carried out in one round-trip-time (RTT) by client-proposer contacting all followers and waiting for at least a majority of them to reply. If the client sees the latest version with a commit flag set in the majority, it can return the data. Otherwise, it will wait to hear from more followers and use their logs to determine the safe value to return. In some rare cases when the proposer cannot determine the latest safe value, it will perform the recovery by running the write path of CPaxos with the value to recover (highest ballot value or highest frequency value if more than one value share the ballot).

Some Thoughts

• The motivation of the paper was to make strongly consistent system spanning multiple clouds providers and storage systems for the benefit of improved latency though leveraging the location of datacenters of these different providers. However, CRIC and CPaxos protocol requires a lot of communication, even on the read path. During reads, a client-proposer contacts all CPaxos nodes, located at all datacenters, and in best case still needs the majority replies. As such the latency benefit here comes from trying to get not just one node closer to the client, but a majority of nodes. This may be difficult to achieve in large systems spanning many datacenters. I think sharding the system and placing it on subset of nodes based on access locality can benefit here greatly. For instance, Facebook’s Akkio paper claims to have significant reduction in traffic and storage by having fewer replicas and making data follow access patterns. In our recent paper, we have also illustrated a few very simple data migration policies and possible latency improvement from implementing these policies.
• One RTT reads in “happy path” can be implemented on top of regular MutliPaxos without contacting all nodes in the systems. Reading from the majority of followers is good enough for this most of the time, while in rare circumstances the reader may need to retry the read from any one node. More on this will be in our upcoming HotStorage ’19 paper.
• The optimization to delay message sending in order to deliver messages at roughly the same time to all nodes can help with conflict reduction in other protocols that suffer from this problem. EPaxos comes to mind right away, as it is affected by the “dueling leaders” problem as well. Actually, CPaxos and EPaxos are rather similar. Both assume low conflict rate to have single round trip “happy path” writes and reads. When the assumption breaks, and there is a conflict, both switch to two phases. EPaxos is better here in a sense that the first opportunistic phase is not totally wasted and can be used as phase-1 in the two phase mode, whereas CPaxos has to start all the way from the beginning due to the API limitation on the follower side.