We are continuing through the OSDI 2020 paper list in our reading group. This time we have discussed “Virtual Consensus in Delos,” a consensus paper (Delos is yet another greek island to continue the consensus naming tradition). Delos relies on the log abstraction to keep track of all commands/operations and their order. Traditionally, some consensus protocol, like Raft or MultiPaxos, would maintain such log. Delos, however, separates the log from consensus and exposes a virtual log to the higher application levels. The virtual log is composed of many underlying log fragments or Loglets, and each Loglet is supported by some replication protocol. The cool part about Loglets is that they do not need to share the same hardware, configuration, or even the same protocol. This allows Delos to seamlessly switch from one Loglet type or configuration to another to support the ongoing workload or scale the system. Just like in many other strongly consistent systems, there is Paxos in Delos. The virtual log configuration is supported by a Paxos-backed Metastore. This makes Paxos the source of consistency in the systems while keeping it away from the “data-path.” Since Delos can switch Loglets supporting the virtual log, each individual Loglet does not need to be as fault-tolerant as MultiPaxos or Raft. The NativeLoglet implements a stripped-down replication scheme the lacks leader election, and upon the leader failure, Delos can simply switch to a different Loglet. This switching process is a bit involved, so please read the paper for the details.
We had a very nice presentation of the paper, although it is a bit longer than our usual presentations:
After the presentation, we have spent another 40 minutes as a group discussing the paper. A few of the key discussion points:
1) Similarity with WormSpace paper. Delos appears similar to the WormSpace paper we have discussed in one of the first reading group sessions. This makes sense, as both papers include Mahesh Balakrishnan. WormSpace also proposes a virtual log kind of system, but it does not talk about changing protocols on the fly, instead, it focuses on allocating chunks of the log to different clients/applications/leaders. So in contrast to Delos that uses a Loglet for some undermined amount of time (i.e. until it fails), WormSpace allocates fixed chunks of a virtual log ahead of time. Both systems use single-shot Paxos as the source of consistency to allocate different portions of the virtual log, and both systems have similar problems to solve, like a failure of a party responsible for a virtual log segment.
2) Use case for mixing protocols. One of the motivations/threads in the paper is switching from one Loglet implementation to another. This use case largely motivates the ability to switch protocols “on the fly”, but it is a pretty rare use case. We were brainstorming other scenarios where such an ability is useful. One of the possibilities is adjusting to the workloads better, as some protocols may handle certain workloads better than the others. A similar idea is described in this paper, although it talks about switching between leader-full and leader-less protocols on the fly to provide better performance depending on the conflict-rate.
3) Loglet sealing. The mechanism for switching between Loglets is quite complicated. It involves a seal operation that tells a Loglet to stop accepting the operations, however, it is not very trivial in the NativeLoglet. The nuances come from the fact that an operation may still get into the log after the Loglet is sealed and before the Loglet tail is determined and written to the Metastore. The presentation describes this situation pretty nicely. The seal may be a misleading operation, as it does not fully prevent the data from being written into the log. One analogy we had is that sealing is akin to starting closing a valve on a pipe — as we are closing it, some water may still go through, and we need to make sure to collect all that water with the checkTail operation.
4) Availability delay when switching Loglets. Although not apparent from the figures/evaluation, we believe that there is a small delay in the system when it switches from one Loglet to another. This is related to point (3) above, as sealing is not an instantaneous operation, we need to use checkTail to figure out the last log position of the old LogLet to start the new one. As such, until this position is figured out and written to the Metastore, a new Loglet may not accept operations just yet (also, a new Loglet is not known until the Metastore updates anyway). On this subject, we also had a quite long discussion on trying to see if we can improve the seal-checkTail functionality to make the transition faster, but it seems that quite a bit of a challenge comes from the simplified notion of NativeLoglet and the need to seal it without writing the “seal” operation in the log itself.
Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!