Tag Archives: rsm

Reading Group. Log-structured Protocols in Delos

For the 87th DistSys paper, we looked at “Log-structured Protocols in Delos” by Mahesh Balakrishnan, Chen Shen, Ahmed Jafri, Suyog Mapara, David Geraghty, Jason Flinn Vidhya Venkat, Ivailo Nedelchev, Santosh Ghosh, Mihir Dharamshi, Jingming Liu, Filip Gruszczynski, Jun Li Rounak Tibrewal, Ali Zaveri, Rajeev Nagar, Ahmed Yossef, Francois Richard, Yee Jiun Song. The paper appeared at SOSP’21. This is a second Delos paper; the first one was “Virtual Consensus in Delos,” and we covered it in the 40th DistSys meeting.

The first Delos paper looked at building the replication/consensus layer to support strongly consistent use-cases on top. The system created a virtual log made from many log pieces or Loglets. Each Loglet is independent of other Loglets and may have a different configuration or even a different implementation. Needless to say, all these differences between Loglets are transparent to the applications relying on the virtual log. 

With the replication/consensus covered by virtual log and Loglets, this new paper focuses on creating modular architecture on top of the virtual log to support different replicated applications with different user requirements. In my mind, the idea of the Log-structured protocol expressed in the paper is all about using this universal log in the most modular and clean way possible. One can build a large application interacting with the log (reading and writing) without too many thoughts into the design and code reusability. After all, we have been building ad-hoc log-based systems all the time! On the Facebook scale, things are different — it is not ideal for every team/product to reinvent the wheel. Instead, having a smarter reusable architecture can go a long way in saving time and money to build better systems.

Anyway, imagine a system that largely communicates through the shared log. Such a system can create new log items or read the existing ones. In a sense, each log item is like a message delivered from one place to one or several other locations. With message transmission handled by the virtual log, the Delos application simply needs to handle encoding and decoding these “log-item-messages.”

Fortunately, we already have a good idea about encoding and decoding messages while providing services along the encoding/decoding path. Of course, I am thinking of common network stacks, such as TCP, or even more broadly, the OSI model. Delos operates in a very similar way, but also with a great focus on the reusability and composability of layers. When a client needs to write some data to the log, it can form its client-specific message and pass it down the stack. Lower layers can do some additional services, such as ensuring session guarantees or batching the messages. Each layer of the stack wraps the message it received with its own headers and information needed for decoding on the consumer side. Delos calls each layer in the stack an engine. The propagation down through layers continues until the message hits the lowest layer in the stack — the base engine. The job of the base engine is to interact with the log system.   

Similarly, when a system reads the message from the log, the log item travels up the stack through all of the same layers/engines, with each engine decoding it and ensuring the properties specific to that engine before passing it up. An imperfect real-world analogy for this process is sending a paper mail. First, you write the letter; this is the highest layer/engine close to the application. Then you put it in the envelope — now the letter is protected from others seeing it. Then goes the stamp — it is ready to be sent, then goes the mailbox — client batching, then post office — server-side batching, then transmission, and then the process starts to get undone from the bottom up. 

Of course, I oversimplified things a bit here, but such message encapsulation is a pretty straightforward abstraction to use. Delos uses it to implement custom Replicated State Machines (RSMs) with different functionality. Building these RSMs requires a bit more functionality than just pushing messages up and down the engines. Luckily, Delos provides a more extensive API with required features. For example, all engines have access to the shared local storage. Also, moving items up or down the stack is not done in a fire-and-forget manner, as responses can flow between engines to know when the call/request gets completed. Furthermore, it is possible to have more than one engine sitting at the same layer. For instance, one engine can be responsible for serializing the data and pushing it down, and another engine can receive the item from an engine below, deserialize and apply it to RSM. The figure illustrates these capabilities.  

This tiered modular approach allows Delos to reuse layers across applications or even compose some layers in different order. So when one application or use-case needs batching, that engineering team does not need to reinvent the batching from scratch. Instead, they can take an existing batching layer and add it to their stack in the appropriate location. The flexibility and reusability allowed Facebook engineers to implement two different control-plane databases with Delos. One datastore, called DelosTable, uses a table API, while another system, called Zelos, implements a ZooKeeper compatible service. 

I think I will stop my summary here. The paper goes into more detail about the history of the project and the rationale for making certain decisions. It also describes many practical aspects of using Delos. The main lesson I learned from this paper is about the overall modular layered design of large RSM-based systems. I think we all know the benefits but may often step aside from following a clean, modular design as the projects get bigger and pressure builds up to deliver faster. But then, what do I know about production-grade systems in academia? Nevertheless, I’d love to see a follow-up paper when more systems are built using Delos. 

As usual, we had our presentation. This time Micah Lerner delivered a concise but very nice explanation of the paper:


1) Architecture. This paper presents a clean and modular architecture. I do not think there is anything majorly new & novel here, so I view this paper more like an experience report on the benefits of good design at a large company. I think there is quite a bit of educational value in this paper. 

In the group, we also discussed the possibility of applying similar modular approaches to more traditional systems. For instance, we looked at MongoDB Raft in the group before. Nothing should preclude a similar design based on a Raft-provided log in a distributed database. In fact, similar benefits can be achieved — multiple client API, optional and configurable batching functionality, etc. That being said, for a system designed with one purpose, it is easy to start designing layers/modules that are more coupled/interconnected and dependent on each other. 

We had another similar observation in the group that mentioned a somewhat similarly designed internal application a while back, but again with a less clear separation between modules/layers.

2) Performance. A performance impact is a natural question to wonder about in such a layered/modular system. The paper spends a bit of time in the evaluation explaining and showing how layers and propagation between them add very little overheads. What is not clear is whether a less generic, more purpose-built solution could have been faster. This is a tough question to answer, as comparing different architectures is not trivial — sometimes it can be hard to tell whether the difference comes from design choices or implementation differences and nuances.

3) Read cost & Virtual Log. This part of the discussion goes back quite a bit to the first Delos paper. With a native Loglet, Delos assumes a quorum-based operation for Loglets, which may have less than ideal read performance. This is because NativeLoglet uses a sequencer to write, but requires on quorum read and waits for reads with the checkTail operation. So a client will read from the quorum, and assuming the Loglet is not sealed (i.e., closed for writes), the client must wait for its knowledge of the globalTail (i.e., globally committed slot) to catch up with the highest locally committed slot it observed earlier. This process is similar to a PQR read! Naturally, it may have higher read latency, which will largely depend on how fast the client’s knowledge of globally committed slot catches up. In the PQR paper, we also describe a few optimizations to cut down on latency, and I wonder if they can apply here. 

Moreover, a client does not need to perform an expensive operation for all reads — if a client is reading something in the past known to exist, it can use a cheaper readNext API, practically allowing a local read from its collocates LogServer. 

4) Engineering costs. This discussion stemmed from the performance discussion. While large companies care a lot about performance and efficiency (even a fraction of % of CPU usage means a lot of money!), the engineering costs matter a lot too. Not having to redo the same things for different products in different teams can translate into a lot of engineering savings! Not to mention this can allow engineers to focus on new features instead of re-writing the same things all over again. Another point is maintenance — cleaner and better-designed systems will likely be cheaper to maintain as well. 

5) Non-control plane applications? The paper talks about using Delos in two control-plane applications. These often have some more specific and unique requirements, such as zero external dependencies, and stronger consistency. The paper also mentions other possible control plane use cases, so it does not appear like Delos is done here. 

At the same time, we were wondering if/how/when Delos can be used outside of the control plane. For Facebook, there may not be too much need for strongly consistent replication for many of their user-facing apps. In fact, it seems like read-your-write consistency is enough for Facebook, so deploying Delos may not be needed. At the same time, user-facing apps can take on more dependencies and external dependencies, achieving some code reuse this way.

Another point made during the discussion is about making a more general and flexible replication framework that can support strongly-consistent cases and higher-throughput weaker consistency applications. We would not be surprised if Delos or its successors will one day support at least some stronger-consistency user-facing applications. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Rabia: Simplifying State-Machine Replication Through Randomization

We covered yet another state machine replication (SMR) paper in our reading group: “Rabia: Simplifying State-Machine Replication Through Randomization” by Haochen Pan, Jesse Tuglu, Neo Zhou, Tianshu Wang, Yicheng Shen, Xiong Zheng, Joseph Tassarotti, Lewis Tseng, Roberto Palmieri. This paper appeared at SOSP’21.

A traditional SMR approach, based on Raft or Multi-Paxos protocols, involves a stable leader to order operations and drive the replication to the remaining nodes in the cluster. Rabia is very different, as it uses a clever combination of determinism to independently order requests at all nodes and a binary consensus protocol to check whether replicas agree on the next request to commit in the system.

Rabia assumes a standard crash fault tolerance (CFT) model, with up to f node failures in a 2f+1 cluster. Each node maintains a log of requests, and the requests execute in the log’s order. The log may contain a NO-OP instead of a request.

When a client sends a request to some node, the node will first retransmit this request to other nodes in the cluster. Upon receiving the request, a node puts it in a min priority queue of pending requests. Rabia uses this priority queue (PQ) to independently and deterministically order pending requests at each node, such that the oldest request is at the head of the queue. The idea is that if all nodes have received the same set of requests, they will have identical PQs. 

At some later point in time, the second phase of Rabia begins — the authors call it Weak-MVC (Weak Multi-Valued Consensus). Weak-MVC itself is broken down into two stages: Propose Stage and Randomized Consensus Stage. In the propose stage, nodes exchange the request at the head of PQ along with the log’s next sequence number seq. This stage allows the nodes to see the state of the cluster and prep for the binary consensus. If a node sees the majority of a cluster proposing the same request in the same sequence number, then the node sets its state to 1. Otherwise, the node assumes a state of 0. 

At this point, the binary consensus begins to essentially certify one of the two options. The first option is that a majority of nodes want to put the same request in the same sequence number (i.e., the state of 1). The second option is to certify that there is no such common request in the majority(state of 0). For binary consensus, Rabia uses a modified Ben-Or algorithm. Ben-Or consists of two rounds that may repeat multiple times.

In round-1, nodes exchange their state and compute a vote to be either 0 or 1. The vote corresponds to the majority of state values received, so if a node received enough messages to indicate that the majority of nodes are in state 1, then the node will take on the vote value of 1. Similarly, if a majority has state 0, then the node will vote 0. If no majority is reached for any state, the vote is nil. 

Round-2 of Ben-Or exchanges votes between nodes. If the majority of nodes agree on the same non-nil vote, the protocol can terminate. Termination means that system has agreed to certify the request from the proposal if the consensus value is 1 or to create NO-OP if a value is 0. 

In an ideal situation, all participating nodes will have the same request at the head of their PQs when the propose phase starts. This means that nodes will have the same state at the end of the propose phase, allowing the binary consensus to certify the proposed request at its sequence number in just one round trip (Round-1 + Round-2 of Ben-Or). So the request distribution + proposal + Ben-Or consensus under such an ideal case only takes 4 message exchanges or 2 RTTs. It is way longer than Multi-Paxos’ ideal case of 1 RTT between the leader and the followers, but Rabia avoids having a single leader.

A less ideal situation arises when no majority quorum has the same request at the head of PQ when the proposal starts. Such a case, for example, may happen when the proposal starts before the request has had a chance to replicate from the receiving node to the rest of the cluster. In this case, binary consensus may reach the agreement on state 0 to not certify any operation in that particular sequence number, essentially producing a NO-OP. The authors argue that this NO-OP is a good thing, as it gives enough time for the inflight requests to reach all the nodes in the cluster, get ordered in their respected PQs. As a result, the system will propose the same request in the next sequence number after the NO-OP. Both of these situations constitute a fast path for Ben-Or, as it terminates in just one iteration (of course the latter situation does not commit a request, at least not until the retry with higher sequence number). 

Now, it is worth pointing out that the fast path of one RTT for binary consensus is not always possible, especially in the light of failures. If too many nodes have a nil vote, the protocol will not have enough votes agreeing for either state (1 – certify the request, 0 – create a NO-OP), and the Ben-Or process must repeat. In fact, the Ben-Or protocol can repeat voting many times with some random coin flips in between the iterations to “jolt” the cluster into a decision. For more information on Ben-Or, Murat’s blog provides ample details. This jolt is the randomized consensus part. The authors, however, replaced the random coin flip at each node with a random, but deterministic coin flip so that each node has the same coin flip value for each iteration. Moreover, the coin flip is only needed at the node if there is no vote received from other nodes in round-1 of Ben-Or, otherwise, the node can assume the state of the vote it received. The whole process can repeat multiple times, so it may not be very fast to terminate. 

The paper provides more details on the protocol. Additionally, the authors have proved the safety and liveness of the protocol using Coq and Ivy.

The big question is whether we need this more complicated protocol if solutions like Multi-Paxos or Raft work well. The paper argues that Raft and Paxos get more complicated when the leader fails, requiring the leader election, which does not happen in Rabia. Moreover, Paxos/Raft-family solutions also require too many additional things to be bolted on, such as log pruning/compaction, reconfigurations, etc. The argument for Rabia is that all these extra components are easier to implement in Rabia.

Quite frankly, I have issues with these claims, and the paper does not really go deep enough into these topics to convince. For example, one argument is that leader election is complicated and takes time. Surely, not having a leader is good to avoid the leader election and performance penalty associated with it. But this does not come for free in Rabia. The entire protocol has more messages and rounds in the common case than Multi-Paxos or Raft, so it kind of shifts the complexity and cost of having a leader election once in a blue moon to having more code and more communication in the common case. I am not sure it is a good tradeoff. The performance claim of slow leader elections in Paxos/Raft is also shaky — failures in Rabia can derail the protocol of the fast path. I am not sure whether the impact of operating under failures is comparable with leader-election overheads upon leader failures, and I hope Rabia may have a point here, but the paper provides no evaluation for any kind of failure cases. 

And speaking of evaluations, this was the biggest disappointment for me. The authors claim Rabia compares in performance to Multi-Paxos and EPaxos in 3 and 5 nodes clusters, with 3-nodes in the same availability zones allowing Rabia to outperform EPaxos. In fact, the figure below shows that Rabia beats Multi-Paxos all the time. 

But there are a ton of assumptions and tweaking going on to get these results. For example, Rabia needs to have enough time to replicate client requests before starting the propose phase to have a good chance for completing on the fast path. So the testing is done with two types of batching to give the delay needed. The figure mentions the client batching. However, there is also a much more extensive server-side batching which is mentioned only in the text. Of course, there is nothing wrong with batching, and it is widely used in systems. For all the fairness, the paper provides a table with no batching results, where Multi-Paxos outperforms Rabia fivefold.

The biggest issue is the lack of testing under less-favorable conditions. No evaluation/testing under failures. No testing when the network is degraded and does not operate on the timing conditions expected by the protocol. These issues impact real performance and may create reliability issues. For example, a network degradation may cause Rabia to not use a fast path and consume more resources, reducing its maximum processing capacity. Such a scenario can act as a powerful trigger for a metastable failure

As usual, we had a nice presentation of the paper in the reading group. Karolis Petrauskas described the paper in great detail:


1) Evaluation. I have already talked about evaluation concerns, and this was one of the discussion topics I brought up during the meeting. 

2) Use of Ben-Or. Ben-Or is an elegant protocol to reach binary consensus, which is not usually useful for solving state machine replication. Traditionally, Multi-Paxos or Raft agree on a value/command and its sequence number, so they need a bit more than just a yes/no agreement. However, Rabia transforms the problem into a series of such yes/no agreements by removing replication and ordering from consensus and doing it apriori. With deterministic timestamp ordering of requests, Rabia just needs to wait for the operation to exist on all/most nodes and agree to commit it at the next sequence number. So the consensus is no longer reached on a value and order, but on whether to commit some command and some sequence number.

3) Practicality. The evaluation suggests that the approach can outperform Multi-Paxos and EPaxos, but whether it is practical remains to be seen. For one, it is important to see how the solution behaves under less ideal conditions. Second, it is also important to see how efficient it is in terms of resource consumption. EPaxos is not efficient despite being fast. The additional message exchanges over Multi-Paxos, Raft, and even EPaxos may cost Rabia on the efficiency side. 

4) Algorithms. The paper provides some nice algorithms the illustrate how the protocol works. However, some of the conditions are not necessarily confusing. In the same algorithm, the authors use f+1, n-f, and floor(n/2)+1 to designate the majority in an n=2f+1 cluster. Please proofread your algorithms — a bit of consistency can improve readability a lot!

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Protocol-Aware Recovery for Consensus-Based Storage

Our last reading group meeting was about storage faults in state machine replications. We looked at the “Protocol-Aware Recovery for Consensus-Based Storage” paper from FAST’18. 

The paper explores an interesting omission in most of the state machine replication (SMR) protocols. These protocols, such as (multi)-Paxos and Raft, are specified with the assumption of having a crash-resistant disk to write the operation log and voting metadata. This disk data allows crashed nodes to restart safely. However, the real-life gets in a way a bit, as infallible storage is as real as unicorns. 

Storage may fail in peculiar ways, when some data may get corrupted, while most other data is correct and the server itself continues working. The problem here is handling such failures. The simplest way is to treat the server as crashed. However, the server must remain crashed, as restarting may get into even more severe state corruption, as the server replays the operations from a faulty log. The paper talks about a variety of other approaches taken to deal with these data issues. The authors state that all the mechanisms they have explored were faulty and led to liveness or safety issues. I personally do not buy such a blanket statement, but a few of the examples in the paper were really interesting. 

The paper then suggests a solution – Protocol-Aware Recovery (PAR). The main point here is to avoid ad-hoc solutions because they are either slow, unsafe, complicated, or all of the above. This makes sense since such a big omission (potential for data-corrupting disk failures) in protocols should be addressed at the protocol level. The paper draws heavily on the Raft state machine protocol and develops the recovery procedure for it.

The log recovery is leader-based and can be broken down into two sub-protocols: follower recovery and leader recovery. The followers are recovered by restoring the data from the leader who always knows of all the committed history. Leader recovery is a bit more tricky and occurs as part of a leader election. Of course, if a non-faulty node can be elected a leader, then recovering faulty nodes is easy with the follower recovery. However, the leader election requires a node to have the most up-to-date log to become a leader, limiting a selection of nodes for the job. That being said, the node can be elected with a corrupted log, but it needs to recover the corrupted entries from the followers. If the entry is not available on any of the followers, the state machine becomes stuck (as it should). The protocol only recovers committed log entries and follows Raft logic to discard non-committed log suffix if it has corrupted entries. 

In addition, to log recovery, the paper also talks about snapshot recovery. The idea behind snapshot recovery is to make sure all nodes take the same snapshots at the same index in the log, break them into “chunks” and recover chunks as needed from other nodes. 

Here is the presentation by Rohan Puri:


1) The need for logs? The paper assumes that a state machine takes periodic snapshots to a disk/drive, and such snapshot in combination with a log can be used for node recovery later. This implies that the actual current state of the state machine can be lost due to a server restart. However, some state machines are directly backed by the disk, in essence, representing a rolling snapshot that gets updated every time an operation from the log applies. Recovery of such disk-backed state machine can be quicker and require only log entries happening after the crash/restart. Of course, this does not mean that the disk-backed state machine itself cannot be corrupted. In any case, the log entries are required for recovery and can be garbage collected once all nodes have persisted the state machine to disk (either as part of normal operation or a snapshot), making the time-frame for the log entries to remain useful to be relatively small. 

A more interesting problem may arise in trying to recover the corrupted state machine. If we rely on this “rolling-snapshot” disk-backed state machine, the mechanism the paper uses for snapshot recovery won’t work, since different copies of the state machine may be misaligned ever-so-slightly. Of course, one can always do the costly node restore procedure — restore to some prior snapshot and replay the log, but this is wasteful and requires keeping an extra snapshot and log from the snapshot onwards. In the spirit of the paper, we should rely on distributed copies instead and be able to restore the corruption without relying on storing redundant copies on the same server

2) Persistent memory vs RAM and recovery for in-memory SMR. If we build a state machine replication (SMR) to work purely off RAM, then we do not have the luxury of retaining any state after a restart. As such, in-memory state machines must have different mechanisms to ensure safety. For example, in traditional Multi-Paxos with a disk, a node always remembers the current term/ballot and past votes it has participated in. Without durable memory, a node restart erases the previous voting state, allowing a node to vote on something it has already voted on before, but with a lower term/ballot. This is not safe and may lead to a double-commit on the same log entry when a node promises to some new leader, and then after restart makes a second promise in the same log index to some older leader. 

Allowing for corruption in persistent memory is somewhat similar to not having persistent memory at all, at least when dealing with crashes/restarts. The very piece of data/metadata we need to ensure safety and avoid double voting as in the example above may be corrupted and cannot be used after a restart. However, the same precautions used for in-memory replicated state machines will work with corrupted storage as well and allow for safe recovery. For example, to prevent the double-voting example, a recovering node needs to run a “mock” leader election (or a leader election with a term guaranteed to not succeed). Such leader election will ensure the node gets a proper view of the current ballot/term in the cluster to make sure it no longer accepts votes from prior leaders. After such a mock election, the node can start accepting/voting for log entries while recovering any prior log and/or state machine from any of the replicas. Of course, the full recovery completes when enough data is shipped from other nodes (i.e. snapshots + missing log entries). 

There are a few differences between RAM and persistent storage when it comes to recovery. First of all, while it seems like both can lose data (one due to a reboot, another due to some random corruption), persistent storage still has a hint of data being missing. This is like not remembering what the node has voted for or who was the leader, but still having a 6th sense that something was voted upon. This extra piece of information may be useful in recovery, and indeed the protocol from the paper takes advantage of that to improve fault tolerance and safety. The recovery protocol preserves safety when the majority of nodes fail at the same log index, as the protocol knows something is missing entirely and will halt for safety. In the RAM setting, a mass reboot (i.e. majority of nodes) leads to a collective loss of memory without any hint that something may have been agreed upon, leading to a rewrite of the log. 

The second difference is that persistent memory may not lose all the data, so fewer items must be shipped from the followers. 

3) Leader-bound recovery. The paper suggests recovering followers from the leader node. This can put more load on the leader, who is already a bottleneck in the protocol. It seems like it may be possible to recover committed log entries from followers (the paper already does so for leader recovery) to make the recovery procedure less demanding for the leader.

4) Byzantine. The paper touches a bit on this topic. Data corruption on disk can be viewed through the lens of Byzantine fault tolerance. The corruption causes a node to act outside of the protocol specs, and byzantine-tolerant protocols are designed to handle such “out-of-spec” behaviors. The paper is a good example of how we can often solve some specific types of byzantine behaviors without resorting to the full-blown PBFT-style solutions. This is very practical, as we want the state machine to handle data corruptions, but we do not want to pay the performance penalty associated with BFT protocols. 

5) Luckilyhood of data corruption. Another point of discussion was around the likelihood of such data-faults happening. It does not seem like these are too frequent, but they do happen. We touched on a few anecdotal occurrences. For example, some firmware issues causing the disk to not write some large buffers of data. 

It is also worth noting error correction. Error correction is standard for server-grade memory, and it comes at a relatively small monetary/performance cost. Similar error-correction technologies are used in disks and drives, allowing for small errors (i.e. a bit-flip) to be fixed by the drive. In fact, NAND flash SSDs rely on error correction in normal operation.

6) Infallible disk. Protocols assume disk is always correct. Why? Even on the surface, this does not come as a super tight assumption. And especially on the scale of millions of SMR instances deployed across millions of machines.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!