Reading Group. chainifyDB: How to get rid of your Blockchain and use your DBMS instead

Our recent meeting focused on Blockchains, as we discussed “chainifyDB: How to get rid of your Blockchain and use your DBMS instead” CIDR’21 paper. The presentation by Karolis Petrauskas is available here:

The paper argues for using existing and proven technology to implement a permissioned blockchain-like system. The core idea is to leverage relational SQL-99 compliant DBMSes that are very common in various IT architectures. The participants in the blockchain can then leverage the existing software stack to implement secure data-sharing or replication. Moreover, the authors argue that in practice different participants may run DBMSes of different versions or even different vendors, making the problem of syncing data between participants more challenging, as different versions/vendors may execute the same SQL transactions slightly differently. To address the heterogeneity issues, the paper proposes a “Whatever-Voting” model, where the system achieves consensus on the results of a transaction, and not on the transaction itself. This allows each node/organization to run the transaction in “whatever” way possible, as long as the final result of the transaction is agreed upon by the quorum of participants in the “voting” phase. If the voting phase does not pass, the node with a mismatched result must roll the transaction back (potentially more than one) and try to recover by reapplying it. To me this seems a bit naive — if a node could not produce the proper results to a correct, non-malicious transaction due to its vendor differences or “whatever” execution differences, it may get stuck in the infinite loop of rollbacks and reapplies until whatever phase is fixed to produce the correct results for a specific transaction. This can be a very error-prone, engineering-intensive process of constant fine-tuning and micromanaging the components.

Step 1: Client creates a Proposal from SQL transaction. Step 2: Early abort of bad proposals. Step 3: Proposal is sent for ordering after all participants verify that it can be executed. Step 4: Ordered produces a block from multiple transactions/proposals. Step 5: Execution against local DBMSes. Step 6: Commit phase after the block is written to DBMS. Step 7: Voting to check the results are identical in the quorum of nodes/organizations. Step 8: Ledger entry is generated if the agreement in Step 7 is successful.

Since the proposed solution does not reach an agreement on transactions, the paper argues that it does not need complicated protocols to ensure that all correct nodes have correct transactions. If some entity issues different transactions to different participants, the results won’t match, missing the quorum agreement and causing a system-wide rollback and retry. However, the voting scheme should still be resilient to some cheating — a node must provide some proof or digest of the authenticity of its results.

The authors argue that their system does not have a centralized authority. However, it still has a centralized control component that batches the transactions and orders them. This component can be untrusted since the participants should catch mismatched batches and roll the state back as needed. That being said, a byzantine centralized batcher/sequencer (the paper calls it orderer) can cause liveness issues by sending garbage to participants causing constant recoveries. The paper does not specify the process of electing/replacing the orderer, so I think in practice it should be trusted for liveness purposes. 

There is a lot more work involved to make the scheme work. For example, each node must remember the transaction batches to create the ledger. This also happens within the DBMS, and some precautions must be taken to ensure that the ledger is tamper-proof. The roll-back procedure is also complicated and must be carefully implemented to have good performance. 

The authors conduct their evaluation on a limited scale with just 3 nodes/DBMS instances and show good scalability compared to Fabric and Fabric++. The evaluation focuses on increasing the number of clients interacting with the DBMSes, and not the number of DBMS instances. The paper does not mention the possibility to have Byzantine clients. Moreover, the evaluation does not consider any byzantine cases. 


1) Paper Quality. Our group’s collective opinion about the quality of the paper was not very enthusiastic. The paper’s premise is interesting and practical, however, it has way too many holes, missed promises, and unsubstantiated claims. These range from inadequate evaluation to relying on centralized unreplaceable “orderer” in a decentralized system to an underspecified voting protocol to glaring engineering holes in the “whatever” model that is prone to liveness issues due to bad implementations. There are way too many to list them all, so I will mention just one in a bit of detail. 

A big undelivered promise is using the DBMSes as “black-box.” While the DBMS software does not need any modifications, the applications running on these DBMSes must be adjusted to accommodate the WV model. Although these adjustments may be relatively small (i.e. making the client talk to chainify server instead of DBMS), it appears that the “whatever” stage may need tweaks on a per transaction-type basis if the DBMSes are very different. This additional logic and tweaking is hardly a black-box approach from the application point of view. Other concerns here involve privacy. What if some tables at one organization are not supposed to be shared but transactions/queries from this same entity touch both shared and private tables? Addressing these may require even bigger application changes and rewrites. And of course, this is not to mention all the additional table chainifying introduces to the database.

2) Preventing Byzantine Behavior vs. Detecting Byzantine Behavior. An interesting point the paper raises (and I wish it was explored more) is whether we need to prevent byzantine actions from taking place or just being able to detect them. Unlike all (most?) of the blockchains, chainifyDB does not prevent bad behavior from executing, and instead, it tries to detect misbehavior and fix it by a reset and restore procedure. Is this approach cheaper on the fundamental level? What applications can it support? Obviously, chainifyDB argues that it supports the permissioned blockchains, but We’d love to see more rigor and research in that direction.

There are some potential issues with this approach, and the paper does not give many details on these. For example, a transaction may execute on all nodes and produce matching result digests for voting. However, if the transaction itself is malicious, and for instance, tries to create value or double-spend something, then the correct nodes must have a mechanism to abort such transaction before its execution (or detect and reset after?). Result voting may not sufficient here. Of course, it is possible to implement some pre-execution sanity check logic to verify that the transaction is compliant with all the rules and constraints. It appears that in chainifyDB this stage happens early and requires every participant’s vote (see Step 2 in the figure in the summary), but the paper is not very clear on the safety implication of this vote/abort decision collection. Also, what about fault tolerance if we need all nodes for this? Additionally, this, in a sense, creates some pre-execution voting that the paper claims to avoid.

3) Issues of Trust. This is a bigger discussion on different types of storage systems and applications that need them. The most popular type of data systems is the ones with trust in a single participant/entity. This is not a problem for an in-house database for example. Many cloud data-intensive services rely on this model, where clients would trust the cloud provider and do not worry about the provider becoming malicious. This even works if the clients distrust each other, as long as they all trust the service provider. The latter is often sold as some form of a permissioned blockchain, but fundamentally it is just a database with an append-only log where safety is ultimately ensured by the trusted provider. Truly permissioned blockchains have no single trusted entity, however, a relatively small set of participants can be trusted as a group. The permissioned nature prevents Sybil attacks by spoofing many fake malicious participants to overtake the network, allowing to have small groups to be trustworthy despite having malicious individuals. Finally, open-blackchains are completely untrusted environments, where only the entire collective of all participants can be trusted. The figure here illustrates this. 

One discussion question we had is when to use permissioned blockchain and why not to stick with open-access protocols instead. Of course, permissioned blockchains or BFT protocols are cheaper to run than open-access counterparts and often have better latency and higher throughput. But is it safe to have trust in a small group of participants? What applications are ok with that level of trust? 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Protocol-Aware Recovery for Consensus-Based Storage

Our last reading group meeting was about storage faults in state machine replications. We looked at the “Protocol-Aware Recovery for Consensus-Based Storage” paper from FAST’18. 

The paper explores an interesting omission in most of the state machine replication (SMR) protocols. These protocols, such as (multi)-Paxos and Raft, are specified with the assumption of having a crash-resistant disk to write the operation log and voting metadata. This disk data allows crashed nodes to restart safely. However, the real-life gets in a way a bit, as infallible storage is as real as unicorns. 

Storage may fail in peculiar ways, when some data may get corrupted, while most other data is correct and the server itself continues working. The problem here is handling such failures. The simplest way is to treat the server as crashed. However, the server must remain crashed, as restarting may get into even more severe state corruption, as the server replays the operations from a faulty log. The paper talks about a variety of other approaches taken to deal with these data issues. The authors state that all the mechanisms they have explored were faulty and led to liveness or safety issues. I personally do not buy such a blanket statement, but a few of the examples in the paper were really interesting. 

The paper then suggests a solution – Protocol-Aware Recovery (PAR). The main point here is to avoid ad-hoc solutions because they are either slow, unsafe, complicated, or all of the above. This makes sense since such a big omission (potential for data-corrupting disk failures) in protocols should be addressed at the protocol level. The paper draws heavily on the Raft state machine protocol and develops the recovery procedure for it.

The log recovery is leader-based and can be broken down into two sub-protocols: follower recovery and leader recovery. The followers are recovered by restoring the data from the leader who always knows of all the committed history. Leader recovery is a bit more tricky and occurs as part of a leader election. Of course, if a non-faulty node can be elected a leader, then recovering faulty nodes is easy with the follower recovery. However, the leader election requires a node to have the most up-to-date log to become a leader, limiting a selection of nodes for the job. That being said, the node can be elected with a corrupted log, but it needs to recover the corrupted entries from the followers. If the entry is not available on any of the followers, the state machine becomes stuck (as it should). The protocol only recovers committed log entries and follows Raft logic to discard non-committed log suffix if it has corrupted entries. 

In addition, to log recovery, the paper also talks about snapshot recovery. The idea behind snapshot recovery is to make sure all nodes take the same snapshots at the same index in the log, break them into “chunks” and recover chunks as needed from other nodes. 

Here is the presentation by Rohan Puri:


1) The need for logs? The paper assumes that a state machine takes periodic snapshots to a disk/drive, and such snapshot in combination with a log can be used for node recovery later. This implies that the actual current state of the state machine can be lost due to a server restart. However, some state machines are directly backed by the disk, in essence, representing a rolling snapshot that gets updated every time an operation from the log applies. Recovery of such disk-backed state machine can be quicker and require only log entries happening after the crash/restart. Of course, this does not mean that the disk-backed state machine itself cannot be corrupted. In any case, the log entries are required for recovery and can be garbage collected once all nodes have persisted the state machine to disk (either as part of normal operation or a snapshot), making the time-frame for the log entries to remain useful to be relatively small. 

A more interesting problem may arise in trying to recover the corrupted state machine. If we rely on this “rolling-snapshot” disk-backed state machine, the mechanism the paper uses for snapshot recovery won’t work, since different copies of the state machine may be misaligned ever-so-slightly. Of course, one can always do the costly node restore procedure — restore to some prior snapshot and replay the log, but this is wasteful and requires keeping an extra snapshot and log from the snapshot onwards. In the spirit of the paper, we should rely on distributed copies instead and be able to restore the corruption without relying on storing redundant copies on the same server

2) Persistent memory vs RAM and recovery for in-memory SMR. If we build a state machine replication (SMR) to work purely off RAM, then we do not have the luxury of retaining any state after a restart. As such, in-memory state machines must have different mechanisms to ensure safety. For example, in traditional Multi-Paxos with a disk, a node always remembers the current term/ballot and past votes it has participated in. Without durable memory, a node restart erases the previous voting state, allowing a node to vote on something it has already voted on before, but with a lower term/ballot. This is not safe and may lead to a double-commit on the same log entry when a node promises to some new leader, and then after restart makes a second promise in the same log index to some older leader. 

Allowing for corruption in persistent memory is somewhat similar to not having persistent memory at all, at least when dealing with crashes/restarts. The very piece of data/metadata we need to ensure safety and avoid double voting as in the example above may be corrupted and cannot be used after a restart. However, the same precautions used for in-memory replicated state machines will work with corrupted storage as well and allow for safe recovery. For example, to prevent the double-voting example, a recovering node needs to run a “mock” leader election (or a leader election with a term guaranteed to not succeed). Such leader election will ensure the node gets a proper view of the current ballot/term in the cluster to make sure it no longer accepts votes from prior leaders. After such a mock election, the node can start accepting/voting for log entries while recovering any prior log and/or state machine from any of the replicas. Of course, the full recovery completes when enough data is shipped from other nodes (i.e. snapshots + missing log entries). 

There are a few differences between RAM and persistent storage when it comes to recovery. First of all, while it seems like both can lose data (one due to a reboot, another due to some random corruption), persistent storage still has a hint of data being missing. This is like not remembering what the node has voted for or who was the leader, but still having a 6th sense that something was voted upon. This extra piece of information may be useful in recovery, and indeed the protocol from the paper takes advantage of that to improve fault tolerance and safety. The recovery protocol preserves safety when the majority of nodes fail at the same log index, as the protocol knows something is missing entirely and will halt for safety. In the RAM setting, a mass reboot (i.e. majority of nodes) leads to a collective loss of memory without any hint that something may have been agreed upon, leading to a rewrite of the log. 

The second difference is that persistent memory may not lose all the data, so fewer items must be shipped from the followers. 

3) Leader-bound recovery. The paper suggests recovering followers from the leader node. This can put more load on the leader, who is already a bottleneck in the protocol. It seems like it may be possible to recover committed log entries from followers (the paper already does so for leader recovery) to make the recovery procedure less demanding for the leader.

4) Byzantine. The paper touches a bit on this topic. Data corruption on disk can be viewed through the lens of Byzantine fault tolerance. The corruption causes a node to act outside of the protocol specs, and byzantine-tolerant protocols are designed to handle such “out-of-spec” behaviors. The paper is a good example of how we can often solve some specific types of byzantine behaviors without resorting to the full-blown PBFT-style solutions. This is very practical, as we want the state machine to handle data corruptions, but we do not want to pay the performance penalty associated with BFT protocols. 

5) Luckilyhood of data corruption. Another point of discussion was around the likelihood of such data-faults happening. It does not seem like these are too frequent, but they do happen. We touched on a few anecdotal occurrences. For example, some firmware issues causing the disk to not write some large buffers of data. 

It is also worth noting error correction. Error correction is standard for server-grade memory, and it comes at a relatively small monetary/performance cost. Similar error-correction technologies are used in disks and drives, allowing for small errors (i.e. a bit-flip) to be fixed by the drive. In fact, NAND flash SSDs rely on error correction in normal operation.

6) Infallible disk. Protocols assume disk is always correct. Why? Even on the surface, this does not come as a super tight assumption. And especially on the scale of millions of SMR instances deployed across millions of machines.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group Special Session: Distributed Transactions in YugabyteDB

When: May 11th at 12:00 pm EST

Who: Karthik Ranganathan.

Karthik Ranganathan is a founder and CTO of YugabyteDB, a globally distributed, strongly consistent database. Prior to Yugabyte, Karthik was at Facebook, where he built the Cassandra database. In this talk, Karthik will discuss Yugabyte’s use of time synchronization and Raft protocol along with some optimizations that enable high-performance distributed transactions.


ACID transactions are a fundamental building block when developing business-critical, user-facing applications. They simplify the complex task of ensuring data integrity while supporting highly concurrent operations. While they are taken for granted in monolithic SQL databases, most distributed DBs would forsake them completely.

Fortunately, this is no longer the case. The trend started with Google Spanner, which offered distributed transactions using GPS based atomic clocks – unheard of in the database world before. Now, distributed transactions – without requiring atomic clocks – are offered by distributed SQL databases. One such example of a fully open source database offering this is YugabyteDB. Using the example of YugabyteDB, this talk will explain how distributed ACID transactions can be achieved without atomic clocks – without compromising on performance.

Reading Group. Paxos vs Raft: Have we reached consensus on distributed consensus?

In our 54th reading group meeting, we were looking for an answer to an important question in the distributed systems community: “What about Raft?” We looked at the “Paxos vs Raft: Have we reached consensus on distributed consensus?” paper to try to find the answer. As always, we had an excellent presentation, this time by A. Jesse Jiryu Davis

The paper compares Multi-Paxos and Raft protocols, but it does so through the lens of Raft and uses Raft terminology to describe Paxos. This can be very handy for people who first learned consensus through Raft. The paper mentions that the two protocols are very similar, especially in the “happy-case” operations. The important differences come in the leader-election phases of the protocols. In Raft, a new leader must be a fully caught-up replica with the longest log, while Multi-Paxos can pick any node as a leader and recover missing log entries. The authors argue that this Raft behavior is good for efficiency — the new leader can start quickly since it does not need to learn any missing entries. 

When it comes to understandability, the paper says that Raft is “slightly more understandable than Paxos.” I suppose this comparison comes after expressing/explaining Paxos in Raft terminology, and not based on the original explanations. 

Personally, Paxos is more understandable to me, since I absolutely hate the quirk of Raft when a majority-accepted but not committed value may get “chopped off” upon some intricate leader churn (See Figure 8 in the original paper). This ties to the sequential commit requirement in the normal case, and different commit rules that apply upon leader churn, which all tie to the leader election procedure itself. How is it more understandable? In contrast, Paxos never loses a majority-accepted value and has one commit rule, although it allows committing out-of-order (the execution of committed operations still must follow the log order). 

I won’t go further into summarizing the paper, it is a good educational read, so I do not want to spoil it.


1) Orignal Paxos Papers as a Source of Confusion? Needless to say, Lamport’s Paxos original paper does not have the clearest description of the algorithm. This somewhat extends to Lamport’s Paxos Made Simple paper, and Paxos Made Moderately Complex by Robbert van Renesse. However, the relative difficulty of understanding Paxos from these later papers may not lie in the algorithm’s description itself, but in the language used. Raft paper is written with an engineering audience in mind, and operates with primitives, like RPCs, that can be used in programming languages right away. Raft is described in operational concepts, such as server states with clear boundaries and transitions between states — first, do leader election, then do log replication, and finally, go back to leader election if needed. These may have a great appeal to students and professionals. We teach computer science from the code first (it is a separate discussion whether this is the right way to approach distributed systems though), and describing protocol closer to code has a certain appeal. 

Multi-Paxos is described in more abstract terms. Multi-Paxos is not as “operational.” For example, the leader election gets blurred a bit with replication as the leader may need to fill some slots before becoming fully active. Transitions back to leader election are also a bit murkier compared to Raft.

I think it is great to have this abstract Paxos algorithm that we can shape and implement in a variety of ways. This leads us to uncover new ways it can be refined and/or implemented — take a look at Flexible Paxos result.

2) Reference Implementation(s). To continue the previous point, a great appeal of Raft are many (reference) implementations. This again appeals greatly to engineers, who can look at the code and run it, trace through breakpoints, and learn it hands-on. Another point that was mention in the discussion is a kind of herd effect. Once a good production-grade implementation is available, more people will just take it and use it. 

3) Leader election efficiency. We spent a bit of time discussing the leader election efficiency point from the paper. This is an important feature that may come in handy in disaster recovery performance. The question is how realistic it is to have followers that significantly lag behind. These lagging followers may put pressure on the leader, as the leader cannot compact the log while some follower is struggling behind and consumes it, which may cause higher memory and/or storage consumption. But of course, this applies only to severely lagging machines, and the performance hit from catch up after the leader election can be noticeable even with a less severe staleness. In the discussion, it was mentioned that having hours of staleness is possible on some systems!

On the other hand, the Cloudflare outage is a good illustration of how Raft’s leader election fails at liveness unless we add a bit more complexity to the protocol (and removing understandability?). So its good performance may get compromised under some conditions. Here is the paper on this too: “Examining Raft’s behaviour during partial network failures.” And no surprise, it is by Heidi Howard.

4) Is Raft an Implementation of Multi-Paxos? Given a more abstract description of Multi-Paxos and a more direct (and less flexible?) approach taken by Raft, one may think that Raft is, in a sense, an implementation of Multi-Paxos. Both protocols operate on Majority quorums, have a stable leader. They have the same communication pattern as the leader talks to followers. Differences appear in the leader election, but are these big enough differences? Can one implement something Raft-like from an abstract Paxos blueprint? 

The differences may appear big, but they also may appear as a refinement of Paxos. 

  • Can I elect a leader without having to copy data from followers? Sure, if you pick the most up-to-date follower to be the leader. 
  • How do I know which one is the most up-to-date? Well, look at the latest committed log item. 
  • Wait a sec, in Paxos I can commit out of order, so which one is the latest committed log item? Just don’t commit out of order… 

5) Porting Paxos Optimizations to Raft. As mentioned in our presentation, another paper tried to answer Paxos vs Raft question. That paper, titled “On the Parallels between Paxos and Raft, and how to PortOptimizations“, looked at a number of Paxos optimizations and tried to figure out if they can apply to Raft. The authors there concluded that for the most part, they apply. And they also noted how the two protocols are similar and can be made more similar by tweaking the implementation of Multi-Paxos to be more Raft-like (which goes back to point (4) above).

6) Industry Use of Raft vs Multi-Paxos. The industry loves Raft. For the reasons I already mentioned above: reference code, good existing libraries, the description that is closer to the code. One notable company that has been using Multi-Paxos is Google. Their Spanner database is based on Multi-Paxos. However, its open-source cousins (CockroachDB and YugabyteDB) rely on Raft. We were wondering how much more difficult it was to implement Multi-Paxos in production compared to Raft in this similar databases.  

Easy adoption is a great point for use in the industry, especially when businesses lack the expertise to use formal methods to check their protocols or protocol variants. However, this may be changing now, as formal methods, like TLA+, are picking up more widespread adoption. We see that companies start to modify their Raft protocols in a variety of ways, and remain confident in their safety and liveness properties through model-checking. This adoption of formal methods may bring Multi-Paxos and Raft closer together, as a more abstract way of thinking about these protocols may highlight their similarities. For example, Jesse mentioned how they use TLA+ to build confidence in their variant of Raft.

7) Other Consensus Protocols. Another important topic we discussed relates to other consensus protocols. There are plenty of consensus-based replicated state machines, yet we still pretty much use Paxos or Raft. Solutions like EPaxos and its extension have been around for some time. Is the problem that they are hard to implement? Are they difficult to understand as well? EPaxos is not an easy protocol for sure, but it has a reference implementation. Or maybe these protocols are not as good in practice as the papers promise…

8) Teachability. If the community has a good understanding and description of Paxos, why not use it in classrooms and textbooks? There are not that many good distributed systems textbooks, and it seems like a lot of faculty prefer to teach from papers and their own notes. The same goes for engineering folks who have to pick up on distributed systems state-of-the-art through an extensive literature survey. 

We have mentioned a few sources to help with distributed systems such as the “Designing Data-Intensive Applications” book by Martin Kleppmann or MIT distributed systems course.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. New Directions in Cloud Programming

Recently we have discussed a CIDR’21 paper: “New Directions in Cloud Programming.Murat Demirbas did the presentation:

Quite honestly, I don’t like to write summaries for this kind of paper. Here, the authors propose a vision for the future of cloud applications, and I feel that summarizing a vision often results in the misinterpretation of that vision. So I recommend reading the paper to draw your own unbiased conclusions. 

That being said, here is my extremely high-level take on the paper in a few points:

  • Application developers should focus on application logic, and not worry about implementation aspects of consistency, durability, availability, etc. 
  • It does not mean that developers do not care about consistency, availability, redundancy at all. Instead, they simply should know what they need, and let the cloud provide these. As such, developers should declare their consistency, availability, budgetary needs, etc., and have the cloud runtime enforce such declarations. This will free up the programmers and let them focus on the application logic instead and make this logic “unsoiled” by the other aspects of the distributed app.
  • To help developers focus on their applications/tasks, we need domain-specific languages (DSLs). DSLs can hide a lot of “mechanical” work from the programmers and delegate it to the cloud runtime. A good example of a popular DSL we have been using for a very long time is SQL. It is declarative — programmers retrieve and update the data without worrying about how it is done under the hood. 
  • Despite potentially having many DSLs, we still want one comprehensive framework to run it on, so the visionary system here can compile DSL to some common Intermediary Representation (IR). The authors want the IR to be human-readable and optimizable, although I feel like this requirement is part of the “evolutionary” theme in the paper, and eventually, the importance of human optimizations may diminish. 
  • Achieving this highly declarative vision is hard, but the paper lists several developing and emerging techniques and research directions that may help evolve the cloud. 


1) DSL. We have spent quite some time discussing DSLs and what does it mean to have many of them. For one, we already have a few successful ones, like SQL. Arguably, ML/AI and data processing systems also often have some form of DSLs. TensorFlow was brought as an example. One minor concern that was expressed in the group is that having many of these DSLs requires someone to make and maintain them. A more interesting DSL question is how specialized they should become? To bring SQL example again, while it is great for what it does, it is rarely used all by itself. So there will be a clear need to allow to mix and match these highly specialized DSLs, potentially making the problem of translating them to IR more difficult. 

2) IR. A big part of the Hydro system vision is the IR language. The language itself may get rather complicated when it needs to support many IRs. A bigger challenge may be having to translate DSL logic to a human-readable IR code. The translations that are done must make sense to engineers, the logic should be clear and not obscure to allow people to make sense of it. This pursuit of human readability may result in less performance efficient IR. The readability may also depend on the individual DSLs. 

Another point we discussed is whether programmers will just start writing code directly in IR if it is a good, readable, feature-rich language. Maybe this is exactly what the programmers need after all? A language made specifically for distributed applications.

3) How much of this is already in the cloud? DSLs exist, the serverless cloud is developing too, providing more consistency and durability than before. For example, Azure Durable Functions save their intermediate state and can be resumed in the face of failures. And surprisingly, many of these cloud offerings, like serverless, durable functions, serverless storage are easy to use. Last semester I gave a project in my Cloud Computing Systems that used blob storage, serverless functions, and durable functions. To my surprise, the students loved the project and were able to figure out all of this tech (which they had to do on their own since the tech aspect was not really part of the problem they were solving) in just a few days. So as it stands right now, the cloud is evolving quickly, with serverless computing and storage becoming more ad more capable. It is not a coherent single cloud runtime just yet, and probably won’t be there any time soon, but some aspects of the vision are there. Users can scale serverless compute, not worry about its availability, may opt into more durable options when needed, may use cloud-native storage with configured/declared consistency, take advantage of DSLs for many tasks in the cloud, like data management, ML/AI systems, etc…

4) Drivers of innovation? An interesting discussion happened at the end of our meeting. Some expressed the opinion that cloud vendors should know better in what direction to develop the cloud since they are in constant interaction with the clients and must adjust to what clients are asking. I, personally, disagreed with this opinion — cloud clients are not thinking about the long-term visions like this paper describes. Instead, they may have more immediate concerns that must be dealt with given all the technology they already use in the cloud. An example I used is the true invention of GUI by Xerox PARC. The vision was out there, but nobody was really asking for this back then, even Xerox did not really know what to do with it, and willingly let others copy the ideas. Yet, this innovation made modern consumer electronics/computing what it is today. I suspect, that if Xerox were asking clients about what to improve, they may have worked on something as boring as developing a console with 120-character lines instead of an 80-characters one to make existing systems more “user friendly.”

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Facebook’s Tectonic Filesystem: Efficiency from Exascale

This time around our reading group discussed a distributed filesystem paper. We looked at FAST’21 paper from Facebook: “Facebook’s Tectonic Filesystem: Efficiency from Exascale.” We had a nice presentation by Akash Mishra:

The paper talks about a unified filesystem across many services and use cases at Facebook. Historically, Facebook had multiple specialized storage infrastructures: one for hot blob storage, another for warm blobs, and another for analytics. While these systems were optimized for their specific purpose, they had a major drawback of resource overprovisioning. For example, hot storage needs to have a lot of IOs per second (IOPS) and may overprovision storage to get these IOPS, leaving storage resources wasted. On the other hand, warm/cold storage has opposite requirements — it needs more storage and fewer IOPS, leaving the IOPS unused. Tectonic, as a unified system, aims to avoid such resource overprovisioning and fragmentation. Having one system to serve different workloads creates several challenges. First, of all the single systems need to scale to support much larger storage capacities than several isolated systems. Second, it needs to work just as well as the purpose-build components it replaces. 

The storage backbone of Tectonic is Chunk Store. It stores chunks of data that make up logical blocks that make up files. Chunk Store is linearly scalable storage that grows with the number of storage servers. This largely covers the data and IOPS scalability of the file system on the hardware level. However, another limiting factor in large file systems is metadata management. As files are being accessed, the clients must first consult the metadata sub-system to locate blocks for the file and enforce other file system guarantees. One problem Facebook had with HDFS in their analytical cluster is limited metadata server scalability, and therefore tectonic avoids “single server” Metadata Store. Instead, all metadata is partitioned based on some directory hashes to allow multiple metadata servers to work at the same time. Furthermore, Tectonic implements the Metadata Store as a collection of micro-services over a Paxos-replicated key-value store, allowing different metadata requests to be processed by different metadata services of the same partition. The client library has all the logic that orchestrates interaction with the Metadata Stores and the Chunk Store. Placing the filesystem logic on the client-side allows for tenant-specific optimizations to better suit the workloads.

Having a unified filesystem for many services at Facebook allows for more efficient resource usage. Tectonic distinguishes between two resource classes: non-ephemeral resources (storage capacity), and ephemeral resources (IOPS). Storage capacity is provisioned to a tenant and remains at the provisioned level until a reconfiguration. Ephemeral resources (IOPS and metadata query capacity), however, can “move around” between tenants depending on the need. The resource sharing/allocation is controlled by a somewhat hierarchical system. A tenant selects a TrafficGroup for its application. The TrafficGroup is assigned to a TrafficClass, which controls latency requirements and decides the priority of allocating the spare resources. 

As always, the paper has way more details!


1) Magnetic storage. Tectonic uses magnetic devices for storage. HDDs are much slower than SSDs, and a lot of new large storage systems are now targetting SSDs. We wonder how much of a design decision was influenced by the use of hard drives instead of SSDs. Resource sharing is a big part of the motivation and design of Tectonic, and since SSDs have different performance characteristics than magnetic storage (lower storage volumes, higher IOPS), these differences may have influenced the design of Tectonic or the entire need to have such unified system in the first place. Of course, at Facebook scale storage cost is a big consideration, and HDDs are cheaper. Cold/warm storage is a big part of Facebook’s architecture, so it makes sense to put infrequently used data on cheap disks and keep “hot” data in caches in the rest of the architecture stack. 

2) Sharded metadata. The metadata store in Tectonic is sharded, meaning that there is no single coherent view of the metadata at any given time, as different directories (subdirectories) may be on different shards. This potentially creates the possibility for some consistency issues, especially for metadata operations that may span multiple shards. Another issue of sharded metadata is getting directory stats, as subdirectories may be in different partitions, so there is no perfectly correct/up-to-date view. But of course, sharded metadata allowed scaling the system horizontally much easier than a more centralized approach. 

3) KV store for metadata. Metadata store is backed by a KV store. A lot of metadata fall naturally into the KV abstraction. For example, a key can be a directory name, and value is a list of its children. However, Tectonic does not store metadata in such a naive way. Instead, the value is expanded into the key. The need for this raised a few questions in the group. The basic here is avoiding the read and write (and a need for a transaction) when you need to append something to the list of values. In the expanded format, the value becomes a suffix of a key in the KV store, and adding something becomes as trivial as writing the key-value pair. Reading the entire list can be by scanning the store for a particular key prefix. Cockroach DB has a very good explanation of using KV storage to implement more complicated data models.

Paper has a few other examples of little performance tricks, like making some metadata records sealed or read-only to allow caching and partitioning schemes designed to avoid hotspots. 

4) Client-driven. An interesting aspect of the system is that it is entirely client-driven in terms of all of its logic. From one point of view, this allows different clients to implement their custom protocols for using Tectonic. The paper has quite a bit of detail on how different tenants interact with the rest of the system differently to optimize for latency or throughput. On the other hand, this creates a possibility of client conflicts, if not managed properly. For instance, Tectonic allows a single writer per file and avoids ordering concurrent writes from different writers. However, this also means that if two clients are competing, the one who writes to the metadata server last and gets a token will be the sole writer, and any data written by a different client will be lost, even if it has been written to the chunk store. We feel like some of these decisions are in part because the product is internal to Facebook, and they have more control over how their engineers use it. 

5) Comparison with other distributed FS. There are plenty of other distributed filesystems out there. Tectonic already compares itself somewhat with HDFS it replaced. Ceph is another popular one. It also has a more centralized, Paxos-replicated metadata service, which is good for consistency but may interfere with scalability. At the same time, Ceph clients (I think) cache the location of data, so maybe the need to query metadata is not very frequent. We also talked about comparing this to blob stores like S3, and extensions over S3, like EMRFS, for use as HFDS replacement. But these are not as general-purpose/flexible to fit multiple different use cases.

6) Verification. The paper does not mention any kind of formal verification done on Tectonic. One argument we had in the reading group is that Tectonic’s lax “eventual” consistency does not require checking. But this is not true, and there are plenty of opportunities to screw up even in the eventual consistency model (i.e. liveness issues, convergence, etc). Moreover, there are plenty more opportunities to make concurrency mistakes when using clients to build stronger consistency models on top. I think S3 was brought up again as an example of a team that verified the blob storage even before it was strongly consistent. We hope the some formal verification was done on a system this large and important, and that simply talking about it was outside of the scope of the paper.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Distributed Snapshots: Determining Global States of Distributed Systems

On Wednesday we kicked off a new set of papers in the reading group. We have started with one of the classical foundational papers in distributed systems and looked at the Chandy-Lamport token-based distributed snapshot algorithm. The basic idea here is to capture the state of distributed processes and channels by “flushing” the messages out of the channels with markers. The markers ensure the causality if not broken, despite the processes taking their local snapshots at different times (and with no affinity to the physical time). I am not going to summarize the paper, as there is plenty of material on the internet on the subject, however, here is our group’s short presentation by Maher Gamal:


1) Use of snapshots. Much of our discussion focused on the use of snapshots. Aside from the trivial use for disaster recovery, snapshots are useful for debugging and runtime verification. The paper suggests some debugging/monitoring usage, like detecting stable properties of the algorithms. However, we also think that detecting violations of certain properties may be more useful in the real world. For instance, detecting the violations of invariant properties at runtime.

Just last week we talked about Aragog, a system for runtime verification of network functions. And while it does not directly use snapshots, it relies on time synchronization to make sure that the messages come from different consistent cuts of the state, and the cause and effect relationship play out correctly in the constructed state machine.

2) Snapshots of states that did not happen. Interesting things about the Chandy-Lamport snapshots are that they may capture a system state that did not happen in the execution. This is because the snapshots are taken progressively as the markers propagate through the channels, and essentially the snapshot gets rolled out at the communication speed. 

3) Timely snapshots. We also brought up snapshots that are close to the wall clock. These may be more useful for debugging purposes, than Chandy-Lamport, as they provide some notion of when things happened. Additionally, more tight snapshots that are taken at about the same time globally are better at recording true state (or should we say has fewer timing artifacts?)

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Aragog: Scalable Runtime Verification of Shardable Networked Systems

We have covered 50 papers in the reading group so far! This week we looked at the “Aragog: Scalable Runtime Verification of Shardable Networked Systems” from OSDI’20. This paper discusses the problem of verifying the network functions (NFs), such as NAT Gateways or firewalls at the runtime. The problem is quite challenging due to its scale, as large cloud providers handle enormous amounts of traffic. The paper uses NAT Gateway (NATGW) as the motivating example. NATGW balances external traffic to the servers in a way to ensure that the entire packet flow goes to the same destination. It is implemented entirely in software. This means that the NF needs to be fault-tolerant and not “forget” the destination server for each flow, but it also needs to be super quick, so strong consistency is not an option. In this NF, some primary node keeps track of routing for a flow and asynchronously replicates the information to backups. 

Aragog is set to examine the operation of this NF and look for invariant violations. Obviously, if we look at the entire NF that handles thousands and thousands of flows, the problem is rather large. However, Aragog does not need to take the global approach, and since each flow is independent, it can look at verification at the flow granularity. For example, the system can check whether at any time the flow is directed by at most one primary node. This check still requires the global view of all system nodes to make sure that there are no two primaries, but it does not require the entirety of the state and needs only the state associated with a particular flow. In the nutshell, Aragog constructs a state machine based on the description of an invariant violation for each flow, allowing for embarrassingly parallel scalability due to the sharded nature of the problem. The state machine transitions across states as it receives the events (i.e. messages), and if it reaches the termination state, a notification can be issues about the violation. Naturally, since the events can happen on different nodes, they are all shipped to a centralized verification agent that runs the invariant violation state machine. However, it would still be inefficient to ship all the events for each flow to construct these state machines, so Aragog does some filtering — it does not send messages irrelevant to the invariant we are checking, and it avoids shipping messages that can be confirmed locally to not transition to a new state of the state machine. 

The evaluation show that Aragog can detect violations due to bugs, and it has quite substantial throughput for checking.

As always, a lot more details can be found in the paper. A. Jesse Jiryu Davis did an excellent presentation of the paper. He also has quite a few interesting questions/ideas at the end of the presentation.


1) Best-Effort. Aragog makes certain assumptions and choices that make it a “best-effort” system, as it may miss some violations (false-negatives) and maybe even detect some that did not happen (false-positives). One such assumption we noted in the discussion is time-synchronization, since the events/messages are ordered and applied to the state-machine according to their timestamps. While good time-sync is possible, it is not clear how well this was implemented in Aragog (it uses PTP) and how big of an impact it may have. For example, a reordering of the messages may potentially hide the violation from being detected, or even worse make a non-violating message ordered appear as one that has a problem. Another avenue for missing out on the violations is the use of sampling.

2) Need for Runtime and Near-real-time Verification. One of the major requirements in Aragog is the runtime or near-real-time verification. However, we were not super clear on why this is needed. For example, even if you detect a problem quickly, there may be limited options to react to it quickly. Unless it causes an outage (which would be detectable by other means), the time to resolve may be very large, as a bug needs to be reproduced, tested, fixed, and deployed, and at best can take many hours or even days. Another reason why a near-real-time requirement is questionable is the best-effort nature of the system described above. However, we have talked about one scenario where a quick response is needed. Consider a rollout of the new version of the network function. The rollouts typically occur in stages, so it is important to detect any issues early and stop the rollout before it hits the majority of servers. 

3) Embarrassingly Parallel. The cool thing about Aragog is its embarrassingly parallel nature. While the system is applied in a specific domain (testing network functions), we feel like other applications/domains can parallelize in a similar matter. 

4) “Productizing” Aragog. Again, currently, Aragog is a niche/domain-specific system. However, it seems to have nice features, like scalability, and nice language to describe invariant violations and filtering, so we wonder if this can be productized beyond this specific application.

5) PL & Distributed Systems. I largely skipped this in my summary, but quite a lot of nice things about Aragog come from its language and how it creates a state-machine out of the invariant violation description. Not to mention all the local suppression to prevent sending the events/messages that do not change the state of the machine to a global verifier. This is cool stuff, at least from our distributed system point of view that is a bit further from programming languages work. 

6) Retroscope. Finally, I could not resist and not make the comparison with some early work of mine — Retroscope. Retroscope uses HLC (to avoid time-sync causality-breaking issues) to construct a progression of globally distributed states and search for… you guessed it, invariant violations. Retroscope uses SQL-like language to express the predicates that describe a violation. Unlike Aragog, I tried to make Retroscope general. It also does not shard-like Aragog, but once the events are ingested by a streaming service, predicate search is embarrassingly parallel. Restrocope was one of my first works in the distributed systems space, so it is not quite as optimized or fast, but it is also a more general prototype.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Protean: VM Allocation Service at Scale

The last paper in our reading group was “Protean: VM Allocation Service at Scale.” This paper from Microsoft is full of technical insights into how they operate their datacenters/regions at scale. In particular, the paper discusses one of the fundamental components of any cloud provider — the VM service. The system, called Protean, is an allocation service that handles VM allocation requests at the availability zone granularity in each Azure region. It tries to figure out which server of many thousands of candidates is the best fit for the VM described in the request. Its goal is to pack VMs tightly to avoid fragmentation of resources — having too many small and unusable server chunks. There are several challenges in doing so. First, each VM has a set of requirements, such as the VM type, the number of vCPUs, memory allocation, on-server SSD size, networking, location preferences for fault tolerance, and many more. This alone makes the problem very hard to solve optimally, NP-hard as a matter of fact. The second major challenge is doing these allocations at scale. There are surprisingly many VM allocations going on in each availability zone all the time. In the steady state, the system deals with hundreds of allocation requests per second, with occasional spikes to thousands of new VM requests per second!

Protean is made up of the placement store, a database that keeps a record of VM assignments in the zone’s server inventory. One of many concurrent Allocation Agents (AAs) computes the actual VM assignment to the machine. Each AA is like a server filter — it takes the requests for new VMs and filters out all the servers not capable of hosting the VM. After the filtering, AAs compute a general preference score to figure out a set of most suitable servers and pick one random server from such a set of candidates. 

Protean implements this whole filtering and scoring using a rule approach and divides the process into multiple phases. First, it uses cluster validator rules to filter out any homogeneous clusters that cannot support a VM. These validator rules specify a “hard” requirement needed to support a VM. For example, a VM with a GPU cannot be supported by a cluster of GPU-less servers, so the entire cluster is automatically not a candidate for allocation. Then the system scores the clusters that can handle the VM based on some preference rules, which describe “nice-to-have” features, as opposed to hard requirements. A similar validator rules process is repeated to filter out the non-compatible machines in the selected cluster (for example, servers that are already at capacity and have no available resources for a VM type). Finally, all remaining good servers are scored based on the machine preference rules.

This tiered approach greatly reduces the possible allocation choices since many thousands of servers can be removed from consideration by excluding the entire clusters. However, filtering out the remaining machines is still a resource-intensive task. Protean has many rules that validate or score machines and doing these computations can add up to significant amounts of time. Each AA, therefore, caches the rules and the scoring results. This works well for two major reasons: (1) most requested VMs are very similar, so the same rules are used repeatedly; (2) inventory changes are relatively small, and between two invocations of the same rule, there will not be a lot of change in terms of server allocations. Moreover, AAs largely address the inventory changes by updating the cached rules before each use. Cache updates recompute the scores/results for a handful of servers that may have been updated by other AAs, and it is a lot faster than doing the full computation for all servers every time. To make the system more efficient, the AAs learn of changes from the placement store via a pub/sub system, so updating cache only involves local operations and local storage and does not query the placement store. This lowers the latency of cache updates and reduces the load on the placement store by avoiding the repeated queries for every cache update. 

The whole interaction between AA and placement store is not strongly consistent/transactional to avoid locking the store while computing the VM placement. This allows multiple AAs to work concurrently, but also introduces the possibility of conflicts due to a race — a couple of AAs working concurrently may pick the same server for two different VM allocation requests. These conflicts are resolved by the placement store in one of two general ways. If the target server can accommodate both VMs (i.e. the validator rules pass for the server for two allocations instead of one), then the placement store will merge the conflicts. If the server cannot handle both VMs, then one conflict allocation is retried. Protean allows up to 10 retries, although this rarely happens in practice. Also, since the system already has a mechanism to tolerate conflicts, it is fine for AAs to work off slightly stale and not-up-to-date caches, allowing the aforementioned pub/sub way of updating them. However, there is probably some balance between the staleness of cache, the number of conflicts/retries, and the overall quality of placement, so I’d suspect that the cache updates still need to be relatively recent. 

Microsoft has released the VM allocation dataset to the public! 

As always, the paper has many more details and rationale for all the decision choices. My rambling presentation of the paper is on YouTube:


1) Preference Rule Evaluation. Preference rules implement a Compare function that orders two objects (two servers or clusters) for a given VM request. Each rule also has a weight that determines the overall weight of a preference rule in the scoring of servers/clusters. The servers are scored/ordered based on all preference rules, and the order is computed with a global compare function that combines all the individual compare functions in a weighted manner. However, the weight is constructed in such a way, that a higher-weight rule always outweighs all lower-weight rules combined. This is done to aid in the explainability of VM placement. The question we had is why do we need to compute the global compare function with all preference rules (and waste all the time doing these computations) if we can evaluate the rules sequentially starting with the most important rules first. This way, if the most important rule produces enough desired servers, we do not need to evaluate other lower-priority rules. 

Of course, caching makes computing fast, since most rules have already been evaluated before, so this may be the reason for just sticking with a general score. At the same time, the need for cache is due to the slow speed of rule evaluations, and it seems like such evaluation of all rules (at least with the strict priority of preference rules) is not necessary.

2) On the Importance of Explaining the Allocations. Part of the design is the result of having “explainable” decisions — engineers want to know which rule has impacted each decision. But how important is this? What benefits it gives the engineers/operators aside from some piece of mind of understanding the system’s choices. Can a more efficient system be designed if the “explainability” rule is omitted? After all, we have many ML systems (including safety-critical systems, like self-driving vehicles) that are based on the models that lack any “explainability”.

3) Caching System. This is one interesting caching system, that caches the results of computations. It is highly-tailored to the task at hand, and papers go into great detail on many nuances of the systems. The interesting part is the cache-updates that must be done before each cache use to bring the cache up-to-date (and recompute some parts). However, the update does not guarantee that cache is the most recent! It simply ensures that cache is more recent, but it still may not have the newest changes that are still in the pub/sub pipeline. 

4) Evaluating the Quality of Placement. The paper talks about the quality of placement quite a lot, however, the evaluation is limited to one simulation on packing density. However, it would be nice to see how production variations impact quality, especially since the paper suggested these impacts are small. Another interesting point is that the paper claims CPU to be the most contended resource. So how much impact other resources and constraints play in the quality of packing?

5) Many Interesting Tidbits. Most VMs are small – 1-2 cores. We think this is due to lots of small automated tasks, such as build and testing pipelines. Many VMs have a short lifespan. This is probably for the same reason, as these build-pipeline VMs will get destroyed when no longer needed. Need to keep empty servers. This looks weird on the surface to have idle capacity, but the paper mentions the fault-tolerance reasons — have resources to move VMs that occupy an entire machine. There are many more interesting tidbits in the paper. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Sundial: Fault-tolerant Clock Synchronization for Datacenters

In our 48th reading group meeting, we talked about time synchronization in distributed systems. More specifically, we discussed the poor state of time sync, the reasons for it, and most importantly, the solutions, as outline in the “Sundial: Fault-tolerant Clock Synchronization for Datacenters” OSDI’20 paper. We had a comprehensive presentation by Murat Demirbas. Murat’s talk was largely based on his extensive time synchronization experience in wireless sensor networks.

First, let’s talk about the need for time synchronization. Many problems of distributed computing could have been avoided if we had a perfect global clock available everywhere, as we often rely on the ordering of events for correctness. For instance, such a perfect clock would make causality/dependency tracking easy. And this alone would have simplified and improved many different systems and processes, ranging from efficient consistent snapshots, to more consistent storage systems, to the improved debuggability of all distributed applications. In the absence of a perfect global clock, we have been relying on other clever tricks and techniques, such as logical clocks, vector clocks, loosely synchronized causality-tracking hybrid logical clocks to name a few. 

Fundamentally, if we have unsynchronized clocks on two servers, we cannot use these clocks to order the events. The paper provides the following example to the issue: a shared variable X is read on some server at time T, but this same variable is updated on a different server at time T-1, however, due to time asynchrony, the update actually happens after the read in the real-time. This essentially makes the clocks useless for ordering, unless we know how badly unsynchronized the clocks are. Knowing this time uncertainty ε allows us to delay the read at T until we know that all servers have moved on past T. Some systems may resort to rescheduling the operations/transactions that fall within the uncertainty instead of waiting, but this is a similar enough use case. Naturally, having smaller uncertainty is better for performance, since a system will incur shorter waits or fewer rescheduled operations.

So what prevents us from driving this uncertainty ε down to 0 for a perfect synchronization? This is not an easy answer, and there is a myriad of factors. The clocks themselves are a problem — servers tend to have cheap quartz oscillators that “tick” at different speeds depending on temperature and voltage variations. These variations make individual machines drift apart ever-so-slightly over time. Trying to synchronize these flimsy clocks is a problem as well — the servers communicate over the network for time sync. And the network is unpredictable, starting from how messages may be routed, to different queues and buffer delays at NICs and switches. All these add the variability to message propagation time and make the network non-symmetric, as message flow in one direction may be faster than in the opposite. 

The paper proposes Sundial, a set of techniques to tame the network-induced uncertainties. Sundial focuses on reducing the message propagation variability in the network.

Firstly, Sundial avoids indirect communication and only exchanges messages between adjacent neighbor nodes in the network topology. This eliminates routing uncertainty between nodes, and also buffer/queue uncertainty at the intermediate switches. 

Secondly, Sundial records the timestamps into messages at the lower level in the network stack. This ensures that the timestamp we are transmitting for synchronization has not been sitting in the local queue for too long, again reducing the variability. 

Thirdly, Sundial ensures that a single node is used as a source of truth for the current time. Since the nodes in the system cannot talk directly to the “source of true time”, the system constructs a tree communication topology starting with the TrueTime root and covering all nodes in the system. 

Fourthly, Sundial tames the unreliable clocks on the individual servers by doing very frequent synchronizations — once every 100 microseconds. 

A big portion of the paper is devoted to handling failures since a link or node failure prevents the updated time to reach any node in the subtree below the fault, that subtree may start to deviate more the TrueTime at the root node. The gist of the solution is to allow all nodes in the impacted branch to detect the synchronization failure and switch to an alternate tree structure that was precomputed ahead of time. As all impacted nodes perform the switch to a new tree locally, the coordination is avoided, and the process is very quick. An important point in having such a back-up plan is to make sure it is smart enough to avoid correlated failures that can render both the main and back-up trees broken. The paper has a lot more details on the fault tolerance aspect, including handling the failures of root nodes.

Combining all the Sundial’s techniques provides good time synchronization with fairly tight bounds. It achieves ~100 ns synchronization even under some failures, which is significantly better than PTP time synchronization (and even better than its precursor NTP?).


We had a nice discussion and questions, below I summarize the most important points.

1) Set of techniques. As outlines, Sundial is a set of techniques to improve the time sync, and there are some important lessons there. For example, doing things in hardware (or as close to hardware) is good. We start seeing (network) hardware optimizations for distributed systems more and more often. Just a few weeks ago we talked about smart switches and using them to drive replication and routing for “hot keys” in a storage system. Obviously, time synchronization is a different problem, but it is also the one to benefit from hardware a lot. Another lesson is to have a single source of time, even though it makes the communication pattern more structured and prone to failures. 

2) Better clocks/oscillators. Sundial synchronizes a lot – one message every ~100 microseconds. This is 10000 messages per second. We are not sure what impact this may have on the network (messages are small) and performance, but there is a practical reason for synchronizing this often. As Sundial aims to keep the uncertainty small (ε=~100ns), it cannot afford the cheap clocks to drift too much upon failures and needs to failover to a back-up tree quickly. This means that the system needs to have a super-tight timeout and very frequent message exchange. Better clocks/oscillators (or maybe using multiple clocks in a server?) can improve the situation here and either allow for even better synchronization or reduce the message frequency. Oven-controlled oscillators, for example, use a heated chamber to keep the crystal at the same temperature and reduce its drift due to the temperature variations. 

3) Comparison with PTP. The paper extensively compares Sundial with PTP protocol. The authors mention how PTP does not report ε, and that they had to augment the designs to provide the uncertainty in these protocols. The paper puts PTP’s uncertainty at ε=800μs. This contrasts with other literature but, where PTP is often reported as having a sub-nanosecond accuracy (is accuracy the same as uncertainty? but regardless, to have an accurate time, we need to have low uncertainty, otherwise how do we know it is accurate?), or nanosecond level accuracy. It is worth noting that PTP in these papers either required a dedicated low-load network for time synchronization or hardware with support of some advanced features needed for PTP to work well or both. 

4) Time sync in wireless sensor networks. Murat has spent quite some time describing how the same set of techniques was used 15-20 years ago to achieve microsecond level synchronization in the wireless sensor networks. The presentation has many fascinating details, but it appears that these techniques were known and used for some time, but not used in the data center setting. What was the blocker for doing this earlier?

5) New applications of synchronized time. Finally, we discussed a lot about the possible new applications of such precise time synchronization. The paper mentioned Spanner latency improvement as one benefit, but this is an “old stuff”. Actually, for many years we, the distributed community, were (self-)taught to not rely on time for anything critical. Sure, we use the time for things like leases and timeouts, but these are all “negative” communication that happens rarely, allowing us to be very conservative with the timeouts — there is a little harm if we add a few more seconds to a lease timeout that happens upon a leader failure and needed in rare reconfiguration cases. With super-tight synchronization bounds, we can try to use the time for positive communication and convey progress instead of the lack of one. This of course is challenging, since time is not the only uncertain variable in the system. All of our other “friends”, such as network uncertainty/variability, and crashes still exist, and we also need to tame them in some way to use the time for positive, active communication. 

For example, one may use a “silent agreement” that requires no acks from the followers if everything is going well. But this quickly falls apart under faults. However, one may treat a synchronized clock as an agreement itself and use it to drive the ordering in a multi-leader system. This may also fall apart if the network is too-asynchronous, and a message from one server that already applied the operation may reach the other follower too late – after it has irreversibly applied some other higher-timestamped operation. Taming the network asynchrony may be the next big thing to allow new usages of time in distributed systems!

The network latency vs time uncertainty is very important for constructing consistent snapshots. If time uncertainty is guaranteed to be smaller than the network latency, we can use the time to construct the consistent snapshots, since we can be sure that no message that breaks the causality can reach the other side within the uncertainty period. This, for example, can be useful for debugging. In my Retroscope consistent monitoring system, I use HLC to preserve the causality when uncertainty is too large, but having software clocks like HLC unnecessarily complicate systems. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!