Reading Group. Multitenancy for Fast and Programmable Networks in the Cloud.

We discussed “Multitenancy for Fast and Programmable Networks in the Cloud” in the 59th DistSys Reading Group meeting. In a sense, this was a continuation of a previous discussion we had a few months ago when covering Pegasus paper. 

Pegasus and many other new protocols rely on specialized programmable network hardware that is not yet available to regular cloud users. This is because such hardware has no support for multitenancy and cannot be efficiently shared and virtualized to allows cloud users to get a virtual slice of the hardware pie. This paper discusses enabling multitenancy on today’s smart network hardware, such as smart switches. 

It is not an easy feat to pull off — these switches (in the paper, the authors used a switch based on Barefoot Tofino logic) have limited programmability, limited resources, and can run just one program. Of course, if you can have only one running program, it is hard to run code from different users/tenants. The paper works around this by compiling the code from all tenants together into one “jumbo-program” that can run on the switch. This tenant packing also solves an issue with resource sharing and isolation as the switch’s hardware resources must be defined at the compile time. Of course, there are also issues isolating tenant’s code. This is done at compile time too by renaming the tenant program’s fields, such as tenant-defined headers and table names to the tenant-unique names. A few restrictions are placed on tenants for the sake of isolation, and the paper goes into more detail. One important assumption is that the tenants are expected to be isolated based on their virtual networks (VLANs).

In addition to being the glue that holds all the tenants together, the “jumbo-program” incorporates some “system” code to provide common functionality. For instance, the “system” program can process common packet headers. This allows it to read VLAN ID (ID) and then invoke proper tenant so that the tenant code can do its own header matching/parsing.

The “system” program of a “jumbo-program” also helps us deal with one tricky part here. See, the program on a switch needs to run at the network line speed, and cannot delay any packets, so there is no way to delay any processing. This means that the program runs as a pipeline in multiple stages with no way of returning to a previous one. The “system” program “sandwiches” the tenant program and in a sense establishes some rules and bounds for the tenants.

I already mentioned that the resources are defined/allocated ahead of time. This is also true for memory, as programmers must hardcode the array sizes of stateful memory they want to use. However, the paper works around this by kind of building a virtual memory in the system program of the “jumbo-program”. The “jumbo-program” just allocates a huge array of memory in each pipeline stage and uses a configuration stage to store the memory boundaries for each tenant. Tenants use the boundaries as index offsets into the big array, essentially implementing virtual memory within the “jumbo-program”, and allowing for dynamic memory reallocation by adjusting these memory offsets.

The paper has its HotCloud presentation. And as usual, we had our own presentation, this time by David Correa:

Discussion

1) Other approaches? The paper mentions some alternatives to the switches they have used. For example, software switches are a lot easier to program. FPGAs can be used for handling some network functions as well. But, according to the paper, these are all slower. ASICs, like Barefoot (now owned by Intel) Tofino, are a lot faster, but with such speed, they are more limited. The programmability constraints, the pipeline processing nature, memory limitations (22 MB of SRAM), etc. are big limiting factors though.   

2) Hardware-specific solution. So the solution presented here is rather specific for the hardware available now. This is cool, but I’d want to imagine what can be done in the near future. Can we improve the hardware to reduce these limitations? The authors conjecture that the pipelines will stay: “we expect ASIC-based pipelines—both for switches and NICs—to be more effective than other programmable architectures in terms of both cost and power consumption.” However, maybe somehow better support for multi-tenancy can be backed to a new generation of devices on the hardware level, along with greater flexibility of adding/removing tenants.

3) Not quite cloud-ready? The major pain point of the solution seems to be the need to “mush together” all the tenants before compiling the program and deploying it on the switch. If tenants change (or tenant’s requirements for some hardware resources) the whole thing needs to be repackaged and re-deployed. This sounds like a cumbersome process that may result in some downtime while the new “jumbo-program” is deployed on the network. Exposing this process as some kind of cloud service seems quite infeasible due to these issues. It would have been nice if more resources could have been reallocated between tenants dynamically (of course, we are not experts here, but there may be some real hardware/performance limitation for this very static approach). It would also be nice to have the ability to “side-load” new tenant programs into the “jumbo-program” runtime. 

4) Incentives for the cloud providers? Smart network devices can greatly benefit those who have access to them. We are starting to see many systems relying on these. However, right now, cloud providers are those who have access to such hardware, and it may provide some advantage to these providers and their native services. As such, there may be little incentive to offer these resources in the cloud. That being said, this may also play the other way around. If this becomes a very desired feature, the first cloud provider to offer these virtualized multi-tenant switches stand to benefit, and other providers will follow in the chain reaction of chasing the market. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group Paper List. Papers ##61-70

We have gone through our current list of papers, covering 10 interesting projects over the past 10 weeks. Now it is time to move on with a new set of papers that will carry the reading group all the way through the summer term.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Cerebro: A Layered Data Platform for Scalable Deep Learning

In the 58th reading group session, we covered “Cerebro: A Layered Data Platform for Scalable Deep Learning.” This was a short meeting, as our original presenter, unfortunately, had some other important commitments. There are actually two Cerebro papers, one appeared at CIDR and another one at VLDB’20. The main premise behind the system is that ML models are rarely trained in isolation. Instead, we often train multiple models concurrently, as we try to find optimal parameters for the best results. This hyper-parameter tuning is costly as it relies on an exhaustive search for the best parameter configuration. Also, as the authors claim, most systems do not take this multi-model/config-same-data parallelism into account. 

A trivial way to deal with running multiple jobs on the same data is to simply run multiple models/configurations as if they are totally independent. Of course, this can lead to data amplification, as we may need to copy the same data to different workers or even systems. Or we can run these models one at a time — as one model/parameter configuration is done, the next one can start. Cerebro proposes a different approach that allows models to run concurrently, avoids data amplification, and reduces communication while preserving good model convergence. The approach shards the data to workers and allows different models to start on these different shards at the same time. Once a worker-full of data has been processed, the entire model (all the weights, etc) can migrate to a worker with another shard. The paper claims that this reduces the data-transfer demand in the cluster compared to approaches the must continuously update model weights (i.e. in the parameter servers)

I am not an ML expert, and my improvised presentation was rather bad, so instead, I will link to a YouTube presentation by the authors, which explains the whole thing much better than I can do. 

Discussion

1) Still a money problem. Cerebro aims to improve the efficiency of hyper-parameter tuning, but it still seems like a money problem, as we need to throw more resources to exhaustively find the best model. Are there better approaches than the exhaustive search like that? Some heuristics? Or past experience on similar problems/data to take into account?

2) Pruning. The paper(s) seem to train all models/configurations all the way to the end. Again, this is probably wasteful — if we see that some model/configuration does not converge fast enough, it makes sense to kill it early. The system seems to support this and is built around this idea as well, but it is not as explicit in the paper. In Cerebro, scheduling is done on the per-epoch level, so there is a chance to evaluate each model/parameter-config after each training epoch and adjust. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. XFT: Practical Fault Tolerance beyond Crashes

In the 57th reading group meeting, we continued looking at byzantine fault tolerance. In particular, we looked at “XFT: Practical Fault Tolerance beyond Crashes” OSDI’16 paper.

Today’s summary & discussion will be short, as I am doing it way past my regular time. The paper talks about a fault tolerance model that is stronger than the crash fault tolerance (CFT) model of traditional state machine replication protocols, like Multi-Paxos and Raft, but slightly weaker than a full-blown byzantine fault tolerance (BFT). The authors propose cross fault tolerance (XFT), which is a relaxation of the BFT model for partly synchronous networks. In particular, the XFT model assumes that at least a majority of servers are both correct and can communicate synchronously. This deviates from classical BFT where the entire communication schedule can be byzantine. Naturally, authors claim that such a byzantine network scheduler is rather difficult to orchestrate in many environments and thus we do not need to account for it and gain some performance in return. 

The paper then proposes XPaxos, a Paxos variant designed for XFT. I am not going in-depth on the XPaxos. Like many BFT-protocols, it relies on signed messages and involves more complicated communication patterns than CFT protocols. The two images below should give some hint on how XPaxos works. 

XPaxos has a special 3-node configuration that is efficient from a communication standpoint, and it seems like this configuration can compete with a 3-node Multi-Paxos.

However, a more general XPaxos configuration is more complex communication-wise to be able to handle byzantine nodes. Obviously, this is a lot less efficient. 

Another complication in XPaxos is a view change, but hopefully, in the happy case we do not need to change leaders/sync-groups too often so, the extra costs of this can be amortized over time. 

Mohit Garg did an excellent presentation of the paper:

Discussion

1) Comparison with Protocol Aware Recovery. Recently we looked at the Protocol Aware Recovery paper that assumes a possibility of arbitrary corruption of data. Obviously, PAR considers a specific type of byzantine fault (such data corruption makes the node act out of spec by potentially sending bad data), while the XFT model is a lot more general. On the other side, PAR paper may be even cheaper to run and has no less efficient general cases. But we think the spirit of the problems is similar, as we need to have better ways to handle common failures that fall out of the traditional CFT model. The difference is doing a more general approach like XFT, or doing a piece-wise defined solution for each non-CFT fault type, like PAR. 

2) On 3-way replication. It seems that the only practical and fast configuration of XPaxos is the one with 3 replicas. This may limit some applications. However, many systems do stick with 3-replica deployment. For example, Cockroach or Yugabyte. One consideration with using just 3 servers is planned maintenance. When a system needs to be updated in a rolling manner, one node at a time must be taken out of commission, living the system vulnerable. But of course, we can solve this problem with reconfiguration and/or temporary operation with a bigger but less efficient cluster. 

3) Further reading. Mohit has worked on follow-up/extension to XFT. 

We have also discussed a few examples of these out-of-spec/BFT problems in the wild. For instance, this one talks about data corruption in the Chubby lock service. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Metastable Failures in Distributed Systems

Metastability is a stable state of a dynamical system other than the system’s state of least energy.

Wikipedia

Distributed systems often fail spectacularly and unpredictably. They are a cause for a headache and sleepless on-call nights for way too many engineers. And this is despite lots of efforts to understand the failures, and all the tools and “best practices” we have to contain and/or prevent them. 

Today I want to talk about failures that may occur on a nearly healthy system with no perceived major bugs or design flaws. We describe this type of failure in our HotOS’21 paper. This work is in collaboration with Nathan Bronson, Abutalib Aghayev, and Timothy Zhu. It is largely based on Nathan’s observations and experiences.

So how does a “healthy” system fail? Naturally, something needs to happen for a failure to occur. We call this something a trigger. Lots of things can act as a trigger (GC, network blips, software pushes, configuration changes, load spikes, etc.). Triggers are impossible to avoid in our multi-tenant unreliable systems. 

Triggers cause backlogs or errors and make the system temporarily less happy. In some cases, however, they cause the system to spiral into the abyss and practically halt any useful work or goodput. When the trigger is removed (it was transient, or removed by the engineers), the system does not recover by itself. So we end up in a peculiar situation when the condition that caused a failure is no longer present, all system’s components seem to be functioning correctly, all servers/nodes are up, the code is working, but nothing useful is actually getting done and goodput remains negligible. 

We call this failure pattern a metastable failure, or rather we say that a system has entered a metastable failure state. The culprit behind metastable failures is a sustaining effect that prevents the system from leaving a bad/failed state even after the initial trigger is removed.

To understand more about the importance of triggers and sustaining effects, we need to look at how most distributed systems are deployed. Resource utilization is a major factor for large systems, as achieving better utilization can save a lot of money at scale. This means that systems often have little extra resources to spare when some unexpected load increase occurs. These systems operate in a metastable state that we call a metastable vulnerable state, as the system is vulnerable to a failure if enough of a trigger disturbs its stability.

Of course, having some unanticipated extra load applied to the system is not sufficient for a metastable failure to occur. In fact, even if the load pushes the system to run at its maximum capacity (or tries to run above the capacity), we should only see an increase in latency, and maybe some goodput degradation. When the extra load is removed, the system should return back down to normal operation all by itself. 

In metastable failure, the extra unanticipated load activates a positive feedback loop that creates more load. This positive feedback loop is the sustaining effect that prevents the system from recovering when the initial trigger is resolved. We say that a system is in a metastable vulnerable state when it is possible to activate such a sustaining effect loop using a strong enough trigger. This state typically occurs at the higher system utilization, as the system has fewer spare resources to absorb the extra load of the trigger without activating the positive feedback mechanism. Opposite of metastable vulnerable state is a stable state, where the positive feedback loop is not possible, allowing the system to recover by itself when the extra load is removed. Note that a system in a stable state may still have a sustaining effect mechanism, but it is not strong enough to feed on itself and diminishes over time when the extra load is removed. 

Let’s look at a hypothetical idealized example of a metastable failure. Consider a web application that uses a database capable of serving up to 300 queries per second (QPS). If more than 300 QPS are tried, the latency goes up by an order of magnitude, causing the web application to retry once after a 1-second timeout (when a retry fails, the system errors out). The web servers for the application produce a 280 QPS workload, which falls nicely below the database’s serving capacity.

Now let’s say there was a network glitch lasting a few seconds between the web servers and the database. After the network is fixed, all the delayed packets reach the database, resulting in a flooding manner, exceeding the database’s capacity of 300 QPS. This, in turn, increases the latency above the retry threshold at the web application, causing any new, post-network-glitch queries to retry as well. So now we have a backlog of queries from the network glitch and an effective new query load of 560 QPS (280 queries from the workload + 280 query retries). With such a high and sustained load, the database will not return to a normal state all by itself. This results in a maxed-out system that clears 300 QPS of load (smaller than the amplified load) and produces very little goodput since most of the queries take too long to process and get timed-out or discarded by the time the database returns them. This can continue indefinitely until either the offered workload reduces to under 150 QPS or a retry policy is changed/suspended.

Our simple example illustrates key metastable concepts and principles. In a stable state, a system can (gradually) return to normal operation. In the example, workload under 150 QPS is stable, since even with the workload amplification caused by the retry policy, the database will receive fewer new requests than it can handle each second, allowing it to gradually clear up the backlog and return to normal operation. While the system is clearing the backlog, its goodput, however, may remain compromised. Systems often operate in the vulnerable state because it seems from all metrics like the stable state is wasting resources, and the vulnerability is not known. This, however, leaves very little spare resources to absorb the extra load spikes caused by the trigger events. Once a system enters into a metastable failure state, it will feed onto itself until there is some sort of intervention. This intervention is costly, as it generally requires either reducing the offered load on the system or finding ways to break the positive feedback loops.

Another possible intervention approach is to increase the capacity of the system to push it into a stable state, but this can be challenging since the triggers are unanticipated. Such unpredictability makes it impossible to reconfigure to add capacity beforehand. And once the system is already in a metastable failure state, it may not have any resources to spare to run the reconfiguration procedure. In fact, a reconfiguration may even increase the workload amplification and make the whole situation worse.

In the paper, we discuss a few other hypothetical and anecdotal scenarios that are a bit more complicated. But the problem is not anecdotal by any means, as there are quite a few failure cases that surely look a lot like metastable failures “in the wild”:

  • Google App Engine Incident #19007. Trigger: Configuration change. Sustaining Effect: cascading load amplification. Fix: Reduce traffic level.
  • Amazon SimpleDB Service Disruption. Trigger: power loss crashing multiple servers. Sustaining Effect: load amplification due to a timeout. Fix: Change in timeout policy & introduction of additional capacity.
  • Cassandra Overload because of hint pressure + MVs. Trigger: rolling restart. Sustaining Effect: Few nodes could not catch up with hinted hand-off, preventing them from fully joining, causing the system to generate more hints. Fix: Policy change – disable hinted handoff.

With metastable failures affecting real systems, we need to have more understanding of the problem and processes involved to develop better coping and prevention strategies. A recurring pattern in our experience is that changes meant to improve the common case behavior of a system tend to increase the strength of the sustaining effects. Fast paths, caches, retries, failover, load balancing, and autoscaling all make the failure state less resource-efficient relative to the normal state, which makes the feedback loop worse. Beware of very high cache hit rates!

Metastable failures are over-represented in large site outages because the strength of the sustaining effect depends on scale. For example, if the feedback loop requires overloading a network fabric then small-scale stress tests will never trigger it. The sustaining effect may also act as a means of contagion so that the problem spreads across machines or shards. This means that the first occurrence of a novel metastable failure may be a major outage even in a hyper-scale distributed system.

Current approaches for handling metastability often lack the full comprehension of the problem and its causes. For example, engineers often focus on the trigger that causes the failure and fail to realize the complicated positive feedback loops that are responsible for the scale of the failure. Fixing a trigger is a temporary solution that may only push the system higher into the metastable vulnerable zone and make the next crash even more severe.

Unfortunately, replicating the failures and feedback loops is difficult, as many of the issues only manifest themselves at scale. This makes it ever so harder to fully understand the failures and develop efficient techniques for dealing with them. Furthermore, predicting the possibility of failure is difficult too. For instance, one can look for unexpected performance variations, and try to correlate them with other things going on in the system to learn the potential future triggers, but this still does not give the full predictive power of when a failure may happen. Improvements in our ability to predict and avoid metastable failures will also translate directly to efficiency gains because it will let us operate systems closer to their natural performance limits.

I think this is an exciting area for research. As the industry makes bigger and bigger systems and pushes them to work as cheaply as possible, we need to develop a proper understanding of how these critical systems fail at scale so we can continue improving the reliability.

Reading Group. chainifyDB: How to get rid of your Blockchain and use your DBMS instead

Our recent meeting focused on Blockchains, as we discussed “chainifyDB: How to get rid of your Blockchain and use your DBMS instead” CIDR’21 paper. The presentation by Karolis Petrauskas is available here:

The paper argues for using existing and proven technology to implement a permissioned blockchain-like system. The core idea is to leverage relational SQL-99 compliant DBMSes that are very common in various IT architectures. The participants in the blockchain can then leverage the existing software stack to implement secure data-sharing or replication. Moreover, the authors argue that in practice different participants may run DBMSes of different versions or even different vendors, making the problem of syncing data between participants more challenging, as different versions/vendors may execute the same SQL transactions slightly differently. To address the heterogeneity issues, the paper proposes a “Whatever-Voting” model, where the system achieves consensus on the results of a transaction, and not on the transaction itself. This allows each node/organization to run the transaction in “whatever” way possible, as long as the final result of the transaction is agreed upon by the quorum of participants in the “voting” phase. If the voting phase does not pass, the node with a mismatched result must roll the transaction back (potentially more than one) and try to recover by reapplying it. To me this seems a bit naive — if a node could not produce the proper results to a correct, non-malicious transaction due to its vendor differences or “whatever” execution differences, it may get stuck in the infinite loop of rollbacks and reapplies until whatever phase is fixed to produce the correct results for a specific transaction. This can be a very error-prone, engineering-intensive process of constant fine-tuning and micromanaging the components.

Step 1: Client creates a Proposal from SQL transaction. Step 2: Early abort of bad proposals. Step 3: Proposal is sent for ordering after all participants verify that it can be executed. Step 4: Ordered produces a block from multiple transactions/proposals. Step 5: Execution against local DBMSes. Step 6: Commit phase after the block is written to DBMS. Step 7: Voting to check the results are identical in the quorum of nodes/organizations. Step 8: Ledger entry is generated if the agreement in Step 7 is successful.

Since the proposed solution does not reach an agreement on transactions, the paper argues that it does not need complicated protocols to ensure that all correct nodes have correct transactions. If some entity issues different transactions to different participants, the results won’t match, missing the quorum agreement and causing a system-wide rollback and retry. However, the voting scheme should still be resilient to some cheating — a node must provide some proof or digest of the authenticity of its results.

The authors argue that their system does not have a centralized authority. However, it still has a centralized control component that batches the transactions and orders them. This component can be untrusted since the participants should catch mismatched batches and roll the state back as needed. That being said, a byzantine centralized batcher/sequencer (the paper calls it orderer) can cause liveness issues by sending garbage to participants causing constant recoveries. The paper does not specify the process of electing/replacing the orderer, so I think in practice it should be trusted for liveness purposes. 

There is a lot more work involved to make the scheme work. For example, each node must remember the transaction batches to create the ledger. This also happens within the DBMS, and some precautions must be taken to ensure that the ledger is tamper-proof. The roll-back procedure is also complicated and must be carefully implemented to have good performance. 

The authors conduct their evaluation on a limited scale with just 3 nodes/DBMS instances and show good scalability compared to Fabric and Fabric++. The evaluation focuses on increasing the number of clients interacting with the DBMSes, and not the number of DBMS instances. The paper does not mention the possibility to have Byzantine clients. Moreover, the evaluation does not consider any byzantine cases. 

Discussion

1) Paper Quality. Our group’s collective opinion about the quality of the paper was not very enthusiastic. The paper’s premise is interesting and practical, however, it has way too many holes, missed promises, and unsubstantiated claims. These range from inadequate evaluation to relying on centralized unreplaceable “orderer” in a decentralized system to an underspecified voting protocol to glaring engineering holes in the “whatever” model that is prone to liveness issues due to bad implementations. There are way too many to list them all, so I will mention just one in a bit of detail. 

A big undelivered promise is using the DBMSes as “black-box.” While the DBMS software does not need any modifications, the applications running on these DBMSes must be adjusted to accommodate the WV model. Although these adjustments may be relatively small (i.e. making the client talk to chainify server instead of DBMS), it appears that the “whatever” stage may need tweaks on a per transaction-type basis if the DBMSes are very different. This additional logic and tweaking is hardly a black-box approach from the application point of view. Other concerns here involve privacy. What if some tables at one organization are not supposed to be shared but transactions/queries from this same entity touch both shared and private tables? Addressing these may require even bigger application changes and rewrites. And of course, this is not to mention all the additional table chainifying introduces to the database.

2) Preventing Byzantine Behavior vs. Detecting Byzantine Behavior. An interesting point the paper raises (and I wish it was explored more) is whether we need to prevent byzantine actions from taking place or just being able to detect them. Unlike all (most?) of the blockchains, chainifyDB does not prevent bad behavior from executing, and instead, it tries to detect misbehavior and fix it by a reset and restore procedure. Is this approach cheaper on the fundamental level? What applications can it support? Obviously, chainifyDB argues that it supports the permissioned blockchains, but We’d love to see more rigor and research in that direction.

There are some potential issues with this approach, and the paper does not give many details on these. For example, a transaction may execute on all nodes and produce matching result digests for voting. However, if the transaction itself is malicious, and for instance, tries to create value or double-spend something, then the correct nodes must have a mechanism to abort such transaction before its execution (or detect and reset after?). Result voting may not sufficient here. Of course, it is possible to implement some pre-execution sanity check logic to verify that the transaction is compliant with all the rules and constraints. It appears that in chainifyDB this stage happens early and requires every participant’s vote (see Step 2 in the figure in the summary), but the paper is not very clear on the safety implication of this vote/abort decision collection. Also, what about fault tolerance if we need all nodes for this? Additionally, this, in a sense, creates some pre-execution voting that the paper claims to avoid.

3) Issues of Trust. This is a bigger discussion on different types of storage systems and applications that need them. The most popular type of data systems is the ones with trust in a single participant/entity. This is not a problem for an in-house database for example. Many cloud data-intensive services rely on this model, where clients would trust the cloud provider and do not worry about the provider becoming malicious. This even works if the clients distrust each other, as long as they all trust the service provider. The latter is often sold as some form of a permissioned blockchain, but fundamentally it is just a database with an append-only log where safety is ultimately ensured by the trusted provider. Truly permissioned blockchains have no single trusted entity, however, a relatively small set of participants can be trusted as a group. The permissioned nature prevents Sybil attacks by spoofing many fake malicious participants to overtake the network, allowing to have small groups to be trustworthy despite having malicious individuals. Finally, open-blackchains are completely untrusted environments, where only the entire collective of all participants can be trusted. The figure here illustrates this. 

One discussion question we had is when to use permissioned blockchain and why not to stick with open-access protocols instead. Of course, permissioned blockchains or BFT protocols are cheaper to run than open-access counterparts and often have better latency and higher throughput. But is it safe to have trust in a small group of participants? What applications are ok with that level of trust? 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Protocol-Aware Recovery for Consensus-Based Storage

Our last reading group meeting was about storage faults in state machine replications. We looked at the “Protocol-Aware Recovery for Consensus-Based Storage” paper from FAST’18. 

The paper explores an interesting omission in most of the state machine replication (SMR) protocols. These protocols, such as (multi)-Paxos and Raft, are specified with the assumption of having a crash-resistant disk to write the operation log and voting metadata. This disk data allows crashed nodes to restart safely. However, the real-life gets in a way a bit, as infallible storage is as real as unicorns. 

Storage may fail in peculiar ways, when some data may get corrupted, while most other data is correct and the server itself continues working. The problem here is handling such failures. The simplest way is to treat the server as crashed. However, the server must remain crashed, as restarting may get into even more severe state corruption, as the server replays the operations from a faulty log. The paper talks about a variety of other approaches taken to deal with these data issues. The authors state that all the mechanisms they have explored were faulty and led to liveness or safety issues. I personally do not buy such a blanket statement, but a few of the examples in the paper were really interesting. 

The paper then suggests a solution – Protocol-Aware Recovery (PAR). The main point here is to avoid ad-hoc solutions because they are either slow, unsafe, complicated, or all of the above. This makes sense since such a big omission (potential for data-corrupting disk failures) in protocols should be addressed at the protocol level. The paper draws heavily on the Raft state machine protocol and develops the recovery procedure for it.

The log recovery is leader-based and can be broken down into two sub-protocols: follower recovery and leader recovery. The followers are recovered by restoring the data from the leader who always knows of all the committed history. Leader recovery is a bit more tricky and occurs as part of a leader election. Of course, if a non-faulty node can be elected a leader, then recovering faulty nodes is easy with the follower recovery. However, the leader election requires a node to have the most up-to-date log to become a leader, limiting a selection of nodes for the job. That being said, the node can be elected with a corrupted log, but it needs to recover the corrupted entries from the followers. If the entry is not available on any of the followers, the state machine becomes stuck (as it should). The protocol only recovers committed log entries and follows Raft logic to discard non-committed log suffix if it has corrupted entries. 

In addition, to log recovery, the paper also talks about snapshot recovery. The idea behind snapshot recovery is to make sure all nodes take the same snapshots at the same index in the log, break them into “chunks” and recover chunks as needed from other nodes. 

Here is the presentation by Rohan Puri:

Discussion

1) The need for logs? The paper assumes that a state machine takes periodic snapshots to a disk/drive, and such snapshot in combination with a log can be used for node recovery later. This implies that the actual current state of the state machine can be lost due to a server restart. However, some state machines are directly backed by the disk, in essence, representing a rolling snapshot that gets updated every time an operation from the log applies. Recovery of such disk-backed state machine can be quicker and require only log entries happening after the crash/restart. Of course, this does not mean that the disk-backed state machine itself cannot be corrupted. In any case, the log entries are required for recovery and can be garbage collected once all nodes have persisted the state machine to disk (either as part of normal operation or a snapshot), making the time-frame for the log entries to remain useful to be relatively small. 

A more interesting problem may arise in trying to recover the corrupted state machine. If we rely on this “rolling-snapshot” disk-backed state machine, the mechanism the paper uses for snapshot recovery won’t work, since different copies of the state machine may be misaligned ever-so-slightly. Of course, one can always do the costly node restore procedure — restore to some prior snapshot and replay the log, but this is wasteful and requires keeping an extra snapshot and log from the snapshot onwards. In the spirit of the paper, we should rely on distributed copies instead and be able to restore the corruption without relying on storing redundant copies on the same server

2) Persistent memory vs RAM and recovery for in-memory SMR. If we build a state machine replication (SMR) to work purely off RAM, then we do not have the luxury of retaining any state after a restart. As such, in-memory state machines must have different mechanisms to ensure safety. For example, in traditional Multi-Paxos with a disk, a node always remembers the current term/ballot and past votes it has participated in. Without durable memory, a node restart erases the previous voting state, allowing a node to vote on something it has already voted on before, but with a lower term/ballot. This is not safe and may lead to a double-commit on the same log entry when a node promises to some new leader, and then after restart makes a second promise in the same log index to some older leader. 

Allowing for corruption in persistent memory is somewhat similar to not having persistent memory at all, at least when dealing with crashes/restarts. The very piece of data/metadata we need to ensure safety and avoid double voting as in the example above may be corrupted and cannot be used after a restart. However, the same precautions used for in-memory replicated state machines will work with corrupted storage as well and allow for safe recovery. For example, to prevent the double-voting example, a recovering node needs to run a “mock” leader election (or a leader election with a term guaranteed to not succeed). Such leader election will ensure the node gets a proper view of the current ballot/term in the cluster to make sure it no longer accepts votes from prior leaders. After such a mock election, the node can start accepting/voting for log entries while recovering any prior log and/or state machine from any of the replicas. Of course, the full recovery completes when enough data is shipped from other nodes (i.e. snapshots + missing log entries). 

There are a few differences between RAM and persistent storage when it comes to recovery. First of all, while it seems like both can lose data (one due to a reboot, another due to some random corruption), persistent storage still has a hint of data being missing. This is like not remembering what the node has voted for or who was the leader, but still having a 6th sense that something was voted upon. This extra piece of information may be useful in recovery, and indeed the protocol from the paper takes advantage of that to improve fault tolerance and safety. The recovery protocol preserves safety when the majority of nodes fail at the same log index, as the protocol knows something is missing entirely and will halt for safety. In the RAM setting, a mass reboot (i.e. majority of nodes) leads to a collective loss of memory without any hint that something may have been agreed upon, leading to a rewrite of the log. 

The second difference is that persistent memory may not lose all the data, so fewer items must be shipped from the followers. 

3) Leader-bound recovery. The paper suggests recovering followers from the leader node. This can put more load on the leader, who is already a bottleneck in the protocol. It seems like it may be possible to recover committed log entries from followers (the paper already does so for leader recovery) to make the recovery procedure less demanding for the leader.

4) Byzantine. The paper touches a bit on this topic. Data corruption on disk can be viewed through the lens of Byzantine fault tolerance. The corruption causes a node to act outside of the protocol specs, and byzantine-tolerant protocols are designed to handle such “out-of-spec” behaviors. The paper is a good example of how we can often solve some specific types of byzantine behaviors without resorting to the full-blown PBFT-style solutions. This is very practical, as we want the state machine to handle data corruptions, but we do not want to pay the performance penalty associated with BFT protocols. 

5) Luckilyhood of data corruption. Another point of discussion was around the likelihood of such data-faults happening. It does not seem like these are too frequent, but they do happen. We touched on a few anecdotal occurrences. For example, some firmware issues causing the disk to not write some large buffers of data. 

It is also worth noting error correction. Error correction is standard for server-grade memory, and it comes at a relatively small monetary/performance cost. Similar error-correction technologies are used in disks and drives, allowing for small errors (i.e. a bit-flip) to be fixed by the drive. In fact, NAND flash SSDs rely on error correction in normal operation.

6) Infallible disk. Protocols assume disk is always correct. Why? Even on the surface, this does not come as a super tight assumption. And especially on the scale of millions of SMR instances deployed across millions of machines.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group Special Session: Distributed Transactions in YugabyteDB

When: May 11th at 12:00 pm EST

Who: Karthik Ranganathan.

Karthik Ranganathan is a founder and CTO of YugabyteDB, a globally distributed, strongly consistent database. Prior to Yugabyte, Karthik was at Facebook, where he built the Cassandra database. In this talk, Karthik will discuss Yugabyte’s use of time synchronization and Raft protocol along with some optimizations that enable high-performance distributed transactions.

Abstract

ACID transactions are a fundamental building block when developing business-critical, user-facing applications. They simplify the complex task of ensuring data integrity while supporting highly concurrent operations. While they are taken for granted in monolithic SQL databases, most distributed DBs would forsake them completely.

Fortunately, this is no longer the case. The trend started with Google Spanner, which offered distributed transactions using GPS based atomic clocks – unheard of in the database world before. Now, distributed transactions – without requiring atomic clocks – are offered by distributed SQL databases. One such example of a fully open source database offering this is YugabyteDB. Using the example of YugabyteDB, this talk will explain how distributed ACID transactions can be achieved without atomic clocks – without compromising on performance.

Reading Group. Paxos vs Raft: Have we reached consensus on distributed consensus?

In our 54th reading group meeting, we were looking for an answer to an important question in the distributed systems community: “What about Raft?” We looked at the “Paxos vs Raft: Have we reached consensus on distributed consensus?” paper to try to find the answer. As always, we had an excellent presentation, this time by A. Jesse Jiryu Davis

The paper compares Multi-Paxos and Raft protocols, but it does so through the lens of Raft and uses Raft terminology to describe Paxos. This can be very handy for people who first learned consensus through Raft. The paper mentions that the two protocols are very similar, especially in the “happy-case” operations. The important differences come in the leader-election phases of the protocols. In Raft, a new leader must be a fully caught-up replica with the longest log, while Multi-Paxos can pick any node as a leader and recover missing log entries. The authors argue that this Raft behavior is good for efficiency — the new leader can start quickly since it does not need to learn any missing entries. 

When it comes to understandability, the paper says that Raft is “slightly more understandable than Paxos.” I suppose this comparison comes after expressing/explaining Paxos in Raft terminology, and not based on the original explanations. 

Personally, Paxos is more understandable to me, since I absolutely hate the quirk of Raft when a majority-accepted but not committed value may get “chopped off” upon some intricate leader churn (See Figure 8 in the original paper). This ties to the sequential commit requirement in the normal case, and different commit rules that apply upon leader churn, which all tie to the leader election procedure itself. How is it more understandable? In contrast, Paxos never loses a majority-accepted value and has one commit rule, although it allows committing out-of-order (the execution of committed operations still must follow the log order). 

I won’t go further into summarizing the paper, it is a good educational read, so I do not want to spoil it.

Discussion.

1) Orignal Paxos Papers as a Source of Confusion? Needless to say, Lamport’s Paxos original paper does not have the clearest description of the algorithm. This somewhat extends to Lamport’s Paxos Made Simple paper, and Paxos Made Moderately Complex by Robbert van Renesse. However, the relative difficulty of understanding Paxos from these later papers may not lie in the algorithm’s description itself, but in the language used. Raft paper is written with an engineering audience in mind, and operates with primitives, like RPCs, that can be used in programming languages right away. Raft is described in operational concepts, such as server states with clear boundaries and transitions between states — first, do leader election, then do log replication, and finally, go back to leader election if needed. These may have a great appeal to students and professionals. We teach computer science from the code first (it is a separate discussion whether this is the right way to approach distributed systems though), and describing protocol closer to code has a certain appeal. 

Multi-Paxos is described in more abstract terms. Multi-Paxos is not as “operational.” For example, the leader election gets blurred a bit with replication as the leader may need to fill some slots before becoming fully active. Transitions back to leader election are also a bit murkier compared to Raft.

I think it is great to have this abstract Paxos algorithm that we can shape and implement in a variety of ways. This leads us to uncover new ways it can be refined and/or implemented — take a look at Flexible Paxos result.

2) Reference Implementation(s). To continue the previous point, a great appeal of Raft are many (reference) implementations. This again appeals greatly to engineers, who can look at the code and run it, trace through breakpoints, and learn it hands-on. Another point that was mention in the discussion is a kind of herd effect. Once a good production-grade implementation is available, more people will just take it and use it. 

3) Leader election efficiency. We spent a bit of time discussing the leader election efficiency point from the paper. This is an important feature that may come in handy in disaster recovery performance. The question is how realistic it is to have followers that significantly lag behind. These lagging followers may put pressure on the leader, as the leader cannot compact the log while some follower is struggling behind and consumes it, which may cause higher memory and/or storage consumption. But of course, this applies only to severely lagging machines, and the performance hit from catch up after the leader election can be noticeable even with a less severe staleness. In the discussion, it was mentioned that having hours of staleness is possible on some systems!

On the other hand, the Cloudflare outage is a good illustration of how Raft’s leader election fails at liveness unless we add a bit more complexity to the protocol (and removing understandability?). So its good performance may get compromised under some conditions. Here is the paper on this too: “Examining Raft’s behaviour during partial network failures.” And no surprise, it is by Heidi Howard.

4) Is Raft an Implementation of Multi-Paxos? Given a more abstract description of Multi-Paxos and a more direct (and less flexible?) approach taken by Raft, one may think that Raft is, in a sense, an implementation of Multi-Paxos. Both protocols operate on Majority quorums, have a stable leader. They have the same communication pattern as the leader talks to followers. Differences appear in the leader election, but are these big enough differences? Can one implement something Raft-like from an abstract Paxos blueprint? 

The differences may appear big, but they also may appear as a refinement of Paxos. 

  • Can I elect a leader without having to copy data from followers? Sure, if you pick the most up-to-date follower to be the leader. 
  • How do I know which one is the most up-to-date? Well, look at the latest committed log item. 
  • Wait a sec, in Paxos I can commit out of order, so which one is the latest committed log item? Just don’t commit out of order… 

5) Porting Paxos Optimizations to Raft. As mentioned in our presentation, another paper tried to answer Paxos vs Raft question. That paper, titled “On the Parallels between Paxos and Raft, and how to PortOptimizations“, looked at a number of Paxos optimizations and tried to figure out if they can apply to Raft. The authors there concluded that for the most part, they apply. And they also noted how the two protocols are similar and can be made more similar by tweaking the implementation of Multi-Paxos to be more Raft-like (which goes back to point (4) above).

6) Industry Use of Raft vs Multi-Paxos. The industry loves Raft. For the reasons I already mentioned above: reference code, good existing libraries, the description that is closer to the code. One notable company that has been using Multi-Paxos is Google. Their Spanner database is based on Multi-Paxos. However, its open-source cousins (CockroachDB and YugabyteDB) rely on Raft. We were wondering how much more difficult it was to implement Multi-Paxos in production compared to Raft in this similar databases.  

Easy adoption is a great point for use in the industry, especially when businesses lack the expertise to use formal methods to check their protocols or protocol variants. However, this may be changing now, as formal methods, like TLA+, are picking up more widespread adoption. We see that companies start to modify their Raft protocols in a variety of ways, and remain confident in their safety and liveness properties through model-checking. This adoption of formal methods may bring Multi-Paxos and Raft closer together, as a more abstract way of thinking about these protocols may highlight their similarities. For example, Jesse mentioned how they use TLA+ to build confidence in their variant of Raft.

7) Other Consensus Protocols. Another important topic we discussed relates to other consensus protocols. There are plenty of consensus-based replicated state machines, yet we still pretty much use Paxos or Raft. Solutions like EPaxos and its extension have been around for some time. Is the problem that they are hard to implement? Are they difficult to understand as well? EPaxos is not an easy protocol for sure, but it has a reference implementation. Or maybe these protocols are not as good in practice as the papers promise…

8) Teachability. If the community has a good understanding and description of Paxos, why not use it in classrooms and textbooks? There are not that many good distributed systems textbooks, and it seems like a lot of faculty prefer to teach from papers and their own notes. The same goes for engineering folks who have to pick up on distributed systems state-of-the-art through an extensive literature survey. 

We have mentioned a few sources to help with distributed systems such as the “Designing Data-Intensive Applications” book by Martin Kleppmann or MIT distributed systems course.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. New Directions in Cloud Programming

Recently we have discussed a CIDR’21 paper: “New Directions in Cloud Programming.Murat Demirbas did the presentation:

Quite honestly, I don’t like to write summaries for this kind of paper. Here, the authors propose a vision for the future of cloud applications, and I feel that summarizing a vision often results in the misinterpretation of that vision. So I recommend reading the paper to draw your own unbiased conclusions. 

That being said, here is my extremely high-level take on the paper in a few points:

  • Application developers should focus on application logic, and not worry about implementation aspects of consistency, durability, availability, etc. 
  • It does not mean that developers do not care about consistency, availability, redundancy at all. Instead, they simply should know what they need, and let the cloud provide these. As such, developers should declare their consistency, availability, budgetary needs, etc., and have the cloud runtime enforce such declarations. This will free up the programmers and let them focus on the application logic instead and make this logic “unsoiled” by the other aspects of the distributed app.
  • To help developers focus on their applications/tasks, we need domain-specific languages (DSLs). DSLs can hide a lot of “mechanical” work from the programmers and delegate it to the cloud runtime. A good example of a popular DSL we have been using for a very long time is SQL. It is declarative — programmers retrieve and update the data without worrying about how it is done under the hood. 
  • Despite potentially having many DSLs, we still want one comprehensive framework to run it on, so the visionary system here can compile DSL to some common Intermediary Representation (IR). The authors want the IR to be human-readable and optimizable, although I feel like this requirement is part of the “evolutionary” theme in the paper, and eventually, the importance of human optimizations may diminish. 
  • Achieving this highly declarative vision is hard, but the paper lists several developing and emerging techniques and research directions that may help evolve the cloud. 

Discussion. 

1) DSL. We have spent quite some time discussing DSLs and what does it mean to have many of them. For one, we already have a few successful ones, like SQL. Arguably, ML/AI and data processing systems also often have some form of DSLs. TensorFlow was brought as an example. One minor concern that was expressed in the group is that having many of these DSLs requires someone to make and maintain them. A more interesting DSL question is how specialized they should become? To bring SQL example again, while it is great for what it does, it is rarely used all by itself. So there will be a clear need to allow to mix and match these highly specialized DSLs, potentially making the problem of translating them to IR more difficult. 

2) IR. A big part of the Hydro system vision is the IR language. The language itself may get rather complicated when it needs to support many IRs. A bigger challenge may be having to translate DSL logic to a human-readable IR code. The translations that are done must make sense to engineers, the logic should be clear and not obscure to allow people to make sense of it. This pursuit of human readability may result in less performance efficient IR. The readability may also depend on the individual DSLs. 

Another point we discussed is whether programmers will just start writing code directly in IR if it is a good, readable, feature-rich language. Maybe this is exactly what the programmers need after all? A language made specifically for distributed applications.

3) How much of this is already in the cloud? DSLs exist, the serverless cloud is developing too, providing more consistency and durability than before. For example, Azure Durable Functions save their intermediate state and can be resumed in the face of failures. And surprisingly, many of these cloud offerings, like serverless, durable functions, serverless storage are easy to use. Last semester I gave a project in my Cloud Computing Systems that used blob storage, serverless functions, and durable functions. To my surprise, the students loved the project and were able to figure out all of this tech (which they had to do on their own since the tech aspect was not really part of the problem they were solving) in just a few days. So as it stands right now, the cloud is evolving quickly, with serverless computing and storage becoming more ad more capable. It is not a coherent single cloud runtime just yet, and probably won’t be there any time soon, but some aspects of the vision are there. Users can scale serverless compute, not worry about its availability, may opt into more durable options when needed, may use cloud-native storage with configured/declared consistency, take advantage of DSLs for many tasks in the cloud, like data management, ML/AI systems, etc…

4) Drivers of innovation? An interesting discussion happened at the end of our meeting. Some expressed the opinion that cloud vendors should know better in what direction to develop the cloud since they are in constant interaction with the clients and must adjust to what clients are asking. I, personally, disagreed with this opinion — cloud clients are not thinking about the long-term visions like this paper describes. Instead, they may have more immediate concerns that must be dealt with given all the technology they already use in the cloud. An example I used is the true invention of GUI by Xerox PARC. The vision was out there, but nobody was really asking for this back then, even Xerox did not really know what to do with it, and willingly let others copy the ideas. Yet, this innovation made modern consumer electronics/computing what it is today. I suspect, that if Xerox were asking clients about what to improve, they may have worked on something as boring as developing a console with 120-character lines instead of an 80-characters one to make existing systems more “user friendly.”

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!