Tag Archives: avalability

Reading Group. New Directions in Cloud Programming

Recently we have discussed a CIDR’21 paper: “New Directions in Cloud Programming.Murat Demirbas did the presentation:

Quite honestly, I don’t like to write summaries for this kind of paper. Here, the authors propose a vision for the future of cloud applications, and I feel that summarizing a vision often results in the misinterpretation of that vision. So I recommend reading the paper to draw your own unbiased conclusions. 

That being said, here is my extremely high-level take on the paper in a few points:

  • Application developers should focus on application logic, and not worry about implementation aspects of consistency, durability, availability, etc. 
  • It does not mean that developers do not care about consistency, availability, redundancy at all. Instead, they simply should know what they need, and let the cloud provide these. As such, developers should declare their consistency, availability, budgetary needs, etc., and have the cloud runtime enforce such declarations. This will free up the programmers and let them focus on the application logic instead and make this logic “unsoiled” by the other aspects of the distributed app.
  • To help developers focus on their applications/tasks, we need domain-specific languages (DSLs). DSLs can hide a lot of “mechanical” work from the programmers and delegate it to the cloud runtime. A good example of a popular DSL we have been using for a very long time is SQL. It is declarative — programmers retrieve and update the data without worrying about how it is done under the hood. 
  • Despite potentially having many DSLs, we still want one comprehensive framework to run it on, so the visionary system here can compile DSL to some common Intermediary Representation (IR). The authors want the IR to be human-readable and optimizable, although I feel like this requirement is part of the “evolutionary” theme in the paper, and eventually, the importance of human optimizations may diminish. 
  • Achieving this highly declarative vision is hard, but the paper lists several developing and emerging techniques and research directions that may help evolve the cloud. 


1) DSL. We have spent quite some time discussing DSLs and what does it mean to have many of them. For one, we already have a few successful ones, like SQL. Arguably, ML/AI and data processing systems also often have some form of DSLs. TensorFlow was brought as an example. One minor concern that was expressed in the group is that having many of these DSLs requires someone to make and maintain them. A more interesting DSL question is how specialized they should become? To bring SQL example again, while it is great for what it does, it is rarely used all by itself. So there will be a clear need to allow to mix and match these highly specialized DSLs, potentially making the problem of translating them to IR more difficult. 

2) IR. A big part of the Hydro system vision is the IR language. The language itself may get rather complicated when it needs to support many IRs. A bigger challenge may be having to translate DSL logic to a human-readable IR code. The translations that are done must make sense to engineers, the logic should be clear and not obscure to allow people to make sense of it. This pursuit of human readability may result in less performance efficient IR. The readability may also depend on the individual DSLs. 

Another point we discussed is whether programmers will just start writing code directly in IR if it is a good, readable, feature-rich language. Maybe this is exactly what the programmers need after all? A language made specifically for distributed applications.

3) How much of this is already in the cloud? DSLs exist, the serverless cloud is developing too, providing more consistency and durability than before. For example, Azure Durable Functions save their intermediate state and can be resumed in the face of failures. And surprisingly, many of these cloud offerings, like serverless, durable functions, serverless storage are easy to use. Last semester I gave a project in my Cloud Computing Systems that used blob storage, serverless functions, and durable functions. To my surprise, the students loved the project and were able to figure out all of this tech (which they had to do on their own since the tech aspect was not really part of the problem they were solving) in just a few days. So as it stands right now, the cloud is evolving quickly, with serverless computing and storage becoming more ad more capable. It is not a coherent single cloud runtime just yet, and probably won’t be there any time soon, but some aspects of the vision are there. Users can scale serverless compute, not worry about its availability, may opt into more durable options when needed, may use cloud-native storage with configured/declared consistency, take advantage of DSLs for many tasks in the cloud, like data management, ML/AI systems, etc…

4) Drivers of innovation? An interesting discussion happened at the end of our meeting. Some expressed the opinion that cloud vendors should know better in what direction to develop the cloud since they are in constant interaction with the clients and must adjust to what clients are asking. I, personally, disagreed with this opinion — cloud clients are not thinking about the long-term visions like this paper describes. Instead, they may have more immediate concerns that must be dealt with given all the technology they already use in the cloud. An example I used is the true invention of GUI by Xerox PARC. The vision was out there, but nobody was really asking for this back then, even Xerox did not really know what to do with it, and willingly let others copy the ideas. Yet, this innovation made modern consumer electronics/computing what it is today. I suspect, that if Xerox were asking clients about what to improve, they may have worked on something as boring as developing a console with 120-character lines instead of an 80-characters one to make existing systems more “user friendly.”

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. High availability in cheap distributed key value storage

Our recent paper was “High availability in cheap distributed key value storage”. And what a paper that was! It was definitely a mind-tingling read the lead to a very interesting and long discussion session with the group.

Short Summary

The paper addresses the problem of fast recovery from the leader (primary) crashes in key-value stores backed by the non-volatile main memory (NVMM). NVMMs provide good latency but have much lower throughput than the DRAM. They are also a lot cheaper than DRAM for the same capacity, making NVMMs an interesting low-latency compromise between in-memory datastores and SSD-backed stores. NVMMs are, however, much more expensive than SSDs. In CANDStore, a primary replica is supported by the NVMM for fast latency, but the follower, or back-up, is backed by an SSD to save on the hardware. An additional “witness” replica exists. Witness, like the primary, is supported by NVMM, but it does not store actual data, and only has the operation log with placeholders for data. This is done, again, for cost reasons. The replication between a primary and the backup is done with a modified Raft algorithm. In a happy case, witness replica is a non-voting member of the cluster, much like in Cheap Paxos. As such, CANDStore is supported by heterogeneous hardware, with specific roles attached to specific hardware.

Upon the failure of a primary node, the witness is promoted to the primary role. Since the witness has no data, it has to learn it from the backup, but it does so smartly: first, learn the log with placeholders, once it knows all the placeholder, it can start serving new write requests that overwrite the key’s state. Knowing the position of the operation in the log is not enough to serve reads and a new primary needs to copy the data from the backup. It does so smartly and copies the “hot” keys first, followed by less frequently used keys. This allows primary to start serving hot reads very soon after gaining the ability to serve writes, and before the full copy of the data has finished. The paper claims this approach yields 4.5-10.5 times recovery speedup compared to the offline rebuild of a replica from an SSD. Backup failure is a less-discussed issue in the paper, but here the authors take the Cheap Paxos approach and make the witness vote on the quorum to select a new configuration.

Video Presentation


The discussion was rather heated, as the paper has many points to digest.
1) Homogeneous vs. heterogeneous deployment. The biggest question is why do we need to restore a primary from a backup by doing a full copy? The backup node in a 3-replica configuration studied in the paper(1 primary, 1 backup, 1 witness) already has the latest state, so it makes sense to promote it to the leader/primary node instead of doing a full replica “build” from the witness state. This would undoubtedly result in a much faster primary recovery (while “build-out” of a new backup can happen in the background). Of course, this works only if the backup and primary are homogeneous, which is not the case here. So, this raises the question of why having such a heterogeneous system in the first place? Our best guess was the cost. Having an SSD-backed replica is cheaper than NVMM one.

2) Witness log. In the paper, a witness maintains a log, keeping track of all the keys and essentially having an order of updates. This is a placeholder log that has no values. In the process of recovery, the witness learns of the missing log entries, i.e. tail of the log that a primary and backup may have, but has not reached the witness. So if this learning phase exists, why having a log at all? just make a witness learn the log (or relevant suffix of the log) when recovery is needed. In the discussion, we think the witness has a log to speed-up recovery when it is needed.

3) Performance comparison of DRAM/NVMM/SSD. The paper’s heterogeneous setup leverage the fact that these types of storage have different performance and price. DRAM is the fastest but also the most expensive, NVMM presents a latency compromise, while SSD is the price compromise. However, if the performance of NVMM is close to that of SSDs, there may be fewer incentives for such design. We looked at a few sources. And it appears that NVMM has good latency, but its throughput is somewhat on par with SSDs

4) The performance of the primary. Another question we had is why NVMM is fast enough for primary, but not for backup? This loops back to (3), and it seems like NVMM would do ok for a backup, given its similar throughput to an SSD. However, we speculate that (a) the NVMM performance may change less favorably depending on the write block size, and (b) again the cost and not the performance is the reason for having an SSD backup.

5) One witness for multiple partitions. The paper evaluates a single-partition 3 node system. However, in such a setup, we have a witness that must run on hardware capable of supporting a primary. This does not lead to cost-savings, as primary hardware is more expensive. We speculate that in a sharded system the cost of running a witness can be amortized by having single witness hardware shared across many partitions. When one partition needs a witness promoted, it takes over the resources, and forces other partition to get new witness nodes. This of course has several complications. Firstly, a correlated failure between partitions may not be tolerated well, when two partitions need a new primary, but they share a witness. Secondly, managing such a system become more challenging, since a witness promotion leads to the cleanup of other partition-witness data, finding new witness nodes, etc

6) Hot/cold benefits. Does hot/cold key separation benefit anything except recovery? The backup cannot benefit from this, it (normally?) serves no user work, so there is not much difference. It maintains hot/cold keys to make a copy upon the recovery prioritized to reduce performance degradation. Primary nodes, on the other hand, may benefit. Keep the “hot” keys in DRAM, and “cold” ones in the NVMM storage.

7) Backup failures. This aspect is not discussed in much details in the paper, except mentioning that the Cheap Paxos approach is used to reconfigure and bring up a new backup. One thing that caught the group’s attention is that a backup recovery puts a lot more work on the primary – it has to copy data to the backup, while still serving the operations (given that the backup recovery is done on-line).

8) More backup nodes. So, in relation to (7), what if we have more backup nodes, lets say 2 nodes instead of 1? This means that one backup should be available to restore the failed one. But what does this mean for write quorums? Do we still need all backup in the quorum? or just one out of 2 backups is enough? If we require all backups for the quorum, we also need two witnesses to tolerate two failures and be able to reconfigure with Cheap Paxos. If we require just one backup out of two, do we need a witness (aside from promoting it to the primary in case of a primary crash). Also, how does such two backups setup compare in terms of cost saving?

9 and beyond) We had way more discussion points as well. Can this heterogeneous setup benefit other copy-have workloads for databases, like elasticity tasks? Can heterogeneous deployments be used with regular Multi-Paxos/Raft? Why using modified Raft if most of the things that differ Raft from Paxos were undone? You can read more about these in our slack discussion group.