Tag Archives: cloud-native

Reading Group. New Directions in Cloud Programming

Recently we have discussed a CIDR’21 paper: “New Directions in Cloud Programming.Murat Demirbas did the presentation:

Quite honestly, I don’t like to write summaries for this kind of paper. Here, the authors propose a vision for the future of cloud applications, and I feel that summarizing a vision often results in the misinterpretation of that vision. So I recommend reading the paper to draw your own unbiased conclusions. 

That being said, here is my extremely high-level take on the paper in a few points:

  • Application developers should focus on application logic, and not worry about implementation aspects of consistency, durability, availability, etc. 
  • It does not mean that developers do not care about consistency, availability, redundancy at all. Instead, they simply should know what they need, and let the cloud provide these. As such, developers should declare their consistency, availability, budgetary needs, etc., and have the cloud runtime enforce such declarations. This will free up the programmers and let them focus on the application logic instead and make this logic “unsoiled” by the other aspects of the distributed app.
  • To help developers focus on their applications/tasks, we need domain-specific languages (DSLs). DSLs can hide a lot of “mechanical” work from the programmers and delegate it to the cloud runtime. A good example of a popular DSL we have been using for a very long time is SQL. It is declarative — programmers retrieve and update the data without worrying about how it is done under the hood. 
  • Despite potentially having many DSLs, we still want one comprehensive framework to run it on, so the visionary system here can compile DSL to some common Intermediary Representation (IR). The authors want the IR to be human-readable and optimizable, although I feel like this requirement is part of the “evolutionary” theme in the paper, and eventually, the importance of human optimizations may diminish. 
  • Achieving this highly declarative vision is hard, but the paper lists several developing and emerging techniques and research directions that may help evolve the cloud. 

Discussion. 

1) DSL. We have spent quite some time discussing DSLs and what does it mean to have many of them. For one, we already have a few successful ones, like SQL. Arguably, ML/AI and data processing systems also often have some form of DSLs. TensorFlow was brought as an example. One minor concern that was expressed in the group is that having many of these DSLs requires someone to make and maintain them. A more interesting DSL question is how specialized they should become? To bring SQL example again, while it is great for what it does, it is rarely used all by itself. So there will be a clear need to allow to mix and match these highly specialized DSLs, potentially making the problem of translating them to IR more difficult. 

2) IR. A big part of the Hydro system vision is the IR language. The language itself may get rather complicated when it needs to support many IRs. A bigger challenge may be having to translate DSL logic to a human-readable IR code. The translations that are done must make sense to engineers, the logic should be clear and not obscure to allow people to make sense of it. This pursuit of human readability may result in less performance efficient IR. The readability may also depend on the individual DSLs. 

Another point we discussed is whether programmers will just start writing code directly in IR if it is a good, readable, feature-rich language. Maybe this is exactly what the programmers need after all? A language made specifically for distributed applications.

3) How much of this is already in the cloud? DSLs exist, the serverless cloud is developing too, providing more consistency and durability than before. For example, Azure Durable Functions save their intermediate state and can be resumed in the face of failures. And surprisingly, many of these cloud offerings, like serverless, durable functions, serverless storage are easy to use. Last semester I gave a project in my Cloud Computing Systems that used blob storage, serverless functions, and durable functions. To my surprise, the students loved the project and were able to figure out all of this tech (which they had to do on their own since the tech aspect was not really part of the problem they were solving) in just a few days. So as it stands right now, the cloud is evolving quickly, with serverless computing and storage becoming more ad more capable. It is not a coherent single cloud runtime just yet, and probably won’t be there any time soon, but some aspects of the vision are there. Users can scale serverless compute, not worry about its availability, may opt into more durable options when needed, may use cloud-native storage with configured/declared consistency, take advantage of DSLs for many tasks in the cloud, like data management, ML/AI systems, etc…

4) Drivers of innovation? An interesting discussion happened at the end of our meeting. Some expressed the opinion that cloud vendors should know better in what direction to develop the cloud since they are in constant interaction with the clients and must adjust to what clients are asking. I, personally, disagreed with this opinion — cloud clients are not thinking about the long-term visions like this paper describes. Instead, they may have more immediate concerns that must be dealt with given all the technology they already use in the cloud. An example I used is the true invention of GUI by Xerox PARC. The vision was out there, but nobody was really asking for this back then, even Xerox did not really know what to do with it, and willingly let others copy the ideas. Yet, this innovation made modern consumer electronics/computing what it is today. I suspect, that if Xerox were asking clients about what to improve, they may have worked on something as boring as developing a console with 120-character lines instead of an 80-characters one to make existing systems more “user friendly.”

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Autoscaling Tiered Cloud Storage in Anna.

This week we looked at “Autoscaling Tiered Cloud Storage in Anna.” This is the second Anna paper. The first one introduces Anna Key-Value store, and the second paper talks about various “cloud-native” improvements. The presentation by Michael Whittaker is available here:

Short Summary

The Anna Architecture

Anna is an eventual-consistent key-value data store, where each value is a lattice. Anna can tweak some consistency guarantees by changing the type of lattice stored in the value. For example, it can be a Last Write Wins (LWW) system that merges operations based on the definition of the last writer (i.e. a timestamp), or it can use some CRDT lattice to preserve all updates/operations. The weak consistency of Anna means that it is relatively easy to add and remove nodes to scale the systems. The Autoscaling part takes advantage of that and dynamically adjusts the number of replicas responsible for each key to match the workload. Moreover, Autoscaled Anna has multiple usage tiers for keys based on the usage frequency. The paper discusses “hot” and “cold” keys, but more tiers are possible. Such tiering allows using heterogeneous hardware to better balance the cost and performance of Anna, as “cold” keys may run on cheaper hardware backed by SSDs, while “hot” keys may be store in RAM on more expensive VMs. Anna has rules for data transitioning between tiers and for controlling the degree of replication. For example, there is a minimum replication factor for durability, and there are usage thresholds for promoting and demoting keys to/from hot/cold tiers.

Discussion

We have spent quite some time talking about this paper, as there is a lot of information to digest. Below are a few of the bigger discussion points

1) Apache Cassandra comparison. This came up a few times as lacking, as the system is somewhat similar to Cassandra, at least when you consider the LWW lattice. There are many differences too, especially in the context of autoscaling. Cassandra uses a statically configured replication factor and lacks any autoscaling. Cassandra’s DHT model is pretty rigid and resists scaling – adding nodes require considerable ring rebalancing. Also, only one node can be added/removed from the ring at once to avoid membership and ring partitioning issues. Anna has no such problem and can adjust quicker and more dynamic. Additionally, for all the fairness, the original Anna paper compares against Cassandra.

2) Breaking Anna. Part of the discussion was around breaking the autoscaling and putting Anna in a bad/less-performant state. One example would be using a workload that changes at about the same rate as Anna can adjust. For example, if we hit some keys above the threshold to promote from the “cold” tier to “hot”, we can cause a lot of data movement as these keys potentially migrate or added to the memory-tier hardware. At the same time, we can reduce the usage, causing the systems to demote these keys, and experience associated overheads. This is like putting a workload in resonance with Anna’s elastic abilities, and it reminds me of a famous Tacoma Narrows bridge collapse

3) Possible consistency guarantees. One of the benefits of Anna is that some guarantees can be tweaked based on the lattice used for the value. Part of the discussion focused on how much consistency we can get purely out of these lattices, and how much may necessitate more drastic architectural changes. It is pretty straightforward to understand that Anna won’t ever provide strong consistency, but what about some other guarantees? For example, the original Anna paper claims the ability to have causal consistency, although it uses more complicated lattices, and keeps vector clocks for maintaining causality, not only making the system more complicated but also potentially harming the performance (something noted by the authors themselves).

4) Comparison fairness. Some points were mentioned about the evaluation fairness. It appears that some systems compared against are significantly more capable and/or may provide better guarantees(DynamoDB, Redis). It is worth noting, however, that there are not that many, if any, direct competitors, so these comparisons are still very valuable.

5) Cloud-Native. This appears as one the “most” cloud-native database papers that talks about some practical aspects of designing and running a  cloud-native database. One aspect of this “cloud-nativeness” is the focus on elasticity and scalability happening automatically. Another one is the cost-performance consideration in how the system scales with regard to workload changes. We noted some other much older papers that fit our definition of “cloud-native” papers. One example was Yahoo’s PNUTS paper that describes a global system with some capability to adjust to workloads and selective replication.

6) Scheduling/resource allocation. One of the challenging aspects of  Anna with autoscaling is figuring out the best key “packing” or binning” given multiple tiers and pricing constraints. This is similar to a Knapsack problem and bin packing problem, and both are NP-complete. So this resource management/scheduling will become a big problem at the production scale when we may have thousands of machines. At the same time, task packing problems at the datacenter level are well studied, for example in Google’s Borg cluster management system or its derivative Kubernetes

 

Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!