Tag Archives: efficiency

Reading Group. CompuCache: Remote Computable Caching using Spot VMs

In the 92nd reading group meeting, we have covered “CompuCache: Remote Computable Caching using Spot VMs” CIDR’22 paper by Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, Badrish Chandramouli, Vincent Liu, and Boon Thau Loo. 

Cloud efficiency seems to be a popular topic recently. A handful of solutions try to improve the efficiency of the cloud by ratcheting up the utilization of already provisioned hardware and infrastructure. These approaches tend to provide some service using otherwise wasted resources. For instance, we have covered a paper that suggests running serverless computing on harvested VMs that use resources normally wasted/idle in the data center. CompuCache provides a similar utility, but instead of running a pure serverless compute infrastructure, it suggests using Spot VMs to run a compute-oriented caching service. See, running a purely caching service is more memory intensive, leaving CPUs still relatively underutilized, defeating our goal of using otherwise wasted resources well. So CompuCache is a caching service that can execute stored procedures against the cached data. This is the “Compu” part, and I suppose it enables a better CPU utilization of the spot VMs the service is running. 

So, now let’s talk about using spot VMs for storage systems. Spot instances have unpredictable lifespans controlled by the cloud vendor and not the users/engineers. When the resources used by a spot instance are needed to support other services, the cloud vendor evicts the VM with little notice. This fickle nature of spot VMs precludes them from reliably supporting a storage architecture. However, caching may be a somewhat viable option since the master copy of the data is stored on some durable service. Of course, we still want to avoid cache misses as much as possible for performance reasons, so CompuCache needs to have mechanisms to deal with spot VM evictions. 

CompuCache architecture relies on the Cloud VM Allocator component to get the resources to run multiple caches for different applications. The allocator allows client applications to specify the size of cache they need and allocate the resources. The system assigns memory in fixed chunks (1 GB), and these chunks can go to different spot VMs that have the resources. Naturally, if no existing VMs have the resources, the allocator will try to get more VMs running. The clients “see” the allocated memory as an address space, and can write and/or read from memory knowing some address and length of data. 

The sharded nature of the cache means that the clients somehow need to know where their data may be in the system. CompuCache uses a scheme that maps virtual global cache address space to the specific nodes and their local address spaces. The client application constructs and maintains this map based on the results of cache allocation through the Cloud VM Allocator. Moreover, the client application is responsible for distributing this map to the CompuCache instances. When the node composition of the cache changes due to VM evictions or new VM additions, so is the mapping of virtual memory space to instances. As the system evolves, each client application must migrate parts of its cache between VMs. The paper has more details on the procedures involved in data migration due to spot VM deallocation/destruction.

An interesting part of CompuCache is its compute-oriented API. The clients can read and write data directly from the cache, but they can also run stored procedures (sprocs). I suspect the stored procedures allow CompuCache to better utilize unused cores that go into the spot instances. I was under the impression that stored procedures are not a very popular tool, but I may be mistaken here. Anyway, of course, the authors provide another explanation for using stored procedures to interact with ComputCache — latency. See, if we need to read something and then update something else based on the retrieved value using just puts and gets, we will pay the price of 2 RTTs (not to mention screwed-up consistency). Using stored procedures, we can achieve this read-then-update pattern in just a single sproc call and 1 RTT between the client and cache. This is not a new idea, and some systems have been suggesting sprocs for atomic access to multiple keys and latency benefits. But here is a caveat, CompuCache is sharded and shards are relatively fine-grained, so the data you access in stored procedures may be on multiple VMs. This not only negates the latency benefit but also complicates CompuCache. To deal with data on multiple VMs, the system can either retrieve the data to a machine running the sproc or move the computation itself to the next VM. 

For the evaluation, the authors compared CompuCache with Redis, and it seems to outperform Redis by a large margin. This is interesting and unexpected, as there is nothing about using spot instances that should be beneficial for performance. Quite the opposite, the VM churn and reclamation should make CompuCache more likely to experience cache misses and spend more time migrating data and rebuilding caches. One explanation here is the poor performance of Redis sprocs implementation used in the evaluation, which seems to be at least two orders of magnitude slower than CompuCache. There is no word on what CompuCache uses for its sprocs to make it so much faster.

Discussion

1) Application-driven design. CompuCache places a lot of burden on the application that needs the cache. It seems like the “memory management” done by the application can be rather heavy. Any data migration requires remapping of the virtual address space to nodes. Consequently, clients then need to redistribute this map so that nodes themselves can find the data they may need to run sprocs. This can make the system brittle due to the application’s mismanagement of the cache. 

2) API. An interesting thing to note is the API. ComputCache allocates memory to applications as some chunk of bytes. This chunk of bytes represents the virtual memory space that is mapped to multiple nodes/VMs. The API allows to read and write at some byte offset and length. So, there is no KV abstraction common in other caches, and in fact, the application needs to implement KV API if needed.

3) Sprocs vs. Serverless functions? One major point of discussion was comparison with prior work on serverless functions using harvested resources. The compute part of CompuCache looks a lot like a serverless function meant to operate on state directly from the cache. I’d suspect the sproc should also have the ability to find data in storage in case of a cache miss, but paper leaves any details on cache misses. Anyway, it seems that a sproc is a bit more restrictive than a general serverless function. So the question is whether you can run harvest VM serverless infrastructure capable of full-fledged function runtime and use it instead of CompuCache sprocs? Of course, you lose that nice benefit of collocating data and compute, but this benefit is elusive at best when a cache is sharded into many VMs anyway. The problem here is the map of virtual space to nodes — application knows it and maintains it, but a separate serverless compute does not know it, so it won’t have a clue what nodes to contact for data. So we are back to the point above — relying on the client application to manage cache can be brittle and restrictive. 

4) Sproc performance. The performance evaluation is on the weaker side of the paper, mainly because it leaves many things underspecified. Why does Redis sproc is so much slower? How was the comparison done given the difference in cache APIs? What was the implementation of CompuCache? 

5) Failures. Finally, it is hard to discuss any distributed systems without talking about failures. CompuCache leaves a lot of details on handling failures out of the paper. It is not clear what sprocs will do in case of a cache miss. In fact, it is even unclear what a cache miss is in CompuCache, as it does not store objects and provides some memory space for the application instead. It seems like this is all up to the application — if some memory is unreachable, the application should know how to find that data in a reliable store. This, of course, leaves sprocs vulnerable to failures. A big part of the paper’s fault tolerance discussion is data migration due to Spot VM eviction. In the real world, however, servers may fail without notice. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Faster and Cheaper Serverless Computing on Harvested Resources

The 83rd paper in the reading group continues with another SOSP’21 paper: “Faster and Cheaper Serverless Computing on Harvested Resources” by Yanqi Zhang, Íñigo Goiri, Gohar Irfan Chaudhry, Rodrigo Fonseca, Sameh Elnikety, Christina Delimitrou, Ricardo Bianchini. This paper is the second one in a series of harvested resources papers, with the first one appearing in OSDI’20.

As a brief background, harvested resources in the context of the paper are temporary free resources that are otherwise left unused on some machine. The first paper described a new type of VM and that can grow and contract to absorb all unused resources of a physical machine. An alternative mechanism to using underutilized machines is creating spot instances on these machines. The spot instance approach, however, comes with start-up and shut-down overheads. As a result, having one VM that runs all the time but can change its size depending on what is available can reduce these overheads. Of course, using such unpredictable VMs creates several challenges. What type of uses cases can tolerate such dynamic resource availability? The harvested VMs paper tested the approach on in-house data-processing tasks with modified Hadoop. 

The SOSP’21 serverless computing paper appears to present a commercial use case for harvested VMs. It makes sense to schedule serverless functions that have a rather limited runtime onto one of these dynamic harvested VMs. See, if a function runs for 30 seconds, then we only need the harvested VM to not shrink for these 30 seconds to support the function’s runtime. Of course, the reality is a bit more nuanced than this — serverless functions suffer from cold starts when the function is placed on a new machine, so, ideally, we want to have enough resources on the VM to last us through many invocations. The paper spends significant time studying various aspects of harvest VM performance and resource allocation. Luckily, around 70% of harvest VMs do not change their CPU allocations for at least 10 minutes, allowing plenty of time for a typical function to be invoked multiple times. Moreover, not all of these CPU allocation changes shrink the harvest VM, and adding more resources to the VM will not negatively impact functions it already has. 

There are two major problems with running serverless workloads in the harvest VM environment: VM eviction, and resource shrinkage. Both of these problems impact running functions and create additional scheduling issues. 

The VM eviction is not unique to harvest VMs and can also occur in spot VMs. According to the original harvest VM paper, harvest VMs should get evicted far less frequently — only when the full capacity of the physical server is needed. Moreover, the VM eviction has a negative impact only when it runs a function, and since VMs get an eviction warning, most often they have enough time to finish executions that have already started. As a result, a serverless workload running in harvest VMs still has a 99.9985\% success rate in the worst-case situation when a data center has limited harvested resources and undergoes many changes. Nevertheless, the paper considers a few other strategies to minimize evictions and their impact. For instance, one strategy is to use regular VMs for longer running functions to prevent them from being evicted “mid-flight,” while using harvest VMs for shorter jobs. 

The resource shrinkage problem is a bit more unique to harvest VMs. Despite most harvest VMs undergoing resource reallocation relatively infrequently, a VM shrinkage can have severe implications. One or more running functions may need to stop due to resource shrinkage, and rescheduling these jobs may also be impacted by the cold start. As a result, the paper presents a scheduling policy, called Min-Worker-Set (MWS), that minimizes the cold starts for a function. The idea behind MWS is to place each function onto as few servers as possible while ensuring adequate compute capacity to serve the workload.

The authors have implemented the prototype with the OpenWhisk platform. The paper provides extensive testing and evaluations both for performance and cost. Each figure in the paper has tons of information! That being said, I am including the performance on a fixed budget figure below to show how much cheaper running serverless on harvest VMs can be. The blue line is running some workload with rather heavy and longer-running functions on dedicated VMs under some fixed budget. Other lines show the latency vs throughput of harvest VM solution under different levels of harvest VM availability. The “lowest” (red triangle) line is when few harvest resources are available, making harvest VMs most expensive (who and how decides the price of harvest VM?). 

As usual, we had the presentation in the reading group. Bocheng Cui presented this paper, and we have the video on YouTube:

Discussion.

1) Commercializing Harvest VMs. This paper sounds like an attempt to sell otherwise unused resources in the data center. The previous harvest VM paper aimed at internal use cases (and there are lots of them, ranging from data processing to building and test systems). I think this is a great idea, and hopefully, it can make the cloud both cheaper and greener to operate with less resource waste. At the same time, it seems like the current prototype is just that — a prototype based on an open-source platform, and I wonder if this is feasible (or otherwise can be done even more efficiently) at the scale of an entire Azure cloud.

2) Evaluation. Most of the eval is done in an environment that simulates harvest VMs based on the real Azure traces. I suppose this is good enough, and the evaluation is rather extensive. It also includes a small section of the “real thing” running in real harvest VMs. But again, I wonder about the scale.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!