Murat and Aleksey Read Papers: “Rethinking the Cost of Distributed Caches for Datacenter Services”

Murat and Aleksey read “Rethinking the Cost of Distributed Caches for Datacenter Services.“

This paper argues that distributed caches can save money by reducing CPU costs despite using substantially more of the costly DRAM. The authors claim up to 4X savings in synthetic and open-source workloads. The paper also calls for richer caching semantics to allow caching of rich application objects rather than a simple key-value abstraction. A richer caching abstraction, coupled with collocating caches on application nodes/workers, can further improve efficiency by reducing serialization/deserialization and data manipulation costs. Finally, the paper makes the case for better cache coherency and the need for cheaper, strongly consistent caches.

To be honest, it was not surprising to me that caches save money and not only give you better latency. The paper made it sound like a big surprise. That said, all these savings come at a cost that paper does not mention. By saving CPU on more expensive services, such as storage, we make them less able to handle traffic bursts and surges. And traffic surges to systems behind the cache can be substantial even with relatively small cache failures. Imagine a system with three cache servers and ideal partitioning, such that all servers do the same work. If such a system, under a given workload, can achieve a 90% cache hit rate, then the database sitting behind the cache handles only 10% of traffic. Now, with one cache server failing, we lose 33% of the cached work (which, in our idealized case, was uniformly distributed), so instead of a 90% cache hit rate, we get 60%. And this means the database workload increased 4-fold.

This dramatic load spike on the database can lead to metastable failures when caches fail — the database struggles, requests timeout, and no work is done in time, so there is no possibility to refill the cache, and the whole system remains perpetually overloaded. As such, I cannot think about whether these operational savings will make the system more fragile and more expensive to maintain and recover from failures. In other words, savings become deferred expenses. The big question, of course, is what we can do to make the caching system more robust to failures and to avoid the “water hammer” effect downstream when caches fail.