All posts by alekseyc

Reading Group Paper. Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate

In this week’s reading group, we discussed the “Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate” OSDI’23 paper by Audrey Cheng, David Chu, Terrance Li, Jason Chan, Natacha Crooks, Joseph M. Hellerstein, Ion Stoica, Xiangyao Yu. This paper argues against optimizing for object hit rate in caches for transactional databases. The main logic behind this is that missing even a single object needed for a transaction will require a trip to the database and incur the associated costs. Instead, the paper suggests designing caches to leverage the structure/composition of transactions. 

Let me use an example from the paper to explain this better. Let’s say we have a transaction, shown in the image here, that queries 4 different tables: A, B, C, and D, such that querying tables B and C depends on the results of reading table A, and reading D requires to first get the data from C. This read-only transaction has data dependencies, and the authors exploit such dependencies to improve transaction latency. See, this example has two dependency chains (A -> B and A -> C -> D). The A -> C -> D is the longest chain with three queries in a sequence. A transaction-optimized cache can then focus on reducing the number of such chained operations that need to hit the database. For instance, caching table A reduces the number of queries answered by the database from 3 down to 2 (using cache for A, then reading B and C concurrently, and then reading table D). Similarly, caching just table C also reduces the number of database steps in the longest chain (reading A, using cached C, and reading B and D concurrently). Every time we reduce the number of database steps in the transaction chain with the most such steps, we improve transaction latency further by cutting out more database interactions and associated networking costs. As such, caching both A and C, for example, can be even better than just A or C, as now there is only one database step left in both transaction chains. A further improvement now needs to consider caching tables that can help cut database access from both chains. 

DeToX, the system proposed in the paper, tries to achieve the above strategy. It works by knowing what types of transactions an application may run and using it to enumerate all possible groups of tables to cache for each transaction type. These groups are then scored during the runtime to decide which groups of tables are more valuable to the cache. In short, the idea is to give higher scores to groups with smaller yet frequently used tables. Finally, caching entire tables may be infeasible, so DeToX also scores individual keys/objects to decide which ones to cache. Similarly, the idea here is to keep as many high-impact keys/objects in the cache as possible, where impact is measured by the object’s “hotness” and whether the object benefits from important/frequent transactions or lots of types of transactions. The paper has a more precise description and formulas used for scoring than my super high-level summary. 

DeToX runs as a shim layer between the PostgreSQL and the clients. This shim layer sits on a separate (and equally large!) machine as the database itself. In the eval, both ran on AWS c5a.4xlarge (16vCPU, 32GB RAM) VMs. Clients do not interact with the database directly and use the shim layer instead. The shim keeps the cache coherence with the underlying PostgreSQL with two-phase locking. The actual cache is backed by Redis, running on an even larger VM (in the eval, it was c5a.16xlarge with 64 vCPUs and 128 GB RAM).

Anyway, this approach and setup seem to provide a decent improvement in the transaction hit rate over other state-of-the-art caching strategies. The paper defines a transaction hit rate as using a cache to successfully reduce the number of databases accessed in the transaction’s longest dependency chain. The object hit rate, however, is reduced since this is not a priority for the scoring system.

Discussion

We had a long discussion of the paper, and for the sake of space and my time, I will summarize only a handful of points. 

1) Object hit rate vs. Transaction hit rate. The objective of the paper is to minimize the transaction hit rate (i.e., caching at least some of the transaction’s sequential steps in their entirety to remove these steps from ever touching the database). This seems to help improve the latency. However, a lower object hit rate may result in databases having to do more work, as now the database needs to serve more objects. It may be the case that for use cases that require higher throughput, object hit rate may still be more important. For what is worth, the paper reports throughput improvements despite the lower object hit rate. 

2) Use cases for DeTox. Stemming from the point above, the use case we see for DeToX is latency-driven. Some caches only exist for reducing latency, and regardless of cache hit or miss, they exercise the underlying storage (see DynamoDB ATC’22 paper) for reliability reasons. It seems like a DeToX may be a viable solution in this type of cache usage for transactional workloads. 

3) Latency Improvements. The DeToX caching approach is supposed to improve latency by cutting out entire transactional steps from reaching the database. The system prefers to cache objects from smaller yet frequently utilized tables. These smaller tables, due to their size, may also be the most efficient to answer using the databases and not the cache. As such, the latency improvements may not be proportional to the number of transactional steps cached and “cut out” if the remaining steps require more complex queries over larger tables that just take longer to be served by the database. 

4) Cost of scoring. For large enough applications with many transaction types, there can be a substantial number of groups that need to be scored during the runtime as system access patterns start to emerge. This process can be quite costly (and the paper admits that), so we were wondering whether there are ways to mitigate some of that cost. The paper already proposes a few improvements. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group Paper. Aggregate VM: Why Reduce or Evict VM’s Resources When You Can Borrow Them From Other Nodes?

In our recent reading group meeting, we discussed “Aggregate VM: Why Reduce or Evict VM’s Resources When You Can Borrow Them From Other Nodes?” by Ho-Ren Chuang, Karim Manaouil, Tong Xing, Antonio Barbalace, Pierre Olivier, Balvansh Heerekar, Binoy Ravindran. This EuroSys’23 paper introduces the concept of Aggregate VM to allow the pooling of small unused chunks of resources from multiple physical servers to create one VM. 

We had no volunteer for this paper, so I put a quick presentation myself:

The problem Aggregate VMs solve is resource fragmentation. Let’s say a cluster has some unused resources, such as CPU and memory, lying around on multiple servers. These may be too small for a bigger VM, yet abundant across many physical servers. The authors propose pooling these smaller chunks of resources together into one bigger VM. For instance, getting 1-core and 2 GB of RAM from 3 physical servers to create a 3-core VM with 6 GB of memory.

From the get-go, this may not sound like a good idea for performance, given the disaggregation of memory and compute resources over the network. The paper is very defensive about this topic and identifies several workloads where this may not be a problem — the key is to have workloads that require little memory sharing between threads, as sharing memory between cores on different physical servers over the network would be a disaster for performance. 

The paper introduces a new FragVisor, a new hypervisor built on top of KVM and Distributed Shared Memory (DSM). FragVisor runs on that DSM layer; however, FragVisor nodes on different physical servers can also communicate with messages passing over the network. Without going into too many details, the DSM and the behind-the-scenes FragVisor communication enables the Aggregate VMs. The FragVisor communication glues the individual FragVisor nodes into one hypervisor. The DSM acts as a single address space for the guest OS, divided into multiple NUMA domains to let the guest OS know of memory speed limitations and boundaries. The paper also talks about several optimizations to DSM to get better performance for the guest OS.

One important aspect I want to discuss is sharing IO devices, such as network cards, between physical fragments. See, if the aggregate VM has access to a network interface on one physical node, the other node needs to leverage DSM to interact with that network card — the card may have send/receive ring buffers in memory. These buffers, however, become the points of memory sharing, slowing down the Aggregate VMs. The paper proposes using a multi-queue interface, so each core (or at least a physical server?) has its own send and receive buffers. This way, while buffers still rely on DSM, cross-core (or rather cross-server) sharing of buffers is avoided. The paper also explores the idea of DSM-bypass for these tasks, avoiding relying on DSM for managing such buffers and using FragVisor message exchange capability instead.

I will not go into the details of the performance evaluation too much and just show a few use cases. In both use cases, the paper compares the Aggregate VM against the overprovisioned VM running on one physical core (so, a 4 vCPU overprovisioned VM still has only one physical core). There is also a comparison against GiantVM, another disaggregated VM solution meant for gluing bigger chunks of resources, as opposed to FragVisor’s smaller fragments. The LEMP workload simulates the php-serving application; it is a workload that avoids memory sharing, as different concurrent requests to serve a page are independent. In this workload, when page render time is larger (i.e., 250+ms), AggregateVMs show nearly linear improvement from having more cores. However, the improvement in OpenLambda varies depending on the task, and in some instances, an overprovisioned VM still runs faster than an aggregate one. 

Discussion

1) Motivation. The primary motivation for the paper is resource fragmentation. The idea is that fragments cannot be used for a primary VM unless we aggregate them together. But this runs into a problem evident right from the title of the paper. The authors claim that current solutions to fragmentation are spot VMs and Harvest VMs. According to the paper, the problem with these VMs is that they can get evicted or destroyed, unlike aggregate VMs that can morph and use the resources of multiple servers during their lifetime. However, if a server can support a large enough spot VM, then it can support the same size primary VM, eliminating the need for the Aggregate VMs. In my opinion, spot VMs solve a different problem: getting some revenue or work done from spare capacity kept in the cluster to accommodate load and demand fluctuations. These fragments may exist there on purpose! The Aggregate VM motivation is based on the premise that these fragments are accidental and otherwise unusable.

I am not saying excessive fragmentation is not a real problem; the paper has impressive references claiming up to 17% fragmentation in some clusters and millions in lost revenue due to fragmentation. However, while unused resources are lost income, I think this is done at least partly on purpose. Furthermore, having evictable resources may be the only way to both use the reserves for something and have the spare capacity to accommodate demand changes.

2) Why not use smaller VMs? Since aggregate VMs favor embarrassingly parallel tasks with little-to-no memory sharing between cores, then why do we need bigger VMs? Why not use many smaller ones such that each smaller VM sits in one physical server? Of course, there are some overheads, like having many copies of OS and managing a larger fleet of VMs, but somehow it still feels like a plausible solution to workloads that run okay on aggregate VMs. 

3) IO sharing. The multi-queue stuff is an interesting way to reduce memory sharing from having a single IO device/interface. However, it may be hard to make it in a dynamic environment as the authors propose, where VMs can fluctuate between physical servers and their numbers. This can require quite a bit of readjustment in the buffers and can be complicated. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group Paper: Hyrax: Fail-in-Place Server Operation in Cloud Platforms

In the 142nd reading group meeting, we discussed “Hyrax: Fail-in-Place Server Operation in Cloud Platforms” OSDI’23 paper. Hyrax allows servers with certain types of hardware failures to return to service after some software-only automated mitigation steps.

Traditionally, when a server malfunctions, the VMs are migrated off of it, then the server gets shut down and scheduled for repair. The paper calls this an “all-or-nothing” model for repairs that leaves only perfectly healthy servers in operation. This behavior aligns well with “distributed” people like me — we treat nodes as disposable and replaceable. However, the physical hardware is not disposable and is sometimes hard to replace. The paper suggests that repair time can take up to nearly half a year in some cases, likely due to the availability of parts for older machines. The authors also state that 22% of servers will experience at least one failure in the 6-year lifespan.

Anyway, Hyrax suggests that this shutdown and repair process may be overkill, especially in large data centers and clouds. The repairs take a lot of technician time, consume the resources, and decommission capacity for days at a time. Surprisingly (at least to me), many component failures within a server do not necessarily lead to a server’s complete loss of function, creating a possibility for the “Fail-in-Place” model. The Fail-in-Place approach works by remotely deactivating failed components, physically leaving them inside the server, and returning the server to production, albeit with some capacity/performance restrictions. This way, the capacity is returned quicker, and valuable technician time can be deferred until the repairs can be done more efficiently (i.e. when more failures accumulate in the server or nearby).

The paper focuses on failures of a select few component types: memory DIMMs, SSDs, and cooling fans. For each failure, the system tries to identify a specific failed component, such as a concrete DIMM or SSD. If successful, then that component can be deactivated. The deactivation, however, is a tricky part with multiple problems to solve, starting from actually doing it without physically removing the components to ensuring that a server can still perform adequately enough to host at least some VM types.

The paper focuses mainly on DIMM deactivation since it is the most complicated case. A faulty “memory stick” gets deactivated via a nearly undocumented BIOS feature, leaving a server in a somewhat crippled state. See, turning off a DIMM or two leaves the server in a configuration not supported by the CPU and its memory controllers. This has to do with DIMM placement rules that get violated. Long story short, such a server will have both the reduced memory and memory address space fragmented by different bandwidths — some memory addresses may work as fast as unaffected servers, and others can have bandwidth degraded by an order of magnitude. To deal with the problem, Hyrax modifies the scheduling policies for the VMs. For example, a degraded server may not host larger VM types that require a lot of memory with high bandwidth. The system also manages VM placement locally to better adhere to these memory regions with different speeds — placing smaller VMs in slower memory regions may be okay since these smaller VMs, even on healthy servers, do not utilize the full memory bandwidth. Naturally, Hyrax has all these combinations precomputed and stored in the database, so the rules are simple mapping of the server’s degraded configuration to VM types it can host.

SSD story is much simpler than DIMMs. Each server the Hyrax prototype was tested on had 6 SSDs, all striped to get maximum performance. The authors calculated that 5 SSDs still have enough performance even for the largest VMs, so deactivating 1 SSD leads to no performance impact. For cooling fan failures, it appears that cooling is severely overprovisioned, and even 2 “dead” fans do not impact the server.

After deactivating the malfunctioning component, Hyrax returns the server to production with the new capacity restrictions for VM sizes. However, it does so only after extensive testing and verification to ensure that correct components have been deactivated and there are no other faults that can cause unstable performance.

The paper provides much more information and details, including an evaluation of the system!

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Fall 2023 Reading Group Papers (##141-150)

The schedule is also in our Google Calendar.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Spring Term Reading Group Papers: ##131-140

A new set of papers for spring and early summer!

Reading Group. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems

In the 127th meeting, we discussed the “DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems” SoCC’22 paper by Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, K. K. Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, Alex X. Liu.

This paper argues that current Autoscaling solutions for Microservice applications are lacking in several ways. The first one is the timing of autoscaling. For example, rule-based systems use some threshold boundaries on performance metrics to decide whether to scale up or down. This reactive approach adjusts to the behaviors of the system that already happened, and this means that thresholds have to account for that — one may set the thresholds more conservatively to allow the autoscaling systems to “spool up” once the threshold is hit, but before the system is in real danger violating SLAs or needed additional resources. Naturally, this threshold manipulation also requires some expert knowledge of systems to autoscale. An alternative to rule-based solutions is learning-based autoscaling, which relies on machine learning. This can reduce the requirement for domain knowledge, as the autoscaling system can learn the patterns itself. However, current learning-based solutions have another limitation — they rarely consider resource usage due to frequent variations in utilization. Naturally, maximizing resource utilization can lead to monetary savings, so it is important for an autoscaling system to accurately provision resources in a dynamic system. 

The paper proposes DeepScaling, a system that optimizes resource usage (in this case it is CPU utilization) while meeting SLO constraints. DeepScaling does so by focusing on forecasting the future workload based on historical data. From this prediction, it estimates the resource requirements for the system in the near future and provisions or decommissions resources as needed. These tasks are separated into 3 components: Workload Forecaster, CPU Utilization Estimator, and Scaling Decider. There are a few other components, such as load balancers, controllers, and monitors to ensure predictable load spread in the cluster, perform data collection and enact the scalability decisions. The predictive manner allows DeepScaling to anticipate load increases and provision enough new capacity ahead of time, eliminating the need for overly conservative autoscaling bounds. 

The paper provides significantly more details on the machine learning components, but I am less interested in these as a systems person. The paper also provides a good evaluation of these components, so it is actually worth the read, especially for more ML-inclined folks. 

Discussion

1) Multiple steps. The DeepScaling system tries to see how many resources it will need in the near future to predict scaling demands and satisfy these future needs. It does so by estimating CPU utilization, but it does so indirectly, as it first tries to predict the workloads and then converts these workload estimates into CPU requirements. Why not build a CPU model directly and learn it that way? Our understanding is that decoupling the two processes has several advantages. The workloads for different services can be learned and predicted separately. Adding or removing services from the cluster does not require retraining the model if the CPU requirement can be estimated from the sum of these predicted workloads. Similarly, if a service changes significantly and requires a different amount of resources for the same work, the system can adjust that in the CPU estimator. In the worst case, if the workload characteristics have changed as well, again, only the model for a service needs to be retrained.

2) Backup autoscaling. One possible problem we discussed was the ability to react to unexpected changes. With a purely predictive system, we believe there is a need for a rule-based backup that can kick in if the predictive model failed. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Method Overloading the Circuit

In the 126th reading group meeting, we continued talking about the reliability of large distributed systems. This time, we read the “Method Overloading the Circuit” SoCC’22 paper by Christopher Meiklejohn, Lydia Stark, Cesare Celozzi, Matt Ranney, and Heather Miller. This paper does an excellent job summarizing a concept of a circuit breaker in microservice applications. The authors discuss several potential problems with circuit breakers, present a taxonomy of circuit breakers, and most importantly, talk about improving circuit breakers and their usage.

Circuit breakers are mechanisms to stop system overload in the presence of some failures. This is the same task circuit breakers do in the electrical circuits, so CS folks borrowed the name. The idea is to monitor the flow of requests/RPC calls through the system, and if errors start happening along some specific flow, to “break the circuit” and stop the request flow. As such, circuit breakers minimize further damage after the initial failure by dropping potentially problematic work. Naturally, if no requests flow through the system, the application will never learn whether the problem has been rectified. So, to allow recovery, circuit breakers must occasionally go into a “half-open” state and let some requests through. Usually, this happens after a timeout period. If the failure persists, the circuit breaker falls to the open state and continues to shed the problematic work.

The paper presents a motivating example for circuit breakers — some API endpoint is malfunctioning, causing users/client applications to retry. The retries create even more faulty requests, making the failure worse by overloading the system. This example is a Metastable failure with load amplification caused by retries, and as we know, load shedding is a primary mechanism for both preventing metastable failures and recovering from them.

What makes circuit breakers special is the ability to shed load selectively along specific faulty execution paths through the system or for some specific type of request. Going back to the example, dropping only requests that trigger the malfunctioning endpoint in a service allows the system to remain operational for other unaffected path/request flows. Unfortunately, this is not only a strength of circuit breakers but also their weakness — high specificity/selectivity makes them harder to use correctly.

Consider a situation when one API of a service fails. If we use the same circuit breaker for all calls to that service, then this circuit breaker may start seeing an elevated error rate and trip, causing all requests to the service to drop, including the ones that were bound for correctly functioning APIs. Similarly, in the case of one API endpoint, it may be possible that some requests work fine while others do not (for example, requests for some, but not all, users work fine), necessitating a circuit breaker that can distinguish between problematic and ok requests to the same endpoint.

The paper then taxonomizes different circuit breakers based on several criteria, such as their implementation, installation location, and whether the installation is explicit. These criteria impact the circuit breakers’ selectivity or the ability to discriminate/distinguish between different requests and their faults (the paper calls it sensitivity, but I like the word selectivity better, as whether the circuit breaker is more selective to a type of request). For instance, the installation site of circuit breakers plays a substantial role. They can be installed at the callsite, meaning that the circuit breaker installation “wraps around” the RPC call to the dependency service. If the same RPC is called in another place in the code, it will have a different callsite circuit breaker, meaning that these circuit breakers will work independently to determine the faulty conditions. A method-installed circuit breaker appears in the method that calls another service (it can be installed by some method annotations, for example). In this case, all functions calling the method that users other service’s RPC will share the circuit breaker. As you can imagine, this can lead to less sensitivity/selectivity, as many different execution paths may converge on one method that performs an RPC. A client-level circuit breaker can use one shared circuit breaker per client, potentially making it even less sensitive.

Luckily, like in many things distributed, the solution is partitioning. In the case of circuit breakers, we want to be able to partition all possible failure scenarios, execution paths, request configuration, and anything that can impact the request execution into as many separate circuit breakers so that each circuit breaker is as selective to a particular request flow or request type as possible. I think this general suggestion is much easier said than done, and in practice, achieving very good partitioning can be challenging.

Ok, I do not want to rewrite the entire paper in this summary, but there is a lot more content there, including code examples, full taxonomy, some improvements to circuit breakers, and hints at unsolved problems, so please read the paper for more!

Discussion

1) Purpose and challenges. The purpose of circuit breakers is to avoid overloading the system or some of its components by shedding work that is likely to fail. This overload can come from a few sources: (1) retries after the initial failed attempt, (2) some other mitigation and/or client-level workarounds for the failed component, and (3) overload due to expensive error handling. The circuit breakers can do well with overload sources (1) and (3). 

Another way to prevent overload, at least due to retries, is to shed only the excess work. See, a circuit breaker sheds all the work that may be problematic, which may not always be a good idea. For instance, if the problem exists due to a high load on some component, stopping all load to it will appear to fix the issue, resetting the circuit breaker and causing the high load to go to the affected component again, causing errors, and tripping the circuit breaker(s) again. I am not entirely sure what is better — a cyclical failure or a persistent one. From the debugging standpoint, I think, a persistent failure may be easier to identify.

So, we may want to shed only excess or extra work, but not the work that would have come organically if there was no failure. This extra load shedding achieves several goals — for faults caused by overload, we avoid cyclical on/off behavior (especially if we can add a simple load shedder on top that can drop any excess “good” work). For non-overload-related failures, sending work through can help with intermittent problems and also speed up recovery once the issue is fixed. Of course, it is not easy to identify and shed that “extra” load under all circumstances; for example, it may be hard to control a person with an itchy “F5 finger” when they do not see a page loading quickly.

Shedding only excess work may have several additional benefits depending on the system. For example, keeping the system busy at the level of the actual organic offered load may help keep the affected/failed services from downscaling due to low load after the circuit breaker trips. Similarly, for cached components, doing work, even such “wasted” work, may keep caches warm for when the problem is fixed. Thinking about the state of the system when the initial problem gets fixed and the circuit breakers reset is important, as it needs to be ready for a quick influx of requests to components that were sitting behind the open circuit breaker for some time.

2) Implementation difficulties. Partitioning the circuit breakers may be hard to implement in large systems, as we need a lot of them to tune the system for proper sensitivity. We also need to have processes/procedures to adapt to changes in the system and make sure that over time old circuit breakers do not cause more harm than good. And finally, with many circuit breakers, there is a question of resource usage. Something has to keep track of all these circuit breakers, their timers, failure counts, etc. 

3) From microservices to stateful systems. Our reading group has been gravitating toward distributed systems with the large state — databases, data stores, etc. So naturally, is there anything similar that can be done in these large systems? Metastable failures can be devastating for these, especially as scaling stateful services is a challenging and resource-intensive process that often cannot be done when the system is overloaded. The basic principles remain the same — these systems need to shed load when they are overloaded. The big question is what work to drop and where in the system to do so.

On another hand, databases do not just exist by themselves, they are used by other apps and services. Can we protect databases/stores from overload with circuit breakers on the app sitting at the database call sites? It is not that easy with the existing circuit breaker design — if the database overloads, all circuit breakers across the app may trip open. The load to the DB will fall, causing the circuit breakers to close, and the process may again repeat in a cyclical pattern. It is possible that we need a smarter “overload-specific” circuit breaker to exist somewhere to drop/prioritize load coming to the stateful components. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. How to fight production incidents?: an empirical study on a large-scale cloud service

In the 125th reading group meeting, we looked at the reliability of cloud services. In particular, we read the “How to fight production incidents?: an empirical study on a large-scale cloud service” SoCC’22 paper by Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. This paper looks at 152 severe production incidents in the Microsoft Teams service. The authors looked at these incidents and distilled them into a handful of categories in terms of the root cause, mitigation type, detection, etc. And from placing the incidents into such categories/buckets, some interesting patterns started to emerge regarding the timeliness of incident mitigation, mitigation approaches, and potential areas for improvement. 

I liked that the paper described their data collection methodology since the categorizations may be rather subjective. However, I will mention only one detail — the authors assigned a single root cause to each incident, even though some incidents are complex and may have more than one contributing factor. I also like that the paper cautions readers from making hasty generalizations — the study focuses on just one large service, and many of the findings may be specific to that service.

So, with the above disclaimer in mind, what have we learned from the paper? The paper’s findings can be roughly broken down into a handful of topics: root cause, detection, mitigations, and post-incident lessons learned by the team/engineers. 

On the root cause side of things, different bugs in the service (Teams) only make up roughly 40% of the incidents (27% code bugs + 13% configuration bugs). Other categories include infrastructure failures, deployment failures, and authentication problems. On the infrastructure side of things, the paper separates infrastructure failures into three different categories. The first one, referred to as “infrastructure failures,” deals with scalability problems, like the inability to get enough nodes to run the work. The second infrastructure root cause bucket is “dependency failures.” Finally, the failures of databases & network dependencies get their own root cause category. But if we combine the three together, all infrastructure failures are around 40%. 

On the detection side, the paper suggests a significant deficiency in detection and monitoring systems and practices. Almost half of all incidents had some kind of detection malfunction, with 29% of incidents reported by external users and another 10% reported by internal users! If we look at the automated detection issues, many are due to bugs, lack of automated monitors, or lack of telemetry. 

For incident mitigations, the authors discuss the common types of mitigations and the reasons for some of these mitigations requiring more time to address the problem. While around 40% of issues were due to bugs, only 21% of all mitigations required a bug fix (8% for fixing code and 13% for fixing config). Additionally, 11% of issues relied on some “ad-hoc” fixes, which the authors describe as “hot-fixes.” The paper conjectures that it takes substantial time to go through the process of fixing bugs during mitigation. Instead, a faster way to recover is to initiate a rollback (22% of cases) or perform an “infrastructure change” (another 22% of incidents). By “infrastructure change” the authors mean scaling the system to take on more nodes/CPU. 

As for mitigation delay reasons (the paper calls them mitigation failures, although the mitigations themselves did not fail but instead took longer), the authors describe several common causes. Inadequate documentation and procedures are at the top of the list. Another common one is deployment delay which occurs when it takes a lot of time to deploy the fix. 

With all of the above findings, the lessons learned by the engineers come to the following: need more automation, need more testing and need more changes within the organization (behavioral change, better documentation, better coordination). I want to focus a bit on the automation part, as it hits a bit too close to home for me. Most automation suggestions (26%) boil down to better tests, such as performance tests, chaos engineering, and better end-to-end testing. I am a bit skeptical about these numbers as testing is something we blame all the time when problems occur — it is a reflex response to say that more testing is needed. And of course, we need more testing but mind you, these same people who call for more testing when dealing with issues, write features and (inadequate?) tests when they are not on call. 

What caught my attention is the need for automated deployment. More specifically, “automated deployment” includes “automated failover.” Just the fact that this is a recurring ask from engineers puzzles me a lot. This means that at least a handful of times per year, running Microsoft Teams requires engineers to manually reconfigure the system to remove failed nodes/machines/services and switch the load to new/backup ones. 

The authors discuss several more insights after doing a multidimensional analysis. I am not going to go in-depth here, and instead, just leave snippets of the paper:

Discussion

1) General Applicability of results. As mentioned in the paper, all observations are taken from one service, so your mileage may vary. Another point to note is that Microsoft is a large organization with decently mature monitoring, automation and deployment tools, which may impact the number and severity of observed incidents. It is quite possible that with a less mature set of tools, one may observe more variety of serious problems. 

Another point to mention with regard to applicability is that there are many other large systems/services that are very different from Microsoft Teams. For example, if we look at systems that maintain a lot of state (i.e., databases), we may see other common root causes and mitigation patterns. For instance, a typical mitigation strategy observed in the paper is throwing more resources at the problem. This strategy works great with systems that run a lot of stateless or small-state components/microservices. Adding more resources allows scaling performance bottlenecks as long as underlying databases/storage dependencies can handle the load. Scaling stateful systems, like databases, often requires spending resources upfront to move data/state around — something a system cannot do well when overloaded. 

2) Lack of concrete actionable items. The findings are interesting from an educational standpoint, but they lack clear directions for improvement. The authors give general advice, along the lines of improving testing, building better automation, etc., but there are no concrete steps/procedures that may help improve systems’ reliability. One point made in our discussion is the need to focus on the specific type of problem to find more concrete solutions for that problem. 

3) Teams Outage. On the day we discussed this paper, Teams experienced an outage, likely related to some networking configuration issues.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

In the 120th DistSys meeting, we talked about “Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service” ATC’22 paper by Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, Akshat Vig.

The paper is loaded with content as it presents many different things, spanning ten years of development. None of the topics are covered in great detail, but I think it is still a great overview of such a massive project. Obviously, the authors discuss DynamoDB, its architecture, and its design. The paper also provides a brief history of the system and examines several challenges/lessons the team has learned while operating such a massive scale system.

To start with the architecture, the users interact with the system by reaching out to the request router. The router can perform the authentication and admission control. Most importantly, however, the router has access to partition metadata, allowing it to, well, route the requests to proper storage nodes and replicas. A node hosts multiple replicas for different partitions.

So, speaking of partitions, each data item in DynamoDB has a unique primary key. These primary keys group items into partitions replicated with Multi-Paxos for redundancy across multiple replicas in different availability zones. The assignment of key ranges to partitions (and partitions to nodes?) constitute the metadata needed for the request router.

DynamoDB has two types of replicas — log and storage replicas. Log replicas only contain replication write-ahead logs. Storage replicas, in addition to having a log, also maintain a state derived from applying/executing the logged commands against the B-tree storage. Both replica types can participate in Paxos quorums, but log replicas are more lightweight and cannot serve reads/queries. The system uses log replicas to improve availability and reliability — it is easier to spin up a log replica that only needs to accept new commands than to rebuild a full storage replica with all the partition’s data. This speed becomes important under failures to restore the system to the proper degree of replication quickly.

From the historical perspective, while DynamoDB started as a pretty basic NoSQL (key-value) store, it has added many features over time, such as secondary indexes, JSON documents, encryption, transactions, and more.

Finally, a decent chunk of the paper focuses on various nuances of running large-scale NoSQL data stores. For example, the paper notes data errors and how DynamoDB verifies the data integrity with checksums for every data transfer between nodes. DynamoDB also does background data verification at rest. Another important lesson on the reliability side of things is the need to maintain enough capacity in the metadata system. While the request routers use caches for metadata to improve performance, a metastable failure in the caching system led to a rather big outage. After the fact, the caches are used only to improve the latency, and no longer offload capacity from the main metadata storage service — all requests for metadata go through to the service even if they are answered by the cache first. This ensures having adequate capacity to serve critical metadata operations regardless of the cache failures. Naturally, this is a more expensive solution for the sake of reliability.

The authors discuss plenty of other lessons and challenges, such as managing the load and balancing capacity of the system and implementing backups and improving availability.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation

In the 122nd reading group meeting, we read “The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation” paper by Ruihong Wang, Jianguo Wang, Stratos Idreos, M. Tamer Özsu, Walid G. Aref. This paper looks at the trend of resource disaggregation in the cloud and asks whether distributed shared memory databases (DSM-DBs) can benefit from memory disaggregation (MD) to become the next “hot” thing in the database world. 

The idea, on the surface, is simple — decoupling compute from memory enables the creation of databases with many separate stateless compute workers for query processing that share a remote memory pool. According to the authors, the driving force for this is RDMA, as it allows relatively comparable latency and bandwidth to local memory. On top of that, such a system can also use disaggregated storage for durability. The bulk of the paper then focuses on challenges for such disaggregated design without going in-depth into the database architecture itself. The paper also does not go deeply into the design of the disaggregated memory system, although it points to a few issues to solve. 

The first challenge listed by the authors is the lack of appropriate APIs to access memory. In particular, the paper suggests having APIs that are more in line with how memory is managed locally — memory allocation APIs. In this disaggregated memory pool case, the memory allocation must work with the virtual address space of the memory system. The authors also suggest data transmission APIs facilitate moving data for local caching at compute nodes. Finally, function offloading API can move some simple compute to the memory systems (does not this defeat the purpose of memory and compute disaggregation?)

The second important set of challenges deals with the disaggregated memory system itself. How does such a system handle node failures and still remain available and durable? Sadly, the paper does not provide any concrete details aside from hinting at high-level solutions, all of which will cost the performance — backup on storage, erasure coding, replication, etc.

The third block of challenges has to do with concurrency control. If the database system caches data locally at worker/compute nodes, then we need to worry about cache coherence when multiple workers can access the same memory. Here we see that memory disaggregation is still slow — local cache can be an order of magnitude faster. This is a smaller difference than, let’s say, going from SSD to memory, but it is still substantial. Authors suggest that reduced speed differences in this new memory hierarchy will require new caching protocols, prioritizing execution time and not cache hit rate.

Another concurrency challenge has to do with transactions, as now we have the potential to fit all data in one large memory pool with many workers accessing it concurrently. Again, the paper does not have many concrete solutions but suggests “rethinking distribute commit.” Similar is the “solution” for concurrency control. It is costly to implement locks over RDMA, so we need to rethink CC as well, preferably without locks. Lastly, this all needs to work with thousands of compute nodes.

The last set of challenges is indexing-related. This, again, can get into the tricky RDMA performance limitations, so we need to have an RDMA-conscious index design. Also, the index needs to work well under high concurrency. 

Discussion

1) Details. Our group collectively found the paper to be rather shallow on details of how these systems may work. While the paper examines some literature on the shared memory databases of the past, it lacks depth and connections with this new paradigm of disaggregated memory used over RDMA. We are especially curious about more depth for concurrency issues and solutions, as many stated issues may have been solved in prior shared memory databases, albeit at a smaller scale.

One example where the paper is very shallow is disaggregated memory system itself. Stating there are challenges with availability and durability in a core component for all DSM-DBs is not going to cut it — the entire premise of the paper depends on such a disaggregated memory system to be fast and reliable. Without these basics, the rest of the discussion becomes largely irrelevant.

2) Memory Disaggregation. We discussed the memory disaggregation idea in general and whether it can become a mainstream technology. See, storage disaggregation is kind of ubiquitous — you create a VM in some cloud, be it AWS or Azure or GCP, and the storage this VM gets is likely to be in a different box (or rather a set of boxes) than your VM’s CPU or memory (think of EBS volumes on AWS EC2). We are ok with this, as this storage is plenty fast and behaves just as if it was located in the same servers as the rest of the virtual hardware. This whole memory disaggregation with RDAM does not work this way, creating a lot of challenges. Most importantly, this disaggregated memory cannot be made (yet?) as universal as disaggregated storage. We won’t run code from it or use it for anything that needs to copy/change contents a lot. As a result, this disaggregated memory, at best, will look like another “storage” solution to systems that use it — something that may be faster than durable storage, but still not fast enough for general use.

Personally, I see more utility in the future with memory disaggregation using CXL. There was a paper at the recent USENIX ATC on the topic. Such a solution may act more like additional memory on-demand that a shared pool of memory between processors/nodes, but it will also not have issues with cache coherence, difficulty, and limitation of RDMA and RDMA’s reliability. I can envision a top-of-rack memory pool, that tenants can tap into if they need more memory or if we need to have a cloud VM product that can scale memory and CPU cores independently of each other. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!