Tag Archives: reliability

Metastable Failures in the Wild

Metastable failures in distributed systems are failures that “feed” and strengthen their own “failed” condition. The main characteristic of a metastable failure is a positive feedback loop that keeps the system in a degraded/failed state. These failures are hard to spot, as they always start with some other distraction — some trigger event that nudges the system over its operating limit or capacity. However, fixing the trigger is not enough, and engineers realize this too late — so late that sometimes it takes days to stabilize the system and return to the pre-failure level of operation.

I already blogged about this topic last year in the context of our original metastable failures paper from HotOS’21. Now we have the “improved and extended” version out at OSDI’22, and oh boy, we have a lot of juicy stuff there. That failure that took days to recover? I did not make it up. My students spent quite some time going through hundreds, if not thousands, publicly available post-mortem reports for different systems to identify a handful of impactful metastable failures. We are sure there are more metastable failures out there lurking in these bug and failure reports. In fact, we had significantly more candidate failures, but many post-mortems simply do not have enough information to be absolutely sure metastability played a role.

Anyway, above is a table from the paper that lists examples of metastable failures in the wild. Some last for an hour, and some take days to resolve. Many examples are due to retries and retry-induced load amplification. See, when a system becomes slow (i.e., some server failure or transient network issue), it is likely that latency on some requests will start to grow, and it may reach the point at which the client application gives up the request and retries. Retries are great for dealing with transient issues, but they create more load for the system by potentially repeating the same request over again. When the system is already overloaded due to the ongoing issue, retries do not solve any real problem and create more unnecessary load. More load means higher queuing times, higher latency, and more retries. This vicious cycle can continue even after the original trigger issues are fixed, and, in many cases, there is no escape from this except load shedding.

We managed to reproduce a toy example of retry-induced failure using MongoDB. It is a bit concerning that we were able to do so at a small scale, but our experience also showed that not all hope is lost. See, as it turns out, modern systems are good at load shedding — when the request queues up for too long and is too old by the time server can handle it, such an old request can be dropped without wasting further resources. Our experiment here used a very aggressive retry policy that had to overrun the load-shedding mechanisms of the underlying database. That being said, sometimes the difference between a failure and recovery can be down to minuscule difference in the trigger. The figure here shows a failure when a trigger took out 80% of CPU capacity, and a successful trigger recovery when the trigger reduced CPU capacity by 78%.

Now, of course, retries are not the only cause of the self-feeding failure loops, but retries are something most engineers are familiar with, so the post-mortem reports tend to have more details on that topic. Potentially, this is one reason for observing so many retry-induced metastable failures in the wild. However, we saw other feedback loops. For example, one of the failures in the list was caused by excessive logging on the error path, which in turn caused the system to slow down and have more requests enter that same error path.

Seeing the examples in the real world and reproducing them is important. Our HotOS paper claimed that reproducibility at a small scale is often challenging. In this new paper, we have a handful of metastable failure scenarios reproduced at a scale of at most a few VMs. However, a good chunk of the paper is about generalizing what we have learned from the observations. We developed a “reasoning framework” for metastability and metastable failures.

The two important characteristics of metastable failures are triggers and amplification effects. On the trigger side, we can have a load-spiking trigger and capacity-decreasing triggers. The former trigger type temporarily increases the load on the system above some pre-trigger level. An example of this can be higher than expected demand from users. The later trigger type degrades the systems’ serving capacity. An example of this trigger is a server failure that takes some capacity out.

Similarly to triggers, there are two types of amplification effects: workload amplification and capacity degradation amplification. Both feedback loops kick in after a system enters an overloaded state — having an insufficient serving capacity for the current offered load. The workload amplification seems to be the more common type in the examples we found so far. It creates even more load in response to overloaded. The vicious retry loop from earlier is an example of workload amplification. The other type, capacity degrading amplification, reduces the serving capacity of the system in response to an overload. This can happen, for instance, in cached systems and applications, when losing some cache can result in downstream system overload that either prevents cache rebuild or downright causes more cache to expire and go stale.

The figure above shows how these two types of triggers and two types of amplification mechanisms can combine. Also, an important point to note — it is possible to have a combination of more than one trigger and more than one amplification mechanism, something that we have seen in the wild. As far as experiencing metastable failure, a lot depends on how the triggers and amplification mechanisms interact and how much load was offered to the system before the failure sequence began.

Small triggers that do not last long or do not have a big enough magnitude (i.e., a trigger that does not overload the system too much) may be less likely to cause an issue. On the other hand, a trigger that lasts an hour and takes out half of the servers is a much bigger problem. Of course, the magnitude of the trigger is not an absolute metric — your system may be fine even when you take out half the servers if it was operating at 25% of its maximum capacity before the trigger event took place.

Triggers are not metastable failures. They are failures on their own that can cause availability and performance issues for clients or users, but they are only the beginning of the metastable failure sequence. The amplification mechanisms are those feedback loops that start after the system is overloaded due to a trigger. Amplification leads to a metastable failure after the system’s overload is amplified so high that even fixing the trigger (i.e., returning lost capacity, or solving the load spike) does not solve the problem. The figure marks those metastable failure points. Naturally, the type of amplification matters a lot. A system with “slow” amplification loops that add little overload over time is better than systems with “fast” amplification loops. In both cases, we can have a metastable failure, but a slow amplification mechanism gives us more time to deal with the trigger.

Anyway, what are the lessons here? For one, I cannot stress the importance of understanding the mechanisms involved so that we can operate the systems better. Maybe, it is not a good idea to squeeze every single bit of capacity from the system/hardware? Having some slack resources allows us to absorb smaller triggers and gives time before the system enters the metastable failure for bigger triggers. The same goes for making systems with predictable performance characteristics. If some code path is significantly slower than the other, a sudden switch to a slower code path can act as a trigger, since it is essentially a capacity degradation.

Load shedding is the solution to metastable failures. All recovery efforts included some load shedding mechanisms. After all, to bring the system back into operation, its load needs to decrease sufficiently as to stop the amplification. But we do not need to leave load shedding only for recovery efforts after it is already too late. More aggressive load shedding done ahead of time or early after the trigger overloaded the system can save from a bigger failure — it is better to lose some requests and piss some clients/users than lose the entire system and impact everyone. Here I can also lump load prioritization — prioritize the work that will raise the system’s goodput if possible.

Another point is to account for the human factor — about half of the observed failures in the wild had some direct human involvement, such as rolling out bad configuration or buggy code. It is important to test and deploy configuration the same way as software. This is something that many systems still seem to neglect. Proper testing of new code is paramount, including stress testing. And, of course, do not deploy on Friday… we have seen a failure in which a weekend load/traffic reduction masked increased resource usage of a new version deployed on Friday. The system failed after the load returned on Monday.

Another interesting observation is the “fix-to-break” behaviors. On a few occasions, engineers increased the severity of the amplification mechanism (feedback loop) after fixing an outage, leading to a more severe failure in the future. I refer to this as a “fix-to-break” pattern; it is a good illustration of why we need to understand metastability in order to avoid causing failures.

As always, the paper has way more details. We talk a lot about definitions and formalism of metastable states and failures. We have a handful of distinct reproductions with triggers and amplifications caused by retries, cache failures, and garbage collections. And finally, we include some real-world first-hand details on metastable failures at a large internet company.

Reading Group. Cores that don’t count

Our 66th paper was a recent HotOS piece about faulty CPUs: “Cores that don’t count.” This paper from Google describes a decently common (at Google datacenter scale) issue with CPUs that may miscompute or silently fail under some conditions. This is a big deal, as we expect CPUs to be deterministic and always provide correct results for all computations. In the rare instances of failures, we expect these failures to be “loud” — CPU completely halting or rebooting or at the very least software crashing. 

However, Google engineers observed some dangerous behaviors that do not align with the expectations. Unfortunately, as CPUs fail, they may do so silently and produce a wrong result to a computation. Since the failures are silent, they are way more difficult to detect. According to the paper, such failures often seem non-deterministic and often require specific conditions to manifest. Moreover, most of the time, the faults do not impact an entire CPU but one or a few cores, leading to computation errors appearing sporadically even on one machine.

Speaking of the factors and conditions that play a role in these failures, the paper talks about CPU age — a CPU may start to miscompute some instruction after some time in operation. This makes quality control both at the CPU vendor and customer more difficult — even when there is a known tendency for some errors, they may not appear on new CPU and pass the QA tests. Other factors are operating temperature, frequency, and voltages. The paper states that their impacts vary. However, it is not unreasonable to assume that some errors may start to appear under heavier load and higher temperature and then subside when the load reduces and the CPU cools down. 

All these make detection hard, as reproducing a fault often requires running the same software on the same hardware under the same conditions. That being said, the authors claim that they are getting better at detecting and reproducing many of these errors. For example, the paper mentions a set of benchmarks that exercise more common failure scenarios. Such a benchmark tool can be handy for confirming the faults or offline testing. Unfortunately, it is not very helpful in detecting problems in live production workloads. Currently, engineers at Google detect the CPU issues through crash and bug reports, which are obviously very noisy. However, some patterns, such as identifying cores involved and checking if the same cores appear in multiple similar reports, can provide the first clues. 

So what are the repercussions of faulty computation if left undetected and unmitigated? The short answer is all kinds of stuff! The major problem here is that a computation may cause a fault in a system before any of the traditional redundancy is involved. For example, a failure at the leader node of a replicated system can produce a bad outcome. If this outcome is then checksummed and replicated, it will propagate through the cluster undetected. Our replicated/redundant systems are geared for detecting data issues due to transfer or storage and may lack the tools/mechanisms needed to detect bad computations. 

I think I will leave the summary here. Murat Demirbas goes into more detail in our group’s presentation. He also touches on a similar arXiv paper from Facebook.

Discussion.

1) Vendors. An obvious question to ask is what vendors/models have these issues. For obvious reasons, the paper does not provide any concrete information here, and it makes sense. It would not be a good idea to point fingers at a major company with which you do a lot of business. So the fact that Google (and Facebook) decided to bring this issue to attention instead of trying to resolve the problem with vendors silently is already a big deal. 

2) In-house CPU development. Another discussion point over CPU manufacturers was whether taking an in-house development approach can help improve the quality. Think of AWS Graviton CPUs. From one side, this gives the large-scale company (Google, AWS, etc) more control over the design and quality assurance. On the other side, these defects are likely due to the manufacturing (i.e., shrinking transistor size) and not the design. The design issues would have likely been more deterministic. Manufacturing is hard, and only a handful of companies can pull it off. So the end product would have been made by a third party anyway. So the remaining benefit is a potential for better QA, but this is also hard. For instance, I mentioned in the summary that some issues appear after some aging. Also, big CPU companies may still have an edge in QA due to the sheer volume of CPU made and shipped. If more consumers report the problems, it may be easier for CPU vendors to identify patterns and improve in the next iteration. 

3) Non-deterministic issues. The paper mentions that many of the issues are non-deterministic. At the same time, the paper discusses the techniques they use to find the problems. So it appears that there are actually a few classes of failures: (1) purely non-deterministic faults, (2) obfuscated deterministic faults, and (3) openly deterministic faults. The third category is easy, with problems reproducible all the time, such as design bugs. The second category is reproducible faults that hide. It seems like the issues discussed the most in the paper are of this type. It may be difficult to find them in all the noise as the problems manifest only under certain conditions. However, once these conditions are known, it becomes more or less easy to reproduce and identify. The first category is the tricky one. Does it even exist? Or do these failures require even more “stars to align” for the problems to appear? Anyway, it is this category that is the most challenging to look for and identify. If these failures do exist, is there even a systematic way to detect them?

4) Hardware mitigations? One way to try to approach the problem is with hardware redundancy. For instance, a system may perform some critical computations redundantly on different cores or CPUs and cross-check the results. If different cores produce a different result for the same computation, then one of the cores is at fault, and we may need a third one to arbitrate. This redundancy is not new, and mission-critical systems use it. However, this is too expensive for most public services and cloud systems. 

5) Are these byzantine failures? Murat raised this question at the end of his talk, and this is an interesting one. Byzantine faults are faults that occur due to a system behaving out of spec. But with faulty computation, the out-of-spec behavior may be very subtle. Again, consider a replicated system with a leader. In a computation fault, the leader will try to replicate the same corrupted data to all the followers, leaving no possibility for a cross-check.

6) Software mitigations? So, if the faults are kind of byzantine, but also subtle enough to avoid detection by some protocols, then what can we do? One idea that was floating around is defensive programming — using assertions and verifying the computations (this works for problems that can be checked relatively cheaply). Good error reporting is also crucial to help find the correlations that may point to faulty cores. 

A more fundamental question, however, remains. Can we design cloud-scale systems that tolerate such failures without costing too much?

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience

On Wednesday, we had our 26th reading group meeting, discussing RocksDB with a help of a recent experience paper: “Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience.” Single-server key-value storage systems are crucial for so many distributed systems and databases. For distributed folks like myself, these often remain black-boxes that you pick up and use. That is until something in your system starts to crumble peculiarly, and you dive in to investigate…

Anyway, what I want to say is that KV-stores are important. Their performance matters a lot in the world of fast CPUs and fast networks, where every millisecond of slowdown at storage can no longer be “masked” by other “slow” components. This is where this paper takes us — improving the performance, reliability, and feature-set of RocksDB over the years as technology and demands have evolved. 

To understand the experiences and lessons of the paper, we first need to look at the underlying technology behind RocksDB. In a nutshell, RocksDB is an LSM-tree (Log-Structured Merge tree) key-value storage. LSM trees have been used in storage for quite some time, as they are relay good for write-intensive workloads. The basic idea of the LSM tree is that data are stored sorted by key. These sorted files are called sorted-strings tables or SSTables for short.

Sorted Strings Table with key-value pairs.

Now, maintaining SSTables requires that data are written to storage sequentially in such a sorted state. Of course, the data does not come pre-sorted to a database, so the system needs to do something else before writing these sorted files to disk. The storage system will keep an in-memory buffer, called memtable, of some relatively large number of updates. This memtable can be represented as some tree structure to allow for efficient insertions. Before each operation is added to a memtable, it is written to a write-ahead-log (WAL) for durability. The WAL reconstructs a memtable in the event of a failure. Once memtable reaches a certain size, it is flushed to disk in a sorted manner. At this time a new empty memtable can start. An important aspect of writing these sorted files is keeping track of their recency order. 

Memtable and SSTable files/segments.

When a read request for a key arrives, the system first looks at the memtable to see if data is there. Memory lookup is relatively cheap since no disk access is needed. However, if the requested key is not in the memtable, then we must search on disk, starting from the most recent SSTable segment. Looking up data on a disk can be slow since the system needs to scan a good chunk of a file to find the spot where the key might exist in the sorted list. Naturally, we want to take advantage of the sorted nature of the file. For this, a system maintains a sparse index for each file with the offsets to narrow down the search. Then the system only needs to scan a portion of a file between the two offsets where the key may exist. If the data is missing in the most recent file, then a search continues in the next most recent one and so on. This process results in some peculiar behaviors. For instance, it generally takes less time to find more frequently used data. But it also takes a lot of time to find out that the data is missing entirely. Fishing for non-existent data is a waste of time, so an additional index, a bloom filter, can be used to tell whether the key is guaranteed to be missing.

Index points to some file offset. To lookup key ‘city’, find where ‘city’ fits in the index (between ‘blog’ and ‘food’) and search in that part of the file.

Another caveat the sequential writes create for us is dealing with old versions of data. See, when we write an SSTable to disk, it is immutable, and when an update or delete to a key comes in, this update will eventually flush to a more recent file. This influx of new data creates a situation where old data that is no longer needed keeps occupying space and potentially increases search time. So the system needs to clean up old data frequently. A compaction procedure mitigates the space amplification by cleaning up old data. It essentially takes multiple files and merges them into one bigger file.

Compaction removed old value of ‘city’. Old files are removed and replaced with a new compacted one.

So, my oversimplified descriptions of LSM storage is not necessarily how RocksDB operates, but it should give enough intuition for us to proceed and dive into the lessons and experiences of Facebook engineers working with RocksDB. 

Resources: IOPS vs Space vs CPU

The paper starts by exploring resource efficiency and how optimization priorities were changing over time. RocksDB runs best on SSDs, and these storage devices have a limited lifespan bound by the number of write cycles. Naturally, engineers focused on issues of write amplification (the same data rewritten multiple times) to make sure SSDs do not die prematurely. Interestingly enough, the paper almost makes it sound like write-amplification mitigation efforts were largely wasted. The authors state that the workloads used at Facebook are not too IOPS-heavy (does it meant they are not very write-heavy for write-optimized storage?), and storage space was a more pressing concern. Because of this, the engineers have shifted their efforts to the space-amplification problem (a key occupies more space than it needs to, for example, due to having multiple old versions of it).

Another issue brought up is the CPU utilization. Here, again, the paper states that CPUs are rarely a bottleneck. However, to me, it seems like these represent a delicate resource trade-off. For example, to reduce space consumption, we may need to use more aggressive compression that uses more CPU cycles and more aggressive compaction that needs both CPU and IOPs (and increases write-amplification). So I am not sure about the correctness of saying whether some resource here is a bottleneck or not. They all can be a problem, and it seems more about the ability to reach some balance for a given workload and infrastructure. I believe the need to find such balance in different applications is part of the reason behind the multiple compaction strategies mentioned in the paper.

A significant portion of the paper then focuses on dealing with resources at scale. For example, many instances of RocksDB may coexist on one server, requiring resource management to prevent one instance from hogging all the resources. Other resource-related aspects involve the treatment of write-ahead logs (WALs). For example, it is possible to completely turn off RocksDB’s internal WAL to conserve resources. Of course, this leaves the system vulnerable to data loss in the event of a crash, but this may not be a problem if an application using Rocks has its own WAL for things like transactions or replication. An interesting mention for resource management is rate-limiting file deletion. This issue seems a bit specific, but the authors explain how file deletion can be costly and impact other tenants using the same SSD.

Features

The paper also extensively talks about new features and their significance. Similar to how the authors have approached resource efficiency, these features largely stem from operating at scale. Many of the points simply make sense when I read them, but I suspect that these realizations were not as easy in practice and carry some production pain points. For example, we usually expect backward compatibility, but designing forward compatibility, where an older version should be compatible with a newer one, is definitely a result of sleepless nights after unrolling from some newer but buggy version and realizing that data files changed to the point that the old version no longer understands them. 

The flexibility of RocksDB is another weaved-in theme of the paper. Since the storage system is used in a variety of applications with a variety of needs, this again makes total sense. It appears that the main goal of many features is to make the system more extensible and fit into many different contexts without creating any roadblocks on purpose. One such example is improvements to configuration management that went from “in-code” configuration to having configuration files. However, one big configuration problem directly stems from the flexibility goal — too many different parameters to tune, and it seems like there is no good solution for this. 

The paper presents a few other examples of flexibility features meant to help build apps on top of RocksDB. If implemented, native versioned storage can greatly help systems relying on multi-version concurrency control (MVCC). This, however, may come with a performance penalty. At the same time, MVCC systems have already been relying on RocksDB for storage, since the “no roadblocks” principle provides great flexibility in how keys and values are encoded, allowing versioning information to be a part of the key. 

Replication and backup support got their own subsection in the papers, but this is nothing but a trivial “you can copy the files to another machine to start a new replica” approach. This is hardly a feature, but again, it plays nicely with the idea of designing a system with as few roadblocks as possible and letting users/engineers be creative with using it. 

Reliability

Reliability is a big topic in the paper. We want the data stored in the database to remain correct and intact. Luckily, there is a very concise summary for this — use checksums! The authors point out that their checksum procedures only work for data already in storage and that they are still working on checking the integrity of data in memtables. This memory corruption may not be that big of a problem though. Thankfully, unlike our personal computers, servers rely on ECC memory that can handle some memory issues all by itself. 

I will finish my summary with a large table of features and changes to RocksDB straight from the paper. 

And as always, we have our groups presentation by Rohan Puri available on YouTube:

Discussion

We had a very long discussion after the presentation. I think it lasted almost an hour, just talking about KV-stores in general and RocksDB in particular. There is no way I can possibly summarize every discussion point, but I will try to pick the important ones (by my judgment of their importance) 

1) Scratching the surface. This point started in our pre-presentation discussion. While the paper talks about many different features and issues and tries to explain the reasons for the decisions taken, some explanations barely scratch the surface. Of course, it would be rather difficult to talk about eight years of development and go into deep technical discussions. However, what interested the group the most are some rather odd talking points throughout the paper. For example, talking about rate-limiting file deletions is oddly specific. Why not have rate-limiting for all tasks that may have a high impact on IOPS? These oddly specific examples scream about rather interesting back-stories that are obviously missing from the paper. 

2) Checksums. The checksum discussion was rather interesting. There are multiple layers of checksums. For example, block checksums make a lot of sense, as they are written when an SSTable block flushes to disk. One observation made in the reading group is that the file checksums were added late in the RocksDB lifecycle. A plausible explanation for this is file checksums are rarely needed, as they would come in handy when, for example, copying the entire SSTable file from one machine to another to start a new replica. And in this hopefully rare occasion, we can check the integrity of the data the long way — open the file and go block by block and check block checksums. 

3) Replication. Obviously, RocksDB is a single-server system, but it serves as a store for many replicated systems. In the group, we found it interesting that the paper still talks about replication. However, the replication discussions in the paper boil down to designing the permissive systems that allow to built replicated solutions on top.

4) Too flexible? One of the bigger goals of RocksDB is its flexibility to fit into different applications with different requirements. This creates a system that has too many features, with any application only using a handful of them. However, this ability to tune and have all these features complicates the configuration and management of the system. One notable example is CockroachDB that developed its own in-house replacement for RocksDB with fewer features, and having fewer features seems to be a big bragging point for Cockroach folks. 

5) Impact of Facebook hardware infrastructure. One concern raised during the discussion was the impact of hardware infrastructure at Facebook on the overall design trajectory described in the paper. Of course, it is true, that Facebook deploys RocksDB in their systems and their own infrastructure. But it also means that other non-Facebook users have to adjust to decisions made with Facebook-grade infrastructure in mind. 

One such example is the write-amplification vs space-amplification discussion in the paper. While Facebook engineers have concluded that on their SSDs (and their workloads), write-amplification does not pose a serious risk of premature SSD failures, the same may not be the case for other users who may have lesser quality SSDs or more write-demanding workloads. It is a serious enough concern that authors acknowledge the existence of LSM-tree solutions with better write-amplification mitigation strategies. Moreover, at least some of these solutions have been put into production use already. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Metastable Failures in Distributed Systems

Metastability is a stable state of a dynamical system other than the system’s state of least energy.

Wikipedia

Distributed systems often fail spectacularly and unpredictably. They are a cause for a headache and sleepless on-call nights for way too many engineers. And this is despite lots of efforts to understand the failures, and all the tools and “best practices” we have to contain and/or prevent them. 

Today I want to talk about failures that may occur on a nearly healthy system with no perceived major bugs or design flaws. We describe this type of failure in our HotOS’21 paper. This work is in collaboration with Nathan Bronson, Abutalib Aghayev, and Timothy Zhu. It is largely based on Nathan’s observations and experiences.

So how does a “healthy” system fail? Naturally, something needs to happen for a failure to occur. We call this something a trigger. Lots of things can act as a trigger (GC, network blips, software pushes, configuration changes, load spikes, etc.). Triggers are impossible to avoid in our multi-tenant unreliable systems. 

Triggers cause backlogs or errors and make the system temporarily less happy. In some cases, however, they cause the system to spiral into the abyss and practically halt any useful work or goodput. When the trigger is removed (it was transient, or removed by the engineers), the system does not recover by itself. So we end up in a peculiar situation when the condition that caused a failure is no longer present, all system’s components seem to be functioning correctly, all servers/nodes are up, the code is working, but nothing useful is actually getting done and goodput remains negligible. 

We call this failure pattern a metastable failure, or rather we say that a system has entered a metastable failure state. The culprit behind metastable failures is a sustaining effect that prevents the system from leaving a bad/failed state even after the initial trigger is removed.

To understand more about the importance of triggers and sustaining effects, we need to look at how most distributed systems are deployed. Resource utilization is a major factor for large systems, as achieving better utilization can save a lot of money at scale. This means that systems often have little extra resources to spare when some unexpected load increase occurs. These systems operate in a metastable state that we call a metastable vulnerable state, as the system is vulnerable to a failure if enough of a trigger disturbs its stability.

Of course, having some unanticipated extra load applied to the system is not sufficient for a metastable failure to occur. In fact, even if the load pushes the system to run at its maximum capacity (or tries to run above the capacity), we should only see an increase in latency, and maybe some goodput degradation. When the extra load is removed, the system should return back down to normal operation all by itself. 

In metastable failure, the extra unanticipated load activates a positive feedback loop that creates more load. This positive feedback loop is the sustaining effect that prevents the system from recovering when the initial trigger is resolved. We say that a system is in a metastable vulnerable state when it is possible to activate such a sustaining effect loop using a strong enough trigger. This state typically occurs at the higher system utilization, as the system has fewer spare resources to absorb the extra load of the trigger without activating the positive feedback mechanism. Opposite of metastable vulnerable state is a stable state, where the positive feedback loop is not possible, allowing the system to recover by itself when the extra load is removed. Note that a system in a stable state may still have a sustaining effect mechanism, but it is not strong enough to feed on itself and diminishes over time when the extra load is removed. 

Let’s look at a hypothetical idealized example of a metastable failure. Consider a web application that uses a database capable of serving up to 300 queries per second (QPS). If more than 300 QPS are tried, the latency goes up by an order of magnitude, causing the web application to retry once after a 1-second timeout (when a retry fails, the system errors out). The web servers for the application produce a 280 QPS workload, which falls nicely below the database’s serving capacity.

Now let’s say there was a network glitch lasting a few seconds between the web servers and the database. After the network is fixed, all the delayed packets reach the database, resulting in a flooding manner, exceeding the database’s capacity of 300 QPS. This, in turn, increases the latency above the retry threshold at the web application, causing any new, post-network-glitch queries to retry as well. So now we have a backlog of queries from the network glitch and an effective new query load of 560 QPS (280 queries from the workload + 280 query retries). With such a high and sustained load, the database will not return to a normal state all by itself. This results in a maxed-out system that clears 300 QPS of load (smaller than the amplified load) and produces very little goodput since most of the queries take too long to process and get timed-out or discarded by the time the database returns them. This can continue indefinitely until either the offered workload reduces to under 150 QPS or a retry policy is changed/suspended.

Our simple example illustrates key metastable concepts and principles. In a stable state, a system can (gradually) return to normal operation. In the example, workload under 150 QPS is stable, since even with the workload amplification caused by the retry policy, the database will receive fewer new requests than it can handle each second, allowing it to gradually clear up the backlog and return to normal operation. While the system is clearing the backlog, its goodput, however, may remain compromised. Systems often operate in the vulnerable state because it seems from all metrics like the stable state is wasting resources, and the vulnerability is not known. This, however, leaves very little spare resources to absorb the extra load spikes caused by the trigger events. Once a system enters into a metastable failure state, it will feed onto itself until there is some sort of intervention. This intervention is costly, as it generally requires either reducing the offered load on the system or finding ways to break the positive feedback loops.

Another possible intervention approach is to increase the capacity of the system to push it into a stable state, but this can be challenging since the triggers are unanticipated. Such unpredictability makes it impossible to reconfigure to add capacity beforehand. And once the system is already in a metastable failure state, it may not have any resources to spare to run the reconfiguration procedure. In fact, a reconfiguration may even increase the workload amplification and make the whole situation worse.

In the paper, we discuss a few other hypothetical and anecdotal scenarios that are a bit more complicated. But the problem is not anecdotal by any means, as there are quite a few failure cases that surely look a lot like metastable failures “in the wild”:

  • Google App Engine Incident #19007. Trigger: Configuration change. Sustaining Effect: cascading load amplification. Fix: Reduce traffic level.
  • Amazon SimpleDB Service Disruption. Trigger: power loss crashing multiple servers. Sustaining Effect: load amplification due to a timeout. Fix: Change in timeout policy & introduction of additional capacity.
  • Cassandra Overload because of hint pressure + MVs. Trigger: rolling restart. Sustaining Effect: Few nodes could not catch up with hinted hand-off, preventing them from fully joining, causing the system to generate more hints. Fix: Policy change – disable hinted handoff.

With metastable failures affecting real systems, we need to have more understanding of the problem and processes involved to develop better coping and prevention strategies. A recurring pattern in our experience is that changes meant to improve the common case behavior of a system tend to increase the strength of the sustaining effects. Fast paths, caches, retries, failover, load balancing, and autoscaling all make the failure state less resource-efficient relative to the normal state, which makes the feedback loop worse. Beware of very high cache hit rates!

Metastable failures are over-represented in large site outages because the strength of the sustaining effect depends on scale. For example, if the feedback loop requires overloading a network fabric then small-scale stress tests will never trigger it. The sustaining effect may also act as a means of contagion so that the problem spreads across machines or shards. This means that the first occurrence of a novel metastable failure may be a major outage even in a hyper-scale distributed system.

Current approaches for handling metastability often lack the full comprehension of the problem and its causes. For example, engineers often focus on the trigger that causes the failure and fail to realize the complicated positive feedback loops that are responsible for the scale of the failure. Fixing a trigger is a temporary solution that may only push the system higher into the metastable vulnerable zone and make the next crash even more severe.

Unfortunately, replicating the failures and feedback loops is difficult, as many of the issues only manifest themselves at scale. This makes it ever so harder to fully understand the failures and develop efficient techniques for dealing with them. Furthermore, predicting the possibility of failure is difficult too. For instance, one can look for unexpected performance variations, and try to correlate them with other things going on in the system to learn the potential future triggers, but this still does not give the full predictive power of when a failure may happen. Improvements in our ability to predict and avoid metastable failures will also translate directly to efficiency gains because it will let us operate systems closer to their natural performance limits.

I think this is an exciting area for research. As the industry makes bigger and bigger systems and pushes them to work as cheaply as possible, we need to develop a proper understanding of how these critical systems fail at scale so we can continue improving the reliability.