Tag Archives: metastabiilty

Metastable Failures in the Wild

Metastable failures in distributed systems are failures that “feed” and strengthen their own “failed” condition. The main characteristic of a metastable failure is a positive feedback loop that keeps the system in a degraded/failed state. These failures are hard to spot, as they always start with some other distraction — some trigger event that nudges the system over its operating limit or capacity. However, fixing the trigger is not enough, and engineers realize this too late — so late that sometimes it takes days to stabilize the system and return to the pre-failure level of operation.

I already blogged about this topic last year in the context of our original metastable failures paper from HotOS’21. Now we have the “improved and extended” version out at OSDI’22, and oh boy, we have a lot of juicy stuff there. That failure that took days to recover? I did not make it up. My students spent quite some time going through hundreds, if not thousands, publicly available post-mortem reports for different systems to identify a handful of impactful metastable failures. We are sure there are more metastable failures out there lurking in these bug and failure reports. In fact, we had significantly more candidate failures, but many post-mortems simply do not have enough information to be absolutely sure metastability played a role.

Anyway, above is a table from the paper that lists examples of metastable failures in the wild. Some last for an hour, and some take days to resolve. Many examples are due to retries and retry-induced load amplification. See, when a system becomes slow (i.e., some server failure or transient network issue), it is likely that latency on some requests will start to grow, and it may reach the point at which the client application gives up the request and retries. Retries are great for dealing with transient issues, but they create more load for the system by potentially repeating the same request over again. When the system is already overloaded due to the ongoing issue, retries do not solve any real problem and create more unnecessary load. More load means higher queuing times, higher latency, and more retries. This vicious cycle can continue even after the original trigger issues are fixed, and, in many cases, there is no escape from this except load shedding.

We managed to reproduce a toy example of retry-induced failure using MongoDB. It is a bit concerning that we were able to do so at a small scale, but our experience also showed that not all hope is lost. See, as it turns out, modern systems are good at load shedding — when the request queues up for too long and is too old by the time server can handle it, such an old request can be dropped without wasting further resources. Our experiment here used a very aggressive retry policy that had to overrun the load-shedding mechanisms of the underlying database. That being said, sometimes the difference between a failure and recovery can be down to minuscule difference in the trigger. The figure here shows a failure when a trigger took out 80% of CPU capacity, and a successful trigger recovery when the trigger reduced CPU capacity by 78%.

Now, of course, retries are not the only cause of the self-feeding failure loops, but retries are something most engineers are familiar with, so the post-mortem reports tend to have more details on that topic. Potentially, this is one reason for observing so many retry-induced metastable failures in the wild. However, we saw other feedback loops. For example, one of the failures in the list was caused by excessive logging on the error path, which in turn caused the system to slow down and have more requests enter that same error path.

Seeing the examples in the real world and reproducing them is important. Our HotOS paper claimed that reproducibility at a small scale is often challenging. In this new paper, we have a handful of metastable failure scenarios reproduced at a scale of at most a few VMs. However, a good chunk of the paper is about generalizing what we have learned from the observations. We developed a “reasoning framework” for metastability and metastable failures.

The two important characteristics of metastable failures are triggers and amplification effects. On the trigger side, we can have a load-spiking trigger and capacity-decreasing triggers. The former trigger type temporarily increases the load on the system above some pre-trigger level. An example of this can be higher than expected demand from users. The later trigger type degrades the systems’ serving capacity. An example of this trigger is a server failure that takes some capacity out.

Similarly to triggers, there are two types of amplification effects: workload amplification and capacity degradation amplification. Both feedback loops kick in after a system enters an overloaded state — having an insufficient serving capacity for the current offered load. The workload amplification seems to be the more common type in the examples we found so far. It creates even more load in response to overloaded. The vicious retry loop from earlier is an example of workload amplification. The other type, capacity degrading amplification, reduces the serving capacity of the system in response to an overload. This can happen, for instance, in cached systems and applications, when losing some cache can result in downstream system overload that either prevents cache rebuild or downright causes more cache to expire and go stale.

The figure above shows how these two types of triggers and two types of amplification mechanisms can combine. Also, an important point to note — it is possible to have a combination of more than one trigger and more than one amplification mechanism, something that we have seen in the wild. As far as experiencing metastable failure, a lot depends on how the triggers and amplification mechanisms interact and how much load was offered to the system before the failure sequence began.

Small triggers that do not last long or do not have a big enough magnitude (i.e., a trigger that does not overload the system too much) may be less likely to cause an issue. On the other hand, a trigger that lasts an hour and takes out half of the servers is a much bigger problem. Of course, the magnitude of the trigger is not an absolute metric — your system may be fine even when you take out half the servers if it was operating at 25% of its maximum capacity before the trigger event took place.

Triggers are not metastable failures. They are failures on their own that can cause availability and performance issues for clients or users, but they are only the beginning of the metastable failure sequence. The amplification mechanisms are those feedback loops that start after the system is overloaded due to a trigger. Amplification leads to a metastable failure after the system’s overload is amplified so high that even fixing the trigger (i.e., returning lost capacity, or solving the load spike) does not solve the problem. The figure marks those metastable failure points. Naturally, the type of amplification matters a lot. A system with “slow” amplification loops that add little overload over time is better than systems with “fast” amplification loops. In both cases, we can have a metastable failure, but a slow amplification mechanism gives us more time to deal with the trigger.

Anyway, what are the lessons here? For one, I cannot stress the importance of understanding the mechanisms involved so that we can operate the systems better. Maybe, it is not a good idea to squeeze every single bit of capacity from the system/hardware? Having some slack resources allows us to absorb smaller triggers and gives time before the system enters the metastable failure for bigger triggers. The same goes for making systems with predictable performance characteristics. If some code path is significantly slower than the other, a sudden switch to a slower code path can act as a trigger, since it is essentially a capacity degradation.

Load shedding is the solution to metastable failures. All recovery efforts included some load shedding mechanisms. After all, to bring the system back into operation, its load needs to decrease sufficiently as to stop the amplification. But we do not need to leave load shedding only for recovery efforts after it is already too late. More aggressive load shedding done ahead of time or early after the trigger overloaded the system can save from a bigger failure — it is better to lose some requests and piss some clients/users than lose the entire system and impact everyone. Here I can also lump load prioritization — prioritize the work that will raise the system’s goodput if possible.

Another point is to account for the human factor — about half of the observed failures in the wild had some direct human involvement, such as rolling out bad configuration or buggy code. It is important to test and deploy configuration the same way as software. This is something that many systems still seem to neglect. Proper testing of new code is paramount, including stress testing. And, of course, do not deploy on Friday… we have seen a failure in which a weekend load/traffic reduction masked increased resource usage of a new version deployed on Friday. The system failed after the load returned on Monday.

Another interesting observation is the “fix-to-break” behaviors. On a few occasions, engineers increased the severity of the amplification mechanism (feedback loop) after fixing an outage, leading to a more severe failure in the future. I refer to this as a “fix-to-break” pattern; it is a good illustration of why we need to understand metastability in order to avoid causing failures.

As always, the paper has way more details. We talk a lot about definitions and formalism of metastable states and failures. We have a handful of distinct reproductions with triggers and amplifications caused by retries, cache failures, and garbage collections. And finally, we include some real-world first-hand details on metastable failures at a large internet company.

Metastable Failures in Distributed Systems

Metastability is a stable state of a dynamical system other than the system’s state of least energy.

Wikipedia

Distributed systems often fail spectacularly and unpredictably. They are a cause for a headache and sleepless on-call nights for way too many engineers. And this is despite lots of efforts to understand the failures, and all the tools and “best practices” we have to contain and/or prevent them. 

Today I want to talk about failures that may occur on a nearly healthy system with no perceived major bugs or design flaws. We describe this type of failure in our HotOS’21 paper. This work is in collaboration with Nathan Bronson, Abutalib Aghayev, and Timothy Zhu. It is largely based on Nathan’s observations and experiences.

So how does a “healthy” system fail? Naturally, something needs to happen for a failure to occur. We call this something a trigger. Lots of things can act as a trigger (GC, network blips, software pushes, configuration changes, load spikes, etc.). Triggers are impossible to avoid in our multi-tenant unreliable systems. 

Triggers cause backlogs or errors and make the system temporarily less happy. In some cases, however, they cause the system to spiral into the abyss and practically halt any useful work or goodput. When the trigger is removed (it was transient, or removed by the engineers), the system does not recover by itself. So we end up in a peculiar situation when the condition that caused a failure is no longer present, all system’s components seem to be functioning correctly, all servers/nodes are up, the code is working, but nothing useful is actually getting done and goodput remains negligible. 

We call this failure pattern a metastable failure, or rather we say that a system has entered a metastable failure state. The culprit behind metastable failures is a sustaining effect that prevents the system from leaving a bad/failed state even after the initial trigger is removed.

To understand more about the importance of triggers and sustaining effects, we need to look at how most distributed systems are deployed. Resource utilization is a major factor for large systems, as achieving better utilization can save a lot of money at scale. This means that systems often have little extra resources to spare when some unexpected load increase occurs. These systems operate in a metastable state that we call a metastable vulnerable state, as the system is vulnerable to a failure if enough of a trigger disturbs its stability.

Of course, having some unanticipated extra load applied to the system is not sufficient for a metastable failure to occur. In fact, even if the load pushes the system to run at its maximum capacity (or tries to run above the capacity), we should only see an increase in latency, and maybe some goodput degradation. When the extra load is removed, the system should return back down to normal operation all by itself. 

In metastable failure, the extra unanticipated load activates a positive feedback loop that creates more load. This positive feedback loop is the sustaining effect that prevents the system from recovering when the initial trigger is resolved. We say that a system is in a metastable vulnerable state when it is possible to activate such a sustaining effect loop using a strong enough trigger. This state typically occurs at the higher system utilization, as the system has fewer spare resources to absorb the extra load of the trigger without activating the positive feedback mechanism. Opposite of metastable vulnerable state is a stable state, where the positive feedback loop is not possible, allowing the system to recover by itself when the extra load is removed. Note that a system in a stable state may still have a sustaining effect mechanism, but it is not strong enough to feed on itself and diminishes over time when the extra load is removed. 

Let’s look at a hypothetical idealized example of a metastable failure. Consider a web application that uses a database capable of serving up to 300 queries per second (QPS). If more than 300 QPS are tried, the latency goes up by an order of magnitude, causing the web application to retry once after a 1-second timeout (when a retry fails, the system errors out). The web servers for the application produce a 280 QPS workload, which falls nicely below the database’s serving capacity.

Now let’s say there was a network glitch lasting a few seconds between the web servers and the database. After the network is fixed, all the delayed packets reach the database, resulting in a flooding manner, exceeding the database’s capacity of 300 QPS. This, in turn, increases the latency above the retry threshold at the web application, causing any new, post-network-glitch queries to retry as well. So now we have a backlog of queries from the network glitch and an effective new query load of 560 QPS (280 queries from the workload + 280 query retries). With such a high and sustained load, the database will not return to a normal state all by itself. This results in a maxed-out system that clears 300 QPS of load (smaller than the amplified load) and produces very little goodput since most of the queries take too long to process and get timed-out or discarded by the time the database returns them. This can continue indefinitely until either the offered workload reduces to under 150 QPS or a retry policy is changed/suspended.

Our simple example illustrates key metastable concepts and principles. In a stable state, a system can (gradually) return to normal operation. In the example, workload under 150 QPS is stable, since even with the workload amplification caused by the retry policy, the database will receive fewer new requests than it can handle each second, allowing it to gradually clear up the backlog and return to normal operation. While the system is clearing the backlog, its goodput, however, may remain compromised. Systems often operate in the vulnerable state because it seems from all metrics like the stable state is wasting resources, and the vulnerability is not known. This, however, leaves very little spare resources to absorb the extra load spikes caused by the trigger events. Once a system enters into a metastable failure state, it will feed onto itself until there is some sort of intervention. This intervention is costly, as it generally requires either reducing the offered load on the system or finding ways to break the positive feedback loops.

Another possible intervention approach is to increase the capacity of the system to push it into a stable state, but this can be challenging since the triggers are unanticipated. Such unpredictability makes it impossible to reconfigure to add capacity beforehand. And once the system is already in a metastable failure state, it may not have any resources to spare to run the reconfiguration procedure. In fact, a reconfiguration may even increase the workload amplification and make the whole situation worse.

In the paper, we discuss a few other hypothetical and anecdotal scenarios that are a bit more complicated. But the problem is not anecdotal by any means, as there are quite a few failure cases that surely look a lot like metastable failures “in the wild”:

  • Google App Engine Incident #19007. Trigger: Configuration change. Sustaining Effect: cascading load amplification. Fix: Reduce traffic level.
  • Amazon SimpleDB Service Disruption. Trigger: power loss crashing multiple servers. Sustaining Effect: load amplification due to a timeout. Fix: Change in timeout policy & introduction of additional capacity.
  • Cassandra Overload because of hint pressure + MVs. Trigger: rolling restart. Sustaining Effect: Few nodes could not catch up with hinted hand-off, preventing them from fully joining, causing the system to generate more hints. Fix: Policy change – disable hinted handoff.

With metastable failures affecting real systems, we need to have more understanding of the problem and processes involved to develop better coping and prevention strategies. A recurring pattern in our experience is that changes meant to improve the common case behavior of a system tend to increase the strength of the sustaining effects. Fast paths, caches, retries, failover, load balancing, and autoscaling all make the failure state less resource-efficient relative to the normal state, which makes the feedback loop worse. Beware of very high cache hit rates!

Metastable failures are over-represented in large site outages because the strength of the sustaining effect depends on scale. For example, if the feedback loop requires overloading a network fabric then small-scale stress tests will never trigger it. The sustaining effect may also act as a means of contagion so that the problem spreads across machines or shards. This means that the first occurrence of a novel metastable failure may be a major outage even in a hyper-scale distributed system.

Current approaches for handling metastability often lack the full comprehension of the problem and its causes. For example, engineers often focus on the trigger that causes the failure and fail to realize the complicated positive feedback loops that are responsible for the scale of the failure. Fixing a trigger is a temporary solution that may only push the system higher into the metastable vulnerable zone and make the next crash even more severe.

Unfortunately, replicating the failures and feedback loops is difficult, as many of the issues only manifest themselves at scale. This makes it ever so harder to fully understand the failures and develop efficient techniques for dealing with them. Furthermore, predicting the possibility of failure is difficult too. For instance, one can look for unexpected performance variations, and try to correlate them with other things going on in the system to learn the potential future triggers, but this still does not give the full predictive power of when a failure may happen. Improvements in our ability to predict and avoid metastable failures will also translate directly to efficiency gains because it will let us operate systems closer to their natural performance limits.

I think this is an exciting area for research. As the industry makes bigger and bigger systems and pushes them to work as cheaply as possible, we need to develop a proper understanding of how these critical systems fail at scale so we can continue improving the reliability.