Our 66th paper was a recent HotOS piece about faulty CPUs: “Cores that don’t count.” This paper from Google describes a decently common (at Google datacenter scale) issue with CPUs that may miscompute or silently fail under some conditions. This is a big deal, as we expect CPUs to be deterministic and always provide correct results for all computations. In the rare instances of failures, we expect these failures to be “loud” — CPU completely halting or rebooting or at the very least software crashing.
However, Google engineers observed some dangerous behaviors that do not align with the expectations. Unfortunately, as CPUs fail, they may do so silently and produce a wrong result to a computation. Since the failures are silent, they are way more difficult to detect. According to the paper, such failures often seem non-deterministic and often require specific conditions to manifest. Moreover, most of the time, the faults do not impact an entire CPU but one or a few cores, leading to computation errors appearing sporadically even on one machine.
Speaking of the factors and conditions that play a role in these failures, the paper talks about CPU age — a CPU may start to miscompute some instruction after some time in operation. This makes quality control both at the CPU vendor and customer more difficult — even when there is a known tendency for some errors, they may not appear on new CPU and pass the QA tests. Other factors are operating temperature, frequency, and voltages. The paper states that their impacts vary. However, it is not unreasonable to assume that some errors may start to appear under heavier load and higher temperature and then subside when the load reduces and the CPU cools down.
All these make detection hard, as reproducing a fault often requires running the same software on the same hardware under the same conditions. That being said, the authors claim that they are getting better at detecting and reproducing many of these errors. For example, the paper mentions a set of benchmarks that exercise more common failure scenarios. Such a benchmark tool can be handy for confirming the faults or offline testing. Unfortunately, it is not very helpful in detecting problems in live production workloads. Currently, engineers at Google detect the CPU issues through crash and bug reports, which are obviously very noisy. However, some patterns, such as identifying cores involved and checking if the same cores appear in multiple similar reports, can provide the first clues.
So what are the repercussions of faulty computation if left undetected and unmitigated? The short answer is all kinds of stuff! The major problem here is that a computation may cause a fault in a system before any of the traditional redundancy is involved. For example, a failure at the leader node of a replicated system can produce a bad outcome. If this outcome is then checksummed and replicated, it will propagate through the cluster undetected. Our replicated/redundant systems are geared for detecting data issues due to transfer or storage and may lack the tools/mechanisms needed to detect bad computations.
I think I will leave the summary here. Murat Demirbas goes into more detail in our group’s presentation. He also touches on a similar arXiv paper from Facebook.
Discussion.
1) Vendors. An obvious question to ask is what vendors/models have these issues. For obvious reasons, the paper does not provide any concrete information here, and it makes sense. It would not be a good idea to point fingers at a major company with which you do a lot of business. So the fact that Google (and Facebook) decided to bring this issue to attention instead of trying to resolve the problem with vendors silently is already a big deal.
2) In-house CPU development. Another discussion point over CPU manufacturers was whether taking an in-house development approach can help improve the quality. Think of AWS Graviton CPUs. From one side, this gives the large-scale company (Google, AWS, etc) more control over the design and quality assurance. On the other side, these defects are likely due to the manufacturing (i.e., shrinking transistor size) and not the design. The design issues would have likely been more deterministic. Manufacturing is hard, and only a handful of companies can pull it off. So the end product would have been made by a third party anyway. So the remaining benefit is a potential for better QA, but this is also hard. For instance, I mentioned in the summary that some issues appear after some aging. Also, big CPU companies may still have an edge in QA due to the sheer volume of CPU made and shipped. If more consumers report the problems, it may be easier for CPU vendors to identify patterns and improve in the next iteration.
3) Non-deterministic issues. The paper mentions that many of the issues are non-deterministic. At the same time, the paper discusses the techniques they use to find the problems. So it appears that there are actually a few classes of failures: (1) purely non-deterministic faults, (2) obfuscated deterministic faults, and (3) openly deterministic faults. The third category is easy, with problems reproducible all the time, such as design bugs. The second category is reproducible faults that hide. It seems like the issues discussed the most in the paper are of this type. It may be difficult to find them in all the noise as the problems manifest only under certain conditions. However, once these conditions are known, it becomes more or less easy to reproduce and identify. The first category is the tricky one. Does it even exist? Or do these failures require even more “stars to align” for the problems to appear? Anyway, it is this category that is the most challenging to look for and identify. If these failures do exist, is there even a systematic way to detect them?
4) Hardware mitigations? One way to try to approach the problem is with hardware redundancy. For instance, a system may perform some critical computations redundantly on different cores or CPUs and cross-check the results. If different cores produce a different result for the same computation, then one of the cores is at fault, and we may need a third one to arbitrate. This redundancy is not new, and mission-critical systems use it. However, this is too expensive for most public services and cloud systems.
5) Are these byzantine failures? Murat raised this question at the end of his talk, and this is an interesting one. Byzantine faults are faults that occur due to a system behaving out of spec. But with faulty computation, the out-of-spec behavior may be very subtle. Again, consider a replicated system with a leader. In a computation fault, the leader will try to replicate the same corrupted data to all the followers, leaving no possibility for a cross-check.
6) Software mitigations? So, if the faults are kind of byzantine, but also subtle enough to avoid detection by some protocols, then what can we do? One idea that was floating around is defensive programming — using assertions and verifying the computations (this works for problems that can be checked relatively cheaply). Good error reporting is also crucial to help find the correlations that may point to faulty cores.
A more fundamental question, however, remains. Can we design cloud-scale systems that tolerate such failures without costing too much?
Reading Group
Our reading group takes place over Zoom every Thursday at 1:00 pm EST. We have a slack group where we post papers, hold offline discussions, and most importantly manage Zoom invites to paper presentations. Please join the slack group to get involved!