We have resumed the distributed systems reading group after a short holiday break. Yesterday we discussed the “Toward a Generic Fault Tolerance Technique for Partial Network Partitioning” paper from OSDI 2020. The paper studies a particular type of network partitioning – partial network partitioning. Normally, we expect that every node can reach every other node; under the network partitions, we often assume that a node or a group of nodes are “disconnected” from the rest of the cluster and cannot be reached. Partial partitioning assumes that some communication remains, and some nodes outside the partition can still communicate with the partitioned servers. The paper studies several public bug reports from popular systems and finds quite a few issues attributed to the partial partitioning problem. A simple example of the issue is a system having one leader and multiple follower nodes. If the leader is partitioned away from the followers, the progress may stop. At the same time, if the leader can still communicate with a configuration manager (ZooKeeper) responsible for picking the next leader, that communication manager won’t see a problem and not elect a new leader, causing the entire system to stall. The paper has many insights about the partial partition failures, and it is worth taking a look at the paper itself for these details. Aside from extensive exploration of the problem, the paper proposes a simple solution to mask such partial partitions, called Nifty. Nifty construct a connectivity graph and uses it to relay/reroute messages between nodes when the direct communication link between nodes fails.
We had a pretty lively discussion of this paper.
1) Monitoring. One of the issues identified in the paper is that such failures caused by partial partitions often go silent. However, the proposed solution does not seem to include the monitoring/notification service itself. We speculate that it should be trivial to add it to Nifty.
2) System Design. Another important find from the paper is that many of the bugs were due to design problems. We spent considerable time discussing how better design with model checking can help avoid such issues. One of the issues, of course, is that model checking is only as good as the assumptions you put in. So if the design was checked, but the designer did not take into account, for example, message loss, it may not have caught the problem. The paper raises awareness for the problem, so hopefully, the next generation of systems can be model-checked with partial network partitions in mind.
3) Nifty performance. We had two major questions with regards to performance. The first one is whether Nifty can scale to large clusters, given its all-to-all communication requirement for constructing the connectivity graph. This all-to-all communication gives a quadratic complexity, but we also think that collecting this connectivity information can be done relatively infrequently, given that the application can tolerate some message delay/loss. The second question is whether Nifty can handle large messages. Data-driven systems, like databases, may transmit a lot of data between nodes, and having all these data flow through some “relay” nodes may put more stress on such. Also, both the topology of connected nodes and the state of the application matter. If two partial partitions are connected by just one server, it will put more stress on that one machine. Similarly, if the system is in a state that requires more data transfer, such as building a new replica, it may cause a lot of traffic/data to flow through Nifty, putting more stress on the “relay” nodes.
4) Other types of network partitions. The paper talks about (mainly) partial symmetrical partitions. Another interesting type of network partition is an asymmetrical or one-way partition, when messages flow in one direction but not the other. However, one can argue that this is still a partial partition, and Nifty can handle such a case with its connectivity graph approach, assuming the graph is directed.
5) Failure Masking. Nifty masks failures, which is great when we need the applications to continue working. However, what happens when/if Nifty fails? Will it create an even bigger/disastrous failure? Another question is whether the partial partition may exist as a transient state towards a full network partition. In such a case, Nifty only gives a little bit of extra time to react to the issue as it emerges and cannot be the only solution to dealing with the problem (hence discussion point #1 and the need for monitoring).
6) We also noted some similarities with the recent Cloudflare issue.
Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!