# Reading Group. DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols

In the 71st DistSys reading group meeting, we have discussed “DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols” OSDI’21 paper. Despite the misleading title, this paper has nothing to do with AI or Machine Learning. Instead, it focuses on the automated search for invariants in distributed algorithms. I will be brief and a bit hand-wavy in this summary, mainly because I am too busy with other things to take care of (thanks, the beginning of the semester!), and this paper requires rather careful attention and a deep read to fully grasp it.

Before going into the paper itself, I want to talk about invariants. We use invariants a lot when thinking about the safety of algorithms and protocols. So an invariant is a property that must hold throughout the execution of an algorithm. It holds initially and after every discrete step of the protocols. For people like me, finding a good safety property is usually enough when checking some algorithm written in a spec language like TLA+. A TLC model checker runs an exhaustive search of the state space and ensures the safety property after every action. If the invariant condition is violated, TLC will produce a counterexample. There is a bit of a problem with this approach — for big algorithms, checking the model can take hours, days, or even weeks. TLA+/TLC is a bruteforce solution.

IVy, the system used as a component in today’s paper, takes a different approach. Instead of brute-forcing and checking all possible executions (and of course, the execution may be infinite too!), it tries to be clever and prove an algorithm correct by induction. The idea is simple on the surface — take an inductive invariant I prescribing the safety of an algorithm, and show that I holds initially and after all possible transitions starting from any state that satisfied the invariant I. In other words, we need to check that every possible transition from state S satisfying invariant results in a state S’ that still satisfies I. Unfortunately, in practice, this is where things get complicated. The problem is that our safety property I from the model-checking world may not be inductive. More specifically, if we are in some non-initial state S that satisfies I, there is still a possibility that after at action S->S’S’ will not satisfy I. This situation may arise not because the safety property is incorrect but because the state is an impossible or unreachable state that still somehow satisfies I. The solution is then to strengthen the invariant to some new invariant I’ that prevents the unreachable starting state for our possible transitions. IVy system provides tools to do this strengthening semi-automatically with some user help.

The DistAI paper builds on IVy and improves and optimizes the invariant search process to make it faster, more automatic, and more accurate. The process starts similarly to IVy’s, as a user provides the IVy spec with all possible actions, safety properties that must hold, and relations between variables. These relations between variables are important since they specify the properties of the algorithm. For example, a system may have a relation specifying whether a node holds a lock or whether the two nodes have communicated with each other. The system will use these relations to find the inductive invariants. The inductive invariant will be used in conjunction with safety property to finally check the protocol.

The rest of the DistAI’s invariant search process, however, is largely automatic. The process is iterative, with each iteration starting with some input to produce invariant candidates that involve up to some number of variables and no longer than some number of disjunctive terms. For instance, if all variable relations involve up to 2 variables, it may make sense to, initially, restrict the invariant search to just two variables.

Interestingly, at this step, DistAI brings back some of the model-checking magic into the process. It produces a few sample trace executions of the protocol and uses these traces to create a list of all observed variable relations after taking each action. In practice, this list of relations can be large if the simulation trace has many variables, so the system will randomly narrow down its focus to only look at a subsample of the execution trace with just a few variables at a time. For example, if all relations have no more than two variables, we may look at how some combination of two variables changed during the execution trace. This two-variable sample will follow a template with two quantified variables: $$\{\forall V_1, V_2\}$$. The DistAI system produces a handful of these subsamples for each quantified template.

After having these subsamples/sub-traces ready, DistAI enumerates the relations from subsamples. It can easily see which relations were holding throughout the entire execution trace. These relations can be generalized from having specific variable values in the samples to quantified variable placeholders from the sample’s template. For example, if a relation some_property holds in all subsamples, then we can generalize based on the template to $$\forall V_1, V_2: some\_property(V_1, V_2)$$. More than one relation may hold true in the subsamples, and so DistAI will produce multiple of these generalized relations. For example, some other relation $$some\_other\_property(V_1)$$ may have been true throughout all the subsamples. We will use all these relations to produce all different disjunctive invariant candidates, such as
(1): $$\forall V_1, V_2: some\_property(V_1, V_2)$$,
(2): $$\forall V_1, V_2: some\_property(V_1, V_2) \lor some\_other\_property(V_1)$$
and other permutations. Also, note that candidate (1) implies candidate (2), i.e., whenever (1) is true, (2) is also always true. In reality, the DistAI candidate enumeration process is a bit more complicated, as different templates may result in duplicate/overlapped invariant candidates.

At this point, DistAI will use IVy to check the candidate invariants. So since we have obtained the candidates from an execution trace, we already know that the candidates hold in at least some executions, and so we hope that we have had filtered most of the incorrect invariants. DistAI will start the check with the shortest candidate first. In the hypothetical example, candidate (1) will be checked first, since if it is inductive, then there is no need to check candidate (2) that is always true when (1) is satisfied. If the shorter invariant candidate fails IVy check, then DistAI will try to refine it by iterating to a longer version or a version with more variables. In this case, candidate (2) is a refined candidate that has one extra disjunctive term. If DistAI cannot find a successful candidate in the list created during the enumeration process, then it can start from the beginning by running new sample traces with more variables and increased the maximum length of the disjunctive invariant formula.

Phew, this is all very complicated. In short, the idea is to get some invariant candidates from the simulation trace and then check these candidates in IVy, starting with the shortest/simplest candidate and gradually working the way up to more complex invariants. Each of the more complicated candidates you check adds a bit of disjunctive stuff to exclude the issues the previous less-refined candidate stumbled upon.

As always, we had a video:

## Discussion.

Quite frankly, I no longer remember the full extent of the discussion we had. I remember it was quite a lot of talking.

1) Inductive invariants. We spent a bit of time on the difference between inductive invariants and just an invariant. See, we have TLA people in the group, but not that many experts in formal methods. As I mentioned in the beginning, TLA is happy with a short non-inductive safety invariant because it is a brute-force tool that can only check finite protocols.

2) Candidate refinement. It seems like most of the refinement is based on going through the list of candidates from the enumeration step and trying them from simple to more complicated once. Most of the refinement is done by adding more disjunctive predicates until the maximum number of disjunctive terms have been reached (at which point we can try with more terms?).

3) Limitations. The templates for sampling require universal quantifiers. The authors claim that they could not check Paxos protocol because it requires existential quantifiers. However, it was pointed out that another related work (SWISS) managed to find invariants for Paxos, so it seems their approach does not have this limitation.

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

# Reading Group. Compositional Programming and Testing of Dynamic Distributed Systems

We have resumed the reading group after one week of Thanksgiving break. On Wednesday, we have discussed “Compositional Programming and Testing of DynamicDistributed Systems.” This paper is on the edge between programming languages, distributed systems, and some formal methods/verification. The premise of the paper is to decompose large monolithic distributed programs into smaller pieces and test each piece separately using the abstracts of all other components. By using the abstract components for testing, we can reduce the number of traces (i.e. state-tree size) and make the testing feasible. The paper takes advantage and adopts the assume-guarantee (AG) theory to make such piece-wise testing/verification possible and correct.

# Discussion Points

We are not formal methods and/or programming languages experts, so our discussion may have been a bit shallow. However, we tried to approach it from multiple directions.
1) TLA+ allows writing specs through refinement. Can something like this be adopted or used in TLA for checking large specs? Interestingly enough, despite having quite a few users of TLA, none of us have tried the refinement approach seriously. Our consensus in this was that this sounds interesting and plausible.

2) The paper relied on the P language, which is an event-driven language. Part of the discussion was on the pros/cons of event-driven programming vs using threads. We eventually drifted off to talk about green-threads/goroutines of Go, and then to coroutines. All-in-all, there are many ways to approach concurrent programming, and, as evidenced by different opinions in the discussion, it appears that we (as a computing community) are still far from agreeing/coming up with the best or most suited way of dealing with concurrency.

3) We also talked about other languages geared towards distributed systems. We have mentioned DistAlgo as one. It is an educational language/compiler that produces runnable python code. DistAlgo, however, has a different purpose, as its aim is the ease of use and clarity of expressing distributed programs, and not testing/verification.

4) We spent quite a bit of time discussing the assume-guarantee (AG) theory and the need to do “cross-testing” of components where every specific implementation must be tested with abstracts of other components. The original discussion question was why it is not enough to simply show the refinement (i.e. specific refines the abstract and satisfies all the properties). In short, it is all about safety and making sure the specifics not only refine their abstracts but do not introduce other unwanted behaviors when interacting with other components.