# Reading Group. Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3

For the 90th reading group paper, we did “Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3” by James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Geffen, Andrew Warfield. As usual, we have a video:

Andrey Satarin did an excellent write-up on this paper, so I will be short here. This paper discusses the use of formal methods in designing, developing, and maintaining a new key-value storage subsystem in Amazon S3. In my opinion, the core idea and contribution of the paper is establishing engineering processes that ensure that implementation adheres to the specification.

See, using traditional formal methods is hard in practice. For example, using TLA+ to model the system creates a model that is detached from implementation. If something in the system changes, both need to be updated. Moreover, updating TLA+ would require experts in TLA+, creating a possibility of model and code diverging over time. Another problem is using the model to check the implementation. Since the two are detached from each other, this becomes rather difficult.

The paper proposes lightweight formal methods that can be integrated directly into the testing/build framework. The approach requires implementing a reference model for each component that captures the semantics/behavior of the component without any real-world concerns. For example, the model for LSM-tree is a hashmap — both have the same behavior and allow reads and writes of some key-value pairs.

Engineers use these references models n a couple of ways. First of all, the models act as mocks for unit tests. This increases the utility of the models and forces the engineers to keep them up to date. But of course, the main purpose of the models is to confirm the behavior of the real implementations. The gist of the conformance checking is verifying that the implementation’s behavior is allowed by the model. To do so, the same test, consisting of some sequence of operations, runs against the implementation and the model. The expectation is that both go through the same states and deliver the same outputs. Below I have an image borrowed from the author’s slides that illustrate the process.

Of course, there are some challenges with running these tests. For instance, it is important to generate interesting scenarios/sequences of operations to test various behaviors. Another challenge is introducing failures into testing. And finally, sequential execution of some operations does not provide comprehensive coverage in modern, multi-threaded, or distributed code, requiring some concurrent testing as well. The paper talks more about these different scenarios in greater detail.

## Discussion

1) Reliance on experts in formal methods. A big point made by the paper is about not having to rely on formal experts to maintain the models and verification. The paper says that while initially all models were written by experts, at the time of writing the paper, about 18% of the models were written by non-experts. To us, this sounded both as a big and small number. It is important to allow engineers to maintain the models and conformance checking framework, and the number clearly shows that the core engineers are getting onboard with the processes involved. At the same time, it is not clear whether a team can completely get rid of expert support.

2) Importance of processes. As we discussed the reliance on experts and reducing this reliance, it became clear that a big contribution of this paper is about the importance of engineering processes. And it is not just about having some processes/workflows to facilitate formal methods adoption. What is crucial is making these processes scale and thus not require significant additional effort from the engineers. For example, developing models to support formal methods is an extra effort. Using these models as mock components for unit tests amortizes such extra effort into almost no additional work. After all, we need to do unit testing anyway, and using mocks is a common practice.

3) TLA+? It is hard to discuss any formal methods paper in our reading group without having some discussion on TLA+. We have talked about the difficulty of keeping the models up to do date with the implementation. Using TLA+ does not seem to allow for a low-effort mechanism — there is a big overhead to having engineering processes/practices that keep the TLA model and implementation coherent.

4) Testing Reading List. Our presenter, Andrey, has compiled an extensive reading list on testing distributed systems. It is most definitely worth checking out.

5) Codebase size. The authors talk about using these lightweight formal methods on the codebase of about 40,000 lines of code. This is not that much code, to be honest. In fact, this is less code than a handful of my academic projects, not to mention the real software that I have worked on. So it would be interesting to see how these approaches can scale to bigger codebases and bigger teams with more people.

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

# Reading Group. DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols

In the 71st DistSys reading group meeting, we have discussed “DistAI: Data-Driven Automated Invariant Learning for Distributed Protocols” OSDI’21 paper. Despite the misleading title, this paper has nothing to do with AI or Machine Learning. Instead, it focuses on the automated search for invariants in distributed algorithms. I will be brief and a bit hand-wavy in this summary, mainly because I am too busy with other things to take care of (thanks, the beginning of the semester!), and this paper requires rather careful attention and a deep read to fully grasp it.

Before going into the paper itself, I want to talk about invariants. We use invariants a lot when thinking about the safety of algorithms and protocols. So an invariant is a property that must hold throughout the execution of an algorithm. It holds initially and after every discrete step of the protocols. For people like me, finding a good safety property is usually enough when checking some algorithm written in a spec language like TLA+. A TLC model checker runs an exhaustive search of the state space and ensures the safety property after every action. If the invariant condition is violated, TLC will produce a counterexample. There is a bit of a problem with this approach — for big algorithms, checking the model can take hours, days, or even weeks. TLA+/TLC is a bruteforce solution.

IVy, the system used as a component in today’s paper, takes a different approach. Instead of brute-forcing and checking all possible executions (and of course, the execution may be infinite too!), it tries to be clever and prove an algorithm correct by induction. The idea is simple on the surface — take an inductive invariant I prescribing the safety of an algorithm, and show that I holds initially and after all possible transitions starting from any state that satisfied the invariant I. In other words, we need to check that every possible transition from state S satisfying invariant results in a state S’ that still satisfies I. Unfortunately, in practice, this is where things get complicated. The problem is that our safety property I from the model-checking world may not be inductive. More specifically, if we are in some non-initial state S that satisfies I, there is still a possibility that after at action S->S’S’ will not satisfy I. This situation may arise not because the safety property is incorrect but because the state is an impossible or unreachable state that still somehow satisfies I. The solution is then to strengthen the invariant to some new invariant I’ that prevents the unreachable starting state for our possible transitions. IVy system provides tools to do this strengthening semi-automatically with some user help.

The DistAI paper builds on IVy and improves and optimizes the invariant search process to make it faster, more automatic, and more accurate. The process starts similarly to IVy’s, as a user provides the IVy spec with all possible actions, safety properties that must hold, and relations between variables. These relations between variables are important since they specify the properties of the algorithm. For example, a system may have a relation specifying whether a node holds a lock or whether the two nodes have communicated with each other. The system will use these relations to find the inductive invariants. The inductive invariant will be used in conjunction with safety property to finally check the protocol.

The rest of the DistAI’s invariant search process, however, is largely automatic. The process is iterative, with each iteration starting with some input to produce invariant candidates that involve up to some number of variables and no longer than some number of disjunctive terms. For instance, if all variable relations involve up to 2 variables, it may make sense to, initially, restrict the invariant search to just two variables.

Interestingly, at this step, DistAI brings back some of the model-checking magic into the process. It produces a few sample trace executions of the protocol and uses these traces to create a list of all observed variable relations after taking each action. In practice, this list of relations can be large if the simulation trace has many variables, so the system will randomly narrow down its focus to only look at a subsample of the execution trace with just a few variables at a time. For example, if all relations have no more than two variables, we may look at how some combination of two variables changed during the execution trace. This two-variable sample will follow a template with two quantified variables: $$\{\forall V_1, V_2\}$$. The DistAI system produces a handful of these subsamples for each quantified template.

After having these subsamples/sub-traces ready, DistAI enumerates the relations from subsamples. It can easily see which relations were holding throughout the entire execution trace. These relations can be generalized from having specific variable values in the samples to quantified variable placeholders from the sample’s template. For example, if a relation some_property holds in all subsamples, then we can generalize based on the template to $$\forall V_1, V_2: some\_property(V_1, V_2)$$. More than one relation may hold true in the subsamples, and so DistAI will produce multiple of these generalized relations. For example, some other relation $$some\_other\_property(V_1)$$ may have been true throughout all the subsamples. We will use all these relations to produce all different disjunctive invariant candidates, such as
(1): $$\forall V_1, V_2: some\_property(V_1, V_2)$$,
(2): $$\forall V_1, V_2: some\_property(V_1, V_2) \lor some\_other\_property(V_1)$$
and other permutations. Also, note that candidate (1) implies candidate (2), i.e., whenever (1) is true, (2) is also always true. In reality, the DistAI candidate enumeration process is a bit more complicated, as different templates may result in duplicate/overlapped invariant candidates.

At this point, DistAI will use IVy to check the candidate invariants. So since we have obtained the candidates from an execution trace, we already know that the candidates hold in at least some executions, and so we hope that we have had filtered most of the incorrect invariants. DistAI will start the check with the shortest candidate first. In the hypothetical example, candidate (1) will be checked first, since if it is inductive, then there is no need to check candidate (2) that is always true when (1) is satisfied. If the shorter invariant candidate fails IVy check, then DistAI will try to refine it by iterating to a longer version or a version with more variables. In this case, candidate (2) is a refined candidate that has one extra disjunctive term. If DistAI cannot find a successful candidate in the list created during the enumeration process, then it can start from the beginning by running new sample traces with more variables and increased the maximum length of the disjunctive invariant formula.

Phew, this is all very complicated. In short, the idea is to get some invariant candidates from the simulation trace and then check these candidates in IVy, starting with the shortest/simplest candidate and gradually working the way up to more complex invariants. Each of the more complicated candidates you check adds a bit of disjunctive stuff to exclude the issues the previous less-refined candidate stumbled upon.

As always, we had a video:

## Discussion.

Quite frankly, I no longer remember the full extent of the discussion we had. I remember it was quite a lot of talking.

1) Inductive invariants. We spent a bit of time on the difference between inductive invariants and just an invariant. See, we have TLA people in the group, but not that many experts in formal methods. As I mentioned in the beginning, TLA is happy with a short non-inductive safety invariant because it is a brute-force tool that can only check finite protocols.

2) Candidate refinement. It seems like most of the refinement is based on going through the list of candidates from the enumeration step and trying them from simple to more complicated once. Most of the refinement is done by adding more disjunctive predicates until the maximum number of disjunctive terms have been reached (at which point we can try with more terms?).

3) Limitations. The templates for sampling require universal quantifiers. The authors claim that they could not check Paxos protocol because it requires existential quantifiers. However, it was pointed out that another related work (SWISS) managed to find invariants for Paxos, so it seems their approach does not have this limitation.

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

# Reading Group. Compositional Programming and Testing of Dynamic Distributed Systems

We have resumed the reading group after one week of Thanksgiving break. On Wednesday, we have discussed “Compositional Programming and Testing of DynamicDistributed Systems.” This paper is on the edge between programming languages, distributed systems, and some formal methods/verification. The premise of the paper is to decompose large monolithic distributed programs into smaller pieces and test each piece separately using the abstracts of all other components. By using the abstract components for testing, we can reduce the number of traces (i.e. state-tree size) and make the testing feasible. The paper takes advantage and adopts the assume-guarantee (AG) theory to make such piece-wise testing/verification possible and correct.

# Discussion Points

We are not formal methods and/or programming languages experts, so our discussion may have been a bit shallow. However, we tried to approach it from multiple directions.
1) TLA+ allows writing specs through refinement. Can something like this be adopted or used in TLA for checking large specs? Interestingly enough, despite having quite a few users of TLA, none of us have tried the refinement approach seriously. Our consensus in this was that this sounds interesting and plausible.

2) The paper relied on the P language, which is an event-driven language. Part of the discussion was on the pros/cons of event-driven programming vs using threads. We eventually drifted off to talk about green-threads/goroutines of Go, and then to coroutines. All-in-all, there are many ways to approach concurrent programming, and, as evidenced by different opinions in the discussion, it appears that we (as a computing community) are still far from agreeing/coming up with the best or most suited way of dealing with concurrency.

3) We also talked about other languages geared towards distributed systems. We have mentioned DistAlgo as one. It is an educational language/compiler that produces runnable python code. DistAlgo, however, has a different purpose, as its aim is the ease of use and clarity of expressing distributed programs, and not testing/verification.

4) We spent quite a bit of time discussing the assume-guarantee (AG) theory and the need to do “cross-testing” of components where every specific implementation must be tested with abstracts of other components. The original discussion question was why it is not enough to simply show the refinement (i.e. specific refines the abstract and satisfies all the properties). In short, it is all about safety and making sure the specifics not only refine their abstracts but do not introduce other unwanted behaviors when interacting with other components.