Tag Archives: verification

Reading Group. Aragog: Scalable Runtime Verification of Shardable Networked Systems

We have covered 50 papers in the reading group so far! This week we looked at the “Aragog: Scalable Runtime Verification of Shardable Networked Systems” from OSDI’20. This paper discusses the problem of verifying the network functions (NFs), such as NAT Gateways or firewalls at the runtime. The problem is quite challenging due to its scale, as large cloud providers handle enormous amounts of traffic. The paper uses NAT Gateway (NATGW) as the motivating example. NATGW balances external traffic to the servers in a way to ensure that the entire packet flow goes to the same destination. It is implemented entirely in software. This means that the NF needs to be fault-tolerant and not “forget” the destination server for each flow, but it also needs to be super quick, so strong consistency is not an option. In this NF, some primary node keeps track of routing for a flow and asynchronously replicates the information to backups. 

Aragog is set to examine the operation of this NF and look for invariant violations. Obviously, if we look at the entire NF that handles thousands and thousands of flows, the problem is rather large. However, Aragog does not need to take the global approach, and since each flow is independent, it can look at verification at the flow granularity. For example, the system can check whether at any time the flow is directed by at most one primary node. This check still requires the global view of all system nodes to make sure that there are no two primaries, but it does not require the entirety of the state and needs only the state associated with a particular flow. In the nutshell, Aragog constructs a state machine based on the description of an invariant violation for each flow, allowing for embarrassingly parallel scalability due to the sharded nature of the problem. The state machine transitions across states as it receives the events (i.e. messages), and if it reaches the termination state, a notification can be issues about the violation. Naturally, since the events can happen on different nodes, they are all shipped to a centralized verification agent that runs the invariant violation state machine. However, it would still be inefficient to ship all the events for each flow to construct these state machines, so Aragog does some filtering — it does not send messages irrelevant to the invariant we are checking, and it avoids shipping messages that can be confirmed locally to not transition to a new state of the state machine. 

The evaluation show that Aragog can detect violations due to bugs, and it has quite substantial throughput for checking.

As always, a lot more details can be found in the paper. A. Jesse Jiryu Davis did an excellent presentation of the paper. He also has quite a few interesting questions/ideas at the end of the presentation.


1) Best-Effort. Aragog makes certain assumptions and choices that make it a “best-effort” system, as it may miss some violations (false-negatives) and maybe even detect some that did not happen (false-positives). One such assumption we noted in the discussion is time-synchronization, since the events/messages are ordered and applied to the state-machine according to their timestamps. While good time-sync is possible, it is not clear how well this was implemented in Aragog (it uses PTP) and how big of an impact it may have. For example, a reordering of the messages may potentially hide the violation from being detected, or even worse make a non-violating message ordered appear as one that has a problem. Another avenue for missing out on the violations is the use of sampling.

2) Need for Runtime and Near-real-time Verification. One of the major requirements in Aragog is the runtime or near-real-time verification. However, we were not super clear on why this is needed. For example, even if you detect a problem quickly, there may be limited options to react to it quickly. Unless it causes an outage (which would be detectable by other means), the time to resolve may be very large, as a bug needs to be reproduced, tested, fixed, and deployed, and at best can take many hours or even days. Another reason why a near-real-time requirement is questionable is the best-effort nature of the system described above. However, we have talked about one scenario where a quick response is needed. Consider a rollout of the new version of the network function. The rollouts typically occur in stages, so it is important to detect any issues early and stop the rollout before it hits the majority of servers. 

3) Embarrassingly Parallel. The cool thing about Aragog is its embarrassingly parallel nature. While the system is applied in a specific domain (testing network functions), we feel like other applications/domains can parallelize in a similar matter. 

4) “Productizing” Aragog. Again, currently, Aragog is a niche/domain-specific system. However, it seems to have nice features, like scalability, and nice language to describe invariant violations and filtering, so we wonder if this can be productized beyond this specific application.

5) PL & Distributed Systems. I largely skipped this in my summary, but quite a lot of nice things about Aragog come from its language and how it creates a state-machine out of the invariant violation description. Not to mention all the local suppression to prevent sending the events/messages that do not change the state of the machine to a global verifier. This is cool stuff, at least from our distributed system point of view that is a bit further from programming languages work. 

6) Retroscope. Finally, I could not resist and not make the comparison with some early work of mine — Retroscope. Retroscope uses HLC (to avoid time-sync causality-breaking issues) to construct a progression of globally distributed states and search for… you guessed it, invariant violations. Retroscope uses SQL-like language to express the predicates that describe a violation. Unlike Aragog, I tried to make Retroscope general. It also does not shard-like Aragog, but once the events are ingested by a streaming service, predicate search is embarrassingly parallel. Restrocope was one of my first works in the distributed systems space, so it is not quite as optimized or fast, but it is also a more general prototype.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group Special Session: Building Distributed Systems With Stateright

This talk is part of the Distributed Systems Reading Group.

Stateright is a software framework for analyzing and systematically verifying distributed systems. Its name refers to its goal of verifying that a system’s collective state always satisfies a correctness specification, such as “operations invoked against the system are always linearizable.” Cloud service providers like AWS and Azure leverage verification software such as the TLA+ model checker to achieve the same goal, but whereas that solution can verify a high level system design, Stateright is able to systematically verify the underlying system implementation in addition to the design.

In this talk, Stateright’s author Jonathan Nadal will introduce the project, demo its usage, and discuss upcoming improvements. Jonathan has been in the cloud space since 2012 when he joined Amazon Web Services to work on S3. There he met Chris Newcombe, who introduced TLA+ to the company, and was immediately hooked by that technology. Jonathan later joined Oracle to help build Oracle Cloud Infrastructure, where he is currently a director of engineering and a proponent of model checking techniques within and outside his own teams. Stateright is a personal project not affiliated with Oracle or Amazon.

Video Recording

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Compositional Programming and Testing of Dynamic Distributed Systems

We have resumed the reading group after one week of Thanksgiving break. On Wednesday, we have discussed “Compositional Programming and Testing of DynamicDistributed Systems.” This paper is on the edge between programming languages, distributed systems, and some formal methods/verification. The premise of the paper is to decompose large monolithic distributed programs into smaller pieces and test each piece separately using the abstracts of all other components. By using the abstract components for testing, we can reduce the number of traces (i.e. state-tree size) and make the testing feasible. The paper takes advantage and adopts the assume-guarantee (AG) theory to make such piece-wise testing/verification possible and correct.

Discussion Points

We are not formal methods and/or programming languages experts, so our discussion may have been a bit shallow. However, we tried to approach it from multiple directions.
1) TLA+ allows writing specs through refinement. Can something like this be adopted or used in TLA for checking large specs? Interestingly enough, despite having quite a few users of TLA, none of us have tried the refinement approach seriously. Our consensus in this was that this sounds interesting and plausible.

2) The paper relied on the P language, which is an event-driven language. Part of the discussion was on the pros/cons of event-driven programming vs using threads. We eventually drifted off to talk about green-threads/goroutines of Go, and then to coroutines. All-in-all, there are many ways to approach concurrent programming, and, as evidenced by different opinions in the discussion, it appears that we (as a computing community) are still far from agreeing/coming up with the best or most suited way of dealing with concurrency.

3) We also talked about other languages geared towards distributed systems. We have mentioned DistAlgo as one. It is an educational language/compiler that produces runnable python code. DistAlgo, however, has a different purpose, as its aim is the ease of use and clarity of expressing distributed programs, and not testing/verification.

4) We spent quite a bit of time discussing the assume-guarantee (AG) theory and the need to do “cross-testing” of components where every specific implementation must be tested with abstracts of other components. The original discussion question was why it is not enough to simply show the refinement (i.e. specific refines the abstract and satisfies all the properties). In short, it is all about safety and making sure the specifics not only refine their abstracts but do not introduce other unwanted behaviors when interacting with other components.