Reading Group. Aragog: Scalable Runtime Verification of Shardable Networked Systems

We have covered 50 papers in the reading group so far! This week we looked at the “Aragog: Scalable Runtime Verification of Shardable Networked Systems” from OSDI’20. This paper discusses the problem of verifying the network functions (NFs), such as NAT Gateways or firewalls at the runtime. The problem is quite challenging due to its scale, as large cloud providers handle enormous amounts of traffic. The paper uses NAT Gateway (NATGW) as the motivating example. NATGW balances external traffic to the servers in a way to ensure that the entire packet flow goes to the same destination. It is implemented entirely in software. This means that the NF needs to be fault-tolerant and not “forget” the destination server for each flow, but it also needs to be super quick, so strong consistency is not an option. In this NF, some primary node keeps track of routing for a flow and asynchronously replicates the information to backups.

Aragog is set to examine the operation of this NF and look for invariant violations. Obviously, if we look at the entire NF that handles thousands and thousands of flows, the problem is rather large. However, Aragog does not need to take the global approach, and since each flow is independent, it can look at verification at the flow granularity. For example, the system can check whether at any time the flow is directed by at most one primary node. This check still requires the global view of all system nodes to make sure that there are no two primaries, but it does not require the entirety of the state and needs only the state associated with a particular flow. In the nutshell, Aragog constructs a state machine based on the description of an invariant violation for each flow, allowing for embarrassingly parallel scalability due to the sharded nature of the problem. The state machine transitions across states as it receives the events (i.e. messages), and if it reaches the termination state, a notification can be issues about the violation. Naturally, since the events can happen on different nodes, they are all shipped to a centralized verification agent that runs the invariant violation state machine. However, it would still be inefficient to ship all the events for each flow to construct these state machines, so Aragog does some filtering — it does not send messages irrelevant to the invariant we are checking, and it avoids shipping messages that can be confirmed locally to not transition to a new state of the state machine.

The evaluation show that Aragog can detect violations due to bugs, and it has quite substantial throughput for checking.

As always, a lot more details can be found in the paper. A. Jesse Jiryu Davis did an excellent presentation of the paper. He also has quite a few interesting questions/ideas at the end of the presentation.

Discussion

1) Best-Effort. Aragog makes certain assumptions and choices that make it a “best-effort” system, as it may miss some violations (false-negatives) and maybe even detect some that did not happen (false-positives). One such assumption we noted in the discussion is time-synchronization, since the events/messages are ordered and applied to the state-machine according to their timestamps. While good time-sync is possible, it is not clear how well this was implemented in Aragog (it uses PTP) and how big of an impact it may have. For example, a reordering of the messages may potentially hide the violation from being detected, or even worse make a non-violating message ordered appear as one that has a problem. Another avenue for missing out on the violations is the use of sampling.

2) Need for Runtime and Near-real-time Verification. One of the major requirements in Aragog is the runtime or near-real-time verification. However, we were not super clear on why this is needed. For example, even if you detect a problem quickly, there may be limited options to react to it quickly. Unless it causes an outage (which would be detectable by other means), the time to resolve may be very large, as a bug needs to be reproduced, tested, fixed, and deployed, and at best can take many hours or even days. Another reason why a near-real-time requirement is questionable is the best-effort nature of the system described above. However, we have talked about one scenario where a quick response is needed. Consider a rollout of the new version of the network function. The rollouts typically occur in stages, so it is important to detect any issues early and stop the rollout before it hits the majority of servers.

3) Embarrassingly Parallel. The cool thing about Aragog is its embarrassingly parallel nature. While the system is applied in a specific domain (testing network functions), we feel like other applications/domains can parallelize in a similar matter.

4) “Productizing” Aragog. Again, currently, Aragog is a niche/domain-specific system. However, it seems to have nice features, like scalability, and nice language to describe invariant violations and filtering, so we wonder if this can be productized beyond this specific application.

5) PL & Distributed Systems. I largely skipped this in my summary, but quite a lot of nice things about Aragog come from its language and how it creates a state-machine out of the invariant violation description. Not to mention all the local suppression to prevent sending the events/messages that do not change the state of the machine to a global verifier. This is cool stuff, at least from our distributed system point of view that is a bit further from programming languages work.

6) Retroscope. Finally, I could not resist and not make the comparison with some early work of mine — Retroscope. Retroscope uses HLC (to avoid time-sync causality-breaking issues) to construct a progression of globally distributed states and search for… you guessed it, invariant violations. Retroscope uses SQL-like language to express the predicates that describe a violation. Unlike Aragog, I tried to make Retroscope general. It also does not shard-like Aragog, but once the events are ingested by a streaming service, predicate search is embarrassingly parallel. Restrocope was one of my first works in the distributed systems space, so it is not quite as optimized or fast, but it is also a more general prototype.