Monitoring with Retroscope: Detecting Invariant Violations

Earlier I briefly mentioned Retroscope, our distributed snapshot library that makes taking non-blocking, unplanned consistent global distributed snapshots possible. However, these snapshots are only good if we know how to use them well. Of course the most obvious use case is just a data backup, and despite it being an important application for snapshots, I feel it being a bit boring to my taste. What I am thinking right now is using snapshots for distributed monitoring and debugging.

Let’s consider an application that has a global invariant predicate P, and we want to check if a distributed system holds the invariant P at all times. This means that we should never see a consistent cut in which predicate P = false. So our problem is boiling down to looking for consistent cuts that violate P. Luckily, Retroscope can do exactly this, since we can take one snapshots and incrementally move forward in time as the application execution progresses, checking the predicates by looking at consistent cuts as the state advances.

With the basic Retroscope described in the earlier post, finding predicate violations is a rather cumbersome effort that requires writing new code for every invariant a user wishes to check. So in the past few months I have been working on Retroscope extension tailored specifically for debugging and monitoring use cases.  Improved Retroscope exposes the Retroscope Query Language (RQL), a SQL-like interface to allow users write queries to search for conditions happening in the consistent cuts.

Now let’s go back to our hypothetical system with global invariant P and for now assume P holds when all local predicates p0, p1, p2, …, pn hold on the nodes [0 … n]. As such, P = p0  p1  p2  …  pn, and if any of the local predicates fail, the global predicate fails as well. For the simplicity of the example, we can say that local predicate pi is following: pi = ai + bi > ci. This makes each node maintain all three variables, although the nodes may have different values. With Retroscope, we can expose these local variables to be stored in the local log named inv. The log will maintain both the current version of the variables and the history of variable changes.

How do we look for the violation of such invariant with RQL? Just a single query would suffice for us:

SELECT inv.a, inv.b, inv.c FROM inv WHEN (inv.a + inv.b <= inv.c) LINK SAME_NODE

Now we can dissect this query into bits and see what happens there. RQL queries are meant to retrieve consistent cuts that satisfy certain criteria. The list of parameters following the SELECT statement specifies what variables we want to see in the resultant consistent cuts. FROM keyword enumerate the logs we use in this particular query. The actual consistent cut criteria are specified after the WHEN keyword. In the particular case the condition for emitting cuts is (inv.a + inv.b <= inv.c) LINK SAME_NODE, which is equivalent to emitting cuts when the following holds:f1

By now a curios reader would have probably asked a question of why we even bother with consistent cut in this particular example. All predicates can be checked locally and their evaluation does not depend on other remote servers, so we can simply run local monitors and do not worry about consistent cuts and time synchronization at all: failure on one node designate the failure of the system globally no matter the time. Retroscope and RQL shines when we break away from this locality. What if our invariant involves messages being sent and received? Or what if in involves different parameters that exists on different machines at the same time? With the ability of looking at consistent cuts, RQL breaches the boundary of a single node. Below I list just a few variations of the original query that no longer deal with conjunction of local predicates and look at global state as a whole:

  1. SELECT inv.a, inv.b, anv.c FROM inv WHEN inv.a + inv.b <= inv.c
    • Omitting LINK SAME_NODE part changes the operation of the query drastically, as all three variables are no longer bound to co-exist on the same node:f2
    • rql_predicates1
  2. SELECT inv.a, inv.b, anv.c FROM inv WHEN (inv.a + inv.b <= inv.c) LINK EACH_NODE
    • Replacing LINK SAME_NODE with LINK EACH_NODE, changes the search condition to require every node satisfying it in the consistent cut:f3
    • rql_predicates2
  3. SELECT inv.a, inv.b, anv.c FROM inv WHEN (inv.a + inv.b) LINK SAME_NODE <= inv.c
    • Rewriting the condition to WHEN (inv.a + inv.b) LINK SAME_NODE <= inv.c will cause the inv.a and inv.b to be summed on the same nodes, and compared to inv.c values from other nodes as well, so the consistent cut is emitted when f4
    • rql_predicates3
  4. SELECT inv.a, inv.b, anv.c FROM inv WHEN (inv.a + inv.b <= inv.c) AND NODE($1) = NODE($3)
    • This query restricts inv.a and inv.c to be on the same node. $1 is the placeholder for the first variable encountered while parsing left to right, and $3 is the third variable. This emits the consistent cuts when f5
    • rql_predicates4

Above are just a few simple examples of what is possible with RQL, however there are limitations. The biggest limitation is the complexity of the conditions. Even though RQL does not limit how many operations are possible in the condition of the query, having large expressions can slow the system down drastically. For example, a simple WHEN inv.a > inv.b will examine all a’s that exist on the nodes of the system at the consistent cut and all b’s in every possible combination. For f6 . Comparison is then carried out on every element of product set E.

P.S. I illustrated some of the syntax as it operates at the time of this writing, however RQL is developing, and I am not sure I like syntax of conditions too much, so it is a subject to change.