Reading Group. Distributed Snapshots: Determining Global States of Distributed Systems

On Wednesday we kicked off a new set of papers in the reading group. We have started with one of the classical foundational papers in distributed systems and looked at the Chandy-Lamport token-based distributed snapshot algorithm. The basic idea here is to capture the state of distributed processes and channels by “flushing” the messages out of the channels with markers. The markers ensure the causality if not broken, despite the processes taking their local snapshots at different times (and with no affinity to the physical time). I am not going to summarize the paper, as there is plenty of material on the internet on the subject, however, here is our group’s short presentation by Maher Gamal:


1) Use of snapshots. Much of our discussion focused on the use of snapshots. Aside from the trivial use for disaster recovery, snapshots are useful for debugging and runtime verification. The paper suggests some debugging/monitoring usage, like detecting stable properties of the algorithms. However, we also think that detecting violations of certain properties may be more useful in the real world. For instance, detecting the violations of invariant properties at runtime.

Just last week we talked about Aragog, a system for runtime verification of network functions. And while it does not directly use snapshots, it relies on time synchronization to make sure that the messages come from different consistent cuts of the state, and the cause and effect relationship play out correctly in the constructed state machine.

2) Snapshots of states that did not happen. Interesting things about the Chandy-Lamport snapshots are that they may capture a system state that did not happen in the execution. This is because the snapshots are taken progressively as the markers propagate through the channels, and essentially the snapshot gets rolled out at the communication speed. 

3) Timely snapshots. We also brought up snapshots that are close to the wall clock. These may be more useful for debugging purposes, than Chandy-Lamport, as they provide some notion of when things happened. Additionally, more tight snapshots that are taken at about the same time globally are better at recording true state (or should we say has fewer timing artifacts?)

Reading Group

Our reading group takes place over Zoom every Thursday at 1:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!