# Do not Blame (only) Network for Your Paxos Scalability Issues.

In the past few months our lab has been doing a lot of work with different flavors of paxos consensus algorithm. Paxos and its numerous flavors are widely used in today’s cloud infrastructure. Distributed systems rely on it for many different tasks to ensure safe operation. For instance, coordination services use some consensus protocol flavor to provide services like leader election, cluster membership, service discovery and metadata management. Databases, such as Spanner or CockroachDB, use paxos to provide fault tolerant-replication of data across nodes or even datacenters.

If you work with Paxos, or had to deal with it at some point, you probably heard that it does not scale well to large number of nodes. ”Five nodes is ideal, do not try more” rhetoric has been repeated many times by many people and it gets engraved into one’s mind. ZooKeeper’s Administrator’s guide mentions 3 and 5 server deployments. Fairly recent epaxos provides evaluation on 3 and 5 node deployments in their paper. Paxos Made Live paper also mentions five nodes as typical Chubby deployment.

But why five servers is such a magical number in the Paxos world? The most common answer is along the lines of ”Increasing the number of servers increases the quorum size, making the paxos round leader wait for more messages to come to reach consensus”. This explanation is straightforward on the surface, after all, waiting for n nodes to reply should take longer than waiting for n − 1. The question then becomes why waiting for more replies is expensive? I thought the answer would be largely network related, but it appears to be more complex.

## Modeling Paxos Performance: First Attempt

To answer this, I tried to model the non-faulty paxos execution time by looking at message exchange between nodes in the cluster. In the nutshell, running a phase of paxos requires one round-trip-time (RTT) between a node and its peers: the node broadcasts the message and waits for the quorum to reply. With network arguably being the slowest part, I can try  to express the paxos phase performance through the communication delays

First, let me define some variables to model a phase of paxos:

• $$r_l$$ – message RTT in local area network
• $$\mu_l$$ – average message RTT in local area network
• $$\sigma_l$$  – standard deviation of message RTT in local area network
• $$N$$ – number of nodes participating in a paxos phase
• $$q$$ – quorum size. For a majority quorum $$q=\left \lfloor{\frac{N}{2}}\right \rfloor +1$$
• $$m_s$$ – time to serialize a message
• $$m_d$$ – time to deserialize and process a message. This involves various message-related round tasks, such as ballot comparisons, log maintenance/updated, etc.

I assume that the network performance, at least in the local area, is normally distributed. Figure 1 shows a normalized histogram (shaded area is equal to 1) of latencies of approximately 2,000 ping requests within an AWS region.

RTT $$r_l$$ for every message is drawn from a normal distribution $$\mathcal{N}(\mu_l+m_s+m_d, \sigma_l^2)$$. This distribution simulates RTTs with additional  static, non-variable delay for serialization and deserialization. As such, the only variability so far is due to the network behavior.

In order to run a round of paxos, a node needs to send $$N − 1$$ messages. That is, a node sends a message to every other node except for itself. I assume a ”leader” is also an acceptor with a short-circuit behavior, where a it does not need to send a vote to itself over a network.

Out of the $$N − 1$$ messages sent, only $$q − 1$$ messages actually matter. Upon receiving $$q − 1$$ successful (once again, assuming non-faulty execution) replies a node has achieved a quorum and can finish the round (Figure 2). We can simulate this by drawing $$N −1$$ random RTTs $$r_{l1}, r_{l2},…, r_{lN-1}$$ from $$\mathcal{N}(\mu_l+m_s+m_d, \sigma_l^2)$$ and sorting them. The $$q − 1$$ fastest message $$r_{lq−1}$$ is the one carrying the last vote to make up a quorum, thus after processing this message, the node no longer needs to wait for other messages to come.

Assuming that a node broadcasts all messages at the same time, message RTT $$r_{lq−1}$$ can then be used to express the time $$T_r$$ for the entire round: $$T_r = m_s + r_{lq−1} + m_d$$. Figure 3 visually represents the paxos round expressed in the formula.

## Does the Model Make Sense?

With this simple model, I can simulate paxos execution. In multi-paxos optimization where phase-1 is used to primarily pick a stable leader and phase-2 repeats a number of times, the performance of paxos is approximated by just looking at phase-2. If $$m_s$$ and $$m_d$$ are constant, the variability in performance is only due to the network fluctuations.

I created a small python script to run such simulation, as if a single client was interacting with the paxos synchronously one command at a time. Figure 4 illustrates the results with $$m_s = m_d = 0.01 ms$$ and network parameters taken from AWS ping latency figure.

At the first glance, we observe degradation in throughput and increase in latency. However, the performance decreases very slowly, and once I reach 9 nodes, the performance stays almost flat. This is very different from the data we have observed on our actual paxos implementation. Figure 5 shows the throughput and latency from a few concurrent clients interacting with the protocol.

The first thing that catches my attention is the performance degradation between the 3 and 5 node deployments. On the simulation, the difference in throughput was only 2.8%, while real paxos degraded by astonishing 30%.  As the cluster size increases, beyond 5 nodes, the real paxos also appear to degrade quicker.

Clearly, the network variability alone cannot explain the performance hit from increasing the number of nodes. So what is missing from the simple model that greatly impacts the performance as the cluster grows?

## Towards an Improved Model

Starting from the obvious, the per-message serialization $$m_s$$ and processing $$m_d$$ overheads are not static constant parameters, instead they introduce some variance as well. However, $$m_s$$ and $$m_d$$ are small to start with, and making them introduce additional variance to the model will make it better, but it will not change the overall model simulation much.

Assuming that $$m_s$$ is drawn from a normal distribution $$\mathcal{N}(\mu_{ms}, \sigma_{ms}^2)$$ and $$m_d$$ is similarly drawn from $$\mathcal{N}(\mu_{md}, \sigma_{md}^2)$$, then the message RTT time $$r_{l}$$ must be drawn from $$\mathcal{N}(\mu_l+\mu_{ms}+\mu_{md}, \sigma_l^2 + \sigma_{ms}^2 + \sigma_{md}^2)$$. Making this changes to the model resulted in a difference between 3 and 5 node simulations to grow from 2.8% to 2.9% with $$\mu_{ms} = \mu_{md} = 0.01$$ and $$\sigma_{ms} = \sigma_{md} = 0.002$$ (some arbitrary values that make little impact, unless same magnitude as network mean and standard deviation).

Introducing the variability to message overheads $$m_s$$ and $$m_d$$ still does not account for these overheads being dependent on the number of node in the cluster. The dependence on $$N$$ is not direct, and instead the per-message costs increase as the number of messages that a server needs to handle grows. This puts a ”leader” node repeating the phase-2 of paxos in a more vulnerable position as it will experience the increased traffic.

To understand how message serialization and processing depends on the number of messages exchanged, we need to consider the implementation of application’s network layers. Often times, as an application receives the message, it will use one or more processing queues or pipelines to deserialize the message and dispatch it to the appropriate handler for processing.

Let’s consider an example with just one such pipeline. If the queue is empty upon message arrival, it will get to deserialize with no further delay. However, when there are other messages in the queue, the new message has to wait for it turn to be processed.

A single Paxos round has a bursty network utilization, especially at the leader. This is because the leader node first broadcasts a messages and then receives multiple replies at roughly the same time. If the message deserialization has not finished before the next message arrives, then the pipeline gets clogged. Further messages potentially make the issue worse by growing the queue, as shown in Figure \ref{fig:pipeline_clogged}.

Obviously, having multiple parallel pipelines can help, given that they are balanced and there are enough compute resources to run them. However, the problem can also get worse when we start running rounds concurrently (i.e. multiple clients interact with paxos). In such scenario, messages from different rounds will compete for the fixed number of processing queues.

Where does this leave me with trying to model paxos round performance? I need to account for a number of additional factors, such as the number of message-processing queues/pipelines/threads, the rate of message deserialization at each queue, and the number of concurrent paxos rounds. Oh, and since the messages arriving after the quorum has been reached still need to get deserialized, they will contribute to how clogged the pipelines are.

I will pick up from this point onwards in some future post. I will also try to look at flexible quorums and our WPaxos protocol.

# Trace Synchronization with HLC

Event logging or tracing is one of the most common techniques for collecting data about the software execution. For simple application running on the same machine, a trace of events timestamped with the machine’s hardware clock is typically sufficient. When the system grows and becomes distributed over multiple nodes, each node is going to produce its own independent logs or traces. Unfortunately, nodes running on different physical machines do not have access to a single global master clock to timestamp and order the events, making these individual logs unaligned in time.

For some purposes, such as debugging, engineers often need to look at these independent logs as one whole. For instance, if Alice the engineer needs need to see how some event on one node influenced the rest of the system, she will need to examine all logs and pay attention to the causal relationships between events. The time skew of different nodes and the trace misalignment caused by such skew prevents Alice from safely relying on such independently produced traces, as shown in the figure below. Instead, she need to collate the logs together and reduce the misalignment as much as possible to produce a more coherent picture of a distributed system execution.

Of course if only the causality was important, Alice could have used Lamport’s logical clocks (LC) to help her identify some causal relationship between the events on different nodes. Alternatively, logging system can also use vector clocks (VC) to capture all of the causal relationships in the system. However, both LC and VC are disjoint from the physical time of the events in the trace, making it hard for Alice to navigate such logging system.

Using synchronized time protocols, such as NTP and PTP  will help make the traces better aligned. These protocols are not perfect and still leave out some synchronization error or uncertainty, introducing the possibility of breaking the causality when collating logs purely with them.

## HLC sync

Instead of using NTP or logical time for synchronizing the event logs, I thought whether it is possible to use both at the same time with the help of Hybrid Logical Time (HLC). HLC combines a physical time, such as NTP with logical clock to keep track of causality during the period of synchronization uncertainty. Since HLC acts as a single always-increasing time in the entire distributed system, it can be used to timestamp the events in the log traces of every node. You can learn more about HLC here.

Similar to logical time, HLC does not capture full causality between the events. However, HLC conforms to the LC implication: if event A happened before event B, then HLC timestamp of A is smaller than the HLC timestamp of B. This can be written as A hb B ⇒ hlc.A < hlc.B. Obviously, we cannot use HLC timestamps to order any two events. Despite this limitation, we can still use the LC implication to give some partial information about the order of events. If an event A has an HLC timestamp greater than the event B, we can at least say that event B did not happen before A, thus either A happened before B or A is concurrent with B: hlc.A < hlc.B ⇒ A hb B ∨ A co B.

We can use this property to adjust the synchronization between the traces produced at different nodes. Let’s assume we have two nodes with some clock skew. These nodes produce logs that are not fully synchronized in time (we also assume the knowledge of a global, “ideal” time for now). The events in the log happen instantaneously, however we can rely on the machine’s clock to measure the time between events on the same node to give the “rigidity” to the logs. Each node timestamps the log events with its machine’s physical time (PT).

In the figure above, the two logs are not synchronized in the “ideal” time, even though they appear to be in sync based on the PT of each node. Without any additional information we cannot improve on the synchrony of these logs. However, if we replace PT time with HLC, we can achieve better trace synchronization.

With the addition of HLC time, we may see that when the logs are aligned by the PT only, some HLC timestamps (highlighted in yellow) appear to be out of place. In particular, this alignment does not satisfy the LC condition (A hb B ⇒ hlc.A < hlc.B), since the alignment makes event a1 to appear to happen before c2, however, hlc.c2 > hlc.a1. In order to satisfy the condition, we need to re-sync the logs such that c2 appear concurrent with a1.

After the adjustment, our synchronization error between two nodes is reduced. Note that we cannot synchronize the logs better and put a1 to happen before c2, since the LC implication simple does not allow us to do so.

The two node synchronization works nice and easy because the LC/HLC implication provides some guarantees about the two events, and we pick these two events from two separate nodes.  Aligning more than two logs is more challenging, as we need to perform many comparisons each involving just two events from some two nodes. The number of possible comparison we need to make grows drastically as we increase the number of traces to sync.

However, with HLC we can reduce the problem to just that of performing n-1 2-node log alignments when we need to sync logs from n nodes. HLC operates by learning of higher timestamps from the communication, thus HLC time at all nodes of the cluster tends to follow the PT time of one node with the highest PT clock.  Once, again to see this you need to understand how HLC ticks, which is explained here. Having one node that drives the entire HLC in the cluster allows us to synchronize every other log from each node independently with the log of that HLC “driver node”.

## Some Testing and Limitation

I set up some synthetic tests to see if HLC can help us achieve any improvement in log synchronization on already loosely synchronized system. The benchmark generates the logs with a number of parameters I can controlled: time skew, communication latency and event probabilities. Maximum time-skew controls the time desynchronization in the simulated cluster measured in time ticks. There will be two nodes in the simulation with maximum time-skew difference between their clocks. Communication latency parameters control the latency of simulated communications in time ticks. Probability of an event parameter controls the chance of an event happening at every time tick; similarly, the probability of communication determines the chance of an outgoing message happening at the time tick.

Since the logs are generated synthetically, I have access to the ideal time synchronization between these logs, allowing me to quantify the alignment error. I calculated the alignment error as 1 – adjustment / skew for every pair of logs.

The figure below shows the error in synchronizing 5 logs as a function of the log length. The Idea is that a longer trace can allow for more opportunities to find the violations in LC/HLC condition and adjust the logs accordingly. All other parameters were kept constant with skew of 50 ticks, communication delay of 3 to 10 ticks, a chance of event as 40% per tick. I made different plots for different probability of inter-node communication. I repeated the simulation 100,000 times and computed the average error.

We can see that in this setup the synchronization between logs was significantly improved, and the improvement happens faster when communication if more frequent. This is because communication introduces the causality between nodes, allowing me to exploit it for synchronization. As the logs grow larger the improvements diminish. This is likely due to the reducing closer to the communication delay, at which point HLC synchronization can no longer make improvements.

Inability to achieve any synchronization improvement when the communication latency is equal or greater than the maximum time skew in the system is the most significant limitation of HLC synchronization. The following graph illustrates that:

Here I ran the simulation with 3 different skew levels: 10, 20 and 50 ticks. As I increase the communication latency, the error is growing as well, getting to the level of no improvement as the latency reaches the time-skew.

## Some Concluding Words

I think this is a rather naïve and simple way to achieve better synchronization for distributed logging software. One can argue that NTP and PTP time synchronization methods are rather good nowadays and that it should be enough for most cases. However, computers are fast now and even in 1 ms of desynchronization many computations can be made. Even a few full round trips of network message exchange can happen in that 1 ms in good Local Area Networks.

HLC synchronization’s simplicity allows it to be implemented purely in user-level application code. There is no need to constantly run NTP protocol to keep tight time synchrony, there is no need to access or modify any of the underlying system-level code beyond just reading the clock, there is no need to access high precision clock that may slow the application down.

HLC can also be used within a single machine to synchronize on traces from different threads. Despite having access to the same clock at all threads and processes, the clock granularity may still be coarse enough to do many computations within a single tick.

HLC sync is not without the limits. Its usefulness degrades as the synchronization gets closer to the communication latency, but it can still be used as a fail-safe mechanism in cases NTP fails.

Alternatively to synchronizing logs, HLC can be used to find consistent global states and search through the distributed application’s past execution.

# Sonification of Distributed Systems with RQL

In the past, I have discussed sonification as a mean of representing monitoring data. Aside from some silly and toy examples, sonifications can be used for serious applications. In many monitoring cases, the presence of some phenomena is more important than the details about it. In such situations, simple sonification is a perfect way to alert users about the occurrence of such phenomena.  For example, Geiger counter alerts users of the radioactive decay, without providing any further details.

Our recent work on Retroscope and RQL allows us to bring sonification to the same “phenomena-awareness” plane for distributed systems. With RQL we can search for interesting conditions happening globally in the system, and we can use sounds to alert the engineers when certain predicates occur in the global state. Of course, for such system to be useful, it needs to be real-time, as the recency of events is the most important attribute of this type of sonification. For instance, users of a Geiger counter are not interested in hearing ticks for decay events happening a minute ago, as it rather defeats the purpose of the counter. Similar requirements apply to the “phenomena-awareness” tools in the distributed systems domain, as engineers need to know what happens in the system instantaneously.

RQL can easily allow inspecting past states, however it lacks when it comes to the current state inspection.  What we need is a streaming query that can continuously receive individual logs from Retroscope log-servers, and perform the search as soon as sufficient data become available to form a consistent cut. Of course, streaming queries will still be lagging behind the current time, however we can make this lag small enough to be virtually unobservable by a person.

With streaming queries we can not only receive the cuts meeting some search criteria in near real time, but we can also sonify the mere fact such cuts have been found. If we look at the previous example of ZooKeeper staleness, we can run sonified streaming queries to have the system alert us when the data staleness reaches some threshold, such as two or more versions stale.  In the audio clip below, we have used MIDI to sonify the stream of staleness events occurring in a ZooKeeper cluster. Multiple queries were sonified to produce different sounds for different staleness of ZooKeeper data.

We can hear periods of silence when the staleness is below the threshold of 2 values, however we can also observe some variations in cluster performance. For instance, it is very easy to identify when the cluster was experiencing more problems than normal. It is also easy to hear it recovering from the spike in staleness soon after.

# The First Datastore-driven Vehicle

It is not a secret that procrastination is the favorite activity of most PhD students. I have been procrastinating today, even though my advisor probably wants me to keep writing.  In the midst of my procrastination, I thought: “Why are there self-driving vehicles, but no database-driven vehicles?” As absurd as it sounds, I gave it a try. And now I not so proudly present the prototype of the world’s first(?) datastore-driven vehicle. The implementation was quick and simple. I took the Voldemort code I have been tinkering with in my Voldemort traffic light post, and changed it to send the signals over Bluetooth connection to a Frankenstein-of-a-car I’ve built very quickly out a few motors, a gearbox, an Arduino, a bunch of wires, some tracks, and lots of thermal glue.

The car changes the direction of travel when the workload characteristics change: it will go straight when number of read operations is the same as number of write operation.  By changing the read-to-write ratio of the workload, the car will turn either left or right. The prototype vehicle is not very responsive (weak motors/dead batteries) and sometimes runs into things despite my best attempts to control it.

And now, once the world has a database-driven car, the procrastination can continue… Or should I start writing?

# Monitoring with Retroscope: Detecting Invariant Violations

Earlier I briefly mentioned Retroscope, our distributed snapshot library that makes taking non-blocking, unplanned consistent global distributed snapshots possible. However, these snapshots are only good if we know how to use them well. Of course the most obvious use case is just a data backup, and despite it being an important application for snapshots, I feel it being a bit boring to my taste. What I am thinking right now is using snapshots for distributed monitoring and debugging.

Let’s consider an application that has a global invariant predicate P, and we want to check if a distributed system holds the invariant P at all times. This means that we should never see a consistent cut in which predicate P = false. So our problem is boiling down to looking for consistent cuts that violate P. Luckily, Retroscope can do exactly this, since we can take one snapshots and incrementally move forward in time as the application execution progresses, checking the predicates by looking at consistent cuts as the state advances.

With the basic Retroscope described in the earlier post, finding predicate violations is a rather cumbersome effort that requires writing new code for every invariant a user wishes to check. So in the past few months I have been working on Retroscope extension tailored specifically for debugging and monitoring use cases.  Improved Retroscope exposes the Retroscope Query Language (RQL), a SQL-like interface to allow users write queries to search for conditions happening in the consistent cuts.

Now let’s go back to our hypothetical system with global invariant P and for now assume P holds when all local predicates p0, p1, p2, …, pn hold on the nodes [0 … n]. As such, P = p0  p1  p2  …  pn, and if any of the local predicates fail, the global predicate fails as well. For the simplicity of the example, we can say that local predicate pi is following: pi = ai + bi > ci. This makes each node maintain all three variables, although the nodes may have different values. With Retroscope, we can expose these local variables to be stored in the local log named inv. The log will maintain both the current version of the variables and the history of variable changes.

How do we look for the violation of such invariant with RQL? Just a single query would suffice for us:

SELECT inv.a, inv.b, inv.c FROM inv WHEN (inv.a + inv.b <= inv.c) LINK SAME_NODE

Now we can dissect this query into bits and see what happens there. RQL queries are meant to retrieve consistent cuts that satisfy certain criteria. The list of parameters following the SELECT statement specifies what variables we want to see in the resultant consistent cuts. FROM keyword enumerate the logs we use in this particular query. The actual consistent cut criteria are specified after the WHEN keyword. In the particular case the condition for emitting cuts is (inv.a + inv.b <= inv.c) LINK SAME_NODE, which is equivalent to emitting cuts when the following holds:

By now a curios reader would have probably asked a question of why we even bother with consistent cut in this particular example. All predicates can be checked locally and their evaluation does not depend on other remote servers, so we can simply run local monitors and do not worry about consistent cuts and time synchronization at all: failure on one node designate the failure of the system globally no matter the time. Retroscope and RQL shines when we break away from this locality. What if our invariant involves messages being sent and received? Or what if in involves different parameters that exists on different machines at the same time? With the ability of looking at consistent cuts, RQL breaches the boundary of a single node. Below I list just a few variations of the original query that no longer deal with conjunction of local predicates and look at global state as a whole:

1. SELECT inv.a, inv.b, anv.c FROM inv WHEN inv.a + inv.b <= inv.c
• Omitting LINK SAME_NODE part changes the operation of the query drastically, as all three variables are no longer bound to co-exist on the same node:
2. SELECT inv.a, inv.b, anv.c FROM inv WHEN (inv.a + inv.b <= inv.c) LINK EACH_NODE
• Replacing LINK SAME_NODE with LINK EACH_NODE, changes the search condition to require every node satisfying it in the consistent cut:
3. SELECT inv.a, inv.b, anv.c FROM inv WHEN (inv.a + inv.b) LINK SAME_NODE <= inv.c
• Rewriting the condition to WHEN (inv.a + inv.b) LINK SAME_NODE <= inv.c will cause the inv.a and inv.b to be summed on the same nodes, and compared to inv.c values from other nodes as well, so the consistent cut is emitted when
4. SELECT inv.a, inv.b, anv.c FROM inv WHEN (inv.a + inv.b <= inv.c) AND NODE($1) = NODE($3)
• This query restricts inv.a and inv.c to be on the same node. $1 is the placeholder for the first variable encountered while parsing left to right, and$3 is the third variable. This emits the consistent cuts when

Above are just a few simple examples of what is possible with RQL, however there are limitations. The biggest limitation is the complexity of the conditions. Even though RQL does not limit how many operations are possible in the condition of the query, having large expressions can slow the system down drastically. For example, a simple WHEN inv.a > inv.b will examine all a’s that exist on the nodes of the system at the consistent cut and all b’s in every possible combination. For  . Comparison is then carried out on every element of product set E.

P.S. I illustrated some of the syntax as it operates at the time of this writing, however RQL is developing, and I am not sure I like syntax of conditions too much, so it is a subject to change.

# Globally Consistent Distributed Snapshots with Retroscope

Taking a consistent snapshot of a distributed system is no trivial task for the reasons of asynchrony between the nodes in the system. As the state of each machine changes in response to incoming external messages or internal events, each node may produce a log of such state changes. With the log abstraction, the problem of taking a snapshot transforms into the issue of aligning the logs together and finding a consistent cut among all these logs. However, time asynchrony between the servers makes collating all the system logs difficult using just physical clocks available at each machine, because clocks tend to drift, producing some time asynchrony or time uncertainty. Various time synchronization protocols, such as NTP and PTP, exist, but perfect synchronization is still unattainable.

Retroscope is the system we designed to take unplanned, non-blocking, consistent global snapshots. Unlike other systems, Retroscope does not need to block while taking snapshot, as it does not need to wait out time uncertainty caused by the clock skews at various machines thanks to the reliance on Hybrid Logical Clocks (HLC) instead of NTP-synchronized (NTP) time.

HLC introduces causality information into the clock as messages are being exchanged between servers and provides the same causal guarantees as Lamport’s Logical Clocks (LC). More information about HLC can be found here and here.

## Taking a snapshot

Retroscope achieves snapshots by adding HLC into the network communication of the Retroscoped system. Internally, Retroscope keeps a sliding window-log of past state changes along with the associated HLC timestamps at each node. This window-logs are used to facilitate the unplanned nature of taking snapshots. Snapshots are triggered by a special client that maintains the common HLC with the rest of the system. Retroscope allows for instant unplanned snapshot to be started by the initiator, such instant snapshots are guaranteed to capture the states at the time Tnow of snapshot request being issued by the initiator. Obviously, once the snapshot request message reaches the nodes, the time has advanced to Tr > Tnow.

Upon receiving the snapshot request message at Tr, each node starts taking a local snapshot. Since the system does not halt processing requests, depending on the implementation, we may arrive to a local snapshot at some time Tf >= Tr. Because our local snapshot is at the state that happened after the requested time, we need to modify it to arrive to the state at time Tnow. We use the window-log of state changes to undo all operations that happened locally after Tnow, thus arriving to a desired local snapshot. Once all nodes compute local snapshots, Retroscope is done taking a global consistent snapshot at time Tnow.

Retroscope provides more flexibility in taking unplanned snapshots. Taking instant snapshot (i.e. snapshot initiated at Tnow) requires each node to maintain only a small log of recent changes. We can, however, expand the instant snapshots and offer retrospective snapshot flexibility at the expense of growing state change log larger. With retrospective snapshots, we can offer the ability to look at the state that has already happened in the past. This functionality is handy for application debugging when there is a need to investigate the root cause of the problem after it has already happened. A distributed reset is another application that can benefit from the retrospective snapshots, as the system can be reset into a correct state after the state has been corrupted.

We have Retroscoped Voldemort key-value database to take data-snapshots. Retroscoping Voldemort took less than a 1000 lines of code for adding HLC to the network protocol, recording changes in the Retroscope window-log, and performing snapshot on Voldemort’s storage. We did the experiments on the 10-node Voldemort cluster with databases of various sizes. We have learned that keeping the window-log of state changes has very little impact on the throughput and latency when no snapshots are being taken, as seen in figure below.

Performing snapshot is non-blocking in Retroscope, because there is no need to wait out the time uncertainty. The non-blocking nature allows Voldemort to continue processing both read and write requests while the snapshot is being computed. The figure below shows throughput and latency for every second of execution while taking the snapshot on a 10,000,000 items database (each item is 100 bytes). Overall we have observed an 18% throughput and 25% latency degradation over the snapshot time, however these numbers can be improved by using a separate disk system for snapshot.

## What can be done with Retroscope?

We used Retroscope to take snapshots of data on a key-value store, however the utility of snapshots can be very extensive. With powerful snapshot capabilities, such as retrospective snapshot, we can look into the past of our distributed system and search for anomalies. Retrospective snapshots can be used to restore system to the latest correct state after the state corruption. Finding such correct states is also possible with Retroscope; we can use it to take successive snapshots to check for global invariants; it can be powerful in monitoring various application level predicates. Retroscope can perform other monitoring tasks. Unlike other monitoring systems that tend to look at local state independently or isolate monitoring to a request level, Retroscope can look at global parameters of the system across every node in a consistent way. We can even use Retroscope to detect erroneous patterns in message exchange by observing what messages are sent and received and how they impact the state at each node as we go through time.

# The Light of Voldemort

Few month ago I showcased how a single server of Voldemort key-value store sounds.  Sonification is a valid way to monitor systems, and has been used a lot in real applications. Geiger counter would be one of the most well-known examples of a sonified application. In some cases sonification may be the preferred form of representing information, as other forms surprisingly may not work as well for human perception. Even visualization of information is often not as good as sonification. Take the same Geiger counter example; research has shown that visual radiation level monitors do not perform quite as well as the sound ones as people tend to be distracted from the display to perform other tasks. Even visual and audio hybrids do not alert users of high radiation levels as good as simple audio counter.

As an example of a hybrid audio-visual system for Voldemort, I have built a small “traffic-light monitor” that changes colors and beeps differently depending on what action is performed by the server. The rig is built with Arduino and simply plugs to the USB of the machine running Voldemort server. Below is a short video of how it operates:

The green light lights up when server handles a “get” operation. Yellow light is for “put” requests and red light is for “get version” commands. As can be see, writing to Voldemort requires two operations, “get version” and “put” and they happen so quickly that Arduino is barely capable to light up the LEDs.

P.S. “Traffic light monitor” is not to be taken seriously, it is rather a silly example to show that there are plenty of ways to monitor a system or represent system’s logs.

# The Sound of Voldemort

Recently at our lab we discussed a fun little project of making distributed systems “play” music. The idea of sonifying a distributed application can be of some benefit for debugging and maintenance, since people have natural ability in recognizing patterns. Of course developer or systems administrators can analyze the logs of their systems and study the patterns that way, but listing to patterns and hearing the changes in such patterns is something we can do in the background, probably without taxing our entire attention span.

So how does a distributed system sound? And what can we learn by listening to it? Here is a 4.5 minute clip of a single Voldemort server playing its song.

Each message request type coming to the server was assigned a different pitch, with the note duration roughly corresponding to the time it took to fulfill the request. Of course, the recording was slowed down compared to the original execution of the node, with the entire 4 minutes and 37 seconds of audio representing just a coupe of seconds of real-time operation.

The audio has been recorded under a static workload of read and write operations, but there are few things that we can definitely hear about Voldemort’s operation even without the workload variations. The most obvious one is that the first half a minute of the audio is mostly silence. This is something I observed from the logs earlier as well, as Voldemort takes some time to get to its paces. As the execution progresses, we can definitely hear different operations happening at somewhat constant rate. In the second half of the audio, we can hear a few “hick-ups” as well as some louder and more forceful sounds for the requests that took longer to process. This, however was normal operation of Voldemort node, and introducing some problems into the system, such as network congestion or some machine failure will most definitely impact the sounds of this node.

What about making the sound of the entire distributed system?  This becomes a trickier problem, as now we need to play multiple streams for all the distributed components in our system at once. Such components can be located on different physical servers and different racks and even different datacenters. However, for us to play the “true” sound of the system while preserving the causality of events, we need to be able to precisely synchronize and align the streams from various servers, accounting for any time skew and clock imprecisions.

Additionally, with multiple servers “playing” at once it may be more difficult for people to comprehend the patterns and the changes to such patterns.