Tag Archives: time

Reading Group. Sundial: Fault-tolerant Clock Synchronization for Datacenters

In our 48th reading group meeting, we talked about time synchronization in distributed systems. More specifically, we discussed the poor state of time sync, the reasons for it, and most importantly, the solutions, as outline in the “Sundial: Fault-tolerant Clock Synchronization for Datacenters” OSDI’20 paper. We had a comprehensive presentation by Murat Demirbas. Murat’s talk was largely based on his extensive time synchronization experience in wireless sensor networks.

First, let’s talk about the need for time synchronization. Many problems of distributed computing could have been avoided if we had a perfect global clock available everywhere, as we often rely on the ordering of events for correctness. For instance, such a perfect clock would make causality/dependency tracking easy. And this alone would have simplified and improved many different systems and processes, ranging from efficient consistent snapshots, to more consistent storage systems, to the improved debuggability of all distributed applications. In the absence of a perfect global clock, we have been relying on other clever tricks and techniques, such as logical clocks, vector clocks, loosely synchronized causality-tracking hybrid logical clocks to name a few. 

Fundamentally, if we have unsynchronized clocks on two servers, we cannot use these clocks to order the events. The paper provides the following example to the issue: a shared variable X is read on some server at time T, but this same variable is updated on a different server at time T-1, however, due to time asynchrony, the update actually happens after the read in the real-time. This essentially makes the clocks useless for ordering, unless we know how badly unsynchronized the clocks are. Knowing this time uncertainty ε allows us to delay the read at T until we know that all servers have moved on past T. Some systems may resort to rescheduling the operations/transactions that fall within the uncertainty instead of waiting, but this is a similar enough use case. Naturally, having smaller uncertainty is better for performance, since a system will incur shorter waits or fewer rescheduled operations.

So what prevents us from driving this uncertainty ε down to 0 for a perfect synchronization? This is not an easy answer, and there is a myriad of factors. The clocks themselves are a problem — servers tend to have cheap quartz oscillators that “tick” at different speeds depending on temperature and voltage variations. These variations make individual machines drift apart ever-so-slightly over time. Trying to synchronize these flimsy clocks is a problem as well — the servers communicate over the network for time sync. And the network is unpredictable, starting from how messages may be routed, to different queues and buffer delays at NICs and switches. All these add the variability to message propagation time and make the network non-symmetric, as message flow in one direction may be faster than in the opposite. 

The paper proposes Sundial, a set of techniques to tame the network-induced uncertainties. Sundial focuses on reducing the message propagation variability in the network.

Firstly, Sundial avoids indirect communication and only exchanges messages between adjacent neighbor nodes in the network topology. This eliminates routing uncertainty between nodes, and also buffer/queue uncertainty at the intermediate switches. 

Secondly, Sundial records the timestamps into messages at the lower level in the network stack. This ensures that the timestamp we are transmitting for synchronization has not been sitting in the local queue for too long, again reducing the variability. 

Thirdly, Sundial ensures that a single node is used as a source of truth for the current time. Since the nodes in the system cannot talk directly to the “source of true time”, the system constructs a tree communication topology starting with the TrueTime root and covering all nodes in the system. 

Fourthly, Sundial tames the unreliable clocks on the individual servers by doing very frequent synchronizations — once every 100 microseconds. 

A big portion of the paper is devoted to handling failures since a link or node failure prevents the updated time to reach any node in the subtree below the fault, that subtree may start to deviate more the TrueTime at the root node. The gist of the solution is to allow all nodes in the impacted branch to detect the synchronization failure and switch to an alternate tree structure that was precomputed ahead of time. As all impacted nodes perform the switch to a new tree locally, the coordination is avoided, and the process is very quick. An important point in having such a back-up plan is to make sure it is smart enough to avoid correlated failures that can render both the main and back-up trees broken. The paper has a lot more details on the fault tolerance aspect, including handling the failures of root nodes.

Combining all the Sundial’s techniques provides good time synchronization with fairly tight bounds. It achieves ~100 ns synchronization even under some failures, which is significantly better than PTP time synchronization (and even better than its precursor NTP?).


We had a nice discussion and questions, below I summarize the most important points.

1) Set of techniques. As outlines, Sundial is a set of techniques to improve the time sync, and there are some important lessons there. For example, doing things in hardware (or as close to hardware) is good. We start seeing (network) hardware optimizations for distributed systems more and more often. Just a few weeks ago we talked about smart switches and using them to drive replication and routing for “hot keys” in a storage system. Obviously, time synchronization is a different problem, but it is also the one to benefit from hardware a lot. Another lesson is to have a single source of time, even though it makes the communication pattern more structured and prone to failures. 

2) Better clocks/oscillators. Sundial synchronizes a lot – one message every ~100 microseconds. This is 10000 messages per second. We are not sure what impact this may have on the network (messages are small) and performance, but there is a practical reason for synchronizing this often. As Sundial aims to keep the uncertainty small (ε=~100ns), it cannot afford the cheap clocks to drift too much upon failures and needs to failover to a back-up tree quickly. This means that the system needs to have a super-tight timeout and very frequent message exchange. Better clocks/oscillators (or maybe using multiple clocks in a server?) can improve the situation here and either allow for even better synchronization or reduce the message frequency. Oven-controlled oscillators, for example, use a heated chamber to keep the crystal at the same temperature and reduce its drift due to the temperature variations. 

3) Comparison with PTP. The paper extensively compares Sundial with PTP protocol. The authors mention how PTP does not report ε, and that they had to augment the designs to provide the uncertainty in these protocols. The paper puts PTP’s uncertainty at ε=800μs. This contrasts with other literature but, where PTP is often reported as having a sub-nanosecond accuracy (is accuracy the same as uncertainty? but regardless, to have an accurate time, we need to have low uncertainty, otherwise how do we know it is accurate?), or nanosecond level accuracy. It is worth noting that PTP in these papers either required a dedicated low-load network for time synchronization or hardware with support of some advanced features needed for PTP to work well or both. 

4) Time sync in wireless sensor networks. Murat has spent quite some time describing how the same set of techniques was used 15-20 years ago to achieve microsecond level synchronization in the wireless sensor networks. The presentation has many fascinating details, but it appears that these techniques were known and used for some time, but not used in the data center setting. What was the blocker for doing this earlier?

5) New applications of synchronized time. Finally, we discussed a lot about the possible new applications of such precise time synchronization. The paper mentioned Spanner latency improvement as one benefit, but this is an “old stuff”. Actually, for many years we, the distributed community, were (self-)taught to not rely on time for anything critical. Sure, we use the time for things like leases and timeouts, but these are all “negative” communication that happens rarely, allowing us to be very conservative with the timeouts — there is a little harm if we add a few more seconds to a lease timeout that happens upon a leader failure and needed in rare reconfiguration cases. With super-tight synchronization bounds, we can try to use the time for positive communication and convey progress instead of the lack of one. This of course is challenging, since time is not the only uncertain variable in the system. All of our other “friends”, such as network uncertainty/variability, and crashes still exist, and we also need to tame them in some way to use the time for positive, active communication. 

For example, one may use a “silent agreement” that requires no acks from the followers if everything is going well. But this quickly falls apart under faults. However, one may treat a synchronized clock as an agreement itself and use it to drive the ordering in a multi-leader system. This may also fall apart if the network is too-asynchronous, and a message from one server that already applied the operation may reach the other follower too late – after it has irreversibly applied some other higher-timestamped operation. Taming the network asynchrony may be the next big thing to allow new usages of time in distributed systems!

The network latency vs time uncertainty is very important for constructing consistent snapshots. If time uncertainty is guaranteed to be smaller than the network latency, we can use the time to construct the consistent snapshots, since we can be sure that no message that breaks the causality can reach the other side within the uncertainty period. This, for example, can be useful for debugging. In my Retroscope consistent monitoring system, I use HLC to preserve the causality when uncertainty is too large, but having software clocks like HLC unnecessarily complicate systems. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Trace Synchronization with HLC

Event logging or tracing is one of the most common techniques for collecting data about the software execution. For simple application running on the same machine, a trace of events timestamped with the machine’s hardware clock is typically sufficient. When the system grows and becomes distributed over multiple nodes, each node is going to produce its own independent logs or traces. Unfortunately, nodes running on different physical machines do not have access to a single global master clock to timestamp and order the events, making these individual logs unaligned in time.

For some purposes, such as debugging, engineers often need to look at these independent logs as one whole. For instance, if Alice the engineer needs need to see how some event on one node influenced the rest of the system, she will need to examine all logs and pay attention to the causal relationships between events. The time skew of different nodes and the trace misalignment caused by such skew prevents Alice from safely relying on such independently produced traces, as shown in the figure below. Instead, she need to collate the logs together and reduce the misalignment as much as possible to produce a more coherent picture of a distributed system execution.

Misaligned logs with PT
Physical time makes logs misaligned and miss the causality between events.

Of course if only the causality was important, Alice could have used Lamport’s logical clocks (LC) to help her identify some causal relationship between the events on different nodes. Alternatively, logging system can also use vector clocks (VC) to capture all of the causal relationships in the system. However, both LC and VC are disjoint from the physical time of the events in the trace, making it hard for Alice to navigate such logging system.

Using synchronized time protocols, such as NTP and PTP  will help make the traces better aligned. These protocols are not perfect and still leave out some synchronization error or uncertainty, introducing the possibility of breaking the causality when collating logs purely with them.

HLC sync

Instead of using NTP or logical time for synchronizing the event logs, I thought whether it is possible to use both at the same time with the help of Hybrid Logical Time (HLC). HLC combines a physical time, such as NTP with logical clock to keep track of causality during the period of synchronization uncertainty. Since HLC acts as a single always-increasing time in the entire distributed system, it can be used to timestamp the events in the log traces of every node. You can learn more about HLC here.

Similar to logical time, HLC does not capture full causality between the events. However, HLC conforms to the LC implication: if event A happened before event B, then HLC timestamp of A is smaller than the HLC timestamp of B. This can be written as A hb B ⇒ hlc.A < hlc.B. Obviously, we cannot use HLC timestamps to order any two events. Despite this limitation, we can still use the LC implication to give some partial information about the order of events. If an event A has an HLC timestamp greater than the event B, we can at least say that event B did not happen before A, thus either A happened before B or A is concurrent with B: hlc.A < hlc.B ⇒ A hb B ∨ A co B.

We can use this property to adjust the synchronization between the traces produced at different nodes. Let’s assume we have two nodes with some clock skew. These nodes produce logs that are not fully synchronized in time (we also assume the knowledge of a global, “ideal” time for now). The events in the log happen instantaneously, however we can rely on the machine’s clock to measure the time between events on the same node to give the “rigidity” to the logs. Each node timestamps the log events with its machine’s physical time (PT).

Unaligned logs
Logs aligned based on PT time. Ideal global time is shown in red.

In the figure above, the two logs are not synchronized in the “ideal” time, even though they appear to be in sync based on the PT of each node. Without any additional information we cannot improve on the synchrony of these logs. However, if we replace PT time with HLC, we can achieve better trace synchronization.

unaligned logs with HLC
HLC can be used instead of PT to synchronize the traces

With the addition of HLC time, we may see that when the logs are aligned by the PT only, some HLC timestamps (highlighted in yellow) appear to be out of place. In particular, this alignment does not satisfy the LC condition (A hb B ⇒ hlc.A < hlc.B), since the alignment makes event a1 to appear to happen before c2, however, hlc.c2 > hlc.a1. In order to satisfy the condition, we need to re-sync the logs such that c2 appear concurrent with a1.

HLC is used to improve log synchronization.
HLC is used to improve log synchronization.

After the adjustment, our synchronization error between two nodes is reduced. Note that we cannot synchronize the logs better and put a1 to happen before c2, since the LC implication simple does not allow us to do so.

The two node synchronization works nice and easy because the LC/HLC implication provides some guarantees about the two events, and we pick these two events from two separate nodes.  Aligning more than two logs is more challenging, as we need to perform many comparisons each involving just two events from some two nodes. The number of possible comparison we need to make grows drastically as we increase the number of traces to sync.

However, with HLC we can reduce the problem to just that of performing n-1 2-node log alignments when we need to sync logs from n nodes. HLC operates by learning of higher timestamps from the communication, thus HLC time at all nodes of the cluster tends to follow the PT time of one node with the highest PT clock.  Once, again to see this you need to understand how HLC ticks, which is explained here. Having one node that drives the entire HLC in the cluster allows us to synchronize every other log from each node independently with the log of that HLC “driver node”.

Some Testing and Limitation

I set up some synthetic tests to see if HLC can help us achieve any improvement in log synchronization on already loosely synchronized system. The benchmark generates the logs with a number of parameters I can controlled: time skew, communication latency and event probabilities. Maximum time-skew controls the time desynchronization in the simulated cluster measured in time ticks. There will be two nodes in the simulation with maximum time-skew difference between their clocks. Communication latency parameters control the latency of simulated communications in time ticks. Probability of an event parameter controls the chance of an event happening at every time tick; similarly, the probability of communication determines the chance of an outgoing message happening at the time tick.

Since the logs are generated synthetically, I have access to the ideal time synchronization between these logs, allowing me to quantify the alignment error. I calculated the alignment error as 1 – adjustment / skew for every pair of logs.

The figure below shows the error in synchronizing 5 logs as a function of the log length. The Idea is that a longer trace can allow for more opportunities to find the violations in LC/HLC condition and adjust the logs accordingly. All other parameters were kept constant with skew of 50 ticks, communication delay of 3 to 10 ticks, a chance of event as 40% per tick. I made different plots for different probability of inter-node communication. I repeated the simulation 100,000 times and computed the average error.

results 1
Log alignment error as a function of log length

We can see that in this setup the synchronization between logs was significantly improved, and the improvement happens faster when communication if more frequent. This is because communication introduces the causality between nodes, allowing me to exploit it for synchronization. As the logs grow larger the improvements diminish. This is likely due to the reducing closer to the communication delay, at which point HLC synchronization can no longer make improvements.

Inability to achieve any synchronization improvement when the communication latency is equal or greater than the maximum time skew in the system is the most significant limitation of HLC synchronization. The following graph illustrates that:

HLC synchronization limitations

Here I ran the simulation with 3 different skew levels: 10, 20 and 50 ticks. As I increase the communication latency, the error is growing as well, getting to the level of no improvement as the latency reaches the time-skew.

Some Concluding Words

I think this is a rather naïve and simple way to achieve better synchronization for distributed logging software. One can argue that NTP and PTP time synchronization methods are rather good nowadays and that it should be enough for most cases. However, computers are fast now and even in 1 ms of desynchronization many computations can be made. Even a few full round trips of network message exchange can happen in that 1 ms in good Local Area Networks.

HLC synchronization’s simplicity allows it to be implemented purely in user-level application code. There is no need to constantly run NTP protocol to keep tight time synchrony, there is no need to access or modify any of the underlying system-level code beyond just reading the clock, there is no need to access high precision clock that may slow the application down.

HLC can also be used within a single machine to synchronize on traces from different threads. Despite having access to the same clock at all threads and processes, the clock granularity may still be coarse enough to do many computations within a single tick.

HLC sync is not without the limits. Its usefulness degrades as the synchronization gets closer to the communication latency, but it can still be used as a fail-safe mechanism in cases NTP fails.

Alternatively to synchronizing logs, HLC can be used to find consistent global states and search through the distributed application’s past execution.