# One Page Summary. Aegean: Replication beyond the client-server model

## One Page Summary.  Aegean: Replication beyond the client-server model

This paper builds on a key observation about the operation of complex distributed applications. Namely, microservice-style  applications rarely follow a simple client-server architecture, where a client makes a request and server (or servers) respond to a request. Instead, many applications often use nested approach, where clients communicate with some service, and the service itself acts as a client for one or more other nested services. This nesting often presents some challenges with traditional replication protocols, like primary-backup or Paxos-based RSM replication. For instance, when a service is replicated for durability, it makes it more difficult to preserve correctness of nested requests: in case of service failure, the information on whether the nested request was issued or returned may have been lost, making it difficult to track whether nested requests need to be reissued or otherwise recovered. Authors also claim that the existing approaches, like Paxos, suffer from performance penalty when dealing with nested calls, since the replicated service needs to block and wait for nested calls to resolve. I, personally, do not buy the latter issue too much, as many existing replication solutions, even Paxos-based, try to take advantage or parallelism whenever possible, by either using a pipelined approach or concurrently operating on independent requests or data in different conflict domains.

To counter the problems with nested request calls and responses, Aegean proposes to use a shim layer sitting next to each replicated service. When one service creates a nested request to the other service, it will talk to the shim layer instead of the nested service directly. The shim layer runs at each replica of a replicated service and collects the requests coming from the caller service (it assumes each replica of the caller will send a duplicate request). The shim passes the request to the nested service replica only upon collecting the majority of requests from the caller replicas, insuring that the caller has sufficiently replicated the nested request. The replica can then process/replicate requests. Similarly, when the nested service generates the response, the shim layer broadcasts the response to all replicas of the caller service, keeping track of caller replica receiving the responses and resending them as needed. Additionally, to ensure the response durability, every replica of the caller service sends the ack to other replicas, and only acts on the responses from a nested call when it itself receives a majority of such acks (including its own). This ensures that the responses to a nested call have been logged by at least a majority of replicas in a caller service. All these shim layers and ensuring response durability create a lot more message exchange in the system, which undoubtedly will impact the performance.

Another aspect of the paper deals with speculative execution of some requests, as these also introduce problems in the context of nested microservices, as speculative state may leak and get exposed to other service in the nesting chain. Aegean solves the problem of speculation by using barrier to before the speculative state may become visible and resolve the speculation by reset and replay if replicas arrived to a different state.

To solve the performance issues with sequential Paxos, Aegean proposes to use pipelined approach, which is definitely not new. For example, our Paxi from a few years back is a pipelined implementation of many consensus protocols. Authors claim that Aegean has decent performance, although I find the evaluation a bit lacking. The main comparison is against sequential (not pipelined) Paxos, and Aegean is doing well in this setting. However, even authors admit that a large portion of the difference is due to the pipelining, raising the question of whether the performance comparison is fair in the first place.

Overall, I enjoyed the problems caused by replication in nested microservice architecture, but I am not sure I am too excited about the solution. The solution is a solid one for sure, but it appears very piecewise, with every piece specifically targeting a sub-problem, so it lacks certain elegance (which is not a bad thing at all for a solid practical approach to a problem). The evaluation is one part that raises the most questions for me, ranging from claims that non-byzantine tolerant Paxos and PBFT have similar throughput, to picking inherently weak baselines for evaluation, like non-pipelined Paxos.

# One Page Summary: Ring Paxos

This paper (Ring Paxos: A high-throughput atomic broadcast protocol) has been out for quite some time, but it addresses a problem still relevant in many distributed consensus protocols. Ring Paxos aims to reduce the communication load in the Paxos cluster and provide better scalability. As we have shown in our SIGMOD 2019 paper, communication is a great limiting factor in scalability of Paxos-like protocols.

Ring Paxos reduces communication overheads with a twofold approach. First, it uses ip-multicast to substitute direct node-to-node communication wherever possible with a broadcast type of communication. Second, Ring Paxos overlays a ring topology over (parts) of the Paxos cluster to control the message flow and prevent communication bottlenecks from forming.

Ring Paxos operates very similarly to regular Paxos, with differences being mainly in the communication part of the protocol. One node acts as a designated coordinator that receives proposals from the clients. However, the coordinator must confirm itself as a valid coordinator using the phase-1 of Paxos. To that order, the coordinator uses ip-multicast to send the message with some ballot/round number to all acceptors. This message also contains the ring configuration, including the node designated as the beginning of the ring. The acceptors receive the message and compare the ballot with their knowledge and only accept the node as new coordinator if the received ballot is the highest an acceptor has seen so far. Each acceptor is going to reply independently to the coordinator (not shown in the figure), and with these replies the coordinator will learn whether it has succeeded. Additionally, the coordinator also learns of any unfinished/uncommitted values that must be recovered.

Upon successfully getting a quorum of confirmations, the coordinator moves on to the phase-2 of Paxos, illustrated on the figure on the right. During this phase, the coordinator replicates the commands/log entries to the acceptors. Similarly to the phase-1, the coordinator uses ip-multicast to send this message to the acceptors (message #2 in the figure). The acceptors (and learners) get the value/command to be replicated, but that value is not committed just yet. Along with the value, acceptors also receive the value-id (c-vid), a unique identifier for the value, and they set their working value-id (v-vid) to c-vid. The acceptors do not reply individually to the coordinator. Instead, the first coordinator in the ring sends a reply, containing the c-vid to its successor (message #3 in the figure). The acceptor receiving such reply will compare its v-vid and the c-vid from the reply message. The two value-ids match when the acceptor is not aware of any other value/coordinator is acting concurrently. In this case it forwards the message further down the chain. This chain forwarding in the ring happens until the reply reaches the coordinator (message #4), which sits at the tail of the chain. The protocol terminates the message propagation across the ring when the acceptor’s v-vid and c-vid from the reply message are different, leaving the coordinator to wait for a timeout and retry the protocol from the beginning with a higher ballot. Node failures produce similar outcome, since a failure completely hinders message propagation across the ring. As a result, the new phase-1 of Paxos must include a different ring configuration to try avoiding the failed node. When the coordinator receives the reply from the ring/chain, it sends the commit message to all acceptors and learners with the ip-multicast (message #5). The protocol may be extended for running multiple slots in parallel by ensuring the c-vid and v-vid comparisons happen within the same slot.

The performance of Ring Paxos is better that standard Paxos implementations, such as Libpaxos (uses ip-multicast), and Paxos4sb (unicast). It appears that Ring Paxos is capable of saturating most of the network bandwidth, with only LCR protocol pushing a bit more throughput.

# Paper Summary: Bolt-On Global Consistency for the Cloud

This paper appeared in SOCC 2018, but caught my Paxos attention only recently. The premise of the paper is to provide strong consistency in a heterogeneous storage system spanning multiple cloud providers and storage platforms. Going across cloud providers is challenging, since storage services at different clouds cannot directly talk to each other and replicate the data with strong consistency. The benefits of spanning multiple clouds, however, may worth the hustle, since a heterogeneous system will be both better protected from cloud provider outages, and provide better performance by placing the data closer to the users. The latter aspect is emphasized in the paper, and as seen in the figure, going multi-cloud can reduce latency by up to ~25%.

To solve the issue of consistent cross-cloud replication, authors propose to use Cloud Paxos (CPaxos), a Paxos variant designed to work with followers supporting a very minimal and common set of operations: get and conditional put. In CPaxos, clients act as prospers, and storage systems serve the role of the followers. The followers are not really “smart” in this protocol, and most of the Paxos logic shifts to the client-proposers (Figure 2).

The prepare phase in CPaxos simply gathers the state from the followers, making the proposer decide for itself whether the followers would have accepted it with the current ballot or not. If the proposer thinks it would have been accepted, it will try updating the followers’ state. Doing this, however, requires some precautions from the followers, since their state may have changed after the proposer made a decision to proceed. For that matter, CPaxos uses conditional put (or compare-and-set) operation, making the followers update their state only if it has not changed since it was read by the proposer. This ensures that at most one proposer can succeed with changing the state of the majority of followers.

I visualize this as a log to represent changes in some object’s state. The new version of an object corresponds to a new slot in the log, while each slot can be tried with different ballot by different proposers. The put operation succeeds at the follower only if the value at the slot and a ballot has not been written by some other proposer. In case a proposer does not get a majority of successful updates, it needs to start from the beginning: increase its ballot, perform a read and make a decision whether to proceed with state update. Upon reaching the majority acks on state update, the proposer sends a message to flip the commit bit to make sure each follower knows the global state of the operation.

This basic protocol has quite a few problems with performance. Latency is large, since at least 2 round-trips are required to reach consensus, since every proposer needs to run 2 phases (+ send a commit message). Additionally, increasing the number of proposers acting on the same objects will lead to the growth in conflict, requiring repeated restarts and further increasing latency. CPaxos mitigates these problems to a degree. For example, it tries to commit values on the fast path by avoiding the prepare phase entirely and starting an accept phase on what it believes will be next version of an object with ballot #0. If the proposer’s knowledge of the object’s state (version, ballot) is outdated, the conditional put will fail and the proposer will try again, but this time with full two phases to learn the correct state first. However, if the proposer is lucky, an update can go in just one round-trip. This optimization, of course, works only when an object is rarely updated concurrently by multiple proposers; otherwise dueling leaders become a problem not only for progress, but for safety as well, since two proposers may write different values for the same version using the same ballot. This creates a bit of a conundrum on when the value becomes safely anchored and won’t ever get lost.

Consider an example in which two proposers write different values: green and blue to the same version using ballot #0 (Figure 4 on the left). One of the proposers is able to write to the majority, before it becomes unresponsive. At the same time, one green follower crashes as well, leading to a situation with two followers having green value and two being blue (Figure 5 on the right). The remaining proposer has no knowledge of whether the green or blue value needs to be recovered (remember, they are both on the same ballot in the same slot/version!). To avoid this situation, CPaxos expands the fast path commit quorum from majority to a supermajority, namely $$\left \lceil{\frac{3f}{2}}\right \rceil +1$$ followers, where $$2f+1$$ is the total number of followers, and f is the tolerated number of follower failures, allowing the anchored/committed value to be in a majority of any majority of followers . Having this creates an interesting misbalance in fault tolerance: while CPaxos still tolerates $$f[\latex] node failures and can make progress by degrading to full 2 phases of the protocol, it can lose an uncommitted value even if it was accepted by the majority when up to [latex]f$$ followers fail.

Proposer conflicts are a big problem for CPaxos, so naturally the protocol tries to mitigate it. The approach taken here reduces the duration in which possible conflicts may occur. As CPaxos is deployed over many datacenters, the latencies between datacenters are not likely to be uniform. This means, that a prepare or accept messages from some proposer reach different datacenters at different times, creating an inconsistent state. When two proposers operate concurrently, they are more likely so observe this inconsistency: as both proposers quickly update their neighboring datacenters, they run the risk of not reaching the required supermajority due to the conflicting state (Figure 6(a)) created by some messages being not as quick to reach remaining datacenters. To avoid rejecting both proposers, CPaxos schedules sending messages in a way to deliver them to all datacenters at roughly the same time. This reduces the duration of inconsistent state, allowing to order some concurrent operations (Figure 6(b)).

Despite the above mitigation strategy, conflicts still affect CPaxos greatly. The authors are rather open about this, and show their system CRIC with CPaxos degrading quicker than Paxos and Fast Paxos as the conflict rate increases. However, in the low conflict scenario, which authors argue is more likely in real world applications, CRIC and CPaxos improve on performance compared to Paxos/Fast Paxos, especially for reading the data. This is because reads in CPaxos are carried out in one round-trip-time (RTT) by client-proposer contacting all followers and waiting for at least a majority of them to reply. If the client sees the latest version with a commit flag set in the majority, it can return the data. Otherwise, it will wait to hear from more followers and use their logs to determine the safe value to return. In some rare cases when the proposer cannot determine the latest safe value, it will perform the recovery by running the write path of CPaxos with the value to recover (highest ballot value or highest frequency value if more than one value share the ballot).

Some Thoughts

• The motivation of the paper was to make strongly consistent system spanning multiple clouds providers and storage systems for the benefit of improved latency though leveraging the location of datacenters of these different providers. However, CRIC and CPaxos protocol requires a lot of communication, even on the read path. During reads, a client-proposer contacts all CPaxos nodes, located at all datacenters, and in best case still needs the majority replies. As such the latency benefit here comes from trying to get not just one node closer to the client, but a majority of nodes. This may be difficult to achieve in large systems spanning many datacenters. I think sharding the system and placing it on subset of nodes based on access locality can benefit here greatly. For instance, Facebook’s Akkio paper claims to have significant reduction in traffic and storage by having fewer replicas and making data follow access patterns. In our recent paper, we have also illustrated a few very simple data migration policies and possible latency improvement from implementing these policies.
• One RTT reads in “happy path” can be implemented on top of regular MutliPaxos without contacting all nodes in the systems. Reading from the majority of followers is good enough for this most of the time, while in rare circumstances the reader may need to retry the read from any one node. More on this will be in our upcoming HotStorage ’19 paper.
• The optimization to delay message sending in order to deliver messages at roughly the same time to all nodes can help with conflict reduction in other protocols that suffer from this problem. EPaxos comes to mind right away, as it is affected by the “dueling leaders” problem as well. Actually, CPaxos and EPaxos are rather similar. Both assume low conflict rate to have single round trip “happy path” writes and reads. When the assumption breaks, and there is a conflict, both switch to two phases. EPaxos is better here in a sense that the first opportunistic phase is not totally wasted and can be used as phase-1 in the two phase mode, whereas CPaxos has to start all the way from the beginning due to the API limitation on the follower side.

# Keep The Data Where You Use It

As trivial as it sounds, but keeping the data close to where it is consumed can drastically improve the performance of the large globe-spanning cloud applications, such as social networks, ecommerce and IoT. These applications rely on some database systems to make sure that all the data can be accessed quickly. The de facto method of keeping the data close to the users is full replication. Many fully replicated systems, however, still have a single region responsible for orchestrating the writes, making the data available locally only for reads and not the updates.

Moreover, full replication becomes rather challenging when strong consistency is desired, since the cost of synchronizing all database replicas on the global scale is large. Many strongly consistent datastores resort to partial replication across a handful of nearby regions to keep the replication latency low. This creates situations when some clients close to the regions in which data is replicated may experience much better access latency than someone reaching out from the other side of the globe.

Despite the obvious benefits of adapting to locality changes, many databases offer only static partitioning. Of course, some data stores have the migration capability, but still often lack the mechanisms to determine where the data must be moved. Quite a few orthogonal solutions provide capabilities to collocate related data close together or use days or weeks’ worth of logs to compute better data placement offline. Meanwhile, Facebook’s aggressive data migration helps them reduce the access latency by 50% and save on both storage and networking.

We (@AlekseyCharapko, @AAilijiang and @MuratDemirbas) investigated the criteria for quick, live data migration in response to changes on access locality. We posit that effective data-migration policies should:

• Minimize access latency,
• Preserve load balancing with regards to data storage and processing capacity
• Preserve collocation of related data, and
• Minimize the number of data migrations.

## Policies

We developed four simple migration policies aimed at optimizing the data placement. Our policies operate at an arbitrary data-granularity, be it an individual key-value pairs, micro-shards, or the partitions. For simplicity, I say that policies work on objects that represent some data of an arbitrary granularity.

The main point we address with the policies is minimizing access locality, with each policy using a different heuristic to make a data-placement decision. Once the policy finds the most optimal location for an object, it checks the load balancing constraints to adjust the data migration decision as required.

Our simplest policy, the n-consecutive accesses policy, uses a threshold of consecutive accesses to the object to make the placement decision. Although simple, this policy works well for workloads with strong locality in a single region. Majority accesses policy keeps track of some request statistics and uses it to find the region with the most accesses to an object over some time interval. It then migrates the data over to that region.

The exponential moving average (EMA) policy takes a different approach and computes the average region for all requests to the object. The average region is computed as an exponential moving average favoring the most recent requests. This policy can potentially find better placement for objects that have more than one high-access region. However, it requires the regions to have numerical IDs arranged in the order of region’s proximity to each other. This policy falters for deployments with complicated geography and may require multiple migrations to move data to the best location. Another disadvantage of EMA is that it takes longer to settle and requires many data migrations. Unlike other policies that can move the data directly to the desired region, EMA can only migrate objects to one of the neighboring regions, making adjustment such as going from region (1) to (3) include a temporary migration to region (2).

Finally, the center-of-gravity (CoG) policy calculates the optimal object placement by taking into account the distribution of all requests to an object and the distances between the datacenters. CoG policy calculates the region closest to the central location for any access locality workloads. CoG can collect the request statistics similar to the majority accesses policy and make a placement decision only after some time has elapsed since last decision. Alternatively, it can use a continuous metric to assign each region a score corresponding to its weight in the workload, adjust the score and recompute the best object placement with every request.

## Some Evaluation

I’ve simulated protocols under different access locality scenarios and calculated the latency of inter-region access and the number of object movements each policy makes. In the simulations, I used 3000 distinct objects, initially assigned to a random region in the cluster of 15 regions. I used the AWS inter-region latencies to specify the distances between simulated regions.  To my surprise, even the most basic policies showed good improvement over static random placement of data.

In the first experiment, the objects were accessed according to a normal distribution. Each object has a ID assigned to it, and some Normal distribution dictates the probability of the drawing the ID each region. All regions have distributions with the same variance, but different means, making each region predominantly accessing some of the objects, and having some group of objects being more-or-less shared across the regions with adjacent IDs.

In this experiment, both CoG and majority accesses policy showed the best results in terms of latency and the number of object movements. This is because the workload almost always favors a single region, and in rarer cases shares the object between two regions. This makes majority heuristic that only considers one region work well. Similarly, 3-consecutive accesses policy shows good latency, but it generates a lot of jitter constantly moving shared objects between neighboring regions.

When the workload is no longer predominantly single region dominant for every key, single-region heuristic policies perform worse. For instance, equally sharing an object between utmost 3 regions out of 15 causes majority and 3-consecutive accesses policies to lock in to one of the sharing regions instead of optimizing the latency for all sharing regions. CoG policy can place the data in a region optimal for all 3 regions (and not even necessarily in one of the sharing regions) and optimize the latency better than a single-region heuristic, topology unaware policies. EMA policy is at a big disadvantage here, since it relies on ID assignments to dictate the proximity of regions. However, the complex geography of AWS datacenters makes a good ID assignment nearly impossible, causing EMA to sometimes overshoot the best region and settle in less optimal one.

Previous experiments did not consider the effect of load balancing. However, a good data-migration policy should refrain from migrating data to overloaded regions. In the next experiment I applied load-balancing filter to the CoG policy to make the migration procedure first compute the best region for the object, check if that region has the capacity, and if no capacity is available, move the data to the next best region with enough processing/storage capacity. Here I used 5 regions and 1000 objects, and limited each region to storing at most 25% of all data. I ran a heavily skewed workload with 80% of all requests coming from a single region. Under these conditions the CoG policy achieves very low average latency. However, as evidenced by the disbalance graph, all objects migrate over to a single region.  If load balancing is enabled, no region becomes overloaded, but latency improvement becomes more modest.

## Concluding Remarks

Having data close to the consumers can dramatically improve the access latency. For some databases, this means doing full replication, for other this may involve moving data or the owner/write role from one region to another. It is important to make sure the data is moved to a right location. I have looked at four simple rules or policies for determining the data migration and ran some simulations on these.

There are a few lessons I have learned so far from this:

• Topology aware rules/polices work better for a larger variety of situations
• Simple rules, such as just looking a number of consecutive requests coming from a region or determining the majority accesses region can also work surprisingly well, but not always. These tend to break when access locality is not concentrated in a single region, but shared across a few regions in the cluster
• EMA looked interesting on paper. It allowed to have just a single number updated with every request to determine the optimal data placement, but it performed rather bad in most experiments. The main reason for this is complicated geography of datacenters.
• Optimizing for latency and adjusting for load balancing constraints to prevent region overload can be done in two separate steps. My simple two-stage policy (presently) looks at load balancing for each object separately. This becomes a first-come-first-serve system, but I am not sure yet whether this can become a problem.
• EMA policy takes multiple hops to move data to better region, while n-consecutive accesses policy has constant jitter for objects shared by some regions

I have not studied much about data-collocation in my experiments, nor designed the policies to take this into consideration. One of the reasons is that I think related objects will have similar access locality, causing them to migrate to same datacenters. However, this is just a guess, and I need to investigate this further.

# Looking at State and Operational Consistency

Recently I rediscovered the “The many faces of consistency” paper by Marcos Aguilera and Doug Terry. When I first read the paper two years ago, I largely dismissed it as trivial, and, oh boy, now I realized how wrong I was at that time.  It is easy to read for sure, and may appear as some summary of various consistency models at first, but it is thought provoking and really makes you ask more questions and draw interesting parallels after giving it some quality time.

Murat gave a good summary of this paper recently in relation to his sabbatical. The questions he asks after the summary, however, provoke even more thoughts about consistency and how we classify, categorize and view it.

In a nutshell, the paper talks about consistency from different perspectives, namely state consistency as observed by a system itself and operational consistency that clients see. State consistency involves enforcing a system-state to hold some invariants. State consistency promotes invariant-based reasoning. The operational consistency is different, as it looks at the system from the client point of view. Outside clients do not directly observe the state of the system, instead they perform operations against it and can only see the results of these operations. So in short, state consistency is invariant-based and concerns with the internal state of the systems. Operational consistency deals with what clients observe from the outside of the system. These operational consistencies include various types sequential equivalence, such as linearizability and serializability, and other client-centric guarantees, like read-your-write or bounded-staleness.

For more details, read the original paper, Murat’s summary or one from the morning paper.

## “Strength” of state consistency.

What strikes me right away is how different the two types of consistency are described. Operational consistency gives us the framework for reasoning about systems without having too many details about the internals. We can gauge the relative “strength” of different consistency models and put them in perspective against each other. We know the serializability is weaker then session serializability. Or that linearizability is stronger than sequential equivalence.

But what about state consistency? There is no such reasoning framework. And in fact, it is not easy to even classify state consistency, yet along reason about the “strength” of different state consistency classes. The paper mentions a few state consistency models, such referential integrity for databases, or mutual consistency in primary-backup systems, or error bounds. But these are not generally applicable across the board. These examples of state consistency operate within the constraints of their specific domains or applications.

However, state consistency still comes at different “strength” levels. When reasoning about state consistency, we use invariant-based approach. The invariants on the state we need to enforce for an eventually consistent data-store and a strongly-consistent one are different. In the former case, we can be more relaxed, since we only need to make sure that at least one future state will have different nodes of the store to have the same data. The latter case is more complicated, as we need to hold stricter invariants (i.e. no two alive nodes have different committed value for the same slot in the log, and there can be only one active leader, and the leader must process the commands in the receive order, etc.).

It is the “tightness” of the invariants that makes state consistency strong or weak. But deciding which invariant is tighter or stricter is hard. We cannot simply compare various invariants on the merits of when they should hold, or how many parameters they cover or how many nodes they span, as all these (and other) metrics mean something only in the context of their systems/problems. Invariant that must hold at every state is not necessarily tighter than the eventual one, as it may simply be an invariant against some irrelevant or trivial parameter that holds all the time anyway and has no impact on the system.

And this is why we often translate these invariants and state consistency they represent into the operational consistency. The operational side of things allows us to observe the impact of the invariants on the system, albeit indirectly. Operational consistency levels the playing field and enables the comparison from the external point of view. It allows us to gauge how otherwise hard-to-compare invariants at the state-consistency level affect the outcome of operations.

## Smart systems or smart clients

Does this mean that a system providing stronger operational consistency has stronger state consistency? Well, it would have been too simple if that was the case. It often happens, and more so recently, that systems have “misalignments” between their internal state consistency and the operational one exposed on the client side.

A system that provides stronger operational semantics may do so because it has a strong state that makes it easy to expose the strong operational consistency. These smart systems preserve strong state-consistency at all costs. They may need to run complicated algorithms (i.e. Paxos) to achieve that, but doing so makes the clients lean and simple with minimal or no state at all.

On the other hand, simple systems may forgo the complicated protocols needed to enforce a strong state. Instead they aim to run as lean as possible at the core system level and shift as much burden to the smart clients. These clients need to have more complex state and protocols if they are to provide stronger operational guarantees. There are systems that do exactly that.

Both “smart system – lean client” and “lean system – smart client” approaches have their advantages and drawbacks. Designing and maintaining a smart system may actually be simpler: all the things engineers need are readily available at the system level. Invariant-based reasoning and tools like TLA easily apply in this setting. Debugging is simpler too, since lean and stateless clients are likely not the cause of a problem, and internal logging can help collect all the necessary information. On the other hand, having a lean system may improve the performance by reducing the bottlenecks, spreading load more evenly across the nodes and even sharing the load with smart clients. But it comes at some engineering costs. Protocols now involve more state and state is even more distributed: both at the lean system nodes and smart clients. Modeling this state with TLA is still possible, but it will likely take more time to check and require a more complicated models that include clients and client interactions with the system nodes. Debugging may be slowed down due to the lack of necessary data, since many issues (especially on production) may happen at smart clients outside of the engineers’ reach.

State consistency is an interesting beast. It does not give us the same mental reasoning framework as operational consistency. We, the distributed systems people, often think about the consistency in operational terms. It is easy to understand why, since operational consistency allows for comparison between systems or protocols. But then we, the distributed systems people, also think in terms of state consistency. We model our systems at the state level, trying to give good invariants, try to see what states should always hold, or how a system needs to converge to certain states.

But now understand that there is no clear path from strong state to strong operational consistency. Strong state makes it easier to build operationally strong systems, but it is not a requirement. In fact, for example ZooKeeper, despite having a Paxos-like protocol at its heart is not all that operationally strong. And some systems, like TAPIR or OCCULT, may have weaker state, but with clever engineering and smart clients can provide stronger operational semantics. The world of distributed systems is not black-and-white. There are lots of gray in between.

# Python, Numpy and a Programmer Error: Story of a Bizarre Bug

While recently working on my performance analysis for Paxos-style protocols, I uncovered some weird quirks about python and numpy. Ultimately, the problem was with my code, however the symptoms of the issue looked extremely bizarre at first.

Modeling WPaxos required doing a series of computations with numpy. In each step, I used numpy to do some computations with arrays. Normally, I would initiate a new array and set the values by doing some calculations on the data from previous steps. However, in one step I used newly initialized array to perform some additions with another numpy array. Of course, by mistake I initialized a new array with numpy.empty() instead of numpy.zeroes(), causing the new array to potentially contain some garbage values that may screw up the entire computation. Obviously, I did not know I made this mistake.

However, most of the times, this new array had all values set to zero, so I consistently observed the correct results. That is until I added a simple print statement (something like print some_array) on some array to check on the intermediate computation in the model. Printing the numpy array caused the entire calculation to screw up, leaving me with a big mystery: how a simple python print statement, that should have no side-effects, change the results of subsequent computations?

I wouldn’t lie, I was mesmerized by this for hours: I remove the print statement and everything works, I add it back and the entire model breaks. Consistently. Even after a reboot. What is even more weird, the bad results I observed were consistently the same, reproducible run after run after run.

And in such consistent failures, I observed a pattern. One computation step was always skewed by the same value, the value of the array I was printing, as if that array was added to the newly initialized array in the skewed step. And this is when I noticed that I use numpy.empty() instead of numpy.zeroes(). A simple fix and the outcome was the same, regardless of whether I print results of the intermediate steps or not.

In the end, it was a programmer’s error, but the bizarre symptoms kept me away from the solution for way too long before uncovering the truth. I am not an expert on the internals of python and numpy, but I do have some clue as of what might have happened. I think, the print statement created some kind of temporary array, and this temporary array got destroyed after the print. (Alternatively, something else created a temporary array, and print statement just shifted the address of memory allocations). Next computation then allocated space for new array, having the same dimensions, in the exact same spot of the old temporary one. And this newly created array then had garbage values, containing the outputs of the previous step.

The interesting part is how consistent this was for hundreds of tries, producing the same failed output. How consistent the memory allocations had to be in every run? And of course, many may not even think about the possibility of having such dirty memory problems in languages such as python. Numpy, however, is written in C, and it clearly brings some of the C’s quirks to python with it, so read the documentation.

# One Page Summary: “PaxosStore: High-availability Storage Made Practical in WeChat”

PaxosStore paper, published in VLDB 2017, describes the large scale, multi-datacenter storage system used in WeChat. As the name may suggest, it uses Paxos to provide storage consistency. The system claims to provide storage for many components of the WeChat application, with 1.5TB of traffic per day and tens of thousands of queries per second during the peak hours.

PaxosStore relies on Paxos protocol to for consistency and replication within tight geographical regions. The system was designed with a great separation of concerns in mind. At a high level, it has three distinct layers interacting with each other: API layer, consensus layer, and storage.  Separating these out allowed PaxosStore provide most suitable APIs and storage for different tasks and application, while still having the same Paxos-backed consistency and replication.

In a paxos-driven consensus layer,  the system uses a per-object log to keep track of values and paxos-related metadata, such as promise (epoch) and proposal (slot) numbers. Log’s implementation, however, seems to be somewhat decoupled from the core Paxos protocol. Paxos implementation is leaderless, meaning there are no single dedicated leader for each object, and every node can perform writes on any of the objects in the cluster by running prepare and accept phases of Paxos. Naturally, the system tries to perform (most) writes in one round trip instead of two by assuming some write locality. After the first successful write, a node can issue more writes with increasing proposal (slot) numbers. If some other node performs a write, it needs to have higher ballot, preventing the old master from doing quick writes. This is a rather common approach, used in many Paxos variants.

PaxosStore runs in multiple datacenters, but it is not a full-fledged geo-replicated system, as it only replicates between the datacenters located in the same geographical area. The paper is not clear on how data get assigned to regions and whether objects can migrate between regions in any way. Within each datacenter the system organizes nodes into mini-clusters, with each mini-cluster acting as a Paxos follower. When data is replicated between mini-clusters, only one (some?) nodes in each mini-cluster hold the data. This design aims to improve fault tolerance: with a 2-node mini-cluster, failure of 1 node does not result in the failure of the entire mini-cluster Paxos-follower.

The paper somewhat lacks in its evaluation, but PaxosStore seems to handle its goal of multi-datacenter, same-region replication fairly well, achieving sub-10 ms writes.

This paper seems like a good solution for reliable and somewhat localized data-store. The authors do not address data sharding and migration between regions and focus only on the intra-region replication to multiple datacenters, which makes me thing PaxosStore is not really “global”, geo-replicated database.  The fault tolerance is backed by Paxos, mini-clusters and the usage of PaxosLog for data recovery. The evaluation could have been more complete if authors showed scalability limits of their system and provided some details about throughput and datacenter-locality of the workload in the latency experiments.

Here is one page pdf of this summary.

# Modeling Paxos Performance in Wide Area – Part 3

Earlier I looked at modeling paxos performance in local networks, however nowadays people (companies) use paxos and its flavors in the wide area as well. Take Google Spanner and CockroachDB as an example. I was naturally curious to expand my performance model into wide area networks as well. Since our lab worked on WAN coordination for quite some time, I knew what to expect from it, but nevertheless I got a few small surprises along the way.

In this post I will look at Paxos over WAN, EPaxos and our wPaxos protocols. I am going to skip most of the explanation of how I arrived to the models, since the models I used are very similar in spirit to the one I created for looking at local area performance. They all rely on queuing theory approximations for processing overheads and k-order statistics for impact of quorum size.

Despite being similar in methods used, modeling protocols designed for WAN operation proved to be more difficult than local area models. This difficulty arises mainly from the myriad of additional parameters I need to account for. For instance, for Paxos in WAN I need to look at latencies between each node in the cluster, since the WAN-networks are not really uniform in inter-region latencies. Going up to EPaxos, I have multiple leaders to model, which means I also must take into consideration the processing overheads each node takes in its role of following other nodes for some slots. wPaxos takes this even further: to model its performance I need to consider access locality and “object stealing” among other things.

Today I will focus only on 5 region models. In particular, I obtained average latencies between 5 AWS regions: Japan (JP), California (CA), Oregon (OR), Virginia (VA) and Ireland (IR). I show these regions and the latencies between them in Figure 1 below.

## Paxos in WAN

Converting paxos model from LAN to WAN is rather straightforward; all I need to do is to modify my paxos model to take non-uniform distances between nodes. I also need the ability to set which node is going to be the leader for my multi-paxos rounds. With these small changes, I can play around with paxos and see how WAN affects it.

Figure 2 (above) shows a model run for 5 nodes in 5 regions (1 node per region). From my previous post, I knew that the maximum throughput of the system does not depend on network latency, so it is reasonable for paxos in WAN to be similar to paxos in local networks in this regard. However, I was a bit surprised to see how flat the latency stays in WAN deployment almost all the way till reaching the saturation point. This makes perfect sense, however, since the WAN RTT dominates the latency and small latency increases due to the queuing costs are largely masked by large network latency. This also may explain why Spanner, CockroachDB and others use paxos in databases; having predictable performance throughout the entire range of load conditions makes it desirable for delivering stable performance to clients and easier for load-balancing efforts.

However, not everything is so peachy here. Geographical placement of the leader node plays a crucial role in determining the latency of the paxos cluster. If the leader node is too far from the majority quorum nodes, it will incur high latency penalty. We see this with Japan and Ireland regions, as they appear far from all other nodes in the system and result in very high operation latency.

## EPaxos

EPaxos protocol tries to address a few shortcoming in paxos. In particular, EPaxos no longer has a single leader node and any node can lead some commands. If commands are independent, then EPaxos can commit them quickly in one phase using a fast quorum. However, if the command have dependencies, EPaxos needs to run another phase on a majority quorum (at which point it pretty much becomes Paxos with two phases for leader election and operation commit). The fast quorum in some cases may be larger than the majority quorum, but in the 5-node model I describe today, the fast quorum is the same as the majority quorum (3 nodes).

Naturally, conflict between commands will impact the performance greatly: with no conflict, all operations can be decided in one phase, while with 100% conflict, all operations need two phases. Since running two phases requires more messages, I had to change the model to factor in the probability of running two phases. Additionally, the model now looks at the performance of every node separately, and account for the node leading some slots and following on the other.

Figures 3 and 4 (above) show EPaxos performance at every node for 2% and 50% conflict. Note that the aggregate throughput of the cluster is a sum of all 5 nodes. For 2% conflict, the max throughput was 2.7 times larger than that of Paxos. As the conflict between commands increases, EPaxos loses its capacity and its maximum throughput decreases, as I illustrate in Figure 5 (below). This changing capacity may make more difficult to use EPaxos in production environments. After all, workload characteristics may fluctuate throughout the system’s lifespan and EPaxos cluster may or may not withstand the workloads of identical intensity (same number of requests/sec), but different conflict.

## wPaxos

wPaxos is our recent flavor of WAN-optimized paxos. Its main premise is to separate the commands for different entities (objects) to different leaders and process these commands geographically close to where the entities are required by clients. Unlike most Paxos flavors, wPaxos needs large cluster, however, thanks to flexoble quorums, each operations only uses a subset of nodes in the cluster. This allows us to achieve both multi-leader capability and low average latency.

wPaxos, however, has lots of configurable parameters that all affect the performance. For instance, the fault tolerance may be reduced to the point where a system does not tolerate a region failure, but can still tolerate failure of nodes within the region. In this scenario (Figure 6, below), wPaxos can achieve the best performance with aggregate throughput across all regions (and 3 nodes per region for a total of 15 nodes) of 153,000 requests per second.

We still observe big differences in latencies due to the geography, as some requests originating in one regions must go through stealing phase or be resolved in another region. However, the average latency for a request is smaller than that of EPaxos or Paxos. Of course, a direct comparison between wPaxos and EPaxos is difficult, as wPaxos (at least in this model configuration) is not as fault tolerant as EPaxos. Also unlike my FPaxos model from last time, wPaxos model also reduces the communication in phase-2 to a phase-2 quorum only. This allows it to take much bigger advantage of flexible quorums than “talk-to-all-nodes” approach. As a result, having more nodes helps wPaxos provide higher throughput than EPaxos.
Some EPaxos problems still show-up in wPaxos. For instance, as the access locality decreases and rate of object migration grows, the maximum throughput a cluster can provide decreases. For instance, Figure 7 (below) shows wPaxos model with locality shrunken to 50% and object migration expanded to 3% of all requests.

## How Good Are the Models?

I was striving to achieve the best model accuracy without going overboard with trying to account all possible variables in the model. The models both for LAN and WAN seems to agree fairly well with the results we observe in our Paxi framework for studying various flavors of consensus.

However, there is always room for improvement, as more parameters can be accounted for to make more accurate models. For instance, WAN RTTs do not really follow a single normal distribution, as a packet can take one of many routes from one region to another (Figure 8, below). This may make real performance fluctuate and “jitter” more compared to a rather idealistic model.

I did not account for some processing overheads as well. In EPaxos, a node must figure out the dependency graph for each request, and for high-conflict workloads these graphs may get large requiring more processing power. My model is simple and assumes this overhead to be negligible.

## Few Concluding Remarks

Over the series of paxos performance modeling posts I looked at various algorithms and parameters that affect their performance. I think it truly helped me understand Paxos a bit better than before doing this work. I showed that network fluctuations have little impact on paxos performance (k-order statistics helps figure this one out). I showed how node’s processing capacity limits the performance (I know this is trivial and obvious), but what is obvious, but still a bit interesting about this is that a paxos node processes roughly half of the messages that do not make a difference anymore. Once the majority quorum is reached, all other messages for a round carry a dead processing weight on the system.

The stability of Paxos compared to other more complicated flavors of paxos (EPaxos, wPaxos) also seems interesting and probably explains why production-grade systems use paxos a lot. Despite having lesser capacity, paxos is very stable, as its latency changes little at levels of throughput. Additionally, The maximum throughput of paxos is not affected by the workload characteristics, such as conflict or locality. This predictability is important for production systems that must plan and allocate resources. It is simply easier to plan for a system delivering stable performance regardless of the workload characteristics.

Geography plays a big role in WAN paxos performance. Despite the cluster having the same maximum throughput, the clients will observe the performance very differently depending on the leader region. Same goes with EPaxos and wPaxos, as different regions have different costs associated with communicating to the quorums, meaning that clients in one region may observe very different latency than their peers in some other regions. I think this may make it more difficult to provide same strong guarantees (SLAs?) regarding the latency of operations to all clients in production systems.

There are still many things one can study with the models, but I will let it be for now. Anyone who is interested in playing around may get the models on GitHub.

# Modeling Paxos Performance – Part 2

In the previous posts I started to explore node-scalability of paxos-style protocols. In this post I will look at processing overheads that I estimate with the help of a queue or a processing pipeline. I show how these overheads cap the performance and affect the latency at different cluster loads.

I look at the scalability for a few reasons. For one, in the age of a cloud 3 or 5 nodes cluster may not be enough to provide good resilience, especially in environments with limited control over the node placement. After all, a good cluster needs to avoid nodes that share common points of failures, such as switches of power supply. Second, I think it helps me learn more about paxos and its flavors and why certain applications chose to use it.  And third, I want to look at more exotic paxos variants and how their performance may be impacted by different factors, such as WAN or flexible quorums. For instance, flexible quorums present the opportunity to make trade-offs between performance and resilience. We do this by adjusting the sizes of quorums for phase-1 and phase-2. This is where the modeling becomes handy, as we can check if a particular quorum or deployment makes a difference from the performance standpoint.

Last time, I looked at how local network variations affect the performance when scaling the cluster up in the number of servers. What I realized is that the fluctuations in message round-trip-time (RTT) can only explain roughly 3% performance degradation going from 3 nodes to 5, compared to 30-35% degradation in our implementation of paxos. We also see that this degradation depends on the quorum size, and for some majority quorum deployments there may even be no difference due to the network. In this post I improve the model further to account for processing bottlenecks.

As a refresher from the previous time, I list some of the parameters and variables I have been using:

• $$l$$ – some local message in a round
• $$r_l$$ – message RTT in local area network
• $$\mu_l$$ – average message RTT in local area network
• $$\sigma_l$$ – standard deviation of message RTT in local area network
• $$N$$ – number of nodes participating in a paxos phase
• $$q$$ – Quorum size. For a majority quorum $$q=\left \lfloor{\frac{N}{2}}\right \rfloor +1$$
• $$m_s$$ – time to serialize a message
• $$m_d$$ – time to deserialize and process a message
• $$\mu_{ms}$$ – average serialization time for a single message
• $$\mu_{md}$$ – average message deserialization time
• $$\sigma_{ms}$$ – standard deviation of message serialization time
• $$\sigma_{md}$$ – standard deviation of message deserialization time

The round latency $$L_r$$ of was estimated by $$L_r = m_s + r_{lq-1} + m_d$$, where $$r_{lq-1}$$ is the RTT + replica processing time for the $$q-1$$th fastest messages $$l_{q-1}$$

## Message Processing Queue

Most performance difference in the above model comes from the network performance fluctuations, given that $$m_s$$, $$m_d$$ and their variances are small compared to network latency. However, handling each message creates significant overheads at the nodes that I did not account for earlier. I visualize the message processing as a queue or a pipeline; if enough compute resources are available, then the message can process immediately, otherwise it has to wait until earlier messages are through and the resources become available. I say that the pipeline is clogged when the messages cannot start processing instantaneously.

The round leader is more prone to clogging, since it needs to process $$N-1$$ replies coming roughly at the same time for each round. For the model purposes, I consider queuing/pipeline costs only at the leader. The pipeline is shared for incoming and outgoing messages.

Lets consider a common FIFO pipeline handling messages from all concurrent rounds and clients. When a message $$l_i$$ enters the pipeline at some time $$t_{ei}$$, it can either process immediately if the pipeline is empty or experience some delay while waiting for the its turn to process.

In the case of empty pipeline, the message exits the queue at time $$t_{fi} = t_{ei} + o$$, where $$o$$ is message processing overhead $$m_s$$ or $$m_d$$ depending on whether the message is outgoing or incoming. However, if there is a message in the queue already, then the processing of $$l_i$$ will stall or clog for some queue waiting time $$w_i$$, thus it will exit the pipeline at time $$t_{fi} = t_{ei} + w_i + o$$. To compute $$w_i$$ we need to know when message $$l_{i-1}$$ is going to leave the queue: $$w_i = t_{fi-1} – t_{ei}$$. In its turn, the exit time $$t_{fi-1}$$ depends of $$w_{i-1}$$, and so we need to compute it first. We can continue to “unroll” the pipeline until we have a message $$l_n$$ without any queue waiting time ($$w_{i-n} = 0$$). We can compute the dequeue time for that message $$l_n$$, which in turns allows us to compute exit time of all following messages. Figure 1 shows different ways a pipeline can get clogged, along with the effects of clog accumulating over time.

Unlike earlier, today I also want to model the overheads of communicating with the clients, since in practice we tend to measure the performance as observed by the clients. This requires the round model to account for client communication latency $$r_c$$ which is one network RTT. Each round also adds a single message deserialization (client’s request) and a message serialization (reply to a client) to the queue.

Let me summarize the parameters and variables we need to model the queuing costs:

• $$r_c$$ – RTT time to communicate with the client
• $$n_p$$ – the number of parallel queues/pipelines. You can roughly think of this as number of cores you wish to give the node.
• $$s_p$$ – pipeline’s service rate (messages per unit of time). $$s_p = \frac{N+2}{N\mu_{md} + 2 \mu_s}$$
• $$w_i$$ – pipeline waiting time for message $$l_i$$
• $$R$$ – throughput in rounds per unit of time.
• $$\mu_{r}$$ – mean delay between rounds. $$\mu_{r} = \frac{1}{R}$$
• $$\sigma_{r}$$ – standard deviation of inter-round delay.

Now lets talk about some these parameters a bit more and how they relate to the model.

Pipeline service rate $$s_p$$ tells how fast a pipeline can process messages. We can get this metric by looking at average latencies of message serialization $$\mu_{ms}$$ and deserialization/processing $$\mu_{md}$$. With $$N$$ nodes in the cluster, we can find an average message overhead of the round $$\mu_{msg}$$. For a given round, the leader node needs to handle 2 message serializations (one to start the round and one to reply back to client and $$N$$ deserializations ($$N-1$$ from followers and one from the client). This communication pattern gives us $$\mu_{msg} = \frac{N\mu_{md}+2\mu_{ms}}{N+2}$$. A reciprocal of $$\mu_{msg}$$ gives us how many messages can be handled by the pipeline per some unit of time: $$s_p = \frac{N+2}{N\mu_{md} + 2\mu_s}$$.

Variable $$w_i$$ tells how backed up the pipeline is at the time of message $$l_i$$. For instance, $$w_i = 0.002 s$$ means that a message $$l_i$$ can start processing only after 0.002 seconds delay. Figure 2 illustrates the round execution model with queue wait overheads.

To properly simulate multi-paxos, I need to look at multiple rounds. Variable $$R$$ defines the throughput I try to push through the cluster, as higher throughput is likely to lead to longer queue wait times. I also need to take into consideration how rounds are distributed in time. On one side of the spectrum, we can perform bursty rounds, where all $$R$$ rounds start at roughly the same time. This will give us the worst round latency, as the pipelines will likely clog more. On the other side, the rounds can be evenly dispersed in time, greatly reducing the competition for pipeline between messages of different rounds. This approach will lead to the best round latency. I have illustrated both of these extremes in round distribution in Figure 3.

However, the maximum throughput $$R_{max}$$ is the same no matter how rounds are spread out, and it is governed only by when the the node reaches the pipeline saturation point: $$R_{max}(N+2) = n_ps_p$$ or $$R_{max}(N+2) = \frac{n_p(N+2)}{N\mu_{md} + 2\mu_{ms}}$$. As such, $$R_{max} = \frac{n_p}{N\mu_{md} + 2\mu_{ms}}$$. In the actual model simulation, the latency is likely to spike up a bit before this theoretical max throughput point, as pipeline gets very congested and keeps delaying messages more and more.

The likely round distribution is probably something more random as different clients interact with the protocol independently of each other, making such perfect round synchronization impossible. For the simulation, I am taking the uniform separation approach and add some variability to it by drawing the round separation times from a normal distribution $$\mathcal{N}(\mu_r, \sigma_r^2)$$. This solution may not be perfect, but normal distribution tend to do fine in modeling many natural random phenomena. I can also control how much different rounds can affect each other by changing the variance $$\sigma_r^2$$. When $$\sigma_r$$ is close to 0, this becomes similar to uniformly spaced rounds, while large values of $$\sigma_r$$ create more “chaos” and conflict between rounds by spreading them more random.

Now I will put all the pieces together. To model the round latency $$L_r$$, I modify the old formula to include the queuing costs and client communication delays. Since the round latency is driven by the time it takes to process message $$l_{q-1}$$, I only concern myself with the queue waiting time $$c_{q-1}$$ for the quorum message. As such, the new formula for round latency is $$L_r = (m_s + r_{lq-1} + c_{q-1} + m_d) + (m_{cd} + m_{cs} + r_c)$$. In this formula, $$m_{cd}$$ is deserialization overhead for the client request, and $$m_{cs}$$ is the serialization overhead for server’s reply back to client.

## Simulation Results

As before, I have a python script that puts the pieces together and simulates multi-paxos runs. There are quite a few parameters to consider in the model, so I will show just a few, but you can grab the code and tinker with it to see how it will behave with different settings. Figure 4 shows the simulation with my default parameters: network settings taken from AWS measurements, pipeline performance taken from the early paxi implementation (now it is much faster). Only one pipeline/queue is used. The distribution of rounds in time is controlled by inter-round spacing $$\mu_r = \frac{1}{R}$$ with $$\sigma_{r} = 2\mu{r}$$.

Next figure (Figure 5) shows how latency changes for inter-round delay variances. The runs with higher standard deviation $$\sigma_r$$ appear more “curvy”, while the runs with more uniform delay do not seem to degrade as quick until almost reaching the saturation point. High $$\sigma_r$$ runs represent more random, uncoordinated interaction with the cluster, which on my opinion is a better representation of what happens in the real world.

## Do I Need to Simulate Paxos Rounds?

The results above simulate many individual rounds by filling the pipeline with messages and computing the queue wait time for each round. Averaging the latencies across all simulated rounds produces the average latency for some given throughput. However, if I can compute the average queue waiting time and the average latency for the quorum message, then I no longer need to simulate individual rounds to essentially obtain these parameters. This will allow me to find the average round latency much quicker without having to repeat round formula computations over and over again.

Let’s start with computing average latency for a quorum message $$r_{lq-1}$$. Since that $$l_{q-1}$$ represents the last message needed to make up the quorum, I can model this message’s latency with some $$k$$th-order statistics sampled from Normal distribution $$\mathcal{N}(\mu_l+\mu_{ms}+\mu_{md}, \sigma_l^2 + \sigma_{ms}^2 + \sigma_{md}^2)$$ on a sample of size $$N-1$$, where $$k=q-1$$. To make things simple, I use Monte Carlo method to approximate this number $$r_{lq-1}$$ fairly quickly and accurately.

Now to approximating the queue wait time $$w_{q-1}$$. This is a bit more involved, but luckily queuing theory provides some easy ways to compute/estimate various parameters for simple queues. I used Marchal’s average waiting time approximation for single queue with generally distributed inter-arrival and service intervals (G/G/1). This approximation allows me to incorporate the inter-round interval and variance from my simulation into the queuing theory model computation.

I will spare the explanation on arriving with the formula for the average round queue wait time (it is pretty straightforward adaptation from here, with service and arrival rates expressed as rounds per second) and just give you the result for a single queue and single worker:

• $$p = R(N\mu_{md} + 2\mu_{ms})$$, where $$p$$ is queue utilization or probability queue is not busy.
• $$C_s^2 = \frac{N^2\sigma_{md}^2 + 2^2\sigma_{ms}^2}{(N\mu_{md} + 2\mu_{ms})^2}$$
• $$C_a^2 = \frac{sigma_r^2}{\mu_r^2}$$
• $$w=\frac{p^2(1+C_s^2)(C_a^2+C_s^2p^2)}{2R(1-p)(1+C_s^2p^2)}$$

With the ability to compute average queue waiting time and average time for message $$l_{q-1}$$ turn around, I can compute the average round latency time for a given throughput quickly without having to simulate multiple rounds to get the average for these parameters. $$L = 2\mu_{ms} + 2\mu_{md} + r_{lq-1} + w + \mu_l$$, where $$r_{lq-1}$$ is the mean RTT for quorum message $$l_{q-1}$$ and $$w$$ is the average queue wait time for given throughput parameters and $$\mu_l$$ is the network RTT for a message exchange with the client.

As such, the average round latency becomes:

$$L = 2\mu_{ms} + 2\mu_{md} + r_{lq-1} + \frac{p^2(1+C_s^2)(C_a^2+C_s^2p^2)}{2R(1-p)(1+C_s^2p^2)} + \mu_l$$

Figure 6 shows the model’s results for latency at various throughputs. The queuing theory model exhibits very similar patterns as the simulation, albeit the simulation seems to degrade quicker at higher throughputs then the model, especially for 3-node cluster. This may due to the fact that the simulation captures the message distribution within each round, while the model looks at the round as one whole.

## Flexible Quorums

I can use both the simulation and the model to show the difference between paxos and flexible paxos (FPaxos) by adjusting the quorums. For instance, I modeled a 9-node deployment of flexible paxos with phase-2 quorum $$q2$$ of 3 nodes. In my setup, flexible paxos must still communicate with all 9 nodes, but it needs to wait for only 2 replies, thus it can finish the phase quicker then the majority quorum. However, as seen in Figure 7, the advantage of smaller quorum is tiny compared to normal majority quorum of 9-node paxos. Despite FPaxos requiring the same number of messages as 5-node paxos setup, the costs of communicating with all 9 nodes do not allow it to get closer in performance event to a 7-machine paxos cluster.

## Conclusion and Next steps

So far I have modeled single-leader paxos variants in the local area network. I showed that network variations have a negligible impact on majority quorum paxos. I also illustrated that it is hard to rip the performance benefits from flexible quorums, since queuing costs of communicating with large cluster become overwhelming. However, not everything is lost for FPaxos, as it  can reduce the number of nodes involved in phase-2 communication from full cluster size to as little as $$|q2|$$ nodes and greatly mitigate the effects of queue waiting time for large clusters.

The simulation and model are available on GitHub, so you can check it out and tinker with parameters to see how the performance may change in response.

There are still quite a few other aspects of paxos that I find interesting and want to model in the future. In particular, I want to look at WAN deployments, multi-leader paxos variants and, of course, our WPaxos protocol that combines multi-leader, WAN and flexible quorums.

# Paxos Performance Modeling – Part 1.5

This post is a quick update/conclusion to the part 1. So, does the network variations make any impact at all? In the earlier simulation I showed some small performance degradation going from 3 to 5 nodes.

The reality is that for paxos, network behavior makes very little difference on scalability, and in some cases no difference at all. To see what I mean, look at the figure below:

See how 4 and 6 and 36 node perform the same in my simulation? And how 5, 7 and even 35 nodes clusters slightly outperforms 4 nodes?

The intuitive high level explanation for even-numbered simulation results is quite simple. For even numbered cluster sizes, a round leader receives an odd number of replies, assuming a self-voting leader. The leader also decides the round after reaching a majority  quorum $$q=\left \lfloor{\frac{N}{2}}\right \rfloor +1$$, meaning that it needs to receive $$q-1$$  or $$\left \lfloor{\frac{N}{2}}\right \rfloor$$messages (with self-voting). As it happens, for even clusters, this message is exactly the median fastest message of the round. For instance, a 6 node cluster leader will receives 5 replies, but the round reaches the majority at the 3rd (or median) reply.

Since the simulation draws message RTTs from a normal distribution, the median (50th  percentile) RTT is also the mean. After repeating it for sufficient number of rounds, any fluctuations are averaged out, resulting in an average round decided by a message with an average RTT for the network.

The cluster with odd number of servers, however, decides on the round at a message with RTT slightly less then the median RTT. This is because we have an even number of replies, and median is computed be averaging two middle RTTs. The smaller of the two values used for computing the median is actually the quorum message for the round.  For instance, in a 7-node deployment, the leader reaches quorum after receiving 3rd message $$l_3$$, with median being $$\frac{l_3+l_4}{2}$$

As a result, after many round repeats (I do ~125000 rounds) the simulation still ends up with an average RTT of a quorum message to be a tiny bit less than the median/mean RTT, and the more nodes I add, the closer it becomes to the actual 50th percentile and the mean.

So, what do we have after all of this? I think it is safe to assume the effects of network variance on paxos performance are very small and sometimes non-existent. We should not worry about the network as much, as long as it is stable and delivers predictable performance.  However, if you have a system with non-majority quorums, you may get slight benefit from quicker replies.

Update (3/10/2018):

• Part 2 – Queuing/Processing overheads