Tag Archives: key-value

Reading Group. Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories

Hard to imagine, but the reading group just completed the 45th session. We discussed “Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories,” again from OSDI’20. Pegasus is one of these systems that are very obvious in the hindsight. However, this “obviousness” is deceptive — Dan Ports, one of the authors behind the paper who joined the discussion, mentioned that the project started in 2017, so it was quite a bit of time from the start to publish with a lot of things considered and tried before zeroing in on what is in the paper. 

Pegasus is a replication system for load balancing and throughput scalability in heavily skewed workloads. Consider a workload with a handful of “hot” objects. These hot objects may have so much usage, that they overwhelm their respective servers. Naturally, this limits the overall throughput because the system is now capped by servers at their maximum capacity/utilization. The solution is to replicate these hot objects to many servers and allow clients to access them from multiple replicas. However, as soon as we have a replicated scenario, we start running into consistency issues. Strongly consistent systems often degrade in performance with the addition of more replicas due to the synchronization overheads. This is what makes Pegasus rather unique — it scales for load balancing through replication while remaining strongly consistent. The key enabler of this is the smart Top of Rack (ToR) switch that handles all the traffic in the server rack. This switch acts as the “source of synchrony” in the rack, and it does so at the packet’s line speed. 

In Pegasus, the data is assigned to servers in the rack using a consistent hashing mechanism, allowing clients to send the requests directly to servers that own the data. However, all these requests go through the ToR switch which can inspect the packets and make some additional routing decisions for “hot” objects. Let’s consider a write request for a such high-demand object. ToR inspects a packet, and if it is for a “hot” key, it sends the write message to some larger and potentially different set of servers in a rack, essentially increasing the replication factor and rotating the responsible servers. Once the servers ack the write completion, the ToR switch sees the acks and records these servers into its coherency directory as servers with the latest copy of the data. The read requests have a similar rerouting fate — if a read is for a hot object, instead of going to the default server, the ToR switch sends it to one of the replicas from its coherency directory. The paper has more details on implementing this coherency directory and keeping track of the recent progress using a simple versioning mechanism.

The end result is awesome! Just by replicating a handful of objects in skewed workloads (~16 objects out of a million in the paper), Pegasus achieves load balancing and high throughput beating in-network caching in almost all scenarios. There are a few other advantages to Pegasus that are missing in other SOTA solutions: the ability to store larger objects (not evaluated), and tolerance of workloads with different read-write ratios (evaluated extensively).

Finally, I have not touched on a few other important pieces of the system: figuring out which keys are hot and fault-tolerance. For measuring the key temperature, the Pegasus statistics engine samples some packets and determines the frequency of keys in the samples to make gauge how hot each key is. For fault-tolerance, the system uses chain replication across racks for durability.

As always, we have our presentation of the paper by A. Jesse Jiryu Davis:

Discussion

This time around we had Dan Ports join us to answer the questions about the paper, so this turned out to be a nice discussion despite a slightly lower than expected attendance. 

1) Simple API. Currently, Pegasus supports a simple API with reads and simple destructive writes (i.e. each write is an unconditional overwrite of the previous copy of the data). The main reason for this is how the writes are structured in Pegasus. The system is very nimble and quickly adjustable, it picks write servers on the fly as the write request “goes through” the switch. This means that a write should be able to just complete on the new server. However, if the write operation is, for example, a conditional write or an update, then such an update also needs to have the previous version of the object, which may be missing on the new server. We have spent some time discussing workarounds for this, and they surely seem possible. But the solution also introduces additional communication, which both introduces more load to the servers and more latency for operations. And since we are dealing with a subset of objects that already generate the most load in the system, adding anything more to it must be avoided as much as possible. The cost of supporting these more complex API will also differ for various read-write ratios.

2) Comparison with caching. Another big discussion was around using caching to load balance the system. As Dan pointed out, caches are good when they are faster than storage, but for super-fast in-memory storage, it is hard to make a cache faster. NetCache (one of the systems used for comparison in the paper) does provide a faster cache by placing it in the network switch. It has several downsides: handles only small objects, consumes significant switch resources, and does not work well for write workloads (this is a read-through cache, I think). Of course, it is possible to make it a write-through cache as well to optimize for write workloads, but it still does not solve other deficiencies and adds even more complexity to the system. We also touched on the more complicated fault-tolerance of cached systems. The disparity between the load that cache and underlying systems can take can create situations when the underlying systems get overrun upon cache failure or excessive cache misses. 

3) Chain replication. Since Pegasus replicates for scalability, it needs a separate mechanism to handle fault-tolerance. The paper suggests using a chain replication approach, where racks are the chain nodes. Only the tail rack serves reads, however, the writes must be applied in all racks as the write operation propagates through the chain. One question we had is why not use something like CRAQ to allow other racks to serve reads, but the reality is that this is simply not needed. The chances that an object can become so skewed and hot that it needs more than a rack worth of servers are very slim, so there is no need to complicate the protocol. Another question I have now but forgot to ask during the discussion is what happens to hot writes as they go through the chain? If non-tail racks only use the default server for writes on “hot” keys (instead of let’s say round-robin or random node), then this server may get overwhelmed. But I think it is trivial to pick a random server for the hot object on each write at the non-tail racks. 

4) Zipfian distribution and workload skewness. Pegasus needs to load-balance fewer keys for a less skewed Zipfian distribution. This was an interesting and a bit counter-intuitive observation at the first glance since one can intuitively expect the more skewed distribution to require more load-balancing. However, a higher alpha Zipf has more skewed objects, but not necessarily more skewed objects than a lower alpha Zipfian distribution. Fewer highly skewed objects mean less load-balancing. 

5) Virtualization of top-of-rack switches. One important question about the technology that enables Pegasus is the virtualization of these smart ToR switches. Currently, it is not the case, so one needs to have bare-metal access to a switch to deploy the code. Virtualization of such a switch may make the technology available to cloud users. I think this would be a huge boost to the overall state of distributed computing at the datacenter level. I would be speculating here, but I think a lot depends on the willingness of manufacturers to provide such support, especially given the relatively limited capabilities of the hardware right now. Of course, the virtualization should not add a significant latency penalty to the users, and most importantly should not add any penalty to non-users (applications/systems that reside in the same rack but do not use the extended capabilities of the switches). Couple all these with the risks of running user’s code on the hardware that handles all the traffic in the rack, and we also need to worry about user isolation/security more than ever. However, as wishful as it is, it is quite probable that these smart switches will not make their way to the public cloud any time soon. This gives large cloud vendors an edge since they can benefit from the technology in their internal systems. Smaller service providers that rely on the cloud, however, will have to find a way to compete without access to this state-of-the-art technology.  Aside from my extremely high-level speculations, some smart people actually go deeper into the topic.

6) There were a few other minor topics discussed, and jokes are thrown here and there. For example, Dan explains Pegasus with cat pictures

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Cobra: Making Transactional Key-Value Stores Verifiably Serializable.

This Wednesday, we were talking about serializability checking of production databases. In particular, we looked at the recent OSDI’20 paper: “Cobra: Making Transactional Key-Value Stores Verifiably Serializable.” The paper explores the problem of verifying serializability in a black-box production system from a client point of view. This makes sense as serializability is an operational, client-observable property. The tool, called Cobra, collects the history via a middle layer sitting between a client and the database and then uses the history to construct a polygraph representing all possible execution orders. The problem then becomes finding whether a serial execution order exists in the polygraph. This means that we need to find some graph that has no dependency/ordering cycles. For instance, this would represent a cycle: A depends on B, B depends on C, and C depends on A. In fact, quite a few tools take a similar approach for checking sequential equivalence properties. What makes Cobra different is that they want to check serializability at scale, but the problem is NP-complete, so adding more events to the graphs makes checking exponentially slower. To mitigate the issue, Cobra uses a few tricks. It uses a few domain-specific heuristics to reduce the size of the polygraph. It also takes advantage of parallel hardware (GPUs) to speed up (by order of magnitude!) some highly-parallel polygraph-pruning tasks. And at last, Cobra uses an SMT solver to perform the final satisfiability search on the pruned polygraph. I will leave the details of all these methods to the paper.

Our video presentation by Akash Mishra is, as always, on YouTube:

Discussion

This is a very nice and interesting paper that sparked some lively discussion. here I list a few of the key discussion points.

Cobra Performance

1) Performance improvement. Cobra is significantly faster than the baselines in the paper. One concern during the discussion was about the optimizations of the baselines. For example, one baseline is Cobra’s approach minus all the optimizations. This is great in showing how much improvement the core contribution of the paper brings, but not so great at comparing against other state-of-the-art solutions. For the paper’s defense, they do provide other baselines as well, all of which perform worse than Cobra minus the optimizations one. So maybe there are no other domain-specific solutions like Cobra to compare just yet.

2) Performance improvement part 2. Another performance discussion was around the use of GPU. While the authors mention a magnitude improvement when using GPU for graph pruning, it is not often clear how much of the overall improvement is due to the GPUs. Interestingly enough, the overall gains compared to the baseline are ten-fold. For this one, the paper provides a figure that breaks down the time spent in each phase/optimization of Cobra. In read-heavy workloads, polygraph pruning, which is GPU-optimized, dominates the entire computation, suggesting that a lot of the gains may come from the use of specialized hardware.

3) Better parallel hardware? Are GPUs the best hardware to accelerate pruning? Maybe some better alternative exists? FPGAs?

4) Is it fast enough? While Cobra is significantly faster than other approaches, it may still be not fast enough for use in some production workloads. While it can handle 10k transactions in ~15 seconds, real production workloads can produce more transactions in under one second. The paper claims that Cobra can sustain an average load of 2k requests per second, which is enough for many large services. 2k transactions per second is the scale of systems about 35 years ago.

5) Do we need this in production? We spent quite a bit of time discussing this. Aside from concerns in point (3) above, there may be less utility from checking a production system. Achieving serializability in happy-case is not as difficult. There are plenty of databases out there that do just that. Testing systems in production is like testing a happy-case execution most of the time, so there may be little incentive to do that, especially given the cost of Running Cobra.

Keeping the same guarantees under failures is more difficult, this is why tools like Jepsen stress test the system by introducing the faults. Our thinking was that Cobra when combining with the fault injection can be a powerful stress-test tool. And with the capacity to check larger histories, it may be useful for checking more involved scenarios and doing routine fuzz testing (maybe even check-in testing!) to try to prevent engineers from introducing bugs.

As far as continuous production use, we did not have many scenarios, aside from a service provider checking its own compliance with some consistency SLAs or a user trying to catch a service provider to get a discount (but at what cost?)

6) Limitations. Cobra works only on key-value stores and only with a subset of common operations. For instance, it does not support range operations. Checking the serializability of range commands requires knowing not only what keys exist but also what keys do not exist in the system. This is hard in a black-box production system. However, our thought was, if this is not a production system, and you start with a blank-state for testing purposes, the knowledge of what keys do not exist is there — initially, none of the keys exist. Maybe in such a scenario, Cobra can add support for ranged operations. Especially since we think it will be more useful as a very powerful testing tool (point 5 above) rather than a production monitoring tool. There may be other reasons for the lack of range operation support that we do not know or understand.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. High availability in cheap distributed key value storage

Our recent paper was “High availability in cheap distributed key value storage”. And what a paper that was! It was definitely a mind-tingling read the lead to a very interesting and long discussion session with the group.

Short Summary

The paper addresses the problem of fast recovery from the leader (primary) crashes in key-value stores backed by the non-volatile main memory (NVMM). NVMMs provide good latency but have much lower throughput than the DRAM. They are also a lot cheaper than DRAM for the same capacity, making NVMMs an interesting low-latency compromise between in-memory datastores and SSD-backed stores. NVMMs are, however, much more expensive than SSDs. In CANDStore, a primary replica is supported by the NVMM for fast latency, but the follower, or back-up, is backed by an SSD to save on the hardware. An additional “witness” replica exists. Witness, like the primary, is supported by NVMM, but it does not store actual data, and only has the operation log with placeholders for data. This is done, again, for cost reasons. The replication between a primary and the backup is done with a modified Raft algorithm. In a happy case, witness replica is a non-voting member of the cluster, much like in Cheap Paxos. As such, CANDStore is supported by heterogeneous hardware, with specific roles attached to specific hardware.

Upon the failure of a primary node, the witness is promoted to the primary role. Since the witness has no data, it has to learn it from the backup, but it does so smartly: first, learn the log with placeholders, once it knows all the placeholder, it can start serving new write requests that overwrite the key’s state. Knowing the position of the operation in the log is not enough to serve reads and a new primary needs to copy the data from the backup. It does so smartly and copies the “hot” keys first, followed by less frequently used keys. This allows primary to start serving hot reads very soon after gaining the ability to serve writes, and before the full copy of the data has finished. The paper claims this approach yields 4.5-10.5 times recovery speedup compared to the offline rebuild of a replica from an SSD. Backup failure is a less-discussed issue in the paper, but here the authors take the Cheap Paxos approach and make the witness vote on the quorum to select a new configuration.

Video Presentation

Discussion

The discussion was rather heated, as the paper has many points to digest.
1) Homogeneous vs. heterogeneous deployment. The biggest question is why do we need to restore a primary from a backup by doing a full copy? The backup node in a 3-replica configuration studied in the paper(1 primary, 1 backup, 1 witness) already has the latest state, so it makes sense to promote it to the leader/primary node instead of doing a full replica “build” from the witness state. This would undoubtedly result in a much faster primary recovery (while “build-out” of a new backup can happen in the background). Of course, this works only if the backup and primary are homogeneous, which is not the case here. So, this raises the question of why having such a heterogeneous system in the first place? Our best guess was the cost. Having an SSD-backed replica is cheaper than NVMM one.

2) Witness log. In the paper, a witness maintains a log, keeping track of all the keys and essentially having an order of updates. This is a placeholder log that has no values. In the process of recovery, the witness learns of the missing log entries, i.e. tail of the log that a primary and backup may have, but has not reached the witness. So if this learning phase exists, why having a log at all? just make a witness learn the log (or relevant suffix of the log) when recovery is needed. In the discussion, we think the witness has a log to speed-up recovery when it is needed.

3) Performance comparison of DRAM/NVMM/SSD. The paper’s heterogeneous setup leverage the fact that these types of storage have different performance and price. DRAM is the fastest but also the most expensive, NVMM presents a latency compromise, while SSD is the price compromise. However, if the performance of NVMM is close to that of SSDs, there may be fewer incentives for such design. We looked at a few sources. And it appears that NVMM has good latency, but its throughput is somewhat on par with SSDs

4) The performance of the primary. Another question we had is why NVMM is fast enough for primary, but not for backup? This loops back to (3), and it seems like NVMM would do ok for a backup, given its similar throughput to an SSD. However, we speculate that (a) the NVMM performance may change less favorably depending on the write block size, and (b) again the cost and not the performance is the reason for having an SSD backup.

5) One witness for multiple partitions. The paper evaluates a single-partition 3 node system. However, in such a setup, we have a witness that must run on hardware capable of supporting a primary. This does not lead to cost-savings, as primary hardware is more expensive. We speculate that in a sharded system the cost of running a witness can be amortized by having single witness hardware shared across many partitions. When one partition needs a witness promoted, it takes over the resources, and forces other partition to get new witness nodes. This of course has several complications. Firstly, a correlated failure between partitions may not be tolerated well, when two partitions need a new primary, but they share a witness. Secondly, managing such a system become more challenging, since a witness promotion leads to the cleanup of other partition-witness data, finding new witness nodes, etc

6) Hot/cold benefits. Does hot/cold key separation benefit anything except recovery? The backup cannot benefit from this, it (normally?) serves no user work, so there is not much difference. It maintains hot/cold keys to make a copy upon the recovery prioritized to reduce performance degradation. Primary nodes, on the other hand, may benefit. Keep the “hot” keys in DRAM, and “cold” ones in the NVMM storage.

7) Backup failures. This aspect is not discussed in much details in the paper, except mentioning that the Cheap Paxos approach is used to reconfigure and bring up a new backup. One thing that caught the group’s attention is that a backup recovery puts a lot more work on the primary – it has to copy data to the backup, while still serving the operations (given that the backup recovery is done on-line).

8) More backup nodes. So, in relation to (7), what if we have more backup nodes, lets say 2 nodes instead of 1? This means that one backup should be available to restore the failed one. But what does this mean for write quorums? Do we still need all backup in the quorum? or just one out of 2 backups is enough? If we require all backups for the quorum, we also need two witnesses to tolerate two failures and be able to reconfigure with Cheap Paxos. If we require just one backup out of two, do we need a witness (aside from promoting it to the primary in case of a primary crash). Also, how does such two backups setup compare in terms of cost saving?

9 and beyond) We had way more discussion points as well. Can this heterogeneous setup benefit other copy-have workloads for databases, like elasticity tasks? Can heterogeneous deployments be used with regular Multi-Paxos/Raft? Why using modified Raft if most of the things that differ Raft from Paxos were undone? You can read more about these in our slack discussion group.