Tag Archives: rdma

Reading Group. Characterizing and Optimizing Remote Persistent Memory with RDMA and NVM

We have looked at the “Characterizing and Optimizing Remote Persistent Memory with RDMA and NVM” ATC’21 paper. This paper investigates a combination of two promising technologies: Remote Direct Memory Access (RDMA) and Non-Volatile Memory (NVM). We have discussed both of these in our reading group before.

RDMA allows efficient access to the remote server’s memory, often entirely bypassing the remote server’s CPU. NVM is a new non-volatile storage technology. One of the key features of NVM is its ability to be used like DRAM, but with the added benefit of surviving power outages and reboots. Typically, NVM is also faster than traditional storage and gets closer to the latency of DRAM. However, NVM still significantly lags behind DRAM in throughput, and especially in write throughput. The cool thing about NVM is its “Memory Mode,” which essentially makes Optane NVM appear like a ton of RAM in the machine. And here is the nice part — if it acts like RAM, then we can use RDMA to access it remotely. Of course, this comes with a catch — after all, NVM is not actual DRAM, so what works well to optimize RDMA under normal circumstances may not work as well here. This paper presents a handful of RDMA+NVM optimizations from the literature and augments them with a few own observations.

Below I put the table with optimizations summarized in the paper. The table contains the literature optimizations (H1-H5), along with the author’s observations (Parts of H3, H6-H8). One additional aspect of the table is the applicability of optimization to one-sided or two-sided RDMA. One-sided RDMA does not involve a remote CPU. However, two-sided RDAM uses a remote server’s CPU.

H1: Accessing NVM attached to another socket is slow, so it is better to avoid it. We have discussed NUMA a bit in a previous paper to give a hint at what may be the problem here. The paper raises some concerns on the practicality of avoiding cross-socket access, but it also provides some guidance for implementing it. 

H2: For two-sided RDMA that involves the host’s CPU, it is better to spread the writes to multiple NVM DIMMs.

H3: Data Direct I/O technology (DDIO) transfers data from the network card to the CPU’s cache. In one-sided RDMA, that will cause the sequential writes to NVM to become Random, impacting performance, especially for large payloads, so it is better to turn DDIO off. However, there are some implications for touching this option, and the paper discusses them in greater detail. 

H4: ntstore command bypasses the CPU caches and stores data directly to the NVM. So this can make things a bit faster in two-sided RDMA that already touches the CPU. 

H5: This one deals with some hardware specs and suggests using writes that fill an entire 256 bytes XPLine (the data storage granularity). I am not going to explain this any further. However, the paper mentions that this optimization is not always great since padding the entire XPLine (and sending it over the network) incurs a lot of overhead.

H6: For one-sided RDMA, it is more efficient to write at the granularity of PCIe data word (64 bytes) for payloads smaller than XPLine. This is the granularity at which NIC operates over the PCIe bus. 

H7: Similar to H6, for two-sided RDMA, write at the granularity of a cache line (64 bytes on x86 architecture). 

H8: atomic operations, such as read-modify-write, are expensive, so for best performance, it is better to avoid them.

H9: Doorbell batching helps when checking for persistence. In one-sided RDMA, there is no easy way to check if data has persisted, aside from performing a read. This procedure incurs two round-trips and obviously hurts performance. Doorbell batching allows to send a write and a read at once but delays the read on the remote until the write completes, avoiding two separate round-trips. 

The figure below illustrates how the optimization changes the performance when added one at a time. It is worth noting that some optimizations can actually slow the system down, so it is important to understand the specific circumstances of the workload to make the proper decision on using some of the suggestions from this paper.

The paper has a lot more details and discussions of individual optimizations. It is also full of individual evaluations and experiments for different optimizations.

Brian Luger did an awesome presentation of the paper, which is available on YouTube:

Discussion

1) Only writes. The paper focuses on a bunch of write optimizations, and no read ones. The authors explain this by saying that NVMs read throughput is very high and exceeds the NICs bandwidth, so the network will be a bottleneck for reading. On the write side of things, however, the bottleneck is the NVM itself, and it is not held up by other slower components, making write optimizations very desirable. At the same time, if the network catches up with NVM bandwidth, then we may need to play the same optimizing game for reads. 

One thing to mention here is that read-to-write ratios will play a huge role in perceived improvement from optimizations. The paper has not explored this angle, but if the workload is read-dominated (like many database workloads), then all the write optimizations will be left unnoticed for the overall performance. For all the fairness, the paper does use realistic workloads with writes and reads for their evaluation. 

2) Vendor-Specific? Currently, there is pretty much one vendor/product out there for NVM — Intel Optane. So a natural question is how many of these optimizations will transfer to other NVM products and vendors.

3) Vendor Benefit. Our final discussion point was about the exposure of RDMA+NVM technology in the cloud. It is still very hard/expensive to get VMs with RDMA support. Having the two technologies available in the public cloud is even less probably now. This means that RDMA+NVM will remain the technology for the “big boys” that run their own data centers. Vendors like Microsoft, AWS, Google, etc. can benefit from this tech in their internal products while making it virtually inaccessible to the general open-source competition.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. When Cloud Storage Meets RDMA

I am very behind on the reading group summaries, so this summary will be short and less detailed. In the 67th reading group meeting, we discussed the “When Cloud Storage Meets RDMA” paper from Alibaba. This paper is largely an experience report on using RDMA in practical storage systems. 

Large-scale RDMA deployments are rather difficult to manage and maintain. This is the result of RDMA over Converged Ethernet (RoCE), which requires a Priority Flow Control (PFC) mechanism to prioritize the RDMA traffic and prevent frame loss on RDMA traffic due to congestion. PFC includes the mechanisms to pause network nodes (i.e switches, NICs, etc) from sending data to an overwhelmed nodes. This can also create backpressures in the network and can cause degraded operation, dropped frames, and even deadlocks. So anyway, while RDMA needs PFC, it can cause major reliability issues even for large companies.

Alibaba paper presents a handful of engineering solutions and workarounds to deliver RDMA performance while maintaining high reliability. I will mention two general approaches the paper takes. 

First, the authors do not try to make every machine accessible via RDMA from anywhere in a data center. Instead, the network/datacenter is divided into podsets, and machines within a podset can take advantage of RDMA when talking to each other, while communication between podsets relies on good-old TCP. I suspect this provides a lot of isolation and greatly limits the extent to which PFC problems can go. 

The second approach is to fail fast. To support tight SLA, Alibaba engineers prefer to failover the problematic connections quickly. For that, upon detecting problems with RDMA, the podset falls back to using TCP. As a result, the Pongu system operates in this kind of hybrid mode — whenever possible, RDMA is used for low latency and high throughput, but TCP connection is always ready as a backup.

The paper talks about a few other engineering solutions and optimizations aimed at improving reliability and performance. For example, the network is built with redundant switches, including Top-of-Rack switches. One important performance optimization is offloading as much work away from the CPU and onto special-purpose hardware. The paper specifically talks about CRC computations offloaded to NIC to save memory bandwidth and CPU cycles. Kernel bypass (DPDK) is used as well. Many of these and other (i.e. user-space SSD optimized file system, RPC thread management) optimizations reside in the User Space Storage Operating System (USSOS) part of Pangu.

Now I realized that I actually have not talked about the Pangu storage system itself. The image taken from the paper is a decent summary. In short, the storage cluster runs in a podset, so BlockServers can access storage with RDMA. There can be many storage clusters, each cluster in its own podsets. The storage clusters, however, do not talk to the clients using RDMA, and the clients communicate with storage using TCP. So, in the best-case scenario, a client’s request will have one TCP hop and one RDMA hop inside the storage cluster instead of two TCP hops. The experimental data from the paper largely supports this, as latency is essentially halved for clients when RDMA is used. Similarly, the Pangu Master Cluster services communicate with storage and clients using TCP.  

As always, we had a paper presentation in the group. Akash Mishra presented this time:

Discussion

1) Bi-model operation. The fail fast and switch to a backup mode of operation allow for a more reliable operation of the system. However, having these two distinct modes and the spectrum of performance between them may present additional problems. Metastability may come to play here as the system transitions from fast mode to degraded mode due to some external trigger. If the system becomes overwhelmed in this degraded mode, it may develop a positive feedback loop that feeds more work and keeps it overwhelmed. In general, having distinct performance modes is a flag for metastability, and extra care must be used when the system transitions from fast to slow operation.

2) RDMA at other companies. The paper cites Microsoft’s experience report with RDMAquite extensively when talking about reliability issues of using RDMA at scale. So it is obvious that the experience is shared, and it may be the reason for RDMA being largely not offered in the public cloud. The last time I checked, Azure had a few RDMA-capable VM instances, while AWS EC2 had none (instead, EC2 offers its own tech, called EFA).

3) Offloading CPU. This was an interesting point as well. The paper notes that RDMA is so fast that the memory bandwidth becomes a bottleneck! To limit memory usage, Pangu offloads data verification through CRCs away from the CPU and onto smart NICs. This saves quite a bit of memory roundtrips on accessing messages/payload to compute/verify CRCs. Using various accelerators is a common theme in many large systems.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Microsecond Consensus for Microsecond Applications

Our 43rd reading group paper was about an extremely low-latency consensus using RDMA: “Microsecond Consensus for Microsecond Applications.” The motivation is pretty compelling — if you have a fast application, then you need fast replication to make your app reliable without holding it back. How fast are we talking here? Authors go for ~1 microsecond with their consensus system called Mu. That is one-thousandth of a millisecond. Of course, this is not achievable over a regular network and network protocols like TCP, so Mu relies on RDMA.

In my mind, Mu maps rather perfectly to Paxos/MultiPaxos, adjusted for the RDMA usage. Accept phase is pretty much Paxos phase-2. The leader directly writes to the follower’s memory. Mu does not use protocol-specific acks, but there is still an RDMA-level ack for successfully writing memory and thus completion of phase-2. Of course in Paxos, followers must check the ballot before accepting an operation in Phase-2. This requires processing and will negate the benefits of direct memory access. To work around the problem, Mu uses RDMA permissions to control whose memory writes are accepted in phase-2. The bottom line, however, is that we have a single round trip phase-2 capable of rejecting messaging from “wrong” leaders, just like in Paxos.

Paxos elects a leader in phase-1. In Mu, the equivalent of phase-1 consists of 2 sub-phases. First, a prospective leader contacts the quorum of followers and tells them to change the permissions from an old leader to itself. This prevents the old leader from writing to a quorum and makes it stop. This quorum becomes “the leader’s go-to quorum”, as it can only write to the nodes from that quorum due to permissions. In the second sub-phase, the prospective leader learns of the past proposal/ballot number and any past operations to recover. The leader then picks a higher proposal number and writes it back. Just like in Paxos/MultiPaxos, the leader must recover the learned commands.

Another prominent part of the paper is the failure detector. The authors claim that it allows for fast leader failover. The detector operates by a pull mechanism — a leader maintains a heartbeat counter in its memory, and increments it periodically, the followers read the counter and depending on the counter’s progress adjust the “badness” score. If the counter moves too slow or does not move (or not readable at all?), the badness score becomes high, causing the follower to decide that a leader has failed and try to take over.

As always, the paper has way more details than I can cover in a short summary. Our group’s presentation by Mohit Garg is available on YouTube:

Discussion

1) Performance. Microsecond latency covers only replication and does not include any of the client interactions or request capture. These components may add a significant delay to the client-observed latency. Moreover, the throughput figure has latency that is at least somewhat close to 1 microsecond only at the low-end of the throughput curve. Pushing more operations degrades latency quite significantly — up to 15 microseconds. Of course, it is worth noting that this is with batching enabled, so still pretty impressive.

2) Use of RDMA permissions for leader enforcement. This looked familiar to me… Until I was reminded that in the 17th reading group meeting we looked at the “Impact of RDMA on agreement” paper by the same authors.

3) Quorums. Since the protocol relies on the permissions to be explicitly granted to a leader when it contacts a quorum, that leader cannot use any other quorum, as it won’t have permissions to access it. We were not very sure why a leader cannot contact all nodes and try to get permissions to all of them. It still needs only the majority to succeed, but having more than the quorum of nodes who can accept writes from leader may be handy, since trying to write to more nodes than the minimal quorum can be useful for controlling the tail latency and tolerating strugglers.

4) Flexible Quorums. This continues the above point about quorums. Flexible quorums are quite useful in trading off fault tolerance and scalability. Since Mu is restricted to just one quorum that granted the write permissions, it cannot take advantage of flexible quorums, such as grids.

5) Failure detector. Failure detector is one of the most interesting and controversial features in Mu. We have spent quite a bit of time discussing it. First of all, what does the pull model give us? Every follower keeps pegging the leader and reading some counter. But what if the leader is actually totally and utterly down, how can you read the memory of the crashed server to learn its counter and compute the badness score from it? Of course, if a follower cannot read, then it can conclude that the leader is down and start the leader election, but this is not explicitly mentioned in the paper. So what is the purpose of reading a counter and having the counter increase then? Being able to read the counter clearly means the leader is up, at least in some capacity. The counter and badness score computed from it is not so much the proxy of the node’s overall up/down status, but the proxy of the node’s health/performance. The paper briefly alludes to this when talking about replication being stuck, eventually causing the heartbeat counter to stop as well and trigger an election, despite the leader not being completely down.

In the discussion, we came up with a different heartbeat mechanism, that avoids the “read from dead node” issue. If we make the leader write its counter to the followers’ memory, and followers read their local copy of the leader’s counter, then a leader crash will stop the counter progress, and followers can detect it by reading their local memory. Quite honestly, this scheme sounds cleaner to us than the follower pull/read approach used in the paper. The authors claim that the pull mechanism provides better detection latency, but this is not backed up experimentally in the paper.

6) “Dumb” acceptors. Mu is not the only protocol that assumes “dumb” Paxos acceptors/followers that simply provide a write/read interface with very little capacity to run any “logic”. Disk Paxos assumes separate sets of processors and disks. One processor can become a leader, and disks are the followers. Disk Paxos, of course, would not provide the same low latency, as in each phase a processor needs to both write and read remote disks/storage. The paper briefly mentions Disk Paxos. CPaxos is a WAN Paxos variant built using strongly consistent cloud storage services as acceptors. Similarly, the storage service provides limited ability to run any logic and the leader must jump through some hoops to maintain safety. Another one mentioned in the discussion was Zero-copy Paxos.

7) Ordered communication for correctness. We spent a bit of time talking about the importance of ordered communication (FIFO) for the correctness of the protocol. If not for FIFO, there could have been some interesting corner cases around the leader churn. I usually do not fully trust papers that just state the assumptions of the FIFO channels and move on, since traditionally you may have quite a few corner-cases with systems built on FIFO network protocols, like TCP, and have messages reordered. One common reason is that applications often have complex and multi-threaded logic, and may reorder messages internally after the messages have left the TCP stack. Here, however, there is no logic at the followers, and it makes the ordered network all you need (assuming there are no other corner-cases in the network, like dropped connections and re-connections).

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!