Tag Archives: storage

Reading Group. Achieving High Throughput and Elasticity in a Larger-than-Memory Store

Achieving High Throughput and Elasticity in a Larger-than-Memory Store” paper by Chinmay Kulkarni, Badrish Chandramouli, and Ryan Stutsman discusses elastic, scalable distributed storage. The paper proposes Shadowfax, an extension to the FASTER single-node KV-store. The particular use case targeted by Shadowfax is the ingestion of large volumes of (streaming) data. The system does not appear to have replication and cannot handle server failures without data loss. However, the authors focus on reconfiguration performance and its impact on the rest of the system. So, in case of failures, replacement servers can be added with a minimal performance penalty. Under failure-free conditions, nodes can be added or removed without data loss. Finally, Shadowfax stores data in three different “places” based on how hot the keys are. Recently updated keys reside in RAM, colder data gets pushed to the SSD, and finally, the data from SSD is asynchronously copied to the shared cloud storage.

The main problem Shadowfax solves is ingesting lots of data from networked clients. A single server can only do so much, so we really need a system that can partition the keyspace across the servers. Shadowfax builds on top of FASTER, single-node storage that leverages a lock-free hash table to support key indexing/look-up. Shadowfax adds hash-based key partitioning to ensure each server is responsible for a subset of all keys. This partitioning is stored in ZooKeeper for fault tolerance of the sharding configuration. 

A lot of the Shadowfax design stems from two ideas: (1) avoiding coordination and context switching and (2) avoiding unnecessarily costly operations.

For the first point, Shadowfax servers pin processing threads to cores, allowing each thread to pick up some requests and dispatch them to the underlying FASTER store shared among all processing threads. As such, the processing threads never coordinate with each other by any other means except the FASTER store. The shared FASTER store uses lock-free hashing to minimize the impacts of concurrency. Whenever some coordination is needed, FASTER avoids explicit cross-thread coordination with the help of asynchronous cuts. As different threads may access the same shared data, it is often essential to know when all threads have passed or seen some particular version of the data. For instance, when some record moves from RAM to SSD, all threads must have completed their access to that record in RAM (or agreed not to access it) before the memory can be cleaned up. FASTER achieves this lazily via view or epoch changes. When a thread sees a new epoch and transition to it, it agrees to have finished accessing any data in the old epoch. When all threads have moved (at their own pace) to a new epoch, the system can reclaim the memory supporting the older epoch’s data.

On the client-side, threads pin to a core as well. Clients maintain sessions with the servers. Each client thread opens a session with each server, allowing independent client threads to talk to the same server without any internal coordination. When accessing the key, a client thread uses its session for the partition/server hosting the key and batches the request before sending it over to the server with other requests from the same session.

The scheme with sticky sessions works well when the partitions are static. However, in a dynamic situation, it may be possible that a client sends a request to a server that no longer serves a key. This is where the second aspect of the design comes in. The naive solution would be for a server to check ownership for each request, but this would also be a waste of CPU cycles for the vast majority of requests. Instead, Shadowfax checks request ownership at a batch level using view numbers and drops outdated batches. Each server has a view number representing its configuration version. Whenever there is a partitioning change impacting a server, its view increases. The client attaches the view obtained (and cached) from ZooKeeper to a batch, so if the client’s information is stale, its batches will not succeed. Naturally, when this happens, the client refreshes the keyspace mapping and server views from ZooKeeper

The migration tries to avoid as much coordination as possible. It leverages the idea of the asynchronous cut to transition between phases of the migration process — once all threads finish a migration phase, they can transition to the next phase. I will skip the detailed explanation of data migration, however, I will mention that it requires a few state transitions on both the source and target nodes, and during the process, some requests may get delayed, as the target machine may not have yet received the data for remapped/repartitioned keys. In short, the source must find the keys to send, tell the target to expect the data migration, and move the keys, while the target needs to accept the keys and eventually start serving them. 

Another extension of Shadowfax over the FASTER store is the ability to move data from SSD to shared cloud storage. The main benefit of using cloud storage is the speed of reconfiguration and migration. Shadowfax eventually writes all records from SSD to shared cloud storage. This cloud copy allows the migration procedure to avoid copying the records from SSD and instead migrate the index with pointers to the records in the shared cloud storage.

Regarding performance, Shadowfax seems to provide around 130 million ops/sec on one 64-core server. On 12 servers (and 3 clients) as shown in the figure here, the throughput can go as high as 930M ops/sec. Scale-up procedures also seem to be quick with little impact on performance. We can see a dip in source throughput, but it is relatively short and lasts way under a minute in all tested scenarios.

Discussion

1) Fault tolerance? While it is impressive to get up to 130M ops/sec of throughput over TCP (albeit with FPGA network acceleration), I cannot stop wondering what kind of applications need such per-node throughput and do not need data redundancy? If we operate at such speeds, a server reboot can mean millions of lost operations that existed only in RAM. 

2) Client Scalability. Shadowfax clients communicate with servers over TCP, but the system assumes a small number of clients/sessions. In fact, the authors perform most evaluations with just one massive client. What happens is that a client needs a session to each server from each of its threads. So a 64-core client will have 64 sessions per server. I am not entirely sure if having this many sessions will impact the performance. The paper achieves 930M ops/sec in 12 server clusters with 3 clients, but each client is massive, and all three clients combined have 192 sessions on each server. The authors claim this result as proof that the servers can handle many client sessions. 

One question I have is whether having more smaller clients will impact the performance even if it is the same number of sessions? Referring to the theme of the first discussion point, what kind of application will have a single client producing 310 (930/3) million events/records/updates each second? Of course, an argument can be made that these clients are “proxies” for many real clients, like sensors, IoT devices, etc., but that will also mean that these proxy clients will not be able to provide 310M ops/sec, as they need to manage a lot of connections from actual data producers. So this, in turn, will require more proxy clients, hence more sessions. 

3) Benefits of cloud storage? Shadowfax uses shared cloud storage to copy data from SSD. However, this cloud storage does not seem to bring many benefits to the system, except for the data migration and recovery use case. Cloud storage also provides some limited protection from data loss when a server crashes and never reboots. But this protection only covers some data since “hot” keys in RAM are not written to SSD and consequently not copied to the cloud, leading to the cloud alone having incomplete or stale data. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. When Cloud Storage Meets RDMA

I am very behind on the reading group summaries, so this summary will be short and less detailed. In the 67th reading group meeting, we discussed the “When Cloud Storage Meets RDMA” paper from Alibaba. This paper is largely an experience report on using RDMA in practical storage systems. 

Large-scale RDMA deployments are rather difficult to manage and maintain. This is the result of RDMA over Converged Ethernet (RoCE), which requires a Priority Flow Control (PFC) mechanism to prioritize the RDMA traffic and prevent frame loss on RDMA traffic due to congestion. PFC includes the mechanisms to pause network nodes (i.e switches, NICs, etc) from sending data to an overwhelmed nodes. This can also create backpressures in the network and can cause degraded operation, dropped frames, and even deadlocks. So anyway, while RDMA needs PFC, it can cause major reliability issues even for large companies.

Alibaba paper presents a handful of engineering solutions and workarounds to deliver RDMA performance while maintaining high reliability. I will mention two general approaches the paper takes. 

First, the authors do not try to make every machine accessible via RDMA from anywhere in a data center. Instead, the network/datacenter is divided into podsets, and machines within a podset can take advantage of RDMA when talking to each other, while communication between podsets relies on good-old TCP. I suspect this provides a lot of isolation and greatly limits the extent to which PFC problems can go. 

The second approach is to fail fast. To support tight SLA, Alibaba engineers prefer to failover the problematic connections quickly. For that, upon detecting problems with RDMA, the podset falls back to using TCP. As a result, the Pongu system operates in this kind of hybrid mode — whenever possible, RDMA is used for low latency and high throughput, but TCP connection is always ready as a backup.

The paper talks about a few other engineering solutions and optimizations aimed at improving reliability and performance. For example, the network is built with redundant switches, including Top-of-Rack switches. One important performance optimization is offloading as much work away from the CPU and onto special-purpose hardware. The paper specifically talks about CRC computations offloaded to NIC to save memory bandwidth and CPU cycles. Kernel bypass (DPDK) is used as well. Many of these and other (i.e. user-space SSD optimized file system, RPC thread management) optimizations reside in the User Space Storage Operating System (USSOS) part of Pangu.

Now I realized that I actually have not talked about the Pangu storage system itself. The image taken from the paper is a decent summary. In short, the storage cluster runs in a podset, so BlockServers can access storage with RDMA. There can be many storage clusters, each cluster in its own podsets. The storage clusters, however, do not talk to the clients using RDMA, and the clients communicate with storage using TCP. So, in the best-case scenario, a client’s request will have one TCP hop and one RDMA hop inside the storage cluster instead of two TCP hops. The experimental data from the paper largely supports this, as latency is essentially halved for clients when RDMA is used. Similarly, the Pangu Master Cluster services communicate with storage and clients using TCP.  

As always, we had a paper presentation in the group. Akash Mishra presented this time:

Discussion

1) Bi-model operation. The fail fast and switch to a backup mode of operation allow for a more reliable operation of the system. However, having these two distinct modes and the spectrum of performance between them may present additional problems. Metastability may come to play here as the system transitions from fast mode to degraded mode due to some external trigger. If the system becomes overwhelmed in this degraded mode, it may develop a positive feedback loop that feeds more work and keeps it overwhelmed. In general, having distinct performance modes is a flag for metastability, and extra care must be used when the system transitions from fast to slow operation.

2) RDMA at other companies. The paper cites Microsoft’s experience report with RDMAquite extensively when talking about reliability issues of using RDMA at scale. So it is obvious that the experience is shared, and it may be the reason for RDMA being largely not offered in the public cloud. The last time I checked, Azure had a few RDMA-capable VM instances, while AWS EC2 had none (instead, EC2 offers its own tech, called EFA).

3) Offloading CPU. This was an interesting point as well. The paper notes that RDMA is so fast that the memory bandwidth becomes a bottleneck! To limit memory usage, Pangu offloads data verification through CRCs away from the CPU and onto smart NICs. This saves quite a bit of memory roundtrips on accessing messages/payload to compute/verify CRCs. Using various accelerators is a common theme in many large systems.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience

On Wednesday, we had our 26th reading group meeting, discussing RocksDB with a help of a recent experience paper: “Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience.” Single-server key-value storage systems are crucial for so many distributed systems and databases. For distributed folks like myself, these often remain black-boxes that you pick up and use. That is until something in your system starts to crumble peculiarly, and you dive in to investigate…

Anyway, what I want to say is that KV-stores are important. Their performance matters a lot in the world of fast CPUs and fast networks, where every millisecond of slowdown at storage can no longer be “masked” by other “slow” components. This is where this paper takes us — improving the performance, reliability, and feature-set of RocksDB over the years as technology and demands have evolved. 

To understand the experiences and lessons of the paper, we first need to look at the underlying technology behind RocksDB. In a nutshell, RocksDB is an LSM-tree (Log-Structured Merge tree) key-value storage. LSM trees have been used in storage for quite some time, as they are relay good for write-intensive workloads. The basic idea of the LSM tree is that data are stored sorted by key. These sorted files are called sorted-strings tables or SSTables for short.

Sorted Strings Table with key-value pairs.

Now, maintaining SSTables requires that data are written to storage sequentially in such a sorted state. Of course, the data does not come pre-sorted to a database, so the system needs to do something else before writing these sorted files to disk. The storage system will keep an in-memory buffer, called memtable, of some relatively large number of updates. This memtable can be represented as some tree structure to allow for efficient insertions. Before each operation is added to a memtable, it is written to a write-ahead-log (WAL) for durability. The WAL reconstructs a memtable in the event of a failure. Once memtable reaches a certain size, it is flushed to disk in a sorted manner. At this time a new empty memtable can start. An important aspect of writing these sorted files is keeping track of their recency order. 

Memtable and SSTable files/segments.

When a read request for a key arrives, the system first looks at the memtable to see if data is there. Memory lookup is relatively cheap since no disk access is needed. However, if the requested key is not in the memtable, then we must search on disk, starting from the most recent SSTable segment. Looking up data on a disk can be slow since the system needs to scan a good chunk of a file to find the spot where the key might exist in the sorted list. Naturally, we want to take advantage of the sorted nature of the file. For this, a system maintains a sparse index for each file with the offsets to narrow down the search. Then the system only needs to scan a portion of a file between the two offsets where the key may exist. If the data is missing in the most recent file, then a search continues in the next most recent one and so on. This process results in some peculiar behaviors. For instance, it generally takes less time to find more frequently used data. But it also takes a lot of time to find out that the data is missing entirely. Fishing for non-existent data is a waste of time, so an additional index, a bloom filter, can be used to tell whether the key is guaranteed to be missing.

Index points to some file offset. To lookup key ‘city’, find where ‘city’ fits in the index (between ‘blog’ and ‘food’) and search in that part of the file.

Another caveat the sequential writes create for us is dealing with old versions of data. See, when we write an SSTable to disk, it is immutable, and when an update or delete to a key comes in, this update will eventually flush to a more recent file. This influx of new data creates a situation where old data that is no longer needed keeps occupying space and potentially increases search time. So the system needs to clean up old data frequently. A compaction procedure mitigates the space amplification by cleaning up old data. It essentially takes multiple files and merges them into one bigger file.

Compaction removed old value of ‘city’. Old files are removed and replaced with a new compacted one.

So, my oversimplified descriptions of LSM storage is not necessarily how RocksDB operates, but it should give enough intuition for us to proceed and dive into the lessons and experiences of Facebook engineers working with RocksDB. 

Resources: IOPS vs Space vs CPU

The paper starts by exploring resource efficiency and how optimization priorities were changing over time. RocksDB runs best on SSDs, and these storage devices have a limited lifespan bound by the number of write cycles. Naturally, engineers focused on issues of write amplification (the same data rewritten multiple times) to make sure SSDs do not die prematurely. Interestingly enough, the paper almost makes it sound like write-amplification mitigation efforts were largely wasted. The authors state that the workloads used at Facebook are not too IOPS-heavy (does it meant they are not very write-heavy for write-optimized storage?), and storage space was a more pressing concern. Because of this, the engineers have shifted their efforts to the space-amplification problem (a key occupies more space than it needs to, for example, due to having multiple old versions of it).

Another issue brought up is the CPU utilization. Here, again, the paper states that CPUs are rarely a bottleneck. However, to me, it seems like these represent a delicate resource trade-off. For example, to reduce space consumption, we may need to use more aggressive compression that uses more CPU cycles and more aggressive compaction that needs both CPU and IOPs (and increases write-amplification). So I am not sure about the correctness of saying whether some resource here is a bottleneck or not. They all can be a problem, and it seems more about the ability to reach some balance for a given workload and infrastructure. I believe the need to find such balance in different applications is part of the reason behind the multiple compaction strategies mentioned in the paper.

A significant portion of the paper then focuses on dealing with resources at scale. For example, many instances of RocksDB may coexist on one server, requiring resource management to prevent one instance from hogging all the resources. Other resource-related aspects involve the treatment of write-ahead logs (WALs). For example, it is possible to completely turn off RocksDB’s internal WAL to conserve resources. Of course, this leaves the system vulnerable to data loss in the event of a crash, but this may not be a problem if an application using Rocks has its own WAL for things like transactions or replication. An interesting mention for resource management is rate-limiting file deletion. This issue seems a bit specific, but the authors explain how file deletion can be costly and impact other tenants using the same SSD.

Features

The paper also extensively talks about new features and their significance. Similar to how the authors have approached resource efficiency, these features largely stem from operating at scale. Many of the points simply make sense when I read them, but I suspect that these realizations were not as easy in practice and carry some production pain points. For example, we usually expect backward compatibility, but designing forward compatibility, where an older version should be compatible with a newer one, is definitely a result of sleepless nights after unrolling from some newer but buggy version and realizing that data files changed to the point that the old version no longer understands them. 

The flexibility of RocksDB is another weaved-in theme of the paper. Since the storage system is used in a variety of applications with a variety of needs, this again makes total sense. It appears that the main goal of many features is to make the system more extensible and fit into many different contexts without creating any roadblocks on purpose. One such example is improvements to configuration management that went from “in-code” configuration to having configuration files. However, one big configuration problem directly stems from the flexibility goal — too many different parameters to tune, and it seems like there is no good solution for this. 

The paper presents a few other examples of flexibility features meant to help build apps on top of RocksDB. If implemented, native versioned storage can greatly help systems relying on multi-version concurrency control (MVCC). This, however, may come with a performance penalty. At the same time, MVCC systems have already been relying on RocksDB for storage, since the “no roadblocks” principle provides great flexibility in how keys and values are encoded, allowing versioning information to be a part of the key. 

Replication and backup support got their own subsection in the papers, but this is nothing but a trivial “you can copy the files to another machine to start a new replica” approach. This is hardly a feature, but again, it plays nicely with the idea of designing a system with as few roadblocks as possible and letting users/engineers be creative with using it. 

Reliability

Reliability is a big topic in the paper. We want the data stored in the database to remain correct and intact. Luckily, there is a very concise summary for this — use checksums! The authors point out that their checksum procedures only work for data already in storage and that they are still working on checking the integrity of data in memtables. This memory corruption may not be that big of a problem though. Thankfully, unlike our personal computers, servers rely on ECC memory that can handle some memory issues all by itself. 

I will finish my summary with a large table of features and changes to RocksDB straight from the paper. 

And as always, we have our groups presentation by Rohan Puri available on YouTube:

Discussion

We had a very long discussion after the presentation. I think it lasted almost an hour, just talking about KV-stores in general and RocksDB in particular. There is no way I can possibly summarize every discussion point, but I will try to pick the important ones (by my judgment of their importance) 

1) Scratching the surface. This point started in our pre-presentation discussion. While the paper talks about many different features and issues and tries to explain the reasons for the decisions taken, some explanations barely scratch the surface. Of course, it would be rather difficult to talk about eight years of development and go into deep technical discussions. However, what interested the group the most are some rather odd talking points throughout the paper. For example, talking about rate-limiting file deletions is oddly specific. Why not have rate-limiting for all tasks that may have a high impact on IOPS? These oddly specific examples scream about rather interesting back-stories that are obviously missing from the paper. 

2) Checksums. The checksum discussion was rather interesting. There are multiple layers of checksums. For example, block checksums make a lot of sense, as they are written when an SSTable block flushes to disk. One observation made in the reading group is that the file checksums were added late in the RocksDB lifecycle. A plausible explanation for this is file checksums are rarely needed, as they would come in handy when, for example, copying the entire SSTable file from one machine to another to start a new replica. And in this hopefully rare occasion, we can check the integrity of the data the long way — open the file and go block by block and check block checksums. 

3) Replication. Obviously, RocksDB is a single-server system, but it serves as a store for many replicated systems. In the group, we found it interesting that the paper still talks about replication. However, the replication discussions in the paper boil down to designing the permissive systems that allow to built replicated solutions on top.

4) Too flexible? One of the bigger goals of RocksDB is its flexibility to fit into different applications with different requirements. This creates a system that has too many features, with any application only using a handful of them. However, this ability to tune and have all these features complicates the configuration and management of the system. One notable example is CockroachDB that developed its own in-house replacement for RocksDB with fewer features, and having fewer features seems to be a big bragging point for Cockroach folks. 

5) Impact of Facebook hardware infrastructure. One concern raised during the discussion was the impact of hardware infrastructure at Facebook on the overall design trajectory described in the paper. Of course, it is true, that Facebook deploys RocksDB in their systems and their own infrastructure. But it also means that other non-Facebook users have to adjust to decisions made with Facebook-grade infrastructure in mind. 

One such example is the write-amplification vs space-amplification discussion in the paper. While Facebook engineers have concluded that on their SSDs (and their workloads), write-amplification does not pose a serious risk of premature SSD failures, the same may not be the case for other users who may have lesser quality SSDs or more write-demanding workloads. It is a serious enough concern that authors acknowledge the existence of LSM-tree solutions with better write-amplification mitigation strategies. Moreover, at least some of these solutions have been put into production use already. 

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Facebook’s Tectonic Filesystem: Efficiency from Exascale

This time around our reading group discussed a distributed filesystem paper. We looked at FAST’21 paper from Facebook: “Facebook’s Tectonic Filesystem: Efficiency from Exascale.” We had a nice presentation by Akash Mishra:

The paper talks about a unified filesystem across many services and use cases at Facebook. Historically, Facebook had multiple specialized storage infrastructures: one for hot blob storage, another for warm blobs, and another for analytics. While these systems were optimized for their specific purpose, they had a major drawback of resource overprovisioning. For example, hot storage needs to have a lot of IOs per second (IOPS) and may overprovision storage to get these IOPS, leaving storage resources wasted. On the other hand, warm/cold storage has opposite requirements — it needs more storage and fewer IOPS, leaving the IOPS unused. Tectonic, as a unified system, aims to avoid such resource overprovisioning and fragmentation. Having one system to serve different workloads creates several challenges. First, of all the single systems need to scale to support much larger storage capacities than several isolated systems. Second, it needs to work just as well as the purpose-build components it replaces. 

The storage backbone of Tectonic is Chunk Store. It stores chunks of data that make up logical blocks that make up files. Chunk Store is linearly scalable storage that grows with the number of storage servers. This largely covers the data and IOPS scalability of the file system on the hardware level. However, another limiting factor in large file systems is metadata management. As files are being accessed, the clients must first consult the metadata sub-system to locate blocks for the file and enforce other file system guarantees. One problem Facebook had with HDFS in their analytical cluster is limited metadata server scalability, and therefore tectonic avoids “single server” Metadata Store. Instead, all metadata is partitioned based on some directory hashes to allow multiple metadata servers to work at the same time. Furthermore, Tectonic implements the Metadata Store as a collection of micro-services over a Paxos-replicated key-value store, allowing different metadata requests to be processed by different metadata services of the same partition. The client library has all the logic that orchestrates interaction with the Metadata Stores and the Chunk Store. Placing the filesystem logic on the client-side allows for tenant-specific optimizations to better suit the workloads.

Having a unified filesystem for many services at Facebook allows for more efficient resource usage. Tectonic distinguishes between two resource classes: non-ephemeral resources (storage capacity), and ephemeral resources (IOPS). Storage capacity is provisioned to a tenant and remains at the provisioned level until a reconfiguration. Ephemeral resources (IOPS and metadata query capacity), however, can “move around” between tenants depending on the need. The resource sharing/allocation is controlled by a somewhat hierarchical system. A tenant selects a TrafficGroup for its application. The TrafficGroup is assigned to a TrafficClass, which controls latency requirements and decides the priority of allocating the spare resources. 

As always, the paper has way more details!

Discussion.

1) Magnetic storage. Tectonic uses magnetic devices for storage. HDDs are much slower than SSDs, and a lot of new large storage systems are now targetting SSDs. We wonder how much of a design decision was influenced by the use of hard drives instead of SSDs. Resource sharing is a big part of the motivation and design of Tectonic, and since SSDs have different performance characteristics than magnetic storage (lower storage volumes, higher IOPS), these differences may have influenced the design of Tectonic or the entire need to have such unified system in the first place. Of course, at Facebook scale storage cost is a big consideration, and HDDs are cheaper. Cold/warm storage is a big part of Facebook’s architecture, so it makes sense to put infrequently used data on cheap disks and keep “hot” data in caches in the rest of the architecture stack. 

2) Sharded metadata. The metadata store in Tectonic is sharded, meaning that there is no single coherent view of the metadata at any given time, as different directories (subdirectories) may be on different shards. This potentially creates the possibility for some consistency issues, especially for metadata operations that may span multiple shards. Another issue of sharded metadata is getting directory stats, as subdirectories may be in different partitions, so there is no perfectly correct/up-to-date view. But of course, sharded metadata allowed scaling the system horizontally much easier than a more centralized approach. 

3) KV store for metadata. Metadata store is backed by a KV store. A lot of metadata fall naturally into the KV abstraction. For example, a key can be a directory name, and value is a list of its children. However, Tectonic does not store metadata in such a naive way. Instead, the value is expanded into the key. The need for this raised a few questions in the group. The basic here is avoiding the read and write (and a need for a transaction) when you need to append something to the list of values. In the expanded format, the value becomes a suffix of a key in the KV store, and adding something becomes as trivial as writing the key-value pair. Reading the entire list can be by scanning the store for a particular key prefix. Cockroach DB has a very good explanation of using KV storage to implement more complicated data models.

Paper has a few other examples of little performance tricks, like making some metadata records sealed or read-only to allow caching and partitioning schemes designed to avoid hotspots. 

4) Client-driven. An interesting aspect of the system is that it is entirely client-driven in terms of all of its logic. From one point of view, this allows different clients to implement their custom protocols for using Tectonic. The paper has quite a bit of detail on how different tenants interact with the rest of the system differently to optimize for latency or throughput. On the other hand, this creates a possibility of client conflicts, if not managed properly. For instance, Tectonic allows a single writer per file and avoids ordering concurrent writes from different writers. However, this also means that if two clients are competing, the one who writes to the metadata server last and gets a token will be the sole writer, and any data written by a different client will be lost, even if it has been written to the chunk store. We feel like some of these decisions are in part because the product is internal to Facebook, and they have more control over how their engineers use it. 

5) Comparison with other distributed FS. There are plenty of other distributed filesystems out there. Tectonic already compares itself somewhat with HDFS it replaced. Ceph is another popular one. It also has a more centralized, Paxos-replicated metadata service, which is good for consistency but may interfere with scalability. At the same time, Ceph clients (I think) cache the location of data, so maybe the need to query metadata is not very frequent. We also talked about comparing this to blob stores like S3, and extensions over S3, like EMRFS, for use as HFDS replacement. But these are not as general-purpose/flexible to fit multiple different use cases.

6) Verification. The paper does not mention any kind of formal verification done on Tectonic. One argument we had in the reading group is that Tectonic’s lax “eventual” consistency does not require checking. But this is not true, and there are plenty of opportunities to screw up even in the eventual consistency model (i.e. liveness issues, convergence, etc). Moreover, there are plenty more opportunities to make concurrency mistakes when using clients to build stronger consistency models on top. I think S3 was brought up again as an example of a team that verified the blob storage even before it was strongly consistent. We hope the some formal verification was done on a system this large and important, and that simply talking about it was outside of the scope of the paper.

Reading Group

Our reading groups takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Autoscaling Tiered Cloud Storage in Anna.

This week we looked at “Autoscaling Tiered Cloud Storage in Anna.” This is the second Anna paper. The first one introduces Anna Key-Value store, and the second paper talks about various “cloud-native” improvements. The presentation by Michael Whittaker is available here:

Short Summary

The Anna Architecture

Anna is an eventual-consistent key-value data store, where each value is a lattice. Anna can tweak some consistency guarantees by changing the type of lattice stored in the value. For example, it can be a Last Write Wins (LWW) system that merges operations based on the definition of the last writer (i.e. a timestamp), or it can use some CRDT lattice to preserve all updates/operations. The weak consistency of Anna means that it is relatively easy to add and remove nodes to scale the systems. The Autoscaling part takes advantage of that and dynamically adjusts the number of replicas responsible for each key to match the workload. Moreover, Autoscaled Anna has multiple usage tiers for keys based on the usage frequency. The paper discusses “hot” and “cold” keys, but more tiers are possible. Such tiering allows using heterogeneous hardware to better balance the cost and performance of Anna, as “cold” keys may run on cheaper hardware backed by SSDs, while “hot” keys may be store in RAM on more expensive VMs. Anna has rules for data transitioning between tiers and for controlling the degree of replication. For example, there is a minimum replication factor for durability, and there are usage thresholds for promoting and demoting keys to/from hot/cold tiers.

Discussion

We have spent quite some time talking about this paper, as there is a lot of information to digest. Below are a few of the bigger discussion points

1) Apache Cassandra comparison. This came up a few times as lacking, as the system is somewhat similar to Cassandra, at least when you consider the LWW lattice. There are many differences too, especially in the context of autoscaling. Cassandra uses a statically configured replication factor and lacks any autoscaling. Cassandra’s DHT model is pretty rigid and resists scaling – adding nodes require considerable ring rebalancing. Also, only one node can be added/removed from the ring at once to avoid membership and ring partitioning issues. Anna has no such problem and can adjust quicker and more dynamic. Additionally, for all the fairness, the original Anna paper compares against Cassandra.

2) Breaking Anna. Part of the discussion was around breaking the autoscaling and putting Anna in a bad/less-performant state. One example would be using a workload that changes at about the same rate as Anna can adjust. For example, if we hit some keys above the threshold to promote from the “cold” tier to “hot”, we can cause a lot of data movement as these keys potentially migrate or added to the memory-tier hardware. At the same time, we can reduce the usage, causing the systems to demote these keys, and experience associated overheads. This is like putting a workload in resonance with Anna’s elastic abilities, and it reminds me of a famous Tacoma Narrows bridge collapse

3) Possible consistency guarantees. One of the benefits of Anna is that some guarantees can be tweaked based on the lattice used for the value. Part of the discussion focused on how much consistency we can get purely out of these lattices, and how much may necessitate more drastic architectural changes. It is pretty straightforward to understand that Anna won’t ever provide strong consistency, but what about some other guarantees? For example, the original Anna paper claims the ability to have causal consistency, although it uses more complicated lattices, and keeps vector clocks for maintaining causality, not only making the system more complicated but also potentially harming the performance (something noted by the authors themselves).

4) Comparison fairness. Some points were mentioned about the evaluation fairness. It appears that some systems compared against are significantly more capable and/or may provide better guarantees(DynamoDB, Redis). It is worth noting, however, that there are not that many, if any, direct competitors, so these comparisons are still very valuable.

5) Cloud-Native. This appears as one the “most” cloud-native database papers that talks about some practical aspects of designing and running a  cloud-native database. One aspect of this “cloud-nativeness” is the focus on elasticity and scalability happening automatically. Another one is the cost-performance consideration in how the system scales with regard to workload changes. We noted some other much older papers that fit our definition of “cloud-native” papers. One example was Yahoo’s PNUTS paper that describes a global system with some capability to adjust to workloads and selective replication.

6) Scheduling/resource allocation. One of the challenging aspects of  Anna with autoscaling is figuring out the best key “packing” or binning” given multiple tiers and pricing constraints. This is similar to a Knapsack problem and bin packing problem, and both are NP-complete. So this resource management/scheduling will become a big problem at the production scale when we may have thousands of machines. At the same time, task packing problems at the datacenter level are well studied, for example in Google’s Borg cluster management system or its derivative Kubernetes

 

Our reading groups takes place over Zoom every Wednesday at 3:30pm EST. We have a slack group where we post papers, hold discussions and most importantly manage Zoom invites to the papers. Please join the slack group to get involved!

Reading Group. Aria: A Fast and Practical Deterministic OLTP Database.

In our 33rd reading group meeting, we discussed “Aria: A Fast and Practical Deterministic OLTP Database.” by Yi Lu, Xiangyao Yu, Lei Cao, Samuel Madden. We had a very nice presentation by Alex Miller:

Quick Summary

Aria is a transaction protocol, heavily influenced by Calvin, and it largely adopts Calvin’s transaction model, with one big difference. In Calvin, read and write sets of a transaction must be known beforehand, but Aria has no such strict requirement, allowing for generally more flexible transactions. Aria’s main goal is parallelizing transactions as much as possible to maximize the throughput. To that extent, Aria adopts batching and processes multiple transactions in each batch concurrently.

ariaEach batch operates in two phases: execution and commit. All transactions in a batch start the execution phase from the same snapshot (the result of committing and applying the previous batch). In the execution phase, each transaction concurrently executes and produces the new state, which is stored temporarily. Once all transactions in a batch have finished executing, the protocol moves into the commit phase, where Aria aborts any transaction that has a Write-after-Write(WAW) conflict or Read-after-Write(RAW) conflict. Aria uses unique and sortable transaction IDs to determine the order of transactions within the batch to find these conflicts. Any aborted transaction goes to the future batch, while the remaining successful transactions commit and apply their temporary execution results. Once the commit has finished, a new batch can start, and the protocol keeps moving in these lock-steps, synchronizing after each step. Every replica can run the protocol without synchronizing/communicating with other replicas unless the system is sharded/partitioned and one replica need to read data from another shard. An important optimization of Aria is deterministic reordering, where transactions within a batch can be reordered (i.e. not follow the order of their TIDs) to reduce the number of aborts, and consequently, reduce the amount of wasted work in the execution phase.

Discussion Points

We had good participation and touched on quite a few things in the discussion, but to the credit of the paper, we were able to find lots of answers there just by reading more carefully.

1) Transaction Serializability. Paper claims serializability, but it applies a bunch of transactions to the same snapshot. How does it compare to snapshot isolation? and what is the difference that allows serializability?

We think the batch processing and addition of RAW conflict within the batch makes a difference. Snapshot Isolation checks for WAW conflicts only and allows some artifacts. By also disallowing RAW conflicts, we can eliminate the problem. However, there are nuances with how a read set can be defined for conflict resolution. For example, What is a read set in a transaction running UPDATE items SET x=0 WHERE x=1? It can be either all items, or only items with x=1 if some indexing is used, and this difference may result in serializability issues. For the paper’s defense, it does not explicitly consider a SQL model, and also mentions that things like the above are up to the users to decide, and a fall-back to Calvin-like approach.

2) Batch size. This is a batched protocol, so the batch size may play a role in the performance. Batches that are too small will have more frequent barriers between batches and phases. Batches that are too large have a higher chance of having a super long transaction that stalls the entire batch, impacting the performance. The paper mentions that the transactions should take about the same time to execute for best performance to avoid the case when one slow transaction stalls the entire batch.

3) Paper readability. The overall consensus in the group was that this was one of the easier transaction appears to read.

4) Comparison with other transaction systems. SLOG paper we discussed a few months ago also uses determinism on each node to run transactions. Unlike Aria though, SLOG uses determinism for execution order, while Aria executes unordered (from the same state) and deterministically aborts the conflicting transactions. Overall it seems that Aria’s use of determinism is more extensive – the same snapshot, transactions are deterministic themselves, deterministic conflict search and abort, etc.

There was also a mention of CockroachDB’s transactions, since they also use snapshot, and produce temporary results, but Cockroach is more optimistic than deterministic. Also, Cockroach is more interested in low-latency transactions, while Aria is all about high throughput even at the expense of latency.

5) Deterministic reordering. The paper mentions that the reordering is a best-effort algorithm and not the most optimal one. It also seems that reordering plays a big role in having high throughput, so can we squeeze more performance with a better reordering algorithm? Obviously, it is not efficient to brute-force all possible permutations, but maybe some better heuristic approach?

aria-reorder

6) Performance variance under conflict. Aria works best for workloads with not a lot of conflict between transactions. Reordering helps, but not in all cases, as evidenced in the evaluation.

 

Join our reading group on slack for more discussions, paper schedule, and zoom participation.

Reading Group. High availability in cheap distributed key value storage

Our recent paper was “High availability in cheap distributed key value storage”. And what a paper that was! It was definitely a mind-tingling read the lead to a very interesting and long discussion session with the group.

Short Summary

The paper addresses the problem of fast recovery from the leader (primary) crashes in key-value stores backed by the non-volatile main memory (NVMM). NVMMs provide good latency but have much lower throughput than the DRAM. They are also a lot cheaper than DRAM for the same capacity, making NVMMs an interesting low-latency compromise between in-memory datastores and SSD-backed stores. NVMMs are, however, much more expensive than SSDs. In CANDStore, a primary replica is supported by the NVMM for fast latency, but the follower, or back-up, is backed by an SSD to save on the hardware. An additional “witness” replica exists. Witness, like the primary, is supported by NVMM, but it does not store actual data, and only has the operation log with placeholders for data. This is done, again, for cost reasons. The replication between a primary and the backup is done with a modified Raft algorithm. In a happy case, witness replica is a non-voting member of the cluster, much like in Cheap Paxos. As such, CANDStore is supported by heterogeneous hardware, with specific roles attached to specific hardware.

Upon the failure of a primary node, the witness is promoted to the primary role. Since the witness has no data, it has to learn it from the backup, but it does so smartly: first, learn the log with placeholders, once it knows all the placeholder, it can start serving new write requests that overwrite the key’s state. Knowing the position of the operation in the log is not enough to serve reads and a new primary needs to copy the data from the backup. It does so smartly and copies the “hot” keys first, followed by less frequently used keys. This allows primary to start serving hot reads very soon after gaining the ability to serve writes, and before the full copy of the data has finished. The paper claims this approach yields 4.5-10.5 times recovery speedup compared to the offline rebuild of a replica from an SSD. Backup failure is a less-discussed issue in the paper, but here the authors take the Cheap Paxos approach and make the witness vote on the quorum to select a new configuration.

Video Presentation

Discussion

The discussion was rather heated, as the paper has many points to digest.
1) Homogeneous vs. heterogeneous deployment. The biggest question is why do we need to restore a primary from a backup by doing a full copy? The backup node in a 3-replica configuration studied in the paper(1 primary, 1 backup, 1 witness) already has the latest state, so it makes sense to promote it to the leader/primary node instead of doing a full replica “build” from the witness state. This would undoubtedly result in a much faster primary recovery (while “build-out” of a new backup can happen in the background). Of course, this works only if the backup and primary are homogeneous, which is not the case here. So, this raises the question of why having such a heterogeneous system in the first place? Our best guess was the cost. Having an SSD-backed replica is cheaper than NVMM one.

2) Witness log. In the paper, a witness maintains a log, keeping track of all the keys and essentially having an order of updates. This is a placeholder log that has no values. In the process of recovery, the witness learns of the missing log entries, i.e. tail of the log that a primary and backup may have, but has not reached the witness. So if this learning phase exists, why having a log at all? just make a witness learn the log (or relevant suffix of the log) when recovery is needed. In the discussion, we think the witness has a log to speed-up recovery when it is needed.

3) Performance comparison of DRAM/NVMM/SSD. The paper’s heterogeneous setup leverage the fact that these types of storage have different performance and price. DRAM is the fastest but also the most expensive, NVMM presents a latency compromise, while SSD is the price compromise. However, if the performance of NVMM is close to that of SSDs, there may be fewer incentives for such design. We looked at a few sources. And it appears that NVMM has good latency, but its throughput is somewhat on par with SSDs

4) The performance of the primary. Another question we had is why NVMM is fast enough for primary, but not for backup? This loops back to (3), and it seems like NVMM would do ok for a backup, given its similar throughput to an SSD. However, we speculate that (a) the NVMM performance may change less favorably depending on the write block size, and (b) again the cost and not the performance is the reason for having an SSD backup.

5) One witness for multiple partitions. The paper evaluates a single-partition 3 node system. However, in such a setup, we have a witness that must run on hardware capable of supporting a primary. This does not lead to cost-savings, as primary hardware is more expensive. We speculate that in a sharded system the cost of running a witness can be amortized by having single witness hardware shared across many partitions. When one partition needs a witness promoted, it takes over the resources, and forces other partition to get new witness nodes. This of course has several complications. Firstly, a correlated failure between partitions may not be tolerated well, when two partitions need a new primary, but they share a witness. Secondly, managing such a system become more challenging, since a witness promotion leads to the cleanup of other partition-witness data, finding new witness nodes, etc

6) Hot/cold benefits. Does hot/cold key separation benefit anything except recovery? The backup cannot benefit from this, it (normally?) serves no user work, so there is not much difference. It maintains hot/cold keys to make a copy upon the recovery prioritized to reduce performance degradation. Primary nodes, on the other hand, may benefit. Keep the “hot” keys in DRAM, and “cold” ones in the NVMM storage.

7) Backup failures. This aspect is not discussed in much details in the paper, except mentioning that the Cheap Paxos approach is used to reconfigure and bring up a new backup. One thing that caught the group’s attention is that a backup recovery puts a lot more work on the primary – it has to copy data to the backup, while still serving the operations (given that the backup recovery is done on-line).

8) More backup nodes. So, in relation to (7), what if we have more backup nodes, lets say 2 nodes instead of 1? This means that one backup should be available to restore the failed one. But what does this mean for write quorums? Do we still need all backup in the quorum? or just one out of 2 backups is enough? If we require all backups for the quorum, we also need two witnesses to tolerate two failures and be able to reconfigure with Cheap Paxos. If we require just one backup out of two, do we need a witness (aside from promoting it to the primary in case of a primary crash). Also, how does such two backups setup compare in terms of cost saving?

9 and beyond) We had way more discussion points as well. Can this heterogeneous setup benefit other copy-have workloads for databases, like elasticity tasks? Can heterogeneous deployments be used with regular Multi-Paxos/Raft? Why using modified Raft if most of the things that differ Raft from Paxos were undone? You can read more about these in our slack discussion group.

One Page Summary. Gryff: Unifying Consensus and Shared Registers

This paper by Matthew Burke, Audrey Cheng, and Wyatt Lloyd appeared in NSDI 2020 and explores an interesting idea of a hybrid replication protocol. The premise is very simple – we can take one protocol that solves a part of the problem well, and marry it with another protocol that excels at the second half of the problem. This paper tackles replication in geo-distributed strongly consistent storage systems. The authors argue that consensus, when used in storage systems with predominantly read and write operations, is inefficient and causes high tail latency.

A system presented in the paper, called Gryff, takes advantage of predominantly read/write workloads in storage systems and exposes these two APIs via a multi-writer atomic storage ABD protocol.  ABD operates in two phases both for reads and writes. On writes, ABD’s coordinator retrieves the latest version of the register from all nodes and writes back with a version higher than it has seen. On reads, ABD’s coordinator again retrieves the register, writes the highest version back to the cluster to ensure future reads do not see any previous versions, and only then returns back to the client. The write-back stage, however, can be skipped if a quorum of nodes agrees on the same version/value of the register, allowing for single RTT reads in a happy case.

Unfortunately, ABD, while providing linearizability, is not capable of supporting more sophisticated APIs. Read-modify-write (RMW) is a common pattern in many storage systems to implement transaction-like conditional updates of data. To support RMW, Gryff resorts back to consensus and in particular to Egalitarian Paxos (EPaxos) protocol. Choice of EPaxos allows any node in the cluster to act as the coordinator, so it does not restrict writes to a single node like with many other protocols. The problem of this hybrid approach is then the ordering of operations completed with ABD protocol and RMW operations running under EPaxos. Since EPaxos side of Gryff works only with RMWs, it can only order these operations with respect to other RMW operations, but what we need is a linearizable ordering of RMWs with normal writes and/or reads. To keep the ordering, Gryff uses tuples of ABD’s logical timestamp, process ID and the RMW logical counter, called carstamps. Carstamps connect the ABD part of the system with EPaxos – only ABD can update ABD’s logical clock, and only EPaxos updates RMWs counter.

When we consider the interleaving of writes and RMWs, the write with higher ABD’s logical time supersedes any other write or RMW. This means that we actually do not need to order all RMWs with respect to each other, but only order RMWs that have the same base or ABD’s logical time. EPaxos was modified to allow such partial ordering of commands belonging to different bases, essentially making the protocols to have different dependency graphs for RMWs applied to different ABD states. Another change to EPaxos is the cluster-execute requirement, as the quorum of nodes need to apply the change before it can be returned to the client to make the change visible for subsequent ABD read operations.

gryff_1
So, how does Gryff do with regards to performance? Based on the author’s evaluation, it is doing very well in reducing the (tail) latency of reads. However, I have to point out that the comparison with Multi-Paxos was flawed, at least to some extent. The authors always consider running a full Paxos phase for reads, and do not consider the possibility of reading from a lease-protected leader, eliminating 1 RTT from Paxos read. This will make Paxos minimum latency to be smaller than Gryff’s, while also dramatically reducing the tail latency. Gryff also struggles with write performance, because writes always take 2 RTTs in the ABD algorithm. As far as scalability, authors admit that it cannot push as many requests per second as EPaxos even in its most favorable configuration with just 3 nodes.gryff_2

Can Paxos do better? We believe that our PQR optimization when applied in WAN will cut down most of the reads down to 1 quorum RTT, similar to Gryff. PQR, however, may still occasionally retry the reads if the size of a keyspace is small, however this problem also applies to Gryff when the cluster is larger than 3 nodes.

What about Casandra? Cassandra uses a protocol similar to ABD for its replication, and it also incorporates Paxos to perform compare-and-set transactions, which are one case of RMW operation, so in a sense Gryff appears to be very similar to what Cassandra has been doing for years.