Spring Term Reading Group Papers: ##131-140

A new set of papers for spring and early summer!

Reading Group. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems

In the 127th meeting, we discussed the “DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems” SoCC’22 paper by Ziliang Wang, Shiyi Zhu, Jianguo Li, Wei Jiang, K. K. Ramakrishnan, Yangfei Zheng, Meng Yan, Xiaohong Zhang, Alex X. Liu.

This paper argues that current Autoscaling solutions for Microservice applications are lacking in several ways. The first one is the timing of autoscaling. For example, rule-based systems use some threshold boundaries on performance metrics to decide whether to scale up or down. This reactive approach adjusts to the behaviors of the system that already happened, and this means that thresholds have to account for that — one may set the thresholds more conservatively to allow the autoscaling systems to “spool up” once the threshold is hit, but before the system is in real danger violating SLAs or needed additional resources. Naturally, this threshold manipulation also requires some expert knowledge of systems to autoscale. An alternative to rule-based solutions is learning-based autoscaling, which relies on machine learning. This can reduce the requirement for domain knowledge, as the autoscaling system can learn the patterns itself. However, current learning-based solutions have another limitation — they rarely consider resource usage due to frequent variations in utilization. Naturally, maximizing resource utilization can lead to monetary savings, so it is important for an autoscaling system to accurately provision resources in a dynamic system. 

The paper proposes DeepScaling, a system that optimizes resource usage (in this case it is CPU utilization) while meeting SLO constraints. DeepScaling does so by focusing on forecasting the future workload based on historical data. From this prediction, it estimates the resource requirements for the system in the near future and provisions or decommissions resources as needed. These tasks are separated into 3 components: Workload Forecaster, CPU Utilization Estimator, and Scaling Decider. There are a few other components, such as load balancers, controllers, and monitors to ensure predictable load spread in the cluster, perform data collection and enact the scalability decisions. The predictive manner allows DeepScaling to anticipate load increases and provision enough new capacity ahead of time, eliminating the need for overly conservative autoscaling bounds. 

The paper provides significantly more details on the machine learning components, but I am less interested in these as a systems person. The paper also provides a good evaluation of these components, so it is actually worth the read, especially for more ML-inclined folks. 


1) Multiple steps. The DeepScaling system tries to see how many resources it will need in the near future to predict scaling demands and satisfy these future needs. It does so by estimating CPU utilization, but it does so indirectly, as it first tries to predict the workloads and then converts these workload estimates into CPU requirements. Why not build a CPU model directly and learn it that way? Our understanding is that decoupling the two processes has several advantages. The workloads for different services can be learned and predicted separately. Adding or removing services from the cluster does not require retraining the model if the CPU requirement can be estimated from the sum of these predicted workloads. Similarly, if a service changes significantly and requires a different amount of resources for the same work, the system can adjust that in the CPU estimator. In the worst case, if the workload characteristics have changed as well, again, only the model for a service needs to be retrained.

2) Backup autoscaling. One possible problem we discussed was the ability to react to unexpected changes. With a purely predictive system, we believe there is a need for a rule-based backup that can kick in if the predictive model failed. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Method Overloading the Circuit

In the 126th reading group meeting, we continued talking about the reliability of large distributed systems. This time, we read the “Method Overloading the Circuit” SoCC’22 paper by Christopher Meiklejohn, Lydia Stark, Cesare Celozzi, Matt Ranney, and Heather Miller. This paper does an excellent job summarizing a concept of a circuit breaker in microservice applications. The authors discuss several potential problems with circuit breakers, present a taxonomy of circuit breakers, and most importantly, talk about improving circuit breakers and their usage.

Circuit breakers are mechanisms to stop system overload in the presence of some failures. This is the same task circuit breakers do in the electrical circuits, so CS folks borrowed the name. The idea is to monitor the flow of requests/RPC calls through the system, and if errors start happening along some specific flow, to “break the circuit” and stop the request flow. As such, circuit breakers minimize further damage after the initial failure by dropping potentially problematic work. Naturally, if no requests flow through the system, the application will never learn whether the problem has been rectified. So, to allow recovery, circuit breakers must occasionally go into a “half-open” state and let some requests through. Usually, this happens after a timeout period. If the failure persists, the circuit breaker falls to the open state and continues to shed the problematic work.

The paper presents a motivating example for circuit breakers — some API endpoint is malfunctioning, causing users/client applications to retry. The retries create even more faulty requests, making the failure worse by overloading the system. This example is a Metastable failure with load amplification caused by retries, and as we know, load shedding is a primary mechanism for both preventing metastable failures and recovering from them.

What makes circuit breakers special is the ability to shed load selectively along specific faulty execution paths through the system or for some specific type of request. Going back to the example, dropping only requests that trigger the malfunctioning endpoint in a service allows the system to remain operational for other unaffected path/request flows. Unfortunately, this is not only a strength of circuit breakers but also their weakness — high specificity/selectivity makes them harder to use correctly.

Consider a situation when one API of a service fails. If we use the same circuit breaker for all calls to that service, then this circuit breaker may start seeing an elevated error rate and trip, causing all requests to the service to drop, including the ones that were bound for correctly functioning APIs. Similarly, in the case of one API endpoint, it may be possible that some requests work fine while others do not (for example, requests for some, but not all, users work fine), necessitating a circuit breaker that can distinguish between problematic and ok requests to the same endpoint.

The paper then taxonomizes different circuit breakers based on several criteria, such as their implementation, installation location, and whether the installation is explicit. These criteria impact the circuit breakers’ selectivity or the ability to discriminate/distinguish between different requests and their faults (the paper calls it sensitivity, but I like the word selectivity better, as whether the circuit breaker is more selective to a type of request). For instance, the installation site of circuit breakers plays a substantial role. They can be installed at the callsite, meaning that the circuit breaker installation “wraps around” the RPC call to the dependency service. If the same RPC is called in another place in the code, it will have a different callsite circuit breaker, meaning that these circuit breakers will work independently to determine the faulty conditions. A method-installed circuit breaker appears in the method that calls another service (it can be installed by some method annotations, for example). In this case, all functions calling the method that users other service’s RPC will share the circuit breaker. As you can imagine, this can lead to less sensitivity/selectivity, as many different execution paths may converge on one method that performs an RPC. A client-level circuit breaker can use one shared circuit breaker per client, potentially making it even less sensitive.

Luckily, like in many things distributed, the solution is partitioning. In the case of circuit breakers, we want to be able to partition all possible failure scenarios, execution paths, request configuration, and anything that can impact the request execution into as many separate circuit breakers so that each circuit breaker is as selective to a particular request flow or request type as possible. I think this general suggestion is much easier said than done, and in practice, achieving very good partitioning can be challenging.

Ok, I do not want to rewrite the entire paper in this summary, but there is a lot more content there, including code examples, full taxonomy, some improvements to circuit breakers, and hints at unsolved problems, so please read the paper for more!


1) Purpose and challenges. The purpose of circuit breakers is to avoid overloading the system or some of its components by shedding work that is likely to fail. This overload can come from a few sources: (1) retries after the initial failed attempt, (2) some other mitigation and/or client-level workarounds for the failed component, and (3) overload due to expensive error handling. The circuit breakers can do well with overload sources (1) and (3). 

Another way to prevent overload, at least due to retries, is to shed only the excess work. See, a circuit breaker sheds all the work that may be problematic, which may not always be a good idea. For instance, if the problem exists due to a high load on some component, stopping all load to it will appear to fix the issue, resetting the circuit breaker and causing the high load to go to the affected component again, causing errors, and tripping the circuit breaker(s) again. I am not entirely sure what is better — a cyclical failure or a persistent one. From the debugging standpoint, I think, a persistent failure may be easier to identify.

So, we may want to shed only excess or extra work, but not the work that would have come organically if there was no failure. This extra load shedding achieves several goals — for faults caused by overload, we avoid cyclical on/off behavior (especially if we can add a simple load shedder on top that can drop any excess “good” work). For non-overload-related failures, sending work through can help with intermittent problems and also speed up recovery once the issue is fixed. Of course, it is not easy to identify and shed that “extra” load under all circumstances; for example, it may be hard to control a person with an itchy “F5 finger” when they do not see a page loading quickly.

Shedding only excess work may have several additional benefits depending on the system. For example, keeping the system busy at the level of the actual organic offered load may help keep the affected/failed services from downscaling due to low load after the circuit breaker trips. Similarly, for cached components, doing work, even such “wasted” work, may keep caches warm for when the problem is fixed. Thinking about the state of the system when the initial problem gets fixed and the circuit breakers reset is important, as it needs to be ready for a quick influx of requests to components that were sitting behind the open circuit breaker for some time.

2) Implementation difficulties. Partitioning the circuit breakers may be hard to implement in large systems, as we need a lot of them to tune the system for proper sensitivity. We also need to have processes/procedures to adapt to changes in the system and make sure that over time old circuit breakers do not cause more harm than good. And finally, with many circuit breakers, there is a question of resource usage. Something has to keep track of all these circuit breakers, their timers, failure counts, etc. 

3) From microservices to stateful systems. Our reading group has been gravitating toward distributed systems with the large state — databases, data stores, etc. So naturally, is there anything similar that can be done in these large systems? Metastable failures can be devastating for these, especially as scaling stateful services is a challenging and resource-intensive process that often cannot be done when the system is overloaded. The basic principles remain the same — these systems need to shed load when they are overloaded. The big question is what work to drop and where in the system to do so.

On another hand, databases do not just exist by themselves, they are used by other apps and services. Can we protect databases/stores from overload with circuit breakers on the app sitting at the database call sites? It is not that easy with the existing circuit breaker design — if the database overloads, all circuit breakers across the app may trip open. The load to the DB will fall, causing the circuit breakers to close, and the process may again repeat in a cyclical pattern. It is possible that we need a smarter “overload-specific” circuit breaker to exist somewhere to drop/prioritize load coming to the stateful components. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. How to fight production incidents?: an empirical study on a large-scale cloud service

In the 125th reading group meeting, we looked at the reliability of cloud services. In particular, we read the “How to fight production incidents?: an empirical study on a large-scale cloud service” SoCC’22 paper by Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. This paper looks at 152 severe production incidents in the Microsoft Teams service. The authors looked at these incidents and distilled them into a handful of categories in terms of the root cause, mitigation type, detection, etc. And from placing the incidents into such categories/buckets, some interesting patterns started to emerge regarding the timeliness of incident mitigation, mitigation approaches, and potential areas for improvement. 

I liked that the paper described their data collection methodology since the categorizations may be rather subjective. However, I will mention only one detail — the authors assigned a single root cause to each incident, even though some incidents are complex and may have more than one contributing factor. I also like that the paper cautions readers from making hasty generalizations — the study focuses on just one large service, and many of the findings may be specific to that service.

So, with the above disclaimer in mind, what have we learned from the paper? The paper’s findings can be roughly broken down into a handful of topics: root cause, detection, mitigations, and post-incident lessons learned by the team/engineers. 

On the root cause side of things, different bugs in the service (Teams) only make up roughly 40% of the incidents (27% code bugs + 13% configuration bugs). Other categories include infrastructure failures, deployment failures, and authentication problems. On the infrastructure side of things, the paper separates infrastructure failures into three different categories. The first one, referred to as “infrastructure failures,” deals with scalability problems, like the inability to get enough nodes to run the work. The second infrastructure root cause bucket is “dependency failures.” Finally, the failures of databases & network dependencies get their own root cause category. But if we combine the three together, all infrastructure failures are around 40%. 

On the detection side, the paper suggests a significant deficiency in detection and monitoring systems and practices. Almost half of all incidents had some kind of detection malfunction, with 29% of incidents reported by external users and another 10% reported by internal users! If we look at the automated detection issues, many are due to bugs, lack of automated monitors, or lack of telemetry. 

For incident mitigations, the authors discuss the common types of mitigations and the reasons for some of these mitigations requiring more time to address the problem. While around 40% of issues were due to bugs, only 21% of all mitigations required a bug fix (8% for fixing code and 13% for fixing config). Additionally, 11% of issues relied on some “ad-hoc” fixes, which the authors describe as “hot-fixes.” The paper conjectures that it takes substantial time to go through the process of fixing bugs during mitigation. Instead, a faster way to recover is to initiate a rollback (22% of cases) or perform an “infrastructure change” (another 22% of incidents). By “infrastructure change” the authors mean scaling the system to take on more nodes/CPU. 

As for mitigation delay reasons (the paper calls them mitigation failures, although the mitigations themselves did not fail but instead took longer), the authors describe several common causes. Inadequate documentation and procedures are at the top of the list. Another common one is deployment delay which occurs when it takes a lot of time to deploy the fix. 

With all of the above findings, the lessons learned by the engineers come to the following: need more automation, need more testing and need more changes within the organization (behavioral change, better documentation, better coordination). I want to focus a bit on the automation part, as it hits a bit too close to home for me. Most automation suggestions (26%) boil down to better tests, such as performance tests, chaos engineering, and better end-to-end testing. I am a bit skeptical about these numbers as testing is something we blame all the time when problems occur — it is a reflex response to say that more testing is needed. And of course, we need more testing but mind you, these same people who call for more testing when dealing with issues, write features and (inadequate?) tests when they are not on call. 

What caught my attention is the need for automated deployment. More specifically, “automated deployment” includes “automated failover.” Just the fact that this is a recurring ask from engineers puzzles me a lot. This means that at least a handful of times per year, running Microsoft Teams requires engineers to manually reconfigure the system to remove failed nodes/machines/services and switch the load to new/backup ones. 

The authors discuss several more insights after doing a multidimensional analysis. I am not going to go in-depth here, and instead, just leave snippets of the paper:


1) General Applicability of results. As mentioned in the paper, all observations are taken from one service, so your mileage may vary. Another point to note is that Microsoft is a large organization with decently mature monitoring, automation and deployment tools, which may impact the number and severity of observed incidents. It is quite possible that with a less mature set of tools, one may observe more variety of serious problems. 

Another point to mention with regard to applicability is that there are many other large systems/services that are very different from Microsoft Teams. For example, if we look at systems that maintain a lot of state (i.e., databases), we may see other common root causes and mitigation patterns. For instance, a typical mitigation strategy observed in the paper is throwing more resources at the problem. This strategy works great with systems that run a lot of stateless or small-state components/microservices. Adding more resources allows scaling performance bottlenecks as long as underlying databases/storage dependencies can handle the load. Scaling stateful systems, like databases, often requires spending resources upfront to move data/state around — something a system cannot do well when overloaded. 

2) Lack of concrete actionable items. The findings are interesting from an educational standpoint, but they lack clear directions for improvement. The authors give general advice, along the lines of improving testing, building better automation, etc., but there are no concrete steps/procedures that may help improve systems’ reliability. One point made in our discussion is the need to focus on the specific type of problem to find more concrete solutions for that problem. 

3) Teams Outage. On the day we discussed this paper, Teams experienced an outage, likely related to some networking configuration issues.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service

In the 120th DistSys meeting, we talked about “Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service” ATC’22 paper by Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, Akshat Vig.

The paper is loaded with content as it presents many different things, spanning ten years of development. None of the topics are covered in great detail, but I think it is still a great overview of such a massive project. Obviously, the authors discuss DynamoDB, its architecture, and its design. The paper also provides a brief history of the system and examines several challenges/lessons the team has learned while operating such a massive scale system.

To start with the architecture, the users interact with the system by reaching out to the request router. The router can perform the authentication and admission control. Most importantly, however, the router has access to partition metadata, allowing it to, well, route the requests to proper storage nodes and replicas. A node hosts multiple replicas for different partitions.

So, speaking of partitions, each data item in DynamoDB has a unique primary key. These primary keys group items into partitions replicated with Multi-Paxos for redundancy across multiple replicas in different availability zones. The assignment of key ranges to partitions (and partitions to nodes?) constitute the metadata needed for the request router.

DynamoDB has two types of replicas — log and storage replicas. Log replicas only contain replication write-ahead logs. Storage replicas, in addition to having a log, also maintain a state derived from applying/executing the logged commands against the B-tree storage. Both replica types can participate in Paxos quorums, but log replicas are more lightweight and cannot serve reads/queries. The system uses log replicas to improve availability and reliability — it is easier to spin up a log replica that only needs to accept new commands than to rebuild a full storage replica with all the partition’s data. This speed becomes important under failures to restore the system to the proper degree of replication quickly.

From the historical perspective, while DynamoDB started as a pretty basic NoSQL (key-value) store, it has added many features over time, such as secondary indexes, JSON documents, encryption, transactions, and more.

Finally, a decent chunk of the paper focuses on various nuances of running large-scale NoSQL data stores. For example, the paper notes data errors and how DynamoDB verifies the data integrity with checksums for every data transfer between nodes. DynamoDB also does background data verification at rest. Another important lesson on the reliability side of things is the need to maintain enough capacity in the metadata system. While the request routers use caches for metadata to improve performance, a metastable failure in the caching system led to a rather big outage. After the fact, the caches are used only to improve the latency, and no longer offload capacity from the main metadata storage service — all requests for metadata go through to the service even if they are answered by the cache first. This ensures having adequate capacity to serve critical metadata operations regardless of the cache failures. Naturally, this is a more expensive solution for the sake of reliability.

The authors discuss plenty of other lessons and challenges, such as managing the load and balancing capacity of the system and implementing backups and improving availability.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation

In the 122nd reading group meeting, we read “The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation” paper by Ruihong Wang, Jianguo Wang, Stratos Idreos, M. Tamer Özsu, Walid G. Aref. This paper looks at the trend of resource disaggregation in the cloud and asks whether distributed shared memory databases (DSM-DBs) can benefit from memory disaggregation (MD) to become the next “hot” thing in the database world. 

The idea, on the surface, is simple — decoupling compute from memory enables the creation of databases with many separate stateless compute workers for query processing that share a remote memory pool. According to the authors, the driving force for this is RDMA, as it allows relatively comparable latency and bandwidth to local memory. On top of that, such a system can also use disaggregated storage for durability. The bulk of the paper then focuses on challenges for such disaggregated design without going in-depth into the database architecture itself. The paper also does not go deeply into the design of the disaggregated memory system, although it points to a few issues to solve. 

The first challenge listed by the authors is the lack of appropriate APIs to access memory. In particular, the paper suggests having APIs that are more in line with how memory is managed locally — memory allocation APIs. In this disaggregated memory pool case, the memory allocation must work with the virtual address space of the memory system. The authors also suggest data transmission APIs facilitate moving data for local caching at compute nodes. Finally, function offloading API can move some simple compute to the memory systems (does not this defeat the purpose of memory and compute disaggregation?)

The second important set of challenges deals with the disaggregated memory system itself. How does such a system handle node failures and still remain available and durable? Sadly, the paper does not provide any concrete details aside from hinting at high-level solutions, all of which will cost the performance — backup on storage, erasure coding, replication, etc.

The third block of challenges has to do with concurrency control. If the database system caches data locally at worker/compute nodes, then we need to worry about cache coherence when multiple workers can access the same memory. Here we see that memory disaggregation is still slow — local cache can be an order of magnitude faster. This is a smaller difference than, let’s say, going from SSD to memory, but it is still substantial. Authors suggest that reduced speed differences in this new memory hierarchy will require new caching protocols, prioritizing execution time and not cache hit rate.

Another concurrency challenge has to do with transactions, as now we have the potential to fit all data in one large memory pool with many workers accessing it concurrently. Again, the paper does not have many concrete solutions but suggests “rethinking distribute commit.” Similar is the “solution” for concurrency control. It is costly to implement locks over RDMA, so we need to rethink CC as well, preferably without locks. Lastly, this all needs to work with thousands of compute nodes.

The last set of challenges is indexing-related. This, again, can get into the tricky RDMA performance limitations, so we need to have an RDMA-conscious index design. Also, the index needs to work well under high concurrency. 


1) Details. Our group collectively found the paper to be rather shallow on details of how these systems may work. While the paper examines some literature on the shared memory databases of the past, it lacks depth and connections with this new paradigm of disaggregated memory used over RDMA. We are especially curious about more depth for concurrency issues and solutions, as many stated issues may have been solved in prior shared memory databases, albeit at a smaller scale.

One example where the paper is very shallow is disaggregated memory system itself. Stating there are challenges with availability and durability in a core component for all DSM-DBs is not going to cut it — the entire premise of the paper depends on such a disaggregated memory system to be fast and reliable. Without these basics, the rest of the discussion becomes largely irrelevant.

2) Memory Disaggregation. We discussed the memory disaggregation idea in general and whether it can become a mainstream technology. See, storage disaggregation is kind of ubiquitous — you create a VM in some cloud, be it AWS or Azure or GCP, and the storage this VM gets is likely to be in a different box (or rather a set of boxes) than your VM’s CPU or memory (think of EBS volumes on AWS EC2). We are ok with this, as this storage is plenty fast and behaves just as if it was located in the same servers as the rest of the virtual hardware. This whole memory disaggregation with RDAM does not work this way, creating a lot of challenges. Most importantly, this disaggregated memory cannot be made (yet?) as universal as disaggregated storage. We won’t run code from it or use it for anything that needs to copy/change contents a lot. As a result, this disaggregated memory, at best, will look like another “storage” solution to systems that use it — something that may be faster than durable storage, but still not fast enough for general use.

Personally, I see more utility in the future with memory disaggregation using CXL. There was a paper at the recent USENIX ATC on the topic. Such a solution may act more like additional memory on-demand that a shared pool of memory between processors/nodes, but it will also not have issues with cache coherence, difficulty, and limitation of RDMA and RDMA’s reliability. I can envision a top-of-rack memory pool, that tenants can tap into if they need more memory or if we need to have a cloud VM product that can scale memory and CPU cores independently of each other. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Not that Simple: Email Delivery in the 21st Century

I haven’t been posting new reading group paper summaries lately, but I intend to fix that gap and resume writing these. Our 123rd paper was about email: “Not that Simple: Email Delivery in the 21st Century” by Florian Holzbauer, Johanna Ullrich, Martina Lindorfer, and Tobias Fiebig. This paper studies whether different emerging standards and technologies impact email delivery from one Mail Transfer Agent (MTA) to another.

It turns out that with a slew of new and changing standards and technologies, email delivery is not that easy. If we simplify things a bit, the sender MTA first needs to figure out where to send the email using DNS. Then it communicates with the receiving MTA to actually deliver the message. Things can go wrong at both of these steps due to the usage of different communication standards (IPv4 vs. IPv6) or adherence to stricter security. On the security side, the authors tried to force TLS to ensure an encrypted connection and present an invalid TLS certificate to study how vulnerable the email exchange may be to several types of attacks, such as man-in-the-middle. They also tried to use misconfigured DNSSEC to see if that prevented email delivery.

To quantify the email delivery, the authors set up their email testbed, consisting of differently configured Authoritative DNS servers and MTA, giving a total of 11 combinations that try different connectivity and security options. This testbed served as the receiver. For senders, the authors used email providers, categorized into two groups: large and regular. Large providers are companies like Google and Microsoft that provide email services for a variety of other organizations and individuals. Regular providers are smaller and serve one or a few institutions. 

Of the big findings, IPv6 is still a problem. Roughly 40% of email providers (regular and large) cannot deliver mail to systems with IPv6-only DNS. About 60% of providers cannot deliver mail when both MTA and DNS only rely on IPv6. 

Another important lesson is that email providers, large and regular, prioritize email delivery over security. While most providers (90% of regular and all large) can deliver emails with forced TLS, the providers do not care/check the validity of the TLS. Misconfigured DNSSEC also does not seem to stop most providers from delivering mail. 


1) Missing parts. The paper studies email delivery, but it looks at the problem very literally — whether email arrives at the receiving MTA. This does not necessarily correspond with the user experience. For example, the complexity of dealing with spam is not studied here, but it is a large part of users’ experience with email. The receiver may apply spam filters after receiving the email, which can cause the email to be never shown to the user.

2) Graylisting. The paper does some study on spam emails, again in the context of its delivery to the receiving MTAs and the technology used. One spam-specific technology the paper tested is graylisting. This is a trivial method to reduce spam by rejecting emails from unknown sources on the first attempt. If the sender retries the email delivery sometime later, then the receiver will accept it. It appears that this simple technique resulted in fewer spam emails. However, we wonder if using it even makes sense — 60% of spam still made it through, and using state-of-the-art spam filters may end up catching more spam regardless of whether it made it through the graylisting procedure. 

3) Misleading results. The results shown in the paper express the ratio of different providers that may be impacted by a particular issue/configuration. This is a somewhat misleading metric, and it would be much nicer to know how many users may be impacted instead. Of course, for all the fairness to the paper, this information is rather hard to know since email providers are not likely to share their user data. 

4) IPv6. The results on issues with IPv6 may not be specific to email, as IPv6 adoption, in general, may not be so great just yet. 

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Winter Term Reading Group Papers: ##121-130

Our winter set of papers! The schedule is also in our Google Calendar.

Fall Term Reading Group Papers: ##111-120

Below is a list of papers for the fall term of the distributed systems reading group. The list is also on the reading group’s Google Calendar.

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group Special Session: Scalability and Fault Tolerance in YDB

YDB is an open-source Distributed SQL Database. YDB is used as an OLTP Database for mission-critical user-facing applications. It provides strong consistency and serializable transaction isolation for the end user. One of the main characteristics of YDB is scalability to very large clusters together with multitenancy, i.e. ability to provide an isolated user environment for users in a single cluster. In this talk, we will cover two layers of YDB – Tablet and BlobStorage layers that together provide fault tolerance, scalability, and user isolation.

Tablet is a very lightweight component that implements distributed consensus. It is a building block for user data storage, schema manager, tablet manager, system views, and many other components. A single YDB cluster is known to have more than a million tablets, a single YDB node can serve up to 10K tablets. Tablet implements distributed consensus protocol over a shared log provided by the BlobStorage layer.

The BlobStorage layer is a fault-tolerant key-value append-only database. It implements a specific API for logging, blobs reading and writing, garbage collection, and quorum management. We believe that the protocol between Tablet and BlobStorage is quite simple and understandable by listeners compared to Paxos or Raft. We also illustrate the flexibility of YDB in real-life use cases like migrating some parts of the cluster from one availability zone to another without a maintenance window or storing some data blobs on fast media like SSD and another blob on HDD.

Speaker: Andrey is the head of the YDB project (https://ydb.tech). He works in Yandex as a Head of Department. He also has a long story in Yandex as a member of the Web Crawler Team, a member of the Infrastructure team, one of the creators of Yandex Cloud, and now the part of CloudIL Team. Andrey holds a Ph.D. from Moscow State University.

When: Wednesday, August 10th at 2:00 pm EST (6 pm UTC)

Where: Zoom