fault-tolerance

Reading Group Paper: Hyrax: Fail-in-Place Server Operation in Cloud Platforms

Reading Group

Aleksey Charapko

·

Sep 7, 2023

In the 142nd reading group meeting, we discussed “Hyrax: Fail-in-Place Server Operation in Cloud Platforms” OSDI’23 paper. Hyrax allows servers with certain types of hardware failures to return to service after some software-only automated mitigation steps. Traditionally, when a server malfunctions, the VMs are migrated off of it, then the server gets shut down and…
Read More
Reading Group Special Session: Scalability and Fault Tolerance in YDB

RG Special Session

Aleksey Charapko

·

Aug 2, 2022

YDB is an open-source Distributed SQL Database. YDB is used as an OLTP Database for mission-critical user-facing applications. It provides strong consistency and serializable transaction isolation for the end user. One of the main characteristics of YDB is scalability to very large clusters together with multitenancy, i.e. ability to provide an isolated user environment for…
Read More
Reading Group. Solar Superstorms: Planning for an Internet Apocalypse

Reading Group

Aleksey Charapko

·

May 22, 2022

Our 96th reading group paper was very different from the topics we usually discuss. We talked about the “Solar Superstorms: Planning for an Internet Apocalypse” SIGCOMM’21 paper by Sangeetha Abdu Jyothi. Now (May 2022), we are slowly approaching the peak of solar cycle 25 (still due in a few years?) as the number of observable…
Read More
Reading Group. In Reference to RPC: It’s Time to Add Distributed Memory

Reading Group

Aleksey Charapko

·

Sep 3, 2021

Our 70th meeting covered the “In Reference to RPC: It’s Time to Add Distributed Memory” paper by Stephanie Wang, Benjamin Hindman, and Ion Stoica. This paper proposes some improvements to remote procedure call (RPC) frameworks. In current RPC implementations, the frameworks pass parameters to function by value. The same happens to the function return values.…
Read More
Metastable Failures in Distributed Systems

Other Thoughts, Paper Review and Summary

Aleksey Charapko

·

May 21, 2021

Metastability is a stable state of a dynamical system other than the system’s state of least energy. – Wikipedia Distributed systems often fail spectacularly and unpredictably. They are a cause for a headache and sleepless on-call nights for way too many engineers. And this is despite lots of efforts to understand the failures, and all…
Read More
Reading Group. Protocol-Aware Recovery for Consensus-Based Storage

Reading Group

Aleksey Charapko

·

May 9, 2021

Our last reading group meeting was about storage faults in state machine replications. We looked at the “Protocol-Aware Recovery for Consensus-Based Storage” paper from FAST’18. The paper explores an interesting omission in most of the state machine replication (SMR) protocols. These protocols, such as (multi)-Paxos and Raft, are specified with the assumption of having a…
Read More
Reading Group. Toward a Generic Fault Tolerance Technique for Partial Network Partitioning

Reading Group

Aleksey Charapko

·

Jan 7, 2021

Short Summary We have resumed the distributed systems reading group after a short holiday break. Yesterday we discussed the “Toward a Generic Fault Tolerance Technique for Partial Network Partitioning” paper from OSDI 2020. The paper studies a particular type of network partitioning – partial network partitioning. Normally, we expect that every node can reach every…
Read More
Reading Group. RMWPaxos: Fault-Tolerant In-Place Consensus Sequences

Reading Group

Aleksey Charapko

·

Nov 5, 2020

Quick Summary In the last reading group discussion, we talked about RMWPaxos. The paper argues that under some circumstances, log-based replication schemes and replicated state machines (RSMs), like Multi-Paxos, are a waste of resources. For example, when the state is small, it may be more efficient to just manage the state directly instead of managing…
Read More
One Page Summary: Flease – Lease Coordination without a Lock Server

One Page Summary

Aleksey Charapko

·

Sep 4, 2017

This paper talks about a decentralized lease management solution. In the past, many lock/lease services have been centralized, placing a single authority to manage all locks in the system. Google’s Chubby, Apache ZooKeeper, etcd, and others rely on a centralized approach and backed by some flavor of a consensus algorithm for fault-tolerance. According to Flease authors,…
Read More

Aleksey Charapko

fault-tolerance

Reading Group Paper: Hyrax: Fail-in-Place Server Operation in Cloud Platforms

Reading Group Special Session: Scalability and Fault Tolerance in YDB

Reading Group. Solar Superstorms: Planning for an Internet Apocalypse

Reading Group. In Reference to RPC: It’s Time to Add Distributed Memory

Metastable Failures in Distributed Systems

Reading Group. Protocol-Aware Recovery for Consensus-Based Storage

Reading Group. Toward a Generic Fault Tolerance Technique for Partial Network Partitioning

Reading Group. RMWPaxos: Fault-Tolerant In-Place Consensus Sequences

One Page Summary: Flease – Lease Coordination without a Lock Server

Search

Recent Posts

Categories