Aleksey Charapko

  • Reading Group. Facebook’s Tectonic Filesystem: Efficiency from Exascale

    ·

    Placeholder Icon

    This time around our reading group discussed a distributed filesystem paper. We looked at FAST’21 paper from Facebook: “Facebook’s Tectonic Filesystem: Efficiency from Exascale.” We had a nice presentation by Akash Mishra: The paper talks about a unified filesystem across many services and use cases at Facebook. Historically, Facebook had multiple specialized storage infrastructures: one…

    Read More

  • Reading Group. Distributed Snapshots: Determining Global States of Distributed Systems

    ·

    Placeholder Icon

    On Wednesday we kicked off a new set of papers in the reading group. We have started with one of the classical foundational papers in distributed systems and looked at the Chandy-Lamport token-based distributed snapshot algorithm. The basic idea here is to capture the state of distributed processes and channels by “flushing” the messages out…

    Read More

  • Reading Group. Aragog: Scalable Runtime Verification of Shardable Networked Systems

    ·

    Placeholder Icon

    We have covered 50 papers in the reading group so far! This week we looked at the “Aragog: Scalable Runtime Verification of Shardable Networked Systems” from OSDI’20. This paper discusses the problem of verifying the network functions (NFs), such as NAT Gateways or firewalls at the runtime. The problem is quite challenging due to its…

    Read More

  • Reading Group. Protean: VM Allocation Service at Scale

    ·

    Placeholder Icon

    The last paper in our reading group was “Protean: VM Allocation Service at Scale.” This paper from Microsoft is full of technical insights into how they operate their datacenters/regions at scale. In particular, the paper discusses one of the fundamental components of any cloud provider — the VM service. The system, called Protean, is an…

    Read More

  • Reading Group. Sundial: Fault-tolerant Clock Synchronization for Datacenters

    ·

    Placeholder Icon

    In our 48th reading group meeting, we talked about time synchronization in distributed systems. More specifically, we discussed the poor state of time sync, the reasons for it, and most importantly, the solutions, as outline in the “Sundial: Fault-tolerant Clock Synchronization for Datacenters” OSDI’20 paper. We had a comprehensive presentation by Murat Demirbas. Murat’s talk…

    Read More

  • Reading Group Special Session: Building Distributed Systems With Stateright

    ·

    Placeholder Icon

    This talk is part of the Distributed Systems Reading Group. Stateright is a software framework for analyzing and systematically verifying distributed systems. Its name refers to its goal of verifying that a system’s collective state always satisfies a correctness specification, such as “operations invoked against the system are always linearizable.” Cloud service providers like AWS…

    Read More

  • Reading Group. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

    ·

    Placeholder Icon

    On Wednesday we were discussing scheduling in large distributed ML/AI systems. Our main paper was the “Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads.” one from OSDI’20. However, it was a bit outside of our group’s comfort zone (outside of my comfort zone for sure). Luckily we had an extensive presentation with a complete background…

    Read More

  • Reading Group. FlightTracker: Consistency across Read-Optimized Online Stores at Facebook

    ·

    Placeholder Icon

    Last DistSys Reading Group we have discussed “FlightTracker: Consistency across Read-Optimized Online Stores at Facebook.” This paper is about consistency in Facebook’s TAO caching stack. TAO is a large social graph storage system composed of many caches, indexes, and persistent storage backends. The sheer size of Facebook and TAO makes it difficult to enforce meaningful…

    Read More

  • Reading Group Paper List. Papers ##51-60.

    ·

    Placeholder Icon

    With just four more papers to go in the DistSys Reading Group’s current batch, it is time to get the next set going. This round, we will have 10 papers that should last till the end of the spring semester. Our last batch was all about OSDI’20 papers, and this time around we will mix…

    Read More

  • Reading Group. Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories

    ·

    Placeholder Icon

    Hard to imagine, but the reading group just completed the 45th session. We discussed “Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories,” again from OSDI’20. Pegasus is one of these systems that are very obvious in the hindsight. However, this “obviousness” is deceptive — Dan Ports, one of the authors behind the…

    Read More