Reading Group

Welcome to the DistSys Reading Group! Every week we present and discuss one distributed systems paper. We try to focus on relatively new papers, although we occasionally break this rule for some important older publications. The main objective of this group is to share knowledge through discussion. Our participants come from academia and industry and often carry a unique perspective and expertise on the subject matter.

Format

We start each meeting with a short presentation of the paper by one of the group members. We record the presentation and later upload it to YouTube for the general audience. After the presentation, we move into a group discussion of the paper. This part is not on the record to make sure we can speak freely about the topic and the paper. However, I write a moderated discussion summary for each meeting and post it here. All the summaries are available via the “Summary” link next to the paper title. To see the archive of past meetings, scroll down to the “Past Meetings” section below.

Meeting Info

Current Schedule (Papers ##201-210)

This is a list of papers for the DistSys reading group Fall2025 term. The schedule is also available on ourĀ Google Calendar.

  • The NIC should be part of the OS [HotOS’25]
    • Authors: Pengcheng Xu, Timothy Roscoe
    • What: A stab at performance vs usbaility problem of kernel bypass with a proposed solution that has both.
    • When: September 11th
  • Rethinking RPC Communication for Microservices-based Applications [HotOS’25]
    • Authors: Xiangfeng Zhu, Yang Zhou, Yuyao Wang, Xiangyu Gao, Arvind Krishnamurthy, Sam Kumar, Ratul Mahajan, Danyang Zhuo
    • What: More efficient RPC with fewer software layers with help of in-network processing/offloading.
    • When: September 18th
  • Picsou: Enabling Replicated State Machines to Communicate Efficiently [OSDI’25]
    • Authors: Reginald Frank, Micah Murray, Chawinphat Tankuranand, Junseo Yoo, Ethan Xu, and Natacha Crooks, Suyash Gupta, Manos Kapritsos
    • What: Synchronization between distinct replicated state machines without transactions/2PC
    • When: September 25th
  • Mako: Speculative Distributed Transactions with Geo-Replication [OSDI’25]
    • Authors: Weihai Shen, Yang Cui, Siddhartha Sen, Sebastian Angel, Shuai Mu
    • What: Speculative Geo-distributed transactions with decoupled execution and replication
    • When: October 2nd
  • Low End-to-End Latency atop a Speculative Shared Log with Fix-Ante Ordering [OSDI’25]
    • Authors: Shreesha G. Bhat, Tony Hong, Xuhao Luo, Jiyu Hu, Aishwarya Ganesan, Ramnatthan Alagappan
    • What: Shared log with speculative global order and occasional rollbacks when speculation fails.
    • When: October 9th
  • Quantum Virtual Machines [OSDI’25]
    • Authors: Runzhou Tao, Hongzheng Zhu, Jason Nieh, Jianan Yao, Ronghui Gu
    • What: The title says it all
    • When: October 16th
  • GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale [ATC’25]
    • Authors: Tianyuan Wu and Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang
    • What: Detections of slow/underperforming components (GPUs, networks) and mitigation of the impacts in the context of ML training jobs
    • When: October 23rd
  • Cuckoo for Clients: Disaggregated Cuckoo Hashing [ATC’25]
    • Authors: Sewart Grant, Alex C. Snoeren
    • What: Key-Value store with a sprinkle of RDMA and some nice algorithmic optimizations
    • When: October 30th
  • Cloudscape: A Study of Storage Services in Modern Cloud Architectures [FAST’25]
    • Authors: Sambhav Satija, Chenhao Ye, Ranjitha Kosgi, Aditya Jain, Romit Kankaria, Yiwei Chen, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Kiran Srinivasan
    • What: Storage usage in the cloud with usage patterns, numbers and statistics.
    • When: November 6th
  • Mooncake: Trading More Storage for Le
    • Authors: Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu
    • What: Classical compute vs memory (storage) tradeoff but for LLM-serving system, relying on underutilized resources (CPU, DRAM, Network) of ML clusters.
    • When: November 13th

Past Meetings

Past Special Sessions

  1. Building Distributed Systems With StaterightMarch 30th @ 1pm EST – Jon Nadal.
  2. Distributed Transactions in YugabyteDBMay 11th @12pm EST – Karthik Ranganathan.
  3. Fast General Purpose Transactions in Apache Cassandra – February 9thth @ 2 pm EST – Benedict Elliott Smith
  4. Scalability and Fault Tolerance in YDBAugust 10th @ 2pm EST – Andrey Fomichev