Here is a list for the Fall 2025 semester. Please join the reading group here: https://discord.gg/VS7J4PAU58. We meet on Thursdays. The schedule is also available on our calendar.
- The NIC should be part of the OS [HotOS’25]
- Authors: Pengcheng Xu, Timothy Roscoe
- What: A stab at performance vs usbaility problem of kernel bypass with a proposed solution that has both.
- When: September 11th
- Rethinking RPC Communication for Microservices-based Applications [HotOS’25]
- Authors: Xiangfeng Zhu, Yang Zhou, Yuyao Wang, Xiangyu Gao, Arvind Krishnamurthy, Sam Kumar, Ratul Mahajan, Danyang Zhuo
- What: More efficient RPC with fewer software layers with help of in-network processing/offloading.
- When: September 18th
- Picsou: Enabling Replicated State Machines to Communicate Efficiently [OSDI’25]
- Authors: Reginald Frank, Micah Murray, Chawinphat Tankuranand, Junseo Yoo, Ethan Xu, and Natacha Crooks, Suyash Gupta, Manos Kapritsos
- What: Synchronization between distinct replicated state machines without transactions/2PC
- When: September 25th
- Mako: Speculative Distributed Transactions with Geo-Replication [OSDI’25]
- Authors: Weihai Shen, Yang Cui, Siddhartha Sen, Sebastian Angel, Shuai Mu
- What: Speculative Geo-distributed transactions with decoupled execution and replication
- When: October 2nd
- Low End-to-End Latency atop a Speculative Shared Log with Fix-Ante Ordering [OSDI’25]
- Authors: Shreesha G. Bhat, Tony Hong, Xuhao Luo, Jiyu Hu, Aishwarya Ganesan, Ramnatthan Alagappan
- What: Shared log with speculative global order and occasional rollbacks when speculation fails.
- When: October 9th
- Quantum Virtual Machines [OSDI’25]
- Authors: Runzhou Tao, Hongzheng Zhu, Jason Nieh, Jianan Yao, Ronghui Gu
- What: The title says it all
- When: October 16th
- GREYHOUND: Hunting Fail-Slows in Hybrid-Parallel Training at Scale [ATC’25]
- Authors: Tianyuan Wu and Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, and Liping Zhang
- What: Detections of slow/underperforming components (GPUs, networks) and mitigation of the impacts in the context of ML training jobs
- When: October 23rd
- Cuckoo for Clients: Disaggregated Cuckoo Hashing [ATC’25]
- Authors: Sewart Grant, Alex C. Snoeren
- What: Key-Value store with a sprinkle of RDMA and some nice algorithmic optimizations
- When: October 30th
- Cloudscape: A Study of Storage Services in Modern Cloud Architectures [FAST’25]
- Authors: Sambhav Satija, Chenhao Ye, Ranjitha Kosgi, Aditya Jain, Romit Kankaria, Yiwei Chen, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Kiran Srinivasan
- What: Storage usage in the cloud with usage patterns, numbers and statistics.
- When: November 6th
- Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot [FAST’25]
- Authors: Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu
- What: Classical compute vs memory (storage) tradeoff but for LLM-serving system, relying on underutilized resources (CPU, DRAM, Network) of ML clusters.
- When: November 13th
Reading Group
Distributed Systems reading groups meets on Thursdays at 1 pm over Zoom. Please join our Discord server for more info.