Welcome to the DistSys Reading Group! Every week we present and discuss one distributed systems paper. We try to focus on relatively new papers, although we occasionally break this rule for some important older publications. The main objective of this group is to share knowledge through discussion. Our participants come from academia and industry and often carry a unique perspective and expertise on the subject matter.
Format
We start each meeting with a short presentation of the paper by one of the group members. We record the presentation and later upload it to YouTube for the general audience. After the presentation, we move into a group discussion of the paper. This part is not on the record to make sure we can speak freely about the topic and the paper. However, I write a moderated discussion summary for each meeting and post it here. All the summaries are available via the “Summary” link next to the paper title. To see the archive of past meetings, scroll down to the “Past Meetings” section below.
Meeting Info
- Meeting Time: Thursdays at 1:00 PM EST (10 am PST)
- Duration: ~1 hour
- Slack Channel – Join our Slack for Zoom information
- YouTube Channel
- Google Calendar with our schedule.
Current Schedule (Papers ##181-190)
Below is a list of papers for the fall term of the distributed systems reading group.
- Starburst: A Cost-aware Scheduler for Hybrid Cloud [ATC’24]
- Authors: Michael Luo, Siyuan Zhuang, Suryaprakash Vengadesan, Romil Bhardwaj, Justin Chang, Eric Friedman, Scott Shenker, Ion Stoica
- What: A batched workload scheduler that spans private and public cloud and reduces public cloud cost while ensuring timely job completion.
- When: October 24th
- If At First You Don’t Succeed, Try, Try, Again…? [SOSP’24]
- Authors: Bogdan A. Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madanlal Musuvathi, Suman Nath
- What: Retries, retry bugs, and a bit of LLMs to analyze those
- When: October 31st
- ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing [OSDI’24]
- Authors: Mike Chow, Yang Wang,William Wang, Ayichew Hailu, Rohan Bopardikar, Bin Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, Rodrigo Paim, Mack Ward, Ivor Huang, Matt McNally, Daniel Hodges, Zoltan Farkas, Caner Gocmen, Elvis Huang, and Chunqiang Tang
- What: Performance testing platform for detecting performance regressions in large systems deployed in noisy (e.g., cloud) environments.
- When: November 7th
- Massively Parallel Multi-Versioned Transaction Processing [OSDI’24]
- Authors: Shujian Qian, Ashvin Goel
- What: Multi-versioned OLTP store with GPU acceleration for massively parallel execution of transactions
- When: November 14th
- Resource Management in Aurora Serverless [VLDB]
- Authors: Bradley Barnhart, Marc Brooker, Daniil Chinenkov, Tony Hooper, Jihoun Im, Prakash Chandra Jha, Tim Kraska, Ashok Kurakula, Alexey Kuznetsov, Grant McAlister, Arjun Muthukrishnan,Aravinthan Narayanan, Douglas Terry, Bhuvan Urgaonkar, Jiaming Yan
- What: Serverless add-on for AWS Aurora that abstracts the resource/capacity usage into Aurora Capacity Units and allows a pay-for-usage model via automatic scaling up/down based on demand.
- When: November 21st
- Beaver: Practical Partial Snapshots for Distributed Cloud Services [OSDI’24]
- Authors: Liangcheng Yu, Xiao Zhang, Haoran Zhang, John Sonchack, Dan Ports, Vincent Liu
- What: “Practical partial snapshot protocol that ensures causal consistency.”
- When: December 5th
- SwiftPaxos: Fast Geo-Replicated State Machines [NSDI’24]
- Authors: Fedor Ryabinin, Alexey Gotsman, Pierre Sutra
- What: Partly multi-writer Geo-distributed Paxos SMR with lower latency in optimal case compared to MultiPaxos.
- When: December 12th
- Anvil: Verifying Liveness of Cluster Management Controllers [OSDI’24]
- Authors: Xudong Sun, Wenjie Ma, Jiawei Tyler Gu, Zicheng Ma, Tej Chajed, Jon Howell, Andrea Lattuada, Oded Padon, Lalith Suresh, Feldera; Adriana Szekeres, Tianyin Xu
- What: Formal verification of cloud management controllers.
- When: December 19th
- GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage Disaggregation [VLDB]
- Authors: Guoliang Li, Wengang Tian, Jinyu Zhang, Ronen Grosman, Zongchao Liu, Sihao Li
- What: A cloud-native, multi-writer database service with 3-way resource disaggregation: compute for TX processing, disaggregated memory for buffers and locks, and disaggregated storage for persistence/durability.
- When: January 9th
- SWARM: Replicating Shared Disaggregated-Memory Data in No Time [SOSP’24]
- Authors: Antoine Murat, Clément Burgelin, Athanasios Xygkis, Igor Zablotchi, Marcos K. Aguilera, Rachid Guerraoui
- What: Replication for in-disaggregated-memory data: how to ensure objects in shared disaggregated memory survive failures inside the shared memory subsystem.
- When: January 16th
Past Meetings
- Papers ##37-50
- Papers ##51-60
- Papers ##61-70
- Papers ##71-80
- Papers ##81-90
- Papers ##91-100
- Papers ##101-110
- Papers ##111-120
- Papers ##121-130
- Papers ##131-140
- Papers ##141-150
- Papers ##151-160
- Papers ##161-170
- Papers ##171-180
Past Special Sessions
- Building Distributed Systems With Stateright – March 30th @ 1pm EST – Jon Nadal.
- Distributed Transactions in YugabyteDB – May 11th @12pm EST – Karthik Ranganathan.
- Fast General Purpose Transactions in Apache Cassandra – February 9thth @ 2 pm EST – Benedict Elliott Smith
- Scalability and Fault Tolerance in YDB – August 10th @ 2pm EST – Andrey Fomichev