debugging

  • Reading Group. How to fight production incidents?: an empirical study on a large-scale cloud service

    ·

    Placeholder Icon

    In the 125th reading group meeting, we looked at the reliability of cloud services. In particular, we read the “How to fight production incidents?: an empirical study on a large-scale cloud service” SoCC’22 paper by Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. This paper looks at 152 severe production incidents in the Microsoft…

    Read More

  • Reading Group. Distributed Snapshots: Determining Global States of Distributed Systems

    ·

    Placeholder Icon

    On Wednesday we kicked off a new set of papers in the reading group. We have started with one of the classical foundational papers in distributed systems and looked at the Chandy-Lamport token-based distributed snapshot algorithm. The basic idea here is to capture the state of distributed processes and channels by “flushing” the messages out…

    Read More

  • Trace Synchronization with HLC

    ·

    Placeholder Icon

    Event logging or tracing is one of the most common techniques for collecting data about the software execution. For simple application running on the same machine, a trace of events timestamped with the machine’s hardware clock is typically sufficient. When the system grows and becomes distributed over multiple nodes, each node is going to produce…

    Read More