debugging

Reading Group. How to fight production incidents?: an empirical study on a large-scale cloud service

Reading Group

Aleksey Charapko

·

Jan 27, 2023

In the 125th reading group meeting, we looked at the reliability of cloud services. In particular, we read the “How to fight production incidents?: an empirical study on a large-scale cloud service” SoCC’22 paper by Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. This paper looks at 152 severe production incidents in the Microsoft…
Read More
Reading Group. Distributed Snapshots: Determining Global States of Distributed Systems

Reading Group

Aleksey Charapko

·

Apr 10, 2021

On Wednesday we kicked off a new set of papers in the reading group. We have started with one of the classical foundational papers in distributed systems and looked at the Chandy-Lamport token-based distributed snapshot algorithm. The basic idea here is to capture the state of distributed processes and channels by “flushing” the messages out…
Read More
Trace Synchronization with HLC

Playing Around

Aleksey Charapko

·

Oct 18, 2017

Event logging or tracing is one of the most common techniques for collecting data about the software execution. For simple application running on the same machine, a trace of events timestamped with the machine’s hardware clock is typically sufficient. When the system grows and becomes distributed over multiple nodes, each node is going to produce…
Read More

Search

Recent Posts

Categories