Paper #196. The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware Scheduling

The last paper we covered in the Distributed Systems Reading group discussed CPUs, data centers, scheduling, and carbon emissions—we read “The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware Scheduling.” Below is my improvised presentation of this paper for the reading group.

This paper was an educational read for me, as I learned about scheduling metrics that prioritize the carbon footprint of a computation job as opposed to other metrics, like performance or monetary cost. The paper gives a good intro into such carbon-aware metrics, like Software Carbon Intensity (SCI). It also argues that many such metrics are bad and either lead to more carbon emissions or are just hard to compute accurately to be useful.

The problem with SCI, according to the authors, is that it considers both the operational and embodied carbon emission of a computing task. Operational carbon is more straightforward; it is the carbon emission directly due to running the job on the hardware in the data center, which results from the energy needed to run servers, cooling, etc. The embodied carbon emission is a bit more complex. It is the carbon emitted in the manufacturing and supply chain needed to produce and install the hardware. Naturally, it is more difficult to compute or estimate.

Let’s assume that we can calculate the embodied carbon of some hardware (the paper specifically focuses on CPUs). The SCI metric assumes that the hardware has a limited lifespan and “assigns” this embodied carbon to jobs as they are scheduled to run on the machine. For example, suppose we expect a CPU to work for 5 years, and a particular long-running job will be on that CPU for a year. In that case, this job’s SCI score will consist of the operational carbon (i.e., a carbon footprint from a year’s worth of electric supply to power and cool the CPU) and one-fifth of the CPU’s embodied carbon.

In heterogeneous clusters, the SCI metric may lead to suboptimal scheduling, resulting in more carbon emissions! This happens because the lower SCI score (which corresponds to lower emissions for a job) can be achieved by having very low embodied carbon associated with a CPU or a server despite that server consuming substantially more power for a unit of performance. The concrete example a paper provides is comparing an old and a new CPU. The older model has a substantially lower embodied carbon but is also markedly slower per watt of power draw in a benchmark. The differences in performance and power consumption of the two CPUs result in the older model having a higher operational carbon footprint. However, the older CPU still has a lower SCI score per unit of performance-year than the newer one. As a result, when given a choice, an SCI-based scheduler will pick an older, less efficient CPU, resulting in higher operational emissions.

Here is the kicker—if a scheduler has a choice between two models, then the carbon emitted to produce both CPU or server choices has already been emitted. Not using a more efficient option does not reduce the carbon footprint of manufacturing a server that has already been made. It simply makes the more efficient server sit idle, and the SCI metric does not account for the idle server’s embodied carbon. The paper makes the point that at scheduling time we need to prioritize the operational footprint and favor more efficient CPUs or servers since the carbon needed to manufacture them has already been released, regardless of the task scheduling choice. The embodied carbon can then be optimized when acquiring the new hardware for the data center.

This paper is full of calculations to draw a point and I recommend checking them out! Some figures suggest that SCI metric can result in up to 25% more carbon than purely operational, no-embodied-carbon, metric (oSCI). I think it would also be interesting to see how SCI and the simpler operational-only oSCI compare to other scheduling metrics driven by performance or money. There are also some additional challenges, starting from premature hardware failures that surely drive up the embodied carbon due to replacements to embodied costs associated with hardware that has an engineered lifespan (e.g., SSDs). Nevertheless, the basic premise should still hold, and the operational decisions should prioritize efficient hardware over the less efficient one. Doing the opposite is like buying a hybrid fuel-efficient car but continuing to drive a 20-year-old gas-guzzler simply because the old car had less embodied carbon emissions since it did not need the battery and electric motors.

Reading Group

Distributed Systems reading groups meets on Thursdays at 1 pm over Zoom. Please join our Discord server for more info.