Tag Archives: data center

Reading Group Paper: Hyrax: Fail-in-Place Server Operation in Cloud Platforms

In the 142nd reading group meeting, we discussed “Hyrax: Fail-in-Place Server Operation in Cloud Platforms” OSDI’23 paper. Hyrax allows servers with certain types of hardware failures to return to service after some software-only automated mitigation steps.

Traditionally, when a server malfunctions, the VMs are migrated off of it, then the server gets shut down and scheduled for repair. The paper calls this an “all-or-nothing” model for repairs that leaves only perfectly healthy servers in operation. This behavior aligns well with “distributed” people like me — we treat nodes as disposable and replaceable. However, the physical hardware is not disposable and is sometimes hard to replace. The paper suggests that repair time can take up to nearly half a year in some cases, likely due to the availability of parts for older machines. The authors also state that 22% of servers will experience at least one failure in the 6-year lifespan.

Anyway, Hyrax suggests that this shutdown and repair process may be overkill, especially in large data centers and clouds. The repairs take a lot of technician time, consume the resources, and decommission capacity for days at a time. Surprisingly (at least to me), many component failures within a server do not necessarily lead to a server’s complete loss of function, creating a possibility for the “Fail-in-Place” model. The Fail-in-Place approach works by remotely deactivating failed components, physically leaving them inside the server, and returning the server to production, albeit with some capacity/performance restrictions. This way, the capacity is returned quicker, and valuable technician time can be deferred until the repairs can be done more efficiently (i.e. when more failures accumulate in the server or nearby).

The paper focuses on failures of a select few component types: memory DIMMs, SSDs, and cooling fans. For each failure, the system tries to identify a specific failed component, such as a concrete DIMM or SSD. If successful, then that component can be deactivated. The deactivation, however, is a tricky part with multiple problems to solve, starting from actually doing it without physically removing the components to ensuring that a server can still perform adequately enough to host at least some VM types.

The paper focuses mainly on DIMM deactivation since it is the most complicated case. A faulty “memory stick” gets deactivated via a nearly undocumented BIOS feature, leaving a server in a somewhat crippled state. See, turning off a DIMM or two leaves the server in a configuration not supported by the CPU and its memory controllers. This has to do with DIMM placement rules that get violated. Long story short, such a server will have both the reduced memory and memory address space fragmented by different bandwidths — some memory addresses may work as fast as unaffected servers, and others can have bandwidth degraded by an order of magnitude. To deal with the problem, Hyrax modifies the scheduling policies for the VMs. For example, a degraded server may not host larger VM types that require a lot of memory with high bandwidth. The system also manages VM placement locally to better adhere to these memory regions with different speeds — placing smaller VMs in slower memory regions may be okay since these smaller VMs, even on healthy servers, do not utilize the full memory bandwidth. Naturally, Hyrax has all these combinations precomputed and stored in the database, so the rules are simple mapping of the server’s degraded configuration to VM types it can host.

SSD story is much simpler than DIMMs. Each server the Hyrax prototype was tested on had 6 SSDs, all striped to get maximum performance. The authors calculated that 5 SSDs still have enough performance even for the largest VMs, so deactivating 1 SSD leads to no performance impact. For cooling fan failures, it appears that cooling is severely overprovisioned, and even 2 “dead” fans do not impact the server.

After deactivating the malfunctioning component, Hyrax returns the server to production with the new capacity restrictions for VM sizes. However, it does so only after extensive testing and verification to ensure that correct components have been deactivated and there are no other faults that can cause unstable performance.

The paper provides much more information and details, including an evaluation of the system!

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!

Reading Group. Running BGP in Data Centers at Scale

Our 82nd reading group paper was “Running BGP in Data Centers at Scale.” This paper describes how Facebook adopted the BGP protocol, normally used at the Internet-scale, to provide routing capabilities at their datacenters. They are not the first to run BGP in the data center, but the paper is interesting nevertheless at giving some details of how BGP is used and how Facebook data centers are structured. As a disclaimer, I am not a networking person, so I will likely grossly oversimplify things from now on.

BGP protocol provides routing service between different Autonomous Systems (ASes). These ASes are some contained networks. For example, on the internet, an ISP network may be its own AS. The protocol relies on peering connections to discover different ASes and what addresses are contained/reachable in these ASes. With this information, the protocol can route the packets to the destination ASes.

BGP is a kind of high-level protocol to route between ASes, but how the packets navigate inside each AS is left to the AS itself. Facebook uses BGP again inside its data centers. Here, a bit of Facebook’s data center architecture can help with the story. As seen in the image I borrowed from the paper, the data center consists of multiple “Server Pods,” and Server Pods are connected to each other via multiple “Spine Planes.” Each Server Pod has Fabric Switches (FSWs) that communicate with the Spine Plane, and Rack Switches (RSWs) that can reach individual servers.

Each Server Pod is given an AS number and further subdivided into sub-ASes, with different Rack Switches having theirs AS numbers. Similarly, Fabric Switches also have their AS numbers. This concept of hierarchically dividing ASes into sub-ASes is often referred to as BGP confederations. Anyway, the AS numbering is kept uniform in Facebook datacenters. For instance, the first Spine Plane is always given the number 65001 in each data center. Similar uniformity exists with Server Pods and their sub-ASes. Such uniformity makes it easier to develop, debug and troubleshoot the network.

Unlike the Internet use of BGP, Facebook has complete control over all of the ASes in their datacenter network. This control allows them to design peering and routing policies that work best in their specific topology. For instance, “the BGP speakers only accept or advertise the routes they are supposed to exchange with their peers according to our overall data center routing design.” The extra knowledge and uniformity also allow the system to establish backup routes for different failures. These backup routes can be very carefully designed to avoid overloading other components/networks or causing other problems.

The system design is also scalable. Intuitively, adding more servers should be as simple as adding another Server Pod to the network. The hierarchical nature of the BGP confederations means that a Pod can be added easily — the rest of the network only needs to know about the Pod and not all the ASes inside the Pod.

The paper talks about quite a few other details that I have no time to list or mention all of them. However, a few important parts stuck with me. First is the implementation. Facebook uses a custom multi-threaded implementation with only the features they need to support their data center networks. The second point I want to mention has to do with maintainability. Testing and deployment are a big part of the reliability and maintainability of the network. The entire process is approached similarly to the regular software development and deployment process. New versions are tested in simulations before proceeding to the canary testing and, finally, before gradual production roll-out. Recently I have been going through a bunch of outage reports for one of my projects, and some of them mention “network configuration change” as the root cause of the problem. I think these “configuration change” issues could have been mitigated with proper testing and deployment processes.

We had our group’s presentation by Lakshmi Ongolu, available on YouTube:

Reading Group

Our reading group takes place over Zoom every Wednesday at 2:00 pm EST. We have a slack group where we post papers, hold discussions, and most importantly manage Zoom invites to paper discussions. Please join the slack group to get involved!