WritingCloudflare (Workers AI)Cloudflare (Workers AI)published Jun 1, 2026seen 5d

How we reduced core unit boot time from hours to minutes

Open original ↗

Captured source

source ↗
published Jun 1, 2026seen 5dcaptured 3dhttp 200method plain

How we reduced core unit boot time from hours to minutes How we reduced core unit boot time from hours to minutes 2026-06-01 Giovanni Pereira Zantedeschi

Nnamdi Ajah

Omar Sheik-Omar

7 min read Cloudflare's core is the centralized data centers that run our control plane, billing, and analytics — distinct from the globally distributed edge that handles user traffic. Core servers are bare metal, and when issues happen during reboot, the consequences can cascade fast. Their boot sequence is orchestrated by UEFI , the modern firmware standard that initializes hardware and hands off control to the operating system. Small quirks in that handoff can have outsized consequences. After a routine firmware update, some of our core servers were taking four hours to come back online, rather than just minutes as they did before. What should have been a one-day fleet-wide rollout was stretching into multi-day slogs. New nodes faced the full timeout gauntlet on their very first boot. Maintenance windows ballooned. Engineering teams had to babysit upgrades that should have run unattended. The behavior we saw was brought to light when we were bringing nodes online that had been powered off for an extended period. These nodes’ firmware was out of date and required multiple updates to resolve. Combine this with recent updates to the boot protocols used by servers in some of our locations, and boot times on the affected nodes became unacceptable. This is the story of how we tracked the cause to a firmware quirk and an over-eager linear search through every available network boot interface, and how we cut total boot and upgrade time from hours back down to minutes. Along the way, we'll share what we learned about UEFI internals, vendor-specific quirks, and the automation strategies that ultimately solved the problem.

The network boot interface

A network boot interface allows a server to boot its operating system over the network instead of from local storage. This is critical for centralized, automated, and scalable control over how machines start up,  especially across a globally distributed fleet serving different workloads. Since our servers are located in different environments and serve different purposes, they have different requirements for a specific network boot interface. The two primary interfaces are the Preboot Execution Environment (PXE) and Unified Extensible Firmware Interface ( UEFI ) HTTPS boot. As part of our reboot process, our servers usually go through PXE for various automation reasons. At Cloudflare, we use the open-source iPXE , an open-source network boot firmware that supports modern protocols like HTTP and HTTPS. This allows computers to boot operating systems directly from web servers, the cloud, or enterprise storage networks with significantly faster speeds and greater reliability. For organizations, iPXE turns the boot process into a programmable workflow. It offers advanced scripting capabilities that allow IT teams to automate complex deployments, such as provisioning servers based on specific hardware configurations or managing secure, diskless workstations. Some of our hardware supports HTTPS-based UEFI network boot, which enables the computer's motherboard firmware to natively download operating system files securely.

The linear search

Our tale begins with that fateful firmware update. Following the update, the first reports came through our internal channels: servers weren't coming back online. Monitoring dashboards showed machines stuck in a pre-OS state for far longer than expected. Our initial suspicion was a firmware regression: perhaps the update itself had introduced a bug that was hanging the boot process. To rule that out, we pulled up the serial console on an affected machine and watched a boot cycle in real time. The firmware Power On Self Test (POST) completed normally and hardware initialization looked healthy. But then, instead of quickly reaching the network boot stage and pulling down an OS image, the server sat waiting. And waiting. The console output told the story: the system was attempting an IPv4 HTTPS network boot, timing out after several minutes, then trying IPv4 iPXE, timing out again, then repeating both — all before finally reaching the IPv6 HTTPS boot interface that would actually succeed. Every failed network boot attempt burned roughly five minutes waiting for a timeout response. With four attempts stacking up before the correct interface was reached, a single boot cycle wasted around twenty minutes. For a routine reboot, that's painful. For firmware upgrade automation, which requires multiple sequential reboots, one per component, those twenty-minute penalties compounded into nearly four hours of idle waiting per server.

No searching games: Declare my boot interface

After tracing the boot sequence and isolating the timeout pattern, the root cause became clear: the servers were blindly searching through every available network boot interface, one by one, waiting for each to fail before moving on. The fix was to eliminate the guesswork entirely — declare the correct boot interface upfront so the system never wastes time on interfaces that will never respond. But putting this into practice was far from straightforward. As we explain next, we hit several obstacles: the order of our boot automation workflow, a setting we were blocked from changing, and differing string formats from our different network interface card vendors.

Our boot automation workflow

Our boot automation flow is in three broad stages: firmware initialization, pre-boot, and kernel startup. After power on, the UEFI firmware does some hardware and peripheral initialization followed by the PXE pre-boot environment. The pre-boot sets up the network card and executes a small program called bootloader, which kickstarts the kernel. It’s in this PXE stage that various network interfaces are probed for the right one. On first boot, firmware upgrades are included in our boot automation workflow. And because each firmware upgrade requires a reboot (and its attendant network boot attempt sequence), that’s how we got to the situation where the total boot time took close to four hours.

By restructuring the automation sequence to declare the network boot interface order early on in the pre-boot PXE stage for each hardware/use-case, we were able to cut the total time by about an hour, since the boot process no longer needed to spend 20 minutes probing for each firmware…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low traction routine engineering blog, not a notable AI event