Slashing Core Unit Boot Time: From Hours To Minutes

The core servers at Cloudflare, which handle the control plane, billing, and analytics, rely on centralized data centers. This is different from the worldwide network of edge servers responsible for processing user traffic. These core servers are “bare metal,” meaning they run directly on hardware without a virtualization layer. Because of this, problems that occur during a reboot can spread very quickly through the system.

The boot process for these servers is managed by UEFI. This modern firmware standard is responsible for getting hardware ready and passing control to the operating system. However, even the smallest, unexpected behaviors during this transition can lead to major problems.

Following a standard firmware update, some of our core servers suddenly needed four hours to restart, a process previously completed in minutes. A fleet-wide update, initially expected to take a single day, was now dragging into several days. Any new servers being set up experienced the same long delays during their initial boot. Scheduled maintenance periods expanded, and needed manual monitoring from engineering teams for updates that were designed to run by themselves.

This problem affected the entire Gen12 fleet, a group of nearly 2,000 machines. Any failure during the update process forced a complete restart of the sequence, leaving new servers inactive and stuck in line behind these long delays.

This is the story of how we identified the cause: a firmware flaw combined with a slow, sequential network boot search. We will explain how we reduced boot and upgrade times from hours back down to minutes. We’ll also share what we learned about UEFI, hardware-specific issues, and the automated solutions we used.

The network boot interface

A network boot interface allows a server to load its operating system from the network, not from its own local disks. This is essential for automated, consistent, and scalable control over system startups, particularly across a global fleet handling different tasks. As our servers are used for different roles in various places, each has its own specific network boot needs. The two main types are Preboot Execution Environment (PXE) and Unified Extensible Firmware Interface (UEFI) HTTPS boot.

During our standard reboot process, our servers typically use PXE for automation. At Cloudflare, we utilize iPXE, an open-source network boot program. It supports current protocols like HTTP and HTTPS, allowing systems to boot directly from web servers, cloud platforms, or corporate networks quickly and reliably.

For companies, iPXE makes the boot process customizable. Its advanced scripting features allow IT departments to automatically handle complex setups, such as configuring servers for specific hardware or managing secure systems without hard drives.

Some of our hardware can also use HTTPS-based UEFI network boot, which lets the server’s built-in firmware securely download operating system files over the web.

This situation started with a problematic firmware update. The first warnings came through our internal systems: servers were not restarting. Our monitoring dashboards showed machines waiting in a state before the operating system loaded for unreasonably long periods. We initially thought the update itself might be the problem, containing a bug that was stalling the boot process.

To investigate, we connected to an affected server’s serial console and watched a boot cycle happen live. The firmware’s Power On Self Test (POST) finished correctly, and the hardware setup appeared normal. But then, instead of quickly moving to the network boot stage to download an OS image, the server just waited. And kept waiting.

The console messages revealed the issue: the system was trying an IPv4 HTTPS network boot, waiting several minutes before timing out. It then tried IPv4 iPXE, which also timed out. It repeated both attempts before finally reaching the IPv6 HTTPS boot interface, which worked.

Each failed network boot attempt wasted about five minutes waiting for a timeout. With four failed attempts before finding the right one, a single boot cycle lost around twenty minutes. For a normal restart, this is a major issue. For automated firmware upgrades, which need several reboots in a row (one for each component), these twenty-minute delays added up to nearly four hours of wasted time per server.

No searching games: Declare my boot interface

After analyzing the boot sequence and identifying the timeout pattern, the root cause was clear: the servers were checking every possible network boot interface one at a time, waiting for each to fail before trying the next. The solution was to remove this trial-and-error process completely—by specifying the correct boot interface from the start, the system would never waste time on options that wouldn’t work.

However, putting this fix into action was not simple. As we’ll describe next, we faced several challenges: the order of our boot automation steps, a configuration setting we couldn’t change, and different text formats used by our various network card manufacturers.

Our boot automation workflow

Our automated boot process has three main phases: firmware setup, pre-boot, and kernel launch. When the server powers on, the UEFI firmware first initializes hardware and peripherals, then starts the PXE pre-boot environment. This pre-boot phase configures the network card and runs a small program called a bootloader, which starts the kernel. It’s during this PXE stage that the system tests various network interfaces.

When a server first starts up, firmware updates are automatically applied through our boot automation process.

Since each firmware update requires a system restart (and its corresponding network boot cycle), this process led to total startup times approaching four hours.

We redesigned our automation to specify the network boot device priority early in the pre-boot environment. This change shaved roughly an hour off the total time, as the system no longer spent 20 minutes scanning interfaces to determine whether a firmware update was necessary.

Configuring the boot priority order introduced two specific challenges:

Legacy Hardware Limitations: Older versions of UEFI do not support network boot ordering.
Setting Volatility: Configuration changes are frequently wiped out during a UEFI firmware update.

To handle these issues, we introduced a validation mechanism. The automation now verifies its settings after applying them; if it detects that the configuration has been reverted, it re-applies the settings and initiates a restart.

Although this makes the initial boot slightly longer, it dramatically improves stability for future restarts, cutting the time required from approximately 20 minutes down to less than 60 seconds.

Overriding Vendor Boot Order Restrictions

The internal format for network boot options uses an EFI_IFR_REF3 structure that utilizes lazy loading. This means the data remains latent until specifically requested by a UI action:

typedef struct _EFI_IFR_REF3 {
  EFI_IFR_OP_HEADER          Header;
  EFI_IFR_QUESTION_HEADER    Question;
  EFI_QUESTION_ID            QuestionId;
  EFI_GUID                   FormSetId;
} EFI_IFR_REF3;

While this is a standard technique used to speed up POST times, it made the “Network Boot Interface” hidden to our automated tools. Because the data had not been initialized yet, our scripts were unable to identify the priority settings.

We partnered with our hardware vendors to activate specific tokens within the built-in “Boot Order Module.” This forces the system to detect the Network Boot Interface automatically during startup without needing any manual input via the interface.

The firmware from our original equipment manufacturers included a fixed setting, Force Priority Httpv4 Httpv6 Pxev4 Pxev6, which was originally blocking us from modifying the boot sequence.

Overcoming this restriction required a new BIOS release from our vendor and a joint troubleshooting session to finalize the boot configuration.

Handling Inconsistent Strings Across NIC Vendors

Varying naming conventions between different network card manufacturers led to mismatches when we attempted to configure the boot order via iPXE.

For example:

UEFI: HTTPS IPv4 Ethernet Network Adapter XXX-XXX-Y for OCP 3.0 P1 UEFI: HTTPS IPv4 Network Adapter - 50:00:E6:8F:4F:32 P1

To solve this, we enhanced the CfHIIConfig_App tool to support partial string matching, allowing it to configure settings without requiring the complete device name:

.*HTTP.*IPv4.*P1

The tool would match this pattern against active configuration strings to identify the correct boot entry. We are currently collaborating with our UEFI suppliers to standardize these strings to include only essential information (such as protocol, transfer type, port number, and slot index) while removing redundant details like MAC addresses. These details, if needed, can be retrieved from the card’s embedded vital product data. This approach reduces both configuration inconsistencies and the necessity of using pattern-matching wildcards.

Eliminating Redundant Config Checks in iPXE

Since iPXE interprets these configuration variables as hexadecimal, it was reading the human-readable strings as raw data. To verify whether a setting had been updated and to save boot time (by avoiding the need to print out variables before changing them), we created a boolean indicator called uefi-same-hex. This flag tracks whether a configuration was actually modified.

This streamlined our workflow from a two-step process (first displaying the current value to compare it, then setting it if necessary) to a single set command.

# Define the path to the update tracking variable
set buffer-var-guid 91468514-75bc-4bb5-8f33-91efff9e9b1f
set var-upd-path efivar/CfHIIVarUpd-${buffer-var-guid}

# Execute the configuration update command
imgexec  set ${uefi-setting}=${uefi-value}

# Compare the update variable against the expected value.
# If a change occurred, set the flag to trigger a reboot.
iseq ${uefi-same-hex} ${${var-upd-path}} || set has-changed ${uefi-diff-hex}

The Outcome: A More Responsive System

By removing the trial-and-error from our network boot process, we transformed a four-hour sequence back into a 3-minute operation. The new system is fully dynamic and requires zero manual BIOS management. A single firmware image now supports all SKU variations, configuration changes roll out seamlessly through our standard deployment channels, and the entire workflow is managed directly from iPXE.

Metric	Prior to Ordering Update	Post-Ordering Update
Firmware Automation Cycle	Approaching 4 hours	3 minutes
Next Boot Duration	Approx. 20 minutes	Under 1 minute

This optimization was only possible by exploring the nuances of UEFI internals, partnering with OEM vendors to unlock programmatic controls like boot order manipulation, and leveraging open-source technologies such as iPXE to build scalable solutions.

Cloudflare’s OpenBMC team continues to refine, test, and improve the boot experience across our global infrastructure. If you are operating bare-metal servers and facing excessive startup times, we hope this breakdown provides a useful roadmap for identifying and resolving unnecessary bottlenecks in your own network boot workflows. To get started with iPXE and network boot automation, explore the project resources here!

Top Posts

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

Slashing Core Unit Boot Time: From Hours to Minutes

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

Champions of the Diplomatic Corps: Democrats Rally Around Fallen Foreign Service Officers

IG Audit Exposes Critical Flaws in VA Police Staffing Tool, Sparking Urgent Reform Calls

From OMB M-26-14 Blueprint to Battle-Ready Cyber Edge

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Champions of the Diplomatic Corps: Democrats Rally Around Fallen Foreign Service Officers

The Ultimate Blood Pressure Showdown: My Month-Long Wearable Battle Royale

Unlock Savings: Adaptive PDF Parsing That Scales Costs Page by Page

EU Forces Google to Surrender Android’s Secret Doors to Rival AI Assistants

Trending

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Slashing Core Unit Boot Time: From Hours to Minutes

The network boot interface

No searching games: Declare my boot interface

Our boot automation workflow

Overriding Vendor Boot Order Restrictions

Handling Inconsistent Strings Across NIC Vendors

Eliminating Redundant Config Checks in iPXE

The Outcome: A More Responsive System

Related Posts