How We Constructed Our Strongest Server But

A number of months in the past, Cloudflare introduced the transition to FL2, our Rust-based rewrite of Cloudflare’s core request dealing with layer. This transition accelerates our skill to assist construct a greater Web for everybody. With the migration within the software program stack, Cloudflare has refreshed our server {hardware} design with improved {hardware} capabilities and higher effectivity to serve the evolving calls for of our community and software program stack. Gen 13 is designed with 192-core AMD EPYC™ Turin 9965 processor, 768 GB of DDR5-6400 reminiscence, 24 TB of PCIe 5.0 NVMe storage, and twin 100 GbE port community interface card.

Gen 13 delivers:

As much as 2x throughput in comparison with Gen 12 whereas staying inside latency SLA
As much as 50% enchancment in efficiency / watt effectivity, decreasing knowledge middle enlargement prices
As much as 60% larger throughput per rack preserving rack energy price range fixed
2x reminiscence capability, 1.5x storage capability, 4x community bandwidth
Launched PCIe encryption {hardware} assist along with reminiscence encryption
Improved assist for thermally demanding highly effective drop-in PCIe accelerators

This weblog submit covers the engineering rationale behind every main part choice: what we evaluated, what we selected, and why.

Technology	Gen 13 Compute	Earlier Gen 12 Compute
Kind Issue	2U1N, Single socket	2U1N, Single socket
Processor	AMD EPYC™ 9965 Turin 192-Core Processor	AMD EPYC™ 9684X Genoa-X 96-Core Processor
Reminiscence	768GB of DDR5-6400 x12 reminiscence channel	384GB of DDR5-4800 x12 reminiscence channel
Storage	x3 E1.S NVMe Samsung PM9D3a 7.68TB / Micron 7600 Professional 7.68TB	x2 E1.S NVMe Samsung PM9A3 7.68TB / Micron 7450 Professional 7.68TB
Community	Twin 100 GbE OCP 3.0 Intel Ethernet Community Adapter E830-CDA2 / NVIDIA Mellanox ConnectX-6 Dx	Twin 25 GbE OCP 3.0 Intel Ethernet Community Adapter E810-XXVDA2 / NVIDIA Mellanox ConnectX-6 Lx
System Administration	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)
Energy Provide	1300W, Titanium Grade	800W, Titanium Grade

^{Determine: Gen 13 server}

Gen 12	AMD EPYC™ 9684X Genoa-X 96-Core (400W TDP, 1152 MB L3 Cache)
Gen 13	AMD EPYC™ 9965 Turin Dense 192-Core (500W TDP, 384 MB L3 Cache)

Through the design part, we evaluated a number of fifth technology AMD EPYC™ Processors, code-named Turin, in Cloudflare’s {hardware} lab: AMD Turin 9755, AMD Turin 9845, and AMD Turin 9965. The desk under summarizes the variations in specs of the candidates for Gen 13 servers towards the AMD Genoa-X 9684X utilized in our Gen 12 servers. Notably, all three candidates provide will increase in core rely however with smaller L3 cache per core. Nevertheless, with the migration to FL2, the brand new workloads are much less depending on L3 cache and scale up effectively with the elevated core rely to attain as much as 100% enhance in throughput.

The three CPU candidates are designed to focus on completely different use circumstances: AMD Turin 9755 presents superior per-core efficiency, AMD Turin 9965 trades per-core efficiency for effectivity, and AMD Turin 9845 trades core rely for decrease socket energy. We evaluated three CPUs within the manufacturing setting.

CPU Mannequin	AMD Genoa-X 9684X	AMD Turin 9755	AMD Turin 9845	AMD Turin 9965
For server platform	Gen 12	Gen 13 candidate	Gen 13 candidate	Gen 13 candidate
# of CPU Cores	96	128	160	192
# of Threads	192	256	320	384
Base Clock	2.4 GHz	2.7 GHz	2.1 GHz	2.25 GHz
Max Increase Clock	3.7 GHz	4.1 GHz	3.7 GHz	3.7 GHz
All Core Increase Clock	3.42 GHz	4.1 GHz	3.25 GHz	3.35 GHz
Complete L3 Cache	1152 MB	512 MB	320 MB	384 MB
L3 cache per core	12 MB / core	4 MB / core	2 MB / core	2 MB / core
Most configurable TDP	400W	500W	390W	500W

First, FL2 ended the L3 cache crunch.

L3 cache is the massive, last-level cache shared amongst all CPU cores on the identical compute die to retailer ceaselessly used knowledge. It bridges the hole between gradual important reminiscence exterior to the CPU, and the quick however smaller L1 and L2 cache on the CPU, decreasing the latency for the CPU to entry knowledge.

Some might discover that the 9965 has solely 2 MB of L3 cache per core, an 83.3% discount from the 12 MB per core on Gen 12’s Genoa-X 9684X. Why commerce away the very cache benefit that gave Gen 12 its edge? The reply lies in how our workloads have developed.

Cloudflare has migrated from FL1 to FL2, a whole rewrite of our request dealing with layer in Rust. With the brand new software program stack, Cloudflare’s request processing pipeline has grow to be considerably much less depending on giant L3 cache. FL2 workloads scale almost linearly with core rely, and the 9965’s 192 cores present a 2x enhance in {hardware} threads over Gen 12.

Second, efficiency per complete price of possession (TCO). Throughout manufacturing analysis, the 9965’s 192 cores delivered the very best combination requests per second of the three candidates, and its performance-per-watt scaled favorably at 500W TDP, yielding superior rack-level TCO.

	Gen 12	Gen 13
Processor	AMD EPYC™ 4th Gen Genoa-X 9684X	AMD EPYC™ fifth Gen Turin 9965
Core rely	96C/192T	192C/384T
FL throughput	Baseline	As much as +100%
Efficiency per watt	Baseline	As much as +50%

Third, operational simplicity. Our operational groups have a powerful desire for fewer, higher-density servers. Managing a fleet of 192-core machines means fewer nodes to provision, patch, and monitor per unit of compute delivered. This straight reduces operational overhead throughout our international community.

Lastly, they’re ahead suitable. The AMD processor structure helps DDR5-6400, PCIe Gen 5.0, CXL 2.0 Kind 3 reminiscence throughout all SKUs. AMD Turin 9965 has the very best variety of high-performing cores per socket within the {industry}, maximizing the compute density per socket, sustaining competitiveness and relevance of the platform for years to come back. By transferring to AMD Turin 9965 from AMD Genoa-X 9684X, we get longer safety assist from AMD, extending the helpful lifetime of the Gen 13 server earlier than they grow to be out of date and must be refreshed.

Gen 12	12x 32GB DDR5-4800 2Rx8 (384 GB complete, 4 GB/core)
Gen 13	12x 64GB DDR5-6400 2Rx4 (768 GB complete, 4 GB/core)

As a result of the AMD Turin processor has twice the core rely of the earlier technology, it calls for extra reminiscence assets, each in capability and in bandwidth, to ship throughput good points.

Maximizing bandwidth with 12 channels

The chosen AMD EPYC™ 9965 CPU helps twelve reminiscence channels, and for Gen 13, we’re populating each single one in every of them. We’ve chosen 64 GB DDR5-6400 ECC RDIMMs in a “one DIMM per channel” (1DPC) configuration.

This setup offers 614 GB/s of peak reminiscence bandwidth per socket, a 33.3% enhance in comparison with our Gen 12 server platform. By using all 12 channels, we make sure that the CPU isn’t “starved” for knowledge, even throughout probably the most memory-intensive parallel workloads.

Populating all twelve channels in a balanced configuration — equal capability per channel, with no combined configurations — is frequent greatest follow. This issues operationally: AMD Turin processors interleave throughout all reminiscence channels with the identical DIMM sort, similar reminiscence capability and similar rank configuration. Interleaving will increase reminiscence bandwidth by spreading contiguous reminiscence entry throughout all reminiscence channels within the interleave set as a substitute of sending all reminiscence entry to a single or a small subset of reminiscence channels.

The 4 GB per core “sweet spot”

Our Gen 12 servers are configured with 4GB per core. We revisited that call as we designed Gen 13.

Cloudflare launches quite a lot of new services and products each month, and every new services or products calls for an incremental quantity of reminiscence capability. These accumulate over time and will grow to be a problem of reminiscence strain, if reminiscence capability is just not sized appropriately.

Preliminary requirement thought-about a memory-to-core ratio between 4 GB and 6 GB per core. With 192 cores on the AMD Turin 9965, that interprets to a spread of 768 GB to 1152 GB. Be aware that at larger capacities, DIMM module capability granularity are sometimes 16GB increments. With 12 channels in a 1DPC configuration, our choices are 12x 48GB (576 GB), 12x 64GB (768 GB), or 12x 96GB (1152 GB).

12x 48GB = 576 GB, or 1.5 GB/thread. The reminiscence capability of this configuration is simply too low; this could starve memory-hungry workloads and violate the decrease certain.
12x 96GB = 1152 GB, or 3.0 GB/thread. This might be a 50% capability enhance per core and would additionally end in larger energy consumption and a considerable enhance in price, particularly within the present market circumstances the place reminiscence costs are 10x of what they have been a 12 months in the past.
12x 64GB = 768 GB, or 2.0 GB/thread (4 GB/core). This configuration is in keeping with our Gen 12 reminiscence to core ratio, and represents a 2x enhance in reminiscence capability per server. Holding the reminiscence capability configuration at 4 GB per core offers enough capability for workloads that scale with core rely, like our major workload, FL, and supply enough reminiscence capability headroom for future development with out overprovisioning.

FL2 makes use of reminiscence extra effectively than FL1 did: our inner measures show FL2 makes use of lower than half the CPU of FL1, and much lower than half the reminiscence. The capability freed up by the software program stack migration offers ample headroom to assist Cloudflare development for the following few years.

The choice: 12x 64GB for 768 GB complete. This maintains the confirmed 4 GB/core ratio, offers a 2x complete capability enhance over Gen 12, and stays inside the DIMM price curve candy spot.

Effectivity by way of twin rank

In Gen 12, we demonstrated that dual-rank DIMMs present measurably larger reminiscence throughput than single-rank modules, with benefits of as much as 17.8% at a 1:1 read-write ratio. Twin-rank DIMMs are sooner as a result of they permit the reminiscence controller to entry one rank whereas one other is refreshing. That very same precept carries ahead right here.

Our requirement additionally calls for about 1 GB/s of reminiscence bandwidth per {hardware} thread. With 614 GB/s of peak bandwidth throughout 384 threads, we ship 1.6 GB/s per thread, comfortably exceeding the minimal. Manufacturing evaluation has proven that Cloudflare workloads aren’t memory-bandwidth-bound, so we financial institution the headroom as margin for future workload development.

By choosing 2Rx4 DDR5 RDIMMs at most supported 6400MT/s, we guarantee we get the bottom latency and greatest efficiency from our Gen 13 platform reminiscence configuration.

Gen 12

x2 E1.S NVMe PCIe 4.0, 16 TB complete

Samsung PM9A3 7.68TB

Micron 7450 Professional 7.68TB

Gen 13

x3 E1.S NVMe PCIe 5.0, 24 TB complete

Samsung PM9D3a 7.68TB

Micron 7600 Professional 7.68TB

+10x U.2 NVMe PCIe 5.0 choice

Our storage structure underwent a change in Gen 12 after we pivoted from M.2 to EDSFF E1.S. For Gen 13, we’re growing the storage capability and the bandwidth to align with the newest know-how. Now we have additionally added a entrance drive bay for flexibility so as to add as much as 10x U.2 drives to maintain tempo with Cloudflare storage product development.

Gen 13 is configured with PCIe Gen 5.0 NVMe drives. Whereas Gen 4.0 served us effectively, the transfer to Gen 5.0 ensures that our storage subsystem can serve knowledge at improved latency, and sustain with elevated storage bandwidth demand from the brand new processor.

Past the pace enhance, we’re bodily increasing the array from two to 3 NVMe drives. Our Gen 12 server platform was designed with 4 E1.S storage drive slots, however solely two slots have been populated with 8TB drives. The Gen 13 server platform makes use of the identical design with 4 E1.S storage drive slots obtainable, however with three slots populated with 8TB drives. Why add a 3rd drive? This will increase our storage capability per server from 16TB to 24TB, guaranteeing we’re increasing our international storage capability to keep up and enhance CDN cache efficiency. This helps development projections for Sturdy Objects, Containers, and Quicksilver companies, too.

Entrance drive bay to assist further drives

For Gen 13, the chassis is designed with a entrance drive bay that may assist as much as ten U.2 PCIe Gen 5.0 NVMe drives. The entrance drive bay offers the choice for Cloudflare to make use of the identical chassis throughout compute and storage platforms, in addition to the pliability to transform a compute SKU to a storage SKU when wanted.

Endurance and reliability

We designed our servers to have a 5-year operational life and require storage drives endurance to maintain 1 DWPD (Drive Writes Per Day) over the total server lifespan.

Each the Samsung PM9D3a and Micron 7600 Professional meet the 1 DWPD specification with a {hardware} over-provisioning (OP) of roughly 7%. If future workload profiles demand larger endurance, we’ve the choice to carry again further consumer capability to extend efficient OP.

NVMe 2.0 and OCP NVMe 2.0 compliance

Each the Samsung PM9D3a and Micron 7600 undertake the NVMe 2.0 specification (up from NVMe 1.4) and the OCP NVMe Cloud SSD Specification 2.0. Key enhancements embody Zoned Namespaces (ZNS) for higher write amplification administration, Easy Copy Command for intra-device knowledge motion with out crossing the PCIe bus, and enhanced Command and Function Lockdown for tighter safety controls. The OCP 2.0 spec additionally provides deeper telemetry and debug capabilities purpose-built for datacenter operations, which aligns with our emphasis on fleet-wide manageability.

The storage drives will proceed to be within the E1.S 15mm kind issue. Its high-surface-area design is important for cooling these new Gen 5.0 controllers, which may pull upwards of 25W below sustained heavy I/O. The 2U chassis offers ample airflow over the E1.S drives in addition to U.2 drive bays, a design benefit we validated in Gen 12 after we made the choice to maneuver from 1U to 2U.

Gen 12

Twin 25 GbE port OCP 3.0 NIC

Intel E810-XXVDA2

NVIDIA Mellanox ConnectX-6 Lx

Gen 13

Twin 100 GbE port OCP 3.0 NIC

Intel E830-CDA2

NVIDIA Mellanox ConnectX-6 Dx

For greater than eight years, twin 25 GbE was the spine of our fleet. Since 2018 it has served us effectively, however because the CPU has improved to serve extra requests and our merchandise scale, we’ve formally hit the wall. For Gen 13, we’re quadrupling our per-port bandwidth.

Community Interface Card (NIC) bandwidth should maintain tempo with compute efficiency development. With 192 fashionable cores, our 25 GbE hyperlinks will grow to be a measurable bottleneck. Manufacturing knowledge from our co-locations worldwide over per week confirmed that, on our Gen 12, P95 bandwidth per port is persistently >50% of accessible bandwidth. Since throughput is doubling per server on Gen 13, we’re susceptible to saturating the NIC bandwidth.

^{Determine: on Gen 12, P95 bandwidth per port is persistently >50% of accessible bandwidth}

The choice to go to 100 GbE relatively than 50 GbE was pushed by {industry} economics: 50 GbE transceiver volumes stay low within the {industry}, making them a poor provide chain wager. Twin 100 GbE ports additionally give us 200 Gb/s of combination bandwidth per server, future-proofing towards the following a number of years of site visitors development.

{Hardware} selections and compatibility

We’re sustaining our dual-vendor technique to make sure provide chain resilience, a lesson hard-learned throughout the pandemic when single-sourcing the Gen 11 NIC left us scrambling.

Each NICs are compliant with OCP 3.0 SFF/TSFF kind issue with the built-in pull tab, sustaining chassis commonality with Gen 12 and guaranteeing area technicians want no new instruments or coaching for swaps.

The OCP 3.0 NIC slot is allotted PCIe 4.0 x16 lanes on the motherboard, offering 256 Gb/s of bidirectional bandwidth, greater than sufficient for twin 100 GbE (200 Gb/s combination) with room to spare.

We’re sustaining the architectural shift, launched in Gen 12, of separating administration and security-related elements from the motherboard onto the Challenge Argus Knowledge Heart Safe Management Module 2.0.

^{Determine: Challenge Argus DC-SCM 2.0}

Continuity with DC-SCM 2.0

We’re carrying ahead the Knowledge Heart Safe Management Module 2.0 (DC-SCM 2.0) commonplace. By decoupling administration and safety features from the motherboard, we make sure that the “brains” of the server’s safety keep modular and guarded.

The DC-SCM module homes our most important elements:

Primary Enter/Output System (BIOS)
Baseboard Administration Controller (BMC)
{Hardware} Root of Belief (HRoT) and TPM (Infineon SLB 9672)
Twin BMC/BIOS flash chips for redundancy

Why we’re staying the course with DC-SCM 2.0

The choice to maintain this structure for Gen 13 is pushed by the confirmed safety good points we noticed within the earlier technology. By offloading these features to a devoted module, we keep:

Fast restoration: Twin picture redundancy permits for near-instant restoration of BIOS/UEFI and BMC firmware if an unintended corruption or a malicious replace is detected.
Bodily resilience: The Gen 13 chassis additionally strikes the intrusion detection mechanism farther from the flat fringe of the chassis, making bodily intercept more durable.
PCIe encryption: Along with TSME (Clear Safe Reminiscence Encryption) for CPU-to-memory encryption that was already enabled since our Gen 10 platforms, AMD Turin 9965 processor for Gen 13 extends encryption to PCIe site visitors, this ensures knowledge is protected in transit throughout each bus within the system.
Operational consistency: Sticking with the Gen 12 administration stack means our safety audits, deployment, provisioning, and operational commonplace process stay totally suitable.

Gen 12	800W 80 PLUS Titanium CRPS
Gen 13	1300W 80 PLUS Titanium CRPS

As we improve the compute and networking functionality of the server, the ability envelope of our servers has naturally expanded. Gen 13 are outfitted with larger energy provides to ship the ability wanted.

Whereas our Gen 12 nodes operated comfortably with 800W 80 PLUS Titanium CRPS (Frequent Redundant Energy Provide), the Gen 13 specification requires a bigger energy provide. Now we have chosen a 1300W 80 PLUS Titanium CRPS.

Energy consumption of Gen 13 throughout typical operation has risen to 850W, a 250W enhance over the 600W seen in Gen 12. The first contributors are the 500W TDP CPU (up from 400W), doubling of the reminiscence capability and the extra NVMe drive.

Why 1300W as a substitute of 1000W? The present PSU ecosystem lacks viable, high-efficiency choices at 1000W. To make sure provide chain reliability, we moved to the following industry-standard tier of 1300W.

EU Lot 9 is a regulation that requires servers deploying within the European Union to have energy provides with effectivity at 10%, 20%, 50% and 100% load to be at or above the share threshold specified within the regulation. The edge matches 80 PLUS Energy Provide certification program titanium grade PSU requirement. We selected a titanium grade PSU for Gen 13 to keep up full compliance with EU Lot 9, guaranteeing that the servers will be deployed in our European knowledge facilities and past.

Thermal design: 2U pays dividends once more

The 2U1N kind issue we adopted in Gen 12 continues to pay dividends. Gen 13 makes use of 5x 80mm followers (up from 4x in Gen 12) to deal with the elevated thermal load from the 500W CPU. The bigger fan quantity, mixed with the 2U chassis airflow traits, means followers function effectively under most obligation cycle at typical ambient temperatures, preserving fan energy within the < 50W vary per fan.

Drop-in accelerator assist

Gen 12	x2 single width FHFL or x1 double width FHFL
Gen 13	x2 double width FHFL

Sustaining the modularity of our fleet is a core requirement for our server design. This requirement enabled Cloudflare to rapidly retrofit and deploy GPUs globally to greater than 100 cities in 2024. In Gen 13, we’re persevering with the assist of high-performance PCIe add-in playing cards.

On Gen 13, the 2U chassis format is up to date and configured to assist extra demanding energy and thermal necessities. Whereas Gen 12 was restricted to a single double-width GPU, the Gen 13 structure now helps two double-width PCIe playing cards.

A launchpad to scale Cloudflare to better heights

Each technology of Cloudflare servers is an train in balancing competing constraints: efficiency versus energy, capability versus price, flexibility versus simplicity. Gen 13 comes with 2x core rely, 2x reminiscence capability, 4x community bandwidth, 1.5x storage capability, and future-proofing for accelerator deployments — all whereas enhancing complete price of possession and sustaining a strong administration characteristic set and safety posture that our international fleet calls for.

Gen 13 servers are totally certified and shall be deployed to serve hundreds of thousands of requests throughout Cloudflare’s international community in additional than 330 cities. As all the time, Cloudflare’s journey to serve the Web as effectively as doable doesn’t finish right here. Because the deployment of Gen 13 begins, we’re planning the structure for Gen 14.

If you’re enthusiastic about serving to construct a greater Web, come be part of us. We’re hiring.

Top Posts

Unmasking Model Bias: How Balanced Datasets and Mimesis Transform Fair AI Audits

Stop Playing Catch-Up: Why Kubernetes Policy Enforcement Fails at the Worst Possible Moment (And How to Fix It Before Disaster Strikes)

Hottest Memorial Day Laptop Deals: Massive Savings on Apple, Dell, Lenovo & More

how we constructed our strongest server but

Stop Playing Catch-Up: Why Kubernetes Policy Enforcement Fails at the Worst Possible Moment (And How to Fix It Before Disaster Strikes)

Seamless Transition: Migrating from Ingress NGINX to Envoy Gateway Without Downtime

Lawmakers press DoD to commit to TRICARE pharmacy contract annual audits

Agencies Turn to AI and Automation as Digital Records Surge

How Hidden Asset Management Gaps Are Undermining Government Compliance—And What to Do About It

As AI Sharpens Cyberattack Skills, Why Humans Remain Crucial in Digital Warfare

Unmasking Model Bias: How Balanced Datasets and Mimesis Transform Fair AI Audits

Stop Playing Catch-Up: Why Kubernetes Policy Enforcement Fails at the Worst Possible Moment (And How to Fix It Before Disaster Strikes)

Hottest Memorial Day Laptop Deals: Massive Savings on Apple, Dell, Lenovo & More

Beyond the Mirror: Why Digital Twins Are Outpacing Static Simulations in Modern Manufacturing

Journey Through Four Generations of Semantic Search: TF-IDF to Transformers Explained

Echo Protocol’s $76M Breach: Not a Hack But a Backdoor

Dutch Authorities Nab 800 Servers and Arrest 2 Suspects Linked to Cyberattacks – Krebs on Security

WorkOS Unveils auth.md: A New Open Protocol for Agent Registration Powered by OAuth

Trending

Unmasking Model Bias: How Balanced Datasets and Mimesis Transform Fair AI Audits

Stop Playing Catch-Up: Why Kubernetes Policy Enforcement Fails at the Worst Possible Moment (And How to Fix It Before Disaster Strikes)

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

how we constructed our strongest server but

Maximizing bandwidth with 12 channels

The 4 GB per core “sweet spot”

Effectivity by way of twin rank

Entrance drive bay to assist further drives

Endurance and reliability

NVMe 2.0 and OCP NVMe 2.0 compliance

{Hardware} selections and compatibility

Continuity with DC-SCM 2.0

Why we’re staying the course with DC-SCM 2.0

Thermal design: 2U pays dividends once more

Drop-in accelerator assist

A launchpad to scale Cloudflare to better heights

Related Posts