When A Linux Kernel Tweak Unmasked A Hidden QUIC Flaw

Connection overview of a failing test. After the 2-second mark (T=2s), packet loss stops completely — yet the congestion window (cwnd) stays stuck at its minimum level, and the congestion state keeps oscillating between recovery and congestion avoidance roughly every 14 milliseconds.

As a result, CUBIC begins oscillating rapidly — depicted in our graph as a prolonged recovery phase — switching back and forth between the congestion avoidance phase (its normal operating state) and the recovery phase (the packet loss recovery state). This happens 999 times in roughly 6.7 seconds, which works out to about one transition every 14 milliseconds. That interval is strikingly close to the connection’s round-trip time of 10ms. Throughout this entire episode, the congestion window (cwnd) remains stuck at its lowest possible value: 2700 bytes, equivalent to just two full-sized packets.

Something in CUBIC’s internal logic is clearly misreading the connection’s condition. The most telling clue is the oscillation frequency: the ~14ms cycle aligns with the RTT. Whatever is causing the algorithm to flip between recovery and avoidance is firing once per round trip, perfectly synchronized with the connection’s ACK clock — the self-clocking pattern where each round trip’s worth of ACKs from the client triggers the server’s next transmission. Since this is a download (server sending to client), the ACKs travel from client to server, and CUBIC’s state machine operates on the server side. Each time those ACKs arrive, bytes_in_flight drops to zero, and the server dispatches the next two-packet burst — which is precisely what sets off the bug.

To verify that this behavior was specific to CUBIC, we repeated the same test using Reno, another loss-based algorithm but with a different acceleration profile. The outcome was definitive: a 100% pass rate. Reno recovered smoothly after the loss phase ended, confirming that this is a bug unique to CUBIC.

^{Reno recovers smoothly once the loss phase ends at T=2s and finishes the download by around 5 seconds}

Loss-based algorithms essentially have two controls — a gas pedal and a brake — differing mainly in how aggressively they accelerate. CUBIC, however, comes with some additional complexity. Here, we’re going to zero in on the condition where bytes_in_flight equals zero.

TCP CUBIC after idle (Linux, 2017)

To grasp the bug, we first need to understand the optimization that introduced it. In 2017, a problem was discovered in the Linux kernel’s CUBIC implementation. The commit message describes it as follows:

The epoch is only updated/reset initially and when losses occur. The delta “t” of now - epoch_start can grow arbitrarily large after an application goes idle, as can the bic_target. As a result, the slope (the inverse of ca->cnt) becomes extremely steep, and eventually ca->cnt gets clamped to a minimum of 2 to produce delayed-ACK slow-start behavior.
This issue is especially visible when slow_start_after_idle is turned off, causing a dangerous inflation of cwnd (by 1.5× RTT) after just a few seconds of idle time.

The epoch is the reference timestamp that CUBIC uses to anchor its growth curve. The function W_cubic(delta_t) is parameterized by delta_t = now - epoch_start, and the epoch is reset each time CUBIC restarts its growth function — most notably after a loss event that reduces cwnd. Between resets, delta_t increases steadily with wall-clock time.

When an application pauses sending (goes idle) for a period and then resumes, CUBIC’s growth function W_cubic(delta_t) calculates delta_t as now - epoch_start, as shown in the figure below. Because the epoch wasn’t updated during the idle period, delta_t becomes very large, producing an enormous target window — and CUBIC would immediately attempt to inflate cwnd to an unreasonably high value.

Jana Iyengar’s initial fix was to reset epoch_start when the application resumes sending. But Neal Cardwell identified a flaw in that approach:

…it would cause the CUBIC algorithm to recalculate the curve so that growth again starts steeply upward from the current cwnd value (the same behavior CUBIC exhibits right after a loss). Ideally, we’d want the cwnd growth curve to retain the same shape, just shifted forward in time by the duration of the idle period.

The refined solution, developed by Eric Dumazet, Yuchung Cheng, and Neal Cardwell, was to advance the epoch forward by the idle duration rather than resetting it. This keeps the CUBIC growth curve’s shape intact — simply sliding it forward in time so the algorithm resumes exactly where it left off.

The port to quiche (2020)

When CUBIC was first implemented in quiche, this idle-period adjustment was carried over. However, QUIC — which operates in user space — doesn’t have access to TCP’s kernel-level CA_EVENT_TX_START callback. Instead, the quiche implementation detects the idle condition inside on_packet_sent():

// cubic.rs — on_packet_sent() (simplified)
/// Updates the state when a packet is sent.
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    // If the sending burst is restarting (i.e., bytes_in_flight was zero before this send),
    // adjust the congestion recovery start time to account for the gap in sending.
    if bytes_in_flight == 0 {
        let delta = now - self.last_sent_time;
        self.congestion_recovery_start_time += delta;
    }
    // Record the time of this send event.
    self.last_sent_time = now;
}

Where it breaks: the QUIC difference

The fix ported to quiche contained a bug that was present in the original kernel change but was corrected by a follow-up patch to the kernel’s CUBIC module about a week later. The commit

The explanation for the second fix reads as follows:

tcp_cubic: avoid setting epoch_start to a future timestamp
Tracking idle time within bictcp_cwnd_event() is inherently imprecise because epoch_start is typically updated during ACK processing rather than at the moment a packet is transmitted.
A thorough fix would require introducing an extra state variable, but that added complexity doesn’t seem justified — this CUBIC bug existed for a long time before Jana identified it.
The simpler approach is to just prevent epoch_start from being set ahead of the current time. Without this guard, bictcp_update() could experience an overflow, causing CUBIC to ramp up cwnd at an excessively rapid rate.

As the commit message explains, the recovery start timestamp gets assigned during ACK processing. When the idle-time adjustment is calculated based on send timestamps, it can push that recovery start time forward into the future. This is what caused the oscillation between recovery and congestion avoidance that we observed during testing.

This issue only reliably triggers when every incoming ACK drives bytes_in_flight all the way down to zero. In practice, that means the congestion window has already shrunk to its minimum of two packets, and the application has enough data queued to immediately fill the window again once an ACK arrives. Outside of this specific scenario, bytes_in_flight == 0 is less likely to occur on every send, making the bug harder to hit.

Why doesn’t this happen at the beginning of a connection? The bug only activates after the connection leaves slow-start and transitions into congestion avoidance. Before that transition, congestion_recovery_start_time hasn’t been initialized, so the faulty code path in on_packet_sent has no recovery boundary to push forward. During slow-start, CUBIC grows cwnd using the same Reno-style ACK-based rule shared by all loss-based congestion control algorithms. The cubic curve — and its dependence on congestion_recovery_start_time — only comes into play once the connection enters congestion avoidance. So the trap requires three conditions simultaneously: a real loss event to establish the recovery boundary, congestion avoidance mode to be active, and cwnd collapsed to the two-packet floor.

^{The self-sustaining recovery trap. At minimum cwnd, every ACK cycle triggers the idle period adjustment with an exaggerated delta.}

When cwnd is at its minimum of two packets, the connection’s behavior shifts into a “death spiral” where the idle-period optimization becomes a self-reinforcing cycle. The loop works as follows:

Send and ACK packets: The sender transmits the full two-packet window. After roughly one RTT (~14ms), both packets are acknowledged, and bytes_in_flight drops to zero.
False idle detection: When the next burst of data is sent, on_packet_sent() observes bytes_in_flight == 0 and concludes the connection was idle — but in reality, it was simply congestion-limited.
Exaggerated delta: The idle duration is calculated as now - last_sent_time. With cwnd at its minimum, last_sent_time reflects the timestamp from the beginning of the previous RTT cycle. This means the computed delta is approximately 14ms (the connection’s RTT, plus minor rounding errors). That RTT-scale delta is then incorrectly treated as “idle” time. The actual idle period — the brief gap between the last ACK arriving and the next packet being sent — is essentially zero. By measuring the entire RTT instead of the true gap, the delta is significantly exaggerated, aggressively pushing the recovery start time forward, potentially into the future.
Perceived recovery: With the recovery start time now set in the future, the in_congestion_recovery() check returns true for every incoming ACK. Processing the next ACK exits recovery and sets the recovery start to the ACK’s timestamp, which is later than last_sent_time. This makes it likely that the next send will again push the recovery time into the future.
Stagnation: Because CUBIC skips cwnd growth for any packet it believes is in a recovery period, the window stays locked at two packets. This guarantees the pipe will drain completely on the next ACK, restarting the entire cycle.

This loop repeats for thousands of cycles until tiny accumulated deviations — caused by scheduler jitter and ACK processing variance — allow the <= boundary in in_congestion_recovery() to fall behind the next packet’s send time, finally breaking the cycle.

The fix: measuring idle from the correct moment

Breaking the death spiral requires measuring the idle duration from the point when bytes_in_flight actually reached zero — that is, when the last ACK was processed — rather than from the time the last packet was sent.

Introduce a last_ack_time timestamp into the CUBIC state.
Update this timestamp each time an ACK is received.
Use it when computing the idle delta:

// cubic.rs — on_packet_sent()
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    // Check if the connection was idle before this packet was sent.
    if bytes_in_flight == 0 {
        if let Some(recovery_start_time) = r.congestion_recovery_start_time {
            // Measure idle from the most recent activity: either the
            // last ACK (approximating when bif hit 0) or the last data
            // send, whichever is later. Using last_sent_time alone
            // would inflate the delta by a full RTT when cwnd is small
            // and bif transiently hits 0 between ACK and send.
            let idle_start = cmp::max(cubic.last_ack_time, cubic.last_sent_time);

            if let Some(idle_start) = idle_start {
                if idle_start < now {
                    let delta = now - idle_start;
                    r.congestion_recovery_start_time =
                        Some(recovery_start_time + delta);
                }
            }
        }
}

With the delta now reflecting the true gap since the last ACK, the recovery boundary no longer chases the send time:

^{Old code: the boundary advances by one RTT each cycle, always landing on or ahead of the next send.}

^{Fix: boundary}

barely moves; the next send lands ahead of it and cwnd grows.

For genuinely idle connections, last_ack_time is far in the past and the same expression captures the full idle duration, the original epoch-shift behavior is preserved.

With the fix applied, the 100% pass rate of our quiche testing suite was restored.

^{After the fix, cwnd grows along the expected CUBIC curve and the download completes in ~4-5 seconds.}

We don’t worry about the losses at the end of the connection — that’s expected because we fully utilized the router’s allocated buffer. In other words, we are fully utilizing the available bandwidth in this test case.

“Idle” is harder to define than it sounds. Normal pipeline delays at small windows can look like idleness to simple checks.
Minimum-cwnd dynamics are a unique corner case. The bug was invisible at high speeds and only triggered after severe loss.
The fix was surprisingly small compared to the complexity of the behavior. After weeks of instrumenting qlogs and analyzing visualizations to find the root cause, the solution required changing just three lines of code. As we noted during the investigation: the effort to find the bug was massive, but the fix itself was basically one line of logic.

The fix described in this post has been contributed to cloudflare/quiche, Cloudflare’s open-source implementation of QUIC and HTTP/3. Our CCA efforts go beyond loss-based algorithms: we also use quiche’s modular congestion control design to experiment with and tune our model-based BBRv3 implementation, now enabled for a growing percentage of our QUIC deployments. Stay tuned for further updates on QUIC congestion control implementation and performance.

If you’re interested in congestion control, transport protocols, or contributing to open-source networking code, check out the quiche repository. We’re always looking for talented engineers who love digging into problems like these, please explore our open positions.

Top Posts

When a Linux Kernel Tweak Unmasked a Hidden QUIC Flaw

After Testing Fedora Kinoite vs. Silverblue, Here’s My Unfiltered Verdict on Linux Immutable Distros

Hello Robot’s Stretch 4: Bigger, Faster, and Stronger Than Ever Before

When a Linux Kernel Tweak Unmasked a Hidden QUIC Flaw

AI as the New Insider: Rethinking Federal Risk in 2026

Redshift Launches AWS Graviton-Powered RG Instances with Built-In Data Lake Query Engine

AWS and Anthropic Deepen Alliance with Claude Platform Launch

Trump Taps FEMA Whistleblower Cameron Hamilton to Head the Agency He Once Defended

New DFARS Rule Broadens FOCI Requirements to Unclassified Contracts

Contracting Has Been Completely Transformed: What You Know Is Outdated

When a Linux Kernel Tweak Unmasked a Hidden QUIC Flaw

After Testing Fedora Kinoite vs. Silverblue, Here’s My Unfiltered Verdict on Linux Immutable Distros

Hello Robot’s Stretch 4: Bigger, Faster, and Stronger Than Ever Before

5 Essential Python Scripts to Supercharge Your Time Series Analysis

The Hidden Battle Over Market Structure: What BRCA Means for the Future

Elusive Free OnlyFans Trap: The Cross-Platform CRPx0 Malware Menace

The Evolution of Software Craftsmanship: From Vibe Coding to Spec-Driven Development

AI as the New Insider: Rethinking Federal Risk in 2026

Trending

When a Linux Kernel Tweak Unmasked a Hidden QUIC Flaw

After Testing Fedora Kinoite vs. Silverblue, Here’s My Unfiltered Verdict on Linux Immutable Distros

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

When a Linux Kernel Tweak Unmasked a Hidden QUIC Flaw

TCP CUBIC after idle (Linux, 2017)

The port to quiche (2020)

Where it breaks: the QUIC difference

The fix: measuring idle from the correct moment

Related Posts