I’ve been tackling the consumer multi-GPU PCIe bandwidth problem — Nvidia dropped NVLink from the 4090/5090, so distributing a 70B model across two consumer GPUs limits you to roughly 30 GB/s over PCIe peer-to-peer.

Over the past few months, I’ve built a Python library that leverages the GPU’s otherwise-unused NVENC/NVDEC hardware to compress activations and KV cache in real time, then transmits the compact bitstream across the same connection.

Repository: (Apache 2.0)

Existing work (the concept itself isn’t new)

LLM.265 — “Video Codecs are Secretly Tensor Codecs” (late 2025). The most closely related prior work: same core insight applied to LLM weights, activations, and KV cache.
KVFetcher (April 2026). KV compression designed for remote prefix retrieval.
CodecFlow (April 2026). Uses codec motion-vector metadata to handle KV updates during prefill.

The “video codec on tensors” concept was already published when I began. Here’s what this project contributes beyond that:

PCA with rank truncation as a preprocessing step. Activations and KV in their default basis appear noise-like (~4× compression floor, essentially the Gaussian-noise limit). Switching to the PCA basis exposes a heavy-tailed channel covariance structure that the codec can actually take advantage of. The basis is computed per layer, generated offline, and distributed alongside the model in a LoRA-like fashion (~32 MB for FLUX.2 Klein 9B’s 8 double-blocks at K=500).
Parallel-path / dual-lane architectural redesign. NVENC and NVDEC are physically independent hardware blocks separate from the SM cluster and the PCIe controller. Using CUDA stream pipelining, codec execution time is hidden behind computation and transfers of other tensors. The compression ratio effectively acts as a bandwidth multiplier rather than just reducing payload size.
Pure-ctypes Direct Video Codec SDK wrapper (DirectBackend) — eliminates FFmpeg subprocess overhead entirely. Zero-copy from torch CUDA tensors, an 8-deep async output ring per NVENC engine, optional CUDA stream binding via nvEncSetIOCudaStreams, and MultiEngineDirectBackend spanning all 3 NVENC engines on the 5090.
Three documented negative results — sparse residual encoding, AV1 NVENC on Blackwell, and channel reordering. These dead ends are recorded so others don’t waste time retreading them.

Benchmark results (RTX 5090, real workloads)

Compression ratios: 6.1× lossless on diffusion (FLUX.2 Klein 9B mid-block), 2.7× lossless on LLM KV cache (Mistral 7B v0.3). Leave-one-out validated across 1,735 diffusion captures and 6 LLM prompts. (FLUX.2 Klein 9B was the internal research target; the public proof-of-concept repo uses FLUX.1-schnell since it’s Apache 2.0 and freely available. Results are qualitatively consistent on schnell — heavy-tailed PCA spectrum, similar Pareto behavior.)
Codec throughput: DirectBackend achieves 0.243 ms/frame encode and 0.435 ms/frame decode at 256×256 YUV444 QP=18 on real PCA-rotated FLUX activations. MultiEngineDirectBackend utilizing all 3 NVENC engines on the 5090: 0.180 ms/frame encode, 0.262 ms/frame decode. Roughly 7.9× faster than an FFmpeg subprocess baseline.
Parallel-path overlap, empirically verified: A 30×4096² fp16 GEMM on CUDA stream A running alongside a 64-frame DirectBackend encode on stream B (encoder bound to stream B via nvEncSetIOCudaStreams). Serialized wall-clock: 40.1 ms; parallel wall-clock: 26.0 ms; theoretical minimum overlap floor: 20.9 ms. 1.34× speedup over serialized = 67% of theoretical maximum overlap achieved. This is the key measurement supporting the architectural claim that NVENC hardware operates concurrently with SM compute.
Slow-connection gains, end-to-end: Measured 3.13× wall-clock speedup at 100 Mbps residential broadband, 5.29× at 50 Mbps (real codec round-trip + simulated network). 1.69× dual-lane improvement on simulated 1 Gbit ethernet.

What hasn’t been measured end-to-end yet (projections from the data above)

Multi-GPU PCIe peer-to-peer activation transfer recovering ~180 GB/s effective bandwidth — the codec primitive is implemented and benchmarked, but the cross-GPU PCIe peer-to-peer integration is still pending. (This is where I need community support, since my test setup has only one desktop GPU and you need two on the same motherboard to validate this.)

Real two-machine ethernet split-model inference — the network-simulation proof-of-concept measures real codec time plus a simulated link, but it isn’t a genuine two-machine deployment yet. (I have a 4090 laptop arriving next week to physically test this networked scenario.)

Long-context KV-spill end-to-end tok/s on a real model decode loop — the compression ratio is measured, but the concrete N tok/s → 3N tok/s benchmark on, say, a 32B model with 64K context isn’t in the repo yet. The math suggests it works; the benchmark hasn’t been written.

Where I’d welcome contributions

Anyone with a dual-4090 / dual-5090 / two-machine PCIe-P2P setup who’d be willing to run the cross-GPU peer-to-peer benchmark once it’s ready. This would meaningfully close the “75%” gap.
Anyone working on long-context KV-spill workloads who’d like to integrate DirectBackend into their decode loop for the end-to-end tok/s measurement. I’d collaborate on the integration.
Cross-vendor support — AMD VCN and Intel QSV/Arc code paths are entirely open. Same architectural principle, different SDK interfaces.

What’s included in the repository

19 numbered, runnable proof-of-concepts, with every reported number reproducible. An honest status summary sits at the top of the README. The PCA basis builder, per-channel quantizer, YUV pack/unpack, and codec wrappers are all modular so components can be swapped independently.

Built solo while managing full-time caregiving responsibilities — technical feedback, critique, or pointers to related work I may have overlooked are genuinely appreciated.

submitted by /u/shootthesound
[comments]

Top Posts

Bridging the Edge: How Army G-TEAD Is Solving Critical Technology Gaps on the Frontlines

Cellular IoT Modules Rebound to $5.6B: Fueled by 5G, AI and Edge Intelligence

5 Agentic Workflows That Will Revolutionize Your Data Science Pipeline

torch-nvenc-compress: Using GPU NVENC Silicon as a PCIe Bandwidth Multiplier

Existing work (the concept itself isn’t new)

Benchmark results (RTX 5090, real workloads)

What hasn’t been measured end-to-end yet (projections from the data above)

Where I’d welcome contributions

What’s included in the repository

5 Agentic Workflows That Will Revolutionize Your Data Science Pipeline

Perplexity Launches Computer for Counsel: A Multi-Model Agentic Layer for Legal Workflows

The Expert Amplifier: A Philosophy for Building Enterprise RAG

Beyond Text and Vision: 5 Open-Source Omni-Modal AI Models Redefining How Machines Perceive Our World

DeepReinforce Releases Ornith-1.0: An Open-Source Coding Model Family That Learns Its Own RL Scaffolds

Beyond Vector RAG: Constructing a Context Graph Layer to Power Multi-Agent Memory Systems

Bridging the Edge: How Army G-TEAD Is Solving Critical Technology Gaps on the Frontlines

Cellular IoT Modules Rebound to $5.6B: Fueled by 5G, AI and Edge Intelligence

5 Agentic Workflows That Will Revolutionize Your Data Science Pipeline

Harnessing Apple Silicon: Mastering Language Model Fine-Tuning with MLX

Kraken Weighs Aave Acquisition: Insider Talks of a $385M Play for 15% Stake

SharkLoader’s Deadly Bite: How StrikeShark Attacks Weaponize Cobalt Strike with Precision

Introducing Security Profiles Operator v1: Locked-Down APIs and a More Secure Upstream Kubernetes Experience

How I Use iOS 27’s Siri Camera Mode to Identify Anything in Real Time

Trending

Bridging the Edge: How Army G-TEAD Is Solving Critical Technology Gaps on the Frontlines

Cellular IoT Modules Rebound to $5.6B: Fueled by 5G, AI and Edge Intelligence

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

torch-nvenc-compress: Using GPU NVENC Silicon as a PCIe Bandwidth Multiplier

Existing work (the concept itself isn’t new)

Benchmark results (RTX 5090, real workloads)

What hasn’t been measured end-to-end yet (projections from the data above)

Where I’d welcome contributions

What’s included in the repository

Related Posts