When running AI inference in production, the amount of traffic can go up and down over time. To handle this, inference replicas need to scale up or down flexibly. However, starting new inference workloads on Kubernetes from scratch can take several minutes. During that startup period, GPUs are already reserved but sit completely idle—producing no output and handling no requests.
“Cold start” refers to the entire sequence a model server must finish before it can begin serving any requests: pulling the container image from a registry, loading the model’s weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and finally registering itself with the service discovery system. This multi-step delay raises the risk of breaching service-level agreements (SLAs) during sudden traffic surges, because the system can’t scale fast enough to absorb the spike in incoming requests.
For a single-GPU vLLM (v0.20.0) workload, the cold-start latency breaks into three distinct phases: container/image pull, engine initialization (which includes loading model weights, warming up CUDA kernels, and compiling CUDA graphs), and distributed runtime startup.
To solve this problem, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint-and-restore solution designed specifically for AI inference workloads running on Kubernetes.

What is CRIU and cuda-checkpoint?
The state of a running inference worker that can be checkpointed consists of two parts. Device state (on the GPU side) includes CUDA contexts, streams, device memory allocations, and virtual address mappings—none of which are directly visible from the host CPU. To save this state, cuda-checkpoint leverages the checkpointing capabilities built into the CUDA driver to copy the GPU device state into the CPU memory of the process that owns each CUDA context. Host state (on the CPU side) covers CPU memory, threads, file descriptors, and Linux namespaces. CRIU (Checkpoint/Restore in Userspace) inspects the Linux kernel’s internal bookkeeping structures and serializes the entire process tree’s state out to disk.
During checkpointing, the two tools execute sequentially: cuda-checkpoint first dumps all device state into CPU memory, and then CRIU dumps the full host-side process tree state into a folder on persistent storage. When restoring—whether on the same node or a different one—CRIU reconstructs the process tree from shared storage such as NFS or SMB first, and then cuda-checkpoint transfers the saved GPU state from CPU memory back onto the target GPUs.
CRIU works as a freeze-and-thaw mechanism at its core. When a process is restored, execution picks up at the exact instruction where the checkpoint was taken, with no awareness that a checkpoint or restore ever happened. Because of this transparency, any coordination that must happen before checkpointing (such as quiescing the workload to a stable state) or after restoring (such as reconnecting external resources) needs to be managed separately—either by an external orchestrator or by workload-specific hooks.
How Dynamo Snapshot Works on Kubernetes
In Kubernetes, workloads run inside containers, which in turn run inside pods. Since CRIU checkpoint files contain references to the container’s writable filesystem layer, checkpointing operates at the container level so that the process tree state and the filesystem are saved together as a single unit.
NVIDIA delivers a privileged DaemonSet called snapshot-agent, deployable via a Helm chart. One agent runs on every cluster node and handles checkpoint and restoration operations for containers managed by runc, without requiring any changes to runc itself. During checkpointing, the agent waits for the workload’s readiness probe to confirm the service is healthy, then invokes cuda-checkpoint and CRIU from the host side and writes the resulting checkpoint artifact to shared storage. The workload may have created or modified files within the container’s local overlay filesystem, and the agent captures those changes as well after the CRIU stage completes.
During restoration, the agent starts a lightweight placeholder pod, restores the overlay filesystem into it, and then restores the CRIU and CUDA checkpoint into the pod’s namespaces. Because each agent operates independently on its own node, checkpoint and restore operations can run in parallel across the entire cluster.
The team chose this DaemonSet-based approach over relying on Kubernetes-native checkpoint/restore support in runc for three key reasons: it remains fully portable without depending on cloud-provider-specific feature gates, it provides finer-grained control over CRIU for performance tuning, and it allows checkpoint artifacts to be stored in flexible backend storage systems rather than being bundled into OCI container images.
Quiesce/resume hooks: A Dynamo inference worker starts up in two sequential stages. First comes engine setup: communicators are configured, model weights are loaded into memory, kernels are pre-warmed, and CUDA graphs are compiled. At this stage, the worker is fully warmed up but is not yet visible outside its pod. Second comes distributed runtime startup: the worker connects to the Dynamo control plane and registers itself with the discovery backend. From this point forward, open TCP connections to the control plane are active.
If a snapshot were captured after distributed runtime startup, there would be live TCP connections that CRIU cannot preserve. The fix is quiesce/resume hooks: the worker creates a ‘ready for checkpoint’ signal file after engine setup but before distributed runtime startup. It then enters a polling loop, waiting for a ‘restore complete’ signal file while the snapshot agent captures the checkpoint externally. Because CRIU resumes execution at the exact instruction where the snapshot was taken, the worker picks up right inside the polling loop, detects the signal file, and continues with distributed runtime initialization — no extra synchronization needed.
The quiesce/resume pattern also matters for multi-GPU and multi-node checkpoints (planned for a future release): outbound TCP connections used for RPC cannot be captured in an established state because the pod IP changes between checkpoint and restore, and RDMA registrations and NIC state must be recreated after restore.
Optimization 1: KV Cache Unmap and Release
After measuring peak GPU memory usage with weights, CUDA graphs, and other buffers in place, inference engines assign the remaining GPU memory as a single large KV cache buffer. Since the checkpoint is captured before the replica has handled any requests, this KV cache buffer carries no meaningful data and does not need to be included in the snapshot. However, its virtual address must stay fixed because it is embedded in the CUDA graph.
The approach is to allocate the KV cache through the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then release the underlying physical memory with cuMemUnmap and cuMemRelease — but not cuMemAddressFree. This preserves the virtual address range while freeing the physical memory. This capability is built into vLLM via sleep() and wake_up(), and into SGLang via torch_memory_saver.
For Qwen3-0.6B on a B200, this cuts the total artifact size from ~190 GiB down to ~6 GiB. The savings are greatest when the KV cache is large relative to the model — that is, when model weights are small compared to available GPU memory.
Optimization 2: Speeding Up CRIU Memory Restore
Even with a smaller artifact, upstream CRIU restore time remains a bottleneck. For larger models, restore time actually surpasses cold-start time, which cancels out the advantage of checkpointing.
Note: The CRIU optimizations described below are not yet shipped as part of Dynamo Snapshot. They may become available once merged into upstream CRIU.
2.1 — Parallel memfd restore: vLLM’s sleep()/wake_up() path and SGLang’s torch_memory_saver move weight-tagged GPU allocations into pinned CPU shadow buffers. CUDA backs these allocations with shared anonymous memory, pinned through the NVIDIA driver. Inside the Linux kernel, these show up as memfds: anonymous, RAM-backed files mapped with MAP_SHARED. For gpt-oss-120b, these buffers consumed over 120 GiB, spread across many independent buffers of 2 GiB or smaller. Upstream CRIU restores those buffers one at a time. The modified CRIU identifies all unique shmem-backed objects, then uses a thread pool to restore them in parallel, letting restore take advantage of available storage bandwidth and CPU parallelism.
2.2 — Linux native AIO for anonymous memory: In upstream CRIU, the memory restore path uses a synchronous preadv loop with only one read in flight at any time, leaving the storage device idle between requests. The replacement uses Linux native AIO: CRIU submits a batch of iocbs via io_submit and maintains a sliding window of up to 128 concurrent reads in flight. As completions arrive via io_getevents, new submissions fill the open slots in the window.
Where the storage backend supports it, both anonymous and shared memory reads use O_DIRECT, avoiding unnecessary page cache pressure during the one-pass restore stream. Linux native AIO is only truly asynchronous on files opened with O_DIRECT. On filesystems where O_DIRECT is unavailable — such as some NFS deployments — restore falls back to buffered I/O with sequential readahead, and the gains from AIO are significantly reduced.
Combined results across three models (checkpoint sizes after KV cache unmap):
| Model | Checkpoint Size | CRIU (upstream) | CRIU (AIO) | CRIU (AIO + parallel memfd) | Speedup | SOL* |
|---|---|---|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 6.8 s | 2.9 s | 2.4 s | 2.8× | 0.95 s |
| Qwen3-8B | 26 GiB | 24 s | 11 s | 4.7 s | 5.1× | 1.8 s |
| gpt-oss-120b | 129 GiB | 119 s | 54 s | 15 s | 7.9× | 11 s |
*SOL (speed of light) is the theoretical maximum restore speed given available storage bandwidth — the floor below which restore time cannot go.
At this point CRIU restore time is close to SOL, but end-to-end restore is still dominated by the sequential transfer of large model weights from storage through host memory onto the GPU. This is a serial bottleneck: cuda-checkpoint cannot restore GPU memory until CRIU has materialized the weights in host memory.
Optimization 3: GPU Memory Service (GMS)
To eliminate the serial weight-transfer bottleneck, NVIDIA’s research team developed the GPU Memory Service (GMS). GMS uses the CUDA Virtual Memory Management (VMM) API to decouple large model weights from the inference worker’s process lifetime, offloading the bulk of process memory into a separate GMS artifact. By removing weights from the core CRIU checkpoint, GMS allows process state restoration and weight restoration to proceed concurrently using different memory bandwidth channels. Weight restoration can leverage the fastest available paths such as GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.
Checkpoint artifact sizes with GMS:
| Model | CRIU checkpoint (baseline) | CRIU checkpoint (with GMS) | GMS weight artifact |
|---|---|---|---|
| Qwen3-0.6B | 6.2 GiB | 4.3 GiB | 1.2 GiB |
| Qwen3-8B | 26 GiB | 4.8 GiB | 15 GiB |
| gpt-oss-120b | 129 GiB | 5.1 GiB | 118 GiB |
In a proof-of-concept weight restoration backend that distributes model weights across 8 local NVMe SSDs in a striped configuration, weight restoration happens simultaneously with the CRIU process restore step. This parallelization brings the total end-to-end startup time for gpt-oss-120b to under 5 seconds — a 21× improvement. All restore times are measured from a shared trigger timestamp, and container startup time is not included in these figures.
Deployment: Kubernetes Resources
The deployment workflow relies on three Kubernetes resources. The snapshot-agent DaemonSet is deployed through a Helm chart. The DynamoCheckpoint custom resource (abbreviated as dckpt) specifies which model configuration should be checkpointed. The DynamoGraphDeployment CR then references that checkpoint for restoration.
Prerequisites from the documentation: x86_64 (amd64) GPU nodes; NVIDIA driver version 580.xx or later on GPU nodes (590.xx or later for multi-GPU snapshots); ReadWriteMany storage to support cross-node restore; currently only vLLM is supported as a backend, and it is in limited preview.
The DynamoCheckpoint identity is derived as a 16-character SHA256 hash computed from fields that influence runtime state: model, backendFramework, dynamoVersion, tensorParallelSize, pipelineParallelSize, dtype, maxModelLen, and extraParameters. Fields that do not influence the hash include replica count, node placement, resource limits, and observability settings.
Two deployment modes are available. The explicit checkpointRef mode points to an existing, ready DynamoCheckpoint by name. In Auto mode, the operator calculates the identity hash, searches for a matching DynamoCheckpoint, and creates one only if none is found — meaning the first worker performs a cold start while the checkpoint is generated in the background for future scale-up events.
Current limitations: checkpoint/restore is available for vLLM workers only, in limited preview; specialized workers (multimodal, embedding, diffusion) are not yet supported; multi-GPU tensor-parallel configurations have undergone limited testing; GMS restore is not yet available; snapshot-agent requires privileged mode; and the restore process is sensitive to live TCP socket state.
Key Takeaways
- Dynamo Snapshot leverages CRIU and cuda-checkpoint to freeze and resume single-GPU inference workers on Kubernetes, completely bypassing the full cold-start latency.
- KV cache unmap through
cuMemUnmapandcuMemReleaseshrinks the checkpoint artifact from roughly 190 GiB down to approximately 6 GiB for Qwen3-0.6B on a B200 GPU. - Linux native AIO and parallel memfd restore reduce CRIU restore time by as much as 7.9× compared to upstream CRIU; these optimizations are still awaiting upstream CRIU integration.
- The GPU Memory Service (GMS) separates model weights from the CRIU artifact, allowing process restoration and weight loading to proceed concurrently over high-speed channels such as GPUDirect Storage.
- In a proof-of-concept setup using 8 striped local NVMe SSDs, gpt-oss-120b startup time drops by 21×, finishing in under 5 seconds.
Marktechpost’s Visual Explainer
Explore the technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our newsletter. On Telegram? You can join us there too!
Interested in partnering with us to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch with us




