NVIDIA AI Debuts Dynamo Snapshot: CRIU-Powered Instant Inference Bootstrapping In Kubernetes Environments

When running AI inference in production, the amount of traffic can go up and down over time. To handle this, inference replicas need to scale up or down flexibly. However, starting new inference workloads on Kubernetes from scratch can take several minutes. During that startup period, GPUs are already reserved but sit completely idle—producing no output and handling no requests.

“Cold start” refers to the entire sequence a model server must finish before it can begin serving any requests: pulling the container image from a registry, loading the model’s weights into GPU memory, warming up CUDA kernels, compiling or capturing CUDA graphs, and finally registering itself with the service discovery system. This multi-step delay raises the risk of breaching service-level agreements (SLAs) during sudden traffic surges, because the system can’t scale fast enough to absorb the spike in incoming requests.

For a single-GPU vLLM (v0.20.0) workload, the cold-start latency breaks into three distinct phases: container/image pull, engine initialization (which includes loading model weights, warming up CUDA kernels, and compiling CUDA graphs), and distributed runtime startup.

To solve this problem, NVIDIA’s AI research team has introduced NVIDIA Dynamo Snapshot: a checkpoint-and-restore solution designed specifically for AI inference workloads running on Kubernetes.

What is CRIU and cuda-checkpoint?

The state of a running inference worker that can be checkpointed consists of two parts. Device state (on the GPU side) includes CUDA contexts, streams, device memory allocations, and virtual address mappings—none of which are directly visible from the host CPU. To save this state, cuda-checkpoint leverages the checkpointing capabilities built into the CUDA driver to copy the GPU device state into the CPU memory of the process that owns each CUDA context. Host state (on the CPU side) covers CPU memory, threads, file descriptors, and Linux namespaces. CRIU (Checkpoint/Restore in Userspace) inspects the Linux kernel’s internal bookkeeping structures and serializes the entire process tree’s state out to disk.

During checkpointing, the two tools execute sequentially: cuda-checkpoint first dumps all device state into CPU memory, and then CRIU dumps the full host-side process tree state into a folder on persistent storage. When restoring—whether on the same node or a different one—CRIU reconstructs the process tree from shared storage such as NFS or SMB first, and then cuda-checkpoint transfers the saved GPU state from CPU memory back onto the target GPUs.

CRIU works as a freeze-and-thaw mechanism at its core. When a process is restored, execution picks up at the exact instruction where the checkpoint was taken, with no awareness that a checkpoint or restore ever happened. Because of this transparency, any coordination that must happen before checkpointing (such as quiescing the workload to a stable state) or after restoring (such as reconnecting external resources) needs to be managed separately—either by an external orchestrator or by workload-specific hooks.

How Dynamo Snapshot Works on Kubernetes

In Kubernetes, workloads run inside containers, which in turn run inside pods. Since CRIU checkpoint files contain references to the container’s writable filesystem layer, checkpointing operates at the container level so that the process tree state and the filesystem are saved together as a single unit.

NVIDIA delivers a privileged DaemonSet called snapshot-agent, deployable via a Helm chart. One agent runs on every cluster node and handles checkpoint and restoration operations for containers managed by runc, without requiring any changes to runc itself. During checkpointing, the agent waits for the workload’s readiness probe to confirm the service is healthy, then invokes cuda-checkpoint and CRIU from the host side and writes the resulting checkpoint artifact to shared storage. The workload may have created or modified files within the container’s local overlay filesystem, and the agent captures those changes as well after the CRIU stage completes.

During restoration, the agent starts a lightweight placeholder pod, restores the overlay filesystem into it, and then restores the CRIU and CUDA checkpoint into the pod’s namespaces. Because each agent operates independently on its own node, checkpoint and restore operations can run in parallel across the entire cluster.

The team chose this DaemonSet-based approach over relying on Kubernetes-native checkpoint/restore support in runc for three key reasons: it remains fully portable without depending on cloud-provider-specific feature gates, it provides finer-grained control over CRIU for performance tuning, and it allows checkpoint artifacts to be stored in flexible backend storage systems rather than being bundled into OCI container images.

Quiesce/resume hooks: A Dynamo inference worker starts up in two sequential stages. First comes engine setup: communicators are configured, model weights are loaded into memory, kernels are pre-warmed, and CUDA graphs are compiled. At this stage, the worker is fully warmed up but is not yet visible outside its pod. Second comes distributed runtime startup: the worker connects to the Dynamo control plane and registers itself with the discovery backend. From this point forward, open TCP connections to the control plane are active.

If a snapshot were captured after distributed runtime startup, there would be live TCP connections that CRIU cannot preserve. The fix is quiesce/resume hooks: the worker creates a ‘ready for checkpoint’ signal file after engine setup but before distributed runtime startup. It then enters a polling loop, waiting for a ‘restore complete’ signal file while the snapshot agent captures the checkpoint externally. Because CRIU resumes execution at the exact instruction where the snapshot was taken, the worker picks up right inside the polling loop, detects the signal file, and continues with distributed runtime initialization — no extra synchronization needed.

The quiesce/resume pattern also matters for multi-GPU and multi-node checkpoints (planned for a future release): outbound TCP connections used for RPC cannot be captured in an established state because the pod IP changes between checkpoint and restore, and RDMA registrations and NIC state must be recreated after restore.

Optimization 1: KV Cache Unmap and Release

After measuring peak GPU memory usage with weights, CUDA graphs, and other buffers in place, inference engines assign the remaining GPU memory as a single large KV cache buffer. Since the checkpoint is captured before the replica has handled any requests, this KV cache buffer carries no meaningful data and does not need to be included in the snapshot. However, its virtual address must stay fixed because it is embedded in the CUDA graph.

The approach is to allocate the KV cache through the CUDA Virtual Memory Management API (cuMemCreate and cuMemMap), then release the underlying physical memory with cuMemUnmap and cuMemRelease — but not cuMemAddressFree. This preserves the virtual address range while freeing the physical memory. This capability is built into vLLM via sleep() and wake_up(), and into SGLang via torch_memory_saver.

For Qwen3-0.6B on a B200, this cuts the total artifact size from ~190 GiB down to ~6 GiB. The savings are greatest when the KV cache is large relative to the model — that is, when model weights are small compared to available GPU memory.

Optimization 2: Speeding Up CRIU Memory Restore

Even with a smaller artifact, upstream CRIU restore time remains a bottleneck. For larger models, restore time actually surpasses cold-start time, which cancels out the advantage of checkpointing.

Note: The CRIU optimizations described below are not yet shipped as part of Dynamo Snapshot. They may become available once merged into upstream CRIU.

2.1 — Parallel memfd restore: vLLM’s sleep()/wake_up() path and SGLang’s torch_memory_saver move weight-tagged GPU allocations into pinned CPU shadow buffers. CUDA backs these allocations with shared anonymous memory, pinned through the NVIDIA driver. Inside the Linux kernel, these show up as memfds: anonymous, RAM-backed files mapped with MAP_SHARED. For gpt-oss-120b, these buffers consumed over 120 GiB, spread across many independent buffers of 2 GiB or smaller. Upstream CRIU restores those buffers one at a time. The modified CRIU identifies all unique shmem-backed objects, then uses a thread pool to restore them in parallel, letting restore take advantage of available storage bandwidth and CPU parallelism.

2.2 — Linux native AIO for anonymous memory: In upstream CRIU, the memory restore path uses a synchronous preadv loop with only one read in flight at any time, leaving the storage device idle between requests. The replacement uses Linux native AIO: CRIU submits a batch of iocbs via io_submit and maintains a sliding window of up to 128 concurrent reads in flight. As completions arrive via io_getevents, new submissions fill the open slots in the window.

Where the storage backend supports it, both anonymous and shared memory reads use O_DIRECT, avoiding unnecessary page cache pressure during the one-pass restore stream. Linux native AIO is only truly asynchronous on files opened with O_DIRECT. On filesystems where O_DIRECT is unavailable — such as some NFS deployments — restore falls back to buffered I/O with sequential readahead, and the gains from AIO are significantly reduced.

Combined results across three models (checkpoint sizes after KV cache unmap):

Model	Checkpoint Size	CRIU (upstream)	CRIU (AIO)	CRIU (AIO + parallel memfd)	Speedup	SOL*
Qwen3-0.6B	6.2 GiB	6.8 s	2.9 s	2.4 s	2.8×	0.95 s
Qwen3-8B	26 GiB	24 s	11 s	4.7 s	5.1×	1.8 s
gpt-oss-120b	129 GiB	119 s	54 s	15 s	7.9×	11 s

*SOL (speed of light) is the theoretical maximum restore speed given available storage bandwidth — the floor below which restore time cannot go.

At this point CRIU restore time is close to SOL, but end-to-end restore is still dominated by the sequential transfer of large model weights from storage through host memory onto the GPU. This is a serial bottleneck: cuda-checkpoint cannot restore GPU memory until CRIU has materialized the weights in host memory.

Optimization 3: GPU Memory Service (GMS)

To eliminate the serial weight-transfer bottleneck, NVIDIA’s research team developed the GPU Memory Service (GMS). GMS uses the CUDA Virtual Memory Management (VMM) API to decouple large model weights from the inference worker’s process lifetime, offloading the bulk of process memory into a separate GMS artifact. By removing weights from the core CRIU checkpoint, GMS allows process state restoration and weight restoration to proceed concurrently using different memory bandwidth channels. Weight restoration can leverage the fastest available paths such as GPUDirect Storage (GDS) or peer-GPU RDMA/NVLink.

Checkpoint artifact sizes with GMS:

Model	CRIU checkpoint (baseline)	CRIU checkpoint (with GMS)	GMS weight artifact
Qwen3-0.6B	6.2 GiB	4.3 GiB	1.2 GiB
Qwen3-8B	26 GiB	4.8 GiB	15 GiB
gpt-oss-120b	129 GiB	5.1 GiB	118 GiB

6.7 GiB74 GiB

In a proof-of-concept weight restoration backend that distributes model weights across 8 local NVMe SSDs in a striped configuration, weight restoration happens simultaneously with the CRIU process restore step. This parallelization brings the total end-to-end startup time for gpt-oss-120b to under 5 seconds — a 21× improvement. All restore times are measured from a shared trigger timestamp, and container startup time is not included in these figures.

Deployment: Kubernetes Resources

The deployment workflow relies on three Kubernetes resources. The snapshot-agent DaemonSet is deployed through a Helm chart. The DynamoCheckpoint custom resource (abbreviated as dckpt) specifies which model configuration should be checkpointed. The DynamoGraphDeployment CR then references that checkpoint for restoration.

Prerequisites from the documentation: x86_64 (amd64) GPU nodes; NVIDIA driver version 580.xx or later on GPU nodes (590.xx or later for multi-GPU snapshots); ReadWriteMany storage to support cross-node restore; currently only vLLM is supported as a backend, and it is in limited preview.

The DynamoCheckpoint identity is derived as a 16-character SHA256 hash computed from fields that influence runtime state: model, backendFramework, dynamoVersion, tensorParallelSize, pipelineParallelSize, dtype, maxModelLen, and extraParameters. Fields that do not influence the hash include replica count, node placement, resource limits, and observability settings.

Two deployment modes are available. The explicit checkpointRef mode points to an existing, ready DynamoCheckpoint by name. In Auto mode, the operator calculates the identity hash, searches for a matching DynamoCheckpoint, and creates one only if none is found — meaning the first worker performs a cold start while the checkpoint is generated in the background for future scale-up events.

Current limitations: checkpoint/restore is available for vLLM workers only, in limited preview; specialized workers (multimodal, embedding, diffusion) are not yet supported; multi-GPU tensor-parallel configurations have undergone limited testing; GMS restore is not yet available; snapshot-agent requires privileged mode; and the restore process is sensitive to live TCP socket state.

Key Takeaways

Dynamo Snapshot leverages CRIU and cuda-checkpoint to freeze and resume single-GPU inference workers on Kubernetes, completely bypassing the full cold-start latency.
KV cache unmap through cuMemUnmap and cuMemRelease shrinks the checkpoint artifact from roughly 190 GiB down to approximately 6 GiB for Qwen3-0.6B on a B200 GPU.
Linux native AIO and parallel memfd restore reduce CRIU restore time by as much as 7.9× compared to upstream CRIU; these optimizations are still awaiting upstream CRIU integration.
The GPU Memory Service (GMS) separates model weights from the CRIU artifact, allowing process restoration and weight loading to proceed concurrently over high-speed channels such as GPUDirect Storage.
In a proof-of-concept setup using 8 striped local NVMe SSDs, gpt-oss-120b startup time drops by 21×, finishing in under 5 seconds.

Marktechpost’s Visual Explainer

01 — Overview
What Is NVIDIA Dynamo Snapshot?

In production inference environments, demand shifts over time, requiring inference replicas to scale up and down elastically.
Cold-starting inference workloads on Kubernetes can take several minutes. Throughout that period, GPUs remain allocated yet idle,
producing no tokens and handling no requests.

NVIDIA Dynamo Snapshot is a checkpoint/restore solution designed for AI inference workloads running on Kubernetes.
It captures the complete state of a running inference worker — both on the GPU side and the CPU side — and restores it on the same node or a different one,
entirely skipping the cold-start sequence.

21xStartup speedup
gpt-oss-120b (PoC)

<5sRestore time
8× NVMe SSDs (PoC)

6 GiBCheckpoint size
vs ~190 GiB baseline

02 — Core Tools
CRIU and cuda-checkpoint

A running inference worker holds two categories of state that can be checkpointed. Dynamo Snapshot employs one tool for each category:

cuda-checkpoint — captures GPU device state (CUDA contexts, streams, device memory, virtual address mappings) and writes it into the CPU memory of the process that owns each CUDA context. Relies on the checkpointing capability built into the CUDA driver.
CRIU (Checkpoint/Restore in Userspace) — inspects Linux kernel data structures and serializes the host-side process tree (CPU memory, threads, file descriptors, namespaces) out to disk.

Checkpoint sequence: cuda-checkpoint writes GPU state to CPU memory first, then CRIU writes all host-side state to storage.
Restore sequence: CRIU rebuilds the process tree from storage first, then cuda-checkpoint restores GPU state from the now-available CPU memory onto the target GPUs.

03 — Kubernetes Architecture
The snapshot-agent DaemonSet

Dynamo Snapshot is deployed as a privileged DaemonSet named snapshot-agent, installed via a Helm chart.
A single agent runs on every node and manages checkpoint and restore operations for runc-managed containers without requiring any modifications to runc itself.

During checkpoint: the agent waits for the workload readiness probe to pass, invokes cuda-checkpoint and CRIU, then writes the resulting artifact to shared storage. The overlay filesystem is also captured after the CRIU stage completes.
During restore: the agent starts a lightweight placeholder pod, restores the overlay filesystem, then restores the CRIU/CUDA checkpoint into the pod’s namespaces.
Parallelism: each agent works independently on its local node, so checkpoints and restores naturally parallelize across the entire cluster.
Portability: no dependency on cloud-provider feature gates; checkpoint artifacts reside in flexible storage backends rather than being embedded inside OCI images.

04 — Workload Coordination
Quiesce/Resume Hooks

04 — Checkpoint Timing
When to Checkpoint a Dynamo Worker

A Dynamo inference worker goes through two sequential startup phases. The snapshot must be captured in the window between them:

Phase 1 — Engine warm-up: communication backends are set up, model weights are loaded into GPU memory, CUDA kernels are warmed up, and CUDA graphs are compiled. At this point the worker is fully initialized but is not yet visible to the rest of the cluster.
Phase 2 — Distributed runtime join: the worker connects to the Dynamo control plane and registers itself with the service discovery layer. From this moment on, live TCP connections are established — and CRIU is unable to capture those.

How it works: once Phase 1 finishes and before Phase 2 begins, the worker writes a "ready for checkpoint" signal file and then enters a busy-wait loop.
The snapshot agent captures the checkpoint while the worker is paused in that loop. When restoring, CRIU resumes execution inside the same polling loop.
The worker detects a "restore complete" signal file and immediately continues into Phase 2 — no additional coordination is required.

05 — Optimization 1
KV Cache Unmap and Release

After model weights and CUDA graphs are placed in GPU memory, inference engines allocate whatever GPU memory remains as a large KV cache buffer.
Because the snapshot is captured before any inference requests have been processed, the KV cache holds no meaningful data and does not need to be included in the checkpoint.
The catch is that its virtual address range must remain unchanged — it is already baked into the compiled CUDA graphs.

Allocate the KV cache through the CUDA Virtual Memory Management API: cuMemCreate + cuMemMap.
Release the backing physical memory with cuMemUnmap + cuMemRelease — but do not call cuMemAddressFree. This keeps the virtual address range reserved and valid.
This capability is already built into vLLM via its sleep() / wake_up() mechanism, and into SGLang via torch_memory_saver.

~190GiB — before unmap
Qwen3-0.6B on B200

~6GiB — after unmap
same model, same GPU

06 — Optimization 2.1
Parallel memfd Restore

vLLM’s sleep()/wake_up() and SGLang’s torch_memory_saver migrate weight-tagged GPU allocations
into pinned CPU-side shadow buffers. Under the hood, CUDA backs these with shared anonymous memory regions, which the Linux kernel represents as
memfds — anonymous, RAM-backed files mapped with MAP_SHARED.

For gpt-oss-120b, these shadow buffers added up to over 120 GiB, spread across many individual chunks of up to 2 GiB each.
Stock CRIU restores them one at a time: it creates a shmem-backed object, sets its size, maps it, reads the data from the checkpoint, then moves on to the next.
Our modified CRIU first identifies all unique shmem-backed objects, then spins up a thread pool to restore them concurrently. Each thread independently allocates its buffer and reads from the checkpoint image.

Note: These CRIU enhancements are not yet included in Dynamo Snapshot. They will become available once they are merged into the upstream CRIU project.

07 — Optimization 2.2
Linux Native AIO for Anonymous Memory

Once shared resources are restored, CRIU must populate each process’s private memory regions — heap pages, thread stacks, anonymous mappings,
and copy-on-write private file mappings — at the exact same virtual addresses they occupied before the checkpoint.

Stock CRIU: uses a synchronous preadv loop — only one read operation is in flight at any time. The storage device sits idle between successive reads, so fast NVMe bandwidth cannot be fully utilized.
Our modified CRIU: leverages Linux native AIO. It submits batches of iocbs through io_submit, maintaining a sliding window of up to 128 concurrent reads. Completed operations are collected via io_getevents, and new submissions immediately fill the freed slots.
Where the filesystem supports it, both anonymous and shared memory reads use O_DIRECT to bypass the page cache entirely. AIO delivers true asynchronous behavior only on O_DIRECT files — on certain NFS setups that lack O_DIRECT support, the speedup is smaller.

08 — Results
CRIU Restore Time Comparison
Aggregate results across three models after applying KV cache unmap. SOL (speed of light) represents the theoretical fastest restore time given the available storage bandwidth.

Model	Ckpt Size	Upstream	AIO only	AIO + memfd	Speedup	SOL
Qwen3-0.6B	6.2 GiB	6.8 s	2.9 s	2.4 s	2.8x	0.95 s
Qwen3-8B	26 GiB	24 s	11 s	4.7 s	5.1x	1.8 s
gpt-oss-120b	129 GiB	119 s	54 s	15 s	7.9x	11 s

Note: These CRIU optimizations are still awaiting upstream merge and are not yet shipped as part of Dynamo Snapshot.

09 — Optimization 3
GPU Memory Service (GMS)

Even with all the CRIU-level improvements above, a fundamental bottleneck persisted: cuda-checkpoint cannot begin restoring GPU memory until CRIU
has fully materialized the model weights in host memory — creating a strict serial dependency. GMS removes this dependency entirely.

GMS leverages the CUDA Virtual Memory Management (VMM) API to decouple large model weights from the inference worker’s process lifecycle, managing them as a separate GMS artifact.
Process state restoration (via CRIU) and weight restoration (via GMS) proceed in parallel, each using independent memory bandwidth channels.
Weight data can be streamed through the fastest available paths: GPUDirect Storage (GDS) for direct storage-to-GPU transfers, or peer-GPU RDMA/NVLink for GPU-to-GPU copies.

Model	CRIU (baseline)	CRIU + GMS	GMS artifact
Qwen3-0.6B	6.2 GiB	4.3 GiB	1.2 GiB
Qwen3-8B	26 GiB	4.8 GiB	15 GiB
gpt-oss-120b	129 GiB	6.7 GiB	74 GiB

10 — Deployment & Roadmap
Deploying Snapshots & What’s Coming Up
Requirements (from docs.nvidia.com/dynamo v1.1.1):

x86_64 (amd64) GPU nodes; NVIDIA driver 580.xx+ (590.xx+ for multi-GPU snapshots)
ReadWriteMany storage for restoring across nodes
vLLM backend only — limited preview; specialized workers (multimodal, embedding, diffusion) are not yet supported
Checkpoint identity is a 16-character SHA256 hash derived from: model, backendFramework, tensorParallelSize, dtype, maxModelLen, dynamoVersion, pipelineParallelSize, extraParameters

Roadmap (currently in progress):

GMS restore path with pluggable backends (GDS, UCX) — pending an upcoming CUDA driver patch
TensorRT-LLM support
Multi-GPU and multi-node support via quiesce/resume hooks for PyTorch, NCCL, and NIXL

Explore the technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our newsletter. On Telegram? You can join us there too!

Interested in partnering with us to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch with us

Top Posts

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

NVIDIA AI Debuts Dynamo Snapshot: CRIU-Powered Instant Inference Bootstrapping in Kubernetes Environments

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trump Mobilizes Defense Industry to Chart Software and Supplier Networks Nationwide

KuCoin Pay: Weaving Crypto Seamlessly Into Everyday Payments

The End of an Era: US Civil Rights Agency Dismantles 60-Year Data Archive

Trending

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

NVIDIA AI Debuts Dynamo Snapshot: CRIU-Powered Instant Inference Bootstrapping in Kubernetes Environments

What is CRIU and cuda-checkpoint?

How Dynamo Snapshot Works on Kubernetes

Optimization 1: KV Cache Unmap and Release

Optimization 2: Speeding Up CRIU Memory Restore

Optimization 3: GPU Memory Service (GMS)

Deployment: Kubernetes Resources

Key Takeaways

Marktechpost’s Visual Explainer

Related Posts