NVIDIA's Cuda-oxide: A Rust-to-CUDA Compiler Backend That Translates SIMT GPU Kernels Straight To PTX

Step 01 of 09 · Prerequisites

What You Need Before You Start

cuda-oxide has specific version requirements for each dependency. Before installing anything, verify your system meets all of these. The project is currently Linux-only (tested on Ubuntu 24.04).

Linux (Ubuntu 24.04)
Rust nightly
CUDA Toolkit 12.x+
LLVM 21+
Clang 21 / libclang-common-21-dev
Git

ⓘ Why LLVM 21?
Simple kernels may work on LLVM 20, but anything targeting Hopper or Blackwell — TMA, tcgen05, WGMMA — requires llc from LLVM 21 or later. This is a hard requirement, not a recommendation.

Check your current CUDA version to confirm compatibility:

nvcc --version

Step 02 of 09 · Install Rust Nightly

Set Up the Rust Nightly Toolchain

cuda-oxide requires Rust nightly with two additional components: rust-src and rustc-dev. The toolchain is pinned to nightly-2026-04-03 via rust-toolchain.toml in the repository — it will be installed automatically when you first run a build inside the repo.

If you need to install it manually:

# Install the pinned nightly toolchain
rustup toolchain install nightly-2026-04-03

# Add required components
rustup component add rust-src rustc-dev 
  --toolchain nightly-2026-04-03

# Confirm the toolchain is active
rustup show

ⓘ Why these components?
rustc-dev exposes the internal compiler APIs that the custom codegen backend hooks into. rust-src is needed so the compiler can find and compile its own standard library sources for the device target.

Step 03 of 09 · Install LLVM 21

Install LLVM 21 with the NVPTX Backend

The cuda-oxide pipeline emits textual LLVM IR (.ll files) and hands them to the external llc binary to produce PTX. You need LLVM 21 or later with the NVPTX backend enabled.

# Ubuntu/Debian
sudo apt install llvm-21

# Verify the NVPTX backend is present
llc-21 --version | grep nvptx

The pipeline auto-discovers llc-22 and llc-21 on your PATH in that order. To pin a specific binary, set the environment variable:

# Pin to a specific llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21

⚠ Common Failure
If NVPTX does not appear in the output of llc-21 --version, your LLVM build was compiled without the NVPTX target. Install from the official LLVM apt repository rather than your distro’s default packages, which may omit GPU backends.

Step 04 of 09 · Install Clang

Install Clang 21 for the cuda-bindings Crate

The cuda-bindings crate uses bindgen to generate FFI bindings to cuda.h at build time. bindgen needs libclang — and specifically, it needs Clang’s own resource directory (which includes stddef.h). A bare libclang1-* runtime package is not enough.

# Install the full clang-21 package (includes resource headers)
sudo apt install clang-21

# Alternatively, the -dev header package also works
sudo apt install libclang-common-21-dev

⚠ Symptom of Missing Clang
If you only install the runtime but not the headers, the host build will fail with a cryptic 'stddef.h' file not found error during bindgen. Run cargo oxide doctor in the next step to catch this before attempting a build.

Step 05 of 09 · Install cargo-oxide

Clone the Repo and Install cargo-oxide

cargo-oxide is a Cargo subcommand that drives the entire build pipeline — running cargo oxide build, cargo oxide run, cargo oxide debug, and cargo oxide pipeline.

Inside the repo (for trying examples):

git clone 
cd cuda-oxide

# cargo oxide works out of the box via a workspace alias
cargo oxide run vecadd

Outside the repo (for your own projects):

# Install globally from the git source
cargo install 
  --git  
  cargo-oxide

# On first run, cargo-oxide fetches and builds the codegen backend

Then verify all prerequisites are in place with the built-in health check:

cargo oxide doctor

ⓘ What doctor checks
It validates your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM version and NVPTX support, Clang/libclang headers, and the codegen backend binary. Fix any red items before proceeding.

Step 06 of 09 · Run Your First Kernel

Build and Run the vecadd Example

The canonical first example is vecadd — a vector addition kernel that adds two arrays of 1,024 f32 values on the GPU and verifies the result on the host.

# Build and run end-to-end
cargo oxide run vecadd

If everything is configured correctly, you will see:

✓ SUCCESS: All 1024 elements correct!

To see the full compilation pipeline — from Rust MIR through each Pliron dialect down to PTX — run:

# Print the full Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX trace
cargo oxide pipeline vecadd

To debug with cuda-gdb:

cargo oxide debug vecadd --tui

ⓘ Output artifacts
A successful build produces two files: target/debug/vecadd (the host binary) and target/debug/vecadd.ptx (the device code). The host binary loads the PTX file via the CUDA driver at runtime.

Step 07 of 09 · Write a Kernel

Writing Your Own #[kernel] Function

A kernel function is annotated with #[kernel]. Use DisjointSlice for mutable outputs and &[T] for

read-only inputs. Retrieve the thread’s unique hardware index using thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 safety: race-free by design, no `unsafe` required.
// DisjointSlice::get_mut() only accepts a ThreadIndex —
// an opaque type derived from hardware that guarantees each thread writes to a unique location.
#[kernel]
pub fn scale(input: &[f32], factor: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
        *elem = input[idx.get()] * factor;
    }
}

ⓘ Tier 1 Safety — how it works
ThreadIndex is an opaque wrapper around usize that can only be constructed from hardware built-in registers (threadIdx, blockIdx, blockDim). Because every thread receives a distinct value, and DisjointSlice::get_mut() exclusively accepts a ThreadIndex, all writes are inherently free of data races — no unsafe code is needed anywhere in the kernel.

Step 08 of 09 · Launch from Host

Launching the Kernel from Host Code

Host and device code coexist in the same .rs file. The host side leverages CudaContext, DeviceBuffer, and the cuda_launch! macro to handle GPU memory management and kernel dispatch.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn main() {
    // Set up the GPU context on device 0
    let ctx    = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(&ctx, "scale_example").unwrap();

    // Transfer input data to GPU memory
    let data: Vec<f32> = (0..1024).map(|i| i as f32).collect();
    let input  = DeviceBuffer::from_host(&stream, &data).unwrap();
    let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    // Launch the kernel — LaunchConfig automatically determines block and grid dimensions
    cuda_launch! {
        kernel: scale,
        stream: stream,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Copy the result back to the host
    let result = output.to_host_vec(&stream).unwrap();
    assert!((result[1] - 2.5).abs() < 1e-5);
    println!("✓ Kernel ran successfully!");
}

ⓘ What cuda_launch! does
It flattens the argument list — converting slices, scalar values, and captured closures into PTX kernel parameters — and dispatches the kernel on the specified stream. There’s no need to manually marshal arguments.

Step 09 of 09 · Next Steps

What to Explore Next

Your cuda-oxide environment is up and running. Here are the most valuable next steps, arranged by increasing complexity:

Generic kernels with monomorphization — check out the generic example (cargo oxide run generic) to see how fn scale generates separate PTX kernels for each concrete type.
Closures with captures — the host_closure example demonstrates how a move |x: f32| x * factor closure is automatically flattened and passed as PTX kernel parameters.
Async GPU execution — cuda_launch_async! produces a lazy DeviceOperation that runs on .sync() or .await. Refer to the async_mlp and async_vecadd examples.
Shared memory and warp intrinsics — these require scoped unsafe blocks with clearly documented safety contracts. See Tier 2 in the safety model documentation.
GEMM at Speed-of-Light — the gemm_sol example reaches 868 TFLOPS on B200 (58% of cuBLAS SoL) using cta_group::2, CLC, and a 4-stage pipeline.
Blackwell tensor cores — the tcgen05 example targets sm_100a with TMEM, MMA, and cta_group::2. Requires LLVM 21+.

ⓘ Known Limitation in v0.1.0
index_2d(stride) is currently documented as unsound — if threads within the same kernel pass different stride values, two threads can obtain &mut T references to the same element without any unsafe code being involved. Until this is resolved (by promoting stride to a type parameter), assign stride to a single let binding and reuse it consistently at every call site.

Full documentation: nvlabs.github.io/cuda-oxide · Source: github.com/NVlabs/cuda-oxide

Top Posts

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

NVIDIA’s cuda-oxide: A Rust-to-CUDA Compiler Backend That Translates SIMT GPU Kernels Straight to PTX

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Trending

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

NVIDIA’s cuda-oxide: A Rust-to-CUDA Compiler Backend That Translates SIMT GPU Kernels Straight to PTX

What You Need Before You Start

Set Up the Rust Nightly Toolchain

Install LLVM 21 with the NVPTX Backend

Install Clang 21 for the cuda-bindings Crate

Clone the Repo and Install cargo-oxide

Build and Run the vecadd Example

Writing Your Own #[kernel] Function

Launching the Kernel from Host Code

What to Explore Next

Related Posts