which have pervaded almost each side of our every day lives are autoregressive decoder fashions. These fashions apply compute-heavy kernel operations to churn out tokens one after the other in a way that, at first look, appears extraordinarily inefficient. Given the big demand for generative AI, it’s no shock that extraordinary engineering effort is being invested into its optimization. Whether or not or not it’s by customized CUDA kernels, CUDA Graphs, devoted AI accelerators, or speculative sampling — any method that reduces latency and/or value by even a fraction of a share is a win.
On this put up, we exhibit a way for optimizing token era in PyTorch utilizing CUDA stream interleaving. Whereas easy to implement, the tactic addresses a particular, typically ignored bottleneck and might result in significant efficiency boosts. Whereas pipelining mannequin execution utilizing CUDA streams is frequent in AI methods engineering, we didn’t discover any tutorial documenting the particular PyTorch-level software we describe right here. When you discover the method helpful, please be so sort as to reference this put up.
To facilitate our dialogue, we are going to use a easy GPT-2 PyTorch decoder mannequin from HuggingFace’s transformers (v5.1.0) library. We are going to run our experiments on an NVIDIA L40S GPU and PyTorch (2.10.0).
Disclaimer: The code we are going to share is meant for demonstrative functions. Please don’t depend on its accuracy or optimality. Please don’t interpret our mentions of any library, platform, or service as an endorsement of its use.
Importantly, the worth of the CUDA stream-based methodology we are going to focus on can fluctuate enormously based mostly on the main points of your mannequin and runtime surroundings. Please make sure you run your individual benchmarks earlier than integrating its use.
Our focus on this put up is on PyTorch-native inference workloads which stay extraordinarily prevalent in improvement and take a look at settings. Nonetheless, you will need to be aware that for manufacturing environments devoted LLM inference libraries equivalent to vLLM or NVIDIA TensorRT-LLM are likely to ship higher efficiency and must be used at any time when related.
A Toy GPT-2 Mannequin
To simplify our dialogue, we are going to use a GPT-2 decoder mannequin from the HuggingFace transformers library and have it run autoregressively on a batch of empty prompts.
Within the following code block, we initialize the mannequin and outline a naive token era perform that creates a batch of random streams as much as a given size.
import torch
from transformers import GPT2LMHeadModel, GPT2Config
torch.set_float32_matmul_precision('excessive')
DEVICE = "cuda"
# outline the decoder mannequin
config = GPT2Config.from_pretrained("gpt2")
mannequin = GPT2LMHeadModel(config).to(DEVICE).eval()
@torch.inference_mode()
def generate_sequence(mannequin, max_seqlen, batch_size):
# Initialize prompts with BOS token
all_tokens = torch.full(
(batch_size, 1),
config.bos_token_id,
system=DEVICE,
dtype=torch.lengthy
)
completed = torch.zeros(batch_size, system=DEVICE, dtype=torch.bool)
for i in vary(max_seqlen):
outputs = mannequin(all_tokens)
# extract new token
logits = outputs.logits[:, -1, :]
new_tokens = torch.argmax(logits, dim=-1)
# append new token to sequence
all_tokens = torch.cat(
[all_tokens, new_tokens.unsqueeze(-1)],
dim=-1
)
completed |= (new_tokens == config.eos_token_id)
stop_gpu = torch.all(completed)
# checking cease situation
if stop_gpu.merchandise():
print(f"All sequences finished at step {i+1}")
break
return all_tokensSubsequent, we outline a easy benchmarking perform which we use to measure the runtime efficiency and reminiscence utilization of our token generator in several situations.
import time, statistics
def benchmark(func, num_runs=10):
# Warmup
func()
torch.cuda.synchronize()
runtimes = []
for _ in vary(num_runs):
# reset reminiscence stats earlier than every run
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
begin = time.perf_counter()
_ = func()
torch.cuda.synchronize()
finish = time.perf_counter()
runtimes.append(finish - begin)
# Get reminiscence allocator stats from final run
mem_stats = torch.cuda.memory_stats()
allocated_peak = mem_stats.get('allocated_bytes.all.peak', 0)
reserved_peak = mem_stats.get('reserved_bytes.all.peak', 0)
f_peak = reserved_peak - allocated_peak
f_pct = (
100 * f_peak / reserved_peak
if reserved_peak > 0 else 0
)
print(f"n{'='*60}")
print(f"Runtime Results:")
print(f" Mean: {statistics.mean(runtimes):.4f}s")
print(f" Std: {statistics.stdev(runtimes):.4f}s")
print(f" Min: {min(runtimes):.4f}s")
print(f" Max: {max(runtimes):.4f}s")
print(f"nMemory Stats:")
print(f" Allocated bytes (peak): {allocated_peak / 1e9:.3f} GB")
print(f" Reserved bytes (peak): {reserved_peak / 1e9:.3f} GB")
print(f" Fragmentation (peak): {f_peak / 1e9:.3f} GB ({f_pct:.1f}%)")
print(f"{'='*60}n")
batch_size = 32
for max_seqlen in [100, 200, 400]:
print(
f"Benchmarking generation with batch size {batch_size} "
f"and max sequence length {max_seqlen}..."
)
benchmark(
lambda: generate_sequence(
mannequin, max_seqlen=max_seqlen, batch_size=batch_size
)
)Within the desk beneath we seize the outcomes for a batch measurement of 32 and several other completely different sequence lengths:
Because the sequence size doubles, the runtime quadruples — showing to observe a basic O(N²) scaling sample. Moreover, excessive reminiscence fragmentation factors to extreme pressure on the CUDA reminiscence allocator, which can lead to frequent reminiscence faults and degrade runtime efficiency. The fragmentation outcomes from every step asking for barely bigger tensor allocations, a sample which finally ends up leaving a number of pockets of unusable reminiscence.
Our first optimization, KV caching, addresses the runtime complexity of our decoder mannequin.
KV Caching
Our naive generator is extraordinarily inefficient — slightly than storing and reusing the intermediate tensors from earlier tokens, it recalculates your complete sequence at each step.
We deal with the computation inefficiency by utilizing KV caching: We retailer and reuse the intermediate Key and Worth tensors for earlier tokens. KV caching reduces the runtime complexity of token era from O(N²) to O(N).
Within the following code block, we make the most of the transformers library’s built-in assist for KV caching to reprogram our token era perform to compute a single batch of tokens in every step.
@torch.inference_mode()
def generate_sequence(mannequin, max_seqlen, batch_size, use_cache=False):
# Initialize prompts with BOS token
all_tokens = torch.full(
(batch_size, 1),
config.bos_token_id,
system=DEVICE,
dtype=torch.lengthy
)
completed = torch.zeros(batch_size, system=DEVICE, dtype=torch.bool)
# past_key_values is used to retailer the cached key/values for every layer
past_key_values = None
for i in vary(max_seqlen):
current_input = (
all_tokens if past_key_values is None
else all_tokens[:, -1:]
)
outputs = mannequin(
current_input,
past_key_values=past_key_values,
use_cache=use_cache
)
# replace cache for subsequent step
past_key_values = outputs.past_key_values
logits = outputs.logits[:, -1, :]
new_tokens = torch.argmax(logits, dim=-1)
# append new token to sequence
all_tokens = torch.cat(
[all_tokens, new_tokens.unsqueeze(-1)],
dim=-1
)
completed |= (new_tokens == config.eos_token_id)
stop_gpu = torch.all(completed)
# checking cease situation
if stop_gpu.merchandise():
print(f"All sequences finished at step {i+1}")
break
return all_tokensThe ensuing efficiency numbers are captured within the following desk:

The efficiency enchancment is profound and, as anticipated, will increase as a perform of the sequence size.
Though considerably higher than in our baseline experiment, the diploma of reminiscence fragmentation stays a priority. To handle this we discover two strategies, expandable reminiscence allocations and static KV caching.
Expandable CUDA Reminiscence Allocations
To scale back CUDA reminiscence fragmentation, we program PyTorch to make use of expandable reminiscence segments. As of the time of this writing, this reminiscence optimization is an experimental function and must be used with warning. Please see the PyTorch documentation for particulars. To make use of the function we set the next surroundings variable:
export PYTORCH_ALLOC_CONF="expandable_segments:True"Rerunning our benchmark ends in the next desk:

Not solely will we see a marked enchancment in fragmentation, however we additionally get a further (marginal) enchancment in runtime efficiency.
KV Caching With StaticCache
The default cache in HuggingFace is dynamic — it grows because the variety of keys and values will increase through the era progresses. HuggingFace helps a fixed-size cache, StaticCache, which pre-allocates a most cache measurement for the KV pairs and reduces pressure on the CUDA reminiscence allocator. The drawback of utilizing StaticCache is that the total size of the cache participates within the consideration computation at every token era step, the place irrelevant tokens are masked out. This ends in a waste of computation that grows with the sequence size. For instance, when producing a sequence of 400 tokens, the eye computation for every token will likely be run on full 400X400-sized tensors.
Within the code block beneath we improve our sequence generator to assist the usage of a StaticCache:
che:
from transformers import StaticCache
@torch.inference_mode()
def generate_sequence(
mannequin, max_seqlen, batch_size, use_cache=False, use_static_cache=False
):
# Initialize prompts with BOS token
all_tokens = torch.full(
(batch_size, 1),
config.bos_token_id,
system=DEVICE,
dtype=torch.lengthy
)
completed = torch.zeros(batch_size, system=DEVICE, dtype=torch.bool)
# Initialize static cache if requested
if use_cache and use_static_cache:
past_key_values = StaticCache(
config=config,
max_batch_size=batch_size,
max_cache_len=max_seqlen,
system=DEVICE,
dtype=mannequin.dtype
)
else:
past_key_values = None
# Initialize cache place monitoring for static cache
cache_positions = torch.arange(max_seqlen, system=DEVICE)
for i in vary(max_seqlen):
current_input = (
all_tokens if past_key_values is None
else all_tokens[:, -1:]
)
cache_position = (
cache_positions[i:i+1] if use_static_cache else None
)
outputs = mannequin(
current_input,
past_key_values=past_key_values,
cache_position=cache_position,
use_cache=use_cache
)
# replace cache for subsequent step
past_key_values = outputs.past_key_values
logits = outputs.logits[:, -1, :]
new_tokens = torch.argmax(logits, dim=-1)
# append new token to sequence
all_tokens = torch.cat(
[all_tokens, new_tokens.unsqueeze(-1)],
dim=-1
)
completed |= (new_tokens == config.eos_token_id)
stop_gpu = torch.all(completed)
# checking cease situation
if stop_gpu.merchandise():
print(f"All sequences finished at step {i+1}")
break
return all_tokensThe up to date outcomes are captured beneath:

Utilizing a fixed-sized cache enormously improves reminiscence utilization as indicated by the lower in reminiscence fragmentation. Nonetheless, its influence on runtime efficiency is blended — for 100 tokens it reduces efficiency in comparison with a dynamic cache, whereas for 200 and 400 tokens it boosts efficiency by 9% and 10%, respectively.
There are extra superior strategies of implementing consideration that optimize for reminiscence utilization with out the price of wasted computation. In a earlier put up, Optimizing Transformer Fashions for Variable-Size Enter Sequences, we lined some PyTorch methods for computing consideration sparsely to cut back computation waste. For manufacturing settings, libraries equivalent to vLLM use PagedAttention for maximizing reminiscence utilization. These strategies are outdoors the scope of this put up.
For extra particulars on caching in HuggingFace, please see the caching methods overview.
Mannequin Compilation
One of many documented benefits of utilizing a fixed-sized cache is that it permits for benefiting from many just-in-time (JIT) optimizations.
Within the following code block we apply our benchmark to a PyTorch-compiled model of our decoder mannequin:
batch_size = 32
max_seqlen = 100
mannequin = torch.compile(mannequin)
benchmark(
lambda: generate_sequence(
mannequin,
max_seqlen=max_seqlen,
batch_size=batch_size,
use_cache=True,
use_static_cache=True
)
)Mannequin compilation ends in a further enhance to runtime efficiency as proven within the desk beneath:

Word that we are able to apply mannequin compilation when utilizing dynamic caching, as properly. Nonetheless, torch.compile supplies the most effective outcomes when the computation graph consists of fixed-sized tensors (e.g., see right here for extra particulars).
The Efficiency Penalty of Early Stopping
An integral a part of frequent token mills is checking for the end-of-sequence (EOS) on the finish of every step. With out this take a look at, token mills would at all times run for max_seqlen, even when all of the sequences within the batch have ended. This might lead to appreciable computation waste and pointless latency — particularly when frequent sequence lengths are a lot shorter than the utmost size. Within the case of our toy experiment, we await all of the sequences within the batch to finish and discontinue token era. Manufacturing-grade implementations will generally carry out steady batching — changing accomplished sequences with new prompts on the enter queue.
completed |= (new_tokens == config.eos_token_id)
stop_gpu = torch.all(completed)
# checking cease situation
if stop_gpu.merchandise():
print(f"All sequences finished at step {i+1}")
breakImportantly, the .merchandise() name on the stop_gpu tensor, triggers a blocking host-device synchronization occasion. Extra particularly, as a way to consider the conditional if assertion, the CPU should await the GPU to finish its computation and replica the contents of the tensor to host reminiscence. Whereas the CPU waits, it’s blocked from executing the following step of the token era loop, or extra precisely, it’s blocked from loading the following computation kernels onto the GPU.
To measure the influence of the stopping situation on runtime efficiency, we add instrumentation for efficiency profiling with NVIDIA Nsight™ Techniques (nsys) utilizing the torch.cuda.profiler and nvtx (v0.2.14) APIs. (See our latest put up for extra particulars on efficiency profiling with nsys).
ore particulars on efficiency profiling with nsys).
import nvtx
from torch.cuda import profiler
@torch.inference_mode()
def generate_sequence(
mannequin, max_seqlen, batch_size, use_cache=False, use_static_cache=False
):
# Initialize prompts with BOS token
all_tokens = torch.full(
(batch_size, 1),
config.bos_token_id,
system=DEVICE,
dtype=torch.lengthy
)
completed = torch.zeros(batch_size, system=DEVICE, dtype=torch.bool)
# Initialize static cache if requested
if use_cache and use_static_cache:
past_key_values = StaticCache(
config=config,
max_batch_size=batch_size,
max_cache_len=max_seqlen,
system=DEVICE,
dtype=mannequin.dtype
)
else:
past_key_values = None
# Initialize cache place monitoring for static cache
cache_positions = torch.arange(max_seqlen, system=DEVICE)
for i in vary(max_seqlen):
if i == 30:
# begin nsys profiler
torch.cuda.synchronize()
profiler.begin()
elif i == 50:
# cease nsys profiler
torch.cuda.synchronize()
profiler.cease()
with nvtx.annotate(f"Step {i+1}", shade="blue"):
with nvtx.annotate("Model Forward", shade="green"):
current_input = (
all_tokens if past_key_values is None
else all_tokens[:, -1:]
)
cache_position = (
cache_positions[i:i+1] if use_static_cache else None
)
outputs = mannequin(
current_input,
past_key_values=past_key_values,
cache_position=cache_position,
use_cache=use_cache
)
past_key_values = outputs.past_key_values
logits = outputs.logits[:, -1, :]
new_tokens = torch.argmax(logits, dim=-1)
all_tokens = torch.cat(
[all_tokens, new_tokens.unsqueeze(-1)],
dim=-1
)
completed |= (new_tokens == config.eos_token_id)
stop_gpu = torch.all(completed)
with nvtx.annotate("Check Stop Condition", shade="red"):
# checking cease situation
if stop_gpu.merchandise():
print(f"All sequences finished at step {i+1}")
break
return all_tokensWe run our script utilizing the cudaProfilerApi possibility to begin and cease the profiler programmatically. Please see the official documentation for full particulars on profiling from the nsys CLI.
nsys profile
--capture-range=cudaProfilerApi
--trace=cuda,nvtx,osrt
--output=baseline
python practice.pyThe next hint, captured for a batch measurement of 16 and sequence size of 100, exhibits the GPU idling for about 110 microseconds in between steps — an eternity within the context of high-performance GPU workloads. It is a direct results of the synchronization occasion triggered by the EOS take a look at.

In production-grade implementations such synchronization points are averted by some mixture of 1) use of decrease degree (e.g., C/C++) code that avoids the limitation of the Python interpreter, 2) utilizing CUDA graphs to cut back overhead of kernel loading, 3) shifting conditional checks onto the GPU utilizing conditional nodes, and 4) constantly and asynchronously getting ready subsequent requests whereas the EOS verify is in progress.
Within the subsequent part, we exhibit a way for hiding the overhead of the host-device synchronization in PyTorch utilizing CUDA streams.
A CUDA Stream Optimization
A CUDA stream is a linear sequence of operations (kernels, reminiscence copies, and so forth.) that execute so as on the GPU. Whereas operations inside a single stream are assured to execute sequentially, operations in several streams can execute concurrently or overlap.
In earlier posts (e.g., right here and right here) we demonstrated the usage of CUDA streams in pipelining frequent AI/ML workloads, e.g., executing a mannequin on batch N whereas getting ready batch N+1. On this put up we are going to use CUDA streams to allow the CPU to load the GPU kernels of step N+1 earlier than checking the stopping standards of step N. Opposite to our earlier demonstrations of CUDA streams, our present instance won’t essentially contain concurrent GPU kernel execution.
We implement another token era perform that interleaves two CUDA streams, working the next operations iteratively:
Program stream ipercent2 to: (A) await stream (i-1)%2 to finish its era of token i-1, (B) use the up to date tensors to calculate the token i, (C) run the EOS take a look at for token i on the GPU, and (D) carry out a (non-blocking) copy of the EOS take a look at end result to pinned reminiscence on the CPU.
On the default CUDA stream, await stream (i-1)%2 to finish its era of token i-1.
On the default CUDA stream, verify if the stopping standards for token i-1 have been met. If that’s the case, halt the generator and return. In any other case, increment i and return to step 1.
Whereas beforehand, the initialization of token i era was blocked by the EOS take a look at on token i-1, the usage of CUDA streams permits us to program the era of token i earlier than we verify the results of the EOS take a look at on token i-1. In apply, the EOS take a look at for token i-1 on the CPU runs whereas the GPU is computing token i.
@torch.inference_mode()
def generate_sequence_pipelined(
mannequin,
max_seqlen,
batch_size,
use_cache=False,
use_static_cache=False
):
# Initialize prompts with BOS token
all_tokens = torch.full(
(batch_size, 1),
config.bos_token_id,
system=DEVICE,
dtype=torch.lengthy
)
completed = torch.zeros(batch_size, system=DEVICE, dtype=torch.bool)
past_key_values = None
# Initialize static cache if requested
if use_cache and use_static_cache:
past_key_values = StaticCache(
config=config,
max_batch_size=batch_size,
max_cache_len=max_seqlen,
system=DEVICE,
dtype=mannequin.dtype
)
# Initialize cache place monitoring for static cache
cache_positions = torch.arange(max_seqlen, system=DEVICE)
# Twin streams for pipelining
streams = [torch.cuda.Stream(), torch.cuda.Stream()]
stop_host = [
torch.tensor(False, pin_memory=True),
torch.tensor(False, pin_memory=True)
]
for i in vary(max_seqlen):
curr_idx, prev_idx = i % 2, (i+1) % 2
curr_s, prev_s = streams[curr_idx], streams[prev_idx]
# Launch iteration i in present stream
with torch.cuda.stream(curr_s):
# program stream to attend for earlier stream to finish
curr_s.wait_stream(prev_s)
current_input = (
all_tokens if past_key_values is None
else all_tokens[:, -1:]
)
cache_position = (
cache_positions[i:i+1] if use_static_cache else None
)
outputs = mannequin(
current_input,
past_key_values=past_key_values,
cache_position=cache_position,
use_cache=use_cache
)
past_key_values = outputs.past_key_values
logits = outputs.logits[:, -1, :]
new_tokens = torch.argmax(logits, dim=-1)
all_tokens = torch.cat(
[all_tokens, new_tokens.unsqueeze(-1)],
dim=-1
)
completed |= (new_tokens == config.eos_token_id)
stop_gpu = torch.all(completed)
stop_host[curr_idx].copy_(stop_gpu, non_blocking=True)
# Verify earlier iteration's cease sign
torch.cuda.current_stream().wait_stream(prev_s)
if stop_host[prev_idx].merchandise():
print(f"All sequences finished at step {i}")
break
return all_tokensThe picture beneath captures the nsys hint for our new token generator:

Within the CUDA part of the hint we are able to see the usage of two CUDA streams, with token era being handed forwards and backwards in a form of ping-pong impact: One stream generates the entire odd tokens and second the entire even tokens. The CPU is about half a step forward of the GPU — permitting it to program step i whereas the GPU is computing step i-1. The CPU-side EOS stop-check of step i-1 (in pink) happens after step i is totally programmed (and has began working). Most significantly, we now discover the GPU utilization to be constant — the idling we noticed earlier than is gone.
The CUDA stream interleaving ends in a further efficiency enhance, as proven within the desk beneath:

We’d count on the good thing about the ping-pong resolution now we have carried out to be impacted by the ratio between the GPU idle time (i.e., the overhead of kernel loading) and the kernel computation time. To check this, we repair the sequence size at 100 and rerun the benchmark for various batch sizes:

As anticipated, the very best efficiency acquire, 11.6%, happens when the batch measurement is smallest and the kernel computation load is at its lowest. Because the kernel compute will increase, the ratio of kernel loading to kernel compute time decreases as does the influence of CUDA stream interleaving.
Word that there’s some overhead to the usage of CUDA streams. This may be demonstrated by evaluating our interleaving resolution to a token generator that skips the EOS take a look at altogether:

The Potential Efficiency Pitfalls of Utilizing CUDA Streams
CUDA streams must be used with excessive warning. When utilizing the default stream we are able to depend on PyTorch to carry out any essential synchronization when information is moved round. Nonetheless, when utilizing CUDA streams, we should guarantee acceptable synchronization explicitly. Particularly, we should guarantee acceptable information switch between the streams. In any other case, we might expertise CUDA errors (e.g., “device-side assert triggered”) — if we’re fortunate. If we’re much less fortunate, we might expertise information corruption with out even figuring out it. See the PyTorch CUDA stream documentation for extra particulars on acceptable use.
For AI/ML workloads with massive CUDA reminiscence utilization, equivalent to LLMs, one other consideration is reminiscence utilization. The PyTorch caching allocator manages reminiscence on a per-stream foundation; utilizing a number of streams can result in elevated reminiscence reservation and fragmentation. These might lead to elevated reminiscence faults which may overshadow the potential beneficial properties from the usage of streams.
Outcomes
Within the desk beneath we summarize the runtime outcomes of making use of static caching, compilation, and pipelining on a batch of 32 sequences and a most sequence size of 100. The outcomes are sorted in growing order of efficiency:

Within the case of our toy GPT-2 mannequin, the most effective outcomes — almost 5 instances the baseline efficiency — are achieved when using PyTorch compilation and the CUDA stream interleaving methodology mentioned on this put up. Nonetheless, as now we have seen, the influence of CUDA interleaving might fluctuate enormously based mostly on the properties of the workload and runtime surroundings, notably on the ratio between the kernel loading time and the kernel compute time. Please make sure you run your individual benchmarks earlier than adopting this methodology.
Abstract
In high-performance AI engineering, any trace of GPU under-utilization presents a possibility for optimization. One of many main optimization instruments on NVIDIA GPUs is CUDA streams. On this put up, we demonstrated their use in fixing the idle GPU time that outcomes from the host-device synchronization related to early-stopping in PyTorch-native autoregressive token era. By interleaving CUDA streams in a “ping-pong” sample, we efficiently hid the latency imposed by the EOS-check which resulted in a significant improve the workload’s throughput. By combining this method with the well-known strategies of mannequin compilation and static caching, we are able to maximize the efficiency of PyTorch-native inference.



