A production-ready pipeline for multi-stream video analytics: hardware-accelerated decoding, tracking, on-screen display, and message brokering, all wired through GStreamer. For standard detection models exported to TensorRT, nvinfer handles everything.
However, the common case has its limits. Vision-language models, custom post-processing, rotated bounding boxes, or the need to hot-swap models at runtime — these are situations where nvinfer‘s built-in assumptions fall apart. Sometimes you have a mature PyTorch inference stack that your team has carefully tuned, and you want DeepStream to call that rather than reimplementing it in a config file.
It’s worth noting that for YOLO-family models specifically, DeepStream-Yolo by Marcos Luciano has already done excellent work implementing custom post-processing in C++. If C++ is an option for you, start there. This article takes a different approach: achieving the same result entirely in Python, using a custom GStreamer plugin with pyservicemaker — without sacrificing throughput.
The key insight that makes this possible: downstream elements like nvtracker, nvdsosd, and nvmsgconv don’t care which element produced the detection metadata. Write to DeepStream’s metadata structure correctly, and the rest of the ecosystem works as if nvinfer was never in the picture.
DeepStream Metadata
Every buffer flowing through a DeepStream pipeline carries more than just pixel data. From the moment frames pass through nvstreammux, each GstBuffer has an NvDsBatchMeta structure attached to it. The hierarchy is straightforward and can be found in the official documentation.
NvDsBatchMeta
├── NvDsUserMeta (batch-level custom metadata)
└── NvDsFrameMeta (one per source stream)
├── NvDsUserMeta (frame-level custom metadata)
└── NvDsObjectMeta (one per detected object)
├── NvDsClassifierMeta
└── NvDsUserMeta (object-level custom metadata)NvDsBatchMeta describes the entire batch. Each NvDsFrameMeta corresponds to one source stream and carries frame-level information such as the source ID and frame number. Each NvDsObjectMeta represents a single detection — meaning that when our plugin writes detections, we’ll create an NvDsObjectMeta for each one.
The critical thing to understand is that none of this is owned by nvinfer. It’s a shared data contract. Any GStreamer element in the pipeline can read from it, write to it, or both:
nvtrackerreads object bounding boxes and writes tracking IDs.nvdsosdreads boxes and labels to draw overlays.nvmsgconvreads the whole structure to produce message payloads.
Our custom plugin will simply write detections into this structure the same way nvinfer would, and everything downstream picks them up without modification. One important constraint worth understanding before we write any code: NvDsObjectMeta instances cannot be constructed directly from Python. Attempting to instantiate the class raises a No constructor defined! error at runtime.
The reason is architectural. DeepStream manages its metadata objects through memory pools — pre-allocated blocks that get recycled across frames to avoid the overhead of repeated heap allocation and deallocation in a high-throughput pipeline. These pools are owned by NvDsBatchMeta and live on the C side of the boundary. The Python bindings expose access to those pools, but deliberately don’t expose a Python-side constructor, because creating an NvDsObjectMeta outside the pool would bypass the lifecycle management that keeps DeepStream’s memory usage predictable. The correct way to get one is to ask the batch for it: batch_meta.acquire_object_meta(), which hands you a pre-allocated instance from the pool. When the frame is done, DeepStream returns it to the pool automatically.
The Python Bridge: pyservicemaker
To interact with DeepStream’s metadata from Python, we’ll use pyservicemaker, NVIDIA’s current, supported Python SDK for DeepStream. The official documentation covers the basics of pipelines and flows, but stops short of showing how to write and attach metadata from a custom inference element. That’s the gap this article fills.
The key abstraction is BatchMetadataOperator. Subclassing it and implementing handle_metadata(batch_meta) gives you access to the full NvDsBatchMeta for every buffer flowing through the pipeline. From there, iterating frames is as simple as using batch_meta.frame_items and attaching a detection object.
pyservicemaker also provides a Buffer wrapper around Gst.Buffer that exposes batch_meta directly and, importantly, an extract(batch_id) method that returns a DLPack handle to each frame’s GPU memory. That’s what makes zero-copy inference possible — you can hand the frame straight to TensorRT without ever leaving the GPU.
Rather than using BatchMetadataOperator standalone via a probe, we’ll fold the same pattern directly into our custom plugin’s do_transform_ip method, which gives us control over the element’s lifecycle, properties, and caps negotiation alongside the metadata access. But first, we need to build that plugin.
A Discoverable Python GStreamer Plugin
GStreamer discovers plugins at runtime by scanning directories listed in GST_PLUGIN_PATH. For Python plugins specifically, it looks inside a python/ subdirectory within each of those paths. That means your plugin is just a .py file dropped in the right place — no compilation, no CMake, no shared library. The tradeoff is that the registration pattern is strict, and getting it wrong produces silent failures that are genuinely hard to debug.
$GST_PLUGIN_PATH/
└── python/
└── gstexampleplugin.py # your pluginSet GST_PLUGIN_PATH to point at the parent directory, and GStreamer will find python/gstexampleplugin.py automatically on the next pipeline run.
The Plugin Skeleton
Here’s the minimal skeleton for a passthrough inference element: it receives batched video buffers, runs inference, attaches metadata, and passes the buffer downstream unmodified.
import gi
gi.require_version('Gst', '1.0')
gi.require_version('GstBase', '1.0')
from gi.repository import Gst, GstBase, GObject
import torch
from pyservicemaker import Buffer
GST_PLUGIN_NAME = "gstexampleplugin"
Gst.init(None)
class GstExamplePlugin(GstBase.BaseTransform):
__gstmetadata__Given the HTML content you provided, here is the paraphrased version. The HTML structure is preserved, while the text has been rewritten for clarity and readability.
import gi
gi.require_version('Gst', '1.0')
gi.require_version('GstBase', '1.0')
gi.require_version('GstVideo', '1.0')
from gi.repository import Gst, GstBase, GObject
import torch
from pyds import Buffer
# --- Plugin Metadata ---
GST_PLUGIN_NAME = 'gstexampleplugin'
__gstmetadata__ = (
'GstExamplePlugin', # name
'Filter/Effect/Video', # classification
'Custom inference element', # description
'Your Name' # author
)
src_format = Gst.Caps.from_string(
"video/x-raw(memory:NVMM), format=RGB, "
"width=(int)[ 1, 2147483647 ], height=(int)[ 1, 2147483647 ], "
"framerate=(fraction)[ 0/1, 2147483647/1 ]"
)
sink_format = Gst.Caps.from_string(
"video/x-raw(memory:NVMM), format=RGB, "
"width=(int)[ 1, 2147483647 ], height=(int)[ 1, 2147483647 ], "
"framerate=(fraction)[ 0/1, 2147483647/1 ]"
)
src_pad_template = Gst.PadTemplate.new(
"src", Gst.PadDirection.SRC, Gst.PadPresence.ALWAYS, src_format
)
sink_pad_template = Gst.PadTemplate.new(
"sink", Gst.PadDirection.SINK, Gst.PadPresence.ALWAYS, sink_format
)
__gsttemplates__ = (src_pad_template, sink_pad_template)
__gproperties__ = {
'model-engine': (
str,
'TensorRT engine path',
'Path to the .engine file',
'',
GObject.ParamFlags.READWRITE
),
'confidence-threshold': (
float,
'Confidence threshold',
'Minimum confidence to attach a detection',
0.0, 1.0, 0.5,
GObject.ParamFlags.READWRITE
),
}
def __init__(self):
super().__init__()
self.model_engine = ''
self.confidence_threshold = 0.5
self.engine = None
def do_get_property(self, prop):
if prop.name == 'model-engine':
return self.model_engine
elif prop.name == 'confidence-threshold':
return self.confidence_threshold
def do_set_property(self, prop, value):
if prop.name == 'model-engine':
self.model_engine = value
elif prop.name == 'confidence-threshold':
self.confidence_threshold = value
def do_start(self):
# Load your TensorRT engine here
self.engine = load_engine(self.model_engine) # This function should be implemented
return True
def do_transform_ip(self, gst_buffer: Gst.Buffer) -> Gst.FlowReturn:
"""In-place transform: attach metadata, pass buffer unchanged."""
buffer = Buffer(gst_buffer)
batch_meta = buffer.batch_meta
frames = []
for frame_meta in batch_meta.frame_items:
t = torch.utils.dlpack.from_dlpack(buffer.extract(frame_meta.batch_id))
frames.append(t)
batch = torch.stack(frames, dim=0)
# Run your model inference
results = self.engine(batch)
# Now we will need to iterate over the results for each frame
# and attach it to the object_meta in case it is detection/segmentation
# otherwise we can do it as user_meta
# The following is pseudocode, which depends on your inference
for frame_meta in batch_meta.frame_items:
for det in results:
obj = batch_meta.acquire_object_meta()
# Fill the obj with each detection
...
frame_meta.append(obj)
return Gst.FlowReturn.OK
# --- Registration ---
GObject.type_register(GstExamplePlugin)
__gstelementfactory__ = (GST_PLUGIN_NAME, Gst.RANK_NONE, GstExamplePlugin)A few important details about this basic structure:
GstBase.BaseTransform is the ideal base class for an in-place filter — one that takes a buffer, modifies it (by adding metadata), and sends it downstream. We override do_transform_ip instead of do_transform since we aren’t creating a new output buffer.
__gstmetadata__ and __gsttemplates__ are mandatory. GStreamer will not register the element without them. The caps string video/x-raw(memory:NVMM) informs GStreamer that this element handles NVIDIA memory, which is critical for keeping data on the GPU within a DeepStream pipeline.
__gproperties__ makes model-engine and confidence-threshold available as native GStreamer properties, allowing you to configure them via a gst-launch command or from Python pipeline code without modifying the source.
The final two lines are necessary for registration: GObject.type_register registers the class with the GObject type system, and __gstelementfactory__ tells GStreamer which element name to expose and which class to create.
Checking the plugin. After placing the file and clearing the cache, confirm registration by running:
GST_PLUGIN_PATH=/path/to/your/plugins gst-inspect-1.0 gstexamplepluginYou should see the element metadata, pad templates, and both properties displayed. If they appear, GStreamer recognizes your plugin and you can integrate it into a pipeline.
End-to-End Inference Example Using Ultralytics
With the plugin structure ready, it’s time to implement the inference logic. The complete working code is provided as a GitHub Gist. Once it’s accessible, you can inspect it as shown earlier or run the pipeline. Below is a simple example that runs inference and shows the fps:
gst-launch-1.0 -v
nvstreammux name=m width=1280 height=720 batch-size=1
batched-push-timeout=33000 !
nvvideoconvert nvbuf-memory-type=0 !
'video/x-raw(memory:NVMM), format=RGB' !
gstyoloplugin model-path=/path/to/yolo26s.engine !
fpsdisplaysink text-overlay=false silent=false sync=false
video-sink=fakesink
uridecodebin uri=file:///path/to/video.mp4 ! m.sink_0Reviewing the Code
Compatibility Issue
If you’ve looked at the code, you may have noticed that we are overriding the tuple object, but only within the ultralytics.nn.backends.tensorrt module, since that is where the issue occurs. There is a known compatibility problem between the TensorRT Python bindings and the GStreamer Python wrapper framework (PyGObject) that can cause your pipeline to crash with the well-known error “Segmentation fault (core dumped).” This is why the following code snippet was needed to maintain the expected behavior:
import ultralytics.nn.backends.tensorrt as trt_backend
_original_tuple = tuple
def safe_tuple(obj):
if "tensorrt" in type(obj).__module__ and type(obj).__name__ == "Dims":
return _original_tuple(obj[i] for i in range(len(obj)))
return _original_tuple(obj)
trt_backend.tuple = safe_tupleThis swaps the tuple reference inside the Ultralytics backend’s namespace at runtime with a version that uses index-based access for Dims objects, leaving all other behavior unchanged. It’s not a clean solution, but it’s precise and must be executed at import time, before any model is loaded.
The Inference Loop
The inference loop itself is fairly simple:
- Extract the frames from the buffers
- Preprocess + inference
- Link the results to the object metadata of each frame if the downstream pipeline elements are DeepStream plugins.
Below is a code snippet demonstrating zero-copy data transfer using DLPack:
frames = []
for frame_meta in batch_meta.frame_items:
t = torch.utils.dlpack.from_dlpack(buffer.extract(frame_meta.batch_id))
frames.append(t)
batch = torch.stack(frames, dim=0)Preparing the Input Data
YOLO models, when receiving a torch.Tensor, require a fixed input shape of (N, 3, 640, 640) as specified in the documentation. However, frames output by nvstreammux will match the resolution of your source video. The solution used is letterboxing: resizing the frame to fit within the target dimensions while maintaining the original aspect ratio, then filling the remaining space with padding. The key advantage is that this entire process can be performed on the GPU, across the entire batch simultaneously, without any need to access CPU memory.
With frame extraction, letterboxing, inference, and coordinate mapping all executed on the GPU within a single do_transform_ip call, the plugin functions identically to nvinfer from the perspective of all downstream components, while offering the full flexibility of a Python-based inference stack underneath.
From this point onward, the rest of the DeepStream pipeline handles the remaining tasks: nvtracker assigns tracking IDs, nvdsosd renders visual overlays, and nvmsgconv serializes the output payloads.
Key Takeaways and Future Directions
If you’ve made it this far, you now have a functional approach for substituting nvinfer with your own custom Python inference element, and more importantly, you understand the reasoning behind each design decision.
This approach is broadly applicable. Everything covered here: the plugin framework, batched preprocessing, and metadata linking is independent of any specific model. Replacing Ultralytics YOLO with Roboflow’s rfdetr is a simple swap, and the GStreamer and pyservicemaker infrastructure remains unchanged. The same applies to more complex architectures: NVIDIA’s deepstream_reference_apps repository contains a practical example of integrating a Vision-Language Model via vLLM using this exact plugin method, which is highly recommended for anyone looking to move beyond object detection into advanced video understanding.
The complete plugin code is available as a GitHub Gist. If you develop something based on it: whether a different model, a multi-stream configuration, or a VLM integration, I’d love to learn about your experience. Happy coding!



