Trajectory’s concurrent multi-LoRA architecture delivers a 2.81× boost in experiment throughput compared to traditional single-tenant RL training, with all implementation code publicly available in the NovaSky-AI/SkyRL GitHub repository.
Language models typically evolve through sudden breaks in capability. Groups accumulate data, perform training cycles, and deploy updated versions — a process that extends across months and often leads to unpredictable shifts in how the model behaves for end users. Trajectory aims to replace this stop-and-go pattern with a smoother, ongoing learning approach.
The Trajectory team released a detailed field report explaining their methodology. They engineered a concurrent, multi-LoRA training infrastructure designed for workloads that require continuous updates. The project was carried out in collaboration with UC Berkeley Sky Lab and Anyscale. All training code has been open-sourced in the NovaSky-AI/SkyRL repository.
The outcome is a 2.81× improvement in end-to-end experiment throughput when measured against a single-tenant training setup. Trajectory confirms there was no degradation in training rewards across any experiments.
Understanding Multi-LoRA Training
Ongoing learning depends on models that can absorb feedback from active usage and real production interactions. For instance, a coding assistant could pick up on engineering conventions as developers fix its outputs. Similarly, a customer support agent could get better at handling tricky tickets as human operators step in during difficult situations.
The majority of existing training systems still follow a straight-line workflow. Teams assign GPU resources, load the model, execute a single training job, and then shut everything down. Continuous learning flips this model on its head. When live production interactions double as training data, the training process itself becomes embedded within a running system.
Contemporary RL training comes down to three fundamental operations. The Sampler creates trajectory sequences from the current policy model. The Trainer calculates gradients and adjusts the policy weights. Parameter synchronization propagates the updated weights back out to inference workers.
Trajectory’s method is called Continuous Multi-LoRA Training, abbreviated as C-LoRA. In this setup, every experiment is linked to its own dedicated LoRA adapter running on a pre-warmed, multi-tenant engine.
The Bottleneck Issues It Addresses
The Trajectory team pinpoints four core inefficiencies found in conventional training stacks:
(1) Cold starts take too long: Every sequential training job forces a full reload of checkpoints, reinitialization of the distributed runtime, and a warmup of inference engines. For large-scale models, just this initialization phase can stretch beyond 30 minutes per execution.
(2) RL carries a heavy memory footprint: Leading-edge models frequently surpass 100 billion parameters. The Qwen3.5-397B model, for example, may need as many as eight H200 nodes just to fit into available memory. LoRA dramatically shrinks memory demands — by roughly a factor of ten. It accomplishes this by keeping the base model frozen and training only lightweight adapter weights.
(3) Conventional stacks operate on a single-tenant basis: They handle one experiment at any given time. Multi-LoRA binds each experiment to its own adapter, multiplying overall throughput by a factor of N.
(4) GPU job utilization is poor: Training processors and inference engines frequently sit idle waiting on one another. Multi-LoRA distributes work across concurrent jobs to soak up that unused capacity.
A Look at the Internal Design
The largest throughput gains originate on the inference side. Within vLLM, all adapters are pre-loaded into GPU memory ahead of time. This lets decode steps interleave tokens from different adapters inside a single shared batch. The critical piece that makes this efficient is the SGMV decode kernel. It consolidates per-adapter matrix-vector computations into just one GPU kernel launch per decode step.
Once each optimization pass completes, the newly updated LoRA weights are loaded directly into the inference engine in-place. The scheduler keeps running uninterrupted, so all other tenants continue generating decodes.
Training follows a different path. Only one LoRA adapter is actively training on the GPU at any moment. The remaining adapters reside in pinned CPU memory. Each tenant’s full training state is managed inside an AdapterStore, which stores the LoRA parameters, FP32 master weights, optimizer moments, and gradient buffers.
The engine moves one tenant’s state onto the GPU, carries out a single forward-backward pass, and then swaps it back out. This training path remains single-adapter at its core — the concurrency benefits seen in inference haven’t yet been extended to the training stage.
Benchmark Results
Trajectory ran its tests on a single H200 node using Qwen3-4B-Instruct-2507. It executed synchronous RL on the GSM8K benchmark within an agentic framework. The team reimagined GSM8K as a tool-usage learning problem. The model learns to decide when to invoke a Calculator tool and a Final Answer tool. A reward of 1.0 is awarded exclusively when the Final Answer tool is called with the correct response.
The policy model begins at roughly 40% accuracy at step zero. Using the right learning algorithm, accuracy climbs above 90% by the ninth step.
The team then scaled up to eight concurrent multi-LoRA experiment runs. Total experiment completion time reached 5433 seconds at N=8, which translates to a 2.81× speedup. All eight concurrent experiments wrapped up before three sequential back-to-back runs could finish. Average experiment runtime also showed gains, with the best improvement at N=4 reaching 1.88×. Every concurrency configuration hit reward accuracy above 90% by step 9.
The Tradeoffs to Consider
Greater throughput comes at the cost of slower per-step execution. As N increases, both First Experiment Time and Step Time worsen. At N=8, the equivalent first sequential experiment completes 1.97× faster. Average step time climbs from 191 seconds to 500 seconds — about 2.62× slower.
The bulk of that slowdown traces back to rollout time. Rollouts grow from 162 seconds to 401 seconds, accounting for roughly 77% of the total time increase. At N=2, merely doubling the load adds only 15% to rollout duration. This makes N=2 an especially sweet spot for multi-LoRA deployments.
The same pattern emerged on a more demanding task. Running τ-bench retail benchmarks with the NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 MoE model, N=2 completed 10 steps 1.28× faster, while per-tenant step time increased by 1.57×.
Advantages and Limitations
Advantages:
- Achieves a 2.81× end-to-end experiment-throughput improvement across eight concurrent runs
- No loss in accuracy; individual runs stay within ±1σ of the serial baseline during final training steps
- LoRA slashes memory requirements by roughly ten times compared to full fine-tuning
- Entirely open-sourced in NovaSky-AI/SkyRL, enabling community adoption and extension
Limitations:
- Per-step latency and First Experiment Time degrade as concurrency N rises
- Training is still serialized across tenants — only inference benefits from multiplexing
- Evaluation has focused on mid-range models rather than frontier-scale parameter counts
- Initial setup demands an 8× H100/H200 node cluster along with a Megatron build
Essential Takeaways
- Trajectory engineered a concurrent, multi-LoRA RL training stack aimed at continual learning, with the full codebase open-sourced in NovaSky-AI/SkyRL.
- It delivers a 2.81× end-to-end experiment-throughput gain over a single-tenant baseline, with no drop-off in reward metrics.
- Every experiment is mapped to a dedicated LoRA adapter on a perpetually-ready engine, multiplying total throughput by N.
- The strongest performance gains are driven by
- vLLM multi-LoRA inference via the SGMV decode kernel; training stays single-adapter.
- The tradeoff is per-step latency: at N=8, step time rises from 191s to 500s.
Marktechpost’s Visual Explainer
Where (Inferences) to Run
Inference & compute providers
Where to access the Qwen3-4B-Instruct-2507 base model, the SkyRL training stack, and the NVIDIA GPUs used in the experiments.
Explore the Repo and Technical Details. Also, follow us on Twitter, join our 150k+ ML SubReddit, and subscribe to our Newsletter. Are you on Telegram? Join us there too.
Interested in partnering with us to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch
Michal Sutter holds a Master of Science degree in Data Science from the University of Padova. Her expertise spans across statistical analysis, machine learning, and data engineering, making her highly skilled at turning complicated data into practical, real-world solutions.



