Trajectory Unveils Concurrent Multi-LoRA Training Stack For Continual Learning With 2.81× Experiment-Throughput Gain

Trajectory’s concurrent multi-LoRA architecture delivers a 2.81× boost in experiment throughput compared to traditional single-tenant RL training, with all implementation code publicly available in the NovaSky-AI/SkyRL GitHub repository.

Language models typically evolve through sudden breaks in capability. Groups accumulate data, perform training cycles, and deploy updated versions — a process that extends across months and often leads to unpredictable shifts in how the model behaves for end users. Trajectory aims to replace this stop-and-go pattern with a smoother, ongoing learning approach.

The Trajectory team released a detailed field report explaining their methodology. They engineered a concurrent, multi-LoRA training infrastructure designed for workloads that require continuous updates. The project was carried out in collaboration with UC Berkeley Sky Lab and Anyscale. All training code has been open-sourced in the NovaSky-AI/SkyRL repository.

The outcome is a 2.81× improvement in end-to-end experiment throughput when measured against a single-tenant training setup. Trajectory confirms there was no degradation in training rewards across any experiments.

Understanding Multi-LoRA Training

Ongoing learning depends on models that can absorb feedback from active usage and real production interactions. For instance, a coding assistant could pick up on engineering conventions as developers fix its outputs. Similarly, a customer support agent could get better at handling tricky tickets as human operators step in during difficult situations.

The majority of existing training systems still follow a straight-line workflow. Teams assign GPU resources, load the model, execute a single training job, and then shut everything down. Continuous learning flips this model on its head. When live production interactions double as training data, the training process itself becomes embedded within a running system.

Contemporary RL training comes down to three fundamental operations. The Sampler creates trajectory sequences from the current policy model. The Trainer calculates gradients and adjusts the policy weights. Parameter synchronization propagates the updated weights back out to inference workers.

Trajectory’s method is called Continuous Multi-LoRA Training, abbreviated as C-LoRA. In this setup, every experiment is linked to its own dedicated LoRA adapter running on a pre-warmed, multi-tenant engine.

The Bottleneck Issues It Addresses

The Trajectory team pinpoints four core inefficiencies found in conventional training stacks:

(1) Cold starts take too long: Every sequential training job forces a full reload of checkpoints, reinitialization of the distributed runtime, and a warmup of inference engines. For large-scale models, just this initialization phase can stretch beyond 30 minutes per execution.

(2) RL carries a heavy memory footprint: Leading-edge models frequently surpass 100 billion parameters. The Qwen3.5-397B model, for example, may need as many as eight H200 nodes just to fit into available memory. LoRA dramatically shrinks memory demands — by roughly a factor of ten. It accomplishes this by keeping the base model frozen and training only lightweight adapter weights.

(3) Conventional stacks operate on a single-tenant basis: They handle one experiment at any given time. Multi-LoRA binds each experiment to its own adapter, multiplying overall throughput by a factor of N.

(4) GPU job utilization is poor: Training processors and inference engines frequently sit idle waiting on one another. Multi-LoRA distributes work across concurrent jobs to soak up that unused capacity.

A Look at the Internal Design

The largest throughput gains originate on the inference side. Within vLLM, all adapters are pre-loaded into GPU memory ahead of time. This lets decode steps interleave tokens from different adapters inside a single shared batch. The critical piece that makes this efficient is the SGMV decode kernel. It consolidates per-adapter matrix-vector computations into just one GPU kernel launch per decode step.

Once each optimization pass completes, the newly updated LoRA weights are loaded directly into the inference engine in-place. The scheduler keeps running uninterrupted, so all other tenants continue generating decodes.Training follows a different path. Only one LoRA adapter is actively training on the GPU at any moment. The remaining adapters reside in pinned CPU memory. Each tenant’s full training state is managed inside an AdapterStore, which stores the LoRA parameters, FP32 master weights, optimizer moments, and gradient buffers.

The engine moves one tenant’s state onto the GPU, carries out a single forward-backward pass, and then swaps it back out. This training path remains single-adapter at its core — the concurrency benefits seen in inference haven’t yet been extended to the training stage.

Benchmark Results

Trajectory ran its tests on a single H200 node using Qwen3-4B-Instruct-2507. It executed synchronous RL on the GSM8K benchmark within an agentic framework. The team reimagined GSM8K as a tool-usage learning problem. The model learns to decide when to invoke a Calculator tool and a Final Answer tool. A reward of 1.0 is awarded exclusively when the Final Answer tool is called with the correct response.

The policy model begins at roughly 40% accuracy at step zero. Using the right learning algorithm, accuracy climbs above 90% by the ninth step.

The team then scaled up to eight concurrent multi-LoRA experiment runs. Total experiment completion time reached 5433 seconds at N=8, which translates to a 2.81× speedup. All eight concurrent experiments wrapped up before three sequential back-to-back runs could finish. Average experiment runtime also showed gains, with the best improvement at N=4 reaching 1.88×. Every concurrency configuration hit reward accuracy above 90% by step 9.

The Tradeoffs to Consider

Greater throughput comes at the cost of slower per-step execution. As N increases, both First Experiment Time and Step Time worsen. At N=8, the equivalent first sequential experiment completes 1.97× faster. Average step time climbs from 191 seconds to 500 seconds — about 2.62× slower.

The bulk of that slowdown traces back to rollout time. Rollouts grow from 162 seconds to 401 seconds, accounting for roughly 77% of the total time increase. At N=2, merely doubling the load adds only 15% to rollout duration. This makes N=2 an especially sweet spot for multi-LoRA deployments.

The same pattern emerged on a more demanding task. Running τ-bench retail benchmarks with the NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 MoE model, N=2 completed 10 steps 1.28× faster, while per-tenant step time increased by 1.57×.

Advantages and Limitations

Advantages:

Achieves a 2.81× end-to-end experiment-throughput improvement across eight concurrent runs
No loss in accuracy; individual runs stay within ±1σ of the serial baseline during final training steps
LoRA slashes memory requirements by roughly ten times compared to full fine-tuning
Entirely open-sourced in NovaSky-AI/SkyRL, enabling community adoption and extension

Limitations:

Per-step latency and First Experiment Time degrade as concurrency N rises
Training is still serialized across tenants — only inference benefits from multiplexing
Evaluation has focused on mid-range models rather than frontier-scale parameter counts
Initial setup demands an 8× H100/H200 node cluster along with a Megatron build

Essential Takeaways

Trajectory engineered a concurrent, multi-LoRA RL training stack aimed at continual learning, with the full codebase open-sourced in NovaSky-AI/SkyRL.
It delivers a 2.81× end-to-end experiment-throughput gain over a single-tenant baseline, with no drop-off in reward metrics.
Every experiment is mapped to a dedicated LoRA adapter on a perpetually-ready engine, multiplying total throughput by N.
The strongest performance gains are driven by

vLLM multi-LoRA inference via the SGMV decode kernel; training stays single-adapter.
The tradeoff is per-step latency: at N=8, step time rises from 191s to 500s.

Marktechpost’s Visual Explainer

Field Report · May 27, 2026

Continuous Multi-LoRA Training for Continual Learning

Trajectory, developed alongside UC Berkeley Sky Lab and Anyscale.

2.81× boost in end-to-end experiment throughput

01 — Overview

A single always-on engine serving multiple adapters

Continual learning refines models using real-time feedback and live production data.

Trajectory introduces its solution: Continuous Multi-LoRA Training (C-LoRA). Every experiment is assigned its own LoRA adapter on a live, multi-tenant engine that stays warm.

Sampler

Creates trajectories using the current policy model.

Trainer

Calculates gradients and adjusts policy weights.

Parameter sync

Pushes updated weights out to inference workers.

The shift

Training integrates directly into a running, distributed service.

02 — Key challenges addressed

Four bottlenecks in traditional serial RL pipelines

Lengthy cold starts

Each run must reload checkpoints and warm up engines—often taking over 30 minutes each time.

Heavy RL memory demands

Qwen3.5-397B may require up to eight H200 nodes; LoRA slashes memory use by roughly 10x.

Single-tenant isolation

Only one experiment runs at a time. Multi-LoRA multiplies throughput by N through multiplexing.

Underutilized resources

Trainers and inference engines sit idle waiting for each other. Multi-LoRA puts that spare capacity to work.

03 — Under the hood

How the throughput gains are achieved

Inference engine: Inside vLLM, all adapters are kept loaded in GPU memory. The SGMV decode kernel merges per-adapter computation into a single GPU launch per decode step.
Weight updates: Updated LoRA weights are swapped in-place. The scheduler never pauses, so other tenants continue decoding without interruption.
Training loop: A single adapter trains on the GPU at a time; the remaining adapters are stored in pinned CPU memory.

AdapterStore

Each tenant’s slot holds LoRA weights, FP32 master weights, optimizer states, and gradient buffers. This pipeline remains single-adapter.

04 — Experimental setup

GSM8K repurposed as a tool-use benchmark

Evaluated on a single H200 node using Qwen3-4B-Instruct-2507, running synchronous RL on GSM8K in an agentic framework.

The model autonomously chooses when to invoke the Calculator and Final Answer tools.
A reward of 1.0 is granted only when Final Answer is triggered with the correct response.
The policy starts around 40% accuracy and surpasses 90% by step 9.

05 — Results

2.81× throughput gain with no reward loss

2.81×

Total experiment time at N=8 (5433s)

1.88×

Average experiment time, peaking at N=4

>90%

reward_accuracy at every scale by step 9

Eight concurrent experiments wrapped up before three serial runs could complete. Results matched the serial baseline within ±1σ during the final steps.

06 — Tradeoffs

Higher throughput at the cost of per-step latency

At N=8, average step time climbs from 191s to 500s, a 2.62× slowdown.
Rollout duration stretches from 162s to 401s, accounting for roughly 77% of the added time.
At N=2, doubling the load increases rollout time by just 15%—the sweet spot.

Stress-test on a harder task

On τ-bench retail with the NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 MoE model, N=2 completed 10 steps 1.28× faster; per-tenant step time grew by 1.57×.

07 — Key takeaways

Main highlights

Concurrent multi-LoRA RL training for continual learning, released as open source in NovaSky-AI/SkyRL.
2.81× end-to-end experiment throughput improvement over a single-tenant setup.
The bulk of the speedup comes from vLLM multi-LoRA inference; training remains single-adapter.
SkyRL implements the Tinker API; reproduce results on 8× H100/H200 using the Tinker cookbook.

Where (Inferences) to Run

Run it / Access the model

Inference & compute providers

Where to access the Qwen3-4B-Instruct-2507 base model, the SkyRL training stack, and the NVIDIA GPUs used in the experiments.

Explore the Repo and Technical Details. Also, follow us on Twitter, join our 150k+ ML SubReddit, and subscribe to our Newsletter. Are you on Telegram? Join us there too.

Interested in partnering with us to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch

Michal Sutter holds a Master of Science degree in Data Science from the University of Padova. Her expertise spans across statistical analysis, machine learning, and data engineering, making her highly skilled at turning complicated data into practical, real-world solutions.

Top Posts

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Trajectory Unveils Concurrent Multi-LoRA Training Stack for Continual Learning with 2.81× Experiment-Throughput Gain

Sampler

Trainer

Parameter sync

The shift

Lengthy cold starts

Heavy RL memory demands

Single-tenant isolation

Underutilized resources

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

NVIDIA Cosmos 3 Edge: 4B-Power Robot Brains Thinking and Acting on Your Device

5 Premier MCP Servers to Supercharge Agentic Development

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trump Mobilizes Defense Industry to Chart Software and Supplier Networks Nationwide

KuCoin Pay: Weaving Crypto Seamlessly Into Everyday Payments

Trending

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Trajectory Unveils Concurrent Multi-LoRA Training Stack for Continual Learning with 2.81× Experiment-Throughput Gain

Understanding Multi-LoRA Training

The Bottleneck Issues It Addresses

A Look at the Internal Design

Benchmark Results

The Tradeoffs to Consider

Advantages and Limitations

Essential Takeaways

Marktechpost’s Visual Explainer

Continuous Multi-LoRA Training for Continual Learning

A single always-on engine serving multiple adapters

Sampler

Trainer

Parameter sync

The shift

Four bottlenecks in traditional serial RL pipelines

Lengthy cold starts

Heavy RL memory demands

Single-tenant isolation

Underutilized resources

How the throughput gains are achieved

AdapterStore

GSM8K repurposed as a tool-use benchmark

2.81× throughput gain with no reward loss

Higher throughput at the cost of per-step latency

Stress-test on a harder task

Main highlights

Where (Inferences) to Run

Inference & compute providers

Related Posts