How do you construct a single imaginative and prescient language motion mannequin that may management many alternative twin arm robots in the actual world? LingBot-VLA is Ant Group Robbyant’s new Imaginative and prescient Language Motion basis mannequin that targets sensible robotic manipulation in the actual world. It’s skilled on about 20,000 hours of teleoperated bimanual knowledge collected from 9 twin arm robotic embodiments and is evaluated on the massive scale GM-100 benchmark throughout 3 platforms. The mannequin is designed for cross morphology generalization, knowledge environment friendly submit coaching, and excessive coaching throughput on commodity GPU clusters.

Giant scale twin arm dataset throughout 9 robotic embodiments
The pre-training dataset is constructed from actual world teleoperation on 9 standard twin arm configurations. These embrace AgiBot G1, AgileX, Galaxea R1Lite, Galaxea R1Pro, Realman Rs 02, Leju KUAVO 4 Professional, Qinglong humanoid, ARX Lift2, and a Bimanual Franka setup. All techniques have twin 6 or 7 diploma of freedom arms with parallel grippers and a number of RGB-D cameras that present multi view observations
Teleoperation makes use of VR management for AgiBot G1 and isomorphic arm management for AgileX. For every scene the recorded movies from all views are segmented by human annotators into clips that correspond to atomic actions. Static frames firstly and finish of every clip are eliminated to scale back redundancy. Process stage and sub job stage language directions are then generated with Qwen3-VL-235B-A22B. This pipeline yields synchronized sequences of photographs, directions, and motion trajectories for pre-training.
To characterize motion range the analysis workforce visualizes probably the most frequent atomic actions in coaching and exams by phrase clouds. About 50 % of atomic actions within the take a look at set don’t seem inside the high 100 most frequent actions within the coaching set. This hole ensures that analysis stresses cross job generalization moderately than frequency primarily based memorization.


Structure, Combination of Transformers, and Stream Matching actions
LingBot-VLA combines a powerful multimodal spine with an motion professional by a Combination of Transformers structure. The imaginative and prescient language spine is Qwen2.5-VL. It encodes multi-view operational photographs and the pure language instruction right into a sequence of multimodal tokens. In parallel, the motion professional receives robotic proprioceptive state and chunks of previous actions. Each branches share a self consideration module that performs layer clever joint sequence modeling over commentary and motion tokens.
At every time step the mannequin varieties an commentary sequence that concatenates tokens from 3 digital camera views, the duty instruction, and the robotic state. The motion sequence is a future motion chunk with a temporal horizon set to 50 throughout pre-training. The coaching goal is conditional Stream Matching. The mannequin learns a vector area that transports Gaussian noise to the bottom reality motion trajectory alongside a linear likelihood path. This offers a steady motion illustration and produces clean, temporally coherent management appropriate for exact twin arm manipulation.
LingBot-VLA makes use of blockwise causal consideration over the joint sequence. Remark tokens can attend to one another bidirectionally. Motion tokens can attend to all commentary tokens and solely to previous motion tokens. This masks prevents info leakage from future actions into present observations whereas nonetheless permitting the motion professional to use the total multimodal context at every choice step.
Spatial notion through LingBot Depth distillation
Many VLA fashions battle with depth reasoning when depth sensors fail or return sparse measurements. LingBot-VLA addresses this by integrating LingBot-Depth, a separate spatial notion mannequin primarily based on Masked Depth Modeling. LingBot-Depth is skilled in a self supervised means on a big RGB-D corpus and learns to reconstruct dense metric depth when elements of the depth map are masked, typically in areas the place bodily sensors are likely to fail.
In LingBot-VLA the visible queries from every digital camera view are aligned with LingBot-Depth tokens by a projection layer and a distillation loss. Cross consideration maps VLM queries into the depth latent house and the coaching minimizes their distinction from LingBot-Depth options. This injects geometry conscious info into the coverage and improves efficiency on duties that require correct 3D spatial reasoning, similar to insertion, stacking, and folding underneath muddle and occlusion.
GM-100 actual world benchmark throughout 3 platforms
The primary analysis makes use of GM-100, an actual world benchmark with 100 manipulation duties and 130 filtered teleoperated trajectories per job on every of three {hardware} platforms. Experiments examine LingBot-VLA with π0.5, GR00T N1.6, and WALL-OSS underneath a shared submit coaching protocol. All strategies advantageous tune from public checkpoints with the identical dataset, batch measurement 256, and 20 epochs. Success Fee measures completion of all subtasks inside 3 minutes and Progress Rating tracks partial completion.
On GM-100, LingBot-VLA with depth achieves cutting-edge averages throughout the three platforms. The typical Success Fee is 17.30 % and the common Progress Rating is 35.41 %. π0.5 reaches 13.02 % SR (success fee) and 27.65 % PS (progress rating). GR00T N1.6 and WALL-OSS are decrease at 7.59 % SR, 15.99 % PS and 4.05 % SR, 10.35 % PS respectively. LingBot-VLA with out depth already outperforms GR00T N1.6 and WALL-OSS and the depth variant provides additional features.
In RoboTwin 2.0 simulation with 50 duties, fashions are skilled on 50 demonstrations per job in clear scenes and 500 per job in randomized scenes. LingBot-VLA with depth reaches 88.56 % common Success Fee in clear scenes and 86.68 % in randomized scenes. π0.5 reaches 82.74 % and 76.76 % in the identical settings. This exhibits constant features from the identical structure and depth integration when area randomization is robust.


Scaling habits and knowledge environment friendly submit coaching
The analysis workforce analyzes scaling legal guidelines by various pre-training knowledge from 3,000 to twenty,000 hours on a subset of 25 duties. Each Success Fee and Progress Rating improve monotonically with knowledge quantity, with no saturation on the largest scale studied. That is the primary empirical proof that VLA fashions keep favorable scaling on actual robotic knowledge at this measurement.
Additionally they examine knowledge effectivity of submit coaching on AgiBot G1 utilizing 8 consultant GM-100 duties. With solely 80 demonstrations per job LingBot-VLA already surpasses π0.5 that makes use of the total 130 demonstration set, in each Success Fee and Progress Rating. As extra trajectories are added the efficiency hole widens. This confirms that the pre-trained coverage transfers with solely dozens to round 100 job particular trajectories, which immediately reduces adaptation value for brand spanking new robots or duties.
Coaching throughput and open supply toolkit
LingBot-VLA comes with a coaching stack optimized for multi-node effectivity. The codebase makes use of a FSDP fashion technique for parameters and optimizer states, hybrid sharding for the motion professional, blended precision with float32 reductions and bfloat16 storage, and operator stage acceleration with fused consideration kernels and torch compile.
On an 8 GPU setup the analysis workforce reported throughput of 261 samples per second per GPU for Qwen2.5-VL-3B and PaliGemma-3B-pt-224 mannequin configurations. This corresponds to a 1.5 occasions to 2.8 occasions speedup in contrast with current VLA oriented codebases similar to StarVLA, Dexbotic, and OpenPI evaluated on the identical Libero primarily based benchmark. Throughput scales near linearly when transferring from 8 to 256 GPUs. The complete submit coaching toolkit is launched as open supply.
Key Takeaways
- LingBot-VLA is a Qwen2.5-VL primarily based imaginative and prescient language motion basis mannequin skilled on about 20,000 hours of actual world twin arm teleoperation throughout 9 robotic embodiments, which allows sturdy cross morphology and cross job generalization.
- The mannequin integrates LingBot Depth by function distillation so imaginative and prescient tokens are aligned with a depth completion professional, which considerably improves 3D spatial understanding for insertion, stacking, folding, and different geometry delicate duties.
- On the GM-100 actual world benchmark, LingBot-VLA with depth achieves about 17.30 % common Success Fee and 35.41 % common Progress Rating, which is increased than π0.5, GR00T N1.6, and WALL OSS underneath the identical submit coaching protocol.
- LingBot-VLA exhibits excessive knowledge effectivity in submit coaching, since on AgiBot G1 it may surpass π0.5 that makes use of 130 demonstrations per job whereas utilizing solely about 80 demonstrations per job, and efficiency continues to enhance as extra trajectories are added.
Take a look at the Paper, Mannequin Weight, Repo and Undertaking Web page. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as nicely.



