, we mentioned AlpamayoR1 (AR1), an autonomous driving mannequin integrating a VLM to behave as a reasoning spine. It depends on a rigorously collected chain-of-causation dataset. Coaching on this dataset allows AR1 to “reason” in pure language to resolve difficult driving conditions.
However what if pure language will not be the very best assist for reasoning in driving situations? In spite of everything, when met with a driving state of affairs that requires a right away response, human drivers typically act reflexively relatively than “reasoning in language step-by-step”. What’s the different for driving fashions?
On this article, we break down the LatentVLA structure, a convincing take towards language-based approaches that requires no pure language dataset, performs reasoning within the latent house and makes use of information distillation to satisfy real-time constraints.
Latent Motion Studying
A big a part of AR1’s success resides within the chain-of-causation dataset, the gathering of which required industrial-scale efforts, a rigorously elaborated labeling pipeline and in depth validation.
In distinction, LatentVLA takes a very other way: the authors argue that uncooked driving information already accommodates the construction required to coach a big mannequin and that pure language is inherently biased and tough to align with actions. Additional, producing pure language reasoning chains is inefficient since some tokens don’t contribute meaningfully to the reasoning course of (e.g. cease phrases).
Due to this fact, they introduce a self-supervised framework employed to foretell ego-centric latent actions in a small latent house. In different phrases, the mannequin makes use of unlabelled driving information to foretell which motion the motive force will need to have taken to generate this information. These latent actions will function the constructing blocks for latent-space reasoning.
Illustration Studying
To foretell latent actions from unlabeled information, the authors use a way paying homage to LAPO (studying to behave with out actions) [2]. This strategy depends on a encoder-decoder setup the place the encoder (additionally known as “inverse-dynamics model”, IDM) makes use of two subsequent frames to foretell a steady motion vector and the decoder (known as “forward dynamics model”, FDM) makes use of the present body and the anticipated motion vector to reconstruct the following body.
This intelligent setup forces the discovered motion illustration to explain what motion will need to have been taken to look at the state transitions in our dataset. Nevertheless, this steady motion illustration continues to be incompatible with the VLMs we intend to make use of. To discretise it, the authors use a VQ-VAE (Vector-Quantised Variational Auto-Encoder), which maps steady vectors to the closest discrete vectors in a discovered codebook (i.e. a dictionary of discrete actions) in a differentiable method. That is the motion that will probably be utilized by the FDM to decode the following body.
By optimising the next-frame reconstruction error, we collectively educated the IDM and FDM to encode a predictive discrete motion illustration.
Distinguishing Ego-Actions from Environmental Noise
Now you would possibly assume: “The driver’s actions are not the only factor influencing the next frame when driving, what if a bird flies in front of the camera? Does this pollute the action representation?”. To this, the authors reply sure and no, there must be a mechanism that disentangles the affect of the motive force’s actions on the longer term from environmental dynamics.
The elegant answer to this downside is to make use of a two-stage encoder-decoder setup:
- Conditioned on the ground-truth trajectory, ego-state and former body, the encoder predicts a latent motion. Since this motion is conditioned on automobile dynamics via the trajectory and ego-state, it solely must mannequin environmental dynamics to allow the decoder to reconstruct the following body. This “environmental action” is then quantised and the codebook used to this finish is frozen for the following stage.
- Conditioned on the earlier body and the environmental motion, the encoder encodes one other latent motion. Equally, for the reason that environmental dynamics are identified and a part of the conditioning, this second latent motion is pressured to encode ego-centric dynamics. Utilizing a brand new codebook, this motion is quantised right into a discrete ego-action.
Lastly, we feed each actions to the decoder to reconstruct the following body. This setup ensures a transparent separation of ego-actions and environmental dynamics.
VLM Coaching
Constructing on the discovered motion illustration, the authors practice a Qwen2.5-VL mannequin to foretell the identical latent actions because the encoder-decoder mannequin. That is achieved by having the encoder predict a trajectory of 12 latent actions for a given enter body and having the VLM optimising its adverse log chance:
A hanging distinction with different approaches using motion codebooks is the variety of actions tokens utilized by LatentVLA. The place different fashions like AutoVLA use an motion codebook of 2048 particular tokens, LatentVLA solely makes use of 16.
This ends in:
- A less complicated studying activity: in a 2048-dimensional codebook, actions most likely characterize very exact driving choices like “steer left at a 16-degree angle”. With solely 16 tokens, the mannequin most likely adopts higher-level directives like “accelerate slightly”, “take a narrow right turn”, which require much less demonstrations to be taught.
- Preserving the VLM’s pre-training information: it doesn’t should be taught over 2000 “new words”.
Data Distillation
The place AlpamayoR1 relied on environment friendly tokenisation and flow-matching diffusion to keep up real-time efficiency, LatentVLA goes for a very completely different strategy: information distillation. To this finish, the authors introduce a fusion module inside present E2E architectures (iPad [4] and Transfuser [5]). This fusion module is fed visible and motion embeddings by the VLM and outputs options in Fowl’s-Eye-View (BEV) house. These embeddings function keys and values in cross-attention with BEV queries produced by the E2E mannequin. This permits E2E mannequin to combine insights from the VLM.

Nevertheless, the VLM stays too massive for use effectively at test-time. Due to this fact, a small 50M-parameter resolution transformer is educated to mimic the massive 3.8B Qwen2.5-VL VLM. That is achieved by minimising the KL divergence between the instructor and scholar distributions:
This framework allows LatentVLA to function with a really compact reasoning spine and supplies a common strategy to integrating VLM information into conventional E2E architectures at a lesser value.

Analysis
LatentVLA is educated and evaluated on NavSim [6], a dataset composed of over 100.000 frames collected in real-world driving simulations. NavSim additionally features a non-reactive simulator to guage open-loop planning.
In different phrases, the fashions predicts a trajectory over the following few seconds given enter pictures. Then, this trajectory is executed in a BEV simulation working on the idea that actions of the ego-vehicle don’t have an effect on the actions of different brokers (thus “non-reactive”). This permits to simply measure planning-related metrics such because the Predictive Driver Mannequin Rating (PDMS): a composite metric that quantifies driving security, efficiency, and danger by integrating simulation outputs.
Nevertheless, this kind of analysis has some necessary shortcomings, as we’ll focus on later.

On this benchmark, LatentVLA obtains state-of-the-art outcomes, enhancing upon commonplace E2E and LLM-based architectures. Nevertheless, the efficiency enhance obtained by integrating VLM information into iPad and Transfuser appears restricted. Specializing in the PDMS, we observe that the iPad baseline obtains a rating of 91.7%. The distilled LatentVLA different will increase the rating to 92.1 (+0.4%) and the non-distilled model reaches 92.4 (one other +0.3%).
This small enchancment begs the query whether or not higher-level reasoning and world information actually are important to driving.
In my view they’ve the potential to unlock a brand new stage of driving performances, however that is poorly measured by non-interactive planning simulators.

The restrictions of open-source planning
Lately, it has grow to be broadly accepted that solely evaluating driving fashions on open loop planning offers an incomplete image of their actual driving talents. Certainly, open-loop planning is basically completely different from driving and arguably simpler. The primary motive being that open-loop planning doesn’t contain interactions with the surroundings (the simulator is at greatest non-reactive) and reduces to imitating the trajectory of an professional. This creates a number of issues in actual situations:
- Small deviations from the discovered trajectories result in cascading errors: with out dynamic interactions with the surroundings and different brokers, open-loop fashions battle to rectify trajectories which can be barely misaligned with ones they discovered.
- Trajectories are inherently multimodal: for every driving state of affairs, there exist a number of trajectories and acceleration patterns resulting in secure driving outcomes. Nevertheless, imitation studying on a single professional trajectory collapses this multi-modality, limiting the generalisation capabilities of the mannequin.
For these causes, you will need to completely consider driving fashions in closed-loop (i.e. reactive) simulators and warrants the usage of RL post-training strategies as mentioned within the AR1 article.
I’d guess that the discrepancy between LatentVLA and its non-VLM baselines is bigger in these situations as reasoning may assist assuaging the constraints of open-loop coaching.
Conclusion
On this article, we mentioned LatentVLA, an strategy aiming to combine VLM information into commonplace E2E fashions with out counting on pure language. This strategy is modern within the sense that it allows studying helpful representations from unlabeled information whereas competing works like AR1 depend on rigorously annotated large-scale datasets to bypass the paradox of pure language.
Nevertheless, LatentVLA would profit from extra thorough analysis, specifically in closed-loop settings.
Thanks for studying this far!
In case you discovered this text helpful, please contemplate sharing it; it genuinely helps assist the effort and time that goes into producing this work. As at all times, be at liberty to contact me when you’ve got questions, ideas, or concepts for follow-ups. In case you’d prefer to assist my impartial analysis and writing, be at liberty to purchase me a espresso 😉
Till subsequent time! 👋
References
- LatentVLA
- LAPO
- VQ-VAE
- iPad
- Transfuser



