Rethinking Inference: Berkeley AI's Blueprint For Scalable Efficiency

Overview of adaptive parallel reasoning.

Imagine a reasoning model that could independently determine when to break a problem into separate subtasks and process them in parallel, how many simultaneous threads to create, and how to manage them based on the specific challenge it faces. In this post, we take a close look at the latest developments in parallel reasoning, with a particular focus on Adaptive Parallel Reasoning.

Disclosure: this post serves as both a landscape survey and a perspective on adaptive parallel reasoning. One of the authors (Tony Lian) co-led the ThreadWeaver project (Lian et al., 2025), which is among the methods discussed below. The authors aim to present each approach on its own terms.

Motivation

Much of the recent improvement in LLM reasoning abilities has come from inference-time scaling, alongside advances in data and parameter scaling (OpenAI et al., 2024; DeepSeek-AI et al., 2025). Models that explicitly generate reasoning tokens—through intermediate steps, backtracking, and exploration—now lead on math, coding, and agentic benchmarks. These capabilities let models test alternative hypotheses, fix earlier errors, and draw conclusions rather than locking into a single solution path (Wen et al., 2025).

The issue is that sequential reasoning scales linearly with the amount of exploration. Expanding the number of sequential reasoning tokens has real costs, as models risk hitting the limits of their effective context window (Hsieh et al., 2024). As intermediate exploration paths pile up, it becomes harder for the model to distinguish useful information from noise when processing its context, leading to a drop in performance known as context-rot (Hong, Troynikov and Huber, 2025). Latency also increases in direct proportion to reasoning length. For complex tasks that require millions of tokens for exploration and planning, it’s not unusual for users to wait tens of minutes or even hours for a response (Qu et al., 2025). As we keep scaling along the output sequence length dimension, inference becomes slower, less reliable, and more computationally expensive. Parallel reasoning has emerged as a natural answer to this challenge. Rather than exploring paths one after another (Gandhi et al., 2024) and bloating the context window at every step, we can let models explore multiple threads independently (threads don’t depend on each other’s context) and concurrently (threads can run at the same time).

Figure 1: Sequential vs. Parallel Reasoning

Over the past few years, a growing body of research has explored this idea across synthetic settings (such as the Countdown game (Katz, Kokel and Sreedharan, 2025)), real-world math problems, and general reasoning tasks.

From Fixed Parallelism to Adaptive Control

Current approaches demonstrate that parallel reasoning can be beneficial, but most of them still determine the parallel structure externally rather than letting the model make that choice itself.

Simple fork-and-join.

Self-consistency/Majority Voting — independently generate multiple complete reasoning traces, extract the final answer from each, and return the most frequent one (Wang et al., 2023).
Best-of-N (BoN) — similar to self-consistency, but relies on a trained verifier to pick the best solution rather than using majority voting (Stiennon et al., 2022).
While easy to implement, these methods often produce redundant computation across branches since trajectories are sampled independently.

Heuristic-based structured search.

Tree / Graph / Skeleton of Thoughts — a family of structured decomposition methods that explores multiple alternative “thoughts” using established search algorithms (BFS/DFS) and prunes via LLM-based evaluation (Yao et al., 2023; Besta et al., 2024; Ning et al., 2024).
Monte-Carlo Tree Search (MCTS) — estimates node values by sampling random rollouts and grows the search tree using Upper Confidence Bound (UCB) style exploration-exploitation (Xie et al., 2024; Zhang et al., 2024).
These methods improve on simple fork-and-join by breaking tasks into non-overlapping subtasks; however, they require advance knowledge of the decomposition strategy, which isn’t always available.

Recent variants.

ParaThinker — trains a model to operate in two fixed stages: first producing multiple reasoning threads in parallel, then combining them. The approach introduces trainable control tokens () and thought-specific positional embeddings to maintain independence during reasoning and enable controlled integration during summarization through a two-phase attention mask (Wen et al., 2025).
GroupThink — multiple parallel reasoning threads can observe each other’s partial progress at the token level and adjust mid-generation. Unlike earlier concurrent methods that handle independent requests, GroupThink runs a single LLM producing multiple interdependent reasoning trajectories at the same time (Hsu et al., 2025).
Hogwild! Inference — multiple parallel reasoning threads share a KV cache and determine how to decompose tasks without an explicit coordination protocol. Workers generate concurrently into a shared attention cache, using RoPE to stitch together individual KV blocks in different orders without recomputation (Rodionov et al., 2025).

Figure 2: Various Strategies for Parallel Reasoning

The methods described above share a key limitation: the decision to parallelize, the degree of parallelization, and the search strategy are imposed on the model from the outside, regardless of whether the problem actually benefits from it. In reality, different problems demand different levels of parallelization, and this is critical to how effective parallelization truly is. For instance, a framework that applies the same parallel structure to “What’s 25+42?” and “What’s the smallest planar region in which you can continuously rotate a unit-length line segment by 180°?” is wasting compute on the simple question and likely using the wrong decomposition strategy for the harder one.

In the methods discussed earlier, the model isn’t trained to adjust its own behavior dynamically. This leads to an intriguing question: What if the model could independently determine when to use parallel processing, how many threads to create, and how to manage them based on the specific challenge it faces?

Adaptive Parallel Reasoning (APR) addresses this by integrating parallelization directly into the model’s decision-making process. More precisely, adaptivity means the model can dynamically shift resources between parallel and sequential operations while generating responses. Essentially, a model equipped with APR learns to manage its own workflow — deciding when to think step-by-step versus when to explore multiple paths simultaneously.

It’s worth clarifying that the idea of adaptive parallel reasoning was first presented in the paper Learning Adaptive Parallel Reasoning with Language Models (Pan et al., 2025), but it represents a broader framework rather than a single technique. In this discussion, APR refers to the overall framework, while “the APR method” indicates the particular implementation from Pan et al. (2025).

This approach is significant for three key reasons. Unlike Tree-of-Thoughts, APR doesn’t rely on manually crafted rules for breaking down problems. Through reinforcement learning, the model picks up universal decomposition techniques by experimenting and learning from mistakes. Remarkably, models develop effective parallelization strategies on their own — such as working on the next step while simultaneously verifying a previous one, or pursuing a primary solution alongside a backup approach — in ways that would be challenging to engineer manually (Yao et al., 2023; Wu et al., 2025; Zheng et al., 2025).

Unlike BoN, APR eliminates unnecessary repetition. APR models decide what each parallel thread will handle before splitting up. This allows the model to create a collection of distinct, non-overlapping subtasks and distribute them across separate threads (Wang et al., 2023; Stiennon et al., 2022; Pan et al., 2025; Yang et al., 2025).

Unlike fixed parallel approaches, APR can opt out of parallelizing altogether. Adaptive models can scale their parallelization to fit the problem’s difficulty relative to the cost and complexity of running parallel operations (Lian et al., 2025).

In implementation, this works by having the model generate special tokens that dictate whether to reason sequentially or in parallel. Below is a simplified ThreadWeaver-style trace: two outlines and two paths within a block, followed by the threads converging on a single boxed answer.

Figure 3: Example of an Adaptive Parallel Reasoning Trajectory from ThreadWeaver, manually condensed for ease of illustration.

Figure 4: Special Tokens Variants across Adaptive Parallel Reasoning Papers

Inference Systems for Adaptive Parallelism

How do we actually run parallel branches? We draw inspiration from computer systems, particularly multithreading and multiprocessing. Much of this work follows a fork-join pattern.

During inference, we’re essentially asking the model to carry out a map-reduce operation:

Split the problem into subtasks/threads, handle them simultaneously
Merge them into one final answer

Figure 5: Fork-join Inference Design

Concretely, the model receives a list of subtasks. It then preloads each subtask and dispatches them as separate requests for the inference engine to handle. These threads run in parallel until they reach an end token or hit the maximum length limit. The process waits until all threads complete their generation, then combines the results. This pattern is shared across various adaptive parallel reasoning methods. However, a challenge emerges during aggregation: the content produced by different branches can’t be easily merged at the KV cache level. This happens because tokens in separate threads begin at the same position IDs, causing encoding conflicts and irregular behavior when attempting to combine KV caches. Additionally, since separate threads don’t communicate with each other, joining their KV caches creates a non-causal attention pattern that the base model never encountered during training.

To tackle this challenge, the research community has diverged into two approaches for handling the aggregation process, distinguished by whether they alter the inference engine or find ways to work around it.

Multiverse adapts the inference engine to share KV cache during the join phase. Before diving deeper into Multiverse (Yang et al., 2025)’s memory management, let’s first examine how KV cache is managed up to the “join” stage. Observe that each separate thread shares the same prefix sequence — that is, the list of subtasks. Without optimization, every thread would need to prefill and recalculate the KV cache for this prefix. However, this duplication can be prevented using SGLang’s RadixAttention (Sheng et al., 2023), which arranges multiple requests into a radix tree — a trie (prefix tree) that stores sequences of elements with varying lengths rather than individual elements. This way, only the KV cache entries from independent thread generation are newly created.

Figure 6: RadixAttention’s KV Cache Management Strategy

Now, assuming everything went smoothly, all the separate threads have returned from the inference engine. Our next challenge is figuring out how to merge them back into a single sequence to continue generating subsequent steps. As it turns out, we can reuse the KV cache from these separate threads during the merging phase. Specifically, Multiverse (Yang et al., 2025), Parallel-R1 (Zheng et al., 2025), and NPR (Wu et al., 2025) adjust the inference engine to transfer the KV cache produced by each thread and modify the page table so that it combines non-contiguous memory blocks into one continuous KV cache sequence. This sidesteps the redundant computation of a second prefill and maximizes the reuse of existing KV cache. However, this approach comes with several notable drawbacks.

First, this method requires altering the inference engine to perform unconventional memory operations, which can lead to unpredictable issues. In particular, because the synthesis request depends on KV cache from earlier requests, it introduces system fragility and the risk of faulty pointers. Another request could arrive and displace the referenced KV cache before the synthesis request finishes, forcing it to pause and redo the prefill for the previous thread request. This issue has prompted the Multiverse researchers (Yang et al., 2025) to restrict the batch size that the inference engine can process, which limits throughput.

Figure 7: KV Cache “Stitching” During Multiverse Inference

Second, this method changes how the model interprets the sequence, introducing a distributional shift that the model wasn’t pretrained on, which means more training is needed to correct its behavior. When KV caches are stitched together, the resulting sequence carries an unusual positional encoding. During independent-thread generation, every thread begins at the same position index and only attends to earlier subtasks, not to other threads. Once the threads merge, the combined KV cache has an atypical positional structure and lacks causal attention. This means the model needs significant training to adapt to this unfamiliar setup. To tackle this, Multiverse (Yang et al., 2025) and similar approaches use a modified attention mask during training that blocks independent threads from seeing each other, ensuring training and inference behaviors stay consistent.

Figure 8: Multiverse’s Attention Mask

Given these complications from non-standard KV cache handling, is there a way to avoid modifying the engine altogether?

ThreadWeaver leaves the inference engine untouched and shifts orchestration to the client side. ThreadWeaver (Lian et al., 2025) frames parallel inference as entirely a client-side concern. The “Fork” step works much like Multiverse’s, but the join phase manages memory in a fundamentally different way since it doesn’t touch engine internals. Instead, the client gathers all text outputs from the independent branches and joins them into one continuous sequence. The engine then runs a second prefill pass to build the KV cache needed for the final conclusion step. Although this adds some redundant computation that Multiverse aims to eliminate, prefill costs are far cheaper than decoding. Plus, no special attention handling is needed during inference because the second prefill uses causal attention (threads can see each other), making it straightforward to adapt standard sequential autoregressive models for this purpose.

Figure 9: ThreadWeaver’s Prefill and Decode Strategy

How do we train a model to pick up this behavior? A straightforward approach would be to split each parallel trajectory into several sequential segments that match our inference pattern. For example, we’d train the model to produce subtasks from the prompt, individual threads from the prompt plus subtask assignments, and the conclusion from the prompt plus subtasks plus their corresponding threads. But this feels repetitive and wastes compute. Can we improve on this? Yes, we can. Following ThreadWeaver (Lian et al., 2025), we can arrange a parallel trajectory into a prefix-tree (trie), flatten it into one sequence, and use an ancestor-only attention mask during training (not during inference!).

Figure 10: Building the Prefix-tree and Flattening into a single training sequence

Specifically, we set up masking and position IDs to replicate the inference behavior, so that each thread only conditions on the prompt and subtasks, without ever looking at sibling threads or the final conclusion.

The engine-agnostic design makes adoption simple since you don’t need a separate hosting setup and can take advantage of your existing hardware infrastructure. It also automatically benefits from improvements to inference engines over time. Moreover, with an engine-agnostic approach, you can serve a hybrid model that seamlessly switches between sequential and parallel reasoning modes.

Training Models to Use Parallelism

Once the inference pathway is in place, the next challenge is teaching the model to actually use it. Demonstrations are essential because the model needs to learn how to emit special tokens that manage control flow. We discovered that the instruction-following abilities of base models aren’t enough to generate parallel threads on their own.

A thought-provoking question here is: does SFT training create a fundamentally new reasoning capability for parallel execution that didn’t exist before, or does it simply align the model’s pre-existing capabilities to a particular control-flow token format? Conventional wisdom says SFT imparts new knowledge; but going against the grain, some papers—notably Parallel-R1 (Zheng et al., 2025) and NPR (Wu et al., 2025)—claim that their SFT demonstrations merely teach format compliance (i.e., how to structure parallel requests). We leave this question for future investigation.

Figure 11: Sources of Parallelization Demonstration Data

Demonstrations teach the mechanics of parallel control flow, but they don’t fully address the incentive problem. Ideally, rewarding only outcome accuracy would be sufficient, and the parallelization pattern would emerge organically once the model learns to produce special tokens through SFT, much like the emergence of long chain-of-thought. However, researchers (Zheng et al., 2025) found that this alone isn’t enough—parallelization incentives are indeed necessary. The question then becomes: how do we determine when the model is parallelizing effectively?

Structure-only rewards are too easy to exploit. A naive approach would be to reward the number of threads spawned. But the model can game this by spawning many short, meaningless threads. That doesn’t work. What about a binary reward for simply using the parallel structure correctly? This partially prevents the model from spamming threads, but the model still learns to spawn threads even when they’re unnecessary. The authors of Parallel-R1 (Zheng et al., 2025) tried an alternating schedule, rewarding parallel structure only 20% of the time, which successfully boosted parallel structure usage (13.6% → 63%), but had minimal effect on overall accuracy.

With this structure-only strategy, we risk straying from our original goal of boosting accuracy and cutting latency… How can we optimize for the Pareto frontier directly? Accuracy is straightforward—we just check the final outcome. What about latency?

Efficiency rewards must account for the critical path. In purely sequential trajectories, latency can be measured by the total number of tokens generated. For parallel trajectories, we can focus on the critical path—the longest chain of causally dependent tokens—since this directly determines the end-to-end generation time (i.e., wall-clock time). For instance, when there are two <parallel> sections with five threads each, the critical path runs through the longest thread from the first parallel section, then any sequential tokens, then the longest thread from the second parallel section, and so on until the end of the sequence.</parallel>

Figure 12: Critical Path Length Illustration

The main objective is to shorten the critical path as much as possible. At the same time, we still want the model to use tokens to explore multiple reasoning threads simultaneously. To balance these two goals, we can aim to reduce the critical path’s share relative to the total number of tokens used. The creators of ThreadWeaver (Lian et al., 2025) defined the parallelization reward as $1 – L_{mathrm{critical}} / L_{mathrm{total}}$, which equals 0 for a fully sequential process and grows linearly as the critical path becomes shorter compared to the overall token count.

Parallel efficiency should depend on correctness. It makes sense that when several reasoning paths lead to correct answers, we should favor those that make better use of parallel processing. But what if none of the paths are correct? In that case, should we still give any reward? Likely not.

To put this idea into a formula, let $R = R_{mathrm{correctness}} + R_{mathrm{parallel}}$. If we assume correctness is binary (right or wrong), this becomes $R = mathbf{1}(text{Correctness}) + mathbf{1}(text{Correctness}) times (text{some parallelization metric})$. This ensures the model only receives a bonus for parallelization when its answer is correct—because there’s no point enforcing efficiency constraints if the model can’t even solve the problem.

Figure 13: Differences in Reward Designs Across Adaptive Parallel Reasoning Works

Evaluation and Open Questions

After all this, how effective are these adaptive parallel reasoning methods in practice? The answer isn’t straightforward, since different studies use different models and evaluation criteria. The choice of model depends on the training approach, how hard the problems are, and the maximum sequence length. For example, when applying supervised fine-tuning (SFT) to challenging datasets like s1k—which includes graduate-level math and science questions—researchers opted for a large base model (Qwen2.5 32B for Multiverse (Yang et al., 2025)) to handle the intricate reasoning needed. In contrast, for reinforcement learning (RL), they used smaller, non-chain-of-thought instruct models (4B or 8B parameters) due to limited computational resources.

Figure 14: Difference in Model Choice Across Adaptive Parallel Reasoning Papers

Each paper also frames the contribution of adaptive parallel reasoning differently. Because they optimize for distinct theoretical goals, they rely on different sets of metrics:

Multiverse and ThreadWeaver (Yang et al., 2025; Lian et al., 2025) aim to match the accuracy of traditional sequential autoregressive models while running faster. Multiverse demonstrates that APR models achieve higher accuracy within the same fixed context window, while ThreadWeaver shows reduced end-to-end token latency (i.e., shorter critical path) without sacrificing accuracy.
NPR (Wu et al., 2025) views falling back to sequential reasoning as a failure and instead maximizes the Genuine Parallelism Rate—the proportion of tokens processed in parallel versus total tokens.
Parallel-R1 (Zheng et al., 2025) doesn’t prioritize latency and instead focuses on exploration diversity, positioning APR as a mid-training scaffold that enhances performance during RL rather than just at inference.

Open Questions

Although Adaptive Parallel Reasoning offers a promising path toward more efficient inference-time scaling, many important questions remain unanswered.

As discussed earlier, Parallel-R1 (Zheng et al., 2025) treats APR primarily as a tool for exploration during training rather than a method for speeding up inference. This raises a deeper question: Does parallelizing reasoning at inference time actually improve accuracy, or is its main value in helping the model explore better solutions during training? Parallel-R1 suggests that the diversity introduced by parallel structures during RL may be more impactful than the parallel execution itself when the model is deployed.

Another concern is stability. Models often revert to sequential reasoning once the pressure to parallelize is removed. For instance, Parallel-R1 found that stopping the parallelization reward after just 200 training steps caused the model to abandon parallel behavior entirely. Is this due to unstable training, poorly designed rewards, or does it reflect a fundamental tension between parallel reasoning and the sequential nature of standard autoregressive pretraining?

Beyond whether APR works at all, real-world deployment brings additional challenges. Can we develop training approaches that consider the available hardware at inference time, so that parallelization decisions are guided by both the problem and the device’s capabilities?

Finally, the parallel structures explored so far are mostly flat—meaning all threads run at the same level. What happens if we allow deeper nesting, with parallelization depth greater than 1? Recursive Language Models (RLMs; Zhang, Kraska, and Khattab, 2026) have shown strong results in handling long contexts and scaling effectively at inference. How would RLMs perform if trained with end-to-end RL that encourages adaptive, hierarchical parallelization?

Acknowledgements

We thank Nicholas Tomlin and Alane Suhr for their helpful feedback. We also appreciate the insightful suggestions from Christopher Park, Karl Vilhelmsson, Nyx Iskandar, Georgia Zhou, Kaival Shah, and Jyoti Rani. Thanks go to Vijay Kethana, Jaewon Chang, Cameron Jordan, Syrielle Montariol, Erran Li, and Anya Ji for valuable discussions. Finally, we are grateful to Jiayi Pan, Xiuyu Li, and Alex Zhang for their constructive correspondence regarding Adaptive Parallel Reasoning and Recursive Language Models.

Top Posts

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

Rethinking Inference: Berkeley AI’s Blueprint for Scalable Efficiency

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Trending

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Rethinking Inference: Berkeley AI’s Blueprint for Scalable Efficiency

Motivation

From Fixed Parallelism to Adaptive Control

Inference Systems for Adaptive Parallelism

Training Models to Use Parallelism

Evaluation and Open Questions

Open Questions

Acknowledgements

Related Posts