Are Multi-step Agents Overthinking?

Since R1, the dominant trend in post-training has been longer reasoning traces. For single-step tasks — math, code, QA — this works well. These are fully observable bandit problems: the model sees everything it needs upfront, and longer chain-of-thought helps it reconstruct and cross-reference information. More thinking, better answers.

But most real-world problems are not single-step. Web navigation, device control, tool use — these require many sequentially impactful decisions before a final reward is obtained. The right abstraction is an MDP, not a bandit. This raises a question that, as far as we can tell, has not been carefully examined:

Does the “think longer” paradigm from single-step post-training actually help in multi-step environments? Or are these agents spending tokens on reasoning when they should be spending steps on acting?

自 R1 以来,后训练的主流趋势一直是更长的推理链。对于单步任务——数学、代码、问答——这种方式效果很好。这些本质上是完全可观测的 bandit 问题:模型一开始就能看到所有必要信息,更长的 chain-of-thought 有助于它重构和交叉验证信息。思考越多,答案越好。

但现实世界的大多数问题并非单步。网页导航、设备控制、工具调用——这些都需要在获得最终奖励之前做出一系列有序且有影响的决策。正确的抽象是 MDP,而非 bandit。这引出了一个据我们所知尚未被仔细审视的问题:

单步后训练中的”多想一会”范式在多步环境中真的有效吗?还是说这些 agent 在本该用步骤去行动的时候,却把 token 花在了推理上?

The Case for Overthinking

The key structural difference between single-step and multi-step tasks is partial observability across steps. After making a decision, the agent receives genuinely new information — information that was impossible to derive from reasoning alone, no matter how long the chain of thought. Before acquiring this information, the agent should not commit to an answer. And crucially, the action needed to acquire it is often trivial.

Consider a web agent that needs to find a website meeting several requirements. These requirements can only be verified after clicking into the site. Before visiting, no amount of reasoning can determine whether the site qualifies. The agent must click in, check, click out, and try the next candidate. The optimal strategy involves almost no reasoning — just systematic exploration.

This suggests a possible failure mode: an agent post-trained on single-step tasks learns that long reasoning traces correlate with success. Deployed in a multi-step environment, it applies the same strategy — thinking extensively before each action. But if the information it needs is locked behind environment interactions, thinking is not a substitute for acting. The agent would be, in a precise sense, overthinking.

Is this what actually happens?

单步任务与多步任务之间的核心结构性差异在于跨步骤的部分可观测性。做出决策之后,agent 会收到全新的信息——这些信息无论 chain-of-thought 多长,都不可能仅靠推理得出。在获取这些信息之前,agent 不应急于给出答案。而且,获取这些信息所需的行动往往非常简单。

考虑一个需要找到满足多个条件的网站的 web agent。这些条件只有点击进入网站之后才能验证。在访问之前,无论怎么推理都无法判断该网站是否符合要求。Agent 必须点进去、检查、返回,然后尝试下一个候选。最优策略几乎不需要推理——只需要系统性的探索。

这暗示了一种可能的失败模式:在单步任务上后训练的 agent 学到了”长推理链与高成功率相关”。当它被部署到多步环境中时,会沿用同样的策略——在每一步行动之前都进行大量思考。但如果它所需的信息被锁在与环境的交互背后,思考就不能替代行动。从精确的意义上说,这个 agent 就是在过度思考。

实际情况真的如此吗?

What the Experiments Show

We trained agents on multi-step web tasks under different training configurations and found patterns consistent with the overthinking hypothesis:

Observation 1: Long horizons, poor performance. With a train-time horizon of \(h = 30\), the agent produces long trajectories but achieves weak task success. One explanation: REINFORCE suffers from error accumulation over many steps. Even successful trajectories contain suboptimal actions that the agent cannot reliably reproduce at evaluation time. But another reading is that the long horizon gives the agent room to deliberate excessively at each step, compounding reasoning errors across the trajectory.

Observation 2: Short horizons help — up to a point. With \(h = 10\), performance improves substantially. The agent is forced to be decisive. But trajectory length shrinks over training. The agent learns to declare success prematurely rather than explore further. It stops overthinking, but it also stops acting — giving up on complex tasks that require information gathering across many pages.

Observation 3: No fixed horizon is satisfactory. A fixed intermediate value (\(h = 20\)) inherits both problems: it’s too generous for simple tasks (allowing overthinking) and too restrictive for hard tasks (preventing exploration). It also requires task-specific tuning, undermining scalability.

The three plots above are suggestive. Each has three lines; our method (TTI) is green. Once the maximum horizon is reached: (a) average trajectory length grows — the agent takes more actions; (b) information-gathering frequency increases — the agent navigates back and jumps to search engines more often; (c) reasoning token count drops rapidly — the agent thinks less.

If overthinking were not a real phenomenon, we would not expect to see reasoning length decrease as performance increases. Yet that is exactly what happens.

我们在不同的训练配置下训练了多步网页任务的 agent,发现了与过度思考假说一致的模式:

观察 1:长 horizon 导致性能不佳。 当训练时 horizon 为 \(h = 30\) 时,agent 产生了很长的轨迹,但任务成功率很低。一种解释是:REINFORCE 在多步中遭受误差累积。即使成功的轨迹中也包含次优行动,agent 在评估时无法可靠地复现。但另一种理解是,长 horizon 给了 agent 在每一步过度思考的空间,推理错误沿轨迹不断累积。

观察 2:短 horizon 有帮助——但有上限。 当 \(h = 10\) 时,性能显著提升。Agent 被迫果断行动。但轨迹长度在训练过程中不断缩短。Agent 学会了过早宣布任务完成,而不是继续探索。它不再过度思考,但也不再行动——放弃了那些需要跨多个页面收集信息的复杂任务。

观察 3:没有固定的 horizon 能令人满意。 固定的中间值(\(h = 20\))继承了两种问题:对简单任务来说太宽松(允许过度思考),对困难任务来说又太严格(阻碍探索)。而且还需要针对任务调参,影响可扩展性。

上面的三张图很能说明问题。每张图有三条线,我们的方法(TTI)是绿色的。一旦达到最大 horizon:(a) 平均轨迹长度增加——agent 采取了更多的行动;(b) 信息收集频率增加——agent 更频繁地返回并跳转到搜索引擎;(c) 推理 token 数量迅速下降——agent 思考更少了

如果过度思考不是真实存在的现象,我们不应该看到推理长度随着性能提升而下降。然而事实恰恰如此。

A Simple Intervention: Horizon Curriculum

If agents are indeed overthinking, what can we do about it? Rather than explicitly constraining reasoning length (which would be a brittle, task-specific fix), we tested whether the training dynamics alone could teach the agent to reason less.

The idea is a horizon curriculum — start small and grow:

  1. Begin with a short horizon (\(h = 10\)). The agent learns basic environment dynamics and solves easy tasks, where extended reasoning was never necessary.
  2. Gradually increase the horizon (\(10 \to 20 \to 30\)). The agent encounters progressively harder tasks requiring genuine multi-step exploration. Because it already learned to act efficiently at shorter horizons, it extends that efficient behavior rather than regressing into overthinking.

We call this Test-Time Interaction (TTI). The curriculum forces the agent to build good habits early — act quickly, gather information, don’t speculate — and then applies those habits to harder problems.

The striking finding is that TTI imposes no constraints on chain-of-thought length. The reduction in reasoning emerges entirely from the training dynamics. The agent discovers on its own that shorter thinking and more actions is the better strategy. This emergent property is important: it suggests the overthinking problem is real and that the agent can learn to correct it, given the right training structure.

如果 agent 确实在过度思考,我们能做什么?与其显式地约束推理长度(这是一种脆弱的、特定于任务的修补),我们测试了是否仅靠训练动态就能让 agent 学会减少推理。

核心想法是 horizon 课程学习——从小到大逐步增长:

  1. 从短 horizon(\(h = 10\))开始。Agent 学习基本的环境动态并解决简单任务,这些任务本来就不需要长时间推理。
  2. 逐步增加 horizon(\(10 \to 20 \to 30\))。Agent 遇到越来越难的任务,需要真正的多步探索。由于它已经在短 horizon 下学会了高效行动,它会将这种高效行为延伸到更难的任务中,而不是退化为过度思考。

我们称之为 Test-Time Interaction(TTI)。这种课程学习迫使 agent 尽早养成良好习惯——快速行动、收集信息、不做无谓的猜测——然后将这些习惯应用到更难的问题上。

一个引人注目的发现是,TTI 没有对 chain-of-thought 长度施加任何约束。推理的减少完全源自训练动态。Agent 自主地发现了更短的思考和更多的行动是更好的策略。这种涌现特性很重要:它表明过度思考问题是真实存在的,而且 agent 可以在正确的训练结构下学会自我纠正。

Results

TTI outperforms both fixed-short and fixed-long horizons on WebVoyager and WebArena. The performance gains correlate with the agent learning to reallocate its compute budget: less internal reasoning per step, more environment interaction overall.

TTI 在 WebVoyager 和 WebArena 上均优于固定短 horizon 和固定长 horizon。性能的提升与 agent 学会重新分配计算预算密切相关:每一步的内部推理减少了,整体的环境交互增多了。

How General Is This?

The overthinking hypothesis, if correct, should apply beyond web agents. Any multi-step setting where the environment provides information that cannot be derived internally would exhibit it: robotics (try a grasp instead of planning it perfectly), tool use (call the API instead of predicting its output), dialogue (ask a clarifying question instead of guessing the user’s intent).

The common thread is that in these settings, the environment is a better source of information than the model’s own reasoning. An agent that recognizes this should think just enough to choose a reasonable next action, then act and observe. The optimal reasoning length per step may actually decrease with better training, as the agent learns to offload cognitive work to the environment.

Whether this pattern holds across domains — and whether the horizon curriculum is the right intervention in all cases — remains to be tested. But the evidence from web tasks is consistent with a simple thesis: multi-step agents trained with standard methods are spending too many tokens thinking and too few steps acting.

Takeaway. There is growing evidence that multi-step agents inherit a “think longer” bias from single-step post-training that actively hurts them. A simple horizon curriculum (short → long) lets the agent discover on its own that less reasoning and more interaction leads to better outcomes — without any explicit constraints on chain-of-thought length.

Paper: https://arxiv.org/pdf/2506.07976

过度思考假说如果成立,应该不仅限于 web agent。任何环境能够提供推理无法推导出的信息的多步场景都会出现这一现象:机器人操作(尝试抓取而非完美规划)、工具使用(调用 API 而非预测其输出)、对话(提出澄清问题而非猜测用户意图)。

共同的主线在于,在这些场景中,环境是比模型自身推理更好的信息来源。认识到这一点的 agent 应该只思考到足以选择一个合理的下一步行动,然后行动并观察。随着训练的改善,每一步的最优推理长度实际上可能会减少,因为 agent 学会了将认知工作交给环境。

这一模式是否跨领域成立——以及 horizon 课程学习是否在所有情况下都是正确的干预方法——仍有待验证。但来自网页任务的证据与一个简洁的命题一致:用标准方法训练的多步 agent 花了太多 token 在思考上,而花了太少步骤在行动上。

要点总结。 越来越多的证据表明,多步 agent 从单步后训练中继承了”多想一会”的偏见,而这种偏见实际上会损害它们的表现。一个简单的 horizon 课程学习(短 → 长)让 agent 自主发现,更少的推理和更多的交互能带来更好的结果——而无需对 chain-of-thought 长度施加任何显式约束。

Paper: https://arxiv.org/pdf/2506.07976