Position: Why Web is a Good Environment to Study RL?

Web environments are a natural testbed for RL with language model agents, but they have structural properties that make them fundamentally different from the single-step reasoning tasks (math, code, QA) that dominate current post-training. This post examines two of these properties through an RL lens: the interaction-heavy nature of web tasks, and the shallow but combinatorially diverse structure of their solution spaces.

Web环境是语言模型智能体进行强化学习的天然试验场,但其结构性质使它与当前后训练中占主导地位的单步推理任务(数学、代码、问答)有本质区别。本文从强化学习视角审视其中两个性质:Web任务的高交互性,以及其解空间浅层但组合多样的结构。

What is a Realistic Environment?

Most RL training environments for language model agents are simulated: synthetic websites with hand-crafted layouts, deterministic transitions, and a fixed set of pages. The agent learns to navigate these sandboxes, but the skills often fail to transfer to real websites — the visual complexity, layout diversity, and unpredictable behavior of the live web are absent from training.

WebGym (Bai et al., 2025) takes the opposite approach: train on real websites directly. The agent interacts with actual, live web pages through a browser, receiving raw pixel screenshots as observations and issuing click/type/scroll actions. This makes the environment realistic in several concrete ways:

  1. Visual complexity. Real websites contain logos, advertisements, dynamic content, inconsistent layouts, and decorative elements that synthetic environments omit. The agent’s vision-language model must extract task-relevant information from this noise — there is no clean, semantic abstraction layer.

  2. Stochastic transitions. The same action on a real website can produce different results depending on server state, network latency, A/B testing variants, or content updates. Unlike simulated environments where \(P(s' \vert s, a)\) is fixed, the transition dynamics are genuinely non-stationary. Some websites block automated access entirely, causing tasks to fail for reasons unrelated to the agent’s skill.

  3. Diverse task structure. Tasks span shopping, food & cooking, sports, health, entertainment, news, finance, travel, and education — each with different UI conventions, navigation patterns, and success criteria. An agent trained on WebGym cannot overfit to a single website’s layout; it must generalize across radically different interfaces.

  4. Partial observability across turns. At each turn, the agent sees only the current screenshot and a sliding window of recent history (typically 4 rounds). Older screenshots are replaced with text summaries. The agent cannot revisit past observations — it must decide what information to retain and what to discard, much like a human browsing with limited working memory.

  5. Built-in domain randomization. In robotics, stochasticity must be artificially injected via simulation randomization — randomizing gravity, friction, object dimensions — so that policies learn to adapt rather than memorize a single simulator configuration (Ilya’s MIT 2018 talk highlights this as the key trick for sim-to-real transfer). Web environments provide this randomization for free: server-side A/B tests, layout updates, dynamic content, network latency, and anti-bot measures all act as natural domain randomizers. An agent trained on live websites is forced to cope with this variation at every step — it cannot overfit to a fixed environment configuration. This is essentially the same principle as simulation randomization, except the stochasticity is real rather than engineered, and the resulting robustness comes without any additional design effort (DigiRL refers to this as “environmental randomness” and identifies it as a key factor in policy generalization).

These properties make WebGym a challenging testbed for RL algorithms. Standard trajectory-level PPO — which works well enough in simulated environments with short horizons and deterministic dynamics — struggles here. The long horizons (10–30 steps), sparse terminal rewards, and high variance from stochastic transitions mean that assigning a single reward to an entire trajectory provides almost no usable learning signal. This is the motivation for a turn-level value function: if we cannot control the environment’s stochasticity, we can at least decompose the credit assignment problem into manageable per-step pieces.

大多数面向语言模型智能体的RL训练环境都是模拟的:手工设计布局的合成网站,确定性转移,固定的页面集合。智能体学会在这些沙盒中导航,但技能往往无法迁移到真实网站——真实Web的视觉复杂性、布局多样性和不可预测行为在训练中完全缺失。

WebGym(Bai et al., 2025)采取了相反的方法:直接在真实网站上训练。智能体通过浏览器与实际的在线网页交互,接收原始像素截图作为观测,发出点击/输入/滚动动作。这使得环境在以下几个具体方面是真实的:

  1. 视觉复杂性。 真实网站包含logo、广告、动态内容、不一致的布局以及合成环境省略的装饰元素。智能体的视觉-语言模型必须从这些噪声中提取与任务相关的信息——没有干净的语义抽象层。

  2. 随机转移。 同一动作在真实网站上可能因服务器状态、网络延迟、A/B测试变体或内容更新而产生不同结果。不同于模拟环境中 \(P(s' \vert s, a)\) 是固定的,转移动力学是真正非平稳的。有些网站完全屏蔽自动化访问,导致任务因与智能体能力无关的原因而失败。

  3. 多样的任务结构。 任务涵盖购物、美食烹饪、体育、健康、娱乐、新闻、金融、旅行和教育——每个领域有不同的UI惯例、导航模式和成功标准。在WebGym上训练的智能体无法过拟合于单个网站的布局;它必须在差异巨大的界面间泛化。

  4. 跨步骤的部分可观测性。 在每个步骤中,智能体只能看到当前截图和最近历史的滑动窗口(通常4轮)。更早的截图被替换为文本摘要。智能体无法重访过去的观测——它必须决定保留什么信息、丢弃什么信息,就像一个工作记忆有限的人在浏览网页一样。

  5. 内置的域随机化。 在机器人学中,随机性必须通过仿真随机化人工注入——随机化重力、摩擦力、物体尺寸——使策略学会适应而非记住单一仿真器配置(Ilya在MIT 2018年的演讲强调这是sim-to-real迁移的关键技巧)。Web环境免费提供这种随机化:服务器端A/B测试、布局更新、动态内容、网络延迟和反爬虫措施都充当天然的域随机器。在真实网站上训练的智能体被迫在每一步应对这种变异——它无法过拟合于固定的环境配置。这本质上与仿真随机化是同一原理,只是随机性是真实的而非工程设计的,由此获得的鲁棒性无需任何额外设计(DigiRL将此称为”环境随机性”并认为它是策略泛化的关键因素)。

这些性质使WebGym成为RL算法的挑战性试验场。标准的轨迹级PPO——在短horizon和确定性动力学的模拟环境中尚可工作——在这里力不从心。长horizon(10-30步)、稀疏的终端奖励以及随机转移带来的高方差意味着,将单一奖励分配给整条轨迹几乎不提供可用的学习信号。这就是步骤级价值函数的动机:如果我们无法控制环境的随机性,至少可以将信用分配问题分解为可管理的逐步部分。

Interaction Over Reasoning

Single-step reasoning tasks — solving a math problem, writing a function, answering a factual question — are fully observable bandits. The model sees the entire problem upfront. All the information needed to produce the answer is present in the prompt. In this regime, longer chain-of-thought helps: more tokens of internal reasoning let the model reconstruct derivations, cross-reference constraints, and catch errors. The “think longer” paradigm from R1-style post-training exploits this directly.

Web environments break this assumption. A web agent operating in a browser faces partial observability across steps: the information needed to complete a task is distributed across multiple pages, each of which can only be accessed by taking actions in the environment. Before clicking into a webpage, no amount of reasoning can determine what is on it. Before submitting a form, no amount of planning can predict the server’s response. The environment holds information that is inaccessible to pure thought.

This creates a concrete tradeoff between thinking (generating reasoning tokens within a single step) and interacting (taking actions that change the environment state and reveal new information). In an MDP formulation, each step of the agent consists of:

  1. Observing the current browser state \(s_t\) (DOM, screenshot, URL, etc.)
  2. Reasoning internally for some number of tokens (chain-of-thought)
  3. Acting on the environment (click, type, navigate) to transition to \(s_{t+1}\)

The key insight, examined in detail in the post Are Multi-step Agents Overthinking?, is that for web tasks, interaction is usually more efficient than reasoning. A web agent that needs to find a product meeting certain criteria should click through candidates and check — not deliberate at length about which candidate might qualify. The information is behind the click, not inside the model’s weights.

Empirically, this shows up as a striking pattern: when agents are trained with a horizon curriculum (short horizon → long horizon), they learn to reduce reasoning tokens per step while increasing the number of environment interactions. Performance improves as thinking decreases — the opposite of what happens in single-step settings. The agent discovers on its own that the environment is a better source of information than its own chain-of-thought.

From an RL perspective, this means web environments are long-horizon MDPs where per-step reasoning is cheap but steps are expensive (each browser interaction has latency). The optimal policy should minimize unnecessary deliberation at each step and instead invest its budget in taking more steps — gathering information through interaction rather than speculation. This is the opposite of the single-step regime, where the “step” is free (just generate more tokens) and the quality of reasoning within that single step determines success.

单步推理任务——解数学题、写函数、回答事实性问题——是完全可观测的bandit问题。模型一开始就看到了整个问题。产生答案所需的所有信息都在提示中。在这种设定下,更长的思维链有帮助:更多内部推理token让模型重构推导、交叉验证约束并捕获错误。R1风格后训练的”多想一会儿”范式直接利用了这一点。

Web环境打破了这一假设。在浏览器中操作的Web智能体面临跨步骤的部分可观测性:完成任务所需的信息分布在多个页面上,每个页面只能通过在环境中采取动作来访问。在点击进入网页之前,再多的推理也无法确定页面上有什么。在提交表单之前,再多的规划也无法预测服务器的响应。环境持有纯思考无法获取的信息

这就产生了思考(在单步内生成推理token)和交互(采取改变环境状态并揭示新信息的动作)之间的具体权衡。在MDP公式中,智能体的每一步包括:

  1. 观测当前浏览器状态 \(s_t\)(DOM、截图、URL等)
  2. 内部推理若干token(思维链)
  3. 对环境执行动作(点击、输入、导航)以转移到 \(s_{t+1}\)

核心洞见——在文章多步智能体是否过度思考?中有详细讨论——是对于Web任务,交互通常比推理更高效。一个需要找到满足特定条件的商品的Web智能体应该逐个点击候选项并检查——而不是长时间考虑哪个候选项可能符合条件。信息在点击的另一端,而不在模型的权重中。

经验上,这表现为一个引人注目的模式:当智能体通过horizon课程(短horizon → 长horizon)训练时,它们学会了减少每步推理token,同时增加环境交互次数。性能随着思考的减少而提升——这与单步设定中的情况恰好相反。智能体自行发现了环境是比自己的思维链更好的信息来源。

从RL角度看,这意味着Web环境是长horizon的MDP,其中每步推理开销低但步骤开销高(每次浏览器交互都有延迟)。最优策略应最小化每步不必要的思考,转而投入预算执行更多步骤——通过交互而非推测来收集信息。这与单步设定恰好相反:在单步设定中,”步骤”是免费的(只需生成更多token),单步内的推理质量决定成败。

What This Means for Training

Standard RL post-training (GRPO, REINFORCE, PPO) applied to web tasks inherits a bias from the pretraining and single-step fine-tuning stages: the model has learned that longer reasoning traces correlate with better outcomes. In a web environment, this manifests as overthinking — the agent generates extensive chain-of-thought before each action, even when the action is trivial (e.g., clicking a link to see what is on the other side).

This has direct consequences for the RL optimization:

  • Trajectory length vs. reasoning length. The total token budget per episode is split between reasoning tokens and action tokens. An overthinking agent allocates too much to reasoning and too little to actions, resulting in shorter trajectories (fewer environment steps) within the same token budget. Shorter trajectories mean less information gathered, which means worse task completion.

  • Credit assignment becomes harder. Long reasoning traces within each step mean the trajectory contains thousands of tokens, most of which are internal deliberation rather than consequential decisions. Assigning credit to the right tokens — the ones that actually influenced the outcome — becomes a needle-in-a-haystack problem.

  • The horizon problem. With a fixed maximum horizon \(h\), an overthinking agent reaches the limit before completing complex tasks. A simple curriculum — starting with short horizons that force decisive action, then gradually extending — can counteract this bias, as shown in the TTI work.

将标准RL后训练(GRPO、REINFORCE、PPO)应用于Web任务时,会继承预训练和单步微调阶段的偏置:模型已经学到更长的推理痕迹与更好的结果相关。在Web环境中,这表现为过度思考——智能体在每个动作前生成大量思维链,即使动作很简单(例如点击一个链接看看另一边有什么)。

这对RL优化有直接影响:

  • 轨迹长度与推理长度。 每个episode的总token预算在推理token和动作token之间分配。过度思考的智能体分配了过多token给推理、过少给动作,导致在相同token预算下轨迹更短(环境步骤更少)。更短的轨迹意味着收集的信息更少,任务完成率更低。

  • 信用分配变得更难。 每步中的长推理痕迹意味着轨迹包含数千个token,其中大多数是内部思考而非关键决策。将信用分配给正确的token——那些真正影响了结果的token——变成了大海捞针。

  • Horizon问题。 在固定最大horizon \(h\) 下,过度思考的智能体在完成复杂任务之前就达到了限制。一个简单的课程——从迫使果断行动的短horizon开始,然后逐步延长——可以抵消这种偏置,如TTI工作所示。

Stochastic Transitions: Where Web Differs from Code and Math

There is a deeper structural difference between web environments and the LLM-based environments that dominate current RL research. In math and code generation, the token-level MDP has deterministic transitions: given a state (the prefix so far) and an action (the next token), the next state is uniquely determined — just append the token. The only source of stochasticity is the policy itself. This is what makes the bandit formulation viable: the entire trajectory is determined by the model’s sampling decisions, and replaying the same sequence of tokens from the same prompt always produces the same outcome.

Web environments do not have this property. The transition function \(T(s_{t+1} \vert s_t, a_t)\) is genuinely stochastic. Clicking the same link at the same time can lead to different page states: dynamic content changes between visits, search results rerank based on server-side signals, ads and recommendations rotate, session state evolves, and network conditions affect what loads. Navigating to an e-commerce site today and navigating to it tomorrow — taking the identical action from the identical initial state — produces different product listings, different prices, different layouts. The environment has its own dynamics that the agent cannot control or fully predict.

This has concrete consequences for RL:

Off-policy correction is harder. In deterministic-transition MDPs (math, code), the state distribution mismatch between two policies arises solely from different action choices — if both policies take the same actions, they visit the same states. In a stochastic web environment, even identical action sequences can lead to different states. This means IS-based off-policy methods must account for transition stochasticity on top of policy stochasticity, making the correction factors larger and noisier.

Replay and reproducibility break down. Deterministic transitions make it possible to “replay” a trajectory by re-executing the same actions and arriving at the same states. Web environments offer no such guarantee. A trajectory collected yesterday may not be reproducible today because the underlying pages have changed. This undermines training approaches that rely on cached rollouts or trajectory replay, and it makes evaluation noisier — the same agent can succeed or fail on the “same” task depending on the environment’s stochastic state.

Value estimation requires environment modeling. In deterministic-transition settings, a value function \(V(s_t)\) only needs to model the uncertainty from the policy’s future actions. In web environments, \(V(s_t)\) must additionally model the uncertainty from the environment’s transitions. This makes value functions harder to learn and less reliable as baselines for advantage estimation, increasing variance in policy gradient methods.

The environment is not a passive canvas. In math and code, the “environment” is just string concatenation — a mechanical process with no autonomy. A web environment is an active participant: servers respond, content updates, other users act, and the state evolves partly independently of the agent. This means the agent must be robust to environment variability, not just to its own stochastic sampling. Policies trained in one snapshot of the web may not transfer to another.

This stochastic-transition property places web tasks in a qualitatively different region of the RL problem space. Most current LLM post-training operates in deterministic-transition MDPs (or equivalently, contextual bandits) where the only randomness is the model’s own token sampling. Web environments are true stochastic MDPs, and algorithms designed for the deterministic case may need substantial adaptation to handle them.

Web环境与当前主导RL研究的LLM环境之间存在更深层的结构性差异。在数学和代码生成中,token级MDP具有确定性转移:给定一个状态(到目前为止的前缀)和一个动作(下一个token),下一状态是唯一确定的——直接拼接token即可。唯一的随机性来源是策略本身。这正是bandit公式可行的原因:整条轨迹由模型的采样决策确定,从相同提示重放相同的token序列总是产生相同的结果。

Web环境不具备这个性质。转移函数 \(T(s_{t+1} \vert s_t, a_t)\) 是真正随机的。在同一时间点击同一链接可能导向不同的页面状态:动态内容在不同访问间变化,搜索结果基于服务器端信号重新排序,广告和推荐轮换,会话状态演变,网络条件影响加载结果。今天导航到一个电商网站和明天导航到同一个——从完全相同的初始状态执行完全相同的动作——产生不同的商品列表、不同的价格、不同的布局。环境有其自身的动力学,智能体无法控制或完全预测。

这对RL有具体的影响:

离策略校正更难。 在确定性转移MDP(数学、代码)中,两个策略之间的状态分布差异仅来自不同的动作选择——如果两个策略采取相同动作,它们访问相同的状态。在随机Web环境中,即使完全相同的动作序列也可能导向不同的状态。这意味着基于IS的离策略方法必须在策略随机性之上额外考虑转移随机性,使校正因子更大、噪声更高。

回放和可重复性崩溃。 确定性转移使得通过重新执行相同动作并到达相同状态来”回放”轨迹成为可能。Web环境不提供这种保证。昨天收集的轨迹今天可能无法重现,因为底层页面已经改变。这削弱了依赖缓存rollout或轨迹回放的训练方法,也使评估更嘈杂——同一个智能体在”相同”任务上可能成功也可能失败,取决于环境的随机状态。

价值估计需要环境建模。 在确定性转移设定中,价值函数 \(V(s_t)\) 只需建模来自策略未来动作的不确定性。在Web环境中,\(V(s_t)\) 还必须建模来自环境转移的不确定性。这使价值函数更难学习,作为优势估计基线的可靠性更低,增加了策略梯度方法的方差。

环境不是被动的画布。 在数学和代码中,”环境”只是字符串拼接——一个没有自主性的机械过程。Web环境是一个主动参与者:服务器响应、内容更新、其他用户行动,状态部分独立于智能体而演变。这意味着智能体必须对环境变异具有鲁棒性,而不仅仅是对自身的随机采样。在Web的某个快照上训练的策略可能无法迁移到另一个快照。

这种随机转移性质将Web任务置于RL问题空间中一个质量上不同的区域。当前大多数LLM后训练在确定性转移MDP(或等价地,上下文bandit)中运行,唯一的随机性是模型自身的token采样。Web环境是真正的随机MDP,为确定性情况设计的算法可能需要大幅调整才能处理它们。

Vision as a First-Class Modality

Web environments are inherently visual. The agent’s observation \(s_t\) is not a clean text string — it is a rendered webpage: a spatial layout of buttons, images, text, menus, and interactive elements arranged in two dimensions. While some web agent systems operate on the DOM (a structured text representation), the DOM is an imperfect proxy for what the user actually sees. Elements may be visually hidden, overlapping, or styled in ways that the DOM does not capture. The most faithful representation of a webpage is its screenshot — a pixel-level image.

This makes web tasks a vision-language RL problem, which introduces a challenge that does not exist in text-only settings like math or code: the model must simultaneously maintain two capabilities that can interfere with each other during optimization.

Web环境天然是视觉性的。智能体的观测 \(s_t\) 不是干净的文本字符串——它是渲染后的网页:按钮、图像、文本、菜单和交互元素在二维空间中的布局。虽然一些Web智能体系统在DOM(一种结构化文本表示)上操作,但DOM是用户实际所见的不完美近似。元素可能在视觉上被隐藏、重叠或以DOM无法捕获的方式设置样式。网页最忠实的表示是其截图——一张像素级图像。

这使Web任务成为一个视觉-语言RL问题,引入了纯文本设定(如数学或代码)中不存在的挑战:模型必须同时维护两种在优化过程中可能相互干扰的能力。

The Grounding-Reasoning Tension

Recognition (also called grounding) is the ability to perceive and locate elements in the visual observation — identifying that a particular region of pixels is a “Submit” button, that a dropdown menu is currently open, or that a specific product card matches the search criteria. This is a perceptual skill: it requires the model to map visual inputs to semantic understanding of the interface.

Reasoning is the ability to decide what to do given the perceived state — choosing which button to click, what to type, or when to navigate away. This is a cognitive skill: it requires planning, goal decomposition, and strategy.

In text-only RL (math, code), both capabilities operate in the same modality. The model reads text and produces text. There is no separate “perception” step — the input is already in the format the model reasons over. Improving reasoning through RL does not risk degrading the model’s ability to read the input.

In vision-language web RL, these capabilities are entangled but distinct. The visual encoder must extract the right features from screenshots (grounding), and the language model must use those features to make good decisions (reasoning). RL optimization applies gradients through the entire pipeline, and these gradients can improve reasoning at the cost of degrading grounding — or vice versa.

This tension manifests concretely:

  • RL rewards reasoning improvements but does not directly supervise grounding. When the agent completes a task, the reward signal tells it that the sequence of decisions was good. It does not tell the model which visual features were correctly recognized. The model could learn to reason well on easy-to-ground pages while losing the ability to ground on visually complex ones.

  • Catastrophic forgetting of visual capabilities. Aggressive RL fine-tuning can degrade the pretrained visual encoder’s representations. The model might learn a policy that works on the training distribution’s visual patterns but fails when pages look slightly different — not because it cannot reason about them, but because it can no longer see them properly.

  • Joint optimization is harder than text-only optimization. In text-only RL, the optimization landscape is shaped only by reasoning quality. In vision-language RL, the landscape is shaped by the interaction between grounding accuracy and reasoning quality. A gradient step that improves reasoning may simultaneously shift the visual representations in a way that breaks grounding on other inputs. The algorithm must navigate a more complex loss surface where two objectives are coupled.

识别(也称为接地,grounding)是在视觉观测中感知和定位元素的能力——识别出某个像素区域是”提交”按钮、下拉菜单当前处于打开状态、或某个商品卡片符合搜索条件。这是一种感知技能:它要求模型将视觉输入映射到对界面的语义理解。

推理是在感知到的状态下决定做什么的能力——选择点击哪个按钮、输入什么、或何时导航离开。这是一种认知技能:它需要规划、目标分解和策略。

在纯文本RL(数学、代码)中,两种能力在同一模态中运作。模型读取文本并产生文本。没有单独的”感知”步骤——输入已经是模型进行推理的格式。通过RL提升推理不会有降低模型读取输入能力的风险。

在视觉-语言Web RL中,这两种能力是纠缠但不同的。视觉编码器必须从截图中提取正确的特征(接地),语言模型必须使用这些特征做出好的决策(推理)。RL优化通过整个管道施加梯度,而这些梯度可能以牺牲接地为代价提升推理——反之亦然。

这种张力具体表现为:

  • RL奖励推理改进但不直接监督接地。 当智能体完成任务时,奖励信号告诉它决策序列是好的,但不告诉模型哪些视觉特征被正确识别了。模型可能学会在易于接地的页面上推理良好,却失去了在视觉复杂页面上接地的能力。

  • 视觉能力的灾难性遗忘。 激进的RL微调可能降低预训练视觉编码器的表征。模型可能学到一个在训练分布的视觉模式上有效的策略,但当页面看起来稍有不同时就失败——不是因为无法推理,而是因为无法正确看到它们。

  • 联合优化比纯文本优化更难。 在纯文本RL中,优化地形仅由推理质量塑造。在视觉-语言RL中,地形由接地准确性与推理质量的交互塑造。一个提升推理的梯度步骤可能同时以破坏其他输入的接地的方式改变视觉表征。算法必须在两个目标耦合的更复杂损失面上导航。

Why This Makes Web a Challenging RL Testbed

Most current RL post-training for LLMs — GRPO on math, REINFORCE on code — operates entirely in text. The model’s “perception” is tokenization, which is fixed and not part of the optimization. This simplifies the problem enormously: RL only needs to improve what the model does with the information, not how it extracts the information.

Web environments remove this simplification. Any RL algorithm applied to a vision-language web agent must take care of both capabilities jointly. This means:

  • KL regularization has a dual role. The KL penalty \(\beta \, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]\) in RLHF-style objectives does not just prevent reasoning drift — it also serves as an anchor for the visual representations. Too little regularization and the visual encoder drifts; too much and the agent cannot learn new reasoning strategies.

  • Data diversity matters more. In text-only RL, the model sees the same type of input (text) across all training examples. In web RL, the visual diversity of webpages — different layouts, color schemes, font sizes, dynamic elements — means the model must maintain grounding robustness across a much wider input distribution. Training on a narrow set of websites risks overfitting the visual encoder.

  • Evaluation must test both axes. A web agent that achieves high reward on training tasks may have done so by memorizing visual patterns rather than learning generalizable grounding. Proper evaluation requires testing on visually novel pages to ensure the agent can still see before assessing whether it can reason.

This joint optimization challenge — improving reasoning without sacrificing grounding — is what makes web environments a particularly demanding and scientifically interesting testbed for RL. It forces algorithm designers to think about the full perception-to-action pipeline, not just the reasoning component that text-only settings isolate.

当前大多数LLM的RL后训练——在数学上的GRPO、在代码上的REINFORCE——完全在文本中运行。模型的”感知”是分词化,这是固定的且不参与优化。这极大地简化了问题:RL只需要提升模型如何使用信息,而非模型如何提取信息

Web环境移除了这一简化。任何应用于视觉-语言Web智能体的RL算法都必须同时照顾两种能力。这意味着:

  • KL正则化具有双重角色。 RLHF风格目标中的KL惩罚 \(\beta \, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]\) 不仅防止推理漂移——它还为视觉表征提供锚点。正则化太少则视觉编码器漂移;太多则智能体无法学习新的推理策略。

  • 数据多样性更重要。 在纯文本RL中,模型在所有训练样本中看到相同类型的输入(文本)。在Web RL中,网页的视觉多样性——不同的布局、配色方案、字体大小、动态元素——意味着模型必须在更宽泛的输入分布上维持接地鲁棒性。在少量网站上训练有过拟合视觉编码器的风险。

  • 评估必须测试两个轴。 一个在训练任务上获得高奖励的Web智能体可能是通过记忆视觉模式而非学习可泛化的接地来实现的。恰当的评估需要在视觉上新颖的页面上测试,确保智能体仍能看到,然后再评估它是否能推理

这种联合优化挑战——在不牺牲接地的前提下提升推理——正是使Web环境成为特别有要求且科学上有趣的RL试验场的原因。它迫使算法设计者思考从感知到行动的完整管道,而不仅仅是纯文本设定所隔离的推理组件。

Shallow Solution Spaces with Combinatorial Diversity

The second structural property of web tasks is the shape of their solution spaces. Consider a typical web task: “Book a flight from Chicago to New York on March 15, economy class, window seat.” This task has several subgoals:

  1. Navigate to the flight booking page
  2. Enter the departure city
  3. Enter the destination city
  4. Select the date
  5. Choose economy class
  6. Select a window seat
  7. Confirm the booking

Each subgoal is individually simple — often solvable by a short, stereotyped sequence of actions (click a field, type a value, select from a dropdown). The solution to each subgoal follows a pattern that generalizes across tasks: entering a city always involves clicking the input field, clearing any existing text, typing the city name, and selecting from the autocomplete dropdown. These patterns are shallow in the sense that they require few steps and little reasoning.

But the subgoals interact in interesting ways. Some are sequential — you must navigate to the booking page before you can enter any details. Others are parallel — the order in which you fill in departure city, destination, and date is arbitrary. The flight booking form does not care whether you enter Chicago before New York or vice versa. This creates a combinatorial explosion in valid trajectories: for \(k\) parallel subgoals, there are \(k!\) valid orderings, each producing a different trajectory that achieves the same result.

Web任务的第二个结构性质是其解空间的形状。考虑一个典型的Web任务:”预订3月15日从芝加哥到纽约的航班,经济舱,靠窗座位。”这个任务有几个子目标

  1. 导航到航班预订页面
  2. 输入出发城市
  3. 输入目的城市
  4. 选择日期
  5. 选择经济舱
  6. 选择靠窗座位
  7. 确认预订

每个子目标都很简单——通常可以通过一个短小、程式化的动作序列来完成(点击字段、输入值、从下拉菜单选择)。每个子目标的解决方案遵循一种跨任务泛化的模式:输入城市总是包括点击输入框、清除已有文本、输入城市名称并从自动补全下拉菜单中选择。这些模式是浅层的,因为它们需要很少的步骤和很少的推理。

但子目标之间有有趣的交互。有些是顺序的——你必须先导航到预订页面才能输入任何细节。其他则是并行的——填写出发城市、目的地和日期的顺序是任意的。航班预订表单不关心你先输入芝加哥还是纽约。这产生了有效轨迹的组合爆炸:对于 \(k\) 个并行子目标,有 \(k!\) 种有效排列,每种产生不同的轨迹但达到相同的结果。

Implications for RL

This structure has several consequences for how RL algorithms behave on web tasks:

Low depth, high breadth. The solution “tree” for a web task is wide but shallow. Each branch (subgoal ordering) leads to success, but the branches look very different as token sequences. This means the reward landscape is relatively flat — many trajectories achieve the same reward — but the policy must learn to recognize that superficially different trajectories are equally good.

Pattern reuse across tasks. Because subgoals are solved by stereotyped action sequences, a policy that learns these patterns can compose them to solve novel tasks. From an RL perspective, this is a form of temporal abstraction: the primitive actions (click, type, scroll) combine into reusable “skills” (fill a form field, navigate to a page, select from a dropdown). The policy does not need to learn each task from scratch — it needs to learn the building blocks and how to compose them.

Reward shaping is natural. Because subgoals are relatively independent, it is straightforward to define intermediate rewards for completing each subgoal. This transforms the sparse-reward problem (reward only at task completion) into a denser reward signal. Unlike in math reasoning — where intermediate “process rewards” require expensive human annotation or unreliable heuristics — web tasks have natural checkpoints: did the agent navigate to the right page? Did it fill in the correct value? These can often be verified programmatically.

The exploration problem is mild. In many RL domains, exploration is the fundamental challenge — the agent must discover rewarding trajectories in a vast space. In web tasks, the shallow solution depth means that even random exploration has a reasonable chance of completing individual subgoals. The challenge is not finding any solution but finding efficient solutions — completing tasks in fewer steps with less wasted interaction. This shifts the RL problem from exploration to efficiency optimization, which is a qualitatively different (and arguably easier) problem.

这种结构对RL算法在Web任务上的表现有几个影响:

低深度、高广度。 Web任务的解”树”是宽而浅的。每个分支(子目标排序)都导向成功,但这些分支作为token序列看起来非常不同。这意味着奖励地形相对平坦——许多轨迹获得相同的奖励——但策略必须学会识别表面上不同的轨迹其实同样好。

跨任务的模式复用。 因为子目标由程式化的动作序列解决,学会这些模式的策略可以将它们组合来解决新任务。从RL角度看,这是一种时间抽象:原始动作(点击、输入、滚动)组合成可复用的”技能”(填写表单字段、导航到页面、从下拉菜单选择)。策略不需要从头学习每个任务——它需要学习构建模块以及如何组合它们。

奖励塑形是自然的。 因为子目标相对独立,为完成每个子目标定义中间奖励是直截了当的。这将稀疏奖励问题(仅在任务完成时给奖励)转化为更密集的奖励信号。不同于数学推理——其中中间的”过程奖励”需要昂贵的人工标注或不可靠的启发式方法——Web任务有天然的检查点:智能体是否导航到了正确的页面?是否填入了正确的值?这些通常可以通过程序化方式验证。

探索问题是温和的。 在许多RL领域,探索是根本性挑战——智能体必须在巨大的空间中发现高回报的轨迹。在Web任务中,浅层的解深度意味着即使随机探索也有合理的机会完成单个子目标。挑战不在于找到任何解,而在于找到高效解——用更少的步骤和更少的浪费交互来完成任务。这将RL问题从探索转向效率优化,这是一个质量上不同的(且可以说更容易的)问题。

Contrast with Single-step Reasoning

In single-step tasks like math or code generation, the solution space has the opposite structure: deep but narrow. A math proof requires a specific sequence of logical steps — skip one, and the entire derivation fails. The order matters: you cannot prove the conclusion before establishing the lemma it depends on. There is typically one (or very few) valid solution paths, and finding it requires deep sequential reasoning.

This contrast explains why the same RL algorithms behave differently on the two task types:

Property Single-step reasoning Web tasks
Solution depth Deep (many dependent steps) Shallow (few steps per subgoal)
Solution breadth Narrow (few valid paths) Wide (many valid orderings)
Key challenge Finding the right reasoning chain Composing known patterns efficiently
Thinking vs. acting More thinking helps More interaction helps
Reward structure Sparse (final answer correctness) Naturally decomposable into subgoals
Exploration difficulty Hard (must find specific path) Mild (many paths lead to success)

Understanding these structural differences is important for designing RL training pipelines for web agents. Methods optimized for single-step reasoning — long chain-of-thought, extensive per-step deliberation, single-reward-at-the-end — may be actively counterproductive when transferred to web environments. The shallow, compositional structure of web tasks calls for different algorithmic choices: shorter reasoning per step, more frequent interaction, intermediate rewards aligned with subgoal completion, and training curricula that encourage acting over thinking.

在数学或代码生成等单步任务中,解空间具有相反的结构:深但窄。数学证明需要特定的逻辑步骤序列——跳过一步,整个推导就失败。顺序很重要:你不能在建立所依赖的引理之前证明结论。通常只有一条(或极少数)有效的解路径,找到它需要深度的顺序推理。

这种对比解释了为什么相同的RL算法在两种任务类型上表现不同:

性质 单步推理 Web任务
解深度 深(许多依赖步骤) 浅(每个子目标几步)
解广度 窄(少数有效路径) 宽(许多有效排列)
核心挑战 找到正确的推理链 高效组合已知模式
思考与行动 更多思考有帮助 更多交互有帮助
奖励结构 稀疏(最终答案正确性) 自然可分解为子目标
探索难度 困难(必须找到特定路径) 温和(许多路径通向成功)

理解这些结构性差异对设计Web智能体的RL训练流程至关重要。为单步推理优化的方法——长思维链、每步的大量思考、仅在结束时给奖励——在迁移到Web环境时可能起到反效果。Web任务的浅层组合结构要求不同的算法选择:更短的每步推理、更频繁的交互、与子目标完成对齐的中间奖励,以及鼓励行动而非思考的训练课程。