Position: Why Web is a Good Environment to Study RL?

Web environments are a natural testbed for RL with language model agents, but they have structural properties that make them fundamentally different from the single-step reasoning tasks (math, code, QA) that dominate current post-training. This post examines two of these properties through an RL lens: the interaction-heavy nature of web tasks, and the shallow but combinatorially diverse structure of their solution spaces.

What is a Realistic Environment?

Most RL training environments for language model agents are simulated: synthetic websites with hand-crafted layouts, deterministic transitions, and a fixed set of pages. The agent learns to navigate these sandboxes, but the skills often fail to transfer to real websites — the visual complexity, layout diversity, and unpredictable behavior of the live web are absent from training.

WebGym (Bai et al., 2025) takes the opposite approach: train on real websites directly. The agent interacts with actual, live web pages through a browser, receiving raw pixel screenshots as observations and issuing click/type/scroll actions. This makes the environment realistic in several concrete ways:

Visual complexity. Real websites contain logos, advertisements, dynamic content, inconsistent layouts, and decorative elements that synthetic environments omit. The agent’s vision-language model must extract task-relevant information from this noise — there is no clean, semantic abstraction layer.
Stochastic transitions. The same action on a real website can produce different results depending on server state, network latency, A/B testing variants, or content updates. Unlike simulated environments where \(P(s' \vert s, a)\) is fixed, the transition dynamics are genuinely non-stationary. Some websites block automated access entirely, causing tasks to fail for reasons unrelated to the agent’s skill.
Diverse task structure. Tasks span shopping, food & cooking, sports, health, entertainment, news, finance, travel, and education — each with different UI conventions, navigation patterns, and success criteria. An agent trained on WebGym cannot overfit to a single website’s layout; it must generalize across radically different interfaces.
Partial observability across turns. At each turn, the agent sees only the current screenshot and a sliding window of recent history (typically 4 rounds). Older screenshots are replaced with text summaries. The agent cannot revisit past observations — it must decide what information to retain and what to discard, much like a human browsing with limited working memory.
Built-in domain randomization. In robotics, stochasticity must be artificially injected via simulation randomization — randomizing gravity, friction, object dimensions — so that policies learn to adapt rather than memorize a single simulator configuration (Ilya’s MIT 2018 talk highlights this as the key trick for sim-to-real transfer). Web environments provide this randomization for free: server-side A/B tests, layout updates, dynamic content, network latency, and anti-bot measures all act as natural domain randomizers. An agent trained on live websites is forced to cope with this variation at every step — it cannot overfit to a fixed environment configuration. This is essentially the same principle as simulation randomization, except the stochasticity is real rather than engineered, and the resulting robustness comes without any additional design effort (DigiRL refers to this as “environmental randomness” and identifies it as a key factor in policy generalization).

These properties make WebGym a challenging testbed for RL algorithms. Standard trajectory-level PPO — which works well enough in simulated environments with short horizons and deterministic dynamics — struggles here. The long horizons (10–30 steps), sparse terminal rewards, and high variance from stochastic transitions mean that assigning a single reward to an entire trajectory provides almost no usable learning signal. This is the motivation for a turn-level value function: if we cannot control the environment’s stochasticity, we can at least decompose the credit assignment problem into manageable per-step pieces.

Interaction Over Reasoning

Single-step reasoning tasks — solving a math problem, writing a function, answering a factual question — are fully observable bandits. The model sees the entire problem upfront. All the information needed to produce the answer is present in the prompt. In this regime, longer chain-of-thought helps: more tokens of internal reasoning let the model reconstruct derivations, cross-reference constraints, and catch errors. The “think longer” paradigm from R1-style post-training exploits this directly.

Web environments break this assumption. A web agent operating in a browser faces partial observability across steps: the information needed to complete a task is distributed across multiple pages, each of which can only be accessed by taking actions in the environment. Before clicking into a webpage, no amount of reasoning can determine what is on it. Before submitting a form, no amount of planning can predict the server’s response. The environment holds information that is inaccessible to pure thought.

This creates a concrete tradeoff between thinking (generating reasoning tokens within a single step) and interacting (taking actions that change the environment state and reveal new information). In an MDP formulation, each step of the agent consists of:

Observing the current browser state \(s_t\) (DOM, screenshot, URL, etc.)
Reasoning internally for some number of tokens (chain-of-thought)
Acting on the environment (click, type, navigate) to transition to \(s_{t+1}\)

The key insight, examined in detail in the post Are Multi-step Agents Overthinking?, is that for web tasks, interaction is usually more efficient than reasoning. A web agent that needs to find a product meeting certain criteria should click through candidates and check — not deliberate at length about which candidate might qualify. The information is behind the click, not inside the model’s weights.

Empirically, this shows up as a striking pattern: when agents are trained with a horizon curriculum (short horizon → long horizon), they learn to reduce reasoning tokens per step while increasing the number of environment interactions. Performance improves as thinking decreases — the opposite of what happens in single-step settings. The agent discovers on its own that the environment is a better source of information than its own chain-of-thought.

From an RL perspective, this means web environments are long-horizon MDPs where per-step reasoning is cheap but steps are expensive (each browser interaction has latency). The optimal policy should minimize unnecessary deliberation at each step and instead invest its budget in taking more steps — gathering information through interaction rather than speculation. This is the opposite of the single-step regime, where the “step” is free (just generate more tokens) and the quality of reasoning within that single step determines success.

What This Means for Training

Standard RL post-training (GRPO, REINFORCE, PPO) applied to web tasks inherits a bias from the pretraining and single-step fine-tuning stages: the model has learned that longer reasoning traces correlate with better outcomes. In a web environment, this manifests as overthinking — the agent generates extensive chain-of-thought before each action, even when the action is trivial (e.g., clicking a link to see what is on the other side).

This has direct consequences for the RL optimization:

Trajectory length vs. reasoning length. The total token budget per episode is split between reasoning tokens and action tokens. An overthinking agent allocates too much to reasoning and too little to actions, resulting in shorter trajectories (fewer environment steps) within the same token budget. Shorter trajectories mean less information gathered, which means worse task completion.
Credit assignment becomes harder. Long reasoning traces within each step mean the trajectory contains thousands of tokens, most of which are internal deliberation rather than consequential decisions. Assigning credit to the right tokens — the ones that actually influenced the outcome — becomes a needle-in-a-haystack problem.
The horizon problem. With a fixed maximum horizon \(h\), an overthinking agent reaches the limit before completing complex tasks. A simple curriculum — starting with short horizons that force decisive action, then gradually extending — can counteract this bias, as shown in the TTI work.

Stochastic Transitions: Where Web Differs from Code and Math

There is a deeper structural difference between web environments and the LLM-based environments that dominate current RL research. In math and code generation, the token-level MDP has deterministic transitions: given a state (the prefix so far) and an action (the next token), the next state is uniquely determined — just append the token. The only source of stochasticity is the policy itself. This is what makes the bandit formulation viable: the entire trajectory is determined by the model’s sampling decisions, and replaying the same sequence of tokens from the same prompt always produces the same outcome.

Web environments do not have this property. The transition function \(T(s_{t+1} \vert s_t, a_t)\) is genuinely stochastic. Clicking the same link at the same time can lead to different page states: dynamic content changes between visits, search results rerank based on server-side signals, ads and recommendations rotate, session state evolves, and network conditions affect what loads. Navigating to an e-commerce site today and navigating to it tomorrow — taking the identical action from the identical initial state — produces different product listings, different prices, different layouts. The environment has its own dynamics that the agent cannot control or fully predict.

This has concrete consequences for RL:

Off-policy correction is harder. In deterministic-transition MDPs (math, code), the state distribution mismatch between two policies arises solely from different action choices — if both policies take the same actions, they visit the same states. In a stochastic web environment, even identical action sequences can lead to different states. This means IS-based off-policy methods must account for transition stochasticity on top of policy stochasticity, making the correction factors larger and noisier.

Replay and reproducibility break down. Deterministic transitions make it possible to “replay” a trajectory by re-executing the same actions and arriving at the same states. Web environments offer no such guarantee. A trajectory collected yesterday may not be reproducible today because the underlying pages have changed. This undermines training approaches that rely on cached rollouts or trajectory replay, and it makes evaluation noisier — the same agent can succeed or fail on the “same” task depending on the environment’s stochastic state.

Value estimation requires environment modeling. In deterministic-transition settings, a value function \(V(s_t)\) only needs to model the uncertainty from the policy’s future actions. In web environments, \(V(s_t)\) must additionally model the uncertainty from the environment’s transitions. This makes value functions harder to learn and less reliable as baselines for advantage estimation, increasing variance in policy gradient methods.

The environment is not a passive canvas. In math and code, the “environment” is just string concatenation — a mechanical process with no autonomy. A web environment is an active participant: servers respond, content updates, other users act, and the state evolves partly independently of the agent. This means the agent must be robust to environment variability, not just to its own stochastic sampling. Policies trained in one snapshot of the web may not transfer to another.

This stochastic-transition property places web tasks in a qualitatively different region of the RL problem space. Most current LLM post-training operates in deterministic-transition MDPs (or equivalently, contextual bandits) where the only randomness is the model’s own token sampling. Web environments are true stochastic MDPs, and algorithms designed for the deterministic case may need substantial adaptation to handle them.

Vision as a First-Class Modality

Web environments are inherently visual. The agent’s observation \(s_t\) is not a clean text string — it is a rendered webpage: a spatial layout of buttons, images, text, menus, and interactive elements arranged in two dimensions. While some web agent systems operate on the DOM (a structured text representation), the DOM is an imperfect proxy for what the user actually sees. Elements may be visually hidden, overlapping, or styled in ways that the DOM does not capture. The most faithful representation of a webpage is its screenshot — a pixel-level image.

This makes web tasks a vision-language RL problem, which introduces a challenge that does not exist in text-only settings like math or code: the model must simultaneously maintain two capabilities that can interfere with each other during optimization.

The Grounding-Reasoning Tension

Recognition (also called grounding) is the ability to perceive and locate elements in the visual observation — identifying that a particular region of pixels is a “Submit” button, that a dropdown menu is currently open, or that a specific product card matches the search criteria. This is a perceptual skill: it requires the model to map visual inputs to semantic understanding of the interface.

Reasoning is the ability to decide what to do given the perceived state — choosing which button to click, what to type, or when to navigate away. This is a cognitive skill: it requires planning, goal decomposition, and strategy.

In text-only RL (math, code), both capabilities operate in the same modality. The model reads text and produces text. There is no separate “perception” step — the input is already in the format the model reasons over. Improving reasoning through RL does not risk degrading the model’s ability to read the input.

In vision-language web RL, these capabilities are entangled but distinct. The visual encoder must extract the right features from screenshots (grounding), and the language model must use those features to make good decisions (reasoning). RL optimization applies gradients through the entire pipeline, and these gradients can improve reasoning at the cost of degrading grounding — or vice versa.

This tension manifests concretely:

RL rewards reasoning improvements but does not directly supervise grounding. When the agent completes a task, the reward signal tells it that the sequence of decisions was good. It does not tell the model which visual features were correctly recognized. The model could learn to reason well on easy-to-ground pages while losing the ability to ground on visually complex ones.
Catastrophic forgetting of visual capabilities. Aggressive RL fine-tuning can degrade the pretrained visual encoder’s representations. The model might learn a policy that works on the training distribution’s visual patterns but fails when pages look slightly different — not because it cannot reason about them, but because it can no longer see them properly.
Joint optimization is harder than text-only optimization. In text-only RL, the optimization landscape is shaped only by reasoning quality. In vision-language RL, the landscape is shaped by the interaction between grounding accuracy and reasoning quality. A gradient step that improves reasoning may simultaneously shift the visual representations in a way that breaks grounding on other inputs. The algorithm must navigate a more complex loss surface where two objectives are coupled.

Why This Makes Web a Challenging RL Testbed

Most current RL post-training for LLMs — GRPO on math, REINFORCE on code — operates entirely in text. The model’s “perception” is tokenization, which is fixed and not part of the optimization. This simplifies the problem enormously: RL only needs to improve what the model does with the information, not how it extracts the information.

Web environments remove this simplification. Any RL algorithm applied to a vision-language web agent must take care of both capabilities jointly. This means:

KL regularization has a dual role. The KL penalty \(\beta \, \text{KL}[\pi_\theta \| \pi_{\text{ref}}]\) in RLHF-style objectives does not just prevent reasoning drift — it also serves as an anchor for the visual representations. Too little regularization and the visual encoder drifts; too much and the agent cannot learn new reasoning strategies.
Data diversity matters more. In text-only RL, the model sees the same type of input (text) across all training examples. In web RL, the visual diversity of webpages — different layouts, color schemes, font sizes, dynamic elements — means the model must maintain grounding robustness across a much wider input distribution. Training on a narrow set of websites risks overfitting the visual encoder.
Evaluation must test both axes. A web agent that achieves high reward on training tasks may have done so by memorizing visual patterns rather than learning generalizable grounding. Proper evaluation requires testing on visually novel pages to ensure the agent can still see before assessing whether it can reason.

This joint optimization challenge — improving reasoning without sacrificing grounding — is what makes web environments a particularly demanding and scientifically interesting testbed for RL. It forces algorithm designers to think about the full perception-to-action pipeline, not just the reasoning component that text-only settings isolate.

Shallow Solution Spaces with Combinatorial Diversity

The second structural property of web tasks is the shape of their solution spaces. Consider a typical web task: “Book a flight from Chicago to New York on March 15, economy class, window seat.” This task has several subgoals:

Navigate to the flight booking page
Enter the departure city
Enter the destination city
Select the date
Choose economy class
Select a window seat
Confirm the booking

Each subgoal is individually simple — often solvable by a short, stereotyped sequence of actions (click a field, type a value, select from a dropdown). The solution to each subgoal follows a pattern that generalizes across tasks: entering a city always involves clicking the input field, clearing any existing text, typing the city name, and selecting from the autocomplete dropdown. These patterns are shallow in the sense that they require few steps and little reasoning.

But the subgoals interact in interesting ways. Some are sequential — you must navigate to the booking page before you can enter any details. Others are parallel — the order in which you fill in departure city, destination, and date is arbitrary. The flight booking form does not care whether you enter Chicago before New York or vice versa. This creates a combinatorial explosion in valid trajectories: for \(k\) parallel subgoals, there are \(k!\) valid orderings, each producing a different trajectory that achieves the same result.

Implications for RL

This structure has several consequences for how RL algorithms behave on web tasks:

Low depth, high breadth. The solution “tree” for a web task is wide but shallow. Each branch (subgoal ordering) leads to success, but the branches look very different as token sequences. This means the reward landscape is relatively flat — many trajectories achieve the same reward — but the policy must learn to recognize that superficially different trajectories are equally good.

Pattern reuse across tasks. Because subgoals are solved by stereotyped action sequences, a policy that learns these patterns can compose them to solve novel tasks. From an RL perspective, this is a form of temporal abstraction: the primitive actions (click, type, scroll) combine into reusable “skills” (fill a form field, navigate to a page, select from a dropdown). The policy does not need to learn each task from scratch — it needs to learn the building blocks and how to compose them.

Reward shaping is natural. Because subgoals are relatively independent, it is straightforward to define intermediate rewards for completing each subgoal. This transforms the sparse-reward problem (reward only at task completion) into a denser reward signal. Unlike in math reasoning — where intermediate “process rewards” require expensive human annotation or unreliable heuristics — web tasks have natural checkpoints: did the agent navigate to the right page? Did it fill in the correct value? These can often be verified programmatically.

The exploration problem is mild. In many RL domains, exploration is the fundamental challenge — the agent must discover rewarding trajectories in a vast space. In web tasks, the shallow solution depth means that even random exploration has a reasonable chance of completing individual subgoals. The challenge is not finding any solution but finding efficient solutions — completing tasks in fewer steps with less wasted interaction. This shifts the RL problem from exploration to efficiency optimization, which is a qualitatively different (and arguably easier) problem.

Contrast with Single-step Reasoning

In single-step tasks like math or code generation, the solution space has the opposite structure: deep but narrow. A math proof requires a specific sequence of logical steps — skip one, and the entire derivation fails. The order matters: you cannot prove the conclusion before establishing the lemma it depends on. There is typically one (or very few) valid solution paths, and finding it requires deep sequential reasoning.

This contrast explains why the same RL algorithms behave differently on the two task types:

Property	Single-step reasoning	Web tasks
Solution depth	Deep (many dependent steps)	Shallow (few steps per subgoal)
Solution breadth	Narrow (few valid paths)	Wide (many valid orderings)
Key challenge	Finding the right reasoning chain	Composing known patterns efficiently
Thinking vs. acting	More thinking helps	More interaction helps
Reward structure	Sparse (final answer correctness)	Naturally decomposable into subgoals
Exploration difficulty	Hard (must find specific path)	Mild (many paths lead to success)

Understanding these structural differences is important for designing RL training pipelines for web agents. Methods optimized for single-step reasoning — long chain-of-thought, extensive per-step deliberation, single-reward-at-the-end — may be actively counterproductive when transferred to web environments. The shallow, compositional structure of web tasks calls for different algorithmic choices: shorter reasoning per step, more frequent interaction, intermediate rewards aligned with subgoal completion, and training curricula that encourage acting over thinking.

性质	单步推理	Web任务
解深度	深（许多依赖步骤）	浅（每个子目标几步）
解广度	窄（少数有效路径）	宽（许多有效排列）
核心挑战	找到正确的推理链	高效组合已知模式
思考与行动	更多思考有帮助	更多交互有帮助
奖励结构	稀疏（最终答案正确性）	自然可分解为子目标
探索难度	困难（必须找到特定路径）	温和（许多路径通向成功）

Position: Why Web is a Good Environment to Study RL?

What is a Realistic Environment?

什么是真实环境？

Interaction Over Reasoning

交互优先于推理

What This Means for Training

对训练的启示

Stochastic Transitions: Where Web Differs from Code and Math

随机转移：Web与代码和数学的区别

Vision as a First-Class Modality

视觉作为一等模态

The Grounding-Reasoning Tension

接地与推理的张力

Why This Makes Web a Challenging RL Testbed

为何Web是有挑战性的RL试验场

Shallow Solution Spaces with Combinatorial Diversity

浅层解空间与组合多样性

Implications for RL

对强化学习的启示

Contrast with Single-step Reasoning

与单步推理的对比