Is Pre-training Hitting a Wall?

There is a growing sense — not yet a consensus, but a real undercurrent — that pre-training as we know it is running into diminishing returns. Not because models are too small or compute is insufficient, but because the paradigm itself may have structural limits. This post tries to articulate what those limits might be, drawing on a discussion with Aviral Kumar.

The question is simple: why is pre-training not scaling as well as we’d like, and what would it take to fix it?

The Standard Story

The standard story of scaling is compelling. Take a transformer, train it on more data with more compute, and performance improves predictably. Scaling laws hold. Loss goes down. Benchmarks improve.

But there are signs that this story is incomplete. Beyond a certain scale, improvements become harder to achieve. Models get larger but not proportionally more capable. The low-hanging fruit — learning syntax, common knowledge, frequent patterns — was picked long ago. What remains is harder, and the question is whether next-token prediction is the right objective for learning it.

Spurious Correlations, Not Representability

A natural first hypothesis is that the model architecture is the bottleneck: maybe transformers lack the representational capacity to capture certain kinds of knowledge. But this is probably not the primary issue. Transformers are highly expressive function approximators. The problem is more likely in the data and the objective.

The core issue is spurious correlations. When you train an autoregressive model to predict the next token on a massive corpus, it learns whatever statistical patterns minimize the loss. Many of these patterns are genuine — syntactic rules, factual associations, reasoning chains. But many are spurious: correlations that hold in the training data but do not reflect causal or logical structure.

Beyond a certain point, there may simply not be much more to extract from the same data via next-token prediction. The model has already captured the “real” patterns; what remains are increasingly subtle spurious correlations that don’t generalize. More compute on the same data and the same objective yields diminishing returns — not because the model can’t fit the data, but because fitting the data more tightly means fitting more noise.

Autoregressive Models as Generalized Tables

A useful mental model: autoregressive language models are generalized lookup tables for storing data sequences. They are extraordinarily efficient at memorizing and interpolating over data that has autoregressive structure — where the next token genuinely depends on the preceding context in predictable ways.

Natural language largely has this structure, which is why AR models work so well. The next word in a sentence does depend on the words before it, and this dependency can be captured by attention over the preceding context.

But not all knowledge has autoregressive structure. Consider:

Spatial reasoning: understanding that if A is to the left of B, and B is to the left of C, then A is to the left of C. The conclusion does not follow from any particular ordering of tokens — it follows from a relational structure that is not inherently sequential.
Causal reasoning: understanding that correlation does not imply causation, or that an intervention changes the data-generating process. The training data contains both causal and spurious correlations, and next-token prediction cannot distinguish between them.
Planning: generating a sequence of actions to achieve a goal requires reasoning backward from the goal, not just predicting forward from the context.

When the data does not follow autoregressive structure, AR models do not gracefully degrade — they fit whatever spurious autoregressive pattern best approximates the non-autoregressive truth. This is the mechanism by which spurious correlations enter: the model must explain the data as an autoregressive process, even when it is not one.

It’s Not Easy to Unlearn, Either

One might hope that post-training (RLHF, DPO, etc.) can fix the spurious correlations learned during pre-training. And to some extent it can — post-training is effective at steering the model’s behavior, suppressing certain outputs, and reinforcing others.

But the spurious correlations are not easy to fully “unlearn.” They are baked into the model’s representations, not just its output distribution. Post-training adjusts the surface behavior without necessarily correcting the underlying feature space. A model that has learned a spurious correlation between two concepts during pre-training may suppress the correlation in its outputs after RLHF, but the correlation may still influence internal representations and resurface in novel contexts.

At the same time, these correlations are not so deeply embedded that the model is hopeless. The model’s internal representations are rich enough that post-training can redirect them. The situation is more nuanced than “pre-training is broken” — it is that pre-training alone is insufficient, and the specific ways in which it is insufficient are not well understood.

No Magic Architecture

A tempting conclusion: if autoregressive models have these structural limits, perhaps a different architecture — diffusion models for text, energy-based models, some yet-undiscovered paradigm — would solve the problem.

This is almost certainly too optimistic. Every architecture comes with its own inductive biases and its own failure modes. Diffusion models for text avoid the left-to-right constraint but introduce challenges with discrete tokens and long-range coherence. Energy-based models are flexible but intractable to train at scale. State-space models are efficient but may sacrifice the flexibility of attention.

The more likely path forward is not a single architectural breakthrough but a co-design between data and architecture:

Data: curating training data that reduces spurious correlations, using synthetic data generation to provide examples of genuine reasoning, or training on interaction traces rather than static text.
Architecture: designing inductive biases that better match the structure of the target knowledge — perhaps hybrid architectures that combine autoregressive generation with explicit relational or causal reasoning modules.
Objective: moving beyond pure next-token prediction to objectives that incentivize learning causal structure, such as masked prediction, denoising, or auxiliary tasks that require understanding interventions.

None of these alone solves the problem. The right combination is unknown and may be domain-dependent.

The Scaling Angle

This connects to a practical concern: scaling becomes inefficient when the objective is misaligned with the target knowledge. If next-token prediction on static text has diminishing returns beyond a certain scale, then building larger models on the same paradigm is wasteful. The compute would be better spent on:

Better data curation (removing spurious correlations at the source)
More diverse training objectives (not just next-token prediction)
Interactive training (where the model learns from environment feedback, not just static text — connecting to the multi-step agent training discussion)
Post-training that does more than surface-level alignment

The models that break through the current plateau will likely not be the ones that simply scale up existing pre-training. They will be the ones that rethink what the model is trained to do and what data it is trained on.

Open Questions

This is more a collection of observations than a theory. Several questions remain genuinely open:

How much of current model capability comes from genuine understanding versus spurious correlation? We don’t have good tools to measure this.
Is there a principled way to curate pre-training data to minimize spurious correlations? Current data filtering is mostly heuristic.
Can architectural innovations reduce the autoregressive bias without sacrificing the efficiency gains? Hardware is optimized for AR models; any alternative must compete on wall-clock time, not just theoretical expressiveness.
What is the role of scale in post-training? If pre-training hits a wall, does post-training (RL, RLHF, etc.) become the primary driver of capability gains? And does that scale?

These are some of the most important open questions in the field. They are unlikely to be resolved by any single paper or any single scaling run. But recognizing the limits of the current paradigm is the first step toward designing a better one.