Pretraining, Post-training, and Test-Time Reasoning

This post was first drafted upon discussions with Peter Tong on the release of DeepSeek-R1 in February 2025, about linear extrapolation of the reasoning capability of language models. It was revised after discussions with Prof. Aviral Kumar around August 2025. It was archived on 12 September 2025.

Modern language models are built in three phases: pretraining on massive text corpora, post-training via reinforcement learning or preference optimization, and test-time reasoning through chain-of-thought and extended inference. Each phase plays a distinct role — and has distinct limits. This post argues that the three phases are best understood through a single lens: pretraining defines the interpolation region, post-training reshapes capabilities within that region, and test-time reasoning attempts to linearly extrapolate beyond it.

Pretraining Defines the Paradigm

Pretraining establishes the model’s interpolation region — the set of all input-output patterns over which the model can produce reliable predictions. To understand what this means, consider the distinction between interpolation and extrapolation. Given training data \(\{(x_i, y_i)\}_{i=1}^{n}\) and a learned function \(\hat{f}\): interpolation estimates \(\hat{f}(x)\) for \(x\) within the convex hull of the training data, while extrapolation predicts outside the training distribution where no signal constrains behavior. Interpolation is generally reliable; extrapolation is fragile — a polynomial that fits training data perfectly can diverge wildly just beyond the data boundary.

By training on next-token prediction over a massive corpus, the model absorbs statistical regularities — what we call “paradigms” — patterns of the next token given a sequence of preceding tokens. At test time, the model finds the distribution closest to what it saw during pretraining and predicts accordingly. This is not memorization (exact recall of training sequences) nor reasoning (deriving novel conclusions from first principles). It is interpolation: blending nearby paradigms to handle inputs that fall within the convex hull of training experience. The answer to “do LLMs memorize or reason?” is neither — they interpolate. This is what makes them so useful (interpolation over a massive corpus covers a huge range of inputs) and also what limits them (anything genuinely outside the training distribution fails).

The standard scaling story — more data, more compute, better performance — works precisely because more training data expands the interpolation region, covering a wider range of inputs. Next-token prediction is embarrassingly parallelizable via teacher forcing: every token provides a training signal, unlike masked objectives that only train on ~15% of tokens. A single pretraining run on diverse data produces a model that handles syntax, factual recall, translation, code generation, and many other tasks without task-specific supervision. Loss decreases predictably with compute, enabling rational resource allocation.

But this scaling story has limits. Beyond a certain scale, the model has captured the genuine statistical patterns in the data. What remains are increasingly subtle spurious correlations — patterns that hold in the training data but do not reflect causal or logical structure. The model learns whatever patterns minimize the loss; it cannot distinguish genuine causal structure from statistical coincidence. More compute on the same data and the same objective means fitting more noise, not learning more truth.

The problem deepens for knowledge that does not have autoregressive structure. Spatial reasoning follows relational structure, not token order. Causal reasoning requires distinguishing correlation from causation. Planning requires reasoning backward from goals, not forward from context. The autoregressive objective forces the model to explain all knowledge as sequential token dependencies, even when this is the wrong abstraction. When the data does not follow autoregressive structure, the model fits whatever spurious autoregressive pattern best approximates the non-autoregressive truth — and this is the mechanism by which spurious correlations enter the model’s representations.

Mirzadeh et al. (ICLR 2025) provide direct evidence for this interpolation view. They created GSM-Symbolic, symbolic templates derived from GSM8K where only surface-level details (names, numbers) change while logical structure stays identical. LLMs show significant performance drops on these variants, with accuracy degrading further as reasoning steps increase. More strikingly, inserting irrelevant clauses — which a true reasoner would simply ignore — causes substantial accuracy drops. The model is interpolating over surface patterns, not reasoning over logical structure.

Post-Training Specializes Capabilities

Post-training (RLHF, DPO, RL fine-tuning) does not fundamentally expand the interpolation region — it reshapes it. The model’s underlying representations, learned during pretraining, remain largely intact. What changes is the mapping from those representations to outputs: which behaviors are reinforced, which are suppressed, and how the model’s probability mass is redistributed over the output space.

Think of pretraining as carving a rough sculpture. Post-training is the polishing: it refines the surface, sharpens certain features, and smooths away others. But it works with the material that pretraining provided — it cannot add mass that was never there.

The benefits are substantial. RLHF and DPO steer the model toward human preferences — helpfulness, harmlessness, honesty — without retraining from scratch. RL fine-tuning can sharpen performance on specific domains (mathematics, coding, tool use) by reinforcing successful patterns already present in the pretraining distribution. And post-training requires orders of magnitude less compute than pretraining, making it practical to iterate on model behavior.

But the limits are equally clear. Post-training adjusts the output distribution without necessarily correcting the underlying feature space. A model that learned a spurious correlation during pretraining may suppress it in outputs after RLHF, but the correlation can still influence internal representations and resurface in novel contexts. Spurious correlations are baked into the model’s representational geometry, not just its output layer — post-training can redirect behavior but cannot fully overwrite what pretraining established. If the model never saw certain reasoning patterns during pretraining, no amount of RLHF will produce them. Post-training is optimization over a fixed landscape, not construction of new terrain.

Zhang (2025) illustrates this vividly with what they call “computational split-brain syndrome”: LLMs can verbally describe correct reasoning procedures but systematically fail to execute them. Post-training can teach the model to articulate the right process (comprehension) without granting the ability to carry it out (competence). The paper shows that transformers solve compositional tasks through “linearized subgraph matching” — memorizing computation patterns from training data rather than learning systematic algorithms. The limitation is architectural: token embeddings encode context-weighted averages that distort the symbolic relationships needed for genuine compositional reasoning. Post-training cannot fix what the architecture cannot represent.

Test-Time Reasoning as Linear Extrapolation

Chain-of-thought prompting, and reasoning models like o1 and DeepSeek-R1, attempt to push the model beyond what a single forward pass can achieve. By generating intermediate reasoning steps, they decompose complex problems into short chunks — each chunk short enough to stay within the interpolation region — and chain them together.

This is linear extrapolation: the model takes short, reliable interpolation steps and chains them sequentially, hoping to reach conclusions that no single step could. Each step is grounded in the pretraining distribution; the extrapolation emerges from their composition. Problems that require multi-step inference become accessible by decomposing them into individually tractable sub-problems. The intermediate steps are inspectable, enabling error detection. And more test-time tokens generally improve performance on hard problems, offering a compute-performance tradeoff orthogonal to model size.

But each interpolation step introduces small errors, and over long chains these compound — a known failure mode of imitation learning. The model can produce short coherent logic chunks, but chunks many steps apart tend to drift into incoherence. From an optimization perspective, the GPT architecture prevents infinite-horizon training: each token prediction only receives gradient signal from its immediate next-token loss. The model has learned few-step next-token prediction, but it lacks a mechanism to strengthen planning (e.g., backup) or long-range logic (e.g., skip-step reasoning) that genuine problem-solving requires. Chain-of-thought prompting mitigates this by keeping each step short, but it manages the symptom rather than solving the underlying problem.

Crucially, the “reasoning” produced by CoT is only as reliable as the interpolation steps it is composed of. Zhao et al. (2025) test this directly using a controlled synthetic environment where training distributions are fully specified. Their finding: CoT effectiveness is governed by the distribution gap between training and test queries. In-distribution, CoT produces impressive structured outputs that look like reasoning. Out-of-distribution, it breaks down — revealing that the apparent logical structure was interpolated from training patterns rather than derived from first principles. Test-time reasoning expands the effective interpolation region through sequential composition, but it does not achieve genuine extrapolation.

The three phases form a coherent picture:

	Pretraining	Post-training	Test-time reasoning
Role	Defines the interpolation region	Reshapes it	Linearly extrapolates from it
Mechanism	Next-token prediction on massive data	RL/RLHF/DPO on curated feedback	Chain-of-thought composition
What it adds	Breadth of paradigms	Task specialization, alignment	Sequential multi-step reach
Key limit	Spurious correlations, structural mismatch	Cannot extend the region	Error accumulation, distribution-dependent
Scaling	Diminishing returns on same data	Bounded by pretraining representations	Linear in steps, not exponential in capability

The fundamental constraint is that no phase produces genuine extrapolation. Pretraining builds the landscape. Post-training sculpts it. Test-time reasoning walks across it, one interpolation step at a time. When the walk exits the landscape — when the problem genuinely requires knowledge or structure not present in the training distribution — the model fails, regardless of how many tokens it spends thinking. Whether this constraint can be broken — through denser data, deeper post-training, non-linear search strategies, or architectural redesign — remains the central open question.