Pretraining, Post-training, and Test-Time Reasoning

This post was first drafted upon discussions with Peter Tong on the release of DeepSeek-R1 in February 2025, about linear extrapolation of the reasoning capability of language models. It was revised after discussions with Prof. Aviral Kumar around August 2025. It was archived on 12 September 2025.

Modern language models are built in three phases: pretraining on massive text corpora, post-training via reinforcement learning or preference optimization, and test-time reasoning through chain-of-thought and extended inference. Each phase plays a distinct role — and has distinct limits. This post argues that the three phases are best understood through a single lens: pretraining defines the interpolation region, post-training reshapes capabilities within that region, and test-time reasoning attempts to linearly extrapolate beyond it.

现代语言模型的构建分为三个阶段:在大规模文本语料上进行预训练、通过强化学习或偏好优化进行后训练、以及通过思维链和扩展推理实现测试时推理。每个阶段扮演不同的角色,也有各自的局限。本文的核心观点是:这三个阶段可以用统一的视角来理解——预训练定义了插值区域,后训练在该区域内重塑能力,测试时推理则尝试从中线性外推。

Pretraining Defines the Paradigm

Pretraining establishes the model’s interpolation region — the set of all input-output patterns over which the model can produce reliable predictions. To understand what this means, consider the distinction between interpolation and extrapolation. Given training data \(\{(x_i, y_i)\}_{i=1}^{n}\) and a learned function \(\hat{f}\): interpolation estimates \(\hat{f}(x)\) for \(x\) within the convex hull of the training data, while extrapolation predicts outside the training distribution where no signal constrains behavior. Interpolation is generally reliable; extrapolation is fragile — a polynomial that fits training data perfectly can diverge wildly just beyond the data boundary.

By training on next-token prediction over a massive corpus, the model absorbs statistical regularities — what we call “paradigms” — patterns of the next token given a sequence of preceding tokens. At test time, the model finds the distribution closest to what it saw during pretraining and predicts accordingly. This is not memorization (exact recall of training sequences) nor reasoning (deriving novel conclusions from first principles). It is interpolation: blending nearby paradigms to handle inputs that fall within the convex hull of training experience. The answer to “do LLMs memorize or reason?” is neither — they interpolate. This is what makes them so useful (interpolation over a massive corpus covers a huge range of inputs) and also what limits them (anything genuinely outside the training distribution fails).

The standard scaling story — more data, more compute, better performance — works precisely because more training data expands the interpolation region, covering a wider range of inputs. Next-token prediction is embarrassingly parallelizable via teacher forcing: every token provides a training signal, unlike masked objectives that only train on ~15% of tokens. A single pretraining run on diverse data produces a model that handles syntax, factual recall, translation, code generation, and many other tasks without task-specific supervision. Loss decreases predictably with compute, enabling rational resource allocation.

But this scaling story has limits. Beyond a certain scale, the model has captured the genuine statistical patterns in the data. What remains are increasingly subtle spurious correlations — patterns that hold in the training data but do not reflect causal or logical structure. The model learns whatever patterns minimize the loss; it cannot distinguish genuine causal structure from statistical coincidence. More compute on the same data and the same objective means fitting more noise, not learning more truth.

The problem deepens for knowledge that does not have autoregressive structure. Spatial reasoning follows relational structure, not token order. Causal reasoning requires distinguishing correlation from causation. Planning requires reasoning backward from goals, not forward from context. The autoregressive objective forces the model to explain all knowledge as sequential token dependencies, even when this is the wrong abstraction. When the data does not follow autoregressive structure, the model fits whatever spurious autoregressive pattern best approximates the non-autoregressive truth — and this is the mechanism by which spurious correlations enter the model’s representations.

Mirzadeh et al. (ICLR 2025) provide direct evidence for this interpolation view. They created GSM-Symbolic, symbolic templates derived from GSM8K where only surface-level details (names, numbers) change while logical structure stays identical. LLMs show significant performance drops on these variants, with accuracy degrading further as reasoning steps increase. More strikingly, inserting irrelevant clauses — which a true reasoner would simply ignore — causes substantial accuracy drops. The model is interpolating over surface patterns, not reasoning over logical structure.

预训练建立了模型的插值区域——即模型能够产生可靠预测的所有输入-输出模式的集合。要理解这意味着什么,需要区分插值与外推。给定训练数据 \(\{(x_i, y_i)\}_{i=1}^{n}\) 和学到的函数 \(\hat{f}\):插值是在训练数据的凸包内对 \(\hat{f}(x)\) 进行估计,而外推则是在训练分布之外进行预测,此时没有信号约束模型的行为。插值通常是可靠的;外推则是脆弱的——一个完美拟合训练数据的多项式,在数据边界之外可能会剧烈发散。

通过在大规模语料上进行下一个token预测的训练,模型吸收了统计规律——我们称之为”范式”——即给定一系列前置token后下一个token的模式。在测试时,模型找到与预训练中所见最接近的分布,并据此进行预测。这既不是记忆(精确回忆训练序列),也不是推理(从第一性原理推导新结论),而是插值:融合相邻范式来处理落在训练经验凸包内的输入。”LLM是在记忆还是在推理?”这个问题的答案是都不是——它们在做插值。这正是它们如此有用的原因(对大规模语料的插值覆盖了巨大的输入范围),也是它们受限的原因(任何真正超出训练分布的输入都会失败)。

标准的规模化叙事——更多数据、更多算力、更好性能——之所以成立,恰恰因为更多训练数据扩展了插值区域,覆盖了更广泛的输入。下一个token预测通过teacher forcing实现了高度并行化:每个token都提供训练信号,不像掩码目标只在约15%的token上训练。在多样化数据上的一次预训练就能产生一个处理语法、事实召回、翻译、代码生成等多种任务的模型,而无需特定任务的监督。损失随算力可预测地下降,使得资源分配有据可依。

但这一规模化叙事也有其极限。超过一定规模后,模型已经捕获了数据中真实的统计模式。剩下的是越来越微妙的虚假相关——在训练数据中成立但不反映因果或逻辑结构的模式。模型学习任何能最小化损失的模式;它无法区分真实的因果结构和统计巧合。在相同数据和相同目标上投入更多算力,意味着拟合更多噪声,而非学到更多真理。

对于不具有自回归结构的知识,这个问题更为严重。空间推理遵循关系结构而非token顺序;因果推理需要区分相关性和因果性;规划需要从目标反向推理而非从上下文正向推理。自回归目标迫使模型将所有知识解释为序列token依赖,即使这是错误的抽象。当数据不遵循自回归结构时,模型会拟合最接近非自回归真相的虚假自回归模式——这正是虚假相关进入模型表征的机制。

Mirzadeh et al. (ICLR 2025) 为这一插值观点提供了直接证据。他们创建了 GSM-Symbolic,这是从GSM8K衍生的符号模板,只改变表面细节(名称、数字),而逻辑结构保持不变。LLM在这些变体上表现出显著的性能下降,且随着推理步骤增加,准确率进一步退化。更引人注目的是,插入无关条件——真正的推理者会直接忽略这些——也会导致准确率大幅下降。模型是在表面模式上插值,而非在逻辑结构上推理。

Post-Training Specializes Capabilities

Post-training (RLHF, DPO, RL fine-tuning) does not fundamentally expand the interpolation region — it reshapes it. The model’s underlying representations, learned during pretraining, remain largely intact. What changes is the mapping from those representations to outputs: which behaviors are reinforced, which are suppressed, and how the model’s probability mass is redistributed over the output space.

Think of pretraining as carving a rough sculpture. Post-training is the polishing: it refines the surface, sharpens certain features, and smooths away others. But it works with the material that pretraining provided — it cannot add mass that was never there.

The benefits are substantial. RLHF and DPO steer the model toward human preferences — helpfulness, harmlessness, honesty — without retraining from scratch. RL fine-tuning can sharpen performance on specific domains (mathematics, coding, tool use) by reinforcing successful patterns already present in the pretraining distribution. And post-training requires orders of magnitude less compute than pretraining, making it practical to iterate on model behavior.

But the limits are equally clear. Post-training adjusts the output distribution without necessarily correcting the underlying feature space. A model that learned a spurious correlation during pretraining may suppress it in outputs after RLHF, but the correlation can still influence internal representations and resurface in novel contexts. Spurious correlations are baked into the model’s representational geometry, not just its output layer — post-training can redirect behavior but cannot fully overwrite what pretraining established. If the model never saw certain reasoning patterns during pretraining, no amount of RLHF will produce them. Post-training is optimization over a fixed landscape, not construction of new terrain.

Zhang (2025) illustrates this vividly with what they call “computational split-brain syndrome”: LLMs can verbally describe correct reasoning procedures but systematically fail to execute them. Post-training can teach the model to articulate the right process (comprehension) without granting the ability to carry it out (competence). The paper shows that transformers solve compositional tasks through “linearized subgraph matching” — memorizing computation patterns from training data rather than learning systematic algorithms. The limitation is architectural: token embeddings encode context-weighted averages that distort the symbolic relationships needed for genuine compositional reasoning. Post-training cannot fix what the architecture cannot represent.

后训练(RLHF、DPO、RL微调)并不从根本上扩展插值区域——它重塑了这个区域。模型在预训练中学到的底层表征基本保持不变,改变的是从这些表征到输出的映射:哪些行为被强化、哪些被抑制、以及模型的概率质量如何在输出空间上重新分配。

可以把预训练想象成粗雕一座雕塑,后训练则是打磨:它精修表面、锐化某些特征、抹平其他特征。但它使用的是预训练提供的材料——无法添加从未存在过的部分。

后训练的收益是显著的。RLHF和DPO将模型导向人类偏好——有用性、无害性、诚实性——而无需从头重训。RL微调可以在特定领域(数学、编程、工具使用)上通过强化预训练分布中已有的成功模式来提升性能。而且后训练所需算力比预训练少几个数量级,使得迭代模型行为成为可能。

但局限同样明显。后训练调整输出分布,却不一定修正底层特征空间。一个在预训练中学到虚假相关的模型,可能在RLHF后抑制了输出中的这种相关,但该相关仍可能影响内部表征并在新情境中重新浮现。虚假相关被嵌入模型的表征几何中,而不仅仅是输出层——后训练可以重定向行为,但无法完全覆写预训练所建立的东西。如果模型在预训练中从未见过某些推理模式,再多的RLHF也无法产生它们。后训练是在固定地形上的优化,而非构建新地形。

Zhang (2025) 用他们所称的”计算型分裂脑综合征”生动地说明了这一点:LLM能够口头描述正确的推理过程,却系统性地无法执行它们。后训练可以教会模型表述正确的流程(理解),却无法赋予其执行能力(胜任)。论文表明,transformer通过”线性化子图匹配”来解决组合任务——记忆训练数据中的计算模式,而非学习系统性算法。这一局限是架构性的:token嵌入编码了上下文加权平均,扭曲了真正组合推理所需的符号关系。后训练无法修复架构本身无法表达的东西。

Test-Time Reasoning as Linear Extrapolation

Chain-of-thought prompting, and reasoning models like o1 and DeepSeek-R1, attempt to push the model beyond what a single forward pass can achieve. By generating intermediate reasoning steps, they decompose complex problems into short chunks — each chunk short enough to stay within the interpolation region — and chain them together.

This is linear extrapolation: the model takes short, reliable interpolation steps and chains them sequentially, hoping to reach conclusions that no single step could. Each step is grounded in the pretraining distribution; the extrapolation emerges from their composition. Problems that require multi-step inference become accessible by decomposing them into individually tractable sub-problems. The intermediate steps are inspectable, enabling error detection. And more test-time tokens generally improve performance on hard problems, offering a compute-performance tradeoff orthogonal to model size.

But each interpolation step introduces small errors, and over long chains these compound — a known failure mode of imitation learning. The model can produce short coherent logic chunks, but chunks many steps apart tend to drift into incoherence. From an optimization perspective, the GPT architecture prevents infinite-horizon training: each token prediction only receives gradient signal from its immediate next-token loss. The model has learned few-step next-token prediction, but it lacks a mechanism to strengthen planning (e.g., backup) or long-range logic (e.g., skip-step reasoning) that genuine problem-solving requires. Chain-of-thought prompting mitigates this by keeping each step short, but it manages the symptom rather than solving the underlying problem.

Crucially, the “reasoning” produced by CoT is only as reliable as the interpolation steps it is composed of. Zhao et al. (2025) test this directly using a controlled synthetic environment where training distributions are fully specified. Their finding: CoT effectiveness is governed by the distribution gap between training and test queries. In-distribution, CoT produces impressive structured outputs that look like reasoning. Out-of-distribution, it breaks down — revealing that the apparent logical structure was interpolated from training patterns rather than derived from first principles. Test-time reasoning expands the effective interpolation region through sequential composition, but it does not achieve genuine extrapolation.

The three phases form a coherent picture:

  Pretraining Post-training Test-time reasoning
Role Defines the interpolation region Reshapes it Linearly extrapolates from it
Mechanism Next-token prediction on massive data RL/RLHF/DPO on curated feedback Chain-of-thought composition
What it adds Breadth of paradigms Task specialization, alignment Sequential multi-step reach
Key limit Spurious correlations, structural mismatch Cannot extend the region Error accumulation, distribution-dependent
Scaling Diminishing returns on same data Bounded by pretraining representations Linear in steps, not exponential in capability

The fundamental constraint is that no phase produces genuine extrapolation. Pretraining builds the landscape. Post-training sculpts it. Test-time reasoning walks across it, one interpolation step at a time. When the walk exits the landscape — when the problem genuinely requires knowledge or structure not present in the training distribution — the model fails, regardless of how many tokens it spends thinking. Whether this constraint can be broken — through denser data, deeper post-training, non-linear search strategies, or architectural redesign — remains the central open question.

思维链提示,以及o1DeepSeek-R1等推理模型,试图推动模型超越单次前向传播所能达到的极限。通过生成中间推理步骤,它们将复杂问题分解为短片段——每个片段足够短以保持在插值区域内——然后将它们链接起来。

这就是线性外推:模型执行短小、可靠的插值步骤,并将它们顺序链接,期望达到单个步骤无法到达的结论。每一步都植根于预训练分布;外推来自于它们的组合。需要多步推理的问题通过分解为单独可解的子问题而变得可及。中间步骤是可检视的,便于错误检测。更多的测试时token通常能提升在困难问题上的表现,提供了一种与模型规模正交的算力-性能权衡。

但每个插值步骤都会引入小误差,在长链中这些误差会累积——这是模仿学习的一个已知失败模式。模型可以产生短小连贯的逻辑片段,但相隔许多步的片段往往会漂移到不连贯。从优化角度看,GPT架构阻止了无限horizon的训练:每个token预测只接收来自其直接下一个token损失的梯度信号。模型学会了少步next-token预测,但缺乏加强规划(如backup)或长程逻辑(如跳步推理)的机制,而这些是真正解决问题所必需的。思维链提示通过保持每步短小来缓解这一问题,但它管理的是症状而非解决根本问题。

关键在于,CoT产生的”推理”的可靠性完全取决于其组成的插值步骤。Zhao et al. (2025) 使用一个训练分布完全已知的受控合成环境直接测试了这一点。他们的发现是:CoT的有效性取决于训练和测试查询之间的分布差距。在分布内,CoT产生令人印象深刻的结构化输出,看起来像推理。在分布外,它就会崩溃——揭示出表面上的逻辑结构是从训练模式中插值而来,而非从第一性原理推导而出。测试时推理通过顺序组合扩展了有效插值区域,但并未实现真正的外推。

三个阶段构成了一个连贯的图景:

  预训练 后训练 测试时推理
角色 定义插值区域 重塑该区域 从中线性外推
机制 在大规模数据上做next-token预测 在精选反馈上做RL/RLHF/DPO 思维链组合
贡献 范式的广度 任务特化与对齐 顺序多步扩展
核心局限 虚假相关、结构错配 无法扩展区域 误差累积、依赖分布
规模化 在相同数据上收益递减 受限于预训练表征 步骤线性增长,能力非指数增长

根本性的约束在于没有任何阶段能产生真正的外推。预训练构建地形,后训练雕刻地形,测试时推理一步一个插值地走过它。当这条路走出了地形——当问题真正需要训练分布中不存在的知识或结构时——模型就会失败,无论它花多少token去”思考”。这一约束能否被打破——通过更密集的数据、更深入的后训练、非线性搜索策略或架构重新设计——仍然是核心的开放问题。