Are LLMs Trained to Memorize or Reason?
Data Interpolation vs. Extrapolation
Before diving into language models, it helps to clarify two fundamental concepts. Given a set of observed data points \(\{(x_i, y_i)\}_{i=1}^{n}\) and a learned function \(\hat{f}\):
- Interpolation estimates \(\hat{f}(x)\) for \(x \in [\min_i x_i,\, \max_i x_i]\) — predicting within the convex hull of the training data.
- Extrapolation estimates \(\hat{f}(x)\) for \(x \notin [\min_i x_i,\, \max_i x_i]\) — predicting outside the support of the training distribution.
Interpolation is generally reliable because \(\hat{f}\) stays within the region where training signal constrains its behavior. Extrapolation is fragile: the model must generalize to inputs it has never seen, and small modeling errors compound rapidly. A polynomial that fits training data perfectly can diverge wildly just beyond the data boundary — a phenomenon the interactive figure below illustrates.
This distinction turns out to be central to understanding what language models can and cannot do. Are they interpolating — piecing together patterns from training data — or are they extrapolating — deriving genuinely new conclusions through reasoning?
The Rise of Next-Token Prediction
The foundation of modern language models is next-token prediction: given a sequence of tokens, predict the next one. GPT-1 (2018) established this paradigm by adopting the decoder-only Transformer with a causal mask — a lower-triangular attention matrix, originally introduced in the decoder of Vaswani et al. (2017), that prevents each token from attending to future positions. This design enables fully parallelized training via teacher forcing: all positions in the sequence can be computed simultaneously, with every token serving as a training signal. In contrast, BERT’s masked language modeling only trains on a randomly selected 15% of tokens per sequence.
For a deeper dive into self-attention and the Transformer architecture, see What is Important about Self-Attention and Transformer?
The biggest advantage of next-token prediction turned out to be scalability. GPT-1 was a contemporary of BERT, aiming to obtain a generalized language model through pre-training, but at that point the approach didn’t yield competitive results on downstream benchmarks. By GPT-2 (2019), scaling up data and parameters showed that the pre-training task itself could produce useful capabilities — fine-tuning enabled novel text continuation, emotional chatbots (Diyi Yang), and even decision making (ArCHer, Yifei Zhou). By GPT-3 (2020), fine-tuning became unnecessary for simple tasks because the model had already encountered similar patterns during pre-training. Up to this stage, no one claimed that the GPT architecture possessed “reasoning abilities,” as GPT clearly couldn’t excel in reasoning tasks. It wasn’t until GPT-4, when post-training elevated the pre-trained model’s abilities to a new level, that the model seemed to demonstrate some zero-shot reasoning capabilities, such as performing mathematical derivations and logical judgments. Subsequent research based on extensive prompt engineering began to outperform fine-tuned models in narrow reasoning tasks. From this point, a flood of media coverage began to claim that “GPT-4 possesses reasoning capabilities.”
LLMs as Interpolation Machines
However, “reasoning” here essentially means that “the model performs well on reasoning tasks” rather than that “the model itself possesses reasoning capabilities.” This good performance is fundamentally derived from the paradigms (a paradigm is a set of patterns of the next token given previous sequence of input tokens) structured during the next-token prediction training process based on the data distribution. For example, GPT-4 was trained on a large amount of serializable mathematical and coding data, which is stream-like and can be modeled quite accurately with next-token prediction. As a result, the model can, in many cases, predict a next token that it encountered during pre-training or one that is close to it. The model essentially interpolates within its vocabulary to find the distribution closest to what it saw during pre-training, then applies some test-time engineering tricks, such as beam search and temperature adjustment, to predict the next token.
Since the model’s test-time interpolation is always approaching the distribution of paradigms seen during pre-training, the more patterns that exist in the dataset, the closer the model will approach a “real” (i.e., objective) paradigm. For example, when the model infers a mathematical formula it had previously generated and that formula was seen in the pre-trained dataset, the model will include the previously inferred formula to find the most similar next formula it has learned. My answer to the title question is therefore Neither — not pure memorization, not true reasoning, but interpolation.
Few-Step Logic and Its Limits
This is where the “few-step logic” of autoregressive models comes from: the model can output relatively short logic chunks, which are internally coherent, and adjacent chunks are often coherent as well, but chunks that are farther apart tend to be incoherent. From an optimization perspective, this is because the GPT architecture inherently prevents infinite-horizon training. From a reinforcement learning perspective, this is error accumulation, a known issue of imitation learning. This is also why language models generally don’t perform well on zero-shot reasoning tasks — language models have learned this few-step next token prediction, but they lack a well-designed mechanism (which typically requires modifying the model’s structure) to strengthen planning (e.g., backup) and logic (e.g., many-step/skip-step reasoning) required for reasoning. Hence, the GPT autoregressive model architecture is neither simply memorizing answers (due to its interpolation capability, i.e., limited continuous few-step logic) nor engaging in reasoning (as it lacks planning and many-step/skip-step logic).
Mechanistic Evidence
Earlier work by Anthropic, Decomposing Language Models With Dictionary Learning, found that most neurons in a 1-layer language model could be decomposed using dictionary learning into many single-semantic token distributions. This indicates that each neuron in a language model is essentially a superposition of simple semantics. For example, a neuron might be activated when the input tokens are all uppercase or when a person’s name appears in the input tokens, making it a superposition of “uppercase” and “person’s name.” Our recent work, CRATE-LM (Hao Bai), explored a larger GPT-2 and found that as the model deepened, the functional division of layers became more apparent, and the performance of dictionary learning significantly deteriorated in deeper layers. This suggests that deeper layers are likely integrating and optimizing logits related to the pre-training objective (next-token prediction) based on the outputs of earlier neurons. Since it’s difficult to efficiently apply dictionary learning to larger language models, we directly integrated sparsity into the language model itself, proposing a CRATE-based language model. The CRATE architecture (Yaodong Yu) is a mathematically first-principled architecture that directly promotes sparsity. Our proposed CRATE-LM offers better neuron interpretability, and because it doesn’t require dictionary learning, it supports loss-free editing. Both of these works answered the question, “Does a reasoning mechanism exist inside language models?” from a mechanistic perspective: in GPT-2-sized language models (12 layers or fewer) trained with next-token prediction, no planning or reasoning mechanism has been observed.
Can Post-Training Solve This?
So can different post-training paradigms, such as reinforcement learning, solve this problem? In principle, as long as training still entirely relies on next-token prediction as the target, it’s difficult to claim reasoning capabilities. A common workaround is that while the language model itself doesn’t possess reasoning capabilities, its excellent representations can be used to train small reasoning/planning heads, which is the main idea behind many works using RL to post-train language models (including our work RL4VLM (Yuexiang Zhai)). Despite the workarounds, future researchers should continue to see further and explore methods that stem from modifying the language model architecture itself. I believe this is already a key area for general robotics models.