Is Auto-Regressive Language Model Simply Memorizing Answers or Learning to Reason?
Is Auto-Regressive Language Models Simply Memorizing Answers or Learning to Reason?
My (abrupt) answer is Neither. The breakthrough work in this area is GPT-2, which used next-token prediction as an unsupervised learning objective. Since the training phase for next-token prediction can directly apply an attention matrix masked as a triangular matrix, it’s simple and convenient, so later works have largely followed this tradition. This mask became known as the causal mask. The biggest advantage of next-token prediction is that this architecture is very scalable, which tells an entirely different story to BERT. Since the early days of GPT, OpenAI has adhered to this idea. By the time GPT-2 was released, the scale of data used had already become evident, and with GPT-3, it became clear that pre-training for next-token prediction had become a well-established path.
The earliest GPT was actually a competitor to BERT, aiming to obtain a generalized language model after pre-training, but at that point, next-token prediction itself didn’t yield good enough results. By GPT-2, the pre-training task was capable of producing some relatively decent applications, such as fine-tuning to generate novel continuations, create emotional bots (Diyi Yang), and even decision making (ArCHer, Yifei Zhou). By the time GPT-3 arrived, fine-tuning became even unnecessary for simple tasks like text continuation because the model had already encountered similar tasks during pre-training. Up to this stage, no one claimed that the GPT architecture possessed “reasoning abilities,” as GPT clearly couldn’t excel in reasoning tasks. It wasn’t until GPT-4, when post-training elevated the pre-trained model’s abilities to a new level, that the model seemed to demonstrate some zero-shot reasoning capabilities, such as performing mathematical derivations and logical judgments. Subsequent research based on extensive prompt engineering began to outperform fine-tuned models in narrow reasoning tasks. From this point, a flood of media coverage began to claim that “GPT-4 possesses reasoning capabilities.”
However, at this stage, “reasoning” essentially means that “the model performs well on reasoning tasks” rather than that “the model itself possesses reasoning capabilities.” This good performance is fundamentally derived from the paradigms (a paradigm is a set of patterns of the next token given previous sequence of input tokens) structured during the next-token prediction training process based on the data distribution. For example, GPT-4 was trained on a large amount of serializable mathematical and coding data, which is stream-like and can be modeled quite accurately with next-token prediction. As a result, the model can, in many cases, predict a next token that it encountered during pre-training or one that is close to it. The model essentially interpolates within its vocabulary to find the distribution closest to what it saw during pre-training, then applies some test-time engineering tricks, such as beam search and temperature adjustment, to predict the next token.
Since the model’s test-time interpolation is always approaching the distribution of paradigms seen during pre-training, the more patterns that exist in the dataset, the closer the model will approach a “real” (i.e., objective) paradigm. For example, when the model infers a mathematical formula it had previously generated and that formula was seen in the pre-trained dataset, the model will include the previously inferred formula to find the most similar next formula it has learned.
This is where the “few-step logic” of autoregressive models comes from: the model can output relatively short logic chunks, which are internally coherent, and adjacent chunks are often coherent as well, but chunks that are farther apart tend to be incoherent. From an optimization perspective, this is because the GPT architecture inherently prevents infinite-horizon training. From a reinforcement learning perspective, this is error accumulation, a known issue of imitation learning. This is also why language models generally don’t perform well on zero-shot reasoning tasks — language models have learned this few-step next token prediction, but they lack a well-designed mechanism (which typically requires modifying the model’s structure) to strengthen planning (e.g., backup) and logic (e.g., many-step/skip-step reasoning) required for reasoning. Hence, the GPT autoregressive model architecture is neither simply memorizing answers (due to its interpolation capability, i.e., limited continuous few-step logic) nor engaging in reasoning (as it lacks planning and many-step/skip-step logic).
Earlier work by Anthropic, Decomposing Language Models With Dictionary Learning, found that most neurons in a 1-layer language model could be decomposed using dictionary learning into many single-semantic token distributions. This indicates that each neuron in a language model is essentially a superposition of simple semantics. For example, a neuron might be activated when the input tokens are all uppercase or when a person’s name appears in the input tokens, making it a superposition of “uppercase” and “person’s name.” Our recent work, CRATE-LM (Hao Bai), explored a larger GPT-2 and found that as the model deepened, the functional division of layers became more apparent, and the performance of dictionary learning significantly deteriorated in deeper layers. This suggests that deeper layers are likely integrating and optimizing logits related to the pre-training objective (next-token prediction) based on the outputs of earlier neurons. Since it’s difficult to efficiently apply dictionary learning to larger language models, we directly integrated sparsity into the language model itself, proposing a CRATE-based language model. The CRATE architecture (Yaodong Yu) is a mathematically first-principled architecture that directly promotes sparsity. Our proposed CRATE-LM offers better neuron interpretability, and because it doesn’t require dictionary learning, it supports loss-free editing. Both of these works answered the question, “Does a reasoning mechanism exist inside language models?” from a mechanistic perspective: in GPT-2-sized language models (12 layers or fewer) trained with next-token prediction, no planning or reasoning mechanism has been observed.
So can different post-training paradigms, such as reinforcement learning, solve this problem? In principle, as long as training still entirely relies on next-token prediction as the target, it’s difficult to claim reasoning capabilities. A common workaround is that while the language model itself doesn’t possess reasoning capabilities, its excellent representations can be used to train small reasoning/planning heads, which is the main idea behind many works using RL to post-train language models (including our work RL4VLM (Yuexiang Zhai)). Despite the workarounds, future researchers should continue to see further and explore methods that stem from modifying the language model architecture itself. I believe this is already a key area for general robotics models.
Enjoy Reading This Article?
Here are some more articles you might like to read next: